Index: user/ngie/bug-237403/MAINTAINERS =================================================================== --- user/ngie/bug-237403/MAINTAINERS (revision 346925) +++ user/ngie/bug-237403/MAINTAINERS (revision 346926) @@ -1,125 +1,132 @@ $FreeBSD$ Please note that the content of this file is strictly advisory. No locks listed here are valid. The only strict review requirements are granted by core. These are documented in head/LOCKS and enforced by svnadmin/conf/approvers. The source tree is a community effort. However, some folks go to the trouble of looking after particular areas of the tree. In return for their active caretaking of the code it is polite to coordinate changes with them. This is a list of people who have expressed an interest in part of the code or listed their active caretaking role so that other committers can easily find somebody who is familiar with it. The notes should specify if there is a 3rd party source tree involved or other things that should be kept in mind. However, this is not a 'big stick', it is an offer to help and a source of guidance. It does not override the communal nature of the tree. It is not a registry of 'turf' or private property. *** This list is prone to becoming stale quickly. The best way to find the recent maintainer of a sub-system is to check recent logs for that directory or sub-system. *** *** Maintainers are encouraged to visit: https://reviews.freebsd.org/herald and configure notifications for parts of the tree which they maintain. Notifications can automatically be sent when someone proposes a revision or makes a commit to the specified subtree. *** subsystem login notes ----------------------------- -atf freebsd-testing,jmmv,ngie Pre-commit review requested. ath(4) adrian Pre-commit review requested, send to freebsd-wireless@freebsd.org +contrib/atf ngie,#test Pre-commit review requested. +contrib/capsicum-test ngie,#capsicum,#test Pre-commit review requested. contrib/compiler-rt dim Pre-commit review preferred. +contrib/googletest ngie,#test Pre-commit review requested. contrib/ipfilter cy Pre-commit review requested. contrib/libc++ dim Pre-commit review preferred. contrib/libcxxrt dim Pre-commit review preferred. contrib/libunwind dim,emaste,jhb Pre-commit review preferred. contrib/llvm dim Pre-commit review preferred. contrib/llvm/tools/lldb dim,emaste Pre-commit review preferred. -contrib/netbsd-tests freebsd-testing,ngie Pre-commit review requested. -contrib/pjdfstest freebsd-testing,asomers,ngie,pjd Pre-commit review requested. +contrib/netbsd-tests ngie,#test Pre-commit review requested. +contrib/pjdfstest asomers,ngie,pjd,#test Pre-commit review requested. *env(3) secteam Due to the problematic security history of this code, please have patches reviewed by secteam. etc/mail gshapiro Pre-commit review requested. Keep in sync with -STABLE. etc/sendmail gshapiro Pre-commit review requested. Keep in sync with -STABLE. fetch des Pre-commit review requested, email only. geli pjd Pre-commit review requested (both sys/geom/eli/ and sbin/geom/class/eli/). isci(4) jimharris Pre-commit review requested. iwm(4) adrian Pre-commit review requested, send to freebsd-wireless@freebsd.org iwn(4) adrian Pre-commit review requested, send to freebsd-wireless@freebsd.org kqueue jmg Pre-commit review requested. Documentation Required. libdpv dteske Pre-commit review requested. Keep in sync with dpv(1). libfetch des Pre-commit review requested, email only. libfigpar dteske Pre-commit review requested. libm freebsd-numerics Send email with patches to freebsd-numerics@ libpam des Pre-commit review requested, email only. linprocfs des Pre-commit review requested, email only. lpr gad Pre-commit review requested, particularly for lpd/recvjob.c and lpd/printjob.c. nanobsd imp Pre-commit phabricator review requested. net80211 adrian Pre-commit review requested, send to freebsd-wireless@freebsd.org nfs freebsd-fs@FreeBSD.org, rmacklem is best for reviews. nvd(4) jimharris Pre-commit review requested. nvme(4) jimharris Pre-commit review requested. nvmecontrol(8) jimharris Pre-commit review requested. opencrypto jmg Pre-commit review requested. Documentation Required. openssh des Pre-commit review requested, email only. openssl benl,jkim Pre-commit review requested. otus(4) adrian Pre-commit review requested, send to freebsd-wireless@freebsd.org pci bus imp,jhb Pre-commit review requested. pmcstudy(8) rrs Pre-commit review requested. procfs des Pre-commit review requested, email only. pseudofs des Pre-commit review requested, email only. release/release.sh gjb,re Pre-commit review and regression tests requested. sctp rrs,tuexen Pre-commit review requested (changes need to be backported to github). sendmail gshapiro Pre-commit review requested. sh(1) jilles Pre-commit review requested. This also applies to kill(1), printf(1) and test(1) which are compiled in as builtins. share/mk imp, bapt, bdrewery, emaste, sjg Make is hard. -share/mk/*.test.mk freebsd-testing,ngie (same list as share/mk too) Pre-commit review requested. +share/mk/*.test.mk imp,bapt,bdrewery, Pre-commit review requested. + emaste,ngie,sjg,#test stand/forth dteske Pre-commit review requested. stand/lua kevans Pre-commit review requested -sys/compat/linuxkpi hselasky If in doubt, ask. +sys/compat/linuxkpi hselasky If in doubt, ask. + zeising, johalun pre-commit review requested via + #x11 phabricator group. + (to avoid drm graphics drivers + impact) sys/contrib/ipfilter cy Pre-commit review requested. sys/dev/e1000 erj Pre-commit phabricator review requested. sys/dev/ixgbe erj Pre-commit phabricator review requested. sys/dev/ixl erj Pre-commit phabricator review requested. sys/dev/sound/usb hselasky If in doubt, ask. sys/dev/usb hselasky If in doubt, ask. sys/dev/xen royger Pre-commit review recommended. sys/netinet/ip_carp.c glebius Pre-commit review recommended. sys/netpfil/pf kp,glebius Pre-commit review recommended. sys/x86/xen royger Pre-commit review recommended. sys/xen royger Pre-commit review recommended. -tests freebsd-testing,ngie Pre-commit review requested. +tests ngie,#test Pre-commit review requested. tools/build imp Pre-commit review requested, especially to fix bootstrap issues. top(1) eadler Pre-commit review requested. usr.sbin/bsdconfig dteske Pre-commit phabricator review requested. usr.sbin/dpv dteske Pre-commit review requested. Keep in sync with libdpv. usr.sbin/pkg pkg@ Please coordinate behavior or flag changes with pkg team. usr.sbin/sysrc dteske Pre-commit phabricator review requested. Keep in sync with bsdconfig(8) sysrc.subr. vmm(4) tychon, jhb Pre-commit review requested via #bhyve phabricator group. libvmmapi tychon, jhb Pre-commit review requested via #bhyve phabricator group. usr.sbin/bhyve* tychon, jhb Pre-commit review requested via #bhyve phabricator group. autofs(5) trasz Pre-commit review recommended. iscsi(4) trasz Pre-commit review recommended. rctl(8) trasz Pre-commit review recommended. sys/dev/ofw nwhitehorn Pre-commit review recommended. sys/dev/drm* imp Pre-commit review requested in phabricator. Changes need to be mirrored in github repo. sys/dev/usb/wlan adrian Pre-commit review requested, send to freebsd-wireless@freebsd.org sys/arm/allwinner manu Pre-commit review requested sys/arm64/rockchip manu Pre-commit review requested Property changes on: user/ngie/bug-237403/MAINTAINERS ___________________________________________________________________ Modified: svn:mergeinfo ## -0,0 +0,1 ## Merged /head/MAINTAINERS:r346444-346925 Index: user/ngie/bug-237403/bin/date/date.1 =================================================================== --- user/ngie/bug-237403/bin/date/date.1 (revision 346925) +++ user/ngie/bug-237403/bin/date/date.1 (revision 346926) @@ -1,471 +1,473 @@ .\"- .\" Copyright (c) 1980, 1990, 1993 .\" The Regents of the University of California. All rights reserved. .\" .\" This code is derived from software contributed to Berkeley by .\" the Institute of Electrical and Electronics Engineers, Inc. .\" .\" Redistribution and use in source and binary forms, with or without .\" modification, are permitted provided that the following conditions .\" are met: .\" 1. Redistributions of source code must retain the above copyright .\" notice, this list of conditions and the following disclaimer. .\" 2. Redistributions in binary form must reproduce the above copyright .\" notice, this list of conditions and the following disclaimer in the .\" documentation and/or other materials provided with the distribution. .\" 3. Neither the name of the University nor the names of its contributors .\" may be used to endorse or promote products derived from this software .\" without specific prior written permission. .\" .\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND .\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE .\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE .\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE .\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL .\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS .\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) .\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT .\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY .\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF .\" SUCH DAMAGE. .\" .\" @(#)date.1 8.3 (Berkeley) 4/28/95 .\" $FreeBSD$ .\" -.Dd March 20, 2019 +.Dd April 23, 2019 .Dt DATE 1 .Os .Sh NAME .Nm date .Nd display or set date and time .Sh SYNOPSIS .Nm -.Op Fl jRu +.Op Fl jnRu .Op Fl r Ar seconds | Ar filename .Oo .Fl v .Sm off .Op Cm + | - .Ar val Op Ar ymwdHMS .Sm on .Oc .Ar ... .Op Cm + Ns Ar output_fmt .Nm .Op Fl ju .Sm off .Op Oo Oo Oo Oo Ar cc Oc Ar yy Oc Ar mm Oc Ar dd Oc Ar HH .Ar MM Op Ar .ss .Sm on .Nm .Op Fl jRu .Fl f Ar input_fmt new_date .Op Cm + Ns Ar output_fmt .Nm .Op Fl jnu .Op Fl I Ns Op Ar FMT .Op Fl f Ar input_fmt .Op Fl r Ar ... .Op Fl v Ar ... .Op Ar new_date .Sh DESCRIPTION When invoked without arguments, the .Nm utility displays the current date and time. Otherwise, depending on the options specified, .Nm will set the date and time or print it in a user-defined way. .Pp The .Nm utility displays the date and time read from the kernel clock. When used to set the date and time, both the kernel clock and the hardware clock are updated. .Pp Only the superuser may set the date, and if the system securelevel (see .Xr securelevel 7 ) is greater than 1, the time may not be changed by more than 1 second. .Pp The options are as follows: .Bl -tag -width Ds .It Fl f Use .Ar input_fmt as the format string to parse the .Ar new_date provided rather than using the default .Sm off .Oo Oo Oo Oo Oo .Ar cc Oc .Ar yy Oc .Ar mm Oc .Ar dd Oc .Ar HH .Oc Ar MM Op Ar .ss .Sm on format. Parsing is done using .Xr strptime 3 . .It Fl I Ns Op Ar FMT Use .St -iso8601 output format. .Ar FMT may be omitted, in which case the default is .Sq date . Valid .Ar FMT values are .Sq date , .Sq hours , .Sq minutes , and .Sq seconds . The date and time is formatted to the specified precision. When .Ar FMT is .Sq hours (or the more precise .Sq minutes or .Sq seconds ) , the .St -iso8601 format includes the timezone. .It Fl j Do not try to set the date. This allows you to use the .Fl f flag in addition to the .Cm + option to convert one date format to another. +.It Fl n +Obsolete flag, accepted and ignored for compatibility. .It Fl R Use RFC 2822 date and time output format. This is equivalent to using .Dq Li %a, %d %b %Y \&%T %z as .Ar output_fmt while .Ev LC_TIME is set to the .Dq C locale . .It Fl r Ar seconds Print the date and time represented by .Ar seconds , where .Ar seconds is the number of seconds since the Epoch (00:00:00 UTC, January 1, 1970; see .Xr time 3 ) , and can be specified in decimal, octal, or hex. .It Fl r Ar filename Print the date and time of the last modification of .Ar filename . .It Fl u Display or set the date in .Tn UTC (Coordinated Universal) time. .It Fl v Adjust (i.e., take the current date and display the result of the adjustment; not actually set the date) the second, minute, hour, month day, week day, month or year according to .Ar val . If .Ar val is preceded with a plus or minus sign, the date is adjusted forwards or backwards according to the remaining string, otherwise the relevant part of the date is set. The date can be adjusted as many times as required using these flags. Flags are processed in the order given. .Pp When setting values (rather than adjusting them), seconds are in the range 0-59, minutes are in the range 0-59, hours are in the range 0-23, month days are in the range 1-31, week days are in the range 0-6 (Sun-Sat), months are in the range 1-12 (Jan-Dec) and years are in the range 80-38 or 1980-2038. .Pp If .Ar val is numeric, one of either .Ar y , .Ar m , .Ar w , .Ar d , .Ar H , .Ar M or .Ar S must be used to specify which part of the date is to be adjusted. .Pp The week day or month may be specified using a name rather than a number. If a name is used with the plus (or minus) sign, the date will be put forwards (or backwards) to the next (previous) date that matches the given week day or month. This will not adjust the date, if the given week day or month is the same as the current one. .Pp When a date is adjusted to a specific value or in units greater than hours, daylight savings time considerations are ignored. Adjustments in units of hours or less honor daylight saving time. So, assuming the current date is March 26, 0:30 and that the DST adjustment means that the clock goes forward at 01:00 to 02:00, using .Fl v No +1H will adjust the date to March 26, 2:30. Likewise, if the date is October 29, 0:30 and the DST adjustment means that the clock goes back at 02:00 to 01:00, using .Fl v No +3H will be necessary to reach October 29, 2:30. .Pp When the date is adjusted to a specific value that does not actually exist (for example March 26, 1:30 BST 2000 in the Europe/London timezone), the date will be silently adjusted forwards in units of one hour until it reaches a valid time. When the date is adjusted to a specific value that occurs twice (for example October 29, 1:30 2000), the resulting timezone will be set so that the date matches the earlier of the two times. .Pp It is not possible to adjust a date to an invalid absolute day, so using the switches .Fl v No 31d Fl v No 12m will simply fail five months of the year. It is therefore usual to set the month before setting the day; using .Fl v No 12m Fl v No 31d always works. .Pp Adjusting the date by months is inherently ambiguous because a month is a unit of variable length depending on the current date. This kind of date adjustment is applied in the most intuitive way. First of all, .Nm tries to preserve the day of the month. If it is impossible because the target month is shorter than the present one, the last day of the target month will be the result. For example, using .Fl v No +1m on May 31 will adjust the date to June 30, while using the same option on January 30 will result in the date adjusted to the last day of February. This approach is also believed to make the most sense for shell scripting. Nevertheless, be aware that going forth and back by the same number of months may take you to a different date. .Pp Refer to the examples below for further details. .El .Pp An operand with a leading plus .Pq Sq + sign signals a user-defined format string which specifies the format in which to display the date and time. The format string may contain any of the conversion specifications described in the .Xr strftime 3 manual page, as well as any arbitrary text. A newline .Pq Ql \en character is always output after the characters specified by the format string. The format string for the default display is .Dq +%+ . .Pp If an operand does not have a leading plus sign, it is interpreted as a value for setting the system's notion of the current date and time. The canonical representation for setting the date and time is: .Pp .Bl -tag -width Ds -compact -offset indent .It Ar cc Century (either 19 or 20) prepended to the abbreviated year. .It Ar yy Year in abbreviated form (e.g., 89 for 1989, 06 for 2006). .It Ar mm Numeric month, a number from 1 to 12. .It Ar dd Day, a number from 1 to 31. .It Ar HH Hour, a number from 0 to 23. .It Ar MM Minutes, a number from 0 to 59. .It Ar ss Seconds, a number from 0 to 60 (59 plus a potential leap second). .El .Pp Everything but the minutes is optional. .Pp Time changes for Daylight Saving Time, standard time, leap seconds, and leap years are handled automatically. .Sh ENVIRONMENT The following environment variables affect the execution of .Nm : .Bl -tag -width Ds .It Ev TZ The timezone to use when displaying dates. The normal format is a pathname relative to .Pa /usr/share/zoneinfo . For example, the command .Dq TZ=America/Los_Angeles date displays the current time in California. See .Xr environ 7 for more information. .El .Sh FILES .Bl -tag -width /var/log/messages -compact .It Pa /var/log/utx.log record of date resets and time changes .It Pa /var/log/messages record of the user setting the time .El .Sh EXIT STATUS The .Nm utility exits 0 on success, 1 if unable to set the date, and 2 if able to set the local date, but unable to set it globally. .Sh EXAMPLES The command: .Pp .Dl "date ""+DATE: %Y-%m-%d%nTIME: %H:%M:%S""" .Pp will display: .Bd -literal -offset indent DATE: 1987-11-21 TIME: 13:36:16 .Ed .Pp In the Europe/London timezone, the command: .Pp .Dl "date -v1m -v+1y" .Pp will display: .Pp .Dl "Sun Jan 4 04:15:24 GMT 1998" .Pp where it is currently .Li "Mon Aug 4 04:15:24 BST 1997" . .Pp The command: .Pp .Dl "date -v1d -v3m -v0y -v-1d" .Pp will display the last day of February in the year 2000: .Pp .Dl "Tue Feb 29 03:18:00 GMT 2000" .Pp So will the command: .Pp .Dl "date -v3m -v30d -v0y -v-1m" .Pp because there is no such date as the 30th of February. .Pp The command: .Pp .Dl "date -v1d -v+1m -v-1d -v-fri" .Pp will display the last Friday of the month: .Pp .Dl "Fri Aug 29 04:31:11 BST 1997" .Pp where it is currently .Li "Mon Aug 4 04:31:11 BST 1997" . .Pp The command: .Pp .Dl "date 8506131627" .Pp sets the date to .Dq Li "June 13, 1985, 4:27 PM" . .Pp .Dl "date ""+%Y%m%d%H%M.%S""" .Pp may be used on one machine to print out the date suitable for setting on another. .Qq ( Li "+%m%d%H%M%Y.%S" for use on .Tn Linux . ) .Pp The command: .Pp .Dl "date 1432" .Pp sets the time to .Li "2:32 PM" , without modifying the date. .Pp The command .Pp .Dl "TZ=America/Los_Angeles date -Iseconds -r 1533415339" .Pp will display .Pp .Dl "2018-08-04T13:42:19-07:00" .Pp Finally the command: .Pp .Dl "date -j -f ""%a %b %d %T %Z %Y"" ""`date`"" ""+%s""" .Pp can be used to parse the output from .Nm and express it in Epoch time. .Sh DIAGNOSTICS It is invalid to combine the .Fl I flag with either .Fl R or an output format .Dq ( + Ns ... ) operand. If this occurs, .Nm prints: .Ql multiple output formats specified and exits with an error status. .Sh SEE ALSO .Xr locale 1 , .Xr gettimeofday 2 , .Xr getutxent 3 , .Xr strftime 3 , .Xr strptime 3 .Rs .%T "TSP: The Time Synchronization Protocol for UNIX 4.3BSD" .%A R. Gusella .%A S. Zatti .Re .Sh STANDARDS The .Nm utility is expected to be compatible with .St -p1003.2 . The .Fl d , f , I , j , r , t , and .Fl v options are all extensions to the standard. .Pp The format selected by the .Fl I flag is compatible with .St -iso8601 . .Sh HISTORY A .Nm command appeared in .At v1 . .Pp The .Fl I flag was added in .Fx 12.0 . Index: user/ngie/bug-237403/bin/date/date.c =================================================================== --- user/ngie/bug-237403/bin/date/date.c (revision 346925) +++ user/ngie/bug-237403/bin/date/date.c (revision 346926) @@ -1,391 +1,393 @@ /*- * SPDX-License-Identifier: BSD-3-Clause * * Copyright (c) 1985, 1987, 1988, 1993 * The Regents of the University of California. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * 3. Neither the name of the University nor the names of its contributors * may be used to endorse or promote products derived from this software * without specific prior written permission. * * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. */ #ifndef lint static char const copyright[] = "@(#) Copyright (c) 1985, 1987, 1988, 1993\n\ The Regents of the University of California. All rights reserved.\n"; #endif /* not lint */ #if 0 #ifndef lint static char sccsid[] = "@(#)date.c 8.2 (Berkeley) 4/28/95"; #endif /* not lint */ #endif #include __FBSDID("$FreeBSD$"); #include #include #include #include #include #include #include #include #include #include #include #include #include #include "vary.h" #ifndef TM_YEAR_BASE #define TM_YEAR_BASE 1900 #endif static time_t tval; static void badformat(void); static void iso8601_usage(const char *); static void multipleformats(void); static void printdate(const char *); static void printisodate(struct tm *); static void setthetime(const char *, const char *, int); static void usage(void); static const struct iso8601_fmt { const char *refname; const char *format_string; } iso8601_fmts[] = { { "date", "%Y-%m-%d" }, { "hours", "T%H" }, { "minutes", ":%M" }, { "seconds", ":%S" }, }; static const struct iso8601_fmt *iso8601_selected; static const char *rfc2822_format = "%a, %d %b %Y %T %z"; int main(int argc, char *argv[]) { int ch, rflag; bool Iflag, jflag, Rflag; const char *format; char buf[1024]; char *fmt; char *tmp; struct vary *v; const struct vary *badv; struct tm *lt; struct stat sb; size_t i; v = NULL; fmt = NULL; (void) setlocale(LC_TIME, ""); rflag = 0; Iflag = jflag = Rflag = 0; - while ((ch = getopt(argc, argv, "f:I::jRr:uv:")) != -1) + while ((ch = getopt(argc, argv, "f:I::jnRr:uv:")) != -1) switch((char)ch) { case 'f': fmt = optarg; break; case 'I': if (Rflag) multipleformats(); Iflag = 1; if (optarg == NULL) { iso8601_selected = iso8601_fmts; break; } for (i = 0; i < nitems(iso8601_fmts); i++) if (strcmp(optarg, iso8601_fmts[i].refname) == 0) break; if (i == nitems(iso8601_fmts)) iso8601_usage(optarg); iso8601_selected = &iso8601_fmts[i]; break; case 'j': jflag = 1; /* don't set time */ + break; + case 'n': break; case 'R': /* RFC 2822 datetime format */ if (Iflag) multipleformats(); Rflag = 1; break; case 'r': /* user specified seconds */ rflag = 1; tval = strtoq(optarg, &tmp, 0); if (*tmp != 0) { if (stat(optarg, &sb) == 0) tval = sb.st_mtim.tv_sec; else usage(); } break; case 'u': /* do everything in UTC */ (void)setenv("TZ", "UTC0", 1); break; case 'v': v = vary_append(v, optarg); break; default: usage(); } argc -= optind; argv += optind; if (!rflag && time(&tval) == -1) err(1, "time"); format = "%+"; if (Rflag) format = rfc2822_format; /* allow the operands in any order */ if (*argv && **argv == '+') { if (Iflag) multipleformats(); format = *argv + 1; ++argv; } if (*argv) { setthetime(fmt, *argv, jflag); ++argv; } else if (fmt != NULL) usage(); if (*argv && **argv == '+') { if (Iflag) multipleformats(); format = *argv + 1; } lt = localtime(&tval); if (lt == NULL) errx(1, "invalid time"); badv = vary_apply(v, lt); if (badv) { fprintf(stderr, "%s: Cannot apply date adjustment\n", badv->arg); vary_destroy(v); usage(); } vary_destroy(v); if (Iflag) printisodate(lt); if (format == rfc2822_format) /* * When using RFC 2822 datetime format, don't honor the * locale. */ setlocale(LC_TIME, "C"); (void)strftime(buf, sizeof(buf), format, lt); printdate(buf); } static void printdate(const char *buf) { (void)printf("%s\n", buf); if (fflush(stdout)) err(1, "stdout"); exit(EXIT_SUCCESS); } static void printisodate(struct tm *lt) { const struct iso8601_fmt *it; char fmtbuf[32], buf[32], tzbuf[8]; fmtbuf[0] = 0; for (it = iso8601_fmts; it <= iso8601_selected; it++) strlcat(fmtbuf, it->format_string, sizeof(fmtbuf)); (void)strftime(buf, sizeof(buf), fmtbuf, lt); if (iso8601_selected > iso8601_fmts) { (void)strftime(tzbuf, sizeof(tzbuf), "%z", lt); memmove(&tzbuf[4], &tzbuf[3], 3); tzbuf[3] = ':'; strlcat(buf, tzbuf, sizeof(buf)); } printdate(buf); } #define ATOI2(s) ((s) += 2, ((s)[-2] - '0') * 10 + ((s)[-1] - '0')) static void setthetime(const char *fmt, const char *p, int jflag) { struct utmpx utx; struct tm *lt; struct timeval tv; const char *dot, *t; int century; lt = localtime(&tval); if (lt == NULL) errx(1, "invalid time"); lt->tm_isdst = -1; /* divine correct DST */ if (fmt != NULL) { t = strptime(p, fmt, lt); if (t == NULL) { fprintf(stderr, "Failed conversion of ``%s''" " using format ``%s''\n", p, fmt); badformat(); } else if (*t != '\0') fprintf(stderr, "Warning: Ignoring %ld extraneous" " characters in date string (%s)\n", (long) strlen(t), t); } else { for (t = p, dot = NULL; *t; ++t) { if (isdigit(*t)) continue; if (*t == '.' && dot == NULL) { dot = t; continue; } badformat(); } if (dot != NULL) { /* .ss */ dot++; /* *dot++ = '\0'; */ if (strlen(dot) != 2) badformat(); lt->tm_sec = ATOI2(dot); if (lt->tm_sec > 61) badformat(); } else lt->tm_sec = 0; century = 0; /* if p has a ".ss" field then let's pretend it's not there */ switch (strlen(p) - ((dot != NULL) ? 3 : 0)) { case 12: /* cc */ lt->tm_year = ATOI2(p) * 100 - TM_YEAR_BASE; century = 1; /* FALLTHROUGH */ case 10: /* yy */ if (century) lt->tm_year += ATOI2(p); else { lt->tm_year = ATOI2(p); if (lt->tm_year < 69) /* hack for 2000 ;-} */ lt->tm_year += 2000 - TM_YEAR_BASE; else lt->tm_year += 1900 - TM_YEAR_BASE; } /* FALLTHROUGH */ case 8: /* mm */ lt->tm_mon = ATOI2(p); if (lt->tm_mon > 12) badformat(); --lt->tm_mon; /* time struct is 0 - 11 */ /* FALLTHROUGH */ case 6: /* dd */ lt->tm_mday = ATOI2(p); if (lt->tm_mday > 31) badformat(); /* FALLTHROUGH */ case 4: /* HH */ lt->tm_hour = ATOI2(p); if (lt->tm_hour > 23) badformat(); /* FALLTHROUGH */ case 2: /* MM */ lt->tm_min = ATOI2(p); if (lt->tm_min > 59) badformat(); break; default: badformat(); } } /* convert broken-down time to GMT clock time */ if ((tval = mktime(lt)) == -1) errx(1, "nonexistent time"); if (!jflag) { utx.ut_type = OLD_TIME; memset(utx.ut_id, 0, sizeof(utx.ut_id)); (void)gettimeofday(&utx.ut_tv, NULL); pututxline(&utx); tv.tv_sec = tval; tv.tv_usec = 0; if (settimeofday(&tv, NULL) != 0) err(1, "settimeofday (timeval)"); utx.ut_type = NEW_TIME; (void)gettimeofday(&utx.ut_tv, NULL); pututxline(&utx); if ((p = getlogin()) == NULL) p = "???"; syslog(LOG_AUTH | LOG_NOTICE, "date set by %s", p); } } static void badformat(void) { warnx("illegal time format"); usage(); } static void iso8601_usage(const char *badarg) { errx(1, "invalid argument '%s' for -I", badarg); } static void multipleformats(void) { errx(1, "multiple output formats specified"); } static void usage(void) { (void)fprintf(stderr, "%s\n%s\n%s\n", "usage: date [-jnRu] [-r seconds|file] [-v[+|-]val[ymwdHMS]]", " " "[-I[date | hours | minutes | seconds]]", " " "[-f fmt date | [[[[[cc]yy]mm]dd]HH]MM[.ss]] [+format]" ); exit(1); } Index: user/ngie/bug-237403/cddl/contrib/opensolaris/cmd/dtrace/test/tst/common/ip/tst.ipv4localsctp.ksh =================================================================== --- user/ngie/bug-237403/cddl/contrib/opensolaris/cmd/dtrace/test/tst/common/ip/tst.ipv4localsctp.ksh (revision 346925) +++ user/ngie/bug-237403/cddl/contrib/opensolaris/cmd/dtrace/test/tst/common/ip/tst.ipv4localsctp.ksh (revision 346926) @@ -1,137 +1,153 @@ #!/usr/bin/env ksh # # CDDL HEADER START # # The contents of this file are subject to the terms of the # Common Development and Distribution License (the "License"). # You may not use this file except in compliance with the License. # # You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE # or http://www.opensolaris.org/os/licensing. # See the License for the specific language governing permissions # and limitations under the License. # # When distributing Covered Code, include this CDDL HEADER in each # file and include the License file at usr/src/OPENSOLARIS.LICENSE. # If applicable, add the following below this CDDL HEADER, with the # fields enclosed by brackets "[]" replaced with your own identifying # information: Portions Copyright [yyyy] [name of copyright owner] # # CDDL HEADER END # # # Copyright (c) 2008, 2010, Oracle and/or its affiliates. All rights reserved. # # # Test {ip,sctp}:::{send,receive} of IPv4 SCTP to local host. # # This may fail due to: # # 1. A change to the ip stack breaking expected probe behavior, # which is the reason we are testing. # 2. The lo0 interface missing or not up. # 3. An unlikely race causes the unlocked global send/receive # variables to be corrupted. # # This test performs a SCTP association and checks that at least the # following packet counts were traced: # # 7 x ip:::send (4 during the setup, 3 during the teardown) # 7 x sctp:::send (4 during the setup, 3 during the teardown) # 7 x ip:::receive (4 during the setup, 3 during the teardown) # 7 x sctp:::receive (4 during the setup, 3 during the teardown) # The actual count tested is 7 each way, since we are tracing both # source and destination events. # if (( $# != 1 )); then print -u2 "expected one argument: " exit 2 fi dtrace=$1 local=127.0.0.1 DIR=/var/tmp/dtest.$$ sctpport=1024 bound=5000 -while [ $sctpport -lt $bound ]; do - ncat --sctp -z $local $sctpport > /dev/null || break - sctpport=$(($sctpport + 1)) -done -if [ $sctpport -eq $bound ]; then - echo "couldn't find an available SCTP port" - exit 1 -fi mkdir $DIR cd $DIR -# ncat will exit when the association is closed. -ncat --sctp --listen $local $sctpport & - -cat > test.pl <<-EOPERL +cat > client.pl <<-EOPERL use IO::Socket; my \$s = IO::Socket::INET->new( Type => SOCK_STREAM, Proto => "sctp", LocalAddr => "$local", PeerAddr => "$local", - PeerPort => $sctpport, + PeerPort => \$ARGV[0], Timeout => 3); - die "Could not connect to host $local port $sctpport \$@" unless \$s; + die "Could not connect to host $local port \$ARGV[0] \$@" unless \$s; close \$s; - sleep(2); + sleep(\$ARGV[1]); EOPERL -$dtrace -c 'perl test.pl' -qs /dev/stdin <&- || break + sctpport=$(($sctpport + 1)) +done +if [ $sctpport -eq $bound ]; then + echo "couldn't find an available SCTP port" + exit 1 +fi + +cat > server.pl <<-EOPERL + use IO::Socket; + my \$l = IO::Socket::INET->new( + Type => SOCK_STREAM, + Proto => "sctp", + LocalAddr => "$local", + LocalPort => $sctpport, + Listen => 1, + Reuse => 1); + die "Could not listen on $local port $sctpport \$@" unless \$l; + my \$c = \$l->accept(); + close \$l; + while (<\$c>) {}; + close \$c; +EOPERL + +perl server.pl & + +$dtrace -c "perl client.pl $sctpport 2" -qs /dev/stdin <ip_saddr == "$local" && args[2]->ip_daddr == "$local" && args[4]->ipv4_protocol == IPPROTO_SCTP/ { ipsend++; } sctp:::send /args[2]->ip_saddr == "$local" && args[2]->ip_daddr == "$local"/ { sctpsend++; } ip:::receive /args[2]->ip_saddr == "$local" && args[2]->ip_daddr == "$local" && args[4]->ipv4_protocol == IPPROTO_SCTP/ { ipreceive++; } sctp:::receive /args[2]->ip_saddr == "$local" && args[2]->ip_daddr == "$local"/ { sctpreceive++; } END { printf("Minimum SCTP events seen\n\n"); - printf("ip:::send (%d) - %s\n", ipsend, ipsend >= 7 ? "yes" : "no"); - printf("ip:::receive (%d) - %s\n", ipreceive, ipreceive >= 7 ? "yes" : "no"); - printf("sctp:::send (%d) - %s\n", sctpsend, sctpsend >= 7 ? "yes" : "no"); - printf("sctp:::receive (%d) - %s\n", sctpreceive, sctpreceive >= 7 ? "yes" : "no"); + printf("ip:::send - %s\n", ipsend >= 7 ? "yes" : "no"); + printf("ip:::receive - %s\n", ipreceive >= 7 ? "yes" : "no"); + printf("sctp:::send - %s\n", sctpsend >= 7 ? "yes" : "no"); + printf("sctp:::receive - %s\n", sctpreceive >= 7 ? "yes" : "no"); } EODTRACE status=$? cd / /bin/rm -rf $DIR exit $status Index: user/ngie/bug-237403/cddl/contrib/opensolaris/cmd/dtrace/test/tst/common/ip/tst.localsctpstate.ksh =================================================================== --- user/ngie/bug-237403/cddl/contrib/opensolaris/cmd/dtrace/test/tst/common/ip/tst.localsctpstate.ksh (revision 346925) +++ user/ngie/bug-237403/cddl/contrib/opensolaris/cmd/dtrace/test/tst/common/ip/tst.localsctpstate.ksh (revision 346926) @@ -1,159 +1,175 @@ #!/usr/bin/env ksh # # CDDL HEADER START # # The contents of this file are subject to the terms of the # Common Development and Distribution License (the "License"). # You may not use this file except in compliance with the License. # # You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE # or http://www.opensolaris.org/os/licensing. # See the License for the specific language governing permissions # and limitations under the License. # # When distributing Covered Code, include this CDDL HEADER in each # file and include the License file at usr/src/OPENSOLARIS.LICENSE. # If applicable, add the following below this CDDL HEADER, with the # fields enclosed by brackets "[]" replaced with your own identifying # information: Portions Copyright [yyyy] [name of copyright owner] # # CDDL HEADER END # # # Copyright (c) 2010, Oracle and/or its affiliates. All rights reserved. # # # Test sctp:::state-change and sctp:::{send,receive} by connecting to # the local discard service. # A number of state transition events along with SCTP send and # receive events for the message should result. # # This may fail due to: # # 1. A change to the ip stack breaking expected probe behavior, # which is the reason we are testing. # 2. The lo0 interface missing or not up. # 3. An unlikely race causes the unlocked global send/receive # variables to be corrupted. # # This test performs a SCTP connection and checks that at least the # following packet counts were traced: # # 7 x ip:::send (4 during the setup, 3 during the teardown) # 7 x sctp:::send (4 during the setup, 3 during the teardown) # 7 x ip:::receive (4 during the setup, 3 during the teardown) # 7 x sctp:::receive (4 during the setup, 3 during the teardown) # # The actual count tested is 7 each way, since we are tracing both # source and destination events. # if (( $# != 1 )); then print -u2 "expected one argument: " exit 2 fi dtrace=$1 local=127.0.0.1 DIR=/var/tmp/dtest.$$ sctpport=1024 bound=5000 -while [ $sctpport -lt $bound ]; do - ncat --sctp -z $local $sctpport > /dev/null || break - sctpport=$(($sctpport + 1)) -done -if [ $sctpport -eq $bound ]; then - echo "couldn't find an available SCTP port" - exit 1 -fi mkdir $DIR cd $DIR -# ncat will exit when the association is closed. -ncat --sctp --listen $local $sctpport & - -cat > test.pl <<-EOPERL +cat > client.pl <<-EOPERL use IO::Socket; my \$s = IO::Socket::INET->new( Type => SOCK_STREAM, Proto => "sctp", LocalAddr => "$local", PeerAddr => "$local", - PeerPort => $sctpport, + PeerPort => \$ARGV[0], Timeout => 3); - die "Could not connect to host $local port $sctpport \$@" unless \$s; + die "Could not connect to host $local port \$ARGV[0] \$@" unless \$s; close \$s; - sleep(2); + sleep(\$ARGV[1]); EOPERL -$dtrace -c 'perl test.pl' -qs /dev/stdin <&- || break + sctpport=$(($sctpport + 1)) +done +if [ $sctpport -eq $bound ]; then + echo "couldn't find an available SCTP port" + exit 1 +fi + +cat > server.pl <<-EOPERL + use IO::Socket; + my \$l = IO::Socket::INET->new( + Type => SOCK_STREAM, + Proto => "sctp", + LocalAddr => "$local", + LocalPort => $sctpport, + Listen => 1, + Reuse => 1); + die "Could not listen on $local port $sctpport \$@" unless \$l; + my \$c = \$l->accept(); + close \$l; + while (<\$c>) {}; + close \$c; +EOPERL + +perl server.pl & + +$dtrace -c "perl client.pl $sctpport 2" -qs /dev/stdin <ip_saddr == "$local" && args[2]->ip_daddr == "$local" && args[4]->ipv4_protocol == IPPROTO_SCTP/ { ipsend++; } sctp:::send /args[2]->ip_saddr == "$local" && args[2]->ip_daddr == "$local" && (args[4]->sctp_sport == $sctpport || args[4]->sctp_dport == $sctpport)/ { sctpsend++; } ip:::receive /args[2]->ip_saddr == "$local" && args[2]->ip_daddr == "$local" && args[4]->ipv4_protocol == IPPROTO_SCTP/ { ipreceive++; } sctp:::receive /args[2]->ip_saddr == "$local" && args[2]->ip_daddr == "$local" && (args[4]->sctp_sport == $sctpport || args[4]->sctp_dport == $sctpport)/ { sctpreceive++; } sctp:::state-change { state_event[args[3]->sctps_state]++; } END { printf("Minimum SCTP events seen\n\n"); printf("ip:::send - %s\n", ipsend >= 7 ? "yes" : "no"); printf("ip:::receive - %s\n", ipreceive >= 7 ? "yes" : "no"); printf("sctp:::send - %s\n", sctpsend >= 7 ? "yes" : "no"); printf("sctp:::receive - %s\n", sctpreceive >= 7 ? "yes" : "no"); printf("sctp:::state-change to cookie-wait - %s\n", state_event[SCTP_STATE_COOKIE_WAIT] >=1 ? "yes" : "no"); printf("sctp:::state-change to cookie-echoed - %s\n", state_event[SCTP_STATE_COOKIE_ECHOED] >=1 ? "yes" : "no"); printf("sctp:::state-change to established - %s\n", state_event[SCTP_STATE_ESTABLISHED] >= 2 ? "yes" : "no"); printf("sctp:::state-change to shutdown-sent - %s\n", state_event[SCTP_STATE_SHUTDOWN_SENT] >= 1 ? "yes" : "no"); printf("sctp:::state-change to shutdown-received - %s\n", state_event[SCTP_STATE_SHUTDOWN_RECEIVED] >= 1 ? "yes" : "no"); printf("sctp:::state-change to shutdown-ack-sent - %s\n", state_event[SCTP_STATE_SHUTDOWN_ACK_SENT] >= 1 ? "yes" : "no"); } EODTRACE status=$? cd / /bin/rm -rf $DIR exit $status Index: user/ngie/bug-237403/cddl/contrib/opensolaris/cmd/dtrace/test/tst/common/ip/tst.localsctpstate.ksh.out =================================================================== --- user/ngie/bug-237403/cddl/contrib/opensolaris/cmd/dtrace/test/tst/common/ip/tst.localsctpstate.ksh.out (revision 346925) +++ user/ngie/bug-237403/cddl/contrib/opensolaris/cmd/dtrace/test/tst/common/ip/tst.localsctpstate.ksh.out (revision 346926) @@ -1,12 +1,13 @@ Minimum SCTP events seen ip:::send - yes ip:::receive - yes sctp:::send - yes sctp:::receive - yes sctp:::state-change to cookie-wait - yes sctp:::state-change to cookie-echoed - yes sctp:::state-change to established - yes sctp:::state-change to shutdown-sent - yes sctp:::state-change to shutdown-received - yes sctp:::state-change to shutdown-ack-sent - yes + Index: user/ngie/bug-237403/cddl/contrib/opensolaris =================================================================== --- user/ngie/bug-237403/cddl/contrib/opensolaris (revision 346925) +++ user/ngie/bug-237403/cddl/contrib/opensolaris (revision 346926) Property changes on: user/ngie/bug-237403/cddl/contrib/opensolaris ___________________________________________________________________ Modified: svn:mergeinfo ## -0,0 +0,1 ## Merged /head/cddl/contrib/opensolaris:r346444-346925 Index: user/ngie/bug-237403/cddl =================================================================== --- user/ngie/bug-237403/cddl (revision 346925) +++ user/ngie/bug-237403/cddl (revision 346926) Property changes on: user/ngie/bug-237403/cddl ___________________________________________________________________ Modified: svn:mergeinfo ## -0,0 +0,1 ## Merged /head/cddl:r346444-346925 Index: user/ngie/bug-237403/etc/mtree/BSD.sendmail.dist =================================================================== --- user/ngie/bug-237403/etc/mtree/BSD.sendmail.dist (revision 346925) +++ user/ngie/bug-237403/etc/mtree/BSD.sendmail.dist (revision 346926) @@ -1,14 +1,14 @@ # $FreeBSD$ # # Please see the file src/etc/mtree/README before making changes to this file. # /set type=dir uname=root gname=wheel mode=0755 . nochange var nochange - spool nochange + spool nochange tags=package=runtime clientmqueue uname=smmsp gname=smmsp mode=0770 .. .. .. .. Index: user/ngie/bug-237403/etc/mtree/BSD.usr.dist =================================================================== --- user/ngie/bug-237403/etc/mtree/BSD.usr.dist (revision 346925) +++ user/ngie/bug-237403/etc/mtree/BSD.usr.dist (revision 346926) @@ -1,1254 +1,1254 @@ # $FreeBSD$ # # Please see the file src/etc/mtree/README before making changes to this file. # /set type=dir uname=root gname=wheel mode=0755 . bin .. include private bsdstat .. event .. gmock internal custom .. .. .. gtest internal custom .. .. .. sqlite3 .. ucl .. zstd .. .. .. lib aout .. clang 8.0.0 include sanitizer .. .. lib freebsd .. .. .. .. compat aout .. .. dtrace .. engines .. i18n .. libxo encoder .. .. .. libdata gcc .. ldscripts .. pkgconfig .. .. libexec bsdconfig 020.docsinstall include .. .. 030.packages include .. .. 040.password include .. .. 050.diskmgmt include .. .. 070.usermgmt include .. .. 080.console include .. .. 090.timezone include .. .. 110.mouse include .. .. 120.networking include .. .. 130.security include .. .. 140.startup include .. .. 150.ttys include .. .. dot include .. .. include .. includes include .. .. .. bsdinstall .. dwatch .. hyperv .. lpr ru .. .. sendmail .. sm.bin .. .. local .. obj nochange .. sbin .. share atf .. bsdconfig media .. networking .. packages .. password .. startup .. timezone .. usermgmt .. .. calendar de_AT.ISO_8859-15 .. de_DE.ISO8859-1 .. fr_FR.ISO8859-1 .. hr_HR.ISO8859-2 .. hu_HU.ISO8859-2 .. pt_BR.ISO8859-1 .. pt_BR.UTF-8 .. ru_RU.KOI8-R .. ru_RU.UTF-8 .. uk_UA.KOI8-U .. .. dict .. doc IPv6 .. atf .. legal .. llvm clang .. .. ncurses .. ntp drivers icons .. scripts .. .. hints .. icons .. pic .. scripts .. .. pjdfstest .. .. dtrace .. examples BSD_daemon .. FreeBSD_version .. IPv6 .. bhyve .. bootforth .. bsdconfig .. csh .. diskless .. dma .. drivers .. dwatch .. etc defaults .. .. find_interface .. hast .. hostapd .. indent .. ipfilter .. ipfw .. jails .. kld cdev module .. test .. .. dyn_sysctl .. firmware fwconsumer .. fwimage .. .. khelp .. syscall module .. test .. .. .. libusb20 .. libvgl .. mdoc .. netgraph .. pc-sysinstall .. perfmon .. pf .. ppi .. ppp .. printing .. scsi_target .. ses getencstat .. sesd .. setencstat .. setobjstat .. srcs .. .. smbfs print .. .. sunrpc dir .. msg .. sort .. .. tcsh .. uefisign .. ypldap .. .. firmware .. games fortune .. .. i18n csmapper APPLE .. AST .. BIG5 .. CNS .. CP .. EBCDIC .. GB .. GEORGIAN .. ISO-8859 .. ISO646 .. JIS .. KAZAKH .. KOI .. KS .. MISC .. TCVN .. .. esdb APPLE .. AST .. BIG5 .. CP .. DEC .. EBCDIC .. EUC .. GB .. GEORGIAN .. ISO-2022 .. ISO-8859 .. ISO646 .. KAZAKH .. KOI .. MISC .. TCVN .. UTF .. .. .. keys pkg - revoked + revoked tags=package=runtime .. - trusted + trusted tags=package=runtime .. .. .. locale af_ZA.ISO8859-1 .. af_ZA.ISO8859-15 .. af_ZA.UTF-8 .. ar_AE.UTF-8 .. ar_EG.UTF-8 .. ar_JO.UTF-8 .. ar_MA.UTF-8 .. ar_QA.UTF-8 .. ar_SA.UTF-8 .. am_ET.UTF-8 .. be_BY.CP1131 .. be_BY.CP1251 .. be_BY.ISO8859-5 .. be_BY.UTF-8 .. bg_BG.CP1251 .. bg_BG.UTF-8 .. ca_AD.ISO8859-1 .. ca_AD.ISO8859-15 .. ca_ES.ISO8859-1 .. ca_ES.ISO8859-15 .. ca_FR.ISO8859-1 .. ca_FR.ISO8859-15 .. ca_IT.ISO8859-1 .. ca_IT.ISO8859-15 .. ca_AD.UTF-8 .. ca_ES.UTF-8 .. ca_FR.UTF-8 .. ca_IT.UTF-8 .. cs_CZ.ISO8859-2 .. cs_CZ.UTF-8 .. da_DK.ISO8859-1 .. da_DK.ISO8859-15 .. da_DK.UTF-8 .. de_AT.ISO8859-1 .. de_AT.ISO8859-15 .. de_AT.UTF-8 .. de_CH.ISO8859-1 .. de_CH.ISO8859-15 .. de_CH.UTF-8 .. de_DE.ISO8859-1 .. de_DE.ISO8859-15 .. de_DE.UTF-8 .. el_GR.ISO8859-7 .. el_GR.UTF-8 .. en_AU.ISO8859-1 .. en_AU.ISO8859-15 .. en_AU.US-ASCII .. en_AU.UTF-8 .. en_CA.ISO8859-1 .. en_CA.ISO8859-15 .. en_CA.US-ASCII .. en_CA.UTF-8 .. en_GB.ISO8859-1 .. en_GB.ISO8859-15 .. en_GB.US-ASCII .. en_GB.UTF-8 .. en_HK.ISO8859-1 .. en_HK.UTF-8 .. en_IE.ISO8859-1 .. en_IE.ISO8859-15 .. en_IE.UTF-8 .. en_NZ.ISO8859-1 .. en_NZ.ISO8859-15 .. en_NZ.US-ASCII .. en_NZ.UTF-8 .. en_PH.UTF-8 .. en_SG.ISO8859-1 .. en_SG.UTF-8 .. en_US.ISO8859-1 .. en_US.ISO8859-15 .. en_US.US-ASCII .. en_US.UTF-8 .. en_ZA.ISO8859-1 .. en_ZA.ISO8859-15 .. en_ZA.US-ASCII .. en_ZA.UTF-8 .. es_AR.ISO8859-1 .. es_AR.UTF-8 .. es_CR.UTF-8 .. es_ES.ISO8859-1 .. es_ES.ISO8859-15 .. es_ES.UTF-8 .. es_MX.ISO8859-1 .. es_MX.UTF-8 .. et_EE.ISO8859-1 .. et_EE.ISO8859-15 .. et_EE.UTF-8 .. eu_ES.ISO8859-1 .. eu_ES.ISO8859-15 .. eu_ES.UTF-8 .. fi_FI.ISO8859-1 .. fi_FI.ISO8859-15 .. fi_FI.UTF-8 .. fr_BE.ISO8859-1 .. fr_BE.ISO8859-15 .. fr_BE.UTF-8 .. fr_CA.ISO8859-1 .. fr_CA.ISO8859-15 .. fr_CA.UTF-8 .. fr_CH.ISO8859-1 .. fr_CH.ISO8859-15 .. fr_CH.UTF-8 .. fr_FR.ISO8859-1 .. fr_FR.ISO8859-15 .. fr_FR.UTF-8 .. ga_IE.UTF-8 .. he_IL.UTF-8 .. hi_IN.ISCII-DEV .. hi_IN.UTF-8 .. hr_HR.ISO8859-2 .. hr_HR.UTF-8 .. hu_HU.ISO8859-2 .. hu_HU.UTF-8 .. hy_AM.ARMSCII-8 .. hy_AM.UTF-8 .. is_IS.ISO8859-1 .. is_IS.ISO8859-15 .. is_IS.UTF-8 .. it_CH.ISO8859-1 .. it_CH.ISO8859-15 .. it_CH.UTF-8 .. it_IT.ISO8859-1 .. it_IT.ISO8859-15 .. it_IT.UTF-8 .. ja_JP.SJIS .. ja_JP.UTF-8 .. ja_JP.eucJP .. kk_KZ.UTF-8 .. ko_KR.CP949 .. ko_KR.UTF-8 .. ko_KR.eucKR .. lt_LT.ISO8859-13 .. lt_LT.UTF-8 .. lv_LV.ISO8859-13 .. lv_LV.UTF-8 .. mn_MN.UTF-8 .. nb_NO.ISO8859-1 .. nb_NO.ISO8859-15 .. nb_NO.UTF-8 .. nl_BE.ISO8859-1 .. nl_BE.ISO8859-15 .. nl_BE.UTF-8 .. nl_NL.ISO8859-1 .. nl_NL.ISO8859-15 .. nl_NL.UTF-8 .. nn_NO.ISO8859-1 .. nn_NO.ISO8859-15 .. nn_NO.UTF-8 .. pl_PL.ISO8859-2 .. pl_PL.UTF-8 .. pt_BR.ISO8859-1 .. pt_BR.UTF-8 .. pt_PT.ISO8859-1 .. pt_PT.ISO8859-15 .. pt_PT.UTF-8 .. ro_RO.ISO8859-2 .. ro_RO.UTF-8 .. ru_RU.CP1251 .. ru_RU.CP866 .. ru_RU.ISO8859-5 .. ru_RU.KOI8-R .. ru_RU.UTF-8 .. se_FI.UTF-8 .. se_NO.UTF-8 .. sk_SK.ISO8859-2 .. sk_SK.UTF-8 .. sl_SI.ISO8859-2 .. sl_SI.UTF-8 .. sr_RS.ISO8859-5 .. sr_RS.UTF-8 .. sr_RS.ISO8859-2 .. sr_RS.UTF-8@latin .. sv_FI.ISO8859-1 .. sv_FI.ISO8859-15 .. sv_FI.UTF-8 .. sv_SE.ISO8859-1 .. sv_SE.ISO8859-15 .. sv_SE.UTF-8 .. tr_TR.ISO8859-9 .. tr_TR.UTF-8 .. uk_UA.CP1251 .. uk_UA.ISO8859-5 .. uk_UA.KOI8-U .. uk_UA.UTF-8 .. zh_CN.GB18030 .. zh_CN.GB2312 .. zh_CN.GBK .. zh_CN.eucCN .. zh_CN.UTF-8 .. zh_HK.UTF-8 .. zh_TW.Big5 .. zh_TW.UTF-8 .. .. man man1 .. man2 .. man3 .. man4 aarch64 .. amd64 .. arm .. i386 .. powerpc .. sparc64 .. .. man5 .. man6 .. man7 .. man8 amd64 .. i386 .. powerpc .. sparc64 .. .. man9 .. .. misc fonts .. .. mk .. nls C .. af_ZA.ISO8859-1 .. af_ZA.ISO8859-15 .. af_ZA.UTF-8 .. am_ET.UTF-8 .. be_BY.CP1131 .. be_BY.CP1251 .. be_BY.ISO8859-5 .. be_BY.UTF-8 .. bg_BG.CP1251 .. bg_BG.UTF-8 .. ca_ES.ISO8859-1 .. ca_ES.ISO8859-15 .. ca_ES.UTF-8 .. cs_CZ.ISO8859-2 .. cs_CZ.UTF-8 .. da_DK.ISO8859-1 .. da_DK.ISO8859-15 .. da_DK.UTF-8 .. de_AT.ISO8859-1 .. de_AT.ISO8859-15 .. de_AT.UTF-8 .. de_CH.ISO8859-1 .. de_CH.ISO8859-15 .. de_CH.UTF-8 .. de_DE.ISO8859-1 .. de_DE.ISO8859-15 .. de_DE.UTF-8 .. el_GR.ISO8859-7 .. el_GR.UTF-8 .. en_AU.ISO8859-1 .. en_AU.ISO8859-15 .. en_AU.US-ASCII .. en_AU.UTF-8 .. en_CA.ISO8859-1 .. en_CA.ISO8859-15 .. en_CA.US-ASCII .. en_CA.UTF-8 .. en_GB.ISO8859-1 .. en_GB.ISO8859-15 .. en_GB.US-ASCII .. en_GB.UTF-8 .. en_IE.UTF-8 .. en_NZ.ISO8859-1 .. en_NZ.ISO8859-15 .. en_NZ.US-ASCII .. en_NZ.UTF-8 .. en_US.ISO8859-1 .. en_US.ISO8859-15 .. en_US.UTF-8 .. es_ES.ISO8859-1 .. es_ES.ISO8859-15 .. es_ES.UTF-8 .. et_EE.ISO8859-15 .. et_EE.UTF-8 .. fi_FI.ISO8859-1 .. fi_FI.ISO8859-15 .. fi_FI.UTF-8 .. fr_BE.ISO8859-1 .. fr_BE.ISO8859-15 .. fr_BE.UTF-8 .. fr_CA.ISO8859-1 .. fr_CA.ISO8859-15 .. fr_CA.UTF-8 .. fr_CH.ISO8859-1 .. fr_CH.ISO8859-15 .. fr_CH.UTF-8 .. fr_FR.ISO8859-1 .. fr_FR.ISO8859-15 .. fr_FR.UTF-8 .. gl_ES.ISO8859-1 .. he_IL.UTF-8 .. hi_IN.ISCII-DEV .. hr_HR.ISO8859-2 .. hr_HR.UTF-8 .. hu_HU.ISO8859-2 .. hu_HU.UTF-8 .. hy_AM.ARMSCII-8 .. hy_AM.UTF-8 .. is_IS.ISO8859-1 .. is_IS.ISO8859-15 .. is_IS.UTF-8 .. it_CH.ISO8859-1 .. it_CH.ISO8859-15 .. it_CH.UTF-8 .. it_IT.ISO8859-1 .. it_IT.ISO8859-15 .. it_IT.UTF-8 .. ja_JP.SJIS .. ja_JP.UTF-8 .. ja_JP.eucJP .. kk_KZ.PT154 .. kk_KZ.UTF-8 .. ko_KR.CP949 .. ko_KR.UTF-8 .. ko_KR.eucKR .. lt_LT.ISO8859-13 .. lt_LT.UTF-8 .. lv_LV.ISO8859-13 .. lv_LV.UTF-8 .. mn_MN.UTF-8 .. nl_BE.ISO8859-1 .. nl_BE.ISO8859-15 .. nl_BE.UTF-8 .. nl_NL.ISO8859-1 .. nl_NL.ISO8859-15 .. nl_NL.UTF-8 .. no_NO.ISO8859-1 .. no_NO.ISO8859-15 .. no_NO.UTF-8 .. pl_PL.ISO8859-2 .. pl_PL.UTF-8 .. pt_BR.ISO8859-1 .. pt_BR.UTF-8 .. pt_PT.ISO8859-1 .. pt_PT.ISO8859-15 .. pt_PT.UTF-8 .. ro_RO.ISO8859-2 .. ro_RO.UTF-8 .. ru_RU.CP1251 .. ru_RU.CP866 .. ru_RU.ISO8859-5 .. ru_RU.KOI8-R .. ru_RU.UTF-8 .. sk_SK.ISO8859-2 .. sk_SK.UTF-8 .. sl_SI.ISO8859-2 .. sl_SI.UTF-8 .. sr_YU.ISO8859-2 .. sr_YU.ISO8859-5 .. sr_YU.UTF-8 .. sv_SE.ISO8859-1 .. sv_SE.ISO8859-15 .. sv_SE.UTF-8 .. tr_TR.ISO8859-9 .. tr_TR.UTF-8 .. uk_UA.ISO8859-5 .. uk_UA.KOI8-U .. uk_UA.UTF-8 .. zh_CN.GB18030 .. zh_CN.GB2312 .. zh_CN.GBK .. zh_CN.UTF-8 .. zh_CN.eucCN .. zh_HK.UTF-8 .. zh_TW.UTF-8 .. .. openssl man man1 .. man3 .. .. .. pc-sysinstall backend .. backend-partmanager .. backend-query .. conf license .. .. doc .. .. security .. sendmail .. skel .. snmp defs .. mibs .. .. syscons fonts .. keymaps .. scrnmaps .. .. tabset .. vi catalog .. .. vt fonts .. keymaps .. .. zoneinfo Africa .. America Argentina .. Indiana .. Kentucky .. North_Dakota .. .. Antarctica .. Arctic .. Asia .. Atlantic .. Australia .. Etc .. Europe .. Indian .. Pacific .. SystemV .. .. .. src nochange .. .. Index: user/ngie/bug-237403/etc/mtree/BSD.var.dist =================================================================== --- user/ngie/bug-237403/etc/mtree/BSD.var.dist (revision 346925) +++ user/ngie/bug-237403/etc/mtree/BSD.var.dist (revision 346926) @@ -1,114 +1,114 @@ # $FreeBSD$ # # Please see the file src/etc/mtree/README before making changes to this file. # /set type=dir uname=root gname=wheel mode=0755 . account .. at /set uname=daemon jobs tags=package=at .. spool tags=package=at .. /set uname=root .. /set mode=0750 /set gname=audit audit dist uname=auditdistd gname=audit mode=0770 .. remote uname=auditdistd gname=wheel mode=0700 .. .. authpf uname=root gname=authpf mode=0770 .. /set gname=wheel backups .. cache mode=0755 .. crash .. - cron + cron tags=package=runtime tabs mode=0700 .. .. /set mode=0755 db entropy uname=operator gname=operator mode=0700 .. freebsd-update mode=0700 .. hyperv mode=0700 .. ipf mode=0700 .. ntp uname=ntpd gname=ntpd .. pkg .. ports .. portsnap .. zfsd cases .. .. .. - empty mode=0555 flags=schg + empty mode=0555 flags=schg tags=package=runtime .. games gname=games mode=0775 .. heimdal mode=0700 .. - log + log tags=package=runtime .. - mail gname=mail mode=0775 + mail gname=mail mode=0775 tags=package=runtime .. msgs uname=daemon .. preserve .. - run + run tags=package=runtime dhclient .. ppp gname=network mode=0770 .. wpa_supplicant .. .. rwho gname=daemon mode=0775 .. spool dma uname=root gname=mail mode=0770 .. lock uname=uucp gname=dialer mode=0775 .. /set gname=daemon lpd .. mqueue .. opielocks mode=0700 .. output lpd .. .. /set gname=wheel .. - tmp mode=01777 + tmp mode=01777 tags=package=runtime vi.recover mode=01777 .. .. unbound uname=unbound gname=unbound mode=0755 tags=package=unbound conf.d uname=unbound gname=unbound mode=0755 tags=package=unbound .. .. yp .. .. Index: user/ngie/bug-237403/lib/libbe/Makefile =================================================================== --- user/ngie/bug-237403/lib/libbe/Makefile (revision 346925) +++ user/ngie/bug-237403/lib/libbe/Makefile (revision 346926) @@ -1,36 +1,37 @@ # $FreeBSD$ +SHLIBDIR?= /lib + .include PACKAGE= lib${LIB} LIB= be -SHLIBDIR?= /lib SHLIB_MAJOR= 1 SHLIB_MINOR= 0 SRCS= be.c be_access.c be_error.c be_info.c INCS= be.h MAN= libbe.3 WARNS?= 2 IGNORE_PRAGMA= yes LIBADD+= zfs LIBADD+= nvpair CFLAGS+= -I${SRCTOP}/cddl/contrib/opensolaris/lib/libzfs/common CFLAGS+= -I${SRCTOP}/sys/cddl/compat/opensolaris CFLAGS+= -I${SRCTOP}/cddl/compat/opensolaris/include CFLAGS+= -I${SRCTOP}/cddl/compat/opensolaris/lib/libumem CFLAGS+= -I${SRCTOP}/cddl/contrib/opensolaris/lib/libzpool/common CFLAGS+= -I${SRCTOP}/sys/cddl/contrib/opensolaris/common/zfs CFLAGS+= -I${SRCTOP}/sys/cddl/contrib/opensolaris/uts/common/fs/zfs CFLAGS+= -I${SRCTOP}/sys/cddl/contrib/opensolaris/uts/common CFLAGS+= -I${SRCTOP}/cddl/contrib/opensolaris/head CFLAGS+= -DNEED_SOLARIS_BOOLEAN HAS_TESTS= YES SUBDIR.${MK_TESTS}+= tests .include Index: user/ngie/bug-237403/lib/libbe/be.c =================================================================== --- user/ngie/bug-237403/lib/libbe/be.c (revision 346925) +++ user/ngie/bug-237403/lib/libbe/be.c (revision 346926) @@ -1,1093 +1,1097 @@ /*- * SPDX-License-Identifier: BSD-2-Clause-FreeBSD * * Copyright (c) 2017 Kyle J. Kneitinger * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. */ #include __FBSDID("$FreeBSD$"); #include #include #include #include #include #include #include #include #include #include #include #include "be.h" #include "be_impl.h" struct be_destroy_data { libbe_handle_t *lbh; char *snapname; }; #if SOON static int be_create_child_noent(libbe_handle_t *lbh, const char *active, const char *child_path); static int be_create_child_cloned(libbe_handle_t *lbh, const char *active); #endif /* Arbitrary... should tune */ #define BE_SNAP_SERIAL_MAX 1024 /* * Iterator function for locating the rootfs amongst the children of the * zfs_be_root set by loader(8). data is expected to be a libbe_handle_t *. */ static int be_locate_rootfs(libbe_handle_t *lbh) { struct statfs sfs; struct extmnttab entry; zfs_handle_t *zfs; /* * Check first if root is ZFS; if not, we'll bail on rootfs capture. * Unfortunately needed because zfs_path_to_zhandle will emit to * stderr if / isn't actually a ZFS filesystem, which we'd like * to avoid. */ if (statfs("/", &sfs) == 0) { statfs2mnttab(&sfs, &entry); if (strcmp(entry.mnt_fstype, MNTTYPE_ZFS) != 0) return (1); } else return (1); zfs = zfs_path_to_zhandle(lbh->lzh, "/", ZFS_TYPE_FILESYSTEM); if (zfs == NULL) return (1); strlcpy(lbh->rootfs, zfs_get_name(zfs), sizeof(lbh->rootfs)); zfs_close(zfs); return (0); } /* * Initializes the libbe context to operate in the root boot environment * dataset, for example, zroot/ROOT. */ libbe_handle_t * libbe_init(const char *root) { char altroot[MAXPATHLEN]; libbe_handle_t *lbh; char *poolname, *pos; int pnamelen; lbh = NULL; poolname = pos = NULL; if ((lbh = calloc(1, sizeof(libbe_handle_t))) == NULL) goto err; if ((lbh->lzh = libzfs_init()) == NULL) goto err; /* * Grab rootfs, we'll work backwards from there if an optional BE root * has not been passed in. */ if (be_locate_rootfs(lbh) != 0) { if (root == NULL) goto err; *lbh->rootfs = '\0'; } if (root == NULL) { /* Strip off the final slash from rootfs to get the be root */ strlcpy(lbh->root, lbh->rootfs, sizeof(lbh->root)); pos = strrchr(lbh->root, '/'); if (pos == NULL) goto err; *pos = '\0'; } else strlcpy(lbh->root, root, sizeof(lbh->root)); if ((pos = strchr(lbh->root, '/')) == NULL) goto err; pnamelen = pos - lbh->root; poolname = malloc(pnamelen + 1); if (poolname == NULL) goto err; strlcpy(poolname, lbh->root, pnamelen + 1); if ((lbh->active_phandle = zpool_open(lbh->lzh, poolname)) == NULL) goto err; free(poolname); poolname = NULL; if (zpool_get_prop(lbh->active_phandle, ZPOOL_PROP_BOOTFS, lbh->bootfs, sizeof(lbh->bootfs), NULL, true) != 0) goto err; if (zpool_get_prop(lbh->active_phandle, ZPOOL_PROP_ALTROOT, altroot, sizeof(altroot), NULL, true) == 0 && strcmp(altroot, "-") != 0) lbh->altroot_len = strlen(altroot); return (lbh); err: if (lbh != NULL) { if (lbh->active_phandle != NULL) zpool_close(lbh->active_phandle); if (lbh->lzh != NULL) libzfs_fini(lbh->lzh); free(lbh); } free(poolname); return (NULL); } /* * Free memory allocated by libbe_init() */ void libbe_close(libbe_handle_t *lbh) { if (lbh->active_phandle != NULL) zpool_close(lbh->active_phandle); libzfs_fini(lbh->lzh); free(lbh); } /* * Proxy through to libzfs for the moment. */ void be_nicenum(uint64_t num, char *buf, size_t buflen) { zfs_nicenum(num, buf, buflen); } static int be_destroy_cb(zfs_handle_t *zfs_hdl, void *data) { char path[BE_MAXPATHLEN]; struct be_destroy_data *bdd; zfs_handle_t *snap; int err; bdd = (struct be_destroy_data *)data; if (bdd->snapname == NULL) { err = zfs_iter_children(zfs_hdl, be_destroy_cb, data); if (err != 0) return (err); return (zfs_destroy(zfs_hdl, false)); } /* If we're dealing with snapshots instead, delete that one alone */ err = zfs_iter_filesystems(zfs_hdl, be_destroy_cb, data); if (err != 0) return (err); /* * This part is intentionally glossing over any potential errors, * because there's a lot less potential for errors when we're cleaning * up snapshots rather than a full deep BE. The primary error case * here being if the snapshot doesn't exist in the first place, which * the caller will likely deem insignificant as long as it doesn't * exist after the call. Thus, such a missing snapshot shouldn't jam * up the destruction. */ snprintf(path, sizeof(path), "%s@%s", zfs_get_name(zfs_hdl), bdd->snapname); if (!zfs_dataset_exists(bdd->lbh->lzh, path, ZFS_TYPE_SNAPSHOT)) return (0); snap = zfs_open(bdd->lbh->lzh, path, ZFS_TYPE_SNAPSHOT); if (snap != NULL) zfs_destroy(snap, false); return (0); } /* * Destroy the boot environment or snapshot specified by the name * parameter. Options are or'd together with the possible values: * BE_DESTROY_FORCE : forces operation on mounted datasets * BE_DESTROY_ORIGIN: destroy the origin snapshot as well */ int be_destroy(libbe_handle_t *lbh, const char *name, int options) { struct be_destroy_data bdd; char origin[BE_MAXPATHLEN], path[BE_MAXPATHLEN]; zfs_handle_t *fs; char *snapdelim; int err, force, mounted; size_t rootlen; bdd.lbh = lbh; bdd.snapname = NULL; force = options & BE_DESTROY_FORCE; *origin = '\0'; be_root_concat(lbh, name, path); if ((snapdelim = strchr(path, '@')) == NULL) { if (!zfs_dataset_exists(lbh->lzh, path, ZFS_TYPE_FILESYSTEM)) return (set_error(lbh, BE_ERR_NOENT)); if (strcmp(path, lbh->rootfs) == 0 || strcmp(path, lbh->bootfs) == 0) return (set_error(lbh, BE_ERR_DESTROYACT)); fs = zfs_open(lbh->lzh, path, ZFS_TYPE_FILESYSTEM); if (fs == NULL) return (set_error(lbh, BE_ERR_ZFSOPEN)); if ((options & BE_DESTROY_ORIGIN) != 0 && zfs_prop_get(fs, ZFS_PROP_ORIGIN, origin, sizeof(origin), NULL, NULL, 0, 1) != 0) return (set_error(lbh, BE_ERR_NOORIGIN)); /* Don't destroy a mounted dataset unless force is specified */ if ((mounted = zfs_is_mounted(fs, NULL)) != 0) { if (force) { zfs_unmount(fs, NULL, 0); } else { free(bdd.snapname); return (set_error(lbh, BE_ERR_DESTROYMNT)); } } } else { if (!zfs_dataset_exists(lbh->lzh, path, ZFS_TYPE_SNAPSHOT)) return (set_error(lbh, BE_ERR_NOENT)); bdd.snapname = strdup(snapdelim + 1); if (bdd.snapname == NULL) return (set_error(lbh, BE_ERR_NOMEM)); *snapdelim = '\0'; fs = zfs_open(lbh->lzh, path, ZFS_TYPE_DATASET); if (fs == NULL) { free(bdd.snapname); return (set_error(lbh, BE_ERR_ZFSOPEN)); } } err = be_destroy_cb(fs, &bdd); zfs_close(fs); free(bdd.snapname); if (err != 0) { /* Children are still present or the mount is referenced */ if (err == EBUSY) return (set_error(lbh, BE_ERR_DESTROYMNT)); return (set_error(lbh, BE_ERR_UNKNOWN)); } if ((options & BE_DESTROY_ORIGIN) == 0) return (0); /* The origin can't possibly be shorter than the BE root */ rootlen = strlen(lbh->root); if (*origin == '\0' || strlen(origin) <= rootlen + 1) return (set_error(lbh, BE_ERR_INVORIGIN)); /* * We'll be chopping off the BE root and running this back through * be_destroy, so that we properly handle the origin snapshot whether * it be that of a deep BE or not. */ if (strncmp(origin, lbh->root, rootlen) != 0 || origin[rootlen] != '/') return (0); return (be_destroy(lbh, origin + rootlen + 1, options & ~BE_DESTROY_ORIGIN)); } static void be_setup_snapshot_name(libbe_handle_t *lbh, char *buf, size_t buflen) { time_t rawtime; int len, serial; time(&rawtime); len = strlen(buf); len += strftime(buf + len, buflen - len, "@%F-%T", localtime(&rawtime)); /* No room for serial... caller will do its best */ if (buflen - len < 2) return; for (serial = 0; serial < BE_SNAP_SERIAL_MAX; ++serial) { snprintf(buf + len, buflen - len, "-%d", serial); if (!zfs_dataset_exists(lbh->lzh, buf, ZFS_TYPE_SNAPSHOT)) return; } } int be_snapshot(libbe_handle_t *lbh, const char *source, const char *snap_name, bool recursive, char *result) { char buf[BE_MAXPATHLEN]; int err; be_root_concat(lbh, source, buf); if ((err = be_exists(lbh, buf)) != 0) return (set_error(lbh, err)); if (snap_name != NULL) { if (strlcat(buf, "@", sizeof(buf)) >= sizeof(buf)) return (set_error(lbh, BE_ERR_INVALIDNAME)); if (strlcat(buf, snap_name, sizeof(buf)) >= sizeof(buf)) return (set_error(lbh, BE_ERR_INVALIDNAME)); if (result != NULL) snprintf(result, BE_MAXPATHLEN, "%s@%s", source, snap_name); } else { be_setup_snapshot_name(lbh, buf, sizeof(buf)); if (result != NULL && strlcpy(result, strrchr(buf, '/') + 1, sizeof(buf)) >= sizeof(buf)) return (set_error(lbh, BE_ERR_INVALIDNAME)); } if ((err = zfs_snapshot(lbh->lzh, buf, recursive, NULL)) != 0) { switch (err) { case EZFS_INVALIDNAME: return (set_error(lbh, BE_ERR_INVALIDNAME)); default: /* * The other errors that zfs_ioc_snapshot might return * shouldn't happen if we've set things up properly, so * we'll gloss over them and call it UNKNOWN as it will * require further triage. */ if (errno == ENOTSUP) return (set_error(lbh, BE_ERR_NOPOOL)); return (set_error(lbh, BE_ERR_UNKNOWN)); } } return (BE_ERR_SUCCESS); } /* * Create the boot environment specified by the name parameter */ int be_create(libbe_handle_t *lbh, const char *name) { int err; err = be_create_from_existing(lbh, name, be_active_path(lbh)); return (set_error(lbh, err)); } static int be_deep_clone_prop(int prop, void *cb) { int err; struct libbe_dccb *dccb; zprop_source_t src; char pval[BE_MAXPATHLEN]; char source[BE_MAXPATHLEN]; char *val; dccb = cb; /* Skip some properties we don't want to touch */ if (prop == ZFS_PROP_CANMOUNT) return (ZPROP_CONT); /* Don't copy readonly properties */ if (zfs_prop_readonly(prop)) return (ZPROP_CONT); if ((err = zfs_prop_get(dccb->zhp, prop, (char *)&pval, sizeof(pval), &src, (char *)&source, sizeof(source), false))) /* Just continue if we fail to read a property */ return (ZPROP_CONT); - /* Only copy locally defined properties */ - if (src != ZPROP_SRC_LOCAL) + /* + * Only copy locally defined or received properties. This continues + * to avoid temporary/default/local properties intentionally without + * breaking received datasets. + */ + if (src != ZPROP_SRC_LOCAL && src != ZPROP_SRC_RECEIVED) return (ZPROP_CONT); /* Augment mountpoint with altroot, if needed */ val = pval; if (prop == ZFS_PROP_MOUNTPOINT) val = be_mountpoint_augmented(dccb->lbh, val); nvlist_add_string(dccb->props, zfs_prop_to_name(prop), val); return (ZPROP_CONT); } /* * Return the corresponding boot environment path for a given * dataset path, the constructed path is placed in 'result'. * * example: say our new boot environment name is 'bootenv' and * the dataset path is 'zroot/ROOT/default/data/set'. * * result should produce: 'zroot/ROOT/bootenv/data/set' */ static int be_get_path(struct libbe_deep_clone *ldc, const char *dspath, char *result, int result_size) { char *pos; char *child_dataset; /* match the root path for the boot environments */ pos = strstr(dspath, ldc->lbh->root); /* no match, different pools? */ if (pos == NULL) return (BE_ERR_BADPATH); /* root path of the new boot environment */ snprintf(result, result_size, "%s/%s", ldc->lbh->root, ldc->bename); /* gets us to the parent dataset, the +1 consumes a trailing slash */ pos += strlen(ldc->lbh->root) + 1; /* skip the parent dataset */ if ((child_dataset = strchr(pos, '/')) != NULL) strlcat(result, child_dataset, result_size); return (BE_ERR_SUCCESS); } static int be_clone_cb(zfs_handle_t *ds, void *data) { int err; char be_path[BE_MAXPATHLEN]; char snap_path[BE_MAXPATHLEN]; const char *dspath; zfs_handle_t *snap_hdl; nvlist_t *props; struct libbe_deep_clone *ldc; struct libbe_dccb dccb; ldc = (struct libbe_deep_clone *)data; dspath = zfs_get_name(ds); snprintf(snap_path, sizeof(snap_path), "%s@%s", dspath, ldc->snapname); /* construct the boot environment path from the dataset we're cloning */ if (be_get_path(ldc, dspath, be_path, sizeof(be_path)) != BE_ERR_SUCCESS) return (set_error(ldc->lbh, BE_ERR_UNKNOWN)); /* the dataset to be created (i.e. the boot environment) already exists */ if (zfs_dataset_exists(ldc->lbh->lzh, be_path, ZFS_TYPE_DATASET)) return (set_error(ldc->lbh, BE_ERR_EXISTS)); /* no snapshot found for this dataset, silently skip it */ if (!zfs_dataset_exists(ldc->lbh->lzh, snap_path, ZFS_TYPE_SNAPSHOT)) return (0); if ((snap_hdl = zfs_open(ldc->lbh->lzh, snap_path, ZFS_TYPE_SNAPSHOT)) == NULL) return (set_error(ldc->lbh, BE_ERR_ZFSOPEN)); nvlist_alloc(&props, NV_UNIQUE_NAME, KM_SLEEP); nvlist_add_string(props, "canmount", "noauto"); dccb.lbh = ldc->lbh; dccb.zhp = ds; dccb.props = props; if (zprop_iter(be_deep_clone_prop, &dccb, B_FALSE, B_FALSE, ZFS_TYPE_FILESYSTEM) == ZPROP_INVAL) return (-1); if ((err = zfs_clone(snap_hdl, be_path, props)) != 0) return (set_error(ldc->lbh, BE_ERR_ZFSCLONE)); nvlist_free(props); zfs_close(snap_hdl); if (ldc->depth_limit == -1 || ldc->depth < ldc->depth_limit) { ldc->depth++; err = zfs_iter_filesystems(ds, be_clone_cb, ldc); ldc->depth--; } return (set_error(ldc->lbh, err)); } /* * Create a boot environment with a given name from a given snapshot. * Snapshots can be in the format 'zroot/ROOT/default@snapshot' or * 'default@snapshot'. In the latter case, 'default@snapshot' will be prepended * with the root path that libbe was initailized with. */ static int be_clone(libbe_handle_t *lbh, const char *bename, const char *snapshot, int depth) { int err; char snap_path[BE_MAXPATHLEN]; char *parentname, *snapname; zfs_handle_t *parent_hdl; struct libbe_deep_clone ldc; /* ensure the boot environment name is valid */ if ((err = be_validate_name(lbh, bename)) != 0) return (set_error(lbh, err)); /* * prepend the boot environment root path if we're * given a partial snapshot name. */ if ((err = be_root_concat(lbh, snapshot, snap_path)) != 0) return (set_error(lbh, err)); /* ensure the snapshot exists */ if ((err = be_validate_snap(lbh, snap_path)) != 0) return (set_error(lbh, err)); /* get a copy of the snapshot path so we can disect it */ if ((parentname = strdup(snap_path)) == NULL) return (set_error(lbh, BE_ERR_UNKNOWN)); /* split dataset name from snapshot name */ snapname = strchr(parentname, '@'); if (snapname == NULL) { free(parentname); return (set_error(lbh, BE_ERR_UNKNOWN)); } *snapname = '\0'; snapname++; /* set-up the boot environment */ ldc.lbh = lbh; ldc.bename = bename; ldc.snapname = snapname; ldc.depth = 0; ldc.depth_limit = depth; /* the boot environment will be cloned from this dataset */ parent_hdl = zfs_open(lbh->lzh, parentname, ZFS_TYPE_DATASET); /* create the boot environment */ err = be_clone_cb(parent_hdl, &ldc); free(parentname); return (set_error(lbh, err)); } /* * Create a boot environment from pre-existing snapshot, specifying a depth. */ int be_create_depth(libbe_handle_t *lbh, const char *bename, const char *snap, int depth) { return (be_clone(lbh, bename, snap, depth)); } /* * Create the boot environment from pre-existing snapshot */ int be_create_from_existing_snap(libbe_handle_t *lbh, const char *bename, const char *snap) { return (be_clone(lbh, bename, snap, -1)); } /* * Create a boot environment from an existing boot environment */ int be_create_from_existing(libbe_handle_t *lbh, const char *bename, const char *old) { int err; char snap[BE_MAXPATHLEN]; if ((err = be_snapshot(lbh, old, NULL, true, snap)) != 0) return (set_error(lbh, err)); err = be_clone(lbh, bename, snap, -1); return (set_error(lbh, err)); } /* * Verifies that a snapshot has a valid name, exists, and has a mountpoint of * '/'. Returns BE_ERR_SUCCESS (0), upon success, or the relevant BE_ERR_* upon * failure. Does not set the internal library error state. */ int be_validate_snap(libbe_handle_t *lbh, const char *snap_name) { if (strlen(snap_name) >= BE_MAXPATHLEN) return (BE_ERR_PATHLEN); if (!zfs_name_valid(snap_name, ZFS_TYPE_SNAPSHOT)) return (BE_ERR_INVALIDNAME); if (!zfs_dataset_exists(lbh->lzh, snap_name, ZFS_TYPE_SNAPSHOT)) return (BE_ERR_NOENT); return (BE_ERR_SUCCESS); } /* * Idempotently appends the name argument to the root boot environment path * and copies the resulting string into the result buffer (which is assumed * to be at least BE_MAXPATHLEN characters long. Returns BE_ERR_SUCCESS upon * success, BE_ERR_PATHLEN if the resulting path is longer than BE_MAXPATHLEN, * or BE_ERR_INVALIDNAME if the name is a path that does not begin with * zfs_be_root. Does not set internal library error state. */ int be_root_concat(libbe_handle_t *lbh, const char *name, char *result) { size_t name_len, root_len; name_len = strlen(name); root_len = strlen(lbh->root); /* Act idempotently; return be name if it is already a full path */ if (strrchr(name, '/') != NULL) { if (strstr(name, lbh->root) != name) return (BE_ERR_INVALIDNAME); if (name_len >= BE_MAXPATHLEN) return (BE_ERR_PATHLEN); strlcpy(result, name, BE_MAXPATHLEN); return (BE_ERR_SUCCESS); } else if (name_len + root_len + 1 < BE_MAXPATHLEN) { snprintf(result, BE_MAXPATHLEN, "%s/%s", lbh->root, name); return (BE_ERR_SUCCESS); } return (BE_ERR_PATHLEN); } /* * Verifies the validity of a boot environment name (A-Za-z0-9-_.). Returns * BE_ERR_SUCCESS (0) if name is valid, otherwise returns BE_ERR_INVALIDNAME * or BE_ERR_PATHLEN. * Does not set internal library error state. */ int be_validate_name(libbe_handle_t *lbh, const char *name) { /* * Impose the additional restriction that the entire dataset name must * not exceed the maximum length of a dataset, i.e. MAXNAMELEN. */ if (strlen(lbh->root) + 1 + strlen(name) > MAXNAMELEN) return (BE_ERR_PATHLEN); if (!zfs_name_valid(name, ZFS_TYPE_DATASET)) return (BE_ERR_INVALIDNAME); return (BE_ERR_SUCCESS); } /* * usage */ int be_rename(libbe_handle_t *lbh, const char *old, const char *new) { char full_old[BE_MAXPATHLEN]; char full_new[BE_MAXPATHLEN]; zfs_handle_t *zfs_hdl; int err; /* * be_validate_name is documented not to set error state, so we should * do so here. */ if ((err = be_validate_name(lbh, new)) != 0) return (set_error(lbh, err)); if ((err = be_root_concat(lbh, old, full_old)) != 0) return (set_error(lbh, err)); if ((err = be_root_concat(lbh, new, full_new)) != 0) return (set_error(lbh, err)); if (!zfs_dataset_exists(lbh->lzh, full_old, ZFS_TYPE_DATASET)) return (set_error(lbh, BE_ERR_NOENT)); if (zfs_dataset_exists(lbh->lzh, full_new, ZFS_TYPE_DATASET)) return (set_error(lbh, BE_ERR_EXISTS)); if ((zfs_hdl = zfs_open(lbh->lzh, full_old, ZFS_TYPE_FILESYSTEM)) == NULL) return (set_error(lbh, BE_ERR_ZFSOPEN)); /* recurse, nounmount, forceunmount */ struct renameflags flags = { .nounmount = 1, }; err = zfs_rename(zfs_hdl, NULL, full_new, flags); zfs_close(zfs_hdl); if (err != 0) return (set_error(lbh, BE_ERR_UNKNOWN)); return (0); } int be_export(libbe_handle_t *lbh, const char *bootenv, int fd) { char snap_name[BE_MAXPATHLEN]; char buf[BE_MAXPATHLEN]; zfs_handle_t *zfs; int err; if ((err = be_snapshot(lbh, bootenv, NULL, true, snap_name)) != 0) /* Use the error set by be_snapshot */ return (err); be_root_concat(lbh, snap_name, buf); if ((zfs = zfs_open(lbh->lzh, buf, ZFS_TYPE_DATASET)) == NULL) return (set_error(lbh, BE_ERR_ZFSOPEN)); err = zfs_send_one(zfs, NULL, fd, 0); zfs_close(zfs); return (err); } int be_import(libbe_handle_t *lbh, const char *bootenv, int fd) { char buf[BE_MAXPATHLEN]; nvlist_t *props; zfs_handle_t *zfs; recvflags_t flags = { .nomount = 1 }; int err; be_root_concat(lbh, bootenv, buf); if ((err = zfs_receive(lbh->lzh, buf, NULL, &flags, fd, NULL)) != 0) { switch (err) { case EINVAL: return (set_error(lbh, BE_ERR_NOORIGIN)); case ENOENT: return (set_error(lbh, BE_ERR_NOENT)); case EIO: return (set_error(lbh, BE_ERR_IO)); default: return (set_error(lbh, BE_ERR_UNKNOWN)); } } if ((zfs = zfs_open(lbh->lzh, buf, ZFS_TYPE_FILESYSTEM)) == NULL) return (set_error(lbh, BE_ERR_ZFSOPEN)); nvlist_alloc(&props, NV_UNIQUE_NAME, KM_SLEEP); nvlist_add_string(props, "canmount", "noauto"); nvlist_add_string(props, "mountpoint", "/"); err = zfs_prop_set_list(zfs, props); nvlist_free(props); zfs_close(zfs); if (err != 0) return (set_error(lbh, BE_ERR_UNKNOWN)); return (0); } #if SOON static int be_create_child_noent(libbe_handle_t *lbh, const char *active, const char *child_path) { nvlist_t *props; zfs_handle_t *zfs; int err; nvlist_alloc(&props, NV_UNIQUE_NAME, KM_SLEEP); nvlist_add_string(props, "canmount", "noauto"); nvlist_add_string(props, "mountpoint", child_path); /* Create */ if ((err = zfs_create(lbh->lzh, active, ZFS_TYPE_DATASET, props)) != 0) { switch (err) { case EZFS_EXISTS: return (set_error(lbh, BE_ERR_EXISTS)); case EZFS_NOENT: return (set_error(lbh, BE_ERR_NOENT)); case EZFS_BADTYPE: case EZFS_BADVERSION: return (set_error(lbh, BE_ERR_NOPOOL)); case EZFS_BADPROP: default: /* We set something up wrong, probably... */ return (set_error(lbh, BE_ERR_UNKNOWN)); } } nvlist_free(props); if ((zfs = zfs_open(lbh->lzh, active, ZFS_TYPE_DATASET)) == NULL) return (set_error(lbh, BE_ERR_ZFSOPEN)); /* Set props */ if ((err = zfs_prop_set(zfs, "canmount", "noauto")) != 0) { zfs_close(zfs); /* * Similar to other cases, this shouldn't fail unless we've * done something wrong. This is a new dataset that shouldn't * have been mounted anywhere between creation and now. */ if (err == EZFS_NOMEM) return (set_error(lbh, BE_ERR_NOMEM)); return (set_error(lbh, BE_ERR_UNKNOWN)); } zfs_close(zfs); return (BE_ERR_SUCCESS); } static int be_create_child_cloned(libbe_handle_t *lbh, const char *active) { char buf[BE_MAXPATHLEN], tmp[BE_MAXPATHLEN];; zfs_handle_t *zfs; int err; /* XXX TODO ? */ /* * Establish if the existing path is a zfs dataset or just * the subdirectory of one */ strlcpy(tmp, "tmp/be_snap.XXXXX", sizeof(tmp)); if (mktemp(tmp) == NULL) return (set_error(lbh, BE_ERR_UNKNOWN)); be_root_concat(lbh, tmp, buf); printf("Here %s?\n", buf); if ((err = zfs_snapshot(lbh->lzh, buf, false, NULL)) != 0) { switch (err) { case EZFS_INVALIDNAME: return (set_error(lbh, BE_ERR_INVALIDNAME)); default: /* * The other errors that zfs_ioc_snapshot might return * shouldn't happen if we've set things up properly, so * we'll gloss over them and call it UNKNOWN as it will * require further triage. */ if (errno == ENOTSUP) return (set_error(lbh, BE_ERR_NOPOOL)); return (set_error(lbh, BE_ERR_UNKNOWN)); } } /* Clone */ if ((zfs = zfs_open(lbh->lzh, buf, ZFS_TYPE_SNAPSHOT)) == NULL) return (BE_ERR_ZFSOPEN); if ((err = zfs_clone(zfs, active, NULL)) != 0) /* XXX TODO correct error */ return (set_error(lbh, BE_ERR_UNKNOWN)); /* set props */ zfs_close(zfs); return (BE_ERR_SUCCESS); } int be_add_child(libbe_handle_t *lbh, const char *child_path, bool cp_if_exists) { struct stat sb; char active[BE_MAXPATHLEN], buf[BE_MAXPATHLEN]; nvlist_t *props; const char *s; /* Require absolute paths */ if (*child_path != '/') return (set_error(lbh, BE_ERR_BADPATH)); strlcpy(active, be_active_path(lbh), BE_MAXPATHLEN); strcpy(buf, active); /* Create non-mountable parent dataset(s) */ s = child_path; for (char *p; (p = strchr(s+1, '/')) != NULL; s = p) { size_t len = p - s; strncat(buf, s, len); nvlist_alloc(&props, NV_UNIQUE_NAME, KM_SLEEP); nvlist_add_string(props, "canmount", "off"); nvlist_add_string(props, "mountpoint", "none"); zfs_create(lbh->lzh, buf, ZFS_TYPE_DATASET, props); nvlist_free(props); } /* Path does not exist as a descendent of / yet */ if (strlcat(active, child_path, BE_MAXPATHLEN) >= BE_MAXPATHLEN) return (set_error(lbh, BE_ERR_PATHLEN)); if (stat(child_path, &sb) != 0) { /* Verify that error is ENOENT */ if (errno != ENOENT) return (set_error(lbh, BE_ERR_UNKNOWN)); return (be_create_child_noent(lbh, active, child_path)); } else if (cp_if_exists) /* Path is already a descendent of / and should be copied */ return (be_create_child_cloned(lbh, active)); return (set_error(lbh, BE_ERR_EXISTS)); } #endif /* SOON */ static int be_set_nextboot(libbe_handle_t *lbh, nvlist_t *config, uint64_t pool_guid, const char *zfsdev) { nvlist_t **child; uint64_t vdev_guid; int c, children; if (nvlist_lookup_nvlist_array(config, ZPOOL_CONFIG_CHILDREN, &child, &children) == 0) { for (c = 0; c < children; ++c) if (be_set_nextboot(lbh, child[c], pool_guid, zfsdev) != 0) return (1); return (0); } if (nvlist_lookup_uint64(config, ZPOOL_CONFIG_GUID, &vdev_guid) != 0) { return (1); } if (zpool_nextboot(lbh->lzh, pool_guid, vdev_guid, zfsdev) != 0) { perror("ZFS_IOC_NEXTBOOT failed"); return (1); } return (0); } /* * Deactivate old BE dataset; currently just sets canmount=noauto */ static int be_deactivate(libbe_handle_t *lbh, const char *ds) { zfs_handle_t *zfs; if ((zfs = zfs_open(lbh->lzh, ds, ZFS_TYPE_DATASET)) == NULL) return (1); if (zfs_prop_set(zfs, "canmount", "noauto") != 0) return (1); zfs_close(zfs); return (0); } int be_activate(libbe_handle_t *lbh, const char *bootenv, bool temporary) { char be_path[BE_MAXPATHLEN]; char buf[BE_MAXPATHLEN]; nvlist_t *config, *dsprops, *vdevs; char *origin; uint64_t pool_guid; zfs_handle_t *zhp; int err; be_root_concat(lbh, bootenv, be_path); /* Note: be_exists fails if mountpoint is not / */ if ((err = be_exists(lbh, be_path)) != 0) return (set_error(lbh, err)); if (temporary) { config = zpool_get_config(lbh->active_phandle, NULL); if (config == NULL) /* config should be fetchable... */ return (set_error(lbh, BE_ERR_UNKNOWN)); if (nvlist_lookup_uint64(config, ZPOOL_CONFIG_POOL_GUID, &pool_guid) != 0) /* Similarly, it shouldn't be possible */ return (set_error(lbh, BE_ERR_UNKNOWN)); /* Expected format according to zfsbootcfg(8) man */ snprintf(buf, sizeof(buf), "zfs:%s:", be_path); /* We have no config tree */ if (nvlist_lookup_nvlist(config, ZPOOL_CONFIG_VDEV_TREE, &vdevs) != 0) return (set_error(lbh, BE_ERR_NOPOOL)); return (be_set_nextboot(lbh, vdevs, pool_guid, buf)); } else { if (be_deactivate(lbh, lbh->bootfs) != 0) return (-1); /* Obtain bootenv zpool */ err = zpool_set_prop(lbh->active_phandle, "bootfs", be_path); if (err) return (-1); zhp = zfs_open(lbh->lzh, be_path, ZFS_TYPE_FILESYSTEM); if (zhp == NULL) return (-1); if (be_prop_list_alloc(&dsprops) != 0) return (-1); if (be_get_dataset_props(lbh, be_path, dsprops) != 0) { nvlist_free(dsprops); return (-1); } if (nvlist_lookup_string(dsprops, "origin", &origin) == 0) err = zfs_promote(zhp); nvlist_free(dsprops); zfs_close(zhp); if (err) return (-1); } return (BE_ERR_SUCCESS); } Index: user/ngie/bug-237403/lib/libc/gen/Makefile.inc =================================================================== --- user/ngie/bug-237403/lib/libc/gen/Makefile.inc (revision 346925) +++ user/ngie/bug-237403/lib/libc/gen/Makefile.inc (revision 346926) @@ -1,543 +1,545 @@ # @(#)Makefile.inc 8.6 (Berkeley) 5/4/95 # $FreeBSD$ # machine-independent gen sources .PATH: ${LIBC_SRCTOP}/${LIBC_ARCH}/gen ${LIBC_SRCTOP}/gen CONFS= shells SRCS+= __getosreldate.c \ __pthread_mutex_init_calloc_cb_stub.c \ __xuname.c \ _once_stub.c \ _pthread_stubs.c \ _rand48.c \ _spinlock_stub.c \ _thread_init.c \ alarm.c \ arc4random.c \ arc4random-compat.c \ arc4random_uniform.c \ assert.c \ auxv.c \ basename.c \ basename_compat.c \ cap_sandboxed.c \ check_utility_compat.c \ clock.c \ clock_getcpuclockid.c \ closedir.c \ confstr.c \ crypt.c \ ctermid.c \ daemon.c \ devname.c \ dirfd.c \ dirname.c \ dirname_compat.c \ disklabel.c \ dlfcn.c \ drand48.c \ dup3.c \ elf_utils.c \ erand48.c \ err.c \ errlst.c \ errno.c \ exec.c \ exect.c \ fdevname.c \ feature_present.c \ fmtcheck.c \ fmtmsg.c \ fnmatch.c \ fpclassify.c \ frexp.c \ fstab.c \ ftok.c \ fts.c \ ftw.c \ getbootfile.c \ getbsize.c \ getcap.c \ getcwd.c \ getdomainname.c \ getentropy.c \ getgrent.c \ getgrouplist.c \ gethostname.c \ getloadavg.c \ getlogin.c \ getmntinfo.c \ getnetgrent.c \ getosreldate.c \ getpagesize.c \ getpagesizes.c \ getpeereid.c \ getprogname.c \ getpwent.c \ getttyent.c \ getusershell.c \ getutxent.c \ getvfsbyname.c \ glob.c \ initgroups.c \ isatty.c \ isinf.c \ isnan.c \ jrand48.c \ lcong48.c \ libc_dlopen.c \ lockf.c \ lrand48.c \ mrand48.c \ nftw.c \ nice.c \ nlist.c \ nrand48.c \ opendir.c \ pause.c \ pmadvise.c \ popen.c \ posix_spawn.c \ psignal.c \ pututxline.c \ pw_scan.c \ raise.c \ readdir.c \ readpassphrase.c \ recvmmsg.c \ rewinddir.c \ scandir.c \ seed48.c \ seekdir.c \ semctl.c \ sendmmsg.c \ setdomainname.c \ sethostname.c \ setjmperr.c \ setmode.c \ setproctitle.c \ setprogname.c \ siginterrupt.c \ siglist.c \ signal.c \ sigsetops.c \ sleep.c \ srand48.c \ statvfs.c \ stringlist.c \ strtofflags.c \ sysconf.c \ sysctl.c \ sysctlbyname.c \ sysctlnametomib.c \ syslog.c \ telldir.c \ termios.c \ time.c \ times.c \ timespec_get.c \ timezone.c \ tls.c \ ttyname.c \ ttyslot.c \ ualarm.c \ ulimit.c \ uname.c \ usleep.c \ utime.c \ utxdb.c \ valloc.c \ wait.c \ wait3.c \ waitpid.c \ waitid.c \ wordexp.c .if ${MK_SYMVER} == yes SRCS+= devname-compat11.c \ fts-compat.c \ fts-compat11.c \ ftw-compat11.c \ getmntinfo-compat11.c \ glob-compat11.c \ nftw-compat11.c \ readdir-compat11.c \ scandir-compat11.c \ unvis-compat.c .endif CFLAGS.arc4random.c= -I${SRCTOP}/sys -I${SRCTOP}/sys/crypto/chacha20 .PATH: ${SRCTOP}/contrib/libc-pwcache SRCS+= pwcache.c pwcache.h .PATH: ${SRCTOP}/contrib/libc-vis CFLAGS+= -I${SRCTOP}/contrib/libc-vis SRCS+= unvis.c vis.c MISRCS+=modf.c CANCELPOINTS_SRCS=sem.c sem_new.c .for src in ${CANCELPOINTS_SRCS} SRCS+=cancelpoints_${src} CLEANFILES+=cancelpoints_${src} cancelpoints_${src}: ${LIBC_SRCTOP}/gen/${src} .NOMETA ln -sf ${.ALLSRC} ${.TARGET} .endfor SYM_MAPS+=${LIBC_SRCTOP}/gen/Symbol.map # machine-dependent gen sources .sinclude "${LIBC_SRCTOP}/${LIBC_ARCH}/gen/Makefile.inc" MAN+= alarm.3 \ arc4random.3 \ + auxv.3 \ basename.3 \ cap_rights_get.3 \ cap_sandboxed.3 \ check_utility_compat.3 \ clock.3 \ clock_getcpuclockid.3 \ confstr.3 \ ctermid.3 \ daemon.3 \ devname.3 \ directory.3 \ dirname.3 \ dl_iterate_phdr.3 \ dladdr.3 \ dlinfo.3 \ dllockinit.3 \ dlopen.3 \ dup3.3 \ err.3 \ exec.3 \ feature_present.3 \ fmtcheck.3 \ fmtmsg.3 \ fnmatch.3 \ fpclassify.3 \ frexp.3 \ ftok.3 \ fts.3 \ ftw.3 \ getbootfile.3 \ getbsize.3 \ getcap.3 \ getcontext.3 \ getcwd.3 \ getdiskbyname.3 \ getdomainname.3 \ getentropy.3 \ getfsent.3 \ getgrent.3 \ getgrouplist.3 \ gethostname.3 \ getloadavg.3 \ getmntinfo.3 \ getnetgrent.3 \ getosreldate.3 \ getpagesize.3 \ getpagesizes.3 \ getpass.3 \ getpeereid.3 \ getprogname.3 \ getpwent.3 \ getttyent.3 \ getusershell.3 \ getutxent.3 \ getvfsbyname.3 \ glob.3 \ initgroups.3 \ isgreater.3 \ ldexp.3 \ lockf.3 \ makecontext.3 \ modf.3 \ nice.3 \ nlist.3 \ pause.3 \ popen.3 \ posix_spawn.3 \ posix_spawn_file_actions_addopen.3 \ posix_spawn_file_actions_init.3 \ posix_spawnattr_getflags.3 \ posix_spawnattr_getpgroup.3 \ posix_spawnattr_getschedparam.3 \ posix_spawnattr_getschedpolicy.3 \ posix_spawnattr_init.3 \ posix_spawnattr_getsigdefault.3 \ posix_spawnattr_getsigmask.3 \ psignal.3 \ pwcache.3 \ raise.3 \ rand48.3 \ readpassphrase.3 \ rfork_thread.3 \ scandir.3 \ sem_destroy.3 \ sem_getvalue.3 \ sem_init.3 \ sem_open.3 \ sem_post.3 \ sem_timedwait.3 \ sem_wait.3 \ setjmp.3 \ setmode.3 \ setproctitle.3 \ siginterrupt.3 \ signal.3 \ sigsetops.3 \ sleep.3 \ statvfs.3 \ stringlist.3 \ strtofflags.3 \ sysconf.3 \ sysctl.3 \ syslog.3 \ tcgetpgrp.3 \ tcgetsid.3 \ tcsendbreak.3 \ tcsetattr.3 \ tcsetpgrp.3 \ tcsetsid.3 \ time.3 \ times.3 \ timespec_get.3 \ timezone.3 \ ttyname.3 \ tzset.3 \ ualarm.3 \ ucontext.3 \ ulimit.3 \ uname.3 \ unvis.3 \ usleep.3 \ utime.3 \ valloc.3 \ vis.3 \ wordexp.3 MLINKS+=arc4random.3 arc4random_buf.3 \ arc4random.3 arc4random_uniform.3 +MLINKS+=auxv.3 elf_aux_info.3 MLINKS+=ctermid.3 ctermid_r.3 MLINKS+=devname.3 devname_r.3 MLINKS+=devname.3 fdevname.3 MLINKS+=devname.3 fdevname_r.3 MLINKS+=directory.3 closedir.3 \ directory.3 dirfd.3 \ directory.3 fdclosedir.3 \ directory.3 fdopendir.3 \ directory.3 opendir.3 \ directory.3 readdir.3 \ directory.3 readdir_r.3 \ directory.3 rewinddir.3 \ directory.3 seekdir.3 \ directory.3 telldir.3 MLINKS+=dlopen.3 fdlopen.3 \ dlopen.3 dlclose.3 \ dlopen.3 dlerror.3 \ dlopen.3 dlfunc.3 \ dlopen.3 dlsym.3 \ dlopen.3 dlvsym.3 MLINKS+=err.3 err_set_exit.3 \ err.3 err_set_file.3 \ err.3 errc.3 \ err.3 errx.3 \ err.3 verr.3 \ err.3 verrc.3 \ err.3 verrx.3 \ err.3 vwarn.3 \ err.3 vwarnc.3 \ err.3 vwarnx.3 \ err.3 warnc.3 \ err.3 warn.3 \ err.3 warnx.3 MLINKS+=exec.3 execl.3 \ exec.3 execle.3 \ exec.3 execlp.3 \ exec.3 exect.3 \ exec.3 execv.3 \ exec.3 execvP.3 \ exec.3 execvp.3 MLINKS+=fpclassify.3 finite.3 \ fpclassify.3 finitef.3 \ fpclassify.3 isfinite.3 \ fpclassify.3 isinf.3 \ fpclassify.3 isnan.3 \ fpclassify.3 isnormal.3 MLINKS+=frexp.3 frexpf.3 \ frexp.3 frexpl.3 MLINKS+=fts.3 fts_children.3 \ fts.3 fts_close.3 \ fts.3 fts_open.3 \ fts.3 fts_read.3 \ fts.3 fts_set.3 \ fts.3 fts_set_clientptr.3 \ fts.3 fts_get_clientptr.3 \ fts.3 fts_get_stream.3 MLINKS+=ftw.3 nftw.3 MLINKS+=getcap.3 cgetcap.3 \ getcap.3 cgetclose.3 \ getcap.3 cgetent.3 \ getcap.3 cgetfirst.3 \ getcap.3 cgetmatch.3 \ getcap.3 cgetnext.3 \ getcap.3 cgetnum.3 \ getcap.3 cgetset.3 \ getcap.3 cgetstr.3 \ getcap.3 cgetustr.3 MLINKS+=getcwd.3 getwd.3 MLINKS+=getcontext.3 getcontextx.3 MLINKS+=getcontext.3 setcontext.3 MLINKS+=getdomainname.3 setdomainname.3 MLINKS+=getfsent.3 endfsent.3 \ getfsent.3 getfsfile.3 \ getfsent.3 getfsspec.3 \ getfsent.3 getfstype.3 \ getfsent.3 setfsent.3 \ getfsent.3 setfstab.3 \ getfsent.3 getfstab.3 MLINKS+=getgrent.3 endgrent.3 \ getgrent.3 getgrgid.3 \ getgrent.3 getgrnam.3 \ getgrent.3 setgrent.3 \ getgrent.3 setgroupent.3 \ getgrent.3 getgrent_r.3 \ getgrent.3 getgrnam_r.3 \ getgrent.3 getgrgid_r.3 MLINKS+=gethostname.3 sethostname.3 MLINKS+=getnetgrent.3 endnetgrent.3 \ getnetgrent.3 getnetgrent_r.3 \ getnetgrent.3 innetgr.3 \ getnetgrent.3 setnetgrent.3 MLINKS+=getprogname.3 setprogname.3 MLINKS+=getpwent.3 endpwent.3 \ getpwent.3 getpwnam.3 \ getpwent.3 getpwuid.3 \ getpwent.3 setpassent.3 \ getpwent.3 setpwent.3 \ getpwent.3 setpwfile.3 \ getpwent.3 getpwent_r.3 \ getpwent.3 getpwnam_r.3 \ getpwent.3 getpwuid_r.3 MLINKS+=getttyent.3 endttyent.3 \ getttyent.3 getttynam.3 \ getttyent.3 isdialuptty.3 \ getttyent.3 isnettty.3 \ getttyent.3 setttyent.3 MLINKS+=getusershell.3 endusershell.3 \ getusershell.3 setusershell.3 MLINKS+=getutxent.3 endutxent.3 \ getutxent.3 getutxid.3 \ getutxent.3 getutxline.3 \ getutxent.3 getutxuser.3 \ getutxent.3 pututxline.3 \ getutxent.3 setutxdb.3 \ getutxent.3 setutxent.3 \ getutxent.3 utmpx.3 MLINKS+=glob.3 globfree.3 MLINKS+=isgreater.3 isgreaterequal.3 \ isgreater.3 isless.3 \ isgreater.3 islessequal.3 \ isgreater.3 islessgreater.3 \ isgreater.3 isunordered.3 MLINKS+=ldexp.3 ldexpf.3 \ ldexp.3 ldexpl.3 MLINKS+=makecontext.3 swapcontext.3 MLINKS+=modf.3 modff.3 \ modf.3 modfl.3 MLINKS+=popen.3 pclose.3 MLINKS+=posix_spawn.3 posix_spawnp.3 \ posix_spawn_file_actions_addopen.3 posix_spawn_file_actions_addclose.3 \ posix_spawn_file_actions_addopen.3 posix_spawn_file_actions_adddup2.3 \ posix_spawn_file_actions_init.3 posix_spawn_file_actions_destroy.3 \ posix_spawnattr_getflags.3 posix_spawnattr_setflags.3 \ posix_spawnattr_getpgroup.3 posix_spawnattr_setpgroup.3 \ posix_spawnattr_getschedparam.3 posix_spawnattr_setschedparam.3 \ posix_spawnattr_getschedpolicy.3 posix_spawnattr_setschedpolicy.3 \ posix_spawnattr_getsigdefault.3 posix_spawnattr_setsigdefault.3 \ posix_spawnattr_getsigmask.3 posix_spawnattr_setsigmask.3 \ posix_spawnattr_init.3 posix_spawnattr_destroy.3 MLINKS+=psignal.3 strsignal.3 \ psignal.3 sys_siglist.3 \ psignal.3 sys_signame.3 MLINKS+=pwcache.3 group_from_gid.3 \ pwcache.3 user_from_uid.3 MLINKS+=rand48.3 _rand48.3 \ rand48.3 drand48.3 \ rand48.3 erand48.3 \ rand48.3 jrand48.3 \ rand48.3 lcong48.3 \ rand48.3 lrand48.3 \ rand48.3 mrand48.3 \ rand48.3 nrand48.3 \ rand48.3 seed48.3 \ rand48.3 srand48.3 MLINKS+=recv.2 recvmmsg.2 MLINKS+=scandir.3 alphasort.3 MLINKS+=sem_open.3 sem_close.3 \ sem_open.3 sem_unlink.3 MLINKS+=sem_wait.3 sem_trywait.3 MLINKS+=sem_timedwait.3 sem_clockwait_np.3 MLINKS+=send.2 sendmmsg.2 MLINKS+=setjmp.3 _longjmp.3 \ setjmp.3 _setjmp.3 \ setjmp.3 longjmp.3 \ setjmp.3 longjmperr.3 \ setjmp.3 longjmperror.3 \ setjmp.3 siglongjmp.3 \ setjmp.3 sigsetjmp.3 MLINKS+=setmode.3 getmode.3 MLINKS+=setproctitle.3 setproctitle_fast.3 MLINKS+=sigsetops.3 sigaddset.3 \ sigsetops.3 sigdelset.3 \ sigsetops.3 sigemptyset.3 \ sigsetops.3 sigfillset.3 \ sigsetops.3 sigismember.3 MLINKS+=statvfs.3 fstatvfs.3 MLINKS+=stringlist.3 sl_add.3 \ stringlist.3 sl_find.3 \ stringlist.3 sl_free.3 \ stringlist.3 sl_init.3 MLINKS+=strtofflags.3 fflagstostr.3 MLINKS+=sysctl.3 sysctlbyname.3 \ sysctl.3 sysctlnametomib.3 MLINKS+=syslog.3 closelog.3 \ syslog.3 openlog.3 \ syslog.3 setlogmask.3 \ syslog.3 vsyslog.3 MLINKS+=tcsendbreak.3 tcdrain.3 \ tcsendbreak.3 tcflow.3 \ tcsendbreak.3 tcflush.3 MLINKS+=tcsetattr.3 cfgetispeed.3 \ tcsetattr.3 cfgetospeed.3 \ tcsetattr.3 cfmakeraw.3 \ tcsetattr.3 cfmakesane.3 \ tcsetattr.3 cfsetispeed.3 \ tcsetattr.3 cfsetospeed.3 \ tcsetattr.3 cfsetspeed.3 \ tcsetattr.3 tcgetattr.3 MLINKS+=ttyname.3 isatty.3 \ ttyname.3 ttyname_r.3 MLINKS+=tzset.3 tzsetwall.3 MLINKS+=unvis.3 strunvis.3 \ unvis.3 strunvisx.3 MLINKS+=vis.3 nvis.3 \ vis.3 snvis.3 \ vis.3 strenvisx.3 \ vis.3 strnunvis.3 \ vis.3 strnunvisx.3 \ vis.3 strnvis.3 \ vis.3 strnvisx.3 \ vis.3 strsenvisx.3 \ vis.3 strsnvis.3 \ vis.3 strsnvisx.3 \ vis.3 strsvis.3 \ vis.3 strsvisx.3 \ vis.3 strvis.3 \ vis.3 strvisx.3 \ vis.3 svis.3 MLINKS+=wordexp.3 wordfree.3 Index: user/ngie/bug-237403/lib/libc/gen/auxv.3 =================================================================== --- user/ngie/bug-237403/lib/libc/gen/auxv.3 (nonexistent) +++ user/ngie/bug-237403/lib/libc/gen/auxv.3 (revision 346926) @@ -0,0 +1,86 @@ +.\" +.\" Copyright (c) 2019 Ian Lepore +.\" +.\" Redistribution and use in source and binary forms, with or without +.\" modification, are permitted provided that the following conditions +.\" are met: +.\" +.\" 1. Redistributions of source code must retain the above copyright +.\" notice, this list of conditions and the following disclaimer. +.\" 2. Redistributions in binary form must reproduce the above copyright +.\" notice, this list of conditions and the following disclaimer in the +.\" documentation and/or other materials provided with the distribution. +.\" +.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR +.\" IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES +.\" OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. +.\" IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, +.\" INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT +.\" NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, +.\" DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY +.\" THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +.\" (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF +.\" THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. +.\" +.\" $FreeBSD$ +.\" +.Dd April 25, 2019 +.Dt ELF_AUX_INFO 3 +.Os +.Sh NAME +.Nm elf_aux_info +.Nd extract data from the elf auxiliary vector of the current process +.Sh LIBRARY +.Lb libc +.Sh SYNOPSIS +.In sys/auxv.h +.Ft int +.Fn elf_aux_info "int aux" "void *buf" "int buflen" +.Sh DESCRIPTION +The +.Fn elf_aux_info +function retrieves the auxiliary info vector requested in +.Va aux . +The information is stored into the provided buffer if it will fit. +The following values, defined in +.In sys/elf_common.h +can be requested: +.Bl -tag -width AT_OSRELDATE +.It AT_CANARY +The canary value for SSP. +.It AT_HWCAP +CPU / hardware feature flags. +.It AT_HWCAP2 +CPU / hardware feature flags. +.It AT_NCPUS +Number of CPUs. +.It AT_OSRELDATE +Kernel OSRELDATE. +.It AT_PAGESIZES +Vector of page sizes. +.It AT_PAGESZ +Page size in bytes. +.It AT_TIMEKEEP +Pointer to VDSO timehands (for library internal use). +.El +.Sh RETURN VALUES +Returns zero on success, or an error number on failure. +.Sh ERRORS +.Bl -tag -width Er +.It Bq Er EINVAL +An unknown item was requested. +.It Bq Er EINVAL +The provided buffer was not the right size for the requested item. +.It Bq Er ENOENT +The requested item is not available. +.El +.Sh HISTORY +The +.Fn elf_aux_info +function appeared in +.Fx 12.0 . +.Sh BUGS +Only a small subset of available auxiliary info vector items are +accessible with this function. +Some items require a "right-sized" buffer while others just require a +"big enough" buffer. Property changes on: user/ngie/bug-237403/lib/libc/gen/auxv.3 ___________________________________________________________________ Added: svn:eol-style ## -0,0 +1 ## +native \ No newline at end of property Added: svn:keywords ## -0,0 +1 ## +FreeBSD=%H \ No newline at end of property Added: svn:mime-type ## -0,0 +1 ## +text/plain \ No newline at end of property Index: user/ngie/bug-237403/lib/libvgl/bitmap.c =================================================================== --- user/ngie/bug-237403/lib/libvgl/bitmap.c (revision 346925) +++ user/ngie/bug-237403/lib/libvgl/bitmap.c (revision 346926) @@ -1,300 +1,320 @@ /*- * SPDX-License-Identifier: BSD-3-Clause * * Copyright (c) 1991-1997 Søren Schmidt * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer, * in this position and unchanged. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * 3. The name of the author may not be used to endorse or promote products * derived from this software without specific prior written permission. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ #include __FBSDID("$FreeBSD$"); #include #include #include #include "vgl.h" #define min(x, y) (((x) < (y)) ? (x) : (y)) static byte mask[8] = {0xff, 0x7f, 0x3f, 0x1f, 0x0f, 0x07, 0x03, 0x01}; static int color2bit[16] = {0x00000000, 0x00000001, 0x00000100, 0x00000101, 0x00010000, 0x00010001, 0x00010100, 0x00010101, 0x01000000, 0x01000001, 0x01000100, 0x01000101, 0x01010000, 0x01010001, 0x01010100, 0x01010101}; static void WriteVerticalLine(VGLBitmap *dst, int x, int y, int width, byte *line) { int i, pos, last, planepos, start_offset, end_offset, offset; int len; unsigned int word = 0; byte *address; byte *VGLPlane[4]; switch (dst->Type) { case VIDBUF4: case VIDBUF4S: start_offset = (x & 0x07); end_offset = (x + width) & 0x07; i = (width + start_offset) / 8; if (end_offset) i++; VGLPlane[0] = VGLBuf; VGLPlane[1] = VGLPlane[0] + i; VGLPlane[2] = VGLPlane[1] + i; VGLPlane[3] = VGLPlane[2] + i; pos = 0; planepos = 0; last = 8 - start_offset; while (pos < width) { word = 0; while (pos < last && pos < width) word = (word<<1) | color2bit[line[pos++]&0x0f]; VGLPlane[0][planepos] = word; VGLPlane[1][planepos] = word>>8; VGLPlane[2][planepos] = word>>16; VGLPlane[3][planepos] = word>>24; planepos++; last += 8; } planepos--; if (end_offset) { word <<= (8 - end_offset); VGLPlane[0][planepos] = word; VGLPlane[1][planepos] = word>>8; VGLPlane[2][planepos] = word>>16; VGLPlane[3][planepos] = word>>24; } if (start_offset || end_offset) width+=8; width /= 8; outb(0x3ce, 0x01); outb(0x3cf, 0x00); /* set/reset enable */ outb(0x3ce, 0x08); outb(0x3cf, 0xff); /* bit mask */ for (i=0; i<4; i++) { outb(0x3c4, 0x02); outb(0x3c5, 0x01<Type == VIDBUF4) { if (end_offset) VGLPlane[i][planepos] |= dst->Bitmap[pos+planepos] & mask[end_offset]; if (start_offset) VGLPlane[i][0] |= dst->Bitmap[pos] & ~mask[start_offset]; bcopy(&VGLPlane[i][0], dst->Bitmap + pos, width); } else { /* VIDBUF4S */ if (end_offset) { offset = VGLSetSegment(pos + planepos); VGLPlane[i][planepos] |= dst->Bitmap[offset] & mask[end_offset]; } offset = VGLSetSegment(pos); if (start_offset) VGLPlane[i][0] |= dst->Bitmap[offset] & ~mask[start_offset]; for (last = width; ; ) { len = min(VGLAdpInfo.va_window_size - offset, last); bcopy(&VGLPlane[i][width - last], dst->Bitmap + offset, len); pos += len; last -= len; if (last <= 0) break; offset = VGLSetSegment(pos); } } } break; case VIDBUF8X: address = dst->Bitmap + VGLAdpInfo.va_line_width * y + x/4; for (i=0; i<4; i++) { outb(0x3c4, 0x02); outb(0x3c5, 0x01 << ((x + i)%4)); for (planepos=0, pos=i; posPixelBytes; pos = (dst->VXsize * y + x) * dst->PixelBytes; while (width > 0) { offset = VGLSetSegment(pos); i = min(VGLAdpInfo.va_window_size - offset, width); bcopy(line, dst->Bitmap + offset, i); line += i; pos += i; width -= i; } break; case MEMBUF: case VIDBUF8: case VIDBUF16: case VIDBUF24: case VIDBUF32: address = dst->Bitmap + (dst->VXsize * y + x) * dst->PixelBytes; bcopy(line, address, width * dst->PixelBytes); break; default: ; } } int __VGLBitmapCopy(VGLBitmap *src, int srcx, int srcy, VGLBitmap *dst, int dstx, int dsty, int width, int hight) { - int srcline, dstline, yend, yextra, ystep; - + byte *buffer, *p; + int mousemerge, srcline, dstline, yend, yextra, ystep; + + mousemerge = 0; + if (hight < 0) { + hight = -hight; + mousemerge = (dst == VGLDisplay && + VGLMouseOverlap(dstx, dsty, width, hight)); + if (mousemerge) + buffer = alloca(width*src->PixelBytes); + } if (srcx>src->VXsize || srcy>src->VYsize || dstx>dst->VXsize || dsty>dst->VYsize) return -1; if (srcx < 0) { width=width+srcx; dstx-=srcx; srcx=0; } if (srcy < 0) { hight=hight+srcy; dsty-=srcy; srcy=0; } if (dstx < 0) { width=width+dstx; srcx-=dstx; dstx=0; } if (dsty < 0) { hight=hight+dsty; srcy-=dsty; dsty=0; } if (srcx+width > src->VXsize) width=src->VXsize-srcx; if (srcy+hight > src->VYsize) hight=src->VYsize-srcy; if (dstx+width > dst->VXsize) width=dst->VXsize-dstx; if (dsty+hight > dst->VYsize) hight=dst->VYsize-dsty; if (width < 0 || hight < 0) return -1; yend = srcy + hight; yextra = 0; ystep = 1; if (src->Bitmap == dst->Bitmap && srcy < dsty) { - yend = srcy; + yend = srcy - 1; yextra = hight - 1; ystep = -1; } for (srcline = srcy + yextra, dstline = dsty + yextra; srcline != yend; srcline += ystep, dstline += ystep) { - WriteVerticalLine(dst, dstx, dstline, width, - src->Bitmap+(srcline*src->VXsize+srcx)*dst->PixelBytes); + p = src->Bitmap+(srcline*src->VXsize+srcx)*dst->PixelBytes; + if (mousemerge && VGLMouseOverlap(dstx, dstline, width, 1)) { + bcopy(p, buffer, width*src->PixelBytes); + p = buffer; + VGLMouseMerge(dstx, dstline, width, p); + } + WriteVerticalLine(dst, dstx, dstline, width, p); } return 0; } int VGLBitmapCopy(VGLBitmap *src, int srcx, int srcy, VGLBitmap *dst, int dstx, int dsty, int width, int hight) { int error; + if (hight < 0) + return -1; if (src == VGLDisplay) src = &VGLVDisplay; if (src->Type != MEMBUF) return -1; /* invalid */ if (dst == VGLDisplay) { - VGLMouseFreeze(dstx, dsty, width, hight, 0); + VGLMouseFreeze(); __VGLBitmapCopy(src, srcx, srcy, &VGLVDisplay, dstx, dsty, width, hight); + error = __VGLBitmapCopy(src, srcx, srcy, &VGLVDisplay, dstx, dsty, + width, hight); + if (error != 0) + return error; src = &VGLVDisplay; srcx = dstx; srcy = dsty; } else if (dst->Type != MEMBUF) return -1; /* invalid */ - error = __VGLBitmapCopy(src, srcx, srcy, dst, dstx, dsty, width, hight); + error = __VGLBitmapCopy(src, srcx, srcy, dst, dstx, dsty, width, -hight); if (dst == VGLDisplay) VGLMouseUnFreeze(); return error; } VGLBitmap *VGLBitmapCreate(int type, int xsize, int ysize, byte *bits) { VGLBitmap *object; if (type != MEMBUF) return NULL; if (xsize < 0 || ysize < 0) return NULL; object = (VGLBitmap *)malloc(sizeof(*object)); if (object == NULL) return NULL; object->Type = type; object->Xsize = xsize; object->Ysize = ysize; object->VXsize = xsize; object->VYsize = ysize; object->Xorigin = 0; object->Yorigin = 0; object->Bitmap = bits; object->PixelBytes = VGLDisplay->PixelBytes; return object; } void VGLBitmapDestroy(VGLBitmap *object) { if (object->Bitmap) free(object->Bitmap); free(object); } int VGLBitmapAllocateBits(VGLBitmap *object) { object->Bitmap = malloc(object->VXsize*object->VYsize*object->PixelBytes); if (object->Bitmap == NULL) return -1; return 0; } void VGLBitmapCvt(VGLBitmap *src, VGLBitmap *dst) { u_long color; int dstpos, i, pb, size, srcpb, srcpos; size = src->VXsize * src->VYsize; srcpb = src->PixelBytes; if (srcpb <= 0) srcpb = 1; pb = dst->PixelBytes; if (pb == srcpb) { bcopy(src->Bitmap, dst->Bitmap, size * pb); return; } if (srcpb != 1) return; /* not supported */ for (srcpos = dstpos = 0; srcpos < size; srcpos++) { color = VGLrgb332ToNative(src->Bitmap[srcpos]); for (i = 0; i < pb; i++, color >>= 8) dst->Bitmap[dstpos++] = color; } } Index: user/ngie/bug-237403/lib/libvgl/main.c =================================================================== --- user/ngie/bug-237403/lib/libvgl/main.c (revision 346925) +++ user/ngie/bug-237403/lib/libvgl/main.c (revision 346926) @@ -1,530 +1,531 @@ /*- * SPDX-License-Identifier: BSD-3-Clause * * Copyright (c) 1991-1997 Søren Schmidt * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer * in this position and unchanged. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * 3. The name of the author may not be used to endorse or promote products * derived from this software without specific prior written permission * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ #include __FBSDID("$FreeBSD$"); #include #include #include #include #include #include #include #include #include #include "vgl.h" -/* XXX Direct Color 24bits modes unsupported */ - #define min(x, y) (((x) < (y)) ? (x) : (y)) #define max(x, y) (((x) > (y)) ? (x) : (y)) VGLBitmap *VGLDisplay; VGLBitmap VGLVDisplay; video_info_t VGLModeInfo; video_adapter_info_t VGLAdpInfo; byte *VGLBuf; static int VGLMode; static int VGLOldMode; static size_t VGLBufSize; static byte *VGLMem = MAP_FAILED; static int VGLSwitchPending; static int VGLAbortPending; static int VGLOnDisplay; static unsigned int VGLCurWindow; static int VGLInitDone = 0; static video_info_t VGLOldModeInfo; static vid_info_t VGLOldVInfo; +static int VGLOldVXsize; void VGLEnd() { struct vt_mode smode; int size[3]; if (!VGLInitDone) return; VGLInitDone = 0; + signal(SIGUSR1, SIG_IGN); + signal(SIGUSR2, SIG_IGN); VGLSwitchPending = 0; VGLAbortPending = 0; + VGLMouseMode(VGL_MOUSEHIDE); - signal(SIGUSR1, SIG_IGN); - if (VGLMem != MAP_FAILED) { VGLClear(VGLDisplay, 0); munmap(VGLMem, VGLAdpInfo.va_window_size); } + ioctl(0, FBIO_SETLINEWIDTH, &VGLOldVXsize); + if (VGLOldMode >= M_VESA_BASE) ioctl(0, _IO('V', VGLOldMode - M_VESA_BASE), 0); else ioctl(0, _IO('S', VGLOldMode), 0); if (VGLOldModeInfo.vi_flags & V_INFO_GRAPHICS) { size[0] = VGLOldVInfo.mv_csz; size[1] = VGLOldVInfo.mv_rsz; size[2] = VGLOldVInfo.font_size;; ioctl(0, KDRASTER, size); } if (VGLModeInfo.vi_mem_model != V_INFO_MM_DIRECT) ioctl(0, KDDISABIO, 0); ioctl(0, KDSETMODE, KD_TEXT); smode.mode = VT_AUTO; ioctl(0, VT_SETMODE, &smode); if (VGLBuf) free(VGLBuf); VGLBuf = NULL; free(VGLDisplay); VGLDisplay = NULL; VGLKeyboardEnd(); } static void VGLAbort(int arg) { sigset_t mask; VGLAbortPending = 1; signal(SIGINT, SIG_IGN); signal(SIGTERM, SIG_IGN); signal(SIGUSR2, SIG_IGN); if (arg == SIGBUS || arg == SIGSEGV) { signal(arg, SIG_DFL); sigemptyset(&mask); sigaddset(&mask, arg); sigprocmask(SIG_UNBLOCK, &mask, NULL); VGLEnd(); kill(getpid(), arg); } } static void VGLSwitch(int arg __unused) { if (!VGLOnDisplay) VGLOnDisplay = 1; else VGLOnDisplay = 0; VGLSwitchPending = 1; signal(SIGUSR1, VGLSwitch); } int VGLInit(int mode) { struct vt_mode smode; int adptype, depth; if (VGLInitDone) return -1; signal(SIGUSR1, VGLSwitch); signal(SIGINT, VGLAbort); signal(SIGTERM, VGLAbort); signal(SIGSEGV, VGLAbort); signal(SIGBUS, VGLAbort); signal(SIGUSR2, SIG_IGN); VGLOnDisplay = 1; VGLSwitchPending = 0; VGLAbortPending = 0; if (ioctl(0, CONS_GET, &VGLOldMode) || ioctl(0, CONS_CURRENT, &adptype)) return -1; if (IOCGROUP(mode) == 'V') /* XXX: this is ugly */ VGLModeInfo.vi_mode = (mode & 0x0ff) + M_VESA_BASE; else VGLModeInfo.vi_mode = mode & 0x0ff; if (ioctl(0, CONS_MODEINFO, &VGLModeInfo)) /* FBIO_MODEINFO */ return -1; /* Save info for old mode to restore font size if old mode is graphics. */ VGLOldModeInfo.vi_mode = VGLOldMode; if (ioctl(0, CONS_MODEINFO, &VGLOldModeInfo)) return -1; VGLOldVInfo.size = sizeof(VGLOldVInfo); if (ioctl(0, CONS_GETINFO, &VGLOldVInfo)) return -1; VGLDisplay = (VGLBitmap *)malloc(sizeof(VGLBitmap)); if (VGLDisplay == NULL) return -2; if (VGLModeInfo.vi_mem_model != V_INFO_MM_DIRECT && ioctl(0, KDENABIO, 0)) { free(VGLDisplay); return -3; } VGLInitDone = 1; /* * vi_mem_model specifies the memory model of the current video mode * in -CURRENT. */ switch (VGLModeInfo.vi_mem_model) { case V_INFO_MM_PLANAR: /* we can handle EGA/VGA planner modes only */ if (VGLModeInfo.vi_depth != 4 || VGLModeInfo.vi_planes != 4 || (adptype != KD_EGA && adptype != KD_VGA)) { VGLEnd(); return -4; } VGLDisplay->Type = VIDBUF4; VGLDisplay->PixelBytes = 1; break; case V_INFO_MM_PACKED: /* we can do only 256 color packed modes */ if (VGLModeInfo.vi_depth != 8) { VGLEnd(); return -4; } VGLDisplay->Type = VIDBUF8; VGLDisplay->PixelBytes = 1; break; case V_INFO_MM_VGAX: VGLDisplay->Type = VIDBUF8X; VGLDisplay->PixelBytes = 1; break; case V_INFO_MM_DIRECT: VGLDisplay->PixelBytes = VGLModeInfo.vi_pixel_size; switch (VGLDisplay->PixelBytes) { case 2: VGLDisplay->Type = VIDBUF16; break; -#if notyet case 3: VGLDisplay->Type = VIDBUF24; break; -#endif case 4: VGLDisplay->Type = VIDBUF32; break; default: VGLEnd(); return -4; } break; default: VGLEnd(); return -4; } ioctl(0, VT_WAITACTIVE, 0); ioctl(0, KDSETMODE, KD_GRAPHICS); if (ioctl(0, mode, 0)) { VGLEnd(); return -5; } if (ioctl(0, CONS_ADPINFO, &VGLAdpInfo)) { /* FBIO_ADPINFO */ VGLEnd(); return -6; } /* * Calculate the shadow screen buffer size. In -CURRENT, va_buffer_size * always holds the entire frame buffer size, wheather it's in the linear * mode or windowed mode. * VGLBufSize = VGLAdpInfo.va_buffer_size; * In -STABLE, va_buffer_size holds the frame buffer size, only if * the linear frame buffer mode is supported. Otherwise the field is zero. * We shall calculate the minimal size in this case: * VGLAdpInfo.va_line_width*VGLModeInfo.vi_height*VGLModeInfo.vi_planes * or * VGLAdpInfo.va_window_size*VGLModeInfo.vi_planes; * Use whichever is larger. */ if (VGLAdpInfo.va_buffer_size != 0) VGLBufSize = VGLAdpInfo.va_buffer_size; else VGLBufSize = max(VGLAdpInfo.va_line_width*VGLModeInfo.vi_height, VGLAdpInfo.va_window_size)*VGLModeInfo.vi_planes; /* * The above is for old -CURRENT. Current -CURRENT since r203535 and/or * r248799 restricts va_buffer_size to the displayed size in VESA modes to * avoid wasting kva for mapping unused parts of the frame buffer. But all * parts were usable here. Applying the same restriction to user mappings * makes our virtualization useless and breaks our panning, but large frame * buffers are also difficult for us to manage (clearing and switching may * be too slow, and malloc() may fail). Restrict ourselves similarly to * get the same efficiency and bugs for all kernels. */ if (VGLModeInfo.vi_mode >= M_VESA_BASE) VGLBufSize = VGLAdpInfo.va_line_width*VGLModeInfo.vi_height* VGLModeInfo.vi_planes; VGLBuf = malloc(VGLBufSize); if (VGLBuf == NULL) { VGLEnd(); return -7; } #ifdef LIBVGL_DEBUG fprintf(stderr, "VGLBufSize:0x%x\n", VGLBufSize); #endif /* see if we are in the windowed buffer mode or in the linear buffer mode */ if (VGLBufSize/VGLModeInfo.vi_planes > VGLAdpInfo.va_window_size) { switch (VGLDisplay->Type) { case VIDBUF4: VGLDisplay->Type = VIDBUF4S; break; case VIDBUF8: VGLDisplay->Type = VIDBUF8S; break; case VIDBUF16: VGLDisplay->Type = VIDBUF16S; break; case VIDBUF24: VGLDisplay->Type = VIDBUF24S; break; case VIDBUF32: VGLDisplay->Type = VIDBUF32S; break; default: VGLEnd(); return -8; } } VGLMode = mode; VGLCurWindow = 0; VGLDisplay->Xsize = VGLModeInfo.vi_width; VGLDisplay->Ysize = VGLModeInfo.vi_height; depth = VGLModeInfo.vi_depth; if (depth == 15) depth = 16; + VGLOldVXsize = VGLDisplay->VXsize = VGLAdpInfo.va_line_width *8/(depth/VGLModeInfo.vi_planes); VGLDisplay->VYsize = VGLBufSize/VGLModeInfo.vi_planes/VGLAdpInfo.va_line_width; VGLDisplay->Xorigin = 0; VGLDisplay->Yorigin = 0; VGLMem = (byte*)mmap(0, VGLAdpInfo.va_window_size, PROT_READ|PROT_WRITE, MAP_FILE | MAP_SHARED, 0, 0); if (VGLMem == MAP_FAILED) { VGLEnd(); return -7; } VGLDisplay->Bitmap = VGLMem; VGLVDisplay = *VGLDisplay; VGLVDisplay.Type = MEMBUF; if (VGLModeInfo.vi_depth < 8) VGLVDisplay.Bitmap = malloc(2 * VGLBufSize); else VGLVDisplay.Bitmap = VGLBuf; VGLSavePalette(); #ifdef LIBVGL_DEBUG fprintf(stderr, "va_line_width:%d\n", VGLAdpInfo.va_line_width); fprintf(stderr, "VGLXsize:%d, Ysize:%d, VXsize:%d, VYsize:%d\n", VGLDisplay->Xsize, VGLDisplay->Ysize, VGLDisplay->VXsize, VGLDisplay->VYsize); #endif smode.mode = VT_PROCESS; smode.waitv = 0; smode.relsig = SIGUSR1; smode.acqsig = SIGUSR1; smode.frsig = SIGINT; if (ioctl(0, VT_SETMODE, &smode)) { VGLEnd(); return -9; } VGLTextSetFontFile((byte*)0); VGLClear(VGLDisplay, 0); return 0; } void VGLCheckSwitch() { if (VGLAbortPending) { VGLEnd(); exit(0); } while (VGLSwitchPending) { VGLSwitchPending = 0; if (VGLOnDisplay) { if (VGLModeInfo.vi_mem_model != V_INFO_MM_DIRECT) ioctl(0, KDENABIO, 0); ioctl(0, KDSETMODE, KD_GRAPHICS); ioctl(0, VGLMode, 0); VGLCurWindow = 0; VGLMem = (byte*)mmap(0, VGLAdpInfo.va_window_size, PROT_READ|PROT_WRITE, MAP_FILE | MAP_SHARED, 0, 0); /* XXX: what if mmap() has failed! */ VGLDisplay->Type = VIDBUF8; /* XXX */ switch (VGLModeInfo.vi_mem_model) { case V_INFO_MM_PLANAR: if (VGLModeInfo.vi_depth == 4 && VGLModeInfo.vi_planes == 4) { if (VGLBufSize/VGLModeInfo.vi_planes > VGLAdpInfo.va_window_size) VGLDisplay->Type = VIDBUF4S; else VGLDisplay->Type = VIDBUF4; } else { /* shouldn't be happening */ } break; case V_INFO_MM_PACKED: if (VGLModeInfo.vi_depth == 8) { if (VGLBufSize/VGLModeInfo.vi_planes > VGLAdpInfo.va_window_size) VGLDisplay->Type = VIDBUF8S; else VGLDisplay->Type = VIDBUF8; } break; case V_INFO_MM_VGAX: VGLDisplay->Type = VIDBUF8X; break; case V_INFO_MM_DIRECT: switch (VGLModeInfo.vi_pixel_size) { case 2: if (VGLBufSize/VGLModeInfo.vi_planes > VGLAdpInfo.va_window_size) VGLDisplay->Type = VIDBUF16S; else VGLDisplay->Type = VIDBUF16; break; case 3: if (VGLBufSize/VGLModeInfo.vi_planes > VGLAdpInfo.va_window_size) VGLDisplay->Type = VIDBUF24S; else VGLDisplay->Type = VIDBUF24; break; case 4: if (VGLBufSize/VGLModeInfo.vi_planes > VGLAdpInfo.va_window_size) VGLDisplay->Type = VIDBUF32S; else VGLDisplay->Type = VIDBUF32; break; default: /* shouldn't be happening */ break; } default: /* shouldn't be happening */ break; } VGLDisplay->Bitmap = VGLMem; VGLDisplay->Xsize = VGLModeInfo.vi_width; VGLDisplay->Ysize = VGLModeInfo.vi_height; VGLSetVScreenSize(VGLDisplay, VGLDisplay->VXsize, VGLDisplay->VYsize); VGLRestoreBlank(); VGLRestoreBorder(); VGLMouseRestore(); VGLPanScreen(VGLDisplay, VGLDisplay->Xorigin, VGLDisplay->Yorigin); VGLBitmapCopy(&VGLVDisplay, 0, 0, VGLDisplay, 0, 0, VGLDisplay->VXsize, VGLDisplay->VYsize); VGLRestorePalette(); ioctl(0, VT_RELDISP, VT_ACKACQ); } else { VGLMem = MAP_FAILED; munmap(VGLDisplay->Bitmap, VGLAdpInfo.va_window_size); ioctl(0, VGLOldMode, 0); ioctl(0, KDSETMODE, KD_TEXT); if (VGLModeInfo.vi_mem_model != V_INFO_MM_DIRECT) ioctl(0, KDDISABIO, 0); ioctl(0, VT_RELDISP, VT_TRUE); VGLDisplay->Bitmap = VGLBuf; VGLDisplay->Type = MEMBUF; VGLDisplay->Xsize = VGLDisplay->VXsize; VGLDisplay->Ysize = VGLDisplay->VYsize; while (!VGLOnDisplay) pause(); } } } int VGLSetSegment(unsigned int offset) { if (offset/VGLAdpInfo.va_window_size != VGLCurWindow) { ioctl(0, CONS_SETWINORG, offset); /* FBIO_SETWINORG */ VGLCurWindow = offset/VGLAdpInfo.va_window_size; } return (offset%VGLAdpInfo.va_window_size); } int VGLSetVScreenSize(VGLBitmap *object, int VXsize, int VYsize) { int depth; if (VXsize < object->Xsize || VYsize < object->Ysize) return -1; if (object->Type == MEMBUF) return -1; if (ioctl(0, FBIO_SETLINEWIDTH, &VXsize)) return -1; ioctl(0, CONS_ADPINFO, &VGLAdpInfo); /* FBIO_ADPINFO */ depth = VGLModeInfo.vi_depth; if (depth == 15) depth = 16; object->VXsize = VGLAdpInfo.va_line_width *8/(depth/VGLModeInfo.vi_planes); object->VYsize = VGLBufSize/VGLModeInfo.vi_planes/VGLAdpInfo.va_line_width; if (VYsize < object->VYsize) object->VYsize = VYsize; #ifdef LIBVGL_DEBUG fprintf(stderr, "new size: VGLXsize:%d, Ysize:%d, VXsize:%d, VYsize:%d\n", object->Xsize, object->Ysize, object->VXsize, object->VYsize); #endif return 0; } int VGLPanScreen(VGLBitmap *object, int x, int y) { video_display_start_t origin; if (x < 0 || x + object->Xsize > object->VXsize || y < 0 || y + object->Ysize > object->VYsize) return -1; if (object->Type == MEMBUF) return 0; origin.x = x; origin.y = y; if (ioctl(0, FBIO_SETDISPSTART, &origin)) return -1; object->Xorigin = x; object->Yorigin = y; #ifdef LIBVGL_DEBUG fprintf(stderr, "new origin: (%d, %d)\n", x, y); #endif return 0; } Index: user/ngie/bug-237403/lib/libvgl/mouse.c =================================================================== --- user/ngie/bug-237403/lib/libvgl/mouse.c (revision 346925) +++ user/ngie/bug-237403/lib/libvgl/mouse.c (revision 346926) @@ -1,363 +1,426 @@ /*- * SPDX-License-Identifier: BSD-3-Clause * * Copyright (c) 1991-1997 Søren Schmidt * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer * in this position and unchanged. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * 3. The name of the author may not be used to endorse or promote products * derived from this software without specific prior written permission * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ #include __FBSDID("$FreeBSD$"); #include #include #include #include #include #include #include "vgl.h" +static void VGLMouseAction(int dummy); + #define BORDER 0xff /* default border -- light white in rgb 3:3:2 */ #define INTERIOR 0xa0 /* default interior -- red in rgb 3:3:2 */ +#define LARGE_MOUSE_IMG_XSIZE 19 +#define LARGE_MOUSE_IMG_YSIZE 32 +#define SMALL_MOUSE_IMG_XSIZE 10 +#define SMALL_MOUSE_IMG_YSIZE 16 #define X 0xff /* any nonzero in And mask means part of cursor */ #define B BORDER #define I INTERIOR -static byte StdAndMask[MOUSE_IMG_SIZE*MOUSE_IMG_SIZE] = { - X,X,0,0,0,0,0,0,0,0,0,0,0,0,0,0, - X,X,X,0,0,0,0,0,0,0,0,0,0,0,0,0, - X,X,X,X,0,0,0,0,0,0,0,0,0,0,0,0, - X,X,X,X,X,0,0,0,0,0,0,0,0,0,0,0, - X,X,X,X,X,X,0,0,0,0,0,0,0,0,0,0, - X,X,X,X,X,X,X,0,0,0,0,0,0,0,0,0, - X,X,X,X,X,X,X,X,0,0,0,0,0,0,0,0, - X,X,X,X,X,X,X,X,X,0,0,0,0,0,0,0, - X,X,X,X,X,X,X,X,X,X,0,0,0,0,0,0, - X,X,X,X,X,X,X,X,X,X,0,0,0,0,0,0, - X,X,X,X,X,X,X,0,0,0,0,0,0,0,0,0, - X,X,X,0,X,X,X,X,0,0,0,0,0,0,0,0, - X,X,0,0,X,X,X,X,0,0,0,0,0,0,0,0, - 0,0,0,0,0,X,X,X,X,0,0,0,0,0,0,0, - 0,0,0,0,0,X,X,X,X,0,0,0,0,0,0,0, - 0,0,0,0,0,0,X,X,0,0,0,0,0,0,0,0, +static byte LargeAndMask[] = { + X,X,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, + X,X,X,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, + X,X,X,X,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, + X,X,X,X,X,0,0,0,0,0,0,0,0,0,0,0,0,0,0, + X,X,X,X,X,X,0,0,0,0,0,0,0,0,0,0,0,0,0, + X,X,X,X,X,X,X,0,0,0,0,0,0,0,0,0,0,0,0, + X,X,X,X,X,X,X,X,0,0,0,0,0,0,0,0,0,0,0, + X,X,X,X,X,X,X,X,X,0,0,0,0,0,0,0,0,0,0, + X,X,X,X,X,X,X,X,X,X,0,0,0,0,0,0,0,0,0, + X,X,X,X,X,X,X,X,X,X,X,0,0,0,0,0,0,0,0, + X,X,X,X,X,X,X,X,X,X,X,X,0,0,0,0,0,0,0, + X,X,X,X,X,X,X,X,X,X,X,X,X,0,0,0,0,0,0, + X,X,X,X,X,X,X,X,X,X,X,X,X,X,0,0,0,0,0, + X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,0,0,0,0, + X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,0,0,0, + X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,0,0, + X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,0, + X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X, + X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X, + X,X,X,X,X,X,X,X,X,X,X,X,0,0,0,0,0,0,0, + X,X,X,X,X,X,X,X,X,X,X,X,0,0,0,0,0,0,0, + X,X,X,X,X,X,0,X,X,X,X,X,X,0,0,0,0,0,0, + X,X,X,X,X,0,0,X,X,X,X,X,X,0,0,0,0,0,0, + X,X,X,X,0,0,0,0,X,X,X,X,X,X,0,0,0,0,0, + X,X,X,0,0,0,0,0,X,X,X,X,X,X,0,0,0,0,0, + X,X,0,0,0,0,0,0,0,X,X,X,X,X,X,0,0,0,0, + 0,0,0,0,0,0,0,0,0,X,X,X,X,X,X,0,0,0,0, + 0,0,0,0,0,0,0,0,0,0,X,X,X,X,X,X,0,0,0, + 0,0,0,0,0,0,0,0,0,0,X,X,X,X,X,X,0,0,0, + 0,0,0,0,0,0,0,0,0,0,0,X,X,X,X,X,X,0,0, + 0,0,0,0,0,0,0,0,0,0,0,X,X,X,X,X,X,0,0, + 0,0,0,0,0,0,0,0,0,0,0,0,X,X,X,X,0,0,0, }; -static byte StdOrMask[MOUSE_IMG_SIZE*MOUSE_IMG_SIZE] = { - B,B,0,0,0,0,0,0,0,0,0,0,0,0,0,0, - B,I,B,0,0,0,0,0,0,0,0,0,0,0,0,0, - B,I,I,B,0,0,0,0,0,0,0,0,0,0,0,0, - B,I,I,I,B,0,0,0,0,0,0,0,0,0,0,0, - B,I,I,I,I,B,0,0,0,0,0,0,0,0,0,0, - B,I,I,I,I,I,B,0,0,0,0,0,0,0,0,0, - B,I,I,I,I,I,I,B,0,0,0,0,0,0,0,0, - B,I,I,I,I,I,I,I,B,0,0,0,0,0,0,0, - B,I,I,I,I,I,I,I,I,B,0,0,0,0,0,0, - B,I,I,I,I,I,B,B,B,B,0,0,0,0,0,0, - B,I,I,B,I,I,B,0,0,0,0,0,0,0,0,0, - B,I,B,0,B,I,I,B,0,0,0,0,0,0,0,0, - B,B,0,0,B,I,I,B,0,0,0,0,0,0,0,0, - 0,0,0,0,0,B,I,I,B,0,0,0,0,0,0,0, - 0,0,0,0,0,B,I,I,B,0,0,0,0,0,0,0, - 0,0,0,0,0,0,B,B,0,0,0,0,0,0,0,0, +static byte LargeOrMask[] = { + B,B,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, + B,I,B,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, + B,I,I,B,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, + B,I,I,I,B,0,0,0,0,0,0,0,0,0,0,0,0,0,0, + B,I,I,I,I,B,0,0,0,0,0,0,0,0,0,0,0,0,0, + B,I,I,I,I,I,B,0,0,0,0,0,0,0,0,0,0,0,0, + B,I,I,I,I,I,I,B,0,0,0,0,0,0,0,0,0,0,0, + B,I,I,I,I,I,I,I,B,0,0,0,0,0,0,0,0,0,0, + B,I,I,I,I,I,I,I,I,B,0,0,0,0,0,0,0,0,0, + B,I,I,I,I,I,I,I,I,I,B,0,0,0,0,0,0,0,0, + B,I,I,I,I,I,I,I,I,I,I,B,0,0,0,0,0,0,0, + B,I,I,I,I,I,I,I,I,I,I,I,B,0,0,0,0,0,0, + B,I,I,I,I,I,I,I,I,I,I,I,I,B,0,0,0,0,0, + B,I,I,I,I,I,I,I,I,I,I,I,I,I,B,0,0,0,0, + B,I,I,I,I,I,I,I,I,I,I,I,I,I,I,B,0,0,0, + B,I,I,I,I,I,I,I,I,I,I,I,I,I,I,I,B,0,0, + B,I,I,I,I,I,I,I,I,I,I,I,I,I,I,I,I,B,0, + B,I,I,I,I,I,I,I,I,I,I,I,I,I,I,I,I,I,B, + B,I,I,I,I,I,I,I,I,I,I,B,B,B,B,B,B,B,B, + B,I,I,I,I,I,I,I,I,I,I,B,0,0,0,0,0,0,0, + B,I,I,I,I,I,B,I,I,I,I,B,0,0,0,0,0,0,0, + B,I,I,I,I,B,0,B,I,I,I,I,B,0,0,0,0,0,0, + B,I,I,I,B,0,0,B,I,I,I,I,B,0,0,0,0,0,0, + B,I,I,B,0,0,0,0,B,I,I,I,I,B,0,0,0,0,0, + B,I,B,0,0,0,0,0,B,I,I,I,I,B,0,0,0,0,0, + B,B,0,0,0,0,0,0,0,B,I,I,I,I,B,0,0,0,0, + 0,0,0,0,0,0,0,0,0,B,I,I,I,I,B,0,0,0,0, + 0,0,0,0,0,0,0,0,0,0,B,I,I,I,I,B,0,0,0, + 0,0,0,0,0,0,0,0,0,0,B,I,I,I,I,B,0,0,0, + 0,0,0,0,0,0,0,0,0,0,0,B,I,I,I,I,B,0,0, + 0,0,0,0,0,0,0,0,0,0,0,B,I,I,I,I,B,0,0, + 0,0,0,0,0,0,0,0,0,0,0,0,B,B,B,B,0,0,0, }; +static byte SmallAndMask[] = { + X,X,0,0,0,0,0,0,0,0, + X,X,X,0,0,0,0,0,0,0, + X,X,X,X,0,0,0,0,0,0, + X,X,X,X,X,0,0,0,0,0, + X,X,X,X,X,X,0,0,0,0, + X,X,X,X,X,X,X,0,0,0, + X,X,X,X,X,X,X,X,0,0, + X,X,X,X,X,X,X,X,X,0, + X,X,X,X,X,X,X,X,X,X, + X,X,X,X,X,X,X,X,X,X, + X,X,X,X,X,X,X,0,0,0, + X,X,X,0,X,X,X,X,0,0, + X,X,0,0,X,X,X,X,0,0, + 0,0,0,0,0,X,X,X,X,0, + 0,0,0,0,0,X,X,X,X,0, + 0,0,0,0,0,0,X,X,0,0, +}; +static byte SmallOrMask[] = { + B,B,0,0,0,0,0,0,0,0, + B,I,B,0,0,0,0,0,0,0, + B,I,I,B,0,0,0,0,0,0, + B,I,I,I,B,0,0,0,0,0, + B,I,I,I,I,B,0,0,0,0, + B,I,I,I,I,I,B,0,0,0, + B,I,I,I,I,I,I,B,0,0, + B,I,I,I,I,I,I,I,B,0, + B,I,I,I,I,I,I,I,I,B, + B,I,I,I,I,I,B,B,B,B, + B,I,I,B,I,I,B,0,0,0, + B,I,B,0,B,I,I,B,0,0, + B,B,0,0,B,I,I,B,0,0, + 0,0,0,0,0,B,I,I,B,0, + 0,0,0,0,0,B,I,I,B,0, + 0,0,0,0,0,0,B,B,0,0, +}; #undef X #undef B #undef I -static VGLBitmap VGLMouseStdAndMask = - VGLBITMAP_INITIALIZER(MEMBUF, MOUSE_IMG_SIZE, MOUSE_IMG_SIZE, StdAndMask); -static VGLBitmap VGLMouseStdOrMask = - VGLBITMAP_INITIALIZER(MEMBUF, MOUSE_IMG_SIZE, MOUSE_IMG_SIZE, StdOrMask); +static VGLBitmap VGLMouseLargeAndMask = + VGLBITMAP_INITIALIZER(MEMBUF, LARGE_MOUSE_IMG_XSIZE, LARGE_MOUSE_IMG_YSIZE, + LargeAndMask); +static VGLBitmap VGLMouseLargeOrMask = + VGLBITMAP_INITIALIZER(MEMBUF, LARGE_MOUSE_IMG_XSIZE, LARGE_MOUSE_IMG_YSIZE, + LargeOrMask); +static VGLBitmap VGLMouseSmallAndMask = + VGLBITMAP_INITIALIZER(MEMBUF, SMALL_MOUSE_IMG_XSIZE, SMALL_MOUSE_IMG_YSIZE, + SmallAndMask); +static VGLBitmap VGLMouseSmallOrMask = + VGLBITMAP_INITIALIZER(MEMBUF, SMALL_MOUSE_IMG_XSIZE, SMALL_MOUSE_IMG_YSIZE, + SmallOrMask); static VGLBitmap *VGLMouseAndMask, *VGLMouseOrMask; -static int VGLMouseVisible = 0; -static int VGLMouseShown = 0; +static int VGLMouseShown = VGL_MOUSEHIDE; static int VGLMouseXpos = 0; static int VGLMouseYpos = 0; static int VGLMouseButtons = 0; static volatile sig_atomic_t VGLMintpending; static volatile sig_atomic_t VGLMsuppressint; #define INTOFF() (VGLMsuppressint++) #define INTON() do { \ if (--VGLMsuppressint == 0 && VGLMintpending) \ VGLMouseAction(0); \ } while (0) -void -VGLMousePointerShow() +int +__VGLMouseMode(int mode) { - byte buf[MOUSE_IMG_SIZE*MOUSE_IMG_SIZE*4]; - VGLBitmap buffer = - VGLBITMAP_INITIALIZER(MEMBUF, MOUSE_IMG_SIZE, MOUSE_IMG_SIZE, buf); - byte crtcidx, crtcval, gdcidx, gdcval; - int pos; + int oldmode; - if (!VGLMouseVisible) { - INTOFF(); - VGLMouseVisible = 1; - if (VGLModeInfo.vi_mem_model != V_INFO_MM_DIRECT) { - crtcidx = inb(0x3c4); - crtcval = inb(0x3c5); - gdcidx = inb(0x3ce); - gdcval = inb(0x3cf); - } - buffer.PixelBytes = VGLDisplay->PixelBytes; - __VGLBitmapCopy(&VGLVDisplay, VGLMouseXpos, VGLMouseYpos, - &buffer, 0, 0, MOUSE_IMG_SIZE, MOUSE_IMG_SIZE); - for (pos = 0; pos < MOUSE_IMG_SIZE*MOUSE_IMG_SIZE; pos++) - if (VGLMouseAndMask->Bitmap[pos]) - bcopy(&VGLMouseOrMask->Bitmap[pos*VGLDisplay->PixelBytes], - &buffer.Bitmap[pos*VGLDisplay->PixelBytes], - VGLDisplay->PixelBytes); - __VGLBitmapCopy(&buffer, 0, 0, VGLDisplay, - VGLMouseXpos, VGLMouseYpos, MOUSE_IMG_SIZE, MOUSE_IMG_SIZE); - if (VGLModeInfo.vi_mem_model != V_INFO_MM_DIRECT) { - outb(0x3c4, crtcidx); - outb(0x3c5, crtcval); - outb(0x3ce, gdcidx); - outb(0x3cf, gdcval); - } - INTON(); - } -} - -void -VGLMousePointerHide() -{ - byte crtcidx, crtcval, gdcidx, gdcval; - - if (VGLMouseVisible) { - INTOFF(); - VGLMouseVisible = 0; - if (VGLModeInfo.vi_mem_model != V_INFO_MM_DIRECT) { - crtcidx = inb(0x3c4); - crtcval = inb(0x3c5); - gdcidx = inb(0x3ce); - gdcval = inb(0x3cf); - } - __VGLBitmapCopy(&VGLVDisplay, VGLMouseXpos, VGLMouseYpos, VGLDisplay, - VGLMouseXpos, VGLMouseYpos, MOUSE_IMG_SIZE, MOUSE_IMG_SIZE); - if (VGLModeInfo.vi_mem_model != V_INFO_MM_DIRECT) { - outb(0x3c4, crtcidx); - outb(0x3c5, crtcval); - outb(0x3ce, gdcidx); - outb(0x3cf, gdcval); - } - INTON(); - } -} - -void -VGLMouseMode(int mode) -{ + INTOFF(); + oldmode = VGLMouseShown; if (mode == VGL_MOUSESHOW) { if (VGLMouseShown == VGL_MOUSEHIDE) { - VGLMousePointerShow(); VGLMouseShown = VGL_MOUSESHOW; + __VGLBitmapCopy(&VGLVDisplay, VGLMouseXpos, VGLMouseYpos, + VGLDisplay, VGLMouseXpos, VGLMouseYpos, + VGLMouseAndMask->VXsize, -VGLMouseAndMask->VYsize); } } else { if (VGLMouseShown == VGL_MOUSESHOW) { - VGLMousePointerHide(); VGLMouseShown = VGL_MOUSEHIDE; + __VGLBitmapCopy(&VGLVDisplay, VGLMouseXpos, VGLMouseYpos, + VGLDisplay, VGLMouseXpos, VGLMouseYpos, + VGLMouseAndMask->VXsize, VGLMouseAndMask->VYsize); } } + INTON(); + return oldmode; } void +VGLMouseMode(int mode) +{ + __VGLMouseMode(mode); +} + +static void VGLMouseAction(int dummy) { struct mouse_info mouseinfo; + int mousemode; if (VGLMsuppressint) { VGLMintpending = 1; return; } again: INTOFF(); VGLMintpending = 0; mouseinfo.operation = MOUSE_GETINFO; ioctl(0, CONS_MOUSECTL, &mouseinfo); - if (VGLMouseShown == VGL_MOUSESHOW) - VGLMousePointerHide(); - VGLMouseXpos = mouseinfo.u.data.x; - VGLMouseYpos = mouseinfo.u.data.y; + if (VGLMouseXpos != mouseinfo.u.data.x || + VGLMouseYpos != mouseinfo.u.data.y) { + mousemode = __VGLMouseMode(VGL_MOUSEHIDE); + VGLMouseXpos = mouseinfo.u.data.x; + VGLMouseYpos = mouseinfo.u.data.y; + __VGLMouseMode(mousemode); + } VGLMouseButtons = mouseinfo.u.data.buttons; - if (VGLMouseShown == VGL_MOUSESHOW) - VGLMousePointerShow(); /* * Loop to handle any new (suppressed) signals. This is INTON() without * recursion. !SA_RESTART prevents recursion in signal handling. So the * maximum recursion is 2 levels. */ VGLMsuppressint = 0; if (VGLMintpending) goto again; } void VGLMouseSetImage(VGLBitmap *AndMask, VGLBitmap *OrMask) { - if (VGLMouseShown == VGL_MOUSESHOW) - VGLMousePointerHide(); + int mousemode; + mousemode = __VGLMouseMode(VGL_MOUSEHIDE); + VGLMouseAndMask = AndMask; if (VGLMouseOrMask != NULL) { free(VGLMouseOrMask->Bitmap); free(VGLMouseOrMask); } VGLMouseOrMask = VGLBitmapCreate(MEMBUF, OrMask->VXsize, OrMask->VYsize, 0); VGLBitmapAllocateBits(VGLMouseOrMask); VGLBitmapCvt(OrMask, VGLMouseOrMask); - if (VGLMouseShown == VGL_MOUSESHOW) - VGLMousePointerShow(); + __VGLMouseMode(mousemode); } void VGLMouseSetStdImage() { - VGLMouseSetImage(&VGLMouseStdAndMask, &VGLMouseStdOrMask); + if (VGLDisplay->VXsize > 800) + VGLMouseSetImage(&VGLMouseLargeAndMask, &VGLMouseLargeOrMask); + else + VGLMouseSetImage(&VGLMouseSmallAndMask, &VGLMouseSmallOrMask); } int VGLMouseInit(int mode) { struct mouse_info mouseinfo; + VGLBitmap *ormask; int andmask, border, error, i, interior; switch (VGLModeInfo.vi_mem_model) { case V_INFO_MM_PACKED: case V_INFO_MM_PLANAR: andmask = 0x0f; border = 0x0f; interior = 0x04; break; case V_INFO_MM_VGAX: andmask = 0x3f; border = 0x3f; interior = 0x24; break; default: andmask = 0xff; border = BORDER; interior = INTERIOR; break; } if (VGLModeInfo.vi_mode == M_BG640x480) border = 0; /* XXX (palette makes 0x04 look like 0x0f) */ if (getenv("VGLMOUSEBORDERCOLOR") != NULL) border = strtoul(getenv("VGLMOUSEBORDERCOLOR"), NULL, 0); if (getenv("VGLMOUSEINTERIORCOLOR") != NULL) interior = strtoul(getenv("VGLMOUSEINTERIORCOLOR"), NULL, 0); - for (i = 0; i < MOUSE_IMG_SIZE*MOUSE_IMG_SIZE; i++) - VGLMouseStdOrMask.Bitmap[i] = VGLMouseStdOrMask.Bitmap[i] == BORDER ? - border : VGLMouseStdOrMask.Bitmap[i] == INTERIOR ? interior : 0; + ormask = &VGLMouseLargeOrMask; + for (i = 0; i < ormask->VXsize * ormask->VYsize; i++) + ormask->Bitmap[i] = ormask->Bitmap[i] == BORDER ? border : + ormask->Bitmap[i] == INTERIOR ? interior : 0; + ormask = &VGLMouseSmallOrMask; + for (i = 0; i < ormask->VXsize * ormask->VYsize; i++) + ormask->Bitmap[i] = ormask->Bitmap[i] == BORDER ? border : + ormask->Bitmap[i] == INTERIOR ? interior : 0; VGLMouseSetStdImage(); mouseinfo.operation = MOUSE_MODE; mouseinfo.u.mode.signal = SIGUSR2; if ((error = ioctl(0, CONS_MOUSECTL, &mouseinfo))) return error; signal(SIGUSR2, VGLMouseAction); mouseinfo.operation = MOUSE_GETINFO; ioctl(0, CONS_MOUSECTL, &mouseinfo); VGLMouseXpos = mouseinfo.u.data.x; VGLMouseYpos = mouseinfo.u.data.y; VGLMouseButtons = mouseinfo.u.data.buttons; VGLMouseMode(mode); return 0; } void VGLMouseRestore(void) { struct mouse_info mouseinfo; INTOFF(); mouseinfo.operation = MOUSE_GETINFO; if (ioctl(0, CONS_MOUSECTL, &mouseinfo) == 0) { mouseinfo.operation = MOUSE_MOVEABS; mouseinfo.u.data.x = VGLMouseXpos; mouseinfo.u.data.y = VGLMouseYpos; ioctl(0, CONS_MOUSECTL, &mouseinfo); } INTON(); } int VGLMouseStatus(int *x, int *y, char *buttons) { INTOFF(); *x = VGLMouseXpos; *y = VGLMouseYpos; *buttons = VGLMouseButtons; INTON(); return VGLMouseShown; } -int -VGLMouseFreeze(int x, int y, int width, int hight, u_long color) +void +VGLMouseFreeze(void) { - INTOFF(); - if (width > 1 || hight > 1 || (color & 0xc0000000) == 0) { /* bitmap */ - if (VGLMouseShown == 1) { - int overlap; + INTOFF(); +} - if (x > VGLMouseXpos) - overlap = (VGLMouseXpos + MOUSE_IMG_SIZE) - x; - else - overlap = (x + width) - VGLMouseXpos; - if (overlap > 0) { - if (y > VGLMouseYpos) - overlap = (VGLMouseYpos + MOUSE_IMG_SIZE) - y; - else - overlap = (y + hight) - VGLMouseYpos; - if (overlap > 0) - VGLMousePointerHide(); - } - } - } - else { /* bit */ - if (VGLMouseShown && - x >= VGLMouseXpos && x < VGLMouseXpos + MOUSE_IMG_SIZE && - y >= VGLMouseYpos && y < VGLMouseYpos + MOUSE_IMG_SIZE) { - if (color & 0x80000000) { /* Set */ - if (VGLMouseAndMask->Bitmap - [(y-VGLMouseYpos)*MOUSE_IMG_SIZE+(x-VGLMouseXpos)]) { - return 1; - } - } - } - } +int +VGLMouseFreezeXY(int x, int y) +{ + INTOFF(); + if (VGLMouseShown != VGL_MOUSESHOW) + return 0; + if (x >= VGLMouseXpos && x < VGLMouseXpos + VGLMouseAndMask->VXsize && + y >= VGLMouseYpos && y < VGLMouseYpos + VGLMouseAndMask->VYsize && + VGLMouseAndMask->Bitmap[(y-VGLMouseYpos)*VGLMouseAndMask->VXsize+ + (x-VGLMouseXpos)]) + return 1; return 0; } +int +VGLMouseOverlap(int x, int y, int width, int hight) +{ + int overlap; + + if (VGLMouseShown != VGL_MOUSESHOW) + return 0; + if (x > VGLMouseXpos) + overlap = (VGLMouseXpos + VGLMouseAndMask->VXsize) - x; + else + overlap = (x + width) - VGLMouseXpos; + if (overlap <= 0) + return 0; + if (y > VGLMouseYpos) + overlap = (VGLMouseYpos + VGLMouseAndMask->VYsize) - y; + else + overlap = (y + hight) - VGLMouseYpos; + return overlap > 0; +} + void +VGLMouseMerge(int x, int y, int width, byte *line) +{ + int pos, x1, xend, xstart; + + xstart = x; + if (xstart < VGLMouseXpos) + xstart = VGLMouseXpos; + xend = x + width; + if (xend > VGLMouseXpos + VGLMouseAndMask->VXsize) + xend = VGLMouseXpos + VGLMouseAndMask->VXsize; + for (x1 = xstart; x1 < xend; x1++) { + pos = (y - VGLMouseYpos) * VGLMouseAndMask->VXsize + x1 - VGLMouseXpos; + if (VGLMouseAndMask->Bitmap[pos]) + bcopy(&VGLMouseOrMask->Bitmap[pos * VGLDisplay->PixelBytes], + &line[(x1 - x) * VGLDisplay->PixelBytes], VGLDisplay->PixelBytes); + } +} + +void VGLMouseUnFreeze() { - if (VGLMouseShown == VGL_MOUSESHOW && !VGLMouseVisible && !VGLMintpending) - VGLMousePointerShow(); - while (VGLMsuppressint) - INTON(); + INTON(); } Index: user/ngie/bug-237403/lib/libvgl/simple.c =================================================================== --- user/ngie/bug-237403/lib/libvgl/simple.c (revision 346925) +++ user/ngie/bug-237403/lib/libvgl/simple.c (revision 346926) @@ -1,670 +1,688 @@ /*- * SPDX-License-Identifier: BSD-3-Clause * * Copyright (c) 1991-1997 Søren Schmidt * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer * in this position and unchanged. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * 3. The name of the author may not be used to endorse or promote products * derived from this software without specific prior written permission * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ #include __FBSDID("$FreeBSD$"); #include #include #include #include #include "vgl.h" static int VGLBlank; static byte VGLBorderColor; static byte VGLSavePaletteRed[256]; static byte VGLSavePaletteGreen[256]; static byte VGLSavePaletteBlue[256]; #define ABS(a) (((a)<0) ? -(a) : (a)) #define SGN(a) (((a)<0) ? -1 : 1) #define min(x, y) (((x) < (y)) ? (x) : (y)) #define max(x, y) (((x) > (y)) ? (x) : (y)) void VGLSetXY(VGLBitmap *object, int x, int y, u_long color) { - int offset; + int offset, soffset, undermouse; VGLCheckSwitch(); if (x>=0 && xVXsize && y>=0 && yVYsize) { - if (object == VGLDisplay) + if (object == VGLDisplay) { + undermouse = VGLMouseFreezeXY(x, y); VGLSetXY(&VGLVDisplay, x, y, color); - if (object->Type == MEMBUF || - !VGLMouseFreeze(x, y, 1, 1, 0x80000000 | color)) { + } else if (object->Type != MEMBUF) + return; /* invalid */ + else + undermouse = 0; + if (!undermouse) { offset = (y * object->VXsize + x) * object->PixelBytes; switch (object->Type) { case VIDBUF8S: case VIDBUF16S: - case VIDBUF24S: case VIDBUF32S: offset = VGLSetSegment(offset); /* FALLTHROUGH */ case MEMBUF: case VIDBUF8: case VIDBUF16: case VIDBUF24: case VIDBUF32: color = htole32(color); switch (object->PixelBytes) { case 1: memcpy(&object->Bitmap[offset], &color, 1); break; case 2: memcpy(&object->Bitmap[offset], &color, 2); break; case 3: memcpy(&object->Bitmap[offset], &color, 3); break; case 4: memcpy(&object->Bitmap[offset], &color, 4); break; } break; + case VIDBUF24S: + soffset = VGLSetSegment(offset); + color = htole32(color); + switch (VGLAdpInfo.va_window_size - soffset) { + case 1: + memcpy(&object->Bitmap[soffset], &color, 1); + soffset = VGLSetSegment(offset + 1); + memcpy(&object->Bitmap[soffset], (byte *)&color + 1, 2); + break; + case 2: + memcpy(&object->Bitmap[soffset], &color, 2); + soffset = VGLSetSegment(offset + 2); + memcpy(&object->Bitmap[soffset], (byte *)&color + 2, 1); + break; + default: + memcpy(&object->Bitmap[soffset], &color, 3); + break; + } + break; case VIDBUF8X: outb(0x3c4, 0x02); outb(0x3c5, 0x01 << (x&0x3)); object->Bitmap[(unsigned)(VGLAdpInfo.va_line_width*y)+(x/4)] = ((byte)color); break; case VIDBUF4S: offset = VGLSetSegment(y*VGLAdpInfo.va_line_width + x/8); goto set_planar; case VIDBUF4: offset = y*VGLAdpInfo.va_line_width + x/8; set_planar: outb(0x3c4, 0x02); outb(0x3c5, 0x0f); outb(0x3ce, 0x00); outb(0x3cf, (byte)color & 0x0f); /* set/reset */ outb(0x3ce, 0x01); outb(0x3cf, 0x0f); /* set/reset enable */ outb(0x3ce, 0x08); outb(0x3cf, 0x80 >> (x%8)); /* bit mask */ object->Bitmap[offset] |= (byte)color; } } - if (object->Type != MEMBUF) + if (object == VGLDisplay) VGLMouseUnFreeze(); } } -static u_long -__VGLGetXY(VGLBitmap *object, int x, int y) +u_long +VGLGetXY(VGLBitmap *object, int x, int y) { - int offset; u_long color; + int offset; + VGLCheckSwitch(); + if (x<0 || x>=object->VXsize || y<0 || y>=object->VYsize) + return 0; + if (object == VGLDisplay) + object = &VGLVDisplay; + else if (object->Type != MEMBUF) + return 0; /* invalid */ offset = (y * object->VXsize + x) * object->PixelBytes; switch (object->PixelBytes) { case 1: memcpy(&color, &object->Bitmap[offset], 1); return le32toh(color) & 0xff; case 2: memcpy(&color, &object->Bitmap[offset], 2); return le32toh(color) & 0xffff; case 3: memcpy(&color, &object->Bitmap[offset], 3); return le32toh(color) & 0xffffff; case 4: memcpy(&color, &object->Bitmap[offset], 4); return le32toh(color); } return 0; /* invalid */ } -u_long -VGLGetXY(VGLBitmap *object, int x, int y) -{ - VGLCheckSwitch(); - if (x<0 || x>=object->VXsize || y<0 || y>=object->VYsize) - return 0; - if (object == VGLDisplay) - object = &VGLVDisplay; - if (object->Type != MEMBUF) - return 0; /* invalid */ - return __VGLGetXY(object, x, y); -} - /* * Symmetric Double Step Line Algorithm by Brian Wyvill from * "Graphics Gems", Academic Press, 1990. */ #define SL_SWAP(a,b) {a^=b; b^=a; a^=b;} #define SL_ABSOLUTE(i,j,k) ( (i-j)*(k = ( (i-j)<0 ? -1 : 1))) void plot(VGLBitmap * object, int x, int y, int flag, u_long color) { /* non-zero flag indicates the pixels need swapping back. */ if (flag) VGLSetXY(object, y, x, color); else VGLSetXY(object, x, y, color); } void VGLLine(VGLBitmap *object, int x1, int y1, int x2, int y2, u_long color) { int dx, dy, incr1, incr2, D, x, y, xend, c, pixels_left; int sign_x, sign_y, step, reverse, i; dx = SL_ABSOLUTE(x2, x1, sign_x); dy = SL_ABSOLUTE(y2, y1, sign_y); /* decide increment sign by the slope sign */ if (sign_x == sign_y) step = 1; else step = -1; if (dy > dx) { /* chooses axis of greatest movement (make dx) */ SL_SWAP(x1, y1); SL_SWAP(x2, y2); SL_SWAP(dx, dy); reverse = 1; } else reverse = 0; /* note error check for dx==0 should be included here */ if (x1 > x2) { /* start from the smaller coordinate */ x = x2; y = y2; /* x1 = x1; y1 = y1; */ } else { x = x1; y = y1; x1 = x2; y1 = y2; } /* Note dx=n implies 0 - n or (dx+1) pixels to be set */ /* Go round loop dx/4 times then plot last 0,1,2 or 3 pixels */ /* In fact (dx-1)/4 as 2 pixels are already plotted */ xend = (dx - 1) / 4; pixels_left = (dx - 1) % 4; /* number of pixels left over at the * end */ plot(object, x, y, reverse, color); if (pixels_left < 0) return; /* plot only one pixel for zero length * vectors */ plot(object, x1, y1, reverse, color); /* plot first two points */ incr2 = 4 * dy - 2 * dx; if (incr2 < 0) { /* slope less than 1/2 */ c = 2 * dy; incr1 = 2 * c; D = incr1 - dx; for (i = 0; i < xend; i++) { /* plotting loop */ ++x; --x1; if (D < 0) { /* pattern 1 forwards */ plot(object, x, y, reverse, color); plot(object, ++x, y, reverse, color); /* pattern 1 backwards */ plot(object, x1, y1, reverse, color); plot(object, --x1, y1, reverse, color); D += incr1; } else { if (D < c) { /* pattern 2 forwards */ plot(object, x, y, reverse, color); plot(object, ++x, y += step, reverse, color); /* pattern 2 backwards */ plot(object, x1, y1, reverse, color); plot(object, --x1, y1 -= step, reverse, color); } else { /* pattern 3 forwards */ plot(object, x, y += step, reverse, color); plot(object, ++x, y, reverse, color); /* pattern 3 backwards */ plot(object, x1, y1 -= step, reverse, color); plot(object, --x1, y1, reverse, color); } D += incr2; } } /* end for */ /* plot last pattern */ if (pixels_left) { if (D < 0) { plot(object, ++x, y, reverse, color); /* pattern 1 */ if (pixels_left > 1) plot(object, ++x, y, reverse, color); if (pixels_left > 2) plot(object, --x1, y1, reverse, color); } else { if (D < c) { plot(object, ++x, y, reverse, color); /* pattern 2 */ if (pixels_left > 1) plot(object, ++x, y += step, reverse, color); if (pixels_left > 2) plot(object, --x1, y1, reverse, color); } else { /* pattern 3 */ plot(object, ++x, y += step, reverse, color); if (pixels_left > 1) plot(object, ++x, y, reverse, color); if (pixels_left > 2) plot(object, --x1, y1 -= step, reverse, color); } } } /* end if pixels_left */ } /* end slope < 1/2 */ else { /* slope greater than 1/2 */ c = 2 * (dy - dx); incr1 = 2 * c; D = incr1 + dx; for (i = 0; i < xend; i++) { ++x; --x1; if (D > 0) { /* pattern 4 forwards */ plot(object, x, y += step, reverse, color); plot(object, ++x, y += step, reverse, color); /* pattern 4 backwards */ plot(object, x1, y1 -= step, reverse, color); plot(object, --x1, y1 -= step, reverse, color); D += incr1; } else { if (D < c) { /* pattern 2 forwards */ plot(object, x, y, reverse, color); plot(object, ++x, y += step, reverse, color); /* pattern 2 backwards */ plot(object, x1, y1, reverse, color); plot(object, --x1, y1 -= step, reverse, color); } else { /* pattern 3 forwards */ plot(object, x, y += step, reverse, color); plot(object, ++x, y, reverse, color); /* pattern 3 backwards */ plot(object, x1, y1 -= step, reverse, color); plot(object, --x1, y1, reverse, color); } D += incr2; } } /* end for */ /* plot last pattern */ if (pixels_left) { if (D > 0) { plot(object, ++x, y += step, reverse, color); /* pattern 4 */ if (pixels_left > 1) plot(object, ++x, y += step, reverse, color); if (pixels_left > 2) plot(object, --x1, y1 -= step, reverse, color); } else { if (D < c) { plot(object, ++x, y, reverse, color); /* pattern 2 */ if (pixels_left > 1) plot(object, ++x, y += step, reverse, color); if (pixels_left > 2) plot(object, --x1, y1, reverse, color); } else { /* pattern 3 */ plot(object, ++x, y += step, reverse, color); if (pixels_left > 1) plot(object, ++x, y, reverse, color); if (pixels_left > 2) { if (D > c) /* step 3 */ plot(object, --x1, y1 -= step, reverse, color); else /* step 2 */ plot(object, --x1, y1, reverse, color); } } } } } } void VGLBox(VGLBitmap *object, int x1, int y1, int x2, int y2, u_long color) { VGLLine(object, x1, y1, x2, y1, color); VGLLine(object, x2, y1, x2, y2, color); VGLLine(object, x2, y2, x1, y2, color); VGLLine(object, x1, y2, x1, y1, color); } void VGLFilledBox(VGLBitmap *object, int x1, int y1, int x2, int y2, u_long color) { int y; for (y=y1; y<=y2; y++) VGLLine(object, x1, y, x2, y, color); } static inline void set4pixels(VGLBitmap *object, int x, int y, int xc, int yc, u_long color) { if (x!=0) { VGLSetXY(object, xc+x, yc+y, color); VGLSetXY(object, xc-x, yc+y, color); if (y!=0) { VGLSetXY(object, xc+x, yc-y, color); VGLSetXY(object, xc-x, yc-y, color); } } else { VGLSetXY(object, xc, yc+y, color); if (y!=0) VGLSetXY(object, xc, yc-y, color); } } void VGLEllipse(VGLBitmap *object, int xc, int yc, int a, int b, u_long color) { int x = 0, y = b, asq = a*a, asq2 = a*a*2, bsq = b*b; int bsq2 = b*b*2, d = bsq-asq*b+asq/4, dx = 0, dy = asq2*b; while (dx0) { y--; dy-=asq2; d-=dy; } x++; dx+=bsq2; d+=bsq+dx; } d+=(3*(asq-bsq)/2-(dx+dy))/2; while (y>=0) { set4pixels(object, x, y, xc, yc, color); if (d<0) { x++; dx+=bsq2; d+=dx; } y--; dy-=asq2; d+=asq-dy; } } static inline void set2lines(VGLBitmap *object, int x, int y, int xc, int yc, u_long color) { if (x!=0) { VGLLine(object, xc+x, yc+y, xc-x, yc+y, color); if (y!=0) VGLLine(object, xc+x, yc-y, xc-x, yc-y, color); } else { VGLLine(object, xc, yc+y, xc, yc-y, color); } } void VGLFilledEllipse(VGLBitmap *object, int xc, int yc, int a, int b, u_long color) { int x = 0, y = b, asq = a*a, asq2 = a*a*2, bsq = b*b; int bsq2 = b*b*2, d = bsq-asq*b+asq/4, dx = 0, dy = asq2*b; while (dx0) { y--; dy-=asq2; d-=dy; } x++; dx+=bsq2; d+=bsq+dx; } d+=(3*(asq-bsq)/2-(dx+dy))/2; while (y>=0) { set2lines(object, x, y, xc, yc, color); if (d<0) { x++; dx+=bsq2; d+=dx; } y--; dy-=asq2; d+=asq-dy; } } void VGLClear(VGLBitmap *object, u_long color) { VGLBitmap src; - int offset; - int len; - int i; + int i, len, mousemode, offset; VGLCheckSwitch(); if (object == VGLDisplay) { - VGLMouseFreeze(0, 0, object->Xsize, object->Ysize, color); + VGLMouseFreeze(); VGLClear(&VGLVDisplay, color); } else if (object->Type != MEMBUF) return; /* invalid */ switch (object->Type) { case MEMBUF: case VIDBUF8: case VIDBUF8S: case VIDBUF16: case VIDBUF16S: case VIDBUF24: case VIDBUF24S: case VIDBUF32: case VIDBUF32S: src.Type = MEMBUF; src.Xsize = object->Xsize; src.VXsize = object->VXsize; src.Ysize = 1; src.VYsize = 1; src.Xorigin = 0; src.Yorigin = 0; src.Bitmap = alloca(object->VXsize * object->PixelBytes); src.PixelBytes = object->PixelBytes; color = htole32(color); for (i = 0; i < object->VXsize; i++) bcopy(&color, src.Bitmap + i * object->PixelBytes, object->PixelBytes); for (i = 0; i < object->VYsize; i++) - __VGLBitmapCopy(&src, 0, 0, object, 0, i, object->VXsize, 1); + __VGLBitmapCopy(&src, 0, 0, object, 0, i, object->VXsize, -1); break; case VIDBUF8X: + mousemode = __VGLMouseMode(VGL_MOUSEHIDE); /* XXX works only for Xsize % 4 = 0 */ outb(0x3c6, 0xff); outb(0x3c4, 0x02); outb(0x3c5, 0x0f); memset(object->Bitmap, (byte)color, VGLAdpInfo.va_line_width*object->VYsize); + __VGLMouseMode(mousemode); break; case VIDBUF4: case VIDBUF4S: + mousemode = __VGLMouseMode(VGL_MOUSEHIDE); /* XXX works only for Xsize % 8 = 0 */ outb(0x3c4, 0x02); outb(0x3c5, 0x0f); outb(0x3ce, 0x05); outb(0x3cf, 0x02); /* mode 2 */ outb(0x3ce, 0x01); outb(0x3cf, 0x00); /* set/reset enable */ outb(0x3ce, 0x08); outb(0x3cf, 0xff); /* bit mask */ for (offset = 0; offset < VGLAdpInfo.va_line_width*object->VYsize; ) { VGLSetSegment(offset); len = min(object->VXsize*object->VYsize - offset, VGLAdpInfo.va_window_size); memset(object->Bitmap, (byte)color, len); offset += len; } outb(0x3ce, 0x05); outb(0x3cf, 0x00); + __VGLMouseMode(mousemode); break; } if (object == VGLDisplay) VGLMouseUnFreeze(); } static inline u_long VGLrgbToNative(uint16_t r, uint16_t g, uint16_t b) { int nr, ng, nb; nr = VGLModeInfo.vi_pixel_fsizes[2]; ng = VGLModeInfo.vi_pixel_fsizes[1]; nb = VGLModeInfo.vi_pixel_fsizes[0]; return (r >> (16 - nr) << (ng + nb)) | (g >> (16 - ng) << nb) | (b >> (16 - nb) << 0); } u_long VGLrgb332ToNative(byte c) { uint16_t r, g, b; /* 3:3:2 to 16:16:16 */ r = ((c & 0xe0) >> 5) * 0xffff / 7; g = ((c & 0x1c) >> 2) * 0xffff / 7; b = ((c & 0x03) >> 0) * 0xffff / 3; return VGLrgbToNative(r, g, b); } void VGLRestorePalette() { int i; if (VGLModeInfo.vi_mem_model == V_INFO_MM_DIRECT) return; outb(0x3C6, 0xFF); inb(0x3DA); outb(0x3C8, 0x00); for (i=0; i<256; i++) { outb(0x3C9, VGLSavePaletteRed[i]); inb(0x84); outb(0x3C9, VGLSavePaletteGreen[i]); inb(0x84); outb(0x3C9, VGLSavePaletteBlue[i]); inb(0x84); } inb(0x3DA); outb(0x3C0, 0x20); } void VGLSavePalette() { int i; if (VGLModeInfo.vi_mem_model == V_INFO_MM_DIRECT) return; outb(0x3C6, 0xFF); inb(0x3DA); outb(0x3C7, 0x00); for (i=0; i<256; i++) { VGLSavePaletteRed[i] = inb(0x3C9); inb(0x84); VGLSavePaletteGreen[i] = inb(0x3C9); inb(0x84); VGLSavePaletteBlue[i] = inb(0x3C9); inb(0x84); } inb(0x3DA); outb(0x3C0, 0x20); } void VGLSetPalette(byte *red, byte *green, byte *blue) { int i; if (VGLModeInfo.vi_mem_model == V_INFO_MM_DIRECT) return; for (i=0; i<256; i++) { VGLSavePaletteRed[i] = red[i]; VGLSavePaletteGreen[i] = green[i]; VGLSavePaletteBlue[i] = blue[i]; } VGLCheckSwitch(); outb(0x3C6, 0xFF); inb(0x3DA); outb(0x3C8, 0x00); for (i=0; i<256; i++) { outb(0x3C9, VGLSavePaletteRed[i]); inb(0x84); outb(0x3C9, VGLSavePaletteGreen[i]); inb(0x84); outb(0x3C9, VGLSavePaletteBlue[i]); inb(0x84); } inb(0x3DA); outb(0x3C0, 0x20); } void VGLSetPaletteIndex(byte color, byte red, byte green, byte blue) { if (VGLModeInfo.vi_mem_model == V_INFO_MM_DIRECT) return; VGLSavePaletteRed[color] = red; VGLSavePaletteGreen[color] = green; VGLSavePaletteBlue[color] = blue; VGLCheckSwitch(); outb(0x3C6, 0xFF); inb(0x3DA); outb(0x3C8, color); outb(0x3C9, red); outb(0x3C9, green); outb(0x3C9, blue); inb(0x3DA); outb(0x3C0, 0x20); } void VGLRestoreBorder(void) { VGLSetBorder(VGLBorderColor); } void VGLSetBorder(byte color) { if (VGLModeInfo.vi_mem_model == V_INFO_MM_DIRECT && ioctl(0, KDENABIO, 0)) return; VGLCheckSwitch(); inb(0x3DA); outb(0x3C0,0x11); outb(0x3C0, color); inb(0x3DA); outb(0x3C0, 0x20); VGLBorderColor = color; if (VGLModeInfo.vi_mem_model == V_INFO_MM_DIRECT) ioctl(0, KDDISABIO, 0); } void VGLRestoreBlank(void) { VGLBlankDisplay(VGLBlank); } void VGLBlankDisplay(int blank) { byte val; if (VGLModeInfo.vi_mem_model == V_INFO_MM_DIRECT && ioctl(0, KDENABIO, 0)) return; VGLCheckSwitch(); outb(0x3C4, 0x01); val = inb(0x3C5); outb(0x3C4, 0x01); outb(0x3C5, ((blank) ? (val |= 0x20) : (val &= 0xDF))); VGLBlank = blank; if (VGLModeInfo.vi_mem_model == V_INFO_MM_DIRECT) ioctl(0, KDDISABIO, 0); } Index: user/ngie/bug-237403/lib/libvgl/vgl.h =================================================================== --- user/ngie/bug-237403/lib/libvgl/vgl.h (revision 346925) +++ user/ngie/bug-237403/lib/libvgl/vgl.h (revision 346926) @@ -1,162 +1,163 @@ /*- * SPDX-License-Identifier: BSD-3-Clause * * Copyright (c) 1991-1997 Søren Schmidt * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer * in this position and unchanged. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * 3. The name of the author may not be used to endorse or promote products * derived from this software without specific prior written permission * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. * * $FreeBSD$ */ #ifndef _VGL_H_ #define _VGL_H_ #include #include #include #include typedef unsigned char byte; typedef struct { byte Type; int Xsize, Ysize; int VXsize, VYsize; int Xorigin, Yorigin; byte *Bitmap; int PixelBytes; } VGLBitmap; #define VGLBITMAP_INITIALIZER(t, x, y, bits) \ { (t), (x), (y), (x), (y), 0, 0, (bits), -1 } /* * Defined Type's */ #define MEMBUF 0 #define VIDBUF4 1 #define VIDBUF8 2 #define VIDBUF8X 3 #define VIDBUF8S 4 #define VIDBUF4S 5 #define VIDBUF16 6 /* Direct Color linear buffer */ #define VIDBUF24 7 /* Direct Color linear buffer */ #define VIDBUF32 8 /* Direct Color linear buffer */ #define VIDBUF16S 9 /* Direct Color segmented buffer */ #define VIDBUF24S 10 /* Direct Color segmented buffer */ #define VIDBUF32S 11 /* Direct Color segmented buffer */ #define NOBUF 255 typedef struct VGLText { byte Width, Height; byte *BitmapArray; } VGLText; typedef struct VGLObject { int Id; int Type; int Status; int Xpos, Ypos; int Xhot, Yhot; VGLBitmap *Image; VGLBitmap *Mask; int (*CallBackFunction)(); } VGLObject; #define MOUSE_IMG_SIZE 16 #define VGL_MOUSEHIDE 0 #define VGL_MOUSESHOW 1 #define VGL_MOUSEFREEZE 0 #define VGL_MOUSEUNFREEZE 1 #define VGL_DIR_RIGHT 0 #define VGL_DIR_UP 1 #define VGL_DIR_LEFT 2 #define VGL_DIR_DOWN 3 #define VGL_RAWKEYS 1 #define VGL_CODEKEYS 2 #define VGL_XLATEKEYS 3 extern video_adapter_info_t VGLAdpInfo; extern video_info_t VGLModeInfo; extern VGLBitmap *VGLDisplay; extern VGLBitmap VGLVDisplay; extern byte *VGLBuf; /* * Prototypes */ /* bitmap.c */ int __VGLBitmapCopy(VGLBitmap *src, int srcx, int srcy, VGLBitmap *dst, int dstx, int dsty, int width, int hight); int VGLBitmapCopy(VGLBitmap *src, int srcx, int srcy, VGLBitmap *dst, int dstx, int dsty, int width, int hight); VGLBitmap *VGLBitmapCreate(int type, int xsize, int ysize, byte *bits); void VGLBitmapDestroy(VGLBitmap *object); int VGLBitmapAllocateBits(VGLBitmap *object); void VGLBitmapCvt(VGLBitmap *src, VGLBitmap *dst); /* keyboard.c */ int VGLKeyboardInit(int mode); void VGLKeyboardEnd(void); int VGLKeyboardGetCh(void); /* main.c */ void VGLEnd(void); int VGLInit(int mode); void VGLCheckSwitch(void); int VGLSetVScreenSize(VGLBitmap *object, int VXsize, int VYsize); int VGLPanScreen(VGLBitmap *object, int x, int y); int VGLSetSegment(unsigned int offset); /* mouse.c */ -void VGLMousePointerShow(void); -void VGLMousePointerHide(void); +int __VGLMouseMode(int mode); void VGLMouseMode(int mode); -void VGLMouseAction(int dummy); void VGLMouseSetImage(VGLBitmap *AndMask, VGLBitmap *OrMask); void VGLMouseSetStdImage(void); int VGLMouseInit(int mode); void VGLMouseRestore(void); int VGLMouseStatus(int *x, int *y, char *buttons); -int VGLMouseFreeze(int x, int y, int width, int hight, u_long color); +void VGLMouseFreeze(void); +int VGLMouseFreezeXY(int x, int y); +void VGLMouseMerge(int x, int y, int width, byte *line); +int VGLMouseOverlap(int x, int y, int width, int hight); void VGLMouseUnFreeze(void); /* simple.c */ void VGLSetXY(VGLBitmap *object, int x, int y, u_long color); u_long VGLGetXY(VGLBitmap *object, int x, int y); void VGLLine(VGLBitmap *object, int x1, int y1, int x2, int y2, u_long color); void VGLBox(VGLBitmap *object, int x1, int y1, int x2, int y2, u_long color); void VGLFilledBox(VGLBitmap *object, int x1, int y1, int x2, int y2, u_long color); void VGLEllipse(VGLBitmap *object, int xc, int yc, int a, int b, u_long color); void VGLFilledEllipse(VGLBitmap *object, int xc, int yc, int a, int b, u_long color); void VGLClear(VGLBitmap *object, u_long color); u_long VGLrgb332ToNative(byte c); void VGLRestoreBlank(void); void VGLRestoreBorder(void); void VGLRestorePalette(void); void VGLSavePalette(void); void VGLSetPalette(byte *red, byte *green, byte *blue); void VGLSetPaletteIndex(byte color, byte red, byte green, byte blue); void VGLSetBorder(byte color); void VGLBlankDisplay(int blank); /* text.c */ int VGLTextSetFontFile(char *filename); void VGLBitmapPutChar(VGLBitmap *Object, int x, int y, byte ch, u_long fgcol, u_long bgcol, int fill, int dir); void VGLBitmapString(VGLBitmap *Object, int x, int y, char *str, u_long fgcol, u_long bgcol, int fill, int dir); #endif /* !_VGL_H_ */ Index: user/ngie/bug-237403/libexec/rc/rc.initdiskless =================================================================== --- user/ngie/bug-237403/libexec/rc/rc.initdiskless (revision 346925) +++ user/ngie/bug-237403/libexec/rc/rc.initdiskless (revision 346926) @@ -1,400 +1,404 @@ #!/bin/sh # # Copyright (c) 1999 Matt Dillon # All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions # are met: # 1. Redistributions of source code must retain the above copyright # notice, this list of conditions and the following disclaimer. # 2. Redistributions in binary form must reproduce the above copyright # notice, this list of conditions and the following disclaimer in the # documentation and/or other materials provided with the distribution. # # THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND # ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE # IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE # ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE # FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL # DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS # OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) # HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT # LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY # OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF # SUCH DAMAGE. # # $FreeBSD$ # On entry to this script the entire system consists of a read-only root # mounted via NFS. The kernel has run BOOTP and configured an interface # (otherwise it would not have been able to mount the NFS root!) # # We use the contents of /conf to create and populate memory filesystems # that are mounted on top of this root to implement the writable # (and host-specific) parts of the root filesystem, and other volatile # filesystems. # # The hierarchy in /conf has the form /conf/T/M/ where M are directories # for which memory filesystems will be created and filled, # and T is one of the "template" directories below: # # base universal base, typically a replica of the original root; # default secondary universal base, typically overriding some # of the files in the original root; # ${ipba} where ${ipba} is the assigned broadcast IP address # bcast/${ipba} same as above # ${class} where ${class} is a list of directories supplied by # bootp/dhcp through the T134 option. # ${ipba} and ${class} are typically used to configure features # for group of diskless clients, or even individual features; # ${ip} where ${ip} is the machine's assigned IP address, typically # used to set host-specific features; # ip/${ip} same as above # # Template directories are scanned in the order they are listed above, # with each successive directory overriding (merged into) the previous one; # non-existing directories are ignored. The subdirectory forms exist to # help keep the top level /conf manageable in large installations. # # The existence of a directory /conf/T/M causes this script to create a # memory filesystem mounted as /M on the client. # # Some files in /conf have special meaning, namely: # # Filename Action # ---------------------------------------------------------------- # /conf/T/M/remount # The contents of the file is a mount command. E.g. if # /conf/1.2.3.4/foo/remount contains "mount -o ro /dev/ad0s3", # then /dev/ad0s3 will be mounted on /conf/1.2.3.4/foo/ # # /conf/T/M/remount_optional # If this file exists, then failure to execute the mount # command contained in /conf/T/M/remount is non-fatal. # # /conf/T/M/remount_subdir # If this file exists, then the behaviour of /conf/T/M/remount # changes as follows: # 1. /conf/T/M/remount is invoked to mount the root of the # filesystem where the configuration data exists on a # temporary mountpoint. # 2. /conf/T/M/remount_subdir is then invoked to mount a # *subdirectory* of the filesystem mounted by # /conf/T/M/remount on /conf/T/M/. # # /conf/T/M/diskless_remount # The contents of the file points to an NFS filesystem, # possibly followed by mount_nfs options. If the server name # is omitted, the script will prepend the root path used when # booting. E.g. if you booted from foo.com:/path/to/root, # an entry for /conf/base/etc/diskless_remount could be any of # foo.com:/path/to/root/etc # /etc -o ro # Because mount_nfs understands ".." in paths, it is # possible to mount from locations above the NFS root with # paths such as "/../../etc". # # /conf/T/M/md_size # The contents of the file specifies the size of the memory # filesystem to be created, in 512 byte blocks. # The default size is 10240 blocks (5MB). E.g. if # /conf/base/etc/md_size contains "30000" then a 15MB MFS # will be created. In case of multiple entries for the same # directory M, the last one in the scanning order is used. # NOTE: If you only need to create a memory filesystem but not # initialize it from a template, it is preferable to specify # it in fstab e.g. as "md /tmp mfs -s=30m,rw 0 0" # # /conf/T/SUBDIR.cpio.gz # The file is cpio'd into /SUBDIR (and a memory filesystem is # created for /SUBDIR if necessary). The presence of this file # prevents the copy from /conf/T/SUBDIR/ # # /conf/T/M/extract # This is alternative to SUBDIR.cpio.gz and remount. # Similar to remount case, a memory filesystem is created # for /M and initialized from a template but no mounting # performed. Instead, this file is run passing /M as single # argument. It is expected to extract template override to /M # using auxiliary storage found in some embedded systems # having NVRAM too small to hold mountable file system. # # /conf/T/SUBDIR.remove # The list of paths contained in the file are rm -rf'd # relative to /SUBDIR. # # /conf/diskless_remount # Similar to /conf/T/M/diskless_remount above, but allows # all of /conf to be remounted. This can be used to allow # multiple roots to share the same /conf. # # # You will almost universally want to create the following files under /conf # # File Content # ---------------------------- ---------------------------------- # /conf/base/etc/md_size size of /etc filesystem # /conf/base/etc/diskless_remount "/etc" # /conf/default/etc/rc.conf generic diskless config parameters # /conf/default/etc/fstab generic diskless fstab e.g. like this # # foo:/root_part / nfs ro 0 0 # foo:/usr_part /usr nfs ro 0 0 # foo:/home_part /home nfs rw 0 0 # md /tmp mfs -s=30m,rw 0 0 # md /var mfs -s=30m,rw 0 0 # proc /proc procfs rw 0 0 # # plus, possibly, overrides for password files etc. # # NOTE! /var, /tmp, and /dev will be typically created elsewhere, e.g. # as entries in the fstab as above. # Those filesystems should not be specified in /conf. # # (end of documentation, now get to the real code) dlv=`/sbin/sysctl -n vfs.nfs.diskless_valid 2> /dev/null` # DEBUGGING # log something on stdout if verbose. o_verbose=0 # set to 1 or 2 if you want more debugging log() { [ ${o_verbose} -gt 0 ] && echo "*** $* ***" [ ${o_verbose} -gt 1 ] && read -p "=== Press enter to continue" foo } # chkerr: # # Routine to check for error # # checks error code and drops into shell on failure. # if shell exits, terminates script as well as /etc/rc. # if remount_optional exists under the mountpoint, skip this check. # chkerr() { lastitem () ( n=$(($# - 1)) ; shift $n ; echo $1 ) mountpoint="$(lastitem $2)" [ -r $mountpoint/remount_optional ] && ( echo "$2 failed: ignoring due to remount_optional" ; return ) case $1 in 0) ;; *) echo "$2 failed: dropping into /bin/sh" /bin/sh # RESUME ;; esac } # The list of filesystems to umount after the copy to_umount="" handle_remount() { # $1 = mount point local nfspt mountopts b b=$1 log handle_remount $1 [ -d $b -a -f $b/diskless_remount ] || return read nfspt mountopts < $b/diskless_remount log "nfspt ${nfspt} mountopts ${mountopts}" # prepend the nfs root if not present [ `expr "$nfspt" : '\(.\)'` = "/" ] && nfspt="${nfsroot}${nfspt}" mount_nfs $mountopts $nfspt $b chkerr $? "mount_nfs $nfspt $b" to_umount="$b ${to_umount}" } # Create a generic memory disk. # The 'auto' parameter will attempt to use tmpfs(5), falls back to md(4). # $1 is size in 512-byte sectors, $2 is the mount point. mount_md() { - /sbin/mdmfs -s $1 auto $2 + if [ ${o_verbose} -gt 0 ] ; then + /sbin/mdmfs -XL -s $1 auto $2 + else + /sbin/mdmfs -s $1 auto $2 + fi } # Create the memory filesystem if it has not already been created # create_md() { [ "x`eval echo \\$md_created_$1`" = "x" ] || return # only once if [ "x`eval echo \\$md_size_$1`" = "x" ]; then md_size=10240 else md_size=`eval echo \\$md_size_$1` fi log create_md $1 with size $md_size mount_md $md_size /$1 /bin/chmod 755 /$1 eval md_created_$1=created } # DEBUGGING # # set -v # Figure out our interface and IP. # bootp_ifc="" bootp_ipa="" bootp_ipbca="" class="" if [ ${dlv:=0} -ne 0 ] ; then iflist=`ifconfig -l` for i in ${iflist} ; do set -- `ifconfig ${i}` while [ $# -ge 1 ] ; do if [ "${bootp_ifc}" = "" -a "$1" = "inet" ] ; then bootp_ifc=${i} ; bootp_ipa=${2} ; shift fi if [ "${bootp_ipbca}" = "" -a "$1" = "broadcast" ] ; then bootp_ipbca=$2; shift fi shift done if [ "${bootp_ifc}" != "" ] ; then break fi done # Get the values passed with the T134 bootp cookie. class="`/sbin/sysctl -qn kern.bootp_cookie`" echo "Interface ${bootp_ifc} IP-Address ${bootp_ipa} Broadcast ${bootp_ipbca} ${class}" fi log Figure out our NFS root path # set -- `mount -t nfs` while [ $# -ge 1 ] ; do if [ "$2" = "on" -a "$3" = "/" ]; then nfsroot="$1" break fi shift done # The list of directories with template files templates="base default" if [ -n "${bootp_ipbca}" ]; then templates="${templates} ${bootp_ipbca} bcast/${bootp_ipbca}" fi if [ -n "${class}" ]; then templates="${templates} ${class}" fi if [ -n "${bootp_ipa}" ]; then templates="${templates} ${bootp_ipa} ip/${bootp_ipa}" fi # If /conf/diskless_remount exists, remount all of /conf. handle_remount /conf # Resolve templates in /conf/base, /conf/default, /conf/${bootp_ipbca}, # and /conf/${bootp_ipa}. For each subdirectory found within these # directories: # # - calculate memory filesystem sizes. If the subdirectory (prior to # NFS remounting) contains the file 'md_size', the contents specified # in 512 byte sectors will be used to size the memory filesystem. Otherwise # 8192 sectors (4MB) is used. # # - handle NFS remounts. If the subdirectory contains the file # diskless_remount, the contents of the file is NFS mounted over # the directory. For example /conf/base/etc/diskless_remount # might contain 'myserver:/etc'. NFS remounts allow you to avoid # having to dup your system directories in /conf. Your server must # be sure to export those filesystems -alldirs, however. # If the diskless_remount file contains a string beginning with a # '/' it is assumed that the local nfsroot should be prepended to # it before attemping to the remount. This allows the root to be # relocated without needing to change the remount files. # log "templates are ${templates}" for i in ${templates} ; do for j in /conf/$i/* ; do [ -d $j ] || continue # memory filesystem size specification subdir=${j##*/} [ -f $j/md_size ] && eval md_size_$subdir=`cat $j/md_size` # remount. Beware, the command is in the file itself! if [ -f $j/remount ]; then if [ -f $j/remount_subdir ]; then k="/conf.tmp/$i/$subdir" [ -d $k ] || continue # Mount the filesystem root where the config data is # on the temporary mount point. nfspt=`/bin/cat $j/remount` $nfspt $k chkerr $? "$nfspt $k" # Now use a nullfs mount to get the data where we # really want to see it. remount_subdir=`/bin/cat $j/remount_subdir` remount_subdir_cmd="mount -t nullfs $k/$remount_subdir" $remount_subdir_cmd $j chkerr $? "$remount_subdir_cmd $j" # XXX check order -- we must force $k to be unmounted # after j, as j depends on k. to_umount="$j $k ${to_umount}" else nfspt=`/bin/cat $j/remount` $nfspt $j chkerr $? "$nfspt $j" to_umount="$j ${to_umount}" # XXX hope it is really a mount! fi fi # NFS remount handle_remount $j done done # - Create all required MFS filesystems and populate them from # our templates. Support both a direct template and a dir.cpio.gz # archive. Support for auxiliary NVRAM. Support dir.remove files containing # a list of relative paths to remove. # # The dir.cpio.gz form is there to make the copy process more efficient, # so if the cpio archive is present, it prevents the files from dir/ # from being copied. for i in ${templates} ; do for j in /conf/$i/* ; do subdir=${j##*/} if [ -d $j -a ! -f $j.cpio.gz ]; then create_md $subdir cp -Rp $j/ /$subdir fi done for j in /conf/$i/*.cpio.gz ; do subdir=${j%*.cpio.gz} subdir=${subdir##*/} if [ -f $j ]; then create_md $subdir echo "Loading /$subdir from cpio archive $j" (cd / ; /rescue/tar -xpf $j) fi done for j in /conf/$i/*/extract ; do if [ -x $j ]; then subdir=${j%*/extract} subdir=${subdir##*/} create_md $subdir echo "Loading /$subdir using auxiliary command $j" $j /$subdir fi done for j in /conf/$i/*.remove ; do subdir=${j%*.remove} subdir=${subdir##*/} if [ -f $j ]; then # doubly sure it is a memory disk before rm -rf'ing create_md $subdir (cd /$subdir; rm -rf `/bin/cat $j`) fi done done # umount partitions used to fill the memory filesystems [ -n "${to_umount}" ] && umount $to_umount Index: user/ngie/bug-237403/sbin/ifconfig/ifgre.c =================================================================== --- user/ngie/bug-237403/sbin/ifconfig/ifgre.c (revision 346925) +++ user/ngie/bug-237403/sbin/ifconfig/ifgre.c (revision 346926) @@ -1,123 +1,144 @@ /*- * SPDX-License-Identifier: BSD-2-Clause-FreeBSD * * Copyright (c) 2008 Andrew Thompson. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. * IN NO EVENT SHALL THE AUTHOR OR HIS RELATIVES BE LIABLE FOR ANY DIRECT, * INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR * SERVICES; LOSS OF MIND, USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, * STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING * IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF * THE POSSIBILITY OF SUCH DAMAGE. */ #include __FBSDID("$FreeBSD$"); #include #include #include #include #include #include #include #include #include #include #include #include #include "ifconfig.h" -#define GREBITS "\020\01ENABLE_CSUM\02ENABLE_SEQ" +#define GREBITS "\020\01ENABLE_CSUM\02ENABLE_SEQ\03UDPENCAP" static void gre_status(int s); static void gre_status(int s) { - uint32_t opts = 0; + uint32_t opts, port; + opts = 0; ifr.ifr_data = (caddr_t)&opts; if (ioctl(s, GREGKEY, &ifr) == 0) if (opts != 0) printf("\tgrekey: 0x%x (%u)\n", opts, opts); opts = 0; if (ioctl(s, GREGOPTS, &ifr) != 0 || opts == 0) return; + + port = 0; + ifr.ifr_data = (caddr_t)&port; + if (ioctl(s, GREGPORT, &ifr) == 0 && port != 0) + printf("\tudpport: %u\n", port); printb("\toptions", opts, GREBITS); putchar('\n'); } static void setifgrekey(const char *val, int dummy __unused, int s, const struct afswtch *afp) { uint32_t grekey = strtol(val, NULL, 0); strlcpy(ifr.ifr_name, name, sizeof (ifr.ifr_name)); ifr.ifr_data = (caddr_t)&grekey; if (ioctl(s, GRESKEY, (caddr_t)&ifr) < 0) warn("ioctl (set grekey)"); } static void +setifgreport(const char *val, int dummy __unused, int s, + const struct afswtch *afp) +{ + uint32_t udpport = strtol(val, NULL, 0); + + strlcpy(ifr.ifr_name, name, sizeof (ifr.ifr_name)); + ifr.ifr_data = (caddr_t)&udpport; + if (ioctl(s, GRESPORT, (caddr_t)&ifr) < 0) + warn("ioctl (set udpport)"); +} + +static void setifgreopts(const char *val, int d, int s, const struct afswtch *afp) { uint32_t opts; ifr.ifr_data = (caddr_t)&opts; if (ioctl(s, GREGOPTS, &ifr) == -1) { warn("ioctl(GREGOPTS)"); return; } if (d < 0) opts &= ~(-d); else opts |= d; if (ioctl(s, GRESOPTS, &ifr) == -1) { warn("ioctl(GIFSOPTS)"); return; } } static struct cmd gre_cmds[] = { DEF_CMD_ARG("grekey", setifgrekey), + DEF_CMD_ARG("udpport", setifgreport), DEF_CMD("enable_csum", GRE_ENABLE_CSUM, setifgreopts), DEF_CMD("-enable_csum",-GRE_ENABLE_CSUM,setifgreopts), DEF_CMD("enable_seq", GRE_ENABLE_SEQ, setifgreopts), DEF_CMD("-enable_seq",-GRE_ENABLE_SEQ, setifgreopts), + DEF_CMD("udpencap", GRE_UDPENCAP, setifgreopts), + DEF_CMD("-udpencap",-GRE_UDPENCAP, setifgreopts), }; static struct afswtch af_gre = { .af_name = "af_gre", .af_af = AF_UNSPEC, .af_other_status = gre_status, }; static __constructor void gre_ctor(void) { size_t i; for (i = 0; i < nitems(gre_cmds); i++) cmd_register(&gre_cmds[i]); af_register(&af_gre); } Index: user/ngie/bug-237403/sbin/ipfw/ipfw2.c =================================================================== --- user/ngie/bug-237403/sbin/ipfw/ipfw2.c (revision 346925) +++ user/ngie/bug-237403/sbin/ipfw/ipfw2.c (revision 346926) @@ -1,5583 +1,5587 @@ /*- * Copyright (c) 2002-2003 Luigi Rizzo * Copyright (c) 1996 Alex Nash, Paul Traina, Poul-Henning Kamp * Copyright (c) 1994 Ugen J.S.Antsilevich * * Idea and grammar partially left from: * Copyright (c) 1993 Daniel Boulet * * Redistribution and use in source forms, with and without modification, * are permitted provided that this entire comment appears intact. * * Redistribution in binary form may occur without any restrictions. * Obviously, it would be nice if you gave credit where credit is due * but requiring it would be too onerous. * * This software is provided ``AS IS'' without any warranties of any kind. * * NEW command line interface for IP firewall facility * * $FreeBSD$ */ #include #include #include #include #include #include "ipfw2.h" #include #include #include #include #include #include #include #include #include #include #include #include #include #include /* ctime */ #include /* _long_to_time */ #include #include #include /* offsetof */ #include #include /* only IFNAMSIZ */ #include #include /* only n_short, n_long */ #include #include #include #include #include struct cmdline_opts co; /* global options */ struct format_opts { int bcwidth; int pcwidth; int show_counters; int show_time; /* show timestamp */ uint32_t set_mask; /* enabled sets mask */ uint32_t flags; /* request flags */ uint32_t first; /* first rule to request */ uint32_t last; /* last rule to request */ uint32_t dcnt; /* number of dynamic states */ ipfw_obj_ctlv *tstate; /* table state data */ }; int resvd_set_number = RESVD_SET; int ipfw_socket = -1; #define CHECK_LENGTH(v, len) do { \ if ((v) < (len)) \ errx(EX_DATAERR, "Rule too long"); \ } while (0) /* * Check if we have enough space in cmd buffer. Note that since * first 8? u32 words are reserved by reserved header, full cmd * buffer can't be used, so we need to protect from buffer overrun * only. At the beginning, cblen is less than actual buffer size by * size of ipfw_insn_u32 instruction + 1 u32 work. This eliminates need * for checking small instructions fitting in given range. * We also (ab)use the fact that ipfw_insn is always the first field * for any custom instruction. */ #define CHECK_CMDLEN CHECK_LENGTH(cblen, F_LEN((ipfw_insn *)cmd)) #define GET_UINT_ARG(arg, min, max, tok, s_x) do { \ if (!av[0]) \ errx(EX_USAGE, "%s: missing argument", match_value(s_x, tok)); \ if (_substrcmp(*av, "tablearg") == 0) { \ arg = IP_FW_TARG; \ break; \ } \ \ { \ long _xval; \ char *end; \ \ _xval = strtol(*av, &end, 10); \ \ if (!isdigit(**av) || *end != '\0' || (_xval == 0 && errno == EINVAL)) \ errx(EX_DATAERR, "%s: invalid argument: %s", \ match_value(s_x, tok), *av); \ \ if (errno == ERANGE || _xval < min || _xval > max) \ errx(EX_DATAERR, "%s: argument is out of range (%u..%u): %s", \ match_value(s_x, tok), min, max, *av); \ \ if (_xval == IP_FW_TARG) \ errx(EX_DATAERR, "%s: illegal argument value: %s", \ match_value(s_x, tok), *av); \ arg = _xval; \ } \ } while (0) static struct _s_x f_tcpflags[] = { { "syn", TH_SYN }, { "fin", TH_FIN }, { "ack", TH_ACK }, { "psh", TH_PUSH }, { "rst", TH_RST }, { "urg", TH_URG }, { "tcp flag", 0 }, { NULL, 0 } }; static struct _s_x f_tcpopts[] = { { "mss", IP_FW_TCPOPT_MSS }, { "maxseg", IP_FW_TCPOPT_MSS }, { "window", IP_FW_TCPOPT_WINDOW }, { "sack", IP_FW_TCPOPT_SACK }, { "ts", IP_FW_TCPOPT_TS }, { "timestamp", IP_FW_TCPOPT_TS }, { "cc", IP_FW_TCPOPT_CC }, { "tcp option", 0 }, { NULL, 0 } }; /* * IP options span the range 0 to 255 so we need to remap them * (though in fact only the low 5 bits are significant). */ static struct _s_x f_ipopts[] = { { "ssrr", IP_FW_IPOPT_SSRR}, { "lsrr", IP_FW_IPOPT_LSRR}, { "rr", IP_FW_IPOPT_RR}, { "ts", IP_FW_IPOPT_TS}, { "ip option", 0 }, { NULL, 0 } }; static struct _s_x f_iptos[] = { { "lowdelay", IPTOS_LOWDELAY}, { "throughput", IPTOS_THROUGHPUT}, { "reliability", IPTOS_RELIABILITY}, { "mincost", IPTOS_MINCOST}, { "congestion", IPTOS_ECN_CE}, { "ecntransport", IPTOS_ECN_ECT0}, { "ip tos option", 0}, { NULL, 0 } }; struct _s_x f_ipdscp[] = { { "af11", IPTOS_DSCP_AF11 >> 2 }, /* 001010 */ { "af12", IPTOS_DSCP_AF12 >> 2 }, /* 001100 */ { "af13", IPTOS_DSCP_AF13 >> 2 }, /* 001110 */ { "af21", IPTOS_DSCP_AF21 >> 2 }, /* 010010 */ { "af22", IPTOS_DSCP_AF22 >> 2 }, /* 010100 */ { "af23", IPTOS_DSCP_AF23 >> 2 }, /* 010110 */ { "af31", IPTOS_DSCP_AF31 >> 2 }, /* 011010 */ { "af32", IPTOS_DSCP_AF32 >> 2 }, /* 011100 */ { "af33", IPTOS_DSCP_AF33 >> 2 }, /* 011110 */ { "af41", IPTOS_DSCP_AF41 >> 2 }, /* 100010 */ { "af42", IPTOS_DSCP_AF42 >> 2 }, /* 100100 */ { "af43", IPTOS_DSCP_AF43 >> 2 }, /* 100110 */ { "be", IPTOS_DSCP_CS0 >> 2 }, /* 000000 */ { "ef", IPTOS_DSCP_EF >> 2 }, /* 101110 */ { "cs0", IPTOS_DSCP_CS0 >> 2 }, /* 000000 */ { "cs1", IPTOS_DSCP_CS1 >> 2 }, /* 001000 */ { "cs2", IPTOS_DSCP_CS2 >> 2 }, /* 010000 */ { "cs3", IPTOS_DSCP_CS3 >> 2 }, /* 011000 */ { "cs4", IPTOS_DSCP_CS4 >> 2 }, /* 100000 */ { "cs5", IPTOS_DSCP_CS5 >> 2 }, /* 101000 */ { "cs6", IPTOS_DSCP_CS6 >> 2 }, /* 110000 */ { "cs7", IPTOS_DSCP_CS7 >> 2 }, /* 100000 */ { NULL, 0 } }; static struct _s_x limit_masks[] = { {"all", DYN_SRC_ADDR|DYN_SRC_PORT|DYN_DST_ADDR|DYN_DST_PORT}, {"src-addr", DYN_SRC_ADDR}, {"src-port", DYN_SRC_PORT}, {"dst-addr", DYN_DST_ADDR}, {"dst-port", DYN_DST_PORT}, {NULL, 0} }; /* * we use IPPROTO_ETHERTYPE as a fake protocol id to call the print routines * This is only used in this code. */ #define IPPROTO_ETHERTYPE 0x1000 static struct _s_x ether_types[] = { /* * Note, we cannot use "-:&/" in the names because they are field * separators in the type specifications. Also, we use s = NULL as * end-delimiter, because a type of 0 can be legal. */ { "ip", 0x0800 }, { "ipv4", 0x0800 }, { "ipv6", 0x86dd }, { "arp", 0x0806 }, { "rarp", 0x8035 }, { "vlan", 0x8100 }, { "loop", 0x9000 }, { "trail", 0x1000 }, { "at", 0x809b }, { "atalk", 0x809b }, { "aarp", 0x80f3 }, { "pppoe_disc", 0x8863 }, { "pppoe_sess", 0x8864 }, { "ipx_8022", 0x00E0 }, { "ipx_8023", 0x0000 }, { "ipx_ii", 0x8137 }, { "ipx_snap", 0x8137 }, { "ipx", 0x8137 }, { "ns", 0x0600 }, { NULL, 0 } }; static struct _s_x rule_eactions[] = { { "nat64clat", TOK_NAT64CLAT }, { "nat64lsn", TOK_NAT64LSN }, { "nat64stl", TOK_NAT64STL }, { "nptv6", TOK_NPTV6 }, { "tcp-setmss", TOK_TCPSETMSS }, { NULL, 0 } /* terminator */ }; static struct _s_x rule_actions[] = { { "abort6", TOK_ABORT6 }, { "abort", TOK_ABORT }, { "accept", TOK_ACCEPT }, { "pass", TOK_ACCEPT }, { "allow", TOK_ACCEPT }, { "permit", TOK_ACCEPT }, { "count", TOK_COUNT }, { "pipe", TOK_PIPE }, { "queue", TOK_QUEUE }, { "divert", TOK_DIVERT }, { "tee", TOK_TEE }, { "netgraph", TOK_NETGRAPH }, { "ngtee", TOK_NGTEE }, { "fwd", TOK_FORWARD }, { "forward", TOK_FORWARD }, { "skipto", TOK_SKIPTO }, { "deny", TOK_DENY }, { "drop", TOK_DENY }, { "reject", TOK_REJECT }, { "reset6", TOK_RESET6 }, { "reset", TOK_RESET }, { "unreach6", TOK_UNREACH6 }, { "unreach", TOK_UNREACH }, { "check-state", TOK_CHECKSTATE }, { "//", TOK_COMMENT }, { "nat", TOK_NAT }, { "reass", TOK_REASS }, { "setfib", TOK_SETFIB }, { "setdscp", TOK_SETDSCP }, { "call", TOK_CALL }, { "return", TOK_RETURN }, { "eaction", TOK_EACTION }, { "tcp-setmss", TOK_TCPSETMSS }, { NULL, 0 } /* terminator */ }; static struct _s_x rule_action_params[] = { { "altq", TOK_ALTQ }, { "log", TOK_LOG }, { "tag", TOK_TAG }, { "untag", TOK_UNTAG }, { NULL, 0 } /* terminator */ }; /* * The 'lookup' instruction accepts one of the following arguments. * -1 is a terminator for the list. * Arguments are passed as v[1] in O_DST_LOOKUP options. */ static int lookup_key[] = { TOK_DSTIP, TOK_SRCIP, TOK_DSTPORT, TOK_SRCPORT, TOK_UID, TOK_JAIL, TOK_DSCP, -1 }; static struct _s_x rule_options[] = { { "tagged", TOK_TAGGED }, { "uid", TOK_UID }, { "gid", TOK_GID }, { "jail", TOK_JAIL }, { "in", TOK_IN }, { "limit", TOK_LIMIT }, { "set-limit", TOK_SETLIMIT }, { "keep-state", TOK_KEEPSTATE }, { "record-state", TOK_RECORDSTATE }, { "bridged", TOK_LAYER2 }, { "layer2", TOK_LAYER2 }, { "out", TOK_OUT }, { "diverted", TOK_DIVERTED }, { "diverted-loopback", TOK_DIVERTEDLOOPBACK }, { "diverted-output", TOK_DIVERTEDOUTPUT }, { "xmit", TOK_XMIT }, { "recv", TOK_RECV }, { "via", TOK_VIA }, { "fragment", TOK_FRAG }, { "frag", TOK_FRAG }, { "fib", TOK_FIB }, { "ipoptions", TOK_IPOPTS }, { "ipopts", TOK_IPOPTS }, { "iplen", TOK_IPLEN }, { "ipid", TOK_IPID }, { "ipprecedence", TOK_IPPRECEDENCE }, { "dscp", TOK_DSCP }, { "iptos", TOK_IPTOS }, { "ipttl", TOK_IPTTL }, { "ipversion", TOK_IPVER }, { "ipver", TOK_IPVER }, { "estab", TOK_ESTAB }, { "established", TOK_ESTAB }, { "setup", TOK_SETUP }, { "sockarg", TOK_SOCKARG }, { "tcpdatalen", TOK_TCPDATALEN }, { "tcpflags", TOK_TCPFLAGS }, { "tcpflgs", TOK_TCPFLAGS }, { "tcpoptions", TOK_TCPOPTS }, { "tcpopts", TOK_TCPOPTS }, { "tcpseq", TOK_TCPSEQ }, { "tcpack", TOK_TCPACK }, { "tcpwin", TOK_TCPWIN }, { "icmptype", TOK_ICMPTYPES }, { "icmptypes", TOK_ICMPTYPES }, { "dst-ip", TOK_DSTIP }, { "src-ip", TOK_SRCIP }, { "dst-port", TOK_DSTPORT }, { "src-port", TOK_SRCPORT }, { "proto", TOK_PROTO }, { "MAC", TOK_MAC }, { "mac", TOK_MAC }, { "mac-type", TOK_MACTYPE }, { "verrevpath", TOK_VERREVPATH }, { "versrcreach", TOK_VERSRCREACH }, { "antispoof", TOK_ANTISPOOF }, { "ipsec", TOK_IPSEC }, { "icmp6type", TOK_ICMP6TYPES }, { "icmp6types", TOK_ICMP6TYPES }, { "ext6hdr", TOK_EXT6HDR}, { "flow-id", TOK_FLOWID}, { "ipv6", TOK_IPV6}, { "ip6", TOK_IPV6}, { "ipv4", TOK_IPV4}, { "ip4", TOK_IPV4}, { "dst-ipv6", TOK_DSTIP6}, { "dst-ip6", TOK_DSTIP6}, { "src-ipv6", TOK_SRCIP6}, { "src-ip6", TOK_SRCIP6}, { "lookup", TOK_LOOKUP}, { "flow", TOK_FLOW}, { "defer-action", TOK_SKIPACTION }, { "defer-immediate-action", TOK_SKIPACTION }, { "//", TOK_COMMENT }, { "not", TOK_NOT }, /* pseudo option */ { "!", /* escape ? */ TOK_NOT }, /* pseudo option */ { "or", TOK_OR }, /* pseudo option */ { "|", /* escape */ TOK_OR }, /* pseudo option */ { "{", TOK_STARTBRACE }, /* pseudo option */ { "(", TOK_STARTBRACE }, /* pseudo option */ { "}", TOK_ENDBRACE }, /* pseudo option */ { ")", TOK_ENDBRACE }, /* pseudo option */ { NULL, 0 } /* terminator */ }; void bprint_uint_arg(struct buf_pr *bp, const char *str, uint32_t arg); static int ipfw_get_config(struct cmdline_opts *co, struct format_opts *fo, ipfw_cfg_lheader **pcfg, size_t *psize); static int ipfw_show_config(struct cmdline_opts *co, struct format_opts *fo, ipfw_cfg_lheader *cfg, size_t sz, int ac, char **av); static void ipfw_list_tifaces(void); struct tidx; static uint16_t pack_object(struct tidx *tstate, char *name, int otype); static uint16_t pack_table(struct tidx *tstate, char *name); static char *table_search_ctlv(ipfw_obj_ctlv *ctlv, uint16_t idx); static void object_sort_ctlv(ipfw_obj_ctlv *ctlv); static char *object_search_ctlv(ipfw_obj_ctlv *ctlv, uint16_t idx, uint16_t type); /* * Simple string buffer API. * Used to simplify buffer passing between function and for * transparent overrun handling. */ /* * Allocates new buffer of given size @sz. * * Returns 0 on success. */ int bp_alloc(struct buf_pr *b, size_t size) { memset(b, 0, sizeof(struct buf_pr)); if ((b->buf = calloc(1, size)) == NULL) return (ENOMEM); b->ptr = b->buf; b->size = size; b->avail = b->size; return (0); } void bp_free(struct buf_pr *b) { free(b->buf); } /* * Flushes buffer so new writer start from beginning. */ void bp_flush(struct buf_pr *b) { b->ptr = b->buf; b->avail = b->size; b->buf[0] = '\0'; } /* * Print message specified by @format and args. * Automatically manage buffer space and transparently handle * buffer overruns. * * Returns number of bytes that should have been printed. */ int bprintf(struct buf_pr *b, char *format, ...) { va_list args; int i; va_start(args, format); i = vsnprintf(b->ptr, b->avail, format, args); va_end(args); if (i > b->avail || i < 0) { /* Overflow or print error */ b->avail = 0; } else { b->ptr += i; b->avail -= i; } b->needed += i; return (i); } /* * Special values printer for tablearg-aware opcodes. */ void bprint_uint_arg(struct buf_pr *bp, const char *str, uint32_t arg) { if (str != NULL) bprintf(bp, "%s", str); if (arg == IP_FW_TARG) bprintf(bp, "tablearg"); else bprintf(bp, "%u", arg); } /* * Helper routine to print a possibly unaligned uint64_t on * various platform. If width > 0, print the value with * the desired width, followed by a space; * otherwise, return the required width. */ int pr_u64(struct buf_pr *b, uint64_t *pd, int width) { #ifdef TCC #define U64_FMT "I64" #else #define U64_FMT "llu" #endif uint64_t u; unsigned long long d; bcopy (pd, &u, sizeof(u)); d = u; return (width > 0) ? bprintf(b, "%*" U64_FMT " ", width, d) : snprintf(NULL, 0, "%" U64_FMT, d) ; #undef U64_FMT } void * safe_calloc(size_t number, size_t size) { void *ret = calloc(number, size); if (ret == NULL) err(EX_OSERR, "calloc"); return ret; } void * safe_realloc(void *ptr, size_t size) { void *ret = realloc(ptr, size); if (ret == NULL) err(EX_OSERR, "realloc"); return ret; } /* * Compare things like interface or table names. */ int stringnum_cmp(const char *a, const char *b) { int la, lb; la = strlen(a); lb = strlen(b); if (la > lb) return (1); else if (la < lb) return (-01); return (strcmp(a, b)); } /* * conditionally runs the command. * Selected options or negative -> getsockopt */ int do_cmd(int optname, void *optval, uintptr_t optlen) { int i; if (co.test_only) return 0; if (ipfw_socket == -1) ipfw_socket = socket(AF_INET, SOCK_RAW, IPPROTO_RAW); if (ipfw_socket < 0) err(EX_UNAVAILABLE, "socket"); if (optname == IP_FW_GET || optname == IP_DUMMYNET_GET || optname == IP_FW_ADD || optname == IP_FW3 || optname == IP_FW_NAT_GET_CONFIG || optname < 0 || optname == IP_FW_NAT_GET_LOG) { if (optname < 0) optname = -optname; i = getsockopt(ipfw_socket, IPPROTO_IP, optname, optval, (socklen_t *)optlen); } else { i = setsockopt(ipfw_socket, IPPROTO_IP, optname, optval, optlen); } return i; } /* * do_set3 - pass ipfw control cmd to kernel * @optname: option name * @optval: pointer to option data * @optlen: option length * * Assumes op3 header is already embedded. * Calls setsockopt() with IP_FW3 as kernel-visible opcode. * Returns 0 on success or errno otherwise. */ int do_set3(int optname, ip_fw3_opheader *op3, size_t optlen) { if (co.test_only) return (0); if (ipfw_socket == -1) ipfw_socket = socket(AF_INET, SOCK_RAW, IPPROTO_RAW); if (ipfw_socket < 0) err(EX_UNAVAILABLE, "socket"); op3->opcode = optname; return (setsockopt(ipfw_socket, IPPROTO_IP, IP_FW3, op3, optlen)); } /* * do_get3 - pass ipfw control cmd to kernel * @optname: option name * @optval: pointer to option data * @optlen: pointer to option length * * Assumes op3 header is already embedded. * Calls getsockopt() with IP_FW3 as kernel-visible opcode. * Returns 0 on success or errno otherwise. */ int do_get3(int optname, ip_fw3_opheader *op3, size_t *optlen) { int error; socklen_t len; if (co.test_only) return (0); if (ipfw_socket == -1) ipfw_socket = socket(AF_INET, SOCK_RAW, IPPROTO_RAW); if (ipfw_socket < 0) err(EX_UNAVAILABLE, "socket"); op3->opcode = optname; len = *optlen; error = getsockopt(ipfw_socket, IPPROTO_IP, IP_FW3, op3, &len); *optlen = len; return (error); } /** * match_token takes a table and a string, returns the value associated * with the string (-1 in case of failure). */ int match_token(struct _s_x *table, const char *string) { struct _s_x *pt; uint i = strlen(string); for (pt = table ; i && pt->s != NULL ; pt++) if (strlen(pt->s) == i && !bcmp(string, pt->s, i)) return pt->x; return (-1); } /** * match_token_relaxed takes a table and a string, returns the value associated * with the string for the best match. * * Returns: * value from @table for matched records * -1 for non-matched records * -2 if more than one records match @string. */ int match_token_relaxed(struct _s_x *table, const char *string) { struct _s_x *pt, *m; int i, c; i = strlen(string); c = 0; for (pt = table ; i != 0 && pt->s != NULL ; pt++) { if (strncmp(pt->s, string, i) != 0) continue; m = pt; c++; } if (c == 1) return (m->x); return (c > 0 ? -2: -1); } int get_token(struct _s_x *table, const char *string, const char *errbase) { int tcmd; if ((tcmd = match_token_relaxed(table, string)) < 0) errx(EX_USAGE, "%s %s %s", (tcmd == 0) ? "invalid" : "ambiguous", errbase, string); return (tcmd); } /** * match_value takes a table and a value, returns the string associated * with the value (NULL in case of failure). */ char const * match_value(struct _s_x *p, int value) { for (; p->s != NULL; p++) if (p->x == value) return p->s; return NULL; } size_t concat_tokens(char *buf, size_t bufsize, struct _s_x *table, char *delimiter) { struct _s_x *pt; int l; size_t sz; for (sz = 0, pt = table ; pt->s != NULL; pt++) { l = snprintf(buf + sz, bufsize - sz, "%s%s", (sz == 0) ? "" : delimiter, pt->s); sz += l; bufsize += l; if (sz > bufsize) return (bufsize); } return (sz); } /* * helper function to process a set of flags and set bits in the * appropriate masks. */ int fill_flags(struct _s_x *flags, char *p, char **e, uint32_t *set, uint32_t *clear) { char *q; /* points to the separator */ int val; uint32_t *which; /* mask we are working on */ while (p && *p) { if (*p == '!') { p++; which = clear; } else which = set; q = strchr(p, ','); if (q) *q++ = '\0'; val = match_token(flags, p); if (val <= 0) { if (e != NULL) *e = p; return (-1); } *which |= (uint32_t)val; p = q; } return (0); } void print_flags_buffer(char *buf, size_t sz, struct _s_x *list, uint32_t set) { char const *comma = ""; int i, l; for (i = 0; list[i].x != 0; i++) { if ((set & list[i].x) == 0) continue; set &= ~list[i].x; l = snprintf(buf, sz, "%s%s", comma, list[i].s); if (l >= sz) return; comma = ","; buf += l; sz -=l; } } /* * _substrcmp takes two strings and returns 1 if they do not match, * and 0 if they match exactly or the first string is a sub-string * of the second. A warning is printed to stderr in the case that the * first string is a sub-string of the second. * * This function will be removed in the future through the usual * deprecation process. */ int _substrcmp(const char *str1, const char* str2) { if (strncmp(str1, str2, strlen(str1)) != 0) return 1; if (strlen(str1) != strlen(str2)) warnx("DEPRECATED: '%s' matched '%s' as a sub-string", str1, str2); return 0; } /* * _substrcmp2 takes three strings and returns 1 if the first two do not match, * and 0 if they match exactly or the second string is a sub-string * of the first. A warning is printed to stderr in the case that the * first string does not match the third. * * This function exists to warn about the bizarre construction * strncmp(str, "by", 2) which is used to allow people to use a shortcut * for "bytes". The problem is that in addition to accepting "by", * "byt", "byte", and "bytes", it also excepts "by_rabid_dogs" and any * other string beginning with "by". * * This function will be removed in the future through the usual * deprecation process. */ int _substrcmp2(const char *str1, const char* str2, const char* str3) { if (strncmp(str1, str2, strlen(str2)) != 0) return 1; if (strcmp(str1, str3) != 0) warnx("DEPRECATED: '%s' matched '%s'", str1, str3); return 0; } /* * prints one port, symbolic or numeric */ static void print_port(struct buf_pr *bp, int proto, uint16_t port) { if (proto == IPPROTO_ETHERTYPE) { char const *s; if (co.do_resolv && (s = match_value(ether_types, port)) ) bprintf(bp, "%s", s); else bprintf(bp, "0x%04x", port); } else { struct servent *se = NULL; if (co.do_resolv) { struct protoent *pe = getprotobynumber(proto); se = getservbyport(htons(port), pe ? pe->p_name : NULL); } if (se) bprintf(bp, "%s", se->s_name); else bprintf(bp, "%d", port); } } static struct _s_x _port_name[] = { {"dst-port", O_IP_DSTPORT}, {"src-port", O_IP_SRCPORT}, {"ipid", O_IPID}, {"iplen", O_IPLEN}, {"ipttl", O_IPTTL}, {"mac-type", O_MAC_TYPE}, {"tcpdatalen", O_TCPDATALEN}, {"tcpwin", O_TCPWIN}, {"tagged", O_TAGGED}, {NULL, 0} }; /* * Print the values in a list 16-bit items of the types above. * XXX todo: add support for mask. */ static void print_newports(struct buf_pr *bp, ipfw_insn_u16 *cmd, int proto, int opcode) { uint16_t *p = cmd->ports; int i; char const *sep; if (opcode != 0) { sep = match_value(_port_name, opcode); if (sep == NULL) sep = "???"; bprintf(bp, " %s", sep); } sep = " "; for (i = F_LEN((ipfw_insn *)cmd) - 1; i > 0; i--, p += 2) { bprintf(bp, "%s", sep); print_port(bp, proto, p[0]); if (p[0] != p[1]) { bprintf(bp, "-"); print_port(bp, proto, p[1]); } sep = ","; } } /* * Like strtol, but also translates service names into port numbers * for some protocols. * In particular: * proto == -1 disables the protocol check; * proto == IPPROTO_ETHERTYPE looks up an internal table * proto == matches the values there. * Returns *end == s in case the parameter is not found. */ static int strtoport(char *s, char **end, int base, int proto) { char *p, *buf; char *s1; int i; *end = s; /* default - not found */ if (*s == '\0') return 0; /* not found */ if (isdigit(*s)) return strtol(s, end, base); /* * find separator. '\\' escapes the next char. */ for (s1 = s; *s1 && (isalnum(*s1) || *s1 == '\\' || *s1 == '_' || *s1 == '.') ; s1++) if (*s1 == '\\' && s1[1] != '\0') s1++; buf = safe_calloc(s1 - s + 1, 1); /* * copy into a buffer skipping backslashes */ for (p = s, i = 0; p != s1 ; p++) if (*p != '\\') buf[i++] = *p; buf[i++] = '\0'; if (proto == IPPROTO_ETHERTYPE) { i = match_token(ether_types, buf); free(buf); if (i != -1) { /* found */ *end = s1; return i; } } else { struct protoent *pe = NULL; struct servent *se; if (proto != 0) pe = getprotobynumber(proto); setservent(1); se = getservbyname(buf, pe ? pe->p_name : NULL); free(buf); if (se != NULL) { *end = s1; return ntohs(se->s_port); } } return 0; /* not found */ } /* * Fill the body of the command with the list of port ranges. */ static int fill_newports(ipfw_insn_u16 *cmd, char *av, int proto, int cblen) { uint16_t a, b, *p = cmd->ports; int i = 0; char *s = av; while (*s) { a = strtoport(av, &s, 0, proto); if (s == av) /* empty or invalid argument */ return (0); CHECK_LENGTH(cblen, i + 2); switch (*s) { case '-': /* a range */ av = s + 1; b = strtoport(av, &s, 0, proto); /* Reject expressions like '1-abc' or '1-2-3'. */ if (s == av || (*s != ',' && *s != '\0')) return (0); p[0] = a; p[1] = b; break; case ',': /* comma separated list */ case '\0': p[0] = p[1] = a; break; default: warnx("port list: invalid separator <%c> in <%s>", *s, av); return (0); } i++; p += 2; av = s + 1; } if (i > 0) { if (i + 1 > F_LEN_MASK) errx(EX_DATAERR, "too many ports/ranges\n"); cmd->o.len |= i + 1; /* leave F_NOT and F_OR untouched */ } return (i); } /* * Fill the body of the command with the list of DiffServ codepoints. */ static void fill_dscp(ipfw_insn *cmd, char *av, int cblen) { uint32_t *low, *high; char *s = av, *a; int code; cmd->opcode = O_DSCP; cmd->len |= F_INSN_SIZE(ipfw_insn_u32) + 1; CHECK_CMDLEN; low = (uint32_t *)(cmd + 1); high = low + 1; *low = 0; *high = 0; while (s != NULL) { a = strchr(s, ','); if (a != NULL) *a++ = '\0'; if (isalpha(*s)) { if ((code = match_token(f_ipdscp, s)) == -1) errx(EX_DATAERR, "Unknown DSCP code"); } else { code = strtoul(s, NULL, 10); if (code < 0 || code > 63) errx(EX_DATAERR, "Invalid DSCP value"); } if (code >= 32) *high |= 1 << (code - 32); else *low |= 1 << code; s = a; } } static struct _s_x icmpcodes[] = { { "net", ICMP_UNREACH_NET }, { "host", ICMP_UNREACH_HOST }, { "protocol", ICMP_UNREACH_PROTOCOL }, { "port", ICMP_UNREACH_PORT }, { "needfrag", ICMP_UNREACH_NEEDFRAG }, { "srcfail", ICMP_UNREACH_SRCFAIL }, { "net-unknown", ICMP_UNREACH_NET_UNKNOWN }, { "host-unknown", ICMP_UNREACH_HOST_UNKNOWN }, { "isolated", ICMP_UNREACH_ISOLATED }, { "net-prohib", ICMP_UNREACH_NET_PROHIB }, { "host-prohib", ICMP_UNREACH_HOST_PROHIB }, { "tosnet", ICMP_UNREACH_TOSNET }, { "toshost", ICMP_UNREACH_TOSHOST }, { "filter-prohib", ICMP_UNREACH_FILTER_PROHIB }, { "host-precedence", ICMP_UNREACH_HOST_PRECEDENCE }, { "precedence-cutoff", ICMP_UNREACH_PRECEDENCE_CUTOFF }, { NULL, 0 } }; static void fill_reject_code(u_short *codep, char *str) { int val; char *s; val = strtoul(str, &s, 0); if (s == str || *s != '\0' || val >= 0x100) val = match_token(icmpcodes, str); if (val < 0) errx(EX_DATAERR, "unknown ICMP unreachable code ``%s''", str); *codep = val; return; } static void print_reject_code(struct buf_pr *bp, uint16_t code) { char const *s; if ((s = match_value(icmpcodes, code)) != NULL) bprintf(bp, "unreach %s", s); else bprintf(bp, "unreach %u", code); } /* * Returns the number of bits set (from left) in a contiguous bitmask, * or -1 if the mask is not contiguous. * XXX this needs a proper fix. * This effectively works on masks in big-endian (network) format. * when compiled on little endian architectures. * * First bit is bit 7 of the first byte -- note, for MAC addresses, * the first bit on the wire is bit 0 of the first byte. * len is the max length in bits. */ int contigmask(uint8_t *p, int len) { int i, n; for (i=0; iarg1 & 0xff; uint8_t clear = (cmd->arg1 >> 8) & 0xff; if (list == f_tcpflags && set == TH_SYN && clear == TH_ACK) { bprintf(bp, " setup"); return; } bprintf(bp, " %s ", name); for (i=0; list[i].x != 0; i++) { if (set & list[i].x) { set &= ~list[i].x; bprintf(bp, "%s%s", comma, list[i].s); comma = ","; } if (clear & list[i].x) { clear &= ~list[i].x; bprintf(bp, "%s!%s", comma, list[i].s); comma = ","; } } } /* * Print the ip address contained in a command. */ static void print_ip(struct buf_pr *bp, const struct format_opts *fo, ipfw_insn_ip *cmd) { struct hostent *he = NULL; struct in_addr *ia; uint32_t len = F_LEN((ipfw_insn *)cmd); uint32_t *a = ((ipfw_insn_u32 *)cmd)->d; char *t; bprintf(bp, " "); if (cmd->o.opcode == O_IP_DST_LOOKUP && len > F_INSN_SIZE(ipfw_insn_u32)) { uint32_t d = a[1]; const char *arg = ""; if (d < sizeof(lookup_key)/sizeof(lookup_key[0])) arg = match_value(rule_options, lookup_key[d]); t = table_search_ctlv(fo->tstate, ((ipfw_insn *)cmd)->arg1); bprintf(bp, "lookup %s %s", arg, t); return; } if (cmd->o.opcode == O_IP_SRC_ME || cmd->o.opcode == O_IP_DST_ME) { bprintf(bp, "me"); return; } if (cmd->o.opcode == O_IP_SRC_LOOKUP || cmd->o.opcode == O_IP_DST_LOOKUP) { t = table_search_ctlv(fo->tstate, ((ipfw_insn *)cmd)->arg1); bprintf(bp, "table(%s", t); if (len == F_INSN_SIZE(ipfw_insn_u32)) bprintf(bp, ",%u", *a); bprintf(bp, ")"); return; } if (cmd->o.opcode == O_IP_SRC_SET || cmd->o.opcode == O_IP_DST_SET) { uint32_t x, *map = (uint32_t *)&(cmd->mask); int i, j; char comma = '{'; x = cmd->o.arg1 - 1; x = htonl( ~x ); cmd->addr.s_addr = htonl(cmd->addr.s_addr); bprintf(bp, "%s/%d", inet_ntoa(cmd->addr), contigmask((uint8_t *)&x, 32)); x = cmd->addr.s_addr = htonl(cmd->addr.s_addr); x &= 0xff; /* base */ /* * Print bits and ranges. * Locate first bit set (i), then locate first bit unset (j). * If we have 3+ consecutive bits set, then print them as a * range, otherwise only print the initial bit and rescan. */ for (i=0; i < cmd->o.arg1; i++) if (map[i/32] & (1<<(i & 31))) { for (j=i+1; j < cmd->o.arg1; j++) if (!(map[ j/32] & (1<<(j & 31)))) break; bprintf(bp, "%c%d", comma, i+x); if (j>i+2) { /* range has at least 3 elements */ bprintf(bp, "-%d", j-1+x); i = j-1; } comma = ','; } bprintf(bp, "}"); return; } /* * len == 2 indicates a single IP, whereas lists of 1 or more * addr/mask pairs have len = (2n+1). We convert len to n so we * use that to count the number of entries. */ for (len = len / 2; len > 0; len--, a += 2) { int mb = /* mask length */ (cmd->o.opcode == O_IP_SRC || cmd->o.opcode == O_IP_DST) ? 32 : contigmask((uint8_t *)&(a[1]), 32); if (mb == 32 && co.do_resolv) he = gethostbyaddr((char *)&(a[0]), sizeof(in_addr_t), AF_INET); if (he != NULL) /* resolved to name */ bprintf(bp, "%s", he->h_name); else if (mb == 0) /* any */ bprintf(bp, "any"); else { /* numeric IP followed by some kind of mask */ ia = (struct in_addr *)&a[0]; bprintf(bp, "%s", inet_ntoa(*ia)); if (mb < 0) { ia = (struct in_addr *)&a[1]; bprintf(bp, ":%s", inet_ntoa(*ia)); } else if (mb < 32) bprintf(bp, "/%d", mb); } if (len > 1) bprintf(bp, ","); } } /* * prints a MAC address/mask pair */ static void format_mac(struct buf_pr *bp, uint8_t *addr, uint8_t *mask) { int l = contigmask(mask, 48); if (l == 0) bprintf(bp, " any"); else { bprintf(bp, " %02x:%02x:%02x:%02x:%02x:%02x", addr[0], addr[1], addr[2], addr[3], addr[4], addr[5]); if (l == -1) bprintf(bp, "&%02x:%02x:%02x:%02x:%02x:%02x", mask[0], mask[1], mask[2], mask[3], mask[4], mask[5]); else if (l < 48) bprintf(bp, "/%d", l); } } static void print_mac(struct buf_pr *bp, ipfw_insn_mac *mac) { bprintf(bp, " MAC"); format_mac(bp, mac->addr, mac->mask); format_mac(bp, mac->addr + 6, mac->mask + 6); } static void fill_icmptypes(ipfw_insn_u32 *cmd, char *av) { uint8_t type; cmd->d[0] = 0; while (*av) { if (*av == ',') av++; type = strtoul(av, &av, 0); if (*av != ',' && *av != '\0') errx(EX_DATAERR, "invalid ICMP type"); if (type > 31) errx(EX_DATAERR, "ICMP type out of range"); cmd->d[0] |= 1 << type; } cmd->o.opcode = O_ICMPTYPE; cmd->o.len |= F_INSN_SIZE(ipfw_insn_u32); } static void print_icmptypes(struct buf_pr *bp, ipfw_insn_u32 *cmd) { int i; char sep= ' '; bprintf(bp, " icmptypes"); for (i = 0; i < 32; i++) { if ( (cmd->d[0] & (1 << (i))) == 0) continue; bprintf(bp, "%c%d", sep, i); sep = ','; } } static void print_dscp(struct buf_pr *bp, ipfw_insn_u32 *cmd) { int i = 0; uint32_t *v; char sep= ' '; const char *code; bprintf(bp, " dscp"); v = cmd->d; while (i < 64) { if (*v & (1 << i)) { if ((code = match_value(f_ipdscp, i)) != NULL) bprintf(bp, "%c%s", sep, code); else bprintf(bp, "%c%d", sep, i); sep = ','; } if ((++i % 32) == 0) v++; } } #define insntod(cmd, type) ((ipfw_insn_ ## type *)(cmd)) struct show_state { struct ip_fw_rule *rule; const ipfw_insn *eaction; uint8_t *printed; int flags; #define HAVE_PROTO 0x0001 #define HAVE_SRCIP 0x0002 #define HAVE_DSTIP 0x0004 #define HAVE_PROBE_STATE 0x0008 int proto; int or_block; }; static int init_show_state(struct show_state *state, struct ip_fw_rule *rule) { state->printed = calloc(rule->cmd_len, sizeof(uint8_t)); if (state->printed == NULL) return (ENOMEM); state->rule = rule; state->eaction = NULL; state->flags = 0; state->proto = 0; state->or_block = 0; return (0); } static void free_show_state(struct show_state *state) { free(state->printed); } static uint8_t is_printed_opcode(struct show_state *state, const ipfw_insn *cmd) { return (state->printed[cmd - state->rule->cmd]); } static void mark_printed(struct show_state *state, const ipfw_insn *cmd) { state->printed[cmd - state->rule->cmd] = 1; } static void print_limit_mask(struct buf_pr *bp, const ipfw_insn_limit *limit) { struct _s_x *p = limit_masks; char const *comma = " "; uint8_t x; for (x = limit->limit_mask; p->x != 0; p++) { if ((x & p->x) == p->x) { x &= ~p->x; bprintf(bp, "%s%s", comma, p->s); comma = ","; } } bprint_uint_arg(bp, " ", limit->conn_limit); } static int print_instruction(struct buf_pr *bp, const struct format_opts *fo, struct show_state *state, ipfw_insn *cmd) { struct protoent *pe; struct passwd *pwd; struct group *grp; const char *s; double d; if (is_printed_opcode(state, cmd)) return (0); if ((cmd->len & F_OR) != 0 && state->or_block == 0) bprintf(bp, " {"); if (cmd->opcode != O_IN && (cmd->len & F_NOT) != 0) bprintf(bp, " not"); switch (cmd->opcode) { case O_PROB: d = 1.0 * insntod(cmd, u32)->d[0] / 0x7fffffff; bprintf(bp, "prob %f ", d); break; case O_PROBE_STATE: /* no need to print anything here */ state->flags |= HAVE_PROBE_STATE; break; case O_IP_SRC: case O_IP_SRC_LOOKUP: case O_IP_SRC_MASK: case O_IP_SRC_ME: case O_IP_SRC_SET: if (state->flags & HAVE_SRCIP) bprintf(bp, " src-ip"); print_ip(bp, fo, insntod(cmd, ip)); break; case O_IP_DST: case O_IP_DST_LOOKUP: case O_IP_DST_MASK: case O_IP_DST_ME: case O_IP_DST_SET: if (state->flags & HAVE_DSTIP) bprintf(bp, " dst-ip"); print_ip(bp, fo, insntod(cmd, ip)); break; case O_IP6_SRC: case O_IP6_SRC_MASK: case O_IP6_SRC_ME: if (state->flags & HAVE_SRCIP) bprintf(bp, " src-ip6"); print_ip6(bp, insntod(cmd, ip6)); break; case O_IP6_DST: case O_IP6_DST_MASK: case O_IP6_DST_ME: if (state->flags & HAVE_DSTIP) bprintf(bp, " dst-ip6"); print_ip6(bp, insntod(cmd, ip6)); break; case O_FLOW6ID: print_flow6id(bp, insntod(cmd, u32)); break; case O_IP_DSTPORT: case O_IP_SRCPORT: print_newports(bp, insntod(cmd, u16), state->proto, (state->flags & (HAVE_SRCIP | HAVE_DSTIP)) == (HAVE_SRCIP | HAVE_DSTIP) ? cmd->opcode: 0); break; case O_PROTO: pe = getprotobynumber(cmd->arg1); if (state->flags & HAVE_PROTO) bprintf(bp, " proto"); if (pe != NULL) bprintf(bp, " %s", pe->p_name); else bprintf(bp, " %u", cmd->arg1); state->proto = cmd->arg1; break; case O_MACADDR2: print_mac(bp, insntod(cmd, mac)); break; case O_MAC_TYPE: print_newports(bp, insntod(cmd, u16), IPPROTO_ETHERTYPE, cmd->opcode); break; case O_FRAG: bprintf(bp, " frag"); break; case O_FIB: bprintf(bp, " fib %u", cmd->arg1); break; case O_SOCKARG: bprintf(bp, " sockarg"); break; case O_IN: bprintf(bp, cmd->len & F_NOT ? " out" : " in"); break; case O_DIVERTED: switch (cmd->arg1) { case 3: bprintf(bp, " diverted"); break; case 2: bprintf(bp, " diverted-output"); break; case 1: bprintf(bp, " diverted-loopback"); break; default: bprintf(bp, " diverted-?<%u>", cmd->arg1); break; } break; case O_LAYER2: bprintf(bp, " layer2"); break; case O_XMIT: case O_RECV: case O_VIA: if (cmd->opcode == O_XMIT) s = "xmit"; else if (cmd->opcode == O_RECV) s = "recv"; else /* if (cmd->opcode == O_VIA) */ s = "via"; switch (insntod(cmd, if)->name[0]) { case '\0': bprintf(bp, " %s %s", s, inet_ntoa(insntod(cmd, if)->p.ip)); break; case '\1': bprintf(bp, " %s table(%s)", s, table_search_ctlv(fo->tstate, insntod(cmd, if)->p.kidx)); break; default: bprintf(bp, " %s %s", s, insntod(cmd, if)->name); } break; case O_IP_FLOW_LOOKUP: s = table_search_ctlv(fo->tstate, cmd->arg1); bprintf(bp, " flow table(%s", s); if (F_LEN(cmd) == F_INSN_SIZE(ipfw_insn_u32)) bprintf(bp, ",%u", insntod(cmd, u32)->d[0]); bprintf(bp, ")"); break; case O_IPID: case O_IPTTL: case O_IPLEN: case O_TCPDATALEN: case O_TCPWIN: if (F_LEN(cmd) == 1) { switch (cmd->opcode) { case O_IPID: s = "ipid"; break; case O_IPTTL: s = "ipttl"; break; case O_IPLEN: s = "iplen"; break; case O_TCPDATALEN: s = "tcpdatalen"; break; case O_TCPWIN: s = "tcpwin"; break; } bprintf(bp, " %s %u", s, cmd->arg1); } else print_newports(bp, insntod(cmd, u16), 0, cmd->opcode); break; case O_IPVER: bprintf(bp, " ipver %u", cmd->arg1); break; case O_IPPRECEDENCE: bprintf(bp, " ipprecedence %u", cmd->arg1 >> 5); break; case O_DSCP: print_dscp(bp, insntod(cmd, u32)); break; case O_IPOPT: print_flags(bp, "ipoptions", cmd, f_ipopts); break; case O_IPTOS: print_flags(bp, "iptos", cmd, f_iptos); break; case O_ICMPTYPE: print_icmptypes(bp, insntod(cmd, u32)); break; case O_ESTAB: bprintf(bp, " established"); break; case O_TCPFLAGS: print_flags(bp, "tcpflags", cmd, f_tcpflags); break; case O_TCPOPTS: print_flags(bp, "tcpoptions", cmd, f_tcpopts); break; case O_TCPACK: bprintf(bp, " tcpack %d", ntohl(insntod(cmd, u32)->d[0])); break; case O_TCPSEQ: bprintf(bp, " tcpseq %d", ntohl(insntod(cmd, u32)->d[0])); break; case O_UID: pwd = getpwuid(insntod(cmd, u32)->d[0]); if (pwd != NULL) bprintf(bp, " uid %s", pwd->pw_name); else bprintf(bp, " uid %u", insntod(cmd, u32)->d[0]); break; case O_GID: grp = getgrgid(insntod(cmd, u32)->d[0]); if (grp != NULL) bprintf(bp, " gid %s", grp->gr_name); else bprintf(bp, " gid %u", insntod(cmd, u32)->d[0]); break; case O_JAIL: bprintf(bp, " jail %d", insntod(cmd, u32)->d[0]); break; case O_VERREVPATH: bprintf(bp, " verrevpath"); break; case O_VERSRCREACH: bprintf(bp, " versrcreach"); break; case O_ANTISPOOF: bprintf(bp, " antispoof"); break; case O_IPSEC: bprintf(bp, " ipsec"); break; case O_NOP: bprintf(bp, " // %s", (char *)(cmd + 1)); break; case O_KEEP_STATE: if (state->flags & HAVE_PROBE_STATE) bprintf(bp, " keep-state"); else bprintf(bp, " record-state"); bprintf(bp, " :%s", object_search_ctlv(fo->tstate, cmd->arg1, IPFW_TLV_STATE_NAME)); break; case O_LIMIT: if (state->flags & HAVE_PROBE_STATE) bprintf(bp, " limit"); else bprintf(bp, " set-limit"); print_limit_mask(bp, insntod(cmd, limit)); bprintf(bp, " :%s", object_search_ctlv(fo->tstate, cmd->arg1, IPFW_TLV_STATE_NAME)); break; case O_IP6: + if (state->flags & HAVE_PROTO) + bprintf(bp, " proto"); bprintf(bp, " ip6"); break; case O_IP4: + if (state->flags & HAVE_PROTO) + bprintf(bp, " proto"); bprintf(bp, " ip4"); break; case O_ICMP6TYPE: print_icmp6types(bp, insntod(cmd, u32)); break; case O_EXT_HDR: print_ext6hdr(bp, cmd); break; case O_TAGGED: if (F_LEN(cmd) == 1) bprint_uint_arg(bp, " tagged ", cmd->arg1); else print_newports(bp, insntod(cmd, u16), 0, O_TAGGED); break; case O_SKIP_ACTION: bprintf(bp, " defer-immediate-action"); break; default: bprintf(bp, " [opcode %d len %d]", cmd->opcode, cmd->len); } if (cmd->len & F_OR) { bprintf(bp, " or"); state->or_block = 1; } else if (state->or_block != 0) { bprintf(bp, " }"); state->or_block = 0; } mark_printed(state, cmd); return (1); } static ipfw_insn * print_opcode(struct buf_pr *bp, struct format_opts *fo, struct show_state *state, int opcode) { ipfw_insn *cmd; int l; for (l = state->rule->act_ofs, cmd = state->rule->cmd; l > 0; l -= F_LEN(cmd), cmd += F_LEN(cmd)) { /* We use zero opcode to print the rest of options */ if (opcode >= 0 && cmd->opcode != opcode) continue; /* * Skip O_NOP, when we printing the rest * of options, it will be handled separately. */ if (cmd->opcode == O_NOP && opcode != O_NOP) continue; if (!print_instruction(bp, fo, state, cmd)) continue; return (cmd); } return (NULL); } static void print_fwd(struct buf_pr *bp, const ipfw_insn *cmd) { char buf[INET6_ADDRSTRLEN + IF_NAMESIZE + 2]; ipfw_insn_sa6 *sa6; ipfw_insn_sa *sa; uint16_t port; if (cmd->opcode == O_FORWARD_IP) { sa = insntod(cmd, sa); port = sa->sa.sin_port; if (sa->sa.sin_addr.s_addr == INADDR_ANY) bprintf(bp, "fwd tablearg"); else bprintf(bp, "fwd %s", inet_ntoa(sa->sa.sin_addr)); } else { sa6 = insntod(cmd, sa6); port = sa6->sa.sin6_port; bprintf(bp, "fwd "); if (getnameinfo((const struct sockaddr *)&sa6->sa, sizeof(struct sockaddr_in6), buf, sizeof(buf), NULL, 0, NI_NUMERICHOST) == 0) bprintf(bp, "%s", buf); } if (port != 0) bprintf(bp, ",%u", port); } static int print_action_instruction(struct buf_pr *bp, const struct format_opts *fo, struct show_state *state, const ipfw_insn *cmd) { const char *s; if (is_printed_opcode(state, cmd)) return (0); switch (cmd->opcode) { case O_CHECK_STATE: bprintf(bp, "check-state"); if (cmd->arg1 != 0) s = object_search_ctlv(fo->tstate, cmd->arg1, IPFW_TLV_STATE_NAME); else s = NULL; bprintf(bp, " :%s", s ? s: "any"); break; case O_ACCEPT: bprintf(bp, "allow"); break; case O_COUNT: bprintf(bp, "count"); break; case O_DENY: bprintf(bp, "deny"); break; case O_REJECT: if (cmd->arg1 == ICMP_REJECT_RST) bprintf(bp, "reset"); else if (cmd->arg1 == ICMP_REJECT_ABORT) bprintf(bp, "abort"); else if (cmd->arg1 == ICMP_UNREACH_HOST) bprintf(bp, "reject"); else print_reject_code(bp, cmd->arg1); break; case O_UNREACH6: if (cmd->arg1 == ICMP6_UNREACH_RST) bprintf(bp, "reset6"); else if (cmd->arg1 == ICMP6_UNREACH_ABORT) bprintf(bp, "abort6"); else print_unreach6_code(bp, cmd->arg1); break; case O_SKIPTO: bprint_uint_arg(bp, "skipto ", cmd->arg1); break; case O_PIPE: bprint_uint_arg(bp, "pipe ", cmd->arg1); break; case O_QUEUE: bprint_uint_arg(bp, "queue ", cmd->arg1); break; case O_DIVERT: bprint_uint_arg(bp, "divert ", cmd->arg1); break; case O_TEE: bprint_uint_arg(bp, "tee ", cmd->arg1); break; case O_NETGRAPH: bprint_uint_arg(bp, "netgraph ", cmd->arg1); break; case O_NGTEE: bprint_uint_arg(bp, "ngtee ", cmd->arg1); break; case O_FORWARD_IP: case O_FORWARD_IP6: print_fwd(bp, cmd); break; case O_LOG: if (insntod(cmd, log)->max_log > 0) bprintf(bp, " log logamount %d", insntod(cmd, log)->max_log); else bprintf(bp, " log"); break; case O_ALTQ: #ifndef NO_ALTQ print_altq_cmd(bp, insntod(cmd, altq)); #endif break; case O_TAG: bprint_uint_arg(bp, cmd->len & F_NOT ? " untag ": " tag ", cmd->arg1); break; case O_NAT: if (cmd->arg1 != IP_FW_NAT44_GLOBAL) bprint_uint_arg(bp, "nat ", cmd->arg1); else bprintf(bp, "nat global"); break; case O_SETFIB: if (cmd->arg1 == IP_FW_TARG) bprint_uint_arg(bp, "setfib ", cmd->arg1); else bprintf(bp, "setfib %u", cmd->arg1 & 0x7FFF); break; case O_EXTERNAL_ACTION: /* * The external action can consists of two following * each other opcodes - O_EXTERNAL_ACTION and * O_EXTERNAL_INSTANCE. The first contains the ID of * name of external action. The second contains the ID * of name of external action instance. * NOTE: in case when external action has no named * instances support, the second opcode isn't needed. */ state->eaction = cmd; s = object_search_ctlv(fo->tstate, cmd->arg1, IPFW_TLV_EACTION); if (match_token(rule_eactions, s) != -1) bprintf(bp, "%s", s); else bprintf(bp, "eaction %s", s); break; case O_EXTERNAL_INSTANCE: if (state->eaction == NULL) break; /* * XXX: we need to teach ipfw(9) to rewrite opcodes * in the user buffer on rule addition. When we add * the rule, we specify zero TLV type for * O_EXTERNAL_INSTANCE object. To show correct * rule after `ipfw add` we need to search instance * name with zero type. But when we do `ipfw show` * we calculate TLV type using IPFW_TLV_EACTION_NAME() * macro. */ s = object_search_ctlv(fo->tstate, cmd->arg1, 0); if (s == NULL) s = object_search_ctlv(fo->tstate, cmd->arg1, IPFW_TLV_EACTION_NAME( state->eaction->arg1)); bprintf(bp, " %s", s); break; case O_EXTERNAL_DATA: if (state->eaction == NULL) break; /* * Currently we support data formatting only for * external data with datalen u16. For unknown data * print its size in bytes. */ if (cmd->len == F_INSN_SIZE(ipfw_insn)) bprintf(bp, " %u", cmd->arg1); else bprintf(bp, " %ubytes", cmd->len * sizeof(uint32_t)); break; case O_SETDSCP: if (cmd->arg1 == IP_FW_TARG) { bprintf(bp, "setdscp tablearg"); break; } s = match_value(f_ipdscp, cmd->arg1 & 0x3F); if (s != NULL) bprintf(bp, "setdscp %s", s); else bprintf(bp, "setdscp %u", cmd->arg1 & 0x3F); break; case O_REASS: bprintf(bp, "reass"); break; case O_CALLRETURN: if (cmd->len & F_NOT) bprintf(bp, "return"); else bprint_uint_arg(bp, "call ", cmd->arg1); break; default: bprintf(bp, "** unrecognized action %d len %d ", cmd->opcode, cmd->len); } mark_printed(state, cmd); return (1); } static ipfw_insn * print_action(struct buf_pr *bp, struct format_opts *fo, struct show_state *state, uint8_t opcode) { ipfw_insn *cmd; int l; for (l = state->rule->cmd_len - state->rule->act_ofs, cmd = ACTION_PTR(state->rule); l > 0; l -= F_LEN(cmd), cmd += F_LEN(cmd)) { if (cmd->opcode != opcode) continue; if (!print_action_instruction(bp, fo, state, cmd)) continue; return (cmd); } return (NULL); } static void print_proto(struct buf_pr *bp, struct format_opts *fo, struct show_state *state) { ipfw_insn *cmd; int l, proto, ip4, ip6; /* Count all O_PROTO, O_IP4, O_IP6 instructions. */ proto = ip4 = ip6 = 0; for (l = state->rule->act_ofs, cmd = state->rule->cmd; l > 0; l -= F_LEN(cmd), cmd += F_LEN(cmd)) { switch (cmd->opcode) { case O_PROTO: proto++; break; case O_IP4: ip4 = 1; if (cmd->len & F_OR) ip4++; break; case O_IP6: ip6 = 1; if (cmd->len & F_OR) ip6++; break; default: continue; } } if (proto == 0 && ip4 == 0 && ip6 == 0) { state->proto = IPPROTO_IP; state->flags |= HAVE_PROTO; bprintf(bp, " ip"); return; } /* To handle the case { ip4 or ip6 }, print opcode with F_OR first */ cmd = NULL; if (ip4 || ip6) cmd = print_opcode(bp, fo, state, ip4 > ip6 ? O_IP4: O_IP6); if (cmd != NULL && (cmd->len & F_OR)) cmd = print_opcode(bp, fo, state, ip4 > ip6 ? O_IP6: O_IP4); if (cmd == NULL || (cmd->len & F_OR)) for (l = proto; l > 0; l--) { cmd = print_opcode(bp, fo, state, O_PROTO); if (cmd == NULL || (cmd->len & F_OR) == 0) break; } /* Initialize proto, it is used by print_newports() */ state->flags |= HAVE_PROTO; if (state->proto == 0 && ip6 != 0) state->proto = IPPROTO_IPV6; } static int match_opcode(int opcode, const int opcodes[], size_t nops) { int i; for (i = 0; i < nops; i++) if (opcode == opcodes[i]) return (1); return (0); } static void print_address(struct buf_pr *bp, struct format_opts *fo, struct show_state *state, const int opcodes[], size_t nops, int portop, int flag) { ipfw_insn *cmd; int count, l, portcnt, pf; count = portcnt = 0; for (l = state->rule->act_ofs, cmd = state->rule->cmd; l > 0; l -= F_LEN(cmd), cmd += F_LEN(cmd)) { if (match_opcode(cmd->opcode, opcodes, nops)) count++; else if (cmd->opcode == portop) portcnt++; } if (count == 0) bprintf(bp, " any"); for (l = state->rule->act_ofs, cmd = state->rule->cmd; l > 0 && count > 0; l -= F_LEN(cmd), cmd += F_LEN(cmd)) { if (!match_opcode(cmd->opcode, opcodes, nops)) continue; print_instruction(bp, fo, state, cmd); if ((cmd->len & F_OR) == 0) break; count--; } /* * If several O_IP_?PORT opcodes specified, leave them to the * options section. */ if (portcnt == 1) { for (l = state->rule->act_ofs, cmd = state->rule->cmd, pf = 0; l > 0; l -= F_LEN(cmd), cmd += F_LEN(cmd)) { if (cmd->opcode != portop) { pf = (cmd->len & F_OR); continue; } /* Print opcode iff it is not in OR block. */ if (pf == 0 && (cmd->len & F_OR) == 0) print_instruction(bp, fo, state, cmd); break; } } state->flags |= flag; } static const int action_opcodes[] = { O_CHECK_STATE, O_ACCEPT, O_COUNT, O_DENY, O_REJECT, O_UNREACH6, O_SKIPTO, O_PIPE, O_QUEUE, O_DIVERT, O_TEE, O_NETGRAPH, O_NGTEE, O_FORWARD_IP, O_FORWARD_IP6, O_NAT, O_SETFIB, O_SETDSCP, O_REASS, O_CALLRETURN, /* keep the following opcodes at the end of the list */ O_EXTERNAL_ACTION, O_EXTERNAL_INSTANCE, O_EXTERNAL_DATA }; static const int modifier_opcodes[] = { O_LOG, O_ALTQ, O_TAG }; static const int src_opcodes[] = { O_IP_SRC, O_IP_SRC_LOOKUP, O_IP_SRC_MASK, O_IP_SRC_ME, O_IP_SRC_SET, O_IP6_SRC, O_IP6_SRC_MASK, O_IP6_SRC_ME }; static const int dst_opcodes[] = { O_IP_DST, O_IP_DST_LOOKUP, O_IP_DST_MASK, O_IP_DST_ME, O_IP_DST_SET, O_IP6_DST, O_IP6_DST_MASK, O_IP6_DST_ME }; static void show_static_rule(struct cmdline_opts *co, struct format_opts *fo, struct buf_pr *bp, struct ip_fw_rule *rule, struct ip_fw_bcounter *cntr) { struct show_state state; ipfw_insn *cmd; static int twidth = 0; int i; /* Print # DISABLED or skip the rule */ if ((fo->set_mask & (1 << rule->set)) == 0) { /* disabled mask */ if (!co->show_sets) return; else bprintf(bp, "# DISABLED "); } if (init_show_state(&state, rule) != 0) { warn("init_show_state() failed"); return; } bprintf(bp, "%05u ", rule->rulenum); /* Print counters if enabled */ if (fo->pcwidth > 0 || fo->bcwidth > 0) { pr_u64(bp, &cntr->pcnt, fo->pcwidth); pr_u64(bp, &cntr->bcnt, fo->bcwidth); } /* Print timestamp */ if (co->do_time == TIMESTAMP_NUMERIC) bprintf(bp, "%10u ", cntr->timestamp); else if (co->do_time == TIMESTAMP_STRING) { char timestr[30]; time_t t = (time_t)0; if (twidth == 0) { strcpy(timestr, ctime(&t)); *strchr(timestr, '\n') = '\0'; twidth = strlen(timestr); } if (cntr->timestamp > 0) { t = _long_to_time(cntr->timestamp); strcpy(timestr, ctime(&t)); *strchr(timestr, '\n') = '\0'; bprintf(bp, "%s ", timestr); } else { bprintf(bp, "%*s", twidth, " "); } } /* Print set number */ if (co->show_sets) bprintf(bp, "set %d ", rule->set); /* Print the optional "match probability" */ cmd = print_opcode(bp, fo, &state, O_PROB); /* Print rule action */ for (i = 0; i < nitems(action_opcodes); i++) { cmd = print_action(bp, fo, &state, action_opcodes[i]); if (cmd == NULL) continue; /* Handle special cases */ switch (cmd->opcode) { case O_CHECK_STATE: goto end; case O_EXTERNAL_ACTION: case O_EXTERNAL_INSTANCE: /* External action can have several instructions */ continue; } break; } /* Print rule modifiers */ for (i = 0; i < nitems(modifier_opcodes); i++) print_action(bp, fo, &state, modifier_opcodes[i]); /* * Print rule body */ if (co->comment_only != 0) goto end; if (rule->flags & IPFW_RULE_JUSTOPTS) { state.flags |= HAVE_PROTO | HAVE_SRCIP | HAVE_DSTIP; goto justopts; } print_proto(bp, fo, &state); /* Print source */ bprintf(bp, " from"); print_address(bp, fo, &state, src_opcodes, nitems(src_opcodes), O_IP_SRCPORT, HAVE_SRCIP); /* Print destination */ bprintf(bp, " to"); print_address(bp, fo, &state, dst_opcodes, nitems(dst_opcodes), O_IP_DSTPORT, HAVE_DSTIP); justopts: /* Print the rest of options */ while (print_opcode(bp, fo, &state, -1)) ; end: /* Print comment at the end */ cmd = print_opcode(bp, fo, &state, O_NOP); if (co->comment_only != 0 && cmd == NULL) bprintf(bp, " // ..."); bprintf(bp, "\n"); free_show_state(&state); } static void show_dyn_state(struct cmdline_opts *co, struct format_opts *fo, struct buf_pr *bp, ipfw_dyn_rule *d) { struct protoent *pe; struct in_addr a; uint16_t rulenum; char buf[INET6_ADDRSTRLEN]; if (d->expire == 0 && d->dyn_type != O_LIMIT_PARENT) return; bcopy(&d->rule, &rulenum, sizeof(rulenum)); bprintf(bp, "%05d", rulenum); if (fo->pcwidth > 0 || fo->bcwidth > 0) { bprintf(bp, " "); pr_u64(bp, &d->pcnt, fo->pcwidth); pr_u64(bp, &d->bcnt, fo->bcwidth); bprintf(bp, "(%ds)", d->expire); } switch (d->dyn_type) { case O_LIMIT_PARENT: bprintf(bp, " PARENT %d", d->count); break; case O_LIMIT: bprintf(bp, " LIMIT"); break; case O_KEEP_STATE: /* bidir, no mask */ bprintf(bp, " STATE"); break; } if ((pe = getprotobynumber(d->id.proto)) != NULL) bprintf(bp, " %s", pe->p_name); else bprintf(bp, " proto %u", d->id.proto); if (d->id.addr_type == 4) { a.s_addr = htonl(d->id.src_ip); bprintf(bp, " %s %d", inet_ntoa(a), d->id.src_port); a.s_addr = htonl(d->id.dst_ip); bprintf(bp, " <-> %s %d", inet_ntoa(a), d->id.dst_port); } else if (d->id.addr_type == 6) { bprintf(bp, " %s %d", inet_ntop(AF_INET6, &d->id.src_ip6, buf, sizeof(buf)), d->id.src_port); bprintf(bp, " <-> %s %d", inet_ntop(AF_INET6, &d->id.dst_ip6, buf, sizeof(buf)), d->id.dst_port); } else bprintf(bp, " UNKNOWN <-> UNKNOWN"); if (d->kidx != 0) bprintf(bp, " :%s", object_search_ctlv(fo->tstate, d->kidx, IPFW_TLV_STATE_NAME)); #define BOTH_SYN (TH_SYN | (TH_SYN << 8)) #define BOTH_FIN (TH_FIN | (TH_FIN << 8)) if (co->verbose) { bprintf(bp, " state 0x%08x%s", d->state, d->state ? " ": ","); if (d->state & IPFW_DYN_ORPHANED) bprintf(bp, "ORPHANED,"); if ((d->state & BOTH_SYN) == BOTH_SYN) bprintf(bp, "BOTH_SYN,"); else { if (d->state & TH_SYN) bprintf(bp, "F_SYN,"); if (d->state & (TH_SYN << 8)) bprintf(bp, "R_SYN,"); } if ((d->state & BOTH_FIN) == BOTH_FIN) bprintf(bp, "BOTH_FIN,"); else { if (d->state & TH_FIN) bprintf(bp, "F_FIN,"); if (d->state & (TH_FIN << 8)) bprintf(bp, "R_FIN,"); } bprintf(bp, " f_ack 0x%x, r_ack 0x%x", d->ack_fwd, d->ack_rev); } } static int do_range_cmd(int cmd, ipfw_range_tlv *rt) { ipfw_range_header rh; size_t sz; memset(&rh, 0, sizeof(rh)); memcpy(&rh.range, rt, sizeof(*rt)); rh.range.head.length = sizeof(*rt); rh.range.head.type = IPFW_TLV_RANGE; sz = sizeof(rh); if (do_get3(cmd, &rh.opheader, &sz) != 0) return (-1); /* Save number of matched objects */ rt->new_set = rh.range.new_set; return (0); } /* * This one handles all set-related commands * ipfw set { show | enable | disable } * ipfw set swap X Y * ipfw set move X to Y * ipfw set move rule X to Y */ void ipfw_sets_handler(char *av[]) { ipfw_range_tlv rt; char *msg; size_t size; uint32_t masks[2]; int i; uint16_t rulenum; uint8_t cmd; av++; memset(&rt, 0, sizeof(rt)); if (av[0] == NULL) errx(EX_USAGE, "set needs command"); if (_substrcmp(*av, "show") == 0) { struct format_opts fo; ipfw_cfg_lheader *cfg; memset(&fo, 0, sizeof(fo)); if (ipfw_get_config(&co, &fo, &cfg, &size) != 0) err(EX_OSERR, "requesting config failed"); for (i = 0, msg = "disable"; i < RESVD_SET; i++) if ((cfg->set_mask & (1<set_mask != (uint32_t)-1) ? " enable" : "enable"; for (i = 0; i < RESVD_SET; i++) if ((cfg->set_mask & (1< RESVD_SET) errx(EX_DATAERR, "invalid set number %s\n", av[0]); if (!isdigit(*(av[1])) || rt.new_set > RESVD_SET) errx(EX_DATAERR, "invalid set number %s\n", av[1]); i = do_range_cmd(IP_FW_SET_SWAP, &rt); } else if (_substrcmp(*av, "move") == 0) { av++; if (av[0] && _substrcmp(*av, "rule") == 0) { rt.flags = IPFW_RCFLAG_RANGE; /* move rules to new set */ cmd = IP_FW_XMOVE; av++; } else cmd = IP_FW_SET_MOVE; /* Move set to new one */ if (av[0] == NULL || av[1] == NULL || av[2] == NULL || av[3] != NULL || _substrcmp(av[1], "to") != 0) errx(EX_USAGE, "syntax: set move [rule] X to Y\n"); rulenum = atoi(av[0]); rt.new_set = atoi(av[2]); if (cmd == IP_FW_XMOVE) { rt.start_rule = rulenum; rt.end_rule = rulenum; } else rt.set = rulenum; rt.new_set = atoi(av[2]); if (!isdigit(*(av[0])) || (cmd == 3 && rt.set > RESVD_SET) || (cmd == 2 && rt.start_rule == IPFW_DEFAULT_RULE) ) errx(EX_DATAERR, "invalid source number %s\n", av[0]); if (!isdigit(*(av[2])) || rt.new_set > RESVD_SET) errx(EX_DATAERR, "invalid dest. set %s\n", av[1]); i = do_range_cmd(cmd, &rt); if (i < 0) err(EX_OSERR, "failed to move %s", cmd == IP_FW_SET_MOVE ? "set": "rule"); } else if (_substrcmp(*av, "disable") == 0 || _substrcmp(*av, "enable") == 0 ) { int which = _substrcmp(*av, "enable") == 0 ? 1 : 0; av++; masks[0] = masks[1] = 0; while (av[0]) { if (isdigit(**av)) { i = atoi(*av); if (i < 0 || i > RESVD_SET) errx(EX_DATAERR, "invalid set number %d\n", i); masks[which] |= (1<dcnt++; if (fo->show_counters == 0) return; if (co->use_set) { /* skip states from another set */ bcopy((char *)&d->rule + sizeof(uint16_t), &set, sizeof(uint8_t)); if (set != co->use_set - 1) return; } width = pr_u64(NULL, &d->pcnt, 0); if (width > fo->pcwidth) fo->pcwidth = width; width = pr_u64(NULL, &d->bcnt, 0); if (width > fo->bcwidth) fo->bcwidth = width; } static int foreach_state(struct cmdline_opts *co, struct format_opts *fo, caddr_t base, size_t sz, state_cb dyn_bc, void *dyn_arg) { int ttype; state_cb *fptr; void *farg; ipfw_obj_tlv *tlv; ipfw_obj_ctlv *ctlv; fptr = NULL; ttype = 0; while (sz > 0) { ctlv = (ipfw_obj_ctlv *)base; switch (ctlv->head.type) { case IPFW_TLV_DYNSTATE_LIST: base += sizeof(*ctlv); sz -= sizeof(*ctlv); ttype = IPFW_TLV_DYN_ENT; fptr = dyn_bc; farg = dyn_arg; break; default: return (sz); } while (sz > 0) { tlv = (ipfw_obj_tlv *)base; if (tlv->type != ttype) break; fptr(co, fo, farg, tlv + 1); sz -= tlv->length; base += tlv->length; } } return (sz); } static void prepare_format_opts(struct cmdline_opts *co, struct format_opts *fo, ipfw_obj_tlv *rtlv, int rcnt, caddr_t dynbase, size_t dynsz) { int bcwidth, pcwidth, width; int n; struct ip_fw_bcounter *cntr; struct ip_fw_rule *r; bcwidth = 0; pcwidth = 0; if (fo->show_counters != 0) { for (n = 0; n < rcnt; n++, rtlv = (ipfw_obj_tlv *)((caddr_t)rtlv + rtlv->length)) { cntr = (struct ip_fw_bcounter *)(rtlv + 1); r = (struct ip_fw_rule *)((caddr_t)cntr + cntr->size); /* skip rules from another set */ if (co->use_set && r->set != co->use_set - 1) continue; /* packet counter */ width = pr_u64(NULL, &cntr->pcnt, 0); if (width > pcwidth) pcwidth = width; /* byte counter */ width = pr_u64(NULL, &cntr->bcnt, 0); if (width > bcwidth) bcwidth = width; } } fo->bcwidth = bcwidth; fo->pcwidth = pcwidth; fo->dcnt = 0; if (co->do_dynamic && dynsz > 0) foreach_state(co, fo, dynbase, dynsz, prepare_format_dyn, NULL); } static int list_static_range(struct cmdline_opts *co, struct format_opts *fo, struct buf_pr *bp, ipfw_obj_tlv *rtlv, int rcnt) { int n, seen; struct ip_fw_rule *r; struct ip_fw_bcounter *cntr; int c = 0; for (n = seen = 0; n < rcnt; n++, rtlv = (ipfw_obj_tlv *)((caddr_t)rtlv + rtlv->length)) { if ((fo->show_counters | fo->show_time) != 0) { cntr = (struct ip_fw_bcounter *)(rtlv + 1); r = (struct ip_fw_rule *)((caddr_t)cntr + cntr->size); } else { cntr = NULL; r = (struct ip_fw_rule *)(rtlv + 1); } if (r->rulenum > fo->last) break; if (co->use_set && r->set != co->use_set - 1) continue; if (r->rulenum >= fo->first && r->rulenum <= fo->last) { show_static_rule(co, fo, bp, r, cntr); printf("%s", bp->buf); c += rtlv->length; bp_flush(bp); seen++; } } return (seen); } static void list_dyn_state(struct cmdline_opts *co, struct format_opts *fo, void *_arg, void *_state) { uint16_t rulenum; uint8_t set; ipfw_dyn_rule *d; struct buf_pr *bp; d = (ipfw_dyn_rule *)_state; bp = (struct buf_pr *)_arg; bcopy(&d->rule, &rulenum, sizeof(rulenum)); if (rulenum > fo->last) return; if (co->use_set) { bcopy((char *)&d->rule + sizeof(uint16_t), &set, sizeof(uint8_t)); if (set != co->use_set - 1) return; } if (rulenum >= fo->first) { show_dyn_state(co, fo, bp, d); printf("%s\n", bp->buf); bp_flush(bp); } } static int list_dyn_range(struct cmdline_opts *co, struct format_opts *fo, struct buf_pr *bp, caddr_t base, size_t sz) { sz = foreach_state(co, fo, base, sz, list_dyn_state, bp); return (sz); } void ipfw_list(int ac, char *av[], int show_counters) { ipfw_cfg_lheader *cfg; struct format_opts sfo; size_t sz; int error; int lac; char **lav; uint32_t rnum; char *endptr; if (co.test_only) { fprintf(stderr, "Testing only, list disabled\n"); return; } if (co.do_pipe) { dummynet_list(ac, av, show_counters); return; } ac--; av++; memset(&sfo, 0, sizeof(sfo)); /* Determine rule range to request */ if (ac > 0) { for (lac = ac, lav = av; lac != 0; lac--) { rnum = strtoul(*lav++, &endptr, 10); if (sfo.first == 0 || rnum < sfo.first) sfo.first = rnum; if (*endptr == '-') rnum = strtoul(endptr + 1, &endptr, 10); if (sfo.last == 0 || rnum > sfo.last) sfo.last = rnum; } } /* get configuraion from kernel */ cfg = NULL; sfo.show_counters = show_counters; sfo.show_time = co.do_time; if (co.do_dynamic != 2) sfo.flags |= IPFW_CFG_GET_STATIC; if (co.do_dynamic != 0) sfo.flags |= IPFW_CFG_GET_STATES; if ((sfo.show_counters | sfo.show_time) != 0) sfo.flags |= IPFW_CFG_GET_COUNTERS; if (ipfw_get_config(&co, &sfo, &cfg, &sz) != 0) err(EX_OSERR, "retrieving config failed"); error = ipfw_show_config(&co, &sfo, cfg, sz, ac, av); free(cfg); if (error != EX_OK) exit(error); } static int ipfw_show_config(struct cmdline_opts *co, struct format_opts *fo, ipfw_cfg_lheader *cfg, size_t sz, int ac, char *av[]) { caddr_t dynbase; size_t dynsz; int rcnt; int exitval = EX_OK; int lac; char **lav; char *endptr; size_t readsz; struct buf_pr bp; ipfw_obj_ctlv *ctlv, *tstate; ipfw_obj_tlv *rbase; /* * Handle tablenames TLV first, if any */ tstate = NULL; rbase = NULL; dynbase = NULL; dynsz = 0; readsz = sizeof(*cfg); rcnt = 0; fo->set_mask = cfg->set_mask; ctlv = (ipfw_obj_ctlv *)(cfg + 1); if (ctlv->head.type == IPFW_TLV_TBLNAME_LIST) { object_sort_ctlv(ctlv); fo->tstate = ctlv; readsz += ctlv->head.length; ctlv = (ipfw_obj_ctlv *)((caddr_t)ctlv + ctlv->head.length); } if (cfg->flags & IPFW_CFG_GET_STATIC) { /* We've requested static rules */ if (ctlv->head.type == IPFW_TLV_RULE_LIST) { rbase = (ipfw_obj_tlv *)(ctlv + 1); rcnt = ctlv->count; readsz += ctlv->head.length; ctlv = (ipfw_obj_ctlv *)((caddr_t)ctlv + ctlv->head.length); } } if ((cfg->flags & IPFW_CFG_GET_STATES) && (readsz != sz)) { /* We may have some dynamic states */ dynsz = sz - readsz; /* Skip empty header */ if (dynsz != sizeof(ipfw_obj_ctlv)) dynbase = (caddr_t)ctlv; else dynsz = 0; } prepare_format_opts(co, fo, rbase, rcnt, dynbase, dynsz); bp_alloc(&bp, 4096); /* if no rule numbers were specified, list all rules */ if (ac == 0) { fo->first = 0; fo->last = IPFW_DEFAULT_RULE; if (cfg->flags & IPFW_CFG_GET_STATIC) list_static_range(co, fo, &bp, rbase, rcnt); if (co->do_dynamic && dynsz > 0) { printf("## Dynamic rules (%d %zu):\n", fo->dcnt, dynsz); list_dyn_range(co, fo, &bp, dynbase, dynsz); } bp_free(&bp); return (EX_OK); } /* display specific rules requested on command line */ for (lac = ac, lav = av; lac != 0; lac--) { /* convert command line rule # */ fo->last = fo->first = strtoul(*lav++, &endptr, 10); if (*endptr == '-') fo->last = strtoul(endptr + 1, &endptr, 10); if (*endptr) { exitval = EX_USAGE; warnx("invalid rule number: %s", *(lav - 1)); continue; } if ((cfg->flags & IPFW_CFG_GET_STATIC) == 0) continue; if (list_static_range(co, fo, &bp, rbase, rcnt) == 0) { /* give precedence to other error(s) */ if (exitval == EX_OK) exitval = EX_UNAVAILABLE; if (fo->first == fo->last) warnx("rule %u does not exist", fo->first); else warnx("no rules in range %u-%u", fo->first, fo->last); } } if (co->do_dynamic && dynsz > 0) { printf("## Dynamic rules:\n"); for (lac = ac, lav = av; lac != 0; lac--) { fo->last = fo->first = strtoul(*lav++, &endptr, 10); if (*endptr == '-') fo->last = strtoul(endptr+1, &endptr, 10); if (*endptr) /* already warned */ continue; list_dyn_range(co, fo, &bp, dynbase, dynsz); } } bp_free(&bp); return (exitval); } /* * Retrieves current ipfw configuration of given type * and stores its pointer to @pcfg. * * Caller is responsible for freeing @pcfg. * * Returns 0 on success. */ static int ipfw_get_config(struct cmdline_opts *co, struct format_opts *fo, ipfw_cfg_lheader **pcfg, size_t *psize) { ipfw_cfg_lheader *cfg; size_t sz; int i; if (co->test_only != 0) { fprintf(stderr, "Testing only, list disabled\n"); return (0); } /* Start with some data size */ sz = 4096; cfg = NULL; for (i = 0; i < 16; i++) { if (cfg != NULL) free(cfg); if ((cfg = calloc(1, sz)) == NULL) return (ENOMEM); cfg->flags = fo->flags; cfg->start_rule = fo->first; cfg->end_rule = fo->last; if (do_get3(IP_FW_XGET, &cfg->opheader, &sz) != 0) { if (errno != ENOMEM) { free(cfg); return (errno); } /* Buffer size is not enough. Try to increase */ sz = sz * 2; if (sz < cfg->size) sz = cfg->size; continue; } *pcfg = cfg; *psize = sz; return (0); } free(cfg); return (ENOMEM); } static int lookup_host (char *host, struct in_addr *ipaddr) { struct hostent *he; if (!inet_aton(host, ipaddr)) { if ((he = gethostbyname(host)) == NULL) return(-1); *ipaddr = *(struct in_addr *)he->h_addr_list[0]; } return(0); } struct tidx { ipfw_obj_ntlv *idx; uint32_t count; uint32_t size; uint16_t counter; uint8_t set; }; int ipfw_check_object_name(const char *name) { int c, i, l; /* * Check that name is null-terminated and contains * valid symbols only. Valid mask is: * [a-zA-Z0-9\-_\.]{1,63} */ l = strlen(name); if (l == 0 || l >= 64) return (EINVAL); for (i = 0; i < l; i++) { c = name[i]; if (isalpha(c) || isdigit(c) || c == '_' || c == '-' || c == '.') continue; return (EINVAL); } return (0); } static char *default_state_name = "default"; static int state_check_name(const char *name) { if (ipfw_check_object_name(name) != 0) return (EINVAL); if (strcmp(name, "any") == 0) return (EINVAL); return (0); } static int eaction_check_name(const char *name) { if (ipfw_check_object_name(name) != 0) return (EINVAL); /* Restrict some 'special' names */ if (match_token(rule_actions, name) != -1 && match_token(rule_action_params, name) != -1) return (EINVAL); return (0); } static uint16_t pack_object(struct tidx *tstate, char *name, int otype) { int i; ipfw_obj_ntlv *ntlv; for (i = 0; i < tstate->count; i++) { if (strcmp(tstate->idx[i].name, name) != 0) continue; if (tstate->idx[i].set != tstate->set) continue; if (tstate->idx[i].head.type != otype) continue; return (tstate->idx[i].idx); } if (tstate->count + 1 > tstate->size) { tstate->size += 4; tstate->idx = realloc(tstate->idx, tstate->size * sizeof(ipfw_obj_ntlv)); if (tstate->idx == NULL) return (0); } ntlv = &tstate->idx[i]; memset(ntlv, 0, sizeof(ipfw_obj_ntlv)); strlcpy(ntlv->name, name, sizeof(ntlv->name)); ntlv->head.type = otype; ntlv->head.length = sizeof(ipfw_obj_ntlv); ntlv->set = tstate->set; ntlv->idx = ++tstate->counter; tstate->count++; return (ntlv->idx); } static uint16_t pack_table(struct tidx *tstate, char *name) { if (table_check_name(name) != 0) return (0); return (pack_object(tstate, name, IPFW_TLV_TBL_NAME)); } void fill_table(struct _ipfw_insn *cmd, char *av, uint8_t opcode, struct tidx *tstate) { uint32_t *d = ((ipfw_insn_u32 *)cmd)->d; uint16_t uidx; char *p; if ((p = strchr(av + 6, ')')) == NULL) errx(EX_DATAERR, "forgotten parenthesis: '%s'", av); *p = '\0'; p = strchr(av + 6, ','); if (p) *p++ = '\0'; if ((uidx = pack_table(tstate, av + 6)) == 0) errx(EX_DATAERR, "Invalid table name: %s", av + 6); cmd->opcode = opcode; cmd->arg1 = uidx; if (p) { cmd->len |= F_INSN_SIZE(ipfw_insn_u32); d[0] = strtoul(p, NULL, 0); } else cmd->len |= F_INSN_SIZE(ipfw_insn); } /* * fills the addr and mask fields in the instruction as appropriate from av. * Update length as appropriate. * The following formats are allowed: * me returns O_IP_*_ME * 1.2.3.4 single IP address * 1.2.3.4:5.6.7.8 address:mask * 1.2.3.4/24 address/mask * 1.2.3.4/26{1,6,5,4,23} set of addresses in a subnet * We can have multiple comma-separated address/mask entries. */ static void fill_ip(ipfw_insn_ip *cmd, char *av, int cblen, struct tidx *tstate) { int len = 0; uint32_t *d = ((ipfw_insn_u32 *)cmd)->d; cmd->o.len &= ~F_LEN_MASK; /* zero len */ if (_substrcmp(av, "any") == 0) return; if (_substrcmp(av, "me") == 0) { cmd->o.len |= F_INSN_SIZE(ipfw_insn); return; } if (strncmp(av, "table(", 6) == 0) { fill_table(&cmd->o, av, O_IP_DST_LOOKUP, tstate); return; } while (av) { /* * After the address we can have '/' or ':' indicating a mask, * ',' indicating another address follows, '{' indicating a * set of addresses of unspecified size. */ char *t = NULL, *p = strpbrk(av, "/:,{"); int masklen; char md, nd = '\0'; CHECK_LENGTH(cblen, F_INSN_SIZE(ipfw_insn) + 2 + len); if (p) { md = *p; *p++ = '\0'; if ((t = strpbrk(p, ",{")) != NULL) { nd = *t; *t = '\0'; } } else md = '\0'; if (lookup_host(av, (struct in_addr *)&d[0]) != 0) errx(EX_NOHOST, "hostname ``%s'' unknown", av); switch (md) { case ':': if (!inet_aton(p, (struct in_addr *)&d[1])) errx(EX_DATAERR, "bad netmask ``%s''", p); break; case '/': masklen = atoi(p); if (masklen == 0) d[1] = htonl(0U); /* mask */ else if (masklen > 32) errx(EX_DATAERR, "bad width ``%s''", p); else d[1] = htonl(~0U << (32 - masklen)); break; case '{': /* no mask, assume /24 and put back the '{' */ d[1] = htonl(~0U << (32 - 24)); *(--p) = md; break; case ',': /* single address plus continuation */ *(--p) = md; /* FALLTHROUGH */ case 0: /* initialization value */ default: d[1] = htonl(~0U); /* force /32 */ break; } d[0] &= d[1]; /* mask base address with mask */ if (t) *t = nd; /* find next separator */ if (p) p = strpbrk(p, ",{"); if (p && *p == '{') { /* * We have a set of addresses. They are stored as follows: * arg1 is the set size (powers of 2, 2..256) * addr is the base address IN HOST FORMAT * mask.. is an array of arg1 bits (rounded up to * the next multiple of 32) with bits set * for each host in the map. */ uint32_t *map = (uint32_t *)&cmd->mask; int low, high; int i = contigmask((uint8_t *)&(d[1]), 32); if (len > 0) errx(EX_DATAERR, "address set cannot be in a list"); if (i < 24 || i > 31) errx(EX_DATAERR, "invalid set with mask %d\n", i); cmd->o.arg1 = 1<<(32-i); /* map length */ d[0] = ntohl(d[0]); /* base addr in host format */ cmd->o.opcode = O_IP_DST_SET; /* default */ cmd->o.len |= F_INSN_SIZE(ipfw_insn_u32) + (cmd->o.arg1+31)/32; for (i = 0; i < (cmd->o.arg1+31)/32 ; i++) map[i] = 0; /* clear map */ av = p + 1; low = d[0] & 0xff; high = low + cmd->o.arg1 - 1; /* * Here, i stores the previous value when we specify a range * of addresses within a mask, e.g. 45-63. i = -1 means we * have no previous value. */ i = -1; /* previous value in a range */ while (isdigit(*av)) { char *s; int a = strtol(av, &s, 0); if (s == av) { /* no parameter */ if (*av != '}') errx(EX_DATAERR, "set not closed\n"); if (i != -1) errx(EX_DATAERR, "incomplete range %d-", i); break; } if (a < low || a > high) errx(EX_DATAERR, "addr %d out of range [%d-%d]\n", a, low, high); a -= low; if (i == -1) /* no previous in range */ i = a; else { /* check that range is valid */ if (i > a) errx(EX_DATAERR, "invalid range %d-%d", i+low, a+low); if (*s == '-') errx(EX_DATAERR, "double '-' in range"); } for (; i <= a; i++) map[i/32] |= 1<<(i & 31); i = -1; if (*s == '-') i = a; else if (*s == '}') break; av = s+1; } return; } av = p; if (av) /* then *av must be a ',' */ av++; /* Check this entry */ if (d[1] == 0) { /* "any", specified as x.x.x.x/0 */ /* * 'any' turns the entire list into a NOP. * 'not any' never matches, so it is removed from the * list unless it is the only item, in which case we * report an error. */ if (cmd->o.len & F_NOT) { /* "not any" never matches */ if (av == NULL && len == 0) /* only this entry */ errx(EX_DATAERR, "not any never matches"); } /* else do nothing and skip this entry */ return; } /* A single IP can be stored in an optimized format */ if (d[1] == (uint32_t)~0 && av == NULL && len == 0) { cmd->o.len |= F_INSN_SIZE(ipfw_insn_u32); return; } len += 2; /* two words... */ d += 2; } /* end while */ if (len + 1 > F_LEN_MASK) errx(EX_DATAERR, "address list too long"); cmd->o.len |= len+1; } /* n2mask sets n bits of the mask */ void n2mask(struct in6_addr *mask, int n) { static int minimask[9] = { 0x00, 0x80, 0xc0, 0xe0, 0xf0, 0xf8, 0xfc, 0xfe, 0xff }; u_char *p; memset(mask, 0, sizeof(struct in6_addr)); p = (u_char *) mask; for (; n > 0; p++, n -= 8) { if (n >= 8) *p = 0xff; else *p = minimask[n]; } return; } static void fill_flags_cmd(ipfw_insn *cmd, enum ipfw_opcodes opcode, struct _s_x *flags, char *p) { char *e; uint32_t set = 0, clear = 0; if (fill_flags(flags, p, &e, &set, &clear) != 0) errx(EX_DATAERR, "invalid flag %s", e); cmd->opcode = opcode; cmd->len = (cmd->len & (F_NOT | F_OR)) | 1; cmd->arg1 = (set & 0xff) | ( (clear & 0xff) << 8); } void ipfw_delete(char *av[]) { ipfw_range_tlv rt; char *sep; int i, j; int exitval = EX_OK; int do_set = 0; av++; NEED1("missing rule specification"); if ( *av && _substrcmp(*av, "set") == 0) { /* Do not allow using the following syntax: * ipfw set N delete set M */ if (co.use_set) errx(EX_DATAERR, "invalid syntax"); do_set = 1; /* delete set */ av++; } /* Rule number */ while (*av && isdigit(**av)) { i = strtol(*av, &sep, 10); j = i; if (*sep== '-') j = strtol(sep + 1, NULL, 10); av++; if (co.do_nat) { exitval = do_cmd(IP_FW_NAT_DEL, &i, sizeof i); if (exitval) { exitval = EX_UNAVAILABLE; if (co.do_quiet) continue; warn("nat %u not available", i); } } else if (co.do_pipe) { exitval = ipfw_delete_pipe(co.do_pipe, i); } else { memset(&rt, 0, sizeof(rt)); if (do_set != 0) { rt.set = i & 31; rt.flags = IPFW_RCFLAG_SET; } else { rt.start_rule = i & 0xffff; rt.end_rule = j & 0xffff; if (rt.start_rule == 0 && rt.end_rule == 0) rt.flags |= IPFW_RCFLAG_ALL; else rt.flags |= IPFW_RCFLAG_RANGE; if (co.use_set != 0) { rt.set = co.use_set - 1; rt.flags |= IPFW_RCFLAG_SET; } } if (co.do_dynamic == 2) rt.flags |= IPFW_RCFLAG_DYNAMIC; i = do_range_cmd(IP_FW_XDEL, &rt); if (i != 0) { exitval = EX_UNAVAILABLE; if (co.do_quiet) continue; warn("rule %u: setsockopt(IP_FW_XDEL)", rt.start_rule); } else if (rt.new_set == 0 && do_set == 0 && co.do_dynamic != 2) { exitval = EX_UNAVAILABLE; if (co.do_quiet) continue; if (rt.start_rule != rt.end_rule) warnx("no rules rules in %u-%u range", rt.start_rule, rt.end_rule); else warnx("rule %u not found", rt.start_rule); } } } if (exitval != EX_OK && co.do_force == 0) exit(exitval); } /* * fill the interface structure. We do not check the name as we can * create interfaces dynamically, so checking them at insert time * makes relatively little sense. * Interface names containing '*', '?', or '[' are assumed to be shell * patterns which match interfaces. */ static void fill_iface(ipfw_insn_if *cmd, char *arg, int cblen, struct tidx *tstate) { char *p; uint16_t uidx; cmd->name[0] = '\0'; cmd->o.len |= F_INSN_SIZE(ipfw_insn_if); CHECK_CMDLEN; /* Parse the interface or address */ if (strcmp(arg, "any") == 0) cmd->o.len = 0; /* effectively ignore this command */ else if (strncmp(arg, "table(", 6) == 0) { if ((p = strchr(arg + 6, ')')) == NULL) errx(EX_DATAERR, "forgotten parenthesis: '%s'", arg); *p = '\0'; p = strchr(arg + 6, ','); if (p) *p++ = '\0'; if ((uidx = pack_table(tstate, arg + 6)) == 0) errx(EX_DATAERR, "Invalid table name: %s", arg + 6); cmd->name[0] = '\1'; /* Special value indicating table */ cmd->p.kidx = uidx; } else if (!isdigit(*arg)) { strlcpy(cmd->name, arg, sizeof(cmd->name)); cmd->p.glob = strpbrk(arg, "*?[") != NULL ? 1 : 0; } else if (!inet_aton(arg, &cmd->p.ip)) errx(EX_DATAERR, "bad ip address ``%s''", arg); } static void get_mac_addr_mask(const char *p, uint8_t *addr, uint8_t *mask) { int i; size_t l; char *ap, *ptr, *optr; struct ether_addr *mac; const char *macset = "0123456789abcdefABCDEF:"; if (strcmp(p, "any") == 0) { for (i = 0; i < ETHER_ADDR_LEN; i++) addr[i] = mask[i] = 0; return; } optr = ptr = strdup(p); if ((ap = strsep(&ptr, "&/")) != NULL && *ap != 0) { l = strlen(ap); if (strspn(ap, macset) != l || (mac = ether_aton(ap)) == NULL) errx(EX_DATAERR, "Incorrect MAC address"); bcopy(mac, addr, ETHER_ADDR_LEN); } else errx(EX_DATAERR, "Incorrect MAC address"); if (ptr != NULL) { /* we have mask? */ if (p[ptr - optr - 1] == '/') { /* mask len */ long ml = strtol(ptr, &ap, 10); if (*ap != 0 || ml > ETHER_ADDR_LEN * 8 || ml < 0) errx(EX_DATAERR, "Incorrect mask length"); for (i = 0; ml > 0 && i < ETHER_ADDR_LEN; ml -= 8, i++) mask[i] = (ml >= 8) ? 0xff: (~0) << (8 - ml); } else { /* mask */ l = strlen(ptr); if (strspn(ptr, macset) != l || (mac = ether_aton(ptr)) == NULL) errx(EX_DATAERR, "Incorrect mask"); bcopy(mac, mask, ETHER_ADDR_LEN); } } else { /* default mask: ff:ff:ff:ff:ff:ff */ for (i = 0; i < ETHER_ADDR_LEN; i++) mask[i] = 0xff; } for (i = 0; i < ETHER_ADDR_LEN; i++) addr[i] &= mask[i]; free(optr); } /* * helper function, updates the pointer to cmd with the length * of the current command, and also cleans up the first word of * the new command in case it has been clobbered before. */ static ipfw_insn * next_cmd(ipfw_insn *cmd, int *len) { *len -= F_LEN(cmd); CHECK_LENGTH(*len, 0); cmd += F_LEN(cmd); bzero(cmd, sizeof(*cmd)); return cmd; } /* * Takes arguments and copies them into a comment */ static void fill_comment(ipfw_insn *cmd, char **av, int cblen) { int i, l; char *p = (char *)(cmd + 1); cmd->opcode = O_NOP; cmd->len = (cmd->len & (F_NOT | F_OR)); /* Compute length of comment string. */ for (i = 0, l = 0; av[i] != NULL; i++) l += strlen(av[i]) + 1; if (l == 0) return; if (l > 84) errx(EX_DATAERR, "comment too long (max 80 chars)"); l = 1 + (l+3)/4; cmd->len = (cmd->len & (F_NOT | F_OR)) | l; CHECK_CMDLEN; for (i = 0; av[i] != NULL; i++) { strcpy(p, av[i]); p += strlen(av[i]); *p++ = ' '; } *(--p) = '\0'; } /* * A function to fill simple commands of size 1. * Existing flags are preserved. */ static void fill_cmd(ipfw_insn *cmd, enum ipfw_opcodes opcode, int flags, uint16_t arg) { cmd->opcode = opcode; cmd->len = ((cmd->len | flags) & (F_NOT | F_OR)) | 1; cmd->arg1 = arg; } /* * Fetch and add the MAC address and type, with masks. This generates one or * two microinstructions, and returns the pointer to the last one. */ static ipfw_insn * add_mac(ipfw_insn *cmd, char *av[], int cblen) { ipfw_insn_mac *mac; if ( ( av[0] == NULL ) || ( av[1] == NULL ) ) errx(EX_DATAERR, "MAC dst src"); cmd->opcode = O_MACADDR2; cmd->len = (cmd->len & (F_NOT | F_OR)) | F_INSN_SIZE(ipfw_insn_mac); CHECK_CMDLEN; mac = (ipfw_insn_mac *)cmd; get_mac_addr_mask(av[0], mac->addr, mac->mask); /* dst */ get_mac_addr_mask(av[1], &(mac->addr[ETHER_ADDR_LEN]), &(mac->mask[ETHER_ADDR_LEN])); /* src */ return cmd; } static ipfw_insn * add_mactype(ipfw_insn *cmd, char *av, int cblen) { if (!av) errx(EX_DATAERR, "missing MAC type"); if (strcmp(av, "any") != 0) { /* we have a non-null type */ fill_newports((ipfw_insn_u16 *)cmd, av, IPPROTO_ETHERTYPE, cblen); cmd->opcode = O_MAC_TYPE; return cmd; } else return NULL; } static ipfw_insn * add_proto0(ipfw_insn *cmd, char *av, u_char *protop) { struct protoent *pe; char *ep; int proto; proto = strtol(av, &ep, 10); if (*ep != '\0' || proto <= 0) { if ((pe = getprotobyname(av)) == NULL) return NULL; proto = pe->p_proto; } fill_cmd(cmd, O_PROTO, 0, proto); *protop = proto; return cmd; } static ipfw_insn * add_proto(ipfw_insn *cmd, char *av, u_char *protop) { u_char proto = IPPROTO_IP; if (_substrcmp(av, "all") == 0 || strcmp(av, "ip") == 0) ; /* do not set O_IP4 nor O_IP6 */ else if (strcmp(av, "ip4") == 0) /* explicit "just IPv4" rule */ fill_cmd(cmd, O_IP4, 0, 0); else if (strcmp(av, "ip6") == 0) { /* explicit "just IPv6" rule */ proto = IPPROTO_IPV6; fill_cmd(cmd, O_IP6, 0, 0); } else return add_proto0(cmd, av, protop); *protop = proto; return cmd; } static ipfw_insn * add_proto_compat(ipfw_insn *cmd, char *av, u_char *protop) { u_char proto = IPPROTO_IP; if (_substrcmp(av, "all") == 0 || strcmp(av, "ip") == 0) ; /* do not set O_IP4 nor O_IP6 */ else if (strcmp(av, "ipv4") == 0 || strcmp(av, "ip4") == 0) /* explicit "just IPv4" rule */ fill_cmd(cmd, O_IP4, 0, 0); else if (strcmp(av, "ipv6") == 0 || strcmp(av, "ip6") == 0) { /* explicit "just IPv6" rule */ proto = IPPROTO_IPV6; fill_cmd(cmd, O_IP6, 0, 0); } else return add_proto0(cmd, av, protop); *protop = proto; return cmd; } static ipfw_insn * add_srcip(ipfw_insn *cmd, char *av, int cblen, struct tidx *tstate) { fill_ip((ipfw_insn_ip *)cmd, av, cblen, tstate); if (cmd->opcode == O_IP_DST_SET) /* set */ cmd->opcode = O_IP_SRC_SET; else if (cmd->opcode == O_IP_DST_LOOKUP) /* table */ cmd->opcode = O_IP_SRC_LOOKUP; else if (F_LEN(cmd) == F_INSN_SIZE(ipfw_insn)) /* me */ cmd->opcode = O_IP_SRC_ME; else if (F_LEN(cmd) == F_INSN_SIZE(ipfw_insn_u32)) /* one IP */ cmd->opcode = O_IP_SRC; else /* addr/mask */ cmd->opcode = O_IP_SRC_MASK; return cmd; } static ipfw_insn * add_dstip(ipfw_insn *cmd, char *av, int cblen, struct tidx *tstate) { fill_ip((ipfw_insn_ip *)cmd, av, cblen, tstate); if (cmd->opcode == O_IP_DST_SET) /* set */ ; else if (cmd->opcode == O_IP_DST_LOOKUP) /* table */ ; else if (F_LEN(cmd) == F_INSN_SIZE(ipfw_insn)) /* me */ cmd->opcode = O_IP_DST_ME; else if (F_LEN(cmd) == F_INSN_SIZE(ipfw_insn_u32)) /* one IP */ cmd->opcode = O_IP_DST; else /* addr/mask */ cmd->opcode = O_IP_DST_MASK; return cmd; } static struct _s_x f_reserved_keywords[] = { { "altq", TOK_OR }, { "//", TOK_OR }, { "diverted", TOK_OR }, { "dst-port", TOK_OR }, { "src-port", TOK_OR }, { "established", TOK_OR }, { "keep-state", TOK_OR }, { "frag", TOK_OR }, { "icmptypes", TOK_OR }, { "in", TOK_OR }, { "out", TOK_OR }, { "ip6", TOK_OR }, { "any", TOK_OR }, { "to", TOK_OR }, { "via", TOK_OR }, { "{", TOK_OR }, { NULL, 0 } /* terminator */ }; static ipfw_insn * add_ports(ipfw_insn *cmd, char *av, u_char proto, int opcode, int cblen) { if (match_token(f_reserved_keywords, av) != -1) return (NULL); if (fill_newports((ipfw_insn_u16 *)cmd, av, proto, cblen)) { /* XXX todo: check that we have a protocol with ports */ cmd->opcode = opcode; return cmd; } return NULL; } static ipfw_insn * add_src(ipfw_insn *cmd, char *av, u_char proto, int cblen, struct tidx *tstate) { struct in6_addr a; char *host, *ch, buf[INET6_ADDRSTRLEN]; ipfw_insn *ret = NULL; int len; /* Copy first address in set if needed */ if ((ch = strpbrk(av, "/,")) != NULL) { len = ch - av; strlcpy(buf, av, sizeof(buf)); if (len < sizeof(buf)) buf[len] = '\0'; host = buf; } else host = av; if (proto == IPPROTO_IPV6 || strcmp(av, "me6") == 0 || inet_pton(AF_INET6, host, &a) == 1) ret = add_srcip6(cmd, av, cblen, tstate); /* XXX: should check for IPv4, not !IPv6 */ if (ret == NULL && (proto == IPPROTO_IP || strcmp(av, "me") == 0 || inet_pton(AF_INET6, host, &a) != 1)) ret = add_srcip(cmd, av, cblen, tstate); if (ret == NULL && strcmp(av, "any") != 0) ret = cmd; return ret; } static ipfw_insn * add_dst(ipfw_insn *cmd, char *av, u_char proto, int cblen, struct tidx *tstate) { struct in6_addr a; char *host, *ch, buf[INET6_ADDRSTRLEN]; ipfw_insn *ret = NULL; int len; /* Copy first address in set if needed */ if ((ch = strpbrk(av, "/,")) != NULL) { len = ch - av; strlcpy(buf, av, sizeof(buf)); if (len < sizeof(buf)) buf[len] = '\0'; host = buf; } else host = av; if (proto == IPPROTO_IPV6 || strcmp(av, "me6") == 0 || inet_pton(AF_INET6, host, &a) == 1) ret = add_dstip6(cmd, av, cblen, tstate); /* XXX: should check for IPv4, not !IPv6 */ if (ret == NULL && (proto == IPPROTO_IP || strcmp(av, "me") == 0 || inet_pton(AF_INET6, host, &a) != 1)) ret = add_dstip(cmd, av, cblen, tstate); if (ret == NULL && strcmp(av, "any") != 0) ret = cmd; return ret; } /* * Parse arguments and assemble the microinstructions which make up a rule. * Rules are added into the 'rulebuf' and then copied in the correct order * into the actual rule. * * The syntax for a rule starts with the action, followed by * optional action parameters, and the various match patterns. * In the assembled microcode, the first opcode must be an O_PROBE_STATE * (generated if the rule includes a keep-state option), then the * various match patterns, log/altq actions, and the actual action. * */ void compile_rule(char *av[], uint32_t *rbuf, int *rbufsize, struct tidx *tstate) { /* * rules are added into the 'rulebuf' and then copied in * the correct order into the actual rule. * Some things that need to go out of order (prob, action etc.) * go into actbuf[]. */ static uint32_t actbuf[255], cmdbuf[255]; int rblen, ablen, cblen; ipfw_insn *src, *dst, *cmd, *action, *prev=NULL; ipfw_insn *first_cmd; /* first match pattern */ struct ip_fw_rule *rule; /* * various flags used to record that we entered some fields. */ ipfw_insn *have_state = NULL; /* any state-related option */ int have_rstate = 0; ipfw_insn *have_log = NULL, *have_altq = NULL, *have_tag = NULL; ipfw_insn *have_skipcmd = NULL; size_t len; int i; int open_par = 0; /* open parenthesis ( */ /* proto is here because it is used to fetch ports */ u_char proto = IPPROTO_IP; /* default protocol */ double match_prob = 1; /* match probability, default is always match */ bzero(actbuf, sizeof(actbuf)); /* actions go here */ bzero(cmdbuf, sizeof(cmdbuf)); bzero(rbuf, *rbufsize); rule = (struct ip_fw_rule *)rbuf; cmd = (ipfw_insn *)cmdbuf; action = (ipfw_insn *)actbuf; rblen = *rbufsize / sizeof(uint32_t); rblen -= sizeof(struct ip_fw_rule) / sizeof(uint32_t); ablen = sizeof(actbuf) / sizeof(actbuf[0]); cblen = sizeof(cmdbuf) / sizeof(cmdbuf[0]); cblen -= F_INSN_SIZE(ipfw_insn_u32) + 1; #define CHECK_RBUFLEN(len) { CHECK_LENGTH(rblen, len); rblen -= len; } #define CHECK_ACTLEN CHECK_LENGTH(ablen, action->len) av++; /* [rule N] -- Rule number optional */ if (av[0] && isdigit(**av)) { rule->rulenum = atoi(*av); av++; } /* [set N] -- set number (0..RESVD_SET), optional */ if (av[0] && av[1] && _substrcmp(*av, "set") == 0) { int set = strtoul(av[1], NULL, 10); if (set < 0 || set > RESVD_SET) errx(EX_DATAERR, "illegal set %s", av[1]); rule->set = set; tstate->set = set; av += 2; } /* [prob D] -- match probability, optional */ if (av[0] && av[1] && _substrcmp(*av, "prob") == 0) { match_prob = strtod(av[1], NULL); if (match_prob <= 0 || match_prob > 1) errx(EX_DATAERR, "illegal match prob. %s", av[1]); av += 2; } /* action -- mandatory */ NEED1("missing action"); i = match_token(rule_actions, *av); av++; action->len = 1; /* default */ CHECK_ACTLEN; switch(i) { case TOK_CHECKSTATE: have_state = action; action->opcode = O_CHECK_STATE; if (*av == NULL || match_token(rule_options, *av) == TOK_COMMENT) { action->arg1 = pack_object(tstate, default_state_name, IPFW_TLV_STATE_NAME); break; } if (*av[0] == ':') { if (strcmp(*av + 1, "any") == 0) action->arg1 = 0; else if (state_check_name(*av + 1) == 0) action->arg1 = pack_object(tstate, *av + 1, IPFW_TLV_STATE_NAME); else errx(EX_DATAERR, "Invalid state name %s", *av); av++; break; } errx(EX_DATAERR, "Invalid state name %s", *av); break; case TOK_ABORT: action->opcode = O_REJECT; action->arg1 = ICMP_REJECT_ABORT; break; case TOK_ABORT6: action->opcode = O_UNREACH6; action->arg1 = ICMP6_UNREACH_ABORT; break; case TOK_ACCEPT: action->opcode = O_ACCEPT; break; case TOK_DENY: action->opcode = O_DENY; action->arg1 = 0; break; case TOK_REJECT: action->opcode = O_REJECT; action->arg1 = ICMP_UNREACH_HOST; break; case TOK_RESET: action->opcode = O_REJECT; action->arg1 = ICMP_REJECT_RST; break; case TOK_RESET6: action->opcode = O_UNREACH6; action->arg1 = ICMP6_UNREACH_RST; break; case TOK_UNREACH: action->opcode = O_REJECT; NEED1("missing reject code"); fill_reject_code(&action->arg1, *av); av++; break; case TOK_UNREACH6: action->opcode = O_UNREACH6; NEED1("missing unreach code"); fill_unreach6_code(&action->arg1, *av); av++; break; case TOK_COUNT: action->opcode = O_COUNT; break; case TOK_NAT: action->opcode = O_NAT; action->len = F_INSN_SIZE(ipfw_insn_nat); CHECK_ACTLEN; if (*av != NULL && _substrcmp(*av, "global") == 0) { action->arg1 = IP_FW_NAT44_GLOBAL; av++; break; } else goto chkarg; case TOK_QUEUE: action->opcode = O_QUEUE; goto chkarg; case TOK_PIPE: action->opcode = O_PIPE; goto chkarg; case TOK_SKIPTO: action->opcode = O_SKIPTO; goto chkarg; case TOK_NETGRAPH: action->opcode = O_NETGRAPH; goto chkarg; case TOK_NGTEE: action->opcode = O_NGTEE; goto chkarg; case TOK_DIVERT: action->opcode = O_DIVERT; goto chkarg; case TOK_TEE: action->opcode = O_TEE; goto chkarg; case TOK_CALL: action->opcode = O_CALLRETURN; chkarg: if (!av[0]) errx(EX_USAGE, "missing argument for %s", *(av - 1)); if (isdigit(**av)) { action->arg1 = strtoul(*av, NULL, 10); if (action->arg1 <= 0 || action->arg1 >= IP_FW_TABLEARG) errx(EX_DATAERR, "illegal argument for %s", *(av - 1)); } else if (_substrcmp(*av, "tablearg") == 0) { action->arg1 = IP_FW_TARG; } else if (i == TOK_DIVERT || i == TOK_TEE) { struct servent *s; setservent(1); s = getservbyname(av[0], "divert"); if (s != NULL) action->arg1 = ntohs(s->s_port); else errx(EX_DATAERR, "illegal divert/tee port"); } else errx(EX_DATAERR, "illegal argument for %s", *(av - 1)); av++; break; case TOK_FORWARD: { /* * Locate the address-port separator (':' or ','). * Could be one of the following: * hostname:port * IPv4 a.b.c.d,port * IPv4 a.b.c.d:port * IPv6 w:x:y::z,port * The ':' can only be used with hostname and IPv4 address. * XXX-BZ Should we also support [w:x:y::z]:port? */ struct sockaddr_storage result; struct addrinfo *res; char *s, *end; int family; u_short port_number; NEED1("missing forward address[:port]"); /* * locate the address-port separator (':' or ',') */ s = strchr(*av, ','); if (s == NULL) { /* Distinguish between IPv4:port and IPv6 cases. */ s = strchr(*av, ':'); if (s && strchr(s+1, ':')) s = NULL; /* no port */ } port_number = 0; if (s != NULL) { /* Terminate host portion and set s to start of port. */ *(s++) = '\0'; i = strtoport(s, &end, 0 /* base */, 0 /* proto */); if (s == end) errx(EX_DATAERR, "illegal forwarding port ``%s''", s); port_number = (u_short)i; } if (_substrcmp(*av, "tablearg") == 0) { family = PF_INET; ((struct sockaddr_in*)&result)->sin_addr.s_addr = INADDR_ANY; } else { /* * Resolve the host name or address to a family and a * network representation of the address. */ if (getaddrinfo(*av, NULL, NULL, &res)) errx(EX_DATAERR, NULL); /* Just use the first host in the answer. */ family = res->ai_family; memcpy(&result, res->ai_addr, res->ai_addrlen); freeaddrinfo(res); } if (family == PF_INET) { ipfw_insn_sa *p = (ipfw_insn_sa *)action; action->opcode = O_FORWARD_IP; action->len = F_INSN_SIZE(ipfw_insn_sa); CHECK_ACTLEN; /* * In the kernel we assume AF_INET and use only * sin_port and sin_addr. Remember to set sin_len as * the routing code seems to use it too. */ p->sa.sin_len = sizeof(struct sockaddr_in); p->sa.sin_family = AF_INET; p->sa.sin_port = port_number; p->sa.sin_addr.s_addr = ((struct sockaddr_in *)&result)->sin_addr.s_addr; } else if (family == PF_INET6) { ipfw_insn_sa6 *p = (ipfw_insn_sa6 *)action; action->opcode = O_FORWARD_IP6; action->len = F_INSN_SIZE(ipfw_insn_sa6); CHECK_ACTLEN; p->sa.sin6_len = sizeof(struct sockaddr_in6); p->sa.sin6_family = AF_INET6; p->sa.sin6_port = port_number; p->sa.sin6_flowinfo = 0; p->sa.sin6_scope_id = ((struct sockaddr_in6 *)&result)->sin6_scope_id; bcopy(&((struct sockaddr_in6*)&result)->sin6_addr, &p->sa.sin6_addr, sizeof(p->sa.sin6_addr)); } else { errx(EX_DATAERR, "Invalid address family in forward action"); } av++; break; } case TOK_COMMENT: /* pretend it is a 'count' rule followed by the comment */ action->opcode = O_COUNT; av--; /* go back... */ break; case TOK_SETFIB: { int numfibs; size_t intsize = sizeof(int); action->opcode = O_SETFIB; NEED1("missing fib number"); if (_substrcmp(*av, "tablearg") == 0) { action->arg1 = IP_FW_TARG; } else { action->arg1 = strtoul(*av, NULL, 10); if (sysctlbyname("net.fibs", &numfibs, &intsize, NULL, 0) == -1) errx(EX_DATAERR, "fibs not suported.\n"); if (action->arg1 >= numfibs) /* Temporary */ errx(EX_DATAERR, "fib too large.\n"); /* Add high-order bit to fib to make room for tablearg*/ action->arg1 |= 0x8000; } av++; break; } case TOK_SETDSCP: { int code; action->opcode = O_SETDSCP; NEED1("missing DSCP code"); if (_substrcmp(*av, "tablearg") == 0) { action->arg1 = IP_FW_TARG; } else { if (isalpha(*av[0])) { if ((code = match_token(f_ipdscp, *av)) == -1) errx(EX_DATAERR, "Unknown DSCP code"); action->arg1 = code; } else action->arg1 = strtoul(*av, NULL, 10); /* * Add high-order bit to DSCP to make room * for tablearg */ action->arg1 |= 0x8000; } av++; break; } case TOK_REASS: action->opcode = O_REASS; break; case TOK_RETURN: fill_cmd(action, O_CALLRETURN, F_NOT, 0); break; case TOK_TCPSETMSS: { u_long mss; uint16_t idx; idx = pack_object(tstate, "tcp-setmss", IPFW_TLV_EACTION); if (idx == 0) errx(EX_DATAERR, "pack_object failed"); fill_cmd(action, O_EXTERNAL_ACTION, 0, idx); NEED1("Missing MSS value"); action = next_cmd(action, &ablen); action->len = 1; CHECK_ACTLEN; mss = strtoul(*av, NULL, 10); if (mss == 0 || mss > UINT16_MAX) errx(EX_USAGE, "invalid MSS value %s", *av); fill_cmd(action, O_EXTERNAL_DATA, 0, (uint16_t)mss); av++; break; } default: av--; if (match_token(rule_eactions, *av) == -1) errx(EX_DATAERR, "invalid action %s\n", *av); /* * External actions support. * XXX: we support only syntax with instance name. * For known external actions (from rule_eactions list) * we can handle syntax directly. But with `eaction' * keyword we can use only `eaction ' * syntax. */ case TOK_EACTION: { uint16_t idx; NEED1("Missing eaction name"); if (eaction_check_name(*av) != 0) errx(EX_DATAERR, "Invalid eaction name %s", *av); idx = pack_object(tstate, *av, IPFW_TLV_EACTION); if (idx == 0) errx(EX_DATAERR, "pack_object failed"); fill_cmd(action, O_EXTERNAL_ACTION, 0, idx); av++; NEED1("Missing eaction instance name"); action = next_cmd(action, &ablen); action->len = 1; CHECK_ACTLEN; if (eaction_check_name(*av) != 0) errx(EX_DATAERR, "Invalid eaction instance name %s", *av); /* * External action instance object has TLV type depended * from the external action name object index. Since we * currently don't know this index, use zero as TLV type. */ idx = pack_object(tstate, *av, 0); if (idx == 0) errx(EX_DATAERR, "pack_object failed"); fill_cmd(action, O_EXTERNAL_INSTANCE, 0, idx); av++; } } action = next_cmd(action, &ablen); /* * [altq queuename] -- altq tag, optional * [log [logamount N]] -- log, optional * * If they exist, it go first in the cmdbuf, but then it is * skipped in the copy section to the end of the buffer. */ while (av[0] != NULL && (i = match_token(rule_action_params, *av)) != -1) { av++; switch (i) { case TOK_LOG: { ipfw_insn_log *c = (ipfw_insn_log *)cmd; int l; if (have_log) errx(EX_DATAERR, "log cannot be specified more than once"); have_log = (ipfw_insn *)c; cmd->len = F_INSN_SIZE(ipfw_insn_log); CHECK_CMDLEN; cmd->opcode = O_LOG; if (av[0] && _substrcmp(*av, "logamount") == 0) { av++; NEED1("logamount requires argument"); l = atoi(*av); if (l < 0) errx(EX_DATAERR, "logamount must be positive"); c->max_log = l; av++; } else { len = sizeof(c->max_log); if (sysctlbyname("net.inet.ip.fw.verbose_limit", &c->max_log, &len, NULL, 0) == -1) { if (co.test_only) { c->max_log = 0; break; } errx(1, "sysctlbyname(\"%s\")", "net.inet.ip.fw.verbose_limit"); } } } break; #ifndef NO_ALTQ case TOK_ALTQ: { ipfw_insn_altq *a = (ipfw_insn_altq *)cmd; NEED1("missing altq queue name"); if (have_altq) errx(EX_DATAERR, "altq cannot be specified more than once"); have_altq = (ipfw_insn *)a; cmd->len = F_INSN_SIZE(ipfw_insn_altq); CHECK_CMDLEN; cmd->opcode = O_ALTQ; a->qid = altq_name_to_qid(*av); av++; } break; #endif case TOK_TAG: case TOK_UNTAG: { uint16_t tag; if (have_tag) errx(EX_USAGE, "tag and untag cannot be " "specified more than once"); GET_UINT_ARG(tag, IPFW_ARG_MIN, IPFW_ARG_MAX, i, rule_action_params); have_tag = cmd; fill_cmd(cmd, O_TAG, (i == TOK_TAG) ? 0: F_NOT, tag); av++; break; } default: abort(); } cmd = next_cmd(cmd, &cblen); } if (have_state) { /* must be a check-state, we are done */ if (*av != NULL && match_token(rule_options, *av) == TOK_COMMENT) { /* check-state has a comment */ av++; fill_comment(cmd, av, cblen); cmd = next_cmd(cmd, &cblen); av[0] = NULL; } goto done; } #define OR_START(target) \ if (av[0] && (*av[0] == '(' || *av[0] == '{')) { \ if (open_par) \ errx(EX_USAGE, "nested \"(\" not allowed\n"); \ prev = NULL; \ open_par = 1; \ if ( (av[0])[1] == '\0') { \ av++; \ } else \ (*av)++; \ } \ target: \ #define CLOSE_PAR \ if (open_par) { \ if (av[0] && ( \ strcmp(*av, ")") == 0 || \ strcmp(*av, "}") == 0)) { \ prev = NULL; \ open_par = 0; \ av++; \ } else \ errx(EX_USAGE, "missing \")\"\n"); \ } #define NOT_BLOCK \ if (av[0] && _substrcmp(*av, "not") == 0) { \ if (cmd->len & F_NOT) \ errx(EX_USAGE, "double \"not\" not allowed\n"); \ cmd->len |= F_NOT; \ av++; \ } #define OR_BLOCK(target) \ if (av[0] && _substrcmp(*av, "or") == 0) { \ if (prev == NULL || open_par == 0) \ errx(EX_DATAERR, "invalid OR block"); \ prev->len |= F_OR; \ av++; \ goto target; \ } \ CLOSE_PAR; first_cmd = cmd; #if 0 /* * MAC addresses, optional. * If we have this, we skip the part "proto from src to dst" * and jump straight to the option parsing. */ NOT_BLOCK; NEED1("missing protocol"); if (_substrcmp(*av, "MAC") == 0 || _substrcmp(*av, "mac") == 0) { av++; /* the "MAC" keyword */ add_mac(cmd, av); /* exits in case of errors */ cmd = next_cmd(cmd); av += 2; /* dst-mac and src-mac */ NOT_BLOCK; NEED1("missing mac type"); if (add_mactype(cmd, av[0])) cmd = next_cmd(cmd); av++; /* any or mac-type */ goto read_options; } #endif /* * protocol, mandatory */ OR_START(get_proto); NOT_BLOCK; NEED1("missing protocol"); if (add_proto_compat(cmd, *av, &proto)) { av++; if (F_LEN(cmd) != 0) { prev = cmd; cmd = next_cmd(cmd, &cblen); } } else if (first_cmd != cmd) { errx(EX_DATAERR, "invalid protocol ``%s''", *av); } else { rule->flags |= IPFW_RULE_JUSTOPTS; goto read_options; } OR_BLOCK(get_proto); /* * "from", mandatory */ if ((av[0] == NULL) || _substrcmp(*av, "from") != 0) errx(EX_USAGE, "missing ``from''"); av++; /* * source IP, mandatory */ OR_START(source_ip); NOT_BLOCK; /* optional "not" */ NEED1("missing source address"); if (add_src(cmd, *av, proto, cblen, tstate)) { av++; if (F_LEN(cmd) != 0) { /* ! any */ prev = cmd; cmd = next_cmd(cmd, &cblen); } } else errx(EX_USAGE, "bad source address %s", *av); OR_BLOCK(source_ip); /* * source ports, optional */ NOT_BLOCK; /* optional "not" */ if ( av[0] != NULL ) { if (_substrcmp(*av, "any") == 0 || add_ports(cmd, *av, proto, O_IP_SRCPORT, cblen)) { av++; if (F_LEN(cmd) != 0) cmd = next_cmd(cmd, &cblen); } } /* * "to", mandatory */ if ( (av[0] == NULL) || _substrcmp(*av, "to") != 0 ) errx(EX_USAGE, "missing ``to''"); av++; /* * destination, mandatory */ OR_START(dest_ip); NOT_BLOCK; /* optional "not" */ NEED1("missing dst address"); if (add_dst(cmd, *av, proto, cblen, tstate)) { av++; if (F_LEN(cmd) != 0) { /* ! any */ prev = cmd; cmd = next_cmd(cmd, &cblen); } } else errx( EX_USAGE, "bad destination address %s", *av); OR_BLOCK(dest_ip); /* * dest. ports, optional */ NOT_BLOCK; /* optional "not" */ if (av[0]) { if (_substrcmp(*av, "any") == 0 || add_ports(cmd, *av, proto, O_IP_DSTPORT, cblen)) { av++; if (F_LEN(cmd) != 0) cmd = next_cmd(cmd, &cblen); } } read_options: prev = NULL; while ( av[0] != NULL ) { char *s; ipfw_insn_u32 *cmd32; /* alias for cmd */ s = *av; cmd32 = (ipfw_insn_u32 *)cmd; if (*s == '!') { /* alternate syntax for NOT */ if (cmd->len & F_NOT) errx(EX_USAGE, "double \"not\" not allowed\n"); cmd->len = F_NOT; s++; } i = match_token(rule_options, s); av++; switch(i) { case TOK_NOT: if (cmd->len & F_NOT) errx(EX_USAGE, "double \"not\" not allowed\n"); cmd->len = F_NOT; break; case TOK_OR: if (open_par == 0 || prev == NULL) errx(EX_USAGE, "invalid \"or\" block\n"); prev->len |= F_OR; break; case TOK_STARTBRACE: if (open_par) errx(EX_USAGE, "+nested \"(\" not allowed\n"); open_par = 1; break; case TOK_ENDBRACE: if (!open_par) errx(EX_USAGE, "+missing \")\"\n"); open_par = 0; prev = NULL; break; case TOK_IN: fill_cmd(cmd, O_IN, 0, 0); break; case TOK_OUT: cmd->len ^= F_NOT; /* toggle F_NOT */ fill_cmd(cmd, O_IN, 0, 0); break; case TOK_DIVERTED: fill_cmd(cmd, O_DIVERTED, 0, 3); break; case TOK_DIVERTEDLOOPBACK: fill_cmd(cmd, O_DIVERTED, 0, 1); break; case TOK_DIVERTEDOUTPUT: fill_cmd(cmd, O_DIVERTED, 0, 2); break; case TOK_FRAG: fill_cmd(cmd, O_FRAG, 0, 0); break; case TOK_LAYER2: fill_cmd(cmd, O_LAYER2, 0, 0); break; case TOK_XMIT: case TOK_RECV: case TOK_VIA: NEED1("recv, xmit, via require interface name" " or address"); fill_iface((ipfw_insn_if *)cmd, av[0], cblen, tstate); av++; if (F_LEN(cmd) == 0) /* not a valid address */ break; if (i == TOK_XMIT) cmd->opcode = O_XMIT; else if (i == TOK_RECV) cmd->opcode = O_RECV; else if (i == TOK_VIA) cmd->opcode = O_VIA; break; case TOK_ICMPTYPES: NEED1("icmptypes requires list of types"); fill_icmptypes((ipfw_insn_u32 *)cmd, *av); av++; break; case TOK_ICMP6TYPES: NEED1("icmptypes requires list of types"); fill_icmp6types((ipfw_insn_icmp6 *)cmd, *av, cblen); av++; break; case TOK_IPTTL: NEED1("ipttl requires TTL"); if (strpbrk(*av, "-,")) { if (!add_ports(cmd, *av, 0, O_IPTTL, cblen)) errx(EX_DATAERR, "invalid ipttl %s", *av); } else fill_cmd(cmd, O_IPTTL, 0, strtoul(*av, NULL, 0)); av++; break; case TOK_IPID: NEED1("ipid requires id"); if (strpbrk(*av, "-,")) { if (!add_ports(cmd, *av, 0, O_IPID, cblen)) errx(EX_DATAERR, "invalid ipid %s", *av); } else fill_cmd(cmd, O_IPID, 0, strtoul(*av, NULL, 0)); av++; break; case TOK_IPLEN: NEED1("iplen requires length"); if (strpbrk(*av, "-,")) { if (!add_ports(cmd, *av, 0, O_IPLEN, cblen)) errx(EX_DATAERR, "invalid ip len %s", *av); } else fill_cmd(cmd, O_IPLEN, 0, strtoul(*av, NULL, 0)); av++; break; case TOK_IPVER: NEED1("ipver requires version"); fill_cmd(cmd, O_IPVER, 0, strtoul(*av, NULL, 0)); av++; break; case TOK_IPPRECEDENCE: NEED1("ipprecedence requires value"); fill_cmd(cmd, O_IPPRECEDENCE, 0, (strtoul(*av, NULL, 0) & 7) << 5); av++; break; case TOK_DSCP: NEED1("missing DSCP code"); fill_dscp(cmd, *av, cblen); av++; break; case TOK_IPOPTS: NEED1("missing argument for ipoptions"); fill_flags_cmd(cmd, O_IPOPT, f_ipopts, *av); av++; break; case TOK_IPTOS: NEED1("missing argument for iptos"); fill_flags_cmd(cmd, O_IPTOS, f_iptos, *av); av++; break; case TOK_UID: NEED1("uid requires argument"); { char *end; uid_t uid; struct passwd *pwd; cmd->opcode = O_UID; uid = strtoul(*av, &end, 0); pwd = (*end == '\0') ? getpwuid(uid) : getpwnam(*av); if (pwd == NULL) errx(EX_DATAERR, "uid \"%s\" nonexistent", *av); cmd32->d[0] = pwd->pw_uid; cmd->len |= F_INSN_SIZE(ipfw_insn_u32); av++; } break; case TOK_GID: NEED1("gid requires argument"); { char *end; gid_t gid; struct group *grp; cmd->opcode = O_GID; gid = strtoul(*av, &end, 0); grp = (*end == '\0') ? getgrgid(gid) : getgrnam(*av); if (grp == NULL) errx(EX_DATAERR, "gid \"%s\" nonexistent", *av); cmd32->d[0] = grp->gr_gid; cmd->len |= F_INSN_SIZE(ipfw_insn_u32); av++; } break; case TOK_JAIL: NEED1("jail requires argument"); { int jid; cmd->opcode = O_JAIL; jid = jail_getid(*av); if (jid < 0) errx(EX_DATAERR, "%s", jail_errmsg); cmd32->d[0] = (uint32_t)jid; cmd->len |= F_INSN_SIZE(ipfw_insn_u32); av++; } break; case TOK_ESTAB: fill_cmd(cmd, O_ESTAB, 0, 0); break; case TOK_SETUP: fill_cmd(cmd, O_TCPFLAGS, 0, (TH_SYN) | ( (TH_ACK) & 0xff) <<8 ); break; case TOK_TCPDATALEN: NEED1("tcpdatalen requires length"); if (strpbrk(*av, "-,")) { if (!add_ports(cmd, *av, 0, O_TCPDATALEN, cblen)) errx(EX_DATAERR, "invalid tcpdata len %s", *av); } else fill_cmd(cmd, O_TCPDATALEN, 0, strtoul(*av, NULL, 0)); av++; break; case TOK_TCPOPTS: NEED1("missing argument for tcpoptions"); fill_flags_cmd(cmd, O_TCPOPTS, f_tcpopts, *av); av++; break; case TOK_TCPSEQ: case TOK_TCPACK: NEED1("tcpseq/tcpack requires argument"); cmd->len = F_INSN_SIZE(ipfw_insn_u32); cmd->opcode = (i == TOK_TCPSEQ) ? O_TCPSEQ : O_TCPACK; cmd32->d[0] = htonl(strtoul(*av, NULL, 0)); av++; break; case TOK_TCPWIN: NEED1("tcpwin requires length"); if (strpbrk(*av, "-,")) { if (!add_ports(cmd, *av, 0, O_TCPWIN, cblen)) errx(EX_DATAERR, "invalid tcpwin len %s", *av); } else fill_cmd(cmd, O_TCPWIN, 0, strtoul(*av, NULL, 0)); av++; break; case TOK_TCPFLAGS: NEED1("missing argument for tcpflags"); cmd->opcode = O_TCPFLAGS; fill_flags_cmd(cmd, O_TCPFLAGS, f_tcpflags, *av); av++; break; case TOK_KEEPSTATE: case TOK_RECORDSTATE: { uint16_t uidx; if (open_par) errx(EX_USAGE, "keep-state or record-state cannot be part " "of an or block"); if (have_state) errx(EX_USAGE, "only one of keep-state, record-state, " " limit and set-limit is allowed"); if (*av != NULL && *av[0] == ':') { if (state_check_name(*av + 1) != 0) errx(EX_DATAERR, "Invalid state name %s", *av); uidx = pack_object(tstate, *av + 1, IPFW_TLV_STATE_NAME); av++; } else uidx = pack_object(tstate, default_state_name, IPFW_TLV_STATE_NAME); have_state = cmd; have_rstate = i == TOK_RECORDSTATE; fill_cmd(cmd, O_KEEP_STATE, 0, uidx); break; } case TOK_LIMIT: case TOK_SETLIMIT: { ipfw_insn_limit *c = (ipfw_insn_limit *)cmd; int val; if (open_par) errx(EX_USAGE, "limit or set-limit cannot be part of an or block"); if (have_state) errx(EX_USAGE, "only one of keep-state, record-state, " " limit and set-limit is allowed"); have_state = cmd; have_rstate = i == TOK_SETLIMIT; cmd->len = F_INSN_SIZE(ipfw_insn_limit); CHECK_CMDLEN; cmd->opcode = O_LIMIT; c->limit_mask = c->conn_limit = 0; while ( av[0] != NULL ) { if ((val = match_token(limit_masks, *av)) <= 0) break; c->limit_mask |= val; av++; } if (c->limit_mask == 0) errx(EX_USAGE, "limit: missing limit mask"); GET_UINT_ARG(c->conn_limit, IPFW_ARG_MIN, IPFW_ARG_MAX, TOK_LIMIT, rule_options); av++; if (*av != NULL && *av[0] == ':') { if (state_check_name(*av + 1) != 0) errx(EX_DATAERR, "Invalid state name %s", *av); cmd->arg1 = pack_object(tstate, *av + 1, IPFW_TLV_STATE_NAME); av++; } else cmd->arg1 = pack_object(tstate, default_state_name, IPFW_TLV_STATE_NAME); break; } case TOK_PROTO: NEED1("missing protocol"); if (add_proto(cmd, *av, &proto)) { av++; } else errx(EX_DATAERR, "invalid protocol ``%s''", *av); break; case TOK_SRCIP: NEED1("missing source IP"); if (add_srcip(cmd, *av, cblen, tstate)) { av++; } break; case TOK_DSTIP: NEED1("missing destination IP"); if (add_dstip(cmd, *av, cblen, tstate)) { av++; } break; case TOK_SRCIP6: NEED1("missing source IP6"); if (add_srcip6(cmd, *av, cblen, tstate)) { av++; } break; case TOK_DSTIP6: NEED1("missing destination IP6"); if (add_dstip6(cmd, *av, cblen, tstate)) { av++; } break; case TOK_SRCPORT: NEED1("missing source port"); if (_substrcmp(*av, "any") == 0 || add_ports(cmd, *av, proto, O_IP_SRCPORT, cblen)) { av++; } else errx(EX_DATAERR, "invalid source port %s", *av); break; case TOK_DSTPORT: NEED1("missing destination port"); if (_substrcmp(*av, "any") == 0 || add_ports(cmd, *av, proto, O_IP_DSTPORT, cblen)) { av++; } else errx(EX_DATAERR, "invalid destination port %s", *av); break; case TOK_MAC: if (add_mac(cmd, av, cblen)) av += 2; break; case TOK_MACTYPE: NEED1("missing mac type"); if (!add_mactype(cmd, *av, cblen)) errx(EX_DATAERR, "invalid mac type %s", *av); av++; break; case TOK_VERREVPATH: fill_cmd(cmd, O_VERREVPATH, 0, 0); break; case TOK_VERSRCREACH: fill_cmd(cmd, O_VERSRCREACH, 0, 0); break; case TOK_ANTISPOOF: fill_cmd(cmd, O_ANTISPOOF, 0, 0); break; case TOK_IPSEC: fill_cmd(cmd, O_IPSEC, 0, 0); break; case TOK_IPV6: fill_cmd(cmd, O_IP6, 0, 0); break; case TOK_IPV4: fill_cmd(cmd, O_IP4, 0, 0); break; case TOK_EXT6HDR: fill_ext6hdr( cmd, *av ); av++; break; case TOK_FLOWID: if (proto != IPPROTO_IPV6 ) errx( EX_USAGE, "flow-id filter is active " "only for ipv6 protocol\n"); fill_flow6( (ipfw_insn_u32 *) cmd, *av, cblen); av++; break; case TOK_COMMENT: fill_comment(cmd, av, cblen); av[0]=NULL; break; case TOK_TAGGED: if (av[0] && strpbrk(*av, "-,")) { if (!add_ports(cmd, *av, 0, O_TAGGED, cblen)) errx(EX_DATAERR, "tagged: invalid tag" " list: %s", *av); } else { uint16_t tag; GET_UINT_ARG(tag, IPFW_ARG_MIN, IPFW_ARG_MAX, TOK_TAGGED, rule_options); fill_cmd(cmd, O_TAGGED, 0, tag); } av++; break; case TOK_FIB: NEED1("fib requires fib number"); fill_cmd(cmd, O_FIB, 0, strtoul(*av, NULL, 0)); av++; break; case TOK_SOCKARG: fill_cmd(cmd, O_SOCKARG, 0, 0); break; case TOK_LOOKUP: { ipfw_insn_u32 *c = (ipfw_insn_u32 *)cmd; int j; if (!av[0] || !av[1]) errx(EX_USAGE, "format: lookup argument tablenum"); cmd->opcode = O_IP_DST_LOOKUP; cmd->len |= F_INSN_SIZE(ipfw_insn) + 2; i = match_token(rule_options, *av); for (j = 0; lookup_key[j] >= 0 ; j++) { if (i == lookup_key[j]) break; } if (lookup_key[j] <= 0) errx(EX_USAGE, "format: cannot lookup on %s", *av); __PAST_END(c->d, 1) = j; // i converted to option av++; if ((j = pack_table(tstate, *av)) == 0) errx(EX_DATAERR, "Invalid table name: %s", *av); cmd->arg1 = j; av++; } break; case TOK_FLOW: NEED1("missing table name"); if (strncmp(*av, "table(", 6) != 0) errx(EX_DATAERR, "enclose table name into \"table()\""); fill_table(cmd, *av, O_IP_FLOW_LOOKUP, tstate); av++; break; case TOK_SKIPACTION: if (have_skipcmd) errx(EX_USAGE, "only one defer-action " "is allowed"); have_skipcmd = cmd; fill_cmd(cmd, O_SKIP_ACTION, 0, 0); break; default: errx(EX_USAGE, "unrecognised option [%d] %s\n", i, s); } if (F_LEN(cmd) > 0) { /* prepare to advance */ prev = cmd; cmd = next_cmd(cmd, &cblen); } } done: if (!have_state && have_skipcmd) warnx("Rule contains \"defer-immediate-action\" " "and doesn't contain any state-related options."); /* * Now copy stuff into the rule. * If we have a keep-state option, the first instruction * must be a PROBE_STATE (which is generated here). * If we have a LOG option, it was stored as the first command, * and now must be moved to the top of the action part. */ dst = (ipfw_insn *)rule->cmd; /* * First thing to write into the command stream is the match probability. */ if (match_prob != 1) { /* 1 means always match */ dst->opcode = O_PROB; dst->len = 2; *((int32_t *)(dst+1)) = (int32_t)(match_prob * 0x7fffffff); dst += dst->len; } /* * generate O_PROBE_STATE if necessary */ if (have_state && have_state->opcode != O_CHECK_STATE && !have_rstate) { fill_cmd(dst, O_PROBE_STATE, 0, have_state->arg1); dst = next_cmd(dst, &rblen); } /* * copy all commands but O_LOG, O_KEEP_STATE, O_LIMIT, O_ALTQ, O_TAG, * O_SKIP_ACTION */ for (src = (ipfw_insn *)cmdbuf; src != cmd; src += i) { i = F_LEN(src); CHECK_RBUFLEN(i); switch (src->opcode) { case O_LOG: case O_KEEP_STATE: case O_LIMIT: case O_ALTQ: case O_TAG: case O_SKIP_ACTION: break; default: bcopy(src, dst, i * sizeof(uint32_t)); dst += i; } } /* * put back the have_state command as last opcode */ if (have_state && have_state->opcode != O_CHECK_STATE) { i = F_LEN(have_state); CHECK_RBUFLEN(i); bcopy(have_state, dst, i * sizeof(uint32_t)); dst += i; } /* * put back the have_skipcmd command as very last opcode */ if (have_skipcmd) { i = F_LEN(have_skipcmd); CHECK_RBUFLEN(i); bcopy(have_skipcmd, dst, i * sizeof(uint32_t)); dst += i; } /* * start action section */ rule->act_ofs = dst - rule->cmd; /* put back O_LOG, O_ALTQ, O_TAG if necessary */ if (have_log) { i = F_LEN(have_log); CHECK_RBUFLEN(i); bcopy(have_log, dst, i * sizeof(uint32_t)); dst += i; } if (have_altq) { i = F_LEN(have_altq); CHECK_RBUFLEN(i); bcopy(have_altq, dst, i * sizeof(uint32_t)); dst += i; } if (have_tag) { i = F_LEN(have_tag); CHECK_RBUFLEN(i); bcopy(have_tag, dst, i * sizeof(uint32_t)); dst += i; } /* * copy all other actions */ for (src = (ipfw_insn *)actbuf; src != action; src += i) { i = F_LEN(src); CHECK_RBUFLEN(i); bcopy(src, dst, i * sizeof(uint32_t)); dst += i; } rule->cmd_len = (uint32_t *)dst - (uint32_t *)(rule->cmd); *rbufsize = (char *)dst - (char *)rule; } static int compare_ntlv(const void *_a, const void *_b) { ipfw_obj_ntlv *a, *b; a = (ipfw_obj_ntlv *)_a; b = (ipfw_obj_ntlv *)_b; if (a->set < b->set) return (-1); else if (a->set > b->set) return (1); if (a->idx < b->idx) return (-1); else if (a->idx > b->idx) return (1); if (a->head.type < b->head.type) return (-1); else if (a->head.type > b->head.type) return (1); return (0); } /* * Provide kernel with sorted list of referenced objects */ static void object_sort_ctlv(ipfw_obj_ctlv *ctlv) { qsort(ctlv + 1, ctlv->count, ctlv->objsize, compare_ntlv); } struct object_kt { uint16_t uidx; uint16_t type; }; static int compare_object_kntlv(const void *k, const void *v) { ipfw_obj_ntlv *ntlv; struct object_kt key; key = *((struct object_kt *)k); ntlv = (ipfw_obj_ntlv *)v; if (key.uidx < ntlv->idx) return (-1); else if (key.uidx > ntlv->idx) return (1); if (key.type < ntlv->head.type) return (-1); else if (key.type > ntlv->head.type) return (1); return (0); } /* * Finds object name in @ctlv by @idx and @type. * Uses the following facts: * 1) All TLVs are the same size * 2) Kernel implementation provides already sorted list. * * Returns table name or NULL. */ static char * object_search_ctlv(ipfw_obj_ctlv *ctlv, uint16_t idx, uint16_t type) { ipfw_obj_ntlv *ntlv; struct object_kt key; key.uidx = idx; key.type = type; ntlv = bsearch(&key, (ctlv + 1), ctlv->count, ctlv->objsize, compare_object_kntlv); if (ntlv != NULL) return (ntlv->name); return (NULL); } static char * table_search_ctlv(ipfw_obj_ctlv *ctlv, uint16_t idx) { return (object_search_ctlv(ctlv, idx, IPFW_TLV_TBL_NAME)); } /* * Adds one or more rules to ipfw chain. * Data layout: * Request: * [ * ip_fw3_opheader * [ ipfw_obj_ctlv(IPFW_TLV_TBL_LIST) ipfw_obj_ntlv x N ] (optional *1) * [ ipfw_obj_ctlv(IPFW_TLV_RULE_LIST) [ ip_fw_rule ip_fw_insn ] x N ] (*2) (*3) * ] * Reply: * [ * ip_fw3_opheader * [ ipfw_obj_ctlv(IPFW_TLV_TBL_LIST) ipfw_obj_ntlv x N ] (optional) * [ ipfw_obj_ctlv(IPFW_TLV_RULE_LIST) [ ip_fw_rule ip_fw_insn ] x N ] * ] * * Rules in reply are modified to store their actual ruleset number. * * (*1) TLVs inside IPFW_TLV_TBL_LIST needs to be sorted ascending * according to their idx field and there has to be no duplicates. * (*2) Numbered rules inside IPFW_TLV_RULE_LIST needs to be sorted ascending. * (*3) Each ip_fw structure needs to be aligned to u64 boundary. */ void ipfw_add(char *av[]) { uint32_t rulebuf[1024]; int rbufsize, default_off, tlen, rlen; size_t sz; struct tidx ts; struct ip_fw_rule *rule; caddr_t tbuf; ip_fw3_opheader *op3; ipfw_obj_ctlv *ctlv, *tstate; rbufsize = sizeof(rulebuf); memset(rulebuf, 0, rbufsize); memset(&ts, 0, sizeof(ts)); /* Optimize case with no tables */ default_off = sizeof(ipfw_obj_ctlv) + sizeof(ip_fw3_opheader); op3 = (ip_fw3_opheader *)rulebuf; ctlv = (ipfw_obj_ctlv *)(op3 + 1); rule = (struct ip_fw_rule *)(ctlv + 1); rbufsize -= default_off; compile_rule(av, (uint32_t *)rule, &rbufsize, &ts); /* Align rule size to u64 boundary */ rlen = roundup2(rbufsize, sizeof(uint64_t)); tbuf = NULL; sz = 0; tstate = NULL; if (ts.count != 0) { /* Some tables. We have to alloc more data */ tlen = ts.count * sizeof(ipfw_obj_ntlv); sz = default_off + sizeof(ipfw_obj_ctlv) + tlen + rlen; if ((tbuf = calloc(1, sz)) == NULL) err(EX_UNAVAILABLE, "malloc() failed for IP_FW_ADD"); op3 = (ip_fw3_opheader *)tbuf; /* Tables first */ ctlv = (ipfw_obj_ctlv *)(op3 + 1); ctlv->head.type = IPFW_TLV_TBLNAME_LIST; ctlv->head.length = sizeof(ipfw_obj_ctlv) + tlen; ctlv->count = ts.count; ctlv->objsize = sizeof(ipfw_obj_ntlv); memcpy(ctlv + 1, ts.idx, tlen); object_sort_ctlv(ctlv); tstate = ctlv; /* Rule next */ ctlv = (ipfw_obj_ctlv *)((caddr_t)ctlv + ctlv->head.length); ctlv->head.type = IPFW_TLV_RULE_LIST; ctlv->head.length = sizeof(ipfw_obj_ctlv) + rlen; ctlv->count = 1; memcpy(ctlv + 1, rule, rbufsize); } else { /* Simply add header */ sz = rlen + default_off; memset(ctlv, 0, sizeof(*ctlv)); ctlv->head.type = IPFW_TLV_RULE_LIST; ctlv->head.length = sizeof(ipfw_obj_ctlv) + rlen; ctlv->count = 1; } if (do_get3(IP_FW_XADD, op3, &sz) != 0) err(EX_UNAVAILABLE, "getsockopt(%s)", "IP_FW_XADD"); if (!co.do_quiet) { struct format_opts sfo; struct buf_pr bp; memset(&sfo, 0, sizeof(sfo)); sfo.tstate = tstate; sfo.set_mask = (uint32_t)(-1); bp_alloc(&bp, 4096); show_static_rule(&co, &sfo, &bp, rule, NULL); printf("%s", bp.buf); bp_free(&bp); } if (tbuf != NULL) free(tbuf); if (ts.idx != NULL) free(ts.idx); } /* * clear the counters or the log counters. * optname has the following values: * 0 (zero both counters and logging) * 1 (zero logging only) */ void ipfw_zero(int ac, char *av[], int optname) { ipfw_range_tlv rt; char const *errstr; char const *name = optname ? "RESETLOG" : "ZERO"; uint32_t arg; int failed = EX_OK; optname = optname ? IP_FW_XRESETLOG : IP_FW_XZERO; av++; ac--; if (ac == 0) { /* clear all entries */ memset(&rt, 0, sizeof(rt)); rt.flags = IPFW_RCFLAG_ALL; if (do_range_cmd(optname, &rt) < 0) err(EX_UNAVAILABLE, "setsockopt(IP_FW_X%s)", name); if (!co.do_quiet) printf("%s.\n", optname == IP_FW_XZERO ? "Accounting cleared":"Logging counts reset"); return; } while (ac) { /* Rule number */ if (isdigit(**av)) { arg = strtonum(*av, 0, 0xffff, &errstr); if (errstr) errx(EX_DATAERR, "invalid rule number %s\n", *av); memset(&rt, 0, sizeof(rt)); rt.start_rule = arg; rt.end_rule = arg; rt.flags |= IPFW_RCFLAG_RANGE; if (co.use_set != 0) { rt.set = co.use_set - 1; rt.flags |= IPFW_RCFLAG_SET; } if (do_range_cmd(optname, &rt) != 0) { warn("rule %u: setsockopt(IP_FW_X%s)", arg, name); failed = EX_UNAVAILABLE; } else if (rt.new_set == 0) { printf("Entry %d not found\n", arg); failed = EX_UNAVAILABLE; } else if (!co.do_quiet) printf("Entry %d %s.\n", arg, optname == IP_FW_XZERO ? "cleared" : "logging count reset"); } else { errx(EX_USAGE, "invalid rule number ``%s''", *av); } av++; ac--; } if (failed != EX_OK) exit(failed); } void ipfw_flush(int force) { ipfw_range_tlv rt; if (!force && !co.do_quiet) { /* need to ask user */ int c; printf("Are you sure? [yn] "); fflush(stdout); do { c = toupper(getc(stdin)); while (c != '\n' && getc(stdin) != '\n') if (feof(stdin)) return; /* and do not flush */ } while (c != 'Y' && c != 'N'); printf("\n"); if (c == 'N') /* user said no */ return; } if (co.do_pipe) { dummynet_flush(); return; } /* `ipfw set N flush` - is the same that `ipfw delete set N` */ memset(&rt, 0, sizeof(rt)); if (co.use_set != 0) { rt.set = co.use_set - 1; rt.flags = IPFW_RCFLAG_SET; } else rt.flags = IPFW_RCFLAG_ALL; if (do_range_cmd(IP_FW_XDEL, &rt) != 0) err(EX_UNAVAILABLE, "setsockopt(IP_FW_XDEL)"); if (!co.do_quiet) printf("Flushed all %s.\n", co.do_pipe ? "pipes" : "rules"); } static struct _s_x intcmds[] = { { "talist", TOK_TALIST }, { "iflist", TOK_IFLIST }, { "olist", TOK_OLIST }, { "vlist", TOK_VLIST }, { NULL, 0 } }; static struct _s_x otypes[] = { { "EACTION", IPFW_TLV_EACTION }, { "DYNSTATE", IPFW_TLV_STATE_NAME }, { NULL, 0 } }; static const char* lookup_eaction_name(ipfw_obj_ntlv *ntlv, int cnt, uint16_t type) { const char *name; int i; name = NULL; for (i = 0; i < cnt; i++) { if (ntlv[i].head.type != IPFW_TLV_EACTION) continue; if (IPFW_TLV_EACTION_NAME(ntlv[i].idx) != type) continue; name = ntlv[i].name; break; } return (name); } static void ipfw_list_objects(int ac, char *av[]) { ipfw_obj_lheader req, *olh; ipfw_obj_ntlv *ntlv; const char *name; size_t sz; int i; memset(&req, 0, sizeof(req)); sz = sizeof(req); if (do_get3(IP_FW_DUMP_SRVOBJECTS, &req.opheader, &sz) != 0) if (errno != ENOMEM) return; sz = req.size; if ((olh = calloc(1, sz)) == NULL) return; olh->size = sz; if (do_get3(IP_FW_DUMP_SRVOBJECTS, &olh->opheader, &sz) != 0) { free(olh); return; } if (olh->count > 0) printf("Objects list:\n"); else printf("There are no objects\n"); ntlv = (ipfw_obj_ntlv *)(olh + 1); for (i = 0; i < olh->count; i++) { name = match_value(otypes, ntlv->head.type); if (name == NULL) name = lookup_eaction_name( (ipfw_obj_ntlv *)(olh + 1), olh->count, ntlv->head.type); if (name == NULL) printf(" kidx: %4d\ttype: %10d\tname: %s\n", ntlv->idx, ntlv->head.type, ntlv->name); else printf(" kidx: %4d\ttype: %10s\tname: %s\n", ntlv->idx, name, ntlv->name); ntlv++; } free(olh); } void ipfw_internal_handler(int ac, char *av[]) { int tcmd; ac--; av++; NEED1("internal cmd required"); if ((tcmd = match_token(intcmds, *av)) == -1) errx(EX_USAGE, "invalid internal sub-cmd: %s", *av); switch (tcmd) { case TOK_IFLIST: ipfw_list_tifaces(); break; case TOK_TALIST: ipfw_list_ta(ac, av); break; case TOK_OLIST: ipfw_list_objects(ac, av); break; case TOK_VLIST: ipfw_list_values(ac, av); break; } } static int ipfw_get_tracked_ifaces(ipfw_obj_lheader **polh) { ipfw_obj_lheader req, *olh; size_t sz; memset(&req, 0, sizeof(req)); sz = sizeof(req); if (do_get3(IP_FW_XIFLIST, &req.opheader, &sz) != 0) { if (errno != ENOMEM) return (errno); } sz = req.size; if ((olh = calloc(1, sz)) == NULL) return (ENOMEM); olh->size = sz; if (do_get3(IP_FW_XIFLIST, &olh->opheader, &sz) != 0) { free(olh); return (errno); } *polh = olh; return (0); } static int ifinfo_cmp(const void *a, const void *b) { ipfw_iface_info *ia, *ib; ia = (ipfw_iface_info *)a; ib = (ipfw_iface_info *)b; return (stringnum_cmp(ia->ifname, ib->ifname)); } /* * Retrieves table list from kernel, * optionally sorts it and calls requested function for each table. * Returns 0 on success. */ static void ipfw_list_tifaces() { ipfw_obj_lheader *olh; ipfw_iface_info *info; int i, error; if ((error = ipfw_get_tracked_ifaces(&olh)) != 0) err(EX_OSERR, "Unable to request ipfw tracked interface list"); qsort(olh + 1, olh->count, olh->objsize, ifinfo_cmp); info = (ipfw_iface_info *)(olh + 1); for (i = 0; i < olh->count; i++) { if (info->flags & IPFW_IFFLAG_RESOLVED) printf("%s ifindex: %d refcount: %u changes: %u\n", info->ifname, info->ifindex, info->refcnt, info->gencnt); else printf("%s ifindex: unresolved refcount: %u changes: %u\n", info->ifname, info->refcnt, info->gencnt); info = (ipfw_iface_info *)((caddr_t)info + olh->objsize); } free(olh); } Index: user/ngie/bug-237403/share/man/man4/Makefile =================================================================== --- user/ngie/bug-237403/share/man/man4/Makefile (revision 346925) +++ user/ngie/bug-237403/share/man/man4/Makefile (revision 346926) @@ -1,1009 +1,1012 @@ # @(#)Makefile 8.1 (Berkeley) 6/18/93 # $FreeBSD$ .include PACKAGE=runtime-manuals MAN= aac.4 \ aacraid.4 \ acpi.4 \ ${_acpi_asus.4} \ ${_acpi_asus_wmi.4} \ ${_acpi_dock.4} \ ${_acpi_fujitsu.4} \ ${_acpi_hp.4} \ ${_acpi_ibm.4} \ ${_acpi_panasonic.4} \ ${_acpi_rapidstart.4} \ ${_acpi_sony.4} \ acpi_thermal.4 \ ${_acpi_toshiba.4} \ acpi_video.4 \ ${_acpi_wmi.4} \ ada.4 \ adm6996fc.4 \ ae.4 \ ${_aesni.4} \ age.4 \ agp.4 \ ahc.4 \ ahci.4 \ ahd.4 \ ${_aibs.4} \ aio.4 \ alc.4 \ ale.4 \ alpm.4 \ altera_atse.4 \ altera_avgen.4 \ altera_jtag_uart.4 \ altera_sdcard.4 \ altq.4 \ amdpm.4 \ ${_amdsbwd.4} \ ${_amdsmb.4} \ ${_amdsmn.4} \ ${_amdtemp.4} \ ${_bxe.4} \ amr.4 \ an.4 \ ${_aout.4} \ ${_apic.4} \ arcmsr.4 \ ${_asmc.4} \ at45d.4 \ ata.4 \ ath.4 \ ath_ahb.4 \ ath_hal.4 \ ath_pci.4 \ atkbd.4 \ atkbdc.4 \ atp.4 \ ${_atf_test_case.4} \ ${_atrtc.4} \ ${_attimer.4} \ audit.4 \ auditpipe.4 \ aue.4 \ axe.4 \ axge.4 \ bce.4 \ bcma.4 \ bfe.4 \ bge.4 \ ${_bhyve.4} \ bhnd.4 \ bhnd_chipc.4 \ bhnd_pmu.4 \ bhndb.4 \ bhndb_pci.4 \ bktr.4 \ blackhole.4 \ bnxt.4 \ bpf.4 \ bridge.4 \ bt.4 \ bwi.4 \ bwn.4 \ ${_bytgpio.4} \ ${_chvgpio.4} \ capsicum.4 \ cardbus.4 \ carp.4 \ cas.4 \ cc_cdg.4 \ cc_chd.4 \ cc_cubic.4 \ cc_dctcp.4 \ cc_hd.4 \ cc_htcp.4 \ cc_newreno.4 \ cc_vegas.4 \ ${_ccd.4} \ ccr.4 \ cd.4 \ cdce.4 \ cfi.4 \ cfumass.4 \ ch.4 \ chromebook_platform.4 \ ciss.4 \ cloudabi.4 \ cmx.4 \ ${_coretemp.4} \ ${_cpuctl.4} \ cpufreq.4 \ crypto.4 \ ctl.4 \ cue.4 \ cxgb.4 \ cxgbe.4 \ cxgbev.4 \ cy.4 \ cyapa.4 \ da.4 \ dc.4 \ dcons.4 \ dcons_crom.4 \ ddb.4 \ de.4 \ devctl.4 \ disc.4 \ divert.4 \ ${_dpms.4} \ ds1307.4 \ ds3231.4 \ ${_dtrace_provs} \ dummynet.4 \ ed.4 \ edsc.4 \ ehci.4 \ em.4 \ ena.4 \ enc.4 \ epair.4 \ esp.4 \ est.4 \ et.4 \ etherswitch.4 \ eventtimers.4 \ exca.4 \ e6060sw.4 \ fd.4 \ fdc.4 \ fdt.4 \ fdt_pinctrl.4 \ fdtbus.4 \ ffclock.4 \ filemon.4 \ firewire.4 \ full.4 \ fwe.4 \ fwip.4 \ fwohci.4 \ fxp.4 \ gbde.4 \ gdb.4 \ gem.4 \ geom.4 \ geom_fox.4 \ geom_linux_lvm.4 \ geom_map.4 \ geom_uzip.4 \ gif.4 \ gpio.4 \ gpioiic.4 \ gpioled.4 \ gre.4 \ h_ertt.4 \ hifn.4 \ hme.4 \ hpet.4 \ ${_hpt27xx.4} \ ${_hptiop.4} \ ${_hptmv.4} \ ${_hptnr.4} \ ${_hptrr.4} \ ${_hv_kvp.4} \ ${_hv_netvsc.4} \ ${_hv_storvsc.4} \ ${_hv_utils.4} \ ${_hv_vmbus.4} \ ${_hv_vss.4} \ hwpmc.4 \ iavf.4 \ ichsmb.4 \ ${_ichwd.4} \ icmp.4 \ icmp6.4 \ ida.4 \ if_ipsec.4 \ iflib.4 \ ifmib.4 \ ig4.4 \ igmp.4 \ iic.4 \ iicbb.4 \ iicbus.4 \ iicsmb.4 \ iir.4 \ ${_imcsmb.4} \ inet.4 \ inet6.4 \ intpm.4 \ intro.4 \ ${_io.4} \ ${_ioat.4} \ ip.4 \ ip6.4 \ ipfirewall.4 \ ipheth.4 \ ${_ipmi.4} \ ips.4 \ ipsec.4 \ ipw.4 \ ipwfw.4 \ isci.4 \ isl.4 \ ismt.4 \ isp.4 \ ispfw.4 \ iwi.4 \ iwifw.4 \ iwm.4 \ iwmfw.4 \ iwn.4 \ iwnfw.4 \ ixgbe.4 \ ixl.4 \ jedec_dimm.4 \ jme.4 \ kbdmux.4 \ keyboard.4 \ kld.4 \ ksyms.4 \ ksz8995ma.4 \ ktr.4 \ kue.4 \ lagg.4 \ le.4 \ led.4 \ lge.4 \ ${_linux.4} \ liquidio.4 \ lm75.4 \ lo.4 \ lp.4 \ lpbb.4 \ lpt.4 \ mac.4 \ mac_biba.4 \ mac_bsdextended.4 \ mac_ifoff.4 \ mac_lomac.4 \ mac_mls.4 \ mac_none.4 \ mac_ntpd.4 \ mac_partition.4 \ mac_portacl.4 \ mac_seeotheruids.4 \ mac_stub.4 \ mac_test.4 \ malo.4 \ md.4 \ mdio.4 \ me.4 \ mem.4 \ meteor.4 \ mfi.4 \ miibus.4 \ mk48txx.4 \ mld.4 \ mlx.4 \ mlx4en.4 \ mlx5en.4 \ mly.4 \ mmc.4 \ mmcsd.4 \ mn.4 \ mod_cc.4 \ mos.4 \ mouse.4 \ mpr.4 \ mps.4 \ mpt.4 \ mrsas.4 \ msk.4 \ mtio.4 \ multicast.4 \ muge.4 \ mvs.4 \ mwl.4 \ mwlfw.4 \ mx25l.4 \ mxge.4 \ my.4 \ nand.4 \ nandsim.4 \ ${_ndis.4} \ net80211.4 \ netdump.4 \ netfpga10g_nf10bmac.4 \ netgraph.4 \ netintro.4 \ netmap.4 \ ${_nfe.4} \ ${_nfsmb.4} \ ng_async.4 \ ngatmbase.4 \ ng_atmllc.4 \ ng_bpf.4 \ ng_bridge.4 \ ng_bt3c.4 \ ng_btsocket.4 \ ng_car.4 \ ng_ccatm.4 \ ng_checksum.4 \ ng_cisco.4 \ ng_deflate.4 \ ng_device.4 \ nge.4 \ ng_echo.4 \ ng_eiface.4 \ ng_etf.4 \ ng_ether.4 \ ng_ether_echo.4 \ ng_frame_relay.4 \ ng_gif.4 \ ng_gif_demux.4 \ ng_h4.4 \ ng_hci.4 \ ng_hole.4 \ ng_hub.4 \ ng_iface.4 \ ng_ipfw.4 \ ng_ip_input.4 \ ng_ksocket.4 \ ng_l2cap.4 \ ng_l2tp.4 \ ng_lmi.4 \ ng_mppc.4 \ ng_nat.4 \ ng_netflow.4 \ ng_one2many.4 \ ng_patch.4 \ ng_ppp.4 \ ng_pppoe.4 \ ng_pptpgre.4 \ ng_pred1.4 \ ng_rfc1490.4 \ ng_socket.4 \ ng_source.4 \ ng_split.4 \ ng_sppp.4 \ ng_sscfu.4 \ ng_sscop.4 \ ng_tag.4 \ ng_tcpmss.4 \ ng_tee.4 \ ng_tty.4 \ ng_ubt.4 \ ng_UI.4 \ ng_uni.4 \ ng_vjc.4 \ ng_vlan.4 \ nmdm.4 \ ${_ntb.4} \ ${_ntb_hw_intel.4} \ ${_ntb_hw_plx.4} \ ${_ntb_transport.4} \ ${_nda.4} \ ${_if_ntb.4} \ null.4 \ numa.4 \ ${_nvd.4} \ ${_nvme.4} \ ${_nvram.4} \ ${_nvram2env.4} \ oce.4 \ ocs_fc.4\ ohci.4 \ orm.4 \ ow.4 \ ow_temp.4 \ owc.4 \ ${_padlock.4} \ pass.4 \ pccard.4 \ pccbb.4 \ pcf.4 \ pci.4 \ pcib.4 \ pcic.4 \ pcm.4 \ pcn.4 \ ${_pf.4} \ ${_pflog.4} \ ${_pfsync.4} \ pim.4 \ pms.4 \ polling.4 \ ppbus.4 \ ppc.4 \ ppi.4 \ procdesc.4 \ proto.4 \ psm.4 \ pst.4 \ pt.4 \ pts.4 \ pty.4 \ puc.4 \ ${_qlxge.4} \ ${_qlxgb.4} \ ${_qlxgbe.4} \ ${_qlnxe.4} \ ral.4 \ random.4 \ rc.4 \ rctl.4 \ re.4 \ rgephy.4 \ rights.4 \ rl.4 \ rndtest.4 \ route.4 \ rp.4 \ rtwn.4 \ rtwnfw.4 \ rtwn_pci.4 \ rue.4 \ sa.4 \ safe.4 \ sbp.4 \ sbp_targ.4 \ scc.4 \ sched_4bsd.4 \ sched_ule.4 \ screen.4 \ scsi.4 \ sctp.4 \ sdhci.4 \ sem.4 \ send.4 \ ses.4 \ sf.4 \ ${_sfxge.4} \ sge.4 \ siba.4 \ siftr.4 \ siis.4 \ simplebus.4 \ sio.4 \ sis.4 \ sk.4 \ ${_smartpqi.4} \ smb.4 \ smbus.4 \ smp.4 \ smsc.4 \ sn.4 \ snd_ad1816.4 \ snd_als4000.4 \ snd_atiixp.4 \ snd_cmi.4 \ snd_cs4281.4 \ snd_csa.4 \ snd_ds1.4 \ snd_emu10k1.4 \ snd_emu10kx.4 \ snd_envy24.4 \ snd_envy24ht.4 \ snd_es137x.4 \ snd_ess.4 \ snd_fm801.4 \ snd_gusc.4 \ snd_hda.4 \ snd_hdspe.4 \ snd_ich.4 \ snd_maestro3.4 \ snd_maestro.4 \ snd_mss.4 \ snd_neomagic.4 \ snd_sbc.4 \ snd_solo.4 \ snd_spicds.4 \ snd_t4dwave.4 \ snd_uaudio.4 \ snd_via8233.4 \ snd_via82c686.4 \ snd_vibes.4 \ snp.4 \ spigen.4 \ ${_spkr.4} \ splash.4 \ sppp.4 \ ste.4 \ stf.4 \ stge.4 \ sym.4 \ syncache.4 \ syncer.4 \ syscons.4 \ sysmouse.4 \ tap.4 \ targ.4 \ tcp.4 \ tdfx.4 \ terasic_mtl.4 \ termios.4 \ textdump.4 \ ti.4 \ timecounters.4 \ tl.4 \ ${_tpm.4} \ trm.4 \ tty.4 \ tun.4 \ twa.4 \ twe.4 \ tws.4 \ tx.4 \ txp.4 \ udp.4 \ udplite.4 \ ure.4 \ vale.4 \ vga.4 \ vge.4 \ viapm.4 \ ${_viawd.4} \ ${_virtio.4} \ ${_virtio_balloon.4} \ ${_virtio_blk.4} \ ${_virtio_console.4} \ ${_virtio_random.4} \ ${_virtio_scsi.4} \ ${_vmci.4} \ vkbd.4 \ vlan.4 \ vxlan.4 \ ${_vmm.4} \ ${_vmx.4} \ vpo.4 \ vr.4 \ vt.4 \ vte.4 \ ${_vtnet.4} \ watchdog.4 \ wb.4 \ ${_wbwd.4} \ wi.4 \ witness.4 \ wlan.4 \ wlan_acl.4 \ wlan_amrr.4 \ wlan_ccmp.4 \ wlan_tkip.4 \ wlan_wep.4 \ wlan_xauth.4 \ wmt.4 \ ${_wpi.4} \ wsp.4 \ xe.4 \ ${_xen.4} \ xhci.4 \ xl.4 \ ${_xnb.4} \ xpt.4 \ zero.4 MLINKS= ae.4 if_ae.4 MLINKS+=age.4 if_age.4 MLINKS+=agp.4 agpgart.4 MLINKS+=alc.4 if_alc.4 MLINKS+=ale.4 if_ale.4 MLINKS+=altera_atse.4 atse.4 MLINKS+=altera_sdcard.4 altera_sdcardc.4 MLINKS+=altq.4 ALTQ.4 MLINKS+=ath.4 if_ath.4 MLINKS+=ath_pci.4 if_ath_pci.4 MLINKS+=an.4 if_an.4 MLINKS+=aue.4 if_aue.4 MLINKS+=axe.4 if_axe.4 MLINKS+=bce.4 if_bce.4 MLINKS+=bfe.4 if_bfe.4 MLINKS+=bge.4 if_bge.4 MLINKS+=bktr.4 brooktree.4 MLINKS+=bnxt.4 if_bnxt.4 MLINKS+=bridge.4 if_bridge.4 MLINKS+=bwi.4 if_bwi.4 MLINKS+=bwn.4 if_bwn.4 MLINKS+=${_bxe.4} ${_if_bxe.4} MLINKS+=cas.4 if_cas.4 MLINKS+=cdce.4 if_cdce.4 MLINKS+=cfi.4 cfid.4 MLINKS+=cloudabi.4 cloudabi32.4 \ cloudabi.4 cloudabi64.4 MLINKS+=crypto.4 cryptodev.4 MLINKS+=cue.4 if_cue.4 MLINKS+=cxgb.4 if_cxgb.4 MLINKS+=cxgbe.4 if_cxgbe.4 \ cxgbe.4 vcxgbe.4 \ cxgbe.4 if_vcxgbe.4 \ cxgbe.4 cxl.4 \ cxgbe.4 if_cxl.4 \ cxgbe.4 vcxl.4 \ cxgbe.4 if_vcxl.4 \ cxgbe.4 cc.4 \ cxgbe.4 if_cc.4 \ cxgbe.4 vcc.4 \ cxgbe.4 if_vcc.4 MLINKS+=cxgbev.4 if_cxgbev.4 \ cxgbev.4 cxlv.4 \ cxgbev.4 if_cxlv.4 \ cxgbev.4 ccv.4 \ cxgbev.4 if_ccv.4 MLINKS+=dc.4 if_dc.4 MLINKS+=de.4 if_de.4 MLINKS+=disc.4 if_disc.4 MLINKS+=ed.4 if_ed.4 MLINKS+=edsc.4 if_edsc.4 MLINKS+=em.4 if_em.4 MLINKS+=enc.4 if_enc.4 MLINKS+=epair.4 if_epair.4 MLINKS+=et.4 if_et.4 MLINKS+=fd.4 stderr.4 \ fd.4 stdin.4 \ fd.4 stdout.4 MLINKS+=fdt.4 FDT.4 MLINKS+=firewire.4 ieee1394.4 MLINKS+=fwe.4 if_fwe.4 MLINKS+=fwip.4 if_fwip.4 MLINKS+=fxp.4 if_fxp.4 MLINKS+=gem.4 if_gem.4 MLINKS+=geom.4 GEOM.4 MLINKS+=gif.4 if_gif.4 MLINKS+=gpio.4 gpiobus.4 MLINKS+=gre.4 if_gre.4 MLINKS+=hme.4 if_hme.4 MLINKS+=hpet.4 acpi_hpet.4 MLINKS+=${_hptrr.4} ${_rr232x.4} MLINKS+=${_attimer.4} ${_i8254.4} MLINKS+=ip.4 rawip.4 MLINKS+=ipfirewall.4 ipaccounting.4 \ ipfirewall.4 ipacct.4 \ ipfirewall.4 ipfw.4 MLINKS+=ipheth.4 if_ipheth.4 MLINKS+=ipw.4 if_ipw.4 MLINKS+=iwi.4 if_iwi.4 MLINKS+=iwm.4 if_iwm.4 MLINKS+=iwn.4 if_iwn.4 MLINKS+=ixgbe.4 ix.4 MLINKS+=ixgbe.4 if_ix.4 MLINKS+=ixgbe.4 if_ixgbe.4 MLINKS+=ixl.4 if_ixl.4 MLINKS+=iavf.4 if_iavf.4 MLINKS+=jme.4 if_jme.4 MLINKS+=kue.4 if_kue.4 MLINKS+=lagg.4 trunk.4 MLINKS+=lagg.4 if_lagg.4 MLINKS+=le.4 if_le.4 MLINKS+=lge.4 if_lge.4 MLINKS+=lo.4 loop.4 MLINKS+=lp.4 plip.4 MLINKS+=malo.4 if_malo.4 MLINKS+=md.4 vn.4 MLINKS+=mem.4 kmem.4 MLINKS+=mfi.4 mfi_linux.4 \ mfi.4 mfip.4 MLINKS+=mlx5en.4 mce.4 MLINKS+=mn.4 if_mn.4 MLINKS+=mos.4 if_mos.4 MLINKS+=msk.4 if_msk.4 MLINKS+=mwl.4 if_mwl.4 MLINKS+=mxge.4 if_mxge.4 MLINKS+=my.4 if_my.4 MLINKS+=${_ndis.4} ${_if_ndis.4} MLINKS+=netfpga10g_nf10bmac.4 if_nf10bmac.4 MLINKS+=netintro.4 net.4 \ netintro.4 networking.4 MLINKS+=${_nfe.4} ${_if_nfe.4} MLINKS+=nge.4 if_nge.4 MLINKS+=ow.4 onewire.4 MLINKS+=pccbb.4 cbb.4 MLINKS+=pcm.4 snd.4 \ pcm.4 sound.4 MLINKS+=pcn.4 if_pcn.4 MLINKS+=pms.4 pmspcv.4 MLINKS+=ral.4 if_ral.4 MLINKS+=re.4 if_re.4 MLINKS+=rl.4 if_rl.4 MLINKS+=rtwn_pci.4 if_rtwn_pci.4 MLINKS+=rue.4 if_rue.4 MLINKS+=scsi.4 CAM.4 \ scsi.4 cam.4 \ scsi.4 scbus.4 \ scsi.4 SCSI.4 MLINKS+=sf.4 if_sf.4 MLINKS+=sge.4 if_sge.4 MLINKS+=sis.4 if_sis.4 MLINKS+=sk.4 if_sk.4 MLINKS+=smp.4 SMP.4 MLINKS+=smsc.4 if_smsc.4 MLINKS+=sn.4 if_sn.4 MLINKS+=snd_envy24.4 snd_ak452x.4 MLINKS+=snd_sbc.4 snd_sb16.4 \ snd_sbc.4 snd_sb8.4 MLINKS+=${_spkr.4} ${_speaker.4} MLINKS+=splash.4 screensaver.4 MLINKS+=ste.4 if_ste.4 MLINKS+=stf.4 if_stf.4 MLINKS+=stge.4 if_stge.4 MLINKS+=syncache.4 syncookies.4 MLINKS+=syscons.4 sc.4 MLINKS+=tap.4 if_tap.4 MLINKS+=tdfx.4 tdfx_linux.4 MLINKS+=ti.4 if_ti.4 MLINKS+=tl.4 if_tl.4 MLINKS+=tun.4 if_tun.4 MLINKS+=tx.4 if_tx.4 MLINKS+=txp.4 if_txp.4 MLINKS+=ure.4 if_ure.4 MLINKS+=vge.4 if_vge.4 MLINKS+=vlan.4 if_vlan.4 MLINKS+=vxlan.4 if_vxlan.4 MLINKS+=${_vmx.4} ${_if_vmx.4} MLINKS+=vpo.4 imm.4 MLINKS+=vr.4 if_vr.4 MLINKS+=vte.4 if_vte.4 MLINKS+=${_vtnet.4} ${_if_vtnet.4} MLINKS+=watchdog.4 SW_WATCHDOG.4 MLINKS+=wb.4 if_wb.4 MLINKS+=wi.4 if_wi.4 MLINKS+=${_wpi.4} ${_if_wpi.4} MLINKS+=xe.4 if_xe.4 MLINKS+=xl.4 if_xl.4 .if ${MACHINE_CPUARCH} == "amd64" || ${MACHINE_CPUARCH} == "i386" _acpi_asus.4= acpi_asus.4 _acpi_asus_wmi.4= acpi_asus_wmi.4 _acpi_dock.4= acpi_dock.4 _acpi_fujitsu.4=acpi_fujitsu.4 _acpi_hp.4= acpi_hp.4 _acpi_ibm.4= acpi_ibm.4 _acpi_panasonic.4=acpi_panasonic.4 _acpi_rapidstart.4=acpi_rapidstart.4 _acpi_sony.4= acpi_sony.4 _acpi_toshiba.4=acpi_toshiba.4 _acpi_wmi.4= acpi_wmi.4 _aesni.4= aesni.4 _aout.4= aout.4 _apic.4= apic.4 _atrtc.4= atrtc.4 _attimer.4= attimer.4 _aibs.4= aibs.4 _amdsbwd.4= amdsbwd.4 _amdsmb.4= amdsmb.4 _amdsmn.4= amdsmn.4 _amdtemp.4= amdtemp.4 _asmc.4= asmc.4 _bxe.4= bxe.4 _bytgpio.4= bytgpio.4 _chvgpio.4= chvgpio.4 _coretemp.4= coretemp.4 _cpuctl.4= cpuctl.4 _dpms.4= dpms.4 _hpt27xx.4= hpt27xx.4 _hptiop.4= hptiop.4 _hptmv.4= hptmv.4 _hptnr.4= hptnr.4 _hptrr.4= hptrr.4 _hv_kvp.4= hv_kvp.4 _hv_netvsc.4= hv_netvsc.4 _hv_storvsc.4= hv_storvsc.4 _hv_utils.4= hv_utils.4 _hv_vmbus.4= hv_vmbus.4 _hv_vss.4= hv_vss.4 _i8254.4= i8254.4 _ichwd.4= ichwd.4 _if_bxe.4= if_bxe.4 _if_ndis.4= if_ndis.4 _if_nfe.4= if_nfe.4 _if_urtw.4= if_urtw.4 _if_vmx.4= if_vmx.4 _if_vtnet.4= if_vtnet.4 _if_wpi.4= if_wpi.4 _imcsmb.4= imcsmb.4 _ipmi.4= ipmi.4 _io.4= io.4 _linux.4= linux.4 _nda.4= nda.4 _ndis.4= ndis.4 _nfe.4= nfe.4 _nfsmb.4= nfsmb.4 _nvd.4= nvd.4 _nvme.4= nvme.4 _nvram.4= nvram.4 _virtio.4= virtio.4 _virtio_balloon.4=virtio_balloon.4 _virtio_blk.4= virtio_blk.4 _virtio_console.4=virtio_console.4 _virtio_random.4= virtio_random.4 _virtio_scsi.4= virtio_scsi.4 _vmx.4= vmx.4 _vtnet.4= vtnet.4 _padlock.4= padlock.4 _rr232x.4= rr232x.4 _speaker.4= speaker.4 _spkr.4= spkr.4 _tpm.4= tpm.4 _urtw.4= urtw.4 _viawd.4= viawd.4 _vmci.4= vmci.4 _wbwd.4= wbwd.4 _wpi.4= wpi.4 _xen.4= xen.4 _xnb.4= xnb.4 .endif .if ${MACHINE_CPUARCH} == "amd64" _if_ntb.4= if_ntb.4 _ioat.4= ioat.4 _ntb.4= ntb.4 _ntb_hw_intel.4= ntb_hw_intel.4 _ntb_hw_plx.4= ntb_hw_plx.4 _ntb_transport.4=ntb_transport.4 _qlxge.4= qlxge.4 _qlxgb.4= qlxgb.4 _qlxgbe.4= qlxgbe.4 _qlnxe.4= qlnxe.4 _sfxge.4= sfxge.4 _smartpqi.4= smartpqi.4 MLINKS+=qlxge.4 if_qlxge.4 MLINKS+=qlxgb.4 if_qlxgb.4 MLINKS+=qlxgbe.4 if_qlxgbe.4 MLINKS+=qlnxe.4 if_qlnxe.4 MLINKS+=sfxge.4 if_sfxge.4 .if ${MK_BHYVE} != "no" _bhyve.4= bhyve.4 _vmm.4= vmm.4 .endif .endif .if ${MACHINE_CPUARCH} == "mips" _nvram2env.4= nvram2env.4 .endif .if ${MACHINE_CPUARCH} == "powerpc" _nvd.4= nvd.4 _nvme.4= nvme.4 .endif .if empty(MAN_ARCH) __arches= ${MACHINE} ${MACHINE_ARCH} ${MACHINE_CPUARCH} .elif ${MAN_ARCH} == "all" __arches= ${:!/bin/sh -c "/bin/ls -d ${.CURDIR}/man4.*"!:E} .else __arches= ${MAN_ARCH} .endif .for __arch in ${__arches:O:u} .if exists(${.CURDIR}/man4.${__arch}) SUBDIR+= man4.${__arch} .endif .endfor .if ${MK_BLUETOOTH} != "no" MAN+= ng_bluetooth.4 .endif .if ${MK_CCD} != "no" _ccd.4= ccd.4 .endif .if ${MK_CDDL} != "no" -_dtrace_provs= dtrace_io.4 \ +_dtrace_provs= dtrace_audit.4 \ + dtrace_io.4 \ dtrace_ip.4 \ dtrace_lockstat.4 \ dtrace_proc.4 \ dtrace_sched.4 \ dtrace_sctp.4 \ dtrace_tcp.4 \ dtrace_udp.4 \ dtrace_udplite.4 + +MLINKS+= dtrace_audit.4 dtaudit.4 .endif .if ${MK_EFI} != "no" MAN+= efidev.4 MLINKS+= efidev.4 efirtc.4 .endif .if ${MK_ISCSI} != "no" MAN+= cfiscsi.4 MAN+= iscsi.4 MAN+= iscsi_initiator.4 MAN+= iser.4 .endif .if ${MK_OFED} != "no" MAN+= mlx4ib.4 MAN+= mlx5ib.4 .endif .if ${MK_MLX5TOOL} != "no" MAN+= mlx5io.4 .endif .if ${MK_TESTS} != "no" ATF= ${SRCTOP}/contrib/atf .PATH: ${ATF}/doc _atf_test_case.4= atf-test-case.4 .endif .if ${MK_PF} != "no" _pf.4= pf.4 _pflog.4= pflog.4 _pfsync.4= pfsync.4 .endif .if ${MK_USB} != "no" MAN+= \ otus.4 \ otusfw.4 \ rsu.4 \ rsufw.4 \ rtwn_usb.4 \ rum.4 \ run.4 \ runfw.4 \ u3g.4 \ uark.4 \ uart.4 \ uath.4 \ ubsa.4 \ ubsec.4 \ ubser.4 \ ubtbcmfw.4 \ uchcom.4 \ ucom.4 \ ucycom.4 \ udav.4 \ udbp.4 \ udl.4 \ uep.4 \ ufm.4 \ ufoma.4 \ uftdi.4 \ ugen.4 \ ugold.4 \ uhci.4 \ uhid.4 \ uhso.4 \ uipaq.4 \ ukbd.4 \ uled.4 \ ulpt.4 \ umass.4 \ umcs.4 \ umct.4 \ umodem.4 \ umoscom.4 \ ums.4 \ unix.4 \ upgt.4 \ uplcom.4 \ ural.4 \ urio.4 \ urndis.4 \ ${_urtw.4} \ usb.4 \ usb_quirk.4 \ usb_template.4 \ usfs.4 \ uslcom.4 \ uvisor.4 \ uvscom.4 \ zyd.4 MLINKS+=otus.4 if_otus.4 MLINKS+=rsu.4 if_rsu.4 MLINKS+=rtwn_usb.4 if_rtwn_usb.4 MLINKS+=rum.4 if_rum.4 MLINKS+=run.4 if_run.4 MLINKS+=u3g.4 u3gstub.4 MLINKS+=uath.4 if_uath.4 MLINKS+=udav.4 if_udav.4 MLINKS+=upgt.4 if_upgt.4 MLINKS+=ural.4 if_ural.4 MLINKS+=urndis.4 if_urndis.4 MLINKS+=${_urtw.4} ${_if_urtw.4} MLINKS+=zyd.4 if_zyd.4 .endif .include Index: user/ngie/bug-237403/share/man/man4/audit.4 =================================================================== --- user/ngie/bug-237403/share/man/man4/audit.4 (revision 346925) +++ user/ngie/bug-237403/share/man/man4/audit.4 (revision 346926) @@ -1,148 +1,160 @@ -.\" Copyright (c) 2006 Robert N. M. Watson +.\" Copyright (c) 2006, 2019 Robert N. M. Watson .\" All rights reserved. .\" +.\" This software was developed in part by BAE Systems, the University of +.\" Cambridge Computer Laboratory, and Memorial University under DARPA/AFRL +.\" contract FA8650-15-C-7558 ("CADETS"), as part of the DARPA Transparent +.\" Computing (TC) research program. +.\" .\" Redistribution and use in source and binary forms, with or without .\" modification, are permitted provided that the following conditions .\" are met: .\" 1. Redistributions of source code must retain the above copyright .\" notice, this list of conditions and the following disclaimer. .\" 2. Redistributions in binary form must reproduce the above copyright .\" notice, this list of conditions and the following disclaimer in the .\" documentation and/or other materials provided with the distribution. .\" .\" THIS SOFTWARE IS PROVIDED BY THE AUTHORS AND CONTRIBUTORS ``AS IS'' AND .\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE .\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE .\" ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHORS OR CONTRIBUTORS BE LIABLE .\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL .\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS .\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) .\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT .\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY .\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF .\" SUCH DAMAGE. .\" .\" $FreeBSD$ .\" -.Dd May 31, 2009 +.Dd April 28, 2019 .Dt AUDIT 4 .Os .Sh NAME .Nm audit .Nd Security Event Audit .Sh SYNOPSIS .Cd "options AUDIT" .Sh DESCRIPTION Security Event Audit is a facility to provide fine-grained, configurable logging of security-relevant events, and is intended to meet the requirements of the Common Criteria (CC) Common Access Protection Profile (CAPP) evaluation. The .Fx .Nm facility implements the de facto industry standard BSM API, file formats, and command line interface, first found in the Solaris operating system. Information on the user space implementation can be found in .Xr libbsm 3 . .Pp Audit support is enabled at boot, if present in the kernel, using an .Xr rc.conf 5 flag. The audit daemon, .Xr auditd 8 , is responsible for configuring the kernel to perform .Nm , pushing configuration data from the various audit configuration files into the kernel. .Ss Audit Special Device The kernel .Nm facility provides a special device, .Pa /dev/audit , which is used by .Xr auditd 8 to monitor for .Nm events, such as requests to cycle the log, low disk space conditions, and requests to terminate auditing. This device is not intended for use by applications. .Ss Audit Pipe Special Devices Audit pipe special devices, discussed in .Xr auditpipe 4 , provide a configurable live tracking mechanism to allow applications to tee the audit trail, as well as to configure custom preselection parameters to track users and events in a fine-grained manner. +.Ss DTrace Audit Provider +The DTrace Audit Provider, +.Xr dtaudit 4 , +allows D scripts to enable capture of in-kernel audit records for kernel audit +event types, and then process their contents during audit commit or BSM +generation. .Sh SEE ALSO .Xr auditreduce 1 , .Xr praudit 1 , .Xr audit 2 , .Xr auditctl 2 , .Xr auditon 2 , .Xr getaudit 2 , .Xr getauid 2 , .Xr poll 2 , .Xr select 2 , .Xr setaudit 2 , .Xr setauid 2 , .Xr libbsm 3 , .Xr auditpipe 4 , +.Xr dtaudit 4 , .Xr audit.log 5 , .Xr audit_class 5 , .Xr audit_control 5 , .Xr audit_event 5 , .Xr audit_user 5 , .Xr audit_warn 5 , .Xr rc.conf 5 , .Xr audit 8 , .Xr auditd 8 , .Xr auditdistd 8 .Sh HISTORY The .Tn OpenBSM implementation was created by McAfee Research, the security division of McAfee Inc., under contract to Apple Computer Inc.\& in 2004. It was subsequently adopted by the TrustedBSD Project as the foundation for the OpenBSM distribution. .Pp Support for kernel .Nm first appeared in .Fx 6.2 . .Sh AUTHORS .An -nosplit This software was created by McAfee Research, the security research division of McAfee, Inc., under contract to Apple Computer Inc. Additional authors include .An Wayne Salamon , .An Robert Watson , and SPARTA Inc. .Pp The Basic Security Module (BSM) interface to audit records and audit event stream format were defined by Sun Microsystems. .Pp This manual page was written by .An Robert Watson Aq Mt rwatson@FreeBSD.org . .Sh BUGS The .Fx kernel does not fully validate that audit records submitted by user applications are syntactically valid BSM; as submission of records is limited to privileged processes, this is not a critical bug. .Pp Instrumentation of auditable events in the kernel is not complete, as some system calls do not generate audit records, or generate audit records with incomplete argument information. .Pp Mandatory Access Control (MAC) labels, as provided by the .Xr mac 4 facility, are not audited as part of records involving MAC decisions. .Pp Currently the .Nm syscalls are not supported for jailed processes. However, if a process has .Nm session state associated with it, audit records will still be produced and a zonename token containing the jail's ID or name will be present in the audit records. Index: user/ngie/bug-237403/share/man/man4/auditpipe.4 =================================================================== --- user/ngie/bug-237403/share/man/man4/auditpipe.4 (revision 346925) +++ user/ngie/bug-237403/share/man/man4/auditpipe.4 (revision 346926) @@ -1,256 +1,257 @@ .\" Copyright (c) 2006 Robert N. M. Watson .\" All rights reserved. .\" .\" Redistribution and use in source and binary forms, with or without .\" modification, are permitted provided that the following conditions .\" are met: .\" 1. Redistributions of source code must retain the above copyright .\" notice, this list of conditions and the following disclaimer. .\" 2. Redistributions in binary form must reproduce the above copyright .\" notice, this list of conditions and the following disclaimer in the .\" documentation and/or other materials provided with the distribution. .\" .\" THIS SOFTWARE IS PROVIDED BY THE AUTHORS AND CONTRIBUTORS ``AS IS'' AND .\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE .\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE .\" ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHORS OR CONTRIBUTORS BE LIABLE .\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL .\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS .\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) .\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT .\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY .\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF .\" SUCH DAMAGE. .\" .\" $FreeBSD$ .\" -.Dd May 30, 2018 +.Dd April 28, 2019 .Dt AUDITPIPE 4 .Os .Sh NAME .Nm auditpipe .Nd "pseudo-device for live audit event tracking" .Sh SYNOPSIS .Cd "options AUDIT" .Sh DESCRIPTION While audit trail files generated with .Xr audit 4 and maintained by .Xr auditd 8 provide a reliable long-term store for audit log information, current log files are owned by the audit daemon until terminated making them somewhat unwieldy for live monitoring applications such as host-based intrusion detection. For example, the log may be cycled and new records written to a new file without notice to applications that may be accessing the file. .Pp The audit facility provides an audit pipe facility for applications requiring direct access to live BSM audit data for the purposes of real-time monitoring. Audit pipes are available via a clonable special device, .Pa /dev/auditpipe , subject to the permissions on the device node, and provide a .Qq tee of the audit event stream. As the device is clonable, more than one instance of the device may be opened at a time; each device instance will provide independent access to all records. .Pp The audit pipe device provides discrete BSM audit records; if the read buffer passed by the application is too small to hold the next record in the sequence, it will be dropped. Unlike audit data written to the audit trail, the reliability of record delivery is not guaranteed. In particular, when an audit pipe queue fills, records will be dropped. Audit pipe devices are blocking by default, but support non-blocking I/O, asynchronous I/O using .Dv SIGIO , and polled operation via .Xr select 2 and .Xr poll 2 . .Pp Applications may choose to track the global audit trail, or configure local preselection parameters independent of the global audit trail parameters. .Ss Audit Pipe Queue Ioctls The following ioctls retrieve and set various audit pipe record queue properties: .Bl -tag -width ".Dv AUDITPIPE_GET_MAXAUDITDATA" .It Dv AUDITPIPE_GET_QLEN Query the current number of records available for reading on the pipe. .It Dv AUDITPIPE_GET_QLIMIT Retrieve the current maximum number of records that may be queued for reading on the pipe. .It Dv AUDITPIPE_SET_QLIMIT Set the current maximum number of records that may be queued for reading on the pipe. The new limit must fall between the queue limit minimum and queue limit maximum queryable using the following two ioctls. .It Dv AUDITPIPE_GET_QLIMIT_MIN Query the lowest possible maximum number of records that may be queued for reading on the pipe. .It Dv AUDITPIPE_GET_QLIMIT_MAX Query the highest possible maximum number of records that may be queued for reading on the pipe. .It Dv AUDITPIPE_FLUSH Flush all outstanding records on the audit pipe; useful after setting initial preselection properties to delete records queued during the configuration process which may not match the interests of the user process. .It Dv AUDITPIPE_GET_MAXAUDITDATA Query the maximum size of an audit record, which is a useful minimum size for a user space buffer intended to hold audit records read from the audit pipe. .El .Ss Audit Pipe Preselection Mode Ioctls By default, the audit pipe facility configures pipes to present records matched by the system-wide audit trail, configured by .Xr auditd 8 . However, the preselection mechanism for audit pipes can be configured using alternative criteria, including pipe-local flags and naflags settings, as well as auid-specific selection masks. This allows applications to track events not captured in the global audit trail, as well as limit records presented to those of specific interest to the application. .Pp The following ioctls configure the preselection mode on an audit pipe: .Bl -tag -width ".Dv AUDITPIPE_GET_PRESELECT_MODE" .It Dv AUDITPIPE_GET_PRESELECT_MODE Return the current preselect mode on the audit pipe. The ioctl argument should be of type .Vt int . .It Dv AUDITPIPE_SET_PRESELECT_MODE Set the current preselection mode on the audit pipe. The ioctl argument should be of type .Vt int . .El .Pp Possible preselection mode values are: .Bl -tag -width ".Dv AUDITPIPE_PRESELECT_MODE_TRAIL" .It Dv AUDITPIPE_PRESELECT_MODE_TRAIL Use the global audit trail preselection parameters to select records for the audit pipe. .It Dv AUDITPIPE_PRESELECT_MODE_LOCAL Use local audit pipe preselection; this model is similar to the global audit trail configuration model, consisting of global flags and naflags parameters, as well as a set of per-auid masks. These parameters are configured using further ioctls. .El .Pp After changing the audit pipe preselection mode, records selected under earlier preselection configuration may still be in the audit pipe queue. The application may flush the current record queue after changing the configuration to remove possibly undesired records. .Ss Audit Pipe Local Preselection Mode Ioctls The following ioctls configure the preselection parameters used when an audit pipe is configured for the .Dv AUDITPIPE_PRESELECT_MODE_LOCAL preselection mode. .Bl -tag -width ".Dv AUDITPIPE_GET_PRESELECT_NAFLAGS" .It Dv AUDITPIPE_GET_PRESELECT_FLAGS Retrieve the current default preselection flags for attributable events on the pipe. These flags correspond to the .Va flags field in .Xr audit_control 5 . The ioctl argument should be of type .Vt au_mask_t . .It Dv AUDITPIPE_SET_PRESELECT_FLAGS Set the current default preselection flags for attributable events on the pipe. These flags correspond to the .Va flags field in .Xr audit_control 5 . The ioctl argument should be of type .Vt au_mask_t . .It Dv AUDITPIPE_GET_PRESELECT_NAFLAGS Retrieve the current default preselection flags for non-attributable events on the pipe. These flags correspond to the .Va naflags field in .Xr audit_control 5 . The ioctl argument should be of type .Vt au_mask_t . .It Dv AUDITPIPE_SET_PRESELECT_NAFLAGS Set the current default preselection flags for non-attributable events on the pipe. These flags correspond to the .Va naflags field in .Xr audit_control 5 . The ioctl argument should be of type .Vt au_mask_t . .It Dv AUDITPIPE_GET_PRESELECT_AUID Query the current preselection masks for a specific auid on the pipe. The ioctl argument should be of type .Vt "struct auditpipe_ioctl_preselect" . The auid to query is specified via the .Va ap_auid field of type .Vt au_id_t ; the mask will be returned via .Va ap_mask of type .Vt au_mask_t . .It Dv AUDITPIPE_SET_PRESELECT_AUID Set the current preselection masks for a specific auid on the pipe. Arguments are identical to .Dv AUDITPIPE_GET_PRESELECT_AUID , except that the caller should properly initialize the .Va ap_mask field to hold the desired preselection mask. .It Dv AUDITPIPE_DELETE_PRESELECT_AUID Delete the current preselection mask for a specific auid on the pipe. Once called, events associated with the specified auid will use the default flags mask. The ioctl argument should be of type .Vt au_id_t . .It Dv AUDITPIPE_FLUSH_PRESELECT_AUID Delete all auid specific preselection specifications. .El .Sh EXAMPLES The .Xr praudit 1 utility may be directly executed on .Pa /dev/auditpipe to review the default audit trail. .Sh SEE ALSO .Xr poll 2 , .Xr select 2 , .Xr audit 4 , +.Xr dtaudit 4 , .Xr audit_control 5 , .Xr audit 8 , .Xr auditd 8 .Sh HISTORY The OpenBSM implementation was created by McAfee Research, the security division of McAfee Inc., under contract to Apple Computer Inc.\& in 2004. It was subsequently adopted by the TrustedBSD Project as the foundation for the OpenBSM distribution. .Pp Support for kernel audit first appeared in .Fx 6.2 . .Sh AUTHORS The audit pipe facility was designed and implemented by .An Robert Watson Aq Mt rwatson@FreeBSD.org . .Pp The Basic Security Module (BSM) interface to audit records and audit event stream format were defined by Sun Microsystems. .Sh BUGS See the .Xr audit 4 manual page for information on audit-related bugs and limitations. .Pp The configurable preselection mechanism mirrors the selection model present for the global audit trail. It might be desirable to provide a more flexible selection model. .Pp The per-pipe audit event queue is fifo, with drops occurring if either the user thread provides in sufficient for the record on the queue head, or on enqueue if there is insufficient room. It might be desirable to support partial reads of records, which would be more compatible with buffered I/O as implemented in system libraries, and to allow applications to select which records are dropped, possibly in the style of preselection. Index: user/ngie/bug-237403/share/man/man4/dtrace_audit.4 =================================================================== --- user/ngie/bug-237403/share/man/man4/dtrace_audit.4 (nonexistent) +++ user/ngie/bug-237403/share/man/man4/dtrace_audit.4 (revision 346926) @@ -0,0 +1,178 @@ +.\"- +.\" SPDX-License-Identifier: BSD-2-Clause +.\" +.\" Copyright (c) 2019 Robert N. M. Watson +.\" +.\" This software was developed by BAE Systems, the University of Cambridge +.\" Computer Laboratory, and Memorial University under DARPA/AFRL contract +.\" FA8650-15-C-7558 ("CADETS"), as part of the DARPA Transparent Computing +.\" (TC) research program. +.\" +.\" Redistribution and use in source and binary forms, with or without +.\" modification, are permitted provided that the following conditions +.\" are met: +.\" 1. Redistributions of source code must retain the above copyright +.\" notice, this list of conditions and the following disclaimer. +.\" 2. Redistributions in binary form must reproduce the above copyright +.\" notice, this list of conditions and the following disclaimer in the +.\" documentation and/or other materials provided with the distribution. +.\" +.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND +.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE +.\" ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE +.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL +.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS +.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) +.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT +.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY +.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF +.\" SUCH DAMAGE. +.\" +.\" $FreeBSD$ +.\" +.Dd April 28, 2019 +.Dt DTRACE_AUDIT 4 +.Os +.Sh NAME +.Nm dtrace_audit +.Nd A DTrace provider for tracing +.Xr audit 4 +events +.Sh SYNOPSIS +.Pp +.Fn audit:event:aue_*:commit "char *eventname" "struct audit_record *ar" +.Fn audit:event:aue_*:bsm "char *eventname" "struct audit_record *ar" "const void *" "size_t" +.Pp +To compile this module into the kernel, place the following in your kernel +configuration file: +.Pp +.Bd -literal -offset indent +.Cd "options DTAUDIT" +.Ed +.Pp +Alternatively, to load the module at boot time, place the following line in +.Xr loader.conf 5 : +.Bd -literal -offset indent +dtaudit_load="YES" +.Ed +.Sh DESCRIPTION +The DTrace +.Nm dtaudit +provider allows users to trace events in the kernel security auditing +subsystem, +.Xr audit 4 . +.Xr audit 4 +provides detailed logging of a configurable set of security-relevant system +calls, including key arguments (such as file paths) and return values that are +copied race-free as the system call proceeds. +The +.Nm dtaudit +provider allows DTrace scripts to selectively enable in-kernel audit-record +capture for system calls, and then access those records in either the +in-kernel format or BSM format (\c +.Xr audit.log 5 ) +when the system call completes. +While the in-kernel audit record data structure is subject to change as the +kernel changes over time, it is a much more friendly interface for use in D +scripts than either those available via the DTrace system-call provider or the +BSM trail itself. +.Ss Configuration +The +.Nm dtaudit +provider relies on +.Xr audit 4 +being compiled into the kernel. +.Nm dtaudit +probes become available only once there is an event-to-name mapping installed +in the kernel, normally done by +.Xr auditd 8 +during the boot process, if audit is enabled in +.Xr rc.conf 5 : +.Bd -literal -offset indent +auditd_enable="YES" +.Ed +.Pp +If +.Nm dtaudit +probes are required earlier in boot -- for example, in single-user mode -- or +without enabling +.Xr audit 4 , +they can be preloaded in the boot loader by adding this line to +.Xr loader.conf 5 . +.Bd -literal -offset indent +audit_event_load="YES" +.Ed +.Ss Probes +The +.Fn audit:event:aue_*:commit +probes fire synchronously during system-call return, giving access to two +arguments: a +.Vt char * +audit event name, and +the +.Vt struct audit_record * +in-kernel audit record. +Because the probe fires in system-call return, the user thread has not yet +regained control, and additional information from the thread and process +remains available for capture by the script. +.Pp +The +.Fn audit:event:aue_*:bsm +probes fire asynchonously from system-call return, following BSM conversion +and just prior to being written to disk, giving access to four arguments: a +.Vt char * +audit event name, the +.Vt struct audit_record * +in-kernel audit record, a +.Vt const void * +pointer to the converted BSM record, and a +.Vt size_t +for the length of the BSM record. +.Sh IMPLEMENTATION NOTES +When a set of +.Nm dtaudit +probes are registered, corresponding in-kernel audit records will be captured +and their probes will fire regardless of whether the +.Xr audit 4 +subsystem itself would have captured the record for the purposes of writing it +to the audit trail, or for delivery to a +.Xr auditpipe 4 . +In-kernel audit records allocated only because of enabled +.Xr dtaudit 4 +probes will not be unnecessarily written to the audit trail or enabled pipes. +.Sh SEE ALSO +.Xr dtrace 1 , +.Xr audit 4 , +.Xr audit.log 5 , +.Xr loader.conf 5 , +.Xr rc.conf 5 , +.Xr auditd 8 +.Sh HISTORY +The +.Nm dtaudit +provider first appeared in +.Fx 12.0 . +.Sh AUTHORS +This software and this manual page were developed by BAE Systems, the +University of Cambridge Computer Laboratory, and Memorial University under +DARPA/AFRL contract +.Pq FA8650-15-C-7558 +.Pq Do CADETS Dc , +as part of the DARPA Transparent Computing (TC) research program. +The +.Nm dtaudit +provider and this manual page were written by +.An Robert Watson Aq Mt rwatson@FreeBSD.org . +.Sh BUGS +Because +.Xr audit 4 +maintains its primary event-to-name mapping database in userspace, that +database must be loaded into the kernel before +.Nm dtaudit +probes become available. +.Pp +.Nm dtaudit +is only able to provide access to system-call audit events, not the full +scope of userspace events, such as those relating to login, password change, +and so on. Property changes on: user/ngie/bug-237403/share/man/man4/dtrace_audit.4 ___________________________________________________________________ Added: svn:eol-style ## -0,0 +1 ## +native \ No newline at end of property Added: svn:keywords ## -0,0 +1 ## +FreeBSD=%H \ No newline at end of property Added: svn:mime-type ## -0,0 +1 ## +text/plain \ No newline at end of property Index: user/ngie/bug-237403/share/man/man4/gre.4 =================================================================== --- user/ngie/bug-237403/share/man/man4/gre.4 (revision 346925) +++ user/ngie/bug-237403/share/man/man4/gre.4 (revision 346926) @@ -1,194 +1,232 @@ .\" $NetBSD: gre.4,v 1.28 2002/06/10 02:49:35 itojun Exp $ .\" .\" Copyright 1998 (c) The NetBSD Foundation, Inc. .\" All rights reserved. .\" .\" This code is derived from software contributed to The NetBSD Foundation .\" by Heiko W.Rupp .\" .\" Redistribution and use in source and binary forms, with or without .\" modification, are permitted provided that the following conditions .\" are met: .\" 1. Redistributions of source code must retain the above copyright .\" notice, this list of conditions and the following disclaimer. .\" 2. Redistributions in binary form must reproduce the above copyright .\" notice, this list of conditions and the following disclaimer in the .\" documentation and/or other materials provided with the distribution. .\" .\" THIS SOFTWARE IS PROVIDED BY THE NETBSD FOUNDATION, INC. AND CONTRIBUTORS .\" ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED .\" TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR .\" PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE FOUNDATION OR CONTRIBUTORS .\" BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR .\" CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF .\" SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS .\" INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN .\" CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) .\" ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE .\" POSSIBILITY OF SUCH DAMAGE. .\" .\" $FreeBSD$ .\" -.Dd June 2, 2015 +.Dd April 24, 2019 .Dt GRE 4 .Os .Sh NAME .Nm gre .Nd encapsulating network device .Sh SYNOPSIS To compile the driver into the kernel, place the following line in the kernel configuration file: .Bd -ragged -offset indent .Cd "device gre" .Ed .Pp Alternatively, to load the driver as a module at boot time, place the following line in .Xr loader.conf 5 : .Bd -literal -offset indent if_gre_load="YES" .Ed .Sh DESCRIPTION The .Nm network interface pseudo device encapsulates datagrams into IP. These encapsulated datagrams are routed to a destination host, where they are decapsulated and further routed to their final destination. The .Dq tunnel appears to the inner datagrams as one hop. .Pp .Nm interfaces are dynamically created and destroyed with the .Xr ifconfig 8 .Cm create and .Cm destroy subcommands. .Pp This driver corresponds to RFC 2784. Encapsulated datagrams are prepended an outer datagram and a GRE header. The GRE header specifies the type of the encapsulated datagram and thus allows for tunneling other protocols than IP. GRE mode is also the default tunnel mode on Cisco routers. .Nm also supports Cisco WCCP protocol, both version 1 and version 2. .Pp The .Nm interfaces support a number of additional parameters to the .Xr ifconfig 8 : .Bl -tag -width "enable_csum" .It Ar grekey Set the GRE key used for outgoing packets. A value of 0 disables the key option. .It Ar enable_csum Enables checksum calculation for outgoing packets. .It Ar enable_seq Enables use of sequence number field in the GRE header for outgoing packets. +.It Ar udpencap +Enables UDP-in-GRE encapsulation (see the +.Sx GRE-IN-UDP ENCAPSULATION +Section below for details). +.It Ar udpport +Set the source UDP port for outgoing packets. +A value of 0 disables the persistence of source UDP port for outgoing packets. +See the +.Sx GRE-IN-UDP ENCAPSULATION +Section below for details. .El +.Sh GRE-IN-UDP ENCAPSULATION +The +.Nm +supports GRE in UDP encapsulation as defined in RFC 8086. +A GRE in UDP tunnel offers the possibility of better performance for +load-balancing GRE traffic in transit networks. +Encapsulating GRE in UDP enables use of the UDP source port to provide +entropy to ECMP hashing. +.Pp +The GRE in UDP tunnel uses single value 4754 as UDP destination port. +The UDP source port contains a 14-bit entropy value that is generated +by the encapsulator to identify a flow for the encapsulated packet. +The +.Ar udpport +option can be used to disable this behaviour and use single source UDP +port value. +The value of +.Ar udpport +should be within the ephemeral port range, i.e., 49152 to 65535 by default. +.Pp +Note that a GRE in UDP tunnel is unidirectional; the tunnel traffic is not +expected to be returned back to the UDP source port values used to generate +entropy. +This may impact NAPT (Network Address Port Translator) middleboxes. +If such tunnels are expected to be used on a path with a middlebox, +the tunnel can be configured either to disable use of the UDP source port +for entropy or to enable middleboxes to pass packets with UDP source port +entropy. .Sh EXAMPLES .Bd -literal 192.168.1.* --- Router A -------tunnel-------- Router B --- 192.168.2.* \\ / \\ / +------ the Internet ------+ .Ed .Pp Assuming router A has the (external) IP address A and the internal address 192.168.1.1, while router B has external address B and internal address 192.168.2.1, the following commands will configure the tunnel: .Pp On router A: .Bd -literal -offset indent ifconfig greN create ifconfig greN inet 192.168.1.1 192.168.2.1 ifconfig greN inet tunnel A B route add -net 192.168.2 -netmask 255.255.255.0 192.168.2.1 .Ed .Pp On router B: .Bd -literal -offset indent ifconfig greN create ifconfig greN inet 192.168.2.1 192.168.1.1 ifconfig greN inet tunnel B A route add -net 192.168.1 -netmask 255.255.255.0 192.168.1.1 .Ed .Pp In case when internal and external IP addresses are the same, different routing tables (FIB) should be used. The default FIB will be applied to IP packets before GRE encapsulation. After encapsulation GRE interface should set different FIB number to outgoing packet. Then different FIB will be applied to such encapsulated packets. According to this FIB packet should be routed to tunnel endpoint. .Bd -literal Host X -- Host A (198.51.100.1) ---tunnel--- Cisco D (203.0.113.1) -- Host E \\ / \\ / +----- Host B ----- Host C -----+ (198.51.100.254) .Ed .Pp On Host A (FreeBSD): .Pp First of multiple FIBs should be configured via loader.conf: .Bd -literal -offset indent net.fibs=2 net.add_addr_allfibs=0 .Ed .Pp Then routes to the gateway and remote tunnel endpoint via this gateway should be added to the second FIB: .Bd -literal -offset indent route add -net 198.51.100.0 -netmask 255.255.255.0 -fib 1 -iface em0 route add -host 203.0.113.1 -fib 1 198.51.100.254 .Ed .Pp And GRE tunnel should be configured to change FIB for encapsulated packets: .Bd -literal -offset indent ifconfig greN create ifconfig greN inet 198.51.100.1 203.0.113.1 ifconfig greN inet tunnel 198.51.100.1 203.0.113.1 tunnelfib 1 .Ed .Sh NOTES The MTU of .Nm interfaces is set to 1476 by default, to match the value used by Cisco routers. This may not be an optimal value, depending on the link between the two tunnel endpoints. It can be adjusted via .Xr ifconfig 8 . .Pp For correct operation, the .Nm device needs a route to the decapsulating host that does not run over the tunnel, as this would be a loop. .Pp The kernel must be set to forward datagrams by setting the .Va net.inet.ip.forwarding .Xr sysctl 8 variable to non-zero. .Sh SEE ALSO .Xr gif 4 , .Xr inet 4 , .Xr ip 4 , .Xr me 4 , .Xr netintro 4 , .Xr protocols 5 , .Xr ifconfig 8 , .Xr sysctl 8 .Pp A description of GRE encapsulation can be found in RFC 2784 and RFC 2890. .Sh AUTHORS .An Andrey V. Elsukov Aq Mt ae@FreeBSD.org .An Heiko W.Rupp Aq Mt hwr@pilhuhn.de .Sh BUGS The current implementation uses the key only for outgoing packets. Incoming packets with a different key or without a key will be treated as if they would belong to this interface. .Pp The sequence number field also used only for outgoing packets. Index: user/ngie/bug-237403/share/man/man4/iflib.4 =================================================================== --- user/ngie/bug-237403/share/man/man4/iflib.4 (revision 346925) +++ user/ngie/bug-237403/share/man/man4/iflib.4 (revision 346926) @@ -1,204 +1,214 @@ .\" $FreeBSD$ .Dd September 27, 2018 .Dt IFLIB 4 .Os .Sh NAME .Nm iflib .Nd Network Interface Driver Framework .Sh SYNOPSIS .Cd "device pci" .Cd "device iflib" .Sh DESCRIPTION .Nm is a framework for network interface drivers for .Fx . It is designed to remove a large amount of the boilerplate that is often needed for modern network interface devices, allowing driver authors to focus on the specific code needed for their hardware. This allows for a shared set of .Xr sysctl 8 names, rather than each driver naming them individually. .Sh SYSCTL VARIABLES These variables must be set before loading the driver, either via .Xr loader.conf 5 or through the use of .Xr kenv 1 . They are all prefixed by .Va dev.X.Y.iflib\&. where X is the driver name, and Y is the instance number. .Bl -tag -width indent .It Va override_nrxds Override the number of RX descriptors for each queue. The value is a comma separated list of positive integers. Some drivers only use a single value, but others may use more. These numbers must be powers of two, and zero means to use the default. Individual drivers may have additional restrictions on allowable values. Defaults to all zeros. .It Va override_ntxds Override the number of TX descriptors for each queue. The value is a comma separated list of positive integers. Some drivers only use a single value, but others may use more. These numbers must be powers of two, and zero means to use the default. Individual drivers may have additional restrictions on allowable values. Defaults to all zeros. .It Va override_qs_enable When set, allows the number of transmit and receive queues to be different. If not set, the lower of the number of TX or RX queues will be used for both. .It Va override_nrxqs Set the number of RX queues. If zero, the number of RX queues is derived from the number of cores on the socket connected to the controller. Defaults to 0. .It Va override_ntxqs Set the number of TX queues. If zero, the number of TX queues is derived from the number of cores on the socket connected to the controller. .It Va disable_msix Disables MSI-X interrupts for the device. +.It Va core_offset +Specifies a starting core offset to assign queues to. +If the value is unspecified or 65535, cores are assigned sequentially across +controllers. +.It Va separate_txrx +Requests that RX and TX queues not be paired on the same core. +If this is zero or not set, an RX and TX queue pair will be assigned to each +core. +When set to a non-zero value, TX queues are assigned to cores following the +last RX queue. .El .Pp These .Xr sysctl 8 variables can be changed at any time: .Bl -tag -width indent .It Va tx_abdicate Controls how the transmit ring is serviced. If set to zero, when a frame is submitted to the transmission ring, the same task that is submitting it will service the ring unless there's already a task servicing the TX ring. This ensures that whenever there is a pending transmission, the transmit ring is being serviced. This results in higher transmit throughput. If set to a non-zero value, task returns immediately and the transmit ring is serviced by a different task. This returns control to the caller faster and under high receive load, may result in fewer dropped RX frames. .It Va rx_budget Sets the maximum number of frames to be received at a time. Zero (the default) indicates the default (currently 16) should be used. .El .Pp There are also some global sysctls which can change behaviour for all drivers, and may be changed at any time. .Bl -tag -width indent .It Va net.iflib.min_tx_latency If this is set to a non-zero value, iflib will avoid any attempt to combine multiple transmits, and notify the hardware as quickly as possible of new descriptors. This will lower the maximum throughput, but will also lower transmit latency. .It Va net.iflib.no_tx_batch Some NICs allow processing completed transmit descriptors in batches. Doing so usually increases the transmit throughput by reducing the number of transmit interrupts. Setting this to a non-zero value will disable the use of this feature. .El .Pp These .Xr sysctl 8 variables are read-only: .Bl -tag -width indent .It Va driver_version A string indicating the internal version of the driver. .El .Pp There are a number of queue state .Xr sysctl 8 variables as well: .Bl -tag -width indent .It Va txqZ The following are repeated for each transmit queue, where Z is the transmit queue instance number: .Bl -tag -width indent .It Va r_abdications Number of consumer abdications in the MP ring for this queue. An abdication occurs on every ring submission when tx_abdicate is true. .It Va r_restarts Number of consumer restarts in the MP ring for this queue. A restart occurs when an attempt to drain a non-empty ring fails, and the ring is already in the STALLED state. .It Va r_stalls Number of consumer stalls in the MP ring for this queue. A stall occurs when an attempt to drain a non-empty ring fails. .It Va r_starts Number of normal consumer starts in the MP ring for this queue. A start occurs when the MP ring transitions from IDLE to BUSY. .It Va r_drops Number of drops in the MP ring for this queue. A drop occurs when there is an attempt to add an entry to an MP ring with no available space. .It Va r_enqueues Number of entries which have been enqueued to the MP ring for this queue. .It Va ring_state MP (soft) ring state. This privides a snapshot of the current MP ring state, including the producer head and tail indexes, the consumer index, and the state. The state is one of "IDLE", "BUSY", "STALLED", or "ABDICATED". .It Va txq_cleaned The number of transmit descriptors which have been reclaimed. Total cleaned. .It Va txq_processed The number of transmit descriptors which have been processed, but may not yet have been reclaimed. .It Va txq_in_use Descriptors which have been added to the transmit queue, but have not yet been cleaned. This value will include both untransmitted descriptors as well as descriptors which have been processed. .It Va txq_cidx_processed The transmit queue consumer index of the next descriptor to process. .It Va txq_cidx The transmit queue consumer index of the oldest descriptor to reclaim. .It Va txq_pidx The transmit queue producer index where the next descriptor to transmit will be inserted. .It Va no_tx_dma_setup Number of times DMA mapping a transmit mbuf failed for reasons other than .Er EFBIG . .It Va txd_encap_efbig Number of times DMA mapping a transmit mbuf failed due to requiring too many segments. .It Va tx_map_failed Number of times DMA mapping a transmit mbuf failed for any reason (sum of no_tx_dma_setup and txd_encap_efbig) .It Va no_desc_avail Number of times a descriptor couldn't be added to the transmit ring because the transmit ring was full. .It Va mbuf_defrag_failed Number of times both .Xr m_collapse 9 and .Xr m_defrag 9 failed after an .Er EFBIG error result from DMA mapping a transmit mbuf. .It Va m_pullups Number of times .Xr m_pullup 9 was called attempting to parse a header. .It Va mbuf_defrag Number of times .Xr m_defrag 9 was called. .El .It Va rxqZ The following are repeated for each receive queue, where Z is the receive queue instance number: .Bl -tag -width indent .It Va rxq_fl0.credits Credits currently available in the receive ring. .It Va rxq_fl0.cidx Current receive ring consumer index. .It Va rxq_fl0.pidx Current receive ring producer index. .El .El .Pp Additional OIDs useful for driver and iflib development are exposed when the INVARIANTS and/or WITNESS options are enabled in the kernel. .Sh SEE ALSO .Xr iflib 9 .Sh HISTORY This framework was introduced in .Fx 11.0 . Index: user/ngie/bug-237403/stand/common/disk.c =================================================================== --- user/ngie/bug-237403/stand/common/disk.c (revision 346925) +++ user/ngie/bug-237403/stand/common/disk.c (revision 346926) @@ -1,438 +1,450 @@ /*- * Copyright (c) 1998 Michael Smith * Copyright (c) 2012 Andrey V. Elsukov * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. */ #include __FBSDID("$FreeBSD$"); #include #include #include #include #include #include #include "disk.h" #ifdef DISK_DEBUG # define DPRINTF(fmt, args...) printf("%s: " fmt "\n" , __func__ , ## args) #else # define DPRINTF(fmt, args...) #endif struct open_disk { struct ptable *table; uint64_t mediasize; uint64_t entrysize; u_int sectorsize; }; struct print_args { struct disk_devdesc *dev; const char *prefix; int verbose; }; /* Convert size to a human-readable number. */ static char * display_size(uint64_t size, u_int sectorsize) { static char buf[80]; char unit; size = size * sectorsize / 1024; unit = 'K'; if (size >= 10485760000LL) { size /= 1073741824; unit = 'T'; } else if (size >= 10240000) { size /= 1048576; unit = 'G'; } else if (size >= 10000) { size /= 1024; unit = 'M'; } sprintf(buf, "%4ld%cB", (long)size, unit); return (buf); } int ptblread(void *d, void *buf, size_t blocks, uint64_t offset) { struct disk_devdesc *dev; struct open_disk *od; dev = (struct disk_devdesc *)d; od = (struct open_disk *)dev->dd.d_opendata; /* * The strategy function assumes the offset is in units of 512 byte * sectors. For larger sector sizes, we need to adjust the offset to * match the actual sector size. */ offset *= (od->sectorsize / 512); /* * As the GPT backup partition is located at the end of the disk, * to avoid reading past disk end, flag bcache not to use RA. */ return (dev->dd.d_dev->dv_strategy(dev, F_READ | F_NORA, offset, blocks * od->sectorsize, (char *)buf, NULL)); } static int ptable_print(void *arg, const char *pname, const struct ptable_entry *part) { struct disk_devdesc dev; struct print_args *pa, bsd; struct open_disk *od; struct ptable *table; char line[80]; int res; u_int sectsize; uint64_t partsize; pa = (struct print_args *)arg; od = (struct open_disk *)pa->dev->dd.d_opendata; sectsize = od->sectorsize; partsize = part->end - part->start + 1; sprintf(line, " %s%s: %s\t%s\n", pa->prefix, pname, parttype2str(part->type), pa->verbose ? display_size(partsize, sectsize) : ""); if (pager_output(line)) return 1; res = 0; if (part->type == PART_FREEBSD) { /* Open slice with BSD label */ dev.dd.d_dev = pa->dev->dd.d_dev; dev.dd.d_unit = pa->dev->dd.d_unit; dev.d_slice = part->index; dev.d_partition = D_PARTNONE; if (disk_open(&dev, partsize, sectsize) == 0) { table = ptable_open(&dev, partsize, sectsize, ptblread); if (table != NULL) { sprintf(line, " %s%s", pa->prefix, pname); bsd.dev = pa->dev; bsd.prefix = line; bsd.verbose = pa->verbose; res = ptable_iterate(table, &bsd, ptable_print); ptable_close(table); } disk_close(&dev); } } return (res); } int disk_print(struct disk_devdesc *dev, char *prefix, int verbose) { struct open_disk *od; struct print_args pa; /* Disk should be opened */ od = (struct open_disk *)dev->dd.d_opendata; pa.dev = dev; pa.prefix = prefix; pa.verbose = verbose; return (ptable_iterate(od->table, &pa, ptable_print)); } int disk_read(struct disk_devdesc *dev, void *buf, uint64_t offset, u_int blocks) { struct open_disk *od; int ret; od = (struct open_disk *)dev->dd.d_opendata; ret = dev->dd.d_dev->dv_strategy(dev, F_READ, dev->d_offset + offset, blocks * od->sectorsize, buf, NULL); return (ret); } int disk_write(struct disk_devdesc *dev, void *buf, uint64_t offset, u_int blocks) { struct open_disk *od; int ret; od = (struct open_disk *)dev->dd.d_opendata; ret = dev->dd.d_dev->dv_strategy(dev, F_WRITE, dev->d_offset + offset, blocks * od->sectorsize, buf, NULL); return (ret); } int disk_ioctl(struct disk_devdesc *dev, u_long cmd, void *data) { struct open_disk *od = dev->dd.d_opendata; if (od == NULL) return (ENOTTY); switch (cmd) { case DIOCGSECTORSIZE: *(u_int *)data = od->sectorsize; break; case DIOCGMEDIASIZE: if (dev->d_offset == 0) *(uint64_t *)data = od->mediasize; else *(uint64_t *)data = od->entrysize * od->sectorsize; break; default: return (ENOTTY); } return (0); } int disk_open(struct disk_devdesc *dev, uint64_t mediasize, u_int sectorsize) { struct disk_devdesc partdev; struct open_disk *od; struct ptable *table; struct ptable_entry part; int rc, slice, partition; rc = 0; od = (struct open_disk *)malloc(sizeof(struct open_disk)); if (od == NULL) { DPRINTF("no memory"); return (ENOMEM); } dev->dd.d_opendata = od; od->entrysize = 0; od->mediasize = mediasize; od->sectorsize = sectorsize; /* * While we are reading disk metadata, make sure we do it relative * to the start of the disk */ memcpy(&partdev, dev, sizeof(partdev)); partdev.d_offset = 0; partdev.d_slice = D_SLICENONE; partdev.d_partition = D_PARTNONE; dev->d_offset = 0; table = NULL; slice = dev->d_slice; partition = dev->d_partition; DPRINTF("%s unit %d, slice %d, partition %d => %p", disk_fmtdev(dev), dev->dd.d_unit, dev->d_slice, dev->d_partition, od); /* Determine disk layout. */ od->table = ptable_open(&partdev, mediasize / sectorsize, sectorsize, ptblread); if (od->table == NULL) { DPRINTF("Can't read partition table"); rc = ENXIO; goto out; } if (ptable_getsize(od->table, &mediasize) != 0) { rc = ENXIO; goto out; } od->mediasize = mediasize; if (ptable_gettype(od->table) == PTABLE_BSD && partition >= 0) { /* It doesn't matter what value has d_slice */ rc = ptable_getpart(od->table, &part, partition); if (rc == 0) { dev->d_offset = part.start; od->entrysize = part.end - part.start + 1; } } else if (ptable_gettype(od->table) == PTABLE_ISO9660) { dev->d_offset = 0; od->entrysize = mediasize; } else if (slice >= 0) { /* Try to get information about partition */ if (slice == 0) rc = ptable_getbestpart(od->table, &part); else rc = ptable_getpart(od->table, &part, slice); if (rc != 0) /* Partition doesn't exist */ goto out; dev->d_offset = part.start; od->entrysize = part.end - part.start + 1; slice = part.index; if (ptable_gettype(od->table) == PTABLE_GPT) { partition = 255; goto out; /* Nothing more to do */ } else if (partition == 255) { /* * When we try to open GPT partition, but partition * table isn't GPT, reset d_partition value to -1 * and try to autodetect appropriate value. */ partition = -1; } /* * If d_partition < 0 and we are looking at a BSD slice, * then try to read BSD label, otherwise return the * whole MBR slice. */ if (partition == -1 && part.type != PART_FREEBSD) goto out; /* Try to read BSD label */ table = ptable_open(dev, part.end - part.start + 1, od->sectorsize, ptblread); if (table == NULL) { DPRINTF("Can't read BSD label"); rc = ENXIO; goto out; } /* * If slice contains BSD label and d_partition < 0, then * assume the 'a' partition. Otherwise just return the * whole MBR slice, because it can contain ZFS. */ if (partition < 0) { if (ptable_gettype(table) != PTABLE_BSD) goto out; partition = 0; } rc = ptable_getpart(table, &part, partition); if (rc != 0) goto out; dev->d_offset += part.start; od->entrysize = part.end - part.start + 1; } out: if (table != NULL) ptable_close(table); if (rc != 0) { if (od->table != NULL) ptable_close(od->table); free(od); DPRINTF("%s could not open", disk_fmtdev(dev)); } else { /* Save the slice and partition number to the dev */ dev->d_slice = slice; dev->d_partition = partition; DPRINTF("%s offset %lld => %p", disk_fmtdev(dev), (long long)dev->d_offset, od); } return (rc); } int disk_close(struct disk_devdesc *dev) { struct open_disk *od; od = (struct open_disk *)dev->dd.d_opendata; DPRINTF("%s closed => %p", disk_fmtdev(dev), od); ptable_close(od->table); free(od); return (0); } char* disk_fmtdev(struct disk_devdesc *dev) { static char buf[128]; char *cp; cp = buf + sprintf(buf, "%s%d", dev->dd.d_dev->dv_name, dev->dd.d_unit); if (dev->d_slice > D_SLICENONE) { #ifdef LOADER_GPT_SUPPORT if (dev->d_partition == D_PARTISGPT) { sprintf(cp, "p%d:", dev->d_slice); return (buf); } else #endif #ifdef LOADER_MBR_SUPPORT cp += sprintf(cp, "s%d", dev->d_slice); #endif } if (dev->d_partition > D_PARTNONE) cp += sprintf(cp, "%c", dev->d_partition + 'a'); strcat(cp, ":"); return (buf); } int disk_parsedev(struct disk_devdesc *dev, const char *devspec, const char **path) { int unit, slice, partition; const char *np; char *cp; np = devspec; unit = -1; - slice = D_SLICEWILD; - partition = D_PARTWILD; + /* + * If there is path/file info after the device info, then any missing + * slice or partition info should be considered a request to search for + * an appropriate partition. Otherwise we want to open the raw device + * itself and not try to fill in missing info by searching. + */ + if ((cp = strchr(np, ':')) != NULL && cp[1] != '\0') { + slice = D_SLICEWILD; + partition = D_PARTWILD; + } else { + slice = D_SLICENONE; + partition = D_PARTNONE; + } + if (*np != '\0' && *np != ':') { unit = strtol(np, &cp, 10); if (cp == np) return (EUNIT); #ifdef LOADER_GPT_SUPPORT if (*cp == 'p') { np = cp + 1; slice = strtol(np, &cp, 10); if (np == cp) return (ESLICE); /* we don't support nested partitions on GPT */ if (*cp != '\0' && *cp != ':') return (EINVAL); partition = 255; } else #endif #ifdef LOADER_MBR_SUPPORT if (*cp == 's') { np = cp + 1; slice = strtol(np, &cp, 10); if (np == cp) return (ESLICE); } #endif if (*cp != '\0' && *cp != ':') { partition = *cp - 'a'; if (partition < 0) return (EPART); cp++; } } else return (EINVAL); if (*cp != '\0' && *cp != ':') return (EINVAL); dev->dd.d_unit = unit; dev->d_slice = slice; dev->d_partition = partition; if (path != NULL) *path = (*cp == '\0') ? cp: cp + 1; return (0); } Index: user/ngie/bug-237403/stand/common/help.common =================================================================== --- user/ngie/bug-237403/stand/common/help.common (revision 346925) +++ user/ngie/bug-237403/stand/common/help.common (revision 346926) @@ -1,407 +1,421 @@ ################################################################################ # Thelp DDisplay command help help [topic [subtopic]] help index The help command displays help on commands and their usage. In command help, a term enclosed with <...> indicates a value as described by the term. A term enclosed with [...] is optional, and may not be required by all forms of the command. Some commands may not be available. Use the '?' command to list most available commands. ################################################################################ # T? DList available commands ? Lists all available commands. ################################################################################ # Tautoboot DBoot after a delay autoboot [ []] Displays or a default prompt, and counts down seconds before attempting to boot. If is not specified, the default value is 10. ################################################################################ # Tboot DBoot immediately boot [] [- ...] Boot the system. If arguments are specified, they are added to the arguments for the kernel. If is specified, and a kernel has not already been loaded, it will be booted instead of the default kernel. ################################################################################ # Tbcachestat DGet disk block cache stats bcachestat Displays statistics about disk cache usage. For debugging only. ################################################################################ # Techo DEcho arguments echo [-n] [] Emits , with no trailing newline if -n is specified. This is most useful in conjunction with scripts and the '@' line prefix. Variables are substituted by prefixing them with $, eg. echo Current device is $currdev will print the current device. ################################################################################ # Tload DLoad a kernel or module load [-t ] Loads the module contained in into memory. If no other modules are loaded, must be a kernel or the command will fail. If -t is specified, the module is loaded as raw data of , for later use by the kernel or other modules. may be any string. ################################################################################ # Tls DList files ls [-l] [] Displays a listing of files in the directory , or the root directory of the current device if is not specified. The -l argument displays file sizes as well; the process of obtaining file sizes on some media may be very slow. ################################################################################ # Tlsdev DList devices lsdev [-v] List all of the devices from which it may be possible to load modules. If -v is specified, print more details. ################################################################################ # Tlsmod DList modules lsmod [-v] List loaded modules. If [-v] is specified, print more details. ################################################################################ +# Tmap-vdisk DMap virtual disk + + map-vdisk filename + + Map file as virtual disk. + +################################################################################ # Tmore DPage files more [ ...] Show contents of text files. When displaying the contents of more, than one file, if the user elects to quit displaying a file, the remaining files will not be shown. ################################################################################ # Tpnpscan DScan for PnP devices pnpscan [-v] Scan for Plug-and-Play devices. This command is normally automatically run as part of the boot process, in order to dynamically load modules required for system operation. If the -v argument is specified, details on the devices found will be printed. ################################################################################ # Tset DSet a variable set set = The set command is used to set variables. ################################################################################ # Tset Sautoboot_delay DSet the default autoboot delay set autoboot_delay= Sets the default delay for the autoboot command to seconds. Set value to -1 if you don't want to allow user to interrupt autoboot process and escape to the loader prompt. ################################################################################ # Tset Sbootfile DSet the default boot file set set bootfile=[;...] Sets the default set of kernel boot filename(s). It may be overridden by setting the bootfile variable to a semicolon-separated list of filenames, each of which will be searched for in the module_path directories. The default bootfile set is "kernel". ################################################################################ # Tset Sboot_askname DPrompt for root device set boot_askname Instructs the kernel to prompt the user for the name of the root device when the kernel is booted. ################################################################################ # Tset Sboot_cdrom DMount root file system from CD-ROM set boot_cdrom Instructs the kernel to try to mount the root file system from CD-ROM. ################################################################################ # Tset Sboot_ddb DDrop to the kernel debugger (DDB) set boot_ddb Instructs the kernel to start in the DDB debugger, rather than proceeding to initialize when booted. ################################################################################ # Tset Sboot_dfltroot DUse default root file system set boot_dfltroot Instructs the kernel to mount the statically compiled-in root file system. ################################################################################ # Tset Sboot_gdb DSelect gdb-remote mode for the kernel debugger set boot_gdb Selects gdb-remote mode for the kernel debugger by default. ################################################################################ # Tset Sboot_multicons DUse multiple consoles set boot_multicons Enables multiple console support in the kernel early on boot. In a running system, console configuration can be manipulated by the conscontrol(8) utility. ################################################################################ # Tset Sboot_mute DMute the console set boot_mute All console output is suppressed when console is muted. In a running system, the state of console muting can be manipulated by the conscontrol(8) utility. ################################################################################ # Tset Sboot_pause DPause after each line during device probing set boot_pause During the device probe, pause after each line is printed. ################################################################################ # Tset Sboot_serial DUse serial console set boot_serial Force the use of a serial console even when an internal console is present. ################################################################################ # Tset Sboot_single DStart system in single-user mode set boot_single Prevents the kernel from initiating a multi-user startup; instead, a single-user mode will be entered when the kernel has finished device probes. ################################################################################ # Tset Sboot_verbose DVerbose boot messages set boot_verbose Setting this variable causes extra debugging information to be printed by the kernel during the boot phase. ################################################################################ # Tset Sconsole DSet the current console set console[=] Sets the current console. If is omitted, a list of valid consoles will be displayed. ################################################################################ # Tset Scurrdev DSet the current device set currdev= Selects the default device. See lsdev for available devices. ################################################################################ # Tset Sinit_path DSet the list of init candidates set init_path=[:...] Sets the list of binaries which the kernel will try to run as initial process. ################################################################################ # Tset Smodule_path DSet the module search path set module_path=[;...] Sets the list of directories which will be searched in for modules named in a load command or implicitly required by a dependency. The default module_path is "/boot/modules" with the kernel directory prepended. ################################################################################ # Tset Sprompt DSet the command prompt set prompt= The command prompt is displayed when the loader is waiting for input. Variable substitution is performed on the prompt. The default prompt can be set with: set prompt=\${interpret} ################################################################################ # Tset Srootdev DSet the root filesystem set rootdev= By default the value of $currdev is used to set the root filesystem when the kernel is booted. This can be overridden by setting $rootdev explicitly. ################################################################################ # Tset Stunables DSet kernel tunable values Various kernel tunable parameters can be overridden by specifying new values in the environment. set kern.ipc.nmbclusters= Set the number of mbuf clusters to be allocated. The value cannot be set below the default determined when the kernel was compiled. set kern.ipc.nsfbufs= NSFBUFS Set the number of sendfile buffers to be allocated. This overrides the value determined when the kernel was compiled. set vm.kmem_size= VM_KMEM_SIZE Sets the size of kernel memory (bytes). This overrides the value determined when the kernel was compiled. set machdep.disable_mtrrs=1 Disable the use of i686 MTRRs (i386 only) set net.inet.tcp.tcbhashsize= TCBHASHSIZE Overrides the compile-time set value of TCBHASHSIZE or the preset default of 512. Must be a power of 2. hw.syscons.sc_no_suspend_vtswitch= Disable VT switching on suspend. value is 0 (default) or non-zero to enable. set hw.physmem= MAXMEM (i386 only) Limits the amount of physical memory space available to the system to bytes. may have a k, M or G suffix to indicate kilobytes, megabytes and gigabytes respectively. Note that the current i386 architecture limits this value to 4GB. On systems where memory cannot be accurately probed, this option provides a hint as to the actual size of system memory (which will be tested before use). set hw.{acpi,pci}.host_start_mem= Sets the lowest address that the pci code will assign when it doesn't have other information about the address to assign (like from a pci bridge). This is only useful in older systems without a pci bridge. Also, it only impacts devices that the BIOS doesn't assign to, typically CardBus bridges. The default is 0x80000000, but some systems need values like 0xf0000000, 0xfc000000 or 0xfe000000 may be suitable for older systems (the older the system, the higher the number typically should be). set hw.pci.enable_io_modes= Enable PCI resources which are left off by some BIOSes or are not enabled correctly by the device driver. value is 1 (default), but this may cause problems with some peripherals. Set to 0 to disable. ################################################################################ # Tshow DShow the values of variables show [] Displays the value of , or all variables if not specified. Multiple paths can be separated with a semicolon. ################################################################################ # Tinclude DRead commands from a script file include [ ...] The entire contents of are read into memory before executing commands, so it is safe to source a file from removable media. ################################################################################ # Tread DRead input from the terminal read [-t ] [-p ] [] The read command reads a line of input from the terminal. If the -t argument is specified, it will return nothing if no input has been received after seconds. (Any keypress will cancel the timeout). If -p is specified, is printed before reading input. No newline is emitted after the prompt. If a variable name is supplied, the variable is set to the value read, less any terminating newline. ################################################################################ # Tunload DRemove all modules from memory unload This command removes any kernel and all loaded modules from memory. + +################################################################################ +# Tunmap-vdisk DUnmap virtual disk + + unmap-vdisk diskname + + Delete virtual disk mapping. ################################################################################ # Tunset DUnset a variable unset If allowed, the named variable's value is discarded and the variable is removed. ################################################################################ Index: user/ngie/bug-237403/stand/common/vdisk.c =================================================================== --- user/ngie/bug-237403/stand/common/vdisk.c (nonexistent) +++ user/ngie/bug-237403/stand/common/vdisk.c (revision 346926) @@ -0,0 +1,417 @@ +/*- + * Copyright 2019 Toomas Soome + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * 1. Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF + * SUCH DAMAGE. + */ + +#include +__FBSDID("$FreeBSD$"); + +#include +#include +#include +#include +#include +#include +#include +#include +#include + +static int vdisk_init(void); +static int vdisk_strategy(void *, int, daddr_t, size_t, char *, size_t *); +static int vdisk_open(struct open_file *, ...); +static int vdisk_close(struct open_file *); +static int vdisk_ioctl(struct open_file *, u_long, void *); +static int vdisk_print(int); + +struct devsw vdisk_dev = { + .dv_name = "vdisk", + .dv_type = DEVT_DISK, + .dv_init = vdisk_init, + .dv_strategy = vdisk_strategy, + .dv_open = vdisk_open, + .dv_close = vdisk_close, + .dv_ioctl = vdisk_ioctl, + .dv_print = vdisk_print, + .dv_cleanup = NULL +}; + +typedef STAILQ_HEAD(vdisk_info_list, vdisk_info) vdisk_info_list_t; + +typedef struct vdisk_info +{ + STAILQ_ENTRY(vdisk_info) vdisk_link; /* link in device list */ + char *vdisk_path; + int vdisk_unit; + int vdisk_fd; + uint64_t vdisk_size; /* size in bytes */ + uint32_t vdisk_sectorsz; + uint32_t vdisk_open; /* reference counter */ +} vdisk_info_t; + +static vdisk_info_list_t vdisk_list; /* list of mapped vdisks. */ + +static vdisk_info_t * +vdisk_get_info(struct devdesc *dev) +{ + vdisk_info_t *vd; + + STAILQ_FOREACH(vd, &vdisk_list, vdisk_link) { + if (vd->vdisk_unit == dev->d_unit) + return (vd); + } + return (vd); +} + +COMMAND_SET(map_vdisk, "map-vdisk", "map file as virtual disk", command_mapvd); + +static int +command_mapvd(int argc, char *argv[]) +{ + vdisk_info_t *vd, *p; + struct stat sb; + + if (argc != 2) { + printf("usage: %s filename\n", argv[0]); + return (CMD_ERROR); + } + + STAILQ_FOREACH(vd, &vdisk_list, vdisk_link) { + if (strcmp(vd->vdisk_path, argv[1]) == 0) { + printf("%s: file %s is already mapped as %s%d\n", + argv[0], argv[1], vdisk_dev.dv_name, + vd->vdisk_unit); + return (CMD_ERROR); + } + } + + if (stat(argv[1], &sb) < 0) { + /* + * ENOSYS is really ENOENT because we did try to walk + * through devsw list to try to open this file. + */ + if (errno == ENOSYS) + errno = ENOENT; + + printf("%s: stat failed: %s\n", argv[0], strerror(errno)); + return (CMD_ERROR); + } + + /* + * Avoid mapping small files. + */ + if (sb.st_size < 1024 * 1024) { + printf("%s: file %s is too small.\n", argv[0], argv[1]); + return (CMD_ERROR); + } + + vd = calloc(1, sizeof (*vd)); + if (vd == NULL) { + printf("%s: out of memory\n", argv[0]); + return (CMD_ERROR); + } + vd->vdisk_path = strdup(argv[1]); + if (vd->vdisk_path == NULL) { + free (vd); + printf("%s: out of memory\n", argv[0]); + return (CMD_ERROR); + } + vd->vdisk_fd = open(vd->vdisk_path, O_RDONLY); + if (vd->vdisk_fd < 0) { + printf("%s: open failed: %s\n", argv[0], strerror(errno)); + free(vd->vdisk_path); + free(vd); + return (CMD_ERROR); + } + + vd->vdisk_size = sb.st_size; + vd->vdisk_sectorsz = DEV_BSIZE; + STAILQ_FOREACH(p, &vdisk_list, vdisk_link) { + vdisk_info_t *n; + if (p->vdisk_unit == vd->vdisk_unit) { + vd->vdisk_unit++; + continue; + } + n = STAILQ_NEXT(p, vdisk_link); + if (p->vdisk_unit < vd->vdisk_unit) { + if (n == NULL) { + /* p is last elem */ + STAILQ_INSERT_TAIL(&vdisk_list, vd, vdisk_link); + break; + } + if (n->vdisk_unit > vd->vdisk_unit) { + /* p < vd < n */ + STAILQ_INSERT_AFTER(&vdisk_list, p, vd, + vdisk_link); + break; + } + /* else n < vd or n == vd */ + vd->vdisk_unit++; + continue; + } + /* p > vd only if p is the first element */ + STAILQ_INSERT_HEAD(&vdisk_list, vd, vdisk_link); + break; + } + + /* if the list was empty or contiguous */ + if (p == NULL) + STAILQ_INSERT_TAIL(&vdisk_list, vd, vdisk_link); + + printf("%s: file %s is mapped as %s%d\n", argv[0], vd->vdisk_path, + vdisk_dev.dv_name, vd->vdisk_unit); + return (CMD_OK); +} + +COMMAND_SET(unmap_vdisk, "unmap-vdisk", "unmap virtual disk", command_unmapvd); + +/* + * unmap-vdisk vdiskX + */ +static int +command_unmapvd(int argc, char *argv[]) +{ + size_t len; + vdisk_info_t *vd; + long unit; + char *end; + + if (argc != 2) { + printf("usage: %s %sN\n", argv[0], vdisk_dev.dv_name); + return (CMD_ERROR); + } + + len = strlen(vdisk_dev.dv_name); + if (strncmp(vdisk_dev.dv_name, argv[1], len) != 0) { + printf("%s: unknown device %s\n", argv[0], argv[1]); + return (CMD_ERROR); + } + errno = 0; + unit = strtol(argv[1] + len, &end, 10); + if (errno != 0 || (*end != '\0' && strcmp(end, ":") != 0)) { + printf("%s: unknown device %s\n", argv[0], argv[1]); + return (CMD_ERROR); + } + + STAILQ_FOREACH(vd, &vdisk_list, vdisk_link) { + if (vd->vdisk_unit == unit) + break; + } + + if (vd == NULL) { + printf("%s: unknown device %s\n", argv[0], argv[1]); + return (CMD_ERROR); + } + + if (vd->vdisk_open != 0) { + printf("%s: %s is in use, unable to unmap.\n", + argv[0], argv[1]); + return (CMD_ERROR); + } + + STAILQ_REMOVE(&vdisk_list, vd, vdisk_info, vdisk_link); + close(vd->vdisk_fd); + free(vd->vdisk_path); + free(vd); + printf("%s (%s) unmapped\n", argv[1], vd->vdisk_path); + + return (CMD_OK); +} + +static int +vdisk_init(void) +{ + STAILQ_INIT(&vdisk_list); + return (0); +} + +static int +vdisk_strategy(void *devdata, int rw, daddr_t blk, size_t size, + char *buf, size_t *rsize) +{ + struct disk_devdesc *dev; + vdisk_info_t *vd; + ssize_t rv; + + dev = devdata; + if (dev == NULL) + return (EINVAL); + vd = vdisk_get_info((struct devdesc *)dev); + if (vd == NULL) + return (EINVAL); + + if (size == 0 || (size % 512) != 0) + return (EIO); + + if (dev->dd.d_dev->dv_type == DEVT_DISK) { + daddr_t offset; + + offset = dev->d_offset * vd->vdisk_sectorsz; + offset /= 512; + blk += offset; + } + if (lseek(vd->vdisk_fd, blk << 9, SEEK_SET) == -1) + return (EIO); + + errno = 0; + switch (rw & F_MASK) { + case F_READ: + rv = read(vd->vdisk_fd, buf, size); + break; + case F_WRITE: + rv = write(vd->vdisk_fd, buf, size); + break; + default: + return (ENOSYS); + } + + if (errno == 0 && rsize != NULL) { + *rsize = rv; + } + return (errno); +} + +static int +vdisk_open(struct open_file *f, ...) +{ + va_list args; + struct disk_devdesc *dev; + vdisk_info_t *vd; + int rc = 0; + + va_start(args, f); + dev = va_arg(args, struct disk_devdesc *); + va_end(args); + if (dev == NULL) + return (EINVAL); + vd = vdisk_get_info((struct devdesc *)dev); + if (vd == NULL) + return (EINVAL); + + if (dev->dd.d_dev->dv_type == DEVT_DISK) { + rc = disk_open(dev, vd->vdisk_size, vd->vdisk_sectorsz); + } + if (rc == 0) + vd->vdisk_open++; + return (rc); +} + +static int +vdisk_close(struct open_file *f) +{ + struct disk_devdesc *dev; + vdisk_info_t *vd; + + dev = (struct disk_devdesc *)(f->f_devdata); + if (dev == NULL) + return (EINVAL); + vd = vdisk_get_info((struct devdesc *)dev); + if (vd == NULL) + return (EINVAL); + + vd->vdisk_open--; + if (dev->dd.d_dev->dv_type == DEVT_DISK) + return (disk_close(dev)); + return (0); +} + +static int +vdisk_ioctl(struct open_file *f, u_long cmd, void *data) +{ + struct disk_devdesc *dev; + vdisk_info_t *vd; + int rc; + + dev = (struct disk_devdesc *)(f->f_devdata); + if (dev == NULL) + return (EINVAL); + vd = vdisk_get_info((struct devdesc *)dev); + if (vd == NULL) + return (EINVAL); + + if (dev->dd.d_dev->dv_type == DEVT_DISK) { + rc = disk_ioctl(dev, cmd, data); + if (rc != ENOTTY) + return (rc); + } + + switch (cmd) { + case DIOCGSECTORSIZE: + *(u_int *)data = vd->vdisk_sectorsz; + break; + case DIOCGMEDIASIZE: + *(uint64_t *)data = vd->vdisk_size; + break; + default: + return (ENOTTY); + } + return (0); +} + +static int +vdisk_print(int verbose) +{ + int ret = 0; + vdisk_info_t *vd; + char line[80]; + + if (STAILQ_EMPTY(&vdisk_list)) + return (ret); + + printf("%s devices:", vdisk_dev.dv_name); + if ((ret = pager_output("\n")) != 0) + return (ret); + + STAILQ_FOREACH(vd, &vdisk_list, vdisk_link) { + struct disk_devdesc vd_dev; + + if (verbose) { + printf(" %s", vd->vdisk_path); + if ((ret = pager_output("\n")) != 0) + break; + } + snprintf(line, sizeof(line), + " %s%d", vdisk_dev.dv_name, vd->vdisk_unit); + printf("%s: %" PRIu64 " X %u blocks", line, + vd->vdisk_size / vd->vdisk_sectorsz, + vd->vdisk_sectorsz); + if ((ret = pager_output("\n")) != 0) + break; + + vd_dev.dd.d_dev = &vdisk_dev; + vd_dev.dd.d_unit = vd->vdisk_unit; + vd_dev.d_slice = -1; + vd_dev.d_partition = -1; + + ret = disk_open(&vd_dev, vd->vdisk_size, vd->vdisk_sectorsz); + if (ret == 0) { + ret = disk_print(&vd_dev, line, verbose); + disk_close(&vd_dev); + if (ret != 0) + break; + } else { + ret = 0; + } + } + + return (ret); +} Property changes on: user/ngie/bug-237403/stand/common/vdisk.c ___________________________________________________________________ Added: svn:eol-style ## -0,0 +1 ## +native \ No newline at end of property Added: svn:keywords ## -0,0 +1 ## +FreeBSD=%H \ No newline at end of property Added: svn:mime-type ## -0,0 +1 ## +text/plain \ No newline at end of property Index: user/ngie/bug-237403/stand/efi/include/efilib.h =================================================================== --- user/ngie/bug-237403/stand/efi/include/efilib.h (revision 346925) +++ user/ngie/bug-237403/stand/efi/include/efilib.h (revision 346926) @@ -1,144 +1,145 @@ /*- * Copyright (c) 2000 Doug Rabson * Copyright (c) 2006 Marcel Moolenaar * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * $FreeBSD$ */ #ifndef _LOADER_EFILIB_H #define _LOADER_EFILIB_H #include #include #include extern EFI_HANDLE IH; extern EFI_SYSTEM_TABLE *ST; extern EFI_BOOT_SERVICES *BS; extern EFI_RUNTIME_SERVICES *RS; extern struct devsw efipart_fddev; extern struct devsw efipart_cddev; extern struct devsw efipart_hddev; extern struct devsw efinet_dev; extern struct netif_driver efinetif; /* EFI block device data, included here to help efi_zfs_probe() */ typedef STAILQ_HEAD(pdinfo_list, pdinfo) pdinfo_list_t; typedef struct pdinfo { STAILQ_ENTRY(pdinfo) pd_link; /* link in device list */ pdinfo_list_t pd_part; /* list of partitions */ EFI_HANDLE pd_handle; EFI_HANDLE pd_alias; EFI_DEVICE_PATH *pd_devpath; EFI_BLOCK_IO *pd_blkio; uint32_t pd_unit; /* unit number */ uint32_t pd_open; /* reference counter */ void *pd_bcache; /* buffer cache data */ struct pdinfo *pd_parent; /* Linked items (eg partitions) */ struct devsw *pd_devsw; /* Back pointer to devsw */ } pdinfo_t; pdinfo_list_t *efiblk_get_pdinfo_list(struct devsw *dev); pdinfo_t *efiblk_get_pdinfo(struct devdesc *dev); pdinfo_t *efiblk_get_pdinfo_by_handle(EFI_HANDLE h); pdinfo_t *efiblk_get_pdinfo_by_device_path(EFI_DEVICE_PATH *path); void *efi_get_table(EFI_GUID *tbl); int efi_getdev(void **vdev, const char *devspec, const char **path); char *efi_fmtdev(void *vdev); int efi_setcurrdev(struct env_var *ev, int flags, const void *value); int efi_register_handles(struct devsw *, EFI_HANDLE *, EFI_HANDLE *, int); EFI_HANDLE efi_find_handle(struct devsw *, int); int efi_handle_lookup(EFI_HANDLE, struct devsw **, int *, uint64_t *); int efi_handle_update_dev(EFI_HANDLE, struct devsw *, int, uint64_t); EFI_DEVICE_PATH *efi_lookup_image_devpath(EFI_HANDLE); EFI_DEVICE_PATH *efi_lookup_devpath(EFI_HANDLE); EFI_HANDLE efi_devpath_handle(EFI_DEVICE_PATH *); EFI_DEVICE_PATH *efi_devpath_last_node(EFI_DEVICE_PATH *); EFI_DEVICE_PATH *efi_devpath_trim(EFI_DEVICE_PATH *); bool efi_devpath_match(EFI_DEVICE_PATH *, EFI_DEVICE_PATH *); bool efi_devpath_match_node(EFI_DEVICE_PATH *, EFI_DEVICE_PATH *); bool efi_devpath_is_prefix(EFI_DEVICE_PATH *, EFI_DEVICE_PATH *); CHAR16 *efi_devpath_name(EFI_DEVICE_PATH *); void efi_free_devpath_name(CHAR16 *); EFI_DEVICE_PATH *efi_devpath_to_media_path(EFI_DEVICE_PATH *); UINTN efi_devpath_length(EFI_DEVICE_PATH *); EFI_DEVICE_PATH *efi_name_to_devpath(const char *path); EFI_DEVICE_PATH *efi_name_to_devpath16(CHAR16 *path); void efi_devpath_free(EFI_DEVICE_PATH *dp); int efi_status_to_errno(EFI_STATUS); EFI_STATUS errno_to_efi_status(int errno); void efi_time_init(void); void efi_time_fini(void); EFI_STATUS efi_main(EFI_HANDLE Ximage, EFI_SYSTEM_TABLE* Xsystab); EFI_STATUS main(int argc, CHAR16 *argv[]); void efi_exit(EFI_STATUS status) __dead2; void delay(int usecs); /* EFI environment initialization. */ void efi_init_environment(void); /* EFI Memory type strings. */ const char *efi_memory_type(EFI_MEMORY_TYPE); /* CHAR16 utility functions. */ int wcscmp(CHAR16 *, CHAR16 *); void cpy8to16(const char *, CHAR16 *, size_t); void cpy16to8(const CHAR16 *, char *, size_t); /* * Routines for interacting with EFI's env vars in a more unix-like * way than the standard APIs. In addition, convenience routines for * the loader setting / getting FreeBSD specific variables. */ EFI_STATUS efi_delenv(EFI_GUID *guid, const char *varname); +EFI_STATUS efi_freebsd_delenv(const char *varname); EFI_STATUS efi_freebsd_getenv(const char *v, void *data, __size_t *len); EFI_STATUS efi_getenv(EFI_GUID *g, const char *v, void *data, __size_t *len); EFI_STATUS efi_global_getenv(const char *v, void *data, __size_t *len); EFI_STATUS efi_setenv(EFI_GUID *guid, const char *varname, UINT32 attr, void *data, __size_t len); EFI_STATUS efi_setenv_freebsd_wcs(const char *varname, CHAR16 *valstr); /* guids and names */ bool efi_guid_to_str(const EFI_GUID *, char **); bool efi_str_to_guid(const char *, EFI_GUID *); bool efi_name_to_guid(const char *, EFI_GUID *); bool efi_guid_to_name(EFI_GUID *, char **); /* efipart.c */ int efipart_inithandles(void); #endif /* _LOADER_EFILIB_H */ Index: user/ngie/bug-237403/stand/efi/libefi/efienv.c =================================================================== --- user/ngie/bug-237403/stand/efi/libefi/efienv.c (revision 346925) +++ user/ngie/bug-237403/stand/efi/libefi/efienv.c (revision 346926) @@ -1,123 +1,129 @@ /*- * Copyright (c) 2018 Netflix, Inc. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. */ #include __FBSDID("$FreeBSD$"); #include #include #include #include static EFI_GUID FreeBSDBootVarGUID = FREEBSD_BOOT_VAR_GUID; static EFI_GUID GlobalBootVarGUID = EFI_GLOBAL_VARIABLE; EFI_STATUS efi_getenv(EFI_GUID *g, const char *v, void *data, size_t *len) { size_t ul; CHAR16 *uv; UINT32 attr; UINTN dl; EFI_STATUS rv; uv = NULL; if (utf8_to_ucs2(v, &uv, &ul) != 0) return (EFI_OUT_OF_RESOURCES); dl = *len; rv = RS->GetVariable(uv, g, &attr, &dl, data); if (rv == EFI_SUCCESS || rv == EFI_BUFFER_TOO_SMALL) *len = dl; free(uv); return (rv); } EFI_STATUS efi_global_getenv(const char *v, void *data, size_t *len) { return (efi_getenv(&GlobalBootVarGUID, v, data, len)); } EFI_STATUS efi_freebsd_getenv(const char *v, void *data, size_t *len) { return (efi_getenv(&FreeBSDBootVarGUID, v, data, len)); } /* * efi_setenv -- Sets an env variable. */ EFI_STATUS efi_setenv(EFI_GUID *guid, const char *varname, UINT32 attr, void *data, __size_t len) { EFI_STATUS rv; CHAR16 *uv; size_t ul; uv = NULL; if (utf8_to_ucs2(varname, &uv, &ul) != 0) return (EFI_OUT_OF_RESOURCES); rv = RS->SetVariable(uv, guid, attr, len, data); free(uv); return (rv); } EFI_STATUS efi_setenv_freebsd_wcs(const char *varname, CHAR16 *valstr) { CHAR16 *var = NULL; size_t len; EFI_STATUS rv; if (utf8_to_ucs2(varname, &var, &len) != 0) return (EFI_OUT_OF_RESOURCES); rv = RS->SetVariable(var, &FreeBSDBootVarGUID, EFI_VARIABLE_BOOTSERVICE_ACCESS | EFI_VARIABLE_RUNTIME_ACCESS, (ucs2len(valstr) + 1) * sizeof(efi_char), valstr); free(var); return (rv); } /* * efi_delenv -- deletes the specified env variable */ EFI_STATUS efi_delenv(EFI_GUID *guid, const char *name) { CHAR16 *var; size_t len; EFI_STATUS rv; var = NULL; if (utf8_to_ucs2(name, &var, &len) != 0) return (EFI_OUT_OF_RESOURCES); rv = RS->SetVariable(var, guid, 0, 0, NULL); free(var); - return rv; + return (rv); +} + +EFI_STATUS +efi_freebsd_delenv(const char *name) +{ + return (efi_delenv(&FreeBSDBootVarGUID, name)); } Index: user/ngie/bug-237403/stand/efi/loader/autoload.c =================================================================== --- user/ngie/bug-237403/stand/efi/loader/autoload.c (revision 346925) +++ user/ngie/bug-237403/stand/efi/loader/autoload.c (revision 346926) @@ -1,56 +1,57 @@ /*- * Copyright (c) 2010 Rui Paulo * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ #include __FBSDID("$FreeBSD$"); #if defined(LOADER_FDT_SUPPORT) #include #include #endif #include "loader_efi.h" int efi_autoload(void) { #if defined(LOADER_FDT_SUPPORT) /* * Setup the FDT early so that we're not loading files during bi_load. * Any such loading is inherently broken since bi_load uses the space * just after all currently loaded files for the data that will be * passed to the kernel and newly loaded files will be positioned in * that same space. * * We're glossing over errors here because LOADER_FDT_SUPPORT does not * imply that we're on a platform where FDT is a requirement. If we * fix this, then the error handling here should be fixed accordingly. */ - fdt_setup_fdtp(); + if (fdt_is_setup() == 0) + fdt_setup_fdtp(); #endif return (0); } Index: user/ngie/bug-237403/stand/efi/loader/conf.c =================================================================== --- user/ngie/bug-237403/stand/efi/loader/conf.c (revision 346925) +++ user/ngie/bug-237403/stand/efi/loader/conf.c (revision 346926) @@ -1,81 +1,84 @@ /*- * Copyright (c) 2006 Marcel Moolenaar * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ #include __FBSDID("$FreeBSD$"); #include #include #include #include #include +extern struct devsw vdisk_dev; + struct devsw *devsw[] = { &efipart_fddev, &efipart_cddev, &efipart_hddev, &efinet_dev, + &vdisk_dev, #ifdef EFI_ZFS_BOOT &zfs_dev, #endif NULL }; struct fs_ops *file_system[] = { #ifdef EFI_ZFS_BOOT &zfs_fsops, #endif &dosfs_fsops, &ufs_fsops, &cd9660_fsops, &tftp_fsops, &nfs_fsops, &gzipfs_fsops, &bzipfs_fsops, NULL }; struct netif_driver *netif_drivers[] = { &efinetif, NULL }; extern struct console efi_console; #if defined(__amd64__) || defined(__i386__) extern struct console comconsole; extern struct console nullconsole; extern struct console spinconsole; #endif struct console *consoles[] = { &efi_console, #if defined(__amd64__) || defined(__i386__) &comconsole, &nullconsole, &spinconsole, #endif NULL }; Index: user/ngie/bug-237403/stand/efi/loader/main.c =================================================================== --- user/ngie/bug-237403/stand/efi/loader/main.c (revision 346925) +++ user/ngie/bug-237403/stand/efi/loader/main.c (revision 346926) @@ -1,1420 +1,1555 @@ /*- * Copyright (c) 2008-2010 Rui Paulo * Copyright (c) 2006 Marcel Moolenaar * All rights reserved. * - * Copyright (c) 2018 Netflix, Inc. + * Copyright (c) 2016-2019 Netflix, Inc. written by M. Warner Losh * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ #include __FBSDID("$FreeBSD$"); #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include "efizfs.h" #include "loader_efi.h" struct arch_switch archsw; /* MI/MD interface boundary */ EFI_GUID acpi = ACPI_TABLE_GUID; EFI_GUID acpi20 = ACPI_20_TABLE_GUID; EFI_GUID devid = DEVICE_PATH_PROTOCOL; EFI_GUID imgid = LOADED_IMAGE_PROTOCOL; EFI_GUID mps = MPS_TABLE_GUID; EFI_GUID netid = EFI_SIMPLE_NETWORK_PROTOCOL; EFI_GUID smbios = SMBIOS_TABLE_GUID; EFI_GUID smbios3 = SMBIOS3_TABLE_GUID; EFI_GUID dxe = DXE_SERVICES_TABLE_GUID; EFI_GUID hoblist = HOB_LIST_TABLE_GUID; EFI_GUID lzmadecomp = LZMA_DECOMPRESSION_GUID; EFI_GUID mpcore = ARM_MP_CORE_INFO_TABLE_GUID; EFI_GUID esrt = ESRT_TABLE_GUID; EFI_GUID memtype = MEMORY_TYPE_INFORMATION_TABLE_GUID; EFI_GUID debugimg = DEBUG_IMAGE_INFO_TABLE_GUID; EFI_GUID fdtdtb = FDT_TABLE_GUID; EFI_GUID inputid = SIMPLE_TEXT_INPUT_PROTOCOL; /* * Number of seconds to wait for a keystroke before exiting with failure * in the event no currdev is found. -2 means always break, -1 means * never break, 0 means poll once and then reboot, > 0 means wait for * that many seconds. "fail_timeout" can be set in the environment as * well. */ static int fail_timeout = 5; /* * Current boot variable */ UINT16 boot_current; /* * Image that we booted from. */ EFI_LOADED_IMAGE *boot_img; static bool has_keyboard(void) { EFI_STATUS status; EFI_DEVICE_PATH *path; EFI_HANDLE *hin, *hin_end, *walker; UINTN sz; bool retval = false; /* * Find all the handles that support the SIMPLE_TEXT_INPUT_PROTOCOL and * do the typical dance to get the right sized buffer. */ sz = 0; hin = NULL; status = BS->LocateHandle(ByProtocol, &inputid, 0, &sz, 0); if (status == EFI_BUFFER_TOO_SMALL) { hin = (EFI_HANDLE *)malloc(sz); status = BS->LocateHandle(ByProtocol, &inputid, 0, &sz, hin); if (EFI_ERROR(status)) free(hin); } if (EFI_ERROR(status)) return retval; /* * Look at each of the handles. If it supports the device path protocol, * use it to get the device path for this handle. Then see if that * device path matches either the USB device path for keyboards or the * legacy device path for keyboards. */ hin_end = &hin[sz / sizeof(*hin)]; for (walker = hin; walker < hin_end; walker++) { status = BS->HandleProtocol(*walker, &devid, (VOID **)&path); if (EFI_ERROR(status)) continue; while (!IsDevicePathEnd(path)) { /* * Check for the ACPI keyboard node. All PNP3xx nodes * are keyboards of different flavors. Note: It is * unclear of there's always a keyboard node when * there's a keyboard controller, or if there's only one * when a keyboard is detected at boot. */ if (DevicePathType(path) == ACPI_DEVICE_PATH && (DevicePathSubType(path) == ACPI_DP || DevicePathSubType(path) == ACPI_EXTENDED_DP)) { ACPI_HID_DEVICE_PATH *acpi; acpi = (ACPI_HID_DEVICE_PATH *)(void *)path; if ((EISA_ID_TO_NUM(acpi->HID) & 0xff00) == 0x300 && (acpi->HID & 0xffff) == PNP_EISA_ID_CONST) { retval = true; goto out; } /* * Check for USB keyboard node, if present. Unlike a * PS/2 keyboard, these definitely only appear when * connected to the system. */ } else if (DevicePathType(path) == MESSAGING_DEVICE_PATH && DevicePathSubType(path) == MSG_USB_CLASS_DP) { USB_CLASS_DEVICE_PATH *usb; usb = (USB_CLASS_DEVICE_PATH *)(void *)path; if (usb->DeviceClass == 3 && /* HID */ usb->DeviceSubClass == 1 && /* Boot devices */ usb->DeviceProtocol == 1) { /* Boot keyboards */ retval = true; goto out; } } path = NextDevicePathNode(path); } } out: free(hin); return retval; } static void set_currdev(const char *devname) { env_setenv("currdev", EV_VOLATILE, devname, efi_setcurrdev, env_nounset); env_setenv("loaddev", EV_VOLATILE, devname, env_noset, env_nounset); } static void set_currdev_devdesc(struct devdesc *currdev) { const char *devname; devname = efi_fmtdev(currdev); printf("Setting currdev to %s\n", devname); set_currdev(devname); } static void set_currdev_devsw(struct devsw *dev, int unit) { struct devdesc currdev; currdev.d_dev = dev; currdev.d_unit = unit; set_currdev_devdesc(&currdev); } static void set_currdev_pdinfo(pdinfo_t *dp) { /* * Disks are special: they have partitions. if the parent * pointer is non-null, we're a partition not a full disk * and we need to adjust currdev appropriately. */ if (dp->pd_devsw->dv_type == DEVT_DISK) { struct disk_devdesc currdev; currdev.dd.d_dev = dp->pd_devsw; if (dp->pd_parent == NULL) { currdev.dd.d_unit = dp->pd_unit; currdev.d_slice = D_SLICENONE; currdev.d_partition = D_PARTNONE; } else { currdev.dd.d_unit = dp->pd_parent->pd_unit; currdev.d_slice = dp->pd_unit; currdev.d_partition = D_PARTISGPT; /* XXX Assumes GPT */ } set_currdev_devdesc((struct devdesc *)&currdev); } else { set_currdev_devsw(dp->pd_devsw, dp->pd_unit); } } static bool sanity_check_currdev(void) { struct stat st; return (stat("/boot/defaults/loader.conf", &st) == 0 || stat("/boot/kernel/kernel", &st) == 0); } #ifdef EFI_ZFS_BOOT static bool probe_zfs_currdev(uint64_t guid) { char *devname; struct zfs_devdesc currdev; currdev.dd.d_dev = &zfs_dev; currdev.dd.d_unit = 0; currdev.pool_guid = guid; currdev.root_guid = 0; set_currdev_devdesc((struct devdesc *)&currdev); devname = efi_fmtdev(&currdev); init_zfs_bootenv(devname); return (sanity_check_currdev()); } #endif static bool try_as_currdev(pdinfo_t *hd, pdinfo_t *pp) { uint64_t guid; #ifdef EFI_ZFS_BOOT /* * If there's a zpool on this device, try it as a ZFS * filesystem, which has somewhat different setup than all * other types of fs due to imperfect loader integration. * This all stems from ZFS being both a device (zpool) and * a filesystem, plus the boot env feature. */ if (efizfs_get_guid_by_handle(pp->pd_handle, &guid)) return (probe_zfs_currdev(guid)); #endif /* * All other filesystems just need the pdinfo * initialized in the standard way. */ set_currdev_pdinfo(pp); return (sanity_check_currdev()); } /* * Sometimes we get filenames that are all upper case * and/or have backslashes in them. Filter all this out * if it looks like we need to do so. */ static void fix_dosisms(char *p) { while (*p) { if (isupper(*p)) *p = tolower(*p); else if (*p == '\\') *p = '/'; p++; } } #define SIZE(dp, edp) (size_t)((intptr_t)(void *)edp - (intptr_t)(void *)dp) enum { BOOT_INFO_OK = 0, BAD_CHOICE = 1, NOT_SPECIFIC = 2 }; static int match_boot_info(char *boot_info, size_t bisz) { uint32_t attr; uint16_t fplen; size_t len; char *walker, *ep; EFI_DEVICE_PATH *dp, *edp, *first_dp, *last_dp; pdinfo_t *pp; CHAR16 *descr; char *kernel = NULL; FILEPATH_DEVICE_PATH *fp; struct stat st; CHAR16 *text; /* * FreeBSD encodes it's boot loading path into the boot loader * BootXXXX variable. We look for the last one in the path * and use that to load the kernel. However, if we only fine * one DEVICE_PATH, then there's nothing specific and we should * fall back. * * In an ideal world, we'd look at the image handle we were * passed, match up with the loader we are and then return the * next one in the path. This would be most flexible and cover * many chain booting scenarios where you need to use this * boot loader to get to the next boot loader. However, that * doesn't work. We rarely have the path to the image booted * (just the device) so we can't count on that. So, we do the * enxt best thing, we look through the device path(s) passed * in the BootXXXX varaible. If there's only one, we return * NOT_SPECIFIC. Otherwise, we look at the last one and try to * load that. If we can, we return BOOT_INFO_OK. Otherwise we * return BAD_CHOICE for the caller to sort out. */ if (bisz < sizeof(attr) + sizeof(fplen) + sizeof(CHAR16)) return NOT_SPECIFIC; walker = boot_info; ep = walker + bisz; memcpy(&attr, walker, sizeof(attr)); walker += sizeof(attr); memcpy(&fplen, walker, sizeof(fplen)); walker += sizeof(fplen); descr = (CHAR16 *)(intptr_t)walker; len = ucs2len(descr); walker += (len + 1) * sizeof(CHAR16); last_dp = first_dp = dp = (EFI_DEVICE_PATH *)walker; edp = (EFI_DEVICE_PATH *)(walker + fplen); if ((char *)edp > ep) return NOT_SPECIFIC; while (dp < edp && SIZE(dp, edp) > sizeof(EFI_DEVICE_PATH)) { text = efi_devpath_name(dp); if (text != NULL) { printf(" BootInfo Path: %S\n", text); efi_free_devpath_name(text); } last_dp = dp; dp = (EFI_DEVICE_PATH *)((char *)dp + efi_devpath_length(dp)); } /* * If there's only one item in the list, then nothing was * specified. Or if the last path doesn't have a media * path in it. Those show up as various VenHw() nodes * which are basically opaque to us. Don't count those * as something specifc. */ if (last_dp == first_dp) { printf("Ignoring Boot%04x: Only one DP found\n", boot_current); return NOT_SPECIFIC; } if (efi_devpath_to_media_path(last_dp) == NULL) { printf("Ignoring Boot%04x: No Media Path\n", boot_current); return NOT_SPECIFIC; } /* * OK. At this point we either have a good path or a bad one. * Let's check. */ pp = efiblk_get_pdinfo_by_device_path(last_dp); if (pp == NULL) { printf("Ignoring Boot%04x: Device Path not found\n", boot_current); return BAD_CHOICE; } set_currdev_pdinfo(pp); if (!sanity_check_currdev()) { printf("Ignoring Boot%04x: sanity check failed\n", boot_current); return BAD_CHOICE; } /* * OK. We've found a device that matches, next we need to check the last * component of the path. If it's a file, then we set the default kernel * to that. Otherwise, just use this as the default root. * * Reminder: we're running very early, before we've parsed the defaults * file, so we may need to have a hack override. */ dp = efi_devpath_last_node(last_dp); if (DevicePathType(dp) != MEDIA_DEVICE_PATH || DevicePathSubType(dp) != MEDIA_FILEPATH_DP) { printf("Using Boot%04x for root partition\n", boot_current); return (BOOT_INFO_OK); /* use currdir, default kernel */ } fp = (FILEPATH_DEVICE_PATH *)dp; ucs2_to_utf8(fp->PathName, &kernel); if (kernel == NULL) { printf("Not using Boot%04x: can't decode kernel\n", boot_current); return (BAD_CHOICE); } if (*kernel == '\\' || isupper(*kernel)) fix_dosisms(kernel); if (stat(kernel, &st) != 0) { free(kernel); printf("Not using Boot%04x: can't find %s\n", boot_current, kernel); return (BAD_CHOICE); } setenv("kernel", kernel, 1); free(kernel); text = efi_devpath_name(last_dp); if (text) { printf("Using Boot%04x %S + %s\n", boot_current, text, kernel); efi_free_devpath_name(text); } return (BOOT_INFO_OK); } /* * Look at the passed-in boot_info, if any. If we find it then we need * to see if we can find ourselves in the boot chain. If we can, and * there's another specified thing to boot next, assume that the file * is loaded from / and use that for the root filesystem. If can't * find the specified thing, we must fail the boot. If we're last on * the list, then we fallback to looking for the first available / * candidate (ZFS, if there's a bootable zpool, otherwise a UFS * partition that has either /boot/defaults/loader.conf on it or * /boot/kernel/kernel (the default kernel) that we can use. * * We always fail if we can't find the right thing. However, as * a concession to buggy UEFI implementations, like u-boot, if * we have determined that the host is violating the UEFI boot * manager protocol, we'll signal the rest of the program that * a drop to the OK boot loader prompt is possible. */ static int find_currdev(bool do_bootmgr, bool is_last, char *boot_info, size_t boot_info_sz) { pdinfo_t *dp, *pp; EFI_DEVICE_PATH *devpath, *copy; EFI_HANDLE h; CHAR16 *text; struct devsw *dev; int unit; uint64_t extra; int rv; char *rootdev; /* * First choice: if rootdev is already set, use that, even if * it's wrong. */ rootdev = getenv("rootdev"); if (rootdev != NULL) { - printf("Setting currdev to configured rootdev %s\n", rootdev); + printf(" Setting currdev to configured rootdev %s\n", + rootdev); set_currdev(rootdev); return (0); } /* - * Second choice: If we can find out image boot_info, and there's + * Second choice: If uefi_rootdev is set, translate that UEFI device + * path to the loader's internal name and use that. + */ + do { + rootdev = getenv("uefi_rootdev"); + if (rootdev == NULL) + break; + devpath = efi_name_to_devpath(rootdev); + if (devpath == NULL) + break; + dp = efiblk_get_pdinfo_by_device_path(devpath); + efi_devpath_free(devpath); + if (dp == NULL) + break; + printf(" Setting currdev to UEFI path %s\n", + rootdev); + set_currdev_pdinfo(dp); + return (0); + } while (0); + + /* + * Third choice: If we can find out image boot_info, and there's * a follow-on boot image in that boot_info, use that. In this * case root will be the partition specified in that image and * we'll load the kernel specified by the file path. Should there * not be a filepath, we use the default. This filepath overrides * loader.conf. */ if (do_bootmgr) { rv = match_boot_info(boot_info, boot_info_sz); switch (rv) { case BOOT_INFO_OK: /* We found it */ return (0); case BAD_CHOICE: /* specified file not found -> error */ /* XXX do we want to have an escape hatch for last in boot order? */ return (ENOENT); } /* Nothing specified, try normal match */ } #ifdef EFI_ZFS_BOOT /* * Did efi_zfs_probe() detect the boot pool? If so, use the zpool * it found, if it's sane. ZFS is the only thing that looks for * disks and pools to boot. This may change in the future, however, * if we allow specifying which pool to boot from via UEFI variables * rather than the bootenv stuff that FreeBSD uses today. */ if (pool_guid != 0) { printf("Trying ZFS pool\n"); if (probe_zfs_currdev(pool_guid)) return (0); } #endif /* EFI_ZFS_BOOT */ /* * Try to find the block device by its handle based on the * image we're booting. If we can't find a sane partition, * search all the other partitions of the disk. We do not * search other disks because it's a violation of the UEFI * boot protocol to do so. We fail and let UEFI go on to * the next candidate. */ dp = efiblk_get_pdinfo_by_handle(boot_img->DeviceHandle); if (dp != NULL) { text = efi_devpath_name(dp->pd_devpath); if (text != NULL) { printf("Trying ESP: %S\n", text); efi_free_devpath_name(text); } set_currdev_pdinfo(dp); if (sanity_check_currdev()) return (0); if (dp->pd_parent != NULL) { pdinfo_t *espdp = dp; dp = dp->pd_parent; STAILQ_FOREACH(pp, &dp->pd_part, pd_link) { /* Already tried the ESP */ if (espdp == pp) continue; /* * Roll up the ZFS special case * for those partitions that have * zpools on them. */ text = efi_devpath_name(pp->pd_devpath); if (text != NULL) { printf("Trying: %S\n", text); efi_free_devpath_name(text); } if (try_as_currdev(dp, pp)) return (0); } } } /* * Try the device handle from our loaded image first. If that * fails, use the device path from the loaded image and see if * any of the nodes in that path match one of the enumerated * handles. Currently, this handle list is only for netboot. */ if (efi_handle_lookup(boot_img->DeviceHandle, &dev, &unit, &extra) == 0) { set_currdev_devsw(dev, unit); if (sanity_check_currdev()) return (0); } copy = NULL; devpath = efi_lookup_image_devpath(IH); while (devpath != NULL) { h = efi_devpath_handle(devpath); if (h == NULL) break; free(copy); copy = NULL; if (efi_handle_lookup(h, &dev, &unit, &extra) == 0) { set_currdev_devsw(dev, unit); if (sanity_check_currdev()) return (0); } devpath = efi_lookup_devpath(h); if (devpath != NULL) { copy = efi_devpath_trim(devpath); devpath = copy; } } free(copy); return (ENOENT); } static bool interactive_interrupt(const char *msg) { time_t now, then, last; last = 0; now = then = getsecs(); printf("%s\n", msg); if (fail_timeout == -2) /* Always break to OK */ return (true); if (fail_timeout == -1) /* Never break to OK */ return (false); do { if (last != now) { printf("press any key to interrupt reboot in %d seconds\r", fail_timeout - (int)(now - then)); last = now; } /* XXX no pause or timeout wait for char */ if (ischar()) return (true); now = getsecs(); } while (now - then < fail_timeout); return (false); } static int parse_args(int argc, CHAR16 *argv[]) { int i, j, howto; bool vargood; char var[128]; /* * Parse the args to set the console settings, etc * boot1.efi passes these in, if it can read /boot.config or /boot/config * or iPXE may be setup to pass these in. Or the optional argument in the * boot environment was used to pass these arguments in (in which case * neither /boot.config nor /boot/config are consulted). * * Loop through the args, and for each one that contains an '=' that is * not the first character, add it to the environment. This allows * loader and kernel env vars to be passed on the command line. Convert * args from UCS-2 to ASCII (16 to 8 bit) as they are copied (though this * method is flawed for non-ASCII characters). */ howto = 0; for (i = 1; i < argc; i++) { cpy16to8(argv[i], var, sizeof(var)); howto |= boot_parse_arg(var); } return (howto); } static void setenv_int(const char *key, int val) { char buf[20]; snprintf(buf, sizeof(buf), "%d", val); setenv(key, buf, 1); } /* * Parse ConOut (the list of consoles active) and see if we can find a * serial port and/or a video port. It would be nice to also walk the * ACPI name space to map the UID for the serial port to a port. The * latter is especially hard. */ static int parse_uefi_con_out(void) { int how, rv; int vid_seen = 0, com_seen = 0, seen = 0; size_t sz; char buf[4096], *ep; EFI_DEVICE_PATH *node; ACPI_HID_DEVICE_PATH *acpi; UART_DEVICE_PATH *uart; bool pci_pending; how = 0; sz = sizeof(buf); rv = efi_global_getenv("ConOut", buf, &sz); if (rv != EFI_SUCCESS) goto out; ep = buf + sz; node = (EFI_DEVICE_PATH *)buf; while ((char *)node < ep) { pci_pending = false; if (DevicePathType(node) == ACPI_DEVICE_PATH && DevicePathSubType(node) == ACPI_DP) { /* Check for Serial node */ acpi = (void *)node; if (EISA_ID_TO_NUM(acpi->HID) == 0x501) { setenv_int("efi_8250_uid", acpi->UID); com_seen = ++seen; } } else if (DevicePathType(node) == MESSAGING_DEVICE_PATH && DevicePathSubType(node) == MSG_UART_DP) { uart = (void *)node; setenv_int("efi_com_speed", uart->BaudRate); } else if (DevicePathType(node) == ACPI_DEVICE_PATH && DevicePathSubType(node) == ACPI_ADR_DP) { /* Check for AcpiAdr() Node for video */ vid_seen = ++seen; } else if (DevicePathType(node) == HARDWARE_DEVICE_PATH && DevicePathSubType(node) == HW_PCI_DP) { /* * Note, vmware fusion has a funky console device * PciRoot(0x0)/Pci(0xf,0x0) * which we can only detect at the end since we also * have to cope with: * PciRoot(0x0)/Pci(0x1f,0x0)/Serial(0x1) * so only match it if it's last. */ pci_pending = true; } node = NextDevicePathNode(node); /* Skip the end node */ } if (pci_pending && vid_seen == 0) vid_seen = ++seen; /* * Truth table for RB_MULTIPLE | RB_SERIAL * Value Result * 0 Use only video console * RB_SERIAL Use only serial console * RB_MULTIPLE Use both video and serial console * (but video is primary so gets rc messages) * both Use both video and serial console * (but serial is primary so gets rc messages) * * Try to honor this as best we can. If only one of serial / video * found, then use that. Otherwise, use the first one we found. * This also implies if we found nothing, default to video. */ how = 0; if (vid_seen && com_seen) { how |= RB_MULTIPLE; if (com_seen < vid_seen) how |= RB_SERIAL; } else if (com_seen) how |= RB_SERIAL; out: return (how); } +void +parse_loader_efi_config(EFI_HANDLE h, const char *env_fn) +{ + pdinfo_t *dp; + struct stat st; + int fd = -1; + char *env = NULL; + + dp = efiblk_get_pdinfo_by_handle(h); + if (dp == NULL) + return; + set_currdev_pdinfo(dp); + if (stat(env_fn, &st) != 0) + return; + fd = open(env_fn, O_RDONLY); + if (fd == -1) + return; + env = malloc(st.st_size + 1); + if (env == NULL) + goto out; + if (read(fd, env, st.st_size) != st.st_size) + goto out; + env[st.st_size] = '\0'; + boot_parse_cmdline(env); +out: + free(env); + close(fd); +} + +static void +read_loader_env(const char *name, char *def_fn, bool once) +{ + UINTN len; + char *fn, *freeme = NULL; + + len = 0; + fn = def_fn; + if (efi_freebsd_getenv(name, NULL, &len) == EFI_BUFFER_TOO_SMALL) { + freeme = fn = malloc(len + 1); + if (fn != NULL) { + if (efi_freebsd_getenv(name, fn, &len) != EFI_SUCCESS) { + free(fn); + fn = NULL; + printf( + "Can't fetch FreeBSD::%s we know is there\n", name); + } else { + /* + * if tagged as 'once' delete the env variable so we + * only use it once. + */ + if (once) + efi_freebsd_delenv(name); + /* + * We malloced 1 more than len above, then redid the call. + * so now we have room at the end of the string to NUL terminate + * it here, even if the typical idium would have '- 1' here to + * not overflow. len should be the same on return both times. + */ + fn[len] = '\0'; + } + } else { + printf( + "Can't allocate %d bytes to fetch FreeBSD::%s env var\n", + len, name); + } + } + if (fn) { + printf(" Reading loader env vars from %s\n", fn); + parse_loader_efi_config(boot_img->DeviceHandle, fn); + } +} + + + EFI_STATUS main(int argc, CHAR16 *argv[]) { EFI_GUID *guid; int howto, i, uhowto; UINTN k; bool has_kbd, is_last; char *s; EFI_DEVICE_PATH *imgpath; CHAR16 *text; EFI_STATUS rv; size_t sz, bosz = 0, bisz = 0; UINT16 boot_order[100]; char boot_info[4096]; char buf[32]; bool uefi_boot_mgr; archsw.arch_autoload = efi_autoload; archsw.arch_getdev = efi_getdev; archsw.arch_copyin = efi_copyin; archsw.arch_copyout = efi_copyout; archsw.arch_readin = efi_readin; archsw.arch_zfs_probe = efi_zfs_probe; /* Get our loaded image protocol interface structure. */ BS->HandleProtocol(IH, &imgid, (VOID**)&boot_img); /* * Chicken-and-egg problem; we want to have console output early, but * some console attributes may depend on reading from eg. the boot * device, which we can't do yet. We can use printf() etc. once this is * done. So, we set it to the efi console, then call console init. This * gets us printf early, but also primes the pump for all future console * changes to take effect, regardless of where they come from. */ setenv("console", "efi", 1); cons_probe(); /* Init the time source */ efi_time_init(); - has_kbd = has_keyboard(); - /* * Initialise the block cache. Set the upper limit. */ bcache_init(32768, 512); + /* + * Scan the BLOCK IO MEDIA handles then + * march through the device switch probing for things. + */ + i = efipart_inithandles(); + if (i != 0 && i != ENOENT) { + printf("efipart_inithandles failed with ERRNO %d, expect " + "failures\n", i); + } + + for (i = 0; devsw[i] != NULL; i++) + if (devsw[i]->dv_init != NULL) + (devsw[i]->dv_init)(); + + /* + * Detect console settings two different ways: one via the command + * args (eg -h) or via the UEFI ConOut variable. + */ + has_kbd = has_keyboard(); howto = parse_args(argc, argv); if (!has_kbd && (howto & RB_PROBE)) howto |= RB_SERIAL | RB_MULTIPLE; howto &= ~RB_PROBE; uhowto = parse_uefi_con_out(); /* + * Scan the BLOCK IO MEDIA handles then + * march through the device switch probing for things. + */ + i = efipart_inithandles(); + if (i != 0 && i != ENOENT) { + printf("efipart_inithandles failed with ERRNO %d, expect " + "failures\n", i); + } + + for (i = 0; devsw[i] != NULL; i++) + if (devsw[i]->dv_init != NULL) + (devsw[i]->dv_init)(); + + /* + * Read additional environment variables from the boot device's + * "LoaderEnv" file. Any boot loader environment variable may be set + * there, which are subtly different than loader.conf variables. Only + * the 'simple' ones may be set so things like foo_load="YES" won't work + * for two reasons. First, the parser is simplistic and doesn't grok + * quotes. Second, because the variables that cause an action to happen + * are parsed by the lua, 4th or whatever code that's not yet + * loaded. This is relative to the root directory when loader.efi is + * loaded off the UFS root drive (when chain booted), or from the ESP + * when directly loaded by the BIOS. + * + * We also read in NextLoaderEnv if it was specified. This allows next boot + * functionality to be implemented and to override anything in LoaderEnv. + */ + read_loader_env("LoaderEnv", "/efi/freebsd/loader.env", false); + read_loader_env("NextLoaderEnv", NULL, true); + + /* * We now have two notions of console. howto should be viewed as * overrides. If console is already set, don't set it again. */ #define VIDEO_ONLY 0 #define SERIAL_ONLY RB_SERIAL #define VID_SER_BOTH RB_MULTIPLE #define SER_VID_BOTH (RB_SERIAL | RB_MULTIPLE) #define CON_MASK (RB_SERIAL | RB_MULTIPLE) if (strcmp(getenv("console"), "efi") == 0) { if ((howto & CON_MASK) == 0) { /* No override, uhowto is controlling and efi cons is perfect */ howto = howto | (uhowto & CON_MASK); } else if ((howto & CON_MASK) == (uhowto & CON_MASK)) { /* override matches what UEFI told us, efi console is perfect */ } else if ((uhowto & (CON_MASK)) != 0) { /* * We detected a serial console on ConOut. All possible * overrides include serial. We can't really override what efi * gives us, so we use it knowing it's the best choice. */ /* Do nothing */ } else { /* * We detected some kind of serial in the override, but ConOut * has no serial, so we have to sort out which case it really is. */ switch (howto & CON_MASK) { case SERIAL_ONLY: setenv("console", "comconsole", 1); break; case VID_SER_BOTH: setenv("console", "efi comconsole", 1); break; case SER_VID_BOTH: setenv("console", "comconsole efi", 1); break; /* case VIDEO_ONLY can't happen -- it's the first if above */ } } } /* * howto is set now how we want to export the flags to the kernel, so * set the env based on it. */ boot_howto_to_env(howto); if (efi_copy_init()) { printf("failed to allocate staging area\n"); return (EFI_BUFFER_TOO_SMALL); } if ((s = getenv("fail_timeout")) != NULL) fail_timeout = strtol(s, NULL, 10); - /* - * Scan the BLOCK IO MEDIA handles then - * march through the device switch probing for things. - */ - i = efipart_inithandles(); - if (i != 0 && i != ENOENT) { - printf("efipart_inithandles failed with ERRNO %d, expect " - "failures\n", i); - } - - for (i = 0; devsw[i] != NULL; i++) - if (devsw[i]->dv_init != NULL) - (devsw[i]->dv_init)(); - printf("%s\n", bootprog_info); printf(" Command line arguments:"); for (i = 0; i < argc; i++) printf(" %S", argv[i]); printf("\n"); printf(" EFI version: %d.%02d\n", ST->Hdr.Revision >> 16, ST->Hdr.Revision & 0xffff); printf(" EFI Firmware: %S (rev %d.%02d)\n", ST->FirmwareVendor, ST->FirmwareRevision >> 16, ST->FirmwareRevision & 0xffff); printf(" Console: %s (%#x)\n", getenv("console"), howto); - - /* Determine the devpath of our image so we can prefer it. */ text = efi_devpath_name(boot_img->FilePath); if (text != NULL) { printf(" Load Path: %S\n", text); efi_setenv_freebsd_wcs("LoaderPath", text); efi_free_devpath_name(text); } rv = BS->HandleProtocol(boot_img->DeviceHandle, &devid, (void **)&imgpath); if (rv == EFI_SUCCESS) { text = efi_devpath_name(imgpath); if (text != NULL) { printf(" Load Device: %S\n", text); efi_setenv_freebsd_wcs("LoaderDev", text); efi_free_devpath_name(text); } } - uefi_boot_mgr = true; - boot_current = 0; - sz = sizeof(boot_current); - rv = efi_global_getenv("BootCurrent", &boot_current, &sz); - if (rv == EFI_SUCCESS) - printf(" BootCurrent: %04x\n", boot_current); - else { - boot_current = 0xffff; + if (getenv("uefi_ignore_boot_mgr") != NULL) { + printf(" Ignoring UEFI boot manager\n"); uefi_boot_mgr = false; - } + } else { + uefi_boot_mgr = true; + boot_current = 0; + sz = sizeof(boot_current); + rv = efi_global_getenv("BootCurrent", &boot_current, &sz); + if (rv == EFI_SUCCESS) + printf(" BootCurrent: %04x\n", boot_current); + else { + boot_current = 0xffff; + uefi_boot_mgr = false; + } - sz = sizeof(boot_order); - rv = efi_global_getenv("BootOrder", &boot_order, &sz); - if (rv == EFI_SUCCESS) { - printf(" BootOrder:"); - for (i = 0; i < sz / sizeof(boot_order[0]); i++) - printf(" %04x%s", boot_order[i], - boot_order[i] == boot_current ? "[*]" : ""); - printf("\n"); - is_last = boot_order[(sz / sizeof(boot_order[0])) - 1] == boot_current; - bosz = sz; - } else if (uefi_boot_mgr) { - /* - * u-boot doesn't set BootOrder, but otherwise participates in the - * boot manager protocol. So we fake it here and don't consider it - * a failure. - */ - bosz = sizeof(boot_order[0]); - boot_order[0] = boot_current; - is_last = true; + sz = sizeof(boot_order); + rv = efi_global_getenv("BootOrder", &boot_order, &sz); + if (rv == EFI_SUCCESS) { + printf(" BootOrder:"); + for (i = 0; i < sz / sizeof(boot_order[0]); i++) + printf(" %04x%s", boot_order[i], + boot_order[i] == boot_current ? "[*]" : ""); + printf("\n"); + is_last = boot_order[(sz / sizeof(boot_order[0])) - 1] == boot_current; + bosz = sz; + } else if (uefi_boot_mgr) { + /* + * u-boot doesn't set BootOrder, but otherwise participates in the + * boot manager protocol. So we fake it here and don't consider it + * a failure. + */ + bosz = sizeof(boot_order[0]); + boot_order[0] = boot_current; + is_last = true; + } } /* * Next, find the boot info structure the UEFI boot manager is * supposed to setup. We need this so we can walk through it to * find where we are in the booting process and what to try to * boot next. */ if (uefi_boot_mgr) { snprintf(buf, sizeof(buf), "Boot%04X", boot_current); sz = sizeof(boot_info); rv = efi_global_getenv(buf, &boot_info, &sz); if (rv == EFI_SUCCESS) bisz = sz; else uefi_boot_mgr = false; } /* * Disable the watchdog timer. By default the boot manager sets * the timer to 5 minutes before invoking a boot option. If we * want to return to the boot manager, we have to disable the * watchdog timer and since we're an interactive program, we don't * want to wait until the user types "quit". The timer may have * fired by then. We don't care if this fails. It does not prevent * normal functioning in any way... */ BS->SetWatchdogTimer(0, 0, 0, NULL); /* * Initialize the trusted/forbidden certificates from UEFI. * They will be later used to verify the manifest(s), * which should contain hashes of verified files. * This needs to be initialized before any configuration files * are loaded. */ #ifdef EFI_SECUREBOOT ve_efi_init(); #endif /* * Try and find a good currdev based on the image that was booted. * It might be desirable here to have a short pause to allow falling * through to the boot loader instead of returning instantly to follow * the boot protocol and also allow an escape hatch for users wishing * to try something different. */ if (find_currdev(uefi_boot_mgr, is_last, boot_info, bisz) != 0) - if (!interactive_interrupt("Failed to find bootable partition")) + if (uefi_boot_mgr && + !interactive_interrupt("Failed to find bootable partition")) return (EFI_NOT_FOUND); efi_init_environment(); #if !defined(__arm__) for (k = 0; k < ST->NumberOfTableEntries; k++) { guid = &ST->ConfigurationTable[k].VendorGuid; if (!memcmp(guid, &smbios, sizeof(EFI_GUID))) { char buf[40]; snprintf(buf, sizeof(buf), "%p", ST->ConfigurationTable[k].VendorTable); setenv("hint.smbios.0.mem", buf, 1); smbios_detect(ST->ConfigurationTable[k].VendorTable); break; } } #endif interact(); /* doesn't return */ return (EFI_SUCCESS); /* keep compiler happy */ } COMMAND_SET(poweroff, "poweroff", "power off the system", command_poweroff); static int command_poweroff(int argc __unused, char *argv[] __unused) { int i; for (i = 0; devsw[i] != NULL; ++i) if (devsw[i]->dv_cleanup != NULL) (devsw[i]->dv_cleanup)(); RS->ResetSystem(EfiResetShutdown, EFI_SUCCESS, 0, NULL); /* NOTREACHED */ return (CMD_ERROR); } COMMAND_SET(reboot, "reboot", "reboot the system", command_reboot); static int command_reboot(int argc, char *argv[]) { int i; for (i = 0; devsw[i] != NULL; ++i) if (devsw[i]->dv_cleanup != NULL) (devsw[i]->dv_cleanup)(); RS->ResetSystem(EfiResetCold, EFI_SUCCESS, 0, NULL); /* NOTREACHED */ return (CMD_ERROR); } COMMAND_SET(quit, "quit", "exit the loader", command_quit); static int command_quit(int argc, char *argv[]) { exit(0); return (CMD_OK); } COMMAND_SET(memmap, "memmap", "print memory map", command_memmap); static int command_memmap(int argc __unused, char *argv[] __unused) { UINTN sz; EFI_MEMORY_DESCRIPTOR *map, *p; UINTN key, dsz; UINT32 dver; EFI_STATUS status; int i, ndesc; char line[80]; sz = 0; status = BS->GetMemoryMap(&sz, 0, &key, &dsz, &dver); if (status != EFI_BUFFER_TOO_SMALL) { printf("Can't determine memory map size\n"); return (CMD_ERROR); } map = malloc(sz); status = BS->GetMemoryMap(&sz, map, &key, &dsz, &dver); if (EFI_ERROR(status)) { printf("Can't read memory map\n"); return (CMD_ERROR); } ndesc = sz / dsz; snprintf(line, sizeof(line), "%23s %12s %12s %8s %4s\n", "Type", "Physical", "Virtual", "#Pages", "Attr"); pager_open(); if (pager_output(line)) { pager_close(); return (CMD_OK); } for (i = 0, p = map; i < ndesc; i++, p = NextMemoryDescriptor(p, dsz)) { snprintf(line, sizeof(line), "%23s %012jx %012jx %08jx ", efi_memory_type(p->Type), (uintmax_t)p->PhysicalStart, (uintmax_t)p->VirtualStart, (uintmax_t)p->NumberOfPages); if (pager_output(line)) break; if (p->Attribute & EFI_MEMORY_UC) printf("UC "); if (p->Attribute & EFI_MEMORY_WC) printf("WC "); if (p->Attribute & EFI_MEMORY_WT) printf("WT "); if (p->Attribute & EFI_MEMORY_WB) printf("WB "); if (p->Attribute & EFI_MEMORY_UCE) printf("UCE "); if (p->Attribute & EFI_MEMORY_WP) printf("WP "); if (p->Attribute & EFI_MEMORY_RP) printf("RP "); if (p->Attribute & EFI_MEMORY_XP) printf("XP "); if (p->Attribute & EFI_MEMORY_NV) printf("NV "); if (p->Attribute & EFI_MEMORY_MORE_RELIABLE) printf("MR "); if (p->Attribute & EFI_MEMORY_RO) printf("RO "); if (pager_output("\n")) break; } pager_close(); return (CMD_OK); } COMMAND_SET(configuration, "configuration", "print configuration tables", command_configuration); static int command_configuration(int argc, char *argv[]) { UINTN i; char *name; printf("NumberOfTableEntries=%lu\n", (unsigned long)ST->NumberOfTableEntries); for (i = 0; i < ST->NumberOfTableEntries; i++) { EFI_GUID *guid; printf(" "); guid = &ST->ConfigurationTable[i].VendorGuid; if (efi_guid_to_name(guid, &name) == true) { printf(name); free(name); } else { printf("Error while translating UUID to name"); } printf(" at %p\n", ST->ConfigurationTable[i].VendorTable); } return (CMD_OK); } COMMAND_SET(mode, "mode", "change or display EFI text modes", command_mode); static int command_mode(int argc, char *argv[]) { UINTN cols, rows; unsigned int mode; int i; char *cp; char rowenv[8]; EFI_STATUS status; SIMPLE_TEXT_OUTPUT_INTERFACE *conout; extern void HO(void); conout = ST->ConOut; if (argc > 1) { mode = strtol(argv[1], &cp, 0); if (cp[0] != '\0') { printf("Invalid mode\n"); return (CMD_ERROR); } status = conout->QueryMode(conout, mode, &cols, &rows); if (EFI_ERROR(status)) { printf("invalid mode %d\n", mode); return (CMD_ERROR); } status = conout->SetMode(conout, mode); if (EFI_ERROR(status)) { printf("couldn't set mode %d\n", mode); return (CMD_ERROR); } sprintf(rowenv, "%u", (unsigned)rows); setenv("LINES", rowenv, 1); HO(); /* set cursor */ return (CMD_OK); } printf("Current mode: %d\n", conout->Mode->Mode); for (i = 0; i <= conout->Mode->MaxMode; i++) { status = conout->QueryMode(conout, i, &cols, &rows); if (EFI_ERROR(status)) continue; printf("Mode %d: %u columns, %u rows\n", i, (unsigned)cols, (unsigned)rows); } if (i != 0) printf("Select a mode with the command \"mode \"\n"); return (CMD_OK); } COMMAND_SET(lsefi, "lsefi", "list EFI handles", command_lsefi); static int command_lsefi(int argc __unused, char *argv[] __unused) { char *name; EFI_HANDLE *buffer = NULL; EFI_HANDLE handle; UINTN bufsz = 0, i, j; EFI_STATUS status; int ret = 0; status = BS->LocateHandle(AllHandles, NULL, NULL, &bufsz, buffer); if (status != EFI_BUFFER_TOO_SMALL) { snprintf(command_errbuf, sizeof (command_errbuf), "unexpected error: %lld", (long long)status); return (CMD_ERROR); } if ((buffer = malloc(bufsz)) == NULL) { sprintf(command_errbuf, "out of memory"); return (CMD_ERROR); } status = BS->LocateHandle(AllHandles, NULL, NULL, &bufsz, buffer); if (EFI_ERROR(status)) { free(buffer); snprintf(command_errbuf, sizeof (command_errbuf), "LocateHandle() error: %lld", (long long)status); return (CMD_ERROR); } pager_open(); for (i = 0; i < (bufsz / sizeof (EFI_HANDLE)); i++) { UINTN nproto = 0; EFI_GUID **protocols = NULL; handle = buffer[i]; printf("Handle %p", handle); if (pager_output("\n")) break; /* device path */ status = BS->ProtocolsPerHandle(handle, &protocols, &nproto); if (EFI_ERROR(status)) { snprintf(command_errbuf, sizeof (command_errbuf), "ProtocolsPerHandle() error: %lld", (long long)status); continue; } for (j = 0; j < nproto; j++) { if (efi_guid_to_name(protocols[j], &name) == true) { printf(" %s", name); free(name); } else { printf("Error while translating UUID to name"); } if ((ret = pager_output("\n")) != 0) break; } BS->FreePool(protocols); if (ret != 0) break; } pager_close(); free(buffer); return (CMD_OK); } #ifdef LOADER_FDT_SUPPORT extern int command_fdt_internal(int argc, char *argv[]); /* * Since proper fdt command handling function is defined in fdt_loader_cmd.c, * and declaring it as extern is in contradiction with COMMAND_SET() macro * (which uses static pointer), we're defining wrapper function, which * calls the proper fdt handling routine. */ static int command_fdt(int argc, char *argv[]) { return (command_fdt_internal(argc, argv)); } COMMAND_SET(fdt, "fdt", "flattened device tree handling", command_fdt); #endif /* * Chain load another efi loader. */ static int command_chain(int argc, char *argv[]) { EFI_GUID LoadedImageGUID = LOADED_IMAGE_PROTOCOL; EFI_HANDLE loaderhandle; EFI_LOADED_IMAGE *loaded_image; EFI_STATUS status; struct stat st; struct devdesc *dev; char *name, *path; void *buf; int fd; if (argc < 2) { command_errmsg = "wrong number of arguments"; return (CMD_ERROR); } name = argv[1]; if ((fd = open(name, O_RDONLY)) < 0) { command_errmsg = "no such file"; return (CMD_ERROR); } if (fstat(fd, &st) < -1) { command_errmsg = "stat failed"; close(fd); return (CMD_ERROR); } status = BS->AllocatePool(EfiLoaderCode, (UINTN)st.st_size, &buf); if (status != EFI_SUCCESS) { command_errmsg = "failed to allocate buffer"; close(fd); return (CMD_ERROR); } if (read(fd, buf, st.st_size) != st.st_size) { command_errmsg = "error while reading the file"; (void)BS->FreePool(buf); close(fd); return (CMD_ERROR); } close(fd); status = BS->LoadImage(FALSE, IH, NULL, buf, st.st_size, &loaderhandle); (void)BS->FreePool(buf); if (status != EFI_SUCCESS) { command_errmsg = "LoadImage failed"; return (CMD_ERROR); } status = BS->HandleProtocol(loaderhandle, &LoadedImageGUID, (void **)&loaded_image); if (argc > 2) { int i, len = 0; CHAR16 *argp; for (i = 2; i < argc; i++) len += strlen(argv[i]) + 1; len *= sizeof (*argp); loaded_image->LoadOptions = argp = malloc (len); loaded_image->LoadOptionsSize = len; for (i = 2; i < argc; i++) { char *ptr = argv[i]; while (*ptr) *(argp++) = *(ptr++); *(argp++) = ' '; } *(--argv) = 0; } if (efi_getdev((void **)&dev, name, (const char **)&path) == 0) { #ifdef EFI_ZFS_BOOT struct zfs_devdesc *z_dev; #endif struct disk_devdesc *d_dev; pdinfo_t *hd, *pd; switch (dev->d_dev->dv_type) { #ifdef EFI_ZFS_BOOT case DEVT_ZFS: z_dev = (struct zfs_devdesc *)dev; loaded_image->DeviceHandle = efizfs_get_handle_by_guid(z_dev->pool_guid); break; #endif case DEVT_NET: loaded_image->DeviceHandle = efi_find_handle(dev->d_dev, dev->d_unit); break; default: hd = efiblk_get_pdinfo(dev); if (STAILQ_EMPTY(&hd->pd_part)) { loaded_image->DeviceHandle = hd->pd_handle; break; } d_dev = (struct disk_devdesc *)dev; STAILQ_FOREACH(pd, &hd->pd_part, pd_link) { /* * d_partition should be 255 */ if (pd->pd_unit == (uint32_t)d_dev->d_slice) { loaded_image->DeviceHandle = pd->pd_handle; break; } } break; } } dev_cleanup(); status = BS->StartImage(loaderhandle, NULL, NULL); if (status != EFI_SUCCESS) { command_errmsg = "StartImage failed"; free(loaded_image->LoadOptions); loaded_image->LoadOptions = NULL; status = BS->UnloadImage(loaded_image); return (CMD_ERROR); } return (CMD_ERROR); /* not reached */ } COMMAND_SET(chain, "chain", "chain load file", command_chain); Index: user/ngie/bug-237403/stand/fdt/fdt_loader_cmd.c =================================================================== --- user/ngie/bug-237403/stand/fdt/fdt_loader_cmd.c (revision 346925) +++ user/ngie/bug-237403/stand/fdt/fdt_loader_cmd.c (revision 346926) @@ -1,1851 +1,1861 @@ /*- * Copyright (c) 2009-2010 The FreeBSD Foundation * All rights reserved. * * This software was developed by Semihalf under sponsorship from * the FreeBSD Foundation. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. */ #include __FBSDID("$FreeBSD$"); #include #include #include #include #include #include #include "bootstrap.h" #include "fdt_platform.h" #ifdef DEBUG #define debugf(fmt, args...) do { printf("%s(): ", __func__); \ printf(fmt,##args); } while (0) #else #define debugf(fmt, args...) #endif #define FDT_CWD_LEN 256 #define FDT_MAX_DEPTH 12 #define FDT_PROP_SEP " = " #define COPYOUT(s,d,l) archsw.arch_copyout(s, d, l) #define COPYIN(s,d,l) archsw.arch_copyin(s, d, l) #define FDT_STATIC_DTB_SYMBOL "fdt_static_dtb" #define CMD_REQUIRES_BLOB 0x01 /* Location of FDT yet to be loaded. */ /* This may be in read-only memory, so can't be manipulated directly. */ static struct fdt_header *fdt_to_load = NULL; /* Location of FDT on heap. */ /* This is the copy we actually manipulate. */ static struct fdt_header *fdtp = NULL; /* Size of FDT blob */ static size_t fdtp_size = 0; static int fdt_load_dtb(vm_offset_t va); static void fdt_print_overlay_load_error(int err, const char *filename); static int fdt_check_overlay_compatible(void *base_fdt, void *overlay_fdt); static int fdt_cmd_nyi(int argc, char *argv[]); static int fdt_load_dtb_overlays_string(const char * filenames); static int fdt_cmd_addr(int argc, char *argv[]); static int fdt_cmd_mkprop(int argc, char *argv[]); static int fdt_cmd_cd(int argc, char *argv[]); static int fdt_cmd_hdr(int argc, char *argv[]); static int fdt_cmd_ls(int argc, char *argv[]); static int fdt_cmd_prop(int argc, char *argv[]); static int fdt_cmd_pwd(int argc, char *argv[]); static int fdt_cmd_rm(int argc, char *argv[]); static int fdt_cmd_mknode(int argc, char *argv[]); static int fdt_cmd_mres(int argc, char *argv[]); typedef int cmdf_t(int, char *[]); struct cmdtab { const char *name; cmdf_t *handler; int flags; }; static const struct cmdtab commands[] = { { "addr", &fdt_cmd_addr, 0 }, { "alias", &fdt_cmd_nyi, 0 }, { "cd", &fdt_cmd_cd, CMD_REQUIRES_BLOB }, { "header", &fdt_cmd_hdr, CMD_REQUIRES_BLOB }, { "ls", &fdt_cmd_ls, CMD_REQUIRES_BLOB }, { "mknode", &fdt_cmd_mknode, CMD_REQUIRES_BLOB }, { "mkprop", &fdt_cmd_mkprop, CMD_REQUIRES_BLOB }, { "mres", &fdt_cmd_mres, CMD_REQUIRES_BLOB }, { "prop", &fdt_cmd_prop, CMD_REQUIRES_BLOB }, { "pwd", &fdt_cmd_pwd, CMD_REQUIRES_BLOB }, { "rm", &fdt_cmd_rm, CMD_REQUIRES_BLOB }, { NULL, NULL } }; static char cwd[FDT_CWD_LEN] = "/"; static vm_offset_t fdt_find_static_dtb() { Elf_Ehdr *ehdr; Elf_Shdr *shdr; Elf_Sym sym; vm_offset_t strtab, symtab, fdt_start; uint64_t offs; struct preloaded_file *kfp; struct file_metadata *md; char *strp; int i, sym_count; debugf("fdt_find_static_dtb()\n"); sym_count = symtab = strtab = 0; strp = NULL; offs = __elfN(relocation_offset); kfp = file_findfile(NULL, NULL); if (kfp == NULL) return (0); /* Locate the dynamic symbols and strtab. */ md = file_findmetadata(kfp, MODINFOMD_ELFHDR); if (md == NULL) return (0); ehdr = (Elf_Ehdr *)md->md_data; md = file_findmetadata(kfp, MODINFOMD_SHDR); if (md == NULL) return (0); shdr = (Elf_Shdr *)md->md_data; for (i = 0; i < ehdr->e_shnum; ++i) { if (shdr[i].sh_type == SHT_DYNSYM && symtab == 0) { symtab = shdr[i].sh_addr + offs; sym_count = shdr[i].sh_size / sizeof(Elf_Sym); } else if (shdr[i].sh_type == SHT_STRTAB && strtab == 0) { strtab = shdr[i].sh_addr + offs; } } /* * The most efficient way to find a symbol would be to calculate a * hash, find proper bucket and chain, and thus find a symbol. * However, that would involve code duplication (e.g. for hash * function). So we're using simpler and a bit slower way: we're * iterating through symbols, searching for the one which name is * 'equal' to 'fdt_static_dtb'. To speed up the process a little bit, * we are eliminating symbols type of which is not STT_NOTYPE, or(and) * those which binding attribute is not STB_GLOBAL. */ fdt_start = 0; while (sym_count > 0 && fdt_start == 0) { COPYOUT(symtab, &sym, sizeof(sym)); symtab += sizeof(sym); --sym_count; if (ELF_ST_BIND(sym.st_info) != STB_GLOBAL || ELF_ST_TYPE(sym.st_info) != STT_NOTYPE) continue; strp = strdupout(strtab + sym.st_name); if (strcmp(strp, FDT_STATIC_DTB_SYMBOL) == 0) fdt_start = (vm_offset_t)sym.st_value + offs; free(strp); } return (fdt_start); } static int fdt_load_dtb(vm_offset_t va) { struct fdt_header header; int err; debugf("fdt_load_dtb(0x%08jx)\n", (uintmax_t)va); COPYOUT(va, &header, sizeof(header)); err = fdt_check_header(&header); if (err < 0) { if (err == -FDT_ERR_BADVERSION) { snprintf(command_errbuf, sizeof(command_errbuf), "incompatible blob version: %d, should be: %d", fdt_version(fdtp), FDT_LAST_SUPPORTED_VERSION); } else { snprintf(command_errbuf, sizeof(command_errbuf), "error validating blob: %s", fdt_strerror(err)); } return (1); } /* * Release previous blob */ if (fdtp) free(fdtp); fdtp_size = fdt_totalsize(&header); fdtp = malloc(fdtp_size); if (fdtp == NULL) { command_errmsg = "can't allocate memory for device tree copy"; return (1); } COPYOUT(va, fdtp, fdtp_size); debugf("DTB blob found at 0x%jx, size: 0x%jx\n", (uintmax_t)va, (uintmax_t)fdtp_size); return (0); } int fdt_load_dtb_addr(struct fdt_header *header) { int err; debugf("fdt_load_dtb_addr(%p)\n", header); fdtp_size = fdt_totalsize(header); err = fdt_check_header(header); if (err < 0) { snprintf(command_errbuf, sizeof(command_errbuf), "error validating blob: %s", fdt_strerror(err)); return (err); } free(fdtp); if ((fdtp = malloc(fdtp_size)) == NULL) { command_errmsg = "can't allocate memory for device tree copy"; return (1); } bcopy(header, fdtp, fdtp_size); return (0); } int fdt_load_dtb_file(const char * filename) { struct preloaded_file *bfp, *oldbfp; int err; debugf("fdt_load_dtb_file(%s)\n", filename); oldbfp = file_findfile(NULL, "dtb"); /* Attempt to load and validate a new dtb from a file. */ if ((bfp = file_loadraw(filename, "dtb", 1)) == NULL) { snprintf(command_errbuf, sizeof(command_errbuf), "failed to load file '%s'", filename); return (1); } if ((err = fdt_load_dtb(bfp->f_addr)) != 0) { file_discard(bfp); return (err); } /* A new dtb was validated, discard any previous file. */ if (oldbfp) file_discard(oldbfp); return (0); } static int fdt_load_dtb_overlay(const char * filename) { struct preloaded_file *bfp; struct fdt_header header; int err; debugf("fdt_load_dtb_overlay(%s)\n", filename); /* Attempt to load and validate a new dtb from a file. FDT_ERR_NOTFOUND * is normally a libfdt error code, but libfdt would actually return * -FDT_ERR_NOTFOUND. We re-purpose the error code here to convey a * similar meaning: the file itself was not found, which can still be * considered an error dealing with FDT pieces. */ if ((bfp = file_loadraw(filename, "dtbo", 1)) == NULL) return (FDT_ERR_NOTFOUND); COPYOUT(bfp->f_addr, &header, sizeof(header)); err = fdt_check_header(&header); if (err < 0) { file_discard(bfp); return (err); } return (0); } static void fdt_print_overlay_load_error(int err, const char *filename) { switch (err) { case FDT_ERR_NOTFOUND: printf("%s: failed to load file\n", filename); break; case -FDT_ERR_BADVERSION: printf("%s: incompatible blob version: %d, should be: %d\n", filename, fdt_version(fdtp), FDT_LAST_SUPPORTED_VERSION); break; default: /* libfdt errs are negative */ if (err < 0) printf("%s: error validating blob: %s\n", filename, fdt_strerror(err)); else printf("%s: unknown load error\n", filename); break; } } static int fdt_load_dtb_overlays_string(const char * filenames) { char *names; char *name, *name_ext; char *comaptr; int err, namesz; debugf("fdt_load_dtb_overlays_string(%s)\n", filenames); names = strdup(filenames); if (names == NULL) return (1); name = names; do { comaptr = strchr(name, ','); if (comaptr) *comaptr = '\0'; err = fdt_load_dtb_overlay(name); if (err == FDT_ERR_NOTFOUND) { /* Allocate enough to append ".dtbo" */ namesz = strlen(name) + 6; name_ext = malloc(namesz); if (name_ext == NULL) { fdt_print_overlay_load_error(err, name); name = comaptr + 1; continue; } snprintf(name_ext, namesz, "%s.dtbo", name); err = fdt_load_dtb_overlay(name_ext); free(name_ext); } /* Catch error with either initial load or fallback load */ if (err != 0) fdt_print_overlay_load_error(err, name); name = comaptr + 1; } while(comaptr); free(names); return (0); } /* * fdt_check_overlay_compatible - check that the overlay_fdt is compatible with * base_fdt before we attempt to apply it. It will need to re-calculate offsets * in the base every time, rather than trying to cache them earlier in the * process, because the overlay application process can/will invalidate a lot of * offsets. */ static int fdt_check_overlay_compatible(void *base_fdt, void *overlay_fdt) { const char *compat; int compat_len, ocompat_len; int oroot_offset, root_offset; int slidx, sllen; oroot_offset = fdt_path_offset(overlay_fdt, "/"); if (oroot_offset < 0) return (oroot_offset); /* * If /compatible in the overlay does not exist or if it is empty, then * we're automatically compatible. We do this for the sake of rapid * overlay development for overlays that aren't intended to be deployed. * The user assumes the risk of using an overlay without /compatible. */ if (fdt_get_property(overlay_fdt, oroot_offset, "compatible", &ocompat_len) == NULL || ocompat_len == 0) return (0); root_offset = fdt_path_offset(base_fdt, "/"); if (root_offset < 0) return (root_offset); /* * However, an empty or missing /compatible on the base is an error, * because allowing this offers no advantages. */ if (fdt_get_property(base_fdt, root_offset, "compatible", &compat_len) == NULL) return (compat_len); else if(compat_len == 0) return (1); slidx = 0; compat = fdt_stringlist_get(overlay_fdt, oroot_offset, "compatible", slidx, &sllen); while (compat != NULL) { if (fdt_stringlist_search(base_fdt, root_offset, "compatible", compat) >= 0) return (0); ++slidx; compat = fdt_stringlist_get(overlay_fdt, oroot_offset, "compatible", slidx, &sllen); }; /* We've exhausted the overlay's /compatible property... no match */ return (1); } void fdt_apply_overlays() { struct preloaded_file *fp; size_t max_overlay_size, next_fdtp_size; size_t current_fdtp_size; void *current_fdtp; void *next_fdtp; void *overlay; int rv; if ((fdtp == NULL) || (fdtp_size == 0)) return; max_overlay_size = 0; for (fp = file_findfile(NULL, "dtbo"); fp != NULL; fp = fp->f_next) { if (max_overlay_size < fp->f_size) max_overlay_size = fp->f_size; } /* Nothing to apply */ if (max_overlay_size == 0) return; overlay = malloc(max_overlay_size); if (overlay == NULL) { printf("failed to allocate memory for DTB blob with overlays\n"); return; } current_fdtp = fdtp; current_fdtp_size = fdtp_size; for (fp = file_findfile(NULL, "dtbo"); fp != NULL; fp = fp->f_next) { COPYOUT(fp->f_addr, overlay, fp->f_size); /* Check compatible first to avoid unnecessary allocation */ rv = fdt_check_overlay_compatible(current_fdtp, overlay); if (rv != 0) { printf("DTB overlay '%s' not compatible\n", fp->f_name); continue; } printf("applying DTB overlay '%s'\n", fp->f_name); next_fdtp_size = current_fdtp_size + fp->f_size; next_fdtp = malloc(next_fdtp_size); if (next_fdtp == NULL) { /* * Output warning, then move on to applying other * overlays in case this one is simply too large. */ printf("failed to allocate memory for overlay base\n"); continue; } rv = fdt_open_into(current_fdtp, next_fdtp, next_fdtp_size); if (rv != 0) { free(next_fdtp); printf("failed to open base dtb into overlay base\n"); continue; } /* Both overlay and next_fdtp may be modified in place */ rv = fdt_overlay_apply(next_fdtp, overlay); if (rv == 0) { /* Rotate next -> current */ if (current_fdtp != fdtp) free(current_fdtp); current_fdtp = next_fdtp; current_fdtp_size = next_fdtp_size; } else { /* * Assume here that the base we tried to apply on is * either trashed or in an inconsistent state. Trying to * load it might work, but it's better to discard it and * play it safe. */ free(next_fdtp); printf("failed to apply overlay: %s\n", fdt_strerror(rv)); } } /* We could have failed to apply all overlays; then we do nothing */ if (current_fdtp != fdtp) { free(fdtp); fdtp = current_fdtp; fdtp_size = current_fdtp_size; } free(overlay); } int +fdt_is_setup(void) +{ + + if (fdtp != NULL) + return (1); + + return (0); +} + +int fdt_setup_fdtp() { struct preloaded_file *bfp; vm_offset_t va; debugf("fdt_setup_fdtp()\n"); /* If we already loaded a file, use it. */ if ((bfp = file_findfile(NULL, "dtb")) != NULL) { if (fdt_load_dtb(bfp->f_addr) == 0) { printf("Using DTB from loaded file '%s'.\n", bfp->f_name); fdt_platform_load_overlays(); return (0); } } /* If we were given the address of a valid blob in memory, use it. */ if (fdt_to_load != NULL) { if (fdt_load_dtb_addr(fdt_to_load) == 0) { printf("Using DTB from memory address %p.\n", fdt_to_load); fdt_platform_load_overlays(); return (0); } } if (fdt_platform_load_dtb() == 0) { fdt_platform_load_overlays(); return (0); } /* If there is a dtb compiled into the kernel, use it. */ if ((va = fdt_find_static_dtb()) != 0) { if (fdt_load_dtb(va) == 0) { printf("Using DTB compiled into kernel.\n"); return (0); } } command_errmsg = "No device tree blob found!\n"; return (1); } #define fdt_strtovect(str, cellbuf, lim, cellsize) _fdt_strtovect((str), \ (cellbuf), (lim), (cellsize), 0); /* Force using base 16 */ #define fdt_strtovectx(str, cellbuf, lim, cellsize) _fdt_strtovect((str), \ (cellbuf), (lim), (cellsize), 16); static int _fdt_strtovect(const char *str, void *cellbuf, int lim, unsigned char cellsize, uint8_t base) { const char *buf = str; const char *end = str + strlen(str) - 2; uint32_t *u32buf = NULL; uint8_t *u8buf = NULL; int cnt = 0; if (cellsize == sizeof(uint32_t)) u32buf = (uint32_t *)cellbuf; else u8buf = (uint8_t *)cellbuf; if (lim == 0) return (0); while (buf < end) { /* Skip white whitespace(s)/separators */ while (!isxdigit(*buf) && buf < end) buf++; if (u32buf != NULL) u32buf[cnt] = cpu_to_fdt32((uint32_t)strtol(buf, NULL, base)); else u8buf[cnt] = (uint8_t)strtol(buf, NULL, base); if (cnt + 1 <= lim - 1) cnt++; else break; buf++; /* Find another number */ while ((isxdigit(*buf) || *buf == 'x') && buf < end) buf++; } return (cnt); } void fdt_fixup_ethernet(const char *str, char *ethstr, int len) { uint8_t tmp_addr[6]; /* Convert macaddr string into a vector of uints */ fdt_strtovectx(str, &tmp_addr, 6, sizeof(uint8_t)); /* Set actual property to a value from vect */ fdt_setprop(fdtp, fdt_path_offset(fdtp, ethstr), "local-mac-address", &tmp_addr, 6 * sizeof(uint8_t)); } void fdt_fixup_cpubusfreqs(unsigned long cpufreq, unsigned long busfreq) { int lo, o = 0, o2, maxo = 0, depth; const uint32_t zero = 0; /* We want to modify every subnode of /cpus */ o = fdt_path_offset(fdtp, "/cpus"); if (o < 0) return; /* maxo should contain offset of node next to /cpus */ depth = 0; maxo = o; while (depth != -1) maxo = fdt_next_node(fdtp, maxo, &depth); /* Find CPU frequency properties */ o = fdt_node_offset_by_prop_value(fdtp, o, "clock-frequency", &zero, sizeof(uint32_t)); o2 = fdt_node_offset_by_prop_value(fdtp, o, "bus-frequency", &zero, sizeof(uint32_t)); lo = MIN(o, o2); while (o != -FDT_ERR_NOTFOUND && o2 != -FDT_ERR_NOTFOUND) { o = fdt_node_offset_by_prop_value(fdtp, lo, "clock-frequency", &zero, sizeof(uint32_t)); o2 = fdt_node_offset_by_prop_value(fdtp, lo, "bus-frequency", &zero, sizeof(uint32_t)); /* We're only interested in /cpus subnode(s) */ if (lo > maxo) break; fdt_setprop_inplace_cell(fdtp, lo, "clock-frequency", (uint32_t)cpufreq); fdt_setprop_inplace_cell(fdtp, lo, "bus-frequency", (uint32_t)busfreq); lo = MIN(o, o2); } } #ifdef notyet static int fdt_reg_valid(uint32_t *reg, int len, int addr_cells, int size_cells) { int cells_in_tuple, i, tuples, tuple_size; uint32_t cur_start, cur_size; cells_in_tuple = (addr_cells + size_cells); tuple_size = cells_in_tuple * sizeof(uint32_t); tuples = len / tuple_size; if (tuples == 0) return (EINVAL); for (i = 0; i < tuples; i++) { if (addr_cells == 2) cur_start = fdt64_to_cpu(reg[i * cells_in_tuple]); else cur_start = fdt32_to_cpu(reg[i * cells_in_tuple]); if (size_cells == 2) cur_size = fdt64_to_cpu(reg[i * cells_in_tuple + 2]); else cur_size = fdt32_to_cpu(reg[i * cells_in_tuple + 1]); if (cur_size == 0) return (EINVAL); debugf(" reg#%d (start: 0x%0x size: 0x%0x) valid!\n", i, cur_start, cur_size); } return (0); } #endif void fdt_fixup_memory(struct fdt_mem_region *region, size_t num) { struct fdt_mem_region *curmr; uint32_t addr_cells, size_cells; uint32_t *addr_cellsp, *size_cellsp; int err, i, len, memory, root; size_t realmrno; uint8_t *buf, *sb; uint64_t rstart, rsize; int reserved; root = fdt_path_offset(fdtp, "/"); if (root < 0) { sprintf(command_errbuf, "Could not find root node !"); return; } memory = fdt_path_offset(fdtp, "/memory"); if (memory <= 0) { /* Create proper '/memory' node. */ memory = fdt_add_subnode(fdtp, root, "memory"); if (memory <= 0) { snprintf(command_errbuf, sizeof(command_errbuf), "Could not fixup '/memory' " "node, error code : %d!\n", memory); return; } err = fdt_setprop(fdtp, memory, "device_type", "memory", sizeof("memory")); if (err < 0) return; } addr_cellsp = (uint32_t *)fdt_getprop(fdtp, root, "#address-cells", NULL); size_cellsp = (uint32_t *)fdt_getprop(fdtp, root, "#size-cells", NULL); if (addr_cellsp == NULL || size_cellsp == NULL) { snprintf(command_errbuf, sizeof(command_errbuf), "Could not fixup '/memory' node : " "%s %s property not found in root node!\n", (!addr_cellsp) ? "#address-cells" : "", (!size_cellsp) ? "#size-cells" : ""); return; } addr_cells = fdt32_to_cpu(*addr_cellsp); size_cells = fdt32_to_cpu(*size_cellsp); /* * Convert memreserve data to memreserve property * Check if property already exists */ reserved = fdt_num_mem_rsv(fdtp); if (reserved && (fdt_getprop(fdtp, root, "memreserve", NULL) == NULL)) { len = (addr_cells + size_cells) * reserved * sizeof(uint32_t); sb = buf = (uint8_t *)malloc(len); if (!buf) return; bzero(buf, len); for (i = 0; i < reserved; i++) { if (fdt_get_mem_rsv(fdtp, i, &rstart, &rsize)) break; if (rsize) { /* Ensure endianness, and put cells into a buffer */ if (addr_cells == 2) *(uint64_t *)buf = cpu_to_fdt64(rstart); else *(uint32_t *)buf = cpu_to_fdt32(rstart); buf += sizeof(uint32_t) * addr_cells; if (size_cells == 2) *(uint64_t *)buf = cpu_to_fdt64(rsize); else *(uint32_t *)buf = cpu_to_fdt32(rsize); buf += sizeof(uint32_t) * size_cells; } } /* Set property */ if ((err = fdt_setprop(fdtp, root, "memreserve", sb, len)) < 0) printf("Could not fixup 'memreserve' property.\n"); free(sb); } /* Count valid memory regions entries in sysinfo. */ realmrno = num; for (i = 0; i < num; i++) if (region[i].start == 0 && region[i].size == 0) realmrno--; if (realmrno == 0) { sprintf(command_errbuf, "Could not fixup '/memory' node : " "sysinfo doesn't contain valid memory regions info!\n"); return; } len = (addr_cells + size_cells) * realmrno * sizeof(uint32_t); sb = buf = (uint8_t *)malloc(len); if (!buf) return; bzero(buf, len); for (i = 0; i < num; i++) { curmr = ®ion[i]; if (curmr->size != 0) { /* Ensure endianness, and put cells into a buffer */ if (addr_cells == 2) *(uint64_t *)buf = cpu_to_fdt64(curmr->start); else *(uint32_t *)buf = cpu_to_fdt32(curmr->start); buf += sizeof(uint32_t) * addr_cells; if (size_cells == 2) *(uint64_t *)buf = cpu_to_fdt64(curmr->size); else *(uint32_t *)buf = cpu_to_fdt32(curmr->size); buf += sizeof(uint32_t) * size_cells; } } /* Set property */ if ((err = fdt_setprop(fdtp, memory, "reg", sb, len)) < 0) sprintf(command_errbuf, "Could not fixup '/memory' node.\n"); free(sb); } void fdt_fixup_stdout(const char *str) { char *ptr; int len, no, sero; const struct fdt_property *prop; char *tmp[10]; ptr = (char *)str + strlen(str) - 1; while (ptr > str && isdigit(*(str - 1))) str--; if (ptr == str) return; no = fdt_path_offset(fdtp, "/chosen"); if (no < 0) return; prop = fdt_get_property(fdtp, no, "stdout", &len); /* If /chosen/stdout does not extist, create it */ if (prop == NULL || (prop != NULL && len == 0)) { bzero(tmp, 10 * sizeof(char)); strcpy((char *)&tmp, "serial"); if (strlen(ptr) > 3) /* Serial number too long */ return; strncpy((char *)tmp + 6, ptr, 3); sero = fdt_path_offset(fdtp, (const char *)tmp); if (sero < 0) /* * If serial device we're trying to assign * stdout to doesn't exist in DT -- return. */ return; fdt_setprop(fdtp, no, "stdout", &tmp, strlen((char *)&tmp) + 1); fdt_setprop(fdtp, no, "stdin", &tmp, strlen((char *)&tmp) + 1); } } void fdt_load_dtb_overlays(const char *extras) { const char *s; /* Any extra overlays supplied by pre-loader environment */ if (extras != NULL && *extras != '\0') { printf("Loading DTB overlays: '%s'\n", extras); fdt_load_dtb_overlays_string(extras); } /* Any overlays supplied by loader environment */ s = getenv("fdt_overlays"); if (s != NULL && *s != '\0') { printf("Loading DTB overlays: '%s'\n", s); fdt_load_dtb_overlays_string(s); } } /* * Locate the blob, fix it up and return its location. */ static int fdt_fixup(void) { int chosen; debugf("fdt_fixup()\n"); if (fdtp == NULL && fdt_setup_fdtp() != 0) return (0); /* Create /chosen node (if not exists) */ if ((chosen = fdt_subnode_offset(fdtp, 0, "chosen")) == -FDT_ERR_NOTFOUND) chosen = fdt_add_subnode(fdtp, 0, "chosen"); /* Value assigned to fixup-applied does not matter. */ if (fdt_getprop(fdtp, chosen, "fixup-applied", NULL)) return (1); fdt_platform_fixups(); /* * Re-fetch the /chosen subnode; our fixups may apply overlays or add * nodes/properties that invalidate the offset we grabbed or created * above, so we can no longer trust it. */ chosen = fdt_subnode_offset(fdtp, 0, "chosen"); fdt_setprop(fdtp, chosen, "fixup-applied", NULL, 0); return (1); } /* * Copy DTB blob to specified location and return size */ int fdt_copy(vm_offset_t va) { int err; debugf("fdt_copy va 0x%08x\n", va); if (fdtp == NULL) { err = fdt_setup_fdtp(); if (err) { printf("No valid device tree blob found!\n"); return (0); } } if (fdt_fixup() == 0) return (0); COPYIN(fdtp, va, fdtp_size); return (fdtp_size); } int command_fdt_internal(int argc, char *argv[]) { cmdf_t *cmdh; int flags; int i, err; if (argc < 2) { command_errmsg = "usage is 'fdt []"; return (CMD_ERROR); } /* * Validate fdt . */ i = 0; cmdh = NULL; while (!(commands[i].name == NULL)) { if (strcmp(argv[1], commands[i].name) == 0) { /* found it */ cmdh = commands[i].handler; flags = commands[i].flags; break; } i++; } if (cmdh == NULL) { command_errmsg = "unknown command"; return (CMD_ERROR); } if (flags & CMD_REQUIRES_BLOB) { /* * Check if uboot env vars were parsed already. If not, do it now. */ if (fdt_fixup() == 0) return (CMD_ERROR); } /* * Call command handler. */ err = (*cmdh)(argc, argv); return (err); } static int fdt_cmd_addr(int argc, char *argv[]) { struct preloaded_file *fp; struct fdt_header *hdr; const char *addr; char *cp; fdt_to_load = NULL; if (argc > 2) addr = argv[2]; else { sprintf(command_errbuf, "no address specified"); return (CMD_ERROR); } hdr = (struct fdt_header *)strtoul(addr, &cp, 16); if (cp == addr) { snprintf(command_errbuf, sizeof(command_errbuf), "Invalid address: %s", addr); return (CMD_ERROR); } while ((fp = file_findfile(NULL, "dtb")) != NULL) { file_discard(fp); } fdt_to_load = hdr; return (CMD_OK); } static int fdt_cmd_cd(int argc, char *argv[]) { char *path; char tmp[FDT_CWD_LEN]; int len, o; path = (argc > 2) ? argv[2] : "/"; if (path[0] == '/') { len = strlen(path); if (len >= FDT_CWD_LEN) goto fail; } else { /* Handle path specification relative to cwd */ len = strlen(cwd) + strlen(path) + 1; if (len >= FDT_CWD_LEN) goto fail; strcpy(tmp, cwd); strcat(tmp, "/"); strcat(tmp, path); path = tmp; } o = fdt_path_offset(fdtp, path); if (o < 0) { snprintf(command_errbuf, sizeof(command_errbuf), "could not find node: '%s'", path); return (CMD_ERROR); } strcpy(cwd, path); return (CMD_OK); fail: snprintf(command_errbuf, sizeof(command_errbuf), "path too long: %d, max allowed: %d", len, FDT_CWD_LEN - 1); return (CMD_ERROR); } static int fdt_cmd_hdr(int argc __unused, char *argv[] __unused) { char line[80]; int ver; if (fdtp == NULL) { command_errmsg = "no device tree blob pointer?!"; return (CMD_ERROR); } ver = fdt_version(fdtp); pager_open(); sprintf(line, "\nFlattened device tree header (%p):\n", fdtp); if (pager_output(line)) goto out; sprintf(line, " magic = 0x%08x\n", fdt_magic(fdtp)); if (pager_output(line)) goto out; sprintf(line, " size = %d\n", fdt_totalsize(fdtp)); if (pager_output(line)) goto out; sprintf(line, " off_dt_struct = 0x%08x\n", fdt_off_dt_struct(fdtp)); if (pager_output(line)) goto out; sprintf(line, " off_dt_strings = 0x%08x\n", fdt_off_dt_strings(fdtp)); if (pager_output(line)) goto out; sprintf(line, " off_mem_rsvmap = 0x%08x\n", fdt_off_mem_rsvmap(fdtp)); if (pager_output(line)) goto out; sprintf(line, " version = %d\n", ver); if (pager_output(line)) goto out; sprintf(line, " last compatible version = %d\n", fdt_last_comp_version(fdtp)); if (pager_output(line)) goto out; if (ver >= 2) { sprintf(line, " boot_cpuid = %d\n", fdt_boot_cpuid_phys(fdtp)); if (pager_output(line)) goto out; } if (ver >= 3) { sprintf(line, " size_dt_strings = %d\n", fdt_size_dt_strings(fdtp)); if (pager_output(line)) goto out; } if (ver >= 17) { sprintf(line, " size_dt_struct = %d\n", fdt_size_dt_struct(fdtp)); if (pager_output(line)) goto out; } out: pager_close(); return (CMD_OK); } static int fdt_cmd_ls(int argc, char *argv[]) { const char *prevname[FDT_MAX_DEPTH] = { NULL }; const char *name; char *path; int i, o, depth; path = (argc > 2) ? argv[2] : NULL; if (path == NULL) path = cwd; o = fdt_path_offset(fdtp, path); if (o < 0) { snprintf(command_errbuf, sizeof(command_errbuf), "could not find node: '%s'", path); return (CMD_ERROR); } for (depth = 0; (o >= 0) && (depth >= 0); o = fdt_next_node(fdtp, o, &depth)) { name = fdt_get_name(fdtp, o, NULL); if (depth > FDT_MAX_DEPTH) { printf("max depth exceeded: %d\n", depth); continue; } prevname[depth] = name; /* Skip root (i = 1) when printing devices */ for (i = 1; i <= depth; i++) { if (prevname[i] == NULL) break; if (strcmp(cwd, "/") == 0) printf("/"); printf("%s", prevname[i]); } printf("\n"); } return (CMD_OK); } static __inline int isprint(int c) { return (c >= ' ' && c <= 0x7e); } static int fdt_isprint(const void *data, int len, int *count) { const char *d; char ch; int yesno, i; if (len == 0) return (0); d = (const char *)data; if (d[len - 1] != '\0') return (0); *count = 0; yesno = 1; for (i = 0; i < len; i++) { ch = *(d + i); if (isprint(ch) || (ch == '\0' && i > 0)) { /* Count strings */ if (ch == '\0') (*count)++; continue; } yesno = 0; break; } return (yesno); } static int fdt_data_str(const void *data, int len, int count, char **buf) { char *b, *tmp; const char *d; int buf_len, i, l; /* * Calculate the length for the string and allocate memory. * * Note that 'len' already includes at least one terminator. */ buf_len = len; if (count > 1) { /* * Each token had already a terminator buried in 'len', but we * only need one eventually, don't count space for these. */ buf_len -= count - 1; /* Each consecutive token requires a ", " separator. */ buf_len += count * 2; } /* Add some space for surrounding double quotes. */ buf_len += count * 2; /* Note that string being put in 'tmp' may be as big as 'buf_len'. */ b = (char *)malloc(buf_len); tmp = (char *)malloc(buf_len); if (b == NULL) goto error; if (tmp == NULL) { free(b); goto error; } b[0] = '\0'; /* * Now that we have space, format the string. */ i = 0; do { d = (const char *)data + i; l = strlen(d) + 1; sprintf(tmp, "\"%s\"%s", d, (i + l) < len ? ", " : ""); strcat(b, tmp); i += l; } while (i < len); *buf = b; free(tmp); return (0); error: return (1); } static int fdt_data_cell(const void *data, int len, char **buf) { char *b, *tmp; const uint32_t *c; int count, i, l; /* Number of cells */ count = len / 4; /* * Calculate the length for the string and allocate memory. */ /* Each byte translates to 2 output characters */ l = len * 2; if (count > 1) { /* Each consecutive cell requires a " " separator. */ l += (count - 1) * 1; } /* Each cell will have a "0x" prefix */ l += count * 2; /* Space for surrounding <> and terminator */ l += 3; b = (char *)malloc(l); tmp = (char *)malloc(l); if (b == NULL) goto error; if (tmp == NULL) { free(b); goto error; } b[0] = '\0'; strcat(b, "<"); for (i = 0; i < len; i += 4) { c = (const uint32_t *)((const uint8_t *)data + i); sprintf(tmp, "0x%08x%s", fdt32_to_cpu(*c), i < (len - 4) ? " " : ""); strcat(b, tmp); } strcat(b, ">"); *buf = b; free(tmp); return (0); error: return (1); } static int fdt_data_bytes(const void *data, int len, char **buf) { char *b, *tmp; const char *d; int i, l; /* * Calculate the length for the string and allocate memory. */ /* Each byte translates to 2 output characters */ l = len * 2; if (len > 1) /* Each consecutive byte requires a " " separator. */ l += (len - 1) * 1; /* Each byte will have a "0x" prefix */ l += len * 2; /* Space for surrounding [] and terminator. */ l += 3; b = (char *)malloc(l); tmp = (char *)malloc(l); if (b == NULL) goto error; if (tmp == NULL) { free(b); goto error; } b[0] = '\0'; strcat(b, "["); for (i = 0, d = data; i < len; i++) { sprintf(tmp, "0x%02x%s", d[i], i < len - 1 ? " " : ""); strcat(b, tmp); } strcat(b, "]"); *buf = b; free(tmp); return (0); error: return (1); } static int fdt_data_fmt(const void *data, int len, char **buf) { int count; if (len == 0) { *buf = NULL; return (1); } if (fdt_isprint(data, len, &count)) return (fdt_data_str(data, len, count, buf)); else if ((len % 4) == 0) return (fdt_data_cell(data, len, buf)); else return (fdt_data_bytes(data, len, buf)); } static int fdt_prop(int offset) { char *line, *buf; const struct fdt_property *prop; const char *name; const void *data; int len, rv; line = NULL; prop = fdt_offset_ptr(fdtp, offset, sizeof(*prop)); if (prop == NULL) return (1); name = fdt_string(fdtp, fdt32_to_cpu(prop->nameoff)); len = fdt32_to_cpu(prop->len); rv = 0; buf = NULL; if (len == 0) { /* Property without value */ line = (char *)malloc(strlen(name) + 2); if (line == NULL) { rv = 2; goto out2; } sprintf(line, "%s\n", name); goto out1; } /* * Process property with value */ data = prop->data; if (fdt_data_fmt(data, len, &buf) != 0) { rv = 3; goto out2; } line = (char *)malloc(strlen(name) + strlen(FDT_PROP_SEP) + strlen(buf) + 2); if (line == NULL) { sprintf(command_errbuf, "could not allocate space for string"); rv = 4; goto out2; } sprintf(line, "%s" FDT_PROP_SEP "%s\n", name, buf); out1: pager_open(); pager_output(line); pager_close(); out2: if (buf) free(buf); if (line) free(line); return (rv); } static int fdt_modprop(int nodeoff, char *propname, void *value, char mode) { uint32_t cells[100]; const char *buf; int len, rv; const struct fdt_property *p; p = fdt_get_property(fdtp, nodeoff, propname, NULL); if (p != NULL) { if (mode == 1) { /* Adding inexistant value in mode 1 is forbidden */ sprintf(command_errbuf, "property already exists!"); return (CMD_ERROR); } } else if (mode == 0) { sprintf(command_errbuf, "property does not exist!"); return (CMD_ERROR); } rv = 0; buf = value; switch (*buf) { case '&': /* phandles */ break; case '<': /* Data cells */ len = fdt_strtovect(buf, (void *)&cells, 100, sizeof(uint32_t)); rv = fdt_setprop(fdtp, nodeoff, propname, &cells, len * sizeof(uint32_t)); break; case '[': /* Data bytes */ len = fdt_strtovect(buf, (void *)&cells, 100, sizeof(uint8_t)); rv = fdt_setprop(fdtp, nodeoff, propname, &cells, len * sizeof(uint8_t)); break; case '"': default: /* Default -- string */ rv = fdt_setprop_string(fdtp, nodeoff, propname, value); break; } if (rv != 0) { if (rv == -FDT_ERR_NOSPACE) sprintf(command_errbuf, "Device tree blob is too small!\n"); else sprintf(command_errbuf, "Could not add/modify property!\n"); } return (rv); } /* Merge strings from argv into a single string */ static int fdt_merge_strings(int argc, char *argv[], int start, char **buffer) { char *buf; int i, idx, sz; *buffer = NULL; sz = 0; for (i = start; i < argc; i++) sz += strlen(argv[i]); /* Additional bytes for whitespaces between args */ sz += argc - start; buf = (char *)malloc(sizeof(char) * sz); if (buf == NULL) { sprintf(command_errbuf, "could not allocate space " "for string"); return (1); } bzero(buf, sizeof(char) * sz); idx = 0; for (i = start, idx = 0; i < argc; i++) { strcpy(buf + idx, argv[i]); idx += strlen(argv[i]); buf[idx] = ' '; idx++; } buf[sz - 1] = '\0'; *buffer = buf; return (0); } /* Extract offset and name of node/property from a given path */ static int fdt_extract_nameloc(char **pathp, char **namep, int *nodeoff) { int o; char *path = *pathp, *name = NULL, *subpath = NULL; subpath = strrchr(path, '/'); if (subpath == NULL) { o = fdt_path_offset(fdtp, cwd); name = path; path = (char *)&cwd; } else { *subpath = '\0'; if (strlen(path) == 0) path = cwd; name = subpath + 1; o = fdt_path_offset(fdtp, path); } if (strlen(name) == 0) { sprintf(command_errbuf, "name not specified"); return (1); } if (o < 0) { snprintf(command_errbuf, sizeof(command_errbuf), "could not find node: '%s'", path); return (1); } *namep = name; *nodeoff = o; *pathp = path; return (0); } static int fdt_cmd_prop(int argc, char *argv[]) { char *path, *propname, *value; int o, next, depth, rv; uint32_t tag; path = (argc > 2) ? argv[2] : NULL; value = NULL; if (argc > 3) { /* Merge property value strings into one */ if (fdt_merge_strings(argc, argv, 3, &value) != 0) return (CMD_ERROR); } else value = NULL; if (path == NULL) path = cwd; rv = CMD_OK; if (value) { /* If value is specified -- try to modify prop. */ if (fdt_extract_nameloc(&path, &propname, &o) != 0) return (CMD_ERROR); rv = fdt_modprop(o, propname, value, 0); if (rv) return (CMD_ERROR); return (CMD_OK); } /* User wants to display properties */ o = fdt_path_offset(fdtp, path); if (o < 0) { snprintf(command_errbuf, sizeof(command_errbuf), "could not find node: '%s'", path); rv = CMD_ERROR; goto out; } depth = 0; while (depth >= 0) { tag = fdt_next_tag(fdtp, o, &next); switch (tag) { case FDT_NOP: break; case FDT_PROP: if (depth > 1) /* Don't process properties of nested nodes */ break; if (fdt_prop(o) != 0) { sprintf(command_errbuf, "could not process " "property"); rv = CMD_ERROR; goto out; } break; case FDT_BEGIN_NODE: depth++; if (depth > FDT_MAX_DEPTH) { printf("warning: nesting too deep: %d\n", depth); goto out; } break; case FDT_END_NODE: depth--; if (depth == 0) /* * This is the end of our starting node, force * the loop finish. */ depth--; break; } o = next; } out: return (rv); } static int fdt_cmd_mkprop(int argc, char *argv[]) { int o; char *path, *propname, *value; path = (argc > 2) ? argv[2] : NULL; value = NULL; if (argc > 3) { /* Merge property value strings into one */ if (fdt_merge_strings(argc, argv, 3, &value) != 0) return (CMD_ERROR); } else value = NULL; if (fdt_extract_nameloc(&path, &propname, &o) != 0) return (CMD_ERROR); if (fdt_modprop(o, propname, value, 1)) return (CMD_ERROR); return (CMD_OK); } static int fdt_cmd_rm(int argc, char *argv[]) { int o, rv; char *path = NULL, *propname; if (argc > 2) path = argv[2]; else { sprintf(command_errbuf, "no node/property name specified"); return (CMD_ERROR); } o = fdt_path_offset(fdtp, path); if (o < 0) { /* If node not found -- try to find & delete property */ if (fdt_extract_nameloc(&path, &propname, &o) != 0) return (CMD_ERROR); if ((rv = fdt_delprop(fdtp, o, propname)) != 0) { snprintf(command_errbuf, sizeof(command_errbuf), "could not delete %s\n", (rv == -FDT_ERR_NOTFOUND) ? "(property/node does not exist)" : ""); return (CMD_ERROR); } else return (CMD_OK); } /* If node exists -- remove node */ rv = fdt_del_node(fdtp, o); if (rv) { sprintf(command_errbuf, "could not delete node"); return (CMD_ERROR); } return (CMD_OK); } static int fdt_cmd_mknode(int argc, char *argv[]) { int o, rv; char *path = NULL, *nodename = NULL; if (argc > 2) path = argv[2]; else { sprintf(command_errbuf, "no node name specified"); return (CMD_ERROR); } if (fdt_extract_nameloc(&path, &nodename, &o) != 0) return (CMD_ERROR); rv = fdt_add_subnode(fdtp, o, nodename); if (rv < 0) { if (rv == -FDT_ERR_NOSPACE) sprintf(command_errbuf, "Device tree blob is too small!\n"); else sprintf(command_errbuf, "Could not add node!\n"); return (CMD_ERROR); } return (CMD_OK); } static int fdt_cmd_pwd(int argc, char *argv[]) { char line[FDT_CWD_LEN]; pager_open(); sprintf(line, "%s\n", cwd); pager_output(line); pager_close(); return (CMD_OK); } static int fdt_cmd_mres(int argc, char *argv[]) { uint64_t start, size; int i, total; char line[80]; pager_open(); total = fdt_num_mem_rsv(fdtp); if (total > 0) { if (pager_output("Reserved memory regions:\n")) goto out; for (i = 0; i < total; i++) { fdt_get_mem_rsv(fdtp, i, &start, &size); sprintf(line, "reg#%d: (start: 0x%jx, size: 0x%jx)\n", i, start, size); if (pager_output(line)) goto out; } } else pager_output("No reserved memory regions\n"); out: pager_close(); return (CMD_OK); } static int fdt_cmd_nyi(int argc, char *argv[]) { printf("command not yet implemented\n"); return (CMD_ERROR); } Index: user/ngie/bug-237403/stand/fdt/fdt_platform.h =================================================================== --- user/ngie/bug-237403/stand/fdt/fdt_platform.h (revision 346925) +++ user/ngie/bug-237403/stand/fdt/fdt_platform.h (revision 346926) @@ -1,57 +1,58 @@ /*- * Copyright (c) 2014 Andrew Turner * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * $FreeBSD$ */ #ifndef FDT_PLATFORM_H #define FDT_PLATFORM_H struct fdt_header; struct fdt_mem_region { unsigned long start; unsigned long size; }; #define TMP_MAX_ETH 8 int fdt_copy(vm_offset_t); void fdt_fixup_cpubusfreqs(unsigned long, unsigned long); void fdt_fixup_ethernet(const char *, char *, int); void fdt_fixup_memory(struct fdt_mem_region *, size_t); void fdt_fixup_stdout(const char *); void fdt_apply_overlays(void); int fdt_load_dtb_addr(struct fdt_header *); int fdt_load_dtb_file(const char *); void fdt_load_dtb_overlays(const char *); int fdt_setup_fdtp(void); +int fdt_is_setup(void); /* The platform library needs to implement these functions */ int fdt_platform_load_dtb(void); void fdt_platform_load_overlays(void); void fdt_platform_fixups(void); #endif /* FDT_PLATFORM_H */ Index: user/ngie/bug-237403/stand/i386/loader/conf.c =================================================================== --- user/ngie/bug-237403/stand/i386/loader/conf.c (revision 346925) +++ user/ngie/bug-237403/stand/i386/loader/conf.c (revision 346926) @@ -1,168 +1,170 @@ /*- * Copyright (c) 1998 Michael Smith * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. */ #include __FBSDID("$FreeBSD$"); #include #include #include "libi386/libi386.h" #if defined(LOADER_ZFS_SUPPORT) #include "libzfs.h" #endif /* * We could use linker sets for some or all of these, but * then we would have to control what ended up linked into * the bootstrap. So it's easier to conditionalise things * here. * * XXX rename these arrays to be consistent and less namespace-hostile * * XXX as libi386 and biosboot merge, some of these can become linker sets. */ #if defined(LOADER_FIREWIRE_SUPPORT) extern struct devsw fwohci; #endif +extern struct devsw vdisk_dev; /* Exported for libstand */ struct devsw *devsw[] = { &biosfd, &bioscd, &bioshd, #if defined(LOADER_NFS_SUPPORT) || defined(LOADER_TFTP_SUPPORT) &pxedisk, #endif #if defined(LOADER_FIREWIRE_SUPPORT) &fwohci, #endif + &vdisk_dev, #if defined(LOADER_ZFS_SUPPORT) &zfs_dev, #endif NULL }; struct fs_ops *file_system[] = { #if defined(LOADER_ZFS_SUPPORT) &zfs_fsops, #endif #if defined(LOADER_UFS_SUPPORT) &ufs_fsops, #endif #if defined(LOADER_EXT2FS_SUPPORT) &ext2fs_fsops, #endif #if defined(LOADER_MSDOS_SUPPORT) &dosfs_fsops, #endif #if defined(LOADER_CD9660_SUPPORT) &cd9660_fsops, #endif #if defined(LOADER_NANDFS_SUPPORT) &nandfs_fsops, #endif #ifdef LOADER_NFS_SUPPORT &nfs_fsops, #endif #ifdef LOADER_TFTP_SUPPORT &tftp_fsops, #endif #ifdef LOADER_GZIP_SUPPORT &gzipfs_fsops, #endif #ifdef LOADER_BZIP2_SUPPORT &bzipfs_fsops, #endif #ifdef LOADER_SPLIT_SUPPORT &splitfs_fsops, #endif NULL }; /* Exported for i386 only */ /* * Sort formats so that those that can detect based on arguments * rather than reading the file go first. */ extern struct file_format i386_elf; extern struct file_format i386_elf_obj; extern struct file_format amd64_elf; extern struct file_format amd64_elf_obj; extern struct file_format multiboot; extern struct file_format multiboot_obj; struct file_format *file_formats[] = { &multiboot, &multiboot_obj, #ifdef LOADER_PREFER_AMD64 &amd64_elf, &amd64_elf_obj, #endif &i386_elf, &i386_elf_obj, #ifndef LOADER_PREFER_AMD64 &amd64_elf, &amd64_elf_obj, #endif NULL }; /* * Consoles * * We don't prototype these in libi386.h because they require * data structures from bootstrap.h as well. */ extern struct console vidconsole; extern struct console comconsole; #if defined(LOADER_FIREWIRE_SUPPORT) extern struct console dconsole; #endif extern struct console nullconsole; extern struct console spinconsole; struct console *consoles[] = { &vidconsole, &comconsole, #if defined(LOADER_FIREWIRE_SUPPORT) &dconsole, #endif &nullconsole, &spinconsole, NULL }; extern struct pnphandler isapnphandler; extern struct pnphandler biospnphandler; extern struct pnphandler biospcihandler; struct pnphandler *pnphandlers[] = { &biospnphandler, /* should go first, as it may set isapnp_readport */ &isapnphandler, &biospcihandler, NULL }; Index: user/ngie/bug-237403/stand/loader.mk =================================================================== --- user/ngie/bug-237403/stand/loader.mk (revision 346925) +++ user/ngie/bug-237403/stand/loader.mk (revision 346926) @@ -1,178 +1,178 @@ # $FreeBSD$ .PATH: ${LDRSRC} ${BOOTSRC}/libsa CFLAGS+=-I${LDRSRC} SRCS+= boot.c commands.c console.c devopen.c interp.c SRCS+= interp_backslash.c interp_parse.c ls.c misc.c SRCS+= module.c .if ${MACHINE} == "i386" || ${MACHINE_CPUARCH} == "amd64" SRCS+= load_elf32.c load_elf32_obj.c reloc_elf32.c SRCS+= load_elf64.c load_elf64_obj.c reloc_elf64.c .elif ${MACHINE_CPUARCH} == "aarch64" SRCS+= load_elf64.c reloc_elf64.c .elif ${MACHINE_CPUARCH} == "arm" SRCS+= load_elf32.c reloc_elf32.c .elif ${MACHINE_CPUARCH} == "powerpc" SRCS+= load_elf32.c reloc_elf32.c SRCS+= load_elf64.c reloc_elf64.c SRCS+= metadata.c .elif ${MACHINE_CPUARCH} == "sparc64" SRCS+= load_elf64.c reloc_elf64.c SRCS+= metadata.c .elif ${MACHINE_ARCH:Mmips64*} != "" SRCS+= load_elf64.c reloc_elf64.c SRCS+= metadata.c .elif ${MACHINE} == "mips" SRCS+= load_elf32.c reloc_elf32.c SRCS+= metadata.c .endif .if ${LOADER_DISK_SUPPORT:Uyes} == "yes" -SRCS+= disk.c part.c +SRCS+= disk.c part.c vdisk.c .endif .if ${LOADER_NET_SUPPORT:Uno} == "yes" SRCS+= dev_net.c .endif .if defined(HAVE_BCACHE) SRCS+= bcache.c .endif .if defined(MD_IMAGE_SIZE) CFLAGS+= -DMD_IMAGE_SIZE=${MD_IMAGE_SIZE} SRCS+= md.c .else CLEANFILES+= md.o .endif # Machine-independant ISA PnP .if defined(HAVE_ISABUS) SRCS+= isapnp.c .endif .if defined(HAVE_PNP) SRCS+= pnp.c .endif .if ${LOADER_INTERP} == "lua" SRCS+= interp_lua.c .include "${BOOTSRC}/lua.mk" LDR_INTERP= ${LIBLUA} LDR_INTERP32= ${LIBLUA32} .elif ${LOADER_INTERP} == "4th" SRCS+= interp_forth.c .include "${BOOTSRC}/ficl.mk" LDR_INTERP= ${LIBFICL} LDR_INTERP32= ${LIBFICL32} .elif ${LOADER_INTERP} == "simp" SRCS+= interp_simple.c .else .error Unknown interpreter ${LOADER_INTERP} .endif .if ${MK_LOADER_VERIEXEC} != "no" CFLAGS+= -DLOADER_VERIEXEC -I${SRCTOP}/lib/libsecureboot/h .endif .if ${MK_LOADER_VERIEXEC_PASS_MANIFEST} != "no" CFLAGS+= -DLOADER_VERIEXEC_PASS_MANIFEST -I${SRCTOP}/lib/libsecureboot/h .endif .if defined(BOOT_PROMPT_123) CFLAGS+= -DBOOT_PROMPT_123 .endif .if defined(LOADER_INSTALL_SUPPORT) SRCS+= install.c .endif # Filesystem support .if ${LOADER_CD9660_SUPPORT:Uno} == "yes" CFLAGS+= -DLOADER_CD9660_SUPPORT .endif .if ${LOADER_EXT2FS_SUPPORT:Uno} == "yes" CFLAGS+= -DLOADER_EXT2FS_SUPPORT .endif .if ${LOADER_MSDOS_SUPPORT:Uno} == "yes" CFLAGS+= -DLOADER_MSDOS_SUPPORT .endif .if ${LOADER_NANDFS_SUPPORT:U${MK_NAND}} == "yes" CFLAGS+= -DLOADER_NANDFS_SUPPORT .endif .if ${LOADER_UFS_SUPPORT:Uyes} == "yes" CFLAGS+= -DLOADER_UFS_SUPPORT .endif # Compression .if ${LOADER_GZIP_SUPPORT:Uno} == "yes" CFLAGS+= -DLOADER_GZIP_SUPPORT .endif .if ${LOADER_BZIP2_SUPPORT:Uno} == "yes" CFLAGS+= -DLOADER_BZIP2_SUPPORT .endif # Network related things .if ${LOADER_NET_SUPPORT:Uno} == "yes" CFLAGS+= -DLOADER_NET_SUPPORT .endif .if ${LOADER_NFS_SUPPORT:Uno} == "yes" CFLAGS+= -DLOADER_NFS_SUPPORT .endif .if ${LOADER_TFTP_SUPPORT:Uno} == "yes" CFLAGS+= -DLOADER_TFTP_SUPPORT .endif # Partition support .if ${LOADER_GPT_SUPPORT:Uyes} == "yes" CFLAGS+= -DLOADER_GPT_SUPPORT .endif .if ${LOADER_MBR_SUPPORT:Uyes} == "yes" CFLAGS+= -DLOADER_MBR_SUPPORT .endif .if ${HAVE_ZFS:Uno} == "yes" CFLAGS+= -DLOADER_ZFS_SUPPORT CFLAGS+= -I${ZFSSRC} CFLAGS+= -I${SYSDIR}/cddl/boot/zfs SRCS+= zfs_cmd.c .endif LIBFICL= ${BOOTOBJ}/ficl/libficl.a .if ${MACHINE} == "i386" LIBFICL32= ${LIBFICL} .else LIBFICL32= ${BOOTOBJ}/ficl32/libficl.a .endif LIBLUA= ${BOOTOBJ}/liblua/liblua.a .if ${MACHINE} == "i386" LIBLUA32= ${LIBLUA} .else LIBLUA32= ${BOOTOBJ}/liblua32/liblua.a .endif CLEANFILES+= vers.c VERSION_FILE?= ${.CURDIR}/version .if ${MK_REPRODUCIBLE_BUILD} != no REPRO_FLAG= -r .endif vers.c: ${LDRSRC}/newvers.sh ${VERSION_FILE} sh ${LDRSRC}/newvers.sh ${REPRO_FLAG} ${VERSION_FILE} \ ${NEWVERSWHAT} .if ${MK_LOADER_VERBOSE} != "no" CFLAGS+= -DELF_VERBOSE .endif .if !empty(HELP_FILES) HELP_FILES+= ${LDRSRC}/help.common CLEANFILES+= loader.help FILES+= loader.help loader.help: ${HELP_FILES} cat ${HELP_FILES} | awk -f ${LDRSRC}/merge_help.awk > ${.TARGET} .endif Index: user/ngie/bug-237403/sys/amd64/include/vmm.h =================================================================== --- user/ngie/bug-237403/sys/amd64/include/vmm.h (revision 346925) +++ user/ngie/bug-237403/sys/amd64/include/vmm.h (revision 346926) @@ -1,701 +1,702 @@ /*- * SPDX-License-Identifier: BSD-2-Clause-FreeBSD * * Copyright (c) 2011 NetApp, Inc. * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY NETAPP, INC ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL NETAPP, INC OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * $FreeBSD$ */ #ifndef _VMM_H_ #define _VMM_H_ #include #include #ifdef _KERNEL SDT_PROVIDER_DECLARE(vmm); #endif enum vm_suspend_how { VM_SUSPEND_NONE, VM_SUSPEND_RESET, VM_SUSPEND_POWEROFF, VM_SUSPEND_HALT, VM_SUSPEND_TRIPLEFAULT, VM_SUSPEND_LAST }; /* * Identifiers for architecturally defined registers. */ enum vm_reg_name { VM_REG_GUEST_RAX, VM_REG_GUEST_RBX, VM_REG_GUEST_RCX, VM_REG_GUEST_RDX, VM_REG_GUEST_RSI, VM_REG_GUEST_RDI, VM_REG_GUEST_RBP, VM_REG_GUEST_R8, VM_REG_GUEST_R9, VM_REG_GUEST_R10, VM_REG_GUEST_R11, VM_REG_GUEST_R12, VM_REG_GUEST_R13, VM_REG_GUEST_R14, VM_REG_GUEST_R15, VM_REG_GUEST_CR0, VM_REG_GUEST_CR3, VM_REG_GUEST_CR4, VM_REG_GUEST_DR7, VM_REG_GUEST_RSP, VM_REG_GUEST_RIP, VM_REG_GUEST_RFLAGS, VM_REG_GUEST_ES, VM_REG_GUEST_CS, VM_REG_GUEST_SS, VM_REG_GUEST_DS, VM_REG_GUEST_FS, VM_REG_GUEST_GS, VM_REG_GUEST_LDTR, VM_REG_GUEST_TR, VM_REG_GUEST_IDTR, VM_REG_GUEST_GDTR, VM_REG_GUEST_EFER, VM_REG_GUEST_CR2, VM_REG_GUEST_PDPTE0, VM_REG_GUEST_PDPTE1, VM_REG_GUEST_PDPTE2, VM_REG_GUEST_PDPTE3, VM_REG_GUEST_INTR_SHADOW, VM_REG_GUEST_DR0, VM_REG_GUEST_DR1, VM_REG_GUEST_DR2, VM_REG_GUEST_DR3, VM_REG_GUEST_DR6, VM_REG_LAST }; enum x2apic_state { X2APIC_DISABLED, X2APIC_ENABLED, X2APIC_STATE_LAST }; #define VM_INTINFO_VECTOR(info) ((info) & 0xff) #define VM_INTINFO_DEL_ERRCODE 0x800 #define VM_INTINFO_RSVD 0x7ffff000 #define VM_INTINFO_VALID 0x80000000 #define VM_INTINFO_TYPE 0x700 #define VM_INTINFO_HWINTR (0 << 8) #define VM_INTINFO_NMI (2 << 8) #define VM_INTINFO_HWEXCEPTION (3 << 8) #define VM_INTINFO_SWINTR (4 << 8) #ifdef _KERNEL #define VM_MAX_NAMELEN 32 struct vm; struct vm_exception; struct seg_desc; struct vm_exit; struct vm_run; struct vhpet; struct vioapic; struct vlapic; struct vmspace; struct vm_object; struct vm_guest_paging; struct pmap; struct vm_eventinfo { void *rptr; /* rendezvous cookie */ int *sptr; /* suspend cookie */ int *iptr; /* reqidle cookie */ }; typedef int (*vmm_init_func_t)(int ipinum); typedef int (*vmm_cleanup_func_t)(void); typedef void (*vmm_resume_func_t)(void); typedef void * (*vmi_init_func_t)(struct vm *vm, struct pmap *pmap); typedef int (*vmi_run_func_t)(void *vmi, int vcpu, register_t rip, struct pmap *pmap, struct vm_eventinfo *info); typedef void (*vmi_cleanup_func_t)(void *vmi); typedef int (*vmi_get_register_t)(void *vmi, int vcpu, int num, uint64_t *retval); typedef int (*vmi_set_register_t)(void *vmi, int vcpu, int num, uint64_t val); typedef int (*vmi_get_desc_t)(void *vmi, int vcpu, int num, struct seg_desc *desc); typedef int (*vmi_set_desc_t)(void *vmi, int vcpu, int num, struct seg_desc *desc); typedef int (*vmi_get_cap_t)(void *vmi, int vcpu, int num, int *retval); typedef int (*vmi_set_cap_t)(void *vmi, int vcpu, int num, int val); typedef struct vmspace * (*vmi_vmspace_alloc)(vm_offset_t min, vm_offset_t max); typedef void (*vmi_vmspace_free)(struct vmspace *vmspace); typedef struct vlapic * (*vmi_vlapic_init)(void *vmi, int vcpu); typedef void (*vmi_vlapic_cleanup)(void *vmi, struct vlapic *vlapic); struct vmm_ops { vmm_init_func_t init; /* module wide initialization */ vmm_cleanup_func_t cleanup; vmm_resume_func_t resume; vmi_init_func_t vminit; /* vm-specific initialization */ vmi_run_func_t vmrun; vmi_cleanup_func_t vmcleanup; vmi_get_register_t vmgetreg; vmi_set_register_t vmsetreg; vmi_get_desc_t vmgetdesc; vmi_set_desc_t vmsetdesc; vmi_get_cap_t vmgetcap; vmi_set_cap_t vmsetcap; vmi_vmspace_alloc vmspace_alloc; vmi_vmspace_free vmspace_free; vmi_vlapic_init vlapic_init; vmi_vlapic_cleanup vlapic_cleanup; }; extern struct vmm_ops vmm_ops_intel; extern struct vmm_ops vmm_ops_amd; int vm_create(const char *name, struct vm **retvm); void vm_destroy(struct vm *vm); int vm_reinit(struct vm *vm); const char *vm_name(struct vm *vm); +uint16_t vm_get_maxcpus(struct vm *vm); void vm_get_topology(struct vm *vm, uint16_t *sockets, uint16_t *cores, uint16_t *threads, uint16_t *maxcpus); int vm_set_topology(struct vm *vm, uint16_t sockets, uint16_t cores, uint16_t threads, uint16_t maxcpus); /* * APIs that modify the guest memory map require all vcpus to be frozen. */ int vm_mmap_memseg(struct vm *vm, vm_paddr_t gpa, int segid, vm_ooffset_t off, size_t len, int prot, int flags); int vm_alloc_memseg(struct vm *vm, int ident, size_t len, bool sysmem); void vm_free_memseg(struct vm *vm, int ident); int vm_map_mmio(struct vm *vm, vm_paddr_t gpa, size_t len, vm_paddr_t hpa); int vm_unmap_mmio(struct vm *vm, vm_paddr_t gpa, size_t len); int vm_assign_pptdev(struct vm *vm, int bus, int slot, int func); int vm_unassign_pptdev(struct vm *vm, int bus, int slot, int func); /* * APIs that inspect the guest memory map require only a *single* vcpu to * be frozen. This acts like a read lock on the guest memory map since any * modification requires *all* vcpus to be frozen. */ int vm_mmap_getnext(struct vm *vm, vm_paddr_t *gpa, int *segid, vm_ooffset_t *segoff, size_t *len, int *prot, int *flags); int vm_get_memseg(struct vm *vm, int ident, size_t *len, bool *sysmem, struct vm_object **objptr); vm_paddr_t vmm_sysmem_maxaddr(struct vm *vm); void *vm_gpa_hold(struct vm *, int vcpuid, vm_paddr_t gpa, size_t len, int prot, void **cookie); void vm_gpa_release(void *cookie); bool vm_mem_allocated(struct vm *vm, int vcpuid, vm_paddr_t gpa); int vm_get_register(struct vm *vm, int vcpu, int reg, uint64_t *retval); int vm_set_register(struct vm *vm, int vcpu, int reg, uint64_t val); int vm_get_seg_desc(struct vm *vm, int vcpu, int reg, struct seg_desc *ret_desc); int vm_set_seg_desc(struct vm *vm, int vcpu, int reg, struct seg_desc *desc); int vm_run(struct vm *vm, struct vm_run *vmrun); int vm_suspend(struct vm *vm, enum vm_suspend_how how); int vm_inject_nmi(struct vm *vm, int vcpu); int vm_nmi_pending(struct vm *vm, int vcpuid); void vm_nmi_clear(struct vm *vm, int vcpuid); int vm_inject_extint(struct vm *vm, int vcpu); int vm_extint_pending(struct vm *vm, int vcpuid); void vm_extint_clear(struct vm *vm, int vcpuid); struct vlapic *vm_lapic(struct vm *vm, int cpu); struct vioapic *vm_ioapic(struct vm *vm); struct vhpet *vm_hpet(struct vm *vm); int vm_get_capability(struct vm *vm, int vcpu, int type, int *val); int vm_set_capability(struct vm *vm, int vcpu, int type, int val); int vm_get_x2apic_state(struct vm *vm, int vcpu, enum x2apic_state *state); int vm_set_x2apic_state(struct vm *vm, int vcpu, enum x2apic_state state); int vm_apicid2vcpuid(struct vm *vm, int apicid); int vm_activate_cpu(struct vm *vm, int vcpu); int vm_suspend_cpu(struct vm *vm, int vcpu); int vm_resume_cpu(struct vm *vm, int vcpu); struct vm_exit *vm_exitinfo(struct vm *vm, int vcpuid); void vm_exit_suspended(struct vm *vm, int vcpuid, uint64_t rip); void vm_exit_debug(struct vm *vm, int vcpuid, uint64_t rip); void vm_exit_rendezvous(struct vm *vm, int vcpuid, uint64_t rip); void vm_exit_astpending(struct vm *vm, int vcpuid, uint64_t rip); void vm_exit_reqidle(struct vm *vm, int vcpuid, uint64_t rip); #ifdef _SYS__CPUSET_H_ /* * Rendezvous all vcpus specified in 'dest' and execute 'func(arg)'. * The rendezvous 'func(arg)' is not allowed to do anything that will * cause the thread to be put to sleep. * * If the rendezvous is being initiated from a vcpu context then the * 'vcpuid' must refer to that vcpu, otherwise it should be set to -1. * * The caller cannot hold any locks when initiating the rendezvous. * * The implementation of this API may cause vcpus other than those specified * by 'dest' to be stalled. The caller should not rely on any vcpus making * forward progress when the rendezvous is in progress. */ typedef void (*vm_rendezvous_func_t)(struct vm *vm, int vcpuid, void *arg); void vm_smp_rendezvous(struct vm *vm, int vcpuid, cpuset_t dest, vm_rendezvous_func_t func, void *arg); cpuset_t vm_active_cpus(struct vm *vm); cpuset_t vm_debug_cpus(struct vm *vm); cpuset_t vm_suspended_cpus(struct vm *vm); #endif /* _SYS__CPUSET_H_ */ static __inline int vcpu_rendezvous_pending(struct vm_eventinfo *info) { return (*((uintptr_t *)(info->rptr)) != 0); } static __inline int vcpu_suspended(struct vm_eventinfo *info) { return (*info->sptr); } static __inline int vcpu_reqidle(struct vm_eventinfo *info) { return (*info->iptr); } int vcpu_debugged(struct vm *vm, int vcpuid); /* * Return 1 if device indicated by bus/slot/func is supposed to be a * pci passthrough device. * * Return 0 otherwise. */ int vmm_is_pptdev(int bus, int slot, int func); void *vm_iommu_domain(struct vm *vm); enum vcpu_state { VCPU_IDLE, VCPU_FROZEN, VCPU_RUNNING, VCPU_SLEEPING, }; int vcpu_set_state(struct vm *vm, int vcpu, enum vcpu_state state, bool from_idle); enum vcpu_state vcpu_get_state(struct vm *vm, int vcpu, int *hostcpu); static int __inline vcpu_is_running(struct vm *vm, int vcpu, int *hostcpu) { return (vcpu_get_state(vm, vcpu, hostcpu) == VCPU_RUNNING); } #ifdef _SYS_PROC_H_ static int __inline vcpu_should_yield(struct vm *vm, int vcpu) { if (curthread->td_flags & (TDF_ASTPENDING | TDF_NEEDRESCHED)) return (1); else if (curthread->td_owepreempt) return (1); else return (0); } #endif void *vcpu_stats(struct vm *vm, int vcpu); void vcpu_notify_event(struct vm *vm, int vcpuid, bool lapic_intr); struct vmspace *vm_get_vmspace(struct vm *vm); struct vatpic *vm_atpic(struct vm *vm); struct vatpit *vm_atpit(struct vm *vm); struct vpmtmr *vm_pmtmr(struct vm *vm); struct vrtc *vm_rtc(struct vm *vm); /* * Inject exception 'vector' into the guest vcpu. This function returns 0 on * success and non-zero on failure. * * Wrapper functions like 'vm_inject_gp()' should be preferred to calling * this function directly because they enforce the trap-like or fault-like * behavior of an exception. * * This function should only be called in the context of the thread that is * executing this vcpu. */ int vm_inject_exception(struct vm *vm, int vcpuid, int vector, int err_valid, uint32_t errcode, int restart_instruction); /* * This function is called after a VM-exit that occurred during exception or * interrupt delivery through the IDT. The format of 'intinfo' is described * in Figure 15-1, "EXITINTINFO for All Intercepts", APM, Vol 2. * * If a VM-exit handler completes the event delivery successfully then it * should call vm_exit_intinfo() to extinguish the pending event. For e.g., * if the task switch emulation is triggered via a task gate then it should * call this function with 'intinfo=0' to indicate that the external event * is not pending anymore. * * Return value is 0 on success and non-zero on failure. */ int vm_exit_intinfo(struct vm *vm, int vcpuid, uint64_t intinfo); /* * This function is called before every VM-entry to retrieve a pending * event that should be injected into the guest. This function combines * nested events into a double or triple fault. * * Returns 0 if there are no events that need to be injected into the guest * and non-zero otherwise. */ int vm_entry_intinfo(struct vm *vm, int vcpuid, uint64_t *info); int vm_get_intinfo(struct vm *vm, int vcpuid, uint64_t *info1, uint64_t *info2); enum vm_reg_name vm_segment_name(int seg_encoding); struct vm_copyinfo { uint64_t gpa; size_t len; void *hva; void *cookie; }; /* * Set up 'copyinfo[]' to copy to/from guest linear address space starting * at 'gla' and 'len' bytes long. The 'prot' should be set to PROT_READ for * a copyin or PROT_WRITE for a copyout. * * retval is_fault Interpretation * 0 0 Success * 0 1 An exception was injected into the guest * EFAULT N/A Unrecoverable error * * The 'copyinfo[]' can be passed to 'vm_copyin()' or 'vm_copyout()' only if * the return value is 0. The 'copyinfo[]' resources should be freed by calling * 'vm_copy_teardown()' after the copy is done. */ int vm_copy_setup(struct vm *vm, int vcpuid, struct vm_guest_paging *paging, uint64_t gla, size_t len, int prot, struct vm_copyinfo *copyinfo, int num_copyinfo, int *is_fault); void vm_copy_teardown(struct vm *vm, int vcpuid, struct vm_copyinfo *copyinfo, int num_copyinfo); void vm_copyin(struct vm *vm, int vcpuid, struct vm_copyinfo *copyinfo, void *kaddr, size_t len); void vm_copyout(struct vm *vm, int vcpuid, const void *kaddr, struct vm_copyinfo *copyinfo, size_t len); int vcpu_trace_exceptions(struct vm *vm, int vcpuid); #endif /* KERNEL */ #define VM_MAXCPU 16 /* maximum virtual cpus */ /* * Identifiers for optional vmm capabilities */ enum vm_cap_type { VM_CAP_HALT_EXIT, VM_CAP_MTRAP_EXIT, VM_CAP_PAUSE_EXIT, VM_CAP_UNRESTRICTED_GUEST, VM_CAP_ENABLE_INVPCID, VM_CAP_MAX }; enum vm_intr_trigger { EDGE_TRIGGER, LEVEL_TRIGGER }; /* * The 'access' field has the format specified in Table 21-2 of the Intel * Architecture Manual vol 3b. * * XXX The contents of the 'access' field are architecturally defined except * bit 16 - Segment Unusable. */ struct seg_desc { uint64_t base; uint32_t limit; uint32_t access; }; #define SEG_DESC_TYPE(access) ((access) & 0x001f) #define SEG_DESC_DPL(access) (((access) >> 5) & 0x3) #define SEG_DESC_PRESENT(access) (((access) & 0x0080) ? 1 : 0) #define SEG_DESC_DEF32(access) (((access) & 0x4000) ? 1 : 0) #define SEG_DESC_GRANULARITY(access) (((access) & 0x8000) ? 1 : 0) #define SEG_DESC_UNUSABLE(access) (((access) & 0x10000) ? 1 : 0) enum vm_cpu_mode { CPU_MODE_REAL, CPU_MODE_PROTECTED, CPU_MODE_COMPATIBILITY, /* IA-32E mode (CS.L = 0) */ CPU_MODE_64BIT, /* IA-32E mode (CS.L = 1) */ }; enum vm_paging_mode { PAGING_MODE_FLAT, PAGING_MODE_32, PAGING_MODE_PAE, PAGING_MODE_64, }; struct vm_guest_paging { uint64_t cr3; int cpl; enum vm_cpu_mode cpu_mode; enum vm_paging_mode paging_mode; }; /* * The data structures 'vie' and 'vie_op' are meant to be opaque to the * consumers of instruction decoding. The only reason why their contents * need to be exposed is because they are part of the 'vm_exit' structure. */ struct vie_op { uint8_t op_byte; /* actual opcode byte */ uint8_t op_type; /* type of operation (e.g. MOV) */ uint16_t op_flags; }; #define VIE_INST_SIZE 15 struct vie { uint8_t inst[VIE_INST_SIZE]; /* instruction bytes */ uint8_t num_valid; /* size of the instruction */ uint8_t num_processed; uint8_t addrsize:4, opsize:4; /* address and operand sizes */ uint8_t rex_w:1, /* REX prefix */ rex_r:1, rex_x:1, rex_b:1, rex_present:1, repz_present:1, /* REP/REPE/REPZ prefix */ repnz_present:1, /* REPNE/REPNZ prefix */ opsize_override:1, /* Operand size override */ addrsize_override:1, /* Address size override */ segment_override:1; /* Segment override */ uint8_t mod:2, /* ModRM byte */ reg:4, rm:4; uint8_t ss:2, /* SIB byte */ index:4, base:4; uint8_t disp_bytes; uint8_t imm_bytes; uint8_t scale; int base_register; /* VM_REG_GUEST_xyz */ int index_register; /* VM_REG_GUEST_xyz */ int segment_register; /* VM_REG_GUEST_xyz */ int64_t displacement; /* optional addr displacement */ int64_t immediate; /* optional immediate operand */ uint8_t decoded; /* set to 1 if successfully decoded */ struct vie_op op; /* opcode description */ }; enum vm_exitcode { VM_EXITCODE_INOUT, VM_EXITCODE_VMX, VM_EXITCODE_BOGUS, VM_EXITCODE_RDMSR, VM_EXITCODE_WRMSR, VM_EXITCODE_HLT, VM_EXITCODE_MTRAP, VM_EXITCODE_PAUSE, VM_EXITCODE_PAGING, VM_EXITCODE_INST_EMUL, VM_EXITCODE_SPINUP_AP, VM_EXITCODE_DEPRECATED1, /* used to be SPINDOWN_CPU */ VM_EXITCODE_RENDEZVOUS, VM_EXITCODE_IOAPIC_EOI, VM_EXITCODE_SUSPENDED, VM_EXITCODE_INOUT_STR, VM_EXITCODE_TASK_SWITCH, VM_EXITCODE_MONITOR, VM_EXITCODE_MWAIT, VM_EXITCODE_SVM, VM_EXITCODE_REQIDLE, VM_EXITCODE_DEBUG, VM_EXITCODE_VMINSN, VM_EXITCODE_MAX }; struct vm_inout { uint16_t bytes:3; /* 1 or 2 or 4 */ uint16_t in:1; uint16_t string:1; uint16_t rep:1; uint16_t port; uint32_t eax; /* valid for out */ }; struct vm_inout_str { struct vm_inout inout; /* must be the first element */ struct vm_guest_paging paging; uint64_t rflags; uint64_t cr0; uint64_t index; uint64_t count; /* rep=1 (%rcx), rep=0 (1) */ int addrsize; enum vm_reg_name seg_name; struct seg_desc seg_desc; }; enum task_switch_reason { TSR_CALL, TSR_IRET, TSR_JMP, TSR_IDT_GATE, /* task gate in IDT */ }; struct vm_task_switch { uint16_t tsssel; /* new TSS selector */ int ext; /* task switch due to external event */ uint32_t errcode; int errcode_valid; /* push 'errcode' on the new stack */ enum task_switch_reason reason; struct vm_guest_paging paging; }; struct vm_exit { enum vm_exitcode exitcode; int inst_length; /* 0 means unknown */ uint64_t rip; union { struct vm_inout inout; struct vm_inout_str inout_str; struct { uint64_t gpa; int fault_type; } paging; struct { uint64_t gpa; uint64_t gla; uint64_t cs_base; int cs_d; /* CS.D */ struct vm_guest_paging paging; struct vie vie; } inst_emul; /* * VMX specific payload. Used when there is no "better" * exitcode to represent the VM-exit. */ struct { int status; /* vmx inst status */ /* * 'exit_reason' and 'exit_qualification' are valid * only if 'status' is zero. */ uint32_t exit_reason; uint64_t exit_qualification; /* * 'inst_error' and 'inst_type' are valid * only if 'status' is non-zero. */ int inst_type; int inst_error; } vmx; /* * SVM specific payload. */ struct { uint64_t exitcode; uint64_t exitinfo1; uint64_t exitinfo2; } svm; struct { uint32_t code; /* ecx value */ uint64_t wval; } msr; struct { int vcpu; uint64_t rip; } spinup_ap; struct { uint64_t rflags; uint64_t intr_status; } hlt; struct { int vector; } ioapic_eoi; struct { enum vm_suspend_how how; } suspended; struct vm_task_switch task_switch; } u; }; /* APIs to inject faults into the guest */ void vm_inject_fault(void *vm, int vcpuid, int vector, int errcode_valid, int errcode); static __inline void vm_inject_ud(void *vm, int vcpuid) { vm_inject_fault(vm, vcpuid, IDT_UD, 0, 0); } static __inline void vm_inject_gp(void *vm, int vcpuid) { vm_inject_fault(vm, vcpuid, IDT_GP, 1, 0); } static __inline void vm_inject_ac(void *vm, int vcpuid, int errcode) { vm_inject_fault(vm, vcpuid, IDT_AC, 1, errcode); } static __inline void vm_inject_ss(void *vm, int vcpuid, int errcode) { vm_inject_fault(vm, vcpuid, IDT_SS, 1, errcode); } void vm_inject_pf(void *vm, int vcpuid, int error_code, uint64_t cr2); int vm_restart_instruction(void *vm, int vcpuid); #endif /* _VMM_H_ */ Index: user/ngie/bug-237403/sys/amd64/vmm/amd/svm.c =================================================================== --- user/ngie/bug-237403/sys/amd64/vmm/amd/svm.c (revision 346925) +++ user/ngie/bug-237403/sys/amd64/vmm/amd/svm.c (revision 346926) @@ -1,2300 +1,2302 @@ /*- * SPDX-License-Identifier: BSD-2-Clause-FreeBSD * * Copyright (c) 2013, Anish Gupta (akgupt3@gmail.com) * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice unmodified, this list of conditions, and the following * disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ #include __FBSDID("$FreeBSD$"); #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include "vmm_lapic.h" #include "vmm_stat.h" #include "vmm_ktr.h" #include "vmm_ioport.h" #include "vatpic.h" #include "vlapic.h" #include "vlapic_priv.h" #include "x86.h" #include "vmcb.h" #include "svm.h" #include "svm_softc.h" #include "svm_msr.h" #include "npt.h" SYSCTL_DECL(_hw_vmm); SYSCTL_NODE(_hw_vmm, OID_AUTO, svm, CTLFLAG_RW, NULL, NULL); /* * SVM CPUID function 0x8000_000A, edx bit decoding. */ #define AMD_CPUID_SVM_NP BIT(0) /* Nested paging or RVI */ #define AMD_CPUID_SVM_LBR BIT(1) /* Last branch virtualization */ #define AMD_CPUID_SVM_SVML BIT(2) /* SVM lock */ #define AMD_CPUID_SVM_NRIP_SAVE BIT(3) /* Next RIP is saved */ #define AMD_CPUID_SVM_TSC_RATE BIT(4) /* TSC rate control. */ #define AMD_CPUID_SVM_VMCB_CLEAN BIT(5) /* VMCB state caching */ #define AMD_CPUID_SVM_FLUSH_BY_ASID BIT(6) /* Flush by ASID */ #define AMD_CPUID_SVM_DECODE_ASSIST BIT(7) /* Decode assist */ #define AMD_CPUID_SVM_PAUSE_INC BIT(10) /* Pause intercept filter. */ #define AMD_CPUID_SVM_PAUSE_FTH BIT(12) /* Pause filter threshold */ #define AMD_CPUID_SVM_AVIC BIT(13) /* AVIC present */ #define VMCB_CACHE_DEFAULT (VMCB_CACHE_ASID | \ VMCB_CACHE_IOPM | \ VMCB_CACHE_I | \ VMCB_CACHE_TPR | \ VMCB_CACHE_CR2 | \ VMCB_CACHE_CR | \ VMCB_CACHE_DR | \ VMCB_CACHE_DT | \ VMCB_CACHE_SEG | \ VMCB_CACHE_NP) static uint32_t vmcb_clean = VMCB_CACHE_DEFAULT; SYSCTL_INT(_hw_vmm_svm, OID_AUTO, vmcb_clean, CTLFLAG_RDTUN, &vmcb_clean, 0, NULL); static MALLOC_DEFINE(M_SVM, "svm", "svm"); static MALLOC_DEFINE(M_SVM_VLAPIC, "svm-vlapic", "svm-vlapic"); /* Per-CPU context area. */ extern struct pcpu __pcpu[]; static uint32_t svm_feature = ~0U; /* AMD SVM features. */ SYSCTL_UINT(_hw_vmm_svm, OID_AUTO, features, CTLFLAG_RDTUN, &svm_feature, 0, "SVM features advertised by CPUID.8000000AH:EDX"); static int disable_npf_assist; SYSCTL_INT(_hw_vmm_svm, OID_AUTO, disable_npf_assist, CTLFLAG_RWTUN, &disable_npf_assist, 0, NULL); /* Maximum ASIDs supported by the processor */ static uint32_t nasid; SYSCTL_UINT(_hw_vmm_svm, OID_AUTO, num_asids, CTLFLAG_RDTUN, &nasid, 0, "Number of ASIDs supported by this processor"); /* Current ASID generation for each host cpu */ static struct asid asid[MAXCPU]; /* * SVM host state saved area of size 4KB for each core. */ static uint8_t hsave[MAXCPU][PAGE_SIZE] __aligned(PAGE_SIZE); static VMM_STAT_AMD(VCPU_EXITINTINFO, "VM exits during event delivery"); static VMM_STAT_AMD(VCPU_INTINFO_INJECTED, "Events pending at VM entry"); static VMM_STAT_AMD(VMEXIT_VINTR, "VM exits due to interrupt window"); static int svm_setreg(void *arg, int vcpu, int ident, uint64_t val); static __inline int flush_by_asid(void) { return (svm_feature & AMD_CPUID_SVM_FLUSH_BY_ASID); } static __inline int decode_assist(void) { return (svm_feature & AMD_CPUID_SVM_DECODE_ASSIST); } static void svm_disable(void *arg __unused) { uint64_t efer; efer = rdmsr(MSR_EFER); efer &= ~EFER_SVM; wrmsr(MSR_EFER, efer); } /* * Disable SVM on all CPUs. */ static int svm_cleanup(void) { smp_rendezvous(NULL, svm_disable, NULL, NULL); return (0); } /* * Verify that all the features required by bhyve are available. */ static int check_svm_features(void) { u_int regs[4]; /* CPUID Fn8000_000A is for SVM */ do_cpuid(0x8000000A, regs); svm_feature &= regs[3]; /* * The number of ASIDs can be configured to be less than what is * supported by the hardware but not more. */ if (nasid == 0 || nasid > regs[1]) nasid = regs[1]; KASSERT(nasid > 1, ("Insufficient ASIDs for guests: %#x", nasid)); /* bhyve requires the Nested Paging feature */ if (!(svm_feature & AMD_CPUID_SVM_NP)) { printf("SVM: Nested Paging feature not available.\n"); return (ENXIO); } /* bhyve requires the NRIP Save feature */ if (!(svm_feature & AMD_CPUID_SVM_NRIP_SAVE)) { printf("SVM: NRIP Save feature not available.\n"); return (ENXIO); } return (0); } static void svm_enable(void *arg __unused) { uint64_t efer; efer = rdmsr(MSR_EFER); efer |= EFER_SVM; wrmsr(MSR_EFER, efer); wrmsr(MSR_VM_HSAVE_PA, vtophys(hsave[curcpu])); } /* * Return 1 if SVM is enabled on this processor and 0 otherwise. */ static int svm_available(void) { uint64_t msr; /* Section 15.4 Enabling SVM from APM2. */ if ((amd_feature2 & AMDID2_SVM) == 0) { printf("SVM: not available.\n"); return (0); } msr = rdmsr(MSR_VM_CR); if ((msr & VM_CR_SVMDIS) != 0) { printf("SVM: disabled by BIOS.\n"); return (0); } return (1); } static int svm_init(int ipinum) { int error, cpu; if (!svm_available()) return (ENXIO); error = check_svm_features(); if (error) return (error); vmcb_clean &= VMCB_CACHE_DEFAULT; for (cpu = 0; cpu < MAXCPU; cpu++) { /* * Initialize the host ASIDs to their "highest" valid values. * * The next ASID allocation will rollover both 'gen' and 'num' * and start off the sequence at {1,1}. */ asid[cpu].gen = ~0UL; asid[cpu].num = nasid - 1; } svm_msr_init(); svm_npt_init(ipinum); /* Enable SVM on all CPUs */ smp_rendezvous(NULL, svm_enable, NULL, NULL); return (0); } static void svm_restore(void) { svm_enable(NULL); } /* Pentium compatible MSRs */ #define MSR_PENTIUM_START 0 #define MSR_PENTIUM_END 0x1FFF /* AMD 6th generation and Intel compatible MSRs */ #define MSR_AMD6TH_START 0xC0000000UL #define MSR_AMD6TH_END 0xC0001FFFUL /* AMD 7th and 8th generation compatible MSRs */ #define MSR_AMD7TH_START 0xC0010000UL #define MSR_AMD7TH_END 0xC0011FFFUL /* * Get the index and bit position for a MSR in permission bitmap. * Two bits are used for each MSR: lower bit for read and higher bit for write. */ static int svm_msr_index(uint64_t msr, int *index, int *bit) { uint32_t base, off; *index = -1; *bit = (msr % 4) * 2; base = 0; if (msr >= MSR_PENTIUM_START && msr <= MSR_PENTIUM_END) { *index = msr / 4; return (0); } base += (MSR_PENTIUM_END - MSR_PENTIUM_START + 1); if (msr >= MSR_AMD6TH_START && msr <= MSR_AMD6TH_END) { off = (msr - MSR_AMD6TH_START); *index = (off + base) / 4; return (0); } base += (MSR_AMD6TH_END - MSR_AMD6TH_START + 1); if (msr >= MSR_AMD7TH_START && msr <= MSR_AMD7TH_END) { off = (msr - MSR_AMD7TH_START); *index = (off + base) / 4; return (0); } return (EINVAL); } /* * Allow vcpu to read or write the 'msr' without trapping into the hypervisor. */ static void svm_msr_perm(uint8_t *perm_bitmap, uint64_t msr, bool read, bool write) { int index, bit, error; error = svm_msr_index(msr, &index, &bit); KASSERT(error == 0, ("%s: invalid msr %#lx", __func__, msr)); KASSERT(index >= 0 && index < SVM_MSR_BITMAP_SIZE, ("%s: invalid index %d for msr %#lx", __func__, index, msr)); KASSERT(bit >= 0 && bit <= 6, ("%s: invalid bit position %d " "msr %#lx", __func__, bit, msr)); if (read) perm_bitmap[index] &= ~(1UL << bit); if (write) perm_bitmap[index] &= ~(2UL << bit); } static void svm_msr_rw_ok(uint8_t *perm_bitmap, uint64_t msr) { svm_msr_perm(perm_bitmap, msr, true, true); } static void svm_msr_rd_ok(uint8_t *perm_bitmap, uint64_t msr) { svm_msr_perm(perm_bitmap, msr, true, false); } static __inline int svm_get_intercept(struct svm_softc *sc, int vcpu, int idx, uint32_t bitmask) { struct vmcb_ctrl *ctrl; KASSERT(idx >=0 && idx < 5, ("invalid intercept index %d", idx)); ctrl = svm_get_vmcb_ctrl(sc, vcpu); return (ctrl->intercept[idx] & bitmask ? 1 : 0); } static __inline void svm_set_intercept(struct svm_softc *sc, int vcpu, int idx, uint32_t bitmask, int enabled) { struct vmcb_ctrl *ctrl; uint32_t oldval; KASSERT(idx >=0 && idx < 5, ("invalid intercept index %d", idx)); ctrl = svm_get_vmcb_ctrl(sc, vcpu); oldval = ctrl->intercept[idx]; if (enabled) ctrl->intercept[idx] |= bitmask; else ctrl->intercept[idx] &= ~bitmask; if (ctrl->intercept[idx] != oldval) { svm_set_dirty(sc, vcpu, VMCB_CACHE_I); VCPU_CTR3(sc->vm, vcpu, "intercept[%d] modified " "from %#x to %#x", idx, oldval, ctrl->intercept[idx]); } } static __inline void svm_disable_intercept(struct svm_softc *sc, int vcpu, int off, uint32_t bitmask) { svm_set_intercept(sc, vcpu, off, bitmask, 0); } static __inline void svm_enable_intercept(struct svm_softc *sc, int vcpu, int off, uint32_t bitmask) { svm_set_intercept(sc, vcpu, off, bitmask, 1); } static void vmcb_init(struct svm_softc *sc, int vcpu, uint64_t iopm_base_pa, uint64_t msrpm_base_pa, uint64_t np_pml4) { struct vmcb_ctrl *ctrl; struct vmcb_state *state; uint32_t mask; int n; ctrl = svm_get_vmcb_ctrl(sc, vcpu); state = svm_get_vmcb_state(sc, vcpu); ctrl->iopm_base_pa = iopm_base_pa; ctrl->msrpm_base_pa = msrpm_base_pa; /* Enable nested paging */ ctrl->np_enable = 1; ctrl->n_cr3 = np_pml4; /* * Intercept accesses to the control registers that are not shadowed * in the VMCB - i.e. all except cr0, cr2, cr3, cr4 and cr8. */ for (n = 0; n < 16; n++) { mask = (BIT(n) << 16) | BIT(n); if (n == 0 || n == 2 || n == 3 || n == 4 || n == 8) svm_disable_intercept(sc, vcpu, VMCB_CR_INTCPT, mask); else svm_enable_intercept(sc, vcpu, VMCB_CR_INTCPT, mask); } /* * Intercept everything when tracing guest exceptions otherwise * just intercept machine check exception. */ if (vcpu_trace_exceptions(sc->vm, vcpu)) { for (n = 0; n < 32; n++) { /* * Skip unimplemented vectors in the exception bitmap. */ if (n == 2 || n == 9) { continue; } svm_enable_intercept(sc, vcpu, VMCB_EXC_INTCPT, BIT(n)); } } else { svm_enable_intercept(sc, vcpu, VMCB_EXC_INTCPT, BIT(IDT_MC)); } /* Intercept various events (for e.g. I/O, MSR and CPUID accesses) */ svm_enable_intercept(sc, vcpu, VMCB_CTRL1_INTCPT, VMCB_INTCPT_IO); svm_enable_intercept(sc, vcpu, VMCB_CTRL1_INTCPT, VMCB_INTCPT_MSR); svm_enable_intercept(sc, vcpu, VMCB_CTRL1_INTCPT, VMCB_INTCPT_CPUID); svm_enable_intercept(sc, vcpu, VMCB_CTRL1_INTCPT, VMCB_INTCPT_INTR); svm_enable_intercept(sc, vcpu, VMCB_CTRL1_INTCPT, VMCB_INTCPT_INIT); svm_enable_intercept(sc, vcpu, VMCB_CTRL1_INTCPT, VMCB_INTCPT_NMI); svm_enable_intercept(sc, vcpu, VMCB_CTRL1_INTCPT, VMCB_INTCPT_SMI); svm_enable_intercept(sc, vcpu, VMCB_CTRL1_INTCPT, VMCB_INTCPT_SHUTDOWN); svm_enable_intercept(sc, vcpu, VMCB_CTRL1_INTCPT, VMCB_INTCPT_FERR_FREEZE); svm_enable_intercept(sc, vcpu, VMCB_CTRL2_INTCPT, VMCB_INTCPT_MONITOR); svm_enable_intercept(sc, vcpu, VMCB_CTRL2_INTCPT, VMCB_INTCPT_MWAIT); /* * From section "Canonicalization and Consistency Checks" in APMv2 * the VMRUN intercept bit must be set to pass the consistency check. */ svm_enable_intercept(sc, vcpu, VMCB_CTRL2_INTCPT, VMCB_INTCPT_VMRUN); /* * The ASID will be set to a non-zero value just before VMRUN. */ ctrl->asid = 0; /* * Section 15.21.1, Interrupt Masking in EFLAGS * Section 15.21.2, Virtualizing APIC.TPR * * This must be set for %rflag and %cr8 isolation of guest and host. */ ctrl->v_intr_masking = 1; /* Enable Last Branch Record aka LBR for debugging */ ctrl->lbr_virt_en = 1; state->dbgctl = BIT(0); /* EFER_SVM must always be set when the guest is executing */ state->efer = EFER_SVM; /* Set up the PAT to power-on state */ state->g_pat = PAT_VALUE(0, PAT_WRITE_BACK) | PAT_VALUE(1, PAT_WRITE_THROUGH) | PAT_VALUE(2, PAT_UNCACHED) | PAT_VALUE(3, PAT_UNCACHEABLE) | PAT_VALUE(4, PAT_WRITE_BACK) | PAT_VALUE(5, PAT_WRITE_THROUGH) | PAT_VALUE(6, PAT_UNCACHED) | PAT_VALUE(7, PAT_UNCACHEABLE); /* Set up DR6/7 to power-on state */ state->dr6 = DBREG_DR6_RESERVED1; state->dr7 = DBREG_DR7_RESERVED1; } /* * Initialize a virtual machine. */ static void * svm_vminit(struct vm *vm, pmap_t pmap) { struct svm_softc *svm_sc; struct svm_vcpu *vcpu; vm_paddr_t msrpm_pa, iopm_pa, pml4_pa; int i; + uint16_t maxcpus; svm_sc = malloc(sizeof (*svm_sc), M_SVM, M_WAITOK | M_ZERO); if (((uintptr_t)svm_sc & PAGE_MASK) != 0) panic("malloc of svm_softc not aligned on page boundary"); svm_sc->msr_bitmap = contigmalloc(SVM_MSR_BITMAP_SIZE, M_SVM, M_WAITOK, 0, ~(vm_paddr_t)0, PAGE_SIZE, 0); if (svm_sc->msr_bitmap == NULL) panic("contigmalloc of SVM MSR bitmap failed"); svm_sc->iopm_bitmap = contigmalloc(SVM_IO_BITMAP_SIZE, M_SVM, M_WAITOK, 0, ~(vm_paddr_t)0, PAGE_SIZE, 0); if (svm_sc->iopm_bitmap == NULL) panic("contigmalloc of SVM IO bitmap failed"); svm_sc->vm = vm; svm_sc->nptp = (vm_offset_t)vtophys(pmap->pm_pml4); /* * Intercept read and write accesses to all MSRs. */ memset(svm_sc->msr_bitmap, 0xFF, SVM_MSR_BITMAP_SIZE); /* * Access to the following MSRs is redirected to the VMCB when the * guest is executing. Therefore it is safe to allow the guest to * read/write these MSRs directly without hypervisor involvement. */ svm_msr_rw_ok(svm_sc->msr_bitmap, MSR_GSBASE); svm_msr_rw_ok(svm_sc->msr_bitmap, MSR_FSBASE); svm_msr_rw_ok(svm_sc->msr_bitmap, MSR_KGSBASE); svm_msr_rw_ok(svm_sc->msr_bitmap, MSR_STAR); svm_msr_rw_ok(svm_sc->msr_bitmap, MSR_LSTAR); svm_msr_rw_ok(svm_sc->msr_bitmap, MSR_CSTAR); svm_msr_rw_ok(svm_sc->msr_bitmap, MSR_SF_MASK); svm_msr_rw_ok(svm_sc->msr_bitmap, MSR_SYSENTER_CS_MSR); svm_msr_rw_ok(svm_sc->msr_bitmap, MSR_SYSENTER_ESP_MSR); svm_msr_rw_ok(svm_sc->msr_bitmap, MSR_SYSENTER_EIP_MSR); svm_msr_rw_ok(svm_sc->msr_bitmap, MSR_PAT); svm_msr_rd_ok(svm_sc->msr_bitmap, MSR_TSC); /* * Intercept writes to make sure that the EFER_SVM bit is not cleared. */ svm_msr_rd_ok(svm_sc->msr_bitmap, MSR_EFER); /* Intercept access to all I/O ports. */ memset(svm_sc->iopm_bitmap, 0xFF, SVM_IO_BITMAP_SIZE); iopm_pa = vtophys(svm_sc->iopm_bitmap); msrpm_pa = vtophys(svm_sc->msr_bitmap); pml4_pa = svm_sc->nptp; - for (i = 0; i < VM_MAXCPU; i++) { + maxcpus = vm_get_maxcpus(svm_sc->vm); + for (i = 0; i < maxcpus; i++) { vcpu = svm_get_vcpu(svm_sc, i); vcpu->nextrip = ~0; vcpu->lastcpu = NOCPU; vcpu->vmcb_pa = vtophys(&vcpu->vmcb); vmcb_init(svm_sc, i, iopm_pa, msrpm_pa, pml4_pa); svm_msr_guest_init(svm_sc, i); } return (svm_sc); } /* * Collateral for a generic SVM VM-exit. */ static void vm_exit_svm(struct vm_exit *vme, uint64_t code, uint64_t info1, uint64_t info2) { vme->exitcode = VM_EXITCODE_SVM; vme->u.svm.exitcode = code; vme->u.svm.exitinfo1 = info1; vme->u.svm.exitinfo2 = info2; } static int svm_cpl(struct vmcb_state *state) { /* * From APMv2: * "Retrieve the CPL from the CPL field in the VMCB, not * from any segment DPL" */ return (state->cpl); } static enum vm_cpu_mode svm_vcpu_mode(struct vmcb *vmcb) { struct vmcb_segment seg; struct vmcb_state *state; int error; state = &vmcb->state; if (state->efer & EFER_LMA) { error = vmcb_seg(vmcb, VM_REG_GUEST_CS, &seg); KASSERT(error == 0, ("%s: vmcb_seg(cs) error %d", __func__, error)); /* * Section 4.8.1 for APM2, check if Code Segment has * Long attribute set in descriptor. */ if (seg.attrib & VMCB_CS_ATTRIB_L) return (CPU_MODE_64BIT); else return (CPU_MODE_COMPATIBILITY); } else if (state->cr0 & CR0_PE) { return (CPU_MODE_PROTECTED); } else { return (CPU_MODE_REAL); } } static enum vm_paging_mode svm_paging_mode(uint64_t cr0, uint64_t cr4, uint64_t efer) { if ((cr0 & CR0_PG) == 0) return (PAGING_MODE_FLAT); if ((cr4 & CR4_PAE) == 0) return (PAGING_MODE_32); if (efer & EFER_LME) return (PAGING_MODE_64); else return (PAGING_MODE_PAE); } /* * ins/outs utility routines */ static uint64_t svm_inout_str_index(struct svm_regctx *regs, int in) { uint64_t val; val = in ? regs->sctx_rdi : regs->sctx_rsi; return (val); } static uint64_t svm_inout_str_count(struct svm_regctx *regs, int rep) { uint64_t val; val = rep ? regs->sctx_rcx : 1; return (val); } static void svm_inout_str_seginfo(struct svm_softc *svm_sc, int vcpu, int64_t info1, int in, struct vm_inout_str *vis) { int error, s; if (in) { vis->seg_name = VM_REG_GUEST_ES; } else { /* The segment field has standard encoding */ s = (info1 >> 10) & 0x7; vis->seg_name = vm_segment_name(s); } error = vmcb_getdesc(svm_sc, vcpu, vis->seg_name, &vis->seg_desc); KASSERT(error == 0, ("%s: svm_getdesc error %d", __func__, error)); } static int svm_inout_str_addrsize(uint64_t info1) { uint32_t size; size = (info1 >> 7) & 0x7; switch (size) { case 1: return (2); /* 16 bit */ case 2: return (4); /* 32 bit */ case 4: return (8); /* 64 bit */ default: panic("%s: invalid size encoding %d", __func__, size); } } static void svm_paging_info(struct vmcb *vmcb, struct vm_guest_paging *paging) { struct vmcb_state *state; state = &vmcb->state; paging->cr3 = state->cr3; paging->cpl = svm_cpl(state); paging->cpu_mode = svm_vcpu_mode(vmcb); paging->paging_mode = svm_paging_mode(state->cr0, state->cr4, state->efer); } #define UNHANDLED 0 /* * Handle guest I/O intercept. */ static int svm_handle_io(struct svm_softc *svm_sc, int vcpu, struct vm_exit *vmexit) { struct vmcb_ctrl *ctrl; struct vmcb_state *state; struct svm_regctx *regs; struct vm_inout_str *vis; uint64_t info1; int inout_string; state = svm_get_vmcb_state(svm_sc, vcpu); ctrl = svm_get_vmcb_ctrl(svm_sc, vcpu); regs = svm_get_guest_regctx(svm_sc, vcpu); info1 = ctrl->exitinfo1; inout_string = info1 & BIT(2) ? 1 : 0; /* * The effective segment number in EXITINFO1[12:10] is populated * only if the processor has the DecodeAssist capability. * * XXX this is not specified explicitly in APMv2 but can be verified * empirically. */ if (inout_string && !decode_assist()) return (UNHANDLED); vmexit->exitcode = VM_EXITCODE_INOUT; vmexit->u.inout.in = (info1 & BIT(0)) ? 1 : 0; vmexit->u.inout.string = inout_string; vmexit->u.inout.rep = (info1 & BIT(3)) ? 1 : 0; vmexit->u.inout.bytes = (info1 >> 4) & 0x7; vmexit->u.inout.port = (uint16_t)(info1 >> 16); vmexit->u.inout.eax = (uint32_t)(state->rax); if (inout_string) { vmexit->exitcode = VM_EXITCODE_INOUT_STR; vis = &vmexit->u.inout_str; svm_paging_info(svm_get_vmcb(svm_sc, vcpu), &vis->paging); vis->rflags = state->rflags; vis->cr0 = state->cr0; vis->index = svm_inout_str_index(regs, vmexit->u.inout.in); vis->count = svm_inout_str_count(regs, vmexit->u.inout.rep); vis->addrsize = svm_inout_str_addrsize(info1); svm_inout_str_seginfo(svm_sc, vcpu, info1, vmexit->u.inout.in, vis); } return (UNHANDLED); } static int npf_fault_type(uint64_t exitinfo1) { if (exitinfo1 & VMCB_NPF_INFO1_W) return (VM_PROT_WRITE); else if (exitinfo1 & VMCB_NPF_INFO1_ID) return (VM_PROT_EXECUTE); else return (VM_PROT_READ); } static bool svm_npf_emul_fault(uint64_t exitinfo1) { if (exitinfo1 & VMCB_NPF_INFO1_ID) { return (false); } if (exitinfo1 & VMCB_NPF_INFO1_GPT) { return (false); } if ((exitinfo1 & VMCB_NPF_INFO1_GPA) == 0) { return (false); } return (true); } static void svm_handle_inst_emul(struct vmcb *vmcb, uint64_t gpa, struct vm_exit *vmexit) { struct vm_guest_paging *paging; struct vmcb_segment seg; struct vmcb_ctrl *ctrl; char *inst_bytes; int error, inst_len; ctrl = &vmcb->ctrl; paging = &vmexit->u.inst_emul.paging; vmexit->exitcode = VM_EXITCODE_INST_EMUL; vmexit->u.inst_emul.gpa = gpa; vmexit->u.inst_emul.gla = VIE_INVALID_GLA; svm_paging_info(vmcb, paging); error = vmcb_seg(vmcb, VM_REG_GUEST_CS, &seg); KASSERT(error == 0, ("%s: vmcb_seg(CS) error %d", __func__, error)); switch(paging->cpu_mode) { case CPU_MODE_REAL: vmexit->u.inst_emul.cs_base = seg.base; vmexit->u.inst_emul.cs_d = 0; break; case CPU_MODE_PROTECTED: case CPU_MODE_COMPATIBILITY: vmexit->u.inst_emul.cs_base = seg.base; /* * Section 4.8.1 of APM2, Default Operand Size or D bit. */ vmexit->u.inst_emul.cs_d = (seg.attrib & VMCB_CS_ATTRIB_D) ? 1 : 0; break; default: vmexit->u.inst_emul.cs_base = 0; vmexit->u.inst_emul.cs_d = 0; break; } /* * Copy the instruction bytes into 'vie' if available. */ if (decode_assist() && !disable_npf_assist) { inst_len = ctrl->inst_len; inst_bytes = ctrl->inst_bytes; } else { inst_len = 0; inst_bytes = NULL; } vie_init(&vmexit->u.inst_emul.vie, inst_bytes, inst_len); } #ifdef KTR static const char * intrtype_to_str(int intr_type) { switch (intr_type) { case VMCB_EVENTINJ_TYPE_INTR: return ("hwintr"); case VMCB_EVENTINJ_TYPE_NMI: return ("nmi"); case VMCB_EVENTINJ_TYPE_INTn: return ("swintr"); case VMCB_EVENTINJ_TYPE_EXCEPTION: return ("exception"); default: panic("%s: unknown intr_type %d", __func__, intr_type); } } #endif /* * Inject an event to vcpu as described in section 15.20, "Event injection". */ static void svm_eventinject(struct svm_softc *sc, int vcpu, int intr_type, int vector, uint32_t error, bool ec_valid) { struct vmcb_ctrl *ctrl; ctrl = svm_get_vmcb_ctrl(sc, vcpu); KASSERT((ctrl->eventinj & VMCB_EVENTINJ_VALID) == 0, ("%s: event already pending %#lx", __func__, ctrl->eventinj)); KASSERT(vector >=0 && vector <= 255, ("%s: invalid vector %d", __func__, vector)); switch (intr_type) { case VMCB_EVENTINJ_TYPE_INTR: case VMCB_EVENTINJ_TYPE_NMI: case VMCB_EVENTINJ_TYPE_INTn: break; case VMCB_EVENTINJ_TYPE_EXCEPTION: if (vector >= 0 && vector <= 31 && vector != 2) break; /* FALLTHROUGH */ default: panic("%s: invalid intr_type/vector: %d/%d", __func__, intr_type, vector); } ctrl->eventinj = vector | (intr_type << 8) | VMCB_EVENTINJ_VALID; if (ec_valid) { ctrl->eventinj |= VMCB_EVENTINJ_EC_VALID; ctrl->eventinj |= (uint64_t)error << 32; VCPU_CTR3(sc->vm, vcpu, "Injecting %s at vector %d errcode %#x", intrtype_to_str(intr_type), vector, error); } else { VCPU_CTR2(sc->vm, vcpu, "Injecting %s at vector %d", intrtype_to_str(intr_type), vector); } } static void svm_update_virqinfo(struct svm_softc *sc, int vcpu) { struct vm *vm; struct vlapic *vlapic; struct vmcb_ctrl *ctrl; vm = sc->vm; vlapic = vm_lapic(vm, vcpu); ctrl = svm_get_vmcb_ctrl(sc, vcpu); /* Update %cr8 in the emulated vlapic */ vlapic_set_cr8(vlapic, ctrl->v_tpr); /* Virtual interrupt injection is not used. */ KASSERT(ctrl->v_intr_vector == 0, ("%s: invalid " "v_intr_vector %d", __func__, ctrl->v_intr_vector)); } static void svm_save_intinfo(struct svm_softc *svm_sc, int vcpu) { struct vmcb_ctrl *ctrl; uint64_t intinfo; ctrl = svm_get_vmcb_ctrl(svm_sc, vcpu); intinfo = ctrl->exitintinfo; if (!VMCB_EXITINTINFO_VALID(intinfo)) return; /* * From APMv2, Section "Intercepts during IDT interrupt delivery" * * If a #VMEXIT happened during event delivery then record the event * that was being delivered. */ VCPU_CTR2(svm_sc->vm, vcpu, "SVM:Pending INTINFO(0x%lx), vector=%d.\n", intinfo, VMCB_EXITINTINFO_VECTOR(intinfo)); vmm_stat_incr(svm_sc->vm, vcpu, VCPU_EXITINTINFO, 1); vm_exit_intinfo(svm_sc->vm, vcpu, intinfo); } #ifdef INVARIANTS static __inline int vintr_intercept_enabled(struct svm_softc *sc, int vcpu) { return (svm_get_intercept(sc, vcpu, VMCB_CTRL1_INTCPT, VMCB_INTCPT_VINTR)); } #endif static __inline void enable_intr_window_exiting(struct svm_softc *sc, int vcpu) { struct vmcb_ctrl *ctrl; ctrl = svm_get_vmcb_ctrl(sc, vcpu); if (ctrl->v_irq && ctrl->v_intr_vector == 0) { KASSERT(ctrl->v_ign_tpr, ("%s: invalid v_ign_tpr", __func__)); KASSERT(vintr_intercept_enabled(sc, vcpu), ("%s: vintr intercept should be enabled", __func__)); return; } VCPU_CTR0(sc->vm, vcpu, "Enable intr window exiting"); ctrl->v_irq = 1; ctrl->v_ign_tpr = 1; ctrl->v_intr_vector = 0; svm_set_dirty(sc, vcpu, VMCB_CACHE_TPR); svm_enable_intercept(sc, vcpu, VMCB_CTRL1_INTCPT, VMCB_INTCPT_VINTR); } static __inline void disable_intr_window_exiting(struct svm_softc *sc, int vcpu) { struct vmcb_ctrl *ctrl; ctrl = svm_get_vmcb_ctrl(sc, vcpu); if (!ctrl->v_irq && ctrl->v_intr_vector == 0) { KASSERT(!vintr_intercept_enabled(sc, vcpu), ("%s: vintr intercept should be disabled", __func__)); return; } VCPU_CTR0(sc->vm, vcpu, "Disable intr window exiting"); ctrl->v_irq = 0; ctrl->v_intr_vector = 0; svm_set_dirty(sc, vcpu, VMCB_CACHE_TPR); svm_disable_intercept(sc, vcpu, VMCB_CTRL1_INTCPT, VMCB_INTCPT_VINTR); } static int svm_modify_intr_shadow(struct svm_softc *sc, int vcpu, uint64_t val) { struct vmcb_ctrl *ctrl; int oldval, newval; ctrl = svm_get_vmcb_ctrl(sc, vcpu); oldval = ctrl->intr_shadow; newval = val ? 1 : 0; if (newval != oldval) { ctrl->intr_shadow = newval; VCPU_CTR1(sc->vm, vcpu, "Setting intr_shadow to %d", newval); } return (0); } static int svm_get_intr_shadow(struct svm_softc *sc, int vcpu, uint64_t *val) { struct vmcb_ctrl *ctrl; ctrl = svm_get_vmcb_ctrl(sc, vcpu); *val = ctrl->intr_shadow; return (0); } /* * Once an NMI is injected it blocks delivery of further NMIs until the handler * executes an IRET. The IRET intercept is enabled when an NMI is injected to * to track when the vcpu is done handling the NMI. */ static int nmi_blocked(struct svm_softc *sc, int vcpu) { int blocked; blocked = svm_get_intercept(sc, vcpu, VMCB_CTRL1_INTCPT, VMCB_INTCPT_IRET); return (blocked); } static void enable_nmi_blocking(struct svm_softc *sc, int vcpu) { KASSERT(!nmi_blocked(sc, vcpu), ("vNMI already blocked")); VCPU_CTR0(sc->vm, vcpu, "vNMI blocking enabled"); svm_enable_intercept(sc, vcpu, VMCB_CTRL1_INTCPT, VMCB_INTCPT_IRET); } static void clear_nmi_blocking(struct svm_softc *sc, int vcpu) { int error; KASSERT(nmi_blocked(sc, vcpu), ("vNMI already unblocked")); VCPU_CTR0(sc->vm, vcpu, "vNMI blocking cleared"); /* * When the IRET intercept is cleared the vcpu will attempt to execute * the "iret" when it runs next. However, it is possible to inject * another NMI into the vcpu before the "iret" has actually executed. * * For e.g. if the "iret" encounters a #NPF when accessing the stack * it will trap back into the hypervisor. If an NMI is pending for * the vcpu it will be injected into the guest. * * XXX this needs to be fixed */ svm_disable_intercept(sc, vcpu, VMCB_CTRL1_INTCPT, VMCB_INTCPT_IRET); /* * Set 'intr_shadow' to prevent an NMI from being injected on the * immediate VMRUN. */ error = svm_modify_intr_shadow(sc, vcpu, 1); KASSERT(!error, ("%s: error %d setting intr_shadow", __func__, error)); } #define EFER_MBZ_BITS 0xFFFFFFFFFFFF0200UL static int svm_write_efer(struct svm_softc *sc, int vcpu, uint64_t newval, bool *retu) { struct vm_exit *vme; struct vmcb_state *state; uint64_t changed, lma, oldval; int error; state = svm_get_vmcb_state(sc, vcpu); oldval = state->efer; VCPU_CTR2(sc->vm, vcpu, "wrmsr(efer) %#lx/%#lx", oldval, newval); newval &= ~0xFE; /* clear the Read-As-Zero (RAZ) bits */ changed = oldval ^ newval; if (newval & EFER_MBZ_BITS) goto gpf; /* APMv2 Table 14-5 "Long-Mode Consistency Checks" */ if (changed & EFER_LME) { if (state->cr0 & CR0_PG) goto gpf; } /* EFER.LMA = EFER.LME & CR0.PG */ if ((newval & EFER_LME) != 0 && (state->cr0 & CR0_PG) != 0) lma = EFER_LMA; else lma = 0; if ((newval & EFER_LMA) != lma) goto gpf; if (newval & EFER_NXE) { if (!vm_cpuid_capability(sc->vm, vcpu, VCC_NO_EXECUTE)) goto gpf; } /* * XXX bhyve does not enforce segment limits in 64-bit mode. Until * this is fixed flag guest attempt to set EFER_LMSLE as an error. */ if (newval & EFER_LMSLE) { vme = vm_exitinfo(sc->vm, vcpu); vm_exit_svm(vme, VMCB_EXIT_MSR, 1, 0); *retu = true; return (0); } if (newval & EFER_FFXSR) { if (!vm_cpuid_capability(sc->vm, vcpu, VCC_FFXSR)) goto gpf; } if (newval & EFER_TCE) { if (!vm_cpuid_capability(sc->vm, vcpu, VCC_TCE)) goto gpf; } error = svm_setreg(sc, vcpu, VM_REG_GUEST_EFER, newval); KASSERT(error == 0, ("%s: error %d updating efer", __func__, error)); return (0); gpf: vm_inject_gp(sc->vm, vcpu); return (0); } static int emulate_wrmsr(struct svm_softc *sc, int vcpu, u_int num, uint64_t val, bool *retu) { int error; if (lapic_msr(num)) error = lapic_wrmsr(sc->vm, vcpu, num, val, retu); else if (num == MSR_EFER) error = svm_write_efer(sc, vcpu, val, retu); else error = svm_wrmsr(sc, vcpu, num, val, retu); return (error); } static int emulate_rdmsr(struct svm_softc *sc, int vcpu, u_int num, bool *retu) { struct vmcb_state *state; struct svm_regctx *ctx; uint64_t result; int error; if (lapic_msr(num)) error = lapic_rdmsr(sc->vm, vcpu, num, &result, retu); else error = svm_rdmsr(sc, vcpu, num, &result, retu); if (error == 0) { state = svm_get_vmcb_state(sc, vcpu); ctx = svm_get_guest_regctx(sc, vcpu); state->rax = result & 0xffffffff; ctx->sctx_rdx = result >> 32; } return (error); } #ifdef KTR static const char * exit_reason_to_str(uint64_t reason) { static char reasonbuf[32]; switch (reason) { case VMCB_EXIT_INVALID: return ("invalvmcb"); case VMCB_EXIT_SHUTDOWN: return ("shutdown"); case VMCB_EXIT_NPF: return ("nptfault"); case VMCB_EXIT_PAUSE: return ("pause"); case VMCB_EXIT_HLT: return ("hlt"); case VMCB_EXIT_CPUID: return ("cpuid"); case VMCB_EXIT_IO: return ("inout"); case VMCB_EXIT_MC: return ("mchk"); case VMCB_EXIT_INTR: return ("extintr"); case VMCB_EXIT_NMI: return ("nmi"); case VMCB_EXIT_VINTR: return ("vintr"); case VMCB_EXIT_MSR: return ("msr"); case VMCB_EXIT_IRET: return ("iret"); case VMCB_EXIT_MONITOR: return ("monitor"); case VMCB_EXIT_MWAIT: return ("mwait"); default: snprintf(reasonbuf, sizeof(reasonbuf), "%#lx", reason); return (reasonbuf); } } #endif /* KTR */ /* * From section "State Saved on Exit" in APMv2: nRIP is saved for all #VMEXITs * that are due to instruction intercepts as well as MSR and IOIO intercepts * and exceptions caused by INT3, INTO and BOUND instructions. * * Return 1 if the nRIP is valid and 0 otherwise. */ static int nrip_valid(uint64_t exitcode) { switch (exitcode) { case 0x00 ... 0x0F: /* read of CR0 through CR15 */ case 0x10 ... 0x1F: /* write of CR0 through CR15 */ case 0x20 ... 0x2F: /* read of DR0 through DR15 */ case 0x30 ... 0x3F: /* write of DR0 through DR15 */ case 0x43: /* INT3 */ case 0x44: /* INTO */ case 0x45: /* BOUND */ case 0x65 ... 0x7C: /* VMEXIT_CR0_SEL_WRITE ... VMEXIT_MSR */ case 0x80 ... 0x8D: /* VMEXIT_VMRUN ... VMEXIT_XSETBV */ return (1); default: return (0); } } static int svm_vmexit(struct svm_softc *svm_sc, int vcpu, struct vm_exit *vmexit) { struct vmcb *vmcb; struct vmcb_state *state; struct vmcb_ctrl *ctrl; struct svm_regctx *ctx; uint64_t code, info1, info2, val; uint32_t eax, ecx, edx; int error, errcode_valid, handled, idtvec, reflect; bool retu; ctx = svm_get_guest_regctx(svm_sc, vcpu); vmcb = svm_get_vmcb(svm_sc, vcpu); state = &vmcb->state; ctrl = &vmcb->ctrl; handled = 0; code = ctrl->exitcode; info1 = ctrl->exitinfo1; info2 = ctrl->exitinfo2; vmexit->exitcode = VM_EXITCODE_BOGUS; vmexit->rip = state->rip; vmexit->inst_length = nrip_valid(code) ? ctrl->nrip - state->rip : 0; vmm_stat_incr(svm_sc->vm, vcpu, VMEXIT_COUNT, 1); /* * #VMEXIT(INVALID) needs to be handled early because the VMCB is * in an inconsistent state and can trigger assertions that would * never happen otherwise. */ if (code == VMCB_EXIT_INVALID) { vm_exit_svm(vmexit, code, info1, info2); return (0); } KASSERT((ctrl->eventinj & VMCB_EVENTINJ_VALID) == 0, ("%s: event " "injection valid bit is set %#lx", __func__, ctrl->eventinj)); KASSERT(vmexit->inst_length >= 0 && vmexit->inst_length <= 15, ("invalid inst_length %d: code (%#lx), info1 (%#lx), info2 (%#lx)", vmexit->inst_length, code, info1, info2)); svm_update_virqinfo(svm_sc, vcpu); svm_save_intinfo(svm_sc, vcpu); switch (code) { case VMCB_EXIT_IRET: /* * Restart execution at "iret" but with the intercept cleared. */ vmexit->inst_length = 0; clear_nmi_blocking(svm_sc, vcpu); handled = 1; break; case VMCB_EXIT_VINTR: /* interrupt window exiting */ vmm_stat_incr(svm_sc->vm, vcpu, VMEXIT_VINTR, 1); handled = 1; break; case VMCB_EXIT_INTR: /* external interrupt */ vmm_stat_incr(svm_sc->vm, vcpu, VMEXIT_EXTINT, 1); handled = 1; break; case VMCB_EXIT_NMI: /* external NMI */ handled = 1; break; case 0x40 ... 0x5F: vmm_stat_incr(svm_sc->vm, vcpu, VMEXIT_EXCEPTION, 1); reflect = 1; idtvec = code - 0x40; switch (idtvec) { case IDT_MC: /* * Call the machine check handler by hand. Also don't * reflect the machine check back into the guest. */ reflect = 0; VCPU_CTR0(svm_sc->vm, vcpu, "Vectoring to MCE handler"); __asm __volatile("int $18"); break; case IDT_PF: error = svm_setreg(svm_sc, vcpu, VM_REG_GUEST_CR2, info2); KASSERT(error == 0, ("%s: error %d updating cr2", __func__, error)); /* fallthru */ case IDT_NP: case IDT_SS: case IDT_GP: case IDT_AC: case IDT_TS: errcode_valid = 1; break; case IDT_DF: errcode_valid = 1; info1 = 0; break; case IDT_BP: case IDT_OF: case IDT_BR: /* * The 'nrip' field is populated for INT3, INTO and * BOUND exceptions and this also implies that * 'inst_length' is non-zero. * * Reset 'inst_length' to zero so the guest %rip at * event injection is identical to what it was when * the exception originally happened. */ VCPU_CTR2(svm_sc->vm, vcpu, "Reset inst_length from %d " "to zero before injecting exception %d", vmexit->inst_length, idtvec); vmexit->inst_length = 0; /* fallthru */ default: errcode_valid = 0; info1 = 0; break; } KASSERT(vmexit->inst_length == 0, ("invalid inst_length (%d) " "when reflecting exception %d into guest", vmexit->inst_length, idtvec)); if (reflect) { /* Reflect the exception back into the guest */ VCPU_CTR2(svm_sc->vm, vcpu, "Reflecting exception " "%d/%#x into the guest", idtvec, (int)info1); error = vm_inject_exception(svm_sc->vm, vcpu, idtvec, errcode_valid, info1, 0); KASSERT(error == 0, ("%s: vm_inject_exception error %d", __func__, error)); } handled = 1; break; case VMCB_EXIT_MSR: /* MSR access. */ eax = state->rax; ecx = ctx->sctx_rcx; edx = ctx->sctx_rdx; retu = false; if (info1) { vmm_stat_incr(svm_sc->vm, vcpu, VMEXIT_WRMSR, 1); val = (uint64_t)edx << 32 | eax; VCPU_CTR2(svm_sc->vm, vcpu, "wrmsr %#x val %#lx", ecx, val); if (emulate_wrmsr(svm_sc, vcpu, ecx, val, &retu)) { vmexit->exitcode = VM_EXITCODE_WRMSR; vmexit->u.msr.code = ecx; vmexit->u.msr.wval = val; } else if (!retu) { handled = 1; } else { KASSERT(vmexit->exitcode != VM_EXITCODE_BOGUS, ("emulate_wrmsr retu with bogus exitcode")); } } else { VCPU_CTR1(svm_sc->vm, vcpu, "rdmsr %#x", ecx); vmm_stat_incr(svm_sc->vm, vcpu, VMEXIT_RDMSR, 1); if (emulate_rdmsr(svm_sc, vcpu, ecx, &retu)) { vmexit->exitcode = VM_EXITCODE_RDMSR; vmexit->u.msr.code = ecx; } else if (!retu) { handled = 1; } else { KASSERT(vmexit->exitcode != VM_EXITCODE_BOGUS, ("emulate_rdmsr retu with bogus exitcode")); } } break; case VMCB_EXIT_IO: handled = svm_handle_io(svm_sc, vcpu, vmexit); vmm_stat_incr(svm_sc->vm, vcpu, VMEXIT_INOUT, 1); break; case VMCB_EXIT_CPUID: vmm_stat_incr(svm_sc->vm, vcpu, VMEXIT_CPUID, 1); handled = x86_emulate_cpuid(svm_sc->vm, vcpu, (uint32_t *)&state->rax, (uint32_t *)&ctx->sctx_rbx, (uint32_t *)&ctx->sctx_rcx, (uint32_t *)&ctx->sctx_rdx); break; case VMCB_EXIT_HLT: vmm_stat_incr(svm_sc->vm, vcpu, VMEXIT_HLT, 1); vmexit->exitcode = VM_EXITCODE_HLT; vmexit->u.hlt.rflags = state->rflags; break; case VMCB_EXIT_PAUSE: vmexit->exitcode = VM_EXITCODE_PAUSE; vmm_stat_incr(svm_sc->vm, vcpu, VMEXIT_PAUSE, 1); break; case VMCB_EXIT_NPF: /* EXITINFO2 contains the faulting guest physical address */ if (info1 & VMCB_NPF_INFO1_RSV) { VCPU_CTR2(svm_sc->vm, vcpu, "nested page fault with " "reserved bits set: info1(%#lx) info2(%#lx)", info1, info2); } else if (vm_mem_allocated(svm_sc->vm, vcpu, info2)) { vmexit->exitcode = VM_EXITCODE_PAGING; vmexit->u.paging.gpa = info2; vmexit->u.paging.fault_type = npf_fault_type(info1); vmm_stat_incr(svm_sc->vm, vcpu, VMEXIT_NESTED_FAULT, 1); VCPU_CTR3(svm_sc->vm, vcpu, "nested page fault " "on gpa %#lx/%#lx at rip %#lx", info2, info1, state->rip); } else if (svm_npf_emul_fault(info1)) { svm_handle_inst_emul(vmcb, info2, vmexit); vmm_stat_incr(svm_sc->vm, vcpu, VMEXIT_INST_EMUL, 1); VCPU_CTR3(svm_sc->vm, vcpu, "inst_emul fault " "for gpa %#lx/%#lx at rip %#lx", info2, info1, state->rip); } break; case VMCB_EXIT_MONITOR: vmexit->exitcode = VM_EXITCODE_MONITOR; break; case VMCB_EXIT_MWAIT: vmexit->exitcode = VM_EXITCODE_MWAIT; break; default: vmm_stat_incr(svm_sc->vm, vcpu, VMEXIT_UNKNOWN, 1); break; } VCPU_CTR4(svm_sc->vm, vcpu, "%s %s vmexit at %#lx/%d", handled ? "handled" : "unhandled", exit_reason_to_str(code), vmexit->rip, vmexit->inst_length); if (handled) { vmexit->rip += vmexit->inst_length; vmexit->inst_length = 0; state->rip = vmexit->rip; } else { if (vmexit->exitcode == VM_EXITCODE_BOGUS) { /* * If this VM exit was not claimed by anybody then * treat it as a generic SVM exit. */ vm_exit_svm(vmexit, code, info1, info2); } else { /* * The exitcode and collateral have been populated. * The VM exit will be processed further in userland. */ } } return (handled); } static void svm_inj_intinfo(struct svm_softc *svm_sc, int vcpu) { uint64_t intinfo; if (!vm_entry_intinfo(svm_sc->vm, vcpu, &intinfo)) return; KASSERT(VMCB_EXITINTINFO_VALID(intinfo), ("%s: entry intinfo is not " "valid: %#lx", __func__, intinfo)); svm_eventinject(svm_sc, vcpu, VMCB_EXITINTINFO_TYPE(intinfo), VMCB_EXITINTINFO_VECTOR(intinfo), VMCB_EXITINTINFO_EC(intinfo), VMCB_EXITINTINFO_EC_VALID(intinfo)); vmm_stat_incr(svm_sc->vm, vcpu, VCPU_INTINFO_INJECTED, 1); VCPU_CTR1(svm_sc->vm, vcpu, "Injected entry intinfo: %#lx", intinfo); } /* * Inject event to virtual cpu. */ static void svm_inj_interrupts(struct svm_softc *sc, int vcpu, struct vlapic *vlapic) { struct vmcb_ctrl *ctrl; struct vmcb_state *state; struct svm_vcpu *vcpustate; uint8_t v_tpr; int vector, need_intr_window; int extint_pending; state = svm_get_vmcb_state(sc, vcpu); ctrl = svm_get_vmcb_ctrl(sc, vcpu); vcpustate = svm_get_vcpu(sc, vcpu); need_intr_window = 0; if (vcpustate->nextrip != state->rip) { ctrl->intr_shadow = 0; VCPU_CTR2(sc->vm, vcpu, "Guest interrupt blocking " "cleared due to rip change: %#lx/%#lx", vcpustate->nextrip, state->rip); } /* * Inject pending events or exceptions for this vcpu. * * An event might be pending because the previous #VMEXIT happened * during event delivery (i.e. ctrl->exitintinfo). * * An event might also be pending because an exception was injected * by the hypervisor (e.g. #PF during instruction emulation). */ svm_inj_intinfo(sc, vcpu); /* NMI event has priority over interrupts. */ if (vm_nmi_pending(sc->vm, vcpu)) { if (nmi_blocked(sc, vcpu)) { /* * Can't inject another NMI if the guest has not * yet executed an "iret" after the last NMI. */ VCPU_CTR0(sc->vm, vcpu, "Cannot inject NMI due " "to NMI-blocking"); } else if (ctrl->intr_shadow) { /* * Can't inject an NMI if the vcpu is in an intr_shadow. */ VCPU_CTR0(sc->vm, vcpu, "Cannot inject NMI due to " "interrupt shadow"); need_intr_window = 1; goto done; } else if (ctrl->eventinj & VMCB_EVENTINJ_VALID) { /* * If there is already an exception/interrupt pending * then defer the NMI until after that. */ VCPU_CTR1(sc->vm, vcpu, "Cannot inject NMI due to " "eventinj %#lx", ctrl->eventinj); /* * Use self-IPI to trigger a VM-exit as soon as * possible after the event injection is completed. * * This works only if the external interrupt exiting * is at a lower priority than the event injection. * * Although not explicitly specified in APMv2 the * relative priorities were verified empirically. */ ipi_cpu(curcpu, IPI_AST); /* XXX vmm_ipinum? */ } else { vm_nmi_clear(sc->vm, vcpu); /* Inject NMI, vector number is not used */ svm_eventinject(sc, vcpu, VMCB_EVENTINJ_TYPE_NMI, IDT_NMI, 0, false); /* virtual NMI blocking is now in effect */ enable_nmi_blocking(sc, vcpu); VCPU_CTR0(sc->vm, vcpu, "Injecting vNMI"); } } extint_pending = vm_extint_pending(sc->vm, vcpu); if (!extint_pending) { if (!vlapic_pending_intr(vlapic, &vector)) goto done; KASSERT(vector >= 16 && vector <= 255, ("invalid vector %d from local APIC", vector)); } else { /* Ask the legacy pic for a vector to inject */ vatpic_pending_intr(sc->vm, &vector); KASSERT(vector >= 0 && vector <= 255, ("invalid vector %d from INTR", vector)); } /* * If the guest has disabled interrupts or is in an interrupt shadow * then we cannot inject the pending interrupt. */ if ((state->rflags & PSL_I) == 0) { VCPU_CTR2(sc->vm, vcpu, "Cannot inject vector %d due to " "rflags %#lx", vector, state->rflags); need_intr_window = 1; goto done; } if (ctrl->intr_shadow) { VCPU_CTR1(sc->vm, vcpu, "Cannot inject vector %d due to " "interrupt shadow", vector); need_intr_window = 1; goto done; } if (ctrl->eventinj & VMCB_EVENTINJ_VALID) { VCPU_CTR2(sc->vm, vcpu, "Cannot inject vector %d due to " "eventinj %#lx", vector, ctrl->eventinj); need_intr_window = 1; goto done; } svm_eventinject(sc, vcpu, VMCB_EVENTINJ_TYPE_INTR, vector, 0, false); if (!extint_pending) { vlapic_intr_accepted(vlapic, vector); } else { vm_extint_clear(sc->vm, vcpu); vatpic_intr_accepted(sc->vm, vector); } /* * Force a VM-exit as soon as the vcpu is ready to accept another * interrupt. This is done because the PIC might have another vector * that it wants to inject. Also, if the APIC has a pending interrupt * that was preempted by the ExtInt then it allows us to inject the * APIC vector as soon as possible. */ need_intr_window = 1; done: /* * The guest can modify the TPR by writing to %CR8. In guest mode * the processor reflects this write to V_TPR without hypervisor * intervention. * * The guest can also modify the TPR by writing to it via the memory * mapped APIC page. In this case, the write will be emulated by the * hypervisor. For this reason V_TPR must be updated before every * VMRUN. */ v_tpr = vlapic_get_cr8(vlapic); KASSERT(v_tpr <= 15, ("invalid v_tpr %#x", v_tpr)); if (ctrl->v_tpr != v_tpr) { VCPU_CTR2(sc->vm, vcpu, "VMCB V_TPR changed from %#x to %#x", ctrl->v_tpr, v_tpr); ctrl->v_tpr = v_tpr; svm_set_dirty(sc, vcpu, VMCB_CACHE_TPR); } if (need_intr_window) { /* * We use V_IRQ in conjunction with the VINTR intercept to * trap into the hypervisor as soon as a virtual interrupt * can be delivered. * * Since injected events are not subject to intercept checks * we need to ensure that the V_IRQ is not actually going to * be delivered on VM entry. The KASSERT below enforces this. */ KASSERT((ctrl->eventinj & VMCB_EVENTINJ_VALID) != 0 || (state->rflags & PSL_I) == 0 || ctrl->intr_shadow, ("Bogus intr_window_exiting: eventinj (%#lx), " "intr_shadow (%u), rflags (%#lx)", ctrl->eventinj, ctrl->intr_shadow, state->rflags)); enable_intr_window_exiting(sc, vcpu); } else { disable_intr_window_exiting(sc, vcpu); } } static __inline void restore_host_tss(void) { struct system_segment_descriptor *tss_sd; /* * The TSS descriptor was in use prior to launching the guest so it * has been marked busy. * * 'ltr' requires the descriptor to be marked available so change the * type to "64-bit available TSS". */ tss_sd = PCPU_GET(tss); tss_sd->sd_type = SDT_SYSTSS; ltr(GSEL(GPROC0_SEL, SEL_KPL)); } static void check_asid(struct svm_softc *sc, int vcpuid, pmap_t pmap, u_int thiscpu) { struct svm_vcpu *vcpustate; struct vmcb_ctrl *ctrl; long eptgen; bool alloc_asid; KASSERT(CPU_ISSET(thiscpu, &pmap->pm_active), ("%s: nested pmap not " "active on cpu %u", __func__, thiscpu)); vcpustate = svm_get_vcpu(sc, vcpuid); ctrl = svm_get_vmcb_ctrl(sc, vcpuid); /* * The TLB entries associated with the vcpu's ASID are not valid * if either of the following conditions is true: * * 1. The vcpu's ASID generation is different than the host cpu's * ASID generation. This happens when the vcpu migrates to a new * host cpu. It can also happen when the number of vcpus executing * on a host cpu is greater than the number of ASIDs available. * * 2. The pmap generation number is different than the value cached in * the 'vcpustate'. This happens when the host invalidates pages * belonging to the guest. * * asidgen eptgen Action * mismatch mismatch * 0 0 (a) * 0 1 (b1) or (b2) * 1 0 (c) * 1 1 (d) * * (a) There is no mismatch in eptgen or ASID generation and therefore * no further action is needed. * * (b1) If the cpu supports FlushByAsid then the vcpu's ASID is * retained and the TLB entries associated with this ASID * are flushed by VMRUN. * * (b2) If the cpu does not support FlushByAsid then a new ASID is * allocated. * * (c) A new ASID is allocated. * * (d) A new ASID is allocated. */ alloc_asid = false; eptgen = pmap->pm_eptgen; ctrl->tlb_ctrl = VMCB_TLB_FLUSH_NOTHING; if (vcpustate->asid.gen != asid[thiscpu].gen) { alloc_asid = true; /* (c) and (d) */ } else if (vcpustate->eptgen != eptgen) { if (flush_by_asid()) ctrl->tlb_ctrl = VMCB_TLB_FLUSH_GUEST; /* (b1) */ else alloc_asid = true; /* (b2) */ } else { /* * This is the common case (a). */ KASSERT(!alloc_asid, ("ASID allocation not necessary")); KASSERT(ctrl->tlb_ctrl == VMCB_TLB_FLUSH_NOTHING, ("Invalid VMCB tlb_ctrl: %#x", ctrl->tlb_ctrl)); } if (alloc_asid) { if (++asid[thiscpu].num >= nasid) { asid[thiscpu].num = 1; if (++asid[thiscpu].gen == 0) asid[thiscpu].gen = 1; /* * If this cpu does not support "flush-by-asid" * then flush the entire TLB on a generation * bump. Subsequent ASID allocation in this * generation can be done without a TLB flush. */ if (!flush_by_asid()) ctrl->tlb_ctrl = VMCB_TLB_FLUSH_ALL; } vcpustate->asid.gen = asid[thiscpu].gen; vcpustate->asid.num = asid[thiscpu].num; ctrl->asid = vcpustate->asid.num; svm_set_dirty(sc, vcpuid, VMCB_CACHE_ASID); /* * If this cpu supports "flush-by-asid" then the TLB * was not flushed after the generation bump. The TLB * is flushed selectively after every new ASID allocation. */ if (flush_by_asid()) ctrl->tlb_ctrl = VMCB_TLB_FLUSH_GUEST; } vcpustate->eptgen = eptgen; KASSERT(ctrl->asid != 0, ("Guest ASID must be non-zero")); KASSERT(ctrl->asid == vcpustate->asid.num, ("ASID mismatch: %u/%u", ctrl->asid, vcpustate->asid.num)); } static __inline void disable_gintr(void) { __asm __volatile("clgi"); } static __inline void enable_gintr(void) { __asm __volatile("stgi"); } static __inline void svm_dr_enter_guest(struct svm_regctx *gctx) { /* Save host control debug registers. */ gctx->host_dr7 = rdr7(); gctx->host_debugctl = rdmsr(MSR_DEBUGCTLMSR); /* * Disable debugging in DR7 and DEBUGCTL to avoid triggering * exceptions in the host based on the guest DRx values. The * guest DR6, DR7, and DEBUGCTL are saved/restored in the * VMCB. */ load_dr7(0); wrmsr(MSR_DEBUGCTLMSR, 0); /* Save host debug registers. */ gctx->host_dr0 = rdr0(); gctx->host_dr1 = rdr1(); gctx->host_dr2 = rdr2(); gctx->host_dr3 = rdr3(); gctx->host_dr6 = rdr6(); /* Restore guest debug registers. */ load_dr0(gctx->sctx_dr0); load_dr1(gctx->sctx_dr1); load_dr2(gctx->sctx_dr2); load_dr3(gctx->sctx_dr3); } static __inline void svm_dr_leave_guest(struct svm_regctx *gctx) { /* Save guest debug registers. */ gctx->sctx_dr0 = rdr0(); gctx->sctx_dr1 = rdr1(); gctx->sctx_dr2 = rdr2(); gctx->sctx_dr3 = rdr3(); /* * Restore host debug registers. Restore DR7 and DEBUGCTL * last. */ load_dr0(gctx->host_dr0); load_dr1(gctx->host_dr1); load_dr2(gctx->host_dr2); load_dr3(gctx->host_dr3); load_dr6(gctx->host_dr6); wrmsr(MSR_DEBUGCTLMSR, gctx->host_debugctl); load_dr7(gctx->host_dr7); } /* * Start vcpu with specified RIP. */ static int svm_vmrun(void *arg, int vcpu, register_t rip, pmap_t pmap, struct vm_eventinfo *evinfo) { struct svm_regctx *gctx; struct svm_softc *svm_sc; struct svm_vcpu *vcpustate; struct vmcb_state *state; struct vmcb_ctrl *ctrl; struct vm_exit *vmexit; struct vlapic *vlapic; struct vm *vm; uint64_t vmcb_pa; int handled; uint16_t ldt_sel; svm_sc = arg; vm = svm_sc->vm; vcpustate = svm_get_vcpu(svm_sc, vcpu); state = svm_get_vmcb_state(svm_sc, vcpu); ctrl = svm_get_vmcb_ctrl(svm_sc, vcpu); vmexit = vm_exitinfo(vm, vcpu); vlapic = vm_lapic(vm, vcpu); gctx = svm_get_guest_regctx(svm_sc, vcpu); vmcb_pa = svm_sc->vcpu[vcpu].vmcb_pa; if (vcpustate->lastcpu != curcpu) { /* * Force new ASID allocation by invalidating the generation. */ vcpustate->asid.gen = 0; /* * Invalidate the VMCB state cache by marking all fields dirty. */ svm_set_dirty(svm_sc, vcpu, 0xffffffff); /* * XXX * Setting 'vcpustate->lastcpu' here is bit premature because * we may return from this function without actually executing * the VMRUN instruction. This could happen if a rendezvous * or an AST is pending on the first time through the loop. * * This works for now but any new side-effects of vcpu * migration should take this case into account. */ vcpustate->lastcpu = curcpu; vmm_stat_incr(vm, vcpu, VCPU_MIGRATIONS, 1); } svm_msr_guest_enter(svm_sc, vcpu); /* Update Guest RIP */ state->rip = rip; do { /* * Disable global interrupts to guarantee atomicity during * loading of guest state. This includes not only the state * loaded by the "vmrun" instruction but also software state * maintained by the hypervisor: suspended and rendezvous * state, NPT generation number, vlapic interrupts etc. */ disable_gintr(); if (vcpu_suspended(evinfo)) { enable_gintr(); vm_exit_suspended(vm, vcpu, state->rip); break; } if (vcpu_rendezvous_pending(evinfo)) { enable_gintr(); vm_exit_rendezvous(vm, vcpu, state->rip); break; } if (vcpu_reqidle(evinfo)) { enable_gintr(); vm_exit_reqidle(vm, vcpu, state->rip); break; } /* We are asked to give the cpu by scheduler. */ if (vcpu_should_yield(vm, vcpu)) { enable_gintr(); vm_exit_astpending(vm, vcpu, state->rip); break; } if (vcpu_debugged(vm, vcpu)) { enable_gintr(); vm_exit_debug(vm, vcpu, state->rip); break; } /* * #VMEXIT resumes the host with the guest LDTR, so * save the current LDT selector so it can be restored * after an exit. The userspace hypervisor probably * doesn't use a LDT, but save and restore it to be * safe. */ ldt_sel = sldt(); svm_inj_interrupts(svm_sc, vcpu, vlapic); /* Activate the nested pmap on 'curcpu' */ CPU_SET_ATOMIC_ACQ(curcpu, &pmap->pm_active); /* * Check the pmap generation and the ASID generation to * ensure that the vcpu does not use stale TLB mappings. */ check_asid(svm_sc, vcpu, pmap, curcpu); ctrl->vmcb_clean = vmcb_clean & ~vcpustate->dirty; vcpustate->dirty = 0; VCPU_CTR1(vm, vcpu, "vmcb clean %#x", ctrl->vmcb_clean); /* Launch Virtual Machine. */ VCPU_CTR1(vm, vcpu, "Resume execution at %#lx", state->rip); svm_dr_enter_guest(gctx); svm_launch(vmcb_pa, gctx, &__pcpu[curcpu]); svm_dr_leave_guest(gctx); CPU_CLR_ATOMIC(curcpu, &pmap->pm_active); /* * The host GDTR and IDTR is saved by VMRUN and restored * automatically on #VMEXIT. However, the host TSS needs * to be restored explicitly. */ restore_host_tss(); /* Restore host LDTR. */ lldt(ldt_sel); /* #VMEXIT disables interrupts so re-enable them here. */ enable_gintr(); /* Update 'nextrip' */ vcpustate->nextrip = state->rip; /* Handle #VMEXIT and if required return to user space. */ handled = svm_vmexit(svm_sc, vcpu, vmexit); } while (handled); svm_msr_guest_exit(svm_sc, vcpu); return (0); } static void svm_vmcleanup(void *arg) { struct svm_softc *sc = arg; contigfree(sc->iopm_bitmap, SVM_IO_BITMAP_SIZE, M_SVM); contigfree(sc->msr_bitmap, SVM_MSR_BITMAP_SIZE, M_SVM); free(sc, M_SVM); } static register_t * swctx_regptr(struct svm_regctx *regctx, int reg) { switch (reg) { case VM_REG_GUEST_RBX: return (®ctx->sctx_rbx); case VM_REG_GUEST_RCX: return (®ctx->sctx_rcx); case VM_REG_GUEST_RDX: return (®ctx->sctx_rdx); case VM_REG_GUEST_RDI: return (®ctx->sctx_rdi); case VM_REG_GUEST_RSI: return (®ctx->sctx_rsi); case VM_REG_GUEST_RBP: return (®ctx->sctx_rbp); case VM_REG_GUEST_R8: return (®ctx->sctx_r8); case VM_REG_GUEST_R9: return (®ctx->sctx_r9); case VM_REG_GUEST_R10: return (®ctx->sctx_r10); case VM_REG_GUEST_R11: return (®ctx->sctx_r11); case VM_REG_GUEST_R12: return (®ctx->sctx_r12); case VM_REG_GUEST_R13: return (®ctx->sctx_r13); case VM_REG_GUEST_R14: return (®ctx->sctx_r14); case VM_REG_GUEST_R15: return (®ctx->sctx_r15); case VM_REG_GUEST_DR0: return (®ctx->sctx_dr0); case VM_REG_GUEST_DR1: return (®ctx->sctx_dr1); case VM_REG_GUEST_DR2: return (®ctx->sctx_dr2); case VM_REG_GUEST_DR3: return (®ctx->sctx_dr3); default: return (NULL); } } static int svm_getreg(void *arg, int vcpu, int ident, uint64_t *val) { struct svm_softc *svm_sc; register_t *reg; svm_sc = arg; if (ident == VM_REG_GUEST_INTR_SHADOW) { return (svm_get_intr_shadow(svm_sc, vcpu, val)); } if (vmcb_read(svm_sc, vcpu, ident, val) == 0) { return (0); } reg = swctx_regptr(svm_get_guest_regctx(svm_sc, vcpu), ident); if (reg != NULL) { *val = *reg; return (0); } VCPU_CTR1(svm_sc->vm, vcpu, "svm_getreg: unknown register %#x", ident); return (EINVAL); } static int svm_setreg(void *arg, int vcpu, int ident, uint64_t val) { struct svm_softc *svm_sc; register_t *reg; svm_sc = arg; if (ident == VM_REG_GUEST_INTR_SHADOW) { return (svm_modify_intr_shadow(svm_sc, vcpu, val)); } if (vmcb_write(svm_sc, vcpu, ident, val) == 0) { return (0); } reg = swctx_regptr(svm_get_guest_regctx(svm_sc, vcpu), ident); if (reg != NULL) { *reg = val; return (0); } /* * XXX deal with CR3 and invalidate TLB entries tagged with the * vcpu's ASID. This needs to be treated differently depending on * whether 'running' is true/false. */ VCPU_CTR1(svm_sc->vm, vcpu, "svm_setreg: unknown register %#x", ident); return (EINVAL); } static int svm_setcap(void *arg, int vcpu, int type, int val) { struct svm_softc *sc; int error; sc = arg; error = 0; switch (type) { case VM_CAP_HALT_EXIT: svm_set_intercept(sc, vcpu, VMCB_CTRL1_INTCPT, VMCB_INTCPT_HLT, val); break; case VM_CAP_PAUSE_EXIT: svm_set_intercept(sc, vcpu, VMCB_CTRL1_INTCPT, VMCB_INTCPT_PAUSE, val); break; case VM_CAP_UNRESTRICTED_GUEST: /* Unrestricted guest execution cannot be disabled in SVM */ if (val == 0) error = EINVAL; break; default: error = ENOENT; break; } return (error); } static int svm_getcap(void *arg, int vcpu, int type, int *retval) { struct svm_softc *sc; int error; sc = arg; error = 0; switch (type) { case VM_CAP_HALT_EXIT: *retval = svm_get_intercept(sc, vcpu, VMCB_CTRL1_INTCPT, VMCB_INTCPT_HLT); break; case VM_CAP_PAUSE_EXIT: *retval = svm_get_intercept(sc, vcpu, VMCB_CTRL1_INTCPT, VMCB_INTCPT_PAUSE); break; case VM_CAP_UNRESTRICTED_GUEST: *retval = 1; /* unrestricted guest is always enabled */ break; default: error = ENOENT; break; } return (error); } static struct vlapic * svm_vlapic_init(void *arg, int vcpuid) { struct svm_softc *svm_sc; struct vlapic *vlapic; svm_sc = arg; vlapic = malloc(sizeof(struct vlapic), M_SVM_VLAPIC, M_WAITOK | M_ZERO); vlapic->vm = svm_sc->vm; vlapic->vcpuid = vcpuid; vlapic->apic_page = (struct LAPIC *)&svm_sc->apic_page[vcpuid]; vlapic_init(vlapic); return (vlapic); } static void svm_vlapic_cleanup(void *arg, struct vlapic *vlapic) { vlapic_cleanup(vlapic); free(vlapic, M_SVM_VLAPIC); } struct vmm_ops vmm_ops_amd = { svm_init, svm_cleanup, svm_restore, svm_vminit, svm_vmrun, svm_vmcleanup, svm_getreg, svm_setreg, vmcb_getdesc, vmcb_setdesc, svm_getcap, svm_setcap, svm_npt_alloc, svm_npt_free, svm_vlapic_init, svm_vlapic_cleanup }; Index: user/ngie/bug-237403/sys/amd64/vmm/intel/vmx.c =================================================================== --- user/ngie/bug-237403/sys/amd64/vmm/intel/vmx.c (revision 346925) +++ user/ngie/bug-237403/sys/amd64/vmm/intel/vmx.c (revision 346926) @@ -1,3805 +1,3809 @@ /*- * SPDX-License-Identifier: BSD-2-Clause-FreeBSD * * Copyright (c) 2011 NetApp, Inc. * All rights reserved. * Copyright (c) 2018 Joyent, Inc. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY NETAPP, INC ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL NETAPP, INC OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * $FreeBSD$ */ #include __FBSDID("$FreeBSD$"); #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include "vmm_lapic.h" #include "vmm_host.h" #include "vmm_ioport.h" #include "vmm_ktr.h" #include "vmm_stat.h" #include "vatpic.h" #include "vlapic.h" #include "vlapic_priv.h" #include "ept.h" #include "vmx_cpufunc.h" #include "vmx.h" #include "vmx_msr.h" #include "x86.h" #include "vmx_controls.h" #define PINBASED_CTLS_ONE_SETTING \ (PINBASED_EXTINT_EXITING | \ PINBASED_NMI_EXITING | \ PINBASED_VIRTUAL_NMI) #define PINBASED_CTLS_ZERO_SETTING 0 #define PROCBASED_CTLS_WINDOW_SETTING \ (PROCBASED_INT_WINDOW_EXITING | \ PROCBASED_NMI_WINDOW_EXITING) #define PROCBASED_CTLS_ONE_SETTING \ (PROCBASED_SECONDARY_CONTROLS | \ PROCBASED_MWAIT_EXITING | \ PROCBASED_MONITOR_EXITING | \ PROCBASED_IO_EXITING | \ PROCBASED_MSR_BITMAPS | \ PROCBASED_CTLS_WINDOW_SETTING | \ PROCBASED_CR8_LOAD_EXITING | \ PROCBASED_CR8_STORE_EXITING) #define PROCBASED_CTLS_ZERO_SETTING \ (PROCBASED_CR3_LOAD_EXITING | \ PROCBASED_CR3_STORE_EXITING | \ PROCBASED_IO_BITMAPS) #define PROCBASED_CTLS2_ONE_SETTING PROCBASED2_ENABLE_EPT #define PROCBASED_CTLS2_ZERO_SETTING 0 #define VM_EXIT_CTLS_ONE_SETTING \ (VM_EXIT_SAVE_DEBUG_CONTROLS | \ VM_EXIT_HOST_LMA | \ VM_EXIT_SAVE_EFER | \ VM_EXIT_LOAD_EFER | \ VM_EXIT_ACKNOWLEDGE_INTERRUPT) #define VM_EXIT_CTLS_ZERO_SETTING 0 #define VM_ENTRY_CTLS_ONE_SETTING \ (VM_ENTRY_LOAD_DEBUG_CONTROLS | \ VM_ENTRY_LOAD_EFER) #define VM_ENTRY_CTLS_ZERO_SETTING \ (VM_ENTRY_INTO_SMM | \ VM_ENTRY_DEACTIVATE_DUAL_MONITOR) #define HANDLED 1 #define UNHANDLED 0 static MALLOC_DEFINE(M_VMX, "vmx", "vmx"); static MALLOC_DEFINE(M_VLAPIC, "vlapic", "vlapic"); SYSCTL_DECL(_hw_vmm); SYSCTL_NODE(_hw_vmm, OID_AUTO, vmx, CTLFLAG_RW, NULL, NULL); int vmxon_enabled[MAXCPU]; static char vmxon_region[MAXCPU][PAGE_SIZE] __aligned(PAGE_SIZE); static uint32_t pinbased_ctls, procbased_ctls, procbased_ctls2; static uint32_t exit_ctls, entry_ctls; static uint64_t cr0_ones_mask, cr0_zeros_mask; SYSCTL_ULONG(_hw_vmm_vmx, OID_AUTO, cr0_ones_mask, CTLFLAG_RD, &cr0_ones_mask, 0, NULL); SYSCTL_ULONG(_hw_vmm_vmx, OID_AUTO, cr0_zeros_mask, CTLFLAG_RD, &cr0_zeros_mask, 0, NULL); static uint64_t cr4_ones_mask, cr4_zeros_mask; SYSCTL_ULONG(_hw_vmm_vmx, OID_AUTO, cr4_ones_mask, CTLFLAG_RD, &cr4_ones_mask, 0, NULL); SYSCTL_ULONG(_hw_vmm_vmx, OID_AUTO, cr4_zeros_mask, CTLFLAG_RD, &cr4_zeros_mask, 0, NULL); static int vmx_initialized; SYSCTL_INT(_hw_vmm_vmx, OID_AUTO, initialized, CTLFLAG_RD, &vmx_initialized, 0, "Intel VMX initialized"); /* * Optional capabilities */ static SYSCTL_NODE(_hw_vmm_vmx, OID_AUTO, cap, CTLFLAG_RW, NULL, NULL); static int cap_halt_exit; SYSCTL_INT(_hw_vmm_vmx_cap, OID_AUTO, halt_exit, CTLFLAG_RD, &cap_halt_exit, 0, "HLT triggers a VM-exit"); static int cap_pause_exit; SYSCTL_INT(_hw_vmm_vmx_cap, OID_AUTO, pause_exit, CTLFLAG_RD, &cap_pause_exit, 0, "PAUSE triggers a VM-exit"); static int cap_unrestricted_guest; SYSCTL_INT(_hw_vmm_vmx_cap, OID_AUTO, unrestricted_guest, CTLFLAG_RD, &cap_unrestricted_guest, 0, "Unrestricted guests"); static int cap_monitor_trap; SYSCTL_INT(_hw_vmm_vmx_cap, OID_AUTO, monitor_trap, CTLFLAG_RD, &cap_monitor_trap, 0, "Monitor trap flag"); static int cap_invpcid; SYSCTL_INT(_hw_vmm_vmx_cap, OID_AUTO, invpcid, CTLFLAG_RD, &cap_invpcid, 0, "Guests are allowed to use INVPCID"); static int virtual_interrupt_delivery; SYSCTL_INT(_hw_vmm_vmx_cap, OID_AUTO, virtual_interrupt_delivery, CTLFLAG_RD, &virtual_interrupt_delivery, 0, "APICv virtual interrupt delivery support"); static int posted_interrupts; SYSCTL_INT(_hw_vmm_vmx_cap, OID_AUTO, posted_interrupts, CTLFLAG_RD, &posted_interrupts, 0, "APICv posted interrupt support"); static int pirvec = -1; SYSCTL_INT(_hw_vmm_vmx, OID_AUTO, posted_interrupt_vector, CTLFLAG_RD, &pirvec, 0, "APICv posted interrupt vector"); static struct unrhdr *vpid_unr; static u_int vpid_alloc_failed; SYSCTL_UINT(_hw_vmm_vmx, OID_AUTO, vpid_alloc_failed, CTLFLAG_RD, &vpid_alloc_failed, 0, NULL); static int guest_l1d_flush; SYSCTL_INT(_hw_vmm_vmx, OID_AUTO, l1d_flush, CTLFLAG_RD, &guest_l1d_flush, 0, NULL); static int guest_l1d_flush_sw; SYSCTL_INT(_hw_vmm_vmx, OID_AUTO, l1d_flush_sw, CTLFLAG_RD, &guest_l1d_flush_sw, 0, NULL); static struct msr_entry msr_load_list[1] __aligned(16); /* * The definitions of SDT probes for VMX. */ SDT_PROBE_DEFINE3(vmm, vmx, exit, entry, "struct vmx *", "int", "struct vm_exit *"); SDT_PROBE_DEFINE4(vmm, vmx, exit, taskswitch, "struct vmx *", "int", "struct vm_exit *", "struct vm_task_switch *"); SDT_PROBE_DEFINE4(vmm, vmx, exit, craccess, "struct vmx *", "int", "struct vm_exit *", "uint64_t"); SDT_PROBE_DEFINE4(vmm, vmx, exit, rdmsr, "struct vmx *", "int", "struct vm_exit *", "uint32_t"); SDT_PROBE_DEFINE5(vmm, vmx, exit, wrmsr, "struct vmx *", "int", "struct vm_exit *", "uint32_t", "uint64_t"); SDT_PROBE_DEFINE3(vmm, vmx, exit, halt, "struct vmx *", "int", "struct vm_exit *"); SDT_PROBE_DEFINE3(vmm, vmx, exit, mtrap, "struct vmx *", "int", "struct vm_exit *"); SDT_PROBE_DEFINE3(vmm, vmx, exit, pause, "struct vmx *", "int", "struct vm_exit *"); SDT_PROBE_DEFINE3(vmm, vmx, exit, intrwindow, "struct vmx *", "int", "struct vm_exit *"); SDT_PROBE_DEFINE4(vmm, vmx, exit, interrupt, "struct vmx *", "int", "struct vm_exit *", "uint32_t"); SDT_PROBE_DEFINE3(vmm, vmx, exit, nmiwindow, "struct vmx *", "int", "struct vm_exit *"); SDT_PROBE_DEFINE3(vmm, vmx, exit, inout, "struct vmx *", "int", "struct vm_exit *"); SDT_PROBE_DEFINE3(vmm, vmx, exit, cpuid, "struct vmx *", "int", "struct vm_exit *"); SDT_PROBE_DEFINE5(vmm, vmx, exit, exception, "struct vmx *", "int", "struct vm_exit *", "uint32_t", "int"); SDT_PROBE_DEFINE5(vmm, vmx, exit, nestedfault, "struct vmx *", "int", "struct vm_exit *", "uint64_t", "uint64_t"); SDT_PROBE_DEFINE4(vmm, vmx, exit, mmiofault, "struct vmx *", "int", "struct vm_exit *", "uint64_t"); SDT_PROBE_DEFINE3(vmm, vmx, exit, eoi, "struct vmx *", "int", "struct vm_exit *"); SDT_PROBE_DEFINE3(vmm, vmx, exit, apicaccess, "struct vmx *", "int", "struct vm_exit *"); SDT_PROBE_DEFINE4(vmm, vmx, exit, apicwrite, "struct vmx *", "int", "struct vm_exit *", "struct vlapic *"); SDT_PROBE_DEFINE3(vmm, vmx, exit, xsetbv, "struct vmx *", "int", "struct vm_exit *"); SDT_PROBE_DEFINE3(vmm, vmx, exit, monitor, "struct vmx *", "int", "struct vm_exit *"); SDT_PROBE_DEFINE3(vmm, vmx, exit, mwait, "struct vmx *", "int", "struct vm_exit *"); SDT_PROBE_DEFINE3(vmm, vmx, exit, vminsn, "struct vmx *", "int", "struct vm_exit *"); SDT_PROBE_DEFINE4(vmm, vmx, exit, unknown, "struct vmx *", "int", "struct vm_exit *", "uint32_t"); SDT_PROBE_DEFINE4(vmm, vmx, exit, return, "struct vmx *", "int", "struct vm_exit *", "int"); /* * Use the last page below 4GB as the APIC access address. This address is * occupied by the boot firmware so it is guaranteed that it will not conflict * with a page in system memory. */ #define APIC_ACCESS_ADDRESS 0xFFFFF000 static int vmx_getdesc(void *arg, int vcpu, int reg, struct seg_desc *desc); static int vmx_getreg(void *arg, int vcpu, int reg, uint64_t *retval); static int vmxctx_setreg(struct vmxctx *vmxctx, int reg, uint64_t val); static void vmx_inject_pir(struct vlapic *vlapic); #ifdef KTR static const char * exit_reason_to_str(int reason) { static char reasonbuf[32]; switch (reason) { case EXIT_REASON_EXCEPTION: return "exception"; case EXIT_REASON_EXT_INTR: return "extint"; case EXIT_REASON_TRIPLE_FAULT: return "triplefault"; case EXIT_REASON_INIT: return "init"; case EXIT_REASON_SIPI: return "sipi"; case EXIT_REASON_IO_SMI: return "iosmi"; case EXIT_REASON_SMI: return "smi"; case EXIT_REASON_INTR_WINDOW: return "intrwindow"; case EXIT_REASON_NMI_WINDOW: return "nmiwindow"; case EXIT_REASON_TASK_SWITCH: return "taskswitch"; case EXIT_REASON_CPUID: return "cpuid"; case EXIT_REASON_GETSEC: return "getsec"; case EXIT_REASON_HLT: return "hlt"; case EXIT_REASON_INVD: return "invd"; case EXIT_REASON_INVLPG: return "invlpg"; case EXIT_REASON_RDPMC: return "rdpmc"; case EXIT_REASON_RDTSC: return "rdtsc"; case EXIT_REASON_RSM: return "rsm"; case EXIT_REASON_VMCALL: return "vmcall"; case EXIT_REASON_VMCLEAR: return "vmclear"; case EXIT_REASON_VMLAUNCH: return "vmlaunch"; case EXIT_REASON_VMPTRLD: return "vmptrld"; case EXIT_REASON_VMPTRST: return "vmptrst"; case EXIT_REASON_VMREAD: return "vmread"; case EXIT_REASON_VMRESUME: return "vmresume"; case EXIT_REASON_VMWRITE: return "vmwrite"; case EXIT_REASON_VMXOFF: return "vmxoff"; case EXIT_REASON_VMXON: return "vmxon"; case EXIT_REASON_CR_ACCESS: return "craccess"; case EXIT_REASON_DR_ACCESS: return "draccess"; case EXIT_REASON_INOUT: return "inout"; case EXIT_REASON_RDMSR: return "rdmsr"; case EXIT_REASON_WRMSR: return "wrmsr"; case EXIT_REASON_INVAL_VMCS: return "invalvmcs"; case EXIT_REASON_INVAL_MSR: return "invalmsr"; case EXIT_REASON_MWAIT: return "mwait"; case EXIT_REASON_MTF: return "mtf"; case EXIT_REASON_MONITOR: return "monitor"; case EXIT_REASON_PAUSE: return "pause"; case EXIT_REASON_MCE_DURING_ENTRY: return "mce-during-entry"; case EXIT_REASON_TPR: return "tpr"; case EXIT_REASON_APIC_ACCESS: return "apic-access"; case EXIT_REASON_GDTR_IDTR: return "gdtridtr"; case EXIT_REASON_LDTR_TR: return "ldtrtr"; case EXIT_REASON_EPT_FAULT: return "eptfault"; case EXIT_REASON_EPT_MISCONFIG: return "eptmisconfig"; case EXIT_REASON_INVEPT: return "invept"; case EXIT_REASON_RDTSCP: return "rdtscp"; case EXIT_REASON_VMX_PREEMPT: return "vmxpreempt"; case EXIT_REASON_INVVPID: return "invvpid"; case EXIT_REASON_WBINVD: return "wbinvd"; case EXIT_REASON_XSETBV: return "xsetbv"; case EXIT_REASON_APIC_WRITE: return "apic-write"; default: snprintf(reasonbuf, sizeof(reasonbuf), "%d", reason); return (reasonbuf); } } #endif /* KTR */ static int vmx_allow_x2apic_msrs(struct vmx *vmx) { int i, error; error = 0; /* * Allow readonly access to the following x2APIC MSRs from the guest. */ error += guest_msr_ro(vmx, MSR_APIC_ID); error += guest_msr_ro(vmx, MSR_APIC_VERSION); error += guest_msr_ro(vmx, MSR_APIC_LDR); error += guest_msr_ro(vmx, MSR_APIC_SVR); for (i = 0; i < 8; i++) error += guest_msr_ro(vmx, MSR_APIC_ISR0 + i); for (i = 0; i < 8; i++) error += guest_msr_ro(vmx, MSR_APIC_TMR0 + i); for (i = 0; i < 8; i++) error += guest_msr_ro(vmx, MSR_APIC_IRR0 + i); error += guest_msr_ro(vmx, MSR_APIC_ESR); error += guest_msr_ro(vmx, MSR_APIC_LVT_TIMER); error += guest_msr_ro(vmx, MSR_APIC_LVT_THERMAL); error += guest_msr_ro(vmx, MSR_APIC_LVT_PCINT); error += guest_msr_ro(vmx, MSR_APIC_LVT_LINT0); error += guest_msr_ro(vmx, MSR_APIC_LVT_LINT1); error += guest_msr_ro(vmx, MSR_APIC_LVT_ERROR); error += guest_msr_ro(vmx, MSR_APIC_ICR_TIMER); error += guest_msr_ro(vmx, MSR_APIC_DCR_TIMER); error += guest_msr_ro(vmx, MSR_APIC_ICR); /* * Allow TPR, EOI and SELF_IPI MSRs to be read and written by the guest. * * These registers get special treatment described in the section * "Virtualizing MSR-Based APIC Accesses". */ error += guest_msr_rw(vmx, MSR_APIC_TPR); error += guest_msr_rw(vmx, MSR_APIC_EOI); error += guest_msr_rw(vmx, MSR_APIC_SELF_IPI); return (error); } u_long vmx_fix_cr0(u_long cr0) { return ((cr0 | cr0_ones_mask) & ~cr0_zeros_mask); } u_long vmx_fix_cr4(u_long cr4) { return ((cr4 | cr4_ones_mask) & ~cr4_zeros_mask); } static void vpid_free(int vpid) { if (vpid < 0 || vpid > 0xffff) panic("vpid_free: invalid vpid %d", vpid); /* * VPIDs [0,VM_MAXCPU] are special and are not allocated from * the unit number allocator. */ if (vpid > VM_MAXCPU) free_unr(vpid_unr, vpid); } static void vpid_alloc(uint16_t *vpid, int num) { int i, x; if (num <= 0 || num > VM_MAXCPU) panic("invalid number of vpids requested: %d", num); /* * If the "enable vpid" execution control is not enabled then the * VPID is required to be 0 for all vcpus. */ if ((procbased_ctls2 & PROCBASED2_ENABLE_VPID) == 0) { for (i = 0; i < num; i++) vpid[i] = 0; return; } /* * Allocate a unique VPID for each vcpu from the unit number allocator. */ for (i = 0; i < num; i++) { x = alloc_unr(vpid_unr); if (x == -1) break; else vpid[i] = x; } if (i < num) { atomic_add_int(&vpid_alloc_failed, 1); /* * If the unit number allocator does not have enough unique * VPIDs then we need to allocate from the [1,VM_MAXCPU] range. * * These VPIDs are not be unique across VMs but this does not * affect correctness because the combined mappings are also * tagged with the EP4TA which is unique for each VM. * * It is still sub-optimal because the invvpid will invalidate * combined mappings for a particular VPID across all EP4TAs. */ while (i-- > 0) vpid_free(vpid[i]); for (i = 0; i < num; i++) vpid[i] = i + 1; } } static void vpid_init(void) { /* * VPID 0 is required when the "enable VPID" execution control is * disabled. * * VPIDs [1,VM_MAXCPU] are used as the "overflow namespace" when the * unit number allocator does not have sufficient unique VPIDs to * satisfy the allocation. * * The remaining VPIDs are managed by the unit number allocator. */ vpid_unr = new_unrhdr(VM_MAXCPU + 1, 0xffff, NULL); } static void vmx_disable(void *arg __unused) { struct invvpid_desc invvpid_desc = { 0 }; struct invept_desc invept_desc = { 0 }; if (vmxon_enabled[curcpu]) { /* * See sections 25.3.3.3 and 25.3.3.4 in Intel Vol 3b. * * VMXON or VMXOFF are not required to invalidate any TLB * caching structures. This prevents potential retention of * cached information in the TLB between distinct VMX episodes. */ invvpid(INVVPID_TYPE_ALL_CONTEXTS, invvpid_desc); invept(INVEPT_TYPE_ALL_CONTEXTS, invept_desc); vmxoff(); } load_cr4(rcr4() & ~CR4_VMXE); } static int vmx_cleanup(void) { if (pirvec >= 0) lapic_ipi_free(pirvec); if (vpid_unr != NULL) { delete_unrhdr(vpid_unr); vpid_unr = NULL; } if (nmi_flush_l1d_sw == 1) nmi_flush_l1d_sw = 0; smp_rendezvous(NULL, vmx_disable, NULL, NULL); return (0); } static void vmx_enable(void *arg __unused) { int error; uint64_t feature_control; feature_control = rdmsr(MSR_IA32_FEATURE_CONTROL); if ((feature_control & IA32_FEATURE_CONTROL_LOCK) == 0 || (feature_control & IA32_FEATURE_CONTROL_VMX_EN) == 0) { wrmsr(MSR_IA32_FEATURE_CONTROL, feature_control | IA32_FEATURE_CONTROL_VMX_EN | IA32_FEATURE_CONTROL_LOCK); } load_cr4(rcr4() | CR4_VMXE); *(uint32_t *)vmxon_region[curcpu] = vmx_revision(); error = vmxon(vmxon_region[curcpu]); if (error == 0) vmxon_enabled[curcpu] = 1; } static void vmx_restore(void) { if (vmxon_enabled[curcpu]) vmxon(vmxon_region[curcpu]); } static int vmx_init(int ipinum) { int error, use_tpr_shadow; uint64_t basic, fixed0, fixed1, feature_control; uint32_t tmp, procbased2_vid_bits; /* CPUID.1:ECX[bit 5] must be 1 for processor to support VMX */ if (!(cpu_feature2 & CPUID2_VMX)) { printf("vmx_init: processor does not support VMX operation\n"); return (ENXIO); } /* * Verify that MSR_IA32_FEATURE_CONTROL lock and VMXON enable bits * are set (bits 0 and 2 respectively). */ feature_control = rdmsr(MSR_IA32_FEATURE_CONTROL); if ((feature_control & IA32_FEATURE_CONTROL_LOCK) == 1 && (feature_control & IA32_FEATURE_CONTROL_VMX_EN) == 0) { printf("vmx_init: VMX operation disabled by BIOS\n"); return (ENXIO); } /* * Verify capabilities MSR_VMX_BASIC: * - bit 54 indicates support for INS/OUTS decoding */ basic = rdmsr(MSR_VMX_BASIC); if ((basic & (1UL << 54)) == 0) { printf("vmx_init: processor does not support desired basic " "capabilities\n"); return (EINVAL); } /* Check support for primary processor-based VM-execution controls */ error = vmx_set_ctlreg(MSR_VMX_PROCBASED_CTLS, MSR_VMX_TRUE_PROCBASED_CTLS, PROCBASED_CTLS_ONE_SETTING, PROCBASED_CTLS_ZERO_SETTING, &procbased_ctls); if (error) { printf("vmx_init: processor does not support desired primary " "processor-based controls\n"); return (error); } /* Clear the processor-based ctl bits that are set on demand */ procbased_ctls &= ~PROCBASED_CTLS_WINDOW_SETTING; /* Check support for secondary processor-based VM-execution controls */ error = vmx_set_ctlreg(MSR_VMX_PROCBASED_CTLS2, MSR_VMX_PROCBASED_CTLS2, PROCBASED_CTLS2_ONE_SETTING, PROCBASED_CTLS2_ZERO_SETTING, &procbased_ctls2); if (error) { printf("vmx_init: processor does not support desired secondary " "processor-based controls\n"); return (error); } /* Check support for VPID */ error = vmx_set_ctlreg(MSR_VMX_PROCBASED_CTLS2, MSR_VMX_PROCBASED_CTLS2, PROCBASED2_ENABLE_VPID, 0, &tmp); if (error == 0) procbased_ctls2 |= PROCBASED2_ENABLE_VPID; /* Check support for pin-based VM-execution controls */ error = vmx_set_ctlreg(MSR_VMX_PINBASED_CTLS, MSR_VMX_TRUE_PINBASED_CTLS, PINBASED_CTLS_ONE_SETTING, PINBASED_CTLS_ZERO_SETTING, &pinbased_ctls); if (error) { printf("vmx_init: processor does not support desired " "pin-based controls\n"); return (error); } /* Check support for VM-exit controls */ error = vmx_set_ctlreg(MSR_VMX_EXIT_CTLS, MSR_VMX_TRUE_EXIT_CTLS, VM_EXIT_CTLS_ONE_SETTING, VM_EXIT_CTLS_ZERO_SETTING, &exit_ctls); if (error) { printf("vmx_init: processor does not support desired " "exit controls\n"); return (error); } /* Check support for VM-entry controls */ error = vmx_set_ctlreg(MSR_VMX_ENTRY_CTLS, MSR_VMX_TRUE_ENTRY_CTLS, VM_ENTRY_CTLS_ONE_SETTING, VM_ENTRY_CTLS_ZERO_SETTING, &entry_ctls); if (error) { printf("vmx_init: processor does not support desired " "entry controls\n"); return (error); } /* * Check support for optional features by testing them * as individual bits */ cap_halt_exit = (vmx_set_ctlreg(MSR_VMX_PROCBASED_CTLS, MSR_VMX_TRUE_PROCBASED_CTLS, PROCBASED_HLT_EXITING, 0, &tmp) == 0); cap_monitor_trap = (vmx_set_ctlreg(MSR_VMX_PROCBASED_CTLS, MSR_VMX_PROCBASED_CTLS, PROCBASED_MTF, 0, &tmp) == 0); cap_pause_exit = (vmx_set_ctlreg(MSR_VMX_PROCBASED_CTLS, MSR_VMX_TRUE_PROCBASED_CTLS, PROCBASED_PAUSE_EXITING, 0, &tmp) == 0); cap_unrestricted_guest = (vmx_set_ctlreg(MSR_VMX_PROCBASED_CTLS2, MSR_VMX_PROCBASED_CTLS2, PROCBASED2_UNRESTRICTED_GUEST, 0, &tmp) == 0); cap_invpcid = (vmx_set_ctlreg(MSR_VMX_PROCBASED_CTLS2, MSR_VMX_PROCBASED_CTLS2, PROCBASED2_ENABLE_INVPCID, 0, &tmp) == 0); /* * Check support for virtual interrupt delivery. */ procbased2_vid_bits = (PROCBASED2_VIRTUALIZE_APIC_ACCESSES | PROCBASED2_VIRTUALIZE_X2APIC_MODE | PROCBASED2_APIC_REGISTER_VIRTUALIZATION | PROCBASED2_VIRTUAL_INTERRUPT_DELIVERY); use_tpr_shadow = (vmx_set_ctlreg(MSR_VMX_PROCBASED_CTLS, MSR_VMX_TRUE_PROCBASED_CTLS, PROCBASED_USE_TPR_SHADOW, 0, &tmp) == 0); error = vmx_set_ctlreg(MSR_VMX_PROCBASED_CTLS2, MSR_VMX_PROCBASED_CTLS2, procbased2_vid_bits, 0, &tmp); if (error == 0 && use_tpr_shadow) { virtual_interrupt_delivery = 1; TUNABLE_INT_FETCH("hw.vmm.vmx.use_apic_vid", &virtual_interrupt_delivery); } if (virtual_interrupt_delivery) { procbased_ctls |= PROCBASED_USE_TPR_SHADOW; procbased_ctls2 |= procbased2_vid_bits; procbased_ctls2 &= ~PROCBASED2_VIRTUALIZE_X2APIC_MODE; /* * No need to emulate accesses to %CR8 if virtual * interrupt delivery is enabled. */ procbased_ctls &= ~PROCBASED_CR8_LOAD_EXITING; procbased_ctls &= ~PROCBASED_CR8_STORE_EXITING; /* * Check for Posted Interrupts only if Virtual Interrupt * Delivery is enabled. */ error = vmx_set_ctlreg(MSR_VMX_PINBASED_CTLS, MSR_VMX_TRUE_PINBASED_CTLS, PINBASED_POSTED_INTERRUPT, 0, &tmp); if (error == 0) { pirvec = lapic_ipi_alloc(pti ? &IDTVEC(justreturn1_pti) : &IDTVEC(justreturn)); if (pirvec < 0) { if (bootverbose) { printf("vmx_init: unable to allocate " "posted interrupt vector\n"); } } else { posted_interrupts = 1; TUNABLE_INT_FETCH("hw.vmm.vmx.use_apic_pir", &posted_interrupts); } } } if (posted_interrupts) pinbased_ctls |= PINBASED_POSTED_INTERRUPT; /* Initialize EPT */ error = ept_init(ipinum); if (error) { printf("vmx_init: ept initialization failed (%d)\n", error); return (error); } guest_l1d_flush = (cpu_ia32_arch_caps & IA32_ARCH_CAP_SKIP_L1DFL_VMENTRY) == 0; TUNABLE_INT_FETCH("hw.vmm.l1d_flush", &guest_l1d_flush); /* * L1D cache flush is enabled. Use IA32_FLUSH_CMD MSR when * available. Otherwise fall back to the software flush * method which loads enough data from the kernel text to * flush existing L1D content, both on VMX entry and on NMI * return. */ if (guest_l1d_flush) { if ((cpu_stdext_feature3 & CPUID_STDEXT3_L1D_FLUSH) == 0) { guest_l1d_flush_sw = 1; TUNABLE_INT_FETCH("hw.vmm.l1d_flush_sw", &guest_l1d_flush_sw); } if (guest_l1d_flush_sw) { if (nmi_flush_l1d_sw <= 1) nmi_flush_l1d_sw = 1; } else { msr_load_list[0].index = MSR_IA32_FLUSH_CMD; msr_load_list[0].val = IA32_FLUSH_CMD_L1D; } } /* * Stash the cr0 and cr4 bits that must be fixed to 0 or 1 */ fixed0 = rdmsr(MSR_VMX_CR0_FIXED0); fixed1 = rdmsr(MSR_VMX_CR0_FIXED1); cr0_ones_mask = fixed0 & fixed1; cr0_zeros_mask = ~fixed0 & ~fixed1; /* * CR0_PE and CR0_PG can be set to zero in VMX non-root operation * if unrestricted guest execution is allowed. */ if (cap_unrestricted_guest) cr0_ones_mask &= ~(CR0_PG | CR0_PE); /* * Do not allow the guest to set CR0_NW or CR0_CD. */ cr0_zeros_mask |= (CR0_NW | CR0_CD); fixed0 = rdmsr(MSR_VMX_CR4_FIXED0); fixed1 = rdmsr(MSR_VMX_CR4_FIXED1); cr4_ones_mask = fixed0 & fixed1; cr4_zeros_mask = ~fixed0 & ~fixed1; vpid_init(); vmx_msr_init(); /* enable VMX operation */ smp_rendezvous(NULL, vmx_enable, NULL, NULL); vmx_initialized = 1; return (0); } static void vmx_trigger_hostintr(int vector) { uintptr_t func; struct gate_descriptor *gd; gd = &idt[vector]; KASSERT(vector >= 32 && vector <= 255, ("vmx_trigger_hostintr: " "invalid vector %d", vector)); KASSERT(gd->gd_p == 1, ("gate descriptor for vector %d not present", vector)); KASSERT(gd->gd_type == SDT_SYSIGT, ("gate descriptor for vector %d " "has invalid type %d", vector, gd->gd_type)); KASSERT(gd->gd_dpl == SEL_KPL, ("gate descriptor for vector %d " "has invalid dpl %d", vector, gd->gd_dpl)); KASSERT(gd->gd_selector == GSEL(GCODE_SEL, SEL_KPL), ("gate descriptor " "for vector %d has invalid selector %d", vector, gd->gd_selector)); KASSERT(gd->gd_ist == 0, ("gate descriptor for vector %d has invalid " "IST %d", vector, gd->gd_ist)); func = ((long)gd->gd_hioffset << 16 | gd->gd_looffset); vmx_call_isr(func); } static int vmx_setup_cr_shadow(int which, struct vmcs *vmcs, uint32_t initial) { int error, mask_ident, shadow_ident; uint64_t mask_value; if (which != 0 && which != 4) panic("vmx_setup_cr_shadow: unknown cr%d", which); if (which == 0) { mask_ident = VMCS_CR0_MASK; mask_value = cr0_ones_mask | cr0_zeros_mask; shadow_ident = VMCS_CR0_SHADOW; } else { mask_ident = VMCS_CR4_MASK; mask_value = cr4_ones_mask | cr4_zeros_mask; shadow_ident = VMCS_CR4_SHADOW; } error = vmcs_setreg(vmcs, 0, VMCS_IDENT(mask_ident), mask_value); if (error) return (error); error = vmcs_setreg(vmcs, 0, VMCS_IDENT(shadow_ident), initial); if (error) return (error); return (0); } #define vmx_setup_cr0_shadow(vmcs,init) vmx_setup_cr_shadow(0, (vmcs), (init)) #define vmx_setup_cr4_shadow(vmcs,init) vmx_setup_cr_shadow(4, (vmcs), (init)) static void * vmx_vminit(struct vm *vm, pmap_t pmap) { uint16_t vpid[VM_MAXCPU]; int i, error; struct vmx *vmx; struct vmcs *vmcs; uint32_t exc_bitmap; + uint16_t maxcpus; vmx = malloc(sizeof(struct vmx), M_VMX, M_WAITOK | M_ZERO); if ((uintptr_t)vmx & PAGE_MASK) { panic("malloc of struct vmx not aligned on %d byte boundary", PAGE_SIZE); } vmx->vm = vm; vmx->eptp = eptp(vtophys((vm_offset_t)pmap->pm_pml4)); /* * Clean up EPTP-tagged guest physical and combined mappings * * VMX transitions are not required to invalidate any guest physical * mappings. So, it may be possible for stale guest physical mappings * to be present in the processor TLBs. * * Combined mappings for this EP4TA are also invalidated for all VPIDs. */ ept_invalidate_mappings(vmx->eptp); msr_bitmap_initialize(vmx->msr_bitmap); /* * It is safe to allow direct access to MSR_GSBASE and MSR_FSBASE. * The guest FSBASE and GSBASE are saved and restored during * vm-exit and vm-entry respectively. The host FSBASE and GSBASE are * always restored from the vmcs host state area on vm-exit. * * The SYSENTER_CS/ESP/EIP MSRs are identical to FS/GSBASE in * how they are saved/restored so can be directly accessed by the * guest. * * MSR_EFER is saved and restored in the guest VMCS area on a * VM exit and entry respectively. It is also restored from the * host VMCS area on a VM exit. * * The TSC MSR is exposed read-only. Writes are disallowed as * that will impact the host TSC. If the guest does a write * the "use TSC offsetting" execution control is enabled and the * difference between the host TSC and the guest TSC is written * into the TSC offset in the VMCS. */ if (guest_msr_rw(vmx, MSR_GSBASE) || guest_msr_rw(vmx, MSR_FSBASE) || guest_msr_rw(vmx, MSR_SYSENTER_CS_MSR) || guest_msr_rw(vmx, MSR_SYSENTER_ESP_MSR) || guest_msr_rw(vmx, MSR_SYSENTER_EIP_MSR) || guest_msr_rw(vmx, MSR_EFER) || guest_msr_ro(vmx, MSR_TSC)) panic("vmx_vminit: error setting guest msr access"); vpid_alloc(vpid, VM_MAXCPU); if (virtual_interrupt_delivery) { error = vm_map_mmio(vm, DEFAULT_APIC_BASE, PAGE_SIZE, APIC_ACCESS_ADDRESS); /* XXX this should really return an error to the caller */ KASSERT(error == 0, ("vm_map_mmio(apicbase) error %d", error)); } - for (i = 0; i < VM_MAXCPU; i++) { + maxcpus = vm_get_maxcpus(vm); + for (i = 0; i < maxcpus; i++) { vmcs = &vmx->vmcs[i]; vmcs->identifier = vmx_revision(); error = vmclear(vmcs); if (error != 0) { panic("vmx_vminit: vmclear error %d on vcpu %d\n", error, i); } vmx_msr_guest_init(vmx, i); error = vmcs_init(vmcs); KASSERT(error == 0, ("vmcs_init error %d", error)); VMPTRLD(vmcs); error = 0; error += vmwrite(VMCS_HOST_RSP, (u_long)&vmx->ctx[i]); error += vmwrite(VMCS_EPTP, vmx->eptp); error += vmwrite(VMCS_PIN_BASED_CTLS, pinbased_ctls); error += vmwrite(VMCS_PRI_PROC_BASED_CTLS, procbased_ctls); error += vmwrite(VMCS_SEC_PROC_BASED_CTLS, procbased_ctls2); error += vmwrite(VMCS_EXIT_CTLS, exit_ctls); error += vmwrite(VMCS_ENTRY_CTLS, entry_ctls); error += vmwrite(VMCS_MSR_BITMAP, vtophys(vmx->msr_bitmap)); error += vmwrite(VMCS_VPID, vpid[i]); if (guest_l1d_flush && !guest_l1d_flush_sw) { vmcs_write(VMCS_ENTRY_MSR_LOAD, pmap_kextract( (vm_offset_t)&msr_load_list[0])); vmcs_write(VMCS_ENTRY_MSR_LOAD_COUNT, nitems(msr_load_list)); vmcs_write(VMCS_EXIT_MSR_STORE, 0); vmcs_write(VMCS_EXIT_MSR_STORE_COUNT, 0); } /* exception bitmap */ if (vcpu_trace_exceptions(vm, i)) exc_bitmap = 0xffffffff; else exc_bitmap = 1 << IDT_MC; error += vmwrite(VMCS_EXCEPTION_BITMAP, exc_bitmap); vmx->ctx[i].guest_dr6 = DBREG_DR6_RESERVED1; error += vmwrite(VMCS_GUEST_DR7, DBREG_DR7_RESERVED1); if (virtual_interrupt_delivery) { error += vmwrite(VMCS_APIC_ACCESS, APIC_ACCESS_ADDRESS); error += vmwrite(VMCS_VIRTUAL_APIC, vtophys(&vmx->apic_page[i])); error += vmwrite(VMCS_EOI_EXIT0, 0); error += vmwrite(VMCS_EOI_EXIT1, 0); error += vmwrite(VMCS_EOI_EXIT2, 0); error += vmwrite(VMCS_EOI_EXIT3, 0); } if (posted_interrupts) { error += vmwrite(VMCS_PIR_VECTOR, pirvec); error += vmwrite(VMCS_PIR_DESC, vtophys(&vmx->pir_desc[i])); } VMCLEAR(vmcs); KASSERT(error == 0, ("vmx_vminit: error customizing the vmcs")); vmx->cap[i].set = 0; vmx->cap[i].proc_ctls = procbased_ctls; vmx->cap[i].proc_ctls2 = procbased_ctls2; vmx->state[i].nextrip = ~0; vmx->state[i].lastcpu = NOCPU; vmx->state[i].vpid = vpid[i]; /* * Set up the CR0/4 shadows, and init the read shadow * to the power-on register value from the Intel Sys Arch. * CR0 - 0x60000010 * CR4 - 0 */ error = vmx_setup_cr0_shadow(vmcs, 0x60000010); if (error != 0) panic("vmx_setup_cr0_shadow %d", error); error = vmx_setup_cr4_shadow(vmcs, 0); if (error != 0) panic("vmx_setup_cr4_shadow %d", error); vmx->ctx[i].pmap = pmap; } return (vmx); } static int vmx_handle_cpuid(struct vm *vm, int vcpu, struct vmxctx *vmxctx) { int handled, func; func = vmxctx->guest_rax; handled = x86_emulate_cpuid(vm, vcpu, (uint32_t*)(&vmxctx->guest_rax), (uint32_t*)(&vmxctx->guest_rbx), (uint32_t*)(&vmxctx->guest_rcx), (uint32_t*)(&vmxctx->guest_rdx)); return (handled); } static __inline void vmx_run_trace(struct vmx *vmx, int vcpu) { #ifdef KTR VCPU_CTR1(vmx->vm, vcpu, "Resume execution at %#lx", vmcs_guest_rip()); #endif } static __inline void vmx_exit_trace(struct vmx *vmx, int vcpu, uint64_t rip, uint32_t exit_reason, int handled) { #ifdef KTR VCPU_CTR3(vmx->vm, vcpu, "%s %s vmexit at 0x%0lx", handled ? "handled" : "unhandled", exit_reason_to_str(exit_reason), rip); #endif } static __inline void vmx_astpending_trace(struct vmx *vmx, int vcpu, uint64_t rip) { #ifdef KTR VCPU_CTR1(vmx->vm, vcpu, "astpending vmexit at 0x%0lx", rip); #endif } static VMM_STAT_INTEL(VCPU_INVVPID_SAVED, "Number of vpid invalidations saved"); static VMM_STAT_INTEL(VCPU_INVVPID_DONE, "Number of vpid invalidations done"); /* * Invalidate guest mappings identified by its vpid from the TLB. */ static __inline void vmx_invvpid(struct vmx *vmx, int vcpu, pmap_t pmap, int running) { struct vmxstate *vmxstate; struct invvpid_desc invvpid_desc; vmxstate = &vmx->state[vcpu]; if (vmxstate->vpid == 0) return; if (!running) { /* * Set the 'lastcpu' to an invalid host cpu. * * This will invalidate TLB entries tagged with the vcpu's * vpid the next time it runs via vmx_set_pcpu_defaults(). */ vmxstate->lastcpu = NOCPU; return; } KASSERT(curthread->td_critnest > 0, ("%s: vcpu %d running outside " "critical section", __func__, vcpu)); /* * Invalidate all mappings tagged with 'vpid' * * We do this because this vcpu was executing on a different host * cpu when it last ran. We do not track whether it invalidated * mappings associated with its 'vpid' during that run. So we must * assume that the mappings associated with 'vpid' on 'curcpu' are * stale and invalidate them. * * Note that we incur this penalty only when the scheduler chooses to * move the thread associated with this vcpu between host cpus. * * Note also that this will invalidate mappings tagged with 'vpid' * for "all" EP4TAs. */ if (pmap->pm_eptgen == vmx->eptgen[curcpu]) { invvpid_desc._res1 = 0; invvpid_desc._res2 = 0; invvpid_desc.vpid = vmxstate->vpid; invvpid_desc.linear_addr = 0; invvpid(INVVPID_TYPE_SINGLE_CONTEXT, invvpid_desc); vmm_stat_incr(vmx->vm, vcpu, VCPU_INVVPID_DONE, 1); } else { /* * The invvpid can be skipped if an invept is going to * be performed before entering the guest. The invept * will invalidate combined mappings tagged with * 'vmx->eptp' for all vpids. */ vmm_stat_incr(vmx->vm, vcpu, VCPU_INVVPID_SAVED, 1); } } static void vmx_set_pcpu_defaults(struct vmx *vmx, int vcpu, pmap_t pmap) { struct vmxstate *vmxstate; vmxstate = &vmx->state[vcpu]; if (vmxstate->lastcpu == curcpu) return; vmxstate->lastcpu = curcpu; vmm_stat_incr(vmx->vm, vcpu, VCPU_MIGRATIONS, 1); vmcs_write(VMCS_HOST_TR_BASE, vmm_get_host_trbase()); vmcs_write(VMCS_HOST_GDTR_BASE, vmm_get_host_gdtrbase()); vmcs_write(VMCS_HOST_GS_BASE, vmm_get_host_gsbase()); vmx_invvpid(vmx, vcpu, pmap, 1); } /* * We depend on 'procbased_ctls' to have the Interrupt Window Exiting bit set. */ CTASSERT((PROCBASED_CTLS_ONE_SETTING & PROCBASED_INT_WINDOW_EXITING) != 0); static void __inline vmx_set_int_window_exiting(struct vmx *vmx, int vcpu) { if ((vmx->cap[vcpu].proc_ctls & PROCBASED_INT_WINDOW_EXITING) == 0) { vmx->cap[vcpu].proc_ctls |= PROCBASED_INT_WINDOW_EXITING; vmcs_write(VMCS_PRI_PROC_BASED_CTLS, vmx->cap[vcpu].proc_ctls); VCPU_CTR0(vmx->vm, vcpu, "Enabling interrupt window exiting"); } } static void __inline vmx_clear_int_window_exiting(struct vmx *vmx, int vcpu) { KASSERT((vmx->cap[vcpu].proc_ctls & PROCBASED_INT_WINDOW_EXITING) != 0, ("intr_window_exiting not set: %#x", vmx->cap[vcpu].proc_ctls)); vmx->cap[vcpu].proc_ctls &= ~PROCBASED_INT_WINDOW_EXITING; vmcs_write(VMCS_PRI_PROC_BASED_CTLS, vmx->cap[vcpu].proc_ctls); VCPU_CTR0(vmx->vm, vcpu, "Disabling interrupt window exiting"); } static void __inline vmx_set_nmi_window_exiting(struct vmx *vmx, int vcpu) { if ((vmx->cap[vcpu].proc_ctls & PROCBASED_NMI_WINDOW_EXITING) == 0) { vmx->cap[vcpu].proc_ctls |= PROCBASED_NMI_WINDOW_EXITING; vmcs_write(VMCS_PRI_PROC_BASED_CTLS, vmx->cap[vcpu].proc_ctls); VCPU_CTR0(vmx->vm, vcpu, "Enabling NMI window exiting"); } } static void __inline vmx_clear_nmi_window_exiting(struct vmx *vmx, int vcpu) { KASSERT((vmx->cap[vcpu].proc_ctls & PROCBASED_NMI_WINDOW_EXITING) != 0, ("nmi_window_exiting not set %#x", vmx->cap[vcpu].proc_ctls)); vmx->cap[vcpu].proc_ctls &= ~PROCBASED_NMI_WINDOW_EXITING; vmcs_write(VMCS_PRI_PROC_BASED_CTLS, vmx->cap[vcpu].proc_ctls); VCPU_CTR0(vmx->vm, vcpu, "Disabling NMI window exiting"); } int vmx_set_tsc_offset(struct vmx *vmx, int vcpu, uint64_t offset) { int error; if ((vmx->cap[vcpu].proc_ctls & PROCBASED_TSC_OFFSET) == 0) { vmx->cap[vcpu].proc_ctls |= PROCBASED_TSC_OFFSET; vmcs_write(VMCS_PRI_PROC_BASED_CTLS, vmx->cap[vcpu].proc_ctls); VCPU_CTR0(vmx->vm, vcpu, "Enabling TSC offsetting"); } error = vmwrite(VMCS_TSC_OFFSET, offset); return (error); } #define NMI_BLOCKING (VMCS_INTERRUPTIBILITY_NMI_BLOCKING | \ VMCS_INTERRUPTIBILITY_MOVSS_BLOCKING) #define HWINTR_BLOCKING (VMCS_INTERRUPTIBILITY_STI_BLOCKING | \ VMCS_INTERRUPTIBILITY_MOVSS_BLOCKING) static void vmx_inject_nmi(struct vmx *vmx, int vcpu) { uint32_t gi, info; gi = vmcs_read(VMCS_GUEST_INTERRUPTIBILITY); KASSERT((gi & NMI_BLOCKING) == 0, ("vmx_inject_nmi: invalid guest " "interruptibility-state %#x", gi)); info = vmcs_read(VMCS_ENTRY_INTR_INFO); KASSERT((info & VMCS_INTR_VALID) == 0, ("vmx_inject_nmi: invalid " "VM-entry interruption information %#x", info)); /* * Inject the virtual NMI. The vector must be the NMI IDT entry * or the VMCS entry check will fail. */ info = IDT_NMI | VMCS_INTR_T_NMI | VMCS_INTR_VALID; vmcs_write(VMCS_ENTRY_INTR_INFO, info); VCPU_CTR0(vmx->vm, vcpu, "Injecting vNMI"); /* Clear the request */ vm_nmi_clear(vmx->vm, vcpu); } static void vmx_inject_interrupts(struct vmx *vmx, int vcpu, struct vlapic *vlapic, uint64_t guestrip) { int vector, need_nmi_exiting, extint_pending; uint64_t rflags, entryinfo; uint32_t gi, info; if (vmx->state[vcpu].nextrip != guestrip) { gi = vmcs_read(VMCS_GUEST_INTERRUPTIBILITY); if (gi & HWINTR_BLOCKING) { VCPU_CTR2(vmx->vm, vcpu, "Guest interrupt blocking " "cleared due to rip change: %#lx/%#lx", vmx->state[vcpu].nextrip, guestrip); gi &= ~HWINTR_BLOCKING; vmcs_write(VMCS_GUEST_INTERRUPTIBILITY, gi); } } if (vm_entry_intinfo(vmx->vm, vcpu, &entryinfo)) { KASSERT((entryinfo & VMCS_INTR_VALID) != 0, ("%s: entry " "intinfo is not valid: %#lx", __func__, entryinfo)); info = vmcs_read(VMCS_ENTRY_INTR_INFO); KASSERT((info & VMCS_INTR_VALID) == 0, ("%s: cannot inject " "pending exception: %#lx/%#x", __func__, entryinfo, info)); info = entryinfo; vector = info & 0xff; if (vector == IDT_BP || vector == IDT_OF) { /* * VT-x requires #BP and #OF to be injected as software * exceptions. */ info &= ~VMCS_INTR_T_MASK; info |= VMCS_INTR_T_SWEXCEPTION; } if (info & VMCS_INTR_DEL_ERRCODE) vmcs_write(VMCS_ENTRY_EXCEPTION_ERROR, entryinfo >> 32); vmcs_write(VMCS_ENTRY_INTR_INFO, info); } if (vm_nmi_pending(vmx->vm, vcpu)) { /* * If there are no conditions blocking NMI injection then * inject it directly here otherwise enable "NMI window * exiting" to inject it as soon as we can. * * We also check for STI_BLOCKING because some implementations * don't allow NMI injection in this case. If we are running * on a processor that doesn't have this restriction it will * immediately exit and the NMI will be injected in the * "NMI window exiting" handler. */ need_nmi_exiting = 1; gi = vmcs_read(VMCS_GUEST_INTERRUPTIBILITY); if ((gi & (HWINTR_BLOCKING | NMI_BLOCKING)) == 0) { info = vmcs_read(VMCS_ENTRY_INTR_INFO); if ((info & VMCS_INTR_VALID) == 0) { vmx_inject_nmi(vmx, vcpu); need_nmi_exiting = 0; } else { VCPU_CTR1(vmx->vm, vcpu, "Cannot inject NMI " "due to VM-entry intr info %#x", info); } } else { VCPU_CTR1(vmx->vm, vcpu, "Cannot inject NMI due to " "Guest Interruptibility-state %#x", gi); } if (need_nmi_exiting) vmx_set_nmi_window_exiting(vmx, vcpu); } extint_pending = vm_extint_pending(vmx->vm, vcpu); if (!extint_pending && virtual_interrupt_delivery) { vmx_inject_pir(vlapic); return; } /* * If interrupt-window exiting is already in effect then don't bother * checking for pending interrupts. This is just an optimization and * not needed for correctness. */ if ((vmx->cap[vcpu].proc_ctls & PROCBASED_INT_WINDOW_EXITING) != 0) { VCPU_CTR0(vmx->vm, vcpu, "Skip interrupt injection due to " "pending int_window_exiting"); return; } if (!extint_pending) { /* Ask the local apic for a vector to inject */ if (!vlapic_pending_intr(vlapic, &vector)) return; /* * From the Intel SDM, Volume 3, Section "Maskable * Hardware Interrupts": * - maskable interrupt vectors [16,255] can be delivered * through the local APIC. */ KASSERT(vector >= 16 && vector <= 255, ("invalid vector %d from local APIC", vector)); } else { /* Ask the legacy pic for a vector to inject */ vatpic_pending_intr(vmx->vm, &vector); /* * From the Intel SDM, Volume 3, Section "Maskable * Hardware Interrupts": * - maskable interrupt vectors [0,255] can be delivered * through the INTR pin. */ KASSERT(vector >= 0 && vector <= 255, ("invalid vector %d from INTR", vector)); } /* Check RFLAGS.IF and the interruptibility state of the guest */ rflags = vmcs_read(VMCS_GUEST_RFLAGS); if ((rflags & PSL_I) == 0) { VCPU_CTR2(vmx->vm, vcpu, "Cannot inject vector %d due to " "rflags %#lx", vector, rflags); goto cantinject; } gi = vmcs_read(VMCS_GUEST_INTERRUPTIBILITY); if (gi & HWINTR_BLOCKING) { VCPU_CTR2(vmx->vm, vcpu, "Cannot inject vector %d due to " "Guest Interruptibility-state %#x", vector, gi); goto cantinject; } info = vmcs_read(VMCS_ENTRY_INTR_INFO); if (info & VMCS_INTR_VALID) { /* * This is expected and could happen for multiple reasons: * - A vectoring VM-entry was aborted due to astpending * - A VM-exit happened during event injection. * - An exception was injected above. * - An NMI was injected above or after "NMI window exiting" */ VCPU_CTR2(vmx->vm, vcpu, "Cannot inject vector %d due to " "VM-entry intr info %#x", vector, info); goto cantinject; } /* Inject the interrupt */ info = VMCS_INTR_T_HWINTR | VMCS_INTR_VALID; info |= vector; vmcs_write(VMCS_ENTRY_INTR_INFO, info); if (!extint_pending) { /* Update the Local APIC ISR */ vlapic_intr_accepted(vlapic, vector); } else { vm_extint_clear(vmx->vm, vcpu); vatpic_intr_accepted(vmx->vm, vector); /* * After we accepted the current ExtINT the PIC may * have posted another one. If that is the case, set * the Interrupt Window Exiting execution control so * we can inject that one too. * * Also, interrupt window exiting allows us to inject any * pending APIC vector that was preempted by the ExtINT * as soon as possible. This applies both for the software * emulated vlapic and the hardware assisted virtual APIC. */ vmx_set_int_window_exiting(vmx, vcpu); } VCPU_CTR1(vmx->vm, vcpu, "Injecting hwintr at vector %d", vector); return; cantinject: /* * Set the Interrupt Window Exiting execution control so we can inject * the interrupt as soon as blocking condition goes away. */ vmx_set_int_window_exiting(vmx, vcpu); } /* * If the Virtual NMIs execution control is '1' then the logical processor * tracks virtual-NMI blocking in the Guest Interruptibility-state field of * the VMCS. An IRET instruction in VMX non-root operation will remove any * virtual-NMI blocking. * * This unblocking occurs even if the IRET causes a fault. In this case the * hypervisor needs to restore virtual-NMI blocking before resuming the guest. */ static void vmx_restore_nmi_blocking(struct vmx *vmx, int vcpuid) { uint32_t gi; VCPU_CTR0(vmx->vm, vcpuid, "Restore Virtual-NMI blocking"); gi = vmcs_read(VMCS_GUEST_INTERRUPTIBILITY); gi |= VMCS_INTERRUPTIBILITY_NMI_BLOCKING; vmcs_write(VMCS_GUEST_INTERRUPTIBILITY, gi); } static void vmx_clear_nmi_blocking(struct vmx *vmx, int vcpuid) { uint32_t gi; VCPU_CTR0(vmx->vm, vcpuid, "Clear Virtual-NMI blocking"); gi = vmcs_read(VMCS_GUEST_INTERRUPTIBILITY); gi &= ~VMCS_INTERRUPTIBILITY_NMI_BLOCKING; vmcs_write(VMCS_GUEST_INTERRUPTIBILITY, gi); } static void vmx_assert_nmi_blocking(struct vmx *vmx, int vcpuid) { uint32_t gi; gi = vmcs_read(VMCS_GUEST_INTERRUPTIBILITY); KASSERT(gi & VMCS_INTERRUPTIBILITY_NMI_BLOCKING, ("NMI blocking is not in effect %#x", gi)); } static int vmx_emulate_xsetbv(struct vmx *vmx, int vcpu, struct vm_exit *vmexit) { struct vmxctx *vmxctx; uint64_t xcrval; const struct xsave_limits *limits; vmxctx = &vmx->ctx[vcpu]; limits = vmm_get_xsave_limits(); /* * Note that the processor raises a GP# fault on its own if * xsetbv is executed for CPL != 0, so we do not have to * emulate that fault here. */ /* Only xcr0 is supported. */ if (vmxctx->guest_rcx != 0) { vm_inject_gp(vmx->vm, vcpu); return (HANDLED); } /* We only handle xcr0 if both the host and guest have XSAVE enabled. */ if (!limits->xsave_enabled || !(vmcs_read(VMCS_GUEST_CR4) & CR4_XSAVE)) { vm_inject_ud(vmx->vm, vcpu); return (HANDLED); } xcrval = vmxctx->guest_rdx << 32 | (vmxctx->guest_rax & 0xffffffff); if ((xcrval & ~limits->xcr0_allowed) != 0) { vm_inject_gp(vmx->vm, vcpu); return (HANDLED); } if (!(xcrval & XFEATURE_ENABLED_X87)) { vm_inject_gp(vmx->vm, vcpu); return (HANDLED); } /* AVX (YMM_Hi128) requires SSE. */ if (xcrval & XFEATURE_ENABLED_AVX && (xcrval & XFEATURE_AVX) != XFEATURE_AVX) { vm_inject_gp(vmx->vm, vcpu); return (HANDLED); } /* * AVX512 requires base AVX (YMM_Hi128) as well as OpMask, * ZMM_Hi256, and Hi16_ZMM. */ if (xcrval & XFEATURE_AVX512 && (xcrval & (XFEATURE_AVX512 | XFEATURE_AVX)) != (XFEATURE_AVX512 | XFEATURE_AVX)) { vm_inject_gp(vmx->vm, vcpu); return (HANDLED); } /* * Intel MPX requires both bound register state flags to be * set. */ if (((xcrval & XFEATURE_ENABLED_BNDREGS) != 0) != ((xcrval & XFEATURE_ENABLED_BNDCSR) != 0)) { vm_inject_gp(vmx->vm, vcpu); return (HANDLED); } /* * This runs "inside" vmrun() with the guest's FPU state, so * modifying xcr0 directly modifies the guest's xcr0, not the * host's. */ load_xcr(0, xcrval); return (HANDLED); } static uint64_t vmx_get_guest_reg(struct vmx *vmx, int vcpu, int ident) { const struct vmxctx *vmxctx; vmxctx = &vmx->ctx[vcpu]; switch (ident) { case 0: return (vmxctx->guest_rax); case 1: return (vmxctx->guest_rcx); case 2: return (vmxctx->guest_rdx); case 3: return (vmxctx->guest_rbx); case 4: return (vmcs_read(VMCS_GUEST_RSP)); case 5: return (vmxctx->guest_rbp); case 6: return (vmxctx->guest_rsi); case 7: return (vmxctx->guest_rdi); case 8: return (vmxctx->guest_r8); case 9: return (vmxctx->guest_r9); case 10: return (vmxctx->guest_r10); case 11: return (vmxctx->guest_r11); case 12: return (vmxctx->guest_r12); case 13: return (vmxctx->guest_r13); case 14: return (vmxctx->guest_r14); case 15: return (vmxctx->guest_r15); default: panic("invalid vmx register %d", ident); } } static void vmx_set_guest_reg(struct vmx *vmx, int vcpu, int ident, uint64_t regval) { struct vmxctx *vmxctx; vmxctx = &vmx->ctx[vcpu]; switch (ident) { case 0: vmxctx->guest_rax = regval; break; case 1: vmxctx->guest_rcx = regval; break; case 2: vmxctx->guest_rdx = regval; break; case 3: vmxctx->guest_rbx = regval; break; case 4: vmcs_write(VMCS_GUEST_RSP, regval); break; case 5: vmxctx->guest_rbp = regval; break; case 6: vmxctx->guest_rsi = regval; break; case 7: vmxctx->guest_rdi = regval; break; case 8: vmxctx->guest_r8 = regval; break; case 9: vmxctx->guest_r9 = regval; break; case 10: vmxctx->guest_r10 = regval; break; case 11: vmxctx->guest_r11 = regval; break; case 12: vmxctx->guest_r12 = regval; break; case 13: vmxctx->guest_r13 = regval; break; case 14: vmxctx->guest_r14 = regval; break; case 15: vmxctx->guest_r15 = regval; break; default: panic("invalid vmx register %d", ident); } } static int vmx_emulate_cr0_access(struct vmx *vmx, int vcpu, uint64_t exitqual) { uint64_t crval, regval; /* We only handle mov to %cr0 at this time */ if ((exitqual & 0xf0) != 0x00) return (UNHANDLED); regval = vmx_get_guest_reg(vmx, vcpu, (exitqual >> 8) & 0xf); vmcs_write(VMCS_CR0_SHADOW, regval); crval = regval | cr0_ones_mask; crval &= ~cr0_zeros_mask; vmcs_write(VMCS_GUEST_CR0, crval); if (regval & CR0_PG) { uint64_t efer, entry_ctls; /* * If CR0.PG is 1 and EFER.LME is 1 then EFER.LMA and * the "IA-32e mode guest" bit in VM-entry control must be * equal. */ efer = vmcs_read(VMCS_GUEST_IA32_EFER); if (efer & EFER_LME) { efer |= EFER_LMA; vmcs_write(VMCS_GUEST_IA32_EFER, efer); entry_ctls = vmcs_read(VMCS_ENTRY_CTLS); entry_ctls |= VM_ENTRY_GUEST_LMA; vmcs_write(VMCS_ENTRY_CTLS, entry_ctls); } } return (HANDLED); } static int vmx_emulate_cr4_access(struct vmx *vmx, int vcpu, uint64_t exitqual) { uint64_t crval, regval; /* We only handle mov to %cr4 at this time */ if ((exitqual & 0xf0) != 0x00) return (UNHANDLED); regval = vmx_get_guest_reg(vmx, vcpu, (exitqual >> 8) & 0xf); vmcs_write(VMCS_CR4_SHADOW, regval); crval = regval | cr4_ones_mask; crval &= ~cr4_zeros_mask; vmcs_write(VMCS_GUEST_CR4, crval); return (HANDLED); } static int vmx_emulate_cr8_access(struct vmx *vmx, int vcpu, uint64_t exitqual) { struct vlapic *vlapic; uint64_t cr8; int regnum; /* We only handle mov %cr8 to/from a register at this time. */ if ((exitqual & 0xe0) != 0x00) { return (UNHANDLED); } vlapic = vm_lapic(vmx->vm, vcpu); regnum = (exitqual >> 8) & 0xf; if (exitqual & 0x10) { cr8 = vlapic_get_cr8(vlapic); vmx_set_guest_reg(vmx, vcpu, regnum, cr8); } else { cr8 = vmx_get_guest_reg(vmx, vcpu, regnum); vlapic_set_cr8(vlapic, cr8); } return (HANDLED); } /* * From section "Guest Register State" in the Intel SDM: CPL = SS.DPL */ static int vmx_cpl(void) { uint32_t ssar; ssar = vmcs_read(VMCS_GUEST_SS_ACCESS_RIGHTS); return ((ssar >> 5) & 0x3); } static enum vm_cpu_mode vmx_cpu_mode(void) { uint32_t csar; if (vmcs_read(VMCS_GUEST_IA32_EFER) & EFER_LMA) { csar = vmcs_read(VMCS_GUEST_CS_ACCESS_RIGHTS); if (csar & 0x2000) return (CPU_MODE_64BIT); /* CS.L = 1 */ else return (CPU_MODE_COMPATIBILITY); } else if (vmcs_read(VMCS_GUEST_CR0) & CR0_PE) { return (CPU_MODE_PROTECTED); } else { return (CPU_MODE_REAL); } } static enum vm_paging_mode vmx_paging_mode(void) { if (!(vmcs_read(VMCS_GUEST_CR0) & CR0_PG)) return (PAGING_MODE_FLAT); if (!(vmcs_read(VMCS_GUEST_CR4) & CR4_PAE)) return (PAGING_MODE_32); if (vmcs_read(VMCS_GUEST_IA32_EFER) & EFER_LME) return (PAGING_MODE_64); else return (PAGING_MODE_PAE); } static uint64_t inout_str_index(struct vmx *vmx, int vcpuid, int in) { uint64_t val; int error; enum vm_reg_name reg; reg = in ? VM_REG_GUEST_RDI : VM_REG_GUEST_RSI; error = vmx_getreg(vmx, vcpuid, reg, &val); KASSERT(error == 0, ("%s: vmx_getreg error %d", __func__, error)); return (val); } static uint64_t inout_str_count(struct vmx *vmx, int vcpuid, int rep) { uint64_t val; int error; if (rep) { error = vmx_getreg(vmx, vcpuid, VM_REG_GUEST_RCX, &val); KASSERT(!error, ("%s: vmx_getreg error %d", __func__, error)); } else { val = 1; } return (val); } static int inout_str_addrsize(uint32_t inst_info) { uint32_t size; size = (inst_info >> 7) & 0x7; switch (size) { case 0: return (2); /* 16 bit */ case 1: return (4); /* 32 bit */ case 2: return (8); /* 64 bit */ default: panic("%s: invalid size encoding %d", __func__, size); } } static void inout_str_seginfo(struct vmx *vmx, int vcpuid, uint32_t inst_info, int in, struct vm_inout_str *vis) { int error, s; if (in) { vis->seg_name = VM_REG_GUEST_ES; } else { s = (inst_info >> 15) & 0x7; vis->seg_name = vm_segment_name(s); } error = vmx_getdesc(vmx, vcpuid, vis->seg_name, &vis->seg_desc); KASSERT(error == 0, ("%s: vmx_getdesc error %d", __func__, error)); } static void vmx_paging_info(struct vm_guest_paging *paging) { paging->cr3 = vmcs_guest_cr3(); paging->cpl = vmx_cpl(); paging->cpu_mode = vmx_cpu_mode(); paging->paging_mode = vmx_paging_mode(); } static void vmexit_inst_emul(struct vm_exit *vmexit, uint64_t gpa, uint64_t gla) { struct vm_guest_paging *paging; uint32_t csar; paging = &vmexit->u.inst_emul.paging; vmexit->exitcode = VM_EXITCODE_INST_EMUL; vmexit->inst_length = 0; vmexit->u.inst_emul.gpa = gpa; vmexit->u.inst_emul.gla = gla; vmx_paging_info(paging); switch (paging->cpu_mode) { case CPU_MODE_REAL: vmexit->u.inst_emul.cs_base = vmcs_read(VMCS_GUEST_CS_BASE); vmexit->u.inst_emul.cs_d = 0; break; case CPU_MODE_PROTECTED: case CPU_MODE_COMPATIBILITY: vmexit->u.inst_emul.cs_base = vmcs_read(VMCS_GUEST_CS_BASE); csar = vmcs_read(VMCS_GUEST_CS_ACCESS_RIGHTS); vmexit->u.inst_emul.cs_d = SEG_DESC_DEF32(csar); break; default: vmexit->u.inst_emul.cs_base = 0; vmexit->u.inst_emul.cs_d = 0; break; } vie_init(&vmexit->u.inst_emul.vie, NULL, 0); } static int ept_fault_type(uint64_t ept_qual) { int fault_type; if (ept_qual & EPT_VIOLATION_DATA_WRITE) fault_type = VM_PROT_WRITE; else if (ept_qual & EPT_VIOLATION_INST_FETCH) fault_type = VM_PROT_EXECUTE; else fault_type= VM_PROT_READ; return (fault_type); } static boolean_t ept_emulation_fault(uint64_t ept_qual) { int read, write; /* EPT fault on an instruction fetch doesn't make sense here */ if (ept_qual & EPT_VIOLATION_INST_FETCH) return (FALSE); /* EPT fault must be a read fault or a write fault */ read = ept_qual & EPT_VIOLATION_DATA_READ ? 1 : 0; write = ept_qual & EPT_VIOLATION_DATA_WRITE ? 1 : 0; if ((read | write) == 0) return (FALSE); /* * The EPT violation must have been caused by accessing a * guest-physical address that is a translation of a guest-linear * address. */ if ((ept_qual & EPT_VIOLATION_GLA_VALID) == 0 || (ept_qual & EPT_VIOLATION_XLAT_VALID) == 0) { return (FALSE); } return (TRUE); } static __inline int apic_access_virtualization(struct vmx *vmx, int vcpuid) { uint32_t proc_ctls2; proc_ctls2 = vmx->cap[vcpuid].proc_ctls2; return ((proc_ctls2 & PROCBASED2_VIRTUALIZE_APIC_ACCESSES) ? 1 : 0); } static __inline int x2apic_virtualization(struct vmx *vmx, int vcpuid) { uint32_t proc_ctls2; proc_ctls2 = vmx->cap[vcpuid].proc_ctls2; return ((proc_ctls2 & PROCBASED2_VIRTUALIZE_X2APIC_MODE) ? 1 : 0); } static int vmx_handle_apic_write(struct vmx *vmx, int vcpuid, struct vlapic *vlapic, uint64_t qual) { int error, handled, offset; uint32_t *apic_regs, vector; bool retu; handled = HANDLED; offset = APIC_WRITE_OFFSET(qual); if (!apic_access_virtualization(vmx, vcpuid)) { /* * In general there should not be any APIC write VM-exits * unless APIC-access virtualization is enabled. * * However self-IPI virtualization can legitimately trigger * an APIC-write VM-exit so treat it specially. */ if (x2apic_virtualization(vmx, vcpuid) && offset == APIC_OFFSET_SELF_IPI) { apic_regs = (uint32_t *)(vlapic->apic_page); vector = apic_regs[APIC_OFFSET_SELF_IPI / 4]; vlapic_self_ipi_handler(vlapic, vector); return (HANDLED); } else return (UNHANDLED); } switch (offset) { case APIC_OFFSET_ID: vlapic_id_write_handler(vlapic); break; case APIC_OFFSET_LDR: vlapic_ldr_write_handler(vlapic); break; case APIC_OFFSET_DFR: vlapic_dfr_write_handler(vlapic); break; case APIC_OFFSET_SVR: vlapic_svr_write_handler(vlapic); break; case APIC_OFFSET_ESR: vlapic_esr_write_handler(vlapic); break; case APIC_OFFSET_ICR_LOW: retu = false; error = vlapic_icrlo_write_handler(vlapic, &retu); if (error != 0 || retu) handled = UNHANDLED; break; case APIC_OFFSET_CMCI_LVT: case APIC_OFFSET_TIMER_LVT ... APIC_OFFSET_ERROR_LVT: vlapic_lvt_write_handler(vlapic, offset); break; case APIC_OFFSET_TIMER_ICR: vlapic_icrtmr_write_handler(vlapic); break; case APIC_OFFSET_TIMER_DCR: vlapic_dcr_write_handler(vlapic); break; default: handled = UNHANDLED; break; } return (handled); } static bool apic_access_fault(struct vmx *vmx, int vcpuid, uint64_t gpa) { if (apic_access_virtualization(vmx, vcpuid) && (gpa >= DEFAULT_APIC_BASE && gpa < DEFAULT_APIC_BASE + PAGE_SIZE)) return (true); else return (false); } static int vmx_handle_apic_access(struct vmx *vmx, int vcpuid, struct vm_exit *vmexit) { uint64_t qual; int access_type, offset, allowed; if (!apic_access_virtualization(vmx, vcpuid)) return (UNHANDLED); qual = vmexit->u.vmx.exit_qualification; access_type = APIC_ACCESS_TYPE(qual); offset = APIC_ACCESS_OFFSET(qual); allowed = 0; if (access_type == 0) { /* * Read data access to the following registers is expected. */ switch (offset) { case APIC_OFFSET_APR: case APIC_OFFSET_PPR: case APIC_OFFSET_RRR: case APIC_OFFSET_CMCI_LVT: case APIC_OFFSET_TIMER_CCR: allowed = 1; break; default: break; } } else if (access_type == 1) { /* * Write data access to the following registers is expected. */ switch (offset) { case APIC_OFFSET_VER: case APIC_OFFSET_APR: case APIC_OFFSET_PPR: case APIC_OFFSET_RRR: case APIC_OFFSET_ISR0 ... APIC_OFFSET_ISR7: case APIC_OFFSET_TMR0 ... APIC_OFFSET_TMR7: case APIC_OFFSET_IRR0 ... APIC_OFFSET_IRR7: case APIC_OFFSET_CMCI_LVT: case APIC_OFFSET_TIMER_CCR: allowed = 1; break; default: break; } } if (allowed) { vmexit_inst_emul(vmexit, DEFAULT_APIC_BASE + offset, VIE_INVALID_GLA); } /* * Regardless of whether the APIC-access is allowed this handler * always returns UNHANDLED: * - if the access is allowed then it is handled by emulating the * instruction that caused the VM-exit (outside the critical section) * - if the access is not allowed then it will be converted to an * exitcode of VM_EXITCODE_VMX and will be dealt with in userland. */ return (UNHANDLED); } static enum task_switch_reason vmx_task_switch_reason(uint64_t qual) { int reason; reason = (qual >> 30) & 0x3; switch (reason) { case 0: return (TSR_CALL); case 1: return (TSR_IRET); case 2: return (TSR_JMP); case 3: return (TSR_IDT_GATE); default: panic("%s: invalid reason %d", __func__, reason); } } static int emulate_wrmsr(struct vmx *vmx, int vcpuid, u_int num, uint64_t val, bool *retu) { int error; if (lapic_msr(num)) error = lapic_wrmsr(vmx->vm, vcpuid, num, val, retu); else error = vmx_wrmsr(vmx, vcpuid, num, val, retu); return (error); } static int emulate_rdmsr(struct vmx *vmx, int vcpuid, u_int num, bool *retu) { struct vmxctx *vmxctx; uint64_t result; uint32_t eax, edx; int error; if (lapic_msr(num)) error = lapic_rdmsr(vmx->vm, vcpuid, num, &result, retu); else error = vmx_rdmsr(vmx, vcpuid, num, &result, retu); if (error == 0) { eax = result; vmxctx = &vmx->ctx[vcpuid]; error = vmxctx_setreg(vmxctx, VM_REG_GUEST_RAX, eax); KASSERT(error == 0, ("vmxctx_setreg(rax) error %d", error)); edx = result >> 32; error = vmxctx_setreg(vmxctx, VM_REG_GUEST_RDX, edx); KASSERT(error == 0, ("vmxctx_setreg(rdx) error %d", error)); } return (error); } static int vmx_exit_process(struct vmx *vmx, int vcpu, struct vm_exit *vmexit) { int error, errcode, errcode_valid, handled, in; struct vmxctx *vmxctx; struct vlapic *vlapic; struct vm_inout_str *vis; struct vm_task_switch *ts; uint32_t eax, ecx, edx, idtvec_info, idtvec_err, intr_info, inst_info; uint32_t intr_type, intr_vec, reason; uint64_t exitintinfo, qual, gpa; bool retu; CTASSERT((PINBASED_CTLS_ONE_SETTING & PINBASED_VIRTUAL_NMI) != 0); CTASSERT((PINBASED_CTLS_ONE_SETTING & PINBASED_NMI_EXITING) != 0); handled = UNHANDLED; vmxctx = &vmx->ctx[vcpu]; qual = vmexit->u.vmx.exit_qualification; reason = vmexit->u.vmx.exit_reason; vmexit->exitcode = VM_EXITCODE_BOGUS; vmm_stat_incr(vmx->vm, vcpu, VMEXIT_COUNT, 1); SDT_PROBE3(vmm, vmx, exit, entry, vmx, vcpu, vmexit); /* * VM-entry failures during or after loading guest state. * * These VM-exits are uncommon but must be handled specially * as most VM-exit fields are not populated as usual. */ if (__predict_false(reason == EXIT_REASON_MCE_DURING_ENTRY)) { VCPU_CTR0(vmx->vm, vcpu, "Handling MCE during VM-entry"); __asm __volatile("int $18"); return (1); } /* * VM exits that can be triggered during event delivery need to * be handled specially by re-injecting the event if the IDT * vectoring information field's valid bit is set. * * See "Information for VM Exits During Event Delivery" in Intel SDM * for details. */ idtvec_info = vmcs_idt_vectoring_info(); if (idtvec_info & VMCS_IDT_VEC_VALID) { idtvec_info &= ~(1 << 12); /* clear undefined bit */ exitintinfo = idtvec_info; if (idtvec_info & VMCS_IDT_VEC_ERRCODE_VALID) { idtvec_err = vmcs_idt_vectoring_err(); exitintinfo |= (uint64_t)idtvec_err << 32; } error = vm_exit_intinfo(vmx->vm, vcpu, exitintinfo); KASSERT(error == 0, ("%s: vm_set_intinfo error %d", __func__, error)); /* * If 'virtual NMIs' are being used and the VM-exit * happened while injecting an NMI during the previous * VM-entry, then clear "blocking by NMI" in the * Guest Interruptibility-State so the NMI can be * reinjected on the subsequent VM-entry. * * However, if the NMI was being delivered through a task * gate, then the new task must start execution with NMIs * blocked so don't clear NMI blocking in this case. */ intr_type = idtvec_info & VMCS_INTR_T_MASK; if (intr_type == VMCS_INTR_T_NMI) { if (reason != EXIT_REASON_TASK_SWITCH) vmx_clear_nmi_blocking(vmx, vcpu); else vmx_assert_nmi_blocking(vmx, vcpu); } /* * Update VM-entry instruction length if the event being * delivered was a software interrupt or software exception. */ if (intr_type == VMCS_INTR_T_SWINTR || intr_type == VMCS_INTR_T_PRIV_SWEXCEPTION || intr_type == VMCS_INTR_T_SWEXCEPTION) { vmcs_write(VMCS_ENTRY_INST_LENGTH, vmexit->inst_length); } } switch (reason) { case EXIT_REASON_TASK_SWITCH: ts = &vmexit->u.task_switch; ts->tsssel = qual & 0xffff; ts->reason = vmx_task_switch_reason(qual); ts->ext = 0; ts->errcode_valid = 0; vmx_paging_info(&ts->paging); /* * If the task switch was due to a CALL, JMP, IRET, software * interrupt (INT n) or software exception (INT3, INTO), * then the saved %rip references the instruction that caused * the task switch. The instruction length field in the VMCS * is valid in this case. * * In all other cases (e.g., NMI, hardware exception) the * saved %rip is one that would have been saved in the old TSS * had the task switch completed normally so the instruction * length field is not needed in this case and is explicitly * set to 0. */ if (ts->reason == TSR_IDT_GATE) { KASSERT(idtvec_info & VMCS_IDT_VEC_VALID, ("invalid idtvec_info %#x for IDT task switch", idtvec_info)); intr_type = idtvec_info & VMCS_INTR_T_MASK; if (intr_type != VMCS_INTR_T_SWINTR && intr_type != VMCS_INTR_T_SWEXCEPTION && intr_type != VMCS_INTR_T_PRIV_SWEXCEPTION) { /* Task switch triggered by external event */ ts->ext = 1; vmexit->inst_length = 0; if (idtvec_info & VMCS_IDT_VEC_ERRCODE_VALID) { ts->errcode_valid = 1; ts->errcode = vmcs_idt_vectoring_err(); } } } vmexit->exitcode = VM_EXITCODE_TASK_SWITCH; SDT_PROBE4(vmm, vmx, exit, taskswitch, vmx, vcpu, vmexit, ts); VCPU_CTR4(vmx->vm, vcpu, "task switch reason %d, tss 0x%04x, " "%s errcode 0x%016lx", ts->reason, ts->tsssel, ts->ext ? "external" : "internal", ((uint64_t)ts->errcode << 32) | ts->errcode_valid); break; case EXIT_REASON_CR_ACCESS: vmm_stat_incr(vmx->vm, vcpu, VMEXIT_CR_ACCESS, 1); SDT_PROBE4(vmm, vmx, exit, craccess, vmx, vcpu, vmexit, qual); switch (qual & 0xf) { case 0: handled = vmx_emulate_cr0_access(vmx, vcpu, qual); break; case 4: handled = vmx_emulate_cr4_access(vmx, vcpu, qual); break; case 8: handled = vmx_emulate_cr8_access(vmx, vcpu, qual); break; } break; case EXIT_REASON_RDMSR: vmm_stat_incr(vmx->vm, vcpu, VMEXIT_RDMSR, 1); retu = false; ecx = vmxctx->guest_rcx; VCPU_CTR1(vmx->vm, vcpu, "rdmsr 0x%08x", ecx); SDT_PROBE4(vmm, vmx, exit, rdmsr, vmx, vcpu, vmexit, ecx); error = emulate_rdmsr(vmx, vcpu, ecx, &retu); if (error) { vmexit->exitcode = VM_EXITCODE_RDMSR; vmexit->u.msr.code = ecx; } else if (!retu) { handled = HANDLED; } else { /* Return to userspace with a valid exitcode */ KASSERT(vmexit->exitcode != VM_EXITCODE_BOGUS, ("emulate_rdmsr retu with bogus exitcode")); } break; case EXIT_REASON_WRMSR: vmm_stat_incr(vmx->vm, vcpu, VMEXIT_WRMSR, 1); retu = false; eax = vmxctx->guest_rax; ecx = vmxctx->guest_rcx; edx = vmxctx->guest_rdx; VCPU_CTR2(vmx->vm, vcpu, "wrmsr 0x%08x value 0x%016lx", ecx, (uint64_t)edx << 32 | eax); SDT_PROBE5(vmm, vmx, exit, wrmsr, vmx, vmexit, vcpu, ecx, (uint64_t)edx << 32 | eax); error = emulate_wrmsr(vmx, vcpu, ecx, (uint64_t)edx << 32 | eax, &retu); if (error) { vmexit->exitcode = VM_EXITCODE_WRMSR; vmexit->u.msr.code = ecx; vmexit->u.msr.wval = (uint64_t)edx << 32 | eax; } else if (!retu) { handled = HANDLED; } else { /* Return to userspace with a valid exitcode */ KASSERT(vmexit->exitcode != VM_EXITCODE_BOGUS, ("emulate_wrmsr retu with bogus exitcode")); } break; case EXIT_REASON_HLT: vmm_stat_incr(vmx->vm, vcpu, VMEXIT_HLT, 1); SDT_PROBE3(vmm, vmx, exit, halt, vmx, vcpu, vmexit); vmexit->exitcode = VM_EXITCODE_HLT; vmexit->u.hlt.rflags = vmcs_read(VMCS_GUEST_RFLAGS); if (virtual_interrupt_delivery) vmexit->u.hlt.intr_status = vmcs_read(VMCS_GUEST_INTR_STATUS); else vmexit->u.hlt.intr_status = 0; break; case EXIT_REASON_MTF: vmm_stat_incr(vmx->vm, vcpu, VMEXIT_MTRAP, 1); SDT_PROBE3(vmm, vmx, exit, mtrap, vmx, vcpu, vmexit); vmexit->exitcode = VM_EXITCODE_MTRAP; vmexit->inst_length = 0; break; case EXIT_REASON_PAUSE: vmm_stat_incr(vmx->vm, vcpu, VMEXIT_PAUSE, 1); SDT_PROBE3(vmm, vmx, exit, pause, vmx, vcpu, vmexit); vmexit->exitcode = VM_EXITCODE_PAUSE; break; case EXIT_REASON_INTR_WINDOW: vmm_stat_incr(vmx->vm, vcpu, VMEXIT_INTR_WINDOW, 1); SDT_PROBE3(vmm, vmx, exit, intrwindow, vmx, vcpu, vmexit); vmx_clear_int_window_exiting(vmx, vcpu); return (1); case EXIT_REASON_EXT_INTR: /* * External interrupts serve only to cause VM exits and allow * the host interrupt handler to run. * * If this external interrupt triggers a virtual interrupt * to a VM, then that state will be recorded by the * host interrupt handler in the VM's softc. We will inject * this virtual interrupt during the subsequent VM enter. */ intr_info = vmcs_read(VMCS_EXIT_INTR_INFO); SDT_PROBE4(vmm, vmx, exit, interrupt, vmx, vcpu, vmexit, intr_info); /* * XXX: Ignore this exit if VMCS_INTR_VALID is not set. * This appears to be a bug in VMware Fusion? */ if (!(intr_info & VMCS_INTR_VALID)) return (1); KASSERT((intr_info & VMCS_INTR_VALID) != 0 && (intr_info & VMCS_INTR_T_MASK) == VMCS_INTR_T_HWINTR, ("VM exit interruption info invalid: %#x", intr_info)); vmx_trigger_hostintr(intr_info & 0xff); /* * This is special. We want to treat this as an 'handled' * VM-exit but not increment the instruction pointer. */ vmm_stat_incr(vmx->vm, vcpu, VMEXIT_EXTINT, 1); return (1); case EXIT_REASON_NMI_WINDOW: SDT_PROBE3(vmm, vmx, exit, nmiwindow, vmx, vcpu, vmexit); /* Exit to allow the pending virtual NMI to be injected */ if (vm_nmi_pending(vmx->vm, vcpu)) vmx_inject_nmi(vmx, vcpu); vmx_clear_nmi_window_exiting(vmx, vcpu); vmm_stat_incr(vmx->vm, vcpu, VMEXIT_NMI_WINDOW, 1); return (1); case EXIT_REASON_INOUT: vmm_stat_incr(vmx->vm, vcpu, VMEXIT_INOUT, 1); vmexit->exitcode = VM_EXITCODE_INOUT; vmexit->u.inout.bytes = (qual & 0x7) + 1; vmexit->u.inout.in = in = (qual & 0x8) ? 1 : 0; vmexit->u.inout.string = (qual & 0x10) ? 1 : 0; vmexit->u.inout.rep = (qual & 0x20) ? 1 : 0; vmexit->u.inout.port = (uint16_t)(qual >> 16); vmexit->u.inout.eax = (uint32_t)(vmxctx->guest_rax); if (vmexit->u.inout.string) { inst_info = vmcs_read(VMCS_EXIT_INSTRUCTION_INFO); vmexit->exitcode = VM_EXITCODE_INOUT_STR; vis = &vmexit->u.inout_str; vmx_paging_info(&vis->paging); vis->rflags = vmcs_read(VMCS_GUEST_RFLAGS); vis->cr0 = vmcs_read(VMCS_GUEST_CR0); vis->index = inout_str_index(vmx, vcpu, in); vis->count = inout_str_count(vmx, vcpu, vis->inout.rep); vis->addrsize = inout_str_addrsize(inst_info); inout_str_seginfo(vmx, vcpu, inst_info, in, vis); } SDT_PROBE3(vmm, vmx, exit, inout, vmx, vcpu, vmexit); break; case EXIT_REASON_CPUID: vmm_stat_incr(vmx->vm, vcpu, VMEXIT_CPUID, 1); SDT_PROBE3(vmm, vmx, exit, cpuid, vmx, vcpu, vmexit); handled = vmx_handle_cpuid(vmx->vm, vcpu, vmxctx); break; case EXIT_REASON_EXCEPTION: vmm_stat_incr(vmx->vm, vcpu, VMEXIT_EXCEPTION, 1); intr_info = vmcs_read(VMCS_EXIT_INTR_INFO); KASSERT((intr_info & VMCS_INTR_VALID) != 0, ("VM exit interruption info invalid: %#x", intr_info)); intr_vec = intr_info & 0xff; intr_type = intr_info & VMCS_INTR_T_MASK; /* * If Virtual NMIs control is 1 and the VM-exit is due to a * fault encountered during the execution of IRET then we must * restore the state of "virtual-NMI blocking" before resuming * the guest. * * See "Resuming Guest Software after Handling an Exception". * See "Information for VM Exits Due to Vectored Events". */ if ((idtvec_info & VMCS_IDT_VEC_VALID) == 0 && (intr_vec != IDT_DF) && (intr_info & EXIT_QUAL_NMIUDTI) != 0) vmx_restore_nmi_blocking(vmx, vcpu); /* * The NMI has already been handled in vmx_exit_handle_nmi(). */ if (intr_type == VMCS_INTR_T_NMI) return (1); /* * Call the machine check handler by hand. Also don't reflect * the machine check back into the guest. */ if (intr_vec == IDT_MC) { VCPU_CTR0(vmx->vm, vcpu, "Vectoring to MCE handler"); __asm __volatile("int $18"); return (1); } if (intr_vec == IDT_PF) { error = vmxctx_setreg(vmxctx, VM_REG_GUEST_CR2, qual); KASSERT(error == 0, ("%s: vmxctx_setreg(cr2) error %d", __func__, error)); } /* * Software exceptions exhibit trap-like behavior. This in * turn requires populating the VM-entry instruction length * so that the %rip in the trap frame is past the INT3/INTO * instruction. */ if (intr_type == VMCS_INTR_T_SWEXCEPTION) vmcs_write(VMCS_ENTRY_INST_LENGTH, vmexit->inst_length); /* Reflect all other exceptions back into the guest */ errcode_valid = errcode = 0; if (intr_info & VMCS_INTR_DEL_ERRCODE) { errcode_valid = 1; errcode = vmcs_read(VMCS_EXIT_INTR_ERRCODE); } VCPU_CTR2(vmx->vm, vcpu, "Reflecting exception %d/%#x into " "the guest", intr_vec, errcode); SDT_PROBE5(vmm, vmx, exit, exception, vmx, vcpu, vmexit, intr_vec, errcode); error = vm_inject_exception(vmx->vm, vcpu, intr_vec, errcode_valid, errcode, 0); KASSERT(error == 0, ("%s: vm_inject_exception error %d", __func__, error)); return (1); case EXIT_REASON_EPT_FAULT: /* * If 'gpa' lies within the address space allocated to * memory then this must be a nested page fault otherwise * this must be an instruction that accesses MMIO space. */ gpa = vmcs_gpa(); if (vm_mem_allocated(vmx->vm, vcpu, gpa) || apic_access_fault(vmx, vcpu, gpa)) { vmexit->exitcode = VM_EXITCODE_PAGING; vmexit->inst_length = 0; vmexit->u.paging.gpa = gpa; vmexit->u.paging.fault_type = ept_fault_type(qual); vmm_stat_incr(vmx->vm, vcpu, VMEXIT_NESTED_FAULT, 1); SDT_PROBE5(vmm, vmx, exit, nestedfault, vmx, vcpu, vmexit, gpa, qual); } else if (ept_emulation_fault(qual)) { vmexit_inst_emul(vmexit, gpa, vmcs_gla()); vmm_stat_incr(vmx->vm, vcpu, VMEXIT_INST_EMUL, 1); SDT_PROBE4(vmm, vmx, exit, mmiofault, vmx, vcpu, vmexit, gpa); } /* * If Virtual NMIs control is 1 and the VM-exit is due to an * EPT fault during the execution of IRET then we must restore * the state of "virtual-NMI blocking" before resuming. * * See description of "NMI unblocking due to IRET" in * "Exit Qualification for EPT Violations". */ if ((idtvec_info & VMCS_IDT_VEC_VALID) == 0 && (qual & EXIT_QUAL_NMIUDTI) != 0) vmx_restore_nmi_blocking(vmx, vcpu); break; case EXIT_REASON_VIRTUALIZED_EOI: vmexit->exitcode = VM_EXITCODE_IOAPIC_EOI; vmexit->u.ioapic_eoi.vector = qual & 0xFF; SDT_PROBE3(vmm, vmx, exit, eoi, vmx, vcpu, vmexit); vmexit->inst_length = 0; /* trap-like */ break; case EXIT_REASON_APIC_ACCESS: SDT_PROBE3(vmm, vmx, exit, apicaccess, vmx, vcpu, vmexit); handled = vmx_handle_apic_access(vmx, vcpu, vmexit); break; case EXIT_REASON_APIC_WRITE: /* * APIC-write VM exit is trap-like so the %rip is already * pointing to the next instruction. */ vmexit->inst_length = 0; vlapic = vm_lapic(vmx->vm, vcpu); SDT_PROBE4(vmm, vmx, exit, apicwrite, vmx, vcpu, vmexit, vlapic); handled = vmx_handle_apic_write(vmx, vcpu, vlapic, qual); break; case EXIT_REASON_XSETBV: SDT_PROBE3(vmm, vmx, exit, xsetbv, vmx, vcpu, vmexit); handled = vmx_emulate_xsetbv(vmx, vcpu, vmexit); break; case EXIT_REASON_MONITOR: SDT_PROBE3(vmm, vmx, exit, monitor, vmx, vcpu, vmexit); vmexit->exitcode = VM_EXITCODE_MONITOR; break; case EXIT_REASON_MWAIT: SDT_PROBE3(vmm, vmx, exit, mwait, vmx, vcpu, vmexit); vmexit->exitcode = VM_EXITCODE_MWAIT; break; case EXIT_REASON_VMCALL: case EXIT_REASON_VMCLEAR: case EXIT_REASON_VMLAUNCH: case EXIT_REASON_VMPTRLD: case EXIT_REASON_VMPTRST: case EXIT_REASON_VMREAD: case EXIT_REASON_VMRESUME: case EXIT_REASON_VMWRITE: case EXIT_REASON_VMXOFF: case EXIT_REASON_VMXON: SDT_PROBE3(vmm, vmx, exit, vminsn, vmx, vcpu, vmexit); vmexit->exitcode = VM_EXITCODE_VMINSN; break; default: SDT_PROBE4(vmm, vmx, exit, unknown, vmx, vcpu, vmexit, reason); vmm_stat_incr(vmx->vm, vcpu, VMEXIT_UNKNOWN, 1); break; } if (handled) { /* * It is possible that control is returned to userland * even though we were able to handle the VM exit in the * kernel. * * In such a case we want to make sure that the userland * restarts guest execution at the instruction *after* * the one we just processed. Therefore we update the * guest rip in the VMCS and in 'vmexit'. */ vmexit->rip += vmexit->inst_length; vmexit->inst_length = 0; vmcs_write(VMCS_GUEST_RIP, vmexit->rip); } else { if (vmexit->exitcode == VM_EXITCODE_BOGUS) { /* * If this VM exit was not claimed by anybody then * treat it as a generic VMX exit. */ vmexit->exitcode = VM_EXITCODE_VMX; vmexit->u.vmx.status = VM_SUCCESS; vmexit->u.vmx.inst_type = 0; vmexit->u.vmx.inst_error = 0; } else { /* * The exitcode and collateral have been populated. * The VM exit will be processed further in userland. */ } } SDT_PROBE4(vmm, vmx, exit, return, vmx, vcpu, vmexit, handled); return (handled); } static __inline void vmx_exit_inst_error(struct vmxctx *vmxctx, int rc, struct vm_exit *vmexit) { KASSERT(vmxctx->inst_fail_status != VM_SUCCESS, ("vmx_exit_inst_error: invalid inst_fail_status %d", vmxctx->inst_fail_status)); vmexit->inst_length = 0; vmexit->exitcode = VM_EXITCODE_VMX; vmexit->u.vmx.status = vmxctx->inst_fail_status; vmexit->u.vmx.inst_error = vmcs_instruction_error(); vmexit->u.vmx.exit_reason = ~0; vmexit->u.vmx.exit_qualification = ~0; switch (rc) { case VMX_VMRESUME_ERROR: case VMX_VMLAUNCH_ERROR: case VMX_INVEPT_ERROR: vmexit->u.vmx.inst_type = rc; break; default: panic("vm_exit_inst_error: vmx_enter_guest returned %d", rc); } } /* * If the NMI-exiting VM execution control is set to '1' then an NMI in * non-root operation causes a VM-exit. NMI blocking is in effect so it is * sufficient to simply vector to the NMI handler via a software interrupt. * However, this must be done before maskable interrupts are enabled * otherwise the "iret" issued by an interrupt handler will incorrectly * clear NMI blocking. */ static __inline void vmx_exit_handle_nmi(struct vmx *vmx, int vcpuid, struct vm_exit *vmexit) { uint32_t intr_info; KASSERT((read_rflags() & PSL_I) == 0, ("interrupts enabled")); if (vmexit->u.vmx.exit_reason != EXIT_REASON_EXCEPTION) return; intr_info = vmcs_read(VMCS_EXIT_INTR_INFO); KASSERT((intr_info & VMCS_INTR_VALID) != 0, ("VM exit interruption info invalid: %#x", intr_info)); if ((intr_info & VMCS_INTR_T_MASK) == VMCS_INTR_T_NMI) { KASSERT((intr_info & 0xff) == IDT_NMI, ("VM exit due " "to NMI has invalid vector: %#x", intr_info)); VCPU_CTR0(vmx->vm, vcpuid, "Vectoring to NMI handler"); __asm __volatile("int $2"); } } static __inline void vmx_dr_enter_guest(struct vmxctx *vmxctx) { register_t rflags; /* Save host control debug registers. */ vmxctx->host_dr7 = rdr7(); vmxctx->host_debugctl = rdmsr(MSR_DEBUGCTLMSR); /* * Disable debugging in DR7 and DEBUGCTL to avoid triggering * exceptions in the host based on the guest DRx values. The * guest DR7 and DEBUGCTL are saved/restored in the VMCS. */ load_dr7(0); wrmsr(MSR_DEBUGCTLMSR, 0); /* * Disable single stepping the kernel to avoid corrupting the * guest DR6. A debugger might still be able to corrupt the * guest DR6 by setting a breakpoint after this point and then * single stepping. */ rflags = read_rflags(); vmxctx->host_tf = rflags & PSL_T; write_rflags(rflags & ~PSL_T); /* Save host debug registers. */ vmxctx->host_dr0 = rdr0(); vmxctx->host_dr1 = rdr1(); vmxctx->host_dr2 = rdr2(); vmxctx->host_dr3 = rdr3(); vmxctx->host_dr6 = rdr6(); /* Restore guest debug registers. */ load_dr0(vmxctx->guest_dr0); load_dr1(vmxctx->guest_dr1); load_dr2(vmxctx->guest_dr2); load_dr3(vmxctx->guest_dr3); load_dr6(vmxctx->guest_dr6); } static __inline void vmx_dr_leave_guest(struct vmxctx *vmxctx) { /* Save guest debug registers. */ vmxctx->guest_dr0 = rdr0(); vmxctx->guest_dr1 = rdr1(); vmxctx->guest_dr2 = rdr2(); vmxctx->guest_dr3 = rdr3(); vmxctx->guest_dr6 = rdr6(); /* * Restore host debug registers. Restore DR7, DEBUGCTL, and * PSL_T last. */ load_dr0(vmxctx->host_dr0); load_dr1(vmxctx->host_dr1); load_dr2(vmxctx->host_dr2); load_dr3(vmxctx->host_dr3); load_dr6(vmxctx->host_dr6); wrmsr(MSR_DEBUGCTLMSR, vmxctx->host_debugctl); load_dr7(vmxctx->host_dr7); write_rflags(read_rflags() | vmxctx->host_tf); } static int vmx_run(void *arg, int vcpu, register_t rip, pmap_t pmap, struct vm_eventinfo *evinfo) { int rc, handled, launched; struct vmx *vmx; struct vm *vm; struct vmxctx *vmxctx; struct vmcs *vmcs; struct vm_exit *vmexit; struct vlapic *vlapic; uint32_t exit_reason; struct region_descriptor gdtr, idtr; uint16_t ldt_sel; vmx = arg; vm = vmx->vm; vmcs = &vmx->vmcs[vcpu]; vmxctx = &vmx->ctx[vcpu]; vlapic = vm_lapic(vm, vcpu); vmexit = vm_exitinfo(vm, vcpu); launched = 0; KASSERT(vmxctx->pmap == pmap, ("pmap %p different than ctx pmap %p", pmap, vmxctx->pmap)); vmx_msr_guest_enter(vmx, vcpu); VMPTRLD(vmcs); /* * XXX * We do this every time because we may setup the virtual machine * from a different process than the one that actually runs it. * * If the life of a virtual machine was spent entirely in the context * of a single process we could do this once in vmx_vminit(). */ vmcs_write(VMCS_HOST_CR3, rcr3()); vmcs_write(VMCS_GUEST_RIP, rip); vmx_set_pcpu_defaults(vmx, vcpu, pmap); do { KASSERT(vmcs_guest_rip() == rip, ("%s: vmcs guest rip mismatch " "%#lx/%#lx", __func__, vmcs_guest_rip(), rip)); handled = UNHANDLED; /* * Interrupts are disabled from this point on until the * guest starts executing. This is done for the following * reasons: * * If an AST is asserted on this thread after the check below, * then the IPI_AST notification will not be lost, because it * will cause a VM exit due to external interrupt as soon as * the guest state is loaded. * * A posted interrupt after 'vmx_inject_interrupts()' will * not be "lost" because it will be held pending in the host * APIC because interrupts are disabled. The pending interrupt * will be recognized as soon as the guest state is loaded. * * The same reasoning applies to the IPI generated by * pmap_invalidate_ept(). */ disable_intr(); vmx_inject_interrupts(vmx, vcpu, vlapic, rip); /* * Check for vcpu suspension after injecting events because * vmx_inject_interrupts() can suspend the vcpu due to a * triple fault. */ if (vcpu_suspended(evinfo)) { enable_intr(); vm_exit_suspended(vmx->vm, vcpu, rip); break; } if (vcpu_rendezvous_pending(evinfo)) { enable_intr(); vm_exit_rendezvous(vmx->vm, vcpu, rip); break; } if (vcpu_reqidle(evinfo)) { enable_intr(); vm_exit_reqidle(vmx->vm, vcpu, rip); break; } if (vcpu_should_yield(vm, vcpu)) { enable_intr(); vm_exit_astpending(vmx->vm, vcpu, rip); vmx_astpending_trace(vmx, vcpu, rip); handled = HANDLED; break; } if (vcpu_debugged(vm, vcpu)) { enable_intr(); vm_exit_debug(vmx->vm, vcpu, rip); break; } /* * VM exits restore the base address but not the * limits of GDTR and IDTR. The VMCS only stores the * base address, so VM exits set the limits to 0xffff. * Save and restore the full GDTR and IDTR to restore * the limits. * * The VMCS does not save the LDTR at all, and VM * exits clear LDTR as if a NULL selector were loaded. * The userspace hypervisor probably doesn't use a * LDT, but save and restore it to be safe. */ sgdt(&gdtr); sidt(&idtr); ldt_sel = sldt(); vmx_run_trace(vmx, vcpu); vmx_dr_enter_guest(vmxctx); rc = vmx_enter_guest(vmxctx, vmx, launched); vmx_dr_leave_guest(vmxctx); bare_lgdt(&gdtr); lidt(&idtr); lldt(ldt_sel); /* Collect some information for VM exit processing */ vmexit->rip = rip = vmcs_guest_rip(); vmexit->inst_length = vmexit_instruction_length(); vmexit->u.vmx.exit_reason = exit_reason = vmcs_exit_reason(); vmexit->u.vmx.exit_qualification = vmcs_exit_qualification(); /* Update 'nextrip' */ vmx->state[vcpu].nextrip = rip; if (rc == VMX_GUEST_VMEXIT) { vmx_exit_handle_nmi(vmx, vcpu, vmexit); enable_intr(); handled = vmx_exit_process(vmx, vcpu, vmexit); } else { enable_intr(); vmx_exit_inst_error(vmxctx, rc, vmexit); } launched = 1; vmx_exit_trace(vmx, vcpu, rip, exit_reason, handled); rip = vmexit->rip; } while (handled); /* * If a VM exit has been handled then the exitcode must be BOGUS * If a VM exit is not handled then the exitcode must not be BOGUS */ if ((handled && vmexit->exitcode != VM_EXITCODE_BOGUS) || (!handled && vmexit->exitcode == VM_EXITCODE_BOGUS)) { panic("Mismatch between handled (%d) and exitcode (%d)", handled, vmexit->exitcode); } if (!handled) vmm_stat_incr(vm, vcpu, VMEXIT_USERSPACE, 1); VCPU_CTR1(vm, vcpu, "returning from vmx_run: exitcode %d", vmexit->exitcode); VMCLEAR(vmcs); vmx_msr_guest_exit(vmx, vcpu); return (0); } static void vmx_vmcleanup(void *arg) { int i; struct vmx *vmx = arg; + uint16_t maxcpus; if (apic_access_virtualization(vmx, 0)) vm_unmap_mmio(vmx->vm, DEFAULT_APIC_BASE, PAGE_SIZE); - for (i = 0; i < VM_MAXCPU; i++) + maxcpus = vm_get_maxcpus(vmx->vm); + for (i = 0; i < maxcpus; i++) vpid_free(vmx->state[i].vpid); free(vmx, M_VMX); return; } static register_t * vmxctx_regptr(struct vmxctx *vmxctx, int reg) { switch (reg) { case VM_REG_GUEST_RAX: return (&vmxctx->guest_rax); case VM_REG_GUEST_RBX: return (&vmxctx->guest_rbx); case VM_REG_GUEST_RCX: return (&vmxctx->guest_rcx); case VM_REG_GUEST_RDX: return (&vmxctx->guest_rdx); case VM_REG_GUEST_RSI: return (&vmxctx->guest_rsi); case VM_REG_GUEST_RDI: return (&vmxctx->guest_rdi); case VM_REG_GUEST_RBP: return (&vmxctx->guest_rbp); case VM_REG_GUEST_R8: return (&vmxctx->guest_r8); case VM_REG_GUEST_R9: return (&vmxctx->guest_r9); case VM_REG_GUEST_R10: return (&vmxctx->guest_r10); case VM_REG_GUEST_R11: return (&vmxctx->guest_r11); case VM_REG_GUEST_R12: return (&vmxctx->guest_r12); case VM_REG_GUEST_R13: return (&vmxctx->guest_r13); case VM_REG_GUEST_R14: return (&vmxctx->guest_r14); case VM_REG_GUEST_R15: return (&vmxctx->guest_r15); case VM_REG_GUEST_CR2: return (&vmxctx->guest_cr2); case VM_REG_GUEST_DR0: return (&vmxctx->guest_dr0); case VM_REG_GUEST_DR1: return (&vmxctx->guest_dr1); case VM_REG_GUEST_DR2: return (&vmxctx->guest_dr2); case VM_REG_GUEST_DR3: return (&vmxctx->guest_dr3); case VM_REG_GUEST_DR6: return (&vmxctx->guest_dr6); default: break; } return (NULL); } static int vmxctx_getreg(struct vmxctx *vmxctx, int reg, uint64_t *retval) { register_t *regp; if ((regp = vmxctx_regptr(vmxctx, reg)) != NULL) { *retval = *regp; return (0); } else return (EINVAL); } static int vmxctx_setreg(struct vmxctx *vmxctx, int reg, uint64_t val) { register_t *regp; if ((regp = vmxctx_regptr(vmxctx, reg)) != NULL) { *regp = val; return (0); } else return (EINVAL); } static int vmx_get_intr_shadow(struct vmx *vmx, int vcpu, int running, uint64_t *retval) { uint64_t gi; int error; error = vmcs_getreg(&vmx->vmcs[vcpu], running, VMCS_IDENT(VMCS_GUEST_INTERRUPTIBILITY), &gi); *retval = (gi & HWINTR_BLOCKING) ? 1 : 0; return (error); } static int vmx_modify_intr_shadow(struct vmx *vmx, int vcpu, int running, uint64_t val) { struct vmcs *vmcs; uint64_t gi; int error, ident; /* * Forcing the vcpu into an interrupt shadow is not supported. */ if (val) { error = EINVAL; goto done; } vmcs = &vmx->vmcs[vcpu]; ident = VMCS_IDENT(VMCS_GUEST_INTERRUPTIBILITY); error = vmcs_getreg(vmcs, running, ident, &gi); if (error == 0) { gi &= ~HWINTR_BLOCKING; error = vmcs_setreg(vmcs, running, ident, gi); } done: VCPU_CTR2(vmx->vm, vcpu, "Setting intr_shadow to %#lx %s", val, error ? "failed" : "succeeded"); return (error); } static int vmx_shadow_reg(int reg) { int shreg; shreg = -1; switch (reg) { case VM_REG_GUEST_CR0: shreg = VMCS_CR0_SHADOW; break; case VM_REG_GUEST_CR4: shreg = VMCS_CR4_SHADOW; break; default: break; } return (shreg); } static int vmx_getreg(void *arg, int vcpu, int reg, uint64_t *retval) { int running, hostcpu; struct vmx *vmx = arg; running = vcpu_is_running(vmx->vm, vcpu, &hostcpu); if (running && hostcpu != curcpu) panic("vmx_getreg: %s%d is running", vm_name(vmx->vm), vcpu); if (reg == VM_REG_GUEST_INTR_SHADOW) return (vmx_get_intr_shadow(vmx, vcpu, running, retval)); if (vmxctx_getreg(&vmx->ctx[vcpu], reg, retval) == 0) return (0); return (vmcs_getreg(&vmx->vmcs[vcpu], running, reg, retval)); } static int vmx_setreg(void *arg, int vcpu, int reg, uint64_t val) { int error, hostcpu, running, shadow; uint64_t ctls; pmap_t pmap; struct vmx *vmx = arg; running = vcpu_is_running(vmx->vm, vcpu, &hostcpu); if (running && hostcpu != curcpu) panic("vmx_setreg: %s%d is running", vm_name(vmx->vm), vcpu); if (reg == VM_REG_GUEST_INTR_SHADOW) return (vmx_modify_intr_shadow(vmx, vcpu, running, val)); if (vmxctx_setreg(&vmx->ctx[vcpu], reg, val) == 0) return (0); error = vmcs_setreg(&vmx->vmcs[vcpu], running, reg, val); if (error == 0) { /* * If the "load EFER" VM-entry control is 1 then the * value of EFER.LMA must be identical to "IA-32e mode guest" * bit in the VM-entry control. */ if ((entry_ctls & VM_ENTRY_LOAD_EFER) != 0 && (reg == VM_REG_GUEST_EFER)) { vmcs_getreg(&vmx->vmcs[vcpu], running, VMCS_IDENT(VMCS_ENTRY_CTLS), &ctls); if (val & EFER_LMA) ctls |= VM_ENTRY_GUEST_LMA; else ctls &= ~VM_ENTRY_GUEST_LMA; vmcs_setreg(&vmx->vmcs[vcpu], running, VMCS_IDENT(VMCS_ENTRY_CTLS), ctls); } shadow = vmx_shadow_reg(reg); if (shadow > 0) { /* * Store the unmodified value in the shadow */ error = vmcs_setreg(&vmx->vmcs[vcpu], running, VMCS_IDENT(shadow), val); } if (reg == VM_REG_GUEST_CR3) { /* * Invalidate the guest vcpu's TLB mappings to emulate * the behavior of updating %cr3. * * XXX the processor retains global mappings when %cr3 * is updated but vmx_invvpid() does not. */ pmap = vmx->ctx[vcpu].pmap; vmx_invvpid(vmx, vcpu, pmap, running); } } return (error); } static int vmx_getdesc(void *arg, int vcpu, int reg, struct seg_desc *desc) { int hostcpu, running; struct vmx *vmx = arg; running = vcpu_is_running(vmx->vm, vcpu, &hostcpu); if (running && hostcpu != curcpu) panic("vmx_getdesc: %s%d is running", vm_name(vmx->vm), vcpu); return (vmcs_getdesc(&vmx->vmcs[vcpu], running, reg, desc)); } static int vmx_setdesc(void *arg, int vcpu, int reg, struct seg_desc *desc) { int hostcpu, running; struct vmx *vmx = arg; running = vcpu_is_running(vmx->vm, vcpu, &hostcpu); if (running && hostcpu != curcpu) panic("vmx_setdesc: %s%d is running", vm_name(vmx->vm), vcpu); return (vmcs_setdesc(&vmx->vmcs[vcpu], running, reg, desc)); } static int vmx_getcap(void *arg, int vcpu, int type, int *retval) { struct vmx *vmx = arg; int vcap; int ret; ret = ENOENT; vcap = vmx->cap[vcpu].set; switch (type) { case VM_CAP_HALT_EXIT: if (cap_halt_exit) ret = 0; break; case VM_CAP_PAUSE_EXIT: if (cap_pause_exit) ret = 0; break; case VM_CAP_MTRAP_EXIT: if (cap_monitor_trap) ret = 0; break; case VM_CAP_UNRESTRICTED_GUEST: if (cap_unrestricted_guest) ret = 0; break; case VM_CAP_ENABLE_INVPCID: if (cap_invpcid) ret = 0; break; default: break; } if (ret == 0) *retval = (vcap & (1 << type)) ? 1 : 0; return (ret); } static int vmx_setcap(void *arg, int vcpu, int type, int val) { struct vmx *vmx = arg; struct vmcs *vmcs = &vmx->vmcs[vcpu]; uint32_t baseval; uint32_t *pptr; int error; int flag; int reg; int retval; retval = ENOENT; pptr = NULL; switch (type) { case VM_CAP_HALT_EXIT: if (cap_halt_exit) { retval = 0; pptr = &vmx->cap[vcpu].proc_ctls; baseval = *pptr; flag = PROCBASED_HLT_EXITING; reg = VMCS_PRI_PROC_BASED_CTLS; } break; case VM_CAP_MTRAP_EXIT: if (cap_monitor_trap) { retval = 0; pptr = &vmx->cap[vcpu].proc_ctls; baseval = *pptr; flag = PROCBASED_MTF; reg = VMCS_PRI_PROC_BASED_CTLS; } break; case VM_CAP_PAUSE_EXIT: if (cap_pause_exit) { retval = 0; pptr = &vmx->cap[vcpu].proc_ctls; baseval = *pptr; flag = PROCBASED_PAUSE_EXITING; reg = VMCS_PRI_PROC_BASED_CTLS; } break; case VM_CAP_UNRESTRICTED_GUEST: if (cap_unrestricted_guest) { retval = 0; pptr = &vmx->cap[vcpu].proc_ctls2; baseval = *pptr; flag = PROCBASED2_UNRESTRICTED_GUEST; reg = VMCS_SEC_PROC_BASED_CTLS; } break; case VM_CAP_ENABLE_INVPCID: if (cap_invpcid) { retval = 0; pptr = &vmx->cap[vcpu].proc_ctls2; baseval = *pptr; flag = PROCBASED2_ENABLE_INVPCID; reg = VMCS_SEC_PROC_BASED_CTLS; } break; default: break; } if (retval == 0) { if (val) { baseval |= flag; } else { baseval &= ~flag; } VMPTRLD(vmcs); error = vmwrite(reg, baseval); VMCLEAR(vmcs); if (error) { retval = error; } else { /* * Update optional stored flags, and record * setting */ if (pptr != NULL) { *pptr = baseval; } if (val) { vmx->cap[vcpu].set |= (1 << type); } else { vmx->cap[vcpu].set &= ~(1 << type); } } } return (retval); } struct vlapic_vtx { struct vlapic vlapic; struct pir_desc *pir_desc; struct vmx *vmx; u_int pending_prio; }; #define VPR_PRIO_BIT(vpr) (1 << ((vpr) >> 4)) #define VMX_CTR_PIR(vm, vcpuid, pir_desc, notify, vector, level, msg) \ do { \ VCPU_CTR2(vm, vcpuid, msg " assert %s-triggered vector %d", \ level ? "level" : "edge", vector); \ VCPU_CTR1(vm, vcpuid, msg " pir0 0x%016lx", pir_desc->pir[0]); \ VCPU_CTR1(vm, vcpuid, msg " pir1 0x%016lx", pir_desc->pir[1]); \ VCPU_CTR1(vm, vcpuid, msg " pir2 0x%016lx", pir_desc->pir[2]); \ VCPU_CTR1(vm, vcpuid, msg " pir3 0x%016lx", pir_desc->pir[3]); \ VCPU_CTR1(vm, vcpuid, msg " notify: %s", notify ? "yes" : "no");\ } while (0) /* * vlapic->ops handlers that utilize the APICv hardware assist described in * Chapter 29 of the Intel SDM. */ static int vmx_set_intr_ready(struct vlapic *vlapic, int vector, bool level) { struct vlapic_vtx *vlapic_vtx; struct pir_desc *pir_desc; uint64_t mask; int idx, notify = 0; vlapic_vtx = (struct vlapic_vtx *)vlapic; pir_desc = vlapic_vtx->pir_desc; /* * Keep track of interrupt requests in the PIR descriptor. This is * because the virtual APIC page pointed to by the VMCS cannot be * modified if the vcpu is running. */ idx = vector / 64; mask = 1UL << (vector % 64); atomic_set_long(&pir_desc->pir[idx], mask); /* * A notification is required whenever the 'pending' bit makes a * transition from 0->1. * * Even if the 'pending' bit is already asserted, notification about * the incoming interrupt may still be necessary. For example, if a * vCPU is HLTed with a high PPR, a low priority interrupt would cause * the 0->1 'pending' transition with a notification, but the vCPU * would ignore the interrupt for the time being. The same vCPU would * need to then be notified if a high-priority interrupt arrived which * satisfied the PPR. * * The priorities of interrupts injected while 'pending' is asserted * are tracked in a custom bitfield 'pending_prio'. Should the * to-be-injected interrupt exceed the priorities already present, the * notification is sent. The priorities recorded in 'pending_prio' are * cleared whenever the 'pending' bit makes another 0->1 transition. */ if (atomic_cmpset_long(&pir_desc->pending, 0, 1) != 0) { notify = 1; vlapic_vtx->pending_prio = 0; } else { const u_int old_prio = vlapic_vtx->pending_prio; const u_int prio_bit = VPR_PRIO_BIT(vector & APIC_TPR_INT); if ((old_prio & prio_bit) == 0 && prio_bit > old_prio) { atomic_set_int(&vlapic_vtx->pending_prio, prio_bit); notify = 1; } } VMX_CTR_PIR(vlapic->vm, vlapic->vcpuid, pir_desc, notify, vector, level, "vmx_set_intr_ready"); return (notify); } static int vmx_pending_intr(struct vlapic *vlapic, int *vecptr) { struct vlapic_vtx *vlapic_vtx; struct pir_desc *pir_desc; struct LAPIC *lapic; uint64_t pending, pirval; uint32_t ppr, vpr; int i; /* * This function is only expected to be called from the 'HLT' exit * handler which does not care about the vector that is pending. */ KASSERT(vecptr == NULL, ("vmx_pending_intr: vecptr must be NULL")); vlapic_vtx = (struct vlapic_vtx *)vlapic; pir_desc = vlapic_vtx->pir_desc; pending = atomic_load_acq_long(&pir_desc->pending); if (!pending) { /* * While a virtual interrupt may have already been * processed the actual delivery maybe pending the * interruptibility of the guest. Recognize a pending * interrupt by reevaluating virtual interrupts * following Section 29.2.1 in the Intel SDM Volume 3. */ struct vm_exit *vmexit; uint8_t rvi, ppr; vmexit = vm_exitinfo(vlapic->vm, vlapic->vcpuid); KASSERT(vmexit->exitcode == VM_EXITCODE_HLT, ("vmx_pending_intr: exitcode not 'HLT'")); rvi = vmexit->u.hlt.intr_status & APIC_TPR_INT; lapic = vlapic->apic_page; ppr = lapic->ppr & APIC_TPR_INT; if (rvi > ppr) { return (1); } return (0); } /* * If there is an interrupt pending then it will be recognized only * if its priority is greater than the processor priority. * * Special case: if the processor priority is zero then any pending * interrupt will be recognized. */ lapic = vlapic->apic_page; ppr = lapic->ppr & APIC_TPR_INT; if (ppr == 0) return (1); VCPU_CTR1(vlapic->vm, vlapic->vcpuid, "HLT with non-zero PPR %d", lapic->ppr); vpr = 0; for (i = 3; i >= 0; i--) { pirval = pir_desc->pir[i]; if (pirval != 0) { vpr = (i * 64 + flsl(pirval) - 1) & APIC_TPR_INT; break; } } /* * If the highest-priority pending interrupt falls short of the * processor priority of this vCPU, ensure that 'pending_prio' does not * have any stale bits which would preclude a higher-priority interrupt * from incurring a notification later. */ if (vpr <= ppr) { const u_int prio_bit = VPR_PRIO_BIT(vpr); const u_int old = vlapic_vtx->pending_prio; if (old > prio_bit && (old & prio_bit) == 0) { vlapic_vtx->pending_prio = prio_bit; } return (0); } return (1); } static void vmx_intr_accepted(struct vlapic *vlapic, int vector) { panic("vmx_intr_accepted: not expected to be called"); } static void vmx_set_tmr(struct vlapic *vlapic, int vector, bool level) { struct vlapic_vtx *vlapic_vtx; struct vmx *vmx; struct vmcs *vmcs; uint64_t mask, val; KASSERT(vector >= 0 && vector <= 255, ("invalid vector %d", vector)); KASSERT(!vcpu_is_running(vlapic->vm, vlapic->vcpuid, NULL), ("vmx_set_tmr: vcpu cannot be running")); vlapic_vtx = (struct vlapic_vtx *)vlapic; vmx = vlapic_vtx->vmx; vmcs = &vmx->vmcs[vlapic->vcpuid]; mask = 1UL << (vector % 64); VMPTRLD(vmcs); val = vmcs_read(VMCS_EOI_EXIT(vector)); if (level) val |= mask; else val &= ~mask; vmcs_write(VMCS_EOI_EXIT(vector), val); VMCLEAR(vmcs); } static void vmx_enable_x2apic_mode(struct vlapic *vlapic) { struct vmx *vmx; struct vmcs *vmcs; uint32_t proc_ctls2; int vcpuid, error; vcpuid = vlapic->vcpuid; vmx = ((struct vlapic_vtx *)vlapic)->vmx; vmcs = &vmx->vmcs[vcpuid]; proc_ctls2 = vmx->cap[vcpuid].proc_ctls2; KASSERT((proc_ctls2 & PROCBASED2_VIRTUALIZE_APIC_ACCESSES) != 0, ("%s: invalid proc_ctls2 %#x", __func__, proc_ctls2)); proc_ctls2 &= ~PROCBASED2_VIRTUALIZE_APIC_ACCESSES; proc_ctls2 |= PROCBASED2_VIRTUALIZE_X2APIC_MODE; vmx->cap[vcpuid].proc_ctls2 = proc_ctls2; VMPTRLD(vmcs); vmcs_write(VMCS_SEC_PROC_BASED_CTLS, proc_ctls2); VMCLEAR(vmcs); if (vlapic->vcpuid == 0) { /* * The nested page table mappings are shared by all vcpus * so unmap the APIC access page just once. */ error = vm_unmap_mmio(vmx->vm, DEFAULT_APIC_BASE, PAGE_SIZE); KASSERT(error == 0, ("%s: vm_unmap_mmio error %d", __func__, error)); /* * The MSR bitmap is shared by all vcpus so modify it only * once in the context of vcpu 0. */ error = vmx_allow_x2apic_msrs(vmx); KASSERT(error == 0, ("%s: vmx_allow_x2apic_msrs error %d", __func__, error)); } } static void vmx_post_intr(struct vlapic *vlapic, int hostcpu) { ipi_cpu(hostcpu, pirvec); } /* * Transfer the pending interrupts in the PIR descriptor to the IRR * in the virtual APIC page. */ static void vmx_inject_pir(struct vlapic *vlapic) { struct vlapic_vtx *vlapic_vtx; struct pir_desc *pir_desc; struct LAPIC *lapic; uint64_t val, pirval; int rvi, pirbase = -1; uint16_t intr_status_old, intr_status_new; vlapic_vtx = (struct vlapic_vtx *)vlapic; pir_desc = vlapic_vtx->pir_desc; if (atomic_cmpset_long(&pir_desc->pending, 1, 0) == 0) { VCPU_CTR0(vlapic->vm, vlapic->vcpuid, "vmx_inject_pir: " "no posted interrupt pending"); return; } pirval = 0; pirbase = -1; lapic = vlapic->apic_page; val = atomic_readandclear_long(&pir_desc->pir[0]); if (val != 0) { lapic->irr0 |= val; lapic->irr1 |= val >> 32; pirbase = 0; pirval = val; } val = atomic_readandclear_long(&pir_desc->pir[1]); if (val != 0) { lapic->irr2 |= val; lapic->irr3 |= val >> 32; pirbase = 64; pirval = val; } val = atomic_readandclear_long(&pir_desc->pir[2]); if (val != 0) { lapic->irr4 |= val; lapic->irr5 |= val >> 32; pirbase = 128; pirval = val; } val = atomic_readandclear_long(&pir_desc->pir[3]); if (val != 0) { lapic->irr6 |= val; lapic->irr7 |= val >> 32; pirbase = 192; pirval = val; } VLAPIC_CTR_IRR(vlapic, "vmx_inject_pir"); /* * Update RVI so the processor can evaluate pending virtual * interrupts on VM-entry. * * It is possible for pirval to be 0 here, even though the * pending bit has been set. The scenario is: * CPU-Y is sending a posted interrupt to CPU-X, which * is running a guest and processing posted interrupts in h/w. * CPU-X will eventually exit and the state seen in s/w is * the pending bit set, but no PIR bits set. * * CPU-X CPU-Y * (vm running) (host running) * rx posted interrupt * CLEAR pending bit * SET PIR bit * READ/CLEAR PIR bits * SET pending bit * (vm exit) * pending bit set, PIR 0 */ if (pirval != 0) { rvi = pirbase + flsl(pirval) - 1; intr_status_old = vmcs_read(VMCS_GUEST_INTR_STATUS); intr_status_new = (intr_status_old & 0xFF00) | rvi; if (intr_status_new > intr_status_old) { vmcs_write(VMCS_GUEST_INTR_STATUS, intr_status_new); VCPU_CTR2(vlapic->vm, vlapic->vcpuid, "vmx_inject_pir: " "guest_intr_status changed from 0x%04x to 0x%04x", intr_status_old, intr_status_new); } } } static struct vlapic * vmx_vlapic_init(void *arg, int vcpuid) { struct vmx *vmx; struct vlapic *vlapic; struct vlapic_vtx *vlapic_vtx; vmx = arg; vlapic = malloc(sizeof(struct vlapic_vtx), M_VLAPIC, M_WAITOK | M_ZERO); vlapic->vm = vmx->vm; vlapic->vcpuid = vcpuid; vlapic->apic_page = (struct LAPIC *)&vmx->apic_page[vcpuid]; vlapic_vtx = (struct vlapic_vtx *)vlapic; vlapic_vtx->pir_desc = &vmx->pir_desc[vcpuid]; vlapic_vtx->vmx = vmx; if (virtual_interrupt_delivery) { vlapic->ops.set_intr_ready = vmx_set_intr_ready; vlapic->ops.pending_intr = vmx_pending_intr; vlapic->ops.intr_accepted = vmx_intr_accepted; vlapic->ops.set_tmr = vmx_set_tmr; vlapic->ops.enable_x2apic_mode = vmx_enable_x2apic_mode; } if (posted_interrupts) vlapic->ops.post_intr = vmx_post_intr; vlapic_init(vlapic); return (vlapic); } static void vmx_vlapic_cleanup(void *arg, struct vlapic *vlapic) { vlapic_cleanup(vlapic); free(vlapic, M_VLAPIC); } struct vmm_ops vmm_ops_intel = { vmx_init, vmx_cleanup, vmx_restore, vmx_vminit, vmx_run, vmx_vmcleanup, vmx_getreg, vmx_setreg, vmx_getdesc, vmx_setdesc, vmx_getcap, vmx_setcap, ept_vmspace_alloc, ept_vmspace_free, vmx_vlapic_init, vmx_vlapic_cleanup, }; Index: user/ngie/bug-237403/sys/amd64/vmm/io/vlapic.c =================================================================== --- user/ngie/bug-237403/sys/amd64/vmm/io/vlapic.c (revision 346925) +++ user/ngie/bug-237403/sys/amd64/vmm/io/vlapic.c (revision 346926) @@ -1,1656 +1,1659 @@ /*- * SPDX-License-Identifier: BSD-2-Clause-FreeBSD * * Copyright (c) 2011 NetApp, Inc. * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY NETAPP, INC ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL NETAPP, INC OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * $FreeBSD$ */ #include __FBSDID("$FreeBSD$"); #include #include #include #include #include #include #include #include #include #include #include #include #include "vmm_lapic.h" #include "vmm_ktr.h" #include "vmm_stat.h" #include "vlapic.h" #include "vlapic_priv.h" #include "vioapic.h" #define PRIO(x) ((x) >> 4) #define VLAPIC_VERSION (16) #define x2apic(vlapic) (((vlapic)->msr_apicbase & APICBASE_X2APIC) ? 1 : 0) /* * The 'vlapic->timer_mtx' is used to provide mutual exclusion between the * vlapic_callout_handler() and vcpu accesses to: * - timer_freq_bt, timer_period_bt, timer_fire_bt * - timer LVT register */ #define VLAPIC_TIMER_LOCK(vlapic) mtx_lock_spin(&((vlapic)->timer_mtx)) #define VLAPIC_TIMER_UNLOCK(vlapic) mtx_unlock_spin(&((vlapic)->timer_mtx)) #define VLAPIC_TIMER_LOCKED(vlapic) mtx_owned(&((vlapic)->timer_mtx)) /* * APIC timer frequency: * - arbitrary but chosen to be in the ballpark of contemporary hardware. * - power-of-two to avoid loss of precision when converted to a bintime. */ #define VLAPIC_BUS_FREQ (128 * 1024 * 1024) static __inline uint32_t vlapic_get_id(struct vlapic *vlapic) { if (x2apic(vlapic)) return (vlapic->vcpuid); else return (vlapic->vcpuid << 24); } static uint32_t x2apic_ldr(struct vlapic *vlapic) { int apicid; uint32_t ldr; apicid = vlapic_get_id(vlapic); ldr = 1 << (apicid & 0xf); ldr |= (apicid & 0xffff0) << 12; return (ldr); } void vlapic_dfr_write_handler(struct vlapic *vlapic) { struct LAPIC *lapic; lapic = vlapic->apic_page; if (x2apic(vlapic)) { VM_CTR1(vlapic->vm, "ignoring write to DFR in x2apic mode: %#x", lapic->dfr); lapic->dfr = 0; return; } lapic->dfr &= APIC_DFR_MODEL_MASK; lapic->dfr |= APIC_DFR_RESERVED; if ((lapic->dfr & APIC_DFR_MODEL_MASK) == APIC_DFR_MODEL_FLAT) VLAPIC_CTR0(vlapic, "vlapic DFR in Flat Model"); else if ((lapic->dfr & APIC_DFR_MODEL_MASK) == APIC_DFR_MODEL_CLUSTER) VLAPIC_CTR0(vlapic, "vlapic DFR in Cluster Model"); else VLAPIC_CTR1(vlapic, "DFR in Unknown Model %#x", lapic->dfr); } void vlapic_ldr_write_handler(struct vlapic *vlapic) { struct LAPIC *lapic; lapic = vlapic->apic_page; /* LDR is read-only in x2apic mode */ if (x2apic(vlapic)) { VLAPIC_CTR1(vlapic, "ignoring write to LDR in x2apic mode: %#x", lapic->ldr); lapic->ldr = x2apic_ldr(vlapic); } else { lapic->ldr &= ~APIC_LDR_RESERVED; VLAPIC_CTR1(vlapic, "vlapic LDR set to %#x", lapic->ldr); } } void vlapic_id_write_handler(struct vlapic *vlapic) { struct LAPIC *lapic; /* * We don't allow the ID register to be modified so reset it back to * its default value. */ lapic = vlapic->apic_page; lapic->id = vlapic_get_id(vlapic); } static int vlapic_timer_divisor(uint32_t dcr) { switch (dcr & 0xB) { case APIC_TDCR_1: return (1); case APIC_TDCR_2: return (2); case APIC_TDCR_4: return (4); case APIC_TDCR_8: return (8); case APIC_TDCR_16: return (16); case APIC_TDCR_32: return (32); case APIC_TDCR_64: return (64); case APIC_TDCR_128: return (128); default: panic("vlapic_timer_divisor: invalid dcr 0x%08x", dcr); } } #if 0 static inline void vlapic_dump_lvt(uint32_t offset, uint32_t *lvt) { printf("Offset %x: lvt %08x (V:%02x DS:%x M:%x)\n", offset, *lvt, *lvt & APIC_LVTT_VECTOR, *lvt & APIC_LVTT_DS, *lvt & APIC_LVTT_M); } #endif static uint32_t vlapic_get_ccr(struct vlapic *vlapic) { struct bintime bt_now, bt_rem; struct LAPIC *lapic; uint32_t ccr; ccr = 0; lapic = vlapic->apic_page; VLAPIC_TIMER_LOCK(vlapic); if (callout_active(&vlapic->callout)) { /* * If the timer is scheduled to expire in the future then * compute the value of 'ccr' based on the remaining time. */ binuptime(&bt_now); if (bintime_cmp(&vlapic->timer_fire_bt, &bt_now, >)) { bt_rem = vlapic->timer_fire_bt; bintime_sub(&bt_rem, &bt_now); ccr += bt_rem.sec * BT2FREQ(&vlapic->timer_freq_bt); ccr += bt_rem.frac / vlapic->timer_freq_bt.frac; } } KASSERT(ccr <= lapic->icr_timer, ("vlapic_get_ccr: invalid ccr %#x, " "icr_timer is %#x", ccr, lapic->icr_timer)); VLAPIC_CTR2(vlapic, "vlapic ccr_timer = %#x, icr_timer = %#x", ccr, lapic->icr_timer); VLAPIC_TIMER_UNLOCK(vlapic); return (ccr); } void vlapic_dcr_write_handler(struct vlapic *vlapic) { struct LAPIC *lapic; int divisor; lapic = vlapic->apic_page; VLAPIC_TIMER_LOCK(vlapic); divisor = vlapic_timer_divisor(lapic->dcr_timer); VLAPIC_CTR2(vlapic, "vlapic dcr_timer=%#x, divisor=%d", lapic->dcr_timer, divisor); /* * Update the timer frequency and the timer period. * * XXX changes to the frequency divider will not take effect until * the timer is reloaded. */ FREQ2BT(VLAPIC_BUS_FREQ / divisor, &vlapic->timer_freq_bt); vlapic->timer_period_bt = vlapic->timer_freq_bt; bintime_mul(&vlapic->timer_period_bt, lapic->icr_timer); VLAPIC_TIMER_UNLOCK(vlapic); } void vlapic_esr_write_handler(struct vlapic *vlapic) { struct LAPIC *lapic; lapic = vlapic->apic_page; lapic->esr = vlapic->esr_pending; vlapic->esr_pending = 0; } int vlapic_set_intr_ready(struct vlapic *vlapic, int vector, bool level) { struct LAPIC *lapic; uint32_t *irrptr, *tmrptr, mask; int idx; KASSERT(vector >= 0 && vector < 256, ("invalid vector %d", vector)); lapic = vlapic->apic_page; if (!(lapic->svr & APIC_SVR_ENABLE)) { VLAPIC_CTR1(vlapic, "vlapic is software disabled, ignoring " "interrupt %d", vector); return (0); } if (vector < 16) { vlapic_set_error(vlapic, APIC_ESR_RECEIVE_ILLEGAL_VECTOR); VLAPIC_CTR1(vlapic, "vlapic ignoring interrupt to vector %d", vector); return (1); } if (vlapic->ops.set_intr_ready) return ((*vlapic->ops.set_intr_ready)(vlapic, vector, level)); idx = (vector / 32) * 4; mask = 1 << (vector % 32); irrptr = &lapic->irr0; atomic_set_int(&irrptr[idx], mask); /* * Verify that the trigger-mode of the interrupt matches with * the vlapic TMR registers. */ tmrptr = &lapic->tmr0; if ((tmrptr[idx] & mask) != (level ? mask : 0)) { VLAPIC_CTR3(vlapic, "vlapic TMR[%d] is 0x%08x but " "interrupt is %s-triggered", idx / 4, tmrptr[idx], level ? "level" : "edge"); } VLAPIC_CTR_IRR(vlapic, "vlapic_set_intr_ready"); return (1); } static __inline uint32_t * vlapic_get_lvtptr(struct vlapic *vlapic, uint32_t offset) { struct LAPIC *lapic = vlapic->apic_page; int i; switch (offset) { case APIC_OFFSET_CMCI_LVT: return (&lapic->lvt_cmci); case APIC_OFFSET_TIMER_LVT ... APIC_OFFSET_ERROR_LVT: i = (offset - APIC_OFFSET_TIMER_LVT) >> 2; return ((&lapic->lvt_timer) + i);; default: panic("vlapic_get_lvt: invalid LVT\n"); } } static __inline int lvt_off_to_idx(uint32_t offset) { int index; switch (offset) { case APIC_OFFSET_CMCI_LVT: index = APIC_LVT_CMCI; break; case APIC_OFFSET_TIMER_LVT: index = APIC_LVT_TIMER; break; case APIC_OFFSET_THERM_LVT: index = APIC_LVT_THERMAL; break; case APIC_OFFSET_PERF_LVT: index = APIC_LVT_PMC; break; case APIC_OFFSET_LINT0_LVT: index = APIC_LVT_LINT0; break; case APIC_OFFSET_LINT1_LVT: index = APIC_LVT_LINT1; break; case APIC_OFFSET_ERROR_LVT: index = APIC_LVT_ERROR; break; default: index = -1; break; } KASSERT(index >= 0 && index <= VLAPIC_MAXLVT_INDEX, ("lvt_off_to_idx: " "invalid lvt index %d for offset %#x", index, offset)); return (index); } static __inline uint32_t vlapic_get_lvt(struct vlapic *vlapic, uint32_t offset) { int idx; uint32_t val; idx = lvt_off_to_idx(offset); val = atomic_load_acq_32(&vlapic->lvt_last[idx]); return (val); } void vlapic_lvt_write_handler(struct vlapic *vlapic, uint32_t offset) { uint32_t *lvtptr, mask, val; struct LAPIC *lapic; int idx; lapic = vlapic->apic_page; lvtptr = vlapic_get_lvtptr(vlapic, offset); val = *lvtptr; idx = lvt_off_to_idx(offset); if (!(lapic->svr & APIC_SVR_ENABLE)) val |= APIC_LVT_M; mask = APIC_LVT_M | APIC_LVT_DS | APIC_LVT_VECTOR; switch (offset) { case APIC_OFFSET_TIMER_LVT: mask |= APIC_LVTT_TM; break; case APIC_OFFSET_ERROR_LVT: break; case APIC_OFFSET_LINT0_LVT: case APIC_OFFSET_LINT1_LVT: mask |= APIC_LVT_TM | APIC_LVT_RIRR | APIC_LVT_IIPP; /* FALLTHROUGH */ default: mask |= APIC_LVT_DM; break; } val &= mask; *lvtptr = val; atomic_store_rel_32(&vlapic->lvt_last[idx], val); } static void vlapic_mask_lvts(struct vlapic *vlapic) { struct LAPIC *lapic = vlapic->apic_page; lapic->lvt_cmci |= APIC_LVT_M; vlapic_lvt_write_handler(vlapic, APIC_OFFSET_CMCI_LVT); lapic->lvt_timer |= APIC_LVT_M; vlapic_lvt_write_handler(vlapic, APIC_OFFSET_TIMER_LVT); lapic->lvt_thermal |= APIC_LVT_M; vlapic_lvt_write_handler(vlapic, APIC_OFFSET_THERM_LVT); lapic->lvt_pcint |= APIC_LVT_M; vlapic_lvt_write_handler(vlapic, APIC_OFFSET_PERF_LVT); lapic->lvt_lint0 |= APIC_LVT_M; vlapic_lvt_write_handler(vlapic, APIC_OFFSET_LINT0_LVT); lapic->lvt_lint1 |= APIC_LVT_M; vlapic_lvt_write_handler(vlapic, APIC_OFFSET_LINT1_LVT); lapic->lvt_error |= APIC_LVT_M; vlapic_lvt_write_handler(vlapic, APIC_OFFSET_ERROR_LVT); } static int vlapic_fire_lvt(struct vlapic *vlapic, uint32_t lvt) { uint32_t vec, mode; if (lvt & APIC_LVT_M) return (0); vec = lvt & APIC_LVT_VECTOR; mode = lvt & APIC_LVT_DM; switch (mode) { case APIC_LVT_DM_FIXED: if (vec < 16) { vlapic_set_error(vlapic, APIC_ESR_SEND_ILLEGAL_VECTOR); return (0); } if (vlapic_set_intr_ready(vlapic, vec, false)) vcpu_notify_event(vlapic->vm, vlapic->vcpuid, true); break; case APIC_LVT_DM_NMI: vm_inject_nmi(vlapic->vm, vlapic->vcpuid); break; case APIC_LVT_DM_EXTINT: vm_inject_extint(vlapic->vm, vlapic->vcpuid); break; default: // Other modes ignored return (0); } return (1); } #if 1 static void dump_isrvec_stk(struct vlapic *vlapic) { int i; uint32_t *isrptr; isrptr = &vlapic->apic_page->isr0; for (i = 0; i < 8; i++) printf("ISR%d 0x%08x\n", i, isrptr[i * 4]); for (i = 0; i <= vlapic->isrvec_stk_top; i++) printf("isrvec_stk[%d] = %d\n", i, vlapic->isrvec_stk[i]); } #endif /* * Algorithm adopted from section "Interrupt, Task and Processor Priority" * in Intel Architecture Manual Vol 3a. */ static void vlapic_update_ppr(struct vlapic *vlapic) { int isrvec, tpr, ppr; /* * Note that the value on the stack at index 0 is always 0. * * This is a placeholder for the value of ISRV when none of the * bits is set in the ISRx registers. */ isrvec = vlapic->isrvec_stk[vlapic->isrvec_stk_top]; tpr = vlapic->apic_page->tpr; #if 1 { int i, lastprio, curprio, vector, idx; uint32_t *isrptr; if (vlapic->isrvec_stk_top == 0 && isrvec != 0) panic("isrvec_stk is corrupted: %d", isrvec); /* * Make sure that the priority of the nested interrupts is * always increasing. */ lastprio = -1; for (i = 1; i <= vlapic->isrvec_stk_top; i++) { curprio = PRIO(vlapic->isrvec_stk[i]); if (curprio <= lastprio) { dump_isrvec_stk(vlapic); panic("isrvec_stk does not satisfy invariant"); } lastprio = curprio; } /* * Make sure that each bit set in the ISRx registers has a * corresponding entry on the isrvec stack. */ i = 1; isrptr = &vlapic->apic_page->isr0; for (vector = 0; vector < 256; vector++) { idx = (vector / 32) * 4; if (isrptr[idx] & (1 << (vector % 32))) { if (i > vlapic->isrvec_stk_top || vlapic->isrvec_stk[i] != vector) { dump_isrvec_stk(vlapic); panic("ISR and isrvec_stk out of sync"); } i++; } } } #endif if (PRIO(tpr) >= PRIO(isrvec)) ppr = tpr; else ppr = isrvec & 0xf0; vlapic->apic_page->ppr = ppr; VLAPIC_CTR1(vlapic, "vlapic_update_ppr 0x%02x", ppr); } static VMM_STAT(VLAPIC_GRATUITOUS_EOI, "EOI without any in-service interrupt"); static void vlapic_process_eoi(struct vlapic *vlapic) { struct LAPIC *lapic = vlapic->apic_page; uint32_t *isrptr, *tmrptr; int i, idx, bitpos, vector; isrptr = &lapic->isr0; tmrptr = &lapic->tmr0; for (i = 7; i >= 0; i--) { idx = i * 4; bitpos = fls(isrptr[idx]); if (bitpos-- != 0) { if (vlapic->isrvec_stk_top <= 0) { panic("invalid vlapic isrvec_stk_top %d", vlapic->isrvec_stk_top); } isrptr[idx] &= ~(1 << bitpos); vector = i * 32 + bitpos; VCPU_CTR1(vlapic->vm, vlapic->vcpuid, "EOI vector %d", vector); VLAPIC_CTR_ISR(vlapic, "vlapic_process_eoi"); vlapic->isrvec_stk_top--; vlapic_update_ppr(vlapic); if ((tmrptr[idx] & (1 << bitpos)) != 0) { vioapic_process_eoi(vlapic->vm, vlapic->vcpuid, vector); } return; } } VCPU_CTR0(vlapic->vm, vlapic->vcpuid, "Gratuitous EOI"); vmm_stat_incr(vlapic->vm, vlapic->vcpuid, VLAPIC_GRATUITOUS_EOI, 1); } static __inline int vlapic_get_lvt_field(uint32_t lvt, uint32_t mask) { return (lvt & mask); } static __inline int vlapic_periodic_timer(struct vlapic *vlapic) { uint32_t lvt; lvt = vlapic_get_lvt(vlapic, APIC_OFFSET_TIMER_LVT); return (vlapic_get_lvt_field(lvt, APIC_LVTT_TM_PERIODIC)); } static VMM_STAT(VLAPIC_INTR_ERROR, "error interrupts generated by vlapic"); void vlapic_set_error(struct vlapic *vlapic, uint32_t mask) { uint32_t lvt; vlapic->esr_pending |= mask; if (vlapic->esr_firing) return; vlapic->esr_firing = 1; // The error LVT always uses the fixed delivery mode. lvt = vlapic_get_lvt(vlapic, APIC_OFFSET_ERROR_LVT); if (vlapic_fire_lvt(vlapic, lvt | APIC_LVT_DM_FIXED)) { vmm_stat_incr(vlapic->vm, vlapic->vcpuid, VLAPIC_INTR_ERROR, 1); } vlapic->esr_firing = 0; } static VMM_STAT(VLAPIC_INTR_TIMER, "timer interrupts generated by vlapic"); static void vlapic_fire_timer(struct vlapic *vlapic) { uint32_t lvt; KASSERT(VLAPIC_TIMER_LOCKED(vlapic), ("vlapic_fire_timer not locked")); // The timer LVT always uses the fixed delivery mode. lvt = vlapic_get_lvt(vlapic, APIC_OFFSET_TIMER_LVT); if (vlapic_fire_lvt(vlapic, lvt | APIC_LVT_DM_FIXED)) { VLAPIC_CTR0(vlapic, "vlapic timer fired"); vmm_stat_incr(vlapic->vm, vlapic->vcpuid, VLAPIC_INTR_TIMER, 1); } } static VMM_STAT(VLAPIC_INTR_CMC, "corrected machine check interrupts generated by vlapic"); void vlapic_fire_cmci(struct vlapic *vlapic) { uint32_t lvt; lvt = vlapic_get_lvt(vlapic, APIC_OFFSET_CMCI_LVT); if (vlapic_fire_lvt(vlapic, lvt)) { vmm_stat_incr(vlapic->vm, vlapic->vcpuid, VLAPIC_INTR_CMC, 1); } } static VMM_STAT_ARRAY(LVTS_TRIGGERRED, VLAPIC_MAXLVT_INDEX + 1, "lvts triggered"); int vlapic_trigger_lvt(struct vlapic *vlapic, int vector) { uint32_t lvt; if (vlapic_enabled(vlapic) == false) { /* * When the local APIC is global/hardware disabled, * LINT[1:0] pins are configured as INTR and NMI pins, * respectively. */ switch (vector) { case APIC_LVT_LINT0: vm_inject_extint(vlapic->vm, vlapic->vcpuid); break; case APIC_LVT_LINT1: vm_inject_nmi(vlapic->vm, vlapic->vcpuid); break; default: break; } return (0); } switch (vector) { case APIC_LVT_LINT0: lvt = vlapic_get_lvt(vlapic, APIC_OFFSET_LINT0_LVT); break; case APIC_LVT_LINT1: lvt = vlapic_get_lvt(vlapic, APIC_OFFSET_LINT1_LVT); break; case APIC_LVT_TIMER: lvt = vlapic_get_lvt(vlapic, APIC_OFFSET_TIMER_LVT); lvt |= APIC_LVT_DM_FIXED; break; case APIC_LVT_ERROR: lvt = vlapic_get_lvt(vlapic, APIC_OFFSET_ERROR_LVT); lvt |= APIC_LVT_DM_FIXED; break; case APIC_LVT_PMC: lvt = vlapic_get_lvt(vlapic, APIC_OFFSET_PERF_LVT); break; case APIC_LVT_THERMAL: lvt = vlapic_get_lvt(vlapic, APIC_OFFSET_THERM_LVT); break; case APIC_LVT_CMCI: lvt = vlapic_get_lvt(vlapic, APIC_OFFSET_CMCI_LVT); break; default: return (EINVAL); } if (vlapic_fire_lvt(vlapic, lvt)) { vmm_stat_array_incr(vlapic->vm, vlapic->vcpuid, LVTS_TRIGGERRED, vector, 1); } return (0); } static void vlapic_callout_handler(void *arg) { struct vlapic *vlapic; struct bintime bt, btnow; sbintime_t rem_sbt; vlapic = arg; VLAPIC_TIMER_LOCK(vlapic); if (callout_pending(&vlapic->callout)) /* callout was reset */ goto done; if (!callout_active(&vlapic->callout)) /* callout was stopped */ goto done; callout_deactivate(&vlapic->callout); vlapic_fire_timer(vlapic); if (vlapic_periodic_timer(vlapic)) { binuptime(&btnow); KASSERT(bintime_cmp(&btnow, &vlapic->timer_fire_bt, >=), ("vlapic callout at %#lx.%#lx, expected at %#lx.#%lx", btnow.sec, btnow.frac, vlapic->timer_fire_bt.sec, vlapic->timer_fire_bt.frac)); /* * Compute the delta between when the timer was supposed to * fire and the present time. */ bt = btnow; bintime_sub(&bt, &vlapic->timer_fire_bt); rem_sbt = bttosbt(vlapic->timer_period_bt); if (bintime_cmp(&bt, &vlapic->timer_period_bt, <)) { /* * Adjust the time until the next countdown downward * to account for the lost time. */ rem_sbt -= bttosbt(bt); } else { /* * If the delta is greater than the timer period then * just reset our time base instead of trying to catch * up. */ vlapic->timer_fire_bt = btnow; VLAPIC_CTR2(vlapic, "vlapic timer lagging by %lu " "usecs, period is %lu usecs - resetting time base", bttosbt(bt) / SBT_1US, bttosbt(vlapic->timer_period_bt) / SBT_1US); } bintime_add(&vlapic->timer_fire_bt, &vlapic->timer_period_bt); callout_reset_sbt(&vlapic->callout, rem_sbt, 0, vlapic_callout_handler, vlapic, 0); } done: VLAPIC_TIMER_UNLOCK(vlapic); } void vlapic_icrtmr_write_handler(struct vlapic *vlapic) { struct LAPIC *lapic; sbintime_t sbt; uint32_t icr_timer; VLAPIC_TIMER_LOCK(vlapic); lapic = vlapic->apic_page; icr_timer = lapic->icr_timer; vlapic->timer_period_bt = vlapic->timer_freq_bt; bintime_mul(&vlapic->timer_period_bt, icr_timer); if (icr_timer != 0) { binuptime(&vlapic->timer_fire_bt); bintime_add(&vlapic->timer_fire_bt, &vlapic->timer_period_bt); sbt = bttosbt(vlapic->timer_period_bt); callout_reset_sbt(&vlapic->callout, sbt, 0, vlapic_callout_handler, vlapic, 0); } else callout_stop(&vlapic->callout); VLAPIC_TIMER_UNLOCK(vlapic); } /* * This function populates 'dmask' with the set of vcpus that match the * addressing specified by the (dest, phys, lowprio) tuple. * * 'x2apic_dest' specifies whether 'dest' is interpreted as x2APIC (32-bit) * or xAPIC (8-bit) destination field. */ static void vlapic_calcdest(struct vm *vm, cpuset_t *dmask, uint32_t dest, bool phys, bool lowprio, bool x2apic_dest) { struct vlapic *vlapic; uint32_t dfr, ldr, ldest, cluster; uint32_t mda_flat_ldest, mda_cluster_ldest, mda_ldest, mda_cluster_id; cpuset_t amask; int vcpuid; if ((x2apic_dest && dest == 0xffffffff) || (!x2apic_dest && dest == 0xff)) { /* * Broadcast in both logical and physical modes. */ *dmask = vm_active_cpus(vm); return; } if (phys) { /* * Physical mode: destination is APIC ID. */ CPU_ZERO(dmask); vcpuid = vm_apicid2vcpuid(vm, dest); - if (vcpuid < VM_MAXCPU) + if (vcpuid < vm_get_maxcpus(vm)) CPU_SET(vcpuid, dmask); } else { /* * In the "Flat Model" the MDA is interpreted as an 8-bit wide * bitmask. This model is only available in the xAPIC mode. */ mda_flat_ldest = dest & 0xff; /* * In the "Cluster Model" the MDA is used to identify a * specific cluster and a set of APICs in that cluster. */ if (x2apic_dest) { mda_cluster_id = dest >> 16; mda_cluster_ldest = dest & 0xffff; } else { mda_cluster_id = (dest >> 4) & 0xf; mda_cluster_ldest = dest & 0xf; } /* * Logical mode: match each APIC that has a bit set * in its LDR that matches a bit in the ldest. */ CPU_ZERO(dmask); amask = vm_active_cpus(vm); while ((vcpuid = CPU_FFS(&amask)) != 0) { vcpuid--; CPU_CLR(vcpuid, &amask); vlapic = vm_lapic(vm, vcpuid); dfr = vlapic->apic_page->dfr; ldr = vlapic->apic_page->ldr; if ((dfr & APIC_DFR_MODEL_MASK) == APIC_DFR_MODEL_FLAT) { ldest = ldr >> 24; mda_ldest = mda_flat_ldest; } else if ((dfr & APIC_DFR_MODEL_MASK) == APIC_DFR_MODEL_CLUSTER) { if (x2apic(vlapic)) { cluster = ldr >> 16; ldest = ldr & 0xffff; } else { cluster = ldr >> 28; ldest = (ldr >> 24) & 0xf; } if (cluster != mda_cluster_id) continue; mda_ldest = mda_cluster_ldest; } else { /* * Guest has configured a bad logical * model for this vcpu - skip it. */ VLAPIC_CTR1(vlapic, "vlapic has bad logical " "model %x - cannot deliver interrupt", dfr); continue; } if ((mda_ldest & ldest) != 0) { CPU_SET(vcpuid, dmask); if (lowprio) break; } } } } static VMM_STAT_ARRAY(IPIS_SENT, VM_MAXCPU, "ipis sent to vcpu"); static void vlapic_set_tpr(struct vlapic *vlapic, uint8_t val) { struct LAPIC *lapic = vlapic->apic_page; if (lapic->tpr != val) { VCPU_CTR2(vlapic->vm, vlapic->vcpuid, "vlapic TPR changed " "from %#x to %#x", lapic->tpr, val); lapic->tpr = val; vlapic_update_ppr(vlapic); } } static uint8_t vlapic_get_tpr(struct vlapic *vlapic) { struct LAPIC *lapic = vlapic->apic_page; return (lapic->tpr); } void vlapic_set_cr8(struct vlapic *vlapic, uint64_t val) { uint8_t tpr; if (val & ~0xf) { vm_inject_gp(vlapic->vm, vlapic->vcpuid); return; } tpr = val << 4; vlapic_set_tpr(vlapic, tpr); } uint64_t vlapic_get_cr8(struct vlapic *vlapic) { uint8_t tpr; tpr = vlapic_get_tpr(vlapic); return (tpr >> 4); } int vlapic_icrlo_write_handler(struct vlapic *vlapic, bool *retu) { int i; bool phys; cpuset_t dmask; uint64_t icrval; uint32_t dest, vec, mode; struct vlapic *vlapic2; struct vm_exit *vmexit; struct LAPIC *lapic; + uint16_t maxcpus; lapic = vlapic->apic_page; lapic->icr_lo &= ~APIC_DELSTAT_PEND; icrval = ((uint64_t)lapic->icr_hi << 32) | lapic->icr_lo; if (x2apic(vlapic)) dest = icrval >> 32; else dest = icrval >> (32 + 24); vec = icrval & APIC_VECTOR_MASK; mode = icrval & APIC_DELMODE_MASK; if (mode == APIC_DELMODE_FIXED && vec < 16) { vlapic_set_error(vlapic, APIC_ESR_SEND_ILLEGAL_VECTOR); VLAPIC_CTR1(vlapic, "Ignoring invalid IPI %d", vec); return (0); } VLAPIC_CTR2(vlapic, "icrlo 0x%016lx triggered ipi %d", icrval, vec); if (mode == APIC_DELMODE_FIXED || mode == APIC_DELMODE_NMI) { switch (icrval & APIC_DEST_MASK) { case APIC_DEST_DESTFLD: phys = ((icrval & APIC_DESTMODE_LOG) == 0); vlapic_calcdest(vlapic->vm, &dmask, dest, phys, false, x2apic(vlapic)); break; case APIC_DEST_SELF: CPU_SETOF(vlapic->vcpuid, &dmask); break; case APIC_DEST_ALLISELF: dmask = vm_active_cpus(vlapic->vm); break; case APIC_DEST_ALLESELF: dmask = vm_active_cpus(vlapic->vm); CPU_CLR(vlapic->vcpuid, &dmask); break; default: CPU_ZERO(&dmask); /* satisfy gcc */ break; } while ((i = CPU_FFS(&dmask)) != 0) { i--; CPU_CLR(i, &dmask); if (mode == APIC_DELMODE_FIXED) { lapic_intr_edge(vlapic->vm, i, vec); vmm_stat_array_incr(vlapic->vm, vlapic->vcpuid, IPIS_SENT, i, 1); VLAPIC_CTR2(vlapic, "vlapic sending ipi %d " "to vcpuid %d", vec, i); } else { vm_inject_nmi(vlapic->vm, i); VLAPIC_CTR1(vlapic, "vlapic sending ipi nmi " "to vcpuid %d", i); } } return (0); /* handled completely in the kernel */ } + maxcpus = vm_get_maxcpus(vlapic->vm); if (mode == APIC_DELMODE_INIT) { if ((icrval & APIC_LEVEL_MASK) == APIC_LEVEL_DEASSERT) return (0); - if (vlapic->vcpuid == 0 && dest != 0 && dest < VM_MAXCPU) { + if (vlapic->vcpuid == 0 && dest != 0 && dest < maxcpus) { vlapic2 = vm_lapic(vlapic->vm, dest); /* move from INIT to waiting-for-SIPI state */ if (vlapic2->boot_state == BS_INIT) { vlapic2->boot_state = BS_SIPI; } return (0); } } if (mode == APIC_DELMODE_STARTUP) { - if (vlapic->vcpuid == 0 && dest != 0 && dest < VM_MAXCPU) { + if (vlapic->vcpuid == 0 && dest != 0 && dest < maxcpus) { vlapic2 = vm_lapic(vlapic->vm, dest); /* * Ignore SIPIs in any state other than wait-for-SIPI */ if (vlapic2->boot_state != BS_SIPI) return (0); vlapic2->boot_state = BS_RUNNING; *retu = true; vmexit = vm_exitinfo(vlapic->vm, vlapic->vcpuid); vmexit->exitcode = VM_EXITCODE_SPINUP_AP; vmexit->u.spinup_ap.vcpu = dest; vmexit->u.spinup_ap.rip = vec << PAGE_SHIFT; return (0); } } /* * This will cause a return to userland. */ return (1); } void vlapic_self_ipi_handler(struct vlapic *vlapic, uint64_t val) { int vec; KASSERT(x2apic(vlapic), ("SELF_IPI does not exist in xAPIC mode")); vec = val & 0xff; lapic_intr_edge(vlapic->vm, vlapic->vcpuid, vec); vmm_stat_array_incr(vlapic->vm, vlapic->vcpuid, IPIS_SENT, vlapic->vcpuid, 1); VLAPIC_CTR1(vlapic, "vlapic self-ipi %d", vec); } int vlapic_pending_intr(struct vlapic *vlapic, int *vecptr) { struct LAPIC *lapic = vlapic->apic_page; int idx, i, bitpos, vector; uint32_t *irrptr, val; if (vlapic->ops.pending_intr) return ((*vlapic->ops.pending_intr)(vlapic, vecptr)); irrptr = &lapic->irr0; for (i = 7; i >= 0; i--) { idx = i * 4; val = atomic_load_acq_int(&irrptr[idx]); bitpos = fls(val); if (bitpos != 0) { vector = i * 32 + (bitpos - 1); if (PRIO(vector) > PRIO(lapic->ppr)) { VLAPIC_CTR1(vlapic, "pending intr %d", vector); if (vecptr != NULL) *vecptr = vector; return (1); } else break; } } return (0); } void vlapic_intr_accepted(struct vlapic *vlapic, int vector) { struct LAPIC *lapic = vlapic->apic_page; uint32_t *irrptr, *isrptr; int idx, stk_top; if (vlapic->ops.intr_accepted) return ((*vlapic->ops.intr_accepted)(vlapic, vector)); /* * clear the ready bit for vector being accepted in irr * and set the vector as in service in isr. */ idx = (vector / 32) * 4; irrptr = &lapic->irr0; atomic_clear_int(&irrptr[idx], 1 << (vector % 32)); VLAPIC_CTR_IRR(vlapic, "vlapic_intr_accepted"); isrptr = &lapic->isr0; isrptr[idx] |= 1 << (vector % 32); VLAPIC_CTR_ISR(vlapic, "vlapic_intr_accepted"); /* * Update the PPR */ vlapic->isrvec_stk_top++; stk_top = vlapic->isrvec_stk_top; if (stk_top >= ISRVEC_STK_SIZE) panic("isrvec_stk_top overflow %d", stk_top); vlapic->isrvec_stk[stk_top] = vector; vlapic_update_ppr(vlapic); } void vlapic_svr_write_handler(struct vlapic *vlapic) { struct LAPIC *lapic; uint32_t old, new, changed; lapic = vlapic->apic_page; new = lapic->svr; old = vlapic->svr_last; vlapic->svr_last = new; changed = old ^ new; if ((changed & APIC_SVR_ENABLE) != 0) { if ((new & APIC_SVR_ENABLE) == 0) { /* * The apic is now disabled so stop the apic timer * and mask all the LVT entries. */ VLAPIC_CTR0(vlapic, "vlapic is software-disabled"); VLAPIC_TIMER_LOCK(vlapic); callout_stop(&vlapic->callout); VLAPIC_TIMER_UNLOCK(vlapic); vlapic_mask_lvts(vlapic); } else { /* * The apic is now enabled so restart the apic timer * if it is configured in periodic mode. */ VLAPIC_CTR0(vlapic, "vlapic is software-enabled"); if (vlapic_periodic_timer(vlapic)) vlapic_icrtmr_write_handler(vlapic); } } } int vlapic_read(struct vlapic *vlapic, int mmio_access, uint64_t offset, uint64_t *data, bool *retu) { struct LAPIC *lapic = vlapic->apic_page; uint32_t *reg; int i; /* Ignore MMIO accesses in x2APIC mode */ if (x2apic(vlapic) && mmio_access) { VLAPIC_CTR1(vlapic, "MMIO read from offset %#lx in x2APIC mode", offset); *data = 0; goto done; } if (!x2apic(vlapic) && !mmio_access) { /* * XXX Generate GP fault for MSR accesses in xAPIC mode */ VLAPIC_CTR1(vlapic, "x2APIC MSR read from offset %#lx in " "xAPIC mode", offset); *data = 0; goto done; } if (offset > sizeof(*lapic)) { *data = 0; goto done; } offset &= ~3; switch(offset) { case APIC_OFFSET_ID: *data = lapic->id; break; case APIC_OFFSET_VER: *data = lapic->version; break; case APIC_OFFSET_TPR: *data = vlapic_get_tpr(vlapic); break; case APIC_OFFSET_APR: *data = lapic->apr; break; case APIC_OFFSET_PPR: *data = lapic->ppr; break; case APIC_OFFSET_EOI: *data = lapic->eoi; break; case APIC_OFFSET_LDR: *data = lapic->ldr; break; case APIC_OFFSET_DFR: *data = lapic->dfr; break; case APIC_OFFSET_SVR: *data = lapic->svr; break; case APIC_OFFSET_ISR0 ... APIC_OFFSET_ISR7: i = (offset - APIC_OFFSET_ISR0) >> 2; reg = &lapic->isr0; *data = *(reg + i); break; case APIC_OFFSET_TMR0 ... APIC_OFFSET_TMR7: i = (offset - APIC_OFFSET_TMR0) >> 2; reg = &lapic->tmr0; *data = *(reg + i); break; case APIC_OFFSET_IRR0 ... APIC_OFFSET_IRR7: i = (offset - APIC_OFFSET_IRR0) >> 2; reg = &lapic->irr0; *data = atomic_load_acq_int(reg + i); break; case APIC_OFFSET_ESR: *data = lapic->esr; break; case APIC_OFFSET_ICR_LOW: *data = lapic->icr_lo; if (x2apic(vlapic)) *data |= (uint64_t)lapic->icr_hi << 32; break; case APIC_OFFSET_ICR_HI: *data = lapic->icr_hi; break; case APIC_OFFSET_CMCI_LVT: case APIC_OFFSET_TIMER_LVT ... APIC_OFFSET_ERROR_LVT: *data = vlapic_get_lvt(vlapic, offset); #ifdef INVARIANTS reg = vlapic_get_lvtptr(vlapic, offset); KASSERT(*data == *reg, ("inconsistent lvt value at " "offset %#lx: %#lx/%#x", offset, *data, *reg)); #endif break; case APIC_OFFSET_TIMER_ICR: *data = lapic->icr_timer; break; case APIC_OFFSET_TIMER_CCR: *data = vlapic_get_ccr(vlapic); break; case APIC_OFFSET_TIMER_DCR: *data = lapic->dcr_timer; break; case APIC_OFFSET_SELF_IPI: /* * XXX generate a GP fault if vlapic is in x2apic mode */ *data = 0; break; case APIC_OFFSET_RRR: default: *data = 0; break; } done: VLAPIC_CTR2(vlapic, "vlapic read offset %#x, data %#lx", offset, *data); return 0; } int vlapic_write(struct vlapic *vlapic, int mmio_access, uint64_t offset, uint64_t data, bool *retu) { struct LAPIC *lapic = vlapic->apic_page; uint32_t *regptr; int retval; KASSERT((offset & 0xf) == 0 && offset < PAGE_SIZE, ("vlapic_write: invalid offset %#lx", offset)); VLAPIC_CTR2(vlapic, "vlapic write offset %#lx, data %#lx", offset, data); if (offset > sizeof(*lapic)) return (0); /* Ignore MMIO accesses in x2APIC mode */ if (x2apic(vlapic) && mmio_access) { VLAPIC_CTR2(vlapic, "MMIO write of %#lx to offset %#lx " "in x2APIC mode", data, offset); return (0); } /* * XXX Generate GP fault for MSR accesses in xAPIC mode */ if (!x2apic(vlapic) && !mmio_access) { VLAPIC_CTR2(vlapic, "x2APIC MSR write of %#lx to offset %#lx " "in xAPIC mode", data, offset); return (0); } retval = 0; switch(offset) { case APIC_OFFSET_ID: lapic->id = data; vlapic_id_write_handler(vlapic); break; case APIC_OFFSET_TPR: vlapic_set_tpr(vlapic, data & 0xff); break; case APIC_OFFSET_EOI: vlapic_process_eoi(vlapic); break; case APIC_OFFSET_LDR: lapic->ldr = data; vlapic_ldr_write_handler(vlapic); break; case APIC_OFFSET_DFR: lapic->dfr = data; vlapic_dfr_write_handler(vlapic); break; case APIC_OFFSET_SVR: lapic->svr = data; vlapic_svr_write_handler(vlapic); break; case APIC_OFFSET_ICR_LOW: lapic->icr_lo = data; if (x2apic(vlapic)) lapic->icr_hi = data >> 32; retval = vlapic_icrlo_write_handler(vlapic, retu); break; case APIC_OFFSET_ICR_HI: lapic->icr_hi = data; break; case APIC_OFFSET_CMCI_LVT: case APIC_OFFSET_TIMER_LVT ... APIC_OFFSET_ERROR_LVT: regptr = vlapic_get_lvtptr(vlapic, offset); *regptr = data; vlapic_lvt_write_handler(vlapic, offset); break; case APIC_OFFSET_TIMER_ICR: lapic->icr_timer = data; vlapic_icrtmr_write_handler(vlapic); break; case APIC_OFFSET_TIMER_DCR: lapic->dcr_timer = data; vlapic_dcr_write_handler(vlapic); break; case APIC_OFFSET_ESR: vlapic_esr_write_handler(vlapic); break; case APIC_OFFSET_SELF_IPI: if (x2apic(vlapic)) vlapic_self_ipi_handler(vlapic, data); break; case APIC_OFFSET_VER: case APIC_OFFSET_APR: case APIC_OFFSET_PPR: case APIC_OFFSET_RRR: case APIC_OFFSET_ISR0 ... APIC_OFFSET_ISR7: case APIC_OFFSET_TMR0 ... APIC_OFFSET_TMR7: case APIC_OFFSET_IRR0 ... APIC_OFFSET_IRR7: case APIC_OFFSET_TIMER_CCR: default: // Read only. break; } return (retval); } static void vlapic_reset(struct vlapic *vlapic) { struct LAPIC *lapic; lapic = vlapic->apic_page; bzero(lapic, sizeof(struct LAPIC)); lapic->id = vlapic_get_id(vlapic); lapic->version = VLAPIC_VERSION; lapic->version |= (VLAPIC_MAXLVT_INDEX << MAXLVTSHIFT); lapic->dfr = 0xffffffff; lapic->svr = APIC_SVR_VECTOR; vlapic_mask_lvts(vlapic); vlapic_reset_tmr(vlapic); lapic->dcr_timer = 0; vlapic_dcr_write_handler(vlapic); if (vlapic->vcpuid == 0) vlapic->boot_state = BS_RUNNING; /* BSP */ else vlapic->boot_state = BS_INIT; /* AP */ vlapic->svr_last = lapic->svr; } void vlapic_init(struct vlapic *vlapic) { KASSERT(vlapic->vm != NULL, ("vlapic_init: vm is not initialized")); - KASSERT(vlapic->vcpuid >= 0 && vlapic->vcpuid < VM_MAXCPU, + KASSERT(vlapic->vcpuid >= 0 && + vlapic->vcpuid < vm_get_maxcpus(vlapic->vm), ("vlapic_init: vcpuid is not initialized")); KASSERT(vlapic->apic_page != NULL, ("vlapic_init: apic_page is not " "initialized")); /* * If the vlapic is configured in x2apic mode then it will be * accessed in the critical section via the MSR emulation code. * * Therefore the timer mutex must be a spinlock because blockable * mutexes cannot be acquired in a critical section. */ mtx_init(&vlapic->timer_mtx, "vlapic timer mtx", NULL, MTX_SPIN); callout_init(&vlapic->callout, 1); vlapic->msr_apicbase = DEFAULT_APIC_BASE | APICBASE_ENABLED; if (vlapic->vcpuid == 0) vlapic->msr_apicbase |= APICBASE_BSP; vlapic_reset(vlapic); } void vlapic_cleanup(struct vlapic *vlapic) { callout_drain(&vlapic->callout); } uint64_t vlapic_get_apicbase(struct vlapic *vlapic) { return (vlapic->msr_apicbase); } int vlapic_set_apicbase(struct vlapic *vlapic, uint64_t new) { if (vlapic->msr_apicbase != new) { VLAPIC_CTR2(vlapic, "Changing APIC_BASE MSR from %#lx to %#lx " "not supported", vlapic->msr_apicbase, new); return (-1); } return (0); } void vlapic_set_x2apic_state(struct vm *vm, int vcpuid, enum x2apic_state state) { struct vlapic *vlapic; struct LAPIC *lapic; vlapic = vm_lapic(vm, vcpuid); if (state == X2APIC_DISABLED) vlapic->msr_apicbase &= ~APICBASE_X2APIC; else vlapic->msr_apicbase |= APICBASE_X2APIC; /* * Reset the local APIC registers whose values are mode-dependent. * * XXX this works because the APIC mode can be changed only at vcpu * initialization time. */ lapic = vlapic->apic_page; lapic->id = vlapic_get_id(vlapic); if (x2apic(vlapic)) { lapic->ldr = x2apic_ldr(vlapic); lapic->dfr = 0; } else { lapic->ldr = 0; lapic->dfr = 0xffffffff; } if (state == X2APIC_ENABLED) { if (vlapic->ops.enable_x2apic_mode) (*vlapic->ops.enable_x2apic_mode)(vlapic); } } void vlapic_deliver_intr(struct vm *vm, bool level, uint32_t dest, bool phys, int delmode, int vec) { bool lowprio; int vcpuid; cpuset_t dmask; if (delmode != IOART_DELFIXED && delmode != IOART_DELLOPRI && delmode != IOART_DELEXINT) { VM_CTR1(vm, "vlapic intr invalid delmode %#x", delmode); return; } lowprio = (delmode == IOART_DELLOPRI); /* * We don't provide any virtual interrupt redirection hardware so * all interrupts originating from the ioapic or MSI specify the * 'dest' in the legacy xAPIC format. */ vlapic_calcdest(vm, &dmask, dest, phys, lowprio, false); while ((vcpuid = CPU_FFS(&dmask)) != 0) { vcpuid--; CPU_CLR(vcpuid, &dmask); if (delmode == IOART_DELEXINT) { vm_inject_extint(vm, vcpuid); } else { lapic_set_intr(vm, vcpuid, vec, level); } } } void vlapic_post_intr(struct vlapic *vlapic, int hostcpu, int ipinum) { /* * Post an interrupt to the vcpu currently running on 'hostcpu'. * * This is done by leveraging features like Posted Interrupts (Intel) * Doorbell MSR (AMD AVIC) that avoid a VM exit. * * If neither of these features are available then fallback to * sending an IPI to 'hostcpu'. */ if (vlapic->ops.post_intr) (*vlapic->ops.post_intr)(vlapic, hostcpu); else ipi_cpu(hostcpu, ipinum); } bool vlapic_enabled(struct vlapic *vlapic) { struct LAPIC *lapic = vlapic->apic_page; if ((vlapic->msr_apicbase & APICBASE_ENABLED) != 0 && (lapic->svr & APIC_SVR_ENABLE) != 0) return (true); else return (false); } static void vlapic_set_tmr(struct vlapic *vlapic, int vector, bool level) { struct LAPIC *lapic; uint32_t *tmrptr, mask; int idx; lapic = vlapic->apic_page; tmrptr = &lapic->tmr0; idx = (vector / 32) * 4; mask = 1 << (vector % 32); if (level) tmrptr[idx] |= mask; else tmrptr[idx] &= ~mask; if (vlapic->ops.set_tmr != NULL) (*vlapic->ops.set_tmr)(vlapic, vector, level); } void vlapic_reset_tmr(struct vlapic *vlapic) { int vector; VLAPIC_CTR0(vlapic, "vlapic resetting all vectors to edge-triggered"); for (vector = 0; vector <= 255; vector++) vlapic_set_tmr(vlapic, vector, false); } void vlapic_set_tmr_level(struct vlapic *vlapic, uint32_t dest, bool phys, int delmode, int vector) { cpuset_t dmask; bool lowprio; KASSERT(vector >= 0 && vector <= 255, ("invalid vector %d", vector)); /* * A level trigger is valid only for fixed and lowprio delivery modes. */ if (delmode != APIC_DELMODE_FIXED && delmode != APIC_DELMODE_LOWPRIO) { VLAPIC_CTR1(vlapic, "Ignoring level trigger-mode for " "delivery-mode %d", delmode); return; } lowprio = (delmode == APIC_DELMODE_LOWPRIO); vlapic_calcdest(vlapic->vm, &dmask, dest, phys, lowprio, false); if (!CPU_ISSET(vlapic->vcpuid, &dmask)) return; VLAPIC_CTR1(vlapic, "vector %d set to level-triggered", vector); vlapic_set_tmr(vlapic, vector, true); } Index: user/ngie/bug-237403/sys/amd64/vmm/vmm.c =================================================================== --- user/ngie/bug-237403/sys/amd64/vmm/vmm.c (revision 346925) +++ user/ngie/bug-237403/sys/amd64/vmm/vmm.c (revision 346926) @@ -1,2711 +1,2718 @@ /*- * SPDX-License-Identifier: BSD-2-Clause-FreeBSD * * Copyright (c) 2011 NetApp, Inc. * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY NETAPP, INC ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL NETAPP, INC OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * $FreeBSD$ */ #include __FBSDID("$FreeBSD$"); #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include "vmm_ioport.h" #include "vmm_ktr.h" #include "vmm_host.h" #include "vmm_mem.h" #include "vmm_util.h" #include "vatpic.h" #include "vatpit.h" #include "vhpet.h" #include "vioapic.h" #include "vlapic.h" #include "vpmtmr.h" #include "vrtc.h" #include "vmm_stat.h" #include "vmm_lapic.h" #include "io/ppt.h" #include "io/iommu.h" struct vlapic; /* * Initialization: * (a) allocated when vcpu is created * (i) initialized when vcpu is created and when it is reinitialized * (o) initialized the first time the vcpu is created * (x) initialized before use */ struct vcpu { struct mtx mtx; /* (o) protects 'state' and 'hostcpu' */ enum vcpu_state state; /* (o) vcpu state */ int hostcpu; /* (o) vcpu's host cpu */ int reqidle; /* (i) request vcpu to idle */ struct vlapic *vlapic; /* (i) APIC device model */ enum x2apic_state x2apic_state; /* (i) APIC mode */ uint64_t exitintinfo; /* (i) events pending at VM exit */ int nmi_pending; /* (i) NMI pending */ int extint_pending; /* (i) INTR pending */ int exception_pending; /* (i) exception pending */ int exc_vector; /* (x) exception collateral */ int exc_errcode_valid; uint32_t exc_errcode; struct savefpu *guestfpu; /* (a,i) guest fpu state */ uint64_t guest_xcr0; /* (i) guest %xcr0 register */ void *stats; /* (a,i) statistics */ struct vm_exit exitinfo; /* (x) exit reason and collateral */ uint64_t nextrip; /* (x) next instruction to execute */ }; #define vcpu_lock_initialized(v) mtx_initialized(&((v)->mtx)) #define vcpu_lock_init(v) mtx_init(&((v)->mtx), "vcpu lock", 0, MTX_SPIN) #define vcpu_lock(v) mtx_lock_spin(&((v)->mtx)) #define vcpu_unlock(v) mtx_unlock_spin(&((v)->mtx)) #define vcpu_assert_locked(v) mtx_assert(&((v)->mtx), MA_OWNED) struct mem_seg { size_t len; bool sysmem; struct vm_object *object; }; #define VM_MAX_MEMSEGS 3 struct mem_map { vm_paddr_t gpa; size_t len; vm_ooffset_t segoff; int segid; int prot; int flags; }; #define VM_MAX_MEMMAPS 4 /* * Initialization: * (o) initialized the first time the VM is created * (i) initialized when VM is created and when it is reinitialized * (x) initialized before use */ struct vm { void *cookie; /* (i) cpu-specific data */ void *iommu; /* (x) iommu-specific data */ struct vhpet *vhpet; /* (i) virtual HPET */ struct vioapic *vioapic; /* (i) virtual ioapic */ struct vatpic *vatpic; /* (i) virtual atpic */ struct vatpit *vatpit; /* (i) virtual atpit */ struct vpmtmr *vpmtmr; /* (i) virtual ACPI PM timer */ struct vrtc *vrtc; /* (o) virtual RTC */ volatile cpuset_t active_cpus; /* (i) active vcpus */ volatile cpuset_t debug_cpus; /* (i) vcpus stopped for debug */ int suspend; /* (i) stop VM execution */ volatile cpuset_t suspended_cpus; /* (i) suspended vcpus */ volatile cpuset_t halted_cpus; /* (x) cpus in a hard halt */ cpuset_t rendezvous_req_cpus; /* (x) rendezvous requested */ cpuset_t rendezvous_done_cpus; /* (x) rendezvous finished */ void *rendezvous_arg; /* (x) rendezvous func/arg */ vm_rendezvous_func_t rendezvous_func; struct mtx rendezvous_mtx; /* (o) rendezvous lock */ struct mem_map mem_maps[VM_MAX_MEMMAPS]; /* (i) guest address space */ struct mem_seg mem_segs[VM_MAX_MEMSEGS]; /* (o) guest memory regions */ struct vmspace *vmspace; /* (o) guest's address space */ char name[VM_MAX_NAMELEN]; /* (o) virtual machine name */ struct vcpu vcpu[VM_MAXCPU]; /* (i) guest vcpus */ /* The following describe the vm cpu topology */ uint16_t sockets; /* (o) num of sockets */ uint16_t cores; /* (o) num of cores/socket */ uint16_t threads; /* (o) num of threads/core */ uint16_t maxcpus; /* (o) max pluggable cpus */ }; static int vmm_initialized; static struct vmm_ops *ops; #define VMM_INIT(num) (ops != NULL ? (*ops->init)(num) : 0) #define VMM_CLEANUP() (ops != NULL ? (*ops->cleanup)() : 0) #define VMM_RESUME() (ops != NULL ? (*ops->resume)() : 0) #define VMINIT(vm, pmap) (ops != NULL ? (*ops->vminit)(vm, pmap): NULL) #define VMRUN(vmi, vcpu, rip, pmap, evinfo) \ (ops != NULL ? (*ops->vmrun)(vmi, vcpu, rip, pmap, evinfo) : ENXIO) #define VMCLEANUP(vmi) (ops != NULL ? (*ops->vmcleanup)(vmi) : NULL) #define VMSPACE_ALLOC(min, max) \ (ops != NULL ? (*ops->vmspace_alloc)(min, max) : NULL) #define VMSPACE_FREE(vmspace) \ (ops != NULL ? (*ops->vmspace_free)(vmspace) : ENXIO) #define VMGETREG(vmi, vcpu, num, retval) \ (ops != NULL ? (*ops->vmgetreg)(vmi, vcpu, num, retval) : ENXIO) #define VMSETREG(vmi, vcpu, num, val) \ (ops != NULL ? (*ops->vmsetreg)(vmi, vcpu, num, val) : ENXIO) #define VMGETDESC(vmi, vcpu, num, desc) \ (ops != NULL ? (*ops->vmgetdesc)(vmi, vcpu, num, desc) : ENXIO) #define VMSETDESC(vmi, vcpu, num, desc) \ (ops != NULL ? (*ops->vmsetdesc)(vmi, vcpu, num, desc) : ENXIO) #define VMGETCAP(vmi, vcpu, num, retval) \ (ops != NULL ? (*ops->vmgetcap)(vmi, vcpu, num, retval) : ENXIO) #define VMSETCAP(vmi, vcpu, num, val) \ (ops != NULL ? (*ops->vmsetcap)(vmi, vcpu, num, val) : ENXIO) #define VLAPIC_INIT(vmi, vcpu) \ (ops != NULL ? (*ops->vlapic_init)(vmi, vcpu) : NULL) #define VLAPIC_CLEANUP(vmi, vlapic) \ (ops != NULL ? (*ops->vlapic_cleanup)(vmi, vlapic) : NULL) #define fpu_start_emulating() load_cr0(rcr0() | CR0_TS) #define fpu_stop_emulating() clts() SDT_PROVIDER_DEFINE(vmm); static MALLOC_DEFINE(M_VM, "vm", "vm"); /* statistics */ static VMM_STAT(VCPU_TOTAL_RUNTIME, "vcpu total runtime"); SYSCTL_NODE(_hw, OID_AUTO, vmm, CTLFLAG_RW, NULL, NULL); /* * Halt the guest if all vcpus are executing a HLT instruction with * interrupts disabled. */ static int halt_detection_enabled = 1; SYSCTL_INT(_hw_vmm, OID_AUTO, halt_detection, CTLFLAG_RDTUN, &halt_detection_enabled, 0, "Halt VM if all vcpus execute HLT with interrupts disabled"); static int vmm_ipinum; SYSCTL_INT(_hw_vmm, OID_AUTO, ipinum, CTLFLAG_RD, &vmm_ipinum, 0, "IPI vector used for vcpu notifications"); static int trace_guest_exceptions; SYSCTL_INT(_hw_vmm, OID_AUTO, trace_guest_exceptions, CTLFLAG_RDTUN, &trace_guest_exceptions, 0, "Trap into hypervisor on all guest exceptions and reflect them back"); static void vm_free_memmap(struct vm *vm, int ident); static bool sysmem_mapping(struct vm *vm, struct mem_map *mm); static void vcpu_notify_event_locked(struct vcpu *vcpu, bool lapic_intr); #ifdef KTR static const char * vcpu_state2str(enum vcpu_state state) { switch (state) { case VCPU_IDLE: return ("idle"); case VCPU_FROZEN: return ("frozen"); case VCPU_RUNNING: return ("running"); case VCPU_SLEEPING: return ("sleeping"); default: return ("unknown"); } } #endif static void vcpu_cleanup(struct vm *vm, int i, bool destroy) { struct vcpu *vcpu = &vm->vcpu[i]; VLAPIC_CLEANUP(vm->cookie, vcpu->vlapic); if (destroy) { vmm_stat_free(vcpu->stats); fpu_save_area_free(vcpu->guestfpu); } } static void vcpu_init(struct vm *vm, int vcpu_id, bool create) { struct vcpu *vcpu; - KASSERT(vcpu_id >= 0 && vcpu_id < VM_MAXCPU, + KASSERT(vcpu_id >= 0 && vcpu_id < vm->maxcpus, ("vcpu_init: invalid vcpu %d", vcpu_id)); vcpu = &vm->vcpu[vcpu_id]; if (create) { KASSERT(!vcpu_lock_initialized(vcpu), ("vcpu %d already " "initialized", vcpu_id)); vcpu_lock_init(vcpu); vcpu->state = VCPU_IDLE; vcpu->hostcpu = NOCPU; vcpu->guestfpu = fpu_save_area_alloc(); vcpu->stats = vmm_stat_alloc(); } vcpu->vlapic = VLAPIC_INIT(vm->cookie, vcpu_id); vm_set_x2apic_state(vm, vcpu_id, X2APIC_DISABLED); vcpu->reqidle = 0; vcpu->exitintinfo = 0; vcpu->nmi_pending = 0; vcpu->extint_pending = 0; vcpu->exception_pending = 0; vcpu->guest_xcr0 = XFEATURE_ENABLED_X87; fpu_save_area_reset(vcpu->guestfpu); vmm_stat_init(vcpu->stats); } int vcpu_trace_exceptions(struct vm *vm, int vcpuid) { return (trace_guest_exceptions); } struct vm_exit * vm_exitinfo(struct vm *vm, int cpuid) { struct vcpu *vcpu; - if (cpuid < 0 || cpuid >= VM_MAXCPU) + if (cpuid < 0 || cpuid >= vm->maxcpus) panic("vm_exitinfo: invalid cpuid %d", cpuid); vcpu = &vm->vcpu[cpuid]; return (&vcpu->exitinfo); } static void vmm_resume(void) { VMM_RESUME(); } static int vmm_init(void) { int error; vmm_host_state_init(); vmm_ipinum = lapic_ipi_alloc(pti ? &IDTVEC(justreturn1_pti) : &IDTVEC(justreturn)); if (vmm_ipinum < 0) vmm_ipinum = IPI_AST; error = vmm_mem_init(); if (error) return (error); if (vmm_is_intel()) ops = &vmm_ops_intel; else if (vmm_is_amd()) ops = &vmm_ops_amd; else return (ENXIO); vmm_resume_p = vmm_resume; return (VMM_INIT(vmm_ipinum)); } static int vmm_handler(module_t mod, int what, void *arg) { int error; switch (what) { case MOD_LOAD: vmmdev_init(); error = vmm_init(); if (error == 0) vmm_initialized = 1; break; case MOD_UNLOAD: error = vmmdev_cleanup(); if (error == 0) { vmm_resume_p = NULL; iommu_cleanup(); if (vmm_ipinum != IPI_AST) lapic_ipi_free(vmm_ipinum); error = VMM_CLEANUP(); /* * Something bad happened - prevent new * VMs from being created */ if (error) vmm_initialized = 0; } break; default: error = 0; break; } return (error); } static moduledata_t vmm_kmod = { "vmm", vmm_handler, NULL }; /* * vmm initialization has the following dependencies: * * - VT-x initialization requires smp_rendezvous() and therefore must happen * after SMP is fully functional (after SI_SUB_SMP). */ DECLARE_MODULE(vmm, vmm_kmod, SI_SUB_SMP + 1, SI_ORDER_ANY); MODULE_VERSION(vmm, 1); static void vm_init(struct vm *vm, bool create) { int i; vm->cookie = VMINIT(vm, vmspace_pmap(vm->vmspace)); vm->iommu = NULL; vm->vioapic = vioapic_init(vm); vm->vhpet = vhpet_init(vm); vm->vatpic = vatpic_init(vm); vm->vatpit = vatpit_init(vm); vm->vpmtmr = vpmtmr_init(vm); if (create) vm->vrtc = vrtc_init(vm); CPU_ZERO(&vm->active_cpus); CPU_ZERO(&vm->debug_cpus); vm->suspend = 0; CPU_ZERO(&vm->suspended_cpus); - for (i = 0; i < VM_MAXCPU; i++) + for (i = 0; i < vm->maxcpus; i++) vcpu_init(vm, i, create); } /* * The default CPU topology is a single thread per package. */ u_int cores_per_package = 1; u_int threads_per_core = 1; int vm_create(const char *name, struct vm **retvm) { struct vm *vm; struct vmspace *vmspace; /* * If vmm.ko could not be successfully initialized then don't attempt * to create the virtual machine. */ if (!vmm_initialized) return (ENXIO); if (name == NULL || strlen(name) >= VM_MAX_NAMELEN) return (EINVAL); vmspace = VMSPACE_ALLOC(0, VM_MAXUSER_ADDRESS); if (vmspace == NULL) return (ENOMEM); vm = malloc(sizeof(struct vm), M_VM, M_WAITOK | M_ZERO); strcpy(vm->name, name); vm->vmspace = vmspace; mtx_init(&vm->rendezvous_mtx, "vm rendezvous lock", 0, MTX_DEF); vm->sockets = 1; vm->cores = cores_per_package; /* XXX backwards compatibility */ vm->threads = threads_per_core; /* XXX backwards compatibility */ - vm->maxcpus = 0; /* XXX not implemented */ + vm->maxcpus = VM_MAXCPU; /* XXX temp to keep code working */ vm_init(vm, true); *retvm = vm; return (0); } void vm_get_topology(struct vm *vm, uint16_t *sockets, uint16_t *cores, uint16_t *threads, uint16_t *maxcpus) { *sockets = vm->sockets; *cores = vm->cores; *threads = vm->threads; *maxcpus = vm->maxcpus; } +uint16_t +vm_get_maxcpus(struct vm *vm) +{ + return (vm->maxcpus); +} + int vm_set_topology(struct vm *vm, uint16_t sockets, uint16_t cores, uint16_t threads, uint16_t maxcpus) { if (maxcpus != 0) return (EINVAL); /* XXX remove when supported */ - if ((sockets * cores * threads) > VM_MAXCPU) + if ((sockets * cores * threads) > vm->maxcpus) return (EINVAL); /* XXX need to check sockets * cores * threads == vCPU, how? */ vm->sockets = sockets; vm->cores = cores; vm->threads = threads; - vm->maxcpus = maxcpus; + vm->maxcpus = VM_MAXCPU; /* XXX temp to keep code working */ return(0); } static void vm_cleanup(struct vm *vm, bool destroy) { struct mem_map *mm; int i; ppt_unassign_all(vm); if (vm->iommu != NULL) iommu_destroy_domain(vm->iommu); if (destroy) vrtc_cleanup(vm->vrtc); else vrtc_reset(vm->vrtc); vpmtmr_cleanup(vm->vpmtmr); vatpit_cleanup(vm->vatpit); vhpet_cleanup(vm->vhpet); vatpic_cleanup(vm->vatpic); vioapic_cleanup(vm->vioapic); - for (i = 0; i < VM_MAXCPU; i++) + for (i = 0; i < vm->maxcpus; i++) vcpu_cleanup(vm, i, destroy); VMCLEANUP(vm->cookie); /* * System memory is removed from the guest address space only when * the VM is destroyed. This is because the mapping remains the same * across VM reset. * * Device memory can be relocated by the guest (e.g. using PCI BARs) * so those mappings are removed on a VM reset. */ for (i = 0; i < VM_MAX_MEMMAPS; i++) { mm = &vm->mem_maps[i]; if (destroy || !sysmem_mapping(vm, mm)) vm_free_memmap(vm, i); } if (destroy) { for (i = 0; i < VM_MAX_MEMSEGS; i++) vm_free_memseg(vm, i); VMSPACE_FREE(vm->vmspace); vm->vmspace = NULL; } } void vm_destroy(struct vm *vm) { vm_cleanup(vm, true); free(vm, M_VM); } int vm_reinit(struct vm *vm) { int error; /* * A virtual machine can be reset only if all vcpus are suspended. */ if (CPU_CMP(&vm->suspended_cpus, &vm->active_cpus) == 0) { vm_cleanup(vm, false); vm_init(vm, false); error = 0; } else { error = EBUSY; } return (error); } const char * vm_name(struct vm *vm) { return (vm->name); } int vm_map_mmio(struct vm *vm, vm_paddr_t gpa, size_t len, vm_paddr_t hpa) { vm_object_t obj; if ((obj = vmm_mmio_alloc(vm->vmspace, gpa, len, hpa)) == NULL) return (ENOMEM); else return (0); } int vm_unmap_mmio(struct vm *vm, vm_paddr_t gpa, size_t len) { vmm_mmio_free(vm->vmspace, gpa, len); return (0); } /* * Return 'true' if 'gpa' is allocated in the guest address space. * * This function is called in the context of a running vcpu which acts as * an implicit lock on 'vm->mem_maps[]'. */ bool vm_mem_allocated(struct vm *vm, int vcpuid, vm_paddr_t gpa) { struct mem_map *mm; int i; #ifdef INVARIANTS int hostcpu, state; state = vcpu_get_state(vm, vcpuid, &hostcpu); KASSERT(state == VCPU_RUNNING && hostcpu == curcpu, ("%s: invalid vcpu state %d/%d", __func__, state, hostcpu)); #endif for (i = 0; i < VM_MAX_MEMMAPS; i++) { mm = &vm->mem_maps[i]; if (mm->len != 0 && gpa >= mm->gpa && gpa < mm->gpa + mm->len) return (true); /* 'gpa' is sysmem or devmem */ } if (ppt_is_mmio(vm, gpa)) return (true); /* 'gpa' is pci passthru mmio */ return (false); } int vm_alloc_memseg(struct vm *vm, int ident, size_t len, bool sysmem) { struct mem_seg *seg; vm_object_t obj; if (ident < 0 || ident >= VM_MAX_MEMSEGS) return (EINVAL); if (len == 0 || (len & PAGE_MASK)) return (EINVAL); seg = &vm->mem_segs[ident]; if (seg->object != NULL) { if (seg->len == len && seg->sysmem == sysmem) return (EEXIST); else return (EINVAL); } obj = vm_object_allocate(OBJT_DEFAULT, len >> PAGE_SHIFT); if (obj == NULL) return (ENOMEM); seg->len = len; seg->object = obj; seg->sysmem = sysmem; return (0); } int vm_get_memseg(struct vm *vm, int ident, size_t *len, bool *sysmem, vm_object_t *objptr) { struct mem_seg *seg; if (ident < 0 || ident >= VM_MAX_MEMSEGS) return (EINVAL); seg = &vm->mem_segs[ident]; if (len) *len = seg->len; if (sysmem) *sysmem = seg->sysmem; if (objptr) *objptr = seg->object; return (0); } void vm_free_memseg(struct vm *vm, int ident) { struct mem_seg *seg; KASSERT(ident >= 0 && ident < VM_MAX_MEMSEGS, ("%s: invalid memseg ident %d", __func__, ident)); seg = &vm->mem_segs[ident]; if (seg->object != NULL) { vm_object_deallocate(seg->object); bzero(seg, sizeof(struct mem_seg)); } } int vm_mmap_memseg(struct vm *vm, vm_paddr_t gpa, int segid, vm_ooffset_t first, size_t len, int prot, int flags) { struct mem_seg *seg; struct mem_map *m, *map; vm_ooffset_t last; int i, error; if (prot == 0 || (prot & ~(VM_PROT_ALL)) != 0) return (EINVAL); if (flags & ~VM_MEMMAP_F_WIRED) return (EINVAL); if (segid < 0 || segid >= VM_MAX_MEMSEGS) return (EINVAL); seg = &vm->mem_segs[segid]; if (seg->object == NULL) return (EINVAL); last = first + len; if (first < 0 || first >= last || last > seg->len) return (EINVAL); if ((gpa | first | last) & PAGE_MASK) return (EINVAL); map = NULL; for (i = 0; i < VM_MAX_MEMMAPS; i++) { m = &vm->mem_maps[i]; if (m->len == 0) { map = m; break; } } if (map == NULL) return (ENOSPC); error = vm_map_find(&vm->vmspace->vm_map, seg->object, first, &gpa, len, 0, VMFS_NO_SPACE, prot, prot, 0); if (error != KERN_SUCCESS) return (EFAULT); vm_object_reference(seg->object); if (flags & VM_MEMMAP_F_WIRED) { error = vm_map_wire(&vm->vmspace->vm_map, gpa, gpa + len, VM_MAP_WIRE_USER | VM_MAP_WIRE_NOHOLES); if (error != KERN_SUCCESS) { vm_map_remove(&vm->vmspace->vm_map, gpa, gpa + len); return (EFAULT); } } map->gpa = gpa; map->len = len; map->segoff = first; map->segid = segid; map->prot = prot; map->flags = flags; return (0); } int vm_mmap_getnext(struct vm *vm, vm_paddr_t *gpa, int *segid, vm_ooffset_t *segoff, size_t *len, int *prot, int *flags) { struct mem_map *mm, *mmnext; int i; mmnext = NULL; for (i = 0; i < VM_MAX_MEMMAPS; i++) { mm = &vm->mem_maps[i]; if (mm->len == 0 || mm->gpa < *gpa) continue; if (mmnext == NULL || mm->gpa < mmnext->gpa) mmnext = mm; } if (mmnext != NULL) { *gpa = mmnext->gpa; if (segid) *segid = mmnext->segid; if (segoff) *segoff = mmnext->segoff; if (len) *len = mmnext->len; if (prot) *prot = mmnext->prot; if (flags) *flags = mmnext->flags; return (0); } else { return (ENOENT); } } static void vm_free_memmap(struct vm *vm, int ident) { struct mem_map *mm; int error; mm = &vm->mem_maps[ident]; if (mm->len) { error = vm_map_remove(&vm->vmspace->vm_map, mm->gpa, mm->gpa + mm->len); KASSERT(error == KERN_SUCCESS, ("%s: vm_map_remove error %d", __func__, error)); bzero(mm, sizeof(struct mem_map)); } } static __inline bool sysmem_mapping(struct vm *vm, struct mem_map *mm) { if (mm->len != 0 && vm->mem_segs[mm->segid].sysmem) return (true); else return (false); } vm_paddr_t vmm_sysmem_maxaddr(struct vm *vm) { struct mem_map *mm; vm_paddr_t maxaddr; int i; maxaddr = 0; for (i = 0; i < VM_MAX_MEMMAPS; i++) { mm = &vm->mem_maps[i]; if (sysmem_mapping(vm, mm)) { if (maxaddr < mm->gpa + mm->len) maxaddr = mm->gpa + mm->len; } } return (maxaddr); } static void vm_iommu_modify(struct vm *vm, boolean_t map) { int i, sz; vm_paddr_t gpa, hpa; struct mem_map *mm; void *vp, *cookie, *host_domain; sz = PAGE_SIZE; host_domain = iommu_host_domain(); for (i = 0; i < VM_MAX_MEMMAPS; i++) { mm = &vm->mem_maps[i]; if (!sysmem_mapping(vm, mm)) continue; if (map) { KASSERT((mm->flags & VM_MEMMAP_F_IOMMU) == 0, ("iommu map found invalid memmap %#lx/%#lx/%#x", mm->gpa, mm->len, mm->flags)); if ((mm->flags & VM_MEMMAP_F_WIRED) == 0) continue; mm->flags |= VM_MEMMAP_F_IOMMU; } else { if ((mm->flags & VM_MEMMAP_F_IOMMU) == 0) continue; mm->flags &= ~VM_MEMMAP_F_IOMMU; KASSERT((mm->flags & VM_MEMMAP_F_WIRED) != 0, ("iommu unmap found invalid memmap %#lx/%#lx/%#x", mm->gpa, mm->len, mm->flags)); } gpa = mm->gpa; while (gpa < mm->gpa + mm->len) { vp = vm_gpa_hold(vm, -1, gpa, PAGE_SIZE, VM_PROT_WRITE, &cookie); KASSERT(vp != NULL, ("vm(%s) could not map gpa %#lx", vm_name(vm), gpa)); vm_gpa_release(cookie); hpa = DMAP_TO_PHYS((uintptr_t)vp); if (map) { iommu_create_mapping(vm->iommu, gpa, hpa, sz); iommu_remove_mapping(host_domain, hpa, sz); } else { iommu_remove_mapping(vm->iommu, gpa, sz); iommu_create_mapping(host_domain, hpa, hpa, sz); } gpa += PAGE_SIZE; } } /* * Invalidate the cached translations associated with the domain * from which pages were removed. */ if (map) iommu_invalidate_tlb(host_domain); else iommu_invalidate_tlb(vm->iommu); } #define vm_iommu_unmap(vm) vm_iommu_modify((vm), FALSE) #define vm_iommu_map(vm) vm_iommu_modify((vm), TRUE) int vm_unassign_pptdev(struct vm *vm, int bus, int slot, int func) { int error; error = ppt_unassign_device(vm, bus, slot, func); if (error) return (error); if (ppt_assigned_devices(vm) == 0) vm_iommu_unmap(vm); return (0); } int vm_assign_pptdev(struct vm *vm, int bus, int slot, int func) { int error; vm_paddr_t maxaddr; /* Set up the IOMMU to do the 'gpa' to 'hpa' translation */ if (ppt_assigned_devices(vm) == 0) { KASSERT(vm->iommu == NULL, ("vm_assign_pptdev: iommu must be NULL")); maxaddr = vmm_sysmem_maxaddr(vm); vm->iommu = iommu_create_domain(maxaddr); if (vm->iommu == NULL) return (ENXIO); vm_iommu_map(vm); } error = ppt_assign_device(vm, bus, slot, func); return (error); } void * vm_gpa_hold(struct vm *vm, int vcpuid, vm_paddr_t gpa, size_t len, int reqprot, void **cookie) { int i, count, pageoff; struct mem_map *mm; vm_page_t m; #ifdef INVARIANTS /* * All vcpus are frozen by ioctls that modify the memory map * (e.g. VM_MMAP_MEMSEG). Therefore 'vm->memmap[]' stability is * guaranteed if at least one vcpu is in the VCPU_FROZEN state. */ int state; - KASSERT(vcpuid >= -1 && vcpuid < VM_MAXCPU, ("%s: invalid vcpuid %d", + KASSERT(vcpuid >= -1 && vcpuid < vm->maxcpus, ("%s: invalid vcpuid %d", __func__, vcpuid)); - for (i = 0; i < VM_MAXCPU; i++) { + for (i = 0; i < vm->maxcpus; i++) { if (vcpuid != -1 && vcpuid != i) continue; state = vcpu_get_state(vm, i, NULL); KASSERT(state == VCPU_FROZEN, ("%s: invalid vcpu state %d", __func__, state)); } #endif pageoff = gpa & PAGE_MASK; if (len > PAGE_SIZE - pageoff) panic("vm_gpa_hold: invalid gpa/len: 0x%016lx/%lu", gpa, len); count = 0; for (i = 0; i < VM_MAX_MEMMAPS; i++) { mm = &vm->mem_maps[i]; if (sysmem_mapping(vm, mm) && gpa >= mm->gpa && gpa < mm->gpa + mm->len) { count = vm_fault_quick_hold_pages(&vm->vmspace->vm_map, trunc_page(gpa), PAGE_SIZE, reqprot, &m, 1); break; } } if (count == 1) { *cookie = m; return ((void *)(PHYS_TO_DMAP(VM_PAGE_TO_PHYS(m)) + pageoff)); } else { *cookie = NULL; return (NULL); } } void vm_gpa_release(void *cookie) { vm_page_t m = cookie; vm_page_lock(m); vm_page_unhold(m); vm_page_unlock(m); } int vm_get_register(struct vm *vm, int vcpu, int reg, uint64_t *retval) { - if (vcpu < 0 || vcpu >= VM_MAXCPU) + if (vcpu < 0 || vcpu >= vm->maxcpus) return (EINVAL); if (reg >= VM_REG_LAST) return (EINVAL); return (VMGETREG(vm->cookie, vcpu, reg, retval)); } int vm_set_register(struct vm *vm, int vcpuid, int reg, uint64_t val) { struct vcpu *vcpu; int error; - if (vcpuid < 0 || vcpuid >= VM_MAXCPU) + if (vcpuid < 0 || vcpuid >= vm->maxcpus) return (EINVAL); if (reg >= VM_REG_LAST) return (EINVAL); error = VMSETREG(vm->cookie, vcpuid, reg, val); if (error || reg != VM_REG_GUEST_RIP) return (error); /* Set 'nextrip' to match the value of %rip */ VCPU_CTR1(vm, vcpuid, "Setting nextrip to %#lx", val); vcpu = &vm->vcpu[vcpuid]; vcpu->nextrip = val; return (0); } static boolean_t is_descriptor_table(int reg) { switch (reg) { case VM_REG_GUEST_IDTR: case VM_REG_GUEST_GDTR: return (TRUE); default: return (FALSE); } } static boolean_t is_segment_register(int reg) { switch (reg) { case VM_REG_GUEST_ES: case VM_REG_GUEST_CS: case VM_REG_GUEST_SS: case VM_REG_GUEST_DS: case VM_REG_GUEST_FS: case VM_REG_GUEST_GS: case VM_REG_GUEST_TR: case VM_REG_GUEST_LDTR: return (TRUE); default: return (FALSE); } } int vm_get_seg_desc(struct vm *vm, int vcpu, int reg, struct seg_desc *desc) { - if (vcpu < 0 || vcpu >= VM_MAXCPU) + if (vcpu < 0 || vcpu >= vm->maxcpus) return (EINVAL); if (!is_segment_register(reg) && !is_descriptor_table(reg)) return (EINVAL); return (VMGETDESC(vm->cookie, vcpu, reg, desc)); } int vm_set_seg_desc(struct vm *vm, int vcpu, int reg, struct seg_desc *desc) { - if (vcpu < 0 || vcpu >= VM_MAXCPU) + if (vcpu < 0 || vcpu >= vm->maxcpus) return (EINVAL); if (!is_segment_register(reg) && !is_descriptor_table(reg)) return (EINVAL); return (VMSETDESC(vm->cookie, vcpu, reg, desc)); } static void restore_guest_fpustate(struct vcpu *vcpu) { /* flush host state to the pcb */ fpuexit(curthread); /* restore guest FPU state */ fpu_stop_emulating(); fpurestore(vcpu->guestfpu); /* restore guest XCR0 if XSAVE is enabled in the host */ if (rcr4() & CR4_XSAVE) load_xcr(0, vcpu->guest_xcr0); /* * The FPU is now "dirty" with the guest's state so turn on emulation * to trap any access to the FPU by the host. */ fpu_start_emulating(); } static void save_guest_fpustate(struct vcpu *vcpu) { if ((rcr0() & CR0_TS) == 0) panic("fpu emulation not enabled in host!"); /* save guest XCR0 and restore host XCR0 */ if (rcr4() & CR4_XSAVE) { vcpu->guest_xcr0 = rxcr(0); load_xcr(0, vmm_get_host_xcr0()); } /* save guest FPU state */ fpu_stop_emulating(); fpusave(vcpu->guestfpu); fpu_start_emulating(); } static VMM_STAT(VCPU_IDLE_TICKS, "number of ticks vcpu was idle"); static int vcpu_set_state_locked(struct vm *vm, int vcpuid, enum vcpu_state newstate, bool from_idle) { struct vcpu *vcpu; int error; vcpu = &vm->vcpu[vcpuid]; vcpu_assert_locked(vcpu); /* * State transitions from the vmmdev_ioctl() must always begin from * the VCPU_IDLE state. This guarantees that there is only a single * ioctl() operating on a vcpu at any point. */ if (from_idle) { while (vcpu->state != VCPU_IDLE) { vcpu->reqidle = 1; vcpu_notify_event_locked(vcpu, false); VCPU_CTR1(vm, vcpuid, "vcpu state change from %s to " "idle requested", vcpu_state2str(vcpu->state)); msleep_spin(&vcpu->state, &vcpu->mtx, "vmstat", hz); } } else { KASSERT(vcpu->state != VCPU_IDLE, ("invalid transition from " "vcpu idle state")); } if (vcpu->state == VCPU_RUNNING) { KASSERT(vcpu->hostcpu == curcpu, ("curcpu %d and hostcpu %d " "mismatch for running vcpu", curcpu, vcpu->hostcpu)); } else { KASSERT(vcpu->hostcpu == NOCPU, ("Invalid hostcpu %d for a " "vcpu that is not running", vcpu->hostcpu)); } /* * The following state transitions are allowed: * IDLE -> FROZEN -> IDLE * FROZEN -> RUNNING -> FROZEN * FROZEN -> SLEEPING -> FROZEN */ switch (vcpu->state) { case VCPU_IDLE: case VCPU_RUNNING: case VCPU_SLEEPING: error = (newstate != VCPU_FROZEN); break; case VCPU_FROZEN: error = (newstate == VCPU_FROZEN); break; default: error = 1; break; } if (error) return (EBUSY); VCPU_CTR2(vm, vcpuid, "vcpu state changed from %s to %s", vcpu_state2str(vcpu->state), vcpu_state2str(newstate)); vcpu->state = newstate; if (newstate == VCPU_RUNNING) vcpu->hostcpu = curcpu; else vcpu->hostcpu = NOCPU; if (newstate == VCPU_IDLE) wakeup(&vcpu->state); return (0); } static void vcpu_require_state(struct vm *vm, int vcpuid, enum vcpu_state newstate) { int error; if ((error = vcpu_set_state(vm, vcpuid, newstate, false)) != 0) panic("Error %d setting state to %d\n", error, newstate); } static void vcpu_require_state_locked(struct vm *vm, int vcpuid, enum vcpu_state newstate) { int error; if ((error = vcpu_set_state_locked(vm, vcpuid, newstate, false)) != 0) panic("Error %d setting state to %d", error, newstate); } static void vm_set_rendezvous_func(struct vm *vm, vm_rendezvous_func_t func) { KASSERT(mtx_owned(&vm->rendezvous_mtx), ("rendezvous_mtx not locked")); /* * Update 'rendezvous_func' and execute a write memory barrier to * ensure that it is visible across all host cpus. This is not needed * for correctness but it does ensure that all the vcpus will notice * that the rendezvous is requested immediately. */ vm->rendezvous_func = func; wmb(); } #define RENDEZVOUS_CTR0(vm, vcpuid, fmt) \ do { \ if (vcpuid >= 0) \ VCPU_CTR0(vm, vcpuid, fmt); \ else \ VM_CTR0(vm, fmt); \ } while (0) static void vm_handle_rendezvous(struct vm *vm, int vcpuid) { - KASSERT(vcpuid == -1 || (vcpuid >= 0 && vcpuid < VM_MAXCPU), + KASSERT(vcpuid == -1 || (vcpuid >= 0 && vcpuid < vm->maxcpus), ("vm_handle_rendezvous: invalid vcpuid %d", vcpuid)); mtx_lock(&vm->rendezvous_mtx); while (vm->rendezvous_func != NULL) { /* 'rendezvous_req_cpus' must be a subset of 'active_cpus' */ CPU_AND(&vm->rendezvous_req_cpus, &vm->active_cpus); if (vcpuid != -1 && CPU_ISSET(vcpuid, &vm->rendezvous_req_cpus) && !CPU_ISSET(vcpuid, &vm->rendezvous_done_cpus)) { VCPU_CTR0(vm, vcpuid, "Calling rendezvous func"); (*vm->rendezvous_func)(vm, vcpuid, vm->rendezvous_arg); CPU_SET(vcpuid, &vm->rendezvous_done_cpus); } if (CPU_CMP(&vm->rendezvous_req_cpus, &vm->rendezvous_done_cpus) == 0) { VCPU_CTR0(vm, vcpuid, "Rendezvous completed"); vm_set_rendezvous_func(vm, NULL); wakeup(&vm->rendezvous_func); break; } RENDEZVOUS_CTR0(vm, vcpuid, "Wait for rendezvous completion"); mtx_sleep(&vm->rendezvous_func, &vm->rendezvous_mtx, 0, "vmrndv", 0); } mtx_unlock(&vm->rendezvous_mtx); } /* * Emulate a guest 'hlt' by sleeping until the vcpu is ready to run. */ static int vm_handle_hlt(struct vm *vm, int vcpuid, bool intr_disabled, bool *retu) { struct vcpu *vcpu; const char *wmesg; int t, vcpu_halted, vm_halted; KASSERT(!CPU_ISSET(vcpuid, &vm->halted_cpus), ("vcpu already halted")); vcpu = &vm->vcpu[vcpuid]; vcpu_halted = 0; vm_halted = 0; vcpu_lock(vcpu); while (1) { /* * Do a final check for pending NMI or interrupts before * really putting this thread to sleep. Also check for * software events that would cause this vcpu to wakeup. * * These interrupts/events could have happened after the * vcpu returned from VMRUN() and before it acquired the * vcpu lock above. */ if (vm->rendezvous_func != NULL || vm->suspend || vcpu->reqidle) break; if (vm_nmi_pending(vm, vcpuid)) break; if (!intr_disabled) { if (vm_extint_pending(vm, vcpuid) || vlapic_pending_intr(vcpu->vlapic, NULL)) { break; } } /* Don't go to sleep if the vcpu thread needs to yield */ if (vcpu_should_yield(vm, vcpuid)) break; if (vcpu_debugged(vm, vcpuid)) break; /* * Some Linux guests implement "halt" by having all vcpus * execute HLT with interrupts disabled. 'halted_cpus' keeps * track of the vcpus that have entered this state. When all * vcpus enter the halted state the virtual machine is halted. */ if (intr_disabled) { wmesg = "vmhalt"; VCPU_CTR0(vm, vcpuid, "Halted"); if (!vcpu_halted && halt_detection_enabled) { vcpu_halted = 1; CPU_SET_ATOMIC(vcpuid, &vm->halted_cpus); } if (CPU_CMP(&vm->halted_cpus, &vm->active_cpus) == 0) { vm_halted = 1; break; } } else { wmesg = "vmidle"; } t = ticks; vcpu_require_state_locked(vm, vcpuid, VCPU_SLEEPING); /* * XXX msleep_spin() cannot be interrupted by signals so * wake up periodically to check pending signals. */ msleep_spin(vcpu, &vcpu->mtx, wmesg, hz); vcpu_require_state_locked(vm, vcpuid, VCPU_FROZEN); vmm_stat_incr(vm, vcpuid, VCPU_IDLE_TICKS, ticks - t); } if (vcpu_halted) CPU_CLR_ATOMIC(vcpuid, &vm->halted_cpus); vcpu_unlock(vcpu); if (vm_halted) vm_suspend(vm, VM_SUSPEND_HALT); return (0); } static int vm_handle_paging(struct vm *vm, int vcpuid, bool *retu) { int rv, ftype; struct vm_map *map; struct vcpu *vcpu; struct vm_exit *vme; vcpu = &vm->vcpu[vcpuid]; vme = &vcpu->exitinfo; KASSERT(vme->inst_length == 0, ("%s: invalid inst_length %d", __func__, vme->inst_length)); ftype = vme->u.paging.fault_type; KASSERT(ftype == VM_PROT_READ || ftype == VM_PROT_WRITE || ftype == VM_PROT_EXECUTE, ("vm_handle_paging: invalid fault_type %d", ftype)); if (ftype == VM_PROT_READ || ftype == VM_PROT_WRITE) { rv = pmap_emulate_accessed_dirty(vmspace_pmap(vm->vmspace), vme->u.paging.gpa, ftype); if (rv == 0) { VCPU_CTR2(vm, vcpuid, "%s bit emulation for gpa %#lx", ftype == VM_PROT_READ ? "accessed" : "dirty", vme->u.paging.gpa); goto done; } } map = &vm->vmspace->vm_map; rv = vm_fault(map, vme->u.paging.gpa, ftype, VM_FAULT_NORMAL); VCPU_CTR3(vm, vcpuid, "vm_handle_paging rv = %d, gpa = %#lx, " "ftype = %d", rv, vme->u.paging.gpa, ftype); if (rv != KERN_SUCCESS) return (EFAULT); done: return (0); } static int vm_handle_inst_emul(struct vm *vm, int vcpuid, bool *retu) { struct vie *vie; struct vcpu *vcpu; struct vm_exit *vme; uint64_t gla, gpa, cs_base; struct vm_guest_paging *paging; mem_region_read_t mread; mem_region_write_t mwrite; enum vm_cpu_mode cpu_mode; int cs_d, error, fault; vcpu = &vm->vcpu[vcpuid]; vme = &vcpu->exitinfo; KASSERT(vme->inst_length == 0, ("%s: invalid inst_length %d", __func__, vme->inst_length)); gla = vme->u.inst_emul.gla; gpa = vme->u.inst_emul.gpa; cs_base = vme->u.inst_emul.cs_base; cs_d = vme->u.inst_emul.cs_d; vie = &vme->u.inst_emul.vie; paging = &vme->u.inst_emul.paging; cpu_mode = paging->cpu_mode; VCPU_CTR1(vm, vcpuid, "inst_emul fault accessing gpa %#lx", gpa); /* Fetch, decode and emulate the faulting instruction */ if (vie->num_valid == 0) { error = vmm_fetch_instruction(vm, vcpuid, paging, vme->rip + cs_base, VIE_INST_SIZE, vie, &fault); } else { /* * The instruction bytes have already been copied into 'vie' */ error = fault = 0; } if (error || fault) return (error); if (vmm_decode_instruction(vm, vcpuid, gla, cpu_mode, cs_d, vie) != 0) { VCPU_CTR1(vm, vcpuid, "Error decoding instruction at %#lx", vme->rip + cs_base); *retu = true; /* dump instruction bytes in userspace */ return (0); } /* * Update 'nextrip' based on the length of the emulated instruction. */ vme->inst_length = vie->num_processed; vcpu->nextrip += vie->num_processed; VCPU_CTR1(vm, vcpuid, "nextrip updated to %#lx after instruction " "decoding", vcpu->nextrip); /* return to userland unless this is an in-kernel emulated device */ if (gpa >= DEFAULT_APIC_BASE && gpa < DEFAULT_APIC_BASE + PAGE_SIZE) { mread = lapic_mmio_read; mwrite = lapic_mmio_write; } else if (gpa >= VIOAPIC_BASE && gpa < VIOAPIC_BASE + VIOAPIC_SIZE) { mread = vioapic_mmio_read; mwrite = vioapic_mmio_write; } else if (gpa >= VHPET_BASE && gpa < VHPET_BASE + VHPET_SIZE) { mread = vhpet_mmio_read; mwrite = vhpet_mmio_write; } else { *retu = true; return (0); } error = vmm_emulate_instruction(vm, vcpuid, gpa, vie, paging, mread, mwrite, retu); return (error); } static int vm_handle_suspend(struct vm *vm, int vcpuid, bool *retu) { int i, done; struct vcpu *vcpu; done = 0; vcpu = &vm->vcpu[vcpuid]; CPU_SET_ATOMIC(vcpuid, &vm->suspended_cpus); /* * Wait until all 'active_cpus' have suspended themselves. * * Since a VM may be suspended at any time including when one or * more vcpus are doing a rendezvous we need to call the rendezvous * handler while we are waiting to prevent a deadlock. */ vcpu_lock(vcpu); while (1) { if (CPU_CMP(&vm->suspended_cpus, &vm->active_cpus) == 0) { VCPU_CTR0(vm, vcpuid, "All vcpus suspended"); break; } if (vm->rendezvous_func == NULL) { VCPU_CTR0(vm, vcpuid, "Sleeping during suspend"); vcpu_require_state_locked(vm, vcpuid, VCPU_SLEEPING); msleep_spin(vcpu, &vcpu->mtx, "vmsusp", hz); vcpu_require_state_locked(vm, vcpuid, VCPU_FROZEN); } else { VCPU_CTR0(vm, vcpuid, "Rendezvous during suspend"); vcpu_unlock(vcpu); vm_handle_rendezvous(vm, vcpuid); vcpu_lock(vcpu); } } vcpu_unlock(vcpu); /* * Wakeup the other sleeping vcpus and return to userspace. */ - for (i = 0; i < VM_MAXCPU; i++) { + for (i = 0; i < vm->maxcpus; i++) { if (CPU_ISSET(i, &vm->suspended_cpus)) { vcpu_notify_event(vm, i, false); } } *retu = true; return (0); } static int vm_handle_reqidle(struct vm *vm, int vcpuid, bool *retu) { struct vcpu *vcpu = &vm->vcpu[vcpuid]; vcpu_lock(vcpu); KASSERT(vcpu->reqidle, ("invalid vcpu reqidle %d", vcpu->reqidle)); vcpu->reqidle = 0; vcpu_unlock(vcpu); *retu = true; return (0); } int vm_suspend(struct vm *vm, enum vm_suspend_how how) { int i; if (how <= VM_SUSPEND_NONE || how >= VM_SUSPEND_LAST) return (EINVAL); if (atomic_cmpset_int(&vm->suspend, 0, how) == 0) { VM_CTR2(vm, "virtual machine already suspended %d/%d", vm->suspend, how); return (EALREADY); } VM_CTR1(vm, "virtual machine successfully suspended %d", how); /* * Notify all active vcpus that they are now suspended. */ - for (i = 0; i < VM_MAXCPU; i++) { + for (i = 0; i < vm->maxcpus; i++) { if (CPU_ISSET(i, &vm->active_cpus)) vcpu_notify_event(vm, i, false); } return (0); } void vm_exit_suspended(struct vm *vm, int vcpuid, uint64_t rip) { struct vm_exit *vmexit; KASSERT(vm->suspend > VM_SUSPEND_NONE && vm->suspend < VM_SUSPEND_LAST, ("vm_exit_suspended: invalid suspend type %d", vm->suspend)); vmexit = vm_exitinfo(vm, vcpuid); vmexit->rip = rip; vmexit->inst_length = 0; vmexit->exitcode = VM_EXITCODE_SUSPENDED; vmexit->u.suspended.how = vm->suspend; } void vm_exit_debug(struct vm *vm, int vcpuid, uint64_t rip) { struct vm_exit *vmexit; vmexit = vm_exitinfo(vm, vcpuid); vmexit->rip = rip; vmexit->inst_length = 0; vmexit->exitcode = VM_EXITCODE_DEBUG; } void vm_exit_rendezvous(struct vm *vm, int vcpuid, uint64_t rip) { struct vm_exit *vmexit; KASSERT(vm->rendezvous_func != NULL, ("rendezvous not in progress")); vmexit = vm_exitinfo(vm, vcpuid); vmexit->rip = rip; vmexit->inst_length = 0; vmexit->exitcode = VM_EXITCODE_RENDEZVOUS; vmm_stat_incr(vm, vcpuid, VMEXIT_RENDEZVOUS, 1); } void vm_exit_reqidle(struct vm *vm, int vcpuid, uint64_t rip) { struct vm_exit *vmexit; vmexit = vm_exitinfo(vm, vcpuid); vmexit->rip = rip; vmexit->inst_length = 0; vmexit->exitcode = VM_EXITCODE_REQIDLE; vmm_stat_incr(vm, vcpuid, VMEXIT_REQIDLE, 1); } void vm_exit_astpending(struct vm *vm, int vcpuid, uint64_t rip) { struct vm_exit *vmexit; vmexit = vm_exitinfo(vm, vcpuid); vmexit->rip = rip; vmexit->inst_length = 0; vmexit->exitcode = VM_EXITCODE_BOGUS; vmm_stat_incr(vm, vcpuid, VMEXIT_ASTPENDING, 1); } int vm_run(struct vm *vm, struct vm_run *vmrun) { struct vm_eventinfo evinfo; int error, vcpuid; struct vcpu *vcpu; struct pcb *pcb; uint64_t tscval; struct vm_exit *vme; bool retu, intr_disabled; pmap_t pmap; vcpuid = vmrun->cpuid; - if (vcpuid < 0 || vcpuid >= VM_MAXCPU) + if (vcpuid < 0 || vcpuid >= vm->maxcpus) return (EINVAL); if (!CPU_ISSET(vcpuid, &vm->active_cpus)) return (EINVAL); if (CPU_ISSET(vcpuid, &vm->suspended_cpus)) return (EINVAL); pmap = vmspace_pmap(vm->vmspace); vcpu = &vm->vcpu[vcpuid]; vme = &vcpu->exitinfo; evinfo.rptr = &vm->rendezvous_func; evinfo.sptr = &vm->suspend; evinfo.iptr = &vcpu->reqidle; restart: critical_enter(); KASSERT(!CPU_ISSET(curcpu, &pmap->pm_active), ("vm_run: absurd pm_active")); tscval = rdtsc(); pcb = PCPU_GET(curpcb); set_pcb_flags(pcb, PCB_FULL_IRET); restore_guest_fpustate(vcpu); vcpu_require_state(vm, vcpuid, VCPU_RUNNING); error = VMRUN(vm->cookie, vcpuid, vcpu->nextrip, pmap, &evinfo); vcpu_require_state(vm, vcpuid, VCPU_FROZEN); save_guest_fpustate(vcpu); vmm_stat_incr(vm, vcpuid, VCPU_TOTAL_RUNTIME, rdtsc() - tscval); critical_exit(); if (error == 0) { retu = false; vcpu->nextrip = vme->rip + vme->inst_length; switch (vme->exitcode) { case VM_EXITCODE_REQIDLE: error = vm_handle_reqidle(vm, vcpuid, &retu); break; case VM_EXITCODE_SUSPENDED: error = vm_handle_suspend(vm, vcpuid, &retu); break; case VM_EXITCODE_IOAPIC_EOI: vioapic_process_eoi(vm, vcpuid, vme->u.ioapic_eoi.vector); break; case VM_EXITCODE_RENDEZVOUS: vm_handle_rendezvous(vm, vcpuid); error = 0; break; case VM_EXITCODE_HLT: intr_disabled = ((vme->u.hlt.rflags & PSL_I) == 0); error = vm_handle_hlt(vm, vcpuid, intr_disabled, &retu); break; case VM_EXITCODE_PAGING: error = vm_handle_paging(vm, vcpuid, &retu); break; case VM_EXITCODE_INST_EMUL: error = vm_handle_inst_emul(vm, vcpuid, &retu); break; case VM_EXITCODE_INOUT: case VM_EXITCODE_INOUT_STR: error = vm_handle_inout(vm, vcpuid, vme, &retu); break; case VM_EXITCODE_MONITOR: case VM_EXITCODE_MWAIT: case VM_EXITCODE_VMINSN: vm_inject_ud(vm, vcpuid); break; default: retu = true; /* handled in userland */ break; } } if (error == 0 && retu == false) goto restart; VCPU_CTR2(vm, vcpuid, "retu %d/%d", error, vme->exitcode); /* copy the exit information */ bcopy(vme, &vmrun->vm_exit, sizeof(struct vm_exit)); return (error); } int vm_restart_instruction(void *arg, int vcpuid) { struct vm *vm; struct vcpu *vcpu; enum vcpu_state state; uint64_t rip; int error; vm = arg; - if (vcpuid < 0 || vcpuid >= VM_MAXCPU) + if (vcpuid < 0 || vcpuid >= vm->maxcpus) return (EINVAL); vcpu = &vm->vcpu[vcpuid]; state = vcpu_get_state(vm, vcpuid, NULL); if (state == VCPU_RUNNING) { /* * When a vcpu is "running" the next instruction is determined * by adding 'rip' and 'inst_length' in the vcpu's 'exitinfo'. * Thus setting 'inst_length' to zero will cause the current * instruction to be restarted. */ vcpu->exitinfo.inst_length = 0; VCPU_CTR1(vm, vcpuid, "restarting instruction at %#lx by " "setting inst_length to zero", vcpu->exitinfo.rip); } else if (state == VCPU_FROZEN) { /* * When a vcpu is "frozen" it is outside the critical section * around VMRUN() and 'nextrip' points to the next instruction. * Thus instruction restart is achieved by setting 'nextrip' * to the vcpu's %rip. */ error = vm_get_register(vm, vcpuid, VM_REG_GUEST_RIP, &rip); KASSERT(!error, ("%s: error %d getting rip", __func__, error)); VCPU_CTR2(vm, vcpuid, "restarting instruction by updating " "nextrip from %#lx to %#lx", vcpu->nextrip, rip); vcpu->nextrip = rip; } else { panic("%s: invalid state %d", __func__, state); } return (0); } int vm_exit_intinfo(struct vm *vm, int vcpuid, uint64_t info) { struct vcpu *vcpu; int type, vector; - if (vcpuid < 0 || vcpuid >= VM_MAXCPU) + if (vcpuid < 0 || vcpuid >= vm->maxcpus) return (EINVAL); vcpu = &vm->vcpu[vcpuid]; if (info & VM_INTINFO_VALID) { type = info & VM_INTINFO_TYPE; vector = info & 0xff; if (type == VM_INTINFO_NMI && vector != IDT_NMI) return (EINVAL); if (type == VM_INTINFO_HWEXCEPTION && vector >= 32) return (EINVAL); if (info & VM_INTINFO_RSVD) return (EINVAL); } else { info = 0; } VCPU_CTR2(vm, vcpuid, "%s: info1(%#lx)", __func__, info); vcpu->exitintinfo = info; return (0); } enum exc_class { EXC_BENIGN, EXC_CONTRIBUTORY, EXC_PAGEFAULT }; #define IDT_VE 20 /* Virtualization Exception (Intel specific) */ static enum exc_class exception_class(uint64_t info) { int type, vector; KASSERT(info & VM_INTINFO_VALID, ("intinfo must be valid: %#lx", info)); type = info & VM_INTINFO_TYPE; vector = info & 0xff; /* Table 6-4, "Interrupt and Exception Classes", Intel SDM, Vol 3 */ switch (type) { case VM_INTINFO_HWINTR: case VM_INTINFO_SWINTR: case VM_INTINFO_NMI: return (EXC_BENIGN); default: /* * Hardware exception. * * SVM and VT-x use identical type values to represent NMI, * hardware interrupt and software interrupt. * * SVM uses type '3' for all exceptions. VT-x uses type '3' * for exceptions except #BP and #OF. #BP and #OF use a type * value of '5' or '6'. Therefore we don't check for explicit * values of 'type' to classify 'intinfo' into a hardware * exception. */ break; } switch (vector) { case IDT_PF: case IDT_VE: return (EXC_PAGEFAULT); case IDT_DE: case IDT_TS: case IDT_NP: case IDT_SS: case IDT_GP: return (EXC_CONTRIBUTORY); default: return (EXC_BENIGN); } } static int nested_fault(struct vm *vm, int vcpuid, uint64_t info1, uint64_t info2, uint64_t *retinfo) { enum exc_class exc1, exc2; int type1, vector1; KASSERT(info1 & VM_INTINFO_VALID, ("info1 %#lx is not valid", info1)); KASSERT(info2 & VM_INTINFO_VALID, ("info2 %#lx is not valid", info2)); /* * If an exception occurs while attempting to call the double-fault * handler the processor enters shutdown mode (aka triple fault). */ type1 = info1 & VM_INTINFO_TYPE; vector1 = info1 & 0xff; if (type1 == VM_INTINFO_HWEXCEPTION && vector1 == IDT_DF) { VCPU_CTR2(vm, vcpuid, "triple fault: info1(%#lx), info2(%#lx)", info1, info2); vm_suspend(vm, VM_SUSPEND_TRIPLEFAULT); *retinfo = 0; return (0); } /* * Table 6-5 "Conditions for Generating a Double Fault", Intel SDM, Vol3 */ exc1 = exception_class(info1); exc2 = exception_class(info2); if ((exc1 == EXC_CONTRIBUTORY && exc2 == EXC_CONTRIBUTORY) || (exc1 == EXC_PAGEFAULT && exc2 != EXC_BENIGN)) { /* Convert nested fault into a double fault. */ *retinfo = IDT_DF; *retinfo |= VM_INTINFO_VALID | VM_INTINFO_HWEXCEPTION; *retinfo |= VM_INTINFO_DEL_ERRCODE; } else { /* Handle exceptions serially */ *retinfo = info2; } return (1); } static uint64_t vcpu_exception_intinfo(struct vcpu *vcpu) { uint64_t info = 0; if (vcpu->exception_pending) { info = vcpu->exc_vector & 0xff; info |= VM_INTINFO_VALID | VM_INTINFO_HWEXCEPTION; if (vcpu->exc_errcode_valid) { info |= VM_INTINFO_DEL_ERRCODE; info |= (uint64_t)vcpu->exc_errcode << 32; } } return (info); } int vm_entry_intinfo(struct vm *vm, int vcpuid, uint64_t *retinfo) { struct vcpu *vcpu; uint64_t info1, info2; int valid; - KASSERT(vcpuid >= 0 && vcpuid < VM_MAXCPU, ("invalid vcpu %d", vcpuid)); + KASSERT(vcpuid >= 0 && + vcpuid < vm->maxcpus, ("invalid vcpu %d", vcpuid)); vcpu = &vm->vcpu[vcpuid]; info1 = vcpu->exitintinfo; vcpu->exitintinfo = 0; info2 = 0; if (vcpu->exception_pending) { info2 = vcpu_exception_intinfo(vcpu); vcpu->exception_pending = 0; VCPU_CTR2(vm, vcpuid, "Exception %d delivered: %#lx", vcpu->exc_vector, info2); } if ((info1 & VM_INTINFO_VALID) && (info2 & VM_INTINFO_VALID)) { valid = nested_fault(vm, vcpuid, info1, info2, retinfo); } else if (info1 & VM_INTINFO_VALID) { *retinfo = info1; valid = 1; } else if (info2 & VM_INTINFO_VALID) { *retinfo = info2; valid = 1; } else { valid = 0; } if (valid) { VCPU_CTR4(vm, vcpuid, "%s: info1(%#lx), info2(%#lx), " "retinfo(%#lx)", __func__, info1, info2, *retinfo); } return (valid); } int vm_get_intinfo(struct vm *vm, int vcpuid, uint64_t *info1, uint64_t *info2) { struct vcpu *vcpu; - if (vcpuid < 0 || vcpuid >= VM_MAXCPU) + if (vcpuid < 0 || vcpuid >= vm->maxcpus) return (EINVAL); vcpu = &vm->vcpu[vcpuid]; *info1 = vcpu->exitintinfo; *info2 = vcpu_exception_intinfo(vcpu); return (0); } int vm_inject_exception(struct vm *vm, int vcpuid, int vector, int errcode_valid, uint32_t errcode, int restart_instruction) { struct vcpu *vcpu; uint64_t regval; int error; - if (vcpuid < 0 || vcpuid >= VM_MAXCPU) + if (vcpuid < 0 || vcpuid >= vm->maxcpus) return (EINVAL); if (vector < 0 || vector >= 32) return (EINVAL); /* * A double fault exception should never be injected directly into * the guest. It is a derived exception that results from specific * combinations of nested faults. */ if (vector == IDT_DF) return (EINVAL); vcpu = &vm->vcpu[vcpuid]; if (vcpu->exception_pending) { VCPU_CTR2(vm, vcpuid, "Unable to inject exception %d due to " "pending exception %d", vector, vcpu->exc_vector); return (EBUSY); } if (errcode_valid) { /* * Exceptions don't deliver an error code in real mode. */ error = vm_get_register(vm, vcpuid, VM_REG_GUEST_CR0, ®val); KASSERT(!error, ("%s: error %d getting CR0", __func__, error)); if (!(regval & CR0_PE)) errcode_valid = 0; } /* * From section 26.6.1 "Interruptibility State" in Intel SDM: * * Event blocking by "STI" or "MOV SS" is cleared after guest executes * one instruction or incurs an exception. */ error = vm_set_register(vm, vcpuid, VM_REG_GUEST_INTR_SHADOW, 0); KASSERT(error == 0, ("%s: error %d clearing interrupt shadow", __func__, error)); if (restart_instruction) vm_restart_instruction(vm, vcpuid); vcpu->exception_pending = 1; vcpu->exc_vector = vector; vcpu->exc_errcode = errcode; vcpu->exc_errcode_valid = errcode_valid; VCPU_CTR1(vm, vcpuid, "Exception %d pending", vector); return (0); } void vm_inject_fault(void *vmarg, int vcpuid, int vector, int errcode_valid, int errcode) { struct vm *vm; int error, restart_instruction; vm = vmarg; restart_instruction = 1; error = vm_inject_exception(vm, vcpuid, vector, errcode_valid, errcode, restart_instruction); KASSERT(error == 0, ("vm_inject_exception error %d", error)); } void vm_inject_pf(void *vmarg, int vcpuid, int error_code, uint64_t cr2) { struct vm *vm; int error; vm = vmarg; VCPU_CTR2(vm, vcpuid, "Injecting page fault: error_code %#x, cr2 %#lx", error_code, cr2); error = vm_set_register(vm, vcpuid, VM_REG_GUEST_CR2, cr2); KASSERT(error == 0, ("vm_set_register(cr2) error %d", error)); vm_inject_fault(vm, vcpuid, IDT_PF, 1, error_code); } static VMM_STAT(VCPU_NMI_COUNT, "number of NMIs delivered to vcpu"); int vm_inject_nmi(struct vm *vm, int vcpuid) { struct vcpu *vcpu; - if (vcpuid < 0 || vcpuid >= VM_MAXCPU) + if (vcpuid < 0 || vcpuid >= vm->maxcpus) return (EINVAL); vcpu = &vm->vcpu[vcpuid]; vcpu->nmi_pending = 1; vcpu_notify_event(vm, vcpuid, false); return (0); } int vm_nmi_pending(struct vm *vm, int vcpuid) { struct vcpu *vcpu; - if (vcpuid < 0 || vcpuid >= VM_MAXCPU) + if (vcpuid < 0 || vcpuid >= vm->maxcpus) panic("vm_nmi_pending: invalid vcpuid %d", vcpuid); vcpu = &vm->vcpu[vcpuid]; return (vcpu->nmi_pending); } void vm_nmi_clear(struct vm *vm, int vcpuid) { struct vcpu *vcpu; - if (vcpuid < 0 || vcpuid >= VM_MAXCPU) + if (vcpuid < 0 || vcpuid >= vm->maxcpus) panic("vm_nmi_pending: invalid vcpuid %d", vcpuid); vcpu = &vm->vcpu[vcpuid]; if (vcpu->nmi_pending == 0) panic("vm_nmi_clear: inconsistent nmi_pending state"); vcpu->nmi_pending = 0; vmm_stat_incr(vm, vcpuid, VCPU_NMI_COUNT, 1); } static VMM_STAT(VCPU_EXTINT_COUNT, "number of ExtINTs delivered to vcpu"); int vm_inject_extint(struct vm *vm, int vcpuid) { struct vcpu *vcpu; - if (vcpuid < 0 || vcpuid >= VM_MAXCPU) + if (vcpuid < 0 || vcpuid >= vm->maxcpus) return (EINVAL); vcpu = &vm->vcpu[vcpuid]; vcpu->extint_pending = 1; vcpu_notify_event(vm, vcpuid, false); return (0); } int vm_extint_pending(struct vm *vm, int vcpuid) { struct vcpu *vcpu; - if (vcpuid < 0 || vcpuid >= VM_MAXCPU) + if (vcpuid < 0 || vcpuid >= vm->maxcpus) panic("vm_extint_pending: invalid vcpuid %d", vcpuid); vcpu = &vm->vcpu[vcpuid]; return (vcpu->extint_pending); } void vm_extint_clear(struct vm *vm, int vcpuid) { struct vcpu *vcpu; - if (vcpuid < 0 || vcpuid >= VM_MAXCPU) + if (vcpuid < 0 || vcpuid >= vm->maxcpus) panic("vm_extint_pending: invalid vcpuid %d", vcpuid); vcpu = &vm->vcpu[vcpuid]; if (vcpu->extint_pending == 0) panic("vm_extint_clear: inconsistent extint_pending state"); vcpu->extint_pending = 0; vmm_stat_incr(vm, vcpuid, VCPU_EXTINT_COUNT, 1); } int vm_get_capability(struct vm *vm, int vcpu, int type, int *retval) { - if (vcpu < 0 || vcpu >= VM_MAXCPU) + if (vcpu < 0 || vcpu >= vm->maxcpus) return (EINVAL); if (type < 0 || type >= VM_CAP_MAX) return (EINVAL); return (VMGETCAP(vm->cookie, vcpu, type, retval)); } int vm_set_capability(struct vm *vm, int vcpu, int type, int val) { - if (vcpu < 0 || vcpu >= VM_MAXCPU) + if (vcpu < 0 || vcpu >= vm->maxcpus) return (EINVAL); if (type < 0 || type >= VM_CAP_MAX) return (EINVAL); return (VMSETCAP(vm->cookie, vcpu, type, val)); } struct vlapic * vm_lapic(struct vm *vm, int cpu) { return (vm->vcpu[cpu].vlapic); } struct vioapic * vm_ioapic(struct vm *vm) { return (vm->vioapic); } struct vhpet * vm_hpet(struct vm *vm) { return (vm->vhpet); } boolean_t vmm_is_pptdev(int bus, int slot, int func) { int found, i, n; int b, s, f; char *val, *cp, *cp2; /* * XXX * The length of an environment variable is limited to 128 bytes which * puts an upper limit on the number of passthru devices that may be * specified using a single environment variable. * * Work around this by scanning multiple environment variable * names instead of a single one - yuck! */ const char *names[] = { "pptdevs", "pptdevs2", "pptdevs3", NULL }; /* set pptdevs="1/2/3 4/5/6 7/8/9 10/11/12" */ found = 0; for (i = 0; names[i] != NULL && !found; i++) { cp = val = kern_getenv(names[i]); while (cp != NULL && *cp != '\0') { if ((cp2 = strchr(cp, ' ')) != NULL) *cp2 = '\0'; n = sscanf(cp, "%d/%d/%d", &b, &s, &f); if (n == 3 && bus == b && slot == s && func == f) { found = 1; break; } if (cp2 != NULL) *cp2++ = ' '; cp = cp2; } freeenv(val); } return (found); } void * vm_iommu_domain(struct vm *vm) { return (vm->iommu); } int vcpu_set_state(struct vm *vm, int vcpuid, enum vcpu_state newstate, bool from_idle) { int error; struct vcpu *vcpu; - if (vcpuid < 0 || vcpuid >= VM_MAXCPU) + if (vcpuid < 0 || vcpuid >= vm->maxcpus) panic("vm_set_run_state: invalid vcpuid %d", vcpuid); vcpu = &vm->vcpu[vcpuid]; vcpu_lock(vcpu); error = vcpu_set_state_locked(vm, vcpuid, newstate, from_idle); vcpu_unlock(vcpu); return (error); } enum vcpu_state vcpu_get_state(struct vm *vm, int vcpuid, int *hostcpu) { struct vcpu *vcpu; enum vcpu_state state; - if (vcpuid < 0 || vcpuid >= VM_MAXCPU) + if (vcpuid < 0 || vcpuid >= vm->maxcpus) panic("vm_get_run_state: invalid vcpuid %d", vcpuid); vcpu = &vm->vcpu[vcpuid]; vcpu_lock(vcpu); state = vcpu->state; if (hostcpu != NULL) *hostcpu = vcpu->hostcpu; vcpu_unlock(vcpu); return (state); } int vm_activate_cpu(struct vm *vm, int vcpuid) { - if (vcpuid < 0 || vcpuid >= VM_MAXCPU) + if (vcpuid < 0 || vcpuid >= vm->maxcpus) return (EINVAL); if (CPU_ISSET(vcpuid, &vm->active_cpus)) return (EBUSY); VCPU_CTR0(vm, vcpuid, "activated"); CPU_SET_ATOMIC(vcpuid, &vm->active_cpus); return (0); } int vm_suspend_cpu(struct vm *vm, int vcpuid) { int i; - if (vcpuid < -1 || vcpuid >= VM_MAXCPU) + if (vcpuid < -1 || vcpuid >= vm->maxcpus) return (EINVAL); if (vcpuid == -1) { vm->debug_cpus = vm->active_cpus; - for (i = 0; i < VM_MAXCPU; i++) { + for (i = 0; i < vm->maxcpus; i++) { if (CPU_ISSET(i, &vm->active_cpus)) vcpu_notify_event(vm, i, false); } } else { if (!CPU_ISSET(vcpuid, &vm->active_cpus)) return (EINVAL); CPU_SET_ATOMIC(vcpuid, &vm->debug_cpus); vcpu_notify_event(vm, vcpuid, false); } return (0); } int vm_resume_cpu(struct vm *vm, int vcpuid) { - if (vcpuid < -1 || vcpuid >= VM_MAXCPU) + if (vcpuid < -1 || vcpuid >= vm->maxcpus) return (EINVAL); if (vcpuid == -1) { CPU_ZERO(&vm->debug_cpus); } else { if (!CPU_ISSET(vcpuid, &vm->debug_cpus)) return (EINVAL); CPU_CLR_ATOMIC(vcpuid, &vm->debug_cpus); } return (0); } int vcpu_debugged(struct vm *vm, int vcpuid) { return (CPU_ISSET(vcpuid, &vm->debug_cpus)); } cpuset_t vm_active_cpus(struct vm *vm) { return (vm->active_cpus); } cpuset_t vm_debug_cpus(struct vm *vm) { return (vm->debug_cpus); } cpuset_t vm_suspended_cpus(struct vm *vm) { return (vm->suspended_cpus); } void * vcpu_stats(struct vm *vm, int vcpuid) { return (vm->vcpu[vcpuid].stats); } int vm_get_x2apic_state(struct vm *vm, int vcpuid, enum x2apic_state *state) { - if (vcpuid < 0 || vcpuid >= VM_MAXCPU) + if (vcpuid < 0 || vcpuid >= vm->maxcpus) return (EINVAL); *state = vm->vcpu[vcpuid].x2apic_state; return (0); } int vm_set_x2apic_state(struct vm *vm, int vcpuid, enum x2apic_state state) { - if (vcpuid < 0 || vcpuid >= VM_MAXCPU) + if (vcpuid < 0 || vcpuid >= vm->maxcpus) return (EINVAL); if (state >= X2APIC_STATE_LAST) return (EINVAL); vm->vcpu[vcpuid].x2apic_state = state; vlapic_set_x2apic_state(vm, vcpuid, state); return (0); } /* * This function is called to ensure that a vcpu "sees" a pending event * as soon as possible: * - If the vcpu thread is sleeping then it is woken up. * - If the vcpu is running on a different host_cpu then an IPI will be directed * to the host_cpu to cause the vcpu to trap into the hypervisor. */ static void vcpu_notify_event_locked(struct vcpu *vcpu, bool lapic_intr) { int hostcpu; hostcpu = vcpu->hostcpu; if (vcpu->state == VCPU_RUNNING) { KASSERT(hostcpu != NOCPU, ("vcpu running on invalid hostcpu")); if (hostcpu != curcpu) { if (lapic_intr) { vlapic_post_intr(vcpu->vlapic, hostcpu, vmm_ipinum); } else { ipi_cpu(hostcpu, vmm_ipinum); } } else { /* * If the 'vcpu' is running on 'curcpu' then it must * be sending a notification to itself (e.g. SELF_IPI). * The pending event will be picked up when the vcpu * transitions back to guest context. */ } } else { KASSERT(hostcpu == NOCPU, ("vcpu state %d not consistent " "with hostcpu %d", vcpu->state, hostcpu)); if (vcpu->state == VCPU_SLEEPING) wakeup_one(vcpu); } } void vcpu_notify_event(struct vm *vm, int vcpuid, bool lapic_intr) { struct vcpu *vcpu = &vm->vcpu[vcpuid]; vcpu_lock(vcpu); vcpu_notify_event_locked(vcpu, lapic_intr); vcpu_unlock(vcpu); } struct vmspace * vm_get_vmspace(struct vm *vm) { return (vm->vmspace); } int vm_apicid2vcpuid(struct vm *vm, int apicid) { /* * XXX apic id is assumed to be numerically identical to vcpu id */ return (apicid); } void vm_smp_rendezvous(struct vm *vm, int vcpuid, cpuset_t dest, vm_rendezvous_func_t func, void *arg) { int i; /* * Enforce that this function is called without any locks */ WITNESS_WARN(WARN_PANIC, NULL, "vm_smp_rendezvous"); - KASSERT(vcpuid == -1 || (vcpuid >= 0 && vcpuid < VM_MAXCPU), + KASSERT(vcpuid == -1 || (vcpuid >= 0 && vcpuid < vm->maxcpus), ("vm_smp_rendezvous: invalid vcpuid %d", vcpuid)); restart: mtx_lock(&vm->rendezvous_mtx); if (vm->rendezvous_func != NULL) { /* * If a rendezvous is already in progress then we need to * call the rendezvous handler in case this 'vcpuid' is one * of the targets of the rendezvous. */ RENDEZVOUS_CTR0(vm, vcpuid, "Rendezvous already in progress"); mtx_unlock(&vm->rendezvous_mtx); vm_handle_rendezvous(vm, vcpuid); goto restart; } KASSERT(vm->rendezvous_func == NULL, ("vm_smp_rendezvous: previous " "rendezvous is still in progress")); RENDEZVOUS_CTR0(vm, vcpuid, "Initiating rendezvous"); vm->rendezvous_req_cpus = dest; CPU_ZERO(&vm->rendezvous_done_cpus); vm->rendezvous_arg = arg; vm_set_rendezvous_func(vm, func); mtx_unlock(&vm->rendezvous_mtx); /* * Wake up any sleeping vcpus and trigger a VM-exit in any running * vcpus so they handle the rendezvous as soon as possible. */ - for (i = 0; i < VM_MAXCPU; i++) { + for (i = 0; i < vm->maxcpus; i++) { if (CPU_ISSET(i, &dest)) vcpu_notify_event(vm, i, false); } vm_handle_rendezvous(vm, vcpuid); } struct vatpic * vm_atpic(struct vm *vm) { return (vm->vatpic); } struct vatpit * vm_atpit(struct vm *vm) { return (vm->vatpit); } struct vpmtmr * vm_pmtmr(struct vm *vm) { return (vm->vpmtmr); } struct vrtc * vm_rtc(struct vm *vm) { return (vm->vrtc); } enum vm_reg_name vm_segment_name(int seg) { static enum vm_reg_name seg_names[] = { VM_REG_GUEST_ES, VM_REG_GUEST_CS, VM_REG_GUEST_SS, VM_REG_GUEST_DS, VM_REG_GUEST_FS, VM_REG_GUEST_GS }; KASSERT(seg >= 0 && seg < nitems(seg_names), ("%s: invalid segment encoding %d", __func__, seg)); return (seg_names[seg]); } void vm_copy_teardown(struct vm *vm, int vcpuid, struct vm_copyinfo *copyinfo, int num_copyinfo) { int idx; for (idx = 0; idx < num_copyinfo; idx++) { if (copyinfo[idx].cookie != NULL) vm_gpa_release(copyinfo[idx].cookie); } bzero(copyinfo, num_copyinfo * sizeof(struct vm_copyinfo)); } int vm_copy_setup(struct vm *vm, int vcpuid, struct vm_guest_paging *paging, uint64_t gla, size_t len, int prot, struct vm_copyinfo *copyinfo, int num_copyinfo, int *fault) { int error, idx, nused; size_t n, off, remaining; void *hva, *cookie; uint64_t gpa; bzero(copyinfo, sizeof(struct vm_copyinfo) * num_copyinfo); nused = 0; remaining = len; while (remaining > 0) { KASSERT(nused < num_copyinfo, ("insufficient vm_copyinfo")); error = vm_gla2gpa(vm, vcpuid, paging, gla, prot, &gpa, fault); if (error || *fault) return (error); off = gpa & PAGE_MASK; n = min(remaining, PAGE_SIZE - off); copyinfo[nused].gpa = gpa; copyinfo[nused].len = n; remaining -= n; gla += n; nused++; } for (idx = 0; idx < nused; idx++) { hva = vm_gpa_hold(vm, vcpuid, copyinfo[idx].gpa, copyinfo[idx].len, prot, &cookie); if (hva == NULL) break; copyinfo[idx].hva = hva; copyinfo[idx].cookie = cookie; } if (idx != nused) { vm_copy_teardown(vm, vcpuid, copyinfo, num_copyinfo); return (EFAULT); } else { *fault = 0; return (0); } } void vm_copyin(struct vm *vm, int vcpuid, struct vm_copyinfo *copyinfo, void *kaddr, size_t len) { char *dst; int idx; dst = kaddr; idx = 0; while (len > 0) { bcopy(copyinfo[idx].hva, dst, copyinfo[idx].len); len -= copyinfo[idx].len; dst += copyinfo[idx].len; idx++; } } void vm_copyout(struct vm *vm, int vcpuid, const void *kaddr, struct vm_copyinfo *copyinfo, size_t len) { const char *src; int idx; src = kaddr; idx = 0; while (len > 0) { bcopy(src, copyinfo[idx].hva, copyinfo[idx].len); len -= copyinfo[idx].len; src += copyinfo[idx].len; idx++; } } /* * Return the amount of in-use and wired memory for the VM. Since * these are global stats, only return the values with for vCPU 0 */ VMM_STAT_DECLARE(VMM_MEM_RESIDENT); VMM_STAT_DECLARE(VMM_MEM_WIRED); static void vm_get_rescnt(struct vm *vm, int vcpu, struct vmm_stat_type *stat) { if (vcpu == 0) { vmm_stat_set(vm, vcpu, VMM_MEM_RESIDENT, PAGE_SIZE * vmspace_resident_count(vm->vmspace)); } } static void vm_get_wiredcnt(struct vm *vm, int vcpu, struct vmm_stat_type *stat) { if (vcpu == 0) { vmm_stat_set(vm, vcpu, VMM_MEM_WIRED, PAGE_SIZE * pmap_wired_count(vmspace_pmap(vm->vmspace))); } } VMM_STAT_FUNC(VMM_MEM_RESIDENT, "Resident memory", vm_get_rescnt); VMM_STAT_FUNC(VMM_MEM_WIRED, "Wired memory", vm_get_wiredcnt); Index: user/ngie/bug-237403/sys/amd64/vmm/vmm_dev.c =================================================================== --- user/ngie/bug-237403/sys/amd64/vmm/vmm_dev.c (revision 346925) +++ user/ngie/bug-237403/sys/amd64/vmm/vmm_dev.c (revision 346926) @@ -1,1131 +1,1142 @@ /*- * SPDX-License-Identifier: BSD-2-Clause-FreeBSD * * Copyright (c) 2011 NetApp, Inc. * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY NETAPP, INC ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL NETAPP, INC OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * $FreeBSD$ */ #include __FBSDID("$FreeBSD$"); #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include "vmm_lapic.h" #include "vmm_stat.h" #include "vmm_mem.h" #include "io/ppt.h" #include "io/vatpic.h" #include "io/vioapic.h" #include "io/vhpet.h" #include "io/vrtc.h" struct devmem_softc { int segid; char *name; struct cdev *cdev; struct vmmdev_softc *sc; SLIST_ENTRY(devmem_softc) link; }; struct vmmdev_softc { struct vm *vm; /* vm instance cookie */ struct cdev *cdev; SLIST_ENTRY(vmmdev_softc) link; SLIST_HEAD(, devmem_softc) devmem; int flags; }; #define VSC_LINKED 0x01 static SLIST_HEAD(, vmmdev_softc) head; static unsigned pr_allow_flag; static struct mtx vmmdev_mtx; static MALLOC_DEFINE(M_VMMDEV, "vmmdev", "vmmdev"); SYSCTL_DECL(_hw_vmm); static int vmm_priv_check(struct ucred *ucred); static int devmem_create_cdev(const char *vmname, int id, char *devmem); static void devmem_destroy(void *arg); static int vmm_priv_check(struct ucred *ucred) { if (jailed(ucred) && !(ucred->cr_prison->pr_allow & pr_allow_flag)) return (EPERM); return (0); } static int vcpu_lock_one(struct vmmdev_softc *sc, int vcpu) { int error; - if (vcpu < 0 || vcpu >= VM_MAXCPU) + if (vcpu < 0 || vcpu >= vm_get_maxcpus(sc->vm)) return (EINVAL); error = vcpu_set_state(sc->vm, vcpu, VCPU_FROZEN, true); return (error); } static void vcpu_unlock_one(struct vmmdev_softc *sc, int vcpu) { enum vcpu_state state; state = vcpu_get_state(sc->vm, vcpu, NULL); if (state != VCPU_FROZEN) { panic("vcpu %s(%d) has invalid state %d", vm_name(sc->vm), vcpu, state); } vcpu_set_state(sc->vm, vcpu, VCPU_IDLE, false); } static int vcpu_lock_all(struct vmmdev_softc *sc) { int error, vcpu; + uint16_t maxcpus; - for (vcpu = 0; vcpu < VM_MAXCPU; vcpu++) { + maxcpus = vm_get_maxcpus(sc->vm); + for (vcpu = 0; vcpu < maxcpus; vcpu++) { error = vcpu_lock_one(sc, vcpu); if (error) break; } if (error) { while (--vcpu >= 0) vcpu_unlock_one(sc, vcpu); } return (error); } static void vcpu_unlock_all(struct vmmdev_softc *sc) { int vcpu; + uint16_t maxcpus; - for (vcpu = 0; vcpu < VM_MAXCPU; vcpu++) + maxcpus = vm_get_maxcpus(sc->vm); + for (vcpu = 0; vcpu < maxcpus; vcpu++) vcpu_unlock_one(sc, vcpu); } static struct vmmdev_softc * vmmdev_lookup(const char *name) { struct vmmdev_softc *sc; #ifdef notyet /* XXX kernel is not compiled with invariants */ mtx_assert(&vmmdev_mtx, MA_OWNED); #endif SLIST_FOREACH(sc, &head, link) { if (strcmp(name, vm_name(sc->vm)) == 0) break; } return (sc); } static struct vmmdev_softc * vmmdev_lookup2(struct cdev *cdev) { return (cdev->si_drv1); } static int vmmdev_rw(struct cdev *cdev, struct uio *uio, int flags) { int error, off, c, prot; vm_paddr_t gpa, maxaddr; void *hpa, *cookie; struct vmmdev_softc *sc; + uint16_t lastcpu; error = vmm_priv_check(curthread->td_ucred); if (error) return (error); sc = vmmdev_lookup2(cdev); if (sc == NULL) return (ENXIO); /* * Get a read lock on the guest memory map by freezing any vcpu. */ - error = vcpu_lock_one(sc, VM_MAXCPU - 1); + lastcpu = vm_get_maxcpus(sc->vm) - 1; + error = vcpu_lock_one(sc, lastcpu); if (error) return (error); prot = (uio->uio_rw == UIO_WRITE ? VM_PROT_WRITE : VM_PROT_READ); maxaddr = vmm_sysmem_maxaddr(sc->vm); while (uio->uio_resid > 0 && error == 0) { gpa = uio->uio_offset; off = gpa & PAGE_MASK; c = min(uio->uio_resid, PAGE_SIZE - off); /* * The VM has a hole in its physical memory map. If we want to * use 'dd' to inspect memory beyond the hole we need to * provide bogus data for memory that lies in the hole. * * Since this device does not support lseek(2), dd(1) will * read(2) blocks of data to simulate the lseek(2). */ - hpa = vm_gpa_hold(sc->vm, VM_MAXCPU - 1, gpa, c, prot, &cookie); + hpa = vm_gpa_hold(sc->vm, lastcpu, gpa, c, + prot, &cookie); if (hpa == NULL) { if (uio->uio_rw == UIO_READ && gpa < maxaddr) error = uiomove(__DECONST(void *, zero_region), c, uio); else error = EFAULT; } else { error = uiomove(hpa, c, uio); vm_gpa_release(cookie); } } - vcpu_unlock_one(sc, VM_MAXCPU - 1); + vcpu_unlock_one(sc, lastcpu); return (error); } CTASSERT(sizeof(((struct vm_memseg *)0)->name) >= SPECNAMELEN + 1); static int get_memseg(struct vmmdev_softc *sc, struct vm_memseg *mseg) { struct devmem_softc *dsc; int error; bool sysmem; error = vm_get_memseg(sc->vm, mseg->segid, &mseg->len, &sysmem, NULL); if (error || mseg->len == 0) return (error); if (!sysmem) { SLIST_FOREACH(dsc, &sc->devmem, link) { if (dsc->segid == mseg->segid) break; } KASSERT(dsc != NULL, ("%s: devmem segment %d not found", __func__, mseg->segid)); error = copystr(dsc->name, mseg->name, SPECNAMELEN + 1, NULL); } else { bzero(mseg->name, sizeof(mseg->name)); } return (error); } static int alloc_memseg(struct vmmdev_softc *sc, struct vm_memseg *mseg) { char *name; int error; bool sysmem; error = 0; name = NULL; sysmem = true; if (VM_MEMSEG_NAME(mseg)) { sysmem = false; name = malloc(SPECNAMELEN + 1, M_VMMDEV, M_WAITOK); error = copystr(mseg->name, name, SPECNAMELEN + 1, 0); if (error) goto done; } error = vm_alloc_memseg(sc->vm, mseg->segid, mseg->len, sysmem); if (error) goto done; if (VM_MEMSEG_NAME(mseg)) { error = devmem_create_cdev(vm_name(sc->vm), mseg->segid, name); if (error) vm_free_memseg(sc->vm, mseg->segid); else name = NULL; /* freed when 'cdev' is destroyed */ } done: free(name, M_VMMDEV); return (error); } static int vm_get_register_set(struct vm *vm, int vcpu, unsigned int count, int *regnum, uint64_t *regval) { int error, i; error = 0; for (i = 0; i < count; i++) { error = vm_get_register(vm, vcpu, regnum[i], ®val[i]); if (error) break; } return (error); } static int vm_set_register_set(struct vm *vm, int vcpu, unsigned int count, int *regnum, uint64_t *regval) { int error, i; error = 0; for (i = 0; i < count; i++) { error = vm_set_register(vm, vcpu, regnum[i], regval[i]); if (error) break; } return (error); } static int vmmdev_ioctl(struct cdev *cdev, u_long cmd, caddr_t data, int fflag, struct thread *td) { int error, vcpu, state_changed, size; cpuset_t *cpuset; struct vmmdev_softc *sc; struct vm_register *vmreg; struct vm_seg_desc *vmsegdesc; struct vm_register_set *vmregset; struct vm_run *vmrun; struct vm_exception *vmexc; struct vm_lapic_irq *vmirq; struct vm_lapic_msi *vmmsi; struct vm_ioapic_irq *ioapic_irq; struct vm_isa_irq *isa_irq; struct vm_isa_irq_trigger *isa_irq_trigger; struct vm_capability *vmcap; struct vm_pptdev *pptdev; struct vm_pptdev_mmio *pptmmio; struct vm_pptdev_msi *pptmsi; struct vm_pptdev_msix *pptmsix; struct vm_nmi *vmnmi; struct vm_stats *vmstats; struct vm_stat_desc *statdesc; struct vm_x2apic *x2apic; struct vm_gpa_pte *gpapte; struct vm_suspend *vmsuspend; struct vm_gla2gpa *gg; struct vm_activate_cpu *vac; struct vm_cpuset *vm_cpuset; struct vm_intinfo *vmii; struct vm_rtc_time *rtctime; struct vm_rtc_data *rtcdata; struct vm_memmap *mm; struct vm_cpu_topology *topology; uint64_t *regvals; int *regnums; error = vmm_priv_check(curthread->td_ucred); if (error) return (error); sc = vmmdev_lookup2(cdev); if (sc == NULL) return (ENXIO); vcpu = -1; state_changed = 0; /* * Some VMM ioctls can operate only on vcpus that are not running. */ switch (cmd) { case VM_RUN: case VM_GET_REGISTER: case VM_SET_REGISTER: case VM_GET_SEGMENT_DESCRIPTOR: case VM_SET_SEGMENT_DESCRIPTOR: case VM_GET_REGISTER_SET: case VM_SET_REGISTER_SET: case VM_INJECT_EXCEPTION: case VM_GET_CAPABILITY: case VM_SET_CAPABILITY: case VM_PPTDEV_MSI: case VM_PPTDEV_MSIX: case VM_SET_X2APIC_STATE: case VM_GLA2GPA: case VM_GLA2GPA_NOFAULT: case VM_ACTIVATE_CPU: case VM_SET_INTINFO: case VM_GET_INTINFO: case VM_RESTART_INSTRUCTION: /* * XXX fragile, handle with care * Assumes that the first field of the ioctl data is the vcpu. */ vcpu = *(int *)data; error = vcpu_lock_one(sc, vcpu); if (error) goto done; state_changed = 1; break; case VM_MAP_PPTDEV_MMIO: case VM_BIND_PPTDEV: case VM_UNBIND_PPTDEV: case VM_ALLOC_MEMSEG: case VM_MMAP_MEMSEG: case VM_REINIT: /* * ioctls that operate on the entire virtual machine must * prevent all vcpus from running. */ error = vcpu_lock_all(sc); if (error) goto done; state_changed = 2; break; case VM_GET_MEMSEG: case VM_MMAP_GETNEXT: /* * Lock a vcpu to make sure that the memory map cannot be * modified while it is being inspected. */ - vcpu = VM_MAXCPU - 1; + vcpu = vm_get_maxcpus(sc->vm) - 1; error = vcpu_lock_one(sc, vcpu); if (error) goto done; state_changed = 1; break; default: break; } switch(cmd) { case VM_RUN: vmrun = (struct vm_run *)data; error = vm_run(sc->vm, vmrun); break; case VM_SUSPEND: vmsuspend = (struct vm_suspend *)data; error = vm_suspend(sc->vm, vmsuspend->how); break; case VM_REINIT: error = vm_reinit(sc->vm); break; case VM_STAT_DESC: { statdesc = (struct vm_stat_desc *)data; error = vmm_stat_desc_copy(statdesc->index, statdesc->desc, sizeof(statdesc->desc)); break; } case VM_STATS: { CTASSERT(MAX_VM_STATS >= MAX_VMM_STAT_ELEMS); vmstats = (struct vm_stats *)data; getmicrotime(&vmstats->tv); error = vmm_stat_copy(sc->vm, vmstats->cpuid, &vmstats->num_entries, vmstats->statbuf); break; } case VM_PPTDEV_MSI: pptmsi = (struct vm_pptdev_msi *)data; error = ppt_setup_msi(sc->vm, pptmsi->vcpu, pptmsi->bus, pptmsi->slot, pptmsi->func, pptmsi->addr, pptmsi->msg, pptmsi->numvec); break; case VM_PPTDEV_MSIX: pptmsix = (struct vm_pptdev_msix *)data; error = ppt_setup_msix(sc->vm, pptmsix->vcpu, pptmsix->bus, pptmsix->slot, pptmsix->func, pptmsix->idx, pptmsix->addr, pptmsix->msg, pptmsix->vector_control); break; case VM_MAP_PPTDEV_MMIO: pptmmio = (struct vm_pptdev_mmio *)data; error = ppt_map_mmio(sc->vm, pptmmio->bus, pptmmio->slot, pptmmio->func, pptmmio->gpa, pptmmio->len, pptmmio->hpa); break; case VM_BIND_PPTDEV: pptdev = (struct vm_pptdev *)data; error = vm_assign_pptdev(sc->vm, pptdev->bus, pptdev->slot, pptdev->func); break; case VM_UNBIND_PPTDEV: pptdev = (struct vm_pptdev *)data; error = vm_unassign_pptdev(sc->vm, pptdev->bus, pptdev->slot, pptdev->func); break; case VM_INJECT_EXCEPTION: vmexc = (struct vm_exception *)data; error = vm_inject_exception(sc->vm, vmexc->cpuid, vmexc->vector, vmexc->error_code_valid, vmexc->error_code, vmexc->restart_instruction); break; case VM_INJECT_NMI: vmnmi = (struct vm_nmi *)data; error = vm_inject_nmi(sc->vm, vmnmi->cpuid); break; case VM_LAPIC_IRQ: vmirq = (struct vm_lapic_irq *)data; error = lapic_intr_edge(sc->vm, vmirq->cpuid, vmirq->vector); break; case VM_LAPIC_LOCAL_IRQ: vmirq = (struct vm_lapic_irq *)data; error = lapic_set_local_intr(sc->vm, vmirq->cpuid, vmirq->vector); break; case VM_LAPIC_MSI: vmmsi = (struct vm_lapic_msi *)data; error = lapic_intr_msi(sc->vm, vmmsi->addr, vmmsi->msg); break; case VM_IOAPIC_ASSERT_IRQ: ioapic_irq = (struct vm_ioapic_irq *)data; error = vioapic_assert_irq(sc->vm, ioapic_irq->irq); break; case VM_IOAPIC_DEASSERT_IRQ: ioapic_irq = (struct vm_ioapic_irq *)data; error = vioapic_deassert_irq(sc->vm, ioapic_irq->irq); break; case VM_IOAPIC_PULSE_IRQ: ioapic_irq = (struct vm_ioapic_irq *)data; error = vioapic_pulse_irq(sc->vm, ioapic_irq->irq); break; case VM_IOAPIC_PINCOUNT: *(int *)data = vioapic_pincount(sc->vm); break; case VM_ISA_ASSERT_IRQ: isa_irq = (struct vm_isa_irq *)data; error = vatpic_assert_irq(sc->vm, isa_irq->atpic_irq); if (error == 0 && isa_irq->ioapic_irq != -1) error = vioapic_assert_irq(sc->vm, isa_irq->ioapic_irq); break; case VM_ISA_DEASSERT_IRQ: isa_irq = (struct vm_isa_irq *)data; error = vatpic_deassert_irq(sc->vm, isa_irq->atpic_irq); if (error == 0 && isa_irq->ioapic_irq != -1) error = vioapic_deassert_irq(sc->vm, isa_irq->ioapic_irq); break; case VM_ISA_PULSE_IRQ: isa_irq = (struct vm_isa_irq *)data; error = vatpic_pulse_irq(sc->vm, isa_irq->atpic_irq); if (error == 0 && isa_irq->ioapic_irq != -1) error = vioapic_pulse_irq(sc->vm, isa_irq->ioapic_irq); break; case VM_ISA_SET_IRQ_TRIGGER: isa_irq_trigger = (struct vm_isa_irq_trigger *)data; error = vatpic_set_irq_trigger(sc->vm, isa_irq_trigger->atpic_irq, isa_irq_trigger->trigger); break; case VM_MMAP_GETNEXT: mm = (struct vm_memmap *)data; error = vm_mmap_getnext(sc->vm, &mm->gpa, &mm->segid, &mm->segoff, &mm->len, &mm->prot, &mm->flags); break; case VM_MMAP_MEMSEG: mm = (struct vm_memmap *)data; error = vm_mmap_memseg(sc->vm, mm->gpa, mm->segid, mm->segoff, mm->len, mm->prot, mm->flags); break; case VM_ALLOC_MEMSEG: error = alloc_memseg(sc, (struct vm_memseg *)data); break; case VM_GET_MEMSEG: error = get_memseg(sc, (struct vm_memseg *)data); break; case VM_GET_REGISTER: vmreg = (struct vm_register *)data; error = vm_get_register(sc->vm, vmreg->cpuid, vmreg->regnum, &vmreg->regval); break; case VM_SET_REGISTER: vmreg = (struct vm_register *)data; error = vm_set_register(sc->vm, vmreg->cpuid, vmreg->regnum, vmreg->regval); break; case VM_SET_SEGMENT_DESCRIPTOR: vmsegdesc = (struct vm_seg_desc *)data; error = vm_set_seg_desc(sc->vm, vmsegdesc->cpuid, vmsegdesc->regnum, &vmsegdesc->desc); break; case VM_GET_SEGMENT_DESCRIPTOR: vmsegdesc = (struct vm_seg_desc *)data; error = vm_get_seg_desc(sc->vm, vmsegdesc->cpuid, vmsegdesc->regnum, &vmsegdesc->desc); break; case VM_GET_REGISTER_SET: vmregset = (struct vm_register_set *)data; if (vmregset->count > VM_REG_LAST) { error = EINVAL; break; } regvals = malloc(sizeof(regvals[0]) * vmregset->count, M_VMMDEV, M_WAITOK); regnums = malloc(sizeof(regnums[0]) * vmregset->count, M_VMMDEV, M_WAITOK); error = copyin(vmregset->regnums, regnums, sizeof(regnums[0]) * vmregset->count); if (error == 0) error = vm_get_register_set(sc->vm, vmregset->cpuid, vmregset->count, regnums, regvals); if (error == 0) error = copyout(regvals, vmregset->regvals, sizeof(regvals[0]) * vmregset->count); free(regvals, M_VMMDEV); free(regnums, M_VMMDEV); break; case VM_SET_REGISTER_SET: vmregset = (struct vm_register_set *)data; if (vmregset->count > VM_REG_LAST) { error = EINVAL; break; } regvals = malloc(sizeof(regvals[0]) * vmregset->count, M_VMMDEV, M_WAITOK); regnums = malloc(sizeof(regnums[0]) * vmregset->count, M_VMMDEV, M_WAITOK); error = copyin(vmregset->regnums, regnums, sizeof(regnums[0]) * vmregset->count); if (error == 0) error = copyin(vmregset->regvals, regvals, sizeof(regvals[0]) * vmregset->count); if (error == 0) error = vm_set_register_set(sc->vm, vmregset->cpuid, vmregset->count, regnums, regvals); free(regvals, M_VMMDEV); free(regnums, M_VMMDEV); break; case VM_GET_CAPABILITY: vmcap = (struct vm_capability *)data; error = vm_get_capability(sc->vm, vmcap->cpuid, vmcap->captype, &vmcap->capval); break; case VM_SET_CAPABILITY: vmcap = (struct vm_capability *)data; error = vm_set_capability(sc->vm, vmcap->cpuid, vmcap->captype, vmcap->capval); break; case VM_SET_X2APIC_STATE: x2apic = (struct vm_x2apic *)data; error = vm_set_x2apic_state(sc->vm, x2apic->cpuid, x2apic->state); break; case VM_GET_X2APIC_STATE: x2apic = (struct vm_x2apic *)data; error = vm_get_x2apic_state(sc->vm, x2apic->cpuid, &x2apic->state); break; case VM_GET_GPA_PMAP: gpapte = (struct vm_gpa_pte *)data; pmap_get_mapping(vmspace_pmap(vm_get_vmspace(sc->vm)), gpapte->gpa, gpapte->pte, &gpapte->ptenum); error = 0; break; case VM_GET_HPET_CAPABILITIES: error = vhpet_getcap((struct vm_hpet_cap *)data); break; case VM_GLA2GPA: { CTASSERT(PROT_READ == VM_PROT_READ); CTASSERT(PROT_WRITE == VM_PROT_WRITE); CTASSERT(PROT_EXEC == VM_PROT_EXECUTE); gg = (struct vm_gla2gpa *)data; error = vm_gla2gpa(sc->vm, gg->vcpuid, &gg->paging, gg->gla, gg->prot, &gg->gpa, &gg->fault); KASSERT(error == 0 || error == EFAULT, ("%s: vm_gla2gpa unknown error %d", __func__, error)); break; } case VM_GLA2GPA_NOFAULT: gg = (struct vm_gla2gpa *)data; error = vm_gla2gpa_nofault(sc->vm, gg->vcpuid, &gg->paging, gg->gla, gg->prot, &gg->gpa, &gg->fault); KASSERT(error == 0 || error == EFAULT, ("%s: vm_gla2gpa unknown error %d", __func__, error)); break; case VM_ACTIVATE_CPU: vac = (struct vm_activate_cpu *)data; error = vm_activate_cpu(sc->vm, vac->vcpuid); break; case VM_GET_CPUS: error = 0; vm_cpuset = (struct vm_cpuset *)data; size = vm_cpuset->cpusetsize; if (size < sizeof(cpuset_t) || size > CPU_MAXSIZE / NBBY) { error = ERANGE; break; } cpuset = malloc(size, M_TEMP, M_WAITOK | M_ZERO); if (vm_cpuset->which == VM_ACTIVE_CPUS) *cpuset = vm_active_cpus(sc->vm); else if (vm_cpuset->which == VM_SUSPENDED_CPUS) *cpuset = vm_suspended_cpus(sc->vm); else if (vm_cpuset->which == VM_DEBUG_CPUS) *cpuset = vm_debug_cpus(sc->vm); else error = EINVAL; if (error == 0) error = copyout(cpuset, vm_cpuset->cpus, size); free(cpuset, M_TEMP); break; case VM_SUSPEND_CPU: vac = (struct vm_activate_cpu *)data; error = vm_suspend_cpu(sc->vm, vac->vcpuid); break; case VM_RESUME_CPU: vac = (struct vm_activate_cpu *)data; error = vm_resume_cpu(sc->vm, vac->vcpuid); break; case VM_SET_INTINFO: vmii = (struct vm_intinfo *)data; error = vm_exit_intinfo(sc->vm, vmii->vcpuid, vmii->info1); break; case VM_GET_INTINFO: vmii = (struct vm_intinfo *)data; error = vm_get_intinfo(sc->vm, vmii->vcpuid, &vmii->info1, &vmii->info2); break; case VM_RTC_WRITE: rtcdata = (struct vm_rtc_data *)data; error = vrtc_nvram_write(sc->vm, rtcdata->offset, rtcdata->value); break; case VM_RTC_READ: rtcdata = (struct vm_rtc_data *)data; error = vrtc_nvram_read(sc->vm, rtcdata->offset, &rtcdata->value); break; case VM_RTC_SETTIME: rtctime = (struct vm_rtc_time *)data; error = vrtc_set_time(sc->vm, rtctime->secs); break; case VM_RTC_GETTIME: error = 0; rtctime = (struct vm_rtc_time *)data; rtctime->secs = vrtc_get_time(sc->vm); break; case VM_RESTART_INSTRUCTION: error = vm_restart_instruction(sc->vm, vcpu); break; case VM_SET_TOPOLOGY: topology = (struct vm_cpu_topology *)data; error = vm_set_topology(sc->vm, topology->sockets, topology->cores, topology->threads, topology->maxcpus); break; case VM_GET_TOPOLOGY: topology = (struct vm_cpu_topology *)data; vm_get_topology(sc->vm, &topology->sockets, &topology->cores, &topology->threads, &topology->maxcpus); error = 0; break; default: error = ENOTTY; break; } if (state_changed == 1) vcpu_unlock_one(sc, vcpu); else if (state_changed == 2) vcpu_unlock_all(sc); done: /* Make sure that no handler returns a bogus value like ERESTART */ KASSERT(error >= 0, ("vmmdev_ioctl: invalid error return %d", error)); return (error); } static int vmmdev_mmap_single(struct cdev *cdev, vm_ooffset_t *offset, vm_size_t mapsize, struct vm_object **objp, int nprot) { struct vmmdev_softc *sc; vm_paddr_t gpa; size_t len; vm_ooffset_t segoff, first, last; int error, found, segid; + uint16_t lastcpu; bool sysmem; error = vmm_priv_check(curthread->td_ucred); if (error) return (error); first = *offset; last = first + mapsize; if ((nprot & PROT_EXEC) || first < 0 || first >= last) return (EINVAL); sc = vmmdev_lookup2(cdev); if (sc == NULL) { /* virtual machine is in the process of being created */ return (EINVAL); } /* * Get a read lock on the guest memory map by freezing any vcpu. */ - error = vcpu_lock_one(sc, VM_MAXCPU - 1); + lastcpu = vm_get_maxcpus(sc->vm) - 1; + error = vcpu_lock_one(sc, lastcpu); if (error) return (error); gpa = 0; found = 0; while (!found) { error = vm_mmap_getnext(sc->vm, &gpa, &segid, &segoff, &len, NULL, NULL); if (error) break; if (first >= gpa && last <= gpa + len) found = 1; else gpa += len; } if (found) { error = vm_get_memseg(sc->vm, segid, &len, &sysmem, objp); KASSERT(error == 0 && *objp != NULL, ("%s: invalid memory segment %d", __func__, segid)); if (sysmem) { vm_object_reference(*objp); *offset = segoff + (first - gpa); } else { error = EINVAL; } } - vcpu_unlock_one(sc, VM_MAXCPU - 1); + vcpu_unlock_one(sc, lastcpu); return (error); } static void vmmdev_destroy(void *arg) { struct vmmdev_softc *sc = arg; struct devmem_softc *dsc; int error; error = vcpu_lock_all(sc); KASSERT(error == 0, ("%s: error %d freezing vcpus", __func__, error)); while ((dsc = SLIST_FIRST(&sc->devmem)) != NULL) { KASSERT(dsc->cdev == NULL, ("%s: devmem not free", __func__)); SLIST_REMOVE_HEAD(&sc->devmem, link); free(dsc->name, M_VMMDEV); free(dsc, M_VMMDEV); } if (sc->cdev != NULL) destroy_dev(sc->cdev); if (sc->vm != NULL) vm_destroy(sc->vm); if ((sc->flags & VSC_LINKED) != 0) { mtx_lock(&vmmdev_mtx); SLIST_REMOVE(&head, sc, vmmdev_softc, link); mtx_unlock(&vmmdev_mtx); } free(sc, M_VMMDEV); } static int sysctl_vmm_destroy(SYSCTL_HANDLER_ARGS) { int error; char buf[VM_MAX_NAMELEN]; struct devmem_softc *dsc; struct vmmdev_softc *sc; struct cdev *cdev; error = vmm_priv_check(req->td->td_ucred); if (error) return (error); strlcpy(buf, "beavis", sizeof(buf)); error = sysctl_handle_string(oidp, buf, sizeof(buf), req); if (error != 0 || req->newptr == NULL) return (error); mtx_lock(&vmmdev_mtx); sc = vmmdev_lookup(buf); if (sc == NULL || sc->cdev == NULL) { mtx_unlock(&vmmdev_mtx); return (EINVAL); } /* * The 'cdev' will be destroyed asynchronously when 'si_threadcount' * goes down to 0 so we should not do it again in the callback. * * Setting 'sc->cdev' to NULL is also used to indicate that the VM * is scheduled for destruction. */ cdev = sc->cdev; sc->cdev = NULL; mtx_unlock(&vmmdev_mtx); /* * Schedule all cdevs to be destroyed: * * - any new operations on the 'cdev' will return an error (ENXIO). * * - when the 'si_threadcount' dwindles down to zero the 'cdev' will * be destroyed and the callback will be invoked in a taskqueue * context. * * - the 'devmem' cdevs are destroyed before the virtual machine 'cdev' */ SLIST_FOREACH(dsc, &sc->devmem, link) { KASSERT(dsc->cdev != NULL, ("devmem cdev already destroyed")); destroy_dev_sched_cb(dsc->cdev, devmem_destroy, dsc); } destroy_dev_sched_cb(cdev, vmmdev_destroy, sc); return (0); } SYSCTL_PROC(_hw_vmm, OID_AUTO, destroy, CTLTYPE_STRING | CTLFLAG_RW | CTLFLAG_PRISON, NULL, 0, sysctl_vmm_destroy, "A", NULL); static struct cdevsw vmmdevsw = { .d_name = "vmmdev", .d_version = D_VERSION, .d_ioctl = vmmdev_ioctl, .d_mmap_single = vmmdev_mmap_single, .d_read = vmmdev_rw, .d_write = vmmdev_rw, }; static int sysctl_vmm_create(SYSCTL_HANDLER_ARGS) { int error; struct vm *vm; struct cdev *cdev; struct vmmdev_softc *sc, *sc2; char buf[VM_MAX_NAMELEN]; error = vmm_priv_check(req->td->td_ucred); if (error) return (error); strlcpy(buf, "beavis", sizeof(buf)); error = sysctl_handle_string(oidp, buf, sizeof(buf), req); if (error != 0 || req->newptr == NULL) return (error); mtx_lock(&vmmdev_mtx); sc = vmmdev_lookup(buf); mtx_unlock(&vmmdev_mtx); if (sc != NULL) return (EEXIST); error = vm_create(buf, &vm); if (error != 0) return (error); sc = malloc(sizeof(struct vmmdev_softc), M_VMMDEV, M_WAITOK | M_ZERO); sc->vm = vm; SLIST_INIT(&sc->devmem); /* * Lookup the name again just in case somebody sneaked in when we * dropped the lock. */ mtx_lock(&vmmdev_mtx); sc2 = vmmdev_lookup(buf); if (sc2 == NULL) { SLIST_INSERT_HEAD(&head, sc, link); sc->flags |= VSC_LINKED; } mtx_unlock(&vmmdev_mtx); if (sc2 != NULL) { vmmdev_destroy(sc); return (EEXIST); } error = make_dev_p(MAKEDEV_CHECKNAME, &cdev, &vmmdevsw, NULL, UID_ROOT, GID_WHEEL, 0600, "vmm/%s", buf); if (error != 0) { vmmdev_destroy(sc); return (error); } mtx_lock(&vmmdev_mtx); sc->cdev = cdev; sc->cdev->si_drv1 = sc; mtx_unlock(&vmmdev_mtx); return (0); } SYSCTL_PROC(_hw_vmm, OID_AUTO, create, CTLTYPE_STRING | CTLFLAG_RW | CTLFLAG_PRISON, NULL, 0, sysctl_vmm_create, "A", NULL); void vmmdev_init(void) { mtx_init(&vmmdev_mtx, "vmm device mutex", NULL, MTX_DEF); pr_allow_flag = prison_add_allow(NULL, "vmm", NULL, "Allow use of vmm in a jail."); } int vmmdev_cleanup(void) { int error; if (SLIST_EMPTY(&head)) error = 0; else error = EBUSY; return (error); } static int devmem_mmap_single(struct cdev *cdev, vm_ooffset_t *offset, vm_size_t len, struct vm_object **objp, int nprot) { struct devmem_softc *dsc; vm_ooffset_t first, last; size_t seglen; int error; + uint16_t lastcpu; bool sysmem; dsc = cdev->si_drv1; if (dsc == NULL) { /* 'cdev' has been created but is not ready for use */ return (ENXIO); } first = *offset; last = *offset + len; if ((nprot & PROT_EXEC) || first < 0 || first >= last) return (EINVAL); - error = vcpu_lock_one(dsc->sc, VM_MAXCPU - 1); + lastcpu = vm_get_maxcpus(dsc->sc->vm) - 1; + error = vcpu_lock_one(dsc->sc, lastcpu); if (error) return (error); error = vm_get_memseg(dsc->sc->vm, dsc->segid, &seglen, &sysmem, objp); KASSERT(error == 0 && !sysmem && *objp != NULL, ("%s: invalid devmem segment %d", __func__, dsc->segid)); - vcpu_unlock_one(dsc->sc, VM_MAXCPU - 1); + vcpu_unlock_one(dsc->sc, lastcpu); if (seglen >= last) { vm_object_reference(*objp); return (0); } else { return (EINVAL); } } static struct cdevsw devmemsw = { .d_name = "devmem", .d_version = D_VERSION, .d_mmap_single = devmem_mmap_single, }; static int devmem_create_cdev(const char *vmname, int segid, char *devname) { struct devmem_softc *dsc; struct vmmdev_softc *sc; struct cdev *cdev; int error; error = make_dev_p(MAKEDEV_CHECKNAME, &cdev, &devmemsw, NULL, UID_ROOT, GID_WHEEL, 0600, "vmm.io/%s.%s", vmname, devname); if (error) return (error); dsc = malloc(sizeof(struct devmem_softc), M_VMMDEV, M_WAITOK | M_ZERO); mtx_lock(&vmmdev_mtx); sc = vmmdev_lookup(vmname); KASSERT(sc != NULL, ("%s: vm %s softc not found", __func__, vmname)); if (sc->cdev == NULL) { /* virtual machine is being created or destroyed */ mtx_unlock(&vmmdev_mtx); free(dsc, M_VMMDEV); destroy_dev_sched_cb(cdev, NULL, 0); return (ENODEV); } dsc->segid = segid; dsc->name = devname; dsc->cdev = cdev; dsc->sc = sc; SLIST_INSERT_HEAD(&sc->devmem, dsc, link); mtx_unlock(&vmmdev_mtx); /* The 'cdev' is ready for use after 'si_drv1' is initialized */ cdev->si_drv1 = dsc; return (0); } static void devmem_destroy(void *arg) { struct devmem_softc *dsc = arg; KASSERT(dsc->cdev, ("%s: devmem cdev already destroyed", __func__)); dsc->cdev = NULL; dsc->sc = NULL; } Index: user/ngie/bug-237403/sys/amd64/vmm/vmm_lapic.c =================================================================== --- user/ngie/bug-237403/sys/amd64/vmm/vmm_lapic.c (revision 346925) +++ user/ngie/bug-237403/sys/amd64/vmm/vmm_lapic.c (revision 346926) @@ -1,249 +1,249 @@ /*- * SPDX-License-Identifier: BSD-2-Clause-FreeBSD * * Copyright (c) 2011 NetApp, Inc. * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY NETAPP, INC ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL NETAPP, INC OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * $FreeBSD$ */ #include __FBSDID("$FreeBSD$"); #include #include #include #include #include #include #include "vmm_ktr.h" #include "vmm_lapic.h" #include "vlapic.h" /* * Some MSI message definitions */ #define MSI_X86_ADDR_MASK 0xfff00000 #define MSI_X86_ADDR_BASE 0xfee00000 #define MSI_X86_ADDR_RH 0x00000008 /* Redirection Hint */ #define MSI_X86_ADDR_LOG 0x00000004 /* Destination Mode */ int lapic_set_intr(struct vm *vm, int cpu, int vector, bool level) { struct vlapic *vlapic; - if (cpu < 0 || cpu >= VM_MAXCPU) + if (cpu < 0 || cpu >= vm_get_maxcpus(vm)) return (EINVAL); /* * According to section "Maskable Hardware Interrupts" in Intel SDM * vectors 16 through 255 can be delivered through the local APIC. */ if (vector < 16 || vector > 255) return (EINVAL); vlapic = vm_lapic(vm, cpu); if (vlapic_set_intr_ready(vlapic, vector, level)) vcpu_notify_event(vm, cpu, true); return (0); } int lapic_set_local_intr(struct vm *vm, int cpu, int vector) { struct vlapic *vlapic; cpuset_t dmask; int error; - if (cpu < -1 || cpu >= VM_MAXCPU) + if (cpu < -1 || cpu >= vm_get_maxcpus(vm)) return (EINVAL); if (cpu == -1) dmask = vm_active_cpus(vm); else CPU_SETOF(cpu, &dmask); error = 0; while ((cpu = CPU_FFS(&dmask)) != 0) { cpu--; CPU_CLR(cpu, &dmask); vlapic = vm_lapic(vm, cpu); error = vlapic_trigger_lvt(vlapic, vector); if (error) break; } return (error); } int lapic_intr_msi(struct vm *vm, uint64_t addr, uint64_t msg) { int delmode, vec; uint32_t dest; bool phys; VM_CTR2(vm, "lapic MSI addr: %#lx msg: %#lx", addr, msg); if ((addr & MSI_X86_ADDR_MASK) != MSI_X86_ADDR_BASE) { VM_CTR1(vm, "lapic MSI invalid addr %#lx", addr); return (-1); } /* * Extract the x86-specific fields from the MSI addr/msg * params according to the Intel Arch spec, Vol3 Ch 10. * * The PCI specification does not support level triggered * MSI/MSI-X so ignore trigger level in 'msg'. * * The 'dest' is interpreted as a logical APIC ID if both * the Redirection Hint and Destination Mode are '1' and * physical otherwise. */ dest = (addr >> 12) & 0xff; phys = ((addr & (MSI_X86_ADDR_RH | MSI_X86_ADDR_LOG)) != (MSI_X86_ADDR_RH | MSI_X86_ADDR_LOG)); delmode = msg & APIC_DELMODE_MASK; vec = msg & 0xff; VM_CTR3(vm, "lapic MSI %s dest %#x, vec %d", phys ? "physical" : "logical", dest, vec); vlapic_deliver_intr(vm, LAPIC_TRIG_EDGE, dest, phys, delmode, vec); return (0); } static boolean_t x2apic_msr(u_int msr) { if (msr >= 0x800 && msr <= 0xBFF) return (TRUE); else return (FALSE); } static u_int x2apic_msr_to_regoff(u_int msr) { return ((msr - 0x800) << 4); } boolean_t lapic_msr(u_int msr) { if (x2apic_msr(msr) || (msr == MSR_APICBASE)) return (TRUE); else return (FALSE); } int lapic_rdmsr(struct vm *vm, int cpu, u_int msr, uint64_t *rval, bool *retu) { int error; u_int offset; struct vlapic *vlapic; vlapic = vm_lapic(vm, cpu); if (msr == MSR_APICBASE) { *rval = vlapic_get_apicbase(vlapic); error = 0; } else { offset = x2apic_msr_to_regoff(msr); error = vlapic_read(vlapic, 0, offset, rval, retu); } return (error); } int lapic_wrmsr(struct vm *vm, int cpu, u_int msr, uint64_t val, bool *retu) { int error; u_int offset; struct vlapic *vlapic; vlapic = vm_lapic(vm, cpu); if (msr == MSR_APICBASE) { error = vlapic_set_apicbase(vlapic, val); } else { offset = x2apic_msr_to_regoff(msr); error = vlapic_write(vlapic, 0, offset, val, retu); } return (error); } int lapic_mmio_write(void *vm, int cpu, uint64_t gpa, uint64_t wval, int size, void *arg) { int error; uint64_t off; struct vlapic *vlapic; off = gpa - DEFAULT_APIC_BASE; /* * Memory mapped local apic accesses must be 4 bytes wide and * aligned on a 16-byte boundary. */ if (size != 4 || off & 0xf) return (EINVAL); vlapic = vm_lapic(vm, cpu); error = vlapic_write(vlapic, 1, off, wval, arg); return (error); } int lapic_mmio_read(void *vm, int cpu, uint64_t gpa, uint64_t *rval, int size, void *arg) { int error; uint64_t off; struct vlapic *vlapic; off = gpa - DEFAULT_APIC_BASE; /* * Memory mapped local apic accesses should be aligned on a * 16-byte boundary. They are also suggested to be 4 bytes * wide, alas not all OSes follow suggestions. */ off &= ~3; if (off & 0xf) return (EINVAL); vlapic = vm_lapic(vm, cpu); error = vlapic_read(vlapic, 1, off, rval, arg); return (error); } Index: user/ngie/bug-237403/sys/amd64/vmm/vmm_stat.c =================================================================== --- user/ngie/bug-237403/sys/amd64/vmm/vmm_stat.c (revision 346925) +++ user/ngie/bug-237403/sys/amd64/vmm/vmm_stat.c (revision 346926) @@ -1,172 +1,172 @@ /*- * SPDX-License-Identifier: BSD-2-Clause-FreeBSD * * Copyright (c) 2011 NetApp, Inc. * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY NETAPP, INC ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL NETAPP, INC OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * $FreeBSD$ */ #include __FBSDID("$FreeBSD$"); #include #include #include #include #include #include "vmm_util.h" #include "vmm_stat.h" /* * 'vst_num_elems' is the total number of addressable statistic elements * 'vst_num_types' is the number of unique statistic types * * It is always true that 'vst_num_elems' is greater than or equal to * 'vst_num_types'. This is because a stat type may represent more than * one element (for e.g. VMM_STAT_ARRAY). */ static int vst_num_elems, vst_num_types; static struct vmm_stat_type *vsttab[MAX_VMM_STAT_ELEMS]; static MALLOC_DEFINE(M_VMM_STAT, "vmm stat", "vmm stat"); #define vst_size ((size_t)vst_num_elems * sizeof(uint64_t)) void vmm_stat_register(void *arg) { struct vmm_stat_type *vst = arg; /* We require all stats to identify themselves with a description */ if (vst->desc == NULL) return; if (vst->scope == VMM_STAT_SCOPE_INTEL && !vmm_is_intel()) return; if (vst->scope == VMM_STAT_SCOPE_AMD && !vmm_is_amd()) return; if (vst_num_elems + vst->nelems >= MAX_VMM_STAT_ELEMS) { printf("Cannot accommodate vmm stat type \"%s\"!\n", vst->desc); return; } vst->index = vst_num_elems; vst_num_elems += vst->nelems; vsttab[vst_num_types++] = vst; } int vmm_stat_copy(struct vm *vm, int vcpu, int *num_stats, uint64_t *buf) { struct vmm_stat_type *vst; uint64_t *stats; int i; - if (vcpu < 0 || vcpu >= VM_MAXCPU) + if (vcpu < 0 || vcpu >= vm_get_maxcpus(vm)) return (EINVAL); /* Let stats functions update their counters */ for (i = 0; i < vst_num_types; i++) { vst = vsttab[i]; if (vst->func != NULL) (*vst->func)(vm, vcpu, vst); } /* Copy over the stats */ stats = vcpu_stats(vm, vcpu); for (i = 0; i < vst_num_elems; i++) buf[i] = stats[i]; *num_stats = vst_num_elems; return (0); } void * vmm_stat_alloc(void) { return (malloc(vst_size, M_VMM_STAT, M_WAITOK)); } void vmm_stat_init(void *vp) { bzero(vp, vst_size); } void vmm_stat_free(void *vp) { free(vp, M_VMM_STAT); } int vmm_stat_desc_copy(int index, char *buf, int bufsize) { int i; struct vmm_stat_type *vst; for (i = 0; i < vst_num_types; i++) { vst = vsttab[i]; if (index >= vst->index && index < vst->index + vst->nelems) { if (vst->nelems > 1) { snprintf(buf, bufsize, "%s[%d]", vst->desc, index - vst->index); } else { strlcpy(buf, vst->desc, bufsize); } return (0); /* found it */ } } return (EINVAL); } /* global statistics */ VMM_STAT(VCPU_MIGRATIONS, "vcpu migration across host cpus"); VMM_STAT(VMEXIT_COUNT, "total number of vm exits"); VMM_STAT(VMEXIT_EXTINT, "vm exits due to external interrupt"); VMM_STAT(VMEXIT_HLT, "number of times hlt was intercepted"); VMM_STAT(VMEXIT_CR_ACCESS, "number of times %cr access was intercepted"); VMM_STAT(VMEXIT_RDMSR, "number of times rdmsr was intercepted"); VMM_STAT(VMEXIT_WRMSR, "number of times wrmsr was intercepted"); VMM_STAT(VMEXIT_MTRAP, "number of monitor trap exits"); VMM_STAT(VMEXIT_PAUSE, "number of times pause was intercepted"); VMM_STAT(VMEXIT_INTR_WINDOW, "vm exits due to interrupt window opening"); VMM_STAT(VMEXIT_NMI_WINDOW, "vm exits due to nmi window opening"); VMM_STAT(VMEXIT_INOUT, "number of times in/out was intercepted"); VMM_STAT(VMEXIT_CPUID, "number of times cpuid was intercepted"); VMM_STAT(VMEXIT_NESTED_FAULT, "vm exits due to nested page fault"); VMM_STAT(VMEXIT_INST_EMUL, "vm exits for instruction emulation"); VMM_STAT(VMEXIT_UNKNOWN, "number of vm exits for unknown reason"); VMM_STAT(VMEXIT_ASTPENDING, "number of times astpending at exit"); VMM_STAT(VMEXIT_REQIDLE, "number of times idle requested at exit"); VMM_STAT(VMEXIT_USERSPACE, "number of vm exits handled in userspace"); VMM_STAT(VMEXIT_RENDEZVOUS, "number of times rendezvous pending at exit"); VMM_STAT(VMEXIT_EXCEPTION, "number of vm exits due to exceptions"); Index: user/ngie/bug-237403/sys/arm/allwinner/a10/a10_padconf.c =================================================================== --- user/ngie/bug-237403/sys/arm/allwinner/a10/a10_padconf.c (revision 346925) +++ user/ngie/bug-237403/sys/arm/allwinner/a10/a10_padconf.c (revision 346926) @@ -1,231 +1,231 @@ /*- * Copyright (c) 2016 Emmanuel Vadot * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * */ #include __FBSDID("$FreeBSD$"); #include #include #include #include #include #ifdef SOC_ALLWINNER_A10 const static struct allwinner_pins a10_pins[] = { {"PA0", 0, 0, {"gpio_in", "gpio_out", "emac", "spi1", "uart2", NULL, NULL, NULL}}, {"PA1", 0, 1, {"gpio_in", "gpio_out", "emac", "spi1", "uart2", NULL, NULL, NULL}}, {"PA2", 0, 2, {"gpio_in", "gpio_out", "emac", "spi1", "uart2", NULL, NULL, NULL}}, {"PA3", 0, 3, {"gpio_in", "gpio_out", "emac", "spi1", "uart2", NULL, NULL, NULL}}, {"PA4", 0, 4, {"gpio_in", "gpio_out", "emac", "spi1", NULL, NULL, NULL, NULL}}, {"PA5", 0, 5, {"gpio_in", "gpio_out", "emac", "spi3", NULL, NULL, NULL, NULL}}, {"PA6", 0, 6, {"gpio_in", "gpio_out", "emac", "spi3", NULL, NULL, NULL, NULL}}, {"PA7", 0, 7, {"gpio_in", "gpio_out", "emac", "spi3", NULL, NULL, NULL, NULL}}, {"PA8", 0, 8, {"gpio_in", "gpio_out", "emac", "spi3", NULL, NULL, NULL, NULL}}, {"PA9", 0, 9, {"gpio_in", "gpio_out", "emac", "spi3", NULL, NULL, NULL, NULL}}, {"PA10", 0, 10, {"gpio_in", "gpio_out", "emac", NULL, "uart1", NULL, NULL, NULL}}, {"PA11", 0, 11, {"gpio_in", "gpio_out", "emac", NULL, "uart1", NULL, NULL, NULL}}, {"PA12", 0, 12, {"gpio_in", "gpio_out", "emac", "uart6", "uart1", NULL, NULL, NULL}}, {"PA13", 0, 13, {"gpio_in", "gpio_out", "emac", "uart6", "uart1", NULL, NULL, NULL}}, {"PA14", 0, 14, {"gpio_in", "gpio_out", "emac", "uart7", "uart1", NULL, NULL, NULL}}, {"PA15", 0, 15, {"gpio_in", "gpio_out", "emac", "uart7", "uart1", NULL, NULL, NULL}}, {"PA16", 0, 16, {"gpio_in", "gpio_out", NULL, "can", "uart1", NULL, NULL, NULL}}, {"PA17", 0, 17, {"gpio_in", "gpio_out", NULL, "can", "uart1", NULL, NULL, NULL}}, {"PB0", 1, 0, {"gpio_in", "gpio_out", "i2c0", NULL, NULL, NULL, NULL, NULL}}, {"PB1", 1, 1, {"gpio_in", "gpio_out", "i2c0", NULL, NULL, NULL, NULL, NULL}}, {"PB2", 1, 2, {"gpio_in", "gpio_out", "pwm", NULL, NULL, NULL, NULL, NULL}}, {"PB3", 1, 3, {"gpio_in", "gpio_out", "ir0", NULL, NULL, NULL, NULL, NULL}}, {"PB4", 1, 4, {"gpio_in", "gpio_out", "ir0", NULL, NULL, NULL, NULL, NULL}}, {"PB5", 1, 5, {"gpio_in", "gpio_out", "i2s", "ac97", NULL, NULL, NULL, NULL}}, {"PB6", 1, 6, {"gpio_in", "gpio_out", "i2s", "ac97", NULL, NULL, NULL, NULL}}, {"PB7", 1, 7, {"gpio_in", "gpio_out", "i2s", "ac97", NULL, NULL, NULL, NULL}}, {"PB8", 1, 8, {"gpio_in", "gpio_out", "i2s", "ac97", NULL, NULL, NULL, NULL}}, {"PB9", 1, 9, {"gpio_in", "gpio_out", "i2s", NULL, NULL, NULL, NULL, NULL}}, {"PB10", 1, 10, {"gpio_in", "gpio_out", "i2s", NULL, NULL, NULL, NULL, NULL}}, {"PB11", 1, 11, {"gpio_in", "gpio_out", "i2s", NULL, NULL, NULL, NULL, NULL}}, {"PB12", 1, 12, {"gpio_in", "gpio_out", "i2s", "ac97", NULL, NULL, NULL, NULL}}, {"PB13", 1, 13, {"gpio_in", "gpio_out", "spi2", NULL, NULL, NULL, NULL, NULL}}, {"PB14", 1, 14, {"gpio_in", "gpio_out", "spi2", "jtag", NULL, NULL, NULL, NULL}}, {"PB15", 1, 15, {"gpio_in", "gpio_out", "spi2", "jtag", NULL, NULL, NULL, NULL}}, {"PB16", 1, 16, {"gpio_in", "gpio_out", "spi2", "jtag", NULL, NULL, NULL, NULL}}, {"PB17", 1, 17, {"gpio_in", "gpio_out", "spi2", "jtag", NULL, NULL, NULL, NULL}}, {"PB18", 1, 18, {"gpio_in", "gpio_out", "i2c1", NULL, NULL, NULL, NULL, NULL}}, {"PB19", 1, 19, {"gpio_in", "gpio_out", "i2c1", NULL, NULL, NULL, NULL, NULL}}, - {"PB20", 1, 20, {"gpio_in", "gpio_out", "i2c1", NULL, NULL, NULL, NULL, NULL}}, - {"PB21", 1, 21, {"gpio_in", "gpio_out", "i2c1", NULL, NULL, NULL, NULL, NULL}}, + {"PB20", 1, 20, {"gpio_in", "gpio_out", "i2c2", NULL, NULL, NULL, NULL, NULL}}, + {"PB21", 1, 21, {"gpio_in", "gpio_out", "i2c2", NULL, NULL, NULL, NULL, NULL}}, {"PB22", 1, 22, {"gpio_in", "gpio_out", "uart0", "ir1", NULL, NULL, NULL, NULL}}, {"PB23", 1, 23, {"gpio_in", "gpio_out", "uart0", "ir1", NULL, NULL, NULL, NULL}}, {"PC0", 2, 0, {"gpio_in", "gpio_out", "nand", "spi0", NULL, NULL, NULL, NULL}}, {"PC1", 2, 1, {"gpio_in", "gpio_out", "nand", "spi0", NULL, NULL, NULL, NULL}}, {"PC2", 2, 2, {"gpio_in", "gpio_out", "nand", "spi0", NULL, NULL, NULL, NULL}}, {"PC3", 2, 3, {"gpio_in", "gpio_out", "nand", NULL, NULL, NULL, NULL, NULL}}, {"PC4", 2, 4, {"gpio_in", "gpio_out", "nand", NULL, NULL, NULL, NULL, NULL}}, {"PC5", 2, 5, {"gpio_in", "gpio_out", "nand", NULL, NULL, NULL, NULL, NULL}}, {"PC6", 2, 6, {"gpio_in", "gpio_out", "nand", "mmc2", NULL, NULL, NULL, NULL}}, {"PC7", 2, 7, {"gpio_in", "gpio_out", "nand", "mmc2", NULL, NULL, NULL, NULL}}, {"PC8", 2, 8, {"gpio_in", "gpio_out", "nand", "mmc2", NULL, NULL, NULL, NULL}}, {"PC9", 2, 9, {"gpio_in", "gpio_out", "nand", "mmc2", NULL, NULL, NULL, NULL}}, {"PC10", 2, 10, {"gpio_in", "gpio_out", "nand", "mmc2", NULL, NULL, NULL, NULL}}, {"PC11", 2, 11, {"gpio_in", "gpio_out", "nand", "mmc2", NULL, NULL, NULL, NULL}}, {"PC12", 2, 12, {"gpio_in", "gpio_out", "nand", NULL, NULL, NULL, NULL, NULL}}, {"PC13", 2, 13, {"gpio_in", "gpio_out", "nand", NULL, NULL, NULL, NULL, NULL}}, {"PC14", 2, 14, {"gpio_in", "gpio_out", "nand", NULL, NULL, NULL, NULL, NULL}}, {"PC15", 2, 15, {"gpio_in", "gpio_out", "nand", NULL, NULL, NULL, NULL, NULL}}, {"PC16", 2, 16, {"gpio_in", "gpio_out", "nand", NULL, NULL, NULL, NULL, NULL}}, {"PC17", 2, 17, {"gpio_in", "gpio_out", "nand", NULL, NULL, NULL, NULL, NULL}}, {"PC18", 2, 18, {"gpio_in", "gpio_out", "nand", NULL, NULL, NULL, NULL, NULL}}, {"PC19", 2, 19, {"gpio_in", "gpio_out", "nand", "spi2", NULL, NULL, NULL, NULL}}, {"PC20", 2, 20, {"gpio_in", "gpio_out", "nand", "spi2", NULL, NULL, NULL, NULL}}, {"PC21", 2, 21, {"gpio_in", "gpio_out", "nand", "spi2", NULL, NULL, NULL, NULL}}, {"PC22", 2, 22, {"gpio_in", "gpio_out", "nand", "spi2", NULL, NULL, NULL, NULL}}, {"PC23", 2, 23, {"gpio_in", "gpio_out", "spi0", NULL, NULL, NULL, NULL, NULL}}, {"PC24", 2, 24, {"gpio_in", "gpio_out", "nand", NULL, NULL, NULL, NULL, NULL}}, {"PD0", 3, 0, {"gpio_in", "gpio_out", "lcd0", "lvds0", NULL, NULL, NULL, NULL}}, {"PD1", 3, 1, {"gpio_in", "gpio_out", "lcd0", "lvds0", NULL, NULL, NULL, NULL}}, {"PD2", 3, 2, {"gpio_in", "gpio_out", "lcd0", "lvds0", NULL, NULL, NULL, NULL}}, {"PD3", 3, 3, {"gpio_in", "gpio_out", "lcd0", "lvds0", NULL, NULL, NULL, NULL}}, {"PD4", 3, 4, {"gpio_in", "gpio_out", "lcd0", "lvds0", NULL, NULL, NULL, NULL}}, {"PD5", 3, 5, {"gpio_in", "gpio_out", "lcd0", "lvds0", NULL, NULL, NULL, NULL}}, {"PD6", 3, 6, {"gpio_in", "gpio_out", "lcd0", "lvds0", NULL, NULL, NULL, NULL}}, {"PD7", 3, 7, {"gpio_in", "gpio_out", "lcd0", "lvds0", NULL, NULL, NULL, NULL}}, {"PD8", 3, 8, {"gpio_in", "gpio_out", "lcd0", "lvds0", NULL, NULL, NULL, NULL}}, {"PD9", 3, 9, {"gpio_in", "gpio_out", "lcd0", "lvds0", NULL, NULL, NULL, NULL}}, {"PD10", 3, 10, {"gpio_in", "gpio_out", "lcd0", "lvds1", NULL, NULL, NULL, NULL}}, {"PD11", 3, 11, {"gpio_in", "gpio_out", "lcd0", "lvds1", NULL, NULL, NULL, NULL}}, {"PD12", 3, 12, {"gpio_in", "gpio_out", "lcd0", "lvds1", NULL, NULL, NULL, NULL}}, {"PD13", 3, 13, {"gpio_in", "gpio_out", "lcd0", "lvds1", NULL, NULL, NULL, NULL}}, {"PD14", 3, 14, {"gpio_in", "gpio_out", "lcd0", "lvds1", NULL, NULL, NULL, NULL}}, {"PD15", 3, 15, {"gpio_in", "gpio_out", "lcd0", "lvds1", NULL, NULL, NULL, NULL}}, {"PD16", 3, 16, {"gpio_in", "gpio_out", "lcd0", "lvds1", NULL, NULL, NULL, NULL}}, {"PD17", 3, 17, {"gpio_in", "gpio_out", "lcd0", "lvds1", NULL, NULL, NULL, NULL}}, {"PD18", 3, 18, {"gpio_in", "gpio_out", "lcd0", "lvds1", NULL, NULL, NULL, NULL}}, {"PD19", 3, 19, {"gpio_in", "gpio_out", "lcd0", "lvds1", NULL, NULL, NULL, NULL}}, {"PD20", 3, 20, {"gpio_in", "gpio_out", "lcd0", "csi1", NULL, NULL, NULL, NULL}}, {"PD21", 3, 21, {"gpio_in", "gpio_out", "lcd0", "sim", NULL, NULL, NULL, NULL}}, {"PD22", 3, 22, {"gpio_in", "gpio_out", "lcd0", "sim", NULL, NULL, NULL, NULL}}, {"PD23", 3, 23, {"gpio_in", "gpio_out", "lcd0", "sim", NULL, NULL, NULL, NULL}}, {"PD24", 3, 24, {"gpio_in", "gpio_out", "lcd0", "sim", NULL, NULL, NULL, NULL}}, {"PD25", 3, 25, {"gpio_in", "gpio_out", "lcd0", "sim", NULL, NULL, NULL, NULL}}, {"PD26", 3, 26, {"gpio_in", "gpio_out", "lcd0", "sim", NULL, NULL, NULL, NULL}}, {"PD27", 3, 27, {"gpio_in", "gpio_out", "lcd0", "sim", NULL, NULL, NULL, NULL}}, {"PE0", 4, 0, {"gpio_in", "gpio_out", "ts0", "csi0", NULL, NULL, NULL, NULL}}, {"PE1", 4, 1, {"gpio_in", "gpio_out", "ts0", "csi0", NULL, NULL, NULL, NULL}}, {"PE2", 4, 2, {"gpio_in", "gpio_out", "ts0", "csi0", NULL, NULL, NULL, NULL}}, {"PE3", 4, 3, {"gpio_in", "gpio_out", "ts0", "csi0", NULL, NULL, NULL, NULL}}, {"PE4", 4, 4, {"gpio_in", "gpio_out", "ts0", "csi0", NULL, NULL, NULL, NULL}}, {"PE5", 4, 5, {"gpio_in", "gpio_out", "ts0", "csi0", "sim", NULL, NULL, NULL}}, {"PE6", 4, 6, {"gpio_in", "gpio_out", "ts0", "csi0", NULL, NULL, NULL, NULL}}, {"PE7", 4, 7, {"gpio_in", "gpio_out", "ts0", "csi0", NULL, NULL, NULL, NULL}}, {"PE8", 4, 8, {"gpio_in", "gpio_out", "ts0", "csi0", NULL, NULL, NULL, NULL}}, {"PE9", 4, 9, {"gpio_in", "gpio_out", "ts0", "csi0", NULL, NULL, NULL, NULL}}, {"PE10", 4, 10, {"gpio_in", "gpio_out", "ts0", "csi0", NULL, NULL, NULL, NULL}}, {"PE11", 4, 11, {"gpio_in", "gpio_out", "ts0", "csi0", NULL, NULL, NULL, NULL}}, {"PF0", 5, 0, {"gpio_in", "gpio_out", "mmc0", NULL, "jtag", NULL, NULL, NULL}}, {"PF1", 5, 1, {"gpio_in", "gpio_out", "mmc0", NULL, "jtag", NULL, NULL, NULL}}, {"PF2", 5, 2, {"gpio_in", "gpio_out", "mmc0", NULL, "uart0", NULL, NULL, NULL}}, {"PF3", 5, 3, {"gpio_in", "gpio_out", "mmc0", NULL, "jtag", NULL, NULL, NULL}}, {"PF4", 5, 4, {"gpio_in", "gpio_out", "mmc0", NULL, "jtag", NULL, NULL, NULL}}, {"PF5", 5, 5, {"gpio_in", "gpio_out", "mmc0", NULL, "jtag", NULL, NULL, NULL}}, {"PG0", 6, 0, {"gpio_in", "gpio_out", "ts1", "csi1", "mmc1", NULL, NULL, NULL}}, {"PG1", 6, 1, {"gpio_in", "gpio_out", "ts1", "csi1", "mmc1", NULL, NULL, NULL}}, {"PG2", 6, 2, {"gpio_in", "gpio_out", "ts1", "csi1", "mmc1", NULL, NULL, NULL}}, {"PG3", 6, 3, {"gpio_in", "gpio_out", "ts1", "csi1", "mmc1", NULL, NULL, NULL}}, {"PG4", 6, 4, {"gpio_in", "gpio_out", "ts1", "csi1", "mmc1", "csi0", NULL, NULL}}, {"PG5", 6, 5, {"gpio_in", "gpio_out", "ts1", "csi1", "mmc1", "csi0", NULL, NULL}}, {"PG6", 6, 6, {"gpio_in", "gpio_out", "ts1", "csi1", "uart3", "csi0", NULL, NULL}}, {"PG7", 6, 7, {"gpio_in", "gpio_out", "ts1", "csi1", "uart3", "csi0", NULL, NULL}}, {"PG8", 6, 8, {"gpio_in", "gpio_out", "ts1", "csi1", "uart3", "csi0", NULL, NULL}}, {"PG9", 6, 9, {"gpio_in", "gpio_out", "ts1", "csi1", "uart3", "csi0", NULL, NULL}}, {"PG10", 6, 10, {"gpio_in", "gpio_out", "ts1", "csi1", "uart4", "csi0", NULL, NULL}}, {"PG11", 6, 11, {"gpio_in", "gpio_out", "ts1", "csi1", "uart4", "csi0", NULL, NULL}}, {"PH0", 7, 0, {"gpio_in", "gpio_out", "lcd1", "pata", "uart3", NULL, "eint0", "csi1"}, 6, 0}, {"PH1", 7, 1, {"gpio_in", "gpio_out", "lcd1", "pata", "uart3", NULL, "eint1", "csi1"}, 6, 1}, {"PH2", 7, 2, {"gpio_in", "gpio_out", "lcd1", "pata", "uart3", NULL, "eint2", "csi1"}, 6, 2}, {"PH3", 7, 3, {"gpio_in", "gpio_out", "lcd1", "pata", "uart3", NULL, "eint3", "csi1"}, 6, 3}, {"PH4", 7, 4, {"gpio_in", "gpio_out", "lcd1", "pata", "uart4", NULL, "eint4", "csi1"}, 6, 4}, {"PH5", 7, 5, {"gpio_in", "gpio_out", "lcd1", "pata", "uart4", NULL, "eint5", "csi1"}, 6, 5}, {"PH6", 7, 6, {"gpio_in", "gpio_out", "lcd1", "pata", "uart5", "ms", "eint6", "csi1"}, 6, 6}, {"PH7", 7, 7, {"gpio_in", "gpio_out", "lcd1", "pata", "uart5", "ms", "eint7", "csi1"}, 6, 7}, {"PH8", 7, 8, {"gpio_in", "gpio_out", "lcd1", "pata", "keypad", "ms", "eint8", "csi1"}, 6, 8}, {"PH9", 7, 9, {"gpio_in", "gpio_out", "lcd1", "pata", "keypad", "ms", "eint9", "csi1"}, 6, 9}, {"PH10", 7, 10, {"gpio_in", "gpio_out", "lcd1", "pata", "keypad", "ms", "eint10", "csi1"}, 6, 10}, {"PH11", 7, 11, {"gpio_in", "gpio_out", "lcd1", "pata", "keypad", "ms", "eint11", "csi1"}, 6, 11}, {"PH12", 7, 12, {"gpio_in", "gpio_out", "lcd1", "pata", "ps2", NULL, "eint12", "csi1"}, 6, 12}, {"PH13", 7, 13, {"gpio_in", "gpio_out", "lcd1", "pata", "ps2", "sim", "eint13", "csi1"}, 6, 13}, {"PH14", 7, 14, {"gpio_in", "gpio_out", "lcd1", "pata", "keypad", "sim", "eint14", "csi1"}, 6, 14}, {"PH15", 7, 15, {"gpio_in", "gpio_out", "lcd1", "pata", "keypad", "sim", "eint15", "csi1"}, 6, 15}, {"PH16", 7, 16, {"gpio_in", "gpio_out", "lcd1", "pata", "keypad", NULL, "eint16", "csi1"}, 6, 16}, {"PH17", 7, 17, {"gpio_in", "gpio_out", "lcd1", "pata", "keypad", "sim", "eint17", "csi1"}, 6, 17}, {"PH18", 7, 18, {"gpio_in", "gpio_out", "lcd1", "pata", "keypad", "sim", "eint18", "csi1"}, 6, 18}, {"PH19", 7, 19, {"gpio_in", "gpio_out", "lcd1", "pata", "keypad", "sim", "eint19", "csi1"}, 6, 19}, {"PH20", 7, 20, {"gpio_in", "gpio_out", "lcd1", "pata", "can", NULL, "eint20", "csi1"}, 6, 20}, {"PH21", 7, 21, {"gpio_in", "gpio_out", "lcd1", "pata", "can", NULL, "eint21", "csi1"}, 6, 21}, {"PH22", 7, 22, {"gpio_in", "gpio_out", "lcd1", "pata", "keypad", "mmc1", NULL, "csi1"}}, {"PH23", 7, 23, {"gpio_in", "gpio_out", "lcd1", "pata", "keypad", "mmc1", NULL, "csi1"}}, {"PH24", 7, 24, {"gpio_in", "gpio_out", "lcd1", "pata", "keypad", "mmc1", NULL, "csi1"}}, {"PH25", 7, 25, {"gpio_in", "gpio_out", "lcd1", "pata", "keypad", "mmc1", NULL, "csi1"}}, {"PH26", 7, 26, {"gpio_in", "gpio_out", "lcd1", "pata", "keypad", "mmc1", NULL, "csi1"}}, {"PH27", 7, 27, {"gpio_in", "gpio_out", "lcd1", "pata", "keypad", "mmc1", NULL, "csi1"}}, {"PI0", 8, 0, {"gpio_in", "gpio_out", NULL, NULL, NULL, NULL, NULL, NULL}}, {"PI1", 8, 1, {"gpio_in", "gpio_out", NULL, NULL, NULL, NULL, NULL, NULL}}, {"PI2", 8, 2, {"gpio_in", "gpio_out", NULL, NULL, NULL, NULL, NULL, NULL}}, {"PI3", 8, 3, {"gpio_in", "gpio_out", "pwm", NULL, NULL, NULL, NULL, NULL}}, {"PI4", 8, 4, {"gpio_in", "gpio_out", "mmc3", NULL, NULL, NULL, NULL, NULL}}, {"PI5", 8, 5, {"gpio_in", "gpio_out", "mmc3", NULL, NULL, NULL, NULL, NULL}}, {"PI6", 8, 6, {"gpio_in", "gpio_out", "mmc3", NULL, NULL, NULL, NULL, NULL}}, {"PI7", 8, 7, {"gpio_in", "gpio_out", "mmc3", NULL, NULL, NULL, NULL, NULL}}, {"PI8", 8, 8, {"gpio_in", "gpio_out", "mmc3", NULL, NULL, NULL, NULL, NULL}}, {"PI9", 8, 9, {"gpio_in", "gpio_out", "mmc3", NULL, NULL, NULL, NULL, NULL}}, {"PI10", 8, 10, {"gpio_in", "gpio_out", "spi0", "uart5", NULL, NULL, "eint22", NULL}, 6, 22}, {"PI11", 8, 11, {"gpio_in", "gpio_out", "spi0", "uart5", NULL, NULL, "eint23", NULL}, 6, 23}, {"PI12", 8, 12, {"gpio_in", "gpio_out", "spi0", "uart6", NULL, NULL, "eint24", NULL}, 6, 24}, {"PI13", 8, 13, {"gpio_in", "gpio_out", "spi0", "uart6", NULL, NULL, "eint25", NULL}, 6, 25}, {"PI14", 8, 14, {"gpio_in", "gpio_out", "spi0", "ps2", "timer4", NULL, "eint26", NULL}, 6, 26}, {"PI15", 8, 15, {"gpio_in", "gpio_out", "spi1", "ps2", "timer5", NULL, "eint27", NULL}, 6, 27}, {"PI16", 8, 16, {"gpio_in", "gpio_out", "spi1", "uart2", NULL, NULL, "eint28", NULL}, 6, 28}, {"PI17", 8, 17, {"gpio_in", "gpio_out", "spi1", "uart2", NULL, NULL, "eint29", NULL}, 6, 29}, {"PI18", 8, 18, {"gpio_in", "gpio_out", "spi1", "uart2", NULL, NULL, "eint30", NULL}, 6, 30}, {"PI19", 8, 19, {"gpio_in", "gpio_out", "spi1", "uart2", NULL, NULL, "eint31", NULL}, 6, 31}, {"PI20", 8, 20, {"gpio_in", "gpio_out", "ps2", "uart7", "hdmi", NULL, NULL, NULL}}, {"PI21", 8, 21, {"gpio_in", "gpio_out", "ps2", "uart7", "hdmi", NULL, NULL, NULL}}, }; const struct allwinner_padconf a10_padconf = { .npins = sizeof(a10_pins) / sizeof(struct allwinner_pins), .pins = a10_pins, }; #endif /* SOC_ALLWINNER_A10 */ Index: user/ngie/bug-237403/sys/arm/allwinner/aw_rsb.c =================================================================== --- user/ngie/bug-237403/sys/arm/allwinner/aw_rsb.c (revision 346925) +++ user/ngie/bug-237403/sys/arm/allwinner/aw_rsb.c (revision 346926) @@ -1,498 +1,500 @@ /*- * Copyright (c) 2016 Jared McNeill * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, * BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED * AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * $FreeBSD$ */ /* * Allwinner RSB (Reduced Serial Bus) and P2WI (Push-Pull Two Wire Interface) */ #include __FBSDID("$FreeBSD$"); #include #include #include #include #include #include #include #include #include #include #include #include #include #include "iicbus_if.h" #define RSB_CTRL 0x00 #define START_TRANS (1 << 7) #define GLOBAL_INT_ENB (1 << 1) #define SOFT_RESET (1 << 0) #define RSB_CCR 0x04 #define RSB_INTE 0x08 #define RSB_INTS 0x0c #define INT_TRANS_ERR_ID(x) (((x) >> 8) & 0xf) #define INT_LOAD_BSY (1 << 2) #define INT_TRANS_ERR (1 << 1) #define INT_TRANS_OVER (1 << 0) #define INT_MASK (INT_LOAD_BSY|INT_TRANS_ERR|INT_TRANS_OVER) #define RSB_DADDR0 0x10 #define RSB_DADDR1 0x14 #define RSB_DLEN 0x18 #define DLEN_READ (1 << 4) #define RSB_DATA0 0x1c #define RSB_DATA1 0x20 #define RSB_CMD 0x2c #define CMD_SRTA 0xe8 #define CMD_RD8 0x8b #define CMD_RD16 0x9c #define CMD_RD32 0xa6 #define CMD_WR8 0x4e #define CMD_WR16 0x59 #define CMD_WR32 0x63 #define RSB_DAR 0x30 #define DAR_RTA (0xff << 16) #define DAR_RTA_SHIFT 16 #define DAR_DA (0xffff << 0) #define DAR_DA_SHIFT 0 #define RSB_MAXLEN 8 #define RSB_RESET_RETRY 100 #define RSB_I2C_TIMEOUT hz #define RSB_ADDR_PMIC_PRIMARY 0x3a3 #define RSB_ADDR_PMIC_SECONDARY 0x745 #define RSB_ADDR_PERIPH_IC 0xe89 #define A31_P2WI 1 #define A23_RSB 2 static struct ofw_compat_data compat_data[] = { { "allwinner,sun6i-a31-p2wi", A31_P2WI }, { "allwinner,sun8i-a23-rsb", A23_RSB }, { NULL, 0 } }; static struct resource_spec rsb_spec[] = { { SYS_RES_MEMORY, 0, RF_ACTIVE }, { -1, 0 } }; /* * Device address to Run-time address mappings. * * Run-time address (RTA) is an 8-bit value used to address the device during * a read or write transaction. The following are valid RTAs: * 0x17 0x2d 0x3a 0x4e 0x59 0x63 0x74 0x8b 0x9c 0xa6 0xb1 0xc5 0xd2 0xe8 0xff * * Allwinner uses RTA 0x2d for the primary PMIC, 0x3a for the secondary PMIC, * and 0x4e for the peripheral IC (where applicable). */ static const struct { uint16_t addr; uint8_t rta; } rsb_rtamap[] = { { .addr = RSB_ADDR_PMIC_PRIMARY, .rta = 0x2d }, { .addr = RSB_ADDR_PMIC_SECONDARY, .rta = 0x3a }, { .addr = RSB_ADDR_PERIPH_IC, .rta = 0x4e }, { .addr = 0, .rta = 0 } }; struct rsb_softc { struct resource *res; struct mtx mtx; clk_t clk; hwreset_t rst; device_t iicbus; int busy; uint32_t status; uint16_t cur_addr; int type; struct iic_msg *msg; }; #define RSB_LOCK(sc) mtx_lock(&(sc)->mtx) #define RSB_UNLOCK(sc) mtx_unlock(&(sc)->mtx) #define RSB_ASSERT_LOCKED(sc) mtx_assert(&(sc)->mtx, MA_OWNED) #define RSB_READ(sc, reg) bus_read_4((sc)->res, (reg)) #define RSB_WRITE(sc, reg, val) bus_write_4((sc)->res, (reg), (val)) static phandle_t rsb_get_node(device_t bus, device_t dev) { return (ofw_bus_get_node(bus)); } static int rsb_reset(device_t dev, u_char speed, u_char addr, u_char *oldaddr) { struct rsb_softc *sc; int retry; sc = device_get_softc(dev); RSB_LOCK(sc); /* Write soft-reset bit and wait for it to self-clear. */ RSB_WRITE(sc, RSB_CTRL, SOFT_RESET); for (retry = RSB_RESET_RETRY; retry > 0; retry--) if ((RSB_READ(sc, RSB_CTRL) & SOFT_RESET) == 0) break; RSB_UNLOCK(sc); if (retry == 0) { device_printf(dev, "soft reset timeout\n"); return (ETIMEDOUT); } return (IIC_ENOADDR); } static uint32_t rsb_encode(const uint8_t *buf, u_int len, u_int off) { uint32_t val; u_int n; val = 0; for (n = off; n < MIN(len, 4 + off); n++) val |= ((uint32_t)buf[n] << ((n - off) * NBBY)); return val; } static void rsb_decode(const uint32_t val, uint8_t *buf, u_int len, u_int off) { u_int n; for (n = off; n < MIN(len, 4 + off); n++) buf[n] = (val >> ((n - off) * NBBY)) & 0xff; } static int rsb_start(device_t dev) { struct rsb_softc *sc; int error, retry; sc = device_get_softc(dev); RSB_ASSERT_LOCKED(sc); /* Start the transfer */ RSB_WRITE(sc, RSB_CTRL, GLOBAL_INT_ENB | START_TRANS); /* Wait for transfer to complete */ error = ETIMEDOUT; for (retry = RSB_I2C_TIMEOUT; retry > 0; retry--) { sc->status |= RSB_READ(sc, RSB_INTS); if ((sc->status & INT_TRANS_OVER) != 0) { error = 0; break; } DELAY((1000 * hz) / RSB_I2C_TIMEOUT); } if (error == 0 && (sc->status & INT_TRANS_OVER) == 0) { device_printf(dev, "transfer error, status 0x%08x\n", sc->status); error = EIO; } return (error); } static int rsb_set_rta(device_t dev, uint16_t addr) { struct rsb_softc *sc; uint8_t rta; int i; sc = device_get_softc(dev); RSB_ASSERT_LOCKED(sc); /* Lookup run-time address for given device address */ for (rta = 0, i = 0; rsb_rtamap[i].rta != 0; i++) if (rsb_rtamap[i].addr == addr) { rta = rsb_rtamap[i].rta; break; } if (rta == 0) { device_printf(dev, "RTA not known for address %#x\n", addr); return (ENXIO); } /* Set run-time address */ RSB_WRITE(sc, RSB_INTS, RSB_READ(sc, RSB_INTS)); RSB_WRITE(sc, RSB_DAR, (addr << DAR_DA_SHIFT) | (rta << DAR_RTA_SHIFT)); RSB_WRITE(sc, RSB_CMD, CMD_SRTA); return (rsb_start(dev)); } static int rsb_transfer(device_t dev, struct iic_msg *msgs, uint32_t nmsgs) { struct rsb_softc *sc; uint32_t daddr[2], data[2], dlen; uint16_t device_addr; uint8_t cmd; int error; sc = device_get_softc(dev); /* * P2WI and RSB are not really I2C or SMBus controllers, so there are * some restrictions imposed by the driver. * * Transfers must contain exactly two messages. The first is always * a write, containing a single data byte offset. Data will either * be read from or written to the corresponding data byte in the * second message. The slave address in both messages must be the * same. */ if (nmsgs != 2 || (msgs[0].flags & IIC_M_RD) == IIC_M_RD || (msgs[0].slave >> 1) != (msgs[1].slave >> 1) || msgs[0].len != 1 || msgs[1].len > RSB_MAXLEN) return (EINVAL); /* The RSB controller can read or write 1, 2, or 4 bytes at a time. */ if (sc->type == A23_RSB) { if ((msgs[1].flags & IIC_M_RD) != 0) { switch (msgs[1].len) { case 1: cmd = CMD_RD8; break; case 2: cmd = CMD_RD16; break; case 4: cmd = CMD_RD32; break; default: return (EINVAL); } } else { switch (msgs[1].len) { case 1: cmd = CMD_WR8; break; case 2: cmd = CMD_WR16; break; case 4: cmd = CMD_WR32; break; default: return (EINVAL); } } } RSB_LOCK(sc); while (sc->busy) mtx_sleep(sc, &sc->mtx, 0, "i2cbuswait", 0); sc->busy = 1; sc->status = 0; /* Select current run-time address if necessary */ if (sc->type == A23_RSB) { device_addr = msgs[0].slave >> 1; if (sc->cur_addr != device_addr) { error = rsb_set_rta(dev, device_addr); if (error != 0) goto done; sc->cur_addr = device_addr; sc->status = 0; } } /* Clear interrupt status */ RSB_WRITE(sc, RSB_INTS, RSB_READ(sc, RSB_INTS)); /* Program data access address registers */ daddr[0] = rsb_encode(msgs[0].buf, msgs[0].len, 0); RSB_WRITE(sc, RSB_DADDR0, daddr[0]); /* Write data */ if ((msgs[1].flags & IIC_M_RD) == 0) { data[0] = rsb_encode(msgs[1].buf, msgs[1].len, 0); RSB_WRITE(sc, RSB_DATA0, data[0]); } /* Set command type for RSB */ if (sc->type == A23_RSB) RSB_WRITE(sc, RSB_CMD, cmd); /* Program data length register and transfer direction */ dlen = msgs[0].len - 1; if ((msgs[1].flags & IIC_M_RD) == IIC_M_RD) dlen |= DLEN_READ; RSB_WRITE(sc, RSB_DLEN, dlen); /* Start transfer */ error = rsb_start(dev); if (error != 0) goto done; /* Read data */ if ((msgs[1].flags & IIC_M_RD) == IIC_M_RD) { data[0] = RSB_READ(sc, RSB_DATA0); rsb_decode(data[0], msgs[1].buf, msgs[1].len, 0); } done: sc->msg = NULL; sc->busy = 0; wakeup(sc); RSB_UNLOCK(sc); return (error); } static int rsb_probe(device_t dev) { if (!ofw_bus_status_okay(dev)) return (ENXIO); switch (ofw_bus_search_compatible(dev, compat_data)->ocd_data) { case A23_RSB: device_set_desc(dev, "Allwinner RSB"); break; case A31_P2WI: device_set_desc(dev, "Allwinner P2WI"); break; default: return (ENXIO); } return (BUS_PROBE_DEFAULT); } static int rsb_attach(device_t dev) { struct rsb_softc *sc; int error; sc = device_get_softc(dev); mtx_init(&sc->mtx, device_get_nameunit(dev), "rsb", MTX_DEF); sc->type = ofw_bus_search_compatible(dev, compat_data)->ocd_data; if (clk_get_by_ofw_index(dev, 0, 0, &sc->clk) == 0) { error = clk_enable(sc->clk); if (error != 0) { device_printf(dev, "cannot enable clock\n"); goto fail; } } if (hwreset_get_by_ofw_idx(dev, 0, 0, &sc->rst) == 0) { error = hwreset_deassert(sc->rst); if (error != 0) { device_printf(dev, "cannot de-assert reset\n"); goto fail; } } if (bus_alloc_resources(dev, rsb_spec, &sc->res) != 0) { device_printf(dev, "cannot allocate resources for device\n"); error = ENXIO; goto fail; } sc->iicbus = device_add_child(dev, "iicbus", -1); if (sc->iicbus == NULL) { device_printf(dev, "cannot add iicbus child device\n"); error = ENXIO; goto fail; } bus_generic_attach(dev); return (0); fail: bus_release_resources(dev, rsb_spec, &sc->res); if (sc->rst != NULL) hwreset_release(sc->rst); if (sc->clk != NULL) clk_release(sc->clk); mtx_destroy(&sc->mtx); return (error); } static device_method_t rsb_methods[] = { /* Device interface */ DEVMETHOD(device_probe, rsb_probe), DEVMETHOD(device_attach, rsb_attach), /* Bus interface */ DEVMETHOD(bus_setup_intr, bus_generic_setup_intr), DEVMETHOD(bus_teardown_intr, bus_generic_teardown_intr), DEVMETHOD(bus_alloc_resource, bus_generic_alloc_resource), DEVMETHOD(bus_release_resource, bus_generic_release_resource), DEVMETHOD(bus_activate_resource, bus_generic_activate_resource), DEVMETHOD(bus_deactivate_resource, bus_generic_deactivate_resource), DEVMETHOD(bus_adjust_resource, bus_generic_adjust_resource), DEVMETHOD(bus_set_resource, bus_generic_rl_set_resource), DEVMETHOD(bus_get_resource, bus_generic_rl_get_resource), /* OFW methods */ DEVMETHOD(ofw_bus_get_node, rsb_get_node), /* iicbus interface */ DEVMETHOD(iicbus_callback, iicbus_null_callback), DEVMETHOD(iicbus_reset, rsb_reset), DEVMETHOD(iicbus_transfer, rsb_transfer), DEVMETHOD_END }; static driver_t rsb_driver = { "iichb", rsb_methods, sizeof(struct rsb_softc), }; static devclass_t rsb_devclass; EARLY_DRIVER_MODULE(iicbus, rsb, iicbus_driver, iicbus_devclass, 0, 0, BUS_PASS_RESOURCE + BUS_PASS_ORDER_MIDDLE); EARLY_DRIVER_MODULE(rsb, simplebus, rsb_driver, rsb_devclass, 0, 0, BUS_PASS_RESOURCE + BUS_PASS_ORDER_MIDDLE); MODULE_VERSION(rsb, 1); +MODULE_DEPEND(rsb, iicbus, 1, 1, 1); +SIMPLEBUS_PNP_INFO(compat_data); Index: user/ngie/bug-237403/sys/arm/allwinner/aw_rtc.c =================================================================== --- user/ngie/bug-237403/sys/arm/allwinner/aw_rtc.c (revision 346925) +++ user/ngie/bug-237403/sys/arm/allwinner/aw_rtc.c (revision 346926) @@ -1,364 +1,367 @@ /*- * Copyright (c) 2019 Emmanuel Vadot * Copyright (c) 2016 Vladimir Belian * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. */ #include __FBSDID("$FreeBSD$"); #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include "clock_if.h" #define LOSC_CTRL_REG 0x00 #define A10_RTC_DATE_REG 0x04 #define A10_RTC_TIME_REG 0x08 #define A31_LOSC_AUTO_SWT_STA 0x04 #define A31_RTC_DATE_REG 0x10 #define A31_RTC_TIME_REG 0x14 #define TIME_MASK 0x001f3f3f #define LOSC_OSC_SRC (1 << 0) #define LOSC_GSM (1 << 3) #define LOSC_AUTO_SW_EN (1 << 14) #define LOSC_MAGIC 0x16aa0000 #define LOSC_BUSY_MASK 0x00000380 #define IS_SUN7I (sc->conf->is_a20 == true) #define YEAR_MIN (IS_SUN7I ? 1970 : 2010) #define YEAR_MAX (IS_SUN7I ? 2100 : 2073) #define YEAR_OFFSET (IS_SUN7I ? 1900 : 2010) #define YEAR_MASK (IS_SUN7I ? 0xff : 0x3f) #define LEAP_BIT (IS_SUN7I ? 24 : 22) #define GET_SEC_VALUE(x) ((x) & 0x0000003f) #define GET_MIN_VALUE(x) (((x) & 0x00003f00) >> 8) #define GET_HOUR_VALUE(x) (((x) & 0x001f0000) >> 16) #define GET_DAY_VALUE(x) ((x) & 0x0000001f) #define GET_MON_VALUE(x) (((x) & 0x00000f00) >> 8) #define GET_YEAR_VALUE(x) (((x) >> 16) & YEAR_MASK) #define SET_DAY_VALUE(x) GET_DAY_VALUE(x) #define SET_MON_VALUE(x) (((x) & 0x0000000f) << 8) #define SET_YEAR_VALUE(x) (((x) & YEAR_MASK) << 16) #define SET_LEAP_VALUE(x) (((x) & 0x00000001) << LEAP_BIT) #define SET_SEC_VALUE(x) GET_SEC_VALUE(x) #define SET_MIN_VALUE(x) (((x) & 0x0000003f) << 8) #define SET_HOUR_VALUE(x) (((x) & 0x0000001f) << 16) #define HALF_OF_SEC_NS 500000000 #define RTC_RES_US 1000000 #define RTC_TIMEOUT 70 #define RTC_READ(sc, reg) bus_read_4((sc)->res, (reg)) #define RTC_WRITE(sc, reg, val) bus_write_4((sc)->res, (reg), (val)) #define IS_LEAP_YEAR(y) (((y) % 400) == 0 || (((y) % 100) != 0 && ((y) % 4) == 0)) struct aw_rtc_conf { uint64_t iosc_freq; bus_size_t rtc_date; bus_size_t rtc_time; bus_size_t rtc_losc_sta; bool is_a20; }; struct aw_rtc_conf a10_conf = { .rtc_date = A10_RTC_DATE_REG, .rtc_time = A10_RTC_TIME_REG, .rtc_losc_sta = LOSC_CTRL_REG, }; struct aw_rtc_conf a20_conf = { .rtc_date = A10_RTC_DATE_REG, .rtc_time = A10_RTC_TIME_REG, .rtc_losc_sta = LOSC_CTRL_REG, .is_a20 = true, }; struct aw_rtc_conf a31_conf = { .iosc_freq = 650000, /* between 600 and 700 Khz */ .rtc_date = A31_RTC_DATE_REG, .rtc_time = A31_RTC_TIME_REG, .rtc_losc_sta = A31_LOSC_AUTO_SWT_STA, }; struct aw_rtc_conf h3_conf = { .iosc_freq = 16000000, .rtc_date = A31_RTC_DATE_REG, .rtc_time = A31_RTC_TIME_REG, .rtc_losc_sta = A31_LOSC_AUTO_SWT_STA, }; static struct ofw_compat_data compat_data[] = { { "allwinner,sun4i-a10-rtc", (uintptr_t) &a10_conf }, { "allwinner,sun7i-a20-rtc", (uintptr_t) &a20_conf }, { "allwinner,sun6i-a31-rtc", (uintptr_t) &a31_conf }, { "allwinner,sun8i-h3-rtc", (uintptr_t) &h3_conf }, + { "allwinner,sun50i-h5-rtc", (uintptr_t) &h3_conf }, { NULL, 0 } }; struct aw_rtc_softc { struct resource *res; struct aw_rtc_conf *conf; int type; }; static struct clk_fixed_def aw_rtc_osc32k = { .clkdef.id = 0, .freq = 32768, }; static struct clk_fixed_def aw_rtc_iosc = { .clkdef.id = 2, }; static void aw_rtc_install_clocks(struct aw_rtc_softc *sc, device_t dev); static int aw_rtc_probe(device_t dev); static int aw_rtc_attach(device_t dev); static int aw_rtc_detach(device_t dev); static int aw_rtc_gettime(device_t dev, struct timespec *ts); static int aw_rtc_settime(device_t dev, struct timespec *ts); static device_method_t aw_rtc_methods[] = { DEVMETHOD(device_probe, aw_rtc_probe), DEVMETHOD(device_attach, aw_rtc_attach), DEVMETHOD(device_detach, aw_rtc_detach), DEVMETHOD(clock_gettime, aw_rtc_gettime), DEVMETHOD(clock_settime, aw_rtc_settime), DEVMETHOD_END }; static driver_t aw_rtc_driver = { "rtc", aw_rtc_methods, sizeof(struct aw_rtc_softc), }; static devclass_t aw_rtc_devclass; EARLY_DRIVER_MODULE(aw_rtc, simplebus, aw_rtc_driver, aw_rtc_devclass, 0, 0, BUS_PASS_BUS + BUS_PASS_ORDER_MIDDLE); +MODULE_VERSION(aw_rtc, 1); +SIMPLEBUS_PNP_INFO(compat_data); static int aw_rtc_probe(device_t dev) { if (!ofw_bus_status_okay(dev)) return (ENXIO); if (!ofw_bus_search_compatible(dev, compat_data)->ocd_data) return (ENXIO); device_set_desc(dev, "Allwinner RTC"); return (BUS_PROBE_DEFAULT); } static int aw_rtc_attach(device_t dev) { struct aw_rtc_softc *sc = device_get_softc(dev); uint32_t val; int rid = 0; sc->res = bus_alloc_resource_any(dev, SYS_RES_MEMORY, &rid, RF_ACTIVE); if (!sc->res) { device_printf(dev, "could not allocate resources\n"); return (ENXIO); } sc->conf = (struct aw_rtc_conf *)ofw_bus_search_compatible(dev, compat_data)->ocd_data; val = RTC_READ(sc, LOSC_CTRL_REG); val |= LOSC_AUTO_SW_EN; val |= LOSC_MAGIC | LOSC_GSM | LOSC_OSC_SRC; RTC_WRITE(sc, LOSC_CTRL_REG, val); DELAY(100); if (bootverbose) { val = RTC_READ(sc, sc->conf->rtc_losc_sta); if ((val & LOSC_OSC_SRC) == 0) device_printf(dev, "Using internal oscillator\n"); else device_printf(dev, "Using external oscillator\n"); } aw_rtc_install_clocks(sc, dev); clock_register(dev, RTC_RES_US); return (0); } static int aw_rtc_detach(device_t dev) { /* can't support detach, since there's no clock_unregister function */ return (EBUSY); } static void aw_rtc_install_clocks(struct aw_rtc_softc *sc, device_t dev) { struct clkdom *clkdom; const char **clknames; phandle_t node; int nclocks; node = ofw_bus_get_node(dev); nclocks = ofw_bus_string_list_to_array(node, "clock-output-names", &clknames); /* No clocks to export */ if (nclocks <= 0) return; if (nclocks != 3) { device_printf(dev, "Having only %d clocks instead of 3, aborting\n", nclocks); return; } clkdom = clkdom_create(dev); aw_rtc_osc32k.clkdef.name = clknames[0]; if (clknode_fixed_register(clkdom, &aw_rtc_osc32k) != 0) device_printf(dev, "Cannot register osc32k clock\n"); aw_rtc_iosc.clkdef.name = clknames[2]; aw_rtc_iosc.freq = sc->conf->iosc_freq; if (clknode_fixed_register(clkdom, &aw_rtc_iosc) != 0) device_printf(dev, "Cannot register iosc clock\n"); clkdom_finit(clkdom); if (bootverbose) clkdom_dump(clkdom); } static int aw_rtc_gettime(device_t dev, struct timespec *ts) { struct aw_rtc_softc *sc = device_get_softc(dev); struct clocktime ct; uint32_t rdate, rtime; rdate = RTC_READ(sc, sc->conf->rtc_date); rtime = RTC_READ(sc, sc->conf->rtc_time); if ((rtime & TIME_MASK) == 0) rdate = RTC_READ(sc, sc->conf->rtc_date); ct.sec = GET_SEC_VALUE(rtime); ct.min = GET_MIN_VALUE(rtime); ct.hour = GET_HOUR_VALUE(rtime); ct.day = GET_DAY_VALUE(rdate); ct.mon = GET_MON_VALUE(rdate); ct.year = GET_YEAR_VALUE(rdate) + YEAR_OFFSET; ct.dow = -1; /* RTC resolution is 1 sec */ ct.nsec = 0; return (clock_ct_to_ts(&ct, ts)); } static int aw_rtc_settime(device_t dev, struct timespec *ts) { struct aw_rtc_softc *sc = device_get_softc(dev); struct clocktime ct; uint32_t clk, rdate, rtime; /* RTC resolution is 1 sec */ if (ts->tv_nsec >= HALF_OF_SEC_NS) ts->tv_sec++; ts->tv_nsec = 0; clock_ts_to_ct(ts, &ct); if ((ct.year < YEAR_MIN) || (ct.year > YEAR_MAX)) { device_printf(dev, "could not set time, year out of range\n"); return (EINVAL); } for (clk = 0; RTC_READ(sc, LOSC_CTRL_REG) & LOSC_BUSY_MASK; clk++) { if (clk > RTC_TIMEOUT) { device_printf(dev, "could not set time, RTC busy\n"); return (EINVAL); } DELAY(1); } /* reset time register to avoid unexpected date increment */ RTC_WRITE(sc, sc->conf->rtc_time, 0); rdate = SET_DAY_VALUE(ct.day) | SET_MON_VALUE(ct.mon) | SET_YEAR_VALUE(ct.year - YEAR_OFFSET) | SET_LEAP_VALUE(IS_LEAP_YEAR(ct.year)); rtime = SET_SEC_VALUE(ct.sec) | SET_MIN_VALUE(ct.min) | SET_HOUR_VALUE(ct.hour); for (clk = 0; RTC_READ(sc, LOSC_CTRL_REG) & LOSC_BUSY_MASK; clk++) { if (clk > RTC_TIMEOUT) { device_printf(dev, "could not set date, RTC busy\n"); return (EINVAL); } DELAY(1); } RTC_WRITE(sc, sc->conf->rtc_date, rdate); for (clk = 0; RTC_READ(sc, LOSC_CTRL_REG) & LOSC_BUSY_MASK; clk++) { if (clk > RTC_TIMEOUT) { device_printf(dev, "could not set time, RTC busy\n"); return (EINVAL); } DELAY(1); } RTC_WRITE(sc, sc->conf->rtc_time, rtime); DELAY(RTC_TIMEOUT); return (0); } Index: user/ngie/bug-237403/sys/arm/allwinner/aw_sid.c =================================================================== --- user/ngie/bug-237403/sys/arm/allwinner/aw_sid.c (revision 346925) +++ user/ngie/bug-237403/sys/arm/allwinner/aw_sid.c (revision 346926) @@ -1,416 +1,417 @@ /*- * Copyright (c) 2016 Jared McNeill * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, * BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED * AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * $FreeBSD$ */ /* * Allwinner secure ID controller */ #include __FBSDID("$FreeBSD$"); #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include "nvmem_if.h" /* * Starting at least from sun8iw6 (A83T) EFUSE starts at 0x200 * There is 3 registers in the low area to read/write protected EFUSE. */ #define SID_PRCTL 0x40 #define SID_PRCTL_OFFSET_MASK 0xff #define SID_PRCTL_OFFSET(n) (((n) & SID_PRCTL_OFFSET_MASK) << 16) #define SID_PRCTL_LOCK (0xac << 8) #define SID_PRCTL_READ (0x01 << 1) #define SID_PRCTL_WRITE (0x01 << 0) #define SID_PRKEY 0x50 #define SID_RDKEY 0x60 #define EFUSE_OFFSET 0x200 #define EFUSE_NAME_SIZE 32 #define EFUSE_DESC_SIZE 64 struct aw_sid_efuse { char name[EFUSE_NAME_SIZE]; char desc[EFUSE_DESC_SIZE]; bus_size_t base; bus_size_t offset; uint32_t size; enum aw_sid_fuse_id id; bool public; }; static struct aw_sid_efuse a10_efuses[] = { { .name = "rootkey", .desc = "Root Key or ChipID", .offset = 0x0, .size = 16, .id = AW_SID_FUSE_ROOTKEY, .public = true, }, }; static struct aw_sid_efuse a64_efuses[] = { { .name = "rootkey", .desc = "Root Key or ChipID", .base = EFUSE_OFFSET, .offset = 0x00, .size = 16, .id = AW_SID_FUSE_ROOTKEY, .public = true, }, { .name = "ths-calib", .desc = "Thermal Sensor Calibration Data", .base = EFUSE_OFFSET, .offset = 0x34, .size = 6, .id = AW_SID_FUSE_THSSENSOR, .public = true, }, }; static struct aw_sid_efuse a83t_efuses[] = { { .name = "rootkey", .desc = "Root Key or ChipID", .base = EFUSE_OFFSET, .offset = 0x00, .size = 16, .id = AW_SID_FUSE_ROOTKEY, .public = true, }, { .name = "ths-calib", .desc = "Thermal Sensor Calibration Data", .base = EFUSE_OFFSET, .offset = 0x34, .size = 8, .id = AW_SID_FUSE_THSSENSOR, .public = true, }, }; static struct aw_sid_efuse h3_efuses[] = { { .name = "rootkey", .desc = "Root Key or ChipID", .base = EFUSE_OFFSET, .offset = 0x00, .size = 16, .id = AW_SID_FUSE_ROOTKEY, .public = true, }, { .name = "ths-calib", .desc = "Thermal Sensor Calibration Data", .base = EFUSE_OFFSET, .offset = 0x34, .size = 2, .id = AW_SID_FUSE_THSSENSOR, .public = false, }, }; static struct aw_sid_efuse h5_efuses[] = { { .name = "rootkey", .desc = "Root Key or ChipID", .base = EFUSE_OFFSET, .offset = 0x00, .size = 16, .id = AW_SID_FUSE_ROOTKEY, .public = true, }, { .name = "ths-calib", .desc = "Thermal Sensor Calibration Data", .base = EFUSE_OFFSET, .offset = 0x34, .size = 4, .id = AW_SID_FUSE_THSSENSOR, .public = true, }, }; struct aw_sid_conf { struct aw_sid_efuse *efuses; size_t nfuses; }; static const struct aw_sid_conf a10_conf = { .efuses = a10_efuses, .nfuses = nitems(a10_efuses), }; static const struct aw_sid_conf a20_conf = { .efuses = a10_efuses, .nfuses = nitems(a10_efuses), }; static const struct aw_sid_conf a64_conf = { .efuses = a64_efuses, .nfuses = nitems(a64_efuses), }; static const struct aw_sid_conf a83t_conf = { .efuses = a83t_efuses, .nfuses = nitems(a83t_efuses), }; static const struct aw_sid_conf h3_conf = { .efuses = h3_efuses, .nfuses = nitems(h3_efuses), }; static const struct aw_sid_conf h5_conf = { .efuses = h5_efuses, .nfuses = nitems(h5_efuses), }; static struct ofw_compat_data compat_data[] = { { "allwinner,sun4i-a10-sid", (uintptr_t)&a10_conf}, { "allwinner,sun7i-a20-sid", (uintptr_t)&a20_conf}, { "allwinner,sun50i-a64-sid", (uintptr_t)&a64_conf}, { "allwinner,sun8i-a83t-sid", (uintptr_t)&a83t_conf}, { "allwinner,sun8i-h3-sid", (uintptr_t)&h3_conf}, { "allwinner,sun50i-h5-sid", (uintptr_t)&h5_conf}, { NULL, 0 } }; struct aw_sid_softc { device_t sid_dev; struct resource *res; struct aw_sid_conf *sid_conf; struct mtx prctl_mtx; }; static struct aw_sid_softc *aw_sid_sc; static struct resource_spec aw_sid_spec[] = { { SYS_RES_MEMORY, 0, RF_ACTIVE }, { -1, 0 } }; #define RD1(sc, reg) bus_read_1((sc)->res, (reg)) #define RD4(sc, reg) bus_read_4((sc)->res, (reg)) #define WR4(sc, reg, val) bus_write_4((sc)->res, (reg), (val)) static int aw_sid_sysctl(SYSCTL_HANDLER_ARGS); static int aw_sid_probe(device_t dev) { if (!ofw_bus_status_okay(dev)) return (ENXIO); if (ofw_bus_search_compatible(dev, compat_data)->ocd_data == 0) return (ENXIO); device_set_desc(dev, "Allwinner Secure ID Controller"); return (BUS_PROBE_DEFAULT); } static int aw_sid_attach(device_t dev) { struct aw_sid_softc *sc; phandle_t node; int i; node = ofw_bus_get_node(dev); sc = device_get_softc(dev); sc->sid_dev = dev; if (bus_alloc_resources(dev, aw_sid_spec, &sc->res) != 0) { device_printf(dev, "cannot allocate resources for device\n"); return (ENXIO); } mtx_init(&sc->prctl_mtx, device_get_nameunit(dev), NULL, MTX_DEF); sc->sid_conf = (struct aw_sid_conf *)ofw_bus_search_compatible(dev, compat_data)->ocd_data; aw_sid_sc = sc; /* Register ourself so device can resolve who we are */ OF_device_register_xref(OF_xref_from_node(node), dev); for (i = 0; i < sc->sid_conf->nfuses ;i++) {\ SYSCTL_ADD_PROC(device_get_sysctl_ctx(dev), SYSCTL_CHILDREN(device_get_sysctl_tree(dev)), OID_AUTO, sc->sid_conf->efuses[i].name, CTLTYPE_STRING | CTLFLAG_RD, dev, sc->sid_conf->efuses[i].id, aw_sid_sysctl, "A", sc->sid_conf->efuses[i].desc); } return (0); } int aw_sid_get_fuse(enum aw_sid_fuse_id id, uint8_t *out, uint32_t *size) { struct aw_sid_softc *sc; uint32_t val; int i, j; sc = aw_sid_sc; if (sc == NULL) return (ENXIO); for (i = 0; i < sc->sid_conf->nfuses; i++) if (id == sc->sid_conf->efuses[i].id) break; if (i == sc->sid_conf->nfuses) return (ENOENT); if (*size != sc->sid_conf->efuses[i].size) { *size = sc->sid_conf->efuses[i].size; return (ENOMEM); } if (out == NULL) return (ENOMEM); if (sc->sid_conf->efuses[i].public == false) mtx_lock(&sc->prctl_mtx); for (j = 0; j < sc->sid_conf->efuses[i].size; j += 4) { if (sc->sid_conf->efuses[i].public == false) { val = SID_PRCTL_OFFSET(sc->sid_conf->efuses[i].offset + j) | SID_PRCTL_LOCK | SID_PRCTL_READ; WR4(sc, SID_PRCTL, val); /* Read bit will be cleared once read has concluded */ while (RD4(sc, SID_PRCTL) & SID_PRCTL_READ) continue; val = RD4(sc, SID_RDKEY); } else val = RD4(sc, sc->sid_conf->efuses[i].base + sc->sid_conf->efuses[i].offset + j); out[j] = val & 0xFF; if (j + 1 < *size) out[j + 1] = (val & 0xFF00) >> 8; if (j + 2 < *size) out[j + 2] = (val & 0xFF0000) >> 16; if (j + 3 < *size) out[j + 3] = (val & 0xFF000000) >> 24; } if (sc->sid_conf->efuses[i].public == false) mtx_unlock(&sc->prctl_mtx); return (0); } static int aw_sid_read(device_t dev, uint32_t offset, uint32_t size, uint8_t *buffer) { struct aw_sid_softc *sc; enum aw_sid_fuse_id fuse_id = 0; int i; sc = device_get_softc(dev); for (i = 0; i < sc->sid_conf->nfuses; i++) if (offset == (sc->sid_conf->efuses[i].base + sc->sid_conf->efuses[i].offset)) { fuse_id = sc->sid_conf->efuses[i].id; break; } if (fuse_id == 0) return (ENOENT); return (aw_sid_get_fuse(fuse_id, buffer, &size)); } static int aw_sid_sysctl(SYSCTL_HANDLER_ARGS) { struct aw_sid_softc *sc; device_t dev = arg1; enum aw_sid_fuse_id fuse = arg2; uint8_t data[32]; char out[128]; uint32_t size; int ret, i; sc = device_get_softc(dev); /* Get the size of the efuse data */ size = 0; aw_sid_get_fuse(fuse, NULL, &size); /* We now have the real size */ ret = aw_sid_get_fuse(fuse, data, &size); if (ret != 0) { device_printf(dev, "Cannot get fuse id %d: %d\n", fuse, ret); return (ENOENT); } for (i = 0; i < size; i++) snprintf(out + (i * 2), sizeof(out) - (i * 2), "%.2x", data[i]); return sysctl_handle_string(oidp, out, sizeof(out), req); } static device_method_t aw_sid_methods[] = { /* Device interface */ DEVMETHOD(device_probe, aw_sid_probe), DEVMETHOD(device_attach, aw_sid_attach), /* NVMEM interface */ DEVMETHOD(nvmem_read, aw_sid_read), DEVMETHOD_END }; static driver_t aw_sid_driver = { "aw_sid", aw_sid_methods, sizeof(struct aw_sid_softc), }; static devclass_t aw_sid_devclass; EARLY_DRIVER_MODULE(aw_sid, simplebus, aw_sid_driver, aw_sid_devclass, 0, 0, BUS_PASS_RESOURCE + BUS_PASS_ORDER_FIRST); MODULE_VERSION(aw_sid, 1); +SIMPLEBUS_PNP_INFO(compat_data); Index: user/ngie/bug-237403/sys/arm/allwinner/aw_syscon.c =================================================================== --- user/ngie/bug-237403/sys/arm/allwinner/aw_syscon.c (revision 346925) +++ user/ngie/bug-237403/sys/arm/allwinner/aw_syscon.c (revision 346926) @@ -1,85 +1,86 @@ /*- * SPDX-License-Identifier: BSD-2-Clause-FreeBSD * * Copyright (c) 2018 Kyle Evans * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. */ /* * Allwinner syscon driver */ #include __FBSDID("$FreeBSD$"); #include #include #include #include #include #include #include #include #include #include #include #include static struct ofw_compat_data compat_data[] = { {"allwinner,sun50i-a64-system-controller", 1}, {"allwinner,sun50i-a64-system-control", 1}, {"allwinner,sun8i-a83t-system-controller", 1}, {"allwinner,sun8i-h3-system-controller", 1}, {"allwinner,sun8i-h3-system-control", 1}, + {"allwinner,sun50i-h5-system-control", 1}, {NULL, 0} }; static int aw_syscon_probe(device_t dev) { if (!ofw_bus_status_okay(dev)) return (ENXIO); if (ofw_bus_search_compatible(dev, compat_data)->ocd_data == 0) return (ENXIO); device_set_desc(dev, "Allwinner syscon"); return (BUS_PROBE_DEFAULT); } static device_method_t aw_syscon_methods[] = { DEVMETHOD(device_probe, aw_syscon_probe), DEVMETHOD_END }; DEFINE_CLASS_1(aw_syscon, aw_syscon_driver, aw_syscon_methods, sizeof(struct syscon_generic_softc), syscon_generic_driver); static devclass_t aw_syscon_devclass; /* aw_syscon needs to attach prior to if_awg */ EARLY_DRIVER_MODULE(aw_syscon, simplebus, aw_syscon_driver, aw_syscon_devclass, 0, 0, BUS_PASS_SUPPORTDEV + BUS_PASS_ORDER_MIDDLE); MODULE_VERSION(aw_syscon, 1); Index: user/ngie/bug-237403/sys/arm/allwinner/aw_thermal.c =================================================================== --- user/ngie/bug-237403/sys/arm/allwinner/aw_thermal.c (revision 346925) +++ user/ngie/bug-237403/sys/arm/allwinner/aw_thermal.c (revision 346926) @@ -1,730 +1,732 @@ /*- * Copyright (c) 2016 Jared McNeill * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, * BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED * AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * $FreeBSD$ */ /* * Allwinner thermal sensor controller */ #include __FBSDID("$FreeBSD$"); #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include "cpufreq_if.h" #include "nvmem_if.h" #define THS_CTRL0 0x00 #define THS_CTRL1 0x04 #define ADC_CALI_EN (1 << 17) #define THS_CTRL2 0x40 #define SENSOR_ACQ1_SHIFT 16 #define SENSOR2_EN (1 << 2) #define SENSOR1_EN (1 << 1) #define SENSOR0_EN (1 << 0) #define THS_INTC 0x44 #define THS_THERMAL_PER_SHIFT 12 #define THS_INTS 0x48 #define THS2_DATA_IRQ_STS (1 << 10) #define THS1_DATA_IRQ_STS (1 << 9) #define THS0_DATA_IRQ_STS (1 << 8) #define SHUT_INT2_STS (1 << 6) #define SHUT_INT1_STS (1 << 5) #define SHUT_INT0_STS (1 << 4) #define ALARM_INT2_STS (1 << 2) #define ALARM_INT1_STS (1 << 1) #define ALARM_INT0_STS (1 << 0) #define THS_ALARM0_CTRL 0x50 #define ALARM_T_HOT_MASK 0xfff #define ALARM_T_HOT_SHIFT 16 #define ALARM_T_HYST_MASK 0xfff #define ALARM_T_HYST_SHIFT 0 #define THS_SHUTDOWN0_CTRL 0x60 #define SHUT_T_HOT_MASK 0xfff #define SHUT_T_HOT_SHIFT 16 #define THS_FILTER 0x70 #define THS_CALIB0 0x74 #define THS_CALIB1 0x78 #define THS_DATA0 0x80 #define THS_DATA1 0x84 #define THS_DATA2 0x88 #define DATA_MASK 0xfff #define A83T_CLK_RATE 24000000 #define A83T_ADC_ACQUIRE_TIME 23 /* 24Mhz/(23 + 1) = 1us */ #define A83T_THERMAL_PER 1 /* 4096 * (1 + 1) / 24Mhz = 341 us */ #define A83T_FILTER 0x5 /* Filter enabled, avg of 4 */ #define A83T_TEMP_BASE 2719000 #define A83T_TEMP_MUL 1000 #define A83T_TEMP_DIV 14186 #define A64_CLK_RATE 4000000 #define A64_ADC_ACQUIRE_TIME 400 /* 4Mhz/(400 + 1) = 100 us */ #define A64_THERMAL_PER 24 /* 4096 * (24 + 1) / 4Mhz = 25.6 ms */ #define A64_FILTER 0x6 /* Filter enabled, avg of 8 */ #define A64_TEMP_BASE 2170000 #define A64_TEMP_MUL 1000 #define A64_TEMP_DIV 8560 #define H3_CLK_RATE 4000000 #define H3_ADC_ACQUIRE_TIME 0x3f #define H3_THERMAL_PER 401 #define H3_FILTER 0x6 /* Filter enabled, avg of 8 */ #define H3_TEMP_BASE 217 #define H3_TEMP_MUL 1000 #define H3_TEMP_DIV 8253 #define H3_TEMP_MINUS 1794000 #define H3_INIT_ALARM 90 /* degC */ #define H3_INIT_SHUT 105 /* degC */ #define H5_CLK_RATE 24000000 #define H5_ADC_ACQUIRE_TIME 479 /* 24Mhz/479 = 20us */ #define H5_THERMAL_PER 58 /* 4096 * (58 + 1) / 24Mhz = 10ms */ #define H5_FILTER 0x6 /* Filter enabled, avg of 8 */ #define H5_TEMP_BASE 233832448 #define H5_TEMP_MUL 124885 #define H5_TEMP_DIV 20 #define H5_TEMP_BASE_CPU 271581184 #define H5_TEMP_MUL_CPU 152253 #define H5_TEMP_BASE_GPU 289406976 #define H5_TEMP_MUL_GPU 166724 #define H5_INIT_CPU_ALARM 80 /* degC */ #define H5_INIT_CPU_SHUT 96 /* degC */ #define H5_INIT_GPU_ALARM 84 /* degC */ #define H5_INIT_GPU_SHUT 100 /* degC */ #define TEMP_C_TO_K 273 #define SENSOR_ENABLE_ALL (SENSOR0_EN|SENSOR1_EN|SENSOR2_EN) #define SHUT_INT_ALL (SHUT_INT0_STS|SHUT_INT1_STS|SHUT_INT2_STS) #define ALARM_INT_ALL (ALARM_INT0_STS) #define MAX_SENSORS 3 #define MAX_CF_LEVELS 64 #define THROTTLE_ENABLE_DEFAULT 1 /* Enable thermal throttling */ static int aw_thermal_throttle_enable = THROTTLE_ENABLE_DEFAULT; TUNABLE_INT("hw.aw_thermal.throttle_enable", &aw_thermal_throttle_enable); struct aw_thermal_sensor { const char *name; const char *desc; int init_alarm; int init_shut; }; struct aw_thermal_config { struct aw_thermal_sensor sensors[MAX_SENSORS]; int nsensors; uint64_t clk_rate; uint32_t adc_acquire_time; int adc_cali_en; uint32_t filter; uint32_t thermal_per; int (*to_temp)(uint32_t, int); uint32_t (*to_reg)(int, int); int temp_base; int temp_mul; int temp_div; int calib0, calib1; uint32_t calib0_mask, calib1_mask; }; static int a83t_to_temp(uint32_t val, int sensor) { return ((A83T_TEMP_BASE - (val * A83T_TEMP_MUL)) / A83T_TEMP_DIV); } static const struct aw_thermal_config a83t_config = { .nsensors = 3, .sensors = { [0] = { .name = "cluster0", .desc = "CPU cluster 0 temperature", }, [1] = { .name = "cluster1", .desc = "CPU cluster 1 temperature", }, [2] = { .name = "gpu", .desc = "GPU temperature", }, }, .clk_rate = A83T_CLK_RATE, .adc_acquire_time = A83T_ADC_ACQUIRE_TIME, .adc_cali_en = 1, .filter = A83T_FILTER, .thermal_per = A83T_THERMAL_PER, .to_temp = a83t_to_temp, .calib0_mask = 0xffffffff, .calib1_mask = 0xffff, }; static int a64_to_temp(uint32_t val, int sensor) { return ((A64_TEMP_BASE - (val * A64_TEMP_MUL)) / A64_TEMP_DIV); } static const struct aw_thermal_config a64_config = { .nsensors = 3, .sensors = { [0] = { .name = "cpu", .desc = "CPU temperature", }, [1] = { .name = "gpu1", .desc = "GPU temperature 1", }, [2] = { .name = "gpu2", .desc = "GPU temperature 2", }, }, .clk_rate = A64_CLK_RATE, .adc_acquire_time = A64_ADC_ACQUIRE_TIME, .adc_cali_en = 1, .filter = A64_FILTER, .thermal_per = A64_THERMAL_PER, .to_temp = a64_to_temp, .calib0_mask = 0xffffffff, .calib1_mask = 0xffff, }; static int h3_to_temp(uint32_t val, int sensor) { return (H3_TEMP_BASE - ((val * H3_TEMP_MUL) / H3_TEMP_DIV)); } static uint32_t h3_to_reg(int val, int sensor) { return ((H3_TEMP_MINUS - (val * H3_TEMP_DIV)) / H3_TEMP_MUL); } static const struct aw_thermal_config h3_config = { .nsensors = 1, .sensors = { [0] = { .name = "cpu", .desc = "CPU temperature", .init_alarm = H3_INIT_ALARM, .init_shut = H3_INIT_SHUT, }, }, .clk_rate = H3_CLK_RATE, .adc_acquire_time = H3_ADC_ACQUIRE_TIME, .adc_cali_en = 1, .filter = H3_FILTER, .thermal_per = H3_THERMAL_PER, .to_temp = h3_to_temp, .to_reg = h3_to_reg, .calib0_mask = 0xffff, }; static int h5_to_temp(uint32_t val, int sensor) { int tmp; /* Temp is lower than 70 degrees */ if (val > 0x500) { tmp = H5_TEMP_BASE - (val * H5_TEMP_MUL); tmp >>= H5_TEMP_DIV; return (tmp); } if (sensor == 0) tmp = H5_TEMP_BASE_CPU - (val * H5_TEMP_MUL_CPU); else if (sensor == 1) tmp = H5_TEMP_BASE_GPU - (val * H5_TEMP_MUL_GPU); else { printf("Unknown sensor %d\n", sensor); return (val); } tmp >>= H5_TEMP_DIV; return (tmp); } static uint32_t h5_to_reg(int val, int sensor) { int tmp; if (val < 70) { tmp = H5_TEMP_BASE - (val << H5_TEMP_DIV); tmp /= H5_TEMP_MUL; } else { if (sensor == 0) { tmp = H5_TEMP_BASE_CPU - (val << H5_TEMP_DIV); tmp /= H5_TEMP_MUL_CPU; } else if (sensor == 1) { tmp = H5_TEMP_BASE_GPU - (val << H5_TEMP_DIV); tmp /= H5_TEMP_MUL_GPU; } else { printf("Unknown sensor %d\n", sensor); return (val); } } return ((uint32_t)tmp); } static const struct aw_thermal_config h5_config = { .nsensors = 2, .sensors = { [0] = { .name = "cpu", .desc = "CPU temperature", .init_alarm = H5_INIT_CPU_ALARM, .init_shut = H5_INIT_CPU_SHUT, }, [1] = { .name = "gpu", .desc = "GPU temperature", .init_alarm = H5_INIT_GPU_ALARM, .init_shut = H5_INIT_GPU_SHUT, }, }, .clk_rate = H5_CLK_RATE, .adc_acquire_time = H5_ADC_ACQUIRE_TIME, .filter = H5_FILTER, .thermal_per = H5_THERMAL_PER, .to_temp = h5_to_temp, .to_reg = h5_to_reg, .calib0_mask = 0xffffffff, }; static struct ofw_compat_data compat_data[] = { { "allwinner,sun8i-a83t-ths", (uintptr_t)&a83t_config }, { "allwinner,sun8i-h3-ths", (uintptr_t)&h3_config }, { "allwinner,sun50i-a64-ths", (uintptr_t)&a64_config }, { "allwinner,sun50i-h5-ths", (uintptr_t)&h5_config }, { NULL, (uintptr_t)NULL } }; #define THS_CONF(d) \ (void *)ofw_bus_search_compatible((d), compat_data)->ocd_data struct aw_thermal_softc { device_t dev; struct resource *res[2]; struct aw_thermal_config *conf; struct task cf_task; int throttle; int min_freq; struct cf_level levels[MAX_CF_LEVELS]; eventhandler_tag cf_pre_tag; clk_t clk_apb; clk_t clk_ths; }; static struct resource_spec aw_thermal_spec[] = { { SYS_RES_MEMORY, 0, RF_ACTIVE }, { SYS_RES_IRQ, 0, RF_ACTIVE }, { -1, 0 } }; #define RD4(sc, reg) bus_read_4((sc)->res[0], (reg)) #define WR4(sc, reg, val) bus_write_4((sc)->res[0], (reg), (val)) static int aw_thermal_init(struct aw_thermal_softc *sc) { phandle_t node; uint32_t calib[2]; int error; node = ofw_bus_get_node(sc->dev); if (nvmem_get_cell_len(node, "ths-calib") > sizeof(calib)) { device_printf(sc->dev, "ths-calib nvmem cell is too large\n"); return (ENXIO); } error = nvmem_read_cell_by_name(node, "ths-calib", (void *)&calib, nvmem_get_cell_len(node, "ths-calib")); /* Read calibration settings from EFUSE */ if (error != 0) { device_printf(sc->dev, "Cannot read THS efuse\n"); return (error); } calib[0] &= sc->conf->calib0_mask; calib[1] &= sc->conf->calib1_mask; /* Write calibration settings to thermal controller */ if (calib[0] != 0) WR4(sc, THS_CALIB0, calib[0]); if (calib[1] != 0) WR4(sc, THS_CALIB1, calib[1]); /* Configure ADC acquire time (CLK_IN/(N+1)) and enable sensors */ WR4(sc, THS_CTRL1, ADC_CALI_EN); WR4(sc, THS_CTRL0, sc->conf->adc_acquire_time); WR4(sc, THS_CTRL2, sc->conf->adc_acquire_time << SENSOR_ACQ1_SHIFT); /* Set thermal period */ WR4(sc, THS_INTC, sc->conf->thermal_per << THS_THERMAL_PER_SHIFT); /* Enable average filter */ WR4(sc, THS_FILTER, sc->conf->filter); /* Enable interrupts */ WR4(sc, THS_INTS, RD4(sc, THS_INTS)); WR4(sc, THS_INTC, RD4(sc, THS_INTC) | SHUT_INT_ALL | ALARM_INT_ALL); /* Enable sensors */ WR4(sc, THS_CTRL2, RD4(sc, THS_CTRL2) | SENSOR_ENABLE_ALL); return (0); } static int aw_thermal_gettemp(struct aw_thermal_softc *sc, int sensor) { uint32_t val; val = RD4(sc, THS_DATA0 + (sensor * 4)); return (sc->conf->to_temp(val, sensor)); } static int aw_thermal_getshut(struct aw_thermal_softc *sc, int sensor) { uint32_t val; val = RD4(sc, THS_SHUTDOWN0_CTRL + (sensor * 4)); val = (val >> SHUT_T_HOT_SHIFT) & SHUT_T_HOT_MASK; return (sc->conf->to_temp(val, sensor)); } static void aw_thermal_setshut(struct aw_thermal_softc *sc, int sensor, int temp) { uint32_t val; val = RD4(sc, THS_SHUTDOWN0_CTRL + (sensor * 4)); val &= ~(SHUT_T_HOT_MASK << SHUT_T_HOT_SHIFT); val |= (sc->conf->to_reg(temp, sensor) << SHUT_T_HOT_SHIFT); WR4(sc, THS_SHUTDOWN0_CTRL + (sensor * 4), val); } static int aw_thermal_gethyst(struct aw_thermal_softc *sc, int sensor) { uint32_t val; val = RD4(sc, THS_ALARM0_CTRL + (sensor * 4)); val = (val >> ALARM_T_HYST_SHIFT) & ALARM_T_HYST_MASK; return (sc->conf->to_temp(val, sensor)); } static int aw_thermal_getalarm(struct aw_thermal_softc *sc, int sensor) { uint32_t val; val = RD4(sc, THS_ALARM0_CTRL + (sensor * 4)); val = (val >> ALARM_T_HOT_SHIFT) & ALARM_T_HOT_MASK; return (sc->conf->to_temp(val, sensor)); } static void aw_thermal_setalarm(struct aw_thermal_softc *sc, int sensor, int temp) { uint32_t val; val = RD4(sc, THS_ALARM0_CTRL + (sensor * 4)); val &= ~(ALARM_T_HOT_MASK << ALARM_T_HOT_SHIFT); val |= (sc->conf->to_reg(temp, sensor) << ALARM_T_HOT_SHIFT); WR4(sc, THS_ALARM0_CTRL + (sensor * 4), val); } static int aw_thermal_sysctl(SYSCTL_HANDLER_ARGS) { struct aw_thermal_softc *sc; int sensor, val; sc = arg1; sensor = arg2; val = aw_thermal_gettemp(sc, sensor) + TEMP_C_TO_K; return sysctl_handle_opaque(oidp, &val, sizeof(val), req); } static void aw_thermal_throttle(struct aw_thermal_softc *sc, int enable) { device_t cf_dev; int count, error; if (enable == sc->throttle) return; if (enable != 0) { /* Set the lowest available frequency */ cf_dev = devclass_get_device(devclass_find("cpufreq"), 0); if (cf_dev == NULL) return; count = MAX_CF_LEVELS; error = CPUFREQ_LEVELS(cf_dev, sc->levels, &count); if (error != 0 || count == 0) return; sc->min_freq = sc->levels[count - 1].total_set.freq; error = CPUFREQ_SET(cf_dev, &sc->levels[count - 1], CPUFREQ_PRIO_USER); if (error != 0) return; } sc->throttle = enable; } static void aw_thermal_cf_task(void *arg, int pending) { struct aw_thermal_softc *sc; sc = arg; aw_thermal_throttle(sc, 1); } static void aw_thermal_cf_pre_change(void *arg, const struct cf_level *level, int *status) { struct aw_thermal_softc *sc; int temp_cur, temp_alarm; sc = arg; if (aw_thermal_throttle_enable == 0 || sc->throttle == 0 || level->total_set.freq == sc->min_freq) return; temp_cur = aw_thermal_gettemp(sc, 0); temp_alarm = aw_thermal_getalarm(sc, 0); if (temp_cur < temp_alarm) aw_thermal_throttle(sc, 0); else *status = ENXIO; } static void aw_thermal_intr(void *arg) { struct aw_thermal_softc *sc; device_t dev; uint32_t ints; dev = arg; sc = device_get_softc(dev); ints = RD4(sc, THS_INTS); WR4(sc, THS_INTS, ints); if ((ints & SHUT_INT_ALL) != 0) { device_printf(dev, "WARNING - current temperature exceeds safe limits\n"); shutdown_nice(RB_POWEROFF); } if ((ints & ALARM_INT_ALL) != 0) taskqueue_enqueue(taskqueue_thread, &sc->cf_task); } static int aw_thermal_probe(device_t dev) { if (!ofw_bus_status_okay(dev)) return (ENXIO); if (THS_CONF(dev) == NULL) return (ENXIO); device_set_desc(dev, "Allwinner Thermal Sensor Controller"); return (BUS_PROBE_DEFAULT); } static int aw_thermal_attach(device_t dev) { struct aw_thermal_softc *sc; hwreset_t rst; int i, error; void *ih; sc = device_get_softc(dev); sc->dev = dev; rst = NULL; ih = NULL; sc->conf = THS_CONF(dev); TASK_INIT(&sc->cf_task, 0, aw_thermal_cf_task, sc); if (bus_alloc_resources(dev, aw_thermal_spec, sc->res) != 0) { device_printf(dev, "cannot allocate resources for device\n"); return (ENXIO); } if (clk_get_by_ofw_name(dev, 0, "apb", &sc->clk_apb) == 0) { error = clk_enable(sc->clk_apb); if (error != 0) { device_printf(dev, "cannot enable apb clock\n"); goto fail; } } if (clk_get_by_ofw_name(dev, 0, "ths", &sc->clk_ths) == 0) { error = clk_set_freq(sc->clk_ths, sc->conf->clk_rate, 0); if (error != 0) { device_printf(dev, "cannot set ths clock rate\n"); goto fail; } error = clk_enable(sc->clk_ths); if (error != 0) { device_printf(dev, "cannot enable ths clock\n"); goto fail; } } if (hwreset_get_by_ofw_idx(dev, 0, 0, &rst) == 0) { error = hwreset_deassert(rst); if (error != 0) { device_printf(dev, "cannot de-assert reset\n"); goto fail; } } error = bus_setup_intr(dev, sc->res[1], INTR_TYPE_MISC | INTR_MPSAFE, NULL, aw_thermal_intr, dev, &ih); if (error != 0) { device_printf(dev, "cannot setup interrupt handler\n"); goto fail; } for (i = 0; i < sc->conf->nsensors; i++) { if (sc->conf->sensors[i].init_alarm > 0) aw_thermal_setalarm(sc, i, sc->conf->sensors[i].init_alarm); if (sc->conf->sensors[i].init_shut > 0) aw_thermal_setshut(sc, i, sc->conf->sensors[i].init_shut); } if (aw_thermal_init(sc) != 0) goto fail; for (i = 0; i < sc->conf->nsensors; i++) SYSCTL_ADD_PROC(device_get_sysctl_ctx(dev), SYSCTL_CHILDREN(device_get_sysctl_tree(dev)), OID_AUTO, sc->conf->sensors[i].name, CTLTYPE_INT | CTLFLAG_RD, sc, i, aw_thermal_sysctl, "IK0", sc->conf->sensors[i].desc); if (bootverbose) for (i = 0; i < sc->conf->nsensors; i++) { device_printf(dev, "%s: alarm %dC hyst %dC shut %dC\n", sc->conf->sensors[i].name, aw_thermal_getalarm(sc, i), aw_thermal_gethyst(sc, i), aw_thermal_getshut(sc, i)); } sc->cf_pre_tag = EVENTHANDLER_REGISTER(cpufreq_pre_change, aw_thermal_cf_pre_change, sc, EVENTHANDLER_PRI_FIRST); return (0); fail: if (ih != NULL) bus_teardown_intr(dev, sc->res[1], ih); if (rst != NULL) hwreset_release(rst); if (sc->clk_apb != NULL) clk_release(sc->clk_apb); if (sc->clk_ths != NULL) clk_release(sc->clk_ths); bus_release_resources(dev, aw_thermal_spec, sc->res); return (ENXIO); } static device_method_t aw_thermal_methods[] = { /* Device interface */ DEVMETHOD(device_probe, aw_thermal_probe), DEVMETHOD(device_attach, aw_thermal_attach), DEVMETHOD_END }; static driver_t aw_thermal_driver = { "aw_thermal", aw_thermal_methods, sizeof(struct aw_thermal_softc), }; static devclass_t aw_thermal_devclass; DRIVER_MODULE(aw_thermal, simplebus, aw_thermal_driver, aw_thermal_devclass, 0, 0); MODULE_VERSION(aw_thermal, 1); +MODULE_DEPEND(aw_thermal, aw_sid, 1, 1, 1); +SIMPLEBUS_PNP_INFO(compat_data); Index: user/ngie/bug-237403/sys/arm/allwinner/axp81x.c =================================================================== --- user/ngie/bug-237403/sys/arm/allwinner/axp81x.c (revision 346925) +++ user/ngie/bug-237403/sys/arm/allwinner/axp81x.c (revision 346926) @@ -1,1621 +1,1622 @@ /*- * Copyright (c) 2018 Emmanuel Vadot * Copyright (c) 2016 Jared McNeill * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, * BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED * AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * $FreeBSD$ */ /* * X-Powers AXP803/813/818 PMU for Allwinner SoCs */ #include __FBSDID("$FreeBSD$"); #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include "gpio_if.h" #include "iicbus_if.h" #include "regdev_if.h" MALLOC_DEFINE(M_AXP8XX_REG, "AXP8xx regulator", "AXP8xx power regulator"); #define AXP_POWERSRC 0x00 #define AXP_POWERSRC_ACIN (1 << 7) #define AXP_POWERSRC_VBUS (1 << 5) #define AXP_POWERSRC_VBAT (1 << 3) #define AXP_POWERSRC_CHARING (1 << 2) /* Charging Direction */ #define AXP_POWERSRC_SHORTED (1 << 1) #define AXP_POWERSRC_STARTUP (1 << 0) #define AXP_POWERMODE 0x01 #define AXP_POWERMODE_BAT_CHARGING (1 << 6) #define AXP_POWERMODE_BAT_PRESENT (1 << 5) #define AXP_POWERMODE_BAT_VALID (1 << 4) #define AXP_ICTYPE 0x03 #define AXP_POWERCTL1 0x10 #define AXP_POWERCTL1_DCDC7 (1 << 6) /* AXP813/818 only */ #define AXP_POWERCTL1_DCDC6 (1 << 5) #define AXP_POWERCTL1_DCDC5 (1 << 4) #define AXP_POWERCTL1_DCDC4 (1 << 3) #define AXP_POWERCTL1_DCDC3 (1 << 2) #define AXP_POWERCTL1_DCDC2 (1 << 1) #define AXP_POWERCTL1_DCDC1 (1 << 0) #define AXP_POWERCTL2 0x12 #define AXP_POWERCTL2_DC1SW (1 << 7) /* AXP803 only */ #define AXP_POWERCTL2_DLDO4 (1 << 6) #define AXP_POWERCTL2_DLDO3 (1 << 5) #define AXP_POWERCTL2_DLDO2 (1 << 4) #define AXP_POWERCTL2_DLDO1 (1 << 3) #define AXP_POWERCTL2_ELDO3 (1 << 2) #define AXP_POWERCTL2_ELDO2 (1 << 1) #define AXP_POWERCTL2_ELDO1 (1 << 0) #define AXP_POWERCTL3 0x13 #define AXP_POWERCTL3_ALDO3 (1 << 7) #define AXP_POWERCTL3_ALDO2 (1 << 6) #define AXP_POWERCTL3_ALDO1 (1 << 5) #define AXP_POWERCTL3_FLDO3 (1 << 4) /* AXP813/818 only */ #define AXP_POWERCTL3_FLDO2 (1 << 3) #define AXP_POWERCTL3_FLDO1 (1 << 2) #define AXP_VOLTCTL_DLDO1 0x15 #define AXP_VOLTCTL_DLDO2 0x16 #define AXP_VOLTCTL_DLDO3 0x17 #define AXP_VOLTCTL_DLDO4 0x18 #define AXP_VOLTCTL_ELDO1 0x19 #define AXP_VOLTCTL_ELDO2 0x1A #define AXP_VOLTCTL_ELDO3 0x1B #define AXP_VOLTCTL_FLDO1 0x1C #define AXP_VOLTCTL_FLDO2 0x1D #define AXP_VOLTCTL_DCDC1 0x20 #define AXP_VOLTCTL_DCDC2 0x21 #define AXP_VOLTCTL_DCDC3 0x22 #define AXP_VOLTCTL_DCDC4 0x23 #define AXP_VOLTCTL_DCDC5 0x24 #define AXP_VOLTCTL_DCDC6 0x25 #define AXP_VOLTCTL_DCDC7 0x26 #define AXP_VOLTCTL_ALDO1 0x28 #define AXP_VOLTCTL_ALDO2 0x29 #define AXP_VOLTCTL_ALDO3 0x2A #define AXP_VOLTCTL_STATUS (1 << 7) #define AXP_VOLTCTL_MASK 0x7f #define AXP_POWERBAT 0x32 #define AXP_POWERBAT_SHUTDOWN (1 << 7) #define AXP_CHARGERCTL1 0x33 #define AXP_CHARGERCTL1_MIN 0 #define AXP_CHARGERCTL1_MAX 13 #define AXP_CHARGERCTL1_CMASK 0xf #define AXP_IRQEN1 0x40 #define AXP_IRQEN1_ACIN_HI (1 << 6) #define AXP_IRQEN1_ACIN_LO (1 << 5) #define AXP_IRQEN1_VBUS_HI (1 << 3) #define AXP_IRQEN1_VBUS_LO (1 << 2) #define AXP_IRQEN2 0x41 #define AXP_IRQEN2_BAT_IN (1 << 7) #define AXP_IRQEN2_BAT_NO (1 << 6) #define AXP_IRQEN2_BATCHGC (1 << 3) #define AXP_IRQEN2_BATCHGD (1 << 2) #define AXP_IRQEN3 0x42 #define AXP_IRQEN4 0x43 #define AXP_IRQEN4_BATLVL_LO1 (1 << 1) #define AXP_IRQEN4_BATLVL_LO0 (1 << 0) #define AXP_IRQEN5 0x44 #define AXP_IRQEN5_POKSIRQ (1 << 4) #define AXP_IRQEN5_POKLIRQ (1 << 3) #define AXP_IRQEN6 0x45 #define AXP_IRQSTAT1 0x48 #define AXP_IRQSTAT1_ACIN_HI (1 << 6) #define AXP_IRQSTAT1_ACIN_LO (1 << 5) #define AXP_IRQSTAT1_VBUS_HI (1 << 3) #define AXP_IRQSTAT1_VBUS_LO (1 << 2) #define AXP_IRQSTAT2 0x49 #define AXP_IRQSTAT2_BAT_IN (1 << 7) #define AXP_IRQSTAT2_BAT_NO (1 << 6) #define AXP_IRQSTAT2_BATCHGC (1 << 3) #define AXP_IRQSTAT2_BATCHGD (1 << 2) #define AXP_IRQSTAT3 0x4a #define AXP_IRQSTAT4 0x4b #define AXP_IRQSTAT4_BATLVL_LO1 (1 << 1) #define AXP_IRQSTAT4_BATLVL_LO0 (1 << 0) #define AXP_IRQSTAT5 0x4c #define AXP_IRQSTAT5_POKSIRQ (1 << 4) #define AXP_IRQEN5_POKLIRQ (1 << 3) #define AXP_IRQSTAT6 0x4d #define AXP_BATSENSE_HI 0x78 #define AXP_BATSENSE_LO 0x79 #define AXP_BATCHG_HI 0x7a #define AXP_BATCHG_LO 0x7b #define AXP_BATDISCHG_HI 0x7c #define AXP_BATDISCHG_LO 0x7d #define AXP_GPIO0_CTRL 0x90 #define AXP_GPIO0LDO_CTRL 0x91 #define AXP_GPIO1_CTRL 0x92 #define AXP_GPIO1LDO_CTRL 0x93 #define AXP_GPIO_FUNC (0x7 << 0) #define AXP_GPIO_FUNC_SHIFT 0 #define AXP_GPIO_FUNC_DRVLO 0 #define AXP_GPIO_FUNC_DRVHI 1 #define AXP_GPIO_FUNC_INPUT 2 #define AXP_GPIO_FUNC_LDO_ON 3 #define AXP_GPIO_FUNC_LDO_OFF 4 #define AXP_GPIO_SIGBIT 0x94 #define AXP_GPIO_PD 0x97 #define AXP_FUEL_GAUGECTL 0xb8 #define AXP_FUEL_GAUGECTL_EN (1 << 7) #define AXP_BAT_CAP 0xb9 #define AXP_BAT_CAP_VALID (1 << 7) #define AXP_BAT_CAP_PERCENT 0x7f #define AXP_BAT_MAX_CAP_HI 0xe0 #define AXP_BAT_MAX_CAP_VALID (1 << 7) #define AXP_BAT_MAX_CAP_LO 0xe1 #define AXP_BAT_COULOMB_HI 0xe2 #define AXP_BAT_COULOMB_VALID (1 << 7) #define AXP_BAT_COULOMB_LO 0xe3 #define AXP_BAT_CAP_WARN 0xe6 #define AXP_BAT_CAP_WARN_LV1 0xf0 /* Bits 4, 5, 6, 7 */ #define AXP_BAP_CAP_WARN_LV1BASE 5 /* 5-20%, 1% per step */ #define AXP_BAT_CAP_WARN_LV2 0xf /* Bits 0, 1, 2, 3 */ /* Sensor conversion macros */ #define AXP_SENSOR_BAT_H(hi) ((hi) << 4) #define AXP_SENSOR_BAT_L(lo) ((lo) & 0xf) #define AXP_SENSOR_COULOMB(hi, lo) (((hi & ~(1 << 7)) << 8) | (lo)) static const struct { const char *name; uint8_t ctrl_reg; } axp8xx_pins[] = { { "GPIO0", AXP_GPIO0_CTRL }, { "GPIO1", AXP_GPIO1_CTRL }, }; enum AXP8XX_TYPE { AXP803 = 1, AXP813, }; static struct ofw_compat_data compat_data[] = { { "x-powers,axp803", AXP803 }, { "x-powers,axp813", AXP813 }, { "x-powers,axp818", AXP813 }, { NULL, 0 } }; static struct resource_spec axp8xx_spec[] = { { SYS_RES_IRQ, 0, RF_ACTIVE }, { -1, 0 } }; struct axp8xx_regdef { intptr_t id; char *name; char *supply_name; uint8_t enable_reg; uint8_t enable_mask; uint8_t enable_value; uint8_t disable_value; uint8_t voltage_reg; int voltage_min; int voltage_max; int voltage_step1; int voltage_nstep1; int voltage_step2; int voltage_nstep2; }; enum axp8xx_reg_id { AXP8XX_REG_ID_DCDC1 = 100, AXP8XX_REG_ID_DCDC2, AXP8XX_REG_ID_DCDC3, AXP8XX_REG_ID_DCDC4, AXP8XX_REG_ID_DCDC5, AXP8XX_REG_ID_DCDC6, AXP813_REG_ID_DCDC7, AXP803_REG_ID_DC1SW, AXP8XX_REG_ID_DLDO1, AXP8XX_REG_ID_DLDO2, AXP8XX_REG_ID_DLDO3, AXP8XX_REG_ID_DLDO4, AXP8XX_REG_ID_ELDO1, AXP8XX_REG_ID_ELDO2, AXP8XX_REG_ID_ELDO3, AXP8XX_REG_ID_ALDO1, AXP8XX_REG_ID_ALDO2, AXP8XX_REG_ID_ALDO3, AXP8XX_REG_ID_FLDO1, AXP8XX_REG_ID_FLDO2, AXP813_REG_ID_FLDO3, AXP8XX_REG_ID_GPIO0_LDO, AXP8XX_REG_ID_GPIO1_LDO, }; static struct axp8xx_regdef axp803_regdefs[] = { { .id = AXP803_REG_ID_DC1SW, .name = "dc1sw", .enable_reg = AXP_POWERCTL2, .enable_mask = (uint8_t) AXP_POWERCTL2_DC1SW, .enable_value = AXP_POWERCTL2_DC1SW, }, }; static struct axp8xx_regdef axp813_regdefs[] = { { .id = AXP813_REG_ID_DCDC7, .name = "dcdc7", .enable_reg = AXP_POWERCTL1, .enable_mask = (uint8_t) AXP_POWERCTL1_DCDC7, .enable_value = AXP_POWERCTL1_DCDC7, .voltage_reg = AXP_VOLTCTL_DCDC7, .voltage_min = 600, .voltage_max = 1520, .voltage_step1 = 10, .voltage_nstep1 = 50, .voltage_step2 = 20, .voltage_nstep2 = 21, }, }; static struct axp8xx_regdef axp8xx_common_regdefs[] = { { .id = AXP8XX_REG_ID_DCDC1, .name = "dcdc1", .enable_reg = AXP_POWERCTL1, .enable_mask = (uint8_t) AXP_POWERCTL1_DCDC1, .enable_value = AXP_POWERCTL1_DCDC1, .voltage_reg = AXP_VOLTCTL_DCDC1, .voltage_min = 1600, .voltage_max = 3400, .voltage_step1 = 100, .voltage_nstep1 = 18, }, { .id = AXP8XX_REG_ID_DCDC2, .name = "dcdc2", .enable_reg = AXP_POWERCTL1, .enable_mask = (uint8_t) AXP_POWERCTL1_DCDC2, .enable_value = AXP_POWERCTL1_DCDC2, .voltage_reg = AXP_VOLTCTL_DCDC2, .voltage_min = 500, .voltage_max = 1300, .voltage_step1 = 10, .voltage_nstep1 = 70, .voltage_step2 = 20, .voltage_nstep2 = 5, }, { .id = AXP8XX_REG_ID_DCDC3, .name = "dcdc3", .enable_reg = AXP_POWERCTL1, .enable_mask = (uint8_t) AXP_POWERCTL1_DCDC3, .enable_value = AXP_POWERCTL1_DCDC3, .voltage_reg = AXP_VOLTCTL_DCDC3, .voltage_min = 500, .voltage_max = 1300, .voltage_step1 = 10, .voltage_nstep1 = 70, .voltage_step2 = 20, .voltage_nstep2 = 5, }, { .id = AXP8XX_REG_ID_DCDC4, .name = "dcdc4", .enable_reg = AXP_POWERCTL1, .enable_mask = (uint8_t) AXP_POWERCTL1_DCDC4, .enable_value = AXP_POWERCTL1_DCDC4, .voltage_reg = AXP_VOLTCTL_DCDC4, .voltage_min = 500, .voltage_max = 1300, .voltage_step1 = 10, .voltage_nstep1 = 70, .voltage_step2 = 20, .voltage_nstep2 = 5, }, { .id = AXP8XX_REG_ID_DCDC5, .name = "dcdc5", .enable_reg = AXP_POWERCTL1, .enable_mask = (uint8_t) AXP_POWERCTL1_DCDC5, .enable_value = AXP_POWERCTL1_DCDC5, .voltage_reg = AXP_VOLTCTL_DCDC5, .voltage_min = 800, .voltage_max = 1840, .voltage_step1 = 10, .voltage_nstep1 = 42, .voltage_step2 = 20, .voltage_nstep2 = 36, }, { .id = AXP8XX_REG_ID_DCDC6, .name = "dcdc6", .enable_reg = AXP_POWERCTL1, .enable_mask = (uint8_t) AXP_POWERCTL1_DCDC6, .enable_value = AXP_POWERCTL1_DCDC6, .voltage_reg = AXP_VOLTCTL_DCDC6, .voltage_min = 600, .voltage_max = 1520, .voltage_step1 = 10, .voltage_nstep1 = 50, .voltage_step2 = 20, .voltage_nstep2 = 21, }, { .id = AXP8XX_REG_ID_DLDO1, .name = "dldo1", .enable_reg = AXP_POWERCTL2, .enable_mask = (uint8_t) AXP_POWERCTL2_DLDO1, .enable_value = AXP_POWERCTL2_DLDO1, .voltage_reg = AXP_VOLTCTL_DLDO1, .voltage_min = 700, .voltage_max = 3300, .voltage_step1 = 100, .voltage_nstep1 = 26, }, { .id = AXP8XX_REG_ID_DLDO2, .name = "dldo2", .enable_reg = AXP_POWERCTL2, .enable_mask = (uint8_t) AXP_POWERCTL2_DLDO2, .enable_value = AXP_POWERCTL2_DLDO2, .voltage_reg = AXP_VOLTCTL_DLDO2, .voltage_min = 700, .voltage_max = 4200, .voltage_step1 = 100, .voltage_nstep1 = 27, .voltage_step2 = 200, .voltage_nstep2 = 4, }, { .id = AXP8XX_REG_ID_DLDO3, .name = "dldo3", .enable_reg = AXP_POWERCTL2, .enable_mask = (uint8_t) AXP_POWERCTL2_DLDO3, .enable_value = AXP_POWERCTL2_DLDO3, .voltage_reg = AXP_VOLTCTL_DLDO3, .voltage_min = 700, .voltage_max = 3300, .voltage_step1 = 100, .voltage_nstep1 = 26, }, { .id = AXP8XX_REG_ID_DLDO4, .name = "dldo4", .enable_reg = AXP_POWERCTL2, .enable_mask = (uint8_t) AXP_POWERCTL2_DLDO4, .enable_value = AXP_POWERCTL2_DLDO4, .voltage_reg = AXP_VOLTCTL_DLDO4, .voltage_min = 700, .voltage_max = 3300, .voltage_step1 = 100, .voltage_nstep1 = 26, }, { .id = AXP8XX_REG_ID_ALDO1, .name = "aldo1", .enable_reg = AXP_POWERCTL3, .enable_mask = (uint8_t) AXP_POWERCTL3_ALDO1, .enable_value = AXP_POWERCTL3_ALDO1, .voltage_min = 700, .voltage_max = 3300, .voltage_step1 = 100, .voltage_nstep1 = 26, }, { .id = AXP8XX_REG_ID_ALDO2, .name = "aldo2", .enable_reg = AXP_POWERCTL3, .enable_mask = (uint8_t) AXP_POWERCTL3_ALDO2, .enable_value = AXP_POWERCTL3_ALDO2, .voltage_min = 700, .voltage_max = 3300, .voltage_step1 = 100, .voltage_nstep1 = 26, }, { .id = AXP8XX_REG_ID_ALDO3, .name = "aldo3", .enable_reg = AXP_POWERCTL3, .enable_mask = (uint8_t) AXP_POWERCTL3_ALDO3, .enable_value = AXP_POWERCTL3_ALDO3, .voltage_min = 700, .voltage_max = 3300, .voltage_step1 = 100, .voltage_nstep1 = 26, }, { .id = AXP8XX_REG_ID_ELDO1, .name = "eldo1", .enable_reg = AXP_POWERCTL2, .enable_mask = (uint8_t) AXP_POWERCTL2_ELDO1, .enable_value = AXP_POWERCTL2_ELDO1, .voltage_min = 700, .voltage_max = 1900, .voltage_step1 = 50, .voltage_nstep1 = 24, }, { .id = AXP8XX_REG_ID_ELDO2, .name = "eldo2", .enable_reg = AXP_POWERCTL2, .enable_mask = (uint8_t) AXP_POWERCTL2_ELDO2, .enable_value = AXP_POWERCTL2_ELDO2, .voltage_min = 700, .voltage_max = 1900, .voltage_step1 = 50, .voltage_nstep1 = 24, }, { .id = AXP8XX_REG_ID_ELDO3, .name = "eldo3", .enable_reg = AXP_POWERCTL2, .enable_mask = (uint8_t) AXP_POWERCTL2_ELDO3, .enable_value = AXP_POWERCTL2_ELDO3, .voltage_min = 700, .voltage_max = 1900, .voltage_step1 = 50, .voltage_nstep1 = 24, }, { .id = AXP8XX_REG_ID_FLDO1, .name = "fldo1", .enable_reg = AXP_POWERCTL3, .enable_mask = (uint8_t) AXP_POWERCTL3_FLDO1, .enable_value = AXP_POWERCTL3_FLDO1, .voltage_min = 700, .voltage_max = 1450, .voltage_step1 = 50, .voltage_nstep1 = 15, }, { .id = AXP8XX_REG_ID_FLDO2, .name = "fldo2", .enable_reg = AXP_POWERCTL3, .enable_mask = (uint8_t) AXP_POWERCTL3_FLDO2, .enable_value = AXP_POWERCTL3_FLDO2, .voltage_min = 700, .voltage_max = 1450, .voltage_step1 = 50, .voltage_nstep1 = 15, }, { .id = AXP8XX_REG_ID_GPIO0_LDO, .name = "ldo-io0", .enable_reg = AXP_GPIO0_CTRL, .enable_mask = (uint8_t) AXP_GPIO_FUNC, .enable_value = AXP_GPIO_FUNC_LDO_ON, .disable_value = AXP_GPIO_FUNC_LDO_OFF, .voltage_reg = AXP_GPIO0LDO_CTRL, .voltage_min = 700, .voltage_max = 3300, .voltage_step1 = 100, .voltage_nstep1 = 26, }, { .id = AXP8XX_REG_ID_GPIO1_LDO, .name = "ldo-io1", .enable_reg = AXP_GPIO1_CTRL, .enable_mask = (uint8_t) AXP_GPIO_FUNC, .enable_value = AXP_GPIO_FUNC_LDO_ON, .disable_value = AXP_GPIO_FUNC_LDO_OFF, .voltage_reg = AXP_GPIO1LDO_CTRL, .voltage_min = 700, .voltage_max = 3300, .voltage_step1 = 100, .voltage_nstep1 = 26, }, }; enum axp8xx_sensor { AXP_SENSOR_ACIN_PRESENT, AXP_SENSOR_VBUS_PRESENT, AXP_SENSOR_BATT_PRESENT, AXP_SENSOR_BATT_CHARGING, AXP_SENSOR_BATT_CHARGE_STATE, AXP_SENSOR_BATT_VOLTAGE, AXP_SENSOR_BATT_CHARGE_CURRENT, AXP_SENSOR_BATT_DISCHARGE_CURRENT, AXP_SENSOR_BATT_CAPACITY_PERCENT, AXP_SENSOR_BATT_MAXIMUM_CAPACITY, AXP_SENSOR_BATT_CURRENT_CAPACITY, }; enum battery_capacity_state { BATT_CAPACITY_NORMAL = 1, /* normal cap in battery */ BATT_CAPACITY_WARNING, /* warning cap in battery */ BATT_CAPACITY_CRITICAL, /* critical cap in battery */ BATT_CAPACITY_HIGH, /* high cap in battery */ BATT_CAPACITY_MAX, /* maximum cap in battery */ BATT_CAPACITY_LOW /* low cap in battery */ }; struct axp8xx_sensors { int id; const char *name; const char *desc; const char *format; }; static const struct axp8xx_sensors axp8xx_common_sensors[] = { { .id = AXP_SENSOR_ACIN_PRESENT, .name = "acin", .format = "I", .desc = "ACIN Present", }, { .id = AXP_SENSOR_VBUS_PRESENT, .name = "vbus", .format = "I", .desc = "VBUS Present", }, { .id = AXP_SENSOR_BATT_PRESENT, .name = "bat", .format = "I", .desc = "Battery Present", }, { .id = AXP_SENSOR_BATT_CHARGING, .name = "batcharging", .format = "I", .desc = "Battery Charging", }, { .id = AXP_SENSOR_BATT_CHARGE_STATE, .name = "batchargestate", .format = "I", .desc = "Battery Charge State", }, { .id = AXP_SENSOR_BATT_VOLTAGE, .name = "batvolt", .format = "I", .desc = "Battery Voltage", }, { .id = AXP_SENSOR_BATT_CHARGE_CURRENT, .name = "batchargecurrent", .format = "I", .desc = "Average Battery Charging Current", }, { .id = AXP_SENSOR_BATT_DISCHARGE_CURRENT, .name = "batdischargecurrent", .format = "I", .desc = "Average Battery Discharging Current", }, { .id = AXP_SENSOR_BATT_CAPACITY_PERCENT, .name = "batcapacitypercent", .format = "I", .desc = "Battery Capacity Percentage", }, { .id = AXP_SENSOR_BATT_MAXIMUM_CAPACITY, .name = "batmaxcapacity", .format = "I", .desc = "Battery Maximum Capacity", }, { .id = AXP_SENSOR_BATT_CURRENT_CAPACITY, .name = "batcurrentcapacity", .format = "I", .desc = "Battery Current Capacity", }, }; struct axp8xx_config { const char *name; int batsense_step; /* uV */ int charge_step; /* uA */ int discharge_step; /* uA */ int maxcap_step; /* uAh */ int coulomb_step; /* uAh */ }; static struct axp8xx_config axp803_config = { .name = "AXP803", .batsense_step = 1100, .charge_step = 1000, .discharge_step = 1000, .maxcap_step = 1456, .coulomb_step = 1456, }; struct axp8xx_softc; struct axp8xx_reg_sc { struct regnode *regnode; device_t base_dev; struct axp8xx_regdef *def; phandle_t xref; struct regnode_std_param *param; }; struct axp8xx_softc { struct resource *res; uint16_t addr; void *ih; device_t gpiodev; struct mtx mtx; int busy; int type; /* Configs */ const struct axp8xx_config *config; /* Sensors */ const struct axp8xx_sensors *sensors; int nsensors; /* Regulators */ struct axp8xx_reg_sc **regs; int nregs; /* Warning, shutdown thresholds */ int warn_thres; int shut_thres; }; #define AXP_LOCK(sc) mtx_lock(&(sc)->mtx) #define AXP_UNLOCK(sc) mtx_unlock(&(sc)->mtx) static int axp8xx_read(device_t dev, uint8_t reg, uint8_t *data, uint8_t size) { struct axp8xx_softc *sc; struct iic_msg msg[2]; sc = device_get_softc(dev); msg[0].slave = sc->addr; msg[0].flags = IIC_M_WR; msg[0].len = 1; msg[0].buf = ® msg[1].slave = sc->addr; msg[1].flags = IIC_M_RD; msg[1].len = size; msg[1].buf = data; return (iicbus_transfer(dev, msg, 2)); } static int axp8xx_write(device_t dev, uint8_t reg, uint8_t val) { struct axp8xx_softc *sc; struct iic_msg msg[2]; sc = device_get_softc(dev); msg[0].slave = sc->addr; msg[0].flags = IIC_M_WR; msg[0].len = 1; msg[0].buf = ® msg[1].slave = sc->addr; msg[1].flags = IIC_M_WR; msg[1].len = 1; msg[1].buf = &val; return (iicbus_transfer(dev, msg, 2)); } static int axp8xx_regnode_init(struct regnode *regnode) { return (0); } static int axp8xx_regnode_enable(struct regnode *regnode, bool enable, int *udelay) { struct axp8xx_reg_sc *sc; uint8_t val; sc = regnode_get_softc(regnode); if (bootverbose) device_printf(sc->base_dev, "%sable %s (%s)\n", enable ? "En" : "Dis", regnode_get_name(regnode), sc->def->name); axp8xx_read(sc->base_dev, sc->def->enable_reg, &val, 1); val &= ~sc->def->enable_mask; if (enable) val |= sc->def->enable_value; else { if (sc->def->disable_value) val |= sc->def->disable_value; else val &= ~sc->def->enable_value; } axp8xx_write(sc->base_dev, sc->def->enable_reg, val); *udelay = 0; return (0); } static void axp8xx_regnode_reg_to_voltage(struct axp8xx_reg_sc *sc, uint8_t val, int *uv) { if (val < sc->def->voltage_nstep1) *uv = sc->def->voltage_min + val * sc->def->voltage_step1; else *uv = sc->def->voltage_min + (sc->def->voltage_nstep1 * sc->def->voltage_step1) + ((val - sc->def->voltage_nstep1) * sc->def->voltage_step2); *uv *= 1000; } static int axp8xx_regnode_voltage_to_reg(struct axp8xx_reg_sc *sc, int min_uvolt, int max_uvolt, uint8_t *val) { uint8_t nval; int nstep, uvolt; nval = 0; uvolt = sc->def->voltage_min * 1000; for (nstep = 0; nstep < sc->def->voltage_nstep1 && uvolt < min_uvolt; nstep++) { ++nval; uvolt += (sc->def->voltage_step1 * 1000); } for (nstep = 0; nstep < sc->def->voltage_nstep2 && uvolt < min_uvolt; nstep++) { ++nval; uvolt += (sc->def->voltage_step2 * 1000); } if (uvolt > max_uvolt) return (EINVAL); *val = nval; return (0); } static int axp8xx_regnode_set_voltage(struct regnode *regnode, int min_uvolt, int max_uvolt, int *udelay) { struct axp8xx_reg_sc *sc; uint8_t val; sc = regnode_get_softc(regnode); if (bootverbose) device_printf(sc->base_dev, "Setting %s (%s) to %d<->%d\n", regnode_get_name(regnode), sc->def->name, min_uvolt, max_uvolt); if (sc->def->voltage_step1 == 0) return (ENXIO); if (axp8xx_regnode_voltage_to_reg(sc, min_uvolt, max_uvolt, &val) != 0) return (ERANGE); axp8xx_write(sc->base_dev, sc->def->voltage_reg, val); *udelay = 0; return (0); } static int axp8xx_regnode_get_voltage(struct regnode *regnode, int *uvolt) { struct axp8xx_reg_sc *sc; uint8_t val; sc = regnode_get_softc(regnode); if (!sc->def->voltage_step1 || !sc->def->voltage_step2) return (ENXIO); axp8xx_read(sc->base_dev, sc->def->voltage_reg, &val, 1); axp8xx_regnode_reg_to_voltage(sc, val & AXP_VOLTCTL_MASK, uvolt); return (0); } static regnode_method_t axp8xx_regnode_methods[] = { /* Regulator interface */ REGNODEMETHOD(regnode_init, axp8xx_regnode_init), REGNODEMETHOD(regnode_enable, axp8xx_regnode_enable), REGNODEMETHOD(regnode_set_voltage, axp8xx_regnode_set_voltage), REGNODEMETHOD(regnode_get_voltage, axp8xx_regnode_get_voltage), REGNODEMETHOD_END }; DEFINE_CLASS_1(axp8xx_regnode, axp8xx_regnode_class, axp8xx_regnode_methods, sizeof(struct axp8xx_reg_sc), regnode_class); static void axp8xx_shutdown(void *devp, int howto) { device_t dev; if ((howto & RB_POWEROFF) == 0) return; dev = devp; if (bootverbose) device_printf(dev, "Shutdown Axp8xx\n"); axp8xx_write(dev, AXP_POWERBAT, AXP_POWERBAT_SHUTDOWN); } static int axp8xx_sysctl_chargecurrent(SYSCTL_HANDLER_ARGS) { device_t dev = arg1; uint8_t data; int val, error; error = axp8xx_read(dev, AXP_CHARGERCTL1, &data, 1); if (error != 0) return (error); if (bootverbose) device_printf(dev, "Raw CHARGECTL1 val: 0x%0x\n", data); val = (data & AXP_CHARGERCTL1_CMASK); error = sysctl_handle_int(oidp, &val, 0, req); if (error || !req->newptr) /* error || read request */ return (error); if ((val < AXP_CHARGERCTL1_MIN) || (val > AXP_CHARGERCTL1_MAX)) return (EINVAL); val |= (data & (AXP_CHARGERCTL1_CMASK << 4)); axp8xx_write(dev, AXP_CHARGERCTL1, val); return (0); } static int axp8xx_sysctl(SYSCTL_HANDLER_ARGS) { struct axp8xx_softc *sc; device_t dev = arg1; enum axp8xx_sensor sensor = arg2; const struct axp8xx_config *c; uint8_t data; int val, i, found, batt_val; uint8_t lo, hi; sc = device_get_softc(dev); c = sc->config; for (found = 0, i = 0; i < sc->nsensors; i++) { if (sc->sensors[i].id == sensor) { found = 1; break; } } if (found == 0) return (ENOENT); switch (sensor) { case AXP_SENSOR_ACIN_PRESENT: if (axp8xx_read(dev, AXP_POWERSRC, &data, 1) == 0) val = !!(data & AXP_POWERSRC_ACIN); break; case AXP_SENSOR_VBUS_PRESENT: if (axp8xx_read(dev, AXP_POWERSRC, &data, 1) == 0) val = !!(data & AXP_POWERSRC_VBUS); break; case AXP_SENSOR_BATT_PRESENT: if (axp8xx_read(dev, AXP_POWERMODE, &data, 1) == 0) { if (data & AXP_POWERMODE_BAT_VALID) val = !!(data & AXP_POWERMODE_BAT_PRESENT); } break; case AXP_SENSOR_BATT_CHARGING: if (axp8xx_read(dev, AXP_POWERMODE, &data, 1) == 0) val = !!(data & AXP_POWERMODE_BAT_CHARGING); break; case AXP_SENSOR_BATT_CHARGE_STATE: if (axp8xx_read(dev, AXP_BAT_CAP, &data, 1) == 0 && (data & AXP_BAT_CAP_VALID) != 0) { batt_val = (data & AXP_BAT_CAP_PERCENT); if (batt_val <= sc->shut_thres) val = BATT_CAPACITY_CRITICAL; else if (batt_val <= sc->warn_thres) val = BATT_CAPACITY_WARNING; else val = BATT_CAPACITY_NORMAL; } break; case AXP_SENSOR_BATT_CAPACITY_PERCENT: if (axp8xx_read(dev, AXP_BAT_CAP, &data, 1) == 0 && (data & AXP_BAT_CAP_VALID) != 0) val = (data & AXP_BAT_CAP_PERCENT); break; case AXP_SENSOR_BATT_VOLTAGE: if (axp8xx_read(dev, AXP_BATSENSE_HI, &hi, 1) == 0 && axp8xx_read(dev, AXP_BATSENSE_LO, &lo, 1) == 0) { val = (AXP_SENSOR_BAT_H(hi) | AXP_SENSOR_BAT_L(lo)); val *= c->batsense_step; } break; case AXP_SENSOR_BATT_CHARGE_CURRENT: if (axp8xx_read(dev, AXP_POWERSRC, &data, 1) == 0 && (data & AXP_POWERSRC_CHARING) != 0 && axp8xx_read(dev, AXP_BATCHG_HI, &hi, 1) == 0 && axp8xx_read(dev, AXP_BATCHG_LO, &lo, 1) == 0) { val = (AXP_SENSOR_BAT_H(hi) | AXP_SENSOR_BAT_L(lo)); val *= c->charge_step; } break; case AXP_SENSOR_BATT_DISCHARGE_CURRENT: if (axp8xx_read(dev, AXP_POWERSRC, &data, 1) == 0 && (data & AXP_POWERSRC_CHARING) == 0 && axp8xx_read(dev, AXP_BATDISCHG_HI, &hi, 1) == 0 && axp8xx_read(dev, AXP_BATDISCHG_LO, &lo, 1) == 0) { val = (AXP_SENSOR_BAT_H(hi) | AXP_SENSOR_BAT_L(lo)); val *= c->discharge_step; } break; case AXP_SENSOR_BATT_MAXIMUM_CAPACITY: if (axp8xx_read(dev, AXP_BAT_MAX_CAP_HI, &hi, 1) == 0 && axp8xx_read(dev, AXP_BAT_MAX_CAP_LO, &lo, 1) == 0) { val = AXP_SENSOR_COULOMB(hi, lo); val *= c->maxcap_step; } break; case AXP_SENSOR_BATT_CURRENT_CAPACITY: if (axp8xx_read(dev, AXP_BAT_COULOMB_HI, &hi, 1) == 0 && axp8xx_read(dev, AXP_BAT_COULOMB_LO, &lo, 1) == 0) { val = AXP_SENSOR_COULOMB(hi, lo); val *= c->coulomb_step; } break; } return sysctl_handle_opaque(oidp, &val, sizeof(val), req); } static void axp8xx_intr(void *arg) { device_t dev; uint8_t val; int error; dev = arg; error = axp8xx_read(dev, AXP_IRQSTAT1, &val, 1); if (error != 0) return; if (val) { if (bootverbose) device_printf(dev, "AXP_IRQSTAT1 val: %x\n", val); if (val & AXP_IRQSTAT1_ACIN_HI) devctl_notify("PMU", "AC", "plugged", NULL); if (val & AXP_IRQSTAT1_ACIN_LO) devctl_notify("PMU", "AC", "unplugged", NULL); if (val & AXP_IRQSTAT1_VBUS_HI) devctl_notify("PMU", "USB", "plugged", NULL); if (val & AXP_IRQSTAT1_VBUS_LO) devctl_notify("PMU", "USB", "unplugged", NULL); /* Acknowledge */ axp8xx_write(dev, AXP_IRQSTAT1, val); } error = axp8xx_read(dev, AXP_IRQSTAT2, &val, 1); if (error != 0) return; if (val) { if (bootverbose) device_printf(dev, "AXP_IRQSTAT2 val: %x\n", val); if (val & AXP_IRQSTAT2_BATCHGD) devctl_notify("PMU", "Battery", "charged", NULL); if (val & AXP_IRQSTAT2_BATCHGC) devctl_notify("PMU", "Battery", "charging", NULL); if (val & AXP_IRQSTAT2_BAT_NO) devctl_notify("PMU", "Battery", "absent", NULL); if (val & AXP_IRQSTAT2_BAT_IN) devctl_notify("PMU", "Battery", "plugged", NULL); /* Acknowledge */ axp8xx_write(dev, AXP_IRQSTAT2, val); } error = axp8xx_read(dev, AXP_IRQSTAT3, &val, 1); if (error != 0) return; if (val) { /* Acknowledge */ axp8xx_write(dev, AXP_IRQSTAT3, val); } error = axp8xx_read(dev, AXP_IRQSTAT4, &val, 1); if (error != 0) return; if (val) { if (bootverbose) device_printf(dev, "AXP_IRQSTAT4 val: %x\n", val); if (val & AXP_IRQSTAT4_BATLVL_LO0) devctl_notify("PMU", "Battery", "shutdown threshold", NULL); if (val & AXP_IRQSTAT4_BATLVL_LO1) devctl_notify("PMU", "Battery", "warning threshold", NULL); /* Acknowledge */ axp8xx_write(dev, AXP_IRQSTAT4, val); } error = axp8xx_read(dev, AXP_IRQSTAT5, &val, 1); if (error != 0) return; if (val != 0) { if ((val & AXP_IRQSTAT5_POKSIRQ) != 0) { if (bootverbose) device_printf(dev, "Power button pressed\n"); shutdown_nice(RB_POWEROFF); } /* Acknowledge */ axp8xx_write(dev, AXP_IRQSTAT5, val); } error = axp8xx_read(dev, AXP_IRQSTAT6, &val, 1); if (error != 0) return; if (val) { /* Acknowledge */ axp8xx_write(dev, AXP_IRQSTAT6, val); } } static device_t axp8xx_gpio_get_bus(device_t dev) { struct axp8xx_softc *sc; sc = device_get_softc(dev); return (sc->gpiodev); } static int axp8xx_gpio_pin_max(device_t dev, int *maxpin) { *maxpin = nitems(axp8xx_pins) - 1; return (0); } static int axp8xx_gpio_pin_getname(device_t dev, uint32_t pin, char *name) { if (pin >= nitems(axp8xx_pins)) return (EINVAL); snprintf(name, GPIOMAXNAME, "%s", axp8xx_pins[pin].name); return (0); } static int axp8xx_gpio_pin_getcaps(device_t dev, uint32_t pin, uint32_t *caps) { if (pin >= nitems(axp8xx_pins)) return (EINVAL); *caps = GPIO_PIN_INPUT | GPIO_PIN_OUTPUT; return (0); } static int axp8xx_gpio_pin_getflags(device_t dev, uint32_t pin, uint32_t *flags) { struct axp8xx_softc *sc; uint8_t data, func; int error; if (pin >= nitems(axp8xx_pins)) return (EINVAL); sc = device_get_softc(dev); AXP_LOCK(sc); error = axp8xx_read(dev, axp8xx_pins[pin].ctrl_reg, &data, 1); if (error == 0) { func = (data & AXP_GPIO_FUNC) >> AXP_GPIO_FUNC_SHIFT; if (func == AXP_GPIO_FUNC_INPUT) *flags = GPIO_PIN_INPUT; else if (func == AXP_GPIO_FUNC_DRVLO || func == AXP_GPIO_FUNC_DRVHI) *flags = GPIO_PIN_OUTPUT; else *flags = 0; } AXP_UNLOCK(sc); return (error); } static int axp8xx_gpio_pin_setflags(device_t dev, uint32_t pin, uint32_t flags) { struct axp8xx_softc *sc; uint8_t data; int error; if (pin >= nitems(axp8xx_pins)) return (EINVAL); sc = device_get_softc(dev); AXP_LOCK(sc); error = axp8xx_read(dev, axp8xx_pins[pin].ctrl_reg, &data, 1); if (error == 0) { data &= ~AXP_GPIO_FUNC; if ((flags & (GPIO_PIN_INPUT|GPIO_PIN_OUTPUT)) != 0) { if ((flags & GPIO_PIN_OUTPUT) == 0) data |= AXP_GPIO_FUNC_INPUT; } error = axp8xx_write(dev, axp8xx_pins[pin].ctrl_reg, data); } AXP_UNLOCK(sc); return (error); } static int axp8xx_gpio_pin_get(device_t dev, uint32_t pin, unsigned int *val) { struct axp8xx_softc *sc; uint8_t data, func; int error; if (pin >= nitems(axp8xx_pins)) return (EINVAL); sc = device_get_softc(dev); AXP_LOCK(sc); error = axp8xx_read(dev, axp8xx_pins[pin].ctrl_reg, &data, 1); if (error == 0) { func = (data & AXP_GPIO_FUNC) >> AXP_GPIO_FUNC_SHIFT; switch (func) { case AXP_GPIO_FUNC_DRVLO: *val = 0; break; case AXP_GPIO_FUNC_DRVHI: *val = 1; break; case AXP_GPIO_FUNC_INPUT: error = axp8xx_read(dev, AXP_GPIO_SIGBIT, &data, 1); if (error == 0) *val = (data & (1 << pin)) ? 1 : 0; break; default: error = EIO; break; } } AXP_UNLOCK(sc); return (error); } static int axp8xx_gpio_pin_set(device_t dev, uint32_t pin, unsigned int val) { struct axp8xx_softc *sc; uint8_t data, func; int error; if (pin >= nitems(axp8xx_pins)) return (EINVAL); sc = device_get_softc(dev); AXP_LOCK(sc); error = axp8xx_read(dev, axp8xx_pins[pin].ctrl_reg, &data, 1); if (error == 0) { func = (data & AXP_GPIO_FUNC) >> AXP_GPIO_FUNC_SHIFT; switch (func) { case AXP_GPIO_FUNC_DRVLO: case AXP_GPIO_FUNC_DRVHI: data &= ~AXP_GPIO_FUNC; data |= (val << AXP_GPIO_FUNC_SHIFT); break; default: error = EIO; break; } } if (error == 0) error = axp8xx_write(dev, axp8xx_pins[pin].ctrl_reg, data); AXP_UNLOCK(sc); return (error); } static int axp8xx_gpio_pin_toggle(device_t dev, uint32_t pin) { struct axp8xx_softc *sc; uint8_t data, func; int error; if (pin >= nitems(axp8xx_pins)) return (EINVAL); sc = device_get_softc(dev); AXP_LOCK(sc); error = axp8xx_read(dev, axp8xx_pins[pin].ctrl_reg, &data, 1); if (error == 0) { func = (data & AXP_GPIO_FUNC) >> AXP_GPIO_FUNC_SHIFT; switch (func) { case AXP_GPIO_FUNC_DRVLO: data &= ~AXP_GPIO_FUNC; data |= (AXP_GPIO_FUNC_DRVHI << AXP_GPIO_FUNC_SHIFT); break; case AXP_GPIO_FUNC_DRVHI: data &= ~AXP_GPIO_FUNC; data |= (AXP_GPIO_FUNC_DRVLO << AXP_GPIO_FUNC_SHIFT); break; default: error = EIO; break; } } if (error == 0) error = axp8xx_write(dev, axp8xx_pins[pin].ctrl_reg, data); AXP_UNLOCK(sc); return (error); } static int axp8xx_gpio_map_gpios(device_t bus, phandle_t dev, phandle_t gparent, int gcells, pcell_t *gpios, uint32_t *pin, uint32_t *flags) { if (gpios[0] >= nitems(axp8xx_pins)) return (EINVAL); *pin = gpios[0]; *flags = gpios[1]; return (0); } static phandle_t axp8xx_get_node(device_t dev, device_t bus) { return (ofw_bus_get_node(dev)); } static struct axp8xx_reg_sc * axp8xx_reg_attach(device_t dev, phandle_t node, struct axp8xx_regdef *def) { struct axp8xx_reg_sc *reg_sc; struct regnode_init_def initdef; struct regnode *regnode; memset(&initdef, 0, sizeof(initdef)); if (regulator_parse_ofw_stdparam(dev, node, &initdef) != 0) return (NULL); if (initdef.std_param.min_uvolt == 0) initdef.std_param.min_uvolt = def->voltage_min * 1000; if (initdef.std_param.max_uvolt == 0) initdef.std_param.max_uvolt = def->voltage_max * 1000; initdef.id = def->id; initdef.ofw_node = node; regnode = regnode_create(dev, &axp8xx_regnode_class, &initdef); if (regnode == NULL) { device_printf(dev, "cannot create regulator\n"); return (NULL); } reg_sc = regnode_get_softc(regnode); reg_sc->regnode = regnode; reg_sc->base_dev = dev; reg_sc->def = def; reg_sc->xref = OF_xref_from_node(node); reg_sc->param = regnode_get_stdparam(regnode); regnode_register(regnode); return (reg_sc); } static int axp8xx_regdev_map(device_t dev, phandle_t xref, int ncells, pcell_t *cells, intptr_t *num) { struct axp8xx_softc *sc; int i; sc = device_get_softc(dev); for (i = 0; i < sc->nregs; i++) { if (sc->regs[i] == NULL) continue; if (sc->regs[i]->xref == xref) { *num = sc->regs[i]->def->id; return (0); } } return (ENXIO); } static int axp8xx_probe(device_t dev) { if (!ofw_bus_status_okay(dev)) return (ENXIO); switch (ofw_bus_search_compatible(dev, compat_data)->ocd_data) { case AXP803: device_set_desc(dev, "X-Powers AXP803 Power Management Unit"); break; case AXP813: device_set_desc(dev, "X-Powers AXP813 Power Management Unit"); break; default: return (ENXIO); } return (BUS_PROBE_DEFAULT); } static int axp8xx_attach(device_t dev) { struct axp8xx_softc *sc; struct axp8xx_reg_sc *reg; uint8_t chip_id, val; phandle_t rnode, child; int error, i; sc = device_get_softc(dev); sc->addr = iicbus_get_addr(dev); mtx_init(&sc->mtx, device_get_nameunit(dev), NULL, MTX_DEF); error = bus_alloc_resources(dev, axp8xx_spec, &sc->res); if (error != 0) { device_printf(dev, "cannot allocate resources for device\n"); return (error); } if (bootverbose) { axp8xx_read(dev, AXP_ICTYPE, &chip_id, 1); device_printf(dev, "chip ID 0x%02x\n", chip_id); } sc->nregs = nitems(axp8xx_common_regdefs); sc->type = ofw_bus_search_compatible(dev, compat_data)->ocd_data; switch (sc->type) { case AXP803: sc->nregs += nitems(axp803_regdefs); break; case AXP813: sc->nregs += nitems(axp813_regdefs); break; } sc->config = &axp803_config; sc->sensors = axp8xx_common_sensors; sc->nsensors = nitems(axp8xx_common_sensors); sc->regs = malloc(sizeof(struct axp8xx_reg_sc *) * sc->nregs, M_AXP8XX_REG, M_WAITOK | M_ZERO); /* Attach known regulators that exist in the DT */ rnode = ofw_bus_find_child(ofw_bus_get_node(dev), "regulators"); if (rnode > 0) { for (i = 0; i < sc->nregs; i++) { char *regname; struct axp8xx_regdef *regdef; if (i <= nitems(axp8xx_common_regdefs)) { regname = axp8xx_common_regdefs[i].name; regdef = &axp8xx_common_regdefs[i]; } else { int off; off = i - nitems(axp8xx_common_regdefs); switch (sc->type) { case AXP803: regname = axp803_regdefs[off].name; regdef = &axp803_regdefs[off]; break; case AXP813: regname = axp813_regdefs[off].name; regdef = &axp813_regdefs[off]; break; } } child = ofw_bus_find_child(rnode, regname); if (child == 0) continue; reg = axp8xx_reg_attach(dev, child, regdef); if (reg == NULL) { device_printf(dev, "cannot attach regulator %s\n", regname); continue; } sc->regs[i] = reg; } } /* Add sensors */ for (i = 0; i < sc->nsensors; i++) { SYSCTL_ADD_PROC(device_get_sysctl_ctx(dev), SYSCTL_CHILDREN(device_get_sysctl_tree(dev)), OID_AUTO, sc->sensors[i].name, CTLTYPE_INT | CTLFLAG_RD, dev, sc->sensors[i].id, axp8xx_sysctl, sc->sensors[i].format, sc->sensors[i].desc); } SYSCTL_ADD_PROC(device_get_sysctl_ctx(dev), SYSCTL_CHILDREN(device_get_sysctl_tree(dev)), OID_AUTO, "batchargecurrentstep", CTLTYPE_INT | CTLFLAG_RW, dev, 0, axp8xx_sysctl_chargecurrent, "I", "Battery Charging Current Step, " "0: 200mA, 1: 400mA, 2: 600mA, 3: 800mA, " "4: 1000mA, 5: 1200mA, 6: 1400mA, 7: 1600mA, " "8: 1800mA, 9: 2000mA, 10: 2200mA, 11: 2400mA, " "12: 2600mA, 13: 2800mA"); /* Get thresholds */ if (axp8xx_read(dev, AXP_BAT_CAP_WARN, &val, 1) == 0) { sc->warn_thres = (val & AXP_BAT_CAP_WARN_LV1) >> 4; sc->warn_thres += AXP_BAP_CAP_WARN_LV1BASE; sc->shut_thres = (val & AXP_BAT_CAP_WARN_LV2); if (bootverbose) { device_printf(dev, "Raw reg val: 0x%02x\n", val); device_printf(dev, "Warning threshold: 0x%02x\n", sc->warn_thres); device_printf(dev, "Shutdown threshold: 0x%02x\n", sc->shut_thres); } } /* Enable interrupts */ axp8xx_write(dev, AXP_IRQEN1, AXP_IRQEN1_VBUS_LO | AXP_IRQEN1_VBUS_HI | AXP_IRQEN1_ACIN_LO | AXP_IRQEN1_ACIN_HI); axp8xx_write(dev, AXP_IRQEN2, AXP_IRQEN2_BATCHGD | AXP_IRQEN2_BATCHGC | AXP_IRQEN2_BAT_NO | AXP_IRQEN2_BAT_IN); axp8xx_write(dev, AXP_IRQEN3, 0); axp8xx_write(dev, AXP_IRQEN4, AXP_IRQEN4_BATLVL_LO0 | AXP_IRQEN4_BATLVL_LO1); axp8xx_write(dev, AXP_IRQEN5, AXP_IRQEN5_POKSIRQ | AXP_IRQEN5_POKLIRQ); axp8xx_write(dev, AXP_IRQEN6, 0); /* Install interrupt handler */ error = bus_setup_intr(dev, sc->res, INTR_TYPE_MISC | INTR_MPSAFE, NULL, axp8xx_intr, dev, &sc->ih); if (error != 0) { device_printf(dev, "cannot setup interrupt handler\n"); return (error); } EVENTHANDLER_REGISTER(shutdown_final, axp8xx_shutdown, dev, SHUTDOWN_PRI_LAST); sc->gpiodev = gpiobus_attach_bus(dev); return (0); } static device_method_t axp8xx_methods[] = { /* Device interface */ DEVMETHOD(device_probe, axp8xx_probe), DEVMETHOD(device_attach, axp8xx_attach), /* GPIO interface */ DEVMETHOD(gpio_get_bus, axp8xx_gpio_get_bus), DEVMETHOD(gpio_pin_max, axp8xx_gpio_pin_max), DEVMETHOD(gpio_pin_getname, axp8xx_gpio_pin_getname), DEVMETHOD(gpio_pin_getcaps, axp8xx_gpio_pin_getcaps), DEVMETHOD(gpio_pin_getflags, axp8xx_gpio_pin_getflags), DEVMETHOD(gpio_pin_setflags, axp8xx_gpio_pin_setflags), DEVMETHOD(gpio_pin_get, axp8xx_gpio_pin_get), DEVMETHOD(gpio_pin_set, axp8xx_gpio_pin_set), DEVMETHOD(gpio_pin_toggle, axp8xx_gpio_pin_toggle), DEVMETHOD(gpio_map_gpios, axp8xx_gpio_map_gpios), /* Regdev interface */ DEVMETHOD(regdev_map, axp8xx_regdev_map), /* OFW bus interface */ DEVMETHOD(ofw_bus_get_node, axp8xx_get_node), DEVMETHOD_END }; static driver_t axp8xx_driver = { "axp8xx_pmu", axp8xx_methods, sizeof(struct axp8xx_softc), }; static devclass_t axp8xx_devclass; extern devclass_t ofwgpiobus_devclass, gpioc_devclass; extern driver_t ofw_gpiobus_driver, gpioc_driver; EARLY_DRIVER_MODULE(axp8xx, iicbus, axp8xx_driver, axp8xx_devclass, 0, 0, BUS_PASS_INTERRUPT + BUS_PASS_ORDER_LAST); EARLY_DRIVER_MODULE(ofw_gpiobus, axp8xx_pmu, ofw_gpiobus_driver, ofwgpiobus_devclass, 0, 0, BUS_PASS_INTERRUPT + BUS_PASS_ORDER_LAST); DRIVER_MODULE(gpioc, axp8xx_pmu, gpioc_driver, gpioc_devclass, 0, 0); MODULE_VERSION(axp8xx, 1); MODULE_DEPEND(axp8xx, iicbus, 1, 1, 1); +SIMPLEBUS_PNP_INFO(compat_data); Index: user/ngie/bug-237403/sys/arm/allwinner/clkng/ccu_de2.c =================================================================== --- user/ngie/bug-237403/sys/arm/allwinner/clkng/ccu_de2.c (revision 346925) +++ user/ngie/bug-237403/sys/arm/allwinner/clkng/ccu_de2.c (revision 346926) @@ -1,167 +1,166 @@ /*- * SPDX-License-Identifier: BSD-2-Clause-FreeBSD * * Copyright (c) 2018 Emmanuel Vadot * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, * BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED * AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * $FreeBSD$ */ #include __FBSDID("$FreeBSD$"); #include #include #include #include #include #include #include #include #include #include #include "opt_soc.h" #include #include #include #include #include #include /* Non exported clocks */ #define CLK_MIXER0_DIV 3 #define CLK_MIXER1_DIV 4 #define CLK_WB_DIV 5 static struct aw_ccung_reset de2_ccu_resets[] = { CCU_RESET(RST_MIXER0, 0x08, 0) CCU_RESET(RST_MIXER1, 0x08, 1) CCU_RESET(RST_WB, 0x08, 2) }; static struct aw_ccung_gate de2_ccu_gates[] = { CCU_GATE(CLK_BUS_MIXER0, "mixer0", "mixer0-div", 0x00, 0) CCU_GATE(CLK_BUS_MIXER1, "mixer1", "mixer1-div", 0x00, 1) CCU_GATE(CLK_BUS_WB, "wb", "wb-div", 0x00, 2) CCU_GATE(CLK_MIXER0, "bus-mixer0", "bus-de", 0x04, 0) CCU_GATE(CLK_MIXER1, "bus-mixer1", "bus-de", 0x04, 1) CCU_GATE(CLK_WB, "bus-wb", "bus-de", 0x04, 2) }; static const char *div_parents[] = {"de"}; NM_CLK(mixer0_div_clk, CLK_MIXER0_DIV, /* id */ "mixer0-div", div_parents, /* names, parents */ 0x0C, /* offset */ 0, 0, 1, AW_CLK_FACTOR_FIXED, /* N factor (fake)*/ 0, 4, 0, 0, /* M flags */ 0, 0, /* mux */ 0, /* gate */ AW_CLK_SCALE_CHANGE); /* flags */ NM_CLK(mixer1_div_clk, CLK_MIXER1_DIV, /* id */ "mixer1-div", div_parents, /* names, parents */ 0x0C, /* offset */ 0, 0, 1, AW_CLK_FACTOR_FIXED, /* N factor (fake)*/ 4, 4, 0, 0, /* M flags */ 0, 0, /* mux */ 0, /* gate */ AW_CLK_SCALE_CHANGE); /* flags */ NM_CLK(wb_div_clk, CLK_WB_DIV, /* id */ "wb-div", div_parents, /* names, parents */ 0x0C, /* offset */ 0, 0, 1, AW_CLK_FACTOR_FIXED, /* N factor (fake)*/ 8, 4, 0, 0, /* M flags */ 0, 0, /* mux */ 0, /* gate */ AW_CLK_SCALE_CHANGE); /* flags */ static struct aw_ccung_clk de2_ccu_clks[] = { { .type = AW_CLK_NM, .clk.nm = &mixer0_div_clk}, { .type = AW_CLK_NM, .clk.nm = &mixer1_div_clk}, { .type = AW_CLK_NM, .clk.nm = &wb_div_clk}, }; static struct ofw_compat_data compat_data[] = { {"allwinner,sun50i-a64-de2-clk", 1}, - {"allwinner,sun50i-h5-de2-clk", 1}, {NULL, 0} }; static int ccu_de2_probe(device_t dev) { if (!ofw_bus_status_okay(dev)) return (ENXIO); if (ofw_bus_search_compatible(dev, compat_data)->ocd_data == 0) return (ENXIO); device_set_desc(dev, "Allwinner DE2 Clock Control Unit"); return (BUS_PROBE_DEFAULT); } static int ccu_de2_attach(device_t dev) { struct aw_ccung_softc *sc; sc = device_get_softc(dev); sc->resets = de2_ccu_resets; sc->nresets = nitems(de2_ccu_resets); sc->gates = de2_ccu_gates; sc->ngates = nitems(de2_ccu_gates); sc->clks = de2_ccu_clks; sc->nclks = nitems(de2_ccu_clks); return (aw_ccung_attach(dev)); } static device_method_t ccu_de2_methods[] = { /* Device interface */ DEVMETHOD(device_probe, ccu_de2_probe), DEVMETHOD(device_attach, ccu_de2_attach), DEVMETHOD_END }; static devclass_t ccu_de2ng_devclass; DEFINE_CLASS_1(ccu_de2, ccu_de2_driver, ccu_de2_methods, sizeof(struct aw_ccung_softc), aw_ccung_driver); EARLY_DRIVER_MODULE(ccu_de2, simplebus, ccu_de2_driver, ccu_de2ng_devclass, 0, 0, BUS_PASS_BUS + BUS_PASS_ORDER_LAST); Index: user/ngie/bug-237403/sys/arm/allwinner/if_awg.c =================================================================== --- user/ngie/bug-237403/sys/arm/allwinner/if_awg.c (revision 346925) +++ user/ngie/bug-237403/sys/arm/allwinner/if_awg.c (revision 346926) @@ -1,1972 +1,1973 @@ /*- * Copyright (c) 2016 Jared McNeill * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, * BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED * AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * $FreeBSD$ */ /* * Allwinner Gigabit Ethernet MAC (EMAC) controller */ #include "opt_device_polling.h" #include __FBSDID("$FreeBSD$"); #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include "syscon_if.h" #include "miibus_if.h" #include "gpio_if.h" #define RD4(sc, reg) bus_read_4((sc)->res[_RES_EMAC], (reg)) #define WR4(sc, reg, val) bus_write_4((sc)->res[_RES_EMAC], (reg), (val)) #define AWG_LOCK(sc) mtx_lock(&(sc)->mtx) #define AWG_UNLOCK(sc) mtx_unlock(&(sc)->mtx); #define AWG_ASSERT_LOCKED(sc) mtx_assert(&(sc)->mtx, MA_OWNED) #define AWG_ASSERT_UNLOCKED(sc) mtx_assert(&(sc)->mtx, MA_NOTOWNED) #define DESC_ALIGN 4 #define TX_DESC_COUNT 1024 #define TX_DESC_SIZE (sizeof(struct emac_desc) * TX_DESC_COUNT) #define RX_DESC_COUNT 256 #define RX_DESC_SIZE (sizeof(struct emac_desc) * RX_DESC_COUNT) #define DESC_OFF(n) ((n) * sizeof(struct emac_desc)) #define TX_NEXT(n) (((n) + 1) & (TX_DESC_COUNT - 1)) #define TX_SKIP(n, o) (((n) + (o)) & (TX_DESC_COUNT - 1)) #define RX_NEXT(n) (((n) + 1) & (RX_DESC_COUNT - 1)) #define TX_MAX_SEGS 20 #define SOFT_RST_RETRY 1000 #define MII_BUSY_RETRY 1000 #define MDIO_FREQ 2500000 #define BURST_LEN_DEFAULT 8 #define RX_TX_PRI_DEFAULT 0 #define PAUSE_TIME_DEFAULT 0x400 #define TX_INTERVAL_DEFAULT 64 #define RX_BATCH_DEFAULT 64 /* syscon EMAC clock register */ #define EMAC_CLK_REG 0x30 #define EMAC_CLK_EPHY_ADDR (0x1f << 20) /* H3 */ #define EMAC_CLK_EPHY_ADDR_SHIFT 20 #define EMAC_CLK_EPHY_LED_POL (1 << 17) /* H3 */ #define EMAC_CLK_EPHY_SHUTDOWN (1 << 16) /* H3 */ #define EMAC_CLK_EPHY_SELECT (1 << 15) /* H3 */ #define EMAC_CLK_RMII_EN (1 << 13) #define EMAC_CLK_ETXDC (0x7 << 10) #define EMAC_CLK_ETXDC_SHIFT 10 #define EMAC_CLK_ERXDC (0x1f << 5) #define EMAC_CLK_ERXDC_SHIFT 5 #define EMAC_CLK_PIT (0x1 << 2) #define EMAC_CLK_PIT_MII (0 << 2) #define EMAC_CLK_PIT_RGMII (1 << 2) #define EMAC_CLK_SRC (0x3 << 0) #define EMAC_CLK_SRC_MII (0 << 0) #define EMAC_CLK_SRC_EXT_RGMII (1 << 0) #define EMAC_CLK_SRC_RGMII (2 << 0) /* Burst length of RX and TX DMA transfers */ static int awg_burst_len = BURST_LEN_DEFAULT; TUNABLE_INT("hw.awg.burst_len", &awg_burst_len); /* RX / TX DMA priority. If 1, RX DMA has priority over TX DMA. */ static int awg_rx_tx_pri = RX_TX_PRI_DEFAULT; TUNABLE_INT("hw.awg.rx_tx_pri", &awg_rx_tx_pri); /* Pause time field in the transmitted control frame */ static int awg_pause_time = PAUSE_TIME_DEFAULT; TUNABLE_INT("hw.awg.pause_time", &awg_pause_time); /* Request a TX interrupt every descriptors */ static int awg_tx_interval = TX_INTERVAL_DEFAULT; TUNABLE_INT("hw.awg.tx_interval", &awg_tx_interval); /* Maximum number of mbufs to send to if_input */ static int awg_rx_batch = RX_BATCH_DEFAULT; TUNABLE_INT("hw.awg.rx_batch", &awg_rx_batch); enum awg_type { EMAC_A83T = 1, EMAC_H3, EMAC_A64, }; static struct ofw_compat_data compat_data[] = { { "allwinner,sun8i-a83t-emac", EMAC_A83T }, { "allwinner,sun8i-h3-emac", EMAC_H3 }, { "allwinner,sun50i-a64-emac", EMAC_A64 }, { NULL, 0 } }; struct awg_bufmap { bus_dmamap_t map; struct mbuf *mbuf; }; struct awg_txring { bus_dma_tag_t desc_tag; bus_dmamap_t desc_map; struct emac_desc *desc_ring; bus_addr_t desc_ring_paddr; bus_dma_tag_t buf_tag; struct awg_bufmap buf_map[TX_DESC_COUNT]; u_int cur, next, queued; u_int segs; }; struct awg_rxring { bus_dma_tag_t desc_tag; bus_dmamap_t desc_map; struct emac_desc *desc_ring; bus_addr_t desc_ring_paddr; bus_dma_tag_t buf_tag; struct awg_bufmap buf_map[RX_DESC_COUNT]; bus_dmamap_t buf_spare_map; u_int cur; }; enum { _RES_EMAC, _RES_IRQ, _RES_SYSCON, _RES_NITEMS }; struct awg_softc { struct resource *res[_RES_NITEMS]; struct mtx mtx; if_t ifp; device_t dev; device_t miibus; struct callout stat_ch; struct task link_task; void *ih; u_int mdc_div_ratio_m; int link; int if_flags; enum awg_type type; struct syscon *syscon; struct awg_txring tx; struct awg_rxring rx; }; static struct resource_spec awg_spec[] = { { SYS_RES_MEMORY, 0, RF_ACTIVE }, { SYS_RES_IRQ, 0, RF_ACTIVE }, { SYS_RES_MEMORY, 1, RF_ACTIVE | RF_OPTIONAL }, { -1, 0 } }; static void awg_txeof(struct awg_softc *sc); static int awg_parse_delay(device_t dev, uint32_t *tx_delay, uint32_t *rx_delay); static uint32_t syscon_read_emac_clk_reg(device_t dev); static void syscon_write_emac_clk_reg(device_t dev, uint32_t val); static phandle_t awg_get_phy_node(device_t dev); static bool awg_has_internal_phy(device_t dev); static int awg_miibus_readreg(device_t dev, int phy, int reg) { struct awg_softc *sc; int retry, val; sc = device_get_softc(dev); val = 0; WR4(sc, EMAC_MII_CMD, (sc->mdc_div_ratio_m << MDC_DIV_RATIO_M_SHIFT) | (phy << PHY_ADDR_SHIFT) | (reg << PHY_REG_ADDR_SHIFT) | MII_BUSY); for (retry = MII_BUSY_RETRY; retry > 0; retry--) { if ((RD4(sc, EMAC_MII_CMD) & MII_BUSY) == 0) { val = RD4(sc, EMAC_MII_DATA); break; } DELAY(10); } if (retry == 0) device_printf(dev, "phy read timeout, phy=%d reg=%d\n", phy, reg); return (val); } static int awg_miibus_writereg(device_t dev, int phy, int reg, int val) { struct awg_softc *sc; int retry; sc = device_get_softc(dev); WR4(sc, EMAC_MII_DATA, val); WR4(sc, EMAC_MII_CMD, (sc->mdc_div_ratio_m << MDC_DIV_RATIO_M_SHIFT) | (phy << PHY_ADDR_SHIFT) | (reg << PHY_REG_ADDR_SHIFT) | MII_WR | MII_BUSY); for (retry = MII_BUSY_RETRY; retry > 0; retry--) { if ((RD4(sc, EMAC_MII_CMD) & MII_BUSY) == 0) break; DELAY(10); } if (retry == 0) device_printf(dev, "phy write timeout, phy=%d reg=%d\n", phy, reg); return (0); } static void awg_update_link_locked(struct awg_softc *sc) { struct mii_data *mii; uint32_t val; AWG_ASSERT_LOCKED(sc); if ((if_getdrvflags(sc->ifp) & IFF_DRV_RUNNING) == 0) return; mii = device_get_softc(sc->miibus); if ((mii->mii_media_status & (IFM_ACTIVE | IFM_AVALID)) == (IFM_ACTIVE | IFM_AVALID)) { switch (IFM_SUBTYPE(mii->mii_media_active)) { case IFM_1000_T: case IFM_1000_SX: case IFM_100_TX: case IFM_10_T: sc->link = 1; break; default: sc->link = 0; break; } } else sc->link = 0; if (sc->link == 0) return; val = RD4(sc, EMAC_BASIC_CTL_0); val &= ~(BASIC_CTL_SPEED | BASIC_CTL_DUPLEX); if (IFM_SUBTYPE(mii->mii_media_active) == IFM_1000_T || IFM_SUBTYPE(mii->mii_media_active) == IFM_1000_SX) val |= BASIC_CTL_SPEED_1000 << BASIC_CTL_SPEED_SHIFT; else if (IFM_SUBTYPE(mii->mii_media_active) == IFM_100_TX) val |= BASIC_CTL_SPEED_100 << BASIC_CTL_SPEED_SHIFT; else val |= BASIC_CTL_SPEED_10 << BASIC_CTL_SPEED_SHIFT; if ((IFM_OPTIONS(mii->mii_media_active) & IFM_FDX) != 0) val |= BASIC_CTL_DUPLEX; WR4(sc, EMAC_BASIC_CTL_0, val); val = RD4(sc, EMAC_RX_CTL_0); val &= ~RX_FLOW_CTL_EN; if ((IFM_OPTIONS(mii->mii_media_active) & IFM_ETH_RXPAUSE) != 0) val |= RX_FLOW_CTL_EN; WR4(sc, EMAC_RX_CTL_0, val); val = RD4(sc, EMAC_TX_FLOW_CTL); val &= ~(PAUSE_TIME|TX_FLOW_CTL_EN); if ((IFM_OPTIONS(mii->mii_media_active) & IFM_ETH_TXPAUSE) != 0) val |= TX_FLOW_CTL_EN; if ((IFM_OPTIONS(mii->mii_media_active) & IFM_FDX) != 0) val |= awg_pause_time << PAUSE_TIME_SHIFT; WR4(sc, EMAC_TX_FLOW_CTL, val); } static void awg_link_task(void *arg, int pending) { struct awg_softc *sc; sc = arg; AWG_LOCK(sc); awg_update_link_locked(sc); AWG_UNLOCK(sc); } static void awg_miibus_statchg(device_t dev) { struct awg_softc *sc; sc = device_get_softc(dev); taskqueue_enqueue(taskqueue_swi, &sc->link_task); } static void awg_media_status(if_t ifp, struct ifmediareq *ifmr) { struct awg_softc *sc; struct mii_data *mii; sc = if_getsoftc(ifp); mii = device_get_softc(sc->miibus); AWG_LOCK(sc); mii_pollstat(mii); ifmr->ifm_active = mii->mii_media_active; ifmr->ifm_status = mii->mii_media_status; AWG_UNLOCK(sc); } static int awg_media_change(if_t ifp) { struct awg_softc *sc; struct mii_data *mii; int error; sc = if_getsoftc(ifp); mii = device_get_softc(sc->miibus); AWG_LOCK(sc); error = mii_mediachg(mii); AWG_UNLOCK(sc); return (error); } static int awg_encap(struct awg_softc *sc, struct mbuf **mp) { bus_dmamap_t map; bus_dma_segment_t segs[TX_MAX_SEGS]; int error, nsegs, cur, first, last, i; u_int csum_flags; uint32_t flags, status; struct mbuf *m; cur = first = sc->tx.cur; map = sc->tx.buf_map[first].map; m = *mp; error = bus_dmamap_load_mbuf_sg(sc->tx.buf_tag, map, m, segs, &nsegs, BUS_DMA_NOWAIT); if (error == EFBIG) { m = m_collapse(m, M_NOWAIT, TX_MAX_SEGS); if (m == NULL) { device_printf(sc->dev, "awg_encap: m_collapse failed\n"); m_freem(*mp); *mp = NULL; return (ENOMEM); } *mp = m; error = bus_dmamap_load_mbuf_sg(sc->tx.buf_tag, map, m, segs, &nsegs, BUS_DMA_NOWAIT); if (error != 0) { m_freem(*mp); *mp = NULL; } } if (error != 0) { device_printf(sc->dev, "awg_encap: bus_dmamap_load_mbuf_sg failed\n"); return (error); } if (nsegs == 0) { m_freem(*mp); *mp = NULL; return (EIO); } if (sc->tx.queued + nsegs > TX_DESC_COUNT) { bus_dmamap_unload(sc->tx.buf_tag, map); return (ENOBUFS); } bus_dmamap_sync(sc->tx.buf_tag, map, BUS_DMASYNC_PREWRITE); flags = TX_FIR_DESC; status = 0; if ((m->m_pkthdr.csum_flags & CSUM_IP) != 0) { if ((m->m_pkthdr.csum_flags & (CSUM_TCP|CSUM_UDP)) != 0) csum_flags = TX_CHECKSUM_CTL_FULL; else csum_flags = TX_CHECKSUM_CTL_IP; flags |= (csum_flags << TX_CHECKSUM_CTL_SHIFT); } for (i = 0; i < nsegs; i++) { sc->tx.segs++; if (i == nsegs - 1) { flags |= TX_LAST_DESC; /* * Can only request TX completion * interrupt on last descriptor. */ if (sc->tx.segs >= awg_tx_interval) { sc->tx.segs = 0; flags |= TX_INT_CTL; } } sc->tx.desc_ring[cur].addr = htole32((uint32_t)segs[i].ds_addr); sc->tx.desc_ring[cur].size = htole32(flags | segs[i].ds_len); sc->tx.desc_ring[cur].status = htole32(status); flags &= ~TX_FIR_DESC; /* * Setting of the valid bit in the first descriptor is * deferred until the whole chain is fully set up. */ status = TX_DESC_CTL; ++sc->tx.queued; cur = TX_NEXT(cur); } sc->tx.cur = cur; /* Store mapping and mbuf in the last segment */ last = TX_SKIP(cur, TX_DESC_COUNT - 1); sc->tx.buf_map[first].map = sc->tx.buf_map[last].map; sc->tx.buf_map[last].map = map; sc->tx.buf_map[last].mbuf = m; /* * The whole mbuf chain has been DMA mapped, * fix the first descriptor. */ sc->tx.desc_ring[first].status = htole32(TX_DESC_CTL); return (0); } static void awg_clean_txbuf(struct awg_softc *sc, int index) { struct awg_bufmap *bmap; --sc->tx.queued; bmap = &sc->tx.buf_map[index]; if (bmap->mbuf != NULL) { bus_dmamap_sync(sc->tx.buf_tag, bmap->map, BUS_DMASYNC_POSTWRITE); bus_dmamap_unload(sc->tx.buf_tag, bmap->map); m_freem(bmap->mbuf); bmap->mbuf = NULL; } } static void awg_setup_rxdesc(struct awg_softc *sc, int index, bus_addr_t paddr) { uint32_t status, size; status = RX_DESC_CTL; size = MCLBYTES - 1; sc->rx.desc_ring[index].addr = htole32((uint32_t)paddr); sc->rx.desc_ring[index].size = htole32(size); sc->rx.desc_ring[index].status = htole32(status); } static void awg_reuse_rxdesc(struct awg_softc *sc, int index) { sc->rx.desc_ring[index].status = htole32(RX_DESC_CTL); } static int awg_newbuf_rx(struct awg_softc *sc, int index) { struct mbuf *m; bus_dma_segment_t seg; bus_dmamap_t map; int nsegs; m = m_getcl(M_NOWAIT, MT_DATA, M_PKTHDR); if (m == NULL) return (ENOBUFS); m->m_pkthdr.len = m->m_len = m->m_ext.ext_size; m_adj(m, ETHER_ALIGN); if (bus_dmamap_load_mbuf_sg(sc->rx.buf_tag, sc->rx.buf_spare_map, m, &seg, &nsegs, BUS_DMA_NOWAIT) != 0) { m_freem(m); return (ENOBUFS); } if (sc->rx.buf_map[index].mbuf != NULL) { bus_dmamap_sync(sc->rx.buf_tag, sc->rx.buf_map[index].map, BUS_DMASYNC_POSTREAD); bus_dmamap_unload(sc->rx.buf_tag, sc->rx.buf_map[index].map); } map = sc->rx.buf_map[index].map; sc->rx.buf_map[index].map = sc->rx.buf_spare_map; sc->rx.buf_spare_map = map; bus_dmamap_sync(sc->rx.buf_tag, sc->rx.buf_map[index].map, BUS_DMASYNC_PREREAD); sc->rx.buf_map[index].mbuf = m; awg_setup_rxdesc(sc, index, seg.ds_addr); return (0); } static void awg_start_locked(struct awg_softc *sc) { struct mbuf *m; uint32_t val; if_t ifp; int cnt, err; AWG_ASSERT_LOCKED(sc); if (!sc->link) return; ifp = sc->ifp; if ((if_getdrvflags(ifp) & (IFF_DRV_RUNNING|IFF_DRV_OACTIVE)) != IFF_DRV_RUNNING) return; for (cnt = 0; ; cnt++) { m = if_dequeue(ifp); if (m == NULL) break; err = awg_encap(sc, &m); if (err != 0) { if (err == ENOBUFS) if_setdrvflagbits(ifp, IFF_DRV_OACTIVE, 0); if (m != NULL) if_sendq_prepend(ifp, m); break; } if_bpfmtap(ifp, m); } if (cnt != 0) { bus_dmamap_sync(sc->tx.desc_tag, sc->tx.desc_map, BUS_DMASYNC_PREREAD|BUS_DMASYNC_PREWRITE); /* Start and run TX DMA */ val = RD4(sc, EMAC_TX_CTL_1); WR4(sc, EMAC_TX_CTL_1, val | TX_DMA_START); } } static void awg_start(if_t ifp) { struct awg_softc *sc; sc = if_getsoftc(ifp); AWG_LOCK(sc); awg_start_locked(sc); AWG_UNLOCK(sc); } static void awg_tick(void *softc) { struct awg_softc *sc; struct mii_data *mii; if_t ifp; int link; sc = softc; ifp = sc->ifp; mii = device_get_softc(sc->miibus); AWG_ASSERT_LOCKED(sc); if ((if_getdrvflags(ifp) & IFF_DRV_RUNNING) == 0) return; link = sc->link; mii_tick(mii); if (sc->link && !link) awg_start_locked(sc); callout_reset(&sc->stat_ch, hz, awg_tick, sc); } /* Bit Reversal - http://aggregate.org/MAGIC/#Bit%20Reversal */ static uint32_t bitrev32(uint32_t x) { x = (((x & 0xaaaaaaaa) >> 1) | ((x & 0x55555555) << 1)); x = (((x & 0xcccccccc) >> 2) | ((x & 0x33333333) << 2)); x = (((x & 0xf0f0f0f0) >> 4) | ((x & 0x0f0f0f0f) << 4)); x = (((x & 0xff00ff00) >> 8) | ((x & 0x00ff00ff) << 8)); return (x >> 16) | (x << 16); } static void awg_setup_rxfilter(struct awg_softc *sc) { uint32_t val, crc, hashreg, hashbit, hash[2], machi, maclo; int mc_count, mcnt, i; uint8_t *eaddr, *mta; if_t ifp; AWG_ASSERT_LOCKED(sc); ifp = sc->ifp; val = 0; hash[0] = hash[1] = 0; mc_count = if_multiaddr_count(ifp, -1); if (if_getflags(ifp) & IFF_PROMISC) val |= DIS_ADDR_FILTER; else if (if_getflags(ifp) & IFF_ALLMULTI) { val |= RX_ALL_MULTICAST; hash[0] = hash[1] = ~0; } else if (mc_count > 0) { val |= HASH_MULTICAST; mta = malloc(sizeof(unsigned char) * ETHER_ADDR_LEN * mc_count, M_DEVBUF, M_NOWAIT); if (mta == NULL) { if_printf(ifp, "failed to allocate temporary multicast list\n"); return; } if_multiaddr_array(ifp, mta, &mcnt, mc_count); for (i = 0; i < mcnt; i++) { crc = ether_crc32_le(mta + (i * ETHER_ADDR_LEN), ETHER_ADDR_LEN) & 0x7f; crc = bitrev32(~crc) >> 26; hashreg = (crc >> 5); hashbit = (crc & 0x1f); hash[hashreg] |= (1 << hashbit); } free(mta, M_DEVBUF); } /* Write our unicast address */ eaddr = IF_LLADDR(ifp); machi = (eaddr[5] << 8) | eaddr[4]; maclo = (eaddr[3] << 24) | (eaddr[2] << 16) | (eaddr[1] << 8) | (eaddr[0] << 0); WR4(sc, EMAC_ADDR_HIGH(0), machi); WR4(sc, EMAC_ADDR_LOW(0), maclo); /* Multicast hash filters */ WR4(sc, EMAC_RX_HASH_0, hash[1]); WR4(sc, EMAC_RX_HASH_1, hash[0]); /* RX frame filter config */ WR4(sc, EMAC_RX_FRM_FLT, val); } static void awg_enable_intr(struct awg_softc *sc) { /* Enable interrupts */ WR4(sc, EMAC_INT_EN, RX_INT_EN | TX_INT_EN | TX_BUF_UA_INT_EN); } static void awg_disable_intr(struct awg_softc *sc) { /* Disable interrupts */ WR4(sc, EMAC_INT_EN, 0); } static void awg_init_locked(struct awg_softc *sc) { struct mii_data *mii; uint32_t val; if_t ifp; mii = device_get_softc(sc->miibus); ifp = sc->ifp; AWG_ASSERT_LOCKED(sc); if (if_getdrvflags(ifp) & IFF_DRV_RUNNING) return; awg_setup_rxfilter(sc); /* Configure DMA burst length and priorities */ val = awg_burst_len << BASIC_CTL_BURST_LEN_SHIFT; if (awg_rx_tx_pri) val |= BASIC_CTL_RX_TX_PRI; WR4(sc, EMAC_BASIC_CTL_1, val); /* Enable interrupts */ #ifdef DEVICE_POLLING if ((if_getcapenable(ifp) & IFCAP_POLLING) == 0) awg_enable_intr(sc); else awg_disable_intr(sc); #else awg_enable_intr(sc); #endif /* Enable transmit DMA */ val = RD4(sc, EMAC_TX_CTL_1); WR4(sc, EMAC_TX_CTL_1, val | TX_DMA_EN | TX_MD | TX_NEXT_FRAME); /* Enable receive DMA */ val = RD4(sc, EMAC_RX_CTL_1); WR4(sc, EMAC_RX_CTL_1, val | RX_DMA_EN | RX_MD); /* Enable transmitter */ val = RD4(sc, EMAC_TX_CTL_0); WR4(sc, EMAC_TX_CTL_0, val | TX_EN); /* Enable receiver */ val = RD4(sc, EMAC_RX_CTL_0); WR4(sc, EMAC_RX_CTL_0, val | RX_EN | CHECK_CRC); if_setdrvflagbits(ifp, IFF_DRV_RUNNING, IFF_DRV_OACTIVE); mii_mediachg(mii); callout_reset(&sc->stat_ch, hz, awg_tick, sc); } static void awg_init(void *softc) { struct awg_softc *sc; sc = softc; AWG_LOCK(sc); awg_init_locked(sc); AWG_UNLOCK(sc); } static void awg_stop(struct awg_softc *sc) { if_t ifp; uint32_t val; int i; AWG_ASSERT_LOCKED(sc); ifp = sc->ifp; callout_stop(&sc->stat_ch); /* Stop transmit DMA and flush data in the TX FIFO */ val = RD4(sc, EMAC_TX_CTL_1); val &= ~TX_DMA_EN; val |= FLUSH_TX_FIFO; WR4(sc, EMAC_TX_CTL_1, val); /* Disable transmitter */ val = RD4(sc, EMAC_TX_CTL_0); WR4(sc, EMAC_TX_CTL_0, val & ~TX_EN); /* Disable receiver */ val = RD4(sc, EMAC_RX_CTL_0); WR4(sc, EMAC_RX_CTL_0, val & ~RX_EN); /* Disable interrupts */ awg_disable_intr(sc); /* Disable transmit DMA */ val = RD4(sc, EMAC_TX_CTL_1); WR4(sc, EMAC_TX_CTL_1, val & ~TX_DMA_EN); /* Disable receive DMA */ val = RD4(sc, EMAC_RX_CTL_1); WR4(sc, EMAC_RX_CTL_1, val & ~RX_DMA_EN); sc->link = 0; /* Finish handling transmitted buffers */ awg_txeof(sc); /* Release any untransmitted buffers. */ for (i = sc->tx.next; sc->tx.queued > 0; i = TX_NEXT(i)) { val = le32toh(sc->tx.desc_ring[i].status); if ((val & TX_DESC_CTL) != 0) break; awg_clean_txbuf(sc, i); } sc->tx.next = i; for (; sc->tx.queued > 0; i = TX_NEXT(i)) { sc->tx.desc_ring[i].status = 0; awg_clean_txbuf(sc, i); } sc->tx.cur = sc->tx.next; bus_dmamap_sync(sc->tx.desc_tag, sc->tx.desc_map, BUS_DMASYNC_PREREAD | BUS_DMASYNC_PREWRITE); /* Setup RX buffers for reuse */ bus_dmamap_sync(sc->rx.desc_tag, sc->rx.desc_map, BUS_DMASYNC_POSTREAD | BUS_DMASYNC_POSTWRITE); for (i = sc->rx.cur; ; i = RX_NEXT(i)) { val = le32toh(sc->rx.desc_ring[i].status); if ((val & RX_DESC_CTL) != 0) break; awg_reuse_rxdesc(sc, i); } sc->rx.cur = i; bus_dmamap_sync(sc->rx.desc_tag, sc->rx.desc_map, BUS_DMASYNC_PREREAD | BUS_DMASYNC_PREWRITE); if_setdrvflagbits(ifp, 0, IFF_DRV_RUNNING | IFF_DRV_OACTIVE); } static int awg_rxintr(struct awg_softc *sc) { if_t ifp; struct mbuf *m, *mh, *mt; int error, index, len, cnt, npkt; uint32_t status; ifp = sc->ifp; mh = mt = NULL; cnt = 0; npkt = 0; bus_dmamap_sync(sc->rx.desc_tag, sc->rx.desc_map, BUS_DMASYNC_POSTREAD | BUS_DMASYNC_POSTWRITE); for (index = sc->rx.cur; ; index = RX_NEXT(index)) { status = le32toh(sc->rx.desc_ring[index].status); if ((status & RX_DESC_CTL) != 0) break; len = (status & RX_FRM_LEN) >> RX_FRM_LEN_SHIFT; if (len == 0) { if ((status & (RX_NO_ENOUGH_BUF_ERR | RX_OVERFLOW_ERR)) != 0) if_inc_counter(ifp, IFCOUNTER_IERRORS, 1); awg_reuse_rxdesc(sc, index); continue; } m = sc->rx.buf_map[index].mbuf; error = awg_newbuf_rx(sc, index); if (error != 0) { if_inc_counter(ifp, IFCOUNTER_IQDROPS, 1); awg_reuse_rxdesc(sc, index); continue; } m->m_pkthdr.rcvif = ifp; m->m_pkthdr.len = len; m->m_len = len; if_inc_counter(ifp, IFCOUNTER_IPACKETS, 1); if ((if_getcapenable(ifp) & IFCAP_RXCSUM) != 0 && (status & RX_FRM_TYPE) != 0) { m->m_pkthdr.csum_flags = CSUM_IP_CHECKED; if ((status & RX_HEADER_ERR) == 0) m->m_pkthdr.csum_flags |= CSUM_IP_VALID; if ((status & RX_PAYLOAD_ERR) == 0) { m->m_pkthdr.csum_flags |= CSUM_DATA_VALID | CSUM_PSEUDO_HDR; m->m_pkthdr.csum_data = 0xffff; } } m->m_nextpkt = NULL; if (mh == NULL) mh = m; else mt->m_nextpkt = m; mt = m; ++cnt; ++npkt; if (cnt == awg_rx_batch) { AWG_UNLOCK(sc); if_input(ifp, mh); AWG_LOCK(sc); mh = mt = NULL; cnt = 0; } } if (index != sc->rx.cur) { bus_dmamap_sync(sc->rx.desc_tag, sc->rx.desc_map, BUS_DMASYNC_PREREAD | BUS_DMASYNC_PREWRITE); } if (mh != NULL) { AWG_UNLOCK(sc); if_input(ifp, mh); AWG_LOCK(sc); } sc->rx.cur = index; return (npkt); } static void awg_txeof(struct awg_softc *sc) { struct emac_desc *desc; uint32_t status, size; if_t ifp; int i, prog; AWG_ASSERT_LOCKED(sc); bus_dmamap_sync(sc->tx.desc_tag, sc->tx.desc_map, BUS_DMASYNC_POSTREAD | BUS_DMASYNC_POSTWRITE); ifp = sc->ifp; prog = 0; for (i = sc->tx.next; sc->tx.queued > 0; i = TX_NEXT(i)) { desc = &sc->tx.desc_ring[i]; status = le32toh(desc->status); if ((status & TX_DESC_CTL) != 0) break; size = le32toh(desc->size); if (size & TX_LAST_DESC) { if ((status & (TX_HEADER_ERR | TX_PAYLOAD_ERR)) != 0) if_inc_counter(ifp, IFCOUNTER_OERRORS, 1); else if_inc_counter(ifp, IFCOUNTER_OPACKETS, 1); } prog++; awg_clean_txbuf(sc, i); } if (prog > 0) { sc->tx.next = i; if_setdrvflagbits(ifp, 0, IFF_DRV_OACTIVE); } } static void awg_intr(void *arg) { struct awg_softc *sc; uint32_t val; sc = arg; AWG_LOCK(sc); val = RD4(sc, EMAC_INT_STA); WR4(sc, EMAC_INT_STA, val); if (val & RX_INT) awg_rxintr(sc); if (val & TX_INT) awg_txeof(sc); if (val & (TX_INT | TX_BUF_UA_INT)) { if (!if_sendq_empty(sc->ifp)) awg_start_locked(sc); } AWG_UNLOCK(sc); } #ifdef DEVICE_POLLING static int awg_poll(if_t ifp, enum poll_cmd cmd, int count) { struct awg_softc *sc; uint32_t val; int rx_npkts; sc = if_getsoftc(ifp); rx_npkts = 0; AWG_LOCK(sc); if ((if_getdrvflags(ifp) & IFF_DRV_RUNNING) == 0) { AWG_UNLOCK(sc); return (0); } rx_npkts = awg_rxintr(sc); awg_txeof(sc); if (!if_sendq_empty(ifp)) awg_start_locked(sc); if (cmd == POLL_AND_CHECK_STATUS) { val = RD4(sc, EMAC_INT_STA); if (val != 0) WR4(sc, EMAC_INT_STA, val); } AWG_UNLOCK(sc); return (rx_npkts); } #endif static int awg_ioctl(if_t ifp, u_long cmd, caddr_t data) { struct awg_softc *sc; struct mii_data *mii; struct ifreq *ifr; int flags, mask, error; sc = if_getsoftc(ifp); mii = device_get_softc(sc->miibus); ifr = (struct ifreq *)data; error = 0; switch (cmd) { case SIOCSIFFLAGS: AWG_LOCK(sc); if (if_getflags(ifp) & IFF_UP) { if (if_getdrvflags(ifp) & IFF_DRV_RUNNING) { flags = if_getflags(ifp) ^ sc->if_flags; if ((flags & (IFF_PROMISC|IFF_ALLMULTI)) != 0) awg_setup_rxfilter(sc); } else awg_init_locked(sc); } else { if (if_getdrvflags(ifp) & IFF_DRV_RUNNING) awg_stop(sc); } sc->if_flags = if_getflags(ifp); AWG_UNLOCK(sc); break; case SIOCADDMULTI: case SIOCDELMULTI: if (if_getdrvflags(ifp) & IFF_DRV_RUNNING) { AWG_LOCK(sc); awg_setup_rxfilter(sc); AWG_UNLOCK(sc); } break; case SIOCSIFMEDIA: case SIOCGIFMEDIA: error = ifmedia_ioctl(ifp, ifr, &mii->mii_media, cmd); break; case SIOCSIFCAP: mask = ifr->ifr_reqcap ^ if_getcapenable(ifp); #ifdef DEVICE_POLLING if (mask & IFCAP_POLLING) { if ((ifr->ifr_reqcap & IFCAP_POLLING) != 0) { error = ether_poll_register(awg_poll, ifp); if (error != 0) break; AWG_LOCK(sc); awg_disable_intr(sc); if_setcapenablebit(ifp, IFCAP_POLLING, 0); AWG_UNLOCK(sc); } else { error = ether_poll_deregister(ifp); AWG_LOCK(sc); awg_enable_intr(sc); if_setcapenablebit(ifp, 0, IFCAP_POLLING); AWG_UNLOCK(sc); } } #endif if (mask & IFCAP_VLAN_MTU) if_togglecapenable(ifp, IFCAP_VLAN_MTU); if (mask & IFCAP_RXCSUM) if_togglecapenable(ifp, IFCAP_RXCSUM); if (mask & IFCAP_TXCSUM) if_togglecapenable(ifp, IFCAP_TXCSUM); if ((if_getcapenable(ifp) & IFCAP_TXCSUM) != 0) if_sethwassistbits(ifp, CSUM_IP | CSUM_UDP | CSUM_TCP, 0); else if_sethwassistbits(ifp, 0, CSUM_IP | CSUM_UDP | CSUM_TCP); break; default: error = ether_ioctl(ifp, cmd, data); break; } return (error); } static uint32_t syscon_read_emac_clk_reg(device_t dev) { struct awg_softc *sc; sc = device_get_softc(dev); if (sc->syscon != NULL) return (SYSCON_READ_4(sc->syscon, EMAC_CLK_REG)); else if (sc->res[_RES_SYSCON] != NULL) return (bus_read_4(sc->res[_RES_SYSCON], 0)); return (0); } static void syscon_write_emac_clk_reg(device_t dev, uint32_t val) { struct awg_softc *sc; sc = device_get_softc(dev); if (sc->syscon != NULL) SYSCON_WRITE_4(sc->syscon, EMAC_CLK_REG, val); else if (sc->res[_RES_SYSCON] != NULL) bus_write_4(sc->res[_RES_SYSCON], 0, val); } static phandle_t awg_get_phy_node(device_t dev) { phandle_t node; pcell_t phy_handle; node = ofw_bus_get_node(dev); if (OF_getencprop(node, "phy-handle", (void *)&phy_handle, sizeof(phy_handle)) <= 0) return (0); return (OF_node_from_xref(phy_handle)); } static bool awg_has_internal_phy(device_t dev) { phandle_t node, phy_node; node = ofw_bus_get_node(dev); /* Legacy binding */ if (OF_hasprop(node, "allwinner,use-internal-phy")) return (true); phy_node = awg_get_phy_node(dev); return (phy_node != 0 && ofw_bus_node_is_compatible(OF_parent(phy_node), "allwinner,sun8i-h3-mdio-internal") != 0); } static int awg_parse_delay(device_t dev, uint32_t *tx_delay, uint32_t *rx_delay) { phandle_t node; uint32_t delay; if (tx_delay == NULL || rx_delay == NULL) return (EINVAL); *tx_delay = *rx_delay = 0; node = ofw_bus_get_node(dev); if (OF_getencprop(node, "tx-delay", &delay, sizeof(delay)) >= 0) *tx_delay = delay; else if (OF_getencprop(node, "allwinner,tx-delay-ps", &delay, sizeof(delay)) >= 0) { if ((delay % 100) != 0) { device_printf(dev, "tx-delay-ps is not a multiple of 100\n"); return (EDOM); } *tx_delay = delay / 100; } if (*tx_delay > 7) { device_printf(dev, "tx-delay out of range\n"); return (ERANGE); } if (OF_getencprop(node, "rx-delay", &delay, sizeof(delay)) >= 0) *rx_delay = delay; else if (OF_getencprop(node, "allwinner,rx-delay-ps", &delay, sizeof(delay)) >= 0) { if ((delay % 100) != 0) { device_printf(dev, "rx-delay-ps is not within documented domain\n"); return (EDOM); } *rx_delay = delay / 100; } if (*rx_delay > 31) { device_printf(dev, "rx-delay out of range\n"); return (ERANGE); } return (0); } static int awg_setup_phy(device_t dev) { struct awg_softc *sc; clk_t clk_tx, clk_tx_parent; const char *tx_parent_name; char *phy_type; phandle_t node; uint32_t reg, tx_delay, rx_delay; int error; bool use_syscon; sc = device_get_softc(dev); node = ofw_bus_get_node(dev); use_syscon = false; if (OF_getprop_alloc(node, "phy-mode", (void **)&phy_type) == 0) return (0); if (sc->syscon != NULL || sc->res[_RES_SYSCON] != NULL) use_syscon = true; if (bootverbose) device_printf(dev, "PHY type: %s, conf mode: %s\n", phy_type, use_syscon ? "reg" : "clk"); if (use_syscon) { /* * Abstract away writing to syscon for devices like the pine64. * For the pine64, we get dtb from U-Boot and it still uses the * legacy setup of specifying syscon register in emac node * rather than as its own node and using an xref in emac. * These abstractions can go away once U-Boot dts is up-to-date. */ reg = syscon_read_emac_clk_reg(dev); reg &= ~(EMAC_CLK_PIT | EMAC_CLK_SRC | EMAC_CLK_RMII_EN); if (strncmp(phy_type, "rgmii", 5) == 0) reg |= EMAC_CLK_PIT_RGMII | EMAC_CLK_SRC_RGMII; else if (strcmp(phy_type, "rmii") == 0) reg |= EMAC_CLK_RMII_EN; else reg |= EMAC_CLK_PIT_MII | EMAC_CLK_SRC_MII; /* * Fail attach if we fail to parse either of the delay * parameters. If we don't have the proper delay to write to * syscon, then awg likely won't function properly anyways. * Lack of delay is not an error! */ error = awg_parse_delay(dev, &tx_delay, &rx_delay); if (error != 0) goto fail; /* Default to 0 and we'll increase it if we need to. */ reg &= ~(EMAC_CLK_ETXDC | EMAC_CLK_ERXDC); if (tx_delay > 0) reg |= (tx_delay << EMAC_CLK_ETXDC_SHIFT); if (rx_delay > 0) reg |= (rx_delay << EMAC_CLK_ERXDC_SHIFT); if (sc->type == EMAC_H3) { if (awg_has_internal_phy(dev)) { reg |= EMAC_CLK_EPHY_SELECT; reg &= ~EMAC_CLK_EPHY_SHUTDOWN; if (OF_hasprop(node, "allwinner,leds-active-low")) reg |= EMAC_CLK_EPHY_LED_POL; else reg &= ~EMAC_CLK_EPHY_LED_POL; /* Set internal PHY addr to 1 */ reg &= ~EMAC_CLK_EPHY_ADDR; reg |= (1 << EMAC_CLK_EPHY_ADDR_SHIFT); } else { reg &= ~EMAC_CLK_EPHY_SELECT; } } if (bootverbose) device_printf(dev, "EMAC clock: 0x%08x\n", reg); syscon_write_emac_clk_reg(dev, reg); } else { if (strncmp(phy_type, "rgmii", 5) == 0) tx_parent_name = "emac_int_tx"; else tx_parent_name = "mii_phy_tx"; /* Get the TX clock */ error = clk_get_by_ofw_name(dev, 0, "tx", &clk_tx); if (error != 0) { device_printf(dev, "cannot get tx clock\n"); goto fail; } /* Find the desired parent clock based on phy-mode property */ error = clk_get_by_name(dev, tx_parent_name, &clk_tx_parent); if (error != 0) { device_printf(dev, "cannot get clock '%s'\n", tx_parent_name); goto fail; } /* Set TX clock parent */ error = clk_set_parent_by_clk(clk_tx, clk_tx_parent); if (error != 0) { device_printf(dev, "cannot set tx clock parent\n"); goto fail; } /* Enable TX clock */ error = clk_enable(clk_tx); if (error != 0) { device_printf(dev, "cannot enable tx clock\n"); goto fail; } } error = 0; fail: OF_prop_free(phy_type); return (error); } static int awg_setup_extres(device_t dev) { struct awg_softc *sc; phandle_t node, phy_node; hwreset_t rst_ahb, rst_ephy; clk_t clk_ahb, clk_ephy; regulator_t reg; uint64_t freq; int error, div; sc = device_get_softc(dev); rst_ahb = rst_ephy = NULL; clk_ahb = clk_ephy = NULL; reg = NULL; node = ofw_bus_get_node(dev); phy_node = awg_get_phy_node(dev); if (phy_node == 0 && OF_hasprop(node, "phy-handle")) { error = ENXIO; device_printf(dev, "cannot get phy handle\n"); goto fail; } /* Get AHB clock and reset resources */ error = hwreset_get_by_ofw_name(dev, 0, "stmmaceth", &rst_ahb); if (error != 0) error = hwreset_get_by_ofw_name(dev, 0, "ahb", &rst_ahb); if (error != 0) { device_printf(dev, "cannot get ahb reset\n"); goto fail; } if (hwreset_get_by_ofw_name(dev, 0, "ephy", &rst_ephy) != 0) if (phy_node == 0 || hwreset_get_by_ofw_idx(dev, phy_node, 0, &rst_ephy) != 0) rst_ephy = NULL; error = clk_get_by_ofw_name(dev, 0, "stmmaceth", &clk_ahb); if (error != 0) error = clk_get_by_ofw_name(dev, 0, "ahb", &clk_ahb); if (error != 0) { device_printf(dev, "cannot get ahb clock\n"); goto fail; } if (clk_get_by_ofw_name(dev, 0, "ephy", &clk_ephy) != 0) if (phy_node == 0 || clk_get_by_ofw_index(dev, phy_node, 0, &clk_ephy) != 0) clk_ephy = NULL; if (OF_hasprop(node, "syscon") && syscon_get_by_ofw_property(dev, node, "syscon", &sc->syscon) != 0) { device_printf(dev, "cannot get syscon driver handle\n"); goto fail; } /* Configure PHY for MII or RGMII mode */ if (awg_setup_phy(dev) != 0) goto fail; /* Enable clocks */ error = clk_enable(clk_ahb); if (error != 0) { device_printf(dev, "cannot enable ahb clock\n"); goto fail; } if (clk_ephy != NULL) { error = clk_enable(clk_ephy); if (error != 0) { device_printf(dev, "cannot enable ephy clock\n"); goto fail; } } /* De-assert reset */ error = hwreset_deassert(rst_ahb); if (error != 0) { device_printf(dev, "cannot de-assert ahb reset\n"); goto fail; } if (rst_ephy != NULL) { /* * The ephy reset is left de-asserted by U-Boot. Assert it * here to make sure that we're in a known good state going * into the PHY reset. */ hwreset_assert(rst_ephy); error = hwreset_deassert(rst_ephy); if (error != 0) { device_printf(dev, "cannot de-assert ephy reset\n"); goto fail; } } /* Enable PHY regulator if applicable */ if (regulator_get_by_ofw_property(dev, 0, "phy-supply", ®) == 0) { error = regulator_enable(reg); if (error != 0) { device_printf(dev, "cannot enable PHY regulator\n"); goto fail; } } /* Determine MDC clock divide ratio based on AHB clock */ error = clk_get_freq(clk_ahb, &freq); if (error != 0) { device_printf(dev, "cannot get AHB clock frequency\n"); goto fail; } div = freq / MDIO_FREQ; if (div <= 16) sc->mdc_div_ratio_m = MDC_DIV_RATIO_M_16; else if (div <= 32) sc->mdc_div_ratio_m = MDC_DIV_RATIO_M_32; else if (div <= 64) sc->mdc_div_ratio_m = MDC_DIV_RATIO_M_64; else if (div <= 128) sc->mdc_div_ratio_m = MDC_DIV_RATIO_M_128; else { device_printf(dev, "cannot determine MDC clock divide ratio\n"); error = ENXIO; goto fail; } if (bootverbose) device_printf(dev, "AHB frequency %ju Hz, MDC div: 0x%x\n", (uintmax_t)freq, sc->mdc_div_ratio_m); return (0); fail: if (reg != NULL) regulator_release(reg); if (clk_ephy != NULL) clk_release(clk_ephy); if (clk_ahb != NULL) clk_release(clk_ahb); if (rst_ephy != NULL) hwreset_release(rst_ephy); if (rst_ahb != NULL) hwreset_release(rst_ahb); return (error); } static void awg_get_eaddr(device_t dev, uint8_t *eaddr) { struct awg_softc *sc; uint32_t maclo, machi, rnd; u_char rootkey[16]; uint32_t rootkey_size; sc = device_get_softc(dev); machi = RD4(sc, EMAC_ADDR_HIGH(0)) & 0xffff; maclo = RD4(sc, EMAC_ADDR_LOW(0)); rootkey_size = sizeof(rootkey); if (maclo == 0xffffffff && machi == 0xffff) { /* MAC address in hardware is invalid, create one */ if (aw_sid_get_fuse(AW_SID_FUSE_ROOTKEY, rootkey, &rootkey_size) == 0 && (rootkey[3] | rootkey[12] | rootkey[13] | rootkey[14] | rootkey[15]) != 0) { /* MAC address is derived from the root key in SID */ maclo = (rootkey[13] << 24) | (rootkey[12] << 16) | (rootkey[3] << 8) | 0x02; machi = (rootkey[15] << 8) | rootkey[14]; } else { /* Create one */ rnd = arc4random(); maclo = 0x00f2 | (rnd & 0xffff0000); machi = rnd & 0xffff; } } eaddr[0] = maclo & 0xff; eaddr[1] = (maclo >> 8) & 0xff; eaddr[2] = (maclo >> 16) & 0xff; eaddr[3] = (maclo >> 24) & 0xff; eaddr[4] = machi & 0xff; eaddr[5] = (machi >> 8) & 0xff; } #ifdef AWG_DEBUG static void awg_dump_regs(device_t dev) { static const struct { const char *name; u_int reg; } regs[] = { { "BASIC_CTL_0", EMAC_BASIC_CTL_0 }, { "BASIC_CTL_1", EMAC_BASIC_CTL_1 }, { "INT_STA", EMAC_INT_STA }, { "INT_EN", EMAC_INT_EN }, { "TX_CTL_0", EMAC_TX_CTL_0 }, { "TX_CTL_1", EMAC_TX_CTL_1 }, { "TX_FLOW_CTL", EMAC_TX_FLOW_CTL }, { "TX_DMA_LIST", EMAC_TX_DMA_LIST }, { "RX_CTL_0", EMAC_RX_CTL_0 }, { "RX_CTL_1", EMAC_RX_CTL_1 }, { "RX_DMA_LIST", EMAC_RX_DMA_LIST }, { "RX_FRM_FLT", EMAC_RX_FRM_FLT }, { "RX_HASH_0", EMAC_RX_HASH_0 }, { "RX_HASH_1", EMAC_RX_HASH_1 }, { "MII_CMD", EMAC_MII_CMD }, { "ADDR_HIGH0", EMAC_ADDR_HIGH(0) }, { "ADDR_LOW0", EMAC_ADDR_LOW(0) }, { "TX_DMA_STA", EMAC_TX_DMA_STA }, { "TX_DMA_CUR_DESC", EMAC_TX_DMA_CUR_DESC }, { "TX_DMA_CUR_BUF", EMAC_TX_DMA_CUR_BUF }, { "RX_DMA_STA", EMAC_RX_DMA_STA }, { "RX_DMA_CUR_DESC", EMAC_RX_DMA_CUR_DESC }, { "RX_DMA_CUR_BUF", EMAC_RX_DMA_CUR_BUF }, { "RGMII_STA", EMAC_RGMII_STA }, }; struct awg_softc *sc; unsigned int n; sc = device_get_softc(dev); for (n = 0; n < nitems(regs); n++) device_printf(dev, " %-20s %08x\n", regs[n].name, RD4(sc, regs[n].reg)); } #endif #define GPIO_ACTIVE_LOW 1 static int awg_phy_reset(device_t dev) { pcell_t gpio_prop[4], delay_prop[3]; phandle_t node, gpio_node; device_t gpio; uint32_t pin, flags; uint32_t pin_value; node = ofw_bus_get_node(dev); if (OF_getencprop(node, "allwinner,reset-gpio", gpio_prop, sizeof(gpio_prop)) <= 0) return (0); if (OF_getencprop(node, "allwinner,reset-delays-us", delay_prop, sizeof(delay_prop)) <= 0) return (ENXIO); gpio_node = OF_node_from_xref(gpio_prop[0]); if ((gpio = OF_device_from_xref(gpio_prop[0])) == NULL) return (ENXIO); if (GPIO_MAP_GPIOS(gpio, node, gpio_node, nitems(gpio_prop) - 1, gpio_prop + 1, &pin, &flags) != 0) return (ENXIO); pin_value = GPIO_PIN_LOW; if (OF_hasprop(node, "allwinner,reset-active-low")) pin_value = GPIO_PIN_HIGH; if (flags & GPIO_ACTIVE_LOW) pin_value = !pin_value; GPIO_PIN_SETFLAGS(gpio, pin, GPIO_PIN_OUTPUT); GPIO_PIN_SET(gpio, pin, pin_value); DELAY(delay_prop[0]); GPIO_PIN_SET(gpio, pin, !pin_value); DELAY(delay_prop[1]); GPIO_PIN_SET(gpio, pin, pin_value); DELAY(delay_prop[2]); return (0); } static int awg_reset(device_t dev) { struct awg_softc *sc; int retry; sc = device_get_softc(dev); /* Reset PHY if necessary */ if (awg_phy_reset(dev) != 0) { device_printf(dev, "failed to reset PHY\n"); return (ENXIO); } /* Soft reset all registers and logic */ WR4(sc, EMAC_BASIC_CTL_1, BASIC_CTL_SOFT_RST); /* Wait for soft reset bit to self-clear */ for (retry = SOFT_RST_RETRY; retry > 0; retry--) { if ((RD4(sc, EMAC_BASIC_CTL_1) & BASIC_CTL_SOFT_RST) == 0) break; DELAY(10); } if (retry == 0) { device_printf(dev, "soft reset timed out\n"); #ifdef AWG_DEBUG awg_dump_regs(dev); #endif return (ETIMEDOUT); } return (0); } static void awg_dmamap_cb(void *arg, bus_dma_segment_t *segs, int nseg, int error) { if (error != 0) return; *(bus_addr_t *)arg = segs[0].ds_addr; } static int awg_setup_dma(device_t dev) { struct awg_softc *sc; int error, i; sc = device_get_softc(dev); /* Setup TX ring */ error = bus_dma_tag_create( bus_get_dma_tag(dev), /* Parent tag */ DESC_ALIGN, 0, /* alignment, boundary */ BUS_SPACE_MAXADDR_32BIT, /* lowaddr */ BUS_SPACE_MAXADDR, /* highaddr */ NULL, NULL, /* filter, filterarg */ TX_DESC_SIZE, 1, /* maxsize, nsegs */ TX_DESC_SIZE, /* maxsegsize */ 0, /* flags */ NULL, NULL, /* lockfunc, lockarg */ &sc->tx.desc_tag); if (error != 0) { device_printf(dev, "cannot create TX descriptor ring tag\n"); return (error); } error = bus_dmamem_alloc(sc->tx.desc_tag, (void **)&sc->tx.desc_ring, BUS_DMA_COHERENT | BUS_DMA_WAITOK | BUS_DMA_ZERO, &sc->tx.desc_map); if (error != 0) { device_printf(dev, "cannot allocate TX descriptor ring\n"); return (error); } error = bus_dmamap_load(sc->tx.desc_tag, sc->tx.desc_map, sc->tx.desc_ring, TX_DESC_SIZE, awg_dmamap_cb, &sc->tx.desc_ring_paddr, 0); if (error != 0) { device_printf(dev, "cannot load TX descriptor ring\n"); return (error); } for (i = 0; i < TX_DESC_COUNT; i++) sc->tx.desc_ring[i].next = htole32(sc->tx.desc_ring_paddr + DESC_OFF(TX_NEXT(i))); error = bus_dma_tag_create( bus_get_dma_tag(dev), /* Parent tag */ 1, 0, /* alignment, boundary */ BUS_SPACE_MAXADDR_32BIT, /* lowaddr */ BUS_SPACE_MAXADDR, /* highaddr */ NULL, NULL, /* filter, filterarg */ MCLBYTES, TX_MAX_SEGS, /* maxsize, nsegs */ MCLBYTES, /* maxsegsize */ 0, /* flags */ NULL, NULL, /* lockfunc, lockarg */ &sc->tx.buf_tag); if (error != 0) { device_printf(dev, "cannot create TX buffer tag\n"); return (error); } sc->tx.queued = 0; for (i = 0; i < TX_DESC_COUNT; i++) { error = bus_dmamap_create(sc->tx.buf_tag, 0, &sc->tx.buf_map[i].map); if (error != 0) { device_printf(dev, "cannot create TX buffer map\n"); return (error); } } /* Setup RX ring */ error = bus_dma_tag_create( bus_get_dma_tag(dev), /* Parent tag */ DESC_ALIGN, 0, /* alignment, boundary */ BUS_SPACE_MAXADDR_32BIT, /* lowaddr */ BUS_SPACE_MAXADDR, /* highaddr */ NULL, NULL, /* filter, filterarg */ RX_DESC_SIZE, 1, /* maxsize, nsegs */ RX_DESC_SIZE, /* maxsegsize */ 0, /* flags */ NULL, NULL, /* lockfunc, lockarg */ &sc->rx.desc_tag); if (error != 0) { device_printf(dev, "cannot create RX descriptor ring tag\n"); return (error); } error = bus_dmamem_alloc(sc->rx.desc_tag, (void **)&sc->rx.desc_ring, BUS_DMA_COHERENT | BUS_DMA_WAITOK | BUS_DMA_ZERO, &sc->rx.desc_map); if (error != 0) { device_printf(dev, "cannot allocate RX descriptor ring\n"); return (error); } error = bus_dmamap_load(sc->rx.desc_tag, sc->rx.desc_map, sc->rx.desc_ring, RX_DESC_SIZE, awg_dmamap_cb, &sc->rx.desc_ring_paddr, 0); if (error != 0) { device_printf(dev, "cannot load RX descriptor ring\n"); return (error); } error = bus_dma_tag_create( bus_get_dma_tag(dev), /* Parent tag */ 1, 0, /* alignment, boundary */ BUS_SPACE_MAXADDR_32BIT, /* lowaddr */ BUS_SPACE_MAXADDR, /* highaddr */ NULL, NULL, /* filter, filterarg */ MCLBYTES, 1, /* maxsize, nsegs */ MCLBYTES, /* maxsegsize */ 0, /* flags */ NULL, NULL, /* lockfunc, lockarg */ &sc->rx.buf_tag); if (error != 0) { device_printf(dev, "cannot create RX buffer tag\n"); return (error); } error = bus_dmamap_create(sc->rx.buf_tag, 0, &sc->rx.buf_spare_map); if (error != 0) { device_printf(dev, "cannot create RX buffer spare map\n"); return (error); } for (i = 0; i < RX_DESC_COUNT; i++) { sc->rx.desc_ring[i].next = htole32(sc->rx.desc_ring_paddr + DESC_OFF(RX_NEXT(i))); error = bus_dmamap_create(sc->rx.buf_tag, 0, &sc->rx.buf_map[i].map); if (error != 0) { device_printf(dev, "cannot create RX buffer map\n"); return (error); } sc->rx.buf_map[i].mbuf = NULL; error = awg_newbuf_rx(sc, i); if (error != 0) { device_printf(dev, "cannot create RX buffer\n"); return (error); } } bus_dmamap_sync(sc->rx.desc_tag, sc->rx.desc_map, BUS_DMASYNC_PREWRITE); /* Write transmit and receive descriptor base address registers */ WR4(sc, EMAC_TX_DMA_LIST, sc->tx.desc_ring_paddr); WR4(sc, EMAC_RX_DMA_LIST, sc->rx.desc_ring_paddr); return (0); } static int awg_probe(device_t dev) { if (!ofw_bus_status_okay(dev)) return (ENXIO); if (ofw_bus_search_compatible(dev, compat_data)->ocd_data == 0) return (ENXIO); device_set_desc(dev, "Allwinner Gigabit Ethernet"); return (BUS_PROBE_DEFAULT); } static int awg_attach(device_t dev) { uint8_t eaddr[ETHER_ADDR_LEN]; struct awg_softc *sc; int error; sc = device_get_softc(dev); sc->dev = dev; sc->type = ofw_bus_search_compatible(dev, compat_data)->ocd_data; if (bus_alloc_resources(dev, awg_spec, sc->res) != 0) { device_printf(dev, "cannot allocate resources for device\n"); return (ENXIO); } mtx_init(&sc->mtx, device_get_nameunit(dev), MTX_NETWORK_LOCK, MTX_DEF); callout_init_mtx(&sc->stat_ch, &sc->mtx, 0); TASK_INIT(&sc->link_task, 0, awg_link_task, sc); /* Setup clocks and regulators */ error = awg_setup_extres(dev); if (error != 0) return (error); /* Read MAC address before resetting the chip */ awg_get_eaddr(dev, eaddr); /* Soft reset EMAC core */ error = awg_reset(dev); if (error != 0) return (error); /* Setup DMA descriptors */ error = awg_setup_dma(dev); if (error != 0) return (error); /* Install interrupt handler */ error = bus_setup_intr(dev, sc->res[_RES_IRQ], INTR_TYPE_NET | INTR_MPSAFE, NULL, awg_intr, sc, &sc->ih); if (error != 0) { device_printf(dev, "cannot setup interrupt handler\n"); return (error); } /* Setup ethernet interface */ sc->ifp = if_alloc(IFT_ETHER); if_setsoftc(sc->ifp, sc); if_initname(sc->ifp, device_get_name(dev), device_get_unit(dev)); if_setflags(sc->ifp, IFF_BROADCAST | IFF_SIMPLEX | IFF_MULTICAST); if_setstartfn(sc->ifp, awg_start); if_setioctlfn(sc->ifp, awg_ioctl); if_setinitfn(sc->ifp, awg_init); if_setsendqlen(sc->ifp, TX_DESC_COUNT - 1); if_setsendqready(sc->ifp); if_sethwassist(sc->ifp, CSUM_IP | CSUM_UDP | CSUM_TCP); if_setcapabilities(sc->ifp, IFCAP_VLAN_MTU | IFCAP_HWCSUM); if_setcapenable(sc->ifp, if_getcapabilities(sc->ifp)); #ifdef DEVICE_POLLING if_setcapabilitiesbit(sc->ifp, IFCAP_POLLING, 0); #endif /* Attach MII driver */ error = mii_attach(dev, &sc->miibus, sc->ifp, awg_media_change, awg_media_status, BMSR_DEFCAPMASK, MII_PHY_ANY, MII_OFFSET_ANY, MIIF_DOPAUSE); if (error != 0) { device_printf(dev, "cannot attach PHY\n"); return (error); } /* Attach ethernet interface */ ether_ifattach(sc->ifp, eaddr); return (0); } static device_method_t awg_methods[] = { /* Device interface */ DEVMETHOD(device_probe, awg_probe), DEVMETHOD(device_attach, awg_attach), /* MII interface */ DEVMETHOD(miibus_readreg, awg_miibus_readreg), DEVMETHOD(miibus_writereg, awg_miibus_writereg), DEVMETHOD(miibus_statchg, awg_miibus_statchg), DEVMETHOD_END }; static driver_t awg_driver = { "awg", awg_methods, sizeof(struct awg_softc), }; static devclass_t awg_devclass; DRIVER_MODULE(awg, simplebus, awg_driver, awg_devclass, 0, 0); DRIVER_MODULE(miibus, awg, miibus_driver, miibus_devclass, 0, 0); - MODULE_DEPEND(awg, ether, 1, 1, 1); MODULE_DEPEND(awg, miibus, 1, 1, 1); +MODULE_DEPEND(awg, aw_sid, 1, 1, 1); +SIMPLEBUS_PNP_INFO(compat_data); Index: user/ngie/bug-237403/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_pool.c =================================================================== --- user/ngie/bug-237403/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_pool.c (revision 346925) +++ user/ngie/bug-237403/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_pool.c (revision 346926) @@ -1,1364 +1,1366 @@ /* * CDDL HEADER START * * The contents of this file are subject to the terms of the * Common Development and Distribution License (the "License"). * You may not use this file except in compliance with the License. * * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE * or http://www.opensolaris.org/os/licensing. * See the License for the specific language governing permissions * and limitations under the License. * * When distributing Covered Code, include this CDDL HEADER in each * file and include the License file at usr/src/OPENSOLARIS.LICENSE. * If applicable, add the following below this CDDL HEADER, with the * fields enclosed by brackets "[]" replaced with your own identifying * information: Portions Copyright [yyyy] [name of copyright owner] * * CDDL HEADER END */ /* * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved. * Copyright (c) 2011, 2017 by Delphix. All rights reserved. * Copyright (c) 2013 Steven Hartland. All rights reserved. * Copyright (c) 2014 Spectra Logic Corporation, All rights reserved. * Copyright (c) 2014 Integros [integros.com] * Copyright 2016 Nexenta Systems, Inc. All rights reserved. */ #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #if defined(__FreeBSD__) && defined(_KERNEL) #include #include #endif /* * ZFS Write Throttle * ------------------ * * ZFS must limit the rate of incoming writes to the rate at which it is able * to sync data modifications to the backend storage. Throttling by too much * creates an artificial limit; throttling by too little can only be sustained * for short periods and would lead to highly lumpy performance. On a per-pool * basis, ZFS tracks the amount of modified (dirty) data. As operations change * data, the amount of dirty data increases; as ZFS syncs out data, the amount * of dirty data decreases. When the amount of dirty data exceeds a * predetermined threshold further modifications are blocked until the amount * of dirty data decreases (as data is synced out). * * The limit on dirty data is tunable, and should be adjusted according to * both the IO capacity and available memory of the system. The larger the * window, the more ZFS is able to aggregate and amortize metadata (and data) * changes. However, memory is a limited resource, and allowing for more dirty * data comes at the cost of keeping other useful data in memory (for example * ZFS data cached by the ARC). * * Implementation * * As buffers are modified dsl_pool_willuse_space() increments both the per- * txg (dp_dirty_pertxg[]) and poolwide (dp_dirty_total) accounting of * dirty space used; dsl_pool_dirty_space() decrements those values as data * is synced out from dsl_pool_sync(). While only the poolwide value is * relevant, the per-txg value is useful for debugging. The tunable * zfs_dirty_data_max determines the dirty space limit. Once that value is * exceeded, new writes are halted until space frees up. * * The zfs_dirty_data_sync tunable dictates the threshold at which we * ensure that there is a txg syncing (see the comment in txg.c for a full * description of transaction group stages). * * The IO scheduler uses both the dirty space limit and current amount of * dirty data as inputs. Those values affect the number of concurrent IOs ZFS * issues. See the comment in vdev_queue.c for details of the IO scheduler. * * The delay is also calculated based on the amount of dirty data. See the * comment above dmu_tx_delay() for details. */ /* * zfs_dirty_data_max will be set to zfs_dirty_data_max_percent% of all memory, * capped at zfs_dirty_data_max_max. It can also be overridden in /etc/system. */ uint64_t zfs_dirty_data_max; uint64_t zfs_dirty_data_max_max = 4ULL * 1024 * 1024 * 1024; int zfs_dirty_data_max_percent = 10; /* * If there is at least this much dirty data, push out a txg. */ uint64_t zfs_dirty_data_sync = 64 * 1024 * 1024; /* * Once there is this amount of dirty data, the dmu_tx_delay() will kick in * and delay each transaction. * This value should be >= zfs_vdev_async_write_active_max_dirty_percent. */ int zfs_delay_min_dirty_percent = 60; /* * This controls how quickly the delay approaches infinity. * Larger values cause it to delay more for a given amount of dirty data. * Therefore larger values will cause there to be less dirty data for a * given throughput. * * For the smoothest delay, this value should be about 1 billion divided * by the maximum number of operations per second. This will smoothly * handle between 10x and 1/10th this number. * * Note: zfs_delay_scale * zfs_dirty_data_max must be < 2^64, due to the * multiply in dmu_tx_delay(). */ uint64_t zfs_delay_scale = 1000 * 1000 * 1000 / 2000; /* * This determines the number of threads used by the dp_sync_taskq. */ int zfs_sync_taskq_batch_pct = 75; /* * These tunables determine the behavior of how zil_itxg_clean() is * called via zil_clean() in the context of spa_sync(). When an itxg * list needs to be cleaned, TQ_NOSLEEP will be used when dispatching. * If the dispatch fails, the call to zil_itxg_clean() will occur * synchronously in the context of spa_sync(), which can negatively * impact the performance of spa_sync() (e.g. in the case of the itxg * list having a large number of itxs that needs to be cleaned). * * Thus, these tunables can be used to manipulate the behavior of the * taskq used by zil_clean(); they determine the number of taskq entries * that are pre-populated when the taskq is first created (via the * "zfs_zil_clean_taskq_minalloc" tunable) and the maximum number of * taskq entries that are cached after an on-demand allocation (via the * "zfs_zil_clean_taskq_maxalloc"). * * The idea being, we want to try reasonably hard to ensure there will * already be a taskq entry pre-allocated by the time that it is needed * by zil_clean(). This way, we can avoid the possibility of an * on-demand allocation of a new taskq entry from failing, which would * result in zil_itxg_clean() being called synchronously from zil_clean() * (which can adversely affect performance of spa_sync()). * * Additionally, the number of threads used by the taskq can be * configured via the "zfs_zil_clean_taskq_nthr_pct" tunable. */ int zfs_zil_clean_taskq_nthr_pct = 100; int zfs_zil_clean_taskq_minalloc = 1024; int zfs_zil_clean_taskq_maxalloc = 1024 * 1024; #if defined(__FreeBSD__) && defined(_KERNEL) extern int zfs_vdev_async_write_active_max_dirty_percent; SYSCTL_DECL(_vfs_zfs); SYSCTL_UQUAD(_vfs_zfs, OID_AUTO, dirty_data_max, CTLFLAG_RWTUN, &zfs_dirty_data_max, 0, "The maximum amount of dirty data in bytes after which new writes are " "halted until space becomes available"); SYSCTL_UQUAD(_vfs_zfs, OID_AUTO, dirty_data_max_max, CTLFLAG_RDTUN, &zfs_dirty_data_max_max, 0, "The absolute cap on dirty_data_max when auto calculating"); static int sysctl_zfs_dirty_data_max_percent(SYSCTL_HANDLER_ARGS); SYSCTL_PROC(_vfs_zfs, OID_AUTO, dirty_data_max_percent, CTLTYPE_INT | CTLFLAG_MPSAFE | CTLFLAG_RWTUN, 0, sizeof(int), sysctl_zfs_dirty_data_max_percent, "I", "The percent of physical memory used to auto calculate dirty_data_max"); SYSCTL_UQUAD(_vfs_zfs, OID_AUTO, dirty_data_sync, CTLFLAG_RWTUN, &zfs_dirty_data_sync, 0, "Force a txg if the number of dirty buffer bytes exceed this value"); static int sysctl_zfs_delay_min_dirty_percent(SYSCTL_HANDLER_ARGS); /* No zfs_delay_min_dirty_percent tunable due to limit requirements */ SYSCTL_PROC(_vfs_zfs, OID_AUTO, delay_min_dirty_percent, CTLTYPE_INT | CTLFLAG_MPSAFE | CTLFLAG_RW, 0, sizeof(int), sysctl_zfs_delay_min_dirty_percent, "I", "The limit of outstanding dirty data before transactions are delayed"); static int sysctl_zfs_delay_scale(SYSCTL_HANDLER_ARGS); /* No zfs_delay_scale tunable due to limit requirements */ SYSCTL_PROC(_vfs_zfs, OID_AUTO, delay_scale, CTLTYPE_U64 | CTLFLAG_MPSAFE | CTLFLAG_RW, 0, sizeof(uint64_t), sysctl_zfs_delay_scale, "QU", "Controls how quickly the delay approaches infinity"); static int sysctl_zfs_dirty_data_max_percent(SYSCTL_HANDLER_ARGS) { int val, err; val = zfs_dirty_data_max_percent; err = sysctl_handle_int(oidp, &val, 0, req); if (err != 0 || req->newptr == NULL) return (err); if (val < 0 || val > 100) return (EINVAL); zfs_dirty_data_max_percent = val; return (0); } static int sysctl_zfs_delay_min_dirty_percent(SYSCTL_HANDLER_ARGS) { int val, err; val = zfs_delay_min_dirty_percent; err = sysctl_handle_int(oidp, &val, 0, req); if (err != 0 || req->newptr == NULL) return (err); if (val < zfs_vdev_async_write_active_max_dirty_percent) return (EINVAL); zfs_delay_min_dirty_percent = val; return (0); } static int sysctl_zfs_delay_scale(SYSCTL_HANDLER_ARGS) { uint64_t val; int err; val = zfs_delay_scale; err = sysctl_handle_64(oidp, &val, 0, req); if (err != 0 || req->newptr == NULL) return (err); if (val > UINT64_MAX / zfs_dirty_data_max) return (EINVAL); zfs_delay_scale = val; return (0); } #endif int dsl_pool_open_special_dir(dsl_pool_t *dp, const char *name, dsl_dir_t **ddp) { uint64_t obj; int err; err = zap_lookup(dp->dp_meta_objset, dsl_dir_phys(dp->dp_root_dir)->dd_child_dir_zapobj, name, sizeof (obj), 1, &obj); if (err) return (err); return (dsl_dir_hold_obj(dp, obj, name, dp, ddp)); } static dsl_pool_t * dsl_pool_open_impl(spa_t *spa, uint64_t txg) { dsl_pool_t *dp; blkptr_t *bp = spa_get_rootblkptr(spa); dp = kmem_zalloc(sizeof (dsl_pool_t), KM_SLEEP); dp->dp_spa = spa; dp->dp_meta_rootbp = *bp; rrw_init(&dp->dp_config_rwlock, B_TRUE); txg_init(dp, txg); txg_list_create(&dp->dp_dirty_datasets, spa, offsetof(dsl_dataset_t, ds_dirty_link)); txg_list_create(&dp->dp_dirty_zilogs, spa, offsetof(zilog_t, zl_dirty_link)); txg_list_create(&dp->dp_dirty_dirs, spa, offsetof(dsl_dir_t, dd_dirty_link)); txg_list_create(&dp->dp_sync_tasks, spa, offsetof(dsl_sync_task_t, dst_node)); txg_list_create(&dp->dp_early_sync_tasks, spa, offsetof(dsl_sync_task_t, dst_node)); dp->dp_sync_taskq = taskq_create("dp_sync_taskq", zfs_sync_taskq_batch_pct, minclsyspri, 1, INT_MAX, TASKQ_THREADS_CPU_PCT); dp->dp_zil_clean_taskq = taskq_create("dp_zil_clean_taskq", zfs_zil_clean_taskq_nthr_pct, minclsyspri, zfs_zil_clean_taskq_minalloc, zfs_zil_clean_taskq_maxalloc, TASKQ_PREPOPULATE | TASKQ_THREADS_CPU_PCT); mutex_init(&dp->dp_lock, NULL, MUTEX_DEFAULT, NULL); cv_init(&dp->dp_spaceavail_cv, NULL, CV_DEFAULT, NULL); dp->dp_vnrele_taskq = taskq_create("zfs_vn_rele_taskq", 1, minclsyspri, 1, 4, 0); return (dp); } int dsl_pool_init(spa_t *spa, uint64_t txg, dsl_pool_t **dpp) { int err; dsl_pool_t *dp = dsl_pool_open_impl(spa, txg); err = dmu_objset_open_impl(spa, NULL, &dp->dp_meta_rootbp, &dp->dp_meta_objset); if (err != 0) dsl_pool_close(dp); else *dpp = dp; return (err); } int dsl_pool_open(dsl_pool_t *dp) { int err; dsl_dir_t *dd; dsl_dataset_t *ds; uint64_t obj; rrw_enter(&dp->dp_config_rwlock, RW_WRITER, FTAG); err = zap_lookup(dp->dp_meta_objset, DMU_POOL_DIRECTORY_OBJECT, DMU_POOL_ROOT_DATASET, sizeof (uint64_t), 1, &dp->dp_root_dir_obj); if (err) goto out; err = dsl_dir_hold_obj(dp, dp->dp_root_dir_obj, NULL, dp, &dp->dp_root_dir); if (err) goto out; err = dsl_pool_open_special_dir(dp, MOS_DIR_NAME, &dp->dp_mos_dir); if (err) goto out; if (spa_version(dp->dp_spa) >= SPA_VERSION_ORIGIN) { err = dsl_pool_open_special_dir(dp, ORIGIN_DIR_NAME, &dd); if (err) goto out; err = dsl_dataset_hold_obj(dp, dsl_dir_phys(dd)->dd_head_dataset_obj, FTAG, &ds); if (err == 0) { err = dsl_dataset_hold_obj(dp, dsl_dataset_phys(ds)->ds_prev_snap_obj, dp, &dp->dp_origin_snap); dsl_dataset_rele(ds, FTAG); } dsl_dir_rele(dd, dp); if (err) goto out; } if (spa_version(dp->dp_spa) >= SPA_VERSION_DEADLISTS) { err = dsl_pool_open_special_dir(dp, FREE_DIR_NAME, &dp->dp_free_dir); if (err) goto out; err = zap_lookup(dp->dp_meta_objset, DMU_POOL_DIRECTORY_OBJECT, DMU_POOL_FREE_BPOBJ, sizeof (uint64_t), 1, &obj); if (err) goto out; VERIFY0(bpobj_open(&dp->dp_free_bpobj, dp->dp_meta_objset, obj)); } if (spa_feature_is_active(dp->dp_spa, SPA_FEATURE_OBSOLETE_COUNTS)) { err = zap_lookup(dp->dp_meta_objset, DMU_POOL_DIRECTORY_OBJECT, DMU_POOL_OBSOLETE_BPOBJ, sizeof (uint64_t), 1, &obj); if (err == 0) { VERIFY0(bpobj_open(&dp->dp_obsolete_bpobj, dp->dp_meta_objset, obj)); } else if (err == ENOENT) { /* * We might not have created the remap bpobj yet. */ err = 0; } else { goto out; } } /* * Note: errors ignored, because the these special dirs, used for * space accounting, are only created on demand. */ (void) dsl_pool_open_special_dir(dp, LEAK_DIR_NAME, &dp->dp_leak_dir); if (spa_feature_is_active(dp->dp_spa, SPA_FEATURE_ASYNC_DESTROY)) { err = zap_lookup(dp->dp_meta_objset, DMU_POOL_DIRECTORY_OBJECT, DMU_POOL_BPTREE_OBJ, sizeof (uint64_t), 1, &dp->dp_bptree_obj); if (err != 0) goto out; } if (spa_feature_is_active(dp->dp_spa, SPA_FEATURE_EMPTY_BPOBJ)) { err = zap_lookup(dp->dp_meta_objset, DMU_POOL_DIRECTORY_OBJECT, DMU_POOL_EMPTY_BPOBJ, sizeof (uint64_t), 1, &dp->dp_empty_bpobj); if (err != 0) goto out; } err = zap_lookup(dp->dp_meta_objset, DMU_POOL_DIRECTORY_OBJECT, DMU_POOL_TMP_USERREFS, sizeof (uint64_t), 1, &dp->dp_tmp_userrefs_obj); if (err == ENOENT) err = 0; if (err) goto out; err = dsl_scan_init(dp, dp->dp_tx.tx_open_txg); out: rrw_exit(&dp->dp_config_rwlock, FTAG); return (err); } void dsl_pool_close(dsl_pool_t *dp) { /* * Drop our references from dsl_pool_open(). * * Since we held the origin_snap from "syncing" context (which * includes pool-opening context), it actually only got a "ref" * and not a hold, so just drop that here. */ if (dp->dp_origin_snap != NULL) dsl_dataset_rele(dp->dp_origin_snap, dp); if (dp->dp_mos_dir != NULL) dsl_dir_rele(dp->dp_mos_dir, dp); if (dp->dp_free_dir != NULL) dsl_dir_rele(dp->dp_free_dir, dp); if (dp->dp_leak_dir != NULL) dsl_dir_rele(dp->dp_leak_dir, dp); if (dp->dp_root_dir != NULL) dsl_dir_rele(dp->dp_root_dir, dp); bpobj_close(&dp->dp_free_bpobj); bpobj_close(&dp->dp_obsolete_bpobj); /* undo the dmu_objset_open_impl(mos) from dsl_pool_open() */ if (dp->dp_meta_objset != NULL) dmu_objset_evict(dp->dp_meta_objset); txg_list_destroy(&dp->dp_dirty_datasets); txg_list_destroy(&dp->dp_dirty_zilogs); txg_list_destroy(&dp->dp_sync_tasks); txg_list_destroy(&dp->dp_early_sync_tasks); txg_list_destroy(&dp->dp_dirty_dirs); taskq_destroy(dp->dp_zil_clean_taskq); taskq_destroy(dp->dp_sync_taskq); /* * We can't set retry to TRUE since we're explicitly specifying * a spa to flush. This is good enough; any missed buffers for * this spa won't cause trouble, and they'll eventually fall * out of the ARC just like any other unused buffer. */ arc_flush(dp->dp_spa, FALSE); txg_fini(dp); dsl_scan_fini(dp); dmu_buf_user_evict_wait(); rrw_destroy(&dp->dp_config_rwlock); mutex_destroy(&dp->dp_lock); taskq_destroy(dp->dp_vnrele_taskq); - if (dp->dp_blkstats != NULL) + if (dp->dp_blkstats != NULL) { + mutex_destroy(&dp->dp_blkstats->zab_lock); kmem_free(dp->dp_blkstats, sizeof (zfs_all_blkstats_t)); + } kmem_free(dp, sizeof (dsl_pool_t)); } void dsl_pool_create_obsolete_bpobj(dsl_pool_t *dp, dmu_tx_t *tx) { uint64_t obj; /* * Currently, we only create the obsolete_bpobj where there are * indirect vdevs with referenced mappings. */ ASSERT(spa_feature_is_active(dp->dp_spa, SPA_FEATURE_DEVICE_REMOVAL)); /* create and open the obsolete_bpobj */ obj = bpobj_alloc(dp->dp_meta_objset, SPA_OLD_MAXBLOCKSIZE, tx); VERIFY0(bpobj_open(&dp->dp_obsolete_bpobj, dp->dp_meta_objset, obj)); VERIFY0(zap_add(dp->dp_meta_objset, DMU_POOL_DIRECTORY_OBJECT, DMU_POOL_OBSOLETE_BPOBJ, sizeof (uint64_t), 1, &obj, tx)); spa_feature_incr(dp->dp_spa, SPA_FEATURE_OBSOLETE_COUNTS, tx); } void dsl_pool_destroy_obsolete_bpobj(dsl_pool_t *dp, dmu_tx_t *tx) { spa_feature_decr(dp->dp_spa, SPA_FEATURE_OBSOLETE_COUNTS, tx); VERIFY0(zap_remove(dp->dp_meta_objset, DMU_POOL_DIRECTORY_OBJECT, DMU_POOL_OBSOLETE_BPOBJ, tx)); bpobj_free(dp->dp_meta_objset, dp->dp_obsolete_bpobj.bpo_object, tx); bpobj_close(&dp->dp_obsolete_bpobj); } dsl_pool_t * dsl_pool_create(spa_t *spa, nvlist_t *zplprops, uint64_t txg) { int err; dsl_pool_t *dp = dsl_pool_open_impl(spa, txg); dmu_tx_t *tx = dmu_tx_create_assigned(dp, txg); dsl_dataset_t *ds; uint64_t obj; rrw_enter(&dp->dp_config_rwlock, RW_WRITER, FTAG); /* create and open the MOS (meta-objset) */ dp->dp_meta_objset = dmu_objset_create_impl(spa, NULL, &dp->dp_meta_rootbp, DMU_OST_META, tx); /* create the pool directory */ err = zap_create_claim(dp->dp_meta_objset, DMU_POOL_DIRECTORY_OBJECT, DMU_OT_OBJECT_DIRECTORY, DMU_OT_NONE, 0, tx); ASSERT0(err); /* Initialize scan structures */ VERIFY0(dsl_scan_init(dp, txg)); /* create and open the root dir */ dp->dp_root_dir_obj = dsl_dir_create_sync(dp, NULL, NULL, tx); VERIFY0(dsl_dir_hold_obj(dp, dp->dp_root_dir_obj, NULL, dp, &dp->dp_root_dir)); /* create and open the meta-objset dir */ (void) dsl_dir_create_sync(dp, dp->dp_root_dir, MOS_DIR_NAME, tx); VERIFY0(dsl_pool_open_special_dir(dp, MOS_DIR_NAME, &dp->dp_mos_dir)); if (spa_version(spa) >= SPA_VERSION_DEADLISTS) { /* create and open the free dir */ (void) dsl_dir_create_sync(dp, dp->dp_root_dir, FREE_DIR_NAME, tx); VERIFY0(dsl_pool_open_special_dir(dp, FREE_DIR_NAME, &dp->dp_free_dir)); /* create and open the free_bplist */ obj = bpobj_alloc(dp->dp_meta_objset, SPA_OLD_MAXBLOCKSIZE, tx); VERIFY(zap_add(dp->dp_meta_objset, DMU_POOL_DIRECTORY_OBJECT, DMU_POOL_FREE_BPOBJ, sizeof (uint64_t), 1, &obj, tx) == 0); VERIFY0(bpobj_open(&dp->dp_free_bpobj, dp->dp_meta_objset, obj)); } if (spa_version(spa) >= SPA_VERSION_DSL_SCRUB) dsl_pool_create_origin(dp, tx); /* create the root dataset */ obj = dsl_dataset_create_sync_dd(dp->dp_root_dir, NULL, 0, tx); /* create the root objset */ VERIFY0(dsl_dataset_hold_obj(dp, obj, FTAG, &ds)); #ifdef _KERNEL { objset_t *os; rrw_enter(&ds->ds_bp_rwlock, RW_READER, FTAG); os = dmu_objset_create_impl(dp->dp_spa, ds, dsl_dataset_get_blkptr(ds), DMU_OST_ZFS, tx); rrw_exit(&ds->ds_bp_rwlock, FTAG); zfs_create_fs(os, kcred, zplprops, tx); } #endif dsl_dataset_rele(ds, FTAG); dmu_tx_commit(tx); rrw_exit(&dp->dp_config_rwlock, FTAG); return (dp); } /* * Account for the meta-objset space in its placeholder dsl_dir. */ void dsl_pool_mos_diduse_space(dsl_pool_t *dp, int64_t used, int64_t comp, int64_t uncomp) { ASSERT3U(comp, ==, uncomp); /* it's all metadata */ mutex_enter(&dp->dp_lock); dp->dp_mos_used_delta += used; dp->dp_mos_compressed_delta += comp; dp->dp_mos_uncompressed_delta += uncomp; mutex_exit(&dp->dp_lock); } static void dsl_pool_sync_mos(dsl_pool_t *dp, dmu_tx_t *tx) { zio_t *zio = zio_root(dp->dp_spa, NULL, NULL, ZIO_FLAG_MUSTSUCCEED); dmu_objset_sync(dp->dp_meta_objset, zio, tx); VERIFY0(zio_wait(zio)); dprintf_bp(&dp->dp_meta_rootbp, "meta objset rootbp is %s", ""); spa_set_rootblkptr(dp->dp_spa, &dp->dp_meta_rootbp); } static void dsl_pool_dirty_delta(dsl_pool_t *dp, int64_t delta) { ASSERT(MUTEX_HELD(&dp->dp_lock)); if (delta < 0) ASSERT3U(-delta, <=, dp->dp_dirty_total); dp->dp_dirty_total += delta; /* * Note: we signal even when increasing dp_dirty_total. * This ensures forward progress -- each thread wakes the next waiter. */ if (dp->dp_dirty_total < zfs_dirty_data_max) cv_signal(&dp->dp_spaceavail_cv); } static boolean_t dsl_early_sync_task_verify(dsl_pool_t *dp, uint64_t txg) { spa_t *spa = dp->dp_spa; vdev_t *rvd = spa->spa_root_vdev; for (uint64_t c = 0; c < rvd->vdev_children; c++) { vdev_t *vd = rvd->vdev_child[c]; txg_list_t *tl = &vd->vdev_ms_list; metaslab_t *ms; for (ms = txg_list_head(tl, TXG_CLEAN(txg)); ms; ms = txg_list_next(tl, ms, TXG_CLEAN(txg))) { VERIFY(range_tree_is_empty(ms->ms_freeing)); VERIFY(range_tree_is_empty(ms->ms_checkpointing)); } } return (B_TRUE); } void dsl_pool_sync(dsl_pool_t *dp, uint64_t txg) { zio_t *zio; dmu_tx_t *tx; dsl_dir_t *dd; dsl_dataset_t *ds; objset_t *mos = dp->dp_meta_objset; list_t synced_datasets; list_create(&synced_datasets, sizeof (dsl_dataset_t), offsetof(dsl_dataset_t, ds_synced_link)); tx = dmu_tx_create_assigned(dp, txg); /* * Run all early sync tasks before writing out any dirty blocks. * For more info on early sync tasks see block comment in * dsl_early_sync_task(). */ if (!txg_list_empty(&dp->dp_early_sync_tasks, txg)) { dsl_sync_task_t *dst; ASSERT3U(spa_sync_pass(dp->dp_spa), ==, 1); while ((dst = txg_list_remove(&dp->dp_early_sync_tasks, txg)) != NULL) { ASSERT(dsl_early_sync_task_verify(dp, txg)); dsl_sync_task_sync(dst, tx); } ASSERT(dsl_early_sync_task_verify(dp, txg)); } /* * Write out all dirty blocks of dirty datasets. */ zio = zio_root(dp->dp_spa, NULL, NULL, ZIO_FLAG_MUSTSUCCEED); while ((ds = txg_list_remove(&dp->dp_dirty_datasets, txg)) != NULL) { /* * We must not sync any non-MOS datasets twice, because * we may have taken a snapshot of them. However, we * may sync newly-created datasets on pass 2. */ ASSERT(!list_link_active(&ds->ds_synced_link)); list_insert_tail(&synced_datasets, ds); dsl_dataset_sync(ds, zio, tx); } VERIFY0(zio_wait(zio)); /* * We have written all of the accounted dirty data, so our * dp_space_towrite should now be zero. However, some seldom-used * code paths do not adhere to this (e.g. dbuf_undirty(), also * rounding error in dbuf_write_physdone). * Shore up the accounting of any dirtied space now. */ dsl_pool_undirty_space(dp, dp->dp_dirty_pertxg[txg & TXG_MASK], txg); /* * Update the long range free counter after * we're done syncing user data */ mutex_enter(&dp->dp_lock); ASSERT(spa_sync_pass(dp->dp_spa) == 1 || dp->dp_long_free_dirty_pertxg[txg & TXG_MASK] == 0); dp->dp_long_free_dirty_pertxg[txg & TXG_MASK] = 0; mutex_exit(&dp->dp_lock); /* * After the data blocks have been written (ensured by the zio_wait() * above), update the user/group space accounting. This happens * in tasks dispatched to dp_sync_taskq, so wait for them before * continuing. */ for (ds = list_head(&synced_datasets); ds != NULL; ds = list_next(&synced_datasets, ds)) { dmu_objset_do_userquota_updates(ds->ds_objset, tx); } taskq_wait(dp->dp_sync_taskq); /* * Sync the datasets again to push out the changes due to * userspace updates. This must be done before we process the * sync tasks, so that any snapshots will have the correct * user accounting information (and we won't get confused * about which blocks are part of the snapshot). */ zio = zio_root(dp->dp_spa, NULL, NULL, ZIO_FLAG_MUSTSUCCEED); while ((ds = txg_list_remove(&dp->dp_dirty_datasets, txg)) != NULL) { ASSERT(list_link_active(&ds->ds_synced_link)); dmu_buf_rele(ds->ds_dbuf, ds); dsl_dataset_sync(ds, zio, tx); } VERIFY0(zio_wait(zio)); /* * Now that the datasets have been completely synced, we can * clean up our in-memory structures accumulated while syncing: * * - move dead blocks from the pending deadlist to the on-disk deadlist * - release hold from dsl_dataset_dirty() */ while ((ds = list_remove_head(&synced_datasets)) != NULL) { dsl_dataset_sync_done(ds, tx); } while ((dd = txg_list_remove(&dp->dp_dirty_dirs, txg)) != NULL) { dsl_dir_sync(dd, tx); } /* * The MOS's space is accounted for in the pool/$MOS * (dp_mos_dir). We can't modify the mos while we're syncing * it, so we remember the deltas and apply them here. */ if (dp->dp_mos_used_delta != 0 || dp->dp_mos_compressed_delta != 0 || dp->dp_mos_uncompressed_delta != 0) { dsl_dir_diduse_space(dp->dp_mos_dir, DD_USED_HEAD, dp->dp_mos_used_delta, dp->dp_mos_compressed_delta, dp->dp_mos_uncompressed_delta, tx); dp->dp_mos_used_delta = 0; dp->dp_mos_compressed_delta = 0; dp->dp_mos_uncompressed_delta = 0; } if (!multilist_is_empty(mos->os_dirty_dnodes[txg & TXG_MASK])) { dsl_pool_sync_mos(dp, tx); } /* * If we modify a dataset in the same txg that we want to destroy it, * its dsl_dir's dd_dbuf will be dirty, and thus have a hold on it. * dsl_dir_destroy_check() will fail if there are unexpected holds. * Therefore, we want to sync the MOS (thus syncing the dd_dbuf * and clearing the hold on it) before we process the sync_tasks. * The MOS data dirtied by the sync_tasks will be synced on the next * pass. */ if (!txg_list_empty(&dp->dp_sync_tasks, txg)) { dsl_sync_task_t *dst; /* * No more sync tasks should have been added while we * were syncing. */ ASSERT3U(spa_sync_pass(dp->dp_spa), ==, 1); while ((dst = txg_list_remove(&dp->dp_sync_tasks, txg)) != NULL) dsl_sync_task_sync(dst, tx); } dmu_tx_commit(tx); DTRACE_PROBE2(dsl_pool_sync__done, dsl_pool_t *dp, dp, uint64_t, txg); } void dsl_pool_sync_done(dsl_pool_t *dp, uint64_t txg) { zilog_t *zilog; while (zilog = txg_list_head(&dp->dp_dirty_zilogs, txg)) { dsl_dataset_t *ds = dmu_objset_ds(zilog->zl_os); /* * We don't remove the zilog from the dp_dirty_zilogs * list until after we've cleaned it. This ensures that * callers of zilog_is_dirty() receive an accurate * answer when they are racing with the spa sync thread. */ zil_clean(zilog, txg); (void) txg_list_remove_this(&dp->dp_dirty_zilogs, zilog, txg); ASSERT(!dmu_objset_is_dirty(zilog->zl_os, txg)); dmu_buf_rele(ds->ds_dbuf, zilog); } ASSERT(!dmu_objset_is_dirty(dp->dp_meta_objset, txg)); } /* * TRUE if the current thread is the tx_sync_thread or if we * are being called from SPA context during pool initialization. */ int dsl_pool_sync_context(dsl_pool_t *dp) { return (curthread == dp->dp_tx.tx_sync_thread || spa_is_initializing(dp->dp_spa) || taskq_member(dp->dp_sync_taskq, curthread)); } /* * This function returns the amount of allocatable space in the pool * minus whatever space is currently reserved by ZFS for specific * purposes. Specifically: * * 1] Any reserved SLOP space * 2] Any space used by the checkpoint * 3] Any space used for deferred frees * * The latter 2 are especially important because they are needed to * rectify the SPA's and DMU's different understanding of how much space * is used. Now the DMU is aware of that extra space tracked by the SPA * without having to maintain a separate special dir (e.g similar to * $MOS, $FREEING, and $LEAKED). * * Note: By deferred frees here, we mean the frees that were deferred * in spa_sync() after sync pass 1 (spa_deferred_bpobj), and not the * segments placed in ms_defer trees during metaslab_sync_done(). */ uint64_t dsl_pool_adjustedsize(dsl_pool_t *dp, zfs_space_check_t slop_policy) { spa_t *spa = dp->dp_spa; uint64_t space, resv, adjustedsize; uint64_t spa_deferred_frees = spa->spa_deferred_bpobj.bpo_phys->bpo_bytes; space = spa_get_dspace(spa) - spa_get_checkpoint_space(spa) - spa_deferred_frees; resv = spa_get_slop_space(spa); switch (slop_policy) { case ZFS_SPACE_CHECK_NORMAL: break; case ZFS_SPACE_CHECK_RESERVED: resv >>= 1; break; case ZFS_SPACE_CHECK_EXTRA_RESERVED: resv >>= 2; break; case ZFS_SPACE_CHECK_NONE: resv = 0; break; default: panic("invalid slop policy value: %d", slop_policy); break; } adjustedsize = (space >= resv) ? (space - resv) : 0; return (adjustedsize); } uint64_t dsl_pool_unreserved_space(dsl_pool_t *dp, zfs_space_check_t slop_policy) { uint64_t poolsize = dsl_pool_adjustedsize(dp, slop_policy); uint64_t deferred = metaslab_class_get_deferred(spa_normal_class(dp->dp_spa)); uint64_t quota = (poolsize >= deferred) ? (poolsize - deferred) : 0; return (quota); } boolean_t dsl_pool_need_dirty_delay(dsl_pool_t *dp) { uint64_t delay_min_bytes = zfs_dirty_data_max * zfs_delay_min_dirty_percent / 100; boolean_t rv; mutex_enter(&dp->dp_lock); if (dp->dp_dirty_total > zfs_dirty_data_sync) txg_kick(dp); rv = (dp->dp_dirty_total > delay_min_bytes); mutex_exit(&dp->dp_lock); return (rv); } void dsl_pool_dirty_space(dsl_pool_t *dp, int64_t space, dmu_tx_t *tx) { if (space > 0) { mutex_enter(&dp->dp_lock); dp->dp_dirty_pertxg[tx->tx_txg & TXG_MASK] += space; dsl_pool_dirty_delta(dp, space); mutex_exit(&dp->dp_lock); } } void dsl_pool_undirty_space(dsl_pool_t *dp, int64_t space, uint64_t txg) { ASSERT3S(space, >=, 0); if (space == 0) return; mutex_enter(&dp->dp_lock); if (dp->dp_dirty_pertxg[txg & TXG_MASK] < space) { /* XXX writing something we didn't dirty? */ space = dp->dp_dirty_pertxg[txg & TXG_MASK]; } ASSERT3U(dp->dp_dirty_pertxg[txg & TXG_MASK], >=, space); dp->dp_dirty_pertxg[txg & TXG_MASK] -= space; ASSERT3U(dp->dp_dirty_total, >=, space); dsl_pool_dirty_delta(dp, -space); mutex_exit(&dp->dp_lock); } /* ARGSUSED */ static int upgrade_clones_cb(dsl_pool_t *dp, dsl_dataset_t *hds, void *arg) { dmu_tx_t *tx = arg; dsl_dataset_t *ds, *prev = NULL; int err; err = dsl_dataset_hold_obj(dp, hds->ds_object, FTAG, &ds); if (err) return (err); while (dsl_dataset_phys(ds)->ds_prev_snap_obj != 0) { err = dsl_dataset_hold_obj(dp, dsl_dataset_phys(ds)->ds_prev_snap_obj, FTAG, &prev); if (err) { dsl_dataset_rele(ds, FTAG); return (err); } if (dsl_dataset_phys(prev)->ds_next_snap_obj != ds->ds_object) break; dsl_dataset_rele(ds, FTAG); ds = prev; prev = NULL; } if (prev == NULL) { prev = dp->dp_origin_snap; /* * The $ORIGIN can't have any data, or the accounting * will be wrong. */ rrw_enter(&ds->ds_bp_rwlock, RW_READER, FTAG); ASSERT0(dsl_dataset_phys(prev)->ds_bp.blk_birth); rrw_exit(&ds->ds_bp_rwlock, FTAG); /* The origin doesn't get attached to itself */ if (ds->ds_object == prev->ds_object) { dsl_dataset_rele(ds, FTAG); return (0); } dmu_buf_will_dirty(ds->ds_dbuf, tx); dsl_dataset_phys(ds)->ds_prev_snap_obj = prev->ds_object; dsl_dataset_phys(ds)->ds_prev_snap_txg = dsl_dataset_phys(prev)->ds_creation_txg; dmu_buf_will_dirty(ds->ds_dir->dd_dbuf, tx); dsl_dir_phys(ds->ds_dir)->dd_origin_obj = prev->ds_object; dmu_buf_will_dirty(prev->ds_dbuf, tx); dsl_dataset_phys(prev)->ds_num_children++; if (dsl_dataset_phys(ds)->ds_next_snap_obj == 0) { ASSERT(ds->ds_prev == NULL); VERIFY0(dsl_dataset_hold_obj(dp, dsl_dataset_phys(ds)->ds_prev_snap_obj, ds, &ds->ds_prev)); } } ASSERT3U(dsl_dir_phys(ds->ds_dir)->dd_origin_obj, ==, prev->ds_object); ASSERT3U(dsl_dataset_phys(ds)->ds_prev_snap_obj, ==, prev->ds_object); if (dsl_dataset_phys(prev)->ds_next_clones_obj == 0) { dmu_buf_will_dirty(prev->ds_dbuf, tx); dsl_dataset_phys(prev)->ds_next_clones_obj = zap_create(dp->dp_meta_objset, DMU_OT_NEXT_CLONES, DMU_OT_NONE, 0, tx); } VERIFY0(zap_add_int(dp->dp_meta_objset, dsl_dataset_phys(prev)->ds_next_clones_obj, ds->ds_object, tx)); dsl_dataset_rele(ds, FTAG); if (prev != dp->dp_origin_snap) dsl_dataset_rele(prev, FTAG); return (0); } void dsl_pool_upgrade_clones(dsl_pool_t *dp, dmu_tx_t *tx) { ASSERT(dmu_tx_is_syncing(tx)); ASSERT(dp->dp_origin_snap != NULL); VERIFY0(dmu_objset_find_dp(dp, dp->dp_root_dir_obj, upgrade_clones_cb, tx, DS_FIND_CHILDREN | DS_FIND_SERIALIZE)); } /* ARGSUSED */ static int upgrade_dir_clones_cb(dsl_pool_t *dp, dsl_dataset_t *ds, void *arg) { dmu_tx_t *tx = arg; objset_t *mos = dp->dp_meta_objset; if (dsl_dir_phys(ds->ds_dir)->dd_origin_obj != 0) { dsl_dataset_t *origin; VERIFY0(dsl_dataset_hold_obj(dp, dsl_dir_phys(ds->ds_dir)->dd_origin_obj, FTAG, &origin)); if (dsl_dir_phys(origin->ds_dir)->dd_clones == 0) { dmu_buf_will_dirty(origin->ds_dir->dd_dbuf, tx); dsl_dir_phys(origin->ds_dir)->dd_clones = zap_create(mos, DMU_OT_DSL_CLONES, DMU_OT_NONE, 0, tx); } VERIFY0(zap_add_int(dp->dp_meta_objset, dsl_dir_phys(origin->ds_dir)->dd_clones, ds->ds_object, tx)); dsl_dataset_rele(origin, FTAG); } return (0); } void dsl_pool_upgrade_dir_clones(dsl_pool_t *dp, dmu_tx_t *tx) { ASSERT(dmu_tx_is_syncing(tx)); uint64_t obj; (void) dsl_dir_create_sync(dp, dp->dp_root_dir, FREE_DIR_NAME, tx); VERIFY0(dsl_pool_open_special_dir(dp, FREE_DIR_NAME, &dp->dp_free_dir)); /* * We can't use bpobj_alloc(), because spa_version() still * returns the old version, and we need a new-version bpobj with * subobj support. So call dmu_object_alloc() directly. */ obj = dmu_object_alloc(dp->dp_meta_objset, DMU_OT_BPOBJ, SPA_OLD_MAXBLOCKSIZE, DMU_OT_BPOBJ_HDR, sizeof (bpobj_phys_t), tx); VERIFY0(zap_add(dp->dp_meta_objset, DMU_POOL_DIRECTORY_OBJECT, DMU_POOL_FREE_BPOBJ, sizeof (uint64_t), 1, &obj, tx)); VERIFY0(bpobj_open(&dp->dp_free_bpobj, dp->dp_meta_objset, obj)); VERIFY0(dmu_objset_find_dp(dp, dp->dp_root_dir_obj, upgrade_dir_clones_cb, tx, DS_FIND_CHILDREN | DS_FIND_SERIALIZE)); } void dsl_pool_create_origin(dsl_pool_t *dp, dmu_tx_t *tx) { uint64_t dsobj; dsl_dataset_t *ds; ASSERT(dmu_tx_is_syncing(tx)); ASSERT(dp->dp_origin_snap == NULL); ASSERT(rrw_held(&dp->dp_config_rwlock, RW_WRITER)); /* create the origin dir, ds, & snap-ds */ dsobj = dsl_dataset_create_sync(dp->dp_root_dir, ORIGIN_DIR_NAME, NULL, 0, kcred, tx); VERIFY0(dsl_dataset_hold_obj(dp, dsobj, FTAG, &ds)); dsl_dataset_snapshot_sync_impl(ds, ORIGIN_DIR_NAME, tx); VERIFY0(dsl_dataset_hold_obj(dp, dsl_dataset_phys(ds)->ds_prev_snap_obj, dp, &dp->dp_origin_snap)); dsl_dataset_rele(ds, FTAG); } taskq_t * dsl_pool_vnrele_taskq(dsl_pool_t *dp) { return (dp->dp_vnrele_taskq); } /* * Walk through the pool-wide zap object of temporary snapshot user holds * and release them. */ void dsl_pool_clean_tmp_userrefs(dsl_pool_t *dp) { zap_attribute_t za; zap_cursor_t zc; objset_t *mos = dp->dp_meta_objset; uint64_t zapobj = dp->dp_tmp_userrefs_obj; nvlist_t *holds; if (zapobj == 0) return; ASSERT(spa_version(dp->dp_spa) >= SPA_VERSION_USERREFS); holds = fnvlist_alloc(); for (zap_cursor_init(&zc, mos, zapobj); zap_cursor_retrieve(&zc, &za) == 0; zap_cursor_advance(&zc)) { char *htag; nvlist_t *tags; htag = strchr(za.za_name, '-'); *htag = '\0'; ++htag; if (nvlist_lookup_nvlist(holds, za.za_name, &tags) != 0) { tags = fnvlist_alloc(); fnvlist_add_boolean(tags, htag); fnvlist_add_nvlist(holds, za.za_name, tags); fnvlist_free(tags); } else { fnvlist_add_boolean(tags, htag); } } dsl_dataset_user_release_tmp(dp, holds); fnvlist_free(holds); zap_cursor_fini(&zc); } /* * Create the pool-wide zap object for storing temporary snapshot holds. */ void dsl_pool_user_hold_create_obj(dsl_pool_t *dp, dmu_tx_t *tx) { objset_t *mos = dp->dp_meta_objset; ASSERT(dp->dp_tmp_userrefs_obj == 0); ASSERT(dmu_tx_is_syncing(tx)); dp->dp_tmp_userrefs_obj = zap_create_link(mos, DMU_OT_USERREFS, DMU_POOL_DIRECTORY_OBJECT, DMU_POOL_TMP_USERREFS, tx); } static int dsl_pool_user_hold_rele_impl(dsl_pool_t *dp, uint64_t dsobj, const char *tag, uint64_t now, dmu_tx_t *tx, boolean_t holding) { objset_t *mos = dp->dp_meta_objset; uint64_t zapobj = dp->dp_tmp_userrefs_obj; char *name; int error; ASSERT(spa_version(dp->dp_spa) >= SPA_VERSION_USERREFS); ASSERT(dmu_tx_is_syncing(tx)); /* * If the pool was created prior to SPA_VERSION_USERREFS, the * zap object for temporary holds might not exist yet. */ if (zapobj == 0) { if (holding) { dsl_pool_user_hold_create_obj(dp, tx); zapobj = dp->dp_tmp_userrefs_obj; } else { return (SET_ERROR(ENOENT)); } } name = kmem_asprintf("%llx-%s", (u_longlong_t)dsobj, tag); if (holding) error = zap_add(mos, zapobj, name, 8, 1, &now, tx); else error = zap_remove(mos, zapobj, name, tx); strfree(name); return (error); } /* * Add a temporary hold for the given dataset object and tag. */ int dsl_pool_user_hold(dsl_pool_t *dp, uint64_t dsobj, const char *tag, uint64_t now, dmu_tx_t *tx) { return (dsl_pool_user_hold_rele_impl(dp, dsobj, tag, now, tx, B_TRUE)); } /* * Release a temporary hold for the given dataset object and tag. */ int dsl_pool_user_release(dsl_pool_t *dp, uint64_t dsobj, const char *tag, dmu_tx_t *tx) { return (dsl_pool_user_hold_rele_impl(dp, dsobj, tag, 0, tx, B_FALSE)); } /* * DSL Pool Configuration Lock * * The dp_config_rwlock protects against changes to DSL state (e.g. dataset * creation / destruction / rename / property setting). It must be held for * read to hold a dataset or dsl_dir. I.e. you must call * dsl_pool_config_enter() or dsl_pool_hold() before calling * dsl_{dataset,dir}_hold{_obj}. In most circumstances, the dp_config_rwlock * must be held continuously until all datasets and dsl_dirs are released. * * The only exception to this rule is that if a "long hold" is placed on * a dataset, then the dp_config_rwlock may be dropped while the dataset * is still held. The long hold will prevent the dataset from being * destroyed -- the destroy will fail with EBUSY. A long hold can be * obtained by calling dsl_dataset_long_hold(), or by "owning" a dataset * (by calling dsl_{dataset,objset}_{try}own{_obj}). * * Legitimate long-holders (including owners) should be long-running, cancelable * tasks that should cause "zfs destroy" to fail. This includes DMU * consumers (i.e. a ZPL filesystem being mounted or ZVOL being open), * "zfs send", and "zfs diff". There are several other long-holders whose * uses are suboptimal (e.g. "zfs promote", and zil_suspend()). * * The usual formula for long-holding would be: * dsl_pool_hold() * dsl_dataset_hold() * ... perform checks ... * dsl_dataset_long_hold() * dsl_pool_rele() * ... perform long-running task ... * dsl_dataset_long_rele() * dsl_dataset_rele() * * Note that when the long hold is released, the dataset is still held but * the pool is not held. The dataset may change arbitrarily during this time * (e.g. it could be destroyed). Therefore you shouldn't do anything to the * dataset except release it. * * User-initiated operations (e.g. ioctls, zfs_ioc_*()) are either read-only * or modifying operations. * * Modifying operations should generally use dsl_sync_task(). The synctask * infrastructure enforces proper locking strategy with respect to the * dp_config_rwlock. See the comment above dsl_sync_task() for details. * * Read-only operations will manually hold the pool, then the dataset, obtain * information from the dataset, then release the pool and dataset. * dmu_objset_{hold,rele}() are convenience routines that also do the pool * hold/rele. */ int dsl_pool_hold(const char *name, void *tag, dsl_pool_t **dp) { spa_t *spa; int error; error = spa_open(name, &spa, tag); if (error == 0) { *dp = spa_get_dsl(spa); dsl_pool_config_enter(*dp, tag); } return (error); } void dsl_pool_rele(dsl_pool_t *dp, void *tag) { dsl_pool_config_exit(dp, tag); spa_close(dp->dp_spa, tag); } void dsl_pool_config_enter(dsl_pool_t *dp, void *tag) { /* * We use a "reentrant" reader-writer lock, but not reentrantly. * * The rrwlock can (with the track_all flag) track all reading threads, * which is very useful for debugging which code path failed to release * the lock, and for verifying that the *current* thread does hold * the lock. * * (Unlike a rwlock, which knows that N threads hold it for * read, but not *which* threads, so rw_held(RW_READER) returns TRUE * if any thread holds it for read, even if this thread doesn't). */ ASSERT(!rrw_held(&dp->dp_config_rwlock, RW_READER)); rrw_enter(&dp->dp_config_rwlock, RW_READER, tag); } void dsl_pool_config_enter_prio(dsl_pool_t *dp, void *tag) { ASSERT(!rrw_held(&dp->dp_config_rwlock, RW_READER)); rrw_enter_read_prio(&dp->dp_config_rwlock, tag); } void dsl_pool_config_exit(dsl_pool_t *dp, void *tag) { rrw_exit(&dp->dp_config_rwlock, tag); } boolean_t dsl_pool_config_held(dsl_pool_t *dp) { return (RRW_LOCK_HELD(&dp->dp_config_rwlock)); } boolean_t dsl_pool_config_held_writer(dsl_pool_t *dp) { return (RRW_WRITE_HELD(&dp->dp_config_rwlock)); } Index: user/ngie/bug-237403/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/range_tree.h =================================================================== --- user/ngie/bug-237403/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/range_tree.h (revision 346925) +++ user/ngie/bug-237403/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/range_tree.h (revision 346926) @@ -1,130 +1,123 @@ /* * CDDL HEADER START * * The contents of this file are subject to the terms of the * Common Development and Distribution License (the "License"). * You may not use this file except in compliance with the License. * * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE * or http://www.opensolaris.org/os/licensing. * See the License for the specific language governing permissions * and limitations under the License. * * When distributing Covered Code, include this CDDL HEADER in each * file and include the License file at usr/src/OPENSOLARIS.LICENSE. * If applicable, add the following below this CDDL HEADER, with the * fields enclosed by brackets "[]" replaced with your own identifying * information: Portions Copyright [yyyy] [name of copyright owner] * * CDDL HEADER END */ /* * Copyright 2009 Sun Microsystems, Inc. All rights reserved. * Use is subject to license terms. */ /* * Copyright (c) 2013, 2017 by Delphix. All rights reserved. */ #ifndef _SYS_RANGE_TREE_H #define _SYS_RANGE_TREE_H #include #include #ifdef __cplusplus extern "C" { #endif #define RANGE_TREE_HISTOGRAM_SIZE 64 typedef struct range_tree_ops range_tree_ops_t; /* * Note: the range_tree may not be accessed concurrently; consumers * must provide external locking if required. */ typedef struct range_tree { avl_tree_t rt_root; /* offset-ordered segment AVL tree */ uint64_t rt_space; /* sum of all segments in the map */ range_tree_ops_t *rt_ops; void *rt_arg; /* rt_avl_compare should only be set it rt_arg is an AVL tree */ uint64_t rt_gap; /* allowable inter-segment gap */ int (*rt_avl_compare)(const void *, const void *); /* * The rt_histogram maintains a histogram of ranges. Each bucket, * rt_histogram[i], contains the number of ranges whose size is: * 2^i <= size of range in bytes < 2^(i+1) */ uint64_t rt_histogram[RANGE_TREE_HISTOGRAM_SIZE]; } range_tree_t; typedef struct range_seg { avl_node_t rs_node; /* AVL node */ avl_node_t rs_pp_node; /* AVL picker-private node */ uint64_t rs_start; /* starting offset of this segment */ uint64_t rs_end; /* ending offset (non-inclusive) */ uint64_t rs_fill; /* actual fill if gap mode is on */ } range_seg_t; struct range_tree_ops { void (*rtop_create)(range_tree_t *rt, void *arg); void (*rtop_destroy)(range_tree_t *rt, void *arg); void (*rtop_add)(range_tree_t *rt, range_seg_t *rs, void *arg); void (*rtop_remove)(range_tree_t *rt, range_seg_t *rs, void *arg); void (*rtop_vacate)(range_tree_t *rt, void *arg); }; typedef void range_tree_func_t(void *arg, uint64_t start, uint64_t size); void range_tree_init(void); void range_tree_fini(void); range_tree_t *range_tree_create_impl(range_tree_ops_t *ops, void *arg, int (*avl_compare)(const void*, const void*), uint64_t gap); - range_tree_t *range_tree_create(range_tree_ops_t *ops, void *arg); +range_tree_t *range_tree_create(range_tree_ops_t *ops, void *arg); void range_tree_destroy(range_tree_t *rt); boolean_t range_tree_contains(range_tree_t *rt, uint64_t start, uint64_t size); range_seg_t *range_tree_find(range_tree_t *rt, uint64_t start, uint64_t size); void range_tree_resize_segment(range_tree_t *rt, range_seg_t *rs, uint64_t newstart, uint64_t newsize); uint64_t range_tree_space(range_tree_t *rt); boolean_t range_tree_is_empty(range_tree_t *rt); void range_tree_verify(range_tree_t *rt, uint64_t start, uint64_t size); void range_tree_swap(range_tree_t **rtsrc, range_tree_t **rtdst); void range_tree_stat_verify(range_tree_t *rt); uint64_t range_tree_min(range_tree_t *rt); uint64_t range_tree_max(range_tree_t *rt); uint64_t range_tree_span(range_tree_t *rt); void range_tree_add(void *arg, uint64_t start, uint64_t size); void range_tree_remove(void *arg, uint64_t start, uint64_t size); void range_tree_remove_fill(range_tree_t *rt, uint64_t start, uint64_t size); void range_tree_adjust_fill(range_tree_t *rt, range_seg_t *rs, int64_t delta); void range_tree_clear(range_tree_t *rt, uint64_t start, uint64_t size); void range_tree_vacate(range_tree_t *rt, range_tree_func_t *func, void *arg); void range_tree_walk(range_tree_t *rt, range_tree_func_t *func, void *arg); range_seg_t *range_tree_first(range_tree_t *rt); - -void rt_avl_create(range_tree_t *rt, void *arg); -void rt_avl_destroy(range_tree_t *rt, void *arg); -void rt_avl_add(range_tree_t *rt, range_seg_t *rs, void *arg); -void rt_avl_remove(range_tree_t *rt, range_seg_t *rs, void *arg); -void rt_avl_vacate(range_tree_t *rt, void *arg); -extern struct range_tree_ops rt_avl_ops; void rt_avl_create(range_tree_t *rt, void *arg); void rt_avl_destroy(range_tree_t *rt, void *arg); void rt_avl_add(range_tree_t *rt, range_seg_t *rs, void *arg); void rt_avl_remove(range_tree_t *rt, range_seg_t *rs, void *arg); void rt_avl_vacate(range_tree_t *rt, void *arg); extern struct range_tree_ops rt_avl_ops; #ifdef __cplusplus } #endif #endif /* _SYS_RANGE_TREE_H */ Index: user/ngie/bug-237403/sys/cddl/contrib/opensolaris =================================================================== --- user/ngie/bug-237403/sys/cddl/contrib/opensolaris (revision 346925) +++ user/ngie/bug-237403/sys/cddl/contrib/opensolaris (revision 346926) Property changes on: user/ngie/bug-237403/sys/cddl/contrib/opensolaris ___________________________________________________________________ Modified: svn:mergeinfo ## -0,0 +0,1 ## Merged /head/sys/cddl/contrib/opensolaris:r346444-346925 Index: user/ngie/bug-237403/sys/compat/freebsd32/freebsd32_systrace_args.c =================================================================== --- user/ngie/bug-237403/sys/compat/freebsd32/freebsd32_systrace_args.c (revision 346925) +++ user/ngie/bug-237403/sys/compat/freebsd32/freebsd32_systrace_args.c (revision 346926) @@ -1,10816 +1,10816 @@ /* * System call argument to DTrace register array converstion. * * DO NOT EDIT-- this file is automatically generated. * $FreeBSD$ * This file is part of the DTrace syscall provider. */ static void systrace_args(int sysnum, void *params, uint64_t *uarg, int *n_args) { int64_t *iarg = (int64_t *) uarg; switch (sysnum) { #if !defined(PAD64_REQUIRED) && !defined(__amd64__) #define PAD64_REQUIRED #endif /* nosys */ case 0: { *n_args = 0; break; } /* sys_exit */ case 1: { struct sys_exit_args *p = params; iarg[0] = p->rval; /* int */ *n_args = 1; break; } /* fork */ case 2: { *n_args = 0; break; } /* read */ case 3: { struct read_args *p = params; iarg[0] = p->fd; /* int */ uarg[1] = (intptr_t) p->buf; /* void * */ uarg[2] = p->nbyte; /* size_t */ *n_args = 3; break; } /* write */ case 4: { struct write_args *p = params; iarg[0] = p->fd; /* int */ uarg[1] = (intptr_t) p->buf; /* const void * */ uarg[2] = p->nbyte; /* size_t */ *n_args = 3; break; } /* open */ case 5: { struct open_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ iarg[1] = p->flags; /* int */ iarg[2] = p->mode; /* mode_t */ *n_args = 3; break; } /* close */ case 6: { struct close_args *p = params; iarg[0] = p->fd; /* int */ *n_args = 1; break; } /* freebsd32_wait4 */ case 7: { struct freebsd32_wait4_args *p = params; iarg[0] = p->pid; /* int */ uarg[1] = (intptr_t) p->status; /* int * */ iarg[2] = p->options; /* int */ uarg[3] = (intptr_t) p->rusage; /* struct rusage32 * */ *n_args = 4; break; } /* link */ case 9: { struct link_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ uarg[1] = (intptr_t) p->link; /* const char * */ *n_args = 2; break; } /* unlink */ case 10: { struct unlink_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ *n_args = 1; break; } /* chdir */ case 12: { struct chdir_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ *n_args = 1; break; } /* fchdir */ case 13: { struct fchdir_args *p = params; iarg[0] = p->fd; /* int */ *n_args = 1; break; } /* chmod */ case 15: { struct chmod_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ iarg[1] = p->mode; /* mode_t */ *n_args = 2; break; } /* chown */ case 16: { struct chown_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ iarg[1] = p->uid; /* int */ iarg[2] = p->gid; /* int */ *n_args = 3; break; } /* break */ case 17: { struct break_args *p = params; uarg[0] = (intptr_t) p->nsize; /* char * */ *n_args = 1; break; } /* getpid */ case 20: { *n_args = 0; break; } /* mount */ case 21: { struct mount_args *p = params; uarg[0] = (intptr_t) p->type; /* const char * */ uarg[1] = (intptr_t) p->path; /* const char * */ iarg[2] = p->flags; /* int */ uarg[3] = (intptr_t) p->data; /* void * */ *n_args = 4; break; } /* unmount */ case 22: { struct unmount_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ iarg[1] = p->flags; /* int */ *n_args = 2; break; } /* setuid */ case 23: { struct setuid_args *p = params; uarg[0] = p->uid; /* uid_t */ *n_args = 1; break; } /* getuid */ case 24: { *n_args = 0; break; } /* geteuid */ case 25: { *n_args = 0; break; } /* ptrace */ case 26: { struct ptrace_args *p = params; iarg[0] = p->req; /* int */ iarg[1] = p->pid; /* pid_t */ uarg[2] = (intptr_t) p->addr; /* caddr_t */ iarg[3] = p->data; /* int */ *n_args = 4; break; } /* freebsd32_recvmsg */ case 27: { struct freebsd32_recvmsg_args *p = params; iarg[0] = p->s; /* int */ uarg[1] = (intptr_t) p->msg; /* struct msghdr32 * */ iarg[2] = p->flags; /* int */ *n_args = 3; break; } /* freebsd32_sendmsg */ case 28: { struct freebsd32_sendmsg_args *p = params; iarg[0] = p->s; /* int */ uarg[1] = (intptr_t) p->msg; /* struct msghdr32 * */ iarg[2] = p->flags; /* int */ *n_args = 3; break; } /* freebsd32_recvfrom */ case 29: { struct freebsd32_recvfrom_args *p = params; iarg[0] = p->s; /* int */ uarg[1] = (intptr_t) p->buf; /* void * */ uarg[2] = p->len; /* uint32_t */ iarg[3] = p->flags; /* int */ uarg[4] = (intptr_t) p->from; /* struct sockaddr * */ uarg[5] = p->fromlenaddr; /* uint32_t */ *n_args = 6; break; } /* accept */ case 30: { struct accept_args *p = params; iarg[0] = p->s; /* int */ uarg[1] = (intptr_t) p->name; /* struct sockaddr * */ uarg[2] = (intptr_t) p->anamelen; /* int * */ *n_args = 3; break; } /* getpeername */ case 31: { struct getpeername_args *p = params; iarg[0] = p->fdes; /* int */ uarg[1] = (intptr_t) p->asa; /* struct sockaddr * */ uarg[2] = (intptr_t) p->alen; /* int * */ *n_args = 3; break; } /* getsockname */ case 32: { struct getsockname_args *p = params; iarg[0] = p->fdes; /* int */ uarg[1] = (intptr_t) p->asa; /* struct sockaddr * */ uarg[2] = (intptr_t) p->alen; /* int * */ *n_args = 3; break; } /* access */ case 33: { struct access_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ iarg[1] = p->amode; /* int */ *n_args = 2; break; } /* chflags */ case 34: { struct chflags_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ uarg[1] = p->flags; /* u_long */ *n_args = 2; break; } /* fchflags */ case 35: { struct fchflags_args *p = params; iarg[0] = p->fd; /* int */ uarg[1] = p->flags; /* u_long */ *n_args = 2; break; } /* sync */ case 36: { *n_args = 0; break; } /* kill */ case 37: { struct kill_args *p = params; iarg[0] = p->pid; /* int */ iarg[1] = p->signum; /* int */ *n_args = 2; break; } /* getppid */ case 39: { *n_args = 0; break; } /* dup */ case 41: { struct dup_args *p = params; uarg[0] = p->fd; /* u_int */ *n_args = 1; break; } /* getegid */ case 43: { *n_args = 0; break; } /* profil */ case 44: { struct profil_args *p = params; uarg[0] = (intptr_t) p->samples; /* char * */ uarg[1] = p->size; /* size_t */ uarg[2] = p->offset; /* size_t */ uarg[3] = p->scale; /* u_int */ *n_args = 4; break; } /* ktrace */ case 45: { struct ktrace_args *p = params; uarg[0] = (intptr_t) p->fname; /* const char * */ iarg[1] = p->ops; /* int */ iarg[2] = p->facs; /* int */ iarg[3] = p->pid; /* int */ *n_args = 4; break; } /* getgid */ case 47: { *n_args = 0; break; } /* getlogin */ case 49: { struct getlogin_args *p = params; uarg[0] = (intptr_t) p->namebuf; /* char * */ uarg[1] = p->namelen; /* u_int */ *n_args = 2; break; } /* setlogin */ case 50: { struct setlogin_args *p = params; uarg[0] = (intptr_t) p->namebuf; /* const char * */ *n_args = 1; break; } /* acct */ case 51: { struct acct_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ *n_args = 1; break; } /* freebsd32_sigaltstack */ case 53: { struct freebsd32_sigaltstack_args *p = params; uarg[0] = (intptr_t) p->ss; /* struct sigaltstack32 * */ uarg[1] = (intptr_t) p->oss; /* struct sigaltstack32 * */ *n_args = 2; break; } /* freebsd32_ioctl */ case 54: { struct freebsd32_ioctl_args *p = params; iarg[0] = p->fd; /* int */ uarg[1] = p->com; /* uint32_t */ uarg[2] = (intptr_t) p->data; /* struct md_ioctl32 * */ *n_args = 3; break; } /* reboot */ case 55: { struct reboot_args *p = params; iarg[0] = p->opt; /* int */ *n_args = 1; break; } /* revoke */ case 56: { struct revoke_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ *n_args = 1; break; } /* symlink */ case 57: { struct symlink_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ uarg[1] = (intptr_t) p->link; /* const char * */ *n_args = 2; break; } /* readlink */ case 58: { struct readlink_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ uarg[1] = (intptr_t) p->buf; /* char * */ uarg[2] = p->count; /* size_t */ *n_args = 3; break; } /* freebsd32_execve */ case 59: { struct freebsd32_execve_args *p = params; uarg[0] = (intptr_t) p->fname; /* const char * */ uarg[1] = (intptr_t) p->argv; /* uint32_t * */ uarg[2] = (intptr_t) p->envv; /* uint32_t * */ *n_args = 3; break; } /* umask */ case 60: { struct umask_args *p = params; iarg[0] = p->newmask; /* mode_t */ *n_args = 1; break; } /* chroot */ case 61: { struct chroot_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ *n_args = 1; break; } /* msync */ case 65: { struct msync_args *p = params; uarg[0] = (intptr_t) p->addr; /* void * */ uarg[1] = p->len; /* size_t */ iarg[2] = p->flags; /* int */ *n_args = 3; break; } /* vfork */ case 66: { *n_args = 0; break; } /* sbrk */ case 69: { struct sbrk_args *p = params; iarg[0] = p->incr; /* int */ *n_args = 1; break; } /* sstk */ case 70: { struct sstk_args *p = params; iarg[0] = p->incr; /* int */ *n_args = 1; break; } /* munmap */ case 73: { struct munmap_args *p = params; uarg[0] = (intptr_t) p->addr; /* void * */ uarg[1] = p->len; /* size_t */ *n_args = 2; break; } /* freebsd32_mprotect */ case 74: { struct freebsd32_mprotect_args *p = params; uarg[0] = (intptr_t) p->addr; /* void * */ uarg[1] = p->len; /* size_t */ iarg[2] = p->prot; /* int */ *n_args = 3; break; } /* madvise */ case 75: { struct madvise_args *p = params; uarg[0] = (intptr_t) p->addr; /* void * */ uarg[1] = p->len; /* size_t */ iarg[2] = p->behav; /* int */ *n_args = 3; break; } /* mincore */ case 78: { struct mincore_args *p = params; uarg[0] = (intptr_t) p->addr; /* const void * */ uarg[1] = p->len; /* size_t */ uarg[2] = (intptr_t) p->vec; /* char * */ *n_args = 3; break; } /* getgroups */ case 79: { struct getgroups_args *p = params; uarg[0] = p->gidsetsize; /* u_int */ uarg[1] = (intptr_t) p->gidset; /* gid_t * */ *n_args = 2; break; } /* setgroups */ case 80: { struct setgroups_args *p = params; uarg[0] = p->gidsetsize; /* u_int */ uarg[1] = (intptr_t) p->gidset; /* gid_t * */ *n_args = 2; break; } /* getpgrp */ case 81: { *n_args = 0; break; } /* setpgid */ case 82: { struct setpgid_args *p = params; iarg[0] = p->pid; /* int */ iarg[1] = p->pgid; /* int */ *n_args = 2; break; } /* freebsd32_setitimer */ case 83: { struct freebsd32_setitimer_args *p = params; uarg[0] = p->which; /* u_int */ uarg[1] = (intptr_t) p->itv; /* struct itimerval32 * */ uarg[2] = (intptr_t) p->oitv; /* struct itimerval32 * */ *n_args = 3; break; } /* swapon */ case 85: { struct swapon_args *p = params; uarg[0] = (intptr_t) p->name; /* const char * */ *n_args = 1; break; } /* freebsd32_getitimer */ case 86: { struct freebsd32_getitimer_args *p = params; uarg[0] = p->which; /* u_int */ uarg[1] = (intptr_t) p->itv; /* struct itimerval32 * */ *n_args = 2; break; } /* getdtablesize */ case 89: { *n_args = 0; break; } /* dup2 */ case 90: { struct dup2_args *p = params; uarg[0] = p->from; /* u_int */ uarg[1] = p->to; /* u_int */ *n_args = 2; break; } /* freebsd32_fcntl */ case 92: { struct freebsd32_fcntl_args *p = params; iarg[0] = p->fd; /* int */ iarg[1] = p->cmd; /* int */ iarg[2] = p->arg; /* int */ *n_args = 3; break; } /* freebsd32_select */ case 93: { struct freebsd32_select_args *p = params; iarg[0] = p->nd; /* int */ uarg[1] = (intptr_t) p->in; /* fd_set * */ uarg[2] = (intptr_t) p->ou; /* fd_set * */ uarg[3] = (intptr_t) p->ex; /* fd_set * */ uarg[4] = (intptr_t) p->tv; /* struct timeval32 * */ *n_args = 5; break; } /* fsync */ case 95: { struct fsync_args *p = params; iarg[0] = p->fd; /* int */ *n_args = 1; break; } /* setpriority */ case 96: { struct setpriority_args *p = params; iarg[0] = p->which; /* int */ iarg[1] = p->who; /* int */ iarg[2] = p->prio; /* int */ *n_args = 3; break; } /* socket */ case 97: { struct socket_args *p = params; iarg[0] = p->domain; /* int */ iarg[1] = p->type; /* int */ iarg[2] = p->protocol; /* int */ *n_args = 3; break; } /* connect */ case 98: { struct connect_args *p = params; iarg[0] = p->s; /* int */ uarg[1] = (intptr_t) p->name; /* const struct sockaddr * */ iarg[2] = p->namelen; /* int */ *n_args = 3; break; } /* getpriority */ case 100: { struct getpriority_args *p = params; iarg[0] = p->which; /* int */ iarg[1] = p->who; /* int */ *n_args = 2; break; } /* bind */ case 104: { struct bind_args *p = params; iarg[0] = p->s; /* int */ uarg[1] = (intptr_t) p->name; /* const struct sockaddr * */ iarg[2] = p->namelen; /* int */ *n_args = 3; break; } /* setsockopt */ case 105: { struct setsockopt_args *p = params; iarg[0] = p->s; /* int */ iarg[1] = p->level; /* int */ iarg[2] = p->name; /* int */ uarg[3] = (intptr_t) p->val; /* const void * */ iarg[4] = p->valsize; /* int */ *n_args = 5; break; } /* listen */ case 106: { struct listen_args *p = params; iarg[0] = p->s; /* int */ iarg[1] = p->backlog; /* int */ *n_args = 2; break; } /* freebsd32_gettimeofday */ case 116: { struct freebsd32_gettimeofday_args *p = params; uarg[0] = (intptr_t) p->tp; /* struct timeval32 * */ uarg[1] = (intptr_t) p->tzp; /* struct timezone * */ *n_args = 2; break; } /* freebsd32_getrusage */ case 117: { struct freebsd32_getrusage_args *p = params; iarg[0] = p->who; /* int */ uarg[1] = (intptr_t) p->rusage; /* struct rusage32 * */ *n_args = 2; break; } /* getsockopt */ case 118: { struct getsockopt_args *p = params; iarg[0] = p->s; /* int */ iarg[1] = p->level; /* int */ iarg[2] = p->name; /* int */ uarg[3] = (intptr_t) p->val; /* void * */ uarg[4] = (intptr_t) p->avalsize; /* int * */ *n_args = 5; break; } /* freebsd32_readv */ case 120: { struct freebsd32_readv_args *p = params; iarg[0] = p->fd; /* int */ uarg[1] = (intptr_t) p->iovp; /* struct iovec32 * */ uarg[2] = p->iovcnt; /* u_int */ *n_args = 3; break; } /* freebsd32_writev */ case 121: { struct freebsd32_writev_args *p = params; iarg[0] = p->fd; /* int */ uarg[1] = (intptr_t) p->iovp; /* struct iovec32 * */ uarg[2] = p->iovcnt; /* u_int */ *n_args = 3; break; } /* freebsd32_settimeofday */ case 122: { struct freebsd32_settimeofday_args *p = params; uarg[0] = (intptr_t) p->tv; /* struct timeval32 * */ uarg[1] = (intptr_t) p->tzp; /* struct timezone * */ *n_args = 2; break; } /* fchown */ case 123: { struct fchown_args *p = params; iarg[0] = p->fd; /* int */ iarg[1] = p->uid; /* int */ iarg[2] = p->gid; /* int */ *n_args = 3; break; } /* fchmod */ case 124: { struct fchmod_args *p = params; iarg[0] = p->fd; /* int */ iarg[1] = p->mode; /* mode_t */ *n_args = 2; break; } /* setreuid */ case 126: { struct setreuid_args *p = params; iarg[0] = p->ruid; /* int */ iarg[1] = p->euid; /* int */ *n_args = 2; break; } /* setregid */ case 127: { struct setregid_args *p = params; iarg[0] = p->rgid; /* int */ iarg[1] = p->egid; /* int */ *n_args = 2; break; } /* rename */ case 128: { struct rename_args *p = params; uarg[0] = (intptr_t) p->from; /* const char * */ uarg[1] = (intptr_t) p->to; /* const char * */ *n_args = 2; break; } /* flock */ case 131: { struct flock_args *p = params; iarg[0] = p->fd; /* int */ iarg[1] = p->how; /* int */ *n_args = 2; break; } /* mkfifo */ case 132: { struct mkfifo_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ iarg[1] = p->mode; /* mode_t */ *n_args = 2; break; } /* sendto */ case 133: { struct sendto_args *p = params; iarg[0] = p->s; /* int */ uarg[1] = (intptr_t) p->buf; /* const void * */ uarg[2] = p->len; /* size_t */ iarg[3] = p->flags; /* int */ uarg[4] = (intptr_t) p->to; /* const struct sockaddr * */ iarg[5] = p->tolen; /* int */ *n_args = 6; break; } /* shutdown */ case 134: { struct shutdown_args *p = params; iarg[0] = p->s; /* int */ iarg[1] = p->how; /* int */ *n_args = 2; break; } /* socketpair */ case 135: { struct socketpair_args *p = params; iarg[0] = p->domain; /* int */ iarg[1] = p->type; /* int */ iarg[2] = p->protocol; /* int */ uarg[3] = (intptr_t) p->rsv; /* int * */ *n_args = 4; break; } /* mkdir */ case 136: { struct mkdir_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ iarg[1] = p->mode; /* mode_t */ *n_args = 2; break; } /* rmdir */ case 137: { struct rmdir_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ *n_args = 1; break; } /* freebsd32_utimes */ case 138: { struct freebsd32_utimes_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ uarg[1] = (intptr_t) p->tptr; /* struct timeval32 * */ *n_args = 2; break; } /* freebsd32_adjtime */ case 140: { struct freebsd32_adjtime_args *p = params; uarg[0] = (intptr_t) p->delta; /* struct timeval32 * */ uarg[1] = (intptr_t) p->olddelta; /* struct timeval32 * */ *n_args = 2; break; } /* setsid */ case 147: { *n_args = 0; break; } /* quotactl */ case 148: { struct quotactl_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ iarg[1] = p->cmd; /* int */ iarg[2] = p->uid; /* int */ uarg[3] = (intptr_t) p->arg; /* void * */ *n_args = 4; break; } /* getfh */ case 161: { struct getfh_args *p = params; uarg[0] = (intptr_t) p->fname; /* const char * */ uarg[1] = (intptr_t) p->fhp; /* struct fhandle * */ *n_args = 2; break; } /* freebsd32_sysarch */ case 165: { struct freebsd32_sysarch_args *p = params; iarg[0] = p->op; /* int */ uarg[1] = (intptr_t) p->parms; /* char * */ *n_args = 2; break; } /* rtprio */ case 166: { struct rtprio_args *p = params; iarg[0] = p->function; /* int */ iarg[1] = p->pid; /* pid_t */ uarg[2] = (intptr_t) p->rtp; /* struct rtprio * */ *n_args = 3; break; } /* freebsd32_semsys */ case 169: { struct freebsd32_semsys_args *p = params; iarg[0] = p->which; /* int */ iarg[1] = p->a2; /* int */ iarg[2] = p->a3; /* int */ iarg[3] = p->a4; /* int */ iarg[4] = p->a5; /* int */ *n_args = 5; break; } /* freebsd32_msgsys */ case 170: { struct freebsd32_msgsys_args *p = params; iarg[0] = p->which; /* int */ iarg[1] = p->a2; /* int */ iarg[2] = p->a3; /* int */ iarg[3] = p->a4; /* int */ iarg[4] = p->a5; /* int */ iarg[5] = p->a6; /* int */ *n_args = 6; break; } /* freebsd32_shmsys */ case 171: { struct freebsd32_shmsys_args *p = params; uarg[0] = p->which; /* uint32_t */ uarg[1] = p->a2; /* uint32_t */ uarg[2] = p->a3; /* uint32_t */ uarg[3] = p->a4; /* uint32_t */ *n_args = 4; break; } /* ntp_adjtime */ case 176: { struct ntp_adjtime_args *p = params; uarg[0] = (intptr_t) p->tp; /* struct timex * */ *n_args = 1; break; } /* setgid */ case 181: { struct setgid_args *p = params; iarg[0] = p->gid; /* gid_t */ *n_args = 1; break; } /* setegid */ case 182: { struct setegid_args *p = params; iarg[0] = p->egid; /* gid_t */ *n_args = 1; break; } /* seteuid */ case 183: { struct seteuid_args *p = params; uarg[0] = p->euid; /* uid_t */ *n_args = 1; break; } /* pathconf */ case 191: { struct pathconf_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ iarg[1] = p->name; /* int */ *n_args = 2; break; } /* fpathconf */ case 192: { struct fpathconf_args *p = params; iarg[0] = p->fd; /* int */ iarg[1] = p->name; /* int */ *n_args = 2; break; } /* getrlimit */ case 194: { struct __getrlimit_args *p = params; uarg[0] = p->which; /* u_int */ uarg[1] = (intptr_t) p->rlp; /* struct rlimit * */ *n_args = 2; break; } /* setrlimit */ case 195: { struct __setrlimit_args *p = params; uarg[0] = p->which; /* u_int */ uarg[1] = (intptr_t) p->rlp; /* struct rlimit * */ *n_args = 2; break; } /* nosys */ case 198: { *n_args = 0; break; } /* freebsd32___sysctl */ case 202: { struct freebsd32___sysctl_args *p = params; uarg[0] = (intptr_t) p->name; /* int * */ uarg[1] = p->namelen; /* u_int */ uarg[2] = (intptr_t) p->old; /* void * */ uarg[3] = (intptr_t) p->oldlenp; /* uint32_t * */ uarg[4] = (intptr_t) p->new; /* const void * */ uarg[5] = p->newlen; /* uint32_t */ *n_args = 6; break; } /* mlock */ case 203: { struct mlock_args *p = params; uarg[0] = (intptr_t) p->addr; /* const void * */ uarg[1] = p->len; /* size_t */ *n_args = 2; break; } /* munlock */ case 204: { struct munlock_args *p = params; uarg[0] = (intptr_t) p->addr; /* const void * */ uarg[1] = p->len; /* size_t */ *n_args = 2; break; } /* undelete */ case 205: { struct undelete_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ *n_args = 1; break; } /* freebsd32_futimes */ case 206: { struct freebsd32_futimes_args *p = params; iarg[0] = p->fd; /* int */ uarg[1] = (intptr_t) p->tptr; /* struct timeval32 * */ *n_args = 2; break; } /* getpgid */ case 207: { struct getpgid_args *p = params; iarg[0] = p->pid; /* pid_t */ *n_args = 1; break; } /* poll */ case 209: { struct poll_args *p = params; uarg[0] = (intptr_t) p->fds; /* struct pollfd * */ uarg[1] = p->nfds; /* u_int */ iarg[2] = p->timeout; /* int */ *n_args = 3; break; } /* lkmnosys */ case 210: { *n_args = 0; break; } /* lkmnosys */ case 211: { *n_args = 0; break; } /* lkmnosys */ case 212: { *n_args = 0; break; } /* lkmnosys */ case 213: { *n_args = 0; break; } /* lkmnosys */ case 214: { *n_args = 0; break; } /* lkmnosys */ case 215: { *n_args = 0; break; } /* lkmnosys */ case 216: { *n_args = 0; break; } /* lkmnosys */ case 217: { *n_args = 0; break; } /* lkmnosys */ case 218: { *n_args = 0; break; } /* lkmnosys */ case 219: { *n_args = 0; break; } /* semget */ case 221: { struct semget_args *p = params; iarg[0] = p->key; /* key_t */ iarg[1] = p->nsems; /* int */ iarg[2] = p->semflg; /* int */ *n_args = 3; break; } /* semop */ case 222: { struct semop_args *p = params; iarg[0] = p->semid; /* int */ uarg[1] = (intptr_t) p->sops; /* struct sembuf * */ uarg[2] = p->nsops; /* u_int */ *n_args = 3; break; } /* msgget */ case 225: { struct msgget_args *p = params; iarg[0] = p->key; /* key_t */ iarg[1] = p->msgflg; /* int */ *n_args = 2; break; } /* freebsd32_msgsnd */ case 226: { struct freebsd32_msgsnd_args *p = params; iarg[0] = p->msqid; /* int */ uarg[1] = (intptr_t) p->msgp; /* void * */ uarg[2] = p->msgsz; /* size_t */ iarg[3] = p->msgflg; /* int */ *n_args = 4; break; } /* freebsd32_msgrcv */ case 227: { struct freebsd32_msgrcv_args *p = params; iarg[0] = p->msqid; /* int */ uarg[1] = (intptr_t) p->msgp; /* void * */ uarg[2] = p->msgsz; /* size_t */ iarg[3] = p->msgtyp; /* long */ iarg[4] = p->msgflg; /* int */ *n_args = 5; break; } /* shmat */ case 228: { struct shmat_args *p = params; iarg[0] = p->shmid; /* int */ uarg[1] = (intptr_t) p->shmaddr; /* void * */ iarg[2] = p->shmflg; /* int */ *n_args = 3; break; } /* shmdt */ case 230: { struct shmdt_args *p = params; uarg[0] = (intptr_t) p->shmaddr; /* void * */ *n_args = 1; break; } /* shmget */ case 231: { struct shmget_args *p = params; iarg[0] = p->key; /* key_t */ iarg[1] = p->size; /* int */ iarg[2] = p->shmflg; /* int */ *n_args = 3; break; } /* freebsd32_clock_gettime */ case 232: { struct freebsd32_clock_gettime_args *p = params; iarg[0] = p->clock_id; /* clockid_t */ uarg[1] = (intptr_t) p->tp; /* struct timespec32 * */ *n_args = 2; break; } /* freebsd32_clock_settime */ case 233: { struct freebsd32_clock_settime_args *p = params; iarg[0] = p->clock_id; /* clockid_t */ uarg[1] = (intptr_t) p->tp; /* const struct timespec32 * */ *n_args = 2; break; } /* freebsd32_clock_getres */ case 234: { struct freebsd32_clock_getres_args *p = params; iarg[0] = p->clock_id; /* clockid_t */ uarg[1] = (intptr_t) p->tp; /* struct timespec32 * */ *n_args = 2; break; } /* freebsd32_ktimer_create */ case 235: { struct freebsd32_ktimer_create_args *p = params; iarg[0] = p->clock_id; /* clockid_t */ uarg[1] = (intptr_t) p->evp; /* struct sigevent32 * */ uarg[2] = (intptr_t) p->timerid; /* int * */ *n_args = 3; break; } /* ktimer_delete */ case 236: { struct ktimer_delete_args *p = params; iarg[0] = p->timerid; /* int */ *n_args = 1; break; } /* freebsd32_ktimer_settime */ case 237: { struct freebsd32_ktimer_settime_args *p = params; iarg[0] = p->timerid; /* int */ iarg[1] = p->flags; /* int */ uarg[2] = (intptr_t) p->value; /* const struct itimerspec32 * */ uarg[3] = (intptr_t) p->ovalue; /* struct itimerspec32 * */ *n_args = 4; break; } /* freebsd32_ktimer_gettime */ case 238: { struct freebsd32_ktimer_gettime_args *p = params; iarg[0] = p->timerid; /* int */ uarg[1] = (intptr_t) p->value; /* struct itimerspec32 * */ *n_args = 2; break; } /* ktimer_getoverrun */ case 239: { struct ktimer_getoverrun_args *p = params; iarg[0] = p->timerid; /* int */ *n_args = 1; break; } /* freebsd32_nanosleep */ case 240: { struct freebsd32_nanosleep_args *p = params; uarg[0] = (intptr_t) p->rqtp; /* const struct timespec32 * */ uarg[1] = (intptr_t) p->rmtp; /* struct timespec32 * */ *n_args = 2; break; } /* ffclock_getcounter */ case 241: { struct ffclock_getcounter_args *p = params; uarg[0] = (intptr_t) p->ffcount; /* ffcounter * */ *n_args = 1; break; } /* ffclock_setestimate */ case 242: { struct ffclock_setestimate_args *p = params; uarg[0] = (intptr_t) p->cest; /* struct ffclock_estimate * */ *n_args = 1; break; } /* ffclock_getestimate */ case 243: { struct ffclock_getestimate_args *p = params; uarg[0] = (intptr_t) p->cest; /* struct ffclock_estimate * */ *n_args = 1; break; } /* freebsd32_clock_nanosleep */ case 244: { struct freebsd32_clock_nanosleep_args *p = params; iarg[0] = p->clock_id; /* clockid_t */ iarg[1] = p->flags; /* int */ uarg[2] = (intptr_t) p->rqtp; /* const struct timespec32 * */ uarg[3] = (intptr_t) p->rmtp; /* struct timespec32 * */ *n_args = 4; break; } /* freebsd32_clock_getcpuclockid2 */ case 247: { struct freebsd32_clock_getcpuclockid2_args *p = params; uarg[0] = p->id1; /* uint32_t */ uarg[1] = p->id2; /* uint32_t */ iarg[2] = p->which; /* int */ uarg[3] = (intptr_t) p->clock_id; /* clockid_t * */ *n_args = 4; break; } /* minherit */ case 250: { struct minherit_args *p = params; uarg[0] = (intptr_t) p->addr; /* void * */ uarg[1] = p->len; /* size_t */ iarg[2] = p->inherit; /* int */ *n_args = 3; break; } /* rfork */ case 251: { struct rfork_args *p = params; iarg[0] = p->flags; /* int */ *n_args = 1; break; } /* issetugid */ case 253: { *n_args = 0; break; } /* lchown */ case 254: { struct lchown_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ iarg[1] = p->uid; /* int */ iarg[2] = p->gid; /* int */ *n_args = 3; break; } /* freebsd32_aio_read */ case 255: { struct freebsd32_aio_read_args *p = params; uarg[0] = (intptr_t) p->aiocbp; /* struct aiocb32 * */ *n_args = 1; break; } /* freebsd32_aio_write */ case 256: { struct freebsd32_aio_write_args *p = params; uarg[0] = (intptr_t) p->aiocbp; /* struct aiocb32 * */ *n_args = 1; break; } /* freebsd32_lio_listio */ case 257: { struct freebsd32_lio_listio_args *p = params; iarg[0] = p->mode; /* int */ uarg[1] = (intptr_t) p->acb_list; /* struct aiocb32 *const * */ iarg[2] = p->nent; /* int */ uarg[3] = (intptr_t) p->sig; /* struct sigevent32 * */ *n_args = 4; break; } /* lchmod */ case 274: { struct lchmod_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ iarg[1] = p->mode; /* mode_t */ *n_args = 2; break; } /* freebsd32_lutimes */ case 276: { struct freebsd32_lutimes_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ uarg[1] = (intptr_t) p->tptr; /* struct timeval32 * */ *n_args = 2; break; } /* freebsd32_preadv */ case 289: { struct freebsd32_preadv_args *p = params; iarg[0] = p->fd; /* int */ uarg[1] = (intptr_t) p->iovp; /* struct iovec32 * */ uarg[2] = p->iovcnt; /* u_int */ uarg[3] = p->offset1; /* uint32_t */ uarg[4] = p->offset2; /* uint32_t */ *n_args = 5; break; } /* freebsd32_pwritev */ case 290: { struct freebsd32_pwritev_args *p = params; iarg[0] = p->fd; /* int */ uarg[1] = (intptr_t) p->iovp; /* struct iovec32 * */ uarg[2] = p->iovcnt; /* u_int */ uarg[3] = p->offset1; /* uint32_t */ uarg[4] = p->offset2; /* uint32_t */ *n_args = 5; break; } /* fhopen */ case 298: { struct fhopen_args *p = params; uarg[0] = (intptr_t) p->u_fhp; /* const struct fhandle * */ iarg[1] = p->flags; /* int */ *n_args = 2; break; } /* modnext */ case 300: { struct modnext_args *p = params; iarg[0] = p->modid; /* int */ *n_args = 1; break; } /* freebsd32_modstat */ case 301: { struct freebsd32_modstat_args *p = params; iarg[0] = p->modid; /* int */ uarg[1] = (intptr_t) p->stat; /* struct module_stat32 * */ *n_args = 2; break; } /* modfnext */ case 302: { struct modfnext_args *p = params; iarg[0] = p->modid; /* int */ *n_args = 1; break; } /* modfind */ case 303: { struct modfind_args *p = params; uarg[0] = (intptr_t) p->name; /* const char * */ *n_args = 1; break; } /* kldload */ case 304: { struct kldload_args *p = params; uarg[0] = (intptr_t) p->file; /* const char * */ *n_args = 1; break; } /* kldunload */ case 305: { struct kldunload_args *p = params; iarg[0] = p->fileid; /* int */ *n_args = 1; break; } /* kldfind */ case 306: { struct kldfind_args *p = params; uarg[0] = (intptr_t) p->file; /* const char * */ *n_args = 1; break; } /* kldnext */ case 307: { struct kldnext_args *p = params; iarg[0] = p->fileid; /* int */ *n_args = 1; break; } /* freebsd32_kldstat */ case 308: { struct freebsd32_kldstat_args *p = params; iarg[0] = p->fileid; /* int */ uarg[1] = (intptr_t) p->stat; /* struct kld32_file_stat * */ *n_args = 2; break; } /* kldfirstmod */ case 309: { struct kldfirstmod_args *p = params; iarg[0] = p->fileid; /* int */ *n_args = 1; break; } /* getsid */ case 310: { struct getsid_args *p = params; iarg[0] = p->pid; /* pid_t */ *n_args = 1; break; } /* setresuid */ case 311: { struct setresuid_args *p = params; uarg[0] = p->ruid; /* uid_t */ uarg[1] = p->euid; /* uid_t */ uarg[2] = p->suid; /* uid_t */ *n_args = 3; break; } /* setresgid */ case 312: { struct setresgid_args *p = params; iarg[0] = p->rgid; /* gid_t */ iarg[1] = p->egid; /* gid_t */ iarg[2] = p->sgid; /* gid_t */ *n_args = 3; break; } /* freebsd32_aio_return */ case 314: { struct freebsd32_aio_return_args *p = params; uarg[0] = (intptr_t) p->aiocbp; /* struct aiocb32 * */ *n_args = 1; break; } /* freebsd32_aio_suspend */ case 315: { struct freebsd32_aio_suspend_args *p = params; uarg[0] = (intptr_t) p->aiocbp; /* struct aiocb32 *const * */ iarg[1] = p->nent; /* int */ uarg[2] = (intptr_t) p->timeout; /* const struct timespec32 * */ *n_args = 3; break; } /* aio_cancel */ case 316: { struct aio_cancel_args *p = params; iarg[0] = p->fd; /* int */ uarg[1] = (intptr_t) p->aiocbp; /* struct aiocb * */ *n_args = 2; break; } /* freebsd32_aio_error */ case 317: { struct freebsd32_aio_error_args *p = params; uarg[0] = (intptr_t) p->aiocbp; /* struct aiocb32 * */ *n_args = 1; break; } /* yield */ case 321: { *n_args = 0; break; } /* mlockall */ case 324: { struct mlockall_args *p = params; iarg[0] = p->how; /* int */ *n_args = 1; break; } /* munlockall */ case 325: { *n_args = 0; break; } /* __getcwd */ case 326: { struct __getcwd_args *p = params; uarg[0] = (intptr_t) p->buf; /* char * */ uarg[1] = p->buflen; /* size_t */ *n_args = 2; break; } /* sched_setparam */ case 327: { struct sched_setparam_args *p = params; iarg[0] = p->pid; /* pid_t */ uarg[1] = (intptr_t) p->param; /* const struct sched_param * */ *n_args = 2; break; } /* sched_getparam */ case 328: { struct sched_getparam_args *p = params; iarg[0] = p->pid; /* pid_t */ uarg[1] = (intptr_t) p->param; /* struct sched_param * */ *n_args = 2; break; } /* sched_setscheduler */ case 329: { struct sched_setscheduler_args *p = params; iarg[0] = p->pid; /* pid_t */ iarg[1] = p->policy; /* int */ uarg[2] = (intptr_t) p->param; /* const struct sched_param * */ *n_args = 3; break; } /* sched_getscheduler */ case 330: { struct sched_getscheduler_args *p = params; iarg[0] = p->pid; /* pid_t */ *n_args = 1; break; } /* sched_yield */ case 331: { *n_args = 0; break; } /* sched_get_priority_max */ case 332: { struct sched_get_priority_max_args *p = params; iarg[0] = p->policy; /* int */ *n_args = 1; break; } /* sched_get_priority_min */ case 333: { struct sched_get_priority_min_args *p = params; iarg[0] = p->policy; /* int */ *n_args = 1; break; } /* freebsd32_sched_rr_get_interval */ case 334: { struct freebsd32_sched_rr_get_interval_args *p = params; iarg[0] = p->pid; /* pid_t */ uarg[1] = (intptr_t) p->interval; /* struct timespec32 * */ *n_args = 2; break; } /* utrace */ case 335: { struct utrace_args *p = params; uarg[0] = (intptr_t) p->addr; /* const void * */ uarg[1] = p->len; /* size_t */ *n_args = 2; break; } /* kldsym */ case 337: { struct kldsym_args *p = params; iarg[0] = p->fileid; /* int */ iarg[1] = p->cmd; /* int */ uarg[2] = (intptr_t) p->data; /* void * */ *n_args = 3; break; } /* freebsd32_jail */ case 338: { struct freebsd32_jail_args *p = params; uarg[0] = (intptr_t) p->jail; /* struct jail32 * */ *n_args = 1; break; } /* sigprocmask */ case 340: { struct sigprocmask_args *p = params; iarg[0] = p->how; /* int */ uarg[1] = (intptr_t) p->set; /* const sigset_t * */ uarg[2] = (intptr_t) p->oset; /* sigset_t * */ *n_args = 3; break; } /* sigsuspend */ case 341: { struct sigsuspend_args *p = params; uarg[0] = (intptr_t) p->sigmask; /* const sigset_t * */ *n_args = 1; break; } /* sigpending */ case 343: { struct sigpending_args *p = params; uarg[0] = (intptr_t) p->set; /* sigset_t * */ *n_args = 1; break; } /* freebsd32_sigtimedwait */ case 345: { struct freebsd32_sigtimedwait_args *p = params; uarg[0] = (intptr_t) p->set; /* const sigset_t * */ uarg[1] = (intptr_t) p->info; /* siginfo_t * */ uarg[2] = (intptr_t) p->timeout; /* const struct timespec * */ *n_args = 3; break; } /* freebsd32_sigwaitinfo */ case 346: { struct freebsd32_sigwaitinfo_args *p = params; uarg[0] = (intptr_t) p->set; /* const sigset_t * */ uarg[1] = (intptr_t) p->info; /* siginfo_t * */ *n_args = 2; break; } /* __acl_get_file */ case 347: { struct __acl_get_file_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ iarg[1] = p->type; /* acl_type_t */ uarg[2] = (intptr_t) p->aclp; /* struct acl * */ *n_args = 3; break; } /* __acl_set_file */ case 348: { struct __acl_set_file_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ iarg[1] = p->type; /* acl_type_t */ uarg[2] = (intptr_t) p->aclp; /* struct acl * */ *n_args = 3; break; } /* __acl_get_fd */ case 349: { struct __acl_get_fd_args *p = params; iarg[0] = p->filedes; /* int */ iarg[1] = p->type; /* acl_type_t */ uarg[2] = (intptr_t) p->aclp; /* struct acl * */ *n_args = 3; break; } /* __acl_set_fd */ case 350: { struct __acl_set_fd_args *p = params; iarg[0] = p->filedes; /* int */ iarg[1] = p->type; /* acl_type_t */ uarg[2] = (intptr_t) p->aclp; /* struct acl * */ *n_args = 3; break; } /* __acl_delete_file */ case 351: { struct __acl_delete_file_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ iarg[1] = p->type; /* acl_type_t */ *n_args = 2; break; } /* __acl_delete_fd */ case 352: { struct __acl_delete_fd_args *p = params; iarg[0] = p->filedes; /* int */ iarg[1] = p->type; /* acl_type_t */ *n_args = 2; break; } /* __acl_aclcheck_file */ case 353: { struct __acl_aclcheck_file_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ iarg[1] = p->type; /* acl_type_t */ uarg[2] = (intptr_t) p->aclp; /* struct acl * */ *n_args = 3; break; } /* __acl_aclcheck_fd */ case 354: { struct __acl_aclcheck_fd_args *p = params; iarg[0] = p->filedes; /* int */ iarg[1] = p->type; /* acl_type_t */ uarg[2] = (intptr_t) p->aclp; /* struct acl * */ *n_args = 3; break; } /* extattrctl */ case 355: { struct extattrctl_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ iarg[1] = p->cmd; /* int */ uarg[2] = (intptr_t) p->filename; /* const char * */ iarg[3] = p->attrnamespace; /* int */ uarg[4] = (intptr_t) p->attrname; /* const char * */ *n_args = 5; break; } /* extattr_set_file */ case 356: { struct extattr_set_file_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ iarg[1] = p->attrnamespace; /* int */ uarg[2] = (intptr_t) p->attrname; /* const char * */ uarg[3] = (intptr_t) p->data; /* void * */ uarg[4] = p->nbytes; /* size_t */ *n_args = 5; break; } /* extattr_get_file */ case 357: { struct extattr_get_file_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ iarg[1] = p->attrnamespace; /* int */ uarg[2] = (intptr_t) p->attrname; /* const char * */ uarg[3] = (intptr_t) p->data; /* void * */ uarg[4] = p->nbytes; /* size_t */ *n_args = 5; break; } /* extattr_delete_file */ case 358: { struct extattr_delete_file_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ iarg[1] = p->attrnamespace; /* int */ uarg[2] = (intptr_t) p->attrname; /* const char * */ *n_args = 3; break; } /* freebsd32_aio_waitcomplete */ case 359: { struct freebsd32_aio_waitcomplete_args *p = params; uarg[0] = (intptr_t) p->aiocbp; /* struct aiocb32 ** */ uarg[1] = (intptr_t) p->timeout; /* struct timespec32 * */ *n_args = 2; break; } /* getresuid */ case 360: { struct getresuid_args *p = params; uarg[0] = (intptr_t) p->ruid; /* uid_t * */ uarg[1] = (intptr_t) p->euid; /* uid_t * */ uarg[2] = (intptr_t) p->suid; /* uid_t * */ *n_args = 3; break; } /* getresgid */ case 361: { struct getresgid_args *p = params; uarg[0] = (intptr_t) p->rgid; /* gid_t * */ uarg[1] = (intptr_t) p->egid; /* gid_t * */ uarg[2] = (intptr_t) p->sgid; /* gid_t * */ *n_args = 3; break; } /* kqueue */ case 362: { *n_args = 0; break; } /* extattr_set_fd */ case 371: { struct extattr_set_fd_args *p = params; iarg[0] = p->fd; /* int */ iarg[1] = p->attrnamespace; /* int */ uarg[2] = (intptr_t) p->attrname; /* const char * */ uarg[3] = (intptr_t) p->data; /* void * */ uarg[4] = p->nbytes; /* size_t */ *n_args = 5; break; } /* extattr_get_fd */ case 372: { struct extattr_get_fd_args *p = params; iarg[0] = p->fd; /* int */ iarg[1] = p->attrnamespace; /* int */ uarg[2] = (intptr_t) p->attrname; /* const char * */ uarg[3] = (intptr_t) p->data; /* void * */ uarg[4] = p->nbytes; /* size_t */ *n_args = 5; break; } /* extattr_delete_fd */ case 373: { struct extattr_delete_fd_args *p = params; iarg[0] = p->fd; /* int */ iarg[1] = p->attrnamespace; /* int */ uarg[2] = (intptr_t) p->attrname; /* const char * */ *n_args = 3; break; } /* __setugid */ case 374: { struct __setugid_args *p = params; iarg[0] = p->flag; /* int */ *n_args = 1; break; } /* eaccess */ case 376: { struct eaccess_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ iarg[1] = p->amode; /* int */ *n_args = 2; break; } /* freebsd32_nmount */ case 378: { struct freebsd32_nmount_args *p = params; uarg[0] = (intptr_t) p->iovp; /* struct iovec32 * */ uarg[1] = p->iovcnt; /* unsigned int */ iarg[2] = p->flags; /* int */ *n_args = 3; break; } /* kenv */ case 390: { struct kenv_args *p = params; iarg[0] = p->what; /* int */ uarg[1] = (intptr_t) p->name; /* const char * */ uarg[2] = (intptr_t) p->value; /* char * */ iarg[3] = p->len; /* int */ *n_args = 4; break; } /* lchflags */ case 391: { struct lchflags_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ uarg[1] = p->flags; /* u_long */ *n_args = 2; break; } /* uuidgen */ case 392: { struct uuidgen_args *p = params; uarg[0] = (intptr_t) p->store; /* struct uuid * */ iarg[1] = p->count; /* int */ *n_args = 2; break; } /* freebsd32_sendfile */ case 393: { struct freebsd32_sendfile_args *p = params; iarg[0] = p->fd; /* int */ iarg[1] = p->s; /* int */ uarg[2] = p->offset1; /* uint32_t */ uarg[3] = p->offset2; /* uint32_t */ uarg[4] = p->nbytes; /* size_t */ uarg[5] = (intptr_t) p->hdtr; /* struct sf_hdtr32 * */ uarg[6] = (intptr_t) p->sbytes; /* off_t * */ iarg[7] = p->flags; /* int */ *n_args = 8; break; } /* ksem_close */ case 400: { struct ksem_close_args *p = params; iarg[0] = p->id; /* semid_t */ *n_args = 1; break; } /* ksem_post */ case 401: { struct ksem_post_args *p = params; iarg[0] = p->id; /* semid_t */ *n_args = 1; break; } /* ksem_wait */ case 402: { struct ksem_wait_args *p = params; iarg[0] = p->id; /* semid_t */ *n_args = 1; break; } /* ksem_trywait */ case 403: { struct ksem_trywait_args *p = params; iarg[0] = p->id; /* semid_t */ *n_args = 1; break; } /* freebsd32_ksem_init */ case 404: { struct freebsd32_ksem_init_args *p = params; uarg[0] = (intptr_t) p->idp; /* semid_t * */ uarg[1] = p->value; /* unsigned int */ *n_args = 2; break; } /* freebsd32_ksem_open */ case 405: { struct freebsd32_ksem_open_args *p = params; uarg[0] = (intptr_t) p->idp; /* semid_t * */ uarg[1] = (intptr_t) p->name; /* const char * */ iarg[2] = p->oflag; /* int */ iarg[3] = p->mode; /* mode_t */ uarg[4] = p->value; /* unsigned int */ *n_args = 5; break; } /* ksem_unlink */ case 406: { struct ksem_unlink_args *p = params; uarg[0] = (intptr_t) p->name; /* const char * */ *n_args = 1; break; } /* ksem_getvalue */ case 407: { struct ksem_getvalue_args *p = params; iarg[0] = p->id; /* semid_t */ uarg[1] = (intptr_t) p->val; /* int * */ *n_args = 2; break; } /* ksem_destroy */ case 408: { struct ksem_destroy_args *p = params; iarg[0] = p->id; /* semid_t */ *n_args = 1; break; } /* extattr_set_link */ case 412: { struct extattr_set_link_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ iarg[1] = p->attrnamespace; /* int */ uarg[2] = (intptr_t) p->attrname; /* const char * */ uarg[3] = (intptr_t) p->data; /* void * */ uarg[4] = p->nbytes; /* size_t */ *n_args = 5; break; } /* extattr_get_link */ case 413: { struct extattr_get_link_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ iarg[1] = p->attrnamespace; /* int */ uarg[2] = (intptr_t) p->attrname; /* const char * */ uarg[3] = (intptr_t) p->data; /* void * */ uarg[4] = p->nbytes; /* size_t */ *n_args = 5; break; } /* extattr_delete_link */ case 414: { struct extattr_delete_link_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ iarg[1] = p->attrnamespace; /* int */ uarg[2] = (intptr_t) p->attrname; /* const char * */ *n_args = 3; break; } /* freebsd32_sigaction */ case 416: { struct freebsd32_sigaction_args *p = params; iarg[0] = p->sig; /* int */ uarg[1] = (intptr_t) p->act; /* struct sigaction32 * */ uarg[2] = (intptr_t) p->oact; /* struct sigaction32 * */ *n_args = 3; break; } /* freebsd32_sigreturn */ case 417: { struct freebsd32_sigreturn_args *p = params; uarg[0] = (intptr_t) p->sigcntxp; /* const struct freebsd32_ucontext * */ *n_args = 1; break; } /* freebsd32_getcontext */ case 421: { struct freebsd32_getcontext_args *p = params; uarg[0] = (intptr_t) p->ucp; /* struct freebsd32_ucontext * */ *n_args = 1; break; } /* freebsd32_setcontext */ case 422: { struct freebsd32_setcontext_args *p = params; uarg[0] = (intptr_t) p->ucp; /* const struct freebsd32_ucontext * */ *n_args = 1; break; } /* freebsd32_swapcontext */ case 423: { struct freebsd32_swapcontext_args *p = params; uarg[0] = (intptr_t) p->oucp; /* struct freebsd32_ucontext * */ uarg[1] = (intptr_t) p->ucp; /* const struct freebsd32_ucontext * */ *n_args = 2; break; } /* __acl_get_link */ case 425: { struct __acl_get_link_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ iarg[1] = p->type; /* acl_type_t */ uarg[2] = (intptr_t) p->aclp; /* struct acl * */ *n_args = 3; break; } /* __acl_set_link */ case 426: { struct __acl_set_link_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ iarg[1] = p->type; /* acl_type_t */ uarg[2] = (intptr_t) p->aclp; /* struct acl * */ *n_args = 3; break; } /* __acl_delete_link */ case 427: { struct __acl_delete_link_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ iarg[1] = p->type; /* acl_type_t */ *n_args = 2; break; } /* __acl_aclcheck_link */ case 428: { struct __acl_aclcheck_link_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ iarg[1] = p->type; /* acl_type_t */ uarg[2] = (intptr_t) p->aclp; /* struct acl * */ *n_args = 3; break; } /* sigwait */ case 429: { struct sigwait_args *p = params; uarg[0] = (intptr_t) p->set; /* const sigset_t * */ uarg[1] = (intptr_t) p->sig; /* int * */ *n_args = 2; break; } /* thr_exit */ case 431: { struct thr_exit_args *p = params; uarg[0] = (intptr_t) p->state; /* long * */ *n_args = 1; break; } /* thr_self */ case 432: { struct thr_self_args *p = params; uarg[0] = (intptr_t) p->id; /* long * */ *n_args = 1; break; } /* thr_kill */ case 433: { struct thr_kill_args *p = params; iarg[0] = p->id; /* long */ iarg[1] = p->sig; /* int */ *n_args = 2; break; } /* jail_attach */ case 436: { struct jail_attach_args *p = params; iarg[0] = p->jid; /* int */ *n_args = 1; break; } /* extattr_list_fd */ case 437: { struct extattr_list_fd_args *p = params; iarg[0] = p->fd; /* int */ iarg[1] = p->attrnamespace; /* int */ uarg[2] = (intptr_t) p->data; /* void * */ uarg[3] = p->nbytes; /* size_t */ *n_args = 4; break; } /* extattr_list_file */ case 438: { struct extattr_list_file_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ iarg[1] = p->attrnamespace; /* int */ uarg[2] = (intptr_t) p->data; /* void * */ uarg[3] = p->nbytes; /* size_t */ *n_args = 4; break; } /* extattr_list_link */ case 439: { struct extattr_list_link_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ iarg[1] = p->attrnamespace; /* int */ uarg[2] = (intptr_t) p->data; /* void * */ uarg[3] = p->nbytes; /* size_t */ *n_args = 4; break; } /* freebsd32_ksem_timedwait */ case 441: { struct freebsd32_ksem_timedwait_args *p = params; iarg[0] = p->id; /* semid_t */ uarg[1] = (intptr_t) p->abstime; /* const struct timespec32 * */ *n_args = 2; break; } /* freebsd32_thr_suspend */ case 442: { struct freebsd32_thr_suspend_args *p = params; uarg[0] = (intptr_t) p->timeout; /* const struct timespec32 * */ *n_args = 1; break; } /* thr_wake */ case 443: { struct thr_wake_args *p = params; iarg[0] = p->id; /* long */ *n_args = 1; break; } /* kldunloadf */ case 444: { struct kldunloadf_args *p = params; iarg[0] = p->fileid; /* int */ iarg[1] = p->flags; /* int */ *n_args = 2; break; } /* audit */ case 445: { struct audit_args *p = params; uarg[0] = (intptr_t) p->record; /* const void * */ uarg[1] = p->length; /* u_int */ *n_args = 2; break; } /* auditon */ case 446: { struct auditon_args *p = params; iarg[0] = p->cmd; /* int */ uarg[1] = (intptr_t) p->data; /* void * */ uarg[2] = p->length; /* u_int */ *n_args = 3; break; } /* getauid */ case 447: { struct getauid_args *p = params; uarg[0] = (intptr_t) p->auid; /* uid_t * */ *n_args = 1; break; } /* setauid */ case 448: { struct setauid_args *p = params; uarg[0] = (intptr_t) p->auid; /* uid_t * */ *n_args = 1; break; } /* getaudit */ case 449: { struct getaudit_args *p = params; uarg[0] = (intptr_t) p->auditinfo; /* struct auditinfo * */ *n_args = 1; break; } /* setaudit */ case 450: { struct setaudit_args *p = params; uarg[0] = (intptr_t) p->auditinfo; /* struct auditinfo * */ *n_args = 1; break; } /* getaudit_addr */ case 451: { struct getaudit_addr_args *p = params; uarg[0] = (intptr_t) p->auditinfo_addr; /* struct auditinfo_addr * */ uarg[1] = p->length; /* u_int */ *n_args = 2; break; } /* setaudit_addr */ case 452: { struct setaudit_addr_args *p = params; uarg[0] = (intptr_t) p->auditinfo_addr; /* struct auditinfo_addr * */ uarg[1] = p->length; /* u_int */ *n_args = 2; break; } /* auditctl */ case 453: { struct auditctl_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ *n_args = 1; break; } /* freebsd32__umtx_op */ case 454: { struct freebsd32__umtx_op_args *p = params; uarg[0] = (intptr_t) p->obj; /* void * */ iarg[1] = p->op; /* int */ uarg[2] = p->val; /* u_long */ uarg[3] = (intptr_t) p->uaddr; /* void * */ uarg[4] = (intptr_t) p->uaddr2; /* void * */ *n_args = 5; break; } /* freebsd32_thr_new */ case 455: { struct freebsd32_thr_new_args *p = params; uarg[0] = (intptr_t) p->param; /* struct thr_param32 * */ iarg[1] = p->param_size; /* int */ *n_args = 2; break; } /* freebsd32_sigqueue */ case 456: { struct freebsd32_sigqueue_args *p = params; iarg[0] = p->pid; /* pid_t */ iarg[1] = p->signum; /* int */ iarg[2] = p->value; /* int */ *n_args = 3; break; } /* freebsd32_kmq_open */ case 457: { struct freebsd32_kmq_open_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ iarg[1] = p->flags; /* int */ iarg[2] = p->mode; /* mode_t */ uarg[3] = (intptr_t) p->attr; /* const struct mq_attr32 * */ *n_args = 4; break; } /* freebsd32_kmq_setattr */ case 458: { struct freebsd32_kmq_setattr_args *p = params; iarg[0] = p->mqd; /* int */ uarg[1] = (intptr_t) p->attr; /* const struct mq_attr32 * */ uarg[2] = (intptr_t) p->oattr; /* struct mq_attr32 * */ *n_args = 3; break; } /* freebsd32_kmq_timedreceive */ case 459: { struct freebsd32_kmq_timedreceive_args *p = params; iarg[0] = p->mqd; /* int */ uarg[1] = (intptr_t) p->msg_ptr; /* char * */ uarg[2] = p->msg_len; /* size_t */ uarg[3] = (intptr_t) p->msg_prio; /* unsigned * */ uarg[4] = (intptr_t) p->abs_timeout; /* const struct timespec32 * */ *n_args = 5; break; } /* freebsd32_kmq_timedsend */ case 460: { struct freebsd32_kmq_timedsend_args *p = params; iarg[0] = p->mqd; /* int */ uarg[1] = (intptr_t) p->msg_ptr; /* const char * */ uarg[2] = p->msg_len; /* size_t */ uarg[3] = p->msg_prio; /* unsigned */ uarg[4] = (intptr_t) p->abs_timeout; /* const struct timespec32 * */ *n_args = 5; break; } /* freebsd32_kmq_notify */ case 461: { struct freebsd32_kmq_notify_args *p = params; iarg[0] = p->mqd; /* int */ uarg[1] = (intptr_t) p->sigev; /* const struct sigevent32 * */ *n_args = 2; break; } /* kmq_unlink */ case 462: { struct kmq_unlink_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ *n_args = 1; break; } /* abort2 */ case 463: { struct abort2_args *p = params; uarg[0] = (intptr_t) p->why; /* const char * */ iarg[1] = p->nargs; /* int */ uarg[2] = (intptr_t) p->args; /* void ** */ *n_args = 3; break; } /* thr_set_name */ case 464: { struct thr_set_name_args *p = params; iarg[0] = p->id; /* long */ uarg[1] = (intptr_t) p->name; /* const char * */ *n_args = 2; break; } /* freebsd32_aio_fsync */ case 465: { struct freebsd32_aio_fsync_args *p = params; iarg[0] = p->op; /* int */ uarg[1] = (intptr_t) p->aiocbp; /* struct aiocb32 * */ *n_args = 2; break; } /* rtprio_thread */ case 466: { struct rtprio_thread_args *p = params; iarg[0] = p->function; /* int */ iarg[1] = p->lwpid; /* lwpid_t */ uarg[2] = (intptr_t) p->rtp; /* struct rtprio * */ *n_args = 3; break; } /* sctp_peeloff */ case 471: { struct sctp_peeloff_args *p = params; iarg[0] = p->sd; /* int */ uarg[1] = p->name; /* uint32_t */ *n_args = 2; break; } /* sctp_generic_sendmsg */ case 472: { struct sctp_generic_sendmsg_args *p = params; iarg[0] = p->sd; /* int */ uarg[1] = (intptr_t) p->msg; /* void * */ iarg[2] = p->mlen; /* int */ uarg[3] = (intptr_t) p->to; /* struct sockaddr * */ iarg[4] = p->tolen; /* __socklen_t */ uarg[5] = (intptr_t) p->sinfo; /* struct sctp_sndrcvinfo * */ iarg[6] = p->flags; /* int */ *n_args = 7; break; } /* sctp_generic_sendmsg_iov */ case 473: { struct sctp_generic_sendmsg_iov_args *p = params; iarg[0] = p->sd; /* int */ uarg[1] = (intptr_t) p->iov; /* struct iovec * */ iarg[2] = p->iovlen; /* int */ uarg[3] = (intptr_t) p->to; /* struct sockaddr * */ iarg[4] = p->tolen; /* __socklen_t */ uarg[5] = (intptr_t) p->sinfo; /* struct sctp_sndrcvinfo * */ iarg[6] = p->flags; /* int */ *n_args = 7; break; } /* sctp_generic_recvmsg */ case 474: { struct sctp_generic_recvmsg_args *p = params; iarg[0] = p->sd; /* int */ uarg[1] = (intptr_t) p->iov; /* struct iovec * */ iarg[2] = p->iovlen; /* int */ uarg[3] = (intptr_t) p->from; /* struct sockaddr * */ uarg[4] = (intptr_t) p->fromlenaddr; /* __socklen_t * */ uarg[5] = (intptr_t) p->sinfo; /* struct sctp_sndrcvinfo * */ uarg[6] = (intptr_t) p->msg_flags; /* int * */ *n_args = 7; break; } #ifdef PAD64_REQUIRED /* freebsd32_pread */ case 475: { struct freebsd32_pread_args *p = params; iarg[0] = p->fd; /* int */ uarg[1] = (intptr_t) p->buf; /* void * */ uarg[2] = p->nbyte; /* size_t */ iarg[3] = p->pad; /* int */ uarg[4] = p->offset1; /* uint32_t */ uarg[5] = p->offset2; /* uint32_t */ *n_args = 6; break; } /* freebsd32_pwrite */ case 476: { struct freebsd32_pwrite_args *p = params; iarg[0] = p->fd; /* int */ uarg[1] = (intptr_t) p->buf; /* const void * */ uarg[2] = p->nbyte; /* size_t */ iarg[3] = p->pad; /* int */ uarg[4] = p->offset1; /* uint32_t */ uarg[5] = p->offset2; /* uint32_t */ *n_args = 6; break; } /* freebsd32_mmap */ case 477: { struct freebsd32_mmap_args *p = params; uarg[0] = (intptr_t) p->addr; /* void * */ uarg[1] = p->len; /* size_t */ iarg[2] = p->prot; /* int */ iarg[3] = p->flags; /* int */ iarg[4] = p->fd; /* int */ iarg[5] = p->pad; /* int */ uarg[6] = p->pos1; /* uint32_t */ uarg[7] = p->pos2; /* uint32_t */ *n_args = 8; break; } /* freebsd32_lseek */ case 478: { struct freebsd32_lseek_args *p = params; iarg[0] = p->fd; /* int */ iarg[1] = p->pad; /* int */ uarg[2] = p->offset1; /* uint32_t */ uarg[3] = p->offset2; /* uint32_t */ iarg[4] = p->whence; /* int */ *n_args = 5; break; } /* freebsd32_truncate */ case 479: { struct freebsd32_truncate_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ iarg[1] = p->pad; /* int */ uarg[2] = p->length1; /* uint32_t */ uarg[3] = p->length2; /* uint32_t */ *n_args = 4; break; } /* freebsd32_ftruncate */ case 480: { struct freebsd32_ftruncate_args *p = params; iarg[0] = p->fd; /* int */ iarg[1] = p->pad; /* int */ uarg[2] = p->length1; /* uint32_t */ uarg[3] = p->length2; /* uint32_t */ *n_args = 4; break; } #else /* freebsd32_pread */ case 475: { struct freebsd32_pread_args *p = params; iarg[0] = p->fd; /* int */ uarg[1] = (intptr_t) p->buf; /* void * */ uarg[2] = p->nbyte; /* size_t */ uarg[3] = p->offset1; /* uint32_t */ uarg[4] = p->offset2; /* uint32_t */ *n_args = 5; break; } /* freebsd32_pwrite */ case 476: { struct freebsd32_pwrite_args *p = params; iarg[0] = p->fd; /* int */ uarg[1] = (intptr_t) p->buf; /* const void * */ uarg[2] = p->nbyte; /* size_t */ uarg[3] = p->offset1; /* uint32_t */ uarg[4] = p->offset2; /* uint32_t */ *n_args = 5; break; } /* freebsd32_mmap */ case 477: { struct freebsd32_mmap_args *p = params; uarg[0] = (intptr_t) p->addr; /* void * */ uarg[1] = p->len; /* size_t */ iarg[2] = p->prot; /* int */ iarg[3] = p->flags; /* int */ iarg[4] = p->fd; /* int */ uarg[5] = p->pos1; /* uint32_t */ uarg[6] = p->pos2; /* uint32_t */ *n_args = 7; break; } /* freebsd32_lseek */ case 478: { struct freebsd32_lseek_args *p = params; iarg[0] = p->fd; /* int */ uarg[1] = p->offset1; /* uint32_t */ uarg[2] = p->offset2; /* uint32_t */ iarg[3] = p->whence; /* int */ *n_args = 4; break; } /* freebsd32_truncate */ case 479: { struct freebsd32_truncate_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ uarg[1] = p->length1; /* uint32_t */ uarg[2] = p->length2; /* uint32_t */ *n_args = 3; break; } /* freebsd32_ftruncate */ case 480: { struct freebsd32_ftruncate_args *p = params; iarg[0] = p->fd; /* int */ uarg[1] = p->length1; /* uint32_t */ uarg[2] = p->length2; /* uint32_t */ *n_args = 3; break; } #endif /* thr_kill2 */ case 481: { struct thr_kill2_args *p = params; iarg[0] = p->pid; /* pid_t */ iarg[1] = p->id; /* long */ iarg[2] = p->sig; /* int */ *n_args = 3; break; } /* shm_open */ case 482: { struct shm_open_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ iarg[1] = p->flags; /* int */ iarg[2] = p->mode; /* mode_t */ *n_args = 3; break; } /* shm_unlink */ case 483: { struct shm_unlink_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ *n_args = 1; break; } /* cpuset */ case 484: { struct cpuset_args *p = params; uarg[0] = (intptr_t) p->setid; /* cpusetid_t * */ *n_args = 1; break; } #ifdef PAD64_REQUIRED /* freebsd32_cpuset_setid */ case 485: { struct freebsd32_cpuset_setid_args *p = params; iarg[0] = p->which; /* cpuwhich_t */ iarg[1] = p->pad; /* int */ uarg[2] = p->id1; /* uint32_t */ uarg[3] = p->id2; /* uint32_t */ iarg[4] = p->setid; /* cpusetid_t */ *n_args = 5; break; } #else /* freebsd32_cpuset_setid */ case 485: { struct freebsd32_cpuset_setid_args *p = params; iarg[0] = p->which; /* cpuwhich_t */ uarg[1] = p->id1; /* uint32_t */ uarg[2] = p->id2; /* uint32_t */ iarg[3] = p->setid; /* cpusetid_t */ *n_args = 4; break; } #endif /* freebsd32_cpuset_getid */ case 486: { struct freebsd32_cpuset_getid_args *p = params; iarg[0] = p->level; /* cpulevel_t */ iarg[1] = p->which; /* cpuwhich_t */ uarg[2] = p->id1; /* uint32_t */ uarg[3] = p->id2; /* uint32_t */ uarg[4] = (intptr_t) p->setid; /* cpusetid_t * */ *n_args = 5; break; } /* freebsd32_cpuset_getaffinity */ case 487: { struct freebsd32_cpuset_getaffinity_args *p = params; iarg[0] = p->level; /* cpulevel_t */ iarg[1] = p->which; /* cpuwhich_t */ uarg[2] = p->id1; /* uint32_t */ uarg[3] = p->id2; /* uint32_t */ uarg[4] = p->cpusetsize; /* size_t */ uarg[5] = (intptr_t) p->mask; /* cpuset_t * */ *n_args = 6; break; } /* freebsd32_cpuset_setaffinity */ case 488: { struct freebsd32_cpuset_setaffinity_args *p = params; iarg[0] = p->level; /* cpulevel_t */ iarg[1] = p->which; /* cpuwhich_t */ uarg[2] = p->id1; /* uint32_t */ uarg[3] = p->id2; /* uint32_t */ uarg[4] = p->cpusetsize; /* size_t */ uarg[5] = (intptr_t) p->mask; /* const cpuset_t * */ *n_args = 6; break; } /* faccessat */ case 489: { struct faccessat_args *p = params; iarg[0] = p->fd; /* int */ uarg[1] = (intptr_t) p->path; /* const char * */ iarg[2] = p->amode; /* int */ iarg[3] = p->flag; /* int */ *n_args = 4; break; } /* fchmodat */ case 490: { struct fchmodat_args *p = params; iarg[0] = p->fd; /* int */ uarg[1] = (intptr_t) p->path; /* const char * */ iarg[2] = p->mode; /* mode_t */ iarg[3] = p->flag; /* int */ *n_args = 4; break; } /* fchownat */ case 491: { struct fchownat_args *p = params; iarg[0] = p->fd; /* int */ uarg[1] = (intptr_t) p->path; /* const char * */ uarg[2] = p->uid; /* uid_t */ iarg[3] = p->gid; /* gid_t */ iarg[4] = p->flag; /* int */ *n_args = 5; break; } /* freebsd32_fexecve */ case 492: { struct freebsd32_fexecve_args *p = params; iarg[0] = p->fd; /* int */ uarg[1] = (intptr_t) p->argv; /* uint32_t * */ uarg[2] = (intptr_t) p->envv; /* uint32_t * */ *n_args = 3; break; } /* freebsd32_futimesat */ case 494: { struct freebsd32_futimesat_args *p = params; iarg[0] = p->fd; /* int */ uarg[1] = (intptr_t) p->path; /* const char * */ uarg[2] = (intptr_t) p->times; /* struct timeval * */ *n_args = 3; break; } /* linkat */ case 495: { struct linkat_args *p = params; iarg[0] = p->fd1; /* int */ uarg[1] = (intptr_t) p->path1; /* const char * */ iarg[2] = p->fd2; /* int */ uarg[3] = (intptr_t) p->path2; /* const char * */ iarg[4] = p->flag; /* int */ *n_args = 5; break; } /* mkdirat */ case 496: { struct mkdirat_args *p = params; iarg[0] = p->fd; /* int */ uarg[1] = (intptr_t) p->path; /* const char * */ iarg[2] = p->mode; /* mode_t */ *n_args = 3; break; } /* mkfifoat */ case 497: { struct mkfifoat_args *p = params; iarg[0] = p->fd; /* int */ uarg[1] = (intptr_t) p->path; /* const char * */ iarg[2] = p->mode; /* mode_t */ *n_args = 3; break; } /* openat */ case 499: { struct openat_args *p = params; iarg[0] = p->fd; /* int */ uarg[1] = (intptr_t) p->path; /* const char * */ iarg[2] = p->flag; /* int */ iarg[3] = p->mode; /* mode_t */ *n_args = 4; break; } /* readlinkat */ case 500: { struct readlinkat_args *p = params; iarg[0] = p->fd; /* int */ uarg[1] = (intptr_t) p->path; /* const char * */ uarg[2] = (intptr_t) p->buf; /* char * */ uarg[3] = p->bufsize; /* size_t */ *n_args = 4; break; } /* renameat */ case 501: { struct renameat_args *p = params; iarg[0] = p->oldfd; /* int */ uarg[1] = (intptr_t) p->old; /* const char * */ iarg[2] = p->newfd; /* int */ uarg[3] = (intptr_t) p->new; /* const char * */ *n_args = 4; break; } /* symlinkat */ case 502: { struct symlinkat_args *p = params; uarg[0] = (intptr_t) p->path1; /* const char * */ iarg[1] = p->fd; /* int */ uarg[2] = (intptr_t) p->path2; /* const char * */ *n_args = 3; break; } /* unlinkat */ case 503: { struct unlinkat_args *p = params; iarg[0] = p->fd; /* int */ uarg[1] = (intptr_t) p->path; /* const char * */ iarg[2] = p->flag; /* int */ *n_args = 3; break; } /* posix_openpt */ case 504: { struct posix_openpt_args *p = params; iarg[0] = p->flags; /* int */ *n_args = 1; break; } /* freebsd32_jail_get */ case 506: { struct freebsd32_jail_get_args *p = params; uarg[0] = (intptr_t) p->iovp; /* struct iovec32 * */ uarg[1] = p->iovcnt; /* unsigned int */ iarg[2] = p->flags; /* int */ *n_args = 3; break; } /* freebsd32_jail_set */ case 507: { struct freebsd32_jail_set_args *p = params; uarg[0] = (intptr_t) p->iovp; /* struct iovec32 * */ uarg[1] = p->iovcnt; /* unsigned int */ iarg[2] = p->flags; /* int */ *n_args = 3; break; } /* jail_remove */ case 508: { struct jail_remove_args *p = params; iarg[0] = p->jid; /* int */ *n_args = 1; break; } /* closefrom */ case 509: { struct closefrom_args *p = params; iarg[0] = p->lowfd; /* int */ *n_args = 1; break; } /* freebsd32_semctl */ case 510: { struct freebsd32_semctl_args *p = params; iarg[0] = p->semid; /* int */ iarg[1] = p->semnum; /* int */ iarg[2] = p->cmd; /* int */ uarg[3] = (intptr_t) p->arg; /* union semun32 * */ *n_args = 4; break; } /* freebsd32_msgctl */ case 511: { struct freebsd32_msgctl_args *p = params; iarg[0] = p->msqid; /* int */ iarg[1] = p->cmd; /* int */ uarg[2] = (intptr_t) p->buf; /* struct msqid_ds32 * */ *n_args = 3; break; } /* freebsd32_shmctl */ case 512: { struct freebsd32_shmctl_args *p = params; iarg[0] = p->shmid; /* int */ iarg[1] = p->cmd; /* int */ uarg[2] = (intptr_t) p->buf; /* struct shmid_ds32 * */ *n_args = 3; break; } /* lpathconf */ case 513: { struct lpathconf_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ iarg[1] = p->name; /* int */ *n_args = 2; break; } /* __cap_rights_get */ case 515: { struct __cap_rights_get_args *p = params; iarg[0] = p->version; /* int */ iarg[1] = p->fd; /* int */ uarg[2] = (intptr_t) p->rightsp; /* cap_rights_t * */ *n_args = 3; break; } /* cap_enter */ case 516: { *n_args = 0; break; } /* cap_getmode */ case 517: { struct cap_getmode_args *p = params; uarg[0] = (intptr_t) p->modep; /* u_int * */ *n_args = 1; break; } /* pdfork */ case 518: { struct pdfork_args *p = params; uarg[0] = (intptr_t) p->fdp; /* int * */ iarg[1] = p->flags; /* int */ *n_args = 2; break; } /* pdkill */ case 519: { struct pdkill_args *p = params; iarg[0] = p->fd; /* int */ iarg[1] = p->signum; /* int */ *n_args = 2; break; } /* pdgetpid */ case 520: { struct pdgetpid_args *p = params; iarg[0] = p->fd; /* int */ uarg[1] = (intptr_t) p->pidp; /* pid_t * */ *n_args = 2; break; } /* freebsd32_pselect */ case 522: { struct freebsd32_pselect_args *p = params; iarg[0] = p->nd; /* int */ uarg[1] = (intptr_t) p->in; /* fd_set * */ uarg[2] = (intptr_t) p->ou; /* fd_set * */ uarg[3] = (intptr_t) p->ex; /* fd_set * */ uarg[4] = (intptr_t) p->ts; /* const struct timespec32 * */ uarg[5] = (intptr_t) p->sm; /* const sigset_t * */ *n_args = 6; break; } /* getloginclass */ case 523: { struct getloginclass_args *p = params; uarg[0] = (intptr_t) p->namebuf; /* char * */ uarg[1] = p->namelen; /* size_t */ *n_args = 2; break; } /* setloginclass */ case 524: { struct setloginclass_args *p = params; uarg[0] = (intptr_t) p->namebuf; /* const char * */ *n_args = 1; break; } /* rctl_get_racct */ case 525: { struct rctl_get_racct_args *p = params; uarg[0] = (intptr_t) p->inbufp; /* const void * */ uarg[1] = p->inbuflen; /* size_t */ uarg[2] = (intptr_t) p->outbufp; /* void * */ uarg[3] = p->outbuflen; /* size_t */ *n_args = 4; break; } /* rctl_get_rules */ case 526: { struct rctl_get_rules_args *p = params; uarg[0] = (intptr_t) p->inbufp; /* const void * */ uarg[1] = p->inbuflen; /* size_t */ uarg[2] = (intptr_t) p->outbufp; /* void * */ uarg[3] = p->outbuflen; /* size_t */ *n_args = 4; break; } /* rctl_get_limits */ case 527: { struct rctl_get_limits_args *p = params; uarg[0] = (intptr_t) p->inbufp; /* const void * */ uarg[1] = p->inbuflen; /* size_t */ uarg[2] = (intptr_t) p->outbufp; /* void * */ uarg[3] = p->outbuflen; /* size_t */ *n_args = 4; break; } /* rctl_add_rule */ case 528: { struct rctl_add_rule_args *p = params; uarg[0] = (intptr_t) p->inbufp; /* const void * */ uarg[1] = p->inbuflen; /* size_t */ uarg[2] = (intptr_t) p->outbufp; /* void * */ uarg[3] = p->outbuflen; /* size_t */ *n_args = 4; break; } /* rctl_remove_rule */ case 529: { struct rctl_remove_rule_args *p = params; uarg[0] = (intptr_t) p->inbufp; /* const void * */ uarg[1] = p->inbuflen; /* size_t */ uarg[2] = (intptr_t) p->outbufp; /* void * */ uarg[3] = p->outbuflen; /* size_t */ *n_args = 4; break; } #ifdef PAD64_REQUIRED /* freebsd32_posix_fallocate */ case 530: { struct freebsd32_posix_fallocate_args *p = params; iarg[0] = p->fd; /* int */ iarg[1] = p->pad; /* int */ uarg[2] = p->offset1; /* uint32_t */ uarg[3] = p->offset2; /* uint32_t */ uarg[4] = p->len1; /* uint32_t */ uarg[5] = p->len2; /* uint32_t */ *n_args = 6; break; } /* freebsd32_posix_fadvise */ case 531: { struct freebsd32_posix_fadvise_args *p = params; iarg[0] = p->fd; /* int */ iarg[1] = p->pad; /* int */ uarg[2] = p->offset1; /* uint32_t */ uarg[3] = p->offset2; /* uint32_t */ uarg[4] = p->len1; /* uint32_t */ uarg[5] = p->len2; /* uint32_t */ iarg[6] = p->advice; /* int */ *n_args = 7; break; } /* freebsd32_wait6 */ case 532: { struct freebsd32_wait6_args *p = params; iarg[0] = p->idtype; /* int */ iarg[1] = p->pad; /* int */ uarg[2] = p->id1; /* uint32_t */ uarg[3] = p->id2; /* uint32_t */ uarg[4] = (intptr_t) p->status; /* int * */ iarg[5] = p->options; /* int */ uarg[6] = (intptr_t) p->wrusage; /* struct wrusage32 * */ uarg[7] = (intptr_t) p->info; /* siginfo_t * */ *n_args = 8; break; } #else /* freebsd32_posix_fallocate */ case 530: { struct freebsd32_posix_fallocate_args *p = params; iarg[0] = p->fd; /* int */ uarg[1] = p->offset1; /* uint32_t */ uarg[2] = p->offset2; /* uint32_t */ uarg[3] = p->len1; /* uint32_t */ uarg[4] = p->len2; /* uint32_t */ *n_args = 5; break; } /* freebsd32_posix_fadvise */ case 531: { struct freebsd32_posix_fadvise_args *p = params; iarg[0] = p->fd; /* int */ uarg[1] = p->offset1; /* uint32_t */ uarg[2] = p->offset2; /* uint32_t */ uarg[3] = p->len1; /* uint32_t */ uarg[4] = p->len2; /* uint32_t */ iarg[5] = p->advice; /* int */ *n_args = 6; break; } /* freebsd32_wait6 */ case 532: { struct freebsd32_wait6_args *p = params; iarg[0] = p->idtype; /* int */ uarg[1] = p->id1; /* uint32_t */ uarg[2] = p->id2; /* uint32_t */ uarg[3] = (intptr_t) p->status; /* int * */ iarg[4] = p->options; /* int */ uarg[5] = (intptr_t) p->wrusage; /* struct wrusage32 * */ uarg[6] = (intptr_t) p->info; /* siginfo_t * */ *n_args = 7; break; } #endif /* cap_rights_limit */ case 533: { struct cap_rights_limit_args *p = params; iarg[0] = p->fd; /* int */ uarg[1] = (intptr_t) p->rightsp; /* cap_rights_t * */ *n_args = 2; break; } /* freebsd32_cap_ioctls_limit */ case 534: { struct freebsd32_cap_ioctls_limit_args *p = params; iarg[0] = p->fd; /* int */ uarg[1] = (intptr_t) p->cmds; /* const uint32_t * */ uarg[2] = p->ncmds; /* size_t */ *n_args = 3; break; } /* freebsd32_cap_ioctls_get */ case 535: { struct freebsd32_cap_ioctls_get_args *p = params; iarg[0] = p->fd; /* int */ uarg[1] = (intptr_t) p->cmds; /* uint32_t * */ uarg[2] = p->maxcmds; /* size_t */ *n_args = 3; break; } /* cap_fcntls_limit */ case 536: { struct cap_fcntls_limit_args *p = params; iarg[0] = p->fd; /* int */ uarg[1] = p->fcntlrights; /* uint32_t */ *n_args = 2; break; } /* cap_fcntls_get */ case 537: { struct cap_fcntls_get_args *p = params; iarg[0] = p->fd; /* int */ uarg[1] = (intptr_t) p->fcntlrightsp; /* uint32_t * */ *n_args = 2; break; } /* bindat */ case 538: { struct bindat_args *p = params; iarg[0] = p->fd; /* int */ iarg[1] = p->s; /* int */ uarg[2] = (intptr_t) p->name; /* const struct sockaddr * */ iarg[3] = p->namelen; /* int */ *n_args = 4; break; } /* connectat */ case 539: { struct connectat_args *p = params; iarg[0] = p->fd; /* int */ iarg[1] = p->s; /* int */ uarg[2] = (intptr_t) p->name; /* const struct sockaddr * */ iarg[3] = p->namelen; /* int */ *n_args = 4; break; } /* chflagsat */ case 540: { struct chflagsat_args *p = params; iarg[0] = p->fd; /* int */ uarg[1] = (intptr_t) p->path; /* const char * */ uarg[2] = p->flags; /* u_long */ iarg[3] = p->atflag; /* int */ *n_args = 4; break; } /* accept4 */ case 541: { struct accept4_args *p = params; iarg[0] = p->s; /* int */ uarg[1] = (intptr_t) p->name; /* struct sockaddr * */ uarg[2] = (intptr_t) p->anamelen; /* __socklen_t * */ iarg[3] = p->flags; /* int */ *n_args = 4; break; } /* pipe2 */ case 542: { struct pipe2_args *p = params; uarg[0] = (intptr_t) p->fildes; /* int * */ iarg[1] = p->flags; /* int */ *n_args = 2; break; } /* freebsd32_aio_mlock */ case 543: { struct freebsd32_aio_mlock_args *p = params; uarg[0] = (intptr_t) p->aiocbp; /* struct aiocb32 * */ *n_args = 1; break; } #ifdef PAD64_REQUIRED /* freebsd32_procctl */ case 544: { struct freebsd32_procctl_args *p = params; iarg[0] = p->idtype; /* int */ iarg[1] = p->pad; /* int */ uarg[2] = p->id1; /* uint32_t */ uarg[3] = p->id2; /* uint32_t */ iarg[4] = p->com; /* int */ uarg[5] = (intptr_t) p->data; /* void * */ *n_args = 6; break; } #else /* freebsd32_procctl */ case 544: { struct freebsd32_procctl_args *p = params; iarg[0] = p->idtype; /* int */ uarg[1] = p->id1; /* uint32_t */ uarg[2] = p->id2; /* uint32_t */ iarg[3] = p->com; /* int */ uarg[4] = (intptr_t) p->data; /* void * */ *n_args = 5; break; } #endif /* freebsd32_ppoll */ case 545: { struct freebsd32_ppoll_args *p = params; uarg[0] = (intptr_t) p->fds; /* struct pollfd * */ uarg[1] = p->nfds; /* u_int */ uarg[2] = (intptr_t) p->ts; /* const struct timespec32 * */ uarg[3] = (intptr_t) p->set; /* const sigset_t * */ *n_args = 4; break; } /* freebsd32_futimens */ case 546: { struct freebsd32_futimens_args *p = params; iarg[0] = p->fd; /* int */ uarg[1] = (intptr_t) p->times; /* struct timespec * */ *n_args = 2; break; } /* freebsd32_utimensat */ case 547: { struct freebsd32_utimensat_args *p = params; iarg[0] = p->fd; /* int */ uarg[1] = (intptr_t) p->path; /* const char * */ uarg[2] = (intptr_t) p->times; /* struct timespec * */ iarg[3] = p->flag; /* int */ *n_args = 4; break; } /* fdatasync */ case 550: { struct fdatasync_args *p = params; iarg[0] = p->fd; /* int */ *n_args = 1; break; } /* freebsd32_fstat */ case 551: { struct freebsd32_fstat_args *p = params; iarg[0] = p->fd; /* int */ uarg[1] = (intptr_t) p->ub; /* struct stat32 * */ *n_args = 2; break; } /* freebsd32_fstatat */ case 552: { struct freebsd32_fstatat_args *p = params; iarg[0] = p->fd; /* int */ uarg[1] = (intptr_t) p->path; /* const char * */ uarg[2] = (intptr_t) p->buf; /* struct stat32 * */ iarg[3] = p->flag; /* int */ *n_args = 4; break; } /* freebsd32_fhstat */ case 553: { struct freebsd32_fhstat_args *p = params; uarg[0] = (intptr_t) p->u_fhp; /* const struct fhandle * */ uarg[1] = (intptr_t) p->sb; /* struct stat32 * */ *n_args = 2; break; } /* getdirentries */ case 554: { struct getdirentries_args *p = params; iarg[0] = p->fd; /* int */ uarg[1] = (intptr_t) p->buf; /* char * */ uarg[2] = p->count; /* size_t */ uarg[3] = (intptr_t) p->basep; /* off_t * */ *n_args = 4; break; } /* statfs */ case 555: { struct statfs_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ uarg[1] = (intptr_t) p->buf; /* struct statfs32 * */ *n_args = 2; break; } /* fstatfs */ case 556: { struct fstatfs_args *p = params; iarg[0] = p->fd; /* int */ uarg[1] = (intptr_t) p->buf; /* struct statfs32 * */ *n_args = 2; break; } /* getfsstat */ case 557: { struct getfsstat_args *p = params; uarg[0] = (intptr_t) p->buf; /* struct statfs32 * */ iarg[1] = p->bufsize; /* long */ iarg[2] = p->mode; /* int */ *n_args = 3; break; } /* fhstatfs */ case 558: { struct fhstatfs_args *p = params; uarg[0] = (intptr_t) p->u_fhp; /* const struct fhandle * */ uarg[1] = (intptr_t) p->buf; /* struct statfs32 * */ *n_args = 2; break; } #ifdef PAD64_REQUIRED /* freebsd32_mknodat */ case 559: { struct freebsd32_mknodat_args *p = params; iarg[0] = p->fd; /* int */ uarg[1] = (intptr_t) p->path; /* const char * */ iarg[2] = p->mode; /* mode_t */ iarg[3] = p->pad; /* int */ uarg[4] = p->dev1; /* uint32_t */ uarg[5] = p->dev2; /* uint32_t */ *n_args = 6; break; } #else /* freebsd32_mknodat */ case 559: { struct freebsd32_mknodat_args *p = params; iarg[0] = p->fd; /* int */ uarg[1] = (intptr_t) p->path; /* const char * */ iarg[2] = p->mode; /* mode_t */ uarg[3] = p->dev1; /* uint32_t */ uarg[4] = p->dev2; /* uint32_t */ *n_args = 5; break; } #endif /* freebsd32_kevent */ case 560: { struct freebsd32_kevent_args *p = params; iarg[0] = p->fd; /* int */ uarg[1] = (intptr_t) p->changelist; /* const struct kevent32 * */ iarg[2] = p->nchanges; /* int */ uarg[3] = (intptr_t) p->eventlist; /* struct kevent32 * */ iarg[4] = p->nevents; /* int */ uarg[5] = (intptr_t) p->timeout; /* const struct timespec32 * */ *n_args = 6; break; } /* freebsd32_cpuset_getdomain */ case 561: { struct freebsd32_cpuset_getdomain_args *p = params; iarg[0] = p->level; /* cpulevel_t */ iarg[1] = p->which; /* cpuwhich_t */ uarg[2] = p->id1; /* uint32_t */ uarg[3] = p->id2; /* uint32_t */ uarg[4] = p->domainsetsize; /* size_t */ uarg[5] = (intptr_t) p->mask; /* domainset_t * */ uarg[6] = (intptr_t) p->policy; /* int * */ *n_args = 7; break; } /* freebsd32_cpuset_setdomain */ case 562: { struct freebsd32_cpuset_setdomain_args *p = params; iarg[0] = p->level; /* cpulevel_t */ iarg[1] = p->which; /* cpuwhich_t */ uarg[2] = p->id1; /* uint32_t */ uarg[3] = p->id2; /* uint32_t */ uarg[4] = p->domainsetsize; /* size_t */ uarg[5] = (intptr_t) p->mask; /* domainset_t * */ iarg[6] = p->policy; /* int */ *n_args = 7; break; } /* getrandom */ case 563: { struct getrandom_args *p = params; uarg[0] = (intptr_t) p->buf; /* void * */ uarg[1] = p->buflen; /* size_t */ uarg[2] = p->flags; /* unsigned int */ *n_args = 3; break; } /* getfhat */ case 564: { struct getfhat_args *p = params; iarg[0] = p->fd; /* int */ uarg[1] = (intptr_t) p->path; /* char * */ uarg[2] = (intptr_t) p->fhp; /* struct fhandle * */ iarg[3] = p->flags; /* int */ *n_args = 4; break; } /* fhlink */ case 565: { struct fhlink_args *p = params; uarg[0] = (intptr_t) p->fhp; /* struct fhandle * */ uarg[1] = (intptr_t) p->to; /* const char * */ *n_args = 2; break; } /* fhlinkat */ case 566: { struct fhlinkat_args *p = params; uarg[0] = (intptr_t) p->fhp; /* struct fhandle * */ iarg[1] = p->tofd; /* int */ uarg[2] = (intptr_t) p->to; /* const char * */ *n_args = 3; break; } /* fhreadlink */ case 567: { struct fhreadlink_args *p = params; uarg[0] = (intptr_t) p->fhp; /* struct fhandle * */ uarg[1] = (intptr_t) p->buf; /* char * */ uarg[2] = p->bufsize; /* size_t */ *n_args = 3; break; } /* funlinkat */ case 568: { struct funlinkat_args *p = params; iarg[0] = p->dfd; /* int */ uarg[1] = (intptr_t) p->path; /* const char * */ iarg[2] = p->fd; /* int */ iarg[3] = p->flag; /* int */ *n_args = 4; break; } default: *n_args = 0; break; }; } static void systrace_entry_setargdesc(int sysnum, int ndx, char *desc, size_t descsz) { const char *p = NULL; switch (sysnum) { #if !defined(PAD64_REQUIRED) && !defined(__amd64__) #define PAD64_REQUIRED #endif /* nosys */ case 0: break; /* sys_exit */ case 1: switch(ndx) { case 0: p = "int"; break; default: break; }; break; /* fork */ case 2: break; /* read */ case 3: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland void *"; break; case 2: p = "size_t"; break; default: break; }; break; /* write */ case 4: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland const void *"; break; case 2: p = "size_t"; break; default: break; }; break; /* open */ case 5: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "int"; break; case 2: p = "mode_t"; break; default: break; }; break; /* close */ case 6: switch(ndx) { case 0: p = "int"; break; default: break; }; break; /* freebsd32_wait4 */ case 7: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland int *"; break; case 2: p = "int"; break; case 3: p = "userland struct rusage32 *"; break; default: break; }; break; /* link */ case 9: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "userland const char *"; break; default: break; }; break; /* unlink */ case 10: switch(ndx) { case 0: p = "userland const char *"; break; default: break; }; break; /* chdir */ case 12: switch(ndx) { case 0: p = "userland const char *"; break; default: break; }; break; /* fchdir */ case 13: switch(ndx) { case 0: p = "int"; break; default: break; }; break; /* chmod */ case 15: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "mode_t"; break; default: break; }; break; /* chown */ case 16: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "int"; break; case 2: p = "int"; break; default: break; }; break; /* break */ case 17: switch(ndx) { case 0: p = "userland char *"; break; default: break; }; break; /* getpid */ case 20: break; /* mount */ case 21: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "userland const char *"; break; case 2: p = "int"; break; case 3: p = "userland void *"; break; default: break; }; break; /* unmount */ case 22: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "int"; break; default: break; }; break; /* setuid */ case 23: switch(ndx) { case 0: p = "uid_t"; break; default: break; }; break; /* getuid */ case 24: break; /* geteuid */ case 25: break; /* ptrace */ case 26: switch(ndx) { case 0: p = "int"; break; case 1: p = "pid_t"; break; case 2: p = "caddr_t"; break; case 3: p = "int"; break; default: break; }; break; /* freebsd32_recvmsg */ case 27: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland struct msghdr32 *"; break; case 2: p = "int"; break; default: break; }; break; /* freebsd32_sendmsg */ case 28: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland struct msghdr32 *"; break; case 2: p = "int"; break; default: break; }; break; /* freebsd32_recvfrom */ case 29: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland void *"; break; case 2: p = "uint32_t"; break; case 3: p = "int"; break; case 4: p = "userland struct sockaddr *"; break; case 5: p = "uint32_t"; break; default: break; }; break; /* accept */ case 30: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland struct sockaddr *"; break; case 2: p = "userland int *"; break; default: break; }; break; /* getpeername */ case 31: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland struct sockaddr *"; break; case 2: p = "userland int *"; break; default: break; }; break; /* getsockname */ case 32: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland struct sockaddr *"; break; case 2: p = "userland int *"; break; default: break; }; break; /* access */ case 33: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "int"; break; default: break; }; break; /* chflags */ case 34: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "u_long"; break; default: break; }; break; /* fchflags */ case 35: switch(ndx) { case 0: p = "int"; break; case 1: p = "u_long"; break; default: break; }; break; /* sync */ case 36: break; /* kill */ case 37: switch(ndx) { case 0: p = "int"; break; case 1: p = "int"; break; default: break; }; break; /* getppid */ case 39: break; /* dup */ case 41: switch(ndx) { case 0: p = "u_int"; break; default: break; }; break; /* getegid */ case 43: break; /* profil */ case 44: switch(ndx) { case 0: p = "userland char *"; break; case 1: p = "size_t"; break; case 2: p = "size_t"; break; case 3: p = "u_int"; break; default: break; }; break; /* ktrace */ case 45: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "int"; break; case 2: p = "int"; break; case 3: p = "int"; break; default: break; }; break; /* getgid */ case 47: break; /* getlogin */ case 49: switch(ndx) { case 0: p = "userland char *"; break; case 1: p = "u_int"; break; default: break; }; break; /* setlogin */ case 50: switch(ndx) { case 0: p = "userland const char *"; break; default: break; }; break; /* acct */ case 51: switch(ndx) { case 0: p = "userland const char *"; break; default: break; }; break; /* freebsd32_sigaltstack */ case 53: switch(ndx) { case 0: p = "userland struct sigaltstack32 *"; break; case 1: p = "userland struct sigaltstack32 *"; break; default: break; }; break; /* freebsd32_ioctl */ case 54: switch(ndx) { case 0: p = "int"; break; case 1: p = "uint32_t"; break; case 2: p = "userland struct md_ioctl32 *"; break; default: break; }; break; /* reboot */ case 55: switch(ndx) { case 0: p = "int"; break; default: break; }; break; /* revoke */ case 56: switch(ndx) { case 0: p = "userland const char *"; break; default: break; }; break; /* symlink */ case 57: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "userland const char *"; break; default: break; }; break; /* readlink */ case 58: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "userland char *"; break; case 2: p = "size_t"; break; default: break; }; break; /* freebsd32_execve */ case 59: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "userland uint32_t *"; break; case 2: p = "userland uint32_t *"; break; default: break; }; break; /* umask */ case 60: switch(ndx) { case 0: p = "mode_t"; break; default: break; }; break; /* chroot */ case 61: switch(ndx) { case 0: p = "userland const char *"; break; default: break; }; break; /* msync */ case 65: switch(ndx) { case 0: p = "userland void *"; break; case 1: p = "size_t"; break; case 2: p = "int"; break; default: break; }; break; /* vfork */ case 66: break; /* sbrk */ case 69: switch(ndx) { case 0: p = "int"; break; default: break; }; break; /* sstk */ case 70: switch(ndx) { case 0: p = "int"; break; default: break; }; break; /* munmap */ case 73: switch(ndx) { case 0: p = "userland void *"; break; case 1: p = "size_t"; break; default: break; }; break; /* freebsd32_mprotect */ case 74: switch(ndx) { case 0: p = "userland void *"; break; case 1: p = "size_t"; break; case 2: p = "int"; break; default: break; }; break; /* madvise */ case 75: switch(ndx) { case 0: p = "userland void *"; break; case 1: p = "size_t"; break; case 2: p = "int"; break; default: break; }; break; /* mincore */ case 78: switch(ndx) { case 0: p = "userland const void *"; break; case 1: p = "size_t"; break; case 2: p = "userland char *"; break; default: break; }; break; /* getgroups */ case 79: switch(ndx) { case 0: p = "u_int"; break; case 1: p = "userland gid_t *"; break; default: break; }; break; /* setgroups */ case 80: switch(ndx) { case 0: p = "u_int"; break; case 1: p = "userland gid_t *"; break; default: break; }; break; /* getpgrp */ case 81: break; /* setpgid */ case 82: switch(ndx) { case 0: p = "int"; break; case 1: p = "int"; break; default: break; }; break; /* freebsd32_setitimer */ case 83: switch(ndx) { case 0: p = "u_int"; break; case 1: p = "userland struct itimerval32 *"; break; case 2: p = "userland struct itimerval32 *"; break; default: break; }; break; /* swapon */ case 85: switch(ndx) { case 0: p = "userland const char *"; break; default: break; }; break; /* freebsd32_getitimer */ case 86: switch(ndx) { case 0: p = "u_int"; break; case 1: p = "userland struct itimerval32 *"; break; default: break; }; break; /* getdtablesize */ case 89: break; /* dup2 */ case 90: switch(ndx) { case 0: p = "u_int"; break; case 1: p = "u_int"; break; default: break; }; break; /* freebsd32_fcntl */ case 92: switch(ndx) { case 0: p = "int"; break; case 1: p = "int"; break; case 2: p = "int"; break; default: break; }; break; /* freebsd32_select */ case 93: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland fd_set *"; break; case 2: p = "userland fd_set *"; break; case 3: p = "userland fd_set *"; break; case 4: p = "userland struct timeval32 *"; break; default: break; }; break; /* fsync */ case 95: switch(ndx) { case 0: p = "int"; break; default: break; }; break; /* setpriority */ case 96: switch(ndx) { case 0: p = "int"; break; case 1: p = "int"; break; case 2: p = "int"; break; default: break; }; break; /* socket */ case 97: switch(ndx) { case 0: p = "int"; break; case 1: p = "int"; break; case 2: p = "int"; break; default: break; }; break; /* connect */ case 98: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland const struct sockaddr *"; break; case 2: p = "int"; break; default: break; }; break; /* getpriority */ case 100: switch(ndx) { case 0: p = "int"; break; case 1: p = "int"; break; default: break; }; break; /* bind */ case 104: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland const struct sockaddr *"; break; case 2: p = "int"; break; default: break; }; break; /* setsockopt */ case 105: switch(ndx) { case 0: p = "int"; break; case 1: p = "int"; break; case 2: p = "int"; break; case 3: p = "userland const void *"; break; case 4: p = "int"; break; default: break; }; break; /* listen */ case 106: switch(ndx) { case 0: p = "int"; break; case 1: p = "int"; break; default: break; }; break; /* freebsd32_gettimeofday */ case 116: switch(ndx) { case 0: p = "userland struct timeval32 *"; break; case 1: p = "userland struct timezone *"; break; default: break; }; break; /* freebsd32_getrusage */ case 117: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland struct rusage32 *"; break; default: break; }; break; /* getsockopt */ case 118: switch(ndx) { case 0: p = "int"; break; case 1: p = "int"; break; case 2: p = "int"; break; case 3: p = "userland void *"; break; case 4: p = "userland int *"; break; default: break; }; break; /* freebsd32_readv */ case 120: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland struct iovec32 *"; break; case 2: p = "u_int"; break; default: break; }; break; /* freebsd32_writev */ case 121: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland struct iovec32 *"; break; case 2: p = "u_int"; break; default: break; }; break; /* freebsd32_settimeofday */ case 122: switch(ndx) { case 0: p = "userland struct timeval32 *"; break; case 1: p = "userland struct timezone *"; break; default: break; }; break; /* fchown */ case 123: switch(ndx) { case 0: p = "int"; break; case 1: p = "int"; break; case 2: p = "int"; break; default: break; }; break; /* fchmod */ case 124: switch(ndx) { case 0: p = "int"; break; case 1: p = "mode_t"; break; default: break; }; break; /* setreuid */ case 126: switch(ndx) { case 0: p = "int"; break; case 1: p = "int"; break; default: break; }; break; /* setregid */ case 127: switch(ndx) { case 0: p = "int"; break; case 1: p = "int"; break; default: break; }; break; /* rename */ case 128: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "userland const char *"; break; default: break; }; break; /* flock */ case 131: switch(ndx) { case 0: p = "int"; break; case 1: p = "int"; break; default: break; }; break; /* mkfifo */ case 132: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "mode_t"; break; default: break; }; break; /* sendto */ case 133: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland const void *"; break; case 2: p = "size_t"; break; case 3: p = "int"; break; case 4: p = "userland const struct sockaddr *"; break; case 5: p = "int"; break; default: break; }; break; /* shutdown */ case 134: switch(ndx) { case 0: p = "int"; break; case 1: p = "int"; break; default: break; }; break; /* socketpair */ case 135: switch(ndx) { case 0: p = "int"; break; case 1: p = "int"; break; case 2: p = "int"; break; case 3: p = "userland int *"; break; default: break; }; break; /* mkdir */ case 136: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "mode_t"; break; default: break; }; break; /* rmdir */ case 137: switch(ndx) { case 0: p = "userland const char *"; break; default: break; }; break; /* freebsd32_utimes */ case 138: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "userland struct timeval32 *"; break; default: break; }; break; /* freebsd32_adjtime */ case 140: switch(ndx) { case 0: p = "userland struct timeval32 *"; break; case 1: p = "userland struct timeval32 *"; break; default: break; }; break; /* setsid */ case 147: break; /* quotactl */ case 148: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "int"; break; case 2: p = "int"; break; case 3: p = "userland void *"; break; default: break; }; break; /* getfh */ case 161: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "userland struct fhandle *"; break; default: break; }; break; /* freebsd32_sysarch */ case 165: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland char *"; break; default: break; }; break; /* rtprio */ case 166: switch(ndx) { case 0: p = "int"; break; case 1: p = "pid_t"; break; case 2: p = "userland struct rtprio *"; break; default: break; }; break; /* freebsd32_semsys */ case 169: switch(ndx) { case 0: p = "int"; break; case 1: p = "int"; break; case 2: p = "int"; break; case 3: p = "int"; break; case 4: p = "int"; break; default: break; }; break; /* freebsd32_msgsys */ case 170: switch(ndx) { case 0: p = "int"; break; case 1: p = "int"; break; case 2: p = "int"; break; case 3: p = "int"; break; case 4: p = "int"; break; case 5: p = "int"; break; default: break; }; break; /* freebsd32_shmsys */ case 171: switch(ndx) { case 0: p = "uint32_t"; break; case 1: p = "uint32_t"; break; case 2: p = "uint32_t"; break; case 3: p = "uint32_t"; break; default: break; }; break; /* ntp_adjtime */ case 176: switch(ndx) { case 0: p = "userland struct timex *"; break; default: break; }; break; /* setgid */ case 181: switch(ndx) { case 0: p = "gid_t"; break; default: break; }; break; /* setegid */ case 182: switch(ndx) { case 0: p = "gid_t"; break; default: break; }; break; /* seteuid */ case 183: switch(ndx) { case 0: p = "uid_t"; break; default: break; }; break; /* pathconf */ case 191: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "int"; break; default: break; }; break; /* fpathconf */ case 192: switch(ndx) { case 0: p = "int"; break; case 1: p = "int"; break; default: break; }; break; /* getrlimit */ case 194: switch(ndx) { case 0: p = "u_int"; break; case 1: p = "userland struct rlimit *"; break; default: break; }; break; /* setrlimit */ case 195: switch(ndx) { case 0: p = "u_int"; break; case 1: p = "userland struct rlimit *"; break; default: break; }; break; /* nosys */ case 198: break; /* freebsd32___sysctl */ case 202: switch(ndx) { case 0: p = "userland int *"; break; case 1: p = "u_int"; break; case 2: p = "userland void *"; break; case 3: p = "userland uint32_t *"; break; case 4: p = "userland const void *"; break; case 5: p = "uint32_t"; break; default: break; }; break; /* mlock */ case 203: switch(ndx) { case 0: p = "userland const void *"; break; case 1: p = "size_t"; break; default: break; }; break; /* munlock */ case 204: switch(ndx) { case 0: p = "userland const void *"; break; case 1: p = "size_t"; break; default: break; }; break; /* undelete */ case 205: switch(ndx) { case 0: p = "userland const char *"; break; default: break; }; break; /* freebsd32_futimes */ case 206: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland struct timeval32 *"; break; default: break; }; break; /* getpgid */ case 207: switch(ndx) { case 0: p = "pid_t"; break; default: break; }; break; /* poll */ case 209: switch(ndx) { case 0: p = "userland struct pollfd *"; break; case 1: p = "u_int"; break; case 2: p = "int"; break; default: break; }; break; /* lkmnosys */ case 210: break; /* lkmnosys */ case 211: break; /* lkmnosys */ case 212: break; /* lkmnosys */ case 213: break; /* lkmnosys */ case 214: break; /* lkmnosys */ case 215: break; /* lkmnosys */ case 216: break; /* lkmnosys */ case 217: break; /* lkmnosys */ case 218: break; /* lkmnosys */ case 219: break; /* semget */ case 221: switch(ndx) { case 0: p = "key_t"; break; case 1: p = "int"; break; case 2: p = "int"; break; default: break; }; break; /* semop */ case 222: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland struct sembuf *"; break; case 2: p = "u_int"; break; default: break; }; break; /* msgget */ case 225: switch(ndx) { case 0: p = "key_t"; break; case 1: p = "int"; break; default: break; }; break; /* freebsd32_msgsnd */ case 226: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland void *"; break; case 2: p = "size_t"; break; case 3: p = "int"; break; default: break; }; break; /* freebsd32_msgrcv */ case 227: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland void *"; break; case 2: p = "size_t"; break; case 3: p = "long"; break; case 4: p = "int"; break; default: break; }; break; /* shmat */ case 228: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland void *"; break; case 2: p = "int"; break; default: break; }; break; /* shmdt */ case 230: switch(ndx) { case 0: p = "userland void *"; break; default: break; }; break; /* shmget */ case 231: switch(ndx) { case 0: p = "key_t"; break; case 1: p = "int"; break; case 2: p = "int"; break; default: break; }; break; /* freebsd32_clock_gettime */ case 232: switch(ndx) { case 0: p = "clockid_t"; break; case 1: p = "userland struct timespec32 *"; break; default: break; }; break; /* freebsd32_clock_settime */ case 233: switch(ndx) { case 0: p = "clockid_t"; break; case 1: p = "userland const struct timespec32 *"; break; default: break; }; break; /* freebsd32_clock_getres */ case 234: switch(ndx) { case 0: p = "clockid_t"; break; case 1: p = "userland struct timespec32 *"; break; default: break; }; break; /* freebsd32_ktimer_create */ case 235: switch(ndx) { case 0: p = "clockid_t"; break; case 1: p = "userland struct sigevent32 *"; break; case 2: p = "userland int *"; break; default: break; }; break; /* ktimer_delete */ case 236: switch(ndx) { case 0: p = "int"; break; default: break; }; break; /* freebsd32_ktimer_settime */ case 237: switch(ndx) { case 0: p = "int"; break; case 1: p = "int"; break; case 2: p = "userland const struct itimerspec32 *"; break; case 3: p = "userland struct itimerspec32 *"; break; default: break; }; break; /* freebsd32_ktimer_gettime */ case 238: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland struct itimerspec32 *"; break; default: break; }; break; /* ktimer_getoverrun */ case 239: switch(ndx) { case 0: p = "int"; break; default: break; }; break; /* freebsd32_nanosleep */ case 240: switch(ndx) { case 0: p = "userland const struct timespec32 *"; break; case 1: p = "userland struct timespec32 *"; break; default: break; }; break; /* ffclock_getcounter */ case 241: switch(ndx) { case 0: p = "userland ffcounter *"; break; default: break; }; break; /* ffclock_setestimate */ case 242: switch(ndx) { case 0: p = "userland struct ffclock_estimate *"; break; default: break; }; break; /* ffclock_getestimate */ case 243: switch(ndx) { case 0: p = "userland struct ffclock_estimate *"; break; default: break; }; break; /* freebsd32_clock_nanosleep */ case 244: switch(ndx) { case 0: p = "clockid_t"; break; case 1: p = "int"; break; case 2: p = "userland const struct timespec32 *"; break; case 3: p = "userland struct timespec32 *"; break; default: break; }; break; /* freebsd32_clock_getcpuclockid2 */ case 247: switch(ndx) { case 0: p = "uint32_t"; break; case 1: p = "uint32_t"; break; case 2: p = "int"; break; case 3: p = "userland clockid_t *"; break; default: break; }; break; /* minherit */ case 250: switch(ndx) { case 0: p = "userland void *"; break; case 1: p = "size_t"; break; case 2: p = "int"; break; default: break; }; break; /* rfork */ case 251: switch(ndx) { case 0: p = "int"; break; default: break; }; break; /* issetugid */ case 253: break; /* lchown */ case 254: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "int"; break; case 2: p = "int"; break; default: break; }; break; /* freebsd32_aio_read */ case 255: switch(ndx) { case 0: p = "userland struct aiocb32 *"; break; default: break; }; break; /* freebsd32_aio_write */ case 256: switch(ndx) { case 0: p = "userland struct aiocb32 *"; break; default: break; }; break; /* freebsd32_lio_listio */ case 257: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland struct aiocb32 *const *"; break; case 2: p = "int"; break; case 3: p = "userland struct sigevent32 *"; break; default: break; }; break; /* lchmod */ case 274: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "mode_t"; break; default: break; }; break; /* freebsd32_lutimes */ case 276: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "userland struct timeval32 *"; break; default: break; }; break; /* freebsd32_preadv */ case 289: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland struct iovec32 *"; break; case 2: p = "u_int"; break; case 3: p = "uint32_t"; break; case 4: p = "uint32_t"; break; default: break; }; break; /* freebsd32_pwritev */ case 290: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland struct iovec32 *"; break; case 2: p = "u_int"; break; case 3: p = "uint32_t"; break; case 4: p = "uint32_t"; break; default: break; }; break; /* fhopen */ case 298: switch(ndx) { case 0: p = "userland const struct fhandle *"; break; case 1: p = "int"; break; default: break; }; break; /* modnext */ case 300: switch(ndx) { case 0: p = "int"; break; default: break; }; break; /* freebsd32_modstat */ case 301: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland struct module_stat32 *"; break; default: break; }; break; /* modfnext */ case 302: switch(ndx) { case 0: p = "int"; break; default: break; }; break; /* modfind */ case 303: switch(ndx) { case 0: p = "userland const char *"; break; default: break; }; break; /* kldload */ case 304: switch(ndx) { case 0: p = "userland const char *"; break; default: break; }; break; /* kldunload */ case 305: switch(ndx) { case 0: p = "int"; break; default: break; }; break; /* kldfind */ case 306: switch(ndx) { case 0: p = "userland const char *"; break; default: break; }; break; /* kldnext */ case 307: switch(ndx) { case 0: p = "int"; break; default: break; }; break; /* freebsd32_kldstat */ case 308: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland struct kld32_file_stat *"; break; default: break; }; break; /* kldfirstmod */ case 309: switch(ndx) { case 0: p = "int"; break; default: break; }; break; /* getsid */ case 310: switch(ndx) { case 0: p = "pid_t"; break; default: break; }; break; /* setresuid */ case 311: switch(ndx) { case 0: p = "uid_t"; break; case 1: p = "uid_t"; break; case 2: p = "uid_t"; break; default: break; }; break; /* setresgid */ case 312: switch(ndx) { case 0: p = "gid_t"; break; case 1: p = "gid_t"; break; case 2: p = "gid_t"; break; default: break; }; break; /* freebsd32_aio_return */ case 314: switch(ndx) { case 0: p = "userland struct aiocb32 *"; break; default: break; }; break; /* freebsd32_aio_suspend */ case 315: switch(ndx) { case 0: p = "userland struct aiocb32 *const *"; break; case 1: p = "int"; break; case 2: p = "userland const struct timespec32 *"; break; default: break; }; break; /* aio_cancel */ case 316: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland struct aiocb *"; break; default: break; }; break; /* freebsd32_aio_error */ case 317: switch(ndx) { case 0: p = "userland struct aiocb32 *"; break; default: break; }; break; /* yield */ case 321: break; /* mlockall */ case 324: switch(ndx) { case 0: p = "int"; break; default: break; }; break; /* munlockall */ case 325: break; /* __getcwd */ case 326: switch(ndx) { case 0: p = "userland char *"; break; case 1: p = "size_t"; break; default: break; }; break; /* sched_setparam */ case 327: switch(ndx) { case 0: p = "pid_t"; break; case 1: p = "userland const struct sched_param *"; break; default: break; }; break; /* sched_getparam */ case 328: switch(ndx) { case 0: p = "pid_t"; break; case 1: p = "userland struct sched_param *"; break; default: break; }; break; /* sched_setscheduler */ case 329: switch(ndx) { case 0: p = "pid_t"; break; case 1: p = "int"; break; case 2: p = "userland const struct sched_param *"; break; default: break; }; break; /* sched_getscheduler */ case 330: switch(ndx) { case 0: p = "pid_t"; break; default: break; }; break; /* sched_yield */ case 331: break; /* sched_get_priority_max */ case 332: switch(ndx) { case 0: p = "int"; break; default: break; }; break; /* sched_get_priority_min */ case 333: switch(ndx) { case 0: p = "int"; break; default: break; }; break; /* freebsd32_sched_rr_get_interval */ case 334: switch(ndx) { case 0: p = "pid_t"; break; case 1: p = "userland struct timespec32 *"; break; default: break; }; break; /* utrace */ case 335: switch(ndx) { case 0: p = "userland const void *"; break; case 1: p = "size_t"; break; default: break; }; break; /* kldsym */ case 337: switch(ndx) { case 0: p = "int"; break; case 1: p = "int"; break; case 2: p = "userland void *"; break; default: break; }; break; /* freebsd32_jail */ case 338: switch(ndx) { case 0: p = "userland struct jail32 *"; break; default: break; }; break; /* sigprocmask */ case 340: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland const sigset_t *"; break; case 2: p = "userland sigset_t *"; break; default: break; }; break; /* sigsuspend */ case 341: switch(ndx) { case 0: p = "userland const sigset_t *"; break; default: break; }; break; /* sigpending */ case 343: switch(ndx) { case 0: p = "userland sigset_t *"; break; default: break; }; break; /* freebsd32_sigtimedwait */ case 345: switch(ndx) { case 0: p = "userland const sigset_t *"; break; case 1: p = "userland siginfo_t *"; break; case 2: p = "userland const struct timespec *"; break; default: break; }; break; /* freebsd32_sigwaitinfo */ case 346: switch(ndx) { case 0: p = "userland const sigset_t *"; break; case 1: p = "userland siginfo_t *"; break; default: break; }; break; /* __acl_get_file */ case 347: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "acl_type_t"; break; case 2: p = "userland struct acl *"; break; default: break; }; break; /* __acl_set_file */ case 348: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "acl_type_t"; break; case 2: p = "userland struct acl *"; break; default: break; }; break; /* __acl_get_fd */ case 349: switch(ndx) { case 0: p = "int"; break; case 1: p = "acl_type_t"; break; case 2: p = "userland struct acl *"; break; default: break; }; break; /* __acl_set_fd */ case 350: switch(ndx) { case 0: p = "int"; break; case 1: p = "acl_type_t"; break; case 2: p = "userland struct acl *"; break; default: break; }; break; /* __acl_delete_file */ case 351: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "acl_type_t"; break; default: break; }; break; /* __acl_delete_fd */ case 352: switch(ndx) { case 0: p = "int"; break; case 1: p = "acl_type_t"; break; default: break; }; break; /* __acl_aclcheck_file */ case 353: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "acl_type_t"; break; case 2: p = "userland struct acl *"; break; default: break; }; break; /* __acl_aclcheck_fd */ case 354: switch(ndx) { case 0: p = "int"; break; case 1: p = "acl_type_t"; break; case 2: p = "userland struct acl *"; break; default: break; }; break; /* extattrctl */ case 355: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "int"; break; case 2: p = "userland const char *"; break; case 3: p = "int"; break; case 4: p = "userland const char *"; break; default: break; }; break; /* extattr_set_file */ case 356: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "int"; break; case 2: p = "userland const char *"; break; case 3: p = "userland void *"; break; case 4: p = "size_t"; break; default: break; }; break; /* extattr_get_file */ case 357: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "int"; break; case 2: p = "userland const char *"; break; case 3: p = "userland void *"; break; case 4: p = "size_t"; break; default: break; }; break; /* extattr_delete_file */ case 358: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "int"; break; case 2: p = "userland const char *"; break; default: break; }; break; /* freebsd32_aio_waitcomplete */ case 359: switch(ndx) { case 0: p = "userland struct aiocb32 **"; break; case 1: p = "userland struct timespec32 *"; break; default: break; }; break; /* getresuid */ case 360: switch(ndx) { case 0: p = "userland uid_t *"; break; case 1: p = "userland uid_t *"; break; case 2: p = "userland uid_t *"; break; default: break; }; break; /* getresgid */ case 361: switch(ndx) { case 0: p = "userland gid_t *"; break; case 1: p = "userland gid_t *"; break; case 2: p = "userland gid_t *"; break; default: break; }; break; /* kqueue */ case 362: break; /* extattr_set_fd */ case 371: switch(ndx) { case 0: p = "int"; break; case 1: p = "int"; break; case 2: p = "userland const char *"; break; case 3: p = "userland void *"; break; case 4: p = "size_t"; break; default: break; }; break; /* extattr_get_fd */ case 372: switch(ndx) { case 0: p = "int"; break; case 1: p = "int"; break; case 2: p = "userland const char *"; break; case 3: p = "userland void *"; break; case 4: p = "size_t"; break; default: break; }; break; /* extattr_delete_fd */ case 373: switch(ndx) { case 0: p = "int"; break; case 1: p = "int"; break; case 2: p = "userland const char *"; break; default: break; }; break; /* __setugid */ case 374: switch(ndx) { case 0: p = "int"; break; default: break; }; break; /* eaccess */ case 376: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "int"; break; default: break; }; break; /* freebsd32_nmount */ case 378: switch(ndx) { case 0: p = "userland struct iovec32 *"; break; case 1: p = "unsigned int"; break; case 2: p = "int"; break; default: break; }; break; /* kenv */ case 390: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland const char *"; break; case 2: p = "userland char *"; break; case 3: p = "int"; break; default: break; }; break; /* lchflags */ case 391: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "u_long"; break; default: break; }; break; /* uuidgen */ case 392: switch(ndx) { case 0: p = "userland struct uuid *"; break; case 1: p = "int"; break; default: break; }; break; /* freebsd32_sendfile */ case 393: switch(ndx) { case 0: p = "int"; break; case 1: p = "int"; break; case 2: p = "uint32_t"; break; case 3: p = "uint32_t"; break; case 4: p = "size_t"; break; case 5: p = "userland struct sf_hdtr32 *"; break; case 6: p = "userland off_t *"; break; case 7: p = "int"; break; default: break; }; break; /* ksem_close */ case 400: switch(ndx) { case 0: p = "semid_t"; break; default: break; }; break; /* ksem_post */ case 401: switch(ndx) { case 0: p = "semid_t"; break; default: break; }; break; /* ksem_wait */ case 402: switch(ndx) { case 0: p = "semid_t"; break; default: break; }; break; /* ksem_trywait */ case 403: switch(ndx) { case 0: p = "semid_t"; break; default: break; }; break; /* freebsd32_ksem_init */ case 404: switch(ndx) { case 0: p = "userland semid_t *"; break; case 1: p = "unsigned int"; break; default: break; }; break; /* freebsd32_ksem_open */ case 405: switch(ndx) { case 0: p = "userland semid_t *"; break; case 1: p = "userland const char *"; break; case 2: p = "int"; break; case 3: p = "mode_t"; break; case 4: p = "unsigned int"; break; default: break; }; break; /* ksem_unlink */ case 406: switch(ndx) { case 0: p = "userland const char *"; break; default: break; }; break; /* ksem_getvalue */ case 407: switch(ndx) { case 0: p = "semid_t"; break; case 1: p = "userland int *"; break; default: break; }; break; /* ksem_destroy */ case 408: switch(ndx) { case 0: p = "semid_t"; break; default: break; }; break; /* extattr_set_link */ case 412: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "int"; break; case 2: p = "userland const char *"; break; case 3: p = "userland void *"; break; case 4: p = "size_t"; break; default: break; }; break; /* extattr_get_link */ case 413: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "int"; break; case 2: p = "userland const char *"; break; case 3: p = "userland void *"; break; case 4: p = "size_t"; break; default: break; }; break; /* extattr_delete_link */ case 414: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "int"; break; case 2: p = "userland const char *"; break; default: break; }; break; /* freebsd32_sigaction */ case 416: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland struct sigaction32 *"; break; case 2: p = "userland struct sigaction32 *"; break; default: break; }; break; /* freebsd32_sigreturn */ case 417: switch(ndx) { case 0: p = "userland const struct freebsd32_ucontext *"; break; default: break; }; break; /* freebsd32_getcontext */ case 421: switch(ndx) { case 0: p = "userland struct freebsd32_ucontext *"; break; default: break; }; break; /* freebsd32_setcontext */ case 422: switch(ndx) { case 0: p = "userland const struct freebsd32_ucontext *"; break; default: break; }; break; /* freebsd32_swapcontext */ case 423: switch(ndx) { case 0: p = "userland struct freebsd32_ucontext *"; break; case 1: p = "userland const struct freebsd32_ucontext *"; break; default: break; }; break; /* __acl_get_link */ case 425: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "acl_type_t"; break; case 2: p = "userland struct acl *"; break; default: break; }; break; /* __acl_set_link */ case 426: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "acl_type_t"; break; case 2: p = "userland struct acl *"; break; default: break; }; break; /* __acl_delete_link */ case 427: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "acl_type_t"; break; default: break; }; break; /* __acl_aclcheck_link */ case 428: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "acl_type_t"; break; case 2: p = "userland struct acl *"; break; default: break; }; break; /* sigwait */ case 429: switch(ndx) { case 0: p = "userland const sigset_t *"; break; case 1: p = "userland int *"; break; default: break; }; break; /* thr_exit */ case 431: switch(ndx) { case 0: p = "userland long *"; break; default: break; }; break; /* thr_self */ case 432: switch(ndx) { case 0: p = "userland long *"; break; default: break; }; break; /* thr_kill */ case 433: switch(ndx) { case 0: p = "long"; break; case 1: p = "int"; break; default: break; }; break; /* jail_attach */ case 436: switch(ndx) { case 0: p = "int"; break; default: break; }; break; /* extattr_list_fd */ case 437: switch(ndx) { case 0: p = "int"; break; case 1: p = "int"; break; case 2: p = "userland void *"; break; case 3: p = "size_t"; break; default: break; }; break; /* extattr_list_file */ case 438: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "int"; break; case 2: p = "userland void *"; break; case 3: p = "size_t"; break; default: break; }; break; /* extattr_list_link */ case 439: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "int"; break; case 2: p = "userland void *"; break; case 3: p = "size_t"; break; default: break; }; break; /* freebsd32_ksem_timedwait */ case 441: switch(ndx) { case 0: p = "semid_t"; break; case 1: p = "userland const struct timespec32 *"; break; default: break; }; break; /* freebsd32_thr_suspend */ case 442: switch(ndx) { case 0: p = "userland const struct timespec32 *"; break; default: break; }; break; /* thr_wake */ case 443: switch(ndx) { case 0: p = "long"; break; default: break; }; break; /* kldunloadf */ case 444: switch(ndx) { case 0: p = "int"; break; case 1: p = "int"; break; default: break; }; break; /* audit */ case 445: switch(ndx) { case 0: p = "userland const void *"; break; case 1: p = "u_int"; break; default: break; }; break; /* auditon */ case 446: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland void *"; break; case 2: p = "u_int"; break; default: break; }; break; /* getauid */ case 447: switch(ndx) { case 0: p = "userland uid_t *"; break; default: break; }; break; /* setauid */ case 448: switch(ndx) { case 0: p = "userland uid_t *"; break; default: break; }; break; /* getaudit */ case 449: switch(ndx) { case 0: p = "userland struct auditinfo *"; break; default: break; }; break; /* setaudit */ case 450: switch(ndx) { case 0: p = "userland struct auditinfo *"; break; default: break; }; break; /* getaudit_addr */ case 451: switch(ndx) { case 0: p = "userland struct auditinfo_addr *"; break; case 1: p = "u_int"; break; default: break; }; break; /* setaudit_addr */ case 452: switch(ndx) { case 0: p = "userland struct auditinfo_addr *"; break; case 1: p = "u_int"; break; default: break; }; break; /* auditctl */ case 453: switch(ndx) { case 0: p = "userland const char *"; break; default: break; }; break; /* freebsd32__umtx_op */ case 454: switch(ndx) { case 0: p = "userland void *"; break; case 1: p = "int"; break; case 2: p = "u_long"; break; case 3: p = "userland void *"; break; case 4: p = "userland void *"; break; default: break; }; break; /* freebsd32_thr_new */ case 455: switch(ndx) { case 0: p = "userland struct thr_param32 *"; break; case 1: p = "int"; break; default: break; }; break; /* freebsd32_sigqueue */ case 456: switch(ndx) { case 0: p = "pid_t"; break; case 1: p = "int"; break; case 2: p = "int"; break; default: break; }; break; /* freebsd32_kmq_open */ case 457: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "int"; break; case 2: p = "mode_t"; break; case 3: p = "userland const struct mq_attr32 *"; break; default: break; }; break; /* freebsd32_kmq_setattr */ case 458: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland const struct mq_attr32 *"; break; case 2: p = "userland struct mq_attr32 *"; break; default: break; }; break; /* freebsd32_kmq_timedreceive */ case 459: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland char *"; break; case 2: p = "size_t"; break; case 3: p = "userland unsigned *"; break; case 4: p = "userland const struct timespec32 *"; break; default: break; }; break; /* freebsd32_kmq_timedsend */ case 460: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland const char *"; break; case 2: p = "size_t"; break; case 3: p = "unsigned"; break; case 4: p = "userland const struct timespec32 *"; break; default: break; }; break; /* freebsd32_kmq_notify */ case 461: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland const struct sigevent32 *"; break; default: break; }; break; /* kmq_unlink */ case 462: switch(ndx) { case 0: p = "userland const char *"; break; default: break; }; break; /* abort2 */ case 463: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "int"; break; case 2: p = "userland void **"; break; default: break; }; break; /* thr_set_name */ case 464: switch(ndx) { case 0: p = "long"; break; case 1: p = "userland const char *"; break; default: break; }; break; /* freebsd32_aio_fsync */ case 465: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland struct aiocb32 *"; break; default: break; }; break; /* rtprio_thread */ case 466: switch(ndx) { case 0: p = "int"; break; case 1: p = "lwpid_t"; break; case 2: p = "userland struct rtprio *"; break; default: break; }; break; /* sctp_peeloff */ case 471: switch(ndx) { case 0: p = "int"; break; case 1: p = "uint32_t"; break; default: break; }; break; /* sctp_generic_sendmsg */ case 472: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland void *"; break; case 2: p = "int"; break; case 3: p = "userland struct sockaddr *"; break; case 4: p = "__socklen_t"; break; case 5: p = "userland struct sctp_sndrcvinfo *"; break; case 6: p = "int"; break; default: break; }; break; /* sctp_generic_sendmsg_iov */ case 473: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland struct iovec *"; break; case 2: p = "int"; break; case 3: p = "userland struct sockaddr *"; break; case 4: p = "__socklen_t"; break; case 5: p = "userland struct sctp_sndrcvinfo *"; break; case 6: p = "int"; break; default: break; }; break; /* sctp_generic_recvmsg */ case 474: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland struct iovec *"; break; case 2: p = "int"; break; case 3: p = "userland struct sockaddr *"; break; case 4: p = "userland __socklen_t *"; break; case 5: p = "userland struct sctp_sndrcvinfo *"; break; case 6: p = "userland int *"; break; default: break; }; break; #ifdef PAD64_REQUIRED /* freebsd32_pread */ case 475: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland void *"; break; case 2: p = "size_t"; break; case 3: p = "int"; break; case 4: p = "uint32_t"; break; case 5: p = "uint32_t"; break; default: break; }; break; /* freebsd32_pwrite */ case 476: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland const void *"; break; case 2: p = "size_t"; break; case 3: p = "int"; break; case 4: p = "uint32_t"; break; case 5: p = "uint32_t"; break; default: break; }; break; /* freebsd32_mmap */ case 477: switch(ndx) { case 0: p = "userland void *"; break; case 1: p = "size_t"; break; case 2: p = "int"; break; case 3: p = "int"; break; case 4: p = "int"; break; case 5: p = "int"; break; case 6: p = "uint32_t"; break; case 7: p = "uint32_t"; break; default: break; }; break; /* freebsd32_lseek */ case 478: switch(ndx) { case 0: p = "int"; break; case 1: p = "int"; break; case 2: p = "uint32_t"; break; case 3: p = "uint32_t"; break; case 4: p = "int"; break; default: break; }; break; /* freebsd32_truncate */ case 479: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "int"; break; case 2: p = "uint32_t"; break; case 3: p = "uint32_t"; break; default: break; }; break; /* freebsd32_ftruncate */ case 480: switch(ndx) { case 0: p = "int"; break; case 1: p = "int"; break; case 2: p = "uint32_t"; break; case 3: p = "uint32_t"; break; default: break; }; break; #else /* freebsd32_pread */ case 475: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland void *"; break; case 2: p = "size_t"; break; case 3: p = "uint32_t"; break; case 4: p = "uint32_t"; break; default: break; }; break; /* freebsd32_pwrite */ case 476: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland const void *"; break; case 2: p = "size_t"; break; case 3: p = "uint32_t"; break; case 4: p = "uint32_t"; break; default: break; }; break; /* freebsd32_mmap */ case 477: switch(ndx) { case 0: p = "userland void *"; break; case 1: p = "size_t"; break; case 2: p = "int"; break; case 3: p = "int"; break; case 4: p = "int"; break; case 5: p = "uint32_t"; break; case 6: p = "uint32_t"; break; default: break; }; break; /* freebsd32_lseek */ case 478: switch(ndx) { case 0: p = "int"; break; case 1: p = "uint32_t"; break; case 2: p = "uint32_t"; break; case 3: p = "int"; break; default: break; }; break; /* freebsd32_truncate */ case 479: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "uint32_t"; break; case 2: p = "uint32_t"; break; default: break; }; break; /* freebsd32_ftruncate */ case 480: switch(ndx) { case 0: p = "int"; break; case 1: p = "uint32_t"; break; case 2: p = "uint32_t"; break; default: break; }; break; #endif /* thr_kill2 */ case 481: switch(ndx) { case 0: p = "pid_t"; break; case 1: p = "long"; break; case 2: p = "int"; break; default: break; }; break; /* shm_open */ case 482: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "int"; break; case 2: p = "mode_t"; break; default: break; }; break; /* shm_unlink */ case 483: switch(ndx) { case 0: p = "userland const char *"; break; default: break; }; break; /* cpuset */ case 484: switch(ndx) { case 0: p = "userland cpusetid_t *"; break; default: break; }; break; #ifdef PAD64_REQUIRED /* freebsd32_cpuset_setid */ case 485: switch(ndx) { case 0: p = "cpuwhich_t"; break; case 1: p = "int"; break; case 2: p = "uint32_t"; break; case 3: p = "uint32_t"; break; case 4: p = "cpusetid_t"; break; default: break; }; break; #else /* freebsd32_cpuset_setid */ case 485: switch(ndx) { case 0: p = "cpuwhich_t"; break; case 1: p = "uint32_t"; break; case 2: p = "uint32_t"; break; case 3: p = "cpusetid_t"; break; default: break; }; break; #endif /* freebsd32_cpuset_getid */ case 486: switch(ndx) { case 0: p = "cpulevel_t"; break; case 1: p = "cpuwhich_t"; break; case 2: p = "uint32_t"; break; case 3: p = "uint32_t"; break; case 4: p = "userland cpusetid_t *"; break; default: break; }; break; /* freebsd32_cpuset_getaffinity */ case 487: switch(ndx) { case 0: p = "cpulevel_t"; break; case 1: p = "cpuwhich_t"; break; case 2: p = "uint32_t"; break; case 3: p = "uint32_t"; break; case 4: p = "size_t"; break; case 5: p = "userland cpuset_t *"; break; default: break; }; break; /* freebsd32_cpuset_setaffinity */ case 488: switch(ndx) { case 0: p = "cpulevel_t"; break; case 1: p = "cpuwhich_t"; break; case 2: p = "uint32_t"; break; case 3: p = "uint32_t"; break; case 4: p = "size_t"; break; case 5: p = "userland const cpuset_t *"; break; default: break; }; break; /* faccessat */ case 489: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland const char *"; break; case 2: p = "int"; break; case 3: p = "int"; break; default: break; }; break; /* fchmodat */ case 490: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland const char *"; break; case 2: p = "mode_t"; break; case 3: p = "int"; break; default: break; }; break; /* fchownat */ case 491: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland const char *"; break; case 2: p = "uid_t"; break; case 3: p = "gid_t"; break; case 4: p = "int"; break; default: break; }; break; /* freebsd32_fexecve */ case 492: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland uint32_t *"; break; case 2: p = "userland uint32_t *"; break; default: break; }; break; /* freebsd32_futimesat */ case 494: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland const char *"; break; case 2: p = "userland struct timeval *"; break; default: break; }; break; /* linkat */ case 495: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland const char *"; break; case 2: p = "int"; break; case 3: p = "userland const char *"; break; case 4: p = "int"; break; default: break; }; break; /* mkdirat */ case 496: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland const char *"; break; case 2: p = "mode_t"; break; default: break; }; break; /* mkfifoat */ case 497: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland const char *"; break; case 2: p = "mode_t"; break; default: break; }; break; /* openat */ case 499: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland const char *"; break; case 2: p = "int"; break; case 3: p = "mode_t"; break; default: break; }; break; /* readlinkat */ case 500: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland const char *"; break; case 2: p = "userland char *"; break; case 3: p = "size_t"; break; default: break; }; break; /* renameat */ case 501: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland const char *"; break; case 2: p = "int"; break; case 3: p = "userland const char *"; break; default: break; }; break; /* symlinkat */ case 502: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "int"; break; case 2: p = "userland const char *"; break; default: break; }; break; /* unlinkat */ case 503: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland const char *"; break; case 2: p = "int"; break; default: break; }; break; /* posix_openpt */ case 504: switch(ndx) { case 0: p = "int"; break; default: break; }; break; /* freebsd32_jail_get */ case 506: switch(ndx) { case 0: p = "userland struct iovec32 *"; break; case 1: p = "unsigned int"; break; case 2: p = "int"; break; default: break; }; break; /* freebsd32_jail_set */ case 507: switch(ndx) { case 0: p = "userland struct iovec32 *"; break; case 1: p = "unsigned int"; break; case 2: p = "int"; break; default: break; }; break; /* jail_remove */ case 508: switch(ndx) { case 0: p = "int"; break; default: break; }; break; /* closefrom */ case 509: switch(ndx) { case 0: p = "int"; break; default: break; }; break; /* freebsd32_semctl */ case 510: switch(ndx) { case 0: p = "int"; break; case 1: p = "int"; break; case 2: p = "int"; break; case 3: p = "userland union semun32 *"; break; default: break; }; break; /* freebsd32_msgctl */ case 511: switch(ndx) { case 0: p = "int"; break; case 1: p = "int"; break; case 2: p = "userland struct msqid_ds32 *"; break; default: break; }; break; /* freebsd32_shmctl */ case 512: switch(ndx) { case 0: p = "int"; break; case 1: p = "int"; break; case 2: p = "userland struct shmid_ds32 *"; break; default: break; }; break; /* lpathconf */ case 513: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "int"; break; default: break; }; break; /* __cap_rights_get */ case 515: switch(ndx) { case 0: p = "int"; break; case 1: p = "int"; break; case 2: p = "userland cap_rights_t *"; break; default: break; }; break; /* cap_enter */ case 516: break; /* cap_getmode */ case 517: switch(ndx) { case 0: p = "userland u_int *"; break; default: break; }; break; /* pdfork */ case 518: switch(ndx) { case 0: p = "userland int *"; break; case 1: p = "int"; break; default: break; }; break; /* pdkill */ case 519: switch(ndx) { case 0: p = "int"; break; case 1: p = "int"; break; default: break; }; break; /* pdgetpid */ case 520: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland pid_t *"; break; default: break; }; break; /* freebsd32_pselect */ case 522: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland fd_set *"; break; case 2: p = "userland fd_set *"; break; case 3: p = "userland fd_set *"; break; case 4: p = "userland const struct timespec32 *"; break; case 5: p = "userland const sigset_t *"; break; default: break; }; break; /* getloginclass */ case 523: switch(ndx) { case 0: p = "userland char *"; break; case 1: p = "size_t"; break; default: break; }; break; /* setloginclass */ case 524: switch(ndx) { case 0: p = "userland const char *"; break; default: break; }; break; /* rctl_get_racct */ case 525: switch(ndx) { case 0: p = "userland const void *"; break; case 1: p = "size_t"; break; case 2: p = "userland void *"; break; case 3: p = "size_t"; break; default: break; }; break; /* rctl_get_rules */ case 526: switch(ndx) { case 0: p = "userland const void *"; break; case 1: p = "size_t"; break; case 2: p = "userland void *"; break; case 3: p = "size_t"; break; default: break; }; break; /* rctl_get_limits */ case 527: switch(ndx) { case 0: p = "userland const void *"; break; case 1: p = "size_t"; break; case 2: p = "userland void *"; break; case 3: p = "size_t"; break; default: break; }; break; /* rctl_add_rule */ case 528: switch(ndx) { case 0: p = "userland const void *"; break; case 1: p = "size_t"; break; case 2: p = "userland void *"; break; case 3: p = "size_t"; break; default: break; }; break; /* rctl_remove_rule */ case 529: switch(ndx) { case 0: p = "userland const void *"; break; case 1: p = "size_t"; break; case 2: p = "userland void *"; break; case 3: p = "size_t"; break; default: break; }; break; #ifdef PAD64_REQUIRED /* freebsd32_posix_fallocate */ case 530: switch(ndx) { case 0: p = "int"; break; case 1: p = "int"; break; case 2: p = "uint32_t"; break; case 3: p = "uint32_t"; break; case 4: p = "uint32_t"; break; case 5: p = "uint32_t"; break; default: break; }; break; /* freebsd32_posix_fadvise */ case 531: switch(ndx) { case 0: p = "int"; break; case 1: p = "int"; break; case 2: p = "uint32_t"; break; case 3: p = "uint32_t"; break; case 4: p = "uint32_t"; break; case 5: p = "uint32_t"; break; case 6: p = "int"; break; default: break; }; break; /* freebsd32_wait6 */ case 532: switch(ndx) { case 0: p = "int"; break; case 1: p = "int"; break; case 2: p = "uint32_t"; break; case 3: p = "uint32_t"; break; case 4: p = "userland int *"; break; case 5: p = "int"; break; case 6: p = "userland struct wrusage32 *"; break; case 7: p = "userland siginfo_t *"; break; default: break; }; break; #else /* freebsd32_posix_fallocate */ case 530: switch(ndx) { case 0: p = "int"; break; case 1: p = "uint32_t"; break; case 2: p = "uint32_t"; break; case 3: p = "uint32_t"; break; case 4: p = "uint32_t"; break; default: break; }; break; /* freebsd32_posix_fadvise */ case 531: switch(ndx) { case 0: p = "int"; break; case 1: p = "uint32_t"; break; case 2: p = "uint32_t"; break; case 3: p = "uint32_t"; break; case 4: p = "uint32_t"; break; case 5: p = "int"; break; default: break; }; break; /* freebsd32_wait6 */ case 532: switch(ndx) { case 0: p = "int"; break; case 1: p = "uint32_t"; break; case 2: p = "uint32_t"; break; case 3: p = "userland int *"; break; case 4: p = "int"; break; case 5: p = "userland struct wrusage32 *"; break; case 6: p = "userland siginfo_t *"; break; default: break; }; break; #endif /* cap_rights_limit */ case 533: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland cap_rights_t *"; break; default: break; }; break; /* freebsd32_cap_ioctls_limit */ case 534: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland const uint32_t *"; break; case 2: p = "size_t"; break; default: break; }; break; /* freebsd32_cap_ioctls_get */ case 535: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland uint32_t *"; break; case 2: p = "size_t"; break; default: break; }; break; /* cap_fcntls_limit */ case 536: switch(ndx) { case 0: p = "int"; break; case 1: p = "uint32_t"; break; default: break; }; break; /* cap_fcntls_get */ case 537: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland uint32_t *"; break; default: break; }; break; /* bindat */ case 538: switch(ndx) { case 0: p = "int"; break; case 1: p = "int"; break; case 2: p = "userland const struct sockaddr *"; break; case 3: p = "int"; break; default: break; }; break; /* connectat */ case 539: switch(ndx) { case 0: p = "int"; break; case 1: p = "int"; break; case 2: p = "userland const struct sockaddr *"; break; case 3: p = "int"; break; default: break; }; break; /* chflagsat */ case 540: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland const char *"; break; case 2: p = "u_long"; break; case 3: p = "int"; break; default: break; }; break; /* accept4 */ case 541: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland struct sockaddr *"; break; case 2: p = "userland __socklen_t *"; break; case 3: p = "int"; break; default: break; }; break; /* pipe2 */ case 542: switch(ndx) { case 0: p = "userland int *"; break; case 1: p = "int"; break; default: break; }; break; /* freebsd32_aio_mlock */ case 543: switch(ndx) { case 0: p = "userland struct aiocb32 *"; break; default: break; }; break; #ifdef PAD64_REQUIRED /* freebsd32_procctl */ case 544: switch(ndx) { case 0: p = "int"; break; case 1: p = "int"; break; case 2: p = "uint32_t"; break; case 3: p = "uint32_t"; break; case 4: p = "int"; break; case 5: p = "userland void *"; break; default: break; }; break; #else /* freebsd32_procctl */ case 544: switch(ndx) { case 0: p = "int"; break; case 1: p = "uint32_t"; break; case 2: p = "uint32_t"; break; case 3: p = "int"; break; case 4: p = "userland void *"; break; default: break; }; break; #endif /* freebsd32_ppoll */ case 545: switch(ndx) { case 0: p = "userland struct pollfd *"; break; case 1: p = "u_int"; break; case 2: p = "userland const struct timespec32 *"; break; case 3: p = "userland const sigset_t *"; break; default: break; }; break; /* freebsd32_futimens */ case 546: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland struct timespec *"; break; default: break; }; break; /* freebsd32_utimensat */ case 547: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland const char *"; break; case 2: p = "userland struct timespec *"; break; case 3: p = "int"; break; default: break; }; break; /* fdatasync */ case 550: switch(ndx) { case 0: p = "int"; break; default: break; }; break; /* freebsd32_fstat */ case 551: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland struct stat32 *"; break; default: break; }; break; /* freebsd32_fstatat */ case 552: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland const char *"; break; case 2: p = "userland struct stat32 *"; break; case 3: p = "int"; break; default: break; }; break; /* freebsd32_fhstat */ case 553: switch(ndx) { case 0: p = "userland const struct fhandle *"; break; case 1: p = "userland struct stat32 *"; break; default: break; }; break; /* getdirentries */ case 554: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland char *"; break; case 2: p = "size_t"; break; case 3: p = "userland off_t *"; break; default: break; }; break; /* statfs */ case 555: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "userland struct statfs32 *"; break; default: break; }; break; /* fstatfs */ case 556: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland struct statfs32 *"; break; default: break; }; break; /* getfsstat */ case 557: switch(ndx) { case 0: p = "userland struct statfs32 *"; break; case 1: p = "long"; break; case 2: p = "int"; break; default: break; }; break; /* fhstatfs */ case 558: switch(ndx) { case 0: p = "userland const struct fhandle *"; break; case 1: p = "userland struct statfs32 *"; break; default: break; }; break; #ifdef PAD64_REQUIRED /* freebsd32_mknodat */ case 559: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland const char *"; break; case 2: p = "mode_t"; break; case 3: p = "int"; break; case 4: p = "uint32_t"; break; case 5: p = "uint32_t"; break; default: break; }; break; #else /* freebsd32_mknodat */ case 559: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland const char *"; break; case 2: p = "mode_t"; break; case 3: p = "uint32_t"; break; case 4: p = "uint32_t"; break; default: break; }; break; #endif /* freebsd32_kevent */ case 560: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland const struct kevent32 *"; break; case 2: p = "int"; break; case 3: p = "userland struct kevent32 *"; break; case 4: p = "int"; break; case 5: p = "userland const struct timespec32 *"; break; default: break; }; break; /* freebsd32_cpuset_getdomain */ case 561: switch(ndx) { case 0: p = "cpulevel_t"; break; case 1: p = "cpuwhich_t"; break; case 2: p = "uint32_t"; break; case 3: p = "uint32_t"; break; case 4: p = "size_t"; break; case 5: p = "userland domainset_t *"; break; case 6: p = "userland int *"; break; default: break; }; break; /* freebsd32_cpuset_setdomain */ case 562: switch(ndx) { case 0: p = "cpulevel_t"; break; case 1: p = "cpuwhich_t"; break; case 2: p = "uint32_t"; break; case 3: p = "uint32_t"; break; case 4: p = "size_t"; break; case 5: p = "userland domainset_t *"; break; case 6: p = "int"; break; default: break; }; break; /* getrandom */ case 563: switch(ndx) { case 0: p = "userland void *"; break; case 1: p = "size_t"; break; case 2: p = "unsigned int"; break; default: break; }; break; /* getfhat */ case 564: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland char *"; break; case 2: p = "userland struct fhandle *"; break; case 3: p = "int"; break; default: break; }; break; /* fhlink */ case 565: switch(ndx) { case 0: p = "userland struct fhandle *"; break; case 1: p = "userland const char *"; break; default: break; }; break; /* fhlinkat */ case 566: switch(ndx) { case 0: p = "userland struct fhandle *"; break; case 1: p = "int"; break; case 2: p = "userland const char *"; break; default: break; }; break; /* fhreadlink */ case 567: switch(ndx) { case 0: p = "userland struct fhandle *"; break; case 1: p = "userland char *"; break; case 2: p = "size_t"; break; default: break; }; break; /* funlinkat */ case 568: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland const char *"; break; case 2: p = "int"; break; case 3: p = "int"; break; default: break; }; break; default: break; }; if (p != NULL) strlcpy(desc, p, descsz); } static void systrace_return_setargdesc(int sysnum, int ndx, char *desc, size_t descsz) { const char *p = NULL; switch (sysnum) { #if !defined(PAD64_REQUIRED) && !defined(__amd64__) #define PAD64_REQUIRED #endif /* nosys */ case 0: /* sys_exit */ case 1: if (ndx == 0 || ndx == 1) p = "void"; break; /* fork */ case 2: /* read */ case 3: if (ndx == 0 || ndx == 1) p = "ssize_t"; break; /* write */ case 4: if (ndx == 0 || ndx == 1) p = "ssize_t"; break; /* open */ case 5: if (ndx == 0 || ndx == 1) p = "int"; break; /* close */ case 6: if (ndx == 0 || ndx == 1) p = "int"; break; /* freebsd32_wait4 */ case 7: if (ndx == 0 || ndx == 1) p = "int"; break; /* link */ case 9: if (ndx == 0 || ndx == 1) p = "int"; break; /* unlink */ case 10: if (ndx == 0 || ndx == 1) p = "int"; break; /* chdir */ case 12: if (ndx == 0 || ndx == 1) p = "int"; break; /* fchdir */ case 13: if (ndx == 0 || ndx == 1) p = "int"; break; /* chmod */ case 15: if (ndx == 0 || ndx == 1) p = "int"; break; /* chown */ case 16: if (ndx == 0 || ndx == 1) p = "int"; break; /* break */ case 17: if (ndx == 0 || ndx == 1) p = "void *"; break; /* getpid */ case 20: /* mount */ case 21: if (ndx == 0 || ndx == 1) p = "int"; break; /* unmount */ case 22: if (ndx == 0 || ndx == 1) p = "int"; break; /* setuid */ case 23: if (ndx == 0 || ndx == 1) p = "int"; break; /* getuid */ case 24: /* geteuid */ case 25: /* ptrace */ case 26: if (ndx == 0 || ndx == 1) p = "int"; break; /* freebsd32_recvmsg */ case 27: if (ndx == 0 || ndx == 1) p = "int"; break; /* freebsd32_sendmsg */ case 28: if (ndx == 0 || ndx == 1) p = "int"; break; /* freebsd32_recvfrom */ case 29: if (ndx == 0 || ndx == 1) p = "int"; break; /* accept */ case 30: if (ndx == 0 || ndx == 1) p = "int"; break; /* getpeername */ case 31: if (ndx == 0 || ndx == 1) p = "int"; break; /* getsockname */ case 32: if (ndx == 0 || ndx == 1) p = "int"; break; /* access */ case 33: if (ndx == 0 || ndx == 1) p = "int"; break; /* chflags */ case 34: if (ndx == 0 || ndx == 1) p = "int"; break; /* fchflags */ case 35: if (ndx == 0 || ndx == 1) p = "int"; break; /* sync */ case 36: /* kill */ case 37: if (ndx == 0 || ndx == 1) p = "int"; break; /* getppid */ case 39: /* dup */ case 41: if (ndx == 0 || ndx == 1) p = "int"; break; /* getegid */ case 43: /* profil */ case 44: if (ndx == 0 || ndx == 1) p = "int"; break; /* ktrace */ case 45: if (ndx == 0 || ndx == 1) p = "int"; break; /* getgid */ case 47: /* getlogin */ case 49: if (ndx == 0 || ndx == 1) p = "int"; break; /* setlogin */ case 50: if (ndx == 0 || ndx == 1) p = "int"; break; /* acct */ case 51: if (ndx == 0 || ndx == 1) p = "int"; break; /* freebsd32_sigaltstack */ case 53: if (ndx == 0 || ndx == 1) p = "int"; break; /* freebsd32_ioctl */ case 54: if (ndx == 0 || ndx == 1) p = "int"; break; /* reboot */ case 55: if (ndx == 0 || ndx == 1) p = "int"; break; /* revoke */ case 56: if (ndx == 0 || ndx == 1) p = "int"; break; /* symlink */ case 57: if (ndx == 0 || ndx == 1) p = "int"; break; /* readlink */ case 58: if (ndx == 0 || ndx == 1) p = "ssize_t"; break; /* freebsd32_execve */ case 59: if (ndx == 0 || ndx == 1) p = "int"; break; /* umask */ case 60: if (ndx == 0 || ndx == 1) p = "int"; break; /* chroot */ case 61: if (ndx == 0 || ndx == 1) p = "int"; break; /* msync */ case 65: if (ndx == 0 || ndx == 1) p = "int"; break; /* vfork */ case 66: /* sbrk */ case 69: if (ndx == 0 || ndx == 1) p = "int"; break; /* sstk */ case 70: if (ndx == 0 || ndx == 1) p = "int"; break; /* munmap */ case 73: if (ndx == 0 || ndx == 1) p = "int"; break; /* freebsd32_mprotect */ case 74: if (ndx == 0 || ndx == 1) p = "int"; break; /* madvise */ case 75: if (ndx == 0 || ndx == 1) p = "int"; break; /* mincore */ case 78: if (ndx == 0 || ndx == 1) p = "int"; break; /* getgroups */ case 79: if (ndx == 0 || ndx == 1) p = "int"; break; /* setgroups */ case 80: if (ndx == 0 || ndx == 1) p = "int"; break; /* getpgrp */ case 81: /* setpgid */ case 82: if (ndx == 0 || ndx == 1) p = "int"; break; /* freebsd32_setitimer */ case 83: if (ndx == 0 || ndx == 1) p = "int"; break; /* swapon */ case 85: if (ndx == 0 || ndx == 1) p = "int"; break; /* freebsd32_getitimer */ case 86: if (ndx == 0 || ndx == 1) p = "int"; break; /* getdtablesize */ case 89: /* dup2 */ case 90: if (ndx == 0 || ndx == 1) p = "int"; break; /* freebsd32_fcntl */ case 92: if (ndx == 0 || ndx == 1) p = "int"; break; /* freebsd32_select */ case 93: if (ndx == 0 || ndx == 1) p = "int"; break; /* fsync */ case 95: if (ndx == 0 || ndx == 1) p = "int"; break; /* setpriority */ case 96: if (ndx == 0 || ndx == 1) p = "int"; break; /* socket */ case 97: if (ndx == 0 || ndx == 1) p = "int"; break; /* connect */ case 98: if (ndx == 0 || ndx == 1) p = "int"; break; /* getpriority */ case 100: if (ndx == 0 || ndx == 1) p = "int"; break; /* bind */ case 104: if (ndx == 0 || ndx == 1) p = "int"; break; /* setsockopt */ case 105: if (ndx == 0 || ndx == 1) p = "int"; break; /* listen */ case 106: if (ndx == 0 || ndx == 1) p = "int"; break; /* freebsd32_gettimeofday */ case 116: if (ndx == 0 || ndx == 1) p = "int"; break; /* freebsd32_getrusage */ case 117: if (ndx == 0 || ndx == 1) p = "int"; break; /* getsockopt */ case 118: if (ndx == 0 || ndx == 1) p = "int"; break; /* freebsd32_readv */ case 120: if (ndx == 0 || ndx == 1) p = "int"; break; /* freebsd32_writev */ case 121: if (ndx == 0 || ndx == 1) p = "int"; break; /* freebsd32_settimeofday */ case 122: if (ndx == 0 || ndx == 1) p = "int"; break; /* fchown */ case 123: if (ndx == 0 || ndx == 1) p = "int"; break; /* fchmod */ case 124: if (ndx == 0 || ndx == 1) p = "int"; break; /* setreuid */ case 126: if (ndx == 0 || ndx == 1) p = "int"; break; /* setregid */ case 127: if (ndx == 0 || ndx == 1) p = "int"; break; /* rename */ case 128: if (ndx == 0 || ndx == 1) p = "int"; break; /* flock */ case 131: if (ndx == 0 || ndx == 1) p = "int"; break; /* mkfifo */ case 132: if (ndx == 0 || ndx == 1) p = "int"; break; /* sendto */ case 133: if (ndx == 0 || ndx == 1) p = "int"; break; /* shutdown */ case 134: if (ndx == 0 || ndx == 1) p = "int"; break; /* socketpair */ case 135: if (ndx == 0 || ndx == 1) p = "int"; break; /* mkdir */ case 136: if (ndx == 0 || ndx == 1) p = "int"; break; /* rmdir */ case 137: if (ndx == 0 || ndx == 1) p = "int"; break; /* freebsd32_utimes */ case 138: if (ndx == 0 || ndx == 1) p = "int"; break; /* freebsd32_adjtime */ case 140: if (ndx == 0 || ndx == 1) p = "int"; break; /* setsid */ case 147: /* quotactl */ case 148: if (ndx == 0 || ndx == 1) p = "int"; break; /* getfh */ case 161: if (ndx == 0 || ndx == 1) p = "int"; break; /* freebsd32_sysarch */ case 165: if (ndx == 0 || ndx == 1) p = "int"; break; /* rtprio */ case 166: if (ndx == 0 || ndx == 1) p = "int"; break; /* freebsd32_semsys */ case 169: if (ndx == 0 || ndx == 1) p = "int"; break; /* freebsd32_msgsys */ case 170: if (ndx == 0 || ndx == 1) p = "int"; break; /* freebsd32_shmsys */ case 171: if (ndx == 0 || ndx == 1) p = "int"; break; /* ntp_adjtime */ case 176: if (ndx == 0 || ndx == 1) p = "int"; break; /* setgid */ case 181: if (ndx == 0 || ndx == 1) p = "int"; break; /* setegid */ case 182: if (ndx == 0 || ndx == 1) p = "int"; break; /* seteuid */ case 183: if (ndx == 0 || ndx == 1) p = "int"; break; /* pathconf */ case 191: if (ndx == 0 || ndx == 1) p = "int"; break; /* fpathconf */ case 192: if (ndx == 0 || ndx == 1) p = "int"; break; /* getrlimit */ case 194: if (ndx == 0 || ndx == 1) p = "int"; break; /* setrlimit */ case 195: if (ndx == 0 || ndx == 1) p = "int"; break; /* nosys */ case 198: /* freebsd32___sysctl */ case 202: if (ndx == 0 || ndx == 1) p = "int"; break; /* mlock */ case 203: if (ndx == 0 || ndx == 1) p = "int"; break; /* munlock */ case 204: if (ndx == 0 || ndx == 1) p = "int"; break; /* undelete */ case 205: if (ndx == 0 || ndx == 1) p = "int"; break; /* freebsd32_futimes */ case 206: if (ndx == 0 || ndx == 1) p = "int"; break; /* getpgid */ case 207: if (ndx == 0 || ndx == 1) p = "int"; break; /* poll */ case 209: if (ndx == 0 || ndx == 1) p = "int"; break; /* lkmnosys */ case 210: /* lkmnosys */ case 211: /* lkmnosys */ case 212: /* lkmnosys */ case 213: /* lkmnosys */ case 214: /* lkmnosys */ case 215: /* lkmnosys */ case 216: /* lkmnosys */ case 217: /* lkmnosys */ case 218: /* lkmnosys */ case 219: /* semget */ case 221: if (ndx == 0 || ndx == 1) p = "int"; break; /* semop */ case 222: if (ndx == 0 || ndx == 1) p = "int"; break; /* msgget */ case 225: if (ndx == 0 || ndx == 1) p = "int"; break; /* freebsd32_msgsnd */ case 226: if (ndx == 0 || ndx == 1) p = "int"; break; /* freebsd32_msgrcv */ case 227: if (ndx == 0 || ndx == 1) p = "int"; break; /* shmat */ case 228: if (ndx == 0 || ndx == 1) p = "void *"; break; /* shmdt */ case 230: if (ndx == 0 || ndx == 1) p = "int"; break; /* shmget */ case 231: if (ndx == 0 || ndx == 1) p = "int"; break; /* freebsd32_clock_gettime */ case 232: if (ndx == 0 || ndx == 1) p = "int"; break; /* freebsd32_clock_settime */ case 233: if (ndx == 0 || ndx == 1) p = "int"; break; /* freebsd32_clock_getres */ case 234: if (ndx == 0 || ndx == 1) p = "int"; break; /* freebsd32_ktimer_create */ case 235: if (ndx == 0 || ndx == 1) p = "int"; break; /* ktimer_delete */ case 236: if (ndx == 0 || ndx == 1) p = "int"; break; /* freebsd32_ktimer_settime */ case 237: if (ndx == 0 || ndx == 1) p = "int"; break; /* freebsd32_ktimer_gettime */ case 238: if (ndx == 0 || ndx == 1) p = "int"; break; /* ktimer_getoverrun */ case 239: if (ndx == 0 || ndx == 1) p = "int"; break; /* freebsd32_nanosleep */ case 240: if (ndx == 0 || ndx == 1) p = "int"; break; /* ffclock_getcounter */ case 241: if (ndx == 0 || ndx == 1) p = "int"; break; /* ffclock_setestimate */ case 242: if (ndx == 0 || ndx == 1) p = "int"; break; /* ffclock_getestimate */ case 243: if (ndx == 0 || ndx == 1) p = "int"; break; /* freebsd32_clock_nanosleep */ case 244: if (ndx == 0 || ndx == 1) p = "int"; break; /* freebsd32_clock_getcpuclockid2 */ case 247: if (ndx == 0 || ndx == 1) p = "int"; break; /* minherit */ case 250: if (ndx == 0 || ndx == 1) p = "int"; break; /* rfork */ case 251: if (ndx == 0 || ndx == 1) p = "int"; break; /* issetugid */ case 253: /* lchown */ case 254: if (ndx == 0 || ndx == 1) p = "int"; break; /* freebsd32_aio_read */ case 255: if (ndx == 0 || ndx == 1) p = "int"; break; /* freebsd32_aio_write */ case 256: if (ndx == 0 || ndx == 1) p = "int"; break; /* freebsd32_lio_listio */ case 257: if (ndx == 0 || ndx == 1) p = "int"; break; /* lchmod */ case 274: if (ndx == 0 || ndx == 1) p = "int"; break; /* freebsd32_lutimes */ case 276: if (ndx == 0 || ndx == 1) p = "int"; break; /* freebsd32_preadv */ case 289: if (ndx == 0 || ndx == 1) p = "ssize_t"; break; /* freebsd32_pwritev */ case 290: if (ndx == 0 || ndx == 1) p = "ssize_t"; break; /* fhopen */ case 298: if (ndx == 0 || ndx == 1) p = "int"; break; /* modnext */ case 300: if (ndx == 0 || ndx == 1) p = "int"; break; /* freebsd32_modstat */ case 301: if (ndx == 0 || ndx == 1) p = "int"; break; /* modfnext */ case 302: if (ndx == 0 || ndx == 1) p = "int"; break; /* modfind */ case 303: if (ndx == 0 || ndx == 1) p = "int"; break; /* kldload */ case 304: if (ndx == 0 || ndx == 1) p = "int"; break; /* kldunload */ case 305: if (ndx == 0 || ndx == 1) p = "int"; break; /* kldfind */ case 306: if (ndx == 0 || ndx == 1) p = "int"; break; /* kldnext */ case 307: if (ndx == 0 || ndx == 1) p = "int"; break; /* freebsd32_kldstat */ case 308: if (ndx == 0 || ndx == 1) p = "int"; break; /* kldfirstmod */ case 309: if (ndx == 0 || ndx == 1) p = "int"; break; /* getsid */ case 310: if (ndx == 0 || ndx == 1) p = "int"; break; /* setresuid */ case 311: if (ndx == 0 || ndx == 1) p = "int"; break; /* setresgid */ case 312: if (ndx == 0 || ndx == 1) p = "int"; break; /* freebsd32_aio_return */ case 314: if (ndx == 0 || ndx == 1) p = "int"; break; /* freebsd32_aio_suspend */ case 315: if (ndx == 0 || ndx == 1) p = "int"; break; /* aio_cancel */ case 316: if (ndx == 0 || ndx == 1) p = "int"; break; /* freebsd32_aio_error */ case 317: if (ndx == 0 || ndx == 1) p = "int"; break; /* yield */ case 321: /* mlockall */ case 324: if (ndx == 0 || ndx == 1) p = "int"; break; /* munlockall */ case 325: /* __getcwd */ case 326: if (ndx == 0 || ndx == 1) p = "int"; break; /* sched_setparam */ case 327: if (ndx == 0 || ndx == 1) p = "int"; break; /* sched_getparam */ case 328: if (ndx == 0 || ndx == 1) p = "int"; break; /* sched_setscheduler */ case 329: if (ndx == 0 || ndx == 1) p = "int"; break; /* sched_getscheduler */ case 330: if (ndx == 0 || ndx == 1) p = "int"; break; /* sched_yield */ case 331: /* sched_get_priority_max */ case 332: if (ndx == 0 || ndx == 1) p = "int"; break; /* sched_get_priority_min */ case 333: if (ndx == 0 || ndx == 1) p = "int"; break; /* freebsd32_sched_rr_get_interval */ case 334: if (ndx == 0 || ndx == 1) p = "int"; break; /* utrace */ case 335: if (ndx == 0 || ndx == 1) p = "int"; break; /* kldsym */ case 337: if (ndx == 0 || ndx == 1) p = "int"; break; /* freebsd32_jail */ case 338: if (ndx == 0 || ndx == 1) p = "int"; break; /* sigprocmask */ case 340: if (ndx == 0 || ndx == 1) p = "int"; break; /* sigsuspend */ case 341: if (ndx == 0 || ndx == 1) p = "int"; break; /* sigpending */ case 343: if (ndx == 0 || ndx == 1) p = "int"; break; /* freebsd32_sigtimedwait */ case 345: if (ndx == 0 || ndx == 1) p = "int"; break; /* freebsd32_sigwaitinfo */ case 346: if (ndx == 0 || ndx == 1) p = "int"; break; /* __acl_get_file */ case 347: if (ndx == 0 || ndx == 1) p = "int"; break; /* __acl_set_file */ case 348: if (ndx == 0 || ndx == 1) p = "int"; break; /* __acl_get_fd */ case 349: if (ndx == 0 || ndx == 1) p = "int"; break; /* __acl_set_fd */ case 350: if (ndx == 0 || ndx == 1) p = "int"; break; /* __acl_delete_file */ case 351: if (ndx == 0 || ndx == 1) p = "int"; break; /* __acl_delete_fd */ case 352: if (ndx == 0 || ndx == 1) p = "int"; break; /* __acl_aclcheck_file */ case 353: if (ndx == 0 || ndx == 1) p = "int"; break; /* __acl_aclcheck_fd */ case 354: if (ndx == 0 || ndx == 1) p = "int"; break; /* extattrctl */ case 355: if (ndx == 0 || ndx == 1) p = "int"; break; /* extattr_set_file */ case 356: if (ndx == 0 || ndx == 1) p = "ssize_t"; break; /* extattr_get_file */ case 357: if (ndx == 0 || ndx == 1) p = "ssize_t"; break; /* extattr_delete_file */ case 358: if (ndx == 0 || ndx == 1) p = "int"; break; /* freebsd32_aio_waitcomplete */ case 359: if (ndx == 0 || ndx == 1) p = "int"; break; /* getresuid */ case 360: if (ndx == 0 || ndx == 1) p = "int"; break; /* getresgid */ case 361: if (ndx == 0 || ndx == 1) p = "int"; break; /* kqueue */ case 362: /* extattr_set_fd */ case 371: if (ndx == 0 || ndx == 1) p = "ssize_t"; break; /* extattr_get_fd */ case 372: if (ndx == 0 || ndx == 1) p = "ssize_t"; break; /* extattr_delete_fd */ case 373: if (ndx == 0 || ndx == 1) p = "int"; break; /* __setugid */ case 374: if (ndx == 0 || ndx == 1) p = "int"; break; /* eaccess */ case 376: if (ndx == 0 || ndx == 1) p = "int"; break; /* freebsd32_nmount */ case 378: if (ndx == 0 || ndx == 1) p = "int"; break; /* kenv */ case 390: if (ndx == 0 || ndx == 1) p = "int"; break; /* lchflags */ case 391: if (ndx == 0 || ndx == 1) p = "int"; break; /* uuidgen */ case 392: if (ndx == 0 || ndx == 1) p = "int"; break; /* freebsd32_sendfile */ case 393: if (ndx == 0 || ndx == 1) p = "int"; break; /* ksem_close */ case 400: if (ndx == 0 || ndx == 1) p = "int"; break; /* ksem_post */ case 401: if (ndx == 0 || ndx == 1) p = "int"; break; /* ksem_wait */ case 402: if (ndx == 0 || ndx == 1) p = "int"; break; /* ksem_trywait */ case 403: if (ndx == 0 || ndx == 1) p = "int"; break; /* freebsd32_ksem_init */ case 404: if (ndx == 0 || ndx == 1) p = "int"; break; /* freebsd32_ksem_open */ case 405: if (ndx == 0 || ndx == 1) p = "int"; break; /* ksem_unlink */ case 406: if (ndx == 0 || ndx == 1) p = "int"; break; /* ksem_getvalue */ case 407: if (ndx == 0 || ndx == 1) p = "int"; break; /* ksem_destroy */ case 408: if (ndx == 0 || ndx == 1) p = "int"; break; /* extattr_set_link */ case 412: if (ndx == 0 || ndx == 1) p = "ssize_t"; break; /* extattr_get_link */ case 413: if (ndx == 0 || ndx == 1) p = "ssize_t"; break; /* extattr_delete_link */ case 414: if (ndx == 0 || ndx == 1) p = "int"; break; /* freebsd32_sigaction */ case 416: if (ndx == 0 || ndx == 1) p = "int"; break; /* freebsd32_sigreturn */ case 417: if (ndx == 0 || ndx == 1) p = "int"; break; /* freebsd32_getcontext */ case 421: if (ndx == 0 || ndx == 1) p = "int"; break; /* freebsd32_setcontext */ case 422: if (ndx == 0 || ndx == 1) p = "int"; break; /* freebsd32_swapcontext */ case 423: if (ndx == 0 || ndx == 1) p = "int"; break; /* __acl_get_link */ case 425: if (ndx == 0 || ndx == 1) p = "int"; break; /* __acl_set_link */ case 426: if (ndx == 0 || ndx == 1) p = "int"; break; /* __acl_delete_link */ case 427: if (ndx == 0 || ndx == 1) p = "int"; break; /* __acl_aclcheck_link */ case 428: if (ndx == 0 || ndx == 1) p = "int"; break; /* sigwait */ case 429: if (ndx == 0 || ndx == 1) p = "int"; break; /* thr_exit */ case 431: if (ndx == 0 || ndx == 1) p = "void"; break; /* thr_self */ case 432: if (ndx == 0 || ndx == 1) p = "int"; break; /* thr_kill */ case 433: if (ndx == 0 || ndx == 1) p = "int"; break; /* jail_attach */ case 436: if (ndx == 0 || ndx == 1) p = "int"; break; /* extattr_list_fd */ case 437: if (ndx == 0 || ndx == 1) p = "ssize_t"; break; /* extattr_list_file */ case 438: if (ndx == 0 || ndx == 1) p = "ssize_t"; break; /* extattr_list_link */ case 439: if (ndx == 0 || ndx == 1) p = "ssize_t"; break; /* freebsd32_ksem_timedwait */ case 441: if (ndx == 0 || ndx == 1) p = "int"; break; /* freebsd32_thr_suspend */ case 442: if (ndx == 0 || ndx == 1) p = "int"; break; /* thr_wake */ case 443: if (ndx == 0 || ndx == 1) p = "int"; break; /* kldunloadf */ case 444: if (ndx == 0 || ndx == 1) p = "int"; break; /* audit */ case 445: if (ndx == 0 || ndx == 1) p = "int"; break; /* auditon */ case 446: if (ndx == 0 || ndx == 1) p = "int"; break; /* getauid */ case 447: if (ndx == 0 || ndx == 1) p = "int"; break; /* setauid */ case 448: if (ndx == 0 || ndx == 1) p = "int"; break; /* getaudit */ case 449: if (ndx == 0 || ndx == 1) p = "int"; break; /* setaudit */ case 450: if (ndx == 0 || ndx == 1) p = "int"; break; /* getaudit_addr */ case 451: if (ndx == 0 || ndx == 1) p = "int"; break; /* setaudit_addr */ case 452: if (ndx == 0 || ndx == 1) p = "int"; break; /* auditctl */ case 453: if (ndx == 0 || ndx == 1) p = "int"; break; /* freebsd32__umtx_op */ case 454: if (ndx == 0 || ndx == 1) p = "int"; break; /* freebsd32_thr_new */ case 455: if (ndx == 0 || ndx == 1) p = "int"; break; /* freebsd32_sigqueue */ case 456: if (ndx == 0 || ndx == 1) p = "int"; break; /* freebsd32_kmq_open */ case 457: if (ndx == 0 || ndx == 1) p = "int"; break; /* freebsd32_kmq_setattr */ case 458: if (ndx == 0 || ndx == 1) p = "int"; break; /* freebsd32_kmq_timedreceive */ case 459: if (ndx == 0 || ndx == 1) p = "int"; break; /* freebsd32_kmq_timedsend */ case 460: if (ndx == 0 || ndx == 1) p = "int"; break; /* freebsd32_kmq_notify */ case 461: if (ndx == 0 || ndx == 1) p = "int"; break; /* kmq_unlink */ case 462: if (ndx == 0 || ndx == 1) p = "int"; break; /* abort2 */ case 463: if (ndx == 0 || ndx == 1) p = "int"; break; /* thr_set_name */ case 464: if (ndx == 0 || ndx == 1) p = "int"; break; /* freebsd32_aio_fsync */ case 465: if (ndx == 0 || ndx == 1) p = "int"; break; /* rtprio_thread */ case 466: if (ndx == 0 || ndx == 1) p = "int"; break; /* sctp_peeloff */ case 471: if (ndx == 0 || ndx == 1) p = "int"; break; /* sctp_generic_sendmsg */ case 472: if (ndx == 0 || ndx == 1) p = "int"; break; /* sctp_generic_sendmsg_iov */ case 473: if (ndx == 0 || ndx == 1) p = "int"; break; /* sctp_generic_recvmsg */ case 474: if (ndx == 0 || ndx == 1) p = "int"; break; #ifdef PAD64_REQUIRED /* freebsd32_pread */ case 475: if (ndx == 0 || ndx == 1) p = "ssize_t"; break; /* freebsd32_pwrite */ case 476: if (ndx == 0 || ndx == 1) p = "ssize_t"; break; /* freebsd32_mmap */ case 477: if (ndx == 0 || ndx == 1) p = "void *"; break; /* freebsd32_lseek */ case 478: if (ndx == 0 || ndx == 1) p = "off_t"; break; /* freebsd32_truncate */ case 479: if (ndx == 0 || ndx == 1) p = "int"; break; /* freebsd32_ftruncate */ case 480: if (ndx == 0 || ndx == 1) p = "int"; break; #else /* freebsd32_pread */ case 475: if (ndx == 0 || ndx == 1) p = "ssize_t"; break; /* freebsd32_pwrite */ case 476: if (ndx == 0 || ndx == 1) p = "ssize_t"; break; /* freebsd32_mmap */ case 477: if (ndx == 0 || ndx == 1) p = "void *"; break; /* freebsd32_lseek */ case 478: if (ndx == 0 || ndx == 1) p = "off_t"; break; /* freebsd32_truncate */ case 479: if (ndx == 0 || ndx == 1) p = "int"; break; /* freebsd32_ftruncate */ case 480: if (ndx == 0 || ndx == 1) p = "int"; break; #endif /* thr_kill2 */ case 481: if (ndx == 0 || ndx == 1) p = "int"; break; /* shm_open */ case 482: if (ndx == 0 || ndx == 1) p = "int"; break; /* shm_unlink */ case 483: if (ndx == 0 || ndx == 1) p = "int"; break; /* cpuset */ case 484: if (ndx == 0 || ndx == 1) p = "int"; break; #ifdef PAD64_REQUIRED /* freebsd32_cpuset_setid */ case 485: if (ndx == 0 || ndx == 1) p = "int"; break; #else /* freebsd32_cpuset_setid */ case 485: if (ndx == 0 || ndx == 1) p = "int"; break; #endif /* freebsd32_cpuset_getid */ case 486: if (ndx == 0 || ndx == 1) p = "int"; break; /* freebsd32_cpuset_getaffinity */ case 487: if (ndx == 0 || ndx == 1) p = "int"; break; /* freebsd32_cpuset_setaffinity */ case 488: if (ndx == 0 || ndx == 1) p = "int"; break; /* faccessat */ case 489: if (ndx == 0 || ndx == 1) p = "int"; break; /* fchmodat */ case 490: if (ndx == 0 || ndx == 1) p = "int"; break; /* fchownat */ case 491: if (ndx == 0 || ndx == 1) p = "int"; break; /* freebsd32_fexecve */ case 492: if (ndx == 0 || ndx == 1) p = "int"; break; /* freebsd32_futimesat */ case 494: if (ndx == 0 || ndx == 1) p = "int"; break; /* linkat */ case 495: if (ndx == 0 || ndx == 1) p = "int"; break; /* mkdirat */ case 496: if (ndx == 0 || ndx == 1) p = "int"; break; /* mkfifoat */ case 497: if (ndx == 0 || ndx == 1) p = "int"; break; /* openat */ case 499: if (ndx == 0 || ndx == 1) p = "int"; break; /* readlinkat */ case 500: if (ndx == 0 || ndx == 1) - p = "int"; + p = "ssize_t"; break; /* renameat */ case 501: if (ndx == 0 || ndx == 1) p = "int"; break; /* symlinkat */ case 502: if (ndx == 0 || ndx == 1) p = "int"; break; /* unlinkat */ case 503: if (ndx == 0 || ndx == 1) p = "int"; break; /* posix_openpt */ case 504: if (ndx == 0 || ndx == 1) p = "int"; break; /* freebsd32_jail_get */ case 506: if (ndx == 0 || ndx == 1) p = "int"; break; /* freebsd32_jail_set */ case 507: if (ndx == 0 || ndx == 1) p = "int"; break; /* jail_remove */ case 508: if (ndx == 0 || ndx == 1) p = "int"; break; /* closefrom */ case 509: if (ndx == 0 || ndx == 1) p = "int"; break; /* freebsd32_semctl */ case 510: if (ndx == 0 || ndx == 1) p = "int"; break; /* freebsd32_msgctl */ case 511: if (ndx == 0 || ndx == 1) p = "int"; break; /* freebsd32_shmctl */ case 512: if (ndx == 0 || ndx == 1) p = "int"; break; /* lpathconf */ case 513: if (ndx == 0 || ndx == 1) p = "int"; break; /* __cap_rights_get */ case 515: if (ndx == 0 || ndx == 1) p = "int"; break; /* cap_enter */ case 516: /* cap_getmode */ case 517: if (ndx == 0 || ndx == 1) p = "int"; break; /* pdfork */ case 518: if (ndx == 0 || ndx == 1) p = "int"; break; /* pdkill */ case 519: if (ndx == 0 || ndx == 1) p = "int"; break; /* pdgetpid */ case 520: if (ndx == 0 || ndx == 1) p = "int"; break; /* freebsd32_pselect */ case 522: if (ndx == 0 || ndx == 1) p = "int"; break; /* getloginclass */ case 523: if (ndx == 0 || ndx == 1) p = "int"; break; /* setloginclass */ case 524: if (ndx == 0 || ndx == 1) p = "int"; break; /* rctl_get_racct */ case 525: if (ndx == 0 || ndx == 1) p = "int"; break; /* rctl_get_rules */ case 526: if (ndx == 0 || ndx == 1) p = "int"; break; /* rctl_get_limits */ case 527: if (ndx == 0 || ndx == 1) p = "int"; break; /* rctl_add_rule */ case 528: if (ndx == 0 || ndx == 1) p = "int"; break; /* rctl_remove_rule */ case 529: if (ndx == 0 || ndx == 1) p = "int"; break; #ifdef PAD64_REQUIRED /* freebsd32_posix_fallocate */ case 530: if (ndx == 0 || ndx == 1) p = "int"; break; /* freebsd32_posix_fadvise */ case 531: if (ndx == 0 || ndx == 1) p = "int"; break; /* freebsd32_wait6 */ case 532: if (ndx == 0 || ndx == 1) p = "int"; break; #else /* freebsd32_posix_fallocate */ case 530: if (ndx == 0 || ndx == 1) p = "int"; break; /* freebsd32_posix_fadvise */ case 531: if (ndx == 0 || ndx == 1) p = "int"; break; /* freebsd32_wait6 */ case 532: if (ndx == 0 || ndx == 1) p = "int"; break; #endif /* cap_rights_limit */ case 533: if (ndx == 0 || ndx == 1) p = "int"; break; /* freebsd32_cap_ioctls_limit */ case 534: if (ndx == 0 || ndx == 1) p = "int"; break; /* freebsd32_cap_ioctls_get */ case 535: if (ndx == 0 || ndx == 1) p = "ssize_t"; break; /* cap_fcntls_limit */ case 536: if (ndx == 0 || ndx == 1) p = "int"; break; /* cap_fcntls_get */ case 537: if (ndx == 0 || ndx == 1) p = "int"; break; /* bindat */ case 538: if (ndx == 0 || ndx == 1) p = "int"; break; /* connectat */ case 539: if (ndx == 0 || ndx == 1) p = "int"; break; /* chflagsat */ case 540: if (ndx == 0 || ndx == 1) p = "int"; break; /* accept4 */ case 541: if (ndx == 0 || ndx == 1) p = "int"; break; /* pipe2 */ case 542: if (ndx == 0 || ndx == 1) p = "int"; break; /* freebsd32_aio_mlock */ case 543: if (ndx == 0 || ndx == 1) p = "int"; break; #ifdef PAD64_REQUIRED /* freebsd32_procctl */ case 544: if (ndx == 0 || ndx == 1) p = "int"; break; #else /* freebsd32_procctl */ case 544: if (ndx == 0 || ndx == 1) p = "int"; break; #endif /* freebsd32_ppoll */ case 545: if (ndx == 0 || ndx == 1) p = "int"; break; /* freebsd32_futimens */ case 546: if (ndx == 0 || ndx == 1) p = "int"; break; /* freebsd32_utimensat */ case 547: if (ndx == 0 || ndx == 1) p = "int"; break; /* fdatasync */ case 550: if (ndx == 0 || ndx == 1) p = "int"; break; /* freebsd32_fstat */ case 551: if (ndx == 0 || ndx == 1) p = "int"; break; /* freebsd32_fstatat */ case 552: if (ndx == 0 || ndx == 1) p = "int"; break; /* freebsd32_fhstat */ case 553: if (ndx == 0 || ndx == 1) p = "int"; break; /* getdirentries */ case 554: if (ndx == 0 || ndx == 1) p = "ssize_t"; break; /* statfs */ case 555: if (ndx == 0 || ndx == 1) p = "int"; break; /* fstatfs */ case 556: if (ndx == 0 || ndx == 1) p = "int"; break; /* getfsstat */ case 557: if (ndx == 0 || ndx == 1) p = "int"; break; /* fhstatfs */ case 558: if (ndx == 0 || ndx == 1) p = "int"; break; #ifdef PAD64_REQUIRED /* freebsd32_mknodat */ case 559: if (ndx == 0 || ndx == 1) p = "int"; break; #else /* freebsd32_mknodat */ case 559: if (ndx == 0 || ndx == 1) p = "int"; break; #endif /* freebsd32_kevent */ case 560: if (ndx == 0 || ndx == 1) p = "int"; break; /* freebsd32_cpuset_getdomain */ case 561: if (ndx == 0 || ndx == 1) p = "int"; break; /* freebsd32_cpuset_setdomain */ case 562: if (ndx == 0 || ndx == 1) p = "int"; break; /* getrandom */ case 563: if (ndx == 0 || ndx == 1) p = "int"; break; /* getfhat */ case 564: if (ndx == 0 || ndx == 1) p = "int"; break; /* fhlink */ case 565: if (ndx == 0 || ndx == 1) p = "int"; break; /* fhlinkat */ case 566: if (ndx == 0 || ndx == 1) p = "int"; break; /* fhreadlink */ case 567: if (ndx == 0 || ndx == 1) p = "int"; break; /* funlinkat */ case 568: if (ndx == 0 || ndx == 1) p = "int"; break; default: break; }; if (p != NULL) strlcpy(desc, p, descsz); } Index: user/ngie/bug-237403/sys/compat/linuxkpi/common/include/linux/device.h =================================================================== --- user/ngie/bug-237403/sys/compat/linuxkpi/common/include/linux/device.h (revision 346925) +++ user/ngie/bug-237403/sys/compat/linuxkpi/common/include/linux/device.h (revision 346926) @@ -1,553 +1,553 @@ /*- * Copyright (c) 2010 Isilon Systems, Inc. * Copyright (c) 2010 iX Systems, Inc. * Copyright (c) 2010 Panasas, Inc. * Copyright (c) 2013-2016 Mellanox Technologies, Ltd. * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice unmodified, this list of conditions, and the following * disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. * * $FreeBSD$ */ #ifndef _LINUX_DEVICE_H_ #define _LINUX_DEVICE_H_ #include #include #include #include #include #include #include #include #include #include #include #include #include struct device; struct fwnode_handle; struct class { const char *name; struct module *owner; struct kobject kobj; devclass_t bsdclass; const struct dev_pm_ops *pm; void (*class_release)(struct class *class); void (*dev_release)(struct device *dev); char * (*devnode)(struct device *dev, umode_t *mode); }; struct dev_pm_ops { int (*suspend)(struct device *dev); int (*suspend_late)(struct device *dev); int (*resume)(struct device *dev); int (*resume_early)(struct device *dev); int (*freeze)(struct device *dev); int (*freeze_late)(struct device *dev); int (*thaw)(struct device *dev); int (*thaw_early)(struct device *dev); int (*poweroff)(struct device *dev); int (*poweroff_late)(struct device *dev); int (*restore)(struct device *dev); int (*restore_early)(struct device *dev); int (*runtime_suspend)(struct device *dev); int (*runtime_resume)(struct device *dev); int (*runtime_idle)(struct device *dev); }; struct device_driver { const char *name; const struct dev_pm_ops *pm; }; struct device_type { const char *name; }; struct device { struct device *parent; struct list_head irqents; device_t bsddev; /* * The following flag is used to determine if the LinuxKPI is * responsible for detaching the BSD device or not. If the * LinuxKPI got the BSD device using devclass_get_device(), it * must not try to detach or delete it, because it's already * done somewhere else. */ bool bsddev_attached_here; struct device_driver *driver; struct device_type *type; dev_t devt; struct class *class; void (*release)(struct device *dev); struct kobject kobj; - uint64_t *dma_mask; + void *dma_priv; void *driver_data; unsigned int irq; #define LINUX_IRQ_INVALID 65535 unsigned int msix; unsigned int msix_max; const struct attribute_group **groups; struct fwnode_handle *fwnode; spinlock_t devres_lock; struct list_head devres_head; }; extern struct device linux_root_device; extern struct kobject linux_class_root; extern const struct kobj_type linux_dev_ktype; extern const struct kobj_type linux_class_ktype; struct class_attribute { struct attribute attr; ssize_t (*show)(struct class *, struct class_attribute *, char *); ssize_t (*store)(struct class *, struct class_attribute *, const char *, size_t); const void *(*namespace)(struct class *, const struct class_attribute *); }; #define CLASS_ATTR(_name, _mode, _show, _store) \ struct class_attribute class_attr_##_name = \ { { #_name, NULL, _mode }, _show, _store } struct device_attribute { struct attribute attr; ssize_t (*show)(struct device *, struct device_attribute *, char *); ssize_t (*store)(struct device *, struct device_attribute *, const char *, size_t); }; #define DEVICE_ATTR(_name, _mode, _show, _store) \ struct device_attribute dev_attr_##_name = \ __ATTR(_name, _mode, _show, _store) #define DEVICE_ATTR_RO(_name) \ struct device_attribute dev_attr_##_name = __ATTR_RO(_name) #define DEVICE_ATTR_WO(_name) \ struct device_attribute dev_attr_##_name = __ATTR_WO(_name) #define DEVICE_ATTR_RW(_name) \ struct device_attribute dev_attr_##_name = __ATTR_RW(_name) /* Simple class attribute that is just a static string */ struct class_attribute_string { struct class_attribute attr; char *str; }; static inline ssize_t show_class_attr_string(struct class *class, struct class_attribute *attr, char *buf) { struct class_attribute_string *cs; cs = container_of(attr, struct class_attribute_string, attr); return snprintf(buf, PAGE_SIZE, "%s\n", cs->str); } /* Currently read-only only */ #define _CLASS_ATTR_STRING(_name, _mode, _str) \ { __ATTR(_name, _mode, show_class_attr_string, NULL), _str } #define CLASS_ATTR_STRING(_name, _mode, _str) \ struct class_attribute_string class_attr_##_name = \ _CLASS_ATTR_STRING(_name, _mode, _str) #define dev_err(dev, fmt, ...) device_printf((dev)->bsddev, fmt, ##__VA_ARGS__) #define dev_warn(dev, fmt, ...) device_printf((dev)->bsddev, fmt, ##__VA_ARGS__) #define dev_info(dev, fmt, ...) device_printf((dev)->bsddev, fmt, ##__VA_ARGS__) #define dev_notice(dev, fmt, ...) device_printf((dev)->bsddev, fmt, ##__VA_ARGS__) #define dev_dbg(dev, fmt, ...) do { } while (0) #define dev_printk(lvl, dev, fmt, ...) \ device_printf((dev)->bsddev, fmt, ##__VA_ARGS__) #define dev_err_once(dev, ...) do { \ static bool __dev_err_once; \ if (!__dev_err_once) { \ __dev_err_once = 1; \ dev_err(dev, __VA_ARGS__); \ } \ } while (0) #define dev_err_ratelimited(dev, ...) do { \ static linux_ratelimit_t __ratelimited; \ if (linux_ratelimited(&__ratelimited)) \ dev_err(dev, __VA_ARGS__); \ } while (0) #define dev_warn_ratelimited(dev, ...) do { \ static linux_ratelimit_t __ratelimited; \ if (linux_ratelimited(&__ratelimited)) \ dev_warn(dev, __VA_ARGS__); \ } while (0) static inline void * dev_get_drvdata(const struct device *dev) { return dev->driver_data; } static inline void dev_set_drvdata(struct device *dev, void *data) { dev->driver_data = data; } static inline struct device * get_device(struct device *dev) { if (dev) kobject_get(&dev->kobj); return (dev); } static inline char * dev_name(const struct device *dev) { return kobject_name(&dev->kobj); } #define dev_set_name(_dev, _fmt, ...) \ kobject_set_name(&(_dev)->kobj, (_fmt), ##__VA_ARGS__) static inline void put_device(struct device *dev) { if (dev) kobject_put(&dev->kobj); } static inline int class_register(struct class *class) { class->bsdclass = devclass_create(class->name); kobject_init(&class->kobj, &linux_class_ktype); kobject_set_name(&class->kobj, class->name); kobject_add(&class->kobj, &linux_class_root, class->name); return (0); } static inline void class_unregister(struct class *class) { kobject_put(&class->kobj); } static inline struct device *kobj_to_dev(struct kobject *kobj) { return container_of(kobj, struct device, kobj); } /* * Devices are registered and created for exporting to sysfs. Create * implies register and register assumes the device fields have been * setup appropriately before being called. */ static inline void device_initialize(struct device *dev) { device_t bsddev = NULL; int unit = -1; if (dev->devt) { unit = MINOR(dev->devt); bsddev = devclass_get_device(dev->class->bsdclass, unit); dev->bsddev_attached_here = false; } else if (dev->parent == NULL) { bsddev = devclass_get_device(dev->class->bsdclass, 0); dev->bsddev_attached_here = false; } else { dev->bsddev_attached_here = true; } if (bsddev == NULL && dev->parent != NULL) { bsddev = device_add_child(dev->parent->bsddev, dev->class->kobj.name, unit); } if (bsddev != NULL) device_set_softc(bsddev, dev); dev->bsddev = bsddev; MPASS(dev->bsddev != NULL); kobject_init(&dev->kobj, &linux_dev_ktype); spin_lock_init(&dev->devres_lock); INIT_LIST_HEAD(&dev->devres_head); } static inline int device_add(struct device *dev) { if (dev->bsddev != NULL) { if (dev->devt == 0) dev->devt = makedev(0, device_get_unit(dev->bsddev)); } kobject_add(&dev->kobj, &dev->class->kobj, dev_name(dev)); return (0); } static inline void device_create_release(struct device *dev) { kfree(dev); } static inline struct device * device_create_groups_vargs(struct class *class, struct device *parent, dev_t devt, void *drvdata, const struct attribute_group **groups, const char *fmt, va_list args) { struct device *dev = NULL; int retval = -ENODEV; if (class == NULL || IS_ERR(class)) goto error; dev = kzalloc(sizeof(*dev), GFP_KERNEL); if (!dev) { retval = -ENOMEM; goto error; } dev->devt = devt; dev->class = class; dev->parent = parent; dev->groups = groups; dev->release = device_create_release; /* device_initialize() needs the class and parent to be set */ device_initialize(dev); dev_set_drvdata(dev, drvdata); retval = kobject_set_name_vargs(&dev->kobj, fmt, args); if (retval) goto error; retval = device_add(dev); if (retval) goto error; return dev; error: put_device(dev); return ERR_PTR(retval); } static inline struct device * device_create_with_groups(struct class *class, struct device *parent, dev_t devt, void *drvdata, const struct attribute_group **groups, const char *fmt, ...) { va_list vargs; struct device *dev; va_start(vargs, fmt); dev = device_create_groups_vargs(class, parent, devt, drvdata, groups, fmt, vargs); va_end(vargs); return dev; } static inline bool device_is_registered(struct device *dev) { return (dev->bsddev != NULL); } static inline int device_register(struct device *dev) { device_t bsddev = NULL; int unit = -1; if (device_is_registered(dev)) goto done; if (dev->devt) { unit = MINOR(dev->devt); bsddev = devclass_get_device(dev->class->bsdclass, unit); dev->bsddev_attached_here = false; } else if (dev->parent == NULL) { bsddev = devclass_get_device(dev->class->bsdclass, 0); dev->bsddev_attached_here = false; } else { dev->bsddev_attached_here = true; } if (bsddev == NULL && dev->parent != NULL) { bsddev = device_add_child(dev->parent->bsddev, dev->class->kobj.name, unit); } if (bsddev != NULL) { if (dev->devt == 0) dev->devt = makedev(0, device_get_unit(bsddev)); device_set_softc(bsddev, dev); } dev->bsddev = bsddev; done: kobject_init(&dev->kobj, &linux_dev_ktype); kobject_add(&dev->kobj, &dev->class->kobj, dev_name(dev)); return (0); } static inline void device_unregister(struct device *dev) { device_t bsddev; bsddev = dev->bsddev; dev->bsddev = NULL; if (bsddev != NULL && dev->bsddev_attached_here) { mtx_lock(&Giant); device_delete_child(device_get_parent(bsddev), bsddev); mtx_unlock(&Giant); } put_device(dev); } static inline void device_del(struct device *dev) { device_t bsddev; bsddev = dev->bsddev; dev->bsddev = NULL; if (bsddev != NULL && dev->bsddev_attached_here) { mtx_lock(&Giant); device_delete_child(device_get_parent(bsddev), bsddev); mtx_unlock(&Giant); } } struct device *device_create(struct class *class, struct device *parent, dev_t devt, void *drvdata, const char *fmt, ...); static inline void device_destroy(struct class *class, dev_t devt) { device_t bsddev; int unit; unit = MINOR(devt); bsddev = devclass_get_device(class->bsdclass, unit); if (bsddev != NULL) device_unregister(device_get_softc(bsddev)); } #define dev_pm_set_driver_flags(dev, flags) do { \ } while (0) static inline void linux_class_kfree(struct class *class) { kfree(class); } static inline struct class * class_create(struct module *owner, const char *name) { struct class *class; int error; class = kzalloc(sizeof(*class), M_WAITOK); class->owner = owner; class->name = name; class->class_release = linux_class_kfree; error = class_register(class); if (error) { kfree(class); return (NULL); } return (class); } static inline void class_destroy(struct class *class) { if (class == NULL) return; class_unregister(class); } static inline int device_create_file(struct device *dev, const struct device_attribute *attr) { if (dev) return sysfs_create_file(&dev->kobj, &attr->attr); return -EINVAL; } static inline void device_remove_file(struct device *dev, const struct device_attribute *attr) { if (dev) sysfs_remove_file(&dev->kobj, &attr->attr); } static inline int class_create_file(struct class *class, const struct class_attribute *attr) { if (class) return sysfs_create_file(&class->kobj, &attr->attr); return -EINVAL; } static inline void class_remove_file(struct class *class, const struct class_attribute *attr) { if (class) sysfs_remove_file(&class->kobj, &attr->attr); } static inline int dev_to_node(struct device *dev) { return -1; } char *kvasprintf(gfp_t, const char *, va_list); char *kasprintf(gfp_t, const char *, ...); #endif /* _LINUX_DEVICE_H_ */ Index: user/ngie/bug-237403/sys/compat/linuxkpi/common/include/linux/dma-mapping.h =================================================================== --- user/ngie/bug-237403/sys/compat/linuxkpi/common/include/linux/dma-mapping.h (revision 346925) +++ user/ngie/bug-237403/sys/compat/linuxkpi/common/include/linux/dma-mapping.h (revision 346926) @@ -1,301 +1,294 @@ /*- * Copyright (c) 2010 Isilon Systems, Inc. * Copyright (c) 2010 iX Systems, Inc. * Copyright (c) 2010 Panasas, Inc. * Copyright (c) 2013, 2014 Mellanox Technologies, Ltd. * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice unmodified, this list of conditions, and the following * disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. * * $FreeBSD$ */ #ifndef _LINUX_DMA_MAPPING_H_ #define _LINUX_DMA_MAPPING_H_ #include #include #include #include #include #include #include #include #include #include #include #include #include enum dma_data_direction { DMA_BIDIRECTIONAL = 0, DMA_TO_DEVICE = 1, DMA_FROM_DEVICE = 2, DMA_NONE = 3, }; struct dma_map_ops { void* (*alloc_coherent)(struct device *dev, size_t size, dma_addr_t *dma_handle, gfp_t gfp); void (*free_coherent)(struct device *dev, size_t size, void *vaddr, dma_addr_t dma_handle); dma_addr_t (*map_page)(struct device *dev, struct page *page, unsigned long offset, size_t size, enum dma_data_direction dir, struct dma_attrs *attrs); void (*unmap_page)(struct device *dev, dma_addr_t dma_handle, size_t size, enum dma_data_direction dir, struct dma_attrs *attrs); int (*map_sg)(struct device *dev, struct scatterlist *sg, int nents, enum dma_data_direction dir, struct dma_attrs *attrs); void (*unmap_sg)(struct device *dev, struct scatterlist *sg, int nents, enum dma_data_direction dir, struct dma_attrs *attrs); void (*sync_single_for_cpu)(struct device *dev, dma_addr_t dma_handle, size_t size, enum dma_data_direction dir); void (*sync_single_for_device)(struct device *dev, dma_addr_t dma_handle, size_t size, enum dma_data_direction dir); void (*sync_single_range_for_cpu)(struct device *dev, dma_addr_t dma_handle, unsigned long offset, size_t size, enum dma_data_direction dir); void (*sync_single_range_for_device)(struct device *dev, dma_addr_t dma_handle, unsigned long offset, size_t size, enum dma_data_direction dir); void (*sync_sg_for_cpu)(struct device *dev, struct scatterlist *sg, int nents, enum dma_data_direction dir); void (*sync_sg_for_device)(struct device *dev, struct scatterlist *sg, int nents, enum dma_data_direction dir); int (*mapping_error)(struct device *dev, dma_addr_t dma_addr); int (*dma_supported)(struct device *dev, u64 mask); int is_phys; }; #define DMA_BIT_MASK(n) ((2ULL << ((n) - 1)) - 1ULL) +int linux_dma_tag_init(struct device *dev, u64 mask); +void *linux_dma_alloc_coherent(struct device *dev, size_t size, + dma_addr_t *dma_handle, gfp_t flag); +dma_addr_t linux_dma_map_phys(struct device *dev, vm_paddr_t phys, size_t len); +void linux_dma_unmap(struct device *dev, dma_addr_t dma_addr, size_t size); +int linux_dma_map_sg_attrs(struct device *dev, struct scatterlist *sgl, + int nents, enum dma_data_direction dir, struct dma_attrs *attrs); +void linux_dma_unmap_sg_attrs(struct device *dev, struct scatterlist *sg, + int nents, enum dma_data_direction dir, struct dma_attrs *attrs); + static inline int dma_supported(struct device *dev, u64 mask) { /* XXX busdma takes care of this elsewhere. */ return (1); } static inline int dma_set_mask(struct device *dev, u64 dma_mask) { - if (!dev->dma_mask || !dma_supported(dev, dma_mask)) + if (!dev->dma_priv || !dma_supported(dev, dma_mask)) return -EIO; - *dev->dma_mask = dma_mask; - return (0); + return (linux_dma_tag_init(dev, dma_mask)); } static inline int dma_set_coherent_mask(struct device *dev, u64 mask) { if (!dma_supported(dev, mask)) return -EIO; /* XXX Currently we don't support a separate coherent mask. */ return 0; } static inline int dma_set_mask_and_coherent(struct device *dev, u64 mask) { int r; r = dma_set_mask(dev, mask); if (r == 0) dma_set_coherent_mask(dev, mask); return (r); } static inline void * dma_alloc_coherent(struct device *dev, size_t size, dma_addr_t *dma_handle, gfp_t flag) { - vm_paddr_t high; - size_t align; - void *mem; - - if (dev != NULL && dev->dma_mask) - high = *dev->dma_mask; - else if (flag & GFP_DMA32) - high = BUS_SPACE_MAXADDR_32BIT; - else - high = BUS_SPACE_MAXADDR; - align = PAGE_SIZE << get_order(size); - mem = (void *)kmem_alloc_contig(size, flag, 0, high, align, 0, - VM_MEMATTR_DEFAULT); - if (mem) - *dma_handle = vtophys(mem); - else - *dma_handle = 0; - return (mem); + return (linux_dma_alloc_coherent(dev, size, dma_handle, flag)); } static inline void * dma_zalloc_coherent(struct device *dev, size_t size, dma_addr_t *dma_handle, gfp_t flag) { return (dma_alloc_coherent(dev, size, dma_handle, flag | __GFP_ZERO)); } static inline void dma_free_coherent(struct device *dev, size_t size, void *cpu_addr, - dma_addr_t dma_handle) + dma_addr_t dma_addr) { + linux_dma_unmap(dev, dma_addr, size); kmem_free((vm_offset_t)cpu_addr, size); } -/* XXX This only works with no iommu. */ static inline dma_addr_t dma_map_single_attrs(struct device *dev, void *ptr, size_t size, enum dma_data_direction dir, struct dma_attrs *attrs) { - return vtophys(ptr); + return (linux_dma_map_phys(dev, vtophys(ptr), size)); } static inline void -dma_unmap_single_attrs(struct device *dev, dma_addr_t addr, size_t size, +dma_unmap_single_attrs(struct device *dev, dma_addr_t dma_addr, size_t size, enum dma_data_direction dir, struct dma_attrs *attrs) { + + linux_dma_unmap(dev, dma_addr, size); } static inline dma_addr_t dma_map_page_attrs(struct device *dev, struct page *page, size_t offset, size_t size, enum dma_data_direction dir, unsigned long attrs) { - return (VM_PAGE_TO_PHYS(page) + offset); + return (linux_dma_map_phys(dev, VM_PAGE_TO_PHYS(page) + offset, size)); } static inline int dma_map_sg_attrs(struct device *dev, struct scatterlist *sgl, int nents, enum dma_data_direction dir, struct dma_attrs *attrs) { - struct scatterlist *sg; - int i; - for_each_sg(sgl, sg, nents, i) - sg_dma_address(sg) = sg_phys(sg); - - return (nents); + return (linux_dma_map_sg_attrs(dev, sgl, nents, dir, attrs)); } static inline void dma_unmap_sg_attrs(struct device *dev, struct scatterlist *sg, int nents, enum dma_data_direction dir, struct dma_attrs *attrs) { + + linux_dma_unmap_sg_attrs(dev, sg, nents, dir, attrs); } static inline dma_addr_t dma_map_page(struct device *dev, struct page *page, unsigned long offset, size_t size, enum dma_data_direction direction) { - return VM_PAGE_TO_PHYS(page) + offset; + return (linux_dma_map_phys(dev, VM_PAGE_TO_PHYS(page) + offset, size)); } static inline void dma_unmap_page(struct device *dev, dma_addr_t dma_address, size_t size, enum dma_data_direction direction) { + + linux_dma_unmap(dev, dma_address, size); } static inline void dma_sync_single_for_cpu(struct device *dev, dma_addr_t dma_handle, size_t size, enum dma_data_direction direction) { } static inline void dma_sync_single(struct device *dev, dma_addr_t addr, size_t size, enum dma_data_direction dir) { dma_sync_single_for_cpu(dev, addr, size, dir); } static inline void dma_sync_single_for_device(struct device *dev, dma_addr_t dma_handle, size_t size, enum dma_data_direction direction) { } static inline void dma_sync_sg_for_cpu(struct device *dev, struct scatterlist *sg, int nelems, enum dma_data_direction direction) { } static inline void dma_sync_sg_for_device(struct device *dev, struct scatterlist *sg, int nelems, enum dma_data_direction direction) { } static inline void dma_sync_single_range_for_cpu(struct device *dev, dma_addr_t dma_handle, unsigned long offset, size_t size, int direction) { } static inline void dma_sync_single_range_for_device(struct device *dev, dma_addr_t dma_handle, unsigned long offset, size_t size, int direction) { } static inline int dma_mapping_error(struct device *dev, dma_addr_t dma_addr) { - return (0); + return (dma_addr == 0); } static inline unsigned int dma_set_max_seg_size(struct device *dev, unsigned int size) { return (0); } #define dma_map_single(d, a, s, r) dma_map_single_attrs(d, a, s, r, NULL) #define dma_unmap_single(d, a, s, r) dma_unmap_single_attrs(d, a, s, r, NULL) #define dma_map_sg(d, s, n, r) dma_map_sg_attrs(d, s, n, r, NULL) #define dma_unmap_sg(d, s, n, r) dma_unmap_sg_attrs(d, s, n, r, NULL) #define DEFINE_DMA_UNMAP_ADDR(name) dma_addr_t name #define DEFINE_DMA_UNMAP_LEN(name) __u32 name #define dma_unmap_addr(p, name) ((p)->name) #define dma_unmap_addr_set(p, name, v) (((p)->name) = (v)) #define dma_unmap_len(p, name) ((p)->name) #define dma_unmap_len_set(p, name, v) (((p)->name) = (v)) extern int uma_align_cache; #define dma_get_cache_alignment() uma_align_cache #endif /* _LINUX_DMA_MAPPING_H_ */ Index: user/ngie/bug-237403/sys/compat/linuxkpi/common/include/linux/dmapool.h =================================================================== --- user/ngie/bug-237403/sys/compat/linuxkpi/common/include/linux/dmapool.h (revision 346925) +++ user/ngie/bug-237403/sys/compat/linuxkpi/common/include/linux/dmapool.h (revision 346926) @@ -1,94 +1,95 @@ /*- * Copyright (c) 2010 Isilon Systems, Inc. * Copyright (c) 2010 iX Systems, Inc. * Copyright (c) 2010 Panasas, Inc. * Copyright (c) 2013, 2014 Mellanox Technologies, Ltd. * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice unmodified, this list of conditions, and the following * disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. * * $FreeBSD$ */ #ifndef _LINUX_DMAPOOL_H_ #define _LINUX_DMAPOOL_H_ #include #include #include #include #include +struct dma_pool *linux_dma_pool_create(char *name, struct device *dev, + size_t size, size_t align, size_t boundary); +void linux_dma_pool_destroy(struct dma_pool *pool); +void *linux_dma_pool_alloc(struct dma_pool *pool, gfp_t mem_flags, + dma_addr_t *handle); +void linux_dma_pool_free(struct dma_pool *pool, void *vaddr, + dma_addr_t dma_addr); + struct dma_pool { + struct pci_dev *pool_pdev; uma_zone_t pool_zone; + struct mtx pool_dma_lock; + bus_dma_tag_t pool_dmat; + size_t pool_entry_size; + struct mtx pool_ptree_lock; + struct pctrie pool_ptree; }; static inline struct dma_pool * dma_pool_create(char *name, struct device *dev, size_t size, size_t align, size_t boundary) { - struct dma_pool *pool; - pool = kmalloc(sizeof(*pool), GFP_KERNEL); - align--; - /* - * XXX Eventually this could use a separate allocf to honor boundary - * and physical address requirements of the device. - */ - pool->pool_zone = uma_zcreate(name, size, NULL, NULL, NULL, NULL, - align, UMA_ZONE_OFFPAGE|UMA_ZONE_HASH); - - return (pool); + return (linux_dma_pool_create(name, dev, size, align, boundary)); } static inline void dma_pool_destroy(struct dma_pool *pool) { - uma_zdestroy(pool->pool_zone); - kfree(pool); + + linux_dma_pool_destroy(pool); } static inline void * dma_pool_alloc(struct dma_pool *pool, gfp_t mem_flags, dma_addr_t *handle) { - void *vaddr; - vaddr = uma_zalloc(pool->pool_zone, mem_flags); - if (vaddr) - *handle = vtophys(vaddr); - return (vaddr); + return (linux_dma_pool_alloc(pool, mem_flags, handle)); } static inline void * dma_pool_zalloc(struct dma_pool *pool, gfp_t mem_flags, dma_addr_t *handle) { return (dma_pool_alloc(pool, mem_flags | __GFP_ZERO, handle)); } static inline void -dma_pool_free(struct dma_pool *pool, void *vaddr, dma_addr_t addr) +dma_pool_free(struct dma_pool *pool, void *vaddr, dma_addr_t dma_addr) { - uma_zfree(pool->pool_zone, vaddr); + + linux_dma_pool_free(pool, vaddr, dma_addr); } #endif /* _LINUX_DMAPOOL_H_ */ Index: user/ngie/bug-237403/sys/compat/linuxkpi/common/include/linux/pci.h =================================================================== --- user/ngie/bug-237403/sys/compat/linuxkpi/common/include/linux/pci.h (revision 346925) +++ user/ngie/bug-237403/sys/compat/linuxkpi/common/include/linux/pci.h (revision 346926) @@ -1,926 +1,925 @@ /*- * Copyright (c) 2010 Isilon Systems, Inc. * Copyright (c) 2010 iX Systems, Inc. * Copyright (c) 2010 Panasas, Inc. * Copyright (c) 2013-2016 Mellanox Technologies, Ltd. * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice unmodified, this list of conditions, and the following * disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. * * $FreeBSD$ */ #ifndef _LINUX_PCI_H_ #define _LINUX_PCI_H_ #define CONFIG_PCI_MSI #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include struct pci_device_id { uint32_t vendor; uint32_t device; uint32_t subvendor; uint32_t subdevice; uint32_t class; uint32_t class_mask; uintptr_t driver_data; }; #define MODULE_DEVICE_TABLE(bus, table) #define PCI_BASE_CLASS_DISPLAY 0x03 #define PCI_CLASS_DISPLAY_VGA 0x0300 #define PCI_CLASS_DISPLAY_OTHER 0x0380 #define PCI_BASE_CLASS_BRIDGE 0x06 #define PCI_CLASS_BRIDGE_ISA 0x0601 #define PCI_ANY_ID -1U #define PCI_VENDOR_ID_APPLE 0x106b #define PCI_VENDOR_ID_ASUSTEK 0x1043 #define PCI_VENDOR_ID_ATI 0x1002 #define PCI_VENDOR_ID_DELL 0x1028 #define PCI_VENDOR_ID_HP 0x103c #define PCI_VENDOR_ID_IBM 0x1014 #define PCI_VENDOR_ID_INTEL 0x8086 #define PCI_VENDOR_ID_MELLANOX 0x15b3 #define PCI_VENDOR_ID_REDHAT_QUMRANET 0x1af4 #define PCI_VENDOR_ID_SERVERWORKS 0x1166 #define PCI_VENDOR_ID_SONY 0x104d #define PCI_VENDOR_ID_TOPSPIN 0x1867 #define PCI_VENDOR_ID_VIA 0x1106 #define PCI_SUBVENDOR_ID_REDHAT_QUMRANET 0x1af4 #define PCI_DEVICE_ID_ATI_RADEON_QY 0x5159 #define PCI_DEVICE_ID_MELLANOX_TAVOR 0x5a44 #define PCI_DEVICE_ID_MELLANOX_TAVOR_BRIDGE 0x5a46 #define PCI_DEVICE_ID_MELLANOX_ARBEL_COMPAT 0x6278 #define PCI_DEVICE_ID_MELLANOX_ARBEL 0x6282 #define PCI_DEVICE_ID_MELLANOX_SINAI_OLD 0x5e8c #define PCI_DEVICE_ID_MELLANOX_SINAI 0x6274 #define PCI_SUBDEVICE_ID_QEMU 0x1100 #define PCI_DEVFN(slot, func) ((((slot) & 0x1f) << 3) | ((func) & 0x07)) #define PCI_SLOT(devfn) (((devfn) >> 3) & 0x1f) #define PCI_FUNC(devfn) ((devfn) & 0x07) #define PCI_BUS_NUM(devfn) (((devfn) >> 8) & 0xff) #define PCI_VDEVICE(_vendor, _device) \ .vendor = PCI_VENDOR_ID_##_vendor, .device = (_device), \ .subvendor = PCI_ANY_ID, .subdevice = PCI_ANY_ID #define PCI_DEVICE(_vendor, _device) \ .vendor = (_vendor), .device = (_device), \ .subvendor = PCI_ANY_ID, .subdevice = PCI_ANY_ID #define to_pci_dev(n) container_of(n, struct pci_dev, dev) #define PCI_VENDOR_ID PCIR_DEVVENDOR #define PCI_COMMAND PCIR_COMMAND #define PCI_EXP_DEVCTL PCIER_DEVICE_CTL /* Device Control */ #define PCI_EXP_LNKCTL PCIER_LINK_CTL /* Link Control */ #define PCI_EXP_FLAGS_TYPE PCIEM_FLAGS_TYPE /* Device/Port type */ #define PCI_EXP_DEVCAP PCIER_DEVICE_CAP /* Device capabilities */ #define PCI_EXP_DEVSTA PCIER_DEVICE_STA /* Device Status */ #define PCI_EXP_LNKCAP PCIER_LINK_CAP /* Link Capabilities */ #define PCI_EXP_LNKSTA PCIER_LINK_STA /* Link Status */ #define PCI_EXP_SLTCAP PCIER_SLOT_CAP /* Slot Capabilities */ #define PCI_EXP_SLTCTL PCIER_SLOT_CTL /* Slot Control */ #define PCI_EXP_SLTSTA PCIER_SLOT_STA /* Slot Status */ #define PCI_EXP_RTCTL PCIER_ROOT_CTL /* Root Control */ #define PCI_EXP_RTCAP PCIER_ROOT_CAP /* Root Capabilities */ #define PCI_EXP_RTSTA PCIER_ROOT_STA /* Root Status */ #define PCI_EXP_DEVCAP2 PCIER_DEVICE_CAP2 /* Device Capabilities 2 */ #define PCI_EXP_DEVCTL2 PCIER_DEVICE_CTL2 /* Device Control 2 */ #define PCI_EXP_LNKCAP2 PCIER_LINK_CAP2 /* Link Capabilities 2 */ #define PCI_EXP_LNKCTL2 PCIER_LINK_CTL2 /* Link Control 2 */ #define PCI_EXP_LNKSTA2 PCIER_LINK_STA2 /* Link Status 2 */ #define PCI_EXP_FLAGS PCIER_FLAGS /* Capabilities register */ #define PCI_EXP_FLAGS_VERS PCIEM_FLAGS_VERSION /* Capability version */ #define PCI_EXP_TYPE_ROOT_PORT PCIEM_TYPE_ROOT_PORT /* Root Port */ #define PCI_EXP_TYPE_ENDPOINT PCIEM_TYPE_ENDPOINT /* Express Endpoint */ #define PCI_EXP_TYPE_LEG_END PCIEM_TYPE_LEGACY_ENDPOINT /* Legacy Endpoint */ #define PCI_EXP_TYPE_DOWNSTREAM PCIEM_TYPE_DOWNSTREAM_PORT /* Downstream Port */ #define PCI_EXP_FLAGS_SLOT PCIEM_FLAGS_SLOT /* Slot implemented */ #define PCI_EXP_TYPE_RC_EC PCIEM_TYPE_ROOT_EC /* Root Complex Event Collector */ #define PCI_EXP_LNKCAP_SLS_2_5GB 0x01 /* Supported Link Speed 2.5GT/s */ #define PCI_EXP_LNKCAP_SLS_5_0GB 0x02 /* Supported Link Speed 5.0GT/s */ #define PCI_EXP_LNKCAP_SLS_8_0GB 0x04 /* Supported Link Speed 8.0GT/s */ #define PCI_EXP_LNKCAP_SLS_16_0GB 0x08 /* Supported Link Speed 16.0GT/s */ #define PCI_EXP_LNKCAP_MLW 0x03f0 /* Maximum Link Width */ #define PCI_EXP_LNKCAP2_SLS_2_5GB 0x02 /* Supported Link Speed 2.5GT/s */ #define PCI_EXP_LNKCAP2_SLS_5_0GB 0x04 /* Supported Link Speed 5.0GT/s */ #define PCI_EXP_LNKCAP2_SLS_8_0GB 0x08 /* Supported Link Speed 8.0GT/s */ #define PCI_EXP_LNKCAP2_SLS_16_0GB 0x10 /* Supported Link Speed 16.0GT/s */ #define PCI_EXP_LNKCTL_HAWD PCIEM_LINK_CTL_HAWD #define PCI_EXP_LNKCAP_CLKPM 0x00040000 #define PCI_EXP_DEVSTA_TRPND 0x0020 #define IORESOURCE_MEM (1 << SYS_RES_MEMORY) #define IORESOURCE_IO (1 << SYS_RES_IOPORT) #define IORESOURCE_IRQ (1 << SYS_RES_IRQ) enum pci_bus_speed { PCI_SPEED_UNKNOWN = -1, PCIE_SPEED_2_5GT, PCIE_SPEED_5_0GT, PCIE_SPEED_8_0GT, PCIE_SPEED_16_0GT, }; enum pcie_link_width { PCIE_LNK_WIDTH_RESRV = 0x00, PCIE_LNK_X1 = 0x01, PCIE_LNK_X2 = 0x02, PCIE_LNK_X4 = 0x04, PCIE_LNK_X8 = 0x08, PCIE_LNK_X12 = 0x0c, PCIE_LNK_X16 = 0x10, PCIE_LNK_X32 = 0x20, PCIE_LNK_WIDTH_UNKNOWN = 0xff, }; typedef int pci_power_t; #define PCI_D0 PCI_POWERSTATE_D0 #define PCI_D1 PCI_POWERSTATE_D1 #define PCI_D2 PCI_POWERSTATE_D2 #define PCI_D3hot PCI_POWERSTATE_D3 #define PCI_D3cold 4 #define PCI_POWER_ERROR PCI_POWERSTATE_UNKNOWN struct pci_dev; struct pci_driver { struct list_head links; char *name; const struct pci_device_id *id_table; int (*probe)(struct pci_dev *dev, const struct pci_device_id *id); void (*remove)(struct pci_dev *dev); int (*suspend) (struct pci_dev *dev, pm_message_t state); /* Device suspended */ int (*resume) (struct pci_dev *dev); /* Device woken up */ void (*shutdown) (struct pci_dev *dev); /* Device shutdown */ driver_t bsddriver; devclass_t bsdclass; struct device_driver driver; const struct pci_error_handlers *err_handler; bool isdrm; }; struct pci_bus { struct pci_dev *self; int number; }; extern struct list_head pci_drivers; extern struct list_head pci_devices; extern spinlock_t pci_lock; #define __devexit_p(x) x struct pci_dev { struct device dev; struct list_head links; struct pci_driver *pdrv; struct pci_bus *bus; - uint64_t dma_mask; uint16_t device; uint16_t vendor; uint16_t subsystem_vendor; uint16_t subsystem_device; unsigned int irq; unsigned int devfn; uint32_t class; uint8_t revision; }; static inline struct resource_list_entry * linux_pci_get_rle(struct pci_dev *pdev, int type, int rid) { struct pci_devinfo *dinfo; struct resource_list *rl; dinfo = device_get_ivars(pdev->dev.bsddev); rl = &dinfo->resources; return resource_list_find(rl, type, rid); } static inline struct resource_list_entry * linux_pci_get_bar(struct pci_dev *pdev, int bar) { struct resource_list_entry *rle; bar = PCIR_BAR(bar); if ((rle = linux_pci_get_rle(pdev, SYS_RES_MEMORY, bar)) == NULL) rle = linux_pci_get_rle(pdev, SYS_RES_IOPORT, bar); return (rle); } static inline struct device * linux_pci_find_irq_dev(unsigned int irq) { struct pci_dev *pdev; struct device *found; found = NULL; spin_lock(&pci_lock); list_for_each_entry(pdev, &pci_devices, links) { if (irq == pdev->dev.irq || (irq >= pdev->dev.msix && irq < pdev->dev.msix_max)) { found = &pdev->dev; break; } } spin_unlock(&pci_lock); return (found); } static inline unsigned long pci_resource_start(struct pci_dev *pdev, int bar) { struct resource_list_entry *rle; if ((rle = linux_pci_get_bar(pdev, bar)) == NULL) return (0); return rle->start; } static inline unsigned long pci_resource_len(struct pci_dev *pdev, int bar) { struct resource_list_entry *rle; if ((rle = linux_pci_get_bar(pdev, bar)) == NULL) return (0); return rle->count; } static inline int pci_resource_type(struct pci_dev *pdev, int bar) { struct pci_map *pm; pm = pci_find_bar(pdev->dev.bsddev, PCIR_BAR(bar)); if (!pm) return (-1); if (PCI_BAR_IO(pm->pm_value)) return (SYS_RES_IOPORT); else return (SYS_RES_MEMORY); } /* * All drivers just seem to want to inspect the type not flags. */ static inline int pci_resource_flags(struct pci_dev *pdev, int bar) { int type; type = pci_resource_type(pdev, bar); if (type < 0) return (0); return (1 << type); } static inline const char * pci_name(struct pci_dev *d) { return device_get_desc(d->dev.bsddev); } static inline void * pci_get_drvdata(struct pci_dev *pdev) { return dev_get_drvdata(&pdev->dev); } static inline void pci_set_drvdata(struct pci_dev *pdev, void *data) { dev_set_drvdata(&pdev->dev, data); } static inline int pci_enable_device(struct pci_dev *pdev) { pci_enable_io(pdev->dev.bsddev, SYS_RES_IOPORT); pci_enable_io(pdev->dev.bsddev, SYS_RES_MEMORY); return (0); } static inline void pci_disable_device(struct pci_dev *pdev) { pci_disable_io(pdev->dev.bsddev, SYS_RES_IOPORT); pci_disable_io(pdev->dev.bsddev, SYS_RES_MEMORY); pci_disable_busmaster(pdev->dev.bsddev); } static inline int pci_set_master(struct pci_dev *pdev) { pci_enable_busmaster(pdev->dev.bsddev); return (0); } static inline int pci_set_power_state(struct pci_dev *pdev, int state) { pci_set_powerstate(pdev->dev.bsddev, state); return (0); } static inline int pci_clear_master(struct pci_dev *pdev) { pci_disable_busmaster(pdev->dev.bsddev); return (0); } static inline int pci_request_region(struct pci_dev *pdev, int bar, const char *res_name) { int rid; int type; type = pci_resource_type(pdev, bar); if (type < 0) return (-ENODEV); rid = PCIR_BAR(bar); if (bus_alloc_resource_any(pdev->dev.bsddev, type, &rid, RF_ACTIVE) == NULL) return (-EINVAL); return (0); } static inline void pci_release_region(struct pci_dev *pdev, int bar) { struct resource_list_entry *rle; if ((rle = linux_pci_get_bar(pdev, bar)) == NULL) return; bus_release_resource(pdev->dev.bsddev, rle->type, rle->rid, rle->res); } static inline void pci_release_regions(struct pci_dev *pdev) { int i; for (i = 0; i <= PCIR_MAX_BAR_0; i++) pci_release_region(pdev, i); } static inline int pci_request_regions(struct pci_dev *pdev, const char *res_name) { int error; int i; for (i = 0; i <= PCIR_MAX_BAR_0; i++) { error = pci_request_region(pdev, i, res_name); if (error && error != -ENODEV) { pci_release_regions(pdev); return (error); } } return (0); } static inline void pci_disable_msix(struct pci_dev *pdev) { pci_release_msi(pdev->dev.bsddev); /* * The MSIX IRQ numbers associated with this PCI device are no * longer valid and might be re-assigned. Make sure * linux_pci_find_irq_dev() does no longer see them by * resetting their references to zero: */ pdev->dev.msix = 0; pdev->dev.msix_max = 0; } static inline bus_addr_t pci_bus_address(struct pci_dev *pdev, int bar) { return (pci_resource_start(pdev, bar)); } #define PCI_CAP_ID_EXP PCIY_EXPRESS #define PCI_CAP_ID_PCIX PCIY_PCIX #define PCI_CAP_ID_AGP PCIY_AGP #define PCI_CAP_ID_PM PCIY_PMG #define PCI_EXP_DEVCTL PCIER_DEVICE_CTL #define PCI_EXP_DEVCTL_PAYLOAD PCIEM_CTL_MAX_PAYLOAD #define PCI_EXP_DEVCTL_READRQ PCIEM_CTL_MAX_READ_REQUEST #define PCI_EXP_LNKCTL PCIER_LINK_CTL #define PCI_EXP_LNKSTA PCIER_LINK_STA static inline int pci_find_capability(struct pci_dev *pdev, int capid) { int reg; if (pci_find_cap(pdev->dev.bsddev, capid, ®)) return (0); return (reg); } static inline int pci_pcie_cap(struct pci_dev *dev) { return pci_find_capability(dev, PCI_CAP_ID_EXP); } static inline int pci_read_config_byte(struct pci_dev *pdev, int where, u8 *val) { *val = (u8)pci_read_config(pdev->dev.bsddev, where, 1); return (0); } static inline int pci_read_config_word(struct pci_dev *pdev, int where, u16 *val) { *val = (u16)pci_read_config(pdev->dev.bsddev, where, 2); return (0); } static inline int pci_read_config_dword(struct pci_dev *pdev, int where, u32 *val) { *val = (u32)pci_read_config(pdev->dev.bsddev, where, 4); return (0); } static inline int pci_write_config_byte(struct pci_dev *pdev, int where, u8 val) { pci_write_config(pdev->dev.bsddev, where, val, 1); return (0); } static inline int pci_write_config_word(struct pci_dev *pdev, int where, u16 val) { pci_write_config(pdev->dev.bsddev, where, val, 2); return (0); } static inline int pci_write_config_dword(struct pci_dev *pdev, int where, u32 val) { pci_write_config(pdev->dev.bsddev, where, val, 4); return (0); } int linux_pci_register_driver(struct pci_driver *pdrv); int linux_pci_register_drm_driver(struct pci_driver *pdrv); void linux_pci_unregister_driver(struct pci_driver *pdrv); #define pci_register_driver(pdrv) linux_pci_register_driver(pdrv) #define pci_unregister_driver(pdrv) linux_pci_unregister_driver(pdrv) struct msix_entry { int entry; int vector; }; /* * Enable msix, positive errors indicate actual number of available * vectors. Negative errors are failures. * * NB: define added to prevent this definition of pci_enable_msix from * clashing with the native FreeBSD version. */ #define pci_enable_msix(...) \ linux_pci_enable_msix(__VA_ARGS__) static inline int pci_enable_msix(struct pci_dev *pdev, struct msix_entry *entries, int nreq) { struct resource_list_entry *rle; int error; int avail; int i; avail = pci_msix_count(pdev->dev.bsddev); if (avail < nreq) { if (avail == 0) return -EINVAL; return avail; } avail = nreq; if ((error = -pci_alloc_msix(pdev->dev.bsddev, &avail)) != 0) return error; /* * Handle case where "pci_alloc_msix()" may allocate less * interrupts than available and return with no error: */ if (avail < nreq) { pci_release_msi(pdev->dev.bsddev); return avail; } rle = linux_pci_get_rle(pdev, SYS_RES_IRQ, 1); pdev->dev.msix = rle->start; pdev->dev.msix_max = rle->start + avail; for (i = 0; i < nreq; i++) entries[i].vector = pdev->dev.msix + i; return (0); } #define pci_enable_msix_range(...) \ linux_pci_enable_msix_range(__VA_ARGS__) static inline int pci_enable_msix_range(struct pci_dev *dev, struct msix_entry *entries, int minvec, int maxvec) { int nvec = maxvec; int rc; if (maxvec < minvec) return (-ERANGE); do { rc = pci_enable_msix(dev, entries, nvec); if (rc < 0) { return (rc); } else if (rc > 0) { if (rc < minvec) return (-ENOSPC); nvec = rc; } } while (rc); return (nvec); } static inline int pci_channel_offline(struct pci_dev *pdev) { return (pci_get_vendor(pdev->dev.bsddev) == 0xffff); } static inline int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn) { return -ENODEV; } static inline void pci_disable_sriov(struct pci_dev *dev) { } #define DEFINE_PCI_DEVICE_TABLE(_table) \ const struct pci_device_id _table[] __devinitdata /* XXX This should not be necessary. */ #define pcix_set_mmrbc(d, v) 0 #define pcix_get_max_mmrbc(d) 0 #define pcie_set_readrq(d, v) 0 #define PCI_DMA_BIDIRECTIONAL 0 #define PCI_DMA_TODEVICE 1 #define PCI_DMA_FROMDEVICE 2 #define PCI_DMA_NONE 3 #define pci_pool dma_pool #define pci_pool_destroy(...) dma_pool_destroy(__VA_ARGS__) #define pci_pool_alloc(...) dma_pool_alloc(__VA_ARGS__) #define pci_pool_free(...) dma_pool_free(__VA_ARGS__) #define pci_pool_create(_name, _pdev, _size, _align, _alloc) \ dma_pool_create(_name, &(_pdev)->dev, _size, _align, _alloc) #define pci_free_consistent(_hwdev, _size, _vaddr, _dma_handle) \ dma_free_coherent((_hwdev) == NULL ? NULL : &(_hwdev)->dev, \ _size, _vaddr, _dma_handle) #define pci_map_sg(_hwdev, _sg, _nents, _dir) \ dma_map_sg((_hwdev) == NULL ? NULL : &(_hwdev->dev), \ _sg, _nents, (enum dma_data_direction)_dir) #define pci_map_single(_hwdev, _ptr, _size, _dir) \ dma_map_single((_hwdev) == NULL ? NULL : &(_hwdev->dev), \ (_ptr), (_size), (enum dma_data_direction)_dir) #define pci_unmap_single(_hwdev, _addr, _size, _dir) \ dma_unmap_single((_hwdev) == NULL ? NULL : &(_hwdev)->dev, \ _addr, _size, (enum dma_data_direction)_dir) #define pci_unmap_sg(_hwdev, _sg, _nents, _dir) \ dma_unmap_sg((_hwdev) == NULL ? NULL : &(_hwdev)->dev, \ _sg, _nents, (enum dma_data_direction)_dir) #define pci_map_page(_hwdev, _page, _offset, _size, _dir) \ dma_map_page((_hwdev) == NULL ? NULL : &(_hwdev)->dev, _page,\ _offset, _size, (enum dma_data_direction)_dir) #define pci_unmap_page(_hwdev, _dma_address, _size, _dir) \ dma_unmap_page((_hwdev) == NULL ? NULL : &(_hwdev)->dev, \ _dma_address, _size, (enum dma_data_direction)_dir) #define pci_set_dma_mask(_pdev, mask) dma_set_mask(&(_pdev)->dev, (mask)) #define pci_dma_mapping_error(_pdev, _dma_addr) \ dma_mapping_error(&(_pdev)->dev, _dma_addr) #define pci_set_consistent_dma_mask(_pdev, _mask) \ dma_set_coherent_mask(&(_pdev)->dev, (_mask)) #define DECLARE_PCI_UNMAP_ADDR(x) DEFINE_DMA_UNMAP_ADDR(x); #define DECLARE_PCI_UNMAP_LEN(x) DEFINE_DMA_UNMAP_LEN(x); #define pci_unmap_addr dma_unmap_addr #define pci_unmap_addr_set dma_unmap_addr_set #define pci_unmap_len dma_unmap_len #define pci_unmap_len_set dma_unmap_len_set typedef unsigned int __bitwise pci_channel_state_t; typedef unsigned int __bitwise pci_ers_result_t; enum pci_channel_state { pci_channel_io_normal = 1, pci_channel_io_frozen = 2, pci_channel_io_perm_failure = 3, }; enum pci_ers_result { PCI_ERS_RESULT_NONE = 1, PCI_ERS_RESULT_CAN_RECOVER = 2, PCI_ERS_RESULT_NEED_RESET = 3, PCI_ERS_RESULT_DISCONNECT = 4, PCI_ERS_RESULT_RECOVERED = 5, }; /* PCI bus error event callbacks */ struct pci_error_handlers { pci_ers_result_t (*error_detected)(struct pci_dev *dev, enum pci_channel_state error); pci_ers_result_t (*mmio_enabled)(struct pci_dev *dev); pci_ers_result_t (*link_reset)(struct pci_dev *dev); pci_ers_result_t (*slot_reset)(struct pci_dev *dev); void (*resume)(struct pci_dev *dev); }; /* FreeBSD does not support SRIOV - yet */ static inline struct pci_dev *pci_physfn(struct pci_dev *dev) { return dev; } static inline bool pci_is_pcie(struct pci_dev *dev) { return !!pci_pcie_cap(dev); } static inline u16 pcie_flags_reg(struct pci_dev *dev) { int pos; u16 reg16; pos = pci_find_capability(dev, PCI_CAP_ID_EXP); if (!pos) return 0; pci_read_config_word(dev, pos + PCI_EXP_FLAGS, ®16); return reg16; } static inline int pci_pcie_type(struct pci_dev *dev) { return (pcie_flags_reg(dev) & PCI_EXP_FLAGS_TYPE) >> 4; } static inline int pcie_cap_version(struct pci_dev *dev) { return pcie_flags_reg(dev) & PCI_EXP_FLAGS_VERS; } static inline bool pcie_cap_has_lnkctl(struct pci_dev *dev) { int type = pci_pcie_type(dev); return pcie_cap_version(dev) > 1 || type == PCI_EXP_TYPE_ROOT_PORT || type == PCI_EXP_TYPE_ENDPOINT || type == PCI_EXP_TYPE_LEG_END; } static inline bool pcie_cap_has_devctl(const struct pci_dev *dev) { return true; } static inline bool pcie_cap_has_sltctl(struct pci_dev *dev) { int type = pci_pcie_type(dev); return pcie_cap_version(dev) > 1 || type == PCI_EXP_TYPE_ROOT_PORT || (type == PCI_EXP_TYPE_DOWNSTREAM && pcie_flags_reg(dev) & PCI_EXP_FLAGS_SLOT); } static inline bool pcie_cap_has_rtctl(struct pci_dev *dev) { int type = pci_pcie_type(dev); return pcie_cap_version(dev) > 1 || type == PCI_EXP_TYPE_ROOT_PORT || type == PCI_EXP_TYPE_RC_EC; } static bool pcie_capability_reg_implemented(struct pci_dev *dev, int pos) { if (!pci_is_pcie(dev)) return false; switch (pos) { case PCI_EXP_FLAGS_TYPE: return true; case PCI_EXP_DEVCAP: case PCI_EXP_DEVCTL: case PCI_EXP_DEVSTA: return pcie_cap_has_devctl(dev); case PCI_EXP_LNKCAP: case PCI_EXP_LNKCTL: case PCI_EXP_LNKSTA: return pcie_cap_has_lnkctl(dev); case PCI_EXP_SLTCAP: case PCI_EXP_SLTCTL: case PCI_EXP_SLTSTA: return pcie_cap_has_sltctl(dev); case PCI_EXP_RTCTL: case PCI_EXP_RTCAP: case PCI_EXP_RTSTA: return pcie_cap_has_rtctl(dev); case PCI_EXP_DEVCAP2: case PCI_EXP_DEVCTL2: case PCI_EXP_LNKCAP2: case PCI_EXP_LNKCTL2: case PCI_EXP_LNKSTA2: return pcie_cap_version(dev) > 1; default: return false; } } static inline int pcie_capability_read_dword(struct pci_dev *dev, int pos, u32 *dst) { if (pos & 3) return -EINVAL; if (!pcie_capability_reg_implemented(dev, pos)) return -EINVAL; return pci_read_config_dword(dev, pci_pcie_cap(dev) + pos, dst); } static inline int pcie_capability_read_word(struct pci_dev *dev, int pos, u16 *dst) { if (pos & 3) return -EINVAL; if (!pcie_capability_reg_implemented(dev, pos)) return -EINVAL; return pci_read_config_word(dev, pci_pcie_cap(dev) + pos, dst); } static inline int pcie_capability_write_word(struct pci_dev *dev, int pos, u16 val) { if (pos & 1) return -EINVAL; if (!pcie_capability_reg_implemented(dev, pos)) return 0; return pci_write_config_word(dev, pci_pcie_cap(dev) + pos, val); } static inline int pcie_get_minimum_link(struct pci_dev *dev, enum pci_bus_speed *speed, enum pcie_link_width *width) { *speed = PCI_SPEED_UNKNOWN; *width = PCIE_LNK_WIDTH_UNKNOWN; return (0); } static inline int pci_num_vf(struct pci_dev *dev) { return (0); } static inline enum pci_bus_speed pcie_get_speed_cap(struct pci_dev *dev) { device_t root; uint32_t lnkcap, lnkcap2; int error, pos; root = device_get_parent(dev->dev.bsddev); if (root == NULL) return (PCI_SPEED_UNKNOWN); root = device_get_parent(root); if (root == NULL) return (PCI_SPEED_UNKNOWN); root = device_get_parent(root); if (root == NULL) return (PCI_SPEED_UNKNOWN); if (pci_get_vendor(root) == PCI_VENDOR_ID_VIA || pci_get_vendor(root) == PCI_VENDOR_ID_SERVERWORKS) return (PCI_SPEED_UNKNOWN); if ((error = pci_find_cap(root, PCIY_EXPRESS, &pos)) != 0) return (PCI_SPEED_UNKNOWN); lnkcap2 = pci_read_config(root, pos + PCIER_LINK_CAP2, 4); if (lnkcap2) { /* PCIe r3.0-compliant */ if (lnkcap2 & PCI_EXP_LNKCAP2_SLS_2_5GB) return (PCIE_SPEED_2_5GT); if (lnkcap2 & PCI_EXP_LNKCAP2_SLS_5_0GB) return (PCIE_SPEED_5_0GT); if (lnkcap2 & PCI_EXP_LNKCAP2_SLS_8_0GB) return (PCIE_SPEED_8_0GT); if (lnkcap2 & PCI_EXP_LNKCAP2_SLS_16_0GB) return (PCIE_SPEED_16_0GT); } else { /* pre-r3.0 */ lnkcap = pci_read_config(root, pos + PCIER_LINK_CAP, 4); if (lnkcap & PCI_EXP_LNKCAP_SLS_2_5GB) return (PCIE_SPEED_2_5GT); if (lnkcap & PCI_EXP_LNKCAP_SLS_5_0GB) return (PCIE_SPEED_5_0GT); if (lnkcap & PCI_EXP_LNKCAP_SLS_8_0GB) return (PCIE_SPEED_8_0GT); if (lnkcap & PCI_EXP_LNKCAP_SLS_16_0GB) return (PCIE_SPEED_16_0GT); } return (PCI_SPEED_UNKNOWN); } static inline enum pcie_link_width pcie_get_width_cap(struct pci_dev *dev) { uint32_t lnkcap; pcie_capability_read_dword(dev, PCI_EXP_LNKCAP, &lnkcap); if (lnkcap) return ((lnkcap & PCI_EXP_LNKCAP_MLW) >> 4); return (PCIE_LNK_WIDTH_UNKNOWN); } #endif /* _LINUX_PCI_H_ */ Index: user/ngie/bug-237403/sys/compat/linuxkpi/common/include/linux/scatterlist.h =================================================================== --- user/ngie/bug-237403/sys/compat/linuxkpi/common/include/linux/scatterlist.h (revision 346925) +++ user/ngie/bug-237403/sys/compat/linuxkpi/common/include/linux/scatterlist.h (revision 346926) @@ -1,457 +1,458 @@ /*- * Copyright (c) 2010 Isilon Systems, Inc. * Copyright (c) 2010 iX Systems, Inc. * Copyright (c) 2010 Panasas, Inc. * Copyright (c) 2013-2017 Mellanox Technologies, Ltd. * Copyright (c) 2015 Matthew Dillon * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice unmodified, this list of conditions, and the following * disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. * * $FreeBSD$ */ #ifndef _LINUX_SCATTERLIST_H_ #define _LINUX_SCATTERLIST_H_ #include #include #include struct scatterlist { unsigned long page_link; #define SG_PAGE_LINK_CHAIN 0x1UL #define SG_PAGE_LINK_LAST 0x2UL #define SG_PAGE_LINK_MASK 0x3UL unsigned int offset; unsigned int length; - dma_addr_t address; + dma_addr_t dma_address; + unsigned int dma_length; }; CTASSERT((sizeof(struct scatterlist) & SG_PAGE_LINK_MASK) == 0); struct sg_table { struct scatterlist *sgl; unsigned int nents; unsigned int orig_nents; }; struct sg_page_iter { struct scatterlist *sg; unsigned int sg_pgoffset; unsigned int maxents; struct { unsigned int nents; int pg_advance; } internal; }; #define SCATTERLIST_MAX_SEGMENT (-1U & ~(PAGE_SIZE - 1)) #define SG_MAX_SINGLE_ALLOC (PAGE_SIZE / sizeof(struct scatterlist)) #define SG_MAGIC 0x87654321UL #define SG_CHAIN SG_PAGE_LINK_CHAIN #define SG_END SG_PAGE_LINK_LAST #define sg_is_chain(sg) ((sg)->page_link & SG_PAGE_LINK_CHAIN) #define sg_is_last(sg) ((sg)->page_link & SG_PAGE_LINK_LAST) #define sg_chain_ptr(sg) \ ((struct scatterlist *) ((sg)->page_link & ~SG_PAGE_LINK_MASK)) -#define sg_dma_address(sg) (sg)->address -#define sg_dma_len(sg) (sg)->length +#define sg_dma_address(sg) (sg)->dma_address +#define sg_dma_len(sg) (sg)->dma_length #define for_each_sg_page(sgl, iter, nents, pgoffset) \ for (_sg_iter_init(sgl, iter, nents, pgoffset); \ (iter)->sg; _sg_iter_next(iter)) #define for_each_sg(sglist, sg, sgmax, iter) \ for (iter = 0, sg = (sglist); iter < (sgmax); iter++, sg = sg_next(sg)) typedef struct scatterlist *(sg_alloc_fn) (unsigned int, gfp_t); typedef void (sg_free_fn) (struct scatterlist *, unsigned int); static inline void sg_assign_page(struct scatterlist *sg, struct page *page) { unsigned long page_link = sg->page_link & SG_PAGE_LINK_MASK; sg->page_link = page_link | (unsigned long)page; } static inline void sg_set_page(struct scatterlist *sg, struct page *page, unsigned int len, unsigned int offset) { sg_assign_page(sg, page); sg->offset = offset; sg->length = len; } static inline struct page * sg_page(struct scatterlist *sg) { return ((struct page *)((sg)->page_link & ~SG_PAGE_LINK_MASK)); } static inline void sg_set_buf(struct scatterlist *sg, const void *buf, unsigned int buflen) { sg_set_page(sg, virt_to_page(buf), buflen, ((uintptr_t)buf) & (PAGE_SIZE - 1)); } static inline struct scatterlist * sg_next(struct scatterlist *sg) { if (sg_is_last(sg)) return (NULL); sg++; if (sg_is_chain(sg)) sg = sg_chain_ptr(sg); return (sg); } static inline vm_paddr_t sg_phys(struct scatterlist *sg) { return (VM_PAGE_TO_PHYS(sg_page(sg)) + sg->offset); } static inline void * sg_virt(struct scatterlist *sg) { return ((void *)((unsigned long)page_address(sg_page(sg)) + sg->offset)); } static inline void sg_chain(struct scatterlist *prv, unsigned int prv_nents, struct scatterlist *sgl) { struct scatterlist *sg = &prv[prv_nents - 1]; sg->offset = 0; sg->length = 0; sg->page_link = ((unsigned long)sgl | SG_PAGE_LINK_CHAIN) & ~SG_PAGE_LINK_LAST; } static inline void sg_mark_end(struct scatterlist *sg) { sg->page_link |= SG_PAGE_LINK_LAST; sg->page_link &= ~SG_PAGE_LINK_CHAIN; } static inline void sg_init_table(struct scatterlist *sg, unsigned int nents) { bzero(sg, sizeof(*sg) * nents); sg_mark_end(&sg[nents - 1]); } static struct scatterlist * sg_kmalloc(unsigned int nents, gfp_t gfp_mask) { if (nents == SG_MAX_SINGLE_ALLOC) { return ((void *)__get_free_page(gfp_mask)); } else return (kmalloc(nents * sizeof(struct scatterlist), gfp_mask)); } static inline void sg_kfree(struct scatterlist *sg, unsigned int nents) { if (nents == SG_MAX_SINGLE_ALLOC) { free_page((unsigned long)sg); } else kfree(sg); } static inline void __sg_free_table(struct sg_table *table, unsigned int max_ents, bool skip_first_chunk, sg_free_fn * free_fn) { struct scatterlist *sgl, *next; if (unlikely(!table->sgl)) return; sgl = table->sgl; while (table->orig_nents) { unsigned int alloc_size = table->orig_nents; unsigned int sg_size; if (alloc_size > max_ents) { next = sg_chain_ptr(&sgl[max_ents - 1]); alloc_size = max_ents; sg_size = alloc_size - 1; } else { sg_size = alloc_size; next = NULL; } table->orig_nents -= sg_size; if (skip_first_chunk) skip_first_chunk = 0; else free_fn(sgl, alloc_size); sgl = next; } table->sgl = NULL; } static inline void sg_free_table(struct sg_table *table) { __sg_free_table(table, SG_MAX_SINGLE_ALLOC, 0, sg_kfree); } static inline int __sg_alloc_table(struct sg_table *table, unsigned int nents, unsigned int max_ents, struct scatterlist *first_chunk, gfp_t gfp_mask, sg_alloc_fn *alloc_fn) { struct scatterlist *sg, *prv; unsigned int left; memset(table, 0, sizeof(*table)); if (nents == 0) return (-EINVAL); left = nents; prv = NULL; do { unsigned int sg_size; unsigned int alloc_size = left; if (alloc_size > max_ents) { alloc_size = max_ents; sg_size = alloc_size - 1; } else sg_size = alloc_size; left -= sg_size; if (first_chunk) { sg = first_chunk; first_chunk = NULL; } else { sg = alloc_fn(alloc_size, gfp_mask); } if (unlikely(!sg)) { if (prv) table->nents = ++table->orig_nents; return (-ENOMEM); } sg_init_table(sg, alloc_size); table->nents = table->orig_nents += sg_size; if (prv) sg_chain(prv, max_ents, sg); else table->sgl = sg; if (!left) sg_mark_end(&sg[sg_size - 1]); prv = sg; } while (left); return (0); } static inline int sg_alloc_table(struct sg_table *table, unsigned int nents, gfp_t gfp_mask) { int ret; ret = __sg_alloc_table(table, nents, SG_MAX_SINGLE_ALLOC, NULL, gfp_mask, sg_kmalloc); if (unlikely(ret)) __sg_free_table(table, SG_MAX_SINGLE_ALLOC, 0, sg_kfree); return (ret); } static inline int __sg_alloc_table_from_pages(struct sg_table *sgt, struct page **pages, unsigned int count, unsigned long off, unsigned long size, unsigned int max_segment, gfp_t gfp_mask) { unsigned int i, segs, cur, len; int rc; struct scatterlist *s; if (__predict_false(!max_segment || offset_in_page(max_segment))) return (-EINVAL); len = 0; for (segs = i = 1; i < count; ++i) { len += PAGE_SIZE; if (len >= max_segment || page_to_pfn(pages[i]) != page_to_pfn(pages[i - 1]) + 1) { ++segs; len = 0; } } if (__predict_false((rc = sg_alloc_table(sgt, segs, gfp_mask)))) return (rc); cur = 0; for_each_sg(sgt->sgl, s, sgt->orig_nents, i) { unsigned long seg_size; unsigned int j; len = 0; for (j = cur + 1; j < count; ++j) { len += PAGE_SIZE; if (len >= max_segment || page_to_pfn(pages[j]) != page_to_pfn(pages[j - 1]) + 1) break; } seg_size = ((j - cur) << PAGE_SHIFT) - off; sg_set_page(s, pages[cur], min(size, seg_size), off); size -= seg_size; off = 0; cur = j; } return (0); } static inline int sg_alloc_table_from_pages(struct sg_table *sgt, struct page **pages, unsigned int count, unsigned long off, unsigned long size, gfp_t gfp_mask) { return (__sg_alloc_table_from_pages(sgt, pages, count, off, size, SCATTERLIST_MAX_SEGMENT, gfp_mask)); } static inline int sg_nents(struct scatterlist *sg) { int nents; for (nents = 0; sg; sg = sg_next(sg)) nents++; return (nents); } static inline void __sg_page_iter_start(struct sg_page_iter *piter, struct scatterlist *sglist, unsigned int nents, unsigned long pgoffset) { piter->internal.pg_advance = 0; piter->internal.nents = nents; piter->sg = sglist; piter->sg_pgoffset = pgoffset; } static inline void _sg_iter_next(struct sg_page_iter *iter) { struct scatterlist *sg; unsigned int pgcount; sg = iter->sg; pgcount = (sg->offset + sg->length + PAGE_SIZE - 1) >> PAGE_SHIFT; ++iter->sg_pgoffset; while (iter->sg_pgoffset >= pgcount) { iter->sg_pgoffset -= pgcount; sg = sg_next(sg); --iter->maxents; if (sg == NULL || iter->maxents == 0) break; pgcount = (sg->offset + sg->length + PAGE_SIZE - 1) >> PAGE_SHIFT; } iter->sg = sg; } static inline int sg_page_count(struct scatterlist *sg) { return (PAGE_ALIGN(sg->offset + sg->length) >> PAGE_SHIFT); } static inline bool __sg_page_iter_next(struct sg_page_iter *piter) { if (piter->internal.nents == 0) return (0); if (piter->sg == NULL) return (0); piter->sg_pgoffset += piter->internal.pg_advance; piter->internal.pg_advance = 1; while (piter->sg_pgoffset >= sg_page_count(piter->sg)) { piter->sg_pgoffset -= sg_page_count(piter->sg); piter->sg = sg_next(piter->sg); if (--piter->internal.nents == 0) return (0); if (piter->sg == NULL) return (0); } return (1); } static inline void _sg_iter_init(struct scatterlist *sgl, struct sg_page_iter *iter, unsigned int nents, unsigned long pgoffset) { if (nents) { iter->sg = sgl; iter->sg_pgoffset = pgoffset - 1; iter->maxents = nents; _sg_iter_next(iter); } else { iter->sg = NULL; iter->sg_pgoffset = 0; iter->maxents = 0; } } static inline dma_addr_t sg_page_iter_dma_address(struct sg_page_iter *spi) { - return (spi->sg->address + (spi->sg_pgoffset << PAGE_SHIFT)); + return (spi->sg->dma_address + (spi->sg_pgoffset << PAGE_SHIFT)); } static inline struct page * sg_page_iter_page(struct sg_page_iter *piter) { return (nth_page(sg_page(piter->sg), piter->sg_pgoffset)); } #endif /* _LINUX_SCATTERLIST_H_ */ Index: user/ngie/bug-237403/sys/compat/linuxkpi/common/src/linux_compat.c =================================================================== --- user/ngie/bug-237403/sys/compat/linuxkpi/common/src/linux_compat.c (revision 346925) +++ user/ngie/bug-237403/sys/compat/linuxkpi/common/src/linux_compat.c (revision 346926) @@ -1,2446 +1,2446 @@ /*- * Copyright (c) 2010 Isilon Systems, Inc. * Copyright (c) 2010 iX Systems, Inc. * Copyright (c) 2010 Panasas, Inc. * Copyright (c) 2013-2018 Mellanox Technologies, Ltd. * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice unmodified, this list of conditions, and the following * disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ #include __FBSDID("$FreeBSD$"); #include "opt_stack.h" #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #if defined(__i386__) || defined(__amd64__) #include #endif #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #if defined(__i386__) || defined(__amd64__) #include #endif SYSCTL_NODE(_compat, OID_AUTO, linuxkpi, CTLFLAG_RW, 0, "LinuxKPI parameters"); MALLOC_DEFINE(M_KMALLOC, "linux", "Linux kmalloc compat"); #include /* Undo Linux compat changes. */ #undef RB_ROOT #undef file #undef cdev #define RB_ROOT(head) (head)->rbh_root static void linux_cdev_deref(struct linux_cdev *ldev); static struct vm_area_struct *linux_cdev_handle_find(void *handle); struct kobject linux_class_root; struct device linux_root_device; struct class linux_class_misc; struct list_head pci_drivers; struct list_head pci_devices; spinlock_t pci_lock; unsigned long linux_timer_hz_mask; int panic_cmp(struct rb_node *one, struct rb_node *two) { panic("no cmp"); } RB_GENERATE(linux_root, rb_node, __entry, panic_cmp); int kobject_set_name_vargs(struct kobject *kobj, const char *fmt, va_list args) { va_list tmp_va; int len; char *old; char *name; char dummy; old = kobj->name; if (old && fmt == NULL) return (0); /* compute length of string */ va_copy(tmp_va, args); len = vsnprintf(&dummy, 0, fmt, tmp_va); va_end(tmp_va); /* account for zero termination */ len++; /* check for error */ if (len < 1) return (-EINVAL); /* allocate memory for string */ name = kzalloc(len, GFP_KERNEL); if (name == NULL) return (-ENOMEM); vsnprintf(name, len, fmt, args); kobj->name = name; /* free old string */ kfree(old); /* filter new string */ for (; *name != '\0'; name++) if (*name == '/') *name = '!'; return (0); } int kobject_set_name(struct kobject *kobj, const char *fmt, ...) { va_list args; int error; va_start(args, fmt); error = kobject_set_name_vargs(kobj, fmt, args); va_end(args); return (error); } static int kobject_add_complete(struct kobject *kobj, struct kobject *parent) { const struct kobj_type *t; int error; kobj->parent = parent; error = sysfs_create_dir(kobj); if (error == 0 && kobj->ktype && kobj->ktype->default_attrs) { struct attribute **attr; t = kobj->ktype; for (attr = t->default_attrs; *attr != NULL; attr++) { error = sysfs_create_file(kobj, *attr); if (error) break; } if (error) sysfs_remove_dir(kobj); } return (error); } int kobject_add(struct kobject *kobj, struct kobject *parent, const char *fmt, ...) { va_list args; int error; va_start(args, fmt); error = kobject_set_name_vargs(kobj, fmt, args); va_end(args); if (error) return (error); return kobject_add_complete(kobj, parent); } void linux_kobject_release(struct kref *kref) { struct kobject *kobj; char *name; kobj = container_of(kref, struct kobject, kref); sysfs_remove_dir(kobj); name = kobj->name; if (kobj->ktype && kobj->ktype->release) kobj->ktype->release(kobj); kfree(name); } static void linux_kobject_kfree(struct kobject *kobj) { kfree(kobj); } static void linux_kobject_kfree_name(struct kobject *kobj) { if (kobj) { kfree(kobj->name); } } const struct kobj_type linux_kfree_type = { .release = linux_kobject_kfree }; static void linux_device_release(struct device *dev) { pr_debug("linux_device_release: %s\n", dev_name(dev)); kfree(dev); } static ssize_t linux_class_show(struct kobject *kobj, struct attribute *attr, char *buf) { struct class_attribute *dattr; ssize_t error; dattr = container_of(attr, struct class_attribute, attr); error = -EIO; if (dattr->show) error = dattr->show(container_of(kobj, struct class, kobj), dattr, buf); return (error); } static ssize_t linux_class_store(struct kobject *kobj, struct attribute *attr, const char *buf, size_t count) { struct class_attribute *dattr; ssize_t error; dattr = container_of(attr, struct class_attribute, attr); error = -EIO; if (dattr->store) error = dattr->store(container_of(kobj, struct class, kobj), dattr, buf, count); return (error); } static void linux_class_release(struct kobject *kobj) { struct class *class; class = container_of(kobj, struct class, kobj); if (class->class_release) class->class_release(class); } static const struct sysfs_ops linux_class_sysfs = { .show = linux_class_show, .store = linux_class_store, }; const struct kobj_type linux_class_ktype = { .release = linux_class_release, .sysfs_ops = &linux_class_sysfs }; static void linux_dev_release(struct kobject *kobj) { struct device *dev; dev = container_of(kobj, struct device, kobj); /* This is the precedence defined by linux. */ if (dev->release) dev->release(dev); else if (dev->class && dev->class->dev_release) dev->class->dev_release(dev); } static ssize_t linux_dev_show(struct kobject *kobj, struct attribute *attr, char *buf) { struct device_attribute *dattr; ssize_t error; dattr = container_of(attr, struct device_attribute, attr); error = -EIO; if (dattr->show) error = dattr->show(container_of(kobj, struct device, kobj), dattr, buf); return (error); } static ssize_t linux_dev_store(struct kobject *kobj, struct attribute *attr, const char *buf, size_t count) { struct device_attribute *dattr; ssize_t error; dattr = container_of(attr, struct device_attribute, attr); error = -EIO; if (dattr->store) error = dattr->store(container_of(kobj, struct device, kobj), dattr, buf, count); return (error); } static const struct sysfs_ops linux_dev_sysfs = { .show = linux_dev_show, .store = linux_dev_store, }; const struct kobj_type linux_dev_ktype = { .release = linux_dev_release, .sysfs_ops = &linux_dev_sysfs }; struct device * device_create(struct class *class, struct device *parent, dev_t devt, void *drvdata, const char *fmt, ...) { struct device *dev; va_list args; dev = kzalloc(sizeof(*dev), M_WAITOK); dev->parent = parent; dev->class = class; dev->devt = devt; dev->driver_data = drvdata; dev->release = linux_device_release; va_start(args, fmt); kobject_set_name_vargs(&dev->kobj, fmt, args); va_end(args); device_register(dev); return (dev); } int kobject_init_and_add(struct kobject *kobj, const struct kobj_type *ktype, struct kobject *parent, const char *fmt, ...) { va_list args; int error; kobject_init(kobj, ktype); kobj->ktype = ktype; kobj->parent = parent; kobj->name = NULL; va_start(args, fmt); error = kobject_set_name_vargs(kobj, fmt, args); va_end(args); if (error) return (error); return kobject_add_complete(kobj, parent); } static void linux_kq_lock(void *arg) { spinlock_t *s = arg; spin_lock(s); } static void linux_kq_unlock(void *arg) { spinlock_t *s = arg; spin_unlock(s); } static void linux_kq_lock_owned(void *arg) { #ifdef INVARIANTS spinlock_t *s = arg; mtx_assert(&s->m, MA_OWNED); #endif } static void linux_kq_lock_unowned(void *arg) { #ifdef INVARIANTS spinlock_t *s = arg; mtx_assert(&s->m, MA_NOTOWNED); #endif } static void linux_file_kqfilter_poll(struct linux_file *, int); struct linux_file * linux_file_alloc(void) { struct linux_file *filp; filp = kzalloc(sizeof(*filp), GFP_KERNEL); /* set initial refcount */ filp->f_count = 1; /* setup fields needed by kqueue support */ spin_lock_init(&filp->f_kqlock); knlist_init(&filp->f_selinfo.si_note, &filp->f_kqlock, linux_kq_lock, linux_kq_unlock, linux_kq_lock_owned, linux_kq_lock_unowned); return (filp); } void linux_file_free(struct linux_file *filp) { if (filp->_file == NULL) { if (filp->f_shmem != NULL) vm_object_deallocate(filp->f_shmem); kfree(filp); } else { /* * The close method of the character device or file * will free the linux_file structure: */ _fdrop(filp->_file, curthread); } } static int linux_cdev_pager_fault(vm_object_t vm_obj, vm_ooffset_t offset, int prot, vm_page_t *mres) { struct vm_area_struct *vmap; vmap = linux_cdev_handle_find(vm_obj->handle); MPASS(vmap != NULL); MPASS(vmap->vm_private_data == vm_obj->handle); if (likely(vmap->vm_ops != NULL && offset < vmap->vm_len)) { vm_paddr_t paddr = IDX_TO_OFF(vmap->vm_pfn) + offset; vm_page_t page; if (((*mres)->flags & PG_FICTITIOUS) != 0) { /* * If the passed in result page is a fake * page, update it with the new physical * address. */ page = *mres; vm_page_updatefake(page, paddr, vm_obj->memattr); } else { /* * Replace the passed in "mres" page with our * own fake page and free up the all of the * original pages. */ VM_OBJECT_WUNLOCK(vm_obj); page = vm_page_getfake(paddr, vm_obj->memattr); VM_OBJECT_WLOCK(vm_obj); vm_page_replace_checked(page, vm_obj, (*mres)->pindex, *mres); vm_page_lock(*mres); vm_page_free(*mres); vm_page_unlock(*mres); *mres = page; } page->valid = VM_PAGE_BITS_ALL; return (VM_PAGER_OK); } return (VM_PAGER_FAIL); } static int linux_cdev_pager_populate(vm_object_t vm_obj, vm_pindex_t pidx, int fault_type, vm_prot_t max_prot, vm_pindex_t *first, vm_pindex_t *last) { struct vm_area_struct *vmap; int err; linux_set_current(curthread); /* get VM area structure */ vmap = linux_cdev_handle_find(vm_obj->handle); MPASS(vmap != NULL); MPASS(vmap->vm_private_data == vm_obj->handle); VM_OBJECT_WUNLOCK(vm_obj); down_write(&vmap->vm_mm->mmap_sem); if (unlikely(vmap->vm_ops == NULL)) { err = VM_FAULT_SIGBUS; } else { struct vm_fault vmf; /* fill out VM fault structure */ vmf.virtual_address = (void *)(uintptr_t)IDX_TO_OFF(pidx); vmf.flags = (fault_type & VM_PROT_WRITE) ? FAULT_FLAG_WRITE : 0; vmf.pgoff = 0; vmf.page = NULL; vmf.vma = vmap; vmap->vm_pfn_count = 0; vmap->vm_pfn_pcount = &vmap->vm_pfn_count; vmap->vm_obj = vm_obj; err = vmap->vm_ops->fault(vmap, &vmf); while (vmap->vm_pfn_count == 0 && err == VM_FAULT_NOPAGE) { kern_yield(PRI_USER); err = vmap->vm_ops->fault(vmap, &vmf); } } /* translate return code */ switch (err) { case VM_FAULT_OOM: err = VM_PAGER_AGAIN; break; case VM_FAULT_SIGBUS: err = VM_PAGER_BAD; break; case VM_FAULT_NOPAGE: /* * By contract the fault handler will return having * busied all the pages itself. If pidx is already * found in the object, it will simply xbusy the first * page and return with vm_pfn_count set to 1. */ *first = vmap->vm_pfn_first; *last = *first + vmap->vm_pfn_count - 1; err = VM_PAGER_OK; break; default: err = VM_PAGER_ERROR; break; } up_write(&vmap->vm_mm->mmap_sem); VM_OBJECT_WLOCK(vm_obj); return (err); } static struct rwlock linux_vma_lock; static TAILQ_HEAD(, vm_area_struct) linux_vma_head = TAILQ_HEAD_INITIALIZER(linux_vma_head); static void linux_cdev_handle_free(struct vm_area_struct *vmap) { /* Drop reference on vm_file */ if (vmap->vm_file != NULL) fput(vmap->vm_file); /* Drop reference on mm_struct */ mmput(vmap->vm_mm); kfree(vmap); } static void linux_cdev_handle_remove(struct vm_area_struct *vmap) { rw_wlock(&linux_vma_lock); TAILQ_REMOVE(&linux_vma_head, vmap, vm_entry); rw_wunlock(&linux_vma_lock); } static struct vm_area_struct * linux_cdev_handle_find(void *handle) { struct vm_area_struct *vmap; rw_rlock(&linux_vma_lock); TAILQ_FOREACH(vmap, &linux_vma_head, vm_entry) { if (vmap->vm_private_data == handle) break; } rw_runlock(&linux_vma_lock); return (vmap); } static int linux_cdev_pager_ctor(void *handle, vm_ooffset_t size, vm_prot_t prot, vm_ooffset_t foff, struct ucred *cred, u_short *color) { MPASS(linux_cdev_handle_find(handle) != NULL); *color = 0; return (0); } static void linux_cdev_pager_dtor(void *handle) { const struct vm_operations_struct *vm_ops; struct vm_area_struct *vmap; vmap = linux_cdev_handle_find(handle); MPASS(vmap != NULL); /* * Remove handle before calling close operation to prevent * other threads from reusing the handle pointer. */ linux_cdev_handle_remove(vmap); down_write(&vmap->vm_mm->mmap_sem); vm_ops = vmap->vm_ops; if (likely(vm_ops != NULL)) vm_ops->close(vmap); up_write(&vmap->vm_mm->mmap_sem); linux_cdev_handle_free(vmap); } static struct cdev_pager_ops linux_cdev_pager_ops[2] = { { /* OBJT_MGTDEVICE */ .cdev_pg_populate = linux_cdev_pager_populate, .cdev_pg_ctor = linux_cdev_pager_ctor, .cdev_pg_dtor = linux_cdev_pager_dtor }, { /* OBJT_DEVICE */ .cdev_pg_fault = linux_cdev_pager_fault, .cdev_pg_ctor = linux_cdev_pager_ctor, .cdev_pg_dtor = linux_cdev_pager_dtor }, }; int zap_vma_ptes(struct vm_area_struct *vma, unsigned long address, unsigned long size) { vm_object_t obj; vm_page_t m; obj = vma->vm_obj; if (obj == NULL || (obj->flags & OBJ_UNMANAGED) != 0) return (-ENOTSUP); VM_OBJECT_RLOCK(obj); for (m = vm_page_find_least(obj, OFF_TO_IDX(address)); m != NULL && m->pindex < OFF_TO_IDX(address + size); m = TAILQ_NEXT(m, listq)) pmap_remove_all(m); VM_OBJECT_RUNLOCK(obj); return (0); } static struct file_operations dummy_ldev_ops = { /* XXXKIB */ }; static struct linux_cdev dummy_ldev = { .ops = &dummy_ldev_ops, }; #define LDEV_SI_DTR 0x0001 #define LDEV_SI_REF 0x0002 static void linux_get_fop(struct linux_file *filp, const struct file_operations **fop, struct linux_cdev **dev) { struct linux_cdev *ldev; u_int siref; ldev = filp->f_cdev; *fop = filp->f_op; if (ldev != NULL) { for (siref = ldev->siref;;) { if ((siref & LDEV_SI_DTR) != 0) { ldev = &dummy_ldev; siref = ldev->siref; *fop = ldev->ops; MPASS((ldev->siref & LDEV_SI_DTR) == 0); } else if (atomic_fcmpset_int(&ldev->siref, &siref, siref + LDEV_SI_REF)) { break; } } } *dev = ldev; } static void linux_drop_fop(struct linux_cdev *ldev) { if (ldev == NULL) return; MPASS((ldev->siref & ~LDEV_SI_DTR) != 0); atomic_subtract_int(&ldev->siref, LDEV_SI_REF); } #define OPW(fp,td,code) ({ \ struct file *__fpop; \ __typeof(code) __retval; \ \ __fpop = (td)->td_fpop; \ (td)->td_fpop = (fp); \ __retval = (code); \ (td)->td_fpop = __fpop; \ __retval; \ }) static int linux_dev_fdopen(struct cdev *dev, int fflags, struct thread *td, struct file *file) { struct linux_cdev *ldev; struct linux_file *filp; const struct file_operations *fop; int error; ldev = dev->si_drv1; filp = linux_file_alloc(); filp->f_dentry = &filp->f_dentry_store; filp->f_op = ldev->ops; filp->f_mode = file->f_flag; filp->f_flags = file->f_flag; filp->f_vnode = file->f_vnode; filp->_file = file; refcount_acquire(&ldev->refs); filp->f_cdev = ldev; linux_set_current(td); linux_get_fop(filp, &fop, &ldev); if (fop->open != NULL) { error = -fop->open(file->f_vnode, filp); if (error != 0) { linux_drop_fop(ldev); linux_cdev_deref(filp->f_cdev); kfree(filp); return (error); } } /* hold on to the vnode - used for fstat() */ vhold(filp->f_vnode); /* release the file from devfs */ finit(file, filp->f_mode, DTYPE_DEV, filp, &linuxfileops); linux_drop_fop(ldev); return (ENXIO); } #define LINUX_IOCTL_MIN_PTR 0x10000UL #define LINUX_IOCTL_MAX_PTR (LINUX_IOCTL_MIN_PTR + IOCPARM_MAX) static inline int linux_remap_address(void **uaddr, size_t len) { uintptr_t uaddr_val = (uintptr_t)(*uaddr); if (unlikely(uaddr_val >= LINUX_IOCTL_MIN_PTR && uaddr_val < LINUX_IOCTL_MAX_PTR)) { struct task_struct *pts = current; if (pts == NULL) { *uaddr = NULL; return (1); } /* compute data offset */ uaddr_val -= LINUX_IOCTL_MIN_PTR; /* check that length is within bounds */ if ((len > IOCPARM_MAX) || (uaddr_val + len) > pts->bsd_ioctl_len) { *uaddr = NULL; return (1); } /* re-add kernel buffer address */ uaddr_val += (uintptr_t)pts->bsd_ioctl_data; /* update address location */ *uaddr = (void *)uaddr_val; return (1); } return (0); } int linux_copyin(const void *uaddr, void *kaddr, size_t len) { if (linux_remap_address(__DECONST(void **, &uaddr), len)) { if (uaddr == NULL) return (-EFAULT); memcpy(kaddr, uaddr, len); return (0); } return (-copyin(uaddr, kaddr, len)); } int linux_copyout(const void *kaddr, void *uaddr, size_t len) { if (linux_remap_address(&uaddr, len)) { if (uaddr == NULL) return (-EFAULT); memcpy(uaddr, kaddr, len); return (0); } return (-copyout(kaddr, uaddr, len)); } size_t linux_clear_user(void *_uaddr, size_t _len) { uint8_t *uaddr = _uaddr; size_t len = _len; /* make sure uaddr is aligned before going into the fast loop */ while (((uintptr_t)uaddr & 7) != 0 && len > 7) { if (subyte(uaddr, 0)) return (_len); uaddr++; len--; } /* zero 8 bytes at a time */ while (len > 7) { #ifdef __LP64__ if (suword64(uaddr, 0)) return (_len); #else if (suword32(uaddr, 0)) return (_len); if (suword32(uaddr + 4, 0)) return (_len); #endif uaddr += 8; len -= 8; } /* zero fill end, if any */ while (len > 0) { if (subyte(uaddr, 0)) return (_len); uaddr++; len--; } return (0); } int linux_access_ok(int rw, const void *uaddr, size_t len) { uintptr_t saddr; uintptr_t eaddr; /* get start and end address */ saddr = (uintptr_t)uaddr; eaddr = (uintptr_t)uaddr + len; /* verify addresses are valid for userspace */ return ((saddr == eaddr) || (eaddr > saddr && eaddr <= VM_MAXUSER_ADDRESS)); } /* * This function should return either EINTR or ERESTART depending on * the signal type sent to this thread: */ static int linux_get_error(struct task_struct *task, int error) { /* check for signal type interrupt code */ if (error == EINTR || error == ERESTARTSYS || error == ERESTART) { error = -linux_schedule_get_interrupt_value(task); if (error == 0) error = EINTR; } return (error); } static int linux_file_ioctl_sub(struct file *fp, struct linux_file *filp, const struct file_operations *fop, u_long cmd, caddr_t data, struct thread *td) { struct task_struct *task = current; unsigned size; int error; size = IOCPARM_LEN(cmd); /* refer to logic in sys_ioctl() */ if (size > 0) { /* * Setup hint for linux_copyin() and linux_copyout(). * * Background: Linux code expects a user-space address * while FreeBSD supplies a kernel-space address. */ task->bsd_ioctl_data = data; task->bsd_ioctl_len = size; data = (void *)LINUX_IOCTL_MIN_PTR; } else { /* fetch user-space pointer */ data = *(void **)data; } #if defined(__amd64__) if (td->td_proc->p_elf_machine == EM_386) { /* try the compat IOCTL handler first */ if (fop->compat_ioctl != NULL) { error = -OPW(fp, td, fop->compat_ioctl(filp, cmd, (u_long)data)); } else { error = ENOTTY; } /* fallback to the regular IOCTL handler, if any */ if (error == ENOTTY && fop->unlocked_ioctl != NULL) { error = -OPW(fp, td, fop->unlocked_ioctl(filp, cmd, (u_long)data)); } } else #endif { if (fop->unlocked_ioctl != NULL) { error = -OPW(fp, td, fop->unlocked_ioctl(filp, cmd, (u_long)data)); } else { error = ENOTTY; } } if (size > 0) { task->bsd_ioctl_data = NULL; task->bsd_ioctl_len = 0; } if (error == EWOULDBLOCK) { /* update kqfilter status, if any */ linux_file_kqfilter_poll(filp, LINUX_KQ_FLAG_HAS_READ | LINUX_KQ_FLAG_HAS_WRITE); } else { error = linux_get_error(task, error); } return (error); } #define LINUX_POLL_TABLE_NORMAL ((poll_table *)1) /* * This function atomically updates the poll wakeup state and returns * the previous state at the time of update. */ static uint8_t linux_poll_wakeup_state(atomic_t *v, const uint8_t *pstate) { int c, old; c = v->counter; while ((old = atomic_cmpxchg(v, c, pstate[c])) != c) c = old; return (c); } static int linux_poll_wakeup_callback(wait_queue_t *wq, unsigned int wq_state, int flags, void *key) { static const uint8_t state[LINUX_FWQ_STATE_MAX] = { [LINUX_FWQ_STATE_INIT] = LINUX_FWQ_STATE_INIT, /* NOP */ [LINUX_FWQ_STATE_NOT_READY] = LINUX_FWQ_STATE_NOT_READY, /* NOP */ [LINUX_FWQ_STATE_QUEUED] = LINUX_FWQ_STATE_READY, [LINUX_FWQ_STATE_READY] = LINUX_FWQ_STATE_READY, /* NOP */ }; struct linux_file *filp = container_of(wq, struct linux_file, f_wait_queue.wq); switch (linux_poll_wakeup_state(&filp->f_wait_queue.state, state)) { case LINUX_FWQ_STATE_QUEUED: linux_poll_wakeup(filp); return (1); default: return (0); } } void linux_poll_wait(struct linux_file *filp, wait_queue_head_t *wqh, poll_table *p) { static const uint8_t state[LINUX_FWQ_STATE_MAX] = { [LINUX_FWQ_STATE_INIT] = LINUX_FWQ_STATE_NOT_READY, [LINUX_FWQ_STATE_NOT_READY] = LINUX_FWQ_STATE_NOT_READY, /* NOP */ [LINUX_FWQ_STATE_QUEUED] = LINUX_FWQ_STATE_QUEUED, /* NOP */ [LINUX_FWQ_STATE_READY] = LINUX_FWQ_STATE_QUEUED, }; /* check if we are called inside the select system call */ if (p == LINUX_POLL_TABLE_NORMAL) selrecord(curthread, &filp->f_selinfo); switch (linux_poll_wakeup_state(&filp->f_wait_queue.state, state)) { case LINUX_FWQ_STATE_INIT: /* NOTE: file handles can only belong to one wait-queue */ filp->f_wait_queue.wqh = wqh; filp->f_wait_queue.wq.func = &linux_poll_wakeup_callback; add_wait_queue(wqh, &filp->f_wait_queue.wq); atomic_set(&filp->f_wait_queue.state, LINUX_FWQ_STATE_QUEUED); break; default: break; } } static void linux_poll_wait_dequeue(struct linux_file *filp) { static const uint8_t state[LINUX_FWQ_STATE_MAX] = { [LINUX_FWQ_STATE_INIT] = LINUX_FWQ_STATE_INIT, /* NOP */ [LINUX_FWQ_STATE_NOT_READY] = LINUX_FWQ_STATE_INIT, [LINUX_FWQ_STATE_QUEUED] = LINUX_FWQ_STATE_INIT, [LINUX_FWQ_STATE_READY] = LINUX_FWQ_STATE_INIT, }; seldrain(&filp->f_selinfo); switch (linux_poll_wakeup_state(&filp->f_wait_queue.state, state)) { case LINUX_FWQ_STATE_NOT_READY: case LINUX_FWQ_STATE_QUEUED: case LINUX_FWQ_STATE_READY: remove_wait_queue(filp->f_wait_queue.wqh, &filp->f_wait_queue.wq); break; default: break; } } void linux_poll_wakeup(struct linux_file *filp) { /* this function should be NULL-safe */ if (filp == NULL) return; selwakeup(&filp->f_selinfo); spin_lock(&filp->f_kqlock); filp->f_kqflags |= LINUX_KQ_FLAG_NEED_READ | LINUX_KQ_FLAG_NEED_WRITE; /* make sure the "knote" gets woken up */ KNOTE_LOCKED(&filp->f_selinfo.si_note, 1); spin_unlock(&filp->f_kqlock); } static void linux_file_kqfilter_detach(struct knote *kn) { struct linux_file *filp = kn->kn_hook; spin_lock(&filp->f_kqlock); knlist_remove(&filp->f_selinfo.si_note, kn, 1); spin_unlock(&filp->f_kqlock); } static int linux_file_kqfilter_read_event(struct knote *kn, long hint) { struct linux_file *filp = kn->kn_hook; mtx_assert(&filp->f_kqlock.m, MA_OWNED); return ((filp->f_kqflags & LINUX_KQ_FLAG_NEED_READ) ? 1 : 0); } static int linux_file_kqfilter_write_event(struct knote *kn, long hint) { struct linux_file *filp = kn->kn_hook; mtx_assert(&filp->f_kqlock.m, MA_OWNED); return ((filp->f_kqflags & LINUX_KQ_FLAG_NEED_WRITE) ? 1 : 0); } static struct filterops linux_dev_kqfiltops_read = { .f_isfd = 1, .f_detach = linux_file_kqfilter_detach, .f_event = linux_file_kqfilter_read_event, }; static struct filterops linux_dev_kqfiltops_write = { .f_isfd = 1, .f_detach = linux_file_kqfilter_detach, .f_event = linux_file_kqfilter_write_event, }; static void linux_file_kqfilter_poll(struct linux_file *filp, int kqflags) { struct thread *td; const struct file_operations *fop; struct linux_cdev *ldev; int temp; if ((filp->f_kqflags & kqflags) == 0) return; td = curthread; linux_get_fop(filp, &fop, &ldev); /* get the latest polling state */ temp = OPW(filp->_file, td, fop->poll(filp, NULL)); linux_drop_fop(ldev); spin_lock(&filp->f_kqlock); /* clear kqflags */ filp->f_kqflags &= ~(LINUX_KQ_FLAG_NEED_READ | LINUX_KQ_FLAG_NEED_WRITE); /* update kqflags */ if ((temp & (POLLIN | POLLOUT)) != 0) { if ((temp & POLLIN) != 0) filp->f_kqflags |= LINUX_KQ_FLAG_NEED_READ; if ((temp & POLLOUT) != 0) filp->f_kqflags |= LINUX_KQ_FLAG_NEED_WRITE; /* make sure the "knote" gets woken up */ KNOTE_LOCKED(&filp->f_selinfo.si_note, 0); } spin_unlock(&filp->f_kqlock); } static int linux_file_kqfilter(struct file *file, struct knote *kn) { struct linux_file *filp; struct thread *td; int error; td = curthread; filp = (struct linux_file *)file->f_data; filp->f_flags = file->f_flag; if (filp->f_op->poll == NULL) return (EINVAL); spin_lock(&filp->f_kqlock); switch (kn->kn_filter) { case EVFILT_READ: filp->f_kqflags |= LINUX_KQ_FLAG_HAS_READ; kn->kn_fop = &linux_dev_kqfiltops_read; kn->kn_hook = filp; knlist_add(&filp->f_selinfo.si_note, kn, 1); error = 0; break; case EVFILT_WRITE: filp->f_kqflags |= LINUX_KQ_FLAG_HAS_WRITE; kn->kn_fop = &linux_dev_kqfiltops_write; kn->kn_hook = filp; knlist_add(&filp->f_selinfo.si_note, kn, 1); error = 0; break; default: error = EINVAL; break; } spin_unlock(&filp->f_kqlock); if (error == 0) { linux_set_current(td); /* update kqfilter status, if any */ linux_file_kqfilter_poll(filp, LINUX_KQ_FLAG_HAS_READ | LINUX_KQ_FLAG_HAS_WRITE); } return (error); } static int linux_file_mmap_single(struct file *fp, const struct file_operations *fop, vm_ooffset_t *offset, vm_size_t size, struct vm_object **object, int nprot, struct thread *td) { struct task_struct *task; struct vm_area_struct *vmap; struct mm_struct *mm; struct linux_file *filp; vm_memattr_t attr; int error; filp = (struct linux_file *)fp->f_data; filp->f_flags = fp->f_flag; if (fop->mmap == NULL) return (EOPNOTSUPP); linux_set_current(td); /* * The same VM object might be shared by multiple processes * and the mm_struct is usually freed when a process exits. * * The atomic reference below makes sure the mm_struct is * available as long as the vmap is in the linux_vma_head. */ task = current; mm = task->mm; if (atomic_inc_not_zero(&mm->mm_users) == 0) return (EINVAL); vmap = kzalloc(sizeof(*vmap), GFP_KERNEL); vmap->vm_start = 0; vmap->vm_end = size; vmap->vm_pgoff = *offset / PAGE_SIZE; vmap->vm_pfn = 0; vmap->vm_flags = vmap->vm_page_prot = (nprot & VM_PROT_ALL); vmap->vm_ops = NULL; vmap->vm_file = get_file(filp); vmap->vm_mm = mm; if (unlikely(down_write_killable(&vmap->vm_mm->mmap_sem))) { error = linux_get_error(task, EINTR); } else { error = -OPW(fp, td, fop->mmap(filp, vmap)); error = linux_get_error(task, error); up_write(&vmap->vm_mm->mmap_sem); } if (error != 0) { linux_cdev_handle_free(vmap); return (error); } attr = pgprot2cachemode(vmap->vm_page_prot); if (vmap->vm_ops != NULL) { struct vm_area_struct *ptr; void *vm_private_data; bool vm_no_fault; if (vmap->vm_ops->open == NULL || vmap->vm_ops->close == NULL || vmap->vm_private_data == NULL) { /* free allocated VM area struct */ linux_cdev_handle_free(vmap); return (EINVAL); } vm_private_data = vmap->vm_private_data; rw_wlock(&linux_vma_lock); TAILQ_FOREACH(ptr, &linux_vma_head, vm_entry) { if (ptr->vm_private_data == vm_private_data) break; } /* check if there is an existing VM area struct */ if (ptr != NULL) { /* check if the VM area structure is invalid */ if (ptr->vm_ops == NULL || ptr->vm_ops->open == NULL || ptr->vm_ops->close == NULL) { error = ESTALE; vm_no_fault = 1; } else { error = EEXIST; vm_no_fault = (ptr->vm_ops->fault == NULL); } } else { /* insert VM area structure into list */ TAILQ_INSERT_TAIL(&linux_vma_head, vmap, vm_entry); error = 0; vm_no_fault = (vmap->vm_ops->fault == NULL); } rw_wunlock(&linux_vma_lock); if (error != 0) { /* free allocated VM area struct */ linux_cdev_handle_free(vmap); /* check for stale VM area struct */ if (error != EEXIST) return (error); } /* check if there is no fault handler */ if (vm_no_fault) { *object = cdev_pager_allocate(vm_private_data, OBJT_DEVICE, &linux_cdev_pager_ops[1], size, nprot, *offset, td->td_ucred); } else { *object = cdev_pager_allocate(vm_private_data, OBJT_MGTDEVICE, &linux_cdev_pager_ops[0], size, nprot, *offset, td->td_ucred); } /* check if allocating the VM object failed */ if (*object == NULL) { if (error == 0) { /* remove VM area struct from list */ linux_cdev_handle_remove(vmap); /* free allocated VM area struct */ linux_cdev_handle_free(vmap); } return (EINVAL); } } else { struct sglist *sg; sg = sglist_alloc(1, M_WAITOK); sglist_append_phys(sg, (vm_paddr_t)vmap->vm_pfn << PAGE_SHIFT, vmap->vm_len); *object = vm_pager_allocate(OBJT_SG, sg, vmap->vm_len, nprot, 0, td->td_ucred); linux_cdev_handle_free(vmap); if (*object == NULL) { sglist_free(sg); return (EINVAL); } } if (attr != VM_MEMATTR_DEFAULT) { VM_OBJECT_WLOCK(*object); vm_object_set_memattr(*object, attr); VM_OBJECT_WUNLOCK(*object); } *offset = 0; return (0); } struct cdevsw linuxcdevsw = { .d_version = D_VERSION, .d_fdopen = linux_dev_fdopen, .d_name = "lkpidev", }; static int linux_file_read(struct file *file, struct uio *uio, struct ucred *active_cred, int flags, struct thread *td) { struct linux_file *filp; const struct file_operations *fop; struct linux_cdev *ldev; ssize_t bytes; int error; error = 0; filp = (struct linux_file *)file->f_data; filp->f_flags = file->f_flag; /* XXX no support for I/O vectors currently */ if (uio->uio_iovcnt != 1) return (EOPNOTSUPP); if (uio->uio_resid > DEVFS_IOSIZE_MAX) return (EINVAL); linux_set_current(td); linux_get_fop(filp, &fop, &ldev); if (fop->read != NULL) { bytes = OPW(file, td, fop->read(filp, uio->uio_iov->iov_base, uio->uio_iov->iov_len, &uio->uio_offset)); if (bytes >= 0) { uio->uio_iov->iov_base = ((uint8_t *)uio->uio_iov->iov_base) + bytes; uio->uio_iov->iov_len -= bytes; uio->uio_resid -= bytes; } else { error = linux_get_error(current, -bytes); } } else error = ENXIO; /* update kqfilter status, if any */ linux_file_kqfilter_poll(filp, LINUX_KQ_FLAG_HAS_READ); linux_drop_fop(ldev); return (error); } static int linux_file_write(struct file *file, struct uio *uio, struct ucred *active_cred, int flags, struct thread *td) { struct linux_file *filp; const struct file_operations *fop; struct linux_cdev *ldev; ssize_t bytes; int error; filp = (struct linux_file *)file->f_data; filp->f_flags = file->f_flag; /* XXX no support for I/O vectors currently */ if (uio->uio_iovcnt != 1) return (EOPNOTSUPP); if (uio->uio_resid > DEVFS_IOSIZE_MAX) return (EINVAL); linux_set_current(td); linux_get_fop(filp, &fop, &ldev); if (fop->write != NULL) { bytes = OPW(file, td, fop->write(filp, uio->uio_iov->iov_base, uio->uio_iov->iov_len, &uio->uio_offset)); if (bytes >= 0) { uio->uio_iov->iov_base = ((uint8_t *)uio->uio_iov->iov_base) + bytes; uio->uio_iov->iov_len -= bytes; uio->uio_resid -= bytes; error = 0; } else { error = linux_get_error(current, -bytes); } } else error = ENXIO; /* update kqfilter status, if any */ linux_file_kqfilter_poll(filp, LINUX_KQ_FLAG_HAS_WRITE); linux_drop_fop(ldev); return (error); } static int linux_file_poll(struct file *file, int events, struct ucred *active_cred, struct thread *td) { struct linux_file *filp; const struct file_operations *fop; struct linux_cdev *ldev; int revents; filp = (struct linux_file *)file->f_data; filp->f_flags = file->f_flag; linux_set_current(td); linux_get_fop(filp, &fop, &ldev); if (fop->poll != NULL) { revents = OPW(file, td, fop->poll(filp, LINUX_POLL_TABLE_NORMAL)) & events; } else { revents = 0; } linux_drop_fop(ldev); return (revents); } static int linux_file_close(struct file *file, struct thread *td) { struct linux_file *filp; const struct file_operations *fop; struct linux_cdev *ldev; int error; filp = (struct linux_file *)file->f_data; KASSERT(file_count(filp) == 0, ("File refcount(%d) is not zero", file_count(filp))); error = 0; filp->f_flags = file->f_flag; linux_set_current(td); linux_poll_wait_dequeue(filp); linux_get_fop(filp, &fop, &ldev); if (fop->release != NULL) error = -OPW(file, td, fop->release(filp->f_vnode, filp)); funsetown(&filp->f_sigio); if (filp->f_vnode != NULL) vdrop(filp->f_vnode); linux_drop_fop(ldev); if (filp->f_cdev != NULL) linux_cdev_deref(filp->f_cdev); kfree(filp); return (error); } static int linux_file_ioctl(struct file *fp, u_long cmd, void *data, struct ucred *cred, struct thread *td) { struct linux_file *filp; const struct file_operations *fop; struct linux_cdev *ldev; int error; error = 0; filp = (struct linux_file *)fp->f_data; filp->f_flags = fp->f_flag; linux_get_fop(filp, &fop, &ldev); linux_set_current(td); switch (cmd) { case FIONBIO: break; case FIOASYNC: if (fop->fasync == NULL) break; error = -OPW(fp, td, fop->fasync(0, filp, fp->f_flag & FASYNC)); break; case FIOSETOWN: error = fsetown(*(int *)data, &filp->f_sigio); if (error == 0) { if (fop->fasync == NULL) break; error = -OPW(fp, td, fop->fasync(0, filp, fp->f_flag & FASYNC)); } break; case FIOGETOWN: *(int *)data = fgetown(&filp->f_sigio); break; default: error = linux_file_ioctl_sub(fp, filp, fop, cmd, data, td); break; } linux_drop_fop(ldev); return (error); } static int linux_file_mmap_sub(struct thread *td, vm_size_t objsize, vm_prot_t prot, vm_prot_t *maxprotp, int *flagsp, struct file *fp, vm_ooffset_t *foff, const struct file_operations *fop, vm_object_t *objp) { /* * Character devices do not provide private mappings * of any kind: */ if ((*maxprotp & VM_PROT_WRITE) == 0 && (prot & VM_PROT_WRITE) != 0) return (EACCES); if ((*flagsp & (MAP_PRIVATE | MAP_COPY)) != 0) return (EINVAL); return (linux_file_mmap_single(fp, fop, foff, objsize, objp, (int)prot, td)); } static int linux_file_mmap(struct file *fp, vm_map_t map, vm_offset_t *addr, vm_size_t size, vm_prot_t prot, vm_prot_t cap_maxprot, int flags, vm_ooffset_t foff, struct thread *td) { struct linux_file *filp; const struct file_operations *fop; struct linux_cdev *ldev; struct mount *mp; struct vnode *vp; vm_object_t object; vm_prot_t maxprot; int error; filp = (struct linux_file *)fp->f_data; vp = filp->f_vnode; if (vp == NULL) return (EOPNOTSUPP); /* * Ensure that file and memory protections are * compatible. */ mp = vp->v_mount; if (mp != NULL && (mp->mnt_flag & MNT_NOEXEC) != 0) { maxprot = VM_PROT_NONE; if ((prot & VM_PROT_EXECUTE) != 0) return (EACCES); } else maxprot = VM_PROT_EXECUTE; if ((fp->f_flag & FREAD) != 0) maxprot |= VM_PROT_READ; else if ((prot & VM_PROT_READ) != 0) return (EACCES); /* * If we are sharing potential changes via MAP_SHARED and we * are trying to get write permission although we opened it * without asking for it, bail out. * * Note that most character devices always share mappings. * * Rely on linux_file_mmap_sub() to fail invalid MAP_PRIVATE * requests rather than doing it here. */ if ((flags & MAP_SHARED) != 0) { if ((fp->f_flag & FWRITE) != 0) maxprot |= VM_PROT_WRITE; else if ((prot & VM_PROT_WRITE) != 0) return (EACCES); } maxprot &= cap_maxprot; linux_get_fop(filp, &fop, &ldev); error = linux_file_mmap_sub(td, size, prot, &maxprot, &flags, fp, &foff, fop, &object); if (error != 0) goto out; error = vm_mmap_object(map, addr, size, prot, maxprot, flags, object, foff, FALSE, td); if (error != 0) vm_object_deallocate(object); out: linux_drop_fop(ldev); return (error); } static int linux_file_stat(struct file *fp, struct stat *sb, struct ucred *active_cred, struct thread *td) { struct linux_file *filp; struct vnode *vp; int error; filp = (struct linux_file *)fp->f_data; if (filp->f_vnode == NULL) return (EOPNOTSUPP); vp = filp->f_vnode; vn_lock(vp, LK_SHARED | LK_RETRY); error = vn_stat(vp, sb, td->td_ucred, NOCRED, td); VOP_UNLOCK(vp, 0); return (error); } static int linux_file_fill_kinfo(struct file *fp, struct kinfo_file *kif, struct filedesc *fdp) { struct linux_file *filp; struct vnode *vp; int error; filp = fp->f_data; vp = filp->f_vnode; if (vp == NULL) { error = 0; kif->kf_type = KF_TYPE_DEV; } else { vref(vp); FILEDESC_SUNLOCK(fdp); error = vn_fill_kinfo_vnode(vp, kif); vrele(vp); kif->kf_type = KF_TYPE_VNODE; FILEDESC_SLOCK(fdp); } return (error); } unsigned int linux_iminor(struct inode *inode) { struct linux_cdev *ldev; if (inode == NULL || inode->v_rdev == NULL || inode->v_rdev->si_devsw != &linuxcdevsw) return (-1U); ldev = inode->v_rdev->si_drv1; if (ldev == NULL) return (-1U); return (minor(ldev->dev)); } struct fileops linuxfileops = { .fo_read = linux_file_read, .fo_write = linux_file_write, .fo_truncate = invfo_truncate, .fo_kqfilter = linux_file_kqfilter, .fo_stat = linux_file_stat, .fo_fill_kinfo = linux_file_fill_kinfo, .fo_poll = linux_file_poll, .fo_close = linux_file_close, .fo_ioctl = linux_file_ioctl, .fo_mmap = linux_file_mmap, .fo_chmod = invfo_chmod, .fo_chown = invfo_chown, .fo_sendfile = invfo_sendfile, .fo_flags = DFLAG_PASSABLE, }; /* * Hash of vmmap addresses. This is infrequently accessed and does not * need to be particularly large. This is done because we must store the * caller's idea of the map size to properly unmap. */ struct vmmap { LIST_ENTRY(vmmap) vm_next; void *vm_addr; unsigned long vm_size; }; struct vmmaphd { struct vmmap *lh_first; }; #define VMMAP_HASH_SIZE 64 #define VMMAP_HASH_MASK (VMMAP_HASH_SIZE - 1) #define VM_HASH(addr) ((uintptr_t)(addr) >> PAGE_SHIFT) & VMMAP_HASH_MASK static struct vmmaphd vmmaphead[VMMAP_HASH_SIZE]; static struct mtx vmmaplock; static void vmmap_add(void *addr, unsigned long size) { struct vmmap *vmmap; vmmap = kmalloc(sizeof(*vmmap), GFP_KERNEL); mtx_lock(&vmmaplock); vmmap->vm_size = size; vmmap->vm_addr = addr; LIST_INSERT_HEAD(&vmmaphead[VM_HASH(addr)], vmmap, vm_next); mtx_unlock(&vmmaplock); } static struct vmmap * vmmap_remove(void *addr) { struct vmmap *vmmap; mtx_lock(&vmmaplock); LIST_FOREACH(vmmap, &vmmaphead[VM_HASH(addr)], vm_next) if (vmmap->vm_addr == addr) break; if (vmmap) LIST_REMOVE(vmmap, vm_next); mtx_unlock(&vmmaplock); return (vmmap); } #if defined(__i386__) || defined(__amd64__) || defined(__powerpc__) || defined(__aarch64__) void * _ioremap_attr(vm_paddr_t phys_addr, unsigned long size, int attr) { void *addr; addr = pmap_mapdev_attr(phys_addr, size, attr); if (addr == NULL) return (NULL); vmmap_add(addr, size); return (addr); } #endif void iounmap(void *addr) { struct vmmap *vmmap; vmmap = vmmap_remove(addr); if (vmmap == NULL) return; #if defined(__i386__) || defined(__amd64__) || defined(__powerpc__) || defined(__aarch64__) pmap_unmapdev((vm_offset_t)addr, vmmap->vm_size); #endif kfree(vmmap); } void * vmap(struct page **pages, unsigned int count, unsigned long flags, int prot) { vm_offset_t off; size_t size; size = count * PAGE_SIZE; off = kva_alloc(size); if (off == 0) return (NULL); vmmap_add((void *)off, size); pmap_qenter(off, pages, count); return ((void *)off); } void vunmap(void *addr) { struct vmmap *vmmap; vmmap = vmmap_remove(addr); if (vmmap == NULL) return; pmap_qremove((vm_offset_t)addr, vmmap->vm_size / PAGE_SIZE); kva_free((vm_offset_t)addr, vmmap->vm_size); kfree(vmmap); } char * kvasprintf(gfp_t gfp, const char *fmt, va_list ap) { unsigned int len; char *p; va_list aq; va_copy(aq, ap); len = vsnprintf(NULL, 0, fmt, aq); va_end(aq); p = kmalloc(len + 1, gfp); if (p != NULL) vsnprintf(p, len + 1, fmt, ap); return (p); } char * kasprintf(gfp_t gfp, const char *fmt, ...) { va_list ap; char *p; va_start(ap, fmt); p = kvasprintf(gfp, fmt, ap); va_end(ap); return (p); } static void linux_timer_callback_wrapper(void *context) { struct timer_list *timer; linux_set_current(curthread); timer = context; timer->function(timer->data); } void mod_timer(struct timer_list *timer, int expires) { timer->expires = expires; callout_reset(&timer->callout, linux_timer_jiffies_until(expires), &linux_timer_callback_wrapper, timer); } void add_timer(struct timer_list *timer) { callout_reset(&timer->callout, linux_timer_jiffies_until(timer->expires), &linux_timer_callback_wrapper, timer); } void add_timer_on(struct timer_list *timer, int cpu) { callout_reset_on(&timer->callout, linux_timer_jiffies_until(timer->expires), &linux_timer_callback_wrapper, timer, cpu); } static void linux_timer_init(void *arg) { /* * Compute an internal HZ value which can divide 2**32 to * avoid timer rounding problems when the tick value wraps * around 2**32: */ linux_timer_hz_mask = 1; while (linux_timer_hz_mask < (unsigned long)hz) linux_timer_hz_mask *= 2; linux_timer_hz_mask--; } SYSINIT(linux_timer, SI_SUB_DRIVERS, SI_ORDER_FIRST, linux_timer_init, NULL); void linux_complete_common(struct completion *c, int all) { int wakeup_swapper; sleepq_lock(c); if (all) { c->done = UINT_MAX; wakeup_swapper = sleepq_broadcast(c, SLEEPQ_SLEEP, 0, 0); } else { if (c->done != UINT_MAX) c->done++; wakeup_swapper = sleepq_signal(c, SLEEPQ_SLEEP, 0, 0); } sleepq_release(c); if (wakeup_swapper) kick_proc0(); } /* * Indefinite wait for done != 0 with or without signals. */ int linux_wait_for_common(struct completion *c, int flags) { struct task_struct *task; int error; if (SCHEDULER_STOPPED()) return (0); task = current; if (flags != 0) flags = SLEEPQ_INTERRUPTIBLE | SLEEPQ_SLEEP; else flags = SLEEPQ_SLEEP; error = 0; for (;;) { sleepq_lock(c); if (c->done) break; sleepq_add(c, NULL, "completion", flags, 0); if (flags & SLEEPQ_INTERRUPTIBLE) { DROP_GIANT(); error = -sleepq_wait_sig(c, 0); PICKUP_GIANT(); if (error != 0) { linux_schedule_save_interrupt_value(task, error); error = -ERESTARTSYS; goto intr; } } else { DROP_GIANT(); sleepq_wait(c, 0); PICKUP_GIANT(); } } if (c->done != UINT_MAX) c->done--; sleepq_release(c); intr: return (error); } /* * Time limited wait for done != 0 with or without signals. */ int linux_wait_for_timeout_common(struct completion *c, int timeout, int flags) { struct task_struct *task; int end = jiffies + timeout; int error; if (SCHEDULER_STOPPED()) return (0); task = current; if (flags != 0) flags = SLEEPQ_INTERRUPTIBLE | SLEEPQ_SLEEP; else flags = SLEEPQ_SLEEP; for (;;) { sleepq_lock(c); if (c->done) break; sleepq_add(c, NULL, "completion", flags, 0); sleepq_set_timeout(c, linux_timer_jiffies_until(end)); DROP_GIANT(); if (flags & SLEEPQ_INTERRUPTIBLE) error = -sleepq_timedwait_sig(c, 0); else error = -sleepq_timedwait(c, 0); PICKUP_GIANT(); if (error != 0) { /* check for timeout */ if (error == -EWOULDBLOCK) { error = 0; /* timeout */ } else { /* signal happened */ linux_schedule_save_interrupt_value(task, error); error = -ERESTARTSYS; } goto done; } } if (c->done != UINT_MAX) c->done--; sleepq_release(c); /* return how many jiffies are left */ error = linux_timer_jiffies_until(end); done: return (error); } int linux_try_wait_for_completion(struct completion *c) { int isdone; sleepq_lock(c); isdone = (c->done != 0); if (c->done != 0 && c->done != UINT_MAX) c->done--; sleepq_release(c); return (isdone); } int linux_completion_done(struct completion *c) { int isdone; sleepq_lock(c); isdone = (c->done != 0); sleepq_release(c); return (isdone); } static void linux_cdev_deref(struct linux_cdev *ldev) { if (refcount_release(&ldev->refs)) kfree(ldev); } static void linux_cdev_release(struct kobject *kobj) { struct linux_cdev *cdev; struct kobject *parent; cdev = container_of(kobj, struct linux_cdev, kobj); parent = kobj->parent; linux_destroy_dev(cdev); linux_cdev_deref(cdev); kobject_put(parent); } static void linux_cdev_static_release(struct kobject *kobj) { struct linux_cdev *cdev; struct kobject *parent; cdev = container_of(kobj, struct linux_cdev, kobj); parent = kobj->parent; linux_destroy_dev(cdev); kobject_put(parent); } void linux_destroy_dev(struct linux_cdev *ldev) { if (ldev->cdev == NULL) return; MPASS((ldev->siref & LDEV_SI_DTR) == 0); atomic_set_int(&ldev->siref, LDEV_SI_DTR); while ((atomic_load_int(&ldev->siref) & ~LDEV_SI_DTR) != 0) pause("ldevdtr", hz / 4); destroy_dev(ldev->cdev); ldev->cdev = NULL; } const struct kobj_type linux_cdev_ktype = { .release = linux_cdev_release, }; const struct kobj_type linux_cdev_static_ktype = { .release = linux_cdev_static_release, }; static void linux_handle_ifnet_link_event(void *arg, struct ifnet *ifp, int linkstate) { struct notifier_block *nb; nb = arg; if (linkstate == LINK_STATE_UP) nb->notifier_call(nb, NETDEV_UP, ifp); else nb->notifier_call(nb, NETDEV_DOWN, ifp); } static void linux_handle_ifnet_arrival_event(void *arg, struct ifnet *ifp) { struct notifier_block *nb; nb = arg; nb->notifier_call(nb, NETDEV_REGISTER, ifp); } static void linux_handle_ifnet_departure_event(void *arg, struct ifnet *ifp) { struct notifier_block *nb; nb = arg; nb->notifier_call(nb, NETDEV_UNREGISTER, ifp); } static void linux_handle_iflladdr_event(void *arg, struct ifnet *ifp) { struct notifier_block *nb; nb = arg; nb->notifier_call(nb, NETDEV_CHANGEADDR, ifp); } static void linux_handle_ifaddr_event(void *arg, struct ifnet *ifp) { struct notifier_block *nb; nb = arg; nb->notifier_call(nb, NETDEV_CHANGEIFADDR, ifp); } int register_netdevice_notifier(struct notifier_block *nb) { nb->tags[NETDEV_UP] = EVENTHANDLER_REGISTER( ifnet_link_event, linux_handle_ifnet_link_event, nb, 0); nb->tags[NETDEV_REGISTER] = EVENTHANDLER_REGISTER( ifnet_arrival_event, linux_handle_ifnet_arrival_event, nb, 0); nb->tags[NETDEV_UNREGISTER] = EVENTHANDLER_REGISTER( ifnet_departure_event, linux_handle_ifnet_departure_event, nb, 0); nb->tags[NETDEV_CHANGEADDR] = EVENTHANDLER_REGISTER( iflladdr_event, linux_handle_iflladdr_event, nb, 0); return (0); } int register_inetaddr_notifier(struct notifier_block *nb) { nb->tags[NETDEV_CHANGEIFADDR] = EVENTHANDLER_REGISTER( ifaddr_event, linux_handle_ifaddr_event, nb, 0); return (0); } int unregister_netdevice_notifier(struct notifier_block *nb) { EVENTHANDLER_DEREGISTER(ifnet_link_event, nb->tags[NETDEV_UP]); EVENTHANDLER_DEREGISTER(ifnet_arrival_event, nb->tags[NETDEV_REGISTER]); EVENTHANDLER_DEREGISTER(ifnet_departure_event, nb->tags[NETDEV_UNREGISTER]); EVENTHANDLER_DEREGISTER(iflladdr_event, nb->tags[NETDEV_CHANGEADDR]); return (0); } int unregister_inetaddr_notifier(struct notifier_block *nb) { EVENTHANDLER_DEREGISTER(ifaddr_event, nb->tags[NETDEV_CHANGEIFADDR]); return (0); } struct list_sort_thunk { int (*cmp)(void *, struct list_head *, struct list_head *); void *priv; }; static inline int linux_le_cmp(void *priv, const void *d1, const void *d2) { struct list_head *le1, *le2; struct list_sort_thunk *thunk; thunk = priv; le1 = *(__DECONST(struct list_head **, d1)); le2 = *(__DECONST(struct list_head **, d2)); return ((thunk->cmp)(thunk->priv, le1, le2)); } void list_sort(void *priv, struct list_head *head, int (*cmp)(void *priv, struct list_head *a, struct list_head *b)) { struct list_sort_thunk thunk; struct list_head **ar, *le; size_t count, i; count = 0; list_for_each(le, head) count++; ar = malloc(sizeof(struct list_head *) * count, M_KMALLOC, M_WAITOK); i = 0; list_for_each(le, head) ar[i++] = le; thunk.cmp = cmp; thunk.priv = priv; qsort_r(ar, count, sizeof(struct list_head *), &thunk, linux_le_cmp); INIT_LIST_HEAD(head); for (i = 0; i < count; i++) list_add_tail(ar[i], head); free(ar, M_KMALLOC); } void linux_irq_handler(void *ent) { struct irq_ent *irqe; linux_set_current(curthread); irqe = ent; irqe->handler(irqe->irq, irqe->arg); } #if defined(__i386__) || defined(__amd64__) int linux_wbinvd_on_all_cpus(void) { pmap_invalidate_cache(); return (0); } #endif int linux_on_each_cpu(void callback(void *), void *data) { smp_rendezvous(smp_no_rendezvous_barrier, callback, smp_no_rendezvous_barrier, data); return (0); } int linux_in_atomic(void) { return ((curthread->td_pflags & TDP_NOFAULTING) != 0); } struct linux_cdev * linux_find_cdev(const char *name, unsigned major, unsigned minor) { dev_t dev = MKDEV(major, minor); struct cdev *cdev; dev_lock(); LIST_FOREACH(cdev, &linuxcdevsw.d_devs, si_list) { struct linux_cdev *ldev = cdev->si_drv1; if (ldev->dev == dev && strcmp(kobject_name(&ldev->kobj), name) == 0) { break; } } dev_unlock(); return (cdev != NULL ? cdev->si_drv1 : NULL); } int __register_chrdev(unsigned int major, unsigned int baseminor, unsigned int count, const char *name, const struct file_operations *fops) { struct linux_cdev *cdev; int ret = 0; int i; for (i = baseminor; i < baseminor + count; i++) { cdev = cdev_alloc(); - cdev_init(cdev, fops); + cdev->ops = fops; kobject_set_name(&cdev->kobj, name); ret = cdev_add(cdev, makedev(major, i), 1); if (ret != 0) break; } return (ret); } int __register_chrdev_p(unsigned int major, unsigned int baseminor, unsigned int count, const char *name, const struct file_operations *fops, uid_t uid, gid_t gid, int mode) { struct linux_cdev *cdev; int ret = 0; int i; for (i = baseminor; i < baseminor + count; i++) { cdev = cdev_alloc(); - cdev_init(cdev, fops); + cdev->ops = fops; kobject_set_name(&cdev->kobj, name); ret = cdev_add_ext(cdev, makedev(major, i), uid, gid, mode); if (ret != 0) break; } return (ret); } void __unregister_chrdev(unsigned int major, unsigned int baseminor, unsigned int count, const char *name) { struct linux_cdev *cdevp; int i; for (i = baseminor; i < baseminor + count; i++) { cdevp = linux_find_cdev(name, major, i); if (cdevp != NULL) cdev_del(cdevp); } } void linux_dump_stack(void) { #ifdef STACK struct stack st; stack_zero(&st); stack_save(&st); stack_print(&st); #endif } #if defined(__i386__) || defined(__amd64__) bool linux_cpu_has_clflush; #endif static void linux_compat_init(void *arg) { struct sysctl_oid *rootoid; int i; #if defined(__i386__) || defined(__amd64__) linux_cpu_has_clflush = (cpu_feature & CPUID_CLFSH); #endif rw_init(&linux_vma_lock, "lkpi-vma-lock"); rootoid = SYSCTL_ADD_ROOT_NODE(NULL, OID_AUTO, "sys", CTLFLAG_RD|CTLFLAG_MPSAFE, NULL, "sys"); kobject_init(&linux_class_root, &linux_class_ktype); kobject_set_name(&linux_class_root, "class"); linux_class_root.oidp = SYSCTL_ADD_NODE(NULL, SYSCTL_CHILDREN(rootoid), OID_AUTO, "class", CTLFLAG_RD|CTLFLAG_MPSAFE, NULL, "class"); kobject_init(&linux_root_device.kobj, &linux_dev_ktype); kobject_set_name(&linux_root_device.kobj, "device"); linux_root_device.kobj.oidp = SYSCTL_ADD_NODE(NULL, SYSCTL_CHILDREN(rootoid), OID_AUTO, "device", CTLFLAG_RD, NULL, "device"); linux_root_device.bsddev = root_bus; linux_class_misc.name = "misc"; class_register(&linux_class_misc); INIT_LIST_HEAD(&pci_drivers); INIT_LIST_HEAD(&pci_devices); spin_lock_init(&pci_lock); mtx_init(&vmmaplock, "IO Map lock", NULL, MTX_DEF); for (i = 0; i < VMMAP_HASH_SIZE; i++) LIST_INIT(&vmmaphead[i]); } SYSINIT(linux_compat, SI_SUB_DRIVERS, SI_ORDER_SECOND, linux_compat_init, NULL); static void linux_compat_uninit(void *arg) { linux_kobject_kfree_name(&linux_class_root); linux_kobject_kfree_name(&linux_root_device.kobj); linux_kobject_kfree_name(&linux_class_misc.kobj); mtx_destroy(&vmmaplock); spin_lock_destroy(&pci_lock); rw_destroy(&linux_vma_lock); } SYSUNINIT(linux_compat, SI_SUB_DRIVERS, SI_ORDER_SECOND, linux_compat_uninit, NULL); /* * NOTE: Linux frequently uses "unsigned long" for pointer to integer * conversion and vice versa, where in FreeBSD "uintptr_t" would be * used. Assert these types have the same size, else some parts of the * LinuxKPI may not work like expected: */ CTASSERT(sizeof(unsigned long) == sizeof(uintptr_t)); Index: user/ngie/bug-237403/sys/compat/linuxkpi/common/src/linux_pci.c =================================================================== --- user/ngie/bug-237403/sys/compat/linuxkpi/common/src/linux_pci.c (revision 346925) +++ user/ngie/bug-237403/sys/compat/linuxkpi/common/src/linux_pci.c (revision 346926) @@ -1,332 +1,828 @@ /*- * Copyright (c) 2015-2016 Mellanox Technologies, Ltd. * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice unmodified, this list of conditions, and the following * disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ #include __FBSDID("$FreeBSD$"); #include #include #include #include #include #include #include #include #include #include #include +#include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include static device_probe_t linux_pci_probe; static device_attach_t linux_pci_attach; static device_detach_t linux_pci_detach; static device_suspend_t linux_pci_suspend; static device_resume_t linux_pci_resume; static device_shutdown_t linux_pci_shutdown; static device_method_t pci_methods[] = { DEVMETHOD(device_probe, linux_pci_probe), DEVMETHOD(device_attach, linux_pci_attach), DEVMETHOD(device_detach, linux_pci_detach), DEVMETHOD(device_suspend, linux_pci_suspend), DEVMETHOD(device_resume, linux_pci_resume), DEVMETHOD(device_shutdown, linux_pci_shutdown), DEVMETHOD_END }; +struct linux_dma_priv { + uint64_t dma_mask; + struct mtx dma_lock; + bus_dma_tag_t dmat; + struct mtx ptree_lock; + struct pctrie ptree; +}; + +static int +linux_pdev_dma_init(struct pci_dev *pdev) +{ + struct linux_dma_priv *priv; + + priv = malloc(sizeof(*priv), M_DEVBUF, M_WAITOK | M_ZERO); + pdev->dev.dma_priv = priv; + + mtx_init(&priv->dma_lock, "linux_dma", NULL, MTX_DEF); + + mtx_init(&priv->ptree_lock, "linux_dma_ptree", NULL, MTX_DEF); + pctrie_init(&priv->ptree); + + return (0); +} + +static int +linux_pdev_dma_uninit(struct pci_dev *pdev) +{ + struct linux_dma_priv *priv; + + priv = pdev->dev.dma_priv; + if (priv->dmat) + bus_dma_tag_destroy(priv->dmat); + mtx_destroy(&priv->dma_lock); + mtx_destroy(&priv->ptree_lock); + free(priv, M_DEVBUF); + pdev->dev.dma_priv = NULL; + return (0); +} + +int +linux_dma_tag_init(struct device *dev, u64 dma_mask) +{ + struct linux_dma_priv *priv; + int error; + + priv = dev->dma_priv; + + if (priv->dmat) { + if (priv->dma_mask == dma_mask) + return (0); + + bus_dma_tag_destroy(priv->dmat); + } + + priv->dma_mask = dma_mask; + + error = bus_dma_tag_create(bus_get_dma_tag(dev->bsddev), + 1, 0, /* alignment, boundary */ + dma_mask, /* lowaddr */ + BUS_SPACE_MAXADDR, /* highaddr */ + NULL, NULL, /* filtfunc, filtfuncarg */ + BUS_SPACE_MAXSIZE, /* maxsize */ + 1, /* nsegments */ + BUS_SPACE_MAXSIZE, /* maxsegsz */ + 0, /* flags */ + NULL, NULL, /* lockfunc, lockfuncarg */ + &priv->dmat); + return (-error); +} + static struct pci_driver * linux_pci_find(device_t dev, const struct pci_device_id **idp) { const struct pci_device_id *id; struct pci_driver *pdrv; uint16_t vendor; uint16_t device; uint16_t subvendor; uint16_t subdevice; vendor = pci_get_vendor(dev); device = pci_get_device(dev); subvendor = pci_get_subvendor(dev); subdevice = pci_get_subdevice(dev); spin_lock(&pci_lock); list_for_each_entry(pdrv, &pci_drivers, links) { for (id = pdrv->id_table; id->vendor != 0; id++) { if (vendor == id->vendor && (PCI_ANY_ID == id->device || device == id->device) && (PCI_ANY_ID == id->subvendor || subvendor == id->subvendor) && (PCI_ANY_ID == id->subdevice || subdevice == id->subdevice)) { *idp = id; spin_unlock(&pci_lock); return (pdrv); } } } spin_unlock(&pci_lock); return (NULL); } static int linux_pci_probe(device_t dev) { const struct pci_device_id *id; struct pci_driver *pdrv; if ((pdrv = linux_pci_find(dev, &id)) == NULL) return (ENXIO); if (device_get_driver(dev) != &pdrv->bsddriver) return (ENXIO); device_set_desc(dev, pdrv->name); return (0); } static int linux_pci_attach(device_t dev) { struct resource_list_entry *rle; struct pci_bus *pbus; struct pci_dev *pdev; struct pci_devinfo *dinfo; struct pci_driver *pdrv; const struct pci_device_id *id; device_t parent; devclass_t devclass; int error; linux_set_current(curthread); pdrv = linux_pci_find(dev, &id); pdev = device_get_softc(dev); parent = device_get_parent(dev); devclass = device_get_devclass(parent); if (pdrv->isdrm) { dinfo = device_get_ivars(parent); device_set_ivars(dev, dinfo); } else { dinfo = device_get_ivars(dev); } pdev->dev.parent = &linux_root_device; pdev->dev.bsddev = dev; INIT_LIST_HEAD(&pdev->dev.irqents); pdev->devfn = PCI_DEVFN(pci_get_slot(dev), pci_get_function(dev)); pdev->device = dinfo->cfg.device; pdev->vendor = dinfo->cfg.vendor; pdev->subsystem_vendor = dinfo->cfg.subvendor; pdev->subsystem_device = dinfo->cfg.subdevice; pdev->class = pci_get_class(dev); pdev->revision = pci_get_revid(dev); - pdev->dev.dma_mask = &pdev->dma_mask; pdev->pdrv = pdrv; kobject_init(&pdev->dev.kobj, &linux_dev_ktype); kobject_set_name(&pdev->dev.kobj, device_get_nameunit(dev)); kobject_add(&pdev->dev.kobj, &linux_root_device.kobj, kobject_name(&pdev->dev.kobj)); rle = linux_pci_get_rle(pdev, SYS_RES_IRQ, 0); if (rle != NULL) pdev->dev.irq = rle->start; else pdev->dev.irq = LINUX_IRQ_INVALID; pdev->irq = pdev->dev.irq; + error = linux_pdev_dma_init(pdev); + if (error) + goto out; if (pdev->bus == NULL) { pbus = malloc(sizeof(*pbus), M_DEVBUF, M_WAITOK | M_ZERO); pbus->self = pdev; pbus->number = pci_get_bus(dev); pdev->bus = pbus; } spin_lock(&pci_lock); list_add(&pdev->links, &pci_devices); spin_unlock(&pci_lock); error = pdrv->probe(pdev, id); +out: if (error) { spin_lock(&pci_lock); list_del(&pdev->links); spin_unlock(&pci_lock); put_device(&pdev->dev); error = -error; } return (error); } static int linux_pci_detach(device_t dev) { struct pci_dev *pdev; linux_set_current(curthread); pdev = device_get_softc(dev); pdev->pdrv->remove(pdev); + linux_pdev_dma_uninit(pdev); spin_lock(&pci_lock); list_del(&pdev->links); spin_unlock(&pci_lock); device_set_desc(dev, NULL); put_device(&pdev->dev); return (0); } static int linux_pci_suspend(device_t dev) { const struct dev_pm_ops *pmops; struct pm_message pm = { }; struct pci_dev *pdev; int error; error = 0; linux_set_current(curthread); pdev = device_get_softc(dev); pmops = pdev->pdrv->driver.pm; if (pdev->pdrv->suspend != NULL) error = -pdev->pdrv->suspend(pdev, pm); else if (pmops != NULL && pmops->suspend != NULL) { error = -pmops->suspend(&pdev->dev); if (error == 0 && pmops->suspend_late != NULL) error = -pmops->suspend_late(&pdev->dev); } return (error); } static int linux_pci_resume(device_t dev) { const struct dev_pm_ops *pmops; struct pci_dev *pdev; int error; error = 0; linux_set_current(curthread); pdev = device_get_softc(dev); pmops = pdev->pdrv->driver.pm; if (pdev->pdrv->resume != NULL) error = -pdev->pdrv->resume(pdev); else if (pmops != NULL && pmops->resume != NULL) { if (pmops->resume_early != NULL) error = -pmops->resume_early(&pdev->dev); if (error == 0 && pmops->resume != NULL) error = -pmops->resume(&pdev->dev); } return (error); } static int linux_pci_shutdown(device_t dev) { struct pci_dev *pdev; linux_set_current(curthread); pdev = device_get_softc(dev); if (pdev->pdrv->shutdown != NULL) pdev->pdrv->shutdown(pdev); return (0); } static int _linux_pci_register_driver(struct pci_driver *pdrv, devclass_t dc) { int error; linux_set_current(curthread); spin_lock(&pci_lock); list_add(&pdrv->links, &pci_drivers); spin_unlock(&pci_lock); pdrv->bsddriver.name = pdrv->name; pdrv->bsddriver.methods = pci_methods; pdrv->bsddriver.size = sizeof(struct pci_dev); mtx_lock(&Giant); error = devclass_add_driver(dc, &pdrv->bsddriver, BUS_PASS_DEFAULT, &pdrv->bsdclass); mtx_unlock(&Giant); return (-error); } int linux_pci_register_driver(struct pci_driver *pdrv) { devclass_t dc; dc = devclass_find("pci"); if (dc == NULL) return (-ENXIO); pdrv->isdrm = false; return (_linux_pci_register_driver(pdrv, dc)); } int linux_pci_register_drm_driver(struct pci_driver *pdrv) { devclass_t dc; dc = devclass_create("vgapci"); if (dc == NULL) return (-ENXIO); pdrv->isdrm = true; pdrv->name = "drmn"; return (_linux_pci_register_driver(pdrv, dc)); } void linux_pci_unregister_driver(struct pci_driver *pdrv) { devclass_t bus; bus = devclass_find("pci"); spin_lock(&pci_lock); list_del(&pdrv->links); spin_unlock(&pci_lock); mtx_lock(&Giant); if (bus != NULL) devclass_delete_driver(bus, &pdrv->bsddriver); mtx_unlock(&Giant); +} + +CTASSERT(sizeof(dma_addr_t) <= sizeof(uint64_t)); + +struct linux_dma_obj { + void *vaddr; + uint64_t dma_addr; + bus_dmamap_t dmamap; +}; + +static uma_zone_t linux_dma_trie_zone; +static uma_zone_t linux_dma_obj_zone; + +static void +linux_dma_init(void *arg) +{ + + linux_dma_trie_zone = uma_zcreate("linux_dma_pctrie", + pctrie_node_size(), NULL, NULL, pctrie_zone_init, NULL, + UMA_ALIGN_PTR, 0); + linux_dma_obj_zone = uma_zcreate("linux_dma_object", + sizeof(struct linux_dma_obj), NULL, NULL, NULL, NULL, + UMA_ALIGN_PTR, 0); + +} +SYSINIT(linux_dma, SI_SUB_DRIVERS, SI_ORDER_THIRD, linux_dma_init, NULL); + +static void +linux_dma_uninit(void *arg) +{ + + uma_zdestroy(linux_dma_obj_zone); + uma_zdestroy(linux_dma_trie_zone); +} +SYSUNINIT(linux_dma, SI_SUB_DRIVERS, SI_ORDER_THIRD, linux_dma_uninit, NULL); + +static void * +linux_dma_trie_alloc(struct pctrie *ptree) +{ + + return (uma_zalloc(linux_dma_trie_zone, 0)); +} + +static void +linux_dma_trie_free(struct pctrie *ptree, void *node) +{ + + uma_zfree(linux_dma_trie_zone, node); +} + + +PCTRIE_DEFINE(LINUX_DMA, linux_dma_obj, dma_addr, linux_dma_trie_alloc, + linux_dma_trie_free); + +void * +linux_dma_alloc_coherent(struct device *dev, size_t size, + dma_addr_t *dma_handle, gfp_t flag) +{ + struct linux_dma_priv *priv; + vm_paddr_t high; + size_t align; + void *mem; + + if (dev == NULL || dev->dma_priv == NULL) { + *dma_handle = 0; + return (NULL); + } + priv = dev->dma_priv; + if (priv->dma_mask) + high = priv->dma_mask; + else if (flag & GFP_DMA32) + high = BUS_SPACE_MAXADDR_32BIT; + else + high = BUS_SPACE_MAXADDR; + align = PAGE_SIZE << get_order(size); + mem = (void *)kmem_alloc_contig(size, flag, 0, high, align, 0, + VM_MEMATTR_DEFAULT); + if (mem) + *dma_handle = linux_dma_map_phys(dev, vtophys(mem), size); + else + *dma_handle = 0; + return (mem); +} + +dma_addr_t +linux_dma_map_phys(struct device *dev, vm_paddr_t phys, size_t len) +{ + struct linux_dma_priv *priv; + struct linux_dma_obj *obj; + int error, nseg; + bus_dma_segment_t seg; + + priv = dev->dma_priv; + + obj = uma_zalloc(linux_dma_obj_zone, 0); + + if (bus_dmamap_create(priv->dmat, 0, &obj->dmamap) != 0) { + uma_zfree(linux_dma_obj_zone, obj); + return (0); + } + + nseg = -1; + mtx_lock(&priv->dma_lock); + if (_bus_dmamap_load_phys(priv->dmat, obj->dmamap, phys, len, + BUS_DMA_NOWAIT, &seg, &nseg) != 0) { + bus_dmamap_destroy(priv->dmat, obj->dmamap); + mtx_unlock(&priv->dma_lock); + uma_zfree(linux_dma_obj_zone, obj); + return (0); + } + mtx_unlock(&priv->dma_lock); + + KASSERT(++nseg == 1, ("More than one segment (nseg=%d)", nseg)); + obj->dma_addr = seg.ds_addr; + + mtx_lock(&priv->ptree_lock); + error = LINUX_DMA_PCTRIE_INSERT(&priv->ptree, obj); + mtx_unlock(&priv->ptree_lock); + if (error != 0) { + mtx_lock(&priv->dma_lock); + bus_dmamap_unload(priv->dmat, obj->dmamap); + bus_dmamap_destroy(priv->dmat, obj->dmamap); + mtx_unlock(&priv->dma_lock); + uma_zfree(linux_dma_obj_zone, obj); + return (0); + } + + return (obj->dma_addr); +} + +void +linux_dma_unmap(struct device *dev, dma_addr_t dma_addr, size_t len) +{ + struct linux_dma_priv *priv; + struct linux_dma_obj *obj; + + priv = dev->dma_priv; + + mtx_lock(&priv->ptree_lock); + obj = LINUX_DMA_PCTRIE_LOOKUP(&priv->ptree, dma_addr); + if (obj == NULL) { + mtx_unlock(&priv->ptree_lock); + return; + } + LINUX_DMA_PCTRIE_REMOVE(&priv->ptree, dma_addr); + mtx_unlock(&priv->ptree_lock); + + mtx_lock(&priv->dma_lock); + bus_dmamap_unload(priv->dmat, obj->dmamap); + bus_dmamap_destroy(priv->dmat, obj->dmamap); + mtx_unlock(&priv->dma_lock); + + uma_zfree(linux_dma_obj_zone, obj); +} + +int +linux_dma_map_sg_attrs(struct device *dev, struct scatterlist *sgl, int nents, + enum dma_data_direction dir, struct dma_attrs *attrs) +{ + struct linux_dma_priv *priv; + struct linux_dma_obj *obj; + struct scatterlist *dma_sg, *sg; + int dma_nents, error, nseg; + size_t seg_len; + vm_paddr_t seg_phys, prev_phys_end; + bus_dma_segment_t seg; + + priv = dev->dma_priv; + + obj = uma_zalloc(linux_dma_obj_zone, 0); + + if (bus_dmamap_create(priv->dmat, 0, &obj->dmamap) != 0) { + uma_zfree(linux_dma_obj_zone, obj); + return (0); + } + + sg = sgl; + dma_sg = sg; + dma_nents = 0; + while (nents > 0) { + seg_phys = sg_phys(sg); + seg_len = sg->length; + while (--nents > 0) { + prev_phys_end = sg_phys(sg) + sg->length; + sg = sg_next(sg); + if (prev_phys_end != sg_phys(sg)) + break; + seg_len += sg->length; + } + + nseg = -1; + mtx_lock(&priv->dma_lock); + if (_bus_dmamap_load_phys(priv->dmat, obj->dmamap, + seg_phys, seg_len, BUS_DMA_NOWAIT, + &seg, &nseg) != 0) { + bus_dmamap_unload(priv->dmat, obj->dmamap); + bus_dmamap_destroy(priv->dmat, obj->dmamap); + mtx_unlock(&priv->dma_lock); + uma_zfree(linux_dma_obj_zone, obj); + return (0); + } + mtx_unlock(&priv->dma_lock); + KASSERT(++nseg == 1, ("More than one segment (nseg=%d)", nseg)); + + sg_dma_address(dma_sg) = seg.ds_addr; + sg_dma_len(dma_sg) = seg.ds_len; + + dma_sg = sg_next(dma_sg); + dma_nents++; + } + + obj->dma_addr = sg_dma_address(sgl); + + mtx_lock(&priv->ptree_lock); + error = LINUX_DMA_PCTRIE_INSERT(&priv->ptree, obj); + mtx_unlock(&priv->ptree_lock); + if (error != 0) { + mtx_lock(&priv->dma_lock); + bus_dmamap_unload(priv->dmat, obj->dmamap); + bus_dmamap_destroy(priv->dmat, obj->dmamap); + mtx_unlock(&priv->dma_lock); + uma_zfree(linux_dma_obj_zone, obj); + return (0); + } + + return (dma_nents); +} + +void +linux_dma_unmap_sg_attrs(struct device *dev, struct scatterlist *sgl, + int nents, enum dma_data_direction dir, struct dma_attrs *attrs) +{ + struct linux_dma_priv *priv; + struct linux_dma_obj *obj; + + priv = dev->dma_priv; + + mtx_lock(&priv->ptree_lock); + obj = LINUX_DMA_PCTRIE_LOOKUP(&priv->ptree, sg_dma_address(sgl)); + if (obj == NULL) { + mtx_unlock(&priv->ptree_lock); + return; + } + LINUX_DMA_PCTRIE_REMOVE(&priv->ptree, sg_dma_address(sgl)); + mtx_unlock(&priv->ptree_lock); + + mtx_lock(&priv->dma_lock); + bus_dmamap_unload(priv->dmat, obj->dmamap); + bus_dmamap_destroy(priv->dmat, obj->dmamap); + mtx_unlock(&priv->dma_lock); + + uma_zfree(linux_dma_obj_zone, obj); +} + +static inline int +dma_pool_obj_ctor(void *mem, int size, void *arg, int flags) +{ + struct linux_dma_obj *obj = mem; + struct dma_pool *pool = arg; + int error, nseg; + bus_dma_segment_t seg; + + nseg = -1; + mtx_lock(&pool->pool_dma_lock); + error = _bus_dmamap_load_phys(pool->pool_dmat, obj->dmamap, + vtophys(obj->vaddr), pool->pool_entry_size, BUS_DMA_NOWAIT, + &seg, &nseg); + mtx_unlock(&pool->pool_dma_lock); + if (error != 0) { + return (error); + } + KASSERT(++nseg == 1, ("More than one segment (nseg=%d)", nseg)); + obj->dma_addr = seg.ds_addr; + + return (0); +} + +static void +dma_pool_obj_dtor(void *mem, int size, void *arg) +{ + struct linux_dma_obj *obj = mem; + struct dma_pool *pool = arg; + + mtx_lock(&pool->pool_dma_lock); + bus_dmamap_unload(pool->pool_dmat, obj->dmamap); + mtx_unlock(&pool->pool_dma_lock); +} + +static int +dma_pool_obj_import(void *arg, void **store, int count, int domain __unused, + int flags) +{ + struct dma_pool *pool = arg; + struct linux_dma_priv *priv; + struct linux_dma_obj *obj; + int error, i; + + priv = pool->pool_pdev->dev.dma_priv; + for (i = 0; i < count; i++) { + obj = uma_zalloc(linux_dma_obj_zone, flags); + if (obj == NULL) + break; + + error = bus_dmamem_alloc(pool->pool_dmat, &obj->vaddr, + BUS_DMA_NOWAIT, &obj->dmamap); + if (error!= 0) { + uma_zfree(linux_dma_obj_zone, obj); + break; + } + + store[i] = obj; + } + + return (i); +} + +static void +dma_pool_obj_release(void *arg, void **store, int count) +{ + struct dma_pool *pool = arg; + struct linux_dma_priv *priv; + struct linux_dma_obj *obj; + int i; + + priv = pool->pool_pdev->dev.dma_priv; + for (i = 0; i < count; i++) { + obj = store[i]; + bus_dmamem_free(pool->pool_dmat, obj->vaddr, obj->dmamap); + uma_zfree(linux_dma_obj_zone, obj); + } +} + +struct dma_pool * +linux_dma_pool_create(char *name, struct device *dev, size_t size, + size_t align, size_t boundary) +{ + struct linux_dma_priv *priv; + struct dma_pool *pool; + + priv = dev->dma_priv; + + pool = kzalloc(sizeof(*pool), GFP_KERNEL); + pool->pool_pdev = to_pci_dev(dev); + pool->pool_entry_size = size; + + if (bus_dma_tag_create(bus_get_dma_tag(dev->bsddev), + align, boundary, /* alignment, boundary */ + priv->dma_mask, /* lowaddr */ + BUS_SPACE_MAXADDR, /* highaddr */ + NULL, NULL, /* filtfunc, filtfuncarg */ + size, /* maxsize */ + 1, /* nsegments */ + size, /* maxsegsz */ + 0, /* flags */ + NULL, NULL, /* lockfunc, lockfuncarg */ + &pool->pool_dmat)) { + kfree(pool); + return (NULL); + } + + pool->pool_zone = uma_zcache_create(name, -1, dma_pool_obj_ctor, + dma_pool_obj_dtor, NULL, NULL, dma_pool_obj_import, + dma_pool_obj_release, pool, 0); + + mtx_init(&pool->pool_dma_lock, "linux_dma_pool", NULL, MTX_DEF); + + mtx_init(&pool->pool_ptree_lock, "linux_dma_pool_ptree", NULL, + MTX_DEF); + pctrie_init(&pool->pool_ptree); + + return (pool); +} + +void +linux_dma_pool_destroy(struct dma_pool *pool) +{ + + uma_zdestroy(pool->pool_zone); + bus_dma_tag_destroy(pool->pool_dmat); + mtx_destroy(&pool->pool_ptree_lock); + mtx_destroy(&pool->pool_dma_lock); + kfree(pool); +} + +void * +linux_dma_pool_alloc(struct dma_pool *pool, gfp_t mem_flags, + dma_addr_t *handle) +{ + struct linux_dma_obj *obj; + + obj = uma_zalloc_arg(pool->pool_zone, pool, mem_flags); + if (obj == NULL) + return (NULL); + + mtx_lock(&pool->pool_ptree_lock); + if (LINUX_DMA_PCTRIE_INSERT(&pool->pool_ptree, obj) != 0) { + mtx_unlock(&pool->pool_ptree_lock); + uma_zfree_arg(pool->pool_zone, obj, pool); + return (NULL); + } + mtx_unlock(&pool->pool_ptree_lock); + + *handle = obj->dma_addr; + return (obj->vaddr); +} + +void +linux_dma_pool_free(struct dma_pool *pool, void *vaddr, dma_addr_t dma_addr) +{ + struct linux_dma_obj *obj; + + mtx_lock(&pool->pool_ptree_lock); + obj = LINUX_DMA_PCTRIE_LOOKUP(&pool->pool_ptree, dma_addr); + if (obj == NULL) { + mtx_unlock(&pool->pool_ptree_lock); + return; + } + LINUX_DMA_PCTRIE_REMOVE(&pool->pool_ptree, dma_addr); + mtx_unlock(&pool->pool_ptree_lock); + + uma_zfree_arg(pool->pool_zone, obj, pool); } Index: user/ngie/bug-237403/sys/conf/files.powerpc =================================================================== --- user/ngie/bug-237403/sys/conf/files.powerpc (revision 346925) +++ user/ngie/bug-237403/sys/conf/files.powerpc (revision 346926) @@ -1,278 +1,278 @@ # This file tells config what files go into building a kernel, # files marked standard are always included. # # $FreeBSD$ # # The long compile-with and dependency lines are required because of # limitations in config: backslash-newline doesn't work in strings, and # dependency lines other than the first are silently ignored. # # font.h optional sc \ compile-with "uudecode < /usr/share/syscons/fonts/${SC_DFLT_FONT}-8x16.fnt && file2c 'u_char dflt_font_16[16*256] = {' '};' < ${SC_DFLT_FONT}-8x16 > font.h && uudecode < /usr/share/syscons/fonts/${SC_DFLT_FONT}-8x14.fnt && file2c 'u_char dflt_font_14[14*256] = {' '};' < ${SC_DFLT_FONT}-8x14 >> font.h && uudecode < /usr/share/syscons/fonts/${SC_DFLT_FONT}-8x8.fnt && file2c 'u_char dflt_font_8[8*256] = {' '};' < ${SC_DFLT_FONT}-8x8 >> font.h" \ no-obj no-implicit-rule before-depend \ clean "font.h ${SC_DFLT_FONT}-8x14 ${SC_DFLT_FONT}-8x16 ${SC_DFLT_FONT}-8x8" # # There is only an asm version on ppc64. cddl/compat/opensolaris/kern/opensolaris_atomic.c optional zfs powerpc | dtrace powerpc | zfs powerpcspe | dtrace powerpcspe compile-with "${ZFS_C}" cddl/contrib/opensolaris/common/atomic/powerpc64/opensolaris_atomic.S optional zfs powerpc64 | dtrace powerpc64 compile-with "${ZFS_S}" cddl/dev/dtrace/powerpc/dtrace_asm.S optional dtrace compile-with "${DTRACE_S}" cddl/dev/dtrace/powerpc/dtrace_subr.c optional dtrace compile-with "${DTRACE_C}" cddl/dev/fbt/powerpc/fbt_isa.c optional dtrace_fbt | dtraceall compile-with "${FBT_C}" crypto/blowfish/bf_enc.c optional crypto | ipsec | ipsec_support crypto/des/des_enc.c optional crypto | ipsec | ipsec_support | netsmb dev/bm/if_bm.c optional bm powermac dev/adb/adb_bus.c optional adb dev/adb/adb_kbd.c optional adb dev/adb/adb_mouse.c optional adb dev/adb/adb_hb_if.m optional adb dev/adb/adb_if.m optional adb dev/adb/adb_buttons.c optional adb dev/agp/agp_apple.c optional agp powermac dev/fb/fb.c optional sc dev/hwpmc/hwpmc_e500.c optional hwpmc dev/hwpmc/hwpmc_mpc7xxx.c optional hwpmc dev/hwpmc/hwpmc_powerpc.c optional hwpmc dev/hwpmc/hwpmc_ppc970.c optional hwpmc dev/iicbus/ad7417.c optional ad7417 powermac dev/iicbus/adm1030.c optional powermac windtunnel | adm1030 powermac dev/iicbus/adt746x.c optional adt746x powermac dev/iicbus/ds1631.c optional ds1631 powermac dev/iicbus/ds1775.c optional ds1775 powermac dev/iicbus/max6690.c optional max6690 powermac dev/iicbus/ofw_iicbus.c optional iicbus aim dev/ipmi/ipmi.c optional ipmi dev/ipmi/ipmi_opal.c optional powernv ipmi dev/nand/nfc_fsl.c optional nand mpc85xx dev/nand/nfc_rb.c optional nand mpc85xx # Most ofw stuff below is brought in by conf/files for options FDT, but # we always want it, even on non-FDT platforms. dev/fdt/simplebus.c standard dev/ofw/openfirm.c standard dev/ofw/openfirmio.c standard dev/ofw/ofw_bus_if.m standard dev/ofw/ofw_cpu.c standard dev/ofw/ofw_if.m standard dev/ofw/ofw_bus_subr.c standard dev/ofw/ofw_console.c optional aim dev/ofw/ofw_disk.c optional ofwd aim dev/ofw/ofwbus.c standard dev/ofw/ofwpci.c optional pci dev/ofw/ofw_standard.c optional aim powerpc dev/ofw/ofw_subr.c standard dev/powermac_nvram/powermac_nvram.c optional powermac_nvram powermac dev/quicc/quicc_bfe_fdt.c optional quicc mpc85xx dev/random/darn.c optional powerpc64 random dev/scc/scc_bfe_macio.c optional scc powermac dev/sdhci/fsl_sdhci.c optional mpc85xx sdhci dev/sec/sec.c optional sec mpc85xx dev/sound/macio/aoa.c optional snd_davbus | snd_ai2s powermac dev/sound/macio/davbus.c optional snd_davbus powermac dev/sound/macio/i2s.c optional snd_ai2s powermac dev/sound/macio/onyx.c optional snd_ai2s iicbus powermac dev/sound/macio/snapper.c optional snd_ai2s iicbus powermac dev/sound/macio/tumbler.c optional snd_ai2s iicbus powermac dev/syscons/scgfbrndr.c optional sc dev/tsec/if_tsec.c optional tsec dev/tsec/if_tsec_fdt.c optional tsec dev/uart/uart_cpu_powerpc.c optional uart dev/usb/controller/ehci_fsl.c optional ehci mpc85xx dev/vt/hw/ofwfb/ofwfb.c optional vt aim kern/kern_clocksource.c standard kern/subr_dummy_vdso_tc.c standard kern/syscalls.c optional ktr kern/subr_sfbuf.c standard libkern/ashldi3.c optional powerpc | powerpcspe libkern/ashrdi3.c optional powerpc | powerpcspe libkern/bcmp.c standard libkern/bcopy.c standard libkern/cmpdi2.c optional powerpc | powerpcspe libkern/divdi3.c optional powerpc | powerpcspe libkern/ffs.c standard libkern/ffsl.c standard libkern/ffsll.c standard libkern/fls.c standard libkern/flsl.c standard libkern/flsll.c standard libkern/lshrdi3.c optional powerpc | powerpcspe libkern/memcmp.c standard libkern/memset.c standard libkern/moddi3.c optional powerpc | powerpcspe libkern/qdivrem.c optional powerpc | powerpcspe libkern/ucmpdi2.c optional powerpc | powerpcspe libkern/udivdi3.c optional powerpc | powerpcspe libkern/umoddi3.c optional powerpc | powerpcspe powerpc/aim/locore.S optional aim no-obj powerpc/aim/aim_machdep.c optional aim powerpc/aim/mmu_oea.c optional aim powerpc powerpc/aim/mmu_oea64.c optional aim powerpc/aim/moea64_if.m optional aim powerpc/aim/moea64_native.c optional aim powerpc/aim/mp_cpudep.c optional aim powerpc/aim/slb.c optional aim powerpc64 powerpc/booke/locore.S optional booke no-obj powerpc/booke/booke_machdep.c optional booke powerpc/booke/machdep_e500.c optional booke_e500 powerpc/booke/mp_cpudep.c optional booke smp powerpc/booke/platform_bare.c optional booke powerpc/booke/pmap.c optional booke powerpc/booke/spe.c optional powerpcspe powerpc/cpufreq/dfs.c optional cpufreq powerpc/cpufreq/mpc85xx_jog.c optional cpufreq mpc85xx powerpc/cpufreq/pcr.c optional cpufreq aim powerpc/cpufreq/pmcr.c optional cpufreq aim powerpc64 powerpc/cpufreq/pmufreq.c optional cpufreq aim pmu powerpc/fpu/fpu_add.c optional fpu_emu | powerpcspe powerpc/fpu/fpu_compare.c optional fpu_emu | powerpcspe powerpc/fpu/fpu_div.c optional fpu_emu | powerpcspe powerpc/fpu/fpu_emu.c optional fpu_emu powerpc/fpu/fpu_explode.c optional fpu_emu | powerpcspe powerpc/fpu/fpu_implode.c optional fpu_emu | powerpcspe powerpc/fpu/fpu_mul.c optional fpu_emu | powerpcspe powerpc/fpu/fpu_sqrt.c optional fpu_emu powerpc/fpu/fpu_subr.c optional fpu_emu | powerpcspe powerpc/mambo/mambocall.S optional mambo powerpc/mambo/mambo.c optional mambo powerpc/mambo/mambo_console.c optional mambo powerpc/mambo/mambo_disk.c optional mambo powerpc/mikrotik/platform_rb.c optional mikrotik powerpc/mikrotik/rb_led.c optional mikrotik powerpc/mpc85xx/atpic.c optional mpc85xx isa powerpc/mpc85xx/ds1553_bus_fdt.c optional ds1553 powerpc/mpc85xx/ds1553_core.c optional ds1553 powerpc/mpc85xx/fsl_diu.c optional mpc85xx diu powerpc/mpc85xx/fsl_espi.c optional mpc85xx spibus powerpc/mpc85xx/fsl_sata.c optional mpc85xx ata powerpc/mpc85xx/i2c.c optional iicbus powerpc/mpc85xx/isa.c optional mpc85xx isa powerpc/mpc85xx/lbc.c optional mpc85xx powerpc/mpc85xx/mpc85xx.c optional mpc85xx powerpc/mpc85xx/mpc85xx_cache.c optional mpc85xx powerpc/mpc85xx/mpc85xx_gpio.c optional mpc85xx gpio powerpc/mpc85xx/platform_mpc85xx.c optional mpc85xx powerpc/mpc85xx/pci_mpc85xx.c optional pci mpc85xx powerpc/mpc85xx/pci_mpc85xx_pcib.c optional pci mpc85xx powerpc/mpc85xx/qoriq_gpio.c optional mpc85xx gpio powerpc/ofw/ofw_machdep.c standard powerpc/ofw/ofw_pcibus.c optional pci powerpc/ofw/ofw_pcib_pci.c optional pci powerpc/ofw/ofw_real.c optional aim powerpc/ofw/ofw_syscons.c optional sc aim powerpc/ofw/ofwcall32.S optional aim powerpc powerpc/ofw/ofwcall64.S optional aim powerpc64 powerpc/ofw/openpic_ofw.c standard powerpc/ofw/rtas.c optional aim powerpc/ofw/ofw_initrd.c optional md_root_mem powerpc64 powerpc/powermac/ata_kauai.c optional powermac ata | powermac atamacio powerpc/powermac/ata_macio.c optional powermac ata | powermac atamacio powerpc/powermac/ata_dbdma.c optional powermac ata | powermac atamacio powerpc/powermac/atibl.c optional powermac atibl powerpc/powermac/cuda.c optional powermac cuda powerpc/powermac/cpcht.c optional powermac pci powerpc/powermac/dbdma.c optional powermac pci powerpc/powermac/fcu.c optional powermac fcu powerpc/powermac/grackle.c optional powermac pci powerpc/powermac/hrowpic.c optional powermac pci powerpc/powermac/kiic.c optional powermac kiic powerpc/powermac/macgpio.c optional powermac pci powerpc/powermac/macio.c optional powermac pci powerpc/powermac/nvbl.c optional powermac nvbl powerpc/powermac/platform_powermac.c optional powermac powerpc/powermac/powermac_thermal.c optional powermac powerpc/powermac/pswitch.c optional powermac pswitch powerpc/powermac/pmu.c optional powermac pmu powerpc/powermac/smu.c optional powermac smu powerpc/powermac/smusat.c optional powermac smu powerpc/powermac/uninorth.c optional powermac powerpc/powermac/uninorthpci.c optional powermac pci powerpc/powermac/vcoregpio.c optional powermac powerpc/powernv/opal.c optional powernv powerpc/powernv/opal_async.c optional powernv powerpc/powernv/opal_console.c optional powernv powerpc/powernv/opal_dev.c optional powernv -powerpc/powernv/opal_flash.c optional powernv +powerpc/powernv/opal_flash.c optional powernv opalflash powerpc/powernv/opal_hmi.c optional powernv powerpc/powernv/opal_i2c.c optional iicbus fdt powernv powerpc/powernv/opal_i2cm.c optional iicbus fdt powernv powerpc/powernv/opal_pci.c optional powernv pci powerpc/powernv/opal_sensor.c optional powernv powerpc/powernv/opalcall.S optional powernv powerpc/powernv/platform_powernv.c optional powernv powerpc/powernv/powernv_centaur.c optional powernv powerpc/powernv/powernv_xscom.c optional powernv powerpc/powernv/xive.c optional powernv powerpc/powerpc/altivec.c optional powerpc | powerpc64 powerpc/powerpc/autoconf.c standard powerpc/powerpc/bus_machdep.c standard powerpc/powerpc/busdma_machdep.c standard powerpc/powerpc/clock.c standard powerpc/powerpc/copyinout.c standard powerpc/powerpc/copystr.c standard powerpc/powerpc/cpu.c standard powerpc/powerpc/cpu_subr64.S optional powerpc64 powerpc/powerpc/db_disasm.c optional ddb powerpc/powerpc/db_hwwatch.c optional ddb powerpc/powerpc/db_interface.c optional ddb powerpc/powerpc/db_trace.c optional ddb powerpc/powerpc/dump_machdep.c standard powerpc/powerpc/elf32_machdep.c optional powerpc | powerpcspe | compat_freebsd32 powerpc/powerpc/elf64_machdep.c optional powerpc64 powerpc/powerpc/exec_machdep.c standard powerpc/powerpc/fpu.c standard powerpc/powerpc/gdb_machdep.c optional gdb powerpc/powerpc/in_cksum.c optional inet | inet6 powerpc/powerpc/interrupt.c standard powerpc/powerpc/intr_machdep.c standard powerpc/powerpc/iommu_if.m standard powerpc/powerpc/machdep.c standard powerpc/powerpc/mem.c optional mem powerpc/powerpc/mmu_if.m standard powerpc/powerpc/mp_machdep.c optional smp powerpc/powerpc/nexus.c standard powerpc/powerpc/openpic.c standard powerpc/powerpc/pic_if.m standard powerpc/powerpc/pmap_dispatch.c standard powerpc/powerpc/platform.c standard powerpc/powerpc/platform_if.m standard powerpc/powerpc/ptrace_machdep.c standard powerpc/powerpc/sc_machdep.c optional sc powerpc/powerpc/setjmp.S standard powerpc/powerpc/sigcode32.S optional powerpc | powerpcspe | compat_freebsd32 powerpc/powerpc/sigcode64.S optional powerpc64 powerpc/powerpc/swtch32.S optional powerpc | powerpcspe powerpc/powerpc/swtch64.S optional powerpc64 powerpc/powerpc/stack_machdep.c optional ddb | stack powerpc/powerpc/syncicache.c standard powerpc/powerpc/sys_machdep.c standard powerpc/powerpc/trap.c standard powerpc/powerpc/uio_machdep.c standard powerpc/powerpc/uma_machdep.c standard powerpc/powerpc/vm_machdep.c standard powerpc/ps3/ehci_ps3.c optional ps3 ehci powerpc/ps3/ohci_ps3.c optional ps3 ohci powerpc/ps3/if_glc.c optional ps3 glc powerpc/ps3/mmu_ps3.c optional ps3 powerpc/ps3/platform_ps3.c optional ps3 powerpc/ps3/ps3bus.c optional ps3 powerpc/ps3/ps3cdrom.c optional ps3 scbus powerpc/ps3/ps3disk.c optional ps3 powerpc/ps3/ps3pic.c optional ps3 powerpc/ps3/ps3_syscons.c optional ps3 vt powerpc/ps3/ps3-hvcall.S optional ps3 powerpc/pseries/phyp-hvcall.S optional pseries powerpc64 powerpc/pseries/mmu_phyp.c optional pseries powerpc64 powerpc/pseries/phyp_console.c optional pseries powerpc64 uart powerpc/pseries/phyp_llan.c optional llan powerpc/pseries/phyp_vscsi.c optional pseries powerpc64 scbus powerpc/pseries/platform_chrp.c optional pseries powerpc/pseries/plpar_iommu.c optional pseries powerpc64 powerpc/pseries/plpar_pcibus.c optional pseries powerpc64 pci powerpc/pseries/rtas_dev.c optional pseries powerpc/pseries/rtas_pci.c optional pseries pci powerpc/pseries/vdevice.c optional pseries powerpc64 powerpc/pseries/xics.c optional pseries powerpc64 powerpc/psim/iobus.c optional psim powerpc/psim/ata_iobus.c optional ata psim powerpc/psim/openpic_iobus.c optional psim powerpc/psim/uart_iobus.c optional uart psim Index: user/ngie/bug-237403/sys/contrib/ipfilter/netinet/ip_fil_freebsd.c =================================================================== --- user/ngie/bug-237403/sys/contrib/ipfilter/netinet/ip_fil_freebsd.c (revision 346925) +++ user/ngie/bug-237403/sys/contrib/ipfilter/netinet/ip_fil_freebsd.c (revision 346926) @@ -1,1447 +1,1447 @@ /* $FreeBSD$ */ /* * Copyright (C) 2012 by Darren Reed. * * See the IPFILTER.LICENCE file for details on licencing. */ #if !defined(lint) static const char sccsid[] = "@(#)ip_fil.c 2.41 6/5/96 (C) 1993-2000 Darren Reed"; static const char rcsid[] = "@(#)$Id$"; #endif #if defined(KERNEL) || defined(_KERNEL) # undef KERNEL # undef _KERNEL # define KERNEL 1 # define _KERNEL 1 #endif #if defined(__FreeBSD_version) && (__FreeBSD_version >= 400000) && \ !defined(KLD_MODULE) && !defined(IPFILTER_LKM) # include "opt_inet6.h" #endif #if defined(__FreeBSD_version) && (__FreeBSD_version >= 440000) && \ !defined(KLD_MODULE) && !defined(IPFILTER_LKM) # include "opt_random_ip_id.h" #endif #include #include #include #include #include # include # include #include #include # include #if defined(__FreeBSD_version) && (__FreeBSD_version >= 800000) #include #endif # include # include # include #include # include # include #include # include # include #include #include #include #include #include #include #include #include #include #include #include #include #include "netinet/ip_compat.h" #ifdef USE_INET6 # include #endif #include "netinet/ip_fil.h" #include "netinet/ip_nat.h" #include "netinet/ip_frag.h" #include "netinet/ip_state.h" #include "netinet/ip_proxy.h" #include "netinet/ip_auth.h" #include "netinet/ip_sync.h" #include "netinet/ip_lookup.h" #include "netinet/ip_dstlist.h" #ifdef IPFILTER_SCAN #include "netinet/ip_scan.h" #endif #include "netinet/ip_pool.h" # include #include #ifdef CSUM_DATA_VALID #include #endif extern int ip_optcopy __P((struct ip *, struct ip *)); # ifdef IPFILTER_M_IPFILTER MALLOC_DEFINE(M_IPFILTER, "ipfilter", "IP Filter packet filter data structures"); # endif static int ipf_send_ip __P((fr_info_t *, mb_t *)); static void ipf_timer_func __P((void *arg)); VNET_DEFINE(ipf_main_softc_t, ipfmain) = { .ipf_running = -2, }; #define V_ipfmain VNET(ipfmain) # include # include static eventhandler_tag ipf_arrivetag, ipf_departtag; #if 0 /* * Disable the "cloner" event handler; we are getting interface * events before the firewall is fully initiallized and also no vnet * information thus leading to uninitialised memory accesses. * In addition it is unclear why we need it in first place. * If it turns out to be needed, well need a dedicated event handler * for it to deal with the ifc and the correct vnet. */ static eventhandler_tag ipf_clonetag; #endif static void ipf_ifevent(void *arg, struct ifnet *ifp); static void ipf_ifevent(arg, ifp) void *arg; struct ifnet *ifp; { CURVNET_SET(ifp->if_vnet); if (V_ipfmain.ipf_running > 0) ipf_sync(&V_ipfmain, NULL); CURVNET_RESTORE(); } static pfil_return_t ipf_check_wrapper(struct mbuf **mp, struct ifnet *ifp, int flags, void *ruleset __unused, struct inpcb *inp) { struct ip *ip = mtod(*mp, struct ip *); pfil_return_t rv; CURVNET_SET(ifp->if_vnet); rv = ipf_check(&V_ipfmain, ip, ip->ip_hl << 2, ifp, !!(flags & PFIL_OUT), mp); CURVNET_RESTORE(); return (rv == 0 ? PFIL_PASS : PFIL_DROPPED); } #ifdef USE_INET6 static pfil_return_t ipf_check_wrapper6(struct mbuf **mp, struct ifnet *ifp, int flags, void *ruleset __unused, struct inpcb *inp) { pfil_return_t rv; CURVNET_SET(ifp->if_vnet); rv = ipf_check(&V_ipfmain, mtod(*mp, struct ip *), sizeof(struct ip6_hdr), ifp, !!(flags & PFIL_OUT), mp); CURVNET_RESTORE(); return (rv == 0 ? PFIL_PASS : PFIL_DROPPED); } # endif #if defined(IPFILTER_LKM) int ipf_identify(s) char *s; { if (strcmp(s, "ipl") == 0) return 1; return 0; } #endif /* IPFILTER_LKM */ static void ipf_timer_func(arg) void *arg; { ipf_main_softc_t *softc = arg; SPL_INT(s); SPL_NET(s); READ_ENTER(&softc->ipf_global); if (softc->ipf_running > 0) ipf_slowtimer(softc); if (softc->ipf_running == -1 || softc->ipf_running == 1) { #if 0 softc->ipf_slow_ch = timeout(ipf_timer_func, softc, hz/2); #endif callout_init(&softc->ipf_slow_ch, 1); callout_reset(&softc->ipf_slow_ch, (hz / IPF_HZ_DIVIDE) * IPF_HZ_MULT, ipf_timer_func, softc); } RWLOCK_EXIT(&softc->ipf_global); SPL_X(s); } int ipfattach(softc) ipf_main_softc_t *softc; { #ifdef USE_SPL int s; #endif SPL_NET(s); if (softc->ipf_running > 0) { SPL_X(s); return EBUSY; } if (ipf_init_all(softc) < 0) { SPL_X(s); return EIO; } bzero((char *)V_ipfmain.ipf_selwait, sizeof(V_ipfmain.ipf_selwait)); softc->ipf_running = 1; if (softc->ipf_control_forwarding & 1) V_ipforwarding = 1; SPL_X(s); #if 0 softc->ipf_slow_ch = timeout(ipf_timer_func, softc, (hz / IPF_HZ_DIVIDE) * IPF_HZ_MULT); #endif callout_init(&softc->ipf_slow_ch, 1); callout_reset(&softc->ipf_slow_ch, (hz / IPF_HZ_DIVIDE) * IPF_HZ_MULT, ipf_timer_func, softc); return 0; } /* * Disable the filter by removing the hooks from the IP input/output * stream. */ int ipfdetach(softc) ipf_main_softc_t *softc; { #ifdef USE_SPL int s; #endif if (softc->ipf_control_forwarding & 2) V_ipforwarding = 0; SPL_NET(s); #if 0 if (softc->ipf_slow_ch.callout != NULL) untimeout(ipf_timer_func, softc, softc->ipf_slow_ch); bzero(&softc->ipf_slow, sizeof(softc->ipf_slow)); #endif callout_drain(&softc->ipf_slow_ch); ipf_fini_all(softc); softc->ipf_running = -2; SPL_X(s); return 0; } /* * Filter ioctl interface. */ int ipfioctl(dev, cmd, data, mode, p) struct thread *p; # define p_cred td_ucred # define p_uid td_ucred->cr_ruid struct cdev *dev; ioctlcmd_t cmd; caddr_t data; int mode; { int error = 0, unit = 0; SPL_INT(s); CURVNET_SET(TD_TO_VNET(p)); #if (BSD >= 199306) if (securelevel_ge(p->p_cred, 3) && (mode & FWRITE)) { V_ipfmain.ipf_interror = 130001; CURVNET_RESTORE(); return EPERM; } #endif unit = GET_MINOR(dev); if ((IPL_LOGMAX < unit) || (unit < 0)) { V_ipfmain.ipf_interror = 130002; CURVNET_RESTORE(); return ENXIO; } if (V_ipfmain.ipf_running <= 0) { if (unit != IPL_LOGIPF && cmd != SIOCIPFINTERROR) { V_ipfmain.ipf_interror = 130003; CURVNET_RESTORE(); return EIO; } if (cmd != SIOCIPFGETNEXT && cmd != SIOCIPFGET && cmd != SIOCIPFSET && cmd != SIOCFRENB && cmd != SIOCGETFS && cmd != SIOCGETFF && cmd != SIOCIPFINTERROR) { V_ipfmain.ipf_interror = 130004; CURVNET_RESTORE(); return EIO; } } SPL_NET(s); error = ipf_ioctlswitch(&V_ipfmain, unit, data, cmd, mode, p->p_uid, p); CURVNET_RESTORE(); if (error != -1) { SPL_X(s); return error; } SPL_X(s); return error; } /* * ipf_send_reset - this could conceivably be a call to tcp_respond(), but that * requires a large amount of setting up and isn't any more efficient. */ int ipf_send_reset(fin) fr_info_t *fin; { struct tcphdr *tcp, *tcp2; int tlen = 0, hlen; struct mbuf *m; #ifdef USE_INET6 ip6_t *ip6; #endif ip_t *ip; tcp = fin->fin_dp; if (tcp->th_flags & TH_RST) return -1; /* feedback loop */ if (ipf_checkl4sum(fin) == -1) return -1; tlen = fin->fin_dlen - (TCP_OFF(tcp) << 2) + ((tcp->th_flags & TH_SYN) ? 1 : 0) + ((tcp->th_flags & TH_FIN) ? 1 : 0); #ifdef USE_INET6 hlen = (fin->fin_v == 6) ? sizeof(ip6_t) : sizeof(ip_t); #else hlen = sizeof(ip_t); #endif #ifdef MGETHDR MGETHDR(m, M_NOWAIT, MT_HEADER); #else MGET(m, M_NOWAIT, MT_HEADER); #endif if (m == NULL) return -1; if (sizeof(*tcp2) + hlen > MLEN) { if (!(MCLGET(m, M_NOWAIT))) { FREE_MB_T(m); return -1; } } m->m_len = sizeof(*tcp2) + hlen; #if (BSD >= 199103) m->m_data += max_linkhdr; m->m_pkthdr.len = m->m_len; m->m_pkthdr.rcvif = (struct ifnet *)0; #endif ip = mtod(m, struct ip *); bzero((char *)ip, hlen); #ifdef USE_INET6 ip6 = (ip6_t *)ip; #endif tcp2 = (struct tcphdr *)((char *)ip + hlen); tcp2->th_sport = tcp->th_dport; tcp2->th_dport = tcp->th_sport; if (tcp->th_flags & TH_ACK) { tcp2->th_seq = tcp->th_ack; tcp2->th_flags = TH_RST; tcp2->th_ack = 0; } else { tcp2->th_seq = 0; tcp2->th_ack = ntohl(tcp->th_seq); tcp2->th_ack += tlen; tcp2->th_ack = htonl(tcp2->th_ack); tcp2->th_flags = TH_RST|TH_ACK; } TCP_X2_A(tcp2, 0); TCP_OFF_A(tcp2, sizeof(*tcp2) >> 2); tcp2->th_win = tcp->th_win; tcp2->th_sum = 0; tcp2->th_urp = 0; #ifdef USE_INET6 if (fin->fin_v == 6) { ip6->ip6_flow = ((ip6_t *)fin->fin_ip)->ip6_flow; ip6->ip6_plen = htons(sizeof(struct tcphdr)); ip6->ip6_nxt = IPPROTO_TCP; ip6->ip6_hlim = 0; ip6->ip6_src = fin->fin_dst6.in6; ip6->ip6_dst = fin->fin_src6.in6; tcp2->th_sum = in6_cksum(m, IPPROTO_TCP, sizeof(*ip6), sizeof(*tcp2)); return ipf_send_ip(fin, m); } #endif ip->ip_p = IPPROTO_TCP; ip->ip_len = htons(sizeof(struct tcphdr)); ip->ip_src.s_addr = fin->fin_daddr; ip->ip_dst.s_addr = fin->fin_saddr; tcp2->th_sum = in_cksum(m, hlen + sizeof(*tcp2)); ip->ip_len = htons(hlen + sizeof(*tcp2)); return ipf_send_ip(fin, m); } /* * ip_len must be in network byte order when called. */ static int ipf_send_ip(fin, m) fr_info_t *fin; mb_t *m; { fr_info_t fnew; ip_t *ip, *oip; int hlen; ip = mtod(m, ip_t *); bzero((char *)&fnew, sizeof(fnew)); fnew.fin_main_soft = fin->fin_main_soft; IP_V_A(ip, fin->fin_v); switch (fin->fin_v) { case 4 : oip = fin->fin_ip; hlen = sizeof(*oip); fnew.fin_v = 4; fnew.fin_p = ip->ip_p; fnew.fin_plen = ntohs(ip->ip_len); IP_HL_A(ip, sizeof(*oip) >> 2); ip->ip_tos = oip->ip_tos; ip->ip_id = fin->fin_ip->ip_id; ip->ip_off = htons(V_path_mtu_discovery ? IP_DF : 0); ip->ip_ttl = V_ip_defttl; ip->ip_sum = 0; break; #ifdef USE_INET6 case 6 : { ip6_t *ip6 = (ip6_t *)ip; ip6->ip6_vfc = 0x60; ip6->ip6_hlim = IPDEFTTL; hlen = sizeof(*ip6); fnew.fin_p = ip6->ip6_nxt; fnew.fin_v = 6; fnew.fin_plen = ntohs(ip6->ip6_plen) + hlen; break; } #endif default : return EINVAL; } #ifdef IPSEC m->m_pkthdr.rcvif = NULL; #endif fnew.fin_ifp = fin->fin_ifp; fnew.fin_flx = FI_NOCKSUM; fnew.fin_m = m; fnew.fin_ip = ip; fnew.fin_mp = &m; fnew.fin_hlen = hlen; fnew.fin_dp = (char *)ip + hlen; (void) ipf_makefrip(hlen, ip, &fnew); return ipf_fastroute(m, &m, &fnew, NULL); } int ipf_send_icmp_err(type, fin, dst) int type; fr_info_t *fin; int dst; { int err, hlen, xtra, iclen, ohlen, avail, code; struct in_addr dst4; struct icmp *icmp; struct mbuf *m; i6addr_t dst6; void *ifp; #ifdef USE_INET6 ip6_t *ip6; #endif ip_t *ip, *ip2; if ((type < 0) || (type >= ICMP_MAXTYPE)) return -1; code = fin->fin_icode; #ifdef USE_INET6 /* See NetBSD ip_fil_netbsd.c r1.4: */ if ((code < 0) || (code >= sizeof(icmptoicmp6unreach)/sizeof(int))) return -1; #endif if (ipf_checkl4sum(fin) == -1) return -1; #ifdef MGETHDR MGETHDR(m, M_NOWAIT, MT_HEADER); #else MGET(m, M_NOWAIT, MT_HEADER); #endif if (m == NULL) return -1; avail = MHLEN; xtra = 0; hlen = 0; ohlen = 0; dst4.s_addr = 0; ifp = fin->fin_ifp; if (fin->fin_v == 4) { if ((fin->fin_p == IPPROTO_ICMP) && !(fin->fin_flx & FI_SHORT)) switch (ntohs(fin->fin_data[0]) >> 8) { case ICMP_ECHO : case ICMP_TSTAMP : case ICMP_IREQ : case ICMP_MASKREQ : break; default : FREE_MB_T(m); return 0; } if (dst == 0) { if (ipf_ifpaddr(&V_ipfmain, 4, FRI_NORMAL, ifp, &dst6, NULL) == -1) { FREE_MB_T(m); return -1; } dst4 = dst6.in4; } else dst4.s_addr = fin->fin_daddr; hlen = sizeof(ip_t); ohlen = fin->fin_hlen; iclen = hlen + offsetof(struct icmp, icmp_ip) + ohlen; if (fin->fin_hlen < fin->fin_plen) xtra = MIN(fin->fin_dlen, 8); else xtra = 0; } #ifdef USE_INET6 else if (fin->fin_v == 6) { hlen = sizeof(ip6_t); ohlen = sizeof(ip6_t); iclen = hlen + offsetof(struct icmp, icmp_ip) + ohlen; type = icmptoicmp6types[type]; if (type == ICMP6_DST_UNREACH) code = icmptoicmp6unreach[code]; if (iclen + max_linkhdr + fin->fin_plen > avail) { if (!(MCLGET(m, M_NOWAIT))) { FREE_MB_T(m); return -1; } avail = MCLBYTES; } xtra = MIN(fin->fin_plen, avail - iclen - max_linkhdr); xtra = MIN(xtra, IPV6_MMTU - iclen); if (dst == 0) { if (ipf_ifpaddr(&V_ipfmain, 6, FRI_NORMAL, ifp, &dst6, NULL) == -1) { FREE_MB_T(m); return -1; } } else dst6 = fin->fin_dst6; } #endif else { FREE_MB_T(m); return -1; } avail -= (max_linkhdr + iclen); if (avail < 0) { FREE_MB_T(m); return -1; } if (xtra > avail) xtra = avail; iclen += xtra; m->m_data += max_linkhdr; m->m_pkthdr.rcvif = (struct ifnet *)0; m->m_pkthdr.len = iclen; m->m_len = iclen; ip = mtod(m, ip_t *); icmp = (struct icmp *)((char *)ip + hlen); ip2 = (ip_t *)&icmp->icmp_ip; icmp->icmp_type = type; icmp->icmp_code = fin->fin_icode; icmp->icmp_cksum = 0; #ifdef icmp_nextmtu if (type == ICMP_UNREACH && fin->fin_icode == ICMP_UNREACH_NEEDFRAG) { if (fin->fin_mtu != 0) { icmp->icmp_nextmtu = htons(fin->fin_mtu); } else if (ifp != NULL) { icmp->icmp_nextmtu = htons(GETIFMTU_4(ifp)); } else { /* make up a number... */ icmp->icmp_nextmtu = htons(fin->fin_plen - 20); } } #endif bcopy((char *)fin->fin_ip, (char *)ip2, ohlen); #ifdef USE_INET6 ip6 = (ip6_t *)ip; if (fin->fin_v == 6) { ip6->ip6_flow = ((ip6_t *)fin->fin_ip)->ip6_flow; ip6->ip6_plen = htons(iclen - hlen); ip6->ip6_nxt = IPPROTO_ICMPV6; ip6->ip6_hlim = 0; ip6->ip6_src = dst6.in6; ip6->ip6_dst = fin->fin_src6.in6; if (xtra > 0) bcopy((char *)fin->fin_ip + ohlen, (char *)&icmp->icmp_ip + ohlen, xtra); icmp->icmp_cksum = in6_cksum(m, IPPROTO_ICMPV6, sizeof(*ip6), iclen - hlen); } else #endif { ip->ip_p = IPPROTO_ICMP; ip->ip_src.s_addr = dst4.s_addr; ip->ip_dst.s_addr = fin->fin_saddr; if (xtra > 0) bcopy((char *)fin->fin_ip + ohlen, (char *)&icmp->icmp_ip + ohlen, xtra); icmp->icmp_cksum = ipf_cksum((u_short *)icmp, sizeof(*icmp) + 8); ip->ip_len = htons(iclen); ip->ip_p = IPPROTO_ICMP; } err = ipf_send_ip(fin, m); return err; } /* * m0 - pointer to mbuf where the IP packet starts * mpp - pointer to the mbuf pointer that is the start of the mbuf chain */ int ipf_fastroute(m0, mpp, fin, fdp) mb_t *m0, **mpp; fr_info_t *fin; frdest_t *fdp; { register struct ip *ip, *mhip; register struct mbuf *m = *mpp; int len, off, error = 0, hlen, code; struct ifnet *ifp, *sifp; struct sockaddr_in dst; struct nhop4_extended nh4; int has_nhop = 0; u_long fibnum = 0; u_short ip_off; frdest_t node; frentry_t *fr; #ifdef M_WRITABLE /* * HOT FIX/KLUDGE: * * If the mbuf we're about to send is not writable (because of * a cluster reference, for example) we'll need to make a copy * of it since this routine modifies the contents. * * If you have non-crappy network hardware that can transmit data * from the mbuf, rather than making a copy, this is gonna be a * problem. */ if (M_WRITABLE(m) == 0) { m0 = m_dup(m, M_NOWAIT); if (m0 != NULL) { FREE_MB_T(m); m = m0; *mpp = m; } else { error = ENOBUFS; FREE_MB_T(m); goto done; } } #endif #ifdef USE_INET6 if (fin->fin_v == 6) { /* * currently "to " and "to :ip#" are not supported * for IPv6 */ return ip6_output(m, NULL, NULL, 0, NULL, NULL, NULL); } #endif hlen = fin->fin_hlen; ip = mtod(m0, struct ip *); ifp = NULL; /* * Route packet. */ bzero(&dst, sizeof (dst)); dst.sin_family = AF_INET; dst.sin_addr = ip->ip_dst; dst.sin_len = sizeof(dst); fr = fin->fin_fr; if ((fr != NULL) && !(fr->fr_flags & FR_KEEPSTATE) && (fdp != NULL) && (fdp->fd_type == FRD_DSTLIST)) { if (ipf_dstlist_select_node(fin, fdp->fd_ptr, NULL, &node) == 0) fdp = &node; } if (fdp != NULL) ifp = fdp->fd_ptr; else ifp = fin->fin_ifp; if ((ifp == NULL) && ((fr == NULL) || !(fr->fr_flags & FR_FASTROUTE))) { error = -2; goto bad; } if ((fdp != NULL) && (fdp->fd_ip.s_addr != 0)) dst.sin_addr = fdp->fd_ip; fibnum = M_GETFIB(m0); if (fib4_lookup_nh_ext(fibnum, dst.sin_addr, NHR_REF, 0, &nh4) != 0) { if (in_localaddr(ip->ip_dst)) error = EHOSTUNREACH; else error = ENETUNREACH; goto bad; } has_nhop = 1; if (ifp == NULL) ifp = nh4.nh_ifp; if (nh4.nh_flags & NHF_GATEWAY) dst.sin_addr = nh4.nh_addr; /* * For input packets which are being "fastrouted", they won't * go back through output filtering and miss their chance to get * NAT'd and counted. Duplicated packets aren't considered to be * part of the normal packet stream, so do not NAT them or pass * them through stateful checking, etc. */ if ((fdp != &fr->fr_dif) && (fin->fin_out == 0)) { sifp = fin->fin_ifp; fin->fin_ifp = ifp; fin->fin_out = 1; (void) ipf_acctpkt(fin, NULL); fin->fin_fr = NULL; if (!fr || !(fr->fr_flags & FR_RETMASK)) { u_32_t pass; (void) ipf_state_check(fin, &pass); } switch (ipf_nat_checkout(fin, NULL)) { case 0 : break; case 1 : ip->ip_sum = 0; break; case -1 : error = -1; goto bad; break; } fin->fin_ifp = sifp; fin->fin_out = 0; } else ip->ip_sum = 0; /* * If small enough for interface, can just send directly. */ if (ntohs(ip->ip_len) <= ifp->if_mtu) { if (!ip->ip_sum) ip->ip_sum = in_cksum(m, hlen); error = (*ifp->if_output)(ifp, m, (struct sockaddr *)&dst, NULL ); goto done; } /* * Too large for interface; fragment if possible. * Must be able to put at least 8 bytes per fragment. */ ip_off = ntohs(ip->ip_off); if (ip_off & IP_DF) { error = EMSGSIZE; goto bad; } len = (ifp->if_mtu - hlen) &~ 7; if (len < 8) { error = EMSGSIZE; goto bad; } { int mhlen, firstlen = len; struct mbuf **mnext = &m->m_act; /* * Loop through length of segment after first fragment, * make new header and copy data of each part and link onto chain. */ m0 = m; mhlen = sizeof (struct ip); for (off = hlen + len; off < ntohs(ip->ip_len); off += len) { #ifdef MGETHDR MGETHDR(m, M_NOWAIT, MT_HEADER); #else MGET(m, M_NOWAIT, MT_HEADER); #endif if (m == NULL) { m = m0; error = ENOBUFS; goto bad; } m->m_data += max_linkhdr; mhip = mtod(m, struct ip *); bcopy((char *)ip, (char *)mhip, sizeof(*ip)); if (hlen > sizeof (struct ip)) { mhlen = ip_optcopy(ip, mhip) + sizeof (struct ip); IP_HL_A(mhip, mhlen >> 2); } m->m_len = mhlen; mhip->ip_off = ((off - hlen) >> 3) + ip_off; if (off + len >= ntohs(ip->ip_len)) len = ntohs(ip->ip_len) - off; else mhip->ip_off |= IP_MF; mhip->ip_len = htons((u_short)(len + mhlen)); *mnext = m; m->m_next = m_copym(m0, off, len, M_NOWAIT); if (m->m_next == 0) { error = ENOBUFS; /* ??? */ goto sendorfree; } m->m_pkthdr.len = mhlen + len; m->m_pkthdr.rcvif = NULL; mhip->ip_off = htons((u_short)mhip->ip_off); mhip->ip_sum = 0; mhip->ip_sum = in_cksum(m, mhlen); mnext = &m->m_act; } /* * Update first fragment by trimming what's been copied out * and updating header, then send each fragment (in order). */ m_adj(m0, hlen + firstlen - ip->ip_len); ip->ip_len = htons((u_short)(hlen + firstlen)); ip->ip_off = htons((u_short)IP_MF); ip->ip_sum = 0; ip->ip_sum = in_cksum(m0, hlen); sendorfree: for (m = m0; m; m = m0) { m0 = m->m_act; m->m_act = 0; if (error == 0) error = (*ifp->if_output)(ifp, m, (struct sockaddr *)&dst, NULL ); else FREE_MB_T(m); } } done: if (!error) V_ipfmain.ipf_frouteok[0]++; else V_ipfmain.ipf_frouteok[1]++; if (has_nhop) fib4_free_nh_ext(fibnum, &nh4); return 0; bad: if (error == EMSGSIZE) { sifp = fin->fin_ifp; code = fin->fin_icode; fin->fin_icode = ICMP_UNREACH_NEEDFRAG; fin->fin_ifp = ifp; (void) ipf_send_icmp_err(ICMP_UNREACH, fin, 1); fin->fin_ifp = sifp; fin->fin_icode = code; } FREE_MB_T(m); goto done; } int ipf_verifysrc(fin) fr_info_t *fin; { struct nhop4_basic nh4; if (fib4_lookup_nh_basic(0, fin->fin_src, 0, 0, &nh4) != 0) return (0); return (fin->fin_ifp == nh4.nh_ifp); } /* * return the first IP Address associated with an interface */ int ipf_ifpaddr(softc, v, atype, ifptr, inp, inpmask) ipf_main_softc_t *softc; int v, atype; void *ifptr; i6addr_t *inp, *inpmask; { #ifdef USE_INET6 struct in6_addr *inp6 = NULL; #endif struct sockaddr *sock, *mask; struct sockaddr_in *sin; struct ifaddr *ifa; struct ifnet *ifp; if ((ifptr == NULL) || (ifptr == (void *)-1)) return -1; sin = NULL; ifp = ifptr; if (v == 4) inp->in4.s_addr = 0; #ifdef USE_INET6 else if (v == 6) bzero((char *)inp, sizeof(*inp)); #endif ifa = CK_STAILQ_FIRST(&ifp->if_addrhead); sock = ifa->ifa_addr; while (sock != NULL && ifa != NULL) { sin = (struct sockaddr_in *)sock; if ((v == 4) && (sin->sin_family == AF_INET)) break; #ifdef USE_INET6 if ((v == 6) && (sin->sin_family == AF_INET6)) { inp6 = &((struct sockaddr_in6 *)sin)->sin6_addr; if (!IN6_IS_ADDR_LINKLOCAL(inp6) && !IN6_IS_ADDR_LOOPBACK(inp6)) break; } #endif ifa = CK_STAILQ_NEXT(ifa, ifa_link); if (ifa != NULL) sock = ifa->ifa_addr; } if (ifa == NULL || sin == NULL) return -1; mask = ifa->ifa_netmask; if (atype == FRI_BROADCAST) sock = ifa->ifa_broadaddr; else if (atype == FRI_PEERADDR) sock = ifa->ifa_dstaddr; if (sock == NULL) return -1; #ifdef USE_INET6 if (v == 6) { return ipf_ifpfillv6addr(atype, (struct sockaddr_in6 *)sock, (struct sockaddr_in6 *)mask, inp, inpmask); } #endif return ipf_ifpfillv4addr(atype, (struct sockaddr_in *)sock, (struct sockaddr_in *)mask, &inp->in4, &inpmask->in4); } u_32_t ipf_newisn(fin) fr_info_t *fin; { u_32_t newiss; newiss = arc4random(); return newiss; } INLINE int ipf_checkv4sum(fin) fr_info_t *fin; { #ifdef CSUM_DATA_VALID int manual = 0; u_short sum; ip_t *ip; mb_t *m; if ((fin->fin_flx & FI_NOCKSUM) != 0) return 0; if ((fin->fin_flx & FI_SHORT) != 0) return 1; if (fin->fin_cksum != FI_CK_NEEDED) return (fin->fin_cksum > FI_CK_NEEDED) ? 0 : -1; m = fin->fin_m; if (m == NULL) { manual = 1; goto skipauto; } ip = fin->fin_ip; if ((m->m_pkthdr.csum_flags & (CSUM_IP_CHECKED|CSUM_IP_VALID)) == CSUM_IP_CHECKED) { fin->fin_cksum = FI_CK_BAD; fin->fin_flx |= FI_BAD; DT2(ipf_fi_bad_checkv4sum_csum_ip_checked, fr_info_t *, fin, u_int, m->m_pkthdr.csum_flags & (CSUM_IP_CHECKED|CSUM_IP_VALID)); return -1; } if (m->m_pkthdr.csum_flags & CSUM_DATA_VALID) { /* Depending on the driver, UDP may have zero checksum */ if (fin->fin_p == IPPROTO_UDP && (fin->fin_flx & (FI_FRAG|FI_SHORT|FI_BAD)) == 0) { udphdr_t *udp = fin->fin_dp; if (udp->uh_sum == 0) { /* * we're good no matter what the hardware * checksum flags and csum_data say (handling * of csum_data for zero UDP checksum is not * consistent across all drivers) */ fin->fin_cksum = 1; return 0; } } if (m->m_pkthdr.csum_flags & CSUM_PSEUDO_HDR) sum = m->m_pkthdr.csum_data; else sum = in_pseudo(ip->ip_src.s_addr, ip->ip_dst.s_addr, htonl(m->m_pkthdr.csum_data + fin->fin_dlen + fin->fin_p)); sum ^= 0xffff; if (sum != 0) { fin->fin_cksum = FI_CK_BAD; fin->fin_flx |= FI_BAD; DT2(ipf_fi_bad_checkv4sum_sum, fr_info_t *, fin, u_int, sum); } else { fin->fin_cksum = FI_CK_SUMOK; return 0; } } else { if (m->m_pkthdr.csum_flags == CSUM_DELAY_DATA) { fin->fin_cksum = FI_CK_L4FULL; return 0; } else if (m->m_pkthdr.csum_flags == CSUM_TCP || m->m_pkthdr.csum_flags == CSUM_UDP) { fin->fin_cksum = FI_CK_L4PART; return 0; } else if (m->m_pkthdr.csum_flags == CSUM_IP) { fin->fin_cksum = FI_CK_L4PART; return 0; } else { manual = 1; } } skipauto: if (manual != 0) { if (ipf_checkl4sum(fin) == -1) { fin->fin_flx |= FI_BAD; DT2(ipf_fi_bad_checkv4sum_manual, fr_info_t *, fin, u_int, manual); return -1; } } #else if (ipf_checkl4sum(fin) == -1) { fin->fin_flx |= FI_BAD; DT2(ipf_fi_bad_checkv4sum_checkl4sum, fr_info_t *, fin, u_int, -1); return -1; } #endif return 0; } #ifdef USE_INET6 INLINE int ipf_checkv6sum(fin) fr_info_t *fin; { if ((fin->fin_flx & FI_NOCKSUM) != 0) { DT(ipf_checkv6sum_fi_nocksum); return 0; } if ((fin->fin_flx & FI_SHORT) != 0) { DT(ipf_checkv6sum_fi_short); return 1; } if (fin->fin_cksum != FI_CK_NEEDED) { DT(ipf_checkv6sum_fi_ck_needed); return (fin->fin_cksum > FI_CK_NEEDED) ? 0 : -1; } if (ipf_checkl4sum(fin) == -1) { fin->fin_flx |= FI_BAD; DT2(ipf_fi_bad_checkv6sum_checkl4sum, fr_info_t *, fin, u_int, -1); return -1; } return 0; } #endif /* USE_INET6 */ size_t mbufchainlen(m0) struct mbuf *m0; - { +{ size_t len; if ((m0->m_flags & M_PKTHDR) != 0) { len = m0->m_pkthdr.len; } else { struct mbuf *m; for (m = m0, len = 0; m != NULL; m = m->m_next) len += m->m_len; } return len; } /* ------------------------------------------------------------------------ */ /* Function: ipf_pullup */ /* Returns: NULL == pullup failed, else pointer to protocol header */ /* Parameters: xmin(I)- pointer to buffer where data packet starts */ /* fin(I) - pointer to packet information */ /* len(I) - number of bytes to pullup */ /* */ /* Attempt to move at least len bytes (from the start of the buffer) into a */ /* single buffer for ease of access. Operating system native functions are */ /* used to manage buffers - if necessary. If the entire packet ends up in */ /* a single buffer, set the FI_COALESCE flag even though ipf_coalesce() has */ /* not been called. Both fin_ip and fin_dp are updated before exiting _IF_ */ /* and ONLY if the pullup succeeds. */ /* */ /* We assume that 'xmin' is a pointer to a buffer that is part of the chain */ /* of buffers that starts at *fin->fin_mp. */ /* ------------------------------------------------------------------------ */ void * ipf_pullup(xmin, fin, len) mb_t *xmin; fr_info_t *fin; int len; { int dpoff, ipoff; mb_t *m = xmin; char *ip; if (m == NULL) return NULL; ip = (char *)fin->fin_ip; if ((fin->fin_flx & FI_COALESCE) != 0) return ip; ipoff = fin->fin_ipoff; if (fin->fin_dp != NULL) dpoff = (char *)fin->fin_dp - (char *)ip; else dpoff = 0; if (M_LEN(m) < len) { mb_t *n = *fin->fin_mp; /* * Assume that M_PKTHDR is set and just work with what is left * rather than check.. * Should not make any real difference, anyway. */ if (m != n) { /* * Record the mbuf that points to the mbuf that we're * about to go to work on so that we can update the * m_next appropriately later. */ for (; n->m_next != m; n = n->m_next) ; } else { n = NULL; } #ifdef MHLEN if (len > MHLEN) #else if (len > MLEN) #endif { #ifdef HAVE_M_PULLDOWN if (m_pulldown(m, 0, len, NULL) == NULL) m = NULL; #else FREE_MB_T(*fin->fin_mp); m = NULL; n = NULL; #endif } else { m = m_pullup(m, len); } if (n != NULL) n->m_next = m; if (m == NULL) { /* * When n is non-NULL, it indicates that m pointed to * a sub-chain (tail) of the mbuf and that the head * of this chain has not yet been free'd. */ if (n != NULL) { FREE_MB_T(*fin->fin_mp); } *fin->fin_mp = NULL; fin->fin_m = NULL; return NULL; } if (n == NULL) *fin->fin_mp = m; while (M_LEN(m) == 0) { m = m->m_next; } fin->fin_m = m; ip = MTOD(m, char *) + ipoff; fin->fin_ip = (ip_t *)ip; if (fin->fin_dp != NULL) fin->fin_dp = (char *)fin->fin_ip + dpoff; if (fin->fin_fraghdr != NULL) fin->fin_fraghdr = (char *)ip + ((char *)fin->fin_fraghdr - (char *)fin->fin_ip); } if (len == fin->fin_plen) fin->fin_flx |= FI_COALESCE; return ip; } int ipf_inject(fin, m) fr_info_t *fin; mb_t *m; { int error = 0; if (fin->fin_out == 0) { netisr_dispatch(NETISR_IP, m); } else { fin->fin_ip->ip_len = ntohs(fin->fin_ip->ip_len); fin->fin_ip->ip_off = ntohs(fin->fin_ip->ip_off); error = ip_output(m, NULL, NULL, IP_FORWARDING, NULL, NULL); } return error; } VNET_DEFINE_STATIC(pfil_hook_t, ipf_inet_hook); VNET_DEFINE_STATIC(pfil_hook_t, ipf_inet6_hook); #define V_ipf_inet_hook VNET(ipf_inet_hook) #define V_ipf_inet6_hook VNET(ipf_inet6_hook) int ipf_pfil_unhook(void) { pfil_remove_hook(V_ipf_inet_hook); #ifdef USE_INET6 pfil_remove_hook(V_ipf_inet6_hook); #endif return (0); } int ipf_pfil_hook(void) { struct pfil_hook_args pha; struct pfil_link_args pla; int error, error6; pha.pa_version = PFIL_VERSION; pha.pa_flags = PFIL_IN | PFIL_OUT; pha.pa_modname = "ipfilter"; pha.pa_rulname = "default-ip4"; pha.pa_func = ipf_check_wrapper; pha.pa_ruleset = NULL; pha.pa_type = PFIL_TYPE_IP4; V_ipf_inet_hook = pfil_add_hook(&pha); #ifdef USE_INET6 pha.pa_rulname = "default-ip6"; pha.pa_func = ipf_check_wrapper6; pha.pa_type = PFIL_TYPE_IP6; V_ipf_inet6_hook = pfil_add_hook(&pha); #endif pla.pa_version = PFIL_VERSION; pla.pa_flags = PFIL_IN | PFIL_OUT | PFIL_HEADPTR | PFIL_HOOKPTR; pla.pa_head = V_inet_pfil_head; pla.pa_hook = V_ipf_inet_hook; error = pfil_link(&pla); error6 = 0; #ifdef USE_INET6 pla.pa_head = V_inet6_pfil_head; pla.pa_hook = V_ipf_inet6_hook; error6 = pfil_link(&pla); #endif if (error || error6) error = ENODEV; else error = 0; return (error); } void ipf_event_reg(void) { ipf_arrivetag = EVENTHANDLER_REGISTER(ifnet_arrival_event, \ ipf_ifevent, NULL, \ EVENTHANDLER_PRI_ANY); ipf_departtag = EVENTHANDLER_REGISTER(ifnet_departure_event, \ ipf_ifevent, NULL, \ EVENTHANDLER_PRI_ANY); #if 0 ipf_clonetag = EVENTHANDLER_REGISTER(if_clone_event, ipf_ifevent, \ NULL, EVENTHANDLER_PRI_ANY); #endif } void ipf_event_dereg(void) { if (ipf_arrivetag != NULL) { EVENTHANDLER_DEREGISTER(ifnet_arrival_event, ipf_arrivetag); } if (ipf_departtag != NULL) { EVENTHANDLER_DEREGISTER(ifnet_departure_event, ipf_departtag); } #if 0 if (ipf_clonetag != NULL) { EVENTHANDLER_DEREGISTER(if_clone_event, ipf_clonetag); } #endif } u_32_t ipf_random() { return arc4random(); } u_int ipf_pcksum(fin, hlen, sum) fr_info_t *fin; int hlen; u_int sum; { struct mbuf *m; u_int sum2; int off; m = fin->fin_m; off = (char *)fin->fin_dp - (char *)fin->fin_ip; m->m_data += hlen; m->m_len -= hlen; sum2 = in_cksum(fin->fin_m, fin->fin_plen - off); m->m_len += hlen; m->m_data -= hlen; /* * Both sum and sum2 are partial sums, so combine them together. */ sum += ~sum2 & 0xffff; while (sum > 0xffff) sum = (sum & 0xffff) + (sum >> 16); sum2 = ~sum & 0xffff; return sum2; } Index: user/ngie/bug-237403/sys/contrib/ipfilter =================================================================== --- user/ngie/bug-237403/sys/contrib/ipfilter (revision 346925) +++ user/ngie/bug-237403/sys/contrib/ipfilter (revision 346926) Property changes on: user/ngie/bug-237403/sys/contrib/ipfilter ___________________________________________________________________ Modified: svn:mergeinfo ## -0,0 +0,1 ## Merged /head/sys/contrib/ipfilter:r346444-346925 Index: user/ngie/bug-237403/sys/dev/acpi_support/acpi_ibm.c =================================================================== --- user/ngie/bug-237403/sys/dev/acpi_support/acpi_ibm.c (revision 346925) +++ user/ngie/bug-237403/sys/dev/acpi_support/acpi_ibm.c (revision 346926) @@ -1,1380 +1,1412 @@ /*- * Copyright (c) 2004 Takanori Watanabe * Copyright (c) 2005 Markus Brueffer * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. */ #include __FBSDID("$FreeBSD$"); /* * Driver for extra ACPI-controlled gadgets found on IBM ThinkPad laptops. * Inspired by the ibm-acpi and tpb projects which implement these features * on Linux. * * acpi-ibm: * tpb: */ #include "opt_acpi.h" #include #include #include #include #include #include #include #include "acpi_if.h" #include #include #include #include #include #include #include #define _COMPONENT ACPI_OEM ACPI_MODULE_NAME("IBM") /* Internal methods */ #define ACPI_IBM_METHOD_EVENTS 1 #define ACPI_IBM_METHOD_EVENTMASK 2 #define ACPI_IBM_METHOD_HOTKEY 3 #define ACPI_IBM_METHOD_BRIGHTNESS 4 #define ACPI_IBM_METHOD_VOLUME 5 #define ACPI_IBM_METHOD_MUTE 6 #define ACPI_IBM_METHOD_THINKLIGHT 7 #define ACPI_IBM_METHOD_BLUETOOTH 8 #define ACPI_IBM_METHOD_WLAN 9 #define ACPI_IBM_METHOD_FANSPEED 10 #define ACPI_IBM_METHOD_FANLEVEL 11 #define ACPI_IBM_METHOD_FANSTATUS 12 #define ACPI_IBM_METHOD_THERMAL 13 #define ACPI_IBM_METHOD_HANDLEREVENTS 14 #define ACPI_IBM_METHOD_MIC_LED 15 /* Hotkeys/Buttons */ #define IBM_RTC_HOTKEY1 0x64 #define IBM_RTC_MASK_HOME (1 << 0) #define IBM_RTC_MASK_SEARCH (1 << 1) #define IBM_RTC_MASK_MAIL (1 << 2) #define IBM_RTC_MASK_WLAN (1 << 5) #define IBM_RTC_HOTKEY2 0x65 #define IBM_RTC_MASK_THINKPAD (1 << 3) #define IBM_RTC_MASK_ZOOM (1 << 5) #define IBM_RTC_MASK_VIDEO (1 << 6) #define IBM_RTC_MASK_HIBERNATE (1 << 7) #define IBM_RTC_THINKLIGHT 0x66 #define IBM_RTC_MASK_THINKLIGHT (1 << 4) #define IBM_RTC_SCREENEXPAND 0x67 #define IBM_RTC_MASK_SCREENEXPAND (1 << 5) #define IBM_RTC_BRIGHTNESS 0x6c #define IBM_RTC_MASK_BRIGHTNESS (1 << 5) #define IBM_RTC_VOLUME 0x6e #define IBM_RTC_MASK_VOLUME (1 << 7) /* Embedded Controller registers */ #define IBM_EC_BRIGHTNESS 0x31 #define IBM_EC_MASK_BRI 0x7 #define IBM_EC_VOLUME 0x30 #define IBM_EC_MASK_VOL 0xf #define IBM_EC_MASK_MUTE (1 << 6) #define IBM_EC_FANSTATUS 0x2F #define IBM_EC_MASK_FANLEVEL 0x3f #define IBM_EC_MASK_FANDISENGAGED (1 << 6) #define IBM_EC_MASK_FANSTATUS (1 << 7) #define IBM_EC_FANSPEED 0x84 /* CMOS Commands */ #define IBM_CMOS_VOLUME_DOWN 0 #define IBM_CMOS_VOLUME_UP 1 #define IBM_CMOS_VOLUME_MUTE 2 #define IBM_CMOS_BRIGHTNESS_UP 4 #define IBM_CMOS_BRIGHTNESS_DOWN 5 /* ACPI methods */ #define IBM_NAME_KEYLIGHT "KBLT" #define IBM_NAME_WLAN_BT_GET "GBDC" #define IBM_NAME_WLAN_BT_SET "SBDC" #define IBM_NAME_MASK_BT (1 << 1) #define IBM_NAME_MASK_WLAN (1 << 2) #define IBM_NAME_THERMAL_GET "TMP7" #define IBM_NAME_THERMAL_UPDT "UPDT" #define IBM_NAME_EVENTS_STATUS_GET "DHKC" #define IBM_NAME_EVENTS_MASK_GET "DHKN" #define IBM_NAME_EVENTS_STATUS_SET "MHKC" #define IBM_NAME_EVENTS_MASK_SET "MHKM" #define IBM_NAME_EVENTS_GET "MHKP" #define IBM_NAME_EVENTS_AVAILMASK "MHKA" /* Event Code */ #define IBM_EVENT_LCD_BACKLIGHT 0x03 #define IBM_EVENT_SUSPEND_TO_RAM 0x04 #define IBM_EVENT_BLUETOOTH 0x05 #define IBM_EVENT_SCREEN_EXPAND 0x07 #define IBM_EVENT_SUSPEND_TO_DISK 0x0c #define IBM_EVENT_BRIGHTNESS_UP 0x10 #define IBM_EVENT_BRIGHTNESS_DOWN 0x11 #define IBM_EVENT_THINKLIGHT 0x12 #define IBM_EVENT_ZOOM 0x14 #define IBM_EVENT_VOLUME_UP 0x15 #define IBM_EVENT_VOLUME_DOWN 0x16 #define IBM_EVENT_MUTE 0x17 #define IBM_EVENT_ACCESS_IBM_BUTTON 0x18 #define ABS(x) (((x) < 0)? -(x) : (x)) struct acpi_ibm_softc { device_t dev; ACPI_HANDLE handle; /* Embedded controller */ device_t ec_dev; ACPI_HANDLE ec_handle; /* CMOS */ ACPI_HANDLE cmos_handle; /* Fan status */ ACPI_HANDLE fan_handle; int fan_levels; /* Keylight commands and states */ ACPI_HANDLE light_handle; int light_cmd_on; int light_cmd_off; int light_val; int light_get_supported; int light_set_supported; /* led(4) interface */ struct cdev *led_dev; int led_busy; int led_state; /* Mic led handle */ ACPI_HANDLE mic_led_handle; int mic_led_state; int wlan_bt_flags; int thermal_updt_supported; unsigned int events_availmask; unsigned int events_initialmask; int events_mask_supported; int events_enable; unsigned int handler_events; struct sysctl_ctx_list *sysctl_ctx; struct sysctl_oid *sysctl_tree; }; static struct { char *name; int method; char *description; int flag_rdonly; } acpi_ibm_sysctls[] = { { .name = "events", .method = ACPI_IBM_METHOD_EVENTS, .description = "ACPI events enable", }, { .name = "eventmask", .method = ACPI_IBM_METHOD_EVENTMASK, .description = "ACPI eventmask", }, { .name = "hotkey", .method = ACPI_IBM_METHOD_HOTKEY, .description = "Key Status", .flag_rdonly = 1 }, { .name = "lcd_brightness", .method = ACPI_IBM_METHOD_BRIGHTNESS, .description = "LCD Brightness", }, { .name = "volume", .method = ACPI_IBM_METHOD_VOLUME, .description = "Volume", }, { .name = "mute", .method = ACPI_IBM_METHOD_MUTE, .description = "Mute", }, { .name = "thinklight", .method = ACPI_IBM_METHOD_THINKLIGHT, .description = "Thinklight enable", }, { .name = "bluetooth", .method = ACPI_IBM_METHOD_BLUETOOTH, .description = "Bluetooth enable", }, { .name = "wlan", .method = ACPI_IBM_METHOD_WLAN, .description = "WLAN enable", .flag_rdonly = 1 }, { .name = "fan_speed", .method = ACPI_IBM_METHOD_FANSPEED, .description = "Fan speed", .flag_rdonly = 1 }, { .name = "fan_level", .method = ACPI_IBM_METHOD_FANLEVEL, .description = "Fan level", }, { .name = "fan", .method = ACPI_IBM_METHOD_FANSTATUS, .description = "Fan enable", }, { .name = "mic_led", .method = ACPI_IBM_METHOD_MIC_LED, .description = "Mic led", }, { NULL, 0, NULL, 0 } }; /* * Per-model default list of event mask. */ #define ACPI_IBM_HKEY_RFKILL_MASK (1 << 4) #define ACPI_IBM_HKEY_DSWITCH_MASK (1 << 6) #define ACPI_IBM_HKEY_BRIGHTNESS_UP_MASK (1 << 15) #define ACPI_IBM_HKEY_BRIGHTNESS_DOWN_MASK (1 << 16) #define ACPI_IBM_HKEY_SEARCH_MASK (1 << 18) #define ACPI_IBM_HKEY_MICMUTE_MASK (1 << 26) #define ACPI_IBM_HKEY_SETTINGS_MASK (1 << 28) #define ACPI_IBM_HKEY_VIEWOPEN_MASK (1 << 30) #define ACPI_IBM_HKEY_VIEWALL_MASK (1 << 31) struct acpi_ibm_models { const char *maker; const char *product; uint32_t eventmask; } acpi_ibm_models[] = { { "LENOVO", "20BSCTO1WW", ACPI_IBM_HKEY_RFKILL_MASK | ACPI_IBM_HKEY_DSWITCH_MASK | ACPI_IBM_HKEY_BRIGHTNESS_UP_MASK | ACPI_IBM_HKEY_BRIGHTNESS_DOWN_MASK | ACPI_IBM_HKEY_SEARCH_MASK | ACPI_IBM_HKEY_MICMUTE_MASK | ACPI_IBM_HKEY_SETTINGS_MASK | ACPI_IBM_HKEY_VIEWOPEN_MASK | ACPI_IBM_HKEY_VIEWALL_MASK } }; ACPI_SERIAL_DECL(ibm, "ACPI IBM extras"); static int acpi_ibm_probe(device_t dev); static int acpi_ibm_attach(device_t dev); static int acpi_ibm_detach(device_t dev); static int acpi_ibm_resume(device_t dev); static void ibm_led(void *softc, int onoff); static void ibm_led_task(struct acpi_ibm_softc *sc, int pending __unused); static int acpi_ibm_sysctl(SYSCTL_HANDLER_ARGS); static int acpi_ibm_sysctl_init(struct acpi_ibm_softc *sc, int method); static int acpi_ibm_sysctl_get(struct acpi_ibm_softc *sc, int method); static int acpi_ibm_sysctl_set(struct acpi_ibm_softc *sc, int method, int val); static int acpi_ibm_eventmask_set(struct acpi_ibm_softc *sc, int val); static int acpi_ibm_thermal_sysctl(SYSCTL_HANDLER_ARGS); static int acpi_ibm_handlerevents_sysctl(SYSCTL_HANDLER_ARGS); static void acpi_ibm_notify(ACPI_HANDLE h, UINT32 notify, void *context); static int acpi_ibm_brightness_set(struct acpi_ibm_softc *sc, int arg); static int acpi_ibm_bluetooth_set(struct acpi_ibm_softc *sc, int arg); static int acpi_ibm_thinklight_set(struct acpi_ibm_softc *sc, int arg); static int acpi_ibm_volume_set(struct acpi_ibm_softc *sc, int arg); static int acpi_ibm_mute_set(struct acpi_ibm_softc *sc, int arg); static device_method_t acpi_ibm_methods[] = { /* Device interface */ DEVMETHOD(device_probe, acpi_ibm_probe), DEVMETHOD(device_attach, acpi_ibm_attach), DEVMETHOD(device_detach, acpi_ibm_detach), DEVMETHOD(device_resume, acpi_ibm_resume), DEVMETHOD_END }; static driver_t acpi_ibm_driver = { "acpi_ibm", acpi_ibm_methods, sizeof(struct acpi_ibm_softc), }; static devclass_t acpi_ibm_devclass; DRIVER_MODULE(acpi_ibm, acpi, acpi_ibm_driver, acpi_ibm_devclass, 0, 0); MODULE_DEPEND(acpi_ibm, acpi, 1, 1, 1); -static char *ibm_ids[] = {"IBM0068", "LEN0068", NULL}; +static char *ibm_ids[] = {"IBM0068", "LEN0068", "LEN0268", NULL}; static void ibm_led(void *softc, int onoff) { struct acpi_ibm_softc* sc = (struct acpi_ibm_softc*) softc; ACPI_FUNCTION_TRACE((char *)(uintptr_t)__func__); if (sc->led_busy) return; sc->led_busy = 1; sc->led_state = onoff; AcpiOsExecute(OSL_NOTIFY_HANDLER, (void *)ibm_led_task, sc); } static void ibm_led_task(struct acpi_ibm_softc *sc, int pending __unused) { ACPI_FUNCTION_TRACE((char *)(uintptr_t)__func__); ACPI_SERIAL_BEGIN(ibm); acpi_ibm_sysctl_set(sc, ACPI_IBM_METHOD_THINKLIGHT, sc->led_state); ACPI_SERIAL_END(ibm); sc->led_busy = 0; } static int acpi_ibm_mic_led_set (struct acpi_ibm_softc *sc, int arg) { ACPI_OBJECT_LIST input; ACPI_OBJECT params[1]; ACPI_STATUS status; if (arg < 0 || arg > 1) return (EINVAL); if (sc->mic_led_handle) { params[0].Type = ACPI_TYPE_INTEGER; params[0].Integer.Value = 0; /* mic led: 0 off, 2 on */ if (arg == 1) params[0].Integer.Value = 2; input.Pointer = params; input.Count = 1; status = AcpiEvaluateObject (sc->handle, "MMTS", &input, NULL); if (ACPI_SUCCESS(status)) sc->mic_led_state = arg; return(status); } return (0); } static int acpi_ibm_probe(device_t dev) { int rv; if (acpi_disabled("ibm") || device_get_unit(dev) != 0) return (ENXIO); rv = ACPI_ID_PROBE(device_get_parent(dev), dev, ibm_ids, NULL); if (rv <= 0) device_set_desc(dev, "IBM ThinkPad ACPI Extras"); return (rv); } static int acpi_ibm_attach(device_t dev) { int i; + int hkey; struct acpi_ibm_softc *sc; char *maker, *product; - devclass_t ec_devclass; + ACPI_OBJECT_LIST input; + ACPI_OBJECT params[1]; + ACPI_OBJECT out_obj; + ACPI_BUFFER result; + devclass_t ec_devclass; ACPI_FUNCTION_TRACE((char *)(uintptr_t) __func__); sc = device_get_softc(dev); sc->dev = dev; sc->handle = acpi_get_handle(dev); /* Look for the first embedded controller */ if (!(ec_devclass = devclass_find ("acpi_ec"))) { if (bootverbose) device_printf(dev, "Couldn't find acpi_ec devclass\n"); return (EINVAL); } if (!(sc->ec_dev = devclass_get_device(ec_devclass, 0))) { if (bootverbose) device_printf(dev, "Couldn't find acpi_ec device\n"); return (EINVAL); } sc->ec_handle = acpi_get_handle(sc->ec_dev); /* Get the sysctl tree */ sc->sysctl_ctx = device_get_sysctl_ctx(dev); sc->sysctl_tree = device_get_sysctl_tree(dev); /* Look for event mask and hook up the nodes */ sc->events_mask_supported = ACPI_SUCCESS(acpi_GetInteger(sc->handle, IBM_NAME_EVENTS_MASK_GET, &sc->events_initialmask)); if (sc->events_mask_supported) { SYSCTL_ADD_UINT(sc->sysctl_ctx, SYSCTL_CHILDREN(sc->sysctl_tree), OID_AUTO, "initialmask", CTLFLAG_RD, &sc->events_initialmask, 0, "Initial eventmask"); - /* The availmask is the bitmask of supported events */ - if (ACPI_FAILURE(acpi_GetInteger(sc->handle, - IBM_NAME_EVENTS_AVAILMASK, &sc->events_availmask))) + if (ACPI_SUCCESS (acpi_GetInteger(sc->handle, "MHKV", &hkey))) { + device_printf(dev, "Firmware version is 0x%X\n", hkey); + switch(hkey >> 8) + { + case 1: + /* The availmask is the bitmask of supported events */ + if (ACPI_FAILURE(acpi_GetInteger(sc->handle, + IBM_NAME_EVENTS_AVAILMASK, &sc->events_availmask))) + sc->events_availmask = 0xffffffff; + break; + + case 2: + result.Length = sizeof(out_obj); + result.Pointer = &out_obj; + params[0].Type = ACPI_TYPE_INTEGER; + params[0].Integer.Value = 1; + input.Pointer = params; + input.Count = 1; + + sc->events_availmask = 0xffffffff; + + if (ACPI_SUCCESS(AcpiEvaluateObject (sc->handle, + IBM_NAME_EVENTS_AVAILMASK, &input, &result))) + sc->events_availmask = out_obj.Integer.Value; + break; + default: + device_printf(dev, "Unknown firmware version 0x%x\n", hkey); + break; + } + } else sc->events_availmask = 0xffffffff; SYSCTL_ADD_UINT(sc->sysctl_ctx, - SYSCTL_CHILDREN(sc->sysctl_tree), OID_AUTO, - "availmask", CTLFLAG_RD, - &sc->events_availmask, 0, "Mask of supported events"); + SYSCTL_CHILDREN(sc->sysctl_tree), OID_AUTO, + "availmask", CTLFLAG_RD, + &sc->events_availmask, 0, "Mask of supported events"); } /* Hook up proc nodes */ for (int i = 0; acpi_ibm_sysctls[i].name != NULL; i++) { if (!acpi_ibm_sysctl_init(sc, acpi_ibm_sysctls[i].method)) continue; if (acpi_ibm_sysctls[i].flag_rdonly != 0) { SYSCTL_ADD_PROC(sc->sysctl_ctx, SYSCTL_CHILDREN(sc->sysctl_tree), OID_AUTO, acpi_ibm_sysctls[i].name, CTLTYPE_INT | CTLFLAG_RD, sc, i, acpi_ibm_sysctl, "I", acpi_ibm_sysctls[i].description); } else { SYSCTL_ADD_PROC(sc->sysctl_ctx, SYSCTL_CHILDREN(sc->sysctl_tree), OID_AUTO, acpi_ibm_sysctls[i].name, CTLTYPE_INT | CTLFLAG_RW, sc, i, acpi_ibm_sysctl, "I", acpi_ibm_sysctls[i].description); } } /* Hook up thermal node */ if (acpi_ibm_sysctl_init(sc, ACPI_IBM_METHOD_THERMAL)) { SYSCTL_ADD_PROC(sc->sysctl_ctx, SYSCTL_CHILDREN(sc->sysctl_tree), OID_AUTO, "thermal", CTLTYPE_INT | CTLFLAG_RD, sc, 0, acpi_ibm_thermal_sysctl, "I", "Thermal zones"); } /* Hook up handlerevents node */ if (acpi_ibm_sysctl_init(sc, ACPI_IBM_METHOD_HANDLEREVENTS)) { SYSCTL_ADD_PROC(sc->sysctl_ctx, SYSCTL_CHILDREN(sc->sysctl_tree), OID_AUTO, "handlerevents", CTLTYPE_STRING | CTLFLAG_RW, sc, 0, acpi_ibm_handlerevents_sysctl, "I", "devd(8) events handled by acpi_ibm"); } /* Handle notifies */ AcpiInstallNotifyHandler(sc->handle, ACPI_DEVICE_NOTIFY, acpi_ibm_notify, dev); /* Hook up light to led(4) */ if (sc->light_set_supported) sc->led_dev = led_create_state(ibm_led, sc, "thinklight", (sc->light_val ? 1 : 0)); /* Enable per-model events. */ maker = kern_getenv("smbios.system.maker"); product = kern_getenv("smbios.system.product"); if (maker == NULL || product == NULL) goto nosmbios; for (i = 0; i < nitems(acpi_ibm_models); i++) { if (strcmp(maker, acpi_ibm_models[i].maker) == 0 && strcmp(product, acpi_ibm_models[i].product) == 0) { ACPI_SERIAL_BEGIN(ibm); acpi_ibm_sysctl_set(sc, ACPI_IBM_METHOD_EVENTMASK, acpi_ibm_models[i].eventmask); ACPI_SERIAL_END(ibm); } } nosmbios: freeenv(maker); freeenv(product); /* Enable events by default. */ ACPI_SERIAL_BEGIN(ibm); acpi_ibm_sysctl_set(sc, ACPI_IBM_METHOD_EVENTS, 1); ACPI_SERIAL_END(ibm); return (0); } static int acpi_ibm_detach(device_t dev) { ACPI_FUNCTION_TRACE((char *)(uintptr_t) __func__); struct acpi_ibm_softc *sc = device_get_softc(dev); /* Disable events and restore eventmask */ ACPI_SERIAL_BEGIN(ibm); acpi_ibm_sysctl_set(sc, ACPI_IBM_METHOD_EVENTS, 0); acpi_ibm_sysctl_set(sc, ACPI_IBM_METHOD_EVENTMASK, sc->events_initialmask); ACPI_SERIAL_END(ibm); AcpiRemoveNotifyHandler(sc->handle, ACPI_DEVICE_NOTIFY, acpi_ibm_notify); if (sc->led_dev != NULL) led_destroy(sc->led_dev); return (0); } static int acpi_ibm_resume(device_t dev) { struct acpi_ibm_softc *sc = device_get_softc(dev); ACPI_FUNCTION_TRACE((char *)(uintptr_t) __func__); ACPI_SERIAL_BEGIN(ibm); for (int i = 0; acpi_ibm_sysctls[i].name != NULL; i++) { int val; val = acpi_ibm_sysctl_get(sc, i); if (acpi_ibm_sysctls[i].flag_rdonly != 0) continue; acpi_ibm_sysctl_set(sc, i, val); } ACPI_SERIAL_END(ibm); /* The mic led does not turn back on when sysctl_set is called in the above loop */ acpi_ibm_mic_led_set(sc, sc->mic_led_state); return (0); } static int acpi_ibm_eventmask_set(struct acpi_ibm_softc *sc, int val) { ACPI_OBJECT arg[2]; ACPI_OBJECT_LIST args; ACPI_STATUS status; ACPI_FUNCTION_TRACE((char *)(uintptr_t)__func__); ACPI_SERIAL_ASSERT(ibm); args.Count = 2; args.Pointer = arg; arg[0].Type = ACPI_TYPE_INTEGER; arg[1].Type = ACPI_TYPE_INTEGER; for (int i = 0; i < 32; ++i) { arg[0].Integer.Value = i+1; arg[1].Integer.Value = (((1 << i) & val) != 0); status = AcpiEvaluateObject(sc->handle, IBM_NAME_EVENTS_MASK_SET, &args, NULL); if (ACPI_FAILURE(status)) return (status); } return (0); } static int acpi_ibm_sysctl(SYSCTL_HANDLER_ARGS) { struct acpi_ibm_softc *sc; int arg; int error = 0; int function; int method; ACPI_FUNCTION_TRACE((char *)(uintptr_t)__func__); sc = (struct acpi_ibm_softc *)oidp->oid_arg1; function = oidp->oid_arg2; method = acpi_ibm_sysctls[function].method; ACPI_SERIAL_BEGIN(ibm); arg = acpi_ibm_sysctl_get(sc, method); error = sysctl_handle_int(oidp, &arg, 0, req); /* Sanity check */ if (error != 0 || req->newptr == NULL) goto out; /* Update */ error = acpi_ibm_sysctl_set(sc, method, arg); out: ACPI_SERIAL_END(ibm); return (error); } static int acpi_ibm_sysctl_get(struct acpi_ibm_softc *sc, int method) { UINT64 val_ec; int val = 0, key; ACPI_FUNCTION_TRACE((char *)(uintptr_t)__func__); ACPI_SERIAL_ASSERT(ibm); switch (method) { case ACPI_IBM_METHOD_EVENTS: acpi_GetInteger(sc->handle, IBM_NAME_EVENTS_STATUS_GET, &val); break; case ACPI_IBM_METHOD_EVENTMASK: if (sc->events_mask_supported) acpi_GetInteger(sc->handle, IBM_NAME_EVENTS_MASK_GET, &val); break; case ACPI_IBM_METHOD_HOTKEY: /* * Construct the hotkey as a bitmask as illustrated below. * Note that whenever a key was pressed, the respecting bit * toggles and nothing else changes. * +--+--+-+-+-+-+-+-+-+-+-+-+ * |11|10|9|8|7|6|5|4|3|2|1|0| * +--+--+-+-+-+-+-+-+-+-+-+-+ * | | | | | | | | | | | | * | | | | | | | | | | | +- Home Button * | | | | | | | | | | +--- Search Button * | | | | | | | | | +----- Mail Button * | | | | | | | | +------- Thinkpad Button * | | | | | | | +--------- Zoom (Fn + Space) * | | | | | | +----------- WLAN Button * | | | | | +------------- Video Button * | | | | +--------------- Hibernate Button * | | | +----------------- Thinklight Button * | | +------------------- Screen expand (Fn + F8) * | +--------------------- Brightness * +------------------------ Volume/Mute */ key = rtcin(IBM_RTC_HOTKEY1); val = (IBM_RTC_MASK_HOME | IBM_RTC_MASK_SEARCH | IBM_RTC_MASK_MAIL | IBM_RTC_MASK_WLAN) & key; key = rtcin(IBM_RTC_HOTKEY2); val |= (IBM_RTC_MASK_THINKPAD | IBM_RTC_MASK_VIDEO | IBM_RTC_MASK_HIBERNATE) & key; val |= (IBM_RTC_MASK_ZOOM & key) >> 1; key = rtcin(IBM_RTC_THINKLIGHT); val |= (IBM_RTC_MASK_THINKLIGHT & key) << 4; key = rtcin(IBM_RTC_SCREENEXPAND); val |= (IBM_RTC_MASK_THINKLIGHT & key) << 4; key = rtcin(IBM_RTC_BRIGHTNESS); val |= (IBM_RTC_MASK_BRIGHTNESS & key) << 5; key = rtcin(IBM_RTC_VOLUME); val |= (IBM_RTC_MASK_VOLUME & key) << 4; break; case ACPI_IBM_METHOD_BRIGHTNESS: ACPI_EC_READ(sc->ec_dev, IBM_EC_BRIGHTNESS, &val_ec, 1); val = val_ec & IBM_EC_MASK_BRI; break; case ACPI_IBM_METHOD_VOLUME: ACPI_EC_READ(sc->ec_dev, IBM_EC_VOLUME, &val_ec, 1); val = val_ec & IBM_EC_MASK_VOL; break; case ACPI_IBM_METHOD_MUTE: ACPI_EC_READ(sc->ec_dev, IBM_EC_VOLUME, &val_ec, 1); val = ((val_ec & IBM_EC_MASK_MUTE) == IBM_EC_MASK_MUTE); break; case ACPI_IBM_METHOD_THINKLIGHT: if (sc->light_get_supported) acpi_GetInteger(sc->ec_handle, IBM_NAME_KEYLIGHT, &val); else val = sc->light_val; break; case ACPI_IBM_METHOD_BLUETOOTH: acpi_GetInteger(sc->handle, IBM_NAME_WLAN_BT_GET, &val); sc->wlan_bt_flags = val; val = ((val & IBM_NAME_MASK_BT) != 0); break; case ACPI_IBM_METHOD_WLAN: acpi_GetInteger(sc->handle, IBM_NAME_WLAN_BT_GET, &val); sc->wlan_bt_flags = val; val = ((val & IBM_NAME_MASK_WLAN) != 0); break; case ACPI_IBM_METHOD_FANSPEED: if (sc->fan_handle) { if(ACPI_FAILURE(acpi_GetInteger(sc->fan_handle, NULL, &val))) val = -1; } else { ACPI_EC_READ(sc->ec_dev, IBM_EC_FANSPEED, &val_ec, 2); val = val_ec; } break; case ACPI_IBM_METHOD_FANLEVEL: /* * The IBM_EC_FANSTATUS register works as follows: * Bit 0-5 indicate the level at which the fan operates. Only * values between 0 and 7 have an effect. Everything * above 7 is treated the same as level 7 * Bit 6 overrides the fan speed limit if set to 1 * Bit 7 indicates at which mode the fan operates: * manual (0) or automatic (1) */ if (!sc->fan_handle) { ACPI_EC_READ(sc->ec_dev, IBM_EC_FANSTATUS, &val_ec, 1); val = val_ec & IBM_EC_MASK_FANLEVEL; } break; case ACPI_IBM_METHOD_FANSTATUS: if (!sc->fan_handle) { ACPI_EC_READ(sc->ec_dev, IBM_EC_FANSTATUS, &val_ec, 1); val = (val_ec & IBM_EC_MASK_FANSTATUS) == IBM_EC_MASK_FANSTATUS; } else val = -1; break; case ACPI_IBM_METHOD_MIC_LED: if (sc->mic_led_handle) return sc->mic_led_state; else val = -1; break; } return (val); } static int acpi_ibm_sysctl_set(struct acpi_ibm_softc *sc, int method, int arg) { int val; UINT64 val_ec; ACPI_STATUS status; ACPI_FUNCTION_TRACE((char *)(uintptr_t)__func__); ACPI_SERIAL_ASSERT(ibm); switch (method) { case ACPI_IBM_METHOD_EVENTS: if (arg < 0 || arg > 1) return (EINVAL); status = acpi_SetInteger(sc->handle, IBM_NAME_EVENTS_STATUS_SET, arg); if (ACPI_FAILURE(status)) return (status); if (sc->events_mask_supported) return acpi_ibm_eventmask_set(sc, sc->events_availmask); break; case ACPI_IBM_METHOD_EVENTMASK: if (sc->events_mask_supported) return acpi_ibm_eventmask_set(sc, arg); break; case ACPI_IBM_METHOD_BRIGHTNESS: return acpi_ibm_brightness_set(sc, arg); break; case ACPI_IBM_METHOD_VOLUME: return acpi_ibm_volume_set(sc, arg); break; case ACPI_IBM_METHOD_MUTE: return acpi_ibm_mute_set(sc, arg); break; case ACPI_IBM_METHOD_MIC_LED: return acpi_ibm_mic_led_set (sc, arg); break; case ACPI_IBM_METHOD_THINKLIGHT: return acpi_ibm_thinklight_set(sc, arg); break; case ACPI_IBM_METHOD_BLUETOOTH: return acpi_ibm_bluetooth_set(sc, arg); break; case ACPI_IBM_METHOD_FANLEVEL: if (arg < 0 || arg > 7) return (EINVAL); if (!sc->fan_handle) { /* Read the current fanstatus */ ACPI_EC_READ(sc->ec_dev, IBM_EC_FANSTATUS, &val_ec, 1); val = val_ec & (~IBM_EC_MASK_FANLEVEL); return ACPI_EC_WRITE(sc->ec_dev, IBM_EC_FANSTATUS, val | arg, 1); } break; case ACPI_IBM_METHOD_FANSTATUS: if (arg < 0 || arg > 1) return (EINVAL); if (!sc->fan_handle) { /* Read the current fanstatus */ ACPI_EC_READ(sc->ec_dev, IBM_EC_FANSTATUS, &val_ec, 1); return ACPI_EC_WRITE(sc->ec_dev, IBM_EC_FANSTATUS, (arg == 1) ? (val_ec | IBM_EC_MASK_FANSTATUS) : (val_ec & (~IBM_EC_MASK_FANSTATUS)), 1); } break; } return (0); } static int acpi_ibm_sysctl_init(struct acpi_ibm_softc *sc, int method) { int dummy; ACPI_OBJECT_TYPE cmos_t; ACPI_HANDLE ledb_handle; switch (method) { case ACPI_IBM_METHOD_EVENTS: return (TRUE); case ACPI_IBM_METHOD_EVENTMASK: return (sc->events_mask_supported); case ACPI_IBM_METHOD_HOTKEY: case ACPI_IBM_METHOD_BRIGHTNESS: case ACPI_IBM_METHOD_VOLUME: case ACPI_IBM_METHOD_MUTE: /* EC is required here, which was already checked before */ return (TRUE); case ACPI_IBM_METHOD_MIC_LED: if (ACPI_SUCCESS(AcpiGetHandle(sc->handle, "MMTS", &sc->mic_led_handle))) { /* Turn off mic led by default */ acpi_ibm_mic_led_set (sc, 0); return(TRUE); } else sc->mic_led_handle = NULL; return (FALSE); case ACPI_IBM_METHOD_THINKLIGHT: sc->cmos_handle = NULL; sc->light_get_supported = ACPI_SUCCESS(acpi_GetInteger( sc->ec_handle, IBM_NAME_KEYLIGHT, &sc->light_val)); if ((ACPI_SUCCESS(AcpiGetHandle(sc->handle, "\\UCMS", &sc->light_handle)) || ACPI_SUCCESS(AcpiGetHandle(sc->handle, "\\CMOS", &sc->light_handle)) || ACPI_SUCCESS(AcpiGetHandle(sc->handle, "\\CMS", &sc->light_handle))) && ACPI_SUCCESS(AcpiGetType(sc->light_handle, &cmos_t)) && cmos_t == ACPI_TYPE_METHOD) { sc->light_cmd_on = 0x0c; sc->light_cmd_off = 0x0d; sc->cmos_handle = sc->light_handle; } else if (ACPI_SUCCESS(AcpiGetHandle(sc->handle, "\\LGHT", &sc->light_handle))) { sc->light_cmd_on = 1; sc->light_cmd_off = 0; } else sc->light_handle = NULL; sc->light_set_supported = (sc->light_handle && ACPI_FAILURE(AcpiGetHandle(sc->ec_handle, "LEDB", &ledb_handle))); if (sc->light_get_supported) return (TRUE); if (sc->light_set_supported) { sc->light_val = 0; return (TRUE); } return (FALSE); case ACPI_IBM_METHOD_BLUETOOTH: case ACPI_IBM_METHOD_WLAN: if (ACPI_SUCCESS(acpi_GetInteger(sc->handle, IBM_NAME_WLAN_BT_GET, &dummy))) return (TRUE); return (FALSE); case ACPI_IBM_METHOD_FANSPEED: /* * Some models report the fan speed in levels from 0-7 * Newer models report it contiguously */ sc->fan_levels = (ACPI_SUCCESS(AcpiGetHandle(sc->handle, "GFAN", &sc->fan_handle)) || ACPI_SUCCESS(AcpiGetHandle(sc->handle, "\\FSPD", &sc->fan_handle))); return (TRUE); case ACPI_IBM_METHOD_FANLEVEL: case ACPI_IBM_METHOD_FANSTATUS: /* * Fan status is only supported on those models, * which report fan RPM contiguously, not in levels */ if (sc->fan_levels) return (FALSE); return (TRUE); case ACPI_IBM_METHOD_THERMAL: if (ACPI_SUCCESS(acpi_GetInteger(sc->ec_handle, IBM_NAME_THERMAL_GET, &dummy))) { sc->thermal_updt_supported = ACPI_SUCCESS(acpi_GetInteger(sc->ec_handle, IBM_NAME_THERMAL_UPDT, &dummy)); return (TRUE); } return (FALSE); case ACPI_IBM_METHOD_HANDLEREVENTS: return (TRUE); } return (FALSE); } static int acpi_ibm_thermal_sysctl(SYSCTL_HANDLER_ARGS) { struct acpi_ibm_softc *sc; int error = 0; char temp_cmd[] = "TMP0"; int temp[8]; ACPI_FUNCTION_TRACE((char *)(uintptr_t)__func__); sc = (struct acpi_ibm_softc *)oidp->oid_arg1; ACPI_SERIAL_BEGIN(ibm); for (int i = 0; i < 8; ++i) { temp_cmd[3] = '0' + i; /* * The TMPx methods seem to return +/- 128 or 0 * when the respecting sensor is not available */ if (ACPI_FAILURE(acpi_GetInteger(sc->ec_handle, temp_cmd, &temp[i])) || ABS(temp[i]) == 128 || temp[i] == 0) temp[i] = -1; else if (sc->thermal_updt_supported) /* Temperature is reported in tenth of Kelvin */ temp[i] = (temp[i] - 2731 + 5) / 10; } error = sysctl_handle_opaque(oidp, &temp, 8*sizeof(int), req); ACPI_SERIAL_END(ibm); return (error); } static int acpi_ibm_handlerevents_sysctl(SYSCTL_HANDLER_ARGS) { struct acpi_ibm_softc *sc; int error = 0; struct sbuf sb; char *cp, *ep; int l, val; unsigned int handler_events; char temp[128]; ACPI_FUNCTION_TRACE((char *)(uintptr_t)__func__); sc = (struct acpi_ibm_softc *)oidp->oid_arg1; if (sbuf_new(&sb, NULL, 128, SBUF_AUTOEXTEND) == NULL) return (ENOMEM); ACPI_SERIAL_BEGIN(ibm); /* Get old values if this is a get request. */ if (req->newptr == NULL) { for (int i = 0; i < 8 * sizeof(sc->handler_events); i++) if (sc->handler_events & (1 << i)) sbuf_printf(&sb, "0x%02x ", i + 1); if (sbuf_len(&sb) == 0) sbuf_printf(&sb, "NONE"); } sbuf_trim(&sb); sbuf_finish(&sb); strlcpy(temp, sbuf_data(&sb), sizeof(temp)); sbuf_delete(&sb); error = sysctl_handle_string(oidp, temp, sizeof(temp), req); /* Check for error or no change */ if (error != 0 || req->newptr == NULL) goto out; /* If the user is setting a string, parse it. */ handler_events = 0; cp = temp; while (*cp) { if (isspace(*cp)) { cp++; continue; } ep = cp; while (*ep && !isspace(*ep)) ep++; l = ep - cp; if (l == 0) break; if (strncmp(cp, "NONE", 4) == 0) { cp = ep; continue; } if (l >= 3 && cp[0] == '0' && (cp[1] == 'X' || cp[1] == 'x')) val = strtoul(cp, &ep, 16); else val = strtoul(cp, &ep, 10); if (val == 0 || ep == cp || val >= 8 * sizeof(handler_events)) { cp[l] = '\0'; device_printf(sc->dev, "invalid event code: %s\n", cp); error = EINVAL; goto out; } handler_events |= 1 << (val - 1); cp = ep; } sc->handler_events = handler_events; out: ACPI_SERIAL_END(ibm); return (error); } static int acpi_ibm_brightness_set(struct acpi_ibm_softc *sc, int arg) { int val, step; UINT64 val_ec; ACPI_OBJECT Arg; ACPI_OBJECT_LIST Args; ACPI_STATUS status; ACPI_FUNCTION_TRACE((char *)(uintptr_t)__func__); ACPI_SERIAL_ASSERT(ibm); if (arg < 0 || arg > 7) return (EINVAL); /* Read the current brightness */ status = ACPI_EC_READ(sc->ec_dev, IBM_EC_BRIGHTNESS, &val_ec, 1); if (ACPI_FAILURE(status)) return (status); if (sc->cmos_handle) { val = val_ec & IBM_EC_MASK_BRI; Args.Count = 1; Args.Pointer = &Arg; Arg.Type = ACPI_TYPE_INTEGER; Arg.Integer.Value = (arg > val) ? IBM_CMOS_BRIGHTNESS_UP : IBM_CMOS_BRIGHTNESS_DOWN; step = (arg > val) ? 1 : -1; for (int i = val; i != arg; i += step) { status = AcpiEvaluateObject(sc->cmos_handle, NULL, &Args, NULL); if (ACPI_FAILURE(status)) { /* Record the last value */ if (i != val) { ACPI_EC_WRITE(sc->ec_dev, IBM_EC_BRIGHTNESS, i - step, 1); } return (status); } } } return ACPI_EC_WRITE(sc->ec_dev, IBM_EC_BRIGHTNESS, arg, 1); } static int acpi_ibm_bluetooth_set(struct acpi_ibm_softc *sc, int arg) { int val; ACPI_FUNCTION_TRACE((char *)(uintptr_t)__func__); ACPI_SERIAL_ASSERT(ibm); if (arg < 0 || arg > 1) return (EINVAL); val = (arg == 1) ? sc->wlan_bt_flags | IBM_NAME_MASK_BT : sc->wlan_bt_flags & (~IBM_NAME_MASK_BT); return acpi_SetInteger(sc->handle, IBM_NAME_WLAN_BT_SET, val); } static int acpi_ibm_thinklight_set(struct acpi_ibm_softc *sc, int arg) { ACPI_OBJECT Arg; ACPI_OBJECT_LIST Args; ACPI_STATUS status; ACPI_FUNCTION_TRACE((char *)(uintptr_t)__func__); ACPI_SERIAL_ASSERT(ibm); if (arg < 0 || arg > 1) return (EINVAL); if (sc->light_set_supported) { Args.Count = 1; Args.Pointer = &Arg; Arg.Type = ACPI_TYPE_INTEGER; Arg.Integer.Value = arg ? sc->light_cmd_on : sc->light_cmd_off; status = AcpiEvaluateObject(sc->light_handle, NULL, &Args, NULL); if (ACPI_SUCCESS(status)) sc->light_val = arg; return (status); } return (0); } static int acpi_ibm_volume_set(struct acpi_ibm_softc *sc, int arg) { int val, step; UINT64 val_ec; ACPI_OBJECT Arg; ACPI_OBJECT_LIST Args; ACPI_STATUS status; ACPI_FUNCTION_TRACE((char *)(uintptr_t)__func__); ACPI_SERIAL_ASSERT(ibm); if (arg < 0 || arg > 14) return (EINVAL); /* Read the current volume */ status = ACPI_EC_READ(sc->ec_dev, IBM_EC_VOLUME, &val_ec, 1); if (ACPI_FAILURE(status)) return (status); if (sc->cmos_handle) { val = val_ec & IBM_EC_MASK_VOL; Args.Count = 1; Args.Pointer = &Arg; Arg.Type = ACPI_TYPE_INTEGER; Arg.Integer.Value = (arg > val) ? IBM_CMOS_VOLUME_UP : IBM_CMOS_VOLUME_DOWN; step = (arg > val) ? 1 : -1; for (int i = val; i != arg; i += step) { status = AcpiEvaluateObject(sc->cmos_handle, NULL, &Args, NULL); if (ACPI_FAILURE(status)) { /* Record the last value */ if (i != val) { val_ec = i - step + (val_ec & (~IBM_EC_MASK_VOL)); ACPI_EC_WRITE(sc->ec_dev, IBM_EC_VOLUME, val_ec, 1); } return (status); } } } val_ec = arg + (val_ec & (~IBM_EC_MASK_VOL)); return ACPI_EC_WRITE(sc->ec_dev, IBM_EC_VOLUME, val_ec, 1); } static int acpi_ibm_mute_set(struct acpi_ibm_softc *sc, int arg) { UINT64 val_ec; ACPI_OBJECT Arg; ACPI_OBJECT_LIST Args; ACPI_STATUS status; ACPI_FUNCTION_TRACE((char *)(uintptr_t)__func__); ACPI_SERIAL_ASSERT(ibm); if (arg < 0 || arg > 1) return (EINVAL); status = ACPI_EC_READ(sc->ec_dev, IBM_EC_VOLUME, &val_ec, 1); if (ACPI_FAILURE(status)) return (status); if (sc->cmos_handle) { Args.Count = 1; Args.Pointer = &Arg; Arg.Type = ACPI_TYPE_INTEGER; Arg.Integer.Value = IBM_CMOS_VOLUME_MUTE; status = AcpiEvaluateObject(sc->cmos_handle, NULL, &Args, NULL); if (ACPI_FAILURE(status)) return (status); } val_ec = (arg == 1) ? val_ec | IBM_EC_MASK_MUTE : val_ec & (~IBM_EC_MASK_MUTE); return ACPI_EC_WRITE(sc->ec_dev, IBM_EC_VOLUME, val_ec, 1); } static void acpi_ibm_eventhandler(struct acpi_ibm_softc *sc, int arg) { int val; UINT64 val_ec; ACPI_STATUS status; ACPI_SERIAL_BEGIN(ibm); switch (arg) { case IBM_EVENT_SUSPEND_TO_RAM: power_pm_suspend(POWER_SLEEP_STATE_SUSPEND); break; case IBM_EVENT_BLUETOOTH: acpi_ibm_bluetooth_set(sc, (sc->wlan_bt_flags == 0)); break; case IBM_EVENT_BRIGHTNESS_UP: case IBM_EVENT_BRIGHTNESS_DOWN: /* Read the current brightness */ status = ACPI_EC_READ(sc->ec_dev, IBM_EC_BRIGHTNESS, &val_ec, 1); if (ACPI_FAILURE(status)) return; val = val_ec & IBM_EC_MASK_BRI; val = (arg == IBM_EVENT_BRIGHTNESS_UP) ? val + 1 : val - 1; acpi_ibm_brightness_set(sc, val); break; case IBM_EVENT_THINKLIGHT: acpi_ibm_thinklight_set(sc, (sc->light_val == 0)); break; case IBM_EVENT_VOLUME_UP: case IBM_EVENT_VOLUME_DOWN: /* Read the current volume */ status = ACPI_EC_READ(sc->ec_dev, IBM_EC_VOLUME, &val_ec, 1); if (ACPI_FAILURE(status)) return; val = val_ec & IBM_EC_MASK_VOL; val = (arg == IBM_EVENT_VOLUME_UP) ? val + 1 : val - 1; acpi_ibm_volume_set(sc, val); break; case IBM_EVENT_MUTE: /* Read the current value */ status = ACPI_EC_READ(sc->ec_dev, IBM_EC_VOLUME, &val_ec, 1); if (ACPI_FAILURE(status)) return; val = ((val_ec & IBM_EC_MASK_MUTE) == IBM_EC_MASK_MUTE); acpi_ibm_mute_set(sc, (val == 0)); break; default: break; } ACPI_SERIAL_END(ibm); } static void acpi_ibm_notify(ACPI_HANDLE h, UINT32 notify, void *context) { int event, arg, type; device_t dev = context; struct acpi_ibm_softc *sc = device_get_softc(dev); ACPI_FUNCTION_TRACE_U32((char *)(uintptr_t)__func__, notify); if (notify != 0x80) device_printf(dev, "Unknown notify\n"); for (;;) { acpi_GetInteger(acpi_get_handle(dev), IBM_NAME_EVENTS_GET, &event); if (event == 0) break; type = (event >> 12) & 0xf; arg = event & 0xfff; switch (type) { case 1: if (!(sc->events_availmask & (1 << (arg - 1)))) { device_printf(dev, "Unknown key %d\n", arg); break; } /* Execute event handler */ if (sc->handler_events & (1 << (arg - 1))) acpi_ibm_eventhandler(sc, (arg & 0xff)); /* Notify devd(8) */ acpi_UserNotify("IBM", h, (arg & 0xff)); break; default: break; } } } Index: user/ngie/bug-237403/sys/dev/altera/atse/if_atse.c =================================================================== --- user/ngie/bug-237403/sys/dev/altera/atse/if_atse.c (revision 346925) +++ user/ngie/bug-237403/sys/dev/altera/atse/if_atse.c (revision 346926) @@ -1,1603 +1,1608 @@ /*- * SPDX-License-Identifier: BSD-2-Clause-FreeBSD * * Copyright (c) 2012, 2013 Bjoern A. Zeeb * Copyright (c) 2014 Robert N. M. Watson * Copyright (c) 2016-2017 Ruslan Bukin * All rights reserved. * * This software was developed by SRI International and the University of * Cambridge Computer Laboratory under DARPA/AFRL contract (FA8750-11-C-0249) * ("MRC2"), as part of the DARPA MRC research programme. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. */ /* * Altera Triple-Speed Ethernet MegaCore, Function User Guide * UG-01008-3.0, Software Version: 12.0, June 2012. * Available at the time of writing at: * http://www.altera.com/literature/ug/ug_ethernet.pdf * * We are using an Marvell E1111 (Alaska) PHY on the DE4. See mii/e1000phy.c. */ /* * XXX-BZ NOTES: * - ifOutBroadcastPkts are only counted if both ether dst and src are all-1s; * seems an IP core bug, they count ether broadcasts as multicast. Is this * still the case? * - figure out why the TX FIFO fill status and intr did not work as expected. * - test 100Mbit/s and 10Mbit/s * - blacklist the one special factory programmed ethernet address (for now * hardcoded, later from loader?) * - resolve all XXX, left as reminders to shake out details later * - Jumbo frame support */ #include __FBSDID("$FreeBSD$"); #include "opt_device_polling.h" #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #define RX_QUEUE_SIZE 4096 #define TX_QUEUE_SIZE 4096 #define NUM_RX_MBUF 512 #define BUFRING_SIZE 8192 #include /* XXX once we'd do parallel attach, we need a global lock for this. */ #define ATSE_ETHERNET_OPTION_BITS_UNDEF 0 #define ATSE_ETHERNET_OPTION_BITS_READ 1 static int atse_ethernet_option_bits_flag = ATSE_ETHERNET_OPTION_BITS_UNDEF; static uint8_t atse_ethernet_option_bits[ALTERA_ETHERNET_OPTION_BITS_LEN]; /* * Softc and critical resource locking. */ #define ATSE_LOCK(_sc) mtx_lock(&(_sc)->atse_mtx) #define ATSE_UNLOCK(_sc) mtx_unlock(&(_sc)->atse_mtx) #define ATSE_LOCK_ASSERT(_sc) mtx_assert(&(_sc)->atse_mtx, MA_OWNED) #define ATSE_DEBUG #undef ATSE_DEBUG #ifdef ATSE_DEBUG #define DPRINTF(format, ...) printf(format, __VA_ARGS__) #else #define DPRINTF(format, ...) #endif /* * Register space access macros. */ static inline void csr_write_4(struct atse_softc *sc, uint32_t reg, uint32_t val4, const char *f, const int l) { val4 = htole32(val4); DPRINTF("[%s:%d] CSR W %s 0x%08x (0x%08x) = 0x%08x\n", f, l, "atse_mem_res", reg, reg * 4, val4); bus_write_4(sc->atse_mem_res, reg * 4, val4); } static inline uint32_t csr_read_4(struct atse_softc *sc, uint32_t reg, const char *f, const int l) { uint32_t val4; val4 = le32toh(bus_read_4(sc->atse_mem_res, reg * 4)); DPRINTF("[%s:%d] CSR R %s 0x%08x (0x%08x) = 0x%08x\n", f, l, "atse_mem_res", reg, reg * 4, val4); return (val4); } /* * See page 5-2 that it's all dword offsets and the MS 16 bits need to be zero * on write and ignored on read. */ static inline void pxx_write_2(struct atse_softc *sc, bus_addr_t bmcr, uint32_t reg, uint16_t val, const char *f, const int l, const char *s) { uint32_t val4; val4 = htole32(val & 0x0000ffff); DPRINTF("[%s:%d] %s W %s 0x%08x (0x%08jx) = 0x%08x\n", f, l, s, "atse_mem_res", reg, (bmcr + reg) * 4, val4); bus_write_4(sc->atse_mem_res, (bmcr + reg) * 4, val4); } static inline uint16_t pxx_read_2(struct atse_softc *sc, bus_addr_t bmcr, uint32_t reg, const char *f, const int l, const char *s) { uint32_t val4; uint16_t val; val4 = bus_read_4(sc->atse_mem_res, (bmcr + reg) * 4); val = le32toh(val4) & 0x0000ffff; DPRINTF("[%s:%d] %s R %s 0x%08x (0x%08jx) = 0x%04x\n", f, l, s, "atse_mem_res", reg, (bmcr + reg) * 4, val); return (val); } #define CSR_WRITE_4(sc, reg, val) \ csr_write_4((sc), (reg), (val), __func__, __LINE__) #define CSR_READ_4(sc, reg) \ csr_read_4((sc), (reg), __func__, __LINE__) #define PCS_WRITE_2(sc, reg, val) \ pxx_write_2((sc), sc->atse_bmcr0, (reg), (val), __func__, __LINE__, \ "PCS") #define PCS_READ_2(sc, reg) \ pxx_read_2((sc), sc->atse_bmcr0, (reg), __func__, __LINE__, "PCS") #define PHY_WRITE_2(sc, reg, val) \ pxx_write_2((sc), sc->atse_bmcr1, (reg), (val), __func__, __LINE__, \ "PHY") #define PHY_READ_2(sc, reg) \ pxx_read_2((sc), sc->atse_bmcr1, (reg), __func__, __LINE__, "PHY") static void atse_tick(void *); static int atse_detach(device_t); devclass_t atse_devclass; static int atse_rx_enqueue(struct atse_softc *sc, uint32_t n) { struct mbuf *m; int i; for (i = 0; i < n; i++) { m = m_getcl(M_NOWAIT, MT_DATA, M_PKTHDR); if (m == NULL) { device_printf(sc->dev, "%s: Can't alloc rx mbuf\n", __func__); return (-1); } m->m_pkthdr.len = m->m_len = m->m_ext.ext_size; xdma_enqueue_mbuf(sc->xchan_rx, &m, 0, 4, 4, XDMA_DEV_TO_MEM); } return (0); } static int atse_xdma_tx_intr(void *arg, xdma_transfer_status_t *status) { xdma_transfer_status_t st; struct atse_softc *sc; struct ifnet *ifp; struct mbuf *m; int err; sc = arg; ATSE_LOCK(sc); ifp = sc->atse_ifp; for (;;) { err = xdma_dequeue_mbuf(sc->xchan_tx, &m, &st); if (err != 0) { break; } if (st.error != 0) { if_inc_counter(ifp, IFCOUNTER_OERRORS, 1); } m_freem(m); sc->txcount--; } ifp->if_drv_flags &= ~IFF_DRV_OACTIVE; ATSE_UNLOCK(sc); return (0); } static int atse_xdma_rx_intr(void *arg, xdma_transfer_status_t *status) { xdma_transfer_status_t st; struct atse_softc *sc; struct ifnet *ifp; struct mbuf *m; int err; uint32_t cnt_processed; sc = arg; ATSE_LOCK(sc); ifp = sc->atse_ifp; cnt_processed = 0; for (;;) { err = xdma_dequeue_mbuf(sc->xchan_rx, &m, &st); if (err != 0) { break; } cnt_processed++; if (st.error != 0) { if_inc_counter(ifp, IFCOUNTER_IERRORS, 1); m_freem(m); continue; } m->m_pkthdr.len = m->m_len = st.transferred; m->m_pkthdr.rcvif = ifp; m_adj(m, ETHER_ALIGN); ATSE_UNLOCK(sc); (*ifp->if_input)(ifp, m); ATSE_LOCK(sc); } atse_rx_enqueue(sc, cnt_processed); ATSE_UNLOCK(sc); return (0); } static int atse_transmit_locked(struct ifnet *ifp) { struct atse_softc *sc; struct mbuf *m; struct buf_ring *br; int error; int enq; sc = ifp->if_softc; br = sc->br; enq = 0; while ((m = drbr_peek(ifp, br)) != NULL) { error = xdma_enqueue_mbuf(sc->xchan_tx, &m, 0, 4, 4, XDMA_MEM_TO_DEV); if (error != 0) { /* No space in request queue available yet. */ drbr_putback(ifp, br, m); break; } drbr_advance(ifp, br); sc->txcount++; enq++; /* If anyone is interested give them a copy. */ ETHER_BPF_MTAP(ifp, m); } if (enq > 0) xdma_queue_submit(sc->xchan_tx); return (0); } static int atse_transmit(struct ifnet *ifp, struct mbuf *m) { struct atse_softc *sc; struct buf_ring *br; int error; sc = ifp->if_softc; br = sc->br; ATSE_LOCK(sc); mtx_lock(&sc->br_mtx); if ((ifp->if_drv_flags & (IFF_DRV_RUNNING | IFF_DRV_OACTIVE)) != IFF_DRV_RUNNING) { error = drbr_enqueue(ifp, sc->br, m); mtx_unlock(&sc->br_mtx); ATSE_UNLOCK(sc); return (error); } if ((sc->atse_flags & ATSE_FLAGS_LINK) == 0) { error = drbr_enqueue(ifp, sc->br, m); mtx_unlock(&sc->br_mtx); ATSE_UNLOCK(sc); return (error); } error = drbr_enqueue(ifp, br, m); if (error) { mtx_unlock(&sc->br_mtx); ATSE_UNLOCK(sc); return (error); } error = atse_transmit_locked(ifp); mtx_unlock(&sc->br_mtx); ATSE_UNLOCK(sc); return (error); } static void atse_qflush(struct ifnet *ifp) { struct atse_softc *sc; sc = ifp->if_softc; printf("%s\n", __func__); } static int atse_stop_locked(struct atse_softc *sc) { uint32_t mask, val4; struct ifnet *ifp; int i; ATSE_LOCK_ASSERT(sc); callout_stop(&sc->atse_tick); ifp = sc->atse_ifp; ifp->if_drv_flags &= ~(IFF_DRV_RUNNING | IFF_DRV_OACTIVE); /* Disable MAC transmit and receive datapath. */ mask = BASE_CFG_COMMAND_CONFIG_TX_ENA|BASE_CFG_COMMAND_CONFIG_RX_ENA; val4 = CSR_READ_4(sc, BASE_CFG_COMMAND_CONFIG); val4 &= ~mask; CSR_WRITE_4(sc, BASE_CFG_COMMAND_CONFIG, val4); /* Wait for bits to be cleared; i=100 is excessive. */ for (i = 0; i < 100; i++) { val4 = CSR_READ_4(sc, BASE_CFG_COMMAND_CONFIG); if ((val4 & mask) == 0) { break; } DELAY(10); } if ((val4 & mask) != 0) { device_printf(sc->atse_dev, "Disabling MAC TX/RX timed out.\n"); /* Punt. */ } sc->atse_flags &= ~ATSE_FLAGS_LINK; return (0); } static uint8_t atse_mchash(struct atse_softc *sc __unused, const uint8_t *addr) { uint8_t x, y; int i, j; x = 0; for (i = 0; i < ETHER_ADDR_LEN; i++) { y = addr[i] & 0x01; for (j = 1; j < 8; j++) y ^= (addr[i] >> j) & 0x01; x |= (y << i); } return (x); } static int atse_rxfilter_locked(struct atse_softc *sc) { struct ifmultiaddr *ifma; struct ifnet *ifp; uint32_t val4; int i; /* XXX-BZ can we find out if we have the MHASH synthesized? */ val4 = CSR_READ_4(sc, BASE_CFG_COMMAND_CONFIG); /* For simplicity always hash full 48 bits of addresses. */ if ((val4 & BASE_CFG_COMMAND_CONFIG_MHASH_SEL) != 0) val4 &= ~BASE_CFG_COMMAND_CONFIG_MHASH_SEL; ifp = sc->atse_ifp; if (ifp->if_flags & IFF_PROMISC) { val4 |= BASE_CFG_COMMAND_CONFIG_PROMIS_EN; } else { val4 &= ~BASE_CFG_COMMAND_CONFIG_PROMIS_EN; } CSR_WRITE_4(sc, BASE_CFG_COMMAND_CONFIG, val4); if (ifp->if_flags & IFF_ALLMULTI) { /* Accept all multicast addresses. */ for (i = 0; i <= MHASH_LEN; i++) CSR_WRITE_4(sc, MHASH_START + i, 0x1); } else { /* * Can hold MHASH_LEN entries. * XXX-BZ bitstring.h would be more general. */ uint64_t h; h = 0; /* * Re-build and re-program hash table. First build the * bit-field "yes" or "no" for each slot per address, then * do all the programming afterwards. */ if_maddr_rlock(ifp); CK_STAILQ_FOREACH(ifma, &ifp->if_multiaddrs, ifma_link) { if (ifma->ifma_addr->sa_family != AF_LINK) { continue; } h |= (1 << atse_mchash(sc, LLADDR((struct sockaddr_dl *)ifma->ifma_addr))); } if_maddr_runlock(ifp); for (i = 0; i <= MHASH_LEN; i++) { CSR_WRITE_4(sc, MHASH_START + i, (h & (1 << i)) ? 0x01 : 0x00); } } return (0); } static int atse_ethernet_option_bits_read_fdt(device_t dev) { struct resource *res; device_t fdev; int i, rid; if (atse_ethernet_option_bits_flag & ATSE_ETHERNET_OPTION_BITS_READ) { return (0); } fdev = device_find_child(device_get_parent(dev), "cfi", 0); if (fdev == NULL) { return (ENOENT); } rid = 0; res = bus_alloc_resource_any(fdev, SYS_RES_MEMORY, &rid, RF_ACTIVE | RF_SHAREABLE); if (res == NULL) { return (ENXIO); } for (i = 0; i < ALTERA_ETHERNET_OPTION_BITS_LEN; i++) { atse_ethernet_option_bits[i] = bus_read_1(res, ALTERA_ETHERNET_OPTION_BITS_OFF + i); } bus_release_resource(fdev, SYS_RES_MEMORY, rid, res); atse_ethernet_option_bits_flag |= ATSE_ETHERNET_OPTION_BITS_READ; return (0); } static int atse_ethernet_option_bits_read(device_t dev) { int error; error = atse_ethernet_option_bits_read_fdt(dev); if (error == 0) return (0); device_printf(dev, "Cannot read Ethernet addresses from flash.\n"); return (error); } static int atse_get_eth_address(struct atse_softc *sc) { unsigned long hostid; uint32_t val4; int unit; /* * Make sure to only ever do this once. Otherwise a reset would * possibly change our ethernet address, which is not good at all. */ if (sc->atse_eth_addr[0] != 0x00 || sc->atse_eth_addr[1] != 0x00 || sc->atse_eth_addr[2] != 0x00) { return (0); } if ((atse_ethernet_option_bits_flag & ATSE_ETHERNET_OPTION_BITS_READ) == 0) { goto get_random; } val4 = atse_ethernet_option_bits[0] << 24; val4 |= atse_ethernet_option_bits[1] << 16; val4 |= atse_ethernet_option_bits[2] << 8; val4 |= atse_ethernet_option_bits[3]; /* They chose "safe". */ if (val4 != le32toh(0x00005afe)) { device_printf(sc->atse_dev, "Magic '5afe' is not safe: 0x%08x. " "Falling back to random numbers for hardware address.\n", val4); goto get_random; } sc->atse_eth_addr[0] = atse_ethernet_option_bits[4]; sc->atse_eth_addr[1] = atse_ethernet_option_bits[5]; sc->atse_eth_addr[2] = atse_ethernet_option_bits[6]; sc->atse_eth_addr[3] = atse_ethernet_option_bits[7]; sc->atse_eth_addr[4] = atse_ethernet_option_bits[8]; sc->atse_eth_addr[5] = atse_ethernet_option_bits[9]; /* Handle factory default ethernet addresss: 00:07:ed:ff:ed:15 */ if (sc->atse_eth_addr[0] == 0x00 && sc->atse_eth_addr[1] == 0x07 && sc->atse_eth_addr[2] == 0xed && sc->atse_eth_addr[3] == 0xff && sc->atse_eth_addr[4] == 0xed && sc->atse_eth_addr[5] == 0x15) { device_printf(sc->atse_dev, "Factory programmed Ethernet " "hardware address blacklisted. Falling back to random " "address to avoid collisions.\n"); device_printf(sc->atse_dev, "Please re-program your flash.\n"); goto get_random; } if (sc->atse_eth_addr[0] == 0x00 && sc->atse_eth_addr[1] == 0x00 && sc->atse_eth_addr[2] == 0x00 && sc->atse_eth_addr[3] == 0x00 && sc->atse_eth_addr[4] == 0x00 && sc->atse_eth_addr[5] == 0x00) { device_printf(sc->atse_dev, "All zero's Ethernet hardware " "address blacklisted. Falling back to random address.\n"); device_printf(sc->atse_dev, "Please re-program your flash.\n"); goto get_random; } if (ETHER_IS_MULTICAST(sc->atse_eth_addr)) { device_printf(sc->atse_dev, "Multicast Ethernet hardware " "address blacklisted. Falling back to random address.\n"); device_printf(sc->atse_dev, "Please re-program your flash.\n"); goto get_random; } /* * If we find an Altera prefixed address with a 0x0 ending * adjust by device unit. If not and this is not the first * Ethernet, go to random. */ unit = device_get_unit(sc->atse_dev); if (unit == 0x00) { return (0); } if (unit > 0x0f) { device_printf(sc->atse_dev, "We do not support Ethernet " "addresses for more than 16 MACs. Falling back to " "random hadware address.\n"); goto get_random; } if ((sc->atse_eth_addr[0] & ~0x2) != 0 || sc->atse_eth_addr[1] != 0x07 || sc->atse_eth_addr[2] != 0xed || (sc->atse_eth_addr[5] & 0x0f) != 0x0) { device_printf(sc->atse_dev, "Ethernet address not meeting our " "multi-MAC standards. Falling back to random hadware " "address.\n"); goto get_random; } sc->atse_eth_addr[5] |= (unit & 0x0f); return (0); get_random: /* * Fall back to random code we also use on bridge(4). */ getcredhostid(curthread->td_ucred, &hostid); if (hostid == 0) { arc4rand(sc->atse_eth_addr, ETHER_ADDR_LEN, 1); sc->atse_eth_addr[0] &= ~1;/* clear multicast bit */ sc->atse_eth_addr[0] |= 2; /* set the LAA bit */ } else { sc->atse_eth_addr[0] = 0x2; sc->atse_eth_addr[1] = (hostid >> 24) & 0xff; sc->atse_eth_addr[2] = (hostid >> 16) & 0xff; sc->atse_eth_addr[3] = (hostid >> 8 ) & 0xff; sc->atse_eth_addr[4] = hostid & 0xff; sc->atse_eth_addr[5] = sc->atse_unit & 0xff; } return (0); } static int atse_set_eth_address(struct atse_softc *sc, int n) { uint32_t v0, v1; v0 = (sc->atse_eth_addr[3] << 24) | (sc->atse_eth_addr[2] << 16) | (sc->atse_eth_addr[1] << 8) | sc->atse_eth_addr[0]; v1 = (sc->atse_eth_addr[5] << 8) | sc->atse_eth_addr[4]; if (n & ATSE_ETH_ADDR_DEF) { CSR_WRITE_4(sc, BASE_CFG_MAC_0, v0); CSR_WRITE_4(sc, BASE_CFG_MAC_1, v1); } if (n & ATSE_ETH_ADDR_SUPP1) { CSR_WRITE_4(sc, SUPPL_ADDR_SMAC_0_0, v0); CSR_WRITE_4(sc, SUPPL_ADDR_SMAC_0_1, v1); } if (n & ATSE_ETH_ADDR_SUPP2) { CSR_WRITE_4(sc, SUPPL_ADDR_SMAC_1_0, v0); CSR_WRITE_4(sc, SUPPL_ADDR_SMAC_1_1, v1); } if (n & ATSE_ETH_ADDR_SUPP3) { CSR_WRITE_4(sc, SUPPL_ADDR_SMAC_2_0, v0); CSR_WRITE_4(sc, SUPPL_ADDR_SMAC_2_1, v1); } if (n & ATSE_ETH_ADDR_SUPP4) { CSR_WRITE_4(sc, SUPPL_ADDR_SMAC_3_0, v0); CSR_WRITE_4(sc, SUPPL_ADDR_SMAC_3_1, v1); } return (0); } static int atse_reset(struct atse_softc *sc) { uint32_t val4, mask; uint16_t val; int i; /* 1. External PHY Initialization using MDIO. */ /* * We select the right MDIO space in atse_attach() and let MII do * anything else. */ /* 2. PCS Configuration Register Initialization. */ /* a. Set auto negotiation link timer to 1.6ms for SGMII. */ PCS_WRITE_2(sc, PCS_EXT_LINK_TIMER_0, 0x0D40); PCS_WRITE_2(sc, PCS_EXT_LINK_TIMER_1, 0x0003); /* b. Configure SGMII. */ val = PCS_EXT_IF_MODE_SGMII_ENA|PCS_EXT_IF_MODE_USE_SGMII_AN; PCS_WRITE_2(sc, PCS_EXT_IF_MODE, val); /* c. Enable auto negotiation. */ /* Ignore Bits 6,8,13; should be set,set,unset. */ val = PCS_READ_2(sc, PCS_CONTROL); val &= ~(PCS_CONTROL_ISOLATE|PCS_CONTROL_POWERDOWN); val &= ~PCS_CONTROL_LOOPBACK; /* Make this a -link1 option? */ val |= PCS_CONTROL_AUTO_NEGOTIATION_ENABLE; PCS_WRITE_2(sc, PCS_CONTROL, val); /* d. PCS reset. */ val = PCS_READ_2(sc, PCS_CONTROL); val |= PCS_CONTROL_RESET; PCS_WRITE_2(sc, PCS_CONTROL, val); /* Wait for reset bit to clear; i=100 is excessive. */ for (i = 0; i < 100; i++) { val = PCS_READ_2(sc, PCS_CONTROL); if ((val & PCS_CONTROL_RESET) == 0) { break; } DELAY(10); } if ((val & PCS_CONTROL_RESET) != 0) { device_printf(sc->atse_dev, "PCS reset timed out.\n"); return (ENXIO); } /* 3. MAC Configuration Register Initialization. */ /* a. Disable MAC transmit and receive datapath. */ mask = BASE_CFG_COMMAND_CONFIG_TX_ENA|BASE_CFG_COMMAND_CONFIG_RX_ENA; val4 = CSR_READ_4(sc, BASE_CFG_COMMAND_CONFIG); val4 &= ~mask; /* Samples in the manual do have the SW_RESET bit set here, why? */ CSR_WRITE_4(sc, BASE_CFG_COMMAND_CONFIG, val4); /* Wait for bits to be cleared; i=100 is excessive. */ for (i = 0; i < 100; i++) { val4 = CSR_READ_4(sc, BASE_CFG_COMMAND_CONFIG); if ((val4 & mask) == 0) { break; } DELAY(10); } if ((val4 & mask) != 0) { device_printf(sc->atse_dev, "Disabling MAC TX/RX timed out.\n"); return (ENXIO); } /* b. MAC FIFO configuration. */ CSR_WRITE_4(sc, BASE_CFG_TX_SECTION_EMPTY, FIFO_DEPTH_TX - 16); CSR_WRITE_4(sc, BASE_CFG_TX_ALMOST_FULL, 3); CSR_WRITE_4(sc, BASE_CFG_TX_ALMOST_EMPTY, 8); CSR_WRITE_4(sc, BASE_CFG_RX_SECTION_EMPTY, FIFO_DEPTH_RX - 16); CSR_WRITE_4(sc, BASE_CFG_RX_ALMOST_FULL, 8); CSR_WRITE_4(sc, BASE_CFG_RX_ALMOST_EMPTY, 8); #if 0 CSR_WRITE_4(sc, BASE_CFG_TX_SECTION_FULL, 16); CSR_WRITE_4(sc, BASE_CFG_RX_SECTION_FULL, 16); #else /* For store-and-forward mode, set this threshold to 0. */ CSR_WRITE_4(sc, BASE_CFG_TX_SECTION_FULL, 0); CSR_WRITE_4(sc, BASE_CFG_RX_SECTION_FULL, 0); #endif /* c. MAC address configuration. */ /* Also intialize supplementary addresses to our primary one. */ /* XXX-BZ FreeBSD really needs to grow and API for using these. */ atse_get_eth_address(sc); atse_set_eth_address(sc, ATSE_ETH_ADDR_ALL); /* d. MAC function configuration. */ CSR_WRITE_4(sc, BASE_CFG_FRM_LENGTH, 1518); /* Default. */ CSR_WRITE_4(sc, BASE_CFG_TX_IPG_LENGTH, 12); CSR_WRITE_4(sc, BASE_CFG_PAUSE_QUANT, 0xFFFF); val4 = CSR_READ_4(sc, BASE_CFG_COMMAND_CONFIG); /* * If 1000BASE-X/SGMII PCS is initialized, set the ETH_SPEED (bit 3) * and ENA_10 (bit 25) in command_config register to 0. If half duplex * is reported in the PHY/PCS status register, set the HD_ENA (bit 10) * to 1 in command_config register. * BZ: We shoot for 1000 instead. */ #if 0 val4 |= BASE_CFG_COMMAND_CONFIG_ETH_SPEED; #else val4 &= ~BASE_CFG_COMMAND_CONFIG_ETH_SPEED; #endif val4 &= ~BASE_CFG_COMMAND_CONFIG_ENA_10; #if 0 /* * We do not want to set this, otherwise, we could not even send * random raw ethernet frames for various other research. By default * FreeBSD will use the right ether source address. */ val4 |= BASE_CFG_COMMAND_CONFIG_TX_ADDR_INS; #endif val4 |= BASE_CFG_COMMAND_CONFIG_PAD_EN; val4 &= ~BASE_CFG_COMMAND_CONFIG_CRC_FWD; #if 0 val4 |= BASE_CFG_COMMAND_CONFIG_CNTL_FRM_ENA; #endif #if 1 val4 |= BASE_CFG_COMMAND_CONFIG_RX_ERR_DISC; #endif val &= ~BASE_CFG_COMMAND_CONFIG_LOOP_ENA; /* link0? */ CSR_WRITE_4(sc, BASE_CFG_COMMAND_CONFIG, val4); /* * Make sure we do not enable 32bit alignment; FreeBSD cannot * cope with the additional padding (though we should!?). * Also make sure we get the CRC appended. */ val4 = CSR_READ_4(sc, TX_CMD_STAT); val4 &= ~(TX_CMD_STAT_OMIT_CRC|TX_CMD_STAT_TX_SHIFT16); CSR_WRITE_4(sc, TX_CMD_STAT, val4); val4 = CSR_READ_4(sc, RX_CMD_STAT); val4 &= ~RX_CMD_STAT_RX_SHIFT16; val4 |= RX_CMD_STAT_RX_SHIFT16; CSR_WRITE_4(sc, RX_CMD_STAT, val4); /* e. Reset MAC. */ val4 = CSR_READ_4(sc, BASE_CFG_COMMAND_CONFIG); val4 |= BASE_CFG_COMMAND_CONFIG_SW_RESET; CSR_WRITE_4(sc, BASE_CFG_COMMAND_CONFIG, val4); /* Wait for bits to be cleared; i=100 is excessive. */ for (i = 0; i < 100; i++) { val4 = CSR_READ_4(sc, BASE_CFG_COMMAND_CONFIG); if ((val4 & BASE_CFG_COMMAND_CONFIG_SW_RESET) == 0) { break; } DELAY(10); } if ((val4 & BASE_CFG_COMMAND_CONFIG_SW_RESET) != 0) { device_printf(sc->atse_dev, "MAC reset timed out.\n"); return (ENXIO); } /* f. Enable MAC transmit and receive datapath. */ mask = BASE_CFG_COMMAND_CONFIG_TX_ENA|BASE_CFG_COMMAND_CONFIG_RX_ENA; val4 = CSR_READ_4(sc, BASE_CFG_COMMAND_CONFIG); val4 |= mask; CSR_WRITE_4(sc, BASE_CFG_COMMAND_CONFIG, val4); /* Wait for bits to be cleared; i=100 is excessive. */ for (i = 0; i < 100; i++) { val4 = CSR_READ_4(sc, BASE_CFG_COMMAND_CONFIG); if ((val4 & mask) == mask) { break; } DELAY(10); } if ((val4 & mask) != mask) { device_printf(sc->atse_dev, "Enabling MAC TX/RX timed out.\n"); return (ENXIO); } return (0); } static void atse_init_locked(struct atse_softc *sc) { struct ifnet *ifp; struct mii_data *mii; uint8_t *eaddr; ATSE_LOCK_ASSERT(sc); ifp = sc->atse_ifp; if ((ifp->if_drv_flags & IFF_DRV_RUNNING) != 0) { return; } /* * Must update the ether address if changed. Given we do not handle * in atse_ioctl() but it's in the general framework, just always * do it here before atse_reset(). */ eaddr = IF_LLADDR(sc->atse_ifp); bcopy(eaddr, &sc->atse_eth_addr, ETHER_ADDR_LEN); /* Make things frind to halt, cleanup, ... */ atse_stop_locked(sc); atse_reset(sc); /* ... and fire up the engine again. */ atse_rxfilter_locked(sc); sc->atse_flags &= ATSE_FLAGS_LINK; /* Preserve. */ mii = device_get_softc(sc->atse_miibus); sc->atse_flags &= ~ATSE_FLAGS_LINK; mii_mediachg(mii); ifp->if_drv_flags |= IFF_DRV_RUNNING; ifp->if_drv_flags &= ~IFF_DRV_OACTIVE; callout_reset(&sc->atse_tick, hz, atse_tick, sc); } static void atse_init(void *xsc) { struct atse_softc *sc; /* * XXXRW: There is some argument that we should immediately do RX * processing after enabling interrupts, or one may not fire if there * are buffered packets. */ sc = (struct atse_softc *)xsc; ATSE_LOCK(sc); atse_init_locked(sc); ATSE_UNLOCK(sc); } static int atse_ioctl(struct ifnet *ifp, u_long command, caddr_t data) { struct atse_softc *sc; struct ifreq *ifr; int error, mask; error = 0; sc = ifp->if_softc; ifr = (struct ifreq *)data; switch (command) { case SIOCSIFFLAGS: ATSE_LOCK(sc); if (ifp->if_flags & IFF_UP) { if ((ifp->if_drv_flags & IFF_DRV_RUNNING) != 0 && ((ifp->if_flags ^ sc->atse_if_flags) & (IFF_PROMISC | IFF_ALLMULTI)) != 0) atse_rxfilter_locked(sc); else atse_init_locked(sc); } else if (ifp->if_drv_flags & IFF_DRV_RUNNING) atse_stop_locked(sc); sc->atse_if_flags = ifp->if_flags; ATSE_UNLOCK(sc); break; case SIOCSIFCAP: ATSE_LOCK(sc); mask = ifr->ifr_reqcap ^ ifp->if_capenable; ATSE_UNLOCK(sc); break; case SIOCADDMULTI: case SIOCDELMULTI: ATSE_LOCK(sc); atse_rxfilter_locked(sc); ATSE_UNLOCK(sc); break; case SIOCGIFMEDIA: case SIOCSIFMEDIA: { struct mii_data *mii; struct ifreq *ifr; mii = device_get_softc(sc->atse_miibus); ifr = (struct ifreq *)data; error = ifmedia_ioctl(ifp, ifr, &mii->mii_media, command); break; } default: error = ether_ioctl(ifp, command, data); break; } return (error); } static void atse_tick(void *xsc) { struct atse_softc *sc; struct mii_data *mii; struct ifnet *ifp; sc = (struct atse_softc *)xsc; ATSE_LOCK_ASSERT(sc); ifp = sc->atse_ifp; mii = device_get_softc(sc->atse_miibus); mii_tick(mii); if ((sc->atse_flags & ATSE_FLAGS_LINK) == 0) { atse_miibus_statchg(sc->atse_dev); } callout_reset(&sc->atse_tick, hz, atse_tick, sc); } /* * Set media options. */ static int atse_ifmedia_upd(struct ifnet *ifp) { struct atse_softc *sc; struct mii_data *mii; struct mii_softc *miisc; int error; sc = ifp->if_softc; ATSE_LOCK(sc); mii = device_get_softc(sc->atse_miibus); LIST_FOREACH(miisc, &mii->mii_phys, mii_list) { PHY_RESET(miisc); } error = mii_mediachg(mii); ATSE_UNLOCK(sc); return (error); } /* * Report current media status. */ static void atse_ifmedia_sts(struct ifnet *ifp, struct ifmediareq *ifmr) { struct atse_softc *sc; struct mii_data *mii; sc = ifp->if_softc; ATSE_LOCK(sc); mii = device_get_softc(sc->atse_miibus); mii_pollstat(mii); ifmr->ifm_active = mii->mii_media_active; ifmr->ifm_status = mii->mii_media_status; ATSE_UNLOCK(sc); } static struct atse_mac_stats_regs { const char *name; const char *descr; /* Mostly copied from Altera datasheet. */ } atse_mac_stats_regs[] = { [0x1a] = { "aFramesTransmittedOK", "The number of frames that are successfully transmitted including " "the pause frames." }, { "aFramesReceivedOK", "The number of frames that are successfully received including the " "pause frames." }, { "aFrameCheckSequenceErrors", "The number of receive frames with CRC error." }, { "aAlignmentErrors", "The number of receive frames with alignment error." }, { "aOctetsTransmittedOK", "The lower 32 bits of the number of data and padding octets that " "are successfully transmitted." }, { "aOctetsReceivedOK", "The lower 32 bits of the number of data and padding octets that " " are successfully received." }, { "aTxPAUSEMACCtrlFrames", "The number of pause frames transmitted." }, { "aRxPAUSEMACCtrlFrames", "The number received pause frames received." }, { "ifInErrors", "The number of errored frames received." }, { "ifOutErrors", "The number of transmit frames with either a FIFO overflow error, " "a FIFO underflow error, or a error defined by the user " "application." }, { "ifInUcastPkts", "The number of valid unicast frames received." }, { "ifInMulticastPkts", "The number of valid multicast frames received. The count does " "not include pause frames." }, { "ifInBroadcastPkts", "The number of valid broadcast frames received." }, { "ifOutDiscards", "This statistics counter is not in use. The MAC function does not " "discard frames that are written to the FIFO buffer by the user " "application." }, { "ifOutUcastPkts", "The number of valid unicast frames transmitted." }, { "ifOutMulticastPkts", "The number of valid multicast frames transmitted, excluding pause " "frames." }, { "ifOutBroadcastPkts", "The number of valid broadcast frames transmitted." }, { "etherStatsDropEvents", "The number of frames that are dropped due to MAC internal errors " "when FIFO buffer overflow persists." }, { "etherStatsOctets", "The lower 32 bits of the total number of octets received. This " "count includes both good and errored frames." }, { "etherStatsPkts", "The total number of good and errored frames received." }, { "etherStatsUndersizePkts", "The number of frames received with length less than 64 bytes. " "This count does not include errored frames." }, { "etherStatsOversizePkts", "The number of frames received that are longer than the value " "configured in the frm_length register. This count does not " "include errored frames." }, { "etherStatsPkts64Octets", "The number of 64-byte frames received. This count includes good " "and errored frames." }, { "etherStatsPkts65to127Octets", "The number of received good and errored frames between the length " "of 65 and 127 bytes." }, { "etherStatsPkts128to255Octets", "The number of received good and errored frames between the length " "of 128 and 255 bytes." }, { "etherStatsPkts256to511Octets", "The number of received good and errored frames between the length " "of 256 and 511 bytes." }, { "etherStatsPkts512to1023Octets", "The number of received good and errored frames between the length " "of 512 and 1023 bytes." }, { "etherStatsPkts1024to1518Octets", "The number of received good and errored frames between the length " "of 1024 and 1518 bytes." }, { "etherStatsPkts1519toXOctets", "The number of received good and errored frames between the length " "of 1519 and the maximum frame length configured in the frm_length " "register." }, { "etherStatsJabbers", "Too long frames with CRC error." }, { "etherStatsFragments", "Too short frames with CRC error." }, /* 0x39 unused, 0x3a/b non-stats. */ [0x3c] = /* Extended Statistics Counters */ { "msb_aOctetsTransmittedOK", "Upper 32 bits of the number of data and padding octets that are " "successfully transmitted." }, { "msb_aOctetsReceivedOK", "Upper 32 bits of the number of data and padding octets that are " "successfully received." }, { "msb_etherStatsOctets", "Upper 32 bits of the total number of octets received. This count " "includes both good and errored frames." } }; static int sysctl_atse_mac_stats_proc(SYSCTL_HANDLER_ARGS) { struct atse_softc *sc; int error, offset, s; sc = arg1; offset = arg2; s = CSR_READ_4(sc, offset); error = sysctl_handle_int(oidp, &s, 0, req); if (error || !req->newptr) { return (error); } return (0); } static struct atse_rx_err_stats_regs { const char *name; const char *descr; } atse_rx_err_stats_regs[] = { #define ATSE_RX_ERR_FIFO_THRES_EOP 0 /* FIFO threshold reached, on EOP. */ #define ATSE_RX_ERR_ELEN 1 /* Frame/payload length not valid. */ #define ATSE_RX_ERR_CRC32 2 /* CRC-32 error. */ #define ATSE_RX_ERR_FIFO_THRES_TRUNC 3 /* FIFO thresh., truncated frame. */ #define ATSE_RX_ERR_4 4 /* ? */ #define ATSE_RX_ERR_5 5 /* / */ { "rx_err_fifo_thres_eop", "FIFO threshold reached, reported on EOP." }, { "rx_err_fifo_elen", "Frame or payload length not valid." }, { "rx_err_fifo_crc32", "CRC-32 error." }, { "rx_err_fifo_thres_trunc", "FIFO threshold reached, truncated frame" }, { "rx_err_4", "?" }, { "rx_err_5", "?" }, }; static int sysctl_atse_rx_err_stats_proc(SYSCTL_HANDLER_ARGS) { struct atse_softc *sc; int error, offset, s; sc = arg1; offset = arg2; s = sc->atse_rx_err[offset]; error = sysctl_handle_int(oidp, &s, 0, req); if (error || !req->newptr) { return (error); } return (0); } static void atse_sysctl_stats_attach(device_t dev) { struct sysctl_ctx_list *sctx; struct sysctl_oid *soid; struct atse_softc *sc; int i; sc = device_get_softc(dev); sctx = device_get_sysctl_ctx(dev); soid = device_get_sysctl_tree(dev); /* MAC statistics. */ for (i = 0; i < nitems(atse_mac_stats_regs); i++) { if (atse_mac_stats_regs[i].name == NULL || atse_mac_stats_regs[i].descr == NULL) { continue; } SYSCTL_ADD_PROC(sctx, SYSCTL_CHILDREN(soid), OID_AUTO, atse_mac_stats_regs[i].name, CTLTYPE_UINT|CTLFLAG_RD, sc, i, sysctl_atse_mac_stats_proc, "IU", atse_mac_stats_regs[i].descr); } /* rx_err[]. */ for (i = 0; i < ATSE_RX_ERR_MAX; i++) { if (atse_rx_err_stats_regs[i].name == NULL || atse_rx_err_stats_regs[i].descr == NULL) { continue; } SYSCTL_ADD_PROC(sctx, SYSCTL_CHILDREN(soid), OID_AUTO, atse_rx_err_stats_regs[i].name, CTLTYPE_UINT|CTLFLAG_RD, sc, i, sysctl_atse_rx_err_stats_proc, "IU", atse_rx_err_stats_regs[i].descr); } } /* * Generic device handling routines. */ int atse_attach(device_t dev) { struct atse_softc *sc; struct ifnet *ifp; uint32_t caps; int error; sc = device_get_softc(dev); sc->dev = dev; /* Get xDMA controller */ sc->xdma_tx = xdma_ofw_get(sc->dev, "tx"); if (sc->xdma_tx == NULL) { device_printf(dev, "Can't find DMA controller.\n"); return (ENXIO); } /* * Only final (EOP) write can be less than "symbols per beat" value * so we have to defrag mbuf chain. * Chapter 15. On-Chip FIFO Memory Core. * Embedded Peripherals IP User Guide. */ - caps = XCHAN_CAP_BUSDMA_NOSEG; + caps = XCHAN_CAP_NOSEG; /* Alloc xDMA virtual channel. */ sc->xchan_tx = xdma_channel_alloc(sc->xdma_tx, caps); if (sc->xchan_tx == NULL) { device_printf(dev, "Can't alloc virtual DMA channel.\n"); return (ENXIO); } /* Setup interrupt handler. */ error = xdma_setup_intr(sc->xchan_tx, atse_xdma_tx_intr, sc, &sc->ih_tx); if (error) { device_printf(sc->dev, "Can't setup xDMA interrupt handler.\n"); return (ENXIO); } xdma_prep_sg(sc->xchan_tx, TX_QUEUE_SIZE, /* xchan requests queue size */ MCLBYTES, /* maxsegsize */ 8, /* maxnsegs */ 16, /* alignment */ 0, /* boundary */ BUS_SPACE_MAXADDR_32BIT, BUS_SPACE_MAXADDR); /* Get RX xDMA controller */ sc->xdma_rx = xdma_ofw_get(sc->dev, "rx"); if (sc->xdma_rx == NULL) { device_printf(dev, "Can't find DMA controller.\n"); return (ENXIO); } /* Alloc xDMA virtual channel. */ sc->xchan_rx = xdma_channel_alloc(sc->xdma_rx, caps); if (sc->xchan_rx == NULL) { device_printf(dev, "Can't alloc virtual DMA channel.\n"); return (ENXIO); } /* Setup interrupt handler. */ error = xdma_setup_intr(sc->xchan_rx, atse_xdma_rx_intr, sc, &sc->ih_rx); if (error) { device_printf(sc->dev, "Can't setup xDMA interrupt handler.\n"); return (ENXIO); } xdma_prep_sg(sc->xchan_rx, RX_QUEUE_SIZE, /* xchan requests queue size */ MCLBYTES, /* maxsegsize */ 1, /* maxnsegs */ 16, /* alignment */ 0, /* boundary */ BUS_SPACE_MAXADDR_32BIT, BUS_SPACE_MAXADDR); mtx_init(&sc->br_mtx, "buf ring mtx", NULL, MTX_DEF); sc->br = buf_ring_alloc(BUFRING_SIZE, M_DEVBUF, M_NOWAIT, &sc->br_mtx); if (sc->br == NULL) { return (ENOMEM); } atse_ethernet_option_bits_read(dev); mtx_init(&sc->atse_mtx, device_get_nameunit(dev), MTX_NETWORK_LOCK, MTX_DEF); callout_init_mtx(&sc->atse_tick, &sc->atse_mtx, 0); /* * We are only doing single-PHY with this driver currently. The * defaults would be right so that BASE_CFG_MDIO_ADDR0 points to the * 1st PHY address (0) apart from the fact that BMCR0 is always * the PCS mapping, so we always use BMCR1. See Table 5-1 0xA0-0xBF. */ #if 0 /* Always PCS. */ sc->atse_bmcr0 = MDIO_0_START; CSR_WRITE_4(sc, BASE_CFG_MDIO_ADDR0, 0x00); #endif /* Always use matching PHY for atse[0..]. */ sc->atse_phy_addr = device_get_unit(dev); sc->atse_bmcr1 = MDIO_1_START; CSR_WRITE_4(sc, BASE_CFG_MDIO_ADDR1, sc->atse_phy_addr); /* Reset the adapter. */ atse_reset(sc); /* Setup interface. */ ifp = sc->atse_ifp = if_alloc(IFT_ETHER); if (ifp == NULL) { device_printf(dev, "if_alloc() failed\n"); error = ENOSPC; goto err; } ifp->if_softc = sc; if_initname(ifp, device_get_name(dev), device_get_unit(dev)); ifp->if_flags = IFF_BROADCAST | IFF_SIMPLEX | IFF_MULTICAST; ifp->if_ioctl = atse_ioctl; ifp->if_transmit = atse_transmit; ifp->if_qflush = atse_qflush; ifp->if_init = atse_init; IFQ_SET_MAXLEN(&ifp->if_snd, ATSE_TX_LIST_CNT - 1); ifp->if_snd.ifq_drv_maxlen = ATSE_TX_LIST_CNT - 1; IFQ_SET_READY(&ifp->if_snd); /* MII setup. */ error = mii_attach(dev, &sc->atse_miibus, ifp, atse_ifmedia_upd, atse_ifmedia_sts, BMSR_DEFCAPMASK, MII_PHY_ANY, MII_OFFSET_ANY, 0); if (error != 0) { device_printf(dev, "attaching PHY failed: %d\n", error); goto err; } /* Call media-indepedent attach routine. */ ether_ifattach(ifp, sc->atse_eth_addr); /* Tell the upper layer(s) about vlan mtu support. */ ifp->if_hdrlen = sizeof(struct ether_vlan_header); ifp->if_capabilities |= IFCAP_VLAN_MTU; ifp->if_capenable = ifp->if_capabilities; err: if (error != 0) { atse_detach(dev); } if (error == 0) { atse_sysctl_stats_attach(dev); } atse_rx_enqueue(sc, NUM_RX_MBUF); xdma_queue_submit(sc->xchan_rx); return (error); } static int atse_detach(device_t dev) { struct atse_softc *sc; struct ifnet *ifp; sc = device_get_softc(dev); KASSERT(mtx_initialized(&sc->atse_mtx), ("%s: mutex not initialized", device_get_nameunit(dev))); ifp = sc->atse_ifp; /* Only cleanup if attach succeeded. */ if (device_is_attached(dev)) { ATSE_LOCK(sc); atse_stop_locked(sc); ATSE_UNLOCK(sc); callout_drain(&sc->atse_tick); ether_ifdetach(ifp); } if (sc->atse_miibus != NULL) { device_delete_child(dev, sc->atse_miibus); } if (ifp != NULL) { if_free(ifp); } mtx_destroy(&sc->atse_mtx); + + xdma_channel_free(sc->xchan_tx); + xdma_channel_free(sc->xchan_rx); + xdma_put(sc->xdma_tx); + xdma_put(sc->xdma_rx); return (0); } /* Shared between nexus and fdt implementation. */ void atse_detach_resources(device_t dev) { struct atse_softc *sc; sc = device_get_softc(dev); if (sc->atse_mem_res != NULL) { bus_release_resource(dev, SYS_RES_MEMORY, sc->atse_mem_rid, sc->atse_mem_res); sc->atse_mem_res = NULL; } } int atse_detach_dev(device_t dev) { int error; error = atse_detach(dev); if (error) { /* We are basically in undefined state now. */ device_printf(dev, "atse_detach() failed: %d\n", error); return (error); } atse_detach_resources(dev); return (0); } int atse_miibus_readreg(device_t dev, int phy, int reg) { struct atse_softc *sc; int val; sc = device_get_softc(dev); /* * We currently do not support re-mapping of MDIO space on-the-fly * but de-facto hard-code the phy#. */ if (phy != sc->atse_phy_addr) { return (0); } val = PHY_READ_2(sc, reg); return (val); } int atse_miibus_writereg(device_t dev, int phy, int reg, int data) { struct atse_softc *sc; sc = device_get_softc(dev); /* * We currently do not support re-mapping of MDIO space on-the-fly * but de-facto hard-code the phy#. */ if (phy != sc->atse_phy_addr) { return (0); } PHY_WRITE_2(sc, reg, data); return (0); } void atse_miibus_statchg(device_t dev) { struct atse_softc *sc; struct mii_data *mii; struct ifnet *ifp; uint32_t val4; sc = device_get_softc(dev); ATSE_LOCK_ASSERT(sc); mii = device_get_softc(sc->atse_miibus); ifp = sc->atse_ifp; if (mii == NULL || ifp == NULL || (ifp->if_drv_flags & IFF_DRV_RUNNING) == 0) { return; } val4 = CSR_READ_4(sc, BASE_CFG_COMMAND_CONFIG); /* Assume no link. */ sc->atse_flags &= ~ATSE_FLAGS_LINK; if ((mii->mii_media_status & (IFM_ACTIVE | IFM_AVALID)) == (IFM_ACTIVE | IFM_AVALID)) { switch (IFM_SUBTYPE(mii->mii_media_active)) { case IFM_10_T: val4 |= BASE_CFG_COMMAND_CONFIG_ENA_10; val4 &= ~BASE_CFG_COMMAND_CONFIG_ETH_SPEED; sc->atse_flags |= ATSE_FLAGS_LINK; break; case IFM_100_TX: val4 &= ~BASE_CFG_COMMAND_CONFIG_ENA_10; val4 &= ~BASE_CFG_COMMAND_CONFIG_ETH_SPEED; sc->atse_flags |= ATSE_FLAGS_LINK; break; case IFM_1000_T: val4 &= ~BASE_CFG_COMMAND_CONFIG_ENA_10; val4 |= BASE_CFG_COMMAND_CONFIG_ETH_SPEED; sc->atse_flags |= ATSE_FLAGS_LINK; break; default: break; } } if ((sc->atse_flags & ATSE_FLAGS_LINK) == 0) { /* Need to stop the MAC? */ return; } if (IFM_OPTIONS(mii->mii_media_active & IFM_FDX) != 0) { val4 &= ~BASE_CFG_COMMAND_CONFIG_HD_ENA; } else { val4 |= BASE_CFG_COMMAND_CONFIG_HD_ENA; } /* flow control? */ /* Make sure the MAC is activated. */ val4 |= BASE_CFG_COMMAND_CONFIG_TX_ENA; val4 |= BASE_CFG_COMMAND_CONFIG_RX_ENA; CSR_WRITE_4(sc, BASE_CFG_COMMAND_CONFIG, val4); } MODULE_DEPEND(atse, ether, 1, 1, 1); MODULE_DEPEND(atse, miibus, 1, 1, 1); Index: user/ngie/bug-237403/sys/dev/altera/softdma/softdma.c =================================================================== --- user/ngie/bug-237403/sys/dev/altera/softdma/softdma.c (revision 346925) +++ user/ngie/bug-237403/sys/dev/altera/softdma/softdma.c (revision 346926) @@ -1,864 +1,888 @@ /*- * Copyright (c) 2017-2018 Ruslan Bukin * All rights reserved. * * This software was developed by SRI International and the University of * Cambridge Computer Laboratory under DARPA/AFRL contract FA8750-10-C-0237 * ("CTSRD"), as part of the DARPA CRASH research programme. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. */ /* This is driver for SoftDMA device built using Altera FIFO component. */ #include __FBSDID("$FreeBSD$"); #include "opt_platform.h" #include #include #include #include #include #include #include #include #include #include #include #include #include #ifdef FDT #include #include #include #endif #include #include #include "xdma_if.h" #define SOFTDMA_DEBUG #undef SOFTDMA_DEBUG #ifdef SOFTDMA_DEBUG #define dprintf(fmt, ...) printf(fmt, ##__VA_ARGS__) #else #define dprintf(fmt, ...) #endif #define AVALON_FIFO_TX_BASIC_OPTS_DEPTH 16 #define SOFTDMA_NCHANNELS 1 #define CONTROL_GEN_SOP (1 << 0) #define CONTROL_GEN_EOP (1 << 1) #define CONTROL_OWN (1 << 31) #define SOFTDMA_RX_EVENTS \ (A_ONCHIP_FIFO_MEM_CORE_INTR_FULL | \ A_ONCHIP_FIFO_MEM_CORE_INTR_OVERFLOW | \ A_ONCHIP_FIFO_MEM_CORE_INTR_UNDERFLOW) #define SOFTDMA_TX_EVENTS \ (A_ONCHIP_FIFO_MEM_CORE_INTR_EMPTY | \ A_ONCHIP_FIFO_MEM_CORE_INTR_OVERFLOW | \ A_ONCHIP_FIFO_MEM_CORE_INTR_UNDERFLOW) struct softdma_channel { struct softdma_softc *sc; struct mtx mtx; xdma_channel_t *xchan; struct proc *p; int used; int index; int run; uint32_t idx_tail; uint32_t idx_head; struct softdma_desc *descs; uint32_t descs_num; uint32_t descs_used_count; }; struct softdma_desc { uint64_t src_addr; uint64_t dst_addr; uint32_t len; uint32_t access_width; uint32_t count; uint16_t src_incr; uint16_t dst_incr; uint32_t direction; struct softdma_desc *next; uint32_t transfered; uint32_t status; uint32_t reserved; uint32_t control; }; struct softdma_softc { device_t dev; struct resource *res[3]; bus_space_tag_t bst; bus_space_handle_t bsh; bus_space_tag_t bst_c; bus_space_handle_t bsh_c; void *ih; struct softdma_channel channels[SOFTDMA_NCHANNELS]; }; static struct resource_spec softdma_spec[] = { { SYS_RES_MEMORY, 0, RF_ACTIVE }, /* fifo */ { SYS_RES_MEMORY, 1, RF_ACTIVE }, /* core */ { SYS_RES_IRQ, 0, RF_ACTIVE }, { -1, 0 } }; static int softdma_probe(device_t dev); static int softdma_attach(device_t dev); static int softdma_detach(device_t dev); static inline uint32_t softdma_next_desc(struct softdma_channel *chan, uint32_t curidx) { return ((curidx + 1) % chan->descs_num); } static void softdma_mem_write(struct softdma_softc *sc, uint32_t reg, uint32_t val) { bus_write_4(sc->res[0], reg, htole32(val)); } static uint32_t softdma_mem_read(struct softdma_softc *sc, uint32_t reg) { uint32_t val; val = bus_read_4(sc->res[0], reg); return (le32toh(val)); } static void softdma_memc_write(struct softdma_softc *sc, uint32_t reg, uint32_t val) { bus_write_4(sc->res[1], reg, htole32(val)); } static uint32_t softdma_memc_read(struct softdma_softc *sc, uint32_t reg) { uint32_t val; val = bus_read_4(sc->res[1], reg); return (le32toh(val)); } static uint32_t softdma_fill_level(struct softdma_softc *sc) { uint32_t val; val = softdma_memc_read(sc, A_ONCHIP_FIFO_MEM_CORE_STATUS_REG_FILL_LEVEL); return (val); } +static uint32_t +fifo_fill_level_wait(struct softdma_softc *sc) +{ + uint32_t val; + + do + val = softdma_fill_level(sc); + while (val == AVALON_FIFO_TX_BASIC_OPTS_DEPTH); + + return (val); +} + static void softdma_intr(void *arg) { struct softdma_channel *chan; struct softdma_softc *sc; int reg; int err; sc = arg; chan = &sc->channels[0]; reg = softdma_memc_read(sc, A_ONCHIP_FIFO_MEM_CORE_STATUS_REG_EVENT); if (reg & (A_ONCHIP_FIFO_MEM_CORE_EVENT_OVERFLOW | A_ONCHIP_FIFO_MEM_CORE_EVENT_UNDERFLOW)) { /* Errors */ err = (((reg & A_ONCHIP_FIFO_MEM_CORE_ERROR_MASK) >> \ A_ONCHIP_FIFO_MEM_CORE_ERROR_SHIFT) & 0xff); } if (reg != 0) { softdma_memc_write(sc, A_ONCHIP_FIFO_MEM_CORE_STATUS_REG_EVENT, reg); chan->run = 1; wakeup(chan); } } static int softdma_probe(device_t dev) { if (!ofw_bus_status_okay(dev)) return (ENXIO); if (!ofw_bus_is_compatible(dev, "altr,softdma")) return (ENXIO); device_set_desc(dev, "SoftDMA"); return (BUS_PROBE_DEFAULT); } static int softdma_attach(device_t dev) { struct softdma_softc *sc; phandle_t xref, node; int err; sc = device_get_softc(dev); sc->dev = dev; if (bus_alloc_resources(dev, softdma_spec, sc->res)) { device_printf(dev, "could not allocate resources for device\n"); return (ENXIO); } /* FIFO memory interface */ sc->bst = rman_get_bustag(sc->res[0]); sc->bsh = rman_get_bushandle(sc->res[0]); /* FIFO control memory interface */ sc->bst_c = rman_get_bustag(sc->res[1]); sc->bsh_c = rman_get_bushandle(sc->res[1]); /* Setup interrupt handler */ err = bus_setup_intr(dev, sc->res[2], INTR_TYPE_MISC | INTR_MPSAFE, NULL, softdma_intr, sc, &sc->ih); if (err) { device_printf(dev, "Unable to alloc interrupt resource.\n"); return (ENXIO); } node = ofw_bus_get_node(dev); xref = OF_xref_from_node(node); OF_device_register_xref(xref, dev); return (0); } static int softdma_detach(device_t dev) { struct softdma_softc *sc; sc = device_get_softc(dev); return (0); } static int softdma_process_tx(struct softdma_channel *chan, struct softdma_desc *desc) { struct softdma_softc *sc; - uint32_t src_offs, dst_offs; + uint64_t addr; + uint64_t buf; + uint32_t word; + uint32_t missing; uint32_t reg; - uint32_t fill_level; - uint32_t leftm; - uint32_t tmp; - uint32_t val; - uint32_t c; + int got_bits; + int len; sc = chan->sc; - fill_level = softdma_fill_level(sc); - while (fill_level == AVALON_FIFO_TX_BASIC_OPTS_DEPTH) - fill_level = softdma_fill_level(sc); + fifo_fill_level_wait(sc); /* Set start of packet. */ - if (desc->control & CONTROL_GEN_SOP) { - reg = 0; - reg |= A_ONCHIP_FIFO_MEM_CORE_SOP; - softdma_mem_write(sc, A_ONCHIP_FIFO_MEM_CORE_METADATA, reg); - } + if (desc->control & CONTROL_GEN_SOP) + softdma_mem_write(sc, A_ONCHIP_FIFO_MEM_CORE_METADATA, + A_ONCHIP_FIFO_MEM_CORE_SOP); - src_offs = dst_offs = 0; - c = 0; - while ((desc->len - c) >= 4) { - val = *(uint32_t *)(desc->src_addr + src_offs); - bus_write_4(sc->res[0], A_ONCHIP_FIFO_MEM_CORE_DATA, val); - if (desc->src_incr) - src_offs += 4; - if (desc->dst_incr) - dst_offs += 4; - fill_level += 1; + got_bits = 0; + buf = 0; - while (fill_level == AVALON_FIFO_TX_BASIC_OPTS_DEPTH) { - fill_level = softdma_fill_level(sc); - } - c += 4; + addr = desc->src_addr; + len = desc->len; + + if (addr & 1) { + buf = (buf << 8) | *(uint8_t *)addr; + got_bits += 8; + addr += 1; + len -= 1; } - val = 0; - leftm = (desc->len - c); + if (len >= 2 && addr & 2) { + buf = (buf << 16) | *(uint16_t *)addr; + got_bits += 16; + addr += 2; + len -= 2; + } - switch (leftm) { - case 1: - val = *(uint8_t *)(desc->src_addr + src_offs); - val <<= 24; - src_offs += 1; - break; - case 2: - case 3: - val = *(uint16_t *)(desc->src_addr + src_offs); - val <<= 16; - src_offs += 2; + while (len >= 4) { + buf = (buf << 32) | (uint64_t)*(uint32_t *)addr; + addr += 4; + len -= 4; + word = (uint32_t)((buf >> got_bits) & 0xffffffff); - if (leftm == 3) { - tmp = *(uint8_t *)(desc->src_addr + src_offs); - val |= (tmp << 8); - src_offs += 1; - } - break; - case 0: - default: - break; + fifo_fill_level_wait(sc); + if (len == 0 && got_bits == 0 && + (desc->control & CONTROL_GEN_EOP) != 0) + softdma_mem_write(sc, A_ONCHIP_FIFO_MEM_CORE_METADATA, + A_ONCHIP_FIFO_MEM_CORE_EOP); + bus_write_4(sc->res[0], A_ONCHIP_FIFO_MEM_CORE_DATA, word); } - /* Set end of packet. */ - reg = 0; - if (desc->control & CONTROL_GEN_EOP) - reg |= A_ONCHIP_FIFO_MEM_CORE_EOP; - reg |= ((4 - leftm) << A_ONCHIP_FIFO_MEM_CORE_EMPTY_SHIFT); - softdma_mem_write(sc, A_ONCHIP_FIFO_MEM_CORE_METADATA, reg); + if (len & 2) { + buf = (buf << 16) | *(uint16_t *)addr; + got_bits += 16; + addr += 2; + len -= 2; + } - /* Ensure there is a FIFO entry available. */ - fill_level = softdma_fill_level(sc); - while (fill_level == AVALON_FIFO_TX_BASIC_OPTS_DEPTH) - fill_level = softdma_fill_level(sc); + if (len & 1) { + buf = (buf << 8) | *(uint8_t *)addr; + got_bits += 8; + addr += 1; + len -= 1; + } - /* Final write */ - bus_write_4(sc->res[0], A_ONCHIP_FIFO_MEM_CORE_DATA, val); + if (got_bits >= 32) { + got_bits -= 32; + word = (uint32_t)((buf >> got_bits) & 0xffffffff); - return (dst_offs); + fifo_fill_level_wait(sc); + if (len == 0 && got_bits == 0 && + (desc->control & CONTROL_GEN_EOP) != 0) + softdma_mem_write(sc, A_ONCHIP_FIFO_MEM_CORE_METADATA, + A_ONCHIP_FIFO_MEM_CORE_EOP); + bus_write_4(sc->res[0], A_ONCHIP_FIFO_MEM_CORE_DATA, word); + } + + if (got_bits) { + missing = 32 - got_bits; + got_bits /= 8; + + fifo_fill_level_wait(sc); + reg = A_ONCHIP_FIFO_MEM_CORE_EOP | + ((4 - got_bits) << A_ONCHIP_FIFO_MEM_CORE_EMPTY_SHIFT); + softdma_mem_write(sc, A_ONCHIP_FIFO_MEM_CORE_METADATA, reg); + word = (uint32_t)((buf << missing) & 0xffffffff); + bus_write_4(sc->res[0], A_ONCHIP_FIFO_MEM_CORE_DATA, word); + } + + return (desc->len); } static int softdma_process_rx(struct softdma_channel *chan, struct softdma_desc *desc) { uint32_t src_offs, dst_offs; struct softdma_softc *sc; uint32_t fill_level; uint32_t empty; uint32_t meta; uint32_t data; int sop_rcvd; int timeout; size_t len; int error; sc = chan->sc; empty = 0; src_offs = dst_offs = 0; error = 0; fill_level = softdma_fill_level(sc); if (fill_level == 0) { /* Nothing to receive. */ return (0); } len = desc->len; sop_rcvd = 0; while (fill_level) { empty = 0; data = bus_read_4(sc->res[0], A_ONCHIP_FIFO_MEM_CORE_DATA); meta = softdma_mem_read(sc, A_ONCHIP_FIFO_MEM_CORE_METADATA); if (meta & A_ONCHIP_FIFO_MEM_CORE_ERROR_MASK) { error = 1; break; } if ((meta & A_ONCHIP_FIFO_MEM_CORE_CHANNEL_MASK) != 0) { error = 1; break; } if (meta & A_ONCHIP_FIFO_MEM_CORE_SOP) { sop_rcvd = 1; } if (meta & A_ONCHIP_FIFO_MEM_CORE_EOP) { empty = (meta & A_ONCHIP_FIFO_MEM_CORE_EMPTY_MASK) >> A_ONCHIP_FIFO_MEM_CORE_EMPTY_SHIFT; } if (sop_rcvd == 0) { error = 1; break; } if (empty == 0) { *(uint32_t *)(desc->dst_addr + dst_offs) = data; dst_offs += 4; } else if (empty == 1) { *(uint16_t *)(desc->dst_addr + dst_offs) = ((data >> 16) & 0xffff); dst_offs += 2; *(uint8_t *)(desc->dst_addr + dst_offs) = ((data >> 8) & 0xff); dst_offs += 1; } else { panic("empty %d\n", empty); } if (meta & A_ONCHIP_FIFO_MEM_CORE_EOP) break; fill_level = softdma_fill_level(sc); timeout = 100; while (fill_level == 0 && timeout--) fill_level = softdma_fill_level(sc); if (timeout == 0) { /* No EOP received. Broken packet. */ error = 1; break; } } if (error) { return (-1); } return (dst_offs); } static uint32_t softdma_process_descriptors(struct softdma_channel *chan, xdma_transfer_status_t *status) { struct xdma_channel *xchan; struct softdma_desc *desc; struct softdma_softc *sc; xdma_transfer_status_t st; int ret; sc = chan->sc; xchan = chan->xchan; desc = &chan->descs[chan->idx_tail]; while (desc != NULL) { if ((desc->control & CONTROL_OWN) == 0) { break; } if (desc->direction == XDMA_MEM_TO_DEV) { ret = softdma_process_tx(chan, desc); } else { ret = softdma_process_rx(chan, desc); if (ret == 0) { /* No new data available. */ break; } } /* Descriptor processed. */ desc->control = 0; if (ret >= 0) { st.error = 0; st.transferred = ret; } else { st.error = ret; st.transferred = 0; } xchan_seg_done(xchan, &st); atomic_subtract_int(&chan->descs_used_count, 1); if (ret >= 0) { status->transferred += ret; } else { status->error = 1; break; } chan->idx_tail = softdma_next_desc(chan, chan->idx_tail); /* Process next descriptor, if any. */ desc = desc->next; } return (0); } static void softdma_worker(void *arg) { xdma_transfer_status_t status; struct softdma_channel *chan; struct softdma_softc *sc; chan = arg; sc = chan->sc; while (1) { mtx_lock(&chan->mtx); do { mtx_sleep(chan, &chan->mtx, 0, "softdma_wait", hz / 2); } while (chan->run == 0); status.error = 0; status.transferred = 0; softdma_process_descriptors(chan, &status); /* Finish operation */ chan->run = 0; xdma_callback(chan->xchan, &status); mtx_unlock(&chan->mtx); } } static int softdma_proc_create(struct softdma_channel *chan) { struct softdma_softc *sc; sc = chan->sc; if (chan->p != NULL) { /* Already created */ return (0); } mtx_init(&chan->mtx, "SoftDMA", NULL, MTX_DEF); if (kproc_create(softdma_worker, (void *)chan, &chan->p, 0, 0, "softdma_worker") != 0) { device_printf(sc->dev, "%s: Failed to create worker thread.\n", __func__); return (-1); } return (0); } static int softdma_channel_alloc(device_t dev, struct xdma_channel *xchan) { struct softdma_channel *chan; struct softdma_softc *sc; int i; sc = device_get_softc(dev); for (i = 0; i < SOFTDMA_NCHANNELS; i++) { chan = &sc->channels[i]; if (chan->used == 0) { chan->xchan = xchan; xchan->chan = (void *)chan; + xchan->caps |= XCHAN_CAP_NOBUFS; + xchan->caps |= XCHAN_CAP_NOSEG; chan->index = i; chan->idx_head = 0; chan->idx_tail = 0; chan->descs_used_count = 0; chan->descs_num = 1024; chan->sc = sc; if (softdma_proc_create(chan) != 0) { return (-1); } chan->used = 1; return (0); } } return (-1); } static int softdma_channel_free(device_t dev, struct xdma_channel *xchan) { struct softdma_channel *chan; struct softdma_softc *sc; sc = device_get_softc(dev); chan = (struct softdma_channel *)xchan->chan; if (chan->descs != NULL) { free(chan->descs, M_DEVBUF); } chan->used = 0; return (0); } static int softdma_desc_alloc(struct xdma_channel *xchan) { struct softdma_channel *chan; uint32_t nsegments; chan = (struct softdma_channel *)xchan->chan; nsegments = chan->descs_num; chan->descs = malloc(nsegments * sizeof(struct softdma_desc), M_DEVBUF, (M_WAITOK | M_ZERO)); return (0); } static int softdma_channel_prep_sg(device_t dev, struct xdma_channel *xchan) { struct softdma_channel *chan; struct softdma_desc *desc; struct softdma_softc *sc; int ret; int i; sc = device_get_softc(dev); chan = (struct softdma_channel *)xchan->chan; ret = softdma_desc_alloc(xchan); if (ret != 0) { device_printf(sc->dev, "%s: Can't allocate descriptors.\n", __func__); return (-1); } for (i = 0; i < chan->descs_num; i++) { desc = &chan->descs[i]; if (i == (chan->descs_num - 1)) { desc->next = &chan->descs[0]; } else { desc->next = &chan->descs[i+1]; } } return (0); } static int softdma_channel_capacity(device_t dev, xdma_channel_t *xchan, uint32_t *capacity) { struct softdma_channel *chan; uint32_t c; chan = (struct softdma_channel *)xchan->chan; /* At least one descriptor must be left empty. */ c = (chan->descs_num - chan->descs_used_count - 1); *capacity = c; return (0); } static int softdma_channel_submit_sg(device_t dev, struct xdma_channel *xchan, struct xdma_sglist *sg, uint32_t sg_n) { struct softdma_channel *chan; struct softdma_desc *desc; struct softdma_softc *sc; uint32_t enqueued; uint32_t saved_dir; uint32_t tmp; uint32_t len; int i; sc = device_get_softc(dev); chan = (struct softdma_channel *)xchan->chan; enqueued = 0; for (i = 0; i < sg_n; i++) { len = (uint32_t)sg[i].len; desc = &chan->descs[chan->idx_head]; desc->src_addr = sg[i].src_addr; desc->dst_addr = sg[i].dst_addr; if (sg[i].direction == XDMA_MEM_TO_DEV) { desc->src_incr = 1; desc->dst_incr = 0; } else { desc->src_incr = 0; desc->dst_incr = 1; } desc->direction = sg[i].direction; saved_dir = sg[i].direction; desc->len = len; desc->transfered = 0; desc->status = 0; desc->reserved = 0; desc->control = 0; if (sg[i].first == 1) desc->control |= CONTROL_GEN_SOP; if (sg[i].last == 1) desc->control |= CONTROL_GEN_EOP; tmp = chan->idx_head; chan->idx_head = softdma_next_desc(chan, chan->idx_head); atomic_add_int(&chan->descs_used_count, 1); desc->control |= CONTROL_OWN; enqueued += 1; } if (enqueued == 0) return (0); if (saved_dir == XDMA_MEM_TO_DEV) { chan->run = 1; wakeup(chan); } else softdma_memc_write(sc, A_ONCHIP_FIFO_MEM_CORE_STATUS_REG_INT_ENABLE, SOFTDMA_RX_EVENTS); return (0); } static int softdma_channel_request(device_t dev, struct xdma_channel *xchan, struct xdma_request *req) { struct softdma_channel *chan; struct softdma_desc *desc; struct softdma_softc *sc; int ret; sc = device_get_softc(dev); chan = (struct softdma_channel *)xchan->chan; ret = softdma_desc_alloc(xchan); if (ret != 0) { device_printf(sc->dev, "%s: Can't allocate descriptors.\n", __func__); return (-1); } desc = &chan->descs[0]; desc->src_addr = req->src_addr; desc->dst_addr = req->dst_addr; desc->len = req->block_len; desc->src_incr = 1; desc->dst_incr = 1; desc->next = NULL; return (0); } static int softdma_channel_control(device_t dev, xdma_channel_t *xchan, int cmd) { struct softdma_channel *chan; struct softdma_softc *sc; sc = device_get_softc(dev); chan = (struct softdma_channel *)xchan->chan; switch (cmd) { case XDMA_CMD_BEGIN: case XDMA_CMD_TERMINATE: case XDMA_CMD_PAUSE: /* TODO: implement me */ return (-1); } return (0); } #ifdef FDT static int softdma_ofw_md_data(device_t dev, pcell_t *cells, int ncells, void **ptr) { return (0); } #endif static device_method_t softdma_methods[] = { /* Device interface */ DEVMETHOD(device_probe, softdma_probe), DEVMETHOD(device_attach, softdma_attach), DEVMETHOD(device_detach, softdma_detach), /* xDMA Interface */ DEVMETHOD(xdma_channel_alloc, softdma_channel_alloc), DEVMETHOD(xdma_channel_free, softdma_channel_free), DEVMETHOD(xdma_channel_request, softdma_channel_request), DEVMETHOD(xdma_channel_control, softdma_channel_control), /* xDMA SG Interface */ DEVMETHOD(xdma_channel_prep_sg, softdma_channel_prep_sg), DEVMETHOD(xdma_channel_submit_sg, softdma_channel_submit_sg), DEVMETHOD(xdma_channel_capacity, softdma_channel_capacity), #ifdef FDT DEVMETHOD(xdma_ofw_md_data, softdma_ofw_md_data), #endif DEVMETHOD_END }; static driver_t softdma_driver = { "softdma", softdma_methods, sizeof(struct softdma_softc), }; static devclass_t softdma_devclass; EARLY_DRIVER_MODULE(softdma, simplebus, softdma_driver, softdma_devclass, 0, 0, BUS_PASS_INTERRUPT + BUS_PASS_ORDER_LATE); Index: user/ngie/bug-237403/sys/dev/cadence/if_cgem.c =================================================================== --- user/ngie/bug-237403/sys/dev/cadence/if_cgem.c (revision 346925) +++ user/ngie/bug-237403/sys/dev/cadence/if_cgem.c (revision 346926) @@ -1,1868 +1,1874 @@ /*- * SPDX-License-Identifier: BSD-2-Clause-FreeBSD * * Copyright (c) 2012-2014 Thomas Skibo * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. */ /* * A network interface driver for Cadence GEM Gigabit Ethernet * interface such as the one used in Xilinx Zynq-7000 SoC. * * Reference: Zynq-7000 All Programmable SoC Technical Reference Manual. * (v1.4) November 16, 2012. Xilinx doc UG585. GEM is covered in Ch. 16 * and register definitions are in appendix B.18. */ #include __FBSDID("$FreeBSD$"); #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #ifdef INET #include #include #include #include #endif #include #include #include #include #include #include #include #include #include "miibus_if.h" #define IF_CGEM_NAME "cgem" #define CGEM_NUM_RX_DESCS 512 /* size of receive descriptor ring */ #define CGEM_NUM_TX_DESCS 512 /* size of transmit descriptor ring */ #define MAX_DESC_RING_SIZE (MAX(CGEM_NUM_RX_DESCS*sizeof(struct cgem_rx_desc),\ CGEM_NUM_TX_DESCS*sizeof(struct cgem_tx_desc))) /* Default for sysctl rxbufs. Must be < CGEM_NUM_RX_DESCS of course. */ #define DEFAULT_NUM_RX_BUFS 256 /* number of receive bufs to queue. */ #define TX_MAX_DMA_SEGS 8 /* maximum segs in a tx mbuf dma */ #define CGEM_CKSUM_ASSIST (CSUM_IP | CSUM_TCP | CSUM_UDP | \ CSUM_TCP_IPV6 | CSUM_UDP_IPV6) +static struct ofw_compat_data compat_data[] = { + { "cadence,gem", 1 }, + { "cdns,macb", 1 }, + { NULL, 0 }, +}; + struct cgem_softc { if_t ifp; struct mtx sc_mtx; device_t dev; device_t miibus; u_int mii_media_active; /* last active media */ int if_old_flags; struct resource *mem_res; struct resource *irq_res; void *intrhand; struct callout tick_ch; uint32_t net_ctl_shadow; int ref_clk_num; u_char eaddr[6]; bus_dma_tag_t desc_dma_tag; bus_dma_tag_t mbuf_dma_tag; /* receive descriptor ring */ struct cgem_rx_desc *rxring; bus_addr_t rxring_physaddr; struct mbuf *rxring_m[CGEM_NUM_RX_DESCS]; bus_dmamap_t rxring_m_dmamap[CGEM_NUM_RX_DESCS]; int rxring_hd_ptr; /* where to put rcv bufs */ int rxring_tl_ptr; /* where to get receives */ int rxring_queued; /* how many rcv bufs queued */ bus_dmamap_t rxring_dma_map; int rxbufs; /* tunable number rcv bufs */ int rxhangwar; /* rx hang work-around */ u_int rxoverruns; /* rx overruns */ u_int rxnobufs; /* rx buf ring empty events */ u_int rxdmamapfails; /* rx dmamap failures */ uint32_t rx_frames_prev; /* transmit descriptor ring */ struct cgem_tx_desc *txring; bus_addr_t txring_physaddr; struct mbuf *txring_m[CGEM_NUM_TX_DESCS]; bus_dmamap_t txring_m_dmamap[CGEM_NUM_TX_DESCS]; int txring_hd_ptr; /* where to put next xmits */ int txring_tl_ptr; /* next xmit mbuf to free */ int txring_queued; /* num xmits segs queued */ bus_dmamap_t txring_dma_map; u_int txfull; /* tx ring full events */ u_int txdefrags; /* tx calls to m_defrag() */ u_int txdefragfails; /* tx m_defrag() failures */ u_int txdmamapfails; /* tx dmamap failures */ /* hardware provided statistics */ struct cgem_hw_stats { uint64_t tx_bytes; uint32_t tx_frames; uint32_t tx_frames_bcast; uint32_t tx_frames_multi; uint32_t tx_frames_pause; uint32_t tx_frames_64b; uint32_t tx_frames_65to127b; uint32_t tx_frames_128to255b; uint32_t tx_frames_256to511b; uint32_t tx_frames_512to1023b; uint32_t tx_frames_1024to1536b; uint32_t tx_under_runs; uint32_t tx_single_collisn; uint32_t tx_multi_collisn; uint32_t tx_excsv_collisn; uint32_t tx_late_collisn; uint32_t tx_deferred_frames; uint32_t tx_carrier_sense_errs; uint64_t rx_bytes; uint32_t rx_frames; uint32_t rx_frames_bcast; uint32_t rx_frames_multi; uint32_t rx_frames_pause; uint32_t rx_frames_64b; uint32_t rx_frames_65to127b; uint32_t rx_frames_128to255b; uint32_t rx_frames_256to511b; uint32_t rx_frames_512to1023b; uint32_t rx_frames_1024to1536b; uint32_t rx_frames_undersize; uint32_t rx_frames_oversize; uint32_t rx_frames_jabber; uint32_t rx_frames_fcs_errs; uint32_t rx_frames_length_errs; uint32_t rx_symbol_errs; uint32_t rx_align_errs; uint32_t rx_resource_errs; uint32_t rx_overrun_errs; uint32_t rx_ip_hdr_csum_errs; uint32_t rx_tcp_csum_errs; uint32_t rx_udp_csum_errs; } stats; }; #define RD4(sc, off) (bus_read_4((sc)->mem_res, (off))) #define WR4(sc, off, val) (bus_write_4((sc)->mem_res, (off), (val))) #define BARRIER(sc, off, len, flags) \ (bus_barrier((sc)->mem_res, (off), (len), (flags)) #define CGEM_LOCK(sc) mtx_lock(&(sc)->sc_mtx) #define CGEM_UNLOCK(sc) mtx_unlock(&(sc)->sc_mtx) #define CGEM_LOCK_INIT(sc) \ mtx_init(&(sc)->sc_mtx, device_get_nameunit((sc)->dev), \ MTX_NETWORK_LOCK, MTX_DEF) #define CGEM_LOCK_DESTROY(sc) mtx_destroy(&(sc)->sc_mtx) #define CGEM_ASSERT_LOCKED(sc) mtx_assert(&(sc)->sc_mtx, MA_OWNED) /* Allow platforms to optionally provide a way to set the reference clock. */ int cgem_set_ref_clk(int unit, int frequency); static devclass_t cgem_devclass; static int cgem_probe(device_t dev); static int cgem_attach(device_t dev); static int cgem_detach(device_t dev); static void cgem_tick(void *); static void cgem_intr(void *); static void cgem_mediachange(struct cgem_softc *, struct mii_data *); static void cgem_get_mac(struct cgem_softc *sc, u_char eaddr[]) { int i; uint32_t rnd; /* See if boot loader gave us a MAC address already. */ for (i = 0; i < 4; i++) { uint32_t low = RD4(sc, CGEM_SPEC_ADDR_LOW(i)); uint32_t high = RD4(sc, CGEM_SPEC_ADDR_HI(i)) & 0xffff; if (low != 0 || high != 0) { eaddr[0] = low & 0xff; eaddr[1] = (low >> 8) & 0xff; eaddr[2] = (low >> 16) & 0xff; eaddr[3] = (low >> 24) & 0xff; eaddr[4] = high & 0xff; eaddr[5] = (high >> 8) & 0xff; break; } } /* No MAC from boot loader? Assign a random one. */ if (i == 4) { rnd = arc4random(); eaddr[0] = 'b'; eaddr[1] = 's'; eaddr[2] = 'd'; eaddr[3] = (rnd >> 16) & 0xff; eaddr[4] = (rnd >> 8) & 0xff; eaddr[5] = rnd & 0xff; device_printf(sc->dev, "no mac address found, assigning " "random: %02x:%02x:%02x:%02x:%02x:%02x\n", eaddr[0], eaddr[1], eaddr[2], eaddr[3], eaddr[4], eaddr[5]); } /* Move address to first slot and zero out the rest. */ WR4(sc, CGEM_SPEC_ADDR_LOW(0), (eaddr[3] << 24) | (eaddr[2] << 16) | (eaddr[1] << 8) | eaddr[0]); WR4(sc, CGEM_SPEC_ADDR_HI(0), (eaddr[5] << 8) | eaddr[4]); for (i = 1; i < 4; i++) { WR4(sc, CGEM_SPEC_ADDR_LOW(i), 0); WR4(sc, CGEM_SPEC_ADDR_HI(i), 0); } } /* cgem_mac_hash(): map 48-bit address to a 6-bit hash. * The 6-bit hash corresponds to a bit in a 64-bit hash * register. Setting that bit in the hash register enables * reception of all frames with a destination address that hashes * to that 6-bit value. * * The hash function is described in sec. 16.2.3 in the Zynq-7000 Tech * Reference Manual. Bits 0-5 in the hash are the exclusive-or of * every sixth bit in the destination address. */ static int cgem_mac_hash(u_char eaddr[]) { int hash; int i, j; hash = 0; for (i = 0; i < 6; i++) for (j = i; j < 48; j += 6) if ((eaddr[j >> 3] & (1 << (j & 7))) != 0) hash ^= (1 << i); return hash; } /* After any change in rx flags or multi-cast addresses, set up * hash registers and net config register bits. */ static void cgem_rx_filter(struct cgem_softc *sc) { if_t ifp = sc->ifp; u_char *mta; int index, i, mcnt; uint32_t hash_hi, hash_lo; uint32_t net_cfg; hash_hi = 0; hash_lo = 0; net_cfg = RD4(sc, CGEM_NET_CFG); net_cfg &= ~(CGEM_NET_CFG_MULTI_HASH_EN | CGEM_NET_CFG_NO_BCAST | CGEM_NET_CFG_COPY_ALL); if ((if_getflags(ifp) & IFF_PROMISC) != 0) net_cfg |= CGEM_NET_CFG_COPY_ALL; else { if ((if_getflags(ifp) & IFF_BROADCAST) == 0) net_cfg |= CGEM_NET_CFG_NO_BCAST; if ((if_getflags(ifp) & IFF_ALLMULTI) != 0) { hash_hi = 0xffffffff; hash_lo = 0xffffffff; } else { mcnt = if_multiaddr_count(ifp, -1); mta = malloc(ETHER_ADDR_LEN * mcnt, M_DEVBUF, M_NOWAIT); if (mta == NULL) { device_printf(sc->dev, "failed to allocate temp mcast list\n"); return; } if_multiaddr_array(ifp, mta, &mcnt, mcnt); for (i = 0; i < mcnt; i++) { index = cgem_mac_hash( LLADDR((struct sockaddr_dl *) (mta + (i * ETHER_ADDR_LEN)))); if (index > 31) hash_hi |= (1 << (index - 32)); else hash_lo |= (1 << index); } free(mta, M_DEVBUF); } if (hash_hi != 0 || hash_lo != 0) net_cfg |= CGEM_NET_CFG_MULTI_HASH_EN; } WR4(sc, CGEM_HASH_TOP, hash_hi); WR4(sc, CGEM_HASH_BOT, hash_lo); WR4(sc, CGEM_NET_CFG, net_cfg); } /* For bus_dmamap_load() callback. */ static void cgem_getaddr(void *arg, bus_dma_segment_t *segs, int nsegs, int error) { if (nsegs != 1 || error != 0) return; *(bus_addr_t *)arg = segs[0].ds_addr; } /* Create DMA'able descriptor rings. */ static int cgem_setup_descs(struct cgem_softc *sc) { int i, err; sc->txring = NULL; sc->rxring = NULL; /* Allocate non-cached DMA space for RX and TX descriptors. */ err = bus_dma_tag_create(bus_get_dma_tag(sc->dev), 1, 0, BUS_SPACE_MAXADDR_32BIT, BUS_SPACE_MAXADDR, NULL, NULL, MAX_DESC_RING_SIZE, 1, MAX_DESC_RING_SIZE, 0, busdma_lock_mutex, &sc->sc_mtx, &sc->desc_dma_tag); if (err) return (err); /* Set up a bus_dma_tag for mbufs. */ err = bus_dma_tag_create(bus_get_dma_tag(sc->dev), 1, 0, BUS_SPACE_MAXADDR_32BIT, BUS_SPACE_MAXADDR, NULL, NULL, MCLBYTES, TX_MAX_DMA_SEGS, MCLBYTES, 0, busdma_lock_mutex, &sc->sc_mtx, &sc->mbuf_dma_tag); if (err) return (err); /* Allocate DMA memory in non-cacheable space. */ err = bus_dmamem_alloc(sc->desc_dma_tag, (void **)&sc->rxring, BUS_DMA_NOWAIT | BUS_DMA_COHERENT, &sc->rxring_dma_map); if (err) return (err); /* Load descriptor DMA memory. */ err = bus_dmamap_load(sc->desc_dma_tag, sc->rxring_dma_map, (void *)sc->rxring, CGEM_NUM_RX_DESCS*sizeof(struct cgem_rx_desc), cgem_getaddr, &sc->rxring_physaddr, BUS_DMA_NOWAIT); if (err) return (err); /* Initialize RX descriptors. */ for (i = 0; i < CGEM_NUM_RX_DESCS; i++) { sc->rxring[i].addr = CGEM_RXDESC_OWN; sc->rxring[i].ctl = 0; sc->rxring_m[i] = NULL; sc->rxring_m_dmamap[i] = NULL; } sc->rxring[CGEM_NUM_RX_DESCS - 1].addr |= CGEM_RXDESC_WRAP; sc->rxring_hd_ptr = 0; sc->rxring_tl_ptr = 0; sc->rxring_queued = 0; /* Allocate DMA memory for TX descriptors in non-cacheable space. */ err = bus_dmamem_alloc(sc->desc_dma_tag, (void **)&sc->txring, BUS_DMA_NOWAIT | BUS_DMA_COHERENT, &sc->txring_dma_map); if (err) return (err); /* Load TX descriptor DMA memory. */ err = bus_dmamap_load(sc->desc_dma_tag, sc->txring_dma_map, (void *)sc->txring, CGEM_NUM_TX_DESCS*sizeof(struct cgem_tx_desc), cgem_getaddr, &sc->txring_physaddr, BUS_DMA_NOWAIT); if (err) return (err); /* Initialize TX descriptor ring. */ for (i = 0; i < CGEM_NUM_TX_DESCS; i++) { sc->txring[i].addr = 0; sc->txring[i].ctl = CGEM_TXDESC_USED; sc->txring_m[i] = NULL; sc->txring_m_dmamap[i] = NULL; } sc->txring[CGEM_NUM_TX_DESCS - 1].ctl |= CGEM_TXDESC_WRAP; sc->txring_hd_ptr = 0; sc->txring_tl_ptr = 0; sc->txring_queued = 0; return (0); } /* Fill receive descriptor ring with mbufs. */ static void cgem_fill_rqueue(struct cgem_softc *sc) { struct mbuf *m = NULL; bus_dma_segment_t segs[TX_MAX_DMA_SEGS]; int nsegs; CGEM_ASSERT_LOCKED(sc); while (sc->rxring_queued < sc->rxbufs) { /* Get a cluster mbuf. */ m = m_getcl(M_NOWAIT, MT_DATA, M_PKTHDR); if (m == NULL) break; m->m_len = MCLBYTES; m->m_pkthdr.len = MCLBYTES; m->m_pkthdr.rcvif = sc->ifp; /* Load map and plug in physical address. */ if (bus_dmamap_create(sc->mbuf_dma_tag, 0, &sc->rxring_m_dmamap[sc->rxring_hd_ptr])) { sc->rxdmamapfails++; m_free(m); break; } if (bus_dmamap_load_mbuf_sg(sc->mbuf_dma_tag, sc->rxring_m_dmamap[sc->rxring_hd_ptr], m, segs, &nsegs, BUS_DMA_NOWAIT)) { sc->rxdmamapfails++; bus_dmamap_destroy(sc->mbuf_dma_tag, sc->rxring_m_dmamap[sc->rxring_hd_ptr]); sc->rxring_m_dmamap[sc->rxring_hd_ptr] = NULL; m_free(m); break; } sc->rxring_m[sc->rxring_hd_ptr] = m; /* Sync cache with receive buffer. */ bus_dmamap_sync(sc->mbuf_dma_tag, sc->rxring_m_dmamap[sc->rxring_hd_ptr], BUS_DMASYNC_PREREAD); /* Write rx descriptor and increment head pointer. */ sc->rxring[sc->rxring_hd_ptr].ctl = 0; if (sc->rxring_hd_ptr == CGEM_NUM_RX_DESCS - 1) { sc->rxring[sc->rxring_hd_ptr].addr = segs[0].ds_addr | CGEM_RXDESC_WRAP; sc->rxring_hd_ptr = 0; } else sc->rxring[sc->rxring_hd_ptr++].addr = segs[0].ds_addr; sc->rxring_queued++; } } /* Pull received packets off of receive descriptor ring. */ static void cgem_recv(struct cgem_softc *sc) { if_t ifp = sc->ifp; struct mbuf *m, *m_hd, **m_tl; uint32_t ctl; CGEM_ASSERT_LOCKED(sc); /* Pick up all packets in which the OWN bit is set. */ m_hd = NULL; m_tl = &m_hd; while (sc->rxring_queued > 0 && (sc->rxring[sc->rxring_tl_ptr].addr & CGEM_RXDESC_OWN) != 0) { ctl = sc->rxring[sc->rxring_tl_ptr].ctl; /* Grab filled mbuf. */ m = sc->rxring_m[sc->rxring_tl_ptr]; sc->rxring_m[sc->rxring_tl_ptr] = NULL; /* Sync cache with receive buffer. */ bus_dmamap_sync(sc->mbuf_dma_tag, sc->rxring_m_dmamap[sc->rxring_tl_ptr], BUS_DMASYNC_POSTREAD); /* Unload and destroy dmamap. */ bus_dmamap_unload(sc->mbuf_dma_tag, sc->rxring_m_dmamap[sc->rxring_tl_ptr]); bus_dmamap_destroy(sc->mbuf_dma_tag, sc->rxring_m_dmamap[sc->rxring_tl_ptr]); sc->rxring_m_dmamap[sc->rxring_tl_ptr] = NULL; /* Increment tail pointer. */ if (++sc->rxring_tl_ptr == CGEM_NUM_RX_DESCS) sc->rxring_tl_ptr = 0; sc->rxring_queued--; /* Check FCS and make sure entire packet landed in one mbuf * cluster (which is much bigger than the largest ethernet * packet). */ if ((ctl & CGEM_RXDESC_BAD_FCS) != 0 || (ctl & (CGEM_RXDESC_SOF | CGEM_RXDESC_EOF)) != (CGEM_RXDESC_SOF | CGEM_RXDESC_EOF)) { /* discard. */ m_free(m); if_inc_counter(ifp, IFCOUNTER_IERRORS, 1); continue; } /* Ready it to hand off to upper layers. */ m->m_data += ETHER_ALIGN; m->m_len = (ctl & CGEM_RXDESC_LENGTH_MASK); m->m_pkthdr.rcvif = ifp; m->m_pkthdr.len = m->m_len; /* Are we using hardware checksumming? Check the * status in the receive descriptor. */ if ((if_getcapenable(ifp) & IFCAP_RXCSUM) != 0) { /* TCP or UDP checks out, IP checks out too. */ if ((ctl & CGEM_RXDESC_CKSUM_STAT_MASK) == CGEM_RXDESC_CKSUM_STAT_TCP_GOOD || (ctl & CGEM_RXDESC_CKSUM_STAT_MASK) == CGEM_RXDESC_CKSUM_STAT_UDP_GOOD) { m->m_pkthdr.csum_flags |= CSUM_IP_CHECKED | CSUM_IP_VALID | CSUM_DATA_VALID | CSUM_PSEUDO_HDR; m->m_pkthdr.csum_data = 0xffff; } else if ((ctl & CGEM_RXDESC_CKSUM_STAT_MASK) == CGEM_RXDESC_CKSUM_STAT_IP_GOOD) { /* Only IP checks out. */ m->m_pkthdr.csum_flags |= CSUM_IP_CHECKED | CSUM_IP_VALID; m->m_pkthdr.csum_data = 0xffff; } } /* Queue it up for delivery below. */ *m_tl = m; m_tl = &m->m_next; } /* Replenish receive buffers. */ cgem_fill_rqueue(sc); /* Unlock and send up packets. */ CGEM_UNLOCK(sc); while (m_hd != NULL) { m = m_hd; m_hd = m_hd->m_next; m->m_next = NULL; if_inc_counter(ifp, IFCOUNTER_IPACKETS, 1); if_input(ifp, m); } CGEM_LOCK(sc); } /* Find completed transmits and free their mbufs. */ static void cgem_clean_tx(struct cgem_softc *sc) { struct mbuf *m; uint32_t ctl; CGEM_ASSERT_LOCKED(sc); /* free up finished transmits. */ while (sc->txring_queued > 0 && ((ctl = sc->txring[sc->txring_tl_ptr].ctl) & CGEM_TXDESC_USED) != 0) { /* Sync cache. */ bus_dmamap_sync(sc->mbuf_dma_tag, sc->txring_m_dmamap[sc->txring_tl_ptr], BUS_DMASYNC_POSTWRITE); /* Unload and destroy DMA map. */ bus_dmamap_unload(sc->mbuf_dma_tag, sc->txring_m_dmamap[sc->txring_tl_ptr]); bus_dmamap_destroy(sc->mbuf_dma_tag, sc->txring_m_dmamap[sc->txring_tl_ptr]); sc->txring_m_dmamap[sc->txring_tl_ptr] = NULL; /* Free up the mbuf. */ m = sc->txring_m[sc->txring_tl_ptr]; sc->txring_m[sc->txring_tl_ptr] = NULL; m_freem(m); /* Check the status. */ if ((ctl & CGEM_TXDESC_AHB_ERR) != 0) { /* Serious bus error. log to console. */ device_printf(sc->dev, "cgem_clean_tx: Whoa! " "AHB error, addr=0x%x\n", sc->txring[sc->txring_tl_ptr].addr); } else if ((ctl & (CGEM_TXDESC_RETRY_ERR | CGEM_TXDESC_LATE_COLL)) != 0) { if_inc_counter(sc->ifp, IFCOUNTER_OERRORS, 1); } else if_inc_counter(sc->ifp, IFCOUNTER_OPACKETS, 1); /* If the packet spanned more than one tx descriptor, * skip descriptors until we find the end so that only * start-of-frame descriptors are processed. */ while ((ctl & CGEM_TXDESC_LAST_BUF) == 0) { if ((ctl & CGEM_TXDESC_WRAP) != 0) sc->txring_tl_ptr = 0; else sc->txring_tl_ptr++; sc->txring_queued--; ctl = sc->txring[sc->txring_tl_ptr].ctl; sc->txring[sc->txring_tl_ptr].ctl = ctl | CGEM_TXDESC_USED; } /* Next descriptor. */ if ((ctl & CGEM_TXDESC_WRAP) != 0) sc->txring_tl_ptr = 0; else sc->txring_tl_ptr++; sc->txring_queued--; if_setdrvflagbits(sc->ifp, 0, IFF_DRV_OACTIVE); } } /* Start transmits. */ static void cgem_start_locked(if_t ifp) { struct cgem_softc *sc = (struct cgem_softc *) if_getsoftc(ifp); struct mbuf *m; bus_dma_segment_t segs[TX_MAX_DMA_SEGS]; uint32_t ctl; int i, nsegs, wrap, err; CGEM_ASSERT_LOCKED(sc); if ((if_getdrvflags(ifp) & IFF_DRV_OACTIVE) != 0) return; for (;;) { /* Check that there is room in the descriptor ring. */ if (sc->txring_queued >= CGEM_NUM_TX_DESCS - TX_MAX_DMA_SEGS * 2) { /* Try to make room. */ cgem_clean_tx(sc); /* Still no room? */ if (sc->txring_queued >= CGEM_NUM_TX_DESCS - TX_MAX_DMA_SEGS * 2) { if_setdrvflagbits(ifp, IFF_DRV_OACTIVE, 0); sc->txfull++; break; } } /* Grab next transmit packet. */ m = if_dequeue(ifp); if (m == NULL) break; /* Create and load DMA map. */ if (bus_dmamap_create(sc->mbuf_dma_tag, 0, &sc->txring_m_dmamap[sc->txring_hd_ptr])) { m_freem(m); sc->txdmamapfails++; continue; } err = bus_dmamap_load_mbuf_sg(sc->mbuf_dma_tag, sc->txring_m_dmamap[sc->txring_hd_ptr], m, segs, &nsegs, BUS_DMA_NOWAIT); if (err == EFBIG) { /* Too many segments! defrag and try again. */ struct mbuf *m2 = m_defrag(m, M_NOWAIT); if (m2 == NULL) { sc->txdefragfails++; m_freem(m); bus_dmamap_destroy(sc->mbuf_dma_tag, sc->txring_m_dmamap[sc->txring_hd_ptr]); sc->txring_m_dmamap[sc->txring_hd_ptr] = NULL; continue; } m = m2; err = bus_dmamap_load_mbuf_sg(sc->mbuf_dma_tag, sc->txring_m_dmamap[sc->txring_hd_ptr], m, segs, &nsegs, BUS_DMA_NOWAIT); sc->txdefrags++; } if (err) { /* Give up. */ m_freem(m); bus_dmamap_destroy(sc->mbuf_dma_tag, sc->txring_m_dmamap[sc->txring_hd_ptr]); sc->txring_m_dmamap[sc->txring_hd_ptr] = NULL; sc->txdmamapfails++; continue; } sc->txring_m[sc->txring_hd_ptr] = m; /* Sync tx buffer with cache. */ bus_dmamap_sync(sc->mbuf_dma_tag, sc->txring_m_dmamap[sc->txring_hd_ptr], BUS_DMASYNC_PREWRITE); /* Set wrap flag if next packet might run off end of ring. */ wrap = sc->txring_hd_ptr + nsegs + TX_MAX_DMA_SEGS >= CGEM_NUM_TX_DESCS; /* Fill in the TX descriptors back to front so that USED * bit in first descriptor is cleared last. */ for (i = nsegs - 1; i >= 0; i--) { /* Descriptor address. */ sc->txring[sc->txring_hd_ptr + i].addr = segs[i].ds_addr; /* Descriptor control word. */ ctl = segs[i].ds_len; if (i == nsegs - 1) { ctl |= CGEM_TXDESC_LAST_BUF; if (wrap) ctl |= CGEM_TXDESC_WRAP; } sc->txring[sc->txring_hd_ptr + i].ctl = ctl; if (i != 0) sc->txring_m[sc->txring_hd_ptr + i] = NULL; } if (wrap) sc->txring_hd_ptr = 0; else sc->txring_hd_ptr += nsegs; sc->txring_queued += nsegs; /* Kick the transmitter. */ WR4(sc, CGEM_NET_CTRL, sc->net_ctl_shadow | CGEM_NET_CTRL_START_TX); /* If there is a BPF listener, bounce a copy to him. */ ETHER_BPF_MTAP(ifp, m); } } static void cgem_start(if_t ifp) { struct cgem_softc *sc = (struct cgem_softc *) if_getsoftc(ifp); CGEM_LOCK(sc); cgem_start_locked(ifp); CGEM_UNLOCK(sc); } static void cgem_poll_hw_stats(struct cgem_softc *sc) { uint32_t n; CGEM_ASSERT_LOCKED(sc); sc->stats.tx_bytes += RD4(sc, CGEM_OCTETS_TX_BOT); sc->stats.tx_bytes += (uint64_t)RD4(sc, CGEM_OCTETS_TX_TOP) << 32; sc->stats.tx_frames += RD4(sc, CGEM_FRAMES_TX); sc->stats.tx_frames_bcast += RD4(sc, CGEM_BCAST_FRAMES_TX); sc->stats.tx_frames_multi += RD4(sc, CGEM_MULTI_FRAMES_TX); sc->stats.tx_frames_pause += RD4(sc, CGEM_PAUSE_FRAMES_TX); sc->stats.tx_frames_64b += RD4(sc, CGEM_FRAMES_64B_TX); sc->stats.tx_frames_65to127b += RD4(sc, CGEM_FRAMES_65_127B_TX); sc->stats.tx_frames_128to255b += RD4(sc, CGEM_FRAMES_128_255B_TX); sc->stats.tx_frames_256to511b += RD4(sc, CGEM_FRAMES_256_511B_TX); sc->stats.tx_frames_512to1023b += RD4(sc, CGEM_FRAMES_512_1023B_TX); sc->stats.tx_frames_1024to1536b += RD4(sc, CGEM_FRAMES_1024_1518B_TX); sc->stats.tx_under_runs += RD4(sc, CGEM_TX_UNDERRUNS); n = RD4(sc, CGEM_SINGLE_COLL_FRAMES); sc->stats.tx_single_collisn += n; if_inc_counter(sc->ifp, IFCOUNTER_COLLISIONS, n); n = RD4(sc, CGEM_MULTI_COLL_FRAMES); sc->stats.tx_multi_collisn += n; if_inc_counter(sc->ifp, IFCOUNTER_COLLISIONS, n); n = RD4(sc, CGEM_EXCESSIVE_COLL_FRAMES); sc->stats.tx_excsv_collisn += n; if_inc_counter(sc->ifp, IFCOUNTER_COLLISIONS, n); n = RD4(sc, CGEM_LATE_COLL); sc->stats.tx_late_collisn += n; if_inc_counter(sc->ifp, IFCOUNTER_COLLISIONS, n); sc->stats.tx_deferred_frames += RD4(sc, CGEM_DEFERRED_TX_FRAMES); sc->stats.tx_carrier_sense_errs += RD4(sc, CGEM_CARRIER_SENSE_ERRS); sc->stats.rx_bytes += RD4(sc, CGEM_OCTETS_RX_BOT); sc->stats.rx_bytes += (uint64_t)RD4(sc, CGEM_OCTETS_RX_TOP) << 32; sc->stats.rx_frames += RD4(sc, CGEM_FRAMES_RX); sc->stats.rx_frames_bcast += RD4(sc, CGEM_BCAST_FRAMES_RX); sc->stats.rx_frames_multi += RD4(sc, CGEM_MULTI_FRAMES_RX); sc->stats.rx_frames_pause += RD4(sc, CGEM_PAUSE_FRAMES_RX); sc->stats.rx_frames_64b += RD4(sc, CGEM_FRAMES_64B_RX); sc->stats.rx_frames_65to127b += RD4(sc, CGEM_FRAMES_65_127B_RX); sc->stats.rx_frames_128to255b += RD4(sc, CGEM_FRAMES_128_255B_RX); sc->stats.rx_frames_256to511b += RD4(sc, CGEM_FRAMES_256_511B_RX); sc->stats.rx_frames_512to1023b += RD4(sc, CGEM_FRAMES_512_1023B_RX); sc->stats.rx_frames_1024to1536b += RD4(sc, CGEM_FRAMES_1024_1518B_RX); sc->stats.rx_frames_undersize += RD4(sc, CGEM_UNDERSZ_RX); sc->stats.rx_frames_oversize += RD4(sc, CGEM_OVERSZ_RX); sc->stats.rx_frames_jabber += RD4(sc, CGEM_JABBERS_RX); sc->stats.rx_frames_fcs_errs += RD4(sc, CGEM_FCS_ERRS); sc->stats.rx_frames_length_errs += RD4(sc, CGEM_LENGTH_FIELD_ERRS); sc->stats.rx_symbol_errs += RD4(sc, CGEM_RX_SYMBOL_ERRS); sc->stats.rx_align_errs += RD4(sc, CGEM_ALIGN_ERRS); sc->stats.rx_resource_errs += RD4(sc, CGEM_RX_RESOURCE_ERRS); sc->stats.rx_overrun_errs += RD4(sc, CGEM_RX_OVERRUN_ERRS); sc->stats.rx_ip_hdr_csum_errs += RD4(sc, CGEM_IP_HDR_CKSUM_ERRS); sc->stats.rx_tcp_csum_errs += RD4(sc, CGEM_TCP_CKSUM_ERRS); sc->stats.rx_udp_csum_errs += RD4(sc, CGEM_UDP_CKSUM_ERRS); } static void cgem_tick(void *arg) { struct cgem_softc *sc = (struct cgem_softc *)arg; struct mii_data *mii; CGEM_ASSERT_LOCKED(sc); /* Poll the phy. */ if (sc->miibus != NULL) { mii = device_get_softc(sc->miibus); mii_tick(mii); } /* Poll statistics registers. */ cgem_poll_hw_stats(sc); /* Check for receiver hang. */ if (sc->rxhangwar && sc->rx_frames_prev == sc->stats.rx_frames) { /* * Reset receiver logic by toggling RX_EN bit. 1usec * delay is necessary especially when operating at 100mbps * and 10mbps speeds. */ WR4(sc, CGEM_NET_CTRL, sc->net_ctl_shadow & ~CGEM_NET_CTRL_RX_EN); DELAY(1); WR4(sc, CGEM_NET_CTRL, sc->net_ctl_shadow); } sc->rx_frames_prev = sc->stats.rx_frames; /* Next callout in one second. */ callout_reset(&sc->tick_ch, hz, cgem_tick, sc); } /* Interrupt handler. */ static void cgem_intr(void *arg) { struct cgem_softc *sc = (struct cgem_softc *)arg; if_t ifp = sc->ifp; uint32_t istatus; CGEM_LOCK(sc); if ((if_getdrvflags(ifp) & IFF_DRV_RUNNING) == 0) { CGEM_UNLOCK(sc); return; } /* Read interrupt status and immediately clear the bits. */ istatus = RD4(sc, CGEM_INTR_STAT); WR4(sc, CGEM_INTR_STAT, istatus); /* Packets received. */ if ((istatus & CGEM_INTR_RX_COMPLETE) != 0) cgem_recv(sc); /* Free up any completed transmit buffers. */ cgem_clean_tx(sc); /* Hresp not ok. Something is very bad with DMA. Try to clear. */ if ((istatus & CGEM_INTR_HRESP_NOT_OK) != 0) { device_printf(sc->dev, "cgem_intr: hresp not okay! " "rx_status=0x%x\n", RD4(sc, CGEM_RX_STAT)); WR4(sc, CGEM_RX_STAT, CGEM_RX_STAT_HRESP_NOT_OK); } /* Receiver overrun. */ if ((istatus & CGEM_INTR_RX_OVERRUN) != 0) { /* Clear status bit. */ WR4(sc, CGEM_RX_STAT, CGEM_RX_STAT_OVERRUN); sc->rxoverruns++; } /* Receiver ran out of bufs. */ if ((istatus & CGEM_INTR_RX_USED_READ) != 0) { WR4(sc, CGEM_NET_CTRL, sc->net_ctl_shadow | CGEM_NET_CTRL_FLUSH_DPRAM_PKT); cgem_fill_rqueue(sc); sc->rxnobufs++; } /* Restart transmitter if needed. */ if (!if_sendq_empty(ifp)) cgem_start_locked(ifp); CGEM_UNLOCK(sc); } /* Reset hardware. */ static void cgem_reset(struct cgem_softc *sc) { CGEM_ASSERT_LOCKED(sc); WR4(sc, CGEM_NET_CTRL, 0); WR4(sc, CGEM_NET_CFG, 0); WR4(sc, CGEM_NET_CTRL, CGEM_NET_CTRL_CLR_STAT_REGS); WR4(sc, CGEM_TX_STAT, CGEM_TX_STAT_ALL); WR4(sc, CGEM_RX_STAT, CGEM_RX_STAT_ALL); WR4(sc, CGEM_INTR_DIS, CGEM_INTR_ALL); WR4(sc, CGEM_HASH_BOT, 0); WR4(sc, CGEM_HASH_TOP, 0); WR4(sc, CGEM_TX_QBAR, 0); /* manual says do this. */ WR4(sc, CGEM_RX_QBAR, 0); /* Get management port running even if interface is down. */ WR4(sc, CGEM_NET_CFG, CGEM_NET_CFG_DBUS_WIDTH_32 | CGEM_NET_CFG_MDC_CLK_DIV_64); sc->net_ctl_shadow = CGEM_NET_CTRL_MGMT_PORT_EN; WR4(sc, CGEM_NET_CTRL, sc->net_ctl_shadow); } /* Bring up the hardware. */ static void cgem_config(struct cgem_softc *sc) { if_t ifp = sc->ifp; uint32_t net_cfg; uint32_t dma_cfg; u_char *eaddr = if_getlladdr(ifp); CGEM_ASSERT_LOCKED(sc); /* Program Net Config Register. */ net_cfg = CGEM_NET_CFG_DBUS_WIDTH_32 | CGEM_NET_CFG_MDC_CLK_DIV_64 | CGEM_NET_CFG_FCS_REMOVE | CGEM_NET_CFG_RX_BUF_OFFSET(ETHER_ALIGN) | CGEM_NET_CFG_GIGE_EN | CGEM_NET_CFG_1536RXEN | CGEM_NET_CFG_FULL_DUPLEX | CGEM_NET_CFG_SPEED100; /* Enable receive checksum offloading? */ if ((if_getcapenable(ifp) & IFCAP_RXCSUM) != 0) net_cfg |= CGEM_NET_CFG_RX_CHKSUM_OFFLD_EN; WR4(sc, CGEM_NET_CFG, net_cfg); /* Program DMA Config Register. */ dma_cfg = CGEM_DMA_CFG_RX_BUF_SIZE(MCLBYTES) | CGEM_DMA_CFG_RX_PKTBUF_MEMSZ_SEL_8K | CGEM_DMA_CFG_TX_PKTBUF_MEMSZ_SEL | CGEM_DMA_CFG_AHB_FIXED_BURST_LEN_16 | CGEM_DMA_CFG_DISC_WHEN_NO_AHB; /* Enable transmit checksum offloading? */ if ((if_getcapenable(ifp) & IFCAP_TXCSUM) != 0) dma_cfg |= CGEM_DMA_CFG_CHKSUM_GEN_OFFLOAD_EN; WR4(sc, CGEM_DMA_CFG, dma_cfg); /* Write the rx and tx descriptor ring addresses to the QBAR regs. */ WR4(sc, CGEM_RX_QBAR, (uint32_t) sc->rxring_physaddr); WR4(sc, CGEM_TX_QBAR, (uint32_t) sc->txring_physaddr); /* Enable rx and tx. */ sc->net_ctl_shadow |= (CGEM_NET_CTRL_TX_EN | CGEM_NET_CTRL_RX_EN); WR4(sc, CGEM_NET_CTRL, sc->net_ctl_shadow); /* Set receive address in case it changed. */ WR4(sc, CGEM_SPEC_ADDR_LOW(0), (eaddr[3] << 24) | (eaddr[2] << 16) | (eaddr[1] << 8) | eaddr[0]); WR4(sc, CGEM_SPEC_ADDR_HI(0), (eaddr[5] << 8) | eaddr[4]); /* Set up interrupts. */ WR4(sc, CGEM_INTR_EN, CGEM_INTR_RX_COMPLETE | CGEM_INTR_RX_OVERRUN | CGEM_INTR_TX_USED_READ | CGEM_INTR_RX_USED_READ | CGEM_INTR_HRESP_NOT_OK); } /* Turn on interface and load up receive ring with buffers. */ static void cgem_init_locked(struct cgem_softc *sc) { struct mii_data *mii; CGEM_ASSERT_LOCKED(sc); if ((if_getdrvflags(sc->ifp) & IFF_DRV_RUNNING) != 0) return; cgem_config(sc); cgem_fill_rqueue(sc); if_setdrvflagbits(sc->ifp, IFF_DRV_RUNNING, IFF_DRV_OACTIVE); mii = device_get_softc(sc->miibus); mii_mediachg(mii); callout_reset(&sc->tick_ch, hz, cgem_tick, sc); } static void cgem_init(void *arg) { struct cgem_softc *sc = (struct cgem_softc *)arg; CGEM_LOCK(sc); cgem_init_locked(sc); CGEM_UNLOCK(sc); } /* Turn off interface. Free up any buffers in transmit or receive queues. */ static void cgem_stop(struct cgem_softc *sc) { int i; CGEM_ASSERT_LOCKED(sc); callout_stop(&sc->tick_ch); /* Shut down hardware. */ cgem_reset(sc); /* Clear out transmit queue. */ for (i = 0; i < CGEM_NUM_TX_DESCS; i++) { sc->txring[i].ctl = CGEM_TXDESC_USED; sc->txring[i].addr = 0; if (sc->txring_m[i]) { /* Unload and destroy dmamap. */ bus_dmamap_unload(sc->mbuf_dma_tag, sc->txring_m_dmamap[i]); bus_dmamap_destroy(sc->mbuf_dma_tag, sc->txring_m_dmamap[i]); sc->txring_m_dmamap[i] = NULL; m_freem(sc->txring_m[i]); sc->txring_m[i] = NULL; } } sc->txring[CGEM_NUM_TX_DESCS - 1].ctl |= CGEM_TXDESC_WRAP; sc->txring_hd_ptr = 0; sc->txring_tl_ptr = 0; sc->txring_queued = 0; /* Clear out receive queue. */ for (i = 0; i < CGEM_NUM_RX_DESCS; i++) { sc->rxring[i].addr = CGEM_RXDESC_OWN; sc->rxring[i].ctl = 0; if (sc->rxring_m[i]) { /* Unload and destroy dmamap. */ bus_dmamap_unload(sc->mbuf_dma_tag, sc->rxring_m_dmamap[i]); bus_dmamap_destroy(sc->mbuf_dma_tag, sc->rxring_m_dmamap[i]); sc->rxring_m_dmamap[i] = NULL; m_freem(sc->rxring_m[i]); sc->rxring_m[i] = NULL; } } sc->rxring[CGEM_NUM_RX_DESCS - 1].addr |= CGEM_RXDESC_WRAP; sc->rxring_hd_ptr = 0; sc->rxring_tl_ptr = 0; sc->rxring_queued = 0; /* Force next statchg or linkchg to program net config register. */ sc->mii_media_active = 0; } static int cgem_ioctl(if_t ifp, u_long cmd, caddr_t data) { struct cgem_softc *sc = if_getsoftc(ifp); struct ifreq *ifr = (struct ifreq *)data; struct mii_data *mii; int error = 0, mask; switch (cmd) { case SIOCSIFFLAGS: CGEM_LOCK(sc); if ((if_getflags(ifp) & IFF_UP) != 0) { if ((if_getdrvflags(ifp) & IFF_DRV_RUNNING) != 0) { if (((if_getflags(ifp) ^ sc->if_old_flags) & (IFF_PROMISC | IFF_ALLMULTI)) != 0) { cgem_rx_filter(sc); } } else { cgem_init_locked(sc); } } else if ((if_getdrvflags(ifp) & IFF_DRV_RUNNING) != 0) { if_setdrvflagbits(ifp, 0, IFF_DRV_RUNNING); cgem_stop(sc); } sc->if_old_flags = if_getflags(ifp); CGEM_UNLOCK(sc); break; case SIOCADDMULTI: case SIOCDELMULTI: /* Set up multi-cast filters. */ if ((if_getdrvflags(ifp) & IFF_DRV_RUNNING) != 0) { CGEM_LOCK(sc); cgem_rx_filter(sc); CGEM_UNLOCK(sc); } break; case SIOCSIFMEDIA: case SIOCGIFMEDIA: mii = device_get_softc(sc->miibus); error = ifmedia_ioctl(ifp, ifr, &mii->mii_media, cmd); break; case SIOCSIFCAP: CGEM_LOCK(sc); mask = if_getcapenable(ifp) ^ ifr->ifr_reqcap; if ((mask & IFCAP_TXCSUM) != 0) { if ((ifr->ifr_reqcap & IFCAP_TXCSUM) != 0) { /* Turn on TX checksumming. */ if_setcapenablebit(ifp, IFCAP_TXCSUM | IFCAP_TXCSUM_IPV6, 0); if_sethwassistbits(ifp, CGEM_CKSUM_ASSIST, 0); WR4(sc, CGEM_DMA_CFG, RD4(sc, CGEM_DMA_CFG) | CGEM_DMA_CFG_CHKSUM_GEN_OFFLOAD_EN); } else { /* Turn off TX checksumming. */ if_setcapenablebit(ifp, 0, IFCAP_TXCSUM | IFCAP_TXCSUM_IPV6); if_sethwassistbits(ifp, 0, CGEM_CKSUM_ASSIST); WR4(sc, CGEM_DMA_CFG, RD4(sc, CGEM_DMA_CFG) & ~CGEM_DMA_CFG_CHKSUM_GEN_OFFLOAD_EN); } } if ((mask & IFCAP_RXCSUM) != 0) { if ((ifr->ifr_reqcap & IFCAP_RXCSUM) != 0) { /* Turn on RX checksumming. */ if_setcapenablebit(ifp, IFCAP_RXCSUM | IFCAP_RXCSUM_IPV6, 0); WR4(sc, CGEM_NET_CFG, RD4(sc, CGEM_NET_CFG) | CGEM_NET_CFG_RX_CHKSUM_OFFLD_EN); } else { /* Turn off RX checksumming. */ if_setcapenablebit(ifp, 0, IFCAP_RXCSUM | IFCAP_RXCSUM_IPV6); WR4(sc, CGEM_NET_CFG, RD4(sc, CGEM_NET_CFG) & ~CGEM_NET_CFG_RX_CHKSUM_OFFLD_EN); } } if ((if_getcapenable(ifp) & (IFCAP_RXCSUM | IFCAP_TXCSUM)) == (IFCAP_RXCSUM | IFCAP_TXCSUM)) if_setcapenablebit(ifp, IFCAP_VLAN_HWCSUM, 0); else if_setcapenablebit(ifp, 0, IFCAP_VLAN_HWCSUM); CGEM_UNLOCK(sc); break; default: error = ether_ioctl(ifp, cmd, data); break; } return (error); } /* MII bus support routines. */ static void cgem_child_detached(device_t dev, device_t child) { struct cgem_softc *sc = device_get_softc(dev); if (child == sc->miibus) sc->miibus = NULL; } static int cgem_ifmedia_upd(if_t ifp) { struct cgem_softc *sc = (struct cgem_softc *) if_getsoftc(ifp); struct mii_data *mii; struct mii_softc *miisc; int error = 0; mii = device_get_softc(sc->miibus); CGEM_LOCK(sc); if ((if_getflags(ifp) & IFF_UP) != 0) { LIST_FOREACH(miisc, &mii->mii_phys, mii_list) PHY_RESET(miisc); error = mii_mediachg(mii); } CGEM_UNLOCK(sc); return (error); } static void cgem_ifmedia_sts(if_t ifp, struct ifmediareq *ifmr) { struct cgem_softc *sc = (struct cgem_softc *) if_getsoftc(ifp); struct mii_data *mii; mii = device_get_softc(sc->miibus); CGEM_LOCK(sc); mii_pollstat(mii); ifmr->ifm_active = mii->mii_media_active; ifmr->ifm_status = mii->mii_media_status; CGEM_UNLOCK(sc); } static int cgem_miibus_readreg(device_t dev, int phy, int reg) { struct cgem_softc *sc = device_get_softc(dev); int tries, val; WR4(sc, CGEM_PHY_MAINT, CGEM_PHY_MAINT_CLAUSE_22 | CGEM_PHY_MAINT_MUST_10 | CGEM_PHY_MAINT_OP_READ | (phy << CGEM_PHY_MAINT_PHY_ADDR_SHIFT) | (reg << CGEM_PHY_MAINT_REG_ADDR_SHIFT)); /* Wait for completion. */ tries=0; while ((RD4(sc, CGEM_NET_STAT) & CGEM_NET_STAT_PHY_MGMT_IDLE) == 0) { DELAY(5); if (++tries > 200) { device_printf(dev, "phy read timeout: %d\n", reg); return (-1); } } val = RD4(sc, CGEM_PHY_MAINT) & CGEM_PHY_MAINT_DATA_MASK; if (reg == MII_EXTSR) /* * MAC does not support half-duplex at gig speeds. * Let mii(4) exclude the capability. */ val &= ~(EXTSR_1000XHDX | EXTSR_1000THDX); return (val); } static int cgem_miibus_writereg(device_t dev, int phy, int reg, int data) { struct cgem_softc *sc = device_get_softc(dev); int tries; WR4(sc, CGEM_PHY_MAINT, CGEM_PHY_MAINT_CLAUSE_22 | CGEM_PHY_MAINT_MUST_10 | CGEM_PHY_MAINT_OP_WRITE | (phy << CGEM_PHY_MAINT_PHY_ADDR_SHIFT) | (reg << CGEM_PHY_MAINT_REG_ADDR_SHIFT) | (data & CGEM_PHY_MAINT_DATA_MASK)); /* Wait for completion. */ tries = 0; while ((RD4(sc, CGEM_NET_STAT) & CGEM_NET_STAT_PHY_MGMT_IDLE) == 0) { DELAY(5); if (++tries > 200) { device_printf(dev, "phy write timeout: %d\n", reg); return (-1); } } return (0); } static void cgem_miibus_statchg(device_t dev) { struct cgem_softc *sc = device_get_softc(dev); struct mii_data *mii = device_get_softc(sc->miibus); CGEM_ASSERT_LOCKED(sc); if ((mii->mii_media_status & (IFM_ACTIVE | IFM_AVALID)) == (IFM_ACTIVE | IFM_AVALID) && sc->mii_media_active != mii->mii_media_active) cgem_mediachange(sc, mii); } static void cgem_miibus_linkchg(device_t dev) { struct cgem_softc *sc = device_get_softc(dev); struct mii_data *mii = device_get_softc(sc->miibus); CGEM_ASSERT_LOCKED(sc); if ((mii->mii_media_status & (IFM_ACTIVE | IFM_AVALID)) == (IFM_ACTIVE | IFM_AVALID) && sc->mii_media_active != mii->mii_media_active) cgem_mediachange(sc, mii); } /* * Overridable weak symbol cgem_set_ref_clk(). This allows platforms to * provide a function to set the cgem's reference clock. */ static int __used cgem_default_set_ref_clk(int unit, int frequency) { return 0; } __weak_reference(cgem_default_set_ref_clk, cgem_set_ref_clk); /* Call to set reference clock and network config bits according to media. */ static void cgem_mediachange(struct cgem_softc *sc, struct mii_data *mii) { uint32_t net_cfg; int ref_clk_freq; CGEM_ASSERT_LOCKED(sc); /* Update hardware to reflect media. */ net_cfg = RD4(sc, CGEM_NET_CFG); net_cfg &= ~(CGEM_NET_CFG_SPEED100 | CGEM_NET_CFG_GIGE_EN | CGEM_NET_CFG_FULL_DUPLEX); switch (IFM_SUBTYPE(mii->mii_media_active)) { case IFM_1000_T: net_cfg |= (CGEM_NET_CFG_SPEED100 | CGEM_NET_CFG_GIGE_EN); ref_clk_freq = 125000000; break; case IFM_100_TX: net_cfg |= CGEM_NET_CFG_SPEED100; ref_clk_freq = 25000000; break; default: ref_clk_freq = 2500000; } if ((mii->mii_media_active & IFM_FDX) != 0) net_cfg |= CGEM_NET_CFG_FULL_DUPLEX; WR4(sc, CGEM_NET_CFG, net_cfg); /* Set the reference clock if necessary. */ if (cgem_set_ref_clk(sc->ref_clk_num, ref_clk_freq)) device_printf(sc->dev, "cgem_mediachange: " "could not set ref clk%d to %d.\n", sc->ref_clk_num, ref_clk_freq); sc->mii_media_active = mii->mii_media_active; } static void cgem_add_sysctls(device_t dev) { struct cgem_softc *sc = device_get_softc(dev); struct sysctl_ctx_list *ctx; struct sysctl_oid_list *child; struct sysctl_oid *tree; ctx = device_get_sysctl_ctx(dev); child = SYSCTL_CHILDREN(device_get_sysctl_tree(dev)); SYSCTL_ADD_INT(ctx, child, OID_AUTO, "rxbufs", CTLFLAG_RW, &sc->rxbufs, 0, "Number receive buffers to provide"); SYSCTL_ADD_INT(ctx, child, OID_AUTO, "rxhangwar", CTLFLAG_RW, &sc->rxhangwar, 0, "Enable receive hang work-around"); SYSCTL_ADD_UINT(ctx, child, OID_AUTO, "_rxoverruns", CTLFLAG_RD, &sc->rxoverruns, 0, "Receive overrun events"); SYSCTL_ADD_UINT(ctx, child, OID_AUTO, "_rxnobufs", CTLFLAG_RD, &sc->rxnobufs, 0, "Receive buf queue empty events"); SYSCTL_ADD_UINT(ctx, child, OID_AUTO, "_rxdmamapfails", CTLFLAG_RD, &sc->rxdmamapfails, 0, "Receive DMA map failures"); SYSCTL_ADD_UINT(ctx, child, OID_AUTO, "_txfull", CTLFLAG_RD, &sc->txfull, 0, "Transmit ring full events"); SYSCTL_ADD_UINT(ctx, child, OID_AUTO, "_txdmamapfails", CTLFLAG_RD, &sc->txdmamapfails, 0, "Transmit DMA map failures"); SYSCTL_ADD_UINT(ctx, child, OID_AUTO, "_txdefrags", CTLFLAG_RD, &sc->txdefrags, 0, "Transmit m_defrag() calls"); SYSCTL_ADD_UINT(ctx, child, OID_AUTO, "_txdefragfails", CTLFLAG_RD, &sc->txdefragfails, 0, "Transmit m_defrag() failures"); tree = SYSCTL_ADD_NODE(ctx, child, OID_AUTO, "stats", CTLFLAG_RD, NULL, "GEM statistics"); child = SYSCTL_CHILDREN(tree); SYSCTL_ADD_UQUAD(ctx, child, OID_AUTO, "tx_bytes", CTLFLAG_RD, &sc->stats.tx_bytes, "Total bytes transmitted"); SYSCTL_ADD_UINT(ctx, child, OID_AUTO, "tx_frames", CTLFLAG_RD, &sc->stats.tx_frames, 0, "Total frames transmitted"); SYSCTL_ADD_UINT(ctx, child, OID_AUTO, "tx_frames_bcast", CTLFLAG_RD, &sc->stats.tx_frames_bcast, 0, "Number broadcast frames transmitted"); SYSCTL_ADD_UINT(ctx, child, OID_AUTO, "tx_frames_multi", CTLFLAG_RD, &sc->stats.tx_frames_multi, 0, "Number multicast frames transmitted"); SYSCTL_ADD_UINT(ctx, child, OID_AUTO, "tx_frames_pause", CTLFLAG_RD, &sc->stats.tx_frames_pause, 0, "Number pause frames transmitted"); SYSCTL_ADD_UINT(ctx, child, OID_AUTO, "tx_frames_64b", CTLFLAG_RD, &sc->stats.tx_frames_64b, 0, "Number frames transmitted of size 64 bytes or less"); SYSCTL_ADD_UINT(ctx, child, OID_AUTO, "tx_frames_65to127b", CTLFLAG_RD, &sc->stats.tx_frames_65to127b, 0, "Number frames transmitted of size 65-127 bytes"); SYSCTL_ADD_UINT(ctx, child, OID_AUTO, "tx_frames_128to255b", CTLFLAG_RD, &sc->stats.tx_frames_128to255b, 0, "Number frames transmitted of size 128-255 bytes"); SYSCTL_ADD_UINT(ctx, child, OID_AUTO, "tx_frames_256to511b", CTLFLAG_RD, &sc->stats.tx_frames_256to511b, 0, "Number frames transmitted of size 256-511 bytes"); SYSCTL_ADD_UINT(ctx, child, OID_AUTO, "tx_frames_512to1023b", CTLFLAG_RD, &sc->stats.tx_frames_512to1023b, 0, "Number frames transmitted of size 512-1023 bytes"); SYSCTL_ADD_UINT(ctx, child, OID_AUTO, "tx_frames_1024to1536b", CTLFLAG_RD, &sc->stats.tx_frames_1024to1536b, 0, "Number frames transmitted of size 1024-1536 bytes"); SYSCTL_ADD_UINT(ctx, child, OID_AUTO, "tx_under_runs", CTLFLAG_RD, &sc->stats.tx_under_runs, 0, "Number transmit under-run events"); SYSCTL_ADD_UINT(ctx, child, OID_AUTO, "tx_single_collisn", CTLFLAG_RD, &sc->stats.tx_single_collisn, 0, "Number single-collision transmit frames"); SYSCTL_ADD_UINT(ctx, child, OID_AUTO, "tx_multi_collisn", CTLFLAG_RD, &sc->stats.tx_multi_collisn, 0, "Number multi-collision transmit frames"); SYSCTL_ADD_UINT(ctx, child, OID_AUTO, "tx_excsv_collisn", CTLFLAG_RD, &sc->stats.tx_excsv_collisn, 0, "Number excessive collision transmit frames"); SYSCTL_ADD_UINT(ctx, child, OID_AUTO, "tx_late_collisn", CTLFLAG_RD, &sc->stats.tx_late_collisn, 0, "Number late-collision transmit frames"); SYSCTL_ADD_UINT(ctx, child, OID_AUTO, "tx_deferred_frames", CTLFLAG_RD, &sc->stats.tx_deferred_frames, 0, "Number deferred transmit frames"); SYSCTL_ADD_UINT(ctx, child, OID_AUTO, "tx_carrier_sense_errs", CTLFLAG_RD, &sc->stats.tx_carrier_sense_errs, 0, "Number carrier sense errors on transmit"); SYSCTL_ADD_UQUAD(ctx, child, OID_AUTO, "rx_bytes", CTLFLAG_RD, &sc->stats.rx_bytes, "Total bytes received"); SYSCTL_ADD_UINT(ctx, child, OID_AUTO, "rx_frames", CTLFLAG_RD, &sc->stats.rx_frames, 0, "Total frames received"); SYSCTL_ADD_UINT(ctx, child, OID_AUTO, "rx_frames_bcast", CTLFLAG_RD, &sc->stats.rx_frames_bcast, 0, "Number broadcast frames received"); SYSCTL_ADD_UINT(ctx, child, OID_AUTO, "rx_frames_multi", CTLFLAG_RD, &sc->stats.rx_frames_multi, 0, "Number multicast frames received"); SYSCTL_ADD_UINT(ctx, child, OID_AUTO, "rx_frames_pause", CTLFLAG_RD, &sc->stats.rx_frames_pause, 0, "Number pause frames received"); SYSCTL_ADD_UINT(ctx, child, OID_AUTO, "rx_frames_64b", CTLFLAG_RD, &sc->stats.rx_frames_64b, 0, "Number frames received of size 64 bytes or less"); SYSCTL_ADD_UINT(ctx, child, OID_AUTO, "rx_frames_65to127b", CTLFLAG_RD, &sc->stats.rx_frames_65to127b, 0, "Number frames received of size 65-127 bytes"); SYSCTL_ADD_UINT(ctx, child, OID_AUTO, "rx_frames_128to255b", CTLFLAG_RD, &sc->stats.rx_frames_128to255b, 0, "Number frames received of size 128-255 bytes"); SYSCTL_ADD_UINT(ctx, child, OID_AUTO, "rx_frames_256to511b", CTLFLAG_RD, &sc->stats.rx_frames_256to511b, 0, "Number frames received of size 256-511 bytes"); SYSCTL_ADD_UINT(ctx, child, OID_AUTO, "rx_frames_512to1023b", CTLFLAG_RD, &sc->stats.rx_frames_512to1023b, 0, "Number frames received of size 512-1023 bytes"); SYSCTL_ADD_UINT(ctx, child, OID_AUTO, "rx_frames_1024to1536b", CTLFLAG_RD, &sc->stats.rx_frames_1024to1536b, 0, "Number frames received of size 1024-1536 bytes"); SYSCTL_ADD_UINT(ctx, child, OID_AUTO, "rx_frames_undersize", CTLFLAG_RD, &sc->stats.rx_frames_undersize, 0, "Number undersize frames received"); SYSCTL_ADD_UINT(ctx, child, OID_AUTO, "rx_frames_oversize", CTLFLAG_RD, &sc->stats.rx_frames_oversize, 0, "Number oversize frames received"); SYSCTL_ADD_UINT(ctx, child, OID_AUTO, "rx_frames_jabber", CTLFLAG_RD, &sc->stats.rx_frames_jabber, 0, "Number jabber frames received"); SYSCTL_ADD_UINT(ctx, child, OID_AUTO, "rx_frames_fcs_errs", CTLFLAG_RD, &sc->stats.rx_frames_fcs_errs, 0, "Number frames received with FCS errors"); SYSCTL_ADD_UINT(ctx, child, OID_AUTO, "rx_frames_length_errs", CTLFLAG_RD, &sc->stats.rx_frames_length_errs, 0, "Number frames received with length errors"); SYSCTL_ADD_UINT(ctx, child, OID_AUTO, "rx_symbol_errs", CTLFLAG_RD, &sc->stats.rx_symbol_errs, 0, "Number receive symbol errors"); SYSCTL_ADD_UINT(ctx, child, OID_AUTO, "rx_align_errs", CTLFLAG_RD, &sc->stats.rx_align_errs, 0, "Number receive alignment errors"); SYSCTL_ADD_UINT(ctx, child, OID_AUTO, "rx_resource_errs", CTLFLAG_RD, &sc->stats.rx_resource_errs, 0, "Number frames received when no rx buffer available"); SYSCTL_ADD_UINT(ctx, child, OID_AUTO, "rx_overrun_errs", CTLFLAG_RD, &sc->stats.rx_overrun_errs, 0, "Number frames received but not copied due to " "receive overrun"); SYSCTL_ADD_UINT(ctx, child, OID_AUTO, "rx_frames_ip_hdr_csum_errs", CTLFLAG_RD, &sc->stats.rx_ip_hdr_csum_errs, 0, "Number frames received with IP header checksum " "errors"); SYSCTL_ADD_UINT(ctx, child, OID_AUTO, "rx_frames_tcp_csum_errs", CTLFLAG_RD, &sc->stats.rx_tcp_csum_errs, 0, "Number frames received with TCP checksum errors"); SYSCTL_ADD_UINT(ctx, child, OID_AUTO, "rx_frames_udp_csum_errs", CTLFLAG_RD, &sc->stats.rx_udp_csum_errs, 0, "Number frames received with UDP checksum errors"); } static int cgem_probe(device_t dev) { if (!ofw_bus_status_okay(dev)) return (ENXIO); - if (!ofw_bus_is_compatible(dev, "cadence,gem")) + if (ofw_bus_search_compatible(dev, compat_data)->ocd_data == 0) return (ENXIO); device_set_desc(dev, "Cadence CGEM Gigabit Ethernet Interface"); return (0); } static int cgem_attach(device_t dev) { struct cgem_softc *sc = device_get_softc(dev); if_t ifp = NULL; phandle_t node; pcell_t cell; int rid, err; u_char eaddr[ETHER_ADDR_LEN]; sc->dev = dev; CGEM_LOCK_INIT(sc); /* Get reference clock number and base divider from fdt. */ node = ofw_bus_get_node(dev); sc->ref_clk_num = 0; if (OF_getprop(node, "ref-clock-num", &cell, sizeof(cell)) > 0) sc->ref_clk_num = fdt32_to_cpu(cell); /* Get memory resource. */ rid = 0; sc->mem_res = bus_alloc_resource_any(dev, SYS_RES_MEMORY, &rid, RF_ACTIVE); if (sc->mem_res == NULL) { device_printf(dev, "could not allocate memory resources.\n"); return (ENOMEM); } /* Get IRQ resource. */ rid = 0; sc->irq_res = bus_alloc_resource_any(dev, SYS_RES_IRQ, &rid, RF_ACTIVE); if (sc->irq_res == NULL) { device_printf(dev, "could not allocate interrupt resource.\n"); cgem_detach(dev); return (ENOMEM); } /* Set up ifnet structure. */ ifp = sc->ifp = if_alloc(IFT_ETHER); if (ifp == NULL) { device_printf(dev, "could not allocate ifnet structure\n"); cgem_detach(dev); return (ENOMEM); } if_setsoftc(ifp, sc); if_initname(ifp, IF_CGEM_NAME, device_get_unit(dev)); if_setflags(ifp, IFF_BROADCAST | IFF_SIMPLEX | IFF_MULTICAST); if_setinitfn(ifp, cgem_init); if_setioctlfn(ifp, cgem_ioctl); if_setstartfn(ifp, cgem_start); if_setcapabilitiesbit(ifp, IFCAP_HWCSUM | IFCAP_HWCSUM_IPV6 | IFCAP_VLAN_MTU | IFCAP_VLAN_HWCSUM, 0); if_setsendqlen(ifp, CGEM_NUM_TX_DESCS); if_setsendqready(ifp); /* Disable hardware checksumming by default. */ if_sethwassist(ifp, 0); if_setcapenable(ifp, if_getcapabilities(ifp) & ~(IFCAP_HWCSUM | IFCAP_HWCSUM_IPV6 | IFCAP_VLAN_HWCSUM)); sc->if_old_flags = if_getflags(ifp); sc->rxbufs = DEFAULT_NUM_RX_BUFS; sc->rxhangwar = 1; /* Reset hardware. */ CGEM_LOCK(sc); cgem_reset(sc); CGEM_UNLOCK(sc); /* Attach phy to mii bus. */ err = mii_attach(dev, &sc->miibus, ifp, cgem_ifmedia_upd, cgem_ifmedia_sts, BMSR_DEFCAPMASK, MII_PHY_ANY, MII_OFFSET_ANY, 0); if (err) { device_printf(dev, "attaching PHYs failed\n"); cgem_detach(dev); return (err); } /* Set up TX and RX descriptor area. */ err = cgem_setup_descs(sc); if (err) { device_printf(dev, "could not set up dma mem for descs.\n"); cgem_detach(dev); return (ENOMEM); } /* Get a MAC address. */ cgem_get_mac(sc, eaddr); /* Start ticks. */ callout_init_mtx(&sc->tick_ch, &sc->sc_mtx, 0); ether_ifattach(ifp, eaddr); err = bus_setup_intr(dev, sc->irq_res, INTR_TYPE_NET | INTR_MPSAFE | INTR_EXCL, NULL, cgem_intr, sc, &sc->intrhand); if (err) { device_printf(dev, "could not set interrupt handler.\n"); ether_ifdetach(ifp); cgem_detach(dev); return (err); } cgem_add_sysctls(dev); return (0); } static int cgem_detach(device_t dev) { struct cgem_softc *sc = device_get_softc(dev); int i; if (sc == NULL) return (ENODEV); if (device_is_attached(dev)) { CGEM_LOCK(sc); cgem_stop(sc); CGEM_UNLOCK(sc); callout_drain(&sc->tick_ch); if_setflagbits(sc->ifp, 0, IFF_UP); ether_ifdetach(sc->ifp); } if (sc->miibus != NULL) { device_delete_child(dev, sc->miibus); sc->miibus = NULL; } /* Release resources. */ if (sc->mem_res != NULL) { bus_release_resource(dev, SYS_RES_MEMORY, rman_get_rid(sc->mem_res), sc->mem_res); sc->mem_res = NULL; } if (sc->irq_res != NULL) { if (sc->intrhand) bus_teardown_intr(dev, sc->irq_res, sc->intrhand); bus_release_resource(dev, SYS_RES_IRQ, rman_get_rid(sc->irq_res), sc->irq_res); sc->irq_res = NULL; } /* Release DMA resources. */ if (sc->rxring != NULL) { if (sc->rxring_physaddr != 0) { bus_dmamap_unload(sc->desc_dma_tag, sc->rxring_dma_map); sc->rxring_physaddr = 0; } bus_dmamem_free(sc->desc_dma_tag, sc->rxring, sc->rxring_dma_map); sc->rxring = NULL; for (i = 0; i < CGEM_NUM_RX_DESCS; i++) if (sc->rxring_m_dmamap[i] != NULL) { bus_dmamap_destroy(sc->mbuf_dma_tag, sc->rxring_m_dmamap[i]); sc->rxring_m_dmamap[i] = NULL; } } if (sc->txring != NULL) { if (sc->txring_physaddr != 0) { bus_dmamap_unload(sc->desc_dma_tag, sc->txring_dma_map); sc->txring_physaddr = 0; } bus_dmamem_free(sc->desc_dma_tag, sc->txring, sc->txring_dma_map); sc->txring = NULL; for (i = 0; i < CGEM_NUM_TX_DESCS; i++) if (sc->txring_m_dmamap[i] != NULL) { bus_dmamap_destroy(sc->mbuf_dma_tag, sc->txring_m_dmamap[i]); sc->txring_m_dmamap[i] = NULL; } } if (sc->desc_dma_tag != NULL) { bus_dma_tag_destroy(sc->desc_dma_tag); sc->desc_dma_tag = NULL; } if (sc->mbuf_dma_tag != NULL) { bus_dma_tag_destroy(sc->mbuf_dma_tag); sc->mbuf_dma_tag = NULL; } bus_generic_detach(dev); CGEM_LOCK_DESTROY(sc); return (0); } static device_method_t cgem_methods[] = { /* Device interface */ DEVMETHOD(device_probe, cgem_probe), DEVMETHOD(device_attach, cgem_attach), DEVMETHOD(device_detach, cgem_detach), /* Bus interface */ DEVMETHOD(bus_child_detached, cgem_child_detached), /* MII interface */ DEVMETHOD(miibus_readreg, cgem_miibus_readreg), DEVMETHOD(miibus_writereg, cgem_miibus_writereg), DEVMETHOD(miibus_statchg, cgem_miibus_statchg), DEVMETHOD(miibus_linkchg, cgem_miibus_linkchg), DEVMETHOD_END }; static driver_t cgem_driver = { "cgem", cgem_methods, sizeof(struct cgem_softc), }; DRIVER_MODULE(cgem, simplebus, cgem_driver, cgem_devclass, NULL, NULL); DRIVER_MODULE(miibus, cgem, miibus_driver, miibus_devclass, NULL, NULL); MODULE_DEPEND(cgem, miibus, 1, 1, 1); MODULE_DEPEND(cgem, ether, 1, 1, 1); Index: user/ngie/bug-237403/sys/dev/cxgbe/crypto/t4_crypto.c =================================================================== --- user/ngie/bug-237403/sys/dev/cxgbe/crypto/t4_crypto.c (revision 346925) +++ user/ngie/bug-237403/sys/dev/cxgbe/crypto/t4_crypto.c (revision 346926) @@ -1,2388 +1,2952 @@ /*- * Copyright (c) 2017 Chelsio Communications, Inc. * All rights reserved. * Written by: John Baldwin * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. */ #include __FBSDID("$FreeBSD$"); #include #include #include #include #include #include #include #include #include #include "cryptodev_if.h" #include "common/common.h" #include "crypto/t4_crypto.h" /* * Requests consist of: * * +-------------------------------+ * | struct fw_crypto_lookaside_wr | * +-------------------------------+ * | struct ulp_txpkt | * +-------------------------------+ * | struct ulptx_idata | * +-------------------------------+ * | struct cpl_tx_sec_pdu | * +-------------------------------+ * | struct cpl_tls_tx_scmd_fmt | * +-------------------------------+ * | key context header | * +-------------------------------+ * | AES key | ----- For requests with AES * +-------------------------------+ * | Hash state | ----- For hash-only requests * +-------------------------------+ - * | IPAD (16-byte aligned) | \ * +-------------------------------+ +---- For requests with HMAC * | OPAD (16-byte aligned) | / * +-------------------------------+ - * | GMAC H | ----- For AES-GCM * +-------------------------------+ - * | struct cpl_rx_phys_dsgl | \ * +-------------------------------+ +---- Destination buffer for * | PHYS_DSGL entries | / non-hash-only requests * +-------------------------------+ - * | 16 dummy bytes | ----- Only for HMAC/hash-only requests * +-------------------------------+ * | IV | ----- If immediate IV * +-------------------------------+ * | Payload | ----- If immediate Payload * +-------------------------------+ - * | struct ulptx_sgl | \ * +-------------------------------+ +---- If payload via SGL * | SGL entries | / * +-------------------------------+ - * * Note that the key context must be padded to ensure 16-byte alignment. * For HMAC requests, the key consists of the partial hash of the IPAD * followed by the partial hash of the OPAD. * * Replies consist of: * * +-------------------------------+ * | struct cpl_fw6_pld | * +-------------------------------+ * | hash digest | ----- For HMAC request with * +-------------------------------+ 'hash_size' set in work request * * A 32-bit big-endian error status word is supplied in the last 4 * bytes of data[0] in the CPL_FW6_PLD message. bit 0 indicates a * "MAC" error and bit 1 indicates a "PAD" error. * * The 64-bit 'cookie' field from the fw_crypto_lookaside_wr message * in the request is returned in data[1] of the CPL_FW6_PLD message. * * For block cipher replies, the updated IV is supplied in data[2] and * data[3] of the CPL_FW6_PLD message. * * For hash replies where the work request set 'hash_size' to request * a copy of the hash in the reply, the hash digest is supplied * immediately following the CPL_FW6_PLD message. */ /* * The crypto engine supports a maximum AAD size of 511 bytes. */ #define MAX_AAD_LEN 511 /* * The documentation for CPL_RX_PHYS_DSGL claims a maximum of 32 SG * entries. While the CPL includes a 16-bit length field, the T6 can * sometimes hang if an error occurs while processing a request with a * single DSGL entry larger than 2k. */ #define MAX_RX_PHYS_DSGL_SGE 32 #define DSGL_SGE_MAXLEN 2048 /* * The adapter only supports requests with a total input or output * length of 64k-1 or smaller. Longer requests either result in hung * requests or incorrect results. */ #define MAX_REQUEST_SIZE 65535 static MALLOC_DEFINE(M_CCR, "ccr", "Chelsio T6 crypto"); struct ccr_session_hmac { struct auth_hash *auth_hash; int hash_len; unsigned int partial_digest_len; unsigned int auth_mode; unsigned int mk_size; char ipad[CHCR_HASH_MAX_BLOCK_SIZE_128]; char opad[CHCR_HASH_MAX_BLOCK_SIZE_128]; }; struct ccr_session_gmac { int hash_len; char ghash_h[GMAC_BLOCK_LEN]; }; +struct ccr_session_ccm_mac { + int hash_len; +}; + struct ccr_session_blkcipher { unsigned int cipher_mode; unsigned int key_len; unsigned int iv_len; __be32 key_ctx_hdr; char enckey[CHCR_AES_MAX_KEY_LEN]; char deckey[CHCR_AES_MAX_KEY_LEN]; }; struct ccr_session { bool active; int pending; - enum { HASH, HMAC, BLKCIPHER, AUTHENC, GCM } mode; + enum { HASH, HMAC, BLKCIPHER, AUTHENC, GCM, CCM } mode; union { struct ccr_session_hmac hmac; struct ccr_session_gmac gmac; + struct ccr_session_ccm_mac ccm_mac; }; struct ccr_session_blkcipher blkcipher; }; struct ccr_softc { struct adapter *adapter; device_t dev; uint32_t cid; int tx_channel_id; struct mtx lock; bool detaching; struct sge_wrq *txq; struct sge_rxq *rxq; /* * Pre-allocate S/G lists used when preparing a work request. * 'sg_crp' contains an sglist describing the entire buffer * for a 'struct cryptop'. 'sg_ulptx' is used to describe * the data the engine should DMA as input via ULPTX_SGL. * 'sg_dsgl' is used to describe the destination that cipher * text and a tag should be written to. */ struct sglist *sg_crp; struct sglist *sg_ulptx; struct sglist *sg_dsgl; /* * Pre-allocate a dummy output buffer for the IV and AAD for * AEAD requests. */ char *iv_aad_buf; struct sglist *sg_iv_aad; /* Statistics. */ uint64_t stats_blkcipher_encrypt; uint64_t stats_blkcipher_decrypt; uint64_t stats_hash; uint64_t stats_hmac; uint64_t stats_authenc_encrypt; uint64_t stats_authenc_decrypt; uint64_t stats_gcm_encrypt; uint64_t stats_gcm_decrypt; + uint64_t stats_ccm_encrypt; + uint64_t stats_ccm_decrypt; uint64_t stats_wr_nomem; uint64_t stats_inflight; uint64_t stats_mac_error; uint64_t stats_pad_error; uint64_t stats_bad_session; uint64_t stats_sglist_error; uint64_t stats_process_error; uint64_t stats_sw_fallback; }; /* * Crypto requests involve two kind of scatter/gather lists. * * Non-hash-only requests require a PHYS_DSGL that describes the * location to store the results of the encryption or decryption * operation. This SGL uses a different format (PHYS_DSGL) and should * exclude the crd_skip bytes at the start of the data as well as * any AAD or IV. For authenticated encryption requests it should * cover include the destination of the hash or tag. * * The input payload may either be supplied inline as immediate data, * or via a standard ULP_TX SGL. This SGL should include AAD, * ciphertext, and the hash or tag for authenticated decryption * requests. * * These scatter/gather lists can describe different subsets of the * buffer described by the crypto operation. ccr_populate_sglist() * generates a scatter/gather list that covers the entire crypto * operation buffer that is then used to construct the other * scatter/gather lists. */ static int ccr_populate_sglist(struct sglist *sg, struct cryptop *crp) { int error; sglist_reset(sg); if (crp->crp_flags & CRYPTO_F_IMBUF) error = sglist_append_mbuf(sg, (struct mbuf *)crp->crp_buf); else if (crp->crp_flags & CRYPTO_F_IOV) error = sglist_append_uio(sg, (struct uio *)crp->crp_buf); else error = sglist_append(sg, crp->crp_buf, crp->crp_ilen); return (error); } /* * Segments in 'sg' larger than 'maxsegsize' are counted as multiple * segments. */ static int ccr_count_sgl(struct sglist *sg, int maxsegsize) { int i, nsegs; nsegs = 0; for (i = 0; i < sg->sg_nseg; i++) nsegs += howmany(sg->sg_segs[i].ss_len, maxsegsize); return (nsegs); } /* These functions deal with PHYS_DSGL for the reply buffer. */ static inline int ccr_phys_dsgl_len(int nsegs) { int len; len = (nsegs / 8) * sizeof(struct phys_sge_pairs); if ((nsegs % 8) != 0) { len += sizeof(uint16_t) * 8; len += roundup2(nsegs % 8, 2) * sizeof(uint64_t); } return (len); } static void ccr_write_phys_dsgl(struct ccr_softc *sc, void *dst, int nsegs) { struct sglist *sg; struct cpl_rx_phys_dsgl *cpl; struct phys_sge_pairs *sgl; vm_paddr_t paddr; size_t seglen; u_int i, j; sg = sc->sg_dsgl; cpl = dst; cpl->op_to_tid = htobe32(V_CPL_RX_PHYS_DSGL_OPCODE(CPL_RX_PHYS_DSGL) | V_CPL_RX_PHYS_DSGL_ISRDMA(0)); cpl->pcirlxorder_to_noofsgentr = htobe32( V_CPL_RX_PHYS_DSGL_PCIRLXORDER(0) | V_CPL_RX_PHYS_DSGL_PCINOSNOOP(0) | V_CPL_RX_PHYS_DSGL_PCITPHNTENB(0) | V_CPL_RX_PHYS_DSGL_DCAID(0) | V_CPL_RX_PHYS_DSGL_NOOFSGENTR(nsegs)); cpl->rss_hdr_int.opcode = CPL_RX_PHYS_ADDR; cpl->rss_hdr_int.qid = htobe16(sc->rxq->iq.abs_id); cpl->rss_hdr_int.hash_val = 0; sgl = (struct phys_sge_pairs *)(cpl + 1); j = 0; for (i = 0; i < sg->sg_nseg; i++) { seglen = sg->sg_segs[i].ss_len; paddr = sg->sg_segs[i].ss_paddr; do { sgl->addr[j] = htobe64(paddr); if (seglen > DSGL_SGE_MAXLEN) { sgl->len[j] = htobe16(DSGL_SGE_MAXLEN); paddr += DSGL_SGE_MAXLEN; seglen -= DSGL_SGE_MAXLEN; } else { sgl->len[j] = htobe16(seglen); seglen = 0; } j++; if (j == 8) { sgl++; j = 0; } } while (seglen != 0); } MPASS(j + 8 * (sgl - (struct phys_sge_pairs *)(cpl + 1)) == nsegs); } /* These functions deal with the ULPTX_SGL for input payload. */ static inline int ccr_ulptx_sgl_len(int nsegs) { u_int n; nsegs--; /* first segment is part of ulptx_sgl */ n = sizeof(struct ulptx_sgl) + 8 * ((3 * nsegs) / 2 + (nsegs & 1)); return (roundup2(n, 16)); } static void ccr_write_ulptx_sgl(struct ccr_softc *sc, void *dst, int nsegs) { struct ulptx_sgl *usgl; struct sglist *sg; struct sglist_seg *ss; int i; sg = sc->sg_ulptx; MPASS(nsegs == sg->sg_nseg); ss = &sg->sg_segs[0]; usgl = dst; usgl->cmd_nsge = htobe32(V_ULPTX_CMD(ULP_TX_SC_DSGL) | V_ULPTX_NSGE(nsegs)); usgl->len0 = htobe32(ss->ss_len); usgl->addr0 = htobe64(ss->ss_paddr); ss++; for (i = 0; i < sg->sg_nseg - 1; i++) { usgl->sge[i / 2].len[i & 1] = htobe32(ss->ss_len); usgl->sge[i / 2].addr[i & 1] = htobe64(ss->ss_paddr); ss++; } } static bool ccr_use_imm_data(u_int transhdr_len, u_int input_len) { if (input_len > CRYPTO_MAX_IMM_TX_PKT_LEN) return (false); if (roundup2(transhdr_len, 16) + roundup2(input_len, 16) > SGE_MAX_WR_LEN) return (false); return (true); } static void ccr_populate_wreq(struct ccr_softc *sc, struct chcr_wr *crwr, u_int kctx_len, u_int wr_len, u_int imm_len, u_int sgl_len, u_int hash_size, struct cryptop *crp) { - u_int cctx_size; + u_int cctx_size, idata_len; cctx_size = sizeof(struct _key_ctx) + kctx_len; crwr->wreq.op_to_cctx_size = htobe32( V_FW_CRYPTO_LOOKASIDE_WR_OPCODE(FW_CRYPTO_LOOKASIDE_WR) | V_FW_CRYPTO_LOOKASIDE_WR_COMPL(0) | V_FW_CRYPTO_LOOKASIDE_WR_IMM_LEN(imm_len) | V_FW_CRYPTO_LOOKASIDE_WR_CCTX_LOC(1) | V_FW_CRYPTO_LOOKASIDE_WR_CCTX_SIZE(cctx_size >> 4)); crwr->wreq.len16_pkd = htobe32( V_FW_CRYPTO_LOOKASIDE_WR_LEN16(wr_len / 16)); crwr->wreq.session_id = 0; crwr->wreq.rx_chid_to_rx_q_id = htobe32( V_FW_CRYPTO_LOOKASIDE_WR_RX_CHID(sc->tx_channel_id) | V_FW_CRYPTO_LOOKASIDE_WR_LCB(0) | V_FW_CRYPTO_LOOKASIDE_WR_PHASH(0) | V_FW_CRYPTO_LOOKASIDE_WR_IV(IV_NOP) | V_FW_CRYPTO_LOOKASIDE_WR_FQIDX(0) | V_FW_CRYPTO_LOOKASIDE_WR_TX_CH(0) | V_FW_CRYPTO_LOOKASIDE_WR_RX_Q_ID(sc->rxq->iq.abs_id)); crwr->wreq.key_addr = 0; crwr->wreq.pld_size_hash_size = htobe32( V_FW_CRYPTO_LOOKASIDE_WR_PLD_SIZE(sgl_len) | V_FW_CRYPTO_LOOKASIDE_WR_HASH_SIZE(hash_size)); crwr->wreq.cookie = htobe64((uintptr_t)crp); crwr->ulptx.cmd_dest = htobe32(V_ULPTX_CMD(ULP_TX_PKT) | V_ULP_TXPKT_DATAMODIFY(0) | V_ULP_TXPKT_CHANNELID(sc->tx_channel_id) | V_ULP_TXPKT_DEST(0) | V_ULP_TXPKT_FID(0) | V_ULP_TXPKT_RO(1)); crwr->ulptx.len = htobe32( ((wr_len - sizeof(struct fw_crypto_lookaside_wr)) / 16)); crwr->sc_imm.cmd_more = htobe32(V_ULPTX_CMD(ULP_TX_SC_IMM) | - V_ULP_TX_SC_MORE(imm_len != 0 ? 0 : 1)); - crwr->sc_imm.len = htobe32(wr_len - offsetof(struct chcr_wr, sec_cpl) - - sgl_len); + V_ULP_TX_SC_MORE(sgl_len != 0 ? 1 : 0)); + idata_len = wr_len - offsetof(struct chcr_wr, sec_cpl) - sgl_len; + if (imm_len % 16 != 0) + idata_len -= 16 - imm_len % 16; + crwr->sc_imm.len = htobe32(idata_len); } static int ccr_hash(struct ccr_softc *sc, struct ccr_session *s, struct cryptop *crp) { struct chcr_wr *crwr; struct wrqe *wr; struct auth_hash *axf; struct cryptodesc *crd; char *dst; u_int hash_size_in_response, kctx_flits, kctx_len, transhdr_len, wr_len; u_int hmac_ctrl, imm_len, iopad_size; int error, sgl_nsegs, sgl_len, use_opad; crd = crp->crp_desc; /* Reject requests with too large of an input buffer. */ if (crd->crd_len > MAX_REQUEST_SIZE) return (EFBIG); axf = s->hmac.auth_hash; if (s->mode == HMAC) { use_opad = 1; hmac_ctrl = SCMD_HMAC_CTRL_NO_TRUNC; } else { use_opad = 0; hmac_ctrl = SCMD_HMAC_CTRL_NOP; } /* PADs must be 128-bit aligned. */ iopad_size = roundup2(s->hmac.partial_digest_len, 16); /* * The 'key' part of the context includes the aligned IPAD and * OPAD. */ kctx_len = iopad_size; if (use_opad) kctx_len += iopad_size; hash_size_in_response = axf->hashsize; transhdr_len = HASH_TRANSHDR_SIZE(kctx_len); if (crd->crd_len == 0) { imm_len = axf->blocksize; sgl_nsegs = 0; sgl_len = 0; } else if (ccr_use_imm_data(transhdr_len, crd->crd_len)) { imm_len = crd->crd_len; sgl_nsegs = 0; sgl_len = 0; } else { imm_len = 0; sglist_reset(sc->sg_ulptx); error = sglist_append_sglist(sc->sg_ulptx, sc->sg_crp, crd->crd_skip, crd->crd_len); if (error) return (error); sgl_nsegs = sc->sg_ulptx->sg_nseg; sgl_len = ccr_ulptx_sgl_len(sgl_nsegs); } wr_len = roundup2(transhdr_len, 16) + roundup2(imm_len, 16) + sgl_len; if (wr_len > SGE_MAX_WR_LEN) return (EFBIG); wr = alloc_wrqe(wr_len, sc->txq); if (wr == NULL) { sc->stats_wr_nomem++; return (ENOMEM); } crwr = wrtod(wr); memset(crwr, 0, wr_len); ccr_populate_wreq(sc, crwr, kctx_len, wr_len, imm_len, sgl_len, hash_size_in_response, crp); /* XXX: Hardcodes SGE loopback channel of 0. */ crwr->sec_cpl.op_ivinsrtofst = htobe32( V_CPL_TX_SEC_PDU_OPCODE(CPL_TX_SEC_PDU) | V_CPL_TX_SEC_PDU_RXCHID(sc->tx_channel_id) | V_CPL_TX_SEC_PDU_ACKFOLLOWS(0) | V_CPL_TX_SEC_PDU_ULPTXLPBK(1) | V_CPL_TX_SEC_PDU_CPLLEN(2) | V_CPL_TX_SEC_PDU_PLACEHOLDER(0) | V_CPL_TX_SEC_PDU_IVINSRTOFST(0)); crwr->sec_cpl.pldlen = htobe32(crd->crd_len == 0 ? axf->blocksize : crd->crd_len); crwr->sec_cpl.cipherstop_lo_authinsert = htobe32( V_CPL_TX_SEC_PDU_AUTHSTART(1) | V_CPL_TX_SEC_PDU_AUTHSTOP(0)); /* These two flits are actually a CPL_TLS_TX_SCMD_FMT. */ crwr->sec_cpl.seqno_numivs = htobe32( V_SCMD_SEQ_NO_CTRL(0) | V_SCMD_PROTO_VERSION(SCMD_PROTO_VERSION_GENERIC) | V_SCMD_CIPH_MODE(SCMD_CIPH_MODE_NOP) | V_SCMD_AUTH_MODE(s->hmac.auth_mode) | V_SCMD_HMAC_CTRL(hmac_ctrl)); crwr->sec_cpl.ivgen_hdrlen = htobe32( V_SCMD_LAST_FRAG(0) | V_SCMD_MORE_FRAGS(crd->crd_len == 0 ? 1 : 0) | V_SCMD_MAC_ONLY(1)); memcpy(crwr->key_ctx.key, s->hmac.ipad, s->hmac.partial_digest_len); if (use_opad) memcpy(crwr->key_ctx.key + iopad_size, s->hmac.opad, s->hmac.partial_digest_len); /* XXX: F_KEY_CONTEXT_SALT_PRESENT set, but 'salt' not set. */ kctx_flits = (sizeof(struct _key_ctx) + kctx_len) / 16; crwr->key_ctx.ctx_hdr = htobe32(V_KEY_CONTEXT_CTX_LEN(kctx_flits) | V_KEY_CONTEXT_OPAD_PRESENT(use_opad) | V_KEY_CONTEXT_SALT_PRESENT(1) | V_KEY_CONTEXT_CK_SIZE(CHCR_KEYCTX_NO_KEY) | V_KEY_CONTEXT_MK_SIZE(s->hmac.mk_size) | V_KEY_CONTEXT_VALID(1)); dst = (char *)(crwr + 1) + kctx_len + DUMMY_BYTES; if (crd->crd_len == 0) { dst[0] = 0x80; - *(uint64_t *)(dst + axf->blocksize - sizeof(uint64_t)) = - htobe64(axf->blocksize << 3); + if (s->mode == HMAC) + *(uint64_t *)(dst + axf->blocksize - sizeof(uint64_t)) = + htobe64(axf->blocksize << 3); } else if (imm_len != 0) crypto_copydata(crp->crp_flags, crp->crp_buf, crd->crd_skip, crd->crd_len, dst); else ccr_write_ulptx_sgl(sc, dst, sgl_nsegs); /* XXX: TODO backpressure */ t4_wrq_tx(sc->adapter, wr); return (0); } static int ccr_hash_done(struct ccr_softc *sc, struct ccr_session *s, struct cryptop *crp, const struct cpl_fw6_pld *cpl, int error) { struct cryptodesc *crd; crd = crp->crp_desc; if (error == 0) { crypto_copyback(crp->crp_flags, crp->crp_buf, crd->crd_inject, s->hmac.hash_len, (c_caddr_t)(cpl + 1)); } return (error); } static int ccr_blkcipher(struct ccr_softc *sc, struct ccr_session *s, struct cryptop *crp) { char iv[CHCR_MAX_CRYPTO_IV_LEN]; struct chcr_wr *crwr; struct wrqe *wr; struct cryptodesc *crd; char *dst; u_int kctx_len, key_half, op_type, transhdr_len, wr_len; u_int imm_len; int dsgl_nsegs, dsgl_len; int sgl_nsegs, sgl_len; int error; crd = crp->crp_desc; if (s->blkcipher.key_len == 0 || crd->crd_len == 0) return (EINVAL); if (crd->crd_alg == CRYPTO_AES_CBC && (crd->crd_len % AES_BLOCK_LEN) != 0) return (EINVAL); /* Reject requests with too large of an input buffer. */ if (crd->crd_len > MAX_REQUEST_SIZE) return (EFBIG); if (crd->crd_flags & CRD_F_ENCRYPT) op_type = CHCR_ENCRYPT_OP; else op_type = CHCR_DECRYPT_OP; sglist_reset(sc->sg_dsgl); error = sglist_append_sglist(sc->sg_dsgl, sc->sg_crp, crd->crd_skip, crd->crd_len); if (error) return (error); dsgl_nsegs = ccr_count_sgl(sc->sg_dsgl, DSGL_SGE_MAXLEN); if (dsgl_nsegs > MAX_RX_PHYS_DSGL_SGE) return (EFBIG); dsgl_len = ccr_phys_dsgl_len(dsgl_nsegs); /* The 'key' must be 128-bit aligned. */ kctx_len = roundup2(s->blkcipher.key_len, 16); transhdr_len = CIPHER_TRANSHDR_SIZE(kctx_len, dsgl_len); if (ccr_use_imm_data(transhdr_len, crd->crd_len + s->blkcipher.iv_len)) { imm_len = crd->crd_len; sgl_nsegs = 0; sgl_len = 0; } else { imm_len = 0; sglist_reset(sc->sg_ulptx); error = sglist_append_sglist(sc->sg_ulptx, sc->sg_crp, crd->crd_skip, crd->crd_len); if (error) return (error); sgl_nsegs = sc->sg_ulptx->sg_nseg; sgl_len = ccr_ulptx_sgl_len(sgl_nsegs); } wr_len = roundup2(transhdr_len, 16) + s->blkcipher.iv_len + roundup2(imm_len, 16) + sgl_len; if (wr_len > SGE_MAX_WR_LEN) return (EFBIG); wr = alloc_wrqe(wr_len, sc->txq); if (wr == NULL) { sc->stats_wr_nomem++; return (ENOMEM); } crwr = wrtod(wr); memset(crwr, 0, wr_len); /* * Read the existing IV from the request or generate a random * one if none is provided. Optionally copy the generated IV * into the output buffer if requested. */ if (op_type == CHCR_ENCRYPT_OP) { if (crd->crd_flags & CRD_F_IV_EXPLICIT) memcpy(iv, crd->crd_iv, s->blkcipher.iv_len); else arc4rand(iv, s->blkcipher.iv_len, 0); if ((crd->crd_flags & CRD_F_IV_PRESENT) == 0) crypto_copyback(crp->crp_flags, crp->crp_buf, crd->crd_inject, s->blkcipher.iv_len, iv); } else { if (crd->crd_flags & CRD_F_IV_EXPLICIT) memcpy(iv, crd->crd_iv, s->blkcipher.iv_len); else crypto_copydata(crp->crp_flags, crp->crp_buf, crd->crd_inject, s->blkcipher.iv_len, iv); } ccr_populate_wreq(sc, crwr, kctx_len, wr_len, imm_len, sgl_len, 0, crp); /* XXX: Hardcodes SGE loopback channel of 0. */ crwr->sec_cpl.op_ivinsrtofst = htobe32( V_CPL_TX_SEC_PDU_OPCODE(CPL_TX_SEC_PDU) | V_CPL_TX_SEC_PDU_RXCHID(sc->tx_channel_id) | V_CPL_TX_SEC_PDU_ACKFOLLOWS(0) | V_CPL_TX_SEC_PDU_ULPTXLPBK(1) | V_CPL_TX_SEC_PDU_CPLLEN(2) | V_CPL_TX_SEC_PDU_PLACEHOLDER(0) | V_CPL_TX_SEC_PDU_IVINSRTOFST(1)); crwr->sec_cpl.pldlen = htobe32(s->blkcipher.iv_len + crd->crd_len); crwr->sec_cpl.aadstart_cipherstop_hi = htobe32( V_CPL_TX_SEC_PDU_CIPHERSTART(s->blkcipher.iv_len + 1) | V_CPL_TX_SEC_PDU_CIPHERSTOP_HI(0)); crwr->sec_cpl.cipherstop_lo_authinsert = htobe32( V_CPL_TX_SEC_PDU_CIPHERSTOP_LO(0)); /* These two flits are actually a CPL_TLS_TX_SCMD_FMT. */ crwr->sec_cpl.seqno_numivs = htobe32( V_SCMD_SEQ_NO_CTRL(0) | V_SCMD_PROTO_VERSION(SCMD_PROTO_VERSION_GENERIC) | V_SCMD_ENC_DEC_CTRL(op_type) | V_SCMD_CIPH_MODE(s->blkcipher.cipher_mode) | V_SCMD_AUTH_MODE(SCMD_AUTH_MODE_NOP) | V_SCMD_HMAC_CTRL(SCMD_HMAC_CTRL_NOP) | V_SCMD_IV_SIZE(s->blkcipher.iv_len / 2) | V_SCMD_NUM_IVS(0)); crwr->sec_cpl.ivgen_hdrlen = htobe32( V_SCMD_IV_GEN_CTRL(0) | V_SCMD_MORE_FRAGS(0) | V_SCMD_LAST_FRAG(0) | V_SCMD_MAC_ONLY(0) | V_SCMD_AADIVDROP(1) | V_SCMD_HDR_LEN(dsgl_len)); crwr->key_ctx.ctx_hdr = s->blkcipher.key_ctx_hdr; switch (crd->crd_alg) { case CRYPTO_AES_CBC: if (crd->crd_flags & CRD_F_ENCRYPT) memcpy(crwr->key_ctx.key, s->blkcipher.enckey, s->blkcipher.key_len); else memcpy(crwr->key_ctx.key, s->blkcipher.deckey, s->blkcipher.key_len); break; case CRYPTO_AES_ICM: memcpy(crwr->key_ctx.key, s->blkcipher.enckey, s->blkcipher.key_len); break; case CRYPTO_AES_XTS: key_half = s->blkcipher.key_len / 2; memcpy(crwr->key_ctx.key, s->blkcipher.enckey + key_half, key_half); if (crd->crd_flags & CRD_F_ENCRYPT) memcpy(crwr->key_ctx.key + key_half, s->blkcipher.enckey, key_half); else memcpy(crwr->key_ctx.key + key_half, s->blkcipher.deckey, key_half); break; } dst = (char *)(crwr + 1) + kctx_len; ccr_write_phys_dsgl(sc, dst, dsgl_nsegs); dst += sizeof(struct cpl_rx_phys_dsgl) + dsgl_len; memcpy(dst, iv, s->blkcipher.iv_len); dst += s->blkcipher.iv_len; if (imm_len != 0) crypto_copydata(crp->crp_flags, crp->crp_buf, crd->crd_skip, crd->crd_len, dst); else ccr_write_ulptx_sgl(sc, dst, sgl_nsegs); /* XXX: TODO backpressure */ t4_wrq_tx(sc->adapter, wr); return (0); } static int ccr_blkcipher_done(struct ccr_softc *sc, struct ccr_session *s, struct cryptop *crp, const struct cpl_fw6_pld *cpl, int error) { /* * The updated IV to permit chained requests is at * cpl->data[2], but OCF doesn't permit chained requests. */ return (error); } /* * 'hashsize' is the length of a full digest. 'authsize' is the * requested digest length for this operation which may be less * than 'hashsize'. */ static int ccr_hmac_ctrl(unsigned int hashsize, unsigned int authsize) { if (authsize == 10) return (SCMD_HMAC_CTRL_TRUNC_RFC4366); if (authsize == 12) return (SCMD_HMAC_CTRL_IPSEC_96BIT); if (authsize == hashsize / 2) return (SCMD_HMAC_CTRL_DIV2); return (SCMD_HMAC_CTRL_NO_TRUNC); } static int ccr_authenc(struct ccr_softc *sc, struct ccr_session *s, struct cryptop *crp, struct cryptodesc *crda, struct cryptodesc *crde) { char iv[CHCR_MAX_CRYPTO_IV_LEN]; struct chcr_wr *crwr; struct wrqe *wr; struct auth_hash *axf; char *dst; u_int kctx_len, key_half, op_type, transhdr_len, wr_len; u_int hash_size_in_response, imm_len, iopad_size; u_int aad_start, aad_len, aad_stop; u_int auth_start, auth_stop, auth_insert; u_int cipher_start, cipher_stop; u_int hmac_ctrl, input_len; int dsgl_nsegs, dsgl_len; int sgl_nsegs, sgl_len; int error; /* * If there is a need in the future, requests with an empty * payload could be supported as HMAC-only requests. */ if (s->blkcipher.key_len == 0 || crde->crd_len == 0) return (EINVAL); if (crde->crd_alg == CRYPTO_AES_CBC && (crde->crd_len % AES_BLOCK_LEN) != 0) return (EINVAL); /* * Compute the length of the AAD (data covered by the * authentication descriptor but not the encryption * descriptor). To simplify the logic, AAD is only permitted * before the cipher/plain text, not after. This is true of * all currently-generated requests. */ if (crda->crd_len + crda->crd_skip > crde->crd_len + crde->crd_skip) return (EINVAL); if (crda->crd_skip < crde->crd_skip) { if (crda->crd_skip + crda->crd_len > crde->crd_skip) aad_len = (crde->crd_skip - crda->crd_skip); else aad_len = crda->crd_len; } else aad_len = 0; if (aad_len + s->blkcipher.iv_len > MAX_AAD_LEN) return (EINVAL); axf = s->hmac.auth_hash; hash_size_in_response = s->hmac.hash_len; if (crde->crd_flags & CRD_F_ENCRYPT) op_type = CHCR_ENCRYPT_OP; else op_type = CHCR_DECRYPT_OP; /* * The output buffer consists of the cipher text followed by * the hash when encrypting. For decryption it only contains * the plain text. * * Due to a firmware bug, the output buffer must include a * dummy output buffer for the IV and AAD prior to the real * output buffer. */ if (op_type == CHCR_ENCRYPT_OP) { if (s->blkcipher.iv_len + aad_len + crde->crd_len + hash_size_in_response > MAX_REQUEST_SIZE) return (EFBIG); } else { if (s->blkcipher.iv_len + aad_len + crde->crd_len > MAX_REQUEST_SIZE) return (EFBIG); } sglist_reset(sc->sg_dsgl); error = sglist_append_sglist(sc->sg_dsgl, sc->sg_iv_aad, 0, s->blkcipher.iv_len + aad_len); if (error) return (error); error = sglist_append_sglist(sc->sg_dsgl, sc->sg_crp, crde->crd_skip, crde->crd_len); if (error) return (error); if (op_type == CHCR_ENCRYPT_OP) { error = sglist_append_sglist(sc->sg_dsgl, sc->sg_crp, crda->crd_inject, hash_size_in_response); if (error) return (error); } dsgl_nsegs = ccr_count_sgl(sc->sg_dsgl, DSGL_SGE_MAXLEN); if (dsgl_nsegs > MAX_RX_PHYS_DSGL_SGE) return (EFBIG); dsgl_len = ccr_phys_dsgl_len(dsgl_nsegs); /* PADs must be 128-bit aligned. */ iopad_size = roundup2(s->hmac.partial_digest_len, 16); /* * The 'key' part of the key context consists of the key followed * by the IPAD and OPAD. */ kctx_len = roundup2(s->blkcipher.key_len, 16) + iopad_size * 2; transhdr_len = CIPHER_TRANSHDR_SIZE(kctx_len, dsgl_len); /* * The input buffer consists of the IV, any AAD, and then the * cipher/plain text. For decryption requests the hash is * appended after the cipher text. * * The IV is always stored at the start of the input buffer * even though it may be duplicated in the payload. The * crypto engine doesn't work properly if the IV offset points * inside of the AAD region, so a second copy is always * required. */ input_len = aad_len + crde->crd_len; /* * The firmware hangs if sent a request which is a * bit smaller than MAX_REQUEST_SIZE. In particular, the * firmware appears to require 512 - 16 bytes of spare room * along with the size of the hash even if the hash isn't * included in the input buffer. */ if (input_len + roundup2(axf->hashsize, 16) + (512 - 16) > MAX_REQUEST_SIZE) return (EFBIG); if (op_type == CHCR_DECRYPT_OP) input_len += hash_size_in_response; if (ccr_use_imm_data(transhdr_len, s->blkcipher.iv_len + input_len)) { imm_len = input_len; sgl_nsegs = 0; sgl_len = 0; } else { imm_len = 0; sglist_reset(sc->sg_ulptx); if (aad_len != 0) { error = sglist_append_sglist(sc->sg_ulptx, sc->sg_crp, crda->crd_skip, aad_len); if (error) return (error); } error = sglist_append_sglist(sc->sg_ulptx, sc->sg_crp, crde->crd_skip, crde->crd_len); if (error) return (error); if (op_type == CHCR_DECRYPT_OP) { error = sglist_append_sglist(sc->sg_ulptx, sc->sg_crp, crda->crd_inject, hash_size_in_response); if (error) return (error); } sgl_nsegs = sc->sg_ulptx->sg_nseg; sgl_len = ccr_ulptx_sgl_len(sgl_nsegs); } /* * Any auth-only data before the cipher region is marked as AAD. * Auth-data that overlaps with the cipher region is placed in * the auth section. */ if (aad_len != 0) { aad_start = s->blkcipher.iv_len + 1; aad_stop = aad_start + aad_len - 1; } else { aad_start = 0; aad_stop = 0; } cipher_start = s->blkcipher.iv_len + aad_len + 1; if (op_type == CHCR_DECRYPT_OP) cipher_stop = hash_size_in_response; else cipher_stop = 0; if (aad_len == crda->crd_len) { auth_start = 0; auth_stop = 0; } else { if (aad_len != 0) auth_start = cipher_start; else auth_start = s->blkcipher.iv_len + crda->crd_skip - crde->crd_skip + 1; auth_stop = (crde->crd_skip + crde->crd_len) - (crda->crd_skip + crda->crd_len) + cipher_stop; } if (op_type == CHCR_DECRYPT_OP) auth_insert = hash_size_in_response; else auth_insert = 0; wr_len = roundup2(transhdr_len, 16) + s->blkcipher.iv_len + roundup2(imm_len, 16) + sgl_len; if (wr_len > SGE_MAX_WR_LEN) return (EFBIG); wr = alloc_wrqe(wr_len, sc->txq); if (wr == NULL) { sc->stats_wr_nomem++; return (ENOMEM); } crwr = wrtod(wr); memset(crwr, 0, wr_len); /* * Read the existing IV from the request or generate a random * one if none is provided. Optionally copy the generated IV * into the output buffer if requested. */ if (op_type == CHCR_ENCRYPT_OP) { if (crde->crd_flags & CRD_F_IV_EXPLICIT) memcpy(iv, crde->crd_iv, s->blkcipher.iv_len); else arc4rand(iv, s->blkcipher.iv_len, 0); if ((crde->crd_flags & CRD_F_IV_PRESENT) == 0) crypto_copyback(crp->crp_flags, crp->crp_buf, crde->crd_inject, s->blkcipher.iv_len, iv); } else { if (crde->crd_flags & CRD_F_IV_EXPLICIT) memcpy(iv, crde->crd_iv, s->blkcipher.iv_len); else crypto_copydata(crp->crp_flags, crp->crp_buf, crde->crd_inject, s->blkcipher.iv_len, iv); } ccr_populate_wreq(sc, crwr, kctx_len, wr_len, imm_len, sgl_len, op_type == CHCR_DECRYPT_OP ? hash_size_in_response : 0, crp); /* XXX: Hardcodes SGE loopback channel of 0. */ crwr->sec_cpl.op_ivinsrtofst = htobe32( V_CPL_TX_SEC_PDU_OPCODE(CPL_TX_SEC_PDU) | V_CPL_TX_SEC_PDU_RXCHID(sc->tx_channel_id) | V_CPL_TX_SEC_PDU_ACKFOLLOWS(0) | V_CPL_TX_SEC_PDU_ULPTXLPBK(1) | V_CPL_TX_SEC_PDU_CPLLEN(2) | V_CPL_TX_SEC_PDU_PLACEHOLDER(0) | V_CPL_TX_SEC_PDU_IVINSRTOFST(1)); crwr->sec_cpl.pldlen = htobe32(s->blkcipher.iv_len + input_len); crwr->sec_cpl.aadstart_cipherstop_hi = htobe32( V_CPL_TX_SEC_PDU_AADSTART(aad_start) | V_CPL_TX_SEC_PDU_AADSTOP(aad_stop) | V_CPL_TX_SEC_PDU_CIPHERSTART(cipher_start) | V_CPL_TX_SEC_PDU_CIPHERSTOP_HI(cipher_stop >> 4)); crwr->sec_cpl.cipherstop_lo_authinsert = htobe32( V_CPL_TX_SEC_PDU_CIPHERSTOP_LO(cipher_stop & 0xf) | V_CPL_TX_SEC_PDU_AUTHSTART(auth_start) | V_CPL_TX_SEC_PDU_AUTHSTOP(auth_stop) | V_CPL_TX_SEC_PDU_AUTHINSERT(auth_insert)); /* These two flits are actually a CPL_TLS_TX_SCMD_FMT. */ hmac_ctrl = ccr_hmac_ctrl(axf->hashsize, hash_size_in_response); crwr->sec_cpl.seqno_numivs = htobe32( V_SCMD_SEQ_NO_CTRL(0) | V_SCMD_PROTO_VERSION(SCMD_PROTO_VERSION_GENERIC) | V_SCMD_ENC_DEC_CTRL(op_type) | V_SCMD_CIPH_AUTH_SEQ_CTRL(op_type == CHCR_ENCRYPT_OP ? 1 : 0) | V_SCMD_CIPH_MODE(s->blkcipher.cipher_mode) | V_SCMD_AUTH_MODE(s->hmac.auth_mode) | V_SCMD_HMAC_CTRL(hmac_ctrl) | V_SCMD_IV_SIZE(s->blkcipher.iv_len / 2) | V_SCMD_NUM_IVS(0)); crwr->sec_cpl.ivgen_hdrlen = htobe32( V_SCMD_IV_GEN_CTRL(0) | V_SCMD_MORE_FRAGS(0) | V_SCMD_LAST_FRAG(0) | V_SCMD_MAC_ONLY(0) | V_SCMD_AADIVDROP(0) | V_SCMD_HDR_LEN(dsgl_len)); crwr->key_ctx.ctx_hdr = s->blkcipher.key_ctx_hdr; switch (crde->crd_alg) { case CRYPTO_AES_CBC: if (crde->crd_flags & CRD_F_ENCRYPT) memcpy(crwr->key_ctx.key, s->blkcipher.enckey, s->blkcipher.key_len); else memcpy(crwr->key_ctx.key, s->blkcipher.deckey, s->blkcipher.key_len); break; case CRYPTO_AES_ICM: memcpy(crwr->key_ctx.key, s->blkcipher.enckey, s->blkcipher.key_len); break; case CRYPTO_AES_XTS: key_half = s->blkcipher.key_len / 2; memcpy(crwr->key_ctx.key, s->blkcipher.enckey + key_half, key_half); if (crde->crd_flags & CRD_F_ENCRYPT) memcpy(crwr->key_ctx.key + key_half, s->blkcipher.enckey, key_half); else memcpy(crwr->key_ctx.key + key_half, s->blkcipher.deckey, key_half); break; } dst = crwr->key_ctx.key + roundup2(s->blkcipher.key_len, 16); memcpy(dst, s->hmac.ipad, s->hmac.partial_digest_len); memcpy(dst + iopad_size, s->hmac.opad, s->hmac.partial_digest_len); dst = (char *)(crwr + 1) + kctx_len; ccr_write_phys_dsgl(sc, dst, dsgl_nsegs); dst += sizeof(struct cpl_rx_phys_dsgl) + dsgl_len; memcpy(dst, iv, s->blkcipher.iv_len); dst += s->blkcipher.iv_len; if (imm_len != 0) { if (aad_len != 0) { crypto_copydata(crp->crp_flags, crp->crp_buf, crda->crd_skip, aad_len, dst); dst += aad_len; } crypto_copydata(crp->crp_flags, crp->crp_buf, crde->crd_skip, crde->crd_len, dst); dst += crde->crd_len; if (op_type == CHCR_DECRYPT_OP) crypto_copydata(crp->crp_flags, crp->crp_buf, crda->crd_inject, hash_size_in_response, dst); } else ccr_write_ulptx_sgl(sc, dst, sgl_nsegs); /* XXX: TODO backpressure */ t4_wrq_tx(sc->adapter, wr); return (0); } static int ccr_authenc_done(struct ccr_softc *sc, struct ccr_session *s, struct cryptop *crp, const struct cpl_fw6_pld *cpl, int error) { struct cryptodesc *crd; /* * The updated IV to permit chained requests is at * cpl->data[2], but OCF doesn't permit chained requests. * * For a decryption request, the hardware may do a verification * of the HMAC which will fail if the existing HMAC isn't in the * buffer. If that happens, clear the error and copy the HMAC * from the CPL reply into the buffer. * * For encryption requests, crd should be the cipher request * which will have CRD_F_ENCRYPT set. For decryption * requests, crp_desc will be the HMAC request which should * not have this flag set. */ crd = crp->crp_desc; if (error == EBADMSG && !CHK_PAD_ERR_BIT(be64toh(cpl->data[0])) && !(crd->crd_flags & CRD_F_ENCRYPT)) { crypto_copyback(crp->crp_flags, crp->crp_buf, crd->crd_inject, s->hmac.hash_len, (c_caddr_t)(cpl + 1)); error = 0; } return (error); } static int ccr_gcm(struct ccr_softc *sc, struct ccr_session *s, struct cryptop *crp, struct cryptodesc *crda, struct cryptodesc *crde) { char iv[CHCR_MAX_CRYPTO_IV_LEN]; struct chcr_wr *crwr; struct wrqe *wr; char *dst; u_int iv_len, kctx_len, op_type, transhdr_len, wr_len; u_int hash_size_in_response, imm_len; u_int aad_start, aad_stop, cipher_start, cipher_stop, auth_insert; u_int hmac_ctrl, input_len; int dsgl_nsegs, dsgl_len; int sgl_nsegs, sgl_len; int error; if (s->blkcipher.key_len == 0) return (EINVAL); /* * The crypto engine doesn't handle GCM requests with an empty * payload, so handle those in software instead. */ if (crde->crd_len == 0) return (EMSGSIZE); /* * AAD is only permitted before the cipher/plain text, not * after. */ if (crda->crd_len + crda->crd_skip > crde->crd_len + crde->crd_skip) return (EMSGSIZE); if (crda->crd_len + AES_BLOCK_LEN > MAX_AAD_LEN) return (EMSGSIZE); hash_size_in_response = s->gmac.hash_len; if (crde->crd_flags & CRD_F_ENCRYPT) op_type = CHCR_ENCRYPT_OP; else op_type = CHCR_DECRYPT_OP; /* * The IV handling for GCM in OCF is a bit more complicated in * that IPSec provides a full 16-byte IV (including the * counter), whereas the /dev/crypto interface sometimes * provides a full 16-byte IV (if no IV is provided in the * ioctl) and sometimes a 12-byte IV (if the IV was explicit). * * When provided a 12-byte IV, assume the IV is really 16 bytes * with a counter in the last 4 bytes initialized to 1. * * While iv_len is checked below, the value is currently * always set to 12 when creating a GCM session in this driver * due to limitations in OCF (there is no way to know what the * IV length of a given request will be). This means that the * driver always assumes as 12-byte IV for now. */ if (s->blkcipher.iv_len == 12) iv_len = AES_BLOCK_LEN; else iv_len = s->blkcipher.iv_len; /* * The output buffer consists of the cipher text followed by * the tag when encrypting. For decryption it only contains * the plain text. * * Due to a firmware bug, the output buffer must include a * dummy output buffer for the IV and AAD prior to the real * output buffer. */ if (op_type == CHCR_ENCRYPT_OP) { if (iv_len + crda->crd_len + crde->crd_len + hash_size_in_response > MAX_REQUEST_SIZE) return (EFBIG); } else { if (iv_len + crda->crd_len + crde->crd_len > MAX_REQUEST_SIZE) return (EFBIG); } sglist_reset(sc->sg_dsgl); error = sglist_append_sglist(sc->sg_dsgl, sc->sg_iv_aad, 0, iv_len + crda->crd_len); if (error) return (error); error = sglist_append_sglist(sc->sg_dsgl, sc->sg_crp, crde->crd_skip, crde->crd_len); if (error) return (error); if (op_type == CHCR_ENCRYPT_OP) { error = sglist_append_sglist(sc->sg_dsgl, sc->sg_crp, crda->crd_inject, hash_size_in_response); if (error) return (error); } dsgl_nsegs = ccr_count_sgl(sc->sg_dsgl, DSGL_SGE_MAXLEN); if (dsgl_nsegs > MAX_RX_PHYS_DSGL_SGE) return (EFBIG); dsgl_len = ccr_phys_dsgl_len(dsgl_nsegs); /* * The 'key' part of the key context consists of the key followed * by the Galois hash key. */ kctx_len = roundup2(s->blkcipher.key_len, 16) + GMAC_BLOCK_LEN; transhdr_len = CIPHER_TRANSHDR_SIZE(kctx_len, dsgl_len); /* * The input buffer consists of the IV, any AAD, and then the * cipher/plain text. For decryption requests the hash is * appended after the cipher text. * * The IV is always stored at the start of the input buffer * even though it may be duplicated in the payload. The * crypto engine doesn't work properly if the IV offset points * inside of the AAD region, so a second copy is always * required. */ input_len = crda->crd_len + crde->crd_len; if (op_type == CHCR_DECRYPT_OP) input_len += hash_size_in_response; if (input_len > MAX_REQUEST_SIZE) return (EFBIG); if (ccr_use_imm_data(transhdr_len, iv_len + input_len)) { imm_len = input_len; sgl_nsegs = 0; sgl_len = 0; } else { imm_len = 0; sglist_reset(sc->sg_ulptx); if (crda->crd_len != 0) { error = sglist_append_sglist(sc->sg_ulptx, sc->sg_crp, crda->crd_skip, crda->crd_len); if (error) return (error); } error = sglist_append_sglist(sc->sg_ulptx, sc->sg_crp, crde->crd_skip, crde->crd_len); if (error) return (error); if (op_type == CHCR_DECRYPT_OP) { error = sglist_append_sglist(sc->sg_ulptx, sc->sg_crp, crda->crd_inject, hash_size_in_response); if (error) return (error); } sgl_nsegs = sc->sg_ulptx->sg_nseg; sgl_len = ccr_ulptx_sgl_len(sgl_nsegs); } if (crda->crd_len != 0) { aad_start = iv_len + 1; aad_stop = aad_start + crda->crd_len - 1; } else { aad_start = 0; aad_stop = 0; } cipher_start = iv_len + crda->crd_len + 1; if (op_type == CHCR_DECRYPT_OP) cipher_stop = hash_size_in_response; else cipher_stop = 0; if (op_type == CHCR_DECRYPT_OP) auth_insert = hash_size_in_response; else auth_insert = 0; wr_len = roundup2(transhdr_len, 16) + iv_len + roundup2(imm_len, 16) + sgl_len; if (wr_len > SGE_MAX_WR_LEN) return (EFBIG); wr = alloc_wrqe(wr_len, sc->txq); if (wr == NULL) { sc->stats_wr_nomem++; return (ENOMEM); } crwr = wrtod(wr); memset(crwr, 0, wr_len); /* * Read the existing IV from the request or generate a random * one if none is provided. Optionally copy the generated IV * into the output buffer if requested. * * If the input IV is 12 bytes, append an explicit 4-byte * counter of 1. */ if (op_type == CHCR_ENCRYPT_OP) { if (crde->crd_flags & CRD_F_IV_EXPLICIT) memcpy(iv, crde->crd_iv, s->blkcipher.iv_len); else arc4rand(iv, s->blkcipher.iv_len, 0); if ((crde->crd_flags & CRD_F_IV_PRESENT) == 0) crypto_copyback(crp->crp_flags, crp->crp_buf, crde->crd_inject, s->blkcipher.iv_len, iv); } else { if (crde->crd_flags & CRD_F_IV_EXPLICIT) memcpy(iv, crde->crd_iv, s->blkcipher.iv_len); else crypto_copydata(crp->crp_flags, crp->crp_buf, crde->crd_inject, s->blkcipher.iv_len, iv); } if (s->blkcipher.iv_len == 12) *(uint32_t *)&iv[12] = htobe32(1); ccr_populate_wreq(sc, crwr, kctx_len, wr_len, imm_len, sgl_len, 0, crp); /* XXX: Hardcodes SGE loopback channel of 0. */ crwr->sec_cpl.op_ivinsrtofst = htobe32( V_CPL_TX_SEC_PDU_OPCODE(CPL_TX_SEC_PDU) | V_CPL_TX_SEC_PDU_RXCHID(sc->tx_channel_id) | V_CPL_TX_SEC_PDU_ACKFOLLOWS(0) | V_CPL_TX_SEC_PDU_ULPTXLPBK(1) | V_CPL_TX_SEC_PDU_CPLLEN(2) | V_CPL_TX_SEC_PDU_PLACEHOLDER(0) | V_CPL_TX_SEC_PDU_IVINSRTOFST(1)); crwr->sec_cpl.pldlen = htobe32(iv_len + input_len); /* * NB: cipherstop is explicitly set to 0. On encrypt it * should normally be set to 0 anyway (as the encrypt crd ends * at the end of the input). However, for decrypt the cipher * ends before the tag in the AUTHENC case (and authstop is * set to stop before the tag), but for GCM the cipher still * runs to the end of the buffer. Not sure if this is * intentional or a firmware quirk, but it is required for * working tag validation with GCM decryption. */ crwr->sec_cpl.aadstart_cipherstop_hi = htobe32( V_CPL_TX_SEC_PDU_AADSTART(aad_start) | V_CPL_TX_SEC_PDU_AADSTOP(aad_stop) | V_CPL_TX_SEC_PDU_CIPHERSTART(cipher_start) | V_CPL_TX_SEC_PDU_CIPHERSTOP_HI(0)); crwr->sec_cpl.cipherstop_lo_authinsert = htobe32( V_CPL_TX_SEC_PDU_CIPHERSTOP_LO(0) | V_CPL_TX_SEC_PDU_AUTHSTART(cipher_start) | V_CPL_TX_SEC_PDU_AUTHSTOP(cipher_stop) | V_CPL_TX_SEC_PDU_AUTHINSERT(auth_insert)); /* These two flits are actually a CPL_TLS_TX_SCMD_FMT. */ hmac_ctrl = ccr_hmac_ctrl(AES_GMAC_HASH_LEN, hash_size_in_response); crwr->sec_cpl.seqno_numivs = htobe32( V_SCMD_SEQ_NO_CTRL(0) | V_SCMD_PROTO_VERSION(SCMD_PROTO_VERSION_GENERIC) | V_SCMD_ENC_DEC_CTRL(op_type) | V_SCMD_CIPH_AUTH_SEQ_CTRL(op_type == CHCR_ENCRYPT_OP ? 1 : 0) | V_SCMD_CIPH_MODE(SCMD_CIPH_MODE_AES_GCM) | V_SCMD_AUTH_MODE(SCMD_AUTH_MODE_GHASH) | V_SCMD_HMAC_CTRL(hmac_ctrl) | V_SCMD_IV_SIZE(iv_len / 2) | V_SCMD_NUM_IVS(0)); crwr->sec_cpl.ivgen_hdrlen = htobe32( V_SCMD_IV_GEN_CTRL(0) | V_SCMD_MORE_FRAGS(0) | V_SCMD_LAST_FRAG(0) | V_SCMD_MAC_ONLY(0) | V_SCMD_AADIVDROP(0) | V_SCMD_HDR_LEN(dsgl_len)); crwr->key_ctx.ctx_hdr = s->blkcipher.key_ctx_hdr; memcpy(crwr->key_ctx.key, s->blkcipher.enckey, s->blkcipher.key_len); dst = crwr->key_ctx.key + roundup2(s->blkcipher.key_len, 16); memcpy(dst, s->gmac.ghash_h, GMAC_BLOCK_LEN); dst = (char *)(crwr + 1) + kctx_len; ccr_write_phys_dsgl(sc, dst, dsgl_nsegs); dst += sizeof(struct cpl_rx_phys_dsgl) + dsgl_len; memcpy(dst, iv, iv_len); dst += iv_len; if (imm_len != 0) { if (crda->crd_len != 0) { crypto_copydata(crp->crp_flags, crp->crp_buf, crda->crd_skip, crda->crd_len, dst); dst += crda->crd_len; } crypto_copydata(crp->crp_flags, crp->crp_buf, crde->crd_skip, crde->crd_len, dst); dst += crde->crd_len; if (op_type == CHCR_DECRYPT_OP) crypto_copydata(crp->crp_flags, crp->crp_buf, crda->crd_inject, hash_size_in_response, dst); } else ccr_write_ulptx_sgl(sc, dst, sgl_nsegs); /* XXX: TODO backpressure */ t4_wrq_tx(sc->adapter, wr); return (0); } static int ccr_gcm_done(struct ccr_softc *sc, struct ccr_session *s, struct cryptop *crp, const struct cpl_fw6_pld *cpl, int error) { /* * The updated IV to permit chained requests is at * cpl->data[2], but OCF doesn't permit chained requests. * * Note that the hardware should always verify the GMAC hash. */ return (error); } /* * Handle a GCM request that is not supported by the crypto engine by * performing the operation in software. Derived from swcr_authenc(). */ static void ccr_gcm_soft(struct ccr_session *s, struct cryptop *crp, struct cryptodesc *crda, struct cryptodesc *crde) { struct auth_hash *axf; struct enc_xform *exf; void *auth_ctx; uint8_t *kschedule; char block[GMAC_BLOCK_LEN]; char digest[GMAC_DIGEST_LEN]; char iv[AES_BLOCK_LEN]; int error, i, len; auth_ctx = NULL; kschedule = NULL; /* Initialize the MAC. */ switch (s->blkcipher.key_len) { case 16: axf = &auth_hash_nist_gmac_aes_128; break; case 24: axf = &auth_hash_nist_gmac_aes_192; break; case 32: axf = &auth_hash_nist_gmac_aes_256; break; default: error = EINVAL; goto out; } auth_ctx = malloc(axf->ctxsize, M_CCR, M_NOWAIT); if (auth_ctx == NULL) { error = ENOMEM; goto out; } axf->Init(auth_ctx); axf->Setkey(auth_ctx, s->blkcipher.enckey, s->blkcipher.key_len); /* Initialize the cipher. */ exf = &enc_xform_aes_nist_gcm; error = exf->setkey(&kschedule, s->blkcipher.enckey, s->blkcipher.key_len); if (error) goto out; /* * This assumes a 12-byte IV from the crp. See longer comment * above in ccr_gcm() for more details. */ if (crde->crd_flags & CRD_F_ENCRYPT) { if (crde->crd_flags & CRD_F_IV_EXPLICIT) memcpy(iv, crde->crd_iv, 12); else arc4rand(iv, 12, 0); if ((crde->crd_flags & CRD_F_IV_PRESENT) == 0) crypto_copyback(crp->crp_flags, crp->crp_buf, crde->crd_inject, 12, iv); } else { if (crde->crd_flags & CRD_F_IV_EXPLICIT) memcpy(iv, crde->crd_iv, 12); else crypto_copydata(crp->crp_flags, crp->crp_buf, crde->crd_inject, 12, iv); } *(uint32_t *)&iv[12] = htobe32(1); axf->Reinit(auth_ctx, iv, sizeof(iv)); /* MAC the AAD. */ for (i = 0; i < crda->crd_len; i += sizeof(block)) { len = imin(crda->crd_len - i, sizeof(block)); crypto_copydata(crp->crp_flags, crp->crp_buf, crda->crd_skip + i, len, block); bzero(block + len, sizeof(block) - len); axf->Update(auth_ctx, block, sizeof(block)); } exf->reinit(kschedule, iv); /* Do encryption with MAC */ for (i = 0; i < crde->crd_len; i += sizeof(block)) { len = imin(crde->crd_len - i, sizeof(block)); crypto_copydata(crp->crp_flags, crp->crp_buf, crde->crd_skip + i, len, block); bzero(block + len, sizeof(block) - len); if (crde->crd_flags & CRD_F_ENCRYPT) { exf->encrypt(kschedule, block); axf->Update(auth_ctx, block, len); crypto_copyback(crp->crp_flags, crp->crp_buf, crde->crd_skip + i, len, block); } else { axf->Update(auth_ctx, block, len); } } /* Length block. */ bzero(block, sizeof(block)); ((uint32_t *)block)[1] = htobe32(crda->crd_len * 8); ((uint32_t *)block)[3] = htobe32(crde->crd_len * 8); axf->Update(auth_ctx, block, sizeof(block)); /* Finalize MAC. */ axf->Final(digest, auth_ctx); /* Inject or validate tag. */ if (crde->crd_flags & CRD_F_ENCRYPT) { crypto_copyback(crp->crp_flags, crp->crp_buf, crda->crd_inject, sizeof(digest), digest); error = 0; } else { char digest2[GMAC_DIGEST_LEN]; crypto_copydata(crp->crp_flags, crp->crp_buf, crda->crd_inject, sizeof(digest2), digest2); if (timingsafe_bcmp(digest, digest2, sizeof(digest)) == 0) { error = 0; /* Tag matches, decrypt data. */ for (i = 0; i < crde->crd_len; i += sizeof(block)) { len = imin(crde->crd_len - i, sizeof(block)); crypto_copydata(crp->crp_flags, crp->crp_buf, crde->crd_skip + i, len, block); bzero(block + len, sizeof(block) - len); exf->decrypt(kschedule, block); crypto_copyback(crp->crp_flags, crp->crp_buf, crde->crd_skip + i, len, block); } } else error = EBADMSG; } exf->zerokey(&kschedule); out: if (auth_ctx != NULL) { memset(auth_ctx, 0, axf->ctxsize); free(auth_ctx, M_CCR); } crp->crp_etype = error; crypto_done(crp); } static void +generate_ccm_b0(struct cryptodesc *crda, struct cryptodesc *crde, + u_int hash_size_in_response, const char *iv, char *b0) +{ + u_int i, payload_len; + + /* NB: L is already set in the first byte of the IV. */ + memcpy(b0, iv, CCM_B0_SIZE); + + /* Set length of hash in bits 3 - 5. */ + b0[0] |= (((hash_size_in_response - 2) / 2) << 3); + + /* Store the payload length as a big-endian value. */ + payload_len = crde->crd_len; + for (i = 0; i < iv[0]; i++) { + b0[CCM_CBC_BLOCK_LEN - 1 - i] = payload_len; + payload_len >>= 8; + } + + /* + * If there is AAD in the request, set bit 6 in the flags + * field and store the AAD length as a big-endian value at the + * start of block 1. This only assumes a 16-bit AAD length + * since T6 doesn't support large AAD sizes. + */ + if (crda->crd_len != 0) { + b0[0] |= (1 << 6); + *(uint16_t *)(b0 + CCM_B0_SIZE) = htobe16(crda->crd_len); + } +} + +static int +ccr_ccm(struct ccr_softc *sc, struct ccr_session *s, struct cryptop *crp, + struct cryptodesc *crda, struct cryptodesc *crde) +{ + char iv[CHCR_MAX_CRYPTO_IV_LEN]; + struct ulptx_idata *idata; + struct chcr_wr *crwr; + struct wrqe *wr; + char *dst; + u_int iv_len, kctx_len, op_type, transhdr_len, wr_len; + u_int aad_len, b0_len, hash_size_in_response, imm_len; + u_int aad_start, aad_stop, cipher_start, cipher_stop, auth_insert; + u_int hmac_ctrl, input_len; + int dsgl_nsegs, dsgl_len; + int sgl_nsegs, sgl_len; + int error; + + if (s->blkcipher.key_len == 0) + return (EINVAL); + + /* + * The crypto engine doesn't handle CCM requests with an empty + * payload, so handle those in software instead. + */ + if (crde->crd_len == 0) + return (EMSGSIZE); + + /* + * AAD is only permitted before the cipher/plain text, not + * after. + */ + if (crda->crd_len + crda->crd_skip > crde->crd_len + crde->crd_skip) + return (EMSGSIZE); + + /* + * CCM always includes block 0 in the AAD before AAD from the + * request. + */ + b0_len = CCM_B0_SIZE; + if (crda->crd_len != 0) + b0_len += CCM_AAD_FIELD_SIZE; + aad_len = b0_len + crda->crd_len; + + /* + * Always assume a 12 byte input IV for now since that is what + * OCF always generates. The full IV in the work request is + * 16 bytes. + */ + iv_len = AES_BLOCK_LEN; + + if (iv_len + aad_len > MAX_AAD_LEN) + return (EMSGSIZE); + + hash_size_in_response = s->ccm_mac.hash_len; + if (crde->crd_flags & CRD_F_ENCRYPT) + op_type = CHCR_ENCRYPT_OP; + else + op_type = CHCR_DECRYPT_OP; + + /* + * The output buffer consists of the cipher text followed by + * the tag when encrypting. For decryption it only contains + * the plain text. + * + * Due to a firmware bug, the output buffer must include a + * dummy output buffer for the IV and AAD prior to the real + * output buffer. + */ + if (op_type == CHCR_ENCRYPT_OP) { + if (iv_len + aad_len + crde->crd_len + hash_size_in_response > + MAX_REQUEST_SIZE) + return (EFBIG); + } else { + if (iv_len + aad_len + crde->crd_len > MAX_REQUEST_SIZE) + return (EFBIG); + } + sglist_reset(sc->sg_dsgl); + error = sglist_append_sglist(sc->sg_dsgl, sc->sg_iv_aad, 0, iv_len + + aad_len); + if (error) + return (error); + error = sglist_append_sglist(sc->sg_dsgl, sc->sg_crp, crde->crd_skip, + crde->crd_len); + if (error) + return (error); + if (op_type == CHCR_ENCRYPT_OP) { + error = sglist_append_sglist(sc->sg_dsgl, sc->sg_crp, + crda->crd_inject, hash_size_in_response); + if (error) + return (error); + } + dsgl_nsegs = ccr_count_sgl(sc->sg_dsgl, DSGL_SGE_MAXLEN); + if (dsgl_nsegs > MAX_RX_PHYS_DSGL_SGE) + return (EFBIG); + dsgl_len = ccr_phys_dsgl_len(dsgl_nsegs); + + /* + * The 'key' part of the key context consists of two copies of + * the AES key. + */ + kctx_len = roundup2(s->blkcipher.key_len, 16) * 2; + transhdr_len = CIPHER_TRANSHDR_SIZE(kctx_len, dsgl_len); + + /* + * The input buffer consists of the IV, AAD (including block + * 0), and then the cipher/plain text. For decryption + * requests the hash is appended after the cipher text. + * + * The IV is always stored at the start of the input buffer + * even though it may be duplicated in the payload. The + * crypto engine doesn't work properly if the IV offset points + * inside of the AAD region, so a second copy is always + * required. + */ + input_len = aad_len + crde->crd_len; + if (op_type == CHCR_DECRYPT_OP) + input_len += hash_size_in_response; + if (input_len > MAX_REQUEST_SIZE) + return (EFBIG); + if (ccr_use_imm_data(transhdr_len, iv_len + input_len)) { + imm_len = input_len; + sgl_nsegs = 0; + sgl_len = 0; + } else { + /* Block 0 is passed as immediate data. */ + imm_len = b0_len; + + sglist_reset(sc->sg_ulptx); + if (crda->crd_len != 0) { + error = sglist_append_sglist(sc->sg_ulptx, sc->sg_crp, + crda->crd_skip, crda->crd_len); + if (error) + return (error); + } + error = sglist_append_sglist(sc->sg_ulptx, sc->sg_crp, + crde->crd_skip, crde->crd_len); + if (error) + return (error); + if (op_type == CHCR_DECRYPT_OP) { + error = sglist_append_sglist(sc->sg_ulptx, sc->sg_crp, + crda->crd_inject, hash_size_in_response); + if (error) + return (error); + } + sgl_nsegs = sc->sg_ulptx->sg_nseg; + sgl_len = ccr_ulptx_sgl_len(sgl_nsegs); + } + + aad_start = iv_len + 1; + aad_stop = aad_start + aad_len - 1; + cipher_start = aad_stop + 1; + if (op_type == CHCR_DECRYPT_OP) + cipher_stop = hash_size_in_response; + else + cipher_stop = 0; + if (op_type == CHCR_DECRYPT_OP) + auth_insert = hash_size_in_response; + else + auth_insert = 0; + + wr_len = roundup2(transhdr_len, 16) + iv_len + roundup2(imm_len, 16) + + sgl_len; + if (wr_len > SGE_MAX_WR_LEN) + return (EFBIG); + wr = alloc_wrqe(wr_len, sc->txq); + if (wr == NULL) { + sc->stats_wr_nomem++; + return (ENOMEM); + } + crwr = wrtod(wr); + memset(crwr, 0, wr_len); + + /* + * Read the nonce from the request or generate a random one if + * none is provided. Use the nonce to generate the full IV + * with the counter set to 0. + */ + memset(iv, 0, iv_len); + iv[0] = (15 - AES_CCM_IV_LEN) - 1; + if (op_type == CHCR_ENCRYPT_OP) { + if (crde->crd_flags & CRD_F_IV_EXPLICIT) + memcpy(iv + 1, crde->crd_iv, AES_CCM_IV_LEN); + else + arc4rand(iv + 1, AES_CCM_IV_LEN, 0); + if ((crde->crd_flags & CRD_F_IV_PRESENT) == 0) + crypto_copyback(crp->crp_flags, crp->crp_buf, + crde->crd_inject, AES_CCM_IV_LEN, iv + 1); + } else { + if (crde->crd_flags & CRD_F_IV_EXPLICIT) + memcpy(iv + 1, crde->crd_iv, AES_CCM_IV_LEN); + else + crypto_copydata(crp->crp_flags, crp->crp_buf, + crde->crd_inject, AES_CCM_IV_LEN, iv + 1); + } + + ccr_populate_wreq(sc, crwr, kctx_len, wr_len, imm_len, sgl_len, 0, + crp); + + /* XXX: Hardcodes SGE loopback channel of 0. */ + crwr->sec_cpl.op_ivinsrtofst = htobe32( + V_CPL_TX_SEC_PDU_OPCODE(CPL_TX_SEC_PDU) | + V_CPL_TX_SEC_PDU_RXCHID(sc->tx_channel_id) | + V_CPL_TX_SEC_PDU_ACKFOLLOWS(0) | V_CPL_TX_SEC_PDU_ULPTXLPBK(1) | + V_CPL_TX_SEC_PDU_CPLLEN(2) | V_CPL_TX_SEC_PDU_PLACEHOLDER(0) | + V_CPL_TX_SEC_PDU_IVINSRTOFST(1)); + + crwr->sec_cpl.pldlen = htobe32(iv_len + input_len); + + /* + * NB: cipherstop is explicitly set to 0. See comments above + * in ccr_gcm(). + */ + crwr->sec_cpl.aadstart_cipherstop_hi = htobe32( + V_CPL_TX_SEC_PDU_AADSTART(aad_start) | + V_CPL_TX_SEC_PDU_AADSTOP(aad_stop) | + V_CPL_TX_SEC_PDU_CIPHERSTART(cipher_start) | + V_CPL_TX_SEC_PDU_CIPHERSTOP_HI(0)); + crwr->sec_cpl.cipherstop_lo_authinsert = htobe32( + V_CPL_TX_SEC_PDU_CIPHERSTOP_LO(0) | + V_CPL_TX_SEC_PDU_AUTHSTART(cipher_start) | + V_CPL_TX_SEC_PDU_AUTHSTOP(cipher_stop) | + V_CPL_TX_SEC_PDU_AUTHINSERT(auth_insert)); + + /* These two flits are actually a CPL_TLS_TX_SCMD_FMT. */ + hmac_ctrl = ccr_hmac_ctrl(AES_CBC_MAC_HASH_LEN, hash_size_in_response); + crwr->sec_cpl.seqno_numivs = htobe32( + V_SCMD_SEQ_NO_CTRL(0) | + V_SCMD_PROTO_VERSION(SCMD_PROTO_VERSION_GENERIC) | + V_SCMD_ENC_DEC_CTRL(op_type) | + V_SCMD_CIPH_AUTH_SEQ_CTRL(op_type == CHCR_ENCRYPT_OP ? 0 : 1) | + V_SCMD_CIPH_MODE(SCMD_CIPH_MODE_AES_CCM) | + V_SCMD_AUTH_MODE(SCMD_AUTH_MODE_CBCMAC) | + V_SCMD_HMAC_CTRL(hmac_ctrl) | + V_SCMD_IV_SIZE(iv_len / 2) | + V_SCMD_NUM_IVS(0)); + crwr->sec_cpl.ivgen_hdrlen = htobe32( + V_SCMD_IV_GEN_CTRL(0) | + V_SCMD_MORE_FRAGS(0) | V_SCMD_LAST_FRAG(0) | V_SCMD_MAC_ONLY(0) | + V_SCMD_AADIVDROP(0) | V_SCMD_HDR_LEN(dsgl_len)); + + crwr->key_ctx.ctx_hdr = s->blkcipher.key_ctx_hdr; + memcpy(crwr->key_ctx.key, s->blkcipher.enckey, s->blkcipher.key_len); + memcpy(crwr->key_ctx.key + roundup(s->blkcipher.key_len, 16), + s->blkcipher.enckey, s->blkcipher.key_len); + + dst = (char *)(crwr + 1) + kctx_len; + ccr_write_phys_dsgl(sc, dst, dsgl_nsegs); + dst += sizeof(struct cpl_rx_phys_dsgl) + dsgl_len; + memcpy(dst, iv, iv_len); + dst += iv_len; + generate_ccm_b0(crda, crde, hash_size_in_response, iv, dst); + if (sgl_nsegs == 0) { + dst += b0_len; + if (crda->crd_len != 0) { + crypto_copydata(crp->crp_flags, crp->crp_buf, + crda->crd_skip, crda->crd_len, dst); + dst += crda->crd_len; + } + crypto_copydata(crp->crp_flags, crp->crp_buf, crde->crd_skip, + crde->crd_len, dst); + dst += crde->crd_len; + if (op_type == CHCR_DECRYPT_OP) + crypto_copydata(crp->crp_flags, crp->crp_buf, + crda->crd_inject, hash_size_in_response, dst); + } else { + dst += CCM_B0_SIZE; + if (b0_len > CCM_B0_SIZE) { + /* + * If there is AAD, insert padding including a + * ULP_TX_SC_NOOP so that the ULP_TX_SC_DSGL + * is 16-byte aligned. + */ + KASSERT(b0_len - CCM_B0_SIZE == CCM_AAD_FIELD_SIZE, + ("b0_len mismatch")); + memset(dst + CCM_AAD_FIELD_SIZE, 0, + 8 - CCM_AAD_FIELD_SIZE); + idata = (void *)(dst + 8); + idata->cmd_more = htobe32(V_ULPTX_CMD(ULP_TX_SC_NOOP)); + idata->len = htobe32(0); + dst = (void *)(idata + 1); + } + ccr_write_ulptx_sgl(sc, dst, sgl_nsegs); + } + + /* XXX: TODO backpressure */ + t4_wrq_tx(sc->adapter, wr); + + return (0); +} + +static int +ccr_ccm_done(struct ccr_softc *sc, struct ccr_session *s, + struct cryptop *crp, const struct cpl_fw6_pld *cpl, int error) +{ + + /* + * The updated IV to permit chained requests is at + * cpl->data[2], but OCF doesn't permit chained requests. + * + * Note that the hardware should always verify the CBC MAC + * hash. + */ + return (error); +} + +/* + * Handle a CCM request that is not supported by the crypto engine by + * performing the operation in software. Derived from swcr_authenc(). + */ +static void +ccr_ccm_soft(struct ccr_session *s, struct cryptop *crp, + struct cryptodesc *crda, struct cryptodesc *crde) +{ + struct auth_hash *axf; + struct enc_xform *exf; + union authctx *auth_ctx; + uint8_t *kschedule; + char block[CCM_CBC_BLOCK_LEN]; + char digest[AES_CBC_MAC_HASH_LEN]; + char iv[AES_CCM_IV_LEN]; + int error, i, len; + + auth_ctx = NULL; + kschedule = NULL; + + /* Initialize the MAC. */ + switch (s->blkcipher.key_len) { + case 16: + axf = &auth_hash_ccm_cbc_mac_128; + break; + case 24: + axf = &auth_hash_ccm_cbc_mac_192; + break; + case 32: + axf = &auth_hash_ccm_cbc_mac_256; + break; + default: + error = EINVAL; + goto out; + } + auth_ctx = malloc(axf->ctxsize, M_CCR, M_NOWAIT); + if (auth_ctx == NULL) { + error = ENOMEM; + goto out; + } + axf->Init(auth_ctx); + axf->Setkey(auth_ctx, s->blkcipher.enckey, s->blkcipher.key_len); + + /* Initialize the cipher. */ + exf = &enc_xform_ccm; + error = exf->setkey(&kschedule, s->blkcipher.enckey, + s->blkcipher.key_len); + if (error) + goto out; + + if (crde->crd_flags & CRD_F_ENCRYPT) { + if (crde->crd_flags & CRD_F_IV_EXPLICIT) + memcpy(iv, crde->crd_iv, AES_CCM_IV_LEN); + else + arc4rand(iv, AES_CCM_IV_LEN, 0); + if ((crde->crd_flags & CRD_F_IV_PRESENT) == 0) + crypto_copyback(crp->crp_flags, crp->crp_buf, + crde->crd_inject, AES_CCM_IV_LEN, iv); + } else { + if (crde->crd_flags & CRD_F_IV_EXPLICIT) + memcpy(iv, crde->crd_iv, AES_CCM_IV_LEN); + else + crypto_copydata(crp->crp_flags, crp->crp_buf, + crde->crd_inject, AES_CCM_IV_LEN, iv); + } + + auth_ctx->aes_cbc_mac_ctx.authDataLength = crda->crd_len; + auth_ctx->aes_cbc_mac_ctx.cryptDataLength = crde->crd_len; + axf->Reinit(auth_ctx, iv, sizeof(iv)); + + /* MAC the AAD. */ + for (i = 0; i < crda->crd_len; i += sizeof(block)) { + len = imin(crda->crd_len - i, sizeof(block)); + crypto_copydata(crp->crp_flags, crp->crp_buf, crda->crd_skip + + i, len, block); + bzero(block + len, sizeof(block) - len); + axf->Update(auth_ctx, block, sizeof(block)); + } + + exf->reinit(kschedule, iv); + + /* Do encryption/decryption with MAC */ + for (i = 0; i < crde->crd_len; i += sizeof(block)) { + len = imin(crde->crd_len - i, sizeof(block)); + crypto_copydata(crp->crp_flags, crp->crp_buf, crde->crd_skip + + i, len, block); + bzero(block + len, sizeof(block) - len); + if (crde->crd_flags & CRD_F_ENCRYPT) { + axf->Update(auth_ctx, block, len); + exf->encrypt(kschedule, block); + crypto_copyback(crp->crp_flags, crp->crp_buf, + crde->crd_skip + i, len, block); + } else { + exf->decrypt(kschedule, block); + axf->Update(auth_ctx, block, len); + } + } + + /* Finalize MAC. */ + axf->Final(digest, auth_ctx); + + /* Inject or validate tag. */ + if (crde->crd_flags & CRD_F_ENCRYPT) { + crypto_copyback(crp->crp_flags, crp->crp_buf, crda->crd_inject, + sizeof(digest), digest); + error = 0; + } else { + char digest2[GMAC_DIGEST_LEN]; + + crypto_copydata(crp->crp_flags, crp->crp_buf, crda->crd_inject, + sizeof(digest2), digest2); + if (timingsafe_bcmp(digest, digest2, sizeof(digest)) == 0) { + error = 0; + + /* Tag matches, decrypt data. */ + exf->reinit(kschedule, iv); + for (i = 0; i < crde->crd_len; i += sizeof(block)) { + len = imin(crde->crd_len - i, sizeof(block)); + crypto_copydata(crp->crp_flags, crp->crp_buf, + crde->crd_skip + i, len, block); + bzero(block + len, sizeof(block) - len); + exf->decrypt(kschedule, block); + crypto_copyback(crp->crp_flags, crp->crp_buf, + crde->crd_skip + i, len, block); + } + } else + error = EBADMSG; + } + + exf->zerokey(&kschedule); +out: + if (auth_ctx != NULL) { + memset(auth_ctx, 0, axf->ctxsize); + free(auth_ctx, M_CCR); + } + crp->crp_etype = error; + crypto_done(crp); +} + +static void ccr_identify(driver_t *driver, device_t parent) { struct adapter *sc; sc = device_get_softc(parent); if (sc->cryptocaps & FW_CAPS_CONFIG_CRYPTO_LOOKASIDE && device_find_child(parent, "ccr", -1) == NULL) device_add_child(parent, "ccr", -1); } static int ccr_probe(device_t dev) { device_set_desc(dev, "Chelsio Crypto Accelerator"); return (BUS_PROBE_DEFAULT); } static void ccr_sysctls(struct ccr_softc *sc) { struct sysctl_ctx_list *ctx; struct sysctl_oid *oid; struct sysctl_oid_list *children; ctx = device_get_sysctl_ctx(sc->dev); /* * dev.ccr.X. */ oid = device_get_sysctl_tree(sc->dev); children = SYSCTL_CHILDREN(oid); /* * dev.ccr.X.stats. */ oid = SYSCTL_ADD_NODE(ctx, children, OID_AUTO, "stats", CTLFLAG_RD, NULL, "statistics"); children = SYSCTL_CHILDREN(oid); SYSCTL_ADD_U64(ctx, children, OID_AUTO, "hash", CTLFLAG_RD, &sc->stats_hash, 0, "Hash requests submitted"); SYSCTL_ADD_U64(ctx, children, OID_AUTO, "hmac", CTLFLAG_RD, &sc->stats_hmac, 0, "HMAC requests submitted"); SYSCTL_ADD_U64(ctx, children, OID_AUTO, "cipher_encrypt", CTLFLAG_RD, &sc->stats_blkcipher_encrypt, 0, "Cipher encryption requests submitted"); SYSCTL_ADD_U64(ctx, children, OID_AUTO, "cipher_decrypt", CTLFLAG_RD, &sc->stats_blkcipher_decrypt, 0, "Cipher decryption requests submitted"); SYSCTL_ADD_U64(ctx, children, OID_AUTO, "authenc_encrypt", CTLFLAG_RD, &sc->stats_authenc_encrypt, 0, "Combined AES+HMAC encryption requests submitted"); SYSCTL_ADD_U64(ctx, children, OID_AUTO, "authenc_decrypt", CTLFLAG_RD, &sc->stats_authenc_decrypt, 0, "Combined AES+HMAC decryption requests submitted"); SYSCTL_ADD_U64(ctx, children, OID_AUTO, "gcm_encrypt", CTLFLAG_RD, &sc->stats_gcm_encrypt, 0, "AES-GCM encryption requests submitted"); SYSCTL_ADD_U64(ctx, children, OID_AUTO, "gcm_decrypt", CTLFLAG_RD, &sc->stats_gcm_decrypt, 0, "AES-GCM decryption requests submitted"); + SYSCTL_ADD_U64(ctx, children, OID_AUTO, "ccm_encrypt", CTLFLAG_RD, + &sc->stats_ccm_encrypt, 0, "AES-CCM encryption requests submitted"); + SYSCTL_ADD_U64(ctx, children, OID_AUTO, "ccm_decrypt", CTLFLAG_RD, + &sc->stats_ccm_decrypt, 0, "AES-CCM decryption requests submitted"); SYSCTL_ADD_U64(ctx, children, OID_AUTO, "wr_nomem", CTLFLAG_RD, &sc->stats_wr_nomem, 0, "Work request memory allocation failures"); SYSCTL_ADD_U64(ctx, children, OID_AUTO, "inflight", CTLFLAG_RD, &sc->stats_inflight, 0, "Requests currently pending"); SYSCTL_ADD_U64(ctx, children, OID_AUTO, "mac_error", CTLFLAG_RD, &sc->stats_mac_error, 0, "MAC errors"); SYSCTL_ADD_U64(ctx, children, OID_AUTO, "pad_error", CTLFLAG_RD, &sc->stats_pad_error, 0, "Padding errors"); SYSCTL_ADD_U64(ctx, children, OID_AUTO, "bad_session", CTLFLAG_RD, &sc->stats_bad_session, 0, "Requests with invalid session ID"); SYSCTL_ADD_U64(ctx, children, OID_AUTO, "sglist_error", CTLFLAG_RD, &sc->stats_sglist_error, 0, "Requests for which DMA mapping failed"); SYSCTL_ADD_U64(ctx, children, OID_AUTO, "process_error", CTLFLAG_RD, &sc->stats_process_error, 0, "Requests failed during queueing"); SYSCTL_ADD_U64(ctx, children, OID_AUTO, "sw_fallback", CTLFLAG_RD, &sc->stats_sw_fallback, 0, "Requests processed by falling back to software"); } static int ccr_attach(device_t dev) { struct ccr_softc *sc; int32_t cid; sc = device_get_softc(dev); sc->dev = dev; sc->adapter = device_get_softc(device_get_parent(dev)); sc->txq = &sc->adapter->sge.ctrlq[0]; sc->rxq = &sc->adapter->sge.rxq[0]; cid = crypto_get_driverid(dev, sizeof(struct ccr_session), CRYPTOCAP_F_HARDWARE); if (cid < 0) { device_printf(dev, "could not get crypto driver id\n"); return (ENXIO); } sc->cid = cid; sc->adapter->ccr_softc = sc; /* XXX: TODO? */ sc->tx_channel_id = 0; mtx_init(&sc->lock, "ccr", NULL, MTX_DEF); sc->sg_crp = sglist_alloc(TX_SGL_SEGS, M_WAITOK); sc->sg_ulptx = sglist_alloc(TX_SGL_SEGS, M_WAITOK); sc->sg_dsgl = sglist_alloc(MAX_RX_PHYS_DSGL_SGE, M_WAITOK); sc->iv_aad_buf = malloc(MAX_AAD_LEN, M_CCR, M_WAITOK); sc->sg_iv_aad = sglist_build(sc->iv_aad_buf, MAX_AAD_LEN, M_WAITOK); ccr_sysctls(sc); crypto_register(cid, CRYPTO_SHA1, 0, 0); crypto_register(cid, CRYPTO_SHA2_224, 0, 0); crypto_register(cid, CRYPTO_SHA2_256, 0, 0); crypto_register(cid, CRYPTO_SHA2_384, 0, 0); crypto_register(cid, CRYPTO_SHA2_512, 0, 0); crypto_register(cid, CRYPTO_SHA1_HMAC, 0, 0); crypto_register(cid, CRYPTO_SHA2_224_HMAC, 0, 0); crypto_register(cid, CRYPTO_SHA2_256_HMAC, 0, 0); crypto_register(cid, CRYPTO_SHA2_384_HMAC, 0, 0); crypto_register(cid, CRYPTO_SHA2_512_HMAC, 0, 0); crypto_register(cid, CRYPTO_AES_CBC, 0, 0); crypto_register(cid, CRYPTO_AES_ICM, 0, 0); crypto_register(cid, CRYPTO_AES_NIST_GCM_16, 0, 0); crypto_register(cid, CRYPTO_AES_128_NIST_GMAC, 0, 0); crypto_register(cid, CRYPTO_AES_192_NIST_GMAC, 0, 0); crypto_register(cid, CRYPTO_AES_256_NIST_GMAC, 0, 0); crypto_register(cid, CRYPTO_AES_XTS, 0, 0); + crypto_register(cid, CRYPTO_AES_CCM_16, 0, 0); + crypto_register(cid, CRYPTO_AES_CCM_CBC_MAC, 0, 0); return (0); } static int ccr_detach(device_t dev) { struct ccr_softc *sc; sc = device_get_softc(dev); mtx_lock(&sc->lock); sc->detaching = true; mtx_unlock(&sc->lock); crypto_unregister_all(sc->cid); mtx_destroy(&sc->lock); sglist_free(sc->sg_iv_aad); free(sc->iv_aad_buf, M_CCR); sglist_free(sc->sg_dsgl); sglist_free(sc->sg_ulptx); sglist_free(sc->sg_crp); sc->adapter->ccr_softc = NULL; return (0); } static void ccr_copy_partial_hash(void *dst, int cri_alg, union authctx *auth_ctx) { uint32_t *u32; uint64_t *u64; u_int i; u32 = (uint32_t *)dst; u64 = (uint64_t *)dst; switch (cri_alg) { case CRYPTO_SHA1: case CRYPTO_SHA1_HMAC: for (i = 0; i < SHA1_HASH_LEN / 4; i++) u32[i] = htobe32(auth_ctx->sha1ctx.h.b32[i]); break; case CRYPTO_SHA2_224: case CRYPTO_SHA2_224_HMAC: for (i = 0; i < SHA2_256_HASH_LEN / 4; i++) u32[i] = htobe32(auth_ctx->sha224ctx.state[i]); break; case CRYPTO_SHA2_256: case CRYPTO_SHA2_256_HMAC: for (i = 0; i < SHA2_256_HASH_LEN / 4; i++) u32[i] = htobe32(auth_ctx->sha256ctx.state[i]); break; case CRYPTO_SHA2_384: case CRYPTO_SHA2_384_HMAC: for (i = 0; i < SHA2_512_HASH_LEN / 8; i++) u64[i] = htobe64(auth_ctx->sha384ctx.state[i]); break; case CRYPTO_SHA2_512: case CRYPTO_SHA2_512_HMAC: for (i = 0; i < SHA2_512_HASH_LEN / 8; i++) u64[i] = htobe64(auth_ctx->sha512ctx.state[i]); break; } } static void ccr_init_hash_digest(struct ccr_session *s, int cri_alg) { union authctx auth_ctx; struct auth_hash *axf; axf = s->hmac.auth_hash; axf->Init(&auth_ctx); ccr_copy_partial_hash(s->hmac.ipad, cri_alg, &auth_ctx); } static void ccr_init_hmac_digest(struct ccr_session *s, int cri_alg, char *key, int klen) { union authctx auth_ctx; struct auth_hash *axf; u_int i; /* * If the key is larger than the block size, use the digest of * the key as the key instead. */ axf = s->hmac.auth_hash; klen /= 8; if (klen > axf->blocksize) { axf->Init(&auth_ctx); axf->Update(&auth_ctx, key, klen); axf->Final(s->hmac.ipad, &auth_ctx); klen = axf->hashsize; } else memcpy(s->hmac.ipad, key, klen); memset(s->hmac.ipad + klen, 0, axf->blocksize - klen); memcpy(s->hmac.opad, s->hmac.ipad, axf->blocksize); for (i = 0; i < axf->blocksize; i++) { s->hmac.ipad[i] ^= HMAC_IPAD_VAL; s->hmac.opad[i] ^= HMAC_OPAD_VAL; } /* * Hash the raw ipad and opad and store the partial result in * the same buffer. */ axf->Init(&auth_ctx); axf->Update(&auth_ctx, s->hmac.ipad, axf->blocksize); ccr_copy_partial_hash(s->hmac.ipad, cri_alg, &auth_ctx); axf->Init(&auth_ctx); axf->Update(&auth_ctx, s->hmac.opad, axf->blocksize); ccr_copy_partial_hash(s->hmac.opad, cri_alg, &auth_ctx); } /* * Borrowed from AES_GMAC_Setkey(). */ static void ccr_init_gmac_hash(struct ccr_session *s, char *key, int klen) { static char zeroes[GMAC_BLOCK_LEN]; uint32_t keysched[4 * (RIJNDAEL_MAXNR + 1)]; int rounds; rounds = rijndaelKeySetupEnc(keysched, key, klen); rijndaelEncrypt(keysched, rounds, zeroes, s->gmac.ghash_h); } static int ccr_aes_check_keylen(int alg, int klen) { switch (klen) { case 128: case 192: if (alg == CRYPTO_AES_XTS) return (EINVAL); break; case 256: break; case 512: if (alg != CRYPTO_AES_XTS) return (EINVAL); break; default: return (EINVAL); } return (0); } static void ccr_aes_setkey(struct ccr_session *s, int alg, const void *key, int klen) { unsigned int ck_size, iopad_size, kctx_flits, kctx_len, kbits, mk_size; unsigned int opad_present; if (alg == CRYPTO_AES_XTS) kbits = klen / 2; else kbits = klen; switch (kbits) { case 128: ck_size = CHCR_KEYCTX_CIPHER_KEY_SIZE_128; break; case 192: ck_size = CHCR_KEYCTX_CIPHER_KEY_SIZE_192; break; case 256: ck_size = CHCR_KEYCTX_CIPHER_KEY_SIZE_256; break; default: panic("should not get here"); } s->blkcipher.key_len = klen / 8; memcpy(s->blkcipher.enckey, key, s->blkcipher.key_len); switch (alg) { case CRYPTO_AES_CBC: case CRYPTO_AES_XTS: t4_aes_getdeckey(s->blkcipher.deckey, key, kbits); break; } kctx_len = roundup2(s->blkcipher.key_len, 16); switch (s->mode) { case AUTHENC: mk_size = s->hmac.mk_size; opad_present = 1; iopad_size = roundup2(s->hmac.partial_digest_len, 16); kctx_len += iopad_size * 2; break; case GCM: mk_size = CHCR_KEYCTX_MAC_KEY_SIZE_128; opad_present = 0; kctx_len += GMAC_BLOCK_LEN; break; + case CCM: + switch (kbits) { + case 128: + mk_size = CHCR_KEYCTX_MAC_KEY_SIZE_128; + break; + case 192: + mk_size = CHCR_KEYCTX_MAC_KEY_SIZE_192; + break; + case 256: + mk_size = CHCR_KEYCTX_MAC_KEY_SIZE_256; + break; + default: + panic("should not get here"); + } + opad_present = 0; + kctx_len *= 2; + break; default: mk_size = CHCR_KEYCTX_NO_KEY; opad_present = 0; break; } kctx_flits = (sizeof(struct _key_ctx) + kctx_len) / 16; s->blkcipher.key_ctx_hdr = htobe32(V_KEY_CONTEXT_CTX_LEN(kctx_flits) | V_KEY_CONTEXT_DUAL_CK(alg == CRYPTO_AES_XTS) | V_KEY_CONTEXT_OPAD_PRESENT(opad_present) | V_KEY_CONTEXT_SALT_PRESENT(1) | V_KEY_CONTEXT_CK_SIZE(ck_size) | V_KEY_CONTEXT_MK_SIZE(mk_size) | V_KEY_CONTEXT_VALID(1)); } static int ccr_newsession(device_t dev, crypto_session_t cses, struct cryptoini *cri) { struct ccr_softc *sc; struct ccr_session *s; struct auth_hash *auth_hash; struct cryptoini *c, *hash, *cipher; unsigned int auth_mode, cipher_mode, iv_len, mk_size; unsigned int partial_digest_len; int error; bool gcm_hash, hmac; if (cri == NULL) return (EINVAL); gcm_hash = false; hmac = false; cipher = NULL; hash = NULL; auth_hash = NULL; auth_mode = SCMD_AUTH_MODE_NOP; cipher_mode = SCMD_CIPH_MODE_NOP; iv_len = 0; mk_size = 0; partial_digest_len = 0; for (c = cri; c != NULL; c = c->cri_next) { switch (c->cri_alg) { case CRYPTO_SHA1: case CRYPTO_SHA2_224: case CRYPTO_SHA2_256: case CRYPTO_SHA2_384: case CRYPTO_SHA2_512: case CRYPTO_SHA1_HMAC: case CRYPTO_SHA2_224_HMAC: case CRYPTO_SHA2_256_HMAC: case CRYPTO_SHA2_384_HMAC: case CRYPTO_SHA2_512_HMAC: case CRYPTO_AES_128_NIST_GMAC: case CRYPTO_AES_192_NIST_GMAC: case CRYPTO_AES_256_NIST_GMAC: + case CRYPTO_AES_CCM_CBC_MAC: if (hash) return (EINVAL); hash = c; switch (c->cri_alg) { case CRYPTO_SHA1: case CRYPTO_SHA1_HMAC: auth_hash = &auth_hash_hmac_sha1; auth_mode = SCMD_AUTH_MODE_SHA1; mk_size = CHCR_KEYCTX_MAC_KEY_SIZE_160; partial_digest_len = SHA1_HASH_LEN; break; case CRYPTO_SHA2_224: case CRYPTO_SHA2_224_HMAC: auth_hash = &auth_hash_hmac_sha2_224; auth_mode = SCMD_AUTH_MODE_SHA224; mk_size = CHCR_KEYCTX_MAC_KEY_SIZE_256; partial_digest_len = SHA2_256_HASH_LEN; break; case CRYPTO_SHA2_256: case CRYPTO_SHA2_256_HMAC: auth_hash = &auth_hash_hmac_sha2_256; auth_mode = SCMD_AUTH_MODE_SHA256; mk_size = CHCR_KEYCTX_MAC_KEY_SIZE_256; partial_digest_len = SHA2_256_HASH_LEN; break; case CRYPTO_SHA2_384: case CRYPTO_SHA2_384_HMAC: auth_hash = &auth_hash_hmac_sha2_384; auth_mode = SCMD_AUTH_MODE_SHA512_384; mk_size = CHCR_KEYCTX_MAC_KEY_SIZE_512; partial_digest_len = SHA2_512_HASH_LEN; break; case CRYPTO_SHA2_512: case CRYPTO_SHA2_512_HMAC: auth_hash = &auth_hash_hmac_sha2_512; auth_mode = SCMD_AUTH_MODE_SHA512_512; mk_size = CHCR_KEYCTX_MAC_KEY_SIZE_512; partial_digest_len = SHA2_512_HASH_LEN; break; case CRYPTO_AES_128_NIST_GMAC: case CRYPTO_AES_192_NIST_GMAC: case CRYPTO_AES_256_NIST_GMAC: gcm_hash = true; auth_mode = SCMD_AUTH_MODE_GHASH; mk_size = CHCR_KEYCTX_MAC_KEY_SIZE_128; break; + case CRYPTO_AES_CCM_CBC_MAC: + auth_mode = SCMD_AUTH_MODE_CBCMAC; + break; } switch (c->cri_alg) { case CRYPTO_SHA1_HMAC: case CRYPTO_SHA2_224_HMAC: case CRYPTO_SHA2_256_HMAC: case CRYPTO_SHA2_384_HMAC: case CRYPTO_SHA2_512_HMAC: hmac = true; break; } break; case CRYPTO_AES_CBC: case CRYPTO_AES_ICM: case CRYPTO_AES_NIST_GCM_16: case CRYPTO_AES_XTS: + case CRYPTO_AES_CCM_16: if (cipher) return (EINVAL); cipher = c; switch (c->cri_alg) { case CRYPTO_AES_CBC: cipher_mode = SCMD_CIPH_MODE_AES_CBC; iv_len = AES_BLOCK_LEN; break; case CRYPTO_AES_ICM: cipher_mode = SCMD_CIPH_MODE_AES_CTR; iv_len = AES_BLOCK_LEN; break; case CRYPTO_AES_NIST_GCM_16: cipher_mode = SCMD_CIPH_MODE_AES_GCM; iv_len = AES_GCM_IV_LEN; break; case CRYPTO_AES_XTS: cipher_mode = SCMD_CIPH_MODE_AES_XTS; iv_len = AES_BLOCK_LEN; break; + case CRYPTO_AES_CCM_16: + cipher_mode = SCMD_CIPH_MODE_AES_CCM; + iv_len = AES_CCM_IV_LEN; + break; } if (c->cri_key != NULL) { error = ccr_aes_check_keylen(c->cri_alg, c->cri_klen); if (error) return (error); } break; default: return (EINVAL); } } if (gcm_hash != (cipher_mode == SCMD_CIPH_MODE_AES_GCM)) return (EINVAL); + if ((auth_mode == SCMD_AUTH_MODE_CBCMAC) != + (cipher_mode == SCMD_CIPH_MODE_AES_CCM)) + return (EINVAL); if (hash == NULL && cipher == NULL) return (EINVAL); if (hash != NULL) { - if ((hmac || gcm_hash) && hash->cri_key == NULL) - return (EINVAL); - if (!(hmac || gcm_hash) && hash->cri_key != NULL) - return (EINVAL); + if (hmac || gcm_hash || auth_mode == SCMD_AUTH_MODE_CBCMAC) { + if (hash->cri_key == NULL) + return (EINVAL); + } else { + if (hash->cri_key != NULL) + return (EINVAL); + } } sc = device_get_softc(dev); /* * XXX: Don't create a session if the queues aren't * initialized. This is racy as the rxq can be destroyed by * the associated VI detaching. Eventually ccr should use * dedicated queues. */ if (sc->rxq->iq.adapter == NULL || sc->txq->adapter == NULL) return (ENXIO); mtx_lock(&sc->lock); if (sc->detaching) { mtx_unlock(&sc->lock); return (ENXIO); } s = crypto_get_driver_session(cses); if (gcm_hash) s->mode = GCM; + else if (cipher_mode == SCMD_CIPH_MODE_AES_CCM) + s->mode = CCM; else if (hash != NULL && cipher != NULL) s->mode = AUTHENC; else if (hash != NULL) { if (hmac) s->mode = HMAC; else s->mode = HASH; } else { MPASS(cipher != NULL); s->mode = BLKCIPHER; } if (gcm_hash) { if (hash->cri_mlen == 0) s->gmac.hash_len = AES_GMAC_HASH_LEN; else s->gmac.hash_len = hash->cri_mlen; ccr_init_gmac_hash(s, hash->cri_key, hash->cri_klen); + } else if (auth_mode == SCMD_AUTH_MODE_CBCMAC) { + if (hash->cri_mlen == 0) + s->ccm_mac.hash_len = AES_CBC_MAC_HASH_LEN; + else + s->ccm_mac.hash_len = hash->cri_mlen; } else if (hash != NULL) { s->hmac.auth_hash = auth_hash; s->hmac.auth_mode = auth_mode; s->hmac.mk_size = mk_size; s->hmac.partial_digest_len = partial_digest_len; if (hash->cri_mlen == 0) s->hmac.hash_len = auth_hash->hashsize; else s->hmac.hash_len = hash->cri_mlen; if (hmac) ccr_init_hmac_digest(s, hash->cri_alg, hash->cri_key, hash->cri_klen); else ccr_init_hash_digest(s, hash->cri_alg); } if (cipher != NULL) { s->blkcipher.cipher_mode = cipher_mode; s->blkcipher.iv_len = iv_len; if (cipher->cri_key != NULL) ccr_aes_setkey(s, cipher->cri_alg, cipher->cri_key, cipher->cri_klen); } s->active = true; mtx_unlock(&sc->lock); return (0); } static void ccr_freesession(device_t dev, crypto_session_t cses) { struct ccr_softc *sc; struct ccr_session *s; sc = device_get_softc(dev); s = crypto_get_driver_session(cses); mtx_lock(&sc->lock); if (s->pending != 0) device_printf(dev, "session %p freed with %d pending requests\n", s, s->pending); s->active = false; mtx_unlock(&sc->lock); } static int ccr_process(device_t dev, struct cryptop *crp, int hint) { struct ccr_softc *sc; struct ccr_session *s; struct cryptodesc *crd, *crda, *crde; int error; if (crp == NULL) return (EINVAL); crd = crp->crp_desc; s = crypto_get_driver_session(crp->crp_session); sc = device_get_softc(dev); mtx_lock(&sc->lock); error = ccr_populate_sglist(sc->sg_crp, crp); if (error) { sc->stats_sglist_error++; goto out; } switch (s->mode) { case HASH: error = ccr_hash(sc, s, crp); if (error == 0) sc->stats_hash++; break; case HMAC: if (crd->crd_flags & CRD_F_KEY_EXPLICIT) ccr_init_hmac_digest(s, crd->crd_alg, crd->crd_key, crd->crd_klen); error = ccr_hash(sc, s, crp); if (error == 0) sc->stats_hmac++; break; case BLKCIPHER: if (crd->crd_flags & CRD_F_KEY_EXPLICIT) { error = ccr_aes_check_keylen(crd->crd_alg, crd->crd_klen); if (error) break; ccr_aes_setkey(s, crd->crd_alg, crd->crd_key, crd->crd_klen); } error = ccr_blkcipher(sc, s, crp); if (error == 0) { if (crd->crd_flags & CRD_F_ENCRYPT) sc->stats_blkcipher_encrypt++; else sc->stats_blkcipher_decrypt++; } break; case AUTHENC: error = 0; switch (crd->crd_alg) { case CRYPTO_AES_CBC: case CRYPTO_AES_ICM: case CRYPTO_AES_XTS: /* Only encrypt-then-authenticate supported. */ crde = crd; crda = crd->crd_next; if (!(crde->crd_flags & CRD_F_ENCRYPT)) { error = EINVAL; break; } break; default: crda = crd; crde = crd->crd_next; if (crde->crd_flags & CRD_F_ENCRYPT) { error = EINVAL; break; } break; } if (error) break; if (crda->crd_flags & CRD_F_KEY_EXPLICIT) ccr_init_hmac_digest(s, crda->crd_alg, crda->crd_key, crda->crd_klen); if (crde->crd_flags & CRD_F_KEY_EXPLICIT) { error = ccr_aes_check_keylen(crde->crd_alg, crde->crd_klen); if (error) break; ccr_aes_setkey(s, crde->crd_alg, crde->crd_key, crde->crd_klen); } error = ccr_authenc(sc, s, crp, crda, crde); if (error == 0) { if (crde->crd_flags & CRD_F_ENCRYPT) sc->stats_authenc_encrypt++; else sc->stats_authenc_decrypt++; } break; case GCM: error = 0; if (crd->crd_alg == CRYPTO_AES_NIST_GCM_16) { crde = crd; crda = crd->crd_next; } else { crda = crd; crde = crd->crd_next; } if (crda->crd_flags & CRD_F_KEY_EXPLICIT) ccr_init_gmac_hash(s, crda->crd_key, crda->crd_klen); if (crde->crd_flags & CRD_F_KEY_EXPLICIT) { error = ccr_aes_check_keylen(crde->crd_alg, crde->crd_klen); if (error) break; ccr_aes_setkey(s, crde->crd_alg, crde->crd_key, crde->crd_klen); } if (crde->crd_len == 0) { mtx_unlock(&sc->lock); ccr_gcm_soft(s, crp, crda, crde); return (0); } error = ccr_gcm(sc, s, crp, crda, crde); if (error == EMSGSIZE) { sc->stats_sw_fallback++; mtx_unlock(&sc->lock); ccr_gcm_soft(s, crp, crda, crde); return (0); } if (error == 0) { if (crde->crd_flags & CRD_F_ENCRYPT) sc->stats_gcm_encrypt++; else sc->stats_gcm_decrypt++; } break; + case CCM: + error = 0; + if (crd->crd_alg == CRYPTO_AES_CCM_16) { + crde = crd; + crda = crd->crd_next; + } else { + crda = crd; + crde = crd->crd_next; + } + if (crde->crd_flags & CRD_F_KEY_EXPLICIT) { + error = ccr_aes_check_keylen(crde->crd_alg, + crde->crd_klen); + if (error) + break; + ccr_aes_setkey(s, crde->crd_alg, crde->crd_key, + crde->crd_klen); + } + error = ccr_ccm(sc, s, crp, crda, crde); + if (error == EMSGSIZE) { + sc->stats_sw_fallback++; + mtx_unlock(&sc->lock); + ccr_ccm_soft(s, crp, crda, crde); + return (0); + } + if (error == 0) { + if (crde->crd_flags & CRD_F_ENCRYPT) + sc->stats_ccm_encrypt++; + else + sc->stats_ccm_decrypt++; + } + break; } if (error == 0) { s->pending++; sc->stats_inflight++; } else sc->stats_process_error++; out: mtx_unlock(&sc->lock); if (error) { crp->crp_etype = error; crypto_done(crp); } return (0); } static int do_cpl6_fw_pld(struct sge_iq *iq, const struct rss_header *rss, struct mbuf *m) { struct ccr_softc *sc = iq->adapter->ccr_softc; struct ccr_session *s; const struct cpl_fw6_pld *cpl; struct cryptop *crp; uint32_t status; int error; if (m != NULL) cpl = mtod(m, const void *); else cpl = (const void *)(rss + 1); crp = (struct cryptop *)(uintptr_t)be64toh(cpl->data[1]); s = crypto_get_driver_session(crp->crp_session); status = be64toh(cpl->data[0]); if (CHK_MAC_ERR_BIT(status) || CHK_PAD_ERR_BIT(status)) error = EBADMSG; else error = 0; mtx_lock(&sc->lock); s->pending--; sc->stats_inflight--; switch (s->mode) { case HASH: case HMAC: error = ccr_hash_done(sc, s, crp, cpl, error); break; case BLKCIPHER: error = ccr_blkcipher_done(sc, s, crp, cpl, error); break; case AUTHENC: error = ccr_authenc_done(sc, s, crp, cpl, error); break; case GCM: error = ccr_gcm_done(sc, s, crp, cpl, error); + break; + case CCM: + error = ccr_ccm_done(sc, s, crp, cpl, error); break; } if (error == EBADMSG) { if (CHK_MAC_ERR_BIT(status)) sc->stats_mac_error++; if (CHK_PAD_ERR_BIT(status)) sc->stats_pad_error++; } mtx_unlock(&sc->lock); crp->crp_etype = error; crypto_done(crp); m_freem(m); return (0); } static int ccr_modevent(module_t mod, int cmd, void *arg) { switch (cmd) { case MOD_LOAD: t4_register_cpl_handler(CPL_FW6_PLD, do_cpl6_fw_pld); return (0); case MOD_UNLOAD: t4_register_cpl_handler(CPL_FW6_PLD, NULL); return (0); default: return (EOPNOTSUPP); } } static device_method_t ccr_methods[] = { DEVMETHOD(device_identify, ccr_identify), DEVMETHOD(device_probe, ccr_probe), DEVMETHOD(device_attach, ccr_attach), DEVMETHOD(device_detach, ccr_detach), DEVMETHOD(cryptodev_newsession, ccr_newsession), DEVMETHOD(cryptodev_freesession, ccr_freesession), DEVMETHOD(cryptodev_process, ccr_process), DEVMETHOD_END }; static driver_t ccr_driver = { "ccr", ccr_methods, sizeof(struct ccr_softc) }; static devclass_t ccr_devclass; DRIVER_MODULE(ccr, t6nex, ccr_driver, ccr_devclass, ccr_modevent, NULL); MODULE_VERSION(ccr, 1); MODULE_DEPEND(ccr, crypto, 1, 1, 1); MODULE_DEPEND(ccr, t6nex, 1, 1, 1); Index: user/ngie/bug-237403/sys/dev/cxgbe/crypto/t4_crypto.h =================================================================== --- user/ngie/bug-237403/sys/dev/cxgbe/crypto/t4_crypto.h (revision 346925) +++ user/ngie/bug-237403/sys/dev/cxgbe/crypto/t4_crypto.h (revision 346926) @@ -1,191 +1,194 @@ /*- * Copyright (c) 2017 Chelsio Communications, Inc. * All rights reserved. * Written by: John Baldwin * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * $FreeBSD$ */ #ifndef __T4_CRYPTO_H__ #define __T4_CRYPTO_H__ /* From chr_core.h */ #define PAD_ERROR_BIT 1 #define CHK_PAD_ERR_BIT(x) (((x) >> PAD_ERROR_BIT) & 1) #define MAC_ERROR_BIT 0 #define CHK_MAC_ERR_BIT(x) (((x) >> MAC_ERROR_BIT) & 1) #define MAX_SALT 4 struct _key_ctx { __be32 ctx_hdr; u8 salt[MAX_SALT]; __be64 reserverd; unsigned char key[0]; }; struct chcr_wr { struct fw_crypto_lookaside_wr wreq; struct ulp_txpkt ulptx; struct ulptx_idata sc_imm; struct cpl_tx_sec_pdu sec_cpl; struct _key_ctx key_ctx; }; /* From chr_algo.h */ /* Crypto key context */ #define S_KEY_CONTEXT_CTX_LEN 24 #define M_KEY_CONTEXT_CTX_LEN 0xff #define V_KEY_CONTEXT_CTX_LEN(x) ((x) << S_KEY_CONTEXT_CTX_LEN) #define G_KEY_CONTEXT_CTX_LEN(x) \ (((x) >> S_KEY_CONTEXT_CTX_LEN) & M_KEY_CONTEXT_CTX_LEN) #define S_KEY_CONTEXT_DUAL_CK 12 #define M_KEY_CONTEXT_DUAL_CK 0x1 #define V_KEY_CONTEXT_DUAL_CK(x) ((x) << S_KEY_CONTEXT_DUAL_CK) #define G_KEY_CONTEXT_DUAL_CK(x) \ (((x) >> S_KEY_CONTEXT_DUAL_CK) & M_KEY_CONTEXT_DUAL_CK) #define F_KEY_CONTEXT_DUAL_CK V_KEY_CONTEXT_DUAL_CK(1U) #define S_KEY_CONTEXT_OPAD_PRESENT 11 #define M_KEY_CONTEXT_OPAD_PRESENT 0x1 #define V_KEY_CONTEXT_OPAD_PRESENT(x) ((x) << S_KEY_CONTEXT_OPAD_PRESENT) #define G_KEY_CONTEXT_OPAD_PRESENT(x) \ (((x) >> S_KEY_CONTEXT_OPAD_PRESENT) & \ M_KEY_CONTEXT_OPAD_PRESENT) #define F_KEY_CONTEXT_OPAD_PRESENT V_KEY_CONTEXT_OPAD_PRESENT(1U) #define S_KEY_CONTEXT_SALT_PRESENT 10 #define M_KEY_CONTEXT_SALT_PRESENT 0x1 #define V_KEY_CONTEXT_SALT_PRESENT(x) ((x) << S_KEY_CONTEXT_SALT_PRESENT) #define G_KEY_CONTEXT_SALT_PRESENT(x) \ (((x) >> S_KEY_CONTEXT_SALT_PRESENT) & \ M_KEY_CONTEXT_SALT_PRESENT) #define F_KEY_CONTEXT_SALT_PRESENT V_KEY_CONTEXT_SALT_PRESENT(1U) #define S_KEY_CONTEXT_CK_SIZE 6 #define M_KEY_CONTEXT_CK_SIZE 0xf #define V_KEY_CONTEXT_CK_SIZE(x) ((x) << S_KEY_CONTEXT_CK_SIZE) #define G_KEY_CONTEXT_CK_SIZE(x) \ (((x) >> S_KEY_CONTEXT_CK_SIZE) & M_KEY_CONTEXT_CK_SIZE) #define S_KEY_CONTEXT_MK_SIZE 2 #define M_KEY_CONTEXT_MK_SIZE 0xf #define V_KEY_CONTEXT_MK_SIZE(x) ((x) << S_KEY_CONTEXT_MK_SIZE) #define G_KEY_CONTEXT_MK_SIZE(x) \ (((x) >> S_KEY_CONTEXT_MK_SIZE) & M_KEY_CONTEXT_MK_SIZE) #define S_KEY_CONTEXT_VALID 0 #define M_KEY_CONTEXT_VALID 0x1 #define V_KEY_CONTEXT_VALID(x) ((x) << S_KEY_CONTEXT_VALID) #define G_KEY_CONTEXT_VALID(x) \ (((x) >> S_KEY_CONTEXT_VALID) & \ M_KEY_CONTEXT_VALID) #define F_KEY_CONTEXT_VALID V_KEY_CONTEXT_VALID(1U) #define CHCR_HASH_MAX_DIGEST_SIZE 64 #define DUMMY_BYTES 16 #define TRANSHDR_SIZE(kctx_len)\ (sizeof(struct chcr_wr) +\ kctx_len) #define CIPHER_TRANSHDR_SIZE(kctx_len, sge_pairs) \ (TRANSHDR_SIZE((kctx_len)) + (sge_pairs) +\ sizeof(struct cpl_rx_phys_dsgl)) #define HASH_TRANSHDR_SIZE(kctx_len)\ (TRANSHDR_SIZE(kctx_len) + DUMMY_BYTES) #define CRYPTO_MAX_IMM_TX_PKT_LEN 256 struct phys_sge_pairs { __be16 len[8]; __be64 addr[8]; }; /* From chr_crypto.h */ +#define CCM_B0_SIZE 16 +#define CCM_AAD_FIELD_SIZE 2 + #define CHCR_AES_MAX_KEY_LEN (AES_XTS_MAX_KEY) #define CHCR_MAX_CRYPTO_IV_LEN 16 /* AES IV len */ #define CHCR_ENCRYPT_OP 0 #define CHCR_DECRYPT_OP 1 #define SCMD_ENCDECCTRL_ENCRYPT 0 #define SCMD_ENCDECCTRL_DECRYPT 1 #define SCMD_PROTO_VERSION_TLS_1_2 0 #define SCMD_PROTO_VERSION_TLS_1_1 1 #define SCMD_PROTO_VERSION_GENERIC 4 #define SCMD_CIPH_MODE_NOP 0 #define SCMD_CIPH_MODE_AES_CBC 1 #define SCMD_CIPH_MODE_AES_GCM 2 #define SCMD_CIPH_MODE_AES_CTR 3 #define SCMD_CIPH_MODE_GENERIC_AES 4 #define SCMD_CIPH_MODE_AES_XTS 6 #define SCMD_CIPH_MODE_AES_CCM 7 #define SCMD_AUTH_MODE_NOP 0 #define SCMD_AUTH_MODE_SHA1 1 #define SCMD_AUTH_MODE_SHA224 2 #define SCMD_AUTH_MODE_SHA256 3 #define SCMD_AUTH_MODE_GHASH 4 #define SCMD_AUTH_MODE_SHA512_224 5 #define SCMD_AUTH_MODE_SHA512_256 6 #define SCMD_AUTH_MODE_SHA512_384 7 #define SCMD_AUTH_MODE_SHA512_512 8 #define SCMD_AUTH_MODE_CBCMAC 9 #define SCMD_AUTH_MODE_CMAC 10 #define SCMD_HMAC_CTRL_NOP 0 #define SCMD_HMAC_CTRL_NO_TRUNC 1 #define SCMD_HMAC_CTRL_TRUNC_RFC4366 2 #define SCMD_HMAC_CTRL_IPSEC_96BIT 3 #define SCMD_HMAC_CTRL_PL1 4 #define SCMD_HMAC_CTRL_PL2 5 #define SCMD_HMAC_CTRL_PL3 6 #define SCMD_HMAC_CTRL_DIV2 7 /* This are not really mac key size. They are intermediate values * of sha engine and its size */ #define CHCR_KEYCTX_MAC_KEY_SIZE_128 0 #define CHCR_KEYCTX_MAC_KEY_SIZE_160 1 #define CHCR_KEYCTX_MAC_KEY_SIZE_192 2 #define CHCR_KEYCTX_MAC_KEY_SIZE_256 3 #define CHCR_KEYCTX_MAC_KEY_SIZE_512 4 #define CHCR_KEYCTX_CIPHER_KEY_SIZE_128 0 #define CHCR_KEYCTX_CIPHER_KEY_SIZE_192 1 #define CHCR_KEYCTX_CIPHER_KEY_SIZE_256 2 #define CHCR_KEYCTX_NO_KEY 15 #define IV_NOP 0 #define IV_IMMEDIATE 1 #define IV_DSGL 2 #define CHCR_HASH_MAX_BLOCK_SIZE_64 64 #define CHCR_HASH_MAX_BLOCK_SIZE_128 128 #endif /* !__T4_CRYPTO_H__ */ Index: user/ngie/bug-237403/sys/dev/cxgbe/t4_sge.c =================================================================== --- user/ngie/bug-237403/sys/dev/cxgbe/t4_sge.c (revision 346925) +++ user/ngie/bug-237403/sys/dev/cxgbe/t4_sge.c (revision 346926) @@ -1,6051 +1,6054 @@ /*- * SPDX-License-Identifier: BSD-2-Clause-FreeBSD * * Copyright (c) 2011 Chelsio Communications, Inc. * All rights reserved. * Written by: Navdeep Parhar * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. */ #include __FBSDID("$FreeBSD$"); #include "opt_inet.h" #include "opt_inet6.h" #include "opt_ratelimit.h" #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #ifdef DEV_NETMAP #include #include #include #include #include #endif #include "common/common.h" #include "common/t4_regs.h" #include "common/t4_regs_values.h" #include "common/t4_msg.h" #include "t4_l2t.h" #include "t4_mp_ring.h" #ifdef T4_PKT_TIMESTAMP #define RX_COPY_THRESHOLD (MINCLSIZE - 8) #else #define RX_COPY_THRESHOLD MINCLSIZE #endif /* Internal mbuf flags stored in PH_loc.eight[1]. */ #define MC_RAW_WR 0x02 /* * Ethernet frames are DMA'd at this byte offset into the freelist buffer. * 0-7 are valid values. */ static int fl_pktshift = 0; SYSCTL_INT(_hw_cxgbe, OID_AUTO, fl_pktshift, CTLFLAG_RDTUN, &fl_pktshift, 0, "payload DMA offset in rx buffer (bytes)"); /* * Pad ethernet payload up to this boundary. * -1: driver should figure out a good value. * 0: disable padding. * Any power of 2 from 32 to 4096 (both inclusive) is also a valid value. */ int fl_pad = -1; SYSCTL_INT(_hw_cxgbe, OID_AUTO, fl_pad, CTLFLAG_RDTUN, &fl_pad, 0, "payload pad boundary (bytes)"); /* * Status page length. * -1: driver should figure out a good value. * 64 or 128 are the only other valid values. */ static int spg_len = -1; SYSCTL_INT(_hw_cxgbe, OID_AUTO, spg_len, CTLFLAG_RDTUN, &spg_len, 0, "status page size (bytes)"); /* * Congestion drops. * -1: no congestion feedback (not recommended). * 0: backpressure the channel instead of dropping packets right away. * 1: no backpressure, drop packets for the congested queue immediately. */ static int cong_drop = 0; SYSCTL_INT(_hw_cxgbe, OID_AUTO, cong_drop, CTLFLAG_RDTUN, &cong_drop, 0, "Congestion control for RX queues (0 = backpressure, 1 = drop"); /* * Deliver multiple frames in the same free list buffer if they fit. * -1: let the driver decide whether to enable buffer packing or not. * 0: disable buffer packing. * 1: enable buffer packing. */ static int buffer_packing = -1; SYSCTL_INT(_hw_cxgbe, OID_AUTO, buffer_packing, CTLFLAG_RDTUN, &buffer_packing, 0, "Enable buffer packing"); /* * Start next frame in a packed buffer at this boundary. * -1: driver should figure out a good value. * T4: driver will ignore this and use the same value as fl_pad above. * T5: 16, or a power of 2 from 64 to 4096 (both inclusive) is a valid value. */ static int fl_pack = -1; SYSCTL_INT(_hw_cxgbe, OID_AUTO, fl_pack, CTLFLAG_RDTUN, &fl_pack, 0, "payload pack boundary (bytes)"); /* * Allow the driver to create mbuf(s) in a cluster allocated for rx. * 0: never; always allocate mbufs from the zone_mbuf UMA zone. * 1: ok to create mbuf(s) within a cluster if there is room. */ static int allow_mbufs_in_cluster = 1; SYSCTL_INT(_hw_cxgbe, OID_AUTO, allow_mbufs_in_cluster, CTLFLAG_RDTUN, &allow_mbufs_in_cluster, 0, "Allow driver to create mbufs within a rx cluster"); /* * Largest rx cluster size that the driver is allowed to allocate. */ static int largest_rx_cluster = MJUM16BYTES; SYSCTL_INT(_hw_cxgbe, OID_AUTO, largest_rx_cluster, CTLFLAG_RDTUN, &largest_rx_cluster, 0, "Largest rx cluster (bytes)"); /* * Size of cluster allocation that's most likely to succeed. The driver will * fall back to this size if it fails to allocate clusters larger than this. */ static int safest_rx_cluster = PAGE_SIZE; SYSCTL_INT(_hw_cxgbe, OID_AUTO, safest_rx_cluster, CTLFLAG_RDTUN, &safest_rx_cluster, 0, "Safe rx cluster (bytes)"); #ifdef RATELIMIT /* * Knob to control TCP timestamp rewriting, and the granularity of the tick used * for rewriting. -1 and 0-3 are all valid values. * -1: hardware should leave the TCP timestamps alone. * 0: 1ms * 1: 100us * 2: 10us * 3: 1us */ static int tsclk = -1; SYSCTL_INT(_hw_cxgbe, OID_AUTO, tsclk, CTLFLAG_RDTUN, &tsclk, 0, "Control TCP timestamp rewriting when using pacing"); static int eo_max_backlog = 1024 * 1024; SYSCTL_INT(_hw_cxgbe, OID_AUTO, eo_max_backlog, CTLFLAG_RDTUN, &eo_max_backlog, 0, "Maximum backlog of ratelimited data per flow"); #endif /* * The interrupt holdoff timers are multiplied by this value on T6+. * 1 and 3-17 (both inclusive) are legal values. */ static int tscale = 1; SYSCTL_INT(_hw_cxgbe, OID_AUTO, tscale, CTLFLAG_RDTUN, &tscale, 0, "Interrupt holdoff timer scale on T6+"); /* * Number of LRO entries in the lro_ctrl structure per rx queue. */ static int lro_entries = TCP_LRO_ENTRIES; SYSCTL_INT(_hw_cxgbe, OID_AUTO, lro_entries, CTLFLAG_RDTUN, &lro_entries, 0, "Number of LRO entries per RX queue"); /* * This enables presorting of frames before they're fed into tcp_lro_rx. */ static int lro_mbufs = 0; SYSCTL_INT(_hw_cxgbe, OID_AUTO, lro_mbufs, CTLFLAG_RDTUN, &lro_mbufs, 0, "Enable presorting of LRO frames"); struct txpkts { u_int wr_type; /* type 0 or type 1 */ u_int npkt; /* # of packets in this work request */ u_int plen; /* total payload (sum of all packets) */ u_int len16; /* # of 16B pieces used by this work request */ }; /* A packet's SGL. This + m_pkthdr has all info needed for tx */ struct sgl { struct sglist sg; struct sglist_seg seg[TX_SGL_SEGS]; }; static int service_iq(struct sge_iq *, int); static int service_iq_fl(struct sge_iq *, int); static struct mbuf *get_fl_payload(struct adapter *, struct sge_fl *, uint32_t); static int t4_eth_rx(struct sge_iq *, const struct rss_header *, struct mbuf *); static inline void init_iq(struct sge_iq *, struct adapter *, int, int, int); static inline void init_fl(struct adapter *, struct sge_fl *, int, int, char *); static inline void init_eq(struct adapter *, struct sge_eq *, int, int, uint8_t, uint16_t, char *); static int alloc_ring(struct adapter *, size_t, bus_dma_tag_t *, bus_dmamap_t *, bus_addr_t *, void **); static int free_ring(struct adapter *, bus_dma_tag_t, bus_dmamap_t, bus_addr_t, void *); static int alloc_iq_fl(struct vi_info *, struct sge_iq *, struct sge_fl *, int, int); static int free_iq_fl(struct vi_info *, struct sge_iq *, struct sge_fl *); static void add_iq_sysctls(struct sysctl_ctx_list *, struct sysctl_oid *, struct sge_iq *); static void add_fl_sysctls(struct adapter *, struct sysctl_ctx_list *, struct sysctl_oid *, struct sge_fl *); static int alloc_fwq(struct adapter *); static int free_fwq(struct adapter *); static int alloc_ctrlq(struct adapter *, struct sge_wrq *, int, struct sysctl_oid *); static int alloc_rxq(struct vi_info *, struct sge_rxq *, int, int, struct sysctl_oid *); static int free_rxq(struct vi_info *, struct sge_rxq *); #ifdef TCP_OFFLOAD static int alloc_ofld_rxq(struct vi_info *, struct sge_ofld_rxq *, int, int, struct sysctl_oid *); static int free_ofld_rxq(struct vi_info *, struct sge_ofld_rxq *); #endif #ifdef DEV_NETMAP static int alloc_nm_rxq(struct vi_info *, struct sge_nm_rxq *, int, int, struct sysctl_oid *); static int free_nm_rxq(struct vi_info *, struct sge_nm_rxq *); static int alloc_nm_txq(struct vi_info *, struct sge_nm_txq *, int, int, struct sysctl_oid *); static int free_nm_txq(struct vi_info *, struct sge_nm_txq *); #endif static int ctrl_eq_alloc(struct adapter *, struct sge_eq *); static int eth_eq_alloc(struct adapter *, struct vi_info *, struct sge_eq *); #if defined(TCP_OFFLOAD) || defined(RATELIMIT) static int ofld_eq_alloc(struct adapter *, struct vi_info *, struct sge_eq *); #endif static int alloc_eq(struct adapter *, struct vi_info *, struct sge_eq *); static int free_eq(struct adapter *, struct sge_eq *); static int alloc_wrq(struct adapter *, struct vi_info *, struct sge_wrq *, struct sysctl_oid *); static int free_wrq(struct adapter *, struct sge_wrq *); static int alloc_txq(struct vi_info *, struct sge_txq *, int, struct sysctl_oid *); static int free_txq(struct vi_info *, struct sge_txq *); static void oneseg_dma_callback(void *, bus_dma_segment_t *, int, int); static inline void ring_fl_db(struct adapter *, struct sge_fl *); static int refill_fl(struct adapter *, struct sge_fl *, int); static void refill_sfl(void *); static int alloc_fl_sdesc(struct sge_fl *); static void free_fl_sdesc(struct adapter *, struct sge_fl *); static void find_best_refill_source(struct adapter *, struct sge_fl *, int); static void find_safe_refill_source(struct adapter *, struct sge_fl *); static void add_fl_to_sfl(struct adapter *, struct sge_fl *); static inline void get_pkt_gl(struct mbuf *, struct sglist *); static inline u_int txpkt_len16(u_int, u_int); static inline u_int txpkt_vm_len16(u_int, u_int); static inline u_int txpkts0_len16(u_int); static inline u_int txpkts1_len16(void); static u_int write_raw_wr(struct sge_txq *, void *, struct mbuf *, u_int); static u_int write_txpkt_wr(struct sge_txq *, struct fw_eth_tx_pkt_wr *, struct mbuf *, u_int); static u_int write_txpkt_vm_wr(struct adapter *, struct sge_txq *, struct fw_eth_tx_pkt_vm_wr *, struct mbuf *, u_int); static int try_txpkts(struct mbuf *, struct mbuf *, struct txpkts *, u_int); static int add_to_txpkts(struct mbuf *, struct txpkts *, u_int); static u_int write_txpkts_wr(struct sge_txq *, struct fw_eth_tx_pkts_wr *, struct mbuf *, const struct txpkts *, u_int); static void write_gl_to_txd(struct sge_txq *, struct mbuf *, caddr_t *, int); static inline void copy_to_txd(struct sge_eq *, caddr_t, caddr_t *, int); static inline void ring_eq_db(struct adapter *, struct sge_eq *, u_int); static inline uint16_t read_hw_cidx(struct sge_eq *); static inline u_int reclaimable_tx_desc(struct sge_eq *); static inline u_int total_available_tx_desc(struct sge_eq *); static u_int reclaim_tx_descs(struct sge_txq *, u_int); static void tx_reclaim(void *, int); static __be64 get_flit(struct sglist_seg *, int, int); static int handle_sge_egr_update(struct sge_iq *, const struct rss_header *, struct mbuf *); static int handle_fw_msg(struct sge_iq *, const struct rss_header *, struct mbuf *); static int t4_handle_wrerr_rpl(struct adapter *, const __be64 *); static void wrq_tx_drain(void *, int); static void drain_wrq_wr_list(struct adapter *, struct sge_wrq *); static int sysctl_uint16(SYSCTL_HANDLER_ARGS); static int sysctl_bufsizes(SYSCTL_HANDLER_ARGS); #ifdef RATELIMIT static inline u_int txpkt_eo_len16(u_int, u_int, u_int); static int ethofld_fw4_ack(struct sge_iq *, const struct rss_header *, struct mbuf *); #endif static counter_u64_t extfree_refs; static counter_u64_t extfree_rels; an_handler_t t4_an_handler; fw_msg_handler_t t4_fw_msg_handler[NUM_FW6_TYPES]; cpl_handler_t t4_cpl_handler[NUM_CPL_CMDS]; cpl_handler_t set_tcb_rpl_handlers[NUM_CPL_COOKIES]; cpl_handler_t l2t_write_rpl_handlers[NUM_CPL_COOKIES]; cpl_handler_t act_open_rpl_handlers[NUM_CPL_COOKIES]; cpl_handler_t abort_rpl_rss_handlers[NUM_CPL_COOKIES]; cpl_handler_t fw4_ack_handlers[NUM_CPL_COOKIES]; void t4_register_an_handler(an_handler_t h) { uintptr_t *loc; MPASS(h == NULL || t4_an_handler == NULL); loc = (uintptr_t *)&t4_an_handler; atomic_store_rel_ptr(loc, (uintptr_t)h); } void t4_register_fw_msg_handler(int type, fw_msg_handler_t h) { uintptr_t *loc; MPASS(type < nitems(t4_fw_msg_handler)); MPASS(h == NULL || t4_fw_msg_handler[type] == NULL); /* * These are dispatched by the handler for FW{4|6}_CPL_MSG using the CPL * handler dispatch table. Reject any attempt to install a handler for * this subtype. */ MPASS(type != FW_TYPE_RSSCPL); MPASS(type != FW6_TYPE_RSSCPL); loc = (uintptr_t *)&t4_fw_msg_handler[type]; atomic_store_rel_ptr(loc, (uintptr_t)h); } void t4_register_cpl_handler(int opcode, cpl_handler_t h) { uintptr_t *loc; MPASS(opcode < nitems(t4_cpl_handler)); MPASS(h == NULL || t4_cpl_handler[opcode] == NULL); loc = (uintptr_t *)&t4_cpl_handler[opcode]; atomic_store_rel_ptr(loc, (uintptr_t)h); } static int set_tcb_rpl_handler(struct sge_iq *iq, const struct rss_header *rss, struct mbuf *m) { const struct cpl_set_tcb_rpl *cpl = (const void *)(rss + 1); u_int tid; int cookie; MPASS(m == NULL); tid = GET_TID(cpl); if (is_hpftid(iq->adapter, tid) || is_ftid(iq->adapter, tid)) { /* * The return code for filter-write is put in the CPL cookie so * we have to rely on the hardware tid (is_ftid) to determine * that this is a response to a filter. */ cookie = CPL_COOKIE_FILTER; } else { cookie = G_COOKIE(cpl->cookie); } MPASS(cookie > CPL_COOKIE_RESERVED); MPASS(cookie < nitems(set_tcb_rpl_handlers)); return (set_tcb_rpl_handlers[cookie](iq, rss, m)); } static int l2t_write_rpl_handler(struct sge_iq *iq, const struct rss_header *rss, struct mbuf *m) { const struct cpl_l2t_write_rpl *rpl = (const void *)(rss + 1); unsigned int cookie; MPASS(m == NULL); cookie = GET_TID(rpl) & F_SYNC_WR ? CPL_COOKIE_TOM : CPL_COOKIE_FILTER; return (l2t_write_rpl_handlers[cookie](iq, rss, m)); } static int act_open_rpl_handler(struct sge_iq *iq, const struct rss_header *rss, struct mbuf *m) { const struct cpl_act_open_rpl *cpl = (const void *)(rss + 1); u_int cookie = G_TID_COOKIE(G_AOPEN_ATID(be32toh(cpl->atid_status))); MPASS(m == NULL); MPASS(cookie != CPL_COOKIE_RESERVED); return (act_open_rpl_handlers[cookie](iq, rss, m)); } static int abort_rpl_rss_handler(struct sge_iq *iq, const struct rss_header *rss, struct mbuf *m) { struct adapter *sc = iq->adapter; u_int cookie; MPASS(m == NULL); if (is_hashfilter(sc)) cookie = CPL_COOKIE_HASHFILTER; else cookie = CPL_COOKIE_TOM; return (abort_rpl_rss_handlers[cookie](iq, rss, m)); } static int fw4_ack_handler(struct sge_iq *iq, const struct rss_header *rss, struct mbuf *m) { struct adapter *sc = iq->adapter; const struct cpl_fw4_ack *cpl = (const void *)(rss + 1); unsigned int tid = G_CPL_FW4_ACK_FLOWID(be32toh(OPCODE_TID(cpl))); u_int cookie; MPASS(m == NULL); if (is_etid(sc, tid)) cookie = CPL_COOKIE_ETHOFLD; else cookie = CPL_COOKIE_TOM; return (fw4_ack_handlers[cookie](iq, rss, m)); } static void t4_init_shared_cpl_handlers(void) { t4_register_cpl_handler(CPL_SET_TCB_RPL, set_tcb_rpl_handler); t4_register_cpl_handler(CPL_L2T_WRITE_RPL, l2t_write_rpl_handler); t4_register_cpl_handler(CPL_ACT_OPEN_RPL, act_open_rpl_handler); t4_register_cpl_handler(CPL_ABORT_RPL_RSS, abort_rpl_rss_handler); t4_register_cpl_handler(CPL_FW4_ACK, fw4_ack_handler); } void t4_register_shared_cpl_handler(int opcode, cpl_handler_t h, int cookie) { uintptr_t *loc; MPASS(opcode < nitems(t4_cpl_handler)); MPASS(cookie > CPL_COOKIE_RESERVED); MPASS(cookie < NUM_CPL_COOKIES); MPASS(t4_cpl_handler[opcode] != NULL); switch (opcode) { case CPL_SET_TCB_RPL: loc = (uintptr_t *)&set_tcb_rpl_handlers[cookie]; break; case CPL_L2T_WRITE_RPL: loc = (uintptr_t *)&l2t_write_rpl_handlers[cookie]; break; case CPL_ACT_OPEN_RPL: loc = (uintptr_t *)&act_open_rpl_handlers[cookie]; break; case CPL_ABORT_RPL_RSS: loc = (uintptr_t *)&abort_rpl_rss_handlers[cookie]; break; case CPL_FW4_ACK: loc = (uintptr_t *)&fw4_ack_handlers[cookie]; break; default: MPASS(0); return; } MPASS(h == NULL || *loc == (uintptr_t)NULL); atomic_store_rel_ptr(loc, (uintptr_t)h); } /* * Called on MOD_LOAD. Validates and calculates the SGE tunables. */ void t4_sge_modload(void) { if (fl_pktshift < 0 || fl_pktshift > 7) { printf("Invalid hw.cxgbe.fl_pktshift value (%d)," " using 0 instead.\n", fl_pktshift); fl_pktshift = 0; } if (spg_len != 64 && spg_len != 128) { int len; #if defined(__i386__) || defined(__amd64__) len = cpu_clflush_line_size > 64 ? 128 : 64; #else len = 64; #endif if (spg_len != -1) { printf("Invalid hw.cxgbe.spg_len value (%d)," " using %d instead.\n", spg_len, len); } spg_len = len; } if (cong_drop < -1 || cong_drop > 1) { printf("Invalid hw.cxgbe.cong_drop value (%d)," " using 0 instead.\n", cong_drop); cong_drop = 0; } if (tscale != 1 && (tscale < 3 || tscale > 17)) { printf("Invalid hw.cxgbe.tscale value (%d)," " using 1 instead.\n", tscale); tscale = 1; } extfree_refs = counter_u64_alloc(M_WAITOK); extfree_rels = counter_u64_alloc(M_WAITOK); counter_u64_zero(extfree_refs); counter_u64_zero(extfree_rels); t4_init_shared_cpl_handlers(); t4_register_cpl_handler(CPL_FW4_MSG, handle_fw_msg); t4_register_cpl_handler(CPL_FW6_MSG, handle_fw_msg); t4_register_cpl_handler(CPL_SGE_EGR_UPDATE, handle_sge_egr_update); t4_register_cpl_handler(CPL_RX_PKT, t4_eth_rx); #ifdef RATELIMIT t4_register_shared_cpl_handler(CPL_FW4_ACK, ethofld_fw4_ack, CPL_COOKIE_ETHOFLD); #endif t4_register_fw_msg_handler(FW6_TYPE_CMD_RPL, t4_handle_fw_rpl); t4_register_fw_msg_handler(FW6_TYPE_WRERR_RPL, t4_handle_wrerr_rpl); } void t4_sge_modunload(void) { counter_u64_free(extfree_refs); counter_u64_free(extfree_rels); } uint64_t t4_sge_extfree_refs(void) { uint64_t refs, rels; rels = counter_u64_fetch(extfree_rels); refs = counter_u64_fetch(extfree_refs); return (refs - rels); } static inline void setup_pad_and_pack_boundaries(struct adapter *sc) { uint32_t v, m; int pad, pack, pad_shift; pad_shift = chip_id(sc) > CHELSIO_T5 ? X_T6_INGPADBOUNDARY_SHIFT : X_INGPADBOUNDARY_SHIFT; pad = fl_pad; if (fl_pad < (1 << pad_shift) || fl_pad > (1 << (pad_shift + M_INGPADBOUNDARY)) || !powerof2(fl_pad)) { /* * If there is any chance that we might use buffer packing and * the chip is a T4, then pick 64 as the pad/pack boundary. Set * it to the minimum allowed in all other cases. */ pad = is_t4(sc) && buffer_packing ? 64 : 1 << pad_shift; /* * For fl_pad = 0 we'll still write a reasonable value to the * register but all the freelists will opt out of padding. * We'll complain here only if the user tried to set it to a * value greater than 0 that was invalid. */ if (fl_pad > 0) { device_printf(sc->dev, "Invalid hw.cxgbe.fl_pad value" " (%d), using %d instead.\n", fl_pad, pad); } } m = V_INGPADBOUNDARY(M_INGPADBOUNDARY); v = V_INGPADBOUNDARY(ilog2(pad) - pad_shift); t4_set_reg_field(sc, A_SGE_CONTROL, m, v); if (is_t4(sc)) { if (fl_pack != -1 && fl_pack != pad) { /* Complain but carry on. */ device_printf(sc->dev, "hw.cxgbe.fl_pack (%d) ignored," " using %d instead.\n", fl_pack, pad); } return; } pack = fl_pack; if (fl_pack < 16 || fl_pack == 32 || fl_pack > 4096 || !powerof2(fl_pack)) { pack = max(sc->params.pci.mps, CACHE_LINE_SIZE); MPASS(powerof2(pack)); if (pack < 16) pack = 16; if (pack == 32) pack = 64; if (pack > 4096) pack = 4096; if (fl_pack != -1) { device_printf(sc->dev, "Invalid hw.cxgbe.fl_pack value" " (%d), using %d instead.\n", fl_pack, pack); } } m = V_INGPACKBOUNDARY(M_INGPACKBOUNDARY); if (pack == 16) v = V_INGPACKBOUNDARY(0); else v = V_INGPACKBOUNDARY(ilog2(pack) - 5); MPASS(!is_t4(sc)); /* T4 doesn't have SGE_CONTROL2 */ t4_set_reg_field(sc, A_SGE_CONTROL2, m, v); } /* * adap->params.vpd.cclk must be set up before this is called. */ void t4_tweak_chip_settings(struct adapter *sc) { int i; uint32_t v, m; int intr_timer[SGE_NTIMERS] = {1, 5, 10, 50, 100, 200}; int timer_max = M_TIMERVALUE0 * 1000 / sc->params.vpd.cclk; int intr_pktcount[SGE_NCOUNTERS] = {1, 8, 16, 32}; /* 63 max */ uint16_t indsz = min(RX_COPY_THRESHOLD - 1, M_INDICATESIZE); static int sge_flbuf_sizes[] = { MCLBYTES, #if MJUMPAGESIZE != MCLBYTES MJUMPAGESIZE, MJUMPAGESIZE - CL_METADATA_SIZE, MJUMPAGESIZE - 2 * MSIZE - CL_METADATA_SIZE, #endif MJUM9BYTES, MJUM16BYTES, MCLBYTES - MSIZE - CL_METADATA_SIZE, MJUM9BYTES - CL_METADATA_SIZE, MJUM16BYTES - CL_METADATA_SIZE, }; KASSERT(sc->flags & MASTER_PF, ("%s: trying to change chip settings when not master.", __func__)); m = V_PKTSHIFT(M_PKTSHIFT) | F_RXPKTCPLMODE | F_EGRSTATUSPAGESIZE; v = V_PKTSHIFT(fl_pktshift) | F_RXPKTCPLMODE | V_EGRSTATUSPAGESIZE(spg_len == 128); t4_set_reg_field(sc, A_SGE_CONTROL, m, v); setup_pad_and_pack_boundaries(sc); v = V_HOSTPAGESIZEPF0(PAGE_SHIFT - 10) | V_HOSTPAGESIZEPF1(PAGE_SHIFT - 10) | V_HOSTPAGESIZEPF2(PAGE_SHIFT - 10) | V_HOSTPAGESIZEPF3(PAGE_SHIFT - 10) | V_HOSTPAGESIZEPF4(PAGE_SHIFT - 10) | V_HOSTPAGESIZEPF5(PAGE_SHIFT - 10) | V_HOSTPAGESIZEPF6(PAGE_SHIFT - 10) | V_HOSTPAGESIZEPF7(PAGE_SHIFT - 10); t4_write_reg(sc, A_SGE_HOST_PAGE_SIZE, v); KASSERT(nitems(sge_flbuf_sizes) <= SGE_FLBUF_SIZES, ("%s: hw buffer size table too big", __func__)); t4_write_reg(sc, A_SGE_FL_BUFFER_SIZE0, 4096); t4_write_reg(sc, A_SGE_FL_BUFFER_SIZE1, 65536); for (i = 0; i < min(nitems(sge_flbuf_sizes), SGE_FLBUF_SIZES); i++) { t4_write_reg(sc, A_SGE_FL_BUFFER_SIZE15 - (4 * i), sge_flbuf_sizes[i]); } v = V_THRESHOLD_0(intr_pktcount[0]) | V_THRESHOLD_1(intr_pktcount[1]) | V_THRESHOLD_2(intr_pktcount[2]) | V_THRESHOLD_3(intr_pktcount[3]); t4_write_reg(sc, A_SGE_INGRESS_RX_THRESHOLD, v); KASSERT(intr_timer[0] <= timer_max, ("%s: not a single usable timer (%d, %d)", __func__, intr_timer[0], timer_max)); for (i = 1; i < nitems(intr_timer); i++) { KASSERT(intr_timer[i] >= intr_timer[i - 1], ("%s: timers not listed in increasing order (%d)", __func__, i)); while (intr_timer[i] > timer_max) { if (i == nitems(intr_timer) - 1) { intr_timer[i] = timer_max; break; } intr_timer[i] += intr_timer[i - 1]; intr_timer[i] /= 2; } } v = V_TIMERVALUE0(us_to_core_ticks(sc, intr_timer[0])) | V_TIMERVALUE1(us_to_core_ticks(sc, intr_timer[1])); t4_write_reg(sc, A_SGE_TIMER_VALUE_0_AND_1, v); v = V_TIMERVALUE2(us_to_core_ticks(sc, intr_timer[2])) | V_TIMERVALUE3(us_to_core_ticks(sc, intr_timer[3])); t4_write_reg(sc, A_SGE_TIMER_VALUE_2_AND_3, v); v = V_TIMERVALUE4(us_to_core_ticks(sc, intr_timer[4])) | V_TIMERVALUE5(us_to_core_ticks(sc, intr_timer[5])); t4_write_reg(sc, A_SGE_TIMER_VALUE_4_AND_5, v); if (chip_id(sc) >= CHELSIO_T6) { m = V_TSCALE(M_TSCALE); if (tscale == 1) v = 0; else v = V_TSCALE(tscale - 2); t4_set_reg_field(sc, A_SGE_ITP_CONTROL, m, v); if (sc->debug_flags & DF_DISABLE_TCB_CACHE) { m = V_RDTHRESHOLD(M_RDTHRESHOLD) | F_WRTHRTHRESHEN | V_WRTHRTHRESH(M_WRTHRTHRESH); t4_tp_pio_read(sc, &v, 1, A_TP_CMM_CONFIG, 1); v &= ~m; v |= V_RDTHRESHOLD(1) | F_WRTHRTHRESHEN | V_WRTHRTHRESH(16); t4_tp_pio_write(sc, &v, 1, A_TP_CMM_CONFIG, 1); } } /* 4K, 16K, 64K, 256K DDP "page sizes" for TDDP */ v = V_HPZ0(0) | V_HPZ1(2) | V_HPZ2(4) | V_HPZ3(6); t4_write_reg(sc, A_ULP_RX_TDDP_PSZ, v); /* * 4K, 8K, 16K, 64K DDP "page sizes" for iSCSI DDP. These have been * chosen with MAXPHYS = 128K in mind. The largest DDP buffer that we * may have to deal with is MAXPHYS + 1 page. */ v = V_HPZ0(0) | V_HPZ1(1) | V_HPZ2(2) | V_HPZ3(4); t4_write_reg(sc, A_ULP_RX_ISCSI_PSZ, v); /* We use multiple DDP page sizes both in plain-TOE and ISCSI modes. */ m = v = F_TDDPTAGTCB | F_ISCSITAGTCB; t4_set_reg_field(sc, A_ULP_RX_CTL, m, v); m = V_INDICATESIZE(M_INDICATESIZE) | F_REARMDDPOFFSET | F_RESETDDPOFFSET; v = V_INDICATESIZE(indsz) | F_REARMDDPOFFSET | F_RESETDDPOFFSET; t4_set_reg_field(sc, A_TP_PARA_REG5, m, v); } /* * SGE wants the buffer to be at least 64B and then a multiple of 16. If * padding is in use, the buffer's start and end need to be aligned to the pad * boundary as well. We'll just make sure that the size is a multiple of the * boundary here, it is up to the buffer allocation code to make sure the start * of the buffer is aligned as well. */ static inline int hwsz_ok(struct adapter *sc, int hwsz) { int mask = fl_pad ? sc->params.sge.pad_boundary - 1 : 16 - 1; return (hwsz >= 64 && (hwsz & mask) == 0); } /* * XXX: driver really should be able to deal with unexpected settings. */ int t4_read_chip_settings(struct adapter *sc) { struct sge *s = &sc->sge; struct sge_params *sp = &sc->params.sge; int i, j, n, rc = 0; uint32_t m, v, r; uint16_t indsz = min(RX_COPY_THRESHOLD - 1, M_INDICATESIZE); static int sw_buf_sizes[] = { /* Sorted by size */ MCLBYTES, #if MJUMPAGESIZE != MCLBYTES MJUMPAGESIZE, #endif MJUM9BYTES, MJUM16BYTES }; struct sw_zone_info *swz, *safe_swz; struct hw_buf_info *hwb; m = F_RXPKTCPLMODE; v = F_RXPKTCPLMODE; r = sc->params.sge.sge_control; if ((r & m) != v) { device_printf(sc->dev, "invalid SGE_CONTROL(0x%x)\n", r); rc = EINVAL; } /* * If this changes then every single use of PAGE_SHIFT in the driver * needs to be carefully reviewed for PAGE_SHIFT vs sp->page_shift. */ if (sp->page_shift != PAGE_SHIFT) { device_printf(sc->dev, "invalid SGE_HOST_PAGE_SIZE(0x%x)\n", r); rc = EINVAL; } /* Filter out unusable hw buffer sizes entirely (mark with -2). */ hwb = &s->hw_buf_info[0]; for (i = 0; i < nitems(s->hw_buf_info); i++, hwb++) { r = sc->params.sge.sge_fl_buffer_size[i]; hwb->size = r; hwb->zidx = hwsz_ok(sc, r) ? -1 : -2; hwb->next = -1; } /* * Create a sorted list in decreasing order of hw buffer sizes (and so * increasing order of spare area) for each software zone. * * If padding is enabled then the start and end of the buffer must align * to the pad boundary; if packing is enabled then they must align with * the pack boundary as well. Allocations from the cluster zones are * aligned to min(size, 4K), so the buffer starts at that alignment and * ends at hwb->size alignment. If mbuf inlining is allowed the * starting alignment will be reduced to MSIZE and the driver will * exercise appropriate caution when deciding on the best buffer layout * to use. */ n = 0; /* no usable buffer size to begin with */ swz = &s->sw_zone_info[0]; safe_swz = NULL; for (i = 0; i < SW_ZONE_SIZES; i++, swz++) { int8_t head = -1, tail = -1; swz->size = sw_buf_sizes[i]; swz->zone = m_getzone(swz->size); swz->type = m_gettype(swz->size); if (swz->size < PAGE_SIZE) { MPASS(powerof2(swz->size)); if (fl_pad && (swz->size % sp->pad_boundary != 0)) continue; } if (swz->size == safest_rx_cluster) safe_swz = swz; hwb = &s->hw_buf_info[0]; for (j = 0; j < SGE_FLBUF_SIZES; j++, hwb++) { if (hwb->zidx != -1 || hwb->size > swz->size) continue; #ifdef INVARIANTS if (fl_pad) MPASS(hwb->size % sp->pad_boundary == 0); #endif hwb->zidx = i; if (head == -1) head = tail = j; else if (hwb->size < s->hw_buf_info[tail].size) { s->hw_buf_info[tail].next = j; tail = j; } else { int8_t *cur; struct hw_buf_info *t; for (cur = &head; *cur != -1; cur = &t->next) { t = &s->hw_buf_info[*cur]; if (hwb->size == t->size) { hwb->zidx = -2; break; } if (hwb->size > t->size) { hwb->next = *cur; *cur = j; break; } } } } swz->head_hwidx = head; swz->tail_hwidx = tail; if (tail != -1) { n++; if (swz->size - s->hw_buf_info[tail].size >= CL_METADATA_SIZE) sc->flags |= BUF_PACKING_OK; } } if (n == 0) { device_printf(sc->dev, "no usable SGE FL buffer size.\n"); rc = EINVAL; } s->safe_hwidx1 = -1; s->safe_hwidx2 = -1; if (safe_swz != NULL) { s->safe_hwidx1 = safe_swz->head_hwidx; for (i = safe_swz->head_hwidx; i != -1; i = hwb->next) { int spare; hwb = &s->hw_buf_info[i]; #ifdef INVARIANTS if (fl_pad) MPASS(hwb->size % sp->pad_boundary == 0); #endif spare = safe_swz->size - hwb->size; if (spare >= CL_METADATA_SIZE) { s->safe_hwidx2 = i; break; } } } if (sc->flags & IS_VF) return (0); v = V_HPZ0(0) | V_HPZ1(2) | V_HPZ2(4) | V_HPZ3(6); r = t4_read_reg(sc, A_ULP_RX_TDDP_PSZ); if (r != v) { device_printf(sc->dev, "invalid ULP_RX_TDDP_PSZ(0x%x)\n", r); rc = EINVAL; } m = v = F_TDDPTAGTCB; r = t4_read_reg(sc, A_ULP_RX_CTL); if ((r & m) != v) { device_printf(sc->dev, "invalid ULP_RX_CTL(0x%x)\n", r); rc = EINVAL; } m = V_INDICATESIZE(M_INDICATESIZE) | F_REARMDDPOFFSET | F_RESETDDPOFFSET; v = V_INDICATESIZE(indsz) | F_REARMDDPOFFSET | F_RESETDDPOFFSET; r = t4_read_reg(sc, A_TP_PARA_REG5); if ((r & m) != v) { device_printf(sc->dev, "invalid TP_PARA_REG5(0x%x)\n", r); rc = EINVAL; } t4_init_tp_params(sc, 1); t4_read_mtu_tbl(sc, sc->params.mtus, NULL); t4_load_mtus(sc, sc->params.mtus, sc->params.a_wnd, sc->params.b_wnd); return (rc); } int t4_create_dma_tag(struct adapter *sc) { int rc; rc = bus_dma_tag_create(bus_get_dma_tag(sc->dev), 1, 0, BUS_SPACE_MAXADDR, BUS_SPACE_MAXADDR, NULL, NULL, BUS_SPACE_MAXSIZE, BUS_SPACE_UNRESTRICTED, BUS_SPACE_MAXSIZE, BUS_DMA_ALLOCNOW, NULL, NULL, &sc->dmat); if (rc != 0) { device_printf(sc->dev, "failed to create main DMA tag: %d\n", rc); } return (rc); } void t4_sge_sysctls(struct adapter *sc, struct sysctl_ctx_list *ctx, struct sysctl_oid_list *children) { struct sge_params *sp = &sc->params.sge; SYSCTL_ADD_PROC(ctx, children, OID_AUTO, "buffer_sizes", CTLTYPE_STRING | CTLFLAG_RD, &sc->sge, 0, sysctl_bufsizes, "A", "freelist buffer sizes"); SYSCTL_ADD_INT(ctx, children, OID_AUTO, "fl_pktshift", CTLFLAG_RD, NULL, sp->fl_pktshift, "payload DMA offset in rx buffer (bytes)"); SYSCTL_ADD_INT(ctx, children, OID_AUTO, "fl_pad", CTLFLAG_RD, NULL, sp->pad_boundary, "payload pad boundary (bytes)"); SYSCTL_ADD_INT(ctx, children, OID_AUTO, "spg_len", CTLFLAG_RD, NULL, sp->spg_len, "status page size (bytes)"); SYSCTL_ADD_INT(ctx, children, OID_AUTO, "cong_drop", CTLFLAG_RD, NULL, cong_drop, "congestion drop setting"); SYSCTL_ADD_INT(ctx, children, OID_AUTO, "fl_pack", CTLFLAG_RD, NULL, sp->pack_boundary, "payload pack boundary (bytes)"); } int t4_destroy_dma_tag(struct adapter *sc) { if (sc->dmat) bus_dma_tag_destroy(sc->dmat); return (0); } /* * Allocate and initialize the firmware event queue, control queues, and special * purpose rx queues owned by the adapter. * * Returns errno on failure. Resources allocated up to that point may still be * allocated. Caller is responsible for cleanup in case this function fails. */ int t4_setup_adapter_queues(struct adapter *sc) { struct sysctl_oid *oid; struct sysctl_oid_list *children; int rc, i; ADAPTER_LOCK_ASSERT_NOTOWNED(sc); sysctl_ctx_init(&sc->ctx); sc->flags |= ADAP_SYSCTL_CTX; /* * Firmware event queue */ rc = alloc_fwq(sc); if (rc != 0) return (rc); /* * That's all for the VF driver. */ if (sc->flags & IS_VF) return (rc); oid = device_get_sysctl_tree(sc->dev); children = SYSCTL_CHILDREN(oid); /* * XXX: General purpose rx queues, one per port. */ /* * Control queues, one per port. */ oid = SYSCTL_ADD_NODE(&sc->ctx, children, OID_AUTO, "ctrlq", CTLFLAG_RD, NULL, "control queues"); for_each_port(sc, i) { struct sge_wrq *ctrlq = &sc->sge.ctrlq[i]; rc = alloc_ctrlq(sc, ctrlq, i, oid); if (rc != 0) return (rc); } return (rc); } /* * Idempotent */ int t4_teardown_adapter_queues(struct adapter *sc) { int i; ADAPTER_LOCK_ASSERT_NOTOWNED(sc); /* Do this before freeing the queue */ if (sc->flags & ADAP_SYSCTL_CTX) { sysctl_ctx_free(&sc->ctx); sc->flags &= ~ADAP_SYSCTL_CTX; } if (!(sc->flags & IS_VF)) { for_each_port(sc, i) free_wrq(sc, &sc->sge.ctrlq[i]); } free_fwq(sc); return (0); } /* Maximum payload that can be delivered with a single iq descriptor */ static inline int mtu_to_max_payload(struct adapter *sc, int mtu, const int toe) { int payload; #ifdef TCP_OFFLOAD if (toe) { int rxcs = G_RXCOALESCESIZE(t4_read_reg(sc, A_TP_PARA_REG2)); /* Note that COP can set rx_coalesce on/off per connection. */ payload = max(mtu, rxcs); } else { #endif /* large enough even when hw VLAN extraction is disabled */ payload = sc->params.sge.fl_pktshift + ETHER_HDR_LEN + ETHER_VLAN_ENCAP_LEN + mtu; #ifdef TCP_OFFLOAD } #endif return (payload); } int t4_setup_vi_queues(struct vi_info *vi) { int rc = 0, i, intr_idx, iqidx; struct sge_rxq *rxq; struct sge_txq *txq; #ifdef TCP_OFFLOAD struct sge_ofld_rxq *ofld_rxq; #endif #if defined(TCP_OFFLOAD) || defined(RATELIMIT) struct sge_wrq *ofld_txq; #endif #ifdef DEV_NETMAP int saved_idx; struct sge_nm_rxq *nm_rxq; struct sge_nm_txq *nm_txq; #endif char name[16]; struct port_info *pi = vi->pi; struct adapter *sc = pi->adapter; struct ifnet *ifp = vi->ifp; struct sysctl_oid *oid = device_get_sysctl_tree(vi->dev); struct sysctl_oid_list *children = SYSCTL_CHILDREN(oid); int maxp, mtu = ifp->if_mtu; /* Interrupt vector to start from (when using multiple vectors) */ intr_idx = vi->first_intr; #ifdef DEV_NETMAP saved_idx = intr_idx; if (ifp->if_capabilities & IFCAP_NETMAP) { /* netmap is supported with direct interrupts only. */ MPASS(!forwarding_intr_to_fwq(sc)); /* * We don't have buffers to back the netmap rx queues * right now so we create the queues in a way that * doesn't set off any congestion signal in the chip. */ oid = SYSCTL_ADD_NODE(&vi->ctx, children, OID_AUTO, "nm_rxq", CTLFLAG_RD, NULL, "rx queues"); for_each_nm_rxq(vi, i, nm_rxq) { rc = alloc_nm_rxq(vi, nm_rxq, intr_idx, i, oid); if (rc != 0) goto done; intr_idx++; } oid = SYSCTL_ADD_NODE(&vi->ctx, children, OID_AUTO, "nm_txq", CTLFLAG_RD, NULL, "tx queues"); for_each_nm_txq(vi, i, nm_txq) { iqidx = vi->first_nm_rxq + (i % vi->nnmrxq); rc = alloc_nm_txq(vi, nm_txq, iqidx, i, oid); if (rc != 0) goto done; } } /* Normal rx queues and netmap rx queues share the same interrupts. */ intr_idx = saved_idx; #endif /* * Allocate rx queues first because a default iqid is required when * creating a tx queue. */ maxp = mtu_to_max_payload(sc, mtu, 0); oid = SYSCTL_ADD_NODE(&vi->ctx, children, OID_AUTO, "rxq", CTLFLAG_RD, NULL, "rx queues"); for_each_rxq(vi, i, rxq) { init_iq(&rxq->iq, sc, vi->tmr_idx, vi->pktc_idx, vi->qsize_rxq); snprintf(name, sizeof(name), "%s rxq%d-fl", device_get_nameunit(vi->dev), i); init_fl(sc, &rxq->fl, vi->qsize_rxq / 8, maxp, name); rc = alloc_rxq(vi, rxq, forwarding_intr_to_fwq(sc) ? -1 : intr_idx, i, oid); if (rc != 0) goto done; intr_idx++; } #ifdef DEV_NETMAP if (ifp->if_capabilities & IFCAP_NETMAP) intr_idx = saved_idx + max(vi->nrxq, vi->nnmrxq); #endif #ifdef TCP_OFFLOAD maxp = mtu_to_max_payload(sc, mtu, 1); oid = SYSCTL_ADD_NODE(&vi->ctx, children, OID_AUTO, "ofld_rxq", CTLFLAG_RD, NULL, "rx queues for offloaded TCP connections"); for_each_ofld_rxq(vi, i, ofld_rxq) { init_iq(&ofld_rxq->iq, sc, vi->ofld_tmr_idx, vi->ofld_pktc_idx, vi->qsize_rxq); snprintf(name, sizeof(name), "%s ofld_rxq%d-fl", device_get_nameunit(vi->dev), i); init_fl(sc, &ofld_rxq->fl, vi->qsize_rxq / 8, maxp, name); rc = alloc_ofld_rxq(vi, ofld_rxq, forwarding_intr_to_fwq(sc) ? -1 : intr_idx, i, oid); if (rc != 0) goto done; intr_idx++; } #endif /* * Now the tx queues. */ oid = SYSCTL_ADD_NODE(&vi->ctx, children, OID_AUTO, "txq", CTLFLAG_RD, NULL, "tx queues"); for_each_txq(vi, i, txq) { iqidx = vi->first_rxq + (i % vi->nrxq); snprintf(name, sizeof(name), "%s txq%d", device_get_nameunit(vi->dev), i); init_eq(sc, &txq->eq, EQ_ETH, vi->qsize_txq, pi->tx_chan, sc->sge.rxq[iqidx].iq.cntxt_id, name); rc = alloc_txq(vi, txq, i, oid); if (rc != 0) goto done; } #if defined(TCP_OFFLOAD) || defined(RATELIMIT) oid = SYSCTL_ADD_NODE(&vi->ctx, children, OID_AUTO, "ofld_txq", CTLFLAG_RD, NULL, "tx queues for TOE/ETHOFLD"); for_each_ofld_txq(vi, i, ofld_txq) { struct sysctl_oid *oid2; snprintf(name, sizeof(name), "%s ofld_txq%d", device_get_nameunit(vi->dev), i); if (vi->nofldrxq > 0) { iqidx = vi->first_ofld_rxq + (i % vi->nofldrxq); init_eq(sc, &ofld_txq->eq, EQ_OFLD, vi->qsize_txq, pi->tx_chan, sc->sge.ofld_rxq[iqidx].iq.cntxt_id, name); } else { iqidx = vi->first_rxq + (i % vi->nrxq); init_eq(sc, &ofld_txq->eq, EQ_OFLD, vi->qsize_txq, pi->tx_chan, sc->sge.rxq[iqidx].iq.cntxt_id, name); } snprintf(name, sizeof(name), "%d", i); oid2 = SYSCTL_ADD_NODE(&vi->ctx, SYSCTL_CHILDREN(oid), OID_AUTO, name, CTLFLAG_RD, NULL, "offload tx queue"); rc = alloc_wrq(sc, vi, ofld_txq, oid2); if (rc != 0) goto done; } #endif done: if (rc) t4_teardown_vi_queues(vi); return (rc); } /* * Idempotent */ int t4_teardown_vi_queues(struct vi_info *vi) { int i; struct sge_rxq *rxq; struct sge_txq *txq; #if defined(TCP_OFFLOAD) || defined(RATELIMIT) struct port_info *pi = vi->pi; struct adapter *sc = pi->adapter; struct sge_wrq *ofld_txq; #endif #ifdef TCP_OFFLOAD struct sge_ofld_rxq *ofld_rxq; #endif #ifdef DEV_NETMAP struct sge_nm_rxq *nm_rxq; struct sge_nm_txq *nm_txq; #endif /* Do this before freeing the queues */ if (vi->flags & VI_SYSCTL_CTX) { sysctl_ctx_free(&vi->ctx); vi->flags &= ~VI_SYSCTL_CTX; } #ifdef DEV_NETMAP if (vi->ifp->if_capabilities & IFCAP_NETMAP) { for_each_nm_txq(vi, i, nm_txq) { free_nm_txq(vi, nm_txq); } for_each_nm_rxq(vi, i, nm_rxq) { free_nm_rxq(vi, nm_rxq); } } #endif /* * Take down all the tx queues first, as they reference the rx queues * (for egress updates, etc.). */ for_each_txq(vi, i, txq) { free_txq(vi, txq); } #if defined(TCP_OFFLOAD) || defined(RATELIMIT) for_each_ofld_txq(vi, i, ofld_txq) { free_wrq(sc, ofld_txq); } #endif /* * Then take down the rx queues. */ for_each_rxq(vi, i, rxq) { free_rxq(vi, rxq); } #ifdef TCP_OFFLOAD for_each_ofld_rxq(vi, i, ofld_rxq) { free_ofld_rxq(vi, ofld_rxq); } #endif return (0); } /* * Interrupt handler when the driver is using only 1 interrupt. This is a very * unusual scenario. * * a) Deals with errors, if any. * b) Services firmware event queue, which is taking interrupts for all other * queues. */ void t4_intr_all(void *arg) { struct adapter *sc = arg; struct sge_iq *fwq = &sc->sge.fwq; MPASS(sc->intr_count == 1); if (sc->intr_type == INTR_INTX) t4_write_reg(sc, MYPF_REG(A_PCIE_PF_CLI), 0); t4_intr_err(arg); t4_intr_evt(fwq); } /* * Interrupt handler for errors (installed directly when multiple interrupts are * being used, or called by t4_intr_all). */ void t4_intr_err(void *arg) { struct adapter *sc = arg; uint32_t v; const bool verbose = (sc->debug_flags & DF_VERBOSE_SLOWINTR) != 0; if (sc->flags & ADAP_ERR) return; v = t4_read_reg(sc, MYPF_REG(A_PL_PF_INT_CAUSE)); if (v & F_PFSW) { sc->swintr++; t4_write_reg(sc, MYPF_REG(A_PL_PF_INT_CAUSE), v); } t4_slow_intr_handler(sc, verbose); } /* * Interrupt handler for iq-only queues. The firmware event queue is the only * such queue right now. */ void t4_intr_evt(void *arg) { struct sge_iq *iq = arg; if (atomic_cmpset_int(&iq->state, IQS_IDLE, IQS_BUSY)) { service_iq(iq, 0); (void) atomic_cmpset_int(&iq->state, IQS_BUSY, IQS_IDLE); } } /* * Interrupt handler for iq+fl queues. */ void t4_intr(void *arg) { struct sge_iq *iq = arg; if (atomic_cmpset_int(&iq->state, IQS_IDLE, IQS_BUSY)) { service_iq_fl(iq, 0); (void) atomic_cmpset_int(&iq->state, IQS_BUSY, IQS_IDLE); } } #ifdef DEV_NETMAP /* * Interrupt handler for netmap rx queues. */ void t4_nm_intr(void *arg) { struct sge_nm_rxq *nm_rxq = arg; if (atomic_cmpset_int(&nm_rxq->nm_state, NM_ON, NM_BUSY)) { service_nm_rxq(nm_rxq); (void) atomic_cmpset_int(&nm_rxq->nm_state, NM_BUSY, NM_ON); } } /* * Interrupt handler for vectors shared between NIC and netmap rx queues. */ void t4_vi_intr(void *arg) { struct irq *irq = arg; MPASS(irq->nm_rxq != NULL); t4_nm_intr(irq->nm_rxq); MPASS(irq->rxq != NULL); t4_intr(irq->rxq); } #endif /* * Deals with interrupts on an iq-only (no freelist) queue. */ static int service_iq(struct sge_iq *iq, int budget) { struct sge_iq *q; struct adapter *sc = iq->adapter; struct iq_desc *d = &iq->desc[iq->cidx]; int ndescs = 0, limit; int rsp_type; uint32_t lq; STAILQ_HEAD(, sge_iq) iql = STAILQ_HEAD_INITIALIZER(iql); KASSERT(iq->state == IQS_BUSY, ("%s: iq %p not BUSY", __func__, iq)); KASSERT((iq->flags & IQ_HAS_FL) == 0, ("%s: called for iq %p with fl (iq->flags 0x%x)", __func__, iq, iq->flags)); MPASS((iq->flags & IQ_ADJ_CREDIT) == 0); MPASS((iq->flags & IQ_LRO_ENABLED) == 0); limit = budget ? budget : iq->qsize / 16; /* * We always come back and check the descriptor ring for new indirect * interrupts and other responses after running a single handler. */ for (;;) { while ((d->rsp.u.type_gen & F_RSPD_GEN) == iq->gen) { rmb(); rsp_type = G_RSPD_TYPE(d->rsp.u.type_gen); lq = be32toh(d->rsp.pldbuflen_qid); switch (rsp_type) { case X_RSPD_TYPE_FLBUF: panic("%s: data for an iq (%p) with no freelist", __func__, iq); /* NOTREACHED */ case X_RSPD_TYPE_CPL: KASSERT(d->rss.opcode < NUM_CPL_CMDS, ("%s: bad opcode %02x.", __func__, d->rss.opcode)); t4_cpl_handler[d->rss.opcode](iq, &d->rss, NULL); break; case X_RSPD_TYPE_INTR: /* * There are 1K interrupt-capable queues (qids 0 * through 1023). A response type indicating a * forwarded interrupt with a qid >= 1K is an * iWARP async notification. */ if (__predict_true(lq >= 1024)) { t4_an_handler(iq, &d->rsp); break; } q = sc->sge.iqmap[lq - sc->sge.iq_start - sc->sge.iq_base]; if (atomic_cmpset_int(&q->state, IQS_IDLE, IQS_BUSY)) { if (service_iq_fl(q, q->qsize / 16) == 0) { (void) atomic_cmpset_int(&q->state, IQS_BUSY, IQS_IDLE); } else { STAILQ_INSERT_TAIL(&iql, q, link); } } break; default: KASSERT(0, ("%s: illegal response type %d on iq %p", __func__, rsp_type, iq)); log(LOG_ERR, "%s: illegal response type %d on iq %p", device_get_nameunit(sc->dev), rsp_type, iq); break; } d++; if (__predict_false(++iq->cidx == iq->sidx)) { iq->cidx = 0; iq->gen ^= F_RSPD_GEN; d = &iq->desc[0]; } if (__predict_false(++ndescs == limit)) { t4_write_reg(sc, sc->sge_gts_reg, V_CIDXINC(ndescs) | V_INGRESSQID(iq->cntxt_id) | V_SEINTARM(V_QINTR_TIMER_IDX(X_TIMERREG_UPDATE_CIDX))); ndescs = 0; if (budget) { return (EINPROGRESS); } } } if (STAILQ_EMPTY(&iql)) break; /* * Process the head only, and send it to the back of the list if * it's still not done. */ q = STAILQ_FIRST(&iql); STAILQ_REMOVE_HEAD(&iql, link); if (service_iq_fl(q, q->qsize / 8) == 0) (void) atomic_cmpset_int(&q->state, IQS_BUSY, IQS_IDLE); else STAILQ_INSERT_TAIL(&iql, q, link); } t4_write_reg(sc, sc->sge_gts_reg, V_CIDXINC(ndescs) | V_INGRESSQID((u32)iq->cntxt_id) | V_SEINTARM(iq->intr_params)); return (0); } static inline int sort_before_lro(struct lro_ctrl *lro) { return (lro->lro_mbuf_max != 0); } static inline uint64_t last_flit_to_ns(struct adapter *sc, uint64_t lf) { uint64_t n = be64toh(lf) & 0xfffffffffffffff; /* 60b, not 64b. */ if (n > UINT64_MAX / 1000000) return (n / sc->params.vpd.cclk * 1000000); else return (n * 1000000 / sc->params.vpd.cclk); } /* * Deals with interrupts on an iq+fl queue. */ static int service_iq_fl(struct sge_iq *iq, int budget) { struct sge_rxq *rxq = iq_to_rxq(iq); struct sge_fl *fl; struct adapter *sc = iq->adapter; struct iq_desc *d = &iq->desc[iq->cidx]; int ndescs = 0, limit; int rsp_type, refill, starved; uint32_t lq; uint16_t fl_hw_cidx; struct mbuf *m0; #if defined(INET) || defined(INET6) const struct timeval lro_timeout = {0, sc->lro_timeout}; struct lro_ctrl *lro = &rxq->lro; #endif KASSERT(iq->state == IQS_BUSY, ("%s: iq %p not BUSY", __func__, iq)); MPASS(iq->flags & IQ_HAS_FL); limit = budget ? budget : iq->qsize / 16; fl = &rxq->fl; fl_hw_cidx = fl->hw_cidx; /* stable snapshot */ #if defined(INET) || defined(INET6) if (iq->flags & IQ_ADJ_CREDIT) { MPASS(sort_before_lro(lro)); iq->flags &= ~IQ_ADJ_CREDIT; if ((d->rsp.u.type_gen & F_RSPD_GEN) != iq->gen) { tcp_lro_flush_all(lro); t4_write_reg(sc, sc->sge_gts_reg, V_CIDXINC(1) | V_INGRESSQID((u32)iq->cntxt_id) | V_SEINTARM(iq->intr_params)); return (0); } ndescs = 1; } #else MPASS((iq->flags & IQ_ADJ_CREDIT) == 0); #endif while ((d->rsp.u.type_gen & F_RSPD_GEN) == iq->gen) { rmb(); refill = 0; m0 = NULL; rsp_type = G_RSPD_TYPE(d->rsp.u.type_gen); lq = be32toh(d->rsp.pldbuflen_qid); switch (rsp_type) { case X_RSPD_TYPE_FLBUF: m0 = get_fl_payload(sc, fl, lq); if (__predict_false(m0 == NULL)) goto out; refill = IDXDIFF(fl->hw_cidx, fl_hw_cidx, fl->sidx) > 2; if (iq->flags & IQ_RX_TIMESTAMP) { /* * Fill up rcv_tstmp but do not set M_TSTMP. * rcv_tstmp is not in the format that the * kernel expects and we don't want to mislead * it. For now this is only for custom code * that knows how to interpret cxgbe's stamp. */ m0->m_pkthdr.rcv_tstmp = last_flit_to_ns(sc, d->rsp.u.last_flit); #ifdef notyet m0->m_flags |= M_TSTMP; #endif } /* fall through */ case X_RSPD_TYPE_CPL: KASSERT(d->rss.opcode < NUM_CPL_CMDS, ("%s: bad opcode %02x.", __func__, d->rss.opcode)); t4_cpl_handler[d->rss.opcode](iq, &d->rss, m0); break; case X_RSPD_TYPE_INTR: /* * There are 1K interrupt-capable queues (qids 0 * through 1023). A response type indicating a * forwarded interrupt with a qid >= 1K is an * iWARP async notification. That is the only * acceptable indirect interrupt on this queue. */ if (__predict_false(lq < 1024)) { panic("%s: indirect interrupt on iq_fl %p " "with qid %u", __func__, iq, lq); } t4_an_handler(iq, &d->rsp); break; default: KASSERT(0, ("%s: illegal response type %d on iq %p", __func__, rsp_type, iq)); log(LOG_ERR, "%s: illegal response type %d on iq %p", device_get_nameunit(sc->dev), rsp_type, iq); break; } d++; if (__predict_false(++iq->cidx == iq->sidx)) { iq->cidx = 0; iq->gen ^= F_RSPD_GEN; d = &iq->desc[0]; } if (__predict_false(++ndescs == limit)) { t4_write_reg(sc, sc->sge_gts_reg, V_CIDXINC(ndescs) | V_INGRESSQID(iq->cntxt_id) | V_SEINTARM(V_QINTR_TIMER_IDX(X_TIMERREG_UPDATE_CIDX))); ndescs = 0; #if defined(INET) || defined(INET6) if (iq->flags & IQ_LRO_ENABLED && !sort_before_lro(lro) && sc->lro_timeout != 0) { tcp_lro_flush_inactive(lro, &lro_timeout); } #endif if (budget) { FL_LOCK(fl); refill_fl(sc, fl, 32); FL_UNLOCK(fl); return (EINPROGRESS); } } if (refill) { FL_LOCK(fl); refill_fl(sc, fl, 32); FL_UNLOCK(fl); fl_hw_cidx = fl->hw_cidx; } } out: #if defined(INET) || defined(INET6) if (iq->flags & IQ_LRO_ENABLED) { if (ndescs > 0 && lro->lro_mbuf_count > 8) { MPASS(sort_before_lro(lro)); /* hold back one credit and don't flush LRO state */ iq->flags |= IQ_ADJ_CREDIT; ndescs--; } else { tcp_lro_flush_all(lro); } } #endif t4_write_reg(sc, sc->sge_gts_reg, V_CIDXINC(ndescs) | V_INGRESSQID((u32)iq->cntxt_id) | V_SEINTARM(iq->intr_params)); FL_LOCK(fl); starved = refill_fl(sc, fl, 64); FL_UNLOCK(fl); if (__predict_false(starved != 0)) add_fl_to_sfl(sc, fl); return (0); } static inline int cl_has_metadata(struct sge_fl *fl, struct cluster_layout *cll) { int rc = fl->flags & FL_BUF_PACKING || cll->region1 > 0; if (rc) MPASS(cll->region3 >= CL_METADATA_SIZE); return (rc); } static inline struct cluster_metadata * cl_metadata(struct adapter *sc, struct sge_fl *fl, struct cluster_layout *cll, caddr_t cl) { if (cl_has_metadata(fl, cll)) { struct sw_zone_info *swz = &sc->sge.sw_zone_info[cll->zidx]; return ((struct cluster_metadata *)(cl + swz->size) - 1); } return (NULL); } static void rxb_free(struct mbuf *m) { uma_zone_t zone = m->m_ext.ext_arg1; void *cl = m->m_ext.ext_arg2; uma_zfree(zone, cl); counter_u64_add(extfree_rels, 1); } /* * The mbuf returned by this function could be allocated from zone_mbuf or * constructed in spare room in the cluster. * * The mbuf carries the payload in one of these ways * a) frame inside the mbuf (mbuf from zone_mbuf) * b) m_cljset (for clusters without metadata) zone_mbuf * c) m_extaddref (cluster with metadata) inline mbuf * d) m_extaddref (cluster with metadata) zone_mbuf */ static struct mbuf * get_scatter_segment(struct adapter *sc, struct sge_fl *fl, int fr_offset, int remaining) { struct mbuf *m; struct fl_sdesc *sd = &fl->sdesc[fl->cidx]; struct cluster_layout *cll = &sd->cll; struct sw_zone_info *swz = &sc->sge.sw_zone_info[cll->zidx]; struct hw_buf_info *hwb = &sc->sge.hw_buf_info[cll->hwidx]; struct cluster_metadata *clm = cl_metadata(sc, fl, cll, sd->cl); int len, blen; caddr_t payload; blen = hwb->size - fl->rx_offset; /* max possible in this buf */ len = min(remaining, blen); payload = sd->cl + cll->region1 + fl->rx_offset; if (fl->flags & FL_BUF_PACKING) { const u_int l = fr_offset + len; const u_int pad = roundup2(l, fl->buf_boundary) - l; if (fl->rx_offset + len + pad < hwb->size) blen = len + pad; MPASS(fl->rx_offset + blen <= hwb->size); } else { MPASS(fl->rx_offset == 0); /* not packing */ } if (sc->sc_do_rxcopy && len < RX_COPY_THRESHOLD) { /* * Copy payload into a freshly allocated mbuf. */ m = fr_offset == 0 ? m_gethdr(M_NOWAIT, MT_DATA) : m_get(M_NOWAIT, MT_DATA); if (m == NULL) return (NULL); fl->mbuf_allocated++; /* copy data to mbuf */ bcopy(payload, mtod(m, caddr_t), len); } else if (sd->nmbuf * MSIZE < cll->region1) { /* * There's spare room in the cluster for an mbuf. Create one * and associate it with the payload that's in the cluster. */ MPASS(clm != NULL); m = (struct mbuf *)(sd->cl + sd->nmbuf * MSIZE); /* No bzero required */ if (m_init(m, M_NOWAIT, MT_DATA, fr_offset == 0 ? M_PKTHDR | M_NOFREE : M_NOFREE)) return (NULL); fl->mbuf_inlined++; m_extaddref(m, payload, blen, &clm->refcount, rxb_free, swz->zone, sd->cl); if (sd->nmbuf++ == 0) counter_u64_add(extfree_refs, 1); } else { /* * Grab an mbuf from zone_mbuf and associate it with the * payload in the cluster. */ m = fr_offset == 0 ? m_gethdr(M_NOWAIT, MT_DATA) : m_get(M_NOWAIT, MT_DATA); if (m == NULL) return (NULL); fl->mbuf_allocated++; if (clm != NULL) { m_extaddref(m, payload, blen, &clm->refcount, rxb_free, swz->zone, sd->cl); if (sd->nmbuf++ == 0) counter_u64_add(extfree_refs, 1); } else { m_cljset(m, sd->cl, swz->type); sd->cl = NULL; /* consumed, not a recycle candidate */ } } if (fr_offset == 0) m->m_pkthdr.len = remaining; m->m_len = len; if (fl->flags & FL_BUF_PACKING) { fl->rx_offset += blen; MPASS(fl->rx_offset <= hwb->size); if (fl->rx_offset < hwb->size) return (m); /* without advancing the cidx */ } if (__predict_false(++fl->cidx % 8 == 0)) { uint16_t cidx = fl->cidx / 8; if (__predict_false(cidx == fl->sidx)) fl->cidx = cidx = 0; fl->hw_cidx = cidx; } fl->rx_offset = 0; return (m); } static struct mbuf * get_fl_payload(struct adapter *sc, struct sge_fl *fl, uint32_t len_newbuf) { struct mbuf *m0, *m, **pnext; u_int remaining; const u_int total = G_RSPD_LEN(len_newbuf); if (__predict_false(fl->flags & FL_BUF_RESUME)) { M_ASSERTPKTHDR(fl->m0); MPASS(fl->m0->m_pkthdr.len == total); MPASS(fl->remaining < total); m0 = fl->m0; pnext = fl->pnext; remaining = fl->remaining; fl->flags &= ~FL_BUF_RESUME; goto get_segment; } if (fl->rx_offset > 0 && len_newbuf & F_RSPD_NEWBUF) { fl->rx_offset = 0; if (__predict_false(++fl->cidx % 8 == 0)) { uint16_t cidx = fl->cidx / 8; if (__predict_false(cidx == fl->sidx)) fl->cidx = cidx = 0; fl->hw_cidx = cidx; } } /* * Payload starts at rx_offset in the current hw buffer. Its length is * 'len' and it may span multiple hw buffers. */ m0 = get_scatter_segment(sc, fl, 0, total); if (m0 == NULL) return (NULL); remaining = total - m0->m_len; pnext = &m0->m_next; while (remaining > 0) { get_segment: MPASS(fl->rx_offset == 0); m = get_scatter_segment(sc, fl, total - remaining, remaining); if (__predict_false(m == NULL)) { fl->m0 = m0; fl->pnext = pnext; fl->remaining = remaining; fl->flags |= FL_BUF_RESUME; return (NULL); } *pnext = m; pnext = &m->m_next; remaining -= m->m_len; } *pnext = NULL; M_ASSERTPKTHDR(m0); return (m0); } static int t4_eth_rx(struct sge_iq *iq, const struct rss_header *rss, struct mbuf *m0) { struct sge_rxq *rxq = iq_to_rxq(iq); struct ifnet *ifp = rxq->ifp; struct adapter *sc = iq->adapter; const struct cpl_rx_pkt *cpl = (const void *)(rss + 1); #if defined(INET) || defined(INET6) struct lro_ctrl *lro = &rxq->lro; #endif static const int sw_hashtype[4][2] = { {M_HASHTYPE_NONE, M_HASHTYPE_NONE}, {M_HASHTYPE_RSS_IPV4, M_HASHTYPE_RSS_IPV6}, {M_HASHTYPE_RSS_TCP_IPV4, M_HASHTYPE_RSS_TCP_IPV6}, {M_HASHTYPE_RSS_UDP_IPV4, M_HASHTYPE_RSS_UDP_IPV6}, }; KASSERT(m0 != NULL, ("%s: no payload with opcode %02x", __func__, rss->opcode)); m0->m_pkthdr.len -= sc->params.sge.fl_pktshift; m0->m_len -= sc->params.sge.fl_pktshift; m0->m_data += sc->params.sge.fl_pktshift; m0->m_pkthdr.rcvif = ifp; M_HASHTYPE_SET(m0, sw_hashtype[rss->hash_type][rss->ipv6]); m0->m_pkthdr.flowid = be32toh(rss->hash_val); if (cpl->csum_calc && !(cpl->err_vec & sc->params.tp.err_vec_mask)) { if (ifp->if_capenable & IFCAP_RXCSUM && cpl->l2info & htobe32(F_RXF_IP)) { m0->m_pkthdr.csum_flags = (CSUM_IP_CHECKED | CSUM_IP_VALID | CSUM_DATA_VALID | CSUM_PSEUDO_HDR); rxq->rxcsum++; } else if (ifp->if_capenable & IFCAP_RXCSUM_IPV6 && cpl->l2info & htobe32(F_RXF_IP6)) { m0->m_pkthdr.csum_flags = (CSUM_DATA_VALID_IPV6 | CSUM_PSEUDO_HDR); rxq->rxcsum++; } if (__predict_false(cpl->ip_frag)) m0->m_pkthdr.csum_data = be16toh(cpl->csum); else m0->m_pkthdr.csum_data = 0xffff; } if (cpl->vlan_ex) { m0->m_pkthdr.ether_vtag = be16toh(cpl->vlan); m0->m_flags |= M_VLANTAG; rxq->vlan_extraction++; } +#ifdef NUMA + m0->m_pkthdr.numa_domain = ifp->if_numa_domain; +#endif #if defined(INET) || defined(INET6) if (iq->flags & IQ_LRO_ENABLED) { if (sort_before_lro(lro)) { tcp_lro_queue_mbuf(lro, m0); return (0); /* queued for sort, then LRO */ } if (tcp_lro_rx(lro, m0, 0) == 0) return (0); /* queued for LRO */ } #endif ifp->if_input(ifp, m0); return (0); } /* * Must drain the wrq or make sure that someone else will. */ static void wrq_tx_drain(void *arg, int n) { struct sge_wrq *wrq = arg; struct sge_eq *eq = &wrq->eq; EQ_LOCK(eq); if (TAILQ_EMPTY(&wrq->incomplete_wrs) && !STAILQ_EMPTY(&wrq->wr_list)) drain_wrq_wr_list(wrq->adapter, wrq); EQ_UNLOCK(eq); } static void drain_wrq_wr_list(struct adapter *sc, struct sge_wrq *wrq) { struct sge_eq *eq = &wrq->eq; u_int available, dbdiff; /* # of hardware descriptors */ u_int n; struct wrqe *wr; struct fw_eth_tx_pkt_wr *dst; /* any fw WR struct will do */ EQ_LOCK_ASSERT_OWNED(eq); MPASS(TAILQ_EMPTY(&wrq->incomplete_wrs)); wr = STAILQ_FIRST(&wrq->wr_list); MPASS(wr != NULL); /* Must be called with something useful to do */ MPASS(eq->pidx == eq->dbidx); dbdiff = 0; do { eq->cidx = read_hw_cidx(eq); if (eq->pidx == eq->cidx) available = eq->sidx - 1; else available = IDXDIFF(eq->cidx, eq->pidx, eq->sidx) - 1; MPASS(wr->wrq == wrq); n = howmany(wr->wr_len, EQ_ESIZE); if (available < n) break; dst = (void *)&eq->desc[eq->pidx]; if (__predict_true(eq->sidx - eq->pidx > n)) { /* Won't wrap, won't end exactly at the status page. */ bcopy(&wr->wr[0], dst, wr->wr_len); eq->pidx += n; } else { int first_portion = (eq->sidx - eq->pidx) * EQ_ESIZE; bcopy(&wr->wr[0], dst, first_portion); if (wr->wr_len > first_portion) { bcopy(&wr->wr[first_portion], &eq->desc[0], wr->wr_len - first_portion); } eq->pidx = n - (eq->sidx - eq->pidx); } wrq->tx_wrs_copied++; if (available < eq->sidx / 4 && atomic_cmpset_int(&eq->equiq, 0, 1)) { /* * XXX: This is not 100% reliable with some * types of WRs. But this is a very unusual * situation for an ofld/ctrl queue anyway. */ dst->equiq_to_len16 |= htobe32(F_FW_WR_EQUIQ | F_FW_WR_EQUEQ); } dbdiff += n; if (dbdiff >= 16) { ring_eq_db(sc, eq, dbdiff); dbdiff = 0; } STAILQ_REMOVE_HEAD(&wrq->wr_list, link); free_wrqe(wr); MPASS(wrq->nwr_pending > 0); wrq->nwr_pending--; MPASS(wrq->ndesc_needed >= n); wrq->ndesc_needed -= n; } while ((wr = STAILQ_FIRST(&wrq->wr_list)) != NULL); if (dbdiff) ring_eq_db(sc, eq, dbdiff); } /* * Doesn't fail. Holds on to work requests it can't send right away. */ void t4_wrq_tx_locked(struct adapter *sc, struct sge_wrq *wrq, struct wrqe *wr) { #ifdef INVARIANTS struct sge_eq *eq = &wrq->eq; #endif EQ_LOCK_ASSERT_OWNED(eq); MPASS(wr != NULL); MPASS(wr->wr_len > 0 && wr->wr_len <= SGE_MAX_WR_LEN); MPASS((wr->wr_len & 0x7) == 0); STAILQ_INSERT_TAIL(&wrq->wr_list, wr, link); wrq->nwr_pending++; wrq->ndesc_needed += howmany(wr->wr_len, EQ_ESIZE); if (!TAILQ_EMPTY(&wrq->incomplete_wrs)) return; /* commit_wrq_wr will drain wr_list as well. */ drain_wrq_wr_list(sc, wrq); /* Doorbell must have caught up to the pidx. */ MPASS(eq->pidx == eq->dbidx); } void t4_update_fl_bufsize(struct ifnet *ifp) { struct vi_info *vi = ifp->if_softc; struct adapter *sc = vi->pi->adapter; struct sge_rxq *rxq; #ifdef TCP_OFFLOAD struct sge_ofld_rxq *ofld_rxq; #endif struct sge_fl *fl; int i, maxp, mtu = ifp->if_mtu; maxp = mtu_to_max_payload(sc, mtu, 0); for_each_rxq(vi, i, rxq) { fl = &rxq->fl; FL_LOCK(fl); find_best_refill_source(sc, fl, maxp); FL_UNLOCK(fl); } #ifdef TCP_OFFLOAD maxp = mtu_to_max_payload(sc, mtu, 1); for_each_ofld_rxq(vi, i, ofld_rxq) { fl = &ofld_rxq->fl; FL_LOCK(fl); find_best_refill_source(sc, fl, maxp); FL_UNLOCK(fl); } #endif } static inline int mbuf_nsegs(struct mbuf *m) { M_ASSERTPKTHDR(m); KASSERT(m->m_pkthdr.l5hlen > 0, ("%s: mbuf %p missing information on # of segments.", __func__, m)); return (m->m_pkthdr.l5hlen); } static inline void set_mbuf_nsegs(struct mbuf *m, uint8_t nsegs) { M_ASSERTPKTHDR(m); m->m_pkthdr.l5hlen = nsegs; } static inline int mbuf_cflags(struct mbuf *m) { M_ASSERTPKTHDR(m); return (m->m_pkthdr.PH_loc.eight[4]); } static inline void set_mbuf_cflags(struct mbuf *m, uint8_t flags) { M_ASSERTPKTHDR(m); m->m_pkthdr.PH_loc.eight[4] = flags; } static inline int mbuf_len16(struct mbuf *m) { int n; M_ASSERTPKTHDR(m); n = m->m_pkthdr.PH_loc.eight[0]; MPASS(n > 0 && n <= SGE_MAX_WR_LEN / 16); return (n); } static inline void set_mbuf_len16(struct mbuf *m, uint8_t len16) { M_ASSERTPKTHDR(m); m->m_pkthdr.PH_loc.eight[0] = len16; } #ifdef RATELIMIT static inline int mbuf_eo_nsegs(struct mbuf *m) { M_ASSERTPKTHDR(m); return (m->m_pkthdr.PH_loc.eight[1]); } static inline void set_mbuf_eo_nsegs(struct mbuf *m, uint8_t nsegs) { M_ASSERTPKTHDR(m); m->m_pkthdr.PH_loc.eight[1] = nsegs; } static inline int mbuf_eo_len16(struct mbuf *m) { int n; M_ASSERTPKTHDR(m); n = m->m_pkthdr.PH_loc.eight[2]; MPASS(n > 0 && n <= SGE_MAX_WR_LEN / 16); return (n); } static inline void set_mbuf_eo_len16(struct mbuf *m, uint8_t len16) { M_ASSERTPKTHDR(m); m->m_pkthdr.PH_loc.eight[2] = len16; } static inline int mbuf_eo_tsclk_tsoff(struct mbuf *m) { M_ASSERTPKTHDR(m); return (m->m_pkthdr.PH_loc.eight[3]); } static inline void set_mbuf_eo_tsclk_tsoff(struct mbuf *m, uint8_t tsclk_tsoff) { M_ASSERTPKTHDR(m); m->m_pkthdr.PH_loc.eight[3] = tsclk_tsoff; } static inline int needs_eo(struct mbuf *m) { return (m->m_pkthdr.snd_tag != NULL); } #endif /* * Try to allocate an mbuf to contain a raw work request. To make it * easy to construct the work request, don't allocate a chain but a * single mbuf. */ struct mbuf * alloc_wr_mbuf(int len, int how) { struct mbuf *m; if (len <= MHLEN) m = m_gethdr(how, MT_DATA); else if (len <= MCLBYTES) m = m_getcl(how, MT_DATA, M_PKTHDR); else m = NULL; if (m == NULL) return (NULL); m->m_pkthdr.len = len; m->m_len = len; set_mbuf_cflags(m, MC_RAW_WR); set_mbuf_len16(m, howmany(len, 16)); return (m); } static inline int needs_tso(struct mbuf *m) { M_ASSERTPKTHDR(m); return (m->m_pkthdr.csum_flags & CSUM_TSO); } static inline int needs_l3_csum(struct mbuf *m) { M_ASSERTPKTHDR(m); return (m->m_pkthdr.csum_flags & (CSUM_IP | CSUM_TSO)); } static inline int needs_l4_csum(struct mbuf *m) { M_ASSERTPKTHDR(m); return (m->m_pkthdr.csum_flags & (CSUM_TCP | CSUM_UDP | CSUM_UDP_IPV6 | CSUM_TCP_IPV6 | CSUM_TSO)); } static inline int needs_tcp_csum(struct mbuf *m) { M_ASSERTPKTHDR(m); return (m->m_pkthdr.csum_flags & (CSUM_TCP | CSUM_TCP_IPV6 | CSUM_TSO)); } #ifdef RATELIMIT static inline int needs_udp_csum(struct mbuf *m) { M_ASSERTPKTHDR(m); return (m->m_pkthdr.csum_flags & (CSUM_UDP | CSUM_UDP_IPV6)); } #endif static inline int needs_vlan_insertion(struct mbuf *m) { M_ASSERTPKTHDR(m); return (m->m_flags & M_VLANTAG); } static void * m_advance(struct mbuf **pm, int *poffset, int len) { struct mbuf *m = *pm; int offset = *poffset; uintptr_t p = 0; MPASS(len > 0); for (;;) { if (offset + len < m->m_len) { offset += len; p = mtod(m, uintptr_t) + offset; break; } len -= m->m_len - offset; m = m->m_next; offset = 0; MPASS(m != NULL); } *poffset = offset; *pm = m; return ((void *)p); } /* * Can deal with empty mbufs in the chain that have m_len = 0, but the chain * must have at least one mbuf that's not empty. It is possible for this * routine to return 0 if skip accounts for all the contents of the mbuf chain. */ static inline int count_mbuf_nsegs(struct mbuf *m, int skip) { vm_paddr_t lastb, next; vm_offset_t va; int len, nsegs; M_ASSERTPKTHDR(m); MPASS(m->m_pkthdr.len > 0); MPASS(m->m_pkthdr.len >= skip); nsegs = 0; lastb = 0; for (; m; m = m->m_next) { len = m->m_len; if (__predict_false(len == 0)) continue; if (skip >= len) { skip -= len; continue; } va = mtod(m, vm_offset_t) + skip; len -= skip; skip = 0; next = pmap_kextract(va); nsegs += sglist_count((void *)(uintptr_t)va, len); if (lastb + 1 == next) nsegs--; lastb = pmap_kextract(va + len - 1); } return (nsegs); } /* * Analyze the mbuf to determine its tx needs. The mbuf passed in may change: * a) caller can assume it's been freed if this function returns with an error. * b) it may get defragged up if the gather list is too long for the hardware. */ int parse_pkt(struct adapter *sc, struct mbuf **mp) { struct mbuf *m0 = *mp, *m; int rc, nsegs, defragged = 0, offset; struct ether_header *eh; void *l3hdr; #if defined(INET) || defined(INET6) struct tcphdr *tcp; #endif uint16_t eh_type; M_ASSERTPKTHDR(m0); if (__predict_false(m0->m_pkthdr.len < ETHER_HDR_LEN)) { rc = EINVAL; fail: m_freem(m0); *mp = NULL; return (rc); } restart: /* * First count the number of gather list segments in the payload. * Defrag the mbuf if nsegs exceeds the hardware limit. */ M_ASSERTPKTHDR(m0); MPASS(m0->m_pkthdr.len > 0); nsegs = count_mbuf_nsegs(m0, 0); if (nsegs > (needs_tso(m0) ? TX_SGL_SEGS_TSO : TX_SGL_SEGS)) { if (defragged++ > 0 || (m = m_defrag(m0, M_NOWAIT)) == NULL) { rc = EFBIG; goto fail; } *mp = m0 = m; /* update caller's copy after defrag */ goto restart; } if (__predict_false(nsegs > 2 && m0->m_pkthdr.len <= MHLEN)) { m0 = m_pullup(m0, m0->m_pkthdr.len); if (m0 == NULL) { /* Should have left well enough alone. */ rc = EFBIG; goto fail; } *mp = m0; /* update caller's copy after pullup */ goto restart; } set_mbuf_nsegs(m0, nsegs); set_mbuf_cflags(m0, 0); if (sc->flags & IS_VF) set_mbuf_len16(m0, txpkt_vm_len16(nsegs, needs_tso(m0))); else set_mbuf_len16(m0, txpkt_len16(nsegs, needs_tso(m0))); #ifdef RATELIMIT /* * Ethofld is limited to TCP and UDP for now, and only when L4 hw * checksumming is enabled. needs_l4_csum happens to check for all the * right things. */ if (__predict_false(needs_eo(m0) && !needs_l4_csum(m0))) m0->m_pkthdr.snd_tag = NULL; #endif if (!needs_tso(m0) && #ifdef RATELIMIT !needs_eo(m0) && #endif !(sc->flags & IS_VF && (needs_l3_csum(m0) || needs_l4_csum(m0)))) return (0); m = m0; eh = mtod(m, struct ether_header *); eh_type = ntohs(eh->ether_type); if (eh_type == ETHERTYPE_VLAN) { struct ether_vlan_header *evh = (void *)eh; eh_type = ntohs(evh->evl_proto); m0->m_pkthdr.l2hlen = sizeof(*evh); } else m0->m_pkthdr.l2hlen = sizeof(*eh); offset = 0; l3hdr = m_advance(&m, &offset, m0->m_pkthdr.l2hlen); switch (eh_type) { #ifdef INET6 case ETHERTYPE_IPV6: { struct ip6_hdr *ip6 = l3hdr; MPASS(!needs_tso(m0) || ip6->ip6_nxt == IPPROTO_TCP); m0->m_pkthdr.l3hlen = sizeof(*ip6); break; } #endif #ifdef INET case ETHERTYPE_IP: { struct ip *ip = l3hdr; m0->m_pkthdr.l3hlen = ip->ip_hl * 4; break; } #endif default: panic("%s: ethertype 0x%04x unknown. if_cxgbe must be compiled" " with the same INET/INET6 options as the kernel.", __func__, eh_type); } #if defined(INET) || defined(INET6) if (needs_tcp_csum(m0)) { tcp = m_advance(&m, &offset, m0->m_pkthdr.l3hlen); m0->m_pkthdr.l4hlen = tcp->th_off * 4; #ifdef RATELIMIT if (tsclk >= 0 && *(uint32_t *)(tcp + 1) == ntohl(0x0101080a)) { set_mbuf_eo_tsclk_tsoff(m0, V_FW_ETH_TX_EO_WR_TSCLK(tsclk) | V_FW_ETH_TX_EO_WR_TSOFF(sizeof(*tcp) / 2 + 1)); } else set_mbuf_eo_tsclk_tsoff(m0, 0); } else if (needs_udp_csum(m)) { m0->m_pkthdr.l4hlen = sizeof(struct udphdr); #endif } #ifdef RATELIMIT if (needs_eo(m0)) { u_int immhdrs; /* EO WRs have the headers in the WR and not the GL. */ immhdrs = m0->m_pkthdr.l2hlen + m0->m_pkthdr.l3hlen + m0->m_pkthdr.l4hlen; nsegs = count_mbuf_nsegs(m0, immhdrs); set_mbuf_eo_nsegs(m0, nsegs); set_mbuf_eo_len16(m0, txpkt_eo_len16(nsegs, immhdrs, needs_tso(m0))); } #endif #endif MPASS(m0 == *mp); return (0); } void * start_wrq_wr(struct sge_wrq *wrq, int len16, struct wrq_cookie *cookie) { struct sge_eq *eq = &wrq->eq; struct adapter *sc = wrq->adapter; int ndesc, available; struct wrqe *wr; void *w; MPASS(len16 > 0); ndesc = howmany(len16, EQ_ESIZE / 16); MPASS(ndesc > 0 && ndesc <= SGE_MAX_WR_NDESC); EQ_LOCK(eq); if (TAILQ_EMPTY(&wrq->incomplete_wrs) && !STAILQ_EMPTY(&wrq->wr_list)) drain_wrq_wr_list(sc, wrq); if (!STAILQ_EMPTY(&wrq->wr_list)) { slowpath: EQ_UNLOCK(eq); wr = alloc_wrqe(len16 * 16, wrq); if (__predict_false(wr == NULL)) return (NULL); cookie->pidx = -1; cookie->ndesc = ndesc; return (&wr->wr); } eq->cidx = read_hw_cidx(eq); if (eq->pidx == eq->cidx) available = eq->sidx - 1; else available = IDXDIFF(eq->cidx, eq->pidx, eq->sidx) - 1; if (available < ndesc) goto slowpath; cookie->pidx = eq->pidx; cookie->ndesc = ndesc; TAILQ_INSERT_TAIL(&wrq->incomplete_wrs, cookie, link); w = &eq->desc[eq->pidx]; IDXINCR(eq->pidx, ndesc, eq->sidx); if (__predict_false(cookie->pidx + ndesc > eq->sidx)) { w = &wrq->ss[0]; wrq->ss_pidx = cookie->pidx; wrq->ss_len = len16 * 16; } EQ_UNLOCK(eq); return (w); } void commit_wrq_wr(struct sge_wrq *wrq, void *w, struct wrq_cookie *cookie) { struct sge_eq *eq = &wrq->eq; struct adapter *sc = wrq->adapter; int ndesc, pidx; struct wrq_cookie *prev, *next; if (cookie->pidx == -1) { struct wrqe *wr = __containerof(w, struct wrqe, wr); t4_wrq_tx(sc, wr); return; } if (__predict_false(w == &wrq->ss[0])) { int n = (eq->sidx - wrq->ss_pidx) * EQ_ESIZE; MPASS(wrq->ss_len > n); /* WR had better wrap around. */ bcopy(&wrq->ss[0], &eq->desc[wrq->ss_pidx], n); bcopy(&wrq->ss[n], &eq->desc[0], wrq->ss_len - n); wrq->tx_wrs_ss++; } else wrq->tx_wrs_direct++; EQ_LOCK(eq); ndesc = cookie->ndesc; /* Can be more than SGE_MAX_WR_NDESC here. */ pidx = cookie->pidx; MPASS(pidx >= 0 && pidx < eq->sidx); prev = TAILQ_PREV(cookie, wrq_incomplete_wrs, link); next = TAILQ_NEXT(cookie, link); if (prev == NULL) { MPASS(pidx == eq->dbidx); if (next == NULL || ndesc >= 16) { int available; struct fw_eth_tx_pkt_wr *dst; /* any fw WR struct will do */ /* * Note that the WR via which we'll request tx updates * is at pidx and not eq->pidx, which has moved on * already. */ dst = (void *)&eq->desc[pidx]; available = IDXDIFF(eq->cidx, eq->pidx, eq->sidx) - 1; if (available < eq->sidx / 4 && atomic_cmpset_int(&eq->equiq, 0, 1)) { /* * XXX: This is not 100% reliable with some * types of WRs. But this is a very unusual * situation for an ofld/ctrl queue anyway. */ dst->equiq_to_len16 |= htobe32(F_FW_WR_EQUIQ | F_FW_WR_EQUEQ); } ring_eq_db(wrq->adapter, eq, ndesc); } else { MPASS(IDXDIFF(next->pidx, pidx, eq->sidx) == ndesc); next->pidx = pidx; next->ndesc += ndesc; } } else { MPASS(IDXDIFF(pidx, prev->pidx, eq->sidx) == prev->ndesc); prev->ndesc += ndesc; } TAILQ_REMOVE(&wrq->incomplete_wrs, cookie, link); if (TAILQ_EMPTY(&wrq->incomplete_wrs) && !STAILQ_EMPTY(&wrq->wr_list)) drain_wrq_wr_list(sc, wrq); #ifdef INVARIANTS if (TAILQ_EMPTY(&wrq->incomplete_wrs)) { /* Doorbell must have caught up to the pidx. */ MPASS(wrq->eq.pidx == wrq->eq.dbidx); } #endif EQ_UNLOCK(eq); } static u_int can_resume_eth_tx(struct mp_ring *r) { struct sge_eq *eq = r->cookie; return (total_available_tx_desc(eq) > eq->sidx / 8); } static inline int cannot_use_txpkts(struct mbuf *m) { /* maybe put a GL limit too, to avoid silliness? */ return (needs_tso(m) || (mbuf_cflags(m) & MC_RAW_WR) != 0); } static inline int discard_tx(struct sge_eq *eq) { return ((eq->flags & (EQ_ENABLED | EQ_QFLUSH)) != EQ_ENABLED); } static inline int wr_can_update_eq(struct fw_eth_tx_pkts_wr *wr) { switch (G_FW_WR_OP(be32toh(wr->op_pkd))) { case FW_ULPTX_WR: case FW_ETH_TX_PKT_WR: case FW_ETH_TX_PKTS_WR: case FW_ETH_TX_PKT_VM_WR: return (1); default: return (0); } } /* * r->items[cidx] to r->items[pidx], with a wraparound at r->size, are ready to * be consumed. Return the actual number consumed. 0 indicates a stall. */ static u_int eth_tx(struct mp_ring *r, u_int cidx, u_int pidx) { struct sge_txq *txq = r->cookie; struct sge_eq *eq = &txq->eq; struct ifnet *ifp = txq->ifp; struct vi_info *vi = ifp->if_softc; struct port_info *pi = vi->pi; struct adapter *sc = pi->adapter; u_int total, remaining; /* # of packets */ u_int available, dbdiff; /* # of hardware descriptors */ u_int n, next_cidx; struct mbuf *m0, *tail; struct txpkts txp; struct fw_eth_tx_pkts_wr *wr; /* any fw WR struct will do */ remaining = IDXDIFF(pidx, cidx, r->size); MPASS(remaining > 0); /* Must not be called without work to do. */ total = 0; TXQ_LOCK(txq); if (__predict_false(discard_tx(eq))) { while (cidx != pidx) { m0 = r->items[cidx]; m_freem(m0); if (++cidx == r->size) cidx = 0; } reclaim_tx_descs(txq, 2048); total = remaining; goto done; } /* How many hardware descriptors do we have readily available. */ if (eq->pidx == eq->cidx) available = eq->sidx - 1; else available = IDXDIFF(eq->cidx, eq->pidx, eq->sidx) - 1; dbdiff = IDXDIFF(eq->pidx, eq->dbidx, eq->sidx); while (remaining > 0) { m0 = r->items[cidx]; M_ASSERTPKTHDR(m0); MPASS(m0->m_nextpkt == NULL); if (available < SGE_MAX_WR_NDESC) { available += reclaim_tx_descs(txq, 64); if (available < howmany(mbuf_len16(m0), EQ_ESIZE / 16)) break; /* out of descriptors */ } next_cidx = cidx + 1; if (__predict_false(next_cidx == r->size)) next_cidx = 0; wr = (void *)&eq->desc[eq->pidx]; if (sc->flags & IS_VF) { total++; remaining--; ETHER_BPF_MTAP(ifp, m0); n = write_txpkt_vm_wr(sc, txq, (void *)wr, m0, available); } else if (remaining > 1 && try_txpkts(m0, r->items[next_cidx], &txp, available) == 0) { /* pkts at cidx, next_cidx should both be in txp. */ MPASS(txp.npkt == 2); tail = r->items[next_cidx]; MPASS(tail->m_nextpkt == NULL); ETHER_BPF_MTAP(ifp, m0); ETHER_BPF_MTAP(ifp, tail); m0->m_nextpkt = tail; if (__predict_false(++next_cidx == r->size)) next_cidx = 0; while (next_cidx != pidx) { if (add_to_txpkts(r->items[next_cidx], &txp, available) != 0) break; tail->m_nextpkt = r->items[next_cidx]; tail = tail->m_nextpkt; ETHER_BPF_MTAP(ifp, tail); if (__predict_false(++next_cidx == r->size)) next_cidx = 0; } n = write_txpkts_wr(txq, wr, m0, &txp, available); total += txp.npkt; remaining -= txp.npkt; } else if (mbuf_cflags(m0) & MC_RAW_WR) { total++; remaining--; n = write_raw_wr(txq, (void *)wr, m0, available); } else { total++; remaining--; ETHER_BPF_MTAP(ifp, m0); n = write_txpkt_wr(txq, (void *)wr, m0, available); } MPASS(n >= 1 && n <= available && n <= SGE_MAX_WR_NDESC); available -= n; dbdiff += n; IDXINCR(eq->pidx, n, eq->sidx); if (wr_can_update_eq(wr)) { if (total_available_tx_desc(eq) < eq->sidx / 4 && atomic_cmpset_int(&eq->equiq, 0, 1)) { wr->equiq_to_len16 |= htobe32(F_FW_WR_EQUIQ | F_FW_WR_EQUEQ); eq->equeqidx = eq->pidx; } else if (IDXDIFF(eq->pidx, eq->equeqidx, eq->sidx) >= 32) { wr->equiq_to_len16 |= htobe32(F_FW_WR_EQUEQ); eq->equeqidx = eq->pidx; } } if (dbdiff >= 16 && remaining >= 4) { ring_eq_db(sc, eq, dbdiff); available += reclaim_tx_descs(txq, 4 * dbdiff); dbdiff = 0; } cidx = next_cidx; } if (dbdiff != 0) { ring_eq_db(sc, eq, dbdiff); reclaim_tx_descs(txq, 32); } done: TXQ_UNLOCK(txq); return (total); } static inline void init_iq(struct sge_iq *iq, struct adapter *sc, int tmr_idx, int pktc_idx, int qsize) { KASSERT(tmr_idx >= 0 && tmr_idx < SGE_NTIMERS, ("%s: bad tmr_idx %d", __func__, tmr_idx)); KASSERT(pktc_idx < SGE_NCOUNTERS, /* -ve is ok, means don't use */ ("%s: bad pktc_idx %d", __func__, pktc_idx)); iq->flags = 0; iq->adapter = sc; iq->intr_params = V_QINTR_TIMER_IDX(tmr_idx); iq->intr_pktc_idx = SGE_NCOUNTERS - 1; if (pktc_idx >= 0) { iq->intr_params |= F_QINTR_CNT_EN; iq->intr_pktc_idx = pktc_idx; } iq->qsize = roundup2(qsize, 16); /* See FW_IQ_CMD/iqsize */ iq->sidx = iq->qsize - sc->params.sge.spg_len / IQ_ESIZE; } static inline void init_fl(struct adapter *sc, struct sge_fl *fl, int qsize, int maxp, char *name) { fl->qsize = qsize; fl->sidx = qsize - sc->params.sge.spg_len / EQ_ESIZE; strlcpy(fl->lockname, name, sizeof(fl->lockname)); if (sc->flags & BUF_PACKING_OK && ((!is_t4(sc) && buffer_packing) || /* T5+: enabled unless 0 */ (is_t4(sc) && buffer_packing == 1)))/* T4: disabled unless 1 */ fl->flags |= FL_BUF_PACKING; find_best_refill_source(sc, fl, maxp); find_safe_refill_source(sc, fl); } static inline void init_eq(struct adapter *sc, struct sge_eq *eq, int eqtype, int qsize, uint8_t tx_chan, uint16_t iqid, char *name) { KASSERT(eqtype <= EQ_TYPEMASK, ("%s: bad qtype %d", __func__, eqtype)); eq->flags = eqtype & EQ_TYPEMASK; eq->tx_chan = tx_chan; eq->iqid = iqid; eq->sidx = qsize - sc->params.sge.spg_len / EQ_ESIZE; strlcpy(eq->lockname, name, sizeof(eq->lockname)); } static int alloc_ring(struct adapter *sc, size_t len, bus_dma_tag_t *tag, bus_dmamap_t *map, bus_addr_t *pa, void **va) { int rc; rc = bus_dma_tag_create(sc->dmat, 512, 0, BUS_SPACE_MAXADDR, BUS_SPACE_MAXADDR, NULL, NULL, len, 1, len, 0, NULL, NULL, tag); if (rc != 0) { device_printf(sc->dev, "cannot allocate DMA tag: %d\n", rc); goto done; } rc = bus_dmamem_alloc(*tag, va, BUS_DMA_WAITOK | BUS_DMA_COHERENT | BUS_DMA_ZERO, map); if (rc != 0) { device_printf(sc->dev, "cannot allocate DMA memory: %d\n", rc); goto done; } rc = bus_dmamap_load(*tag, *map, *va, len, oneseg_dma_callback, pa, 0); if (rc != 0) { device_printf(sc->dev, "cannot load DMA map: %d\n", rc); goto done; } done: if (rc) free_ring(sc, *tag, *map, *pa, *va); return (rc); } static int free_ring(struct adapter *sc, bus_dma_tag_t tag, bus_dmamap_t map, bus_addr_t pa, void *va) { if (pa) bus_dmamap_unload(tag, map); if (va) bus_dmamem_free(tag, va, map); if (tag) bus_dma_tag_destroy(tag); return (0); } /* * Allocates the ring for an ingress queue and an optional freelist. If the * freelist is specified it will be allocated and then associated with the * ingress queue. * * Returns errno on failure. Resources allocated up to that point may still be * allocated. Caller is responsible for cleanup in case this function fails. * * If the ingress queue will take interrupts directly then the intr_idx * specifies the vector, starting from 0. -1 means the interrupts for this * queue should be forwarded to the fwq. */ static int alloc_iq_fl(struct vi_info *vi, struct sge_iq *iq, struct sge_fl *fl, int intr_idx, int cong) { int rc, i, cntxt_id; size_t len; struct fw_iq_cmd c; struct port_info *pi = vi->pi; struct adapter *sc = iq->adapter; struct sge_params *sp = &sc->params.sge; __be32 v = 0; len = iq->qsize * IQ_ESIZE; rc = alloc_ring(sc, len, &iq->desc_tag, &iq->desc_map, &iq->ba, (void **)&iq->desc); if (rc != 0) return (rc); bzero(&c, sizeof(c)); c.op_to_vfn = htobe32(V_FW_CMD_OP(FW_IQ_CMD) | F_FW_CMD_REQUEST | F_FW_CMD_WRITE | F_FW_CMD_EXEC | V_FW_IQ_CMD_PFN(sc->pf) | V_FW_IQ_CMD_VFN(0)); c.alloc_to_len16 = htobe32(F_FW_IQ_CMD_ALLOC | F_FW_IQ_CMD_IQSTART | FW_LEN16(c)); /* Special handling for firmware event queue */ if (iq == &sc->sge.fwq) v |= F_FW_IQ_CMD_IQASYNCH; if (intr_idx < 0) { /* Forwarded interrupts, all headed to fwq */ v |= F_FW_IQ_CMD_IQANDST; v |= V_FW_IQ_CMD_IQANDSTINDEX(sc->sge.fwq.cntxt_id); } else { KASSERT(intr_idx < sc->intr_count, ("%s: invalid direct intr_idx %d", __func__, intr_idx)); v |= V_FW_IQ_CMD_IQANDSTINDEX(intr_idx); } c.type_to_iqandstindex = htobe32(v | V_FW_IQ_CMD_TYPE(FW_IQ_TYPE_FL_INT_CAP) | V_FW_IQ_CMD_VIID(vi->viid) | V_FW_IQ_CMD_IQANUD(X_UPDATEDELIVERY_INTERRUPT)); c.iqdroprss_to_iqesize = htobe16(V_FW_IQ_CMD_IQPCIECH(pi->tx_chan) | F_FW_IQ_CMD_IQGTSMODE | V_FW_IQ_CMD_IQINTCNTTHRESH(iq->intr_pktc_idx) | V_FW_IQ_CMD_IQESIZE(ilog2(IQ_ESIZE) - 4)); c.iqsize = htobe16(iq->qsize); c.iqaddr = htobe64(iq->ba); if (cong >= 0) c.iqns_to_fl0congen = htobe32(F_FW_IQ_CMD_IQFLINTCONGEN); if (fl) { mtx_init(&fl->fl_lock, fl->lockname, NULL, MTX_DEF); len = fl->qsize * EQ_ESIZE; rc = alloc_ring(sc, len, &fl->desc_tag, &fl->desc_map, &fl->ba, (void **)&fl->desc); if (rc) return (rc); /* Allocate space for one software descriptor per buffer. */ rc = alloc_fl_sdesc(fl); if (rc != 0) { device_printf(sc->dev, "failed to setup fl software descriptors: %d\n", rc); return (rc); } if (fl->flags & FL_BUF_PACKING) { fl->lowat = roundup2(sp->fl_starve_threshold2, 8); fl->buf_boundary = sp->pack_boundary; } else { fl->lowat = roundup2(sp->fl_starve_threshold, 8); fl->buf_boundary = 16; } if (fl_pad && fl->buf_boundary < sp->pad_boundary) fl->buf_boundary = sp->pad_boundary; c.iqns_to_fl0congen |= htobe32(V_FW_IQ_CMD_FL0HOSTFCMODE(X_HOSTFCMODE_NONE) | F_FW_IQ_CMD_FL0FETCHRO | F_FW_IQ_CMD_FL0DATARO | (fl_pad ? F_FW_IQ_CMD_FL0PADEN : 0) | (fl->flags & FL_BUF_PACKING ? F_FW_IQ_CMD_FL0PACKEN : 0)); if (cong >= 0) { c.iqns_to_fl0congen |= htobe32(V_FW_IQ_CMD_FL0CNGCHMAP(cong) | F_FW_IQ_CMD_FL0CONGCIF | F_FW_IQ_CMD_FL0CONGEN); } c.fl0dcaen_to_fl0cidxfthresh = htobe16(V_FW_IQ_CMD_FL0FBMIN(chip_id(sc) <= CHELSIO_T5 ? X_FETCHBURSTMIN_128B : X_FETCHBURSTMIN_64B) | V_FW_IQ_CMD_FL0FBMAX(chip_id(sc) <= CHELSIO_T5 ? X_FETCHBURSTMAX_512B : X_FETCHBURSTMAX_256B)); c.fl0size = htobe16(fl->qsize); c.fl0addr = htobe64(fl->ba); } rc = -t4_wr_mbox(sc, sc->mbox, &c, sizeof(c), &c); if (rc != 0) { device_printf(sc->dev, "failed to create ingress queue: %d\n", rc); return (rc); } iq->cidx = 0; iq->gen = F_RSPD_GEN; iq->intr_next = iq->intr_params; iq->cntxt_id = be16toh(c.iqid); iq->abs_id = be16toh(c.physiqid); iq->flags |= IQ_ALLOCATED; cntxt_id = iq->cntxt_id - sc->sge.iq_start; if (cntxt_id >= sc->sge.niq) { panic ("%s: iq->cntxt_id (%d) more than the max (%d)", __func__, cntxt_id, sc->sge.niq - 1); } sc->sge.iqmap[cntxt_id] = iq; if (fl) { u_int qid; iq->flags |= IQ_HAS_FL; fl->cntxt_id = be16toh(c.fl0id); fl->pidx = fl->cidx = 0; cntxt_id = fl->cntxt_id - sc->sge.eq_start; if (cntxt_id >= sc->sge.neq) { panic("%s: fl->cntxt_id (%d) more than the max (%d)", __func__, cntxt_id, sc->sge.neq - 1); } sc->sge.eqmap[cntxt_id] = (void *)fl; qid = fl->cntxt_id; if (isset(&sc->doorbells, DOORBELL_UDB)) { uint32_t s_qpp = sc->params.sge.eq_s_qpp; uint32_t mask = (1 << s_qpp) - 1; volatile uint8_t *udb; udb = sc->udbs_base + UDBS_DB_OFFSET; udb += (qid >> s_qpp) << PAGE_SHIFT; qid &= mask; if (qid < PAGE_SIZE / UDBS_SEG_SIZE) { udb += qid << UDBS_SEG_SHIFT; qid = 0; } fl->udb = (volatile void *)udb; } fl->dbval = V_QID(qid) | sc->chip_params->sge_fl_db; FL_LOCK(fl); /* Enough to make sure the SGE doesn't think it's starved */ refill_fl(sc, fl, fl->lowat); FL_UNLOCK(fl); } if (chip_id(sc) >= CHELSIO_T5 && !(sc->flags & IS_VF) && cong >= 0) { uint32_t param, val; param = V_FW_PARAMS_MNEM(FW_PARAMS_MNEM_DMAQ) | V_FW_PARAMS_PARAM_X(FW_PARAMS_PARAM_DMAQ_CONM_CTXT) | V_FW_PARAMS_PARAM_YZ(iq->cntxt_id); if (cong == 0) val = 1 << 19; else { val = 2 << 19; for (i = 0; i < 4; i++) { if (cong & (1 << i)) val |= 1 << (i << 2); } } rc = -t4_set_params(sc, sc->mbox, sc->pf, 0, 1, ¶m, &val); if (rc != 0) { /* report error but carry on */ device_printf(sc->dev, "failed to set congestion manager context for " "ingress queue %d: %d\n", iq->cntxt_id, rc); } } /* Enable IQ interrupts */ atomic_store_rel_int(&iq->state, IQS_IDLE); t4_write_reg(sc, sc->sge_gts_reg, V_SEINTARM(iq->intr_params) | V_INGRESSQID(iq->cntxt_id)); return (0); } static int free_iq_fl(struct vi_info *vi, struct sge_iq *iq, struct sge_fl *fl) { int rc; struct adapter *sc = iq->adapter; device_t dev; if (sc == NULL) return (0); /* nothing to do */ dev = vi ? vi->dev : sc->dev; if (iq->flags & IQ_ALLOCATED) { rc = -t4_iq_free(sc, sc->mbox, sc->pf, 0, FW_IQ_TYPE_FL_INT_CAP, iq->cntxt_id, fl ? fl->cntxt_id : 0xffff, 0xffff); if (rc != 0) { device_printf(dev, "failed to free queue %p: %d\n", iq, rc); return (rc); } iq->flags &= ~IQ_ALLOCATED; } free_ring(sc, iq->desc_tag, iq->desc_map, iq->ba, iq->desc); bzero(iq, sizeof(*iq)); if (fl) { free_ring(sc, fl->desc_tag, fl->desc_map, fl->ba, fl->desc); if (fl->sdesc) free_fl_sdesc(sc, fl); if (mtx_initialized(&fl->fl_lock)) mtx_destroy(&fl->fl_lock); bzero(fl, sizeof(*fl)); } return (0); } static void add_iq_sysctls(struct sysctl_ctx_list *ctx, struct sysctl_oid *oid, struct sge_iq *iq) { struct sysctl_oid_list *children = SYSCTL_CHILDREN(oid); SYSCTL_ADD_UAUTO(ctx, children, OID_AUTO, "ba", CTLFLAG_RD, &iq->ba, "bus address of descriptor ring"); SYSCTL_ADD_INT(ctx, children, OID_AUTO, "dmalen", CTLFLAG_RD, NULL, iq->qsize * IQ_ESIZE, "descriptor ring size in bytes"); SYSCTL_ADD_PROC(ctx, children, OID_AUTO, "abs_id", CTLTYPE_INT | CTLFLAG_RD, &iq->abs_id, 0, sysctl_uint16, "I", "absolute id of the queue"); SYSCTL_ADD_PROC(ctx, children, OID_AUTO, "cntxt_id", CTLTYPE_INT | CTLFLAG_RD, &iq->cntxt_id, 0, sysctl_uint16, "I", "SGE context id of the queue"); SYSCTL_ADD_PROC(ctx, children, OID_AUTO, "cidx", CTLTYPE_INT | CTLFLAG_RD, &iq->cidx, 0, sysctl_uint16, "I", "consumer index"); } static void add_fl_sysctls(struct adapter *sc, struct sysctl_ctx_list *ctx, struct sysctl_oid *oid, struct sge_fl *fl) { struct sysctl_oid_list *children = SYSCTL_CHILDREN(oid); oid = SYSCTL_ADD_NODE(ctx, children, OID_AUTO, "fl", CTLFLAG_RD, NULL, "freelist"); children = SYSCTL_CHILDREN(oid); SYSCTL_ADD_UAUTO(ctx, children, OID_AUTO, "ba", CTLFLAG_RD, &fl->ba, "bus address of descriptor ring"); SYSCTL_ADD_INT(ctx, children, OID_AUTO, "dmalen", CTLFLAG_RD, NULL, fl->sidx * EQ_ESIZE + sc->params.sge.spg_len, "desc ring size in bytes"); SYSCTL_ADD_PROC(ctx, children, OID_AUTO, "cntxt_id", CTLTYPE_INT | CTLFLAG_RD, &fl->cntxt_id, 0, sysctl_uint16, "I", "SGE context id of the freelist"); SYSCTL_ADD_UINT(ctx, children, OID_AUTO, "padding", CTLFLAG_RD, NULL, fl_pad ? 1 : 0, "padding enabled"); SYSCTL_ADD_UINT(ctx, children, OID_AUTO, "packing", CTLFLAG_RD, NULL, fl->flags & FL_BUF_PACKING ? 1 : 0, "packing enabled"); SYSCTL_ADD_UINT(ctx, children, OID_AUTO, "cidx", CTLFLAG_RD, &fl->cidx, 0, "consumer index"); if (fl->flags & FL_BUF_PACKING) { SYSCTL_ADD_UINT(ctx, children, OID_AUTO, "rx_offset", CTLFLAG_RD, &fl->rx_offset, 0, "packing rx offset"); } SYSCTL_ADD_UINT(ctx, children, OID_AUTO, "pidx", CTLFLAG_RD, &fl->pidx, 0, "producer index"); SYSCTL_ADD_UQUAD(ctx, children, OID_AUTO, "mbuf_allocated", CTLFLAG_RD, &fl->mbuf_allocated, "# of mbuf allocated"); SYSCTL_ADD_UQUAD(ctx, children, OID_AUTO, "mbuf_inlined", CTLFLAG_RD, &fl->mbuf_inlined, "# of mbuf inlined in clusters"); SYSCTL_ADD_UQUAD(ctx, children, OID_AUTO, "cluster_allocated", CTLFLAG_RD, &fl->cl_allocated, "# of clusters allocated"); SYSCTL_ADD_UQUAD(ctx, children, OID_AUTO, "cluster_recycled", CTLFLAG_RD, &fl->cl_recycled, "# of clusters recycled"); SYSCTL_ADD_UQUAD(ctx, children, OID_AUTO, "cluster_fast_recycled", CTLFLAG_RD, &fl->cl_fast_recycled, "# of clusters recycled (fast)"); } static int alloc_fwq(struct adapter *sc) { int rc, intr_idx; struct sge_iq *fwq = &sc->sge.fwq; struct sysctl_oid *oid = device_get_sysctl_tree(sc->dev); struct sysctl_oid_list *children = SYSCTL_CHILDREN(oid); init_iq(fwq, sc, 0, 0, FW_IQ_QSIZE); if (sc->flags & IS_VF) intr_idx = 0; else intr_idx = sc->intr_count > 1 ? 1 : 0; rc = alloc_iq_fl(&sc->port[0]->vi[0], fwq, NULL, intr_idx, -1); if (rc != 0) { device_printf(sc->dev, "failed to create firmware event queue: %d\n", rc); return (rc); } oid = SYSCTL_ADD_NODE(&sc->ctx, children, OID_AUTO, "fwq", CTLFLAG_RD, NULL, "firmware event queue"); add_iq_sysctls(&sc->ctx, oid, fwq); return (0); } static int free_fwq(struct adapter *sc) { return free_iq_fl(NULL, &sc->sge.fwq, NULL); } static int alloc_ctrlq(struct adapter *sc, struct sge_wrq *ctrlq, int idx, struct sysctl_oid *oid) { int rc; char name[16]; struct sysctl_oid_list *children; snprintf(name, sizeof(name), "%s ctrlq%d", device_get_nameunit(sc->dev), idx); init_eq(sc, &ctrlq->eq, EQ_CTRL, CTRL_EQ_QSIZE, sc->port[idx]->tx_chan, sc->sge.fwq.cntxt_id, name); children = SYSCTL_CHILDREN(oid); snprintf(name, sizeof(name), "%d", idx); oid = SYSCTL_ADD_NODE(&sc->ctx, children, OID_AUTO, name, CTLFLAG_RD, NULL, "ctrl queue"); rc = alloc_wrq(sc, NULL, ctrlq, oid); return (rc); } int tnl_cong(struct port_info *pi, int drop) { if (drop == -1) return (-1); else if (drop == 1) return (0); else return (pi->rx_e_chan_map); } static int alloc_rxq(struct vi_info *vi, struct sge_rxq *rxq, int intr_idx, int idx, struct sysctl_oid *oid) { int rc; struct adapter *sc = vi->pi->adapter; struct sysctl_oid_list *children; char name[16]; rc = alloc_iq_fl(vi, &rxq->iq, &rxq->fl, intr_idx, tnl_cong(vi->pi, cong_drop)); if (rc != 0) return (rc); if (idx == 0) sc->sge.iq_base = rxq->iq.abs_id - rxq->iq.cntxt_id; else KASSERT(rxq->iq.cntxt_id + sc->sge.iq_base == rxq->iq.abs_id, ("iq_base mismatch")); KASSERT(sc->sge.iq_base == 0 || sc->flags & IS_VF, ("PF with non-zero iq_base")); /* * The freelist is just barely above the starvation threshold right now, * fill it up a bit more. */ FL_LOCK(&rxq->fl); refill_fl(sc, &rxq->fl, 128); FL_UNLOCK(&rxq->fl); #if defined(INET) || defined(INET6) rc = tcp_lro_init_args(&rxq->lro, vi->ifp, lro_entries, lro_mbufs); if (rc != 0) return (rc); MPASS(rxq->lro.ifp == vi->ifp); /* also indicates LRO init'ed */ if (vi->ifp->if_capenable & IFCAP_LRO) rxq->iq.flags |= IQ_LRO_ENABLED; #endif if (vi->ifp->if_capenable & IFCAP_HWRXTSTMP) rxq->iq.flags |= IQ_RX_TIMESTAMP; rxq->ifp = vi->ifp; children = SYSCTL_CHILDREN(oid); snprintf(name, sizeof(name), "%d", idx); oid = SYSCTL_ADD_NODE(&vi->ctx, children, OID_AUTO, name, CTLFLAG_RD, NULL, "rx queue"); children = SYSCTL_CHILDREN(oid); add_iq_sysctls(&vi->ctx, oid, &rxq->iq); #if defined(INET) || defined(INET6) SYSCTL_ADD_U64(&vi->ctx, children, OID_AUTO, "lro_queued", CTLFLAG_RD, &rxq->lro.lro_queued, 0, NULL); SYSCTL_ADD_U64(&vi->ctx, children, OID_AUTO, "lro_flushed", CTLFLAG_RD, &rxq->lro.lro_flushed, 0, NULL); #endif SYSCTL_ADD_UQUAD(&vi->ctx, children, OID_AUTO, "rxcsum", CTLFLAG_RD, &rxq->rxcsum, "# of times hardware assisted with checksum"); SYSCTL_ADD_UQUAD(&vi->ctx, children, OID_AUTO, "vlan_extraction", CTLFLAG_RD, &rxq->vlan_extraction, "# of times hardware extracted 802.1Q tag"); add_fl_sysctls(sc, &vi->ctx, oid, &rxq->fl); return (rc); } static int free_rxq(struct vi_info *vi, struct sge_rxq *rxq) { int rc; #if defined(INET) || defined(INET6) if (rxq->lro.ifp) { tcp_lro_free(&rxq->lro); rxq->lro.ifp = NULL; } #endif rc = free_iq_fl(vi, &rxq->iq, &rxq->fl); if (rc == 0) bzero(rxq, sizeof(*rxq)); return (rc); } #ifdef TCP_OFFLOAD static int alloc_ofld_rxq(struct vi_info *vi, struct sge_ofld_rxq *ofld_rxq, int intr_idx, int idx, struct sysctl_oid *oid) { struct port_info *pi = vi->pi; int rc; struct sysctl_oid_list *children; char name[16]; rc = alloc_iq_fl(vi, &ofld_rxq->iq, &ofld_rxq->fl, intr_idx, 0); if (rc != 0) return (rc); children = SYSCTL_CHILDREN(oid); snprintf(name, sizeof(name), "%d", idx); oid = SYSCTL_ADD_NODE(&vi->ctx, children, OID_AUTO, name, CTLFLAG_RD, NULL, "rx queue"); add_iq_sysctls(&vi->ctx, oid, &ofld_rxq->iq); add_fl_sysctls(pi->adapter, &vi->ctx, oid, &ofld_rxq->fl); return (rc); } static int free_ofld_rxq(struct vi_info *vi, struct sge_ofld_rxq *ofld_rxq) { int rc; rc = free_iq_fl(vi, &ofld_rxq->iq, &ofld_rxq->fl); if (rc == 0) bzero(ofld_rxq, sizeof(*ofld_rxq)); return (rc); } #endif #ifdef DEV_NETMAP static int alloc_nm_rxq(struct vi_info *vi, struct sge_nm_rxq *nm_rxq, int intr_idx, int idx, struct sysctl_oid *oid) { int rc; struct sysctl_oid_list *children; struct sysctl_ctx_list *ctx; char name[16]; size_t len; struct adapter *sc = vi->pi->adapter; struct netmap_adapter *na = NA(vi->ifp); MPASS(na != NULL); len = vi->qsize_rxq * IQ_ESIZE; rc = alloc_ring(sc, len, &nm_rxq->iq_desc_tag, &nm_rxq->iq_desc_map, &nm_rxq->iq_ba, (void **)&nm_rxq->iq_desc); if (rc != 0) return (rc); len = na->num_rx_desc * EQ_ESIZE + sc->params.sge.spg_len; rc = alloc_ring(sc, len, &nm_rxq->fl_desc_tag, &nm_rxq->fl_desc_map, &nm_rxq->fl_ba, (void **)&nm_rxq->fl_desc); if (rc != 0) return (rc); nm_rxq->vi = vi; nm_rxq->nid = idx; nm_rxq->iq_cidx = 0; nm_rxq->iq_sidx = vi->qsize_rxq - sc->params.sge.spg_len / IQ_ESIZE; nm_rxq->iq_gen = F_RSPD_GEN; nm_rxq->fl_pidx = nm_rxq->fl_cidx = 0; nm_rxq->fl_sidx = na->num_rx_desc; nm_rxq->intr_idx = intr_idx; nm_rxq->iq_cntxt_id = INVALID_NM_RXQ_CNTXT_ID; ctx = &vi->ctx; children = SYSCTL_CHILDREN(oid); snprintf(name, sizeof(name), "%d", idx); oid = SYSCTL_ADD_NODE(ctx, children, OID_AUTO, name, CTLFLAG_RD, NULL, "rx queue"); children = SYSCTL_CHILDREN(oid); SYSCTL_ADD_PROC(ctx, children, OID_AUTO, "abs_id", CTLTYPE_INT | CTLFLAG_RD, &nm_rxq->iq_abs_id, 0, sysctl_uint16, "I", "absolute id of the queue"); SYSCTL_ADD_PROC(ctx, children, OID_AUTO, "cntxt_id", CTLTYPE_INT | CTLFLAG_RD, &nm_rxq->iq_cntxt_id, 0, sysctl_uint16, "I", "SGE context id of the queue"); SYSCTL_ADD_PROC(ctx, children, OID_AUTO, "cidx", CTLTYPE_INT | CTLFLAG_RD, &nm_rxq->iq_cidx, 0, sysctl_uint16, "I", "consumer index"); children = SYSCTL_CHILDREN(oid); oid = SYSCTL_ADD_NODE(ctx, children, OID_AUTO, "fl", CTLFLAG_RD, NULL, "freelist"); children = SYSCTL_CHILDREN(oid); SYSCTL_ADD_PROC(ctx, children, OID_AUTO, "cntxt_id", CTLTYPE_INT | CTLFLAG_RD, &nm_rxq->fl_cntxt_id, 0, sysctl_uint16, "I", "SGE context id of the freelist"); SYSCTL_ADD_UINT(ctx, children, OID_AUTO, "cidx", CTLFLAG_RD, &nm_rxq->fl_cidx, 0, "consumer index"); SYSCTL_ADD_UINT(ctx, children, OID_AUTO, "pidx", CTLFLAG_RD, &nm_rxq->fl_pidx, 0, "producer index"); return (rc); } static int free_nm_rxq(struct vi_info *vi, struct sge_nm_rxq *nm_rxq) { struct adapter *sc = vi->pi->adapter; if (vi->flags & VI_INIT_DONE) MPASS(nm_rxq->iq_cntxt_id == INVALID_NM_RXQ_CNTXT_ID); else MPASS(nm_rxq->iq_cntxt_id == 0); free_ring(sc, nm_rxq->iq_desc_tag, nm_rxq->iq_desc_map, nm_rxq->iq_ba, nm_rxq->iq_desc); free_ring(sc, nm_rxq->fl_desc_tag, nm_rxq->fl_desc_map, nm_rxq->fl_ba, nm_rxq->fl_desc); return (0); } static int alloc_nm_txq(struct vi_info *vi, struct sge_nm_txq *nm_txq, int iqidx, int idx, struct sysctl_oid *oid) { int rc; size_t len; struct port_info *pi = vi->pi; struct adapter *sc = pi->adapter; struct netmap_adapter *na = NA(vi->ifp); char name[16]; struct sysctl_oid_list *children = SYSCTL_CHILDREN(oid); len = na->num_tx_desc * EQ_ESIZE + sc->params.sge.spg_len; rc = alloc_ring(sc, len, &nm_txq->desc_tag, &nm_txq->desc_map, &nm_txq->ba, (void **)&nm_txq->desc); if (rc) return (rc); nm_txq->pidx = nm_txq->cidx = 0; nm_txq->sidx = na->num_tx_desc; nm_txq->nid = idx; nm_txq->iqidx = iqidx; nm_txq->cpl_ctrl0 = htobe32(V_TXPKT_OPCODE(CPL_TX_PKT) | V_TXPKT_INTF(pi->tx_chan) | V_TXPKT_PF(sc->pf) | V_TXPKT_VF(vi->vin) | V_TXPKT_VF_VLD(vi->vfvld)); nm_txq->cntxt_id = INVALID_NM_TXQ_CNTXT_ID; snprintf(name, sizeof(name), "%d", idx); oid = SYSCTL_ADD_NODE(&vi->ctx, children, OID_AUTO, name, CTLFLAG_RD, NULL, "netmap tx queue"); children = SYSCTL_CHILDREN(oid); SYSCTL_ADD_UINT(&vi->ctx, children, OID_AUTO, "cntxt_id", CTLFLAG_RD, &nm_txq->cntxt_id, 0, "SGE context id of the queue"); SYSCTL_ADD_PROC(&vi->ctx, children, OID_AUTO, "cidx", CTLTYPE_INT | CTLFLAG_RD, &nm_txq->cidx, 0, sysctl_uint16, "I", "consumer index"); SYSCTL_ADD_PROC(&vi->ctx, children, OID_AUTO, "pidx", CTLTYPE_INT | CTLFLAG_RD, &nm_txq->pidx, 0, sysctl_uint16, "I", "producer index"); return (rc); } static int free_nm_txq(struct vi_info *vi, struct sge_nm_txq *nm_txq) { struct adapter *sc = vi->pi->adapter; if (vi->flags & VI_INIT_DONE) MPASS(nm_txq->cntxt_id == INVALID_NM_TXQ_CNTXT_ID); else MPASS(nm_txq->cntxt_id == 0); free_ring(sc, nm_txq->desc_tag, nm_txq->desc_map, nm_txq->ba, nm_txq->desc); return (0); } #endif /* * Returns a reasonable automatic cidx flush threshold for a given queue size. */ static u_int qsize_to_fthresh(int qsize) { u_int fthresh; while (!powerof2(qsize)) qsize++; fthresh = ilog2(qsize); if (fthresh > X_CIDXFLUSHTHRESH_128) fthresh = X_CIDXFLUSHTHRESH_128; return (fthresh); } static int ctrl_eq_alloc(struct adapter *sc, struct sge_eq *eq) { int rc, cntxt_id; struct fw_eq_ctrl_cmd c; int qsize = eq->sidx + sc->params.sge.spg_len / EQ_ESIZE; bzero(&c, sizeof(c)); c.op_to_vfn = htobe32(V_FW_CMD_OP(FW_EQ_CTRL_CMD) | F_FW_CMD_REQUEST | F_FW_CMD_WRITE | F_FW_CMD_EXEC | V_FW_EQ_CTRL_CMD_PFN(sc->pf) | V_FW_EQ_CTRL_CMD_VFN(0)); c.alloc_to_len16 = htobe32(F_FW_EQ_CTRL_CMD_ALLOC | F_FW_EQ_CTRL_CMD_EQSTART | FW_LEN16(c)); c.cmpliqid_eqid = htonl(V_FW_EQ_CTRL_CMD_CMPLIQID(eq->iqid)); c.physeqid_pkd = htobe32(0); c.fetchszm_to_iqid = htobe32(V_FW_EQ_CTRL_CMD_HOSTFCMODE(X_HOSTFCMODE_STATUS_PAGE) | V_FW_EQ_CTRL_CMD_PCIECHN(eq->tx_chan) | F_FW_EQ_CTRL_CMD_FETCHRO | V_FW_EQ_CTRL_CMD_IQID(eq->iqid)); c.dcaen_to_eqsize = htobe32(V_FW_EQ_CTRL_CMD_FBMIN(X_FETCHBURSTMIN_64B) | V_FW_EQ_CTRL_CMD_FBMAX(X_FETCHBURSTMAX_512B) | V_FW_EQ_CTRL_CMD_CIDXFTHRESH(qsize_to_fthresh(qsize)) | V_FW_EQ_CTRL_CMD_EQSIZE(qsize)); c.eqaddr = htobe64(eq->ba); rc = -t4_wr_mbox(sc, sc->mbox, &c, sizeof(c), &c); if (rc != 0) { device_printf(sc->dev, "failed to create control queue %d: %d\n", eq->tx_chan, rc); return (rc); } eq->flags |= EQ_ALLOCATED; eq->cntxt_id = G_FW_EQ_CTRL_CMD_EQID(be32toh(c.cmpliqid_eqid)); cntxt_id = eq->cntxt_id - sc->sge.eq_start; if (cntxt_id >= sc->sge.neq) panic("%s: eq->cntxt_id (%d) more than the max (%d)", __func__, cntxt_id, sc->sge.neq - 1); sc->sge.eqmap[cntxt_id] = eq; return (rc); } static int eth_eq_alloc(struct adapter *sc, struct vi_info *vi, struct sge_eq *eq) { int rc, cntxt_id; struct fw_eq_eth_cmd c; int qsize = eq->sidx + sc->params.sge.spg_len / EQ_ESIZE; bzero(&c, sizeof(c)); c.op_to_vfn = htobe32(V_FW_CMD_OP(FW_EQ_ETH_CMD) | F_FW_CMD_REQUEST | F_FW_CMD_WRITE | F_FW_CMD_EXEC | V_FW_EQ_ETH_CMD_PFN(sc->pf) | V_FW_EQ_ETH_CMD_VFN(0)); c.alloc_to_len16 = htobe32(F_FW_EQ_ETH_CMD_ALLOC | F_FW_EQ_ETH_CMD_EQSTART | FW_LEN16(c)); c.autoequiqe_to_viid = htobe32(F_FW_EQ_ETH_CMD_AUTOEQUIQE | F_FW_EQ_ETH_CMD_AUTOEQUEQE | V_FW_EQ_ETH_CMD_VIID(vi->viid)); c.fetchszm_to_iqid = htobe32(V_FW_EQ_ETH_CMD_HOSTFCMODE(X_HOSTFCMODE_NONE) | V_FW_EQ_ETH_CMD_PCIECHN(eq->tx_chan) | F_FW_EQ_ETH_CMD_FETCHRO | V_FW_EQ_ETH_CMD_IQID(eq->iqid)); c.dcaen_to_eqsize = htobe32(V_FW_EQ_ETH_CMD_FBMIN(X_FETCHBURSTMIN_64B) | V_FW_EQ_ETH_CMD_FBMAX(X_FETCHBURSTMAX_512B) | V_FW_EQ_ETH_CMD_EQSIZE(qsize)); c.eqaddr = htobe64(eq->ba); rc = -t4_wr_mbox(sc, sc->mbox, &c, sizeof(c), &c); if (rc != 0) { device_printf(vi->dev, "failed to create Ethernet egress queue: %d\n", rc); return (rc); } eq->flags |= EQ_ALLOCATED; eq->cntxt_id = G_FW_EQ_ETH_CMD_EQID(be32toh(c.eqid_pkd)); eq->abs_id = G_FW_EQ_ETH_CMD_PHYSEQID(be32toh(c.physeqid_pkd)); cntxt_id = eq->cntxt_id - sc->sge.eq_start; if (cntxt_id >= sc->sge.neq) panic("%s: eq->cntxt_id (%d) more than the max (%d)", __func__, cntxt_id, sc->sge.neq - 1); sc->sge.eqmap[cntxt_id] = eq; return (rc); } #if defined(TCP_OFFLOAD) || defined(RATELIMIT) static int ofld_eq_alloc(struct adapter *sc, struct vi_info *vi, struct sge_eq *eq) { int rc, cntxt_id; struct fw_eq_ofld_cmd c; int qsize = eq->sidx + sc->params.sge.spg_len / EQ_ESIZE; bzero(&c, sizeof(c)); c.op_to_vfn = htonl(V_FW_CMD_OP(FW_EQ_OFLD_CMD) | F_FW_CMD_REQUEST | F_FW_CMD_WRITE | F_FW_CMD_EXEC | V_FW_EQ_OFLD_CMD_PFN(sc->pf) | V_FW_EQ_OFLD_CMD_VFN(0)); c.alloc_to_len16 = htonl(F_FW_EQ_OFLD_CMD_ALLOC | F_FW_EQ_OFLD_CMD_EQSTART | FW_LEN16(c)); c.fetchszm_to_iqid = htonl(V_FW_EQ_OFLD_CMD_HOSTFCMODE(X_HOSTFCMODE_STATUS_PAGE) | V_FW_EQ_OFLD_CMD_PCIECHN(eq->tx_chan) | F_FW_EQ_OFLD_CMD_FETCHRO | V_FW_EQ_OFLD_CMD_IQID(eq->iqid)); c.dcaen_to_eqsize = htobe32(V_FW_EQ_OFLD_CMD_FBMIN(X_FETCHBURSTMIN_64B) | V_FW_EQ_OFLD_CMD_FBMAX(X_FETCHBURSTMAX_512B) | V_FW_EQ_OFLD_CMD_CIDXFTHRESH(qsize_to_fthresh(qsize)) | V_FW_EQ_OFLD_CMD_EQSIZE(qsize)); c.eqaddr = htobe64(eq->ba); rc = -t4_wr_mbox(sc, sc->mbox, &c, sizeof(c), &c); if (rc != 0) { device_printf(vi->dev, "failed to create egress queue for TCP offload: %d\n", rc); return (rc); } eq->flags |= EQ_ALLOCATED; eq->cntxt_id = G_FW_EQ_OFLD_CMD_EQID(be32toh(c.eqid_pkd)); cntxt_id = eq->cntxt_id - sc->sge.eq_start; if (cntxt_id >= sc->sge.neq) panic("%s: eq->cntxt_id (%d) more than the max (%d)", __func__, cntxt_id, sc->sge.neq - 1); sc->sge.eqmap[cntxt_id] = eq; return (rc); } #endif static int alloc_eq(struct adapter *sc, struct vi_info *vi, struct sge_eq *eq) { int rc, qsize; size_t len; mtx_init(&eq->eq_lock, eq->lockname, NULL, MTX_DEF); qsize = eq->sidx + sc->params.sge.spg_len / EQ_ESIZE; len = qsize * EQ_ESIZE; rc = alloc_ring(sc, len, &eq->desc_tag, &eq->desc_map, &eq->ba, (void **)&eq->desc); if (rc) return (rc); eq->pidx = eq->cidx = eq->dbidx = 0; /* Note that equeqidx is not used with sge_wrq (OFLD/CTRL) queues. */ eq->equeqidx = 0; eq->doorbells = sc->doorbells; switch (eq->flags & EQ_TYPEMASK) { case EQ_CTRL: rc = ctrl_eq_alloc(sc, eq); break; case EQ_ETH: rc = eth_eq_alloc(sc, vi, eq); break; #if defined(TCP_OFFLOAD) || defined(RATELIMIT) case EQ_OFLD: rc = ofld_eq_alloc(sc, vi, eq); break; #endif default: panic("%s: invalid eq type %d.", __func__, eq->flags & EQ_TYPEMASK); } if (rc != 0) { device_printf(sc->dev, "failed to allocate egress queue(%d): %d\n", eq->flags & EQ_TYPEMASK, rc); } if (isset(&eq->doorbells, DOORBELL_UDB) || isset(&eq->doorbells, DOORBELL_UDBWC) || isset(&eq->doorbells, DOORBELL_WCWR)) { uint32_t s_qpp = sc->params.sge.eq_s_qpp; uint32_t mask = (1 << s_qpp) - 1; volatile uint8_t *udb; udb = sc->udbs_base + UDBS_DB_OFFSET; udb += (eq->cntxt_id >> s_qpp) << PAGE_SHIFT; /* pg offset */ eq->udb_qid = eq->cntxt_id & mask; /* id in page */ if (eq->udb_qid >= PAGE_SIZE / UDBS_SEG_SIZE) clrbit(&eq->doorbells, DOORBELL_WCWR); else { udb += eq->udb_qid << UDBS_SEG_SHIFT; /* seg offset */ eq->udb_qid = 0; } eq->udb = (volatile void *)udb; } return (rc); } static int free_eq(struct adapter *sc, struct sge_eq *eq) { int rc; if (eq->flags & EQ_ALLOCATED) { switch (eq->flags & EQ_TYPEMASK) { case EQ_CTRL: rc = -t4_ctrl_eq_free(sc, sc->mbox, sc->pf, 0, eq->cntxt_id); break; case EQ_ETH: rc = -t4_eth_eq_free(sc, sc->mbox, sc->pf, 0, eq->cntxt_id); break; #if defined(TCP_OFFLOAD) || defined(RATELIMIT) case EQ_OFLD: rc = -t4_ofld_eq_free(sc, sc->mbox, sc->pf, 0, eq->cntxt_id); break; #endif default: panic("%s: invalid eq type %d.", __func__, eq->flags & EQ_TYPEMASK); } if (rc != 0) { device_printf(sc->dev, "failed to free egress queue (%d): %d\n", eq->flags & EQ_TYPEMASK, rc); return (rc); } eq->flags &= ~EQ_ALLOCATED; } free_ring(sc, eq->desc_tag, eq->desc_map, eq->ba, eq->desc); if (mtx_initialized(&eq->eq_lock)) mtx_destroy(&eq->eq_lock); bzero(eq, sizeof(*eq)); return (0); } static int alloc_wrq(struct adapter *sc, struct vi_info *vi, struct sge_wrq *wrq, struct sysctl_oid *oid) { int rc; struct sysctl_ctx_list *ctx = vi ? &vi->ctx : &sc->ctx; struct sysctl_oid_list *children = SYSCTL_CHILDREN(oid); rc = alloc_eq(sc, vi, &wrq->eq); if (rc) return (rc); wrq->adapter = sc; TASK_INIT(&wrq->wrq_tx_task, 0, wrq_tx_drain, wrq); TAILQ_INIT(&wrq->incomplete_wrs); STAILQ_INIT(&wrq->wr_list); wrq->nwr_pending = 0; wrq->ndesc_needed = 0; SYSCTL_ADD_UAUTO(ctx, children, OID_AUTO, "ba", CTLFLAG_RD, &wrq->eq.ba, "bus address of descriptor ring"); SYSCTL_ADD_INT(ctx, children, OID_AUTO, "dmalen", CTLFLAG_RD, NULL, wrq->eq.sidx * EQ_ESIZE + sc->params.sge.spg_len, "desc ring size in bytes"); SYSCTL_ADD_UINT(ctx, children, OID_AUTO, "cntxt_id", CTLFLAG_RD, &wrq->eq.cntxt_id, 0, "SGE context id of the queue"); SYSCTL_ADD_PROC(ctx, children, OID_AUTO, "cidx", CTLTYPE_INT | CTLFLAG_RD, &wrq->eq.cidx, 0, sysctl_uint16, "I", "consumer index"); SYSCTL_ADD_PROC(ctx, children, OID_AUTO, "pidx", CTLTYPE_INT | CTLFLAG_RD, &wrq->eq.pidx, 0, sysctl_uint16, "I", "producer index"); SYSCTL_ADD_INT(ctx, children, OID_AUTO, "sidx", CTLFLAG_RD, NULL, wrq->eq.sidx, "status page index"); SYSCTL_ADD_UQUAD(ctx, children, OID_AUTO, "tx_wrs_direct", CTLFLAG_RD, &wrq->tx_wrs_direct, "# of work requests (direct)"); SYSCTL_ADD_UQUAD(ctx, children, OID_AUTO, "tx_wrs_copied", CTLFLAG_RD, &wrq->tx_wrs_copied, "# of work requests (copied)"); SYSCTL_ADD_UQUAD(ctx, children, OID_AUTO, "tx_wrs_sspace", CTLFLAG_RD, &wrq->tx_wrs_ss, "# of work requests (copied from scratch space)"); return (rc); } static int free_wrq(struct adapter *sc, struct sge_wrq *wrq) { int rc; rc = free_eq(sc, &wrq->eq); if (rc) return (rc); bzero(wrq, sizeof(*wrq)); return (0); } static int alloc_txq(struct vi_info *vi, struct sge_txq *txq, int idx, struct sysctl_oid *oid) { int rc; struct port_info *pi = vi->pi; struct adapter *sc = pi->adapter; struct sge_eq *eq = &txq->eq; char name[16]; struct sysctl_oid_list *children = SYSCTL_CHILDREN(oid); rc = mp_ring_alloc(&txq->r, eq->sidx, txq, eth_tx, can_resume_eth_tx, M_CXGBE, M_WAITOK); if (rc != 0) { device_printf(sc->dev, "failed to allocate mp_ring: %d\n", rc); return (rc); } rc = alloc_eq(sc, vi, eq); if (rc != 0) { mp_ring_free(txq->r); txq->r = NULL; return (rc); } /* Can't fail after this point. */ if (idx == 0) sc->sge.eq_base = eq->abs_id - eq->cntxt_id; else KASSERT(eq->cntxt_id + sc->sge.eq_base == eq->abs_id, ("eq_base mismatch")); KASSERT(sc->sge.eq_base == 0 || sc->flags & IS_VF, ("PF with non-zero eq_base")); TASK_INIT(&txq->tx_reclaim_task, 0, tx_reclaim, eq); txq->ifp = vi->ifp; txq->gl = sglist_alloc(TX_SGL_SEGS, M_WAITOK); if (sc->flags & IS_VF) txq->cpl_ctrl0 = htobe32(V_TXPKT_OPCODE(CPL_TX_PKT_XT) | V_TXPKT_INTF(pi->tx_chan)); else txq->cpl_ctrl0 = htobe32(V_TXPKT_OPCODE(CPL_TX_PKT) | V_TXPKT_INTF(pi->tx_chan) | V_TXPKT_PF(sc->pf) | V_TXPKT_VF(vi->vin) | V_TXPKT_VF_VLD(vi->vfvld)); txq->tc_idx = -1; txq->sdesc = malloc(eq->sidx * sizeof(struct tx_sdesc), M_CXGBE, M_ZERO | M_WAITOK); snprintf(name, sizeof(name), "%d", idx); oid = SYSCTL_ADD_NODE(&vi->ctx, children, OID_AUTO, name, CTLFLAG_RD, NULL, "tx queue"); children = SYSCTL_CHILDREN(oid); SYSCTL_ADD_UAUTO(&vi->ctx, children, OID_AUTO, "ba", CTLFLAG_RD, &eq->ba, "bus address of descriptor ring"); SYSCTL_ADD_INT(&vi->ctx, children, OID_AUTO, "dmalen", CTLFLAG_RD, NULL, eq->sidx * EQ_ESIZE + sc->params.sge.spg_len, "desc ring size in bytes"); SYSCTL_ADD_UINT(&vi->ctx, children, OID_AUTO, "abs_id", CTLFLAG_RD, &eq->abs_id, 0, "absolute id of the queue"); SYSCTL_ADD_UINT(&vi->ctx, children, OID_AUTO, "cntxt_id", CTLFLAG_RD, &eq->cntxt_id, 0, "SGE context id of the queue"); SYSCTL_ADD_PROC(&vi->ctx, children, OID_AUTO, "cidx", CTLTYPE_INT | CTLFLAG_RD, &eq->cidx, 0, sysctl_uint16, "I", "consumer index"); SYSCTL_ADD_PROC(&vi->ctx, children, OID_AUTO, "pidx", CTLTYPE_INT | CTLFLAG_RD, &eq->pidx, 0, sysctl_uint16, "I", "producer index"); SYSCTL_ADD_INT(&vi->ctx, children, OID_AUTO, "sidx", CTLFLAG_RD, NULL, eq->sidx, "status page index"); SYSCTL_ADD_PROC(&vi->ctx, children, OID_AUTO, "tc", CTLTYPE_INT | CTLFLAG_RW, vi, idx, sysctl_tc, "I", "traffic class (-1 means none)"); SYSCTL_ADD_UQUAD(&vi->ctx, children, OID_AUTO, "txcsum", CTLFLAG_RD, &txq->txcsum, "# of times hardware assisted with checksum"); SYSCTL_ADD_UQUAD(&vi->ctx, children, OID_AUTO, "vlan_insertion", CTLFLAG_RD, &txq->vlan_insertion, "# of times hardware inserted 802.1Q tag"); SYSCTL_ADD_UQUAD(&vi->ctx, children, OID_AUTO, "tso_wrs", CTLFLAG_RD, &txq->tso_wrs, "# of TSO work requests"); SYSCTL_ADD_UQUAD(&vi->ctx, children, OID_AUTO, "imm_wrs", CTLFLAG_RD, &txq->imm_wrs, "# of work requests with immediate data"); SYSCTL_ADD_UQUAD(&vi->ctx, children, OID_AUTO, "sgl_wrs", CTLFLAG_RD, &txq->sgl_wrs, "# of work requests with direct SGL"); SYSCTL_ADD_UQUAD(&vi->ctx, children, OID_AUTO, "txpkt_wrs", CTLFLAG_RD, &txq->txpkt_wrs, "# of txpkt work requests (one pkt/WR)"); SYSCTL_ADD_UQUAD(&vi->ctx, children, OID_AUTO, "txpkts0_wrs", CTLFLAG_RD, &txq->txpkts0_wrs, "# of txpkts (type 0) work requests"); SYSCTL_ADD_UQUAD(&vi->ctx, children, OID_AUTO, "txpkts1_wrs", CTLFLAG_RD, &txq->txpkts1_wrs, "# of txpkts (type 1) work requests"); SYSCTL_ADD_UQUAD(&vi->ctx, children, OID_AUTO, "txpkts0_pkts", CTLFLAG_RD, &txq->txpkts0_pkts, "# of frames tx'd using type0 txpkts work requests"); SYSCTL_ADD_UQUAD(&vi->ctx, children, OID_AUTO, "txpkts1_pkts", CTLFLAG_RD, &txq->txpkts1_pkts, "# of frames tx'd using type1 txpkts work requests"); SYSCTL_ADD_UQUAD(&vi->ctx, children, OID_AUTO, "raw_wrs", CTLFLAG_RD, &txq->raw_wrs, "# of raw work requests (non-packets)"); SYSCTL_ADD_COUNTER_U64(&vi->ctx, children, OID_AUTO, "r_enqueues", CTLFLAG_RD, &txq->r->enqueues, "# of enqueues to the mp_ring for this queue"); SYSCTL_ADD_COUNTER_U64(&vi->ctx, children, OID_AUTO, "r_drops", CTLFLAG_RD, &txq->r->drops, "# of drops in the mp_ring for this queue"); SYSCTL_ADD_COUNTER_U64(&vi->ctx, children, OID_AUTO, "r_starts", CTLFLAG_RD, &txq->r->starts, "# of normal consumer starts in the mp_ring for this queue"); SYSCTL_ADD_COUNTER_U64(&vi->ctx, children, OID_AUTO, "r_stalls", CTLFLAG_RD, &txq->r->stalls, "# of consumer stalls in the mp_ring for this queue"); SYSCTL_ADD_COUNTER_U64(&vi->ctx, children, OID_AUTO, "r_restarts", CTLFLAG_RD, &txq->r->restarts, "# of consumer restarts in the mp_ring for this queue"); SYSCTL_ADD_COUNTER_U64(&vi->ctx, children, OID_AUTO, "r_abdications", CTLFLAG_RD, &txq->r->abdications, "# of consumer abdications in the mp_ring for this queue"); return (0); } static int free_txq(struct vi_info *vi, struct sge_txq *txq) { int rc; struct adapter *sc = vi->pi->adapter; struct sge_eq *eq = &txq->eq; rc = free_eq(sc, eq); if (rc) return (rc); sglist_free(txq->gl); free(txq->sdesc, M_CXGBE); mp_ring_free(txq->r); bzero(txq, sizeof(*txq)); return (0); } static void oneseg_dma_callback(void *arg, bus_dma_segment_t *segs, int nseg, int error) { bus_addr_t *ba = arg; KASSERT(nseg == 1, ("%s meant for single segment mappings only.", __func__)); *ba = error ? 0 : segs->ds_addr; } static inline void ring_fl_db(struct adapter *sc, struct sge_fl *fl) { uint32_t n, v; n = IDXDIFF(fl->pidx / 8, fl->dbidx, fl->sidx); MPASS(n > 0); wmb(); v = fl->dbval | V_PIDX(n); if (fl->udb) *fl->udb = htole32(v); else t4_write_reg(sc, sc->sge_kdoorbell_reg, v); IDXINCR(fl->dbidx, n, fl->sidx); } /* * Fills up the freelist by allocating up to 'n' buffers. Buffers that are * recycled do not count towards this allocation budget. * * Returns non-zero to indicate that this freelist should be added to the list * of starving freelists. */ static int refill_fl(struct adapter *sc, struct sge_fl *fl, int n) { __be64 *d; struct fl_sdesc *sd; uintptr_t pa; caddr_t cl; struct cluster_layout *cll; struct sw_zone_info *swz; struct cluster_metadata *clm; uint16_t max_pidx; uint16_t hw_cidx = fl->hw_cidx; /* stable snapshot */ FL_LOCK_ASSERT_OWNED(fl); /* * We always stop at the beginning of the hardware descriptor that's just * before the one with the hw cidx. This is to avoid hw pidx = hw cidx, * which would mean an empty freelist to the chip. */ max_pidx = __predict_false(hw_cidx == 0) ? fl->sidx - 1 : hw_cidx - 1; if (fl->pidx == max_pidx * 8) return (0); d = &fl->desc[fl->pidx]; sd = &fl->sdesc[fl->pidx]; cll = &fl->cll_def; /* default layout */ swz = &sc->sge.sw_zone_info[cll->zidx]; while (n > 0) { if (sd->cl != NULL) { if (sd->nmbuf == 0) { /* * Fast recycle without involving any atomics on * the cluster's metadata (if the cluster has * metadata). This happens when all frames * received in the cluster were small enough to * fit within a single mbuf each. */ fl->cl_fast_recycled++; #ifdef INVARIANTS clm = cl_metadata(sc, fl, &sd->cll, sd->cl); if (clm != NULL) MPASS(clm->refcount == 1); #endif goto recycled_fast; } /* * Cluster is guaranteed to have metadata. Clusters * without metadata always take the fast recycle path * when they're recycled. */ clm = cl_metadata(sc, fl, &sd->cll, sd->cl); MPASS(clm != NULL); if (atomic_fetchadd_int(&clm->refcount, -1) == 1) { fl->cl_recycled++; counter_u64_add(extfree_rels, 1); goto recycled; } sd->cl = NULL; /* gave up my reference */ } MPASS(sd->cl == NULL); alloc: cl = uma_zalloc(swz->zone, M_NOWAIT); if (__predict_false(cl == NULL)) { if (cll == &fl->cll_alt || fl->cll_alt.zidx == -1 || fl->cll_def.zidx == fl->cll_alt.zidx) break; /* fall back to the safe zone */ cll = &fl->cll_alt; swz = &sc->sge.sw_zone_info[cll->zidx]; goto alloc; } fl->cl_allocated++; n--; pa = pmap_kextract((vm_offset_t)cl); pa += cll->region1; sd->cl = cl; sd->cll = *cll; *d = htobe64(pa | cll->hwidx); clm = cl_metadata(sc, fl, cll, cl); if (clm != NULL) { recycled: #ifdef INVARIANTS clm->sd = sd; #endif clm->refcount = 1; } sd->nmbuf = 0; recycled_fast: d++; sd++; if (__predict_false(++fl->pidx % 8 == 0)) { uint16_t pidx = fl->pidx / 8; if (__predict_false(pidx == fl->sidx)) { fl->pidx = 0; pidx = 0; sd = fl->sdesc; d = fl->desc; } if (pidx == max_pidx) break; if (IDXDIFF(pidx, fl->dbidx, fl->sidx) >= 4) ring_fl_db(sc, fl); } } if (fl->pidx / 8 != fl->dbidx) ring_fl_db(sc, fl); return (FL_RUNNING_LOW(fl) && !(fl->flags & FL_STARVING)); } /* * Attempt to refill all starving freelists. */ static void refill_sfl(void *arg) { struct adapter *sc = arg; struct sge_fl *fl, *fl_temp; mtx_assert(&sc->sfl_lock, MA_OWNED); TAILQ_FOREACH_SAFE(fl, &sc->sfl, link, fl_temp) { FL_LOCK(fl); refill_fl(sc, fl, 64); if (FL_NOT_RUNNING_LOW(fl) || fl->flags & FL_DOOMED) { TAILQ_REMOVE(&sc->sfl, fl, link); fl->flags &= ~FL_STARVING; } FL_UNLOCK(fl); } if (!TAILQ_EMPTY(&sc->sfl)) callout_schedule(&sc->sfl_callout, hz / 5); } static int alloc_fl_sdesc(struct sge_fl *fl) { fl->sdesc = malloc(fl->sidx * 8 * sizeof(struct fl_sdesc), M_CXGBE, M_ZERO | M_WAITOK); return (0); } static void free_fl_sdesc(struct adapter *sc, struct sge_fl *fl) { struct fl_sdesc *sd; struct cluster_metadata *clm; struct cluster_layout *cll; int i; sd = fl->sdesc; for (i = 0; i < fl->sidx * 8; i++, sd++) { if (sd->cl == NULL) continue; cll = &sd->cll; clm = cl_metadata(sc, fl, cll, sd->cl); if (sd->nmbuf == 0) uma_zfree(sc->sge.sw_zone_info[cll->zidx].zone, sd->cl); else if (clm && atomic_fetchadd_int(&clm->refcount, -1) == 1) { uma_zfree(sc->sge.sw_zone_info[cll->zidx].zone, sd->cl); counter_u64_add(extfree_rels, 1); } sd->cl = NULL; } free(fl->sdesc, M_CXGBE); fl->sdesc = NULL; } static inline void get_pkt_gl(struct mbuf *m, struct sglist *gl) { int rc; M_ASSERTPKTHDR(m); sglist_reset(gl); rc = sglist_append_mbuf(gl, m); if (__predict_false(rc != 0)) { panic("%s: mbuf %p (%d segs) was vetted earlier but now fails " "with %d.", __func__, m, mbuf_nsegs(m), rc); } KASSERT(gl->sg_nseg == mbuf_nsegs(m), ("%s: nsegs changed for mbuf %p from %d to %d", __func__, m, mbuf_nsegs(m), gl->sg_nseg)); KASSERT(gl->sg_nseg > 0 && gl->sg_nseg <= (needs_tso(m) ? TX_SGL_SEGS_TSO : TX_SGL_SEGS), ("%s: %d segments, should have been 1 <= nsegs <= %d", __func__, gl->sg_nseg, needs_tso(m) ? TX_SGL_SEGS_TSO : TX_SGL_SEGS)); } /* * len16 for a txpkt WR with a GL. Includes the firmware work request header. */ static inline u_int txpkt_len16(u_int nsegs, u_int tso) { u_int n; MPASS(nsegs > 0); nsegs--; /* first segment is part of ulptx_sgl */ n = sizeof(struct fw_eth_tx_pkt_wr) + sizeof(struct cpl_tx_pkt_core) + sizeof(struct ulptx_sgl) + 8 * ((3 * nsegs) / 2 + (nsegs & 1)); if (tso) n += sizeof(struct cpl_tx_pkt_lso_core); return (howmany(n, 16)); } /* * len16 for a txpkt_vm WR with a GL. Includes the firmware work * request header. */ static inline u_int txpkt_vm_len16(u_int nsegs, u_int tso) { u_int n; MPASS(nsegs > 0); nsegs--; /* first segment is part of ulptx_sgl */ n = sizeof(struct fw_eth_tx_pkt_vm_wr) + sizeof(struct cpl_tx_pkt_core) + sizeof(struct ulptx_sgl) + 8 * ((3 * nsegs) / 2 + (nsegs & 1)); if (tso) n += sizeof(struct cpl_tx_pkt_lso_core); return (howmany(n, 16)); } /* * len16 for a txpkts type 0 WR with a GL. Does not include the firmware work * request header. */ static inline u_int txpkts0_len16(u_int nsegs) { u_int n; MPASS(nsegs > 0); nsegs--; /* first segment is part of ulptx_sgl */ n = sizeof(struct ulp_txpkt) + sizeof(struct ulptx_idata) + sizeof(struct cpl_tx_pkt_core) + sizeof(struct ulptx_sgl) + 8 * ((3 * nsegs) / 2 + (nsegs & 1)); return (howmany(n, 16)); } /* * len16 for a txpkts type 1 WR with a GL. Does not include the firmware work * request header. */ static inline u_int txpkts1_len16(void) { u_int n; n = sizeof(struct cpl_tx_pkt_core) + sizeof(struct ulptx_sgl); return (howmany(n, 16)); } static inline u_int imm_payload(u_int ndesc) { u_int n; n = ndesc * EQ_ESIZE - sizeof(struct fw_eth_tx_pkt_wr) - sizeof(struct cpl_tx_pkt_core); return (n); } /* * Write a VM txpkt WR for this packet to the hardware descriptors, update the * software descriptor, and advance the pidx. It is guaranteed that enough * descriptors are available. * * The return value is the # of hardware descriptors used. */ static u_int write_txpkt_vm_wr(struct adapter *sc, struct sge_txq *txq, struct fw_eth_tx_pkt_vm_wr *wr, struct mbuf *m0, u_int available) { struct sge_eq *eq = &txq->eq; struct tx_sdesc *txsd; struct cpl_tx_pkt_core *cpl; uint32_t ctrl; /* used in many unrelated places */ uint64_t ctrl1; int csum_type, len16, ndesc, pktlen, nsegs; caddr_t dst; TXQ_LOCK_ASSERT_OWNED(txq); M_ASSERTPKTHDR(m0); MPASS(available > 0 && available < eq->sidx); len16 = mbuf_len16(m0); nsegs = mbuf_nsegs(m0); pktlen = m0->m_pkthdr.len; ctrl = sizeof(struct cpl_tx_pkt_core); if (needs_tso(m0)) ctrl += sizeof(struct cpl_tx_pkt_lso_core); ndesc = howmany(len16, EQ_ESIZE / 16); MPASS(ndesc <= available); /* Firmware work request header */ MPASS(wr == (void *)&eq->desc[eq->pidx]); wr->op_immdlen = htobe32(V_FW_WR_OP(FW_ETH_TX_PKT_VM_WR) | V_FW_ETH_TX_PKT_WR_IMMDLEN(ctrl)); ctrl = V_FW_WR_LEN16(len16); wr->equiq_to_len16 = htobe32(ctrl); wr->r3[0] = 0; wr->r3[1] = 0; /* * Copy over ethmacdst, ethmacsrc, ethtype, and vlantci. * vlantci is ignored unless the ethtype is 0x8100, so it's * simpler to always copy it rather than making it * conditional. Also, it seems that we do not have to set * vlantci or fake the ethtype when doing VLAN tag insertion. */ m_copydata(m0, 0, sizeof(struct ether_header) + 2, wr->ethmacdst); csum_type = -1; if (needs_tso(m0)) { struct cpl_tx_pkt_lso_core *lso = (void *)(wr + 1); KASSERT(m0->m_pkthdr.l2hlen > 0 && m0->m_pkthdr.l3hlen > 0 && m0->m_pkthdr.l4hlen > 0, ("%s: mbuf %p needs TSO but missing header lengths", __func__, m0)); ctrl = V_LSO_OPCODE(CPL_TX_PKT_LSO) | F_LSO_FIRST_SLICE | F_LSO_LAST_SLICE | V_LSO_IPHDR_LEN(m0->m_pkthdr.l3hlen >> 2) | V_LSO_TCPHDR_LEN(m0->m_pkthdr.l4hlen >> 2); if (m0->m_pkthdr.l2hlen == sizeof(struct ether_vlan_header)) ctrl |= V_LSO_ETHHDR_LEN(1); if (m0->m_pkthdr.l3hlen == sizeof(struct ip6_hdr)) ctrl |= F_LSO_IPV6; lso->lso_ctrl = htobe32(ctrl); lso->ipid_ofst = htobe16(0); lso->mss = htobe16(m0->m_pkthdr.tso_segsz); lso->seqno_offset = htobe32(0); lso->len = htobe32(pktlen); if (m0->m_pkthdr.l3hlen == sizeof(struct ip6_hdr)) csum_type = TX_CSUM_TCPIP6; else csum_type = TX_CSUM_TCPIP; cpl = (void *)(lso + 1); txq->tso_wrs++; } else { if (m0->m_pkthdr.csum_flags & CSUM_IP_TCP) csum_type = TX_CSUM_TCPIP; else if (m0->m_pkthdr.csum_flags & CSUM_IP_UDP) csum_type = TX_CSUM_UDPIP; else if (m0->m_pkthdr.csum_flags & CSUM_IP6_TCP) csum_type = TX_CSUM_TCPIP6; else if (m0->m_pkthdr.csum_flags & CSUM_IP6_UDP) csum_type = TX_CSUM_UDPIP6; #if defined(INET) else if (m0->m_pkthdr.csum_flags & CSUM_IP) { /* * XXX: The firmware appears to stomp on the * fragment/flags field of the IP header when * using TX_CSUM_IP. Fall back to doing * software checksums. */ u_short *sump; struct mbuf *m; int offset; m = m0; offset = 0; sump = m_advance(&m, &offset, m0->m_pkthdr.l2hlen + offsetof(struct ip, ip_sum)); *sump = in_cksum_skip(m0, m0->m_pkthdr.l2hlen + m0->m_pkthdr.l3hlen, m0->m_pkthdr.l2hlen); m0->m_pkthdr.csum_flags &= ~CSUM_IP; } #endif cpl = (void *)(wr + 1); } /* Checksum offload */ ctrl1 = 0; if (needs_l3_csum(m0) == 0) ctrl1 |= F_TXPKT_IPCSUM_DIS; if (csum_type >= 0) { KASSERT(m0->m_pkthdr.l2hlen > 0 && m0->m_pkthdr.l3hlen > 0, ("%s: mbuf %p needs checksum offload but missing header lengths", __func__, m0)); if (chip_id(sc) <= CHELSIO_T5) { ctrl1 |= V_TXPKT_ETHHDR_LEN(m0->m_pkthdr.l2hlen - ETHER_HDR_LEN); } else { ctrl1 |= V_T6_TXPKT_ETHHDR_LEN(m0->m_pkthdr.l2hlen - ETHER_HDR_LEN); } ctrl1 |= V_TXPKT_IPHDR_LEN(m0->m_pkthdr.l3hlen); ctrl1 |= V_TXPKT_CSUM_TYPE(csum_type); } else ctrl1 |= F_TXPKT_L4CSUM_DIS; if (m0->m_pkthdr.csum_flags & (CSUM_IP | CSUM_TCP | CSUM_UDP | CSUM_UDP_IPV6 | CSUM_TCP_IPV6 | CSUM_TSO)) txq->txcsum++; /* some hardware assistance provided */ /* VLAN tag insertion */ if (needs_vlan_insertion(m0)) { ctrl1 |= F_TXPKT_VLAN_VLD | V_TXPKT_VLAN(m0->m_pkthdr.ether_vtag); txq->vlan_insertion++; } /* CPL header */ cpl->ctrl0 = txq->cpl_ctrl0; cpl->pack = 0; cpl->len = htobe16(pktlen); cpl->ctrl1 = htobe64(ctrl1); /* SGL */ dst = (void *)(cpl + 1); /* * A packet using TSO will use up an entire descriptor for the * firmware work request header, LSO CPL, and TX_PKT_XT CPL. * If this descriptor is the last descriptor in the ring, wrap * around to the front of the ring explicitly for the start of * the sgl. */ if (dst == (void *)&eq->desc[eq->sidx]) { dst = (void *)&eq->desc[0]; write_gl_to_txd(txq, m0, &dst, 0); } else write_gl_to_txd(txq, m0, &dst, eq->sidx - ndesc < eq->pidx); txq->sgl_wrs++; txq->txpkt_wrs++; txsd = &txq->sdesc[eq->pidx]; txsd->m = m0; txsd->desc_used = ndesc; return (ndesc); } /* * Write a raw WR to the hardware descriptors, update the software * descriptor, and advance the pidx. It is guaranteed that enough * descriptors are available. * * The return value is the # of hardware descriptors used. */ static u_int write_raw_wr(struct sge_txq *txq, void *wr, struct mbuf *m0, u_int available) { struct sge_eq *eq = &txq->eq; struct tx_sdesc *txsd; struct mbuf *m; caddr_t dst; int len16, ndesc; len16 = mbuf_len16(m0); ndesc = howmany(len16, EQ_ESIZE / 16); MPASS(ndesc <= available); dst = wr; for (m = m0; m != NULL; m = m->m_next) copy_to_txd(eq, mtod(m, caddr_t), &dst, m->m_len); txq->raw_wrs++; txsd = &txq->sdesc[eq->pidx]; txsd->m = m0; txsd->desc_used = ndesc; return (ndesc); } /* * Write a txpkt WR for this packet to the hardware descriptors, update the * software descriptor, and advance the pidx. It is guaranteed that enough * descriptors are available. * * The return value is the # of hardware descriptors used. */ static u_int write_txpkt_wr(struct sge_txq *txq, struct fw_eth_tx_pkt_wr *wr, struct mbuf *m0, u_int available) { struct sge_eq *eq = &txq->eq; struct tx_sdesc *txsd; struct cpl_tx_pkt_core *cpl; uint32_t ctrl; /* used in many unrelated places */ uint64_t ctrl1; int len16, ndesc, pktlen, nsegs; caddr_t dst; TXQ_LOCK_ASSERT_OWNED(txq); M_ASSERTPKTHDR(m0); MPASS(available > 0 && available < eq->sidx); len16 = mbuf_len16(m0); nsegs = mbuf_nsegs(m0); pktlen = m0->m_pkthdr.len; ctrl = sizeof(struct cpl_tx_pkt_core); if (needs_tso(m0)) ctrl += sizeof(struct cpl_tx_pkt_lso_core); else if (pktlen <= imm_payload(2) && available >= 2) { /* Immediate data. Recalculate len16 and set nsegs to 0. */ ctrl += pktlen; len16 = howmany(sizeof(struct fw_eth_tx_pkt_wr) + sizeof(struct cpl_tx_pkt_core) + pktlen, 16); nsegs = 0; } ndesc = howmany(len16, EQ_ESIZE / 16); MPASS(ndesc <= available); /* Firmware work request header */ MPASS(wr == (void *)&eq->desc[eq->pidx]); wr->op_immdlen = htobe32(V_FW_WR_OP(FW_ETH_TX_PKT_WR) | V_FW_ETH_TX_PKT_WR_IMMDLEN(ctrl)); ctrl = V_FW_WR_LEN16(len16); wr->equiq_to_len16 = htobe32(ctrl); wr->r3 = 0; if (needs_tso(m0)) { struct cpl_tx_pkt_lso_core *lso = (void *)(wr + 1); KASSERT(m0->m_pkthdr.l2hlen > 0 && m0->m_pkthdr.l3hlen > 0 && m0->m_pkthdr.l4hlen > 0, ("%s: mbuf %p needs TSO but missing header lengths", __func__, m0)); ctrl = V_LSO_OPCODE(CPL_TX_PKT_LSO) | F_LSO_FIRST_SLICE | F_LSO_LAST_SLICE | V_LSO_IPHDR_LEN(m0->m_pkthdr.l3hlen >> 2) | V_LSO_TCPHDR_LEN(m0->m_pkthdr.l4hlen >> 2); if (m0->m_pkthdr.l2hlen == sizeof(struct ether_vlan_header)) ctrl |= V_LSO_ETHHDR_LEN(1); if (m0->m_pkthdr.l3hlen == sizeof(struct ip6_hdr)) ctrl |= F_LSO_IPV6; lso->lso_ctrl = htobe32(ctrl); lso->ipid_ofst = htobe16(0); lso->mss = htobe16(m0->m_pkthdr.tso_segsz); lso->seqno_offset = htobe32(0); lso->len = htobe32(pktlen); cpl = (void *)(lso + 1); txq->tso_wrs++; } else cpl = (void *)(wr + 1); /* Checksum offload */ ctrl1 = 0; if (needs_l3_csum(m0) == 0) ctrl1 |= F_TXPKT_IPCSUM_DIS; if (needs_l4_csum(m0) == 0) ctrl1 |= F_TXPKT_L4CSUM_DIS; if (m0->m_pkthdr.csum_flags & (CSUM_IP | CSUM_TCP | CSUM_UDP | CSUM_UDP_IPV6 | CSUM_TCP_IPV6 | CSUM_TSO)) txq->txcsum++; /* some hardware assistance provided */ /* VLAN tag insertion */ if (needs_vlan_insertion(m0)) { ctrl1 |= F_TXPKT_VLAN_VLD | V_TXPKT_VLAN(m0->m_pkthdr.ether_vtag); txq->vlan_insertion++; } /* CPL header */ cpl->ctrl0 = txq->cpl_ctrl0; cpl->pack = 0; cpl->len = htobe16(pktlen); cpl->ctrl1 = htobe64(ctrl1); /* SGL */ dst = (void *)(cpl + 1); if (nsegs > 0) { write_gl_to_txd(txq, m0, &dst, eq->sidx - ndesc < eq->pidx); txq->sgl_wrs++; } else { struct mbuf *m; for (m = m0; m != NULL; m = m->m_next) { copy_to_txd(eq, mtod(m, caddr_t), &dst, m->m_len); #ifdef INVARIANTS pktlen -= m->m_len; #endif } #ifdef INVARIANTS KASSERT(pktlen == 0, ("%s: %d bytes left.", __func__, pktlen)); #endif txq->imm_wrs++; } txq->txpkt_wrs++; txsd = &txq->sdesc[eq->pidx]; txsd->m = m0; txsd->desc_used = ndesc; return (ndesc); } static int try_txpkts(struct mbuf *m, struct mbuf *n, struct txpkts *txp, u_int available) { u_int needed, nsegs1, nsegs2, l1, l2; if (cannot_use_txpkts(m) || cannot_use_txpkts(n)) return (1); nsegs1 = mbuf_nsegs(m); nsegs2 = mbuf_nsegs(n); if (nsegs1 + nsegs2 == 2) { txp->wr_type = 1; l1 = l2 = txpkts1_len16(); } else { txp->wr_type = 0; l1 = txpkts0_len16(nsegs1); l2 = txpkts0_len16(nsegs2); } txp->len16 = howmany(sizeof(struct fw_eth_tx_pkts_wr), 16) + l1 + l2; needed = howmany(txp->len16, EQ_ESIZE / 16); if (needed > SGE_MAX_WR_NDESC || needed > available) return (1); txp->plen = m->m_pkthdr.len + n->m_pkthdr.len; if (txp->plen > 65535) return (1); txp->npkt = 2; set_mbuf_len16(m, l1); set_mbuf_len16(n, l2); return (0); } static int add_to_txpkts(struct mbuf *m, struct txpkts *txp, u_int available) { u_int plen, len16, needed, nsegs; MPASS(txp->wr_type == 0 || txp->wr_type == 1); if (cannot_use_txpkts(m)) return (1); nsegs = mbuf_nsegs(m); if (txp->wr_type == 1 && nsegs != 1) return (1); plen = txp->plen + m->m_pkthdr.len; if (plen > 65535) return (1); if (txp->wr_type == 0) len16 = txpkts0_len16(nsegs); else len16 = txpkts1_len16(); needed = howmany(txp->len16 + len16, EQ_ESIZE / 16); if (needed > SGE_MAX_WR_NDESC || needed > available) return (1); txp->npkt++; txp->plen = plen; txp->len16 += len16; set_mbuf_len16(m, len16); return (0); } /* * Write a txpkts WR for the packets in txp to the hardware descriptors, update * the software descriptor, and advance the pidx. It is guaranteed that enough * descriptors are available. * * The return value is the # of hardware descriptors used. */ static u_int write_txpkts_wr(struct sge_txq *txq, struct fw_eth_tx_pkts_wr *wr, struct mbuf *m0, const struct txpkts *txp, u_int available) { struct sge_eq *eq = &txq->eq; struct tx_sdesc *txsd; struct cpl_tx_pkt_core *cpl; uint32_t ctrl; uint64_t ctrl1; int ndesc, checkwrap; struct mbuf *m; void *flitp; TXQ_LOCK_ASSERT_OWNED(txq); MPASS(txp->npkt > 0); MPASS(txp->plen < 65536); MPASS(m0 != NULL); MPASS(m0->m_nextpkt != NULL); MPASS(txp->len16 <= howmany(SGE_MAX_WR_LEN, 16)); MPASS(available > 0 && available < eq->sidx); ndesc = howmany(txp->len16, EQ_ESIZE / 16); MPASS(ndesc <= available); MPASS(wr == (void *)&eq->desc[eq->pidx]); wr->op_pkd = htobe32(V_FW_WR_OP(FW_ETH_TX_PKTS_WR)); ctrl = V_FW_WR_LEN16(txp->len16); wr->equiq_to_len16 = htobe32(ctrl); wr->plen = htobe16(txp->plen); wr->npkt = txp->npkt; wr->r3 = 0; wr->type = txp->wr_type; flitp = wr + 1; /* * At this point we are 16B into a hardware descriptor. If checkwrap is * set then we know the WR is going to wrap around somewhere. We'll * check for that at appropriate points. */ checkwrap = eq->sidx - ndesc < eq->pidx; for (m = m0; m != NULL; m = m->m_nextpkt) { if (txp->wr_type == 0) { struct ulp_txpkt *ulpmc; struct ulptx_idata *ulpsc; /* ULP master command */ ulpmc = flitp; ulpmc->cmd_dest = htobe32(V_ULPTX_CMD(ULP_TX_PKT) | V_ULP_TXPKT_DEST(0) | V_ULP_TXPKT_FID(eq->iqid)); ulpmc->len = htobe32(mbuf_len16(m)); /* ULP subcommand */ ulpsc = (void *)(ulpmc + 1); ulpsc->cmd_more = htobe32(V_ULPTX_CMD(ULP_TX_SC_IMM) | F_ULP_TX_SC_MORE); ulpsc->len = htobe32(sizeof(struct cpl_tx_pkt_core)); cpl = (void *)(ulpsc + 1); if (checkwrap && (uintptr_t)cpl == (uintptr_t)&eq->desc[eq->sidx]) cpl = (void *)&eq->desc[0]; } else { cpl = flitp; } /* Checksum offload */ ctrl1 = 0; if (needs_l3_csum(m) == 0) ctrl1 |= F_TXPKT_IPCSUM_DIS; if (needs_l4_csum(m) == 0) ctrl1 |= F_TXPKT_L4CSUM_DIS; if (m->m_pkthdr.csum_flags & (CSUM_IP | CSUM_TCP | CSUM_UDP | CSUM_UDP_IPV6 | CSUM_TCP_IPV6 | CSUM_TSO)) txq->txcsum++; /* some hardware assistance provided */ /* VLAN tag insertion */ if (needs_vlan_insertion(m)) { ctrl1 |= F_TXPKT_VLAN_VLD | V_TXPKT_VLAN(m->m_pkthdr.ether_vtag); txq->vlan_insertion++; } /* CPL header */ cpl->ctrl0 = txq->cpl_ctrl0; cpl->pack = 0; cpl->len = htobe16(m->m_pkthdr.len); cpl->ctrl1 = htobe64(ctrl1); flitp = cpl + 1; if (checkwrap && (uintptr_t)flitp == (uintptr_t)&eq->desc[eq->sidx]) flitp = (void *)&eq->desc[0]; write_gl_to_txd(txq, m, (caddr_t *)(&flitp), checkwrap); } if (txp->wr_type == 0) { txq->txpkts0_pkts += txp->npkt; txq->txpkts0_wrs++; } else { txq->txpkts1_pkts += txp->npkt; txq->txpkts1_wrs++; } txsd = &txq->sdesc[eq->pidx]; txsd->m = m0; txsd->desc_used = ndesc; return (ndesc); } /* * If the SGL ends on an address that is not 16 byte aligned, this function will * add a 0 filled flit at the end. */ static void write_gl_to_txd(struct sge_txq *txq, struct mbuf *m, caddr_t *to, int checkwrap) { struct sge_eq *eq = &txq->eq; struct sglist *gl = txq->gl; struct sglist_seg *seg; __be64 *flitp, *wrap; struct ulptx_sgl *usgl; int i, nflits, nsegs; KASSERT(((uintptr_t)(*to) & 0xf) == 0, ("%s: SGL must start at a 16 byte boundary: %p", __func__, *to)); MPASS((uintptr_t)(*to) >= (uintptr_t)&eq->desc[0]); MPASS((uintptr_t)(*to) < (uintptr_t)&eq->desc[eq->sidx]); get_pkt_gl(m, gl); nsegs = gl->sg_nseg; MPASS(nsegs > 0); nflits = (3 * (nsegs - 1)) / 2 + ((nsegs - 1) & 1) + 2; flitp = (__be64 *)(*to); wrap = (__be64 *)(&eq->desc[eq->sidx]); seg = &gl->sg_segs[0]; usgl = (void *)flitp; /* * We start at a 16 byte boundary somewhere inside the tx descriptor * ring, so we're at least 16 bytes away from the status page. There is * no chance of a wrap around in the middle of usgl (which is 16 bytes). */ usgl->cmd_nsge = htobe32(V_ULPTX_CMD(ULP_TX_SC_DSGL) | V_ULPTX_NSGE(nsegs)); usgl->len0 = htobe32(seg->ss_len); usgl->addr0 = htobe64(seg->ss_paddr); seg++; if (checkwrap == 0 || (uintptr_t)(flitp + nflits) <= (uintptr_t)wrap) { /* Won't wrap around at all */ for (i = 0; i < nsegs - 1; i++, seg++) { usgl->sge[i / 2].len[i & 1] = htobe32(seg->ss_len); usgl->sge[i / 2].addr[i & 1] = htobe64(seg->ss_paddr); } if (i & 1) usgl->sge[i / 2].len[1] = htobe32(0); flitp += nflits; } else { /* Will wrap somewhere in the rest of the SGL */ /* 2 flits already written, write the rest flit by flit */ flitp = (void *)(usgl + 1); for (i = 0; i < nflits - 2; i++) { if (flitp == wrap) flitp = (void *)eq->desc; *flitp++ = get_flit(seg, nsegs - 1, i); } } if (nflits & 1) { MPASS(((uintptr_t)flitp) & 0xf); *flitp++ = 0; } MPASS((((uintptr_t)flitp) & 0xf) == 0); if (__predict_false(flitp == wrap)) *to = (void *)eq->desc; else *to = (void *)flitp; } static inline void copy_to_txd(struct sge_eq *eq, caddr_t from, caddr_t *to, int len) { MPASS((uintptr_t)(*to) >= (uintptr_t)&eq->desc[0]); MPASS((uintptr_t)(*to) < (uintptr_t)&eq->desc[eq->sidx]); if (__predict_true((uintptr_t)(*to) + len <= (uintptr_t)&eq->desc[eq->sidx])) { bcopy(from, *to, len); (*to) += len; } else { int portion = (uintptr_t)&eq->desc[eq->sidx] - (uintptr_t)(*to); bcopy(from, *to, portion); from += portion; portion = len - portion; /* remaining */ bcopy(from, (void *)eq->desc, portion); (*to) = (caddr_t)eq->desc + portion; } } static inline void ring_eq_db(struct adapter *sc, struct sge_eq *eq, u_int n) { u_int db; MPASS(n > 0); db = eq->doorbells; if (n > 1) clrbit(&db, DOORBELL_WCWR); wmb(); switch (ffs(db) - 1) { case DOORBELL_UDB: *eq->udb = htole32(V_QID(eq->udb_qid) | V_PIDX(n)); break; case DOORBELL_WCWR: { volatile uint64_t *dst, *src; int i; /* * Queues whose 128B doorbell segment fits in the page do not * use relative qid (udb_qid is always 0). Only queues with * doorbell segments can do WCWR. */ KASSERT(eq->udb_qid == 0 && n == 1, ("%s: inappropriate doorbell (0x%x, %d, %d) for eq %p", __func__, eq->doorbells, n, eq->dbidx, eq)); dst = (volatile void *)((uintptr_t)eq->udb + UDBS_WR_OFFSET - UDBS_DB_OFFSET); i = eq->dbidx; src = (void *)&eq->desc[i]; while (src != (void *)&eq->desc[i + 1]) *dst++ = *src++; wmb(); break; } case DOORBELL_UDBWC: *eq->udb = htole32(V_QID(eq->udb_qid) | V_PIDX(n)); wmb(); break; case DOORBELL_KDB: t4_write_reg(sc, sc->sge_kdoorbell_reg, V_QID(eq->cntxt_id) | V_PIDX(n)); break; } IDXINCR(eq->dbidx, n, eq->sidx); } static inline u_int reclaimable_tx_desc(struct sge_eq *eq) { uint16_t hw_cidx; hw_cidx = read_hw_cidx(eq); return (IDXDIFF(hw_cidx, eq->cidx, eq->sidx)); } static inline u_int total_available_tx_desc(struct sge_eq *eq) { uint16_t hw_cidx, pidx; hw_cidx = read_hw_cidx(eq); pidx = eq->pidx; if (pidx == hw_cidx) return (eq->sidx - 1); else return (IDXDIFF(hw_cidx, pidx, eq->sidx) - 1); } static inline uint16_t read_hw_cidx(struct sge_eq *eq) { struct sge_qstat *spg = (void *)&eq->desc[eq->sidx]; uint16_t cidx = spg->cidx; /* stable snapshot */ return (be16toh(cidx)); } /* * Reclaim 'n' descriptors approximately. */ static u_int reclaim_tx_descs(struct sge_txq *txq, u_int n) { struct tx_sdesc *txsd; struct sge_eq *eq = &txq->eq; u_int can_reclaim, reclaimed; TXQ_LOCK_ASSERT_OWNED(txq); MPASS(n > 0); reclaimed = 0; can_reclaim = reclaimable_tx_desc(eq); while (can_reclaim && reclaimed < n) { int ndesc; struct mbuf *m, *nextpkt; txsd = &txq->sdesc[eq->cidx]; ndesc = txsd->desc_used; /* Firmware doesn't return "partial" credits. */ KASSERT(can_reclaim >= ndesc, ("%s: unexpected number of credits: %d, %d", __func__, can_reclaim, ndesc)); KASSERT(ndesc != 0, ("%s: descriptor with no credits: cidx %d", __func__, eq->cidx)); for (m = txsd->m; m != NULL; m = nextpkt) { nextpkt = m->m_nextpkt; m->m_nextpkt = NULL; m_freem(m); } reclaimed += ndesc; can_reclaim -= ndesc; IDXINCR(eq->cidx, ndesc, eq->sidx); } return (reclaimed); } static void tx_reclaim(void *arg, int n) { struct sge_txq *txq = arg; struct sge_eq *eq = &txq->eq; do { if (TXQ_TRYLOCK(txq) == 0) break; n = reclaim_tx_descs(txq, 32); if (eq->cidx == eq->pidx) eq->equeqidx = eq->pidx; TXQ_UNLOCK(txq); } while (n > 0); } static __be64 get_flit(struct sglist_seg *segs, int nsegs, int idx) { int i = (idx / 3) * 2; switch (idx % 3) { case 0: { uint64_t rc; rc = (uint64_t)segs[i].ss_len << 32; if (i + 1 < nsegs) rc |= (uint64_t)(segs[i + 1].ss_len); return (htobe64(rc)); } case 1: return (htobe64(segs[i].ss_paddr)); case 2: return (htobe64(segs[i + 1].ss_paddr)); } return (0); } static void find_best_refill_source(struct adapter *sc, struct sge_fl *fl, int maxp) { int8_t zidx, hwidx, idx; uint16_t region1, region3; int spare, spare_needed, n; struct sw_zone_info *swz; struct hw_buf_info *hwb, *hwb_list = &sc->sge.hw_buf_info[0]; /* * Buffer Packing: Look for PAGE_SIZE or larger zone which has a bufsize * large enough for the max payload and cluster metadata. Otherwise * settle for the largest bufsize that leaves enough room in the cluster * for metadata. * * Without buffer packing: Look for the smallest zone which has a * bufsize large enough for the max payload. Settle for the largest * bufsize available if there's nothing big enough for max payload. */ spare_needed = fl->flags & FL_BUF_PACKING ? CL_METADATA_SIZE : 0; swz = &sc->sge.sw_zone_info[0]; hwidx = -1; for (zidx = 0; zidx < SW_ZONE_SIZES; zidx++, swz++) { if (swz->size > largest_rx_cluster) { if (__predict_true(hwidx != -1)) break; /* * This is a misconfiguration. largest_rx_cluster is * preventing us from finding a refill source. See * dev.t5nex..buffer_sizes to figure out why. */ device_printf(sc->dev, "largest_rx_cluster=%u leaves no" " refill source for fl %p (dma %u). Ignored.\n", largest_rx_cluster, fl, maxp); } for (idx = swz->head_hwidx; idx != -1; idx = hwb->next) { hwb = &hwb_list[idx]; spare = swz->size - hwb->size; if (spare < spare_needed) continue; hwidx = idx; /* best option so far */ if (hwb->size >= maxp) { if ((fl->flags & FL_BUF_PACKING) == 0) goto done; /* stop looking (not packing) */ if (swz->size >= safest_rx_cluster) goto done; /* stop looking (packing) */ } break; /* keep looking, next zone */ } } done: /* A usable hwidx has been located. */ MPASS(hwidx != -1); hwb = &hwb_list[hwidx]; zidx = hwb->zidx; swz = &sc->sge.sw_zone_info[zidx]; region1 = 0; region3 = swz->size - hwb->size; /* * Stay within this zone and see if there is a better match when mbuf * inlining is allowed. Remember that the hwidx's are sorted in * decreasing order of size (so in increasing order of spare area). */ for (idx = hwidx; idx != -1; idx = hwb->next) { hwb = &hwb_list[idx]; spare = swz->size - hwb->size; if (allow_mbufs_in_cluster == 0 || hwb->size < maxp) break; /* * Do not inline mbufs if doing so would violate the pad/pack * boundary alignment requirement. */ if (fl_pad && (MSIZE % sc->params.sge.pad_boundary) != 0) continue; if (fl->flags & FL_BUF_PACKING && (MSIZE % sc->params.sge.pack_boundary) != 0) continue; if (spare < CL_METADATA_SIZE + MSIZE) continue; n = (spare - CL_METADATA_SIZE) / MSIZE; if (n > howmany(hwb->size, maxp)) break; hwidx = idx; if (fl->flags & FL_BUF_PACKING) { region1 = n * MSIZE; region3 = spare - region1; } else { region1 = MSIZE; region3 = spare - region1; break; } } KASSERT(zidx >= 0 && zidx < SW_ZONE_SIZES, ("%s: bad zone %d for fl %p, maxp %d", __func__, zidx, fl, maxp)); KASSERT(hwidx >= 0 && hwidx <= SGE_FLBUF_SIZES, ("%s: bad hwidx %d for fl %p, maxp %d", __func__, hwidx, fl, maxp)); KASSERT(region1 + sc->sge.hw_buf_info[hwidx].size + region3 == sc->sge.sw_zone_info[zidx].size, ("%s: bad buffer layout for fl %p, maxp %d. " "cl %d; r1 %d, payload %d, r3 %d", __func__, fl, maxp, sc->sge.sw_zone_info[zidx].size, region1, sc->sge.hw_buf_info[hwidx].size, region3)); if (fl->flags & FL_BUF_PACKING || region1 > 0) { KASSERT(region3 >= CL_METADATA_SIZE, ("%s: no room for metadata. fl %p, maxp %d; " "cl %d; r1 %d, payload %d, r3 %d", __func__, fl, maxp, sc->sge.sw_zone_info[zidx].size, region1, sc->sge.hw_buf_info[hwidx].size, region3)); KASSERT(region1 % MSIZE == 0, ("%s: bad mbuf region for fl %p, maxp %d. " "cl %d; r1 %d, payload %d, r3 %d", __func__, fl, maxp, sc->sge.sw_zone_info[zidx].size, region1, sc->sge.hw_buf_info[hwidx].size, region3)); } fl->cll_def.zidx = zidx; fl->cll_def.hwidx = hwidx; fl->cll_def.region1 = region1; fl->cll_def.region3 = region3; } static void find_safe_refill_source(struct adapter *sc, struct sge_fl *fl) { struct sge *s = &sc->sge; struct hw_buf_info *hwb; struct sw_zone_info *swz; int spare; int8_t hwidx; if (fl->flags & FL_BUF_PACKING) hwidx = s->safe_hwidx2; /* with room for metadata */ else if (allow_mbufs_in_cluster && s->safe_hwidx2 != -1) { hwidx = s->safe_hwidx2; hwb = &s->hw_buf_info[hwidx]; swz = &s->sw_zone_info[hwb->zidx]; spare = swz->size - hwb->size; /* no good if there isn't room for an mbuf as well */ if (spare < CL_METADATA_SIZE + MSIZE) hwidx = s->safe_hwidx1; } else hwidx = s->safe_hwidx1; if (hwidx == -1) { /* No fallback source */ fl->cll_alt.hwidx = -1; fl->cll_alt.zidx = -1; return; } hwb = &s->hw_buf_info[hwidx]; swz = &s->sw_zone_info[hwb->zidx]; spare = swz->size - hwb->size; fl->cll_alt.hwidx = hwidx; fl->cll_alt.zidx = hwb->zidx; if (allow_mbufs_in_cluster && (fl_pad == 0 || (MSIZE % sc->params.sge.pad_boundary) == 0)) fl->cll_alt.region1 = ((spare - CL_METADATA_SIZE) / MSIZE) * MSIZE; else fl->cll_alt.region1 = 0; fl->cll_alt.region3 = spare - fl->cll_alt.region1; } static void add_fl_to_sfl(struct adapter *sc, struct sge_fl *fl) { mtx_lock(&sc->sfl_lock); FL_LOCK(fl); if ((fl->flags & FL_DOOMED) == 0) { fl->flags |= FL_STARVING; TAILQ_INSERT_TAIL(&sc->sfl, fl, link); callout_reset(&sc->sfl_callout, hz / 5, refill_sfl, sc); } FL_UNLOCK(fl); mtx_unlock(&sc->sfl_lock); } static void handle_wrq_egr_update(struct adapter *sc, struct sge_eq *eq) { struct sge_wrq *wrq = (void *)eq; atomic_readandclear_int(&eq->equiq); taskqueue_enqueue(sc->tq[eq->tx_chan], &wrq->wrq_tx_task); } static void handle_eth_egr_update(struct adapter *sc, struct sge_eq *eq) { struct sge_txq *txq = (void *)eq; MPASS((eq->flags & EQ_TYPEMASK) == EQ_ETH); atomic_readandclear_int(&eq->equiq); mp_ring_check_drainage(txq->r, 0); taskqueue_enqueue(sc->tq[eq->tx_chan], &txq->tx_reclaim_task); } static int handle_sge_egr_update(struct sge_iq *iq, const struct rss_header *rss, struct mbuf *m) { const struct cpl_sge_egr_update *cpl = (const void *)(rss + 1); unsigned int qid = G_EGR_QID(ntohl(cpl->opcode_qid)); struct adapter *sc = iq->adapter; struct sge *s = &sc->sge; struct sge_eq *eq; static void (*h[])(struct adapter *, struct sge_eq *) = {NULL, &handle_wrq_egr_update, &handle_eth_egr_update, &handle_wrq_egr_update}; KASSERT(m == NULL, ("%s: payload with opcode %02x", __func__, rss->opcode)); eq = s->eqmap[qid - s->eq_start - s->eq_base]; (*h[eq->flags & EQ_TYPEMASK])(sc, eq); return (0); } /* handle_fw_msg works for both fw4_msg and fw6_msg because this is valid */ CTASSERT(offsetof(struct cpl_fw4_msg, data) == \ offsetof(struct cpl_fw6_msg, data)); static int handle_fw_msg(struct sge_iq *iq, const struct rss_header *rss, struct mbuf *m) { struct adapter *sc = iq->adapter; const struct cpl_fw6_msg *cpl = (const void *)(rss + 1); KASSERT(m == NULL, ("%s: payload with opcode %02x", __func__, rss->opcode)); if (cpl->type == FW_TYPE_RSSCPL || cpl->type == FW6_TYPE_RSSCPL) { const struct rss_header *rss2; rss2 = (const struct rss_header *)&cpl->data[0]; return (t4_cpl_handler[rss2->opcode](iq, rss2, m)); } return (t4_fw_msg_handler[cpl->type](sc, &cpl->data[0])); } /** * t4_handle_wrerr_rpl - process a FW work request error message * @adap: the adapter * @rpl: start of the FW message */ static int t4_handle_wrerr_rpl(struct adapter *adap, const __be64 *rpl) { u8 opcode = *(const u8 *)rpl; const struct fw_error_cmd *e = (const void *)rpl; unsigned int i; if (opcode != FW_ERROR_CMD) { log(LOG_ERR, "%s: Received WRERR_RPL message with opcode %#x\n", device_get_nameunit(adap->dev), opcode); return (EINVAL); } log(LOG_ERR, "%s: FW_ERROR (%s) ", device_get_nameunit(adap->dev), G_FW_ERROR_CMD_FATAL(be32toh(e->op_to_type)) ? "fatal" : "non-fatal"); switch (G_FW_ERROR_CMD_TYPE(be32toh(e->op_to_type))) { case FW_ERROR_TYPE_EXCEPTION: log(LOG_ERR, "exception info:\n"); for (i = 0; i < nitems(e->u.exception.info); i++) log(LOG_ERR, "%s%08x", i == 0 ? "\t" : " ", be32toh(e->u.exception.info[i])); log(LOG_ERR, "\n"); break; case FW_ERROR_TYPE_HWMODULE: log(LOG_ERR, "HW module regaddr %08x regval %08x\n", be32toh(e->u.hwmodule.regaddr), be32toh(e->u.hwmodule.regval)); break; case FW_ERROR_TYPE_WR: log(LOG_ERR, "WR cidx %d PF %d VF %d eqid %d hdr:\n", be16toh(e->u.wr.cidx), G_FW_ERROR_CMD_PFN(be16toh(e->u.wr.pfn_vfn)), G_FW_ERROR_CMD_VFN(be16toh(e->u.wr.pfn_vfn)), be32toh(e->u.wr.eqid)); for (i = 0; i < nitems(e->u.wr.wrhdr); i++) log(LOG_ERR, "%s%02x", i == 0 ? "\t" : " ", e->u.wr.wrhdr[i]); log(LOG_ERR, "\n"); break; case FW_ERROR_TYPE_ACL: log(LOG_ERR, "ACL cidx %d PF %d VF %d eqid %d %s", be16toh(e->u.acl.cidx), G_FW_ERROR_CMD_PFN(be16toh(e->u.acl.pfn_vfn)), G_FW_ERROR_CMD_VFN(be16toh(e->u.acl.pfn_vfn)), be32toh(e->u.acl.eqid), G_FW_ERROR_CMD_MV(be16toh(e->u.acl.mv_pkd)) ? "vlanid" : "MAC"); for (i = 0; i < nitems(e->u.acl.val); i++) log(LOG_ERR, " %02x", e->u.acl.val[i]); log(LOG_ERR, "\n"); break; default: log(LOG_ERR, "type %#x\n", G_FW_ERROR_CMD_TYPE(be32toh(e->op_to_type))); return (EINVAL); } return (0); } static int sysctl_uint16(SYSCTL_HANDLER_ARGS) { uint16_t *id = arg1; int i = *id; return sysctl_handle_int(oidp, &i, 0, req); } static int sysctl_bufsizes(SYSCTL_HANDLER_ARGS) { struct sge *s = arg1; struct hw_buf_info *hwb = &s->hw_buf_info[0]; struct sw_zone_info *swz = &s->sw_zone_info[0]; int i, rc; struct sbuf sb; char c; sbuf_new(&sb, NULL, 32, SBUF_AUTOEXTEND); for (i = 0; i < SGE_FLBUF_SIZES; i++, hwb++) { if (hwb->zidx >= 0 && swz[hwb->zidx].size <= largest_rx_cluster) c = '*'; else c = '\0'; sbuf_printf(&sb, "%u%c ", hwb->size, c); } sbuf_trim(&sb); sbuf_finish(&sb); rc = sysctl_handle_string(oidp, sbuf_data(&sb), sbuf_len(&sb), req); sbuf_delete(&sb); return (rc); } #ifdef RATELIMIT /* * len16 for a txpkt WR with a GL. Includes the firmware work request header. */ static inline u_int txpkt_eo_len16(u_int nsegs, u_int immhdrs, u_int tso) { u_int n; MPASS(immhdrs > 0); n = roundup2(sizeof(struct fw_eth_tx_eo_wr) + sizeof(struct cpl_tx_pkt_core) + immhdrs, 16); if (__predict_false(nsegs == 0)) goto done; nsegs--; /* first segment is part of ulptx_sgl */ n += sizeof(struct ulptx_sgl) + 8 * ((3 * nsegs) / 2 + (nsegs & 1)); if (tso) n += sizeof(struct cpl_tx_pkt_lso_core); done: return (howmany(n, 16)); } #define ETID_FLOWC_NPARAMS 6 #define ETID_FLOWC_LEN (roundup2((sizeof(struct fw_flowc_wr) + \ ETID_FLOWC_NPARAMS * sizeof(struct fw_flowc_mnemval)), 16)) #define ETID_FLOWC_LEN16 (howmany(ETID_FLOWC_LEN, 16)) static int send_etid_flowc_wr(struct cxgbe_snd_tag *cst, struct port_info *pi, struct vi_info *vi) { struct wrq_cookie cookie; u_int pfvf = pi->adapter->pf << S_FW_VIID_PFN; struct fw_flowc_wr *flowc; mtx_assert(&cst->lock, MA_OWNED); MPASS((cst->flags & (EO_FLOWC_PENDING | EO_FLOWC_RPL_PENDING)) == EO_FLOWC_PENDING); flowc = start_wrq_wr(cst->eo_txq, ETID_FLOWC_LEN16, &cookie); if (__predict_false(flowc == NULL)) return (ENOMEM); bzero(flowc, ETID_FLOWC_LEN); flowc->op_to_nparams = htobe32(V_FW_WR_OP(FW_FLOWC_WR) | V_FW_FLOWC_WR_NPARAMS(ETID_FLOWC_NPARAMS) | V_FW_WR_COMPL(0)); flowc->flowid_len16 = htonl(V_FW_WR_LEN16(ETID_FLOWC_LEN16) | V_FW_WR_FLOWID(cst->etid)); flowc->mnemval[0].mnemonic = FW_FLOWC_MNEM_PFNVFN; flowc->mnemval[0].val = htobe32(pfvf); flowc->mnemval[1].mnemonic = FW_FLOWC_MNEM_CH; flowc->mnemval[1].val = htobe32(pi->tx_chan); flowc->mnemval[2].mnemonic = FW_FLOWC_MNEM_PORT; flowc->mnemval[2].val = htobe32(pi->tx_chan); flowc->mnemval[3].mnemonic = FW_FLOWC_MNEM_IQID; flowc->mnemval[3].val = htobe32(cst->iqid); flowc->mnemval[4].mnemonic = FW_FLOWC_MNEM_EOSTATE; flowc->mnemval[4].val = htobe32(FW_FLOWC_MNEM_EOSTATE_ESTABLISHED); flowc->mnemval[5].mnemonic = FW_FLOWC_MNEM_SCHEDCLASS; flowc->mnemval[5].val = htobe32(cst->schedcl); commit_wrq_wr(cst->eo_txq, flowc, &cookie); cst->flags &= ~EO_FLOWC_PENDING; cst->flags |= EO_FLOWC_RPL_PENDING; MPASS(cst->tx_credits >= ETID_FLOWC_LEN16); /* flowc is first WR. */ cst->tx_credits -= ETID_FLOWC_LEN16; return (0); } #define ETID_FLUSH_LEN16 (howmany(sizeof (struct fw_flowc_wr), 16)) void send_etid_flush_wr(struct cxgbe_snd_tag *cst) { struct fw_flowc_wr *flowc; struct wrq_cookie cookie; mtx_assert(&cst->lock, MA_OWNED); flowc = start_wrq_wr(cst->eo_txq, ETID_FLUSH_LEN16, &cookie); if (__predict_false(flowc == NULL)) CXGBE_UNIMPLEMENTED(__func__); bzero(flowc, ETID_FLUSH_LEN16 * 16); flowc->op_to_nparams = htobe32(V_FW_WR_OP(FW_FLOWC_WR) | V_FW_FLOWC_WR_NPARAMS(0) | F_FW_WR_COMPL); flowc->flowid_len16 = htobe32(V_FW_WR_LEN16(ETID_FLUSH_LEN16) | V_FW_WR_FLOWID(cst->etid)); commit_wrq_wr(cst->eo_txq, flowc, &cookie); cst->flags |= EO_FLUSH_RPL_PENDING; MPASS(cst->tx_credits >= ETID_FLUSH_LEN16); cst->tx_credits -= ETID_FLUSH_LEN16; cst->ncompl++; } static void write_ethofld_wr(struct cxgbe_snd_tag *cst, struct fw_eth_tx_eo_wr *wr, struct mbuf *m0, int compl) { struct cpl_tx_pkt_core *cpl; uint64_t ctrl1; uint32_t ctrl; /* used in many unrelated places */ int len16, pktlen, nsegs, immhdrs; caddr_t dst; uintptr_t p; struct ulptx_sgl *usgl; struct sglist sg; struct sglist_seg segs[38]; /* XXX: find real limit. XXX: get off the stack */ mtx_assert(&cst->lock, MA_OWNED); M_ASSERTPKTHDR(m0); KASSERT(m0->m_pkthdr.l2hlen > 0 && m0->m_pkthdr.l3hlen > 0 && m0->m_pkthdr.l4hlen > 0, ("%s: ethofld mbuf %p is missing header lengths", __func__, m0)); len16 = mbuf_eo_len16(m0); nsegs = mbuf_eo_nsegs(m0); pktlen = m0->m_pkthdr.len; ctrl = sizeof(struct cpl_tx_pkt_core); if (needs_tso(m0)) ctrl += sizeof(struct cpl_tx_pkt_lso_core); immhdrs = m0->m_pkthdr.l2hlen + m0->m_pkthdr.l3hlen + m0->m_pkthdr.l4hlen; ctrl += immhdrs; wr->op_immdlen = htobe32(V_FW_WR_OP(FW_ETH_TX_EO_WR) | V_FW_ETH_TX_EO_WR_IMMDLEN(ctrl) | V_FW_WR_COMPL(!!compl)); wr->equiq_to_len16 = htobe32(V_FW_WR_LEN16(len16) | V_FW_WR_FLOWID(cst->etid)); wr->r3 = 0; if (needs_udp_csum(m0)) { wr->u.udpseg.type = FW_ETH_TX_EO_TYPE_UDPSEG; wr->u.udpseg.ethlen = m0->m_pkthdr.l2hlen; wr->u.udpseg.iplen = htobe16(m0->m_pkthdr.l3hlen); wr->u.udpseg.udplen = m0->m_pkthdr.l4hlen; wr->u.udpseg.rtplen = 0; wr->u.udpseg.r4 = 0; wr->u.udpseg.mss = htobe16(pktlen - immhdrs); wr->u.udpseg.schedpktsize = wr->u.udpseg.mss; wr->u.udpseg.plen = htobe32(pktlen - immhdrs); cpl = (void *)(wr + 1); } else { MPASS(needs_tcp_csum(m0)); wr->u.tcpseg.type = FW_ETH_TX_EO_TYPE_TCPSEG; wr->u.tcpseg.ethlen = m0->m_pkthdr.l2hlen; wr->u.tcpseg.iplen = htobe16(m0->m_pkthdr.l3hlen); wr->u.tcpseg.tcplen = m0->m_pkthdr.l4hlen; wr->u.tcpseg.tsclk_tsoff = mbuf_eo_tsclk_tsoff(m0); wr->u.tcpseg.r4 = 0; wr->u.tcpseg.r5 = 0; wr->u.tcpseg.plen = htobe32(pktlen - immhdrs); if (needs_tso(m0)) { struct cpl_tx_pkt_lso_core *lso = (void *)(wr + 1); wr->u.tcpseg.mss = htobe16(m0->m_pkthdr.tso_segsz); ctrl = V_LSO_OPCODE(CPL_TX_PKT_LSO) | F_LSO_FIRST_SLICE | F_LSO_LAST_SLICE | V_LSO_IPHDR_LEN(m0->m_pkthdr.l3hlen >> 2) | V_LSO_TCPHDR_LEN(m0->m_pkthdr.l4hlen >> 2); if (m0->m_pkthdr.l2hlen == sizeof(struct ether_vlan_header)) ctrl |= V_LSO_ETHHDR_LEN(1); if (m0->m_pkthdr.l3hlen == sizeof(struct ip6_hdr)) ctrl |= F_LSO_IPV6; lso->lso_ctrl = htobe32(ctrl); lso->ipid_ofst = htobe16(0); lso->mss = htobe16(m0->m_pkthdr.tso_segsz); lso->seqno_offset = htobe32(0); lso->len = htobe32(pktlen); cpl = (void *)(lso + 1); } else { wr->u.tcpseg.mss = htobe16(0xffff); cpl = (void *)(wr + 1); } } /* Checksum offload must be requested for ethofld. */ ctrl1 = 0; MPASS(needs_l4_csum(m0)); /* VLAN tag insertion */ if (needs_vlan_insertion(m0)) { ctrl1 |= F_TXPKT_VLAN_VLD | V_TXPKT_VLAN(m0->m_pkthdr.ether_vtag); } /* CPL header */ cpl->ctrl0 = cst->ctrl0; cpl->pack = 0; cpl->len = htobe16(pktlen); cpl->ctrl1 = htobe64(ctrl1); /* Copy Ethernet, IP & TCP/UDP hdrs as immediate data */ p = (uintptr_t)(cpl + 1); m_copydata(m0, 0, immhdrs, (void *)p); /* SGL */ dst = (void *)(cpl + 1); if (nsegs > 0) { int i, pad; /* zero-pad upto next 16Byte boundary, if not 16Byte aligned */ p += immhdrs; pad = 16 - (immhdrs & 0xf); bzero((void *)p, pad); usgl = (void *)(p + pad); usgl->cmd_nsge = htobe32(V_ULPTX_CMD(ULP_TX_SC_DSGL) | V_ULPTX_NSGE(nsegs)); sglist_init(&sg, nitems(segs), segs); for (; m0 != NULL; m0 = m0->m_next) { if (__predict_false(m0->m_len == 0)) continue; if (immhdrs >= m0->m_len) { immhdrs -= m0->m_len; continue; } sglist_append(&sg, mtod(m0, char *) + immhdrs, m0->m_len - immhdrs); immhdrs = 0; } MPASS(sg.sg_nseg == nsegs); /* * Zero pad last 8B in case the WR doesn't end on a 16B * boundary. */ *(uint64_t *)((char *)wr + len16 * 16 - 8) = 0; usgl->len0 = htobe32(segs[0].ss_len); usgl->addr0 = htobe64(segs[0].ss_paddr); for (i = 0; i < nsegs - 1; i++) { usgl->sge[i / 2].len[i & 1] = htobe32(segs[i + 1].ss_len); usgl->sge[i / 2].addr[i & 1] = htobe64(segs[i + 1].ss_paddr); } if (i & 1) usgl->sge[i / 2].len[1] = htobe32(0); } } static void ethofld_tx(struct cxgbe_snd_tag *cst) { struct mbuf *m; struct wrq_cookie cookie; int next_credits, compl; struct fw_eth_tx_eo_wr *wr; mtx_assert(&cst->lock, MA_OWNED); while ((m = mbufq_first(&cst->pending_tx)) != NULL) { M_ASSERTPKTHDR(m); /* How many len16 credits do we need to send this mbuf. */ next_credits = mbuf_eo_len16(m); MPASS(next_credits > 0); if (next_credits > cst->tx_credits) { /* * Tx will make progress eventually because there is at * least one outstanding fw4_ack that will return * credits and kick the tx. */ MPASS(cst->ncompl > 0); return; } wr = start_wrq_wr(cst->eo_txq, next_credits, &cookie); if (__predict_false(wr == NULL)) { /* XXX: wishful thinking, not a real assertion. */ MPASS(cst->ncompl > 0); return; } cst->tx_credits -= next_credits; cst->tx_nocompl += next_credits; compl = cst->ncompl == 0 || cst->tx_nocompl >= cst->tx_total / 2; ETHER_BPF_MTAP(cst->com.ifp, m); write_ethofld_wr(cst, wr, m, compl); commit_wrq_wr(cst->eo_txq, wr, &cookie); if (compl) { cst->ncompl++; cst->tx_nocompl = 0; } (void) mbufq_dequeue(&cst->pending_tx); mbufq_enqueue(&cst->pending_fwack, m); } } int ethofld_transmit(struct ifnet *ifp, struct mbuf *m0) { struct cxgbe_snd_tag *cst; int rc; MPASS(m0->m_nextpkt == NULL); MPASS(m0->m_pkthdr.snd_tag != NULL); cst = mst_to_cst(m0->m_pkthdr.snd_tag); mtx_lock(&cst->lock); MPASS(cst->flags & EO_SND_TAG_REF); if (__predict_false(cst->flags & EO_FLOWC_PENDING)) { struct vi_info *vi = ifp->if_softc; struct port_info *pi = vi->pi; struct adapter *sc = pi->adapter; const uint32_t rss_mask = vi->rss_size - 1; uint32_t rss_hash; cst->eo_txq = &sc->sge.ofld_txq[vi->first_ofld_txq]; if (M_HASHTYPE_ISHASH(m0)) rss_hash = m0->m_pkthdr.flowid; else rss_hash = arc4random(); /* We assume RSS hashing */ cst->iqid = vi->rss[rss_hash & rss_mask]; cst->eo_txq += rss_hash % vi->nofldtxq; rc = send_etid_flowc_wr(cst, pi, vi); if (rc != 0) goto done; } if (__predict_false(cst->plen + m0->m_pkthdr.len > eo_max_backlog)) { rc = ENOBUFS; goto done; } mbufq_enqueue(&cst->pending_tx, m0); cst->plen += m0->m_pkthdr.len; ethofld_tx(cst); rc = 0; done: mtx_unlock(&cst->lock); if (__predict_false(rc != 0)) m_freem(m0); return (rc); } static int ethofld_fw4_ack(struct sge_iq *iq, const struct rss_header *rss, struct mbuf *m0) { struct adapter *sc = iq->adapter; const struct cpl_fw4_ack *cpl = (const void *)(rss + 1); struct mbuf *m; u_int etid = G_CPL_FW4_ACK_FLOWID(be32toh(OPCODE_TID(cpl))); struct cxgbe_snd_tag *cst; uint8_t credits = cpl->credits; cst = lookup_etid(sc, etid); mtx_lock(&cst->lock); if (__predict_false(cst->flags & EO_FLOWC_RPL_PENDING)) { MPASS(credits >= ETID_FLOWC_LEN16); credits -= ETID_FLOWC_LEN16; cst->flags &= ~EO_FLOWC_RPL_PENDING; } KASSERT(cst->ncompl > 0, ("%s: etid %u (%p) wasn't expecting completion.", __func__, etid, cst)); cst->ncompl--; while (credits > 0) { m = mbufq_dequeue(&cst->pending_fwack); if (__predict_false(m == NULL)) { /* * The remaining credits are for the final flush that * was issued when the tag was freed by the kernel. */ MPASS((cst->flags & (EO_FLUSH_RPL_PENDING | EO_SND_TAG_REF)) == EO_FLUSH_RPL_PENDING); MPASS(credits == ETID_FLUSH_LEN16); MPASS(cst->tx_credits + cpl->credits == cst->tx_total); MPASS(cst->ncompl == 0); cst->flags &= ~EO_FLUSH_RPL_PENDING; cst->tx_credits += cpl->credits; freetag: cxgbe_snd_tag_free_locked(cst); return (0); /* cst is gone. */ } KASSERT(m != NULL, ("%s: too many credits (%u, %u)", __func__, cpl->credits, credits)); KASSERT(credits >= mbuf_eo_len16(m), ("%s: too few credits (%u, %u, %u)", __func__, cpl->credits, credits, mbuf_eo_len16(m))); credits -= mbuf_eo_len16(m); cst->plen -= m->m_pkthdr.len; m_freem(m); } cst->tx_credits += cpl->credits; MPASS(cst->tx_credits <= cst->tx_total); m = mbufq_first(&cst->pending_tx); if (m != NULL && cst->tx_credits >= mbuf_eo_len16(m)) ethofld_tx(cst); if (__predict_false((cst->flags & EO_SND_TAG_REF) == 0) && cst->ncompl == 0) { if (cst->tx_credits == cst->tx_total) goto freetag; else { MPASS((cst->flags & EO_FLUSH_RPL_PENDING) == 0); send_etid_flush_wr(cst); } } mtx_unlock(&cst->lock); return (0); } #endif Index: user/ngie/bug-237403/sys/dev/gpio/gpioc.c =================================================================== --- user/ngie/bug-237403/sys/dev/gpio/gpioc.c (revision 346925) +++ user/ngie/bug-237403/sys/dev/gpio/gpioc.c (revision 346926) @@ -1,231 +1,231 @@ /*- * SPDX-License-Identifier: BSD-2-Clause-FreeBSD * * Copyright (c) 2009 Oleksandr Tymoshenko * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. */ #include __FBSDID("$FreeBSD$"); #include #include #include #include #include #include #include #include #include #include #include "gpio_if.h" #include "gpiobus_if.h" #undef GPIOC_DEBUG #ifdef GPIOC_DEBUG #define dprintf printf #else #define dprintf(x, arg...) #endif static int gpioc_probe(device_t dev); static int gpioc_attach(device_t dev); static int gpioc_detach(device_t dev); static d_ioctl_t gpioc_ioctl; static struct cdevsw gpioc_cdevsw = { .d_version = D_VERSION, .d_ioctl = gpioc_ioctl, .d_name = "gpioc", }; struct gpioc_softc { device_t sc_dev; /* gpiocX dev */ device_t sc_pdev; /* gpioX dev */ struct cdev *sc_ctl_dev; /* controller device */ int sc_unit; }; static int gpioc_probe(device_t dev) { device_set_desc(dev, "GPIO controller"); return (0); } static int gpioc_attach(device_t dev) { int err; struct gpioc_softc *sc; struct make_dev_args devargs; sc = device_get_softc(dev); sc->sc_dev = dev; sc->sc_pdev = device_get_parent(dev); sc->sc_unit = device_get_unit(dev); make_dev_args_init(&devargs); devargs.mda_devsw = &gpioc_cdevsw; devargs.mda_uid = UID_ROOT; devargs.mda_gid = GID_WHEEL; devargs.mda_mode = 0600; devargs.mda_si_drv1 = sc; err = make_dev_s(&devargs, &sc->sc_ctl_dev, "gpioc%d", sc->sc_unit); if (err != 0) { printf("Failed to create gpioc%d", sc->sc_unit); return (ENXIO); } return (0); } static int gpioc_detach(device_t dev) { struct gpioc_softc *sc = device_get_softc(dev); int err; if (sc->sc_ctl_dev) destroy_dev(sc->sc_ctl_dev); if ((err = bus_generic_detach(dev)) != 0) return (err); return (0); } static int gpioc_ioctl(struct cdev *cdev, u_long cmd, caddr_t arg, int fflag, struct thread *td) { device_t bus; int max_pin, res; struct gpioc_softc *sc = cdev->si_drv1; struct gpio_pin pin; struct gpio_req req; struct gpio_access_32 *a32; struct gpio_config_32 *c32; uint32_t caps; bus = GPIO_GET_BUS(sc->sc_pdev); if (bus == NULL) return (EINVAL); switch (cmd) { case GPIOMAXPIN: max_pin = -1; res = GPIO_PIN_MAX(sc->sc_pdev, &max_pin); bcopy(&max_pin, arg, sizeof(max_pin)); break; case GPIOGETCONFIG: bcopy(arg, &pin, sizeof(pin)); dprintf("get config pin %d\n", pin.gp_pin); res = GPIO_PIN_GETFLAGS(sc->sc_pdev, pin.gp_pin, &pin.gp_flags); /* Fail early */ if (res) break; GPIO_PIN_GETCAPS(sc->sc_pdev, pin.gp_pin, &pin.gp_caps); GPIOBUS_PIN_GETNAME(bus, pin.gp_pin, pin.gp_name); bcopy(&pin, arg, sizeof(pin)); break; case GPIOSETCONFIG: bcopy(arg, &pin, sizeof(pin)); dprintf("set config pin %d\n", pin.gp_pin); res = GPIO_PIN_GETCAPS(sc->sc_pdev, pin.gp_pin, &caps); if (res == 0) res = gpio_check_flags(caps, pin.gp_flags); if (res == 0) res = GPIO_PIN_SETFLAGS(sc->sc_pdev, pin.gp_pin, pin.gp_flags); break; case GPIOGET: bcopy(arg, &req, sizeof(req)); res = GPIO_PIN_GET(sc->sc_pdev, req.gp_pin, &req.gp_value); dprintf("read pin %d -> %d\n", req.gp_pin, req.gp_value); bcopy(&req, arg, sizeof(req)); break; case GPIOSET: bcopy(arg, &req, sizeof(req)); res = GPIO_PIN_SET(sc->sc_pdev, req.gp_pin, req.gp_value); dprintf("write pin %d -> %d\n", req.gp_pin, req.gp_value); break; case GPIOTOGGLE: bcopy(arg, &req, sizeof(req)); dprintf("toggle pin %d\n", req.gp_pin); res = GPIO_PIN_TOGGLE(sc->sc_pdev, req.gp_pin); break; case GPIOSETNAME: bcopy(arg, &pin, sizeof(pin)); dprintf("set name on pin %d\n", pin.gp_pin); res = GPIOBUS_PIN_SETNAME(bus, pin.gp_pin, pin.gp_name); break; case GPIOACCESS32: a32 = (struct gpio_access_32 *)arg; res = GPIO_PIN_ACCESS_32(sc->sc_pdev, a32->first_pin, - a32->clear_pins, a32->orig_pins, &a32->orig_pins); + a32->clear_pins, a32->change_pins, &a32->orig_pins); break; case GPIOCONFIG32: c32 = (struct gpio_config_32 *)arg; res = GPIO_PIN_CONFIG_32(sc->sc_pdev, c32->first_pin, c32->num_pins, c32->pin_flags); break; default: return (ENOTTY); break; } return (res); } static device_method_t gpioc_methods[] = { /* Device interface */ DEVMETHOD(device_probe, gpioc_probe), DEVMETHOD(device_attach, gpioc_attach), DEVMETHOD(device_detach, gpioc_detach), DEVMETHOD(device_shutdown, bus_generic_shutdown), DEVMETHOD(device_suspend, bus_generic_suspend), DEVMETHOD(device_resume, bus_generic_resume), DEVMETHOD_END }; driver_t gpioc_driver = { "gpioc", gpioc_methods, sizeof(struct gpioc_softc) }; devclass_t gpioc_devclass; DRIVER_MODULE(gpioc, gpio, gpioc_driver, gpioc_devclass, 0, 0); MODULE_VERSION(gpioc, 1); Index: user/ngie/bug-237403/sys/dev/isp/isp_pci.c =================================================================== --- user/ngie/bug-237403/sys/dev/isp/isp_pci.c (revision 346925) +++ user/ngie/bug-237403/sys/dev/isp/isp_pci.c (revision 346926) @@ -1,2001 +1,2010 @@ /*- * SPDX-License-Identifier: BSD-2-Clause-FreeBSD * * Copyright (c) 2009-2018 Alexander Motin * Copyright (c) 1997-2008 by Matthew Jacob * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice immediately at the beginning of the file, without modification, * this list of conditions, and the following disclaimer. * 2. The name of the author may not be used to endorse or promote products * derived from this software without specific prior written permission. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE FOR * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. */ /* * PCI specific probe and attach routines for Qlogic ISP SCSI adapters. * FreeBSD Version. */ #include __FBSDID("$FreeBSD$"); #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #ifdef __sparc64__ #include #include #endif #include static uint32_t isp_pci_rd_reg(ispsoftc_t *, int); static void isp_pci_wr_reg(ispsoftc_t *, int, uint32_t); static uint32_t isp_pci_rd_reg_1080(ispsoftc_t *, int); static void isp_pci_wr_reg_1080(ispsoftc_t *, int, uint32_t); static uint32_t isp_pci_rd_reg_2400(ispsoftc_t *, int); static void isp_pci_wr_reg_2400(ispsoftc_t *, int, uint32_t); static uint32_t isp_pci_rd_reg_2600(ispsoftc_t *, int); static void isp_pci_wr_reg_2600(ispsoftc_t *, int, uint32_t); static void isp_pci_run_isr(ispsoftc_t *); static void isp_pci_run_isr_2300(ispsoftc_t *); static void isp_pci_run_isr_2400(ispsoftc_t *); static int isp_pci_mbxdma(ispsoftc_t *); static void isp_pci_mbxdmafree(ispsoftc_t *); static int isp_pci_dmasetup(ispsoftc_t *, XS_T *, void *); static int isp_pci_irqsetup(ispsoftc_t *); static void isp_pci_dumpregs(ispsoftc_t *, const char *); static struct ispmdvec mdvec = { isp_pci_run_isr, isp_pci_rd_reg, isp_pci_wr_reg, isp_pci_mbxdma, isp_pci_dmasetup, isp_common_dmateardown, isp_pci_irqsetup, isp_pci_dumpregs, NULL, BIU_BURST_ENABLE|BIU_PCI_CONF1_FIFO_64 }; static struct ispmdvec mdvec_1080 = { isp_pci_run_isr, isp_pci_rd_reg_1080, isp_pci_wr_reg_1080, isp_pci_mbxdma, isp_pci_dmasetup, isp_common_dmateardown, isp_pci_irqsetup, isp_pci_dumpregs, NULL, BIU_BURST_ENABLE|BIU_PCI_CONF1_FIFO_64 }; static struct ispmdvec mdvec_12160 = { isp_pci_run_isr, isp_pci_rd_reg_1080, isp_pci_wr_reg_1080, isp_pci_mbxdma, isp_pci_dmasetup, isp_common_dmateardown, isp_pci_irqsetup, isp_pci_dumpregs, NULL, BIU_BURST_ENABLE|BIU_PCI_CONF1_FIFO_64 }; static struct ispmdvec mdvec_2100 = { isp_pci_run_isr, isp_pci_rd_reg, isp_pci_wr_reg, isp_pci_mbxdma, isp_pci_dmasetup, isp_common_dmateardown, isp_pci_irqsetup, isp_pci_dumpregs }; static struct ispmdvec mdvec_2200 = { isp_pci_run_isr, isp_pci_rd_reg, isp_pci_wr_reg, isp_pci_mbxdma, isp_pci_dmasetup, isp_common_dmateardown, isp_pci_irqsetup, isp_pci_dumpregs }; static struct ispmdvec mdvec_2300 = { isp_pci_run_isr_2300, isp_pci_rd_reg, isp_pci_wr_reg, isp_pci_mbxdma, isp_pci_dmasetup, isp_common_dmateardown, isp_pci_irqsetup, isp_pci_dumpregs }; static struct ispmdvec mdvec_2400 = { isp_pci_run_isr_2400, isp_pci_rd_reg_2400, isp_pci_wr_reg_2400, isp_pci_mbxdma, isp_pci_dmasetup, isp_common_dmateardown, isp_pci_irqsetup, NULL }; static struct ispmdvec mdvec_2500 = { isp_pci_run_isr_2400, isp_pci_rd_reg_2400, isp_pci_wr_reg_2400, isp_pci_mbxdma, isp_pci_dmasetup, isp_common_dmateardown, isp_pci_irqsetup, NULL }; static struct ispmdvec mdvec_2600 = { isp_pci_run_isr_2400, isp_pci_rd_reg_2600, isp_pci_wr_reg_2600, isp_pci_mbxdma, isp_pci_dmasetup, isp_common_dmateardown, isp_pci_irqsetup, NULL }; static struct ispmdvec mdvec_2700 = { isp_pci_run_isr_2400, isp_pci_rd_reg_2600, isp_pci_wr_reg_2600, isp_pci_mbxdma, isp_pci_dmasetup, isp_common_dmateardown, isp_pci_irqsetup, NULL }; #ifndef PCIM_CMD_INVEN #define PCIM_CMD_INVEN 0x10 #endif #ifndef PCIM_CMD_BUSMASTEREN #define PCIM_CMD_BUSMASTEREN 0x0004 #endif #ifndef PCIM_CMD_PERRESPEN #define PCIM_CMD_PERRESPEN 0x0040 #endif #ifndef PCIM_CMD_SEREN #define PCIM_CMD_SEREN 0x0100 #endif #ifndef PCIM_CMD_INTX_DISABLE #define PCIM_CMD_INTX_DISABLE 0x0400 #endif #ifndef PCIR_COMMAND #define PCIR_COMMAND 0x04 #endif #ifndef PCIR_CACHELNSZ #define PCIR_CACHELNSZ 0x0c #endif #ifndef PCIR_LATTIMER #define PCIR_LATTIMER 0x0d #endif #ifndef PCIR_ROMADDR #define PCIR_ROMADDR 0x30 #endif #define PCI_VENDOR_QLOGIC 0x1077 #define PCI_PRODUCT_QLOGIC_ISP1020 0x1020 #define PCI_PRODUCT_QLOGIC_ISP1080 0x1080 #define PCI_PRODUCT_QLOGIC_ISP10160 0x1016 #define PCI_PRODUCT_QLOGIC_ISP12160 0x1216 #define PCI_PRODUCT_QLOGIC_ISP1240 0x1240 #define PCI_PRODUCT_QLOGIC_ISP1280 0x1280 #define PCI_PRODUCT_QLOGIC_ISP2100 0x2100 #define PCI_PRODUCT_QLOGIC_ISP2200 0x2200 #define PCI_PRODUCT_QLOGIC_ISP2300 0x2300 #define PCI_PRODUCT_QLOGIC_ISP2312 0x2312 #define PCI_PRODUCT_QLOGIC_ISP2322 0x2322 #define PCI_PRODUCT_QLOGIC_ISP2422 0x2422 #define PCI_PRODUCT_QLOGIC_ISP2432 0x2432 #define PCI_PRODUCT_QLOGIC_ISP2532 0x2532 #define PCI_PRODUCT_QLOGIC_ISP5432 0x5432 #define PCI_PRODUCT_QLOGIC_ISP6312 0x6312 #define PCI_PRODUCT_QLOGIC_ISP6322 0x6322 #define PCI_PRODUCT_QLOGIC_ISP2031 0x2031 #define PCI_PRODUCT_QLOGIC_ISP8031 0x8031 #define PCI_PRODUCT_QLOGIC_ISP2684 0x2171 #define PCI_PRODUCT_QLOGIC_ISP2692 0x2b61 #define PCI_PRODUCT_QLOGIC_ISP2714 0x2071 #define PCI_PRODUCT_QLOGIC_ISP2722 0x2261 #define PCI_QLOGIC_ISP1020 \ ((PCI_PRODUCT_QLOGIC_ISP1020 << 16) | PCI_VENDOR_QLOGIC) #define PCI_QLOGIC_ISP1080 \ ((PCI_PRODUCT_QLOGIC_ISP1080 << 16) | PCI_VENDOR_QLOGIC) #define PCI_QLOGIC_ISP10160 \ ((PCI_PRODUCT_QLOGIC_ISP10160 << 16) | PCI_VENDOR_QLOGIC) #define PCI_QLOGIC_ISP12160 \ ((PCI_PRODUCT_QLOGIC_ISP12160 << 16) | PCI_VENDOR_QLOGIC) #define PCI_QLOGIC_ISP1240 \ ((PCI_PRODUCT_QLOGIC_ISP1240 << 16) | PCI_VENDOR_QLOGIC) #define PCI_QLOGIC_ISP1280 \ ((PCI_PRODUCT_QLOGIC_ISP1280 << 16) | PCI_VENDOR_QLOGIC) #define PCI_QLOGIC_ISP2100 \ ((PCI_PRODUCT_QLOGIC_ISP2100 << 16) | PCI_VENDOR_QLOGIC) #define PCI_QLOGIC_ISP2200 \ ((PCI_PRODUCT_QLOGIC_ISP2200 << 16) | PCI_VENDOR_QLOGIC) #define PCI_QLOGIC_ISP2300 \ ((PCI_PRODUCT_QLOGIC_ISP2300 << 16) | PCI_VENDOR_QLOGIC) #define PCI_QLOGIC_ISP2312 \ ((PCI_PRODUCT_QLOGIC_ISP2312 << 16) | PCI_VENDOR_QLOGIC) #define PCI_QLOGIC_ISP2322 \ ((PCI_PRODUCT_QLOGIC_ISP2322 << 16) | PCI_VENDOR_QLOGIC) #define PCI_QLOGIC_ISP2422 \ ((PCI_PRODUCT_QLOGIC_ISP2422 << 16) | PCI_VENDOR_QLOGIC) #define PCI_QLOGIC_ISP2432 \ ((PCI_PRODUCT_QLOGIC_ISP2432 << 16) | PCI_VENDOR_QLOGIC) #define PCI_QLOGIC_ISP2532 \ ((PCI_PRODUCT_QLOGIC_ISP2532 << 16) | PCI_VENDOR_QLOGIC) #define PCI_QLOGIC_ISP5432 \ ((PCI_PRODUCT_QLOGIC_ISP5432 << 16) | PCI_VENDOR_QLOGIC) #define PCI_QLOGIC_ISP6312 \ ((PCI_PRODUCT_QLOGIC_ISP6312 << 16) | PCI_VENDOR_QLOGIC) #define PCI_QLOGIC_ISP6322 \ ((PCI_PRODUCT_QLOGIC_ISP6322 << 16) | PCI_VENDOR_QLOGIC) #define PCI_QLOGIC_ISP2031 \ ((PCI_PRODUCT_QLOGIC_ISP2031 << 16) | PCI_VENDOR_QLOGIC) #define PCI_QLOGIC_ISP8031 \ ((PCI_PRODUCT_QLOGIC_ISP8031 << 16) | PCI_VENDOR_QLOGIC) #define PCI_QLOGIC_ISP2684 \ ((PCI_PRODUCT_QLOGIC_ISP2684 << 16) | PCI_VENDOR_QLOGIC) #define PCI_QLOGIC_ISP2692 \ ((PCI_PRODUCT_QLOGIC_ISP2692 << 16) | PCI_VENDOR_QLOGIC) #define PCI_QLOGIC_ISP2714 \ ((PCI_PRODUCT_QLOGIC_ISP2714 << 16) | PCI_VENDOR_QLOGIC) #define PCI_QLOGIC_ISP2722 \ ((PCI_PRODUCT_QLOGIC_ISP2722 << 16) | PCI_VENDOR_QLOGIC) /* * Odd case for some AMI raid cards... We need to *not* attach to this. */ #define AMI_RAID_SUBVENDOR_ID 0x101e #define PCI_DFLT_LTNCY 0x40 #define PCI_DFLT_LNSZ 0x10 static int isp_pci_probe (device_t); static int isp_pci_attach (device_t); static int isp_pci_detach (device_t); #define ISP_PCD(isp) ((struct isp_pcisoftc *)isp)->pci_dev struct isp_pcisoftc { ispsoftc_t pci_isp; device_t pci_dev; struct resource * regs; struct resource * regs1; struct resource * regs2; struct { int iqd; struct resource * irq; void * ih; } irq[ISP_MAX_IRQS]; int rtp; int rgd; int rtp1; int rgd1; int rtp2; int rgd2; int16_t pci_poff[_NREG_BLKS]; bus_dma_tag_t dmat; int msicount; }; static device_method_t isp_pci_methods[] = { /* Device interface */ DEVMETHOD(device_probe, isp_pci_probe), DEVMETHOD(device_attach, isp_pci_attach), DEVMETHOD(device_detach, isp_pci_detach), { 0, 0 } }; static driver_t isp_pci_driver = { "isp", isp_pci_methods, sizeof (struct isp_pcisoftc) }; static devclass_t isp_devclass; DRIVER_MODULE(isp, pci, isp_pci_driver, isp_devclass, 0, 0); MODULE_DEPEND(isp, cam, 1, 1, 1); MODULE_DEPEND(isp, firmware, 1, 1, 1); static int isp_nvports = 0; static int isp_pci_probe(device_t dev) { switch ((pci_get_device(dev) << 16) | (pci_get_vendor(dev))) { case PCI_QLOGIC_ISP1020: device_set_desc(dev, "Qlogic ISP 1020/1040 PCI SCSI Adapter"); break; case PCI_QLOGIC_ISP1080: device_set_desc(dev, "Qlogic ISP 1080 PCI SCSI Adapter"); break; case PCI_QLOGIC_ISP1240: device_set_desc(dev, "Qlogic ISP 1240 PCI SCSI Adapter"); break; case PCI_QLOGIC_ISP1280: device_set_desc(dev, "Qlogic ISP 1280 PCI SCSI Adapter"); break; case PCI_QLOGIC_ISP10160: device_set_desc(dev, "Qlogic ISP 10160 PCI SCSI Adapter"); break; case PCI_QLOGIC_ISP12160: if (pci_get_subvendor(dev) == AMI_RAID_SUBVENDOR_ID) { return (ENXIO); } device_set_desc(dev, "Qlogic ISP 12160 PCI SCSI Adapter"); break; case PCI_QLOGIC_ISP2100: device_set_desc(dev, "Qlogic ISP 2100 PCI FC-AL Adapter"); break; case PCI_QLOGIC_ISP2200: device_set_desc(dev, "Qlogic ISP 2200 PCI FC-AL Adapter"); break; case PCI_QLOGIC_ISP2300: device_set_desc(dev, "Qlogic ISP 2300 PCI FC-AL Adapter"); break; case PCI_QLOGIC_ISP2312: device_set_desc(dev, "Qlogic ISP 2312 PCI FC-AL Adapter"); break; case PCI_QLOGIC_ISP2322: device_set_desc(dev, "Qlogic ISP 2322 PCI FC-AL Adapter"); break; case PCI_QLOGIC_ISP2422: device_set_desc(dev, "Qlogic ISP 2422 PCI FC-AL Adapter"); break; case PCI_QLOGIC_ISP2432: device_set_desc(dev, "Qlogic ISP 2432 PCI FC-AL Adapter"); break; case PCI_QLOGIC_ISP2532: device_set_desc(dev, "Qlogic ISP 2532 PCI FC-AL Adapter"); break; case PCI_QLOGIC_ISP5432: device_set_desc(dev, "Qlogic ISP 5432 PCI FC-AL Adapter"); break; case PCI_QLOGIC_ISP6312: device_set_desc(dev, "Qlogic ISP 6312 PCI FC-AL Adapter"); break; case PCI_QLOGIC_ISP6322: device_set_desc(dev, "Qlogic ISP 6322 PCI FC-AL Adapter"); break; case PCI_QLOGIC_ISP2031: device_set_desc(dev, "Qlogic ISP 2031 PCI FC-AL Adapter"); break; case PCI_QLOGIC_ISP8031: device_set_desc(dev, "Qlogic ISP 8031 PCI FCoE Adapter"); break; case PCI_QLOGIC_ISP2684: device_set_desc(dev, "Qlogic ISP 2684 PCI FC Adapter"); break; case PCI_QLOGIC_ISP2692: device_set_desc(dev, "Qlogic ISP 2692 PCI FC Adapter"); break; case PCI_QLOGIC_ISP2714: device_set_desc(dev, "Qlogic ISP 2714 PCI FC Adapter"); break; case PCI_QLOGIC_ISP2722: device_set_desc(dev, "Qlogic ISP 2722 PCI FC Adapter"); break; default: return (ENXIO); } if (isp_announced == 0 && bootverbose) { printf("Qlogic ISP Driver, FreeBSD Version %d.%d, " "Core Version %d.%d\n", ISP_PLATFORM_VERSION_MAJOR, ISP_PLATFORM_VERSION_MINOR, ISP_CORE_VERSION_MAJOR, ISP_CORE_VERSION_MINOR); isp_announced++; } /* * XXXX: Here is where we might load the f/w module * XXXX: (or increase a reference count to it). */ return (BUS_PROBE_DEFAULT); } static void isp_get_generic_options(device_t dev, ispsoftc_t *isp) { int tval; tval = 0; if (resource_int_value(device_get_name(dev), device_get_unit(dev), "fwload_disable", &tval) == 0 && tval != 0) { isp->isp_confopts |= ISP_CFG_NORELOAD; } tval = 0; if (resource_int_value(device_get_name(dev), device_get_unit(dev), "ignore_nvram", &tval) == 0 && tval != 0) { isp->isp_confopts |= ISP_CFG_NONVRAM; } tval = 0; (void) resource_int_value(device_get_name(dev), device_get_unit(dev), "debug", &tval); if (tval) { isp->isp_dblev = tval; } else { isp->isp_dblev = ISP_LOGWARN|ISP_LOGERR; } if (bootverbose) { isp->isp_dblev |= ISP_LOGCONFIG|ISP_LOGINFO; } tval = -1; (void) resource_int_value(device_get_name(dev), device_get_unit(dev), "vports", &tval); if (tval > 0 && tval <= 254) { isp_nvports = tval; } tval = 7; (void) resource_int_value(device_get_name(dev), device_get_unit(dev), "quickboot_time", &tval); isp_quickboot_time = tval; } static void isp_get_specific_options(device_t dev, int chan, ispsoftc_t *isp) { const char *sptr; int tval = 0; char prefix[12], name[16]; if (chan == 0) prefix[0] = 0; else snprintf(prefix, sizeof(prefix), "chan%d.", chan); snprintf(name, sizeof(name), "%siid", prefix); if (resource_int_value(device_get_name(dev), device_get_unit(dev), name, &tval)) { if (IS_FC(isp)) { ISP_FC_PC(isp, chan)->default_id = 109 - chan; } else { #ifdef __sparc64__ ISP_SPI_PC(isp, chan)->iid = OF_getscsinitid(dev); #else ISP_SPI_PC(isp, chan)->iid = 7; #endif } } else { if (IS_FC(isp)) { ISP_FC_PC(isp, chan)->default_id = tval - chan; } else { ISP_SPI_PC(isp, chan)->iid = tval; } isp->isp_confopts |= ISP_CFG_OWNLOOPID; } if (IS_SCSI(isp)) return; tval = -1; snprintf(name, sizeof(name), "%srole", prefix); if (resource_int_value(device_get_name(dev), device_get_unit(dev), name, &tval) == 0) { switch (tval) { case ISP_ROLE_NONE: case ISP_ROLE_INITIATOR: case ISP_ROLE_TARGET: case ISP_ROLE_BOTH: device_printf(dev, "Chan %d setting role to 0x%x\n", chan, tval); break; default: tval = -1; break; } } if (tval == -1) { tval = ISP_DEFAULT_ROLES; } ISP_FC_PC(isp, chan)->def_role = tval; tval = 0; snprintf(name, sizeof(name), "%sfullduplex", prefix); if (resource_int_value(device_get_name(dev), device_get_unit(dev), name, &tval) == 0 && tval != 0) { isp->isp_confopts |= ISP_CFG_FULL_DUPLEX; } sptr = NULL; snprintf(name, sizeof(name), "%stopology", prefix); if (resource_string_value(device_get_name(dev), device_get_unit(dev), name, (const char **) &sptr) == 0 && sptr != NULL) { if (strcmp(sptr, "lport") == 0) { isp->isp_confopts |= ISP_CFG_LPORT; } else if (strcmp(sptr, "nport") == 0) { isp->isp_confopts |= ISP_CFG_NPORT; } else if (strcmp(sptr, "lport-only") == 0) { isp->isp_confopts |= ISP_CFG_LPORT_ONLY; } else if (strcmp(sptr, "nport-only") == 0) { isp->isp_confopts |= ISP_CFG_NPORT_ONLY; } } #ifdef ISP_FCTAPE_OFF isp->isp_confopts |= ISP_CFG_NOFCTAPE; #else isp->isp_confopts |= ISP_CFG_FCTAPE; #endif tval = 0; snprintf(name, sizeof(name), "%snofctape", prefix); (void) resource_int_value(device_get_name(dev), device_get_unit(dev), name, &tval); if (tval) { isp->isp_confopts &= ~ISP_CFG_FCTAPE; isp->isp_confopts |= ISP_CFG_NOFCTAPE; } tval = 0; snprintf(name, sizeof(name), "%sfctape", prefix); (void) resource_int_value(device_get_name(dev), device_get_unit(dev), name, &tval); if (tval) { isp->isp_confopts &= ~ISP_CFG_NOFCTAPE; isp->isp_confopts |= ISP_CFG_FCTAPE; } /* * Because the resource_*_value functions can neither return * 64 bit integer values, nor can they be directly coerced * to interpret the right hand side of the assignment as * you want them to interpret it, we have to force WWN * hint replacement to specify WWN strings with a leading * 'w' (e..g w50000000aaaa0001). Sigh. */ sptr = NULL; snprintf(name, sizeof(name), "%sportwwn", prefix); tval = resource_string_value(device_get_name(dev), device_get_unit(dev), name, (const char **) &sptr); if (tval == 0 && sptr != NULL && *sptr++ == 'w') { char *eptr = NULL; ISP_FC_PC(isp, chan)->def_wwpn = strtouq(sptr, &eptr, 16); if (eptr < sptr + 16 || ISP_FC_PC(isp, chan)->def_wwpn == -1) { device_printf(dev, "mangled portwwn hint '%s'\n", sptr); ISP_FC_PC(isp, chan)->def_wwpn = 0; } } sptr = NULL; snprintf(name, sizeof(name), "%snodewwn", prefix); tval = resource_string_value(device_get_name(dev), device_get_unit(dev), name, (const char **) &sptr); if (tval == 0 && sptr != NULL && *sptr++ == 'w') { char *eptr = NULL; ISP_FC_PC(isp, chan)->def_wwnn = strtouq(sptr, &eptr, 16); if (eptr < sptr + 16 || ISP_FC_PC(isp, chan)->def_wwnn == 0) { device_printf(dev, "mangled nodewwn hint '%s'\n", sptr); ISP_FC_PC(isp, chan)->def_wwnn = 0; } } tval = -1; snprintf(name, sizeof(name), "%sloop_down_limit", prefix); (void) resource_int_value(device_get_name(dev), device_get_unit(dev), name, &tval); if (tval >= 0 && tval < 0xffff) { ISP_FC_PC(isp, chan)->loop_down_limit = tval; } else { ISP_FC_PC(isp, chan)->loop_down_limit = isp_loop_down_limit; } tval = -1; snprintf(name, sizeof(name), "%sgone_device_time", prefix); (void) resource_int_value(device_get_name(dev), device_get_unit(dev), name, &tval); if (tval >= 0 && tval < 0xffff) { ISP_FC_PC(isp, chan)->gone_device_time = tval; } else { ISP_FC_PC(isp, chan)->gone_device_time = isp_gone_device_time; } } static int isp_pci_attach(device_t dev) { struct isp_pcisoftc *pcs = device_get_softc(dev); ispsoftc_t *isp = &pcs->pci_isp; int i; uint32_t data, cmd, linesz, did; size_t psize, xsize; char fwname[32]; pcs->pci_dev = dev; isp->isp_dev = dev; isp->isp_nchan = 1; mtx_init(&isp->isp_lock, "isp", NULL, MTX_DEF); /* * Get Generic Options */ isp_nvports = 0; isp_get_generic_options(dev, isp); linesz = PCI_DFLT_LNSZ; pcs->regs = pcs->regs2 = NULL; pcs->rgd = pcs->rtp = 0; pcs->pci_dev = dev; pcs->pci_poff[BIU_BLOCK >> _BLK_REG_SHFT] = BIU_REGS_OFF; pcs->pci_poff[MBOX_BLOCK >> _BLK_REG_SHFT] = PCI_MBOX_REGS_OFF; pcs->pci_poff[SXP_BLOCK >> _BLK_REG_SHFT] = PCI_SXP_REGS_OFF; pcs->pci_poff[RISC_BLOCK >> _BLK_REG_SHFT] = PCI_RISC_REGS_OFF; pcs->pci_poff[DMA_BLOCK >> _BLK_REG_SHFT] = DMA_REGS_OFF; switch (pci_get_devid(dev)) { case PCI_QLOGIC_ISP1020: did = 0x1040; isp->isp_mdvec = &mdvec; isp->isp_type = ISP_HA_SCSI_UNKNOWN; break; case PCI_QLOGIC_ISP1080: did = 0x1080; isp->isp_mdvec = &mdvec_1080; isp->isp_type = ISP_HA_SCSI_1080; pcs->pci_poff[DMA_BLOCK >> _BLK_REG_SHFT] = ISP1080_DMA_REGS_OFF; break; case PCI_QLOGIC_ISP1240: did = 0x1080; isp->isp_mdvec = &mdvec_1080; isp->isp_type = ISP_HA_SCSI_1240; isp->isp_nchan = 2; pcs->pci_poff[DMA_BLOCK >> _BLK_REG_SHFT] = ISP1080_DMA_REGS_OFF; break; case PCI_QLOGIC_ISP1280: did = 0x1080; isp->isp_mdvec = &mdvec_1080; isp->isp_type = ISP_HA_SCSI_1280; pcs->pci_poff[DMA_BLOCK >> _BLK_REG_SHFT] = ISP1080_DMA_REGS_OFF; break; case PCI_QLOGIC_ISP10160: did = 0x12160; isp->isp_mdvec = &mdvec_12160; isp->isp_type = ISP_HA_SCSI_10160; pcs->pci_poff[DMA_BLOCK >> _BLK_REG_SHFT] = ISP1080_DMA_REGS_OFF; break; case PCI_QLOGIC_ISP12160: did = 0x12160; isp->isp_nchan = 2; isp->isp_mdvec = &mdvec_12160; isp->isp_type = ISP_HA_SCSI_12160; pcs->pci_poff[DMA_BLOCK >> _BLK_REG_SHFT] = ISP1080_DMA_REGS_OFF; break; case PCI_QLOGIC_ISP2100: did = 0x2100; isp->isp_mdvec = &mdvec_2100; isp->isp_type = ISP_HA_FC_2100; pcs->pci_poff[MBOX_BLOCK >> _BLK_REG_SHFT] = PCI_MBOX_REGS2100_OFF; if (pci_get_revid(dev) < 3) { /* * XXX: Need to get the actual revision * XXX: number of the 2100 FB. At any rate, * XXX: lower cache line size for early revision * XXX; boards. */ linesz = 1; } break; case PCI_QLOGIC_ISP2200: did = 0x2200; isp->isp_mdvec = &mdvec_2200; isp->isp_type = ISP_HA_FC_2200; pcs->pci_poff[MBOX_BLOCK >> _BLK_REG_SHFT] = PCI_MBOX_REGS2100_OFF; break; case PCI_QLOGIC_ISP2300: did = 0x2300; isp->isp_mdvec = &mdvec_2300; isp->isp_type = ISP_HA_FC_2300; pcs->pci_poff[MBOX_BLOCK >> _BLK_REG_SHFT] = PCI_MBOX_REGS2300_OFF; break; case PCI_QLOGIC_ISP2312: case PCI_QLOGIC_ISP6312: did = 0x2300; isp->isp_mdvec = &mdvec_2300; isp->isp_type = ISP_HA_FC_2312; pcs->pci_poff[MBOX_BLOCK >> _BLK_REG_SHFT] = PCI_MBOX_REGS2300_OFF; break; case PCI_QLOGIC_ISP2322: case PCI_QLOGIC_ISP6322: did = 0x2322; isp->isp_mdvec = &mdvec_2300; isp->isp_type = ISP_HA_FC_2322; pcs->pci_poff[MBOX_BLOCK >> _BLK_REG_SHFT] = PCI_MBOX_REGS2300_OFF; break; case PCI_QLOGIC_ISP2422: case PCI_QLOGIC_ISP2432: did = 0x2400; isp->isp_nchan += isp_nvports; isp->isp_mdvec = &mdvec_2400; isp->isp_type = ISP_HA_FC_2400; pcs->pci_poff[MBOX_BLOCK >> _BLK_REG_SHFT] = PCI_MBOX_REGS2400_OFF; break; case PCI_QLOGIC_ISP2532: did = 0x2500; isp->isp_nchan += isp_nvports; isp->isp_mdvec = &mdvec_2500; isp->isp_type = ISP_HA_FC_2500; pcs->pci_poff[MBOX_BLOCK >> _BLK_REG_SHFT] = PCI_MBOX_REGS2400_OFF; break; case PCI_QLOGIC_ISP5432: did = 0x2500; isp->isp_mdvec = &mdvec_2500; isp->isp_type = ISP_HA_FC_2500; pcs->pci_poff[MBOX_BLOCK >> _BLK_REG_SHFT] = PCI_MBOX_REGS2400_OFF; break; case PCI_QLOGIC_ISP2031: case PCI_QLOGIC_ISP8031: did = 0x2600; isp->isp_nchan += isp_nvports; isp->isp_mdvec = &mdvec_2600; isp->isp_type = ISP_HA_FC_2600; pcs->pci_poff[MBOX_BLOCK >> _BLK_REG_SHFT] = PCI_MBOX_REGS2400_OFF; break; case PCI_QLOGIC_ISP2684: case PCI_QLOGIC_ISP2692: case PCI_QLOGIC_ISP2714: case PCI_QLOGIC_ISP2722: did = 0x2700; isp->isp_nchan += isp_nvports; isp->isp_mdvec = &mdvec_2700; isp->isp_type = ISP_HA_FC_2700; pcs->pci_poff[MBOX_BLOCK >> _BLK_REG_SHFT] = PCI_MBOX_REGS2400_OFF; break; default: device_printf(dev, "unknown device type\n"); goto bad; break; } isp->isp_revision = pci_get_revid(dev); if (IS_26XX(isp)) { pcs->rtp = SYS_RES_MEMORY; pcs->rgd = PCIR_BAR(0); pcs->regs = bus_alloc_resource_any(dev, pcs->rtp, &pcs->rgd, RF_ACTIVE); pcs->rtp1 = SYS_RES_MEMORY; pcs->rgd1 = PCIR_BAR(2); pcs->regs1 = bus_alloc_resource_any(dev, pcs->rtp1, &pcs->rgd1, RF_ACTIVE); pcs->rtp2 = SYS_RES_MEMORY; pcs->rgd2 = PCIR_BAR(4); pcs->regs2 = bus_alloc_resource_any(dev, pcs->rtp2, &pcs->rgd2, RF_ACTIVE); } else { pcs->rtp = SYS_RES_MEMORY; pcs->rgd = PCIR_BAR(1); pcs->regs = bus_alloc_resource_any(dev, pcs->rtp, &pcs->rgd, RF_ACTIVE); if (pcs->regs == NULL) { pcs->rtp = SYS_RES_IOPORT; pcs->rgd = PCIR_BAR(0); pcs->regs = bus_alloc_resource_any(dev, pcs->rtp, &pcs->rgd, RF_ACTIVE); } } if (pcs->regs == NULL) { device_printf(dev, "Unable to map any ports\n"); goto bad; } if (bootverbose) { device_printf(dev, "Using %s space register mapping\n", (pcs->rtp == SYS_RES_IOPORT)? "I/O" : "Memory"); } isp->isp_regs = pcs->regs; isp->isp_regs2 = pcs->regs2; if (IS_FC(isp)) { psize = sizeof (fcparam); xsize = sizeof (struct isp_fc); } else { psize = sizeof (sdparam); xsize = sizeof (struct isp_spi); } psize *= isp->isp_nchan; xsize *= isp->isp_nchan; isp->isp_param = malloc(psize, M_DEVBUF, M_NOWAIT | M_ZERO); if (isp->isp_param == NULL) { device_printf(dev, "cannot allocate parameter data\n"); goto bad; } isp->isp_osinfo.pc.ptr = malloc(xsize, M_DEVBUF, M_NOWAIT | M_ZERO); if (isp->isp_osinfo.pc.ptr == NULL) { device_printf(dev, "cannot allocate parameter data\n"); goto bad; } /* * Now that we know who we are (roughly) get/set specific options */ for (i = 0; i < isp->isp_nchan; i++) { isp_get_specific_options(dev, i, isp); } isp->isp_osinfo.fw = NULL; if (isp->isp_osinfo.fw == NULL) { snprintf(fwname, sizeof (fwname), "isp_%04x", did); isp->isp_osinfo.fw = firmware_get(fwname); } if (isp->isp_osinfo.fw != NULL) { isp_prt(isp, ISP_LOGCONFIG, "loaded firmware %s", fwname); isp->isp_mdvec->dv_ispfw = isp->isp_osinfo.fw->data; } /* * Make sure that SERR, PERR, WRITE INVALIDATE and BUSMASTER are set. */ cmd = pci_read_config(dev, PCIR_COMMAND, 2); cmd |= PCIM_CMD_SEREN | PCIM_CMD_PERRESPEN | PCIM_CMD_BUSMASTEREN | PCIM_CMD_INVEN; if (IS_2300(isp)) { /* per QLogic errata */ cmd &= ~PCIM_CMD_INVEN; } if (IS_2322(isp) || pci_get_devid(dev) == PCI_QLOGIC_ISP6312) { cmd &= ~PCIM_CMD_INTX_DISABLE; } if (IS_24XX(isp)) { cmd &= ~PCIM_CMD_INTX_DISABLE; } pci_write_config(dev, PCIR_COMMAND, cmd, 2); /* * Make sure the Cache Line Size register is set sensibly. */ data = pci_read_config(dev, PCIR_CACHELNSZ, 1); if (data == 0 || (linesz != PCI_DFLT_LNSZ && data != linesz)) { isp_prt(isp, ISP_LOGDEBUG0, "set PCI line size to %d from %d", linesz, data); data = linesz; pci_write_config(dev, PCIR_CACHELNSZ, data, 1); } /* * Make sure the Latency Timer is sane. */ data = pci_read_config(dev, PCIR_LATTIMER, 1); if (data < PCI_DFLT_LTNCY) { data = PCI_DFLT_LTNCY; isp_prt(isp, ISP_LOGDEBUG0, "set PCI latency to %d", data); pci_write_config(dev, PCIR_LATTIMER, data, 1); } /* * Make sure we've disabled the ROM. */ data = pci_read_config(dev, PCIR_ROMADDR, 4); data &= ~1; pci_write_config(dev, PCIR_ROMADDR, data, 4); /* * Last minute checks... */ if (IS_23XX(isp) || IS_24XX(isp)) { isp->isp_port = pci_get_function(dev); } /* * Make sure we're in reset state. */ ISP_LOCK(isp); if (isp_reinit(isp, 1) != 0) { ISP_UNLOCK(isp); goto bad; } ISP_UNLOCK(isp); if (isp_attach(isp)) { ISP_LOCK(isp); isp_shutdown(isp); ISP_UNLOCK(isp); goto bad; } return (0); bad: + if (isp->isp_osinfo.fw == NULL && !IS_26XX(isp)) { + /* + * Failure to attach at boot time might have been caused + * by a missing ispfw(4). Except for for 16Gb adapters, + * there's no loadable firmware for them. + */ + isp_prt(isp, ISP_LOGWARN, "See the ispfw(4) man page on " + "how to load known good firmware at boot time"); + } for (i = 0; i < isp->isp_nirq; i++) { (void) bus_teardown_intr(dev, pcs->irq[i].irq, pcs->irq[i].ih); (void) bus_release_resource(dev, SYS_RES_IRQ, pcs->irq[i].iqd, pcs->irq[0].irq); } if (pcs->msicount) { pci_release_msi(dev); } if (pcs->regs) (void) bus_release_resource(dev, pcs->rtp, pcs->rgd, pcs->regs); if (pcs->regs1) (void) bus_release_resource(dev, pcs->rtp1, pcs->rgd1, pcs->regs1); if (pcs->regs2) (void) bus_release_resource(dev, pcs->rtp2, pcs->rgd2, pcs->regs2); if (pcs->pci_isp.isp_param) { free(pcs->pci_isp.isp_param, M_DEVBUF); pcs->pci_isp.isp_param = NULL; } if (pcs->pci_isp.isp_osinfo.pc.ptr) { free(pcs->pci_isp.isp_osinfo.pc.ptr, M_DEVBUF); pcs->pci_isp.isp_osinfo.pc.ptr = NULL; } mtx_destroy(&isp->isp_lock); return (ENXIO); } static int isp_pci_detach(device_t dev) { struct isp_pcisoftc *pcs = device_get_softc(dev); ispsoftc_t *isp = &pcs->pci_isp; int i, status; status = isp_detach(isp); if (status) return (status); ISP_LOCK(isp); isp_shutdown(isp); ISP_UNLOCK(isp); for (i = 0; i < isp->isp_nirq; i++) { (void) bus_teardown_intr(dev, pcs->irq[i].irq, pcs->irq[i].ih); (void) bus_release_resource(dev, SYS_RES_IRQ, pcs->irq[i].iqd, pcs->irq[i].irq); } if (pcs->msicount) pci_release_msi(dev); (void) bus_release_resource(dev, pcs->rtp, pcs->rgd, pcs->regs); if (pcs->regs1) (void) bus_release_resource(dev, pcs->rtp1, pcs->rgd1, pcs->regs1); if (pcs->regs2) (void) bus_release_resource(dev, pcs->rtp2, pcs->rgd2, pcs->regs2); isp_pci_mbxdmafree(isp); if (pcs->pci_isp.isp_param) { free(pcs->pci_isp.isp_param, M_DEVBUF); pcs->pci_isp.isp_param = NULL; } if (pcs->pci_isp.isp_osinfo.pc.ptr) { free(pcs->pci_isp.isp_osinfo.pc.ptr, M_DEVBUF); pcs->pci_isp.isp_osinfo.pc.ptr = NULL; } mtx_destroy(&isp->isp_lock); return (0); } #define IspVirt2Off(a, x) \ (((struct isp_pcisoftc *)a)->pci_poff[((x) & _BLK_REG_MASK) >> \ _BLK_REG_SHFT] + ((x) & 0xfff)) #define BXR2(isp, off) bus_read_2((isp)->isp_regs, (off)) #define BXW2(isp, off, v) bus_write_2((isp)->isp_regs, (off), (v)) #define BXR4(isp, off) bus_read_4((isp)->isp_regs, (off)) #define BXW4(isp, off, v) bus_write_4((isp)->isp_regs, (off), (v)) #define B2R4(isp, off) bus_read_4((isp)->isp_regs2, (off)) #define B2W4(isp, off, v) bus_write_4((isp)->isp_regs2, (off), (v)) static ISP_INLINE uint16_t isp_pci_rd_debounced(ispsoftc_t *isp, int off) { uint16_t val, prev; val = BXR2(isp, IspVirt2Off(isp, off)); do { prev = val; val = BXR2(isp, IspVirt2Off(isp, off)); } while (val != prev); return (val); } static void isp_pci_run_isr(ispsoftc_t *isp) { uint16_t isr, sema, info; if (IS_2100(isp)) { isr = isp_pci_rd_debounced(isp, BIU_ISR); sema = isp_pci_rd_debounced(isp, BIU_SEMA); } else { isr = BXR2(isp, IspVirt2Off(isp, BIU_ISR)); sema = BXR2(isp, IspVirt2Off(isp, BIU_SEMA)); } isp_prt(isp, ISP_LOGDEBUG3, "ISR 0x%x SEMA 0x%x", isr, sema); isr &= INT_PENDING_MASK(isp); sema &= BIU_SEMA_LOCK; if (isr == 0 && sema == 0) return; if (sema != 0) { if (IS_2100(isp)) info = isp_pci_rd_debounced(isp, OUTMAILBOX0); else info = BXR2(isp, IspVirt2Off(isp, OUTMAILBOX0)); if (info & MBOX_COMMAND_COMPLETE) isp_intr_mbox(isp, info); else isp_intr_async(isp, info); if (!IS_FC(isp) && isp->isp_state == ISP_RUNSTATE) isp_intr_respq(isp); } else isp_intr_respq(isp); ISP_WRITE(isp, HCCR, HCCR_CMD_CLEAR_RISC_INT); if (sema) ISP_WRITE(isp, BIU_SEMA, 0); } static void isp_pci_run_isr_2300(ispsoftc_t *isp) { uint32_t hccr, r2hisr; uint16_t isr, info; if ((BXR2(isp, IspVirt2Off(isp, BIU_ISR)) & BIU2100_ISR_RISC_INT) == 0) return; r2hisr = BXR4(isp, IspVirt2Off(isp, BIU_R2HSTSLO)); isp_prt(isp, ISP_LOGDEBUG3, "RISC2HOST ISR 0x%x", r2hisr); if ((r2hisr & BIU_R2HST_INTR) == 0) return; isr = r2hisr & BIU_R2HST_ISTAT_MASK; info = r2hisr >> 16; switch (isr) { case ISPR2HST_ROM_MBX_OK: case ISPR2HST_ROM_MBX_FAIL: case ISPR2HST_MBX_OK: case ISPR2HST_MBX_FAIL: isp_intr_mbox(isp, info); break; case ISPR2HST_ASYNC_EVENT: isp_intr_async(isp, info); break; case ISPR2HST_RIO_16: isp_intr_async(isp, ASYNC_RIO16_1); break; case ISPR2HST_FPOST: isp_intr_async(isp, ASYNC_CMD_CMPLT); break; case ISPR2HST_FPOST_CTIO: isp_intr_async(isp, ASYNC_CTIO_DONE); break; case ISPR2HST_RSPQ_UPDATE: isp_intr_respq(isp); break; default: hccr = ISP_READ(isp, HCCR); if (hccr & HCCR_PAUSE) { ISP_WRITE(isp, HCCR, HCCR_RESET); isp_prt(isp, ISP_LOGERR, "RISC paused at interrupt (%x->%x)", hccr, ISP_READ(isp, HCCR)); ISP_WRITE(isp, BIU_ICR, 0); } else { isp_prt(isp, ISP_LOGERR, "unknown interrupt 0x%x\n", r2hisr); } } ISP_WRITE(isp, HCCR, HCCR_CMD_CLEAR_RISC_INT); ISP_WRITE(isp, BIU_SEMA, 0); } static void isp_pci_run_isr_2400(ispsoftc_t *isp) { uint32_t r2hisr; uint16_t isr, info; r2hisr = BXR4(isp, IspVirt2Off(isp, BIU2400_R2HSTSLO)); isp_prt(isp, ISP_LOGDEBUG3, "RISC2HOST ISR 0x%x", r2hisr); if ((r2hisr & BIU_R2HST_INTR) == 0) return; isr = r2hisr & BIU_R2HST_ISTAT_MASK; info = (r2hisr >> 16); switch (isr) { case ISPR2HST_ROM_MBX_OK: case ISPR2HST_ROM_MBX_FAIL: case ISPR2HST_MBX_OK: case ISPR2HST_MBX_FAIL: isp_intr_mbox(isp, info); break; case ISPR2HST_ASYNC_EVENT: isp_intr_async(isp, info); break; case ISPR2HST_RSPQ_UPDATE: isp_intr_respq(isp); break; case ISPR2HST_RSPQ_UPDATE2: #ifdef ISP_TARGET_MODE case ISPR2HST_ATIO_RSPQ_UPDATE: #endif isp_intr_respq(isp); /* FALLTHROUGH */ #ifdef ISP_TARGET_MODE case ISPR2HST_ATIO_UPDATE: case ISPR2HST_ATIO_UPDATE2: isp_intr_atioq(isp); #endif break; default: isp_prt(isp, ISP_LOGERR, "unknown interrupt 0x%x\n", r2hisr); } ISP_WRITE(isp, BIU2400_HCCR, HCCR_2400_CMD_CLEAR_RISC_INT); } static uint32_t isp_pci_rd_reg(ispsoftc_t *isp, int regoff) { uint16_t rv; int oldconf = 0; if ((regoff & _BLK_REG_MASK) == SXP_BLOCK) { /* * We will assume that someone has paused the RISC processor. */ oldconf = BXR2(isp, IspVirt2Off(isp, BIU_CONF1)); BXW2(isp, IspVirt2Off(isp, BIU_CONF1), oldconf | BIU_PCI_CONF1_SXP); MEMORYBARRIER(isp, SYNC_REG, IspVirt2Off(isp, BIU_CONF1), 2, -1); } rv = BXR2(isp, IspVirt2Off(isp, regoff)); if ((regoff & _BLK_REG_MASK) == SXP_BLOCK) { BXW2(isp, IspVirt2Off(isp, BIU_CONF1), oldconf); MEMORYBARRIER(isp, SYNC_REG, IspVirt2Off(isp, BIU_CONF1), 2, -1); } return (rv); } static void isp_pci_wr_reg(ispsoftc_t *isp, int regoff, uint32_t val) { int oldconf = 0; if ((regoff & _BLK_REG_MASK) == SXP_BLOCK) { /* * We will assume that someone has paused the RISC processor. */ oldconf = BXR2(isp, IspVirt2Off(isp, BIU_CONF1)); BXW2(isp, IspVirt2Off(isp, BIU_CONF1), oldconf | BIU_PCI_CONF1_SXP); MEMORYBARRIER(isp, SYNC_REG, IspVirt2Off(isp, BIU_CONF1), 2, -1); } BXW2(isp, IspVirt2Off(isp, regoff), val); MEMORYBARRIER(isp, SYNC_REG, IspVirt2Off(isp, regoff), 2, -1); if ((regoff & _BLK_REG_MASK) == SXP_BLOCK) { BXW2(isp, IspVirt2Off(isp, BIU_CONF1), oldconf); MEMORYBARRIER(isp, SYNC_REG, IspVirt2Off(isp, BIU_CONF1), 2, -1); } } static uint32_t isp_pci_rd_reg_1080(ispsoftc_t *isp, int regoff) { uint32_t rv, oc = 0; if ((regoff & _BLK_REG_MASK) == SXP_BLOCK) { uint32_t tc; /* * We will assume that someone has paused the RISC processor. */ oc = BXR2(isp, IspVirt2Off(isp, BIU_CONF1)); tc = oc & ~BIU_PCI1080_CONF1_DMA; if (regoff & SXP_BANK1_SELECT) tc |= BIU_PCI1080_CONF1_SXP1; else tc |= BIU_PCI1080_CONF1_SXP0; BXW2(isp, IspVirt2Off(isp, BIU_CONF1), tc); MEMORYBARRIER(isp, SYNC_REG, IspVirt2Off(isp, BIU_CONF1), 2, -1); } else if ((regoff & _BLK_REG_MASK) == DMA_BLOCK) { oc = BXR2(isp, IspVirt2Off(isp, BIU_CONF1)); BXW2(isp, IspVirt2Off(isp, BIU_CONF1), oc | BIU_PCI1080_CONF1_DMA); MEMORYBARRIER(isp, SYNC_REG, IspVirt2Off(isp, BIU_CONF1), 2, -1); } rv = BXR2(isp, IspVirt2Off(isp, regoff)); if (oc) { BXW2(isp, IspVirt2Off(isp, BIU_CONF1), oc); MEMORYBARRIER(isp, SYNC_REG, IspVirt2Off(isp, BIU_CONF1), 2, -1); } return (rv); } static void isp_pci_wr_reg_1080(ispsoftc_t *isp, int regoff, uint32_t val) { int oc = 0; if ((regoff & _BLK_REG_MASK) == SXP_BLOCK) { uint32_t tc; /* * We will assume that someone has paused the RISC processor. */ oc = BXR2(isp, IspVirt2Off(isp, BIU_CONF1)); tc = oc & ~BIU_PCI1080_CONF1_DMA; if (regoff & SXP_BANK1_SELECT) tc |= BIU_PCI1080_CONF1_SXP1; else tc |= BIU_PCI1080_CONF1_SXP0; BXW2(isp, IspVirt2Off(isp, BIU_CONF1), tc); MEMORYBARRIER(isp, SYNC_REG, IspVirt2Off(isp, BIU_CONF1), 2, -1); } else if ((regoff & _BLK_REG_MASK) == DMA_BLOCK) { oc = BXR2(isp, IspVirt2Off(isp, BIU_CONF1)); BXW2(isp, IspVirt2Off(isp, BIU_CONF1), oc | BIU_PCI1080_CONF1_DMA); MEMORYBARRIER(isp, SYNC_REG, IspVirt2Off(isp, BIU_CONF1), 2, -1); } BXW2(isp, IspVirt2Off(isp, regoff), val); MEMORYBARRIER(isp, SYNC_REG, IspVirt2Off(isp, regoff), 2, -1); if (oc) { BXW2(isp, IspVirt2Off(isp, BIU_CONF1), oc); MEMORYBARRIER(isp, SYNC_REG, IspVirt2Off(isp, BIU_CONF1), 2, -1); } } static uint32_t isp_pci_rd_reg_2400(ispsoftc_t *isp, int regoff) { uint32_t rv; int block = regoff & _BLK_REG_MASK; switch (block) { case BIU_BLOCK: break; case MBOX_BLOCK: return (BXR2(isp, IspVirt2Off(isp, regoff))); case SXP_BLOCK: isp_prt(isp, ISP_LOGERR, "SXP_BLOCK read at 0x%x", regoff); return (0xffffffff); case RISC_BLOCK: isp_prt(isp, ISP_LOGERR, "RISC_BLOCK read at 0x%x", regoff); return (0xffffffff); case DMA_BLOCK: isp_prt(isp, ISP_LOGERR, "DMA_BLOCK read at 0x%x", regoff); return (0xffffffff); default: isp_prt(isp, ISP_LOGERR, "unknown block read at 0x%x", regoff); return (0xffffffff); } switch (regoff) { case BIU2400_FLASH_ADDR: case BIU2400_FLASH_DATA: case BIU2400_ICR: case BIU2400_ISR: case BIU2400_CSR: case BIU2400_REQINP: case BIU2400_REQOUTP: case BIU2400_RSPINP: case BIU2400_RSPOUTP: case BIU2400_PRI_REQINP: case BIU2400_PRI_REQOUTP: case BIU2400_ATIO_RSPINP: case BIU2400_ATIO_RSPOUTP: case BIU2400_HCCR: case BIU2400_GPIOD: case BIU2400_GPIOE: case BIU2400_HSEMA: rv = BXR4(isp, IspVirt2Off(isp, regoff)); break; case BIU2400_R2HSTSLO: rv = BXR4(isp, IspVirt2Off(isp, regoff)); break; case BIU2400_R2HSTSHI: rv = BXR4(isp, IspVirt2Off(isp, regoff)) >> 16; break; default: isp_prt(isp, ISP_LOGERR, "unknown register read at 0x%x", regoff); rv = 0xffffffff; break; } return (rv); } static void isp_pci_wr_reg_2400(ispsoftc_t *isp, int regoff, uint32_t val) { int block = regoff & _BLK_REG_MASK; switch (block) { case BIU_BLOCK: break; case MBOX_BLOCK: BXW2(isp, IspVirt2Off(isp, regoff), val); MEMORYBARRIER(isp, SYNC_REG, IspVirt2Off(isp, regoff), 2, -1); return; case SXP_BLOCK: isp_prt(isp, ISP_LOGERR, "SXP_BLOCK write at 0x%x", regoff); return; case RISC_BLOCK: isp_prt(isp, ISP_LOGERR, "RISC_BLOCK write at 0x%x", regoff); return; case DMA_BLOCK: isp_prt(isp, ISP_LOGERR, "DMA_BLOCK write at 0x%x", regoff); return; default: isp_prt(isp, ISP_LOGERR, "unknown block write at 0x%x", regoff); break; } switch (regoff) { case BIU2400_FLASH_ADDR: case BIU2400_FLASH_DATA: case BIU2400_ICR: case BIU2400_ISR: case BIU2400_CSR: case BIU2400_REQINP: case BIU2400_REQOUTP: case BIU2400_RSPINP: case BIU2400_RSPOUTP: case BIU2400_PRI_REQINP: case BIU2400_PRI_REQOUTP: case BIU2400_ATIO_RSPINP: case BIU2400_ATIO_RSPOUTP: case BIU2400_HCCR: case BIU2400_GPIOD: case BIU2400_GPIOE: case BIU2400_HSEMA: BXW4(isp, IspVirt2Off(isp, regoff), val); #ifdef MEMORYBARRIERW if (regoff == BIU2400_REQINP || regoff == BIU2400_RSPOUTP || regoff == BIU2400_PRI_REQINP || regoff == BIU2400_ATIO_RSPOUTP) MEMORYBARRIERW(isp, SYNC_REG, IspVirt2Off(isp, regoff), 4, -1) else #endif MEMORYBARRIER(isp, SYNC_REG, IspVirt2Off(isp, regoff), 4, -1); break; default: isp_prt(isp, ISP_LOGERR, "unknown register write at 0x%x", regoff); break; } } static uint32_t isp_pci_rd_reg_2600(ispsoftc_t *isp, int regoff) { uint32_t rv; switch (regoff) { case BIU2400_PRI_REQINP: case BIU2400_PRI_REQOUTP: isp_prt(isp, ISP_LOGERR, "unknown register read at 0x%x", regoff); rv = 0xffffffff; break; case BIU2400_REQINP: rv = B2R4(isp, 0x00); break; case BIU2400_REQOUTP: rv = B2R4(isp, 0x04); break; case BIU2400_RSPINP: rv = B2R4(isp, 0x08); break; case BIU2400_RSPOUTP: rv = B2R4(isp, 0x0c); break; case BIU2400_ATIO_RSPINP: rv = B2R4(isp, 0x10); break; case BIU2400_ATIO_RSPOUTP: rv = B2R4(isp, 0x14); break; default: rv = isp_pci_rd_reg_2400(isp, regoff); break; } return (rv); } static void isp_pci_wr_reg_2600(ispsoftc_t *isp, int regoff, uint32_t val) { int off; switch (regoff) { case BIU2400_PRI_REQINP: case BIU2400_PRI_REQOUTP: isp_prt(isp, ISP_LOGERR, "unknown register write at 0x%x", regoff); return; case BIU2400_REQINP: off = 0x00; break; case BIU2400_REQOUTP: off = 0x04; break; case BIU2400_RSPINP: off = 0x08; break; case BIU2400_RSPOUTP: off = 0x0c; break; case BIU2400_ATIO_RSPINP: off = 0x10; break; case BIU2400_ATIO_RSPOUTP: off = 0x14; break; default: isp_pci_wr_reg_2400(isp, regoff, val); return; } B2W4(isp, off, val); } struct imush { bus_addr_t maddr; int error; }; static void imc(void *arg, bus_dma_segment_t *segs, int nseg, int error) { struct imush *imushp = (struct imush *) arg; if (!(imushp->error = error)) imushp->maddr = segs[0].ds_addr; } static int isp_pci_mbxdma(ispsoftc_t *isp) { caddr_t base; uint32_t len, nsegs; int i, error, cmap = 0; bus_size_t slim; /* segment size */ bus_addr_t llim; /* low limit of unavailable dma */ bus_addr_t hlim; /* high limit of unavailable dma */ struct imush im; isp_ecmd_t *ecmd; /* Already been here? If so, leave... */ if (isp->isp_xflist != NULL) return (0); if (isp->isp_rquest != NULL && isp->isp_maxcmds == 0) return (0); ISP_UNLOCK(isp); if (isp->isp_rquest != NULL) goto gotmaxcmds; hlim = BUS_SPACE_MAXADDR; if (IS_ULTRA2(isp) || IS_FC(isp) || IS_1240(isp)) { if (sizeof (bus_size_t) > 4) slim = (bus_size_t) (1ULL << 32); else slim = (bus_size_t) (1UL << 31); llim = BUS_SPACE_MAXADDR; } else { slim = (1UL << 24); llim = BUS_SPACE_MAXADDR_32BIT; } if (sizeof (bus_size_t) > 4) nsegs = ISP_NSEG64_MAX; else nsegs = ISP_NSEG_MAX; if (bus_dma_tag_create(bus_get_dma_tag(ISP_PCD(isp)), 1, slim, llim, hlim, NULL, NULL, BUS_SPACE_MAXSIZE, nsegs, slim, 0, busdma_lock_mutex, &isp->isp_lock, &isp->isp_osinfo.dmat)) { ISP_LOCK(isp); isp_prt(isp, ISP_LOGERR, "could not create master dma tag"); return (1); } /* * Allocate and map the request queue and a region for external * DMA addressable command/status structures (22XX and later). */ len = ISP_QUEUE_SIZE(RQUEST_QUEUE_LEN(isp)); if (isp->isp_type >= ISP_HA_FC_2200) len += (N_XCMDS * XCMD_SIZE); if (bus_dma_tag_create(isp->isp_osinfo.dmat, QENTRY_LEN, slim, BUS_SPACE_MAXADDR_32BIT, BUS_SPACE_MAXADDR, NULL, NULL, len, 1, len, 0, busdma_lock_mutex, &isp->isp_lock, &isp->isp_osinfo.reqdmat)) { isp_prt(isp, ISP_LOGERR, "cannot create request DMA tag"); goto bad; } if (bus_dmamem_alloc(isp->isp_osinfo.reqdmat, (void **)&base, BUS_DMA_COHERENT, &isp->isp_osinfo.reqmap) != 0) { isp_prt(isp, ISP_LOGERR, "cannot allocate request DMA memory"); bus_dma_tag_destroy(isp->isp_osinfo.reqdmat); goto bad; } isp->isp_rquest = base; im.error = 0; if (bus_dmamap_load(isp->isp_osinfo.reqdmat, isp->isp_osinfo.reqmap, base, len, imc, &im, 0) || im.error) { isp_prt(isp, ISP_LOGERR, "error loading request DMA map %d", im.error); goto bad; } isp_prt(isp, ISP_LOGDEBUG0, "request area @ 0x%jx/0x%jx", (uintmax_t)im.maddr, (uintmax_t)len); isp->isp_rquest_dma = im.maddr; base += ISP_QUEUE_SIZE(RQUEST_QUEUE_LEN(isp)); im.maddr += ISP_QUEUE_SIZE(RQUEST_QUEUE_LEN(isp)); if (isp->isp_type >= ISP_HA_FC_2200) { isp->isp_osinfo.ecmd_dma = im.maddr; isp->isp_osinfo.ecmd_free = (isp_ecmd_t *)base; isp->isp_osinfo.ecmd_base = isp->isp_osinfo.ecmd_free; for (ecmd = isp->isp_osinfo.ecmd_free; ecmd < &isp->isp_osinfo.ecmd_free[N_XCMDS]; ecmd++) { if (ecmd == &isp->isp_osinfo.ecmd_free[N_XCMDS - 1]) ecmd->next = NULL; else ecmd->next = ecmd + 1; } } /* * Allocate and map the result queue. */ len = ISP_QUEUE_SIZE(RESULT_QUEUE_LEN(isp)); if (bus_dma_tag_create(isp->isp_osinfo.dmat, QENTRY_LEN, slim, BUS_SPACE_MAXADDR_32BIT, BUS_SPACE_MAXADDR, NULL, NULL, len, 1, len, 0, busdma_lock_mutex, &isp->isp_lock, &isp->isp_osinfo.respdmat)) { isp_prt(isp, ISP_LOGERR, "cannot create response DMA tag"); goto bad; } if (bus_dmamem_alloc(isp->isp_osinfo.respdmat, (void **)&base, BUS_DMA_COHERENT, &isp->isp_osinfo.respmap) != 0) { isp_prt(isp, ISP_LOGERR, "cannot allocate response DMA memory"); bus_dma_tag_destroy(isp->isp_osinfo.respdmat); goto bad; } isp->isp_result = base; im.error = 0; if (bus_dmamap_load(isp->isp_osinfo.respdmat, isp->isp_osinfo.respmap, base, len, imc, &im, 0) || im.error) { isp_prt(isp, ISP_LOGERR, "error loading response DMA map %d", im.error); goto bad; } isp_prt(isp, ISP_LOGDEBUG0, "response area @ 0x%jx/0x%jx", (uintmax_t)im.maddr, (uintmax_t)len); isp->isp_result_dma = im.maddr; #ifdef ISP_TARGET_MODE /* * Allocate and map ATIO queue on 24xx with target mode. */ if (IS_24XX(isp)) { len = ISP_QUEUE_SIZE(RESULT_QUEUE_LEN(isp)); if (bus_dma_tag_create(isp->isp_osinfo.dmat, QENTRY_LEN, slim, BUS_SPACE_MAXADDR_32BIT, BUS_SPACE_MAXADDR, NULL, NULL, len, 1, len, 0, busdma_lock_mutex, &isp->isp_lock, &isp->isp_osinfo.atiodmat)) { isp_prt(isp, ISP_LOGERR, "cannot create ATIO DMA tag"); goto bad; } if (bus_dmamem_alloc(isp->isp_osinfo.atiodmat, (void **)&base, BUS_DMA_COHERENT, &isp->isp_osinfo.atiomap) != 0) { isp_prt(isp, ISP_LOGERR, "cannot allocate ATIO DMA memory"); bus_dma_tag_destroy(isp->isp_osinfo.atiodmat); goto bad; } isp->isp_atioq = base; im.error = 0; if (bus_dmamap_load(isp->isp_osinfo.atiodmat, isp->isp_osinfo.atiomap, base, len, imc, &im, 0) || im.error) { isp_prt(isp, ISP_LOGERR, "error loading ATIO DMA map %d", im.error); goto bad; } isp_prt(isp, ISP_LOGDEBUG0, "ATIO area @ 0x%jx/0x%jx", (uintmax_t)im.maddr, (uintmax_t)len); isp->isp_atioq_dma = im.maddr; } #endif if (IS_FC(isp)) { if (bus_dma_tag_create(isp->isp_osinfo.dmat, 64, slim, BUS_SPACE_MAXADDR, BUS_SPACE_MAXADDR, NULL, NULL, 2*QENTRY_LEN, 1, 2*QENTRY_LEN, 0, busdma_lock_mutex, &isp->isp_lock, &isp->isp_osinfo.iocbdmat)) { goto bad; } if (bus_dmamem_alloc(isp->isp_osinfo.iocbdmat, (void **)&base, BUS_DMA_COHERENT, &isp->isp_osinfo.iocbmap) != 0) goto bad; isp->isp_iocb = base; im.error = 0; if (bus_dmamap_load(isp->isp_osinfo.iocbdmat, isp->isp_osinfo.iocbmap, base, 2*QENTRY_LEN, imc, &im, 0) || im.error) goto bad; isp->isp_iocb_dma = im.maddr; if (bus_dma_tag_create(isp->isp_osinfo.dmat, 64, slim, BUS_SPACE_MAXADDR, BUS_SPACE_MAXADDR, NULL, NULL, ISP_FC_SCRLEN, 1, ISP_FC_SCRLEN, 0, busdma_lock_mutex, &isp->isp_lock, &isp->isp_osinfo.scdmat)) goto bad; for (cmap = 0; cmap < isp->isp_nchan; cmap++) { struct isp_fc *fc = ISP_FC_PC(isp, cmap); if (bus_dmamem_alloc(isp->isp_osinfo.scdmat, (void **)&base, BUS_DMA_COHERENT, &fc->scmap) != 0) goto bad; FCPARAM(isp, cmap)->isp_scratch = base; im.error = 0; if (bus_dmamap_load(isp->isp_osinfo.scdmat, fc->scmap, base, ISP_FC_SCRLEN, imc, &im, 0) || im.error) { bus_dmamem_free(isp->isp_osinfo.scdmat, base, fc->scmap); FCPARAM(isp, cmap)->isp_scratch = NULL; goto bad; } FCPARAM(isp, cmap)->isp_scdma = im.maddr; if (!IS_2100(isp)) { for (i = 0; i < INITIAL_NEXUS_COUNT; i++) { struct isp_nexus *n = malloc(sizeof (struct isp_nexus), M_DEVBUF, M_NOWAIT | M_ZERO); if (n == NULL) { while (fc->nexus_free_list) { n = fc->nexus_free_list; fc->nexus_free_list = n->next; free(n, M_DEVBUF); } goto bad; } n->next = fc->nexus_free_list; fc->nexus_free_list = n; } } } } if (isp->isp_maxcmds == 0) { ISP_LOCK(isp); return (0); } gotmaxcmds: len = isp->isp_maxcmds * sizeof (struct isp_pcmd); isp->isp_osinfo.pcmd_pool = (struct isp_pcmd *) malloc(len, M_DEVBUF, M_WAITOK | M_ZERO); for (i = 0; i < isp->isp_maxcmds; i++) { struct isp_pcmd *pcmd = &isp->isp_osinfo.pcmd_pool[i]; error = bus_dmamap_create(isp->isp_osinfo.dmat, 0, &pcmd->dmap); if (error) { isp_prt(isp, ISP_LOGERR, "error %d creating per-cmd DMA maps", error); while (--i >= 0) { bus_dmamap_destroy(isp->isp_osinfo.dmat, isp->isp_osinfo.pcmd_pool[i].dmap); } goto bad; } callout_init_mtx(&pcmd->wdog, &isp->isp_lock, 0); if (i == isp->isp_maxcmds-1) pcmd->next = NULL; else pcmd->next = &isp->isp_osinfo.pcmd_pool[i+1]; } isp->isp_osinfo.pcmd_free = &isp->isp_osinfo.pcmd_pool[0]; len = sizeof (isp_hdl_t) * isp->isp_maxcmds; isp->isp_xflist = (isp_hdl_t *) malloc(len, M_DEVBUF, M_WAITOK | M_ZERO); for (len = 0; len < isp->isp_maxcmds - 1; len++) isp->isp_xflist[len].cmd = &isp->isp_xflist[len+1]; isp->isp_xffree = isp->isp_xflist; ISP_LOCK(isp); return (0); bad: isp_pci_mbxdmafree(isp); ISP_LOCK(isp); return (1); } static void isp_pci_mbxdmafree(ispsoftc_t *isp) { int i; if (isp->isp_xflist != NULL) { free(isp->isp_xflist, M_DEVBUF); isp->isp_xflist = NULL; } if (isp->isp_osinfo.pcmd_pool != NULL) { for (i = 0; i < isp->isp_maxcmds; i++) { bus_dmamap_destroy(isp->isp_osinfo.dmat, isp->isp_osinfo.pcmd_pool[i].dmap); } free(isp->isp_osinfo.pcmd_pool, M_DEVBUF); isp->isp_osinfo.pcmd_pool = NULL; } if (IS_FC(isp)) { for (i = 0; i < isp->isp_nchan; i++) { struct isp_fc *fc = ISP_FC_PC(isp, i); if (FCPARAM(isp, i)->isp_scdma != 0) { bus_dmamap_unload(isp->isp_osinfo.scdmat, fc->scmap); FCPARAM(isp, i)->isp_scdma = 0; } if (FCPARAM(isp, i)->isp_scratch != NULL) { bus_dmamem_free(isp->isp_osinfo.scdmat, FCPARAM(isp, i)->isp_scratch, fc->scmap); FCPARAM(isp, i)->isp_scratch = NULL; } while (fc->nexus_free_list) { struct isp_nexus *n = fc->nexus_free_list; fc->nexus_free_list = n->next; free(n, M_DEVBUF); } } if (isp->isp_iocb_dma != 0) { bus_dma_tag_destroy(isp->isp_osinfo.scdmat); bus_dmamap_unload(isp->isp_osinfo.iocbdmat, isp->isp_osinfo.iocbmap); isp->isp_iocb_dma = 0; } if (isp->isp_iocb != NULL) { bus_dmamem_free(isp->isp_osinfo.iocbdmat, isp->isp_iocb, isp->isp_osinfo.iocbmap); bus_dma_tag_destroy(isp->isp_osinfo.iocbdmat); } } #ifdef ISP_TARGET_MODE if (IS_24XX(isp)) { if (isp->isp_atioq_dma != 0) { bus_dmamap_unload(isp->isp_osinfo.atiodmat, isp->isp_osinfo.atiomap); isp->isp_atioq_dma = 0; } if (isp->isp_atioq != NULL) { bus_dmamem_free(isp->isp_osinfo.atiodmat, isp->isp_atioq, isp->isp_osinfo.atiomap); bus_dma_tag_destroy(isp->isp_osinfo.atiodmat); isp->isp_atioq = NULL; } } #endif if (isp->isp_result_dma != 0) { bus_dmamap_unload(isp->isp_osinfo.respdmat, isp->isp_osinfo.respmap); isp->isp_result_dma = 0; } if (isp->isp_result != NULL) { bus_dmamem_free(isp->isp_osinfo.respdmat, isp->isp_result, isp->isp_osinfo.respmap); bus_dma_tag_destroy(isp->isp_osinfo.respdmat); isp->isp_result = NULL; } if (isp->isp_rquest_dma != 0) { bus_dmamap_unload(isp->isp_osinfo.reqdmat, isp->isp_osinfo.reqmap); isp->isp_rquest_dma = 0; } if (isp->isp_rquest != NULL) { bus_dmamem_free(isp->isp_osinfo.reqdmat, isp->isp_rquest, isp->isp_osinfo.reqmap); bus_dma_tag_destroy(isp->isp_osinfo.reqdmat); isp->isp_rquest = NULL; } } typedef struct { ispsoftc_t *isp; void *cmd_token; void *rq; /* original request */ int error; } mush_t; #define MUSHERR_NOQENTRIES -2 static void dma2(void *arg, bus_dma_segment_t *dm_segs, int nseg, int error) { mush_t *mp = (mush_t *) arg; ispsoftc_t *isp= mp->isp; struct ccb_scsiio *csio = mp->cmd_token; isp_ddir_t ddir; int sdir; if (error) { mp->error = error; return; } if (nseg == 0) { ddir = ISP_NOXFR; } else { if ((csio->ccb_h.flags & CAM_DIR_MASK) == CAM_DIR_IN) { ddir = ISP_FROM_DEVICE; } else { ddir = ISP_TO_DEVICE; } if ((csio->ccb_h.func_code == XPT_CONT_TARGET_IO) ^ ((csio->ccb_h.flags & CAM_DIR_MASK) == CAM_DIR_IN)) { sdir = BUS_DMASYNC_PREREAD; } else { sdir = BUS_DMASYNC_PREWRITE; } bus_dmamap_sync(isp->isp_osinfo.dmat, PISP_PCMD(csio)->dmap, sdir); } error = isp_send_cmd(isp, mp->rq, dm_segs, nseg, XS_XFRLEN(csio), ddir, (ispds64_t *)csio->req_map); switch (error) { case CMD_EAGAIN: mp->error = MUSHERR_NOQENTRIES; break; case CMD_QUEUED: break; default: mp->error = EIO; break; } } static int isp_pci_dmasetup(ispsoftc_t *isp, struct ccb_scsiio *csio, void *ff) { mush_t mush, *mp; int error; mp = &mush; mp->isp = isp; mp->cmd_token = csio; mp->rq = ff; mp->error = 0; error = bus_dmamap_load_ccb(isp->isp_osinfo.dmat, PISP_PCMD(csio)->dmap, (union ccb *)csio, dma2, mp, 0); if (error == EINPROGRESS) { bus_dmamap_unload(isp->isp_osinfo.dmat, PISP_PCMD(csio)->dmap); mp->error = EINVAL; isp_prt(isp, ISP_LOGERR, "deferred dma allocation not supported"); } else if (error && mp->error == 0) { #ifdef DIAGNOSTIC isp_prt(isp, ISP_LOGERR, "error %d in dma mapping code", error); #endif mp->error = error; } if (mp->error) { int retval = CMD_COMPLETE; if (mp->error == MUSHERR_NOQENTRIES) { retval = CMD_EAGAIN; } else if (mp->error == EFBIG) { csio->ccb_h.status = CAM_REQ_TOO_BIG; } else if (mp->error == EINVAL) { csio->ccb_h.status = CAM_REQ_INVALID; } else { csio->ccb_h.status = CAM_UNREC_HBA_ERROR; } return (retval); } return (CMD_QUEUED); } static int isp_pci_irqsetup(ispsoftc_t *isp) { device_t dev = isp->isp_osinfo.dev; struct isp_pcisoftc *pcs = device_get_softc(dev); driver_intr_t *f; int i, max_irq; /* Allocate IRQs only once. */ if (isp->isp_nirq > 0) return (0); ISP_UNLOCK(isp); if (ISP_CAP_MSIX(isp)) { max_irq = IS_26XX(isp) ? 3 : (IS_25XX(isp) ? 2 : 0); resource_int_value(device_get_name(dev), device_get_unit(dev), "msix", &max_irq); max_irq = imin(ISP_MAX_IRQS, max_irq); pcs->msicount = imin(pci_msix_count(dev), max_irq); if (pcs->msicount > 0 && pci_alloc_msix(dev, &pcs->msicount) != 0) pcs->msicount = 0; } if (pcs->msicount == 0) { max_irq = 1; resource_int_value(device_get_name(dev), device_get_unit(dev), "msi", &max_irq); max_irq = imin(1, max_irq); pcs->msicount = imin(pci_msi_count(dev), max_irq); if (pcs->msicount > 0 && pci_alloc_msi(dev, &pcs->msicount) != 0) pcs->msicount = 0; } for (i = 0; i < MAX(1, pcs->msicount); i++) { pcs->irq[i].iqd = i + (pcs->msicount > 0); pcs->irq[i].irq = bus_alloc_resource_any(dev, SYS_RES_IRQ, &pcs->irq[i].iqd, RF_ACTIVE | RF_SHAREABLE); if (pcs->irq[i].irq == NULL) { device_printf(dev, "could not allocate interrupt\n"); break; } if (i == 0) f = isp_platform_intr; else if (i == 1) f = isp_platform_intr_resp; else f = isp_platform_intr_atio; if (bus_setup_intr(dev, pcs->irq[i].irq, ISP_IFLAGS, NULL, f, isp, &pcs->irq[i].ih)) { device_printf(dev, "could not setup interrupt\n"); (void) bus_release_resource(dev, SYS_RES_IRQ, pcs->irq[i].iqd, pcs->irq[i].irq); break; } if (pcs->msicount > 1) { bus_describe_intr(dev, pcs->irq[i].irq, pcs->irq[i].ih, "%d", i); } isp->isp_nirq = i + 1; } ISP_LOCK(isp); return (isp->isp_nirq == 0); } static void isp_pci_dumpregs(ispsoftc_t *isp, const char *msg) { struct isp_pcisoftc *pcs = (struct isp_pcisoftc *)isp; if (msg) printf("%s: %s\n", device_get_nameunit(isp->isp_dev), msg); else printf("%s:\n", device_get_nameunit(isp->isp_dev)); if (IS_SCSI(isp)) printf(" biu_conf1=%x", ISP_READ(isp, BIU_CONF1)); else printf(" biu_csr=%x", ISP_READ(isp, BIU2100_CSR)); printf(" biu_icr=%x biu_isr=%x biu_sema=%x ", ISP_READ(isp, BIU_ICR), ISP_READ(isp, BIU_ISR), ISP_READ(isp, BIU_SEMA)); printf("risc_hccr=%x\n", ISP_READ(isp, HCCR)); if (IS_SCSI(isp)) { ISP_WRITE(isp, HCCR, HCCR_CMD_PAUSE); printf(" cdma_conf=%x cdma_sts=%x cdma_fifostat=%x\n", ISP_READ(isp, CDMA_CONF), ISP_READ(isp, CDMA_STATUS), ISP_READ(isp, CDMA_FIFO_STS)); printf(" ddma_conf=%x ddma_sts=%x ddma_fifostat=%x\n", ISP_READ(isp, DDMA_CONF), ISP_READ(isp, DDMA_STATUS), ISP_READ(isp, DDMA_FIFO_STS)); printf(" sxp_int=%x sxp_gross=%x sxp(scsi_ctrl)=%x\n", ISP_READ(isp, SXP_INTERRUPT), ISP_READ(isp, SXP_GROSS_ERR), ISP_READ(isp, SXP_PINS_CTRL)); ISP_WRITE(isp, HCCR, HCCR_CMD_RELEASE); } printf(" mbox regs: %x %x %x %x %x\n", ISP_READ(isp, OUTMAILBOX0), ISP_READ(isp, OUTMAILBOX1), ISP_READ(isp, OUTMAILBOX2), ISP_READ(isp, OUTMAILBOX3), ISP_READ(isp, OUTMAILBOX4)); printf(" PCI Status Command/Status=%x\n", pci_read_config(pcs->pci_dev, PCIR_COMMAND, 1)); } Index: user/ngie/bug-237403/sys/dev/mlx5/mlx5_en/mlx5_en_rx.c =================================================================== --- user/ngie/bug-237403/sys/dev/mlx5/mlx5_en/mlx5_en_rx.c (revision 346925) +++ user/ngie/bug-237403/sys/dev/mlx5/mlx5_en/mlx5_en_rx.c (revision 346926) @@ -1,588 +1,591 @@ /*- * Copyright (c) 2015 Mellanox Technologies. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY AUTHOR AND CONTRIBUTORS `AS IS' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * $FreeBSD$ */ #include "en.h" #include static inline int mlx5e_alloc_rx_wqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe, u16 ix) { bus_dma_segment_t segs[rq->nsegs]; struct mbuf *mb; int nsegs; int err; #if (MLX5E_MAX_RX_SEGS != 1) struct mbuf *mb_head; int i; #endif if (rq->mbuf[ix].mbuf != NULL) return (0); #if (MLX5E_MAX_RX_SEGS == 1) mb = m_getjcl(M_NOWAIT, MT_DATA, M_PKTHDR, rq->wqe_sz); if (unlikely(!mb)) return (-ENOMEM); mb->m_pkthdr.len = mb->m_len = rq->wqe_sz; #else mb_head = mb = m_getjcl(M_NOWAIT, MT_DATA, M_PKTHDR, MLX5E_MAX_RX_BYTES); if (unlikely(mb == NULL)) return (-ENOMEM); mb->m_len = MLX5E_MAX_RX_BYTES; mb->m_pkthdr.len = MLX5E_MAX_RX_BYTES; for (i = 1; i < rq->nsegs; i++) { if (mb_head->m_pkthdr.len >= rq->wqe_sz) break; mb = mb->m_next = m_getjcl(M_NOWAIT, MT_DATA, 0, MLX5E_MAX_RX_BYTES); if (unlikely(mb == NULL)) { m_freem(mb_head); return (-ENOMEM); } mb->m_len = MLX5E_MAX_RX_BYTES; mb_head->m_pkthdr.len += MLX5E_MAX_RX_BYTES; } /* rewind to first mbuf in chain */ mb = mb_head; #endif /* get IP header aligned */ m_adj(mb, MLX5E_NET_IP_ALIGN); err = -bus_dmamap_load_mbuf_sg(rq->dma_tag, rq->mbuf[ix].dma_map, mb, segs, &nsegs, BUS_DMA_NOWAIT); if (err != 0) goto err_free_mbuf; if (unlikely(nsegs == 0)) { bus_dmamap_unload(rq->dma_tag, rq->mbuf[ix].dma_map); err = -ENOMEM; goto err_free_mbuf; } #if (MLX5E_MAX_RX_SEGS == 1) wqe->data[0].addr = cpu_to_be64(segs[0].ds_addr); #else wqe->data[0].addr = cpu_to_be64(segs[0].ds_addr); wqe->data[0].byte_count = cpu_to_be32(segs[0].ds_len | MLX5_HW_START_PADDING); for (i = 1; i != nsegs; i++) { wqe->data[i].addr = cpu_to_be64(segs[i].ds_addr); wqe->data[i].byte_count = cpu_to_be32(segs[i].ds_len); } for (; i < rq->nsegs; i++) { wqe->data[i].addr = 0; wqe->data[i].byte_count = 0; } #endif rq->mbuf[ix].mbuf = mb; rq->mbuf[ix].data = mb->m_data; bus_dmamap_sync(rq->dma_tag, rq->mbuf[ix].dma_map, BUS_DMASYNC_PREREAD); return (0); err_free_mbuf: m_freem(mb); return (err); } static void mlx5e_post_rx_wqes(struct mlx5e_rq *rq) { if (unlikely(rq->enabled == 0)) return; while (!mlx5_wq_ll_is_full(&rq->wq)) { struct mlx5e_rx_wqe *wqe = mlx5_wq_ll_get_wqe(&rq->wq, rq->wq.head); if (unlikely(mlx5e_alloc_rx_wqe(rq, wqe, rq->wq.head))) { callout_reset_curcpu(&rq->watchdog, 1, (void *)&mlx5e_post_rx_wqes, rq); break; } mlx5_wq_ll_push(&rq->wq, be16_to_cpu(wqe->next.next_wqe_index)); } /* ensure wqes are visible to device before updating doorbell record */ atomic_thread_fence_rel(); mlx5_wq_ll_update_db_record(&rq->wq); } static void mlx5e_lro_update_hdr(struct mbuf *mb, struct mlx5_cqe64 *cqe) { /* TODO: consider vlans, ip options, ... */ struct ether_header *eh; uint16_t eh_type; uint16_t tot_len; struct ip6_hdr *ip6 = NULL; struct ip *ip4 = NULL; struct tcphdr *th; uint32_t *ts_ptr; uint8_t l4_hdr_type; int tcp_ack; eh = mtod(mb, struct ether_header *); eh_type = ntohs(eh->ether_type); l4_hdr_type = get_cqe_l4_hdr_type(cqe); tcp_ack = ((CQE_L4_HDR_TYPE_TCP_ACK_NO_DATA == l4_hdr_type) || (CQE_L4_HDR_TYPE_TCP_ACK_AND_DATA == l4_hdr_type)); /* TODO: consider vlan */ tot_len = be32_to_cpu(cqe->byte_cnt) - ETHER_HDR_LEN; switch (eh_type) { case ETHERTYPE_IP: ip4 = (struct ip *)(eh + 1); th = (struct tcphdr *)(ip4 + 1); break; case ETHERTYPE_IPV6: ip6 = (struct ip6_hdr *)(eh + 1); th = (struct tcphdr *)(ip6 + 1); break; default: return; } ts_ptr = (uint32_t *)(th + 1); if (get_cqe_lro_tcppsh(cqe)) th->th_flags |= TH_PUSH; if (tcp_ack) { th->th_flags |= TH_ACK; th->th_ack = cqe->lro_ack_seq_num; th->th_win = cqe->lro_tcp_win; /* * FreeBSD handles only 32bit aligned timestamp right after * the TCP hdr * +--------+--------+--------+--------+ * | NOP | NOP | TSopt | 10 | * +--------+--------+--------+--------+ * | TSval timestamp | * +--------+--------+--------+--------+ * | TSecr timestamp | * +--------+--------+--------+--------+ */ if (get_cqe_lro_timestamp_valid(cqe) && (__predict_true(*ts_ptr) == ntohl(TCPOPT_NOP << 24 | TCPOPT_NOP << 16 | TCPOPT_TIMESTAMP << 8 | TCPOLEN_TIMESTAMP))) { /* * cqe->timestamp is 64bit long. * [0-31] - timestamp. * [32-64] - timestamp echo replay. */ ts_ptr[1] = *(uint32_t *)&cqe->timestamp; ts_ptr[2] = *((uint32_t *)&cqe->timestamp + 1); } } if (ip4) { ip4->ip_ttl = cqe->lro_min_ttl; ip4->ip_len = cpu_to_be16(tot_len); ip4->ip_sum = 0; ip4->ip_sum = in_cksum(mb, ip4->ip_hl << 2); } else { ip6->ip6_hlim = cqe->lro_min_ttl; ip6->ip6_plen = cpu_to_be16(tot_len - sizeof(struct ip6_hdr)); } /* TODO: handle tcp checksum */ } static uint64_t mlx5e_mbuf_tstmp(struct mlx5e_priv *priv, uint64_t hw_tstmp) { struct mlx5e_clbr_point *cp; uint64_t a1, a2, res; u_int gen; do { cp = &priv->clbr_points[priv->clbr_curr]; gen = atomic_load_acq_int(&cp->clbr_gen); a1 = (hw_tstmp - cp->clbr_hw_prev) >> MLX5E_TSTMP_PREC; a2 = (cp->base_curr - cp->base_prev) >> MLX5E_TSTMP_PREC; res = (a1 * a2) << MLX5E_TSTMP_PREC; /* * Divisor cannot be zero because calibration callback * checks for the condition and disables timestamping * if clock halted. */ res /= (cp->clbr_hw_curr - cp->clbr_hw_prev) >> MLX5E_TSTMP_PREC; res += cp->base_prev; atomic_thread_fence_acq(); } while (gen == 0 || gen != cp->clbr_gen); return (res); } static inline void mlx5e_build_rx_mbuf(struct mlx5_cqe64 *cqe, struct mlx5e_rq *rq, struct mbuf *mb, u32 cqe_bcnt) { struct ifnet *ifp = rq->ifp; struct mlx5e_channel *c; #if (MLX5E_MAX_RX_SEGS != 1) struct mbuf *mb_head; #endif int lro_num_seg; /* HW LRO session aggregated packets counter */ uint64_t tstmp; lro_num_seg = be32_to_cpu(cqe->srqn) >> 24; if (lro_num_seg > 1) { mlx5e_lro_update_hdr(mb, cqe); rq->stats.lro_packets++; rq->stats.lro_bytes += cqe_bcnt; } #if (MLX5E_MAX_RX_SEGS == 1) mb->m_pkthdr.len = mb->m_len = cqe_bcnt; #else mb->m_pkthdr.len = cqe_bcnt; for (mb_head = mb; mb != NULL; mb = mb->m_next) { if (mb->m_len > cqe_bcnt) mb->m_len = cqe_bcnt; cqe_bcnt -= mb->m_len; if (likely(cqe_bcnt == 0)) { if (likely(mb->m_next != NULL)) { /* trim off empty mbufs */ m_freem(mb->m_next); mb->m_next = NULL; } break; } } /* rewind to first mbuf in chain */ mb = mb_head; #endif /* check if a Toeplitz hash was computed */ if (cqe->rss_hash_type != 0) { mb->m_pkthdr.flowid = be32_to_cpu(cqe->rss_hash_result); #ifdef RSS /* decode the RSS hash type */ switch (cqe->rss_hash_type & (CQE_RSS_DST_HTYPE_L4 | CQE_RSS_DST_HTYPE_IP)) { /* IPv4 */ case (CQE_RSS_DST_HTYPE_TCP | CQE_RSS_DST_HTYPE_IPV4): M_HASHTYPE_SET(mb, M_HASHTYPE_RSS_TCP_IPV4); break; case (CQE_RSS_DST_HTYPE_UDP | CQE_RSS_DST_HTYPE_IPV4): M_HASHTYPE_SET(mb, M_HASHTYPE_RSS_UDP_IPV4); break; case CQE_RSS_DST_HTYPE_IPV4: M_HASHTYPE_SET(mb, M_HASHTYPE_RSS_IPV4); break; /* IPv6 */ case (CQE_RSS_DST_HTYPE_TCP | CQE_RSS_DST_HTYPE_IPV6): M_HASHTYPE_SET(mb, M_HASHTYPE_RSS_TCP_IPV6); break; case (CQE_RSS_DST_HTYPE_UDP | CQE_RSS_DST_HTYPE_IPV6): M_HASHTYPE_SET(mb, M_HASHTYPE_RSS_UDP_IPV6); break; case CQE_RSS_DST_HTYPE_IPV6: M_HASHTYPE_SET(mb, M_HASHTYPE_RSS_IPV6); break; default: /* Other */ M_HASHTYPE_SET(mb, M_HASHTYPE_OPAQUE_HASH); break; } #else M_HASHTYPE_SET(mb, M_HASHTYPE_OPAQUE_HASH); #endif } else { mb->m_pkthdr.flowid = rq->ix; M_HASHTYPE_SET(mb, M_HASHTYPE_OPAQUE); } mb->m_pkthdr.rcvif = ifp; if (likely(ifp->if_capenable & (IFCAP_RXCSUM | IFCAP_RXCSUM_IPV6)) && ((cqe->hds_ip_ext & (CQE_L2_OK | CQE_L3_OK | CQE_L4_OK)) == (CQE_L2_OK | CQE_L3_OK | CQE_L4_OK))) { mb->m_pkthdr.csum_flags = CSUM_IP_CHECKED | CSUM_IP_VALID | CSUM_DATA_VALID | CSUM_PSEUDO_HDR; mb->m_pkthdr.csum_data = htons(0xffff); } else { rq->stats.csum_none++; } if (cqe_has_vlan(cqe)) { mb->m_pkthdr.ether_vtag = be16_to_cpu(cqe->vlan_info); mb->m_flags |= M_VLANTAG; } c = container_of(rq, struct mlx5e_channel, rq); if (c->priv->clbr_done >= 2) { tstmp = mlx5e_mbuf_tstmp(c->priv, be64_to_cpu(cqe->timestamp)); if ((tstmp & MLX5_CQE_TSTMP_PTP) != 0) { /* * Timestamp was taken on the packet entrance, * instead of the cqe generation. */ tstmp &= ~MLX5_CQE_TSTMP_PTP; mb->m_flags |= M_TSTMP_HPREC; } mb->m_pkthdr.rcv_tstmp = tstmp; mb->m_flags |= M_TSTMP; } } static inline void mlx5e_read_cqe_slot(struct mlx5e_cq *cq, u32 cc, void *data) { memcpy(data, mlx5_cqwq_get_wqe(&cq->wq, (cc & cq->wq.sz_m1)), sizeof(struct mlx5_cqe64)); } static inline void mlx5e_write_cqe_slot(struct mlx5e_cq *cq, u32 cc, void *data) { memcpy(mlx5_cqwq_get_wqe(&cq->wq, cc & cq->wq.sz_m1), data, sizeof(struct mlx5_cqe64)); } static inline void mlx5e_decompress_cqe(struct mlx5e_cq *cq, struct mlx5_cqe64 *title, struct mlx5_mini_cqe8 *mini, u16 wqe_counter, int i) { /* * NOTE: The fields which are not set here are copied from the * initial and common title. See memcpy() in * mlx5e_write_cqe_slot(). */ title->byte_cnt = mini->byte_cnt; title->wqe_counter = cpu_to_be16((wqe_counter + i) & cq->wq.sz_m1); title->check_sum = mini->checksum; title->op_own = (title->op_own & 0xf0) | (((cq->wq.cc + i) >> cq->wq.log_sz) & 1); } #define MLX5E_MINI_ARRAY_SZ 8 /* Make sure structs are not packet differently */ CTASSERT(sizeof(struct mlx5_cqe64) == sizeof(struct mlx5_mini_cqe8) * MLX5E_MINI_ARRAY_SZ); static void mlx5e_decompress_cqes(struct mlx5e_cq *cq) { struct mlx5_mini_cqe8 mini_array[MLX5E_MINI_ARRAY_SZ]; struct mlx5_cqe64 title; u32 cqe_count; u32 i = 0; u16 title_wqe_counter; mlx5e_read_cqe_slot(cq, cq->wq.cc, &title); title_wqe_counter = be16_to_cpu(title.wqe_counter); cqe_count = be32_to_cpu(title.byte_cnt); /* Make sure we won't overflow */ KASSERT(cqe_count <= cq->wq.sz_m1, ("%s: cqe_count %u > cq->wq.sz_m1 %u", __func__, cqe_count, cq->wq.sz_m1)); mlx5e_read_cqe_slot(cq, cq->wq.cc + 1, mini_array); while (true) { mlx5e_decompress_cqe(cq, &title, &mini_array[i % MLX5E_MINI_ARRAY_SZ], title_wqe_counter, i); mlx5e_write_cqe_slot(cq, cq->wq.cc + i, &title); i++; if (i == cqe_count) break; if (i % MLX5E_MINI_ARRAY_SZ == 0) mlx5e_read_cqe_slot(cq, cq->wq.cc + i, mini_array); } } static int mlx5e_poll_rx_cq(struct mlx5e_rq *rq, int budget) { struct pfil_head *pfil; int i, rv; CURVNET_SET_QUIET(rq->ifp->if_vnet); pfil = rq->channel->priv->pfil; for (i = 0; i < budget; i++) { struct mlx5e_rx_wqe *wqe; struct mlx5_cqe64 *cqe; struct mbuf *mb; __be16 wqe_counter_be; u16 wqe_counter; u32 byte_cnt, seglen; cqe = mlx5e_get_cqe(&rq->cq); if (!cqe) break; if (mlx5_get_cqe_format(cqe) == MLX5_COMPRESSED) mlx5e_decompress_cqes(&rq->cq); mlx5_cqwq_pop(&rq->cq.wq); wqe_counter_be = cqe->wqe_counter; wqe_counter = be16_to_cpu(wqe_counter_be); wqe = mlx5_wq_ll_get_wqe(&rq->wq, wqe_counter); byte_cnt = be32_to_cpu(cqe->byte_cnt); bus_dmamap_sync(rq->dma_tag, rq->mbuf[wqe_counter].dma_map, BUS_DMASYNC_POSTREAD); if (unlikely((cqe->op_own >> 4) != MLX5_CQE_RESP_SEND)) { rq->stats.wqe_err++; goto wq_ll_pop; } if (pfil != NULL && PFIL_HOOKED_IN(pfil)) { seglen = MIN(byte_cnt, MLX5E_MAX_RX_BYTES); rv = pfil_run_hooks(rq->channel->priv->pfil, rq->mbuf[wqe_counter].data, rq->ifp, seglen | PFIL_MEMPTR | PFIL_IN, NULL); switch (rv) { case PFIL_DROPPED: case PFIL_CONSUMED: /* * Filter dropped or consumed it. In * either case, we can just recycle * buffer; there is no more work to do. */ rq->stats.packets++; goto wq_ll_pop; case PFIL_REALLOCED: /* * Filter copied it; recycle buffer * and receive the new mbuf allocated * by the Filter */ mb = pfil_mem2mbuf(rq->mbuf[wqe_counter].data); goto rx_common; default: /* * The Filter said it was OK, so * receive like normal. */ KASSERT(rv == PFIL_PASS, ("Filter returned %d!\n", rv)); } } if ((MHLEN - MLX5E_NET_IP_ALIGN) >= byte_cnt && (mb = m_gethdr(M_NOWAIT, MT_DATA)) != NULL) { #if (MLX5E_MAX_RX_SEGS != 1) /* set maximum mbuf length */ mb->m_len = MHLEN - MLX5E_NET_IP_ALIGN; #endif /* get IP header aligned */ mb->m_data += MLX5E_NET_IP_ALIGN; bcopy(rq->mbuf[wqe_counter].data, mtod(mb, caddr_t), byte_cnt); } else { mb = rq->mbuf[wqe_counter].mbuf; rq->mbuf[wqe_counter].mbuf = NULL; /* safety clear */ bus_dmamap_unload(rq->dma_tag, rq->mbuf[wqe_counter].dma_map); } rx_common: mlx5e_build_rx_mbuf(cqe, rq, mb, byte_cnt); rq->stats.bytes += byte_cnt; rq->stats.packets++; +#ifdef NUMA + mb->m_pkthdr.numa_domain = rq->ifp->if_numa_domain; +#endif #if !defined(HAVE_TCP_LRO_RX) tcp_lro_queue_mbuf(&rq->lro, mb); #else if (mb->m_pkthdr.csum_flags == 0 || (rq->ifp->if_capenable & IFCAP_LRO) == 0 || rq->lro.lro_cnt == 0 || tcp_lro_rx(&rq->lro, mb, 0) != 0) { rq->ifp->if_input(rq->ifp, mb); } #endif wq_ll_pop: mlx5_wq_ll_pop(&rq->wq, wqe_counter_be, &wqe->next.next_wqe_index); } CURVNET_RESTORE(); mlx5_cqwq_update_db_record(&rq->cq.wq); /* ensure cq space is freed before enabling more cqes */ atomic_thread_fence_rel(); return (i); } void mlx5e_rx_cq_comp(struct mlx5_core_cq *mcq) { struct mlx5e_rq *rq = container_of(mcq, struct mlx5e_rq, cq.mcq); int i = 0; #ifdef HAVE_PER_CQ_EVENT_PACKET #if (MHLEN < 15) #error "MHLEN is too small" #endif struct mbuf *mb = m_gethdr(M_NOWAIT, MT_DATA); if (mb != NULL) { /* this code is used for debugging purpose only */ mb->m_pkthdr.len = mb->m_len = 15; memset(mb->m_data, 255, 14); mb->m_data[14] = rq->ix; mb->m_pkthdr.rcvif = rq->ifp; rq->ifp->if_input(rq->ifp, mb); } #endif mtx_lock(&rq->mtx); /* * Polling the entire CQ without posting new WQEs results in * lack of receive WQEs during heavy traffic scenarios. */ while (1) { if (mlx5e_poll_rx_cq(rq, MLX5E_RX_BUDGET_MAX) != MLX5E_RX_BUDGET_MAX) break; i += MLX5E_RX_BUDGET_MAX; if (i >= MLX5E_BUDGET_MAX) break; mlx5e_post_rx_wqes(rq); } mlx5e_post_rx_wqes(rq); mlx5e_cq_arm(&rq->cq, MLX5_GET_DOORBELL_LOCK(&rq->channel->priv->doorbell_lock)); tcp_lro_flush_all(&rq->lro); mtx_unlock(&rq->mtx); } Index: user/ngie/bug-237403/sys/dev/uart/uart_cpu_arm64.c =================================================================== --- user/ngie/bug-237403/sys/dev/uart/uart_cpu_arm64.c (revision 346925) +++ user/ngie/bug-237403/sys/dev/uart/uart_cpu_arm64.c (revision 346926) @@ -1,217 +1,224 @@ /*- * Copyright (c) 2016 The FreeBSD Foundation * All rights reserved. * * This software was developed by Andrew Turner under sponsorship from * the FreeBSD Foundation. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. */ #include "opt_acpi.h" #include "opt_platform.h" #include __FBSDID("$FreeBSD$"); #include #include #include #include #include #include #include #include #include #include #include #ifdef DEV_ACPI #include #include #include #include #endif #ifdef FDT #include #include #include #include #endif /* * UART console routines. */ extern struct bus_space memmap_bus; bus_space_tag_t uart_bus_space_io; bus_space_tag_t uart_bus_space_mem = &memmap_bus; int uart_cpu_eqres(struct uart_bas *b1, struct uart_bas *b2) { if (pmap_kextract(b1->bsh) == 0) return (0); if (pmap_kextract(b2->bsh) == 0) return (0); return ((pmap_kextract(b1->bsh) == pmap_kextract(b2->bsh)) ? 1 : 0); } #ifdef DEV_ACPI static struct acpi_uart_compat_data * uart_cpu_acpi_scan(uint8_t interface_type) { struct acpi_uart_compat_data **cd, *curcd; int i; SET_FOREACH(cd, uart_acpi_class_and_device_set) { curcd = *cd; for (i = 0; curcd[i].cd_hid != NULL; i++) { if (curcd[i].cd_port_subtype == interface_type) return (&curcd[i]); } } SET_FOREACH(cd, uart_acpi_class_set) { curcd = *cd; for (i = 0; curcd[i].cd_hid != NULL; i++) { if (curcd[i].cd_port_subtype == interface_type) return (&curcd[i]); } } return (NULL); } static int uart_cpu_acpi_probe(struct uart_class **classp, bus_space_tag_t *bst, bus_space_handle_t *bsh, int *baud, u_int *rclk, u_int *shiftp, u_int *iowidthp) { struct acpi_uart_compat_data *cd; ACPI_TABLE_SPCR *spcr; vm_paddr_t spcr_physaddr; int err; err = ENXIO; spcr_physaddr = acpi_find_table(ACPI_SIG_SPCR); if (spcr_physaddr == 0) return (ENXIO); spcr = acpi_map_table(spcr_physaddr, ACPI_SIG_SPCR); cd = uart_cpu_acpi_scan(spcr->InterfaceType); if (cd == NULL) goto out; switch(spcr->BaudRate) { + case 0: + /* + * A BaudRate of 0 is a special value which means not to + * change the rate that's already programmed. + */ + *baud = 0; + break; case 3: *baud = 9600; break; case 4: *baud = 19200; break; case 6: *baud = 57600; break; case 7: *baud = 115200; break; default: goto out; } err = acpi_map_addr(&spcr->SerialPort, bst, bsh, PAGE_SIZE); if (err != 0) goto out; *classp = cd->cd_class; *rclk = 0; *shiftp = spcr->SerialPort.AccessWidth - 1; *iowidthp = spcr->SerialPort.BitWidth / 8; if ((cd->cd_quirks & UART_F_IGNORE_SPCR_REGSHFT) == UART_F_IGNORE_SPCR_REGSHFT) { *shiftp = cd->cd_regshft; } out: acpi_unmap_table(spcr); return (err); } #endif int uart_cpu_getdev(int devtype, struct uart_devinfo *di) { struct uart_class *class; bus_space_handle_t bsh; bus_space_tag_t bst; u_int rclk, shift, iowidth; int br, err; /* Allow overriding the FDT using the environment. */ class = &uart_ns8250_class; err = uart_getenv(devtype, di, class); if (err == 0) return (0); if (devtype != UART_DEV_CONSOLE) return (ENXIO); err = ENXIO; #ifdef DEV_ACPI err = uart_cpu_acpi_probe(&class, &bst, &bsh, &br, &rclk, &shift, &iowidth); #endif #ifdef FDT if (err != 0) { err = uart_cpu_fdt_probe(&class, &bst, &bsh, &br, &rclk, &shift, &iowidth); } #endif if (err != 0) return (err); /* * Finalize configuration. */ di->bas.chan = 0; di->bas.regshft = shift; di->bas.regiowidth = iowidth; di->baudrate = br; di->bas.rclk = rclk; di->ops = uart_getops(class); di->databits = 8; di->stopbits = 1; di->parity = UART_PARITY_NONE; di->bas.bst = bst; di->bas.bsh = bsh; uart_bus_space_mem = di->bas.bst; uart_bus_space_io = NULL; return (0); } Index: user/ngie/bug-237403/sys/dev/xdma/xdma.h =================================================================== --- user/ngie/bug-237403/sys/dev/xdma/xdma.h (revision 346925) +++ user/ngie/bug-237403/sys/dev/xdma/xdma.h (revision 346926) @@ -1,264 +1,264 @@ /*- * Copyright (c) 2016-2018 Ruslan Bukin * All rights reserved. * * This software was developed by SRI International and the University of * Cambridge Computer Laboratory under DARPA/AFRL contract FA8750-10-C-0237 * ("CTSRD"), as part of the DARPA CRASH research programme. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * $FreeBSD$ */ #ifndef _DEV_XDMA_XDMA_H_ #define _DEV_XDMA_XDMA_H_ #include enum xdma_direction { XDMA_MEM_TO_MEM, XDMA_MEM_TO_DEV, XDMA_DEV_TO_MEM, XDMA_DEV_TO_DEV, }; enum xdma_operation_type { XDMA_MEMCPY, XDMA_CYCLIC, XDMA_FIFO, XDMA_SG, }; enum xdma_request_type { XR_TYPE_PHYS, XR_TYPE_VIRT, XR_TYPE_MBUF, XR_TYPE_BIO, }; enum xdma_command { XDMA_CMD_BEGIN, XDMA_CMD_PAUSE, XDMA_CMD_TERMINATE, }; struct xdma_transfer_status { uint32_t transferred; int error; }; typedef struct xdma_transfer_status xdma_transfer_status_t; struct xdma_controller { device_t dev; /* DMA consumer device_t. */ device_t dma_dev; /* A real DMA device_t. */ void *data; /* OFW MD part. */ /* List of virtual channels allocated. */ TAILQ_HEAD(xdma_channel_list, xdma_channel) channels; }; typedef struct xdma_controller xdma_controller_t; struct xchan_buf { bus_dmamap_t map; uint32_t nsegs; uint32_t nsegs_left; - void *cbuf; }; struct xdma_request { struct mbuf *m; struct bio *bp; enum xdma_operation_type operation; enum xdma_request_type req_type; enum xdma_direction direction; bus_addr_t src_addr; bus_addr_t dst_addr; uint8_t src_width; uint8_t dst_width; bus_size_t block_num; bus_size_t block_len; xdma_transfer_status_t status; void *user; TAILQ_ENTRY(xdma_request) xr_next; struct xchan_buf buf; }; struct xdma_sglist { bus_addr_t src_addr; bus_addr_t dst_addr; size_t len; uint8_t src_width; uint8_t dst_width; enum xdma_direction direction; bool first; bool last; }; struct xdma_channel { xdma_controller_t *xdma; uint32_t flags; #define XCHAN_BUFS_ALLOCATED (1 << 0) #define XCHAN_SGLIST_ALLOCATED (1 << 1) #define XCHAN_CONFIGURED (1 << 2) #define XCHAN_TYPE_CYCLIC (1 << 3) #define XCHAN_TYPE_MEMCPY (1 << 4) #define XCHAN_TYPE_FIFO (1 << 5) #define XCHAN_TYPE_SG (1 << 6) uint32_t caps; #define XCHAN_CAP_BUSDMA (1 << 0) -#define XCHAN_CAP_BUSDMA_NOSEG (1 << 1) +#define XCHAN_CAP_NOSEG (1 << 1) +#define XCHAN_CAP_NOBUFS (1 << 2) /* A real hardware driver channel. */ void *chan; /* Interrupt handlers. */ TAILQ_HEAD(, xdma_intr_handler) ie_handlers; TAILQ_ENTRY(xdma_channel) xchan_next; struct sx sx_lock; struct sx sx_qin_lock; struct sx sx_qout_lock; struct sx sx_bank_lock; struct sx sx_proc_lock; /* Request queue. */ bus_dma_tag_t dma_tag_bufs; struct xdma_request *xr_mem; uint32_t xr_num; /* Bus dma tag options. */ bus_size_t maxsegsize; bus_size_t maxnsegs; bus_size_t alignment; bus_addr_t boundary; bus_addr_t lowaddr; bus_addr_t highaddr; struct xdma_sglist *sg; TAILQ_HEAD(, xdma_request) bank; TAILQ_HEAD(, xdma_request) queue_in; TAILQ_HEAD(, xdma_request) queue_out; TAILQ_HEAD(, xdma_request) processing; }; typedef struct xdma_channel xdma_channel_t; struct xdma_intr_handler { int (*cb)(void *cb_user, xdma_transfer_status_t *status); void *cb_user; TAILQ_ENTRY(xdma_intr_handler) ih_next; }; static MALLOC_DEFINE(M_XDMA, "xdma", "xDMA framework"); #define XCHAN_LOCK(xchan) sx_xlock(&(xchan)->sx_lock) #define XCHAN_UNLOCK(xchan) sx_xunlock(&(xchan)->sx_lock) #define XCHAN_ASSERT_LOCKED(xchan) \ sx_assert(&(xchan)->sx_lock, SX_XLOCKED) #define QUEUE_IN_LOCK(xchan) sx_xlock(&(xchan)->sx_qin_lock) #define QUEUE_IN_UNLOCK(xchan) sx_xunlock(&(xchan)->sx_qin_lock) #define QUEUE_IN_ASSERT_LOCKED(xchan) \ sx_assert(&(xchan)->sx_qin_lock, SX_XLOCKED) #define QUEUE_OUT_LOCK(xchan) sx_xlock(&(xchan)->sx_qout_lock) #define QUEUE_OUT_UNLOCK(xchan) sx_xunlock(&(xchan)->sx_qout_lock) #define QUEUE_OUT_ASSERT_LOCKED(xchan) \ sx_assert(&(xchan)->sx_qout_lock, SX_XLOCKED) #define QUEUE_BANK_LOCK(xchan) sx_xlock(&(xchan)->sx_bank_lock) #define QUEUE_BANK_UNLOCK(xchan) sx_xunlock(&(xchan)->sx_bank_lock) #define QUEUE_BANK_ASSERT_LOCKED(xchan) \ sx_assert(&(xchan)->sx_bank_lock, SX_XLOCKED) #define QUEUE_PROC_LOCK(xchan) sx_xlock(&(xchan)->sx_proc_lock) #define QUEUE_PROC_UNLOCK(xchan) sx_xunlock(&(xchan)->sx_proc_lock) #define QUEUE_PROC_ASSERT_LOCKED(xchan) \ sx_assert(&(xchan)->sx_proc_lock, SX_XLOCKED) #define XDMA_SGLIST_MAXLEN 2048 #define XDMA_MAX_SEG 128 /* xDMA controller ops */ xdma_controller_t *xdma_ofw_get(device_t dev, const char *prop); int xdma_put(xdma_controller_t *xdma); /* xDMA channel ops */ xdma_channel_t * xdma_channel_alloc(xdma_controller_t *, uint32_t caps); int xdma_channel_free(xdma_channel_t *); int xdma_request(xdma_channel_t *xchan, struct xdma_request *r); /* SG interface */ int xdma_prep_sg(xdma_channel_t *, uint32_t, bus_size_t, bus_size_t, bus_size_t, bus_addr_t, bus_addr_t, bus_addr_t); void xdma_channel_free_sg(xdma_channel_t *xchan); int xdma_queue_submit_sg(xdma_channel_t *xchan); void xchan_seg_done(xdma_channel_t *xchan, xdma_transfer_status_t *); /* Queue operations */ int xdma_dequeue_mbuf(xdma_channel_t *xchan, struct mbuf **m, xdma_transfer_status_t *); int xdma_enqueue_mbuf(xdma_channel_t *xchan, struct mbuf **m, uintptr_t addr, uint8_t, uint8_t, enum xdma_direction dir); int xdma_dequeue_bio(xdma_channel_t *xchan, struct bio **bp, xdma_transfer_status_t *status); int xdma_enqueue_bio(xdma_channel_t *xchan, struct bio **bp, bus_addr_t addr, uint8_t, uint8_t, enum xdma_direction dir); int xdma_dequeue(xdma_channel_t *xchan, void **user, xdma_transfer_status_t *status); int xdma_enqueue(xdma_channel_t *xchan, uintptr_t src, uintptr_t dst, uint8_t, uint8_t, bus_size_t, enum xdma_direction dir, void *); int xdma_queue_submit(xdma_channel_t *xchan); /* Mbuf operations */ uint32_t xdma_mbuf_defrag(xdma_channel_t *xchan, struct xdma_request *xr); uint32_t xdma_mbuf_chain_count(struct mbuf *m0); /* Channel Control */ int xdma_control(xdma_channel_t *xchan, enum xdma_command cmd); /* Interrupt callback */ int xdma_setup_intr(xdma_channel_t *xchan, int (*cb)(void *, xdma_transfer_status_t *), void *arg, void **); int xdma_teardown_intr(xdma_channel_t *xchan, struct xdma_intr_handler *ih); int xdma_teardown_all_intr(xdma_channel_t *xchan); void xdma_callback(struct xdma_channel *xchan, xdma_transfer_status_t *status); /* Sglist */ int xchan_sglist_alloc(xdma_channel_t *xchan); void xchan_sglist_free(xdma_channel_t *xchan); int xdma_sglist_add(struct xdma_sglist *sg, struct bus_dma_segment *seg, uint32_t nsegs, struct xdma_request *xr); /* Requests bank */ void xchan_bank_init(xdma_channel_t *xchan); int xchan_bank_free(xdma_channel_t *xchan); struct xdma_request * xchan_bank_get(xdma_channel_t *xchan); int xchan_bank_put(xdma_channel_t *xchan, struct xdma_request *xr); #endif /* !_DEV_XDMA_XDMA_H_ */ Index: user/ngie/bug-237403/sys/dev/xdma/xdma_mbuf.c =================================================================== --- user/ngie/bug-237403/sys/dev/xdma/xdma_mbuf.c (revision 346925) +++ user/ngie/bug-237403/sys/dev/xdma/xdma_mbuf.c (revision 346926) @@ -1,154 +1,150 @@ /*- * Copyright (c) 2017-2018 Ruslan Bukin * All rights reserved. * * This software was developed by SRI International and the University of * Cambridge Computer Laboratory under DARPA/AFRL contract FA8750-10-C-0237 * ("CTSRD"), as part of the DARPA CRASH research programme. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. */ #include __FBSDID("$FreeBSD$"); #include "opt_platform.h" #include #include #include #include #include #include #include #ifdef FDT #include #include #include #endif #include int xdma_dequeue_mbuf(xdma_channel_t *xchan, struct mbuf **mp, xdma_transfer_status_t *status) { struct xdma_request *xr; struct xdma_request *xr_tmp; QUEUE_OUT_LOCK(xchan); TAILQ_FOREACH_SAFE(xr, &xchan->queue_out, xr_next, xr_tmp) { TAILQ_REMOVE(&xchan->queue_out, xr, xr_next); break; } QUEUE_OUT_UNLOCK(xchan); if (xr == NULL) return (-1); *mp = xr->m; status->error = xr->status.error; status->transferred = xr->status.transferred; xchan_bank_put(xchan, xr); return (0); } int xdma_enqueue_mbuf(xdma_channel_t *xchan, struct mbuf **mp, uintptr_t addr, uint8_t src_width, uint8_t dst_width, enum xdma_direction dir) { struct xdma_request *xr; xdma_controller_t *xdma; xdma = xchan->xdma; xr = xchan_bank_get(xchan); if (xr == NULL) return (-1); /* No space is available yet. */ xr->direction = dir; xr->m = *mp; xr->req_type = XR_TYPE_MBUF; if (dir == XDMA_MEM_TO_DEV) { xr->dst_addr = addr; xr->src_addr = 0; } else { xr->src_addr = addr; xr->dst_addr = 0; } xr->src_width = src_width; xr->dst_width = dst_width; QUEUE_IN_LOCK(xchan); TAILQ_INSERT_TAIL(&xchan->queue_in, xr, xr_next); QUEUE_IN_UNLOCK(xchan); return (0); } uint32_t xdma_mbuf_chain_count(struct mbuf *m0) { struct mbuf *m; uint32_t c; c = 0; for (m = m0; m != NULL; m = m->m_next) c++; return (c); } uint32_t xdma_mbuf_defrag(xdma_channel_t *xchan, struct xdma_request *xr) { xdma_controller_t *xdma; struct mbuf *m; uint32_t c; xdma = xchan->xdma; c = xdma_mbuf_chain_count(xr->m); if (c == 1) return (c); /* Nothing to do. */ - if (xchan->caps & XCHAN_CAP_BUSDMA) { - if ((xchan->caps & XCHAN_CAP_BUSDMA_NOSEG) || \ - (c > xchan->maxnsegs)) { - if ((m = m_defrag(xr->m, M_NOWAIT)) == NULL) { - device_printf(xdma->dma_dev, - "%s: Can't defrag mbuf\n", - __func__); - return (c); - } - xr->m = m; - c = 1; - } + if ((m = m_defrag(xr->m, M_NOWAIT)) == NULL) { + device_printf(xdma->dma_dev, + "%s: Can't defrag mbuf\n", + __func__); + return (c); } + + xr->m = m; + c = 1; return (c); } Index: user/ngie/bug-237403/sys/dev/xdma/xdma_sg.c =================================================================== --- user/ngie/bug-237403/sys/dev/xdma/xdma_sg.c (revision 346925) +++ user/ngie/bug-237403/sys/dev/xdma/xdma_sg.c (revision 346926) @@ -1,594 +1,586 @@ /*- * Copyright (c) 2018 Ruslan Bukin * All rights reserved. * * This software was developed by SRI International and the University of * Cambridge Computer Laboratory under DARPA/AFRL contract FA8750-10-C-0237 * ("CTSRD"), as part of the DARPA CRASH research programme. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. */ #include __FBSDID("$FreeBSD$"); #include "opt_platform.h" #include #include #include #include #include #include #include #include #ifdef FDT #include #include #include #endif #include #include struct seg_load_request { struct bus_dma_segment *seg; uint32_t nsegs; uint32_t error; }; static int _xchan_bufs_alloc(xdma_channel_t *xchan) { xdma_controller_t *xdma; struct xdma_request *xr; int i; xdma = xchan->xdma; for (i = 0; i < xchan->xr_num; i++) { xr = &xchan->xr_mem[i]; - xr->buf.cbuf = contigmalloc(xchan->maxsegsize, - M_XDMA, 0, 0, ~0, PAGE_SIZE, 0); - if (xr->buf.cbuf == NULL) { - device_printf(xdma->dev, - "%s: Can't allocate contiguous kernel" - " physical memory\n", __func__); - return (-1); - } + /* TODO: bounce buffer */ } return (0); } static int _xchan_bufs_alloc_busdma(xdma_channel_t *xchan) { xdma_controller_t *xdma; struct xdma_request *xr; int err; int i; xdma = xchan->xdma; /* Create bus_dma tag */ err = bus_dma_tag_create( bus_get_dma_tag(xdma->dev), /* Parent tag. */ xchan->alignment, /* alignment */ xchan->boundary, /* boundary */ xchan->lowaddr, /* lowaddr */ xchan->highaddr, /* highaddr */ NULL, NULL, /* filter, filterarg */ xchan->maxsegsize * xchan->maxnsegs, /* maxsize */ xchan->maxnsegs, /* nsegments */ xchan->maxsegsize, /* maxsegsize */ 0, /* flags */ NULL, NULL, /* lockfunc, lockarg */ &xchan->dma_tag_bufs); if (err != 0) { device_printf(xdma->dev, "%s: Can't create bus_dma tag.\n", __func__); return (-1); } for (i = 0; i < xchan->xr_num; i++) { xr = &xchan->xr_mem[i]; err = bus_dmamap_create(xchan->dma_tag_bufs, 0, &xr->buf.map); if (err != 0) { device_printf(xdma->dev, "%s: Can't create buf DMA map.\n", __func__); /* Cleanup. */ bus_dma_tag_destroy(xchan->dma_tag_bufs); return (-1); } } return (0); } static int xchan_bufs_alloc(xdma_channel_t *xchan) { xdma_controller_t *xdma; int ret; xdma = xchan->xdma; if (xdma == NULL) { device_printf(xdma->dev, "%s: Channel was not allocated properly.\n", __func__); return (-1); } if (xchan->caps & XCHAN_CAP_BUSDMA) ret = _xchan_bufs_alloc_busdma(xchan); else ret = _xchan_bufs_alloc(xchan); if (ret != 0) { device_printf(xdma->dev, "%s: Can't allocate bufs.\n", __func__); return (-1); } xchan->flags |= XCHAN_BUFS_ALLOCATED; return (0); } static int xchan_bufs_free(xdma_channel_t *xchan) { struct xdma_request *xr; struct xchan_buf *b; int i; if ((xchan->flags & XCHAN_BUFS_ALLOCATED) == 0) return (-1); if (xchan->caps & XCHAN_CAP_BUSDMA) { for (i = 0; i < xchan->xr_num; i++) { xr = &xchan->xr_mem[i]; b = &xr->buf; bus_dmamap_destroy(xchan->dma_tag_bufs, b->map); } bus_dma_tag_destroy(xchan->dma_tag_bufs); } else { for (i = 0; i < xchan->xr_num; i++) { xr = &xchan->xr_mem[i]; - contigfree(xr->buf.cbuf, xchan->maxsegsize, M_XDMA); + /* TODO: bounce buffer */ } } xchan->flags &= ~XCHAN_BUFS_ALLOCATED; return (0); } void xdma_channel_free_sg(xdma_channel_t *xchan) { xchan_bufs_free(xchan); xchan_sglist_free(xchan); xchan_bank_free(xchan); } /* * Prepare xchan for a scatter-gather transfer. * xr_num - xdma requests queue size, * maxsegsize - maximum allowed scatter-gather list element size in bytes */ int xdma_prep_sg(xdma_channel_t *xchan, uint32_t xr_num, bus_size_t maxsegsize, bus_size_t maxnsegs, bus_size_t alignment, bus_addr_t boundary, bus_addr_t lowaddr, bus_addr_t highaddr) { xdma_controller_t *xdma; int ret; xdma = xchan->xdma; KASSERT(xdma != NULL, ("xdma is NULL")); if (xchan->flags & XCHAN_CONFIGURED) { device_printf(xdma->dev, "%s: Channel is already configured.\n", __func__); return (-1); } xchan->xr_num = xr_num; xchan->maxsegsize = maxsegsize; xchan->maxnsegs = maxnsegs; xchan->alignment = alignment; xchan->boundary = boundary; xchan->lowaddr = lowaddr; xchan->highaddr = highaddr; if (xchan->maxnsegs > XDMA_MAX_SEG) { device_printf(xdma->dev, "%s: maxnsegs is too big\n", __func__); return (-1); } xchan_bank_init(xchan); /* Allocate sglist. */ ret = xchan_sglist_alloc(xchan); if (ret != 0) { device_printf(xdma->dev, "%s: Can't allocate sglist.\n", __func__); return (-1); } - /* Allocate bufs. */ - ret = xchan_bufs_alloc(xchan); - if (ret != 0) { - device_printf(xdma->dev, - "%s: Can't allocate bufs.\n", __func__); + /* Allocate buffers if required. */ + if ((xchan->caps & XCHAN_CAP_NOBUFS) == 0) { + ret = xchan_bufs_alloc(xchan); + if (ret != 0) { + device_printf(xdma->dev, + "%s: Can't allocate bufs.\n", __func__); - /* Cleanup */ - xchan_sglist_free(xchan); - xchan_bank_free(xchan); + /* Cleanup */ + xchan_sglist_free(xchan); + xchan_bank_free(xchan); - return (-1); + return (-1); + } } xchan->flags |= (XCHAN_CONFIGURED | XCHAN_TYPE_SG); XCHAN_LOCK(xchan); ret = XDMA_CHANNEL_PREP_SG(xdma->dma_dev, xchan); if (ret != 0) { device_printf(xdma->dev, "%s: Can't prepare SG transfer.\n", __func__); XCHAN_UNLOCK(xchan); return (-1); } XCHAN_UNLOCK(xchan); return (0); } void xchan_seg_done(xdma_channel_t *xchan, struct xdma_transfer_status *st) { struct xdma_request *xr; xdma_controller_t *xdma; struct xchan_buf *b; xdma = xchan->xdma; xr = TAILQ_FIRST(&xchan->processing); if (xr == NULL) panic("request not found\n"); b = &xr->buf; atomic_subtract_int(&b->nsegs_left, 1); if (b->nsegs_left == 0) { if (xchan->caps & XCHAN_CAP_BUSDMA) { if (xr->direction == XDMA_MEM_TO_DEV) bus_dmamap_sync(xchan->dma_tag_bufs, b->map, BUS_DMASYNC_POSTWRITE); else bus_dmamap_sync(xchan->dma_tag_bufs, b->map, BUS_DMASYNC_POSTREAD); bus_dmamap_unload(xchan->dma_tag_bufs, b->map); } xr->status.error = st->error; xr->status.transferred = st->transferred; QUEUE_PROC_LOCK(xchan); TAILQ_REMOVE(&xchan->processing, xr, xr_next); QUEUE_PROC_UNLOCK(xchan); QUEUE_OUT_LOCK(xchan); TAILQ_INSERT_TAIL(&xchan->queue_out, xr, xr_next); QUEUE_OUT_UNLOCK(xchan); } } static void xdma_dmamap_cb(void *arg, bus_dma_segment_t *segs, int nsegs, int error) { struct seg_load_request *slr; struct bus_dma_segment *seg; int i; slr = arg; seg = slr->seg; if (error != 0) { slr->error = error; return; } slr->nsegs = nsegs; for (i = 0; i < nsegs; i++) { seg[i].ds_addr = segs[i].ds_addr; seg[i].ds_len = segs[i].ds_len; } } static int _xdma_load_data_busdma(xdma_channel_t *xchan, struct xdma_request *xr, struct bus_dma_segment *seg) { xdma_controller_t *xdma; struct seg_load_request slr; uint32_t nsegs; void *addr; int error; xdma = xchan->xdma; error = 0; nsegs = 0; switch (xr->req_type) { case XR_TYPE_MBUF: error = bus_dmamap_load_mbuf_sg(xchan->dma_tag_bufs, xr->buf.map, xr->m, seg, &nsegs, BUS_DMA_NOWAIT); break; case XR_TYPE_BIO: slr.nsegs = 0; slr.error = 0; slr.seg = seg; error = bus_dmamap_load_bio(xchan->dma_tag_bufs, xr->buf.map, xr->bp, xdma_dmamap_cb, &slr, BUS_DMA_NOWAIT); if (slr.error != 0) { device_printf(xdma->dma_dev, "%s: bus_dmamap_load failed, err %d\n", __func__, slr.error); return (0); } nsegs = slr.nsegs; break; case XR_TYPE_VIRT: switch (xr->direction) { case XDMA_MEM_TO_DEV: addr = (void *)xr->src_addr; break; case XDMA_DEV_TO_MEM: addr = (void *)xr->dst_addr; break; default: device_printf(xdma->dma_dev, "%s: Direction is not supported\n", __func__); return (0); } slr.nsegs = 0; slr.error = 0; slr.seg = seg; error = bus_dmamap_load(xchan->dma_tag_bufs, xr->buf.map, addr, (xr->block_len * xr->block_num), xdma_dmamap_cb, &slr, BUS_DMA_NOWAIT); if (slr.error != 0) { device_printf(xdma->dma_dev, "%s: bus_dmamap_load failed, err %d\n", __func__, slr.error); return (0); } nsegs = slr.nsegs; break; default: break; } if (error != 0) { if (error == ENOMEM) { /* * Out of memory. Try again later. * TODO: count errors. */ } else device_printf(xdma->dma_dev, "%s: bus_dmamap_load failed with err %d\n", __func__, error); return (0); } if (xr->direction == XDMA_MEM_TO_DEV) bus_dmamap_sync(xchan->dma_tag_bufs, xr->buf.map, BUS_DMASYNC_PREWRITE); else bus_dmamap_sync(xchan->dma_tag_bufs, xr->buf.map, BUS_DMASYNC_PREREAD); return (nsegs); } static int _xdma_load_data(xdma_channel_t *xchan, struct xdma_request *xr, struct bus_dma_segment *seg) { xdma_controller_t *xdma; struct mbuf *m; uint32_t nsegs; xdma = xchan->xdma; m = xr->m; nsegs = 1; switch (xr->req_type) { case XR_TYPE_MBUF: - if (xr->direction == XDMA_MEM_TO_DEV) { - m_copydata(m, 0, m->m_pkthdr.len, xr->buf.cbuf); - seg[0].ds_addr = (bus_addr_t)xr->buf.cbuf; - seg[0].ds_len = m->m_pkthdr.len; - } else { - seg[0].ds_addr = mtod(m, bus_addr_t); - seg[0].ds_len = m->m_pkthdr.len; - } + seg[0].ds_addr = mtod(m, bus_addr_t); + seg[0].ds_len = m->m_pkthdr.len; break; case XR_TYPE_BIO: case XR_TYPE_VIRT: default: panic("implement me\n"); } return (nsegs); } static int xdma_load_data(xdma_channel_t *xchan, struct xdma_request *xr, struct bus_dma_segment *seg) { xdma_controller_t *xdma; int error; int nsegs; xdma = xchan->xdma; error = 0; nsegs = 0; if (xchan->caps & XCHAN_CAP_BUSDMA) nsegs = _xdma_load_data_busdma(xchan, xr, seg); else nsegs = _xdma_load_data(xchan, xr, seg); if (nsegs == 0) return (0); /* Try again later. */ xr->buf.nsegs = nsegs; xr->buf.nsegs_left = nsegs; return (nsegs); } static int xdma_process(xdma_channel_t *xchan, struct xdma_sglist *sg) { struct bus_dma_segment seg[XDMA_MAX_SEG]; struct xdma_request *xr; struct xdma_request *xr_tmp; xdma_controller_t *xdma; uint32_t capacity; uint32_t n; uint32_t c; int nsegs; int ret; XCHAN_ASSERT_LOCKED(xchan); xdma = xchan->xdma; n = 0; ret = XDMA_CHANNEL_CAPACITY(xdma->dma_dev, xchan, &capacity); if (ret != 0) { device_printf(xdma->dev, "%s: Can't get DMA controller capacity.\n", __func__); return (-1); } TAILQ_FOREACH_SAFE(xr, &xchan->queue_in, xr_next, xr_tmp) { switch (xr->req_type) { case XR_TYPE_MBUF: - c = xdma_mbuf_defrag(xchan, xr); + if ((xchan->caps & XCHAN_CAP_NOSEG) || + (c > xchan->maxnsegs)) + c = xdma_mbuf_defrag(xchan, xr); break; case XR_TYPE_BIO: case XR_TYPE_VIRT: default: c = 1; } if (capacity <= (c + n)) { /* * No space yet available for the entire * request in the DMA engine. */ break; } if ((c + n + xchan->maxnsegs) >= XDMA_SGLIST_MAXLEN) { /* Sglist is full. */ break; } nsegs = xdma_load_data(xchan, xr, seg); if (nsegs == 0) break; xdma_sglist_add(&sg[n], seg, nsegs, xr); n += nsegs; QUEUE_IN_LOCK(xchan); TAILQ_REMOVE(&xchan->queue_in, xr, xr_next); QUEUE_IN_UNLOCK(xchan); QUEUE_PROC_LOCK(xchan); TAILQ_INSERT_TAIL(&xchan->processing, xr, xr_next); QUEUE_PROC_UNLOCK(xchan); } return (n); } int xdma_queue_submit_sg(xdma_channel_t *xchan) { struct xdma_sglist *sg; xdma_controller_t *xdma; uint32_t sg_n; int ret; xdma = xchan->xdma; KASSERT(xdma != NULL, ("xdma is NULL")); XCHAN_ASSERT_LOCKED(xchan); sg = xchan->sg; - if ((xchan->flags & XCHAN_BUFS_ALLOCATED) == 0) { + if ((xchan->caps & XCHAN_CAP_NOBUFS) == 0 && + (xchan->flags & XCHAN_BUFS_ALLOCATED) == 0) { device_printf(xdma->dev, "%s: Can't submit a transfer: no bufs\n", __func__); return (-1); } sg_n = xdma_process(xchan, sg); if (sg_n == 0) return (0); /* Nothing to submit */ /* Now submit sglist to DMA engine driver. */ ret = XDMA_CHANNEL_SUBMIT_SG(xdma->dma_dev, xchan, sg, sg_n); if (ret != 0) { device_printf(xdma->dev, "%s: Can't submit an sglist.\n", __func__); return (-1); } return (0); } Index: user/ngie/bug-237403/sys/geom/geom_dev.c =================================================================== --- user/ngie/bug-237403/sys/geom/geom_dev.c (revision 346925) +++ user/ngie/bug-237403/sys/geom/geom_dev.c (revision 346926) @@ -1,869 +1,870 @@ /*- * SPDX-License-Identifier: BSD-3-Clause * * Copyright (c) 2002 Poul-Henning Kamp * Copyright (c) 2002 Networks Associates Technology, Inc. * All rights reserved. * * This software was developed for the FreeBSD Project by Poul-Henning Kamp * and NAI Labs, the Security Research Division of Network Associates, Inc. * under DARPA/SPAWAR contract N66001-01-C-8035 ("CBOSS"), as part of the * DARPA CHATS research program. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * 3. The names of the authors may not be used to endorse or promote * products derived from this software without specific prior written * permission. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. */ #include __FBSDID("$FreeBSD$"); #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include struct g_dev_softc { struct mtx sc_mtx; struct cdev *sc_dev; struct cdev *sc_alias; int sc_open; u_int sc_active; #define SC_A_DESTROY (1 << 31) #define SC_A_OPEN (1 << 30) #define SC_A_ACTIVE (SC_A_OPEN - 1) }; static d_open_t g_dev_open; static d_close_t g_dev_close; static d_strategy_t g_dev_strategy; static d_ioctl_t g_dev_ioctl; static struct cdevsw g_dev_cdevsw = { .d_version = D_VERSION, .d_open = g_dev_open, .d_close = g_dev_close, .d_read = physread, .d_write = physwrite, .d_ioctl = g_dev_ioctl, .d_strategy = g_dev_strategy, .d_name = "g_dev", .d_flags = D_DISK | D_TRACKCLOSE, }; static g_init_t g_dev_init; static g_fini_t g_dev_fini; static g_taste_t g_dev_taste; static g_orphan_t g_dev_orphan; static g_attrchanged_t g_dev_attrchanged; static g_resize_t g_dev_resize; static struct g_class g_dev_class = { .name = "DEV", .version = G_VERSION, .init = g_dev_init, .fini = g_dev_fini, .taste = g_dev_taste, .orphan = g_dev_orphan, .attrchanged = g_dev_attrchanged, .resize = g_dev_resize }; /* * We target 262144 (8 x 32768) sectors by default as this significantly * increases the throughput on commonly used SSD's with a marginal * increase in non-interruptible request latency. */ static uint64_t g_dev_del_max_sectors = 262144; SYSCTL_DECL(_kern_geom); SYSCTL_NODE(_kern_geom, OID_AUTO, dev, CTLFLAG_RW, 0, "GEOM_DEV stuff"); SYSCTL_QUAD(_kern_geom_dev, OID_AUTO, delete_max_sectors, CTLFLAG_RW, &g_dev_del_max_sectors, 0, "Maximum number of sectors in a single " "delete request sent to the provider. Larger requests are chunked " "so they can be interrupted. (0 = disable chunking)"); static char *dumpdev = NULL; static void g_dev_init(struct g_class *mp) { dumpdev = kern_getenv("dumpdev"); } static void g_dev_fini(struct g_class *mp) { freeenv(dumpdev); dumpdev = NULL; } static int g_dev_setdumpdev(struct cdev *dev, struct diocskerneldump_arg *kda, struct thread *td) { struct g_kerneldump kd; struct g_consumer *cp; int error, len; if (dev == NULL || kda == NULL) return (clear_dumper(td)); cp = dev->si_drv2; len = sizeof(kd); memset(&kd, 0, len); kd.offset = 0; kd.length = OFF_MAX; error = g_io_getattr("GEOM::kerneldump", cp, &len, &kd); if (error != 0) return (error); error = set_dumper(&kd.di, devtoname(dev), td, kda->kda_compression, kda->kda_encryption, kda->kda_key, kda->kda_encryptedkeysize, kda->kda_encryptedkey); if (error == 0) dev->si_flags |= SI_DUMPDEV; return (error); } static int init_dumpdev(struct cdev *dev) { struct diocskerneldump_arg kda; struct g_consumer *cp; const char *devprefix = "/dev/", *devname; int error; size_t len; bzero(&kda, sizeof(kda)); kda.kda_enable = 1; if (dumpdev == NULL) return (0); len = strlen(devprefix); devname = devtoname(dev); if (strcmp(devname, dumpdev) != 0 && (strncmp(dumpdev, devprefix, len) != 0 || strcmp(devname, dumpdev + len) != 0)) return (0); cp = (struct g_consumer *)dev->si_drv2; error = g_access(cp, 1, 0, 0); if (error != 0) return (error); error = g_dev_setdumpdev(dev, &kda, curthread); if (error == 0) { freeenv(dumpdev); dumpdev = NULL; } (void)g_access(cp, -1, 0, 0); return (error); } static void g_dev_destroy(void *arg, int flags __unused) { struct g_consumer *cp; struct g_geom *gp; struct g_dev_softc *sc; char buf[SPECNAMELEN + 6]; g_topology_assert(); cp = arg; gp = cp->geom; sc = cp->private; g_trace(G_T_TOPOLOGY, "g_dev_destroy(%p(%s))", cp, gp->name); snprintf(buf, sizeof(buf), "cdev=%s", gp->name); devctl_notify_f("GEOM", "DEV", "DESTROY", buf, M_WAITOK); if (cp->acr > 0 || cp->acw > 0 || cp->ace > 0) g_access(cp, -cp->acr, -cp->acw, -cp->ace); g_detach(cp); g_destroy_consumer(cp); g_destroy_geom(gp); mtx_destroy(&sc->sc_mtx); g_free(sc); } void g_dev_print(void) { struct g_geom *gp; char const *p = ""; LIST_FOREACH(gp, &g_dev_class.geom, geom) { printf("%s%s", p, gp->name); p = " "; } printf("\n"); } static void g_dev_set_physpath(struct g_consumer *cp) { struct g_dev_softc *sc; char *physpath; int error, physpath_len; if (g_access(cp, 1, 0, 0) != 0) return; sc = cp->private; physpath_len = MAXPATHLEN; physpath = g_malloc(physpath_len, M_WAITOK|M_ZERO); error = g_io_getattr("GEOM::physpath", cp, &physpath_len, physpath); g_access(cp, -1, 0, 0); if (error == 0 && strlen(physpath) != 0) { struct cdev *dev, *old_alias_dev; struct cdev **alias_devp; dev = sc->sc_dev; old_alias_dev = sc->sc_alias; alias_devp = (struct cdev **)&sc->sc_alias; make_dev_physpath_alias(MAKEDEV_WAITOK, alias_devp, dev, old_alias_dev, physpath); } else if (sc->sc_alias) { destroy_dev((struct cdev *)sc->sc_alias); sc->sc_alias = NULL; } g_free(physpath); } static void g_dev_set_media(struct g_consumer *cp) { struct g_dev_softc *sc; struct cdev *dev; char buf[SPECNAMELEN + 6]; sc = cp->private; dev = sc->sc_dev; snprintf(buf, sizeof(buf), "cdev=%s", dev->si_name); devctl_notify_f("DEVFS", "CDEV", "MEDIACHANGE", buf, M_WAITOK); devctl_notify_f("GEOM", "DEV", "MEDIACHANGE", buf, M_WAITOK); dev = sc->sc_alias; if (dev != NULL) { snprintf(buf, sizeof(buf), "cdev=%s", dev->si_name); devctl_notify_f("DEVFS", "CDEV", "MEDIACHANGE", buf, M_WAITOK); devctl_notify_f("GEOM", "DEV", "MEDIACHANGE", buf, M_WAITOK); } } static void g_dev_attrchanged(struct g_consumer *cp, const char *attr) { if (strcmp(attr, "GEOM::media") == 0) { g_dev_set_media(cp); return; } if (strcmp(attr, "GEOM::physpath") == 0) { g_dev_set_physpath(cp); return; } } static void g_dev_resize(struct g_consumer *cp) { char buf[SPECNAMELEN + 6]; snprintf(buf, sizeof(buf), "cdev=%s", cp->provider->name); devctl_notify_f("GEOM", "DEV", "SIZECHANGE", buf, M_WAITOK); } struct g_provider * g_dev_getprovider(struct cdev *dev) { struct g_consumer *cp; g_topology_assert(); if (dev == NULL) return (NULL); if (dev->si_devsw != &g_dev_cdevsw) return (NULL); cp = dev->si_drv2; return (cp->provider); } static struct g_geom * g_dev_taste(struct g_class *mp, struct g_provider *pp, int insist __unused) { struct g_geom *gp; struct g_geom_alias *gap; struct g_consumer *cp; struct g_dev_softc *sc; int error; struct cdev *dev, *adev; char buf[SPECNAMELEN + 6]; g_trace(G_T_TOPOLOGY, "dev_taste(%s,%s)", mp->name, pp->name); g_topology_assert(); gp = g_new_geomf(mp, "%s", pp->name); sc = g_malloc(sizeof(*sc), M_WAITOK | M_ZERO); mtx_init(&sc->sc_mtx, "g_dev", NULL, MTX_DEF); cp = g_new_consumer(gp); cp->private = sc; cp->flags |= G_CF_DIRECT_SEND | G_CF_DIRECT_RECEIVE; error = g_attach(cp, pp); KASSERT(error == 0, ("g_dev_taste(%s) failed to g_attach, err=%d", pp->name, error)); error = make_dev_p(MAKEDEV_CHECKNAME | MAKEDEV_WAITOK, &dev, &g_dev_cdevsw, NULL, UID_ROOT, GID_OPERATOR, 0640, "%s", gp->name); if (error != 0) { printf("%s: make_dev_p() failed (gp->name=%s, error=%d)\n", __func__, gp->name, error); g_detach(cp); g_destroy_consumer(cp); g_destroy_geom(gp); mtx_destroy(&sc->sc_mtx); g_free(sc); return (NULL); } dev->si_flags |= SI_UNMAPPED; sc->sc_dev = dev; dev->si_iosize_max = MAXPHYS; dev->si_drv2 = cp; error = init_dumpdev(dev); if (error != 0) printf("%s: init_dumpdev() failed (gp->name=%s, error=%d)\n", __func__, gp->name, error); g_dev_attrchanged(cp, "GEOM::physpath"); snprintf(buf, sizeof(buf), "cdev=%s", gp->name); devctl_notify_f("GEOM", "DEV", "CREATE", buf, M_WAITOK); /* * Now add all the aliases for this drive */ LIST_FOREACH(gap, &pp->geom->aliases, ga_next) { error = make_dev_alias_p(MAKEDEV_CHECKNAME | MAKEDEV_WAITOK, &adev, dev, "%s", gap->ga_alias); if (error) { printf("%s: make_dev_alias_p() failed (name=%s, error=%d)\n", __func__, gap->ga_alias, error); continue; } snprintf(buf, sizeof(buf), "cdev=%s", gap->ga_alias); devctl_notify_f("GEOM", "DEV", "CREATE", buf, M_WAITOK); } return (gp); } static int g_dev_open(struct cdev *dev, int flags, int fmt, struct thread *td) { struct g_consumer *cp; struct g_dev_softc *sc; int error, r, w, e; cp = dev->si_drv2; if (cp == NULL) return (ENXIO); /* g_dev_taste() not done yet */ g_trace(G_T_ACCESS, "g_dev_open(%s, %d, %d, %p)", cp->geom->name, flags, fmt, td); r = flags & FREAD ? 1 : 0; w = flags & FWRITE ? 1 : 0; #ifdef notyet e = flags & O_EXCL ? 1 : 0; #else e = 0; #endif /* * This happens on attempt to open a device node with O_EXEC. */ if (r + w + e == 0) return (EINVAL); if (w) { /* * When running in very secure mode, do not allow * opens for writing of any disks. */ error = securelevel_ge(td->td_ucred, 2); if (error) return (error); } g_topology_lock(); error = g_access(cp, r, w, e); g_topology_unlock(); if (error == 0) { sc = cp->private; mtx_lock(&sc->sc_mtx); if (sc->sc_open == 0 && (sc->sc_active & SC_A_ACTIVE) != 0) wakeup(&sc->sc_active); sc->sc_open += r + w + e; if (sc->sc_open == 0) atomic_clear_int(&sc->sc_active, SC_A_OPEN); else atomic_set_int(&sc->sc_active, SC_A_OPEN); mtx_unlock(&sc->sc_mtx); } return (error); } static int g_dev_close(struct cdev *dev, int flags, int fmt, struct thread *td) { struct g_consumer *cp; struct g_dev_softc *sc; int error, r, w, e; cp = dev->si_drv2; if (cp == NULL) return (ENXIO); g_trace(G_T_ACCESS, "g_dev_close(%s, %d, %d, %p)", cp->geom->name, flags, fmt, td); r = flags & FREAD ? -1 : 0; w = flags & FWRITE ? -1 : 0; #ifdef notyet e = flags & O_EXCL ? -1 : 0; #else e = 0; #endif /* * The vgonel(9) - caused by eg. forced unmount of devfs - calls * VOP_CLOSE(9) on devfs vnode without any FREAD or FWRITE flags, * which would result in zero deltas, which in turn would cause * panic in g_access(9). * * Note that we cannot zero the counters (ie. do "r = cp->acr" * etc) instead, because the consumer might be opened in another * devfs instance. */ if (r + w + e == 0) return (EINVAL); sc = cp->private; mtx_lock(&sc->sc_mtx); sc->sc_open += r + w + e; if (sc->sc_open == 0) atomic_clear_int(&sc->sc_active, SC_A_OPEN); else atomic_set_int(&sc->sc_active, SC_A_OPEN); while (sc->sc_open == 0 && (sc->sc_active & SC_A_ACTIVE) != 0) msleep(&sc->sc_active, &sc->sc_mtx, 0, "g_dev_close", hz / 10); mtx_unlock(&sc->sc_mtx); g_topology_lock(); error = g_access(cp, r, w, e); g_topology_unlock(); return (error); } /* * XXX: Until we have unmessed the ioctl situation, there is a race against * XXX: a concurrent orphanization. We cannot close it by holding topology * XXX: since that would prevent us from doing our job, and stalling events * XXX: will break (actually: stall) the BSD disklabel hacks. */ static int g_dev_ioctl(struct cdev *dev, u_long cmd, caddr_t data, int fflag, struct thread *td) { struct g_consumer *cp; struct g_provider *pp; off_t offset, length, chunk, odd; int i, error; cp = dev->si_drv2; pp = cp->provider; error = 0; KASSERT(cp->acr || cp->acw, ("Consumer with zero access count in g_dev_ioctl")); i = IOCPARM_LEN(cmd); switch (cmd) { case DIOCGSECTORSIZE: *(u_int *)data = cp->provider->sectorsize; if (*(u_int *)data == 0) error = ENOENT; break; case DIOCGMEDIASIZE: *(off_t *)data = cp->provider->mediasize; if (*(off_t *)data == 0) error = ENOENT; break; case DIOCGFWSECTORS: error = g_io_getattr("GEOM::fwsectors", cp, &i, data); if (error == 0 && *(u_int *)data == 0) error = ENOENT; break; case DIOCGFWHEADS: error = g_io_getattr("GEOM::fwheads", cp, &i, data); if (error == 0 && *(u_int *)data == 0) error = ENOENT; break; case DIOCGFRONTSTUFF: error = g_io_getattr("GEOM::frontstuff", cp, &i, data); break; #ifdef COMPAT_FREEBSD11 case DIOCSKERNELDUMP_FREEBSD11: { struct diocskerneldump_arg kda; bzero(&kda, sizeof(kda)); kda.kda_encryption = KERNELDUMP_ENC_NONE; kda.kda_enable = (uint8_t)*(u_int *)data; if (kda.kda_enable == 0) error = g_dev_setdumpdev(NULL, NULL, td); else error = g_dev_setdumpdev(dev, &kda, td); break; } #endif case DIOCSKERNELDUMP: { struct diocskerneldump_arg *kda; uint8_t *encryptedkey; kda = (struct diocskerneldump_arg *)data; if (kda->kda_enable == 0) { error = g_dev_setdumpdev(NULL, NULL, td); break; } if (kda->kda_encryption != KERNELDUMP_ENC_NONE) { if (kda->kda_encryptedkeysize <= 0 || kda->kda_encryptedkeysize > KERNELDUMP_ENCKEY_MAX_SIZE) { return (EINVAL); } encryptedkey = malloc(kda->kda_encryptedkeysize, M_TEMP, M_WAITOK); error = copyin(kda->kda_encryptedkey, encryptedkey, kda->kda_encryptedkeysize); } else { encryptedkey = NULL; } if (error == 0) { kda->kda_encryptedkey = encryptedkey; error = g_dev_setdumpdev(dev, kda, td); } if (encryptedkey != NULL) { explicit_bzero(encryptedkey, kda->kda_encryptedkeysize); free(encryptedkey, M_TEMP); } explicit_bzero(kda, sizeof(*kda)); break; } case DIOCGFLUSH: error = g_io_flush(cp); break; case DIOCGDELETE: offset = ((off_t *)data)[0]; length = ((off_t *)data)[1]; if ((offset % cp->provider->sectorsize) != 0 || (length % cp->provider->sectorsize) != 0 || length <= 0) { printf("%s: offset=%jd length=%jd\n", __func__, offset, length); error = EINVAL; break; } if ((cp->provider->mediasize > 0) && (offset >= cp->provider->mediasize)) { /* * Catch out-of-bounds requests here. The problem is * that due to historical GEOM I/O implementation * peculatities, g_delete_data() would always return * success for requests starting just the next byte * after providers media boundary. Condition check on * non-zero media size, since that condition would * (most likely) cause ENXIO instead. */ error = EIO; break; } while (length > 0) { chunk = length; if (g_dev_del_max_sectors != 0 && chunk > g_dev_del_max_sectors * cp->provider->sectorsize) { chunk = g_dev_del_max_sectors * cp->provider->sectorsize; if (cp->provider->stripesize > 0) { odd = (offset + chunk + cp->provider->stripeoffset) % cp->provider->stripesize; if (chunk > odd) chunk -= odd; } } error = g_delete_data(cp, offset, chunk); length -= chunk; offset += chunk; if (error) break; /* * Since the request size can be large, the service * time can be is likewise. We make this ioctl * interruptible by checking for signals for each bio. */ if (SIGPENDING(td)) break; } break; case DIOCGIDENT: error = g_io_getattr("GEOM::ident", cp, &i, data); break; case DIOCGPROVIDERNAME: if (pp == NULL) return (ENOENT); strlcpy(data, pp->name, i); break; case DIOCGSTRIPESIZE: *(off_t *)data = cp->provider->stripesize; break; case DIOCGSTRIPEOFFSET: *(off_t *)data = cp->provider->stripeoffset; break; case DIOCGPHYSPATH: error = g_io_getattr("GEOM::physpath", cp, &i, data); if (error == 0 && *(char *)data == '\0') error = ENOENT; break; case DIOCGATTR: { struct diocgattr_arg *arg = (struct diocgattr_arg *)data; if (arg->len > sizeof(arg->value)) { error = EINVAL; break; } error = g_io_getattr(arg->name, cp, &arg->len, &arg->value); break; } case DIOCZONECMD: { struct disk_zone_args *zone_args =(struct disk_zone_args *)data; struct disk_zone_rep_entry *new_entries, *old_entries; struct disk_zone_report *rep; size_t alloc_size; old_entries = NULL; new_entries = NULL; rep = NULL; alloc_size = 0; if (zone_args->zone_cmd == DISK_ZONE_REPORT_ZONES) { rep = &zone_args->zone_params.report; #define MAXENTRIES (MAXPHYS / sizeof(struct disk_zone_rep_entry)) if (rep->entries_allocated > MAXENTRIES) rep->entries_allocated = MAXENTRIES; alloc_size = rep->entries_allocated * sizeof(struct disk_zone_rep_entry); if (alloc_size != 0) new_entries = g_malloc(alloc_size, M_WAITOK| M_ZERO); old_entries = rep->entries; rep->entries = new_entries; } error = g_io_zonecmd(zone_args, cp); if (zone_args->zone_cmd == DISK_ZONE_REPORT_ZONES && alloc_size != 0 && error == 0) error = copyout(new_entries, old_entries, alloc_size); if (old_entries != NULL && rep != NULL) rep->entries = old_entries; if (new_entries != NULL) g_free(new_entries); break; } default: if (cp->provider->geom->ioctl != NULL) { error = cp->provider->geom->ioctl(cp->provider, cmd, data, fflag, td); } else { error = ENOIOCTL; } } return (error); } static void g_dev_done(struct bio *bp2) { struct g_consumer *cp; struct g_dev_softc *sc; struct bio *bp; int active; cp = bp2->bio_from; sc = cp->private; bp = bp2->bio_parent; bp->bio_error = bp2->bio_error; bp->bio_completed = bp2->bio_completed; bp->bio_resid = bp->bio_length - bp2->bio_completed; if (bp2->bio_cmd == BIO_ZONE) bcopy(&bp2->bio_zone, &bp->bio_zone, sizeof(bp->bio_zone)); if (bp2->bio_error != 0) { g_trace(G_T_BIO, "g_dev_done(%p) had error %d", bp2, bp2->bio_error); bp->bio_flags |= BIO_ERROR; } else { g_trace(G_T_BIO, "g_dev_done(%p/%p) resid %ld completed %jd", bp2, bp, bp2->bio_resid, (intmax_t)bp2->bio_completed); } g_destroy_bio(bp2); active = atomic_fetchadd_int(&sc->sc_active, -1) - 1; if ((active & SC_A_ACTIVE) == 0) { if ((active & SC_A_OPEN) == 0) wakeup(&sc->sc_active); if (active & SC_A_DESTROY) g_post_event(g_dev_destroy, cp, M_NOWAIT, NULL); } biodone(bp); } static void g_dev_strategy(struct bio *bp) { struct g_consumer *cp; struct bio *bp2; struct cdev *dev; struct g_dev_softc *sc; KASSERT(bp->bio_cmd == BIO_READ || bp->bio_cmd == BIO_WRITE || bp->bio_cmd == BIO_DELETE || bp->bio_cmd == BIO_FLUSH || bp->bio_cmd == BIO_ZONE, ("Wrong bio_cmd bio=%p cmd=%d", bp, bp->bio_cmd)); dev = bp->bio_dev; cp = dev->si_drv2; sc = cp->private; KASSERT(cp->acr || cp->acw, ("Consumer with zero access count in g_dev_strategy")); biotrack(bp, __func__); #ifdef INVARIANTS if ((bp->bio_offset % cp->provider->sectorsize) != 0 || (bp->bio_bcount % cp->provider->sectorsize) != 0) { bp->bio_resid = bp->bio_bcount; biofinish(bp, NULL, EINVAL); return; } #endif KASSERT(sc->sc_open > 0, ("Closed device in g_dev_strategy")); atomic_add_int(&sc->sc_active, 1); for (;;) { /* * XXX: This is not an ideal solution, but I believe it to * XXX: deadlock safely, all things considered. */ bp2 = g_clone_bio(bp); if (bp2 != NULL) break; pause("gdstrat", hz / 10); } KASSERT(bp2 != NULL, ("XXX: ENOMEM in a bad place")); bp2->bio_done = g_dev_done; g_trace(G_T_BIO, "g_dev_strategy(%p/%p) offset %jd length %jd data %p cmd %d", bp, bp2, (intmax_t)bp->bio_offset, (intmax_t)bp2->bio_length, bp2->bio_data, bp2->bio_cmd); g_io_request(bp2, cp); KASSERT(cp->acr || cp->acw, ("g_dev_strategy raced with g_dev_close and lost")); } /* * g_dev_callback() * * Called by devfs when asynchronous device destruction is completed. * - Mark that we have no attached device any more. * - If there are no outstanding requests, schedule geom destruction. * Otherwise destruction will be scheduled later by g_dev_done(). */ static void g_dev_callback(void *arg) { struct g_consumer *cp; struct g_dev_softc *sc; int active; cp = arg; sc = cp->private; g_trace(G_T_TOPOLOGY, "g_dev_callback(%p(%s))", cp, cp->geom->name); sc->sc_dev = NULL; sc->sc_alias = NULL; active = atomic_fetchadd_int(&sc->sc_active, SC_A_DESTROY); if ((active & SC_A_ACTIVE) == 0) g_post_event(g_dev_destroy, cp, M_WAITOK, NULL); } /* * g_dev_orphan() * * Called from below when the provider orphaned us. * - Clear any dump settings. * - Request asynchronous device destruction to prevent any more requests * from coming in. The provider is already marked with an error, so * anything which comes in the interim will be returned immediately. */ static void g_dev_orphan(struct g_consumer *cp) { struct cdev *dev; struct g_dev_softc *sc; g_topology_assert(); sc = cp->private; dev = sc->sc_dev; g_trace(G_T_TOPOLOGY, "g_dev_orphan(%p(%s))", cp, cp->geom->name); /* Reset any dump-area set on this device */ if (dev->si_flags & SI_DUMPDEV) (void)clear_dumper(curthread); /* Destroy the struct cdev *so we get no more requests */ + delist_dev(dev); destroy_dev_sched_cb(dev, g_dev_callback, cp); } DECLARE_GEOM_CLASS(g_dev_class, g_dev); Index: user/ngie/bug-237403/sys/kern/kern_sig.c =================================================================== --- user/ngie/bug-237403/sys/kern/kern_sig.c (revision 346925) +++ user/ngie/bug-237403/sys/kern/kern_sig.c (revision 346926) @@ -1,3840 +1,3838 @@ /*- * SPDX-License-Identifier: BSD-3-Clause * * Copyright (c) 1982, 1986, 1989, 1991, 1993 * The Regents of the University of California. All rights reserved. * (c) UNIX System Laboratories, Inc. * All or some portions of this file are derived from material licensed * to the University of California by American Telephone and Telegraph * Co. or Unix System Laboratories, Inc. and are reproduced herein with * the permission of UNIX System Laboratories, Inc. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * 3. Neither the name of the University nor the names of its contributors * may be used to endorse or promote products derived from this software * without specific prior written permission. * * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * @(#)kern_sig.c 8.7 (Berkeley) 4/18/94 */ #include __FBSDID("$FreeBSD$"); #include "opt_ktrace.h" #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #define ONSIG 32 /* NSIG for osig* syscalls. XXX. */ SDT_PROVIDER_DECLARE(proc); SDT_PROBE_DEFINE3(proc, , , signal__send, "struct thread *", "struct proc *", "int"); SDT_PROBE_DEFINE2(proc, , , signal__clear, "int", "ksiginfo_t *"); SDT_PROBE_DEFINE3(proc, , , signal__discard, "struct thread *", "struct proc *", "int"); static int coredump(struct thread *); static int killpg1(struct thread *td, int sig, int pgid, int all, ksiginfo_t *ksi); static int issignal(struct thread *td); static int sigprop(int sig); static void tdsigwakeup(struct thread *, int, sig_t, int); static int sig_suspend_threads(struct thread *, struct proc *, int); static int filt_sigattach(struct knote *kn); static void filt_sigdetach(struct knote *kn); static int filt_signal(struct knote *kn, long hint); static struct thread *sigtd(struct proc *p, int sig, int prop); static void sigqueue_start(void); static uma_zone_t ksiginfo_zone = NULL; struct filterops sig_filtops = { .f_isfd = 0, .f_attach = filt_sigattach, .f_detach = filt_sigdetach, .f_event = filt_signal, }; static int kern_logsigexit = 1; SYSCTL_INT(_kern, KERN_LOGSIGEXIT, logsigexit, CTLFLAG_RW, &kern_logsigexit, 0, "Log processes quitting on abnormal signals to syslog(3)"); static int kern_forcesigexit = 1; SYSCTL_INT(_kern, OID_AUTO, forcesigexit, CTLFLAG_RW, &kern_forcesigexit, 0, "Force trap signal to be handled"); static SYSCTL_NODE(_kern, OID_AUTO, sigqueue, CTLFLAG_RW, 0, "POSIX real time signal"); static int max_pending_per_proc = 128; SYSCTL_INT(_kern_sigqueue, OID_AUTO, max_pending_per_proc, CTLFLAG_RW, &max_pending_per_proc, 0, "Max pending signals per proc"); static int preallocate_siginfo = 1024; SYSCTL_INT(_kern_sigqueue, OID_AUTO, preallocate, CTLFLAG_RDTUN, &preallocate_siginfo, 0, "Preallocated signal memory size"); static int signal_overflow = 0; SYSCTL_INT(_kern_sigqueue, OID_AUTO, overflow, CTLFLAG_RD, &signal_overflow, 0, "Number of signals overflew"); static int signal_alloc_fail = 0; SYSCTL_INT(_kern_sigqueue, OID_AUTO, alloc_fail, CTLFLAG_RD, &signal_alloc_fail, 0, "signals failed to be allocated"); static int kern_lognosys = 0; SYSCTL_INT(_kern, OID_AUTO, lognosys, CTLFLAG_RWTUN, &kern_lognosys, 0, "Log invalid syscalls"); SYSINIT(signal, SI_SUB_P1003_1B, SI_ORDER_FIRST+3, sigqueue_start, NULL); /* * Policy -- Can ucred cr1 send SIGIO to process cr2? * Should use cr_cansignal() once cr_cansignal() allows SIGIO and SIGURG * in the right situations. */ #define CANSIGIO(cr1, cr2) \ ((cr1)->cr_uid == 0 || \ (cr1)->cr_ruid == (cr2)->cr_ruid || \ (cr1)->cr_uid == (cr2)->cr_ruid || \ (cr1)->cr_ruid == (cr2)->cr_uid || \ (cr1)->cr_uid == (cr2)->cr_uid) static int sugid_coredump; SYSCTL_INT(_kern, OID_AUTO, sugid_coredump, CTLFLAG_RWTUN, &sugid_coredump, 0, "Allow setuid and setgid processes to dump core"); static int capmode_coredump; SYSCTL_INT(_kern, OID_AUTO, capmode_coredump, CTLFLAG_RWTUN, &capmode_coredump, 0, "Allow processes in capability mode to dump core"); static int do_coredump = 1; SYSCTL_INT(_kern, OID_AUTO, coredump, CTLFLAG_RW, &do_coredump, 0, "Enable/Disable coredumps"); static int set_core_nodump_flag = 0; SYSCTL_INT(_kern, OID_AUTO, nodump_coredump, CTLFLAG_RW, &set_core_nodump_flag, 0, "Enable setting the NODUMP flag on coredump files"); static int coredump_devctl = 0; SYSCTL_INT(_kern, OID_AUTO, coredump_devctl, CTLFLAG_RW, &coredump_devctl, 0, "Generate a devctl notification when processes coredump"); /* * Signal properties and actions. * The array below categorizes the signals and their default actions * according to the following properties: */ #define SIGPROP_KILL 0x01 /* terminates process by default */ #define SIGPROP_CORE 0x02 /* ditto and coredumps */ #define SIGPROP_STOP 0x04 /* suspend process */ #define SIGPROP_TTYSTOP 0x08 /* ditto, from tty */ #define SIGPROP_IGNORE 0x10 /* ignore by default */ #define SIGPROP_CONT 0x20 /* continue if suspended */ #define SIGPROP_CANTMASK 0x40 /* non-maskable, catchable */ static int sigproptbl[NSIG] = { [SIGHUP] = SIGPROP_KILL, [SIGINT] = SIGPROP_KILL, [SIGQUIT] = SIGPROP_KILL | SIGPROP_CORE, [SIGILL] = SIGPROP_KILL | SIGPROP_CORE, [SIGTRAP] = SIGPROP_KILL | SIGPROP_CORE, [SIGABRT] = SIGPROP_KILL | SIGPROP_CORE, [SIGEMT] = SIGPROP_KILL | SIGPROP_CORE, [SIGFPE] = SIGPROP_KILL | SIGPROP_CORE, [SIGKILL] = SIGPROP_KILL, [SIGBUS] = SIGPROP_KILL | SIGPROP_CORE, [SIGSEGV] = SIGPROP_KILL | SIGPROP_CORE, [SIGSYS] = SIGPROP_KILL | SIGPROP_CORE, [SIGPIPE] = SIGPROP_KILL, [SIGALRM] = SIGPROP_KILL, [SIGTERM] = SIGPROP_KILL, [SIGURG] = SIGPROP_IGNORE, [SIGSTOP] = SIGPROP_STOP, [SIGTSTP] = SIGPROP_STOP | SIGPROP_TTYSTOP, [SIGCONT] = SIGPROP_IGNORE | SIGPROP_CONT, [SIGCHLD] = SIGPROP_IGNORE, [SIGTTIN] = SIGPROP_STOP | SIGPROP_TTYSTOP, [SIGTTOU] = SIGPROP_STOP | SIGPROP_TTYSTOP, [SIGIO] = SIGPROP_IGNORE, [SIGXCPU] = SIGPROP_KILL, [SIGXFSZ] = SIGPROP_KILL, [SIGVTALRM] = SIGPROP_KILL, [SIGPROF] = SIGPROP_KILL, [SIGWINCH] = SIGPROP_IGNORE, [SIGINFO] = SIGPROP_IGNORE, [SIGUSR1] = SIGPROP_KILL, [SIGUSR2] = SIGPROP_KILL, }; static void reschedule_signals(struct proc *p, sigset_t block, int flags); static void sigqueue_start(void) { ksiginfo_zone = uma_zcreate("ksiginfo", sizeof(ksiginfo_t), NULL, NULL, NULL, NULL, UMA_ALIGN_PTR, 0); uma_prealloc(ksiginfo_zone, preallocate_siginfo); p31b_setcfg(CTL_P1003_1B_REALTIME_SIGNALS, _POSIX_REALTIME_SIGNALS); p31b_setcfg(CTL_P1003_1B_RTSIG_MAX, SIGRTMAX - SIGRTMIN + 1); p31b_setcfg(CTL_P1003_1B_SIGQUEUE_MAX, max_pending_per_proc); } ksiginfo_t * ksiginfo_alloc(int wait) { int flags; flags = M_ZERO; if (! wait) flags |= M_NOWAIT; if (ksiginfo_zone != NULL) return ((ksiginfo_t *)uma_zalloc(ksiginfo_zone, flags)); return (NULL); } void ksiginfo_free(ksiginfo_t *ksi) { uma_zfree(ksiginfo_zone, ksi); } static __inline int ksiginfo_tryfree(ksiginfo_t *ksi) { if (!(ksi->ksi_flags & KSI_EXT)) { uma_zfree(ksiginfo_zone, ksi); return (1); } return (0); } void sigqueue_init(sigqueue_t *list, struct proc *p) { SIGEMPTYSET(list->sq_signals); SIGEMPTYSET(list->sq_kill); SIGEMPTYSET(list->sq_ptrace); TAILQ_INIT(&list->sq_list); list->sq_proc = p; list->sq_flags = SQ_INIT; } /* * Get a signal's ksiginfo. * Return: * 0 - signal not found * others - signal number */ static int sigqueue_get(sigqueue_t *sq, int signo, ksiginfo_t *si) { struct proc *p = sq->sq_proc; struct ksiginfo *ksi, *next; int count = 0; KASSERT(sq->sq_flags & SQ_INIT, ("sigqueue not inited")); if (!SIGISMEMBER(sq->sq_signals, signo)) return (0); if (SIGISMEMBER(sq->sq_ptrace, signo)) { count++; SIGDELSET(sq->sq_ptrace, signo); si->ksi_flags |= KSI_PTRACE; } if (SIGISMEMBER(sq->sq_kill, signo)) { count++; if (count == 1) SIGDELSET(sq->sq_kill, signo); } TAILQ_FOREACH_SAFE(ksi, &sq->sq_list, ksi_link, next) { if (ksi->ksi_signo == signo) { if (count == 0) { TAILQ_REMOVE(&sq->sq_list, ksi, ksi_link); ksi->ksi_sigq = NULL; ksiginfo_copy(ksi, si); if (ksiginfo_tryfree(ksi) && p != NULL) p->p_pendingcnt--; } if (++count > 1) break; } } if (count <= 1) SIGDELSET(sq->sq_signals, signo); si->ksi_signo = signo; return (signo); } void sigqueue_take(ksiginfo_t *ksi) { struct ksiginfo *kp; struct proc *p; sigqueue_t *sq; if (ksi == NULL || (sq = ksi->ksi_sigq) == NULL) return; p = sq->sq_proc; TAILQ_REMOVE(&sq->sq_list, ksi, ksi_link); ksi->ksi_sigq = NULL; if (!(ksi->ksi_flags & KSI_EXT) && p != NULL) p->p_pendingcnt--; for (kp = TAILQ_FIRST(&sq->sq_list); kp != NULL; kp = TAILQ_NEXT(kp, ksi_link)) { if (kp->ksi_signo == ksi->ksi_signo) break; } if (kp == NULL && !SIGISMEMBER(sq->sq_kill, ksi->ksi_signo) && !SIGISMEMBER(sq->sq_ptrace, ksi->ksi_signo)) SIGDELSET(sq->sq_signals, ksi->ksi_signo); } static int sigqueue_add(sigqueue_t *sq, int signo, ksiginfo_t *si) { struct proc *p = sq->sq_proc; struct ksiginfo *ksi; int ret = 0; KASSERT(sq->sq_flags & SQ_INIT, ("sigqueue not inited")); /* * SIGKILL/SIGSTOP cannot be caught or masked, so take the fast path * for these signals. */ if (signo == SIGKILL || signo == SIGSTOP || si == NULL) { SIGADDSET(sq->sq_kill, signo); goto out_set_bit; } /* directly insert the ksi, don't copy it */ if (si->ksi_flags & KSI_INS) { if (si->ksi_flags & KSI_HEAD) TAILQ_INSERT_HEAD(&sq->sq_list, si, ksi_link); else TAILQ_INSERT_TAIL(&sq->sq_list, si, ksi_link); si->ksi_sigq = sq; goto out_set_bit; } if (__predict_false(ksiginfo_zone == NULL)) { SIGADDSET(sq->sq_kill, signo); goto out_set_bit; } if (p != NULL && p->p_pendingcnt >= max_pending_per_proc) { signal_overflow++; ret = EAGAIN; } else if ((ksi = ksiginfo_alloc(0)) == NULL) { signal_alloc_fail++; ret = EAGAIN; } else { if (p != NULL) p->p_pendingcnt++; ksiginfo_copy(si, ksi); ksi->ksi_signo = signo; if (si->ksi_flags & KSI_HEAD) TAILQ_INSERT_HEAD(&sq->sq_list, ksi, ksi_link); else TAILQ_INSERT_TAIL(&sq->sq_list, ksi, ksi_link); ksi->ksi_sigq = sq; } if (ret != 0) { if ((si->ksi_flags & KSI_PTRACE) != 0) { SIGADDSET(sq->sq_ptrace, signo); ret = 0; goto out_set_bit; } else if ((si->ksi_flags & KSI_TRAP) != 0 || (si->ksi_flags & KSI_SIGQ) == 0) { SIGADDSET(sq->sq_kill, signo); ret = 0; goto out_set_bit; } return (ret); } out_set_bit: SIGADDSET(sq->sq_signals, signo); return (ret); } void sigqueue_flush(sigqueue_t *sq) { struct proc *p = sq->sq_proc; ksiginfo_t *ksi; KASSERT(sq->sq_flags & SQ_INIT, ("sigqueue not inited")); if (p != NULL) PROC_LOCK_ASSERT(p, MA_OWNED); while ((ksi = TAILQ_FIRST(&sq->sq_list)) != NULL) { TAILQ_REMOVE(&sq->sq_list, ksi, ksi_link); ksi->ksi_sigq = NULL; if (ksiginfo_tryfree(ksi) && p != NULL) p->p_pendingcnt--; } SIGEMPTYSET(sq->sq_signals); SIGEMPTYSET(sq->sq_kill); SIGEMPTYSET(sq->sq_ptrace); } static void sigqueue_move_set(sigqueue_t *src, sigqueue_t *dst, const sigset_t *set) { sigset_t tmp; struct proc *p1, *p2; ksiginfo_t *ksi, *next; KASSERT(src->sq_flags & SQ_INIT, ("src sigqueue not inited")); KASSERT(dst->sq_flags & SQ_INIT, ("dst sigqueue not inited")); p1 = src->sq_proc; p2 = dst->sq_proc; /* Move siginfo to target list */ TAILQ_FOREACH_SAFE(ksi, &src->sq_list, ksi_link, next) { if (SIGISMEMBER(*set, ksi->ksi_signo)) { TAILQ_REMOVE(&src->sq_list, ksi, ksi_link); if (p1 != NULL) p1->p_pendingcnt--; TAILQ_INSERT_TAIL(&dst->sq_list, ksi, ksi_link); ksi->ksi_sigq = dst; if (p2 != NULL) p2->p_pendingcnt++; } } /* Move pending bits to target list */ tmp = src->sq_kill; SIGSETAND(tmp, *set); SIGSETOR(dst->sq_kill, tmp); SIGSETNAND(src->sq_kill, tmp); tmp = src->sq_ptrace; SIGSETAND(tmp, *set); SIGSETOR(dst->sq_ptrace, tmp); SIGSETNAND(src->sq_ptrace, tmp); tmp = src->sq_signals; SIGSETAND(tmp, *set); SIGSETOR(dst->sq_signals, tmp); SIGSETNAND(src->sq_signals, tmp); } #if 0 static void sigqueue_move(sigqueue_t *src, sigqueue_t *dst, int signo) { sigset_t set; SIGEMPTYSET(set); SIGADDSET(set, signo); sigqueue_move_set(src, dst, &set); } #endif static void sigqueue_delete_set(sigqueue_t *sq, const sigset_t *set) { struct proc *p = sq->sq_proc; ksiginfo_t *ksi, *next; KASSERT(sq->sq_flags & SQ_INIT, ("src sigqueue not inited")); /* Remove siginfo queue */ TAILQ_FOREACH_SAFE(ksi, &sq->sq_list, ksi_link, next) { if (SIGISMEMBER(*set, ksi->ksi_signo)) { TAILQ_REMOVE(&sq->sq_list, ksi, ksi_link); ksi->ksi_sigq = NULL; if (ksiginfo_tryfree(ksi) && p != NULL) p->p_pendingcnt--; } } SIGSETNAND(sq->sq_kill, *set); SIGSETNAND(sq->sq_ptrace, *set); SIGSETNAND(sq->sq_signals, *set); } void sigqueue_delete(sigqueue_t *sq, int signo) { sigset_t set; SIGEMPTYSET(set); SIGADDSET(set, signo); sigqueue_delete_set(sq, &set); } /* Remove a set of signals for a process */ static void sigqueue_delete_set_proc(struct proc *p, const sigset_t *set) { sigqueue_t worklist; struct thread *td0; PROC_LOCK_ASSERT(p, MA_OWNED); sigqueue_init(&worklist, NULL); sigqueue_move_set(&p->p_sigqueue, &worklist, set); FOREACH_THREAD_IN_PROC(p, td0) sigqueue_move_set(&td0->td_sigqueue, &worklist, set); sigqueue_flush(&worklist); } void sigqueue_delete_proc(struct proc *p, int signo) { sigset_t set; SIGEMPTYSET(set); SIGADDSET(set, signo); sigqueue_delete_set_proc(p, &set); } static void sigqueue_delete_stopmask_proc(struct proc *p) { sigset_t set; SIGEMPTYSET(set); SIGADDSET(set, SIGSTOP); SIGADDSET(set, SIGTSTP); SIGADDSET(set, SIGTTIN); SIGADDSET(set, SIGTTOU); sigqueue_delete_set_proc(p, &set); } /* * Determine signal that should be delivered to thread td, the current * thread, 0 if none. If there is a pending stop signal with default * action, the process stops in issignal(). */ int cursig(struct thread *td) { PROC_LOCK_ASSERT(td->td_proc, MA_OWNED); mtx_assert(&td->td_proc->p_sigacts->ps_mtx, MA_OWNED); THREAD_LOCK_ASSERT(td, MA_NOTOWNED); return (SIGPENDING(td) ? issignal(td) : 0); } /* * Arrange for ast() to handle unmasked pending signals on return to user * mode. This must be called whenever a signal is added to td_sigqueue or * unmasked in td_sigmask. */ void signotify(struct thread *td) { PROC_LOCK_ASSERT(td->td_proc, MA_OWNED); if (SIGPENDING(td)) { thread_lock(td); td->td_flags |= TDF_NEEDSIGCHK | TDF_ASTPENDING; thread_unlock(td); } } /* * Returns 1 (true) if altstack is configured for the thread, and the * passed stack bottom address falls into the altstack range. Handles * the 43 compat special case where the alt stack size is zero. */ int sigonstack(size_t sp) { struct thread *td; td = curthread; if ((td->td_pflags & TDP_ALTSTACK) == 0) return (0); #if defined(COMPAT_43) if (td->td_sigstk.ss_size == 0) return ((td->td_sigstk.ss_flags & SS_ONSTACK) != 0); #endif return (sp >= (size_t)td->td_sigstk.ss_sp && sp < td->td_sigstk.ss_size + (size_t)td->td_sigstk.ss_sp); } static __inline int sigprop(int sig) { if (sig > 0 && sig < nitems(sigproptbl)) return (sigproptbl[sig]); return (0); } int sig_ffs(sigset_t *set) { int i; for (i = 0; i < _SIG_WORDS; i++) if (set->__bits[i]) return (ffs(set->__bits[i]) + (i * 32)); return (0); } static bool sigact_flag_test(const struct sigaction *act, int flag) { /* * SA_SIGINFO is reset when signal disposition is set to * ignore or default. Other flags are kept according to user * settings. */ return ((act->sa_flags & flag) != 0 && (flag != SA_SIGINFO || ((__sighandler_t *)act->sa_sigaction != SIG_IGN && (__sighandler_t *)act->sa_sigaction != SIG_DFL))); } /* * kern_sigaction * sigaction * freebsd4_sigaction * osigaction */ int kern_sigaction(struct thread *td, int sig, const struct sigaction *act, struct sigaction *oact, int flags) { struct sigacts *ps; struct proc *p = td->td_proc; if (!_SIG_VALID(sig)) return (EINVAL); if (act != NULL && act->sa_handler != SIG_DFL && act->sa_handler != SIG_IGN && (act->sa_flags & ~(SA_ONSTACK | SA_RESTART | SA_RESETHAND | SA_NOCLDSTOP | SA_NODEFER | SA_NOCLDWAIT | SA_SIGINFO)) != 0) return (EINVAL); PROC_LOCK(p); ps = p->p_sigacts; mtx_lock(&ps->ps_mtx); if (oact) { memset(oact, 0, sizeof(*oact)); oact->sa_mask = ps->ps_catchmask[_SIG_IDX(sig)]; if (SIGISMEMBER(ps->ps_sigonstack, sig)) oact->sa_flags |= SA_ONSTACK; if (!SIGISMEMBER(ps->ps_sigintr, sig)) oact->sa_flags |= SA_RESTART; if (SIGISMEMBER(ps->ps_sigreset, sig)) oact->sa_flags |= SA_RESETHAND; if (SIGISMEMBER(ps->ps_signodefer, sig)) oact->sa_flags |= SA_NODEFER; if (SIGISMEMBER(ps->ps_siginfo, sig)) { oact->sa_flags |= SA_SIGINFO; oact->sa_sigaction = (__siginfohandler_t *)ps->ps_sigact[_SIG_IDX(sig)]; } else oact->sa_handler = ps->ps_sigact[_SIG_IDX(sig)]; if (sig == SIGCHLD && ps->ps_flag & PS_NOCLDSTOP) oact->sa_flags |= SA_NOCLDSTOP; if (sig == SIGCHLD && ps->ps_flag & PS_NOCLDWAIT) oact->sa_flags |= SA_NOCLDWAIT; } if (act) { if ((sig == SIGKILL || sig == SIGSTOP) && act->sa_handler != SIG_DFL) { mtx_unlock(&ps->ps_mtx); PROC_UNLOCK(p); return (EINVAL); } /* * Change setting atomically. */ ps->ps_catchmask[_SIG_IDX(sig)] = act->sa_mask; SIG_CANTMASK(ps->ps_catchmask[_SIG_IDX(sig)]); if (sigact_flag_test(act, SA_SIGINFO)) { ps->ps_sigact[_SIG_IDX(sig)] = (__sighandler_t *)act->sa_sigaction; SIGADDSET(ps->ps_siginfo, sig); } else { ps->ps_sigact[_SIG_IDX(sig)] = act->sa_handler; SIGDELSET(ps->ps_siginfo, sig); } if (!sigact_flag_test(act, SA_RESTART)) SIGADDSET(ps->ps_sigintr, sig); else SIGDELSET(ps->ps_sigintr, sig); if (sigact_flag_test(act, SA_ONSTACK)) SIGADDSET(ps->ps_sigonstack, sig); else SIGDELSET(ps->ps_sigonstack, sig); if (sigact_flag_test(act, SA_RESETHAND)) SIGADDSET(ps->ps_sigreset, sig); else SIGDELSET(ps->ps_sigreset, sig); if (sigact_flag_test(act, SA_NODEFER)) SIGADDSET(ps->ps_signodefer, sig); else SIGDELSET(ps->ps_signodefer, sig); if (sig == SIGCHLD) { if (act->sa_flags & SA_NOCLDSTOP) ps->ps_flag |= PS_NOCLDSTOP; else ps->ps_flag &= ~PS_NOCLDSTOP; if (act->sa_flags & SA_NOCLDWAIT) { /* * Paranoia: since SA_NOCLDWAIT is implemented * by reparenting the dying child to PID 1 (and * trust it to reap the zombie), PID 1 itself * is forbidden to set SA_NOCLDWAIT. */ if (p->p_pid == 1) ps->ps_flag &= ~PS_NOCLDWAIT; else ps->ps_flag |= PS_NOCLDWAIT; } else ps->ps_flag &= ~PS_NOCLDWAIT; if (ps->ps_sigact[_SIG_IDX(SIGCHLD)] == SIG_IGN) ps->ps_flag |= PS_CLDSIGIGN; else ps->ps_flag &= ~PS_CLDSIGIGN; } /* * Set bit in ps_sigignore for signals that are set to SIG_IGN, * and for signals set to SIG_DFL where the default is to * ignore. However, don't put SIGCONT in ps_sigignore, as we * have to restart the process. */ if (ps->ps_sigact[_SIG_IDX(sig)] == SIG_IGN || (sigprop(sig) & SIGPROP_IGNORE && ps->ps_sigact[_SIG_IDX(sig)] == SIG_DFL)) { /* never to be seen again */ sigqueue_delete_proc(p, sig); if (sig != SIGCONT) /* easier in psignal */ SIGADDSET(ps->ps_sigignore, sig); SIGDELSET(ps->ps_sigcatch, sig); } else { SIGDELSET(ps->ps_sigignore, sig); if (ps->ps_sigact[_SIG_IDX(sig)] == SIG_DFL) SIGDELSET(ps->ps_sigcatch, sig); else SIGADDSET(ps->ps_sigcatch, sig); } #ifdef COMPAT_FREEBSD4 if (ps->ps_sigact[_SIG_IDX(sig)] == SIG_IGN || ps->ps_sigact[_SIG_IDX(sig)] == SIG_DFL || (flags & KSA_FREEBSD4) == 0) SIGDELSET(ps->ps_freebsd4, sig); else SIGADDSET(ps->ps_freebsd4, sig); #endif #ifdef COMPAT_43 if (ps->ps_sigact[_SIG_IDX(sig)] == SIG_IGN || ps->ps_sigact[_SIG_IDX(sig)] == SIG_DFL || (flags & KSA_OSIGSET) == 0) SIGDELSET(ps->ps_osigset, sig); else SIGADDSET(ps->ps_osigset, sig); #endif } mtx_unlock(&ps->ps_mtx); PROC_UNLOCK(p); return (0); } #ifndef _SYS_SYSPROTO_H_ struct sigaction_args { int sig; struct sigaction *act; struct sigaction *oact; }; #endif int sys_sigaction(struct thread *td, struct sigaction_args *uap) { struct sigaction act, oact; struct sigaction *actp, *oactp; int error; actp = (uap->act != NULL) ? &act : NULL; oactp = (uap->oact != NULL) ? &oact : NULL; if (actp) { error = copyin(uap->act, actp, sizeof(act)); if (error) return (error); } error = kern_sigaction(td, uap->sig, actp, oactp, 0); if (oactp && !error) error = copyout(oactp, uap->oact, sizeof(oact)); return (error); } #ifdef COMPAT_FREEBSD4 #ifndef _SYS_SYSPROTO_H_ struct freebsd4_sigaction_args { int sig; struct sigaction *act; struct sigaction *oact; }; #endif int freebsd4_sigaction(struct thread *td, struct freebsd4_sigaction_args *uap) { struct sigaction act, oact; struct sigaction *actp, *oactp; int error; actp = (uap->act != NULL) ? &act : NULL; oactp = (uap->oact != NULL) ? &oact : NULL; if (actp) { error = copyin(uap->act, actp, sizeof(act)); if (error) return (error); } error = kern_sigaction(td, uap->sig, actp, oactp, KSA_FREEBSD4); if (oactp && !error) error = copyout(oactp, uap->oact, sizeof(oact)); return (error); } #endif /* COMAPT_FREEBSD4 */ #ifdef COMPAT_43 /* XXX - COMPAT_FBSD3 */ #ifndef _SYS_SYSPROTO_H_ struct osigaction_args { int signum; struct osigaction *nsa; struct osigaction *osa; }; #endif int osigaction(struct thread *td, struct osigaction_args *uap) { struct osigaction sa; struct sigaction nsa, osa; struct sigaction *nsap, *osap; int error; if (uap->signum <= 0 || uap->signum >= ONSIG) return (EINVAL); nsap = (uap->nsa != NULL) ? &nsa : NULL; osap = (uap->osa != NULL) ? &osa : NULL; if (nsap) { error = copyin(uap->nsa, &sa, sizeof(sa)); if (error) return (error); nsap->sa_handler = sa.sa_handler; nsap->sa_flags = sa.sa_flags; OSIG2SIG(sa.sa_mask, nsap->sa_mask); } error = kern_sigaction(td, uap->signum, nsap, osap, KSA_OSIGSET); if (osap && !error) { sa.sa_handler = osap->sa_handler; sa.sa_flags = osap->sa_flags; SIG2OSIG(osap->sa_mask, sa.sa_mask); error = copyout(&sa, uap->osa, sizeof(sa)); } return (error); } #if !defined(__i386__) /* Avoid replicating the same stub everywhere */ int osigreturn(struct thread *td, struct osigreturn_args *uap) { return (nosys(td, (struct nosys_args *)uap)); } #endif #endif /* COMPAT_43 */ /* * Initialize signal state for process 0; * set to ignore signals that are ignored by default. */ void siginit(struct proc *p) { int i; struct sigacts *ps; PROC_LOCK(p); ps = p->p_sigacts; mtx_lock(&ps->ps_mtx); for (i = 1; i <= NSIG; i++) { if (sigprop(i) & SIGPROP_IGNORE && i != SIGCONT) { SIGADDSET(ps->ps_sigignore, i); } } mtx_unlock(&ps->ps_mtx); PROC_UNLOCK(p); } /* * Reset specified signal to the default disposition. */ static void sigdflt(struct sigacts *ps, int sig) { mtx_assert(&ps->ps_mtx, MA_OWNED); SIGDELSET(ps->ps_sigcatch, sig); if ((sigprop(sig) & SIGPROP_IGNORE) != 0 && sig != SIGCONT) SIGADDSET(ps->ps_sigignore, sig); ps->ps_sigact[_SIG_IDX(sig)] = SIG_DFL; SIGDELSET(ps->ps_siginfo, sig); } /* * Reset signals for an exec of the specified process. */ void execsigs(struct proc *p) { sigset_t osigignore; struct sigacts *ps; int sig; struct thread *td; /* * Reset caught signals. Held signals remain held * through td_sigmask (unless they were caught, * and are now ignored by default). */ PROC_LOCK_ASSERT(p, MA_OWNED); ps = p->p_sigacts; mtx_lock(&ps->ps_mtx); while (SIGNOTEMPTY(ps->ps_sigcatch)) { sig = sig_ffs(&ps->ps_sigcatch); sigdflt(ps, sig); if ((sigprop(sig) & SIGPROP_IGNORE) != 0) sigqueue_delete_proc(p, sig); } /* * As CloudABI processes cannot modify signal handlers, fully * reset all signals to their default behavior. Do ignore * SIGPIPE, as it would otherwise be impossible to recover from * writes to broken pipes and sockets. */ if (SV_PROC_ABI(p) == SV_ABI_CLOUDABI) { osigignore = ps->ps_sigignore; while (SIGNOTEMPTY(osigignore)) { sig = sig_ffs(&osigignore); SIGDELSET(osigignore, sig); if (sig != SIGPIPE) sigdflt(ps, sig); } SIGADDSET(ps->ps_sigignore, SIGPIPE); } /* * Reset stack state to the user stack. * Clear set of signals caught on the signal stack. */ td = curthread; MPASS(td->td_proc == p); td->td_sigstk.ss_flags = SS_DISABLE; td->td_sigstk.ss_size = 0; td->td_sigstk.ss_sp = 0; td->td_pflags &= ~TDP_ALTSTACK; /* * Reset no zombies if child dies flag as Solaris does. */ ps->ps_flag &= ~(PS_NOCLDWAIT | PS_CLDSIGIGN); if (ps->ps_sigact[_SIG_IDX(SIGCHLD)] == SIG_IGN) ps->ps_sigact[_SIG_IDX(SIGCHLD)] = SIG_DFL; mtx_unlock(&ps->ps_mtx); } /* * kern_sigprocmask() * * Manipulate signal mask. */ int kern_sigprocmask(struct thread *td, int how, sigset_t *set, sigset_t *oset, int flags) { sigset_t new_block, oset1; struct proc *p; int error; p = td->td_proc; if ((flags & SIGPROCMASK_PROC_LOCKED) != 0) PROC_LOCK_ASSERT(p, MA_OWNED); else PROC_LOCK(p); mtx_assert(&p->p_sigacts->ps_mtx, (flags & SIGPROCMASK_PS_LOCKED) != 0 ? MA_OWNED : MA_NOTOWNED); if (oset != NULL) *oset = td->td_sigmask; error = 0; if (set != NULL) { switch (how) { case SIG_BLOCK: SIG_CANTMASK(*set); oset1 = td->td_sigmask; SIGSETOR(td->td_sigmask, *set); new_block = td->td_sigmask; SIGSETNAND(new_block, oset1); break; case SIG_UNBLOCK: SIGSETNAND(td->td_sigmask, *set); signotify(td); goto out; case SIG_SETMASK: SIG_CANTMASK(*set); oset1 = td->td_sigmask; if (flags & SIGPROCMASK_OLD) SIGSETLO(td->td_sigmask, *set); else td->td_sigmask = *set; new_block = td->td_sigmask; SIGSETNAND(new_block, oset1); signotify(td); break; default: error = EINVAL; goto out; } /* * The new_block set contains signals that were not previously * blocked, but are blocked now. * * In case we block any signal that was not previously blocked * for td, and process has the signal pending, try to schedule * signal delivery to some thread that does not block the * signal, possibly waking it up. */ if (p->p_numthreads != 1) reschedule_signals(p, new_block, flags); } out: if (!(flags & SIGPROCMASK_PROC_LOCKED)) PROC_UNLOCK(p); return (error); } #ifndef _SYS_SYSPROTO_H_ struct sigprocmask_args { int how; const sigset_t *set; sigset_t *oset; }; #endif int sys_sigprocmask(struct thread *td, struct sigprocmask_args *uap) { sigset_t set, oset; sigset_t *setp, *osetp; int error; setp = (uap->set != NULL) ? &set : NULL; osetp = (uap->oset != NULL) ? &oset : NULL; if (setp) { error = copyin(uap->set, setp, sizeof(set)); if (error) return (error); } error = kern_sigprocmask(td, uap->how, setp, osetp, 0); if (osetp && !error) { error = copyout(osetp, uap->oset, sizeof(oset)); } return (error); } #ifdef COMPAT_43 /* XXX - COMPAT_FBSD3 */ #ifndef _SYS_SYSPROTO_H_ struct osigprocmask_args { int how; osigset_t mask; }; #endif int osigprocmask(struct thread *td, struct osigprocmask_args *uap) { sigset_t set, oset; int error; OSIG2SIG(uap->mask, set); error = kern_sigprocmask(td, uap->how, &set, &oset, 1); SIG2OSIG(oset, td->td_retval[0]); return (error); } #endif /* COMPAT_43 */ int sys_sigwait(struct thread *td, struct sigwait_args *uap) { ksiginfo_t ksi; sigset_t set; int error; error = copyin(uap->set, &set, sizeof(set)); if (error) { td->td_retval[0] = error; return (0); } error = kern_sigtimedwait(td, set, &ksi, NULL); if (error) { if (error == EINTR && td->td_proc->p_osrel < P_OSREL_SIGWAIT) error = ERESTART; if (error == ERESTART) return (error); td->td_retval[0] = error; return (0); } error = copyout(&ksi.ksi_signo, uap->sig, sizeof(ksi.ksi_signo)); td->td_retval[0] = error; return (0); } int sys_sigtimedwait(struct thread *td, struct sigtimedwait_args *uap) { struct timespec ts; struct timespec *timeout; sigset_t set; ksiginfo_t ksi; int error; if (uap->timeout) { error = copyin(uap->timeout, &ts, sizeof(ts)); if (error) return (error); timeout = &ts; } else timeout = NULL; error = copyin(uap->set, &set, sizeof(set)); if (error) return (error); error = kern_sigtimedwait(td, set, &ksi, timeout); if (error) return (error); if (uap->info) error = copyout(&ksi.ksi_info, uap->info, sizeof(siginfo_t)); if (error == 0) td->td_retval[0] = ksi.ksi_signo; return (error); } int sys_sigwaitinfo(struct thread *td, struct sigwaitinfo_args *uap) { ksiginfo_t ksi; sigset_t set; int error; error = copyin(uap->set, &set, sizeof(set)); if (error) return (error); error = kern_sigtimedwait(td, set, &ksi, NULL); if (error) return (error); if (uap->info) error = copyout(&ksi.ksi_info, uap->info, sizeof(siginfo_t)); if (error == 0) td->td_retval[0] = ksi.ksi_signo; return (error); } static void proc_td_siginfo_capture(struct thread *td, siginfo_t *si) { struct thread *thr; FOREACH_THREAD_IN_PROC(td->td_proc, thr) { if (thr == td) thr->td_si = *si; else thr->td_si.si_signo = 0; } } int kern_sigtimedwait(struct thread *td, sigset_t waitset, ksiginfo_t *ksi, struct timespec *timeout) { struct sigacts *ps; sigset_t saved_mask, new_block; struct proc *p; int error, sig, timo, timevalid = 0; struct timespec rts, ets, ts; struct timeval tv; p = td->td_proc; error = 0; ets.tv_sec = 0; ets.tv_nsec = 0; if (timeout != NULL) { if (timeout->tv_nsec >= 0 && timeout->tv_nsec < 1000000000) { timevalid = 1; getnanouptime(&rts); timespecadd(&rts, timeout, &ets); } } ksiginfo_init(ksi); /* Some signals can not be waited for. */ SIG_CANTMASK(waitset); ps = p->p_sigacts; PROC_LOCK(p); saved_mask = td->td_sigmask; SIGSETNAND(td->td_sigmask, waitset); for (;;) { mtx_lock(&ps->ps_mtx); sig = cursig(td); mtx_unlock(&ps->ps_mtx); KASSERT(sig >= 0, ("sig %d", sig)); if (sig != 0 && SIGISMEMBER(waitset, sig)) { if (sigqueue_get(&td->td_sigqueue, sig, ksi) != 0 || sigqueue_get(&p->p_sigqueue, sig, ksi) != 0) { error = 0; break; } } if (error != 0) break; /* * POSIX says this must be checked after looking for pending * signals. */ if (timeout != NULL) { if (!timevalid) { error = EINVAL; break; } getnanouptime(&rts); if (timespeccmp(&rts, &ets, >=)) { error = EAGAIN; break; } timespecsub(&ets, &rts, &ts); TIMESPEC_TO_TIMEVAL(&tv, &ts); timo = tvtohz(&tv); } else { timo = 0; } error = msleep(ps, &p->p_mtx, PPAUSE|PCATCH, "sigwait", timo); if (timeout != NULL) { if (error == ERESTART) { /* Timeout can not be restarted. */ error = EINTR; } else if (error == EAGAIN) { /* We will calculate timeout by ourself. */ error = 0; } } } new_block = saved_mask; SIGSETNAND(new_block, td->td_sigmask); td->td_sigmask = saved_mask; /* * Fewer signals can be delivered to us, reschedule signal * notification. */ if (p->p_numthreads != 1) reschedule_signals(p, new_block, 0); if (error == 0) { SDT_PROBE2(proc, , , signal__clear, sig, ksi); if (ksi->ksi_code == SI_TIMER) itimer_accept(p, ksi->ksi_timerid, ksi); #ifdef KTRACE if (KTRPOINT(td, KTR_PSIG)) { sig_t action; mtx_lock(&ps->ps_mtx); action = ps->ps_sigact[_SIG_IDX(sig)]; mtx_unlock(&ps->ps_mtx); ktrpsig(sig, action, &td->td_sigmask, ksi->ksi_code); } #endif if (sig == SIGKILL) { proc_td_siginfo_capture(td, &ksi->ksi_info); sigexit(td, sig); } } PROC_UNLOCK(p); return (error); } #ifndef _SYS_SYSPROTO_H_ struct sigpending_args { sigset_t *set; }; #endif int sys_sigpending(struct thread *td, struct sigpending_args *uap) { struct proc *p = td->td_proc; sigset_t pending; PROC_LOCK(p); pending = p->p_sigqueue.sq_signals; SIGSETOR(pending, td->td_sigqueue.sq_signals); PROC_UNLOCK(p); return (copyout(&pending, uap->set, sizeof(sigset_t))); } #ifdef COMPAT_43 /* XXX - COMPAT_FBSD3 */ #ifndef _SYS_SYSPROTO_H_ struct osigpending_args { int dummy; }; #endif int osigpending(struct thread *td, struct osigpending_args *uap) { struct proc *p = td->td_proc; sigset_t pending; PROC_LOCK(p); pending = p->p_sigqueue.sq_signals; SIGSETOR(pending, td->td_sigqueue.sq_signals); PROC_UNLOCK(p); SIG2OSIG(pending, td->td_retval[0]); return (0); } #endif /* COMPAT_43 */ #if defined(COMPAT_43) /* * Generalized interface signal handler, 4.3-compatible. */ #ifndef _SYS_SYSPROTO_H_ struct osigvec_args { int signum; struct sigvec *nsv; struct sigvec *osv; }; #endif /* ARGSUSED */ int osigvec(struct thread *td, struct osigvec_args *uap) { struct sigvec vec; struct sigaction nsa, osa; struct sigaction *nsap, *osap; int error; if (uap->signum <= 0 || uap->signum >= ONSIG) return (EINVAL); nsap = (uap->nsv != NULL) ? &nsa : NULL; osap = (uap->osv != NULL) ? &osa : NULL; if (nsap) { error = copyin(uap->nsv, &vec, sizeof(vec)); if (error) return (error); nsap->sa_handler = vec.sv_handler; OSIG2SIG(vec.sv_mask, nsap->sa_mask); nsap->sa_flags = vec.sv_flags; nsap->sa_flags ^= SA_RESTART; /* opposite of SV_INTERRUPT */ } error = kern_sigaction(td, uap->signum, nsap, osap, KSA_OSIGSET); if (osap && !error) { vec.sv_handler = osap->sa_handler; SIG2OSIG(osap->sa_mask, vec.sv_mask); vec.sv_flags = osap->sa_flags; vec.sv_flags &= ~SA_NOCLDWAIT; vec.sv_flags ^= SA_RESTART; error = copyout(&vec, uap->osv, sizeof(vec)); } return (error); } #ifndef _SYS_SYSPROTO_H_ struct osigblock_args { int mask; }; #endif int osigblock(struct thread *td, struct osigblock_args *uap) { sigset_t set, oset; OSIG2SIG(uap->mask, set); kern_sigprocmask(td, SIG_BLOCK, &set, &oset, 0); SIG2OSIG(oset, td->td_retval[0]); return (0); } #ifndef _SYS_SYSPROTO_H_ struct osigsetmask_args { int mask; }; #endif int osigsetmask(struct thread *td, struct osigsetmask_args *uap) { sigset_t set, oset; OSIG2SIG(uap->mask, set); kern_sigprocmask(td, SIG_SETMASK, &set, &oset, 0); SIG2OSIG(oset, td->td_retval[0]); return (0); } #endif /* COMPAT_43 */ /* * Suspend calling thread until signal, providing mask to be set in the * meantime. */ #ifndef _SYS_SYSPROTO_H_ struct sigsuspend_args { const sigset_t *sigmask; }; #endif /* ARGSUSED */ int sys_sigsuspend(struct thread *td, struct sigsuspend_args *uap) { sigset_t mask; int error; error = copyin(uap->sigmask, &mask, sizeof(mask)); if (error) return (error); return (kern_sigsuspend(td, mask)); } int kern_sigsuspend(struct thread *td, sigset_t mask) { struct proc *p = td->td_proc; int has_sig, sig; /* * When returning from sigsuspend, we want * the old mask to be restored after the * signal handler has finished. Thus, we * save it here and mark the sigacts structure * to indicate this. */ PROC_LOCK(p); kern_sigprocmask(td, SIG_SETMASK, &mask, &td->td_oldsigmask, SIGPROCMASK_PROC_LOCKED); td->td_pflags |= TDP_OLDMASK; /* * Process signals now. Otherwise, we can get spurious wakeup * due to signal entered process queue, but delivered to other * thread. But sigsuspend should return only on signal * delivery. */ (p->p_sysent->sv_set_syscall_retval)(td, EINTR); for (has_sig = 0; !has_sig;) { while (msleep(&p->p_sigacts, &p->p_mtx, PPAUSE|PCATCH, "pause", 0) == 0) /* void */; thread_suspend_check(0); mtx_lock(&p->p_sigacts->ps_mtx); while ((sig = cursig(td)) != 0) { KASSERT(sig >= 0, ("sig %d", sig)); has_sig += postsig(sig); } mtx_unlock(&p->p_sigacts->ps_mtx); } PROC_UNLOCK(p); td->td_errno = EINTR; td->td_pflags |= TDP_NERRNO; return (EJUSTRETURN); } #ifdef COMPAT_43 /* XXX - COMPAT_FBSD3 */ /* * Compatibility sigsuspend call for old binaries. Note nonstandard calling * convention: libc stub passes mask, not pointer, to save a copyin. */ #ifndef _SYS_SYSPROTO_H_ struct osigsuspend_args { osigset_t mask; }; #endif /* ARGSUSED */ int osigsuspend(struct thread *td, struct osigsuspend_args *uap) { sigset_t mask; OSIG2SIG(uap->mask, mask); return (kern_sigsuspend(td, mask)); } #endif /* COMPAT_43 */ #if defined(COMPAT_43) #ifndef _SYS_SYSPROTO_H_ struct osigstack_args { struct sigstack *nss; struct sigstack *oss; }; #endif /* ARGSUSED */ int osigstack(struct thread *td, struct osigstack_args *uap) { struct sigstack nss, oss; int error = 0; if (uap->nss != NULL) { error = copyin(uap->nss, &nss, sizeof(nss)); if (error) return (error); } oss.ss_sp = td->td_sigstk.ss_sp; oss.ss_onstack = sigonstack(cpu_getstack(td)); if (uap->nss != NULL) { td->td_sigstk.ss_sp = nss.ss_sp; td->td_sigstk.ss_size = 0; td->td_sigstk.ss_flags |= nss.ss_onstack & SS_ONSTACK; td->td_pflags |= TDP_ALTSTACK; } if (uap->oss != NULL) error = copyout(&oss, uap->oss, sizeof(oss)); return (error); } #endif /* COMPAT_43 */ #ifndef _SYS_SYSPROTO_H_ struct sigaltstack_args { stack_t *ss; stack_t *oss; }; #endif /* ARGSUSED */ int sys_sigaltstack(struct thread *td, struct sigaltstack_args *uap) { stack_t ss, oss; int error; if (uap->ss != NULL) { error = copyin(uap->ss, &ss, sizeof(ss)); if (error) return (error); } error = kern_sigaltstack(td, (uap->ss != NULL) ? &ss : NULL, (uap->oss != NULL) ? &oss : NULL); if (error) return (error); if (uap->oss != NULL) error = copyout(&oss, uap->oss, sizeof(stack_t)); return (error); } int kern_sigaltstack(struct thread *td, stack_t *ss, stack_t *oss) { struct proc *p = td->td_proc; int oonstack; oonstack = sigonstack(cpu_getstack(td)); if (oss != NULL) { *oss = td->td_sigstk; oss->ss_flags = (td->td_pflags & TDP_ALTSTACK) ? ((oonstack) ? SS_ONSTACK : 0) : SS_DISABLE; } if (ss != NULL) { if (oonstack) return (EPERM); if ((ss->ss_flags & ~SS_DISABLE) != 0) return (EINVAL); if (!(ss->ss_flags & SS_DISABLE)) { if (ss->ss_size < p->p_sysent->sv_minsigstksz) return (ENOMEM); td->td_sigstk = *ss; td->td_pflags |= TDP_ALTSTACK; } else { td->td_pflags &= ~TDP_ALTSTACK; } } return (0); } /* * Common code for kill process group/broadcast kill. * cp is calling process. */ static int killpg1(struct thread *td, int sig, int pgid, int all, ksiginfo_t *ksi) { struct proc *p; struct pgrp *pgrp; int err; int ret; ret = ESRCH; if (all) { /* * broadcast */ sx_slock(&allproc_lock); FOREACH_PROC_IN_SYSTEM(p) { if (p->p_pid <= 1 || p->p_flag & P_SYSTEM || p == td->td_proc || p->p_state == PRS_NEW) { continue; } PROC_LOCK(p); err = p_cansignal(td, p, sig); if (err == 0) { if (sig) pksignal(p, sig, ksi); ret = err; } else if (ret == ESRCH) ret = err; PROC_UNLOCK(p); } sx_sunlock(&allproc_lock); } else { sx_slock(&proctree_lock); if (pgid == 0) { /* * zero pgid means send to my process group. */ pgrp = td->td_proc->p_pgrp; PGRP_LOCK(pgrp); } else { pgrp = pgfind(pgid); if (pgrp == NULL) { sx_sunlock(&proctree_lock); return (ESRCH); } } sx_sunlock(&proctree_lock); LIST_FOREACH(p, &pgrp->pg_members, p_pglist) { PROC_LOCK(p); if (p->p_pid <= 1 || p->p_flag & P_SYSTEM || p->p_state == PRS_NEW) { PROC_UNLOCK(p); continue; } err = p_cansignal(td, p, sig); if (err == 0) { if (sig) pksignal(p, sig, ksi); ret = err; } else if (ret == ESRCH) ret = err; PROC_UNLOCK(p); } PGRP_UNLOCK(pgrp); } return (ret); } #ifndef _SYS_SYSPROTO_H_ struct kill_args { int pid; int signum; }; #endif /* ARGSUSED */ int sys_kill(struct thread *td, struct kill_args *uap) { ksiginfo_t ksi; struct proc *p; int error; /* * A process in capability mode can send signals only to himself. * The main rationale behind this is that abort(3) is implemented as * kill(getpid(), SIGABRT). */ if (IN_CAPABILITY_MODE(td) && uap->pid != td->td_proc->p_pid) return (ECAPMODE); AUDIT_ARG_SIGNUM(uap->signum); AUDIT_ARG_PID(uap->pid); if ((u_int)uap->signum > _SIG_MAXSIG) return (EINVAL); ksiginfo_init(&ksi); ksi.ksi_signo = uap->signum; ksi.ksi_code = SI_USER; ksi.ksi_pid = td->td_proc->p_pid; ksi.ksi_uid = td->td_ucred->cr_ruid; if (uap->pid > 0) { /* kill single process */ if ((p = pfind_any(uap->pid)) == NULL) return (ESRCH); AUDIT_ARG_PROCESS(p); error = p_cansignal(td, p, uap->signum); if (error == 0 && uap->signum) pksignal(p, uap->signum, &ksi); PROC_UNLOCK(p); return (error); } switch (uap->pid) { case -1: /* broadcast signal */ return (killpg1(td, uap->signum, 0, 1, &ksi)); case 0: /* signal own process group */ return (killpg1(td, uap->signum, 0, 0, &ksi)); default: /* negative explicit process group */ return (killpg1(td, uap->signum, -uap->pid, 0, &ksi)); } /* NOTREACHED */ } int sys_pdkill(struct thread *td, struct pdkill_args *uap) { struct proc *p; int error; AUDIT_ARG_SIGNUM(uap->signum); AUDIT_ARG_FD(uap->fd); if ((u_int)uap->signum > _SIG_MAXSIG) return (EINVAL); error = procdesc_find(td, uap->fd, &cap_pdkill_rights, &p); if (error) return (error); AUDIT_ARG_PROCESS(p); error = p_cansignal(td, p, uap->signum); if (error == 0 && uap->signum) kern_psignal(p, uap->signum); PROC_UNLOCK(p); return (error); } #if defined(COMPAT_43) #ifndef _SYS_SYSPROTO_H_ struct okillpg_args { int pgid; int signum; }; #endif /* ARGSUSED */ int okillpg(struct thread *td, struct okillpg_args *uap) { ksiginfo_t ksi; AUDIT_ARG_SIGNUM(uap->signum); AUDIT_ARG_PID(uap->pgid); if ((u_int)uap->signum > _SIG_MAXSIG) return (EINVAL); ksiginfo_init(&ksi); ksi.ksi_signo = uap->signum; ksi.ksi_code = SI_USER; ksi.ksi_pid = td->td_proc->p_pid; ksi.ksi_uid = td->td_ucred->cr_ruid; return (killpg1(td, uap->signum, uap->pgid, 0, &ksi)); } #endif /* COMPAT_43 */ #ifndef _SYS_SYSPROTO_H_ struct sigqueue_args { pid_t pid; int signum; /* union sigval */ void *value; }; #endif int sys_sigqueue(struct thread *td, struct sigqueue_args *uap) { union sigval sv; sv.sival_ptr = uap->value; return (kern_sigqueue(td, uap->pid, uap->signum, &sv)); } int kern_sigqueue(struct thread *td, pid_t pid, int signum, union sigval *value) { ksiginfo_t ksi; struct proc *p; int error; if ((u_int)signum > _SIG_MAXSIG) return (EINVAL); /* * Specification says sigqueue can only send signal to * single process. */ if (pid <= 0) return (EINVAL); if ((p = pfind_any(pid)) == NULL) return (ESRCH); error = p_cansignal(td, p, signum); if (error == 0 && signum != 0) { ksiginfo_init(&ksi); ksi.ksi_flags = KSI_SIGQ; ksi.ksi_signo = signum; ksi.ksi_code = SI_QUEUE; ksi.ksi_pid = td->td_proc->p_pid; ksi.ksi_uid = td->td_ucred->cr_ruid; ksi.ksi_value = *value; error = pksignal(p, ksi.ksi_signo, &ksi); } PROC_UNLOCK(p); return (error); } /* * Send a signal to a process group. */ void gsignal(int pgid, int sig, ksiginfo_t *ksi) { struct pgrp *pgrp; if (pgid != 0) { sx_slock(&proctree_lock); pgrp = pgfind(pgid); sx_sunlock(&proctree_lock); if (pgrp != NULL) { pgsignal(pgrp, sig, 0, ksi); PGRP_UNLOCK(pgrp); } } } /* * Send a signal to a process group. If checktty is 1, * limit to members which have a controlling terminal. */ void pgsignal(struct pgrp *pgrp, int sig, int checkctty, ksiginfo_t *ksi) { struct proc *p; if (pgrp) { PGRP_LOCK_ASSERT(pgrp, MA_OWNED); LIST_FOREACH(p, &pgrp->pg_members, p_pglist) { PROC_LOCK(p); if (p->p_state == PRS_NORMAL && (checkctty == 0 || p->p_flag & P_CONTROLT)) pksignal(p, sig, ksi); PROC_UNLOCK(p); } } } /* * Recalculate the signal mask and reset the signal disposition after * usermode frame for delivery is formed. Should be called after * mach-specific routine, because sysent->sv_sendsig() needs correct * ps_siginfo and signal mask. */ static void postsig_done(int sig, struct thread *td, struct sigacts *ps) { sigset_t mask; mtx_assert(&ps->ps_mtx, MA_OWNED); td->td_ru.ru_nsignals++; mask = ps->ps_catchmask[_SIG_IDX(sig)]; if (!SIGISMEMBER(ps->ps_signodefer, sig)) SIGADDSET(mask, sig); kern_sigprocmask(td, SIG_BLOCK, &mask, NULL, SIGPROCMASK_PROC_LOCKED | SIGPROCMASK_PS_LOCKED); if (SIGISMEMBER(ps->ps_sigreset, sig)) sigdflt(ps, sig); } /* * Send a signal caused by a trap to the current thread. If it will be * caught immediately, deliver it with correct code. Otherwise, post it * normally. */ void trapsignal(struct thread *td, ksiginfo_t *ksi) { struct sigacts *ps; struct proc *p; int sig; int code; p = td->td_proc; sig = ksi->ksi_signo; code = ksi->ksi_code; KASSERT(_SIG_VALID(sig), ("invalid signal")); PROC_LOCK(p); ps = p->p_sigacts; mtx_lock(&ps->ps_mtx); if ((p->p_flag & P_TRACED) == 0 && SIGISMEMBER(ps->ps_sigcatch, sig) && !SIGISMEMBER(td->td_sigmask, sig)) { #ifdef KTRACE if (KTRPOINT(curthread, KTR_PSIG)) ktrpsig(sig, ps->ps_sigact[_SIG_IDX(sig)], &td->td_sigmask, code); #endif (*p->p_sysent->sv_sendsig)(ps->ps_sigact[_SIG_IDX(sig)], ksi, &td->td_sigmask); postsig_done(sig, td, ps); mtx_unlock(&ps->ps_mtx); } else { /* * Avoid a possible infinite loop if the thread * masking the signal or process is ignoring the * signal. */ if (kern_forcesigexit && (SIGISMEMBER(td->td_sigmask, sig) || ps->ps_sigact[_SIG_IDX(sig)] == SIG_IGN)) { SIGDELSET(td->td_sigmask, sig); SIGDELSET(ps->ps_sigcatch, sig); SIGDELSET(ps->ps_sigignore, sig); ps->ps_sigact[_SIG_IDX(sig)] = SIG_DFL; } mtx_unlock(&ps->ps_mtx); - p->p_code = code; /* XXX for core dump/debugger */ p->p_sig = sig; /* XXX to verify code */ tdsendsignal(p, td, sig, ksi); } PROC_UNLOCK(p); } static struct thread * sigtd(struct proc *p, int sig, int prop) { struct thread *td, *signal_td; PROC_LOCK_ASSERT(p, MA_OWNED); /* * Check if current thread can handle the signal without * switching context to another thread. */ if (curproc == p && !SIGISMEMBER(curthread->td_sigmask, sig)) return (curthread); signal_td = NULL; FOREACH_THREAD_IN_PROC(p, td) { if (!SIGISMEMBER(td->td_sigmask, sig)) { signal_td = td; break; } } if (signal_td == NULL) signal_td = FIRST_THREAD_IN_PROC(p); return (signal_td); } /* * Send the signal to the process. If the signal has an action, the action * is usually performed by the target process rather than the caller; we add * the signal to the set of pending signals for the process. * * Exceptions: * o When a stop signal is sent to a sleeping process that takes the * default action, the process is stopped without awakening it. * o SIGCONT restarts stopped processes (or puts them back to sleep) * regardless of the signal action (eg, blocked or ignored). * * Other ignored signals are discarded immediately. * * NB: This function may be entered from the debugger via the "kill" DDB * command. There is little that can be done to mitigate the possibly messy * side effects of this unwise possibility. */ void kern_psignal(struct proc *p, int sig) { ksiginfo_t ksi; ksiginfo_init(&ksi); ksi.ksi_signo = sig; ksi.ksi_code = SI_KERNEL; (void) tdsendsignal(p, NULL, sig, &ksi); } int pksignal(struct proc *p, int sig, ksiginfo_t *ksi) { return (tdsendsignal(p, NULL, sig, ksi)); } /* Utility function for finding a thread to send signal event to. */ int sigev_findtd(struct proc *p ,struct sigevent *sigev, struct thread **ttd) { struct thread *td; if (sigev->sigev_notify == SIGEV_THREAD_ID) { td = tdfind(sigev->sigev_notify_thread_id, p->p_pid); if (td == NULL) return (ESRCH); *ttd = td; } else { *ttd = NULL; PROC_LOCK(p); } return (0); } void tdsignal(struct thread *td, int sig) { ksiginfo_t ksi; ksiginfo_init(&ksi); ksi.ksi_signo = sig; ksi.ksi_code = SI_KERNEL; (void) tdsendsignal(td->td_proc, td, sig, &ksi); } void tdksignal(struct thread *td, int sig, ksiginfo_t *ksi) { (void) tdsendsignal(td->td_proc, td, sig, ksi); } int tdsendsignal(struct proc *p, struct thread *td, int sig, ksiginfo_t *ksi) { sig_t action; sigqueue_t *sigqueue; int prop; struct sigacts *ps; int intrval; int ret = 0; int wakeup_swapper; MPASS(td == NULL || p == td->td_proc); PROC_LOCK_ASSERT(p, MA_OWNED); if (!_SIG_VALID(sig)) panic("%s(): invalid signal %d", __func__, sig); KASSERT(ksi == NULL || !KSI_ONQ(ksi), ("%s: ksi on queue", __func__)); /* * IEEE Std 1003.1-2001: return success when killing a zombie. */ if (p->p_state == PRS_ZOMBIE) { if (ksi && (ksi->ksi_flags & KSI_INS)) ksiginfo_tryfree(ksi); return (ret); } ps = p->p_sigacts; KNOTE_LOCKED(p->p_klist, NOTE_SIGNAL | sig); prop = sigprop(sig); if (td == NULL) { td = sigtd(p, sig, prop); sigqueue = &p->p_sigqueue; } else sigqueue = &td->td_sigqueue; SDT_PROBE3(proc, , , signal__send, td, p, sig); /* * If the signal is being ignored, * then we forget about it immediately. * (Note: we don't set SIGCONT in ps_sigignore, * and if it is set to SIG_IGN, * action will be SIG_DFL here.) */ mtx_lock(&ps->ps_mtx); if (SIGISMEMBER(ps->ps_sigignore, sig)) { SDT_PROBE3(proc, , , signal__discard, td, p, sig); mtx_unlock(&ps->ps_mtx); if (ksi && (ksi->ksi_flags & KSI_INS)) ksiginfo_tryfree(ksi); return (ret); } if (SIGISMEMBER(td->td_sigmask, sig)) action = SIG_HOLD; else if (SIGISMEMBER(ps->ps_sigcatch, sig)) action = SIG_CATCH; else action = SIG_DFL; if (SIGISMEMBER(ps->ps_sigintr, sig)) intrval = EINTR; else intrval = ERESTART; mtx_unlock(&ps->ps_mtx); if (prop & SIGPROP_CONT) sigqueue_delete_stopmask_proc(p); else if (prop & SIGPROP_STOP) { /* * If sending a tty stop signal to a member of an orphaned * process group, discard the signal here if the action * is default; don't stop the process below if sleeping, * and don't clear any pending SIGCONT. */ if ((prop & SIGPROP_TTYSTOP) && (p->p_pgrp->pg_jobc == 0) && (action == SIG_DFL)) { if (ksi && (ksi->ksi_flags & KSI_INS)) ksiginfo_tryfree(ksi); return (ret); } sigqueue_delete_proc(p, SIGCONT); if (p->p_flag & P_CONTINUED) { p->p_flag &= ~P_CONTINUED; PROC_LOCK(p->p_pptr); sigqueue_take(p->p_ksi); PROC_UNLOCK(p->p_pptr); } } ret = sigqueue_add(sigqueue, sig, ksi); if (ret != 0) return (ret); signotify(td); /* * Defer further processing for signals which are held, * except that stopped processes must be continued by SIGCONT. */ if (action == SIG_HOLD && !((prop & SIGPROP_CONT) && (p->p_flag & P_STOPPED_SIG))) return (ret); /* SIGKILL: Remove procfs STOPEVENTs. */ if (sig == SIGKILL) { /* from procfs_ioctl.c: PIOCBIC */ p->p_stops = 0; /* from procfs_ioctl.c: PIOCCONT */ p->p_step = 0; wakeup(&p->p_step); } /* * Some signals have a process-wide effect and a per-thread * component. Most processing occurs when the process next * tries to cross the user boundary, however there are some * times when processing needs to be done immediately, such as * waking up threads so that they can cross the user boundary. * We try to do the per-process part here. */ if (P_SHOULDSTOP(p)) { KASSERT(!(p->p_flag & P_WEXIT), ("signal to stopped but exiting process")); if (sig == SIGKILL) { /* * If traced process is already stopped, * then no further action is necessary. */ if (p->p_flag & P_TRACED) goto out; /* * SIGKILL sets process running. * It will die elsewhere. * All threads must be restarted. */ p->p_flag &= ~P_STOPPED_SIG; goto runfast; } if (prop & SIGPROP_CONT) { /* * If traced process is already stopped, * then no further action is necessary. */ if (p->p_flag & P_TRACED) goto out; /* * If SIGCONT is default (or ignored), we continue the * process but don't leave the signal in sigqueue as * it has no further action. If SIGCONT is held, we * continue the process and leave the signal in * sigqueue. If the process catches SIGCONT, let it * handle the signal itself. If it isn't waiting on * an event, it goes back to run state. * Otherwise, process goes back to sleep state. */ p->p_flag &= ~P_STOPPED_SIG; PROC_SLOCK(p); if (p->p_numthreads == p->p_suspcount) { PROC_SUNLOCK(p); p->p_flag |= P_CONTINUED; p->p_xsig = SIGCONT; PROC_LOCK(p->p_pptr); childproc_continued(p); PROC_UNLOCK(p->p_pptr); PROC_SLOCK(p); } if (action == SIG_DFL) { thread_unsuspend(p); PROC_SUNLOCK(p); sigqueue_delete(sigqueue, sig); goto out; } if (action == SIG_CATCH) { /* * The process wants to catch it so it needs * to run at least one thread, but which one? */ PROC_SUNLOCK(p); goto runfast; } /* * The signal is not ignored or caught. */ thread_unsuspend(p); PROC_SUNLOCK(p); goto out; } if (prop & SIGPROP_STOP) { /* * If traced process is already stopped, * then no further action is necessary. */ if (p->p_flag & P_TRACED) goto out; /* * Already stopped, don't need to stop again * (If we did the shell could get confused). * Just make sure the signal STOP bit set. */ p->p_flag |= P_STOPPED_SIG; sigqueue_delete(sigqueue, sig); goto out; } /* * All other kinds of signals: * If a thread is sleeping interruptibly, simulate a * wakeup so that when it is continued it will be made * runnable and can look at the signal. However, don't make * the PROCESS runnable, leave it stopped. * It may run a bit until it hits a thread_suspend_check(). */ wakeup_swapper = 0; PROC_SLOCK(p); thread_lock(td); if (TD_ON_SLEEPQ(td) && (td->td_flags & TDF_SINTR)) wakeup_swapper = sleepq_abort(td, intrval); thread_unlock(td); PROC_SUNLOCK(p); if (wakeup_swapper) kick_proc0(); goto out; /* * Mutexes are short lived. Threads waiting on them will * hit thread_suspend_check() soon. */ } else if (p->p_state == PRS_NORMAL) { if (p->p_flag & P_TRACED || action == SIG_CATCH) { tdsigwakeup(td, sig, action, intrval); goto out; } MPASS(action == SIG_DFL); if (prop & SIGPROP_STOP) { if (p->p_flag & (P_PPWAIT|P_WEXIT)) goto out; p->p_flag |= P_STOPPED_SIG; p->p_xsig = sig; PROC_SLOCK(p); wakeup_swapper = sig_suspend_threads(td, p, 1); if (p->p_numthreads == p->p_suspcount) { /* * only thread sending signal to another * process can reach here, if thread is sending * signal to its process, because thread does * not suspend itself here, p_numthreads * should never be equal to p_suspcount. */ thread_stopped(p); PROC_SUNLOCK(p); sigqueue_delete_proc(p, p->p_xsig); } else PROC_SUNLOCK(p); if (wakeup_swapper) kick_proc0(); goto out; } } else { /* Not in "NORMAL" state. discard the signal. */ sigqueue_delete(sigqueue, sig); goto out; } /* * The process is not stopped so we need to apply the signal to all the * running threads. */ runfast: tdsigwakeup(td, sig, action, intrval); PROC_SLOCK(p); thread_unsuspend(p); PROC_SUNLOCK(p); out: /* If we jump here, proc slock should not be owned. */ PROC_SLOCK_ASSERT(p, MA_NOTOWNED); return (ret); } /* * The force of a signal has been directed against a single * thread. We need to see what we can do about knocking it * out of any sleep it may be in etc. */ static void tdsigwakeup(struct thread *td, int sig, sig_t action, int intrval) { struct proc *p = td->td_proc; int prop; int wakeup_swapper; wakeup_swapper = 0; PROC_LOCK_ASSERT(p, MA_OWNED); prop = sigprop(sig); PROC_SLOCK(p); thread_lock(td); /* * Bring the priority of a thread up if we want it to get * killed in this lifetime. Be careful to avoid bumping the * priority of the idle thread, since we still allow to signal * kernel processes. */ if (action == SIG_DFL && (prop & SIGPROP_KILL) != 0 && td->td_priority > PUSER && !TD_IS_IDLETHREAD(td)) sched_prio(td, PUSER); if (TD_ON_SLEEPQ(td)) { /* * If thread is sleeping uninterruptibly * we can't interrupt the sleep... the signal will * be noticed when the process returns through * trap() or syscall(). */ if ((td->td_flags & TDF_SINTR) == 0) goto out; /* * If SIGCONT is default (or ignored) and process is * asleep, we are finished; the process should not * be awakened. */ if ((prop & SIGPROP_CONT) && action == SIG_DFL) { thread_unlock(td); PROC_SUNLOCK(p); sigqueue_delete(&p->p_sigqueue, sig); /* * It may be on either list in this state. * Remove from both for now. */ sigqueue_delete(&td->td_sigqueue, sig); return; } /* * Don't awaken a sleeping thread for SIGSTOP if the * STOP signal is deferred. */ if ((prop & SIGPROP_STOP) != 0 && (td->td_flags & (TDF_SBDRY | TDF_SERESTART | TDF_SEINTR)) == TDF_SBDRY) goto out; /* * Give low priority threads a better chance to run. */ if (td->td_priority > PUSER && !TD_IS_IDLETHREAD(td)) sched_prio(td, PUSER); wakeup_swapper = sleepq_abort(td, intrval); } else { /* * Other states do nothing with the signal immediately, * other than kicking ourselves if we are running. * It will either never be noticed, or noticed very soon. */ #ifdef SMP if (TD_IS_RUNNING(td) && td != curthread) forward_signal(td); #endif } out: PROC_SUNLOCK(p); thread_unlock(td); if (wakeup_swapper) kick_proc0(); } static int sig_suspend_threads(struct thread *td, struct proc *p, int sending) { struct thread *td2; int wakeup_swapper; PROC_LOCK_ASSERT(p, MA_OWNED); PROC_SLOCK_ASSERT(p, MA_OWNED); MPASS(sending || td == curthread); wakeup_swapper = 0; FOREACH_THREAD_IN_PROC(p, td2) { thread_lock(td2); td2->td_flags |= TDF_ASTPENDING | TDF_NEEDSUSPCHK; if ((TD_IS_SLEEPING(td2) || TD_IS_SWAPPED(td2)) && (td2->td_flags & TDF_SINTR)) { if (td2->td_flags & TDF_SBDRY) { /* * Once a thread is asleep with * TDF_SBDRY and without TDF_SERESTART * or TDF_SEINTR set, it should never * become suspended due to this check. */ KASSERT(!TD_IS_SUSPENDED(td2), ("thread with deferred stops suspended")); if (TD_SBDRY_INTR(td2)) wakeup_swapper |= sleepq_abort(td2, TD_SBDRY_ERRNO(td2)); } else if (!TD_IS_SUSPENDED(td2)) { thread_suspend_one(td2); } } else if (!TD_IS_SUSPENDED(td2)) { if (sending || td != td2) td2->td_flags |= TDF_ASTPENDING; #ifdef SMP if (TD_IS_RUNNING(td2) && td2 != td) forward_signal(td2); #endif } thread_unlock(td2); } return (wakeup_swapper); } /* * Stop the process for an event deemed interesting to the debugger. If si is * non-NULL, this is a signal exchange; the new signal requested by the * debugger will be returned for handling. If si is NULL, this is some other * type of interesting event. The debugger may request a signal be delivered in * that case as well, however it will be deferred until it can be handled. */ int ptracestop(struct thread *td, int sig, ksiginfo_t *si) { struct proc *p = td->td_proc; struct thread *td2; ksiginfo_t ksi; int prop; PROC_LOCK_ASSERT(p, MA_OWNED); KASSERT(!(p->p_flag & P_WEXIT), ("Stopping exiting process")); WITNESS_WARN(WARN_GIANTOK | WARN_SLEEPOK, &p->p_mtx.lock_object, "Stopping for traced signal"); td->td_xsig = sig; if (si == NULL || (si->ksi_flags & KSI_PTRACE) == 0) { td->td_dbgflags |= TDB_XSIG; CTR4(KTR_PTRACE, "ptracestop: tid %d (pid %d) flags %#x sig %d", td->td_tid, p->p_pid, td->td_dbgflags, sig); PROC_SLOCK(p); while ((p->p_flag & P_TRACED) && (td->td_dbgflags & TDB_XSIG)) { if (P_KILLED(p)) { /* * Ensure that, if we've been PT_KILLed, the * exit status reflects that. Another thread * may also be in ptracestop(), having just * received the SIGKILL, but this thread was * unsuspended first. */ td->td_dbgflags &= ~TDB_XSIG; td->td_xsig = SIGKILL; p->p_ptevents = 0; break; } if (p->p_flag & P_SINGLE_EXIT && !(td->td_dbgflags & TDB_EXIT)) { /* * Ignore ptrace stops except for thread exit * events when the process exits. */ td->td_dbgflags &= ~TDB_XSIG; PROC_SUNLOCK(p); return (0); } /* * Make wait(2) work. Ensure that right after the * attach, the thread which was decided to become the * leader of attach gets reported to the waiter. * Otherwise, just avoid overwriting another thread's * assignment to p_xthread. If another thread has * already set p_xthread, the current thread will get * a chance to report itself upon the next iteration. */ if ((td->td_dbgflags & TDB_FSTP) != 0 || ((p->p_flag2 & P2_PTRACE_FSTP) == 0 && p->p_xthread == NULL)) { p->p_xsig = sig; p->p_xthread = td; td->td_dbgflags &= ~TDB_FSTP; p->p_flag2 &= ~P2_PTRACE_FSTP; p->p_flag |= P_STOPPED_SIG | P_STOPPED_TRACE; sig_suspend_threads(td, p, 0); } if ((td->td_dbgflags & TDB_STOPATFORK) != 0) { td->td_dbgflags &= ~TDB_STOPATFORK; } stopme: thread_suspend_switch(td, p); if (p->p_xthread == td) p->p_xthread = NULL; if (!(p->p_flag & P_TRACED)) break; if (td->td_dbgflags & TDB_SUSPEND) { if (p->p_flag & P_SINGLE_EXIT) break; goto stopme; } } PROC_SUNLOCK(p); } if (si != NULL && sig == td->td_xsig) { /* Parent wants us to take the original signal unchanged. */ si->ksi_flags |= KSI_HEAD; if (sigqueue_add(&td->td_sigqueue, sig, si) != 0) si->ksi_signo = 0; } else if (td->td_xsig != 0) { /* * If parent wants us to take a new signal, then it will leave * it in td->td_xsig; otherwise we just look for signals again. */ ksiginfo_init(&ksi); ksi.ksi_signo = td->td_xsig; ksi.ksi_flags |= KSI_PTRACE; prop = sigprop(td->td_xsig); td2 = sigtd(p, td->td_xsig, prop); tdsendsignal(p, td2, td->td_xsig, &ksi); if (td != td2) return (0); } return (td->td_xsig); } static void reschedule_signals(struct proc *p, sigset_t block, int flags) { struct sigacts *ps; struct thread *td; int sig; PROC_LOCK_ASSERT(p, MA_OWNED); ps = p->p_sigacts; mtx_assert(&ps->ps_mtx, (flags & SIGPROCMASK_PS_LOCKED) != 0 ? MA_OWNED : MA_NOTOWNED); if (SIGISEMPTY(p->p_siglist)) return; SIGSETAND(block, p->p_siglist); while ((sig = sig_ffs(&block)) != 0) { SIGDELSET(block, sig); td = sigtd(p, sig, 0); signotify(td); if (!(flags & SIGPROCMASK_PS_LOCKED)) mtx_lock(&ps->ps_mtx); if (p->p_flag & P_TRACED || (SIGISMEMBER(ps->ps_sigcatch, sig) && !SIGISMEMBER(td->td_sigmask, sig))) tdsigwakeup(td, sig, SIG_CATCH, (SIGISMEMBER(ps->ps_sigintr, sig) ? EINTR : ERESTART)); if (!(flags & SIGPROCMASK_PS_LOCKED)) mtx_unlock(&ps->ps_mtx); } } void tdsigcleanup(struct thread *td) { struct proc *p; sigset_t unblocked; p = td->td_proc; PROC_LOCK_ASSERT(p, MA_OWNED); sigqueue_flush(&td->td_sigqueue); if (p->p_numthreads == 1) return; /* * Since we cannot handle signals, notify signal post code * about this by filling the sigmask. * * Also, if needed, wake up thread(s) that do not block the * same signals as the exiting thread, since the thread might * have been selected for delivery and woken up. */ SIGFILLSET(unblocked); SIGSETNAND(unblocked, td->td_sigmask); SIGFILLSET(td->td_sigmask); reschedule_signals(p, unblocked, 0); } static int sigdeferstop_curr_flags(int cflags) { MPASS((cflags & (TDF_SEINTR | TDF_SERESTART)) == 0 || (cflags & TDF_SBDRY) != 0); return (cflags & (TDF_SBDRY | TDF_SEINTR | TDF_SERESTART)); } /* * Defer the delivery of SIGSTOP for the current thread, according to * the requested mode. Returns previous flags, which must be restored * by sigallowstop(). * * TDF_SBDRY, TDF_SEINTR, and TDF_SERESTART flags are only set and * cleared by the current thread, which allow the lock-less read-only * accesses below. */ int sigdeferstop_impl(int mode) { struct thread *td; int cflags, nflags; td = curthread; cflags = sigdeferstop_curr_flags(td->td_flags); switch (mode) { case SIGDEFERSTOP_NOP: nflags = cflags; break; case SIGDEFERSTOP_OFF: nflags = 0; break; case SIGDEFERSTOP_SILENT: nflags = (cflags | TDF_SBDRY) & ~(TDF_SEINTR | TDF_SERESTART); break; case SIGDEFERSTOP_EINTR: nflags = (cflags | TDF_SBDRY | TDF_SEINTR) & ~TDF_SERESTART; break; case SIGDEFERSTOP_ERESTART: nflags = (cflags | TDF_SBDRY | TDF_SERESTART) & ~TDF_SEINTR; break; default: panic("sigdeferstop: invalid mode %x", mode); break; } if (cflags == nflags) return (SIGDEFERSTOP_VAL_NCHG); thread_lock(td); td->td_flags = (td->td_flags & ~cflags) | nflags; thread_unlock(td); return (cflags); } /* * Restores the STOP handling mode, typically permitting the delivery * of SIGSTOP for the current thread. This does not immediately * suspend if a stop was posted. Instead, the thread will suspend * either via ast() or a subsequent interruptible sleep. */ void sigallowstop_impl(int prev) { struct thread *td; int cflags; KASSERT(prev != SIGDEFERSTOP_VAL_NCHG, ("failed sigallowstop")); KASSERT((prev & ~(TDF_SBDRY | TDF_SEINTR | TDF_SERESTART)) == 0, ("sigallowstop: incorrect previous mode %x", prev)); td = curthread; cflags = sigdeferstop_curr_flags(td->td_flags); if (cflags != prev) { thread_lock(td); td->td_flags = (td->td_flags & ~cflags) | prev; thread_unlock(td); } } /* * If the current process has received a signal (should be caught or cause * termination, should interrupt current syscall), return the signal number. * Stop signals with default action are processed immediately, then cleared; * they aren't returned. This is checked after each entry to the system for * a syscall or trap (though this can usually be done without calling issignal * by checking the pending signal masks in cursig.) The normal call * sequence is * * while (sig = cursig(curthread)) * postsig(sig); */ static int issignal(struct thread *td) { struct proc *p; struct sigacts *ps; struct sigqueue *queue; sigset_t sigpending; ksiginfo_t ksi; int prop, sig, traced; p = td->td_proc; ps = p->p_sigacts; mtx_assert(&ps->ps_mtx, MA_OWNED); PROC_LOCK_ASSERT(p, MA_OWNED); for (;;) { traced = (p->p_flag & P_TRACED) || (p->p_stops & S_SIG); sigpending = td->td_sigqueue.sq_signals; SIGSETOR(sigpending, p->p_sigqueue.sq_signals); SIGSETNAND(sigpending, td->td_sigmask); if ((p->p_flag & P_PPWAIT) != 0 || (td->td_flags & (TDF_SBDRY | TDF_SERESTART | TDF_SEINTR)) == TDF_SBDRY) SIG_STOPSIGMASK(sigpending); if (SIGISEMPTY(sigpending)) /* no signal to send */ return (0); if ((p->p_flag & (P_TRACED | P_PPTRACE)) == P_TRACED && (p->p_flag2 & P2_PTRACE_FSTP) != 0 && SIGISMEMBER(sigpending, SIGSTOP)) { /* * If debugger just attached, always consume * SIGSTOP from ptrace(PT_ATTACH) first, to * execute the debugger attach ritual in * order. */ sig = SIGSTOP; td->td_dbgflags |= TDB_FSTP; } else { sig = sig_ffs(&sigpending); } if (p->p_stops & S_SIG) { mtx_unlock(&ps->ps_mtx); stopevent(p, S_SIG, sig); mtx_lock(&ps->ps_mtx); } /* * We should see pending but ignored signals * only if P_TRACED was on when they were posted. */ if (SIGISMEMBER(ps->ps_sigignore, sig) && (traced == 0)) { sigqueue_delete(&td->td_sigqueue, sig); sigqueue_delete(&p->p_sigqueue, sig); continue; } if ((p->p_flag & (P_TRACED | P_PPTRACE)) == P_TRACED) { /* * If traced, always stop. * Remove old signal from queue before the stop. * XXX shrug off debugger, it causes siginfo to * be thrown away. */ queue = &td->td_sigqueue; ksiginfo_init(&ksi); if (sigqueue_get(queue, sig, &ksi) == 0) { queue = &p->p_sigqueue; sigqueue_get(queue, sig, &ksi); } td->td_si = ksi.ksi_info; mtx_unlock(&ps->ps_mtx); sig = ptracestop(td, sig, &ksi); mtx_lock(&ps->ps_mtx); td->td_si.si_signo = 0; /* * Keep looking if the debugger discarded or * replaced the signal. */ if (sig == 0) continue; /* * If the signal became masked, re-queue it. */ if (SIGISMEMBER(td->td_sigmask, sig)) { ksi.ksi_flags |= KSI_HEAD; sigqueue_add(&p->p_sigqueue, sig, &ksi); continue; } /* * If the traced bit got turned off, requeue * the signal and go back up to the top to * rescan signals. This ensures that p_sig* * and p_sigact are consistent. */ if ((p->p_flag & P_TRACED) == 0) { ksi.ksi_flags |= KSI_HEAD; sigqueue_add(queue, sig, &ksi); continue; } } prop = sigprop(sig); /* * Decide whether the signal should be returned. * Return the signal's number, or fall through * to clear it from the pending mask. */ switch ((intptr_t)p->p_sigacts->ps_sigact[_SIG_IDX(sig)]) { case (intptr_t)SIG_DFL: /* * Don't take default actions on system processes. */ if (p->p_pid <= 1) { #ifdef DIAGNOSTIC /* * Are you sure you want to ignore SIGSEGV * in init? XXX */ printf("Process (pid %lu) got signal %d\n", (u_long)p->p_pid, sig); #endif break; /* == ignore */ } /* * If there is a pending stop signal to process with * default action, stop here, then clear the signal. * Traced or exiting processes should ignore stops. * Additionally, a member of an orphaned process group * should ignore tty stops. */ if (prop & SIGPROP_STOP) { if (p->p_flag & (P_TRACED | P_WEXIT | P_SINGLE_EXIT) || (p->p_pgrp->pg_jobc == 0 && prop & SIGPROP_TTYSTOP)) break; /* == ignore */ if (TD_SBDRY_INTR(td)) { KASSERT((td->td_flags & TDF_SBDRY) != 0, ("lost TDF_SBDRY")); return (-1); } mtx_unlock(&ps->ps_mtx); WITNESS_WARN(WARN_GIANTOK | WARN_SLEEPOK, &p->p_mtx.lock_object, "Catching SIGSTOP"); sigqueue_delete(&td->td_sigqueue, sig); sigqueue_delete(&p->p_sigqueue, sig); p->p_flag |= P_STOPPED_SIG; p->p_xsig = sig; PROC_SLOCK(p); sig_suspend_threads(td, p, 0); thread_suspend_switch(td, p); PROC_SUNLOCK(p); mtx_lock(&ps->ps_mtx); goto next; } else if (prop & SIGPROP_IGNORE) { /* * Except for SIGCONT, shouldn't get here. * Default action is to ignore; drop it. */ break; /* == ignore */ } else return (sig); /*NOTREACHED*/ case (intptr_t)SIG_IGN: /* * Masking above should prevent us ever trying * to take action on an ignored signal other * than SIGCONT, unless process is traced. */ if ((prop & SIGPROP_CONT) == 0 && (p->p_flag & P_TRACED) == 0) printf("issignal\n"); break; /* == ignore */ default: /* * This signal has an action, let * postsig() process it. */ return (sig); } sigqueue_delete(&td->td_sigqueue, sig); /* take the signal! */ sigqueue_delete(&p->p_sigqueue, sig); next:; } /* NOTREACHED */ } void thread_stopped(struct proc *p) { int n; PROC_LOCK_ASSERT(p, MA_OWNED); PROC_SLOCK_ASSERT(p, MA_OWNED); n = p->p_suspcount; if (p == curproc) n++; if ((p->p_flag & P_STOPPED_SIG) && (n == p->p_numthreads)) { PROC_SUNLOCK(p); p->p_flag &= ~P_WAITED; PROC_LOCK(p->p_pptr); childproc_stopped(p, (p->p_flag & P_TRACED) ? CLD_TRAPPED : CLD_STOPPED); PROC_UNLOCK(p->p_pptr); PROC_SLOCK(p); } } /* * Take the action for the specified signal * from the current set of pending signals. */ int postsig(int sig) { struct thread *td; struct proc *p; struct sigacts *ps; sig_t action; ksiginfo_t ksi; sigset_t returnmask; KASSERT(sig != 0, ("postsig")); td = curthread; p = td->td_proc; PROC_LOCK_ASSERT(p, MA_OWNED); ps = p->p_sigacts; mtx_assert(&ps->ps_mtx, MA_OWNED); ksiginfo_init(&ksi); if (sigqueue_get(&td->td_sigqueue, sig, &ksi) == 0 && sigqueue_get(&p->p_sigqueue, sig, &ksi) == 0) return (0); ksi.ksi_signo = sig; if (ksi.ksi_code == SI_TIMER) itimer_accept(p, ksi.ksi_timerid, &ksi); action = ps->ps_sigact[_SIG_IDX(sig)]; #ifdef KTRACE if (KTRPOINT(td, KTR_PSIG)) ktrpsig(sig, action, td->td_pflags & TDP_OLDMASK ? &td->td_oldsigmask : &td->td_sigmask, ksi.ksi_code); #endif if ((p->p_stops & S_SIG) != 0) { mtx_unlock(&ps->ps_mtx); stopevent(p, S_SIG, sig); mtx_lock(&ps->ps_mtx); } if (action == SIG_DFL) { /* * Default action, where the default is to kill * the process. (Other cases were ignored above.) */ mtx_unlock(&ps->ps_mtx); proc_td_siginfo_capture(td, &ksi.ksi_info); sigexit(td, sig); /* NOTREACHED */ } else { /* * If we get here, the signal must be caught. */ KASSERT(action != SIG_IGN, ("postsig action %p", action)); KASSERT(!SIGISMEMBER(td->td_sigmask, sig), ("postsig action: blocked sig %d", sig)); /* * Set the new mask value and also defer further * occurrences of this signal. * * Special case: user has done a sigsuspend. Here the * current mask is not of interest, but rather the * mask from before the sigsuspend is what we want * restored after the signal processing is completed. */ if (td->td_pflags & TDP_OLDMASK) { returnmask = td->td_oldsigmask; td->td_pflags &= ~TDP_OLDMASK; } else returnmask = td->td_sigmask; if (p->p_sig == sig) { - p->p_code = 0; p->p_sig = 0; } (*p->p_sysent->sv_sendsig)(action, &ksi, &returnmask); postsig_done(sig, td, ps); } return (1); } void proc_wkilled(struct proc *p) { PROC_LOCK_ASSERT(p, MA_OWNED); if ((p->p_flag & P_WKILLED) == 0) { p->p_flag |= P_WKILLED; /* * Notify swapper that there is a process to swap in. * The notification is racy, at worst it would take 10 * seconds for the swapper process to notice. */ if ((p->p_flag & (P_INMEM | P_SWAPPINGIN)) == 0) wakeup(&proc0); } } /* * Kill the current process for stated reason. */ void killproc(struct proc *p, char *why) { PROC_LOCK_ASSERT(p, MA_OWNED); CTR3(KTR_PROC, "killproc: proc %p (pid %d, %s)", p, p->p_pid, p->p_comm); log(LOG_ERR, "pid %d (%s), jid %d, uid %d, was killed: %s\n", p->p_pid, p->p_comm, p->p_ucred->cr_prison->pr_id, p->p_ucred->cr_uid, why); proc_wkilled(p); kern_psignal(p, SIGKILL); } /* * Force the current process to exit with the specified signal, dumping core * if appropriate. We bypass the normal tests for masked and caught signals, * allowing unrecoverable failures to terminate the process without changing * signal state. Mark the accounting record with the signal termination. * If dumping core, save the signal number for the debugger. Calls exit and * does not return. */ void sigexit(struct thread *td, int sig) { struct proc *p = td->td_proc; PROC_LOCK_ASSERT(p, MA_OWNED); p->p_acflag |= AXSIG; /* * We must be single-threading to generate a core dump. This * ensures that the registers in the core file are up-to-date. * Also, the ELF dump handler assumes that the thread list doesn't * change out from under it. * * XXX If another thread attempts to single-thread before us * (e.g. via fork()), we won't get a dump at all. */ if ((sigprop(sig) & SIGPROP_CORE) && thread_single(p, SINGLE_NO_EXIT) == 0) { p->p_sig = sig; /* * Log signals which would cause core dumps * (Log as LOG_INFO to appease those who don't want * these messages.) * XXX : Todo, as well as euid, write out ruid too * Note that coredump() drops proc lock. */ if (coredump(td) == 0) sig |= WCOREFLAG; if (kern_logsigexit) log(LOG_INFO, "pid %d (%s), jid %d, uid %d: exited on " "signal %d%s\n", p->p_pid, p->p_comm, p->p_ucred->cr_prison->pr_id, td->td_ucred->cr_uid, sig &~ WCOREFLAG, sig & WCOREFLAG ? " (core dumped)" : ""); } else PROC_UNLOCK(p); exit1(td, 0, sig); /* NOTREACHED */ } /* * Send queued SIGCHLD to parent when child process's state * is changed. */ static void sigparent(struct proc *p, int reason, int status) { PROC_LOCK_ASSERT(p, MA_OWNED); PROC_LOCK_ASSERT(p->p_pptr, MA_OWNED); if (p->p_ksi != NULL) { p->p_ksi->ksi_signo = SIGCHLD; p->p_ksi->ksi_code = reason; p->p_ksi->ksi_status = status; p->p_ksi->ksi_pid = p->p_pid; p->p_ksi->ksi_uid = p->p_ucred->cr_ruid; if (KSI_ONQ(p->p_ksi)) return; } pksignal(p->p_pptr, SIGCHLD, p->p_ksi); } static void childproc_jobstate(struct proc *p, int reason, int sig) { struct sigacts *ps; PROC_LOCK_ASSERT(p, MA_OWNED); PROC_LOCK_ASSERT(p->p_pptr, MA_OWNED); /* * Wake up parent sleeping in kern_wait(), also send * SIGCHLD to parent, but SIGCHLD does not guarantee * that parent will awake, because parent may masked * the signal. */ p->p_pptr->p_flag |= P_STATCHILD; wakeup(p->p_pptr); ps = p->p_pptr->p_sigacts; mtx_lock(&ps->ps_mtx); if ((ps->ps_flag & PS_NOCLDSTOP) == 0) { mtx_unlock(&ps->ps_mtx); sigparent(p, reason, sig); } else mtx_unlock(&ps->ps_mtx); } void childproc_stopped(struct proc *p, int reason) { childproc_jobstate(p, reason, p->p_xsig); } void childproc_continued(struct proc *p) { childproc_jobstate(p, CLD_CONTINUED, SIGCONT); } void childproc_exited(struct proc *p) { int reason, status; if (WCOREDUMP(p->p_xsig)) { reason = CLD_DUMPED; status = WTERMSIG(p->p_xsig); } else if (WIFSIGNALED(p->p_xsig)) { reason = CLD_KILLED; status = WTERMSIG(p->p_xsig); } else { reason = CLD_EXITED; status = p->p_xexit; } /* * XXX avoid calling wakeup(p->p_pptr), the work is * done in exit1(). */ sigparent(p, reason, status); } #define MAX_NUM_CORE_FILES 100000 #ifndef NUM_CORE_FILES #define NUM_CORE_FILES 5 #endif CTASSERT(NUM_CORE_FILES >= 0 && NUM_CORE_FILES <= MAX_NUM_CORE_FILES); static int num_cores = NUM_CORE_FILES; static int sysctl_debug_num_cores_check (SYSCTL_HANDLER_ARGS) { int error; int new_val; new_val = num_cores; error = sysctl_handle_int(oidp, &new_val, 0, req); if (error != 0 || req->newptr == NULL) return (error); if (new_val > MAX_NUM_CORE_FILES) new_val = MAX_NUM_CORE_FILES; if (new_val < 0) new_val = 0; num_cores = new_val; return (0); } SYSCTL_PROC(_debug, OID_AUTO, ncores, CTLTYPE_INT|CTLFLAG_RW, 0, sizeof(int), sysctl_debug_num_cores_check, "I", "Maximum number of generated process corefiles while using index format"); #define GZIP_SUFFIX ".gz" #define ZSTD_SUFFIX ".zst" int compress_user_cores = 0; static int sysctl_compress_user_cores(SYSCTL_HANDLER_ARGS) { int error, val; val = compress_user_cores; error = sysctl_handle_int(oidp, &val, 0, req); if (error != 0 || req->newptr == NULL) return (error); if (val != 0 && !compressor_avail(val)) return (EINVAL); compress_user_cores = val; return (error); } SYSCTL_PROC(_kern, OID_AUTO, compress_user_cores, CTLTYPE_INT | CTLFLAG_RWTUN, 0, sizeof(int), sysctl_compress_user_cores, "I", "Enable compression of user corefiles (" __XSTRING(COMPRESS_GZIP) " = gzip, " __XSTRING(COMPRESS_ZSTD) " = zstd)"); int compress_user_cores_level = 6; SYSCTL_INT(_kern, OID_AUTO, compress_user_cores_level, CTLFLAG_RWTUN, &compress_user_cores_level, 0, "Corefile compression level"); /* * Protect the access to corefilename[] by allproc_lock. */ #define corefilename_lock allproc_lock static char corefilename[MAXPATHLEN] = {"%N.core"}; TUNABLE_STR("kern.corefile", corefilename, sizeof(corefilename)); static int sysctl_kern_corefile(SYSCTL_HANDLER_ARGS) { int error; sx_xlock(&corefilename_lock); error = sysctl_handle_string(oidp, corefilename, sizeof(corefilename), req); sx_xunlock(&corefilename_lock); return (error); } SYSCTL_PROC(_kern, OID_AUTO, corefile, CTLTYPE_STRING | CTLFLAG_RW | CTLFLAG_MPSAFE, 0, 0, sysctl_kern_corefile, "A", "Process corefile name format string"); static void vnode_close_locked(struct thread *td, struct vnode *vp) { VOP_UNLOCK(vp, 0); vn_close(vp, FWRITE, td->td_ucred, td); } /* * If the core format has a %I in it, then we need to check * for existing corefiles before defining a name. * To do this we iterate over 0..ncores to find a * non-existing core file name to use. If all core files are * already used we choose the oldest one. */ static int corefile_open_last(struct thread *td, char *name, int indexpos, int indexlen, int ncores, struct vnode **vpp) { struct vnode *oldvp, *nextvp, *vp; struct vattr vattr; struct nameidata nd; int error, i, flags, oflags, cmode; char ch; struct timespec lasttime; nextvp = oldvp = NULL; cmode = S_IRUSR | S_IWUSR; oflags = VN_OPEN_NOAUDIT | VN_OPEN_NAMECACHE | (capmode_coredump ? VN_OPEN_NOCAPCHECK : 0); for (i = 0; i < ncores; i++) { flags = O_CREAT | FWRITE | O_NOFOLLOW; ch = name[indexpos + indexlen]; (void)snprintf(name + indexpos, indexlen + 1, "%.*u", indexlen, i); name[indexpos + indexlen] = ch; NDINIT(&nd, LOOKUP, NOFOLLOW, UIO_SYSSPACE, name, td); error = vn_open_cred(&nd, &flags, cmode, oflags, td->td_ucred, NULL); if (error != 0) break; vp = nd.ni_vp; NDFREE(&nd, NDF_ONLY_PNBUF); if ((flags & O_CREAT) == O_CREAT) { nextvp = vp; break; } error = VOP_GETATTR(vp, &vattr, td->td_ucred); if (error != 0) { vnode_close_locked(td, vp); break; } if (oldvp == NULL || lasttime.tv_sec > vattr.va_mtime.tv_sec || (lasttime.tv_sec == vattr.va_mtime.tv_sec && lasttime.tv_nsec >= vattr.va_mtime.tv_nsec)) { if (oldvp != NULL) vnode_close_locked(td, oldvp); oldvp = vp; lasttime = vattr.va_mtime; } else { vnode_close_locked(td, vp); } } if (oldvp != NULL) { if (nextvp == NULL) nextvp = oldvp; else vnode_close_locked(td, oldvp); } if (error != 0) { if (nextvp != NULL) vnode_close_locked(td, oldvp); } else { *vpp = nextvp; } return (error); } /* * corefile_open(comm, uid, pid, td, compress, vpp, namep) * Expand the name described in corefilename, using name, uid, and pid * and open/create core file. * corefilename is a printf-like string, with three format specifiers: * %N name of process ("name") * %P process id (pid) * %U user id (uid) * For example, "%N.core" is the default; they can be disabled completely * by using "/dev/null", or all core files can be stored in "/cores/%U/%N-%P". * This is controlled by the sysctl variable kern.corefile (see above). */ static int corefile_open(const char *comm, uid_t uid, pid_t pid, struct thread *td, int compress, struct vnode **vpp, char **namep) { struct sbuf sb; struct nameidata nd; const char *format; char *hostname, *name; int cmode, error, flags, i, indexpos, indexlen, oflags, ncores; hostname = NULL; format = corefilename; name = malloc(MAXPATHLEN, M_TEMP, M_WAITOK | M_ZERO); indexlen = 0; indexpos = -1; ncores = num_cores; (void)sbuf_new(&sb, name, MAXPATHLEN, SBUF_FIXEDLEN); sx_slock(&corefilename_lock); for (i = 0; format[i] != '\0'; i++) { switch (format[i]) { case '%': /* Format character */ i++; switch (format[i]) { case '%': sbuf_putc(&sb, '%'); break; case 'H': /* hostname */ if (hostname == NULL) { hostname = malloc(MAXHOSTNAMELEN, M_TEMP, M_WAITOK); } getcredhostname(td->td_ucred, hostname, MAXHOSTNAMELEN); sbuf_printf(&sb, "%s", hostname); break; case 'I': /* autoincrementing index */ if (indexpos != -1) { sbuf_printf(&sb, "%%I"); break; } indexpos = sbuf_len(&sb); sbuf_printf(&sb, "%u", ncores - 1); indexlen = sbuf_len(&sb) - indexpos; break; case 'N': /* process name */ sbuf_printf(&sb, "%s", comm); break; case 'P': /* process id */ sbuf_printf(&sb, "%u", pid); break; case 'U': /* user id */ sbuf_printf(&sb, "%u", uid); break; default: log(LOG_ERR, "Unknown format character %c in " "corename `%s'\n", format[i], format); break; } break; default: sbuf_putc(&sb, format[i]); break; } } sx_sunlock(&corefilename_lock); free(hostname, M_TEMP); if (compress == COMPRESS_GZIP) sbuf_printf(&sb, GZIP_SUFFIX); else if (compress == COMPRESS_ZSTD) sbuf_printf(&sb, ZSTD_SUFFIX); if (sbuf_error(&sb) != 0) { log(LOG_ERR, "pid %ld (%s), uid (%lu): corename is too " "long\n", (long)pid, comm, (u_long)uid); sbuf_delete(&sb); free(name, M_TEMP); return (ENOMEM); } sbuf_finish(&sb); sbuf_delete(&sb); if (indexpos != -1) { error = corefile_open_last(td, name, indexpos, indexlen, ncores, vpp); if (error != 0) { log(LOG_ERR, "pid %d (%s), uid (%u): Path `%s' failed " "on initial open test, error = %d\n", pid, comm, uid, name, error); } } else { cmode = S_IRUSR | S_IWUSR; oflags = VN_OPEN_NOAUDIT | VN_OPEN_NAMECACHE | (capmode_coredump ? VN_OPEN_NOCAPCHECK : 0); flags = O_CREAT | FWRITE | O_NOFOLLOW; NDINIT(&nd, LOOKUP, NOFOLLOW, UIO_SYSSPACE, name, td); error = vn_open_cred(&nd, &flags, cmode, oflags, td->td_ucred, NULL); if (error == 0) { *vpp = nd.ni_vp; NDFREE(&nd, NDF_ONLY_PNBUF); } } if (error != 0) { #ifdef AUDIT audit_proc_coredump(td, name, error); #endif free(name, M_TEMP); return (error); } *namep = name; return (0); } /* * Dump a process' core. The main routine does some * policy checking, and creates the name of the coredump; * then it passes on a vnode and a size limit to the process-specific * coredump routine if there is one; if there _is not_ one, it returns * ENOSYS; otherwise it returns the error from the process-specific routine. */ static int coredump(struct thread *td) { struct proc *p = td->td_proc; struct ucred *cred = td->td_ucred; struct vnode *vp; struct flock lf; struct vattr vattr; int error, error1, locked; char *name; /* name of corefile */ void *rl_cookie; off_t limit; char *fullpath, *freepath = NULL; struct sbuf *sb; PROC_LOCK_ASSERT(p, MA_OWNED); MPASS((p->p_flag & P_HADTHREADS) == 0 || p->p_singlethread == td); _STOPEVENT(p, S_CORE, 0); if (!do_coredump || (!sugid_coredump && (p->p_flag & P_SUGID) != 0) || (p->p_flag2 & P2_NOTRACE) != 0) { PROC_UNLOCK(p); return (EFAULT); } /* * Note that the bulk of limit checking is done after * the corefile is created. The exception is if the limit * for corefiles is 0, in which case we don't bother * creating the corefile at all. This layout means that * a corefile is truncated instead of not being created, * if it is larger than the limit. */ limit = (off_t)lim_cur(td, RLIMIT_CORE); if (limit == 0 || racct_get_available(p, RACCT_CORE) == 0) { PROC_UNLOCK(p); return (EFBIG); } PROC_UNLOCK(p); error = corefile_open(p->p_comm, cred->cr_uid, p->p_pid, td, compress_user_cores, &vp, &name); if (error != 0) return (error); /* * Don't dump to non-regular files or files with links. * Do not dump into system files. */ if (vp->v_type != VREG || VOP_GETATTR(vp, &vattr, cred) != 0 || vattr.va_nlink != 1 || (vp->v_vflag & VV_SYSTEM) != 0) { VOP_UNLOCK(vp, 0); error = EFAULT; goto out; } VOP_UNLOCK(vp, 0); /* Postpone other writers, including core dumps of other processes. */ rl_cookie = vn_rangelock_wlock(vp, 0, OFF_MAX); lf.l_whence = SEEK_SET; lf.l_start = 0; lf.l_len = 0; lf.l_type = F_WRLCK; locked = (VOP_ADVLOCK(vp, (caddr_t)p, F_SETLK, &lf, F_FLOCK) == 0); VATTR_NULL(&vattr); vattr.va_size = 0; if (set_core_nodump_flag) vattr.va_flags = UF_NODUMP; vn_lock(vp, LK_EXCLUSIVE | LK_RETRY); VOP_SETATTR(vp, &vattr, cred); VOP_UNLOCK(vp, 0); PROC_LOCK(p); p->p_acflag |= ACORE; PROC_UNLOCK(p); if (p->p_sysent->sv_coredump != NULL) { error = p->p_sysent->sv_coredump(td, vp, limit, 0); } else { error = ENOSYS; } if (locked) { lf.l_type = F_UNLCK; VOP_ADVLOCK(vp, (caddr_t)p, F_UNLCK, &lf, F_FLOCK); } vn_rangelock_unlock(vp, rl_cookie); /* * Notify the userland helper that a process triggered a core dump. * This allows the helper to run an automated debugging session. */ if (error != 0 || coredump_devctl == 0) goto out; sb = sbuf_new_auto(); if (vn_fullpath_global(td, p->p_textvp, &fullpath, &freepath) != 0) goto out2; sbuf_printf(sb, "comm=\""); devctl_safe_quote_sb(sb, fullpath); free(freepath, M_TEMP); sbuf_printf(sb, "\" core=\""); /* * We can't lookup core file vp directly. When we're replacing a core, and * other random times, we flush the name cache, so it will fail. Instead, * if the path of the core is relative, add the current dir in front if it. */ if (name[0] != '/') { fullpath = malloc(MAXPATHLEN, M_TEMP, M_WAITOK); if (kern___getcwd(td, fullpath, UIO_SYSSPACE, MAXPATHLEN, MAXPATHLEN) != 0) { free(fullpath, M_TEMP); goto out2; } devctl_safe_quote_sb(sb, fullpath); free(fullpath, M_TEMP); sbuf_putc(sb, '/'); } devctl_safe_quote_sb(sb, name); sbuf_printf(sb, "\""); if (sbuf_finish(sb) == 0) devctl_notify("kernel", "signal", "coredump", sbuf_data(sb)); out2: sbuf_delete(sb); out: error1 = vn_close(vp, FWRITE, cred, td); if (error == 0) error = error1; #ifdef AUDIT audit_proc_coredump(td, name, error); #endif free(name, M_TEMP); return (error); } /* * Nonexistent system call-- signal process (may want to handle it). Flag * error in case process won't see signal immediately (blocked or ignored). */ #ifndef _SYS_SYSPROTO_H_ struct nosys_args { int dummy; }; #endif /* ARGSUSED */ int nosys(struct thread *td, struct nosys_args *args) { struct proc *p; p = td->td_proc; PROC_LOCK(p); tdsignal(td, SIGSYS); PROC_UNLOCK(p); if (kern_lognosys == 1 || kern_lognosys == 3) { uprintf("pid %d comm %s: nosys %d\n", p->p_pid, p->p_comm, td->td_sa.code); } if (kern_lognosys == 2 || kern_lognosys == 3) { printf("pid %d comm %s: nosys %d\n", p->p_pid, p->p_comm, td->td_sa.code); } return (ENOSYS); } /* * Send a SIGIO or SIGURG signal to a process or process group using stored * credentials rather than those of the current process. */ void pgsigio(struct sigio **sigiop, int sig, int checkctty) { ksiginfo_t ksi; struct sigio *sigio; ksiginfo_init(&ksi); ksi.ksi_signo = sig; ksi.ksi_code = SI_KERNEL; SIGIO_LOCK(); sigio = *sigiop; if (sigio == NULL) { SIGIO_UNLOCK(); return; } if (sigio->sio_pgid > 0) { PROC_LOCK(sigio->sio_proc); if (CANSIGIO(sigio->sio_ucred, sigio->sio_proc->p_ucred)) kern_psignal(sigio->sio_proc, sig); PROC_UNLOCK(sigio->sio_proc); } else if (sigio->sio_pgid < 0) { struct proc *p; PGRP_LOCK(sigio->sio_pgrp); LIST_FOREACH(p, &sigio->sio_pgrp->pg_members, p_pglist) { PROC_LOCK(p); if (p->p_state == PRS_NORMAL && CANSIGIO(sigio->sio_ucred, p->p_ucred) && (checkctty == 0 || (p->p_flag & P_CONTROLT))) kern_psignal(p, sig); PROC_UNLOCK(p); } PGRP_UNLOCK(sigio->sio_pgrp); } SIGIO_UNLOCK(); } static int filt_sigattach(struct knote *kn) { struct proc *p = curproc; kn->kn_ptr.p_proc = p; kn->kn_flags |= EV_CLEAR; /* automatically set */ knlist_add(p->p_klist, kn, 0); return (0); } static void filt_sigdetach(struct knote *kn) { struct proc *p = kn->kn_ptr.p_proc; knlist_remove(p->p_klist, kn, 0); } /* * signal knotes are shared with proc knotes, so we apply a mask to * the hint in order to differentiate them from process hints. This * could be avoided by using a signal-specific knote list, but probably * isn't worth the trouble. */ static int filt_signal(struct knote *kn, long hint) { if (hint & NOTE_SIGNAL) { hint &= ~NOTE_SIGNAL; if (kn->kn_id == hint) kn->kn_data++; } return (kn->kn_data != 0); } struct sigacts * sigacts_alloc(void) { struct sigacts *ps; ps = malloc(sizeof(struct sigacts), M_SUBPROC, M_WAITOK | M_ZERO); refcount_init(&ps->ps_refcnt, 1); mtx_init(&ps->ps_mtx, "sigacts", NULL, MTX_DEF); return (ps); } void sigacts_free(struct sigacts *ps) { if (refcount_release(&ps->ps_refcnt) == 0) return; mtx_destroy(&ps->ps_mtx); free(ps, M_SUBPROC); } struct sigacts * sigacts_hold(struct sigacts *ps) { refcount_acquire(&ps->ps_refcnt); return (ps); } void sigacts_copy(struct sigacts *dest, struct sigacts *src) { KASSERT(dest->ps_refcnt == 1, ("sigacts_copy to shared dest")); mtx_lock(&src->ps_mtx); bcopy(src, dest, offsetof(struct sigacts, ps_refcnt)); mtx_unlock(&src->ps_mtx); } int sigacts_shared(struct sigacts *ps) { return (ps->ps_refcnt > 1); } Index: user/ngie/bug-237403/sys/kern/kern_thread.c =================================================================== --- user/ngie/bug-237403/sys/kern/kern_thread.c (revision 346925) +++ user/ngie/bug-237403/sys/kern/kern_thread.c (revision 346926) @@ -1,1267 +1,1267 @@ /*- * SPDX-License-Identifier: BSD-2-Clause-FreeBSD * * Copyright (C) 2001 Julian Elischer . * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice(s), this list of conditions and the following disclaimer as * the first lines of this file unmodified other than the possible * addition of one or more copyright notices. * 2. Redistributions in binary form must reproduce the above copyright * notice(s), this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDER(S) ``AS IS'' AND ANY * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER(S) BE LIABLE FOR ANY * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH * DAMAGE. */ #include "opt_witness.h" #include "opt_hwpmc_hooks.h" #include __FBSDID("$FreeBSD$"); #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #ifdef HWPMC_HOOKS #include #endif #include #include #include #include #include /* * Asserts below verify the stability of struct thread and struct proc * layout, as exposed by KBI to modules. On head, the KBI is allowed * to drift, change to the structures must be accompanied by the * assert update. * * On the stable branches after KBI freeze, conditions must not be * violated. Typically new fields are moved to the end of the * structures. */ #ifdef __amd64__ _Static_assert(offsetof(struct thread, td_flags) == 0xfc, "struct thread KBI td_flags"); _Static_assert(offsetof(struct thread, td_pflags) == 0x104, "struct thread KBI td_pflags"); _Static_assert(offsetof(struct thread, td_frame) == 0x478, "struct thread KBI td_frame"); _Static_assert(offsetof(struct thread, td_emuldata) == 0x530, "struct thread KBI td_emuldata"); _Static_assert(offsetof(struct proc, p_flag) == 0xb0, "struct proc KBI p_flag"); _Static_assert(offsetof(struct proc, p_pid) == 0xbc, "struct proc KBI p_pid"); -_Static_assert(offsetof(struct proc, p_filemon) == 0x3d0, +_Static_assert(offsetof(struct proc, p_filemon) == 0x3c8, "struct proc KBI p_filemon"); -_Static_assert(offsetof(struct proc, p_comm) == 0x3e8, +_Static_assert(offsetof(struct proc, p_comm) == 0x3e0, "struct proc KBI p_comm"); -_Static_assert(offsetof(struct proc, p_emuldata) == 0x4c8, +_Static_assert(offsetof(struct proc, p_emuldata) == 0x4c0, "struct proc KBI p_emuldata"); #endif #ifdef __i386__ _Static_assert(offsetof(struct thread, td_flags) == 0x98, "struct thread KBI td_flags"); _Static_assert(offsetof(struct thread, td_pflags) == 0xa0, "struct thread KBI td_pflags"); _Static_assert(offsetof(struct thread, td_frame) == 0x2ec, "struct thread KBI td_frame"); _Static_assert(offsetof(struct thread, td_emuldata) == 0x338, "struct thread KBI td_emuldata"); _Static_assert(offsetof(struct proc, p_flag) == 0x68, "struct proc KBI p_flag"); _Static_assert(offsetof(struct proc, p_pid) == 0x74, "struct proc KBI p_pid"); -_Static_assert(offsetof(struct proc, p_filemon) == 0x27c, +_Static_assert(offsetof(struct proc, p_filemon) == 0x278, "struct proc KBI p_filemon"); -_Static_assert(offsetof(struct proc, p_comm) == 0x290, +_Static_assert(offsetof(struct proc, p_comm) == 0x28c, "struct proc KBI p_comm"); -_Static_assert(offsetof(struct proc, p_emuldata) == 0x31c, +_Static_assert(offsetof(struct proc, p_emuldata) == 0x318, "struct proc KBI p_emuldata"); #endif SDT_PROVIDER_DECLARE(proc); SDT_PROBE_DEFINE(proc, , , lwp__exit); /* * thread related storage. */ static uma_zone_t thread_zone; TAILQ_HEAD(, thread) zombie_threads = TAILQ_HEAD_INITIALIZER(zombie_threads); static struct mtx zombie_lock; MTX_SYSINIT(zombie_lock, &zombie_lock, "zombie lock", MTX_SPIN); static void thread_zombie(struct thread *); static int thread_unsuspend_one(struct thread *td, struct proc *p, bool boundary); #define TID_BUFFER_SIZE 1024 struct mtx tid_lock; static struct unrhdr *tid_unrhdr; static lwpid_t tid_buffer[TID_BUFFER_SIZE]; static int tid_head, tid_tail; static MALLOC_DEFINE(M_TIDHASH, "tidhash", "thread hash"); struct tidhashhead *tidhashtbl; u_long tidhash; struct rwlock tidhash_lock; EVENTHANDLER_LIST_DEFINE(thread_ctor); EVENTHANDLER_LIST_DEFINE(thread_dtor); EVENTHANDLER_LIST_DEFINE(thread_init); EVENTHANDLER_LIST_DEFINE(thread_fini); static lwpid_t tid_alloc(void) { lwpid_t tid; tid = alloc_unr(tid_unrhdr); if (tid != -1) return (tid); mtx_lock(&tid_lock); if (tid_head == tid_tail) { mtx_unlock(&tid_lock); return (-1); } tid = tid_buffer[tid_head]; tid_head = (tid_head + 1) % TID_BUFFER_SIZE; mtx_unlock(&tid_lock); return (tid); } static void tid_free(lwpid_t tid) { lwpid_t tmp_tid = -1; mtx_lock(&tid_lock); if ((tid_tail + 1) % TID_BUFFER_SIZE == tid_head) { tmp_tid = tid_buffer[tid_head]; tid_head = (tid_head + 1) % TID_BUFFER_SIZE; } tid_buffer[tid_tail] = tid; tid_tail = (tid_tail + 1) % TID_BUFFER_SIZE; mtx_unlock(&tid_lock); if (tmp_tid != -1) free_unr(tid_unrhdr, tmp_tid); } /* * Prepare a thread for use. */ static int thread_ctor(void *mem, int size, void *arg, int flags) { struct thread *td; td = (struct thread *)mem; td->td_state = TDS_INACTIVE; td->td_lastcpu = td->td_oncpu = NOCPU; td->td_tid = tid_alloc(); /* * Note that td_critnest begins life as 1 because the thread is not * running and is thereby implicitly waiting to be on the receiving * end of a context switch. */ td->td_critnest = 1; td->td_lend_user_pri = PRI_MAX; EVENTHANDLER_DIRECT_INVOKE(thread_ctor, td); #ifdef AUDIT audit_thread_alloc(td); #endif umtx_thread_alloc(td); return (0); } /* * Reclaim a thread after use. */ static void thread_dtor(void *mem, int size, void *arg) { struct thread *td; td = (struct thread *)mem; #ifdef INVARIANTS /* Verify that this thread is in a safe state to free. */ switch (td->td_state) { case TDS_INHIBITED: case TDS_RUNNING: case TDS_CAN_RUN: case TDS_RUNQ: /* * We must never unlink a thread that is in one of * these states, because it is currently active. */ panic("bad state for thread unlinking"); /* NOTREACHED */ case TDS_INACTIVE: break; default: panic("bad thread state"); /* NOTREACHED */ } #endif #ifdef AUDIT audit_thread_free(td); #endif /* Free all OSD associated to this thread. */ osd_thread_exit(td); td_softdep_cleanup(td); MPASS(td->td_su == NULL); EVENTHANDLER_DIRECT_INVOKE(thread_dtor, td); tid_free(td->td_tid); } /* * Initialize type-stable parts of a thread (when newly created). */ static int thread_init(void *mem, int size, int flags) { struct thread *td; td = (struct thread *)mem; td->td_sleepqueue = sleepq_alloc(); td->td_turnstile = turnstile_alloc(); td->td_rlqe = NULL; EVENTHANDLER_DIRECT_INVOKE(thread_init, td); umtx_thread_init(td); epoch_thread_init(td); td->td_kstack = 0; td->td_sel = NULL; return (0); } /* * Tear down type-stable parts of a thread (just before being discarded). */ static void thread_fini(void *mem, int size) { struct thread *td; td = (struct thread *)mem; EVENTHANDLER_DIRECT_INVOKE(thread_fini, td); rlqentry_free(td->td_rlqe); turnstile_free(td->td_turnstile); sleepq_free(td->td_sleepqueue); umtx_thread_fini(td); epoch_thread_fini(td); seltdfini(td); } /* * For a newly created process, * link up all the structures and its initial threads etc. * called from: * {arch}/{arch}/machdep.c {arch}_init(), init386() etc. * proc_dtor() (should go away) * proc_init() */ void proc_linkup0(struct proc *p, struct thread *td) { TAILQ_INIT(&p->p_threads); /* all threads in proc */ proc_linkup(p, td); } void proc_linkup(struct proc *p, struct thread *td) { sigqueue_init(&p->p_sigqueue, p); p->p_ksi = ksiginfo_alloc(1); if (p->p_ksi != NULL) { /* XXX p_ksi may be null if ksiginfo zone is not ready */ p->p_ksi->ksi_flags = KSI_EXT | KSI_INS; } LIST_INIT(&p->p_mqnotifier); p->p_numthreads = 0; thread_link(td, p); } /* * Initialize global thread allocation resources. */ void threadinit(void) { mtx_init(&tid_lock, "TID lock", NULL, MTX_DEF); /* * pid_max cannot be greater than PID_MAX. * leave one number for thread0. */ tid_unrhdr = new_unrhdr(PID_MAX + 2, INT_MAX, &tid_lock); thread_zone = uma_zcreate("THREAD", sched_sizeof_thread(), thread_ctor, thread_dtor, thread_init, thread_fini, 32 - 1, UMA_ZONE_NOFREE); tidhashtbl = hashinit(maxproc / 2, M_TIDHASH, &tidhash); rw_init(&tidhash_lock, "tidhash"); } /* * Place an unused thread on the zombie list. * Use the slpq as that must be unused by now. */ void thread_zombie(struct thread *td) { mtx_lock_spin(&zombie_lock); TAILQ_INSERT_HEAD(&zombie_threads, td, td_slpq); mtx_unlock_spin(&zombie_lock); } /* * Release a thread that has exited after cpu_throw(). */ void thread_stash(struct thread *td) { atomic_subtract_rel_int(&td->td_proc->p_exitthreads, 1); thread_zombie(td); } /* * Reap zombie resources. */ void thread_reap(void) { struct thread *td_first, *td_next; /* * Don't even bother to lock if none at this instant, * we really don't care about the next instant. */ if (!TAILQ_EMPTY(&zombie_threads)) { mtx_lock_spin(&zombie_lock); td_first = TAILQ_FIRST(&zombie_threads); if (td_first) TAILQ_INIT(&zombie_threads); mtx_unlock_spin(&zombie_lock); while (td_first) { td_next = TAILQ_NEXT(td_first, td_slpq); thread_cow_free(td_first); thread_free(td_first); td_first = td_next; } } } /* * Allocate a thread. */ struct thread * thread_alloc(int pages) { struct thread *td; thread_reap(); /* check if any zombies to get */ td = (struct thread *)uma_zalloc(thread_zone, M_WAITOK); KASSERT(td->td_kstack == 0, ("thread_alloc got thread with kstack")); if (!vm_thread_new(td, pages)) { uma_zfree(thread_zone, td); return (NULL); } cpu_thread_alloc(td); return (td); } int thread_alloc_stack(struct thread *td, int pages) { KASSERT(td->td_kstack == 0, ("thread_alloc_stack called on a thread with kstack")); if (!vm_thread_new(td, pages)) return (0); cpu_thread_alloc(td); return (1); } /* * Deallocate a thread. */ void thread_free(struct thread *td) { lock_profile_thread_exit(td); if (td->td_cpuset) cpuset_rel(td->td_cpuset); td->td_cpuset = NULL; cpu_thread_free(td); if (td->td_kstack != 0) vm_thread_dispose(td); callout_drain(&td->td_slpcallout); uma_zfree(thread_zone, td); } void thread_cow_get_proc(struct thread *newtd, struct proc *p) { PROC_LOCK_ASSERT(p, MA_OWNED); newtd->td_ucred = crhold(p->p_ucred); newtd->td_limit = lim_hold(p->p_limit); newtd->td_cowgen = p->p_cowgen; } void thread_cow_get(struct thread *newtd, struct thread *td) { newtd->td_ucred = crhold(td->td_ucred); newtd->td_limit = lim_hold(td->td_limit); newtd->td_cowgen = td->td_cowgen; } void thread_cow_free(struct thread *td) { if (td->td_ucred != NULL) crfree(td->td_ucred); if (td->td_limit != NULL) lim_free(td->td_limit); } void thread_cow_update(struct thread *td) { struct proc *p; struct ucred *oldcred; struct plimit *oldlimit; p = td->td_proc; oldcred = NULL; oldlimit = NULL; PROC_LOCK(p); if (td->td_ucred != p->p_ucred) { oldcred = td->td_ucred; td->td_ucred = crhold(p->p_ucred); } if (td->td_limit != p->p_limit) { oldlimit = td->td_limit; td->td_limit = lim_hold(p->p_limit); } td->td_cowgen = p->p_cowgen; PROC_UNLOCK(p); if (oldcred != NULL) crfree(oldcred); if (oldlimit != NULL) lim_free(oldlimit); } /* * Discard the current thread and exit from its context. * Always called with scheduler locked. * * Because we can't free a thread while we're operating under its context, * push the current thread into our CPU's deadthread holder. This means * we needn't worry about someone else grabbing our context before we * do a cpu_throw(). */ void thread_exit(void) { uint64_t runtime, new_switchtime; struct thread *td; struct thread *td2; struct proc *p; int wakeup_swapper; td = curthread; p = td->td_proc; PROC_SLOCK_ASSERT(p, MA_OWNED); mtx_assert(&Giant, MA_NOTOWNED); PROC_LOCK_ASSERT(p, MA_OWNED); KASSERT(p != NULL, ("thread exiting without a process")); CTR3(KTR_PROC, "thread_exit: thread %p (pid %ld, %s)", td, (long)p->p_pid, td->td_name); SDT_PROBE0(proc, , , lwp__exit); KASSERT(TAILQ_EMPTY(&td->td_sigqueue.sq_list), ("signal pending")); /* * drop FPU & debug register state storage, or any other * architecture specific resources that * would not be on a new untouched process. */ cpu_thread_exit(td); /* * The last thread is left attached to the process * So that the whole bundle gets recycled. Skip * all this stuff if we never had threads. * EXIT clears all sign of other threads when * it goes to single threading, so the last thread always * takes the short path. */ if (p->p_flag & P_HADTHREADS) { if (p->p_numthreads > 1) { atomic_add_int(&td->td_proc->p_exitthreads, 1); thread_unlink(td); td2 = FIRST_THREAD_IN_PROC(p); sched_exit_thread(td2, td); /* * The test below is NOT true if we are the * sole exiting thread. P_STOPPED_SINGLE is unset * in exit1() after it is the only survivor. */ if (P_SHOULDSTOP(p) == P_STOPPED_SINGLE) { if (p->p_numthreads == p->p_suspcount) { thread_lock(p->p_singlethread); wakeup_swapper = thread_unsuspend_one( p->p_singlethread, p, false); thread_unlock(p->p_singlethread); if (wakeup_swapper) kick_proc0(); } } PCPU_SET(deadthread, td); } else { /* * The last thread is exiting.. but not through exit() */ panic ("thread_exit: Last thread exiting on its own"); } } #ifdef HWPMC_HOOKS /* * If this thread is part of a process that is being tracked by hwpmc(4), * inform the module of the thread's impending exit. */ if (PMC_PROC_IS_USING_PMCS(td->td_proc)) { PMC_SWITCH_CONTEXT(td, PMC_FN_CSW_OUT); PMC_CALL_HOOK_UNLOCKED(td, PMC_FN_THR_EXIT, NULL); } else if (PMC_SYSTEM_SAMPLING_ACTIVE()) PMC_CALL_HOOK_UNLOCKED(td, PMC_FN_THR_EXIT_LOG, NULL); #endif PROC_UNLOCK(p); PROC_STATLOCK(p); thread_lock(td); PROC_SUNLOCK(p); /* Do the same timestamp bookkeeping that mi_switch() would do. */ new_switchtime = cpu_ticks(); runtime = new_switchtime - PCPU_GET(switchtime); td->td_runtime += runtime; td->td_incruntime += runtime; PCPU_SET(switchtime, new_switchtime); PCPU_SET(switchticks, ticks); VM_CNT_INC(v_swtch); /* Save our resource usage in our process. */ td->td_ru.ru_nvcsw++; ruxagg(p, td); rucollect(&p->p_ru, &td->td_ru); PROC_STATUNLOCK(p); td->td_state = TDS_INACTIVE; #ifdef WITNESS witness_thread_exit(td); #endif CTR1(KTR_PROC, "thread_exit: cpu_throw() thread %p", td); sched_throw(td); panic("I'm a teapot!"); /* NOTREACHED */ } /* * Do any thread specific cleanups that may be needed in wait() * called with Giant, proc and schedlock not held. */ void thread_wait(struct proc *p) { struct thread *td; mtx_assert(&Giant, MA_NOTOWNED); KASSERT(p->p_numthreads == 1, ("multiple threads in thread_wait()")); KASSERT(p->p_exitthreads == 0, ("p_exitthreads leaking")); td = FIRST_THREAD_IN_PROC(p); /* Lock the last thread so we spin until it exits cpu_throw(). */ thread_lock(td); thread_unlock(td); lock_profile_thread_exit(td); cpuset_rel(td->td_cpuset); td->td_cpuset = NULL; cpu_thread_clean(td); thread_cow_free(td); callout_drain(&td->td_slpcallout); thread_reap(); /* check for zombie threads etc. */ } /* * Link a thread to a process. * set up anything that needs to be initialized for it to * be used by the process. */ void thread_link(struct thread *td, struct proc *p) { /* * XXX This can't be enabled because it's called for proc0 before * its lock has been created. * PROC_LOCK_ASSERT(p, MA_OWNED); */ td->td_state = TDS_INACTIVE; td->td_proc = p; td->td_flags = TDF_INMEM; LIST_INIT(&td->td_contested); LIST_INIT(&td->td_lprof[0]); LIST_INIT(&td->td_lprof[1]); sigqueue_init(&td->td_sigqueue, p); callout_init(&td->td_slpcallout, 1); TAILQ_INSERT_TAIL(&p->p_threads, td, td_plist); p->p_numthreads++; } /* * Called from: * thread_exit() */ void thread_unlink(struct thread *td) { struct proc *p = td->td_proc; PROC_LOCK_ASSERT(p, MA_OWNED); TAILQ_REMOVE(&p->p_threads, td, td_plist); p->p_numthreads--; /* could clear a few other things here */ /* Must NOT clear links to proc! */ } static int calc_remaining(struct proc *p, int mode) { int remaining; PROC_LOCK_ASSERT(p, MA_OWNED); PROC_SLOCK_ASSERT(p, MA_OWNED); if (mode == SINGLE_EXIT) remaining = p->p_numthreads; else if (mode == SINGLE_BOUNDARY) remaining = p->p_numthreads - p->p_boundary_count; else if (mode == SINGLE_NO_EXIT || mode == SINGLE_ALLPROC) remaining = p->p_numthreads - p->p_suspcount; else panic("calc_remaining: wrong mode %d", mode); return (remaining); } static int remain_for_mode(int mode) { return (mode == SINGLE_ALLPROC ? 0 : 1); } static int weed_inhib(int mode, struct thread *td2, struct proc *p) { int wakeup_swapper; PROC_LOCK_ASSERT(p, MA_OWNED); PROC_SLOCK_ASSERT(p, MA_OWNED); THREAD_LOCK_ASSERT(td2, MA_OWNED); wakeup_swapper = 0; switch (mode) { case SINGLE_EXIT: if (TD_IS_SUSPENDED(td2)) wakeup_swapper |= thread_unsuspend_one(td2, p, true); if (TD_ON_SLEEPQ(td2) && (td2->td_flags & TDF_SINTR) != 0) wakeup_swapper |= sleepq_abort(td2, EINTR); break; case SINGLE_BOUNDARY: case SINGLE_NO_EXIT: if (TD_IS_SUSPENDED(td2) && (td2->td_flags & TDF_BOUNDARY) == 0) wakeup_swapper |= thread_unsuspend_one(td2, p, false); if (TD_ON_SLEEPQ(td2) && (td2->td_flags & TDF_SINTR) != 0) wakeup_swapper |= sleepq_abort(td2, ERESTART); break; case SINGLE_ALLPROC: /* * ALLPROC suspend tries to avoid spurious EINTR for * threads sleeping interruptable, by suspending the * thread directly, similarly to sig_suspend_threads(). * Since such sleep is not performed at the user * boundary, TDF_BOUNDARY flag is not set, and TDF_ALLPROCSUSP * is used to avoid immediate un-suspend. */ if (TD_IS_SUSPENDED(td2) && (td2->td_flags & (TDF_BOUNDARY | TDF_ALLPROCSUSP)) == 0) wakeup_swapper |= thread_unsuspend_one(td2, p, false); if (TD_ON_SLEEPQ(td2) && (td2->td_flags & TDF_SINTR) != 0) { if ((td2->td_flags & TDF_SBDRY) == 0) { thread_suspend_one(td2); td2->td_flags |= TDF_ALLPROCSUSP; } else { wakeup_swapper |= sleepq_abort(td2, ERESTART); } } break; } return (wakeup_swapper); } /* * Enforce single-threading. * * Returns 1 if the caller must abort (another thread is waiting to * exit the process or similar). Process is locked! * Returns 0 when you are successfully the only thread running. * A process has successfully single threaded in the suspend mode when * There are no threads in user mode. Threads in the kernel must be * allowed to continue until they get to the user boundary. They may even * copy out their return values and data before suspending. They may however be * accelerated in reaching the user boundary as we will wake up * any sleeping threads that are interruptable. (PCATCH). */ int thread_single(struct proc *p, int mode) { struct thread *td; struct thread *td2; int remaining, wakeup_swapper; td = curthread; KASSERT(mode == SINGLE_EXIT || mode == SINGLE_BOUNDARY || mode == SINGLE_ALLPROC || mode == SINGLE_NO_EXIT, ("invalid mode %d", mode)); /* * If allowing non-ALLPROC singlethreading for non-curproc * callers, calc_remaining() and remain_for_mode() should be * adjusted to also account for td->td_proc != p. For now * this is not implemented because it is not used. */ KASSERT((mode == SINGLE_ALLPROC && td->td_proc != p) || (mode != SINGLE_ALLPROC && td->td_proc == p), ("mode %d proc %p curproc %p", mode, p, td->td_proc)); mtx_assert(&Giant, MA_NOTOWNED); PROC_LOCK_ASSERT(p, MA_OWNED); if ((p->p_flag & P_HADTHREADS) == 0 && mode != SINGLE_ALLPROC) return (0); /* Is someone already single threading? */ if (p->p_singlethread != NULL && p->p_singlethread != td) return (1); if (mode == SINGLE_EXIT) { p->p_flag |= P_SINGLE_EXIT; p->p_flag &= ~P_SINGLE_BOUNDARY; } else { p->p_flag &= ~P_SINGLE_EXIT; if (mode == SINGLE_BOUNDARY) p->p_flag |= P_SINGLE_BOUNDARY; else p->p_flag &= ~P_SINGLE_BOUNDARY; } if (mode == SINGLE_ALLPROC) p->p_flag |= P_TOTAL_STOP; p->p_flag |= P_STOPPED_SINGLE; PROC_SLOCK(p); p->p_singlethread = td; remaining = calc_remaining(p, mode); while (remaining != remain_for_mode(mode)) { if (P_SHOULDSTOP(p) != P_STOPPED_SINGLE) goto stopme; wakeup_swapper = 0; FOREACH_THREAD_IN_PROC(p, td2) { if (td2 == td) continue; thread_lock(td2); td2->td_flags |= TDF_ASTPENDING | TDF_NEEDSUSPCHK; if (TD_IS_INHIBITED(td2)) { wakeup_swapper |= weed_inhib(mode, td2, p); #ifdef SMP } else if (TD_IS_RUNNING(td2) && td != td2) { forward_signal(td2); #endif } thread_unlock(td2); } if (wakeup_swapper) kick_proc0(); remaining = calc_remaining(p, mode); /* * Maybe we suspended some threads.. was it enough? */ if (remaining == remain_for_mode(mode)) break; stopme: /* * Wake us up when everyone else has suspended. * In the mean time we suspend as well. */ thread_suspend_switch(td, p); remaining = calc_remaining(p, mode); } if (mode == SINGLE_EXIT) { /* * Convert the process to an unthreaded process. The * SINGLE_EXIT is called by exit1() or execve(), in * both cases other threads must be retired. */ KASSERT(p->p_numthreads == 1, ("Unthreading with >1 threads")); p->p_singlethread = NULL; p->p_flag &= ~(P_STOPPED_SINGLE | P_SINGLE_EXIT | P_HADTHREADS); /* * Wait for any remaining threads to exit cpu_throw(). */ while (p->p_exitthreads != 0) { PROC_SUNLOCK(p); PROC_UNLOCK(p); sched_relinquish(td); PROC_LOCK(p); PROC_SLOCK(p); } } else if (mode == SINGLE_BOUNDARY) { /* * Wait until all suspended threads are removed from * the processors. The thread_suspend_check() * increments p_boundary_count while it is still * running, which makes it possible for the execve() * to destroy vmspace while our other threads are * still using the address space. * * We lock the thread, which is only allowed to * succeed after context switch code finished using * the address space. */ FOREACH_THREAD_IN_PROC(p, td2) { if (td2 == td) continue; thread_lock(td2); KASSERT((td2->td_flags & TDF_BOUNDARY) != 0, ("td %p not on boundary", td2)); KASSERT(TD_IS_SUSPENDED(td2), ("td %p is not suspended", td2)); thread_unlock(td2); } } PROC_SUNLOCK(p); return (0); } bool thread_suspend_check_needed(void) { struct proc *p; struct thread *td; td = curthread; p = td->td_proc; PROC_LOCK_ASSERT(p, MA_OWNED); return (P_SHOULDSTOP(p) || ((p->p_flag & P_TRACED) != 0 && (td->td_dbgflags & TDB_SUSPEND) != 0)); } /* * Called in from locations that can safely check to see * whether we have to suspend or at least throttle for a * single-thread event (e.g. fork). * * Such locations include userret(). * If the "return_instead" argument is non zero, the thread must be able to * accept 0 (caller may continue), or 1 (caller must abort) as a result. * * The 'return_instead' argument tells the function if it may do a * thread_exit() or suspend, or whether the caller must abort and back * out instead. * * If the thread that set the single_threading request has set the * P_SINGLE_EXIT bit in the process flags then this call will never return * if 'return_instead' is false, but will exit. * * P_SINGLE_EXIT | return_instead == 0| return_instead != 0 *---------------+--------------------+--------------------- * 0 | returns 0 | returns 0 or 1 * | when ST ends | immediately *---------------+--------------------+--------------------- * 1 | thread exits | returns 1 * | | immediately * 0 = thread_exit() or suspension ok, * other = return error instead of stopping the thread. * * While a full suspension is under effect, even a single threading * thread would be suspended if it made this call (but it shouldn't). * This call should only be made from places where * thread_exit() would be safe as that may be the outcome unless * return_instead is set. */ int thread_suspend_check(int return_instead) { struct thread *td; struct proc *p; int wakeup_swapper; td = curthread; p = td->td_proc; mtx_assert(&Giant, MA_NOTOWNED); PROC_LOCK_ASSERT(p, MA_OWNED); while (thread_suspend_check_needed()) { if (P_SHOULDSTOP(p) == P_STOPPED_SINGLE) { KASSERT(p->p_singlethread != NULL, ("singlethread not set")); /* * The only suspension in action is a * single-threading. Single threader need not stop. * It is safe to access p->p_singlethread unlocked * because it can only be set to our address by us. */ if (p->p_singlethread == td) return (0); /* Exempt from stopping. */ } if ((p->p_flag & P_SINGLE_EXIT) && return_instead) return (EINTR); /* Should we goto user boundary if we didn't come from there? */ if (P_SHOULDSTOP(p) == P_STOPPED_SINGLE && (p->p_flag & P_SINGLE_BOUNDARY) && return_instead) return (ERESTART); /* * Ignore suspend requests if they are deferred. */ if ((td->td_flags & TDF_SBDRY) != 0) { KASSERT(return_instead, ("TDF_SBDRY set for unsafe thread_suspend_check")); KASSERT((td->td_flags & (TDF_SEINTR | TDF_SERESTART)) != (TDF_SEINTR | TDF_SERESTART), ("both TDF_SEINTR and TDF_SERESTART")); return (TD_SBDRY_INTR(td) ? TD_SBDRY_ERRNO(td) : 0); } /* * If the process is waiting for us to exit, * this thread should just suicide. * Assumes that P_SINGLE_EXIT implies P_STOPPED_SINGLE. */ if ((p->p_flag & P_SINGLE_EXIT) && (p->p_singlethread != td)) { PROC_UNLOCK(p); /* * Allow Linux emulation layer to do some work * before thread suicide. */ if (__predict_false(p->p_sysent->sv_thread_detach != NULL)) (p->p_sysent->sv_thread_detach)(td); umtx_thread_exit(td); kern_thr_exit(td); panic("stopped thread did not exit"); } PROC_SLOCK(p); thread_stopped(p); if (P_SHOULDSTOP(p) == P_STOPPED_SINGLE) { if (p->p_numthreads == p->p_suspcount + 1) { thread_lock(p->p_singlethread); wakeup_swapper = thread_unsuspend_one( p->p_singlethread, p, false); thread_unlock(p->p_singlethread); if (wakeup_swapper) kick_proc0(); } } PROC_UNLOCK(p); thread_lock(td); /* * When a thread suspends, it just * gets taken off all queues. */ thread_suspend_one(td); if (return_instead == 0) { p->p_boundary_count++; td->td_flags |= TDF_BOUNDARY; } PROC_SUNLOCK(p); mi_switch(SW_INVOL | SWT_SUSPEND, NULL); thread_unlock(td); PROC_LOCK(p); } return (0); } void thread_suspend_switch(struct thread *td, struct proc *p) { KASSERT(!TD_IS_SUSPENDED(td), ("already suspended")); PROC_LOCK_ASSERT(p, MA_OWNED); PROC_SLOCK_ASSERT(p, MA_OWNED); /* * We implement thread_suspend_one in stages here to avoid * dropping the proc lock while the thread lock is owned. */ if (p == td->td_proc) { thread_stopped(p); p->p_suspcount++; } PROC_UNLOCK(p); thread_lock(td); td->td_flags &= ~TDF_NEEDSUSPCHK; TD_SET_SUSPENDED(td); sched_sleep(td, 0); PROC_SUNLOCK(p); DROP_GIANT(); mi_switch(SW_VOL | SWT_SUSPEND, NULL); thread_unlock(td); PICKUP_GIANT(); PROC_LOCK(p); PROC_SLOCK(p); } void thread_suspend_one(struct thread *td) { struct proc *p; p = td->td_proc; PROC_SLOCK_ASSERT(p, MA_OWNED); THREAD_LOCK_ASSERT(td, MA_OWNED); KASSERT(!TD_IS_SUSPENDED(td), ("already suspended")); p->p_suspcount++; td->td_flags &= ~TDF_NEEDSUSPCHK; TD_SET_SUSPENDED(td); sched_sleep(td, 0); } static int thread_unsuspend_one(struct thread *td, struct proc *p, bool boundary) { THREAD_LOCK_ASSERT(td, MA_OWNED); KASSERT(TD_IS_SUSPENDED(td), ("Thread not suspended")); TD_CLR_SUSPENDED(td); td->td_flags &= ~TDF_ALLPROCSUSP; if (td->td_proc == p) { PROC_SLOCK_ASSERT(p, MA_OWNED); p->p_suspcount--; if (boundary && (td->td_flags & TDF_BOUNDARY) != 0) { td->td_flags &= ~TDF_BOUNDARY; p->p_boundary_count--; } } return (setrunnable(td)); } /* * Allow all threads blocked by single threading to continue running. */ void thread_unsuspend(struct proc *p) { struct thread *td; int wakeup_swapper; PROC_LOCK_ASSERT(p, MA_OWNED); PROC_SLOCK_ASSERT(p, MA_OWNED); wakeup_swapper = 0; if (!P_SHOULDSTOP(p)) { FOREACH_THREAD_IN_PROC(p, td) { thread_lock(td); if (TD_IS_SUSPENDED(td)) { wakeup_swapper |= thread_unsuspend_one(td, p, true); } thread_unlock(td); } } else if (P_SHOULDSTOP(p) == P_STOPPED_SINGLE && p->p_numthreads == p->p_suspcount) { /* * Stopping everything also did the job for the single * threading request. Now we've downgraded to single-threaded, * let it continue. */ if (p->p_singlethread->td_proc == p) { thread_lock(p->p_singlethread); wakeup_swapper = thread_unsuspend_one( p->p_singlethread, p, false); thread_unlock(p->p_singlethread); } } if (wakeup_swapper) kick_proc0(); } /* * End the single threading mode.. */ void thread_single_end(struct proc *p, int mode) { struct thread *td; int wakeup_swapper; KASSERT(mode == SINGLE_EXIT || mode == SINGLE_BOUNDARY || mode == SINGLE_ALLPROC || mode == SINGLE_NO_EXIT, ("invalid mode %d", mode)); PROC_LOCK_ASSERT(p, MA_OWNED); KASSERT((mode == SINGLE_ALLPROC && (p->p_flag & P_TOTAL_STOP) != 0) || (mode != SINGLE_ALLPROC && (p->p_flag & P_TOTAL_STOP) == 0), ("mode %d does not match P_TOTAL_STOP", mode)); KASSERT(mode == SINGLE_ALLPROC || p->p_singlethread == curthread, ("thread_single_end from other thread %p %p", curthread, p->p_singlethread)); KASSERT(mode != SINGLE_BOUNDARY || (p->p_flag & P_SINGLE_BOUNDARY) != 0, ("mis-matched SINGLE_BOUNDARY flags %x", p->p_flag)); p->p_flag &= ~(P_STOPPED_SINGLE | P_SINGLE_EXIT | P_SINGLE_BOUNDARY | P_TOTAL_STOP); PROC_SLOCK(p); p->p_singlethread = NULL; wakeup_swapper = 0; /* * If there are other threads they may now run, * unless of course there is a blanket 'stop order' * on the process. The single threader must be allowed * to continue however as this is a bad place to stop. */ if (p->p_numthreads != remain_for_mode(mode) && !P_SHOULDSTOP(p)) { FOREACH_THREAD_IN_PROC(p, td) { thread_lock(td); if (TD_IS_SUSPENDED(td)) { wakeup_swapper |= thread_unsuspend_one(td, p, mode == SINGLE_BOUNDARY); } thread_unlock(td); } } KASSERT(mode != SINGLE_BOUNDARY || p->p_boundary_count == 0, ("inconsistent boundary count %d", p->p_boundary_count)); PROC_SUNLOCK(p); if (wakeup_swapper) kick_proc0(); } struct thread * thread_find(struct proc *p, lwpid_t tid) { struct thread *td; PROC_LOCK_ASSERT(p, MA_OWNED); FOREACH_THREAD_IN_PROC(p, td) { if (td->td_tid == tid) break; } return (td); } /* Locate a thread by number; return with proc lock held. */ struct thread * tdfind(lwpid_t tid, pid_t pid) { #define RUN_THRESH 16 struct thread *td; int run = 0; rw_rlock(&tidhash_lock); LIST_FOREACH(td, TIDHASH(tid), td_hash) { if (td->td_tid == tid) { if (pid != -1 && td->td_proc->p_pid != pid) { td = NULL; break; } PROC_LOCK(td->td_proc); if (td->td_proc->p_state == PRS_NEW) { PROC_UNLOCK(td->td_proc); td = NULL; break; } if (run > RUN_THRESH) { if (rw_try_upgrade(&tidhash_lock)) { LIST_REMOVE(td, td_hash); LIST_INSERT_HEAD(TIDHASH(td->td_tid), td, td_hash); rw_wunlock(&tidhash_lock); return (td); } } break; } run++; } rw_runlock(&tidhash_lock); return (td); } void tidhash_add(struct thread *td) { rw_wlock(&tidhash_lock); LIST_INSERT_HEAD(TIDHASH(td->td_tid), td, td_hash); rw_wunlock(&tidhash_lock); } void tidhash_remove(struct thread *td) { rw_wlock(&tidhash_lock); LIST_REMOVE(td, td_hash); rw_wunlock(&tidhash_lock); } Index: user/ngie/bug-237403/sys/kern/systrace_args.c =================================================================== --- user/ngie/bug-237403/sys/kern/systrace_args.c (revision 346925) +++ user/ngie/bug-237403/sys/kern/systrace_args.c (revision 346926) @@ -1,10748 +1,10748 @@ /* * System call argument to DTrace register array converstion. * * DO NOT EDIT-- this file is automatically generated. * $FreeBSD$ * This file is part of the DTrace syscall provider. */ static void systrace_args(int sysnum, void *params, uint64_t *uarg, int *n_args) { int64_t *iarg = (int64_t *) uarg; switch (sysnum) { /* nosys */ case 0: { *n_args = 0; break; } /* sys_exit */ case 1: { struct sys_exit_args *p = params; iarg[0] = p->rval; /* int */ *n_args = 1; break; } /* fork */ case 2: { *n_args = 0; break; } /* read */ case 3: { struct read_args *p = params; iarg[0] = p->fd; /* int */ uarg[1] = (intptr_t) p->buf; /* void * */ uarg[2] = p->nbyte; /* size_t */ *n_args = 3; break; } /* write */ case 4: { struct write_args *p = params; iarg[0] = p->fd; /* int */ uarg[1] = (intptr_t) p->buf; /* const void * */ uarg[2] = p->nbyte; /* size_t */ *n_args = 3; break; } /* open */ case 5: { struct open_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ iarg[1] = p->flags; /* int */ iarg[2] = p->mode; /* mode_t */ *n_args = 3; break; } /* close */ case 6: { struct close_args *p = params; iarg[0] = p->fd; /* int */ *n_args = 1; break; } /* wait4 */ case 7: { struct wait4_args *p = params; iarg[0] = p->pid; /* int */ uarg[1] = (intptr_t) p->status; /* int * */ iarg[2] = p->options; /* int */ uarg[3] = (intptr_t) p->rusage; /* struct rusage * */ *n_args = 4; break; } /* link */ case 9: { struct link_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ uarg[1] = (intptr_t) p->link; /* const char * */ *n_args = 2; break; } /* unlink */ case 10: { struct unlink_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ *n_args = 1; break; } /* chdir */ case 12: { struct chdir_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ *n_args = 1; break; } /* fchdir */ case 13: { struct fchdir_args *p = params; iarg[0] = p->fd; /* int */ *n_args = 1; break; } /* chmod */ case 15: { struct chmod_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ iarg[1] = p->mode; /* mode_t */ *n_args = 2; break; } /* chown */ case 16: { struct chown_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ iarg[1] = p->uid; /* int */ iarg[2] = p->gid; /* int */ *n_args = 3; break; } /* break */ case 17: { struct break_args *p = params; uarg[0] = (intptr_t) p->nsize; /* char * */ *n_args = 1; break; } /* getpid */ case 20: { *n_args = 0; break; } /* mount */ case 21: { struct mount_args *p = params; uarg[0] = (intptr_t) p->type; /* const char * */ uarg[1] = (intptr_t) p->path; /* const char * */ iarg[2] = p->flags; /* int */ uarg[3] = (intptr_t) p->data; /* void * */ *n_args = 4; break; } /* unmount */ case 22: { struct unmount_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ iarg[1] = p->flags; /* int */ *n_args = 2; break; } /* setuid */ case 23: { struct setuid_args *p = params; uarg[0] = p->uid; /* uid_t */ *n_args = 1; break; } /* getuid */ case 24: { *n_args = 0; break; } /* geteuid */ case 25: { *n_args = 0; break; } /* ptrace */ case 26: { struct ptrace_args *p = params; iarg[0] = p->req; /* int */ iarg[1] = p->pid; /* pid_t */ uarg[2] = (intptr_t) p->addr; /* caddr_t */ iarg[3] = p->data; /* int */ *n_args = 4; break; } /* recvmsg */ case 27: { struct recvmsg_args *p = params; iarg[0] = p->s; /* int */ uarg[1] = (intptr_t) p->msg; /* struct msghdr * */ iarg[2] = p->flags; /* int */ *n_args = 3; break; } /* sendmsg */ case 28: { struct sendmsg_args *p = params; iarg[0] = p->s; /* int */ uarg[1] = (intptr_t) p->msg; /* struct msghdr * */ iarg[2] = p->flags; /* int */ *n_args = 3; break; } /* recvfrom */ case 29: { struct recvfrom_args *p = params; iarg[0] = p->s; /* int */ uarg[1] = (intptr_t) p->buf; /* void * */ uarg[2] = p->len; /* size_t */ iarg[3] = p->flags; /* int */ uarg[4] = (intptr_t) p->from; /* struct sockaddr * */ uarg[5] = (intptr_t) p->fromlenaddr; /* __socklen_t * */ *n_args = 6; break; } /* accept */ case 30: { struct accept_args *p = params; iarg[0] = p->s; /* int */ uarg[1] = (intptr_t) p->name; /* struct sockaddr * */ uarg[2] = (intptr_t) p->anamelen; /* __socklen_t * */ *n_args = 3; break; } /* getpeername */ case 31: { struct getpeername_args *p = params; iarg[0] = p->fdes; /* int */ uarg[1] = (intptr_t) p->asa; /* struct sockaddr * */ uarg[2] = (intptr_t) p->alen; /* __socklen_t * */ *n_args = 3; break; } /* getsockname */ case 32: { struct getsockname_args *p = params; iarg[0] = p->fdes; /* int */ uarg[1] = (intptr_t) p->asa; /* struct sockaddr * */ uarg[2] = (intptr_t) p->alen; /* __socklen_t * */ *n_args = 3; break; } /* access */ case 33: { struct access_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ iarg[1] = p->amode; /* int */ *n_args = 2; break; } /* chflags */ case 34: { struct chflags_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ uarg[1] = p->flags; /* u_long */ *n_args = 2; break; } /* fchflags */ case 35: { struct fchflags_args *p = params; iarg[0] = p->fd; /* int */ uarg[1] = p->flags; /* u_long */ *n_args = 2; break; } /* sync */ case 36: { *n_args = 0; break; } /* kill */ case 37: { struct kill_args *p = params; iarg[0] = p->pid; /* int */ iarg[1] = p->signum; /* int */ *n_args = 2; break; } /* getppid */ case 39: { *n_args = 0; break; } /* dup */ case 41: { struct dup_args *p = params; uarg[0] = p->fd; /* u_int */ *n_args = 1; break; } /* getegid */ case 43: { *n_args = 0; break; } /* profil */ case 44: { struct profil_args *p = params; uarg[0] = (intptr_t) p->samples; /* char * */ uarg[1] = p->size; /* size_t */ uarg[2] = p->offset; /* size_t */ uarg[3] = p->scale; /* u_int */ *n_args = 4; break; } /* ktrace */ case 45: { struct ktrace_args *p = params; uarg[0] = (intptr_t) p->fname; /* const char * */ iarg[1] = p->ops; /* int */ iarg[2] = p->facs; /* int */ iarg[3] = p->pid; /* int */ *n_args = 4; break; } /* getgid */ case 47: { *n_args = 0; break; } /* getlogin */ case 49: { struct getlogin_args *p = params; uarg[0] = (intptr_t) p->namebuf; /* char * */ uarg[1] = p->namelen; /* u_int */ *n_args = 2; break; } /* setlogin */ case 50: { struct setlogin_args *p = params; uarg[0] = (intptr_t) p->namebuf; /* const char * */ *n_args = 1; break; } /* acct */ case 51: { struct acct_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ *n_args = 1; break; } /* sigaltstack */ case 53: { struct sigaltstack_args *p = params; uarg[0] = (intptr_t) p->ss; /* stack_t * */ uarg[1] = (intptr_t) p->oss; /* stack_t * */ *n_args = 2; break; } /* ioctl */ case 54: { struct ioctl_args *p = params; iarg[0] = p->fd; /* int */ uarg[1] = p->com; /* u_long */ uarg[2] = (intptr_t) p->data; /* char * */ *n_args = 3; break; } /* reboot */ case 55: { struct reboot_args *p = params; iarg[0] = p->opt; /* int */ *n_args = 1; break; } /* revoke */ case 56: { struct revoke_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ *n_args = 1; break; } /* symlink */ case 57: { struct symlink_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ uarg[1] = (intptr_t) p->link; /* const char * */ *n_args = 2; break; } /* readlink */ case 58: { struct readlink_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ uarg[1] = (intptr_t) p->buf; /* char * */ uarg[2] = p->count; /* size_t */ *n_args = 3; break; } /* execve */ case 59: { struct execve_args *p = params; uarg[0] = (intptr_t) p->fname; /* const char * */ uarg[1] = (intptr_t) p->argv; /* char ** */ uarg[2] = (intptr_t) p->envv; /* char ** */ *n_args = 3; break; } /* umask */ case 60: { struct umask_args *p = params; iarg[0] = p->newmask; /* mode_t */ *n_args = 1; break; } /* chroot */ case 61: { struct chroot_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ *n_args = 1; break; } /* msync */ case 65: { struct msync_args *p = params; uarg[0] = (intptr_t) p->addr; /* void * */ uarg[1] = p->len; /* size_t */ iarg[2] = p->flags; /* int */ *n_args = 3; break; } /* vfork */ case 66: { *n_args = 0; break; } /* sbrk */ case 69: { struct sbrk_args *p = params; iarg[0] = p->incr; /* int */ *n_args = 1; break; } /* sstk */ case 70: { struct sstk_args *p = params; iarg[0] = p->incr; /* int */ *n_args = 1; break; } /* munmap */ case 73: { struct munmap_args *p = params; uarg[0] = (intptr_t) p->addr; /* void * */ uarg[1] = p->len; /* size_t */ *n_args = 2; break; } /* mprotect */ case 74: { struct mprotect_args *p = params; uarg[0] = (intptr_t) p->addr; /* void * */ uarg[1] = p->len; /* size_t */ iarg[2] = p->prot; /* int */ *n_args = 3; break; } /* madvise */ case 75: { struct madvise_args *p = params; uarg[0] = (intptr_t) p->addr; /* void * */ uarg[1] = p->len; /* size_t */ iarg[2] = p->behav; /* int */ *n_args = 3; break; } /* mincore */ case 78: { struct mincore_args *p = params; uarg[0] = (intptr_t) p->addr; /* const void * */ uarg[1] = p->len; /* size_t */ uarg[2] = (intptr_t) p->vec; /* char * */ *n_args = 3; break; } /* getgroups */ case 79: { struct getgroups_args *p = params; uarg[0] = p->gidsetsize; /* u_int */ uarg[1] = (intptr_t) p->gidset; /* gid_t * */ *n_args = 2; break; } /* setgroups */ case 80: { struct setgroups_args *p = params; uarg[0] = p->gidsetsize; /* u_int */ uarg[1] = (intptr_t) p->gidset; /* gid_t * */ *n_args = 2; break; } /* getpgrp */ case 81: { *n_args = 0; break; } /* setpgid */ case 82: { struct setpgid_args *p = params; iarg[0] = p->pid; /* int */ iarg[1] = p->pgid; /* int */ *n_args = 2; break; } /* setitimer */ case 83: { struct setitimer_args *p = params; uarg[0] = p->which; /* u_int */ uarg[1] = (intptr_t) p->itv; /* struct itimerval * */ uarg[2] = (intptr_t) p->oitv; /* struct itimerval * */ *n_args = 3; break; } /* swapon */ case 85: { struct swapon_args *p = params; uarg[0] = (intptr_t) p->name; /* const char * */ *n_args = 1; break; } /* getitimer */ case 86: { struct getitimer_args *p = params; uarg[0] = p->which; /* u_int */ uarg[1] = (intptr_t) p->itv; /* struct itimerval * */ *n_args = 2; break; } /* getdtablesize */ case 89: { *n_args = 0; break; } /* dup2 */ case 90: { struct dup2_args *p = params; uarg[0] = p->from; /* u_int */ uarg[1] = p->to; /* u_int */ *n_args = 2; break; } /* fcntl */ case 92: { struct fcntl_args *p = params; iarg[0] = p->fd; /* int */ iarg[1] = p->cmd; /* int */ iarg[2] = p->arg; /* long */ *n_args = 3; break; } /* select */ case 93: { struct select_args *p = params; iarg[0] = p->nd; /* int */ uarg[1] = (intptr_t) p->in; /* fd_set * */ uarg[2] = (intptr_t) p->ou; /* fd_set * */ uarg[3] = (intptr_t) p->ex; /* fd_set * */ uarg[4] = (intptr_t) p->tv; /* struct timeval * */ *n_args = 5; break; } /* fsync */ case 95: { struct fsync_args *p = params; iarg[0] = p->fd; /* int */ *n_args = 1; break; } /* setpriority */ case 96: { struct setpriority_args *p = params; iarg[0] = p->which; /* int */ iarg[1] = p->who; /* int */ iarg[2] = p->prio; /* int */ *n_args = 3; break; } /* socket */ case 97: { struct socket_args *p = params; iarg[0] = p->domain; /* int */ iarg[1] = p->type; /* int */ iarg[2] = p->protocol; /* int */ *n_args = 3; break; } /* connect */ case 98: { struct connect_args *p = params; iarg[0] = p->s; /* int */ uarg[1] = (intptr_t) p->name; /* const struct sockaddr * */ iarg[2] = p->namelen; /* int */ *n_args = 3; break; } /* getpriority */ case 100: { struct getpriority_args *p = params; iarg[0] = p->which; /* int */ iarg[1] = p->who; /* int */ *n_args = 2; break; } /* bind */ case 104: { struct bind_args *p = params; iarg[0] = p->s; /* int */ uarg[1] = (intptr_t) p->name; /* const struct sockaddr * */ iarg[2] = p->namelen; /* int */ *n_args = 3; break; } /* setsockopt */ case 105: { struct setsockopt_args *p = params; iarg[0] = p->s; /* int */ iarg[1] = p->level; /* int */ iarg[2] = p->name; /* int */ uarg[3] = (intptr_t) p->val; /* const void * */ iarg[4] = p->valsize; /* int */ *n_args = 5; break; } /* listen */ case 106: { struct listen_args *p = params; iarg[0] = p->s; /* int */ iarg[1] = p->backlog; /* int */ *n_args = 2; break; } /* gettimeofday */ case 116: { struct gettimeofday_args *p = params; uarg[0] = (intptr_t) p->tp; /* struct timeval * */ uarg[1] = (intptr_t) p->tzp; /* struct timezone * */ *n_args = 2; break; } /* getrusage */ case 117: { struct getrusage_args *p = params; iarg[0] = p->who; /* int */ uarg[1] = (intptr_t) p->rusage; /* struct rusage * */ *n_args = 2; break; } /* getsockopt */ case 118: { struct getsockopt_args *p = params; iarg[0] = p->s; /* int */ iarg[1] = p->level; /* int */ iarg[2] = p->name; /* int */ uarg[3] = (intptr_t) p->val; /* void * */ uarg[4] = (intptr_t) p->avalsize; /* int * */ *n_args = 5; break; } /* readv */ case 120: { struct readv_args *p = params; iarg[0] = p->fd; /* int */ uarg[1] = (intptr_t) p->iovp; /* struct iovec * */ uarg[2] = p->iovcnt; /* u_int */ *n_args = 3; break; } /* writev */ case 121: { struct writev_args *p = params; iarg[0] = p->fd; /* int */ uarg[1] = (intptr_t) p->iovp; /* struct iovec * */ uarg[2] = p->iovcnt; /* u_int */ *n_args = 3; break; } /* settimeofday */ case 122: { struct settimeofday_args *p = params; uarg[0] = (intptr_t) p->tv; /* struct timeval * */ uarg[1] = (intptr_t) p->tzp; /* struct timezone * */ *n_args = 2; break; } /* fchown */ case 123: { struct fchown_args *p = params; iarg[0] = p->fd; /* int */ iarg[1] = p->uid; /* int */ iarg[2] = p->gid; /* int */ *n_args = 3; break; } /* fchmod */ case 124: { struct fchmod_args *p = params; iarg[0] = p->fd; /* int */ iarg[1] = p->mode; /* mode_t */ *n_args = 2; break; } /* setreuid */ case 126: { struct setreuid_args *p = params; iarg[0] = p->ruid; /* int */ iarg[1] = p->euid; /* int */ *n_args = 2; break; } /* setregid */ case 127: { struct setregid_args *p = params; iarg[0] = p->rgid; /* int */ iarg[1] = p->egid; /* int */ *n_args = 2; break; } /* rename */ case 128: { struct rename_args *p = params; uarg[0] = (intptr_t) p->from; /* const char * */ uarg[1] = (intptr_t) p->to; /* const char * */ *n_args = 2; break; } /* flock */ case 131: { struct flock_args *p = params; iarg[0] = p->fd; /* int */ iarg[1] = p->how; /* int */ *n_args = 2; break; } /* mkfifo */ case 132: { struct mkfifo_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ iarg[1] = p->mode; /* mode_t */ *n_args = 2; break; } /* sendto */ case 133: { struct sendto_args *p = params; iarg[0] = p->s; /* int */ uarg[1] = (intptr_t) p->buf; /* const void * */ uarg[2] = p->len; /* size_t */ iarg[3] = p->flags; /* int */ uarg[4] = (intptr_t) p->to; /* const struct sockaddr * */ iarg[5] = p->tolen; /* int */ *n_args = 6; break; } /* shutdown */ case 134: { struct shutdown_args *p = params; iarg[0] = p->s; /* int */ iarg[1] = p->how; /* int */ *n_args = 2; break; } /* socketpair */ case 135: { struct socketpair_args *p = params; iarg[0] = p->domain; /* int */ iarg[1] = p->type; /* int */ iarg[2] = p->protocol; /* int */ uarg[3] = (intptr_t) p->rsv; /* int * */ *n_args = 4; break; } /* mkdir */ case 136: { struct mkdir_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ iarg[1] = p->mode; /* mode_t */ *n_args = 2; break; } /* rmdir */ case 137: { struct rmdir_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ *n_args = 1; break; } /* utimes */ case 138: { struct utimes_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ uarg[1] = (intptr_t) p->tptr; /* struct timeval * */ *n_args = 2; break; } /* adjtime */ case 140: { struct adjtime_args *p = params; uarg[0] = (intptr_t) p->delta; /* struct timeval * */ uarg[1] = (intptr_t) p->olddelta; /* struct timeval * */ *n_args = 2; break; } /* setsid */ case 147: { *n_args = 0; break; } /* quotactl */ case 148: { struct quotactl_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ iarg[1] = p->cmd; /* int */ iarg[2] = p->uid; /* int */ uarg[3] = (intptr_t) p->arg; /* void * */ *n_args = 4; break; } /* nlm_syscall */ case 154: { struct nlm_syscall_args *p = params; iarg[0] = p->debug_level; /* int */ iarg[1] = p->grace_period; /* int */ iarg[2] = p->addr_count; /* int */ uarg[3] = (intptr_t) p->addrs; /* char ** */ *n_args = 4; break; } /* nfssvc */ case 155: { struct nfssvc_args *p = params; iarg[0] = p->flag; /* int */ uarg[1] = (intptr_t) p->argp; /* void * */ *n_args = 2; break; } /* lgetfh */ case 160: { struct lgetfh_args *p = params; uarg[0] = (intptr_t) p->fname; /* const char * */ uarg[1] = (intptr_t) p->fhp; /* struct fhandle * */ *n_args = 2; break; } /* getfh */ case 161: { struct getfh_args *p = params; uarg[0] = (intptr_t) p->fname; /* const char * */ uarg[1] = (intptr_t) p->fhp; /* struct fhandle * */ *n_args = 2; break; } /* sysarch */ case 165: { struct sysarch_args *p = params; iarg[0] = p->op; /* int */ uarg[1] = (intptr_t) p->parms; /* char * */ *n_args = 2; break; } /* rtprio */ case 166: { struct rtprio_args *p = params; iarg[0] = p->function; /* int */ iarg[1] = p->pid; /* pid_t */ uarg[2] = (intptr_t) p->rtp; /* struct rtprio * */ *n_args = 3; break; } /* semsys */ case 169: { struct semsys_args *p = params; iarg[0] = p->which; /* int */ iarg[1] = p->a2; /* int */ iarg[2] = p->a3; /* int */ iarg[3] = p->a4; /* int */ iarg[4] = p->a5; /* int */ *n_args = 5; break; } /* msgsys */ case 170: { struct msgsys_args *p = params; iarg[0] = p->which; /* int */ iarg[1] = p->a2; /* int */ iarg[2] = p->a3; /* int */ iarg[3] = p->a4; /* int */ iarg[4] = p->a5; /* int */ iarg[5] = p->a6; /* int */ *n_args = 6; break; } /* shmsys */ case 171: { struct shmsys_args *p = params; iarg[0] = p->which; /* int */ iarg[1] = p->a2; /* int */ iarg[2] = p->a3; /* int */ iarg[3] = p->a4; /* int */ *n_args = 4; break; } /* setfib */ case 175: { struct setfib_args *p = params; iarg[0] = p->fibnum; /* int */ *n_args = 1; break; } /* ntp_adjtime */ case 176: { struct ntp_adjtime_args *p = params; uarg[0] = (intptr_t) p->tp; /* struct timex * */ *n_args = 1; break; } /* setgid */ case 181: { struct setgid_args *p = params; iarg[0] = p->gid; /* gid_t */ *n_args = 1; break; } /* setegid */ case 182: { struct setegid_args *p = params; iarg[0] = p->egid; /* gid_t */ *n_args = 1; break; } /* seteuid */ case 183: { struct seteuid_args *p = params; uarg[0] = p->euid; /* uid_t */ *n_args = 1; break; } /* pathconf */ case 191: { struct pathconf_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ iarg[1] = p->name; /* int */ *n_args = 2; break; } /* fpathconf */ case 192: { struct fpathconf_args *p = params; iarg[0] = p->fd; /* int */ iarg[1] = p->name; /* int */ *n_args = 2; break; } /* getrlimit */ case 194: { struct __getrlimit_args *p = params; uarg[0] = p->which; /* u_int */ uarg[1] = (intptr_t) p->rlp; /* struct rlimit * */ *n_args = 2; break; } /* setrlimit */ case 195: { struct __setrlimit_args *p = params; uarg[0] = p->which; /* u_int */ uarg[1] = (intptr_t) p->rlp; /* struct rlimit * */ *n_args = 2; break; } /* nosys */ case 198: { *n_args = 0; break; } /* __sysctl */ case 202: { struct sysctl_args *p = params; uarg[0] = (intptr_t) p->name; /* int * */ uarg[1] = p->namelen; /* u_int */ uarg[2] = (intptr_t) p->old; /* void * */ uarg[3] = (intptr_t) p->oldlenp; /* size_t * */ uarg[4] = (intptr_t) p->new; /* const void * */ uarg[5] = p->newlen; /* size_t */ *n_args = 6; break; } /* mlock */ case 203: { struct mlock_args *p = params; uarg[0] = (intptr_t) p->addr; /* const void * */ uarg[1] = p->len; /* size_t */ *n_args = 2; break; } /* munlock */ case 204: { struct munlock_args *p = params; uarg[0] = (intptr_t) p->addr; /* const void * */ uarg[1] = p->len; /* size_t */ *n_args = 2; break; } /* undelete */ case 205: { struct undelete_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ *n_args = 1; break; } /* futimes */ case 206: { struct futimes_args *p = params; iarg[0] = p->fd; /* int */ uarg[1] = (intptr_t) p->tptr; /* struct timeval * */ *n_args = 2; break; } /* getpgid */ case 207: { struct getpgid_args *p = params; iarg[0] = p->pid; /* pid_t */ *n_args = 1; break; } /* poll */ case 209: { struct poll_args *p = params; uarg[0] = (intptr_t) p->fds; /* struct pollfd * */ uarg[1] = p->nfds; /* u_int */ iarg[2] = p->timeout; /* int */ *n_args = 3; break; } /* lkmnosys */ case 210: { *n_args = 0; break; } /* lkmnosys */ case 211: { *n_args = 0; break; } /* lkmnosys */ case 212: { *n_args = 0; break; } /* lkmnosys */ case 213: { *n_args = 0; break; } /* lkmnosys */ case 214: { *n_args = 0; break; } /* lkmnosys */ case 215: { *n_args = 0; break; } /* lkmnosys */ case 216: { *n_args = 0; break; } /* lkmnosys */ case 217: { *n_args = 0; break; } /* lkmnosys */ case 218: { *n_args = 0; break; } /* lkmnosys */ case 219: { *n_args = 0; break; } /* semget */ case 221: { struct semget_args *p = params; iarg[0] = p->key; /* key_t */ iarg[1] = p->nsems; /* int */ iarg[2] = p->semflg; /* int */ *n_args = 3; break; } /* semop */ case 222: { struct semop_args *p = params; iarg[0] = p->semid; /* int */ uarg[1] = (intptr_t) p->sops; /* struct sembuf * */ uarg[2] = p->nsops; /* size_t */ *n_args = 3; break; } /* msgget */ case 225: { struct msgget_args *p = params; iarg[0] = p->key; /* key_t */ iarg[1] = p->msgflg; /* int */ *n_args = 2; break; } /* msgsnd */ case 226: { struct msgsnd_args *p = params; iarg[0] = p->msqid; /* int */ uarg[1] = (intptr_t) p->msgp; /* const void * */ uarg[2] = p->msgsz; /* size_t */ iarg[3] = p->msgflg; /* int */ *n_args = 4; break; } /* msgrcv */ case 227: { struct msgrcv_args *p = params; iarg[0] = p->msqid; /* int */ uarg[1] = (intptr_t) p->msgp; /* void * */ uarg[2] = p->msgsz; /* size_t */ iarg[3] = p->msgtyp; /* long */ iarg[4] = p->msgflg; /* int */ *n_args = 5; break; } /* shmat */ case 228: { struct shmat_args *p = params; iarg[0] = p->shmid; /* int */ uarg[1] = (intptr_t) p->shmaddr; /* const void * */ iarg[2] = p->shmflg; /* int */ *n_args = 3; break; } /* shmdt */ case 230: { struct shmdt_args *p = params; uarg[0] = (intptr_t) p->shmaddr; /* const void * */ *n_args = 1; break; } /* shmget */ case 231: { struct shmget_args *p = params; iarg[0] = p->key; /* key_t */ uarg[1] = p->size; /* size_t */ iarg[2] = p->shmflg; /* int */ *n_args = 3; break; } /* clock_gettime */ case 232: { struct clock_gettime_args *p = params; iarg[0] = p->clock_id; /* clockid_t */ uarg[1] = (intptr_t) p->tp; /* struct timespec * */ *n_args = 2; break; } /* clock_settime */ case 233: { struct clock_settime_args *p = params; iarg[0] = p->clock_id; /* clockid_t */ uarg[1] = (intptr_t) p->tp; /* const struct timespec * */ *n_args = 2; break; } /* clock_getres */ case 234: { struct clock_getres_args *p = params; iarg[0] = p->clock_id; /* clockid_t */ uarg[1] = (intptr_t) p->tp; /* struct timespec * */ *n_args = 2; break; } /* ktimer_create */ case 235: { struct ktimer_create_args *p = params; iarg[0] = p->clock_id; /* clockid_t */ uarg[1] = (intptr_t) p->evp; /* struct sigevent * */ uarg[2] = (intptr_t) p->timerid; /* int * */ *n_args = 3; break; } /* ktimer_delete */ case 236: { struct ktimer_delete_args *p = params; iarg[0] = p->timerid; /* int */ *n_args = 1; break; } /* ktimer_settime */ case 237: { struct ktimer_settime_args *p = params; iarg[0] = p->timerid; /* int */ iarg[1] = p->flags; /* int */ uarg[2] = (intptr_t) p->value; /* const struct itimerspec * */ uarg[3] = (intptr_t) p->ovalue; /* struct itimerspec * */ *n_args = 4; break; } /* ktimer_gettime */ case 238: { struct ktimer_gettime_args *p = params; iarg[0] = p->timerid; /* int */ uarg[1] = (intptr_t) p->value; /* struct itimerspec * */ *n_args = 2; break; } /* ktimer_getoverrun */ case 239: { struct ktimer_getoverrun_args *p = params; iarg[0] = p->timerid; /* int */ *n_args = 1; break; } /* nanosleep */ case 240: { struct nanosleep_args *p = params; uarg[0] = (intptr_t) p->rqtp; /* const struct timespec * */ uarg[1] = (intptr_t) p->rmtp; /* struct timespec * */ *n_args = 2; break; } /* ffclock_getcounter */ case 241: { struct ffclock_getcounter_args *p = params; uarg[0] = (intptr_t) p->ffcount; /* ffcounter * */ *n_args = 1; break; } /* ffclock_setestimate */ case 242: { struct ffclock_setestimate_args *p = params; uarg[0] = (intptr_t) p->cest; /* struct ffclock_estimate * */ *n_args = 1; break; } /* ffclock_getestimate */ case 243: { struct ffclock_getestimate_args *p = params; uarg[0] = (intptr_t) p->cest; /* struct ffclock_estimate * */ *n_args = 1; break; } /* clock_nanosleep */ case 244: { struct clock_nanosleep_args *p = params; iarg[0] = p->clock_id; /* clockid_t */ iarg[1] = p->flags; /* int */ uarg[2] = (intptr_t) p->rqtp; /* const struct timespec * */ uarg[3] = (intptr_t) p->rmtp; /* struct timespec * */ *n_args = 4; break; } /* clock_getcpuclockid2 */ case 247: { struct clock_getcpuclockid2_args *p = params; iarg[0] = p->id; /* id_t */ iarg[1] = p->which; /* int */ uarg[2] = (intptr_t) p->clock_id; /* clockid_t * */ *n_args = 3; break; } /* ntp_gettime */ case 248: { struct ntp_gettime_args *p = params; uarg[0] = (intptr_t) p->ntvp; /* struct ntptimeval * */ *n_args = 1; break; } /* minherit */ case 250: { struct minherit_args *p = params; uarg[0] = (intptr_t) p->addr; /* void * */ uarg[1] = p->len; /* size_t */ iarg[2] = p->inherit; /* int */ *n_args = 3; break; } /* rfork */ case 251: { struct rfork_args *p = params; iarg[0] = p->flags; /* int */ *n_args = 1; break; } /* issetugid */ case 253: { *n_args = 0; break; } /* lchown */ case 254: { struct lchown_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ iarg[1] = p->uid; /* int */ iarg[2] = p->gid; /* int */ *n_args = 3; break; } /* aio_read */ case 255: { struct aio_read_args *p = params; uarg[0] = (intptr_t) p->aiocbp; /* struct aiocb * */ *n_args = 1; break; } /* aio_write */ case 256: { struct aio_write_args *p = params; uarg[0] = (intptr_t) p->aiocbp; /* struct aiocb * */ *n_args = 1; break; } /* lio_listio */ case 257: { struct lio_listio_args *p = params; iarg[0] = p->mode; /* int */ uarg[1] = (intptr_t) p->acb_list; /* struct aiocb *const * */ iarg[2] = p->nent; /* int */ uarg[3] = (intptr_t) p->sig; /* struct sigevent * */ *n_args = 4; break; } /* lchmod */ case 274: { struct lchmod_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ iarg[1] = p->mode; /* mode_t */ *n_args = 2; break; } /* lutimes */ case 276: { struct lutimes_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ uarg[1] = (intptr_t) p->tptr; /* struct timeval * */ *n_args = 2; break; } /* preadv */ case 289: { struct preadv_args *p = params; iarg[0] = p->fd; /* int */ uarg[1] = (intptr_t) p->iovp; /* struct iovec * */ uarg[2] = p->iovcnt; /* u_int */ iarg[3] = p->offset; /* off_t */ *n_args = 4; break; } /* pwritev */ case 290: { struct pwritev_args *p = params; iarg[0] = p->fd; /* int */ uarg[1] = (intptr_t) p->iovp; /* struct iovec * */ uarg[2] = p->iovcnt; /* u_int */ iarg[3] = p->offset; /* off_t */ *n_args = 4; break; } /* fhopen */ case 298: { struct fhopen_args *p = params; uarg[0] = (intptr_t) p->u_fhp; /* const struct fhandle * */ iarg[1] = p->flags; /* int */ *n_args = 2; break; } /* modnext */ case 300: { struct modnext_args *p = params; iarg[0] = p->modid; /* int */ *n_args = 1; break; } /* modstat */ case 301: { struct modstat_args *p = params; iarg[0] = p->modid; /* int */ uarg[1] = (intptr_t) p->stat; /* struct module_stat * */ *n_args = 2; break; } /* modfnext */ case 302: { struct modfnext_args *p = params; iarg[0] = p->modid; /* int */ *n_args = 1; break; } /* modfind */ case 303: { struct modfind_args *p = params; uarg[0] = (intptr_t) p->name; /* const char * */ *n_args = 1; break; } /* kldload */ case 304: { struct kldload_args *p = params; uarg[0] = (intptr_t) p->file; /* const char * */ *n_args = 1; break; } /* kldunload */ case 305: { struct kldunload_args *p = params; iarg[0] = p->fileid; /* int */ *n_args = 1; break; } /* kldfind */ case 306: { struct kldfind_args *p = params; uarg[0] = (intptr_t) p->file; /* const char * */ *n_args = 1; break; } /* kldnext */ case 307: { struct kldnext_args *p = params; iarg[0] = p->fileid; /* int */ *n_args = 1; break; } /* kldstat */ case 308: { struct kldstat_args *p = params; iarg[0] = p->fileid; /* int */ uarg[1] = (intptr_t) p->stat; /* struct kld_file_stat * */ *n_args = 2; break; } /* kldfirstmod */ case 309: { struct kldfirstmod_args *p = params; iarg[0] = p->fileid; /* int */ *n_args = 1; break; } /* getsid */ case 310: { struct getsid_args *p = params; iarg[0] = p->pid; /* pid_t */ *n_args = 1; break; } /* setresuid */ case 311: { struct setresuid_args *p = params; uarg[0] = p->ruid; /* uid_t */ uarg[1] = p->euid; /* uid_t */ uarg[2] = p->suid; /* uid_t */ *n_args = 3; break; } /* setresgid */ case 312: { struct setresgid_args *p = params; iarg[0] = p->rgid; /* gid_t */ iarg[1] = p->egid; /* gid_t */ iarg[2] = p->sgid; /* gid_t */ *n_args = 3; break; } /* aio_return */ case 314: { struct aio_return_args *p = params; uarg[0] = (intptr_t) p->aiocbp; /* struct aiocb * */ *n_args = 1; break; } /* aio_suspend */ case 315: { struct aio_suspend_args *p = params; uarg[0] = (intptr_t) p->aiocbp; /* struct aiocb *const * */ iarg[1] = p->nent; /* int */ uarg[2] = (intptr_t) p->timeout; /* const struct timespec * */ *n_args = 3; break; } /* aio_cancel */ case 316: { struct aio_cancel_args *p = params; iarg[0] = p->fd; /* int */ uarg[1] = (intptr_t) p->aiocbp; /* struct aiocb * */ *n_args = 2; break; } /* aio_error */ case 317: { struct aio_error_args *p = params; uarg[0] = (intptr_t) p->aiocbp; /* struct aiocb * */ *n_args = 1; break; } /* yield */ case 321: { *n_args = 0; break; } /* mlockall */ case 324: { struct mlockall_args *p = params; iarg[0] = p->how; /* int */ *n_args = 1; break; } /* munlockall */ case 325: { *n_args = 0; break; } /* __getcwd */ case 326: { struct __getcwd_args *p = params; uarg[0] = (intptr_t) p->buf; /* char * */ uarg[1] = p->buflen; /* size_t */ *n_args = 2; break; } /* sched_setparam */ case 327: { struct sched_setparam_args *p = params; iarg[0] = p->pid; /* pid_t */ uarg[1] = (intptr_t) p->param; /* const struct sched_param * */ *n_args = 2; break; } /* sched_getparam */ case 328: { struct sched_getparam_args *p = params; iarg[0] = p->pid; /* pid_t */ uarg[1] = (intptr_t) p->param; /* struct sched_param * */ *n_args = 2; break; } /* sched_setscheduler */ case 329: { struct sched_setscheduler_args *p = params; iarg[0] = p->pid; /* pid_t */ iarg[1] = p->policy; /* int */ uarg[2] = (intptr_t) p->param; /* const struct sched_param * */ *n_args = 3; break; } /* sched_getscheduler */ case 330: { struct sched_getscheduler_args *p = params; iarg[0] = p->pid; /* pid_t */ *n_args = 1; break; } /* sched_yield */ case 331: { *n_args = 0; break; } /* sched_get_priority_max */ case 332: { struct sched_get_priority_max_args *p = params; iarg[0] = p->policy; /* int */ *n_args = 1; break; } /* sched_get_priority_min */ case 333: { struct sched_get_priority_min_args *p = params; iarg[0] = p->policy; /* int */ *n_args = 1; break; } /* sched_rr_get_interval */ case 334: { struct sched_rr_get_interval_args *p = params; iarg[0] = p->pid; /* pid_t */ uarg[1] = (intptr_t) p->interval; /* struct timespec * */ *n_args = 2; break; } /* utrace */ case 335: { struct utrace_args *p = params; uarg[0] = (intptr_t) p->addr; /* const void * */ uarg[1] = p->len; /* size_t */ *n_args = 2; break; } /* kldsym */ case 337: { struct kldsym_args *p = params; iarg[0] = p->fileid; /* int */ iarg[1] = p->cmd; /* int */ uarg[2] = (intptr_t) p->data; /* void * */ *n_args = 3; break; } /* jail */ case 338: { struct jail_args *p = params; uarg[0] = (intptr_t) p->jail; /* struct jail * */ *n_args = 1; break; } /* nnpfs_syscall */ case 339: { struct nnpfs_syscall_args *p = params; iarg[0] = p->operation; /* int */ uarg[1] = (intptr_t) p->a_pathP; /* char * */ iarg[2] = p->a_opcode; /* int */ uarg[3] = (intptr_t) p->a_paramsP; /* void * */ iarg[4] = p->a_followSymlinks; /* int */ *n_args = 5; break; } /* sigprocmask */ case 340: { struct sigprocmask_args *p = params; iarg[0] = p->how; /* int */ uarg[1] = (intptr_t) p->set; /* const sigset_t * */ uarg[2] = (intptr_t) p->oset; /* sigset_t * */ *n_args = 3; break; } /* sigsuspend */ case 341: { struct sigsuspend_args *p = params; uarg[0] = (intptr_t) p->sigmask; /* const sigset_t * */ *n_args = 1; break; } /* sigpending */ case 343: { struct sigpending_args *p = params; uarg[0] = (intptr_t) p->set; /* sigset_t * */ *n_args = 1; break; } /* sigtimedwait */ case 345: { struct sigtimedwait_args *p = params; uarg[0] = (intptr_t) p->set; /* const sigset_t * */ uarg[1] = (intptr_t) p->info; /* siginfo_t * */ uarg[2] = (intptr_t) p->timeout; /* const struct timespec * */ *n_args = 3; break; } /* sigwaitinfo */ case 346: { struct sigwaitinfo_args *p = params; uarg[0] = (intptr_t) p->set; /* const sigset_t * */ uarg[1] = (intptr_t) p->info; /* siginfo_t * */ *n_args = 2; break; } /* __acl_get_file */ case 347: { struct __acl_get_file_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ iarg[1] = p->type; /* acl_type_t */ uarg[2] = (intptr_t) p->aclp; /* struct acl * */ *n_args = 3; break; } /* __acl_set_file */ case 348: { struct __acl_set_file_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ iarg[1] = p->type; /* acl_type_t */ uarg[2] = (intptr_t) p->aclp; /* struct acl * */ *n_args = 3; break; } /* __acl_get_fd */ case 349: { struct __acl_get_fd_args *p = params; iarg[0] = p->filedes; /* int */ iarg[1] = p->type; /* acl_type_t */ uarg[2] = (intptr_t) p->aclp; /* struct acl * */ *n_args = 3; break; } /* __acl_set_fd */ case 350: { struct __acl_set_fd_args *p = params; iarg[0] = p->filedes; /* int */ iarg[1] = p->type; /* acl_type_t */ uarg[2] = (intptr_t) p->aclp; /* struct acl * */ *n_args = 3; break; } /* __acl_delete_file */ case 351: { struct __acl_delete_file_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ iarg[1] = p->type; /* acl_type_t */ *n_args = 2; break; } /* __acl_delete_fd */ case 352: { struct __acl_delete_fd_args *p = params; iarg[0] = p->filedes; /* int */ iarg[1] = p->type; /* acl_type_t */ *n_args = 2; break; } /* __acl_aclcheck_file */ case 353: { struct __acl_aclcheck_file_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ iarg[1] = p->type; /* acl_type_t */ uarg[2] = (intptr_t) p->aclp; /* struct acl * */ *n_args = 3; break; } /* __acl_aclcheck_fd */ case 354: { struct __acl_aclcheck_fd_args *p = params; iarg[0] = p->filedes; /* int */ iarg[1] = p->type; /* acl_type_t */ uarg[2] = (intptr_t) p->aclp; /* struct acl * */ *n_args = 3; break; } /* extattrctl */ case 355: { struct extattrctl_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ iarg[1] = p->cmd; /* int */ uarg[2] = (intptr_t) p->filename; /* const char * */ iarg[3] = p->attrnamespace; /* int */ uarg[4] = (intptr_t) p->attrname; /* const char * */ *n_args = 5; break; } /* extattr_set_file */ case 356: { struct extattr_set_file_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ iarg[1] = p->attrnamespace; /* int */ uarg[2] = (intptr_t) p->attrname; /* const char * */ uarg[3] = (intptr_t) p->data; /* void * */ uarg[4] = p->nbytes; /* size_t */ *n_args = 5; break; } /* extattr_get_file */ case 357: { struct extattr_get_file_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ iarg[1] = p->attrnamespace; /* int */ uarg[2] = (intptr_t) p->attrname; /* const char * */ uarg[3] = (intptr_t) p->data; /* void * */ uarg[4] = p->nbytes; /* size_t */ *n_args = 5; break; } /* extattr_delete_file */ case 358: { struct extattr_delete_file_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ iarg[1] = p->attrnamespace; /* int */ uarg[2] = (intptr_t) p->attrname; /* const char * */ *n_args = 3; break; } /* aio_waitcomplete */ case 359: { struct aio_waitcomplete_args *p = params; uarg[0] = (intptr_t) p->aiocbp; /* struct aiocb ** */ uarg[1] = (intptr_t) p->timeout; /* struct timespec * */ *n_args = 2; break; } /* getresuid */ case 360: { struct getresuid_args *p = params; uarg[0] = (intptr_t) p->ruid; /* uid_t * */ uarg[1] = (intptr_t) p->euid; /* uid_t * */ uarg[2] = (intptr_t) p->suid; /* uid_t * */ *n_args = 3; break; } /* getresgid */ case 361: { struct getresgid_args *p = params; uarg[0] = (intptr_t) p->rgid; /* gid_t * */ uarg[1] = (intptr_t) p->egid; /* gid_t * */ uarg[2] = (intptr_t) p->sgid; /* gid_t * */ *n_args = 3; break; } /* kqueue */ case 362: { *n_args = 0; break; } /* extattr_set_fd */ case 371: { struct extattr_set_fd_args *p = params; iarg[0] = p->fd; /* int */ iarg[1] = p->attrnamespace; /* int */ uarg[2] = (intptr_t) p->attrname; /* const char * */ uarg[3] = (intptr_t) p->data; /* void * */ uarg[4] = p->nbytes; /* size_t */ *n_args = 5; break; } /* extattr_get_fd */ case 372: { struct extattr_get_fd_args *p = params; iarg[0] = p->fd; /* int */ iarg[1] = p->attrnamespace; /* int */ uarg[2] = (intptr_t) p->attrname; /* const char * */ uarg[3] = (intptr_t) p->data; /* void * */ uarg[4] = p->nbytes; /* size_t */ *n_args = 5; break; } /* extattr_delete_fd */ case 373: { struct extattr_delete_fd_args *p = params; iarg[0] = p->fd; /* int */ iarg[1] = p->attrnamespace; /* int */ uarg[2] = (intptr_t) p->attrname; /* const char * */ *n_args = 3; break; } /* __setugid */ case 374: { struct __setugid_args *p = params; iarg[0] = p->flag; /* int */ *n_args = 1; break; } /* eaccess */ case 376: { struct eaccess_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ iarg[1] = p->amode; /* int */ *n_args = 2; break; } /* afs3_syscall */ case 377: { struct afs3_syscall_args *p = params; iarg[0] = p->syscall; /* long */ iarg[1] = p->parm1; /* long */ iarg[2] = p->parm2; /* long */ iarg[3] = p->parm3; /* long */ iarg[4] = p->parm4; /* long */ iarg[5] = p->parm5; /* long */ iarg[6] = p->parm6; /* long */ *n_args = 7; break; } /* nmount */ case 378: { struct nmount_args *p = params; uarg[0] = (intptr_t) p->iovp; /* struct iovec * */ uarg[1] = p->iovcnt; /* unsigned int */ iarg[2] = p->flags; /* int */ *n_args = 3; break; } /* __mac_get_proc */ case 384: { struct __mac_get_proc_args *p = params; uarg[0] = (intptr_t) p->mac_p; /* struct mac * */ *n_args = 1; break; } /* __mac_set_proc */ case 385: { struct __mac_set_proc_args *p = params; uarg[0] = (intptr_t) p->mac_p; /* struct mac * */ *n_args = 1; break; } /* __mac_get_fd */ case 386: { struct __mac_get_fd_args *p = params; iarg[0] = p->fd; /* int */ uarg[1] = (intptr_t) p->mac_p; /* struct mac * */ *n_args = 2; break; } /* __mac_get_file */ case 387: { struct __mac_get_file_args *p = params; uarg[0] = (intptr_t) p->path_p; /* const char * */ uarg[1] = (intptr_t) p->mac_p; /* struct mac * */ *n_args = 2; break; } /* __mac_set_fd */ case 388: { struct __mac_set_fd_args *p = params; iarg[0] = p->fd; /* int */ uarg[1] = (intptr_t) p->mac_p; /* struct mac * */ *n_args = 2; break; } /* __mac_set_file */ case 389: { struct __mac_set_file_args *p = params; uarg[0] = (intptr_t) p->path_p; /* const char * */ uarg[1] = (intptr_t) p->mac_p; /* struct mac * */ *n_args = 2; break; } /* kenv */ case 390: { struct kenv_args *p = params; iarg[0] = p->what; /* int */ uarg[1] = (intptr_t) p->name; /* const char * */ uarg[2] = (intptr_t) p->value; /* char * */ iarg[3] = p->len; /* int */ *n_args = 4; break; } /* lchflags */ case 391: { struct lchflags_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ uarg[1] = p->flags; /* u_long */ *n_args = 2; break; } /* uuidgen */ case 392: { struct uuidgen_args *p = params; uarg[0] = (intptr_t) p->store; /* struct uuid * */ iarg[1] = p->count; /* int */ *n_args = 2; break; } /* sendfile */ case 393: { struct sendfile_args *p = params; iarg[0] = p->fd; /* int */ iarg[1] = p->s; /* int */ iarg[2] = p->offset; /* off_t */ uarg[3] = p->nbytes; /* size_t */ uarg[4] = (intptr_t) p->hdtr; /* struct sf_hdtr * */ uarg[5] = (intptr_t) p->sbytes; /* off_t * */ iarg[6] = p->flags; /* int */ *n_args = 7; break; } /* mac_syscall */ case 394: { struct mac_syscall_args *p = params; uarg[0] = (intptr_t) p->policy; /* const char * */ iarg[1] = p->call; /* int */ uarg[2] = (intptr_t) p->arg; /* void * */ *n_args = 3; break; } /* ksem_close */ case 400: { struct ksem_close_args *p = params; iarg[0] = p->id; /* semid_t */ *n_args = 1; break; } /* ksem_post */ case 401: { struct ksem_post_args *p = params; iarg[0] = p->id; /* semid_t */ *n_args = 1; break; } /* ksem_wait */ case 402: { struct ksem_wait_args *p = params; iarg[0] = p->id; /* semid_t */ *n_args = 1; break; } /* ksem_trywait */ case 403: { struct ksem_trywait_args *p = params; iarg[0] = p->id; /* semid_t */ *n_args = 1; break; } /* ksem_init */ case 404: { struct ksem_init_args *p = params; uarg[0] = (intptr_t) p->idp; /* semid_t * */ uarg[1] = p->value; /* unsigned int */ *n_args = 2; break; } /* ksem_open */ case 405: { struct ksem_open_args *p = params; uarg[0] = (intptr_t) p->idp; /* semid_t * */ uarg[1] = (intptr_t) p->name; /* const char * */ iarg[2] = p->oflag; /* int */ iarg[3] = p->mode; /* mode_t */ uarg[4] = p->value; /* unsigned int */ *n_args = 5; break; } /* ksem_unlink */ case 406: { struct ksem_unlink_args *p = params; uarg[0] = (intptr_t) p->name; /* const char * */ *n_args = 1; break; } /* ksem_getvalue */ case 407: { struct ksem_getvalue_args *p = params; iarg[0] = p->id; /* semid_t */ uarg[1] = (intptr_t) p->val; /* int * */ *n_args = 2; break; } /* ksem_destroy */ case 408: { struct ksem_destroy_args *p = params; iarg[0] = p->id; /* semid_t */ *n_args = 1; break; } /* __mac_get_pid */ case 409: { struct __mac_get_pid_args *p = params; iarg[0] = p->pid; /* pid_t */ uarg[1] = (intptr_t) p->mac_p; /* struct mac * */ *n_args = 2; break; } /* __mac_get_link */ case 410: { struct __mac_get_link_args *p = params; uarg[0] = (intptr_t) p->path_p; /* const char * */ uarg[1] = (intptr_t) p->mac_p; /* struct mac * */ *n_args = 2; break; } /* __mac_set_link */ case 411: { struct __mac_set_link_args *p = params; uarg[0] = (intptr_t) p->path_p; /* const char * */ uarg[1] = (intptr_t) p->mac_p; /* struct mac * */ *n_args = 2; break; } /* extattr_set_link */ case 412: { struct extattr_set_link_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ iarg[1] = p->attrnamespace; /* int */ uarg[2] = (intptr_t) p->attrname; /* const char * */ uarg[3] = (intptr_t) p->data; /* void * */ uarg[4] = p->nbytes; /* size_t */ *n_args = 5; break; } /* extattr_get_link */ case 413: { struct extattr_get_link_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ iarg[1] = p->attrnamespace; /* int */ uarg[2] = (intptr_t) p->attrname; /* const char * */ uarg[3] = (intptr_t) p->data; /* void * */ uarg[4] = p->nbytes; /* size_t */ *n_args = 5; break; } /* extattr_delete_link */ case 414: { struct extattr_delete_link_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ iarg[1] = p->attrnamespace; /* int */ uarg[2] = (intptr_t) p->attrname; /* const char * */ *n_args = 3; break; } /* __mac_execve */ case 415: { struct __mac_execve_args *p = params; uarg[0] = (intptr_t) p->fname; /* const char * */ uarg[1] = (intptr_t) p->argv; /* char ** */ uarg[2] = (intptr_t) p->envv; /* char ** */ uarg[3] = (intptr_t) p->mac_p; /* struct mac * */ *n_args = 4; break; } /* sigaction */ case 416: { struct sigaction_args *p = params; iarg[0] = p->sig; /* int */ uarg[1] = (intptr_t) p->act; /* const struct sigaction * */ uarg[2] = (intptr_t) p->oact; /* struct sigaction * */ *n_args = 3; break; } /* sigreturn */ case 417: { struct sigreturn_args *p = params; uarg[0] = (intptr_t) p->sigcntxp; /* const struct __ucontext * */ *n_args = 1; break; } /* getcontext */ case 421: { struct getcontext_args *p = params; uarg[0] = (intptr_t) p->ucp; /* struct __ucontext * */ *n_args = 1; break; } /* setcontext */ case 422: { struct setcontext_args *p = params; uarg[0] = (intptr_t) p->ucp; /* const struct __ucontext * */ *n_args = 1; break; } /* swapcontext */ case 423: { struct swapcontext_args *p = params; uarg[0] = (intptr_t) p->oucp; /* struct __ucontext * */ uarg[1] = (intptr_t) p->ucp; /* const struct __ucontext * */ *n_args = 2; break; } /* swapoff */ case 424: { struct swapoff_args *p = params; uarg[0] = (intptr_t) p->name; /* const char * */ *n_args = 1; break; } /* __acl_get_link */ case 425: { struct __acl_get_link_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ iarg[1] = p->type; /* acl_type_t */ uarg[2] = (intptr_t) p->aclp; /* struct acl * */ *n_args = 3; break; } /* __acl_set_link */ case 426: { struct __acl_set_link_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ iarg[1] = p->type; /* acl_type_t */ uarg[2] = (intptr_t) p->aclp; /* struct acl * */ *n_args = 3; break; } /* __acl_delete_link */ case 427: { struct __acl_delete_link_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ iarg[1] = p->type; /* acl_type_t */ *n_args = 2; break; } /* __acl_aclcheck_link */ case 428: { struct __acl_aclcheck_link_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ iarg[1] = p->type; /* acl_type_t */ uarg[2] = (intptr_t) p->aclp; /* struct acl * */ *n_args = 3; break; } /* sigwait */ case 429: { struct sigwait_args *p = params; uarg[0] = (intptr_t) p->set; /* const sigset_t * */ uarg[1] = (intptr_t) p->sig; /* int * */ *n_args = 2; break; } /* thr_create */ case 430: { struct thr_create_args *p = params; uarg[0] = (intptr_t) p->ctx; /* ucontext_t * */ uarg[1] = (intptr_t) p->id; /* long * */ iarg[2] = p->flags; /* int */ *n_args = 3; break; } /* thr_exit */ case 431: { struct thr_exit_args *p = params; uarg[0] = (intptr_t) p->state; /* long * */ *n_args = 1; break; } /* thr_self */ case 432: { struct thr_self_args *p = params; uarg[0] = (intptr_t) p->id; /* long * */ *n_args = 1; break; } /* thr_kill */ case 433: { struct thr_kill_args *p = params; iarg[0] = p->id; /* long */ iarg[1] = p->sig; /* int */ *n_args = 2; break; } /* jail_attach */ case 436: { struct jail_attach_args *p = params; iarg[0] = p->jid; /* int */ *n_args = 1; break; } /* extattr_list_fd */ case 437: { struct extattr_list_fd_args *p = params; iarg[0] = p->fd; /* int */ iarg[1] = p->attrnamespace; /* int */ uarg[2] = (intptr_t) p->data; /* void * */ uarg[3] = p->nbytes; /* size_t */ *n_args = 4; break; } /* extattr_list_file */ case 438: { struct extattr_list_file_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ iarg[1] = p->attrnamespace; /* int */ uarg[2] = (intptr_t) p->data; /* void * */ uarg[3] = p->nbytes; /* size_t */ *n_args = 4; break; } /* extattr_list_link */ case 439: { struct extattr_list_link_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ iarg[1] = p->attrnamespace; /* int */ uarg[2] = (intptr_t) p->data; /* void * */ uarg[3] = p->nbytes; /* size_t */ *n_args = 4; break; } /* ksem_timedwait */ case 441: { struct ksem_timedwait_args *p = params; iarg[0] = p->id; /* semid_t */ uarg[1] = (intptr_t) p->abstime; /* const struct timespec * */ *n_args = 2; break; } /* thr_suspend */ case 442: { struct thr_suspend_args *p = params; uarg[0] = (intptr_t) p->timeout; /* const struct timespec * */ *n_args = 1; break; } /* thr_wake */ case 443: { struct thr_wake_args *p = params; iarg[0] = p->id; /* long */ *n_args = 1; break; } /* kldunloadf */ case 444: { struct kldunloadf_args *p = params; iarg[0] = p->fileid; /* int */ iarg[1] = p->flags; /* int */ *n_args = 2; break; } /* audit */ case 445: { struct audit_args *p = params; uarg[0] = (intptr_t) p->record; /* const void * */ uarg[1] = p->length; /* u_int */ *n_args = 2; break; } /* auditon */ case 446: { struct auditon_args *p = params; iarg[0] = p->cmd; /* int */ uarg[1] = (intptr_t) p->data; /* void * */ uarg[2] = p->length; /* u_int */ *n_args = 3; break; } /* getauid */ case 447: { struct getauid_args *p = params; uarg[0] = (intptr_t) p->auid; /* uid_t * */ *n_args = 1; break; } /* setauid */ case 448: { struct setauid_args *p = params; uarg[0] = (intptr_t) p->auid; /* uid_t * */ *n_args = 1; break; } /* getaudit */ case 449: { struct getaudit_args *p = params; uarg[0] = (intptr_t) p->auditinfo; /* struct auditinfo * */ *n_args = 1; break; } /* setaudit */ case 450: { struct setaudit_args *p = params; uarg[0] = (intptr_t) p->auditinfo; /* struct auditinfo * */ *n_args = 1; break; } /* getaudit_addr */ case 451: { struct getaudit_addr_args *p = params; uarg[0] = (intptr_t) p->auditinfo_addr; /* struct auditinfo_addr * */ uarg[1] = p->length; /* u_int */ *n_args = 2; break; } /* setaudit_addr */ case 452: { struct setaudit_addr_args *p = params; uarg[0] = (intptr_t) p->auditinfo_addr; /* struct auditinfo_addr * */ uarg[1] = p->length; /* u_int */ *n_args = 2; break; } /* auditctl */ case 453: { struct auditctl_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ *n_args = 1; break; } /* _umtx_op */ case 454: { struct _umtx_op_args *p = params; uarg[0] = (intptr_t) p->obj; /* void * */ iarg[1] = p->op; /* int */ uarg[2] = p->val; /* u_long */ uarg[3] = (intptr_t) p->uaddr1; /* void * */ uarg[4] = (intptr_t) p->uaddr2; /* void * */ *n_args = 5; break; } /* thr_new */ case 455: { struct thr_new_args *p = params; uarg[0] = (intptr_t) p->param; /* struct thr_param * */ iarg[1] = p->param_size; /* int */ *n_args = 2; break; } /* sigqueue */ case 456: { struct sigqueue_args *p = params; iarg[0] = p->pid; /* pid_t */ iarg[1] = p->signum; /* int */ uarg[2] = (intptr_t) p->value; /* void * */ *n_args = 3; break; } /* kmq_open */ case 457: { struct kmq_open_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ iarg[1] = p->flags; /* int */ iarg[2] = p->mode; /* mode_t */ uarg[3] = (intptr_t) p->attr; /* const struct mq_attr * */ *n_args = 4; break; } /* kmq_setattr */ case 458: { struct kmq_setattr_args *p = params; iarg[0] = p->mqd; /* int */ uarg[1] = (intptr_t) p->attr; /* const struct mq_attr * */ uarg[2] = (intptr_t) p->oattr; /* struct mq_attr * */ *n_args = 3; break; } /* kmq_timedreceive */ case 459: { struct kmq_timedreceive_args *p = params; iarg[0] = p->mqd; /* int */ uarg[1] = (intptr_t) p->msg_ptr; /* char * */ uarg[2] = p->msg_len; /* size_t */ uarg[3] = (intptr_t) p->msg_prio; /* unsigned * */ uarg[4] = (intptr_t) p->abs_timeout; /* const struct timespec * */ *n_args = 5; break; } /* kmq_timedsend */ case 460: { struct kmq_timedsend_args *p = params; iarg[0] = p->mqd; /* int */ uarg[1] = (intptr_t) p->msg_ptr; /* const char * */ uarg[2] = p->msg_len; /* size_t */ uarg[3] = p->msg_prio; /* unsigned */ uarg[4] = (intptr_t) p->abs_timeout; /* const struct timespec * */ *n_args = 5; break; } /* kmq_notify */ case 461: { struct kmq_notify_args *p = params; iarg[0] = p->mqd; /* int */ uarg[1] = (intptr_t) p->sigev; /* const struct sigevent * */ *n_args = 2; break; } /* kmq_unlink */ case 462: { struct kmq_unlink_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ *n_args = 1; break; } /* abort2 */ case 463: { struct abort2_args *p = params; uarg[0] = (intptr_t) p->why; /* const char * */ iarg[1] = p->nargs; /* int */ uarg[2] = (intptr_t) p->args; /* void ** */ *n_args = 3; break; } /* thr_set_name */ case 464: { struct thr_set_name_args *p = params; iarg[0] = p->id; /* long */ uarg[1] = (intptr_t) p->name; /* const char * */ *n_args = 2; break; } /* aio_fsync */ case 465: { struct aio_fsync_args *p = params; iarg[0] = p->op; /* int */ uarg[1] = (intptr_t) p->aiocbp; /* struct aiocb * */ *n_args = 2; break; } /* rtprio_thread */ case 466: { struct rtprio_thread_args *p = params; iarg[0] = p->function; /* int */ iarg[1] = p->lwpid; /* lwpid_t */ uarg[2] = (intptr_t) p->rtp; /* struct rtprio * */ *n_args = 3; break; } /* sctp_peeloff */ case 471: { struct sctp_peeloff_args *p = params; iarg[0] = p->sd; /* int */ uarg[1] = p->name; /* uint32_t */ *n_args = 2; break; } /* sctp_generic_sendmsg */ case 472: { struct sctp_generic_sendmsg_args *p = params; iarg[0] = p->sd; /* int */ uarg[1] = (intptr_t) p->msg; /* void * */ iarg[2] = p->mlen; /* int */ uarg[3] = (intptr_t) p->to; /* struct sockaddr * */ iarg[4] = p->tolen; /* __socklen_t */ uarg[5] = (intptr_t) p->sinfo; /* struct sctp_sndrcvinfo * */ iarg[6] = p->flags; /* int */ *n_args = 7; break; } /* sctp_generic_sendmsg_iov */ case 473: { struct sctp_generic_sendmsg_iov_args *p = params; iarg[0] = p->sd; /* int */ uarg[1] = (intptr_t) p->iov; /* struct iovec * */ iarg[2] = p->iovlen; /* int */ uarg[3] = (intptr_t) p->to; /* struct sockaddr * */ iarg[4] = p->tolen; /* __socklen_t */ uarg[5] = (intptr_t) p->sinfo; /* struct sctp_sndrcvinfo * */ iarg[6] = p->flags; /* int */ *n_args = 7; break; } /* sctp_generic_recvmsg */ case 474: { struct sctp_generic_recvmsg_args *p = params; iarg[0] = p->sd; /* int */ uarg[1] = (intptr_t) p->iov; /* struct iovec * */ iarg[2] = p->iovlen; /* int */ uarg[3] = (intptr_t) p->from; /* struct sockaddr * */ uarg[4] = (intptr_t) p->fromlenaddr; /* __socklen_t * */ uarg[5] = (intptr_t) p->sinfo; /* struct sctp_sndrcvinfo * */ uarg[6] = (intptr_t) p->msg_flags; /* int * */ *n_args = 7; break; } /* pread */ case 475: { struct pread_args *p = params; iarg[0] = p->fd; /* int */ uarg[1] = (intptr_t) p->buf; /* void * */ uarg[2] = p->nbyte; /* size_t */ iarg[3] = p->offset; /* off_t */ *n_args = 4; break; } /* pwrite */ case 476: { struct pwrite_args *p = params; iarg[0] = p->fd; /* int */ uarg[1] = (intptr_t) p->buf; /* const void * */ uarg[2] = p->nbyte; /* size_t */ iarg[3] = p->offset; /* off_t */ *n_args = 4; break; } /* mmap */ case 477: { struct mmap_args *p = params; uarg[0] = (intptr_t) p->addr; /* void * */ uarg[1] = p->len; /* size_t */ iarg[2] = p->prot; /* int */ iarg[3] = p->flags; /* int */ iarg[4] = p->fd; /* int */ iarg[5] = p->pos; /* off_t */ *n_args = 6; break; } /* lseek */ case 478: { struct lseek_args *p = params; iarg[0] = p->fd; /* int */ iarg[1] = p->offset; /* off_t */ iarg[2] = p->whence; /* int */ *n_args = 3; break; } /* truncate */ case 479: { struct truncate_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ iarg[1] = p->length; /* off_t */ *n_args = 2; break; } /* ftruncate */ case 480: { struct ftruncate_args *p = params; iarg[0] = p->fd; /* int */ iarg[1] = p->length; /* off_t */ *n_args = 2; break; } /* thr_kill2 */ case 481: { struct thr_kill2_args *p = params; iarg[0] = p->pid; /* pid_t */ iarg[1] = p->id; /* long */ iarg[2] = p->sig; /* int */ *n_args = 3; break; } /* shm_open */ case 482: { struct shm_open_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ iarg[1] = p->flags; /* int */ iarg[2] = p->mode; /* mode_t */ *n_args = 3; break; } /* shm_unlink */ case 483: { struct shm_unlink_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ *n_args = 1; break; } /* cpuset */ case 484: { struct cpuset_args *p = params; uarg[0] = (intptr_t) p->setid; /* cpusetid_t * */ *n_args = 1; break; } /* cpuset_setid */ case 485: { struct cpuset_setid_args *p = params; iarg[0] = p->which; /* cpuwhich_t */ iarg[1] = p->id; /* id_t */ iarg[2] = p->setid; /* cpusetid_t */ *n_args = 3; break; } /* cpuset_getid */ case 486: { struct cpuset_getid_args *p = params; iarg[0] = p->level; /* cpulevel_t */ iarg[1] = p->which; /* cpuwhich_t */ iarg[2] = p->id; /* id_t */ uarg[3] = (intptr_t) p->setid; /* cpusetid_t * */ *n_args = 4; break; } /* cpuset_getaffinity */ case 487: { struct cpuset_getaffinity_args *p = params; iarg[0] = p->level; /* cpulevel_t */ iarg[1] = p->which; /* cpuwhich_t */ iarg[2] = p->id; /* id_t */ uarg[3] = p->cpusetsize; /* size_t */ uarg[4] = (intptr_t) p->mask; /* cpuset_t * */ *n_args = 5; break; } /* cpuset_setaffinity */ case 488: { struct cpuset_setaffinity_args *p = params; iarg[0] = p->level; /* cpulevel_t */ iarg[1] = p->which; /* cpuwhich_t */ iarg[2] = p->id; /* id_t */ uarg[3] = p->cpusetsize; /* size_t */ uarg[4] = (intptr_t) p->mask; /* const cpuset_t * */ *n_args = 5; break; } /* faccessat */ case 489: { struct faccessat_args *p = params; iarg[0] = p->fd; /* int */ uarg[1] = (intptr_t) p->path; /* const char * */ iarg[2] = p->amode; /* int */ iarg[3] = p->flag; /* int */ *n_args = 4; break; } /* fchmodat */ case 490: { struct fchmodat_args *p = params; iarg[0] = p->fd; /* int */ uarg[1] = (intptr_t) p->path; /* const char * */ iarg[2] = p->mode; /* mode_t */ iarg[3] = p->flag; /* int */ *n_args = 4; break; } /* fchownat */ case 491: { struct fchownat_args *p = params; iarg[0] = p->fd; /* int */ uarg[1] = (intptr_t) p->path; /* const char * */ uarg[2] = p->uid; /* uid_t */ iarg[3] = p->gid; /* gid_t */ iarg[4] = p->flag; /* int */ *n_args = 5; break; } /* fexecve */ case 492: { struct fexecve_args *p = params; iarg[0] = p->fd; /* int */ uarg[1] = (intptr_t) p->argv; /* char ** */ uarg[2] = (intptr_t) p->envv; /* char ** */ *n_args = 3; break; } /* futimesat */ case 494: { struct futimesat_args *p = params; iarg[0] = p->fd; /* int */ uarg[1] = (intptr_t) p->path; /* const char * */ uarg[2] = (intptr_t) p->times; /* struct timeval * */ *n_args = 3; break; } /* linkat */ case 495: { struct linkat_args *p = params; iarg[0] = p->fd1; /* int */ uarg[1] = (intptr_t) p->path1; /* const char * */ iarg[2] = p->fd2; /* int */ uarg[3] = (intptr_t) p->path2; /* const char * */ iarg[4] = p->flag; /* int */ *n_args = 5; break; } /* mkdirat */ case 496: { struct mkdirat_args *p = params; iarg[0] = p->fd; /* int */ uarg[1] = (intptr_t) p->path; /* const char * */ iarg[2] = p->mode; /* mode_t */ *n_args = 3; break; } /* mkfifoat */ case 497: { struct mkfifoat_args *p = params; iarg[0] = p->fd; /* int */ uarg[1] = (intptr_t) p->path; /* const char * */ iarg[2] = p->mode; /* mode_t */ *n_args = 3; break; } /* openat */ case 499: { struct openat_args *p = params; iarg[0] = p->fd; /* int */ uarg[1] = (intptr_t) p->path; /* const char * */ iarg[2] = p->flag; /* int */ iarg[3] = p->mode; /* mode_t */ *n_args = 4; break; } /* readlinkat */ case 500: { struct readlinkat_args *p = params; iarg[0] = p->fd; /* int */ uarg[1] = (intptr_t) p->path; /* const char * */ uarg[2] = (intptr_t) p->buf; /* char * */ uarg[3] = p->bufsize; /* size_t */ *n_args = 4; break; } /* renameat */ case 501: { struct renameat_args *p = params; iarg[0] = p->oldfd; /* int */ uarg[1] = (intptr_t) p->old; /* const char * */ iarg[2] = p->newfd; /* int */ uarg[3] = (intptr_t) p->new; /* const char * */ *n_args = 4; break; } /* symlinkat */ case 502: { struct symlinkat_args *p = params; uarg[0] = (intptr_t) p->path1; /* const char * */ iarg[1] = p->fd; /* int */ uarg[2] = (intptr_t) p->path2; /* const char * */ *n_args = 3; break; } /* unlinkat */ case 503: { struct unlinkat_args *p = params; iarg[0] = p->fd; /* int */ uarg[1] = (intptr_t) p->path; /* const char * */ iarg[2] = p->flag; /* int */ *n_args = 3; break; } /* posix_openpt */ case 504: { struct posix_openpt_args *p = params; iarg[0] = p->flags; /* int */ *n_args = 1; break; } /* gssd_syscall */ case 505: { struct gssd_syscall_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ *n_args = 1; break; } /* jail_get */ case 506: { struct jail_get_args *p = params; uarg[0] = (intptr_t) p->iovp; /* struct iovec * */ uarg[1] = p->iovcnt; /* unsigned int */ iarg[2] = p->flags; /* int */ *n_args = 3; break; } /* jail_set */ case 507: { struct jail_set_args *p = params; uarg[0] = (intptr_t) p->iovp; /* struct iovec * */ uarg[1] = p->iovcnt; /* unsigned int */ iarg[2] = p->flags; /* int */ *n_args = 3; break; } /* jail_remove */ case 508: { struct jail_remove_args *p = params; iarg[0] = p->jid; /* int */ *n_args = 1; break; } /* closefrom */ case 509: { struct closefrom_args *p = params; iarg[0] = p->lowfd; /* int */ *n_args = 1; break; } /* __semctl */ case 510: { struct __semctl_args *p = params; iarg[0] = p->semid; /* int */ iarg[1] = p->semnum; /* int */ iarg[2] = p->cmd; /* int */ uarg[3] = (intptr_t) p->arg; /* union semun * */ *n_args = 4; break; } /* msgctl */ case 511: { struct msgctl_args *p = params; iarg[0] = p->msqid; /* int */ iarg[1] = p->cmd; /* int */ uarg[2] = (intptr_t) p->buf; /* struct msqid_ds * */ *n_args = 3; break; } /* shmctl */ case 512: { struct shmctl_args *p = params; iarg[0] = p->shmid; /* int */ iarg[1] = p->cmd; /* int */ uarg[2] = (intptr_t) p->buf; /* struct shmid_ds * */ *n_args = 3; break; } /* lpathconf */ case 513: { struct lpathconf_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ iarg[1] = p->name; /* int */ *n_args = 2; break; } /* __cap_rights_get */ case 515: { struct __cap_rights_get_args *p = params; iarg[0] = p->version; /* int */ iarg[1] = p->fd; /* int */ uarg[2] = (intptr_t) p->rightsp; /* cap_rights_t * */ *n_args = 3; break; } /* cap_enter */ case 516: { *n_args = 0; break; } /* cap_getmode */ case 517: { struct cap_getmode_args *p = params; uarg[0] = (intptr_t) p->modep; /* u_int * */ *n_args = 1; break; } /* pdfork */ case 518: { struct pdfork_args *p = params; uarg[0] = (intptr_t) p->fdp; /* int * */ iarg[1] = p->flags; /* int */ *n_args = 2; break; } /* pdkill */ case 519: { struct pdkill_args *p = params; iarg[0] = p->fd; /* int */ iarg[1] = p->signum; /* int */ *n_args = 2; break; } /* pdgetpid */ case 520: { struct pdgetpid_args *p = params; iarg[0] = p->fd; /* int */ uarg[1] = (intptr_t) p->pidp; /* pid_t * */ *n_args = 2; break; } /* pselect */ case 522: { struct pselect_args *p = params; iarg[0] = p->nd; /* int */ uarg[1] = (intptr_t) p->in; /* fd_set * */ uarg[2] = (intptr_t) p->ou; /* fd_set * */ uarg[3] = (intptr_t) p->ex; /* fd_set * */ uarg[4] = (intptr_t) p->ts; /* const struct timespec * */ uarg[5] = (intptr_t) p->sm; /* const sigset_t * */ *n_args = 6; break; } /* getloginclass */ case 523: { struct getloginclass_args *p = params; uarg[0] = (intptr_t) p->namebuf; /* char * */ uarg[1] = p->namelen; /* size_t */ *n_args = 2; break; } /* setloginclass */ case 524: { struct setloginclass_args *p = params; uarg[0] = (intptr_t) p->namebuf; /* const char * */ *n_args = 1; break; } /* rctl_get_racct */ case 525: { struct rctl_get_racct_args *p = params; uarg[0] = (intptr_t) p->inbufp; /* const void * */ uarg[1] = p->inbuflen; /* size_t */ uarg[2] = (intptr_t) p->outbufp; /* void * */ uarg[3] = p->outbuflen; /* size_t */ *n_args = 4; break; } /* rctl_get_rules */ case 526: { struct rctl_get_rules_args *p = params; uarg[0] = (intptr_t) p->inbufp; /* const void * */ uarg[1] = p->inbuflen; /* size_t */ uarg[2] = (intptr_t) p->outbufp; /* void * */ uarg[3] = p->outbuflen; /* size_t */ *n_args = 4; break; } /* rctl_get_limits */ case 527: { struct rctl_get_limits_args *p = params; uarg[0] = (intptr_t) p->inbufp; /* const void * */ uarg[1] = p->inbuflen; /* size_t */ uarg[2] = (intptr_t) p->outbufp; /* void * */ uarg[3] = p->outbuflen; /* size_t */ *n_args = 4; break; } /* rctl_add_rule */ case 528: { struct rctl_add_rule_args *p = params; uarg[0] = (intptr_t) p->inbufp; /* const void * */ uarg[1] = p->inbuflen; /* size_t */ uarg[2] = (intptr_t) p->outbufp; /* void * */ uarg[3] = p->outbuflen; /* size_t */ *n_args = 4; break; } /* rctl_remove_rule */ case 529: { struct rctl_remove_rule_args *p = params; uarg[0] = (intptr_t) p->inbufp; /* const void * */ uarg[1] = p->inbuflen; /* size_t */ uarg[2] = (intptr_t) p->outbufp; /* void * */ uarg[3] = p->outbuflen; /* size_t */ *n_args = 4; break; } /* posix_fallocate */ case 530: { struct posix_fallocate_args *p = params; iarg[0] = p->fd; /* int */ iarg[1] = p->offset; /* off_t */ iarg[2] = p->len; /* off_t */ *n_args = 3; break; } /* posix_fadvise */ case 531: { struct posix_fadvise_args *p = params; iarg[0] = p->fd; /* int */ iarg[1] = p->offset; /* off_t */ iarg[2] = p->len; /* off_t */ iarg[3] = p->advice; /* int */ *n_args = 4; break; } /* wait6 */ case 532: { struct wait6_args *p = params; iarg[0] = p->idtype; /* idtype_t */ iarg[1] = p->id; /* id_t */ uarg[2] = (intptr_t) p->status; /* int * */ iarg[3] = p->options; /* int */ uarg[4] = (intptr_t) p->wrusage; /* struct __wrusage * */ uarg[5] = (intptr_t) p->info; /* siginfo_t * */ *n_args = 6; break; } /* cap_rights_limit */ case 533: { struct cap_rights_limit_args *p = params; iarg[0] = p->fd; /* int */ uarg[1] = (intptr_t) p->rightsp; /* cap_rights_t * */ *n_args = 2; break; } /* cap_ioctls_limit */ case 534: { struct cap_ioctls_limit_args *p = params; iarg[0] = p->fd; /* int */ uarg[1] = (intptr_t) p->cmds; /* const u_long * */ uarg[2] = p->ncmds; /* size_t */ *n_args = 3; break; } /* cap_ioctls_get */ case 535: { struct cap_ioctls_get_args *p = params; iarg[0] = p->fd; /* int */ uarg[1] = (intptr_t) p->cmds; /* u_long * */ uarg[2] = p->maxcmds; /* size_t */ *n_args = 3; break; } /* cap_fcntls_limit */ case 536: { struct cap_fcntls_limit_args *p = params; iarg[0] = p->fd; /* int */ uarg[1] = p->fcntlrights; /* uint32_t */ *n_args = 2; break; } /* cap_fcntls_get */ case 537: { struct cap_fcntls_get_args *p = params; iarg[0] = p->fd; /* int */ uarg[1] = (intptr_t) p->fcntlrightsp; /* uint32_t * */ *n_args = 2; break; } /* bindat */ case 538: { struct bindat_args *p = params; iarg[0] = p->fd; /* int */ iarg[1] = p->s; /* int */ uarg[2] = (intptr_t) p->name; /* const struct sockaddr * */ iarg[3] = p->namelen; /* int */ *n_args = 4; break; } /* connectat */ case 539: { struct connectat_args *p = params; iarg[0] = p->fd; /* int */ iarg[1] = p->s; /* int */ uarg[2] = (intptr_t) p->name; /* const struct sockaddr * */ iarg[3] = p->namelen; /* int */ *n_args = 4; break; } /* chflagsat */ case 540: { struct chflagsat_args *p = params; iarg[0] = p->fd; /* int */ uarg[1] = (intptr_t) p->path; /* const char * */ uarg[2] = p->flags; /* u_long */ iarg[3] = p->atflag; /* int */ *n_args = 4; break; } /* accept4 */ case 541: { struct accept4_args *p = params; iarg[0] = p->s; /* int */ uarg[1] = (intptr_t) p->name; /* struct sockaddr * */ uarg[2] = (intptr_t) p->anamelen; /* __socklen_t * */ iarg[3] = p->flags; /* int */ *n_args = 4; break; } /* pipe2 */ case 542: { struct pipe2_args *p = params; uarg[0] = (intptr_t) p->fildes; /* int * */ iarg[1] = p->flags; /* int */ *n_args = 2; break; } /* aio_mlock */ case 543: { struct aio_mlock_args *p = params; uarg[0] = (intptr_t) p->aiocbp; /* struct aiocb * */ *n_args = 1; break; } /* procctl */ case 544: { struct procctl_args *p = params; iarg[0] = p->idtype; /* idtype_t */ iarg[1] = p->id; /* id_t */ iarg[2] = p->com; /* int */ uarg[3] = (intptr_t) p->data; /* void * */ *n_args = 4; break; } /* ppoll */ case 545: { struct ppoll_args *p = params; uarg[0] = (intptr_t) p->fds; /* struct pollfd * */ uarg[1] = p->nfds; /* u_int */ uarg[2] = (intptr_t) p->ts; /* const struct timespec * */ uarg[3] = (intptr_t) p->set; /* const sigset_t * */ *n_args = 4; break; } /* futimens */ case 546: { struct futimens_args *p = params; iarg[0] = p->fd; /* int */ uarg[1] = (intptr_t) p->times; /* struct timespec * */ *n_args = 2; break; } /* utimensat */ case 547: { struct utimensat_args *p = params; iarg[0] = p->fd; /* int */ uarg[1] = (intptr_t) p->path; /* const char * */ uarg[2] = (intptr_t) p->times; /* struct timespec * */ iarg[3] = p->flag; /* int */ *n_args = 4; break; } /* fdatasync */ case 550: { struct fdatasync_args *p = params; iarg[0] = p->fd; /* int */ *n_args = 1; break; } /* fstat */ case 551: { struct fstat_args *p = params; iarg[0] = p->fd; /* int */ uarg[1] = (intptr_t) p->sb; /* struct stat * */ *n_args = 2; break; } /* fstatat */ case 552: { struct fstatat_args *p = params; iarg[0] = p->fd; /* int */ uarg[1] = (intptr_t) p->path; /* const char * */ uarg[2] = (intptr_t) p->buf; /* struct stat * */ iarg[3] = p->flag; /* int */ *n_args = 4; break; } /* fhstat */ case 553: { struct fhstat_args *p = params; uarg[0] = (intptr_t) p->u_fhp; /* const struct fhandle * */ uarg[1] = (intptr_t) p->sb; /* struct stat * */ *n_args = 2; break; } /* getdirentries */ case 554: { struct getdirentries_args *p = params; iarg[0] = p->fd; /* int */ uarg[1] = (intptr_t) p->buf; /* char * */ uarg[2] = p->count; /* size_t */ uarg[3] = (intptr_t) p->basep; /* off_t * */ *n_args = 4; break; } /* statfs */ case 555: { struct statfs_args *p = params; uarg[0] = (intptr_t) p->path; /* const char * */ uarg[1] = (intptr_t) p->buf; /* struct statfs * */ *n_args = 2; break; } /* fstatfs */ case 556: { struct fstatfs_args *p = params; iarg[0] = p->fd; /* int */ uarg[1] = (intptr_t) p->buf; /* struct statfs * */ *n_args = 2; break; } /* getfsstat */ case 557: { struct getfsstat_args *p = params; uarg[0] = (intptr_t) p->buf; /* struct statfs * */ iarg[1] = p->bufsize; /* long */ iarg[2] = p->mode; /* int */ *n_args = 3; break; } /* fhstatfs */ case 558: { struct fhstatfs_args *p = params; uarg[0] = (intptr_t) p->u_fhp; /* const struct fhandle * */ uarg[1] = (intptr_t) p->buf; /* struct statfs * */ *n_args = 2; break; } /* mknodat */ case 559: { struct mknodat_args *p = params; iarg[0] = p->fd; /* int */ uarg[1] = (intptr_t) p->path; /* const char * */ iarg[2] = p->mode; /* mode_t */ iarg[3] = p->dev; /* dev_t */ *n_args = 4; break; } /* kevent */ case 560: { struct kevent_args *p = params; iarg[0] = p->fd; /* int */ uarg[1] = (intptr_t) p->changelist; /* struct kevent * */ iarg[2] = p->nchanges; /* int */ uarg[3] = (intptr_t) p->eventlist; /* struct kevent * */ iarg[4] = p->nevents; /* int */ uarg[5] = (intptr_t) p->timeout; /* const struct timespec * */ *n_args = 6; break; } /* cpuset_getdomain */ case 561: { struct cpuset_getdomain_args *p = params; iarg[0] = p->level; /* cpulevel_t */ iarg[1] = p->which; /* cpuwhich_t */ iarg[2] = p->id; /* id_t */ uarg[3] = p->domainsetsize; /* size_t */ uarg[4] = (intptr_t) p->mask; /* domainset_t * */ uarg[5] = (intptr_t) p->policy; /* int * */ *n_args = 6; break; } /* cpuset_setdomain */ case 562: { struct cpuset_setdomain_args *p = params; iarg[0] = p->level; /* cpulevel_t */ iarg[1] = p->which; /* cpuwhich_t */ iarg[2] = p->id; /* id_t */ uarg[3] = p->domainsetsize; /* size_t */ uarg[4] = (intptr_t) p->mask; /* domainset_t * */ iarg[5] = p->policy; /* int */ *n_args = 6; break; } /* getrandom */ case 563: { struct getrandom_args *p = params; uarg[0] = (intptr_t) p->buf; /* void * */ uarg[1] = p->buflen; /* size_t */ uarg[2] = p->flags; /* unsigned int */ *n_args = 3; break; } /* getfhat */ case 564: { struct getfhat_args *p = params; iarg[0] = p->fd; /* int */ uarg[1] = (intptr_t) p->path; /* char * */ uarg[2] = (intptr_t) p->fhp; /* struct fhandle * */ iarg[3] = p->flags; /* int */ *n_args = 4; break; } /* fhlink */ case 565: { struct fhlink_args *p = params; uarg[0] = (intptr_t) p->fhp; /* struct fhandle * */ uarg[1] = (intptr_t) p->to; /* const char * */ *n_args = 2; break; } /* fhlinkat */ case 566: { struct fhlinkat_args *p = params; uarg[0] = (intptr_t) p->fhp; /* struct fhandle * */ iarg[1] = p->tofd; /* int */ uarg[2] = (intptr_t) p->to; /* const char * */ *n_args = 3; break; } /* fhreadlink */ case 567: { struct fhreadlink_args *p = params; uarg[0] = (intptr_t) p->fhp; /* struct fhandle * */ uarg[1] = (intptr_t) p->buf; /* char * */ uarg[2] = p->bufsize; /* size_t */ *n_args = 3; break; } /* funlinkat */ case 568: { struct funlinkat_args *p = params; iarg[0] = p->dfd; /* int */ uarg[1] = (intptr_t) p->path; /* const char * */ iarg[2] = p->fd; /* int */ iarg[3] = p->flag; /* int */ *n_args = 4; break; } default: *n_args = 0; break; }; } static void systrace_entry_setargdesc(int sysnum, int ndx, char *desc, size_t descsz) { const char *p = NULL; switch (sysnum) { /* nosys */ case 0: break; /* sys_exit */ case 1: switch(ndx) { case 0: p = "int"; break; default: break; }; break; /* fork */ case 2: break; /* read */ case 3: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland void *"; break; case 2: p = "size_t"; break; default: break; }; break; /* write */ case 4: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland const void *"; break; case 2: p = "size_t"; break; default: break; }; break; /* open */ case 5: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "int"; break; case 2: p = "mode_t"; break; default: break; }; break; /* close */ case 6: switch(ndx) { case 0: p = "int"; break; default: break; }; break; /* wait4 */ case 7: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland int *"; break; case 2: p = "int"; break; case 3: p = "userland struct rusage *"; break; default: break; }; break; /* link */ case 9: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "userland const char *"; break; default: break; }; break; /* unlink */ case 10: switch(ndx) { case 0: p = "userland const char *"; break; default: break; }; break; /* chdir */ case 12: switch(ndx) { case 0: p = "userland const char *"; break; default: break; }; break; /* fchdir */ case 13: switch(ndx) { case 0: p = "int"; break; default: break; }; break; /* chmod */ case 15: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "mode_t"; break; default: break; }; break; /* chown */ case 16: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "int"; break; case 2: p = "int"; break; default: break; }; break; /* break */ case 17: switch(ndx) { case 0: p = "userland char *"; break; default: break; }; break; /* getpid */ case 20: break; /* mount */ case 21: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "userland const char *"; break; case 2: p = "int"; break; case 3: p = "userland void *"; break; default: break; }; break; /* unmount */ case 22: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "int"; break; default: break; }; break; /* setuid */ case 23: switch(ndx) { case 0: p = "uid_t"; break; default: break; }; break; /* getuid */ case 24: break; /* geteuid */ case 25: break; /* ptrace */ case 26: switch(ndx) { case 0: p = "int"; break; case 1: p = "pid_t"; break; case 2: p = "caddr_t"; break; case 3: p = "int"; break; default: break; }; break; /* recvmsg */ case 27: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland struct msghdr *"; break; case 2: p = "int"; break; default: break; }; break; /* sendmsg */ case 28: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland struct msghdr *"; break; case 2: p = "int"; break; default: break; }; break; /* recvfrom */ case 29: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland void *"; break; case 2: p = "size_t"; break; case 3: p = "int"; break; case 4: p = "userland struct sockaddr *"; break; case 5: p = "userland __socklen_t *"; break; default: break; }; break; /* accept */ case 30: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland struct sockaddr *"; break; case 2: p = "userland __socklen_t *"; break; default: break; }; break; /* getpeername */ case 31: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland struct sockaddr *"; break; case 2: p = "userland __socklen_t *"; break; default: break; }; break; /* getsockname */ case 32: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland struct sockaddr *"; break; case 2: p = "userland __socklen_t *"; break; default: break; }; break; /* access */ case 33: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "int"; break; default: break; }; break; /* chflags */ case 34: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "u_long"; break; default: break; }; break; /* fchflags */ case 35: switch(ndx) { case 0: p = "int"; break; case 1: p = "u_long"; break; default: break; }; break; /* sync */ case 36: break; /* kill */ case 37: switch(ndx) { case 0: p = "int"; break; case 1: p = "int"; break; default: break; }; break; /* getppid */ case 39: break; /* dup */ case 41: switch(ndx) { case 0: p = "u_int"; break; default: break; }; break; /* getegid */ case 43: break; /* profil */ case 44: switch(ndx) { case 0: p = "userland char *"; break; case 1: p = "size_t"; break; case 2: p = "size_t"; break; case 3: p = "u_int"; break; default: break; }; break; /* ktrace */ case 45: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "int"; break; case 2: p = "int"; break; case 3: p = "int"; break; default: break; }; break; /* getgid */ case 47: break; /* getlogin */ case 49: switch(ndx) { case 0: p = "userland char *"; break; case 1: p = "u_int"; break; default: break; }; break; /* setlogin */ case 50: switch(ndx) { case 0: p = "userland const char *"; break; default: break; }; break; /* acct */ case 51: switch(ndx) { case 0: p = "userland const char *"; break; default: break; }; break; /* sigaltstack */ case 53: switch(ndx) { case 0: p = "userland stack_t *"; break; case 1: p = "userland stack_t *"; break; default: break; }; break; /* ioctl */ case 54: switch(ndx) { case 0: p = "int"; break; case 1: p = "u_long"; break; case 2: p = "userland char *"; break; default: break; }; break; /* reboot */ case 55: switch(ndx) { case 0: p = "int"; break; default: break; }; break; /* revoke */ case 56: switch(ndx) { case 0: p = "userland const char *"; break; default: break; }; break; /* symlink */ case 57: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "userland const char *"; break; default: break; }; break; /* readlink */ case 58: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "userland char *"; break; case 2: p = "size_t"; break; default: break; }; break; /* execve */ case 59: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "userland char **"; break; case 2: p = "userland char **"; break; default: break; }; break; /* umask */ case 60: switch(ndx) { case 0: p = "mode_t"; break; default: break; }; break; /* chroot */ case 61: switch(ndx) { case 0: p = "userland const char *"; break; default: break; }; break; /* msync */ case 65: switch(ndx) { case 0: p = "userland void *"; break; case 1: p = "size_t"; break; case 2: p = "int"; break; default: break; }; break; /* vfork */ case 66: break; /* sbrk */ case 69: switch(ndx) { case 0: p = "int"; break; default: break; }; break; /* sstk */ case 70: switch(ndx) { case 0: p = "int"; break; default: break; }; break; /* munmap */ case 73: switch(ndx) { case 0: p = "userland void *"; break; case 1: p = "size_t"; break; default: break; }; break; /* mprotect */ case 74: switch(ndx) { case 0: p = "userland void *"; break; case 1: p = "size_t"; break; case 2: p = "int"; break; default: break; }; break; /* madvise */ case 75: switch(ndx) { case 0: p = "userland void *"; break; case 1: p = "size_t"; break; case 2: p = "int"; break; default: break; }; break; /* mincore */ case 78: switch(ndx) { case 0: p = "userland const void *"; break; case 1: p = "size_t"; break; case 2: p = "userland char *"; break; default: break; }; break; /* getgroups */ case 79: switch(ndx) { case 0: p = "u_int"; break; case 1: p = "userland gid_t *"; break; default: break; }; break; /* setgroups */ case 80: switch(ndx) { case 0: p = "u_int"; break; case 1: p = "userland gid_t *"; break; default: break; }; break; /* getpgrp */ case 81: break; /* setpgid */ case 82: switch(ndx) { case 0: p = "int"; break; case 1: p = "int"; break; default: break; }; break; /* setitimer */ case 83: switch(ndx) { case 0: p = "u_int"; break; case 1: p = "userland struct itimerval *"; break; case 2: p = "userland struct itimerval *"; break; default: break; }; break; /* swapon */ case 85: switch(ndx) { case 0: p = "userland const char *"; break; default: break; }; break; /* getitimer */ case 86: switch(ndx) { case 0: p = "u_int"; break; case 1: p = "userland struct itimerval *"; break; default: break; }; break; /* getdtablesize */ case 89: break; /* dup2 */ case 90: switch(ndx) { case 0: p = "u_int"; break; case 1: p = "u_int"; break; default: break; }; break; /* fcntl */ case 92: switch(ndx) { case 0: p = "int"; break; case 1: p = "int"; break; case 2: p = "long"; break; default: break; }; break; /* select */ case 93: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland fd_set *"; break; case 2: p = "userland fd_set *"; break; case 3: p = "userland fd_set *"; break; case 4: p = "userland struct timeval *"; break; default: break; }; break; /* fsync */ case 95: switch(ndx) { case 0: p = "int"; break; default: break; }; break; /* setpriority */ case 96: switch(ndx) { case 0: p = "int"; break; case 1: p = "int"; break; case 2: p = "int"; break; default: break; }; break; /* socket */ case 97: switch(ndx) { case 0: p = "int"; break; case 1: p = "int"; break; case 2: p = "int"; break; default: break; }; break; /* connect */ case 98: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland const struct sockaddr *"; break; case 2: p = "int"; break; default: break; }; break; /* getpriority */ case 100: switch(ndx) { case 0: p = "int"; break; case 1: p = "int"; break; default: break; }; break; /* bind */ case 104: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland const struct sockaddr *"; break; case 2: p = "int"; break; default: break; }; break; /* setsockopt */ case 105: switch(ndx) { case 0: p = "int"; break; case 1: p = "int"; break; case 2: p = "int"; break; case 3: p = "userland const void *"; break; case 4: p = "int"; break; default: break; }; break; /* listen */ case 106: switch(ndx) { case 0: p = "int"; break; case 1: p = "int"; break; default: break; }; break; /* gettimeofday */ case 116: switch(ndx) { case 0: p = "userland struct timeval *"; break; case 1: p = "userland struct timezone *"; break; default: break; }; break; /* getrusage */ case 117: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland struct rusage *"; break; default: break; }; break; /* getsockopt */ case 118: switch(ndx) { case 0: p = "int"; break; case 1: p = "int"; break; case 2: p = "int"; break; case 3: p = "userland void *"; break; case 4: p = "userland int *"; break; default: break; }; break; /* readv */ case 120: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland struct iovec *"; break; case 2: p = "u_int"; break; default: break; }; break; /* writev */ case 121: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland struct iovec *"; break; case 2: p = "u_int"; break; default: break; }; break; /* settimeofday */ case 122: switch(ndx) { case 0: p = "userland struct timeval *"; break; case 1: p = "userland struct timezone *"; break; default: break; }; break; /* fchown */ case 123: switch(ndx) { case 0: p = "int"; break; case 1: p = "int"; break; case 2: p = "int"; break; default: break; }; break; /* fchmod */ case 124: switch(ndx) { case 0: p = "int"; break; case 1: p = "mode_t"; break; default: break; }; break; /* setreuid */ case 126: switch(ndx) { case 0: p = "int"; break; case 1: p = "int"; break; default: break; }; break; /* setregid */ case 127: switch(ndx) { case 0: p = "int"; break; case 1: p = "int"; break; default: break; }; break; /* rename */ case 128: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "userland const char *"; break; default: break; }; break; /* flock */ case 131: switch(ndx) { case 0: p = "int"; break; case 1: p = "int"; break; default: break; }; break; /* mkfifo */ case 132: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "mode_t"; break; default: break; }; break; /* sendto */ case 133: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland const void *"; break; case 2: p = "size_t"; break; case 3: p = "int"; break; case 4: p = "userland const struct sockaddr *"; break; case 5: p = "int"; break; default: break; }; break; /* shutdown */ case 134: switch(ndx) { case 0: p = "int"; break; case 1: p = "int"; break; default: break; }; break; /* socketpair */ case 135: switch(ndx) { case 0: p = "int"; break; case 1: p = "int"; break; case 2: p = "int"; break; case 3: p = "userland int *"; break; default: break; }; break; /* mkdir */ case 136: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "mode_t"; break; default: break; }; break; /* rmdir */ case 137: switch(ndx) { case 0: p = "userland const char *"; break; default: break; }; break; /* utimes */ case 138: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "userland struct timeval *"; break; default: break; }; break; /* adjtime */ case 140: switch(ndx) { case 0: p = "userland struct timeval *"; break; case 1: p = "userland struct timeval *"; break; default: break; }; break; /* setsid */ case 147: break; /* quotactl */ case 148: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "int"; break; case 2: p = "int"; break; case 3: p = "userland void *"; break; default: break; }; break; /* nlm_syscall */ case 154: switch(ndx) { case 0: p = "int"; break; case 1: p = "int"; break; case 2: p = "int"; break; case 3: p = "userland char **"; break; default: break; }; break; /* nfssvc */ case 155: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland void *"; break; default: break; }; break; /* lgetfh */ case 160: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "userland struct fhandle *"; break; default: break; }; break; /* getfh */ case 161: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "userland struct fhandle *"; break; default: break; }; break; /* sysarch */ case 165: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland char *"; break; default: break; }; break; /* rtprio */ case 166: switch(ndx) { case 0: p = "int"; break; case 1: p = "pid_t"; break; case 2: p = "userland struct rtprio *"; break; default: break; }; break; /* semsys */ case 169: switch(ndx) { case 0: p = "int"; break; case 1: p = "int"; break; case 2: p = "int"; break; case 3: p = "int"; break; case 4: p = "int"; break; default: break; }; break; /* msgsys */ case 170: switch(ndx) { case 0: p = "int"; break; case 1: p = "int"; break; case 2: p = "int"; break; case 3: p = "int"; break; case 4: p = "int"; break; case 5: p = "int"; break; default: break; }; break; /* shmsys */ case 171: switch(ndx) { case 0: p = "int"; break; case 1: p = "int"; break; case 2: p = "int"; break; case 3: p = "int"; break; default: break; }; break; /* setfib */ case 175: switch(ndx) { case 0: p = "int"; break; default: break; }; break; /* ntp_adjtime */ case 176: switch(ndx) { case 0: p = "userland struct timex *"; break; default: break; }; break; /* setgid */ case 181: switch(ndx) { case 0: p = "gid_t"; break; default: break; }; break; /* setegid */ case 182: switch(ndx) { case 0: p = "gid_t"; break; default: break; }; break; /* seteuid */ case 183: switch(ndx) { case 0: p = "uid_t"; break; default: break; }; break; /* pathconf */ case 191: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "int"; break; default: break; }; break; /* fpathconf */ case 192: switch(ndx) { case 0: p = "int"; break; case 1: p = "int"; break; default: break; }; break; /* getrlimit */ case 194: switch(ndx) { case 0: p = "u_int"; break; case 1: p = "userland struct rlimit *"; break; default: break; }; break; /* setrlimit */ case 195: switch(ndx) { case 0: p = "u_int"; break; case 1: p = "userland struct rlimit *"; break; default: break; }; break; /* nosys */ case 198: break; /* __sysctl */ case 202: switch(ndx) { case 0: p = "userland int *"; break; case 1: p = "u_int"; break; case 2: p = "userland void *"; break; case 3: p = "userland size_t *"; break; case 4: p = "userland const void *"; break; case 5: p = "size_t"; break; default: break; }; break; /* mlock */ case 203: switch(ndx) { case 0: p = "userland const void *"; break; case 1: p = "size_t"; break; default: break; }; break; /* munlock */ case 204: switch(ndx) { case 0: p = "userland const void *"; break; case 1: p = "size_t"; break; default: break; }; break; /* undelete */ case 205: switch(ndx) { case 0: p = "userland const char *"; break; default: break; }; break; /* futimes */ case 206: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland struct timeval *"; break; default: break; }; break; /* getpgid */ case 207: switch(ndx) { case 0: p = "pid_t"; break; default: break; }; break; /* poll */ case 209: switch(ndx) { case 0: p = "userland struct pollfd *"; break; case 1: p = "u_int"; break; case 2: p = "int"; break; default: break; }; break; /* lkmnosys */ case 210: break; /* lkmnosys */ case 211: break; /* lkmnosys */ case 212: break; /* lkmnosys */ case 213: break; /* lkmnosys */ case 214: break; /* lkmnosys */ case 215: break; /* lkmnosys */ case 216: break; /* lkmnosys */ case 217: break; /* lkmnosys */ case 218: break; /* lkmnosys */ case 219: break; /* semget */ case 221: switch(ndx) { case 0: p = "key_t"; break; case 1: p = "int"; break; case 2: p = "int"; break; default: break; }; break; /* semop */ case 222: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland struct sembuf *"; break; case 2: p = "size_t"; break; default: break; }; break; /* msgget */ case 225: switch(ndx) { case 0: p = "key_t"; break; case 1: p = "int"; break; default: break; }; break; /* msgsnd */ case 226: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland const void *"; break; case 2: p = "size_t"; break; case 3: p = "int"; break; default: break; }; break; /* msgrcv */ case 227: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland void *"; break; case 2: p = "size_t"; break; case 3: p = "long"; break; case 4: p = "int"; break; default: break; }; break; /* shmat */ case 228: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland const void *"; break; case 2: p = "int"; break; default: break; }; break; /* shmdt */ case 230: switch(ndx) { case 0: p = "userland const void *"; break; default: break; }; break; /* shmget */ case 231: switch(ndx) { case 0: p = "key_t"; break; case 1: p = "size_t"; break; case 2: p = "int"; break; default: break; }; break; /* clock_gettime */ case 232: switch(ndx) { case 0: p = "clockid_t"; break; case 1: p = "userland struct timespec *"; break; default: break; }; break; /* clock_settime */ case 233: switch(ndx) { case 0: p = "clockid_t"; break; case 1: p = "userland const struct timespec *"; break; default: break; }; break; /* clock_getres */ case 234: switch(ndx) { case 0: p = "clockid_t"; break; case 1: p = "userland struct timespec *"; break; default: break; }; break; /* ktimer_create */ case 235: switch(ndx) { case 0: p = "clockid_t"; break; case 1: p = "userland struct sigevent *"; break; case 2: p = "userland int *"; break; default: break; }; break; /* ktimer_delete */ case 236: switch(ndx) { case 0: p = "int"; break; default: break; }; break; /* ktimer_settime */ case 237: switch(ndx) { case 0: p = "int"; break; case 1: p = "int"; break; case 2: p = "userland const struct itimerspec *"; break; case 3: p = "userland struct itimerspec *"; break; default: break; }; break; /* ktimer_gettime */ case 238: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland struct itimerspec *"; break; default: break; }; break; /* ktimer_getoverrun */ case 239: switch(ndx) { case 0: p = "int"; break; default: break; }; break; /* nanosleep */ case 240: switch(ndx) { case 0: p = "userland const struct timespec *"; break; case 1: p = "userland struct timespec *"; break; default: break; }; break; /* ffclock_getcounter */ case 241: switch(ndx) { case 0: p = "userland ffcounter *"; break; default: break; }; break; /* ffclock_setestimate */ case 242: switch(ndx) { case 0: p = "userland struct ffclock_estimate *"; break; default: break; }; break; /* ffclock_getestimate */ case 243: switch(ndx) { case 0: p = "userland struct ffclock_estimate *"; break; default: break; }; break; /* clock_nanosleep */ case 244: switch(ndx) { case 0: p = "clockid_t"; break; case 1: p = "int"; break; case 2: p = "userland const struct timespec *"; break; case 3: p = "userland struct timespec *"; break; default: break; }; break; /* clock_getcpuclockid2 */ case 247: switch(ndx) { case 0: p = "id_t"; break; case 1: p = "int"; break; case 2: p = "userland clockid_t *"; break; default: break; }; break; /* ntp_gettime */ case 248: switch(ndx) { case 0: p = "userland struct ntptimeval *"; break; default: break; }; break; /* minherit */ case 250: switch(ndx) { case 0: p = "userland void *"; break; case 1: p = "size_t"; break; case 2: p = "int"; break; default: break; }; break; /* rfork */ case 251: switch(ndx) { case 0: p = "int"; break; default: break; }; break; /* issetugid */ case 253: break; /* lchown */ case 254: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "int"; break; case 2: p = "int"; break; default: break; }; break; /* aio_read */ case 255: switch(ndx) { case 0: p = "userland struct aiocb *"; break; default: break; }; break; /* aio_write */ case 256: switch(ndx) { case 0: p = "userland struct aiocb *"; break; default: break; }; break; /* lio_listio */ case 257: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland struct aiocb *const *"; break; case 2: p = "int"; break; case 3: p = "userland struct sigevent *"; break; default: break; }; break; /* lchmod */ case 274: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "mode_t"; break; default: break; }; break; /* lutimes */ case 276: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "userland struct timeval *"; break; default: break; }; break; /* preadv */ case 289: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland struct iovec *"; break; case 2: p = "u_int"; break; case 3: p = "off_t"; break; default: break; }; break; /* pwritev */ case 290: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland struct iovec *"; break; case 2: p = "u_int"; break; case 3: p = "off_t"; break; default: break; }; break; /* fhopen */ case 298: switch(ndx) { case 0: p = "userland const struct fhandle *"; break; case 1: p = "int"; break; default: break; }; break; /* modnext */ case 300: switch(ndx) { case 0: p = "int"; break; default: break; }; break; /* modstat */ case 301: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland struct module_stat *"; break; default: break; }; break; /* modfnext */ case 302: switch(ndx) { case 0: p = "int"; break; default: break; }; break; /* modfind */ case 303: switch(ndx) { case 0: p = "userland const char *"; break; default: break; }; break; /* kldload */ case 304: switch(ndx) { case 0: p = "userland const char *"; break; default: break; }; break; /* kldunload */ case 305: switch(ndx) { case 0: p = "int"; break; default: break; }; break; /* kldfind */ case 306: switch(ndx) { case 0: p = "userland const char *"; break; default: break; }; break; /* kldnext */ case 307: switch(ndx) { case 0: p = "int"; break; default: break; }; break; /* kldstat */ case 308: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland struct kld_file_stat *"; break; default: break; }; break; /* kldfirstmod */ case 309: switch(ndx) { case 0: p = "int"; break; default: break; }; break; /* getsid */ case 310: switch(ndx) { case 0: p = "pid_t"; break; default: break; }; break; /* setresuid */ case 311: switch(ndx) { case 0: p = "uid_t"; break; case 1: p = "uid_t"; break; case 2: p = "uid_t"; break; default: break; }; break; /* setresgid */ case 312: switch(ndx) { case 0: p = "gid_t"; break; case 1: p = "gid_t"; break; case 2: p = "gid_t"; break; default: break; }; break; /* aio_return */ case 314: switch(ndx) { case 0: p = "userland struct aiocb *"; break; default: break; }; break; /* aio_suspend */ case 315: switch(ndx) { case 0: p = "userland struct aiocb *const *"; break; case 1: p = "int"; break; case 2: p = "userland const struct timespec *"; break; default: break; }; break; /* aio_cancel */ case 316: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland struct aiocb *"; break; default: break; }; break; /* aio_error */ case 317: switch(ndx) { case 0: p = "userland struct aiocb *"; break; default: break; }; break; /* yield */ case 321: break; /* mlockall */ case 324: switch(ndx) { case 0: p = "int"; break; default: break; }; break; /* munlockall */ case 325: break; /* __getcwd */ case 326: switch(ndx) { case 0: p = "userland char *"; break; case 1: p = "size_t"; break; default: break; }; break; /* sched_setparam */ case 327: switch(ndx) { case 0: p = "pid_t"; break; case 1: p = "userland const struct sched_param *"; break; default: break; }; break; /* sched_getparam */ case 328: switch(ndx) { case 0: p = "pid_t"; break; case 1: p = "userland struct sched_param *"; break; default: break; }; break; /* sched_setscheduler */ case 329: switch(ndx) { case 0: p = "pid_t"; break; case 1: p = "int"; break; case 2: p = "userland const struct sched_param *"; break; default: break; }; break; /* sched_getscheduler */ case 330: switch(ndx) { case 0: p = "pid_t"; break; default: break; }; break; /* sched_yield */ case 331: break; /* sched_get_priority_max */ case 332: switch(ndx) { case 0: p = "int"; break; default: break; }; break; /* sched_get_priority_min */ case 333: switch(ndx) { case 0: p = "int"; break; default: break; }; break; /* sched_rr_get_interval */ case 334: switch(ndx) { case 0: p = "pid_t"; break; case 1: p = "userland struct timespec *"; break; default: break; }; break; /* utrace */ case 335: switch(ndx) { case 0: p = "userland const void *"; break; case 1: p = "size_t"; break; default: break; }; break; /* kldsym */ case 337: switch(ndx) { case 0: p = "int"; break; case 1: p = "int"; break; case 2: p = "userland void *"; break; default: break; }; break; /* jail */ case 338: switch(ndx) { case 0: p = "userland struct jail *"; break; default: break; }; break; /* nnpfs_syscall */ case 339: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland char *"; break; case 2: p = "int"; break; case 3: p = "userland void *"; break; case 4: p = "int"; break; default: break; }; break; /* sigprocmask */ case 340: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland const sigset_t *"; break; case 2: p = "userland sigset_t *"; break; default: break; }; break; /* sigsuspend */ case 341: switch(ndx) { case 0: p = "userland const sigset_t *"; break; default: break; }; break; /* sigpending */ case 343: switch(ndx) { case 0: p = "userland sigset_t *"; break; default: break; }; break; /* sigtimedwait */ case 345: switch(ndx) { case 0: p = "userland const sigset_t *"; break; case 1: p = "userland siginfo_t *"; break; case 2: p = "userland const struct timespec *"; break; default: break; }; break; /* sigwaitinfo */ case 346: switch(ndx) { case 0: p = "userland const sigset_t *"; break; case 1: p = "userland siginfo_t *"; break; default: break; }; break; /* __acl_get_file */ case 347: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "acl_type_t"; break; case 2: p = "userland struct acl *"; break; default: break; }; break; /* __acl_set_file */ case 348: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "acl_type_t"; break; case 2: p = "userland struct acl *"; break; default: break; }; break; /* __acl_get_fd */ case 349: switch(ndx) { case 0: p = "int"; break; case 1: p = "acl_type_t"; break; case 2: p = "userland struct acl *"; break; default: break; }; break; /* __acl_set_fd */ case 350: switch(ndx) { case 0: p = "int"; break; case 1: p = "acl_type_t"; break; case 2: p = "userland struct acl *"; break; default: break; }; break; /* __acl_delete_file */ case 351: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "acl_type_t"; break; default: break; }; break; /* __acl_delete_fd */ case 352: switch(ndx) { case 0: p = "int"; break; case 1: p = "acl_type_t"; break; default: break; }; break; /* __acl_aclcheck_file */ case 353: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "acl_type_t"; break; case 2: p = "userland struct acl *"; break; default: break; }; break; /* __acl_aclcheck_fd */ case 354: switch(ndx) { case 0: p = "int"; break; case 1: p = "acl_type_t"; break; case 2: p = "userland struct acl *"; break; default: break; }; break; /* extattrctl */ case 355: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "int"; break; case 2: p = "userland const char *"; break; case 3: p = "int"; break; case 4: p = "userland const char *"; break; default: break; }; break; /* extattr_set_file */ case 356: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "int"; break; case 2: p = "userland const char *"; break; case 3: p = "userland void *"; break; case 4: p = "size_t"; break; default: break; }; break; /* extattr_get_file */ case 357: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "int"; break; case 2: p = "userland const char *"; break; case 3: p = "userland void *"; break; case 4: p = "size_t"; break; default: break; }; break; /* extattr_delete_file */ case 358: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "int"; break; case 2: p = "userland const char *"; break; default: break; }; break; /* aio_waitcomplete */ case 359: switch(ndx) { case 0: p = "userland struct aiocb **"; break; case 1: p = "userland struct timespec *"; break; default: break; }; break; /* getresuid */ case 360: switch(ndx) { case 0: p = "userland uid_t *"; break; case 1: p = "userland uid_t *"; break; case 2: p = "userland uid_t *"; break; default: break; }; break; /* getresgid */ case 361: switch(ndx) { case 0: p = "userland gid_t *"; break; case 1: p = "userland gid_t *"; break; case 2: p = "userland gid_t *"; break; default: break; }; break; /* kqueue */ case 362: break; /* extattr_set_fd */ case 371: switch(ndx) { case 0: p = "int"; break; case 1: p = "int"; break; case 2: p = "userland const char *"; break; case 3: p = "userland void *"; break; case 4: p = "size_t"; break; default: break; }; break; /* extattr_get_fd */ case 372: switch(ndx) { case 0: p = "int"; break; case 1: p = "int"; break; case 2: p = "userland const char *"; break; case 3: p = "userland void *"; break; case 4: p = "size_t"; break; default: break; }; break; /* extattr_delete_fd */ case 373: switch(ndx) { case 0: p = "int"; break; case 1: p = "int"; break; case 2: p = "userland const char *"; break; default: break; }; break; /* __setugid */ case 374: switch(ndx) { case 0: p = "int"; break; default: break; }; break; /* eaccess */ case 376: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "int"; break; default: break; }; break; /* afs3_syscall */ case 377: switch(ndx) { case 0: p = "long"; break; case 1: p = "long"; break; case 2: p = "long"; break; case 3: p = "long"; break; case 4: p = "long"; break; case 5: p = "long"; break; case 6: p = "long"; break; default: break; }; break; /* nmount */ case 378: switch(ndx) { case 0: p = "userland struct iovec *"; break; case 1: p = "unsigned int"; break; case 2: p = "int"; break; default: break; }; break; /* __mac_get_proc */ case 384: switch(ndx) { case 0: p = "userland struct mac *"; break; default: break; }; break; /* __mac_set_proc */ case 385: switch(ndx) { case 0: p = "userland struct mac *"; break; default: break; }; break; /* __mac_get_fd */ case 386: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland struct mac *"; break; default: break; }; break; /* __mac_get_file */ case 387: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "userland struct mac *"; break; default: break; }; break; /* __mac_set_fd */ case 388: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland struct mac *"; break; default: break; }; break; /* __mac_set_file */ case 389: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "userland struct mac *"; break; default: break; }; break; /* kenv */ case 390: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland const char *"; break; case 2: p = "userland char *"; break; case 3: p = "int"; break; default: break; }; break; /* lchflags */ case 391: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "u_long"; break; default: break; }; break; /* uuidgen */ case 392: switch(ndx) { case 0: p = "userland struct uuid *"; break; case 1: p = "int"; break; default: break; }; break; /* sendfile */ case 393: switch(ndx) { case 0: p = "int"; break; case 1: p = "int"; break; case 2: p = "off_t"; break; case 3: p = "size_t"; break; case 4: p = "userland struct sf_hdtr *"; break; case 5: p = "userland off_t *"; break; case 6: p = "int"; break; default: break; }; break; /* mac_syscall */ case 394: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "int"; break; case 2: p = "userland void *"; break; default: break; }; break; /* ksem_close */ case 400: switch(ndx) { case 0: p = "semid_t"; break; default: break; }; break; /* ksem_post */ case 401: switch(ndx) { case 0: p = "semid_t"; break; default: break; }; break; /* ksem_wait */ case 402: switch(ndx) { case 0: p = "semid_t"; break; default: break; }; break; /* ksem_trywait */ case 403: switch(ndx) { case 0: p = "semid_t"; break; default: break; }; break; /* ksem_init */ case 404: switch(ndx) { case 0: p = "userland semid_t *"; break; case 1: p = "unsigned int"; break; default: break; }; break; /* ksem_open */ case 405: switch(ndx) { case 0: p = "userland semid_t *"; break; case 1: p = "userland const char *"; break; case 2: p = "int"; break; case 3: p = "mode_t"; break; case 4: p = "unsigned int"; break; default: break; }; break; /* ksem_unlink */ case 406: switch(ndx) { case 0: p = "userland const char *"; break; default: break; }; break; /* ksem_getvalue */ case 407: switch(ndx) { case 0: p = "semid_t"; break; case 1: p = "userland int *"; break; default: break; }; break; /* ksem_destroy */ case 408: switch(ndx) { case 0: p = "semid_t"; break; default: break; }; break; /* __mac_get_pid */ case 409: switch(ndx) { case 0: p = "pid_t"; break; case 1: p = "userland struct mac *"; break; default: break; }; break; /* __mac_get_link */ case 410: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "userland struct mac *"; break; default: break; }; break; /* __mac_set_link */ case 411: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "userland struct mac *"; break; default: break; }; break; /* extattr_set_link */ case 412: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "int"; break; case 2: p = "userland const char *"; break; case 3: p = "userland void *"; break; case 4: p = "size_t"; break; default: break; }; break; /* extattr_get_link */ case 413: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "int"; break; case 2: p = "userland const char *"; break; case 3: p = "userland void *"; break; case 4: p = "size_t"; break; default: break; }; break; /* extattr_delete_link */ case 414: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "int"; break; case 2: p = "userland const char *"; break; default: break; }; break; /* __mac_execve */ case 415: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "userland char **"; break; case 2: p = "userland char **"; break; case 3: p = "userland struct mac *"; break; default: break; }; break; /* sigaction */ case 416: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland const struct sigaction *"; break; case 2: p = "userland struct sigaction *"; break; default: break; }; break; /* sigreturn */ case 417: switch(ndx) { case 0: p = "userland const struct __ucontext *"; break; default: break; }; break; /* getcontext */ case 421: switch(ndx) { case 0: p = "userland struct __ucontext *"; break; default: break; }; break; /* setcontext */ case 422: switch(ndx) { case 0: p = "userland const struct __ucontext *"; break; default: break; }; break; /* swapcontext */ case 423: switch(ndx) { case 0: p = "userland struct __ucontext *"; break; case 1: p = "userland const struct __ucontext *"; break; default: break; }; break; /* swapoff */ case 424: switch(ndx) { case 0: p = "userland const char *"; break; default: break; }; break; /* __acl_get_link */ case 425: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "acl_type_t"; break; case 2: p = "userland struct acl *"; break; default: break; }; break; /* __acl_set_link */ case 426: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "acl_type_t"; break; case 2: p = "userland struct acl *"; break; default: break; }; break; /* __acl_delete_link */ case 427: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "acl_type_t"; break; default: break; }; break; /* __acl_aclcheck_link */ case 428: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "acl_type_t"; break; case 2: p = "userland struct acl *"; break; default: break; }; break; /* sigwait */ case 429: switch(ndx) { case 0: p = "userland const sigset_t *"; break; case 1: p = "userland int *"; break; default: break; }; break; /* thr_create */ case 430: switch(ndx) { case 0: p = "userland ucontext_t *"; break; case 1: p = "userland long *"; break; case 2: p = "int"; break; default: break; }; break; /* thr_exit */ case 431: switch(ndx) { case 0: p = "userland long *"; break; default: break; }; break; /* thr_self */ case 432: switch(ndx) { case 0: p = "userland long *"; break; default: break; }; break; /* thr_kill */ case 433: switch(ndx) { case 0: p = "long"; break; case 1: p = "int"; break; default: break; }; break; /* jail_attach */ case 436: switch(ndx) { case 0: p = "int"; break; default: break; }; break; /* extattr_list_fd */ case 437: switch(ndx) { case 0: p = "int"; break; case 1: p = "int"; break; case 2: p = "userland void *"; break; case 3: p = "size_t"; break; default: break; }; break; /* extattr_list_file */ case 438: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "int"; break; case 2: p = "userland void *"; break; case 3: p = "size_t"; break; default: break; }; break; /* extattr_list_link */ case 439: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "int"; break; case 2: p = "userland void *"; break; case 3: p = "size_t"; break; default: break; }; break; /* ksem_timedwait */ case 441: switch(ndx) { case 0: p = "semid_t"; break; case 1: p = "userland const struct timespec *"; break; default: break; }; break; /* thr_suspend */ case 442: switch(ndx) { case 0: p = "userland const struct timespec *"; break; default: break; }; break; /* thr_wake */ case 443: switch(ndx) { case 0: p = "long"; break; default: break; }; break; /* kldunloadf */ case 444: switch(ndx) { case 0: p = "int"; break; case 1: p = "int"; break; default: break; }; break; /* audit */ case 445: switch(ndx) { case 0: p = "userland const void *"; break; case 1: p = "u_int"; break; default: break; }; break; /* auditon */ case 446: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland void *"; break; case 2: p = "u_int"; break; default: break; }; break; /* getauid */ case 447: switch(ndx) { case 0: p = "userland uid_t *"; break; default: break; }; break; /* setauid */ case 448: switch(ndx) { case 0: p = "userland uid_t *"; break; default: break; }; break; /* getaudit */ case 449: switch(ndx) { case 0: p = "userland struct auditinfo *"; break; default: break; }; break; /* setaudit */ case 450: switch(ndx) { case 0: p = "userland struct auditinfo *"; break; default: break; }; break; /* getaudit_addr */ case 451: switch(ndx) { case 0: p = "userland struct auditinfo_addr *"; break; case 1: p = "u_int"; break; default: break; }; break; /* setaudit_addr */ case 452: switch(ndx) { case 0: p = "userland struct auditinfo_addr *"; break; case 1: p = "u_int"; break; default: break; }; break; /* auditctl */ case 453: switch(ndx) { case 0: p = "userland const char *"; break; default: break; }; break; /* _umtx_op */ case 454: switch(ndx) { case 0: p = "userland void *"; break; case 1: p = "int"; break; case 2: p = "u_long"; break; case 3: p = "userland void *"; break; case 4: p = "userland void *"; break; default: break; }; break; /* thr_new */ case 455: switch(ndx) { case 0: p = "userland struct thr_param *"; break; case 1: p = "int"; break; default: break; }; break; /* sigqueue */ case 456: switch(ndx) { case 0: p = "pid_t"; break; case 1: p = "int"; break; case 2: p = "userland void *"; break; default: break; }; break; /* kmq_open */ case 457: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "int"; break; case 2: p = "mode_t"; break; case 3: p = "userland const struct mq_attr *"; break; default: break; }; break; /* kmq_setattr */ case 458: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland const struct mq_attr *"; break; case 2: p = "userland struct mq_attr *"; break; default: break; }; break; /* kmq_timedreceive */ case 459: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland char *"; break; case 2: p = "size_t"; break; case 3: p = "userland unsigned *"; break; case 4: p = "userland const struct timespec *"; break; default: break; }; break; /* kmq_timedsend */ case 460: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland const char *"; break; case 2: p = "size_t"; break; case 3: p = "unsigned"; break; case 4: p = "userland const struct timespec *"; break; default: break; }; break; /* kmq_notify */ case 461: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland const struct sigevent *"; break; default: break; }; break; /* kmq_unlink */ case 462: switch(ndx) { case 0: p = "userland const char *"; break; default: break; }; break; /* abort2 */ case 463: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "int"; break; case 2: p = "userland void **"; break; default: break; }; break; /* thr_set_name */ case 464: switch(ndx) { case 0: p = "long"; break; case 1: p = "userland const char *"; break; default: break; }; break; /* aio_fsync */ case 465: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland struct aiocb *"; break; default: break; }; break; /* rtprio_thread */ case 466: switch(ndx) { case 0: p = "int"; break; case 1: p = "lwpid_t"; break; case 2: p = "userland struct rtprio *"; break; default: break; }; break; /* sctp_peeloff */ case 471: switch(ndx) { case 0: p = "int"; break; case 1: p = "uint32_t"; break; default: break; }; break; /* sctp_generic_sendmsg */ case 472: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland void *"; break; case 2: p = "int"; break; case 3: p = "userland struct sockaddr *"; break; case 4: p = "__socklen_t"; break; case 5: p = "userland struct sctp_sndrcvinfo *"; break; case 6: p = "int"; break; default: break; }; break; /* sctp_generic_sendmsg_iov */ case 473: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland struct iovec *"; break; case 2: p = "int"; break; case 3: p = "userland struct sockaddr *"; break; case 4: p = "__socklen_t"; break; case 5: p = "userland struct sctp_sndrcvinfo *"; break; case 6: p = "int"; break; default: break; }; break; /* sctp_generic_recvmsg */ case 474: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland struct iovec *"; break; case 2: p = "int"; break; case 3: p = "userland struct sockaddr *"; break; case 4: p = "userland __socklen_t *"; break; case 5: p = "userland struct sctp_sndrcvinfo *"; break; case 6: p = "userland int *"; break; default: break; }; break; /* pread */ case 475: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland void *"; break; case 2: p = "size_t"; break; case 3: p = "off_t"; break; default: break; }; break; /* pwrite */ case 476: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland const void *"; break; case 2: p = "size_t"; break; case 3: p = "off_t"; break; default: break; }; break; /* mmap */ case 477: switch(ndx) { case 0: p = "userland void *"; break; case 1: p = "size_t"; break; case 2: p = "int"; break; case 3: p = "int"; break; case 4: p = "int"; break; case 5: p = "off_t"; break; default: break; }; break; /* lseek */ case 478: switch(ndx) { case 0: p = "int"; break; case 1: p = "off_t"; break; case 2: p = "int"; break; default: break; }; break; /* truncate */ case 479: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "off_t"; break; default: break; }; break; /* ftruncate */ case 480: switch(ndx) { case 0: p = "int"; break; case 1: p = "off_t"; break; default: break; }; break; /* thr_kill2 */ case 481: switch(ndx) { case 0: p = "pid_t"; break; case 1: p = "long"; break; case 2: p = "int"; break; default: break; }; break; /* shm_open */ case 482: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "int"; break; case 2: p = "mode_t"; break; default: break; }; break; /* shm_unlink */ case 483: switch(ndx) { case 0: p = "userland const char *"; break; default: break; }; break; /* cpuset */ case 484: switch(ndx) { case 0: p = "userland cpusetid_t *"; break; default: break; }; break; /* cpuset_setid */ case 485: switch(ndx) { case 0: p = "cpuwhich_t"; break; case 1: p = "id_t"; break; case 2: p = "cpusetid_t"; break; default: break; }; break; /* cpuset_getid */ case 486: switch(ndx) { case 0: p = "cpulevel_t"; break; case 1: p = "cpuwhich_t"; break; case 2: p = "id_t"; break; case 3: p = "userland cpusetid_t *"; break; default: break; }; break; /* cpuset_getaffinity */ case 487: switch(ndx) { case 0: p = "cpulevel_t"; break; case 1: p = "cpuwhich_t"; break; case 2: p = "id_t"; break; case 3: p = "size_t"; break; case 4: p = "userland cpuset_t *"; break; default: break; }; break; /* cpuset_setaffinity */ case 488: switch(ndx) { case 0: p = "cpulevel_t"; break; case 1: p = "cpuwhich_t"; break; case 2: p = "id_t"; break; case 3: p = "size_t"; break; case 4: p = "userland const cpuset_t *"; break; default: break; }; break; /* faccessat */ case 489: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland const char *"; break; case 2: p = "int"; break; case 3: p = "int"; break; default: break; }; break; /* fchmodat */ case 490: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland const char *"; break; case 2: p = "mode_t"; break; case 3: p = "int"; break; default: break; }; break; /* fchownat */ case 491: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland const char *"; break; case 2: p = "uid_t"; break; case 3: p = "gid_t"; break; case 4: p = "int"; break; default: break; }; break; /* fexecve */ case 492: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland char **"; break; case 2: p = "userland char **"; break; default: break; }; break; /* futimesat */ case 494: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland const char *"; break; case 2: p = "userland struct timeval *"; break; default: break; }; break; /* linkat */ case 495: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland const char *"; break; case 2: p = "int"; break; case 3: p = "userland const char *"; break; case 4: p = "int"; break; default: break; }; break; /* mkdirat */ case 496: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland const char *"; break; case 2: p = "mode_t"; break; default: break; }; break; /* mkfifoat */ case 497: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland const char *"; break; case 2: p = "mode_t"; break; default: break; }; break; /* openat */ case 499: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland const char *"; break; case 2: p = "int"; break; case 3: p = "mode_t"; break; default: break; }; break; /* readlinkat */ case 500: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland const char *"; break; case 2: p = "userland char *"; break; case 3: p = "size_t"; break; default: break; }; break; /* renameat */ case 501: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland const char *"; break; case 2: p = "int"; break; case 3: p = "userland const char *"; break; default: break; }; break; /* symlinkat */ case 502: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "int"; break; case 2: p = "userland const char *"; break; default: break; }; break; /* unlinkat */ case 503: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland const char *"; break; case 2: p = "int"; break; default: break; }; break; /* posix_openpt */ case 504: switch(ndx) { case 0: p = "int"; break; default: break; }; break; /* gssd_syscall */ case 505: switch(ndx) { case 0: p = "userland const char *"; break; default: break; }; break; /* jail_get */ case 506: switch(ndx) { case 0: p = "userland struct iovec *"; break; case 1: p = "unsigned int"; break; case 2: p = "int"; break; default: break; }; break; /* jail_set */ case 507: switch(ndx) { case 0: p = "userland struct iovec *"; break; case 1: p = "unsigned int"; break; case 2: p = "int"; break; default: break; }; break; /* jail_remove */ case 508: switch(ndx) { case 0: p = "int"; break; default: break; }; break; /* closefrom */ case 509: switch(ndx) { case 0: p = "int"; break; default: break; }; break; /* __semctl */ case 510: switch(ndx) { case 0: p = "int"; break; case 1: p = "int"; break; case 2: p = "int"; break; case 3: p = "userland union semun *"; break; default: break; }; break; /* msgctl */ case 511: switch(ndx) { case 0: p = "int"; break; case 1: p = "int"; break; case 2: p = "userland struct msqid_ds *"; break; default: break; }; break; /* shmctl */ case 512: switch(ndx) { case 0: p = "int"; break; case 1: p = "int"; break; case 2: p = "userland struct shmid_ds *"; break; default: break; }; break; /* lpathconf */ case 513: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "int"; break; default: break; }; break; /* __cap_rights_get */ case 515: switch(ndx) { case 0: p = "int"; break; case 1: p = "int"; break; case 2: p = "userland cap_rights_t *"; break; default: break; }; break; /* cap_enter */ case 516: break; /* cap_getmode */ case 517: switch(ndx) { case 0: p = "userland u_int *"; break; default: break; }; break; /* pdfork */ case 518: switch(ndx) { case 0: p = "userland int *"; break; case 1: p = "int"; break; default: break; }; break; /* pdkill */ case 519: switch(ndx) { case 0: p = "int"; break; case 1: p = "int"; break; default: break; }; break; /* pdgetpid */ case 520: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland pid_t *"; break; default: break; }; break; /* pselect */ case 522: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland fd_set *"; break; case 2: p = "userland fd_set *"; break; case 3: p = "userland fd_set *"; break; case 4: p = "userland const struct timespec *"; break; case 5: p = "userland const sigset_t *"; break; default: break; }; break; /* getloginclass */ case 523: switch(ndx) { case 0: p = "userland char *"; break; case 1: p = "size_t"; break; default: break; }; break; /* setloginclass */ case 524: switch(ndx) { case 0: p = "userland const char *"; break; default: break; }; break; /* rctl_get_racct */ case 525: switch(ndx) { case 0: p = "userland const void *"; break; case 1: p = "size_t"; break; case 2: p = "userland void *"; break; case 3: p = "size_t"; break; default: break; }; break; /* rctl_get_rules */ case 526: switch(ndx) { case 0: p = "userland const void *"; break; case 1: p = "size_t"; break; case 2: p = "userland void *"; break; case 3: p = "size_t"; break; default: break; }; break; /* rctl_get_limits */ case 527: switch(ndx) { case 0: p = "userland const void *"; break; case 1: p = "size_t"; break; case 2: p = "userland void *"; break; case 3: p = "size_t"; break; default: break; }; break; /* rctl_add_rule */ case 528: switch(ndx) { case 0: p = "userland const void *"; break; case 1: p = "size_t"; break; case 2: p = "userland void *"; break; case 3: p = "size_t"; break; default: break; }; break; /* rctl_remove_rule */ case 529: switch(ndx) { case 0: p = "userland const void *"; break; case 1: p = "size_t"; break; case 2: p = "userland void *"; break; case 3: p = "size_t"; break; default: break; }; break; /* posix_fallocate */ case 530: switch(ndx) { case 0: p = "int"; break; case 1: p = "off_t"; break; case 2: p = "off_t"; break; default: break; }; break; /* posix_fadvise */ case 531: switch(ndx) { case 0: p = "int"; break; case 1: p = "off_t"; break; case 2: p = "off_t"; break; case 3: p = "int"; break; default: break; }; break; /* wait6 */ case 532: switch(ndx) { case 0: p = "idtype_t"; break; case 1: p = "id_t"; break; case 2: p = "userland int *"; break; case 3: p = "int"; break; case 4: p = "userland struct __wrusage *"; break; case 5: p = "userland siginfo_t *"; break; default: break; }; break; /* cap_rights_limit */ case 533: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland cap_rights_t *"; break; default: break; }; break; /* cap_ioctls_limit */ case 534: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland const u_long *"; break; case 2: p = "size_t"; break; default: break; }; break; /* cap_ioctls_get */ case 535: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland u_long *"; break; case 2: p = "size_t"; break; default: break; }; break; /* cap_fcntls_limit */ case 536: switch(ndx) { case 0: p = "int"; break; case 1: p = "uint32_t"; break; default: break; }; break; /* cap_fcntls_get */ case 537: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland uint32_t *"; break; default: break; }; break; /* bindat */ case 538: switch(ndx) { case 0: p = "int"; break; case 1: p = "int"; break; case 2: p = "userland const struct sockaddr *"; break; case 3: p = "int"; break; default: break; }; break; /* connectat */ case 539: switch(ndx) { case 0: p = "int"; break; case 1: p = "int"; break; case 2: p = "userland const struct sockaddr *"; break; case 3: p = "int"; break; default: break; }; break; /* chflagsat */ case 540: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland const char *"; break; case 2: p = "u_long"; break; case 3: p = "int"; break; default: break; }; break; /* accept4 */ case 541: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland struct sockaddr *"; break; case 2: p = "userland __socklen_t *"; break; case 3: p = "int"; break; default: break; }; break; /* pipe2 */ case 542: switch(ndx) { case 0: p = "userland int *"; break; case 1: p = "int"; break; default: break; }; break; /* aio_mlock */ case 543: switch(ndx) { case 0: p = "userland struct aiocb *"; break; default: break; }; break; /* procctl */ case 544: switch(ndx) { case 0: p = "idtype_t"; break; case 1: p = "id_t"; break; case 2: p = "int"; break; case 3: p = "userland void *"; break; default: break; }; break; /* ppoll */ case 545: switch(ndx) { case 0: p = "userland struct pollfd *"; break; case 1: p = "u_int"; break; case 2: p = "userland const struct timespec *"; break; case 3: p = "userland const sigset_t *"; break; default: break; }; break; /* futimens */ case 546: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland struct timespec *"; break; default: break; }; break; /* utimensat */ case 547: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland const char *"; break; case 2: p = "userland struct timespec *"; break; case 3: p = "int"; break; default: break; }; break; /* fdatasync */ case 550: switch(ndx) { case 0: p = "int"; break; default: break; }; break; /* fstat */ case 551: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland struct stat *"; break; default: break; }; break; /* fstatat */ case 552: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland const char *"; break; case 2: p = "userland struct stat *"; break; case 3: p = "int"; break; default: break; }; break; /* fhstat */ case 553: switch(ndx) { case 0: p = "userland const struct fhandle *"; break; case 1: p = "userland struct stat *"; break; default: break; }; break; /* getdirentries */ case 554: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland char *"; break; case 2: p = "size_t"; break; case 3: p = "userland off_t *"; break; default: break; }; break; /* statfs */ case 555: switch(ndx) { case 0: p = "userland const char *"; break; case 1: p = "userland struct statfs *"; break; default: break; }; break; /* fstatfs */ case 556: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland struct statfs *"; break; default: break; }; break; /* getfsstat */ case 557: switch(ndx) { case 0: p = "userland struct statfs *"; break; case 1: p = "long"; break; case 2: p = "int"; break; default: break; }; break; /* fhstatfs */ case 558: switch(ndx) { case 0: p = "userland const struct fhandle *"; break; case 1: p = "userland struct statfs *"; break; default: break; }; break; /* mknodat */ case 559: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland const char *"; break; case 2: p = "mode_t"; break; case 3: p = "dev_t"; break; default: break; }; break; /* kevent */ case 560: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland struct kevent *"; break; case 2: p = "int"; break; case 3: p = "userland struct kevent *"; break; case 4: p = "int"; break; case 5: p = "userland const struct timespec *"; break; default: break; }; break; /* cpuset_getdomain */ case 561: switch(ndx) { case 0: p = "cpulevel_t"; break; case 1: p = "cpuwhich_t"; break; case 2: p = "id_t"; break; case 3: p = "size_t"; break; case 4: p = "userland domainset_t *"; break; case 5: p = "userland int *"; break; default: break; }; break; /* cpuset_setdomain */ case 562: switch(ndx) { case 0: p = "cpulevel_t"; break; case 1: p = "cpuwhich_t"; break; case 2: p = "id_t"; break; case 3: p = "size_t"; break; case 4: p = "userland domainset_t *"; break; case 5: p = "int"; break; default: break; }; break; /* getrandom */ case 563: switch(ndx) { case 0: p = "userland void *"; break; case 1: p = "size_t"; break; case 2: p = "unsigned int"; break; default: break; }; break; /* getfhat */ case 564: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland char *"; break; case 2: p = "userland struct fhandle *"; break; case 3: p = "int"; break; default: break; }; break; /* fhlink */ case 565: switch(ndx) { case 0: p = "userland struct fhandle *"; break; case 1: p = "userland const char *"; break; default: break; }; break; /* fhlinkat */ case 566: switch(ndx) { case 0: p = "userland struct fhandle *"; break; case 1: p = "int"; break; case 2: p = "userland const char *"; break; default: break; }; break; /* fhreadlink */ case 567: switch(ndx) { case 0: p = "userland struct fhandle *"; break; case 1: p = "userland char *"; break; case 2: p = "size_t"; break; default: break; }; break; /* funlinkat */ case 568: switch(ndx) { case 0: p = "int"; break; case 1: p = "userland const char *"; break; case 2: p = "int"; break; case 3: p = "int"; break; default: break; }; break; default: break; }; if (p != NULL) strlcpy(desc, p, descsz); } static void systrace_return_setargdesc(int sysnum, int ndx, char *desc, size_t descsz) { const char *p = NULL; switch (sysnum) { /* nosys */ case 0: /* sys_exit */ case 1: if (ndx == 0 || ndx == 1) p = "void"; break; /* fork */ case 2: /* read */ case 3: if (ndx == 0 || ndx == 1) p = "ssize_t"; break; /* write */ case 4: if (ndx == 0 || ndx == 1) p = "ssize_t"; break; /* open */ case 5: if (ndx == 0 || ndx == 1) p = "int"; break; /* close */ case 6: if (ndx == 0 || ndx == 1) p = "int"; break; /* wait4 */ case 7: if (ndx == 0 || ndx == 1) p = "int"; break; /* link */ case 9: if (ndx == 0 || ndx == 1) p = "int"; break; /* unlink */ case 10: if (ndx == 0 || ndx == 1) p = "int"; break; /* chdir */ case 12: if (ndx == 0 || ndx == 1) p = "int"; break; /* fchdir */ case 13: if (ndx == 0 || ndx == 1) p = "int"; break; /* chmod */ case 15: if (ndx == 0 || ndx == 1) p = "int"; break; /* chown */ case 16: if (ndx == 0 || ndx == 1) p = "int"; break; /* break */ case 17: if (ndx == 0 || ndx == 1) p = "void *"; break; /* getpid */ case 20: /* mount */ case 21: if (ndx == 0 || ndx == 1) p = "int"; break; /* unmount */ case 22: if (ndx == 0 || ndx == 1) p = "int"; break; /* setuid */ case 23: if (ndx == 0 || ndx == 1) p = "int"; break; /* getuid */ case 24: /* geteuid */ case 25: /* ptrace */ case 26: if (ndx == 0 || ndx == 1) p = "int"; break; /* recvmsg */ case 27: if (ndx == 0 || ndx == 1) p = "int"; break; /* sendmsg */ case 28: if (ndx == 0 || ndx == 1) p = "int"; break; /* recvfrom */ case 29: if (ndx == 0 || ndx == 1) p = "int"; break; /* accept */ case 30: if (ndx == 0 || ndx == 1) p = "int"; break; /* getpeername */ case 31: if (ndx == 0 || ndx == 1) p = "int"; break; /* getsockname */ case 32: if (ndx == 0 || ndx == 1) p = "int"; break; /* access */ case 33: if (ndx == 0 || ndx == 1) p = "int"; break; /* chflags */ case 34: if (ndx == 0 || ndx == 1) p = "int"; break; /* fchflags */ case 35: if (ndx == 0 || ndx == 1) p = "int"; break; /* sync */ case 36: /* kill */ case 37: if (ndx == 0 || ndx == 1) p = "int"; break; /* getppid */ case 39: /* dup */ case 41: if (ndx == 0 || ndx == 1) p = "int"; break; /* getegid */ case 43: /* profil */ case 44: if (ndx == 0 || ndx == 1) p = "int"; break; /* ktrace */ case 45: if (ndx == 0 || ndx == 1) p = "int"; break; /* getgid */ case 47: /* getlogin */ case 49: if (ndx == 0 || ndx == 1) p = "int"; break; /* setlogin */ case 50: if (ndx == 0 || ndx == 1) p = "int"; break; /* acct */ case 51: if (ndx == 0 || ndx == 1) p = "int"; break; /* sigaltstack */ case 53: if (ndx == 0 || ndx == 1) p = "int"; break; /* ioctl */ case 54: if (ndx == 0 || ndx == 1) p = "int"; break; /* reboot */ case 55: if (ndx == 0 || ndx == 1) p = "int"; break; /* revoke */ case 56: if (ndx == 0 || ndx == 1) p = "int"; break; /* symlink */ case 57: if (ndx == 0 || ndx == 1) p = "int"; break; /* readlink */ case 58: if (ndx == 0 || ndx == 1) p = "ssize_t"; break; /* execve */ case 59: if (ndx == 0 || ndx == 1) p = "int"; break; /* umask */ case 60: if (ndx == 0 || ndx == 1) p = "int"; break; /* chroot */ case 61: if (ndx == 0 || ndx == 1) p = "int"; break; /* msync */ case 65: if (ndx == 0 || ndx == 1) p = "int"; break; /* vfork */ case 66: /* sbrk */ case 69: if (ndx == 0 || ndx == 1) p = "int"; break; /* sstk */ case 70: if (ndx == 0 || ndx == 1) p = "int"; break; /* munmap */ case 73: if (ndx == 0 || ndx == 1) p = "int"; break; /* mprotect */ case 74: if (ndx == 0 || ndx == 1) p = "int"; break; /* madvise */ case 75: if (ndx == 0 || ndx == 1) p = "int"; break; /* mincore */ case 78: if (ndx == 0 || ndx == 1) p = "int"; break; /* getgroups */ case 79: if (ndx == 0 || ndx == 1) p = "int"; break; /* setgroups */ case 80: if (ndx == 0 || ndx == 1) p = "int"; break; /* getpgrp */ case 81: /* setpgid */ case 82: if (ndx == 0 || ndx == 1) p = "int"; break; /* setitimer */ case 83: if (ndx == 0 || ndx == 1) p = "int"; break; /* swapon */ case 85: if (ndx == 0 || ndx == 1) p = "int"; break; /* getitimer */ case 86: if (ndx == 0 || ndx == 1) p = "int"; break; /* getdtablesize */ case 89: /* dup2 */ case 90: if (ndx == 0 || ndx == 1) p = "int"; break; /* fcntl */ case 92: if (ndx == 0 || ndx == 1) p = "int"; break; /* select */ case 93: if (ndx == 0 || ndx == 1) p = "int"; break; /* fsync */ case 95: if (ndx == 0 || ndx == 1) p = "int"; break; /* setpriority */ case 96: if (ndx == 0 || ndx == 1) p = "int"; break; /* socket */ case 97: if (ndx == 0 || ndx == 1) p = "int"; break; /* connect */ case 98: if (ndx == 0 || ndx == 1) p = "int"; break; /* getpriority */ case 100: if (ndx == 0 || ndx == 1) p = "int"; break; /* bind */ case 104: if (ndx == 0 || ndx == 1) p = "int"; break; /* setsockopt */ case 105: if (ndx == 0 || ndx == 1) p = "int"; break; /* listen */ case 106: if (ndx == 0 || ndx == 1) p = "int"; break; /* gettimeofday */ case 116: if (ndx == 0 || ndx == 1) p = "int"; break; /* getrusage */ case 117: if (ndx == 0 || ndx == 1) p = "int"; break; /* getsockopt */ case 118: if (ndx == 0 || ndx == 1) p = "int"; break; /* readv */ case 120: if (ndx == 0 || ndx == 1) p = "int"; break; /* writev */ case 121: if (ndx == 0 || ndx == 1) p = "int"; break; /* settimeofday */ case 122: if (ndx == 0 || ndx == 1) p = "int"; break; /* fchown */ case 123: if (ndx == 0 || ndx == 1) p = "int"; break; /* fchmod */ case 124: if (ndx == 0 || ndx == 1) p = "int"; break; /* setreuid */ case 126: if (ndx == 0 || ndx == 1) p = "int"; break; /* setregid */ case 127: if (ndx == 0 || ndx == 1) p = "int"; break; /* rename */ case 128: if (ndx == 0 || ndx == 1) p = "int"; break; /* flock */ case 131: if (ndx == 0 || ndx == 1) p = "int"; break; /* mkfifo */ case 132: if (ndx == 0 || ndx == 1) p = "int"; break; /* sendto */ case 133: if (ndx == 0 || ndx == 1) p = "int"; break; /* shutdown */ case 134: if (ndx == 0 || ndx == 1) p = "int"; break; /* socketpair */ case 135: if (ndx == 0 || ndx == 1) p = "int"; break; /* mkdir */ case 136: if (ndx == 0 || ndx == 1) p = "int"; break; /* rmdir */ case 137: if (ndx == 0 || ndx == 1) p = "int"; break; /* utimes */ case 138: if (ndx == 0 || ndx == 1) p = "int"; break; /* adjtime */ case 140: if (ndx == 0 || ndx == 1) p = "int"; break; /* setsid */ case 147: /* quotactl */ case 148: if (ndx == 0 || ndx == 1) p = "int"; break; /* nlm_syscall */ case 154: if (ndx == 0 || ndx == 1) p = "int"; break; /* nfssvc */ case 155: if (ndx == 0 || ndx == 1) p = "int"; break; /* lgetfh */ case 160: if (ndx == 0 || ndx == 1) p = "int"; break; /* getfh */ case 161: if (ndx == 0 || ndx == 1) p = "int"; break; /* sysarch */ case 165: if (ndx == 0 || ndx == 1) p = "int"; break; /* rtprio */ case 166: if (ndx == 0 || ndx == 1) p = "int"; break; /* semsys */ case 169: if (ndx == 0 || ndx == 1) p = "int"; break; /* msgsys */ case 170: if (ndx == 0 || ndx == 1) p = "int"; break; /* shmsys */ case 171: if (ndx == 0 || ndx == 1) p = "int"; break; /* setfib */ case 175: if (ndx == 0 || ndx == 1) p = "int"; break; /* ntp_adjtime */ case 176: if (ndx == 0 || ndx == 1) p = "int"; break; /* setgid */ case 181: if (ndx == 0 || ndx == 1) p = "int"; break; /* setegid */ case 182: if (ndx == 0 || ndx == 1) p = "int"; break; /* seteuid */ case 183: if (ndx == 0 || ndx == 1) p = "int"; break; /* pathconf */ case 191: if (ndx == 0 || ndx == 1) p = "int"; break; /* fpathconf */ case 192: if (ndx == 0 || ndx == 1) p = "int"; break; /* getrlimit */ case 194: if (ndx == 0 || ndx == 1) p = "int"; break; /* setrlimit */ case 195: if (ndx == 0 || ndx == 1) p = "int"; break; /* nosys */ case 198: /* __sysctl */ case 202: if (ndx == 0 || ndx == 1) p = "int"; break; /* mlock */ case 203: if (ndx == 0 || ndx == 1) p = "int"; break; /* munlock */ case 204: if (ndx == 0 || ndx == 1) p = "int"; break; /* undelete */ case 205: if (ndx == 0 || ndx == 1) p = "int"; break; /* futimes */ case 206: if (ndx == 0 || ndx == 1) p = "int"; break; /* getpgid */ case 207: if (ndx == 0 || ndx == 1) p = "int"; break; /* poll */ case 209: if (ndx == 0 || ndx == 1) p = "int"; break; /* lkmnosys */ case 210: /* lkmnosys */ case 211: /* lkmnosys */ case 212: /* lkmnosys */ case 213: /* lkmnosys */ case 214: /* lkmnosys */ case 215: /* lkmnosys */ case 216: /* lkmnosys */ case 217: /* lkmnosys */ case 218: /* lkmnosys */ case 219: /* semget */ case 221: if (ndx == 0 || ndx == 1) p = "int"; break; /* semop */ case 222: if (ndx == 0 || ndx == 1) p = "int"; break; /* msgget */ case 225: if (ndx == 0 || ndx == 1) p = "int"; break; /* msgsnd */ case 226: if (ndx == 0 || ndx == 1) p = "int"; break; /* msgrcv */ case 227: if (ndx == 0 || ndx == 1) p = "ssize_t"; break; /* shmat */ case 228: if (ndx == 0 || ndx == 1) p = "void *"; break; /* shmdt */ case 230: if (ndx == 0 || ndx == 1) p = "int"; break; /* shmget */ case 231: if (ndx == 0 || ndx == 1) p = "int"; break; /* clock_gettime */ case 232: if (ndx == 0 || ndx == 1) p = "int"; break; /* clock_settime */ case 233: if (ndx == 0 || ndx == 1) p = "int"; break; /* clock_getres */ case 234: if (ndx == 0 || ndx == 1) p = "int"; break; /* ktimer_create */ case 235: if (ndx == 0 || ndx == 1) p = "int"; break; /* ktimer_delete */ case 236: if (ndx == 0 || ndx == 1) p = "int"; break; /* ktimer_settime */ case 237: if (ndx == 0 || ndx == 1) p = "int"; break; /* ktimer_gettime */ case 238: if (ndx == 0 || ndx == 1) p = "int"; break; /* ktimer_getoverrun */ case 239: if (ndx == 0 || ndx == 1) p = "int"; break; /* nanosleep */ case 240: if (ndx == 0 || ndx == 1) p = "int"; break; /* ffclock_getcounter */ case 241: if (ndx == 0 || ndx == 1) p = "int"; break; /* ffclock_setestimate */ case 242: if (ndx == 0 || ndx == 1) p = "int"; break; /* ffclock_getestimate */ case 243: if (ndx == 0 || ndx == 1) p = "int"; break; /* clock_nanosleep */ case 244: if (ndx == 0 || ndx == 1) p = "int"; break; /* clock_getcpuclockid2 */ case 247: if (ndx == 0 || ndx == 1) p = "int"; break; /* ntp_gettime */ case 248: if (ndx == 0 || ndx == 1) p = "int"; break; /* minherit */ case 250: if (ndx == 0 || ndx == 1) p = "int"; break; /* rfork */ case 251: if (ndx == 0 || ndx == 1) p = "int"; break; /* issetugid */ case 253: /* lchown */ case 254: if (ndx == 0 || ndx == 1) p = "int"; break; /* aio_read */ case 255: if (ndx == 0 || ndx == 1) p = "int"; break; /* aio_write */ case 256: if (ndx == 0 || ndx == 1) p = "int"; break; /* lio_listio */ case 257: if (ndx == 0 || ndx == 1) p = "int"; break; /* lchmod */ case 274: if (ndx == 0 || ndx == 1) p = "int"; break; /* lutimes */ case 276: if (ndx == 0 || ndx == 1) p = "int"; break; /* preadv */ case 289: if (ndx == 0 || ndx == 1) p = "ssize_t"; break; /* pwritev */ case 290: if (ndx == 0 || ndx == 1) p = "ssize_t"; break; /* fhopen */ case 298: if (ndx == 0 || ndx == 1) p = "int"; break; /* modnext */ case 300: if (ndx == 0 || ndx == 1) p = "int"; break; /* modstat */ case 301: if (ndx == 0 || ndx == 1) p = "int"; break; /* modfnext */ case 302: if (ndx == 0 || ndx == 1) p = "int"; break; /* modfind */ case 303: if (ndx == 0 || ndx == 1) p = "int"; break; /* kldload */ case 304: if (ndx == 0 || ndx == 1) p = "int"; break; /* kldunload */ case 305: if (ndx == 0 || ndx == 1) p = "int"; break; /* kldfind */ case 306: if (ndx == 0 || ndx == 1) p = "int"; break; /* kldnext */ case 307: if (ndx == 0 || ndx == 1) p = "int"; break; /* kldstat */ case 308: if (ndx == 0 || ndx == 1) p = "int"; break; /* kldfirstmod */ case 309: if (ndx == 0 || ndx == 1) p = "int"; break; /* getsid */ case 310: if (ndx == 0 || ndx == 1) p = "int"; break; /* setresuid */ case 311: if (ndx == 0 || ndx == 1) p = "int"; break; /* setresgid */ case 312: if (ndx == 0 || ndx == 1) p = "int"; break; /* aio_return */ case 314: if (ndx == 0 || ndx == 1) p = "ssize_t"; break; /* aio_suspend */ case 315: if (ndx == 0 || ndx == 1) p = "int"; break; /* aio_cancel */ case 316: if (ndx == 0 || ndx == 1) p = "int"; break; /* aio_error */ case 317: if (ndx == 0 || ndx == 1) p = "int"; break; /* yield */ case 321: /* mlockall */ case 324: if (ndx == 0 || ndx == 1) p = "int"; break; /* munlockall */ case 325: /* __getcwd */ case 326: if (ndx == 0 || ndx == 1) p = "int"; break; /* sched_setparam */ case 327: if (ndx == 0 || ndx == 1) p = "int"; break; /* sched_getparam */ case 328: if (ndx == 0 || ndx == 1) p = "int"; break; /* sched_setscheduler */ case 329: if (ndx == 0 || ndx == 1) p = "int"; break; /* sched_getscheduler */ case 330: if (ndx == 0 || ndx == 1) p = "int"; break; /* sched_yield */ case 331: /* sched_get_priority_max */ case 332: if (ndx == 0 || ndx == 1) p = "int"; break; /* sched_get_priority_min */ case 333: if (ndx == 0 || ndx == 1) p = "int"; break; /* sched_rr_get_interval */ case 334: if (ndx == 0 || ndx == 1) p = "int"; break; /* utrace */ case 335: if (ndx == 0 || ndx == 1) p = "int"; break; /* kldsym */ case 337: if (ndx == 0 || ndx == 1) p = "int"; break; /* jail */ case 338: if (ndx == 0 || ndx == 1) p = "int"; break; /* nnpfs_syscall */ case 339: if (ndx == 0 || ndx == 1) p = "int"; break; /* sigprocmask */ case 340: if (ndx == 0 || ndx == 1) p = "int"; break; /* sigsuspend */ case 341: if (ndx == 0 || ndx == 1) p = "int"; break; /* sigpending */ case 343: if (ndx == 0 || ndx == 1) p = "int"; break; /* sigtimedwait */ case 345: if (ndx == 0 || ndx == 1) p = "int"; break; /* sigwaitinfo */ case 346: if (ndx == 0 || ndx == 1) p = "int"; break; /* __acl_get_file */ case 347: if (ndx == 0 || ndx == 1) p = "int"; break; /* __acl_set_file */ case 348: if (ndx == 0 || ndx == 1) p = "int"; break; /* __acl_get_fd */ case 349: if (ndx == 0 || ndx == 1) p = "int"; break; /* __acl_set_fd */ case 350: if (ndx == 0 || ndx == 1) p = "int"; break; /* __acl_delete_file */ case 351: if (ndx == 0 || ndx == 1) p = "int"; break; /* __acl_delete_fd */ case 352: if (ndx == 0 || ndx == 1) p = "int"; break; /* __acl_aclcheck_file */ case 353: if (ndx == 0 || ndx == 1) p = "int"; break; /* __acl_aclcheck_fd */ case 354: if (ndx == 0 || ndx == 1) p = "int"; break; /* extattrctl */ case 355: if (ndx == 0 || ndx == 1) p = "int"; break; /* extattr_set_file */ case 356: if (ndx == 0 || ndx == 1) p = "ssize_t"; break; /* extattr_get_file */ case 357: if (ndx == 0 || ndx == 1) p = "ssize_t"; break; /* extattr_delete_file */ case 358: if (ndx == 0 || ndx == 1) p = "int"; break; /* aio_waitcomplete */ case 359: if (ndx == 0 || ndx == 1) p = "ssize_t"; break; /* getresuid */ case 360: if (ndx == 0 || ndx == 1) p = "int"; break; /* getresgid */ case 361: if (ndx == 0 || ndx == 1) p = "int"; break; /* kqueue */ case 362: /* extattr_set_fd */ case 371: if (ndx == 0 || ndx == 1) p = "ssize_t"; break; /* extattr_get_fd */ case 372: if (ndx == 0 || ndx == 1) p = "ssize_t"; break; /* extattr_delete_fd */ case 373: if (ndx == 0 || ndx == 1) p = "int"; break; /* __setugid */ case 374: if (ndx == 0 || ndx == 1) p = "int"; break; /* eaccess */ case 376: if (ndx == 0 || ndx == 1) p = "int"; break; /* afs3_syscall */ case 377: if (ndx == 0 || ndx == 1) p = "int"; break; /* nmount */ case 378: if (ndx == 0 || ndx == 1) p = "int"; break; /* __mac_get_proc */ case 384: if (ndx == 0 || ndx == 1) p = "int"; break; /* __mac_set_proc */ case 385: if (ndx == 0 || ndx == 1) p = "int"; break; /* __mac_get_fd */ case 386: if (ndx == 0 || ndx == 1) p = "int"; break; /* __mac_get_file */ case 387: if (ndx == 0 || ndx == 1) p = "int"; break; /* __mac_set_fd */ case 388: if (ndx == 0 || ndx == 1) p = "int"; break; /* __mac_set_file */ case 389: if (ndx == 0 || ndx == 1) p = "int"; break; /* kenv */ case 390: if (ndx == 0 || ndx == 1) p = "int"; break; /* lchflags */ case 391: if (ndx == 0 || ndx == 1) p = "int"; break; /* uuidgen */ case 392: if (ndx == 0 || ndx == 1) p = "int"; break; /* sendfile */ case 393: if (ndx == 0 || ndx == 1) p = "int"; break; /* mac_syscall */ case 394: if (ndx == 0 || ndx == 1) p = "int"; break; /* ksem_close */ case 400: if (ndx == 0 || ndx == 1) p = "int"; break; /* ksem_post */ case 401: if (ndx == 0 || ndx == 1) p = "int"; break; /* ksem_wait */ case 402: if (ndx == 0 || ndx == 1) p = "int"; break; /* ksem_trywait */ case 403: if (ndx == 0 || ndx == 1) p = "int"; break; /* ksem_init */ case 404: if (ndx == 0 || ndx == 1) p = "int"; break; /* ksem_open */ case 405: if (ndx == 0 || ndx == 1) p = "int"; break; /* ksem_unlink */ case 406: if (ndx == 0 || ndx == 1) p = "int"; break; /* ksem_getvalue */ case 407: if (ndx == 0 || ndx == 1) p = "int"; break; /* ksem_destroy */ case 408: if (ndx == 0 || ndx == 1) p = "int"; break; /* __mac_get_pid */ case 409: if (ndx == 0 || ndx == 1) p = "int"; break; /* __mac_get_link */ case 410: if (ndx == 0 || ndx == 1) p = "int"; break; /* __mac_set_link */ case 411: if (ndx == 0 || ndx == 1) p = "int"; break; /* extattr_set_link */ case 412: if (ndx == 0 || ndx == 1) p = "ssize_t"; break; /* extattr_get_link */ case 413: if (ndx == 0 || ndx == 1) p = "ssize_t"; break; /* extattr_delete_link */ case 414: if (ndx == 0 || ndx == 1) p = "int"; break; /* __mac_execve */ case 415: if (ndx == 0 || ndx == 1) p = "int"; break; /* sigaction */ case 416: if (ndx == 0 || ndx == 1) p = "int"; break; /* sigreturn */ case 417: if (ndx == 0 || ndx == 1) p = "int"; break; /* getcontext */ case 421: if (ndx == 0 || ndx == 1) p = "int"; break; /* setcontext */ case 422: if (ndx == 0 || ndx == 1) p = "int"; break; /* swapcontext */ case 423: if (ndx == 0 || ndx == 1) p = "int"; break; /* swapoff */ case 424: if (ndx == 0 || ndx == 1) p = "int"; break; /* __acl_get_link */ case 425: if (ndx == 0 || ndx == 1) p = "int"; break; /* __acl_set_link */ case 426: if (ndx == 0 || ndx == 1) p = "int"; break; /* __acl_delete_link */ case 427: if (ndx == 0 || ndx == 1) p = "int"; break; /* __acl_aclcheck_link */ case 428: if (ndx == 0 || ndx == 1) p = "int"; break; /* sigwait */ case 429: if (ndx == 0 || ndx == 1) p = "int"; break; /* thr_create */ case 430: if (ndx == 0 || ndx == 1) p = "int"; break; /* thr_exit */ case 431: if (ndx == 0 || ndx == 1) p = "void"; break; /* thr_self */ case 432: if (ndx == 0 || ndx == 1) p = "int"; break; /* thr_kill */ case 433: if (ndx == 0 || ndx == 1) p = "int"; break; /* jail_attach */ case 436: if (ndx == 0 || ndx == 1) p = "int"; break; /* extattr_list_fd */ case 437: if (ndx == 0 || ndx == 1) p = "ssize_t"; break; /* extattr_list_file */ case 438: if (ndx == 0 || ndx == 1) p = "ssize_t"; break; /* extattr_list_link */ case 439: if (ndx == 0 || ndx == 1) p = "ssize_t"; break; /* ksem_timedwait */ case 441: if (ndx == 0 || ndx == 1) p = "int"; break; /* thr_suspend */ case 442: if (ndx == 0 || ndx == 1) p = "int"; break; /* thr_wake */ case 443: if (ndx == 0 || ndx == 1) p = "int"; break; /* kldunloadf */ case 444: if (ndx == 0 || ndx == 1) p = "int"; break; /* audit */ case 445: if (ndx == 0 || ndx == 1) p = "int"; break; /* auditon */ case 446: if (ndx == 0 || ndx == 1) p = "int"; break; /* getauid */ case 447: if (ndx == 0 || ndx == 1) p = "int"; break; /* setauid */ case 448: if (ndx == 0 || ndx == 1) p = "int"; break; /* getaudit */ case 449: if (ndx == 0 || ndx == 1) p = "int"; break; /* setaudit */ case 450: if (ndx == 0 || ndx == 1) p = "int"; break; /* getaudit_addr */ case 451: if (ndx == 0 || ndx == 1) p = "int"; break; /* setaudit_addr */ case 452: if (ndx == 0 || ndx == 1) p = "int"; break; /* auditctl */ case 453: if (ndx == 0 || ndx == 1) p = "int"; break; /* _umtx_op */ case 454: if (ndx == 0 || ndx == 1) p = "int"; break; /* thr_new */ case 455: if (ndx == 0 || ndx == 1) p = "int"; break; /* sigqueue */ case 456: if (ndx == 0 || ndx == 1) p = "int"; break; /* kmq_open */ case 457: if (ndx == 0 || ndx == 1) p = "int"; break; /* kmq_setattr */ case 458: if (ndx == 0 || ndx == 1) p = "int"; break; /* kmq_timedreceive */ case 459: if (ndx == 0 || ndx == 1) p = "int"; break; /* kmq_timedsend */ case 460: if (ndx == 0 || ndx == 1) p = "int"; break; /* kmq_notify */ case 461: if (ndx == 0 || ndx == 1) p = "int"; break; /* kmq_unlink */ case 462: if (ndx == 0 || ndx == 1) p = "int"; break; /* abort2 */ case 463: if (ndx == 0 || ndx == 1) p = "int"; break; /* thr_set_name */ case 464: if (ndx == 0 || ndx == 1) p = "int"; break; /* aio_fsync */ case 465: if (ndx == 0 || ndx == 1) p = "int"; break; /* rtprio_thread */ case 466: if (ndx == 0 || ndx == 1) p = "int"; break; /* sctp_peeloff */ case 471: if (ndx == 0 || ndx == 1) p = "int"; break; /* sctp_generic_sendmsg */ case 472: if (ndx == 0 || ndx == 1) p = "int"; break; /* sctp_generic_sendmsg_iov */ case 473: if (ndx == 0 || ndx == 1) p = "int"; break; /* sctp_generic_recvmsg */ case 474: if (ndx == 0 || ndx == 1) p = "int"; break; /* pread */ case 475: if (ndx == 0 || ndx == 1) p = "ssize_t"; break; /* pwrite */ case 476: if (ndx == 0 || ndx == 1) p = "ssize_t"; break; /* mmap */ case 477: if (ndx == 0 || ndx == 1) p = "void *"; break; /* lseek */ case 478: if (ndx == 0 || ndx == 1) p = "off_t"; break; /* truncate */ case 479: if (ndx == 0 || ndx == 1) p = "int"; break; /* ftruncate */ case 480: if (ndx == 0 || ndx == 1) p = "int"; break; /* thr_kill2 */ case 481: if (ndx == 0 || ndx == 1) p = "int"; break; /* shm_open */ case 482: if (ndx == 0 || ndx == 1) p = "int"; break; /* shm_unlink */ case 483: if (ndx == 0 || ndx == 1) p = "int"; break; /* cpuset */ case 484: if (ndx == 0 || ndx == 1) p = "int"; break; /* cpuset_setid */ case 485: if (ndx == 0 || ndx == 1) p = "int"; break; /* cpuset_getid */ case 486: if (ndx == 0 || ndx == 1) p = "int"; break; /* cpuset_getaffinity */ case 487: if (ndx == 0 || ndx == 1) p = "int"; break; /* cpuset_setaffinity */ case 488: if (ndx == 0 || ndx == 1) p = "int"; break; /* faccessat */ case 489: if (ndx == 0 || ndx == 1) p = "int"; break; /* fchmodat */ case 490: if (ndx == 0 || ndx == 1) p = "int"; break; /* fchownat */ case 491: if (ndx == 0 || ndx == 1) p = "int"; break; /* fexecve */ case 492: if (ndx == 0 || ndx == 1) p = "int"; break; /* futimesat */ case 494: if (ndx == 0 || ndx == 1) p = "int"; break; /* linkat */ case 495: if (ndx == 0 || ndx == 1) p = "int"; break; /* mkdirat */ case 496: if (ndx == 0 || ndx == 1) p = "int"; break; /* mkfifoat */ case 497: if (ndx == 0 || ndx == 1) p = "int"; break; /* openat */ case 499: if (ndx == 0 || ndx == 1) p = "int"; break; /* readlinkat */ case 500: if (ndx == 0 || ndx == 1) - p = "int"; + p = "ssize_t"; break; /* renameat */ case 501: if (ndx == 0 || ndx == 1) p = "int"; break; /* symlinkat */ case 502: if (ndx == 0 || ndx == 1) p = "int"; break; /* unlinkat */ case 503: if (ndx == 0 || ndx == 1) p = "int"; break; /* posix_openpt */ case 504: if (ndx == 0 || ndx == 1) p = "int"; break; /* gssd_syscall */ case 505: if (ndx == 0 || ndx == 1) p = "int"; break; /* jail_get */ case 506: if (ndx == 0 || ndx == 1) p = "int"; break; /* jail_set */ case 507: if (ndx == 0 || ndx == 1) p = "int"; break; /* jail_remove */ case 508: if (ndx == 0 || ndx == 1) p = "int"; break; /* closefrom */ case 509: if (ndx == 0 || ndx == 1) p = "int"; break; /* __semctl */ case 510: if (ndx == 0 || ndx == 1) p = "int"; break; /* msgctl */ case 511: if (ndx == 0 || ndx == 1) p = "int"; break; /* shmctl */ case 512: if (ndx == 0 || ndx == 1) p = "int"; break; /* lpathconf */ case 513: if (ndx == 0 || ndx == 1) p = "int"; break; /* __cap_rights_get */ case 515: if (ndx == 0 || ndx == 1) p = "int"; break; /* cap_enter */ case 516: /* cap_getmode */ case 517: if (ndx == 0 || ndx == 1) p = "int"; break; /* pdfork */ case 518: if (ndx == 0 || ndx == 1) p = "int"; break; /* pdkill */ case 519: if (ndx == 0 || ndx == 1) p = "int"; break; /* pdgetpid */ case 520: if (ndx == 0 || ndx == 1) p = "int"; break; /* pselect */ case 522: if (ndx == 0 || ndx == 1) p = "int"; break; /* getloginclass */ case 523: if (ndx == 0 || ndx == 1) p = "int"; break; /* setloginclass */ case 524: if (ndx == 0 || ndx == 1) p = "int"; break; /* rctl_get_racct */ case 525: if (ndx == 0 || ndx == 1) p = "int"; break; /* rctl_get_rules */ case 526: if (ndx == 0 || ndx == 1) p = "int"; break; /* rctl_get_limits */ case 527: if (ndx == 0 || ndx == 1) p = "int"; break; /* rctl_add_rule */ case 528: if (ndx == 0 || ndx == 1) p = "int"; break; /* rctl_remove_rule */ case 529: if (ndx == 0 || ndx == 1) p = "int"; break; /* posix_fallocate */ case 530: if (ndx == 0 || ndx == 1) p = "int"; break; /* posix_fadvise */ case 531: if (ndx == 0 || ndx == 1) p = "int"; break; /* wait6 */ case 532: if (ndx == 0 || ndx == 1) p = "int"; break; /* cap_rights_limit */ case 533: if (ndx == 0 || ndx == 1) p = "int"; break; /* cap_ioctls_limit */ case 534: if (ndx == 0 || ndx == 1) p = "int"; break; /* cap_ioctls_get */ case 535: if (ndx == 0 || ndx == 1) p = "ssize_t"; break; /* cap_fcntls_limit */ case 536: if (ndx == 0 || ndx == 1) p = "int"; break; /* cap_fcntls_get */ case 537: if (ndx == 0 || ndx == 1) p = "int"; break; /* bindat */ case 538: if (ndx == 0 || ndx == 1) p = "int"; break; /* connectat */ case 539: if (ndx == 0 || ndx == 1) p = "int"; break; /* chflagsat */ case 540: if (ndx == 0 || ndx == 1) p = "int"; break; /* accept4 */ case 541: if (ndx == 0 || ndx == 1) p = "int"; break; /* pipe2 */ case 542: if (ndx == 0 || ndx == 1) p = "int"; break; /* aio_mlock */ case 543: if (ndx == 0 || ndx == 1) p = "int"; break; /* procctl */ case 544: if (ndx == 0 || ndx == 1) p = "int"; break; /* ppoll */ case 545: if (ndx == 0 || ndx == 1) p = "int"; break; /* futimens */ case 546: if (ndx == 0 || ndx == 1) p = "int"; break; /* utimensat */ case 547: if (ndx == 0 || ndx == 1) p = "int"; break; /* fdatasync */ case 550: if (ndx == 0 || ndx == 1) p = "int"; break; /* fstat */ case 551: if (ndx == 0 || ndx == 1) p = "int"; break; /* fstatat */ case 552: if (ndx == 0 || ndx == 1) p = "int"; break; /* fhstat */ case 553: if (ndx == 0 || ndx == 1) p = "int"; break; /* getdirentries */ case 554: if (ndx == 0 || ndx == 1) p = "ssize_t"; break; /* statfs */ case 555: if (ndx == 0 || ndx == 1) p = "int"; break; /* fstatfs */ case 556: if (ndx == 0 || ndx == 1) p = "int"; break; /* getfsstat */ case 557: if (ndx == 0 || ndx == 1) p = "int"; break; /* fhstatfs */ case 558: if (ndx == 0 || ndx == 1) p = "int"; break; /* mknodat */ case 559: if (ndx == 0 || ndx == 1) p = "int"; break; /* kevent */ case 560: if (ndx == 0 || ndx == 1) p = "int"; break; /* cpuset_getdomain */ case 561: if (ndx == 0 || ndx == 1) p = "int"; break; /* cpuset_setdomain */ case 562: if (ndx == 0 || ndx == 1) p = "int"; break; /* getrandom */ case 563: if (ndx == 0 || ndx == 1) p = "int"; break; /* getfhat */ case 564: if (ndx == 0 || ndx == 1) p = "int"; break; /* fhlink */ case 565: if (ndx == 0 || ndx == 1) p = "int"; break; /* fhlinkat */ case 566: if (ndx == 0 || ndx == 1) p = "int"; break; /* fhreadlink */ case 567: if (ndx == 0 || ndx == 1) p = "int"; break; /* funlinkat */ case 568: if (ndx == 0 || ndx == 1) p = "int"; break; default: break; }; if (p != NULL) strlcpy(desc, p, descsz); } Index: user/ngie/bug-237403/sys/kern/uipc_mbuf.c =================================================================== --- user/ngie/bug-237403/sys/kern/uipc_mbuf.c (revision 346925) +++ user/ngie/bug-237403/sys/kern/uipc_mbuf.c (revision 346926) @@ -1,1872 +1,1875 @@ /*- * SPDX-License-Identifier: BSD-3-Clause * * Copyright (c) 1982, 1986, 1988, 1991, 1993 * The Regents of the University of California. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * 3. Neither the name of the University nor the names of its contributors * may be used to endorse or promote products derived from this software * without specific prior written permission. * * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * @(#)uipc_mbuf.c 8.2 (Berkeley) 1/4/94 */ #include __FBSDID("$FreeBSD$"); #include "opt_param.h" #include "opt_mbuf_stress_test.h" #include "opt_mbuf_profiling.h" #include #include #include #include #include #include #include #include #include #include #include #include SDT_PROBE_DEFINE5_XLATE(sdt, , , m__init, "struct mbuf *", "mbufinfo_t *", "uint32_t", "uint32_t", "uint16_t", "uint16_t", "uint32_t", "uint32_t", "uint32_t", "uint32_t"); SDT_PROBE_DEFINE3_XLATE(sdt, , , m__gethdr, "uint32_t", "uint32_t", "uint16_t", "uint16_t", "struct mbuf *", "mbufinfo_t *"); SDT_PROBE_DEFINE3_XLATE(sdt, , , m__get, "uint32_t", "uint32_t", "uint16_t", "uint16_t", "struct mbuf *", "mbufinfo_t *"); SDT_PROBE_DEFINE4_XLATE(sdt, , , m__getcl, "uint32_t", "uint32_t", "uint16_t", "uint16_t", "uint32_t", "uint32_t", "struct mbuf *", "mbufinfo_t *"); SDT_PROBE_DEFINE3_XLATE(sdt, , , m__clget, "struct mbuf *", "mbufinfo_t *", "uint32_t", "uint32_t", "uint32_t", "uint32_t"); SDT_PROBE_DEFINE4_XLATE(sdt, , , m__cljget, "struct mbuf *", "mbufinfo_t *", "uint32_t", "uint32_t", "uint32_t", "uint32_t", "void*", "void*"); SDT_PROBE_DEFINE(sdt, , , m__cljset); SDT_PROBE_DEFINE1_XLATE(sdt, , , m__free, "struct mbuf *", "mbufinfo_t *"); SDT_PROBE_DEFINE1_XLATE(sdt, , , m__freem, "struct mbuf *", "mbufinfo_t *"); #include int max_linkhdr; int max_protohdr; int max_hdr; int max_datalen; #ifdef MBUF_STRESS_TEST int m_defragpackets; int m_defragbytes; int m_defraguseless; int m_defragfailure; int m_defragrandomfailures; #endif /* * sysctl(8) exported objects */ SYSCTL_INT(_kern_ipc, KIPC_MAX_LINKHDR, max_linkhdr, CTLFLAG_RD, &max_linkhdr, 0, "Size of largest link layer header"); SYSCTL_INT(_kern_ipc, KIPC_MAX_PROTOHDR, max_protohdr, CTLFLAG_RD, &max_protohdr, 0, "Size of largest protocol layer header"); SYSCTL_INT(_kern_ipc, KIPC_MAX_HDR, max_hdr, CTLFLAG_RD, &max_hdr, 0, "Size of largest link plus protocol header"); SYSCTL_INT(_kern_ipc, KIPC_MAX_DATALEN, max_datalen, CTLFLAG_RD, &max_datalen, 0, "Minimum space left in mbuf after max_hdr"); #ifdef MBUF_STRESS_TEST SYSCTL_INT(_kern_ipc, OID_AUTO, m_defragpackets, CTLFLAG_RD, &m_defragpackets, 0, ""); SYSCTL_INT(_kern_ipc, OID_AUTO, m_defragbytes, CTLFLAG_RD, &m_defragbytes, 0, ""); SYSCTL_INT(_kern_ipc, OID_AUTO, m_defraguseless, CTLFLAG_RD, &m_defraguseless, 0, ""); SYSCTL_INT(_kern_ipc, OID_AUTO, m_defragfailure, CTLFLAG_RD, &m_defragfailure, 0, ""); SYSCTL_INT(_kern_ipc, OID_AUTO, m_defragrandomfailures, CTLFLAG_RW, &m_defragrandomfailures, 0, ""); #endif /* * Ensure the correct size of various mbuf parameters. It could be off due * to compiler-induced padding and alignment artifacts. */ CTASSERT(MSIZE - offsetof(struct mbuf, m_dat) == MLEN); CTASSERT(MSIZE - offsetof(struct mbuf, m_pktdat) == MHLEN); /* * mbuf data storage should be 64-bit aligned regardless of architectural * pointer size; check this is the case with and without a packet header. */ CTASSERT(offsetof(struct mbuf, m_dat) % 8 == 0); CTASSERT(offsetof(struct mbuf, m_pktdat) % 8 == 0); /* * While the specific values here don't matter too much (i.e., +/- a few * words), we do want to ensure that changes to these values are carefully * reasoned about and properly documented. This is especially the case as * network-protocol and device-driver modules encode these layouts, and must * be recompiled if the structures change. Check these values at compile time * against the ones documented in comments in mbuf.h. * * NB: Possibly they should be documented there via #define's and not just * comments. */ #if defined(__LP64__) CTASSERT(offsetof(struct mbuf, m_dat) == 32); CTASSERT(sizeof(struct pkthdr) == 56); CTASSERT(sizeof(struct m_ext) == 48); #else CTASSERT(offsetof(struct mbuf, m_dat) == 24); CTASSERT(sizeof(struct pkthdr) == 48); CTASSERT(sizeof(struct m_ext) == 28); #endif /* * Assert that the queue(3) macros produce code of the same size as an old * plain pointer does. */ #ifdef INVARIANTS static struct mbuf __used m_assertbuf; CTASSERT(sizeof(m_assertbuf.m_slist) == sizeof(m_assertbuf.m_next)); CTASSERT(sizeof(m_assertbuf.m_stailq) == sizeof(m_assertbuf.m_next)); CTASSERT(sizeof(m_assertbuf.m_slistpkt) == sizeof(m_assertbuf.m_nextpkt)); CTASSERT(sizeof(m_assertbuf.m_stailqpkt) == sizeof(m_assertbuf.m_nextpkt)); #endif /* * Attach the cluster from *m to *n, set up m_ext in *n * and bump the refcount of the cluster. */ void mb_dupcl(struct mbuf *n, struct mbuf *m) { volatile u_int *refcnt; KASSERT(m->m_flags & M_EXT, ("%s: M_EXT not set on %p", __func__, m)); KASSERT(!(n->m_flags & M_EXT), ("%s: M_EXT set on %p", __func__, n)); /* * Cache access optimization. For most kinds of external * storage we don't need full copy of m_ext, since the * holder of the 'ext_count' is responsible to carry the * free routine and its arguments. Exclusion is EXT_EXTREF, * where 'ext_cnt' doesn't point into mbuf at all. */ if (m->m_ext.ext_type == EXT_EXTREF) bcopy(&m->m_ext, &n->m_ext, sizeof(struct m_ext)); else bcopy(&m->m_ext, &n->m_ext, m_ext_copylen); n->m_flags |= M_EXT; n->m_flags |= m->m_flags & M_RDONLY; /* See if this is the mbuf that holds the embedded refcount. */ if (m->m_ext.ext_flags & EXT_FLAG_EMBREF) { refcnt = n->m_ext.ext_cnt = &m->m_ext.ext_count; n->m_ext.ext_flags &= ~EXT_FLAG_EMBREF; } else { KASSERT(m->m_ext.ext_cnt != NULL, ("%s: no refcounting pointer on %p", __func__, m)); refcnt = m->m_ext.ext_cnt; } if (*refcnt == 1) *refcnt += 1; else atomic_add_int(refcnt, 1); } void m_demote_pkthdr(struct mbuf *m) { M_ASSERTPKTHDR(m); m_tag_delete_chain(m, NULL); m->m_flags &= ~M_PKTHDR; bzero(&m->m_pkthdr, sizeof(struct pkthdr)); } /* * Clean up mbuf (chain) from any tags and packet headers. * If "all" is set then the first mbuf in the chain will be * cleaned too. */ void m_demote(struct mbuf *m0, int all, int flags) { struct mbuf *m; for (m = all ? m0 : m0->m_next; m != NULL; m = m->m_next) { KASSERT(m->m_nextpkt == NULL, ("%s: m_nextpkt in m %p, m0 %p", __func__, m, m0)); if (m->m_flags & M_PKTHDR) m_demote_pkthdr(m); m->m_flags = m->m_flags & (M_EXT | M_RDONLY | M_NOFREE | flags); } } /* * Sanity checks on mbuf (chain) for use in KASSERT() and general * debugging. * Returns 0 or panics when bad and 1 on all tests passed. * Sanitize, 0 to run M_SANITY_ACTION, 1 to garble things so they * blow up later. */ int m_sanity(struct mbuf *m0, int sanitize) { struct mbuf *m; caddr_t a, b; int pktlen = 0; #ifdef INVARIANTS #define M_SANITY_ACTION(s) panic("mbuf %p: " s, m) #else #define M_SANITY_ACTION(s) printf("mbuf %p: " s, m) #endif for (m = m0; m != NULL; m = m->m_next) { /* * Basic pointer checks. If any of these fails then some * unrelated kernel memory before or after us is trashed. * No way to recover from that. */ a = M_START(m); b = a + M_SIZE(m); if ((caddr_t)m->m_data < a) M_SANITY_ACTION("m_data outside mbuf data range left"); if ((caddr_t)m->m_data > b) M_SANITY_ACTION("m_data outside mbuf data range right"); if ((caddr_t)m->m_data + m->m_len > b) M_SANITY_ACTION("m_data + m_len exeeds mbuf space"); /* m->m_nextpkt may only be set on first mbuf in chain. */ if (m != m0 && m->m_nextpkt != NULL) { if (sanitize) { m_freem(m->m_nextpkt); m->m_nextpkt = (struct mbuf *)0xDEADC0DE; } else M_SANITY_ACTION("m->m_nextpkt on in-chain mbuf"); } /* packet length (not mbuf length!) calculation */ if (m0->m_flags & M_PKTHDR) pktlen += m->m_len; /* m_tags may only be attached to first mbuf in chain. */ if (m != m0 && m->m_flags & M_PKTHDR && !SLIST_EMPTY(&m->m_pkthdr.tags)) { if (sanitize) { m_tag_delete_chain(m, NULL); /* put in 0xDEADC0DE perhaps? */ } else M_SANITY_ACTION("m_tags on in-chain mbuf"); } /* M_PKTHDR may only be set on first mbuf in chain */ if (m != m0 && m->m_flags & M_PKTHDR) { if (sanitize) { bzero(&m->m_pkthdr, sizeof(m->m_pkthdr)); m->m_flags &= ~M_PKTHDR; /* put in 0xDEADCODE and leave hdr flag in */ } else M_SANITY_ACTION("M_PKTHDR on in-chain mbuf"); } } m = m0; if (pktlen && pktlen != m->m_pkthdr.len) { if (sanitize) m->m_pkthdr.len = 0; else M_SANITY_ACTION("m_pkthdr.len != mbuf chain length"); } return 1; #undef M_SANITY_ACTION } /* * Non-inlined part of m_init(). */ int m_pkthdr_init(struct mbuf *m, int how) { #ifdef MAC int error; #endif m->m_data = m->m_pktdat; bzero(&m->m_pkthdr, sizeof(m->m_pkthdr)); +#ifdef NUMA + m->m_pkthdr.numa_domain = M_NODOM; +#endif #ifdef MAC /* If the label init fails, fail the alloc */ error = mac_mbuf_init(m, how); if (error) return (error); #endif return (0); } /* * "Move" mbuf pkthdr from "from" to "to". * "from" must have M_PKTHDR set, and "to" must be empty. */ void m_move_pkthdr(struct mbuf *to, struct mbuf *from) { #if 0 /* see below for why these are not enabled */ M_ASSERTPKTHDR(to); /* Note: with MAC, this may not be a good assertion. */ KASSERT(SLIST_EMPTY(&to->m_pkthdr.tags), ("m_move_pkthdr: to has tags")); #endif #ifdef MAC /* * XXXMAC: It could be this should also occur for non-MAC? */ if (to->m_flags & M_PKTHDR) m_tag_delete_chain(to, NULL); #endif to->m_flags = (from->m_flags & M_COPYFLAGS) | (to->m_flags & M_EXT); if ((to->m_flags & M_EXT) == 0) to->m_data = to->m_pktdat; to->m_pkthdr = from->m_pkthdr; /* especially tags */ SLIST_INIT(&from->m_pkthdr.tags); /* purge tags from src */ from->m_flags &= ~M_PKTHDR; } /* * Duplicate "from"'s mbuf pkthdr in "to". * "from" must have M_PKTHDR set, and "to" must be empty. * In particular, this does a deep copy of the packet tags. */ int m_dup_pkthdr(struct mbuf *to, const struct mbuf *from, int how) { #if 0 /* * The mbuf allocator only initializes the pkthdr * when the mbuf is allocated with m_gethdr(). Many users * (e.g. m_copy*, m_prepend) use m_get() and then * smash the pkthdr as needed causing these * assertions to trip. For now just disable them. */ M_ASSERTPKTHDR(to); /* Note: with MAC, this may not be a good assertion. */ KASSERT(SLIST_EMPTY(&to->m_pkthdr.tags), ("m_dup_pkthdr: to has tags")); #endif MBUF_CHECKSLEEP(how); #ifdef MAC if (to->m_flags & M_PKTHDR) m_tag_delete_chain(to, NULL); #endif to->m_flags = (from->m_flags & M_COPYFLAGS) | (to->m_flags & M_EXT); if ((to->m_flags & M_EXT) == 0) to->m_data = to->m_pktdat; to->m_pkthdr = from->m_pkthdr; SLIST_INIT(&to->m_pkthdr.tags); return (m_tag_copy_chain(to, from, how)); } /* * Lesser-used path for M_PREPEND: * allocate new mbuf to prepend to chain, * copy junk along. */ struct mbuf * m_prepend(struct mbuf *m, int len, int how) { struct mbuf *mn; if (m->m_flags & M_PKTHDR) mn = m_gethdr(how, m->m_type); else mn = m_get(how, m->m_type); if (mn == NULL) { m_freem(m); return (NULL); } if (m->m_flags & M_PKTHDR) m_move_pkthdr(mn, m); mn->m_next = m; m = mn; if (len < M_SIZE(m)) M_ALIGN(m, len); m->m_len = len; return (m); } /* * Make a copy of an mbuf chain starting "off0" bytes from the beginning, * continuing for "len" bytes. If len is M_COPYALL, copy to end of mbuf. * The wait parameter is a choice of M_WAITOK/M_NOWAIT from caller. * Note that the copy is read-only, because clusters are not copied, * only their reference counts are incremented. */ struct mbuf * m_copym(struct mbuf *m, int off0, int len, int wait) { struct mbuf *n, **np; int off = off0; struct mbuf *top; int copyhdr = 0; KASSERT(off >= 0, ("m_copym, negative off %d", off)); KASSERT(len >= 0, ("m_copym, negative len %d", len)); MBUF_CHECKSLEEP(wait); if (off == 0 && m->m_flags & M_PKTHDR) copyhdr = 1; while (off > 0) { KASSERT(m != NULL, ("m_copym, offset > size of mbuf chain")); if (off < m->m_len) break; off -= m->m_len; m = m->m_next; } np = ⊤ top = NULL; while (len > 0) { if (m == NULL) { KASSERT(len == M_COPYALL, ("m_copym, length > size of mbuf chain")); break; } if (copyhdr) n = m_gethdr(wait, m->m_type); else n = m_get(wait, m->m_type); *np = n; if (n == NULL) goto nospace; if (copyhdr) { if (!m_dup_pkthdr(n, m, wait)) goto nospace; if (len == M_COPYALL) n->m_pkthdr.len -= off0; else n->m_pkthdr.len = len; copyhdr = 0; } n->m_len = min(len, m->m_len - off); if (m->m_flags & M_EXT) { n->m_data = m->m_data + off; mb_dupcl(n, m); } else bcopy(mtod(m, caddr_t)+off, mtod(n, caddr_t), (u_int)n->m_len); if (len != M_COPYALL) len -= n->m_len; off = 0; m = m->m_next; np = &n->m_next; } return (top); nospace: m_freem(top); return (NULL); } /* * Copy an entire packet, including header (which must be present). * An optimization of the common case `m_copym(m, 0, M_COPYALL, how)'. * Note that the copy is read-only, because clusters are not copied, * only their reference counts are incremented. * Preserve alignment of the first mbuf so if the creator has left * some room at the beginning (e.g. for inserting protocol headers) * the copies still have the room available. */ struct mbuf * m_copypacket(struct mbuf *m, int how) { struct mbuf *top, *n, *o; MBUF_CHECKSLEEP(how); n = m_get(how, m->m_type); top = n; if (n == NULL) goto nospace; if (!m_dup_pkthdr(n, m, how)) goto nospace; n->m_len = m->m_len; if (m->m_flags & M_EXT) { n->m_data = m->m_data; mb_dupcl(n, m); } else { n->m_data = n->m_pktdat + (m->m_data - m->m_pktdat ); bcopy(mtod(m, char *), mtod(n, char *), n->m_len); } m = m->m_next; while (m) { o = m_get(how, m->m_type); if (o == NULL) goto nospace; n->m_next = o; n = n->m_next; n->m_len = m->m_len; if (m->m_flags & M_EXT) { n->m_data = m->m_data; mb_dupcl(n, m); } else { bcopy(mtod(m, char *), mtod(n, char *), n->m_len); } m = m->m_next; } return top; nospace: m_freem(top); return (NULL); } /* * Copy data from an mbuf chain starting "off" bytes from the beginning, * continuing for "len" bytes, into the indicated buffer. */ void m_copydata(const struct mbuf *m, int off, int len, caddr_t cp) { u_int count; KASSERT(off >= 0, ("m_copydata, negative off %d", off)); KASSERT(len >= 0, ("m_copydata, negative len %d", len)); while (off > 0) { KASSERT(m != NULL, ("m_copydata, offset > size of mbuf chain")); if (off < m->m_len) break; off -= m->m_len; m = m->m_next; } while (len > 0) { KASSERT(m != NULL, ("m_copydata, length > size of mbuf chain")); count = min(m->m_len - off, len); bcopy(mtod(m, caddr_t) + off, cp, count); len -= count; cp += count; off = 0; m = m->m_next; } } /* * Copy a packet header mbuf chain into a completely new chain, including * copying any mbuf clusters. Use this instead of m_copypacket() when * you need a writable copy of an mbuf chain. */ struct mbuf * m_dup(const struct mbuf *m, int how) { struct mbuf **p, *top = NULL; int remain, moff, nsize; MBUF_CHECKSLEEP(how); /* Sanity check */ if (m == NULL) return (NULL); M_ASSERTPKTHDR(m); /* While there's more data, get a new mbuf, tack it on, and fill it */ remain = m->m_pkthdr.len; moff = 0; p = ⊤ while (remain > 0 || top == NULL) { /* allow m->m_pkthdr.len == 0 */ struct mbuf *n; /* Get the next new mbuf */ if (remain >= MINCLSIZE) { n = m_getcl(how, m->m_type, 0); nsize = MCLBYTES; } else { n = m_get(how, m->m_type); nsize = MLEN; } if (n == NULL) goto nospace; if (top == NULL) { /* First one, must be PKTHDR */ if (!m_dup_pkthdr(n, m, how)) { m_free(n); goto nospace; } if ((n->m_flags & M_EXT) == 0) nsize = MHLEN; n->m_flags &= ~M_RDONLY; } n->m_len = 0; /* Link it into the new chain */ *p = n; p = &n->m_next; /* Copy data from original mbuf(s) into new mbuf */ while (n->m_len < nsize && m != NULL) { int chunk = min(nsize - n->m_len, m->m_len - moff); bcopy(m->m_data + moff, n->m_data + n->m_len, chunk); moff += chunk; n->m_len += chunk; remain -= chunk; if (moff == m->m_len) { m = m->m_next; moff = 0; } } /* Check correct total mbuf length */ KASSERT((remain > 0 && m != NULL) || (remain == 0 && m == NULL), ("%s: bogus m_pkthdr.len", __func__)); } return (top); nospace: m_freem(top); return (NULL); } /* * Concatenate mbuf chain n to m. * Both chains must be of the same type (e.g. MT_DATA). * Any m_pkthdr is not updated. */ void m_cat(struct mbuf *m, struct mbuf *n) { while (m->m_next) m = m->m_next; while (n) { if (!M_WRITABLE(m) || M_TRAILINGSPACE(m) < n->m_len) { /* just join the two chains */ m->m_next = n; return; } /* splat the data from one into the other */ bcopy(mtod(n, caddr_t), mtod(m, caddr_t) + m->m_len, (u_int)n->m_len); m->m_len += n->m_len; n = m_free(n); } } /* * Concatenate two pkthdr mbuf chains. */ void m_catpkt(struct mbuf *m, struct mbuf *n) { M_ASSERTPKTHDR(m); M_ASSERTPKTHDR(n); m->m_pkthdr.len += n->m_pkthdr.len; m_demote(n, 1, 0); m_cat(m, n); } void m_adj(struct mbuf *mp, int req_len) { int len = req_len; struct mbuf *m; int count; if ((m = mp) == NULL) return; if (len >= 0) { /* * Trim from head. */ while (m != NULL && len > 0) { if (m->m_len <= len) { len -= m->m_len; m->m_len = 0; m = m->m_next; } else { m->m_len -= len; m->m_data += len; len = 0; } } if (mp->m_flags & M_PKTHDR) mp->m_pkthdr.len -= (req_len - len); } else { /* * Trim from tail. Scan the mbuf chain, * calculating its length and finding the last mbuf. * If the adjustment only affects this mbuf, then just * adjust and return. Otherwise, rescan and truncate * after the remaining size. */ len = -len; count = 0; for (;;) { count += m->m_len; if (m->m_next == (struct mbuf *)0) break; m = m->m_next; } if (m->m_len >= len) { m->m_len -= len; if (mp->m_flags & M_PKTHDR) mp->m_pkthdr.len -= len; return; } count -= len; if (count < 0) count = 0; /* * Correct length for chain is "count". * Find the mbuf with last data, adjust its length, * and toss data from remaining mbufs on chain. */ m = mp; if (m->m_flags & M_PKTHDR) m->m_pkthdr.len = count; for (; m; m = m->m_next) { if (m->m_len >= count) { m->m_len = count; if (m->m_next != NULL) { m_freem(m->m_next); m->m_next = NULL; } break; } count -= m->m_len; } } } /* * Rearange an mbuf chain so that len bytes are contiguous * and in the data area of an mbuf (so that mtod will work * for a structure of size len). Returns the resulting * mbuf chain on success, frees it and returns null on failure. * If there is room, it will add up to max_protohdr-len extra bytes to the * contiguous region in an attempt to avoid being called next time. */ struct mbuf * m_pullup(struct mbuf *n, int len) { struct mbuf *m; int count; int space; /* * If first mbuf has no cluster, and has room for len bytes * without shifting current data, pullup into it, * otherwise allocate a new mbuf to prepend to the chain. */ if ((n->m_flags & M_EXT) == 0 && n->m_data + len < &n->m_dat[MLEN] && n->m_next) { if (n->m_len >= len) return (n); m = n; n = n->m_next; len -= m->m_len; } else { if (len > MHLEN) goto bad; m = m_get(M_NOWAIT, n->m_type); if (m == NULL) goto bad; if (n->m_flags & M_PKTHDR) m_move_pkthdr(m, n); } space = &m->m_dat[MLEN] - (m->m_data + m->m_len); do { count = min(min(max(len, max_protohdr), space), n->m_len); bcopy(mtod(n, caddr_t), mtod(m, caddr_t) + m->m_len, (u_int)count); len -= count; m->m_len += count; n->m_len -= count; space -= count; if (n->m_len) n->m_data += count; else n = m_free(n); } while (len > 0 && n); if (len > 0) { (void) m_free(m); goto bad; } m->m_next = n; return (m); bad: m_freem(n); return (NULL); } /* * Like m_pullup(), except a new mbuf is always allocated, and we allow * the amount of empty space before the data in the new mbuf to be specified * (in the event that the caller expects to prepend later). */ struct mbuf * m_copyup(struct mbuf *n, int len, int dstoff) { struct mbuf *m; int count, space; if (len > (MHLEN - dstoff)) goto bad; m = m_get(M_NOWAIT, n->m_type); if (m == NULL) goto bad; if (n->m_flags & M_PKTHDR) m_move_pkthdr(m, n); m->m_data += dstoff; space = &m->m_dat[MLEN] - (m->m_data + m->m_len); do { count = min(min(max(len, max_protohdr), space), n->m_len); memcpy(mtod(m, caddr_t) + m->m_len, mtod(n, caddr_t), (unsigned)count); len -= count; m->m_len += count; n->m_len -= count; space -= count; if (n->m_len) n->m_data += count; else n = m_free(n); } while (len > 0 && n); if (len > 0) { (void) m_free(m); goto bad; } m->m_next = n; return (m); bad: m_freem(n); return (NULL); } /* * Partition an mbuf chain in two pieces, returning the tail -- * all but the first len0 bytes. In case of failure, it returns NULL and * attempts to restore the chain to its original state. * * Note that the resulting mbufs might be read-only, because the new * mbuf can end up sharing an mbuf cluster with the original mbuf if * the "breaking point" happens to lie within a cluster mbuf. Use the * M_WRITABLE() macro to check for this case. */ struct mbuf * m_split(struct mbuf *m0, int len0, int wait) { struct mbuf *m, *n; u_int len = len0, remain; MBUF_CHECKSLEEP(wait); for (m = m0; m && len > m->m_len; m = m->m_next) len -= m->m_len; if (m == NULL) return (NULL); remain = m->m_len - len; if (m0->m_flags & M_PKTHDR && remain == 0) { n = m_gethdr(wait, m0->m_type); if (n == NULL) return (NULL); n->m_next = m->m_next; m->m_next = NULL; n->m_pkthdr.rcvif = m0->m_pkthdr.rcvif; n->m_pkthdr.len = m0->m_pkthdr.len - len0; m0->m_pkthdr.len = len0; return (n); } else if (m0->m_flags & M_PKTHDR) { n = m_gethdr(wait, m0->m_type); if (n == NULL) return (NULL); n->m_pkthdr.rcvif = m0->m_pkthdr.rcvif; n->m_pkthdr.len = m0->m_pkthdr.len - len0; m0->m_pkthdr.len = len0; if (m->m_flags & M_EXT) goto extpacket; if (remain > MHLEN) { /* m can't be the lead packet */ M_ALIGN(n, 0); n->m_next = m_split(m, len, wait); if (n->m_next == NULL) { (void) m_free(n); return (NULL); } else { n->m_len = 0; return (n); } } else M_ALIGN(n, remain); } else if (remain == 0) { n = m->m_next; m->m_next = NULL; return (n); } else { n = m_get(wait, m->m_type); if (n == NULL) return (NULL); M_ALIGN(n, remain); } extpacket: if (m->m_flags & M_EXT) { n->m_data = m->m_data + len; mb_dupcl(n, m); } else { bcopy(mtod(m, caddr_t) + len, mtod(n, caddr_t), remain); } n->m_len = remain; m->m_len = len; n->m_next = m->m_next; m->m_next = NULL; return (n); } /* * Routine to copy from device local memory into mbufs. * Note that `off' argument is offset into first mbuf of target chain from * which to begin copying the data to. */ struct mbuf * m_devget(char *buf, int totlen, int off, struct ifnet *ifp, void (*copy)(char *from, caddr_t to, u_int len)) { struct mbuf *m; struct mbuf *top = NULL, **mp = ⊤ int len; if (off < 0 || off > MHLEN) return (NULL); while (totlen > 0) { if (top == NULL) { /* First one, must be PKTHDR */ if (totlen + off >= MINCLSIZE) { m = m_getcl(M_NOWAIT, MT_DATA, M_PKTHDR); len = MCLBYTES; } else { m = m_gethdr(M_NOWAIT, MT_DATA); len = MHLEN; /* Place initial small packet/header at end of mbuf */ if (m && totlen + off + max_linkhdr <= MHLEN) { m->m_data += max_linkhdr; len -= max_linkhdr; } } if (m == NULL) return NULL; m->m_pkthdr.rcvif = ifp; m->m_pkthdr.len = totlen; } else { if (totlen + off >= MINCLSIZE) { m = m_getcl(M_NOWAIT, MT_DATA, 0); len = MCLBYTES; } else { m = m_get(M_NOWAIT, MT_DATA); len = MLEN; } if (m == NULL) { m_freem(top); return NULL; } } if (off) { m->m_data += off; len -= off; off = 0; } m->m_len = len = min(totlen, len); if (copy) copy(buf, mtod(m, caddr_t), (u_int)len); else bcopy(buf, mtod(m, caddr_t), (u_int)len); buf += len; *mp = m; mp = &m->m_next; totlen -= len; } return (top); } /* * Copy data from a buffer back into the indicated mbuf chain, * starting "off" bytes from the beginning, extending the mbuf * chain if necessary. */ void m_copyback(struct mbuf *m0, int off, int len, c_caddr_t cp) { int mlen; struct mbuf *m = m0, *n; int totlen = 0; if (m0 == NULL) return; while (off > (mlen = m->m_len)) { off -= mlen; totlen += mlen; if (m->m_next == NULL) { n = m_get(M_NOWAIT, m->m_type); if (n == NULL) goto out; bzero(mtod(n, caddr_t), MLEN); n->m_len = min(MLEN, len + off); m->m_next = n; } m = m->m_next; } while (len > 0) { if (m->m_next == NULL && (len > m->m_len - off)) { m->m_len += min(len - (m->m_len - off), M_TRAILINGSPACE(m)); } mlen = min (m->m_len - off, len); bcopy(cp, off + mtod(m, caddr_t), (u_int)mlen); cp += mlen; len -= mlen; mlen += off; off = 0; totlen += mlen; if (len == 0) break; if (m->m_next == NULL) { n = m_get(M_NOWAIT, m->m_type); if (n == NULL) break; n->m_len = min(MLEN, len); m->m_next = n; } m = m->m_next; } out: if (((m = m0)->m_flags & M_PKTHDR) && (m->m_pkthdr.len < totlen)) m->m_pkthdr.len = totlen; } /* * Append the specified data to the indicated mbuf chain, * Extend the mbuf chain if the new data does not fit in * existing space. * * Return 1 if able to complete the job; otherwise 0. */ int m_append(struct mbuf *m0, int len, c_caddr_t cp) { struct mbuf *m, *n; int remainder, space; for (m = m0; m->m_next != NULL; m = m->m_next) ; remainder = len; space = M_TRAILINGSPACE(m); if (space > 0) { /* * Copy into available space. */ if (space > remainder) space = remainder; bcopy(cp, mtod(m, caddr_t) + m->m_len, space); m->m_len += space; cp += space, remainder -= space; } while (remainder > 0) { /* * Allocate a new mbuf; could check space * and allocate a cluster instead. */ n = m_get(M_NOWAIT, m->m_type); if (n == NULL) break; n->m_len = min(MLEN, remainder); bcopy(cp, mtod(n, caddr_t), n->m_len); cp += n->m_len, remainder -= n->m_len; m->m_next = n; m = n; } if (m0->m_flags & M_PKTHDR) m0->m_pkthdr.len += len - remainder; return (remainder == 0); } /* * Apply function f to the data in an mbuf chain starting "off" bytes from * the beginning, continuing for "len" bytes. */ int m_apply(struct mbuf *m, int off, int len, int (*f)(void *, void *, u_int), void *arg) { u_int count; int rval; KASSERT(off >= 0, ("m_apply, negative off %d", off)); KASSERT(len >= 0, ("m_apply, negative len %d", len)); while (off > 0) { KASSERT(m != NULL, ("m_apply, offset > size of mbuf chain")); if (off < m->m_len) break; off -= m->m_len; m = m->m_next; } while (len > 0) { KASSERT(m != NULL, ("m_apply, offset > size of mbuf chain")); count = min(m->m_len - off, len); rval = (*f)(arg, mtod(m, caddr_t) + off, count); if (rval) return (rval); len -= count; off = 0; m = m->m_next; } return (0); } /* * Return a pointer to mbuf/offset of location in mbuf chain. */ struct mbuf * m_getptr(struct mbuf *m, int loc, int *off) { while (loc >= 0) { /* Normal end of search. */ if (m->m_len > loc) { *off = loc; return (m); } else { loc -= m->m_len; if (m->m_next == NULL) { if (loc == 0) { /* Point at the end of valid data. */ *off = m->m_len; return (m); } return (NULL); } m = m->m_next; } } return (NULL); } void m_print(const struct mbuf *m, int maxlen) { int len; int pdata; const struct mbuf *m2; if (m == NULL) { printf("mbuf: %p\n", m); return; } if (m->m_flags & M_PKTHDR) len = m->m_pkthdr.len; else len = -1; m2 = m; while (m2 != NULL && (len == -1 || len)) { pdata = m2->m_len; if (maxlen != -1 && pdata > maxlen) pdata = maxlen; printf("mbuf: %p len: %d, next: %p, %b%s", m2, m2->m_len, m2->m_next, m2->m_flags, "\20\20freelist\17skipfw" "\11proto5\10proto4\7proto3\6proto2\5proto1\4rdonly" "\3eor\2pkthdr\1ext", pdata ? "" : "\n"); if (pdata) printf(", %*D\n", pdata, (u_char *)m2->m_data, "-"); if (len != -1) len -= m2->m_len; m2 = m2->m_next; } if (len > 0) printf("%d bytes unaccounted for.\n", len); return; } u_int m_fixhdr(struct mbuf *m0) { u_int len; len = m_length(m0, NULL); m0->m_pkthdr.len = len; return (len); } u_int m_length(struct mbuf *m0, struct mbuf **last) { struct mbuf *m; u_int len; len = 0; for (m = m0; m != NULL; m = m->m_next) { len += m->m_len; if (m->m_next == NULL) break; } if (last != NULL) *last = m; return (len); } /* * Defragment a mbuf chain, returning the shortest possible * chain of mbufs and clusters. If allocation fails and * this cannot be completed, NULL will be returned, but * the passed in chain will be unchanged. Upon success, * the original chain will be freed, and the new chain * will be returned. * * If a non-packet header is passed in, the original * mbuf (chain?) will be returned unharmed. */ struct mbuf * m_defrag(struct mbuf *m0, int how) { struct mbuf *m_new = NULL, *m_final = NULL; int progress = 0, length; MBUF_CHECKSLEEP(how); if (!(m0->m_flags & M_PKTHDR)) return (m0); m_fixhdr(m0); /* Needed sanity check */ #ifdef MBUF_STRESS_TEST if (m_defragrandomfailures) { int temp = arc4random() & 0xff; if (temp == 0xba) goto nospace; } #endif if (m0->m_pkthdr.len > MHLEN) m_final = m_getcl(how, MT_DATA, M_PKTHDR); else m_final = m_gethdr(how, MT_DATA); if (m_final == NULL) goto nospace; if (m_dup_pkthdr(m_final, m0, how) == 0) goto nospace; m_new = m_final; while (progress < m0->m_pkthdr.len) { length = m0->m_pkthdr.len - progress; if (length > MCLBYTES) length = MCLBYTES; if (m_new == NULL) { if (length > MLEN) m_new = m_getcl(how, MT_DATA, 0); else m_new = m_get(how, MT_DATA); if (m_new == NULL) goto nospace; } m_copydata(m0, progress, length, mtod(m_new, caddr_t)); progress += length; m_new->m_len = length; if (m_new != m_final) m_cat(m_final, m_new); m_new = NULL; } #ifdef MBUF_STRESS_TEST if (m0->m_next == NULL) m_defraguseless++; #endif m_freem(m0); m0 = m_final; #ifdef MBUF_STRESS_TEST m_defragpackets++; m_defragbytes += m0->m_pkthdr.len; #endif return (m0); nospace: #ifdef MBUF_STRESS_TEST m_defragfailure++; #endif if (m_final) m_freem(m_final); return (NULL); } /* * Defragment an mbuf chain, returning at most maxfrags separate * mbufs+clusters. If this is not possible NULL is returned and * the original mbuf chain is left in its present (potentially * modified) state. We use two techniques: collapsing consecutive * mbufs and replacing consecutive mbufs by a cluster. * * NB: this should really be named m_defrag but that name is taken */ struct mbuf * m_collapse(struct mbuf *m0, int how, int maxfrags) { struct mbuf *m, *n, *n2, **prev; u_int curfrags; /* * Calculate the current number of frags. */ curfrags = 0; for (m = m0; m != NULL; m = m->m_next) curfrags++; /* * First, try to collapse mbufs. Note that we always collapse * towards the front so we don't need to deal with moving the * pkthdr. This may be suboptimal if the first mbuf has much * less data than the following. */ m = m0; again: for (;;) { n = m->m_next; if (n == NULL) break; if (M_WRITABLE(m) && n->m_len < M_TRAILINGSPACE(m)) { bcopy(mtod(n, void *), mtod(m, char *) + m->m_len, n->m_len); m->m_len += n->m_len; m->m_next = n->m_next; m_free(n); if (--curfrags <= maxfrags) return m0; } else m = n; } KASSERT(maxfrags > 1, ("maxfrags %u, but normal collapse failed", maxfrags)); /* * Collapse consecutive mbufs to a cluster. */ prev = &m0->m_next; /* NB: not the first mbuf */ while ((n = *prev) != NULL) { if ((n2 = n->m_next) != NULL && n->m_len + n2->m_len < MCLBYTES) { m = m_getcl(how, MT_DATA, 0); if (m == NULL) goto bad; bcopy(mtod(n, void *), mtod(m, void *), n->m_len); bcopy(mtod(n2, void *), mtod(m, char *) + n->m_len, n2->m_len); m->m_len = n->m_len + n2->m_len; m->m_next = n2->m_next; *prev = m; m_free(n); m_free(n2); if (--curfrags <= maxfrags) /* +1 cl -2 mbufs */ return m0; /* * Still not there, try the normal collapse * again before we allocate another cluster. */ goto again; } prev = &n->m_next; } /* * No place where we can collapse to a cluster; punt. * This can occur if, for example, you request 2 frags * but the packet requires that both be clusters (we * never reallocate the first mbuf to avoid moving the * packet header). */ bad: return NULL; } #ifdef MBUF_STRESS_TEST /* * Fragment an mbuf chain. There's no reason you'd ever want to do * this in normal usage, but it's great for stress testing various * mbuf consumers. * * If fragmentation is not possible, the original chain will be * returned. * * Possible length values: * 0 no fragmentation will occur * > 0 each fragment will be of the specified length * -1 each fragment will be the same random value in length * -2 each fragment's length will be entirely random * (Random values range from 1 to 256) */ struct mbuf * m_fragment(struct mbuf *m0, int how, int length) { struct mbuf *m_first, *m_last; int divisor = 255, progress = 0, fraglen; if (!(m0->m_flags & M_PKTHDR)) return (m0); if (length == 0 || length < -2) return (m0); if (length > MCLBYTES) length = MCLBYTES; if (length < 0 && divisor > MCLBYTES) divisor = MCLBYTES; if (length == -1) length = 1 + (arc4random() % divisor); if (length > 0) fraglen = length; m_fixhdr(m0); /* Needed sanity check */ m_first = m_getcl(how, MT_DATA, M_PKTHDR); if (m_first == NULL) goto nospace; if (m_dup_pkthdr(m_first, m0, how) == 0) goto nospace; m_last = m_first; while (progress < m0->m_pkthdr.len) { if (length == -2) fraglen = 1 + (arc4random() % divisor); if (fraglen > m0->m_pkthdr.len - progress) fraglen = m0->m_pkthdr.len - progress; if (progress != 0) { struct mbuf *m_new = m_getcl(how, MT_DATA, 0); if (m_new == NULL) goto nospace; m_last->m_next = m_new; m_last = m_new; } m_copydata(m0, progress, fraglen, mtod(m_last, caddr_t)); progress += fraglen; m_last->m_len = fraglen; } m_freem(m0); m0 = m_first; return (m0); nospace: if (m_first) m_freem(m_first); /* Return the original chain on failure */ return (m0); } #endif /* * Copy the contents of uio into a properly sized mbuf chain. */ struct mbuf * m_uiotombuf(struct uio *uio, int how, int len, int align, int flags) { struct mbuf *m, *mb; int error, length; ssize_t total; int progress = 0; /* * len can be zero or an arbitrary large value bound by * the total data supplied by the uio. */ if (len > 0) total = (uio->uio_resid < len) ? uio->uio_resid : len; else total = uio->uio_resid; /* * The smallest unit returned by m_getm2() is a single mbuf * with pkthdr. We can't align past it. */ if (align >= MHLEN) return (NULL); /* * Give us the full allocation or nothing. * If len is zero return the smallest empty mbuf. */ m = m_getm2(NULL, max(total + align, 1), how, MT_DATA, flags); if (m == NULL) return (NULL); m->m_data += align; /* Fill all mbufs with uio data and update header information. */ for (mb = m; mb != NULL; mb = mb->m_next) { length = min(M_TRAILINGSPACE(mb), total - progress); error = uiomove(mtod(mb, void *), length, uio); if (error) { m_freem(m); return (NULL); } mb->m_len = length; progress += length; if (flags & M_PKTHDR) m->m_pkthdr.len += length; } KASSERT(progress == total, ("%s: progress != total", __func__)); return (m); } /* * Copy an mbuf chain into a uio limited by len if set. */ int m_mbuftouio(struct uio *uio, const struct mbuf *m, int len) { int error, length, total; int progress = 0; if (len > 0) total = min(uio->uio_resid, len); else total = uio->uio_resid; /* Fill the uio with data from the mbufs. */ for (; m != NULL; m = m->m_next) { length = min(m->m_len, total - progress); error = uiomove(mtod(m, void *), length, uio); if (error) return (error); progress += length; } return (0); } /* * Create a writable copy of the mbuf chain. While doing this * we compact the chain with a goal of producing a chain with * at most two mbufs. The second mbuf in this chain is likely * to be a cluster. The primary purpose of this work is to create * a writable packet for encryption, compression, etc. The * secondary goal is to linearize the data so the data can be * passed to crypto hardware in the most efficient manner possible. */ struct mbuf * m_unshare(struct mbuf *m0, int how) { struct mbuf *m, *mprev; struct mbuf *n, *mfirst, *mlast; int len, off; mprev = NULL; for (m = m0; m != NULL; m = mprev->m_next) { /* * Regular mbufs are ignored unless there's a cluster * in front of it that we can use to coalesce. We do * the latter mainly so later clusters can be coalesced * also w/o having to handle them specially (i.e. convert * mbuf+cluster -> cluster). This optimization is heavily * influenced by the assumption that we're running over * Ethernet where MCLBYTES is large enough that the max * packet size will permit lots of coalescing into a * single cluster. This in turn permits efficient * crypto operations, especially when using hardware. */ if ((m->m_flags & M_EXT) == 0) { if (mprev && (mprev->m_flags & M_EXT) && m->m_len <= M_TRAILINGSPACE(mprev)) { /* XXX: this ignores mbuf types */ memcpy(mtod(mprev, caddr_t) + mprev->m_len, mtod(m, caddr_t), m->m_len); mprev->m_len += m->m_len; mprev->m_next = m->m_next; /* unlink from chain */ m_free(m); /* reclaim mbuf */ } else { mprev = m; } continue; } /* * Writable mbufs are left alone (for now). */ if (M_WRITABLE(m)) { mprev = m; continue; } /* * Not writable, replace with a copy or coalesce with * the previous mbuf if possible (since we have to copy * it anyway, we try to reduce the number of mbufs and * clusters so that future work is easier). */ KASSERT(m->m_flags & M_EXT, ("m_flags 0x%x", m->m_flags)); /* NB: we only coalesce into a cluster or larger */ if (mprev != NULL && (mprev->m_flags & M_EXT) && m->m_len <= M_TRAILINGSPACE(mprev)) { /* XXX: this ignores mbuf types */ memcpy(mtod(mprev, caddr_t) + mprev->m_len, mtod(m, caddr_t), m->m_len); mprev->m_len += m->m_len; mprev->m_next = m->m_next; /* unlink from chain */ m_free(m); /* reclaim mbuf */ continue; } /* * Allocate new space to hold the copy and copy the data. * We deal with jumbo mbufs (i.e. m_len > MCLBYTES) by * splitting them into clusters. We could just malloc a * buffer and make it external but too many device drivers * don't know how to break up the non-contiguous memory when * doing DMA. */ n = m_getcl(how, m->m_type, m->m_flags & M_COPYFLAGS); if (n == NULL) { m_freem(m0); return (NULL); } if (m->m_flags & M_PKTHDR) { KASSERT(mprev == NULL, ("%s: m0 %p, m %p has M_PKTHDR", __func__, m0, m)); m_move_pkthdr(n, m); } len = m->m_len; off = 0; mfirst = n; mlast = NULL; for (;;) { int cc = min(len, MCLBYTES); memcpy(mtod(n, caddr_t), mtod(m, caddr_t) + off, cc); n->m_len = cc; if (mlast != NULL) mlast->m_next = n; mlast = n; #if 0 newipsecstat.ips_clcopied++; #endif len -= cc; if (len <= 0) break; off += cc; n = m_getcl(how, m->m_type, m->m_flags & M_COPYFLAGS); if (n == NULL) { m_freem(mfirst); m_freem(m0); return (NULL); } } n->m_next = m->m_next; if (mprev == NULL) m0 = mfirst; /* new head of chain */ else mprev->m_next = mfirst; /* replace old mbuf */ m_free(m); /* release old mbuf */ mprev = mfirst; } return (m0); } #ifdef MBUF_PROFILING #define MP_BUCKETS 32 /* don't just change this as things may overflow.*/ struct mbufprofile { uintmax_t wasted[MP_BUCKETS]; uintmax_t used[MP_BUCKETS]; uintmax_t segments[MP_BUCKETS]; } mbprof; #define MP_MAXDIGITS 21 /* strlen("16,000,000,000,000,000,000") == 21 */ #define MP_NUMLINES 6 #define MP_NUMSPERLINE 16 #define MP_EXTRABYTES 64 /* > strlen("used:\nwasted:\nsegments:\n") */ /* work out max space needed and add a bit of spare space too */ #define MP_MAXLINE ((MP_MAXDIGITS+1) * MP_NUMSPERLINE) #define MP_BUFSIZE ((MP_MAXLINE * MP_NUMLINES) + 1 + MP_EXTRABYTES) char mbprofbuf[MP_BUFSIZE]; void m_profile(struct mbuf *m) { int segments = 0; int used = 0; int wasted = 0; while (m) { segments++; used += m->m_len; if (m->m_flags & M_EXT) { wasted += MHLEN - sizeof(m->m_ext) + m->m_ext.ext_size - m->m_len; } else { if (m->m_flags & M_PKTHDR) wasted += MHLEN - m->m_len; else wasted += MLEN - m->m_len; } m = m->m_next; } /* be paranoid.. it helps */ if (segments > MP_BUCKETS - 1) segments = MP_BUCKETS - 1; if (used > 100000) used = 100000; if (wasted > 100000) wasted = 100000; /* store in the appropriate bucket */ /* don't bother locking. if it's slightly off, so what? */ mbprof.segments[segments]++; mbprof.used[fls(used)]++; mbprof.wasted[fls(wasted)]++; } static void mbprof_textify(void) { int offset; char *c; uint64_t *p; p = &mbprof.wasted[0]; c = mbprofbuf; offset = snprintf(c, MP_MAXLINE + 10, "wasted:\n" "%ju %ju %ju %ju %ju %ju %ju %ju " "%ju %ju %ju %ju %ju %ju %ju %ju\n", p[0], p[1], p[2], p[3], p[4], p[5], p[6], p[7], p[8], p[9], p[10], p[11], p[12], p[13], p[14], p[15]); #ifdef BIG_ARRAY p = &mbprof.wasted[16]; c += offset; offset = snprintf(c, MP_MAXLINE, "%ju %ju %ju %ju %ju %ju %ju %ju " "%ju %ju %ju %ju %ju %ju %ju %ju\n", p[0], p[1], p[2], p[3], p[4], p[5], p[6], p[7], p[8], p[9], p[10], p[11], p[12], p[13], p[14], p[15]); #endif p = &mbprof.used[0]; c += offset; offset = snprintf(c, MP_MAXLINE + 10, "used:\n" "%ju %ju %ju %ju %ju %ju %ju %ju " "%ju %ju %ju %ju %ju %ju %ju %ju\n", p[0], p[1], p[2], p[3], p[4], p[5], p[6], p[7], p[8], p[9], p[10], p[11], p[12], p[13], p[14], p[15]); #ifdef BIG_ARRAY p = &mbprof.used[16]; c += offset; offset = snprintf(c, MP_MAXLINE, "%ju %ju %ju %ju %ju %ju %ju %ju " "%ju %ju %ju %ju %ju %ju %ju %ju\n", p[0], p[1], p[2], p[3], p[4], p[5], p[6], p[7], p[8], p[9], p[10], p[11], p[12], p[13], p[14], p[15]); #endif p = &mbprof.segments[0]; c += offset; offset = snprintf(c, MP_MAXLINE + 10, "segments:\n" "%ju %ju %ju %ju %ju %ju %ju %ju " "%ju %ju %ju %ju %ju %ju %ju %ju\n", p[0], p[1], p[2], p[3], p[4], p[5], p[6], p[7], p[8], p[9], p[10], p[11], p[12], p[13], p[14], p[15]); #ifdef BIG_ARRAY p = &mbprof.segments[16]; c += offset; offset = snprintf(c, MP_MAXLINE, "%ju %ju %ju %ju %ju %ju %ju %ju " "%ju %ju %ju %ju %ju %ju %ju %jju", p[0], p[1], p[2], p[3], p[4], p[5], p[6], p[7], p[8], p[9], p[10], p[11], p[12], p[13], p[14], p[15]); #endif } static int mbprof_handler(SYSCTL_HANDLER_ARGS) { int error; mbprof_textify(); error = SYSCTL_OUT(req, mbprofbuf, strlen(mbprofbuf) + 1); return (error); } static int mbprof_clr_handler(SYSCTL_HANDLER_ARGS) { int clear, error; clear = 0; error = sysctl_handle_int(oidp, &clear, 0, req); if (error || !req->newptr) return (error); if (clear) { bzero(&mbprof, sizeof(mbprof)); } return (error); } SYSCTL_PROC(_kern_ipc, OID_AUTO, mbufprofile, CTLTYPE_STRING|CTLFLAG_RD, NULL, 0, mbprof_handler, "A", "mbuf profiling statistics"); SYSCTL_PROC(_kern_ipc, OID_AUTO, mbufprofileclr, CTLTYPE_INT|CTLFLAG_RW, NULL, 0, mbprof_clr_handler, "I", "clear mbuf profiling statistics"); #endif Index: user/ngie/bug-237403/sys/kern/vfs_bio.c =================================================================== --- user/ngie/bug-237403/sys/kern/vfs_bio.c (revision 346925) +++ user/ngie/bug-237403/sys/kern/vfs_bio.c (revision 346926) @@ -1,5505 +1,5499 @@ /*- * SPDX-License-Identifier: BSD-2-Clause-FreeBSD * * Copyright (c) 2004 Poul-Henning Kamp * Copyright (c) 1994,1997 John S. Dyson * Copyright (c) 2013 The FreeBSD Foundation * All rights reserved. * * Portions of this software were developed by Konstantin Belousov * under sponsorship from the FreeBSD Foundation. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. */ /* * this file contains a new buffer I/O scheme implementing a coherent * VM object and buffer cache scheme. Pains have been taken to make * sure that the performance degradation associated with schemes such * as this is not realized. * * Author: John S. Dyson * Significant help during the development and debugging phases * had been provided by David Greenman, also of the FreeBSD core team. * * see man buf(9) for more info. */ #include __FBSDID("$FreeBSD$"); #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include static MALLOC_DEFINE(M_BIOBUF, "biobuf", "BIO buffer"); struct bio_ops bioops; /* I/O operation notification */ struct buf_ops buf_ops_bio = { .bop_name = "buf_ops_bio", .bop_write = bufwrite, .bop_strategy = bufstrategy, .bop_sync = bufsync, .bop_bdflush = bufbdflush, }; struct bufqueue { struct mtx_padalign bq_lock; TAILQ_HEAD(, buf) bq_queue; uint8_t bq_index; uint16_t bq_subqueue; int bq_len; } __aligned(CACHE_LINE_SIZE); #define BQ_LOCKPTR(bq) (&(bq)->bq_lock) #define BQ_LOCK(bq) mtx_lock(BQ_LOCKPTR((bq))) #define BQ_UNLOCK(bq) mtx_unlock(BQ_LOCKPTR((bq))) #define BQ_ASSERT_LOCKED(bq) mtx_assert(BQ_LOCKPTR((bq)), MA_OWNED) struct bufdomain { struct bufqueue bd_subq[MAXCPU + 1]; /* Per-cpu sub queues + global */ struct bufqueue bd_dirtyq; struct bufqueue *bd_cleanq; struct mtx_padalign bd_run_lock; /* Constants */ long bd_maxbufspace; long bd_hibufspace; long bd_lobufspace; long bd_bufspacethresh; int bd_hifreebuffers; int bd_lofreebuffers; int bd_hidirtybuffers; int bd_lodirtybuffers; int bd_dirtybufthresh; int bd_lim; /* atomics */ int bd_wanted; int __aligned(CACHE_LINE_SIZE) bd_numdirtybuffers; int __aligned(CACHE_LINE_SIZE) bd_running; long __aligned(CACHE_LINE_SIZE) bd_bufspace; int __aligned(CACHE_LINE_SIZE) bd_freebuffers; } __aligned(CACHE_LINE_SIZE); #define BD_LOCKPTR(bd) (&(bd)->bd_cleanq->bq_lock) #define BD_LOCK(bd) mtx_lock(BD_LOCKPTR((bd))) #define BD_UNLOCK(bd) mtx_unlock(BD_LOCKPTR((bd))) #define BD_ASSERT_LOCKED(bd) mtx_assert(BD_LOCKPTR((bd)), MA_OWNED) #define BD_RUN_LOCKPTR(bd) (&(bd)->bd_run_lock) #define BD_RUN_LOCK(bd) mtx_lock(BD_RUN_LOCKPTR((bd))) #define BD_RUN_UNLOCK(bd) mtx_unlock(BD_RUN_LOCKPTR((bd))) #define BD_DOMAIN(bd) (bd - bdomain) static struct buf *buf; /* buffer header pool */ extern struct buf *swbuf; /* Swap buffer header pool. */ caddr_t unmapped_buf; /* Used below and for softdep flushing threads in ufs/ffs/ffs_softdep.c */ struct proc *bufdaemonproc; static int inmem(struct vnode *vp, daddr_t blkno); static void vm_hold_free_pages(struct buf *bp, int newbsize); static void vm_hold_load_pages(struct buf *bp, vm_offset_t from, vm_offset_t to); static void vfs_page_set_valid(struct buf *bp, vm_ooffset_t off, vm_page_t m); static void vfs_page_set_validclean(struct buf *bp, vm_ooffset_t off, vm_page_t m); static void vfs_clean_pages_dirty_buf(struct buf *bp); static void vfs_setdirty_locked_object(struct buf *bp); static void vfs_vmio_invalidate(struct buf *bp); static void vfs_vmio_truncate(struct buf *bp, int npages); static void vfs_vmio_extend(struct buf *bp, int npages, int size); static int vfs_bio_clcheck(struct vnode *vp, int size, daddr_t lblkno, daddr_t blkno); static void breada(struct vnode *, daddr_t *, int *, int, struct ucred *, int, void (*)(struct buf *)); static int buf_flush(struct vnode *vp, struct bufdomain *, int); static int flushbufqueues(struct vnode *, struct bufdomain *, int, int); static void buf_daemon(void); static __inline void bd_wakeup(void); static int sysctl_runningspace(SYSCTL_HANDLER_ARGS); static void bufkva_reclaim(vmem_t *, int); static void bufkva_free(struct buf *); static int buf_import(void *, void **, int, int, int); static void buf_release(void *, void **, int); static void maxbcachebuf_adjust(void); static inline struct bufdomain *bufdomain(struct buf *); static void bq_remove(struct bufqueue *bq, struct buf *bp); static void bq_insert(struct bufqueue *bq, struct buf *bp, bool unlock); static int buf_recycle(struct bufdomain *, bool kva); static void bq_init(struct bufqueue *bq, int qindex, int cpu, const char *lockname); static void bd_init(struct bufdomain *bd); static int bd_flushall(struct bufdomain *bd); static int sysctl_bufdomain_long(SYSCTL_HANDLER_ARGS); static int sysctl_bufdomain_int(SYSCTL_HANDLER_ARGS); static int sysctl_bufspace(SYSCTL_HANDLER_ARGS); int vmiodirenable = TRUE; SYSCTL_INT(_vfs, OID_AUTO, vmiodirenable, CTLFLAG_RW, &vmiodirenable, 0, "Use the VM system for directory writes"); long runningbufspace; SYSCTL_LONG(_vfs, OID_AUTO, runningbufspace, CTLFLAG_RD, &runningbufspace, 0, "Amount of presently outstanding async buffer io"); SYSCTL_PROC(_vfs, OID_AUTO, bufspace, CTLTYPE_LONG|CTLFLAG_MPSAFE|CTLFLAG_RD, NULL, 0, sysctl_bufspace, "L", "Physical memory used for buffers"); static counter_u64_t bufkvaspace; SYSCTL_COUNTER_U64(_vfs, OID_AUTO, bufkvaspace, CTLFLAG_RD, &bufkvaspace, "Kernel virtual memory used for buffers"); static long maxbufspace; SYSCTL_PROC(_vfs, OID_AUTO, maxbufspace, CTLTYPE_LONG|CTLFLAG_MPSAFE|CTLFLAG_RW, &maxbufspace, __offsetof(struct bufdomain, bd_maxbufspace), sysctl_bufdomain_long, "L", "Maximum allowed value of bufspace (including metadata)"); static long bufmallocspace; SYSCTL_LONG(_vfs, OID_AUTO, bufmallocspace, CTLFLAG_RD, &bufmallocspace, 0, "Amount of malloced memory for buffers"); static long maxbufmallocspace; SYSCTL_LONG(_vfs, OID_AUTO, maxmallocbufspace, CTLFLAG_RW, &maxbufmallocspace, 0, "Maximum amount of malloced memory for buffers"); static long lobufspace; SYSCTL_PROC(_vfs, OID_AUTO, lobufspace, CTLTYPE_LONG|CTLFLAG_MPSAFE|CTLFLAG_RW, &lobufspace, __offsetof(struct bufdomain, bd_lobufspace), sysctl_bufdomain_long, "L", "Minimum amount of buffers we want to have"); long hibufspace; SYSCTL_PROC(_vfs, OID_AUTO, hibufspace, CTLTYPE_LONG|CTLFLAG_MPSAFE|CTLFLAG_RW, &hibufspace, __offsetof(struct bufdomain, bd_hibufspace), sysctl_bufdomain_long, "L", "Maximum allowed value of bufspace (excluding metadata)"); long bufspacethresh; SYSCTL_PROC(_vfs, OID_AUTO, bufspacethresh, CTLTYPE_LONG|CTLFLAG_MPSAFE|CTLFLAG_RW, &bufspacethresh, __offsetof(struct bufdomain, bd_bufspacethresh), sysctl_bufdomain_long, "L", "Bufspace consumed before waking the daemon to free some"); static counter_u64_t buffreekvacnt; SYSCTL_COUNTER_U64(_vfs, OID_AUTO, buffreekvacnt, CTLFLAG_RW, &buffreekvacnt, "Number of times we have freed the KVA space from some buffer"); static counter_u64_t bufdefragcnt; SYSCTL_COUNTER_U64(_vfs, OID_AUTO, bufdefragcnt, CTLFLAG_RW, &bufdefragcnt, "Number of times we have had to repeat buffer allocation to defragment"); static long lorunningspace; SYSCTL_PROC(_vfs, OID_AUTO, lorunningspace, CTLTYPE_LONG | CTLFLAG_MPSAFE | CTLFLAG_RW, &lorunningspace, 0, sysctl_runningspace, "L", "Minimum preferred space used for in-progress I/O"); static long hirunningspace; SYSCTL_PROC(_vfs, OID_AUTO, hirunningspace, CTLTYPE_LONG | CTLFLAG_MPSAFE | CTLFLAG_RW, &hirunningspace, 0, sysctl_runningspace, "L", "Maximum amount of space to use for in-progress I/O"); int dirtybufferflushes; SYSCTL_INT(_vfs, OID_AUTO, dirtybufferflushes, CTLFLAG_RW, &dirtybufferflushes, 0, "Number of bdwrite to bawrite conversions to limit dirty buffers"); int bdwriteskip; SYSCTL_INT(_vfs, OID_AUTO, bdwriteskip, CTLFLAG_RW, &bdwriteskip, 0, "Number of buffers supplied to bdwrite with snapshot deadlock risk"); int altbufferflushes; SYSCTL_INT(_vfs, OID_AUTO, altbufferflushes, CTLFLAG_RW, &altbufferflushes, 0, "Number of fsync flushes to limit dirty buffers"); static int recursiveflushes; SYSCTL_INT(_vfs, OID_AUTO, recursiveflushes, CTLFLAG_RW, &recursiveflushes, 0, "Number of flushes skipped due to being recursive"); static int sysctl_numdirtybuffers(SYSCTL_HANDLER_ARGS); SYSCTL_PROC(_vfs, OID_AUTO, numdirtybuffers, CTLTYPE_INT|CTLFLAG_MPSAFE|CTLFLAG_RD, NULL, 0, sysctl_numdirtybuffers, "I", "Number of buffers that are dirty (has unwritten changes) at the moment"); static int lodirtybuffers; SYSCTL_PROC(_vfs, OID_AUTO, lodirtybuffers, CTLTYPE_INT|CTLFLAG_MPSAFE|CTLFLAG_RW, &lodirtybuffers, __offsetof(struct bufdomain, bd_lodirtybuffers), sysctl_bufdomain_int, "I", "How many buffers we want to have free before bufdaemon can sleep"); static int hidirtybuffers; SYSCTL_PROC(_vfs, OID_AUTO, hidirtybuffers, CTLTYPE_INT|CTLFLAG_MPSAFE|CTLFLAG_RW, &hidirtybuffers, __offsetof(struct bufdomain, bd_hidirtybuffers), sysctl_bufdomain_int, "I", "When the number of dirty buffers is considered severe"); int dirtybufthresh; SYSCTL_PROC(_vfs, OID_AUTO, dirtybufthresh, CTLTYPE_INT|CTLFLAG_MPSAFE|CTLFLAG_RW, &dirtybufthresh, __offsetof(struct bufdomain, bd_dirtybufthresh), sysctl_bufdomain_int, "I", "Number of bdwrite to bawrite conversions to clear dirty buffers"); static int numfreebuffers; SYSCTL_INT(_vfs, OID_AUTO, numfreebuffers, CTLFLAG_RD, &numfreebuffers, 0, "Number of free buffers"); static int lofreebuffers; SYSCTL_PROC(_vfs, OID_AUTO, lofreebuffers, CTLTYPE_INT|CTLFLAG_MPSAFE|CTLFLAG_RW, &lofreebuffers, __offsetof(struct bufdomain, bd_lofreebuffers), sysctl_bufdomain_int, "I", "Target number of free buffers"); static int hifreebuffers; SYSCTL_PROC(_vfs, OID_AUTO, hifreebuffers, CTLTYPE_INT|CTLFLAG_MPSAFE|CTLFLAG_RW, &hifreebuffers, __offsetof(struct bufdomain, bd_hifreebuffers), sysctl_bufdomain_int, "I", "Threshold for clean buffer recycling"); static counter_u64_t getnewbufcalls; SYSCTL_COUNTER_U64(_vfs, OID_AUTO, getnewbufcalls, CTLFLAG_RD, &getnewbufcalls, "Number of calls to getnewbuf"); static counter_u64_t getnewbufrestarts; SYSCTL_COUNTER_U64(_vfs, OID_AUTO, getnewbufrestarts, CTLFLAG_RD, &getnewbufrestarts, "Number of times getnewbuf has had to restart a buffer acquisition"); static counter_u64_t mappingrestarts; SYSCTL_COUNTER_U64(_vfs, OID_AUTO, mappingrestarts, CTLFLAG_RD, &mappingrestarts, "Number of times getblk has had to restart a buffer mapping for " "unmapped buffer"); static counter_u64_t numbufallocfails; SYSCTL_COUNTER_U64(_vfs, OID_AUTO, numbufallocfails, CTLFLAG_RW, &numbufallocfails, "Number of times buffer allocations failed"); static int flushbufqtarget = 100; SYSCTL_INT(_vfs, OID_AUTO, flushbufqtarget, CTLFLAG_RW, &flushbufqtarget, 0, "Amount of work to do in flushbufqueues when helping bufdaemon"); static counter_u64_t notbufdflushes; SYSCTL_COUNTER_U64(_vfs, OID_AUTO, notbufdflushes, CTLFLAG_RD, ¬bufdflushes, "Number of dirty buffer flushes done by the bufdaemon helpers"); static long barrierwrites; SYSCTL_LONG(_vfs, OID_AUTO, barrierwrites, CTLFLAG_RW, &barrierwrites, 0, "Number of barrier writes"); SYSCTL_INT(_vfs, OID_AUTO, unmapped_buf_allowed, CTLFLAG_RD, &unmapped_buf_allowed, 0, "Permit the use of the unmapped i/o"); int maxbcachebuf = MAXBCACHEBUF; SYSCTL_INT(_vfs, OID_AUTO, maxbcachebuf, CTLFLAG_RDTUN, &maxbcachebuf, 0, "Maximum size of a buffer cache block"); /* * This lock synchronizes access to bd_request. */ static struct mtx_padalign __exclusive_cache_line bdlock; /* * This lock protects the runningbufreq and synchronizes runningbufwakeup and * waitrunningbufspace(). */ static struct mtx_padalign __exclusive_cache_line rbreqlock; /* * Lock that protects bdirtywait. */ static struct mtx_padalign __exclusive_cache_line bdirtylock; /* * Wakeup point for bufdaemon, as well as indicator of whether it is already * active. Set to 1 when the bufdaemon is already "on" the queue, 0 when it * is idling. */ static int bd_request; /* * Request for the buf daemon to write more buffers than is indicated by * lodirtybuf. This may be necessary to push out excess dependencies or * defragment the address space where a simple count of the number of dirty * buffers is insufficient to characterize the demand for flushing them. */ static int bd_speedupreq; /* * Synchronization (sleep/wakeup) variable for active buffer space requests. * Set when wait starts, cleared prior to wakeup(). * Used in runningbufwakeup() and waitrunningbufspace(). */ static int runningbufreq; /* * Synchronization for bwillwrite() waiters. */ static int bdirtywait; /* * Definitions for the buffer free lists. */ #define QUEUE_NONE 0 /* on no queue */ #define QUEUE_EMPTY 1 /* empty buffer headers */ #define QUEUE_DIRTY 2 /* B_DELWRI buffers */ #define QUEUE_CLEAN 3 /* non-B_DELWRI buffers */ #define QUEUE_SENTINEL 4 /* not an queue index, but mark for sentinel */ /* Maximum number of buffer domains. */ #define BUF_DOMAINS 8 struct bufdomainset bdlodirty; /* Domains > lodirty */ struct bufdomainset bdhidirty; /* Domains > hidirty */ /* Configured number of clean queues. */ static int __read_mostly buf_domains; BITSET_DEFINE(bufdomainset, BUF_DOMAINS); struct bufdomain __exclusive_cache_line bdomain[BUF_DOMAINS]; struct bufqueue __exclusive_cache_line bqempty; /* * per-cpu empty buffer cache. */ uma_zone_t buf_zone; /* * Single global constant for BUF_WMESG, to avoid getting multiple references. * buf_wmesg is referred from macros. */ const char *buf_wmesg = BUF_WMESG; static int sysctl_runningspace(SYSCTL_HANDLER_ARGS) { long value; int error; value = *(long *)arg1; error = sysctl_handle_long(oidp, &value, 0, req); if (error != 0 || req->newptr == NULL) return (error); mtx_lock(&rbreqlock); if (arg1 == &hirunningspace) { if (value < lorunningspace) error = EINVAL; else hirunningspace = value; } else { KASSERT(arg1 == &lorunningspace, ("%s: unknown arg1", __func__)); if (value > hirunningspace) error = EINVAL; else lorunningspace = value; } mtx_unlock(&rbreqlock); return (error); } static int sysctl_bufdomain_int(SYSCTL_HANDLER_ARGS) { int error; int value; int i; value = *(int *)arg1; error = sysctl_handle_int(oidp, &value, 0, req); if (error != 0 || req->newptr == NULL) return (error); *(int *)arg1 = value; for (i = 0; i < buf_domains; i++) *(int *)(uintptr_t)(((uintptr_t)&bdomain[i]) + arg2) = value / buf_domains; return (error); } static int sysctl_bufdomain_long(SYSCTL_HANDLER_ARGS) { long value; int error; int i; value = *(long *)arg1; error = sysctl_handle_long(oidp, &value, 0, req); if (error != 0 || req->newptr == NULL) return (error); *(long *)arg1 = value; for (i = 0; i < buf_domains; i++) *(long *)(uintptr_t)(((uintptr_t)&bdomain[i]) + arg2) = value / buf_domains; return (error); } #if defined(COMPAT_FREEBSD4) || defined(COMPAT_FREEBSD5) || \ defined(COMPAT_FREEBSD6) || defined(COMPAT_FREEBSD7) static int sysctl_bufspace(SYSCTL_HANDLER_ARGS) { long lvalue; int ivalue; int i; lvalue = 0; for (i = 0; i < buf_domains; i++) lvalue += bdomain[i].bd_bufspace; if (sizeof(int) == sizeof(long) || req->oldlen >= sizeof(long)) return (sysctl_handle_long(oidp, &lvalue, 0, req)); if (lvalue > INT_MAX) /* On overflow, still write out a long to trigger ENOMEM. */ return (sysctl_handle_long(oidp, &lvalue, 0, req)); ivalue = lvalue; return (sysctl_handle_int(oidp, &ivalue, 0, req)); } #else static int sysctl_bufspace(SYSCTL_HANDLER_ARGS) { long lvalue; int i; lvalue = 0; for (i = 0; i < buf_domains; i++) lvalue += bdomain[i].bd_bufspace; return (sysctl_handle_long(oidp, &lvalue, 0, req)); } #endif static int sysctl_numdirtybuffers(SYSCTL_HANDLER_ARGS) { int value; int i; value = 0; for (i = 0; i < buf_domains; i++) value += bdomain[i].bd_numdirtybuffers; return (sysctl_handle_int(oidp, &value, 0, req)); } /* * bdirtywakeup: * * Wakeup any bwillwrite() waiters. */ static void bdirtywakeup(void) { mtx_lock(&bdirtylock); if (bdirtywait) { bdirtywait = 0; wakeup(&bdirtywait); } mtx_unlock(&bdirtylock); } /* * bd_clear: * * Clear a domain from the appropriate bitsets when dirtybuffers * is decremented. */ static void bd_clear(struct bufdomain *bd) { mtx_lock(&bdirtylock); if (bd->bd_numdirtybuffers <= bd->bd_lodirtybuffers) BIT_CLR(BUF_DOMAINS, BD_DOMAIN(bd), &bdlodirty); if (bd->bd_numdirtybuffers <= bd->bd_hidirtybuffers) BIT_CLR(BUF_DOMAINS, BD_DOMAIN(bd), &bdhidirty); mtx_unlock(&bdirtylock); } /* * bd_set: * * Set a domain in the appropriate bitsets when dirtybuffers * is incremented. */ static void bd_set(struct bufdomain *bd) { mtx_lock(&bdirtylock); if (bd->bd_numdirtybuffers > bd->bd_lodirtybuffers) BIT_SET(BUF_DOMAINS, BD_DOMAIN(bd), &bdlodirty); if (bd->bd_numdirtybuffers > bd->bd_hidirtybuffers) BIT_SET(BUF_DOMAINS, BD_DOMAIN(bd), &bdhidirty); mtx_unlock(&bdirtylock); } /* * bdirtysub: * * Decrement the numdirtybuffers count by one and wakeup any * threads blocked in bwillwrite(). */ static void bdirtysub(struct buf *bp) { struct bufdomain *bd; int num; bd = bufdomain(bp); num = atomic_fetchadd_int(&bd->bd_numdirtybuffers, -1); if (num == (bd->bd_lodirtybuffers + bd->bd_hidirtybuffers) / 2) bdirtywakeup(); if (num == bd->bd_lodirtybuffers || num == bd->bd_hidirtybuffers) bd_clear(bd); } /* * bdirtyadd: * * Increment the numdirtybuffers count by one and wakeup the buf * daemon if needed. */ static void bdirtyadd(struct buf *bp) { struct bufdomain *bd; int num; /* * Only do the wakeup once as we cross the boundary. The * buf daemon will keep running until the condition clears. */ bd = bufdomain(bp); num = atomic_fetchadd_int(&bd->bd_numdirtybuffers, 1); if (num == (bd->bd_lodirtybuffers + bd->bd_hidirtybuffers) / 2) bd_wakeup(); if (num == bd->bd_lodirtybuffers || num == bd->bd_hidirtybuffers) bd_set(bd); } /* * bufspace_daemon_wakeup: * * Wakeup the daemons responsible for freeing clean bufs. */ static void bufspace_daemon_wakeup(struct bufdomain *bd) { /* * avoid the lock if the daemon is running. */ if (atomic_fetchadd_int(&bd->bd_running, 1) == 0) { BD_RUN_LOCK(bd); atomic_store_int(&bd->bd_running, 1); wakeup(&bd->bd_running); BD_RUN_UNLOCK(bd); } } /* * bufspace_daemon_wait: * * Sleep until the domain falls below a limit or one second passes. */ static void bufspace_daemon_wait(struct bufdomain *bd) { /* * Re-check our limits and sleep. bd_running must be * cleared prior to checking the limits to avoid missed * wakeups. The waker will adjust one of bufspace or * freebuffers prior to checking bd_running. */ BD_RUN_LOCK(bd); atomic_store_int(&bd->bd_running, 0); if (bd->bd_bufspace < bd->bd_bufspacethresh && bd->bd_freebuffers > bd->bd_lofreebuffers) { msleep(&bd->bd_running, BD_RUN_LOCKPTR(bd), PRIBIO|PDROP, "-", hz); } else { /* Avoid spurious wakeups while running. */ atomic_store_int(&bd->bd_running, 1); BD_RUN_UNLOCK(bd); } } /* * bufspace_adjust: * * Adjust the reported bufspace for a KVA managed buffer, possibly * waking any waiters. */ static void bufspace_adjust(struct buf *bp, int bufsize) { struct bufdomain *bd; long space; int diff; KASSERT((bp->b_flags & B_MALLOC) == 0, ("bufspace_adjust: malloc buf %p", bp)); bd = bufdomain(bp); diff = bufsize - bp->b_bufsize; if (diff < 0) { atomic_subtract_long(&bd->bd_bufspace, -diff); } else if (diff > 0) { space = atomic_fetchadd_long(&bd->bd_bufspace, diff); /* Wake up the daemon on the transition. */ if (space < bd->bd_bufspacethresh && space + diff >= bd->bd_bufspacethresh) bufspace_daemon_wakeup(bd); } bp->b_bufsize = bufsize; } /* * bufspace_reserve: * * Reserve bufspace before calling allocbuf(). metadata has a * different space limit than data. */ static int bufspace_reserve(struct bufdomain *bd, int size, bool metadata) { long limit, new; long space; if (metadata) limit = bd->bd_maxbufspace; else limit = bd->bd_hibufspace; space = atomic_fetchadd_long(&bd->bd_bufspace, size); new = space + size; if (new > limit) { atomic_subtract_long(&bd->bd_bufspace, size); return (ENOSPC); } /* Wake up the daemon on the transition. */ if (space < bd->bd_bufspacethresh && new >= bd->bd_bufspacethresh) bufspace_daemon_wakeup(bd); return (0); } /* * bufspace_release: * * Release reserved bufspace after bufspace_adjust() has consumed it. */ static void bufspace_release(struct bufdomain *bd, int size) { atomic_subtract_long(&bd->bd_bufspace, size); } /* * bufspace_wait: * * Wait for bufspace, acting as the buf daemon if a locked vnode is * supplied. bd_wanted must be set prior to polling for space. The * operation must be re-tried on return. */ static void bufspace_wait(struct bufdomain *bd, struct vnode *vp, int gbflags, int slpflag, int slptimeo) { struct thread *td; int error, fl, norunbuf; if ((gbflags & GB_NOWAIT_BD) != 0) return; td = curthread; BD_LOCK(bd); while (bd->bd_wanted) { if (vp != NULL && vp->v_type != VCHR && (td->td_pflags & TDP_BUFNEED) == 0) { BD_UNLOCK(bd); /* * getblk() is called with a vnode locked, and * some majority of the dirty buffers may as * well belong to the vnode. Flushing the * buffers there would make a progress that * cannot be achieved by the buf_daemon, that * cannot lock the vnode. */ norunbuf = ~(TDP_BUFNEED | TDP_NORUNNINGBUF) | (td->td_pflags & TDP_NORUNNINGBUF); /* * Play bufdaemon. The getnewbuf() function * may be called while the thread owns lock * for another dirty buffer for the same * vnode, which makes it impossible to use * VOP_FSYNC() there, due to the buffer lock * recursion. */ td->td_pflags |= TDP_BUFNEED | TDP_NORUNNINGBUF; fl = buf_flush(vp, bd, flushbufqtarget); td->td_pflags &= norunbuf; BD_LOCK(bd); if (fl != 0) continue; if (bd->bd_wanted == 0) break; } error = msleep(&bd->bd_wanted, BD_LOCKPTR(bd), (PRIBIO + 4) | slpflag, "newbuf", slptimeo); if (error != 0) break; } BD_UNLOCK(bd); } /* * bufspace_daemon: * * buffer space management daemon. Tries to maintain some marginal * amount of free buffer space so that requesting processes neither * block nor work to reclaim buffers. */ static void bufspace_daemon(void *arg) { struct bufdomain *bd; EVENTHANDLER_REGISTER(shutdown_pre_sync, kthread_shutdown, curthread, SHUTDOWN_PRI_LAST + 100); bd = arg; for (;;) { kthread_suspend_check(); /* * Free buffers from the clean queue until we meet our * targets. * * Theory of operation: The buffer cache is most efficient * when some free buffer headers and space are always * available to getnewbuf(). This daemon attempts to prevent * the excessive blocking and synchronization associated * with shortfall. It goes through three phases according * demand: * * 1) The daemon wakes up voluntarily once per-second * during idle periods when the counters are below * the wakeup thresholds (bufspacethresh, lofreebuffers). * * 2) The daemon wakes up as we cross the thresholds * ahead of any potential blocking. This may bounce * slightly according to the rate of consumption and * release. * * 3) The daemon and consumers are starved for working * clean buffers. This is the 'bufspace' sleep below * which will inefficiently trade bufs with bqrelse * until we return to condition 2. */ while (bd->bd_bufspace > bd->bd_lobufspace || bd->bd_freebuffers < bd->bd_hifreebuffers) { if (buf_recycle(bd, false) != 0) { if (bd_flushall(bd)) continue; /* * Speedup dirty if we've run out of clean * buffers. This is possible in particular * because softdep may held many bufs locked * pending writes to other bufs which are * marked for delayed write, exhausting * clean space until they are written. */ bd_speedup(); BD_LOCK(bd); if (bd->bd_wanted) { msleep(&bd->bd_wanted, BD_LOCKPTR(bd), PRIBIO|PDROP, "bufspace", hz/10); } else BD_UNLOCK(bd); } maybe_yield(); } bufspace_daemon_wait(bd); } } /* * bufmallocadjust: * * Adjust the reported bufspace for a malloc managed buffer, possibly * waking any waiters. */ static void bufmallocadjust(struct buf *bp, int bufsize) { int diff; KASSERT((bp->b_flags & B_MALLOC) != 0, ("bufmallocadjust: non-malloc buf %p", bp)); diff = bufsize - bp->b_bufsize; if (diff < 0) atomic_subtract_long(&bufmallocspace, -diff); else atomic_add_long(&bufmallocspace, diff); bp->b_bufsize = bufsize; } /* * runningwakeup: * * Wake up processes that are waiting on asynchronous writes to fall * below lorunningspace. */ static void runningwakeup(void) { mtx_lock(&rbreqlock); if (runningbufreq) { runningbufreq = 0; wakeup(&runningbufreq); } mtx_unlock(&rbreqlock); } /* * runningbufwakeup: * * Decrement the outstanding write count according. */ void runningbufwakeup(struct buf *bp) { long space, bspace; bspace = bp->b_runningbufspace; if (bspace == 0) return; space = atomic_fetchadd_long(&runningbufspace, -bspace); KASSERT(space >= bspace, ("runningbufspace underflow %ld %ld", space, bspace)); bp->b_runningbufspace = 0; /* * Only acquire the lock and wakeup on the transition from exceeding * the threshold to falling below it. */ if (space < lorunningspace) return; if (space - bspace > lorunningspace) return; runningwakeup(); } /* * waitrunningbufspace() * * runningbufspace is a measure of the amount of I/O currently * running. This routine is used in async-write situations to * prevent creating huge backups of pending writes to a device. * Only asynchronous writes are governed by this function. * * This does NOT turn an async write into a sync write. It waits * for earlier writes to complete and generally returns before the * caller's write has reached the device. */ void waitrunningbufspace(void) { mtx_lock(&rbreqlock); while (runningbufspace > hirunningspace) { runningbufreq = 1; msleep(&runningbufreq, &rbreqlock, PVM, "wdrain", 0); } mtx_unlock(&rbreqlock); } /* * vfs_buf_test_cache: * * Called when a buffer is extended. This function clears the B_CACHE * bit if the newly extended portion of the buffer does not contain * valid data. */ static __inline void vfs_buf_test_cache(struct buf *bp, vm_ooffset_t foff, vm_offset_t off, vm_offset_t size, vm_page_t m) { VM_OBJECT_ASSERT_LOCKED(m->object); if (bp->b_flags & B_CACHE) { int base = (foff + off) & PAGE_MASK; if (vm_page_is_valid(m, base, size) == 0) bp->b_flags &= ~B_CACHE; } } /* Wake up the buffer daemon if necessary */ static void bd_wakeup(void) { mtx_lock(&bdlock); if (bd_request == 0) { bd_request = 1; wakeup(&bd_request); } mtx_unlock(&bdlock); } /* * Adjust the maxbcachbuf tunable. */ static void maxbcachebuf_adjust(void) { int i; /* * maxbcachebuf must be a power of 2 >= MAXBSIZE. */ i = 2; while (i * 2 <= maxbcachebuf) i *= 2; maxbcachebuf = i; if (maxbcachebuf < MAXBSIZE) maxbcachebuf = MAXBSIZE; if (maxbcachebuf > MAXPHYS) maxbcachebuf = MAXPHYS; if (bootverbose != 0 && maxbcachebuf != MAXBCACHEBUF) printf("maxbcachebuf=%d\n", maxbcachebuf); } /* * bd_speedup - speedup the buffer cache flushing code */ void bd_speedup(void) { int needwake; mtx_lock(&bdlock); needwake = 0; if (bd_speedupreq == 0 || bd_request == 0) needwake = 1; bd_speedupreq = 1; bd_request = 1; if (needwake) wakeup(&bd_request); mtx_unlock(&bdlock); } #ifdef __i386__ #define TRANSIENT_DENOM 5 #else #define TRANSIENT_DENOM 10 #endif /* * Calculating buffer cache scaling values and reserve space for buffer * headers. This is called during low level kernel initialization and * may be called more then once. We CANNOT write to the memory area * being reserved at this time. */ caddr_t kern_vfs_bio_buffer_alloc(caddr_t v, long physmem_est) { int tuned_nbuf; long maxbuf, maxbuf_sz, buf_sz, biotmap_sz; /* * physmem_est is in pages. Convert it to kilobytes (assumes * PAGE_SIZE is >= 1K) */ physmem_est = physmem_est * (PAGE_SIZE / 1024); maxbcachebuf_adjust(); /* * The nominal buffer size (and minimum KVA allocation) is BKVASIZE. * For the first 64MB of ram nominally allocate sufficient buffers to * cover 1/4 of our ram. Beyond the first 64MB allocate additional * buffers to cover 1/10 of our ram over 64MB. When auto-sizing * the buffer cache we limit the eventual kva reservation to * maxbcache bytes. * * factor represents the 1/4 x ram conversion. */ if (nbuf == 0) { int factor = 4 * BKVASIZE / 1024; nbuf = 50; if (physmem_est > 4096) nbuf += min((physmem_est - 4096) / factor, 65536 / factor); if (physmem_est > 65536) nbuf += min((physmem_est - 65536) * 2 / (factor * 5), 32 * 1024 * 1024 / (factor * 5)); if (maxbcache && nbuf > maxbcache / BKVASIZE) nbuf = maxbcache / BKVASIZE; tuned_nbuf = 1; } else tuned_nbuf = 0; /* XXX Avoid unsigned long overflows later on with maxbufspace. */ maxbuf = (LONG_MAX / 3) / BKVASIZE; if (nbuf > maxbuf) { if (!tuned_nbuf) printf("Warning: nbufs lowered from %d to %ld\n", nbuf, maxbuf); nbuf = maxbuf; } /* * Ideal allocation size for the transient bio submap is 10% * of the maximal space buffer map. This roughly corresponds * to the amount of the buffer mapped for typical UFS load. * * Clip the buffer map to reserve space for the transient * BIOs, if its extent is bigger than 90% (80% on i386) of the * maximum buffer map extent on the platform. * * The fall-back to the maxbuf in case of maxbcache unset, * allows to not trim the buffer KVA for the architectures * with ample KVA space. */ if (bio_transient_maxcnt == 0 && unmapped_buf_allowed) { maxbuf_sz = maxbcache != 0 ? maxbcache : maxbuf * BKVASIZE; buf_sz = (long)nbuf * BKVASIZE; if (buf_sz < maxbuf_sz / TRANSIENT_DENOM * (TRANSIENT_DENOM - 1)) { /* * There is more KVA than memory. Do not * adjust buffer map size, and assign the rest * of maxbuf to transient map. */ biotmap_sz = maxbuf_sz - buf_sz; } else { /* * Buffer map spans all KVA we could afford on * this platform. Give 10% (20% on i386) of * the buffer map to the transient bio map. */ biotmap_sz = buf_sz / TRANSIENT_DENOM; buf_sz -= biotmap_sz; } if (biotmap_sz / INT_MAX > MAXPHYS) bio_transient_maxcnt = INT_MAX; else bio_transient_maxcnt = biotmap_sz / MAXPHYS; /* * Artificially limit to 1024 simultaneous in-flight I/Os * using the transient mapping. */ if (bio_transient_maxcnt > 1024) bio_transient_maxcnt = 1024; if (tuned_nbuf) nbuf = buf_sz / BKVASIZE; } if (nswbuf == 0) { nswbuf = min(nbuf / 4, 256); if (nswbuf < NSWBUF_MIN) nswbuf = NSWBUF_MIN; } /* * Reserve space for the buffer cache buffers */ buf = (void *)v; v = (caddr_t)(buf + nbuf); return(v); } /* Initialize the buffer subsystem. Called before use of any buffers. */ void bufinit(void) { struct buf *bp; int i; KASSERT(maxbcachebuf >= MAXBSIZE, ("maxbcachebuf (%d) must be >= MAXBSIZE (%d)\n", maxbcachebuf, MAXBSIZE)); bq_init(&bqempty, QUEUE_EMPTY, -1, "bufq empty lock"); mtx_init(&rbreqlock, "runningbufspace lock", NULL, MTX_DEF); mtx_init(&bdlock, "buffer daemon lock", NULL, MTX_DEF); mtx_init(&bdirtylock, "dirty buf lock", NULL, MTX_DEF); unmapped_buf = (caddr_t)kva_alloc(MAXPHYS); /* finally, initialize each buffer header and stick on empty q */ for (i = 0; i < nbuf; i++) { bp = &buf[i]; bzero(bp, sizeof *bp); bp->b_flags = B_INVAL; bp->b_rcred = NOCRED; bp->b_wcred = NOCRED; bp->b_qindex = QUEUE_NONE; bp->b_domain = -1; bp->b_subqueue = mp_maxid + 1; bp->b_xflags = 0; bp->b_data = bp->b_kvabase = unmapped_buf; LIST_INIT(&bp->b_dep); BUF_LOCKINIT(bp); bq_insert(&bqempty, bp, false); } /* * maxbufspace is the absolute maximum amount of buffer space we are * allowed to reserve in KVM and in real terms. The absolute maximum * is nominally used by metadata. hibufspace is the nominal maximum * used by most other requests. The differential is required to * ensure that metadata deadlocks don't occur. * * maxbufspace is based on BKVASIZE. Allocating buffers larger then * this may result in KVM fragmentation which is not handled optimally * by the system. XXX This is less true with vmem. We could use * PAGE_SIZE. */ maxbufspace = (long)nbuf * BKVASIZE; hibufspace = lmax(3 * maxbufspace / 4, maxbufspace - maxbcachebuf * 10); lobufspace = (hibufspace / 20) * 19; /* 95% */ bufspacethresh = lobufspace + (hibufspace - lobufspace) / 2; /* * Note: The 16 MiB upper limit for hirunningspace was chosen * arbitrarily and may need further tuning. It corresponds to * 128 outstanding write IO requests (if IO size is 128 KiB), * which fits with many RAID controllers' tagged queuing limits. * The lower 1 MiB limit is the historical upper limit for * hirunningspace. */ hirunningspace = lmax(lmin(roundup(hibufspace / 64, maxbcachebuf), 16 * 1024 * 1024), 1024 * 1024); lorunningspace = roundup((hirunningspace * 2) / 3, maxbcachebuf); /* * Limit the amount of malloc memory since it is wired permanently into * the kernel space. Even though this is accounted for in the buffer * allocation, we don't want the malloced region to grow uncontrolled. * The malloc scheme improves memory utilization significantly on * average (small) directories. */ maxbufmallocspace = hibufspace / 20; /* * Reduce the chance of a deadlock occurring by limiting the number * of delayed-write dirty buffers we allow to stack up. */ hidirtybuffers = nbuf / 4 + 20; dirtybufthresh = hidirtybuffers * 9 / 10; /* * To support extreme low-memory systems, make sure hidirtybuffers * cannot eat up all available buffer space. This occurs when our * minimum cannot be met. We try to size hidirtybuffers to 3/4 our * buffer space assuming BKVASIZE'd buffers. */ while ((long)hidirtybuffers * BKVASIZE > 3 * hibufspace / 4) { hidirtybuffers >>= 1; } lodirtybuffers = hidirtybuffers / 2; /* * lofreebuffers should be sufficient to avoid stalling waiting on * buf headers under heavy utilization. The bufs in per-cpu caches * are counted as free but will be unavailable to threads executing * on other cpus. * * hifreebuffers is the free target for the bufspace daemon. This * should be set appropriately to limit work per-iteration. */ lofreebuffers = MIN((nbuf / 25) + (20 * mp_ncpus), 128 * mp_ncpus); hifreebuffers = (3 * lofreebuffers) / 2; numfreebuffers = nbuf; /* Setup the kva and free list allocators. */ vmem_set_reclaim(buffer_arena, bufkva_reclaim); buf_zone = uma_zcache_create("buf free cache", sizeof(struct buf), NULL, NULL, NULL, NULL, buf_import, buf_release, NULL, 0); /* * Size the clean queue according to the amount of buffer space. * One queue per-256mb up to the max. More queues gives better * concurrency but less accurate LRU. */ buf_domains = MIN(howmany(maxbufspace, 256*1024*1024), BUF_DOMAINS); for (i = 0 ; i < buf_domains; i++) { struct bufdomain *bd; bd = &bdomain[i]; bd_init(bd); bd->bd_freebuffers = nbuf / buf_domains; bd->bd_hifreebuffers = hifreebuffers / buf_domains; bd->bd_lofreebuffers = lofreebuffers / buf_domains; bd->bd_bufspace = 0; bd->bd_maxbufspace = maxbufspace / buf_domains; bd->bd_hibufspace = hibufspace / buf_domains; bd->bd_lobufspace = lobufspace / buf_domains; bd->bd_bufspacethresh = bufspacethresh / buf_domains; bd->bd_numdirtybuffers = 0; bd->bd_hidirtybuffers = hidirtybuffers / buf_domains; bd->bd_lodirtybuffers = lodirtybuffers / buf_domains; bd->bd_dirtybufthresh = dirtybufthresh / buf_domains; /* Don't allow more than 2% of bufs in the per-cpu caches. */ bd->bd_lim = nbuf / buf_domains / 50 / mp_ncpus; } getnewbufcalls = counter_u64_alloc(M_WAITOK); getnewbufrestarts = counter_u64_alloc(M_WAITOK); mappingrestarts = counter_u64_alloc(M_WAITOK); numbufallocfails = counter_u64_alloc(M_WAITOK); notbufdflushes = counter_u64_alloc(M_WAITOK); buffreekvacnt = counter_u64_alloc(M_WAITOK); bufdefragcnt = counter_u64_alloc(M_WAITOK); bufkvaspace = counter_u64_alloc(M_WAITOK); } #ifdef INVARIANTS static inline void vfs_buf_check_mapped(struct buf *bp) { KASSERT(bp->b_kvabase != unmapped_buf, ("mapped buf: b_kvabase was not updated %p", bp)); KASSERT(bp->b_data != unmapped_buf, ("mapped buf: b_data was not updated %p", bp)); KASSERT(bp->b_data < unmapped_buf || bp->b_data >= unmapped_buf + MAXPHYS, ("b_data + b_offset unmapped %p", bp)); } static inline void vfs_buf_check_unmapped(struct buf *bp) { KASSERT(bp->b_data == unmapped_buf, ("unmapped buf: corrupted b_data %p", bp)); } #define BUF_CHECK_MAPPED(bp) vfs_buf_check_mapped(bp) #define BUF_CHECK_UNMAPPED(bp) vfs_buf_check_unmapped(bp) #else #define BUF_CHECK_MAPPED(bp) do {} while (0) #define BUF_CHECK_UNMAPPED(bp) do {} while (0) #endif static int isbufbusy(struct buf *bp) { if (((bp->b_flags & B_INVAL) == 0 && BUF_ISLOCKED(bp)) || ((bp->b_flags & (B_DELWRI | B_INVAL)) == B_DELWRI)) return (1); return (0); } /* * Shutdown the system cleanly to prepare for reboot, halt, or power off. */ void bufshutdown(int show_busybufs) { static int first_buf_printf = 1; struct buf *bp; int iter, nbusy, pbusy; #ifndef PREEMPTION int subiter; #endif /* * Sync filesystems for shutdown */ wdog_kern_pat(WD_LASTVAL); sys_sync(curthread, NULL); /* * With soft updates, some buffers that are * written will be remarked as dirty until other * buffers are written. */ for (iter = pbusy = 0; iter < 20; iter++) { nbusy = 0; for (bp = &buf[nbuf]; --bp >= buf; ) if (isbufbusy(bp)) nbusy++; if (nbusy == 0) { if (first_buf_printf) printf("All buffers synced."); break; } if (first_buf_printf) { printf("Syncing disks, buffers remaining... "); first_buf_printf = 0; } printf("%d ", nbusy); if (nbusy < pbusy) iter = 0; pbusy = nbusy; wdog_kern_pat(WD_LASTVAL); sys_sync(curthread, NULL); #ifdef PREEMPTION /* * Spin for a while to allow interrupt threads to run. */ DELAY(50000 * iter); #else /* * Context switch several times to allow interrupt * threads to run. */ for (subiter = 0; subiter < 50 * iter; subiter++) { thread_lock(curthread); mi_switch(SW_VOL, NULL); thread_unlock(curthread); DELAY(1000); } #endif } printf("\n"); /* * Count only busy local buffers to prevent forcing * a fsck if we're just a client of a wedged NFS server */ nbusy = 0; for (bp = &buf[nbuf]; --bp >= buf; ) { if (isbufbusy(bp)) { #if 0 /* XXX: This is bogus. We should probably have a BO_REMOTE flag instead */ if (bp->b_dev == NULL) { TAILQ_REMOVE(&mountlist, bp->b_vp->v_mount, mnt_list); continue; } #endif nbusy++; if (show_busybufs > 0) { printf( "%d: buf:%p, vnode:%p, flags:%0x, blkno:%jd, lblkno:%jd, buflock:", nbusy, bp, bp->b_vp, bp->b_flags, (intmax_t)bp->b_blkno, (intmax_t)bp->b_lblkno); BUF_LOCKPRINTINFO(bp); if (show_busybufs > 1) vn_printf(bp->b_vp, "vnode content: "); } } } if (nbusy) { /* * Failed to sync all blocks. Indicate this and don't * unmount filesystems (thus forcing an fsck on reboot). */ printf("Giving up on %d buffers\n", nbusy); DELAY(5000000); /* 5 seconds */ } else { if (!first_buf_printf) printf("Final sync complete\n"); /* * Unmount filesystems */ if (panicstr == NULL) vfs_unmountall(); } swapoff_all(); DELAY(100000); /* wait for console output to finish */ } static void bpmap_qenter(struct buf *bp) { BUF_CHECK_MAPPED(bp); /* * bp->b_data is relative to bp->b_offset, but * bp->b_offset may be offset into the first page. */ bp->b_data = (caddr_t)trunc_page((vm_offset_t)bp->b_data); pmap_qenter((vm_offset_t)bp->b_data, bp->b_pages, bp->b_npages); bp->b_data = (caddr_t)((vm_offset_t)bp->b_data | (vm_offset_t)(bp->b_offset & PAGE_MASK)); } static inline struct bufdomain * bufdomain(struct buf *bp) { return (&bdomain[bp->b_domain]); } static struct bufqueue * bufqueue(struct buf *bp) { switch (bp->b_qindex) { case QUEUE_NONE: /* FALLTHROUGH */ case QUEUE_SENTINEL: return (NULL); case QUEUE_EMPTY: return (&bqempty); case QUEUE_DIRTY: return (&bufdomain(bp)->bd_dirtyq); case QUEUE_CLEAN: return (&bufdomain(bp)->bd_subq[bp->b_subqueue]); default: break; } panic("bufqueue(%p): Unhandled type %d\n", bp, bp->b_qindex); } /* * Return the locked bufqueue that bp is a member of. */ static struct bufqueue * bufqueue_acquire(struct buf *bp) { struct bufqueue *bq, *nbq; /* * bp can be pushed from a per-cpu queue to the * cleanq while we're waiting on the lock. Retry * if the queues don't match. */ bq = bufqueue(bp); BQ_LOCK(bq); for (;;) { nbq = bufqueue(bp); if (bq == nbq) break; BQ_UNLOCK(bq); BQ_LOCK(nbq); bq = nbq; } return (bq); } /* * binsfree: * * Insert the buffer into the appropriate free list. Requires a * locked buffer on entry and buffer is unlocked before return. */ static void binsfree(struct buf *bp, int qindex) { struct bufdomain *bd; struct bufqueue *bq; KASSERT(qindex == QUEUE_CLEAN || qindex == QUEUE_DIRTY, ("binsfree: Invalid qindex %d", qindex)); BUF_ASSERT_XLOCKED(bp); /* * Handle delayed bremfree() processing. */ if (bp->b_flags & B_REMFREE) { if (bp->b_qindex == qindex) { bp->b_flags |= B_REUSE; bp->b_flags &= ~B_REMFREE; BUF_UNLOCK(bp); return; } bq = bufqueue_acquire(bp); bq_remove(bq, bp); BQ_UNLOCK(bq); } bd = bufdomain(bp); if (qindex == QUEUE_CLEAN) { if (bd->bd_lim != 0) bq = &bd->bd_subq[PCPU_GET(cpuid)]; else bq = bd->bd_cleanq; } else bq = &bd->bd_dirtyq; bq_insert(bq, bp, true); } /* * buf_free: * * Free a buffer to the buf zone once it no longer has valid contents. */ static void buf_free(struct buf *bp) { if (bp->b_flags & B_REMFREE) bremfreef(bp); if (bp->b_vflags & BV_BKGRDINPROG) panic("losing buffer 1"); if (bp->b_rcred != NOCRED) { crfree(bp->b_rcred); bp->b_rcred = NOCRED; } if (bp->b_wcred != NOCRED) { crfree(bp->b_wcred); bp->b_wcred = NOCRED; } if (!LIST_EMPTY(&bp->b_dep)) buf_deallocate(bp); bufkva_free(bp); atomic_add_int(&bufdomain(bp)->bd_freebuffers, 1); BUF_UNLOCK(bp); uma_zfree(buf_zone, bp); } /* * buf_import: * * Import bufs into the uma cache from the buf list. The system still * expects a static array of bufs and much of the synchronization * around bufs assumes type stable storage. As a result, UMA is used * only as a per-cpu cache of bufs still maintained on a global list. */ static int buf_import(void *arg, void **store, int cnt, int domain, int flags) { struct buf *bp; int i; BQ_LOCK(&bqempty); for (i = 0; i < cnt; i++) { bp = TAILQ_FIRST(&bqempty.bq_queue); if (bp == NULL) break; bq_remove(&bqempty, bp); store[i] = bp; } BQ_UNLOCK(&bqempty); return (i); } /* * buf_release: * * Release bufs from the uma cache back to the buffer queues. */ static void buf_release(void *arg, void **store, int cnt) { struct bufqueue *bq; struct buf *bp; int i; bq = &bqempty; BQ_LOCK(bq); for (i = 0; i < cnt; i++) { bp = store[i]; /* Inline bq_insert() to batch locking. */ TAILQ_INSERT_TAIL(&bq->bq_queue, bp, b_freelist); bp->b_flags &= ~(B_AGE | B_REUSE); bq->bq_len++; bp->b_qindex = bq->bq_index; } BQ_UNLOCK(bq); } /* * buf_alloc: * * Allocate an empty buffer header. */ static struct buf * buf_alloc(struct bufdomain *bd) { struct buf *bp; int freebufs; /* * We can only run out of bufs in the buf zone if the average buf * is less than BKVASIZE. In this case the actual wait/block will * come from buf_reycle() failing to flush one of these small bufs. */ bp = NULL; freebufs = atomic_fetchadd_int(&bd->bd_freebuffers, -1); if (freebufs > 0) bp = uma_zalloc(buf_zone, M_NOWAIT); if (bp == NULL) { atomic_add_int(&bd->bd_freebuffers, 1); bufspace_daemon_wakeup(bd); counter_u64_add(numbufallocfails, 1); return (NULL); } /* * Wake-up the bufspace daemon on transition below threshold. */ if (freebufs == bd->bd_lofreebuffers) bufspace_daemon_wakeup(bd); if (BUF_LOCK(bp, LK_EXCLUSIVE | LK_NOWAIT, NULL) != 0) panic("getnewbuf_empty: Locked buf %p on free queue.", bp); KASSERT(bp->b_vp == NULL, ("bp: %p still has vnode %p.", bp, bp->b_vp)); KASSERT((bp->b_flags & (B_DELWRI | B_NOREUSE)) == 0, ("invalid buffer %p flags %#x", bp, bp->b_flags)); KASSERT((bp->b_xflags & (BX_VNCLEAN|BX_VNDIRTY)) == 0, ("bp: %p still on a buffer list. xflags %X", bp, bp->b_xflags)); KASSERT(bp->b_npages == 0, ("bp: %p still has %d vm pages\n", bp, bp->b_npages)); KASSERT(bp->b_kvasize == 0, ("bp: %p still has kva\n", bp)); KASSERT(bp->b_bufsize == 0, ("bp: %p still has bufspace\n", bp)); bp->b_domain = BD_DOMAIN(bd); bp->b_flags = 0; bp->b_ioflags = 0; bp->b_xflags = 0; bp->b_vflags = 0; bp->b_vp = NULL; bp->b_blkno = bp->b_lblkno = 0; bp->b_offset = NOOFFSET; bp->b_iodone = 0; bp->b_error = 0; bp->b_resid = 0; bp->b_bcount = 0; bp->b_npages = 0; bp->b_dirtyoff = bp->b_dirtyend = 0; bp->b_bufobj = NULL; bp->b_data = bp->b_kvabase = unmapped_buf; bp->b_fsprivate1 = NULL; bp->b_fsprivate2 = NULL; bp->b_fsprivate3 = NULL; LIST_INIT(&bp->b_dep); return (bp); } /* * buf_recycle: * * Free a buffer from the given bufqueue. kva controls whether the * freed buf must own some kva resources. This is used for * defragmenting. */ static int buf_recycle(struct bufdomain *bd, bool kva) { struct bufqueue *bq; struct buf *bp, *nbp; if (kva) counter_u64_add(bufdefragcnt, 1); nbp = NULL; bq = bd->bd_cleanq; BQ_LOCK(bq); KASSERT(BQ_LOCKPTR(bq) == BD_LOCKPTR(bd), ("buf_recycle: Locks don't match")); nbp = TAILQ_FIRST(&bq->bq_queue); /* * Run scan, possibly freeing data and/or kva mappings on the fly * depending. */ while ((bp = nbp) != NULL) { /* * Calculate next bp (we can only use it if we do not * release the bqlock). */ nbp = TAILQ_NEXT(bp, b_freelist); /* * If we are defragging then we need a buffer with * some kva to reclaim. */ if (kva && bp->b_kvasize == 0) continue; if (BUF_LOCK(bp, LK_EXCLUSIVE | LK_NOWAIT, NULL) != 0) continue; /* * Implement a second chance algorithm for frequently * accessed buffers. */ if ((bp->b_flags & B_REUSE) != 0) { TAILQ_REMOVE(&bq->bq_queue, bp, b_freelist); TAILQ_INSERT_TAIL(&bq->bq_queue, bp, b_freelist); bp->b_flags &= ~B_REUSE; BUF_UNLOCK(bp); continue; } /* * Skip buffers with background writes in progress. */ if ((bp->b_vflags & BV_BKGRDINPROG) != 0) { BUF_UNLOCK(bp); continue; } KASSERT(bp->b_qindex == QUEUE_CLEAN, ("buf_recycle: inconsistent queue %d bp %p", bp->b_qindex, bp)); KASSERT(bp->b_domain == BD_DOMAIN(bd), ("getnewbuf: queue domain %d doesn't match request %d", bp->b_domain, (int)BD_DOMAIN(bd))); /* * NOTE: nbp is now entirely invalid. We can only restart * the scan from this point on. */ bq_remove(bq, bp); BQ_UNLOCK(bq); /* * Requeue the background write buffer with error and * restart the scan. */ if ((bp->b_vflags & BV_BKGRDERR) != 0) { bqrelse(bp); BQ_LOCK(bq); nbp = TAILQ_FIRST(&bq->bq_queue); continue; } bp->b_flags |= B_INVAL; brelse(bp); return (0); } bd->bd_wanted = 1; BQ_UNLOCK(bq); return (ENOBUFS); } /* * bremfree: * * Mark the buffer for removal from the appropriate free list. * */ void bremfree(struct buf *bp) { CTR3(KTR_BUF, "bremfree(%p) vp %p flags %X", bp, bp->b_vp, bp->b_flags); KASSERT((bp->b_flags & B_REMFREE) == 0, ("bremfree: buffer %p already marked for delayed removal.", bp)); KASSERT(bp->b_qindex != QUEUE_NONE, ("bremfree: buffer %p not on a queue.", bp)); BUF_ASSERT_XLOCKED(bp); bp->b_flags |= B_REMFREE; } /* * bremfreef: * * Force an immediate removal from a free list. Used only in nfs when * it abuses the b_freelist pointer. */ void bremfreef(struct buf *bp) { struct bufqueue *bq; bq = bufqueue_acquire(bp); bq_remove(bq, bp); BQ_UNLOCK(bq); } static void bq_init(struct bufqueue *bq, int qindex, int subqueue, const char *lockname) { mtx_init(&bq->bq_lock, lockname, NULL, MTX_DEF); TAILQ_INIT(&bq->bq_queue); bq->bq_len = 0; bq->bq_index = qindex; bq->bq_subqueue = subqueue; } static void bd_init(struct bufdomain *bd) { int i; bd->bd_cleanq = &bd->bd_subq[mp_maxid + 1]; bq_init(bd->bd_cleanq, QUEUE_CLEAN, mp_maxid + 1, "bufq clean lock"); bq_init(&bd->bd_dirtyq, QUEUE_DIRTY, -1, "bufq dirty lock"); for (i = 0; i <= mp_maxid; i++) bq_init(&bd->bd_subq[i], QUEUE_CLEAN, i, "bufq clean subqueue lock"); mtx_init(&bd->bd_run_lock, "bufspace daemon run lock", NULL, MTX_DEF); } /* * bq_remove: * * Removes a buffer from the free list, must be called with the * correct qlock held. */ static void bq_remove(struct bufqueue *bq, struct buf *bp) { CTR3(KTR_BUF, "bq_remove(%p) vp %p flags %X", bp, bp->b_vp, bp->b_flags); KASSERT(bp->b_qindex != QUEUE_NONE, ("bq_remove: buffer %p not on a queue.", bp)); KASSERT(bufqueue(bp) == bq, ("bq_remove: Remove buffer %p from wrong queue.", bp)); BQ_ASSERT_LOCKED(bq); if (bp->b_qindex != QUEUE_EMPTY) { BUF_ASSERT_XLOCKED(bp); } KASSERT(bq->bq_len >= 1, ("queue %d underflow", bp->b_qindex)); TAILQ_REMOVE(&bq->bq_queue, bp, b_freelist); bq->bq_len--; bp->b_qindex = QUEUE_NONE; bp->b_flags &= ~(B_REMFREE | B_REUSE); } static void bd_flush(struct bufdomain *bd, struct bufqueue *bq) { struct buf *bp; BQ_ASSERT_LOCKED(bq); if (bq != bd->bd_cleanq) { BD_LOCK(bd); while ((bp = TAILQ_FIRST(&bq->bq_queue)) != NULL) { TAILQ_REMOVE(&bq->bq_queue, bp, b_freelist); TAILQ_INSERT_TAIL(&bd->bd_cleanq->bq_queue, bp, b_freelist); bp->b_subqueue = bd->bd_cleanq->bq_subqueue; } bd->bd_cleanq->bq_len += bq->bq_len; bq->bq_len = 0; } if (bd->bd_wanted) { bd->bd_wanted = 0; wakeup(&bd->bd_wanted); } if (bq != bd->bd_cleanq) BD_UNLOCK(bd); } static int bd_flushall(struct bufdomain *bd) { struct bufqueue *bq; int flushed; int i; if (bd->bd_lim == 0) return (0); flushed = 0; for (i = 0; i <= mp_maxid; i++) { bq = &bd->bd_subq[i]; if (bq->bq_len == 0) continue; BQ_LOCK(bq); bd_flush(bd, bq); BQ_UNLOCK(bq); flushed++; } return (flushed); } static void bq_insert(struct bufqueue *bq, struct buf *bp, bool unlock) { struct bufdomain *bd; if (bp->b_qindex != QUEUE_NONE) panic("bq_insert: free buffer %p onto another queue?", bp); bd = bufdomain(bp); if (bp->b_flags & B_AGE) { /* Place this buf directly on the real queue. */ if (bq->bq_index == QUEUE_CLEAN) bq = bd->bd_cleanq; BQ_LOCK(bq); TAILQ_INSERT_HEAD(&bq->bq_queue, bp, b_freelist); } else { BQ_LOCK(bq); TAILQ_INSERT_TAIL(&bq->bq_queue, bp, b_freelist); } bp->b_flags &= ~(B_AGE | B_REUSE); bq->bq_len++; bp->b_qindex = bq->bq_index; bp->b_subqueue = bq->bq_subqueue; /* * Unlock before we notify so that we don't wakeup a waiter that * fails a trylock on the buf and sleeps again. */ if (unlock) BUF_UNLOCK(bp); if (bp->b_qindex == QUEUE_CLEAN) { /* * Flush the per-cpu queue and notify any waiters. */ if (bd->bd_wanted || (bq != bd->bd_cleanq && bq->bq_len >= bd->bd_lim)) bd_flush(bd, bq); } BQ_UNLOCK(bq); } /* * bufkva_free: * * Free the kva allocation for a buffer. * */ static void bufkva_free(struct buf *bp) { #ifdef INVARIANTS if (bp->b_kvasize == 0) { KASSERT(bp->b_kvabase == unmapped_buf && bp->b_data == unmapped_buf, ("Leaked KVA space on %p", bp)); } else if (buf_mapped(bp)) BUF_CHECK_MAPPED(bp); else BUF_CHECK_UNMAPPED(bp); #endif if (bp->b_kvasize == 0) return; vmem_free(buffer_arena, (vm_offset_t)bp->b_kvabase, bp->b_kvasize); counter_u64_add(bufkvaspace, -bp->b_kvasize); counter_u64_add(buffreekvacnt, 1); bp->b_data = bp->b_kvabase = unmapped_buf; bp->b_kvasize = 0; } /* * bufkva_alloc: * * Allocate the buffer KVA and set b_kvasize and b_kvabase. */ static int bufkva_alloc(struct buf *bp, int maxsize, int gbflags) { vm_offset_t addr; int error; KASSERT((gbflags & GB_UNMAPPED) == 0 || (gbflags & GB_KVAALLOC) != 0, ("Invalid gbflags 0x%x in %s", gbflags, __func__)); bufkva_free(bp); addr = 0; error = vmem_alloc(buffer_arena, maxsize, M_BESTFIT | M_NOWAIT, &addr); if (error != 0) { /* * Buffer map is too fragmented. Request the caller * to defragment the map. */ return (error); } bp->b_kvabase = (caddr_t)addr; bp->b_kvasize = maxsize; counter_u64_add(bufkvaspace, bp->b_kvasize); if ((gbflags & GB_UNMAPPED) != 0) { bp->b_data = unmapped_buf; BUF_CHECK_UNMAPPED(bp); } else { bp->b_data = bp->b_kvabase; BUF_CHECK_MAPPED(bp); } return (0); } /* * bufkva_reclaim: * * Reclaim buffer kva by freeing buffers holding kva. This is a vmem * callback that fires to avoid returning failure. */ static void bufkva_reclaim(vmem_t *vmem, int flags) { bool done; int q; int i; done = false; for (i = 0; i < 5; i++) { for (q = 0; q < buf_domains; q++) if (buf_recycle(&bdomain[q], true) != 0) done = true; if (done) break; } return; } /* * Attempt to initiate asynchronous I/O on read-ahead blocks. We must * clear BIO_ERROR and B_INVAL prior to initiating I/O . If B_CACHE is set, * the buffer is valid and we do not have to do anything. */ static void breada(struct vnode * vp, daddr_t * rablkno, int * rabsize, int cnt, struct ucred * cred, int flags, void (*ckhashfunc)(struct buf *)) { struct buf *rabp; + struct thread *td; int i; + td = curthread; + for (i = 0; i < cnt; i++, rablkno++, rabsize++) { if (inmem(vp, *rablkno)) continue; rabp = getblk(vp, *rablkno, *rabsize, 0, 0, 0); if ((rabp->b_flags & B_CACHE) != 0) { brelse(rabp); continue; } - if (!TD_IS_IDLETHREAD(curthread)) { #ifdef RACCT - if (racct_enable) { - PROC_LOCK(curproc); - racct_add_buf(curproc, rabp, 0); - PROC_UNLOCK(curproc); - } -#endif /* RACCT */ - curthread->td_ru.ru_inblock++; + if (racct_enable) { + PROC_LOCK(curproc); + racct_add_buf(curproc, rabp, 0); + PROC_UNLOCK(curproc); } +#endif /* RACCT */ + td->td_ru.ru_inblock++; rabp->b_flags |= B_ASYNC; rabp->b_flags &= ~B_INVAL; if ((flags & GB_CKHASH) != 0) { rabp->b_flags |= B_CKHASH; rabp->b_ckhashcalc = ckhashfunc; } rabp->b_ioflags &= ~BIO_ERROR; rabp->b_iocmd = BIO_READ; if (rabp->b_rcred == NOCRED && cred != NOCRED) rabp->b_rcred = crhold(cred); vfs_busy_pages(rabp, 0); BUF_KERNPROC(rabp); rabp->b_iooffset = dbtob(rabp->b_blkno); bstrategy(rabp); } } /* * Entry point for bread() and breadn() via #defines in sys/buf.h. * * Get a buffer with the specified data. Look in the cache first. We * must clear BIO_ERROR and B_INVAL prior to initiating I/O. If B_CACHE * is set, the buffer is valid and we do not have to do anything, see * getblk(). Also starts asynchronous I/O on read-ahead blocks. * * Always return a NULL buffer pointer (in bpp) when returning an error. */ int breadn_flags(struct vnode *vp, daddr_t blkno, int size, daddr_t *rablkno, int *rabsize, int cnt, struct ucred *cred, int flags, void (*ckhashfunc)(struct buf *), struct buf **bpp) { struct buf *bp; struct thread *td; int error, readwait, rv; CTR3(KTR_BUF, "breadn(%p, %jd, %d)", vp, blkno, size); td = curthread; /* * Can only return NULL if GB_LOCK_NOWAIT or GB_SPARSE flags * are specified. */ error = getblkx(vp, blkno, size, 0, 0, flags, &bp); if (error != 0) { *bpp = NULL; return (error); } flags &= ~GB_NOSPARSE; *bpp = bp; /* * If not found in cache, do some I/O */ readwait = 0; if ((bp->b_flags & B_CACHE) == 0) { - if (!TD_IS_IDLETHREAD(td)) { #ifdef RACCT - if (racct_enable) { - PROC_LOCK(td->td_proc); - racct_add_buf(td->td_proc, bp, 0); - PROC_UNLOCK(td->td_proc); - } -#endif /* RACCT */ - td->td_ru.ru_inblock++; + if (racct_enable) { + PROC_LOCK(td->td_proc); + racct_add_buf(td->td_proc, bp, 0); + PROC_UNLOCK(td->td_proc); } +#endif /* RACCT */ + td->td_ru.ru_inblock++; bp->b_iocmd = BIO_READ; bp->b_flags &= ~B_INVAL; if ((flags & GB_CKHASH) != 0) { bp->b_flags |= B_CKHASH; bp->b_ckhashcalc = ckhashfunc; } bp->b_ioflags &= ~BIO_ERROR; if (bp->b_rcred == NOCRED && cred != NOCRED) bp->b_rcred = crhold(cred); vfs_busy_pages(bp, 0); bp->b_iooffset = dbtob(bp->b_blkno); bstrategy(bp); ++readwait; } /* * Attempt to initiate asynchronous I/O on read-ahead blocks. */ breada(vp, rablkno, rabsize, cnt, cred, flags, ckhashfunc); rv = 0; if (readwait) { rv = bufwait(bp); if (rv != 0) { brelse(bp); *bpp = NULL; } } return (rv); } /* * Write, release buffer on completion. (Done by iodone * if async). Do not bother writing anything if the buffer * is invalid. * * Note that we set B_CACHE here, indicating that buffer is * fully valid and thus cacheable. This is true even of NFS * now so we set it generally. This could be set either here * or in biodone() since the I/O is synchronous. We put it * here. */ int bufwrite(struct buf *bp) { int oldflags; struct vnode *vp; long space; int vp_md; CTR3(KTR_BUF, "bufwrite(%p) vp %p flags %X", bp, bp->b_vp, bp->b_flags); if ((bp->b_bufobj->bo_flag & BO_DEAD) != 0) { bp->b_flags |= B_INVAL | B_RELBUF; bp->b_flags &= ~B_CACHE; brelse(bp); return (ENXIO); } if (bp->b_flags & B_INVAL) { brelse(bp); return (0); } if (bp->b_flags & B_BARRIER) atomic_add_long(&barrierwrites, 1); oldflags = bp->b_flags; BUF_ASSERT_HELD(bp); KASSERT(!(bp->b_vflags & BV_BKGRDINPROG), ("FFS background buffer should not get here %p", bp)); vp = bp->b_vp; if (vp) vp_md = vp->v_vflag & VV_MD; else vp_md = 0; /* * Mark the buffer clean. Increment the bufobj write count * before bundirty() call, to prevent other thread from seeing * empty dirty list and zero counter for writes in progress, * falsely indicating that the bufobj is clean. */ bufobj_wref(bp->b_bufobj); bundirty(bp); bp->b_flags &= ~B_DONE; bp->b_ioflags &= ~BIO_ERROR; bp->b_flags |= B_CACHE; bp->b_iocmd = BIO_WRITE; vfs_busy_pages(bp, 1); /* * Normal bwrites pipeline writes */ bp->b_runningbufspace = bp->b_bufsize; space = atomic_fetchadd_long(&runningbufspace, bp->b_runningbufspace); - if (!TD_IS_IDLETHREAD(curthread)) { #ifdef RACCT - if (racct_enable) { - PROC_LOCK(curproc); - racct_add_buf(curproc, bp, 1); - PROC_UNLOCK(curproc); - } -#endif /* RACCT */ - curthread->td_ru.ru_oublock++; + if (racct_enable) { + PROC_LOCK(curproc); + racct_add_buf(curproc, bp, 1); + PROC_UNLOCK(curproc); } +#endif /* RACCT */ + curthread->td_ru.ru_oublock++; if (oldflags & B_ASYNC) BUF_KERNPROC(bp); bp->b_iooffset = dbtob(bp->b_blkno); buf_track(bp, __func__); bstrategy(bp); if ((oldflags & B_ASYNC) == 0) { int rtval = bufwait(bp); brelse(bp); return (rtval); } else if (space > hirunningspace) { /* * don't allow the async write to saturate the I/O * system. We will not deadlock here because * we are blocking waiting for I/O that is already in-progress * to complete. We do not block here if it is the update * or syncer daemon trying to clean up as that can lead * to deadlock. */ if ((curthread->td_pflags & TDP_NORUNNINGBUF) == 0 && !vp_md) waitrunningbufspace(); } return (0); } void bufbdflush(struct bufobj *bo, struct buf *bp) { struct buf *nbp; if (bo->bo_dirty.bv_cnt > dirtybufthresh + 10) { (void) VOP_FSYNC(bp->b_vp, MNT_NOWAIT, curthread); altbufferflushes++; } else if (bo->bo_dirty.bv_cnt > dirtybufthresh) { BO_LOCK(bo); /* * Try to find a buffer to flush. */ TAILQ_FOREACH(nbp, &bo->bo_dirty.bv_hd, b_bobufs) { if ((nbp->b_vflags & BV_BKGRDINPROG) || BUF_LOCK(nbp, LK_EXCLUSIVE | LK_NOWAIT, NULL)) continue; if (bp == nbp) panic("bdwrite: found ourselves"); BO_UNLOCK(bo); /* Don't countdeps with the bo lock held. */ if (buf_countdeps(nbp, 0)) { BO_LOCK(bo); BUF_UNLOCK(nbp); continue; } if (nbp->b_flags & B_CLUSTEROK) { vfs_bio_awrite(nbp); } else { bremfree(nbp); bawrite(nbp); } dirtybufferflushes++; break; } if (nbp == NULL) BO_UNLOCK(bo); } } /* * Delayed write. (Buffer is marked dirty). Do not bother writing * anything if the buffer is marked invalid. * * Note that since the buffer must be completely valid, we can safely * set B_CACHE. In fact, we have to set B_CACHE here rather then in * biodone() in order to prevent getblk from writing the buffer * out synchronously. */ void bdwrite(struct buf *bp) { struct thread *td = curthread; struct vnode *vp; struct bufobj *bo; CTR3(KTR_BUF, "bdwrite(%p) vp %p flags %X", bp, bp->b_vp, bp->b_flags); KASSERT(bp->b_bufobj != NULL, ("No b_bufobj %p", bp)); KASSERT((bp->b_flags & B_BARRIER) == 0, ("Barrier request in delayed write %p", bp)); BUF_ASSERT_HELD(bp); if (bp->b_flags & B_INVAL) { brelse(bp); return; } /* * If we have too many dirty buffers, don't create any more. * If we are wildly over our limit, then force a complete * cleanup. Otherwise, just keep the situation from getting * out of control. Note that we have to avoid a recursive * disaster and not try to clean up after our own cleanup! */ vp = bp->b_vp; bo = bp->b_bufobj; if ((td->td_pflags & (TDP_COWINPROGRESS|TDP_INBDFLUSH)) == 0) { td->td_pflags |= TDP_INBDFLUSH; BO_BDFLUSH(bo, bp); td->td_pflags &= ~TDP_INBDFLUSH; } else recursiveflushes++; bdirty(bp); /* * Set B_CACHE, indicating that the buffer is fully valid. This is * true even of NFS now. */ bp->b_flags |= B_CACHE; /* * This bmap keeps the system from needing to do the bmap later, * perhaps when the system is attempting to do a sync. Since it * is likely that the indirect block -- or whatever other datastructure * that the filesystem needs is still in memory now, it is a good * thing to do this. Note also, that if the pageout daemon is * requesting a sync -- there might not be enough memory to do * the bmap then... So, this is important to do. */ if (vp->v_type != VCHR && bp->b_lblkno == bp->b_blkno) { VOP_BMAP(vp, bp->b_lblkno, NULL, &bp->b_blkno, NULL, NULL); } buf_track(bp, __func__); /* * Set the *dirty* buffer range based upon the VM system dirty * pages. * * Mark the buffer pages as clean. We need to do this here to * satisfy the vnode_pager and the pageout daemon, so that it * thinks that the pages have been "cleaned". Note that since * the pages are in a delayed write buffer -- the VFS layer * "will" see that the pages get written out on the next sync, * or perhaps the cluster will be completed. */ vfs_clean_pages_dirty_buf(bp); bqrelse(bp); /* * note: we cannot initiate I/O from a bdwrite even if we wanted to, * due to the softdep code. */ } /* * bdirty: * * Turn buffer into delayed write request. We must clear BIO_READ and * B_RELBUF, and we must set B_DELWRI. We reassign the buffer to * itself to properly update it in the dirty/clean lists. We mark it * B_DONE to ensure that any asynchronization of the buffer properly * clears B_DONE ( else a panic will occur later ). * * bdirty() is kinda like bdwrite() - we have to clear B_INVAL which * might have been set pre-getblk(). Unlike bwrite/bdwrite, bdirty() * should only be called if the buffer is known-good. * * Since the buffer is not on a queue, we do not update the numfreebuffers * count. * * The buffer must be on QUEUE_NONE. */ void bdirty(struct buf *bp) { CTR3(KTR_BUF, "bdirty(%p) vp %p flags %X", bp, bp->b_vp, bp->b_flags); KASSERT(bp->b_bufobj != NULL, ("No b_bufobj %p", bp)); KASSERT(bp->b_flags & B_REMFREE || bp->b_qindex == QUEUE_NONE, ("bdirty: buffer %p still on queue %d", bp, bp->b_qindex)); BUF_ASSERT_HELD(bp); bp->b_flags &= ~(B_RELBUF); bp->b_iocmd = BIO_WRITE; if ((bp->b_flags & B_DELWRI) == 0) { bp->b_flags |= /* XXX B_DONE | */ B_DELWRI; reassignbuf(bp); bdirtyadd(bp); } } /* * bundirty: * * Clear B_DELWRI for buffer. * * Since the buffer is not on a queue, we do not update the numfreebuffers * count. * * The buffer must be on QUEUE_NONE. */ void bundirty(struct buf *bp) { CTR3(KTR_BUF, "bundirty(%p) vp %p flags %X", bp, bp->b_vp, bp->b_flags); KASSERT(bp->b_bufobj != NULL, ("No b_bufobj %p", bp)); KASSERT(bp->b_flags & B_REMFREE || bp->b_qindex == QUEUE_NONE, ("bundirty: buffer %p still on queue %d", bp, bp->b_qindex)); BUF_ASSERT_HELD(bp); if (bp->b_flags & B_DELWRI) { bp->b_flags &= ~B_DELWRI; reassignbuf(bp); bdirtysub(bp); } /* * Since it is now being written, we can clear its deferred write flag. */ bp->b_flags &= ~B_DEFERRED; } /* * bawrite: * * Asynchronous write. Start output on a buffer, but do not wait for * it to complete. The buffer is released when the output completes. * * bwrite() ( or the VOP routine anyway ) is responsible for handling * B_INVAL buffers. Not us. */ void bawrite(struct buf *bp) { bp->b_flags |= B_ASYNC; (void) bwrite(bp); } /* * babarrierwrite: * * Asynchronous barrier write. Start output on a buffer, but do not * wait for it to complete. Place a write barrier after this write so * that this buffer and all buffers written before it are committed to * the disk before any buffers written after this write are committed * to the disk. The buffer is released when the output completes. */ void babarrierwrite(struct buf *bp) { bp->b_flags |= B_ASYNC | B_BARRIER; (void) bwrite(bp); } /* * bbarrierwrite: * * Synchronous barrier write. Start output on a buffer and wait for * it to complete. Place a write barrier after this write so that * this buffer and all buffers written before it are committed to * the disk before any buffers written after this write are committed * to the disk. The buffer is released when the output completes. */ int bbarrierwrite(struct buf *bp) { bp->b_flags |= B_BARRIER; return (bwrite(bp)); } /* * bwillwrite: * * Called prior to the locking of any vnodes when we are expecting to * write. We do not want to starve the buffer cache with too many * dirty buffers so we block here. By blocking prior to the locking * of any vnodes we attempt to avoid the situation where a locked vnode * prevents the various system daemons from flushing related buffers. */ void bwillwrite(void) { if (buf_dirty_count_severe()) { mtx_lock(&bdirtylock); while (buf_dirty_count_severe()) { bdirtywait = 1; msleep(&bdirtywait, &bdirtylock, (PRIBIO + 4), "flswai", 0); } mtx_unlock(&bdirtylock); } } /* * Return true if we have too many dirty buffers. */ int buf_dirty_count_severe(void) { return (!BIT_EMPTY(BUF_DOMAINS, &bdhidirty)); } /* * brelse: * * Release a busy buffer and, if requested, free its resources. The * buffer will be stashed in the appropriate bufqueue[] allowing it * to be accessed later as a cache entity or reused for other purposes. */ void brelse(struct buf *bp) { struct mount *v_mnt; int qindex; /* * Many functions erroneously call brelse with a NULL bp under rare * error conditions. Simply return when called with a NULL bp. */ if (bp == NULL) return; CTR3(KTR_BUF, "brelse(%p) vp %p flags %X", bp, bp->b_vp, bp->b_flags); KASSERT(!(bp->b_flags & (B_CLUSTER|B_PAGING)), ("brelse: inappropriate B_PAGING or B_CLUSTER bp %p", bp)); KASSERT((bp->b_flags & B_VMIO) != 0 || (bp->b_flags & B_NOREUSE) == 0, ("brelse: non-VMIO buffer marked NOREUSE")); if (BUF_LOCKRECURSED(bp)) { /* * Do not process, in particular, do not handle the * B_INVAL/B_RELBUF and do not release to free list. */ BUF_UNLOCK(bp); return; } if (bp->b_flags & B_MANAGED) { bqrelse(bp); return; } if ((bp->b_vflags & (BV_BKGRDINPROG | BV_BKGRDERR)) == BV_BKGRDERR) { BO_LOCK(bp->b_bufobj); bp->b_vflags &= ~BV_BKGRDERR; BO_UNLOCK(bp->b_bufobj); bdirty(bp); } if (bp->b_iocmd == BIO_WRITE && (bp->b_ioflags & BIO_ERROR) && (bp->b_error != ENXIO || !LIST_EMPTY(&bp->b_dep)) && !(bp->b_flags & B_INVAL)) { /* * Failed write, redirty. All errors except ENXIO (which * means the device is gone) are treated as being * transient. * * XXX Treating EIO as transient is not correct; the * contract with the local storage device drivers is that * they will only return EIO once the I/O is no longer * retriable. Network I/O also respects this through the * guarantees of TCP and/or the internal retries of NFS. * ENOMEM might be transient, but we also have no way of * knowing when its ok to retry/reschedule. In general, * this entire case should be made obsolete through better * error handling/recovery and resource scheduling. * * Do this also for buffers that failed with ENXIO, but have * non-empty dependencies - the soft updates code might need * to access the buffer to untangle them. * * Must clear BIO_ERROR to prevent pages from being scrapped. */ bp->b_ioflags &= ~BIO_ERROR; bdirty(bp); } else if ((bp->b_flags & (B_NOCACHE | B_INVAL)) || (bp->b_ioflags & BIO_ERROR) || (bp->b_bufsize <= 0)) { /* * Either a failed read I/O, or we were asked to free or not * cache the buffer, or we failed to write to a device that's * no longer present. */ bp->b_flags |= B_INVAL; if (!LIST_EMPTY(&bp->b_dep)) buf_deallocate(bp); if (bp->b_flags & B_DELWRI) bdirtysub(bp); bp->b_flags &= ~(B_DELWRI | B_CACHE); if ((bp->b_flags & B_VMIO) == 0) { allocbuf(bp, 0); if (bp->b_vp) brelvp(bp); } } /* * We must clear B_RELBUF if B_DELWRI is set. If vfs_vmio_truncate() * is called with B_DELWRI set, the underlying pages may wind up * getting freed causing a previous write (bdwrite()) to get 'lost' * because pages associated with a B_DELWRI bp are marked clean. * * We still allow the B_INVAL case to call vfs_vmio_truncate(), even * if B_DELWRI is set. */ if (bp->b_flags & B_DELWRI) bp->b_flags &= ~B_RELBUF; /* * VMIO buffer rundown. It is not very necessary to keep a VMIO buffer * constituted, not even NFS buffers now. Two flags effect this. If * B_INVAL, the struct buf is invalidated but the VM object is kept * around ( i.e. so it is trivial to reconstitute the buffer later ). * * If BIO_ERROR or B_NOCACHE is set, pages in the VM object will be * invalidated. BIO_ERROR cannot be set for a failed write unless the * buffer is also B_INVAL because it hits the re-dirtying code above. * * Normally we can do this whether a buffer is B_DELWRI or not. If * the buffer is an NFS buffer, it is tracking piecemeal writes or * the commit state and we cannot afford to lose the buffer. If the * buffer has a background write in progress, we need to keep it * around to prevent it from being reconstituted and starting a second * background write. */ v_mnt = bp->b_vp != NULL ? bp->b_vp->v_mount : NULL; if ((bp->b_flags & B_VMIO) && (bp->b_flags & B_NOCACHE || (bp->b_ioflags & BIO_ERROR && bp->b_iocmd == BIO_READ)) && (v_mnt == NULL || (v_mnt->mnt_vfc->vfc_flags & VFCF_NETWORK) == 0 || vn_isdisk(bp->b_vp, NULL) || (bp->b_flags & B_DELWRI) == 0)) { vfs_vmio_invalidate(bp); allocbuf(bp, 0); } if ((bp->b_flags & (B_INVAL | B_RELBUF)) != 0 || (bp->b_flags & (B_DELWRI | B_NOREUSE)) == B_NOREUSE) { allocbuf(bp, 0); bp->b_flags &= ~B_NOREUSE; if (bp->b_vp != NULL) brelvp(bp); } /* * If the buffer has junk contents signal it and eventually * clean up B_DELWRI and diassociate the vnode so that gbincore() * doesn't find it. */ if (bp->b_bufsize == 0 || (bp->b_ioflags & BIO_ERROR) != 0 || (bp->b_flags & (B_INVAL | B_NOCACHE | B_RELBUF)) != 0) bp->b_flags |= B_INVAL; if (bp->b_flags & B_INVAL) { if (bp->b_flags & B_DELWRI) bundirty(bp); if (bp->b_vp) brelvp(bp); } buf_track(bp, __func__); /* buffers with no memory */ if (bp->b_bufsize == 0) { buf_free(bp); return; } /* buffers with junk contents */ if (bp->b_flags & (B_INVAL | B_NOCACHE | B_RELBUF) || (bp->b_ioflags & BIO_ERROR)) { bp->b_xflags &= ~(BX_BKGRDWRITE | BX_ALTDATA); if (bp->b_vflags & BV_BKGRDINPROG) panic("losing buffer 2"); qindex = QUEUE_CLEAN; bp->b_flags |= B_AGE; /* remaining buffers */ } else if (bp->b_flags & B_DELWRI) qindex = QUEUE_DIRTY; else qindex = QUEUE_CLEAN; if ((bp->b_flags & B_DELWRI) == 0 && (bp->b_xflags & BX_VNDIRTY)) panic("brelse: not dirty"); bp->b_flags &= ~(B_ASYNC | B_NOCACHE | B_RELBUF | B_DIRECT); /* binsfree unlocks bp. */ binsfree(bp, qindex); } /* * Release a buffer back to the appropriate queue but do not try to free * it. The buffer is expected to be used again soon. * * bqrelse() is used by bdwrite() to requeue a delayed write, and used by * biodone() to requeue an async I/O on completion. It is also used when * known good buffers need to be requeued but we think we may need the data * again soon. * * XXX we should be able to leave the B_RELBUF hint set on completion. */ void bqrelse(struct buf *bp) { int qindex; CTR3(KTR_BUF, "bqrelse(%p) vp %p flags %X", bp, bp->b_vp, bp->b_flags); KASSERT(!(bp->b_flags & (B_CLUSTER|B_PAGING)), ("bqrelse: inappropriate B_PAGING or B_CLUSTER bp %p", bp)); qindex = QUEUE_NONE; if (BUF_LOCKRECURSED(bp)) { /* do not release to free list */ BUF_UNLOCK(bp); return; } bp->b_flags &= ~(B_ASYNC | B_NOCACHE | B_AGE | B_RELBUF); if (bp->b_flags & B_MANAGED) { if (bp->b_flags & B_REMFREE) bremfreef(bp); goto out; } /* buffers with stale but valid contents */ if ((bp->b_flags & B_DELWRI) != 0 || (bp->b_vflags & (BV_BKGRDINPROG | BV_BKGRDERR)) == BV_BKGRDERR) { BO_LOCK(bp->b_bufobj); bp->b_vflags &= ~BV_BKGRDERR; BO_UNLOCK(bp->b_bufobj); qindex = QUEUE_DIRTY; } else { if ((bp->b_flags & B_DELWRI) == 0 && (bp->b_xflags & BX_VNDIRTY)) panic("bqrelse: not dirty"); if ((bp->b_flags & B_NOREUSE) != 0) { brelse(bp); return; } qindex = QUEUE_CLEAN; } buf_track(bp, __func__); /* binsfree unlocks bp. */ binsfree(bp, qindex); return; out: buf_track(bp, __func__); /* unlock */ BUF_UNLOCK(bp); } /* * Complete I/O to a VMIO backed page. Validate the pages as appropriate, * restore bogus pages. */ static void vfs_vmio_iodone(struct buf *bp) { vm_ooffset_t foff; vm_page_t m; vm_object_t obj; struct vnode *vp __unused; int i, iosize, resid; bool bogus; obj = bp->b_bufobj->bo_object; KASSERT(obj->paging_in_progress >= bp->b_npages, ("vfs_vmio_iodone: paging in progress(%d) < b_npages(%d)", obj->paging_in_progress, bp->b_npages)); vp = bp->b_vp; KASSERT(vp->v_holdcnt > 0, ("vfs_vmio_iodone: vnode %p has zero hold count", vp)); KASSERT(vp->v_object != NULL, ("vfs_vmio_iodone: vnode %p has no vm_object", vp)); foff = bp->b_offset; KASSERT(bp->b_offset != NOOFFSET, ("vfs_vmio_iodone: bp %p has no buffer offset", bp)); bogus = false; iosize = bp->b_bcount - bp->b_resid; VM_OBJECT_WLOCK(obj); for (i = 0; i < bp->b_npages; i++) { resid = ((foff + PAGE_SIZE) & ~(off_t)PAGE_MASK) - foff; if (resid > iosize) resid = iosize; /* * cleanup bogus pages, restoring the originals */ m = bp->b_pages[i]; if (m == bogus_page) { bogus = true; m = vm_page_lookup(obj, OFF_TO_IDX(foff)); if (m == NULL) panic("biodone: page disappeared!"); bp->b_pages[i] = m; } else if ((bp->b_iocmd == BIO_READ) && resid > 0) { /* * In the write case, the valid and clean bits are * already changed correctly ( see bdwrite() ), so we * only need to do this here in the read case. */ KASSERT((m->dirty & vm_page_bits(foff & PAGE_MASK, resid)) == 0, ("vfs_vmio_iodone: page %p " "has unexpected dirty bits", m)); vfs_page_set_valid(bp, foff, m); } KASSERT(OFF_TO_IDX(foff) == m->pindex, ("vfs_vmio_iodone: foff(%jd)/pindex(%ju) mismatch", (intmax_t)foff, (uintmax_t)m->pindex)); vm_page_sunbusy(m); foff = (foff + PAGE_SIZE) & ~(off_t)PAGE_MASK; iosize -= resid; } vm_object_pip_wakeupn(obj, bp->b_npages); VM_OBJECT_WUNLOCK(obj); if (bogus && buf_mapped(bp)) { BUF_CHECK_MAPPED(bp); pmap_qenter(trunc_page((vm_offset_t)bp->b_data), bp->b_pages, bp->b_npages); } } /* * Unwire a page held by a buf and either free it or update the page queues to * reflect its recent use. */ static void vfs_vmio_unwire(struct buf *bp, vm_page_t m) { bool freed; vm_page_lock(m); if (vm_page_unwire_noq(m)) { if ((bp->b_flags & B_DIRECT) != 0) freed = vm_page_try_to_free(m); else freed = false; if (!freed) { /* * Use a racy check of the valid bits to determine * whether we can accelerate reclamation of the page. * The valid bits will be stable unless the page is * being mapped or is referenced by multiple buffers, * and in those cases we expect races to be rare. At * worst we will either accelerate reclamation of a * valid page and violate LRU, or unnecessarily defer * reclamation of an invalid page. * * The B_NOREUSE flag marks data that is not expected to * be reused, so accelerate reclamation in that case * too. Otherwise, maintain LRU. */ if (m->valid == 0 || (bp->b_flags & B_NOREUSE) != 0) vm_page_deactivate_noreuse(m); else if (vm_page_active(m)) vm_page_reference(m); else vm_page_deactivate(m); } } vm_page_unlock(m); } /* * Perform page invalidation when a buffer is released. The fully invalid * pages will be reclaimed later in vfs_vmio_truncate(). */ static void vfs_vmio_invalidate(struct buf *bp) { vm_object_t obj; vm_page_t m; int i, resid, poffset, presid; if (buf_mapped(bp)) { BUF_CHECK_MAPPED(bp); pmap_qremove(trunc_page((vm_offset_t)bp->b_data), bp->b_npages); } else BUF_CHECK_UNMAPPED(bp); /* * Get the base offset and length of the buffer. Note that * in the VMIO case if the buffer block size is not * page-aligned then b_data pointer may not be page-aligned. * But our b_pages[] array *IS* page aligned. * * block sizes less then DEV_BSIZE (usually 512) are not * supported due to the page granularity bits (m->valid, * m->dirty, etc...). * * See man buf(9) for more information */ obj = bp->b_bufobj->bo_object; resid = bp->b_bufsize; poffset = bp->b_offset & PAGE_MASK; VM_OBJECT_WLOCK(obj); for (i = 0; i < bp->b_npages; i++) { m = bp->b_pages[i]; if (m == bogus_page) panic("vfs_vmio_invalidate: Unexpected bogus page."); bp->b_pages[i] = NULL; presid = resid > (PAGE_SIZE - poffset) ? (PAGE_SIZE - poffset) : resid; KASSERT(presid >= 0, ("brelse: extra page")); while (vm_page_xbusied(m)) { vm_page_lock(m); VM_OBJECT_WUNLOCK(obj); vm_page_busy_sleep(m, "mbncsh", true); VM_OBJECT_WLOCK(obj); } if (pmap_page_wired_mappings(m) == 0) vm_page_set_invalid(m, poffset, presid); vfs_vmio_unwire(bp, m); resid -= presid; poffset = 0; } VM_OBJECT_WUNLOCK(obj); bp->b_npages = 0; } /* * Page-granular truncation of an existing VMIO buffer. */ static void vfs_vmio_truncate(struct buf *bp, int desiredpages) { vm_object_t obj; vm_page_t m; int i; if (bp->b_npages == desiredpages) return; if (buf_mapped(bp)) { BUF_CHECK_MAPPED(bp); pmap_qremove((vm_offset_t)trunc_page((vm_offset_t)bp->b_data) + (desiredpages << PAGE_SHIFT), bp->b_npages - desiredpages); } else BUF_CHECK_UNMAPPED(bp); /* * The object lock is needed only if we will attempt to free pages. */ obj = (bp->b_flags & B_DIRECT) != 0 ? bp->b_bufobj->bo_object : NULL; if (obj != NULL) VM_OBJECT_WLOCK(obj); for (i = desiredpages; i < bp->b_npages; i++) { m = bp->b_pages[i]; KASSERT(m != bogus_page, ("allocbuf: bogus page found")); bp->b_pages[i] = NULL; vfs_vmio_unwire(bp, m); } if (obj != NULL) VM_OBJECT_WUNLOCK(obj); bp->b_npages = desiredpages; } /* * Byte granular extension of VMIO buffers. */ static void vfs_vmio_extend(struct buf *bp, int desiredpages, int size) { /* * We are growing the buffer, possibly in a * byte-granular fashion. */ vm_object_t obj; vm_offset_t toff; vm_offset_t tinc; vm_page_t m; /* * Step 1, bring in the VM pages from the object, allocating * them if necessary. We must clear B_CACHE if these pages * are not valid for the range covered by the buffer. */ obj = bp->b_bufobj->bo_object; VM_OBJECT_WLOCK(obj); if (bp->b_npages < desiredpages) { /* * We must allocate system pages since blocking * here could interfere with paging I/O, no * matter which process we are. * * Only exclusive busy can be tested here. * Blocking on shared busy might lead to * deadlocks once allocbuf() is called after * pages are vfs_busy_pages(). */ (void)vm_page_grab_pages(obj, OFF_TO_IDX(bp->b_offset) + bp->b_npages, VM_ALLOC_SYSTEM | VM_ALLOC_IGN_SBUSY | VM_ALLOC_NOBUSY | VM_ALLOC_WIRED, &bp->b_pages[bp->b_npages], desiredpages - bp->b_npages); bp->b_npages = desiredpages; } /* * Step 2. We've loaded the pages into the buffer, * we have to figure out if we can still have B_CACHE * set. Note that B_CACHE is set according to the * byte-granular range ( bcount and size ), not the * aligned range ( newbsize ). * * The VM test is against m->valid, which is DEV_BSIZE * aligned. Needless to say, the validity of the data * needs to also be DEV_BSIZE aligned. Note that this * fails with NFS if the server or some other client * extends the file's EOF. If our buffer is resized, * B_CACHE may remain set! XXX */ toff = bp->b_bcount; tinc = PAGE_SIZE - ((bp->b_offset + toff) & PAGE_MASK); while ((bp->b_flags & B_CACHE) && toff < size) { vm_pindex_t pi; if (tinc > (size - toff)) tinc = size - toff; pi = ((bp->b_offset & PAGE_MASK) + toff) >> PAGE_SHIFT; m = bp->b_pages[pi]; vfs_buf_test_cache(bp, bp->b_offset, toff, tinc, m); toff += tinc; tinc = PAGE_SIZE; } VM_OBJECT_WUNLOCK(obj); /* * Step 3, fixup the KVA pmap. */ if (buf_mapped(bp)) bpmap_qenter(bp); else BUF_CHECK_UNMAPPED(bp); } /* * Check to see if a block at a particular lbn is available for a clustered * write. */ static int vfs_bio_clcheck(struct vnode *vp, int size, daddr_t lblkno, daddr_t blkno) { struct buf *bpa; int match; match = 0; /* If the buf isn't in core skip it */ if ((bpa = gbincore(&vp->v_bufobj, lblkno)) == NULL) return (0); /* If the buf is busy we don't want to wait for it */ if (BUF_LOCK(bpa, LK_EXCLUSIVE | LK_NOWAIT, NULL) != 0) return (0); /* Only cluster with valid clusterable delayed write buffers */ if ((bpa->b_flags & (B_DELWRI | B_CLUSTEROK | B_INVAL)) != (B_DELWRI | B_CLUSTEROK)) goto done; if (bpa->b_bufsize != size) goto done; /* * Check to see if it is in the expected place on disk and that the * block has been mapped. */ if ((bpa->b_blkno != bpa->b_lblkno) && (bpa->b_blkno == blkno)) match = 1; done: BUF_UNLOCK(bpa); return (match); } /* * vfs_bio_awrite: * * Implement clustered async writes for clearing out B_DELWRI buffers. * This is much better then the old way of writing only one buffer at * a time. Note that we may not be presented with the buffers in the * correct order, so we search for the cluster in both directions. */ int vfs_bio_awrite(struct buf *bp) { struct bufobj *bo; int i; int j; daddr_t lblkno = bp->b_lblkno; struct vnode *vp = bp->b_vp; int ncl; int nwritten; int size; int maxcl; int gbflags; bo = &vp->v_bufobj; gbflags = (bp->b_data == unmapped_buf) ? GB_UNMAPPED : 0; /* * right now we support clustered writing only to regular files. If * we find a clusterable block we could be in the middle of a cluster * rather then at the beginning. */ if ((vp->v_type == VREG) && (vp->v_mount != 0) && /* Only on nodes that have the size info */ (bp->b_flags & (B_CLUSTEROK | B_INVAL)) == B_CLUSTEROK) { size = vp->v_mount->mnt_stat.f_iosize; maxcl = MAXPHYS / size; BO_RLOCK(bo); for (i = 1; i < maxcl; i++) if (vfs_bio_clcheck(vp, size, lblkno + i, bp->b_blkno + ((i * size) >> DEV_BSHIFT)) == 0) break; for (j = 1; i + j <= maxcl && j <= lblkno; j++) if (vfs_bio_clcheck(vp, size, lblkno - j, bp->b_blkno - ((j * size) >> DEV_BSHIFT)) == 0) break; BO_RUNLOCK(bo); --j; ncl = i + j; /* * this is a possible cluster write */ if (ncl != 1) { BUF_UNLOCK(bp); nwritten = cluster_wbuild(vp, size, lblkno - j, ncl, gbflags); return (nwritten); } } bremfree(bp); bp->b_flags |= B_ASYNC; /* * default (old) behavior, writing out only one block * * XXX returns b_bufsize instead of b_bcount for nwritten? */ nwritten = bp->b_bufsize; (void) bwrite(bp); return (nwritten); } /* * getnewbuf_kva: * * Allocate KVA for an empty buf header according to gbflags. */ static int getnewbuf_kva(struct buf *bp, int gbflags, int maxsize) { if ((gbflags & (GB_UNMAPPED | GB_KVAALLOC)) != GB_UNMAPPED) { /* * In order to keep fragmentation sane we only allocate kva * in BKVASIZE chunks. XXX with vmem we can do page size. */ maxsize = (maxsize + BKVAMASK) & ~BKVAMASK; if (maxsize != bp->b_kvasize && bufkva_alloc(bp, maxsize, gbflags)) return (ENOSPC); } return (0); } /* * getnewbuf: * * Find and initialize a new buffer header, freeing up existing buffers * in the bufqueues as necessary. The new buffer is returned locked. * * We block if: * We have insufficient buffer headers * We have insufficient buffer space * buffer_arena is too fragmented ( space reservation fails ) * If we have to flush dirty buffers ( but we try to avoid this ) * * The caller is responsible for releasing the reserved bufspace after * allocbuf() is called. */ static struct buf * getnewbuf(struct vnode *vp, int slpflag, int slptimeo, int maxsize, int gbflags) { struct bufdomain *bd; struct buf *bp; bool metadata, reserved; bp = NULL; KASSERT((gbflags & (GB_UNMAPPED | GB_KVAALLOC)) != GB_KVAALLOC, ("GB_KVAALLOC only makes sense with GB_UNMAPPED")); if (!unmapped_buf_allowed) gbflags &= ~(GB_UNMAPPED | GB_KVAALLOC); if (vp == NULL || (vp->v_vflag & (VV_MD | VV_SYSTEM)) != 0 || vp->v_type == VCHR) metadata = true; else metadata = false; if (vp == NULL) bd = &bdomain[0]; else bd = &bdomain[vp->v_bufobj.bo_domain]; counter_u64_add(getnewbufcalls, 1); reserved = false; do { if (reserved == false && bufspace_reserve(bd, maxsize, metadata) != 0) { counter_u64_add(getnewbufrestarts, 1); continue; } reserved = true; if ((bp = buf_alloc(bd)) == NULL) { counter_u64_add(getnewbufrestarts, 1); continue; } if (getnewbuf_kva(bp, gbflags, maxsize) == 0) return (bp); break; } while (buf_recycle(bd, false) == 0); if (reserved) bufspace_release(bd, maxsize); if (bp != NULL) { bp->b_flags |= B_INVAL; brelse(bp); } bufspace_wait(bd, vp, gbflags, slpflag, slptimeo); return (NULL); } /* * buf_daemon: * * buffer flushing daemon. Buffers are normally flushed by the * update daemon but if it cannot keep up this process starts to * take the load in an attempt to prevent getnewbuf() from blocking. */ static struct kproc_desc buf_kp = { "bufdaemon", buf_daemon, &bufdaemonproc }; SYSINIT(bufdaemon, SI_SUB_KTHREAD_BUF, SI_ORDER_FIRST, kproc_start, &buf_kp); static int buf_flush(struct vnode *vp, struct bufdomain *bd, int target) { int flushed; flushed = flushbufqueues(vp, bd, target, 0); if (flushed == 0) { /* * Could not find any buffers without rollback * dependencies, so just write the first one * in the hopes of eventually making progress. */ if (vp != NULL && target > 2) target /= 2; flushbufqueues(vp, bd, target, 1); } return (flushed); } static void buf_daemon() { struct bufdomain *bd; int speedupreq; int lodirty; int i; /* * This process needs to be suspended prior to shutdown sync. */ EVENTHANDLER_REGISTER(shutdown_pre_sync, kthread_shutdown, curthread, SHUTDOWN_PRI_LAST + 100); /* * Start the buf clean daemons as children threads. */ for (i = 0 ; i < buf_domains; i++) { int error; error = kthread_add((void (*)(void *))bufspace_daemon, &bdomain[i], curproc, NULL, 0, 0, "bufspacedaemon-%d", i); if (error) panic("error %d spawning bufspace daemon", error); } /* * This process is allowed to take the buffer cache to the limit */ curthread->td_pflags |= TDP_NORUNNINGBUF | TDP_BUFNEED; mtx_lock(&bdlock); for (;;) { bd_request = 0; mtx_unlock(&bdlock); kthread_suspend_check(); /* * Save speedupreq for this pass and reset to capture new * requests. */ speedupreq = bd_speedupreq; bd_speedupreq = 0; /* * Flush each domain sequentially according to its level and * the speedup request. */ for (i = 0; i < buf_domains; i++) { bd = &bdomain[i]; if (speedupreq) lodirty = bd->bd_numdirtybuffers / 2; else lodirty = bd->bd_lodirtybuffers; while (bd->bd_numdirtybuffers > lodirty) { if (buf_flush(NULL, bd, bd->bd_numdirtybuffers - lodirty) == 0) break; kern_yield(PRI_USER); } } /* * Only clear bd_request if we have reached our low water * mark. The buf_daemon normally waits 1 second and * then incrementally flushes any dirty buffers that have * built up, within reason. * * If we were unable to hit our low water mark and couldn't * find any flushable buffers, we sleep for a short period * to avoid endless loops on unlockable buffers. */ mtx_lock(&bdlock); if (!BIT_EMPTY(BUF_DOMAINS, &bdlodirty)) { /* * We reached our low water mark, reset the * request and sleep until we are needed again. * The sleep is just so the suspend code works. */ bd_request = 0; /* * Do an extra wakeup in case dirty threshold * changed via sysctl and the explicit transition * out of shortfall was missed. */ bdirtywakeup(); if (runningbufspace <= lorunningspace) runningwakeup(); msleep(&bd_request, &bdlock, PVM, "psleep", hz); } else { /* * We couldn't find any flushable dirty buffers but * still have too many dirty buffers, we * have to sleep and try again. (rare) */ msleep(&bd_request, &bdlock, PVM, "qsleep", hz / 10); } } } /* * flushbufqueues: * * Try to flush a buffer in the dirty queue. We must be careful to * free up B_INVAL buffers instead of write them, which NFS is * particularly sensitive to. */ static int flushwithdeps = 0; SYSCTL_INT(_vfs, OID_AUTO, flushwithdeps, CTLFLAG_RW, &flushwithdeps, 0, "Number of buffers flushed with dependecies that require rollbacks"); static int flushbufqueues(struct vnode *lvp, struct bufdomain *bd, int target, int flushdeps) { struct bufqueue *bq; struct buf *sentinel; struct vnode *vp; struct mount *mp; struct buf *bp; int hasdeps; int flushed; int error; bool unlock; flushed = 0; bq = &bd->bd_dirtyq; bp = NULL; sentinel = malloc(sizeof(struct buf), M_TEMP, M_WAITOK | M_ZERO); sentinel->b_qindex = QUEUE_SENTINEL; BQ_LOCK(bq); TAILQ_INSERT_HEAD(&bq->bq_queue, sentinel, b_freelist); BQ_UNLOCK(bq); while (flushed != target) { maybe_yield(); BQ_LOCK(bq); bp = TAILQ_NEXT(sentinel, b_freelist); if (bp != NULL) { TAILQ_REMOVE(&bq->bq_queue, sentinel, b_freelist); TAILQ_INSERT_AFTER(&bq->bq_queue, bp, sentinel, b_freelist); } else { BQ_UNLOCK(bq); break; } /* * Skip sentinels inserted by other invocations of the * flushbufqueues(), taking care to not reorder them. * * Only flush the buffers that belong to the * vnode locked by the curthread. */ if (bp->b_qindex == QUEUE_SENTINEL || (lvp != NULL && bp->b_vp != lvp)) { BQ_UNLOCK(bq); continue; } error = BUF_LOCK(bp, LK_EXCLUSIVE | LK_NOWAIT, NULL); BQ_UNLOCK(bq); if (error != 0) continue; /* * BKGRDINPROG can only be set with the buf and bufobj * locks both held. We tolerate a race to clear it here. */ if ((bp->b_vflags & BV_BKGRDINPROG) != 0 || (bp->b_flags & B_DELWRI) == 0) { BUF_UNLOCK(bp); continue; } if (bp->b_flags & B_INVAL) { bremfreef(bp); brelse(bp); flushed++; continue; } if (!LIST_EMPTY(&bp->b_dep) && buf_countdeps(bp, 0)) { if (flushdeps == 0) { BUF_UNLOCK(bp); continue; } hasdeps = 1; } else hasdeps = 0; /* * We must hold the lock on a vnode before writing * one of its buffers. Otherwise we may confuse, or * in the case of a snapshot vnode, deadlock the * system. * * The lock order here is the reverse of the normal * of vnode followed by buf lock. This is ok because * the NOWAIT will prevent deadlock. */ vp = bp->b_vp; if (vn_start_write(vp, &mp, V_NOWAIT) != 0) { BUF_UNLOCK(bp); continue; } if (lvp == NULL) { unlock = true; error = vn_lock(vp, LK_EXCLUSIVE | LK_NOWAIT); } else { ASSERT_VOP_LOCKED(vp, "getbuf"); unlock = false; error = VOP_ISLOCKED(vp) == LK_EXCLUSIVE ? 0 : vn_lock(vp, LK_TRYUPGRADE); } if (error == 0) { CTR3(KTR_BUF, "flushbufqueue(%p) vp %p flags %X", bp, bp->b_vp, bp->b_flags); if (curproc == bufdaemonproc) { vfs_bio_awrite(bp); } else { bremfree(bp); bwrite(bp); counter_u64_add(notbufdflushes, 1); } vn_finished_write(mp); if (unlock) VOP_UNLOCK(vp, 0); flushwithdeps += hasdeps; flushed++; /* * Sleeping on runningbufspace while holding * vnode lock leads to deadlock. */ if (curproc == bufdaemonproc && runningbufspace > hirunningspace) waitrunningbufspace(); continue; } vn_finished_write(mp); BUF_UNLOCK(bp); } BQ_LOCK(bq); TAILQ_REMOVE(&bq->bq_queue, sentinel, b_freelist); BQ_UNLOCK(bq); free(sentinel, M_TEMP); return (flushed); } /* * Check to see if a block is currently memory resident. */ struct buf * incore(struct bufobj *bo, daddr_t blkno) { struct buf *bp; BO_RLOCK(bo); bp = gbincore(bo, blkno); BO_RUNLOCK(bo); return (bp); } /* * Returns true if no I/O is needed to access the * associated VM object. This is like incore except * it also hunts around in the VM system for the data. */ static int inmem(struct vnode * vp, daddr_t blkno) { vm_object_t obj; vm_offset_t toff, tinc, size; vm_page_t m; vm_ooffset_t off; ASSERT_VOP_LOCKED(vp, "inmem"); if (incore(&vp->v_bufobj, blkno)) return 1; if (vp->v_mount == NULL) return 0; obj = vp->v_object; if (obj == NULL) return (0); size = PAGE_SIZE; if (size > vp->v_mount->mnt_stat.f_iosize) size = vp->v_mount->mnt_stat.f_iosize; off = (vm_ooffset_t)blkno * (vm_ooffset_t)vp->v_mount->mnt_stat.f_iosize; VM_OBJECT_RLOCK(obj); for (toff = 0; toff < vp->v_mount->mnt_stat.f_iosize; toff += tinc) { m = vm_page_lookup(obj, OFF_TO_IDX(off + toff)); if (!m) goto notinmem; tinc = size; if (tinc > PAGE_SIZE - ((toff + off) & PAGE_MASK)) tinc = PAGE_SIZE - ((toff + off) & PAGE_MASK); if (vm_page_is_valid(m, (vm_offset_t) ((toff + off) & PAGE_MASK), tinc) == 0) goto notinmem; } VM_OBJECT_RUNLOCK(obj); return 1; notinmem: VM_OBJECT_RUNLOCK(obj); return (0); } /* * Set the dirty range for a buffer based on the status of the dirty * bits in the pages comprising the buffer. The range is limited * to the size of the buffer. * * Tell the VM system that the pages associated with this buffer * are clean. This is used for delayed writes where the data is * going to go to disk eventually without additional VM intevention. * * Note that while we only really need to clean through to b_bcount, we * just go ahead and clean through to b_bufsize. */ static void vfs_clean_pages_dirty_buf(struct buf *bp) { vm_ooffset_t foff, noff, eoff; vm_page_t m; int i; if ((bp->b_flags & B_VMIO) == 0 || bp->b_bufsize == 0) return; foff = bp->b_offset; KASSERT(bp->b_offset != NOOFFSET, ("vfs_clean_pages_dirty_buf: no buffer offset")); VM_OBJECT_WLOCK(bp->b_bufobj->bo_object); vfs_drain_busy_pages(bp); vfs_setdirty_locked_object(bp); for (i = 0; i < bp->b_npages; i++) { noff = (foff + PAGE_SIZE) & ~(off_t)PAGE_MASK; eoff = noff; if (eoff > bp->b_offset + bp->b_bufsize) eoff = bp->b_offset + bp->b_bufsize; m = bp->b_pages[i]; vfs_page_set_validclean(bp, foff, m); /* vm_page_clear_dirty(m, foff & PAGE_MASK, eoff - foff); */ foff = noff; } VM_OBJECT_WUNLOCK(bp->b_bufobj->bo_object); } static void vfs_setdirty_locked_object(struct buf *bp) { vm_object_t object; int i; object = bp->b_bufobj->bo_object; VM_OBJECT_ASSERT_WLOCKED(object); /* * We qualify the scan for modified pages on whether the * object has been flushed yet. */ if ((object->flags & OBJ_MIGHTBEDIRTY) != 0) { vm_offset_t boffset; vm_offset_t eoffset; /* * test the pages to see if they have been modified directly * by users through the VM system. */ for (i = 0; i < bp->b_npages; i++) vm_page_test_dirty(bp->b_pages[i]); /* * Calculate the encompassing dirty range, boffset and eoffset, * (eoffset - boffset) bytes. */ for (i = 0; i < bp->b_npages; i++) { if (bp->b_pages[i]->dirty) break; } boffset = (i << PAGE_SHIFT) - (bp->b_offset & PAGE_MASK); for (i = bp->b_npages - 1; i >= 0; --i) { if (bp->b_pages[i]->dirty) { break; } } eoffset = ((i + 1) << PAGE_SHIFT) - (bp->b_offset & PAGE_MASK); /* * Fit it to the buffer. */ if (eoffset > bp->b_bcount) eoffset = bp->b_bcount; /* * If we have a good dirty range, merge with the existing * dirty range. */ if (boffset < eoffset) { if (bp->b_dirtyoff > boffset) bp->b_dirtyoff = boffset; if (bp->b_dirtyend < eoffset) bp->b_dirtyend = eoffset; } } } /* * Allocate the KVA mapping for an existing buffer. * If an unmapped buffer is provided but a mapped buffer is requested, take * also care to properly setup mappings between pages and KVA. */ static void bp_unmapped_get_kva(struct buf *bp, daddr_t blkno, int size, int gbflags) { int bsize, maxsize, need_mapping, need_kva; off_t offset; need_mapping = bp->b_data == unmapped_buf && (gbflags & GB_UNMAPPED) == 0; need_kva = bp->b_kvabase == unmapped_buf && bp->b_data == unmapped_buf && (gbflags & GB_KVAALLOC) != 0; if (!need_mapping && !need_kva) return; BUF_CHECK_UNMAPPED(bp); if (need_mapping && bp->b_kvabase != unmapped_buf) { /* * Buffer is not mapped, but the KVA was already * reserved at the time of the instantiation. Use the * allocated space. */ goto has_addr; } /* * Calculate the amount of the address space we would reserve * if the buffer was mapped. */ bsize = vn_isdisk(bp->b_vp, NULL) ? DEV_BSIZE : bp->b_bufobj->bo_bsize; KASSERT(bsize != 0, ("bsize == 0, check bo->bo_bsize")); offset = blkno * bsize; maxsize = size + (offset & PAGE_MASK); maxsize = imax(maxsize, bsize); while (bufkva_alloc(bp, maxsize, gbflags) != 0) { if ((gbflags & GB_NOWAIT_BD) != 0) { /* * XXXKIB: defragmentation cannot * succeed, not sure what else to do. */ panic("GB_NOWAIT_BD and GB_UNMAPPED %p", bp); } counter_u64_add(mappingrestarts, 1); bufspace_wait(bufdomain(bp), bp->b_vp, gbflags, 0, 0); } has_addr: if (need_mapping) { /* b_offset is handled by bpmap_qenter. */ bp->b_data = bp->b_kvabase; BUF_CHECK_MAPPED(bp); bpmap_qenter(bp); } } struct buf * getblk(struct vnode *vp, daddr_t blkno, int size, int slpflag, int slptimeo, int flags) { struct buf *bp; int error; error = getblkx(vp, blkno, size, slpflag, slptimeo, flags, &bp); if (error != 0) return (NULL); return (bp); } /* * getblkx: * * Get a block given a specified block and offset into a file/device. * The buffers B_DONE bit will be cleared on return, making it almost * ready for an I/O initiation. B_INVAL may or may not be set on * return. The caller should clear B_INVAL prior to initiating a * READ. * * For a non-VMIO buffer, B_CACHE is set to the opposite of B_INVAL for * an existing buffer. * * For a VMIO buffer, B_CACHE is modified according to the backing VM. * If getblk()ing a previously 0-sized invalid buffer, B_CACHE is set * and then cleared based on the backing VM. If the previous buffer is * non-0-sized but invalid, B_CACHE will be cleared. * * If getblk() must create a new buffer, the new buffer is returned with * both B_INVAL and B_CACHE clear unless it is a VMIO buffer, in which * case it is returned with B_INVAL clear and B_CACHE set based on the * backing VM. * * getblk() also forces a bwrite() for any B_DELWRI buffer whos * B_CACHE bit is clear. * * What this means, basically, is that the caller should use B_CACHE to * determine whether the buffer is fully valid or not and should clear * B_INVAL prior to issuing a read. If the caller intends to validate * the buffer by loading its data area with something, the caller needs * to clear B_INVAL. If the caller does this without issuing an I/O, * the caller should set B_CACHE ( as an optimization ), else the caller * should issue the I/O and biodone() will set B_CACHE if the I/O was * a write attempt or if it was a successful read. If the caller * intends to issue a READ, the caller must clear B_INVAL and BIO_ERROR * prior to issuing the READ. biodone() will *not* clear B_INVAL. */ int getblkx(struct vnode *vp, daddr_t blkno, int size, int slpflag, int slptimeo, int flags, struct buf **bpp) { struct buf *bp; struct bufobj *bo; daddr_t d_blkno; int bsize, error, maxsize, vmio; off_t offset; CTR3(KTR_BUF, "getblk(%p, %ld, %d)", vp, (long)blkno, size); KASSERT((flags & (GB_UNMAPPED | GB_KVAALLOC)) != GB_KVAALLOC, ("GB_KVAALLOC only makes sense with GB_UNMAPPED")); ASSERT_VOP_LOCKED(vp, "getblk"); if (size > maxbcachebuf) panic("getblk: size(%d) > maxbcachebuf(%d)\n", size, maxbcachebuf); if (!unmapped_buf_allowed) flags &= ~(GB_UNMAPPED | GB_KVAALLOC); bo = &vp->v_bufobj; d_blkno = blkno; loop: BO_RLOCK(bo); bp = gbincore(bo, blkno); if (bp != NULL) { int lockflags; /* * Buffer is in-core. If the buffer is not busy nor managed, * it must be on a queue. */ lockflags = LK_EXCLUSIVE | LK_SLEEPFAIL | LK_INTERLOCK; if ((flags & GB_LOCK_NOWAIT) != 0) lockflags |= LK_NOWAIT; error = BUF_TIMELOCK(bp, lockflags, BO_LOCKPTR(bo), "getblk", slpflag, slptimeo); /* * If we slept and got the lock we have to restart in case * the buffer changed identities. */ if (error == ENOLCK) goto loop; /* We timed out or were interrupted. */ else if (error != 0) return (error); /* If recursed, assume caller knows the rules. */ else if (BUF_LOCKRECURSED(bp)) goto end; /* * The buffer is locked. B_CACHE is cleared if the buffer is * invalid. Otherwise, for a non-VMIO buffer, B_CACHE is set * and for a VMIO buffer B_CACHE is adjusted according to the * backing VM cache. */ if (bp->b_flags & B_INVAL) bp->b_flags &= ~B_CACHE; else if ((bp->b_flags & (B_VMIO | B_INVAL)) == 0) bp->b_flags |= B_CACHE; if (bp->b_flags & B_MANAGED) MPASS(bp->b_qindex == QUEUE_NONE); else bremfree(bp); /* * check for size inconsistencies for non-VMIO case. */ if (bp->b_bcount != size) { if ((bp->b_flags & B_VMIO) == 0 || (size > bp->b_kvasize)) { if (bp->b_flags & B_DELWRI) { bp->b_flags |= B_NOCACHE; bwrite(bp); } else { if (LIST_EMPTY(&bp->b_dep)) { bp->b_flags |= B_RELBUF; brelse(bp); } else { bp->b_flags |= B_NOCACHE; bwrite(bp); } } goto loop; } } /* * Handle the case of unmapped buffer which should * become mapped, or the buffer for which KVA * reservation is requested. */ bp_unmapped_get_kva(bp, blkno, size, flags); /* * If the size is inconsistent in the VMIO case, we can resize * the buffer. This might lead to B_CACHE getting set or * cleared. If the size has not changed, B_CACHE remains * unchanged from its previous state. */ allocbuf(bp, size); KASSERT(bp->b_offset != NOOFFSET, ("getblk: no buffer offset")); /* * A buffer with B_DELWRI set and B_CACHE clear must * be committed before we can return the buffer in * order to prevent the caller from issuing a read * ( due to B_CACHE not being set ) and overwriting * it. * * Most callers, including NFS and FFS, need this to * operate properly either because they assume they * can issue a read if B_CACHE is not set, or because * ( for example ) an uncached B_DELWRI might loop due * to softupdates re-dirtying the buffer. In the latter * case, B_CACHE is set after the first write completes, * preventing further loops. * NOTE! b*write() sets B_CACHE. If we cleared B_CACHE * above while extending the buffer, we cannot allow the * buffer to remain with B_CACHE set after the write * completes or it will represent a corrupt state. To * deal with this we set B_NOCACHE to scrap the buffer * after the write. * * We might be able to do something fancy, like setting * B_CACHE in bwrite() except if B_DELWRI is already set, * so the below call doesn't set B_CACHE, but that gets real * confusing. This is much easier. */ if ((bp->b_flags & (B_CACHE|B_DELWRI)) == B_DELWRI) { bp->b_flags |= B_NOCACHE; bwrite(bp); goto loop; } bp->b_flags &= ~B_DONE; } else { /* * Buffer is not in-core, create new buffer. The buffer * returned by getnewbuf() is locked. Note that the returned * buffer is also considered valid (not marked B_INVAL). */ BO_RUNLOCK(bo); /* * If the user does not want us to create the buffer, bail out * here. */ if (flags & GB_NOCREAT) return (EEXIST); - if (bdomain[bo->bo_domain].bd_freebuffers == 0 && - TD_IS_IDLETHREAD(curthread)) - return (EBUSY); bsize = vn_isdisk(vp, NULL) ? DEV_BSIZE : bo->bo_bsize; KASSERT(bsize != 0, ("bsize == 0, check bo->bo_bsize")); offset = blkno * bsize; vmio = vp->v_object != NULL; if (vmio) { maxsize = size + (offset & PAGE_MASK); } else { maxsize = size; /* Do not allow non-VMIO notmapped buffers. */ flags &= ~(GB_UNMAPPED | GB_KVAALLOC); } maxsize = imax(maxsize, bsize); if ((flags & GB_NOSPARSE) != 0 && vmio && !vn_isdisk(vp, NULL)) { error = VOP_BMAP(vp, blkno, NULL, &d_blkno, 0, 0); KASSERT(error != EOPNOTSUPP, ("GB_NOSPARSE from fs not supporting bmap, vp %p", vp)); if (error != 0) return (error); if (d_blkno == -1) return (EJUSTRETURN); } bp = getnewbuf(vp, slpflag, slptimeo, maxsize, flags); if (bp == NULL) { if (slpflag || slptimeo) return (ETIMEDOUT); /* * XXX This is here until the sleep path is diagnosed * enough to work under very low memory conditions. * * There's an issue on low memory, 4BSD+non-preempt * systems (eg MIPS routers with 32MB RAM) where buffer * exhaustion occurs without sleeping for buffer * reclaimation. This just sticks in a loop and * constantly attempts to allocate a buffer, which * hits exhaustion and tries to wakeup bufdaemon. * This never happens because we never yield. * * The real solution is to identify and fix these cases * so we aren't effectively busy-waiting in a loop * until the reclaimation path has cycles to run. */ kern_yield(PRI_USER); goto loop; } /* * This code is used to make sure that a buffer is not * created while the getnewbuf routine is blocked. * This can be a problem whether the vnode is locked or not. * If the buffer is created out from under us, we have to * throw away the one we just created. * * Note: this must occur before we associate the buffer * with the vp especially considering limitations in * the splay tree implementation when dealing with duplicate * lblkno's. */ BO_LOCK(bo); if (gbincore(bo, blkno)) { BO_UNLOCK(bo); bp->b_flags |= B_INVAL; bufspace_release(bufdomain(bp), maxsize); brelse(bp); goto loop; } /* * Insert the buffer into the hash, so that it can * be found by incore. */ bp->b_lblkno = blkno; bp->b_blkno = d_blkno; bp->b_offset = offset; bgetvp(vp, bp); BO_UNLOCK(bo); /* * set B_VMIO bit. allocbuf() the buffer bigger. Since the * buffer size starts out as 0, B_CACHE will be set by * allocbuf() for the VMIO case prior to it testing the * backing store for validity. */ if (vmio) { bp->b_flags |= B_VMIO; KASSERT(vp->v_object == bp->b_bufobj->bo_object, ("ARGH! different b_bufobj->bo_object %p %p %p\n", bp, vp->v_object, bp->b_bufobj->bo_object)); } else { bp->b_flags &= ~B_VMIO; KASSERT(bp->b_bufobj->bo_object == NULL, ("ARGH! has b_bufobj->bo_object %p %p\n", bp, bp->b_bufobj->bo_object)); BUF_CHECK_MAPPED(bp); } allocbuf(bp, size); bufspace_release(bufdomain(bp), maxsize); bp->b_flags &= ~B_DONE; } CTR4(KTR_BUF, "getblk(%p, %ld, %d) = %p", vp, (long)blkno, size, bp); BUF_ASSERT_HELD(bp); end: buf_track(bp, __func__); KASSERT(bp->b_bufobj == bo, ("bp %p wrong b_bufobj %p should be %p", bp, bp->b_bufobj, bo)); *bpp = bp; return (0); } /* * Get an empty, disassociated buffer of given size. The buffer is initially * set to B_INVAL. */ struct buf * geteblk(int size, int flags) { struct buf *bp; int maxsize; maxsize = (size + BKVAMASK) & ~BKVAMASK; while ((bp = getnewbuf(NULL, 0, 0, maxsize, flags)) == NULL) { if ((flags & GB_NOWAIT_BD) && (curthread->td_pflags & TDP_BUFNEED) != 0) return (NULL); } allocbuf(bp, size); bufspace_release(bufdomain(bp), maxsize); bp->b_flags |= B_INVAL; /* b_dep cleared by getnewbuf() */ BUF_ASSERT_HELD(bp); return (bp); } /* * Truncate the backing store for a non-vmio buffer. */ static void vfs_nonvmio_truncate(struct buf *bp, int newbsize) { if (bp->b_flags & B_MALLOC) { /* * malloced buffers are not shrunk */ if (newbsize == 0) { bufmallocadjust(bp, 0); free(bp->b_data, M_BIOBUF); bp->b_data = bp->b_kvabase; bp->b_flags &= ~B_MALLOC; } return; } vm_hold_free_pages(bp, newbsize); bufspace_adjust(bp, newbsize); } /* * Extend the backing for a non-VMIO buffer. */ static void vfs_nonvmio_extend(struct buf *bp, int newbsize) { caddr_t origbuf; int origbufsize; /* * We only use malloced memory on the first allocation. * and revert to page-allocated memory when the buffer * grows. * * There is a potential smp race here that could lead * to bufmallocspace slightly passing the max. It * is probably extremely rare and not worth worrying * over. */ if (bp->b_bufsize == 0 && newbsize <= PAGE_SIZE/2 && bufmallocspace < maxbufmallocspace) { bp->b_data = malloc(newbsize, M_BIOBUF, M_WAITOK); bp->b_flags |= B_MALLOC; bufmallocadjust(bp, newbsize); return; } /* * If the buffer is growing on its other-than-first * allocation then we revert to the page-allocation * scheme. */ origbuf = NULL; origbufsize = 0; if (bp->b_flags & B_MALLOC) { origbuf = bp->b_data; origbufsize = bp->b_bufsize; bp->b_data = bp->b_kvabase; bufmallocadjust(bp, 0); bp->b_flags &= ~B_MALLOC; newbsize = round_page(newbsize); } vm_hold_load_pages(bp, (vm_offset_t) bp->b_data + bp->b_bufsize, (vm_offset_t) bp->b_data + newbsize); if (origbuf != NULL) { bcopy(origbuf, bp->b_data, origbufsize); free(origbuf, M_BIOBUF); } bufspace_adjust(bp, newbsize); } /* * This code constitutes the buffer memory from either anonymous system * memory (in the case of non-VMIO operations) or from an associated * VM object (in the case of VMIO operations). This code is able to * resize a buffer up or down. * * Note that this code is tricky, and has many complications to resolve * deadlock or inconsistent data situations. Tread lightly!!! * There are B_CACHE and B_DELWRI interactions that must be dealt with by * the caller. Calling this code willy nilly can result in the loss of data. * * allocbuf() only adjusts B_CACHE for VMIO buffers. getblk() deals with * B_CACHE for the non-VMIO case. */ int allocbuf(struct buf *bp, int size) { int newbsize; BUF_ASSERT_HELD(bp); if (bp->b_bcount == size) return (1); if (bp->b_kvasize != 0 && bp->b_kvasize < size) panic("allocbuf: buffer too small"); newbsize = roundup2(size, DEV_BSIZE); if ((bp->b_flags & B_VMIO) == 0) { if ((bp->b_flags & B_MALLOC) == 0) newbsize = round_page(newbsize); /* * Just get anonymous memory from the kernel. Don't * mess with B_CACHE. */ if (newbsize < bp->b_bufsize) vfs_nonvmio_truncate(bp, newbsize); else if (newbsize > bp->b_bufsize) vfs_nonvmio_extend(bp, newbsize); } else { int desiredpages; desiredpages = (size == 0) ? 0 : num_pages((bp->b_offset & PAGE_MASK) + newbsize); if (bp->b_flags & B_MALLOC) panic("allocbuf: VMIO buffer can't be malloced"); /* * Set B_CACHE initially if buffer is 0 length or will become * 0-length. */ if (size == 0 || bp->b_bufsize == 0) bp->b_flags |= B_CACHE; if (newbsize < bp->b_bufsize) vfs_vmio_truncate(bp, desiredpages); /* XXX This looks as if it should be newbsize > b_bufsize */ else if (size > bp->b_bcount) vfs_vmio_extend(bp, desiredpages, size); bufspace_adjust(bp, newbsize); } bp->b_bcount = size; /* requested buffer size. */ return (1); } extern int inflight_transient_maps; static struct bio_queue nondump_bios; void biodone(struct bio *bp) { struct mtx *mtxp; void (*done)(struct bio *); vm_offset_t start, end; biotrack(bp, __func__); /* * Avoid completing I/O when dumping after a panic since that may * result in a deadlock in the filesystem or pager code. Note that * this doesn't affect dumps that were started manually since we aim * to keep the system usable after it has been resumed. */ if (__predict_false(dumping && SCHEDULER_STOPPED())) { TAILQ_INSERT_HEAD(&nondump_bios, bp, bio_queue); return; } if ((bp->bio_flags & BIO_TRANSIENT_MAPPING) != 0) { bp->bio_flags &= ~BIO_TRANSIENT_MAPPING; bp->bio_flags |= BIO_UNMAPPED; start = trunc_page((vm_offset_t)bp->bio_data); end = round_page((vm_offset_t)bp->bio_data + bp->bio_length); bp->bio_data = unmapped_buf; pmap_qremove(start, atop(end - start)); vmem_free(transient_arena, start, end - start); atomic_add_int(&inflight_transient_maps, -1); } done = bp->bio_done; if (done == NULL) { mtxp = mtx_pool_find(mtxpool_sleep, bp); mtx_lock(mtxp); bp->bio_flags |= BIO_DONE; wakeup(bp); mtx_unlock(mtxp); } else done(bp); } /* * Wait for a BIO to finish. */ int biowait(struct bio *bp, const char *wchan) { struct mtx *mtxp; mtxp = mtx_pool_find(mtxpool_sleep, bp); mtx_lock(mtxp); while ((bp->bio_flags & BIO_DONE) == 0) msleep(bp, mtxp, PRIBIO, wchan, 0); mtx_unlock(mtxp); if (bp->bio_error != 0) return (bp->bio_error); if (!(bp->bio_flags & BIO_ERROR)) return (0); return (EIO); } void biofinish(struct bio *bp, struct devstat *stat, int error) { if (error) { bp->bio_error = error; bp->bio_flags |= BIO_ERROR; } if (stat != NULL) devstat_end_transaction_bio(stat, bp); biodone(bp); } #if defined(BUF_TRACKING) || defined(FULL_BUF_TRACKING) void biotrack_buf(struct bio *bp, const char *location) { buf_track(bp->bio_track_bp, location); } #endif /* * bufwait: * * Wait for buffer I/O completion, returning error status. The buffer * is left locked and B_DONE on return. B_EINTR is converted into an EINTR * error and cleared. */ int bufwait(struct buf *bp) { if (bp->b_iocmd == BIO_READ) bwait(bp, PRIBIO, "biord"); else bwait(bp, PRIBIO, "biowr"); if (bp->b_flags & B_EINTR) { bp->b_flags &= ~B_EINTR; return (EINTR); } if (bp->b_ioflags & BIO_ERROR) { return (bp->b_error ? bp->b_error : EIO); } else { return (0); } } /* * bufdone: * * Finish I/O on a buffer, optionally calling a completion function. * This is usually called from an interrupt so process blocking is * not allowed. * * biodone is also responsible for setting B_CACHE in a B_VMIO bp. * In a non-VMIO bp, B_CACHE will be set on the next getblk() * assuming B_INVAL is clear. * * For the VMIO case, we set B_CACHE if the op was a read and no * read error occurred, or if the op was a write. B_CACHE is never * set if the buffer is invalid or otherwise uncacheable. * - * biodone does not mess with B_INVAL, allowing the I/O routine or the + * bufdone does not mess with B_INVAL, allowing the I/O routine or the * initiator to leave B_INVAL set to brelse the buffer out of existence * in the biodone routine. */ void bufdone(struct buf *bp) { struct bufobj *dropobj; void (*biodone)(struct buf *); buf_track(bp, __func__); CTR3(KTR_BUF, "bufdone(%p) vp %p flags %X", bp, bp->b_vp, bp->b_flags); dropobj = NULL; KASSERT(!(bp->b_flags & B_DONE), ("biodone: bp %p already done", bp)); BUF_ASSERT_HELD(bp); runningbufwakeup(bp); if (bp->b_iocmd == BIO_WRITE) dropobj = bp->b_bufobj; /* call optional completion function if requested */ if (bp->b_iodone != NULL) { biodone = bp->b_iodone; bp->b_iodone = NULL; (*biodone) (bp); if (dropobj) bufobj_wdrop(dropobj); return; } if (bp->b_flags & B_VMIO) { /* * Set B_CACHE if the op was a normal read and no error * occurred. B_CACHE is set for writes in the b*write() * routines. */ if (bp->b_iocmd == BIO_READ && !(bp->b_flags & (B_INVAL|B_NOCACHE)) && !(bp->b_ioflags & BIO_ERROR)) bp->b_flags |= B_CACHE; vfs_vmio_iodone(bp); } if (!LIST_EMPTY(&bp->b_dep)) buf_complete(bp); if ((bp->b_flags & B_CKHASH) != 0) { KASSERT(bp->b_iocmd == BIO_READ, ("bufdone: b_iocmd %d not BIO_READ", bp->b_iocmd)); KASSERT(buf_mapped(bp), ("bufdone: bp %p not mapped", bp)); (*bp->b_ckhashcalc)(bp); } /* * For asynchronous completions, release the buffer now. The brelse * will do a wakeup there if necessary - so no need to do a wakeup * here in the async case. The sync case always needs to do a wakeup. */ if (bp->b_flags & B_ASYNC) { if ((bp->b_flags & (B_NOCACHE | B_INVAL | B_RELBUF)) || (bp->b_ioflags & BIO_ERROR)) brelse(bp); else bqrelse(bp); } else bdone(bp); if (dropobj) bufobj_wdrop(dropobj); } /* * This routine is called in lieu of iodone in the case of * incomplete I/O. This keeps the busy status for pages * consistent. */ void vfs_unbusy_pages(struct buf *bp) { int i; vm_object_t obj; vm_page_t m; runningbufwakeup(bp); if (!(bp->b_flags & B_VMIO)) return; obj = bp->b_bufobj->bo_object; VM_OBJECT_WLOCK(obj); for (i = 0; i < bp->b_npages; i++) { m = bp->b_pages[i]; if (m == bogus_page) { m = vm_page_lookup(obj, OFF_TO_IDX(bp->b_offset) + i); if (!m) panic("vfs_unbusy_pages: page missing\n"); bp->b_pages[i] = m; if (buf_mapped(bp)) { BUF_CHECK_MAPPED(bp); pmap_qenter(trunc_page((vm_offset_t)bp->b_data), bp->b_pages, bp->b_npages); } else BUF_CHECK_UNMAPPED(bp); } vm_page_sunbusy(m); } vm_object_pip_wakeupn(obj, bp->b_npages); VM_OBJECT_WUNLOCK(obj); } /* * vfs_page_set_valid: * * Set the valid bits in a page based on the supplied offset. The * range is restricted to the buffer's size. * * This routine is typically called after a read completes. */ static void vfs_page_set_valid(struct buf *bp, vm_ooffset_t off, vm_page_t m) { vm_ooffset_t eoff; /* * Compute the end offset, eoff, such that [off, eoff) does not span a * page boundary and eoff is not greater than the end of the buffer. * The end of the buffer, in this case, is our file EOF, not the * allocation size of the buffer. */ eoff = (off + PAGE_SIZE) & ~(vm_ooffset_t)PAGE_MASK; if (eoff > bp->b_offset + bp->b_bcount) eoff = bp->b_offset + bp->b_bcount; /* * Set valid range. This is typically the entire buffer and thus the * entire page. */ if (eoff > off) vm_page_set_valid_range(m, off & PAGE_MASK, eoff - off); } /* * vfs_page_set_validclean: * * Set the valid bits and clear the dirty bits in a page based on the * supplied offset. The range is restricted to the buffer's size. */ static void vfs_page_set_validclean(struct buf *bp, vm_ooffset_t off, vm_page_t m) { vm_ooffset_t soff, eoff; /* * Start and end offsets in buffer. eoff - soff may not cross a * page boundary or cross the end of the buffer. The end of the * buffer, in this case, is our file EOF, not the allocation size * of the buffer. */ soff = off; eoff = (off + PAGE_SIZE) & ~(off_t)PAGE_MASK; if (eoff > bp->b_offset + bp->b_bcount) eoff = bp->b_offset + bp->b_bcount; /* * Set valid range. This is typically the entire buffer and thus the * entire page. */ if (eoff > soff) { vm_page_set_validclean( m, (vm_offset_t) (soff & PAGE_MASK), (vm_offset_t) (eoff - soff) ); } } /* * Ensure that all buffer pages are not exclusive busied. If any page is * exclusive busy, drain it. */ void vfs_drain_busy_pages(struct buf *bp) { vm_page_t m; int i, last_busied; VM_OBJECT_ASSERT_WLOCKED(bp->b_bufobj->bo_object); last_busied = 0; for (i = 0; i < bp->b_npages; i++) { m = bp->b_pages[i]; if (vm_page_xbusied(m)) { for (; last_busied < i; last_busied++) vm_page_sbusy(bp->b_pages[last_busied]); while (vm_page_xbusied(m)) { vm_page_lock(m); VM_OBJECT_WUNLOCK(bp->b_bufobj->bo_object); vm_page_busy_sleep(m, "vbpage", true); VM_OBJECT_WLOCK(bp->b_bufobj->bo_object); } } } for (i = 0; i < last_busied; i++) vm_page_sunbusy(bp->b_pages[i]); } /* * This routine is called before a device strategy routine. * It is used to tell the VM system that paging I/O is in * progress, and treat the pages associated with the buffer * almost as being exclusive busy. Also the object paging_in_progress * flag is handled to make sure that the object doesn't become * inconsistent. * * Since I/O has not been initiated yet, certain buffer flags * such as BIO_ERROR or B_INVAL may be in an inconsistent state * and should be ignored. */ void vfs_busy_pages(struct buf *bp, int clear_modify) { vm_object_t obj; vm_ooffset_t foff; vm_page_t m; int i; bool bogus; if (!(bp->b_flags & B_VMIO)) return; obj = bp->b_bufobj->bo_object; foff = bp->b_offset; KASSERT(bp->b_offset != NOOFFSET, ("vfs_busy_pages: no buffer offset")); VM_OBJECT_WLOCK(obj); vfs_drain_busy_pages(bp); if (bp->b_bufsize != 0) vfs_setdirty_locked_object(bp); bogus = false; for (i = 0; i < bp->b_npages; i++) { m = bp->b_pages[i]; if ((bp->b_flags & B_CLUSTER) == 0) { vm_object_pip_add(obj, 1); vm_page_sbusy(m); } /* * When readying a buffer for a read ( i.e * clear_modify == 0 ), it is important to do * bogus_page replacement for valid pages in * partially instantiated buffers. Partially * instantiated buffers can, in turn, occur when * reconstituting a buffer from its VM backing store * base. We only have to do this if B_CACHE is * clear ( which causes the I/O to occur in the * first place ). The replacement prevents the read * I/O from overwriting potentially dirty VM-backed * pages. XXX bogus page replacement is, uh, bogus. * It may not work properly with small-block devices. * We need to find a better way. */ if (clear_modify) { pmap_remove_write(m); vfs_page_set_validclean(bp, foff, m); } else if (m->valid == VM_PAGE_BITS_ALL && (bp->b_flags & B_CACHE) == 0) { bp->b_pages[i] = bogus_page; bogus = true; } foff = (foff + PAGE_SIZE) & ~(off_t)PAGE_MASK; } VM_OBJECT_WUNLOCK(obj); if (bogus && buf_mapped(bp)) { BUF_CHECK_MAPPED(bp); pmap_qenter(trunc_page((vm_offset_t)bp->b_data), bp->b_pages, bp->b_npages); } } /* * vfs_bio_set_valid: * * Set the range within the buffer to valid. The range is * relative to the beginning of the buffer, b_offset. Note that * b_offset itself may be offset from the beginning of the first * page. */ void vfs_bio_set_valid(struct buf *bp, int base, int size) { int i, n; vm_page_t m; if (!(bp->b_flags & B_VMIO)) return; /* * Fixup base to be relative to beginning of first page. * Set initial n to be the maximum number of bytes in the * first page that can be validated. */ base += (bp->b_offset & PAGE_MASK); n = PAGE_SIZE - (base & PAGE_MASK); VM_OBJECT_WLOCK(bp->b_bufobj->bo_object); for (i = base / PAGE_SIZE; size > 0 && i < bp->b_npages; ++i) { m = bp->b_pages[i]; if (n > size) n = size; vm_page_set_valid_range(m, base & PAGE_MASK, n); base += n; size -= n; n = PAGE_SIZE; } VM_OBJECT_WUNLOCK(bp->b_bufobj->bo_object); } /* * vfs_bio_clrbuf: * * If the specified buffer is a non-VMIO buffer, clear the entire * buffer. If the specified buffer is a VMIO buffer, clear and * validate only the previously invalid portions of the buffer. * This routine essentially fakes an I/O, so we need to clear * BIO_ERROR and B_INVAL. * * Note that while we only theoretically need to clear through b_bcount, * we go ahead and clear through b_bufsize. */ void vfs_bio_clrbuf(struct buf *bp) { int i, j, mask, sa, ea, slide; if ((bp->b_flags & (B_VMIO | B_MALLOC)) != B_VMIO) { clrbuf(bp); return; } bp->b_flags &= ~B_INVAL; bp->b_ioflags &= ~BIO_ERROR; VM_OBJECT_WLOCK(bp->b_bufobj->bo_object); if ((bp->b_npages == 1) && (bp->b_bufsize < PAGE_SIZE) && (bp->b_offset & PAGE_MASK) == 0) { if (bp->b_pages[0] == bogus_page) goto unlock; mask = (1 << (bp->b_bufsize / DEV_BSIZE)) - 1; VM_OBJECT_ASSERT_WLOCKED(bp->b_pages[0]->object); if ((bp->b_pages[0]->valid & mask) == mask) goto unlock; if ((bp->b_pages[0]->valid & mask) == 0) { pmap_zero_page_area(bp->b_pages[0], 0, bp->b_bufsize); bp->b_pages[0]->valid |= mask; goto unlock; } } sa = bp->b_offset & PAGE_MASK; slide = 0; for (i = 0; i < bp->b_npages; i++, sa = 0) { slide = imin(slide + PAGE_SIZE, bp->b_offset + bp->b_bufsize); ea = slide & PAGE_MASK; if (ea == 0) ea = PAGE_SIZE; if (bp->b_pages[i] == bogus_page) continue; j = sa / DEV_BSIZE; mask = ((1 << ((ea - sa) / DEV_BSIZE)) - 1) << j; VM_OBJECT_ASSERT_WLOCKED(bp->b_pages[i]->object); if ((bp->b_pages[i]->valid & mask) == mask) continue; if ((bp->b_pages[i]->valid & mask) == 0) pmap_zero_page_area(bp->b_pages[i], sa, ea - sa); else { for (; sa < ea; sa += DEV_BSIZE, j++) { if ((bp->b_pages[i]->valid & (1 << j)) == 0) { pmap_zero_page_area(bp->b_pages[i], sa, DEV_BSIZE); } } } bp->b_pages[i]->valid |= mask; } unlock: VM_OBJECT_WUNLOCK(bp->b_bufobj->bo_object); bp->b_resid = 0; } void vfs_bio_bzero_buf(struct buf *bp, int base, int size) { vm_page_t m; int i, n; if (buf_mapped(bp)) { BUF_CHECK_MAPPED(bp); bzero(bp->b_data + base, size); } else { BUF_CHECK_UNMAPPED(bp); n = PAGE_SIZE - (base & PAGE_MASK); for (i = base / PAGE_SIZE; size > 0 && i < bp->b_npages; ++i) { m = bp->b_pages[i]; if (n > size) n = size; pmap_zero_page_area(m, base & PAGE_MASK, n); base += n; size -= n; n = PAGE_SIZE; } } } /* * Update buffer flags based on I/O request parameters, optionally releasing the * buffer. If it's VMIO or direct I/O, the buffer pages are released to the VM, * where they may be placed on a page queue (VMIO) or freed immediately (direct * I/O). Otherwise the buffer is released to the cache. */ static void b_io_dismiss(struct buf *bp, int ioflag, bool release) { KASSERT((ioflag & IO_NOREUSE) == 0 || (ioflag & IO_VMIO) != 0, ("buf %p non-VMIO noreuse", bp)); if ((ioflag & IO_DIRECT) != 0) bp->b_flags |= B_DIRECT; if ((ioflag & IO_EXT) != 0) bp->b_xflags |= BX_ALTDATA; if ((ioflag & (IO_VMIO | IO_DIRECT)) != 0 && LIST_EMPTY(&bp->b_dep)) { bp->b_flags |= B_RELBUF; if ((ioflag & IO_NOREUSE) != 0) bp->b_flags |= B_NOREUSE; if (release) brelse(bp); } else if (release) bqrelse(bp); } void vfs_bio_brelse(struct buf *bp, int ioflag) { b_io_dismiss(bp, ioflag, true); } void vfs_bio_set_flags(struct buf *bp, int ioflag) { b_io_dismiss(bp, ioflag, false); } /* * vm_hold_load_pages and vm_hold_free_pages get pages into * a buffers address space. The pages are anonymous and are * not associated with a file object. */ static void vm_hold_load_pages(struct buf *bp, vm_offset_t from, vm_offset_t to) { vm_offset_t pg; vm_page_t p; int index; BUF_CHECK_MAPPED(bp); to = round_page(to); from = round_page(from); index = (from - trunc_page((vm_offset_t)bp->b_data)) >> PAGE_SHIFT; for (pg = from; pg < to; pg += PAGE_SIZE, index++) { /* * note: must allocate system pages since blocking here * could interfere with paging I/O, no matter which * process we are. */ p = vm_page_alloc(NULL, 0, VM_ALLOC_SYSTEM | VM_ALLOC_NOOBJ | VM_ALLOC_WIRED | VM_ALLOC_COUNT((to - pg) >> PAGE_SHIFT) | VM_ALLOC_WAITOK); pmap_qenter(pg, &p, 1); bp->b_pages[index] = p; } bp->b_npages = index; } /* Return pages associated with this buf to the vm system */ static void vm_hold_free_pages(struct buf *bp, int newbsize) { vm_offset_t from; vm_page_t p; int index, newnpages; BUF_CHECK_MAPPED(bp); from = round_page((vm_offset_t)bp->b_data + newbsize); newnpages = (from - trunc_page((vm_offset_t)bp->b_data)) >> PAGE_SHIFT; if (bp->b_npages > newnpages) pmap_qremove(from, bp->b_npages - newnpages); for (index = newnpages; index < bp->b_npages; index++) { p = bp->b_pages[index]; bp->b_pages[index] = NULL; p->wire_count--; vm_page_free(p); } vm_wire_sub(bp->b_npages - newnpages); bp->b_npages = newnpages; } /* * Map an IO request into kernel virtual address space. * * All requests are (re)mapped into kernel VA space. * Notice that we use b_bufsize for the size of the buffer * to be mapped. b_bcount might be modified by the driver. * * Note that even if the caller determines that the address space should * be valid, a race or a smaller-file mapped into a larger space may * actually cause vmapbuf() to fail, so all callers of vmapbuf() MUST * check the return value. * * This function only works with pager buffers. */ int vmapbuf(struct buf *bp, int mapbuf) { vm_prot_t prot; int pidx; if (bp->b_bufsize < 0) return (-1); prot = VM_PROT_READ; if (bp->b_iocmd == BIO_READ) prot |= VM_PROT_WRITE; /* Less backwards than it looks */ if ((pidx = vm_fault_quick_hold_pages(&curproc->p_vmspace->vm_map, (vm_offset_t)bp->b_data, bp->b_bufsize, prot, bp->b_pages, btoc(MAXPHYS))) < 0) return (-1); bp->b_npages = pidx; bp->b_offset = ((vm_offset_t)bp->b_data) & PAGE_MASK; if (mapbuf || !unmapped_buf_allowed) { pmap_qenter((vm_offset_t)bp->b_kvabase, bp->b_pages, pidx); bp->b_data = bp->b_kvabase + bp->b_offset; } else bp->b_data = unmapped_buf; return(0); } /* * Free the io map PTEs associated with this IO operation. * We also invalidate the TLB entries and restore the original b_addr. * * This function only works with pager buffers. */ void vunmapbuf(struct buf *bp) { int npages; npages = bp->b_npages; if (buf_mapped(bp)) pmap_qremove(trunc_page((vm_offset_t)bp->b_data), npages); vm_page_unhold_pages(bp->b_pages, npages); bp->b_data = unmapped_buf; } void bdone(struct buf *bp) { struct mtx *mtxp; mtxp = mtx_pool_find(mtxpool_sleep, bp); mtx_lock(mtxp); bp->b_flags |= B_DONE; wakeup(bp); mtx_unlock(mtxp); } void bwait(struct buf *bp, u_char pri, const char *wchan) { struct mtx *mtxp; mtxp = mtx_pool_find(mtxpool_sleep, bp); mtx_lock(mtxp); while ((bp->b_flags & B_DONE) == 0) msleep(bp, mtxp, pri, wchan, 0); mtx_unlock(mtxp); } int bufsync(struct bufobj *bo, int waitfor) { return (VOP_FSYNC(bo2vnode(bo), waitfor, curthread)); } void bufstrategy(struct bufobj *bo, struct buf *bp) { int i __unused; struct vnode *vp; vp = bp->b_vp; KASSERT(vp == bo->bo_private, ("Inconsistent vnode bufstrategy")); KASSERT(vp->v_type != VCHR && vp->v_type != VBLK, ("Wrong vnode in bufstrategy(bp=%p, vp=%p)", bp, vp)); i = VOP_STRATEGY(vp, bp); KASSERT(i == 0, ("VOP_STRATEGY failed bp=%p vp=%p", bp, bp->b_vp)); } /* * Initialize a struct bufobj before use. Memory is assumed zero filled. */ void bufobj_init(struct bufobj *bo, void *private) { static volatile int bufobj_cleanq; bo->bo_domain = atomic_fetchadd_int(&bufobj_cleanq, 1) % buf_domains; rw_init(BO_LOCKPTR(bo), "bufobj interlock"); bo->bo_private = private; TAILQ_INIT(&bo->bo_clean.bv_hd); TAILQ_INIT(&bo->bo_dirty.bv_hd); } void bufobj_wrefl(struct bufobj *bo) { KASSERT(bo != NULL, ("NULL bo in bufobj_wref")); ASSERT_BO_WLOCKED(bo); bo->bo_numoutput++; } void bufobj_wref(struct bufobj *bo) { KASSERT(bo != NULL, ("NULL bo in bufobj_wref")); BO_LOCK(bo); bo->bo_numoutput++; BO_UNLOCK(bo); } void bufobj_wdrop(struct bufobj *bo) { KASSERT(bo != NULL, ("NULL bo in bufobj_wdrop")); BO_LOCK(bo); KASSERT(bo->bo_numoutput > 0, ("bufobj_wdrop non-positive count")); if ((--bo->bo_numoutput == 0) && (bo->bo_flag & BO_WWAIT)) { bo->bo_flag &= ~BO_WWAIT; wakeup(&bo->bo_numoutput); } BO_UNLOCK(bo); } int bufobj_wwait(struct bufobj *bo, int slpflag, int timeo) { int error; KASSERT(bo != NULL, ("NULL bo in bufobj_wwait")); ASSERT_BO_WLOCKED(bo); error = 0; while (bo->bo_numoutput) { bo->bo_flag |= BO_WWAIT; error = msleep(&bo->bo_numoutput, BO_LOCKPTR(bo), slpflag | (PRIBIO + 1), "bo_wwait", timeo); if (error) break; } return (error); } /* * Set bio_data or bio_ma for struct bio from the struct buf. */ void bdata2bio(struct buf *bp, struct bio *bip) { if (!buf_mapped(bp)) { KASSERT(unmapped_buf_allowed, ("unmapped")); bip->bio_ma = bp->b_pages; bip->bio_ma_n = bp->b_npages; bip->bio_data = unmapped_buf; bip->bio_ma_offset = (vm_offset_t)bp->b_offset & PAGE_MASK; bip->bio_flags |= BIO_UNMAPPED; KASSERT(round_page(bip->bio_ma_offset + bip->bio_length) / PAGE_SIZE == bp->b_npages, ("Buffer %p too short: %d %lld %d", bp, bip->bio_ma_offset, (long long)bip->bio_length, bip->bio_ma_n)); } else { bip->bio_data = bp->b_data; bip->bio_ma = NULL; } } /* * The MIPS pmap code currently doesn't handle aliased pages. * The VIPT caches may not handle page aliasing themselves, leading * to data corruption. * * As such, this code makes a system extremely unhappy if said * system doesn't support unaliasing the above situation in hardware. * Some "recent" systems (eg some mips24k/mips74k cores) don't enable * this feature at build time, so it has to be handled in software. * * Once the MIPS pmap/cache code grows to support this function on * earlier chips, it should be flipped back off. */ #ifdef __mips__ static int buf_pager_relbuf = 1; #else static int buf_pager_relbuf = 0; #endif SYSCTL_INT(_vfs, OID_AUTO, buf_pager_relbuf, CTLFLAG_RWTUN, &buf_pager_relbuf, 0, "Make buffer pager release buffers after reading"); /* * The buffer pager. It uses buffer reads to validate pages. * * In contrast to the generic local pager from vm/vnode_pager.c, this * pager correctly and easily handles volumes where the underlying * device block size is greater than the machine page size. The * buffer cache transparently extends the requested page run to be * aligned at the block boundary, and does the necessary bogus page * replacements in the addends to avoid obliterating already valid * pages. * * The only non-trivial issue is that the exclusive busy state for * pages, which is assumed by the vm_pager_getpages() interface, is * incompatible with the VMIO buffer cache's desire to share-busy the * pages. This function performs a trivial downgrade of the pages' * state before reading buffers, and a less trivial upgrade from the * shared-busy to excl-busy state after the read. */ int vfs_bio_getpages(struct vnode *vp, vm_page_t *ma, int count, int *rbehind, int *rahead, vbg_get_lblkno_t get_lblkno, vbg_get_blksize_t get_blksize) { vm_page_t m; vm_object_t object; struct buf *bp; struct mount *mp; daddr_t lbn, lbnp; vm_ooffset_t la, lb, poff, poffe; long bsize; int bo_bs, br_flags, error, i, pgsin, pgsin_a, pgsin_b; bool redo, lpart; object = vp->v_object; mp = vp->v_mount; error = 0; la = IDX_TO_OFF(ma[count - 1]->pindex); if (la >= object->un_pager.vnp.vnp_size) return (VM_PAGER_BAD); /* * Change the meaning of la from where the last requested page starts * to where it ends, because that's the end of the requested region * and the start of the potential read-ahead region. */ la += PAGE_SIZE; lpart = la > object->un_pager.vnp.vnp_size; bo_bs = get_blksize(vp, get_lblkno(vp, IDX_TO_OFF(ma[0]->pindex))); /* * Calculate read-ahead, behind and total pages. */ pgsin = count; lb = IDX_TO_OFF(ma[0]->pindex); pgsin_b = OFF_TO_IDX(lb - rounddown2(lb, bo_bs)); pgsin += pgsin_b; if (rbehind != NULL) *rbehind = pgsin_b; pgsin_a = OFF_TO_IDX(roundup2(la, bo_bs) - la); if (la + IDX_TO_OFF(pgsin_a) >= object->un_pager.vnp.vnp_size) pgsin_a = OFF_TO_IDX(roundup2(object->un_pager.vnp.vnp_size, PAGE_SIZE) - la); pgsin += pgsin_a; if (rahead != NULL) *rahead = pgsin_a; VM_CNT_INC(v_vnodein); VM_CNT_ADD(v_vnodepgsin, pgsin); br_flags = (mp != NULL && (mp->mnt_kern_flag & MNTK_UNMAPPED_BUFS) != 0) ? GB_UNMAPPED : 0; VM_OBJECT_WLOCK(object); again: for (i = 0; i < count; i++) vm_page_busy_downgrade(ma[i]); VM_OBJECT_WUNLOCK(object); lbnp = -1; for (i = 0; i < count; i++) { m = ma[i]; /* * Pages are shared busy and the object lock is not * owned, which together allow for the pages' * invalidation. The racy test for validity avoids * useless creation of the buffer for the most typical * case when invalidation is not used in redo or for * parallel read. The shared->excl upgrade loop at * the end of the function catches the race in a * reliable way (protected by the object lock). */ if (m->valid == VM_PAGE_BITS_ALL) continue; poff = IDX_TO_OFF(m->pindex); poffe = MIN(poff + PAGE_SIZE, object->un_pager.vnp.vnp_size); for (; poff < poffe; poff += bsize) { lbn = get_lblkno(vp, poff); if (lbn == lbnp) goto next_page; lbnp = lbn; bsize = get_blksize(vp, lbn); error = bread_gb(vp, lbn, bsize, curthread->td_ucred, br_flags, &bp); if (error != 0) goto end_pages; if (LIST_EMPTY(&bp->b_dep)) { /* * Invalidation clears m->valid, but * may leave B_CACHE flag if the * buffer existed at the invalidation * time. In this case, recycle the * buffer to do real read on next * bread() after redo. * * Otherwise B_RELBUF is not strictly * necessary, enable to reduce buf * cache pressure. */ if (buf_pager_relbuf || m->valid != VM_PAGE_BITS_ALL) bp->b_flags |= B_RELBUF; bp->b_flags &= ~B_NOCACHE; brelse(bp); } else { bqrelse(bp); } } KASSERT(1 /* racy, enable for debugging */ || m->valid == VM_PAGE_BITS_ALL || i == count - 1, ("buf %d %p invalid", i, m)); if (i == count - 1 && lpart) { VM_OBJECT_WLOCK(object); if (m->valid != 0 && m->valid != VM_PAGE_BITS_ALL) vm_page_zero_invalid(m, TRUE); VM_OBJECT_WUNLOCK(object); } next_page:; } end_pages: VM_OBJECT_WLOCK(object); redo = false; for (i = 0; i < count; i++) { vm_page_sunbusy(ma[i]); ma[i] = vm_page_grab(object, ma[i]->pindex, VM_ALLOC_NORMAL); /* * Since the pages were only sbusy while neither the * buffer nor the object lock was held by us, or * reallocated while vm_page_grab() slept for busy * relinguish, they could have been invalidated. * Recheck the valid bits and re-read as needed. * * Note that the last page is made fully valid in the * read loop, and partial validity for the page at * index count - 1 could mean that the page was * invalidated or removed, so we must restart for * safety as well. */ if (ma[i]->valid != VM_PAGE_BITS_ALL) redo = true; } if (redo && error == 0) goto again; VM_OBJECT_WUNLOCK(object); return (error != 0 ? VM_PAGER_ERROR : VM_PAGER_OK); } #include "opt_ddb.h" #ifdef DDB #include /* DDB command to show buffer data */ DB_SHOW_COMMAND(buffer, db_show_buffer) { /* get args */ struct buf *bp = (struct buf *)addr; #ifdef FULL_BUF_TRACKING uint32_t i, j; #endif if (!have_addr) { db_printf("usage: show buffer \n"); return; } db_printf("buf at %p\n", bp); db_printf("b_flags = 0x%b, b_xflags=0x%b\n", (u_int)bp->b_flags, PRINT_BUF_FLAGS, (u_int)bp->b_xflags, PRINT_BUF_XFLAGS); db_printf("b_vflags=0x%b b_ioflags0x%b\n", (u_int)bp->b_vflags, PRINT_BUF_VFLAGS, (u_int)bp->b_ioflags, PRINT_BIO_FLAGS); db_printf( "b_error = %d, b_bufsize = %ld, b_bcount = %ld, b_resid = %ld\n" "b_bufobj = (%p), b_data = %p\n, b_blkno = %jd, b_lblkno = %jd, " "b_vp = %p, b_dep = %p\n", bp->b_error, bp->b_bufsize, bp->b_bcount, bp->b_resid, bp->b_bufobj, bp->b_data, (intmax_t)bp->b_blkno, (intmax_t)bp->b_lblkno, bp->b_vp, bp->b_dep.lh_first); db_printf("b_kvabase = %p, b_kvasize = %d\n", bp->b_kvabase, bp->b_kvasize); if (bp->b_npages) { int i; db_printf("b_npages = %d, pages(OBJ, IDX, PA): ", bp->b_npages); for (i = 0; i < bp->b_npages; i++) { vm_page_t m; m = bp->b_pages[i]; if (m != NULL) db_printf("(%p, 0x%lx, 0x%lx)", m->object, (u_long)m->pindex, (u_long)VM_PAGE_TO_PHYS(m)); else db_printf("( ??? )"); if ((i + 1) < bp->b_npages) db_printf(","); } db_printf("\n"); } BUF_LOCKPRINTINFO(bp); #if defined(FULL_BUF_TRACKING) db_printf("b_io_tracking: b_io_tcnt = %u\n", bp->b_io_tcnt); i = bp->b_io_tcnt % BUF_TRACKING_SIZE; for (j = 1; j <= BUF_TRACKING_SIZE; j++) { if (bp->b_io_tracking[BUF_TRACKING_ENTRY(i - j)] == NULL) continue; db_printf(" %2u: %s\n", j, bp->b_io_tracking[BUF_TRACKING_ENTRY(i - j)]); } #elif defined(BUF_TRACKING) db_printf("b_io_tracking: %s\n", bp->b_io_tracking); #endif db_printf(" "); } DB_SHOW_COMMAND(bufqueues, bufqueues) { struct bufdomain *bd; struct buf *bp; long total; int i, j, cnt; db_printf("bqempty: %d\n", bqempty.bq_len); for (i = 0; i < buf_domains; i++) { bd = &bdomain[i]; db_printf("Buf domain %d\n", i); db_printf("\tfreebufs\t%d\n", bd->bd_freebuffers); db_printf("\tlofreebufs\t%d\n", bd->bd_lofreebuffers); db_printf("\thifreebufs\t%d\n", bd->bd_hifreebuffers); db_printf("\n"); db_printf("\tbufspace\t%ld\n", bd->bd_bufspace); db_printf("\tmaxbufspace\t%ld\n", bd->bd_maxbufspace); db_printf("\thibufspace\t%ld\n", bd->bd_hibufspace); db_printf("\tlobufspace\t%ld\n", bd->bd_lobufspace); db_printf("\tbufspacethresh\t%ld\n", bd->bd_bufspacethresh); db_printf("\n"); db_printf("\tnumdirtybuffers\t%d\n", bd->bd_numdirtybuffers); db_printf("\tlodirtybuffers\t%d\n", bd->bd_lodirtybuffers); db_printf("\thidirtybuffers\t%d\n", bd->bd_hidirtybuffers); db_printf("\tdirtybufthresh\t%d\n", bd->bd_dirtybufthresh); db_printf("\n"); total = 0; TAILQ_FOREACH(bp, &bd->bd_cleanq->bq_queue, b_freelist) total += bp->b_bufsize; db_printf("\tcleanq count\t%d (%ld)\n", bd->bd_cleanq->bq_len, total); total = 0; TAILQ_FOREACH(bp, &bd->bd_dirtyq.bq_queue, b_freelist) total += bp->b_bufsize; db_printf("\tdirtyq count\t%d (%ld)\n", bd->bd_dirtyq.bq_len, total); db_printf("\twakeup\t\t%d\n", bd->bd_wanted); db_printf("\tlim\t\t%d\n", bd->bd_lim); db_printf("\tCPU "); for (j = 0; j <= mp_maxid; j++) db_printf("%d, ", bd->bd_subq[j].bq_len); db_printf("\n"); cnt = 0; total = 0; for (j = 0; j < nbuf; j++) if (buf[j].b_domain == i && BUF_ISLOCKED(&buf[j])) { cnt++; total += buf[j].b_bufsize; } db_printf("\tLocked buffers: %d space %ld\n", cnt, total); cnt = 0; total = 0; for (j = 0; j < nbuf; j++) if (buf[j].b_domain == i) { cnt++; total += buf[j].b_bufsize; } db_printf("\tTotal buffers: %d space %ld\n", cnt, total); } } DB_SHOW_COMMAND(lockedbufs, lockedbufs) { struct buf *bp; int i; for (i = 0; i < nbuf; i++) { bp = &buf[i]; if (BUF_ISLOCKED(bp)) { db_show_buffer((uintptr_t)bp, 1, 0, NULL); db_printf("\n"); if (db_pager_quit) break; } } } DB_SHOW_COMMAND(vnodebufs, db_show_vnodebufs) { struct vnode *vp; struct buf *bp; if (!have_addr) { db_printf("usage: show vnodebufs \n"); return; } vp = (struct vnode *)addr; db_printf("Clean buffers:\n"); TAILQ_FOREACH(bp, &vp->v_bufobj.bo_clean.bv_hd, b_bobufs) { db_show_buffer((uintptr_t)bp, 1, 0, NULL); db_printf("\n"); } db_printf("Dirty buffers:\n"); TAILQ_FOREACH(bp, &vp->v_bufobj.bo_dirty.bv_hd, b_bobufs) { db_show_buffer((uintptr_t)bp, 1, 0, NULL); db_printf("\n"); } } DB_COMMAND(countfreebufs, db_coundfreebufs) { struct buf *bp; int i, used = 0, nfree = 0; if (have_addr) { db_printf("usage: countfreebufs\n"); return; } for (i = 0; i < nbuf; i++) { bp = &buf[i]; if (bp->b_qindex == QUEUE_EMPTY) nfree++; else used++; } db_printf("Counted %d free, %d used (%d tot)\n", nfree, used, nfree + used); db_printf("numfreebuffers is %d\n", numfreebuffers); } #endif /* DDB */ Index: user/ngie/bug-237403/sys/modules/allwinner/Makefile =================================================================== --- user/ngie/bug-237403/sys/modules/allwinner/Makefile (revision 346925) +++ user/ngie/bug-237403/sys/modules/allwinner/Makefile (revision 346926) @@ -1,7 +1,14 @@ # $FreeBSD$ # Build modules specific to Allwinner. SUBDIR = \ + aw_pwm \ + aw_rtc \ + aw_rsb \ + aw_sid \ aw_spi \ + aw_thermal \ + axp81x \ + if_awg .include Index: user/ngie/bug-237403/sys/modules/allwinner/aw_pwm/Makefile =================================================================== --- user/ngie/bug-237403/sys/modules/allwinner/aw_pwm/Makefile (nonexistent) +++ user/ngie/bug-237403/sys/modules/allwinner/aw_pwm/Makefile (revision 346926) @@ -0,0 +1,15 @@ +# $FreeBSD$ + +.PATH: ${SRCTOP}/sys/arm/allwinner + +KMOD= aw_pwm +SRCS= aw_pwm.c + +SRCS+= \ + bus_if.h \ + clknode_if.h \ + device_if.h \ + ofw_bus_if.h \ + pwm_if.h + +.include Property changes on: user/ngie/bug-237403/sys/modules/allwinner/aw_pwm/Makefile ___________________________________________________________________ Added: svn:eol-style ## -0,0 +1 ## +native \ No newline at end of property Added: svn:keywords ## -0,0 +1 ## +FreeBSD=%H \ No newline at end of property Added: svn:mime-type ## -0,0 +1 ## +text/plain \ No newline at end of property Index: user/ngie/bug-237403/sys/modules/allwinner/aw_rsb/Makefile =================================================================== --- user/ngie/bug-237403/sys/modules/allwinner/aw_rsb/Makefile (nonexistent) +++ user/ngie/bug-237403/sys/modules/allwinner/aw_rsb/Makefile (revision 346926) @@ -0,0 +1,15 @@ +# $FreeBSD$ + +.PATH: ${SRCTOP}/sys/arm/allwinner + +KMOD= aw_rsb +SRCS= aw_rsb.c + +SRCS+= \ + bus_if.h \ + clknode_if.h \ + device_if.h \ + ofw_bus_if.h \ + iicbus_if.h + +.include Property changes on: user/ngie/bug-237403/sys/modules/allwinner/aw_rsb/Makefile ___________________________________________________________________ Added: svn:eol-style ## -0,0 +1 ## +native \ No newline at end of property Added: svn:keywords ## -0,0 +1 ## +FreeBSD=%H \ No newline at end of property Added: svn:mime-type ## -0,0 +1 ## +text/plain \ No newline at end of property Index: user/ngie/bug-237403/sys/modules/allwinner/aw_rtc/Makefile =================================================================== --- user/ngie/bug-237403/sys/modules/allwinner/aw_rtc/Makefile (nonexistent) +++ user/ngie/bug-237403/sys/modules/allwinner/aw_rtc/Makefile (revision 346926) @@ -0,0 +1,15 @@ +# $FreeBSD$ + +.PATH: ${SRCTOP}/sys/arm/allwinner + +KMOD= aw_rtc +SRCS= aw_rtc.c + +SRCS+= \ + bus_if.h \ + clknode_if.h \ + device_if.h \ + ofw_bus_if.h \ + spibus_if.h \ + +.include Property changes on: user/ngie/bug-237403/sys/modules/allwinner/aw_rtc/Makefile ___________________________________________________________________ Added: svn:eol-style ## -0,0 +1 ## +native \ No newline at end of property Added: svn:keywords ## -0,0 +1 ## +FreeBSD=%H \ No newline at end of property Added: svn:mime-type ## -0,0 +1 ## +text/plain \ No newline at end of property Index: user/ngie/bug-237403/sys/modules/allwinner/aw_sid/Makefile =================================================================== --- user/ngie/bug-237403/sys/modules/allwinner/aw_sid/Makefile (nonexistent) +++ user/ngie/bug-237403/sys/modules/allwinner/aw_sid/Makefile (revision 346926) @@ -0,0 +1,14 @@ +# $FreeBSD$ + +.PATH: ${SRCTOP}/sys/arm/allwinner + +KMOD= aw_sid +SRCS= aw_sid.c + +SRCS+= \ + bus_if.h \ + clknode_if.h \ + device_if.h \ + ofw_bus_if.h \ + +.include Property changes on: user/ngie/bug-237403/sys/modules/allwinner/aw_sid/Makefile ___________________________________________________________________ Added: svn:eol-style ## -0,0 +1 ## +native \ No newline at end of property Added: svn:keywords ## -0,0 +1 ## +FreeBSD=%H \ No newline at end of property Added: svn:mime-type ## -0,0 +1 ## +text/plain \ No newline at end of property Index: user/ngie/bug-237403/sys/modules/allwinner/aw_thermal/Makefile =================================================================== --- user/ngie/bug-237403/sys/modules/allwinner/aw_thermal/Makefile (nonexistent) +++ user/ngie/bug-237403/sys/modules/allwinner/aw_thermal/Makefile (revision 346926) @@ -0,0 +1,14 @@ +# $FreeBSD$ + +.PATH: ${SRCTOP}/sys/arm/allwinner + +KMOD= aw_thermal +SRCS= aw_thermal.c + +SRCS+= \ + bus_if.h \ + clknode_if.h \ + device_if.h \ + ofw_bus_if.h \ + +.include Property changes on: user/ngie/bug-237403/sys/modules/allwinner/aw_thermal/Makefile ___________________________________________________________________ Added: svn:eol-style ## -0,0 +1 ## +native \ No newline at end of property Added: svn:keywords ## -0,0 +1 ## +FreeBSD=%H \ No newline at end of property Added: svn:mime-type ## -0,0 +1 ## +text/plain \ No newline at end of property Index: user/ngie/bug-237403/sys/modules/allwinner/axp81x/Makefile =================================================================== --- user/ngie/bug-237403/sys/modules/allwinner/axp81x/Makefile (nonexistent) +++ user/ngie/bug-237403/sys/modules/allwinner/axp81x/Makefile (revision 346926) @@ -0,0 +1,15 @@ +# $FreeBSD$ + +.PATH: ${SRCTOP}/sys/arm/allwinner + +KMOD= axp81x +SRCS= axp81x.c + +SRCS+= \ + bus_if.h \ + clknode_if.h \ + device_if.h \ + ofw_bus_if.h \ + iicbus_if.h + +.include Property changes on: user/ngie/bug-237403/sys/modules/allwinner/axp81x/Makefile ___________________________________________________________________ Added: svn:eol-style ## -0,0 +1 ## +native \ No newline at end of property Added: svn:keywords ## -0,0 +1 ## +FreeBSD=%H \ No newline at end of property Added: svn:mime-type ## -0,0 +1 ## +text/plain \ No newline at end of property Index: user/ngie/bug-237403/sys/modules/allwinner/if_awg/Makefile =================================================================== --- user/ngie/bug-237403/sys/modules/allwinner/if_awg/Makefile (nonexistent) +++ user/ngie/bug-237403/sys/modules/allwinner/if_awg/Makefile (revision 346926) @@ -0,0 +1,14 @@ +# $FreeBSD$ + +.PATH: ${SRCTOP}/sys/arm/allwinner + +KMOD= if_awg +SRCS= if_awg.c + +SRCS+= \ + bus_if.h \ + clknode_if.h \ + device_if.h \ + ofw_bus_if.h \ + +.include Property changes on: user/ngie/bug-237403/sys/modules/allwinner/if_awg/Makefile ___________________________________________________________________ Added: svn:eol-style ## -0,0 +1 ## +native \ No newline at end of property Added: svn:keywords ## -0,0 +1 ## +FreeBSD=%H \ No newline at end of property Added: svn:mime-type ## -0,0 +1 ## +text/plain \ No newline at end of property Index: user/ngie/bug-237403/sys/modules/fusefs/Makefile =================================================================== --- user/ngie/bug-237403/sys/modules/fusefs/Makefile (revision 346925) +++ user/ngie/bug-237403/sys/modules/fusefs/Makefile (revision 346926) @@ -1,13 +1,29 @@ # $FreeBSD$ .PATH: ${SRCTOP}/sys/fs/fuse KMOD= fusefs SRCS= vnode_if.h \ fuse_node.c fuse_io.c fuse_device.c fuse_ipc.c fuse_file.c \ fuse_vfsops.c fuse_vnops.c fuse_internal.c fuse_main.c # Symlink for backwards compatibility with systems installed at 12.0 or older +.if ${MACHINE_CPUARCH} != "powerpc" SYMLINKS= ${KMOD}.ko ${KMODDIR}/fuse.ko +.else +# Some PPC systems use msdosfs for /boot, which can't handle links or symlinks +afterinstall: alias alias_debug +alias: .PHONY + ${INSTALL} -T release -o ${KMODOWN} -g ${KMODGRP} -m ${KMODMODE} \ + ${_INSTALLFLAGS} ${PROG} ${DESTDIR}${KMODDIR}/fuse.ko +.if defined(DEBUG_FLAGS) && !defined(INSTALL_NODEBUG) && "${MK_KERNEL_SYMBOLS}" != "no" +alias_debug: .PHONY + ${INSTALL} -T debug -o ${KMODOWN} -g ${KMODGRP} -m ${KMODMODE} \ + ${_INSTALLFLAGS} ${PROG}.debug \ + ${DESTDIR}${KERN_DEBUGDIR}${KMODDIR}/fuse.ko +.else +alias_debug: .PHONY +.endif +.endif .include Index: user/ngie/bug-237403/sys/modules/if_gre/Makefile =================================================================== --- user/ngie/bug-237403/sys/modules/if_gre/Makefile (revision 346925) +++ user/ngie/bug-237403/sys/modules/if_gre/Makefile (revision 346926) @@ -1,12 +1,12 @@ # $FreeBSD$ SYSDIR?=${SRCTOP}/sys .PATH: ${SYSDIR}/net ${SYSDIR}/netinet ${SYSDIR}/netinet6 .include "${SYSDIR}/conf/kern.opts.mk" KMOD= if_gre -SRCS= if_gre.c opt_inet.h opt_inet6.h +SRCS= if_gre.c opt_inet.h opt_inet6.h opt_rss.h SRCS.INET= ip_gre.c SRCS.INET6= ip6_gre.c .include Index: user/ngie/bug-237403/sys/net/if_gre.c =================================================================== --- user/ngie/bug-237403/sys/net/if_gre.c (revision 346925) +++ user/ngie/bug-237403/sys/net/if_gre.c (revision 346926) @@ -1,679 +1,820 @@ /*- * SPDX-License-Identifier: BSD-2-Clause-FreeBSD * * Copyright (c) 1998 The NetBSD Foundation, Inc. * Copyright (c) 2014, 2018 Andrey V. Elsukov * All rights reserved. * * This code is derived from software contributed to The NetBSD Foundation * by Heiko W.Rupp * * IPv6-over-GRE contributed by Gert Doering * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE NETBSD FOUNDATION, INC. AND CONTRIBUTORS * ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED * TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR * PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE FOUNDATION OR CONTRIBUTORS * BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN * CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE * POSSIBILITY OF SUCH DAMAGE. * * $NetBSD: if_gre.c,v 1.49 2003/12/11 00:22:29 itojun Exp $ */ #include __FBSDID("$FreeBSD$"); #include "opt_inet.h" #include "opt_inet6.h" +#include "opt_rss.h" #include #include #include #include #include #include #include #include #include +#include #include #include #include #include #include #include #include #include #include #include #include #include #include #include +#include #ifdef INET #include #include #include +#ifdef RSS +#include #endif +#endif #ifdef INET6 #include #include #include +#ifdef RSS +#include #endif +#endif #include +#include #include #include #include #include #define GREMTU 1476 static const char grename[] = "gre"; MALLOC_DEFINE(M_GRE, grename, "Generic Routing Encapsulation"); static struct sx gre_ioctl_sx; SX_SYSINIT(gre_ioctl_sx, &gre_ioctl_sx, "gre_ioctl"); static int gre_clone_create(struct if_clone *, int, caddr_t); static void gre_clone_destroy(struct ifnet *); VNET_DEFINE_STATIC(struct if_clone *, gre_cloner); #define V_gre_cloner VNET(gre_cloner) static void gre_qflush(struct ifnet *); static int gre_transmit(struct ifnet *, struct mbuf *); static int gre_ioctl(struct ifnet *, u_long, caddr_t); static int gre_output(struct ifnet *, struct mbuf *, const struct sockaddr *, struct route *); static void gre_delete_tunnel(struct gre_softc *); SYSCTL_DECL(_net_link); static SYSCTL_NODE(_net_link, IFT_TUNNEL, gre, CTLFLAG_RW, 0, "Generic Routing Encapsulation"); #ifndef MAX_GRE_NEST /* * This macro controls the default upper limitation on nesting of gre tunnels. * Since, setting a large value to this macro with a careless configuration * may introduce system crash, we don't allow any nestings by default. * If you need to configure nested gre tunnels, you can define this macro * in your kernel configuration file. However, if you do so, please be * careful to configure the tunnels so that it won't make a loop. */ #define MAX_GRE_NEST 1 #endif VNET_DEFINE_STATIC(int, max_gre_nesting) = MAX_GRE_NEST; #define V_max_gre_nesting VNET(max_gre_nesting) SYSCTL_INT(_net_link_gre, OID_AUTO, max_nesting, CTLFLAG_RW | CTLFLAG_VNET, &VNET_NAME(max_gre_nesting), 0, "Max nested tunnels"); static void vnet_gre_init(const void *unused __unused) { V_gre_cloner = if_clone_simple(grename, gre_clone_create, gre_clone_destroy, 0); #ifdef INET in_gre_init(); #endif #ifdef INET6 in6_gre_init(); #endif } VNET_SYSINIT(vnet_gre_init, SI_SUB_PROTO_IFATTACHDOMAIN, SI_ORDER_ANY, vnet_gre_init, NULL); static void vnet_gre_uninit(const void *unused __unused) { if_clone_detach(V_gre_cloner); #ifdef INET in_gre_uninit(); #endif #ifdef INET6 in6_gre_uninit(); #endif + /* XXX: epoch_call drain */ } VNET_SYSUNINIT(vnet_gre_uninit, SI_SUB_PROTO_IFATTACHDOMAIN, SI_ORDER_ANY, vnet_gre_uninit, NULL); static int gre_clone_create(struct if_clone *ifc, int unit, caddr_t params) { struct gre_softc *sc; sc = malloc(sizeof(struct gre_softc), M_GRE, M_WAITOK | M_ZERO); sc->gre_fibnum = curthread->td_proc->p_fibnum; GRE2IFP(sc) = if_alloc(IFT_TUNNEL); GRE2IFP(sc)->if_softc = sc; if_initname(GRE2IFP(sc), grename, unit); GRE2IFP(sc)->if_mtu = GREMTU; GRE2IFP(sc)->if_flags = IFF_POINTOPOINT|IFF_MULTICAST; GRE2IFP(sc)->if_output = gre_output; GRE2IFP(sc)->if_ioctl = gre_ioctl; GRE2IFP(sc)->if_transmit = gre_transmit; GRE2IFP(sc)->if_qflush = gre_qflush; GRE2IFP(sc)->if_capabilities |= IFCAP_LINKSTATE; GRE2IFP(sc)->if_capenable |= IFCAP_LINKSTATE; if_attach(GRE2IFP(sc)); bpfattach(GRE2IFP(sc), DLT_NULL, sizeof(u_int32_t)); return (0); } static void gre_clone_destroy(struct ifnet *ifp) { struct gre_softc *sc; sx_xlock(&gre_ioctl_sx); sc = ifp->if_softc; gre_delete_tunnel(sc); bpfdetach(ifp); if_detach(ifp); ifp->if_softc = NULL; sx_xunlock(&gre_ioctl_sx); GRE_WAIT(); if_free(ifp); free(sc, M_GRE); } static int gre_ioctl(struct ifnet *ifp, u_long cmd, caddr_t data) { struct ifreq *ifr = (struct ifreq *)data; struct gre_softc *sc; uint32_t opt; int error; switch (cmd) { case SIOCSIFMTU: /* XXX: */ if (ifr->ifr_mtu < 576) return (EINVAL); ifp->if_mtu = ifr->ifr_mtu; return (0); case SIOCSIFADDR: ifp->if_flags |= IFF_UP; case SIOCSIFFLAGS: case SIOCADDMULTI: case SIOCDELMULTI: return (0); case GRESADDRS: case GRESADDRD: case GREGADDRS: case GREGADDRD: case GRESPROTO: case GREGPROTO: return (EOPNOTSUPP); } sx_xlock(&gre_ioctl_sx); sc = ifp->if_softc; if (sc == NULL) { error = ENXIO; goto end; } error = 0; switch (cmd) { case SIOCDIFPHYADDR: if (sc->gre_family == 0) break; gre_delete_tunnel(sc); break; #ifdef INET case SIOCSIFPHYADDR: case SIOCGIFPSRCADDR: case SIOCGIFPDSTADDR: error = in_gre_ioctl(sc, cmd, data); break; #endif #ifdef INET6 case SIOCSIFPHYADDR_IN6: case SIOCGIFPSRCADDR_IN6: case SIOCGIFPDSTADDR_IN6: error = in6_gre_ioctl(sc, cmd, data); break; #endif case SIOCGTUNFIB: ifr->ifr_fib = sc->gre_fibnum; break; case SIOCSTUNFIB: if ((error = priv_check(curthread, PRIV_NET_GRE)) != 0) break; if (ifr->ifr_fib >= rt_numfibs) error = EINVAL; else sc->gre_fibnum = ifr->ifr_fib; break; case GRESKEY: case GRESOPTS: + case GRESPORT: if ((error = priv_check(curthread, PRIV_NET_GRE)) != 0) break; if ((error = copyin(ifr_data_get_ptr(ifr), &opt, sizeof(opt))) != 0) break; if (cmd == GRESKEY) { if (sc->gre_key == opt) break; } else if (cmd == GRESOPTS) { if (opt & ~GRE_OPTMASK) { error = EINVAL; break; } if (sc->gre_options == opt) break; + } else if (cmd == GRESPORT) { + if (opt != 0 && (opt < V_ipport_hifirstauto || + opt > V_ipport_hilastauto)) { + error = EINVAL; + break; + } + if (sc->gre_port == opt) + break; + if ((sc->gre_options & GRE_UDPENCAP) == 0) { + /* + * UDP encapsulation is not enabled, thus + * there is no need to reattach softc. + */ + sc->gre_port = opt; + break; + } } switch (sc->gre_family) { #ifdef INET case AF_INET: - in_gre_setopts(sc, cmd, opt); + error = in_gre_setopts(sc, cmd, opt); break; #endif #ifdef INET6 case AF_INET6: - in6_gre_setopts(sc, cmd, opt); + error = in6_gre_setopts(sc, cmd, opt); break; #endif default: + /* + * Tunnel is not yet configured. + * We can just change any parameters. + */ if (cmd == GRESKEY) sc->gre_key = opt; - else + if (cmd == GRESOPTS) sc->gre_options = opt; + if (cmd == GRESPORT) + sc->gre_port = opt; break; } /* * XXX: Do we need to initiate change of interface * state here? */ break; case GREGKEY: error = copyout(&sc->gre_key, ifr_data_get_ptr(ifr), sizeof(sc->gre_key)); break; case GREGOPTS: error = copyout(&sc->gre_options, ifr_data_get_ptr(ifr), sizeof(sc->gre_options)); break; + case GREGPORT: + error = copyout(&sc->gre_port, ifr_data_get_ptr(ifr), + sizeof(sc->gre_port)); + break; default: error = EINVAL; break; } if (error == 0 && sc->gre_family != 0) { if ( #ifdef INET cmd == SIOCSIFPHYADDR || #endif #ifdef INET6 cmd == SIOCSIFPHYADDR_IN6 || #endif 0) { if_link_state_change(ifp, LINK_STATE_UP); } } end: sx_xunlock(&gre_ioctl_sx); return (error); } static void gre_delete_tunnel(struct gre_softc *sc) { + struct gre_socket *gs; sx_assert(&gre_ioctl_sx, SA_XLOCKED); if (sc->gre_family != 0) { CK_LIST_REMOVE(sc, chain); CK_LIST_REMOVE(sc, srchash); GRE_WAIT(); free(sc->gre_hdr, M_GRE); sc->gre_family = 0; } + /* + * If this Tunnel was the last one that could use UDP socket, + * we should unlink socket from hash table and close it. + */ + if ((gs = sc->gre_so) != NULL && CK_LIST_EMPTY(&gs->list)) { + CK_LIST_REMOVE(gs, chain); + soclose(gs->so); + epoch_call(net_epoch_preempt, &gs->epoch_ctx, gre_sofree); + sc->gre_so = NULL; + } GRE2IFP(sc)->if_drv_flags &= ~IFF_DRV_RUNNING; if_link_state_change(GRE2IFP(sc), LINK_STATE_DOWN); } struct gre_list * gre_hashinit(void) { struct gre_list *hash; int i; hash = malloc(sizeof(struct gre_list) * GRE_HASH_SIZE, M_GRE, M_WAITOK); for (i = 0; i < GRE_HASH_SIZE; i++) CK_LIST_INIT(&hash[i]); return (hash); } void gre_hashdestroy(struct gre_list *hash) { free(hash, M_GRE); } void -gre_updatehdr(struct gre_softc *sc, struct grehdr *gh) +gre_sofree(epoch_context_t ctx) { + struct gre_socket *gs; + + gs = __containerof(ctx, struct gre_socket, epoch_ctx); + free(gs, M_GRE); +} + +static __inline uint16_t +gre_cksum_add(uint16_t sum, uint16_t a) +{ + uint16_t res; + + res = sum + a; + return (res + (res < a)); +} + +void +gre_update_udphdr(struct gre_softc *sc, struct udphdr *udp, uint16_t csum) +{ + + sx_assert(&gre_ioctl_sx, SA_XLOCKED); + MPASS(sc->gre_options & GRE_UDPENCAP); + + udp->uh_dport = htons(GRE_UDPPORT); + udp->uh_sport = htons(sc->gre_port); + udp->uh_sum = csum; + udp->uh_ulen = 0; +} + +void +gre_update_hdr(struct gre_softc *sc, struct grehdr *gh) +{ uint32_t *opts; uint16_t flags; sx_assert(&gre_ioctl_sx, SA_XLOCKED); flags = 0; opts = gh->gre_opts; if (sc->gre_options & GRE_ENABLE_CSUM) { flags |= GRE_FLAGS_CP; sc->gre_hlen += 2 * sizeof(uint16_t); *opts++ = 0; } if (sc->gre_key != 0) { flags |= GRE_FLAGS_KP; sc->gre_hlen += sizeof(uint32_t); *opts++ = htonl(sc->gre_key); } if (sc->gre_options & GRE_ENABLE_SEQ) { flags |= GRE_FLAGS_SP; sc->gre_hlen += sizeof(uint32_t); *opts++ = 0; } else sc->gre_oseq = 0; gh->gre_flags = htons(flags); } int gre_input(struct mbuf *m, int off, int proto, void *arg) { struct gre_softc *sc = arg; struct grehdr *gh; struct ifnet *ifp; uint32_t *opts; #ifdef notyet uint32_t key; #endif uint16_t flags; int hlen, isr, af; ifp = GRE2IFP(sc); hlen = off + sizeof(struct grehdr) + 4 * sizeof(uint32_t); if (m->m_pkthdr.len < hlen) goto drop; if (m->m_len < hlen) { m = m_pullup(m, hlen); if (m == NULL) goto drop; } gh = (struct grehdr *)mtodo(m, off); flags = ntohs(gh->gre_flags); if (flags & ~GRE_FLAGS_MASK) goto drop; opts = gh->gre_opts; hlen = 2 * sizeof(uint16_t); if (flags & GRE_FLAGS_CP) { /* reserved1 field must be zero */ if (((uint16_t *)opts)[1] != 0) goto drop; if (in_cksum_skip(m, m->m_pkthdr.len, off) != 0) goto drop; hlen += 2 * sizeof(uint16_t); opts++; } if (flags & GRE_FLAGS_KP) { #ifdef notyet /* * XXX: The current implementation uses the key only for outgoing * packets. But we can check the key value here, or even in the * encapcheck function. */ key = ntohl(*opts); #endif hlen += sizeof(uint32_t); opts++; } #ifdef notyet } else key = 0; if (sc->gre_key != 0 && (key != sc->gre_key || key != 0)) goto drop; #endif if (flags & GRE_FLAGS_SP) { #ifdef notyet seq = ntohl(*opts); #endif hlen += sizeof(uint32_t); } switch (ntohs(gh->gre_proto)) { case ETHERTYPE_WCCP: /* * For WCCP skip an additional 4 bytes if after GRE header * doesn't follow an IP header. */ if (flags == 0 && (*(uint8_t *)gh->gre_opts & 0xF0) != 0x40) hlen += sizeof(uint32_t); /* FALLTHROUGH */ case ETHERTYPE_IP: isr = NETISR_IP; af = AF_INET; break; case ETHERTYPE_IPV6: isr = NETISR_IPV6; af = AF_INET6; break; default: goto drop; } m_adj(m, off + hlen); m_clrprotoflags(m); m->m_pkthdr.rcvif = ifp; M_SETFIB(m, ifp->if_fib); #ifdef MAC mac_ifnet_create_mbuf(ifp, m); #endif BPF_MTAP2(ifp, &af, sizeof(af), m); if_inc_counter(ifp, IFCOUNTER_IPACKETS, 1); if_inc_counter(ifp, IFCOUNTER_IBYTES, m->m_pkthdr.len); if ((ifp->if_flags & IFF_MONITOR) != 0) m_freem(m); else netisr_dispatch(isr, m); return (IPPROTO_DONE); drop: if_inc_counter(ifp, IFCOUNTER_IERRORS, 1); m_freem(m); return (IPPROTO_DONE); } static int gre_output(struct ifnet *ifp, struct mbuf *m, const struct sockaddr *dst, struct route *ro) { uint32_t af; if (dst->sa_family == AF_UNSPEC) bcopy(dst->sa_data, &af, sizeof(af)); else af = dst->sa_family; /* * Now save the af in the inbound pkt csum data, this is a cheat since * we are using the inbound csum_data field to carry the af over to * the gre_transmit() routine, avoiding using yet another mtag. */ m->m_pkthdr.csum_data = af; return (ifp->if_transmit(ifp, m)); } static void gre_setseqn(struct grehdr *gh, uint32_t seq) { uint32_t *opts; uint16_t flags; opts = gh->gre_opts; flags = ntohs(gh->gre_flags); KASSERT((flags & GRE_FLAGS_SP) != 0, ("gre_setseqn called, but GRE_FLAGS_SP isn't set ")); if (flags & GRE_FLAGS_CP) opts++; if (flags & GRE_FLAGS_KP) opts++; *opts = htonl(seq); } +static uint32_t +gre_flowid(struct gre_softc *sc, struct mbuf *m, uint32_t af) +{ + uint32_t flowid; + + if ((sc->gre_options & GRE_UDPENCAP) == 0 || sc->gre_port != 0) + return (0); +#ifndef RSS + switch (af) { +#ifdef INET + case AF_INET: + flowid = mtod(m, struct ip *)->ip_src.s_addr ^ + mtod(m, struct ip *)->ip_dst.s_addr; + break; +#endif +#ifdef INET6 + case AF_INET6: + flowid = mtod(m, struct ip6_hdr *)->ip6_src.s6_addr32[3] ^ + mtod(m, struct ip6_hdr *)->ip6_dst.s6_addr32[3]; + break; +#endif + default: + flowid = 0; + } +#else /* RSS */ + switch (af) { +#ifdef INET + case AF_INET: + flowid = rss_hash_ip4_2tuple(mtod(m, struct ip *)->ip_src, + mtod(m, struct ip *)->ip_dst); + break; +#endif +#ifdef INET6 + case AF_INET6: + flowid = rss_hash_ip6_2tuple( + &mtod(m, struct ip6_hdr *)->ip6_src, + &mtod(m, struct ip6_hdr *)->ip6_dst); + break; +#endif + default: + flowid = 0; + } +#endif + return (flowid); +} + #define MTAG_GRE 1307983903 static int gre_transmit(struct ifnet *ifp, struct mbuf *m) { GRE_RLOCK_TRACKER; struct gre_softc *sc; struct grehdr *gh; - uint32_t af; + struct udphdr *uh; + uint32_t af, flowid; int error, len; uint16_t proto; len = 0; GRE_RLOCK(); #ifdef MAC error = mac_ifnet_check_transmit(ifp, m); if (error) { m_freem(m); goto drop; } #endif error = ENETDOWN; sc = ifp->if_softc; if ((ifp->if_flags & IFF_MONITOR) != 0 || (ifp->if_flags & IFF_UP) == 0 || (ifp->if_drv_flags & IFF_DRV_RUNNING) == 0 || sc->gre_family == 0 || (error = if_tunnel_check_nesting(ifp, m, MTAG_GRE, V_max_gre_nesting)) != 0) { m_freem(m); goto drop; } af = m->m_pkthdr.csum_data; BPF_MTAP2(ifp, &af, sizeof(af), m); m->m_flags &= ~(M_BCAST|M_MCAST); + flowid = gre_flowid(sc, m, af); M_SETFIB(m, sc->gre_fibnum); M_PREPEND(m, sc->gre_hlen, M_NOWAIT); if (m == NULL) { error = ENOBUFS; goto drop; } bcopy(sc->gre_hdr, mtod(m, void *), sc->gre_hlen); /* Determine GRE proto */ switch (af) { #ifdef INET case AF_INET: proto = htons(ETHERTYPE_IP); break; #endif #ifdef INET6 case AF_INET6: proto = htons(ETHERTYPE_IPV6); break; #endif default: m_freem(m); error = ENETDOWN; goto drop; } /* Determine offset of GRE header */ switch (sc->gre_family) { #ifdef INET case AF_INET: len = sizeof(struct ip); break; #endif #ifdef INET6 case AF_INET6: len = sizeof(struct ip6_hdr); break; #endif default: m_freem(m); error = ENETDOWN; goto drop; } + if (sc->gre_options & GRE_UDPENCAP) { + uh = (struct udphdr *)mtodo(m, len); + uh->uh_sport |= htons(V_ipport_hifirstauto) | + (flowid >> 16) | (flowid & 0xFFFF); + uh->uh_sport = htons(ntohs(uh->uh_sport) % + V_ipport_hilastauto); + uh->uh_ulen = htons(m->m_pkthdr.len - len); + uh->uh_sum = gre_cksum_add(uh->uh_sum, + htons(m->m_pkthdr.len - len + IPPROTO_UDP)); + m->m_pkthdr.csum_flags = sc->gre_csumflags; + m->m_pkthdr.csum_data = offsetof(struct udphdr, uh_sum); + len += sizeof(struct udphdr); + } gh = (struct grehdr *)mtodo(m, len); gh->gre_proto = proto; if (sc->gre_options & GRE_ENABLE_SEQ) gre_setseqn(gh, sc->gre_oseq++); if (sc->gre_options & GRE_ENABLE_CSUM) { *(uint16_t *)gh->gre_opts = in_cksum_skip(m, m->m_pkthdr.len, len); } len = m->m_pkthdr.len - len; switch (sc->gre_family) { #ifdef INET case AF_INET: error = in_gre_output(m, af, sc->gre_hlen); break; #endif #ifdef INET6 case AF_INET6: - error = in6_gre_output(m, af, sc->gre_hlen); + error = in6_gre_output(m, af, sc->gre_hlen, flowid); break; #endif default: m_freem(m); error = ENETDOWN; } drop: if (error) if_inc_counter(ifp, IFCOUNTER_OERRORS, 1); else { if_inc_counter(ifp, IFCOUNTER_OPACKETS, 1); if_inc_counter(ifp, IFCOUNTER_OBYTES, len); } GRE_RUNLOCK(); return (error); } static void gre_qflush(struct ifnet *ifp __unused) { } static int gremodevent(module_t mod, int type, void *data) { switch (type) { case MOD_LOAD: case MOD_UNLOAD: break; default: return (EOPNOTSUPP); } return (0); } static moduledata_t gre_mod = { "if_gre", gremodevent, 0 }; DECLARE_MODULE(if_gre, gre_mod, SI_SUB_PSEUDO, SI_ORDER_ANY); MODULE_VERSION(if_gre, 1); Index: user/ngie/bug-237403/sys/net/if_gre.h =================================================================== --- user/ngie/bug-237403/sys/net/if_gre.h (revision 346925) +++ user/ngie/bug-237403/sys/net/if_gre.h (revision 346926) @@ -1,147 +1,185 @@ /*- * SPDX-License-Identifier: BSD-2-Clause-FreeBSD * * Copyright (c) 1998 The NetBSD Foundation, Inc. * Copyright (c) 2014 Andrey V. Elsukov * All rights reserved * * This code is derived from software contributed to The NetBSD Foundation * by Heiko W.Rupp * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE NETBSD FOUNDATION, INC. AND CONTRIBUTORS * ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED * TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR * PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE FOUNDATION OR CONTRIBUTORS * BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN * CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE * POSSIBILITY OF SUCH DAMAGE. * * $NetBSD: if_gre.h,v 1.13 2003/11/10 08:51:52 wiz Exp $ * $FreeBSD$ */ #ifndef _NET_IF_GRE_H_ #define _NET_IF_GRE_H_ #ifdef _KERNEL /* GRE header according to RFC 2784 and RFC 2890 */ struct grehdr { uint16_t gre_flags; /* GRE flags */ #define GRE_FLAGS_CP 0x8000 /* checksum present */ #define GRE_FLAGS_KP 0x2000 /* key present */ #define GRE_FLAGS_SP 0x1000 /* sequence present */ #define GRE_FLAGS_MASK (GRE_FLAGS_CP|GRE_FLAGS_KP|GRE_FLAGS_SP) uint16_t gre_proto; /* protocol type */ uint32_t gre_opts[0]; /* optional fields */ } __packed; #ifdef INET struct greip { struct ip gi_ip; struct grehdr gi_gre; } __packed; -#endif +struct greudp { + struct ip gi_ip; + struct udphdr gi_udp; + struct grehdr gi_gre; +} __packed; +#endif /* INET */ + #ifdef INET6 struct greip6 { struct ip6_hdr gi6_ip6; struct grehdr gi6_gre; } __packed; -#endif +struct greudp6 { + struct ip6_hdr gi6_ip6; + struct udphdr gi6_udp; + struct grehdr gi6_gre; +} __packed; +#endif /* INET6 */ + +CK_LIST_HEAD(gre_list, gre_softc); +CK_LIST_HEAD(gre_sockets, gre_socket); +struct gre_socket { + struct socket *so; + struct gre_list list; + CK_LIST_ENTRY(gre_socket) chain; + struct epoch_context epoch_ctx; +}; + struct gre_softc { struct ifnet *gre_ifp; int gre_family; /* AF of delivery header */ uint32_t gre_iseq; uint32_t gre_oseq; uint32_t gre_key; uint32_t gre_options; + uint32_t gre_csumflags; + uint32_t gre_port; u_int gre_fibnum; u_int gre_hlen; /* header size */ union { void *hdr; #ifdef INET - struct greip *gihdr; + struct greip *iphdr; + struct greudp *udphdr; #endif #ifdef INET6 - struct greip6 *gi6hdr; + struct greip6 *ip6hdr; + struct greudp6 *udp6hdr; #endif } gre_uhdr; + struct gre_socket *gre_so; CK_LIST_ENTRY(gre_softc) chain; CK_LIST_ENTRY(gre_softc) srchash; }; -CK_LIST_HEAD(gre_list, gre_softc); MALLOC_DECLARE(M_GRE); #ifndef GRE_HASH_SIZE #define GRE_HASH_SIZE (1 << 4) #endif #define GRE2IFP(sc) ((sc)->gre_ifp) #define GRE_RLOCK_TRACKER struct epoch_tracker gre_et #define GRE_RLOCK() epoch_enter_preempt(net_epoch_preempt, &gre_et) #define GRE_RUNLOCK() epoch_exit_preempt(net_epoch_preempt, &gre_et) #define GRE_WAIT() epoch_wait_preempt(net_epoch_preempt) #define gre_hdr gre_uhdr.hdr -#define gre_gihdr gre_uhdr.gihdr -#define gre_gi6hdr gre_uhdr.gi6hdr -#define gre_oip gre_gihdr->gi_ip -#define gre_oip6 gre_gi6hdr->gi6_ip6 +#define gre_iphdr gre_uhdr.iphdr +#define gre_ip6hdr gre_uhdr.ip6hdr +#define gre_udphdr gre_uhdr.udphdr +#define gre_udp6hdr gre_uhdr.udp6hdr +#define gre_oip gre_iphdr->gi_ip +#define gre_udp gre_udphdr->gi_udp +#define gre_oip6 gre_ip6hdr->gi6_ip6 +#define gre_udp6 gre_udp6hdr->gi6_udp + struct gre_list *gre_hashinit(void); void gre_hashdestroy(struct gre_list *); int gre_input(struct mbuf *, int, int, void *); -void gre_updatehdr(struct gre_softc *, struct grehdr *); +void gre_update_hdr(struct gre_softc *, struct grehdr *); +void gre_update_udphdr(struct gre_softc *, struct udphdr *, uint16_t); +void gre_sofree(epoch_context_t); void in_gre_init(void); void in_gre_uninit(void); -void in_gre_setopts(struct gre_softc *, u_long, uint32_t); +int in_gre_setopts(struct gre_softc *, u_long, uint32_t); int in_gre_ioctl(struct gre_softc *, u_long, caddr_t); int in_gre_output(struct mbuf *, int, int); void in6_gre_init(void); void in6_gre_uninit(void); -void in6_gre_setopts(struct gre_softc *, u_long, uint32_t); +int in6_gre_setopts(struct gre_softc *, u_long, uint32_t); int in6_gre_ioctl(struct gre_softc *, u_long, caddr_t); -int in6_gre_output(struct mbuf *, int, int); +int in6_gre_output(struct mbuf *, int, int, uint32_t); /* * CISCO uses special type for GRE tunnel created as part of WCCP * connection, while in fact those packets are just IPv4 encapsulated * into GRE. */ #define ETHERTYPE_WCCP 0x883E #endif /* _KERNEL */ #define GRESADDRS _IOW('i', 101, struct ifreq) #define GRESADDRD _IOW('i', 102, struct ifreq) #define GREGADDRS _IOWR('i', 103, struct ifreq) #define GREGADDRD _IOWR('i', 104, struct ifreq) #define GRESPROTO _IOW('i' , 105, struct ifreq) #define GREGPROTO _IOWR('i', 106, struct ifreq) #define GREGKEY _IOWR('i', 107, struct ifreq) #define GRESKEY _IOW('i', 108, struct ifreq) #define GREGOPTS _IOWR('i', 109, struct ifreq) #define GRESOPTS _IOW('i', 110, struct ifreq) +#define GREGPORT _IOWR('i', 111, struct ifreq) +#define GRESPORT _IOW('i', 112, struct ifreq) +/* GRE-in-UDP encapsulation destination port as defined in RFC8086 */ +#define GRE_UDPPORT 4754 + #define GRE_ENABLE_CSUM 0x0001 #define GRE_ENABLE_SEQ 0x0002 -#define GRE_OPTMASK (GRE_ENABLE_CSUM|GRE_ENABLE_SEQ) +#define GRE_UDPENCAP 0x0004 +#define GRE_OPTMASK (GRE_ENABLE_CSUM|GRE_ENABLE_SEQ|GRE_UDPENCAP) #endif /* _NET_IF_GRE_H_ */ Index: user/ngie/bug-237403/sys/net/if_tap.c =================================================================== --- user/ngie/bug-237403/sys/net/if_tap.c (revision 346925) +++ user/ngie/bug-237403/sys/net/if_tap.c (revision 346926) @@ -1,1127 +1,1145 @@ /*- * SPDX-License-Identifier: BSD-2-Clause-FreeBSD * * Copyright (C) 1999-2000 by Maksim Yevmenkin * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * BASED ON: * ------------------------------------------------------------------------- * * Copyright (c) 1988, Julian Onions * Nottingham University 1987. */ /* * $FreeBSD$ * $Id: if_tap.c,v 0.21 2000/07/23 21:46:02 max Exp $ */ #include "opt_inet.h" #include #include +#include #include #include #include #include #include #include #include #include #include #include #include #include #include #include +#include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #define CDEV_NAME "tap" #define TAPDEBUG if (tapdebug) printf static const char tapname[] = "tap"; static const char vmnetname[] = "vmnet"; #define TAPMAXUNIT 0x7fff #define VMNET_DEV_MASK CLONE_FLAG0 /* module */ static int tapmodevent(module_t, int, void *); /* device */ static void tapclone(void *, struct ucred *, char *, int, struct cdev **); static void tapcreate(struct cdev *); /* network interface */ static void tapifstart(struct ifnet *); static int tapifioctl(struct ifnet *, u_long, caddr_t); static void tapifinit(void *); static int tap_clone_create(struct if_clone *, int, caddr_t); static void tap_clone_destroy(struct ifnet *); static struct if_clone *tap_cloner; static int vmnet_clone_create(struct if_clone *, int, caddr_t); static void vmnet_clone_destroy(struct ifnet *); static struct if_clone *vmnet_cloner; /* character device */ static d_open_t tapopen; static d_close_t tapclose; static d_read_t tapread; static d_write_t tapwrite; static d_ioctl_t tapioctl; static d_poll_t tappoll; static d_kqfilter_t tapkqfilter; /* kqueue(2) */ static int tapkqread(struct knote *, long); static int tapkqwrite(struct knote *, long); static void tapkqdetach(struct knote *); static struct filterops tap_read_filterops = { .f_isfd = 1, .f_attach = NULL, .f_detach = tapkqdetach, .f_event = tapkqread, }; static struct filterops tap_write_filterops = { .f_isfd = 1, .f_attach = NULL, .f_detach = tapkqdetach, .f_event = tapkqwrite, }; static struct cdevsw tap_cdevsw = { .d_version = D_VERSION, .d_flags = D_NEEDMINOR, .d_open = tapopen, .d_close = tapclose, .d_read = tapread, .d_write = tapwrite, .d_ioctl = tapioctl, .d_poll = tappoll, .d_name = CDEV_NAME, .d_kqfilter = tapkqfilter, }; /* * All global variables in if_tap.c are locked with tapmtx, with the * exception of tapdebug, which is accessed unlocked; tapclones is * static at runtime. */ static struct mtx tapmtx; static int tapdebug = 0; /* debug flag */ static int tapuopen = 0; /* allow user open() */ static int tapuponopen = 0; /* IFF_UP on open() */ static int tapdclone = 1; /* enable devfs cloning */ static SLIST_HEAD(, tap_softc) taphead; /* first device */ static struct clonedevs *tapclones; MALLOC_DECLARE(M_TAP); MALLOC_DEFINE(M_TAP, CDEV_NAME, "Ethernet tunnel interface"); SYSCTL_INT(_debug, OID_AUTO, if_tap_debug, CTLFLAG_RW, &tapdebug, 0, ""); +static struct sx tap_ioctl_sx; +SX_SYSINIT(tap_ioctl_sx, &tap_ioctl_sx, "tap_ioctl"); + SYSCTL_DECL(_net_link); static SYSCTL_NODE(_net_link, OID_AUTO, tap, CTLFLAG_RW, 0, "Ethernet tunnel software network interface"); SYSCTL_INT(_net_link_tap, OID_AUTO, user_open, CTLFLAG_RW, &tapuopen, 0, "Allow user to open /dev/tap (based on node permissions)"); SYSCTL_INT(_net_link_tap, OID_AUTO, up_on_open, CTLFLAG_RW, &tapuponopen, 0, "Bring interface up when /dev/tap is opened"); SYSCTL_INT(_net_link_tap, OID_AUTO, devfs_cloning, CTLFLAG_RWTUN, &tapdclone, 0, "Enable legacy devfs interface creation"); SYSCTL_INT(_net_link_tap, OID_AUTO, debug, CTLFLAG_RW, &tapdebug, 0, ""); DEV_MODULE(if_tap, tapmodevent, NULL); +MODULE_VERSION(if_tap, 1); static int tap_clone_create(struct if_clone *ifc, int unit, caddr_t params) { struct cdev *dev; int i; /* Find any existing device, or allocate new unit number. */ i = clone_create(&tapclones, &tap_cdevsw, &unit, &dev, 0); if (i) { dev = make_dev(&tap_cdevsw, unit, UID_ROOT, GID_WHEEL, 0600, "%s%d", tapname, unit); } tapcreate(dev); return (0); } /* vmnet devices are tap devices in disguise */ static int vmnet_clone_create(struct if_clone *ifc, int unit, caddr_t params) { struct cdev *dev; int i; /* Find any existing device, or allocate new unit number. */ i = clone_create(&tapclones, &tap_cdevsw, &unit, &dev, VMNET_DEV_MASK); if (i) { dev = make_dev(&tap_cdevsw, unit | VMNET_DEV_MASK, UID_ROOT, GID_WHEEL, 0600, "%s%d", vmnetname, unit); } tapcreate(dev); return (0); } static void tap_destroy(struct tap_softc *tp) { struct ifnet *ifp = tp->tap_ifp; CURVNET_SET(ifp->if_vnet); + sx_xlock(&tap_ioctl_sx); + ifp->if_softc = NULL; + sx_xunlock(&tap_ioctl_sx); + destroy_dev(tp->tap_dev); seldrain(&tp->tap_rsel); knlist_clear(&tp->tap_rsel.si_note, 0); knlist_destroy(&tp->tap_rsel.si_note); ether_ifdetach(ifp); if_free(ifp); mtx_destroy(&tp->tap_mtx); free(tp, M_TAP); CURVNET_RESTORE(); } static void tap_clone_destroy(struct ifnet *ifp) { struct tap_softc *tp = ifp->if_softc; mtx_lock(&tapmtx); SLIST_REMOVE(&taphead, tp, tap_softc, tap_next); mtx_unlock(&tapmtx); tap_destroy(tp); } /* vmnet devices are tap devices in disguise */ static void vmnet_clone_destroy(struct ifnet *ifp) { tap_clone_destroy(ifp); } /* * tapmodevent * * module event handler */ static int tapmodevent(module_t mod, int type, void *data) { static eventhandler_tag eh_tag = NULL; struct tap_softc *tp = NULL; struct ifnet *ifp = NULL; switch (type) { case MOD_LOAD: /* intitialize device */ mtx_init(&tapmtx, "tapmtx", NULL, MTX_DEF); SLIST_INIT(&taphead); clone_setup(&tapclones); eh_tag = EVENTHANDLER_REGISTER(dev_clone, tapclone, 0, 1000); if (eh_tag == NULL) { clone_cleanup(&tapclones); mtx_destroy(&tapmtx); return (ENOMEM); } tap_cloner = if_clone_simple(tapname, tap_clone_create, tap_clone_destroy, 0); vmnet_cloner = if_clone_simple(vmnetname, vmnet_clone_create, vmnet_clone_destroy, 0); return (0); case MOD_UNLOAD: /* * The EBUSY algorithm here can't quite atomically * guarantee that this is race-free since we have to * release the tap mtx to deregister the clone handler. */ mtx_lock(&tapmtx); SLIST_FOREACH(tp, &taphead, tap_next) { mtx_lock(&tp->tap_mtx); if (tp->tap_flags & TAP_OPEN) { mtx_unlock(&tp->tap_mtx); mtx_unlock(&tapmtx); return (EBUSY); } mtx_unlock(&tp->tap_mtx); } mtx_unlock(&tapmtx); EVENTHANDLER_DEREGISTER(dev_clone, eh_tag); if_clone_detach(tap_cloner); if_clone_detach(vmnet_cloner); drain_dev_clone_events(); mtx_lock(&tapmtx); while ((tp = SLIST_FIRST(&taphead)) != NULL) { SLIST_REMOVE_HEAD(&taphead, tap_next); mtx_unlock(&tapmtx); ifp = tp->tap_ifp; TAPDEBUG("detaching %s\n", ifp->if_xname); tap_destroy(tp); mtx_lock(&tapmtx); } mtx_unlock(&tapmtx); clone_cleanup(&tapclones); mtx_destroy(&tapmtx); break; default: return (EOPNOTSUPP); } return (0); } /* tapmodevent */ /* * DEVFS handler * * We need to support two kind of devices - tap and vmnet */ static void tapclone(void *arg, struct ucred *cred, char *name, int namelen, struct cdev **dev) { char devname[SPECNAMELEN + 1]; int i, unit, append_unit; int extra; if (*dev != NULL) return; if (!tapdclone || (!tapuopen && priv_check_cred(cred, PRIV_NET_IFCREATE) != 0)) return; unit = 0; append_unit = 0; extra = 0; /* We're interested in only tap/vmnet devices. */ if (strcmp(name, tapname) == 0) { unit = -1; } else if (strcmp(name, vmnetname) == 0) { unit = -1; extra = VMNET_DEV_MASK; } else if (dev_stdclone(name, NULL, tapname, &unit) != 1) { if (dev_stdclone(name, NULL, vmnetname, &unit) != 1) { return; } else { extra = VMNET_DEV_MASK; } } if (unit == -1) append_unit = 1; CURVNET_SET(CRED_TO_VNET(cred)); /* find any existing device, or allocate new unit number */ i = clone_create(&tapclones, &tap_cdevsw, &unit, dev, extra); if (i) { if (append_unit) { /* * We were passed 'tun' or 'tap', with no unit specified * so we'll need to append it now. */ namelen = snprintf(devname, sizeof(devname), "%s%d", name, unit); name = devname; } *dev = make_dev_credf(MAKEDEV_REF, &tap_cdevsw, unit | extra, cred, UID_ROOT, GID_WHEEL, 0600, "%s", name); } if_clone_create(name, namelen, NULL); CURVNET_RESTORE(); } /* tapclone */ /* * tapcreate * * to create interface */ static void tapcreate(struct cdev *dev) { struct ifnet *ifp = NULL; struct tap_softc *tp = NULL; unsigned short macaddr_hi; uint32_t macaddr_mid; int unit; const char *name = NULL; u_char eaddr[6]; /* allocate driver storage and create device */ tp = malloc(sizeof(*tp), M_TAP, M_WAITOK | M_ZERO); mtx_init(&tp->tap_mtx, "tap_mtx", NULL, MTX_DEF); mtx_lock(&tapmtx); SLIST_INSERT_HEAD(&taphead, tp, tap_next); mtx_unlock(&tapmtx); unit = dev2unit(dev); /* select device: tap or vmnet */ if (unit & VMNET_DEV_MASK) { name = vmnetname; tp->tap_flags |= TAP_VMNET; } else name = tapname; unit &= TAPMAXUNIT; TAPDEBUG("tapcreate(%s%d). minor = %#x\n", name, unit, dev2unit(dev)); /* generate fake MAC address: 00 bd xx xx xx unit_no */ macaddr_hi = htons(0x00bd); macaddr_mid = (uint32_t) ticks; bcopy(&macaddr_hi, eaddr, sizeof(short)); bcopy(&macaddr_mid, &eaddr[2], sizeof(uint32_t)); eaddr[5] = (u_char)unit; /* fill the rest and attach interface */ ifp = tp->tap_ifp = if_alloc(IFT_ETHER); if (ifp == NULL) panic("%s%d: can not if_alloc()", name, unit); ifp->if_softc = tp; if_initname(ifp, name, unit); ifp->if_init = tapifinit; ifp->if_start = tapifstart; ifp->if_ioctl = tapifioctl; ifp->if_mtu = ETHERMTU; ifp->if_flags = (IFF_BROADCAST|IFF_SIMPLEX|IFF_MULTICAST); IFQ_SET_MAXLEN(&ifp->if_snd, ifqmaxlen); ifp->if_capabilities |= IFCAP_LINKSTATE; ifp->if_capenable |= IFCAP_LINKSTATE; dev->si_drv1 = tp; tp->tap_dev = dev; ether_ifattach(ifp, eaddr); mtx_lock(&tp->tap_mtx); tp->tap_flags |= TAP_INITED; mtx_unlock(&tp->tap_mtx); knlist_init_mtx(&tp->tap_rsel.si_note, &tp->tap_mtx); TAPDEBUG("interface %s is created. minor = %#x\n", ifp->if_xname, dev2unit(dev)); } /* tapcreate */ /* * tapopen * * to open tunnel. must be superuser */ static int tapopen(struct cdev *dev, int flag, int mode, struct thread *td) { struct tap_softc *tp = NULL; struct ifnet *ifp = NULL; int error; if (tapuopen == 0) { error = priv_check(td, PRIV_NET_TAP); if (error) return (error); } if ((dev2unit(dev) & CLONE_UNITMASK) > TAPMAXUNIT) return (ENXIO); tp = dev->si_drv1; mtx_lock(&tp->tap_mtx); if (tp->tap_flags & TAP_OPEN) { mtx_unlock(&tp->tap_mtx); return (EBUSY); } bcopy(IF_LLADDR(tp->tap_ifp), tp->ether_addr, sizeof(tp->ether_addr)); tp->tap_pid = td->td_proc->p_pid; tp->tap_flags |= TAP_OPEN; ifp = tp->tap_ifp; ifp->if_drv_flags |= IFF_DRV_RUNNING; ifp->if_drv_flags &= ~IFF_DRV_OACTIVE; if (tapuponopen) ifp->if_flags |= IFF_UP; if_link_state_change(ifp, LINK_STATE_UP); mtx_unlock(&tp->tap_mtx); TAPDEBUG("%s is open. minor = %#x\n", ifp->if_xname, dev2unit(dev)); return (0); } /* tapopen */ /* * tapclose * * close the device - mark i/f down & delete routing info */ static int tapclose(struct cdev *dev, int foo, int bar, struct thread *td) { struct ifaddr *ifa; struct tap_softc *tp = dev->si_drv1; struct ifnet *ifp = tp->tap_ifp; /* junk all pending output */ mtx_lock(&tp->tap_mtx); CURVNET_SET(ifp->if_vnet); IF_DRAIN(&ifp->if_snd); /* * Do not bring the interface down, and do not anything with * interface, if we are in VMnet mode. Just close the device. */ if (((tp->tap_flags & TAP_VMNET) == 0) && (ifp->if_flags & (IFF_UP | IFF_LINK0)) == IFF_UP) { mtx_unlock(&tp->tap_mtx); if_down(ifp); mtx_lock(&tp->tap_mtx); if (ifp->if_drv_flags & IFF_DRV_RUNNING) { ifp->if_drv_flags &= ~IFF_DRV_RUNNING; mtx_unlock(&tp->tap_mtx); CK_STAILQ_FOREACH(ifa, &ifp->if_addrhead, ifa_link) { rtinit(ifa, (int)RTM_DELETE, 0); } if_purgeaddrs(ifp); mtx_lock(&tp->tap_mtx); } } if_link_state_change(ifp, LINK_STATE_DOWN); CURVNET_RESTORE(); funsetown(&tp->tap_sigio); selwakeuppri(&tp->tap_rsel, PZERO+1); KNOTE_LOCKED(&tp->tap_rsel.si_note, 0); tp->tap_flags &= ~TAP_OPEN; tp->tap_pid = 0; mtx_unlock(&tp->tap_mtx); TAPDEBUG("%s is closed. minor = %#x\n", ifp->if_xname, dev2unit(dev)); return (0); } /* tapclose */ /* * tapifinit * * network interface initialization function */ static void tapifinit(void *xtp) { struct tap_softc *tp = (struct tap_softc *)xtp; struct ifnet *ifp = tp->tap_ifp; TAPDEBUG("initializing %s\n", ifp->if_xname); mtx_lock(&tp->tap_mtx); ifp->if_drv_flags |= IFF_DRV_RUNNING; ifp->if_drv_flags &= ~IFF_DRV_OACTIVE; mtx_unlock(&tp->tap_mtx); /* attempt to start output */ tapifstart(ifp); } /* tapifinit */ /* * tapifioctl * * Process an ioctl request on network interface */ static int tapifioctl(struct ifnet *ifp, u_long cmd, caddr_t data) { - struct tap_softc *tp = ifp->if_softc; + struct tap_softc *tp; struct ifreq *ifr = (struct ifreq *)data; struct ifstat *ifs = NULL; struct ifmediareq *ifmr = NULL; int dummy, error = 0; + sx_xlock(&tap_ioctl_sx); + tp = ifp->if_softc; + if (tp == NULL) { + error = ENXIO; + goto bad; + } switch (cmd) { case SIOCSIFFLAGS: /* XXX -- just like vmnet does */ case SIOCADDMULTI: case SIOCDELMULTI: break; case SIOCGIFMEDIA: ifmr = (struct ifmediareq *)data; dummy = ifmr->ifm_count; ifmr->ifm_count = 1; ifmr->ifm_status = IFM_AVALID; ifmr->ifm_active = IFM_ETHER; if (tp->tap_flags & TAP_OPEN) ifmr->ifm_status |= IFM_ACTIVE; ifmr->ifm_current = ifmr->ifm_active; if (dummy >= 1) { int media = IFM_ETHER; error = copyout(&media, ifmr->ifm_ulist, sizeof(int)); } break; case SIOCSIFMTU: ifp->if_mtu = ifr->ifr_mtu; break; case SIOCGIFSTATUS: ifs = (struct ifstat *)data; mtx_lock(&tp->tap_mtx); if (tp->tap_pid != 0) snprintf(ifs->ascii, sizeof(ifs->ascii), "\tOpened by PID %d\n", tp->tap_pid); else ifs->ascii[0] = '\0'; mtx_unlock(&tp->tap_mtx); break; default: error = ether_ioctl(ifp, cmd, data); break; } +bad: + sx_xunlock(&tap_ioctl_sx); return (error); } /* tapifioctl */ /* * tapifstart * * queue packets from higher level ready to put out */ static void tapifstart(struct ifnet *ifp) { struct tap_softc *tp = ifp->if_softc; TAPDEBUG("%s starting\n", ifp->if_xname); /* * do not junk pending output if we are in VMnet mode. * XXX: can this do any harm because of queue overflow? */ mtx_lock(&tp->tap_mtx); if (((tp->tap_flags & TAP_VMNET) == 0) && ((tp->tap_flags & TAP_READY) != TAP_READY)) { struct mbuf *m; /* Unlocked read. */ TAPDEBUG("%s not ready, tap_flags = 0x%x\n", ifp->if_xname, tp->tap_flags); for (;;) { IF_DEQUEUE(&ifp->if_snd, m); if (m != NULL) { m_freem(m); if_inc_counter(ifp, IFCOUNTER_OERRORS, 1); } else break; } mtx_unlock(&tp->tap_mtx); return; } ifp->if_drv_flags |= IFF_DRV_OACTIVE; if (!IFQ_IS_EMPTY(&ifp->if_snd)) { if (tp->tap_flags & TAP_RWAIT) { tp->tap_flags &= ~TAP_RWAIT; wakeup(tp); } if ((tp->tap_flags & TAP_ASYNC) && (tp->tap_sigio != NULL)) { mtx_unlock(&tp->tap_mtx); pgsigio(&tp->tap_sigio, SIGIO, 0); mtx_lock(&tp->tap_mtx); } selwakeuppri(&tp->tap_rsel, PZERO+1); KNOTE_LOCKED(&tp->tap_rsel.si_note, 0); if_inc_counter(ifp, IFCOUNTER_OPACKETS, 1); /* obytes are counted in ether_output */ } ifp->if_drv_flags &= ~IFF_DRV_OACTIVE; mtx_unlock(&tp->tap_mtx); } /* tapifstart */ /* * tapioctl * * the cdevsw interface is now pretty minimal */ static int tapioctl(struct cdev *dev, u_long cmd, caddr_t data, int flag, struct thread *td) { struct ifreq ifr; struct tap_softc *tp = dev->si_drv1; struct ifnet *ifp = tp->tap_ifp; struct tapinfo *tapp = NULL; int f; int error; #if defined(COMPAT_FREEBSD6) || defined(COMPAT_FREEBSD5) || \ defined(COMPAT_FREEBSD4) int ival; #endif switch (cmd) { case TAPSIFINFO: tapp = (struct tapinfo *)data; if (ifp->if_type != tapp->type) return (EPROTOTYPE); mtx_lock(&tp->tap_mtx); if (ifp->if_mtu != tapp->mtu) { strlcpy(ifr.ifr_name, if_name(ifp), IFNAMSIZ); ifr.ifr_mtu = tapp->mtu; CURVNET_SET(ifp->if_vnet); error = ifhwioctl(SIOCSIFMTU, ifp, (caddr_t)&ifr, td); CURVNET_RESTORE(); if (error) { mtx_unlock(&tp->tap_mtx); return (error); } } ifp->if_baudrate = tapp->baudrate; mtx_unlock(&tp->tap_mtx); break; case TAPGIFINFO: tapp = (struct tapinfo *)data; mtx_lock(&tp->tap_mtx); tapp->mtu = ifp->if_mtu; tapp->type = ifp->if_type; tapp->baudrate = ifp->if_baudrate; mtx_unlock(&tp->tap_mtx); break; case TAPSDEBUG: tapdebug = *(int *)data; break; case TAPGDEBUG: *(int *)data = tapdebug; break; case TAPGIFNAME: { struct ifreq *ifr = (struct ifreq *) data; strlcpy(ifr->ifr_name, ifp->if_xname, IFNAMSIZ); } break; case FIONBIO: break; case FIOASYNC: mtx_lock(&tp->tap_mtx); if (*(int *)data) tp->tap_flags |= TAP_ASYNC; else tp->tap_flags &= ~TAP_ASYNC; mtx_unlock(&tp->tap_mtx); break; case FIONREAD: if (!IFQ_IS_EMPTY(&ifp->if_snd)) { struct mbuf *mb; IFQ_LOCK(&ifp->if_snd); IFQ_POLL_NOLOCK(&ifp->if_snd, mb); for (*(int *)data = 0; mb != NULL; mb = mb->m_next) *(int *)data += mb->m_len; IFQ_UNLOCK(&ifp->if_snd); } else *(int *)data = 0; break; case FIOSETOWN: return (fsetown(*(int *)data, &tp->tap_sigio)); case FIOGETOWN: *(int *)data = fgetown(&tp->tap_sigio); return (0); /* this is deprecated, FIOSETOWN should be used instead */ case TIOCSPGRP: return (fsetown(-(*(int *)data), &tp->tap_sigio)); /* this is deprecated, FIOGETOWN should be used instead */ case TIOCGPGRP: *(int *)data = -fgetown(&tp->tap_sigio); return (0); /* VMware/VMnet port ioctl's */ #if defined(COMPAT_FREEBSD6) || defined(COMPAT_FREEBSD5) || \ defined(COMPAT_FREEBSD4) case _IO('V', 0): ival = IOCPARM_IVAL(data); data = (caddr_t)&ival; /* FALLTHROUGH */ #endif case VMIO_SIOCSIFFLAGS: /* VMware/VMnet SIOCSIFFLAGS */ f = *(int *)data; f &= 0x0fff; f &= ~IFF_CANTCHANGE; f |= IFF_UP; mtx_lock(&tp->tap_mtx); ifp->if_flags = f | (ifp->if_flags & IFF_CANTCHANGE); mtx_unlock(&tp->tap_mtx); break; case SIOCGIFADDR: /* get MAC address of the remote side */ mtx_lock(&tp->tap_mtx); bcopy(tp->ether_addr, data, sizeof(tp->ether_addr)); mtx_unlock(&tp->tap_mtx); break; case SIOCSIFADDR: /* set MAC address of the remote side */ mtx_lock(&tp->tap_mtx); bcopy(data, tp->ether_addr, sizeof(tp->ether_addr)); mtx_unlock(&tp->tap_mtx); break; default: return (ENOTTY); } return (0); } /* tapioctl */ /* * tapread * * the cdevsw read interface - reads a packet at a time, or at * least as much of a packet as can be read */ static int tapread(struct cdev *dev, struct uio *uio, int flag) { struct tap_softc *tp = dev->si_drv1; struct ifnet *ifp = tp->tap_ifp; struct mbuf *m = NULL; int error = 0, len; TAPDEBUG("%s reading, minor = %#x\n", ifp->if_xname, dev2unit(dev)); mtx_lock(&tp->tap_mtx); if ((tp->tap_flags & TAP_READY) != TAP_READY) { mtx_unlock(&tp->tap_mtx); /* Unlocked read. */ TAPDEBUG("%s not ready. minor = %#x, tap_flags = 0x%x\n", ifp->if_xname, dev2unit(dev), tp->tap_flags); return (EHOSTDOWN); } tp->tap_flags &= ~TAP_RWAIT; /* sleep until we get a packet */ do { IF_DEQUEUE(&ifp->if_snd, m); if (m == NULL) { if (flag & O_NONBLOCK) { mtx_unlock(&tp->tap_mtx); return (EWOULDBLOCK); } tp->tap_flags |= TAP_RWAIT; error = mtx_sleep(tp, &tp->tap_mtx, PCATCH | (PZERO + 1), "taprd", 0); if (error) { mtx_unlock(&tp->tap_mtx); return (error); } } } while (m == NULL); mtx_unlock(&tp->tap_mtx); /* feed packet to bpf */ BPF_MTAP(ifp, m); /* xfer packet to user space */ while ((m != NULL) && (uio->uio_resid > 0) && (error == 0)) { len = min(uio->uio_resid, m->m_len); if (len == 0) break; error = uiomove(mtod(m, void *), len, uio); m = m_free(m); } if (m != NULL) { TAPDEBUG("%s dropping mbuf, minor = %#x\n", ifp->if_xname, dev2unit(dev)); m_freem(m); } return (error); } /* tapread */ /* * tapwrite * * the cdevsw write interface - an atomic write is a packet - or else! */ static int tapwrite(struct cdev *dev, struct uio *uio, int flag) { struct ether_header *eh; struct tap_softc *tp = dev->si_drv1; struct ifnet *ifp = tp->tap_ifp; struct mbuf *m; TAPDEBUG("%s writing, minor = %#x\n", ifp->if_xname, dev2unit(dev)); if (uio->uio_resid == 0) return (0); if ((uio->uio_resid < 0) || (uio->uio_resid > TAPMRU)) { TAPDEBUG("%s invalid packet len = %zd, minor = %#x\n", ifp->if_xname, uio->uio_resid, dev2unit(dev)); return (EIO); } if ((m = m_uiotombuf(uio, M_NOWAIT, 0, ETHER_ALIGN, M_PKTHDR)) == NULL) { if_inc_counter(ifp, IFCOUNTER_IERRORS, 1); return (ENOBUFS); } m->m_pkthdr.rcvif = ifp; /* * Only pass a unicast frame to ether_input(), if it would actually * have been received by non-virtual hardware. */ if (m->m_len < sizeof(struct ether_header)) { m_freem(m); return (0); } eh = mtod(m, struct ether_header *); if (eh && (ifp->if_flags & IFF_PROMISC) == 0 && !ETHER_IS_MULTICAST(eh->ether_dhost) && bcmp(eh->ether_dhost, IF_LLADDR(ifp), ETHER_ADDR_LEN) != 0) { m_freem(m); return (0); } /* Pass packet up to parent. */ CURVNET_SET(ifp->if_vnet); (*ifp->if_input)(ifp, m); CURVNET_RESTORE(); if_inc_counter(ifp, IFCOUNTER_IPACKETS, 1); /* ibytes are counted in parent */ return (0); } /* tapwrite */ /* * tappoll * * the poll interface, this is only useful on reads * really. the write detect always returns true, write never blocks * anyway, it either accepts the packet or drops it */ static int tappoll(struct cdev *dev, int events, struct thread *td) { struct tap_softc *tp = dev->si_drv1; struct ifnet *ifp = tp->tap_ifp; int revents = 0; TAPDEBUG("%s polling, minor = %#x\n", ifp->if_xname, dev2unit(dev)); if (events & (POLLIN | POLLRDNORM)) { IFQ_LOCK(&ifp->if_snd); if (!IFQ_IS_EMPTY(&ifp->if_snd)) { TAPDEBUG("%s have data in queue. len = %d, " \ "minor = %#x\n", ifp->if_xname, ifp->if_snd.ifq_len, dev2unit(dev)); revents |= (events & (POLLIN | POLLRDNORM)); } else { TAPDEBUG("%s waiting for data, minor = %#x\n", ifp->if_xname, dev2unit(dev)); selrecord(td, &tp->tap_rsel); } IFQ_UNLOCK(&ifp->if_snd); } if (events & (POLLOUT | POLLWRNORM)) revents |= (events & (POLLOUT | POLLWRNORM)); return (revents); } /* tappoll */ /* * tap_kqfilter * * support for kevent() system call */ static int tapkqfilter(struct cdev *dev, struct knote *kn) { struct tap_softc *tp = dev->si_drv1; struct ifnet *ifp = tp->tap_ifp; switch (kn->kn_filter) { case EVFILT_READ: TAPDEBUG("%s kqfilter: EVFILT_READ, minor = %#x\n", ifp->if_xname, dev2unit(dev)); kn->kn_fop = &tap_read_filterops; break; case EVFILT_WRITE: TAPDEBUG("%s kqfilter: EVFILT_WRITE, minor = %#x\n", ifp->if_xname, dev2unit(dev)); kn->kn_fop = &tap_write_filterops; break; default: TAPDEBUG("%s kqfilter: invalid filter, minor = %#x\n", ifp->if_xname, dev2unit(dev)); return (EINVAL); /* NOT REACHED */ } kn->kn_hook = tp; knlist_add(&tp->tap_rsel.si_note, kn, 0); return (0); } /* tapkqfilter */ /* * tap_kqread * * Return true if there is data in the interface queue */ static int tapkqread(struct knote *kn, long hint) { int ret; struct tap_softc *tp = kn->kn_hook; struct cdev *dev = tp->tap_dev; struct ifnet *ifp = tp->tap_ifp; if ((kn->kn_data = ifp->if_snd.ifq_len) > 0) { TAPDEBUG("%s have data in queue. len = %d, minor = %#x\n", ifp->if_xname, ifp->if_snd.ifq_len, dev2unit(dev)); ret = 1; } else { TAPDEBUG("%s waiting for data, minor = %#x\n", ifp->if_xname, dev2unit(dev)); ret = 0; } return (ret); } /* tapkqread */ /* * tap_kqwrite * * Always can write. Return the MTU in kn->data */ static int tapkqwrite(struct knote *kn, long hint) { struct tap_softc *tp = kn->kn_hook; struct ifnet *ifp = tp->tap_ifp; kn->kn_data = ifp->if_mtu; return (1); } /* tapkqwrite */ static void tapkqdetach(struct knote *kn) { struct tap_softc *tp = kn->kn_hook; knlist_remove(&tp->tap_rsel.si_note, kn, 0); } /* tapkqdetach */ Index: user/ngie/bug-237403/sys/net/if_tun.c =================================================================== --- user/ngie/bug-237403/sys/net/if_tun.c (revision 346925) +++ user/ngie/bug-237403/sys/net/if_tun.c (revision 346926) @@ -1,1096 +1,1112 @@ /* $NetBSD: if_tun.c,v 1.14 1994/06/29 06:36:25 cgd Exp $ */ /*- * Copyright (c) 1988, Julian Onions * Nottingham University 1987. * * This source may be freely distributed, however I would be interested * in any changes that are made. * * This driver takes packets off the IP i/f and hands them up to a * user process to have its wicked way with. This driver has it's * roots in a similar driver written by Phil Cockcroft (formerly) at * UCL. This driver is based much more on read/write/poll mode of * operation though. * * $FreeBSD$ */ #include "opt_inet.h" #include "opt_inet6.h" #include +#include #include #include #include #include #include #include #include #include #include #include +#include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #ifdef INET #include #endif #include #include #include #include #include /* * tun_list is protected by global tunmtx. Other mutable fields are * protected by tun->tun_mtx, or by their owning subsystem. tun_dev is * static for the duration of a tunnel interface. */ struct tun_softc { TAILQ_ENTRY(tun_softc) tun_list; struct cdev *tun_dev; u_short tun_flags; /* misc flags */ #define TUN_OPEN 0x0001 #define TUN_INITED 0x0002 #define TUN_RCOLL 0x0004 #define TUN_IASET 0x0008 #define TUN_DSTADDR 0x0010 #define TUN_LMODE 0x0020 #define TUN_RWAIT 0x0040 #define TUN_ASYNC 0x0080 #define TUN_IFHEAD 0x0100 +#define TUN_DYING 0x0200 #define TUN_READY (TUN_OPEN | TUN_INITED) - /* - * XXXRW: tun_pid is used to exclusively lock /dev/tun. Is this - * actually needed? Can we just return EBUSY if already open? - * Problem is that this involved inherent races when a tun device - * is handed off from one process to another, as opposed to just - * being slightly stale informationally. - */ pid_t tun_pid; /* owning pid */ struct ifnet *tun_ifp; /* the interface */ struct sigio *tun_sigio; /* information for async I/O */ struct selinfo tun_rsel; /* read select */ struct mtx tun_mtx; /* protect mutable softc fields */ struct cv tun_cv; /* protect against ref'd dev destroy */ }; #define TUN2IFP(sc) ((sc)->tun_ifp) #define TUNDEBUG if (tundebug) if_printf /* * All mutable global variables in if_tun are locked using tunmtx, with * the exception of tundebug, which is used unlocked, and tunclones, * which is static after setup. */ static struct mtx tunmtx; static eventhandler_tag tag; static const char tunname[] = "tun"; static MALLOC_DEFINE(M_TUN, tunname, "Tunnel Interface"); static int tundebug = 0; static int tundclone = 1; static struct clonedevs *tunclones; static TAILQ_HEAD(,tun_softc) tunhead = TAILQ_HEAD_INITIALIZER(tunhead); SYSCTL_INT(_debug, OID_AUTO, if_tun_debug, CTLFLAG_RW, &tundebug, 0, ""); +static struct sx tun_ioctl_sx; +SX_SYSINIT(tun_ioctl_sx, &tun_ioctl_sx, "tun_ioctl"); + SYSCTL_DECL(_net_link); static SYSCTL_NODE(_net_link, OID_AUTO, tun, CTLFLAG_RW, 0, "IP tunnel software network interface."); SYSCTL_INT(_net_link_tun, OID_AUTO, devfs_cloning, CTLFLAG_RWTUN, &tundclone, 0, "Enable legacy devfs interface creation."); static void tunclone(void *arg, struct ucred *cred, char *name, int namelen, struct cdev **dev); static void tuncreate(const char *name, struct cdev *dev); static int tunifioctl(struct ifnet *, u_long, caddr_t); static void tuninit(struct ifnet *); static int tunmodevent(module_t, int, void *); static int tunoutput(struct ifnet *, struct mbuf *, const struct sockaddr *, struct route *ro); static void tunstart(struct ifnet *); static int tun_clone_match(struct if_clone *ifc, const char *name); static int tun_clone_create(struct if_clone *, char *, size_t, caddr_t); static int tun_clone_destroy(struct if_clone *, struct ifnet *); static struct unrhdr *tun_unrhdr; VNET_DEFINE_STATIC(struct if_clone *, tun_cloner); #define V_tun_cloner VNET(tun_cloner) static d_open_t tunopen; static d_close_t tunclose; static d_read_t tunread; static d_write_t tunwrite; static d_ioctl_t tunioctl; static d_poll_t tunpoll; static d_kqfilter_t tunkqfilter; static int tunkqread(struct knote *, long); static int tunkqwrite(struct knote *, long); static void tunkqdetach(struct knote *); static struct filterops tun_read_filterops = { .f_isfd = 1, .f_attach = NULL, .f_detach = tunkqdetach, .f_event = tunkqread, }; static struct filterops tun_write_filterops = { .f_isfd = 1, .f_attach = NULL, .f_detach = tunkqdetach, .f_event = tunkqwrite, }; static struct cdevsw tun_cdevsw = { .d_version = D_VERSION, .d_flags = D_NEEDMINOR, .d_open = tunopen, .d_close = tunclose, .d_read = tunread, .d_write = tunwrite, .d_ioctl = tunioctl, .d_poll = tunpoll, .d_kqfilter = tunkqfilter, .d_name = tunname, }; static int tun_clone_match(struct if_clone *ifc, const char *name) { if (strncmp(tunname, name, 3) == 0 && (name[3] == '\0' || isdigit(name[3]))) return (1); return (0); } static int tun_clone_create(struct if_clone *ifc, char *name, size_t len, caddr_t params) { struct cdev *dev; int err, unit, i; err = ifc_name2unit(name, &unit); if (err != 0) return (err); if (unit != -1) { /* If this unit number is still available that/s okay. */ if (alloc_unr_specific(tun_unrhdr, unit) == -1) return (EEXIST); } else { unit = alloc_unr(tun_unrhdr); } snprintf(name, IFNAMSIZ, "%s%d", tunname, unit); /* find any existing device, or allocate new unit number */ i = clone_create(&tunclones, &tun_cdevsw, &unit, &dev, 0); if (i) { /* No preexisting struct cdev *, create one */ dev = make_dev(&tun_cdevsw, unit, UID_UUCP, GID_DIALER, 0600, "%s%d", tunname, unit); } tuncreate(tunname, dev); return (0); } static void tunclone(void *arg, struct ucred *cred, char *name, int namelen, struct cdev **dev) { char devname[SPECNAMELEN + 1]; int u, i, append_unit; if (*dev != NULL) return; /* * If tun cloning is enabled, only the superuser can create an * interface. */ if (!tundclone || priv_check_cred(cred, PRIV_NET_IFCREATE) != 0) return; if (strcmp(name, tunname) == 0) { u = -1; } else if (dev_stdclone(name, NULL, tunname, &u) != 1) return; /* Don't recognise the name */ if (u != -1 && u > IF_MAXUNIT) return; /* Unit number too high */ if (u == -1) append_unit = 1; else append_unit = 0; CURVNET_SET(CRED_TO_VNET(cred)); /* find any existing device, or allocate new unit number */ i = clone_create(&tunclones, &tun_cdevsw, &u, dev, 0); if (i) { if (append_unit) { namelen = snprintf(devname, sizeof(devname), "%s%d", name, u); name = devname; } /* No preexisting struct cdev *, create one */ *dev = make_dev_credf(MAKEDEV_REF, &tun_cdevsw, u, cred, UID_UUCP, GID_DIALER, 0600, "%s", name); } if_clone_create(name, namelen, NULL); CURVNET_RESTORE(); } static void tun_destroy(struct tun_softc *tp) { struct cdev *dev; mtx_lock(&tp->tun_mtx); + tp->tun_flags |= TUN_DYING; if ((tp->tun_flags & TUN_OPEN) != 0) cv_wait_unlock(&tp->tun_cv, &tp->tun_mtx); else mtx_unlock(&tp->tun_mtx); CURVNET_SET(TUN2IFP(tp)->if_vnet); + sx_xlock(&tun_ioctl_sx); + TUN2IFP(tp)->if_softc = NULL; + sx_xunlock(&tun_ioctl_sx); + dev = tp->tun_dev; bpfdetach(TUN2IFP(tp)); if_detach(TUN2IFP(tp)); free_unr(tun_unrhdr, TUN2IFP(tp)->if_dunit); if_free(TUN2IFP(tp)); destroy_dev(dev); seldrain(&tp->tun_rsel); knlist_clear(&tp->tun_rsel.si_note, 0); knlist_destroy(&tp->tun_rsel.si_note); mtx_destroy(&tp->tun_mtx); cv_destroy(&tp->tun_cv); free(tp, M_TUN); CURVNET_RESTORE(); } static int tun_clone_destroy(struct if_clone *ifc, struct ifnet *ifp) { struct tun_softc *tp = ifp->if_softc; mtx_lock(&tunmtx); TAILQ_REMOVE(&tunhead, tp, tun_list); mtx_unlock(&tunmtx); tun_destroy(tp); return (0); } static void vnet_tun_init(const void *unused __unused) { V_tun_cloner = if_clone_advanced(tunname, 0, tun_clone_match, tun_clone_create, tun_clone_destroy); } VNET_SYSINIT(vnet_tun_init, SI_SUB_PROTO_IF, SI_ORDER_ANY, vnet_tun_init, NULL); static void vnet_tun_uninit(const void *unused __unused) { if_clone_detach(V_tun_cloner); } VNET_SYSUNINIT(vnet_tun_uninit, SI_SUB_PROTO_IF, SI_ORDER_ANY, vnet_tun_uninit, NULL); static void tun_uninit(const void *unused __unused) { struct tun_softc *tp; EVENTHANDLER_DEREGISTER(dev_clone, tag); drain_dev_clone_events(); mtx_lock(&tunmtx); while ((tp = TAILQ_FIRST(&tunhead)) != NULL) { TAILQ_REMOVE(&tunhead, tp, tun_list); mtx_unlock(&tunmtx); tun_destroy(tp); mtx_lock(&tunmtx); } mtx_unlock(&tunmtx); delete_unrhdr(tun_unrhdr); clone_cleanup(&tunclones); mtx_destroy(&tunmtx); } SYSUNINIT(tun_uninit, SI_SUB_PROTO_IF, SI_ORDER_ANY, tun_uninit, NULL); static int tunmodevent(module_t mod, int type, void *data) { switch (type) { case MOD_LOAD: mtx_init(&tunmtx, "tunmtx", NULL, MTX_DEF); clone_setup(&tunclones); tun_unrhdr = new_unrhdr(0, IF_MAXUNIT, &tunmtx); tag = EVENTHANDLER_REGISTER(dev_clone, tunclone, 0, 1000); if (tag == NULL) return (ENOMEM); break; case MOD_UNLOAD: /* See tun_uninit, so it's done after the vnet_sysuninit() */ break; default: return EOPNOTSUPP; } return 0; } static moduledata_t tun_mod = { "if_tun", tunmodevent, 0 }; DECLARE_MODULE(if_tun, tun_mod, SI_SUB_PSEUDO, SI_ORDER_ANY); MODULE_VERSION(if_tun, 1); static void tunstart(struct ifnet *ifp) { struct tun_softc *tp = ifp->if_softc; struct mbuf *m; TUNDEBUG(ifp,"%s starting\n", ifp->if_xname); if (ALTQ_IS_ENABLED(&ifp->if_snd)) { IFQ_LOCK(&ifp->if_snd); IFQ_POLL_NOLOCK(&ifp->if_snd, m); if (m == NULL) { IFQ_UNLOCK(&ifp->if_snd); return; } IFQ_UNLOCK(&ifp->if_snd); } mtx_lock(&tp->tun_mtx); if (tp->tun_flags & TUN_RWAIT) { tp->tun_flags &= ~TUN_RWAIT; wakeup(tp); } selwakeuppri(&tp->tun_rsel, PZERO + 1); KNOTE_LOCKED(&tp->tun_rsel.si_note, 0); if (tp->tun_flags & TUN_ASYNC && tp->tun_sigio) { mtx_unlock(&tp->tun_mtx); pgsigio(&tp->tun_sigio, SIGIO, 0); } else mtx_unlock(&tp->tun_mtx); } /* XXX: should return an error code so it can fail. */ static void tuncreate(const char *name, struct cdev *dev) { struct tun_softc *sc; struct ifnet *ifp; sc = malloc(sizeof(*sc), M_TUN, M_WAITOK | M_ZERO); mtx_init(&sc->tun_mtx, "tun_mtx", NULL, MTX_DEF); cv_init(&sc->tun_cv, "tun_condvar"); sc->tun_flags = TUN_INITED; sc->tun_dev = dev; mtx_lock(&tunmtx); TAILQ_INSERT_TAIL(&tunhead, sc, tun_list); mtx_unlock(&tunmtx); ifp = sc->tun_ifp = if_alloc(IFT_PPP); if (ifp == NULL) panic("%s%d: failed to if_alloc() interface.\n", name, dev2unit(dev)); if_initname(ifp, name, dev2unit(dev)); ifp->if_mtu = TUNMTU; ifp->if_ioctl = tunifioctl; ifp->if_output = tunoutput; ifp->if_start = tunstart; ifp->if_flags = IFF_POINTOPOINT | IFF_MULTICAST; ifp->if_softc = sc; IFQ_SET_MAXLEN(&ifp->if_snd, ifqmaxlen); ifp->if_snd.ifq_drv_maxlen = 0; IFQ_SET_READY(&ifp->if_snd); knlist_init_mtx(&sc->tun_rsel.si_note, &sc->tun_mtx); ifp->if_capabilities |= IFCAP_LINKSTATE; ifp->if_capenable |= IFCAP_LINKSTATE; if_attach(ifp); bpfattach(ifp, DLT_NULL, sizeof(u_int32_t)); dev->si_drv1 = sc; TUNDEBUG(ifp, "interface %s is created, minor = %#x\n", ifp->if_xname, dev2unit(dev)); } static int tunopen(struct cdev *dev, int flag, int mode, struct thread *td) { struct ifnet *ifp; struct tun_softc *tp; /* * XXXRW: Non-atomic test and set of dev->si_drv1 requires * synchronization. */ tp = dev->si_drv1; if (!tp) { tuncreate(tunname, dev); tp = dev->si_drv1; } - /* - * XXXRW: This use of tun_pid is subject to error due to the - * fact that a reference to the tunnel can live beyond the - * death of the process that created it. Can we replace this - * with a simple busy flag? - */ mtx_lock(&tp->tun_mtx); - if (tp->tun_pid != 0 && tp->tun_pid != td->td_proc->p_pid) { + if ((tp->tun_flags & (TUN_OPEN | TUN_DYING)) != 0) { mtx_unlock(&tp->tun_mtx); return (EBUSY); } - tp->tun_pid = td->td_proc->p_pid; + tp->tun_pid = td->td_proc->p_pid; tp->tun_flags |= TUN_OPEN; ifp = TUN2IFP(tp); if_link_state_change(ifp, LINK_STATE_UP); TUNDEBUG(ifp, "open\n"); mtx_unlock(&tp->tun_mtx); return (0); } /* * tunclose - close the device - mark i/f down & delete * routing info */ static int tunclose(struct cdev *dev, int foo, int bar, struct thread *td) { struct tun_softc *tp; struct ifnet *ifp; tp = dev->si_drv1; ifp = TUN2IFP(tp); mtx_lock(&tp->tun_mtx); + /* + * Simply close the device if this isn't the controlling process. This + * may happen if, for instance, the tunnel has been handed off to + * another process. The original controller should be able to close it + * without putting us into an inconsistent state. + */ + if (td->td_proc->p_pid != tp->tun_pid) { + mtx_unlock(&tp->tun_mtx); + return (0); + } /* * junk all pending output */ CURVNET_SET(ifp->if_vnet); IFQ_PURGE(&ifp->if_snd); if (ifp->if_flags & IFF_UP) { mtx_unlock(&tp->tun_mtx); if_down(ifp); mtx_lock(&tp->tun_mtx); } /* Delete all addresses and routes which reference this interface. */ if (ifp->if_drv_flags & IFF_DRV_RUNNING) { struct ifaddr *ifa; ifp->if_drv_flags &= ~IFF_DRV_RUNNING; mtx_unlock(&tp->tun_mtx); CK_STAILQ_FOREACH(ifa, &ifp->if_addrhead, ifa_link) { /* deal w/IPv4 PtP destination; unlocked read */ if (ifa->ifa_addr->sa_family == AF_INET) { rtinit(ifa, (int)RTM_DELETE, tp->tun_flags & TUN_DSTADDR ? RTF_HOST : 0); } else { rtinit(ifa, (int)RTM_DELETE, 0); } } if_purgeaddrs(ifp); mtx_lock(&tp->tun_mtx); } if_link_state_change(ifp, LINK_STATE_DOWN); CURVNET_RESTORE(); funsetown(&tp->tun_sigio); selwakeuppri(&tp->tun_rsel, PZERO + 1); KNOTE_LOCKED(&tp->tun_rsel.si_note, 0); TUNDEBUG (ifp, "closed\n"); tp->tun_flags &= ~TUN_OPEN; tp->tun_pid = 0; cv_broadcast(&tp->tun_cv); mtx_unlock(&tp->tun_mtx); return (0); } static void tuninit(struct ifnet *ifp) { struct tun_softc *tp = ifp->if_softc; #ifdef INET struct ifaddr *ifa; #endif TUNDEBUG(ifp, "tuninit\n"); mtx_lock(&tp->tun_mtx); ifp->if_flags |= IFF_UP; ifp->if_drv_flags |= IFF_DRV_RUNNING; getmicrotime(&ifp->if_lastchange); #ifdef INET if_addr_rlock(ifp); CK_STAILQ_FOREACH(ifa, &ifp->if_addrhead, ifa_link) { if (ifa->ifa_addr->sa_family == AF_INET) { struct sockaddr_in *si; si = (struct sockaddr_in *)ifa->ifa_addr; if (si->sin_addr.s_addr) tp->tun_flags |= TUN_IASET; si = (struct sockaddr_in *)ifa->ifa_dstaddr; if (si && si->sin_addr.s_addr) tp->tun_flags |= TUN_DSTADDR; } } if_addr_runlock(ifp); #endif mtx_unlock(&tp->tun_mtx); } /* * Process an ioctl request. */ static int tunifioctl(struct ifnet *ifp, u_long cmd, caddr_t data) { struct ifreq *ifr = (struct ifreq *)data; - struct tun_softc *tp = ifp->if_softc; + struct tun_softc *tp; struct ifstat *ifs; int error = 0; + sx_xlock(&tun_ioctl_sx); + tp = ifp->if_softc; + if (tp == NULL) { + error = ENXIO; + goto bad; + } switch(cmd) { case SIOCGIFSTATUS: ifs = (struct ifstat *)data; mtx_lock(&tp->tun_mtx); if (tp->tun_pid) snprintf(ifs->ascii, sizeof(ifs->ascii), "\tOpened by PID %d\n", tp->tun_pid); else ifs->ascii[0] = '\0'; mtx_unlock(&tp->tun_mtx); break; case SIOCSIFADDR: tuninit(ifp); TUNDEBUG(ifp, "address set\n"); break; case SIOCSIFMTU: ifp->if_mtu = ifr->ifr_mtu; TUNDEBUG(ifp, "mtu set\n"); break; case SIOCSIFFLAGS: case SIOCADDMULTI: case SIOCDELMULTI: break; default: error = EINVAL; } +bad: + sx_xunlock(&tun_ioctl_sx); return (error); } /* * tunoutput - queue packets from higher level ready to put out. */ static int tunoutput(struct ifnet *ifp, struct mbuf *m0, const struct sockaddr *dst, struct route *ro) { struct tun_softc *tp = ifp->if_softc; u_short cached_tun_flags; int error; u_int32_t af; TUNDEBUG (ifp, "tunoutput\n"); #ifdef MAC error = mac_ifnet_check_transmit(ifp, m0); if (error) { m_freem(m0); return (error); } #endif /* Could be unlocked read? */ mtx_lock(&tp->tun_mtx); cached_tun_flags = tp->tun_flags; mtx_unlock(&tp->tun_mtx); if ((cached_tun_flags & TUN_READY) != TUN_READY) { TUNDEBUG (ifp, "not ready 0%o\n", tp->tun_flags); m_freem (m0); return (EHOSTDOWN); } if ((ifp->if_flags & IFF_UP) != IFF_UP) { m_freem (m0); return (EHOSTDOWN); } /* BPF writes need to be handled specially. */ if (dst->sa_family == AF_UNSPEC) bcopy(dst->sa_data, &af, sizeof(af)); else af = dst->sa_family; if (bpf_peers_present(ifp->if_bpf)) bpf_mtap2(ifp->if_bpf, &af, sizeof(af), m0); /* prepend sockaddr? this may abort if the mbuf allocation fails */ if (cached_tun_flags & TUN_LMODE) { /* allocate space for sockaddr */ M_PREPEND(m0, dst->sa_len, M_NOWAIT); /* if allocation failed drop packet */ if (m0 == NULL) { if_inc_counter(ifp, IFCOUNTER_IQDROPS, 1); if_inc_counter(ifp, IFCOUNTER_OERRORS, 1); return (ENOBUFS); } else { bcopy(dst, m0->m_data, dst->sa_len); } } if (cached_tun_flags & TUN_IFHEAD) { /* Prepend the address family */ M_PREPEND(m0, 4, M_NOWAIT); /* if allocation failed drop packet */ if (m0 == NULL) { if_inc_counter(ifp, IFCOUNTER_IQDROPS, 1); if_inc_counter(ifp, IFCOUNTER_OERRORS, 1); return (ENOBUFS); } else *(u_int32_t *)m0->m_data = htonl(af); } else { #ifdef INET if (af != AF_INET) #endif { m_freem(m0); return (EAFNOSUPPORT); } } error = (ifp->if_transmit)(ifp, m0); if (error) return (ENOBUFS); if_inc_counter(ifp, IFCOUNTER_OPACKETS, 1); return (0); } /* * the cdevsw interface is now pretty minimal. */ static int tunioctl(struct cdev *dev, u_long cmd, caddr_t data, int flag, struct thread *td) { struct ifreq ifr; struct tun_softc *tp = dev->si_drv1; struct tuninfo *tunp; int error; switch (cmd) { case TUNSIFINFO: tunp = (struct tuninfo *)data; if (TUN2IFP(tp)->if_type != tunp->type) return (EPROTOTYPE); mtx_lock(&tp->tun_mtx); if (TUN2IFP(tp)->if_mtu != tunp->mtu) { strlcpy(ifr.ifr_name, if_name(TUN2IFP(tp)), IFNAMSIZ); ifr.ifr_mtu = tunp->mtu; CURVNET_SET(TUN2IFP(tp)->if_vnet); error = ifhwioctl(SIOCSIFMTU, TUN2IFP(tp), (caddr_t)&ifr, td); CURVNET_RESTORE(); if (error) { mtx_unlock(&tp->tun_mtx); return (error); } } TUN2IFP(tp)->if_baudrate = tunp->baudrate; mtx_unlock(&tp->tun_mtx); break; case TUNGIFINFO: tunp = (struct tuninfo *)data; mtx_lock(&tp->tun_mtx); tunp->mtu = TUN2IFP(tp)->if_mtu; tunp->type = TUN2IFP(tp)->if_type; tunp->baudrate = TUN2IFP(tp)->if_baudrate; mtx_unlock(&tp->tun_mtx); break; case TUNSDEBUG: tundebug = *(int *)data; break; case TUNGDEBUG: *(int *)data = tundebug; break; case TUNSLMODE: mtx_lock(&tp->tun_mtx); if (*(int *)data) { tp->tun_flags |= TUN_LMODE; tp->tun_flags &= ~TUN_IFHEAD; } else tp->tun_flags &= ~TUN_LMODE; mtx_unlock(&tp->tun_mtx); break; case TUNSIFHEAD: mtx_lock(&tp->tun_mtx); if (*(int *)data) { tp->tun_flags |= TUN_IFHEAD; tp->tun_flags &= ~TUN_LMODE; } else tp->tun_flags &= ~TUN_IFHEAD; mtx_unlock(&tp->tun_mtx); break; case TUNGIFHEAD: mtx_lock(&tp->tun_mtx); *(int *)data = (tp->tun_flags & TUN_IFHEAD) ? 1 : 0; mtx_unlock(&tp->tun_mtx); break; case TUNSIFMODE: /* deny this if UP */ if (TUN2IFP(tp)->if_flags & IFF_UP) return(EBUSY); switch (*(int *)data & ~IFF_MULTICAST) { case IFF_POINTOPOINT: case IFF_BROADCAST: mtx_lock(&tp->tun_mtx); TUN2IFP(tp)->if_flags &= ~(IFF_BROADCAST|IFF_POINTOPOINT|IFF_MULTICAST); TUN2IFP(tp)->if_flags |= *(int *)data; mtx_unlock(&tp->tun_mtx); break; default: return(EINVAL); } break; case TUNSIFPID: mtx_lock(&tp->tun_mtx); tp->tun_pid = curthread->td_proc->p_pid; mtx_unlock(&tp->tun_mtx); break; case FIONBIO: break; case FIOASYNC: mtx_lock(&tp->tun_mtx); if (*(int *)data) tp->tun_flags |= TUN_ASYNC; else tp->tun_flags &= ~TUN_ASYNC; mtx_unlock(&tp->tun_mtx); break; case FIONREAD: if (!IFQ_IS_EMPTY(&TUN2IFP(tp)->if_snd)) { struct mbuf *mb; IFQ_LOCK(&TUN2IFP(tp)->if_snd); IFQ_POLL_NOLOCK(&TUN2IFP(tp)->if_snd, mb); for (*(int *)data = 0; mb != NULL; mb = mb->m_next) *(int *)data += mb->m_len; IFQ_UNLOCK(&TUN2IFP(tp)->if_snd); } else *(int *)data = 0; break; case FIOSETOWN: return (fsetown(*(int *)data, &tp->tun_sigio)); case FIOGETOWN: *(int *)data = fgetown(&tp->tun_sigio); return (0); /* This is deprecated, FIOSETOWN should be used instead. */ case TIOCSPGRP: return (fsetown(-(*(int *)data), &tp->tun_sigio)); /* This is deprecated, FIOGETOWN should be used instead. */ case TIOCGPGRP: *(int *)data = -fgetown(&tp->tun_sigio); return (0); default: return (ENOTTY); } return (0); } /* * The cdevsw read interface - reads a packet at a time, or at * least as much of a packet as can be read. */ static int tunread(struct cdev *dev, struct uio *uio, int flag) { struct tun_softc *tp = dev->si_drv1; struct ifnet *ifp = TUN2IFP(tp); struct mbuf *m; int error=0, len; TUNDEBUG (ifp, "read\n"); mtx_lock(&tp->tun_mtx); if ((tp->tun_flags & TUN_READY) != TUN_READY) { mtx_unlock(&tp->tun_mtx); TUNDEBUG (ifp, "not ready 0%o\n", tp->tun_flags); return (EHOSTDOWN); } tp->tun_flags &= ~TUN_RWAIT; do { IFQ_DEQUEUE(&ifp->if_snd, m); if (m == NULL) { if (flag & O_NONBLOCK) { mtx_unlock(&tp->tun_mtx); return (EWOULDBLOCK); } tp->tun_flags |= TUN_RWAIT; error = mtx_sleep(tp, &tp->tun_mtx, PCATCH | (PZERO + 1), "tunread", 0); if (error != 0) { mtx_unlock(&tp->tun_mtx); return (error); } } } while (m == NULL); mtx_unlock(&tp->tun_mtx); while (m && uio->uio_resid > 0 && error == 0) { len = min(uio->uio_resid, m->m_len); if (len != 0) error = uiomove(mtod(m, void *), len, uio); m = m_free(m); } if (m) { TUNDEBUG(ifp, "Dropping mbuf\n"); m_freem(m); } return (error); } /* * the cdevsw write interface - an atomic write is a packet - or else! */ static int tunwrite(struct cdev *dev, struct uio *uio, int flag) { struct tun_softc *tp = dev->si_drv1; struct ifnet *ifp = TUN2IFP(tp); struct mbuf *m; uint32_t family, mru; int isr; TUNDEBUG(ifp, "tunwrite\n"); if ((ifp->if_flags & IFF_UP) != IFF_UP) /* ignore silently */ return (0); if (uio->uio_resid == 0) return (0); mru = TUNMRU; if (tp->tun_flags & TUN_IFHEAD) mru += sizeof(family); if (uio->uio_resid < 0 || uio->uio_resid > mru) { TUNDEBUG(ifp, "len=%zd!\n", uio->uio_resid); return (EIO); } if ((m = m_uiotombuf(uio, M_NOWAIT, 0, 0, M_PKTHDR)) == NULL) { if_inc_counter(ifp, IFCOUNTER_IERRORS, 1); return (ENOBUFS); } m->m_pkthdr.rcvif = ifp; #ifdef MAC mac_ifnet_create_mbuf(ifp, m); #endif /* Could be unlocked read? */ mtx_lock(&tp->tun_mtx); if (tp->tun_flags & TUN_IFHEAD) { mtx_unlock(&tp->tun_mtx); if (m->m_len < sizeof(family) && (m = m_pullup(m, sizeof(family))) == NULL) return (ENOBUFS); family = ntohl(*mtod(m, u_int32_t *)); m_adj(m, sizeof(family)); } else { mtx_unlock(&tp->tun_mtx); family = AF_INET; } BPF_MTAP2(ifp, &family, sizeof(family), m); switch (family) { #ifdef INET case AF_INET: isr = NETISR_IP; break; #endif #ifdef INET6 case AF_INET6: isr = NETISR_IPV6; break; #endif default: m_freem(m); return (EAFNOSUPPORT); } random_harvest_queue(m, sizeof(*m), RANDOM_NET_TUN); if_inc_counter(ifp, IFCOUNTER_IBYTES, m->m_pkthdr.len); if_inc_counter(ifp, IFCOUNTER_IPACKETS, 1); CURVNET_SET(ifp->if_vnet); M_SETFIB(m, ifp->if_fib); netisr_dispatch(isr, m); CURVNET_RESTORE(); return (0); } /* * tunpoll - the poll interface, this is only useful on reads * really. The write detect always returns true, write never blocks * anyway, it either accepts the packet or drops it. */ static int tunpoll(struct cdev *dev, int events, struct thread *td) { struct tun_softc *tp = dev->si_drv1; struct ifnet *ifp = TUN2IFP(tp); int revents = 0; struct mbuf *m; TUNDEBUG(ifp, "tunpoll\n"); if (events & (POLLIN | POLLRDNORM)) { IFQ_LOCK(&ifp->if_snd); IFQ_POLL_NOLOCK(&ifp->if_snd, m); if (m != NULL) { TUNDEBUG(ifp, "tunpoll q=%d\n", ifp->if_snd.ifq_len); revents |= events & (POLLIN | POLLRDNORM); } else { TUNDEBUG(ifp, "tunpoll waiting\n"); selrecord(td, &tp->tun_rsel); } IFQ_UNLOCK(&ifp->if_snd); } if (events & (POLLOUT | POLLWRNORM)) revents |= events & (POLLOUT | POLLWRNORM); return (revents); } /* * tunkqfilter - support for the kevent() system call. */ static int tunkqfilter(struct cdev *dev, struct knote *kn) { struct tun_softc *tp = dev->si_drv1; struct ifnet *ifp = TUN2IFP(tp); switch(kn->kn_filter) { case EVFILT_READ: TUNDEBUG(ifp, "%s kqfilter: EVFILT_READ, minor = %#x\n", ifp->if_xname, dev2unit(dev)); kn->kn_fop = &tun_read_filterops; break; case EVFILT_WRITE: TUNDEBUG(ifp, "%s kqfilter: EVFILT_WRITE, minor = %#x\n", ifp->if_xname, dev2unit(dev)); kn->kn_fop = &tun_write_filterops; break; default: TUNDEBUG(ifp, "%s kqfilter: invalid filter, minor = %#x\n", ifp->if_xname, dev2unit(dev)); return(EINVAL); } kn->kn_hook = tp; knlist_add(&tp->tun_rsel.si_note, kn, 0); return (0); } /* * Return true of there is data in the interface queue. */ static int tunkqread(struct knote *kn, long hint) { int ret; struct tun_softc *tp = kn->kn_hook; struct cdev *dev = tp->tun_dev; struct ifnet *ifp = TUN2IFP(tp); if ((kn->kn_data = ifp->if_snd.ifq_len) > 0) { TUNDEBUG(ifp, "%s have data in the queue. Len = %d, minor = %#x\n", ifp->if_xname, ifp->if_snd.ifq_len, dev2unit(dev)); ret = 1; } else { TUNDEBUG(ifp, "%s waiting for data, minor = %#x\n", ifp->if_xname, dev2unit(dev)); ret = 0; } return (ret); } /* * Always can write, always return MTU in kn->data. */ static int tunkqwrite(struct knote *kn, long hint) { struct tun_softc *tp = kn->kn_hook; struct ifnet *ifp = TUN2IFP(tp); kn->kn_data = ifp->if_mtu; return (1); } static void tunkqdetach(struct knote *kn) { struct tun_softc *tp = kn->kn_hook; knlist_remove(&tp->tun_rsel.si_note, kn, 0); } Index: user/ngie/bug-237403/sys/net/iflib.c =================================================================== --- user/ngie/bug-237403/sys/net/iflib.c (revision 346925) +++ user/ngie/bug-237403/sys/net/iflib.c (revision 346926) @@ -1,6537 +1,6730 @@ /*- * Copyright (c) 2014-2018, Matthew Macy * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: * * 1. Redistributions of source code must retain the above copyright notice, * this list of conditions and the following disclaimer. * * 2. Neither the name of Matthew Macy nor the names of its * contributors may be used to endorse or promote products derived from * this software without specific prior written permission. * * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN * CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE * POSSIBILITY OF SUCH DAMAGE. */ #include __FBSDID("$FreeBSD$"); #include "opt_inet.h" #include "opt_inet6.h" #include "opt_acpi.h" #include "opt_sched.h" #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include +#include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include "ifdi_if.h" #ifdef PCI_IOV #include #endif #include /* * enable accounting of every mbuf as it comes in to and goes out of * iflib's software descriptor references */ #define MEMORY_LOGGING 0 /* * Enable mbuf vectors for compressing long mbuf chains */ /* * NB: * - Prefetching in tx cleaning should perhaps be a tunable. The distance ahead * we prefetch needs to be determined by the time spent in m_free vis a vis * the cost of a prefetch. This will of course vary based on the workload: * - NFLX's m_free path is dominated by vm-based M_EXT manipulation which * is quite expensive, thus suggesting very little prefetch. * - small packet forwarding which is just returning a single mbuf to * UMA will typically be very fast vis a vis the cost of a memory * access. */ /* * File organization: * - private structures * - iflib private utility functions * - ifnet functions * - vlan registry and other exported functions * - iflib public core functions * * */ MALLOC_DEFINE(M_IFLIB, "iflib", "ifnet library"); struct iflib_txq; typedef struct iflib_txq *iflib_txq_t; struct iflib_rxq; typedef struct iflib_rxq *iflib_rxq_t; struct iflib_fl; typedef struct iflib_fl *iflib_fl_t; struct iflib_ctx; static void iru_init(if_rxd_update_t iru, iflib_rxq_t rxq, uint8_t flid); static void iflib_timer(void *arg); typedef struct iflib_filter_info { driver_filter_t *ifi_filter; void *ifi_filter_arg; struct grouptask *ifi_task; void *ifi_ctx; } *iflib_filter_info_t; struct iflib_ctx { KOBJ_FIELDS; /* * Pointer to hardware driver's softc */ void *ifc_softc; device_t ifc_dev; if_t ifc_ifp; cpuset_t ifc_cpus; if_shared_ctx_t ifc_sctx; struct if_softc_ctx ifc_softc_ctx; struct sx ifc_ctx_sx; struct mtx ifc_state_mtx; iflib_txq_t ifc_txqs; iflib_rxq_t ifc_rxqs; uint32_t ifc_if_flags; uint32_t ifc_flags; uint32_t ifc_max_fl_buf_size; uint32_t ifc_rx_mbuf_sz; int ifc_link_state; int ifc_link_irq; int ifc_watchdog_events; struct cdev *ifc_led_dev; struct resource *ifc_msix_mem; struct if_irq ifc_legacy_irq; struct grouptask ifc_admin_task; struct grouptask ifc_vflr_task; struct iflib_filter_info ifc_filter_info; struct ifmedia ifc_media; struct sysctl_oid *ifc_sysctl_node; uint16_t ifc_sysctl_ntxqs; uint16_t ifc_sysctl_nrxqs; uint16_t ifc_sysctl_qs_eq_override; uint16_t ifc_sysctl_rx_budget; uint16_t ifc_sysctl_tx_abdicate; + uint16_t ifc_sysctl_core_offset; +#define CORE_OFFSET_UNSPECIFIED 0xffff + uint8_t ifc_sysctl_separate_txrx; qidx_t ifc_sysctl_ntxds[8]; qidx_t ifc_sysctl_nrxds[8]; struct if_txrx ifc_txrx; #define isc_txd_encap ifc_txrx.ift_txd_encap #define isc_txd_flush ifc_txrx.ift_txd_flush #define isc_txd_credits_update ifc_txrx.ift_txd_credits_update #define isc_rxd_available ifc_txrx.ift_rxd_available #define isc_rxd_pkt_get ifc_txrx.ift_rxd_pkt_get #define isc_rxd_refill ifc_txrx.ift_rxd_refill #define isc_rxd_flush ifc_txrx.ift_rxd_flush #define isc_rxd_refill ifc_txrx.ift_rxd_refill #define isc_rxd_refill ifc_txrx.ift_rxd_refill #define isc_legacy_intr ifc_txrx.ift_legacy_intr eventhandler_tag ifc_vlan_attach_event; eventhandler_tag ifc_vlan_detach_event; struct ether_addr ifc_mac; char ifc_mtx_name[16]; }; void * iflib_get_softc(if_ctx_t ctx) { return (ctx->ifc_softc); } device_t iflib_get_dev(if_ctx_t ctx) { return (ctx->ifc_dev); } if_t iflib_get_ifp(if_ctx_t ctx) { return (ctx->ifc_ifp); } struct ifmedia * iflib_get_media(if_ctx_t ctx) { return (&ctx->ifc_media); } uint32_t iflib_get_flags(if_ctx_t ctx) { return (ctx->ifc_flags); } void iflib_set_mac(if_ctx_t ctx, uint8_t mac[ETHER_ADDR_LEN]) { bcopy(mac, ctx->ifc_mac.octet, ETHER_ADDR_LEN); } if_softc_ctx_t iflib_get_softc_ctx(if_ctx_t ctx) { return (&ctx->ifc_softc_ctx); } if_shared_ctx_t iflib_get_sctx(if_ctx_t ctx) { return (ctx->ifc_sctx); } #define IP_ALIGNED(m) ((((uintptr_t)(m)->m_data) & 0x3) == 0x2) #define CACHE_PTR_INCREMENT (CACHE_LINE_SIZE/sizeof(void*)) #define CACHE_PTR_NEXT(ptr) ((void *)(((uintptr_t)(ptr)+CACHE_LINE_SIZE-1) & (CACHE_LINE_SIZE-1))) #define LINK_ACTIVE(ctx) ((ctx)->ifc_link_state == LINK_STATE_UP) #define CTX_IS_VF(ctx) ((ctx)->ifc_sctx->isc_flags & IFLIB_IS_VF) typedef struct iflib_sw_rx_desc_array { bus_dmamap_t *ifsd_map; /* bus_dma maps for packet */ struct mbuf **ifsd_m; /* pkthdr mbufs */ caddr_t *ifsd_cl; /* direct cluster pointer for rx */ bus_addr_t *ifsd_ba; /* bus addr of cluster for rx */ } iflib_rxsd_array_t; typedef struct iflib_sw_tx_desc_array { bus_dmamap_t *ifsd_map; /* bus_dma maps for packet */ bus_dmamap_t *ifsd_tso_map; /* bus_dma maps for TSO packet */ struct mbuf **ifsd_m; /* pkthdr mbufs */ } if_txsd_vec_t; /* magic number that should be high enough for any hardware */ #define IFLIB_MAX_TX_SEGS 128 #define IFLIB_RX_COPY_THRESH 128 #define IFLIB_MAX_RX_REFRESH 32 /* The minimum descriptors per second before we start coalescing */ #define IFLIB_MIN_DESC_SEC 16384 #define IFLIB_DEFAULT_TX_UPDATE_FREQ 16 #define IFLIB_QUEUE_IDLE 0 #define IFLIB_QUEUE_HUNG 1 #define IFLIB_QUEUE_WORKING 2 /* maximum number of txqs that can share an rx interrupt */ #define IFLIB_MAX_TX_SHARED_INTR 4 /* this should really scale with ring size - this is a fairly arbitrary value */ #define TX_BATCH_SIZE 32 #define IFLIB_RESTART_BUDGET 8 #define CSUM_OFFLOAD (CSUM_IP_TSO|CSUM_IP6_TSO|CSUM_IP| \ CSUM_IP_UDP|CSUM_IP_TCP|CSUM_IP_SCTP| \ CSUM_IP6_UDP|CSUM_IP6_TCP|CSUM_IP6_SCTP) struct iflib_txq { qidx_t ift_in_use; qidx_t ift_cidx; qidx_t ift_cidx_processed; qidx_t ift_pidx; uint8_t ift_gen; uint8_t ift_br_offset; uint16_t ift_npending; uint16_t ift_db_pending; uint16_t ift_rs_pending; /* implicit pad */ uint8_t ift_txd_size[8]; uint64_t ift_processed; uint64_t ift_cleaned; uint64_t ift_cleaned_prev; #if MEMORY_LOGGING uint64_t ift_enqueued; uint64_t ift_dequeued; #endif uint64_t ift_no_tx_dma_setup; uint64_t ift_no_desc_avail; uint64_t ift_mbuf_defrag_failed; uint64_t ift_mbuf_defrag; uint64_t ift_map_failed; uint64_t ift_txd_encap_efbig; uint64_t ift_pullups; uint64_t ift_last_timer_tick; struct mtx ift_mtx; struct mtx ift_db_mtx; /* constant values */ if_ctx_t ift_ctx; struct ifmp_ring *ift_br; struct grouptask ift_task; qidx_t ift_size; uint16_t ift_id; struct callout ift_timer; if_txsd_vec_t ift_sds; uint8_t ift_qstatus; uint8_t ift_closed; uint8_t ift_update_freq; struct iflib_filter_info ift_filter_info; bus_dma_tag_t ift_buf_tag; bus_dma_tag_t ift_tso_buf_tag; iflib_dma_info_t ift_ifdi; #define MTX_NAME_LEN 16 char ift_mtx_name[MTX_NAME_LEN]; char ift_db_mtx_name[MTX_NAME_LEN]; bus_dma_segment_t ift_segs[IFLIB_MAX_TX_SEGS] __aligned(CACHE_LINE_SIZE); #ifdef IFLIB_DIAGNOSTICS uint64_t ift_cpu_exec_count[256]; #endif } __aligned(CACHE_LINE_SIZE); struct iflib_fl { qidx_t ifl_cidx; qidx_t ifl_pidx; qidx_t ifl_credits; uint8_t ifl_gen; uint8_t ifl_rxd_size; #if MEMORY_LOGGING uint64_t ifl_m_enqueued; uint64_t ifl_m_dequeued; uint64_t ifl_cl_enqueued; uint64_t ifl_cl_dequeued; #endif /* implicit pad */ bitstr_t *ifl_rx_bitmap; qidx_t ifl_fragidx; /* constant */ qidx_t ifl_size; uint16_t ifl_buf_size; uint16_t ifl_cltype; uma_zone_t ifl_zone; iflib_rxsd_array_t ifl_sds; iflib_rxq_t ifl_rxq; uint8_t ifl_id; bus_dma_tag_t ifl_buf_tag; iflib_dma_info_t ifl_ifdi; uint64_t ifl_bus_addrs[IFLIB_MAX_RX_REFRESH] __aligned(CACHE_LINE_SIZE); caddr_t ifl_vm_addrs[IFLIB_MAX_RX_REFRESH]; qidx_t ifl_rxd_idxs[IFLIB_MAX_RX_REFRESH]; } __aligned(CACHE_LINE_SIZE); static inline qidx_t get_inuse(int size, qidx_t cidx, qidx_t pidx, uint8_t gen) { qidx_t used; if (pidx > cidx) used = pidx - cidx; else if (pidx < cidx) used = size - cidx + pidx; else if (gen == 0 && pidx == cidx) used = 0; else if (gen == 1 && pidx == cidx) used = size; else panic("bad state"); return (used); } #define TXQ_AVAIL(txq) (txq->ift_size - get_inuse(txq->ift_size, txq->ift_cidx, txq->ift_pidx, txq->ift_gen)) #define IDXDIFF(head, tail, wrap) \ ((head) >= (tail) ? (head) - (tail) : (wrap) - (tail) + (head)) struct iflib_rxq { /* If there is a separate completion queue - * these are the cq cidx and pidx. Otherwise * these are unused. */ qidx_t ifr_size; qidx_t ifr_cq_cidx; qidx_t ifr_cq_pidx; uint8_t ifr_cq_gen; uint8_t ifr_fl_offset; if_ctx_t ifr_ctx; iflib_fl_t ifr_fl; uint64_t ifr_rx_irq; + struct pfil_head *pfil; uint16_t ifr_id; uint8_t ifr_lro_enabled; uint8_t ifr_nfl; uint8_t ifr_ntxqirq; uint8_t ifr_txqid[IFLIB_MAX_TX_SHARED_INTR]; struct lro_ctrl ifr_lc; struct grouptask ifr_task; struct iflib_filter_info ifr_filter_info; iflib_dma_info_t ifr_ifdi; /* dynamically allocate if any drivers need a value substantially larger than this */ struct if_rxd_frag ifr_frags[IFLIB_MAX_RX_SEGS] __aligned(CACHE_LINE_SIZE); #ifdef IFLIB_DIAGNOSTICS uint64_t ifr_cpu_exec_count[256]; #endif } __aligned(CACHE_LINE_SIZE); typedef struct if_rxsd { caddr_t *ifsd_cl; - struct mbuf **ifsd_m; iflib_fl_t ifsd_fl; qidx_t ifsd_cidx; } *if_rxsd_t; /* multiple of word size */ #ifdef __LP64__ #define PKT_INFO_SIZE 6 #define RXD_INFO_SIZE 5 #define PKT_TYPE uint64_t #else #define PKT_INFO_SIZE 11 #define RXD_INFO_SIZE 8 #define PKT_TYPE uint32_t #endif #define PKT_LOOP_BOUND ((PKT_INFO_SIZE/3)*3) #define RXD_LOOP_BOUND ((RXD_INFO_SIZE/4)*4) typedef struct if_pkt_info_pad { PKT_TYPE pkt_val[PKT_INFO_SIZE]; } *if_pkt_info_pad_t; typedef struct if_rxd_info_pad { PKT_TYPE rxd_val[RXD_INFO_SIZE]; } *if_rxd_info_pad_t; CTASSERT(sizeof(struct if_pkt_info_pad) == sizeof(struct if_pkt_info)); CTASSERT(sizeof(struct if_rxd_info_pad) == sizeof(struct if_rxd_info)); static inline void pkt_info_zero(if_pkt_info_t pi) { if_pkt_info_pad_t pi_pad; pi_pad = (if_pkt_info_pad_t)pi; pi_pad->pkt_val[0] = 0; pi_pad->pkt_val[1] = 0; pi_pad->pkt_val[2] = 0; pi_pad->pkt_val[3] = 0; pi_pad->pkt_val[4] = 0; pi_pad->pkt_val[5] = 0; #ifndef __LP64__ pi_pad->pkt_val[6] = 0; pi_pad->pkt_val[7] = 0; pi_pad->pkt_val[8] = 0; pi_pad->pkt_val[9] = 0; pi_pad->pkt_val[10] = 0; #endif } static device_method_t iflib_pseudo_methods[] = { DEVMETHOD(device_attach, noop_attach), DEVMETHOD(device_detach, iflib_pseudo_detach), DEVMETHOD_END }; driver_t iflib_pseudodriver = { "iflib_pseudo", iflib_pseudo_methods, sizeof(struct iflib_ctx), }; static inline void rxd_info_zero(if_rxd_info_t ri) { if_rxd_info_pad_t ri_pad; int i; ri_pad = (if_rxd_info_pad_t)ri; for (i = 0; i < RXD_LOOP_BOUND; i += 4) { ri_pad->rxd_val[i] = 0; ri_pad->rxd_val[i+1] = 0; ri_pad->rxd_val[i+2] = 0; ri_pad->rxd_val[i+3] = 0; } #ifdef __LP64__ ri_pad->rxd_val[RXD_INFO_SIZE-1] = 0; #endif } /* * Only allow a single packet to take up most 1/nth of the tx ring */ #define MAX_SINGLE_PACKET_FRACTION 12 #define IF_BAD_DMA (bus_addr_t)-1 #define CTX_ACTIVE(ctx) ((if_getdrvflags((ctx)->ifc_ifp) & IFF_DRV_RUNNING)) #define CTX_LOCK_INIT(_sc) sx_init(&(_sc)->ifc_ctx_sx, "iflib ctx lock") #define CTX_LOCK(ctx) sx_xlock(&(ctx)->ifc_ctx_sx) #define CTX_UNLOCK(ctx) sx_xunlock(&(ctx)->ifc_ctx_sx) #define CTX_LOCK_DESTROY(ctx) sx_destroy(&(ctx)->ifc_ctx_sx) #define STATE_LOCK_INIT(_sc, _name) mtx_init(&(_sc)->ifc_state_mtx, _name, "iflib state lock", MTX_DEF) #define STATE_LOCK(ctx) mtx_lock(&(ctx)->ifc_state_mtx) #define STATE_UNLOCK(ctx) mtx_unlock(&(ctx)->ifc_state_mtx) #define STATE_LOCK_DESTROY(ctx) mtx_destroy(&(ctx)->ifc_state_mtx) #define CALLOUT_LOCK(txq) mtx_lock(&txq->ift_mtx) #define CALLOUT_UNLOCK(txq) mtx_unlock(&txq->ift_mtx) void iflib_set_detach(if_ctx_t ctx) { STATE_LOCK(ctx); ctx->ifc_flags |= IFC_IN_DETACH; STATE_UNLOCK(ctx); } /* Our boot-time initialization hook */ static int iflib_module_event_handler(module_t, int, void *); static moduledata_t iflib_moduledata = { "iflib", iflib_module_event_handler, NULL }; DECLARE_MODULE(iflib, iflib_moduledata, SI_SUB_INIT_IF, SI_ORDER_ANY); MODULE_VERSION(iflib, 1); MODULE_DEPEND(iflib, pci, 1, 1, 1); MODULE_DEPEND(iflib, ether, 1, 1, 1); TASKQGROUP_DEFINE(if_io_tqg, mp_ncpus, 1); TASKQGROUP_DEFINE(if_config_tqg, 1, 1); #ifndef IFLIB_DEBUG_COUNTERS #ifdef INVARIANTS #define IFLIB_DEBUG_COUNTERS 1 #else #define IFLIB_DEBUG_COUNTERS 0 #endif /* !INVARIANTS */ #endif static SYSCTL_NODE(_net, OID_AUTO, iflib, CTLFLAG_RD, 0, "iflib driver parameters"); /* * XXX need to ensure that this can't accidentally cause the head to be moved backwards */ static int iflib_min_tx_latency = 0; SYSCTL_INT(_net_iflib, OID_AUTO, min_tx_latency, CTLFLAG_RW, &iflib_min_tx_latency, 0, "minimize transmit latency at the possible expense of throughput"); static int iflib_no_tx_batch = 0; SYSCTL_INT(_net_iflib, OID_AUTO, no_tx_batch, CTLFLAG_RW, &iflib_no_tx_batch, 0, "minimize transmit latency at the possible expense of throughput"); #if IFLIB_DEBUG_COUNTERS static int iflib_tx_seen; static int iflib_tx_sent; static int iflib_tx_encap; static int iflib_rx_allocs; static int iflib_fl_refills; static int iflib_fl_refills_large; static int iflib_tx_frees; SYSCTL_INT(_net_iflib, OID_AUTO, tx_seen, CTLFLAG_RD, &iflib_tx_seen, 0, "# tx mbufs seen"); SYSCTL_INT(_net_iflib, OID_AUTO, tx_sent, CTLFLAG_RD, &iflib_tx_sent, 0, "# tx mbufs sent"); SYSCTL_INT(_net_iflib, OID_AUTO, tx_encap, CTLFLAG_RD, &iflib_tx_encap, 0, "# tx mbufs encapped"); SYSCTL_INT(_net_iflib, OID_AUTO, tx_frees, CTLFLAG_RD, &iflib_tx_frees, 0, "# tx frees"); SYSCTL_INT(_net_iflib, OID_AUTO, rx_allocs, CTLFLAG_RD, &iflib_rx_allocs, 0, "# rx allocations"); SYSCTL_INT(_net_iflib, OID_AUTO, fl_refills, CTLFLAG_RD, &iflib_fl_refills, 0, "# refills"); SYSCTL_INT(_net_iflib, OID_AUTO, fl_refills_large, CTLFLAG_RD, &iflib_fl_refills_large, 0, "# large refills"); static int iflib_txq_drain_flushing; static int iflib_txq_drain_oactive; static int iflib_txq_drain_notready; SYSCTL_INT(_net_iflib, OID_AUTO, txq_drain_flushing, CTLFLAG_RD, &iflib_txq_drain_flushing, 0, "# drain flushes"); SYSCTL_INT(_net_iflib, OID_AUTO, txq_drain_oactive, CTLFLAG_RD, &iflib_txq_drain_oactive, 0, "# drain oactives"); SYSCTL_INT(_net_iflib, OID_AUTO, txq_drain_notready, CTLFLAG_RD, &iflib_txq_drain_notready, 0, "# drain notready"); static int iflib_encap_load_mbuf_fail; static int iflib_encap_pad_mbuf_fail; static int iflib_encap_txq_avail_fail; static int iflib_encap_txd_encap_fail; SYSCTL_INT(_net_iflib, OID_AUTO, encap_load_mbuf_fail, CTLFLAG_RD, &iflib_encap_load_mbuf_fail, 0, "# busdma load failures"); SYSCTL_INT(_net_iflib, OID_AUTO, encap_pad_mbuf_fail, CTLFLAG_RD, &iflib_encap_pad_mbuf_fail, 0, "# runt frame pad failures"); SYSCTL_INT(_net_iflib, OID_AUTO, encap_txq_avail_fail, CTLFLAG_RD, &iflib_encap_txq_avail_fail, 0, "# txq avail failures"); SYSCTL_INT(_net_iflib, OID_AUTO, encap_txd_encap_fail, CTLFLAG_RD, &iflib_encap_txd_encap_fail, 0, "# driver encap failures"); static int iflib_task_fn_rxs; static int iflib_rx_intr_enables; static int iflib_fast_intrs; static int iflib_rx_unavail; static int iflib_rx_ctx_inactive; static int iflib_rx_if_input; -static int iflib_rx_mbuf_null; static int iflib_rxd_flush; static int iflib_verbose_debug; SYSCTL_INT(_net_iflib, OID_AUTO, task_fn_rx, CTLFLAG_RD, &iflib_task_fn_rxs, 0, "# task_fn_rx calls"); SYSCTL_INT(_net_iflib, OID_AUTO, rx_intr_enables, CTLFLAG_RD, &iflib_rx_intr_enables, 0, "# rx intr enables"); SYSCTL_INT(_net_iflib, OID_AUTO, fast_intrs, CTLFLAG_RD, &iflib_fast_intrs, 0, "# fast_intr calls"); SYSCTL_INT(_net_iflib, OID_AUTO, rx_unavail, CTLFLAG_RD, &iflib_rx_unavail, 0, "# times rxeof called with no available data"); SYSCTL_INT(_net_iflib, OID_AUTO, rx_ctx_inactive, CTLFLAG_RD, &iflib_rx_ctx_inactive, 0, "# times rxeof called with inactive context"); SYSCTL_INT(_net_iflib, OID_AUTO, rx_if_input, CTLFLAG_RD, &iflib_rx_if_input, 0, "# times rxeof called if_input"); -SYSCTL_INT(_net_iflib, OID_AUTO, rx_mbuf_null, CTLFLAG_RD, - &iflib_rx_mbuf_null, 0, "# times rxeof got null mbuf"); SYSCTL_INT(_net_iflib, OID_AUTO, rxd_flush, CTLFLAG_RD, &iflib_rxd_flush, 0, "# times rxd_flush called"); SYSCTL_INT(_net_iflib, OID_AUTO, verbose_debug, CTLFLAG_RW, &iflib_verbose_debug, 0, "enable verbose debugging"); #define DBG_COUNTER_INC(name) atomic_add_int(&(iflib_ ## name), 1) static void iflib_debug_reset(void) { iflib_tx_seen = iflib_tx_sent = iflib_tx_encap = iflib_rx_allocs = iflib_fl_refills = iflib_fl_refills_large = iflib_tx_frees = iflib_txq_drain_flushing = iflib_txq_drain_oactive = iflib_txq_drain_notready = iflib_encap_load_mbuf_fail = iflib_encap_pad_mbuf_fail = iflib_encap_txq_avail_fail = iflib_encap_txd_encap_fail = iflib_task_fn_rxs = iflib_rx_intr_enables = iflib_fast_intrs = iflib_rx_unavail = iflib_rx_ctx_inactive = iflib_rx_if_input = - iflib_rx_mbuf_null = iflib_rxd_flush = 0; + iflib_rxd_flush = 0; } #else #define DBG_COUNTER_INC(name) static void iflib_debug_reset(void) {} #endif #define IFLIB_DEBUG 0 static void iflib_tx_structures_free(if_ctx_t ctx); static void iflib_rx_structures_free(if_ctx_t ctx); static int iflib_queues_alloc(if_ctx_t ctx); static int iflib_tx_credits_update(if_ctx_t ctx, iflib_txq_t txq); static int iflib_rxd_avail(if_ctx_t ctx, iflib_rxq_t rxq, qidx_t cidx, qidx_t budget); static int iflib_qset_structures_setup(if_ctx_t ctx); static int iflib_msix_init(if_ctx_t ctx); static int iflib_legacy_setup(if_ctx_t ctx, driver_filter_t filter, void *filterarg, int *rid, const char *str); static void iflib_txq_check_drain(iflib_txq_t txq, int budget); static uint32_t iflib_txq_can_drain(struct ifmp_ring *); #ifdef ALTQ static void iflib_altq_if_start(if_t ifp); static int iflib_altq_if_transmit(if_t ifp, struct mbuf *m); #endif static int iflib_register(if_ctx_t); static void iflib_init_locked(if_ctx_t ctx); static void iflib_add_device_sysctl_pre(if_ctx_t ctx); static void iflib_add_device_sysctl_post(if_ctx_t ctx); static void iflib_ifmp_purge(iflib_txq_t txq); static void _iflib_pre_assert(if_softc_ctx_t scctx); static void iflib_if_init_locked(if_ctx_t ctx); static void iflib_free_intr_mem(if_ctx_t ctx); #ifndef __NO_STRICT_ALIGNMENT static struct mbuf * iflib_fixup_rx(struct mbuf *m); #endif +static SLIST_HEAD(cpu_offset_list, cpu_offset) cpu_offsets = + SLIST_HEAD_INITIALIZER(cpu_offsets); +struct cpu_offset { + SLIST_ENTRY(cpu_offset) entries; + cpuset_t set; + unsigned int refcount; + uint16_t offset; +}; +static struct mtx cpu_offset_mtx; +MTX_SYSINIT(iflib_cpu_offset, &cpu_offset_mtx, "iflib_cpu_offset lock", + MTX_DEF); + NETDUMP_DEFINE(iflib); #ifdef DEV_NETMAP #include #include #include MODULE_DEPEND(iflib, netmap, 1, 1, 1); static int netmap_fl_refill(iflib_rxq_t rxq, struct netmap_kring *kring, uint32_t nm_i, bool init); /* * device-specific sysctl variables: * * iflib_crcstrip: 0: keep CRC in rx frames (default), 1: strip it. * During regular operations the CRC is stripped, but on some * hardware reception of frames not multiple of 64 is slower, * so using crcstrip=0 helps in benchmarks. * * iflib_rx_miss, iflib_rx_miss_bufs: * count packets that might be missed due to lost interrupts. */ SYSCTL_DECL(_dev_netmap); /* * The xl driver by default strips CRCs and we do not override it. */ int iflib_crcstrip = 1; SYSCTL_INT(_dev_netmap, OID_AUTO, iflib_crcstrip, CTLFLAG_RW, &iflib_crcstrip, 1, "strip CRC on rx frames"); int iflib_rx_miss, iflib_rx_miss_bufs; SYSCTL_INT(_dev_netmap, OID_AUTO, iflib_rx_miss, CTLFLAG_RW, &iflib_rx_miss, 0, "potentially missed rx intr"); SYSCTL_INT(_dev_netmap, OID_AUTO, iflib_rx_miss_bufs, CTLFLAG_RW, &iflib_rx_miss_bufs, 0, "potentially missed rx intr bufs"); /* * Register/unregister. We are already under netmap lock. * Only called on the first register or the last unregister. */ static int iflib_netmap_register(struct netmap_adapter *na, int onoff) { struct ifnet *ifp = na->ifp; if_ctx_t ctx = ifp->if_softc; int status; CTX_LOCK(ctx); IFDI_INTR_DISABLE(ctx); /* Tell the stack that the interface is no longer active */ ifp->if_drv_flags &= ~(IFF_DRV_RUNNING | IFF_DRV_OACTIVE); if (!CTX_IS_VF(ctx)) IFDI_CRCSTRIP_SET(ctx, onoff, iflib_crcstrip); /* enable or disable flags and callbacks in na and ifp */ if (onoff) { nm_set_native_flags(na); } else { nm_clear_native_flags(na); } iflib_stop(ctx); iflib_init_locked(ctx); IFDI_CRCSTRIP_SET(ctx, onoff, iflib_crcstrip); // XXX why twice ? status = ifp->if_drv_flags & IFF_DRV_RUNNING ? 0 : 1; if (status) nm_clear_native_flags(na); CTX_UNLOCK(ctx); return (status); } static int netmap_fl_refill(iflib_rxq_t rxq, struct netmap_kring *kring, uint32_t nm_i, bool init) { struct netmap_adapter *na = kring->na; u_int const lim = kring->nkr_num_slots - 1; u_int head = kring->rhead; struct netmap_ring *ring = kring->ring; bus_dmamap_t *map; struct if_rxd_update iru; if_ctx_t ctx = rxq->ifr_ctx; iflib_fl_t fl = &rxq->ifr_fl[0]; uint32_t refill_pidx, nic_i; #if IFLIB_DEBUG_COUNTERS int rf_count = 0; #endif if (nm_i == head && __predict_true(!init)) return 0; iru_init(&iru, rxq, 0 /* flid */); map = fl->ifl_sds.ifsd_map; refill_pidx = netmap_idx_k2n(kring, nm_i); /* * IMPORTANT: we must leave one free slot in the ring, * so move head back by one unit */ head = nm_prev(head, lim); nic_i = UINT_MAX; DBG_COUNTER_INC(fl_refills); while (nm_i != head) { #if IFLIB_DEBUG_COUNTERS if (++rf_count == 9) DBG_COUNTER_INC(fl_refills_large); #endif for (int tmp_pidx = 0; tmp_pidx < IFLIB_MAX_RX_REFRESH && nm_i != head; tmp_pidx++) { struct netmap_slot *slot = &ring->slot[nm_i]; void *addr = PNMB(na, slot, &fl->ifl_bus_addrs[tmp_pidx]); uint32_t nic_i_dma = refill_pidx; nic_i = netmap_idx_k2n(kring, nm_i); MPASS(tmp_pidx < IFLIB_MAX_RX_REFRESH); if (addr == NETMAP_BUF_BASE(na)) /* bad buf */ return netmap_ring_reinit(kring); fl->ifl_vm_addrs[tmp_pidx] = addr; if (__predict_false(init)) { netmap_load_map(na, fl->ifl_buf_tag, map[nic_i], addr); } else if (slot->flags & NS_BUF_CHANGED) { /* buffer has changed, reload map */ netmap_reload_map(na, fl->ifl_buf_tag, map[nic_i], addr); } slot->flags &= ~NS_BUF_CHANGED; nm_i = nm_next(nm_i, lim); fl->ifl_rxd_idxs[tmp_pidx] = nic_i = nm_next(nic_i, lim); if (nm_i != head && tmp_pidx < IFLIB_MAX_RX_REFRESH-1) continue; iru.iru_pidx = refill_pidx; iru.iru_count = tmp_pidx+1; ctx->isc_rxd_refill(ctx->ifc_softc, &iru); refill_pidx = nic_i; for (int n = 0; n < iru.iru_count; n++) { bus_dmamap_sync(fl->ifl_buf_tag, map[nic_i_dma], BUS_DMASYNC_PREREAD); /* XXX - change this to not use the netmap func*/ nic_i_dma = nm_next(nic_i_dma, lim); } } } kring->nr_hwcur = head; bus_dmamap_sync(fl->ifl_ifdi->idi_tag, fl->ifl_ifdi->idi_map, BUS_DMASYNC_PREREAD | BUS_DMASYNC_PREWRITE); if (__predict_true(nic_i != UINT_MAX)) { ctx->isc_rxd_flush(ctx->ifc_softc, rxq->ifr_id, fl->ifl_id, nic_i); DBG_COUNTER_INC(rxd_flush); } return (0); } /* * Reconcile kernel and user view of the transmit ring. * * All information is in the kring. * Userspace wants to send packets up to the one before kring->rhead, * kernel knows kring->nr_hwcur is the first unsent packet. * * Here we push packets out (as many as possible), and possibly * reclaim buffers from previously completed transmission. * * The caller (netmap) guarantees that there is only one instance * running at any time. Any interference with other driver * methods should be handled by the individual drivers. */ static int iflib_netmap_txsync(struct netmap_kring *kring, int flags) { struct netmap_adapter *na = kring->na; struct ifnet *ifp = na->ifp; struct netmap_ring *ring = kring->ring; u_int nm_i; /* index into the netmap kring */ u_int nic_i; /* index into the NIC ring */ u_int n; u_int const lim = kring->nkr_num_slots - 1; u_int const head = kring->rhead; struct if_pkt_info pi; /* * interrupts on every tx packet are expensive so request * them every half ring, or where NS_REPORT is set */ u_int report_frequency = kring->nkr_num_slots >> 1; /* device-specific */ if_ctx_t ctx = ifp->if_softc; iflib_txq_t txq = &ctx->ifc_txqs[kring->ring_id]; bus_dmamap_sync(txq->ift_ifdi->idi_tag, txq->ift_ifdi->idi_map, BUS_DMASYNC_POSTREAD | BUS_DMASYNC_POSTWRITE); /* * First part: process new packets to send. * nm_i is the current index in the netmap kring, * nic_i is the corresponding index in the NIC ring. * * If we have packets to send (nm_i != head) * iterate over the netmap ring, fetch length and update * the corresponding slot in the NIC ring. Some drivers also * need to update the buffer's physical address in the NIC slot * even NS_BUF_CHANGED is not set (PNMB computes the addresses). * * The netmap_reload_map() calls is especially expensive, * even when (as in this case) the tag is 0, so do only * when the buffer has actually changed. * * If possible do not set the report/intr bit on all slots, * but only a few times per ring or when NS_REPORT is set. * * Finally, on 10G and faster drivers, it might be useful * to prefetch the next slot and txr entry. */ nm_i = kring->nr_hwcur; if (nm_i != head) { /* we have new packets to send */ pkt_info_zero(&pi); pi.ipi_segs = txq->ift_segs; pi.ipi_qsidx = kring->ring_id; nic_i = netmap_idx_k2n(kring, nm_i); __builtin_prefetch(&ring->slot[nm_i]); __builtin_prefetch(&txq->ift_sds.ifsd_m[nic_i]); __builtin_prefetch(&txq->ift_sds.ifsd_map[nic_i]); for (n = 0; nm_i != head; n++) { struct netmap_slot *slot = &ring->slot[nm_i]; u_int len = slot->len; uint64_t paddr; void *addr = PNMB(na, slot, &paddr); int flags = (slot->flags & NS_REPORT || nic_i == 0 || nic_i == report_frequency) ? IPI_TX_INTR : 0; /* device-specific */ pi.ipi_len = len; pi.ipi_segs[0].ds_addr = paddr; pi.ipi_segs[0].ds_len = len; pi.ipi_nsegs = 1; pi.ipi_ndescs = 0; pi.ipi_pidx = nic_i; pi.ipi_flags = flags; /* Fill the slot in the NIC ring. */ ctx->isc_txd_encap(ctx->ifc_softc, &pi); DBG_COUNTER_INC(tx_encap); /* prefetch for next round */ __builtin_prefetch(&ring->slot[nm_i + 1]); __builtin_prefetch(&txq->ift_sds.ifsd_m[nic_i + 1]); __builtin_prefetch(&txq->ift_sds.ifsd_map[nic_i + 1]); NM_CHECK_ADDR_LEN(na, addr, len); if (slot->flags & NS_BUF_CHANGED) { /* buffer has changed, reload map */ netmap_reload_map(na, txq->ift_buf_tag, txq->ift_sds.ifsd_map[nic_i], addr); } /* make sure changes to the buffer are synced */ bus_dmamap_sync(txq->ift_buf_tag, txq->ift_sds.ifsd_map[nic_i], BUS_DMASYNC_PREWRITE); slot->flags &= ~(NS_REPORT | NS_BUF_CHANGED); nm_i = nm_next(nm_i, lim); nic_i = nm_next(nic_i, lim); } kring->nr_hwcur = nm_i; /* synchronize the NIC ring */ bus_dmamap_sync(txq->ift_ifdi->idi_tag, txq->ift_ifdi->idi_map, BUS_DMASYNC_PREREAD | BUS_DMASYNC_PREWRITE); /* (re)start the tx unit up to slot nic_i (excluded) */ ctx->isc_txd_flush(ctx->ifc_softc, txq->ift_id, nic_i); } /* * Second part: reclaim buffers for completed transmissions. * * If there are unclaimed buffers, attempt to reclaim them. * If none are reclaimed, and TX IRQs are not in use, do an initial * minimal delay, then trigger the tx handler which will spin in the * group task queue. */ if (kring->nr_hwtail != nm_prev(kring->nr_hwcur, lim)) { if (iflib_tx_credits_update(ctx, txq)) { /* some tx completed, increment avail */ nic_i = txq->ift_cidx_processed; kring->nr_hwtail = nm_prev(netmap_idx_n2k(kring, nic_i), lim); } } if (!(ctx->ifc_flags & IFC_NETMAP_TX_IRQ)) if (kring->nr_hwtail != nm_prev(kring->nr_hwcur, lim)) { callout_reset_on(&txq->ift_timer, hz < 2000 ? 1 : hz / 1000, iflib_timer, txq, txq->ift_timer.c_cpu); } return (0); } /* * Reconcile kernel and user view of the receive ring. * Same as for the txsync, this routine must be efficient. * The caller guarantees a single invocations, but races against * the rest of the driver should be handled here. * * On call, kring->rhead is the first packet that userspace wants * to keep, and kring->rcur is the wakeup point. * The kernel has previously reported packets up to kring->rtail. * * If (flags & NAF_FORCE_READ) also check for incoming packets irrespective * of whether or not we received an interrupt. */ static int iflib_netmap_rxsync(struct netmap_kring *kring, int flags) { struct netmap_adapter *na = kring->na; struct netmap_ring *ring = kring->ring; iflib_fl_t fl; uint32_t nm_i; /* index into the netmap ring */ uint32_t nic_i; /* index into the NIC ring */ u_int i, n; u_int const lim = kring->nkr_num_slots - 1; u_int const head = kring->rhead; int force_update = (flags & NAF_FORCE_READ) || kring->nr_kflags & NKR_PENDINTR; struct if_rxd_info ri; struct ifnet *ifp = na->ifp; if_ctx_t ctx = ifp->if_softc; iflib_rxq_t rxq = &ctx->ifc_rxqs[kring->ring_id]; if (head > lim) return netmap_ring_reinit(kring); /* * XXX netmap_fl_refill() only ever (re)fills free list 0 so far. */ for (i = 0, fl = rxq->ifr_fl; i < rxq->ifr_nfl; i++, fl++) { bus_dmamap_sync(fl->ifl_ifdi->idi_tag, fl->ifl_ifdi->idi_map, BUS_DMASYNC_POSTREAD | BUS_DMASYNC_POSTWRITE); } /* * First part: import newly received packets. * * nm_i is the index of the next free slot in the netmap ring, * nic_i is the index of the next received packet in the NIC ring, * and they may differ in case if_init() has been called while * in netmap mode. For the receive ring we have * * nic_i = rxr->next_check; * nm_i = kring->nr_hwtail (previous) * and * nm_i == (nic_i + kring->nkr_hwofs) % ring_size * * rxr->next_check is set to 0 on a ring reinit */ if (netmap_no_pendintr || force_update) { int crclen = iflib_crcstrip ? 0 : 4; int error, avail; for (i = 0; i < rxq->ifr_nfl; i++) { fl = &rxq->ifr_fl[i]; nic_i = fl->ifl_cidx; nm_i = netmap_idx_n2k(kring, nic_i); avail = ctx->isc_rxd_available(ctx->ifc_softc, rxq->ifr_id, nic_i, USHRT_MAX); for (n = 0; avail > 0; n++, avail--) { rxd_info_zero(&ri); ri.iri_frags = rxq->ifr_frags; ri.iri_qsidx = kring->ring_id; ri.iri_ifp = ctx->ifc_ifp; ri.iri_cidx = nic_i; error = ctx->isc_rxd_pkt_get(ctx->ifc_softc, &ri); ring->slot[nm_i].len = error ? 0 : ri.iri_len - crclen; ring->slot[nm_i].flags = 0; bus_dmamap_sync(fl->ifl_buf_tag, fl->ifl_sds.ifsd_map[nic_i], BUS_DMASYNC_POSTREAD); nm_i = nm_next(nm_i, lim); nic_i = nm_next(nic_i, lim); } if (n) { /* update the state variables */ if (netmap_no_pendintr && !force_update) { /* diagnostics */ iflib_rx_miss ++; iflib_rx_miss_bufs += n; } fl->ifl_cidx = nic_i; kring->nr_hwtail = nm_i; } kring->nr_kflags &= ~NKR_PENDINTR; } } /* * Second part: skip past packets that userspace has released. * (kring->nr_hwcur to head excluded), * and make the buffers available for reception. * As usual nm_i is the index in the netmap ring, * nic_i is the index in the NIC ring, and * nm_i == (nic_i + kring->nkr_hwofs) % ring_size */ /* XXX not sure how this will work with multiple free lists */ nm_i = kring->nr_hwcur; return (netmap_fl_refill(rxq, kring, nm_i, false)); } static void iflib_netmap_intr(struct netmap_adapter *na, int onoff) { struct ifnet *ifp = na->ifp; if_ctx_t ctx = ifp->if_softc; CTX_LOCK(ctx); if (onoff) { IFDI_INTR_ENABLE(ctx); } else { IFDI_INTR_DISABLE(ctx); } CTX_UNLOCK(ctx); } static int iflib_netmap_attach(if_ctx_t ctx) { struct netmap_adapter na; if_softc_ctx_t scctx = &ctx->ifc_softc_ctx; bzero(&na, sizeof(na)); na.ifp = ctx->ifc_ifp; na.na_flags = NAF_BDG_MAYSLEEP; MPASS(ctx->ifc_softc_ctx.isc_ntxqsets); MPASS(ctx->ifc_softc_ctx.isc_nrxqsets); na.num_tx_desc = scctx->isc_ntxd[0]; na.num_rx_desc = scctx->isc_nrxd[0]; na.nm_txsync = iflib_netmap_txsync; na.nm_rxsync = iflib_netmap_rxsync; na.nm_register = iflib_netmap_register; na.nm_intr = iflib_netmap_intr; na.num_tx_rings = ctx->ifc_softc_ctx.isc_ntxqsets; na.num_rx_rings = ctx->ifc_softc_ctx.isc_nrxqsets; return (netmap_attach(&na)); } static void iflib_netmap_txq_init(if_ctx_t ctx, iflib_txq_t txq) { struct netmap_adapter *na = NA(ctx->ifc_ifp); struct netmap_slot *slot; slot = netmap_reset(na, NR_TX, txq->ift_id, 0); if (slot == NULL) return; for (int i = 0; i < ctx->ifc_softc_ctx.isc_ntxd[0]; i++) { /* * In netmap mode, set the map for the packet buffer. * NOTE: Some drivers (not this one) also need to set * the physical buffer address in the NIC ring. * netmap_idx_n2k() maps a nic index, i, into the corresponding * netmap slot index, si */ int si = netmap_idx_n2k(na->tx_rings[txq->ift_id], i); netmap_load_map(na, txq->ift_buf_tag, txq->ift_sds.ifsd_map[i], NMB(na, slot + si)); } } static void iflib_netmap_rxq_init(if_ctx_t ctx, iflib_rxq_t rxq) { struct netmap_adapter *na = NA(ctx->ifc_ifp); struct netmap_kring *kring = na->rx_rings[rxq->ifr_id]; struct netmap_slot *slot; uint32_t nm_i; slot = netmap_reset(na, NR_RX, rxq->ifr_id, 0); if (slot == NULL) return; nm_i = netmap_idx_n2k(kring, 0); netmap_fl_refill(rxq, kring, nm_i, true); } static void iflib_netmap_timer_adjust(if_ctx_t ctx, iflib_txq_t txq, uint32_t *reset_on) { struct netmap_kring *kring; uint16_t txqid; txqid = txq->ift_id; kring = NA(ctx->ifc_ifp)->tx_rings[txqid]; if (kring->nr_hwcur != nm_next(kring->nr_hwtail, kring->nkr_num_slots - 1)) { bus_dmamap_sync(txq->ift_ifdi->idi_tag, txq->ift_ifdi->idi_map, BUS_DMASYNC_POSTREAD); if (ctx->isc_txd_credits_update(ctx->ifc_softc, txqid, false)) netmap_tx_irq(ctx->ifc_ifp, txqid); if (!(ctx->ifc_flags & IFC_NETMAP_TX_IRQ)) { if (hz < 2000) *reset_on = 1; else *reset_on = hz / 1000; } } } #define iflib_netmap_detach(ifp) netmap_detach(ifp) #else #define iflib_netmap_txq_init(ctx, txq) #define iflib_netmap_rxq_init(ctx, rxq) #define iflib_netmap_detach(ifp) #define iflib_netmap_attach(ctx) (0) #define netmap_rx_irq(ifp, qid, budget) (0) #define netmap_tx_irq(ifp, qid) do {} while (0) #define iflib_netmap_timer_adjust(ctx, txq, reset_on) #endif #if defined(__i386__) || defined(__amd64__) static __inline void prefetch(void *x) { __asm volatile("prefetcht0 %0" :: "m" (*(unsigned long *)x)); } static __inline void prefetch2cachelines(void *x) { __asm volatile("prefetcht0 %0" :: "m" (*(unsigned long *)x)); #if (CACHE_LINE_SIZE < 128) __asm volatile("prefetcht0 %0" :: "m" (*(((unsigned long *)x)+CACHE_LINE_SIZE/(sizeof(unsigned long))))); #endif } #else #define prefetch(x) #define prefetch2cachelines(x) #endif static void iru_init(if_rxd_update_t iru, iflib_rxq_t rxq, uint8_t flid) { iflib_fl_t fl; fl = &rxq->ifr_fl[flid]; iru->iru_paddrs = fl->ifl_bus_addrs; iru->iru_vaddrs = &fl->ifl_vm_addrs[0]; iru->iru_idxs = fl->ifl_rxd_idxs; iru->iru_qsidx = rxq->ifr_id; iru->iru_buf_size = fl->ifl_buf_size; iru->iru_flidx = fl->ifl_id; } static void _iflib_dmamap_cb(void *arg, bus_dma_segment_t *segs, int nseg, int err) { if (err) return; *(bus_addr_t *) arg = segs[0].ds_addr; } int iflib_dma_alloc_align(if_ctx_t ctx, int size, int align, iflib_dma_info_t dma, int mapflags) { int err; device_t dev = ctx->ifc_dev; err = bus_dma_tag_create(bus_get_dma_tag(dev), /* parent */ align, 0, /* alignment, bounds */ BUS_SPACE_MAXADDR, /* lowaddr */ BUS_SPACE_MAXADDR, /* highaddr */ NULL, NULL, /* filter, filterarg */ size, /* maxsize */ 1, /* nsegments */ size, /* maxsegsize */ BUS_DMA_ALLOCNOW, /* flags */ NULL, /* lockfunc */ NULL, /* lockarg */ &dma->idi_tag); if (err) { device_printf(dev, "%s: bus_dma_tag_create failed: %d\n", __func__, err); goto fail_0; } err = bus_dmamem_alloc(dma->idi_tag, (void**) &dma->idi_vaddr, BUS_DMA_NOWAIT | BUS_DMA_COHERENT | BUS_DMA_ZERO, &dma->idi_map); if (err) { device_printf(dev, "%s: bus_dmamem_alloc(%ju) failed: %d\n", __func__, (uintmax_t)size, err); goto fail_1; } dma->idi_paddr = IF_BAD_DMA; err = bus_dmamap_load(dma->idi_tag, dma->idi_map, dma->idi_vaddr, size, _iflib_dmamap_cb, &dma->idi_paddr, mapflags | BUS_DMA_NOWAIT); if (err || dma->idi_paddr == IF_BAD_DMA) { device_printf(dev, "%s: bus_dmamap_load failed: %d\n", __func__, err); goto fail_2; } dma->idi_size = size; return (0); fail_2: bus_dmamem_free(dma->idi_tag, dma->idi_vaddr, dma->idi_map); fail_1: bus_dma_tag_destroy(dma->idi_tag); fail_0: dma->idi_tag = NULL; return (err); } int iflib_dma_alloc(if_ctx_t ctx, int size, iflib_dma_info_t dma, int mapflags) { if_shared_ctx_t sctx = ctx->ifc_sctx; KASSERT(sctx->isc_q_align != 0, ("alignment value not initialized")); return (iflib_dma_alloc_align(ctx, size, sctx->isc_q_align, dma, mapflags)); } int iflib_dma_alloc_multi(if_ctx_t ctx, int *sizes, iflib_dma_info_t *dmalist, int mapflags, int count) { int i, err; iflib_dma_info_t *dmaiter; dmaiter = dmalist; for (i = 0; i < count; i++, dmaiter++) { if ((err = iflib_dma_alloc(ctx, sizes[i], *dmaiter, mapflags)) != 0) break; } if (err) iflib_dma_free_multi(dmalist, i); return (err); } void iflib_dma_free(iflib_dma_info_t dma) { if (dma->idi_tag == NULL) return; if (dma->idi_paddr != IF_BAD_DMA) { bus_dmamap_sync(dma->idi_tag, dma->idi_map, BUS_DMASYNC_POSTREAD | BUS_DMASYNC_POSTWRITE); bus_dmamap_unload(dma->idi_tag, dma->idi_map); dma->idi_paddr = IF_BAD_DMA; } if (dma->idi_vaddr != NULL) { bus_dmamem_free(dma->idi_tag, dma->idi_vaddr, dma->idi_map); dma->idi_vaddr = NULL; } bus_dma_tag_destroy(dma->idi_tag); dma->idi_tag = NULL; } void iflib_dma_free_multi(iflib_dma_info_t *dmalist, int count) { int i; iflib_dma_info_t *dmaiter = dmalist; for (i = 0; i < count; i++, dmaiter++) iflib_dma_free(*dmaiter); } #ifdef EARLY_AP_STARTUP static const int iflib_started = 1; #else /* * We used to abuse the smp_started flag to decide if the queues have been * fully initialized (by late taskqgroup_adjust() calls in a SYSINIT()). * That gave bad races, since the SYSINIT() runs strictly after smp_started * is set. Run a SYSINIT() strictly after that to just set a usable * completion flag. */ static int iflib_started; static void iflib_record_started(void *arg) { iflib_started = 1; } SYSINIT(iflib_record_started, SI_SUB_SMP + 1, SI_ORDER_FIRST, iflib_record_started, NULL); #endif static int iflib_fast_intr(void *arg) { iflib_filter_info_t info = arg; struct grouptask *gtask = info->ifi_task; int result; if (!iflib_started) return (FILTER_STRAY); DBG_COUNTER_INC(fast_intrs); if (info->ifi_filter != NULL) { result = info->ifi_filter(info->ifi_filter_arg); if ((result & FILTER_SCHEDULE_THREAD) == 0) return (result); } GROUPTASK_ENQUEUE(gtask); return (FILTER_HANDLED); } static int iflib_fast_intr_rxtx(void *arg) { iflib_filter_info_t info = arg; struct grouptask *gtask = info->ifi_task; if_ctx_t ctx; iflib_rxq_t rxq = (iflib_rxq_t)info->ifi_ctx; iflib_txq_t txq; void *sc; int i, cidx, result; qidx_t txqid; if (!iflib_started) return (FILTER_STRAY); DBG_COUNTER_INC(fast_intrs); if (info->ifi_filter != NULL) { result = info->ifi_filter(info->ifi_filter_arg); if ((result & FILTER_SCHEDULE_THREAD) == 0) return (result); } ctx = rxq->ifr_ctx; sc = ctx->ifc_softc; MPASS(rxq->ifr_ntxqirq); for (i = 0; i < rxq->ifr_ntxqirq; i++) { txqid = rxq->ifr_txqid[i]; txq = &ctx->ifc_txqs[txqid]; bus_dmamap_sync(txq->ift_ifdi->idi_tag, txq->ift_ifdi->idi_map, BUS_DMASYNC_POSTREAD); if (!ctx->isc_txd_credits_update(sc, txqid, false)) { IFDI_TX_QUEUE_INTR_ENABLE(ctx, txqid); continue; } GROUPTASK_ENQUEUE(&txq->ift_task); } if (ctx->ifc_sctx->isc_flags & IFLIB_HAS_RXCQ) cidx = rxq->ifr_cq_cidx; else cidx = rxq->ifr_fl[0].ifl_cidx; if (iflib_rxd_avail(ctx, rxq, cidx, 1)) GROUPTASK_ENQUEUE(gtask); else { IFDI_RX_QUEUE_INTR_ENABLE(ctx, rxq->ifr_id); DBG_COUNTER_INC(rx_intr_enables); } return (FILTER_HANDLED); } static int iflib_fast_intr_ctx(void *arg) { iflib_filter_info_t info = arg; struct grouptask *gtask = info->ifi_task; int result; if (!iflib_started) return (FILTER_STRAY); DBG_COUNTER_INC(fast_intrs); if (info->ifi_filter != NULL) { result = info->ifi_filter(info->ifi_filter_arg); if ((result & FILTER_SCHEDULE_THREAD) == 0) return (result); } GROUPTASK_ENQUEUE(gtask); return (FILTER_HANDLED); } static int _iflib_irq_alloc(if_ctx_t ctx, if_irq_t irq, int rid, driver_filter_t filter, driver_intr_t handler, void *arg, const char *name) { int rc, flags; struct resource *res; void *tag = NULL; device_t dev = ctx->ifc_dev; flags = RF_ACTIVE; if (ctx->ifc_flags & IFC_LEGACY) flags |= RF_SHAREABLE; MPASS(rid < 512); irq->ii_rid = rid; res = bus_alloc_resource_any(dev, SYS_RES_IRQ, &irq->ii_rid, flags); if (res == NULL) { device_printf(dev, "failed to allocate IRQ for rid %d, name %s.\n", rid, name); return (ENOMEM); } irq->ii_res = res; KASSERT(filter == NULL || handler == NULL, ("filter and handler can't both be non-NULL")); rc = bus_setup_intr(dev, res, INTR_MPSAFE | INTR_TYPE_NET, filter, handler, arg, &tag); if (rc != 0) { device_printf(dev, "failed to setup interrupt for rid %d, name %s: %d\n", rid, name ? name : "unknown", rc); return (rc); } else if (name) bus_describe_intr(dev, res, tag, "%s", name); irq->ii_tag = tag; return (0); } /********************************************************************* * * Allocate DMA resources for TX buffers as well as memory for the TX * mbuf map. TX DMA maps (non-TSO/TSO) and TX mbuf map are kept in a * iflib_sw_tx_desc_array structure, storing all the information that * is needed to transmit a packet on the wire. This is called only * once at attach, setup is done every reset. * **********************************************************************/ static int iflib_txsd_alloc(iflib_txq_t txq) { if_ctx_t ctx = txq->ift_ctx; if_shared_ctx_t sctx = ctx->ifc_sctx; if_softc_ctx_t scctx = &ctx->ifc_softc_ctx; device_t dev = ctx->ifc_dev; bus_size_t tsomaxsize; int err, nsegments, ntsosegments; bool tso; nsegments = scctx->isc_tx_nsegments; ntsosegments = scctx->isc_tx_tso_segments_max; tsomaxsize = scctx->isc_tx_tso_size_max; if (if_getcapabilities(ctx->ifc_ifp) & IFCAP_VLAN_MTU) tsomaxsize += sizeof(struct ether_vlan_header); MPASS(scctx->isc_ntxd[0] > 0); MPASS(scctx->isc_ntxd[txq->ift_br_offset] > 0); MPASS(nsegments > 0); if (if_getcapabilities(ctx->ifc_ifp) & IFCAP_TSO) { MPASS(ntsosegments > 0); MPASS(sctx->isc_tso_maxsize >= tsomaxsize); } /* * Set up DMA tags for TX buffers. */ if ((err = bus_dma_tag_create(bus_get_dma_tag(dev), 1, 0, /* alignment, bounds */ BUS_SPACE_MAXADDR, /* lowaddr */ BUS_SPACE_MAXADDR, /* highaddr */ NULL, NULL, /* filter, filterarg */ sctx->isc_tx_maxsize, /* maxsize */ nsegments, /* nsegments */ sctx->isc_tx_maxsegsize, /* maxsegsize */ 0, /* flags */ NULL, /* lockfunc */ NULL, /* lockfuncarg */ &txq->ift_buf_tag))) { device_printf(dev,"Unable to allocate TX DMA tag: %d\n", err); device_printf(dev,"maxsize: %ju nsegments: %d maxsegsize: %ju\n", (uintmax_t)sctx->isc_tx_maxsize, nsegments, (uintmax_t)sctx->isc_tx_maxsegsize); goto fail; } tso = (if_getcapabilities(ctx->ifc_ifp) & IFCAP_TSO) != 0; if (tso && (err = bus_dma_tag_create(bus_get_dma_tag(dev), 1, 0, /* alignment, bounds */ BUS_SPACE_MAXADDR, /* lowaddr */ BUS_SPACE_MAXADDR, /* highaddr */ NULL, NULL, /* filter, filterarg */ tsomaxsize, /* maxsize */ ntsosegments, /* nsegments */ sctx->isc_tso_maxsegsize,/* maxsegsize */ 0, /* flags */ NULL, /* lockfunc */ NULL, /* lockfuncarg */ &txq->ift_tso_buf_tag))) { device_printf(dev, "Unable to allocate TSO TX DMA tag: %d\n", err); goto fail; } /* Allocate memory for the TX mbuf map. */ if (!(txq->ift_sds.ifsd_m = (struct mbuf **) malloc(sizeof(struct mbuf *) * scctx->isc_ntxd[txq->ift_br_offset], M_IFLIB, M_NOWAIT | M_ZERO))) { device_printf(dev, "Unable to allocate TX mbuf map memory\n"); err = ENOMEM; goto fail; } /* * Create the DMA maps for TX buffers. */ if ((txq->ift_sds.ifsd_map = (bus_dmamap_t *)malloc( sizeof(bus_dmamap_t) * scctx->isc_ntxd[txq->ift_br_offset], M_IFLIB, M_NOWAIT | M_ZERO)) == NULL) { device_printf(dev, "Unable to allocate TX buffer DMA map memory\n"); err = ENOMEM; goto fail; } if (tso && (txq->ift_sds.ifsd_tso_map = (bus_dmamap_t *)malloc( sizeof(bus_dmamap_t) * scctx->isc_ntxd[txq->ift_br_offset], M_IFLIB, M_NOWAIT | M_ZERO)) == NULL) { device_printf(dev, "Unable to allocate TSO TX buffer map memory\n"); err = ENOMEM; goto fail; } for (int i = 0; i < scctx->isc_ntxd[txq->ift_br_offset]; i++) { err = bus_dmamap_create(txq->ift_buf_tag, 0, &txq->ift_sds.ifsd_map[i]); if (err != 0) { device_printf(dev, "Unable to create TX DMA map\n"); goto fail; } if (!tso) continue; err = bus_dmamap_create(txq->ift_tso_buf_tag, 0, &txq->ift_sds.ifsd_tso_map[i]); if (err != 0) { device_printf(dev, "Unable to create TSO TX DMA map\n"); goto fail; } } return (0); fail: /* We free all, it handles case where we are in the middle */ iflib_tx_structures_free(ctx); return (err); } static void iflib_txsd_destroy(if_ctx_t ctx, iflib_txq_t txq, int i) { bus_dmamap_t map; map = NULL; if (txq->ift_sds.ifsd_map != NULL) map = txq->ift_sds.ifsd_map[i]; if (map != NULL) { bus_dmamap_sync(txq->ift_buf_tag, map, BUS_DMASYNC_POSTWRITE); bus_dmamap_unload(txq->ift_buf_tag, map); bus_dmamap_destroy(txq->ift_buf_tag, map); txq->ift_sds.ifsd_map[i] = NULL; } map = NULL; if (txq->ift_sds.ifsd_tso_map != NULL) map = txq->ift_sds.ifsd_tso_map[i]; if (map != NULL) { bus_dmamap_sync(txq->ift_tso_buf_tag, map, BUS_DMASYNC_POSTWRITE); bus_dmamap_unload(txq->ift_tso_buf_tag, map); bus_dmamap_destroy(txq->ift_tso_buf_tag, map); txq->ift_sds.ifsd_tso_map[i] = NULL; } } static void iflib_txq_destroy(iflib_txq_t txq) { if_ctx_t ctx = txq->ift_ctx; for (int i = 0; i < txq->ift_size; i++) iflib_txsd_destroy(ctx, txq, i); if (txq->ift_sds.ifsd_map != NULL) { free(txq->ift_sds.ifsd_map, M_IFLIB); txq->ift_sds.ifsd_map = NULL; } if (txq->ift_sds.ifsd_tso_map != NULL) { free(txq->ift_sds.ifsd_tso_map, M_IFLIB); txq->ift_sds.ifsd_tso_map = NULL; } if (txq->ift_sds.ifsd_m != NULL) { free(txq->ift_sds.ifsd_m, M_IFLIB); txq->ift_sds.ifsd_m = NULL; } if (txq->ift_buf_tag != NULL) { bus_dma_tag_destroy(txq->ift_buf_tag); txq->ift_buf_tag = NULL; } if (txq->ift_tso_buf_tag != NULL) { bus_dma_tag_destroy(txq->ift_tso_buf_tag); txq->ift_tso_buf_tag = NULL; } } static void iflib_txsd_free(if_ctx_t ctx, iflib_txq_t txq, int i) { struct mbuf **mp; mp = &txq->ift_sds.ifsd_m[i]; if (*mp == NULL) return; if (txq->ift_sds.ifsd_map != NULL) { bus_dmamap_sync(txq->ift_buf_tag, txq->ift_sds.ifsd_map[i], BUS_DMASYNC_POSTWRITE); bus_dmamap_unload(txq->ift_buf_tag, txq->ift_sds.ifsd_map[i]); } if (txq->ift_sds.ifsd_tso_map != NULL) { bus_dmamap_sync(txq->ift_tso_buf_tag, txq->ift_sds.ifsd_tso_map[i], BUS_DMASYNC_POSTWRITE); bus_dmamap_unload(txq->ift_tso_buf_tag, txq->ift_sds.ifsd_tso_map[i]); } m_free(*mp); DBG_COUNTER_INC(tx_frees); *mp = NULL; } static int iflib_txq_setup(iflib_txq_t txq) { if_ctx_t ctx = txq->ift_ctx; if_softc_ctx_t scctx = &ctx->ifc_softc_ctx; if_shared_ctx_t sctx = ctx->ifc_sctx; iflib_dma_info_t di; int i; /* Set number of descriptors available */ txq->ift_qstatus = IFLIB_QUEUE_IDLE; /* XXX make configurable */ txq->ift_update_freq = IFLIB_DEFAULT_TX_UPDATE_FREQ; /* Reset indices */ txq->ift_cidx_processed = 0; txq->ift_pidx = txq->ift_cidx = txq->ift_npending = 0; txq->ift_size = scctx->isc_ntxd[txq->ift_br_offset]; for (i = 0, di = txq->ift_ifdi; i < sctx->isc_ntxqs; i++, di++) bzero((void *)di->idi_vaddr, di->idi_size); IFDI_TXQ_SETUP(ctx, txq->ift_id); for (i = 0, di = txq->ift_ifdi; i < sctx->isc_ntxqs; i++, di++) bus_dmamap_sync(di->idi_tag, di->idi_map, BUS_DMASYNC_PREREAD | BUS_DMASYNC_PREWRITE); return (0); } /********************************************************************* * * Allocate DMA resources for RX buffers as well as memory for the RX * mbuf map, direct RX cluster pointer map and RX cluster bus address * map. RX DMA map, RX mbuf map, direct RX cluster pointer map and * RX cluster map are kept in a iflib_sw_rx_desc_array structure. * Since we use use one entry in iflib_sw_rx_desc_array per received * packet, the maximum number of entries we'll need is equal to the * number of hardware receive descriptors that we've allocated. * **********************************************************************/ static int iflib_rxsd_alloc(iflib_rxq_t rxq) { if_ctx_t ctx = rxq->ifr_ctx; if_shared_ctx_t sctx = ctx->ifc_sctx; if_softc_ctx_t scctx = &ctx->ifc_softc_ctx; device_t dev = ctx->ifc_dev; iflib_fl_t fl; int err; MPASS(scctx->isc_nrxd[0] > 0); MPASS(scctx->isc_nrxd[rxq->ifr_fl_offset] > 0); fl = rxq->ifr_fl; for (int i = 0; i < rxq->ifr_nfl; i++, fl++) { fl->ifl_size = scctx->isc_nrxd[rxq->ifr_fl_offset]; /* this isn't necessarily the same */ /* Set up DMA tag for RX buffers. */ err = bus_dma_tag_create(bus_get_dma_tag(dev), /* parent */ 1, 0, /* alignment, bounds */ BUS_SPACE_MAXADDR, /* lowaddr */ BUS_SPACE_MAXADDR, /* highaddr */ NULL, NULL, /* filter, filterarg */ sctx->isc_rx_maxsize, /* maxsize */ sctx->isc_rx_nsegments, /* nsegments */ sctx->isc_rx_maxsegsize, /* maxsegsize */ 0, /* flags */ NULL, /* lockfunc */ NULL, /* lockarg */ &fl->ifl_buf_tag); if (err) { device_printf(dev, "Unable to allocate RX DMA tag: %d\n", err); goto fail; } /* Allocate memory for the RX mbuf map. */ if (!(fl->ifl_sds.ifsd_m = (struct mbuf **) malloc(sizeof(struct mbuf *) * scctx->isc_nrxd[rxq->ifr_fl_offset], M_IFLIB, M_NOWAIT | M_ZERO))) { device_printf(dev, "Unable to allocate RX mbuf map memory\n"); err = ENOMEM; goto fail; } /* Allocate memory for the direct RX cluster pointer map. */ if (!(fl->ifl_sds.ifsd_cl = (caddr_t *) malloc(sizeof(caddr_t) * scctx->isc_nrxd[rxq->ifr_fl_offset], M_IFLIB, M_NOWAIT | M_ZERO))) { device_printf(dev, "Unable to allocate RX cluster map memory\n"); err = ENOMEM; goto fail; } /* Allocate memory for the RX cluster bus address map. */ if (!(fl->ifl_sds.ifsd_ba = (bus_addr_t *) malloc(sizeof(bus_addr_t) * scctx->isc_nrxd[rxq->ifr_fl_offset], M_IFLIB, M_NOWAIT | M_ZERO))) { device_printf(dev, "Unable to allocate RX bus address map memory\n"); err = ENOMEM; goto fail; } /* * Create the DMA maps for RX buffers. */ if (!(fl->ifl_sds.ifsd_map = (bus_dmamap_t *) malloc(sizeof(bus_dmamap_t) * scctx->isc_nrxd[rxq->ifr_fl_offset], M_IFLIB, M_NOWAIT | M_ZERO))) { device_printf(dev, "Unable to allocate RX buffer DMA map memory\n"); err = ENOMEM; goto fail; } for (int i = 0; i < scctx->isc_nrxd[rxq->ifr_fl_offset]; i++) { err = bus_dmamap_create(fl->ifl_buf_tag, 0, &fl->ifl_sds.ifsd_map[i]); if (err != 0) { device_printf(dev, "Unable to create RX buffer DMA map\n"); goto fail; } } } return (0); fail: iflib_rx_structures_free(ctx); return (err); } /* * Internal service routines */ struct rxq_refill_cb_arg { int error; bus_dma_segment_t seg; int nseg; }; static void _rxq_refill_cb(void *arg, bus_dma_segment_t *segs, int nseg, int error) { struct rxq_refill_cb_arg *cb_arg = arg; cb_arg->error = error; cb_arg->seg = segs[0]; cb_arg->nseg = nseg; } /** * rxq_refill - refill an rxq free-buffer list * @ctx: the iflib context * @rxq: the free-list to refill * @n: the number of new buffers to allocate * * (Re)populate an rxq free-buffer list with up to @n new packet buffers. * The caller must assure that @n does not exceed the queue's capacity. */ static void _iflib_fl_refill(if_ctx_t ctx, iflib_fl_t fl, int count) { struct if_rxd_update iru; struct rxq_refill_cb_arg cb_arg; struct mbuf *m; caddr_t cl, *sd_cl; struct mbuf **sd_m; bus_dmamap_t *sd_map; bus_addr_t bus_addr, *sd_ba; int err, frag_idx, i, idx, n, pidx; qidx_t credits; sd_m = fl->ifl_sds.ifsd_m; sd_map = fl->ifl_sds.ifsd_map; sd_cl = fl->ifl_sds.ifsd_cl; sd_ba = fl->ifl_sds.ifsd_ba; pidx = fl->ifl_pidx; idx = pidx; frag_idx = fl->ifl_fragidx; credits = fl->ifl_credits; i = 0; n = count; MPASS(n > 0); MPASS(credits + n <= fl->ifl_size); if (pidx < fl->ifl_cidx) MPASS(pidx + n <= fl->ifl_cidx); if (pidx == fl->ifl_cidx && (credits < fl->ifl_size)) MPASS(fl->ifl_gen == 0); if (pidx > fl->ifl_cidx) MPASS(n <= fl->ifl_size - pidx + fl->ifl_cidx); DBG_COUNTER_INC(fl_refills); if (n > 8) DBG_COUNTER_INC(fl_refills_large); iru_init(&iru, fl->ifl_rxq, fl->ifl_id); while (n--) { /* * We allocate an uninitialized mbuf + cluster, mbuf is * initialized after rx. * * If the cluster is still set then we know a minimum sized packet was received */ bit_ffc_at(fl->ifl_rx_bitmap, frag_idx, fl->ifl_size, &frag_idx); if (frag_idx < 0) bit_ffc(fl->ifl_rx_bitmap, fl->ifl_size, &frag_idx); MPASS(frag_idx >= 0); if ((cl = sd_cl[frag_idx]) == NULL) { if ((cl = m_cljget(NULL, M_NOWAIT, fl->ifl_buf_size)) == NULL) break; cb_arg.error = 0; MPASS(sd_map != NULL); err = bus_dmamap_load(fl->ifl_buf_tag, sd_map[frag_idx], cl, fl->ifl_buf_size, _rxq_refill_cb, &cb_arg, BUS_DMA_NOWAIT); if (err != 0 || cb_arg.error) { /* * !zone_pack ? */ if (fl->ifl_zone == zone_pack) uma_zfree(fl->ifl_zone, cl); break; } sd_ba[frag_idx] = bus_addr = cb_arg.seg.ds_addr; sd_cl[frag_idx] = cl; #if MEMORY_LOGGING fl->ifl_cl_enqueued++; #endif } else { bus_addr = sd_ba[frag_idx]; } bus_dmamap_sync(fl->ifl_buf_tag, sd_map[frag_idx], BUS_DMASYNC_PREREAD); - MPASS(sd_m[frag_idx] == NULL); - if ((m = m_gethdr(M_NOWAIT, MT_NOINIT)) == NULL) { - break; + if (sd_m[frag_idx] == NULL) { + if ((m = m_gethdr(M_NOWAIT, MT_NOINIT)) == NULL) { + break; + } + sd_m[frag_idx] = m; } - sd_m[frag_idx] = m; bit_set(fl->ifl_rx_bitmap, frag_idx); #if MEMORY_LOGGING fl->ifl_m_enqueued++; #endif DBG_COUNTER_INC(rx_allocs); fl->ifl_rxd_idxs[i] = frag_idx; fl->ifl_bus_addrs[i] = bus_addr; fl->ifl_vm_addrs[i] = cl; credits++; i++; MPASS(credits <= fl->ifl_size); if (++idx == fl->ifl_size) { fl->ifl_gen = 1; idx = 0; } if (n == 0 || i == IFLIB_MAX_RX_REFRESH) { iru.iru_pidx = pidx; iru.iru_count = i; ctx->isc_rxd_refill(ctx->ifc_softc, &iru); i = 0; pidx = idx; fl->ifl_pidx = idx; fl->ifl_credits = credits; } } if (i) { iru.iru_pidx = pidx; iru.iru_count = i; ctx->isc_rxd_refill(ctx->ifc_softc, &iru); fl->ifl_pidx = idx; fl->ifl_credits = credits; } DBG_COUNTER_INC(rxd_flush); if (fl->ifl_pidx == 0) pidx = fl->ifl_size - 1; else pidx = fl->ifl_pidx - 1; bus_dmamap_sync(fl->ifl_ifdi->idi_tag, fl->ifl_ifdi->idi_map, BUS_DMASYNC_PREREAD | BUS_DMASYNC_PREWRITE); ctx->isc_rxd_flush(ctx->ifc_softc, fl->ifl_rxq->ifr_id, fl->ifl_id, pidx); fl->ifl_fragidx = frag_idx; } static __inline void __iflib_fl_refill_lt(if_ctx_t ctx, iflib_fl_t fl, int max) { /* we avoid allowing pidx to catch up with cidx as it confuses ixl */ int32_t reclaimable = fl->ifl_size - fl->ifl_credits - 1; #ifdef INVARIANTS int32_t delta = fl->ifl_size - get_inuse(fl->ifl_size, fl->ifl_cidx, fl->ifl_pidx, fl->ifl_gen) - 1; #endif MPASS(fl->ifl_credits <= fl->ifl_size); MPASS(reclaimable == delta); if (reclaimable > 0) _iflib_fl_refill(ctx, fl, min(max, reclaimable)); } uint8_t iflib_in_detach(if_ctx_t ctx) { bool in_detach; STATE_LOCK(ctx); in_detach = !!(ctx->ifc_flags & IFC_IN_DETACH); STATE_UNLOCK(ctx); return (in_detach); } static void iflib_fl_bufs_free(iflib_fl_t fl) { iflib_dma_info_t idi = fl->ifl_ifdi; bus_dmamap_t sd_map; uint32_t i; for (i = 0; i < fl->ifl_size; i++) { struct mbuf **sd_m = &fl->ifl_sds.ifsd_m[i]; caddr_t *sd_cl = &fl->ifl_sds.ifsd_cl[i]; if (*sd_cl != NULL) { sd_map = fl->ifl_sds.ifsd_map[i]; bus_dmamap_sync(fl->ifl_buf_tag, sd_map, BUS_DMASYNC_POSTREAD); bus_dmamap_unload(fl->ifl_buf_tag, sd_map); if (*sd_cl != NULL) uma_zfree(fl->ifl_zone, *sd_cl); // XXX: Should this get moved out? if (iflib_in_detach(fl->ifl_rxq->ifr_ctx)) bus_dmamap_destroy(fl->ifl_buf_tag, sd_map); if (*sd_m != NULL) { m_init(*sd_m, M_NOWAIT, MT_DATA, 0); uma_zfree(zone_mbuf, *sd_m); } } else { MPASS(*sd_cl == NULL); MPASS(*sd_m == NULL); } #if MEMORY_LOGGING fl->ifl_m_dequeued++; fl->ifl_cl_dequeued++; #endif *sd_cl = NULL; *sd_m = NULL; } #ifdef INVARIANTS for (i = 0; i < fl->ifl_size; i++) { MPASS(fl->ifl_sds.ifsd_cl[i] == NULL); MPASS(fl->ifl_sds.ifsd_m[i] == NULL); } #endif /* * Reset free list values */ fl->ifl_credits = fl->ifl_cidx = fl->ifl_pidx = fl->ifl_gen = fl->ifl_fragidx = 0; bzero(idi->idi_vaddr, idi->idi_size); } /********************************************************************* * * Initialize a receive ring and its buffers. * **********************************************************************/ static int iflib_fl_setup(iflib_fl_t fl) { iflib_rxq_t rxq = fl->ifl_rxq; if_ctx_t ctx = rxq->ifr_ctx; bit_nclear(fl->ifl_rx_bitmap, 0, fl->ifl_size - 1); /* ** Free current RX buffer structs and their mbufs */ iflib_fl_bufs_free(fl); /* Now replenish the mbufs */ MPASS(fl->ifl_credits == 0); fl->ifl_buf_size = ctx->ifc_rx_mbuf_sz; if (fl->ifl_buf_size > ctx->ifc_max_fl_buf_size) ctx->ifc_max_fl_buf_size = fl->ifl_buf_size; fl->ifl_cltype = m_gettype(fl->ifl_buf_size); fl->ifl_zone = m_getzone(fl->ifl_buf_size); /* avoid pre-allocating zillions of clusters to an idle card * potentially speeding up attach */ _iflib_fl_refill(ctx, fl, min(128, fl->ifl_size)); MPASS(min(128, fl->ifl_size) == fl->ifl_credits); if (min(128, fl->ifl_size) != fl->ifl_credits) return (ENOBUFS); /* * handle failure */ MPASS(rxq != NULL); MPASS(fl->ifl_ifdi != NULL); bus_dmamap_sync(fl->ifl_ifdi->idi_tag, fl->ifl_ifdi->idi_map, BUS_DMASYNC_PREREAD | BUS_DMASYNC_PREWRITE); return (0); } /********************************************************************* * * Free receive ring data structures * **********************************************************************/ static void iflib_rx_sds_free(iflib_rxq_t rxq) { iflib_fl_t fl; int i, j; if (rxq->ifr_fl != NULL) { for (i = 0; i < rxq->ifr_nfl; i++) { fl = &rxq->ifr_fl[i]; if (fl->ifl_buf_tag != NULL) { if (fl->ifl_sds.ifsd_map != NULL) { for (j = 0; j < fl->ifl_size; j++) { if (fl->ifl_sds.ifsd_map[j] == NULL) continue; bus_dmamap_sync( fl->ifl_buf_tag, fl->ifl_sds.ifsd_map[j], BUS_DMASYNC_POSTREAD); bus_dmamap_unload( fl->ifl_buf_tag, fl->ifl_sds.ifsd_map[j]); } } bus_dma_tag_destroy(fl->ifl_buf_tag); fl->ifl_buf_tag = NULL; } free(fl->ifl_sds.ifsd_m, M_IFLIB); free(fl->ifl_sds.ifsd_cl, M_IFLIB); free(fl->ifl_sds.ifsd_ba, M_IFLIB); free(fl->ifl_sds.ifsd_map, M_IFLIB); fl->ifl_sds.ifsd_m = NULL; fl->ifl_sds.ifsd_cl = NULL; fl->ifl_sds.ifsd_ba = NULL; fl->ifl_sds.ifsd_map = NULL; } free(rxq->ifr_fl, M_IFLIB); rxq->ifr_fl = NULL; rxq->ifr_cq_gen = rxq->ifr_cq_cidx = rxq->ifr_cq_pidx = 0; } } /* * MI independent logic * */ static void iflib_timer(void *arg) { iflib_txq_t txq = arg; if_ctx_t ctx = txq->ift_ctx; if_softc_ctx_t sctx = &ctx->ifc_softc_ctx; uint64_t this_tick = ticks; uint32_t reset_on = hz / 2; if (!(if_getdrvflags(ctx->ifc_ifp) & IFF_DRV_RUNNING)) return; /* ** Check on the state of the TX queue(s), this ** can be done without the lock because its RO ** and the HUNG state will be static if set. */ if (this_tick - txq->ift_last_timer_tick >= hz / 2) { txq->ift_last_timer_tick = this_tick; IFDI_TIMER(ctx, txq->ift_id); if ((txq->ift_qstatus == IFLIB_QUEUE_HUNG) && ((txq->ift_cleaned_prev == txq->ift_cleaned) || (sctx->isc_pause_frames == 0))) goto hung; if (ifmp_ring_is_stalled(txq->ift_br)) txq->ift_qstatus = IFLIB_QUEUE_HUNG; txq->ift_cleaned_prev = txq->ift_cleaned; } #ifdef DEV_NETMAP if (if_getcapenable(ctx->ifc_ifp) & IFCAP_NETMAP) iflib_netmap_timer_adjust(ctx, txq, &reset_on); #endif /* handle any laggards */ if (txq->ift_db_pending) GROUPTASK_ENQUEUE(&txq->ift_task); sctx->isc_pause_frames = 0; if (if_getdrvflags(ctx->ifc_ifp) & IFF_DRV_RUNNING) callout_reset_on(&txq->ift_timer, reset_on, iflib_timer, txq, txq->ift_timer.c_cpu); return; hung: device_printf(ctx->ifc_dev, "TX(%d) desc avail = %d, pidx = %d\n", txq->ift_id, TXQ_AVAIL(txq), txq->ift_pidx); STATE_LOCK(ctx); if_setdrvflagbits(ctx->ifc_ifp, IFF_DRV_OACTIVE, IFF_DRV_RUNNING); ctx->ifc_flags |= (IFC_DO_WATCHDOG|IFC_DO_RESET); iflib_admin_intr_deferred(ctx); STATE_UNLOCK(ctx); } static void iflib_calc_rx_mbuf_sz(if_ctx_t ctx) { if_softc_ctx_t sctx = &ctx->ifc_softc_ctx; /* * XXX don't set the max_frame_size to larger * than the hardware can handle */ if (sctx->isc_max_frame_size <= MCLBYTES) ctx->ifc_rx_mbuf_sz = MCLBYTES; else ctx->ifc_rx_mbuf_sz = MJUMPAGESIZE; } uint32_t iflib_get_rx_mbuf_sz(if_ctx_t ctx) { return (ctx->ifc_rx_mbuf_sz); } static void iflib_init_locked(if_ctx_t ctx) { if_softc_ctx_t sctx = &ctx->ifc_softc_ctx; if_softc_ctx_t scctx = &ctx->ifc_softc_ctx; if_t ifp = ctx->ifc_ifp; iflib_fl_t fl; iflib_txq_t txq; iflib_rxq_t rxq; int i, j, tx_ip_csum_flags, tx_ip6_csum_flags; if_setdrvflagbits(ifp, IFF_DRV_OACTIVE, IFF_DRV_RUNNING); IFDI_INTR_DISABLE(ctx); tx_ip_csum_flags = scctx->isc_tx_csum_flags & (CSUM_IP | CSUM_TCP | CSUM_UDP | CSUM_SCTP); tx_ip6_csum_flags = scctx->isc_tx_csum_flags & (CSUM_IP6_TCP | CSUM_IP6_UDP | CSUM_IP6_SCTP); /* Set hardware offload abilities */ if_clearhwassist(ifp); if (if_getcapenable(ifp) & IFCAP_TXCSUM) if_sethwassistbits(ifp, tx_ip_csum_flags, 0); if (if_getcapenable(ifp) & IFCAP_TXCSUM_IPV6) if_sethwassistbits(ifp, tx_ip6_csum_flags, 0); if (if_getcapenable(ifp) & IFCAP_TSO4) if_sethwassistbits(ifp, CSUM_IP_TSO, 0); if (if_getcapenable(ifp) & IFCAP_TSO6) if_sethwassistbits(ifp, CSUM_IP6_TSO, 0); for (i = 0, txq = ctx->ifc_txqs; i < sctx->isc_ntxqsets; i++, txq++) { CALLOUT_LOCK(txq); callout_stop(&txq->ift_timer); CALLOUT_UNLOCK(txq); iflib_netmap_txq_init(ctx, txq); } /* * Calculate a suitable Rx mbuf size prior to calling IFDI_INIT, so * that drivers can use the value when setting up the hardware receive * buffers. */ iflib_calc_rx_mbuf_sz(ctx); #ifdef INVARIANTS i = if_getdrvflags(ifp); #endif IFDI_INIT(ctx); MPASS(if_getdrvflags(ifp) == i); for (i = 0, rxq = ctx->ifc_rxqs; i < sctx->isc_nrxqsets; i++, rxq++) { /* XXX this should really be done on a per-queue basis */ if (if_getcapenable(ifp) & IFCAP_NETMAP) { MPASS(rxq->ifr_id == i); iflib_netmap_rxq_init(ctx, rxq); continue; } for (j = 0, fl = rxq->ifr_fl; j < rxq->ifr_nfl; j++, fl++) { if (iflib_fl_setup(fl)) { device_printf(ctx->ifc_dev, "freelist setup failed - check cluster settings\n"); goto done; } } } done: if_setdrvflagbits(ctx->ifc_ifp, IFF_DRV_RUNNING, IFF_DRV_OACTIVE); IFDI_INTR_ENABLE(ctx); txq = ctx->ifc_txqs; for (i = 0; i < sctx->isc_ntxqsets; i++, txq++) callout_reset_on(&txq->ift_timer, hz/2, iflib_timer, txq, txq->ift_timer.c_cpu); } static int iflib_media_change(if_t ifp) { if_ctx_t ctx = if_getsoftc(ifp); int err; CTX_LOCK(ctx); if ((err = IFDI_MEDIA_CHANGE(ctx)) == 0) iflib_init_locked(ctx); CTX_UNLOCK(ctx); return (err); } static void iflib_media_status(if_t ifp, struct ifmediareq *ifmr) { if_ctx_t ctx = if_getsoftc(ifp); CTX_LOCK(ctx); IFDI_UPDATE_ADMIN_STATUS(ctx); IFDI_MEDIA_STATUS(ctx, ifmr); CTX_UNLOCK(ctx); } void iflib_stop(if_ctx_t ctx) { iflib_txq_t txq = ctx->ifc_txqs; iflib_rxq_t rxq = ctx->ifc_rxqs; if_softc_ctx_t scctx = &ctx->ifc_softc_ctx; if_shared_ctx_t sctx = ctx->ifc_sctx; iflib_dma_info_t di; iflib_fl_t fl; int i, j; /* Tell the stack that the interface is no longer active */ if_setdrvflagbits(ctx->ifc_ifp, IFF_DRV_OACTIVE, IFF_DRV_RUNNING); IFDI_INTR_DISABLE(ctx); DELAY(1000); IFDI_STOP(ctx); DELAY(1000); iflib_debug_reset(); /* Wait for current tx queue users to exit to disarm watchdog timer. */ for (i = 0; i < scctx->isc_ntxqsets; i++, txq++) { /* make sure all transmitters have completed before proceeding XXX */ CALLOUT_LOCK(txq); callout_stop(&txq->ift_timer); CALLOUT_UNLOCK(txq); /* clean any enqueued buffers */ iflib_ifmp_purge(txq); /* Free any existing tx buffers. */ for (j = 0; j < txq->ift_size; j++) { iflib_txsd_free(ctx, txq, j); } txq->ift_processed = txq->ift_cleaned = txq->ift_cidx_processed = 0; txq->ift_in_use = txq->ift_gen = txq->ift_cidx = txq->ift_pidx = txq->ift_no_desc_avail = 0; txq->ift_closed = txq->ift_mbuf_defrag = txq->ift_mbuf_defrag_failed = 0; txq->ift_no_tx_dma_setup = txq->ift_txd_encap_efbig = txq->ift_map_failed = 0; txq->ift_pullups = 0; ifmp_ring_reset_stats(txq->ift_br); for (j = 0, di = txq->ift_ifdi; j < sctx->isc_ntxqs; j++, di++) bzero((void *)di->idi_vaddr, di->idi_size); } for (i = 0; i < scctx->isc_nrxqsets; i++, rxq++) { /* make sure all transmitters have completed before proceeding XXX */ rxq->ifr_cq_gen = rxq->ifr_cq_cidx = rxq->ifr_cq_pidx = 0; for (j = 0, di = rxq->ifr_ifdi; j < sctx->isc_nrxqs; j++, di++) bzero((void *)di->idi_vaddr, di->idi_size); /* also resets the free lists pidx/cidx */ for (j = 0, fl = rxq->ifr_fl; j < rxq->ifr_nfl; j++, fl++) iflib_fl_bufs_free(fl); } } static inline caddr_t calc_next_rxd(iflib_fl_t fl, int cidx) { qidx_t size; int nrxd; caddr_t start, end, cur, next; nrxd = fl->ifl_size; size = fl->ifl_rxd_size; start = fl->ifl_ifdi->idi_vaddr; if (__predict_false(size == 0)) return (start); cur = start + size*cidx; end = start + size*nrxd; next = CACHE_PTR_NEXT(cur); return (next < end ? next : start); } static inline void prefetch_pkts(iflib_fl_t fl, int cidx) { int nextptr; int nrxd = fl->ifl_size; caddr_t next_rxd; nextptr = (cidx + CACHE_PTR_INCREMENT) & (nrxd-1); prefetch(&fl->ifl_sds.ifsd_m[nextptr]); prefetch(&fl->ifl_sds.ifsd_cl[nextptr]); next_rxd = calc_next_rxd(fl, cidx); prefetch(next_rxd); prefetch(fl->ifl_sds.ifsd_m[(cidx + 1) & (nrxd-1)]); prefetch(fl->ifl_sds.ifsd_m[(cidx + 2) & (nrxd-1)]); prefetch(fl->ifl_sds.ifsd_m[(cidx + 3) & (nrxd-1)]); prefetch(fl->ifl_sds.ifsd_m[(cidx + 4) & (nrxd-1)]); prefetch(fl->ifl_sds.ifsd_cl[(cidx + 1) & (nrxd-1)]); prefetch(fl->ifl_sds.ifsd_cl[(cidx + 2) & (nrxd-1)]); prefetch(fl->ifl_sds.ifsd_cl[(cidx + 3) & (nrxd-1)]); prefetch(fl->ifl_sds.ifsd_cl[(cidx + 4) & (nrxd-1)]); } -static void -rxd_frag_to_sd(iflib_rxq_t rxq, if_rxd_frag_t irf, int unload, if_rxsd_t sd) +static struct mbuf * +rxd_frag_to_sd(iflib_rxq_t rxq, if_rxd_frag_t irf, bool unload, if_rxsd_t sd, + int *pf_rv, if_rxd_info_t ri) { - int flid, cidx; bus_dmamap_t map; iflib_fl_t fl; - int next; + caddr_t payload; + struct mbuf *m; + int flid, cidx, len, next; map = NULL; flid = irf->irf_flid; cidx = irf->irf_idx; fl = &rxq->ifr_fl[flid]; sd->ifsd_fl = fl; sd->ifsd_cidx = cidx; - sd->ifsd_m = &fl->ifl_sds.ifsd_m[cidx]; + m = fl->ifl_sds.ifsd_m[cidx]; sd->ifsd_cl = &fl->ifl_sds.ifsd_cl[cidx]; fl->ifl_credits--; #if MEMORY_LOGGING fl->ifl_m_dequeued++; #endif if (rxq->ifr_ctx->ifc_flags & IFC_PREFETCH) prefetch_pkts(fl, cidx); next = (cidx + CACHE_PTR_INCREMENT) & (fl->ifl_size-1); prefetch(&fl->ifl_sds.ifsd_map[next]); map = fl->ifl_sds.ifsd_map[cidx]; next = (cidx + CACHE_LINE_SIZE) & (fl->ifl_size-1); /* not valid assert if bxe really does SGE from non-contiguous elements */ MPASS(fl->ifl_cidx == cidx); bus_dmamap_sync(fl->ifl_buf_tag, map, BUS_DMASYNC_POSTREAD); + + if (rxq->pfil != NULL && PFIL_HOOKED_IN(rxq->pfil) && pf_rv != NULL) { + payload = *sd->ifsd_cl; + payload += ri->iri_pad; + len = ri->iri_len - ri->iri_pad; + *pf_rv = pfil_run_hooks(rxq->pfil, payload, ri->iri_ifp, + len | PFIL_MEMPTR | PFIL_IN, NULL); + switch (*pf_rv) { + case PFIL_DROPPED: + case PFIL_CONSUMED: + /* + * The filter ate it. Everything is recycled. + */ + m = NULL; + unload = 0; + break; + case PFIL_REALLOCED: + /* + * The filter copied it. Everything is recycled. + */ + m = pfil_mem2mbuf(payload); + unload = 0; + break; + case PFIL_PASS: + /* + * Filter said it was OK, so receive like + * normal + */ + fl->ifl_sds.ifsd_m[cidx] = NULL; + break; + default: + MPASS(0); + } + } else { + fl->ifl_sds.ifsd_m[cidx] = NULL; + *pf_rv = PFIL_PASS; + } + if (unload) bus_dmamap_unload(fl->ifl_buf_tag, map); fl->ifl_cidx = (fl->ifl_cidx + 1) & (fl->ifl_size-1); if (__predict_false(fl->ifl_cidx == 0)) fl->ifl_gen = 0; bit_clear(fl->ifl_rx_bitmap, cidx); + return (m); } static struct mbuf * -assemble_segments(iflib_rxq_t rxq, if_rxd_info_t ri, if_rxsd_t sd) +assemble_segments(iflib_rxq_t rxq, if_rxd_info_t ri, if_rxsd_t sd, int *pf_rv) { - int i, padlen , flags; struct mbuf *m, *mh, *mt; caddr_t cl; + int *pf_rv_ptr, flags, i, padlen; + bool consumed; i = 0; mh = NULL; + consumed = false; + *pf_rv = PFIL_PASS; + pf_rv_ptr = pf_rv; do { - rxd_frag_to_sd(rxq, &ri->iri_frags[i], TRUE, sd); + m = rxd_frag_to_sd(rxq, &ri->iri_frags[i], !consumed, sd, + pf_rv_ptr, ri); MPASS(*sd->ifsd_cl != NULL); - MPASS(*sd->ifsd_m != NULL); - /* Don't include zero-length frags */ - if (ri->iri_frags[i].irf_len == 0) { + /* + * Exclude zero-length frags & frags from + * packets the filter has consumed or dropped + */ + if (ri->iri_frags[i].irf_len == 0 || consumed || + *pf_rv == PFIL_CONSUMED || *pf_rv == PFIL_DROPPED) { + if (mh == NULL) { + /* everything saved here */ + consumed = true; + pf_rv_ptr = NULL; + continue; + } /* XXX we can save the cluster here, but not the mbuf */ - m_init(*sd->ifsd_m, M_NOWAIT, MT_DATA, 0); - m_free(*sd->ifsd_m); - *sd->ifsd_m = NULL; + m_init(m, M_NOWAIT, MT_DATA, 0); + m_free(m); continue; } - m = *sd->ifsd_m; - *sd->ifsd_m = NULL; if (mh == NULL) { flags = M_PKTHDR|M_EXT; mh = mt = m; padlen = ri->iri_pad; } else { flags = M_EXT; mt->m_next = m; mt = m; /* assuming padding is only on the first fragment */ padlen = 0; } cl = *sd->ifsd_cl; *sd->ifsd_cl = NULL; /* Can these two be made one ? */ m_init(m, M_NOWAIT, MT_DATA, flags); m_cljset(m, cl, sd->ifsd_fl->ifl_cltype); /* * These must follow m_init and m_cljset */ m->m_data += padlen; ri->iri_len -= padlen; m->m_len = ri->iri_frags[i].irf_len; } while (++i < ri->iri_nfrags); return (mh); } /* * Process one software descriptor */ static struct mbuf * iflib_rxd_pkt_get(iflib_rxq_t rxq, if_rxd_info_t ri) { struct if_rxsd sd; struct mbuf *m; + int pf_rv; /* should I merge this back in now that the two paths are basically duplicated? */ if (ri->iri_nfrags == 1 && ri->iri_frags[0].irf_len <= MIN(IFLIB_RX_COPY_THRESH, MHLEN)) { - rxd_frag_to_sd(rxq, &ri->iri_frags[0], FALSE, &sd); - m = *sd.ifsd_m; - *sd.ifsd_m = NULL; - m_init(m, M_NOWAIT, MT_DATA, M_PKTHDR); + m = rxd_frag_to_sd(rxq, &ri->iri_frags[0], false, &sd, + &pf_rv, ri); + if (pf_rv != PFIL_PASS && pf_rv != PFIL_REALLOCED) + return (m); + if (pf_rv == PFIL_PASS) { + m_init(m, M_NOWAIT, MT_DATA, M_PKTHDR); #ifndef __NO_STRICT_ALIGNMENT - if (!IP_ALIGNED(m)) - m->m_data += 2; + if (!IP_ALIGNED(m)) + m->m_data += 2; #endif - memcpy(m->m_data, *sd.ifsd_cl, ri->iri_len); - m->m_len = ri->iri_frags[0].irf_len; - } else { - m = assemble_segments(rxq, ri, &sd); + memcpy(m->m_data, *sd.ifsd_cl, ri->iri_len); + m->m_len = ri->iri_frags[0].irf_len; + } + } else { + m = assemble_segments(rxq, ri, &sd, &pf_rv); + if (pf_rv != PFIL_PASS && pf_rv != PFIL_REALLOCED) + return (m); } m->m_pkthdr.len = ri->iri_len; m->m_pkthdr.rcvif = ri->iri_ifp; m->m_flags |= ri->iri_flags; m->m_pkthdr.ether_vtag = ri->iri_vtag; m->m_pkthdr.flowid = ri->iri_flowid; M_HASHTYPE_SET(m, ri->iri_rsstype); m->m_pkthdr.csum_flags = ri->iri_csum_flags; m->m_pkthdr.csum_data = ri->iri_csum_data; return (m); } #if defined(INET6) || defined(INET) static void iflib_get_ip_forwarding(struct lro_ctrl *lc, bool *v4, bool *v6) { CURVNET_SET(lc->ifp->if_vnet); #if defined(INET6) *v6 = VNET(ip6_forwarding); #endif #if defined(INET) *v4 = VNET(ipforwarding); #endif CURVNET_RESTORE(); } /* * Returns true if it's possible this packet could be LROed. * if it returns false, it is guaranteed that tcp_lro_rx() * would not return zero. */ static bool iflib_check_lro_possible(struct mbuf *m, bool v4_forwarding, bool v6_forwarding) { struct ether_header *eh; uint16_t eh_type; eh = mtod(m, struct ether_header *); eh_type = ntohs(eh->ether_type); switch (eh_type) { #if defined(INET6) case ETHERTYPE_IPV6: return !v6_forwarding; #endif #if defined (INET) case ETHERTYPE_IP: return !v4_forwarding; #endif } return false; } #else static void iflib_get_ip_forwarding(struct lro_ctrl *lc __unused, bool *v4 __unused, bool *v6 __unused) { } #endif static bool iflib_rxeof(iflib_rxq_t rxq, qidx_t budget) { if_ctx_t ctx = rxq->ifr_ctx; if_shared_ctx_t sctx = ctx->ifc_sctx; if_softc_ctx_t scctx = &ctx->ifc_softc_ctx; int avail, i; qidx_t *cidxp; struct if_rxd_info ri; int err, budget_left, rx_bytes, rx_pkts; iflib_fl_t fl; struct ifnet *ifp; int lro_enabled; bool v4_forwarding, v6_forwarding, lro_possible; /* * XXX early demux data packets so that if_input processing only handles * acks in interrupt context */ struct mbuf *m, *mh, *mt, *mf; lro_possible = v4_forwarding = v6_forwarding = false; ifp = ctx->ifc_ifp; mh = mt = NULL; MPASS(budget > 0); rx_pkts = rx_bytes = 0; if (sctx->isc_flags & IFLIB_HAS_RXCQ) cidxp = &rxq->ifr_cq_cidx; else cidxp = &rxq->ifr_fl[0].ifl_cidx; if ((avail = iflib_rxd_avail(ctx, rxq, *cidxp, budget)) == 0) { for (i = 0, fl = &rxq->ifr_fl[0]; i < sctx->isc_nfl; i++, fl++) __iflib_fl_refill_lt(ctx, fl, budget + 8); DBG_COUNTER_INC(rx_unavail); return (false); } + /* pfil needs the vnet to be set */ + CURVNET_SET_QUIET(ifp->if_vnet); for (budget_left = budget; budget_left > 0 && avail > 0;) { if (__predict_false(!CTX_ACTIVE(ctx))) { DBG_COUNTER_INC(rx_ctx_inactive); break; } /* * Reset client set fields to their default values */ rxd_info_zero(&ri); ri.iri_qsidx = rxq->ifr_id; ri.iri_cidx = *cidxp; ri.iri_ifp = ifp; ri.iri_frags = rxq->ifr_frags; err = ctx->isc_rxd_pkt_get(ctx->ifc_softc, &ri); if (err) goto err; + rx_pkts += 1; + rx_bytes += ri.iri_len; if (sctx->isc_flags & IFLIB_HAS_RXCQ) { *cidxp = ri.iri_cidx; /* Update our consumer index */ /* XXX NB: shurd - check if this is still safe */ while (rxq->ifr_cq_cidx >= scctx->isc_nrxd[0]) { rxq->ifr_cq_cidx -= scctx->isc_nrxd[0]; rxq->ifr_cq_gen = 0; } /* was this only a completion queue message? */ if (__predict_false(ri.iri_nfrags == 0)) continue; } MPASS(ri.iri_nfrags != 0); MPASS(ri.iri_len != 0); /* will advance the cidx on the corresponding free lists */ m = iflib_rxd_pkt_get(rxq, &ri); avail--; budget_left--; if (avail == 0 && budget_left) avail = iflib_rxd_avail(ctx, rxq, *cidxp, budget_left); - if (__predict_false(m == NULL)) { - DBG_COUNTER_INC(rx_mbuf_null); + if (__predict_false(m == NULL)) continue; - } + /* imm_pkt: -- cxgb */ if (mh == NULL) mh = mt = m; else { mt->m_nextpkt = m; mt = m; } } + CURVNET_RESTORE(); /* make sure that we can refill faster than drain */ for (i = 0, fl = &rxq->ifr_fl[0]; i < sctx->isc_nfl; i++, fl++) __iflib_fl_refill_lt(ctx, fl, budget + 8); lro_enabled = (if_getcapenable(ifp) & IFCAP_LRO); if (lro_enabled) iflib_get_ip_forwarding(&rxq->ifr_lc, &v4_forwarding, &v6_forwarding); mt = mf = NULL; while (mh != NULL) { m = mh; mh = mh->m_nextpkt; m->m_nextpkt = NULL; #ifndef __NO_STRICT_ALIGNMENT if (!IP_ALIGNED(m) && (m = iflib_fixup_rx(m)) == NULL) continue; #endif rx_bytes += m->m_pkthdr.len; rx_pkts++; #if defined(INET6) || defined(INET) if (lro_enabled) { if (!lro_possible) { lro_possible = iflib_check_lro_possible(m, v4_forwarding, v6_forwarding); if (lro_possible && mf != NULL) { ifp->if_input(ifp, mf); DBG_COUNTER_INC(rx_if_input); mt = mf = NULL; } } if ((m->m_pkthdr.csum_flags & (CSUM_L4_CALC|CSUM_L4_VALID)) == (CSUM_L4_CALC|CSUM_L4_VALID)) { if (lro_possible && tcp_lro_rx(&rxq->ifr_lc, m, 0) == 0) continue; } } #endif if (lro_possible) { ifp->if_input(ifp, m); DBG_COUNTER_INC(rx_if_input); continue; } if (mf == NULL) mf = m; if (mt != NULL) mt->m_nextpkt = m; mt = m; } if (mf != NULL) { ifp->if_input(ifp, mf); DBG_COUNTER_INC(rx_if_input); } if_inc_counter(ifp, IFCOUNTER_IBYTES, rx_bytes); if_inc_counter(ifp, IFCOUNTER_IPACKETS, rx_pkts); /* * Flush any outstanding LRO work */ #if defined(INET6) || defined(INET) tcp_lro_flush_all(&rxq->ifr_lc); #endif if (avail) return true; return (iflib_rxd_avail(ctx, rxq, *cidxp, 1)); err: STATE_LOCK(ctx); ctx->ifc_flags |= IFC_DO_RESET; iflib_admin_intr_deferred(ctx); STATE_UNLOCK(ctx); return (false); } #define TXD_NOTIFY_COUNT(txq) (((txq)->ift_size / (txq)->ift_update_freq)-1) static inline qidx_t txq_max_db_deferred(iflib_txq_t txq, qidx_t in_use) { qidx_t notify_count = TXD_NOTIFY_COUNT(txq); qidx_t minthresh = txq->ift_size / 8; if (in_use > 4*minthresh) return (notify_count); if (in_use > 2*minthresh) return (notify_count >> 1); if (in_use > minthresh) return (notify_count >> 3); return (0); } static inline qidx_t txq_max_rs_deferred(iflib_txq_t txq) { qidx_t notify_count = TXD_NOTIFY_COUNT(txq); qidx_t minthresh = txq->ift_size / 8; if (txq->ift_in_use > 4*minthresh) return (notify_count); if (txq->ift_in_use > 2*minthresh) return (notify_count >> 1); if (txq->ift_in_use > minthresh) return (notify_count >> 2); return (2); } #define M_CSUM_FLAGS(m) ((m)->m_pkthdr.csum_flags) #define M_HAS_VLANTAG(m) (m->m_flags & M_VLANTAG) #define TXQ_MAX_DB_DEFERRED(txq, in_use) txq_max_db_deferred((txq), (in_use)) #define TXQ_MAX_RS_DEFERRED(txq) txq_max_rs_deferred(txq) #define TXQ_MAX_DB_CONSUMED(size) (size >> 4) /* forward compatibility for cxgb */ #define FIRST_QSET(ctx) 0 #define NTXQSETS(ctx) ((ctx)->ifc_softc_ctx.isc_ntxqsets) #define NRXQSETS(ctx) ((ctx)->ifc_softc_ctx.isc_nrxqsets) #define QIDX(ctx, m) ((((m)->m_pkthdr.flowid & ctx->ifc_softc_ctx.isc_rss_table_mask) % NTXQSETS(ctx)) + FIRST_QSET(ctx)) #define DESC_RECLAIMABLE(q) ((int)((q)->ift_processed - (q)->ift_cleaned - (q)->ift_ctx->ifc_softc_ctx.isc_tx_nsegments)) /* XXX we should be setting this to something other than zero */ #define RECLAIM_THRESH(ctx) ((ctx)->ifc_sctx->isc_tx_reclaim_thresh) #define MAX_TX_DESC(ctx) max((ctx)->ifc_softc_ctx.isc_tx_tso_segments_max, \ (ctx)->ifc_softc_ctx.isc_tx_nsegments) static inline bool iflib_txd_db_check(if_ctx_t ctx, iflib_txq_t txq, int ring, qidx_t in_use) { qidx_t dbval, max; bool rang; rang = false; max = TXQ_MAX_DB_DEFERRED(txq, in_use); if (ring || txq->ift_db_pending >= max) { dbval = txq->ift_npending ? txq->ift_npending : txq->ift_pidx; bus_dmamap_sync(txq->ift_ifdi->idi_tag, txq->ift_ifdi->idi_map, BUS_DMASYNC_PREREAD | BUS_DMASYNC_PREWRITE); ctx->isc_txd_flush(ctx->ifc_softc, txq->ift_id, dbval); txq->ift_db_pending = txq->ift_npending = 0; rang = true; } return (rang); } #ifdef PKT_DEBUG static void print_pkt(if_pkt_info_t pi) { printf("pi len: %d qsidx: %d nsegs: %d ndescs: %d flags: %x pidx: %d\n", pi->ipi_len, pi->ipi_qsidx, pi->ipi_nsegs, pi->ipi_ndescs, pi->ipi_flags, pi->ipi_pidx); printf("pi new_pidx: %d csum_flags: %lx tso_segsz: %d mflags: %x vtag: %d\n", pi->ipi_new_pidx, pi->ipi_csum_flags, pi->ipi_tso_segsz, pi->ipi_mflags, pi->ipi_vtag); printf("pi etype: %d ehdrlen: %d ip_hlen: %d ipproto: %d\n", pi->ipi_etype, pi->ipi_ehdrlen, pi->ipi_ip_hlen, pi->ipi_ipproto); } #endif #define IS_TSO4(pi) ((pi)->ipi_csum_flags & CSUM_IP_TSO) #define IS_TX_OFFLOAD4(pi) ((pi)->ipi_csum_flags & (CSUM_IP_TCP | CSUM_IP_TSO)) #define IS_TSO6(pi) ((pi)->ipi_csum_flags & CSUM_IP6_TSO) #define IS_TX_OFFLOAD6(pi) ((pi)->ipi_csum_flags & (CSUM_IP6_TCP | CSUM_IP6_TSO)) static int iflib_parse_header(iflib_txq_t txq, if_pkt_info_t pi, struct mbuf **mp) { if_shared_ctx_t sctx = txq->ift_ctx->ifc_sctx; struct ether_vlan_header *eh; struct mbuf *m; m = *mp; if ((sctx->isc_flags & IFLIB_NEED_SCRATCH) && M_WRITABLE(m) == 0) { if ((m = m_dup(m, M_NOWAIT)) == NULL) { return (ENOMEM); } else { m_freem(*mp); DBG_COUNTER_INC(tx_frees); *mp = m; } } /* * Determine where frame payload starts. * Jump over vlan headers if already present, * helpful for QinQ too. */ if (__predict_false(m->m_len < sizeof(*eh))) { txq->ift_pullups++; if (__predict_false((m = m_pullup(m, sizeof(*eh))) == NULL)) return (ENOMEM); } eh = mtod(m, struct ether_vlan_header *); if (eh->evl_encap_proto == htons(ETHERTYPE_VLAN)) { pi->ipi_etype = ntohs(eh->evl_proto); pi->ipi_ehdrlen = ETHER_HDR_LEN + ETHER_VLAN_ENCAP_LEN; } else { pi->ipi_etype = ntohs(eh->evl_encap_proto); pi->ipi_ehdrlen = ETHER_HDR_LEN; } switch (pi->ipi_etype) { #ifdef INET case ETHERTYPE_IP: { struct mbuf *n; struct ip *ip = NULL; struct tcphdr *th = NULL; int minthlen; minthlen = min(m->m_pkthdr.len, pi->ipi_ehdrlen + sizeof(*ip) + sizeof(*th)); if (__predict_false(m->m_len < minthlen)) { /* * if this code bloat is causing too much of a hit * move it to a separate function and mark it noinline */ if (m->m_len == pi->ipi_ehdrlen) { n = m->m_next; MPASS(n); if (n->m_len >= sizeof(*ip)) { ip = (struct ip *)n->m_data; if (n->m_len >= (ip->ip_hl << 2) + sizeof(*th)) th = (struct tcphdr *)((caddr_t)ip + (ip->ip_hl << 2)); } else { txq->ift_pullups++; if (__predict_false((m = m_pullup(m, minthlen)) == NULL)) return (ENOMEM); ip = (struct ip *)(m->m_data + pi->ipi_ehdrlen); } } else { txq->ift_pullups++; if (__predict_false((m = m_pullup(m, minthlen)) == NULL)) return (ENOMEM); ip = (struct ip *)(m->m_data + pi->ipi_ehdrlen); if (m->m_len >= (ip->ip_hl << 2) + sizeof(*th)) th = (struct tcphdr *)((caddr_t)ip + (ip->ip_hl << 2)); } } else { ip = (struct ip *)(m->m_data + pi->ipi_ehdrlen); if (m->m_len >= (ip->ip_hl << 2) + sizeof(*th)) th = (struct tcphdr *)((caddr_t)ip + (ip->ip_hl << 2)); } pi->ipi_ip_hlen = ip->ip_hl << 2; pi->ipi_ipproto = ip->ip_p; pi->ipi_flags |= IPI_TX_IPV4; /* TCP checksum offload may require TCP header length */ if (IS_TX_OFFLOAD4(pi)) { if (__predict_true(pi->ipi_ipproto == IPPROTO_TCP)) { if (__predict_false(th == NULL)) { txq->ift_pullups++; if (__predict_false((m = m_pullup(m, (ip->ip_hl << 2) + sizeof(*th))) == NULL)) return (ENOMEM); th = (struct tcphdr *)((caddr_t)ip + pi->ipi_ip_hlen); } pi->ipi_tcp_hflags = th->th_flags; pi->ipi_tcp_hlen = th->th_off << 2; pi->ipi_tcp_seq = th->th_seq; } if (IS_TSO4(pi)) { if (__predict_false(ip->ip_p != IPPROTO_TCP)) return (ENXIO); /* * TSO always requires hardware checksum offload. */ pi->ipi_csum_flags |= (CSUM_IP_TCP | CSUM_IP); th->th_sum = in_pseudo(ip->ip_src.s_addr, ip->ip_dst.s_addr, htons(IPPROTO_TCP)); pi->ipi_tso_segsz = m->m_pkthdr.tso_segsz; if (sctx->isc_flags & IFLIB_TSO_INIT_IP) { ip->ip_sum = 0; ip->ip_len = htons(pi->ipi_ip_hlen + pi->ipi_tcp_hlen + pi->ipi_tso_segsz); } } } if ((sctx->isc_flags & IFLIB_NEED_ZERO_CSUM) && (pi->ipi_csum_flags & CSUM_IP)) ip->ip_sum = 0; break; } #endif #ifdef INET6 case ETHERTYPE_IPV6: { struct ip6_hdr *ip6 = (struct ip6_hdr *)(m->m_data + pi->ipi_ehdrlen); struct tcphdr *th; pi->ipi_ip_hlen = sizeof(struct ip6_hdr); if (__predict_false(m->m_len < pi->ipi_ehdrlen + sizeof(struct ip6_hdr))) { txq->ift_pullups++; if (__predict_false((m = m_pullup(m, pi->ipi_ehdrlen + sizeof(struct ip6_hdr))) == NULL)) return (ENOMEM); } th = (struct tcphdr *)((caddr_t)ip6 + pi->ipi_ip_hlen); /* XXX-BZ this will go badly in case of ext hdrs. */ pi->ipi_ipproto = ip6->ip6_nxt; pi->ipi_flags |= IPI_TX_IPV6; /* TCP checksum offload may require TCP header length */ if (IS_TX_OFFLOAD6(pi)) { if (pi->ipi_ipproto == IPPROTO_TCP) { if (__predict_false(m->m_len < pi->ipi_ehdrlen + sizeof(struct ip6_hdr) + sizeof(struct tcphdr))) { txq->ift_pullups++; if (__predict_false((m = m_pullup(m, pi->ipi_ehdrlen + sizeof(struct ip6_hdr) + sizeof(struct tcphdr))) == NULL)) return (ENOMEM); } pi->ipi_tcp_hflags = th->th_flags; pi->ipi_tcp_hlen = th->th_off << 2; pi->ipi_tcp_seq = th->th_seq; } if (IS_TSO6(pi)) { if (__predict_false(ip6->ip6_nxt != IPPROTO_TCP)) return (ENXIO); /* * TSO always requires hardware checksum offload. */ pi->ipi_csum_flags |= CSUM_IP6_TCP; th->th_sum = in6_cksum_pseudo(ip6, 0, IPPROTO_TCP, 0); pi->ipi_tso_segsz = m->m_pkthdr.tso_segsz; } } break; } #endif default: pi->ipi_csum_flags &= ~CSUM_OFFLOAD; pi->ipi_ip_hlen = 0; break; } *mp = m; return (0); } /* * If dodgy hardware rejects the scatter gather chain we've handed it * we'll need to remove the mbuf chain from ifsg_m[] before we can add the * m_defrag'd mbufs */ static __noinline struct mbuf * iflib_remove_mbuf(iflib_txq_t txq) { int ntxd, pidx; struct mbuf *m, **ifsd_m; ifsd_m = txq->ift_sds.ifsd_m; ntxd = txq->ift_size; pidx = txq->ift_pidx & (ntxd - 1); ifsd_m = txq->ift_sds.ifsd_m; m = ifsd_m[pidx]; ifsd_m[pidx] = NULL; bus_dmamap_unload(txq->ift_buf_tag, txq->ift_sds.ifsd_map[pidx]); if (txq->ift_sds.ifsd_tso_map != NULL) bus_dmamap_unload(txq->ift_tso_buf_tag, txq->ift_sds.ifsd_tso_map[pidx]); #if MEMORY_LOGGING txq->ift_dequeued++; #endif return (m); } static inline caddr_t calc_next_txd(iflib_txq_t txq, int cidx, uint8_t qid) { qidx_t size; int ntxd; caddr_t start, end, cur, next; ntxd = txq->ift_size; size = txq->ift_txd_size[qid]; start = txq->ift_ifdi[qid].idi_vaddr; if (__predict_false(size == 0)) return (start); cur = start + size*cidx; end = start + size*ntxd; next = CACHE_PTR_NEXT(cur); return (next < end ? next : start); } /* * Pad an mbuf to ensure a minimum ethernet frame size. * min_frame_size is the frame size (less CRC) to pad the mbuf to */ static __noinline int iflib_ether_pad(device_t dev, struct mbuf **m_head, uint16_t min_frame_size) { /* * 18 is enough bytes to pad an ARP packet to 46 bytes, and * and ARP message is the smallest common payload I can think of */ static char pad[18]; /* just zeros */ int n; struct mbuf *new_head; if (!M_WRITABLE(*m_head)) { new_head = m_dup(*m_head, M_NOWAIT); if (new_head == NULL) { m_freem(*m_head); device_printf(dev, "cannot pad short frame, m_dup() failed"); DBG_COUNTER_INC(encap_pad_mbuf_fail); DBG_COUNTER_INC(tx_frees); return ENOMEM; } m_freem(*m_head); *m_head = new_head; } for (n = min_frame_size - (*m_head)->m_pkthdr.len; n > 0; n -= sizeof(pad)) if (!m_append(*m_head, min(n, sizeof(pad)), pad)) break; if (n > 0) { m_freem(*m_head); device_printf(dev, "cannot pad short frame\n"); DBG_COUNTER_INC(encap_pad_mbuf_fail); DBG_COUNTER_INC(tx_frees); return (ENOBUFS); } return 0; } static int iflib_encap(iflib_txq_t txq, struct mbuf **m_headp) { if_ctx_t ctx; if_shared_ctx_t sctx; if_softc_ctx_t scctx; bus_dma_tag_t buf_tag; bus_dma_segment_t *segs; struct mbuf *m_head, **ifsd_m; void *next_txd; bus_dmamap_t map; struct if_pkt_info pi; int remap = 0; int err, nsegs, ndesc, max_segs, pidx, cidx, next, ntxd; ctx = txq->ift_ctx; sctx = ctx->ifc_sctx; scctx = &ctx->ifc_softc_ctx; segs = txq->ift_segs; ntxd = txq->ift_size; m_head = *m_headp; map = NULL; /* * If we're doing TSO the next descriptor to clean may be quite far ahead */ cidx = txq->ift_cidx; pidx = txq->ift_pidx; if (ctx->ifc_flags & IFC_PREFETCH) { next = (cidx + CACHE_PTR_INCREMENT) & (ntxd-1); if (!(ctx->ifc_flags & IFLIB_HAS_TXCQ)) { next_txd = calc_next_txd(txq, cidx, 0); prefetch(next_txd); } /* prefetch the next cache line of mbuf pointers and flags */ prefetch(&txq->ift_sds.ifsd_m[next]); prefetch(&txq->ift_sds.ifsd_map[next]); next = (cidx + CACHE_LINE_SIZE) & (ntxd-1); } map = txq->ift_sds.ifsd_map[pidx]; ifsd_m = txq->ift_sds.ifsd_m; if (m_head->m_pkthdr.csum_flags & CSUM_TSO) { buf_tag = txq->ift_tso_buf_tag; max_segs = scctx->isc_tx_tso_segments_max; map = txq->ift_sds.ifsd_tso_map[pidx]; MPASS(buf_tag != NULL); MPASS(max_segs > 0); } else { buf_tag = txq->ift_buf_tag; max_segs = scctx->isc_tx_nsegments; map = txq->ift_sds.ifsd_map[pidx]; } if ((sctx->isc_flags & IFLIB_NEED_ETHER_PAD) && __predict_false(m_head->m_pkthdr.len < scctx->isc_min_frame_size)) { err = iflib_ether_pad(ctx->ifc_dev, m_headp, scctx->isc_min_frame_size); if (err) { DBG_COUNTER_INC(encap_txd_encap_fail); return err; } } m_head = *m_headp; pkt_info_zero(&pi); pi.ipi_mflags = (m_head->m_flags & (M_VLANTAG|M_BCAST|M_MCAST)); pi.ipi_pidx = pidx; pi.ipi_qsidx = txq->ift_id; pi.ipi_len = m_head->m_pkthdr.len; pi.ipi_csum_flags = m_head->m_pkthdr.csum_flags; pi.ipi_vtag = (m_head->m_flags & M_VLANTAG) ? m_head->m_pkthdr.ether_vtag : 0; /* deliberate bitwise OR to make one condition */ if (__predict_true((pi.ipi_csum_flags | pi.ipi_vtag))) { if (__predict_false((err = iflib_parse_header(txq, &pi, m_headp)) != 0)) { DBG_COUNTER_INC(encap_txd_encap_fail); return (err); } m_head = *m_headp; } retry: err = bus_dmamap_load_mbuf_sg(buf_tag, map, m_head, segs, &nsegs, BUS_DMA_NOWAIT); defrag: if (__predict_false(err)) { switch (err) { case EFBIG: /* try collapse once and defrag once */ if (remap == 0) { m_head = m_collapse(*m_headp, M_NOWAIT, max_segs); /* try defrag if collapsing fails */ if (m_head == NULL) remap++; } if (remap == 1) { txq->ift_mbuf_defrag++; m_head = m_defrag(*m_headp, M_NOWAIT); } /* * remap should never be >1 unless bus_dmamap_load_mbuf_sg * failed to map an mbuf that was run through m_defrag */ MPASS(remap <= 1); if (__predict_false(m_head == NULL || remap > 1)) goto defrag_failed; remap++; *m_headp = m_head; goto retry; break; case ENOMEM: txq->ift_no_tx_dma_setup++; break; default: txq->ift_no_tx_dma_setup++; m_freem(*m_headp); DBG_COUNTER_INC(tx_frees); *m_headp = NULL; break; } txq->ift_map_failed++; DBG_COUNTER_INC(encap_load_mbuf_fail); DBG_COUNTER_INC(encap_txd_encap_fail); return (err); } ifsd_m[pidx] = m_head; /* * XXX assumes a 1 to 1 relationship between segments and * descriptors - this does not hold true on all drivers, e.g. * cxgb */ if (__predict_false(nsegs + 2 > TXQ_AVAIL(txq))) { txq->ift_no_desc_avail++; bus_dmamap_unload(buf_tag, map); DBG_COUNTER_INC(encap_txq_avail_fail); DBG_COUNTER_INC(encap_txd_encap_fail); if ((txq->ift_task.gt_task.ta_flags & TASK_ENQUEUED) == 0) GROUPTASK_ENQUEUE(&txq->ift_task); return (ENOBUFS); } /* * On Intel cards we can greatly reduce the number of TX interrupts * we see by only setting report status on every Nth descriptor. * However, this also means that the driver will need to keep track * of the descriptors that RS was set on to check them for the DD bit. */ txq->ift_rs_pending += nsegs + 1; if (txq->ift_rs_pending > TXQ_MAX_RS_DEFERRED(txq) || iflib_no_tx_batch || (TXQ_AVAIL(txq) - nsegs) <= MAX_TX_DESC(ctx) + 2) { pi.ipi_flags |= IPI_TX_INTR; txq->ift_rs_pending = 0; } pi.ipi_segs = segs; pi.ipi_nsegs = nsegs; MPASS(pidx >= 0 && pidx < txq->ift_size); #ifdef PKT_DEBUG print_pkt(&pi); #endif if ((err = ctx->isc_txd_encap(ctx->ifc_softc, &pi)) == 0) { bus_dmamap_sync(buf_tag, map, BUS_DMASYNC_PREWRITE); DBG_COUNTER_INC(tx_encap); MPASS(pi.ipi_new_pidx < txq->ift_size); ndesc = pi.ipi_new_pidx - pi.ipi_pidx; if (pi.ipi_new_pidx < pi.ipi_pidx) { ndesc += txq->ift_size; txq->ift_gen = 1; } /* * drivers can need as many as * two sentinels */ MPASS(ndesc <= pi.ipi_nsegs + 2); MPASS(pi.ipi_new_pidx != pidx); MPASS(ndesc > 0); txq->ift_in_use += ndesc; /* * We update the last software descriptor again here because there may * be a sentinel and/or there may be more mbufs than segments */ txq->ift_pidx = pi.ipi_new_pidx; txq->ift_npending += pi.ipi_ndescs; } else { *m_headp = m_head = iflib_remove_mbuf(txq); if (err == EFBIG) { txq->ift_txd_encap_efbig++; if (remap < 2) { remap = 1; goto defrag; } } goto defrag_failed; } /* * err can't possibly be non-zero here, so we don't neet to test it * to see if we need to DBG_COUNTER_INC(encap_txd_encap_fail). */ return (err); defrag_failed: txq->ift_mbuf_defrag_failed++; txq->ift_map_failed++; m_freem(*m_headp); DBG_COUNTER_INC(tx_frees); *m_headp = NULL; DBG_COUNTER_INC(encap_txd_encap_fail); return (ENOMEM); } static void iflib_tx_desc_free(iflib_txq_t txq, int n) { uint32_t qsize, cidx, mask, gen; struct mbuf *m, **ifsd_m; bool do_prefetch; cidx = txq->ift_cidx; gen = txq->ift_gen; qsize = txq->ift_size; mask = qsize-1; ifsd_m = txq->ift_sds.ifsd_m; do_prefetch = (txq->ift_ctx->ifc_flags & IFC_PREFETCH); while (n-- > 0) { if (do_prefetch) { prefetch(ifsd_m[(cidx + 3) & mask]); prefetch(ifsd_m[(cidx + 4) & mask]); } if ((m = ifsd_m[cidx]) != NULL) { prefetch(&ifsd_m[(cidx + CACHE_PTR_INCREMENT) & mask]); if (m->m_pkthdr.csum_flags & CSUM_TSO) { bus_dmamap_sync(txq->ift_tso_buf_tag, txq->ift_sds.ifsd_tso_map[cidx], BUS_DMASYNC_POSTWRITE); bus_dmamap_unload(txq->ift_tso_buf_tag, txq->ift_sds.ifsd_tso_map[cidx]); } else { bus_dmamap_sync(txq->ift_buf_tag, txq->ift_sds.ifsd_map[cidx], BUS_DMASYNC_POSTWRITE); bus_dmamap_unload(txq->ift_buf_tag, txq->ift_sds.ifsd_map[cidx]); } /* XXX we don't support any drivers that batch packets yet */ MPASS(m->m_nextpkt == NULL); m_freem(m); ifsd_m[cidx] = NULL; #if MEMORY_LOGGING txq->ift_dequeued++; #endif DBG_COUNTER_INC(tx_frees); } if (__predict_false(++cidx == qsize)) { cidx = 0; gen = 0; } } txq->ift_cidx = cidx; txq->ift_gen = gen; } static __inline int iflib_completed_tx_reclaim(iflib_txq_t txq, int thresh) { int reclaim; if_ctx_t ctx = txq->ift_ctx; KASSERT(thresh >= 0, ("invalid threshold to reclaim")); MPASS(thresh /*+ MAX_TX_DESC(txq->ift_ctx) */ < txq->ift_size); /* * Need a rate-limiting check so that this isn't called every time */ iflib_tx_credits_update(ctx, txq); reclaim = DESC_RECLAIMABLE(txq); if (reclaim <= thresh /* + MAX_TX_DESC(txq->ift_ctx) */) { #ifdef INVARIANTS if (iflib_verbose_debug) { printf("%s processed=%ju cleaned=%ju tx_nsegments=%d reclaim=%d thresh=%d\n", __FUNCTION__, txq->ift_processed, txq->ift_cleaned, txq->ift_ctx->ifc_softc_ctx.isc_tx_nsegments, reclaim, thresh); } #endif return (0); } iflib_tx_desc_free(txq, reclaim); txq->ift_cleaned += reclaim; txq->ift_in_use -= reclaim; return (reclaim); } static struct mbuf ** _ring_peek_one(struct ifmp_ring *r, int cidx, int offset, int remaining) { int next, size; struct mbuf **items; size = r->size; next = (cidx + CACHE_PTR_INCREMENT) & (size-1); items = __DEVOLATILE(struct mbuf **, &r->items[0]); prefetch(items[(cidx + offset) & (size-1)]); if (remaining > 1) { prefetch2cachelines(&items[next]); prefetch2cachelines(items[(cidx + offset + 1) & (size-1)]); prefetch2cachelines(items[(cidx + offset + 2) & (size-1)]); prefetch2cachelines(items[(cidx + offset + 3) & (size-1)]); } return (__DEVOLATILE(struct mbuf **, &r->items[(cidx + offset) & (size-1)])); } static void iflib_txq_check_drain(iflib_txq_t txq, int budget) { ifmp_ring_check_drainage(txq->ift_br, budget); } static uint32_t iflib_txq_can_drain(struct ifmp_ring *r) { iflib_txq_t txq = r->cookie; if_ctx_t ctx = txq->ift_ctx; if (TXQ_AVAIL(txq) > MAX_TX_DESC(ctx) + 2) return (1); bus_dmamap_sync(txq->ift_ifdi->idi_tag, txq->ift_ifdi->idi_map, BUS_DMASYNC_POSTREAD); return (ctx->isc_txd_credits_update(ctx->ifc_softc, txq->ift_id, false)); } static uint32_t iflib_txq_drain(struct ifmp_ring *r, uint32_t cidx, uint32_t pidx) { iflib_txq_t txq = r->cookie; if_ctx_t ctx = txq->ift_ctx; struct ifnet *ifp = ctx->ifc_ifp; struct mbuf **mp, *m; int i, count, consumed, pkt_sent, bytes_sent, mcast_sent, avail; int reclaimed, err, in_use_prev, desc_used; bool do_prefetch, ring, rang; if (__predict_false(!(if_getdrvflags(ifp) & IFF_DRV_RUNNING) || !LINK_ACTIVE(ctx))) { DBG_COUNTER_INC(txq_drain_notready); return (0); } reclaimed = iflib_completed_tx_reclaim(txq, RECLAIM_THRESH(ctx)); rang = iflib_txd_db_check(ctx, txq, reclaimed, txq->ift_in_use); avail = IDXDIFF(pidx, cidx, r->size); if (__predict_false(ctx->ifc_flags & IFC_QFLUSH)) { DBG_COUNTER_INC(txq_drain_flushing); for (i = 0; i < avail; i++) { if (__predict_true(r->items[(cidx + i) & (r->size-1)] != (void *)txq)) m_free(r->items[(cidx + i) & (r->size-1)]); r->items[(cidx + i) & (r->size-1)] = NULL; } return (avail); } if (__predict_false(if_getdrvflags(ctx->ifc_ifp) & IFF_DRV_OACTIVE)) { txq->ift_qstatus = IFLIB_QUEUE_IDLE; CALLOUT_LOCK(txq); callout_stop(&txq->ift_timer); CALLOUT_UNLOCK(txq); DBG_COUNTER_INC(txq_drain_oactive); return (0); } if (reclaimed) txq->ift_qstatus = IFLIB_QUEUE_IDLE; consumed = mcast_sent = bytes_sent = pkt_sent = 0; count = MIN(avail, TX_BATCH_SIZE); #ifdef INVARIANTS if (iflib_verbose_debug) printf("%s avail=%d ifc_flags=%x txq_avail=%d ", __FUNCTION__, avail, ctx->ifc_flags, TXQ_AVAIL(txq)); #endif do_prefetch = (ctx->ifc_flags & IFC_PREFETCH); avail = TXQ_AVAIL(txq); err = 0; for (desc_used = i = 0; i < count && avail > MAX_TX_DESC(ctx) + 2; i++) { int rem = do_prefetch ? count - i : 0; mp = _ring_peek_one(r, cidx, i, rem); MPASS(mp != NULL && *mp != NULL); if (__predict_false(*mp == (struct mbuf *)txq)) { consumed++; reclaimed++; continue; } in_use_prev = txq->ift_in_use; err = iflib_encap(txq, mp); if (__predict_false(err)) { /* no room - bail out */ if (err == ENOBUFS) break; consumed++; /* we can't send this packet - skip it */ continue; } consumed++; pkt_sent++; m = *mp; DBG_COUNTER_INC(tx_sent); bytes_sent += m->m_pkthdr.len; mcast_sent += !!(m->m_flags & M_MCAST); avail = TXQ_AVAIL(txq); txq->ift_db_pending += (txq->ift_in_use - in_use_prev); desc_used += (txq->ift_in_use - in_use_prev); ETHER_BPF_MTAP(ifp, m); if (__predict_false(!(ifp->if_drv_flags & IFF_DRV_RUNNING))) break; rang = iflib_txd_db_check(ctx, txq, false, in_use_prev); } /* deliberate use of bitwise or to avoid gratuitous short-circuit */ ring = rang ? false : (iflib_min_tx_latency | err) || (TXQ_AVAIL(txq) < MAX_TX_DESC(ctx)); iflib_txd_db_check(ctx, txq, ring, txq->ift_in_use); if_inc_counter(ifp, IFCOUNTER_OBYTES, bytes_sent); if_inc_counter(ifp, IFCOUNTER_OPACKETS, pkt_sent); if (mcast_sent) if_inc_counter(ifp, IFCOUNTER_OMCASTS, mcast_sent); #ifdef INVARIANTS if (iflib_verbose_debug) printf("consumed=%d\n", consumed); #endif return (consumed); } static uint32_t iflib_txq_drain_always(struct ifmp_ring *r) { return (1); } static uint32_t iflib_txq_drain_free(struct ifmp_ring *r, uint32_t cidx, uint32_t pidx) { int i, avail; struct mbuf **mp; iflib_txq_t txq; txq = r->cookie; txq->ift_qstatus = IFLIB_QUEUE_IDLE; CALLOUT_LOCK(txq); callout_stop(&txq->ift_timer); CALLOUT_UNLOCK(txq); avail = IDXDIFF(pidx, cidx, r->size); for (i = 0; i < avail; i++) { mp = _ring_peek_one(r, cidx, i, avail - i); if (__predict_false(*mp == (struct mbuf *)txq)) continue; m_freem(*mp); DBG_COUNTER_INC(tx_frees); } MPASS(ifmp_ring_is_stalled(r) == 0); return (avail); } static void iflib_ifmp_purge(iflib_txq_t txq) { struct ifmp_ring *r; r = txq->ift_br; r->drain = iflib_txq_drain_free; r->can_drain = iflib_txq_drain_always; ifmp_ring_check_drainage(r, r->size); r->drain = iflib_txq_drain; r->can_drain = iflib_txq_can_drain; } static void _task_fn_tx(void *context) { iflib_txq_t txq = context; if_ctx_t ctx = txq->ift_ctx; #if defined(ALTQ) || defined(DEV_NETMAP) if_t ifp = ctx->ifc_ifp; #endif int abdicate = ctx->ifc_sysctl_tx_abdicate; #ifdef IFLIB_DIAGNOSTICS txq->ift_cpu_exec_count[curcpu]++; #endif if (!(if_getdrvflags(ctx->ifc_ifp) & IFF_DRV_RUNNING)) return; #ifdef DEV_NETMAP if (if_getcapenable(ifp) & IFCAP_NETMAP) { bus_dmamap_sync(txq->ift_ifdi->idi_tag, txq->ift_ifdi->idi_map, BUS_DMASYNC_POSTREAD); if (ctx->isc_txd_credits_update(ctx->ifc_softc, txq->ift_id, false)) netmap_tx_irq(ifp, txq->ift_id); IFDI_TX_QUEUE_INTR_ENABLE(ctx, txq->ift_id); return; } #endif #ifdef ALTQ if (ALTQ_IS_ENABLED(&ifp->if_snd)) iflib_altq_if_start(ifp); #endif if (txq->ift_db_pending) ifmp_ring_enqueue(txq->ift_br, (void **)&txq, 1, TX_BATCH_SIZE, abdicate); else if (!abdicate) ifmp_ring_check_drainage(txq->ift_br, TX_BATCH_SIZE); /* * When abdicating, we always need to check drainage, not just when we don't enqueue */ if (abdicate) ifmp_ring_check_drainage(txq->ift_br, TX_BATCH_SIZE); if (ctx->ifc_flags & IFC_LEGACY) IFDI_INTR_ENABLE(ctx); else { #ifdef INVARIANTS int rc = #endif IFDI_TX_QUEUE_INTR_ENABLE(ctx, txq->ift_id); KASSERT(rc != ENOTSUP, ("MSI-X support requires queue_intr_enable, but not implemented in driver")); } } static void _task_fn_rx(void *context) { iflib_rxq_t rxq = context; if_ctx_t ctx = rxq->ifr_ctx; bool more; uint16_t budget; #ifdef IFLIB_DIAGNOSTICS rxq->ifr_cpu_exec_count[curcpu]++; #endif DBG_COUNTER_INC(task_fn_rxs); if (__predict_false(!(if_getdrvflags(ctx->ifc_ifp) & IFF_DRV_RUNNING))) return; more = true; #ifdef DEV_NETMAP if (if_getcapenable(ctx->ifc_ifp) & IFCAP_NETMAP) { u_int work = 0; if (netmap_rx_irq(ctx->ifc_ifp, rxq->ifr_id, &work)) { more = false; } } #endif budget = ctx->ifc_sysctl_rx_budget; if (budget == 0) budget = 16; /* XXX */ if (more == false || (more = iflib_rxeof(rxq, budget)) == false) { if (ctx->ifc_flags & IFC_LEGACY) IFDI_INTR_ENABLE(ctx); else { #ifdef INVARIANTS int rc = #endif IFDI_RX_QUEUE_INTR_ENABLE(ctx, rxq->ifr_id); KASSERT(rc != ENOTSUP, ("MSI-X support requires queue_intr_enable, but not implemented in driver")); DBG_COUNTER_INC(rx_intr_enables); } } if (__predict_false(!(if_getdrvflags(ctx->ifc_ifp) & IFF_DRV_RUNNING))) return; if (more) GROUPTASK_ENQUEUE(&rxq->ifr_task); } static void _task_fn_admin(void *context) { if_ctx_t ctx = context; if_softc_ctx_t sctx = &ctx->ifc_softc_ctx; iflib_txq_t txq; int i; bool oactive, running, do_reset, do_watchdog, in_detach; uint32_t reset_on = hz / 2; STATE_LOCK(ctx); running = (if_getdrvflags(ctx->ifc_ifp) & IFF_DRV_RUNNING); oactive = (if_getdrvflags(ctx->ifc_ifp) & IFF_DRV_OACTIVE); do_reset = (ctx->ifc_flags & IFC_DO_RESET); do_watchdog = (ctx->ifc_flags & IFC_DO_WATCHDOG); in_detach = (ctx->ifc_flags & IFC_IN_DETACH); ctx->ifc_flags &= ~(IFC_DO_RESET|IFC_DO_WATCHDOG); STATE_UNLOCK(ctx); if ((!running && !oactive) && !(ctx->ifc_sctx->isc_flags & IFLIB_ADMIN_ALWAYS_RUN)) return; if (in_detach) return; CTX_LOCK(ctx); for (txq = ctx->ifc_txqs, i = 0; i < sctx->isc_ntxqsets; i++, txq++) { CALLOUT_LOCK(txq); callout_stop(&txq->ift_timer); CALLOUT_UNLOCK(txq); } if (do_watchdog) { ctx->ifc_watchdog_events++; IFDI_WATCHDOG_RESET(ctx); } IFDI_UPDATE_ADMIN_STATUS(ctx); for (txq = ctx->ifc_txqs, i = 0; i < sctx->isc_ntxqsets; i++, txq++) { #ifdef DEV_NETMAP reset_on = hz / 2; if (if_getcapenable(ctx->ifc_ifp) & IFCAP_NETMAP) iflib_netmap_timer_adjust(ctx, txq, &reset_on); #endif callout_reset_on(&txq->ift_timer, reset_on, iflib_timer, txq, txq->ift_timer.c_cpu); } IFDI_LINK_INTR_ENABLE(ctx); if (do_reset) iflib_if_init_locked(ctx); CTX_UNLOCK(ctx); if (LINK_ACTIVE(ctx) == 0) return; for (txq = ctx->ifc_txqs, i = 0; i < sctx->isc_ntxqsets; i++, txq++) iflib_txq_check_drain(txq, IFLIB_RESTART_BUDGET); } static void _task_fn_iov(void *context) { if_ctx_t ctx = context; if (!(if_getdrvflags(ctx->ifc_ifp) & IFF_DRV_RUNNING) && !(ctx->ifc_sctx->isc_flags & IFLIB_ADMIN_ALWAYS_RUN)) return; CTX_LOCK(ctx); IFDI_VFLR_HANDLE(ctx); CTX_UNLOCK(ctx); } static int iflib_sysctl_int_delay(SYSCTL_HANDLER_ARGS) { int err; if_int_delay_info_t info; if_ctx_t ctx; info = (if_int_delay_info_t)arg1; ctx = info->iidi_ctx; info->iidi_req = req; info->iidi_oidp = oidp; CTX_LOCK(ctx); err = IFDI_SYSCTL_INT_DELAY(ctx, info); CTX_UNLOCK(ctx); return (err); } /********************************************************************* * * IFNET FUNCTIONS * **********************************************************************/ static void iflib_if_init_locked(if_ctx_t ctx) { iflib_stop(ctx); iflib_init_locked(ctx); } static void iflib_if_init(void *arg) { if_ctx_t ctx = arg; CTX_LOCK(ctx); iflib_if_init_locked(ctx); CTX_UNLOCK(ctx); } static int iflib_if_transmit(if_t ifp, struct mbuf *m) { if_ctx_t ctx = if_getsoftc(ifp); iflib_txq_t txq; int err, qidx; int abdicate = ctx->ifc_sysctl_tx_abdicate; if (__predict_false((ifp->if_drv_flags & IFF_DRV_RUNNING) == 0 || !LINK_ACTIVE(ctx))) { DBG_COUNTER_INC(tx_frees); m_freem(m); return (ENETDOWN); } MPASS(m->m_nextpkt == NULL); /* ALTQ-enabled interfaces always use queue 0. */ qidx = 0; if ((NTXQSETS(ctx) > 1) && M_HASHTYPE_GET(m) && !ALTQ_IS_ENABLED(&ifp->if_snd)) qidx = QIDX(ctx, m); /* * XXX calculate buf_ring based on flowid (divvy up bits?) */ txq = &ctx->ifc_txqs[qidx]; #ifdef DRIVER_BACKPRESSURE if (txq->ift_closed) { while (m != NULL) { next = m->m_nextpkt; m->m_nextpkt = NULL; m_freem(m); DBG_COUNTER_INC(tx_frees); m = next; } return (ENOBUFS); } #endif #ifdef notyet qidx = count = 0; mp = marr; next = m; do { count++; next = next->m_nextpkt; } while (next != NULL); if (count > nitems(marr)) if ((mp = malloc(count*sizeof(struct mbuf *), M_IFLIB, M_NOWAIT)) == NULL) { /* XXX check nextpkt */ m_freem(m); /* XXX simplify for now */ DBG_COUNTER_INC(tx_frees); return (ENOBUFS); } for (next = m, i = 0; next != NULL; i++) { mp[i] = next; next = next->m_nextpkt; mp[i]->m_nextpkt = NULL; } #endif DBG_COUNTER_INC(tx_seen); err = ifmp_ring_enqueue(txq->ift_br, (void **)&m, 1, TX_BATCH_SIZE, abdicate); if (abdicate) GROUPTASK_ENQUEUE(&txq->ift_task); if (err) { if (!abdicate) GROUPTASK_ENQUEUE(&txq->ift_task); /* support forthcoming later */ #ifdef DRIVER_BACKPRESSURE txq->ift_closed = TRUE; #endif ifmp_ring_check_drainage(txq->ift_br, TX_BATCH_SIZE); m_freem(m); DBG_COUNTER_INC(tx_frees); } return (err); } #ifdef ALTQ /* * The overall approach to integrating iflib with ALTQ is to continue to use * the iflib mp_ring machinery between the ALTQ queue(s) and the hardware * ring. Technically, when using ALTQ, queueing to an intermediate mp_ring * is redundant/unnecessary, but doing so minimizes the amount of * ALTQ-specific code required in iflib. It is assumed that the overhead of * redundantly queueing to an intermediate mp_ring is swamped by the * performance limitations inherent in using ALTQ. * * When ALTQ support is compiled in, all iflib drivers will use a transmit * routine, iflib_altq_if_transmit(), that checks if ALTQ is enabled for the * given interface. If ALTQ is enabled for an interface, then all * transmitted packets for that interface will be submitted to the ALTQ * subsystem via IFQ_ENQUEUE(). We don't use the legacy if_transmit() * implementation because it uses IFQ_HANDOFF(), which will duplicatively * update stats that the iflib machinery handles, and which is sensitve to * the disused IFF_DRV_OACTIVE flag. Additionally, iflib_altq_if_start() * will be installed as the start routine for use by ALTQ facilities that * need to trigger queue drains on a scheduled basis. * */ static void iflib_altq_if_start(if_t ifp) { struct ifaltq *ifq = &ifp->if_snd; struct mbuf *m; IFQ_LOCK(ifq); IFQ_DEQUEUE_NOLOCK(ifq, m); while (m != NULL) { iflib_if_transmit(ifp, m); IFQ_DEQUEUE_NOLOCK(ifq, m); } IFQ_UNLOCK(ifq); } static int iflib_altq_if_transmit(if_t ifp, struct mbuf *m) { int err; if (ALTQ_IS_ENABLED(&ifp->if_snd)) { IFQ_ENQUEUE(&ifp->if_snd, m, err); if (err == 0) iflib_altq_if_start(ifp); } else err = iflib_if_transmit(ifp, m); return (err); } #endif /* ALTQ */ static void iflib_if_qflush(if_t ifp) { if_ctx_t ctx = if_getsoftc(ifp); iflib_txq_t txq = ctx->ifc_txqs; int i; STATE_LOCK(ctx); ctx->ifc_flags |= IFC_QFLUSH; STATE_UNLOCK(ctx); for (i = 0; i < NTXQSETS(ctx); i++, txq++) while (!(ifmp_ring_is_idle(txq->ift_br) || ifmp_ring_is_stalled(txq->ift_br))) iflib_txq_check_drain(txq, 0); STATE_LOCK(ctx); ctx->ifc_flags &= ~IFC_QFLUSH; STATE_UNLOCK(ctx); /* * When ALTQ is enabled, this will also take care of purging the * ALTQ queue(s). */ if_qflush(ifp); } #define IFCAP_FLAGS (IFCAP_HWCSUM_IPV6 | IFCAP_HWCSUM | IFCAP_LRO | \ IFCAP_TSO | IFCAP_VLAN_HWTAGGING | IFCAP_HWSTATS | \ IFCAP_VLAN_MTU | IFCAP_VLAN_HWFILTER | \ IFCAP_VLAN_HWTSO | IFCAP_VLAN_HWCSUM) static int iflib_if_ioctl(if_t ifp, u_long command, caddr_t data) { if_ctx_t ctx = if_getsoftc(ifp); struct ifreq *ifr = (struct ifreq *)data; #if defined(INET) || defined(INET6) struct ifaddr *ifa = (struct ifaddr *)data; #endif bool avoid_reset = FALSE; int err = 0, reinit = 0, bits; switch (command) { case SIOCSIFADDR: #ifdef INET if (ifa->ifa_addr->sa_family == AF_INET) avoid_reset = TRUE; #endif #ifdef INET6 if (ifa->ifa_addr->sa_family == AF_INET6) avoid_reset = TRUE; #endif /* ** Calling init results in link renegotiation, ** so we avoid doing it when possible. */ if (avoid_reset) { if_setflagbits(ifp, IFF_UP,0); if (!(if_getdrvflags(ifp) & IFF_DRV_RUNNING)) reinit = 1; #ifdef INET if (!(if_getflags(ifp) & IFF_NOARP)) arp_ifinit(ifp, ifa); #endif } else err = ether_ioctl(ifp, command, data); break; case SIOCSIFMTU: CTX_LOCK(ctx); if (ifr->ifr_mtu == if_getmtu(ifp)) { CTX_UNLOCK(ctx); break; } bits = if_getdrvflags(ifp); /* stop the driver and free any clusters before proceeding */ iflib_stop(ctx); if ((err = IFDI_MTU_SET(ctx, ifr->ifr_mtu)) == 0) { STATE_LOCK(ctx); if (ifr->ifr_mtu > ctx->ifc_max_fl_buf_size) ctx->ifc_flags |= IFC_MULTISEG; else ctx->ifc_flags &= ~IFC_MULTISEG; STATE_UNLOCK(ctx); err = if_setmtu(ifp, ifr->ifr_mtu); } iflib_init_locked(ctx); STATE_LOCK(ctx); if_setdrvflags(ifp, bits); STATE_UNLOCK(ctx); CTX_UNLOCK(ctx); break; case SIOCSIFFLAGS: CTX_LOCK(ctx); if (if_getflags(ifp) & IFF_UP) { if (if_getdrvflags(ifp) & IFF_DRV_RUNNING) { if ((if_getflags(ifp) ^ ctx->ifc_if_flags) & (IFF_PROMISC | IFF_ALLMULTI)) { err = IFDI_PROMISC_SET(ctx, if_getflags(ifp)); } } else reinit = 1; } else if (if_getdrvflags(ifp) & IFF_DRV_RUNNING) { iflib_stop(ctx); } ctx->ifc_if_flags = if_getflags(ifp); CTX_UNLOCK(ctx); break; case SIOCADDMULTI: case SIOCDELMULTI: if (if_getdrvflags(ifp) & IFF_DRV_RUNNING) { CTX_LOCK(ctx); IFDI_INTR_DISABLE(ctx); IFDI_MULTI_SET(ctx); IFDI_INTR_ENABLE(ctx); CTX_UNLOCK(ctx); } break; case SIOCSIFMEDIA: CTX_LOCK(ctx); IFDI_MEDIA_SET(ctx); CTX_UNLOCK(ctx); /* falls thru */ case SIOCGIFMEDIA: case SIOCGIFXMEDIA: err = ifmedia_ioctl(ifp, ifr, &ctx->ifc_media, command); break; case SIOCGI2C: { struct ifi2creq i2c; err = copyin(ifr_data_get_ptr(ifr), &i2c, sizeof(i2c)); if (err != 0) break; if (i2c.dev_addr != 0xA0 && i2c.dev_addr != 0xA2) { err = EINVAL; break; } if (i2c.len > sizeof(i2c.data)) { err = EINVAL; break; } if ((err = IFDI_I2C_REQ(ctx, &i2c)) == 0) err = copyout(&i2c, ifr_data_get_ptr(ifr), sizeof(i2c)); break; } case SIOCSIFCAP: { int mask, setmask, oldmask; oldmask = if_getcapenable(ifp); mask = ifr->ifr_reqcap ^ oldmask; mask &= ctx->ifc_softc_ctx.isc_capabilities; setmask = 0; #ifdef TCP_OFFLOAD setmask |= mask & (IFCAP_TOE4|IFCAP_TOE6); #endif setmask |= (mask & IFCAP_FLAGS); setmask |= (mask & IFCAP_WOL); /* * If any RX csum has changed, change all the ones that * are supported by the driver. */ if (setmask & (IFCAP_RXCSUM | IFCAP_RXCSUM_IPV6)) { setmask |= ctx->ifc_softc_ctx.isc_capabilities & (IFCAP_RXCSUM | IFCAP_RXCSUM_IPV6); } /* * want to ensure that traffic has stopped before we change any of the flags */ if (setmask) { CTX_LOCK(ctx); bits = if_getdrvflags(ifp); if (bits & IFF_DRV_RUNNING && setmask & ~IFCAP_WOL) iflib_stop(ctx); STATE_LOCK(ctx); if_togglecapenable(ifp, setmask); STATE_UNLOCK(ctx); if (bits & IFF_DRV_RUNNING && setmask & ~IFCAP_WOL) iflib_init_locked(ctx); STATE_LOCK(ctx); if_setdrvflags(ifp, bits); STATE_UNLOCK(ctx); CTX_UNLOCK(ctx); } if_vlancap(ifp); break; } case SIOCGPRIVATE_0: case SIOCSDRVSPEC: case SIOCGDRVSPEC: CTX_LOCK(ctx); err = IFDI_PRIV_IOCTL(ctx, command, data); CTX_UNLOCK(ctx); break; default: err = ether_ioctl(ifp, command, data); break; } if (reinit) iflib_if_init(ctx); return (err); } static uint64_t iflib_if_get_counter(if_t ifp, ift_counter cnt) { if_ctx_t ctx = if_getsoftc(ifp); return (IFDI_GET_COUNTER(ctx, cnt)); } /********************************************************************* * * OTHER FUNCTIONS EXPORTED TO THE STACK * **********************************************************************/ static void iflib_vlan_register(void *arg, if_t ifp, uint16_t vtag) { if_ctx_t ctx = if_getsoftc(ifp); if ((void *)ctx != arg) return; if ((vtag == 0) || (vtag > 4095)) return; CTX_LOCK(ctx); IFDI_VLAN_REGISTER(ctx, vtag); /* Re-init to load the changes */ if (if_getcapenable(ifp) & IFCAP_VLAN_HWFILTER) iflib_if_init_locked(ctx); CTX_UNLOCK(ctx); } static void iflib_vlan_unregister(void *arg, if_t ifp, uint16_t vtag) { if_ctx_t ctx = if_getsoftc(ifp); if ((void *)ctx != arg) return; if ((vtag == 0) || (vtag > 4095)) return; CTX_LOCK(ctx); IFDI_VLAN_UNREGISTER(ctx, vtag); /* Re-init to load the changes */ if (if_getcapenable(ifp) & IFCAP_VLAN_HWFILTER) iflib_if_init_locked(ctx); CTX_UNLOCK(ctx); } static void iflib_led_func(void *arg, int onoff) { if_ctx_t ctx = arg; CTX_LOCK(ctx); IFDI_LED_FUNC(ctx, onoff); CTX_UNLOCK(ctx); } /********************************************************************* * * BUS FUNCTION DEFINITIONS * **********************************************************************/ int iflib_device_probe(device_t dev) { pci_vendor_info_t *ent; uint16_t pci_vendor_id, pci_device_id; uint16_t pci_subvendor_id, pci_subdevice_id; uint16_t pci_rev_id; if_shared_ctx_t sctx; if ((sctx = DEVICE_REGISTER(dev)) == NULL || sctx->isc_magic != IFLIB_MAGIC) return (ENOTSUP); pci_vendor_id = pci_get_vendor(dev); pci_device_id = pci_get_device(dev); pci_subvendor_id = pci_get_subvendor(dev); pci_subdevice_id = pci_get_subdevice(dev); pci_rev_id = pci_get_revid(dev); if (sctx->isc_parse_devinfo != NULL) sctx->isc_parse_devinfo(&pci_device_id, &pci_subvendor_id, &pci_subdevice_id, &pci_rev_id); ent = sctx->isc_vendor_info; while (ent->pvi_vendor_id != 0) { if (pci_vendor_id != ent->pvi_vendor_id) { ent++; continue; } if ((pci_device_id == ent->pvi_device_id) && ((pci_subvendor_id == ent->pvi_subvendor_id) || (ent->pvi_subvendor_id == 0)) && ((pci_subdevice_id == ent->pvi_subdevice_id) || (ent->pvi_subdevice_id == 0)) && ((pci_rev_id == ent->pvi_rev_id) || (ent->pvi_rev_id == 0))) { device_set_desc_copy(dev, ent->pvi_name); /* this needs to be changed to zero if the bus probing code * ever stops re-probing on best match because the sctx * may have its values over written by register calls * in subsequent probes */ return (BUS_PROBE_DEFAULT); } ent++; } return (ENXIO); } static void iflib_reset_qvalues(if_ctx_t ctx) { if_softc_ctx_t scctx = &ctx->ifc_softc_ctx; if_shared_ctx_t sctx = ctx->ifc_sctx; device_t dev = ctx->ifc_dev; int i; scctx->isc_txrx_budget_bytes_max = IFLIB_MAX_TX_BYTES; scctx->isc_tx_qdepth = IFLIB_DEFAULT_TX_QDEPTH; /* * XXX sanity check that ntxd & nrxd are a power of 2 */ if (ctx->ifc_sysctl_ntxqs != 0) scctx->isc_ntxqsets = ctx->ifc_sysctl_ntxqs; if (ctx->ifc_sysctl_nrxqs != 0) scctx->isc_nrxqsets = ctx->ifc_sysctl_nrxqs; for (i = 0; i < sctx->isc_ntxqs; i++) { if (ctx->ifc_sysctl_ntxds[i] != 0) scctx->isc_ntxd[i] = ctx->ifc_sysctl_ntxds[i]; else scctx->isc_ntxd[i] = sctx->isc_ntxd_default[i]; } for (i = 0; i < sctx->isc_nrxqs; i++) { if (ctx->ifc_sysctl_nrxds[i] != 0) scctx->isc_nrxd[i] = ctx->ifc_sysctl_nrxds[i]; else scctx->isc_nrxd[i] = sctx->isc_nrxd_default[i]; } for (i = 0; i < sctx->isc_nrxqs; i++) { if (scctx->isc_nrxd[i] < sctx->isc_nrxd_min[i]) { device_printf(dev, "nrxd%d: %d less than nrxd_min %d - resetting to min\n", i, scctx->isc_nrxd[i], sctx->isc_nrxd_min[i]); scctx->isc_nrxd[i] = sctx->isc_nrxd_min[i]; } if (scctx->isc_nrxd[i] > sctx->isc_nrxd_max[i]) { device_printf(dev, "nrxd%d: %d greater than nrxd_max %d - resetting to max\n", i, scctx->isc_nrxd[i], sctx->isc_nrxd_max[i]); scctx->isc_nrxd[i] = sctx->isc_nrxd_max[i]; } } for (i = 0; i < sctx->isc_ntxqs; i++) { if (scctx->isc_ntxd[i] < sctx->isc_ntxd_min[i]) { device_printf(dev, "ntxd%d: %d less than ntxd_min %d - resetting to min\n", i, scctx->isc_ntxd[i], sctx->isc_ntxd_min[i]); scctx->isc_ntxd[i] = sctx->isc_ntxd_min[i]; } if (scctx->isc_ntxd[i] > sctx->isc_ntxd_max[i]) { device_printf(dev, "ntxd%d: %d greater than ntxd_max %d - resetting to max\n", i, scctx->isc_ntxd[i], sctx->isc_ntxd_max[i]); scctx->isc_ntxd[i] = sctx->isc_ntxd_max[i]; } } } +static void +iflib_add_pfil(if_ctx_t ctx) +{ + struct pfil_head *pfil; + struct pfil_head_args pa; + iflib_rxq_t rxq; + int i; + + pa.pa_version = PFIL_VERSION; + pa.pa_flags = PFIL_IN; + pa.pa_type = PFIL_TYPE_ETHERNET; + pa.pa_headname = ctx->ifc_ifp->if_xname; + pfil = pfil_head_register(&pa); + + for (i = 0, rxq = ctx->ifc_rxqs; i < NRXQSETS(ctx); i++, rxq++) { + rxq->pfil = pfil; + } +} + +static void +iflib_rem_pfil(if_ctx_t ctx) +{ + struct pfil_head *pfil; + iflib_rxq_t rxq; + int i; + + rxq = ctx->ifc_rxqs; + pfil = rxq->pfil; + for (i = 0; i < NRXQSETS(ctx); i++, rxq++) { + rxq->pfil = NULL; + } + pfil_head_unregister(pfil); +} + +static uint16_t +get_ctx_core_offset(if_ctx_t ctx) +{ + if_softc_ctx_t scctx = &ctx->ifc_softc_ctx; + struct cpu_offset *op; + uint16_t qc; + uint16_t ret = ctx->ifc_sysctl_core_offset; + + if (ret != CORE_OFFSET_UNSPECIFIED) + return (ret); + + if (ctx->ifc_sysctl_separate_txrx) + qc = scctx->isc_ntxqsets + scctx->isc_nrxqsets; + else + qc = max(scctx->isc_ntxqsets, scctx->isc_nrxqsets); + + mtx_lock(&cpu_offset_mtx); + SLIST_FOREACH(op, &cpu_offsets, entries) { + if (CPU_CMP(&ctx->ifc_cpus, &op->set) == 0) { + ret = op->offset; + op->offset += qc; + MPASS(op->refcount < UINT_MAX); + op->refcount++; + break; + } + } + if (ret == CORE_OFFSET_UNSPECIFIED) { + ret = 0; + op = malloc(sizeof(struct cpu_offset), M_IFLIB, + M_NOWAIT | M_ZERO); + if (op == NULL) { + device_printf(ctx->ifc_dev, + "allocation for cpu offset failed.\n"); + } else { + op->offset = qc; + op->refcount = 1; + CPU_COPY(&ctx->ifc_cpus, &op->set); + SLIST_INSERT_HEAD(&cpu_offsets, op, entries); + } + } + mtx_unlock(&cpu_offset_mtx); + + return (ret); +} + +static void +unref_ctx_core_offset(if_ctx_t ctx) +{ + struct cpu_offset *op, *top; + + mtx_lock(&cpu_offset_mtx); + SLIST_FOREACH_SAFE(op, &cpu_offsets, entries, top) { + if (CPU_CMP(&ctx->ifc_cpus, &op->set) == 0) { + MPASS(op->refcount > 0); + op->refcount--; + if (op->refcount == 0) { + SLIST_REMOVE(&cpu_offsets, op, cpu_offset, entries); + free(op, M_IFLIB); + } + break; + } + } + mtx_unlock(&cpu_offset_mtx); +} + int iflib_device_register(device_t dev, void *sc, if_shared_ctx_t sctx, if_ctx_t *ctxp) { int err, rid, msix; if_ctx_t ctx; if_t ifp; if_softc_ctx_t scctx; int i; uint16_t main_txq; uint16_t main_rxq; ctx = malloc(sizeof(* ctx), M_IFLIB, M_WAITOK|M_ZERO); if (sc == NULL) { sc = malloc(sctx->isc_driver->size, M_IFLIB, M_WAITOK|M_ZERO); device_set_softc(dev, ctx); ctx->ifc_flags |= IFC_SC_ALLOCATED; } ctx->ifc_sctx = sctx; ctx->ifc_dev = dev; ctx->ifc_softc = sc; if ((err = iflib_register(ctx)) != 0) { device_printf(dev, "iflib_register failed %d\n", err); goto fail_ctx_free; } iflib_add_device_sysctl_pre(ctx); scctx = &ctx->ifc_softc_ctx; ifp = ctx->ifc_ifp; iflib_reset_qvalues(ctx); CTX_LOCK(ctx); if ((err = IFDI_ATTACH_PRE(ctx)) != 0) { device_printf(dev, "IFDI_ATTACH_PRE failed %d\n", err); goto fail_unlock; } _iflib_pre_assert(scctx); ctx->ifc_txrx = *scctx->isc_txrx; #ifdef INVARIANTS MPASS(scctx->isc_capabilities); if (scctx->isc_capabilities & IFCAP_TXCSUM) MPASS(scctx->isc_tx_csum_flags); #endif if_setcapabilities(ifp, scctx->isc_capabilities | IFCAP_HWSTATS); if_setcapenable(ifp, scctx->isc_capenable | IFCAP_HWSTATS); if (scctx->isc_ntxqsets == 0 || (scctx->isc_ntxqsets_max && scctx->isc_ntxqsets_max < scctx->isc_ntxqsets)) scctx->isc_ntxqsets = scctx->isc_ntxqsets_max; if (scctx->isc_nrxqsets == 0 || (scctx->isc_nrxqsets_max && scctx->isc_nrxqsets_max < scctx->isc_nrxqsets)) scctx->isc_nrxqsets = scctx->isc_nrxqsets_max; main_txq = (sctx->isc_flags & IFLIB_HAS_TXCQ) ? 1 : 0; main_rxq = (sctx->isc_flags & IFLIB_HAS_RXCQ) ? 1 : 0; /* XXX change for per-queue sizes */ device_printf(dev, "Using %d tx descriptors and %d rx descriptors\n", scctx->isc_ntxd[main_txq], scctx->isc_nrxd[main_rxq]); for (i = 0; i < sctx->isc_nrxqs; i++) { if (!powerof2(scctx->isc_nrxd[i])) { /* round down instead? */ device_printf(dev, "# rx descriptors must be a power of 2\n"); err = EINVAL; goto fail_iflib_detach; } } for (i = 0; i < sctx->isc_ntxqs; i++) { if (!powerof2(scctx->isc_ntxd[i])) { device_printf(dev, "# tx descriptors must be a power of 2"); err = EINVAL; goto fail_iflib_detach; } } if (scctx->isc_tx_nsegments > scctx->isc_ntxd[main_txq] / MAX_SINGLE_PACKET_FRACTION) scctx->isc_tx_nsegments = max(1, scctx->isc_ntxd[main_txq] / MAX_SINGLE_PACKET_FRACTION); if (scctx->isc_tx_tso_segments_max > scctx->isc_ntxd[main_txq] / MAX_SINGLE_PACKET_FRACTION) scctx->isc_tx_tso_segments_max = max(1, scctx->isc_ntxd[main_txq] / MAX_SINGLE_PACKET_FRACTION); /* TSO parameters - dig these out of the data sheet - simply correspond to tag setup */ if (if_getcapabilities(ifp) & IFCAP_TSO) { /* * The stack can't handle a TSO size larger than IP_MAXPACKET, * but some MACs do. */ if_sethwtsomax(ifp, min(scctx->isc_tx_tso_size_max, IP_MAXPACKET)); /* * Take maximum number of m_pullup(9)'s in iflib_parse_header() * into account. In the worst case, each of these calls will * add another mbuf and, thus, the requirement for another DMA * segment. So for best performance, it doesn't make sense to * advertize a maximum of TSO segments that typically will * require defragmentation in iflib_encap(). */ if_sethwtsomaxsegcount(ifp, scctx->isc_tx_tso_segments_max - 3); if_sethwtsomaxsegsize(ifp, scctx->isc_tx_tso_segsize_max); } if (scctx->isc_rss_table_size == 0) scctx->isc_rss_table_size = 64; scctx->isc_rss_table_mask = scctx->isc_rss_table_size-1; GROUPTASK_INIT(&ctx->ifc_admin_task, 0, _task_fn_admin, ctx); /* XXX format name */ taskqgroup_attach(qgroup_if_config_tqg, &ctx->ifc_admin_task, ctx, NULL, NULL, "admin"); /* Set up cpu set. If it fails, use the set of all CPUs. */ if (bus_get_cpus(dev, INTR_CPUS, sizeof(ctx->ifc_cpus), &ctx->ifc_cpus) != 0) { device_printf(dev, "Unable to fetch CPU list\n"); CPU_COPY(&all_cpus, &ctx->ifc_cpus); } MPASS(CPU_COUNT(&ctx->ifc_cpus) > 0); /* ** Now set up MSI or MSI-X, should return us the number of supported ** vectors (will be 1 for a legacy interrupt and MSI). */ if (sctx->isc_flags & IFLIB_SKIP_MSIX) { msix = scctx->isc_vectors; } else if (scctx->isc_msix_bar != 0) /* * The simple fact that isc_msix_bar is not 0 does not mean we * we have a good value there that is known to work. */ msix = iflib_msix_init(ctx); else { scctx->isc_vectors = 1; scctx->isc_ntxqsets = 1; scctx->isc_nrxqsets = 1; scctx->isc_intr = IFLIB_INTR_LEGACY; msix = 0; } /* Get memory for the station queues */ if ((err = iflib_queues_alloc(ctx))) { device_printf(dev, "Unable to allocate queue memory\n"); goto fail_intr_free; } if ((err = iflib_qset_structures_setup(ctx))) goto fail_queues; /* + * Now that we know how many queues there are, get the core offset. + */ + ctx->ifc_sysctl_core_offset = get_ctx_core_offset(ctx); + + /* * Group taskqueues aren't properly set up until SMP is started, * so we disable interrupts until we can handle them post * SI_SUB_SMP. * * XXX: disabling interrupts doesn't actually work, at least for * the non-MSI case. When they occur before SI_SUB_SMP completes, * we do null handling and depend on this not causing too large an * interrupt storm. */ IFDI_INTR_DISABLE(ctx); if (msix > 1 && (err = IFDI_MSIX_INTR_ASSIGN(ctx, msix)) != 0) { device_printf(dev, "IFDI_MSIX_INTR_ASSIGN failed %d\n", err); goto fail_queues; } if (msix <= 1) { rid = 0; if (scctx->isc_intr == IFLIB_INTR_MSI) { MPASS(msix == 1); rid = 1; } if ((err = iflib_legacy_setup(ctx, ctx->isc_legacy_intr, ctx->ifc_softc, &rid, "irq0")) != 0) { device_printf(dev, "iflib_legacy_setup failed %d\n", err); goto fail_queues; } } ether_ifattach(ctx->ifc_ifp, ctx->ifc_mac.octet); if ((err = IFDI_ATTACH_POST(ctx)) != 0) { device_printf(dev, "IFDI_ATTACH_POST failed %d\n", err); goto fail_detach; } /* * Tell the upper layer(s) if IFCAP_VLAN_MTU is supported. * This must appear after the call to ether_ifattach() because * ether_ifattach() sets if_hdrlen to the default value. */ if (if_getcapabilities(ifp) & IFCAP_VLAN_MTU) if_setifheaderlen(ifp, sizeof(struct ether_vlan_header)); if ((err = iflib_netmap_attach(ctx))) { device_printf(ctx->ifc_dev, "netmap attach failed: %d\n", err); goto fail_detach; } *ctxp = ctx; NETDUMP_SET(ctx->ifc_ifp, iflib); if_setgetcounterfn(ctx->ifc_ifp, iflib_if_get_counter); iflib_add_device_sysctl_post(ctx); + iflib_add_pfil(ctx); ctx->ifc_flags |= IFC_INIT_DONE; CTX_UNLOCK(ctx); return (0); fail_detach: ether_ifdetach(ctx->ifc_ifp); fail_intr_free: iflib_free_intr_mem(ctx); fail_queues: iflib_tx_structures_free(ctx); iflib_rx_structures_free(ctx); fail_iflib_detach: IFDI_DETACH(ctx); fail_unlock: CTX_UNLOCK(ctx); fail_ctx_free: if (ctx->ifc_flags & IFC_SC_ALLOCATED) free(ctx->ifc_softc, M_IFLIB); free(ctx, M_IFLIB); return (err); } int iflib_pseudo_register(device_t dev, if_shared_ctx_t sctx, if_ctx_t *ctxp, struct iflib_cloneattach_ctx *clctx) { int err; if_ctx_t ctx; if_t ifp; if_softc_ctx_t scctx; int i; void *sc; uint16_t main_txq; uint16_t main_rxq; ctx = malloc(sizeof(*ctx), M_IFLIB, M_WAITOK|M_ZERO); sc = malloc(sctx->isc_driver->size, M_IFLIB, M_WAITOK|M_ZERO); ctx->ifc_flags |= IFC_SC_ALLOCATED; if (sctx->isc_flags & (IFLIB_PSEUDO|IFLIB_VIRTUAL)) ctx->ifc_flags |= IFC_PSEUDO; ctx->ifc_sctx = sctx; ctx->ifc_softc = sc; ctx->ifc_dev = dev; if ((err = iflib_register(ctx)) != 0) { device_printf(dev, "%s: iflib_register failed %d\n", __func__, err); goto fail_ctx_free; } iflib_add_device_sysctl_pre(ctx); scctx = &ctx->ifc_softc_ctx; ifp = ctx->ifc_ifp; /* * XXX sanity check that ntxd & nrxd are a power of 2 */ iflib_reset_qvalues(ctx); CTX_LOCK(ctx); if ((err = IFDI_ATTACH_PRE(ctx)) != 0) { device_printf(dev, "IFDI_ATTACH_PRE failed %d\n", err); goto fail_unlock; } if (sctx->isc_flags & IFLIB_GEN_MAC) ether_gen_addr(ifp, &ctx->ifc_mac); if ((err = IFDI_CLONEATTACH(ctx, clctx->cc_ifc, clctx->cc_name, clctx->cc_params)) != 0) { device_printf(dev, "IFDI_CLONEATTACH failed %d\n", err); goto fail_ctx_free; } ifmedia_add(&ctx->ifc_media, IFM_ETHER | IFM_1000_T | IFM_FDX, 0, NULL); ifmedia_add(&ctx->ifc_media, IFM_ETHER | IFM_AUTO, 0, NULL); ifmedia_set(&ctx->ifc_media, IFM_ETHER | IFM_AUTO); #ifdef INVARIANTS MPASS(scctx->isc_capabilities); if (scctx->isc_capabilities & IFCAP_TXCSUM) MPASS(scctx->isc_tx_csum_flags); #endif if_setcapabilities(ifp, scctx->isc_capabilities | IFCAP_HWSTATS | IFCAP_LINKSTATE); if_setcapenable(ifp, scctx->isc_capenable | IFCAP_HWSTATS | IFCAP_LINKSTATE); ifp->if_flags |= IFF_NOGROUP; if (sctx->isc_flags & IFLIB_PSEUDO) { ether_ifattach(ctx->ifc_ifp, ctx->ifc_mac.octet); if ((err = IFDI_ATTACH_POST(ctx)) != 0) { device_printf(dev, "IFDI_ATTACH_POST failed %d\n", err); goto fail_detach; } *ctxp = ctx; /* * Tell the upper layer(s) if IFCAP_VLAN_MTU is supported. * This must appear after the call to ether_ifattach() because * ether_ifattach() sets if_hdrlen to the default value. */ if (if_getcapabilities(ifp) & IFCAP_VLAN_MTU) if_setifheaderlen(ifp, sizeof(struct ether_vlan_header)); if_setgetcounterfn(ctx->ifc_ifp, iflib_if_get_counter); iflib_add_device_sysctl_post(ctx); ctx->ifc_flags |= IFC_INIT_DONE; return (0); } _iflib_pre_assert(scctx); ctx->ifc_txrx = *scctx->isc_txrx; if (scctx->isc_ntxqsets == 0 || (scctx->isc_ntxqsets_max && scctx->isc_ntxqsets_max < scctx->isc_ntxqsets)) scctx->isc_ntxqsets = scctx->isc_ntxqsets_max; if (scctx->isc_nrxqsets == 0 || (scctx->isc_nrxqsets_max && scctx->isc_nrxqsets_max < scctx->isc_nrxqsets)) scctx->isc_nrxqsets = scctx->isc_nrxqsets_max; main_txq = (sctx->isc_flags & IFLIB_HAS_TXCQ) ? 1 : 0; main_rxq = (sctx->isc_flags & IFLIB_HAS_RXCQ) ? 1 : 0; /* XXX change for per-queue sizes */ device_printf(dev, "Using %d tx descriptors and %d rx descriptors\n", scctx->isc_ntxd[main_txq], scctx->isc_nrxd[main_rxq]); for (i = 0; i < sctx->isc_nrxqs; i++) { if (!powerof2(scctx->isc_nrxd[i])) { /* round down instead? */ device_printf(dev, "# rx descriptors must be a power of 2\n"); err = EINVAL; goto fail_iflib_detach; } } for (i = 0; i < sctx->isc_ntxqs; i++) { if (!powerof2(scctx->isc_ntxd[i])) { device_printf(dev, "# tx descriptors must be a power of 2"); err = EINVAL; goto fail_iflib_detach; } } if (scctx->isc_tx_nsegments > scctx->isc_ntxd[main_txq] / MAX_SINGLE_PACKET_FRACTION) scctx->isc_tx_nsegments = max(1, scctx->isc_ntxd[main_txq] / MAX_SINGLE_PACKET_FRACTION); if (scctx->isc_tx_tso_segments_max > scctx->isc_ntxd[main_txq] / MAX_SINGLE_PACKET_FRACTION) scctx->isc_tx_tso_segments_max = max(1, scctx->isc_ntxd[main_txq] / MAX_SINGLE_PACKET_FRACTION); /* TSO parameters - dig these out of the data sheet - simply correspond to tag setup */ if (if_getcapabilities(ifp) & IFCAP_TSO) { /* * The stack can't handle a TSO size larger than IP_MAXPACKET, * but some MACs do. */ if_sethwtsomax(ifp, min(scctx->isc_tx_tso_size_max, IP_MAXPACKET)); /* * Take maximum number of m_pullup(9)'s in iflib_parse_header() * into account. In the worst case, each of these calls will * add another mbuf and, thus, the requirement for another DMA * segment. So for best performance, it doesn't make sense to * advertize a maximum of TSO segments that typically will * require defragmentation in iflib_encap(). */ if_sethwtsomaxsegcount(ifp, scctx->isc_tx_tso_segments_max - 3); if_sethwtsomaxsegsize(ifp, scctx->isc_tx_tso_segsize_max); } if (scctx->isc_rss_table_size == 0) scctx->isc_rss_table_size = 64; scctx->isc_rss_table_mask = scctx->isc_rss_table_size-1; GROUPTASK_INIT(&ctx->ifc_admin_task, 0, _task_fn_admin, ctx); /* XXX format name */ taskqgroup_attach(qgroup_if_config_tqg, &ctx->ifc_admin_task, ctx, NULL, NULL, "admin"); /* XXX --- can support > 1 -- but keep it simple for now */ scctx->isc_intr = IFLIB_INTR_LEGACY; /* Get memory for the station queues */ if ((err = iflib_queues_alloc(ctx))) { device_printf(dev, "Unable to allocate queue memory\n"); goto fail_iflib_detach; } if ((err = iflib_qset_structures_setup(ctx))) { device_printf(dev, "qset structure setup failed %d\n", err); goto fail_queues; } /* * XXX What if anything do we want to do about interrupts? */ ether_ifattach(ctx->ifc_ifp, ctx->ifc_mac.octet); if ((err = IFDI_ATTACH_POST(ctx)) != 0) { device_printf(dev, "IFDI_ATTACH_POST failed %d\n", err); goto fail_detach; } /* * Tell the upper layer(s) if IFCAP_VLAN_MTU is supported. * This must appear after the call to ether_ifattach() because * ether_ifattach() sets if_hdrlen to the default value. */ if (if_getcapabilities(ifp) & IFCAP_VLAN_MTU) if_setifheaderlen(ifp, sizeof(struct ether_vlan_header)); /* XXX handle more than one queue */ for (i = 0; i < scctx->isc_nrxqsets; i++) IFDI_RX_CLSET(ctx, 0, i, ctx->ifc_rxqs[i].ifr_fl[0].ifl_sds.ifsd_cl); *ctxp = ctx; if_setgetcounterfn(ctx->ifc_ifp, iflib_if_get_counter); iflib_add_device_sysctl_post(ctx); ctx->ifc_flags |= IFC_INIT_DONE; CTX_UNLOCK(ctx); return (0); fail_detach: ether_ifdetach(ctx->ifc_ifp); fail_queues: iflib_tx_structures_free(ctx); iflib_rx_structures_free(ctx); fail_iflib_detach: IFDI_DETACH(ctx); fail_unlock: CTX_UNLOCK(ctx); fail_ctx_free: free(ctx->ifc_softc, M_IFLIB); free(ctx, M_IFLIB); return (err); } int iflib_pseudo_deregister(if_ctx_t ctx) { if_t ifp = ctx->ifc_ifp; iflib_txq_t txq; iflib_rxq_t rxq; int i, j; struct taskqgroup *tqg; iflib_fl_t fl; /* Unregister VLAN events */ if (ctx->ifc_vlan_attach_event != NULL) EVENTHANDLER_DEREGISTER(vlan_config, ctx->ifc_vlan_attach_event); if (ctx->ifc_vlan_detach_event != NULL) EVENTHANDLER_DEREGISTER(vlan_unconfig, ctx->ifc_vlan_detach_event); ether_ifdetach(ifp); /* ether_ifdetach calls if_qflush - lock must be destroy afterwards*/ CTX_LOCK_DESTROY(ctx); /* XXX drain any dependent tasks */ tqg = qgroup_if_io_tqg; for (txq = ctx->ifc_txqs, i = 0; i < NTXQSETS(ctx); i++, txq++) { callout_drain(&txq->ift_timer); if (txq->ift_task.gt_uniq != NULL) taskqgroup_detach(tqg, &txq->ift_task); } for (i = 0, rxq = ctx->ifc_rxqs; i < NRXQSETS(ctx); i++, rxq++) { if (rxq->ifr_task.gt_uniq != NULL) taskqgroup_detach(tqg, &rxq->ifr_task); for (j = 0, fl = rxq->ifr_fl; j < rxq->ifr_nfl; j++, fl++) free(fl->ifl_rx_bitmap, M_IFLIB); } tqg = qgroup_if_config_tqg; if (ctx->ifc_admin_task.gt_uniq != NULL) taskqgroup_detach(tqg, &ctx->ifc_admin_task); if (ctx->ifc_vflr_task.gt_uniq != NULL) taskqgroup_detach(tqg, &ctx->ifc_vflr_task); if_free(ifp); iflib_tx_structures_free(ctx); iflib_rx_structures_free(ctx); if (ctx->ifc_flags & IFC_SC_ALLOCATED) free(ctx->ifc_softc, M_IFLIB); free(ctx, M_IFLIB); return (0); } int iflib_device_attach(device_t dev) { if_ctx_t ctx; if_shared_ctx_t sctx; if ((sctx = DEVICE_REGISTER(dev)) == NULL || sctx->isc_magic != IFLIB_MAGIC) return (ENOTSUP); pci_enable_busmaster(dev); return (iflib_device_register(dev, NULL, sctx, &ctx)); } int iflib_device_deregister(if_ctx_t ctx) { if_t ifp = ctx->ifc_ifp; iflib_txq_t txq; iflib_rxq_t rxq; device_t dev = ctx->ifc_dev; int i, j; struct taskqgroup *tqg; iflib_fl_t fl; /* Make sure VLANS are not using driver */ if (if_vlantrunkinuse(ifp)) { device_printf(dev, "Vlan in use, detach first\n"); return (EBUSY); } #ifdef PCI_IOV if (!CTX_IS_VF(ctx) && pci_iov_detach(dev) != 0) { device_printf(dev, "SR-IOV in use; detach first.\n"); return (EBUSY); } #endif STATE_LOCK(ctx); ctx->ifc_flags |= IFC_IN_DETACH; STATE_UNLOCK(ctx); CTX_LOCK(ctx); iflib_stop(ctx); CTX_UNLOCK(ctx); /* Unregister VLAN events */ if (ctx->ifc_vlan_attach_event != NULL) EVENTHANDLER_DEREGISTER(vlan_config, ctx->ifc_vlan_attach_event); if (ctx->ifc_vlan_detach_event != NULL) EVENTHANDLER_DEREGISTER(vlan_unconfig, ctx->ifc_vlan_detach_event); iflib_netmap_detach(ifp); ether_ifdetach(ifp); + iflib_rem_pfil(ctx); if (ctx->ifc_led_dev != NULL) led_destroy(ctx->ifc_led_dev); /* XXX drain any dependent tasks */ tqg = qgroup_if_io_tqg; for (txq = ctx->ifc_txqs, i = 0; i < NTXQSETS(ctx); i++, txq++) { callout_drain(&txq->ift_timer); if (txq->ift_task.gt_uniq != NULL) taskqgroup_detach(tqg, &txq->ift_task); } for (i = 0, rxq = ctx->ifc_rxqs; i < NRXQSETS(ctx); i++, rxq++) { if (rxq->ifr_task.gt_uniq != NULL) taskqgroup_detach(tqg, &rxq->ifr_task); for (j = 0, fl = rxq->ifr_fl; j < rxq->ifr_nfl; j++, fl++) free(fl->ifl_rx_bitmap, M_IFLIB); } tqg = qgroup_if_config_tqg; if (ctx->ifc_admin_task.gt_uniq != NULL) taskqgroup_detach(tqg, &ctx->ifc_admin_task); if (ctx->ifc_vflr_task.gt_uniq != NULL) taskqgroup_detach(tqg, &ctx->ifc_vflr_task); CTX_LOCK(ctx); IFDI_DETACH(ctx); CTX_UNLOCK(ctx); /* ether_ifdetach calls if_qflush - lock must be destroy afterwards*/ CTX_LOCK_DESTROY(ctx); device_set_softc(ctx->ifc_dev, NULL); iflib_free_intr_mem(ctx); bus_generic_detach(dev); if_free(ifp); iflib_tx_structures_free(ctx); iflib_rx_structures_free(ctx); if (ctx->ifc_flags & IFC_SC_ALLOCATED) free(ctx->ifc_softc, M_IFLIB); + unref_ctx_core_offset(ctx); STATE_LOCK_DESTROY(ctx); free(ctx, M_IFLIB); return (0); } static void iflib_free_intr_mem(if_ctx_t ctx) { if (ctx->ifc_softc_ctx.isc_intr != IFLIB_INTR_MSIX) { iflib_irq_free(ctx, &ctx->ifc_legacy_irq); } if (ctx->ifc_softc_ctx.isc_intr != IFLIB_INTR_LEGACY) { pci_release_msi(ctx->ifc_dev); } if (ctx->ifc_msix_mem != NULL) { bus_release_resource(ctx->ifc_dev, SYS_RES_MEMORY, rman_get_rid(ctx->ifc_msix_mem), ctx->ifc_msix_mem); ctx->ifc_msix_mem = NULL; } } int iflib_device_detach(device_t dev) { if_ctx_t ctx = device_get_softc(dev); return (iflib_device_deregister(ctx)); } int iflib_device_suspend(device_t dev) { if_ctx_t ctx = device_get_softc(dev); CTX_LOCK(ctx); IFDI_SUSPEND(ctx); CTX_UNLOCK(ctx); return bus_generic_suspend(dev); } int iflib_device_shutdown(device_t dev) { if_ctx_t ctx = device_get_softc(dev); CTX_LOCK(ctx); IFDI_SHUTDOWN(ctx); CTX_UNLOCK(ctx); return bus_generic_suspend(dev); } int iflib_device_resume(device_t dev) { if_ctx_t ctx = device_get_softc(dev); iflib_txq_t txq = ctx->ifc_txqs; CTX_LOCK(ctx); IFDI_RESUME(ctx); iflib_if_init_locked(ctx); CTX_UNLOCK(ctx); for (int i = 0; i < NTXQSETS(ctx); i++, txq++) iflib_txq_check_drain(txq, IFLIB_RESTART_BUDGET); return (bus_generic_resume(dev)); } int iflib_device_iov_init(device_t dev, uint16_t num_vfs, const nvlist_t *params) { int error; if_ctx_t ctx = device_get_softc(dev); CTX_LOCK(ctx); error = IFDI_IOV_INIT(ctx, num_vfs, params); CTX_UNLOCK(ctx); return (error); } void iflib_device_iov_uninit(device_t dev) { if_ctx_t ctx = device_get_softc(dev); CTX_LOCK(ctx); IFDI_IOV_UNINIT(ctx); CTX_UNLOCK(ctx); } int iflib_device_iov_add_vf(device_t dev, uint16_t vfnum, const nvlist_t *params) { int error; if_ctx_t ctx = device_get_softc(dev); CTX_LOCK(ctx); error = IFDI_IOV_VF_ADD(ctx, vfnum, params); CTX_UNLOCK(ctx); return (error); } /********************************************************************* * * MODULE FUNCTION DEFINITIONS * **********************************************************************/ /* * - Start a fast taskqueue thread for each core * - Start a taskqueue for control operations */ static int iflib_module_init(void) { return (0); } static int iflib_module_event_handler(module_t mod, int what, void *arg) { int err; switch (what) { case MOD_LOAD: if ((err = iflib_module_init()) != 0) return (err); break; case MOD_UNLOAD: return (EBUSY); default: return (EOPNOTSUPP); } return (0); } /********************************************************************* * * PUBLIC FUNCTION DEFINITIONS * ordered as in iflib.h * **********************************************************************/ static void _iflib_assert(if_shared_ctx_t sctx) { MPASS(sctx->isc_tx_maxsize); MPASS(sctx->isc_tx_maxsegsize); MPASS(sctx->isc_rx_maxsize); MPASS(sctx->isc_rx_nsegments); MPASS(sctx->isc_rx_maxsegsize); MPASS(sctx->isc_nrxd_min[0]); MPASS(sctx->isc_nrxd_max[0]); MPASS(sctx->isc_nrxd_default[0]); MPASS(sctx->isc_ntxd_min[0]); MPASS(sctx->isc_ntxd_max[0]); MPASS(sctx->isc_ntxd_default[0]); } static void _iflib_pre_assert(if_softc_ctx_t scctx) { MPASS(scctx->isc_txrx->ift_txd_encap); MPASS(scctx->isc_txrx->ift_txd_flush); MPASS(scctx->isc_txrx->ift_txd_credits_update); MPASS(scctx->isc_txrx->ift_rxd_available); MPASS(scctx->isc_txrx->ift_rxd_pkt_get); MPASS(scctx->isc_txrx->ift_rxd_refill); MPASS(scctx->isc_txrx->ift_rxd_flush); } static int iflib_register(if_ctx_t ctx) { if_shared_ctx_t sctx = ctx->ifc_sctx; driver_t *driver = sctx->isc_driver; device_t dev = ctx->ifc_dev; if_t ifp; _iflib_assert(sctx); CTX_LOCK_INIT(ctx); STATE_LOCK_INIT(ctx, device_get_nameunit(ctx->ifc_dev)); ifp = ctx->ifc_ifp = if_alloc(IFT_ETHER); if (ifp == NULL) { device_printf(dev, "can not allocate ifnet structure\n"); return (ENOMEM); } /* * Initialize our context's device specific methods */ kobj_init((kobj_t) ctx, (kobj_class_t) driver); kobj_class_compile((kobj_class_t) driver); driver->refs++; if_initname(ifp, device_get_name(dev), device_get_unit(dev)); if_setsoftc(ifp, ctx); if_setdev(ifp, dev); if_setinitfn(ifp, iflib_if_init); if_setioctlfn(ifp, iflib_if_ioctl); #ifdef ALTQ if_setstartfn(ifp, iflib_altq_if_start); if_settransmitfn(ifp, iflib_altq_if_transmit); if_setsendqready(ifp); #else if_settransmitfn(ifp, iflib_if_transmit); #endif if_setqflushfn(ifp, iflib_if_qflush); if_setflags(ifp, IFF_BROADCAST | IFF_SIMPLEX | IFF_MULTICAST); ctx->ifc_vlan_attach_event = EVENTHANDLER_REGISTER(vlan_config, iflib_vlan_register, ctx, EVENTHANDLER_PRI_FIRST); ctx->ifc_vlan_detach_event = EVENTHANDLER_REGISTER(vlan_unconfig, iflib_vlan_unregister, ctx, EVENTHANDLER_PRI_FIRST); ifmedia_init(&ctx->ifc_media, IFM_IMASK, iflib_media_change, iflib_media_status); return (0); } static int iflib_queues_alloc(if_ctx_t ctx) { if_shared_ctx_t sctx = ctx->ifc_sctx; if_softc_ctx_t scctx = &ctx->ifc_softc_ctx; device_t dev = ctx->ifc_dev; int nrxqsets = scctx->isc_nrxqsets; int ntxqsets = scctx->isc_ntxqsets; iflib_txq_t txq; iflib_rxq_t rxq; iflib_fl_t fl = NULL; int i, j, cpu, err, txconf, rxconf; iflib_dma_info_t ifdip; uint32_t *rxqsizes = scctx->isc_rxqsizes; uint32_t *txqsizes = scctx->isc_txqsizes; uint8_t nrxqs = sctx->isc_nrxqs; uint8_t ntxqs = sctx->isc_ntxqs; int nfree_lists = sctx->isc_nfl ? sctx->isc_nfl : 1; caddr_t *vaddrs; uint64_t *paddrs; KASSERT(ntxqs > 0, ("number of queues per qset must be at least 1")); KASSERT(nrxqs > 0, ("number of queues per qset must be at least 1")); /* Allocate the TX ring struct memory */ if (!(ctx->ifc_txqs = (iflib_txq_t) malloc(sizeof(struct iflib_txq) * ntxqsets, M_IFLIB, M_NOWAIT | M_ZERO))) { device_printf(dev, "Unable to allocate TX ring memory\n"); err = ENOMEM; goto fail; } /* Now allocate the RX */ if (!(ctx->ifc_rxqs = (iflib_rxq_t) malloc(sizeof(struct iflib_rxq) * nrxqsets, M_IFLIB, M_NOWAIT | M_ZERO))) { device_printf(dev, "Unable to allocate RX ring memory\n"); err = ENOMEM; goto rx_fail; } txq = ctx->ifc_txqs; rxq = ctx->ifc_rxqs; /* * XXX handle allocation failure */ for (txconf = i = 0, cpu = CPU_FIRST(); i < ntxqsets; i++, txconf++, txq++, cpu = CPU_NEXT(cpu)) { /* Set up some basics */ if ((ifdip = malloc(sizeof(struct iflib_dma_info) * ntxqs, M_IFLIB, M_NOWAIT | M_ZERO)) == NULL) { device_printf(dev, "Unable to allocate TX DMA info memory\n"); err = ENOMEM; goto err_tx_desc; } txq->ift_ifdi = ifdip; for (j = 0; j < ntxqs; j++, ifdip++) { if (iflib_dma_alloc(ctx, txqsizes[j], ifdip, 0)) { device_printf(dev, "Unable to allocate TX descriptors\n"); err = ENOMEM; goto err_tx_desc; } txq->ift_txd_size[j] = scctx->isc_txd_size[j]; bzero((void *)ifdip->idi_vaddr, txqsizes[j]); } txq->ift_ctx = ctx; txq->ift_id = i; if (sctx->isc_flags & IFLIB_HAS_TXCQ) { txq->ift_br_offset = 1; } else { txq->ift_br_offset = 0; } /* XXX fix this */ txq->ift_timer.c_cpu = cpu; if (iflib_txsd_alloc(txq)) { device_printf(dev, "Critical Failure setting up TX buffers\n"); err = ENOMEM; goto err_tx_desc; } /* Initialize the TX lock */ snprintf(txq->ift_mtx_name, MTX_NAME_LEN, "%s:tx(%d):callout", device_get_nameunit(dev), txq->ift_id); mtx_init(&txq->ift_mtx, txq->ift_mtx_name, NULL, MTX_DEF); callout_init_mtx(&txq->ift_timer, &txq->ift_mtx, 0); snprintf(txq->ift_db_mtx_name, MTX_NAME_LEN, "%s:tx(%d):db", device_get_nameunit(dev), txq->ift_id); err = ifmp_ring_alloc(&txq->ift_br, 2048, txq, iflib_txq_drain, iflib_txq_can_drain, M_IFLIB, M_WAITOK); if (err) { /* XXX free any allocated rings */ device_printf(dev, "Unable to allocate buf_ring\n"); goto err_tx_desc; } } for (rxconf = i = 0; i < nrxqsets; i++, rxconf++, rxq++) { /* Set up some basics */ if ((ifdip = malloc(sizeof(struct iflib_dma_info) * nrxqs, M_IFLIB, M_NOWAIT | M_ZERO)) == NULL) { device_printf(dev, "Unable to allocate RX DMA info memory\n"); err = ENOMEM; goto err_tx_desc; } rxq->ifr_ifdi = ifdip; /* XXX this needs to be changed if #rx queues != #tx queues */ rxq->ifr_ntxqirq = 1; rxq->ifr_txqid[0] = i; for (j = 0; j < nrxqs; j++, ifdip++) { if (iflib_dma_alloc(ctx, rxqsizes[j], ifdip, 0)) { device_printf(dev, "Unable to allocate RX descriptors\n"); err = ENOMEM; goto err_tx_desc; } bzero((void *)ifdip->idi_vaddr, rxqsizes[j]); } rxq->ifr_ctx = ctx; rxq->ifr_id = i; if (sctx->isc_flags & IFLIB_HAS_RXCQ) { rxq->ifr_fl_offset = 1; } else { rxq->ifr_fl_offset = 0; } rxq->ifr_nfl = nfree_lists; if (!(fl = (iflib_fl_t) malloc(sizeof(struct iflib_fl) * nfree_lists, M_IFLIB, M_NOWAIT | M_ZERO))) { device_printf(dev, "Unable to allocate free list memory\n"); err = ENOMEM; goto err_tx_desc; } rxq->ifr_fl = fl; for (j = 0; j < nfree_lists; j++) { fl[j].ifl_rxq = rxq; fl[j].ifl_id = j; fl[j].ifl_ifdi = &rxq->ifr_ifdi[j + rxq->ifr_fl_offset]; fl[j].ifl_rxd_size = scctx->isc_rxd_size[j]; } /* Allocate receive buffers for the ring */ if (iflib_rxsd_alloc(rxq)) { device_printf(dev, "Critical Failure setting up receive buffers\n"); err = ENOMEM; goto err_rx_desc; } for (j = 0, fl = rxq->ifr_fl; j < rxq->ifr_nfl; j++, fl++) fl->ifl_rx_bitmap = bit_alloc(fl->ifl_size, M_IFLIB, M_WAITOK); } /* TXQs */ vaddrs = malloc(sizeof(caddr_t)*ntxqsets*ntxqs, M_IFLIB, M_WAITOK); paddrs = malloc(sizeof(uint64_t)*ntxqsets*ntxqs, M_IFLIB, M_WAITOK); for (i = 0; i < ntxqsets; i++) { iflib_dma_info_t di = ctx->ifc_txqs[i].ift_ifdi; for (j = 0; j < ntxqs; j++, di++) { vaddrs[i*ntxqs + j] = di->idi_vaddr; paddrs[i*ntxqs + j] = di->idi_paddr; } } if ((err = IFDI_TX_QUEUES_ALLOC(ctx, vaddrs, paddrs, ntxqs, ntxqsets)) != 0) { device_printf(ctx->ifc_dev, "Unable to allocate device TX queue\n"); iflib_tx_structures_free(ctx); free(vaddrs, M_IFLIB); free(paddrs, M_IFLIB); goto err_rx_desc; } free(vaddrs, M_IFLIB); free(paddrs, M_IFLIB); /* RXQs */ vaddrs = malloc(sizeof(caddr_t)*nrxqsets*nrxqs, M_IFLIB, M_WAITOK); paddrs = malloc(sizeof(uint64_t)*nrxqsets*nrxqs, M_IFLIB, M_WAITOK); for (i = 0; i < nrxqsets; i++) { iflib_dma_info_t di = ctx->ifc_rxqs[i].ifr_ifdi; for (j = 0; j < nrxqs; j++, di++) { vaddrs[i*nrxqs + j] = di->idi_vaddr; paddrs[i*nrxqs + j] = di->idi_paddr; } } if ((err = IFDI_RX_QUEUES_ALLOC(ctx, vaddrs, paddrs, nrxqs, nrxqsets)) != 0) { device_printf(ctx->ifc_dev, "Unable to allocate device RX queue\n"); iflib_tx_structures_free(ctx); free(vaddrs, M_IFLIB); free(paddrs, M_IFLIB); goto err_rx_desc; } free(vaddrs, M_IFLIB); free(paddrs, M_IFLIB); return (0); /* XXX handle allocation failure changes */ err_rx_desc: err_tx_desc: rx_fail: if (ctx->ifc_rxqs != NULL) free(ctx->ifc_rxqs, M_IFLIB); ctx->ifc_rxqs = NULL; if (ctx->ifc_txqs != NULL) free(ctx->ifc_txqs, M_IFLIB); ctx->ifc_txqs = NULL; fail: return (err); } static int iflib_tx_structures_setup(if_ctx_t ctx) { iflib_txq_t txq = ctx->ifc_txqs; int i; for (i = 0; i < NTXQSETS(ctx); i++, txq++) iflib_txq_setup(txq); return (0); } static void iflib_tx_structures_free(if_ctx_t ctx) { iflib_txq_t txq = ctx->ifc_txqs; if_shared_ctx_t sctx = ctx->ifc_sctx; int i, j; for (i = 0; i < NTXQSETS(ctx); i++, txq++) { iflib_txq_destroy(txq); for (j = 0; j < sctx->isc_ntxqs; j++) iflib_dma_free(&txq->ift_ifdi[j]); } free(ctx->ifc_txqs, M_IFLIB); ctx->ifc_txqs = NULL; IFDI_QUEUES_FREE(ctx); } /********************************************************************* * * Initialize all receive rings. * **********************************************************************/ static int iflib_rx_structures_setup(if_ctx_t ctx) { iflib_rxq_t rxq = ctx->ifc_rxqs; int q; #if defined(INET6) || defined(INET) int i, err; #endif for (q = 0; q < ctx->ifc_softc_ctx.isc_nrxqsets; q++, rxq++) { #if defined(INET6) || defined(INET) tcp_lro_free(&rxq->ifr_lc); if ((err = tcp_lro_init_args(&rxq->ifr_lc, ctx->ifc_ifp, TCP_LRO_ENTRIES, min(1024, ctx->ifc_softc_ctx.isc_nrxd[rxq->ifr_fl_offset]))) != 0) { device_printf(ctx->ifc_dev, "LRO Initialization failed!\n"); goto fail; } rxq->ifr_lro_enabled = TRUE; #endif IFDI_RXQ_SETUP(ctx, rxq->ifr_id); } return (0); #if defined(INET6) || defined(INET) fail: /* * Free RX software descriptors allocated so far, we will only handle * the rings that completed, the failing case will have * cleaned up for itself. 'q' failed, so its the terminus. */ rxq = ctx->ifc_rxqs; for (i = 0; i < q; ++i, rxq++) { iflib_rx_sds_free(rxq); rxq->ifr_cq_gen = rxq->ifr_cq_cidx = rxq->ifr_cq_pidx = 0; } return (err); #endif } /********************************************************************* * * Free all receive rings. * **********************************************************************/ static void iflib_rx_structures_free(if_ctx_t ctx) { iflib_rxq_t rxq = ctx->ifc_rxqs; for (int i = 0; i < ctx->ifc_softc_ctx.isc_nrxqsets; i++, rxq++) { iflib_rx_sds_free(rxq); } free(ctx->ifc_rxqs, M_IFLIB); ctx->ifc_rxqs = NULL; } static int iflib_qset_structures_setup(if_ctx_t ctx) { int err; /* * It is expected that the caller takes care of freeing queues if this * fails. */ if ((err = iflib_tx_structures_setup(ctx)) != 0) { device_printf(ctx->ifc_dev, "iflib_tx_structures_setup failed: %d\n", err); return (err); } if ((err = iflib_rx_structures_setup(ctx)) != 0) device_printf(ctx->ifc_dev, "iflib_rx_structures_setup failed: %d\n", err); return (err); } int iflib_irq_alloc(if_ctx_t ctx, if_irq_t irq, int rid, driver_filter_t filter, void *filter_arg, driver_intr_t handler, void *arg, const char *name) { return (_iflib_irq_alloc(ctx, irq, rid, filter, handler, arg, name)); } #ifdef SMP static int find_nth(if_ctx_t ctx, int qid) { cpuset_t cpus; int i, cpuid, eqid, count; CPU_COPY(&ctx->ifc_cpus, &cpus); count = CPU_COUNT(&cpus); eqid = qid % count; /* clear up to the qid'th bit */ for (i = 0; i < eqid; i++) { cpuid = CPU_FFS(&cpus); MPASS(cpuid != 0); CPU_CLR(cpuid-1, &cpus); } cpuid = CPU_FFS(&cpus); MPASS(cpuid != 0); return (cpuid-1); } #ifdef SCHED_ULE extern struct cpu_group *cpu_top; /* CPU topology */ static int find_child_with_core(int cpu, struct cpu_group *grp) { int i; if (grp->cg_children == 0) return -1; MPASS(grp->cg_child); for (i = 0; i < grp->cg_children; i++) { if (CPU_ISSET(cpu, &grp->cg_child[i].cg_mask)) return i; } return -1; } /* * Find the nth "close" core to the specified core * "close" is defined as the deepest level that shares * at least an L2 cache. With threads, this will be - * threads on the same core. If the sahred cache is L3 + * threads on the same core. If the shared cache is L3 * or higher, simply returns the same core. */ static int find_close_core(int cpu, int core_offset) { struct cpu_group *grp; int i; int fcpu; cpuset_t cs; grp = cpu_top; if (grp == NULL) return cpu; i = 0; while ((i = find_child_with_core(cpu, grp)) != -1) { /* If the child only has one cpu, don't descend */ if (grp->cg_child[i].cg_count <= 1) break; grp = &grp->cg_child[i]; } /* If they don't share at least an L2 cache, use the same CPU */ if (grp->cg_level > CG_SHARE_L2 || grp->cg_level == CG_SHARE_NONE) return cpu; /* Now pick one */ CPU_COPY(&grp->cg_mask, &cs); /* Add the selected CPU offset to core offset. */ for (i = 0; (fcpu = CPU_FFS(&cs)) != 0; i++) { if (fcpu - 1 == cpu) break; CPU_CLR(fcpu - 1, &cs); } MPASS(fcpu); core_offset += i; CPU_COPY(&grp->cg_mask, &cs); for (i = core_offset % grp->cg_count; i > 0; i--) { MPASS(CPU_FFS(&cs)); CPU_CLR(CPU_FFS(&cs) - 1, &cs); } MPASS(CPU_FFS(&cs)); return CPU_FFS(&cs) - 1; } #else static int find_close_core(int cpu, int core_offset __unused) { return cpu; } #endif static int get_core_offset(if_ctx_t ctx, iflib_intr_type_t type, int qid) { switch (type) { case IFLIB_INTR_TX: /* TX queues get cores which share at least an L2 cache with the corresponding RX queue */ /* XXX handle multiple RX threads per core and more than two core per L2 group */ return qid / CPU_COUNT(&ctx->ifc_cpus) + 1; case IFLIB_INTR_RX: case IFLIB_INTR_RXTX: /* RX queues get the specified core */ return qid / CPU_COUNT(&ctx->ifc_cpus); default: return -1; } } #else #define get_core_offset(ctx, type, qid) CPU_FIRST() #define find_close_core(cpuid, tid) CPU_FIRST() #define find_nth(ctx, gid) CPU_FIRST() #endif /* Just to avoid copy/paste */ static inline int iflib_irq_set_affinity(if_ctx_t ctx, if_irq_t irq, iflib_intr_type_t type, int qid, struct grouptask *gtask, struct taskqgroup *tqg, void *uniq, const char *name) { device_t dev; - int err, cpuid, tid; + int co, cpuid, err, tid; dev = ctx->ifc_dev; - cpuid = find_nth(ctx, qid); + co = ctx->ifc_sysctl_core_offset; + if (ctx->ifc_sysctl_separate_txrx && type == IFLIB_INTR_TX) + co += ctx->ifc_softc_ctx.isc_nrxqsets; + cpuid = find_nth(ctx, qid + co); tid = get_core_offset(ctx, type, qid); MPASS(tid >= 0); cpuid = find_close_core(cpuid, tid); err = taskqgroup_attach_cpu(tqg, gtask, uniq, cpuid, dev, irq->ii_res, name); if (err) { device_printf(dev, "taskqgroup_attach_cpu failed %d\n", err); return (err); } #ifdef notyet if (cpuid > ctx->ifc_cpuid_highest) ctx->ifc_cpuid_highest = cpuid; #endif return 0; } int iflib_irq_alloc_generic(if_ctx_t ctx, if_irq_t irq, int rid, iflib_intr_type_t type, driver_filter_t *filter, void *filter_arg, int qid, const char *name) { device_t dev; struct grouptask *gtask; struct taskqgroup *tqg; iflib_filter_info_t info; gtask_fn_t *fn; int tqrid, err; driver_filter_t *intr_fast; void *q; info = &ctx->ifc_filter_info; tqrid = rid; switch (type) { /* XXX merge tx/rx for netmap? */ case IFLIB_INTR_TX: q = &ctx->ifc_txqs[qid]; info = &ctx->ifc_txqs[qid].ift_filter_info; gtask = &ctx->ifc_txqs[qid].ift_task; tqg = qgroup_if_io_tqg; fn = _task_fn_tx; intr_fast = iflib_fast_intr; GROUPTASK_INIT(gtask, 0, fn, q); ctx->ifc_flags |= IFC_NETMAP_TX_IRQ; break; case IFLIB_INTR_RX: q = &ctx->ifc_rxqs[qid]; info = &ctx->ifc_rxqs[qid].ifr_filter_info; gtask = &ctx->ifc_rxqs[qid].ifr_task; tqg = qgroup_if_io_tqg; fn = _task_fn_rx; intr_fast = iflib_fast_intr; GROUPTASK_INIT(gtask, 0, fn, q); break; case IFLIB_INTR_RXTX: q = &ctx->ifc_rxqs[qid]; info = &ctx->ifc_rxqs[qid].ifr_filter_info; gtask = &ctx->ifc_rxqs[qid].ifr_task; tqg = qgroup_if_io_tqg; fn = _task_fn_rx; intr_fast = iflib_fast_intr_rxtx; GROUPTASK_INIT(gtask, 0, fn, q); break; case IFLIB_INTR_ADMIN: q = ctx; tqrid = -1; info = &ctx->ifc_filter_info; gtask = &ctx->ifc_admin_task; tqg = qgroup_if_config_tqg; fn = _task_fn_admin; intr_fast = iflib_fast_intr_ctx; break; default: panic("unknown net intr type"); } info->ifi_filter = filter; info->ifi_filter_arg = filter_arg; info->ifi_task = gtask; info->ifi_ctx = q; dev = ctx->ifc_dev; err = _iflib_irq_alloc(ctx, irq, rid, intr_fast, NULL, info, name); if (err != 0) { device_printf(dev, "_iflib_irq_alloc failed %d\n", err); return (err); } if (type == IFLIB_INTR_ADMIN) return (0); if (tqrid != -1) { err = iflib_irq_set_affinity(ctx, irq, type, qid, gtask, tqg, q, name); if (err) return (err); } else { taskqgroup_attach(tqg, gtask, q, dev, irq->ii_res, name); } return (0); } void iflib_softirq_alloc_generic(if_ctx_t ctx, if_irq_t irq, iflib_intr_type_t type, void *arg, int qid, const char *name) { struct grouptask *gtask; struct taskqgroup *tqg; gtask_fn_t *fn; void *q; int err; switch (type) { case IFLIB_INTR_TX: q = &ctx->ifc_txqs[qid]; gtask = &ctx->ifc_txqs[qid].ift_task; tqg = qgroup_if_io_tqg; fn = _task_fn_tx; break; case IFLIB_INTR_RX: q = &ctx->ifc_rxqs[qid]; gtask = &ctx->ifc_rxqs[qid].ifr_task; tqg = qgroup_if_io_tqg; fn = _task_fn_rx; break; case IFLIB_INTR_IOV: q = ctx; gtask = &ctx->ifc_vflr_task; tqg = qgroup_if_config_tqg; fn = _task_fn_iov; break; default: panic("unknown net intr type"); } GROUPTASK_INIT(gtask, 0, fn, q); if (irq != NULL) { err = iflib_irq_set_affinity(ctx, irq, type, qid, gtask, tqg, q, name); if (err) taskqgroup_attach(tqg, gtask, q, ctx->ifc_dev, irq->ii_res, name); } else { taskqgroup_attach(tqg, gtask, q, NULL, NULL, name); } } void iflib_irq_free(if_ctx_t ctx, if_irq_t irq) { if (irq->ii_tag) bus_teardown_intr(ctx->ifc_dev, irq->ii_res, irq->ii_tag); if (irq->ii_res) bus_release_resource(ctx->ifc_dev, SYS_RES_IRQ, rman_get_rid(irq->ii_res), irq->ii_res); } static int iflib_legacy_setup(if_ctx_t ctx, driver_filter_t filter, void *filter_arg, int *rid, const char *name) { iflib_txq_t txq = ctx->ifc_txqs; iflib_rxq_t rxq = ctx->ifc_rxqs; if_irq_t irq = &ctx->ifc_legacy_irq; iflib_filter_info_t info; device_t dev; struct grouptask *gtask; struct resource *res; struct taskqgroup *tqg; gtask_fn_t *fn; int tqrid; void *q; int err; q = &ctx->ifc_rxqs[0]; info = &rxq[0].ifr_filter_info; gtask = &rxq[0].ifr_task; tqg = qgroup_if_io_tqg; tqrid = irq->ii_rid = *rid; fn = _task_fn_rx; ctx->ifc_flags |= IFC_LEGACY; info->ifi_filter = filter; info->ifi_filter_arg = filter_arg; info->ifi_task = gtask; info->ifi_ctx = ctx; dev = ctx->ifc_dev; /* We allocate a single interrupt resource */ if ((err = _iflib_irq_alloc(ctx, irq, tqrid, iflib_fast_intr_ctx, NULL, info, name)) != 0) return (err); GROUPTASK_INIT(gtask, 0, fn, q); res = irq->ii_res; taskqgroup_attach(tqg, gtask, q, dev, res, name); GROUPTASK_INIT(&txq->ift_task, 0, _task_fn_tx, txq); taskqgroup_attach(qgroup_if_io_tqg, &txq->ift_task, txq, dev, res, "tx"); return (0); } void iflib_led_create(if_ctx_t ctx) { ctx->ifc_led_dev = led_create(iflib_led_func, ctx, device_get_nameunit(ctx->ifc_dev)); } void iflib_tx_intr_deferred(if_ctx_t ctx, int txqid) { GROUPTASK_ENQUEUE(&ctx->ifc_txqs[txqid].ift_task); } void iflib_rx_intr_deferred(if_ctx_t ctx, int rxqid) { GROUPTASK_ENQUEUE(&ctx->ifc_rxqs[rxqid].ifr_task); } void iflib_admin_intr_deferred(if_ctx_t ctx) { #ifdef INVARIANTS struct grouptask *gtask; gtask = &ctx->ifc_admin_task; MPASS(gtask != NULL && gtask->gt_taskqueue != NULL); #endif GROUPTASK_ENQUEUE(&ctx->ifc_admin_task); } void iflib_iov_intr_deferred(if_ctx_t ctx) { GROUPTASK_ENQUEUE(&ctx->ifc_vflr_task); } void iflib_io_tqg_attach(struct grouptask *gt, void *uniq, int cpu, char *name) { taskqgroup_attach_cpu(qgroup_if_io_tqg, gt, uniq, cpu, NULL, NULL, name); } void iflib_config_gtask_init(void *ctx, struct grouptask *gtask, gtask_fn_t *fn, const char *name) { GROUPTASK_INIT(gtask, 0, fn, ctx); taskqgroup_attach(qgroup_if_config_tqg, gtask, gtask, NULL, NULL, name); } void iflib_config_gtask_deinit(struct grouptask *gtask) { taskqgroup_detach(qgroup_if_config_tqg, gtask); } void iflib_link_state_change(if_ctx_t ctx, int link_state, uint64_t baudrate) { if_t ifp = ctx->ifc_ifp; iflib_txq_t txq = ctx->ifc_txqs; if_setbaudrate(ifp, baudrate); if (baudrate >= IF_Gbps(10)) { STATE_LOCK(ctx); ctx->ifc_flags |= IFC_PREFETCH; STATE_UNLOCK(ctx); } /* If link down, disable watchdog */ if ((ctx->ifc_link_state == LINK_STATE_UP) && (link_state == LINK_STATE_DOWN)) { for (int i = 0; i < ctx->ifc_softc_ctx.isc_ntxqsets; i++, txq++) txq->ift_qstatus = IFLIB_QUEUE_IDLE; } ctx->ifc_link_state = link_state; if_link_state_change(ifp, link_state); } static int iflib_tx_credits_update(if_ctx_t ctx, iflib_txq_t txq) { int credits; #ifdef INVARIANTS int credits_pre = txq->ift_cidx_processed; #endif if (ctx->isc_txd_credits_update == NULL) return (0); bus_dmamap_sync(txq->ift_ifdi->idi_tag, txq->ift_ifdi->idi_map, BUS_DMASYNC_POSTREAD); if ((credits = ctx->isc_txd_credits_update(ctx->ifc_softc, txq->ift_id, true)) == 0) return (0); txq->ift_processed += credits; txq->ift_cidx_processed += credits; MPASS(credits_pre + credits == txq->ift_cidx_processed); if (txq->ift_cidx_processed >= txq->ift_size) txq->ift_cidx_processed -= txq->ift_size; return (credits); } static int iflib_rxd_avail(if_ctx_t ctx, iflib_rxq_t rxq, qidx_t cidx, qidx_t budget) { iflib_fl_t fl; u_int i; for (i = 0, fl = &rxq->ifr_fl[0]; i < rxq->ifr_nfl; i++, fl++) bus_dmamap_sync(fl->ifl_ifdi->idi_tag, fl->ifl_ifdi->idi_map, BUS_DMASYNC_POSTREAD | BUS_DMASYNC_POSTWRITE); return (ctx->isc_rxd_available(ctx->ifc_softc, rxq->ifr_id, cidx, budget)); } void iflib_add_int_delay_sysctl(if_ctx_t ctx, const char *name, const char *description, if_int_delay_info_t info, int offset, int value) { info->iidi_ctx = ctx; info->iidi_offset = offset; info->iidi_value = value; SYSCTL_ADD_PROC(device_get_sysctl_ctx(ctx->ifc_dev), SYSCTL_CHILDREN(device_get_sysctl_tree(ctx->ifc_dev)), OID_AUTO, name, CTLTYPE_INT|CTLFLAG_RW, info, 0, iflib_sysctl_int_delay, "I", description); } struct sx * iflib_ctx_lock_get(if_ctx_t ctx) { return (&ctx->ifc_ctx_sx); } static int iflib_msix_init(if_ctx_t ctx) { device_t dev = ctx->ifc_dev; if_shared_ctx_t sctx = ctx->ifc_sctx; if_softc_ctx_t scctx = &ctx->ifc_softc_ctx; int vectors, queues, rx_queues, tx_queues, queuemsgs, msgs; int iflib_num_tx_queues, iflib_num_rx_queues; int err, admincnt, bar; iflib_num_tx_queues = ctx->ifc_sysctl_ntxqs; iflib_num_rx_queues = ctx->ifc_sysctl_nrxqs; if (bootverbose) device_printf(dev, "msix_init qsets capped at %d\n", imax(scctx->isc_ntxqsets, scctx->isc_nrxqsets)); bar = ctx->ifc_softc_ctx.isc_msix_bar; admincnt = sctx->isc_admin_intrcnt; /* Override by tuneable */ if (scctx->isc_disable_msix) goto msi; /* First try MSI-X */ if ((msgs = pci_msix_count(dev)) == 0) { if (bootverbose) device_printf(dev, "MSI-X not supported or disabled\n"); goto msi; } /* * bar == -1 => "trust me I know what I'm doing" * Some drivers are for hardware that is so shoddily * documented that no one knows which bars are which * so the developer has to map all bars. This hack * allows shoddy garbage to use MSI-X in this framework. */ if (bar != -1) { ctx->ifc_msix_mem = bus_alloc_resource_any(dev, SYS_RES_MEMORY, &bar, RF_ACTIVE); if (ctx->ifc_msix_mem == NULL) { device_printf(dev, "Unable to map MSI-X table\n"); goto msi; } } #if IFLIB_DEBUG /* use only 1 qset in debug mode */ queuemsgs = min(msgs - admincnt, 1); #else queuemsgs = msgs - admincnt; #endif #ifdef RSS queues = imin(queuemsgs, rss_getnumbuckets()); #else queues = queuemsgs; #endif queues = imin(CPU_COUNT(&ctx->ifc_cpus), queues); if (bootverbose) device_printf(dev, "intr CPUs: %d queue msgs: %d admincnt: %d\n", CPU_COUNT(&ctx->ifc_cpus), queuemsgs, admincnt); #ifdef RSS /* If we're doing RSS, clamp at the number of RSS buckets */ if (queues > rss_getnumbuckets()) queues = rss_getnumbuckets(); #endif if (iflib_num_rx_queues > 0 && iflib_num_rx_queues < queuemsgs - admincnt) rx_queues = iflib_num_rx_queues; else rx_queues = queues; if (rx_queues > scctx->isc_nrxqsets) rx_queues = scctx->isc_nrxqsets; /* * We want this to be all logical CPUs by default */ if (iflib_num_tx_queues > 0 && iflib_num_tx_queues < queues) tx_queues = iflib_num_tx_queues; else tx_queues = mp_ncpus; if (tx_queues > scctx->isc_ntxqsets) tx_queues = scctx->isc_ntxqsets; if (ctx->ifc_sysctl_qs_eq_override == 0) { #ifdef INVARIANTS if (tx_queues != rx_queues) device_printf(dev, "queue equality override not set, capping rx_queues at %d and tx_queues at %d\n", min(rx_queues, tx_queues), min(rx_queues, tx_queues)); #endif tx_queues = min(rx_queues, tx_queues); rx_queues = min(rx_queues, tx_queues); } device_printf(dev, "Using %d rx queues %d tx queues\n", rx_queues, tx_queues); vectors = rx_queues + admincnt; if ((err = pci_alloc_msix(dev, &vectors)) == 0) { device_printf(dev, "Using MSI-X interrupts with %d vectors\n", vectors); scctx->isc_vectors = vectors; scctx->isc_nrxqsets = rx_queues; scctx->isc_ntxqsets = tx_queues; scctx->isc_intr = IFLIB_INTR_MSIX; return (vectors); } else { device_printf(dev, "failed to allocate %d MSI-X vectors, err: %d - using MSI\n", vectors, err); bus_release_resource(dev, SYS_RES_MEMORY, bar, ctx->ifc_msix_mem); ctx->ifc_msix_mem = NULL; } msi: vectors = pci_msi_count(dev); scctx->isc_nrxqsets = 1; scctx->isc_ntxqsets = 1; scctx->isc_vectors = vectors; if (vectors == 1 && pci_alloc_msi(dev, &vectors) == 0) { device_printf(dev,"Using an MSI interrupt\n"); scctx->isc_intr = IFLIB_INTR_MSI; } else { scctx->isc_vectors = 1; device_printf(dev,"Using a Legacy interrupt\n"); scctx->isc_intr = IFLIB_INTR_LEGACY; } return (vectors); } static const char *ring_states[] = { "IDLE", "BUSY", "STALLED", "ABDICATED" }; static int mp_ring_state_handler(SYSCTL_HANDLER_ARGS) { int rc; uint16_t *state = ((uint16_t *)oidp->oid_arg1); struct sbuf *sb; const char *ring_state = "UNKNOWN"; /* XXX needed ? */ rc = sysctl_wire_old_buffer(req, 0); MPASS(rc == 0); if (rc != 0) return (rc); sb = sbuf_new_for_sysctl(NULL, NULL, 80, req); MPASS(sb != NULL); if (sb == NULL) return (ENOMEM); if (state[3] <= 3) ring_state = ring_states[state[3]]; sbuf_printf(sb, "pidx_head: %04hd pidx_tail: %04hd cidx: %04hd state: %s", state[0], state[1], state[2], ring_state); rc = sbuf_finish(sb); sbuf_delete(sb); return(rc); } enum iflib_ndesc_handler { IFLIB_NTXD_HANDLER, IFLIB_NRXD_HANDLER, }; static int mp_ndesc_handler(SYSCTL_HANDLER_ARGS) { if_ctx_t ctx = (void *)arg1; enum iflib_ndesc_handler type = arg2; char buf[256] = {0}; qidx_t *ndesc; char *p, *next; int nqs, rc, i; MPASS(type == IFLIB_NTXD_HANDLER || type == IFLIB_NRXD_HANDLER); nqs = 8; switch(type) { case IFLIB_NTXD_HANDLER: ndesc = ctx->ifc_sysctl_ntxds; if (ctx->ifc_sctx) nqs = ctx->ifc_sctx->isc_ntxqs; break; case IFLIB_NRXD_HANDLER: ndesc = ctx->ifc_sysctl_nrxds; if (ctx->ifc_sctx) nqs = ctx->ifc_sctx->isc_nrxqs; break; default: panic("unhandled type"); } if (nqs == 0) nqs = 8; for (i=0; i<8; i++) { if (i >= nqs) break; if (i) strcat(buf, ","); sprintf(strchr(buf, 0), "%d", ndesc[i]); } rc = sysctl_handle_string(oidp, buf, sizeof(buf), req); if (rc || req->newptr == NULL) return rc; for (i = 0, next = buf, p = strsep(&next, " ,"); i < 8 && p; i++, p = strsep(&next, " ,")) { ndesc[i] = strtoul(p, NULL, 10); } return(rc); } #define NAME_BUFLEN 32 static void iflib_add_device_sysctl_pre(if_ctx_t ctx) { device_t dev = iflib_get_dev(ctx); struct sysctl_oid_list *child, *oid_list; struct sysctl_ctx_list *ctx_list; struct sysctl_oid *node; ctx_list = device_get_sysctl_ctx(dev); child = SYSCTL_CHILDREN(device_get_sysctl_tree(dev)); ctx->ifc_sysctl_node = node = SYSCTL_ADD_NODE(ctx_list, child, OID_AUTO, "iflib", CTLFLAG_RD, NULL, "IFLIB fields"); oid_list = SYSCTL_CHILDREN(node); SYSCTL_ADD_CONST_STRING(ctx_list, oid_list, OID_AUTO, "driver_version", CTLFLAG_RD, ctx->ifc_sctx->isc_driver_version, "driver version"); SYSCTL_ADD_U16(ctx_list, oid_list, OID_AUTO, "override_ntxqs", CTLFLAG_RWTUN, &ctx->ifc_sysctl_ntxqs, 0, "# of txqs to use, 0 => use default #"); SYSCTL_ADD_U16(ctx_list, oid_list, OID_AUTO, "override_nrxqs", CTLFLAG_RWTUN, &ctx->ifc_sysctl_nrxqs, 0, "# of rxqs to use, 0 => use default #"); SYSCTL_ADD_U16(ctx_list, oid_list, OID_AUTO, "override_qs_enable", CTLFLAG_RWTUN, &ctx->ifc_sysctl_qs_eq_override, 0, "permit #txq != #rxq"); SYSCTL_ADD_INT(ctx_list, oid_list, OID_AUTO, "disable_msix", CTLFLAG_RWTUN, &ctx->ifc_softc_ctx.isc_disable_msix, 0, "disable MSI-X (default 0)"); SYSCTL_ADD_U16(ctx_list, oid_list, OID_AUTO, "rx_budget", CTLFLAG_RWTUN, &ctx->ifc_sysctl_rx_budget, 0, "set the rx budget"); SYSCTL_ADD_U16(ctx_list, oid_list, OID_AUTO, "tx_abdicate", CTLFLAG_RWTUN, &ctx->ifc_sysctl_tx_abdicate, 0, "cause tx to abdicate instead of running to completion"); + ctx->ifc_sysctl_core_offset = CORE_OFFSET_UNSPECIFIED; + SYSCTL_ADD_U16(ctx_list, oid_list, OID_AUTO, "core_offset", + CTLFLAG_RDTUN, &ctx->ifc_sysctl_core_offset, 0, + "offset to start using cores at"); + SYSCTL_ADD_U8(ctx_list, oid_list, OID_AUTO, "separate_txrx", + CTLFLAG_RDTUN, &ctx->ifc_sysctl_separate_txrx, 0, + "use separate cores for TX and RX"); /* XXX change for per-queue sizes */ SYSCTL_ADD_PROC(ctx_list, oid_list, OID_AUTO, "override_ntxds", CTLTYPE_STRING|CTLFLAG_RWTUN, ctx, IFLIB_NTXD_HANDLER, mp_ndesc_handler, "A", "list of # of tx descriptors to use, 0 = use default #"); SYSCTL_ADD_PROC(ctx_list, oid_list, OID_AUTO, "override_nrxds", CTLTYPE_STRING|CTLFLAG_RWTUN, ctx, IFLIB_NRXD_HANDLER, mp_ndesc_handler, "A", "list of # of rx descriptors to use, 0 = use default #"); } static void iflib_add_device_sysctl_post(if_ctx_t ctx) { if_shared_ctx_t sctx = ctx->ifc_sctx; if_softc_ctx_t scctx = &ctx->ifc_softc_ctx; device_t dev = iflib_get_dev(ctx); struct sysctl_oid_list *child; struct sysctl_ctx_list *ctx_list; iflib_fl_t fl; iflib_txq_t txq; iflib_rxq_t rxq; int i, j; char namebuf[NAME_BUFLEN]; char *qfmt; struct sysctl_oid *queue_node, *fl_node, *node; struct sysctl_oid_list *queue_list, *fl_list; ctx_list = device_get_sysctl_ctx(dev); node = ctx->ifc_sysctl_node; child = SYSCTL_CHILDREN(node); if (scctx->isc_ntxqsets > 100) qfmt = "txq%03d"; else if (scctx->isc_ntxqsets > 10) qfmt = "txq%02d"; else qfmt = "txq%d"; for (i = 0, txq = ctx->ifc_txqs; i < scctx->isc_ntxqsets; i++, txq++) { snprintf(namebuf, NAME_BUFLEN, qfmt, i); queue_node = SYSCTL_ADD_NODE(ctx_list, child, OID_AUTO, namebuf, CTLFLAG_RD, NULL, "Queue Name"); queue_list = SYSCTL_CHILDREN(queue_node); #if MEMORY_LOGGING SYSCTL_ADD_QUAD(ctx_list, queue_list, OID_AUTO, "txq_dequeued", CTLFLAG_RD, &txq->ift_dequeued, "total mbufs freed"); SYSCTL_ADD_QUAD(ctx_list, queue_list, OID_AUTO, "txq_enqueued", CTLFLAG_RD, &txq->ift_enqueued, "total mbufs enqueued"); #endif SYSCTL_ADD_QUAD(ctx_list, queue_list, OID_AUTO, "mbuf_defrag", CTLFLAG_RD, &txq->ift_mbuf_defrag, "# of times m_defrag was called"); SYSCTL_ADD_QUAD(ctx_list, queue_list, OID_AUTO, "m_pullups", CTLFLAG_RD, &txq->ift_pullups, "# of times m_pullup was called"); SYSCTL_ADD_QUAD(ctx_list, queue_list, OID_AUTO, "mbuf_defrag_failed", CTLFLAG_RD, &txq->ift_mbuf_defrag_failed, "# of times m_defrag failed"); SYSCTL_ADD_QUAD(ctx_list, queue_list, OID_AUTO, "no_desc_avail", CTLFLAG_RD, &txq->ift_no_desc_avail, "# of times no descriptors were available"); SYSCTL_ADD_QUAD(ctx_list, queue_list, OID_AUTO, "tx_map_failed", CTLFLAG_RD, &txq->ift_map_failed, "# of times dma map failed"); SYSCTL_ADD_QUAD(ctx_list, queue_list, OID_AUTO, "txd_encap_efbig", CTLFLAG_RD, &txq->ift_txd_encap_efbig, "# of times txd_encap returned EFBIG"); SYSCTL_ADD_QUAD(ctx_list, queue_list, OID_AUTO, "no_tx_dma_setup", CTLFLAG_RD, &txq->ift_no_tx_dma_setup, "# of times map failed for other than EFBIG"); SYSCTL_ADD_U16(ctx_list, queue_list, OID_AUTO, "txq_pidx", CTLFLAG_RD, &txq->ift_pidx, 1, "Producer Index"); SYSCTL_ADD_U16(ctx_list, queue_list, OID_AUTO, "txq_cidx", CTLFLAG_RD, &txq->ift_cidx, 1, "Consumer Index"); SYSCTL_ADD_U16(ctx_list, queue_list, OID_AUTO, "txq_cidx_processed", CTLFLAG_RD, &txq->ift_cidx_processed, 1, "Consumer Index seen by credit update"); SYSCTL_ADD_U16(ctx_list, queue_list, OID_AUTO, "txq_in_use", CTLFLAG_RD, &txq->ift_in_use, 1, "descriptors in use"); SYSCTL_ADD_QUAD(ctx_list, queue_list, OID_AUTO, "txq_processed", CTLFLAG_RD, &txq->ift_processed, "descriptors procesed for clean"); SYSCTL_ADD_QUAD(ctx_list, queue_list, OID_AUTO, "txq_cleaned", CTLFLAG_RD, &txq->ift_cleaned, "total cleaned"); SYSCTL_ADD_PROC(ctx_list, queue_list, OID_AUTO, "ring_state", CTLTYPE_STRING | CTLFLAG_RD, __DEVOLATILE(uint64_t *, &txq->ift_br->state), 0, mp_ring_state_handler, "A", "soft ring state"); SYSCTL_ADD_COUNTER_U64(ctx_list, queue_list, OID_AUTO, "r_enqueues", CTLFLAG_RD, &txq->ift_br->enqueues, "# of enqueues to the mp_ring for this queue"); SYSCTL_ADD_COUNTER_U64(ctx_list, queue_list, OID_AUTO, "r_drops", CTLFLAG_RD, &txq->ift_br->drops, "# of drops in the mp_ring for this queue"); SYSCTL_ADD_COUNTER_U64(ctx_list, queue_list, OID_AUTO, "r_starts", CTLFLAG_RD, &txq->ift_br->starts, "# of normal consumer starts in the mp_ring for this queue"); SYSCTL_ADD_COUNTER_U64(ctx_list, queue_list, OID_AUTO, "r_stalls", CTLFLAG_RD, &txq->ift_br->stalls, "# of consumer stalls in the mp_ring for this queue"); SYSCTL_ADD_COUNTER_U64(ctx_list, queue_list, OID_AUTO, "r_restarts", CTLFLAG_RD, &txq->ift_br->restarts, "# of consumer restarts in the mp_ring for this queue"); SYSCTL_ADD_COUNTER_U64(ctx_list, queue_list, OID_AUTO, "r_abdications", CTLFLAG_RD, &txq->ift_br->abdications, "# of consumer abdications in the mp_ring for this queue"); } if (scctx->isc_nrxqsets > 100) qfmt = "rxq%03d"; else if (scctx->isc_nrxqsets > 10) qfmt = "rxq%02d"; else qfmt = "rxq%d"; for (i = 0, rxq = ctx->ifc_rxqs; i < scctx->isc_nrxqsets; i++, rxq++) { snprintf(namebuf, NAME_BUFLEN, qfmt, i); queue_node = SYSCTL_ADD_NODE(ctx_list, child, OID_AUTO, namebuf, CTLFLAG_RD, NULL, "Queue Name"); queue_list = SYSCTL_CHILDREN(queue_node); if (sctx->isc_flags & IFLIB_HAS_RXCQ) { SYSCTL_ADD_U16(ctx_list, queue_list, OID_AUTO, "rxq_cq_pidx", CTLFLAG_RD, &rxq->ifr_cq_pidx, 1, "Producer Index"); SYSCTL_ADD_U16(ctx_list, queue_list, OID_AUTO, "rxq_cq_cidx", CTLFLAG_RD, &rxq->ifr_cq_cidx, 1, "Consumer Index"); } for (j = 0, fl = rxq->ifr_fl; j < rxq->ifr_nfl; j++, fl++) { snprintf(namebuf, NAME_BUFLEN, "rxq_fl%d", j); fl_node = SYSCTL_ADD_NODE(ctx_list, queue_list, OID_AUTO, namebuf, CTLFLAG_RD, NULL, "freelist Name"); fl_list = SYSCTL_CHILDREN(fl_node); SYSCTL_ADD_U16(ctx_list, fl_list, OID_AUTO, "pidx", CTLFLAG_RD, &fl->ifl_pidx, 1, "Producer Index"); SYSCTL_ADD_U16(ctx_list, fl_list, OID_AUTO, "cidx", CTLFLAG_RD, &fl->ifl_cidx, 1, "Consumer Index"); SYSCTL_ADD_U16(ctx_list, fl_list, OID_AUTO, "credits", CTLFLAG_RD, &fl->ifl_credits, 1, "credits available"); #if MEMORY_LOGGING SYSCTL_ADD_QUAD(ctx_list, fl_list, OID_AUTO, "fl_m_enqueued", CTLFLAG_RD, &fl->ifl_m_enqueued, "mbufs allocated"); SYSCTL_ADD_QUAD(ctx_list, fl_list, OID_AUTO, "fl_m_dequeued", CTLFLAG_RD, &fl->ifl_m_dequeued, "mbufs freed"); SYSCTL_ADD_QUAD(ctx_list, fl_list, OID_AUTO, "fl_cl_enqueued", CTLFLAG_RD, &fl->ifl_cl_enqueued, "clusters allocated"); SYSCTL_ADD_QUAD(ctx_list, fl_list, OID_AUTO, "fl_cl_dequeued", CTLFLAG_RD, &fl->ifl_cl_dequeued, "clusters freed"); #endif } } } void iflib_request_reset(if_ctx_t ctx) { STATE_LOCK(ctx); ctx->ifc_flags |= IFC_DO_RESET; STATE_UNLOCK(ctx); } #ifndef __NO_STRICT_ALIGNMENT static struct mbuf * iflib_fixup_rx(struct mbuf *m) { struct mbuf *n; if (m->m_len <= (MCLBYTES - ETHER_HDR_LEN)) { bcopy(m->m_data, m->m_data + ETHER_HDR_LEN, m->m_len); m->m_data += ETHER_HDR_LEN; n = m; } else { MGETHDR(n, M_NOWAIT, MT_DATA); if (n == NULL) { m_freem(m); return (NULL); } bcopy(m->m_data, n->m_data, ETHER_HDR_LEN); m->m_data += ETHER_HDR_LEN; m->m_len -= ETHER_HDR_LEN; n->m_len = ETHER_HDR_LEN; M_MOVE_PKTHDR(n, m); n->m_next = m; } return (n); } #endif #ifdef NETDUMP static void iflib_netdump_init(struct ifnet *ifp, int *nrxr, int *ncl, int *clsize) { if_ctx_t ctx; ctx = if_getsoftc(ifp); CTX_LOCK(ctx); *nrxr = NRXQSETS(ctx); *ncl = ctx->ifc_rxqs[0].ifr_fl->ifl_size; *clsize = ctx->ifc_rxqs[0].ifr_fl->ifl_buf_size; CTX_UNLOCK(ctx); } static void iflib_netdump_event(struct ifnet *ifp, enum netdump_ev event) { if_ctx_t ctx; if_softc_ctx_t scctx; iflib_fl_t fl; iflib_rxq_t rxq; int i, j; ctx = if_getsoftc(ifp); scctx = &ctx->ifc_softc_ctx; switch (event) { case NETDUMP_START: for (i = 0; i < scctx->isc_nrxqsets; i++) { rxq = &ctx->ifc_rxqs[i]; for (j = 0; j < rxq->ifr_nfl; j++) { fl = rxq->ifr_fl; fl->ifl_zone = m_getzone(fl->ifl_buf_size); } } iflib_no_tx_batch = 1; break; default: break; } } static int iflib_netdump_transmit(struct ifnet *ifp, struct mbuf *m) { if_ctx_t ctx; iflib_txq_t txq; int error; ctx = if_getsoftc(ifp); if ((if_getdrvflags(ifp) & (IFF_DRV_RUNNING | IFF_DRV_OACTIVE)) != IFF_DRV_RUNNING) return (EBUSY); txq = &ctx->ifc_txqs[0]; error = iflib_encap(txq, &m); if (error == 0) (void)iflib_txd_db_check(ctx, txq, true, txq->ift_in_use); return (error); } static int iflib_netdump_poll(struct ifnet *ifp, int count) { if_ctx_t ctx; if_softc_ctx_t scctx; iflib_txq_t txq; int i; ctx = if_getsoftc(ifp); scctx = &ctx->ifc_softc_ctx; if ((if_getdrvflags(ifp) & (IFF_DRV_RUNNING | IFF_DRV_OACTIVE)) != IFF_DRV_RUNNING) return (EBUSY); txq = &ctx->ifc_txqs[0]; (void)iflib_completed_tx_reclaim(txq, RECLAIM_THRESH(ctx)); for (i = 0; i < scctx->isc_nrxqsets; i++) (void)iflib_rxeof(&ctx->ifc_rxqs[i], 16 /* XXX */); return (0); } #endif /* NETDUMP */ Index: user/ngie/bug-237403/sys/netinet/in_mcast.c =================================================================== --- user/ngie/bug-237403/sys/netinet/in_mcast.c (revision 346925) +++ user/ngie/bug-237403/sys/netinet/in_mcast.c (revision 346926) @@ -1,3151 +1,3162 @@ /*- * SPDX-License-Identifier: BSD-3-Clause * * Copyright (c) 2007-2009 Bruce Simpson. * Copyright (c) 2005 Robert N. M. Watson. * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * 3. The name of the author may not be used to endorse or promote * products derived from this software without specific prior written * permission. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. */ /* * IPv4 multicast socket, group, and socket option processing module. */ #include __FBSDID("$FreeBSD$"); #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #ifndef KTR_IGMPV3 #define KTR_IGMPV3 KTR_INET #endif #ifndef __SOCKUNION_DECLARED union sockunion { struct sockaddr_storage ss; struct sockaddr sa; struct sockaddr_dl sdl; struct sockaddr_in sin; }; typedef union sockunion sockunion_t; #define __SOCKUNION_DECLARED #endif /* __SOCKUNION_DECLARED */ static MALLOC_DEFINE(M_INMFILTER, "in_mfilter", "IPv4 multicast PCB-layer source filter"); static MALLOC_DEFINE(M_IPMADDR, "in_multi", "IPv4 multicast group"); static MALLOC_DEFINE(M_IPMOPTS, "ip_moptions", "IPv4 multicast options"); static MALLOC_DEFINE(M_IPMSOURCE, "ip_msource", "IPv4 multicast IGMP-layer source filter"); /* * Locking: * - Lock order is: Giant, INP_WLOCK, IN_MULTI_LIST_LOCK, IGMP_LOCK, IF_ADDR_LOCK. * - The IF_ADDR_LOCK is implicitly taken by inm_lookup() earlier, however * it can be taken by code in net/if.c also. * - ip_moptions and in_mfilter are covered by the INP_WLOCK. * * struct in_multi is covered by IN_MULTI_LIST_LOCK. There isn't strictly * any need for in_multi itself to be virtualized -- it is bound to an ifp * anyway no matter what happens. */ struct mtx in_multi_list_mtx; MTX_SYSINIT(in_multi_mtx, &in_multi_list_mtx, "in_multi_list_mtx", MTX_DEF); struct mtx in_multi_free_mtx; MTX_SYSINIT(in_multi_free_mtx, &in_multi_free_mtx, "in_multi_free_mtx", MTX_DEF); struct sx in_multi_sx; SX_SYSINIT(in_multi_sx, &in_multi_sx, "in_multi_sx"); int ifma_restart; /* * Functions with non-static linkage defined in this file should be * declared in in_var.h: * imo_multi_filter() * in_addmulti() * in_delmulti() * in_joingroup() * in_joingroup_locked() * in_leavegroup() * in_leavegroup_locked() * and ip_var.h: * inp_freemoptions() * inp_getmoptions() * inp_setmoptions() * * XXX: Both carp and pf need to use the legacy (*,G) KPIs in_addmulti() * and in_delmulti(). */ static void imf_commit(struct in_mfilter *); static int imf_get_source(struct in_mfilter *imf, const struct sockaddr_in *psin, struct in_msource **); static struct in_msource * imf_graft(struct in_mfilter *, const uint8_t, const struct sockaddr_in *); static void imf_leave(struct in_mfilter *); static int imf_prune(struct in_mfilter *, const struct sockaddr_in *); static void imf_purge(struct in_mfilter *); static void imf_rollback(struct in_mfilter *); static void imf_reap(struct in_mfilter *); static int imo_grow(struct ip_moptions *); static size_t imo_match_group(const struct ip_moptions *, const struct ifnet *, const struct sockaddr *); static struct in_msource * imo_match_source(const struct ip_moptions *, const size_t, const struct sockaddr *); static void ims_merge(struct ip_msource *ims, const struct in_msource *lims, const int rollback); static int in_getmulti(struct ifnet *, const struct in_addr *, struct in_multi **); static int inm_get_source(struct in_multi *inm, const in_addr_t haddr, const int noalloc, struct ip_msource **pims); #ifdef KTR static int inm_is_ifp_detached(const struct in_multi *); #endif static int inm_merge(struct in_multi *, /*const*/ struct in_mfilter *); static void inm_purge(struct in_multi *); static void inm_reap(struct in_multi *); static void inm_release(struct in_multi *); static struct ip_moptions * inp_findmoptions(struct inpcb *); static int inp_get_source_filters(struct inpcb *, struct sockopt *); static int inp_join_group(struct inpcb *, struct sockopt *); static int inp_leave_group(struct inpcb *, struct sockopt *); static struct ifnet * inp_lookup_mcast_ifp(const struct inpcb *, const struct sockaddr_in *, const struct in_addr); static int inp_block_unblock_source(struct inpcb *, struct sockopt *); static int inp_set_multicast_if(struct inpcb *, struct sockopt *); static int inp_set_source_filters(struct inpcb *, struct sockopt *); static int sysctl_ip_mcast_filters(SYSCTL_HANDLER_ARGS); static SYSCTL_NODE(_net_inet_ip, OID_AUTO, mcast, CTLFLAG_RW, 0, "IPv4 multicast"); static u_long in_mcast_maxgrpsrc = IP_MAX_GROUP_SRC_FILTER; SYSCTL_ULONG(_net_inet_ip_mcast, OID_AUTO, maxgrpsrc, CTLFLAG_RWTUN, &in_mcast_maxgrpsrc, 0, "Max source filters per group"); static u_long in_mcast_maxsocksrc = IP_MAX_SOCK_SRC_FILTER; SYSCTL_ULONG(_net_inet_ip_mcast, OID_AUTO, maxsocksrc, CTLFLAG_RWTUN, &in_mcast_maxsocksrc, 0, "Max source filters per socket"); int in_mcast_loop = IP_DEFAULT_MULTICAST_LOOP; SYSCTL_INT(_net_inet_ip_mcast, OID_AUTO, loop, CTLFLAG_RWTUN, &in_mcast_loop, 0, "Loopback multicast datagrams by default"); static SYSCTL_NODE(_net_inet_ip_mcast, OID_AUTO, filters, CTLFLAG_RD | CTLFLAG_MPSAFE, sysctl_ip_mcast_filters, "Per-interface stack-wide source filters"); #ifdef KTR /* * Inline function which wraps assertions for a valid ifp. * The ifnet layer will set the ifma's ifp pointer to NULL if the ifp * is detached. */ static int __inline inm_is_ifp_detached(const struct in_multi *inm) { struct ifnet *ifp; KASSERT(inm->inm_ifma != NULL, ("%s: no ifma", __func__)); ifp = inm->inm_ifma->ifma_ifp; if (ifp != NULL) { /* * Sanity check that netinet's notion of ifp is the * same as net's. */ KASSERT(inm->inm_ifp == ifp, ("%s: bad ifp", __func__)); } return (ifp == NULL); } #endif static struct grouptask free_gtask; static struct in_multi_head inm_free_list; static void inm_release_task(void *arg __unused); static void inm_init(void) { SLIST_INIT(&inm_free_list); taskqgroup_config_gtask_init(NULL, &free_gtask, inm_release_task, "inm release task"); } #ifdef EARLY_AP_STARTUP SYSINIT(inm_init, SI_SUB_SMP + 1, SI_ORDER_FIRST, inm_init, NULL); #else SYSINIT(inm_init, SI_SUB_ROOT_CONF - 1, SI_ORDER_FIRST, inm_init, NULL); #endif void inm_release_list_deferred(struct in_multi_head *inmh) { if (SLIST_EMPTY(inmh)) return; mtx_lock(&in_multi_free_mtx); SLIST_CONCAT(&inm_free_list, inmh, in_multi, inm_nrele); mtx_unlock(&in_multi_free_mtx); GROUPTASK_ENQUEUE(&free_gtask); } void inm_disconnect(struct in_multi *inm) { struct ifnet *ifp; struct ifmultiaddr *ifma, *ll_ifma; ifp = inm->inm_ifp; IF_ADDR_WLOCK_ASSERT(ifp); ifma = inm->inm_ifma; if_ref(ifp); if (ifma->ifma_flags & IFMA_F_ENQUEUED) { CK_STAILQ_REMOVE(&ifp->if_multiaddrs, ifma, ifmultiaddr, ifma_link); ifma->ifma_flags &= ~IFMA_F_ENQUEUED; } MCDPRINTF("removed ifma: %p from %s\n", ifma, ifp->if_xname); if ((ll_ifma = ifma->ifma_llifma) != NULL) { MPASS(ifma != ll_ifma); ifma->ifma_llifma = NULL; MPASS(ll_ifma->ifma_llifma == NULL); MPASS(ll_ifma->ifma_ifp == ifp); if (--ll_ifma->ifma_refcount == 0) { if (ll_ifma->ifma_flags & IFMA_F_ENQUEUED) { CK_STAILQ_REMOVE(&ifp->if_multiaddrs, ll_ifma, ifmultiaddr, ifma_link); ll_ifma->ifma_flags &= ~IFMA_F_ENQUEUED; } MCDPRINTF("removed ll_ifma: %p from %s\n", ll_ifma, ifp->if_xname); if_freemulti(ll_ifma); ifma_restart = true; } } } void inm_release_deferred(struct in_multi *inm) { struct in_multi_head tmp; IN_MULTI_LIST_LOCK_ASSERT(); MPASS(inm->inm_refcount > 0); if (--inm->inm_refcount == 0) { SLIST_INIT(&tmp); inm_disconnect(inm); inm->inm_ifma->ifma_protospec = NULL; SLIST_INSERT_HEAD(&tmp, inm, inm_nrele); inm_release_list_deferred(&tmp); } } static void inm_release_task(void *arg __unused) { struct in_multi_head inm_free_tmp; struct in_multi *inm, *tinm; SLIST_INIT(&inm_free_tmp); mtx_lock(&in_multi_free_mtx); SLIST_CONCAT(&inm_free_tmp, &inm_free_list, in_multi, inm_nrele); mtx_unlock(&in_multi_free_mtx); IN_MULTI_LOCK(); SLIST_FOREACH_SAFE(inm, &inm_free_tmp, inm_nrele, tinm) { SLIST_REMOVE_HEAD(&inm_free_tmp, inm_nrele); MPASS(inm); inm_release(inm); } IN_MULTI_UNLOCK(); } /* * Initialize an in_mfilter structure to a known state at t0, t1 * with an empty source filter list. */ static __inline void imf_init(struct in_mfilter *imf, const int st0, const int st1) { memset(imf, 0, sizeof(struct in_mfilter)); RB_INIT(&imf->imf_sources); imf->imf_st[0] = st0; imf->imf_st[1] = st1; } /* * Function for looking up an in_multi record for an IPv4 multicast address * on a given interface. ifp must be valid. If no record found, return NULL. * The IN_MULTI_LIST_LOCK and IF_ADDR_LOCK on ifp must be held. */ struct in_multi * inm_lookup_locked(struct ifnet *ifp, const struct in_addr ina) { struct ifmultiaddr *ifma; struct in_multi *inm; IN_MULTI_LIST_LOCK_ASSERT(); IF_ADDR_LOCK_ASSERT(ifp); inm = NULL; CK_STAILQ_FOREACH(ifma, &((ifp)->if_multiaddrs), ifma_link) { if (ifma->ifma_addr->sa_family != AF_INET || ifma->ifma_protospec == NULL) continue; inm = (struct in_multi *)ifma->ifma_protospec; if (inm->inm_addr.s_addr == ina.s_addr) break; inm = NULL; } return (inm); } /* * Wrapper for inm_lookup_locked(). * The IF_ADDR_LOCK will be taken on ifp and released on return. */ struct in_multi * inm_lookup(struct ifnet *ifp, const struct in_addr ina) { struct epoch_tracker et; struct in_multi *inm; IN_MULTI_LIST_LOCK_ASSERT(); NET_EPOCH_ENTER(et); inm = inm_lookup_locked(ifp, ina); NET_EPOCH_EXIT(et); return (inm); } /* * Resize the ip_moptions vector to the next power-of-two minus 1. * May be called with locks held; do not sleep. */ static int imo_grow(struct ip_moptions *imo) { struct in_multi **nmships; struct in_multi **omships; struct in_mfilter *nmfilters; struct in_mfilter *omfilters; size_t idx; size_t newmax; size_t oldmax; nmships = NULL; nmfilters = NULL; omships = imo->imo_membership; omfilters = imo->imo_mfilters; oldmax = imo->imo_max_memberships; newmax = ((oldmax + 1) * 2) - 1; if (newmax <= IP_MAX_MEMBERSHIPS) { nmships = (struct in_multi **)realloc(omships, sizeof(struct in_multi *) * newmax, M_IPMOPTS, M_NOWAIT); nmfilters = (struct in_mfilter *)realloc(omfilters, sizeof(struct in_mfilter) * newmax, M_INMFILTER, M_NOWAIT); if (nmships != NULL && nmfilters != NULL) { /* Initialize newly allocated source filter heads. */ for (idx = oldmax; idx < newmax; idx++) { imf_init(&nmfilters[idx], MCAST_UNDEFINED, MCAST_EXCLUDE); } imo->imo_max_memberships = newmax; imo->imo_membership = nmships; imo->imo_mfilters = nmfilters; } } if (nmships == NULL || nmfilters == NULL) { if (nmships != NULL) free(nmships, M_IPMOPTS); if (nmfilters != NULL) free(nmfilters, M_INMFILTER); return (ETOOMANYREFS); } return (0); } /* * Find an IPv4 multicast group entry for this ip_moptions instance * which matches the specified group, and optionally an interface. * Return its index into the array, or -1 if not found. */ static size_t imo_match_group(const struct ip_moptions *imo, const struct ifnet *ifp, const struct sockaddr *group) { const struct sockaddr_in *gsin; struct in_multi **pinm; int idx; int nmships; gsin = (const struct sockaddr_in *)group; /* The imo_membership array may be lazy allocated. */ if (imo->imo_membership == NULL || imo->imo_num_memberships == 0) return (-1); nmships = imo->imo_num_memberships; pinm = &imo->imo_membership[0]; for (idx = 0; idx < nmships; idx++, pinm++) { if (*pinm == NULL) continue; if ((ifp == NULL || ((*pinm)->inm_ifp == ifp)) && in_hosteq((*pinm)->inm_addr, gsin->sin_addr)) { break; } } if (idx >= nmships) idx = -1; return (idx); } /* * Find an IPv4 multicast source entry for this imo which matches * the given group index for this socket, and source address. * * NOTE: This does not check if the entry is in-mode, merely if * it exists, which may not be the desired behaviour. */ static struct in_msource * imo_match_source(const struct ip_moptions *imo, const size_t gidx, const struct sockaddr *src) { struct ip_msource find; struct in_mfilter *imf; struct ip_msource *ims; const sockunion_t *psa; KASSERT(src->sa_family == AF_INET, ("%s: !AF_INET", __func__)); KASSERT(gidx != -1 && gidx < imo->imo_num_memberships, ("%s: invalid index %d\n", __func__, (int)gidx)); /* The imo_mfilters array may be lazy allocated. */ if (imo->imo_mfilters == NULL) return (NULL); imf = &imo->imo_mfilters[gidx]; /* Source trees are keyed in host byte order. */ psa = (const sockunion_t *)src; find.ims_haddr = ntohl(psa->sin.sin_addr.s_addr); ims = RB_FIND(ip_msource_tree, &imf->imf_sources, &find); return ((struct in_msource *)ims); } /* * Perform filtering for multicast datagrams on a socket by group and source. * * Returns 0 if a datagram should be allowed through, or various error codes * if the socket was not a member of the group, or the source was muted, etc. */ int imo_multi_filter(const struct ip_moptions *imo, const struct ifnet *ifp, const struct sockaddr *group, const struct sockaddr *src) { size_t gidx; struct in_msource *ims; int mode; KASSERT(ifp != NULL, ("%s: null ifp", __func__)); gidx = imo_match_group(imo, ifp, group); if (gidx == -1) return (MCAST_NOTGMEMBER); /* * Check if the source was included in an (S,G) join. * Allow reception on exclusive memberships by default, * reject reception on inclusive memberships by default. * Exclude source only if an in-mode exclude filter exists. * Include source only if an in-mode include filter exists. * NOTE: We are comparing group state here at IGMP t1 (now) * with socket-layer t0 (since last downcall). */ mode = imo->imo_mfilters[gidx].imf_st[1]; ims = imo_match_source(imo, gidx, src); if ((ims == NULL && mode == MCAST_INCLUDE) || (ims != NULL && ims->imsl_st[0] != mode)) return (MCAST_NOTSMEMBER); return (MCAST_PASS); } /* * Find and return a reference to an in_multi record for (ifp, group), * and bump its reference count. * If one does not exist, try to allocate it, and update link-layer multicast * filters on ifp to listen for group. * Assumes the IN_MULTI lock is held across the call. * Return 0 if successful, otherwise return an appropriate error code. */ static int in_getmulti(struct ifnet *ifp, const struct in_addr *group, struct in_multi **pinm) { struct sockaddr_in gsin; struct ifmultiaddr *ifma; struct in_ifinfo *ii; struct in_multi *inm; int error; IN_MULTI_LOCK_ASSERT(); ii = (struct in_ifinfo *)ifp->if_afdata[AF_INET]; IN_MULTI_LIST_LOCK(); inm = inm_lookup(ifp, *group); if (inm != NULL) { /* * If we already joined this group, just bump the * refcount and return it. */ KASSERT(inm->inm_refcount >= 1, ("%s: bad refcount %d", __func__, inm->inm_refcount)); inm_acquire_locked(inm); *pinm = inm; } IN_MULTI_LIST_UNLOCK(); if (inm != NULL) return (0); memset(&gsin, 0, sizeof(gsin)); gsin.sin_family = AF_INET; gsin.sin_len = sizeof(struct sockaddr_in); gsin.sin_addr = *group; /* * Check if a link-layer group is already associated * with this network-layer group on the given ifnet. */ error = if_addmulti(ifp, (struct sockaddr *)&gsin, &ifma); if (error != 0) return (error); /* XXX ifma_protospec must be covered by IF_ADDR_LOCK */ IN_MULTI_LIST_LOCK(); IF_ADDR_WLOCK(ifp); /* * If something other than netinet is occupying the link-layer * group, print a meaningful error message and back out of * the allocation. * Otherwise, bump the refcount on the existing network-layer * group association and return it. */ if (ifma->ifma_protospec != NULL) { inm = (struct in_multi *)ifma->ifma_protospec; #ifdef INVARIANTS KASSERT(ifma->ifma_addr != NULL, ("%s: no ifma_addr", __func__)); KASSERT(ifma->ifma_addr->sa_family == AF_INET, ("%s: ifma not AF_INET", __func__)); KASSERT(inm != NULL, ("%s: no ifma_protospec", __func__)); if (inm->inm_ifma != ifma || inm->inm_ifp != ifp || !in_hosteq(inm->inm_addr, *group)) { char addrbuf[INET_ADDRSTRLEN]; panic("%s: ifma %p is inconsistent with %p (%s)", __func__, ifma, inm, inet_ntoa_r(*group, addrbuf)); } #endif inm_acquire_locked(inm); *pinm = inm; goto out_locked; } IF_ADDR_WLOCK_ASSERT(ifp); /* * A new in_multi record is needed; allocate and initialize it. * We DO NOT perform an IGMP join as the in_ layer may need to * push an initial source list down to IGMP to support SSM. * * The initial source filter state is INCLUDE, {} as per the RFC. */ inm = malloc(sizeof(*inm), M_IPMADDR, M_NOWAIT | M_ZERO); if (inm == NULL) { IF_ADDR_WUNLOCK(ifp); IN_MULTI_LIST_UNLOCK(); if_delmulti_ifma(ifma); return (ENOMEM); } inm->inm_addr = *group; inm->inm_ifp = ifp; inm->inm_igi = ii->ii_igmp; inm->inm_ifma = ifma; inm->inm_refcount = 1; inm->inm_state = IGMP_NOT_MEMBER; mbufq_init(&inm->inm_scq, IGMP_MAX_STATE_CHANGES); inm->inm_st[0].iss_fmode = MCAST_UNDEFINED; inm->inm_st[1].iss_fmode = MCAST_UNDEFINED; RB_INIT(&inm->inm_srcs); ifma->ifma_protospec = inm; *pinm = inm; out_locked: IF_ADDR_WUNLOCK(ifp); IN_MULTI_LIST_UNLOCK(); return (0); } /* * Drop a reference to an in_multi record. * * If the refcount drops to 0, free the in_multi record and * delete the underlying link-layer membership. */ static void inm_release(struct in_multi *inm) { struct ifmultiaddr *ifma; struct ifnet *ifp; CTR2(KTR_IGMPV3, "%s: refcount is %d", __func__, inm->inm_refcount); MPASS(inm->inm_refcount == 0); CTR2(KTR_IGMPV3, "%s: freeing inm %p", __func__, inm); ifma = inm->inm_ifma; ifp = inm->inm_ifp; /* XXX this access is not covered by IF_ADDR_LOCK */ CTR2(KTR_IGMPV3, "%s: purging ifma %p", __func__, ifma); if (ifp != NULL) { CURVNET_SET(ifp->if_vnet); inm_purge(inm); free(inm, M_IPMADDR); if_delmulti_ifma_flags(ifma, 1); CURVNET_RESTORE(); if_rele(ifp); } else { inm_purge(inm); free(inm, M_IPMADDR); if_delmulti_ifma_flags(ifma, 1); } } /* * Clear recorded source entries for a group. * Used by the IGMP code. Caller must hold the IN_MULTI lock. * FIXME: Should reap. */ void inm_clear_recorded(struct in_multi *inm) { struct ip_msource *ims; IN_MULTI_LIST_LOCK_ASSERT(); RB_FOREACH(ims, ip_msource_tree, &inm->inm_srcs) { if (ims->ims_stp) { ims->ims_stp = 0; --inm->inm_st[1].iss_rec; } } KASSERT(inm->inm_st[1].iss_rec == 0, ("%s: iss_rec %d not 0", __func__, inm->inm_st[1].iss_rec)); } /* * Record a source as pending for a Source-Group IGMPv3 query. * This lives here as it modifies the shared tree. * * inm is the group descriptor. * naddr is the address of the source to record in network-byte order. * * If the net.inet.igmp.sgalloc sysctl is non-zero, we will * lazy-allocate a source node in response to an SG query. * Otherwise, no allocation is performed. This saves some memory * with the trade-off that the source will not be reported to the * router if joined in the window between the query response and * the group actually being joined on the local host. * * VIMAGE: XXX: Currently the igmp_sgalloc feature has been removed. * This turns off the allocation of a recorded source entry if * the group has not been joined. * * Return 0 if the source didn't exist or was already marked as recorded. * Return 1 if the source was marked as recorded by this function. * Return <0 if any error occurred (negated errno code). */ int inm_record_source(struct in_multi *inm, const in_addr_t naddr) { struct ip_msource find; struct ip_msource *ims, *nims; IN_MULTI_LIST_LOCK_ASSERT(); find.ims_haddr = ntohl(naddr); ims = RB_FIND(ip_msource_tree, &inm->inm_srcs, &find); if (ims && ims->ims_stp) return (0); if (ims == NULL) { if (inm->inm_nsrc == in_mcast_maxgrpsrc) return (-ENOSPC); nims = malloc(sizeof(struct ip_msource), M_IPMSOURCE, M_NOWAIT | M_ZERO); if (nims == NULL) return (-ENOMEM); nims->ims_haddr = find.ims_haddr; RB_INSERT(ip_msource_tree, &inm->inm_srcs, nims); ++inm->inm_nsrc; ims = nims; } /* * Mark the source as recorded and update the recorded * source count. */ ++ims->ims_stp; ++inm->inm_st[1].iss_rec; return (1); } /* * Return a pointer to an in_msource owned by an in_mfilter, * given its source address. * Lazy-allocate if needed. If this is a new entry its filter state is * undefined at t0. * * imf is the filter set being modified. * haddr is the source address in *host* byte-order. * * SMPng: May be called with locks held; malloc must not block. */ static int imf_get_source(struct in_mfilter *imf, const struct sockaddr_in *psin, struct in_msource **plims) { struct ip_msource find; struct ip_msource *ims, *nims; struct in_msource *lims; int error; error = 0; ims = NULL; lims = NULL; /* key is host byte order */ find.ims_haddr = ntohl(psin->sin_addr.s_addr); ims = RB_FIND(ip_msource_tree, &imf->imf_sources, &find); lims = (struct in_msource *)ims; if (lims == NULL) { if (imf->imf_nsrc == in_mcast_maxsocksrc) return (ENOSPC); nims = malloc(sizeof(struct in_msource), M_INMFILTER, M_NOWAIT | M_ZERO); if (nims == NULL) return (ENOMEM); lims = (struct in_msource *)nims; lims->ims_haddr = find.ims_haddr; lims->imsl_st[0] = MCAST_UNDEFINED; RB_INSERT(ip_msource_tree, &imf->imf_sources, nims); ++imf->imf_nsrc; } *plims = lims; return (error); } /* * Graft a source entry into an existing socket-layer filter set, * maintaining any required invariants and checking allocations. * * The source is marked as being in the new filter mode at t1. * * Return the pointer to the new node, otherwise return NULL. */ static struct in_msource * imf_graft(struct in_mfilter *imf, const uint8_t st1, const struct sockaddr_in *psin) { struct ip_msource *nims; struct in_msource *lims; nims = malloc(sizeof(struct in_msource), M_INMFILTER, M_NOWAIT | M_ZERO); if (nims == NULL) return (NULL); lims = (struct in_msource *)nims; lims->ims_haddr = ntohl(psin->sin_addr.s_addr); lims->imsl_st[0] = MCAST_UNDEFINED; lims->imsl_st[1] = st1; RB_INSERT(ip_msource_tree, &imf->imf_sources, nims); ++imf->imf_nsrc; return (lims); } /* * Prune a source entry from an existing socket-layer filter set, * maintaining any required invariants and checking allocations. * * The source is marked as being left at t1, it is not freed. * * Return 0 if no error occurred, otherwise return an errno value. */ static int imf_prune(struct in_mfilter *imf, const struct sockaddr_in *psin) { struct ip_msource find; struct ip_msource *ims; struct in_msource *lims; /* key is host byte order */ find.ims_haddr = ntohl(psin->sin_addr.s_addr); ims = RB_FIND(ip_msource_tree, &imf->imf_sources, &find); if (ims == NULL) return (ENOENT); lims = (struct in_msource *)ims; lims->imsl_st[1] = MCAST_UNDEFINED; return (0); } /* * Revert socket-layer filter set deltas at t1 to t0 state. */ static void imf_rollback(struct in_mfilter *imf) { struct ip_msource *ims, *tims; struct in_msource *lims; RB_FOREACH_SAFE(ims, ip_msource_tree, &imf->imf_sources, tims) { lims = (struct in_msource *)ims; if (lims->imsl_st[0] == lims->imsl_st[1]) { /* no change at t1 */ continue; } else if (lims->imsl_st[0] != MCAST_UNDEFINED) { /* revert change to existing source at t1 */ lims->imsl_st[1] = lims->imsl_st[0]; } else { /* revert source added t1 */ CTR2(KTR_IGMPV3, "%s: free ims %p", __func__, ims); RB_REMOVE(ip_msource_tree, &imf->imf_sources, ims); free(ims, M_INMFILTER); imf->imf_nsrc--; } } imf->imf_st[1] = imf->imf_st[0]; } /* * Mark socket-layer filter set as INCLUDE {} at t1. */ static void imf_leave(struct in_mfilter *imf) { struct ip_msource *ims; struct in_msource *lims; RB_FOREACH(ims, ip_msource_tree, &imf->imf_sources) { lims = (struct in_msource *)ims; lims->imsl_st[1] = MCAST_UNDEFINED; } imf->imf_st[1] = MCAST_INCLUDE; } /* * Mark socket-layer filter set deltas as committed. */ static void imf_commit(struct in_mfilter *imf) { struct ip_msource *ims; struct in_msource *lims; RB_FOREACH(ims, ip_msource_tree, &imf->imf_sources) { lims = (struct in_msource *)ims; lims->imsl_st[0] = lims->imsl_st[1]; } imf->imf_st[0] = imf->imf_st[1]; } /* * Reap unreferenced sources from socket-layer filter set. */ static void imf_reap(struct in_mfilter *imf) { struct ip_msource *ims, *tims; struct in_msource *lims; RB_FOREACH_SAFE(ims, ip_msource_tree, &imf->imf_sources, tims) { lims = (struct in_msource *)ims; if ((lims->imsl_st[0] == MCAST_UNDEFINED) && (lims->imsl_st[1] == MCAST_UNDEFINED)) { CTR2(KTR_IGMPV3, "%s: free lims %p", __func__, ims); RB_REMOVE(ip_msource_tree, &imf->imf_sources, ims); free(ims, M_INMFILTER); imf->imf_nsrc--; } } } /* * Purge socket-layer filter set. */ static void imf_purge(struct in_mfilter *imf) { struct ip_msource *ims, *tims; RB_FOREACH_SAFE(ims, ip_msource_tree, &imf->imf_sources, tims) { CTR2(KTR_IGMPV3, "%s: free ims %p", __func__, ims); RB_REMOVE(ip_msource_tree, &imf->imf_sources, ims); free(ims, M_INMFILTER); imf->imf_nsrc--; } imf->imf_st[0] = imf->imf_st[1] = MCAST_UNDEFINED; KASSERT(RB_EMPTY(&imf->imf_sources), ("%s: imf_sources not empty", __func__)); } /* * Look up a source filter entry for a multicast group. * * inm is the group descriptor to work with. * haddr is the host-byte-order IPv4 address to look up. * noalloc may be non-zero to suppress allocation of sources. * *pims will be set to the address of the retrieved or allocated source. * * SMPng: NOTE: may be called with locks held. * Return 0 if successful, otherwise return a non-zero error code. */ static int inm_get_source(struct in_multi *inm, const in_addr_t haddr, const int noalloc, struct ip_msource **pims) { struct ip_msource find; struct ip_msource *ims, *nims; find.ims_haddr = haddr; ims = RB_FIND(ip_msource_tree, &inm->inm_srcs, &find); if (ims == NULL && !noalloc) { if (inm->inm_nsrc == in_mcast_maxgrpsrc) return (ENOSPC); nims = malloc(sizeof(struct ip_msource), M_IPMSOURCE, M_NOWAIT | M_ZERO); if (nims == NULL) return (ENOMEM); nims->ims_haddr = haddr; RB_INSERT(ip_msource_tree, &inm->inm_srcs, nims); ++inm->inm_nsrc; ims = nims; #ifdef KTR CTR3(KTR_IGMPV3, "%s: allocated 0x%08x as %p", __func__, haddr, ims); #endif } *pims = ims; return (0); } /* * Merge socket-layer source into IGMP-layer source. * If rollback is non-zero, perform the inverse of the merge. */ static void ims_merge(struct ip_msource *ims, const struct in_msource *lims, const int rollback) { int n = rollback ? -1 : 1; if (lims->imsl_st[0] == MCAST_EXCLUDE) { CTR3(KTR_IGMPV3, "%s: t1 ex -= %d on 0x%08x", __func__, n, ims->ims_haddr); ims->ims_st[1].ex -= n; } else if (lims->imsl_st[0] == MCAST_INCLUDE) { CTR3(KTR_IGMPV3, "%s: t1 in -= %d on 0x%08x", __func__, n, ims->ims_haddr); ims->ims_st[1].in -= n; } if (lims->imsl_st[1] == MCAST_EXCLUDE) { CTR3(KTR_IGMPV3, "%s: t1 ex += %d on 0x%08x", __func__, n, ims->ims_haddr); ims->ims_st[1].ex += n; } else if (lims->imsl_st[1] == MCAST_INCLUDE) { CTR3(KTR_IGMPV3, "%s: t1 in += %d on 0x%08x", __func__, n, ims->ims_haddr); ims->ims_st[1].in += n; } } /* * Atomically update the global in_multi state, when a membership's * filter list is being updated in any way. * * imf is the per-inpcb-membership group filter pointer. * A fake imf may be passed for in-kernel consumers. * * XXX This is a candidate for a set-symmetric-difference style loop * which would eliminate the repeated lookup from root of ims nodes, * as they share the same key space. * * If any error occurred this function will back out of refcounts * and return a non-zero value. */ static int inm_merge(struct in_multi *inm, /*const*/ struct in_mfilter *imf) { struct ip_msource *ims, *nims; struct in_msource *lims; int schanged, error; int nsrc0, nsrc1; schanged = 0; error = 0; nsrc1 = nsrc0 = 0; IN_MULTI_LIST_LOCK_ASSERT(); /* * Update the source filters first, as this may fail. * Maintain count of in-mode filters at t0, t1. These are * used to work out if we transition into ASM mode or not. * Maintain a count of source filters whose state was * actually modified by this operation. */ RB_FOREACH(ims, ip_msource_tree, &imf->imf_sources) { lims = (struct in_msource *)ims; if (lims->imsl_st[0] == imf->imf_st[0]) nsrc0++; if (lims->imsl_st[1] == imf->imf_st[1]) nsrc1++; if (lims->imsl_st[0] == lims->imsl_st[1]) continue; error = inm_get_source(inm, lims->ims_haddr, 0, &nims); ++schanged; if (error) break; ims_merge(nims, lims, 0); } if (error) { struct ip_msource *bims; RB_FOREACH_REVERSE_FROM(ims, ip_msource_tree, nims) { lims = (struct in_msource *)ims; if (lims->imsl_st[0] == lims->imsl_st[1]) continue; (void)inm_get_source(inm, lims->ims_haddr, 1, &bims); if (bims == NULL) continue; ims_merge(bims, lims, 1); } goto out_reap; } CTR3(KTR_IGMPV3, "%s: imf filters in-mode: %d at t0, %d at t1", __func__, nsrc0, nsrc1); /* Handle transition between INCLUDE {n} and INCLUDE {} on socket. */ if (imf->imf_st[0] == imf->imf_st[1] && imf->imf_st[1] == MCAST_INCLUDE) { if (nsrc1 == 0) { CTR1(KTR_IGMPV3, "%s: --in on inm at t1", __func__); --inm->inm_st[1].iss_in; } } /* Handle filter mode transition on socket. */ if (imf->imf_st[0] != imf->imf_st[1]) { CTR3(KTR_IGMPV3, "%s: imf transition %d to %d", __func__, imf->imf_st[0], imf->imf_st[1]); if (imf->imf_st[0] == MCAST_EXCLUDE) { CTR1(KTR_IGMPV3, "%s: --ex on inm at t1", __func__); --inm->inm_st[1].iss_ex; } else if (imf->imf_st[0] == MCAST_INCLUDE) { CTR1(KTR_IGMPV3, "%s: --in on inm at t1", __func__); --inm->inm_st[1].iss_in; } if (imf->imf_st[1] == MCAST_EXCLUDE) { CTR1(KTR_IGMPV3, "%s: ex++ on inm at t1", __func__); inm->inm_st[1].iss_ex++; } else if (imf->imf_st[1] == MCAST_INCLUDE && nsrc1 > 0) { CTR1(KTR_IGMPV3, "%s: in++ on inm at t1", __func__); inm->inm_st[1].iss_in++; } } /* * Track inm filter state in terms of listener counts. * If there are any exclusive listeners, stack-wide * membership is exclusive. * Otherwise, if only inclusive listeners, stack-wide is inclusive. * If no listeners remain, state is undefined at t1, * and the IGMP lifecycle for this group should finish. */ if (inm->inm_st[1].iss_ex > 0) { CTR1(KTR_IGMPV3, "%s: transition to EX", __func__); inm->inm_st[1].iss_fmode = MCAST_EXCLUDE; } else if (inm->inm_st[1].iss_in > 0) { CTR1(KTR_IGMPV3, "%s: transition to IN", __func__); inm->inm_st[1].iss_fmode = MCAST_INCLUDE; } else { CTR1(KTR_IGMPV3, "%s: transition to UNDEF", __func__); inm->inm_st[1].iss_fmode = MCAST_UNDEFINED; } /* Decrement ASM listener count on transition out of ASM mode. */ if (imf->imf_st[0] == MCAST_EXCLUDE && nsrc0 == 0) { if ((imf->imf_st[1] != MCAST_EXCLUDE) || (imf->imf_st[1] == MCAST_EXCLUDE && nsrc1 > 0)) { CTR1(KTR_IGMPV3, "%s: --asm on inm at t1", __func__); --inm->inm_st[1].iss_asm; } } /* Increment ASM listener count on transition to ASM mode. */ if (imf->imf_st[1] == MCAST_EXCLUDE && nsrc1 == 0) { CTR1(KTR_IGMPV3, "%s: asm++ on inm at t1", __func__); inm->inm_st[1].iss_asm++; } CTR3(KTR_IGMPV3, "%s: merged imf %p to inm %p", __func__, imf, inm); inm_print(inm); out_reap: if (schanged > 0) { CTR1(KTR_IGMPV3, "%s: sources changed; reaping", __func__); inm_reap(inm); } return (error); } /* * Mark an in_multi's filter set deltas as committed. * Called by IGMP after a state change has been enqueued. */ void inm_commit(struct in_multi *inm) { struct ip_msource *ims; CTR2(KTR_IGMPV3, "%s: commit inm %p", __func__, inm); CTR1(KTR_IGMPV3, "%s: pre commit:", __func__); inm_print(inm); RB_FOREACH(ims, ip_msource_tree, &inm->inm_srcs) { ims->ims_st[0] = ims->ims_st[1]; } inm->inm_st[0] = inm->inm_st[1]; } /* * Reap unreferenced nodes from an in_multi's filter set. */ static void inm_reap(struct in_multi *inm) { struct ip_msource *ims, *tims; RB_FOREACH_SAFE(ims, ip_msource_tree, &inm->inm_srcs, tims) { if (ims->ims_st[0].ex > 0 || ims->ims_st[0].in > 0 || ims->ims_st[1].ex > 0 || ims->ims_st[1].in > 0 || ims->ims_stp != 0) continue; CTR2(KTR_IGMPV3, "%s: free ims %p", __func__, ims); RB_REMOVE(ip_msource_tree, &inm->inm_srcs, ims); free(ims, M_IPMSOURCE); inm->inm_nsrc--; } } /* * Purge all source nodes from an in_multi's filter set. */ static void inm_purge(struct in_multi *inm) { struct ip_msource *ims, *tims; RB_FOREACH_SAFE(ims, ip_msource_tree, &inm->inm_srcs, tims) { CTR2(KTR_IGMPV3, "%s: free ims %p", __func__, ims); RB_REMOVE(ip_msource_tree, &inm->inm_srcs, ims); free(ims, M_IPMSOURCE); inm->inm_nsrc--; } } /* * Join a multicast group; unlocked entry point. * * SMPng: XXX: in_joingroup() is called from in_control() when Giant * is not held. Fortunately, ifp is unlikely to have been detached * at this point, so we assume it's OK to recurse. */ int in_joingroup(struct ifnet *ifp, const struct in_addr *gina, /*const*/ struct in_mfilter *imf, struct in_multi **pinm) { int error; IN_MULTI_LOCK(); error = in_joingroup_locked(ifp, gina, imf, pinm); IN_MULTI_UNLOCK(); return (error); } /* * Join a multicast group; real entry point. * * Only preserves atomicity at inm level. * NOTE: imf argument cannot be const due to sys/tree.h limitations. * * If the IGMP downcall fails, the group is not joined, and an error * code is returned. */ int in_joingroup_locked(struct ifnet *ifp, const struct in_addr *gina, /*const*/ struct in_mfilter *imf, struct in_multi **pinm) { struct in_mfilter timf; struct in_multi *inm; int error; IN_MULTI_LOCK_ASSERT(); IN_MULTI_LIST_UNLOCK_ASSERT(); CTR4(KTR_IGMPV3, "%s: join 0x%08x on %p(%s))", __func__, ntohl(gina->s_addr), ifp, ifp->if_xname); error = 0; inm = NULL; /* * If no imf was specified (i.e. kernel consumer), * fake one up and assume it is an ASM join. */ if (imf == NULL) { imf_init(&timf, MCAST_UNDEFINED, MCAST_EXCLUDE); imf = &timf; } error = in_getmulti(ifp, gina, &inm); if (error) { CTR1(KTR_IGMPV3, "%s: in_getmulti() failure", __func__); return (error); } IN_MULTI_LIST_LOCK(); CTR1(KTR_IGMPV3, "%s: merge inm state", __func__); error = inm_merge(inm, imf); if (error) { CTR1(KTR_IGMPV3, "%s: failed to merge inm state", __func__); goto out_inm_release; } CTR1(KTR_IGMPV3, "%s: doing igmp downcall", __func__); error = igmp_change_state(inm); if (error) { CTR1(KTR_IGMPV3, "%s: failed to update source", __func__); goto out_inm_release; } out_inm_release: if (error) { CTR2(KTR_IGMPV3, "%s: dropping ref on %p", __func__, inm); inm_release_deferred(inm); } else { *pinm = inm; } IN_MULTI_LIST_UNLOCK(); return (error); } /* * Leave a multicast group; unlocked entry point. */ int in_leavegroup(struct in_multi *inm, /*const*/ struct in_mfilter *imf) { int error; IN_MULTI_LOCK(); error = in_leavegroup_locked(inm, imf); IN_MULTI_UNLOCK(); return (error); } /* * Leave a multicast group; real entry point. * All source filters will be expunged. * * Only preserves atomicity at inm level. * * Holding the write lock for the INP which contains imf * is highly advisable. We can't assert for it as imf does not * contain a back-pointer to the owning inp. * * Note: This is not the same as inm_release(*) as this function also * makes a state change downcall into IGMP. */ int in_leavegroup_locked(struct in_multi *inm, /*const*/ struct in_mfilter *imf) { struct in_mfilter timf; int error; error = 0; IN_MULTI_LOCK_ASSERT(); IN_MULTI_LIST_UNLOCK_ASSERT(); CTR5(KTR_IGMPV3, "%s: leave inm %p, 0x%08x/%s, imf %p", __func__, inm, ntohl(inm->inm_addr.s_addr), (inm_is_ifp_detached(inm) ? "null" : inm->inm_ifp->if_xname), imf); /* * If no imf was specified (i.e. kernel consumer), * fake one up and assume it is an ASM join. */ if (imf == NULL) { imf_init(&timf, MCAST_EXCLUDE, MCAST_UNDEFINED); imf = &timf; } /* * Begin state merge transaction at IGMP layer. * * As this particular invocation should not cause any memory * to be allocated, and there is no opportunity to roll back * the transaction, it MUST NOT fail. */ CTR1(KTR_IGMPV3, "%s: merge inm state", __func__); IN_MULTI_LIST_LOCK(); error = inm_merge(inm, imf); KASSERT(error == 0, ("%s: failed to merge inm state", __func__)); CTR1(KTR_IGMPV3, "%s: doing igmp downcall", __func__); CURVNET_SET(inm->inm_ifp->if_vnet); error = igmp_change_state(inm); IF_ADDR_WLOCK(inm->inm_ifp); inm_release_deferred(inm); IF_ADDR_WUNLOCK(inm->inm_ifp); IN_MULTI_LIST_UNLOCK(); CURVNET_RESTORE(); if (error) CTR1(KTR_IGMPV3, "%s: failed igmp downcall", __func__); CTR2(KTR_IGMPV3, "%s: dropping ref on %p", __func__, inm); return (error); } /*#ifndef BURN_BRIDGES*/ /* * Join an IPv4 multicast group in (*,G) exclusive mode. * The group must be a 224.0.0.0/24 link-scope group. * This KPI is for legacy kernel consumers only. */ struct in_multi * in_addmulti(struct in_addr *ap, struct ifnet *ifp) { struct in_multi *pinm; int error; #ifdef INVARIANTS char addrbuf[INET_ADDRSTRLEN]; #endif KASSERT(IN_LOCAL_GROUP(ntohl(ap->s_addr)), ("%s: %s not in 224.0.0.0/24", __func__, inet_ntoa_r(*ap, addrbuf))); error = in_joingroup(ifp, ap, NULL, &pinm); if (error != 0) pinm = NULL; return (pinm); } /* * Block or unblock an ASM multicast source on an inpcb. * This implements the delta-based API described in RFC 3678. * * The delta-based API applies only to exclusive-mode memberships. * An IGMP downcall will be performed. * * SMPng: NOTE: Must take Giant as a join may create a new ifma. * * Return 0 if successful, otherwise return an appropriate error code. */ static int inp_block_unblock_source(struct inpcb *inp, struct sockopt *sopt) { struct group_source_req gsr; struct rm_priotracker in_ifa_tracker; sockunion_t *gsa, *ssa; struct ifnet *ifp; struct in_mfilter *imf; struct ip_moptions *imo; struct in_msource *ims; struct in_multi *inm; size_t idx; uint16_t fmode; int error, doblock; ifp = NULL; error = 0; doblock = 0; memset(&gsr, 0, sizeof(struct group_source_req)); gsa = (sockunion_t *)&gsr.gsr_group; ssa = (sockunion_t *)&gsr.gsr_source; switch (sopt->sopt_name) { case IP_BLOCK_SOURCE: case IP_UNBLOCK_SOURCE: { struct ip_mreq_source mreqs; error = sooptcopyin(sopt, &mreqs, sizeof(struct ip_mreq_source), sizeof(struct ip_mreq_source)); if (error) return (error); gsa->sin.sin_family = AF_INET; gsa->sin.sin_len = sizeof(struct sockaddr_in); gsa->sin.sin_addr = mreqs.imr_multiaddr; ssa->sin.sin_family = AF_INET; ssa->sin.sin_len = sizeof(struct sockaddr_in); ssa->sin.sin_addr = mreqs.imr_sourceaddr; if (!in_nullhost(mreqs.imr_interface)) { IN_IFADDR_RLOCK(&in_ifa_tracker); INADDR_TO_IFP(mreqs.imr_interface, ifp); IN_IFADDR_RUNLOCK(&in_ifa_tracker); } if (sopt->sopt_name == IP_BLOCK_SOURCE) doblock = 1; CTR3(KTR_IGMPV3, "%s: imr_interface = 0x%08x, ifp = %p", __func__, ntohl(mreqs.imr_interface.s_addr), ifp); break; } case MCAST_BLOCK_SOURCE: case MCAST_UNBLOCK_SOURCE: error = sooptcopyin(sopt, &gsr, sizeof(struct group_source_req), sizeof(struct group_source_req)); if (error) return (error); if (gsa->sin.sin_family != AF_INET || gsa->sin.sin_len != sizeof(struct sockaddr_in)) return (EINVAL); if (ssa->sin.sin_family != AF_INET || ssa->sin.sin_len != sizeof(struct sockaddr_in)) return (EINVAL); if (gsr.gsr_interface == 0 || V_if_index < gsr.gsr_interface) return (EADDRNOTAVAIL); ifp = ifnet_byindex(gsr.gsr_interface); if (sopt->sopt_name == MCAST_BLOCK_SOURCE) doblock = 1; break; default: CTR2(KTR_IGMPV3, "%s: unknown sopt_name %d", __func__, sopt->sopt_name); return (EOPNOTSUPP); break; } if (!IN_MULTICAST(ntohl(gsa->sin.sin_addr.s_addr))) return (EINVAL); /* * Check if we are actually a member of this group. */ imo = inp_findmoptions(inp); idx = imo_match_group(imo, ifp, &gsa->sa); if (idx == -1 || imo->imo_mfilters == NULL) { error = EADDRNOTAVAIL; goto out_inp_locked; } KASSERT(imo->imo_mfilters != NULL, ("%s: imo_mfilters not allocated", __func__)); imf = &imo->imo_mfilters[idx]; inm = imo->imo_membership[idx]; /* * Attempting to use the delta-based API on an * non exclusive-mode membership is an error. */ fmode = imf->imf_st[0]; if (fmode != MCAST_EXCLUDE) { error = EINVAL; goto out_inp_locked; } /* * Deal with error cases up-front: * Asked to block, but already blocked; or * Asked to unblock, but nothing to unblock. * If adding a new block entry, allocate it. */ ims = imo_match_source(imo, idx, &ssa->sa); if ((ims != NULL && doblock) || (ims == NULL && !doblock)) { CTR3(KTR_IGMPV3, "%s: source 0x%08x %spresent", __func__, ntohl(ssa->sin.sin_addr.s_addr), doblock ? "" : "not "); error = EADDRNOTAVAIL; goto out_inp_locked; } INP_WLOCK_ASSERT(inp); /* * Begin state merge transaction at socket layer. */ if (doblock) { CTR2(KTR_IGMPV3, "%s: %s source", __func__, "block"); ims = imf_graft(imf, fmode, &ssa->sin); if (ims == NULL) error = ENOMEM; } else { CTR2(KTR_IGMPV3, "%s: %s source", __func__, "allow"); error = imf_prune(imf, &ssa->sin); } if (error) { CTR1(KTR_IGMPV3, "%s: merge imf state failed", __func__); goto out_imf_rollback; } /* * Begin state merge transaction at IGMP layer. */ IN_MULTI_LOCK(); CTR1(KTR_IGMPV3, "%s: merge inm state", __func__); IN_MULTI_LIST_LOCK(); error = inm_merge(inm, imf); if (error) { CTR1(KTR_IGMPV3, "%s: failed to merge inm state", __func__); IN_MULTI_LIST_UNLOCK(); goto out_in_multi_locked; } CTR1(KTR_IGMPV3, "%s: doing igmp downcall", __func__); error = igmp_change_state(inm); IN_MULTI_LIST_UNLOCK(); if (error) CTR1(KTR_IGMPV3, "%s: failed igmp downcall", __func__); out_in_multi_locked: IN_MULTI_UNLOCK(); out_imf_rollback: if (error) imf_rollback(imf); else imf_commit(imf); imf_reap(imf); out_inp_locked: INP_WUNLOCK(inp); return (error); } /* * Given an inpcb, return its multicast options structure pointer. Accepts * an unlocked inpcb pointer, but will return it locked. May sleep. * * SMPng: NOTE: Potentially calls malloc(M_WAITOK) with Giant held. * SMPng: NOTE: Returns with the INP write lock held. */ static struct ip_moptions * inp_findmoptions(struct inpcb *inp) { struct ip_moptions *imo; struct in_multi **immp; struct in_mfilter *imfp; size_t idx; INP_WLOCK(inp); if (inp->inp_moptions != NULL) return (inp->inp_moptions); INP_WUNLOCK(inp); imo = malloc(sizeof(*imo), M_IPMOPTS, M_WAITOK); immp = malloc(sizeof(*immp) * IP_MIN_MEMBERSHIPS, M_IPMOPTS, M_WAITOK | M_ZERO); imfp = malloc(sizeof(struct in_mfilter) * IP_MIN_MEMBERSHIPS, M_INMFILTER, M_WAITOK); imo->imo_multicast_ifp = NULL; imo->imo_multicast_addr.s_addr = INADDR_ANY; imo->imo_multicast_vif = -1; imo->imo_multicast_ttl = IP_DEFAULT_MULTICAST_TTL; imo->imo_multicast_loop = in_mcast_loop; imo->imo_num_memberships = 0; imo->imo_max_memberships = IP_MIN_MEMBERSHIPS; imo->imo_membership = immp; /* Initialize per-group source filters. */ for (idx = 0; idx < IP_MIN_MEMBERSHIPS; idx++) imf_init(&imfp[idx], MCAST_UNDEFINED, MCAST_EXCLUDE); imo->imo_mfilters = imfp; INP_WLOCK(inp); if (inp->inp_moptions != NULL) { free(imfp, M_INMFILTER); free(immp, M_IPMOPTS); free(imo, M_IPMOPTS); return (inp->inp_moptions); } inp->inp_moptions = imo; return (imo); } static void inp_gcmoptions(struct ip_moptions *imo) { struct in_mfilter *imf; struct in_multi *inm; struct ifnet *ifp; size_t idx, nmships; nmships = imo->imo_num_memberships; for (idx = 0; idx < nmships; ++idx) { imf = imo->imo_mfilters ? &imo->imo_mfilters[idx] : NULL; if (imf) imf_leave(imf); inm = imo->imo_membership[idx]; ifp = inm->inm_ifp; if (ifp != NULL) { CURVNET_SET(ifp->if_vnet); (void)in_leavegroup(inm, imf); CURVNET_RESTORE(); } else { (void)in_leavegroup(inm, imf); } if (imf) imf_purge(imf); } if (imo->imo_mfilters) free(imo->imo_mfilters, M_INMFILTER); free(imo->imo_membership, M_IPMOPTS); free(imo, M_IPMOPTS); } /* * Discard the IP multicast options (and source filters). To minimize * the amount of work done while holding locks such as the INP's * pcbinfo lock (which is used in the receive path), the free * operation is deferred to the epoch callback task. */ void inp_freemoptions(struct ip_moptions *imo) { if (imo == NULL) return; inp_gcmoptions(imo); } /* * Atomically get source filters on a socket for an IPv4 multicast group. * Called with INP lock held; returns with lock released. */ static int inp_get_source_filters(struct inpcb *inp, struct sockopt *sopt) { struct __msfilterreq msfr; sockunion_t *gsa; struct ifnet *ifp; struct ip_moptions *imo; struct in_mfilter *imf; struct ip_msource *ims; struct in_msource *lims; struct sockaddr_in *psin; struct sockaddr_storage *ptss; struct sockaddr_storage *tss; int error; size_t idx, nsrcs, ncsrcs; INP_WLOCK_ASSERT(inp); imo = inp->inp_moptions; KASSERT(imo != NULL, ("%s: null ip_moptions", __func__)); INP_WUNLOCK(inp); error = sooptcopyin(sopt, &msfr, sizeof(struct __msfilterreq), sizeof(struct __msfilterreq)); if (error) return (error); if (msfr.msfr_ifindex == 0 || V_if_index < msfr.msfr_ifindex) return (EINVAL); ifp = ifnet_byindex(msfr.msfr_ifindex); if (ifp == NULL) return (EINVAL); INP_WLOCK(inp); /* * Lookup group on the socket. */ gsa = (sockunion_t *)&msfr.msfr_group; idx = imo_match_group(imo, ifp, &gsa->sa); if (idx == -1 || imo->imo_mfilters == NULL) { INP_WUNLOCK(inp); return (EADDRNOTAVAIL); } imf = &imo->imo_mfilters[idx]; /* * Ignore memberships which are in limbo. */ if (imf->imf_st[1] == MCAST_UNDEFINED) { INP_WUNLOCK(inp); return (EAGAIN); } msfr.msfr_fmode = imf->imf_st[1]; /* * If the user specified a buffer, copy out the source filter * entries to userland gracefully. * We only copy out the number of entries which userland * has asked for, but we always tell userland how big the * buffer really needs to be. */ if (msfr.msfr_nsrcs > in_mcast_maxsocksrc) msfr.msfr_nsrcs = in_mcast_maxsocksrc; tss = NULL; if (msfr.msfr_srcs != NULL && msfr.msfr_nsrcs > 0) { tss = malloc(sizeof(struct sockaddr_storage) * msfr.msfr_nsrcs, M_TEMP, M_NOWAIT | M_ZERO); if (tss == NULL) { INP_WUNLOCK(inp); return (ENOBUFS); } } /* * Count number of sources in-mode at t0. * If buffer space exists and remains, copy out source entries. */ nsrcs = msfr.msfr_nsrcs; ncsrcs = 0; ptss = tss; RB_FOREACH(ims, ip_msource_tree, &imf->imf_sources) { lims = (struct in_msource *)ims; if (lims->imsl_st[0] == MCAST_UNDEFINED || lims->imsl_st[0] != imf->imf_st[0]) continue; ++ncsrcs; if (tss != NULL && nsrcs > 0) { psin = (struct sockaddr_in *)ptss; psin->sin_family = AF_INET; psin->sin_len = sizeof(struct sockaddr_in); psin->sin_addr.s_addr = htonl(lims->ims_haddr); psin->sin_port = 0; ++ptss; --nsrcs; } } INP_WUNLOCK(inp); if (tss != NULL) { error = copyout(tss, msfr.msfr_srcs, sizeof(struct sockaddr_storage) * msfr.msfr_nsrcs); free(tss, M_TEMP); if (error) return (error); } msfr.msfr_nsrcs = ncsrcs; error = sooptcopyout(sopt, &msfr, sizeof(struct __msfilterreq)); return (error); } /* * Return the IP multicast options in response to user getsockopt(). */ int inp_getmoptions(struct inpcb *inp, struct sockopt *sopt) { struct rm_priotracker in_ifa_tracker; struct ip_mreqn mreqn; struct ip_moptions *imo; struct ifnet *ifp; struct in_ifaddr *ia; int error, optval; u_char coptval; INP_WLOCK(inp); imo = inp->inp_moptions; /* * If socket is neither of type SOCK_RAW or SOCK_DGRAM, * or is a divert socket, reject it. */ if (inp->inp_socket->so_proto->pr_protocol == IPPROTO_DIVERT || (inp->inp_socket->so_proto->pr_type != SOCK_RAW && inp->inp_socket->so_proto->pr_type != SOCK_DGRAM)) { INP_WUNLOCK(inp); return (EOPNOTSUPP); } error = 0; switch (sopt->sopt_name) { case IP_MULTICAST_VIF: if (imo != NULL) optval = imo->imo_multicast_vif; else optval = -1; INP_WUNLOCK(inp); error = sooptcopyout(sopt, &optval, sizeof(int)); break; case IP_MULTICAST_IF: memset(&mreqn, 0, sizeof(struct ip_mreqn)); if (imo != NULL) { ifp = imo->imo_multicast_ifp; if (!in_nullhost(imo->imo_multicast_addr)) { mreqn.imr_address = imo->imo_multicast_addr; } else if (ifp != NULL) { struct epoch_tracker et; mreqn.imr_ifindex = ifp->if_index; NET_EPOCH_ENTER(et); IFP_TO_IA(ifp, ia, &in_ifa_tracker); if (ia != NULL) mreqn.imr_address = IA_SIN(ia)->sin_addr; NET_EPOCH_EXIT(et); } } INP_WUNLOCK(inp); if (sopt->sopt_valsize == sizeof(struct ip_mreqn)) { error = sooptcopyout(sopt, &mreqn, sizeof(struct ip_mreqn)); } else { error = sooptcopyout(sopt, &mreqn.imr_address, sizeof(struct in_addr)); } break; case IP_MULTICAST_TTL: if (imo == NULL) optval = coptval = IP_DEFAULT_MULTICAST_TTL; else optval = coptval = imo->imo_multicast_ttl; INP_WUNLOCK(inp); if (sopt->sopt_valsize == sizeof(u_char)) error = sooptcopyout(sopt, &coptval, sizeof(u_char)); else error = sooptcopyout(sopt, &optval, sizeof(int)); break; case IP_MULTICAST_LOOP: if (imo == NULL) optval = coptval = IP_DEFAULT_MULTICAST_LOOP; else optval = coptval = imo->imo_multicast_loop; INP_WUNLOCK(inp); if (sopt->sopt_valsize == sizeof(u_char)) error = sooptcopyout(sopt, &coptval, sizeof(u_char)); else error = sooptcopyout(sopt, &optval, sizeof(int)); break; case IP_MSFILTER: if (imo == NULL) { error = EADDRNOTAVAIL; INP_WUNLOCK(inp); } else { error = inp_get_source_filters(inp, sopt); } break; default: INP_WUNLOCK(inp); error = ENOPROTOOPT; break; } INP_UNLOCK_ASSERT(inp); return (error); } /* * Look up the ifnet to use for a multicast group membership, * given the IPv4 address of an interface, and the IPv4 group address. * * This routine exists to support legacy multicast applications * which do not understand that multicast memberships are scoped to * specific physical links in the networking stack, or which need * to join link-scope groups before IPv4 addresses are configured. * * If inp is non-NULL, use this socket's current FIB number for any * required FIB lookup. * If ina is INADDR_ANY, look up the group address in the unicast FIB, * and use its ifp; usually, this points to the default next-hop. * * If the FIB lookup fails, attempt to use the first non-loopback * interface with multicast capability in the system as a * last resort. The legacy IPv4 ASM API requires that we do * this in order to allow groups to be joined when the routing * table has not yet been populated during boot. * * Returns NULL if no ifp could be found. * * FUTURE: Implement IPv4 source-address selection. */ static struct ifnet * inp_lookup_mcast_ifp(const struct inpcb *inp, const struct sockaddr_in *gsin, const struct in_addr ina) { struct rm_priotracker in_ifa_tracker; struct ifnet *ifp; struct nhop4_basic nh4; uint32_t fibnum; KASSERT(gsin->sin_family == AF_INET, ("%s: not AF_INET", __func__)); KASSERT(IN_MULTICAST(ntohl(gsin->sin_addr.s_addr)), ("%s: not multicast", __func__)); ifp = NULL; if (!in_nullhost(ina)) { IN_IFADDR_RLOCK(&in_ifa_tracker); INADDR_TO_IFP(ina, ifp); IN_IFADDR_RUNLOCK(&in_ifa_tracker); } else { fibnum = inp ? inp->inp_inc.inc_fibnum : 0; if (fib4_lookup_nh_basic(fibnum, gsin->sin_addr, 0, 0, &nh4)==0) ifp = nh4.nh_ifp; else { struct in_ifaddr *ia; struct ifnet *mifp; mifp = NULL; IN_IFADDR_RLOCK(&in_ifa_tracker); CK_STAILQ_FOREACH(ia, &V_in_ifaddrhead, ia_link) { mifp = ia->ia_ifp; if (!(mifp->if_flags & IFF_LOOPBACK) && (mifp->if_flags & IFF_MULTICAST)) { ifp = mifp; break; } } IN_IFADDR_RUNLOCK(&in_ifa_tracker); } } return (ifp); } /* * Join an IPv4 multicast group, possibly with a source. */ static int inp_join_group(struct inpcb *inp, struct sockopt *sopt) { struct group_source_req gsr; sockunion_t *gsa, *ssa; struct ifnet *ifp; struct in_mfilter *imf; struct ip_moptions *imo; struct in_multi *inm; struct in_msource *lims; size_t idx; int error, is_new; ifp = NULL; imf = NULL; lims = NULL; error = 0; is_new = 0; memset(&gsr, 0, sizeof(struct group_source_req)); gsa = (sockunion_t *)&gsr.gsr_group; gsa->ss.ss_family = AF_UNSPEC; ssa = (sockunion_t *)&gsr.gsr_source; ssa->ss.ss_family = AF_UNSPEC; switch (sopt->sopt_name) { case IP_ADD_MEMBERSHIP: { struct ip_mreqn mreqn; if (sopt->sopt_valsize == sizeof(struct ip_mreqn)) error = sooptcopyin(sopt, &mreqn, sizeof(struct ip_mreqn), sizeof(struct ip_mreqn)); else error = sooptcopyin(sopt, &mreqn, sizeof(struct ip_mreq), sizeof(struct ip_mreq)); if (error) return (error); gsa->sin.sin_family = AF_INET; gsa->sin.sin_len = sizeof(struct sockaddr_in); gsa->sin.sin_addr = mreqn.imr_multiaddr; if (!IN_MULTICAST(ntohl(gsa->sin.sin_addr.s_addr))) return (EINVAL); if (sopt->sopt_valsize == sizeof(struct ip_mreqn) && mreqn.imr_ifindex != 0) ifp = ifnet_byindex(mreqn.imr_ifindex); else ifp = inp_lookup_mcast_ifp(inp, &gsa->sin, mreqn.imr_address); break; } case IP_ADD_SOURCE_MEMBERSHIP: { struct ip_mreq_source mreqs; error = sooptcopyin(sopt, &mreqs, sizeof(struct ip_mreq_source), sizeof(struct ip_mreq_source)); if (error) return (error); gsa->sin.sin_family = ssa->sin.sin_family = AF_INET; gsa->sin.sin_len = ssa->sin.sin_len = sizeof(struct sockaddr_in); gsa->sin.sin_addr = mreqs.imr_multiaddr; if (!IN_MULTICAST(ntohl(gsa->sin.sin_addr.s_addr))) return (EINVAL); ssa->sin.sin_addr = mreqs.imr_sourceaddr; ifp = inp_lookup_mcast_ifp(inp, &gsa->sin, mreqs.imr_interface); CTR3(KTR_IGMPV3, "%s: imr_interface = 0x%08x, ifp = %p", __func__, ntohl(mreqs.imr_interface.s_addr), ifp); break; } case MCAST_JOIN_GROUP: case MCAST_JOIN_SOURCE_GROUP: if (sopt->sopt_name == MCAST_JOIN_GROUP) { error = sooptcopyin(sopt, &gsr, sizeof(struct group_req), sizeof(struct group_req)); } else if (sopt->sopt_name == MCAST_JOIN_SOURCE_GROUP) { error = sooptcopyin(sopt, &gsr, sizeof(struct group_source_req), sizeof(struct group_source_req)); } if (error) return (error); if (gsa->sin.sin_family != AF_INET || gsa->sin.sin_len != sizeof(struct sockaddr_in)) return (EINVAL); /* * Overwrite the port field if present, as the sockaddr * being copied in may be matched with a binary comparison. */ gsa->sin.sin_port = 0; if (sopt->sopt_name == MCAST_JOIN_SOURCE_GROUP) { if (ssa->sin.sin_family != AF_INET || ssa->sin.sin_len != sizeof(struct sockaddr_in)) return (EINVAL); ssa->sin.sin_port = 0; } if (!IN_MULTICAST(ntohl(gsa->sin.sin_addr.s_addr))) return (EINVAL); if (gsr.gsr_interface == 0 || V_if_index < gsr.gsr_interface) return (EADDRNOTAVAIL); ifp = ifnet_byindex(gsr.gsr_interface); break; default: CTR2(KTR_IGMPV3, "%s: unknown sopt_name %d", __func__, sopt->sopt_name); return (EOPNOTSUPP); break; } if (ifp == NULL || (ifp->if_flags & IFF_MULTICAST) == 0) return (EADDRNOTAVAIL); imo = inp_findmoptions(inp); idx = imo_match_group(imo, ifp, &gsa->sa); if (idx == -1) { is_new = 1; } else { inm = imo->imo_membership[idx]; imf = &imo->imo_mfilters[idx]; if (ssa->ss.ss_family != AF_UNSPEC) { /* * MCAST_JOIN_SOURCE_GROUP on an exclusive membership * is an error. On an existing inclusive membership, * it just adds the source to the filter list. */ if (imf->imf_st[1] != MCAST_INCLUDE) { error = EINVAL; goto out_inp_locked; } /* * Throw out duplicates. * * XXX FIXME: This makes a naive assumption that * even if entries exist for *ssa in this imf, * they will be rejected as dupes, even if they * are not valid in the current mode (in-mode). * * in_msource is transactioned just as for anything * else in SSM -- but note naive use of inm_graft() * below for allocating new filter entries. * * This is only an issue if someone mixes the * full-state SSM API with the delta-based API, * which is discouraged in the relevant RFCs. */ lims = imo_match_source(imo, idx, &ssa->sa); if (lims != NULL /*&& lims->imsl_st[1] == MCAST_INCLUDE*/) { error = EADDRNOTAVAIL; goto out_inp_locked; } } else { /* * MCAST_JOIN_GROUP on an existing exclusive * membership is an error; return EADDRINUSE * to preserve 4.4BSD API idempotence, and * avoid tedious detour to code below. * NOTE: This is bending RFC 3678 a bit. * * On an existing inclusive membership, this is also * an error; if you want to change filter mode, * you must use the userland API setsourcefilter(). * XXX We don't reject this for imf in UNDEFINED * state at t1, because allocation of a filter * is atomic with allocation of a membership. */ error = EINVAL; if (imf->imf_st[1] == MCAST_EXCLUDE) error = EADDRINUSE; goto out_inp_locked; } } /* * Begin state merge transaction at socket layer. */ INP_WLOCK_ASSERT(inp); if (is_new) { if (imo->imo_num_memberships == imo->imo_max_memberships) { error = imo_grow(imo); if (error) goto out_inp_locked; } /* * Allocate the new slot upfront so we can deal with * grafting the new source filter in same code path * as for join-source on existing membership. */ idx = imo->imo_num_memberships; imo->imo_membership[idx] = NULL; imo->imo_num_memberships++; KASSERT(imo->imo_mfilters != NULL, ("%s: imf_mfilters vector was not allocated", __func__)); imf = &imo->imo_mfilters[idx]; KASSERT(RB_EMPTY(&imf->imf_sources), ("%s: imf_sources not empty", __func__)); } /* * Graft new source into filter list for this inpcb's * membership of the group. The in_multi may not have * been allocated yet if this is a new membership, however, * the in_mfilter slot will be allocated and must be initialized. * * Note: Grafting of exclusive mode filters doesn't happen * in this path. * XXX: Should check for non-NULL lims (node exists but may * not be in-mode) for interop with full-state API. */ if (ssa->ss.ss_family != AF_UNSPEC) { /* Membership starts in IN mode */ if (is_new) { CTR1(KTR_IGMPV3, "%s: new join w/source", __func__); imf_init(imf, MCAST_UNDEFINED, MCAST_INCLUDE); } else { CTR2(KTR_IGMPV3, "%s: %s source", __func__, "allow"); } lims = imf_graft(imf, MCAST_INCLUDE, &ssa->sin); if (lims == NULL) { CTR1(KTR_IGMPV3, "%s: merge imf state failed", __func__); error = ENOMEM; goto out_imo_free; } } else { /* No address specified; Membership starts in EX mode */ if (is_new) { CTR1(KTR_IGMPV3, "%s: new join w/o source", __func__); imf_init(imf, MCAST_UNDEFINED, MCAST_EXCLUDE); } } /* * Begin state merge transaction at IGMP layer. */ in_pcbref(inp); INP_WUNLOCK(inp); IN_MULTI_LOCK(); if (is_new) { error = in_joingroup_locked(ifp, &gsa->sin.sin_addr, imf, &inm); if (error) { CTR1(KTR_IGMPV3, "%s: in_joingroup_locked failed", __func__); IN_MULTI_LIST_UNLOCK(); goto out_imo_free; } inm_acquire(inm); imo->imo_membership[idx] = inm; } else { CTR1(KTR_IGMPV3, "%s: merge inm state", __func__); IN_MULTI_LIST_LOCK(); error = inm_merge(inm, imf); if (error) { CTR1(KTR_IGMPV3, "%s: failed to merge inm state", __func__); IN_MULTI_LIST_UNLOCK(); goto out_in_multi_locked; } CTR1(KTR_IGMPV3, "%s: doing igmp downcall", __func__); error = igmp_change_state(inm); IN_MULTI_LIST_UNLOCK(); if (error) { CTR1(KTR_IGMPV3, "%s: failed igmp downcall", __func__); goto out_in_multi_locked; } } out_in_multi_locked: IN_MULTI_UNLOCK(); INP_WLOCK(inp); if (in_pcbrele_wlocked(inp)) return (ENXIO); if (error) { imf_rollback(imf); if (is_new) imf_purge(imf); else imf_reap(imf); } else { imf_commit(imf); } out_imo_free: if (error && is_new) { inm = imo->imo_membership[idx]; if (inm != NULL) { IN_MULTI_LIST_LOCK(); inm_release_deferred(inm); IN_MULTI_LIST_UNLOCK(); } imo->imo_membership[idx] = NULL; --imo->imo_num_memberships; } out_inp_locked: INP_WUNLOCK(inp); return (error); } /* * Leave an IPv4 multicast group on an inpcb, possibly with a source. */ static int inp_leave_group(struct inpcb *inp, struct sockopt *sopt) { struct group_source_req gsr; struct ip_mreq_source mreqs; struct rm_priotracker in_ifa_tracker; sockunion_t *gsa, *ssa; struct ifnet *ifp; struct in_mfilter *imf; struct ip_moptions *imo; struct in_msource *ims; struct in_multi *inm; size_t idx; int error, is_final; ifp = NULL; error = 0; is_final = 1; memset(&gsr, 0, sizeof(struct group_source_req)); gsa = (sockunion_t *)&gsr.gsr_group; gsa->ss.ss_family = AF_UNSPEC; ssa = (sockunion_t *)&gsr.gsr_source; ssa->ss.ss_family = AF_UNSPEC; switch (sopt->sopt_name) { case IP_DROP_MEMBERSHIP: case IP_DROP_SOURCE_MEMBERSHIP: if (sopt->sopt_name == IP_DROP_MEMBERSHIP) { error = sooptcopyin(sopt, &mreqs, sizeof(struct ip_mreq), sizeof(struct ip_mreq)); /* * Swap interface and sourceaddr arguments, * as ip_mreq and ip_mreq_source are laid * out differently. */ mreqs.imr_interface = mreqs.imr_sourceaddr; mreqs.imr_sourceaddr.s_addr = INADDR_ANY; } else if (sopt->sopt_name == IP_DROP_SOURCE_MEMBERSHIP) { error = sooptcopyin(sopt, &mreqs, sizeof(struct ip_mreq_source), sizeof(struct ip_mreq_source)); } if (error) return (error); gsa->sin.sin_family = AF_INET; gsa->sin.sin_len = sizeof(struct sockaddr_in); gsa->sin.sin_addr = mreqs.imr_multiaddr; if (sopt->sopt_name == IP_DROP_SOURCE_MEMBERSHIP) { ssa->sin.sin_family = AF_INET; ssa->sin.sin_len = sizeof(struct sockaddr_in); ssa->sin.sin_addr = mreqs.imr_sourceaddr; } /* * Attempt to look up hinted ifp from interface address. * Fallthrough with null ifp iff lookup fails, to * preserve 4.4BSD mcast API idempotence. * XXX NOTE WELL: The RFC 3678 API is preferred because * using an IPv4 address as a key is racy. */ if (!in_nullhost(mreqs.imr_interface)) { IN_IFADDR_RLOCK(&in_ifa_tracker); INADDR_TO_IFP(mreqs.imr_interface, ifp); IN_IFADDR_RUNLOCK(&in_ifa_tracker); } CTR3(KTR_IGMPV3, "%s: imr_interface = 0x%08x, ifp = %p", __func__, ntohl(mreqs.imr_interface.s_addr), ifp); break; case MCAST_LEAVE_GROUP: case MCAST_LEAVE_SOURCE_GROUP: if (sopt->sopt_name == MCAST_LEAVE_GROUP) { error = sooptcopyin(sopt, &gsr, sizeof(struct group_req), sizeof(struct group_req)); } else if (sopt->sopt_name == MCAST_LEAVE_SOURCE_GROUP) { error = sooptcopyin(sopt, &gsr, sizeof(struct group_source_req), sizeof(struct group_source_req)); } if (error) return (error); if (gsa->sin.sin_family != AF_INET || gsa->sin.sin_len != sizeof(struct sockaddr_in)) return (EINVAL); if (sopt->sopt_name == MCAST_LEAVE_SOURCE_GROUP) { if (ssa->sin.sin_family != AF_INET || ssa->sin.sin_len != sizeof(struct sockaddr_in)) return (EINVAL); } if (gsr.gsr_interface == 0 || V_if_index < gsr.gsr_interface) return (EADDRNOTAVAIL); ifp = ifnet_byindex(gsr.gsr_interface); if (ifp == NULL) return (EADDRNOTAVAIL); break; default: CTR2(KTR_IGMPV3, "%s: unknown sopt_name %d", __func__, sopt->sopt_name); return (EOPNOTSUPP); break; } if (!IN_MULTICAST(ntohl(gsa->sin.sin_addr.s_addr))) return (EINVAL); /* * Find the membership in the membership array. */ imo = inp_findmoptions(inp); idx = imo_match_group(imo, ifp, &gsa->sa); if (idx == -1) { error = EADDRNOTAVAIL; goto out_inp_locked; } inm = imo->imo_membership[idx]; imf = &imo->imo_mfilters[idx]; if (ssa->ss.ss_family != AF_UNSPEC) is_final = 0; /* * Begin state merge transaction at socket layer. */ INP_WLOCK_ASSERT(inp); /* * If we were instructed only to leave a given source, do so. * MCAST_LEAVE_SOURCE_GROUP is only valid for inclusive memberships. */ if (is_final) { imf_leave(imf); } else { if (imf->imf_st[0] == MCAST_EXCLUDE) { error = EADDRNOTAVAIL; goto out_inp_locked; } ims = imo_match_source(imo, idx, &ssa->sa); if (ims == NULL) { CTR3(KTR_IGMPV3, "%s: source 0x%08x %spresent", __func__, ntohl(ssa->sin.sin_addr.s_addr), "not "); error = EADDRNOTAVAIL; goto out_inp_locked; } CTR2(KTR_IGMPV3, "%s: %s source", __func__, "block"); error = imf_prune(imf, &ssa->sin); if (error) { CTR1(KTR_IGMPV3, "%s: merge imf state failed", __func__); goto out_inp_locked; } } /* * Begin state merge transaction at IGMP layer. */ in_pcbref(inp); INP_WUNLOCK(inp); IN_MULTI_LOCK(); if (is_final) { /* * Give up the multicast address record to which * the membership points. */ (void)in_leavegroup_locked(inm, imf); } else { CTR1(KTR_IGMPV3, "%s: merge inm state", __func__); IN_MULTI_LIST_LOCK(); error = inm_merge(inm, imf); if (error) { CTR1(KTR_IGMPV3, "%s: failed to merge inm state", __func__); IN_MULTI_LIST_UNLOCK(); goto out_in_multi_locked; } CTR1(KTR_IGMPV3, "%s: doing igmp downcall", __func__); error = igmp_change_state(inm); IN_MULTI_LIST_UNLOCK(); if (error) { CTR1(KTR_IGMPV3, "%s: failed igmp downcall", __func__); } } out_in_multi_locked: IN_MULTI_UNLOCK(); INP_WLOCK(inp); if (in_pcbrele_wlocked(inp)) return (ENXIO); if (error) imf_rollback(imf); else imf_commit(imf); imf_reap(imf); if (is_final) { /* Remove the gap in the membership and filter array. */ KASSERT(RB_EMPTY(&imf->imf_sources), ("%s: imf_sources not empty", __func__)); for (++idx; idx < imo->imo_num_memberships; ++idx) { imo->imo_membership[idx - 1] = imo->imo_membership[idx]; imo->imo_mfilters[idx - 1] = imo->imo_mfilters[idx]; } imf_init(&imo->imo_mfilters[idx - 1], MCAST_UNDEFINED, MCAST_EXCLUDE); imo->imo_num_memberships--; } out_inp_locked: INP_WUNLOCK(inp); return (error); } /* * Select the interface for transmitting IPv4 multicast datagrams. * * Either an instance of struct in_addr or an instance of struct ip_mreqn * may be passed to this socket option. An address of INADDR_ANY or an * interface index of 0 is used to remove a previous selection. * When no interface is selected, one is chosen for every send. */ static int inp_set_multicast_if(struct inpcb *inp, struct sockopt *sopt) { struct rm_priotracker in_ifa_tracker; struct in_addr addr; struct ip_mreqn mreqn; struct ifnet *ifp; struct ip_moptions *imo; int error; if (sopt->sopt_valsize == sizeof(struct ip_mreqn)) { /* * An interface index was specified using the * Linux-derived ip_mreqn structure. */ error = sooptcopyin(sopt, &mreqn, sizeof(struct ip_mreqn), sizeof(struct ip_mreqn)); if (error) return (error); if (mreqn.imr_ifindex < 0 || V_if_index < mreqn.imr_ifindex) return (EINVAL); if (mreqn.imr_ifindex == 0) { ifp = NULL; } else { ifp = ifnet_byindex(mreqn.imr_ifindex); if (ifp == NULL) return (EADDRNOTAVAIL); } } else { /* * An interface was specified by IPv4 address. * This is the traditional BSD usage. */ error = sooptcopyin(sopt, &addr, sizeof(struct in_addr), sizeof(struct in_addr)); if (error) return (error); if (in_nullhost(addr)) { ifp = NULL; } else { IN_IFADDR_RLOCK(&in_ifa_tracker); INADDR_TO_IFP(addr, ifp); IN_IFADDR_RUNLOCK(&in_ifa_tracker); if (ifp == NULL) return (EADDRNOTAVAIL); } CTR3(KTR_IGMPV3, "%s: ifp = %p, addr = 0x%08x", __func__, ifp, ntohl(addr.s_addr)); } /* Reject interfaces which do not support multicast. */ if (ifp != NULL && (ifp->if_flags & IFF_MULTICAST) == 0) return (EOPNOTSUPP); imo = inp_findmoptions(inp); imo->imo_multicast_ifp = ifp; imo->imo_multicast_addr.s_addr = INADDR_ANY; INP_WUNLOCK(inp); return (0); } /* * Atomically set source filters on a socket for an IPv4 multicast group. * * SMPng: NOTE: Potentially calls malloc(M_WAITOK) with Giant held. */ static int inp_set_source_filters(struct inpcb *inp, struct sockopt *sopt) { struct __msfilterreq msfr; sockunion_t *gsa; struct ifnet *ifp; struct in_mfilter *imf; struct ip_moptions *imo; struct in_multi *inm; size_t idx; int error; error = sooptcopyin(sopt, &msfr, sizeof(struct __msfilterreq), sizeof(struct __msfilterreq)); if (error) return (error); if (msfr.msfr_nsrcs > in_mcast_maxsocksrc) return (ENOBUFS); if ((msfr.msfr_fmode != MCAST_EXCLUDE && msfr.msfr_fmode != MCAST_INCLUDE)) return (EINVAL); if (msfr.msfr_group.ss_family != AF_INET || msfr.msfr_group.ss_len != sizeof(struct sockaddr_in)) return (EINVAL); gsa = (sockunion_t *)&msfr.msfr_group; if (!IN_MULTICAST(ntohl(gsa->sin.sin_addr.s_addr))) return (EINVAL); gsa->sin.sin_port = 0; /* ignore port */ if (msfr.msfr_ifindex == 0 || V_if_index < msfr.msfr_ifindex) return (EADDRNOTAVAIL); ifp = ifnet_byindex(msfr.msfr_ifindex); if (ifp == NULL) return (EADDRNOTAVAIL); /* * Take the INP write lock. * Check if this socket is a member of this group. */ imo = inp_findmoptions(inp); idx = imo_match_group(imo, ifp, &gsa->sa); if (idx == -1 || imo->imo_mfilters == NULL) { error = EADDRNOTAVAIL; goto out_inp_locked; } inm = imo->imo_membership[idx]; imf = &imo->imo_mfilters[idx]; /* * Begin state merge transaction at socket layer. */ INP_WLOCK_ASSERT(inp); imf->imf_st[1] = msfr.msfr_fmode; /* * Apply any new source filters, if present. * Make a copy of the user-space source vector so * that we may copy them with a single copyin. This * allows us to deal with page faults up-front. */ if (msfr.msfr_nsrcs > 0) { struct in_msource *lims; struct sockaddr_in *psin; struct sockaddr_storage *kss, *pkss; int i; INP_WUNLOCK(inp); CTR2(KTR_IGMPV3, "%s: loading %lu source list entries", __func__, (unsigned long)msfr.msfr_nsrcs); kss = malloc(sizeof(struct sockaddr_storage) * msfr.msfr_nsrcs, M_TEMP, M_WAITOK); error = copyin(msfr.msfr_srcs, kss, sizeof(struct sockaddr_storage) * msfr.msfr_nsrcs); if (error) { free(kss, M_TEMP); return (error); } INP_WLOCK(inp); /* * Mark all source filters as UNDEFINED at t1. * Restore new group filter mode, as imf_leave() * will set it to INCLUDE. */ imf_leave(imf); imf->imf_st[1] = msfr.msfr_fmode; /* * Update socket layer filters at t1, lazy-allocating * new entries. This saves a bunch of memory at the * cost of one RB_FIND() per source entry; duplicate * entries in the msfr_nsrcs vector are ignored. * If we encounter an error, rollback transaction. * * XXX This too could be replaced with a set-symmetric * difference like loop to avoid walking from root * every time, as the key space is common. */ for (i = 0, pkss = kss; i < msfr.msfr_nsrcs; i++, pkss++) { psin = (struct sockaddr_in *)pkss; if (psin->sin_family != AF_INET) { error = EAFNOSUPPORT; break; } if (psin->sin_len != sizeof(struct sockaddr_in)) { error = EINVAL; break; } error = imf_get_source(imf, psin, &lims); if (error) break; lims->imsl_st[1] = imf->imf_st[1]; } free(kss, M_TEMP); } if (error) goto out_imf_rollback; INP_WLOCK_ASSERT(inp); IN_MULTI_LOCK(); /* * Begin state merge transaction at IGMP layer. */ CTR1(KTR_IGMPV3, "%s: merge inm state", __func__); IN_MULTI_LIST_LOCK(); error = inm_merge(inm, imf); if (error) { CTR1(KTR_IGMPV3, "%s: failed to merge inm state", __func__); IN_MULTI_LIST_UNLOCK(); goto out_in_multi_locked; } CTR1(KTR_IGMPV3, "%s: doing igmp downcall", __func__); error = igmp_change_state(inm); IN_MULTI_LIST_UNLOCK(); if (error) CTR1(KTR_IGMPV3, "%s: failed igmp downcall", __func__); out_in_multi_locked: IN_MULTI_UNLOCK(); out_imf_rollback: if (error) imf_rollback(imf); else imf_commit(imf); imf_reap(imf); out_inp_locked: INP_WUNLOCK(inp); return (error); } /* * Set the IP multicast options in response to user setsockopt(). * * Many of the socket options handled in this function duplicate the * functionality of socket options in the regular unicast API. However, * it is not possible to merge the duplicate code, because the idempotence * of the IPv4 multicast part of the BSD Sockets API must be preserved; * the effects of these options must be treated as separate and distinct. * * SMPng: XXX: Unlocked read of inp_socket believed OK. * FUTURE: The IP_MULTICAST_VIF option may be eliminated if MROUTING * is refactored to no longer use vifs. */ int inp_setmoptions(struct inpcb *inp, struct sockopt *sopt) { struct ip_moptions *imo; int error; error = 0; /* * If socket is neither of type SOCK_RAW or SOCK_DGRAM, * or is a divert socket, reject it. */ if (inp->inp_socket->so_proto->pr_protocol == IPPROTO_DIVERT || (inp->inp_socket->so_proto->pr_type != SOCK_RAW && inp->inp_socket->so_proto->pr_type != SOCK_DGRAM)) return (EOPNOTSUPP); switch (sopt->sopt_name) { case IP_MULTICAST_VIF: { int vifi; /* * Select a multicast VIF for transmission. * Only useful if multicast forwarding is active. */ if (legal_vif_num == NULL) { error = EOPNOTSUPP; break; } error = sooptcopyin(sopt, &vifi, sizeof(int), sizeof(int)); if (error) break; if (!legal_vif_num(vifi) && (vifi != -1)) { error = EINVAL; break; } imo = inp_findmoptions(inp); imo->imo_multicast_vif = vifi; INP_WUNLOCK(inp); break; } case IP_MULTICAST_IF: error = inp_set_multicast_if(inp, sopt); break; case IP_MULTICAST_TTL: { u_char ttl; /* * Set the IP time-to-live for outgoing multicast packets. * The original multicast API required a char argument, * which is inconsistent with the rest of the socket API. * We allow either a char or an int. */ if (sopt->sopt_valsize == sizeof(u_char)) { error = sooptcopyin(sopt, &ttl, sizeof(u_char), sizeof(u_char)); if (error) break; } else { u_int ittl; error = sooptcopyin(sopt, &ittl, sizeof(u_int), sizeof(u_int)); if (error) break; if (ittl > 255) { error = EINVAL; break; } ttl = (u_char)ittl; } imo = inp_findmoptions(inp); imo->imo_multicast_ttl = ttl; INP_WUNLOCK(inp); break; } case IP_MULTICAST_LOOP: { u_char loop; /* * Set the loopback flag for outgoing multicast packets. * Must be zero or one. The original multicast API required a * char argument, which is inconsistent with the rest * of the socket API. We allow either a char or an int. */ if (sopt->sopt_valsize == sizeof(u_char)) { error = sooptcopyin(sopt, &loop, sizeof(u_char), sizeof(u_char)); if (error) break; } else { u_int iloop; error = sooptcopyin(sopt, &iloop, sizeof(u_int), sizeof(u_int)); if (error) break; loop = (u_char)iloop; } imo = inp_findmoptions(inp); imo->imo_multicast_loop = !!loop; INP_WUNLOCK(inp); break; } case IP_ADD_MEMBERSHIP: case IP_ADD_SOURCE_MEMBERSHIP: case MCAST_JOIN_GROUP: case MCAST_JOIN_SOURCE_GROUP: error = inp_join_group(inp, sopt); break; case IP_DROP_MEMBERSHIP: case IP_DROP_SOURCE_MEMBERSHIP: case MCAST_LEAVE_GROUP: case MCAST_LEAVE_SOURCE_GROUP: error = inp_leave_group(inp, sopt); break; case IP_BLOCK_SOURCE: case IP_UNBLOCK_SOURCE: case MCAST_BLOCK_SOURCE: case MCAST_UNBLOCK_SOURCE: error = inp_block_unblock_source(inp, sopt); break; case IP_MSFILTER: error = inp_set_source_filters(inp, sopt); break; default: error = EOPNOTSUPP; break; } INP_UNLOCK_ASSERT(inp); return (error); } /* * Expose IGMP's multicast filter mode and source list(s) to userland, * keyed by (ifindex, group). * The filter mode is written out as a uint32_t, followed by * 0..n of struct in_addr. * For use by ifmcstat(8). * SMPng: NOTE: unlocked read of ifindex space. */ static int sysctl_ip_mcast_filters(SYSCTL_HANDLER_ARGS) { struct in_addr src, group; struct epoch_tracker et; struct ifnet *ifp; struct ifmultiaddr *ifma; struct in_multi *inm; struct ip_msource *ims; int *name; int retval; u_int namelen; uint32_t fmode, ifindex; name = (int *)arg1; namelen = arg2; if (req->newptr != NULL) return (EPERM); if (namelen != 2) return (EINVAL); ifindex = name[0]; if (ifindex <= 0 || ifindex > V_if_index) { CTR2(KTR_IGMPV3, "%s: ifindex %u out of range", __func__, ifindex); return (ENOENT); } group.s_addr = name[1]; if (!IN_MULTICAST(ntohl(group.s_addr))) { CTR2(KTR_IGMPV3, "%s: group 0x%08x is not multicast", __func__, ntohl(group.s_addr)); return (EINVAL); } ifp = ifnet_byindex(ifindex); if (ifp == NULL) { CTR2(KTR_IGMPV3, "%s: no ifp for ifindex %u", __func__, ifindex); return (ENOENT); } retval = sysctl_wire_old_buffer(req, sizeof(uint32_t) + (in_mcast_maxgrpsrc * sizeof(struct in_addr))); if (retval) return (retval); IN_MULTI_LIST_LOCK(); NET_EPOCH_ENTER(et); CK_STAILQ_FOREACH(ifma, &ifp->if_multiaddrs, ifma_link) { if (ifma->ifma_addr->sa_family != AF_INET || ifma->ifma_protospec == NULL) continue; inm = (struct in_multi *)ifma->ifma_protospec; if (!in_hosteq(inm->inm_addr, group)) continue; fmode = inm->inm_st[1].iss_fmode; retval = SYSCTL_OUT(req, &fmode, sizeof(uint32_t)); if (retval != 0) break; RB_FOREACH(ims, ip_msource_tree, &inm->inm_srcs) { CTR2(KTR_IGMPV3, "%s: visit node 0x%08x", __func__, ims->ims_haddr); /* * Only copy-out sources which are in-mode. */ if (fmode != ims_get_mode(inm, ims, 1)) { CTR1(KTR_IGMPV3, "%s: skip non-in-mode", __func__); continue; } src.s_addr = htonl(ims->ims_haddr); retval = SYSCTL_OUT(req, &src, sizeof(struct in_addr)); if (retval != 0) break; } } NET_EPOCH_EXIT(et); IN_MULTI_LIST_UNLOCK(); return (retval); } #if defined(KTR) && (KTR_COMPILE & KTR_IGMPV3) -static const char *inm_modestrs[] = { "un", "in", "ex" }; +static const char *inm_modestrs[] = { + [MCAST_UNDEFINED] = "un", + [MCAST_INCLUDE] = "in", + [MCAST_EXCLUDE] = "ex", +}; +_Static_assert(MCAST_UNDEFINED == 0 && + MCAST_EXCLUDE + 1 == nitems(inm_modestrs), + "inm_modestrs: no longer matches #defines"); static const char * inm_mode_str(const int mode) { if (mode >= MCAST_UNDEFINED && mode <= MCAST_EXCLUDE) return (inm_modestrs[mode]); return ("??"); } static const char *inm_statestrs[] = { - "not-member", - "silent", - "idle", - "lazy", - "sleeping", - "awakening", - "query-pending", - "sg-query-pending", - "leaving" + [IGMP_NOT_MEMBER] = "not-member", + [IGMP_SILENT_MEMBER] = "silent", + [IGMP_REPORTING_MEMBER] = "reporting", + [IGMP_IDLE_MEMBER] = "idle", + [IGMP_LAZY_MEMBER] = "lazy", + [IGMP_SLEEPING_MEMBER] = "sleeping", + [IGMP_AWAKENING_MEMBER] = "awakening", + [IGMP_G_QUERY_PENDING_MEMBER] = "query-pending", + [IGMP_SG_QUERY_PENDING_MEMBER] = "sg-query-pending", + [IGMP_LEAVING_MEMBER] = "leaving", }; +_Static_assert(IGMP_NOT_MEMBER == 0 && + IGMP_LEAVING_MEMBER + 1 == nitems(inm_statestrs), + "inm_statetrs: no longer matches #defines"); static const char * inm_state_str(const int state) { if (state >= IGMP_NOT_MEMBER && state <= IGMP_LEAVING_MEMBER) return (inm_statestrs[state]); return ("??"); } /* * Dump an in_multi structure to the console. */ void inm_print(const struct in_multi *inm) { int t; char addrbuf[INET_ADDRSTRLEN]; if ((ktr_mask & KTR_IGMPV3) == 0) return; printf("%s: --- begin inm %p ---\n", __func__, inm); printf("addr %s ifp %p(%s) ifma %p\n", inet_ntoa_r(inm->inm_addr, addrbuf), inm->inm_ifp, inm->inm_ifp->if_xname, inm->inm_ifma); printf("timer %u state %s refcount %u scq.len %u\n", inm->inm_timer, inm_state_str(inm->inm_state), inm->inm_refcount, inm->inm_scq.mq_len); printf("igi %p nsrc %lu sctimer %u scrv %u\n", inm->inm_igi, inm->inm_nsrc, inm->inm_sctimer, inm->inm_scrv); for (t = 0; t < 2; t++) { printf("t%d: fmode %s asm %u ex %u in %u rec %u\n", t, inm_mode_str(inm->inm_st[t].iss_fmode), inm->inm_st[t].iss_asm, inm->inm_st[t].iss_ex, inm->inm_st[t].iss_in, inm->inm_st[t].iss_rec); } printf("%s: --- end inm %p ---\n", __func__, inm); } #else /* !KTR || !(KTR_COMPILE & KTR_IGMPV3) */ void inm_print(const struct in_multi *inm) { } #endif /* KTR && (KTR_COMPILE & KTR_IGMPV3) */ RB_GENERATE(ip_msource_tree, ip_msource, ims_link, ip_msource_cmp); Index: user/ngie/bug-237403/sys/netinet/in_pcb.c =================================================================== --- user/ngie/bug-237403/sys/netinet/in_pcb.c (revision 346925) +++ user/ngie/bug-237403/sys/netinet/in_pcb.c (revision 346926) @@ -1,3431 +1,3434 @@ /*- * SPDX-License-Identifier: BSD-3-Clause * * Copyright (c) 1982, 1986, 1991, 1993, 1995 * The Regents of the University of California. * Copyright (c) 2007-2009 Robert N. M. Watson * Copyright (c) 2010-2011 Juniper Networks, Inc. * All rights reserved. * * Portions of this software were developed by Robert N. M. Watson under * contract to Juniper Networks, Inc. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * 3. Neither the name of the University nor the names of its contributors * may be used to endorse or promote products derived from this software * without specific prior written permission. * * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * @(#)in_pcb.c 8.4 (Berkeley) 5/24/95 */ #include __FBSDID("$FreeBSD$"); #include "opt_ddb.h" #include "opt_ipsec.h" #include "opt_inet.h" #include "opt_inet6.h" #include "opt_ratelimit.h" #include "opt_pcbgroup.h" #include "opt_rss.h" #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #ifdef DDB #include #endif #include #include #include #include #include #include #include #include #if defined(INET) || defined(INET6) #include #include #include #include #ifdef TCPHPTS #include #endif #include #include #endif #ifdef INET #include #endif #ifdef INET6 #include #include #include #include #endif /* INET6 */ #include #include #define INPCBLBGROUP_SIZMIN 8 #define INPCBLBGROUP_SIZMAX 256 static struct callout ipport_tick_callout; /* * These configure the range of local port addresses assigned to * "unspecified" outgoing connections/packets/whatever. */ VNET_DEFINE(int, ipport_lowfirstauto) = IPPORT_RESERVED - 1; /* 1023 */ VNET_DEFINE(int, ipport_lowlastauto) = IPPORT_RESERVEDSTART; /* 600 */ VNET_DEFINE(int, ipport_firstauto) = IPPORT_EPHEMERALFIRST; /* 10000 */ VNET_DEFINE(int, ipport_lastauto) = IPPORT_EPHEMERALLAST; /* 65535 */ VNET_DEFINE(int, ipport_hifirstauto) = IPPORT_HIFIRSTAUTO; /* 49152 */ VNET_DEFINE(int, ipport_hilastauto) = IPPORT_HILASTAUTO; /* 65535 */ /* * Reserved ports accessible only to root. There are significant * security considerations that must be accounted for when changing these, * but the security benefits can be great. Please be careful. */ VNET_DEFINE(int, ipport_reservedhigh) = IPPORT_RESERVED - 1; /* 1023 */ VNET_DEFINE(int, ipport_reservedlow); /* Variables dealing with random ephemeral port allocation. */ VNET_DEFINE(int, ipport_randomized) = 1; /* user controlled via sysctl */ VNET_DEFINE(int, ipport_randomcps) = 10; /* user controlled via sysctl */ VNET_DEFINE(int, ipport_randomtime) = 45; /* user controlled via sysctl */ VNET_DEFINE(int, ipport_stoprandom); /* toggled by ipport_tick */ VNET_DEFINE(int, ipport_tcpallocs); VNET_DEFINE_STATIC(int, ipport_tcplastcount); #define V_ipport_tcplastcount VNET(ipport_tcplastcount) static void in_pcbremlists(struct inpcb *inp); #ifdef INET static struct inpcb *in_pcblookup_hash_locked(struct inpcbinfo *pcbinfo, struct in_addr faddr, u_int fport_arg, struct in_addr laddr, u_int lport_arg, int lookupflags, struct ifnet *ifp); #define RANGECHK(var, min, max) \ if ((var) < (min)) { (var) = (min); } \ else if ((var) > (max)) { (var) = (max); } static int sysctl_net_ipport_check(SYSCTL_HANDLER_ARGS) { int error; error = sysctl_handle_int(oidp, arg1, arg2, req); if (error == 0) { RANGECHK(V_ipport_lowfirstauto, 1, IPPORT_RESERVED - 1); RANGECHK(V_ipport_lowlastauto, 1, IPPORT_RESERVED - 1); RANGECHK(V_ipport_firstauto, IPPORT_RESERVED, IPPORT_MAX); RANGECHK(V_ipport_lastauto, IPPORT_RESERVED, IPPORT_MAX); RANGECHK(V_ipport_hifirstauto, IPPORT_RESERVED, IPPORT_MAX); RANGECHK(V_ipport_hilastauto, IPPORT_RESERVED, IPPORT_MAX); } return (error); } #undef RANGECHK static SYSCTL_NODE(_net_inet_ip, IPPROTO_IP, portrange, CTLFLAG_RW, 0, "IP Ports"); SYSCTL_PROC(_net_inet_ip_portrange, OID_AUTO, lowfirst, CTLFLAG_VNET | CTLTYPE_INT | CTLFLAG_RW, &VNET_NAME(ipport_lowfirstauto), 0, &sysctl_net_ipport_check, "I", ""); SYSCTL_PROC(_net_inet_ip_portrange, OID_AUTO, lowlast, CTLFLAG_VNET | CTLTYPE_INT | CTLFLAG_RW, &VNET_NAME(ipport_lowlastauto), 0, &sysctl_net_ipport_check, "I", ""); SYSCTL_PROC(_net_inet_ip_portrange, OID_AUTO, first, CTLFLAG_VNET | CTLTYPE_INT | CTLFLAG_RW, &VNET_NAME(ipport_firstauto), 0, &sysctl_net_ipport_check, "I", ""); SYSCTL_PROC(_net_inet_ip_portrange, OID_AUTO, last, CTLFLAG_VNET | CTLTYPE_INT | CTLFLAG_RW, &VNET_NAME(ipport_lastauto), 0, &sysctl_net_ipport_check, "I", ""); SYSCTL_PROC(_net_inet_ip_portrange, OID_AUTO, hifirst, CTLFLAG_VNET | CTLTYPE_INT | CTLFLAG_RW, &VNET_NAME(ipport_hifirstauto), 0, &sysctl_net_ipport_check, "I", ""); SYSCTL_PROC(_net_inet_ip_portrange, OID_AUTO, hilast, CTLFLAG_VNET | CTLTYPE_INT | CTLFLAG_RW, &VNET_NAME(ipport_hilastauto), 0, &sysctl_net_ipport_check, "I", ""); SYSCTL_INT(_net_inet_ip_portrange, OID_AUTO, reservedhigh, CTLFLAG_VNET | CTLFLAG_RW | CTLFLAG_SECURE, &VNET_NAME(ipport_reservedhigh), 0, ""); SYSCTL_INT(_net_inet_ip_portrange, OID_AUTO, reservedlow, CTLFLAG_RW|CTLFLAG_SECURE, &VNET_NAME(ipport_reservedlow), 0, ""); SYSCTL_INT(_net_inet_ip_portrange, OID_AUTO, randomized, CTLFLAG_VNET | CTLFLAG_RW, &VNET_NAME(ipport_randomized), 0, "Enable random port allocation"); SYSCTL_INT(_net_inet_ip_portrange, OID_AUTO, randomcps, CTLFLAG_VNET | CTLFLAG_RW, &VNET_NAME(ipport_randomcps), 0, "Maximum number of random port " "allocations before switching to a sequental one"); SYSCTL_INT(_net_inet_ip_portrange, OID_AUTO, randomtime, CTLFLAG_VNET | CTLFLAG_RW, &VNET_NAME(ipport_randomtime), 0, "Minimum time to keep sequental port " "allocation before switching to a random one"); #endif /* INET */ /* * in_pcb.c: manage the Protocol Control Blocks. * * NOTE: It is assumed that most of these functions will be called with * the pcbinfo lock held, and often, the inpcb lock held, as these utility * functions often modify hash chains or addresses in pcbs. */ static struct inpcblbgroup * in_pcblbgroup_alloc(struct inpcblbgrouphead *hdr, u_char vflag, uint16_t port, const union in_dependaddr *addr, int size) { struct inpcblbgroup *grp; size_t bytes; bytes = __offsetof(struct inpcblbgroup, il_inp[size]); grp = malloc(bytes, M_PCB, M_ZERO | M_NOWAIT); if (!grp) return (NULL); grp->il_vflag = vflag; grp->il_lport = port; grp->il_dependladdr = *addr; grp->il_inpsiz = size; CK_LIST_INSERT_HEAD(hdr, grp, il_list); return (grp); } static void in_pcblbgroup_free_deferred(epoch_context_t ctx) { struct inpcblbgroup *grp; grp = __containerof(ctx, struct inpcblbgroup, il_epoch_ctx); free(grp, M_PCB); } static void in_pcblbgroup_free(struct inpcblbgroup *grp) { CK_LIST_REMOVE(grp, il_list); epoch_call(net_epoch_preempt, &grp->il_epoch_ctx, in_pcblbgroup_free_deferred); } static struct inpcblbgroup * in_pcblbgroup_resize(struct inpcblbgrouphead *hdr, struct inpcblbgroup *old_grp, int size) { struct inpcblbgroup *grp; int i; grp = in_pcblbgroup_alloc(hdr, old_grp->il_vflag, old_grp->il_lport, &old_grp->il_dependladdr, size); if (grp == NULL) return (NULL); KASSERT(old_grp->il_inpcnt < grp->il_inpsiz, ("invalid new local group size %d and old local group count %d", grp->il_inpsiz, old_grp->il_inpcnt)); for (i = 0; i < old_grp->il_inpcnt; ++i) grp->il_inp[i] = old_grp->il_inp[i]; grp->il_inpcnt = old_grp->il_inpcnt; in_pcblbgroup_free(old_grp); return (grp); } /* * PCB at index 'i' is removed from the group. Pull up the ones below il_inp[i] * and shrink group if possible. */ static void in_pcblbgroup_reorder(struct inpcblbgrouphead *hdr, struct inpcblbgroup **grpp, int i) { struct inpcblbgroup *grp, *new_grp; grp = *grpp; for (; i + 1 < grp->il_inpcnt; ++i) grp->il_inp[i] = grp->il_inp[i + 1]; grp->il_inpcnt--; if (grp->il_inpsiz > INPCBLBGROUP_SIZMIN && grp->il_inpcnt <= grp->il_inpsiz / 4) { /* Shrink this group. */ new_grp = in_pcblbgroup_resize(hdr, grp, grp->il_inpsiz / 2); if (new_grp != NULL) *grpp = new_grp; } } /* * Add PCB to load balance group for SO_REUSEPORT_LB option. */ static int in_pcbinslbgrouphash(struct inpcb *inp) { const static struct timeval interval = { 60, 0 }; static struct timeval lastprint; struct inpcbinfo *pcbinfo; struct inpcblbgrouphead *hdr; struct inpcblbgroup *grp; uint32_t idx; pcbinfo = inp->inp_pcbinfo; INP_WLOCK_ASSERT(inp); INP_HASH_WLOCK_ASSERT(pcbinfo); /* * Don't allow jailed socket to join local group. */ if (inp->inp_socket != NULL && jailed(inp->inp_socket->so_cred)) return (0); #ifdef INET6 /* * Don't allow IPv4 mapped INET6 wild socket. */ if ((inp->inp_vflag & INP_IPV4) && inp->inp_laddr.s_addr == INADDR_ANY && INP_CHECK_SOCKAF(inp->inp_socket, AF_INET6)) { return (0); } #endif idx = INP_PCBPORTHASH(inp->inp_lport, pcbinfo->ipi_lbgrouphashmask); hdr = &pcbinfo->ipi_lbgrouphashbase[idx]; CK_LIST_FOREACH(grp, hdr, il_list) { if (grp->il_vflag == inp->inp_vflag && grp->il_lport == inp->inp_lport && memcmp(&grp->il_dependladdr, &inp->inp_inc.inc_ie.ie_dependladdr, sizeof(grp->il_dependladdr)) == 0) break; } if (grp == NULL) { /* Create new load balance group. */ grp = in_pcblbgroup_alloc(hdr, inp->inp_vflag, inp->inp_lport, &inp->inp_inc.inc_ie.ie_dependladdr, INPCBLBGROUP_SIZMIN); if (grp == NULL) return (ENOBUFS); } else if (grp->il_inpcnt == grp->il_inpsiz) { if (grp->il_inpsiz >= INPCBLBGROUP_SIZMAX) { if (ratecheck(&lastprint, &interval)) printf("lb group port %d, limit reached\n", ntohs(grp->il_lport)); return (0); } /* Expand this local group. */ grp = in_pcblbgroup_resize(hdr, grp, grp->il_inpsiz * 2); if (grp == NULL) return (ENOBUFS); } KASSERT(grp->il_inpcnt < grp->il_inpsiz, ("invalid local group size %d and count %d", grp->il_inpsiz, grp->il_inpcnt)); grp->il_inp[grp->il_inpcnt] = inp; grp->il_inpcnt++; return (0); } /* * Remove PCB from load balance group. */ static void in_pcbremlbgrouphash(struct inpcb *inp) { struct inpcbinfo *pcbinfo; struct inpcblbgrouphead *hdr; struct inpcblbgroup *grp; int i; pcbinfo = inp->inp_pcbinfo; INP_WLOCK_ASSERT(inp); INP_HASH_WLOCK_ASSERT(pcbinfo); hdr = &pcbinfo->ipi_lbgrouphashbase[ INP_PCBPORTHASH(inp->inp_lport, pcbinfo->ipi_lbgrouphashmask)]; CK_LIST_FOREACH(grp, hdr, il_list) { for (i = 0; i < grp->il_inpcnt; ++i) { if (grp->il_inp[i] != inp) continue; if (grp->il_inpcnt == 1) { /* We are the last, free this local group. */ in_pcblbgroup_free(grp); } else { /* Pull up inpcbs, shrink group if possible. */ in_pcblbgroup_reorder(hdr, &grp, i); } return; } } } /* * Different protocols initialize their inpcbs differently - giving * different name to the lock. But they all are disposed the same. */ static void inpcb_fini(void *mem, int size) { struct inpcb *inp = mem; INP_LOCK_DESTROY(inp); } /* * Initialize an inpcbinfo -- we should be able to reduce the number of * arguments in time. */ void in_pcbinfo_init(struct inpcbinfo *pcbinfo, const char *name, struct inpcbhead *listhead, int hash_nelements, int porthash_nelements, char *inpcbzone_name, uma_init inpcbzone_init, u_int hashfields) { porthash_nelements = imin(porthash_nelements, IPPORT_MAX + 1); INP_INFO_LOCK_INIT(pcbinfo, name); INP_HASH_LOCK_INIT(pcbinfo, "pcbinfohash"); /* XXXRW: argument? */ INP_LIST_LOCK_INIT(pcbinfo, "pcbinfolist"); #ifdef VIMAGE pcbinfo->ipi_vnet = curvnet; #endif pcbinfo->ipi_listhead = listhead; CK_LIST_INIT(pcbinfo->ipi_listhead); pcbinfo->ipi_count = 0; pcbinfo->ipi_hashbase = hashinit(hash_nelements, M_PCB, &pcbinfo->ipi_hashmask); pcbinfo->ipi_porthashbase = hashinit(porthash_nelements, M_PCB, &pcbinfo->ipi_porthashmask); pcbinfo->ipi_lbgrouphashbase = hashinit(porthash_nelements, M_PCB, &pcbinfo->ipi_lbgrouphashmask); #ifdef PCBGROUP in_pcbgroup_init(pcbinfo, hashfields, hash_nelements); #endif pcbinfo->ipi_zone = uma_zcreate(inpcbzone_name, sizeof(struct inpcb), NULL, NULL, inpcbzone_init, inpcb_fini, UMA_ALIGN_PTR, 0); uma_zone_set_max(pcbinfo->ipi_zone, maxsockets); uma_zone_set_warning(pcbinfo->ipi_zone, "kern.ipc.maxsockets limit reached"); } /* * Destroy an inpcbinfo. */ void in_pcbinfo_destroy(struct inpcbinfo *pcbinfo) { KASSERT(pcbinfo->ipi_count == 0, ("%s: ipi_count = %u", __func__, pcbinfo->ipi_count)); hashdestroy(pcbinfo->ipi_hashbase, M_PCB, pcbinfo->ipi_hashmask); hashdestroy(pcbinfo->ipi_porthashbase, M_PCB, pcbinfo->ipi_porthashmask); hashdestroy(pcbinfo->ipi_lbgrouphashbase, M_PCB, pcbinfo->ipi_lbgrouphashmask); #ifdef PCBGROUP in_pcbgroup_destroy(pcbinfo); #endif uma_zdestroy(pcbinfo->ipi_zone); INP_LIST_LOCK_DESTROY(pcbinfo); INP_HASH_LOCK_DESTROY(pcbinfo); INP_INFO_LOCK_DESTROY(pcbinfo); } /* * Allocate a PCB and associate it with the socket. * On success return with the PCB locked. */ int in_pcballoc(struct socket *so, struct inpcbinfo *pcbinfo) { struct inpcb *inp; int error; #ifdef INVARIANTS if (pcbinfo == &V_tcbinfo) { INP_INFO_RLOCK_ASSERT(pcbinfo); } else { INP_INFO_WLOCK_ASSERT(pcbinfo); } #endif error = 0; inp = uma_zalloc(pcbinfo->ipi_zone, M_NOWAIT); if (inp == NULL) return (ENOBUFS); bzero(&inp->inp_start_zero, inp_zero_size); +#ifdef NUMA + inp->inp_numa_domain = M_NODOM; +#endif inp->inp_pcbinfo = pcbinfo; inp->inp_socket = so; inp->inp_cred = crhold(so->so_cred); inp->inp_inc.inc_fibnum = so->so_fibnum; #ifdef MAC error = mac_inpcb_init(inp, M_NOWAIT); if (error != 0) goto out; mac_inpcb_create(so, inp); #endif #if defined(IPSEC) || defined(IPSEC_SUPPORT) error = ipsec_init_pcbpolicy(inp); if (error != 0) { #ifdef MAC mac_inpcb_destroy(inp); #endif goto out; } #endif /*IPSEC*/ #ifdef INET6 if (INP_SOCKAF(so) == AF_INET6) { inp->inp_vflag |= INP_IPV6PROTO; if (V_ip6_v6only) inp->inp_flags |= IN6P_IPV6_V6ONLY; } #endif INP_WLOCK(inp); INP_LIST_WLOCK(pcbinfo); CK_LIST_INSERT_HEAD(pcbinfo->ipi_listhead, inp, inp_list); pcbinfo->ipi_count++; so->so_pcb = (caddr_t)inp; #ifdef INET6 if (V_ip6_auto_flowlabel) inp->inp_flags |= IN6P_AUTOFLOWLABEL; #endif inp->inp_gencnt = ++pcbinfo->ipi_gencnt; refcount_init(&inp->inp_refcount, 1); /* Reference from inpcbinfo */ /* * Routes in inpcb's can cache L2 as well; they are guaranteed * to be cleaned up. */ inp->inp_route.ro_flags = RT_LLE_CACHE; INP_LIST_WUNLOCK(pcbinfo); #if defined(IPSEC) || defined(IPSEC_SUPPORT) || defined(MAC) out: if (error != 0) { crfree(inp->inp_cred); uma_zfree(pcbinfo->ipi_zone, inp); } #endif return (error); } #ifdef INET int in_pcbbind(struct inpcb *inp, struct sockaddr *nam, struct ucred *cred) { int anonport, error; INP_WLOCK_ASSERT(inp); INP_HASH_WLOCK_ASSERT(inp->inp_pcbinfo); if (inp->inp_lport != 0 || inp->inp_laddr.s_addr != INADDR_ANY) return (EINVAL); anonport = nam == NULL || ((struct sockaddr_in *)nam)->sin_port == 0; error = in_pcbbind_setup(inp, nam, &inp->inp_laddr.s_addr, &inp->inp_lport, cred); if (error) return (error); if (in_pcbinshash(inp) != 0) { inp->inp_laddr.s_addr = INADDR_ANY; inp->inp_lport = 0; return (EAGAIN); } if (anonport) inp->inp_flags |= INP_ANONPORT; return (0); } #endif /* * Select a local port (number) to use. */ #if defined(INET) || defined(INET6) int in_pcb_lport(struct inpcb *inp, struct in_addr *laddrp, u_short *lportp, struct ucred *cred, int lookupflags) { struct inpcbinfo *pcbinfo; struct inpcb *tmpinp; unsigned short *lastport; int count, dorandom, error; u_short aux, first, last, lport; #ifdef INET struct in_addr laddr; #endif pcbinfo = inp->inp_pcbinfo; /* * Because no actual state changes occur here, a global write lock on * the pcbinfo isn't required. */ INP_LOCK_ASSERT(inp); INP_HASH_LOCK_ASSERT(pcbinfo); if (inp->inp_flags & INP_HIGHPORT) { first = V_ipport_hifirstauto; /* sysctl */ last = V_ipport_hilastauto; lastport = &pcbinfo->ipi_lasthi; } else if (inp->inp_flags & INP_LOWPORT) { error = priv_check_cred(cred, PRIV_NETINET_RESERVEDPORT); if (error) return (error); first = V_ipport_lowfirstauto; /* 1023 */ last = V_ipport_lowlastauto; /* 600 */ lastport = &pcbinfo->ipi_lastlow; } else { first = V_ipport_firstauto; /* sysctl */ last = V_ipport_lastauto; lastport = &pcbinfo->ipi_lastport; } /* * For UDP(-Lite), use random port allocation as long as the user * allows it. For TCP (and as of yet unknown) connections, * use random port allocation only if the user allows it AND * ipport_tick() allows it. */ if (V_ipport_randomized && (!V_ipport_stoprandom || pcbinfo == &V_udbinfo || pcbinfo == &V_ulitecbinfo)) dorandom = 1; else dorandom = 0; /* * It makes no sense to do random port allocation if * we have the only port available. */ if (first == last) dorandom = 0; /* Make sure to not include UDP(-Lite) packets in the count. */ if (pcbinfo != &V_udbinfo || pcbinfo != &V_ulitecbinfo) V_ipport_tcpallocs++; /* * Instead of having two loops further down counting up or down * make sure that first is always <= last and go with only one * code path implementing all logic. */ if (first > last) { aux = first; first = last; last = aux; } #ifdef INET /* Make the compiler happy. */ laddr.s_addr = 0; if ((inp->inp_vflag & (INP_IPV4|INP_IPV6)) == INP_IPV4) { KASSERT(laddrp != NULL, ("%s: laddrp NULL for v4 inp %p", __func__, inp)); laddr = *laddrp; } #endif tmpinp = NULL; /* Make compiler happy. */ lport = *lportp; if (dorandom) *lastport = first + (arc4random() % (last - first)); count = last - first; do { if (count-- < 0) /* completely used? */ return (EADDRNOTAVAIL); ++*lastport; if (*lastport < first || *lastport > last) *lastport = first; lport = htons(*lastport); #ifdef INET6 if ((inp->inp_vflag & INP_IPV6) != 0) tmpinp = in6_pcblookup_local(pcbinfo, &inp->in6p_laddr, lport, lookupflags, cred); #endif #if defined(INET) && defined(INET6) else #endif #ifdef INET tmpinp = in_pcblookup_local(pcbinfo, laddr, lport, lookupflags, cred); #endif } while (tmpinp != NULL); #ifdef INET if ((inp->inp_vflag & (INP_IPV4|INP_IPV6)) == INP_IPV4) laddrp->s_addr = laddr.s_addr; #endif *lportp = lport; return (0); } /* * Return cached socket options. */ int inp_so_options(const struct inpcb *inp) { int so_options; so_options = 0; if ((inp->inp_flags2 & INP_REUSEPORT_LB) != 0) so_options |= SO_REUSEPORT_LB; if ((inp->inp_flags2 & INP_REUSEPORT) != 0) so_options |= SO_REUSEPORT; if ((inp->inp_flags2 & INP_REUSEADDR) != 0) so_options |= SO_REUSEADDR; return (so_options); } #endif /* INET || INET6 */ /* * Check if a new BINDMULTI socket is allowed to be created. * * ni points to the new inp. * oi points to the exisitng inp. * * This checks whether the existing inp also has BINDMULTI and * whether the credentials match. */ int in_pcbbind_check_bindmulti(const struct inpcb *ni, const struct inpcb *oi) { /* Check permissions match */ if ((ni->inp_flags2 & INP_BINDMULTI) && (ni->inp_cred->cr_uid != oi->inp_cred->cr_uid)) return (0); /* Check the existing inp has BINDMULTI set */ if ((ni->inp_flags2 & INP_BINDMULTI) && ((oi->inp_flags2 & INP_BINDMULTI) == 0)) return (0); /* * We're okay - either INP_BINDMULTI isn't set on ni, or * it is and it matches the checks. */ return (1); } #ifdef INET /* * Set up a bind operation on a PCB, performing port allocation * as required, but do not actually modify the PCB. Callers can * either complete the bind by setting inp_laddr/inp_lport and * calling in_pcbinshash(), or they can just use the resulting * port and address to authorise the sending of a once-off packet. * * On error, the values of *laddrp and *lportp are not changed. */ int in_pcbbind_setup(struct inpcb *inp, struct sockaddr *nam, in_addr_t *laddrp, u_short *lportp, struct ucred *cred) { struct socket *so = inp->inp_socket; struct sockaddr_in *sin; struct inpcbinfo *pcbinfo = inp->inp_pcbinfo; struct in_addr laddr; u_short lport = 0; int lookupflags = 0, reuseport = (so->so_options & SO_REUSEPORT); int error; /* * XXX: Maybe we could let SO_REUSEPORT_LB set SO_REUSEPORT bit here * so that we don't have to add to the (already messy) code below. */ int reuseport_lb = (so->so_options & SO_REUSEPORT_LB); /* * No state changes, so read locks are sufficient here. */ INP_LOCK_ASSERT(inp); INP_HASH_LOCK_ASSERT(pcbinfo); if (CK_STAILQ_EMPTY(&V_in_ifaddrhead)) /* XXX broken! */ return (EADDRNOTAVAIL); laddr.s_addr = *laddrp; if (nam != NULL && laddr.s_addr != INADDR_ANY) return (EINVAL); if ((so->so_options & (SO_REUSEADDR|SO_REUSEPORT|SO_REUSEPORT_LB)) == 0) lookupflags = INPLOOKUP_WILDCARD; if (nam == NULL) { if ((error = prison_local_ip4(cred, &laddr)) != 0) return (error); } else { sin = (struct sockaddr_in *)nam; if (nam->sa_len != sizeof (*sin)) return (EINVAL); #ifdef notdef /* * We should check the family, but old programs * incorrectly fail to initialize it. */ if (sin->sin_family != AF_INET) return (EAFNOSUPPORT); #endif error = prison_local_ip4(cred, &sin->sin_addr); if (error) return (error); if (sin->sin_port != *lportp) { /* Don't allow the port to change. */ if (*lportp != 0) return (EINVAL); lport = sin->sin_port; } /* NB: lport is left as 0 if the port isn't being changed. */ if (IN_MULTICAST(ntohl(sin->sin_addr.s_addr))) { /* * Treat SO_REUSEADDR as SO_REUSEPORT for multicast; * allow complete duplication of binding if * SO_REUSEPORT is set, or if SO_REUSEADDR is set * and a multicast address is bound on both * new and duplicated sockets. */ if ((so->so_options & (SO_REUSEADDR|SO_REUSEPORT)) != 0) reuseport = SO_REUSEADDR|SO_REUSEPORT; /* * XXX: How to deal with SO_REUSEPORT_LB here? * Treat same as SO_REUSEPORT for now. */ if ((so->so_options & (SO_REUSEADDR|SO_REUSEPORT_LB)) != 0) reuseport_lb = SO_REUSEADDR|SO_REUSEPORT_LB; } else if (sin->sin_addr.s_addr != INADDR_ANY) { sin->sin_port = 0; /* yech... */ bzero(&sin->sin_zero, sizeof(sin->sin_zero)); /* * Is the address a local IP address? * If INP_BINDANY is set, then the socket may be bound * to any endpoint address, local or not. */ if ((inp->inp_flags & INP_BINDANY) == 0 && ifa_ifwithaddr_check((struct sockaddr *)sin) == 0) return (EADDRNOTAVAIL); } laddr = sin->sin_addr; if (lport) { struct inpcb *t; struct tcptw *tw; /* GROSS */ if (ntohs(lport) <= V_ipport_reservedhigh && ntohs(lport) >= V_ipport_reservedlow && priv_check_cred(cred, PRIV_NETINET_RESERVEDPORT)) return (EACCES); if (!IN_MULTICAST(ntohl(sin->sin_addr.s_addr)) && priv_check_cred(inp->inp_cred, PRIV_NETINET_REUSEPORT) != 0) { t = in_pcblookup_local(pcbinfo, sin->sin_addr, lport, INPLOOKUP_WILDCARD, cred); /* * XXX * This entire block sorely needs a rewrite. */ if (t && ((inp->inp_flags2 & INP_BINDMULTI) == 0) && ((t->inp_flags & INP_TIMEWAIT) == 0) && (so->so_type != SOCK_STREAM || ntohl(t->inp_faddr.s_addr) == INADDR_ANY) && (ntohl(sin->sin_addr.s_addr) != INADDR_ANY || ntohl(t->inp_laddr.s_addr) != INADDR_ANY || (t->inp_flags2 & INP_REUSEPORT) || (t->inp_flags2 & INP_REUSEPORT_LB) == 0) && (inp->inp_cred->cr_uid != t->inp_cred->cr_uid)) return (EADDRINUSE); /* * If the socket is a BINDMULTI socket, then * the credentials need to match and the * original socket also has to have been bound * with BINDMULTI. */ if (t && (! in_pcbbind_check_bindmulti(inp, t))) return (EADDRINUSE); } t = in_pcblookup_local(pcbinfo, sin->sin_addr, lport, lookupflags, cred); if (t && (t->inp_flags & INP_TIMEWAIT)) { /* * XXXRW: If an incpb has had its timewait * state recycled, we treat the address as * being in use (for now). This is better * than a panic, but not desirable. */ tw = intotw(t); if (tw == NULL || ((reuseport & tw->tw_so_options) == 0 && (reuseport_lb & tw->tw_so_options) == 0)) { return (EADDRINUSE); } } else if (t && ((inp->inp_flags2 & INP_BINDMULTI) == 0) && (reuseport & inp_so_options(t)) == 0 && (reuseport_lb & inp_so_options(t)) == 0) { #ifdef INET6 if (ntohl(sin->sin_addr.s_addr) != INADDR_ANY || ntohl(t->inp_laddr.s_addr) != INADDR_ANY || (inp->inp_vflag & INP_IPV6PROTO) == 0 || (t->inp_vflag & INP_IPV6PROTO) == 0) #endif return (EADDRINUSE); if (t && (! in_pcbbind_check_bindmulti(inp, t))) return (EADDRINUSE); } } } if (*lportp != 0) lport = *lportp; if (lport == 0) { error = in_pcb_lport(inp, &laddr, &lport, cred, lookupflags); if (error != 0) return (error); } *laddrp = laddr.s_addr; *lportp = lport; return (0); } /* * Connect from a socket to a specified address. * Both address and port must be specified in argument sin. * If don't have a local address for this socket yet, * then pick one. */ int in_pcbconnect_mbuf(struct inpcb *inp, struct sockaddr *nam, struct ucred *cred, struct mbuf *m) { u_short lport, fport; in_addr_t laddr, faddr; int anonport, error; INP_WLOCK_ASSERT(inp); INP_HASH_WLOCK_ASSERT(inp->inp_pcbinfo); lport = inp->inp_lport; laddr = inp->inp_laddr.s_addr; anonport = (lport == 0); error = in_pcbconnect_setup(inp, nam, &laddr, &lport, &faddr, &fport, NULL, cred); if (error) return (error); /* Do the initial binding of the local address if required. */ if (inp->inp_laddr.s_addr == INADDR_ANY && inp->inp_lport == 0) { inp->inp_lport = lport; inp->inp_laddr.s_addr = laddr; if (in_pcbinshash(inp) != 0) { inp->inp_laddr.s_addr = INADDR_ANY; inp->inp_lport = 0; return (EAGAIN); } } /* Commit the remaining changes. */ inp->inp_lport = lport; inp->inp_laddr.s_addr = laddr; inp->inp_faddr.s_addr = faddr; inp->inp_fport = fport; in_pcbrehash_mbuf(inp, m); if (anonport) inp->inp_flags |= INP_ANONPORT; return (0); } int in_pcbconnect(struct inpcb *inp, struct sockaddr *nam, struct ucred *cred) { return (in_pcbconnect_mbuf(inp, nam, cred, NULL)); } /* * Do proper source address selection on an unbound socket in case * of connect. Take jails into account as well. */ int in_pcbladdr(struct inpcb *inp, struct in_addr *faddr, struct in_addr *laddr, struct ucred *cred) { struct ifaddr *ifa; struct sockaddr *sa; struct sockaddr_in *sin; struct route sro; struct epoch_tracker et; int error; KASSERT(laddr != NULL, ("%s: laddr NULL", __func__)); /* * Bypass source address selection and use the primary jail IP * if requested. */ if (cred != NULL && !prison_saddrsel_ip4(cred, laddr)) return (0); error = 0; bzero(&sro, sizeof(sro)); sin = (struct sockaddr_in *)&sro.ro_dst; sin->sin_family = AF_INET; sin->sin_len = sizeof(struct sockaddr_in); sin->sin_addr.s_addr = faddr->s_addr; /* * If route is known our src addr is taken from the i/f, * else punt. * * Find out route to destination. */ if ((inp->inp_socket->so_options & SO_DONTROUTE) == 0) in_rtalloc_ign(&sro, 0, inp->inp_inc.inc_fibnum); /* * If we found a route, use the address corresponding to * the outgoing interface. * * Otherwise assume faddr is reachable on a directly connected * network and try to find a corresponding interface to take * the source address from. */ NET_EPOCH_ENTER(et); if (sro.ro_rt == NULL || sro.ro_rt->rt_ifp == NULL) { struct in_ifaddr *ia; struct ifnet *ifp; ia = ifatoia(ifa_ifwithdstaddr((struct sockaddr *)sin, inp->inp_socket->so_fibnum)); if (ia == NULL) { ia = ifatoia(ifa_ifwithnet((struct sockaddr *)sin, 0, inp->inp_socket->so_fibnum)); } if (ia == NULL) { error = ENETUNREACH; goto done; } if (cred == NULL || !prison_flag(cred, PR_IP4)) { laddr->s_addr = ia->ia_addr.sin_addr.s_addr; goto done; } ifp = ia->ia_ifp; ia = NULL; CK_STAILQ_FOREACH(ifa, &ifp->if_addrhead, ifa_link) { sa = ifa->ifa_addr; if (sa->sa_family != AF_INET) continue; sin = (struct sockaddr_in *)sa; if (prison_check_ip4(cred, &sin->sin_addr) == 0) { ia = (struct in_ifaddr *)ifa; break; } } if (ia != NULL) { laddr->s_addr = ia->ia_addr.sin_addr.s_addr; goto done; } /* 3. As a last resort return the 'default' jail address. */ error = prison_get_ip4(cred, laddr); goto done; } /* * If the outgoing interface on the route found is not * a loopback interface, use the address from that interface. * In case of jails do those three steps: * 1. check if the interface address belongs to the jail. If so use it. * 2. check if we have any address on the outgoing interface * belonging to this jail. If so use it. * 3. as a last resort return the 'default' jail address. */ if ((sro.ro_rt->rt_ifp->if_flags & IFF_LOOPBACK) == 0) { struct in_ifaddr *ia; struct ifnet *ifp; /* If not jailed, use the default returned. */ if (cred == NULL || !prison_flag(cred, PR_IP4)) { ia = (struct in_ifaddr *)sro.ro_rt->rt_ifa; laddr->s_addr = ia->ia_addr.sin_addr.s_addr; goto done; } /* Jailed. */ /* 1. Check if the iface address belongs to the jail. */ sin = (struct sockaddr_in *)sro.ro_rt->rt_ifa->ifa_addr; if (prison_check_ip4(cred, &sin->sin_addr) == 0) { ia = (struct in_ifaddr *)sro.ro_rt->rt_ifa; laddr->s_addr = ia->ia_addr.sin_addr.s_addr; goto done; } /* * 2. Check if we have any address on the outgoing interface * belonging to this jail. */ ia = NULL; ifp = sro.ro_rt->rt_ifp; CK_STAILQ_FOREACH(ifa, &ifp->if_addrhead, ifa_link) { sa = ifa->ifa_addr; if (sa->sa_family != AF_INET) continue; sin = (struct sockaddr_in *)sa; if (prison_check_ip4(cred, &sin->sin_addr) == 0) { ia = (struct in_ifaddr *)ifa; break; } } if (ia != NULL) { laddr->s_addr = ia->ia_addr.sin_addr.s_addr; goto done; } /* 3. As a last resort return the 'default' jail address. */ error = prison_get_ip4(cred, laddr); goto done; } /* * The outgoing interface is marked with 'loopback net', so a route * to ourselves is here. * Try to find the interface of the destination address and then * take the address from there. That interface is not necessarily * a loopback interface. * In case of jails, check that it is an address of the jail * and if we cannot find, fall back to the 'default' jail address. */ if ((sro.ro_rt->rt_ifp->if_flags & IFF_LOOPBACK) != 0) { struct sockaddr_in sain; struct in_ifaddr *ia; bzero(&sain, sizeof(struct sockaddr_in)); sain.sin_family = AF_INET; sain.sin_len = sizeof(struct sockaddr_in); sain.sin_addr.s_addr = faddr->s_addr; ia = ifatoia(ifa_ifwithdstaddr(sintosa(&sain), inp->inp_socket->so_fibnum)); if (ia == NULL) ia = ifatoia(ifa_ifwithnet(sintosa(&sain), 0, inp->inp_socket->so_fibnum)); if (ia == NULL) ia = ifatoia(ifa_ifwithaddr(sintosa(&sain))); if (cred == NULL || !prison_flag(cred, PR_IP4)) { if (ia == NULL) { error = ENETUNREACH; goto done; } laddr->s_addr = ia->ia_addr.sin_addr.s_addr; goto done; } /* Jailed. */ if (ia != NULL) { struct ifnet *ifp; ifp = ia->ia_ifp; ia = NULL; CK_STAILQ_FOREACH(ifa, &ifp->if_addrhead, ifa_link) { sa = ifa->ifa_addr; if (sa->sa_family != AF_INET) continue; sin = (struct sockaddr_in *)sa; if (prison_check_ip4(cred, &sin->sin_addr) == 0) { ia = (struct in_ifaddr *)ifa; break; } } if (ia != NULL) { laddr->s_addr = ia->ia_addr.sin_addr.s_addr; goto done; } } /* 3. As a last resort return the 'default' jail address. */ error = prison_get_ip4(cred, laddr); goto done; } done: NET_EPOCH_EXIT(et); if (sro.ro_rt != NULL) RTFREE(sro.ro_rt); return (error); } /* * Set up for a connect from a socket to the specified address. * On entry, *laddrp and *lportp should contain the current local * address and port for the PCB; these are updated to the values * that should be placed in inp_laddr and inp_lport to complete * the connect. * * On success, *faddrp and *fportp will be set to the remote address * and port. These are not updated in the error case. * * If the operation fails because the connection already exists, * *oinpp will be set to the PCB of that connection so that the * caller can decide to override it. In all other cases, *oinpp * is set to NULL. */ int in_pcbconnect_setup(struct inpcb *inp, struct sockaddr *nam, in_addr_t *laddrp, u_short *lportp, in_addr_t *faddrp, u_short *fportp, struct inpcb **oinpp, struct ucred *cred) { struct rm_priotracker in_ifa_tracker; struct sockaddr_in *sin = (struct sockaddr_in *)nam; struct in_ifaddr *ia; struct inpcb *oinp; struct in_addr laddr, faddr; u_short lport, fport; int error; /* * Because a global state change doesn't actually occur here, a read * lock is sufficient. */ INP_LOCK_ASSERT(inp); INP_HASH_LOCK_ASSERT(inp->inp_pcbinfo); if (oinpp != NULL) *oinpp = NULL; if (nam->sa_len != sizeof (*sin)) return (EINVAL); if (sin->sin_family != AF_INET) return (EAFNOSUPPORT); if (sin->sin_port == 0) return (EADDRNOTAVAIL); laddr.s_addr = *laddrp; lport = *lportp; faddr = sin->sin_addr; fport = sin->sin_port; if (!CK_STAILQ_EMPTY(&V_in_ifaddrhead)) { /* * If the destination address is INADDR_ANY, * use the primary local address. * If the supplied address is INADDR_BROADCAST, * and the primary interface supports broadcast, * choose the broadcast address for that interface. */ if (faddr.s_addr == INADDR_ANY) { IN_IFADDR_RLOCK(&in_ifa_tracker); faddr = IA_SIN(CK_STAILQ_FIRST(&V_in_ifaddrhead))->sin_addr; IN_IFADDR_RUNLOCK(&in_ifa_tracker); if (cred != NULL && (error = prison_get_ip4(cred, &faddr)) != 0) return (error); } else if (faddr.s_addr == (u_long)INADDR_BROADCAST) { IN_IFADDR_RLOCK(&in_ifa_tracker); if (CK_STAILQ_FIRST(&V_in_ifaddrhead)->ia_ifp->if_flags & IFF_BROADCAST) faddr = satosin(&CK_STAILQ_FIRST( &V_in_ifaddrhead)->ia_broadaddr)->sin_addr; IN_IFADDR_RUNLOCK(&in_ifa_tracker); } } if (laddr.s_addr == INADDR_ANY) { error = in_pcbladdr(inp, &faddr, &laddr, cred); /* * If the destination address is multicast and an outgoing * interface has been set as a multicast option, prefer the * address of that interface as our source address. */ if (IN_MULTICAST(ntohl(faddr.s_addr)) && inp->inp_moptions != NULL) { struct ip_moptions *imo; struct ifnet *ifp; imo = inp->inp_moptions; if (imo->imo_multicast_ifp != NULL) { ifp = imo->imo_multicast_ifp; IN_IFADDR_RLOCK(&in_ifa_tracker); CK_STAILQ_FOREACH(ia, &V_in_ifaddrhead, ia_link) { if ((ia->ia_ifp == ifp) && (cred == NULL || prison_check_ip4(cred, &ia->ia_addr.sin_addr) == 0)) break; } if (ia == NULL) error = EADDRNOTAVAIL; else { laddr = ia->ia_addr.sin_addr; error = 0; } IN_IFADDR_RUNLOCK(&in_ifa_tracker); } } if (error) return (error); } oinp = in_pcblookup_hash_locked(inp->inp_pcbinfo, faddr, fport, laddr, lport, 0, NULL); if (oinp != NULL) { if (oinpp != NULL) *oinpp = oinp; return (EADDRINUSE); } if (lport == 0) { error = in_pcbbind_setup(inp, NULL, &laddr.s_addr, &lport, cred); if (error) return (error); } *laddrp = laddr.s_addr; *lportp = lport; *faddrp = faddr.s_addr; *fportp = fport; return (0); } void in_pcbdisconnect(struct inpcb *inp) { INP_WLOCK_ASSERT(inp); INP_HASH_WLOCK_ASSERT(inp->inp_pcbinfo); inp->inp_faddr.s_addr = INADDR_ANY; inp->inp_fport = 0; in_pcbrehash(inp); } #endif /* INET */ /* * in_pcbdetach() is responsibe for disassociating a socket from an inpcb. * For most protocols, this will be invoked immediately prior to calling * in_pcbfree(). However, with TCP the inpcb may significantly outlive the * socket, in which case in_pcbfree() is deferred. */ void in_pcbdetach(struct inpcb *inp) { KASSERT(inp->inp_socket != NULL, ("%s: inp_socket == NULL", __func__)); #ifdef RATELIMIT if (inp->inp_snd_tag != NULL) in_pcbdetach_txrtlmt(inp); #endif inp->inp_socket->so_pcb = NULL; inp->inp_socket = NULL; } /* * in_pcbref() bumps the reference count on an inpcb in order to maintain * stability of an inpcb pointer despite the inpcb lock being released. This * is used in TCP when the inpcbinfo lock needs to be acquired or upgraded, * but where the inpcb lock may already held, or when acquiring a reference * via a pcbgroup. * * in_pcbref() should be used only to provide brief memory stability, and * must always be followed by a call to INP_WLOCK() and in_pcbrele() to * garbage collect the inpcb if it has been in_pcbfree()'d from another * context. Until in_pcbrele() has returned that the inpcb is still valid, * lock and rele are the *only* safe operations that may be performed on the * inpcb. * * While the inpcb will not be freed, releasing the inpcb lock means that the * connection's state may change, so the caller should be careful to * revalidate any cached state on reacquiring the lock. Drop the reference * using in_pcbrele(). */ void in_pcbref(struct inpcb *inp) { KASSERT(inp->inp_refcount > 0, ("%s: refcount 0", __func__)); refcount_acquire(&inp->inp_refcount); } /* * Drop a refcount on an inpcb elevated using in_pcbref(); because a call to * in_pcbfree() may have been made between in_pcbref() and in_pcbrele(), we * return a flag indicating whether or not the inpcb remains valid. If it is * valid, we return with the inpcb lock held. * * Notice that, unlike in_pcbref(), the inpcb lock must be held to drop a * reference on an inpcb. Historically more work was done here (actually, in * in_pcbfree_internal()) but has been moved to in_pcbfree() to avoid the * need for the pcbinfo lock in in_pcbrele(). Deferring the free is entirely * about memory stability (and continued use of the write lock). */ int in_pcbrele_rlocked(struct inpcb *inp) { struct inpcbinfo *pcbinfo; KASSERT(inp->inp_refcount > 0, ("%s: refcount 0", __func__)); INP_RLOCK_ASSERT(inp); if (refcount_release(&inp->inp_refcount) == 0) { /* * If the inpcb has been freed, let the caller know, even if * this isn't the last reference. */ if (inp->inp_flags2 & INP_FREED) { INP_RUNLOCK(inp); return (1); } return (0); } KASSERT(inp->inp_socket == NULL, ("%s: inp_socket != NULL", __func__)); #ifdef TCPHPTS if (inp->inp_in_hpts || inp->inp_in_input) { struct tcp_hpts_entry *hpts; /* * We should not be on the hpts at * this point in any form. we must * get the lock to be sure. */ hpts = tcp_hpts_lock(inp); if (inp->inp_in_hpts) panic("Hpts:%p inp:%p at free still on hpts", hpts, inp); mtx_unlock(&hpts->p_mtx); hpts = tcp_input_lock(inp); if (inp->inp_in_input) panic("Hpts:%p inp:%p at free still on input hpts", hpts, inp); mtx_unlock(&hpts->p_mtx); } #endif INP_RUNLOCK(inp); pcbinfo = inp->inp_pcbinfo; uma_zfree(pcbinfo->ipi_zone, inp); return (1); } int in_pcbrele_wlocked(struct inpcb *inp) { struct inpcbinfo *pcbinfo; KASSERT(inp->inp_refcount > 0, ("%s: refcount 0", __func__)); INP_WLOCK_ASSERT(inp); if (refcount_release(&inp->inp_refcount) == 0) { /* * If the inpcb has been freed, let the caller know, even if * this isn't the last reference. */ if (inp->inp_flags2 & INP_FREED) { INP_WUNLOCK(inp); return (1); } return (0); } KASSERT(inp->inp_socket == NULL, ("%s: inp_socket != NULL", __func__)); #ifdef TCPHPTS if (inp->inp_in_hpts || inp->inp_in_input) { struct tcp_hpts_entry *hpts; /* * We should not be on the hpts at * this point in any form. we must * get the lock to be sure. */ hpts = tcp_hpts_lock(inp); if (inp->inp_in_hpts) panic("Hpts:%p inp:%p at free still on hpts", hpts, inp); mtx_unlock(&hpts->p_mtx); hpts = tcp_input_lock(inp); if (inp->inp_in_input) panic("Hpts:%p inp:%p at free still on input hpts", hpts, inp); mtx_unlock(&hpts->p_mtx); } #endif INP_WUNLOCK(inp); pcbinfo = inp->inp_pcbinfo; uma_zfree(pcbinfo->ipi_zone, inp); return (1); } /* * Temporary wrapper. */ int in_pcbrele(struct inpcb *inp) { return (in_pcbrele_wlocked(inp)); } void in_pcblist_rele_rlocked(epoch_context_t ctx) { struct in_pcblist *il; struct inpcb *inp; struct inpcbinfo *pcbinfo; int i, n; il = __containerof(ctx, struct in_pcblist, il_epoch_ctx); pcbinfo = il->il_pcbinfo; n = il->il_count; INP_INFO_WLOCK(pcbinfo); for (i = 0; i < n; i++) { inp = il->il_inp_list[i]; INP_RLOCK(inp); if (!in_pcbrele_rlocked(inp)) INP_RUNLOCK(inp); } INP_INFO_WUNLOCK(pcbinfo); free(il, M_TEMP); } static void inpcbport_free(epoch_context_t ctx) { struct inpcbport *phd; phd = __containerof(ctx, struct inpcbport, phd_epoch_ctx); free(phd, M_PCB); } static void in_pcbfree_deferred(epoch_context_t ctx) { struct inpcb *inp; int released __unused; inp = __containerof(ctx, struct inpcb, inp_epoch_ctx); INP_WLOCK(inp); CURVNET_SET(inp->inp_vnet); #ifdef INET struct ip_moptions *imo = inp->inp_moptions; inp->inp_moptions = NULL; #endif /* XXXRW: Do as much as possible here. */ #if defined(IPSEC) || defined(IPSEC_SUPPORT) if (inp->inp_sp != NULL) ipsec_delete_pcbpolicy(inp); #endif #ifdef INET6 struct ip6_moptions *im6o = NULL; if (inp->inp_vflag & INP_IPV6PROTO) { ip6_freepcbopts(inp->in6p_outputopts); im6o = inp->in6p_moptions; inp->in6p_moptions = NULL; } #endif if (inp->inp_options) (void)m_free(inp->inp_options); inp->inp_vflag = 0; crfree(inp->inp_cred); #ifdef MAC mac_inpcb_destroy(inp); #endif released = in_pcbrele_wlocked(inp); MPASS(released); #ifdef INET6 ip6_freemoptions(im6o); #endif #ifdef INET inp_freemoptions(imo); #endif CURVNET_RESTORE(); } /* * Unconditionally schedule an inpcb to be freed by decrementing its * reference count, which should occur only after the inpcb has been detached * from its socket. If another thread holds a temporary reference (acquired * using in_pcbref()) then the free is deferred until that reference is * released using in_pcbrele(), but the inpcb is still unlocked. Almost all * work, including removal from global lists, is done in this context, where * the pcbinfo lock is held. */ void in_pcbfree(struct inpcb *inp) { struct inpcbinfo *pcbinfo = inp->inp_pcbinfo; KASSERT(inp->inp_socket == NULL, ("%s: inp_socket != NULL", __func__)); KASSERT((inp->inp_flags2 & INP_FREED) == 0, ("%s: called twice for pcb %p", __func__, inp)); if (inp->inp_flags2 & INP_FREED) { INP_WUNLOCK(inp); return; } #ifdef INVARIANTS if (pcbinfo == &V_tcbinfo) { INP_INFO_LOCK_ASSERT(pcbinfo); } else { INP_INFO_WLOCK_ASSERT(pcbinfo); } #endif INP_WLOCK_ASSERT(inp); INP_LIST_WLOCK(pcbinfo); in_pcbremlists(inp); INP_LIST_WUNLOCK(pcbinfo); RO_INVALIDATE_CACHE(&inp->inp_route); /* mark as destruction in progress */ inp->inp_flags2 |= INP_FREED; INP_WUNLOCK(inp); epoch_call(net_epoch_preempt, &inp->inp_epoch_ctx, in_pcbfree_deferred); } /* * in_pcbdrop() removes an inpcb from hashed lists, releasing its address and * port reservation, and preventing it from being returned by inpcb lookups. * * It is used by TCP to mark an inpcb as unused and avoid future packet * delivery or event notification when a socket remains open but TCP has * closed. This might occur as a result of a shutdown()-initiated TCP close * or a RST on the wire, and allows the port binding to be reused while still * maintaining the invariant that so_pcb always points to a valid inpcb until * in_pcbdetach(). * * XXXRW: Possibly in_pcbdrop() should also prevent future notifications by * in_pcbnotifyall() and in_pcbpurgeif0()? */ void in_pcbdrop(struct inpcb *inp) { INP_WLOCK_ASSERT(inp); #ifdef INVARIANTS if (inp->inp_socket != NULL && inp->inp_ppcb != NULL) MPASS(inp->inp_refcount > 1); #endif /* * XXXRW: Possibly we should protect the setting of INP_DROPPED with * the hash lock...? */ inp->inp_flags |= INP_DROPPED; if (inp->inp_flags & INP_INHASHLIST) { struct inpcbport *phd = inp->inp_phd; INP_HASH_WLOCK(inp->inp_pcbinfo); in_pcbremlbgrouphash(inp); CK_LIST_REMOVE(inp, inp_hash); CK_LIST_REMOVE(inp, inp_portlist); if (CK_LIST_FIRST(&phd->phd_pcblist) == NULL) { CK_LIST_REMOVE(phd, phd_hash); epoch_call(net_epoch_preempt, &phd->phd_epoch_ctx, inpcbport_free); } INP_HASH_WUNLOCK(inp->inp_pcbinfo); inp->inp_flags &= ~INP_INHASHLIST; #ifdef PCBGROUP in_pcbgroup_remove(inp); #endif } } #ifdef INET /* * Common routines to return the socket addresses associated with inpcbs. */ struct sockaddr * in_sockaddr(in_port_t port, struct in_addr *addr_p) { struct sockaddr_in *sin; sin = malloc(sizeof *sin, M_SONAME, M_WAITOK | M_ZERO); sin->sin_family = AF_INET; sin->sin_len = sizeof(*sin); sin->sin_addr = *addr_p; sin->sin_port = port; return (struct sockaddr *)sin; } int in_getsockaddr(struct socket *so, struct sockaddr **nam) { struct inpcb *inp; struct in_addr addr; in_port_t port; inp = sotoinpcb(so); KASSERT(inp != NULL, ("in_getsockaddr: inp == NULL")); INP_RLOCK(inp); port = inp->inp_lport; addr = inp->inp_laddr; INP_RUNLOCK(inp); *nam = in_sockaddr(port, &addr); return 0; } int in_getpeeraddr(struct socket *so, struct sockaddr **nam) { struct inpcb *inp; struct in_addr addr; in_port_t port; inp = sotoinpcb(so); KASSERT(inp != NULL, ("in_getpeeraddr: inp == NULL")); INP_RLOCK(inp); port = inp->inp_fport; addr = inp->inp_faddr; INP_RUNLOCK(inp); *nam = in_sockaddr(port, &addr); return 0; } void in_pcbnotifyall(struct inpcbinfo *pcbinfo, struct in_addr faddr, int errno, struct inpcb *(*notify)(struct inpcb *, int)) { struct inpcb *inp, *inp_temp; INP_INFO_WLOCK(pcbinfo); CK_LIST_FOREACH_SAFE(inp, pcbinfo->ipi_listhead, inp_list, inp_temp) { INP_WLOCK(inp); #ifdef INET6 if ((inp->inp_vflag & INP_IPV4) == 0) { INP_WUNLOCK(inp); continue; } #endif if (inp->inp_faddr.s_addr != faddr.s_addr || inp->inp_socket == NULL) { INP_WUNLOCK(inp); continue; } if ((*notify)(inp, errno)) INP_WUNLOCK(inp); } INP_INFO_WUNLOCK(pcbinfo); } void in_pcbpurgeif0(struct inpcbinfo *pcbinfo, struct ifnet *ifp) { struct inpcb *inp; struct ip_moptions *imo; int i, gap; INP_INFO_WLOCK(pcbinfo); CK_LIST_FOREACH(inp, pcbinfo->ipi_listhead, inp_list) { INP_WLOCK(inp); imo = inp->inp_moptions; if ((inp->inp_vflag & INP_IPV4) && imo != NULL) { /* * Unselect the outgoing interface if it is being * detached. */ if (imo->imo_multicast_ifp == ifp) imo->imo_multicast_ifp = NULL; /* * Drop multicast group membership if we joined * through the interface being detached. * * XXX This can all be deferred to an epoch_call */ for (i = 0, gap = 0; i < imo->imo_num_memberships; i++) { if (imo->imo_membership[i]->inm_ifp == ifp) { IN_MULTI_LOCK_ASSERT(); in_leavegroup_locked(imo->imo_membership[i], NULL); gap++; } else if (gap != 0) imo->imo_membership[i - gap] = imo->imo_membership[i]; } imo->imo_num_memberships -= gap; } INP_WUNLOCK(inp); } INP_INFO_WUNLOCK(pcbinfo); } /* * Lookup a PCB based on the local address and port. Caller must hold the * hash lock. No inpcb locks or references are acquired. */ #define INP_LOOKUP_MAPPED_PCB_COST 3 struct inpcb * in_pcblookup_local(struct inpcbinfo *pcbinfo, struct in_addr laddr, u_short lport, int lookupflags, struct ucred *cred) { struct inpcb *inp; #ifdef INET6 int matchwild = 3 + INP_LOOKUP_MAPPED_PCB_COST; #else int matchwild = 3; #endif int wildcard; KASSERT((lookupflags & ~(INPLOOKUP_WILDCARD)) == 0, ("%s: invalid lookup flags %d", __func__, lookupflags)); INP_HASH_LOCK_ASSERT(pcbinfo); if ((lookupflags & INPLOOKUP_WILDCARD) == 0) { struct inpcbhead *head; /* * Look for an unconnected (wildcard foreign addr) PCB that * matches the local address and port we're looking for. */ head = &pcbinfo->ipi_hashbase[INP_PCBHASH(INADDR_ANY, lport, 0, pcbinfo->ipi_hashmask)]; CK_LIST_FOREACH(inp, head, inp_hash) { #ifdef INET6 /* XXX inp locking */ if ((inp->inp_vflag & INP_IPV4) == 0) continue; #endif if (inp->inp_faddr.s_addr == INADDR_ANY && inp->inp_laddr.s_addr == laddr.s_addr && inp->inp_lport == lport) { /* * Found? */ if (cred == NULL || prison_equal_ip4(cred->cr_prison, inp->inp_cred->cr_prison)) return (inp); } } /* * Not found. */ return (NULL); } else { struct inpcbporthead *porthash; struct inpcbport *phd; struct inpcb *match = NULL; /* * Best fit PCB lookup. * * First see if this local port is in use by looking on the * port hash list. */ porthash = &pcbinfo->ipi_porthashbase[INP_PCBPORTHASH(lport, pcbinfo->ipi_porthashmask)]; CK_LIST_FOREACH(phd, porthash, phd_hash) { if (phd->phd_port == lport) break; } if (phd != NULL) { /* * Port is in use by one or more PCBs. Look for best * fit. */ CK_LIST_FOREACH(inp, &phd->phd_pcblist, inp_portlist) { wildcard = 0; if (cred != NULL && !prison_equal_ip4(inp->inp_cred->cr_prison, cred->cr_prison)) continue; #ifdef INET6 /* XXX inp locking */ if ((inp->inp_vflag & INP_IPV4) == 0) continue; /* * We never select the PCB that has * INP_IPV6 flag and is bound to :: if * we have another PCB which is bound * to 0.0.0.0. If a PCB has the * INP_IPV6 flag, then we set its cost * higher than IPv4 only PCBs. * * Note that the case only happens * when a socket is bound to ::, under * the condition that the use of the * mapped address is allowed. */ if ((inp->inp_vflag & INP_IPV6) != 0) wildcard += INP_LOOKUP_MAPPED_PCB_COST; #endif if (inp->inp_faddr.s_addr != INADDR_ANY) wildcard++; if (inp->inp_laddr.s_addr != INADDR_ANY) { if (laddr.s_addr == INADDR_ANY) wildcard++; else if (inp->inp_laddr.s_addr != laddr.s_addr) continue; } else { if (laddr.s_addr != INADDR_ANY) wildcard++; } if (wildcard < matchwild) { match = inp; matchwild = wildcard; if (matchwild == 0) break; } } } return (match); } } #undef INP_LOOKUP_MAPPED_PCB_COST static struct inpcb * in_pcblookup_lbgroup(const struct inpcbinfo *pcbinfo, const struct in_addr *laddr, uint16_t lport, const struct in_addr *faddr, uint16_t fport, int lookupflags) { struct inpcb *local_wild; const struct inpcblbgrouphead *hdr; struct inpcblbgroup *grp; uint32_t idx; INP_HASH_LOCK_ASSERT(pcbinfo); hdr = &pcbinfo->ipi_lbgrouphashbase[ INP_PCBPORTHASH(lport, pcbinfo->ipi_lbgrouphashmask)]; /* * Order of socket selection: * 1. non-wild. * 2. wild (if lookupflags contains INPLOOKUP_WILDCARD). * * NOTE: * - Load balanced group does not contain jailed sockets * - Load balanced group does not contain IPv4 mapped INET6 wild sockets */ local_wild = NULL; CK_LIST_FOREACH(grp, hdr, il_list) { #ifdef INET6 if (!(grp->il_vflag & INP_IPV4)) continue; #endif if (grp->il_lport != lport) continue; idx = INP_PCBLBGROUP_PKTHASH(faddr->s_addr, lport, fport) % grp->il_inpcnt; if (grp->il_laddr.s_addr == laddr->s_addr) return (grp->il_inp[idx]); if (grp->il_laddr.s_addr == INADDR_ANY && (lookupflags & INPLOOKUP_WILDCARD) != 0) local_wild = grp->il_inp[idx]; } return (local_wild); } #ifdef PCBGROUP /* * Lookup PCB in hash list, using pcbgroup tables. */ static struct inpcb * in_pcblookup_group(struct inpcbinfo *pcbinfo, struct inpcbgroup *pcbgroup, struct in_addr faddr, u_int fport_arg, struct in_addr laddr, u_int lport_arg, int lookupflags, struct ifnet *ifp) { struct inpcbhead *head; struct inpcb *inp, *tmpinp; u_short fport = fport_arg, lport = lport_arg; bool locked; /* * First look for an exact match. */ tmpinp = NULL; INP_GROUP_LOCK(pcbgroup); head = &pcbgroup->ipg_hashbase[INP_PCBHASH(faddr.s_addr, lport, fport, pcbgroup->ipg_hashmask)]; CK_LIST_FOREACH(inp, head, inp_pcbgrouphash) { #ifdef INET6 /* XXX inp locking */ if ((inp->inp_vflag & INP_IPV4) == 0) continue; #endif if (inp->inp_faddr.s_addr == faddr.s_addr && inp->inp_laddr.s_addr == laddr.s_addr && inp->inp_fport == fport && inp->inp_lport == lport) { /* * XXX We should be able to directly return * the inp here, without any checks. * Well unless both bound with SO_REUSEPORT? */ if (prison_flag(inp->inp_cred, PR_IP4)) goto found; if (tmpinp == NULL) tmpinp = inp; } } if (tmpinp != NULL) { inp = tmpinp; goto found; } #ifdef RSS /* * For incoming connections, we may wish to do a wildcard * match for an RSS-local socket. */ if ((lookupflags & INPLOOKUP_WILDCARD) != 0) { struct inpcb *local_wild = NULL, *local_exact = NULL; #ifdef INET6 struct inpcb *local_wild_mapped = NULL; #endif struct inpcb *jail_wild = NULL; struct inpcbhead *head; int injail; /* * Order of socket selection - we always prefer jails. * 1. jailed, non-wild. * 2. jailed, wild. * 3. non-jailed, non-wild. * 4. non-jailed, wild. */ head = &pcbgroup->ipg_hashbase[INP_PCBHASH(INADDR_ANY, lport, 0, pcbgroup->ipg_hashmask)]; CK_LIST_FOREACH(inp, head, inp_pcbgrouphash) { #ifdef INET6 /* XXX inp locking */ if ((inp->inp_vflag & INP_IPV4) == 0) continue; #endif if (inp->inp_faddr.s_addr != INADDR_ANY || inp->inp_lport != lport) continue; injail = prison_flag(inp->inp_cred, PR_IP4); if (injail) { if (prison_check_ip4(inp->inp_cred, &laddr) != 0) continue; } else { if (local_exact != NULL) continue; } if (inp->inp_laddr.s_addr == laddr.s_addr) { if (injail) goto found; else local_exact = inp; } else if (inp->inp_laddr.s_addr == INADDR_ANY) { #ifdef INET6 /* XXX inp locking, NULL check */ if (inp->inp_vflag & INP_IPV6PROTO) local_wild_mapped = inp; else #endif if (injail) jail_wild = inp; else local_wild = inp; } } /* LIST_FOREACH */ inp = jail_wild; if (inp == NULL) inp = local_exact; if (inp == NULL) inp = local_wild; #ifdef INET6 if (inp == NULL) inp = local_wild_mapped; #endif if (inp != NULL) goto found; } #endif /* * Then look for a wildcard match, if requested. */ if ((lookupflags & INPLOOKUP_WILDCARD) != 0) { struct inpcb *local_wild = NULL, *local_exact = NULL; #ifdef INET6 struct inpcb *local_wild_mapped = NULL; #endif struct inpcb *jail_wild = NULL; struct inpcbhead *head; int injail; /* * Order of socket selection - we always prefer jails. * 1. jailed, non-wild. * 2. jailed, wild. * 3. non-jailed, non-wild. * 4. non-jailed, wild. */ head = &pcbinfo->ipi_wildbase[INP_PCBHASH(INADDR_ANY, lport, 0, pcbinfo->ipi_wildmask)]; CK_LIST_FOREACH(inp, head, inp_pcbgroup_wild) { #ifdef INET6 /* XXX inp locking */ if ((inp->inp_vflag & INP_IPV4) == 0) continue; #endif if (inp->inp_faddr.s_addr != INADDR_ANY || inp->inp_lport != lport) continue; injail = prison_flag(inp->inp_cred, PR_IP4); if (injail) { if (prison_check_ip4(inp->inp_cred, &laddr) != 0) continue; } else { if (local_exact != NULL) continue; } if (inp->inp_laddr.s_addr == laddr.s_addr) { if (injail) goto found; else local_exact = inp; } else if (inp->inp_laddr.s_addr == INADDR_ANY) { #ifdef INET6 /* XXX inp locking, NULL check */ if (inp->inp_vflag & INP_IPV6PROTO) local_wild_mapped = inp; else #endif if (injail) jail_wild = inp; else local_wild = inp; } } /* LIST_FOREACH */ inp = jail_wild; if (inp == NULL) inp = local_exact; if (inp == NULL) inp = local_wild; #ifdef INET6 if (inp == NULL) inp = local_wild_mapped; #endif if (inp != NULL) goto found; } /* if (lookupflags & INPLOOKUP_WILDCARD) */ INP_GROUP_UNLOCK(pcbgroup); return (NULL); found: if (lookupflags & INPLOOKUP_WLOCKPCB) locked = INP_TRY_WLOCK(inp); else if (lookupflags & INPLOOKUP_RLOCKPCB) locked = INP_TRY_RLOCK(inp); else panic("%s: locking bug", __func__); if (__predict_false(locked && (inp->inp_flags2 & INP_FREED))) { if (lookupflags & INPLOOKUP_WLOCKPCB) INP_WUNLOCK(inp); else INP_RUNLOCK(inp); return (NULL); } else if (!locked) in_pcbref(inp); INP_GROUP_UNLOCK(pcbgroup); if (!locked) { if (lookupflags & INPLOOKUP_WLOCKPCB) { INP_WLOCK(inp); if (in_pcbrele_wlocked(inp)) return (NULL); } else { INP_RLOCK(inp); if (in_pcbrele_rlocked(inp)) return (NULL); } } #ifdef INVARIANTS if (lookupflags & INPLOOKUP_WLOCKPCB) INP_WLOCK_ASSERT(inp); else INP_RLOCK_ASSERT(inp); #endif return (inp); } #endif /* PCBGROUP */ /* * Lookup PCB in hash list, using pcbinfo tables. This variation assumes * that the caller has locked the hash list, and will not perform any further * locking or reference operations on either the hash list or the connection. */ static struct inpcb * in_pcblookup_hash_locked(struct inpcbinfo *pcbinfo, struct in_addr faddr, u_int fport_arg, struct in_addr laddr, u_int lport_arg, int lookupflags, struct ifnet *ifp) { struct inpcbhead *head; struct inpcb *inp, *tmpinp; u_short fport = fport_arg, lport = lport_arg; #ifdef INVARIANTS KASSERT((lookupflags & ~(INPLOOKUP_WILDCARD)) == 0, ("%s: invalid lookup flags %d", __func__, lookupflags)); if (!mtx_owned(&pcbinfo->ipi_hash_lock)) MPASS(in_epoch_verbose(net_epoch_preempt, 1)); #endif /* * First look for an exact match. */ tmpinp = NULL; head = &pcbinfo->ipi_hashbase[INP_PCBHASH(faddr.s_addr, lport, fport, pcbinfo->ipi_hashmask)]; CK_LIST_FOREACH(inp, head, inp_hash) { #ifdef INET6 /* XXX inp locking */ if ((inp->inp_vflag & INP_IPV4) == 0) continue; #endif if (inp->inp_faddr.s_addr == faddr.s_addr && inp->inp_laddr.s_addr == laddr.s_addr && inp->inp_fport == fport && inp->inp_lport == lport) { /* * XXX We should be able to directly return * the inp here, without any checks. * Well unless both bound with SO_REUSEPORT? */ if (prison_flag(inp->inp_cred, PR_IP4)) return (inp); if (tmpinp == NULL) tmpinp = inp; } } if (tmpinp != NULL) return (tmpinp); /* * Then look in lb group (for wildcard match). */ if ((lookupflags & INPLOOKUP_WILDCARD) != 0) { inp = in_pcblookup_lbgroup(pcbinfo, &laddr, lport, &faddr, fport, lookupflags); if (inp != NULL) return (inp); } /* * Then look for a wildcard match, if requested. */ if ((lookupflags & INPLOOKUP_WILDCARD) != 0) { struct inpcb *local_wild = NULL, *local_exact = NULL; #ifdef INET6 struct inpcb *local_wild_mapped = NULL; #endif struct inpcb *jail_wild = NULL; int injail; /* * Order of socket selection - we always prefer jails. * 1. jailed, non-wild. * 2. jailed, wild. * 3. non-jailed, non-wild. * 4. non-jailed, wild. */ head = &pcbinfo->ipi_hashbase[INP_PCBHASH(INADDR_ANY, lport, 0, pcbinfo->ipi_hashmask)]; CK_LIST_FOREACH(inp, head, inp_hash) { #ifdef INET6 /* XXX inp locking */ if ((inp->inp_vflag & INP_IPV4) == 0) continue; #endif if (inp->inp_faddr.s_addr != INADDR_ANY || inp->inp_lport != lport) continue; injail = prison_flag(inp->inp_cred, PR_IP4); if (injail) { if (prison_check_ip4(inp->inp_cred, &laddr) != 0) continue; } else { if (local_exact != NULL) continue; } if (inp->inp_laddr.s_addr == laddr.s_addr) { if (injail) return (inp); else local_exact = inp; } else if (inp->inp_laddr.s_addr == INADDR_ANY) { #ifdef INET6 /* XXX inp locking, NULL check */ if (inp->inp_vflag & INP_IPV6PROTO) local_wild_mapped = inp; else #endif if (injail) jail_wild = inp; else local_wild = inp; } } /* LIST_FOREACH */ if (jail_wild != NULL) return (jail_wild); if (local_exact != NULL) return (local_exact); if (local_wild != NULL) return (local_wild); #ifdef INET6 if (local_wild_mapped != NULL) return (local_wild_mapped); #endif } /* if ((lookupflags & INPLOOKUP_WILDCARD) != 0) */ return (NULL); } /* * Lookup PCB in hash list, using pcbinfo tables. This variation locks the * hash list lock, and will return the inpcb locked (i.e., requires * INPLOOKUP_LOCKPCB). */ static struct inpcb * in_pcblookup_hash(struct inpcbinfo *pcbinfo, struct in_addr faddr, u_int fport, struct in_addr laddr, u_int lport, int lookupflags, struct ifnet *ifp) { struct inpcb *inp; INP_HASH_RLOCK(pcbinfo); inp = in_pcblookup_hash_locked(pcbinfo, faddr, fport, laddr, lport, (lookupflags & ~(INPLOOKUP_RLOCKPCB | INPLOOKUP_WLOCKPCB)), ifp); if (inp != NULL) { if (lookupflags & INPLOOKUP_WLOCKPCB) { INP_WLOCK(inp); if (__predict_false(inp->inp_flags2 & INP_FREED)) { INP_WUNLOCK(inp); inp = NULL; } } else if (lookupflags & INPLOOKUP_RLOCKPCB) { INP_RLOCK(inp); if (__predict_false(inp->inp_flags2 & INP_FREED)) { INP_RUNLOCK(inp); inp = NULL; } } else panic("%s: locking bug", __func__); #ifdef INVARIANTS if (inp != NULL) { if (lookupflags & INPLOOKUP_WLOCKPCB) INP_WLOCK_ASSERT(inp); else INP_RLOCK_ASSERT(inp); } #endif } INP_HASH_RUNLOCK(pcbinfo); return (inp); } /* * Public inpcb lookup routines, accepting a 4-tuple, and optionally, an mbuf * from which a pre-calculated hash value may be extracted. * * Possibly more of this logic should be in in_pcbgroup.c. */ struct inpcb * in_pcblookup(struct inpcbinfo *pcbinfo, struct in_addr faddr, u_int fport, struct in_addr laddr, u_int lport, int lookupflags, struct ifnet *ifp) { #if defined(PCBGROUP) && !defined(RSS) struct inpcbgroup *pcbgroup; #endif KASSERT((lookupflags & ~INPLOOKUP_MASK) == 0, ("%s: invalid lookup flags %d", __func__, lookupflags)); KASSERT((lookupflags & (INPLOOKUP_RLOCKPCB | INPLOOKUP_WLOCKPCB)) != 0, ("%s: LOCKPCB not set", __func__)); /* * When not using RSS, use connection groups in preference to the * reservation table when looking up 4-tuples. When using RSS, just * use the reservation table, due to the cost of the Toeplitz hash * in software. * * XXXRW: This policy belongs in the pcbgroup code, as in principle * we could be doing RSS with a non-Toeplitz hash that is affordable * in software. */ #if defined(PCBGROUP) && !defined(RSS) if (in_pcbgroup_enabled(pcbinfo)) { pcbgroup = in_pcbgroup_bytuple(pcbinfo, laddr, lport, faddr, fport); return (in_pcblookup_group(pcbinfo, pcbgroup, faddr, fport, laddr, lport, lookupflags, ifp)); } #endif return (in_pcblookup_hash(pcbinfo, faddr, fport, laddr, lport, lookupflags, ifp)); } struct inpcb * in_pcblookup_mbuf(struct inpcbinfo *pcbinfo, struct in_addr faddr, u_int fport, struct in_addr laddr, u_int lport, int lookupflags, struct ifnet *ifp, struct mbuf *m) { #ifdef PCBGROUP struct inpcbgroup *pcbgroup; #endif KASSERT((lookupflags & ~INPLOOKUP_MASK) == 0, ("%s: invalid lookup flags %d", __func__, lookupflags)); KASSERT((lookupflags & (INPLOOKUP_RLOCKPCB | INPLOOKUP_WLOCKPCB)) != 0, ("%s: LOCKPCB not set", __func__)); #ifdef PCBGROUP /* * If we can use a hardware-generated hash to look up the connection * group, use that connection group to find the inpcb. Otherwise * fall back on a software hash -- or the reservation table if we're * using RSS. * * XXXRW: As above, that policy belongs in the pcbgroup code. */ if (in_pcbgroup_enabled(pcbinfo) && !(M_HASHTYPE_TEST(m, M_HASHTYPE_NONE))) { pcbgroup = in_pcbgroup_byhash(pcbinfo, M_HASHTYPE_GET(m), m->m_pkthdr.flowid); if (pcbgroup != NULL) return (in_pcblookup_group(pcbinfo, pcbgroup, faddr, fport, laddr, lport, lookupflags, ifp)); #ifndef RSS pcbgroup = in_pcbgroup_bytuple(pcbinfo, laddr, lport, faddr, fport); return (in_pcblookup_group(pcbinfo, pcbgroup, faddr, fport, laddr, lport, lookupflags, ifp)); #endif } #endif return (in_pcblookup_hash(pcbinfo, faddr, fport, laddr, lport, lookupflags, ifp)); } #endif /* INET */ /* * Insert PCB onto various hash lists. */ static int in_pcbinshash_internal(struct inpcb *inp, int do_pcbgroup_update) { struct inpcbhead *pcbhash; struct inpcbporthead *pcbporthash; struct inpcbinfo *pcbinfo = inp->inp_pcbinfo; struct inpcbport *phd; u_int32_t hashkey_faddr; int so_options; INP_WLOCK_ASSERT(inp); INP_HASH_WLOCK_ASSERT(pcbinfo); KASSERT((inp->inp_flags & INP_INHASHLIST) == 0, ("in_pcbinshash: INP_INHASHLIST")); #ifdef INET6 if (inp->inp_vflag & INP_IPV6) hashkey_faddr = INP6_PCBHASHKEY(&inp->in6p_faddr); else #endif hashkey_faddr = inp->inp_faddr.s_addr; pcbhash = &pcbinfo->ipi_hashbase[INP_PCBHASH(hashkey_faddr, inp->inp_lport, inp->inp_fport, pcbinfo->ipi_hashmask)]; pcbporthash = &pcbinfo->ipi_porthashbase[ INP_PCBPORTHASH(inp->inp_lport, pcbinfo->ipi_porthashmask)]; /* * Add entry to load balance group. * Only do this if SO_REUSEPORT_LB is set. */ so_options = inp_so_options(inp); if (so_options & SO_REUSEPORT_LB) { int ret = in_pcbinslbgrouphash(inp); if (ret) { /* pcb lb group malloc fail (ret=ENOBUFS). */ return (ret); } } /* * Go through port list and look for a head for this lport. */ CK_LIST_FOREACH(phd, pcbporthash, phd_hash) { if (phd->phd_port == inp->inp_lport) break; } /* * If none exists, malloc one and tack it on. */ if (phd == NULL) { phd = malloc(sizeof(struct inpcbport), M_PCB, M_NOWAIT); if (phd == NULL) { return (ENOBUFS); /* XXX */ } bzero(&phd->phd_epoch_ctx, sizeof(struct epoch_context)); phd->phd_port = inp->inp_lport; CK_LIST_INIT(&phd->phd_pcblist); CK_LIST_INSERT_HEAD(pcbporthash, phd, phd_hash); } inp->inp_phd = phd; CK_LIST_INSERT_HEAD(&phd->phd_pcblist, inp, inp_portlist); CK_LIST_INSERT_HEAD(pcbhash, inp, inp_hash); inp->inp_flags |= INP_INHASHLIST; #ifdef PCBGROUP if (do_pcbgroup_update) in_pcbgroup_update(inp); #endif return (0); } /* * For now, there are two public interfaces to insert an inpcb into the hash * lists -- one that does update pcbgroups, and one that doesn't. The latter * is used only in the TCP syncache, where in_pcbinshash is called before the * full 4-tuple is set for the inpcb, and we don't want to install in the * pcbgroup until later. * * XXXRW: This seems like a misfeature. in_pcbinshash should always update * connection groups, and partially initialised inpcbs should not be exposed * to either reservation hash tables or pcbgroups. */ int in_pcbinshash(struct inpcb *inp) { return (in_pcbinshash_internal(inp, 1)); } int in_pcbinshash_nopcbgroup(struct inpcb *inp) { return (in_pcbinshash_internal(inp, 0)); } /* * Move PCB to the proper hash bucket when { faddr, fport } have been * changed. NOTE: This does not handle the case of the lport changing (the * hashed port list would have to be updated as well), so the lport must * not change after in_pcbinshash() has been called. */ void in_pcbrehash_mbuf(struct inpcb *inp, struct mbuf *m) { struct inpcbinfo *pcbinfo = inp->inp_pcbinfo; struct inpcbhead *head; u_int32_t hashkey_faddr; INP_WLOCK_ASSERT(inp); INP_HASH_WLOCK_ASSERT(pcbinfo); KASSERT(inp->inp_flags & INP_INHASHLIST, ("in_pcbrehash: !INP_INHASHLIST")); #ifdef INET6 if (inp->inp_vflag & INP_IPV6) hashkey_faddr = INP6_PCBHASHKEY(&inp->in6p_faddr); else #endif hashkey_faddr = inp->inp_faddr.s_addr; head = &pcbinfo->ipi_hashbase[INP_PCBHASH(hashkey_faddr, inp->inp_lport, inp->inp_fport, pcbinfo->ipi_hashmask)]; CK_LIST_REMOVE(inp, inp_hash); CK_LIST_INSERT_HEAD(head, inp, inp_hash); #ifdef PCBGROUP if (m != NULL) in_pcbgroup_update_mbuf(inp, m); else in_pcbgroup_update(inp); #endif } void in_pcbrehash(struct inpcb *inp) { in_pcbrehash_mbuf(inp, NULL); } /* * Remove PCB from various lists. */ static void in_pcbremlists(struct inpcb *inp) { struct inpcbinfo *pcbinfo = inp->inp_pcbinfo; #ifdef INVARIANTS if (pcbinfo == &V_tcbinfo) { INP_INFO_RLOCK_ASSERT(pcbinfo); } else { INP_INFO_WLOCK_ASSERT(pcbinfo); } #endif INP_WLOCK_ASSERT(inp); INP_LIST_WLOCK_ASSERT(pcbinfo); inp->inp_gencnt = ++pcbinfo->ipi_gencnt; if (inp->inp_flags & INP_INHASHLIST) { struct inpcbport *phd = inp->inp_phd; INP_HASH_WLOCK(pcbinfo); /* XXX: Only do if SO_REUSEPORT_LB set? */ in_pcbremlbgrouphash(inp); CK_LIST_REMOVE(inp, inp_hash); CK_LIST_REMOVE(inp, inp_portlist); if (CK_LIST_FIRST(&phd->phd_pcblist) == NULL) { CK_LIST_REMOVE(phd, phd_hash); epoch_call(net_epoch_preempt, &phd->phd_epoch_ctx, inpcbport_free); } INP_HASH_WUNLOCK(pcbinfo); inp->inp_flags &= ~INP_INHASHLIST; } CK_LIST_REMOVE(inp, inp_list); pcbinfo->ipi_count--; #ifdef PCBGROUP in_pcbgroup_remove(inp); #endif } /* * Check for alternatives when higher level complains * about service problems. For now, invalidate cached * routing information. If the route was created dynamically * (by a redirect), time to try a default gateway again. */ void in_losing(struct inpcb *inp) { RO_INVALIDATE_CACHE(&inp->inp_route); return; } /* * A set label operation has occurred at the socket layer, propagate the * label change into the in_pcb for the socket. */ void in_pcbsosetlabel(struct socket *so) { #ifdef MAC struct inpcb *inp; inp = sotoinpcb(so); KASSERT(inp != NULL, ("in_pcbsosetlabel: so->so_pcb == NULL")); INP_WLOCK(inp); SOCK_LOCK(so); mac_inpcb_sosetlabel(so, inp); SOCK_UNLOCK(so); INP_WUNLOCK(inp); #endif } /* * ipport_tick runs once per second, determining if random port allocation * should be continued. If more than ipport_randomcps ports have been * allocated in the last second, then we return to sequential port * allocation. We return to random allocation only once we drop below * ipport_randomcps for at least ipport_randomtime seconds. */ static void ipport_tick(void *xtp) { VNET_ITERATOR_DECL(vnet_iter); VNET_LIST_RLOCK_NOSLEEP(); VNET_FOREACH(vnet_iter) { CURVNET_SET(vnet_iter); /* XXX appease INVARIANTS here */ if (V_ipport_tcpallocs <= V_ipport_tcplastcount + V_ipport_randomcps) { if (V_ipport_stoprandom > 0) V_ipport_stoprandom--; } else V_ipport_stoprandom = V_ipport_randomtime; V_ipport_tcplastcount = V_ipport_tcpallocs; CURVNET_RESTORE(); } VNET_LIST_RUNLOCK_NOSLEEP(); callout_reset(&ipport_tick_callout, hz, ipport_tick, NULL); } static void ip_fini(void *xtp) { callout_stop(&ipport_tick_callout); } /* * The ipport_callout should start running at about the time we attach the * inet or inet6 domains. */ static void ipport_tick_init(const void *unused __unused) { /* Start ipport_tick. */ callout_init(&ipport_tick_callout, 1); callout_reset(&ipport_tick_callout, 1, ipport_tick, NULL); EVENTHANDLER_REGISTER(shutdown_pre_sync, ip_fini, NULL, SHUTDOWN_PRI_DEFAULT); } SYSINIT(ipport_tick_init, SI_SUB_PROTO_DOMAIN, SI_ORDER_MIDDLE, ipport_tick_init, NULL); void inp_wlock(struct inpcb *inp) { INP_WLOCK(inp); } void inp_wunlock(struct inpcb *inp) { INP_WUNLOCK(inp); } void inp_rlock(struct inpcb *inp) { INP_RLOCK(inp); } void inp_runlock(struct inpcb *inp) { INP_RUNLOCK(inp); } #ifdef INVARIANT_SUPPORT void inp_lock_assert(struct inpcb *inp) { INP_WLOCK_ASSERT(inp); } void inp_unlock_assert(struct inpcb *inp) { INP_UNLOCK_ASSERT(inp); } #endif void inp_apply_all(void (*func)(struct inpcb *, void *), void *arg) { struct inpcb *inp; INP_INFO_WLOCK(&V_tcbinfo); CK_LIST_FOREACH(inp, V_tcbinfo.ipi_listhead, inp_list) { INP_WLOCK(inp); func(inp, arg); INP_WUNLOCK(inp); } INP_INFO_WUNLOCK(&V_tcbinfo); } struct socket * inp_inpcbtosocket(struct inpcb *inp) { INP_WLOCK_ASSERT(inp); return (inp->inp_socket); } struct tcpcb * inp_inpcbtotcpcb(struct inpcb *inp) { INP_WLOCK_ASSERT(inp); return ((struct tcpcb *)inp->inp_ppcb); } int inp_ip_tos_get(const struct inpcb *inp) { return (inp->inp_ip_tos); } void inp_ip_tos_set(struct inpcb *inp, int val) { inp->inp_ip_tos = val; } void inp_4tuple_get(struct inpcb *inp, uint32_t *laddr, uint16_t *lp, uint32_t *faddr, uint16_t *fp) { INP_LOCK_ASSERT(inp); *laddr = inp->inp_laddr.s_addr; *faddr = inp->inp_faddr.s_addr; *lp = inp->inp_lport; *fp = inp->inp_fport; } struct inpcb * so_sotoinpcb(struct socket *so) { return (sotoinpcb(so)); } struct tcpcb * so_sototcpcb(struct socket *so) { return (sototcpcb(so)); } /* * Create an external-format (``xinpcb'') structure using the information in * the kernel-format in_pcb structure pointed to by inp. This is done to * reduce the spew of irrelevant information over this interface, to isolate * user code from changes in the kernel structure, and potentially to provide * information-hiding if we decide that some of this information should be * hidden from users. */ void in_pcbtoxinpcb(const struct inpcb *inp, struct xinpcb *xi) { bzero(xi, sizeof(*xi)); xi->xi_len = sizeof(struct xinpcb); if (inp->inp_socket) sotoxsocket(inp->inp_socket, &xi->xi_socket); bcopy(&inp->inp_inc, &xi->inp_inc, sizeof(struct in_conninfo)); xi->inp_gencnt = inp->inp_gencnt; xi->inp_ppcb = (uintptr_t)inp->inp_ppcb; xi->inp_flow = inp->inp_flow; xi->inp_flowid = inp->inp_flowid; xi->inp_flowtype = inp->inp_flowtype; xi->inp_flags = inp->inp_flags; xi->inp_flags2 = inp->inp_flags2; xi->inp_rss_listen_bucket = inp->inp_rss_listen_bucket; xi->in6p_cksum = inp->in6p_cksum; xi->in6p_hops = inp->in6p_hops; xi->inp_ip_tos = inp->inp_ip_tos; xi->inp_vflag = inp->inp_vflag; xi->inp_ip_ttl = inp->inp_ip_ttl; xi->inp_ip_p = inp->inp_ip_p; xi->inp_ip_minttl = inp->inp_ip_minttl; } #ifdef DDB static void db_print_indent(int indent) { int i; for (i = 0; i < indent; i++) db_printf(" "); } static void db_print_inconninfo(struct in_conninfo *inc, const char *name, int indent) { char faddr_str[48], laddr_str[48]; db_print_indent(indent); db_printf("%s at %p\n", name, inc); indent += 2; #ifdef INET6 if (inc->inc_flags & INC_ISIPV6) { /* IPv6. */ ip6_sprintf(laddr_str, &inc->inc6_laddr); ip6_sprintf(faddr_str, &inc->inc6_faddr); } else #endif { /* IPv4. */ inet_ntoa_r(inc->inc_laddr, laddr_str); inet_ntoa_r(inc->inc_faddr, faddr_str); } db_print_indent(indent); db_printf("inc_laddr %s inc_lport %u\n", laddr_str, ntohs(inc->inc_lport)); db_print_indent(indent); db_printf("inc_faddr %s inc_fport %u\n", faddr_str, ntohs(inc->inc_fport)); } static void db_print_inpflags(int inp_flags) { int comma; comma = 0; if (inp_flags & INP_RECVOPTS) { db_printf("%sINP_RECVOPTS", comma ? ", " : ""); comma = 1; } if (inp_flags & INP_RECVRETOPTS) { db_printf("%sINP_RECVRETOPTS", comma ? ", " : ""); comma = 1; } if (inp_flags & INP_RECVDSTADDR) { db_printf("%sINP_RECVDSTADDR", comma ? ", " : ""); comma = 1; } if (inp_flags & INP_ORIGDSTADDR) { db_printf("%sINP_ORIGDSTADDR", comma ? ", " : ""); comma = 1; } if (inp_flags & INP_HDRINCL) { db_printf("%sINP_HDRINCL", comma ? ", " : ""); comma = 1; } if (inp_flags & INP_HIGHPORT) { db_printf("%sINP_HIGHPORT", comma ? ", " : ""); comma = 1; } if (inp_flags & INP_LOWPORT) { db_printf("%sINP_LOWPORT", comma ? ", " : ""); comma = 1; } if (inp_flags & INP_ANONPORT) { db_printf("%sINP_ANONPORT", comma ? ", " : ""); comma = 1; } if (inp_flags & INP_RECVIF) { db_printf("%sINP_RECVIF", comma ? ", " : ""); comma = 1; } if (inp_flags & INP_MTUDISC) { db_printf("%sINP_MTUDISC", comma ? ", " : ""); comma = 1; } if (inp_flags & INP_RECVTTL) { db_printf("%sINP_RECVTTL", comma ? ", " : ""); comma = 1; } if (inp_flags & INP_DONTFRAG) { db_printf("%sINP_DONTFRAG", comma ? ", " : ""); comma = 1; } if (inp_flags & INP_RECVTOS) { db_printf("%sINP_RECVTOS", comma ? ", " : ""); comma = 1; } if (inp_flags & IN6P_IPV6_V6ONLY) { db_printf("%sIN6P_IPV6_V6ONLY", comma ? ", " : ""); comma = 1; } if (inp_flags & IN6P_PKTINFO) { db_printf("%sIN6P_PKTINFO", comma ? ", " : ""); comma = 1; } if (inp_flags & IN6P_HOPLIMIT) { db_printf("%sIN6P_HOPLIMIT", comma ? ", " : ""); comma = 1; } if (inp_flags & IN6P_HOPOPTS) { db_printf("%sIN6P_HOPOPTS", comma ? ", " : ""); comma = 1; } if (inp_flags & IN6P_DSTOPTS) { db_printf("%sIN6P_DSTOPTS", comma ? ", " : ""); comma = 1; } if (inp_flags & IN6P_RTHDR) { db_printf("%sIN6P_RTHDR", comma ? ", " : ""); comma = 1; } if (inp_flags & IN6P_RTHDRDSTOPTS) { db_printf("%sIN6P_RTHDRDSTOPTS", comma ? ", " : ""); comma = 1; } if (inp_flags & IN6P_TCLASS) { db_printf("%sIN6P_TCLASS", comma ? ", " : ""); comma = 1; } if (inp_flags & IN6P_AUTOFLOWLABEL) { db_printf("%sIN6P_AUTOFLOWLABEL", comma ? ", " : ""); comma = 1; } if (inp_flags & INP_TIMEWAIT) { db_printf("%sINP_TIMEWAIT", comma ? ", " : ""); comma = 1; } if (inp_flags & INP_ONESBCAST) { db_printf("%sINP_ONESBCAST", comma ? ", " : ""); comma = 1; } if (inp_flags & INP_DROPPED) { db_printf("%sINP_DROPPED", comma ? ", " : ""); comma = 1; } if (inp_flags & INP_SOCKREF) { db_printf("%sINP_SOCKREF", comma ? ", " : ""); comma = 1; } if (inp_flags & IN6P_RFC2292) { db_printf("%sIN6P_RFC2292", comma ? ", " : ""); comma = 1; } if (inp_flags & IN6P_MTU) { db_printf("IN6P_MTU%s", comma ? ", " : ""); comma = 1; } } static void db_print_inpvflag(u_char inp_vflag) { int comma; comma = 0; if (inp_vflag & INP_IPV4) { db_printf("%sINP_IPV4", comma ? ", " : ""); comma = 1; } if (inp_vflag & INP_IPV6) { db_printf("%sINP_IPV6", comma ? ", " : ""); comma = 1; } if (inp_vflag & INP_IPV6PROTO) { db_printf("%sINP_IPV6PROTO", comma ? ", " : ""); comma = 1; } } static void db_print_inpcb(struct inpcb *inp, const char *name, int indent) { db_print_indent(indent); db_printf("%s at %p\n", name, inp); indent += 2; db_print_indent(indent); db_printf("inp_flow: 0x%x\n", inp->inp_flow); db_print_inconninfo(&inp->inp_inc, "inp_conninfo", indent); db_print_indent(indent); db_printf("inp_ppcb: %p inp_pcbinfo: %p inp_socket: %p\n", inp->inp_ppcb, inp->inp_pcbinfo, inp->inp_socket); db_print_indent(indent); db_printf("inp_label: %p inp_flags: 0x%x (", inp->inp_label, inp->inp_flags); db_print_inpflags(inp->inp_flags); db_printf(")\n"); db_print_indent(indent); db_printf("inp_sp: %p inp_vflag: 0x%x (", inp->inp_sp, inp->inp_vflag); db_print_inpvflag(inp->inp_vflag); db_printf(")\n"); db_print_indent(indent); db_printf("inp_ip_ttl: %d inp_ip_p: %d inp_ip_minttl: %d\n", inp->inp_ip_ttl, inp->inp_ip_p, inp->inp_ip_minttl); db_print_indent(indent); #ifdef INET6 if (inp->inp_vflag & INP_IPV6) { db_printf("in6p_options: %p in6p_outputopts: %p " "in6p_moptions: %p\n", inp->in6p_options, inp->in6p_outputopts, inp->in6p_moptions); db_printf("in6p_icmp6filt: %p in6p_cksum %d " "in6p_hops %u\n", inp->in6p_icmp6filt, inp->in6p_cksum, inp->in6p_hops); } else #endif { db_printf("inp_ip_tos: %d inp_ip_options: %p " "inp_ip_moptions: %p\n", inp->inp_ip_tos, inp->inp_options, inp->inp_moptions); } db_print_indent(indent); db_printf("inp_phd: %p inp_gencnt: %ju\n", inp->inp_phd, (uintmax_t)inp->inp_gencnt); } DB_SHOW_COMMAND(inpcb, db_show_inpcb) { struct inpcb *inp; if (!have_addr) { db_printf("usage: show inpcb \n"); return; } inp = (struct inpcb *)addr; db_print_inpcb(inp, "inpcb", 0); } #endif /* DDB */ #ifdef RATELIMIT /* * Modify TX rate limit based on the existing "inp->inp_snd_tag", * if any. */ int in_pcbmodify_txrtlmt(struct inpcb *inp, uint32_t max_pacing_rate) { union if_snd_tag_modify_params params = { .rate_limit.max_rate = max_pacing_rate, }; struct m_snd_tag *mst; struct ifnet *ifp; int error; mst = inp->inp_snd_tag; if (mst == NULL) return (EINVAL); ifp = mst->ifp; if (ifp == NULL) return (EINVAL); if (ifp->if_snd_tag_modify == NULL) { error = EOPNOTSUPP; } else { error = ifp->if_snd_tag_modify(mst, ¶ms); } return (error); } /* * Query existing TX rate limit based on the existing * "inp->inp_snd_tag", if any. */ int in_pcbquery_txrtlmt(struct inpcb *inp, uint32_t *p_max_pacing_rate) { union if_snd_tag_query_params params = { }; struct m_snd_tag *mst; struct ifnet *ifp; int error; mst = inp->inp_snd_tag; if (mst == NULL) return (EINVAL); ifp = mst->ifp; if (ifp == NULL) return (EINVAL); if (ifp->if_snd_tag_query == NULL) { error = EOPNOTSUPP; } else { error = ifp->if_snd_tag_query(mst, ¶ms); if (error == 0 && p_max_pacing_rate != NULL) *p_max_pacing_rate = params.rate_limit.max_rate; } return (error); } /* * Query existing TX queue level based on the existing * "inp->inp_snd_tag", if any. */ int in_pcbquery_txrlevel(struct inpcb *inp, uint32_t *p_txqueue_level) { union if_snd_tag_query_params params = { }; struct m_snd_tag *mst; struct ifnet *ifp; int error; mst = inp->inp_snd_tag; if (mst == NULL) return (EINVAL); ifp = mst->ifp; if (ifp == NULL) return (EINVAL); if (ifp->if_snd_tag_query == NULL) return (EOPNOTSUPP); error = ifp->if_snd_tag_query(mst, ¶ms); if (error == 0 && p_txqueue_level != NULL) *p_txqueue_level = params.rate_limit.queue_level; return (error); } /* * Allocate a new TX rate limit send tag from the network interface * given by the "ifp" argument and save it in "inp->inp_snd_tag": */ int in_pcbattach_txrtlmt(struct inpcb *inp, struct ifnet *ifp, uint32_t flowtype, uint32_t flowid, uint32_t max_pacing_rate) { union if_snd_tag_alloc_params params = { .rate_limit.hdr.type = (max_pacing_rate == -1U) ? IF_SND_TAG_TYPE_UNLIMITED : IF_SND_TAG_TYPE_RATE_LIMIT, .rate_limit.hdr.flowid = flowid, .rate_limit.hdr.flowtype = flowtype, .rate_limit.max_rate = max_pacing_rate, }; int error; INP_WLOCK_ASSERT(inp); if (inp->inp_snd_tag != NULL) return (EINVAL); if (ifp->if_snd_tag_alloc == NULL) { error = EOPNOTSUPP; } else { error = ifp->if_snd_tag_alloc(ifp, ¶ms, &inp->inp_snd_tag); /* * At success increment the refcount on * the send tag's network interface: */ if (error == 0) if_ref(inp->inp_snd_tag->ifp); } return (error); } /* * Free an existing TX rate limit tag based on the "inp->inp_snd_tag", * if any: */ void in_pcbdetach_txrtlmt(struct inpcb *inp) { struct m_snd_tag *mst; struct ifnet *ifp; INP_WLOCK_ASSERT(inp); mst = inp->inp_snd_tag; inp->inp_snd_tag = NULL; if (mst == NULL) return; ifp = mst->ifp; if (ifp == NULL) return; /* * If the device was detached while we still had reference(s) * on the ifp, we assume if_snd_tag_free() was replaced with * stubs. */ ifp->if_snd_tag_free(mst); /* release reference count on network interface */ if_rele(ifp); } /* * This function should be called when the INP_RATE_LIMIT_CHANGED flag * is set in the fast path and will attach/detach/modify the TX rate * limit send tag based on the socket's so_max_pacing_rate value. */ void in_pcboutput_txrtlmt(struct inpcb *inp, struct ifnet *ifp, struct mbuf *mb) { struct socket *socket; uint32_t max_pacing_rate; bool did_upgrade; int error; if (inp == NULL) return; socket = inp->inp_socket; if (socket == NULL) return; if (!INP_WLOCKED(inp)) { /* * NOTE: If the write locking fails, we need to bail * out and use the non-ratelimited ring for the * transmit until there is a new chance to get the * write lock. */ if (!INP_TRY_UPGRADE(inp)) return; did_upgrade = 1; } else { did_upgrade = 0; } /* * NOTE: The so_max_pacing_rate value is read unlocked, * because atomic updates are not required since the variable * is checked at every mbuf we send. It is assumed that the * variable read itself will be atomic. */ max_pacing_rate = socket->so_max_pacing_rate; /* * NOTE: When attaching to a network interface a reference is * made to ensure the network interface doesn't go away until * all ratelimit connections are gone. The network interface * pointers compared below represent valid network interfaces, * except when comparing towards NULL. */ if (max_pacing_rate == 0 && inp->inp_snd_tag == NULL) { error = 0; } else if (!(ifp->if_capenable & IFCAP_TXRTLMT)) { if (inp->inp_snd_tag != NULL) in_pcbdetach_txrtlmt(inp); error = 0; } else if (inp->inp_snd_tag == NULL) { /* * In order to utilize packet pacing with RSS, we need * to wait until there is a valid RSS hash before we * can proceed: */ if (M_HASHTYPE_GET(mb) == M_HASHTYPE_NONE) { error = EAGAIN; } else { error = in_pcbattach_txrtlmt(inp, ifp, M_HASHTYPE_GET(mb), mb->m_pkthdr.flowid, max_pacing_rate); } } else { error = in_pcbmodify_txrtlmt(inp, max_pacing_rate); } if (error == 0 || error == EOPNOTSUPP) inp->inp_flags2 &= ~INP_RATE_LIMIT_CHANGED; if (did_upgrade) INP_DOWNGRADE(inp); } /* * Track route changes for TX rate limiting. */ void in_pcboutput_eagain(struct inpcb *inp) { bool did_upgrade; if (inp == NULL) return; if (inp->inp_snd_tag == NULL) return; if (!INP_WLOCKED(inp)) { /* * NOTE: If the write locking fails, we need to bail * out and use the non-ratelimited ring for the * transmit until there is a new chance to get the * write lock. */ if (!INP_TRY_UPGRADE(inp)) return; did_upgrade = 1; } else { did_upgrade = 0; } /* detach rate limiting */ in_pcbdetach_txrtlmt(inp); /* make sure new mbuf send tag allocation is made */ inp->inp_flags2 |= INP_RATE_LIMIT_CHANGED; if (did_upgrade) INP_DOWNGRADE(inp); } #endif /* RATELIMIT */ Index: user/ngie/bug-237403/sys/netinet/in_pcb.h =================================================================== --- user/ngie/bug-237403/sys/netinet/in_pcb.h (revision 346925) +++ user/ngie/bug-237403/sys/netinet/in_pcb.h (revision 346926) @@ -1,894 +1,894 @@ /*- * SPDX-License-Identifier: BSD-3-Clause * * Copyright (c) 1982, 1986, 1990, 1993 * The Regents of the University of California. * Copyright (c) 2010-2011 Juniper Networks, Inc. * All rights reserved. * * Portions of this software were developed by Robert N. M. Watson under * contract to Juniper Networks, Inc. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * 3. Neither the name of the University nor the names of its contributors * may be used to endorse or promote products derived from this software * without specific prior written permission. * * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * @(#)in_pcb.h 8.1 (Berkeley) 6/10/93 * $FreeBSD$ */ #ifndef _NETINET_IN_PCB_H_ #define _NETINET_IN_PCB_H_ #include #include #include #include #include #include #ifdef _KERNEL #include #include #include #include #include #include #endif #include #define in6pcb inpcb /* for KAME src sync over BSD*'s */ #define in6p_sp inp_sp /* for KAME src sync over BSD*'s */ /* * struct inpcb is the common protocol control block structure used in most * IP transport protocols. * * Pointers to local and foreign host table entries, local and foreign socket * numbers, and pointers up (to a socket structure) and down (to a * protocol-specific control block) are stored here. */ CK_LIST_HEAD(inpcbhead, inpcb); CK_LIST_HEAD(inpcbporthead, inpcbport); CK_LIST_HEAD(inpcblbgrouphead, inpcblbgroup); typedef uint64_t inp_gen_t; /* * PCB with AF_INET6 null bind'ed laddr can receive AF_INET input packet. * So, AF_INET6 null laddr is also used as AF_INET null laddr, by utilizing * the following structure. */ struct in_addr_4in6 { u_int32_t ia46_pad32[3]; struct in_addr ia46_addr4; }; union in_dependaddr { struct in_addr_4in6 id46_addr; struct in6_addr id6_addr; }; /* * NOTE: ipv6 addrs should be 64-bit aligned, per RFC 2553. in_conninfo has * some extra padding to accomplish this. * NOTE 2: tcp_syncache.c uses first 5 32-bit words, which identify fport, * lport, faddr to generate hash, so these fields shouldn't be moved. */ struct in_endpoints { u_int16_t ie_fport; /* foreign port */ u_int16_t ie_lport; /* local port */ /* protocol dependent part, local and foreign addr */ union in_dependaddr ie_dependfaddr; /* foreign host table entry */ union in_dependaddr ie_dependladdr; /* local host table entry */ #define ie_faddr ie_dependfaddr.id46_addr.ia46_addr4 #define ie_laddr ie_dependladdr.id46_addr.ia46_addr4 #define ie6_faddr ie_dependfaddr.id6_addr #define ie6_laddr ie_dependladdr.id6_addr u_int32_t ie6_zoneid; /* scope zone id */ }; /* * XXX The defines for inc_* are hacks and should be changed to direct * references. */ struct in_conninfo { u_int8_t inc_flags; u_int8_t inc_len; u_int16_t inc_fibnum; /* XXX was pad, 16 bits is plenty */ /* protocol dependent part */ struct in_endpoints inc_ie; }; /* * Flags for inc_flags. */ #define INC_ISIPV6 0x01 #define INC_IPV6MINMTU 0x02 #define inc_fport inc_ie.ie_fport #define inc_lport inc_ie.ie_lport #define inc_faddr inc_ie.ie_faddr #define inc_laddr inc_ie.ie_laddr #define inc6_faddr inc_ie.ie6_faddr #define inc6_laddr inc_ie.ie6_laddr #define inc6_zoneid inc_ie.ie6_zoneid #if defined(_KERNEL) || defined(_WANT_INPCB) /* * struct inpcb captures the network layer state for TCP, UDP, and raw IPv4 and * IPv6 sockets. In the case of TCP and UDP, further per-connection state is * hung off of inp_ppcb most of the time. Almost all fields of struct inpcb * are static after creation or protected by a per-inpcb rwlock, inp_lock. A * few fields are protected by multiple locks as indicated in the locking notes * below. For these fields, all of the listed locks must be write-locked for * any modifications. However, these fields can be safely read while any one of * the listed locks are read-locked. This model can permit greater concurrency * for read operations. For example, connections can be looked up while only * holding a read lock on the global pcblist lock. This is important for * performance when attempting to find the connection for a packet given its IP * and port tuple. * * One noteworthy exception is that the global pcbinfo lock follows a different * set of rules in relation to the inp_list field. Rather than being * write-locked for modifications and read-locked for list iterations, it must * be read-locked during modifications and write-locked during list iterations. * This ensures that the relatively rare global list iterations safely walk a * stable snapshot of connections while allowing more common list modifications * to safely grab the pcblist lock just while adding or removing a connection * from the global list. * * Key: * (b) - Protected by the hpts lock. * (c) - Constant after initialization * (e) - Protected by the net_epoch_prempt epoch * (g) - Protected by the pcbgroup lock * (i) - Protected by the inpcb lock * (p) - Protected by the pcbinfo lock for the inpcb * (l) - Protected by the pcblist lock for the inpcb * (h) - Protected by the pcbhash lock for the inpcb * (s) - Protected by another subsystem's locks * (x) - Undefined locking * * Notes on the tcp_hpts: * * First Hpts lock order is * 1) INP_WLOCK() * 2) HPTS_LOCK() i.e. hpts->pmtx * * To insert a TCB on the hpts you *must* be holding the INP_WLOCK(). * You may check the inp->inp_in_hpts flag without the hpts lock. * The hpts is the only one that will clear this flag holding * only the hpts lock. This means that in your tcp_output() * routine when you test for the inp_in_hpts flag to be 1 * it may be transitioning to 0 (by the hpts). * That's ok since that will just mean an extra call to tcp_output * that most likely will find the call you executed * (when the mis-match occured) will have put the TCB back * on the hpts and it will return. If your * call did not add the inp back to the hpts then you will either * over-send or the cwnd will block you from sending more. * * Note you should also be holding the INP_WLOCK() when you * call the remove from the hpts as well. Though usually * you are either doing this from a timer, where you need and have * the INP_WLOCK() or from destroying your TCB where again * you should already have the INP_WLOCK(). * * The inp_hpts_cpu, inp_hpts_cpu_set, inp_input_cpu and * inp_input_cpu_set fields are controlled completely by * the hpts. Do not ever set these. The inp_hpts_cpu_set * and inp_input_cpu_set fields indicate if the hpts has * setup the respective cpu field. It is advised if this * field is 0, to enqueue the packet with the appropriate * hpts_immediate() call. If the _set field is 1, then * you may compare the inp_*_cpu field to the curcpu and * may want to again insert onto the hpts if these fields * are not equal (i.e. you are not on the expected CPU). * * A note on inp_hpts_calls and inp_input_calls, these * flags are set when the hpts calls either the output * or do_segment routines respectively. If the routine * being called wants to use this, then it needs to * clear the flag before returning. The hpts will not * clear the flag. The flags can be used to tell if * the hpts is the function calling the respective * routine. * * A few other notes: * * When a read lock is held, stability of the field is guaranteed; to write * to a field, a write lock must generally be held. * * netinet/netinet6-layer code should not assume that the inp_socket pointer * is safe to dereference without inp_lock being held, even for protocols * other than TCP (where the inpcb persists during TIMEWAIT even after the * socket has been freed), or there may be close(2)-related races. * * The inp_vflag field is overloaded, and would otherwise ideally be (c). * * TODO: Currently only the TCP stack is leveraging the global pcbinfo lock * read-lock usage during modification, this model can be applied to other * protocols (especially SCTP). */ struct icmp6_filter; struct inpcbpolicy; struct m_snd_tag; struct inpcb { /* Cache line #1 (amd64) */ CK_LIST_ENTRY(inpcb) inp_hash; /* [w](h/i) [r](e/i) hash list */ CK_LIST_ENTRY(inpcb) inp_pcbgrouphash; /* (g/i) hash list */ struct rwlock inp_lock; /* Cache line #2 (amd64) */ #define inp_start_zero inp_hpts #define inp_zero_size (sizeof(struct inpcb) - \ offsetof(struct inpcb, inp_start_zero)) TAILQ_ENTRY(inpcb) inp_hpts; /* pacing out queue next lock(b) */ uint32_t inp_hpts_request; /* Current hpts request, zero if * fits in the pacing window (i&b). */ /* * Note the next fields are protected by a * different lock (hpts-lock). This means that * they must correspond in size to the smallest * protectable bit field (uint8_t on x86, and * other platfomrs potentially uint32_t?). Also * since CPU switches can occur at different times the two * fields can *not* be collapsed into a signal bit field. */ #if defined(__amd64__) || defined(__i386__) volatile uint8_t inp_in_hpts; /* on output hpts (lock b) */ volatile uint8_t inp_in_input; /* on input hpts (lock b) */ #else volatile uint32_t inp_in_hpts; /* on output hpts (lock b) */ volatile uint32_t inp_in_input; /* on input hpts (lock b) */ #endif volatile uint16_t inp_hpts_cpu; /* Lock (i) */ u_int inp_refcount; /* (i) refcount */ int inp_flags; /* (i) generic IP/datagram flags */ int inp_flags2; /* (i) generic IP/datagram flags #2*/ volatile uint16_t inp_input_cpu; /* Lock (i) */ volatile uint8_t inp_hpts_cpu_set :1, /* on output hpts (i) */ inp_input_cpu_set : 1, /* on input hpts (i) */ inp_hpts_calls :1, /* (i) from output hpts */ inp_input_calls :1, /* (i) from input hpts */ inp_spare_bits2 : 4; - uint8_t inp_spare_byte; /* Compiler hole */ + uint8_t inp_numa_domain; /* numa domain */ void *inp_ppcb; /* (i) pointer to per-protocol pcb */ struct socket *inp_socket; /* (i) back pointer to socket */ uint32_t inp_hptsslot; /* Hpts wheel slot this tcb is Lock(i&b) */ uint32_t inp_hpts_drop_reas; /* reason we are dropping the PCB (lock i&b) */ TAILQ_ENTRY(inpcb) inp_input; /* pacing in queue next lock(b) */ struct inpcbinfo *inp_pcbinfo; /* (c) PCB list info */ struct inpcbgroup *inp_pcbgroup; /* (g/i) PCB group list */ CK_LIST_ENTRY(inpcb) inp_pcbgroup_wild; /* (g/i/h) group wildcard entry */ struct ucred *inp_cred; /* (c) cache of socket cred */ u_int32_t inp_flow; /* (i) IPv6 flow information */ u_char inp_vflag; /* (i) IP version flag (v4/v6) */ u_char inp_ip_ttl; /* (i) time to live proto */ u_char inp_ip_p; /* (c) protocol proto */ u_char inp_ip_minttl; /* (i) minimum TTL or drop */ uint32_t inp_flowid; /* (x) flow id / queue id */ struct m_snd_tag *inp_snd_tag; /* (i) send tag for outgoing mbufs */ uint32_t inp_flowtype; /* (x) M_HASHTYPE value */ uint32_t inp_rss_listen_bucket; /* (x) overridden RSS listen bucket */ /* Local and foreign ports, local and foreign addr. */ struct in_conninfo inp_inc; /* (i) list for PCB's local port */ /* MAC and IPSEC policy information. */ struct label *inp_label; /* (i) MAC label */ struct inpcbpolicy *inp_sp; /* (s) for IPSEC */ /* Protocol-dependent part; options. */ struct { u_char inp_ip_tos; /* (i) type of service proto */ struct mbuf *inp_options; /* (i) IP options */ struct ip_moptions *inp_moptions; /* (i) mcast options */ }; struct { /* (i) IP options */ struct mbuf *in6p_options; /* (i) IP6 options for outgoing packets */ struct ip6_pktopts *in6p_outputopts; /* (i) IP multicast options */ struct ip6_moptions *in6p_moptions; /* (i) ICMPv6 code type filter */ struct icmp6_filter *in6p_icmp6filt; /* (i) IPV6_CHECKSUM setsockopt */ int in6p_cksum; short in6p_hops; }; CK_LIST_ENTRY(inpcb) inp_portlist; /* (i/h) */ struct inpcbport *inp_phd; /* (i/h) head of this list */ inp_gen_t inp_gencnt; /* (c) generation count */ void *spare_ptr; /* Spare pointer. */ rt_gen_t inp_rt_cookie; /* generation for route entry */ union { /* cached L3 information */ struct route inp_route; struct route_in6 inp_route6; }; CK_LIST_ENTRY(inpcb) inp_list; /* (p/l) list for all PCBs for proto */ /* (e[r]) for list iteration */ /* (p[w]/l) for addition/removal */ struct epoch_context inp_epoch_ctx; }; #endif /* _KERNEL */ #define inp_fport inp_inc.inc_fport #define inp_lport inp_inc.inc_lport #define inp_faddr inp_inc.inc_faddr #define inp_laddr inp_inc.inc_laddr #define in6p_faddr inp_inc.inc6_faddr #define in6p_laddr inp_inc.inc6_laddr #define in6p_zoneid inp_inc.inc6_zoneid #define in6p_flowinfo inp_flow #define inp_vnet inp_pcbinfo->ipi_vnet /* * The range of the generation count, as used in this implementation, is 9e19. * We would have to create 300 billion connections per second for this number * to roll over in a year. This seems sufficiently unlikely that we simply * don't concern ourselves with that possibility. */ /* * Interface exported to userland by various protocols which use inpcbs. Hack * alert -- only define if struct xsocket is in scope. * Fields prefixed with "xi_" are unique to this structure, and the rest * match fields in the struct inpcb, to ease coding and porting. * * Legend: * (s) - used by userland utilities in src * (p) - used by utilities in ports * (3) - is known to be used by third party software not in ports * (n) - no known usage */ #ifdef _SYS_SOCKETVAR_H_ struct xinpcb { ksize_t xi_len; /* length of this structure */ struct xsocket xi_socket; /* (s,p) */ struct in_conninfo inp_inc; /* (s,p) */ uint64_t inp_gencnt; /* (s,p) */ kvaddr_t inp_ppcb; /* (s) netstat(1) */ int64_t inp_spare64[4]; uint32_t inp_flow; /* (s) */ uint32_t inp_flowid; /* (s) */ uint32_t inp_flowtype; /* (s) */ int32_t inp_flags; /* (s,p) */ int32_t inp_flags2; /* (s) */ int32_t inp_rss_listen_bucket; /* (n) */ int32_t in6p_cksum; /* (n) */ int32_t inp_spare32[4]; uint16_t in6p_hops; /* (n) */ uint8_t inp_ip_tos; /* (n) */ int8_t pad8; uint8_t inp_vflag; /* (s,p) */ uint8_t inp_ip_ttl; /* (n) */ uint8_t inp_ip_p; /* (n) */ uint8_t inp_ip_minttl; /* (n) */ int8_t inp_spare8[4]; } __aligned(8); struct xinpgen { ksize_t xig_len; /* length of this structure */ u_int xig_count; /* number of PCBs at this time */ uint32_t _xig_spare32; inp_gen_t xig_gen; /* generation count at this time */ so_gen_t xig_sogen; /* socket generation count this time */ uint64_t _xig_spare64[4]; } __aligned(8); #ifdef _KERNEL void in_pcbtoxinpcb(const struct inpcb *, struct xinpcb *); #endif #endif /* _SYS_SOCKETVAR_H_ */ struct inpcbport { struct epoch_context phd_epoch_ctx; CK_LIST_ENTRY(inpcbport) phd_hash; struct inpcbhead phd_pcblist; u_short phd_port; }; struct in_pcblist { int il_count; struct epoch_context il_epoch_ctx; struct inpcbinfo *il_pcbinfo; struct inpcb *il_inp_list[0]; }; /*- * Global data structure for each high-level protocol (UDP, TCP, ...) in both * IPv4 and IPv6. Holds inpcb lists and information for managing them. * * Each pcbinfo is protected by three locks: ipi_lock, ipi_hash_lock and * ipi_list_lock: * - ipi_lock covering the global pcb list stability during loop iteration, * - ipi_hash_lock covering the hashed lookup tables, * - ipi_list_lock covering mutable global fields (such as the global * pcb list) * * The lock order is: * * ipi_lock (before) * inpcb locks (before) * ipi_list locks (before) * {ipi_hash_lock, pcbgroup locks} * * Locking key: * * (c) Constant or nearly constant after initialisation * (e) - Protected by the net_epoch_prempt epoch * (g) Locked by ipi_lock * (l) Locked by ipi_list_lock * (h) Read using either net_epoch_preempt or inpcb lock; write requires both ipi_hash_lock and inpcb lock * (p) Protected by one or more pcbgroup locks * (x) Synchronisation properties poorly defined */ struct inpcbinfo { /* * Global lock protecting inpcb list modification */ struct mtx ipi_lock; /* * Global list of inpcbs on the protocol. */ struct inpcbhead *ipi_listhead; /* [r](e) [w](g/l) */ u_int ipi_count; /* (l) */ /* * Generation count -- incremented each time a connection is allocated * or freed. */ u_quad_t ipi_gencnt; /* (l) */ /* * Fields associated with port lookup and allocation. */ u_short ipi_lastport; /* (x) */ u_short ipi_lastlow; /* (x) */ u_short ipi_lasthi; /* (x) */ /* * UMA zone from which inpcbs are allocated for this protocol. */ struct uma_zone *ipi_zone; /* (c) */ /* * Connection groups associated with this protocol. These fields are * constant, but pcbgroup structures themselves are protected by * per-pcbgroup locks. */ struct inpcbgroup *ipi_pcbgroups; /* (c) */ u_int ipi_npcbgroups; /* (c) */ u_int ipi_hashfields; /* (c) */ /* * Global lock protecting modification non-pcbgroup hash lookup tables. */ struct mtx ipi_hash_lock; /* * Global hash of inpcbs, hashed by local and foreign addresses and * port numbers. */ struct inpcbhead *ipi_hashbase; /* (h) */ u_long ipi_hashmask; /* (h) */ /* * Global hash of inpcbs, hashed by only local port number. */ struct inpcbporthead *ipi_porthashbase; /* (h) */ u_long ipi_porthashmask; /* (h) */ /* * List of wildcard inpcbs for use with pcbgroups. In the past, was * per-pcbgroup but is now global. All pcbgroup locks must be held * to modify the list, so any is sufficient to read it. */ struct inpcbhead *ipi_wildbase; /* (p) */ u_long ipi_wildmask; /* (p) */ /* * Load balance groups used for the SO_REUSEPORT_LB option, * hashed by local port. */ struct inpcblbgrouphead *ipi_lbgrouphashbase; /* (h) */ u_long ipi_lbgrouphashmask; /* (h) */ /* * Pointer to network stack instance */ struct vnet *ipi_vnet; /* (c) */ /* * general use 2 */ void *ipi_pspare[2]; /* * Global lock protecting global inpcb list, inpcb count, etc. */ struct rwlock ipi_list_lock; }; #ifdef _KERNEL /* * Connection groups hold sets of connections that have similar CPU/thread * affinity. Each connection belongs to exactly one connection group. */ struct inpcbgroup { /* * Per-connection group hash of inpcbs, hashed by local and foreign * addresses and port numbers. */ struct inpcbhead *ipg_hashbase; /* (c) */ u_long ipg_hashmask; /* (c) */ /* * Notional affinity of this pcbgroup. */ u_int ipg_cpu; /* (p) */ /* * Per-connection group lock, not to be confused with ipi_lock. * Protects the hash table hung off the group, but also the global * wildcard list in inpcbinfo. */ struct mtx ipg_lock; } __aligned(CACHE_LINE_SIZE); /* * Load balance groups used for the SO_REUSEPORT_LB socket option. Each group * (or unique address:port combination) can be re-used at most * INPCBLBGROUP_SIZMAX (256) times. The inpcbs are stored in il_inp which * is dynamically resized as processes bind/unbind to that specific group. */ struct inpcblbgroup { CK_LIST_ENTRY(inpcblbgroup) il_list; struct epoch_context il_epoch_ctx; uint16_t il_lport; /* (c) */ u_char il_vflag; /* (c) */ u_char il_pad; uint32_t il_pad2; union in_dependaddr il_dependladdr; /* (c) */ #define il_laddr il_dependladdr.id46_addr.ia46_addr4 #define il6_laddr il_dependladdr.id6_addr uint32_t il_inpsiz; /* max count in il_inp[] (h) */ uint32_t il_inpcnt; /* cur count in il_inp[] (h) */ struct inpcb *il_inp[]; /* (h) */ }; #define INP_LOCK_INIT(inp, d, t) \ rw_init_flags(&(inp)->inp_lock, (t), RW_RECURSE | RW_DUPOK) #define INP_LOCK_DESTROY(inp) rw_destroy(&(inp)->inp_lock) #define INP_RLOCK(inp) rw_rlock(&(inp)->inp_lock) #define INP_WLOCK(inp) rw_wlock(&(inp)->inp_lock) #define INP_TRY_RLOCK(inp) rw_try_rlock(&(inp)->inp_lock) #define INP_TRY_WLOCK(inp) rw_try_wlock(&(inp)->inp_lock) #define INP_RUNLOCK(inp) rw_runlock(&(inp)->inp_lock) #define INP_WUNLOCK(inp) rw_wunlock(&(inp)->inp_lock) #define INP_TRY_UPGRADE(inp) rw_try_upgrade(&(inp)->inp_lock) #define INP_DOWNGRADE(inp) rw_downgrade(&(inp)->inp_lock) #define INP_WLOCKED(inp) rw_wowned(&(inp)->inp_lock) #define INP_LOCK_ASSERT(inp) rw_assert(&(inp)->inp_lock, RA_LOCKED) #define INP_RLOCK_ASSERT(inp) rw_assert(&(inp)->inp_lock, RA_RLOCKED) #define INP_WLOCK_ASSERT(inp) rw_assert(&(inp)->inp_lock, RA_WLOCKED) #define INP_UNLOCK_ASSERT(inp) rw_assert(&(inp)->inp_lock, RA_UNLOCKED) /* * These locking functions are for inpcb consumers outside of sys/netinet, * more specifically, they were added for the benefit of TOE drivers. The * macros are reserved for use by the stack. */ void inp_wlock(struct inpcb *); void inp_wunlock(struct inpcb *); void inp_rlock(struct inpcb *); void inp_runlock(struct inpcb *); #ifdef INVARIANT_SUPPORT void inp_lock_assert(struct inpcb *); void inp_unlock_assert(struct inpcb *); #else #define inp_lock_assert(inp) do {} while (0) #define inp_unlock_assert(inp) do {} while (0) #endif void inp_apply_all(void (*func)(struct inpcb *, void *), void *arg); int inp_ip_tos_get(const struct inpcb *inp); void inp_ip_tos_set(struct inpcb *inp, int val); struct socket * inp_inpcbtosocket(struct inpcb *inp); struct tcpcb * inp_inpcbtotcpcb(struct inpcb *inp); void inp_4tuple_get(struct inpcb *inp, uint32_t *laddr, uint16_t *lp, uint32_t *faddr, uint16_t *fp); int inp_so_options(const struct inpcb *inp); #endif /* _KERNEL */ #define INP_INFO_LOCK_INIT(ipi, d) \ mtx_init(&(ipi)->ipi_lock, (d), NULL, MTX_DEF| MTX_RECURSE) #define INP_INFO_LOCK_DESTROY(ipi) mtx_destroy(&(ipi)->ipi_lock) #define INP_INFO_RLOCK_ET(ipi, et) NET_EPOCH_ENTER((et)) #define INP_INFO_WLOCK(ipi) mtx_lock(&(ipi)->ipi_lock) #define INP_INFO_TRY_WLOCK(ipi) mtx_trylock(&(ipi)->ipi_lock) #define INP_INFO_WLOCKED(ipi) mtx_owned(&(ipi)->ipi_lock) #define INP_INFO_RUNLOCK_ET(ipi, et) NET_EPOCH_EXIT((et)) #define INP_INFO_RUNLOCK_TP(ipi, tp) NET_EPOCH_EXIT(*(tp)->t_inpcb->inp_et) #define INP_INFO_WUNLOCK(ipi) mtx_unlock(&(ipi)->ipi_lock) #define INP_INFO_LOCK_ASSERT(ipi) MPASS(in_epoch(net_epoch_preempt) || mtx_owned(&(ipi)->ipi_lock)) #define INP_INFO_RLOCK_ASSERT(ipi) MPASS(in_epoch(net_epoch_preempt)) #define INP_INFO_WLOCK_ASSERT(ipi) mtx_assert(&(ipi)->ipi_lock, MA_OWNED) #define INP_INFO_WUNLOCK_ASSERT(ipi) \ mtx_assert(&(ipi)->ipi_lock, MA_NOTOWNED) #define INP_INFO_UNLOCK_ASSERT(ipi) MPASS(!in_epoch(net_epoch_preempt) && !mtx_owned(&(ipi)->ipi_lock)) #define INP_LIST_LOCK_INIT(ipi, d) \ rw_init_flags(&(ipi)->ipi_list_lock, (d), 0) #define INP_LIST_LOCK_DESTROY(ipi) rw_destroy(&(ipi)->ipi_list_lock) #define INP_LIST_RLOCK(ipi) rw_rlock(&(ipi)->ipi_list_lock) #define INP_LIST_WLOCK(ipi) rw_wlock(&(ipi)->ipi_list_lock) #define INP_LIST_TRY_RLOCK(ipi) rw_try_rlock(&(ipi)->ipi_list_lock) #define INP_LIST_TRY_WLOCK(ipi) rw_try_wlock(&(ipi)->ipi_list_lock) #define INP_LIST_TRY_UPGRADE(ipi) rw_try_upgrade(&(ipi)->ipi_list_lock) #define INP_LIST_RUNLOCK(ipi) rw_runlock(&(ipi)->ipi_list_lock) #define INP_LIST_WUNLOCK(ipi) rw_wunlock(&(ipi)->ipi_list_lock) #define INP_LIST_LOCK_ASSERT(ipi) \ rw_assert(&(ipi)->ipi_list_lock, RA_LOCKED) #define INP_LIST_RLOCK_ASSERT(ipi) \ rw_assert(&(ipi)->ipi_list_lock, RA_RLOCKED) #define INP_LIST_WLOCK_ASSERT(ipi) \ rw_assert(&(ipi)->ipi_list_lock, RA_WLOCKED) #define INP_LIST_UNLOCK_ASSERT(ipi) \ rw_assert(&(ipi)->ipi_list_lock, RA_UNLOCKED) #define INP_HASH_LOCK_INIT(ipi, d) mtx_init(&(ipi)->ipi_hash_lock, (d), NULL, MTX_DEF) #define INP_HASH_LOCK_DESTROY(ipi) mtx_destroy(&(ipi)->ipi_hash_lock) #define INP_HASH_RLOCK(ipi) struct epoch_tracker inp_hash_et; epoch_enter_preempt(net_epoch_preempt, &inp_hash_et) #define INP_HASH_RLOCK_ET(ipi, et) epoch_enter_preempt(net_epoch_preempt, &(et)) #define INP_HASH_WLOCK(ipi) mtx_lock(&(ipi)->ipi_hash_lock) #define INP_HASH_RUNLOCK(ipi) NET_EPOCH_EXIT(inp_hash_et) #define INP_HASH_RUNLOCK_ET(ipi, et) NET_EPOCH_EXIT((et)) #define INP_HASH_WUNLOCK(ipi) mtx_unlock(&(ipi)->ipi_hash_lock) #define INP_HASH_LOCK_ASSERT(ipi) MPASS(in_epoch(net_epoch_preempt) || mtx_owned(&(ipi)->ipi_hash_lock)) #define INP_HASH_WLOCK_ASSERT(ipi) mtx_assert(&(ipi)->ipi_hash_lock, MA_OWNED); #define INP_GROUP_LOCK_INIT(ipg, d) mtx_init(&(ipg)->ipg_lock, (d), NULL, \ MTX_DEF | MTX_DUPOK) #define INP_GROUP_LOCK_DESTROY(ipg) mtx_destroy(&(ipg)->ipg_lock) #define INP_GROUP_LOCK(ipg) mtx_lock(&(ipg)->ipg_lock) #define INP_GROUP_LOCK_ASSERT(ipg) mtx_assert(&(ipg)->ipg_lock, MA_OWNED) #define INP_GROUP_UNLOCK(ipg) mtx_unlock(&(ipg)->ipg_lock) #define INP_PCBHASH(faddr, lport, fport, mask) \ (((faddr) ^ ((faddr) >> 16) ^ ntohs((lport) ^ (fport))) & (mask)) #define INP_PCBPORTHASH(lport, mask) \ (ntohs((lport)) & (mask)) #define INP_PCBLBGROUP_PKTHASH(faddr, lport, fport) \ ((faddr) ^ ((faddr) >> 16) ^ ntohs((lport) ^ (fport))) #define INP6_PCBHASHKEY(faddr) ((faddr)->s6_addr32[3]) /* * Flags for inp_vflags -- historically version flags only */ #define INP_IPV4 0x1 #define INP_IPV6 0x2 #define INP_IPV6PROTO 0x4 /* opened under IPv6 protocol */ /* * Flags for inp_flags. */ #define INP_RECVOPTS 0x00000001 /* receive incoming IP options */ #define INP_RECVRETOPTS 0x00000002 /* receive IP options for reply */ #define INP_RECVDSTADDR 0x00000004 /* receive IP dst address */ #define INP_HDRINCL 0x00000008 /* user supplies entire IP header */ #define INP_HIGHPORT 0x00000010 /* user wants "high" port binding */ #define INP_LOWPORT 0x00000020 /* user wants "low" port binding */ #define INP_ANONPORT 0x00000040 /* port chosen for user */ #define INP_RECVIF 0x00000080 /* receive incoming interface */ #define INP_MTUDISC 0x00000100 /* user can do MTU discovery */ /* 0x000200 unused: was INP_FAITH */ #define INP_RECVTTL 0x00000400 /* receive incoming IP TTL */ #define INP_DONTFRAG 0x00000800 /* don't fragment packet */ #define INP_BINDANY 0x00001000 /* allow bind to any address */ #define INP_INHASHLIST 0x00002000 /* in_pcbinshash() has been called */ #define INP_RECVTOS 0x00004000 /* receive incoming IP TOS */ #define IN6P_IPV6_V6ONLY 0x00008000 /* restrict AF_INET6 socket for v6 */ #define IN6P_PKTINFO 0x00010000 /* receive IP6 dst and I/F */ #define IN6P_HOPLIMIT 0x00020000 /* receive hoplimit */ #define IN6P_HOPOPTS 0x00040000 /* receive hop-by-hop options */ #define IN6P_DSTOPTS 0x00080000 /* receive dst options after rthdr */ #define IN6P_RTHDR 0x00100000 /* receive routing header */ #define IN6P_RTHDRDSTOPTS 0x00200000 /* receive dstoptions before rthdr */ #define IN6P_TCLASS 0x00400000 /* receive traffic class value */ #define IN6P_AUTOFLOWLABEL 0x00800000 /* attach flowlabel automatically */ #define INP_TIMEWAIT 0x01000000 /* in TIMEWAIT, ppcb is tcptw */ #define INP_ONESBCAST 0x02000000 /* send all-ones broadcast */ #define INP_DROPPED 0x04000000 /* protocol drop flag */ #define INP_SOCKREF 0x08000000 /* strong socket reference */ #define INP_RESERVED_0 0x10000000 /* reserved field */ #define INP_RESERVED_1 0x20000000 /* reserved field */ #define IN6P_RFC2292 0x40000000 /* used RFC2292 API on the socket */ #define IN6P_MTU 0x80000000 /* receive path MTU */ #define INP_CONTROLOPTS (INP_RECVOPTS|INP_RECVRETOPTS|INP_RECVDSTADDR|\ INP_RECVIF|INP_RECVTTL|INP_RECVTOS|\ IN6P_PKTINFO|IN6P_HOPLIMIT|IN6P_HOPOPTS|\ IN6P_DSTOPTS|IN6P_RTHDR|IN6P_RTHDRDSTOPTS|\ IN6P_TCLASS|IN6P_AUTOFLOWLABEL|IN6P_RFC2292|\ IN6P_MTU) /* * Flags for inp_flags2. */ #define INP_2UNUSED1 0x00000001 #define INP_2UNUSED2 0x00000002 #define INP_PCBGROUPWILD 0x00000004 /* in pcbgroup wildcard list */ #define INP_REUSEPORT 0x00000008 /* SO_REUSEPORT option is set */ #define INP_FREED 0x00000010 /* inp itself is not valid */ #define INP_REUSEADDR 0x00000020 /* SO_REUSEADDR option is set */ #define INP_BINDMULTI 0x00000040 /* IP_BINDMULTI option is set */ #define INP_RSS_BUCKET_SET 0x00000080 /* IP_RSS_LISTEN_BUCKET is set */ #define INP_RECVFLOWID 0x00000100 /* populate recv datagram with flow info */ #define INP_RECVRSSBUCKETID 0x00000200 /* populate recv datagram with bucket id */ #define INP_RATE_LIMIT_CHANGED 0x00000400 /* rate limit needs attention */ #define INP_ORIGDSTADDR 0x00000800 /* receive IP dst address/port */ #define INP_CANNOT_DO_ECN 0x00001000 /* The stack does not do ECN */ #define INP_REUSEPORT_LB 0x00002000 /* SO_REUSEPORT_LB option is set */ /* * Flags passed to in_pcblookup*() functions. */ #define INPLOOKUP_WILDCARD 0x00000001 /* Allow wildcard sockets. */ #define INPLOOKUP_RLOCKPCB 0x00000002 /* Return inpcb read-locked. */ #define INPLOOKUP_WLOCKPCB 0x00000004 /* Return inpcb write-locked. */ #define INPLOOKUP_MASK (INPLOOKUP_WILDCARD | INPLOOKUP_RLOCKPCB | \ INPLOOKUP_WLOCKPCB) #define sotoinpcb(so) ((struct inpcb *)(so)->so_pcb) #define sotoin6pcb(so) sotoinpcb(so) /* for KAME src sync over BSD*'s */ #define INP_SOCKAF(so) so->so_proto->pr_domain->dom_family #define INP_CHECK_SOCKAF(so, af) (INP_SOCKAF(so) == af) /* * Constants for pcbinfo.ipi_hashfields. */ #define IPI_HASHFIELDS_NONE 0 #define IPI_HASHFIELDS_2TUPLE 1 #define IPI_HASHFIELDS_4TUPLE 2 #ifdef _KERNEL VNET_DECLARE(int, ipport_reservedhigh); VNET_DECLARE(int, ipport_reservedlow); VNET_DECLARE(int, ipport_lowfirstauto); VNET_DECLARE(int, ipport_lowlastauto); VNET_DECLARE(int, ipport_firstauto); VNET_DECLARE(int, ipport_lastauto); VNET_DECLARE(int, ipport_hifirstauto); VNET_DECLARE(int, ipport_hilastauto); VNET_DECLARE(int, ipport_randomized); VNET_DECLARE(int, ipport_randomcps); VNET_DECLARE(int, ipport_randomtime); VNET_DECLARE(int, ipport_stoprandom); VNET_DECLARE(int, ipport_tcpallocs); #define V_ipport_reservedhigh VNET(ipport_reservedhigh) #define V_ipport_reservedlow VNET(ipport_reservedlow) #define V_ipport_lowfirstauto VNET(ipport_lowfirstauto) #define V_ipport_lowlastauto VNET(ipport_lowlastauto) #define V_ipport_firstauto VNET(ipport_firstauto) #define V_ipport_lastauto VNET(ipport_lastauto) #define V_ipport_hifirstauto VNET(ipport_hifirstauto) #define V_ipport_hilastauto VNET(ipport_hilastauto) #define V_ipport_randomized VNET(ipport_randomized) #define V_ipport_randomcps VNET(ipport_randomcps) #define V_ipport_randomtime VNET(ipport_randomtime) #define V_ipport_stoprandom VNET(ipport_stoprandom) #define V_ipport_tcpallocs VNET(ipport_tcpallocs) void in_pcbinfo_destroy(struct inpcbinfo *); void in_pcbinfo_init(struct inpcbinfo *, const char *, struct inpcbhead *, int, int, char *, uma_init, u_int); int in_pcbbind_check_bindmulti(const struct inpcb *ni, const struct inpcb *oi); struct inpcbgroup * in_pcbgroup_byhash(struct inpcbinfo *, u_int, uint32_t); struct inpcbgroup * in_pcbgroup_byinpcb(struct inpcb *); struct inpcbgroup * in_pcbgroup_bytuple(struct inpcbinfo *, struct in_addr, u_short, struct in_addr, u_short); void in_pcbgroup_destroy(struct inpcbinfo *); int in_pcbgroup_enabled(struct inpcbinfo *); void in_pcbgroup_init(struct inpcbinfo *, u_int, int); void in_pcbgroup_remove(struct inpcb *); void in_pcbgroup_update(struct inpcb *); void in_pcbgroup_update_mbuf(struct inpcb *, struct mbuf *); void in_pcbpurgeif0(struct inpcbinfo *, struct ifnet *); int in_pcballoc(struct socket *, struct inpcbinfo *); int in_pcbbind(struct inpcb *, struct sockaddr *, struct ucred *); int in_pcb_lport(struct inpcb *, struct in_addr *, u_short *, struct ucred *, int); int in_pcbbind_setup(struct inpcb *, struct sockaddr *, in_addr_t *, u_short *, struct ucred *); int in_pcbconnect(struct inpcb *, struct sockaddr *, struct ucred *); int in_pcbconnect_mbuf(struct inpcb *, struct sockaddr *, struct ucred *, struct mbuf *); int in_pcbconnect_setup(struct inpcb *, struct sockaddr *, in_addr_t *, u_short *, in_addr_t *, u_short *, struct inpcb **, struct ucred *); void in_pcbdetach(struct inpcb *); void in_pcbdisconnect(struct inpcb *); void in_pcbdrop(struct inpcb *); void in_pcbfree(struct inpcb *); int in_pcbinshash(struct inpcb *); int in_pcbinshash_nopcbgroup(struct inpcb *); int in_pcbladdr(struct inpcb *, struct in_addr *, struct in_addr *, struct ucred *); struct inpcb * in_pcblookup_local(struct inpcbinfo *, struct in_addr, u_short, int, struct ucred *); struct inpcb * in_pcblookup(struct inpcbinfo *, struct in_addr, u_int, struct in_addr, u_int, int, struct ifnet *); struct inpcb * in_pcblookup_mbuf(struct inpcbinfo *, struct in_addr, u_int, struct in_addr, u_int, int, struct ifnet *, struct mbuf *); void in_pcbnotifyall(struct inpcbinfo *pcbinfo, struct in_addr, int, struct inpcb *(*)(struct inpcb *, int)); void in_pcbref(struct inpcb *); void in_pcbrehash(struct inpcb *); void in_pcbrehash_mbuf(struct inpcb *, struct mbuf *); int in_pcbrele(struct inpcb *); int in_pcbrele_rlocked(struct inpcb *); int in_pcbrele_wlocked(struct inpcb *); void in_pcblist_rele_rlocked(epoch_context_t ctx); void in_losing(struct inpcb *); void in_pcbsetsolabel(struct socket *so); int in_getpeeraddr(struct socket *so, struct sockaddr **nam); int in_getsockaddr(struct socket *so, struct sockaddr **nam); struct sockaddr * in_sockaddr(in_port_t port, struct in_addr *addr); void in_pcbsosetlabel(struct socket *so); #ifdef RATELIMIT int in_pcbattach_txrtlmt(struct inpcb *, struct ifnet *, uint32_t, uint32_t, uint32_t); void in_pcbdetach_txrtlmt(struct inpcb *); int in_pcbmodify_txrtlmt(struct inpcb *, uint32_t); int in_pcbquery_txrtlmt(struct inpcb *, uint32_t *); int in_pcbquery_txrlevel(struct inpcb *, uint32_t *); void in_pcboutput_txrtlmt(struct inpcb *, struct ifnet *, struct mbuf *); void in_pcboutput_eagain(struct inpcb *); #endif #endif /* _KERNEL */ #endif /* !_NETINET_IN_PCB_H_ */ Index: user/ngie/bug-237403/sys/netinet/ip_gre.c =================================================================== --- user/ngie/bug-237403/sys/netinet/ip_gre.c (revision 346925) +++ user/ngie/bug-237403/sys/netinet/ip_gre.c (revision 346926) @@ -1,358 +1,595 @@ /*- * SPDX-License-Identifier: BSD-2-Clause-NetBSD * * Copyright (c) 1998 The NetBSD Foundation, Inc. * Copyright (c) 2014, 2018 Andrey V. Elsukov * All rights reserved. * * This code is derived from software contributed to The NetBSD Foundation * by Heiko W.Rupp * * IPv6-over-GRE contributed by Gert Doering * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE NETBSD FOUNDATION, INC. AND CONTRIBUTORS * ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED * TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR * PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE FOUNDATION OR CONTRIBUTORS * BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN * CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE * POSSIBILITY OF SUCH DAMAGE. * * $NetBSD: ip_gre.c,v 1.29 2003/09/05 23:02:43 itojun Exp $ */ #include __FBSDID("$FreeBSD$"); #include "opt_inet.h" #include "opt_inet6.h" #include #include #include #include +#include #include #include #include #include #include #include #include #include #include #include #include #include +#include #include #include #include +#include +#include #ifdef INET6 #include #endif #include +#include #define GRE_TTL 30 VNET_DEFINE(int, ip_gre_ttl) = GRE_TTL; #define V_ip_gre_ttl VNET(ip_gre_ttl) SYSCTL_INT(_net_inet_ip, OID_AUTO, grettl, CTLFLAG_VNET | CTLFLAG_RW, &VNET_NAME(ip_gre_ttl), 0, "Default TTL value for encapsulated packets"); +struct in_gre_socket { + struct gre_socket base; + in_addr_t addr; +}; +VNET_DEFINE_STATIC(struct gre_sockets *, ipv4_sockets) = NULL; VNET_DEFINE_STATIC(struct gre_list *, ipv4_hashtbl) = NULL; VNET_DEFINE_STATIC(struct gre_list *, ipv4_srchashtbl) = NULL; +#define V_ipv4_sockets VNET(ipv4_sockets) #define V_ipv4_hashtbl VNET(ipv4_hashtbl) #define V_ipv4_srchashtbl VNET(ipv4_srchashtbl) #define GRE_HASH(src, dst) (V_ipv4_hashtbl[\ in_gre_hashval((src), (dst)) & (GRE_HASH_SIZE - 1)]) #define GRE_SRCHASH(src) (V_ipv4_srchashtbl[\ fnv_32_buf(&(src), sizeof(src), FNV1_32_INIT) & (GRE_HASH_SIZE - 1)]) +#define GRE_SOCKHASH(src) (V_ipv4_sockets[\ + fnv_32_buf(&(src), sizeof(src), FNV1_32_INIT) & (GRE_HASH_SIZE - 1)]) #define GRE_HASH_SC(sc) GRE_HASH((sc)->gre_oip.ip_src.s_addr,\ (sc)->gre_oip.ip_dst.s_addr) static uint32_t in_gre_hashval(in_addr_t src, in_addr_t dst) { uint32_t ret; ret = fnv_32_buf(&src, sizeof(src), FNV1_32_INIT); return (fnv_32_buf(&dst, sizeof(dst), ret)); } +static struct gre_socket* +in_gre_lookup_socket(in_addr_t addr) +{ + struct gre_socket *gs; + struct in_gre_socket *s; + + CK_LIST_FOREACH(gs, &GRE_SOCKHASH(addr), chain) { + s = __containerof(gs, struct in_gre_socket, base); + if (s->addr == addr) + break; + } + return (gs); +} + static int -in_gre_checkdup(const struct gre_softc *sc, in_addr_t src, in_addr_t dst) +in_gre_checkdup(const struct gre_softc *sc, in_addr_t src, in_addr_t dst, + uint32_t opts) { + struct gre_list *head; struct gre_softc *tmp; + struct gre_socket *gs; if (sc->gre_family == AF_INET && sc->gre_oip.ip_src.s_addr == src && - sc->gre_oip.ip_dst.s_addr == dst) + sc->gre_oip.ip_dst.s_addr == dst && + (sc->gre_options & GRE_UDPENCAP) == (opts & GRE_UDPENCAP)) return (EEXIST); - CK_LIST_FOREACH(tmp, &GRE_HASH(src, dst), chain) { + if (opts & GRE_UDPENCAP) { + gs = in_gre_lookup_socket(src); + if (gs == NULL) + return (0); + head = &gs->list; + } else + head = &GRE_HASH(src, dst); + + CK_LIST_FOREACH(tmp, head, chain) { if (tmp == sc) continue; if (tmp->gre_oip.ip_src.s_addr == src && tmp->gre_oip.ip_dst.s_addr == dst) return (EADDRNOTAVAIL); } return (0); } static int in_gre_lookup(const struct mbuf *m, int off, int proto, void **arg) { const struct ip *ip; struct gre_softc *sc; if (V_ipv4_hashtbl == NULL) return (0); MPASS(in_epoch(net_epoch_preempt)); ip = mtod(m, const struct ip *); CK_LIST_FOREACH(sc, &GRE_HASH(ip->ip_dst.s_addr, ip->ip_src.s_addr), chain) { /* * This is an inbound packet, its ip_dst is source address * in softc. */ if (sc->gre_oip.ip_src.s_addr == ip->ip_dst.s_addr && sc->gre_oip.ip_dst.s_addr == ip->ip_src.s_addr) { if ((GRE2IFP(sc)->if_flags & IFF_UP) == 0) return (0); *arg = sc; return (ENCAP_DRV_LOOKUP); } } return (0); } /* * Check that ingress address belongs to local host. */ static void in_gre_set_running(struct gre_softc *sc) { if (in_localip(sc->gre_oip.ip_src)) GRE2IFP(sc)->if_drv_flags |= IFF_DRV_RUNNING; else GRE2IFP(sc)->if_drv_flags &= ~IFF_DRV_RUNNING; } /* * ifaddr_event handler. * Clear IFF_DRV_RUNNING flag when ingress address disappears to prevent * source address spoofing. */ static void in_gre_srcaddr(void *arg __unused, const struct sockaddr *sa, int event __unused) { const struct sockaddr_in *sin; struct gre_softc *sc; /* Check that VNET is ready */ if (V_ipv4_hashtbl == NULL) return; MPASS(in_epoch(net_epoch_preempt)); sin = (const struct sockaddr_in *)sa; CK_LIST_FOREACH(sc, &GRE_SRCHASH(sin->sin_addr.s_addr), srchash) { if (sc->gre_oip.ip_src.s_addr != sin->sin_addr.s_addr) continue; in_gre_set_running(sc); } } static void +in_gre_udp_input(struct mbuf *m, int off, struct inpcb *inp, + const struct sockaddr *sa, void *ctx) +{ + struct epoch_tracker et; + struct gre_socket *gs; + struct gre_softc *sc; + in_addr_t dst; + + NET_EPOCH_ENTER(et); + /* + * udp_append() holds reference to inp, it is safe to check + * inp_flags2 without INP_RLOCK(). + * If socket was closed before we have entered NET_EPOCH section, + * INP_FREED flag should be set. Otherwise it should be safe to + * make access to ctx data, because gre_so will be freed by + * gre_sofree() via epoch_call(). + */ + if (__predict_false(inp->inp_flags2 & INP_FREED)) { + NET_EPOCH_EXIT(et); + m_freem(m); + return; + } + + gs = (struct gre_socket *)ctx; + dst = ((const struct sockaddr_in *)sa)->sin_addr.s_addr; + CK_LIST_FOREACH(sc, &gs->list, chain) { + if (sc->gre_oip.ip_dst.s_addr == dst) + break; + } + if (sc != NULL && (GRE2IFP(sc)->if_flags & IFF_UP) != 0){ + gre_input(m, off + sizeof(struct udphdr), IPPROTO_UDP, sc); + NET_EPOCH_EXIT(et); + return; + } + m_freem(m); + NET_EPOCH_EXIT(et); +} + +static int +in_gre_setup_socket(struct gre_softc *sc) +{ + struct sockopt sopt; + struct sockaddr_in sin; + struct in_gre_socket *s; + struct gre_socket *gs; + in_addr_t addr; + int error, value; + + /* + * NOTE: we are protected with gre_ioctl_sx lock. + * + * First check that socket is already configured. + * If so, check that source addres was not changed. + * If address is different, check that there are no other tunnels + * and close socket. + */ + addr = sc->gre_oip.ip_src.s_addr; + gs = sc->gre_so; + if (gs != NULL) { + s = __containerof(gs, struct in_gre_socket, base); + if (s->addr != addr) { + if (CK_LIST_EMPTY(&gs->list)) { + CK_LIST_REMOVE(gs, chain); + soclose(gs->so); + epoch_call(net_epoch_preempt, &gs->epoch_ctx, + gre_sofree); + } + gs = sc->gre_so = NULL; + } + } + + if (gs == NULL) { + /* + * Check that socket for given address is already + * configured. + */ + gs = in_gre_lookup_socket(addr); + if (gs == NULL) { + s = malloc(sizeof(*s), M_GRE, M_WAITOK | M_ZERO); + s->addr = addr; + gs = &s->base; + + error = socreate(sc->gre_family, &gs->so, + SOCK_DGRAM, IPPROTO_UDP, curthread->td_ucred, + curthread); + if (error != 0) { + if_printf(GRE2IFP(sc), + "cannot create socket: %d\n", error); + free(s, M_GRE); + return (error); + } + + error = udp_set_kernel_tunneling(gs->so, + in_gre_udp_input, NULL, gs); + if (error != 0) { + if_printf(GRE2IFP(sc), + "cannot set UDP tunneling: %d\n", error); + goto fail; + } + + memset(&sopt, 0, sizeof(sopt)); + sopt.sopt_dir = SOPT_SET; + sopt.sopt_level = IPPROTO_IP; + sopt.sopt_name = IP_BINDANY; + sopt.sopt_val = &value; + sopt.sopt_valsize = sizeof(value); + value = 1; + error = sosetopt(gs->so, &sopt); + if (error != 0) { + if_printf(GRE2IFP(sc), + "cannot set IP_BINDANY opt: %d\n", error); + goto fail; + } + + memset(&sin, 0, sizeof(sin)); + sin.sin_family = AF_INET; + sin.sin_len = sizeof(sin); + sin.sin_addr.s_addr = addr; + sin.sin_port = htons(GRE_UDPPORT); + error = sobind(gs->so, (struct sockaddr *)&sin, + curthread); + if (error != 0) { + if_printf(GRE2IFP(sc), + "cannot bind socket: %d\n", error); + goto fail; + } + /* Add socket to the chain */ + CK_LIST_INSERT_HEAD(&GRE_SOCKHASH(addr), gs, chain); + } + } + + /* Add softc to the socket's list */ + CK_LIST_INSERT_HEAD(&gs->list, sc, chain); + sc->gre_so = gs; + return (0); +fail: + soclose(gs->so); + free(s, M_GRE); + return (error); +} + +static int in_gre_attach(struct gre_softc *sc) { + struct grehdr *gh; + int error; - sc->gre_hlen = sizeof(struct greip); + if (sc->gre_options & GRE_UDPENCAP) { + sc->gre_csumflags = CSUM_UDP; + sc->gre_hlen = sizeof(struct greudp); + sc->gre_oip.ip_p = IPPROTO_UDP; + gh = &sc->gre_udphdr->gi_gre; + gre_update_udphdr(sc, &sc->gre_udp, + in_pseudo(sc->gre_oip.ip_src.s_addr, + sc->gre_oip.ip_dst.s_addr, 0)); + } else { + sc->gre_hlen = sizeof(struct greip); + sc->gre_oip.ip_p = IPPROTO_GRE; + gh = &sc->gre_iphdr->gi_gre; + } sc->gre_oip.ip_v = IPVERSION; sc->gre_oip.ip_hl = sizeof(struct ip) >> 2; - sc->gre_oip.ip_p = IPPROTO_GRE; - gre_updatehdr(sc, &sc->gre_gihdr->gi_gre); - CK_LIST_INSERT_HEAD(&GRE_HASH_SC(sc), sc, chain); + gre_update_hdr(sc, gh); + + /* + * If we return error, this means that sc is not linked, + * and caller should reset gre_family and free(sc->gre_hdr). + */ + if (sc->gre_options & GRE_UDPENCAP) { + error = in_gre_setup_socket(sc); + if (error != 0) + return (error); + } else + CK_LIST_INSERT_HEAD(&GRE_HASH_SC(sc), sc, chain); CK_LIST_INSERT_HEAD(&GRE_SRCHASH(sc->gre_oip.ip_src.s_addr), sc, srchash); + + /* Set IFF_DRV_RUNNING if interface is ready */ + in_gre_set_running(sc); + return (0); } -void +int in_gre_setopts(struct gre_softc *sc, u_long cmd, uint32_t value) { + int error; - MPASS(cmd == GRESKEY || cmd == GRESOPTS); - /* NOTE: we are protected with gre_ioctl_sx lock */ + MPASS(cmd == GRESKEY || cmd == GRESOPTS || cmd == GRESPORT); MPASS(sc->gre_family == AF_INET); + + /* + * If we are going to change encapsulation protocol, do check + * for duplicate tunnels. Return EEXIST here to do not confuse + * user. + */ + if (cmd == GRESOPTS && + (sc->gre_options & GRE_UDPENCAP) != (value & GRE_UDPENCAP) && + in_gre_checkdup(sc, sc->gre_oip.ip_src.s_addr, + sc->gre_oip.ip_dst.s_addr, value) == EADDRNOTAVAIL) + return (EEXIST); + CK_LIST_REMOVE(sc, chain); CK_LIST_REMOVE(sc, srchash); GRE_WAIT(); - if (cmd == GRESKEY) + switch (cmd) { + case GRESKEY: sc->gre_key = value; - else + break; + case GRESOPTS: sc->gre_options = value; - in_gre_attach(sc); + break; + case GRESPORT: + sc->gre_port = value; + break; + } + error = in_gre_attach(sc); + if (error != 0) { + sc->gre_family = 0; + free(sc->gre_hdr, M_GRE); + } + return (error); } int in_gre_ioctl(struct gre_softc *sc, u_long cmd, caddr_t data) { struct ifreq *ifr = (struct ifreq *)data; struct sockaddr_in *dst, *src; struct ip *ip; int error; /* NOTE: we are protected with gre_ioctl_sx lock */ error = EINVAL; switch (cmd) { case SIOCSIFPHYADDR: src = &((struct in_aliasreq *)data)->ifra_addr; dst = &((struct in_aliasreq *)data)->ifra_dstaddr; /* sanity checks */ if (src->sin_family != dst->sin_family || src->sin_family != AF_INET || src->sin_len != dst->sin_len || src->sin_len != sizeof(*src)) break; if (src->sin_addr.s_addr == INADDR_ANY || dst->sin_addr.s_addr == INADDR_ANY) { error = EADDRNOTAVAIL; break; } if (V_ipv4_hashtbl == NULL) { V_ipv4_hashtbl = gre_hashinit(); V_ipv4_srchashtbl = gre_hashinit(); + V_ipv4_sockets = (struct gre_sockets *)gre_hashinit(); } error = in_gre_checkdup(sc, src->sin_addr.s_addr, - dst->sin_addr.s_addr); + dst->sin_addr.s_addr, sc->gre_options); if (error == EADDRNOTAVAIL) break; if (error == EEXIST) { /* Addresses are the same. Just return. */ error = 0; break; } - ip = malloc(sizeof(struct greip) + 3 * sizeof(uint32_t), + ip = malloc(sizeof(struct greudp) + 3 * sizeof(uint32_t), M_GRE, M_WAITOK | M_ZERO); ip->ip_src.s_addr = src->sin_addr.s_addr; ip->ip_dst.s_addr = dst->sin_addr.s_addr; if (sc->gre_family != 0) { /* Detach existing tunnel first */ CK_LIST_REMOVE(sc, chain); CK_LIST_REMOVE(sc, srchash); GRE_WAIT(); free(sc->gre_hdr, M_GRE); /* XXX: should we notify about link state change? */ } sc->gre_family = AF_INET; sc->gre_hdr = ip; sc->gre_oseq = 0; sc->gre_iseq = UINT32_MAX; - in_gre_attach(sc); - in_gre_set_running(sc); + error = in_gre_attach(sc); + if (error != 0) { + sc->gre_family = 0; + free(sc->gre_hdr, M_GRE); + } break; case SIOCGIFPSRCADDR: case SIOCGIFPDSTADDR: if (sc->gre_family != AF_INET) { error = EADDRNOTAVAIL; break; } src = (struct sockaddr_in *)&ifr->ifr_addr; memset(src, 0, sizeof(*src)); src->sin_family = AF_INET; src->sin_len = sizeof(*src); src->sin_addr = (cmd == SIOCGIFPSRCADDR) ? sc->gre_oip.ip_src: sc->gre_oip.ip_dst; error = prison_if(curthread->td_ucred, (struct sockaddr *)src); if (error != 0) memset(src, 0, sizeof(*src)); break; } return (error); } int in_gre_output(struct mbuf *m, int af, int hlen) { struct greip *gi; gi = mtod(m, struct greip *); switch (af) { case AF_INET: /* * gre_transmit() has used M_PREPEND() that doesn't guarantee * m_data is contiguous more than hlen bytes. Use m_copydata() * here to avoid m_pullup(). */ m_copydata(m, hlen + offsetof(struct ip, ip_tos), sizeof(u_char), &gi->gi_ip.ip_tos); m_copydata(m, hlen + offsetof(struct ip, ip_id), sizeof(u_short), (caddr_t)&gi->gi_ip.ip_id); break; #ifdef INET6 case AF_INET6: gi->gi_ip.ip_tos = 0; /* XXX */ ip_fillid(&gi->gi_ip); break; #endif } gi->gi_ip.ip_ttl = V_ip_gre_ttl; gi->gi_ip.ip_len = htons(m->m_pkthdr.len); return (ip_output(m, NULL, NULL, IP_FORWARDING, NULL, NULL)); } static const struct srcaddrtab *ipv4_srcaddrtab = NULL; static const struct encaptab *ecookie = NULL; static const struct encap_config ipv4_encap_cfg = { .proto = IPPROTO_GRE, .min_length = sizeof(struct greip) + sizeof(struct ip), .exact_match = ENCAP_DRV_LOOKUP, .lookup = in_gre_lookup, .input = gre_input }; void in_gre_init(void) { if (!IS_DEFAULT_VNET(curvnet)) return; ipv4_srcaddrtab = ip_encap_register_srcaddr(in_gre_srcaddr, NULL, M_WAITOK); ecookie = ip_encap_attach(&ipv4_encap_cfg, NULL, M_WAITOK); } void in_gre_uninit(void) { if (IS_DEFAULT_VNET(curvnet)) { ip_encap_detach(ecookie); ip_encap_unregister_srcaddr(ipv4_srcaddrtab); } if (V_ipv4_hashtbl != NULL) { gre_hashdestroy(V_ipv4_hashtbl); V_ipv4_hashtbl = NULL; GRE_WAIT(); gre_hashdestroy(V_ipv4_srchashtbl); + gre_hashdestroy((struct gre_list *)V_ipv4_sockets); } } Index: user/ngie/bug-237403/sys/netinet/ip_output.c =================================================================== --- user/ngie/bug-237403/sys/netinet/ip_output.c (revision 346925) +++ user/ngie/bug-237403/sys/netinet/ip_output.c (revision 346926) @@ -1,1469 +1,1472 @@ /*- * SPDX-License-Identifier: BSD-3-Clause * * Copyright (c) 1982, 1986, 1988, 1990, 1993 * The Regents of the University of California. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * 3. Neither the name of the University nor the names of its contributors * may be used to endorse or promote products derived from this software * without specific prior written permission. * * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * @(#)ip_output.c 8.3 (Berkeley) 1/21/94 */ #include __FBSDID("$FreeBSD$"); #include "opt_inet.h" #include "opt_ratelimit.h" #include "opt_ipsec.h" #include "opt_mbuf_stress_test.h" #include "opt_mpath.h" #include "opt_route.h" #include "opt_sctp.h" #include "opt_rss.h" #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #ifdef RADIX_MPATH #include #endif #include #include #include #include #include #include #include #include #include #include #include #include #include #ifdef SCTP #include #include #endif #include #include #include #ifdef MBUF_STRESS_TEST static int mbuf_frag_size = 0; SYSCTL_INT(_net_inet_ip, OID_AUTO, mbuf_frag_size, CTLFLAG_RW, &mbuf_frag_size, 0, "Fragment outgoing mbufs to this size"); #endif static void ip_mloopback(struct ifnet *, const struct mbuf *, int); extern int in_mcast_loop; extern struct protosw inetsw[]; static inline int ip_output_pfil(struct mbuf **mp, struct ifnet *ifp, struct inpcb *inp, struct sockaddr_in *dst, int *fibnum, int *error) { struct m_tag *fwd_tag = NULL; struct mbuf *m; struct in_addr odst; struct ip *ip; m = *mp; ip = mtod(m, struct ip *); /* Run through list of hooks for output packets. */ odst.s_addr = ip->ip_dst.s_addr; switch (pfil_run_hooks(V_inet_pfil_head, mp, ifp, PFIL_OUT, inp)) { case PFIL_DROPPED: *error = EPERM; /* FALLTHROUGH */ case PFIL_CONSUMED: return 1; /* Finished */ case PFIL_PASS: *error = 0; } m = *mp; ip = mtod(m, struct ip *); /* See if destination IP address was changed by packet filter. */ if (odst.s_addr != ip->ip_dst.s_addr) { m->m_flags |= M_SKIP_FIREWALL; /* If destination is now ourself drop to ip_input(). */ if (in_localip(ip->ip_dst)) { m->m_flags |= M_FASTFWD_OURS; if (m->m_pkthdr.rcvif == NULL) m->m_pkthdr.rcvif = V_loif; if (m->m_pkthdr.csum_flags & CSUM_DELAY_DATA) { m->m_pkthdr.csum_flags |= CSUM_DATA_VALID | CSUM_PSEUDO_HDR; m->m_pkthdr.csum_data = 0xffff; } m->m_pkthdr.csum_flags |= CSUM_IP_CHECKED | CSUM_IP_VALID; #ifdef SCTP if (m->m_pkthdr.csum_flags & CSUM_SCTP) m->m_pkthdr.csum_flags |= CSUM_SCTP_VALID; #endif *error = netisr_queue(NETISR_IP, m); return 1; /* Finished */ } bzero(dst, sizeof(*dst)); dst->sin_family = AF_INET; dst->sin_len = sizeof(*dst); dst->sin_addr = ip->ip_dst; return -1; /* Reloop */ } /* See if fib was changed by packet filter. */ if ((*fibnum) != M_GETFIB(m)) { m->m_flags |= M_SKIP_FIREWALL; *fibnum = M_GETFIB(m); return -1; /* Reloop for FIB change */ } /* See if local, if yes, send it to netisr with IP_FASTFWD_OURS. */ if (m->m_flags & M_FASTFWD_OURS) { if (m->m_pkthdr.rcvif == NULL) m->m_pkthdr.rcvif = V_loif; if (m->m_pkthdr.csum_flags & CSUM_DELAY_DATA) { m->m_pkthdr.csum_flags |= CSUM_DATA_VALID | CSUM_PSEUDO_HDR; m->m_pkthdr.csum_data = 0xffff; } #ifdef SCTP if (m->m_pkthdr.csum_flags & CSUM_SCTP) m->m_pkthdr.csum_flags |= CSUM_SCTP_VALID; #endif m->m_pkthdr.csum_flags |= CSUM_IP_CHECKED | CSUM_IP_VALID; *error = netisr_queue(NETISR_IP, m); return 1; /* Finished */ } /* Or forward to some other address? */ if ((m->m_flags & M_IP_NEXTHOP) && ((fwd_tag = m_tag_find(m, PACKET_TAG_IPFORWARD, NULL)) != NULL)) { bcopy((fwd_tag+1), dst, sizeof(struct sockaddr_in)); m->m_flags |= M_SKIP_FIREWALL; m->m_flags &= ~M_IP_NEXTHOP; m_tag_delete(m, fwd_tag); return -1; /* Reloop for CHANGE of dst */ } return 0; } /* * IP output. The packet in mbuf chain m contains a skeletal IP * header (with len, off, ttl, proto, tos, src, dst). * The mbuf chain containing the packet will be freed. * The mbuf opt, if present, will not be freed. * If route ro is present and has ro_rt initialized, route lookup would be * skipped and ro->ro_rt would be used. If ro is present but ro->ro_rt is NULL, * then result of route lookup is stored in ro->ro_rt. * * In the IP forwarding case, the packet will arrive with options already * inserted, so must have a NULL opt pointer. */ int ip_output(struct mbuf *m, struct mbuf *opt, struct route *ro, int flags, struct ip_moptions *imo, struct inpcb *inp) { struct rm_priotracker in_ifa_tracker; struct epoch_tracker et; struct ip *ip; struct ifnet *ifp = NULL; /* keep compiler happy */ struct mbuf *m0; int hlen = sizeof (struct ip); int mtu; int error = 0; struct sockaddr_in *dst; const struct sockaddr_in *gw; struct in_ifaddr *ia; int isbroadcast; uint16_t ip_len, ip_off; struct route iproute; struct rtentry *rte; /* cache for ro->ro_rt */ uint32_t fibnum; #if defined(IPSEC) || defined(IPSEC_SUPPORT) int no_route_but_check_spd = 0; #endif M_ASSERTPKTHDR(m); if (inp != NULL) { INP_LOCK_ASSERT(inp); M_SETFIB(m, inp->inp_inc.inc_fibnum); if ((flags & IP_NODEFAULTFLOWID) == 0) { m->m_pkthdr.flowid = inp->inp_flowid; M_HASHTYPE_SET(m, inp->inp_flowtype); } +#ifdef NUMA + m->m_pkthdr.numa_domain = inp->inp_numa_domain; +#endif } if (ro == NULL) { ro = &iproute; bzero(ro, sizeof (*ro)); } if (opt) { int len = 0; m = ip_insertoptions(m, opt, &len); if (len != 0) hlen = len; /* ip->ip_hl is updated above */ } ip = mtod(m, struct ip *); ip_len = ntohs(ip->ip_len); ip_off = ntohs(ip->ip_off); if ((flags & (IP_FORWARDING|IP_RAWOUTPUT)) == 0) { ip->ip_v = IPVERSION; ip->ip_hl = hlen >> 2; ip_fillid(ip); } else { /* Header already set, fetch hlen from there */ hlen = ip->ip_hl << 2; } if ((flags & IP_FORWARDING) == 0) IPSTAT_INC(ips_localout); /* * dst/gw handling: * * dst can be rewritten but always points to &ro->ro_dst. * gw is readonly but can point either to dst OR rt_gateway, * therefore we need restore gw if we're redoing lookup. */ gw = dst = (struct sockaddr_in *)&ro->ro_dst; fibnum = (inp != NULL) ? inp->inp_inc.inc_fibnum : M_GETFIB(m); rte = ro->ro_rt; if (rte == NULL) { bzero(dst, sizeof(*dst)); dst->sin_family = AF_INET; dst->sin_len = sizeof(*dst); dst->sin_addr = ip->ip_dst; } NET_EPOCH_ENTER(et); again: /* * Validate route against routing table additions; * a better/more specific route might have been added. */ if (inp) RT_VALIDATE(ro, &inp->inp_rt_cookie, fibnum); /* * If there is a cached route, * check that it is to the same destination * and is still up. If not, free it and try again. * The address family should also be checked in case of sharing the * cache with IPv6. * Also check whether routing cache needs invalidation. */ rte = ro->ro_rt; if (rte && ((rte->rt_flags & RTF_UP) == 0 || rte->rt_ifp == NULL || !RT_LINK_IS_UP(rte->rt_ifp) || dst->sin_family != AF_INET || dst->sin_addr.s_addr != ip->ip_dst.s_addr)) { RO_INVALIDATE_CACHE(ro); rte = NULL; } ia = NULL; /* * If routing to interface only, short circuit routing lookup. * The use of an all-ones broadcast address implies this; an * interface is specified by the broadcast address of an interface, * or the destination address of a ptp interface. */ if (flags & IP_SENDONES) { if ((ia = ifatoia(ifa_ifwithbroadaddr(sintosa(dst), M_GETFIB(m)))) == NULL && (ia = ifatoia(ifa_ifwithdstaddr(sintosa(dst), M_GETFIB(m)))) == NULL) { IPSTAT_INC(ips_noroute); error = ENETUNREACH; goto bad; } ip->ip_dst.s_addr = INADDR_BROADCAST; dst->sin_addr = ip->ip_dst; ifp = ia->ia_ifp; ip->ip_ttl = 1; isbroadcast = 1; } else if (flags & IP_ROUTETOIF) { if ((ia = ifatoia(ifa_ifwithdstaddr(sintosa(dst), M_GETFIB(m)))) == NULL && (ia = ifatoia(ifa_ifwithnet(sintosa(dst), 0, M_GETFIB(m)))) == NULL) { IPSTAT_INC(ips_noroute); error = ENETUNREACH; goto bad; } ifp = ia->ia_ifp; ip->ip_ttl = 1; isbroadcast = ifp->if_flags & IFF_BROADCAST ? in_ifaddr_broadcast(dst->sin_addr, ia) : 0; } else if (IN_MULTICAST(ntohl(ip->ip_dst.s_addr)) && imo != NULL && imo->imo_multicast_ifp != NULL) { /* * Bypass the normal routing lookup for multicast * packets if the interface is specified. */ ifp = imo->imo_multicast_ifp; IFP_TO_IA(ifp, ia, &in_ifa_tracker); isbroadcast = 0; /* fool gcc */ } else { /* * We want to do any cloning requested by the link layer, * as this is probably required in all cases for correct * operation (as it is for ARP). */ if (rte == NULL) { #ifdef RADIX_MPATH rtalloc_mpath_fib(ro, ntohl(ip->ip_src.s_addr ^ ip->ip_dst.s_addr), fibnum); #else in_rtalloc_ign(ro, 0, fibnum); #endif rte = ro->ro_rt; } if (rte == NULL || (rte->rt_flags & RTF_UP) == 0 || rte->rt_ifp == NULL || !RT_LINK_IS_UP(rte->rt_ifp)) { #if defined(IPSEC) || defined(IPSEC_SUPPORT) /* * There is no route for this packet, but it is * possible that a matching SPD entry exists. */ no_route_but_check_spd = 1; mtu = 0; /* Silence GCC warning. */ goto sendit; #endif IPSTAT_INC(ips_noroute); error = EHOSTUNREACH; goto bad; } ia = ifatoia(rte->rt_ifa); ifp = rte->rt_ifp; counter_u64_add(rte->rt_pksent, 1); rt_update_ro_flags(ro); if (rte->rt_flags & RTF_GATEWAY) gw = (struct sockaddr_in *)rte->rt_gateway; if (rte->rt_flags & RTF_HOST) isbroadcast = (rte->rt_flags & RTF_BROADCAST); else if (ifp->if_flags & IFF_BROADCAST) isbroadcast = in_ifaddr_broadcast(gw->sin_addr, ia); else isbroadcast = 0; } /* * Calculate MTU. If we have a route that is up, use that, * otherwise use the interface's MTU. */ if (rte != NULL && (rte->rt_flags & (RTF_UP|RTF_HOST))) mtu = rte->rt_mtu; else mtu = ifp->if_mtu; /* Catch a possible divide by zero later. */ KASSERT(mtu > 0, ("%s: mtu %d <= 0, rte=%p (rt_flags=0x%08x) ifp=%p", __func__, mtu, rte, (rte != NULL) ? rte->rt_flags : 0, ifp)); if (IN_MULTICAST(ntohl(ip->ip_dst.s_addr))) { m->m_flags |= M_MCAST; /* * IP destination address is multicast. Make sure "gw" * still points to the address in "ro". (It may have been * changed to point to a gateway address, above.) */ gw = dst; /* * See if the caller provided any multicast options */ if (imo != NULL) { ip->ip_ttl = imo->imo_multicast_ttl; if (imo->imo_multicast_vif != -1) ip->ip_src.s_addr = ip_mcast_src ? ip_mcast_src(imo->imo_multicast_vif) : INADDR_ANY; } else ip->ip_ttl = IP_DEFAULT_MULTICAST_TTL; /* * Confirm that the outgoing interface supports multicast. */ if ((imo == NULL) || (imo->imo_multicast_vif == -1)) { if ((ifp->if_flags & IFF_MULTICAST) == 0) { IPSTAT_INC(ips_noroute); error = ENETUNREACH; goto bad; } } /* * If source address not specified yet, use address * of outgoing interface. */ if (ip->ip_src.s_addr == INADDR_ANY) { /* Interface may have no addresses. */ if (ia != NULL) ip->ip_src = IA_SIN(ia)->sin_addr; } if ((imo == NULL && in_mcast_loop) || (imo && imo->imo_multicast_loop)) { /* * Loop back multicast datagram if not expressly * forbidden to do so, even if we are not a member * of the group; ip_input() will filter it later, * thus deferring a hash lookup and mutex acquisition * at the expense of a cheap copy using m_copym(). */ ip_mloopback(ifp, m, hlen); } else { /* * If we are acting as a multicast router, perform * multicast forwarding as if the packet had just * arrived on the interface to which we are about * to send. The multicast forwarding function * recursively calls this function, using the * IP_FORWARDING flag to prevent infinite recursion. * * Multicasts that are looped back by ip_mloopback(), * above, will be forwarded by the ip_input() routine, * if necessary. */ if (V_ip_mrouter && (flags & IP_FORWARDING) == 0) { /* * If rsvp daemon is not running, do not * set ip_moptions. This ensures that the packet * is multicast and not just sent down one link * as prescribed by rsvpd. */ if (!V_rsvp_on) imo = NULL; if (ip_mforward && ip_mforward(ip, ifp, m, imo) != 0) { m_freem(m); goto done; } } } /* * Multicasts with a time-to-live of zero may be looped- * back, above, but must not be transmitted on a network. * Also, multicasts addressed to the loopback interface * are not sent -- the above call to ip_mloopback() will * loop back a copy. ip_input() will drop the copy if * this host does not belong to the destination group on * the loopback interface. */ if (ip->ip_ttl == 0 || ifp->if_flags & IFF_LOOPBACK) { m_freem(m); goto done; } goto sendit; } /* * If the source address is not specified yet, use the address * of the outoing interface. */ if (ip->ip_src.s_addr == INADDR_ANY) { /* Interface may have no addresses. */ if (ia != NULL) { ip->ip_src = IA_SIN(ia)->sin_addr; } } /* * Look for broadcast address and * verify user is allowed to send * such a packet. */ if (isbroadcast) { if ((ifp->if_flags & IFF_BROADCAST) == 0) { error = EADDRNOTAVAIL; goto bad; } if ((flags & IP_ALLOWBROADCAST) == 0) { error = EACCES; goto bad; } /* don't allow broadcast messages to be fragmented */ if (ip_len > mtu) { error = EMSGSIZE; goto bad; } m->m_flags |= M_BCAST; } else { m->m_flags &= ~M_BCAST; } sendit: #if defined(IPSEC) || defined(IPSEC_SUPPORT) if (IPSEC_ENABLED(ipv4)) { if ((error = IPSEC_OUTPUT(ipv4, m, inp)) != 0) { if (error == EINPROGRESS) error = 0; goto done; } } /* * Check if there was a route for this packet; return error if not. */ if (no_route_but_check_spd) { IPSTAT_INC(ips_noroute); error = EHOSTUNREACH; goto bad; } /* Update variables that are affected by ipsec4_output(). */ ip = mtod(m, struct ip *); hlen = ip->ip_hl << 2; #endif /* IPSEC */ /* Jump over all PFIL processing if hooks are not active. */ if (PFIL_HOOKED_OUT(V_inet_pfil_head)) { switch (ip_output_pfil(&m, ifp, inp, dst, &fibnum, &error)) { case 1: /* Finished */ goto done; case 0: /* Continue normally */ ip = mtod(m, struct ip *); break; case -1: /* Need to try again */ /* Reset everything for a new round */ RO_RTFREE(ro); ro->ro_prepend = NULL; rte = NULL; gw = dst; ip = mtod(m, struct ip *); goto again; } } /* IN_LOOPBACK must not appear on the wire - RFC1122. */ if (IN_LOOPBACK(ntohl(ip->ip_dst.s_addr)) || IN_LOOPBACK(ntohl(ip->ip_src.s_addr))) { if ((ifp->if_flags & IFF_LOOPBACK) == 0) { IPSTAT_INC(ips_badaddr); error = EADDRNOTAVAIL; goto bad; } } m->m_pkthdr.csum_flags |= CSUM_IP; if (m->m_pkthdr.csum_flags & CSUM_DELAY_DATA & ~ifp->if_hwassist) { in_delayed_cksum(m); m->m_pkthdr.csum_flags &= ~CSUM_DELAY_DATA; } #ifdef SCTP if (m->m_pkthdr.csum_flags & CSUM_SCTP & ~ifp->if_hwassist) { sctp_delayed_cksum(m, (uint32_t)(ip->ip_hl << 2)); m->m_pkthdr.csum_flags &= ~CSUM_SCTP; } #endif /* * If small enough for interface, or the interface will take * care of the fragmentation for us, we can just send directly. */ if (ip_len <= mtu || (m->m_pkthdr.csum_flags & ifp->if_hwassist & CSUM_TSO) != 0) { ip->ip_sum = 0; if (m->m_pkthdr.csum_flags & CSUM_IP & ~ifp->if_hwassist) { ip->ip_sum = in_cksum(m, hlen); m->m_pkthdr.csum_flags &= ~CSUM_IP; } /* * Record statistics for this interface address. * With CSUM_TSO the byte/packet count will be slightly * incorrect because we count the IP+TCP headers only * once instead of for every generated packet. */ if (!(flags & IP_FORWARDING) && ia) { if (m->m_pkthdr.csum_flags & CSUM_TSO) counter_u64_add(ia->ia_ifa.ifa_opackets, m->m_pkthdr.len / m->m_pkthdr.tso_segsz); else counter_u64_add(ia->ia_ifa.ifa_opackets, 1); counter_u64_add(ia->ia_ifa.ifa_obytes, m->m_pkthdr.len); } #ifdef MBUF_STRESS_TEST if (mbuf_frag_size && m->m_pkthdr.len > mbuf_frag_size) m = m_fragment(m, M_NOWAIT, mbuf_frag_size); #endif /* * Reset layer specific mbuf flags * to avoid confusing lower layers. */ m_clrprotoflags(m); IP_PROBE(send, NULL, NULL, ip, ifp, ip, NULL); #ifdef RATELIMIT if (inp != NULL) { if (inp->inp_flags2 & INP_RATE_LIMIT_CHANGED) in_pcboutput_txrtlmt(inp, ifp, m); /* stamp send tag on mbuf */ m->m_pkthdr.snd_tag = inp->inp_snd_tag; } else { m->m_pkthdr.snd_tag = NULL; } #endif error = (*ifp->if_output)(ifp, m, (const struct sockaddr *)gw, ro); #ifdef RATELIMIT /* check for route change */ if (error == EAGAIN) in_pcboutput_eagain(inp); #endif goto done; } /* Balk when DF bit is set or the interface didn't support TSO. */ if ((ip_off & IP_DF) || (m->m_pkthdr.csum_flags & CSUM_TSO)) { error = EMSGSIZE; IPSTAT_INC(ips_cantfrag); goto bad; } /* * Too large for interface; fragment if possible. If successful, * on return, m will point to a list of packets to be sent. */ error = ip_fragment(ip, &m, mtu, ifp->if_hwassist); if (error) goto bad; for (; m; m = m0) { m0 = m->m_nextpkt; m->m_nextpkt = 0; if (error == 0) { /* Record statistics for this interface address. */ if (ia != NULL) { counter_u64_add(ia->ia_ifa.ifa_opackets, 1); counter_u64_add(ia->ia_ifa.ifa_obytes, m->m_pkthdr.len); } /* * Reset layer specific mbuf flags * to avoid confusing upper layers. */ m_clrprotoflags(m); IP_PROBE(send, NULL, NULL, mtod(m, struct ip *), ifp, mtod(m, struct ip *), NULL); #ifdef RATELIMIT if (inp != NULL) { if (inp->inp_flags2 & INP_RATE_LIMIT_CHANGED) in_pcboutput_txrtlmt(inp, ifp, m); /* stamp send tag on mbuf */ m->m_pkthdr.snd_tag = inp->inp_snd_tag; } else { m->m_pkthdr.snd_tag = NULL; } #endif error = (*ifp->if_output)(ifp, m, (const struct sockaddr *)gw, ro); #ifdef RATELIMIT /* check for route change */ if (error == EAGAIN) in_pcboutput_eagain(inp); #endif } else m_freem(m); } if (error == 0) IPSTAT_INC(ips_fragmented); done: if (ro == &iproute) RO_RTFREE(ro); else if (rte == NULL) /* * If the caller supplied a route but somehow the reference * to it has been released need to prevent the caller * calling RTFREE on it again. */ ro->ro_rt = NULL; NET_EPOCH_EXIT(et); return (error); bad: m_freem(m); goto done; } /* * Create a chain of fragments which fit the given mtu. m_frag points to the * mbuf to be fragmented; on return it points to the chain with the fragments. * Return 0 if no error. If error, m_frag may contain a partially built * chain of fragments that should be freed by the caller. * * if_hwassist_flags is the hw offload capabilities (see if_data.ifi_hwassist) */ int ip_fragment(struct ip *ip, struct mbuf **m_frag, int mtu, u_long if_hwassist_flags) { int error = 0; int hlen = ip->ip_hl << 2; int len = (mtu - hlen) & ~7; /* size of payload in each fragment */ int off; struct mbuf *m0 = *m_frag; /* the original packet */ int firstlen; struct mbuf **mnext; int nfrags; uint16_t ip_len, ip_off; ip_len = ntohs(ip->ip_len); ip_off = ntohs(ip->ip_off); if (ip_off & IP_DF) { /* Fragmentation not allowed */ IPSTAT_INC(ips_cantfrag); return EMSGSIZE; } /* * Must be able to put at least 8 bytes per fragment. */ if (len < 8) return EMSGSIZE; /* * If the interface will not calculate checksums on * fragmented packets, then do it here. */ if (m0->m_pkthdr.csum_flags & CSUM_DELAY_DATA) { in_delayed_cksum(m0); m0->m_pkthdr.csum_flags &= ~CSUM_DELAY_DATA; } #ifdef SCTP if (m0->m_pkthdr.csum_flags & CSUM_SCTP) { sctp_delayed_cksum(m0, hlen); m0->m_pkthdr.csum_flags &= ~CSUM_SCTP; } #endif if (len > PAGE_SIZE) { /* * Fragment large datagrams such that each segment * contains a multiple of PAGE_SIZE amount of data, * plus headers. This enables a receiver to perform * page-flipping zero-copy optimizations. * * XXX When does this help given that sender and receiver * could have different page sizes, and also mtu could * be less than the receiver's page size ? */ int newlen; off = MIN(mtu, m0->m_pkthdr.len); /* * firstlen (off - hlen) must be aligned on an * 8-byte boundary */ if (off < hlen) goto smart_frag_failure; off = ((off - hlen) & ~7) + hlen; newlen = (~PAGE_MASK) & mtu; if ((newlen + sizeof (struct ip)) > mtu) { /* we failed, go back the default */ smart_frag_failure: newlen = len; off = hlen + len; } len = newlen; } else { off = hlen + len; } firstlen = off - hlen; mnext = &m0->m_nextpkt; /* pointer to next packet */ /* * Loop through length of segment after first fragment, * make new header and copy data of each part and link onto chain. * Here, m0 is the original packet, m is the fragment being created. * The fragments are linked off the m_nextpkt of the original * packet, which after processing serves as the first fragment. */ for (nfrags = 1; off < ip_len; off += len, nfrags++) { struct ip *mhip; /* ip header on the fragment */ struct mbuf *m; int mhlen = sizeof (struct ip); m = m_gethdr(M_NOWAIT, MT_DATA); if (m == NULL) { error = ENOBUFS; IPSTAT_INC(ips_odropped); goto done; } /* * Make sure the complete packet header gets copied * from the originating mbuf to the newly created * mbuf. This also ensures that existing firewall * classification(s), VLAN tags and so on get copied * to the resulting fragmented packet(s): */ if (m_dup_pkthdr(m, m0, M_NOWAIT) == 0) { m_free(m); error = ENOBUFS; IPSTAT_INC(ips_odropped); goto done; } /* * In the first mbuf, leave room for the link header, then * copy the original IP header including options. The payload * goes into an additional mbuf chain returned by m_copym(). */ m->m_data += max_linkhdr; mhip = mtod(m, struct ip *); *mhip = *ip; if (hlen > sizeof (struct ip)) { mhlen = ip_optcopy(ip, mhip) + sizeof (struct ip); mhip->ip_v = IPVERSION; mhip->ip_hl = mhlen >> 2; } m->m_len = mhlen; /* XXX do we need to add ip_off below ? */ mhip->ip_off = ((off - hlen) >> 3) + ip_off; if (off + len >= ip_len) len = ip_len - off; else mhip->ip_off |= IP_MF; mhip->ip_len = htons((u_short)(len + mhlen)); m->m_next = m_copym(m0, off, len, M_NOWAIT); if (m->m_next == NULL) { /* copy failed */ m_free(m); error = ENOBUFS; /* ??? */ IPSTAT_INC(ips_odropped); goto done; } m->m_pkthdr.len = mhlen + len; #ifdef MAC mac_netinet_fragment(m0, m); #endif mhip->ip_off = htons(mhip->ip_off); mhip->ip_sum = 0; if (m->m_pkthdr.csum_flags & CSUM_IP & ~if_hwassist_flags) { mhip->ip_sum = in_cksum(m, mhlen); m->m_pkthdr.csum_flags &= ~CSUM_IP; } *mnext = m; mnext = &m->m_nextpkt; } IPSTAT_ADD(ips_ofragments, nfrags); /* * Update first fragment by trimming what's been copied out * and updating header. */ m_adj(m0, hlen + firstlen - ip_len); m0->m_pkthdr.len = hlen + firstlen; ip->ip_len = htons((u_short)m0->m_pkthdr.len); ip->ip_off = htons(ip_off | IP_MF); ip->ip_sum = 0; if (m0->m_pkthdr.csum_flags & CSUM_IP & ~if_hwassist_flags) { ip->ip_sum = in_cksum(m0, hlen); m0->m_pkthdr.csum_flags &= ~CSUM_IP; } done: *m_frag = m0; return error; } void in_delayed_cksum(struct mbuf *m) { struct ip *ip; struct udphdr *uh; uint16_t cklen, csum, offset; ip = mtod(m, struct ip *); offset = ip->ip_hl << 2 ; if (m->m_pkthdr.csum_flags & CSUM_UDP) { /* if udp header is not in the first mbuf copy udplen */ if (offset + sizeof(struct udphdr) > m->m_len) { m_copydata(m, offset + offsetof(struct udphdr, uh_ulen), sizeof(cklen), (caddr_t)&cklen); cklen = ntohs(cklen); } else { uh = (struct udphdr *)mtodo(m, offset); cklen = ntohs(uh->uh_ulen); } csum = in_cksum_skip(m, cklen + offset, offset); if (csum == 0) csum = 0xffff; } else { cklen = ntohs(ip->ip_len); csum = in_cksum_skip(m, cklen, offset); } offset += m->m_pkthdr.csum_data; /* checksum offset */ if (offset + sizeof(csum) > m->m_len) m_copyback(m, offset, sizeof(csum), (caddr_t)&csum); else *(u_short *)mtodo(m, offset) = csum; } /* * IP socket option processing. */ int ip_ctloutput(struct socket *so, struct sockopt *sopt) { struct inpcb *inp = sotoinpcb(so); int error, optval; #ifdef RSS uint32_t rss_bucket; int retval; #endif error = optval = 0; if (sopt->sopt_level != IPPROTO_IP) { error = EINVAL; if (sopt->sopt_level == SOL_SOCKET && sopt->sopt_dir == SOPT_SET) { switch (sopt->sopt_name) { case SO_REUSEADDR: INP_WLOCK(inp); if ((so->so_options & SO_REUSEADDR) != 0) inp->inp_flags2 |= INP_REUSEADDR; else inp->inp_flags2 &= ~INP_REUSEADDR; INP_WUNLOCK(inp); error = 0; break; case SO_REUSEPORT: INP_WLOCK(inp); if ((so->so_options & SO_REUSEPORT) != 0) inp->inp_flags2 |= INP_REUSEPORT; else inp->inp_flags2 &= ~INP_REUSEPORT; INP_WUNLOCK(inp); error = 0; break; case SO_REUSEPORT_LB: INP_WLOCK(inp); if ((so->so_options & SO_REUSEPORT_LB) != 0) inp->inp_flags2 |= INP_REUSEPORT_LB; else inp->inp_flags2 &= ~INP_REUSEPORT_LB; INP_WUNLOCK(inp); error = 0; break; case SO_SETFIB: INP_WLOCK(inp); inp->inp_inc.inc_fibnum = so->so_fibnum; INP_WUNLOCK(inp); error = 0; break; case SO_MAX_PACING_RATE: #ifdef RATELIMIT INP_WLOCK(inp); inp->inp_flags2 |= INP_RATE_LIMIT_CHANGED; INP_WUNLOCK(inp); error = 0; #else error = EOPNOTSUPP; #endif break; default: break; } } return (error); } switch (sopt->sopt_dir) { case SOPT_SET: switch (sopt->sopt_name) { case IP_OPTIONS: #ifdef notyet case IP_RETOPTS: #endif { struct mbuf *m; if (sopt->sopt_valsize > MLEN) { error = EMSGSIZE; break; } m = m_get(sopt->sopt_td ? M_WAITOK : M_NOWAIT, MT_DATA); if (m == NULL) { error = ENOBUFS; break; } m->m_len = sopt->sopt_valsize; error = sooptcopyin(sopt, mtod(m, char *), m->m_len, m->m_len); if (error) { m_free(m); break; } INP_WLOCK(inp); error = ip_pcbopts(inp, sopt->sopt_name, m); INP_WUNLOCK(inp); return (error); } case IP_BINDANY: if (sopt->sopt_td != NULL) { error = priv_check(sopt->sopt_td, PRIV_NETINET_BINDANY); if (error) break; } /* FALLTHROUGH */ case IP_BINDMULTI: #ifdef RSS case IP_RSS_LISTEN_BUCKET: #endif case IP_TOS: case IP_TTL: case IP_MINTTL: case IP_RECVOPTS: case IP_RECVRETOPTS: case IP_ORIGDSTADDR: case IP_RECVDSTADDR: case IP_RECVTTL: case IP_RECVIF: case IP_ONESBCAST: case IP_DONTFRAG: case IP_RECVTOS: case IP_RECVFLOWID: #ifdef RSS case IP_RECVRSSBUCKETID: #endif error = sooptcopyin(sopt, &optval, sizeof optval, sizeof optval); if (error) break; switch (sopt->sopt_name) { case IP_TOS: inp->inp_ip_tos = optval; break; case IP_TTL: inp->inp_ip_ttl = optval; break; case IP_MINTTL: if (optval >= 0 && optval <= MAXTTL) inp->inp_ip_minttl = optval; else error = EINVAL; break; #define OPTSET(bit) do { \ INP_WLOCK(inp); \ if (optval) \ inp->inp_flags |= bit; \ else \ inp->inp_flags &= ~bit; \ INP_WUNLOCK(inp); \ } while (0) #define OPTSET2(bit, val) do { \ INP_WLOCK(inp); \ if (val) \ inp->inp_flags2 |= bit; \ else \ inp->inp_flags2 &= ~bit; \ INP_WUNLOCK(inp); \ } while (0) case IP_RECVOPTS: OPTSET(INP_RECVOPTS); break; case IP_RECVRETOPTS: OPTSET(INP_RECVRETOPTS); break; case IP_RECVDSTADDR: OPTSET(INP_RECVDSTADDR); break; case IP_ORIGDSTADDR: OPTSET2(INP_ORIGDSTADDR, optval); break; case IP_RECVTTL: OPTSET(INP_RECVTTL); break; case IP_RECVIF: OPTSET(INP_RECVIF); break; case IP_ONESBCAST: OPTSET(INP_ONESBCAST); break; case IP_DONTFRAG: OPTSET(INP_DONTFRAG); break; case IP_BINDANY: OPTSET(INP_BINDANY); break; case IP_RECVTOS: OPTSET(INP_RECVTOS); break; case IP_BINDMULTI: OPTSET2(INP_BINDMULTI, optval); break; case IP_RECVFLOWID: OPTSET2(INP_RECVFLOWID, optval); break; #ifdef RSS case IP_RSS_LISTEN_BUCKET: if ((optval >= 0) && (optval < rss_getnumbuckets())) { inp->inp_rss_listen_bucket = optval; OPTSET2(INP_RSS_BUCKET_SET, 1); } else { error = EINVAL; } break; case IP_RECVRSSBUCKETID: OPTSET2(INP_RECVRSSBUCKETID, optval); break; #endif } break; #undef OPTSET #undef OPTSET2 /* * Multicast socket options are processed by the in_mcast * module. */ case IP_MULTICAST_IF: case IP_MULTICAST_VIF: case IP_MULTICAST_TTL: case IP_MULTICAST_LOOP: case IP_ADD_MEMBERSHIP: case IP_DROP_MEMBERSHIP: case IP_ADD_SOURCE_MEMBERSHIP: case IP_DROP_SOURCE_MEMBERSHIP: case IP_BLOCK_SOURCE: case IP_UNBLOCK_SOURCE: case IP_MSFILTER: case MCAST_JOIN_GROUP: case MCAST_LEAVE_GROUP: case MCAST_JOIN_SOURCE_GROUP: case MCAST_LEAVE_SOURCE_GROUP: case MCAST_BLOCK_SOURCE: case MCAST_UNBLOCK_SOURCE: error = inp_setmoptions(inp, sopt); break; case IP_PORTRANGE: error = sooptcopyin(sopt, &optval, sizeof optval, sizeof optval); if (error) break; INP_WLOCK(inp); switch (optval) { case IP_PORTRANGE_DEFAULT: inp->inp_flags &= ~(INP_LOWPORT); inp->inp_flags &= ~(INP_HIGHPORT); break; case IP_PORTRANGE_HIGH: inp->inp_flags &= ~(INP_LOWPORT); inp->inp_flags |= INP_HIGHPORT; break; case IP_PORTRANGE_LOW: inp->inp_flags &= ~(INP_HIGHPORT); inp->inp_flags |= INP_LOWPORT; break; default: error = EINVAL; break; } INP_WUNLOCK(inp); break; #if defined(IPSEC) || defined(IPSEC_SUPPORT) case IP_IPSEC_POLICY: if (IPSEC_ENABLED(ipv4)) { error = IPSEC_PCBCTL(ipv4, inp, sopt); break; } /* FALLTHROUGH */ #endif /* IPSEC */ default: error = ENOPROTOOPT; break; } break; case SOPT_GET: switch (sopt->sopt_name) { case IP_OPTIONS: case IP_RETOPTS: INP_RLOCK(inp); if (inp->inp_options) { struct mbuf *options; options = m_copym(inp->inp_options, 0, M_COPYALL, M_NOWAIT); INP_RUNLOCK(inp); if (options != NULL) { error = sooptcopyout(sopt, mtod(options, char *), options->m_len); m_freem(options); } else error = ENOMEM; } else { INP_RUNLOCK(inp); sopt->sopt_valsize = 0; } break; case IP_TOS: case IP_TTL: case IP_MINTTL: case IP_RECVOPTS: case IP_RECVRETOPTS: case IP_ORIGDSTADDR: case IP_RECVDSTADDR: case IP_RECVTTL: case IP_RECVIF: case IP_PORTRANGE: case IP_ONESBCAST: case IP_DONTFRAG: case IP_BINDANY: case IP_RECVTOS: case IP_BINDMULTI: case IP_FLOWID: case IP_FLOWTYPE: case IP_RECVFLOWID: #ifdef RSS case IP_RSSBUCKETID: case IP_RECVRSSBUCKETID: #endif switch (sopt->sopt_name) { case IP_TOS: optval = inp->inp_ip_tos; break; case IP_TTL: optval = inp->inp_ip_ttl; break; case IP_MINTTL: optval = inp->inp_ip_minttl; break; #define OPTBIT(bit) (inp->inp_flags & bit ? 1 : 0) #define OPTBIT2(bit) (inp->inp_flags2 & bit ? 1 : 0) case IP_RECVOPTS: optval = OPTBIT(INP_RECVOPTS); break; case IP_RECVRETOPTS: optval = OPTBIT(INP_RECVRETOPTS); break; case IP_RECVDSTADDR: optval = OPTBIT(INP_RECVDSTADDR); break; case IP_ORIGDSTADDR: optval = OPTBIT2(INP_ORIGDSTADDR); break; case IP_RECVTTL: optval = OPTBIT(INP_RECVTTL); break; case IP_RECVIF: optval = OPTBIT(INP_RECVIF); break; case IP_PORTRANGE: if (inp->inp_flags & INP_HIGHPORT) optval = IP_PORTRANGE_HIGH; else if (inp->inp_flags & INP_LOWPORT) optval = IP_PORTRANGE_LOW; else optval = 0; break; case IP_ONESBCAST: optval = OPTBIT(INP_ONESBCAST); break; case IP_DONTFRAG: optval = OPTBIT(INP_DONTFRAG); break; case IP_BINDANY: optval = OPTBIT(INP_BINDANY); break; case IP_RECVTOS: optval = OPTBIT(INP_RECVTOS); break; case IP_FLOWID: optval = inp->inp_flowid; break; case IP_FLOWTYPE: optval = inp->inp_flowtype; break; case IP_RECVFLOWID: optval = OPTBIT2(INP_RECVFLOWID); break; #ifdef RSS case IP_RSSBUCKETID: retval = rss_hash2bucket(inp->inp_flowid, inp->inp_flowtype, &rss_bucket); if (retval == 0) optval = rss_bucket; else error = EINVAL; break; case IP_RECVRSSBUCKETID: optval = OPTBIT2(INP_RECVRSSBUCKETID); break; #endif case IP_BINDMULTI: optval = OPTBIT2(INP_BINDMULTI); break; } error = sooptcopyout(sopt, &optval, sizeof optval); break; /* * Multicast socket options are processed by the in_mcast * module. */ case IP_MULTICAST_IF: case IP_MULTICAST_VIF: case IP_MULTICAST_TTL: case IP_MULTICAST_LOOP: case IP_MSFILTER: error = inp_getmoptions(inp, sopt); break; #if defined(IPSEC) || defined(IPSEC_SUPPORT) case IP_IPSEC_POLICY: if (IPSEC_ENABLED(ipv4)) { error = IPSEC_PCBCTL(ipv4, inp, sopt); break; } /* FALLTHROUGH */ #endif /* IPSEC */ default: error = ENOPROTOOPT; break; } break; } return (error); } /* * Routine called from ip_output() to loop back a copy of an IP multicast * packet to the input queue of a specified interface. Note that this * calls the output routine of the loopback "driver", but with an interface * pointer that might NOT be a loopback interface -- evil, but easier than * replicating that code here. */ static void ip_mloopback(struct ifnet *ifp, const struct mbuf *m, int hlen) { struct ip *ip; struct mbuf *copym; /* * Make a deep copy of the packet because we're going to * modify the pack in order to generate checksums. */ copym = m_dup(m, M_NOWAIT); if (copym != NULL && (!M_WRITABLE(copym) || copym->m_len < hlen)) copym = m_pullup(copym, hlen); if (copym != NULL) { /* If needed, compute the checksum and mark it as valid. */ if (copym->m_pkthdr.csum_flags & CSUM_DELAY_DATA) { in_delayed_cksum(copym); copym->m_pkthdr.csum_flags &= ~CSUM_DELAY_DATA; copym->m_pkthdr.csum_flags |= CSUM_DATA_VALID | CSUM_PSEUDO_HDR; copym->m_pkthdr.csum_data = 0xffff; } /* * We don't bother to fragment if the IP length is greater * than the interface's MTU. Can this possibly matter? */ ip = mtod(copym, struct ip *); ip->ip_sum = 0; ip->ip_sum = in_cksum(copym, hlen); if_simloop(ifp, copym, AF_INET, 0); } } Index: user/ngie/bug-237403/sys/netinet/tcp_syncache.c =================================================================== --- user/ngie/bug-237403/sys/netinet/tcp_syncache.c (revision 346925) +++ user/ngie/bug-237403/sys/netinet/tcp_syncache.c (revision 346926) @@ -1,2297 +1,2300 @@ /*- * SPDX-License-Identifier: BSD-2-Clause-FreeBSD * * Copyright (c) 2001 McAfee, Inc. * Copyright (c) 2006,2013 Andre Oppermann, Internet Business Solutions AG * All rights reserved. * * This software was developed for the FreeBSD Project by Jonathan Lemon * and McAfee Research, the Security Research Division of McAfee, Inc. under * DARPA/SPAWAR contract N66001-01-C-8035 ("CBOSS"), as part of the * DARPA CHATS research program. [2001 McAfee, Inc.] * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. */ #include __FBSDID("$FreeBSD$"); #include "opt_inet.h" #include "opt_inet6.h" #include "opt_ipsec.h" #include "opt_pcbgroup.h" #include #include #include #include #include #include #include #include #include #include #include #include /* for proc0 declaration */ #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #ifdef INET6 #include #include #include #include #include #endif #include #include #include #include #include #include #include #ifdef INET6 #include #endif #ifdef TCP_OFFLOAD #include #endif #include #include #include VNET_DEFINE_STATIC(int, tcp_syncookies) = 1; #define V_tcp_syncookies VNET(tcp_syncookies) SYSCTL_INT(_net_inet_tcp, OID_AUTO, syncookies, CTLFLAG_VNET | CTLFLAG_RW, &VNET_NAME(tcp_syncookies), 0, "Use TCP SYN cookies if the syncache overflows"); VNET_DEFINE_STATIC(int, tcp_syncookiesonly) = 0; #define V_tcp_syncookiesonly VNET(tcp_syncookiesonly) SYSCTL_INT(_net_inet_tcp, OID_AUTO, syncookies_only, CTLFLAG_VNET | CTLFLAG_RW, &VNET_NAME(tcp_syncookiesonly), 0, "Use only TCP SYN cookies"); VNET_DEFINE_STATIC(int, functions_inherit_listen_socket_stack) = 1; #define V_functions_inherit_listen_socket_stack \ VNET(functions_inherit_listen_socket_stack) SYSCTL_INT(_net_inet_tcp, OID_AUTO, functions_inherit_listen_socket_stack, CTLFLAG_VNET | CTLFLAG_RW, &VNET_NAME(functions_inherit_listen_socket_stack), 0, "Inherit listen socket's stack"); #ifdef TCP_OFFLOAD #define ADDED_BY_TOE(sc) ((sc)->sc_tod != NULL) #endif static void syncache_drop(struct syncache *, struct syncache_head *); static void syncache_free(struct syncache *); static void syncache_insert(struct syncache *, struct syncache_head *); static int syncache_respond(struct syncache *, struct syncache_head *, const struct mbuf *, int); static struct socket *syncache_socket(struct syncache *, struct socket *, struct mbuf *m); static void syncache_timeout(struct syncache *sc, struct syncache_head *sch, int docallout); static void syncache_timer(void *); static uint32_t syncookie_mac(struct in_conninfo *, tcp_seq, uint8_t, uint8_t *, uintptr_t); static tcp_seq syncookie_generate(struct syncache_head *, struct syncache *); static struct syncache *syncookie_lookup(struct in_conninfo *, struct syncache_head *, struct syncache *, struct tcphdr *, struct tcpopt *, struct socket *); static void syncookie_reseed(void *); #ifdef INVARIANTS static int syncookie_cmp(struct in_conninfo *inc, struct syncache_head *sch, struct syncache *sc, struct tcphdr *th, struct tcpopt *to, struct socket *lso); #endif /* * Transmit the SYN,ACK fewer times than TCP_MAXRXTSHIFT specifies. * 3 retransmits corresponds to a timeout with default values of * tcp_rexmit_initial * ( 1 + * tcp_backoff[1] + * tcp_backoff[2] + * tcp_backoff[3]) + 3 * tcp_rexmit_slop, * 1000 ms * (1 + 2 + 4 + 8) + 3 * 200 ms = 15600 ms, * the odds are that the user has given up attempting to connect by then. */ #define SYNCACHE_MAXREXMTS 3 /* Arbitrary values */ #define TCP_SYNCACHE_HASHSIZE 512 #define TCP_SYNCACHE_BUCKETLIMIT 30 VNET_DEFINE_STATIC(struct tcp_syncache, tcp_syncache); #define V_tcp_syncache VNET(tcp_syncache) static SYSCTL_NODE(_net_inet_tcp, OID_AUTO, syncache, CTLFLAG_RW, 0, "TCP SYN cache"); SYSCTL_UINT(_net_inet_tcp_syncache, OID_AUTO, bucketlimit, CTLFLAG_VNET | CTLFLAG_RDTUN, &VNET_NAME(tcp_syncache.bucket_limit), 0, "Per-bucket hash limit for syncache"); SYSCTL_UINT(_net_inet_tcp_syncache, OID_AUTO, cachelimit, CTLFLAG_VNET | CTLFLAG_RDTUN, &VNET_NAME(tcp_syncache.cache_limit), 0, "Overall entry limit for syncache"); SYSCTL_UMA_CUR(_net_inet_tcp_syncache, OID_AUTO, count, CTLFLAG_VNET, &VNET_NAME(tcp_syncache.zone), "Current number of entries in syncache"); SYSCTL_UINT(_net_inet_tcp_syncache, OID_AUTO, hashsize, CTLFLAG_VNET | CTLFLAG_RDTUN, &VNET_NAME(tcp_syncache.hashsize), 0, "Size of TCP syncache hashtable"); static int sysctl_net_inet_tcp_syncache_rexmtlimit_check(SYSCTL_HANDLER_ARGS) { int error; u_int new; new = V_tcp_syncache.rexmt_limit; error = sysctl_handle_int(oidp, &new, 0, req); if ((error == 0) && (req->newptr != NULL)) { if (new > TCP_MAXRXTSHIFT) error = EINVAL; else V_tcp_syncache.rexmt_limit = new; } return (error); } SYSCTL_PROC(_net_inet_tcp_syncache, OID_AUTO, rexmtlimit, CTLFLAG_VNET | CTLTYPE_UINT | CTLFLAG_RW, &VNET_NAME(tcp_syncache.rexmt_limit), 0, sysctl_net_inet_tcp_syncache_rexmtlimit_check, "UI", "Limit on SYN/ACK retransmissions"); VNET_DEFINE(int, tcp_sc_rst_sock_fail) = 1; SYSCTL_INT(_net_inet_tcp_syncache, OID_AUTO, rst_on_sock_fail, CTLFLAG_VNET | CTLFLAG_RW, &VNET_NAME(tcp_sc_rst_sock_fail), 0, "Send reset on socket allocation failure"); static MALLOC_DEFINE(M_SYNCACHE, "syncache", "TCP syncache"); #define SCH_LOCK(sch) mtx_lock(&(sch)->sch_mtx) #define SCH_UNLOCK(sch) mtx_unlock(&(sch)->sch_mtx) #define SCH_LOCK_ASSERT(sch) mtx_assert(&(sch)->sch_mtx, MA_OWNED) /* * Requires the syncache entry to be already removed from the bucket list. */ static void syncache_free(struct syncache *sc) { if (sc->sc_ipopts) (void) m_free(sc->sc_ipopts); if (sc->sc_cred) crfree(sc->sc_cred); #ifdef MAC mac_syncache_destroy(&sc->sc_label); #endif uma_zfree(V_tcp_syncache.zone, sc); } void syncache_init(void) { int i; V_tcp_syncache.hashsize = TCP_SYNCACHE_HASHSIZE; V_tcp_syncache.bucket_limit = TCP_SYNCACHE_BUCKETLIMIT; V_tcp_syncache.rexmt_limit = SYNCACHE_MAXREXMTS; V_tcp_syncache.hash_secret = arc4random(); TUNABLE_INT_FETCH("net.inet.tcp.syncache.hashsize", &V_tcp_syncache.hashsize); TUNABLE_INT_FETCH("net.inet.tcp.syncache.bucketlimit", &V_tcp_syncache.bucket_limit); if (!powerof2(V_tcp_syncache.hashsize) || V_tcp_syncache.hashsize == 0) { printf("WARNING: syncache hash size is not a power of 2.\n"); V_tcp_syncache.hashsize = TCP_SYNCACHE_HASHSIZE; } V_tcp_syncache.hashmask = V_tcp_syncache.hashsize - 1; /* Set limits. */ V_tcp_syncache.cache_limit = V_tcp_syncache.hashsize * V_tcp_syncache.bucket_limit; TUNABLE_INT_FETCH("net.inet.tcp.syncache.cachelimit", &V_tcp_syncache.cache_limit); /* Allocate the hash table. */ V_tcp_syncache.hashbase = malloc(V_tcp_syncache.hashsize * sizeof(struct syncache_head), M_SYNCACHE, M_WAITOK | M_ZERO); #ifdef VIMAGE V_tcp_syncache.vnet = curvnet; #endif /* Initialize the hash buckets. */ for (i = 0; i < V_tcp_syncache.hashsize; i++) { TAILQ_INIT(&V_tcp_syncache.hashbase[i].sch_bucket); mtx_init(&V_tcp_syncache.hashbase[i].sch_mtx, "tcp_sc_head", NULL, MTX_DEF); callout_init_mtx(&V_tcp_syncache.hashbase[i].sch_timer, &V_tcp_syncache.hashbase[i].sch_mtx, 0); V_tcp_syncache.hashbase[i].sch_length = 0; V_tcp_syncache.hashbase[i].sch_sc = &V_tcp_syncache; V_tcp_syncache.hashbase[i].sch_last_overflow = -(SYNCOOKIE_LIFETIME + 1); } /* Create the syncache entry zone. */ V_tcp_syncache.zone = uma_zcreate("syncache", sizeof(struct syncache), NULL, NULL, NULL, NULL, UMA_ALIGN_PTR, 0); V_tcp_syncache.cache_limit = uma_zone_set_max(V_tcp_syncache.zone, V_tcp_syncache.cache_limit); /* Start the SYN cookie reseeder callout. */ callout_init(&V_tcp_syncache.secret.reseed, 1); arc4rand(V_tcp_syncache.secret.key[0], SYNCOOKIE_SECRET_SIZE, 0); arc4rand(V_tcp_syncache.secret.key[1], SYNCOOKIE_SECRET_SIZE, 0); callout_reset(&V_tcp_syncache.secret.reseed, SYNCOOKIE_LIFETIME * hz, syncookie_reseed, &V_tcp_syncache); } #ifdef VIMAGE void syncache_destroy(void) { struct syncache_head *sch; struct syncache *sc, *nsc; int i; /* * Stop the re-seed timer before freeing resources. No need to * possibly schedule it another time. */ callout_drain(&V_tcp_syncache.secret.reseed); /* Cleanup hash buckets: stop timers, free entries, destroy locks. */ for (i = 0; i < V_tcp_syncache.hashsize; i++) { sch = &V_tcp_syncache.hashbase[i]; callout_drain(&sch->sch_timer); SCH_LOCK(sch); TAILQ_FOREACH_SAFE(sc, &sch->sch_bucket, sc_hash, nsc) syncache_drop(sc, sch); SCH_UNLOCK(sch); KASSERT(TAILQ_EMPTY(&sch->sch_bucket), ("%s: sch->sch_bucket not empty", __func__)); KASSERT(sch->sch_length == 0, ("%s: sch->sch_length %d not 0", __func__, sch->sch_length)); mtx_destroy(&sch->sch_mtx); } KASSERT(uma_zone_get_cur(V_tcp_syncache.zone) == 0, ("%s: cache_count not 0", __func__)); /* Free the allocated global resources. */ uma_zdestroy(V_tcp_syncache.zone); free(V_tcp_syncache.hashbase, M_SYNCACHE); } #endif /* * Inserts a syncache entry into the specified bucket row. * Locks and unlocks the syncache_head autonomously. */ static void syncache_insert(struct syncache *sc, struct syncache_head *sch) { struct syncache *sc2; SCH_LOCK(sch); /* * Make sure that we don't overflow the per-bucket limit. * If the bucket is full, toss the oldest element. */ if (sch->sch_length >= V_tcp_syncache.bucket_limit) { KASSERT(!TAILQ_EMPTY(&sch->sch_bucket), ("sch->sch_length incorrect")); sc2 = TAILQ_LAST(&sch->sch_bucket, sch_head); sch->sch_last_overflow = time_uptime; syncache_drop(sc2, sch); TCPSTAT_INC(tcps_sc_bucketoverflow); } /* Put it into the bucket. */ TAILQ_INSERT_HEAD(&sch->sch_bucket, sc, sc_hash); sch->sch_length++; #ifdef TCP_OFFLOAD if (ADDED_BY_TOE(sc)) { struct toedev *tod = sc->sc_tod; tod->tod_syncache_added(tod, sc->sc_todctx); } #endif /* Reinitialize the bucket row's timer. */ if (sch->sch_length == 1) sch->sch_nextc = ticks + INT_MAX; syncache_timeout(sc, sch, 1); SCH_UNLOCK(sch); TCPSTATES_INC(TCPS_SYN_RECEIVED); TCPSTAT_INC(tcps_sc_added); } /* * Remove and free entry from syncache bucket row. * Expects locked syncache head. */ static void syncache_drop(struct syncache *sc, struct syncache_head *sch) { SCH_LOCK_ASSERT(sch); TCPSTATES_DEC(TCPS_SYN_RECEIVED); TAILQ_REMOVE(&sch->sch_bucket, sc, sc_hash); sch->sch_length--; #ifdef TCP_OFFLOAD if (ADDED_BY_TOE(sc)) { struct toedev *tod = sc->sc_tod; tod->tod_syncache_removed(tod, sc->sc_todctx); } #endif syncache_free(sc); } /* * Engage/reengage time on bucket row. */ static void syncache_timeout(struct syncache *sc, struct syncache_head *sch, int docallout) { int rexmt; if (sc->sc_rxmits == 0) rexmt = tcp_rexmit_initial; else TCPT_RANGESET(rexmt, tcp_rexmit_initial * tcp_backoff[sc->sc_rxmits], tcp_rexmit_min, TCPTV_REXMTMAX); sc->sc_rxttime = ticks + rexmt; sc->sc_rxmits++; if (TSTMP_LT(sc->sc_rxttime, sch->sch_nextc)) { sch->sch_nextc = sc->sc_rxttime; if (docallout) callout_reset(&sch->sch_timer, sch->sch_nextc - ticks, syncache_timer, (void *)sch); } } /* * Walk the timer queues, looking for SYN,ACKs that need to be retransmitted. * If we have retransmitted an entry the maximum number of times, expire it. * One separate timer for each bucket row. */ static void syncache_timer(void *xsch) { struct syncache_head *sch = (struct syncache_head *)xsch; struct syncache *sc, *nsc; int tick = ticks; char *s; CURVNET_SET(sch->sch_sc->vnet); /* NB: syncache_head has already been locked by the callout. */ SCH_LOCK_ASSERT(sch); /* * In the following cycle we may remove some entries and/or * advance some timeouts, so re-initialize the bucket timer. */ sch->sch_nextc = tick + INT_MAX; TAILQ_FOREACH_SAFE(sc, &sch->sch_bucket, sc_hash, nsc) { /* * We do not check if the listen socket still exists * and accept the case where the listen socket may be * gone by the time we resend the SYN/ACK. We do * not expect this to happens often. If it does, * then the RST will be sent by the time the remote * host does the SYN/ACK->ACK. */ if (TSTMP_GT(sc->sc_rxttime, tick)) { if (TSTMP_LT(sc->sc_rxttime, sch->sch_nextc)) sch->sch_nextc = sc->sc_rxttime; continue; } if (sc->sc_rxmits > V_tcp_syncache.rexmt_limit) { if ((s = tcp_log_addrs(&sc->sc_inc, NULL, NULL, NULL))) { log(LOG_DEBUG, "%s; %s: Retransmits exhausted, " "giving up and removing syncache entry\n", s, __func__); free(s, M_TCPLOG); } syncache_drop(sc, sch); TCPSTAT_INC(tcps_sc_stale); continue; } if ((s = tcp_log_addrs(&sc->sc_inc, NULL, NULL, NULL))) { log(LOG_DEBUG, "%s; %s: Response timeout, " "retransmitting (%u) SYN|ACK\n", s, __func__, sc->sc_rxmits); free(s, M_TCPLOG); } syncache_respond(sc, sch, NULL, TH_SYN|TH_ACK); TCPSTAT_INC(tcps_sc_retransmitted); syncache_timeout(sc, sch, 0); } if (!TAILQ_EMPTY(&(sch)->sch_bucket)) callout_reset(&(sch)->sch_timer, (sch)->sch_nextc - tick, syncache_timer, (void *)(sch)); CURVNET_RESTORE(); } /* * Find an entry in the syncache. * Returns always with locked syncache_head plus a matching entry or NULL. */ static struct syncache * syncache_lookup(struct in_conninfo *inc, struct syncache_head **schp) { struct syncache *sc; struct syncache_head *sch; uint32_t hash; /* * The hash is built on foreign port + local port + foreign address. * We rely on the fact that struct in_conninfo starts with 16 bits * of foreign port, then 16 bits of local port then followed by 128 * bits of foreign address. In case of IPv4 address, the first 3 * 32-bit words of the address always are zeroes. */ hash = jenkins_hash32((uint32_t *)&inc->inc_ie, 5, V_tcp_syncache.hash_secret) & V_tcp_syncache.hashmask; sch = &V_tcp_syncache.hashbase[hash]; *schp = sch; SCH_LOCK(sch); /* Circle through bucket row to find matching entry. */ TAILQ_FOREACH(sc, &sch->sch_bucket, sc_hash) if (bcmp(&inc->inc_ie, &sc->sc_inc.inc_ie, sizeof(struct in_endpoints)) == 0) break; return (sc); /* Always returns with locked sch. */ } /* * This function is called when we get a RST for a * non-existent connection, so that we can see if the * connection is in the syn cache. If it is, zap it. * If required send a challenge ACK. */ void syncache_chkrst(struct in_conninfo *inc, struct tcphdr *th, struct mbuf *m) { struct syncache *sc; struct syncache_head *sch; char *s = NULL; sc = syncache_lookup(inc, &sch); /* returns locked sch */ SCH_LOCK_ASSERT(sch); /* * Any RST to our SYN|ACK must not carry ACK, SYN or FIN flags. * See RFC 793 page 65, section SEGMENT ARRIVES. */ if (th->th_flags & (TH_ACK|TH_SYN|TH_FIN)) { if ((s = tcp_log_addrs(inc, th, NULL, NULL))) log(LOG_DEBUG, "%s; %s: Spurious RST with ACK, SYN or " "FIN flag set, segment ignored\n", s, __func__); TCPSTAT_INC(tcps_badrst); goto done; } /* * No corresponding connection was found in syncache. * If syncookies are enabled and possibly exclusively * used, or we are under memory pressure, a valid RST * may not find a syncache entry. In that case we're * done and no SYN|ACK retransmissions will happen. * Otherwise the RST was misdirected or spoofed. */ if (sc == NULL) { if ((s = tcp_log_addrs(inc, th, NULL, NULL))) log(LOG_DEBUG, "%s; %s: Spurious RST without matching " "syncache entry (possibly syncookie only), " "segment ignored\n", s, __func__); TCPSTAT_INC(tcps_badrst); goto done; } /* * If the RST bit is set, check the sequence number to see * if this is a valid reset segment. * * RFC 793 page 37: * In all states except SYN-SENT, all reset (RST) segments * are validated by checking their SEQ-fields. A reset is * valid if its sequence number is in the window. * * RFC 793 page 69: * There are four cases for the acceptability test for an incoming * segment: * * Segment Receive Test * Length Window * ------- ------- ------------------------------------------- * 0 0 SEG.SEQ = RCV.NXT * 0 >0 RCV.NXT =< SEG.SEQ < RCV.NXT+RCV.WND * >0 0 not acceptable * >0 >0 RCV.NXT =< SEG.SEQ < RCV.NXT+RCV.WND * or RCV.NXT =< SEG.SEQ+SEG.LEN-1 < RCV.NXT+RCV.WND * * Note that when receiving a SYN segment in the LISTEN state, * IRS is set to SEG.SEQ and RCV.NXT is set to SEG.SEQ+1, as * described in RFC 793, page 66. */ if ((SEQ_GEQ(th->th_seq, sc->sc_irs + 1) && SEQ_LT(th->th_seq, sc->sc_irs + 1 + sc->sc_wnd)) || (sc->sc_wnd == 0 && th->th_seq == sc->sc_irs + 1)) { if (V_tcp_insecure_rst || th->th_seq == sc->sc_irs + 1) { syncache_drop(sc, sch); if ((s = tcp_log_addrs(inc, th, NULL, NULL))) log(LOG_DEBUG, "%s; %s: Our SYN|ACK was rejected, " "connection attempt aborted by remote " "endpoint\n", s, __func__); TCPSTAT_INC(tcps_sc_reset); } else { TCPSTAT_INC(tcps_badrst); /* Send challenge ACK. */ if ((s = tcp_log_addrs(inc, th, NULL, NULL))) log(LOG_DEBUG, "%s; %s: RST with invalid " " SEQ %u != NXT %u (+WND %u), " "sending challenge ACK\n", s, __func__, th->th_seq, sc->sc_irs + 1, sc->sc_wnd); syncache_respond(sc, sch, m, TH_ACK); } } else { if ((s = tcp_log_addrs(inc, th, NULL, NULL))) log(LOG_DEBUG, "%s; %s: RST with invalid SEQ %u != " "NXT %u (+WND %u), segment ignored\n", s, __func__, th->th_seq, sc->sc_irs + 1, sc->sc_wnd); TCPSTAT_INC(tcps_badrst); } done: if (s != NULL) free(s, M_TCPLOG); SCH_UNLOCK(sch); } void syncache_badack(struct in_conninfo *inc) { struct syncache *sc; struct syncache_head *sch; sc = syncache_lookup(inc, &sch); /* returns locked sch */ SCH_LOCK_ASSERT(sch); if (sc != NULL) { syncache_drop(sc, sch); TCPSTAT_INC(tcps_sc_badack); } SCH_UNLOCK(sch); } void syncache_unreach(struct in_conninfo *inc, tcp_seq th_seq) { struct syncache *sc; struct syncache_head *sch; sc = syncache_lookup(inc, &sch); /* returns locked sch */ SCH_LOCK_ASSERT(sch); if (sc == NULL) goto done; /* If the sequence number != sc_iss, then it's a bogus ICMP msg */ if (ntohl(th_seq) != sc->sc_iss) goto done; /* * If we've rertransmitted 3 times and this is our second error, * we remove the entry. Otherwise, we allow it to continue on. * This prevents us from incorrectly nuking an entry during a * spurious network outage. * * See tcp_notify(). */ if ((sc->sc_flags & SCF_UNREACH) == 0 || sc->sc_rxmits < 3 + 1) { sc->sc_flags |= SCF_UNREACH; goto done; } syncache_drop(sc, sch); TCPSTAT_INC(tcps_sc_unreach); done: SCH_UNLOCK(sch); } /* * Build a new TCP socket structure from a syncache entry. * * On success return the newly created socket with its underlying inp locked. */ static struct socket * syncache_socket(struct syncache *sc, struct socket *lso, struct mbuf *m) { struct tcp_function_block *blk; struct inpcb *inp = NULL; struct socket *so; struct tcpcb *tp; int error; char *s; INP_INFO_RLOCK_ASSERT(&V_tcbinfo); /* * Ok, create the full blown connection, and set things up * as they would have been set up if we had created the * connection when the SYN arrived. If we can't create * the connection, abort it. */ so = sonewconn(lso, 0); if (so == NULL) { /* * Drop the connection; we will either send a RST or * have the peer retransmit its SYN again after its * RTO and try again. */ TCPSTAT_INC(tcps_listendrop); if ((s = tcp_log_addrs(&sc->sc_inc, NULL, NULL, NULL))) { log(LOG_DEBUG, "%s; %s: Socket create failed " "due to limits or memory shortage\n", s, __func__); free(s, M_TCPLOG); } goto abort2; } #ifdef MAC mac_socketpeer_set_from_mbuf(m, so); #endif inp = sotoinpcb(so); inp->inp_inc.inc_fibnum = so->so_fibnum; INP_WLOCK(inp); /* * Exclusive pcbinfo lock is not required in syncache socket case even * if two inpcb locks can be acquired simultaneously: * - the inpcb in LISTEN state, * - the newly created inp. * * In this case, an inp cannot be at same time in LISTEN state and * just created by an accept() call. */ INP_HASH_WLOCK(&V_tcbinfo); /* Insert new socket into PCB hash list. */ inp->inp_inc.inc_flags = sc->sc_inc.inc_flags; #ifdef INET6 if (sc->sc_inc.inc_flags & INC_ISIPV6) { inp->inp_vflag &= ~INP_IPV4; inp->inp_vflag |= INP_IPV6; inp->in6p_laddr = sc->sc_inc.inc6_laddr; } else { inp->inp_vflag &= ~INP_IPV6; inp->inp_vflag |= INP_IPV4; #endif inp->inp_laddr = sc->sc_inc.inc_laddr; #ifdef INET6 } #endif /* * If there's an mbuf and it has a flowid, then let's initialise the * inp with that particular flowid. */ if (m != NULL && M_HASHTYPE_GET(m) != M_HASHTYPE_NONE) { inp->inp_flowid = m->m_pkthdr.flowid; inp->inp_flowtype = M_HASHTYPE_GET(m); +#ifdef NUMA + inp->inp_numa_domain = m->m_pkthdr.numa_domain; +#endif } /* * Install in the reservation hash table for now, but don't yet * install a connection group since the full 4-tuple isn't yet * configured. */ inp->inp_lport = sc->sc_inc.inc_lport; if ((error = in_pcbinshash_nopcbgroup(inp)) != 0) { /* * Undo the assignments above if we failed to * put the PCB on the hash lists. */ #ifdef INET6 if (sc->sc_inc.inc_flags & INC_ISIPV6) inp->in6p_laddr = in6addr_any; else #endif inp->inp_laddr.s_addr = INADDR_ANY; inp->inp_lport = 0; if ((s = tcp_log_addrs(&sc->sc_inc, NULL, NULL, NULL))) { log(LOG_DEBUG, "%s; %s: in_pcbinshash failed " "with error %i\n", s, __func__, error); free(s, M_TCPLOG); } INP_HASH_WUNLOCK(&V_tcbinfo); goto abort; } #ifdef INET6 if (inp->inp_vflag & INP_IPV6PROTO) { struct inpcb *oinp = sotoinpcb(lso); /* * Inherit socket options from the listening socket. * Note that in6p_inputopts are not (and should not be) * copied, since it stores previously received options and is * used to detect if each new option is different than the * previous one and hence should be passed to a user. * If we copied in6p_inputopts, a user would not be able to * receive options just after calling the accept system call. */ inp->inp_flags |= oinp->inp_flags & INP_CONTROLOPTS; if (oinp->in6p_outputopts) inp->in6p_outputopts = ip6_copypktopts(oinp->in6p_outputopts, M_NOWAIT); } if (sc->sc_inc.inc_flags & INC_ISIPV6) { struct in6_addr laddr6; struct sockaddr_in6 sin6; sin6.sin6_family = AF_INET6; sin6.sin6_len = sizeof(sin6); sin6.sin6_addr = sc->sc_inc.inc6_faddr; sin6.sin6_port = sc->sc_inc.inc_fport; sin6.sin6_flowinfo = sin6.sin6_scope_id = 0; laddr6 = inp->in6p_laddr; if (IN6_IS_ADDR_UNSPECIFIED(&inp->in6p_laddr)) inp->in6p_laddr = sc->sc_inc.inc6_laddr; if ((error = in6_pcbconnect_mbuf(inp, (struct sockaddr *)&sin6, thread0.td_ucred, m)) != 0) { inp->in6p_laddr = laddr6; if ((s = tcp_log_addrs(&sc->sc_inc, NULL, NULL, NULL))) { log(LOG_DEBUG, "%s; %s: in6_pcbconnect failed " "with error %i\n", s, __func__, error); free(s, M_TCPLOG); } INP_HASH_WUNLOCK(&V_tcbinfo); goto abort; } /* Override flowlabel from in6_pcbconnect. */ inp->inp_flow &= ~IPV6_FLOWLABEL_MASK; inp->inp_flow |= sc->sc_flowlabel; } #endif /* INET6 */ #if defined(INET) && defined(INET6) else #endif #ifdef INET { struct in_addr laddr; struct sockaddr_in sin; inp->inp_options = (m) ? ip_srcroute(m) : NULL; if (inp->inp_options == NULL) { inp->inp_options = sc->sc_ipopts; sc->sc_ipopts = NULL; } sin.sin_family = AF_INET; sin.sin_len = sizeof(sin); sin.sin_addr = sc->sc_inc.inc_faddr; sin.sin_port = sc->sc_inc.inc_fport; bzero((caddr_t)sin.sin_zero, sizeof(sin.sin_zero)); laddr = inp->inp_laddr; if (inp->inp_laddr.s_addr == INADDR_ANY) inp->inp_laddr = sc->sc_inc.inc_laddr; if ((error = in_pcbconnect_mbuf(inp, (struct sockaddr *)&sin, thread0.td_ucred, m)) != 0) { inp->inp_laddr = laddr; if ((s = tcp_log_addrs(&sc->sc_inc, NULL, NULL, NULL))) { log(LOG_DEBUG, "%s; %s: in_pcbconnect failed " "with error %i\n", s, __func__, error); free(s, M_TCPLOG); } INP_HASH_WUNLOCK(&V_tcbinfo); goto abort; } } #endif /* INET */ #if defined(IPSEC) || defined(IPSEC_SUPPORT) /* Copy old policy into new socket's. */ if (ipsec_copy_pcbpolicy(sotoinpcb(lso), inp) != 0) printf("syncache_socket: could not copy policy\n"); #endif INP_HASH_WUNLOCK(&V_tcbinfo); tp = intotcpcb(inp); tcp_state_change(tp, TCPS_SYN_RECEIVED); tp->iss = sc->sc_iss; tp->irs = sc->sc_irs; tcp_rcvseqinit(tp); tcp_sendseqinit(tp); blk = sototcpcb(lso)->t_fb; if (V_functions_inherit_listen_socket_stack && blk != tp->t_fb) { /* * Our parents t_fb was not the default, * we need to release our ref on tp->t_fb and * pickup one on the new entry. */ struct tcp_function_block *rblk; rblk = find_and_ref_tcp_fb(blk); KASSERT(rblk != NULL, ("cannot find blk %p out of syncache?", blk)); if (tp->t_fb->tfb_tcp_fb_fini) (*tp->t_fb->tfb_tcp_fb_fini)(tp, 0); refcount_release(&tp->t_fb->tfb_refcnt); tp->t_fb = rblk; /* * XXXrrs this is quite dangerous, it is possible * for the new function to fail to init. We also * are not asking if the handoff_is_ok though at * the very start thats probalbly ok. */ if (tp->t_fb->tfb_tcp_fb_init) { (*tp->t_fb->tfb_tcp_fb_init)(tp); } } tp->snd_wl1 = sc->sc_irs; tp->snd_max = tp->iss + 1; tp->snd_nxt = tp->iss + 1; tp->rcv_up = sc->sc_irs + 1; tp->rcv_wnd = sc->sc_wnd; tp->rcv_adv += tp->rcv_wnd; tp->last_ack_sent = tp->rcv_nxt; tp->t_flags = sototcpcb(lso)->t_flags & (TF_NOPUSH|TF_NODELAY); if (sc->sc_flags & SCF_NOOPT) tp->t_flags |= TF_NOOPT; else { if (sc->sc_flags & SCF_WINSCALE) { tp->t_flags |= TF_REQ_SCALE|TF_RCVD_SCALE; tp->snd_scale = sc->sc_requested_s_scale; tp->request_r_scale = sc->sc_requested_r_scale; } if (sc->sc_flags & SCF_TIMESTAMP) { tp->t_flags |= TF_REQ_TSTMP|TF_RCVD_TSTMP; tp->ts_recent = sc->sc_tsreflect; tp->ts_recent_age = tcp_ts_getticks(); tp->ts_offset = sc->sc_tsoff; } #if defined(IPSEC_SUPPORT) || defined(TCP_SIGNATURE) if (sc->sc_flags & SCF_SIGNATURE) tp->t_flags |= TF_SIGNATURE; #endif if (sc->sc_flags & SCF_SACK) tp->t_flags |= TF_SACK_PERMIT; } if (sc->sc_flags & SCF_ECN) tp->t_flags |= TF_ECN_PERMIT; /* * Set up MSS and get cached values from tcp_hostcache. * This might overwrite some of the defaults we just set. */ tcp_mss(tp, sc->sc_peer_mss); /* * If the SYN,ACK was retransmitted, indicate that CWND to be * limited to one segment in cc_conn_init(). * NB: sc_rxmits counts all SYN,ACK transmits, not just retransmits. */ if (sc->sc_rxmits > 1) tp->snd_cwnd = 1; #ifdef TCP_OFFLOAD /* * Allow a TOE driver to install its hooks. Note that we hold the * pcbinfo lock too and that prevents tcp_usr_accept from accepting a * new connection before the TOE driver has done its thing. */ if (ADDED_BY_TOE(sc)) { struct toedev *tod = sc->sc_tod; tod->tod_offload_socket(tod, sc->sc_todctx, so); } #endif /* * Copy and activate timers. */ tp->t_keepinit = sototcpcb(lso)->t_keepinit; tp->t_keepidle = sototcpcb(lso)->t_keepidle; tp->t_keepintvl = sototcpcb(lso)->t_keepintvl; tp->t_keepcnt = sototcpcb(lso)->t_keepcnt; tcp_timer_activate(tp, TT_KEEP, TP_KEEPINIT(tp)); TCPSTAT_INC(tcps_accepts); return (so); abort: INP_WUNLOCK(inp); abort2: if (so != NULL) soabort(so); return (NULL); } /* * This function gets called when we receive an ACK for a * socket in the LISTEN state. We look up the connection * in the syncache, and if its there, we pull it out of * the cache and turn it into a full-blown connection in * the SYN-RECEIVED state. * * On syncache_socket() success the newly created socket * has its underlying inp locked. */ int syncache_expand(struct in_conninfo *inc, struct tcpopt *to, struct tcphdr *th, struct socket **lsop, struct mbuf *m) { struct syncache *sc; struct syncache_head *sch; struct syncache scs; char *s; /* * Global TCP locks are held because we manipulate the PCB lists * and create a new socket. */ INP_INFO_RLOCK_ASSERT(&V_tcbinfo); KASSERT((th->th_flags & (TH_RST|TH_ACK|TH_SYN)) == TH_ACK, ("%s: can handle only ACK", __func__)); sc = syncache_lookup(inc, &sch); /* returns locked sch */ SCH_LOCK_ASSERT(sch); #ifdef INVARIANTS /* * Test code for syncookies comparing the syncache stored * values with the reconstructed values from the cookie. */ if (sc != NULL) syncookie_cmp(inc, sch, sc, th, to, *lsop); #endif if (sc == NULL) { /* * There is no syncache entry, so see if this ACK is * a returning syncookie. To do this, first: * A. Check if syncookies are used in case of syncache * overflows * B. See if this socket has had a syncache entry dropped in * the recent past. We don't want to accept a bogus * syncookie if we've never received a SYN or accept it * twice. * C. check that the syncookie is valid. If it is, then * cobble up a fake syncache entry, and return. */ if (!V_tcp_syncookies) { SCH_UNLOCK(sch); if ((s = tcp_log_addrs(inc, th, NULL, NULL))) log(LOG_DEBUG, "%s; %s: Spurious ACK, " "segment rejected (syncookies disabled)\n", s, __func__); goto failed; } if (!V_tcp_syncookiesonly && sch->sch_last_overflow < time_uptime - SYNCOOKIE_LIFETIME) { SCH_UNLOCK(sch); if ((s = tcp_log_addrs(inc, th, NULL, NULL))) log(LOG_DEBUG, "%s; %s: Spurious ACK, " "segment rejected (no syncache entry)\n", s, __func__); goto failed; } bzero(&scs, sizeof(scs)); sc = syncookie_lookup(inc, sch, &scs, th, to, *lsop); SCH_UNLOCK(sch); if (sc == NULL) { if ((s = tcp_log_addrs(inc, th, NULL, NULL))) log(LOG_DEBUG, "%s; %s: Segment failed " "SYNCOOKIE authentication, segment rejected " "(probably spoofed)\n", s, __func__); goto failed; } #if defined(IPSEC_SUPPORT) || defined(TCP_SIGNATURE) /* If received ACK has MD5 signature, check it. */ if ((to->to_flags & TOF_SIGNATURE) != 0 && (!TCPMD5_ENABLED() || TCPMD5_INPUT(m, th, to->to_signature) != 0)) { /* Drop the ACK. */ if ((s = tcp_log_addrs(inc, th, NULL, NULL))) { log(LOG_DEBUG, "%s; %s: Segment rejected, " "MD5 signature doesn't match.\n", s, __func__); free(s, M_TCPLOG); } TCPSTAT_INC(tcps_sig_err_sigopt); return (-1); /* Do not send RST */ } #endif /* TCP_SIGNATURE */ } else { #if defined(IPSEC_SUPPORT) || defined(TCP_SIGNATURE) /* * If listening socket requested TCP digests, check that * received ACK has signature and it is correct. * If not, drop the ACK and leave sc entry in th cache, * because SYN was received with correct signature. */ if (sc->sc_flags & SCF_SIGNATURE) { if ((to->to_flags & TOF_SIGNATURE) == 0) { /* No signature */ TCPSTAT_INC(tcps_sig_err_nosigopt); SCH_UNLOCK(sch); if ((s = tcp_log_addrs(inc, th, NULL, NULL))) { log(LOG_DEBUG, "%s; %s: Segment " "rejected, MD5 signature wasn't " "provided.\n", s, __func__); free(s, M_TCPLOG); } return (-1); /* Do not send RST */ } if (!TCPMD5_ENABLED() || TCPMD5_INPUT(m, th, to->to_signature) != 0) { /* Doesn't match or no SA */ SCH_UNLOCK(sch); if ((s = tcp_log_addrs(inc, th, NULL, NULL))) { log(LOG_DEBUG, "%s; %s: Segment " "rejected, MD5 signature doesn't " "match.\n", s, __func__); free(s, M_TCPLOG); } return (-1); /* Do not send RST */ } } #endif /* TCP_SIGNATURE */ /* * Pull out the entry to unlock the bucket row. * * NOTE: We must decrease TCPS_SYN_RECEIVED count here, not * tcp_state_change(). The tcpcb is not existent at this * moment. A new one will be allocated via syncache_socket-> * sonewconn->tcp_usr_attach in TCPS_CLOSED state, then * syncache_socket() will change it to TCPS_SYN_RECEIVED. */ TCPSTATES_DEC(TCPS_SYN_RECEIVED); TAILQ_REMOVE(&sch->sch_bucket, sc, sc_hash); sch->sch_length--; #ifdef TCP_OFFLOAD if (ADDED_BY_TOE(sc)) { struct toedev *tod = sc->sc_tod; tod->tod_syncache_removed(tod, sc->sc_todctx); } #endif SCH_UNLOCK(sch); } /* * Segment validation: * ACK must match our initial sequence number + 1 (the SYN|ACK). */ if (th->th_ack != sc->sc_iss + 1) { if ((s = tcp_log_addrs(inc, th, NULL, NULL))) log(LOG_DEBUG, "%s; %s: ACK %u != ISS+1 %u, segment " "rejected\n", s, __func__, th->th_ack, sc->sc_iss); goto failed; } /* * The SEQ must fall in the window starting at the received * initial receive sequence number + 1 (the SYN). */ if (SEQ_LEQ(th->th_seq, sc->sc_irs) || SEQ_GT(th->th_seq, sc->sc_irs + sc->sc_wnd)) { if ((s = tcp_log_addrs(inc, th, NULL, NULL))) log(LOG_DEBUG, "%s; %s: SEQ %u != IRS+1 %u, segment " "rejected\n", s, __func__, th->th_seq, sc->sc_irs); goto failed; } /* * If timestamps were not negotiated during SYN/ACK they * must not appear on any segment during this session. */ if (!(sc->sc_flags & SCF_TIMESTAMP) && (to->to_flags & TOF_TS)) { if ((s = tcp_log_addrs(inc, th, NULL, NULL))) log(LOG_DEBUG, "%s; %s: Timestamp not expected, " "segment rejected\n", s, __func__); goto failed; } /* * If timestamps were negotiated during SYN/ACK they should * appear on every segment during this session. * XXXAO: This is only informal as there have been unverified * reports of non-compliants stacks. */ if ((sc->sc_flags & SCF_TIMESTAMP) && !(to->to_flags & TOF_TS)) { if ((s = tcp_log_addrs(inc, th, NULL, NULL))) { log(LOG_DEBUG, "%s; %s: Timestamp missing, " "no action\n", s, __func__); free(s, M_TCPLOG); s = NULL; } } *lsop = syncache_socket(sc, *lsop, m); if (*lsop == NULL) TCPSTAT_INC(tcps_sc_aborted); else TCPSTAT_INC(tcps_sc_completed); /* how do we find the inp for the new socket? */ if (sc != &scs) syncache_free(sc); return (1); failed: if (sc != NULL && sc != &scs) syncache_free(sc); if (s != NULL) free(s, M_TCPLOG); *lsop = NULL; return (0); } static void syncache_tfo_expand(struct syncache *sc, struct socket **lsop, struct mbuf *m, uint64_t response_cookie) { struct inpcb *inp; struct tcpcb *tp; unsigned int *pending_counter; /* * Global TCP locks are held because we manipulate the PCB lists * and create a new socket. */ INP_INFO_RLOCK_ASSERT(&V_tcbinfo); pending_counter = intotcpcb(sotoinpcb(*lsop))->t_tfo_pending; *lsop = syncache_socket(sc, *lsop, m); if (*lsop == NULL) { TCPSTAT_INC(tcps_sc_aborted); atomic_subtract_int(pending_counter, 1); } else { soisconnected(*lsop); inp = sotoinpcb(*lsop); tp = intotcpcb(inp); tp->t_flags |= TF_FASTOPEN; tp->t_tfo_cookie.server = response_cookie; tp->snd_max = tp->iss; tp->snd_nxt = tp->iss; tp->t_tfo_pending = pending_counter; TCPSTAT_INC(tcps_sc_completed); } } /* * Given a LISTEN socket and an inbound SYN request, add * this to the syn cache, and send back a segment: * * to the source. * * IMPORTANT NOTE: We do _NOT_ ACK data that might accompany the SYN. * Doing so would require that we hold onto the data and deliver it * to the application. However, if we are the target of a SYN-flood * DoS attack, an attacker could send data which would eventually * consume all available buffer space if it were ACKed. By not ACKing * the data, we avoid this DoS scenario. * * The exception to the above is when a SYN with a valid TCP Fast Open (TFO) * cookie is processed and a new socket is created. In this case, any data * accompanying the SYN will be queued to the socket by tcp_input() and will * be ACKed either when the application sends response data or the delayed * ACK timer expires, whichever comes first. */ int syncache_add(struct in_conninfo *inc, struct tcpopt *to, struct tcphdr *th, struct inpcb *inp, struct socket **lsop, struct mbuf *m, void *tod, void *todctx) { struct tcpcb *tp; struct socket *so; struct syncache *sc = NULL; struct syncache_head *sch; struct mbuf *ipopts = NULL; u_int ltflags; int win, ip_ttl, ip_tos; char *s; int rv = 0; #ifdef INET6 int autoflowlabel = 0; #endif #ifdef MAC struct label *maclabel; #endif struct syncache scs; struct ucred *cred; uint64_t tfo_response_cookie; unsigned int *tfo_pending = NULL; int tfo_cookie_valid = 0; int tfo_response_cookie_valid = 0; INP_WLOCK_ASSERT(inp); /* listen socket */ KASSERT((th->th_flags & (TH_RST|TH_ACK|TH_SYN)) == TH_SYN, ("%s: unexpected tcp flags", __func__)); /* * Combine all so/tp operations very early to drop the INP lock as * soon as possible. */ so = *lsop; KASSERT(SOLISTENING(so), ("%s: %p not listening", __func__, so)); tp = sototcpcb(so); cred = crhold(so->so_cred); #ifdef INET6 if ((inc->inc_flags & INC_ISIPV6) && (inp->inp_flags & IN6P_AUTOFLOWLABEL)) autoflowlabel = 1; #endif ip_ttl = inp->inp_ip_ttl; ip_tos = inp->inp_ip_tos; win = so->sol_sbrcv_hiwat; ltflags = (tp->t_flags & (TF_NOOPT | TF_SIGNATURE)); if (V_tcp_fastopen_server_enable && IS_FASTOPEN(tp->t_flags) && (tp->t_tfo_pending != NULL) && (to->to_flags & TOF_FASTOPEN)) { /* * Limit the number of pending TFO connections to * approximately half of the queue limit. This prevents TFO * SYN floods from starving the service by filling the * listen queue with bogus TFO connections. */ if (atomic_fetchadd_int(tp->t_tfo_pending, 1) <= (so->sol_qlimit / 2)) { int result; result = tcp_fastopen_check_cookie(inc, to->to_tfo_cookie, to->to_tfo_len, &tfo_response_cookie); tfo_cookie_valid = (result > 0); tfo_response_cookie_valid = (result >= 0); } /* * Remember the TFO pending counter as it will have to be * decremented below if we don't make it to syncache_tfo_expand(). */ tfo_pending = tp->t_tfo_pending; } /* By the time we drop the lock these should no longer be used. */ so = NULL; tp = NULL; #ifdef MAC if (mac_syncache_init(&maclabel) != 0) { INP_WUNLOCK(inp); goto done; } else mac_syncache_create(maclabel, inp); #endif if (!tfo_cookie_valid) INP_WUNLOCK(inp); /* * Remember the IP options, if any. */ #ifdef INET6 if (!(inc->inc_flags & INC_ISIPV6)) #endif #ifdef INET ipopts = (m) ? ip_srcroute(m) : NULL; #else ipopts = NULL; #endif #if defined(IPSEC_SUPPORT) || defined(TCP_SIGNATURE) /* * If listening socket requested TCP digests, check that received * SYN has signature and it is correct. If signature doesn't match * or TCP_SIGNATURE support isn't enabled, drop the packet. */ if (ltflags & TF_SIGNATURE) { if ((to->to_flags & TOF_SIGNATURE) == 0) { TCPSTAT_INC(tcps_sig_err_nosigopt); goto done; } if (!TCPMD5_ENABLED() || TCPMD5_INPUT(m, th, to->to_signature) != 0) goto done; } #endif /* TCP_SIGNATURE */ /* * See if we already have an entry for this connection. * If we do, resend the SYN,ACK, and reset the retransmit timer. * * XXX: should the syncache be re-initialized with the contents * of the new SYN here (which may have different options?) * * XXX: We do not check the sequence number to see if this is a * real retransmit or a new connection attempt. The question is * how to handle such a case; either ignore it as spoofed, or * drop the current entry and create a new one? */ sc = syncache_lookup(inc, &sch); /* returns locked entry */ SCH_LOCK_ASSERT(sch); if (sc != NULL) { if (tfo_cookie_valid) INP_WUNLOCK(inp); TCPSTAT_INC(tcps_sc_dupsyn); if (ipopts) { /* * If we were remembering a previous source route, * forget it and use the new one we've been given. */ if (sc->sc_ipopts) (void) m_free(sc->sc_ipopts); sc->sc_ipopts = ipopts; } /* * Update timestamp if present. */ if ((sc->sc_flags & SCF_TIMESTAMP) && (to->to_flags & TOF_TS)) sc->sc_tsreflect = to->to_tsval; else sc->sc_flags &= ~SCF_TIMESTAMP; #ifdef MAC /* * Since we have already unconditionally allocated label * storage, free it up. The syncache entry will already * have an initialized label we can use. */ mac_syncache_destroy(&maclabel); #endif TCP_PROBE5(receive, NULL, NULL, m, NULL, th); /* Retransmit SYN|ACK and reset retransmit count. */ if ((s = tcp_log_addrs(&sc->sc_inc, th, NULL, NULL))) { log(LOG_DEBUG, "%s; %s: Received duplicate SYN, " "resetting timer and retransmitting SYN|ACK\n", s, __func__); free(s, M_TCPLOG); } if (syncache_respond(sc, sch, m, TH_SYN|TH_ACK) == 0) { sc->sc_rxmits = 0; syncache_timeout(sc, sch, 1); TCPSTAT_INC(tcps_sndacks); TCPSTAT_INC(tcps_sndtotal); } SCH_UNLOCK(sch); goto donenoprobe; } if (tfo_cookie_valid) { bzero(&scs, sizeof(scs)); sc = &scs; goto skip_alloc; } sc = uma_zalloc(V_tcp_syncache.zone, M_NOWAIT | M_ZERO); if (sc == NULL) { /* * The zone allocator couldn't provide more entries. * Treat this as if the cache was full; drop the oldest * entry and insert the new one. */ TCPSTAT_INC(tcps_sc_zonefail); if ((sc = TAILQ_LAST(&sch->sch_bucket, sch_head)) != NULL) { sch->sch_last_overflow = time_uptime; syncache_drop(sc, sch); } sc = uma_zalloc(V_tcp_syncache.zone, M_NOWAIT | M_ZERO); if (sc == NULL) { if (V_tcp_syncookies) { bzero(&scs, sizeof(scs)); sc = &scs; } else { SCH_UNLOCK(sch); if (ipopts) (void) m_free(ipopts); goto done; } } } skip_alloc: if (!tfo_cookie_valid && tfo_response_cookie_valid) sc->sc_tfo_cookie = &tfo_response_cookie; /* * Fill in the syncache values. */ #ifdef MAC sc->sc_label = maclabel; #endif sc->sc_cred = cred; cred = NULL; sc->sc_ipopts = ipopts; bcopy(inc, &sc->sc_inc, sizeof(struct in_conninfo)); #ifdef INET6 if (!(inc->inc_flags & INC_ISIPV6)) #endif { sc->sc_ip_tos = ip_tos; sc->sc_ip_ttl = ip_ttl; } #ifdef TCP_OFFLOAD sc->sc_tod = tod; sc->sc_todctx = todctx; #endif sc->sc_irs = th->th_seq; sc->sc_iss = arc4random(); sc->sc_flags = 0; sc->sc_flowlabel = 0; /* * Initial receive window: clip sbspace to [0 .. TCP_MAXWIN]. * win was derived from socket earlier in the function. */ win = imax(win, 0); win = imin(win, TCP_MAXWIN); sc->sc_wnd = win; if (V_tcp_do_rfc1323) { /* * A timestamp received in a SYN makes * it ok to send timestamp requests and replies. */ if (to->to_flags & TOF_TS) { sc->sc_tsreflect = to->to_tsval; sc->sc_flags |= SCF_TIMESTAMP; sc->sc_tsoff = tcp_new_ts_offset(inc); } if (to->to_flags & TOF_SCALE) { int wscale = 0; /* * Pick the smallest possible scaling factor that * will still allow us to scale up to sb_max, aka * kern.ipc.maxsockbuf. * * We do this because there are broken firewalls that * will corrupt the window scale option, leading to * the other endpoint believing that our advertised * window is unscaled. At scale factors larger than * 5 the unscaled window will drop below 1500 bytes, * leading to serious problems when traversing these * broken firewalls. * * With the default maxsockbuf of 256K, a scale factor * of 3 will be chosen by this algorithm. Those who * choose a larger maxsockbuf should watch out * for the compatibility problems mentioned above. * * RFC1323: The Window field in a SYN (i.e., a * or ) segment itself is never scaled. */ while (wscale < TCP_MAX_WINSHIFT && (TCP_MAXWIN << wscale) < sb_max) wscale++; sc->sc_requested_r_scale = wscale; sc->sc_requested_s_scale = to->to_wscale; sc->sc_flags |= SCF_WINSCALE; } } #if defined(IPSEC_SUPPORT) || defined(TCP_SIGNATURE) /* * If listening socket requested TCP digests, flag this in the * syncache so that syncache_respond() will do the right thing * with the SYN+ACK. */ if (ltflags & TF_SIGNATURE) sc->sc_flags |= SCF_SIGNATURE; #endif /* TCP_SIGNATURE */ if (to->to_flags & TOF_SACKPERM) sc->sc_flags |= SCF_SACK; if (to->to_flags & TOF_MSS) sc->sc_peer_mss = to->to_mss; /* peer mss may be zero */ if (ltflags & TF_NOOPT) sc->sc_flags |= SCF_NOOPT; if ((th->th_flags & (TH_ECE|TH_CWR)) && V_tcp_do_ecn) sc->sc_flags |= SCF_ECN; if (V_tcp_syncookies) sc->sc_iss = syncookie_generate(sch, sc); #ifdef INET6 if (autoflowlabel) { if (V_tcp_syncookies) sc->sc_flowlabel = sc->sc_iss; else sc->sc_flowlabel = ip6_randomflowlabel(); sc->sc_flowlabel = htonl(sc->sc_flowlabel) & IPV6_FLOWLABEL_MASK; } #endif SCH_UNLOCK(sch); if (tfo_cookie_valid) { syncache_tfo_expand(sc, lsop, m, tfo_response_cookie); /* INP_WUNLOCK(inp) will be performed by the caller */ rv = 1; goto tfo_expanded; } TCP_PROBE5(receive, NULL, NULL, m, NULL, th); /* * Do a standard 3-way handshake. */ if (syncache_respond(sc, sch, m, TH_SYN|TH_ACK) == 0) { if (V_tcp_syncookies && V_tcp_syncookiesonly && sc != &scs) syncache_free(sc); else if (sc != &scs) syncache_insert(sc, sch); /* locks and unlocks sch */ TCPSTAT_INC(tcps_sndacks); TCPSTAT_INC(tcps_sndtotal); } else { if (sc != &scs) syncache_free(sc); TCPSTAT_INC(tcps_sc_dropped); } goto donenoprobe; done: TCP_PROBE5(receive, NULL, NULL, m, NULL, th); donenoprobe: if (m) { *lsop = NULL; m_freem(m); } /* * If tfo_pending is not NULL here, then a TFO SYN that did not * result in a new socket was processed and the associated pending * counter has not yet been decremented. All such TFO processing paths * transit this point. */ if (tfo_pending != NULL) tcp_fastopen_decrement_counter(tfo_pending); tfo_expanded: if (cred != NULL) crfree(cred); #ifdef MAC if (sc == &scs) mac_syncache_destroy(&maclabel); #endif return (rv); } /* * Send SYN|ACK or ACK to the peer. Either in response to a peer's segment, * i.e. m0 != NULL, or upon 3WHS ACK timeout, i.e. m0 == NULL. */ static int syncache_respond(struct syncache *sc, struct syncache_head *sch, const struct mbuf *m0, int flags) { struct ip *ip = NULL; struct mbuf *m; struct tcphdr *th = NULL; int optlen, error = 0; /* Make compiler happy */ u_int16_t hlen, tlen, mssopt; struct tcpopt to; #ifdef INET6 struct ip6_hdr *ip6 = NULL; #endif hlen = #ifdef INET6 (sc->sc_inc.inc_flags & INC_ISIPV6) ? sizeof(struct ip6_hdr) : #endif sizeof(struct ip); tlen = hlen + sizeof(struct tcphdr); /* Determine MSS we advertize to other end of connection. */ mssopt = max(tcp_mssopt(&sc->sc_inc), V_tcp_minmss); /* XXX: Assume that the entire packet will fit in a header mbuf. */ KASSERT(max_linkhdr + tlen + TCP_MAXOLEN <= MHLEN, ("syncache: mbuf too small")); /* Create the IP+TCP header from scratch. */ m = m_gethdr(M_NOWAIT, MT_DATA); if (m == NULL) return (ENOBUFS); #ifdef MAC mac_syncache_create_mbuf(sc->sc_label, m); #endif m->m_data += max_linkhdr; m->m_len = tlen; m->m_pkthdr.len = tlen; m->m_pkthdr.rcvif = NULL; #ifdef INET6 if (sc->sc_inc.inc_flags & INC_ISIPV6) { ip6 = mtod(m, struct ip6_hdr *); ip6->ip6_vfc = IPV6_VERSION; ip6->ip6_nxt = IPPROTO_TCP; ip6->ip6_src = sc->sc_inc.inc6_laddr; ip6->ip6_dst = sc->sc_inc.inc6_faddr; ip6->ip6_plen = htons(tlen - hlen); /* ip6_hlim is set after checksum */ ip6->ip6_flow &= ~IPV6_FLOWLABEL_MASK; ip6->ip6_flow |= sc->sc_flowlabel; th = (struct tcphdr *)(ip6 + 1); } #endif #if defined(INET6) && defined(INET) else #endif #ifdef INET { ip = mtod(m, struct ip *); ip->ip_v = IPVERSION; ip->ip_hl = sizeof(struct ip) >> 2; ip->ip_len = htons(tlen); ip->ip_id = 0; ip->ip_off = 0; ip->ip_sum = 0; ip->ip_p = IPPROTO_TCP; ip->ip_src = sc->sc_inc.inc_laddr; ip->ip_dst = sc->sc_inc.inc_faddr; ip->ip_ttl = sc->sc_ip_ttl; ip->ip_tos = sc->sc_ip_tos; /* * See if we should do MTU discovery. Route lookups are * expensive, so we will only unset the DF bit if: * * 1) path_mtu_discovery is disabled * 2) the SCF_UNREACH flag has been set */ if (V_path_mtu_discovery && ((sc->sc_flags & SCF_UNREACH) == 0)) ip->ip_off |= htons(IP_DF); th = (struct tcphdr *)(ip + 1); } #endif /* INET */ th->th_sport = sc->sc_inc.inc_lport; th->th_dport = sc->sc_inc.inc_fport; if (flags & TH_SYN) th->th_seq = htonl(sc->sc_iss); else th->th_seq = htonl(sc->sc_iss + 1); th->th_ack = htonl(sc->sc_irs + 1); th->th_off = sizeof(struct tcphdr) >> 2; th->th_x2 = 0; th->th_flags = flags; th->th_win = htons(sc->sc_wnd); th->th_urp = 0; if ((flags & TH_SYN) && (sc->sc_flags & SCF_ECN)) { th->th_flags |= TH_ECE; TCPSTAT_INC(tcps_ecn_shs); } /* Tack on the TCP options. */ if ((sc->sc_flags & SCF_NOOPT) == 0) { to.to_flags = 0; if (flags & TH_SYN) { to.to_mss = mssopt; to.to_flags = TOF_MSS; if (sc->sc_flags & SCF_WINSCALE) { to.to_wscale = sc->sc_requested_r_scale; to.to_flags |= TOF_SCALE; } if (sc->sc_flags & SCF_SACK) to.to_flags |= TOF_SACKPERM; #if defined(IPSEC_SUPPORT) || defined(TCP_SIGNATURE) if (sc->sc_flags & SCF_SIGNATURE) to.to_flags |= TOF_SIGNATURE; #endif if (sc->sc_tfo_cookie) { to.to_flags |= TOF_FASTOPEN; to.to_tfo_len = TCP_FASTOPEN_COOKIE_LEN; to.to_tfo_cookie = sc->sc_tfo_cookie; /* don't send cookie again when retransmitting response */ sc->sc_tfo_cookie = NULL; } } if (sc->sc_flags & SCF_TIMESTAMP) { to.to_tsval = sc->sc_tsoff + tcp_ts_getticks(); to.to_tsecr = sc->sc_tsreflect; to.to_flags |= TOF_TS; } optlen = tcp_addoptions(&to, (u_char *)(th + 1)); /* Adjust headers by option size. */ th->th_off = (sizeof(struct tcphdr) + optlen) >> 2; m->m_len += optlen; m->m_pkthdr.len += optlen; #ifdef INET6 if (sc->sc_inc.inc_flags & INC_ISIPV6) ip6->ip6_plen = htons(ntohs(ip6->ip6_plen) + optlen); else #endif ip->ip_len = htons(ntohs(ip->ip_len) + optlen); #if defined(IPSEC_SUPPORT) || defined(TCP_SIGNATURE) if (sc->sc_flags & SCF_SIGNATURE) { KASSERT(to.to_flags & TOF_SIGNATURE, ("tcp_addoptions() didn't set tcp_signature")); /* NOTE: to.to_signature is inside of mbuf */ if (!TCPMD5_ENABLED() || TCPMD5_OUTPUT(m, th, to.to_signature) != 0) { m_freem(m); return (EACCES); } } #endif } else optlen = 0; M_SETFIB(m, sc->sc_inc.inc_fibnum); m->m_pkthdr.csum_data = offsetof(struct tcphdr, th_sum); /* * If we have peer's SYN and it has a flowid, then let's assign it to * our SYN|ACK. ip6_output() and ip_output() will not assign flowid * to SYN|ACK due to lack of inp here. */ if (m0 != NULL && M_HASHTYPE_GET(m0) != M_HASHTYPE_NONE) { m->m_pkthdr.flowid = m0->m_pkthdr.flowid; M_HASHTYPE_SET(m, M_HASHTYPE_GET(m0)); } #ifdef INET6 if (sc->sc_inc.inc_flags & INC_ISIPV6) { m->m_pkthdr.csum_flags = CSUM_TCP_IPV6; th->th_sum = in6_cksum_pseudo(ip6, tlen + optlen - hlen, IPPROTO_TCP, 0); ip6->ip6_hlim = in6_selecthlim(NULL, NULL); #ifdef TCP_OFFLOAD if (ADDED_BY_TOE(sc)) { struct toedev *tod = sc->sc_tod; error = tod->tod_syncache_respond(tod, sc->sc_todctx, m); return (error); } #endif TCP_PROBE5(send, NULL, NULL, ip6, NULL, th); error = ip6_output(m, NULL, NULL, 0, NULL, NULL, NULL); } #endif #if defined(INET6) && defined(INET) else #endif #ifdef INET { m->m_pkthdr.csum_flags = CSUM_TCP; th->th_sum = in_pseudo(ip->ip_src.s_addr, ip->ip_dst.s_addr, htons(tlen + optlen - hlen + IPPROTO_TCP)); #ifdef TCP_OFFLOAD if (ADDED_BY_TOE(sc)) { struct toedev *tod = sc->sc_tod; error = tod->tod_syncache_respond(tod, sc->sc_todctx, m); return (error); } #endif TCP_PROBE5(send, NULL, NULL, ip, NULL, th); error = ip_output(m, sc->sc_ipopts, NULL, 0, NULL, NULL); } #endif return (error); } /* * The purpose of syncookies is to handle spoofed SYN flooding DoS attacks * that exceed the capacity of the syncache by avoiding the storage of any * of the SYNs we receive. Syncookies defend against blind SYN flooding * attacks where the attacker does not have access to our responses. * * Syncookies encode and include all necessary information about the * connection setup within the SYN|ACK that we send back. That way we * can avoid keeping any local state until the ACK to our SYN|ACK returns * (if ever). Normally the syncache and syncookies are running in parallel * with the latter taking over when the former is exhausted. When matching * syncache entry is found the syncookie is ignored. * * The only reliable information persisting the 3WHS is our initial sequence * number ISS of 32 bits. Syncookies embed a cryptographically sufficient * strong hash (MAC) value and a few bits of TCP SYN options in the ISS * of our SYN|ACK. The MAC can be recomputed when the ACK to our SYN|ACK * returns and signifies a legitimate connection if it matches the ACK. * * The available space of 32 bits to store the hash and to encode the SYN * option information is very tight and we should have at least 24 bits for * the MAC to keep the number of guesses by blind spoofing reasonably high. * * SYN option information we have to encode to fully restore a connection: * MSS: is imporant to chose an optimal segment size to avoid IP level * fragmentation along the path. The common MSS values can be encoded * in a 3-bit table. Uncommon values are captured by the next lower value * in the table leading to a slight increase in packetization overhead. * WSCALE: is necessary to allow large windows to be used for high delay- * bandwidth product links. Not scaling the window when it was initially * negotiated is bad for performance as lack of scaling further decreases * the apparent available send window. We only need to encode the WSCALE * we received from the remote end. Our end can be recalculated at any * time. The common WSCALE values can be encoded in a 3-bit table. * Uncommon values are captured by the next lower value in the table * making us under-estimate the available window size halving our * theoretically possible maximum throughput for that connection. * SACK: Greatly assists in packet loss recovery and requires 1 bit. * TIMESTAMP and SIGNATURE is not encoded because they are permanent options * that are included in all segments on a connection. We enable them when * the ACK has them. * * Security of syncookies and attack vectors: * * The MAC is computed over (faddr||laddr||fport||lport||irs||flags||secmod) * together with the gloabl secret to make it unique per connection attempt. * Thus any change of any of those parameters results in a different MAC output * in an unpredictable way unless a collision is encountered. 24 bits of the * MAC are embedded into the ISS. * * To prevent replay attacks two rotating global secrets are updated with a * new random value every 15 seconds. The life-time of a syncookie is thus * 15-30 seconds. * * Vector 1: Attacking the secret. This requires finding a weakness in the * MAC itself or the way it is used here. The attacker can do a chosen plain * text attack by varying and testing the all parameters under his control. * The strength depends on the size and randomness of the secret, and the * cryptographic security of the MAC function. Due to the constant updating * of the secret the attacker has at most 29.999 seconds to find the secret * and launch spoofed connections. After that he has to start all over again. * * Vector 2: Collision attack on the MAC of a single ACK. With a 24 bit MAC * size an average of 4,823 attempts are required for a 50% chance of success * to spoof a single syncookie (birthday collision paradox). However the * attacker is blind and doesn't know if one of his attempts succeeded unless * he has a side channel to interfere success from. A single connection setup * success average of 90% requires 8,790 packets, 99.99% requires 17,578 packets. * This many attempts are required for each one blind spoofed connection. For * every additional spoofed connection he has to launch another N attempts. * Thus for a sustained rate 100 spoofed connections per second approximately * 1,800,000 packets per second would have to be sent. * * NB: The MAC function should be fast so that it doesn't become a CPU * exhaustion attack vector itself. * * References: * RFC4987 TCP SYN Flooding Attacks and Common Mitigations * SYN cookies were first proposed by cryptographer Dan J. Bernstein in 1996 * http://cr.yp.to/syncookies.html (overview) * http://cr.yp.to/syncookies/archive (details) * * * Schematic construction of a syncookie enabled Initial Sequence Number: * 0 1 2 3 * 12345678901234567890123456789012 * |xxxxxxxxxxxxxxxxxxxxxxxxWWWMMMSP| * * x 24 MAC (truncated) * W 3 Send Window Scale index * M 3 MSS index * S 1 SACK permitted * P 1 Odd/even secret */ /* * Distribution and probability of certain MSS values. Those in between are * rounded down to the next lower one. * [An Analysis of TCP Maximum Segment Sizes, S. Alcock and R. Nelson, 2011] * .2% .3% 5% 7% 7% 20% 15% 45% */ static int tcp_sc_msstab[] = { 216, 536, 1200, 1360, 1400, 1440, 1452, 1460 }; /* * Distribution and probability of certain WSCALE values. We have to map the * (send) window scale (shift) option with a range of 0-14 from 4 bits into 3 * bits based on prevalence of certain values. Where we don't have an exact * match for are rounded down to the next lower one letting us under-estimate * the true available window. At the moment this would happen only for the * very uncommon values 3, 5 and those above 8 (more than 16MB socket buffer * and window size). The absence of the WSCALE option (no scaling in either * direction) is encoded with index zero. * [WSCALE values histograms, Allman, 2012] * X 10 10 35 5 6 14 10% by host * X 11 4 5 5 18 49 3% by connections */ static int tcp_sc_wstab[] = { 0, 0, 1, 2, 4, 6, 7, 8 }; /* * Compute the MAC for the SYN cookie. SIPHASH-2-4 is chosen for its speed * and good cryptographic properties. */ static uint32_t syncookie_mac(struct in_conninfo *inc, tcp_seq irs, uint8_t flags, uint8_t *secbits, uintptr_t secmod) { SIPHASH_CTX ctx; uint32_t siphash[2]; SipHash24_Init(&ctx); SipHash_SetKey(&ctx, secbits); switch (inc->inc_flags & INC_ISIPV6) { #ifdef INET case 0: SipHash_Update(&ctx, &inc->inc_faddr, sizeof(inc->inc_faddr)); SipHash_Update(&ctx, &inc->inc_laddr, sizeof(inc->inc_laddr)); break; #endif #ifdef INET6 case INC_ISIPV6: SipHash_Update(&ctx, &inc->inc6_faddr, sizeof(inc->inc6_faddr)); SipHash_Update(&ctx, &inc->inc6_laddr, sizeof(inc->inc6_laddr)); break; #endif } SipHash_Update(&ctx, &inc->inc_fport, sizeof(inc->inc_fport)); SipHash_Update(&ctx, &inc->inc_lport, sizeof(inc->inc_lport)); SipHash_Update(&ctx, &irs, sizeof(irs)); SipHash_Update(&ctx, &flags, sizeof(flags)); SipHash_Update(&ctx, &secmod, sizeof(secmod)); SipHash_Final((u_int8_t *)&siphash, &ctx); return (siphash[0] ^ siphash[1]); } static tcp_seq syncookie_generate(struct syncache_head *sch, struct syncache *sc) { u_int i, secbit, wscale; uint32_t iss, hash; uint8_t *secbits; union syncookie cookie; SCH_LOCK_ASSERT(sch); cookie.cookie = 0; /* Map our computed MSS into the 3-bit index. */ for (i = nitems(tcp_sc_msstab) - 1; tcp_sc_msstab[i] > sc->sc_peer_mss && i > 0; i--) ; cookie.flags.mss_idx = i; /* * Map the send window scale into the 3-bit index but only if * the wscale option was received. */ if (sc->sc_flags & SCF_WINSCALE) { wscale = sc->sc_requested_s_scale; for (i = nitems(tcp_sc_wstab) - 1; tcp_sc_wstab[i] > wscale && i > 0; i--) ; cookie.flags.wscale_idx = i; } /* Can we do SACK? */ if (sc->sc_flags & SCF_SACK) cookie.flags.sack_ok = 1; /* Which of the two secrets to use. */ secbit = sch->sch_sc->secret.oddeven & 0x1; cookie.flags.odd_even = secbit; secbits = sch->sch_sc->secret.key[secbit]; hash = syncookie_mac(&sc->sc_inc, sc->sc_irs, cookie.cookie, secbits, (uintptr_t)sch); /* * Put the flags into the hash and XOR them to get better ISS number * variance. This doesn't enhance the cryptographic strength and is * done to prevent the 8 cookie bits from showing up directly on the * wire. */ iss = hash & ~0xff; iss |= cookie.cookie ^ (hash >> 24); TCPSTAT_INC(tcps_sc_sendcookie); return (iss); } static struct syncache * syncookie_lookup(struct in_conninfo *inc, struct syncache_head *sch, struct syncache *sc, struct tcphdr *th, struct tcpopt *to, struct socket *lso) { uint32_t hash; uint8_t *secbits; tcp_seq ack, seq; int wnd, wscale = 0; union syncookie cookie; SCH_LOCK_ASSERT(sch); /* * Pull information out of SYN-ACK/ACK and revert sequence number * advances. */ ack = th->th_ack - 1; seq = th->th_seq - 1; /* * Unpack the flags containing enough information to restore the * connection. */ cookie.cookie = (ack & 0xff) ^ (ack >> 24); /* Which of the two secrets to use. */ secbits = sch->sch_sc->secret.key[cookie.flags.odd_even]; hash = syncookie_mac(inc, seq, cookie.cookie, secbits, (uintptr_t)sch); /* The recomputed hash matches the ACK if this was a genuine cookie. */ if ((ack & ~0xff) != (hash & ~0xff)) return (NULL); /* Fill in the syncache values. */ sc->sc_flags = 0; bcopy(inc, &sc->sc_inc, sizeof(struct in_conninfo)); sc->sc_ipopts = NULL; sc->sc_irs = seq; sc->sc_iss = ack; switch (inc->inc_flags & INC_ISIPV6) { #ifdef INET case 0: sc->sc_ip_ttl = sotoinpcb(lso)->inp_ip_ttl; sc->sc_ip_tos = sotoinpcb(lso)->inp_ip_tos; break; #endif #ifdef INET6 case INC_ISIPV6: if (sotoinpcb(lso)->inp_flags & IN6P_AUTOFLOWLABEL) sc->sc_flowlabel = sc->sc_iss & IPV6_FLOWLABEL_MASK; break; #endif } sc->sc_peer_mss = tcp_sc_msstab[cookie.flags.mss_idx]; /* We can simply recompute receive window scale we sent earlier. */ while (wscale < TCP_MAX_WINSHIFT && (TCP_MAXWIN << wscale) < sb_max) wscale++; /* Only use wscale if it was enabled in the orignal SYN. */ if (cookie.flags.wscale_idx > 0) { sc->sc_requested_r_scale = wscale; sc->sc_requested_s_scale = tcp_sc_wstab[cookie.flags.wscale_idx]; sc->sc_flags |= SCF_WINSCALE; } wnd = lso->sol_sbrcv_hiwat; wnd = imax(wnd, 0); wnd = imin(wnd, TCP_MAXWIN); sc->sc_wnd = wnd; if (cookie.flags.sack_ok) sc->sc_flags |= SCF_SACK; if (to->to_flags & TOF_TS) { sc->sc_flags |= SCF_TIMESTAMP; sc->sc_tsreflect = to->to_tsval; sc->sc_tsoff = tcp_new_ts_offset(inc); } if (to->to_flags & TOF_SIGNATURE) sc->sc_flags |= SCF_SIGNATURE; sc->sc_rxmits = 0; TCPSTAT_INC(tcps_sc_recvcookie); return (sc); } #ifdef INVARIANTS static int syncookie_cmp(struct in_conninfo *inc, struct syncache_head *sch, struct syncache *sc, struct tcphdr *th, struct tcpopt *to, struct socket *lso) { struct syncache scs, *scx; char *s; bzero(&scs, sizeof(scs)); scx = syncookie_lookup(inc, sch, &scs, th, to, lso); if ((s = tcp_log_addrs(inc, th, NULL, NULL)) == NULL) return (0); if (scx != NULL) { if (sc->sc_peer_mss != scx->sc_peer_mss) log(LOG_DEBUG, "%s; %s: mss different %i vs %i\n", s, __func__, sc->sc_peer_mss, scx->sc_peer_mss); if (sc->sc_requested_r_scale != scx->sc_requested_r_scale) log(LOG_DEBUG, "%s; %s: rwscale different %i vs %i\n", s, __func__, sc->sc_requested_r_scale, scx->sc_requested_r_scale); if (sc->sc_requested_s_scale != scx->sc_requested_s_scale) log(LOG_DEBUG, "%s; %s: swscale different %i vs %i\n", s, __func__, sc->sc_requested_s_scale, scx->sc_requested_s_scale); if ((sc->sc_flags & SCF_SACK) != (scx->sc_flags & SCF_SACK)) log(LOG_DEBUG, "%s; %s: SACK different\n", s, __func__); } if (s != NULL) free(s, M_TCPLOG); return (0); } #endif /* INVARIANTS */ static void syncookie_reseed(void *arg) { struct tcp_syncache *sc = arg; uint8_t *secbits; int secbit; /* * Reseeding the secret doesn't have to be protected by a lock. * It only must be ensured that the new random values are visible * to all CPUs in a SMP environment. The atomic with release * semantics ensures that. */ secbit = (sc->secret.oddeven & 0x1) ? 0 : 1; secbits = sc->secret.key[secbit]; arc4rand(secbits, SYNCOOKIE_SECRET_SIZE, 0); atomic_add_rel_int(&sc->secret.oddeven, 1); /* Reschedule ourself. */ callout_schedule(&sc->secret.reseed, SYNCOOKIE_LIFETIME * hz); } /* * Exports the syncache entries to userland so that netstat can display * them alongside the other sockets. This function is intended to be * called only from tcp_pcblist. * * Due to concurrency on an active system, the number of pcbs exported * may have no relation to max_pcbs. max_pcbs merely indicates the * amount of space the caller allocated for this function to use. */ int syncache_pcblist(struct sysctl_req *req, int max_pcbs, int *pcbs_exported) { struct xtcpcb xt; struct syncache *sc; struct syncache_head *sch; int count, error, i; for (count = 0, error = 0, i = 0; i < V_tcp_syncache.hashsize; i++) { sch = &V_tcp_syncache.hashbase[i]; SCH_LOCK(sch); TAILQ_FOREACH(sc, &sch->sch_bucket, sc_hash) { if (count >= max_pcbs) { SCH_UNLOCK(sch); goto exit; } if (cr_cansee(req->td->td_ucred, sc->sc_cred) != 0) continue; bzero(&xt, sizeof(xt)); xt.xt_len = sizeof(xt); if (sc->sc_inc.inc_flags & INC_ISIPV6) xt.xt_inp.inp_vflag = INP_IPV6; else xt.xt_inp.inp_vflag = INP_IPV4; bcopy(&sc->sc_inc, &xt.xt_inp.inp_inc, sizeof (struct in_conninfo)); xt.t_state = TCPS_SYN_RECEIVED; xt.xt_inp.xi_socket.xso_protocol = IPPROTO_TCP; xt.xt_inp.xi_socket.xso_len = sizeof (struct xsocket); xt.xt_inp.xi_socket.so_type = SOCK_STREAM; xt.xt_inp.xi_socket.so_state = SS_ISCONNECTING; error = SYSCTL_OUT(req, &xt, sizeof xt); if (error) { SCH_UNLOCK(sch); goto exit; } count++; } SCH_UNLOCK(sch); } exit: *pcbs_exported = count; return error; } Index: user/ngie/bug-237403/sys/netinet6/ip6_gre.c =================================================================== --- user/ngie/bug-237403/sys/netinet6/ip6_gre.c (revision 346925) +++ user/ngie/bug-237403/sys/netinet6/ip6_gre.c (revision 346926) @@ -1,346 +1,595 @@ /*- * Copyright (c) 2014, 2018 Andrey V. Elsukov * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ #include __FBSDID("$FreeBSD$"); #include "opt_inet.h" #include "opt_inet6.h" #include #include #include #include +#include #include #include #include #include #include #include #include #include #include #include #include #ifdef INET #include #include #endif +#include #include +#include #include +#include +#include #include #include #include #include VNET_DEFINE(int, ip6_gre_hlim) = IPV6_DEFHLIM; #define V_ip6_gre_hlim VNET(ip6_gre_hlim) SYSCTL_DECL(_net_inet6_ip6); SYSCTL_INT(_net_inet6_ip6, OID_AUTO, grehlim, CTLFLAG_VNET | CTLFLAG_RW, &VNET_NAME(ip6_gre_hlim), 0, "Default hop limit for encapsulated packets"); +struct in6_gre_socket { + struct gre_socket base; + struct in6_addr addr; /* scope zone id is embedded */ +}; +VNET_DEFINE_STATIC(struct gre_sockets *, ipv6_sockets) = NULL; VNET_DEFINE_STATIC(struct gre_list *, ipv6_hashtbl) = NULL; VNET_DEFINE_STATIC(struct gre_list *, ipv6_srchashtbl) = NULL; +#define V_ipv6_sockets VNET(ipv6_sockets) #define V_ipv6_hashtbl VNET(ipv6_hashtbl) #define V_ipv6_srchashtbl VNET(ipv6_srchashtbl) #define GRE_HASH(src, dst) (V_ipv6_hashtbl[\ in6_gre_hashval((src), (dst)) & (GRE_HASH_SIZE - 1)]) #define GRE_SRCHASH(src) (V_ipv6_srchashtbl[\ fnv_32_buf((src), sizeof(*src), FNV1_32_INIT) & (GRE_HASH_SIZE - 1)]) +#define GRE_SOCKHASH(src) (V_ipv6_sockets[\ + fnv_32_buf((src), sizeof(*src), FNV1_32_INIT) & (GRE_HASH_SIZE - 1)]) #define GRE_HASH_SC(sc) GRE_HASH(&(sc)->gre_oip6.ip6_src,\ &(sc)->gre_oip6.ip6_dst) static uint32_t in6_gre_hashval(const struct in6_addr *src, const struct in6_addr *dst) { uint32_t ret; ret = fnv_32_buf(src, sizeof(*src), FNV1_32_INIT); return (fnv_32_buf(dst, sizeof(*dst), ret)); } +static struct gre_socket* +in6_gre_lookup_socket(const struct in6_addr *addr) +{ + struct gre_socket *gs; + struct in6_gre_socket *s; + + CK_LIST_FOREACH(gs, &GRE_SOCKHASH(addr), chain) { + s = __containerof(gs, struct in6_gre_socket, base); + if (IN6_ARE_ADDR_EQUAL(&s->addr, addr)) + break; + } + return (gs); +} + static int in6_gre_checkdup(const struct gre_softc *sc, const struct in6_addr *src, - const struct in6_addr *dst) + const struct in6_addr *dst, uint32_t opts) { + struct gre_list *head; struct gre_softc *tmp; + struct gre_socket *gs; if (sc->gre_family == AF_INET6 && IN6_ARE_ADDR_EQUAL(&sc->gre_oip6.ip6_src, src) && - IN6_ARE_ADDR_EQUAL(&sc->gre_oip6.ip6_dst, dst)) + IN6_ARE_ADDR_EQUAL(&sc->gre_oip6.ip6_dst, dst) && + (sc->gre_options & GRE_UDPENCAP) == (opts & GRE_UDPENCAP)) return (EEXIST); - CK_LIST_FOREACH(tmp, &GRE_HASH(src, dst), chain) { + if (opts & GRE_UDPENCAP) { + gs = in6_gre_lookup_socket(src); + if (gs == NULL) + return (0); + head = &gs->list; + } else + head = &GRE_HASH(src, dst); + + CK_LIST_FOREACH(tmp, head, chain) { if (tmp == sc) continue; if (IN6_ARE_ADDR_EQUAL(&tmp->gre_oip6.ip6_src, src) && IN6_ARE_ADDR_EQUAL(&tmp->gre_oip6.ip6_dst, dst)) return (EADDRNOTAVAIL); } return (0); } static int in6_gre_lookup(const struct mbuf *m, int off, int proto, void **arg) { const struct ip6_hdr *ip6; struct gre_softc *sc; if (V_ipv6_hashtbl == NULL) return (0); MPASS(in_epoch(net_epoch_preempt)); ip6 = mtod(m, const struct ip6_hdr *); CK_LIST_FOREACH(sc, &GRE_HASH(&ip6->ip6_dst, &ip6->ip6_src), chain) { /* * This is an inbound packet, its ip6_dst is source address * in softc. */ if (IN6_ARE_ADDR_EQUAL(&sc->gre_oip6.ip6_src, &ip6->ip6_dst) && IN6_ARE_ADDR_EQUAL(&sc->gre_oip6.ip6_dst, &ip6->ip6_src)) { if ((GRE2IFP(sc)->if_flags & IFF_UP) == 0) return (0); *arg = sc; return (ENCAP_DRV_LOOKUP); } } return (0); } /* * Check that ingress address belongs to local host. */ static void in6_gre_set_running(struct gre_softc *sc) { if (in6_localip(&sc->gre_oip6.ip6_src)) GRE2IFP(sc)->if_drv_flags |= IFF_DRV_RUNNING; else GRE2IFP(sc)->if_drv_flags &= ~IFF_DRV_RUNNING; } /* * ifaddr_event handler. * Clear IFF_DRV_RUNNING flag when ingress address disappears to prevent * source address spoofing. */ static void in6_gre_srcaddr(void *arg __unused, const struct sockaddr *sa, int event __unused) { const struct sockaddr_in6 *sin; struct gre_softc *sc; /* Check that VNET is ready */ if (V_ipv6_hashtbl == NULL) return; MPASS(in_epoch(net_epoch_preempt)); sin = (const struct sockaddr_in6 *)sa; CK_LIST_FOREACH(sc, &GRE_SRCHASH(&sin->sin6_addr), srchash) { if (IN6_ARE_ADDR_EQUAL(&sc->gre_oip6.ip6_src, &sin->sin6_addr) == 0) continue; in6_gre_set_running(sc); } } static void +in6_gre_udp_input(struct mbuf *m, int off, struct inpcb *inp, + const struct sockaddr *sa, void *ctx) +{ + struct epoch_tracker et; + struct gre_socket *gs; + struct gre_softc *sc; + struct sockaddr_in6 dst; + + NET_EPOCH_ENTER(et); + /* + * udp_append() holds reference to inp, it is safe to check + * inp_flags2 without INP_RLOCK(). + * If socket was closed before we have entered NET_EPOCH section, + * INP_FREED flag should be set. Otherwise it should be safe to + * make access to ctx data, because gre_so will be freed by + * gre_sofree() via epoch_call(). + */ + if (__predict_false(inp->inp_flags2 & INP_FREED)) { + NET_EPOCH_EXIT(et); + m_freem(m); + return; + } + + gs = (struct gre_socket *)ctx; + dst = *(const struct sockaddr_in6 *)sa; + if (sa6_embedscope(&dst, 0)) { + NET_EPOCH_EXIT(et); + m_freem(m); + return; + } + CK_LIST_FOREACH(sc, &gs->list, chain) { + if (IN6_ARE_ADDR_EQUAL(&sc->gre_oip6.ip6_dst, &dst.sin6_addr)) + break; + } + if (sc != NULL && (GRE2IFP(sc)->if_flags & IFF_UP) != 0){ + gre_input(m, off + sizeof(struct udphdr), IPPROTO_UDP, sc); + NET_EPOCH_EXIT(et); + return; + } + m_freem(m); + NET_EPOCH_EXIT(et); +} + +static int +in6_gre_setup_socket(struct gre_softc *sc) +{ + struct sockopt sopt; + struct sockaddr_in6 sin6; + struct in6_gre_socket *s; + struct gre_socket *gs; + int error, value; + + /* + * NOTE: we are protected with gre_ioctl_sx lock. + * + * First check that socket is already configured. + * If so, check that source addres was not changed. + * If address is different, check that there are no other tunnels + * and close socket. + */ + gs = sc->gre_so; + if (gs != NULL) { + s = __containerof(gs, struct in6_gre_socket, base); + if (!IN6_ARE_ADDR_EQUAL(&s->addr, &sc->gre_oip6.ip6_src)) { + if (CK_LIST_EMPTY(&gs->list)) { + CK_LIST_REMOVE(gs, chain); + soclose(gs->so); + epoch_call(net_epoch_preempt, &gs->epoch_ctx, + gre_sofree); + } + gs = sc->gre_so = NULL; + } + } + + if (gs == NULL) { + /* + * Check that socket for given address is already + * configured. + */ + gs = in6_gre_lookup_socket(&sc->gre_oip6.ip6_src); + if (gs == NULL) { + s = malloc(sizeof(*s), M_GRE, M_WAITOK | M_ZERO); + s->addr = sc->gre_oip6.ip6_src; + gs = &s->base; + + error = socreate(sc->gre_family, &gs->so, + SOCK_DGRAM, IPPROTO_UDP, curthread->td_ucred, + curthread); + if (error != 0) { + if_printf(GRE2IFP(sc), + "cannot create socket: %d\n", error); + free(s, M_GRE); + return (error); + } + + error = udp_set_kernel_tunneling(gs->so, + in6_gre_udp_input, NULL, gs); + if (error != 0) { + if_printf(GRE2IFP(sc), + "cannot set UDP tunneling: %d\n", error); + goto fail; + } + + memset(&sopt, 0, sizeof(sopt)); + sopt.sopt_dir = SOPT_SET; + sopt.sopt_level = IPPROTO_IPV6; + sopt.sopt_name = IPV6_BINDANY; + sopt.sopt_val = &value; + sopt.sopt_valsize = sizeof(value); + value = 1; + error = sosetopt(gs->so, &sopt); + if (error != 0) { + if_printf(GRE2IFP(sc), + "cannot set IPV6_BINDANY opt: %d\n", + error); + goto fail; + } + + memset(&sin6, 0, sizeof(sin6)); + sin6.sin6_family = AF_INET6; + sin6.sin6_len = sizeof(sin6); + sin6.sin6_addr = sc->gre_oip6.ip6_src; + sin6.sin6_port = htons(GRE_UDPPORT); + error = sa6_recoverscope(&sin6); + if (error != 0) { + if_printf(GRE2IFP(sc), + "cannot determine scope zone id: %d\n", + error); + goto fail; + } + error = sobind(gs->so, (struct sockaddr *)&sin6, + curthread); + if (error != 0) { + if_printf(GRE2IFP(sc), + "cannot bind socket: %d\n", error); + goto fail; + } + /* Add socket to the chain */ + CK_LIST_INSERT_HEAD( + &GRE_SOCKHASH(&sc->gre_oip6.ip6_src), gs, chain); + } + } + + /* Add softc to the socket's list */ + CK_LIST_INSERT_HEAD(&gs->list, sc, chain); + sc->gre_so = gs; + return (0); +fail: + soclose(gs->so); + free(s, M_GRE); + return (error); +} + +static int in6_gre_attach(struct gre_softc *sc) { + struct grehdr *gh; + int error; - sc->gre_hlen = sizeof(struct greip6); + if (sc->gre_options & GRE_UDPENCAP) { + sc->gre_csumflags = CSUM_UDP_IPV6; + sc->gre_hlen = sizeof(struct greudp6); + sc->gre_oip6.ip6_nxt = IPPROTO_UDP; + gh = &sc->gre_udp6hdr->gi6_gre; + gre_update_udphdr(sc, &sc->gre_udp6, + in6_cksum_pseudo(&sc->gre_oip6, 0, 0, 0)); + } else { + sc->gre_hlen = sizeof(struct greip6); + sc->gre_oip6.ip6_nxt = IPPROTO_GRE; + gh = &sc->gre_ip6hdr->gi6_gre; + } sc->gre_oip6.ip6_vfc = IPV6_VERSION; - sc->gre_oip6.ip6_nxt = IPPROTO_GRE; - gre_updatehdr(sc, &sc->gre_gi6hdr->gi6_gre); - CK_LIST_INSERT_HEAD(&GRE_HASH_SC(sc), sc, chain); + gre_update_hdr(sc, gh); + + /* + * If we return error, this means that sc is not linked, + * and caller should reset gre_family and free(sc->gre_hdr). + */ + if (sc->gre_options & GRE_UDPENCAP) { + error = in6_gre_setup_socket(sc); + if (error != 0) + return (error); + } else + CK_LIST_INSERT_HEAD(&GRE_HASH_SC(sc), sc, chain); CK_LIST_INSERT_HEAD(&GRE_SRCHASH(&sc->gre_oip6.ip6_src), sc, srchash); + + /* Set IFF_DRV_RUNNING if interface is ready */ + in6_gre_set_running(sc); + return (0); } -void +int in6_gre_setopts(struct gre_softc *sc, u_long cmd, uint32_t value) { + int error; - MPASS(cmd == GRESKEY || cmd == GRESOPTS); - /* NOTE: we are protected with gre_ioctl_sx lock */ + MPASS(cmd == GRESKEY || cmd == GRESOPTS || cmd == GRESPORT); MPASS(sc->gre_family == AF_INET6); + + /* + * If we are going to change encapsulation protocol, do check + * for duplicate tunnels. Return EEXIST here to do not confuse + * user. + */ + if (cmd == GRESOPTS && + (sc->gre_options & GRE_UDPENCAP) != (value & GRE_UDPENCAP) && + in6_gre_checkdup(sc, &sc->gre_oip6.ip6_src, + &sc->gre_oip6.ip6_dst, value) == EADDRNOTAVAIL) + return (EEXIST); + CK_LIST_REMOVE(sc, chain); CK_LIST_REMOVE(sc, srchash); GRE_WAIT(); - if (cmd == GRESKEY) + switch (cmd) { + case GRESKEY: sc->gre_key = value; - else + break; + case GRESOPTS: sc->gre_options = value; - in6_gre_attach(sc); + break; + case GRESPORT: + sc->gre_port = value; + break; + } + error = in6_gre_attach(sc); + if (error != 0) { + sc->gre_family = 0; + free(sc->gre_hdr, M_GRE); + } + return (error); } int in6_gre_ioctl(struct gre_softc *sc, u_long cmd, caddr_t data) { struct in6_ifreq *ifr = (struct in6_ifreq *)data; struct sockaddr_in6 *dst, *src; struct ip6_hdr *ip6; int error; /* NOTE: we are protected with gre_ioctl_sx lock */ error = EINVAL; switch (cmd) { case SIOCSIFPHYADDR_IN6: src = &((struct in6_aliasreq *)data)->ifra_addr; dst = &((struct in6_aliasreq *)data)->ifra_dstaddr; /* sanity checks */ if (src->sin6_family != dst->sin6_family || src->sin6_family != AF_INET6 || src->sin6_len != dst->sin6_len || src->sin6_len != sizeof(*src)) break; if (IN6_IS_ADDR_UNSPECIFIED(&src->sin6_addr) || IN6_IS_ADDR_UNSPECIFIED(&dst->sin6_addr)) { error = EADDRNOTAVAIL; break; } /* * Check validity of the scope zone ID of the * addresses, and convert it into the kernel * internal form if necessary. */ if ((error = sa6_embedscope(src, 0)) != 0 || (error = sa6_embedscope(dst, 0)) != 0) break; if (V_ipv6_hashtbl == NULL) { V_ipv6_hashtbl = gre_hashinit(); V_ipv6_srchashtbl = gre_hashinit(); + V_ipv6_sockets = (struct gre_sockets *)gre_hashinit(); } error = in6_gre_checkdup(sc, &src->sin6_addr, - &dst->sin6_addr); + &dst->sin6_addr, sc->gre_options); if (error == EADDRNOTAVAIL) break; if (error == EEXIST) { /* Addresses are the same. Just return. */ error = 0; break; } - ip6 = malloc(sizeof(struct greip6) + 3 * sizeof(uint32_t), + ip6 = malloc(sizeof(struct greudp6) + 3 * sizeof(uint32_t), M_GRE, M_WAITOK | M_ZERO); ip6->ip6_src = src->sin6_addr; ip6->ip6_dst = dst->sin6_addr; if (sc->gre_family != 0) { /* Detach existing tunnel first */ CK_LIST_REMOVE(sc, chain); CK_LIST_REMOVE(sc, srchash); GRE_WAIT(); free(sc->gre_hdr, M_GRE); /* XXX: should we notify about link state change? */ } sc->gre_family = AF_INET6; sc->gre_hdr = ip6; sc->gre_oseq = 0; sc->gre_iseq = UINT32_MAX; - in6_gre_attach(sc); - in6_gre_set_running(sc); + error = in6_gre_attach(sc); + if (error != 0) { + sc->gre_family = 0; + free(sc->gre_hdr, M_GRE); + } break; case SIOCGIFPSRCADDR_IN6: case SIOCGIFPDSTADDR_IN6: if (sc->gre_family != AF_INET6) { error = EADDRNOTAVAIL; break; } src = (struct sockaddr_in6 *)&ifr->ifr_addr; memset(src, 0, sizeof(*src)); src->sin6_family = AF_INET6; src->sin6_len = sizeof(*src); src->sin6_addr = (cmd == SIOCGIFPSRCADDR_IN6) ? sc->gre_oip6.ip6_src: sc->gre_oip6.ip6_dst; error = prison_if(curthread->td_ucred, (struct sockaddr *)src); if (error == 0) error = sa6_recoverscope(src); if (error != 0) memset(src, 0, sizeof(*src)); break; } return (error); } int -in6_gre_output(struct mbuf *m, int af __unused, int hlen __unused) +in6_gre_output(struct mbuf *m, int af __unused, int hlen __unused, + uint32_t flowid) { struct greip6 *gi6; gi6 = mtod(m, struct greip6 *); gi6->gi6_ip6.ip6_hlim = V_ip6_gre_hlim; + gi6->gi6_ip6.ip6_flow |= flowid & IPV6_FLOWLABEL_MASK; return (ip6_output(m, NULL, NULL, IPV6_MINMTU, NULL, NULL, NULL)); } static const struct srcaddrtab *ipv6_srcaddrtab = NULL; static const struct encaptab *ecookie = NULL; static const struct encap_config ipv6_encap_cfg = { .proto = IPPROTO_GRE, .min_length = sizeof(struct greip6) + #ifdef INET sizeof(struct ip), #else sizeof(struct ip6_hdr), #endif .exact_match = ENCAP_DRV_LOOKUP, .lookup = in6_gre_lookup, .input = gre_input }; void in6_gre_init(void) { if (!IS_DEFAULT_VNET(curvnet)) return; ipv6_srcaddrtab = ip6_encap_register_srcaddr(in6_gre_srcaddr, NULL, M_WAITOK); ecookie = ip6_encap_attach(&ipv6_encap_cfg, NULL, M_WAITOK); } void in6_gre_uninit(void) { if (IS_DEFAULT_VNET(curvnet)) { ip6_encap_detach(ecookie); ip6_encap_unregister_srcaddr(ipv6_srcaddrtab); } if (V_ipv6_hashtbl != NULL) { gre_hashdestroy(V_ipv6_hashtbl); V_ipv6_hashtbl = NULL; GRE_WAIT(); gre_hashdestroy(V_ipv6_srchashtbl); + gre_hashdestroy((struct gre_list *)V_ipv6_sockets); } } Index: user/ngie/bug-237403/sys/netinet6/ip6_output.c =================================================================== --- user/ngie/bug-237403/sys/netinet6/ip6_output.c (revision 346925) +++ user/ngie/bug-237403/sys/netinet6/ip6_output.c (revision 346926) @@ -1,3146 +1,3149 @@ /*- * SPDX-License-Identifier: BSD-3-Clause * * Copyright (C) 1995, 1996, 1997, and 1998 WIDE Project. * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * 3. Neither the name of the project nor the names of its contributors * may be used to endorse or promote products derived from this software * without specific prior written permission. * * THIS SOFTWARE IS PROVIDED BY THE PROJECT AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE PROJECT OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * $KAME: ip6_output.c,v 1.279 2002/01/26 06:12:30 jinmei Exp $ */ /*- * Copyright (c) 1982, 1986, 1988, 1990, 1993 * The Regents of the University of California. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * 3. Neither the name of the University nor the names of its contributors * may be used to endorse or promote products derived from this software * without specific prior written permission. * * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * @(#)ip_output.c 8.3 (Berkeley) 1/21/94 */ #include __FBSDID("$FreeBSD$"); #include "opt_inet.h" #include "opt_inet6.h" #include "opt_ratelimit.h" #include "opt_ipsec.h" #include "opt_sctp.h" #include "opt_route.h" #include "opt_rss.h" #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #ifdef SCTP #include #include #endif #include #include extern int in6_mcast_loop; struct ip6_exthdrs { struct mbuf *ip6e_ip6; struct mbuf *ip6e_hbh; struct mbuf *ip6e_dest1; struct mbuf *ip6e_rthdr; struct mbuf *ip6e_dest2; }; static MALLOC_DEFINE(M_IP6OPT, "ip6opt", "IPv6 options"); static int ip6_pcbopt(int, u_char *, int, struct ip6_pktopts **, struct ucred *, int); static int ip6_pcbopts(struct ip6_pktopts **, struct mbuf *, struct socket *, struct sockopt *); static int ip6_getpcbopt(struct inpcb *, int, struct sockopt *); static int ip6_setpktopt(int, u_char *, int, struct ip6_pktopts *, struct ucred *, int, int, int); static int ip6_copyexthdr(struct mbuf **, caddr_t, int); static int ip6_insertfraghdr(struct mbuf *, struct mbuf *, int, struct ip6_frag **); static int ip6_insert_jumboopt(struct ip6_exthdrs *, u_int32_t); static int ip6_splithdr(struct mbuf *, struct ip6_exthdrs *); static int ip6_getpmtu(struct route_in6 *, int, struct ifnet *, const struct in6_addr *, u_long *, int *, u_int, u_int); static int ip6_calcmtu(struct ifnet *, const struct in6_addr *, u_long, u_long *, int *, u_int); static int ip6_getpmtu_ctl(u_int, const struct in6_addr *, u_long *); static int copypktopts(struct ip6_pktopts *, struct ip6_pktopts *, int); /* * Make an extension header from option data. hp is the source, and * mp is the destination. */ #define MAKE_EXTHDR(hp, mp) \ do { \ if (hp) { \ struct ip6_ext *eh = (struct ip6_ext *)(hp); \ error = ip6_copyexthdr((mp), (caddr_t)(hp), \ ((eh)->ip6e_len + 1) << 3); \ if (error) \ goto freehdrs; \ } \ } while (/*CONSTCOND*/ 0) /* * Form a chain of extension headers. * m is the extension header mbuf * mp is the previous mbuf in the chain * p is the next header * i is the type of option. */ #define MAKE_CHAIN(m, mp, p, i)\ do {\ if (m) {\ if (!hdrsplit) \ panic("assumption failed: hdr not split"); \ *mtod((m), u_char *) = *(p);\ *(p) = (i);\ p = mtod((m), u_char *);\ (m)->m_next = (mp)->m_next;\ (mp)->m_next = (m);\ (mp) = (m);\ }\ } while (/*CONSTCOND*/ 0) void in6_delayed_cksum(struct mbuf *m, uint32_t plen, u_short offset) { u_short csum; csum = in_cksum_skip(m, offset + plen, offset); if (m->m_pkthdr.csum_flags & CSUM_UDP_IPV6 && csum == 0) csum = 0xffff; offset += m->m_pkthdr.csum_data; /* checksum offset */ if (offset + sizeof(csum) > m->m_len) m_copyback(m, offset, sizeof(csum), (caddr_t)&csum); else *(u_short *)mtodo(m, offset) = csum; } int ip6_fragment(struct ifnet *ifp, struct mbuf *m0, int hlen, u_char nextproto, int fraglen , uint32_t id) { struct mbuf *m, **mnext, *m_frgpart; struct ip6_hdr *ip6, *mhip6; struct ip6_frag *ip6f; int off; int error; int tlen = m0->m_pkthdr.len; KASSERT((fraglen % 8 == 0), ("Fragment length must be a multiple of 8")); m = m0; ip6 = mtod(m, struct ip6_hdr *); mnext = &m->m_nextpkt; for (off = hlen; off < tlen; off += fraglen) { m = m_gethdr(M_NOWAIT, MT_DATA); if (!m) { IP6STAT_INC(ip6s_odropped); return (ENOBUFS); } m->m_flags = m0->m_flags & M_COPYFLAGS; *mnext = m; mnext = &m->m_nextpkt; m->m_data += max_linkhdr; mhip6 = mtod(m, struct ip6_hdr *); *mhip6 = *ip6; m->m_len = sizeof(*mhip6); error = ip6_insertfraghdr(m0, m, hlen, &ip6f); if (error) { IP6STAT_INC(ip6s_odropped); return (error); } ip6f->ip6f_offlg = htons((u_short)((off - hlen) & ~7)); if (off + fraglen >= tlen) fraglen = tlen - off; else ip6f->ip6f_offlg |= IP6F_MORE_FRAG; mhip6->ip6_plen = htons((u_short)(fraglen + hlen + sizeof(*ip6f) - sizeof(struct ip6_hdr))); if ((m_frgpart = m_copym(m0, off, fraglen, M_NOWAIT)) == NULL) { IP6STAT_INC(ip6s_odropped); return (ENOBUFS); } m_cat(m, m_frgpart); m->m_pkthdr.len = fraglen + hlen + sizeof(*ip6f); m->m_pkthdr.fibnum = m0->m_pkthdr.fibnum; m->m_pkthdr.rcvif = NULL; ip6f->ip6f_reserved = 0; ip6f->ip6f_ident = id; ip6f->ip6f_nxt = nextproto; IP6STAT_INC(ip6s_ofragments); in6_ifstat_inc(ifp, ifs6_out_fragcreat); } return (0); } /* * IP6 output. The packet in mbuf chain m contains a skeletal IP6 * header (with pri, len, nxt, hlim, src, dst). * This function may modify ver and hlim only. * The mbuf chain containing the packet will be freed. * The mbuf opt, if present, will not be freed. * If route_in6 ro is present and has ro_rt initialized, route lookup would be * skipped and ro->ro_rt would be used. If ro is present but ro->ro_rt is NULL, * then result of route lookup is stored in ro->ro_rt. * * type of "mtu": rt_mtu is u_long, ifnet.ifr_mtu is int, and * nd_ifinfo.linkmtu is u_int32_t. so we use u_long to hold largest one, * which is rt_mtu. * * ifpp - XXX: just for statistics */ /* * XXX TODO: no flowid is assigned for outbound flows? */ int ip6_output(struct mbuf *m0, struct ip6_pktopts *opt, struct route_in6 *ro, int flags, struct ip6_moptions *im6o, struct ifnet **ifpp, struct inpcb *inp) { struct ip6_hdr *ip6; struct ifnet *ifp, *origifp; struct mbuf *m = m0; struct mbuf *mprev = NULL; int hlen, tlen, len; struct route_in6 ip6route; struct rtentry *rt = NULL; struct sockaddr_in6 *dst, src_sa, dst_sa; struct in6_addr odst; int error = 0; struct in6_ifaddr *ia = NULL; u_long mtu; int alwaysfrag, dontfrag; u_int32_t optlen = 0, plen = 0, unfragpartlen = 0; struct ip6_exthdrs exthdrs; struct in6_addr src0, dst0; u_int32_t zone; struct route_in6 *ro_pmtu = NULL; int hdrsplit = 0; int sw_csum, tso; int needfiblookup; uint32_t fibnum; struct m_tag *fwd_tag = NULL; uint32_t id; if (inp != NULL) { INP_LOCK_ASSERT(inp); M_SETFIB(m, inp->inp_inc.inc_fibnum); if ((flags & IP_NODEFAULTFLOWID) == 0) { /* unconditionally set flowid */ m->m_pkthdr.flowid = inp->inp_flowid; M_HASHTYPE_SET(m, inp->inp_flowtype); } +#ifdef NUMA + m->m_pkthdr.numa_domain = inp->inp_numa_domain; +#endif } #if defined(IPSEC) || defined(IPSEC_SUPPORT) /* * IPSec checking which handles several cases. * FAST IPSEC: We re-injected the packet. * XXX: need scope argument. */ if (IPSEC_ENABLED(ipv6)) { if ((error = IPSEC_OUTPUT(ipv6, m, inp)) != 0) { if (error == EINPROGRESS) error = 0; goto done; } } #endif /* IPSEC */ bzero(&exthdrs, sizeof(exthdrs)); if (opt) { /* Hop-by-Hop options header */ MAKE_EXTHDR(opt->ip6po_hbh, &exthdrs.ip6e_hbh); /* Destination options header(1st part) */ if (opt->ip6po_rthdr) { /* * Destination options header(1st part) * This only makes sense with a routing header. * See Section 9.2 of RFC 3542. * Disabling this part just for MIP6 convenience is * a bad idea. We need to think carefully about a * way to make the advanced API coexist with MIP6 * options, which might automatically be inserted in * the kernel. */ MAKE_EXTHDR(opt->ip6po_dest1, &exthdrs.ip6e_dest1); } /* Routing header */ MAKE_EXTHDR(opt->ip6po_rthdr, &exthdrs.ip6e_rthdr); /* Destination options header(2nd part) */ MAKE_EXTHDR(opt->ip6po_dest2, &exthdrs.ip6e_dest2); } /* * Calculate the total length of the extension header chain. * Keep the length of the unfragmentable part for fragmentation. */ optlen = 0; if (exthdrs.ip6e_hbh) optlen += exthdrs.ip6e_hbh->m_len; if (exthdrs.ip6e_dest1) optlen += exthdrs.ip6e_dest1->m_len; if (exthdrs.ip6e_rthdr) optlen += exthdrs.ip6e_rthdr->m_len; unfragpartlen = optlen + sizeof(struct ip6_hdr); /* NOTE: we don't add AH/ESP length here (done in ip6_ipsec_output) */ if (exthdrs.ip6e_dest2) optlen += exthdrs.ip6e_dest2->m_len; /* * If there is at least one extension header, * separate IP6 header from the payload. */ if (optlen && !hdrsplit) { if ((error = ip6_splithdr(m, &exthdrs)) != 0) { m = NULL; goto freehdrs; } m = exthdrs.ip6e_ip6; hdrsplit++; } ip6 = mtod(m, struct ip6_hdr *); /* adjust mbuf packet header length */ m->m_pkthdr.len += optlen; plen = m->m_pkthdr.len - sizeof(*ip6); /* If this is a jumbo payload, insert a jumbo payload option. */ if (plen > IPV6_MAXPACKET) { if (!hdrsplit) { if ((error = ip6_splithdr(m, &exthdrs)) != 0) { m = NULL; goto freehdrs; } m = exthdrs.ip6e_ip6; hdrsplit++; } /* adjust pointer */ ip6 = mtod(m, struct ip6_hdr *); if ((error = ip6_insert_jumboopt(&exthdrs, plen)) != 0) goto freehdrs; ip6->ip6_plen = 0; } else ip6->ip6_plen = htons(plen); /* * Concatenate headers and fill in next header fields. * Here we have, on "m" * IPv6 payload * and we insert headers accordingly. Finally, we should be getting: * IPv6 hbh dest1 rthdr ah* [esp* dest2 payload] * * during the header composing process, "m" points to IPv6 header. * "mprev" points to an extension header prior to esp. */ u_char *nexthdrp = &ip6->ip6_nxt; mprev = m; /* * we treat dest2 specially. this makes IPsec processing * much easier. the goal here is to make mprev point the * mbuf prior to dest2. * * result: IPv6 dest2 payload * m and mprev will point to IPv6 header. */ if (exthdrs.ip6e_dest2) { if (!hdrsplit) panic("assumption failed: hdr not split"); exthdrs.ip6e_dest2->m_next = m->m_next; m->m_next = exthdrs.ip6e_dest2; *mtod(exthdrs.ip6e_dest2, u_char *) = ip6->ip6_nxt; ip6->ip6_nxt = IPPROTO_DSTOPTS; } /* * result: IPv6 hbh dest1 rthdr dest2 payload * m will point to IPv6 header. mprev will point to the * extension header prior to dest2 (rthdr in the above case). */ MAKE_CHAIN(exthdrs.ip6e_hbh, mprev, nexthdrp, IPPROTO_HOPOPTS); MAKE_CHAIN(exthdrs.ip6e_dest1, mprev, nexthdrp, IPPROTO_DSTOPTS); MAKE_CHAIN(exthdrs.ip6e_rthdr, mprev, nexthdrp, IPPROTO_ROUTING); /* * If there is a routing header, discard the packet. */ if (exthdrs.ip6e_rthdr) { error = EINVAL; goto bad; } /* Source address validation */ if (IN6_IS_ADDR_UNSPECIFIED(&ip6->ip6_src) && (flags & IPV6_UNSPECSRC) == 0) { error = EOPNOTSUPP; IP6STAT_INC(ip6s_badscope); goto bad; } if (IN6_IS_ADDR_MULTICAST(&ip6->ip6_src)) { error = EOPNOTSUPP; IP6STAT_INC(ip6s_badscope); goto bad; } IP6STAT_INC(ip6s_localout); /* * Route packet. */ if (ro == NULL) { ro = &ip6route; bzero((caddr_t)ro, sizeof(*ro)); } ro_pmtu = ro; if (opt && opt->ip6po_rthdr) ro = &opt->ip6po_route; dst = (struct sockaddr_in6 *)&ro->ro_dst; fibnum = (inp != NULL) ? inp->inp_inc.inc_fibnum : M_GETFIB(m); again: /* * if specified, try to fill in the traffic class field. * do not override if a non-zero value is already set. * we check the diffserv field and the ecn field separately. */ if (opt && opt->ip6po_tclass >= 0) { int mask = 0; if ((ip6->ip6_flow & htonl(0xfc << 20)) == 0) mask |= 0xfc; if ((ip6->ip6_flow & htonl(0x03 << 20)) == 0) mask |= 0x03; if (mask != 0) ip6->ip6_flow |= htonl((opt->ip6po_tclass & mask) << 20); } /* fill in or override the hop limit field, if necessary. */ if (opt && opt->ip6po_hlim != -1) ip6->ip6_hlim = opt->ip6po_hlim & 0xff; else if (IN6_IS_ADDR_MULTICAST(&ip6->ip6_dst)) { if (im6o != NULL) ip6->ip6_hlim = im6o->im6o_multicast_hlim; else ip6->ip6_hlim = V_ip6_defmcasthlim; } /* * Validate route against routing table additions; * a better/more specific route might have been added. * Make sure address family is set in route. */ if (inp) { ro->ro_dst.sin6_family = AF_INET6; RT_VALIDATE((struct route *)ro, &inp->inp_rt_cookie, fibnum); } if (ro->ro_rt && fwd_tag == NULL && (ro->ro_rt->rt_flags & RTF_UP) && ro->ro_dst.sin6_family == AF_INET6 && IN6_ARE_ADDR_EQUAL(&ro->ro_dst.sin6_addr, &ip6->ip6_dst)) { rt = ro->ro_rt; ifp = ro->ro_rt->rt_ifp; } else { if (ro->ro_lle) LLE_FREE(ro->ro_lle); /* zeros ro_lle */ ro->ro_lle = NULL; if (fwd_tag == NULL) { bzero(&dst_sa, sizeof(dst_sa)); dst_sa.sin6_family = AF_INET6; dst_sa.sin6_len = sizeof(dst_sa); dst_sa.sin6_addr = ip6->ip6_dst; } error = in6_selectroute_fib(&dst_sa, opt, im6o, ro, &ifp, &rt, fibnum); if (error != 0) { if (ifp != NULL) in6_ifstat_inc(ifp, ifs6_out_discard); goto bad; } } if (rt == NULL) { /* * If in6_selectroute() does not return a route entry, * dst may not have been updated. */ *dst = dst_sa; /* XXX */ } /* * then rt (for unicast) and ifp must be non-NULL valid values. */ if ((flags & IPV6_FORWARDING) == 0) { /* XXX: the FORWARDING flag can be set for mrouting. */ in6_ifstat_inc(ifp, ifs6_out_request); } if (rt != NULL) { ia = (struct in6_ifaddr *)(rt->rt_ifa); counter_u64_add(rt->rt_pksent, 1); } /* Setup data structures for scope ID checks. */ src0 = ip6->ip6_src; bzero(&src_sa, sizeof(src_sa)); src_sa.sin6_family = AF_INET6; src_sa.sin6_len = sizeof(src_sa); src_sa.sin6_addr = ip6->ip6_src; dst0 = ip6->ip6_dst; /* re-initialize to be sure */ bzero(&dst_sa, sizeof(dst_sa)); dst_sa.sin6_family = AF_INET6; dst_sa.sin6_len = sizeof(dst_sa); dst_sa.sin6_addr = ip6->ip6_dst; /* Check for valid scope ID. */ if (in6_setscope(&src0, ifp, &zone) == 0 && sa6_recoverscope(&src_sa) == 0 && zone == src_sa.sin6_scope_id && in6_setscope(&dst0, ifp, &zone) == 0 && sa6_recoverscope(&dst_sa) == 0 && zone == dst_sa.sin6_scope_id) { /* * The outgoing interface is in the zone of the source * and destination addresses. * * Because the loopback interface cannot receive * packets with a different scope ID than its own, * there is a trick is to pretend the outgoing packet * was received by the real network interface, by * setting "origifp" different from "ifp". This is * only allowed when "ifp" is a loopback network * interface. Refer to code in nd6_output_ifp() for * more details. */ origifp = ifp; /* * We should use ia_ifp to support the case of sending * packets to an address of our own. */ if (ia != NULL && ia->ia_ifp) ifp = ia->ia_ifp; } else if ((ifp->if_flags & IFF_LOOPBACK) == 0 || sa6_recoverscope(&src_sa) != 0 || sa6_recoverscope(&dst_sa) != 0 || dst_sa.sin6_scope_id == 0 || (src_sa.sin6_scope_id != 0 && src_sa.sin6_scope_id != dst_sa.sin6_scope_id) || (origifp = ifnet_byindex(dst_sa.sin6_scope_id)) == NULL) { /* * If the destination network interface is not a * loopback interface, or the destination network * address has no scope ID, or the source address has * a scope ID set which is different from the * destination address one, or there is no network * interface representing this scope ID, the address * pair is considered invalid. */ IP6STAT_INC(ip6s_badscope); in6_ifstat_inc(ifp, ifs6_out_discard); if (error == 0) error = EHOSTUNREACH; /* XXX */ goto bad; } /* All scope ID checks are successful. */ if (rt && !IN6_IS_ADDR_MULTICAST(&ip6->ip6_dst)) { if (opt && opt->ip6po_nextroute.ro_rt) { /* * The nexthop is explicitly specified by the * application. We assume the next hop is an IPv6 * address. */ dst = (struct sockaddr_in6 *)opt->ip6po_nexthop; } else if ((rt->rt_flags & RTF_GATEWAY)) dst = (struct sockaddr_in6 *)rt->rt_gateway; } if (!IN6_IS_ADDR_MULTICAST(&ip6->ip6_dst)) { m->m_flags &= ~(M_BCAST | M_MCAST); /* just in case */ } else { m->m_flags = (m->m_flags & ~M_BCAST) | M_MCAST; in6_ifstat_inc(ifp, ifs6_out_mcast); /* * Confirm that the outgoing interface supports multicast. */ if (!(ifp->if_flags & IFF_MULTICAST)) { IP6STAT_INC(ip6s_noroute); in6_ifstat_inc(ifp, ifs6_out_discard); error = ENETUNREACH; goto bad; } if ((im6o == NULL && in6_mcast_loop) || (im6o && im6o->im6o_multicast_loop)) { /* * Loop back multicast datagram if not expressly * forbidden to do so, even if we have not joined * the address; protocols will filter it later, * thus deferring a hash lookup and lock acquisition * at the expense of an m_copym(). */ ip6_mloopback(ifp, m); } else { /* * If we are acting as a multicast router, perform * multicast forwarding as if the packet had just * arrived on the interface to which we are about * to send. The multicast forwarding function * recursively calls this function, using the * IPV6_FORWARDING flag to prevent infinite recursion. * * Multicasts that are looped back by ip6_mloopback(), * above, will be forwarded by the ip6_input() routine, * if necessary. */ if (V_ip6_mrouter && (flags & IPV6_FORWARDING) == 0) { /* * XXX: ip6_mforward expects that rcvif is NULL * when it is called from the originating path. * However, it may not always be the case. */ m->m_pkthdr.rcvif = NULL; if (ip6_mforward(ip6, ifp, m) != 0) { m_freem(m); goto done; } } } /* * Multicasts with a hoplimit of zero may be looped back, * above, but must not be transmitted on a network. * Also, multicasts addressed to the loopback interface * are not sent -- the above call to ip6_mloopback() will * loop back a copy if this host actually belongs to the * destination group on the loopback interface. */ if (ip6->ip6_hlim == 0 || (ifp->if_flags & IFF_LOOPBACK) || IN6_IS_ADDR_MC_INTFACELOCAL(&ip6->ip6_dst)) { m_freem(m); goto done; } } /* * Fill the outgoing inteface to tell the upper layer * to increment per-interface statistics. */ if (ifpp) *ifpp = ifp; /* Determine path MTU. */ if ((error = ip6_getpmtu(ro_pmtu, ro != ro_pmtu, ifp, &ip6->ip6_dst, &mtu, &alwaysfrag, fibnum, *nexthdrp)) != 0) goto bad; /* * The caller of this function may specify to use the minimum MTU * in some cases. * An advanced API option (IPV6_USE_MIN_MTU) can also override MTU * setting. The logic is a bit complicated; by default, unicast * packets will follow path MTU while multicast packets will be sent at * the minimum MTU. If IP6PO_MINMTU_ALL is specified, all packets * including unicast ones will be sent at the minimum MTU. Multicast * packets will always be sent at the minimum MTU unless * IP6PO_MINMTU_DISABLE is explicitly specified. * See RFC 3542 for more details. */ if (mtu > IPV6_MMTU) { if ((flags & IPV6_MINMTU)) mtu = IPV6_MMTU; else if (opt && opt->ip6po_minmtu == IP6PO_MINMTU_ALL) mtu = IPV6_MMTU; else if (IN6_IS_ADDR_MULTICAST(&ip6->ip6_dst) && (opt == NULL || opt->ip6po_minmtu != IP6PO_MINMTU_DISABLE)) { mtu = IPV6_MMTU; } } /* * clear embedded scope identifiers if necessary. * in6_clearscope will touch the addresses only when necessary. */ in6_clearscope(&ip6->ip6_src); in6_clearscope(&ip6->ip6_dst); /* * If the outgoing packet contains a hop-by-hop options header, * it must be examined and processed even by the source node. * (RFC 2460, section 4.) */ if (exthdrs.ip6e_hbh) { struct ip6_hbh *hbh = mtod(exthdrs.ip6e_hbh, struct ip6_hbh *); u_int32_t dummy; /* XXX unused */ u_int32_t plen = 0; /* XXX: ip6_process will check the value */ #ifdef DIAGNOSTIC if ((hbh->ip6h_len + 1) << 3 > exthdrs.ip6e_hbh->m_len) panic("ip6e_hbh is not contiguous"); #endif /* * XXX: if we have to send an ICMPv6 error to the sender, * we need the M_LOOP flag since icmp6_error() expects * the IPv6 and the hop-by-hop options header are * contiguous unless the flag is set. */ m->m_flags |= M_LOOP; m->m_pkthdr.rcvif = ifp; if (ip6_process_hopopts(m, (u_int8_t *)(hbh + 1), ((hbh->ip6h_len + 1) << 3) - sizeof(struct ip6_hbh), &dummy, &plen) < 0) { /* m was already freed at this point */ error = EINVAL;/* better error? */ goto done; } m->m_flags &= ~M_LOOP; /* XXX */ m->m_pkthdr.rcvif = NULL; } /* Jump over all PFIL processing if hooks are not active. */ if (!PFIL_HOOKED_OUT(V_inet6_pfil_head)) goto passout; odst = ip6->ip6_dst; /* Run through list of hooks for output packets. */ switch (pfil_run_hooks(V_inet6_pfil_head, &m, ifp, PFIL_OUT, inp)) { case PFIL_PASS: ip6 = mtod(m, struct ip6_hdr *); break; case PFIL_DROPPED: error = EPERM; /* FALLTHROUGH */ case PFIL_CONSUMED: goto done; } needfiblookup = 0; /* See if destination IP address was changed by packet filter. */ if (!IN6_ARE_ADDR_EQUAL(&odst, &ip6->ip6_dst)) { m->m_flags |= M_SKIP_FIREWALL; /* If destination is now ourself drop to ip6_input(). */ if (in6_localip(&ip6->ip6_dst)) { m->m_flags |= M_FASTFWD_OURS; if (m->m_pkthdr.rcvif == NULL) m->m_pkthdr.rcvif = V_loif; if (m->m_pkthdr.csum_flags & CSUM_DELAY_DATA_IPV6) { m->m_pkthdr.csum_flags |= CSUM_DATA_VALID_IPV6 | CSUM_PSEUDO_HDR; m->m_pkthdr.csum_data = 0xffff; } #ifdef SCTP if (m->m_pkthdr.csum_flags & CSUM_SCTP_IPV6) m->m_pkthdr.csum_flags |= CSUM_SCTP_VALID; #endif error = netisr_queue(NETISR_IPV6, m); goto done; } else { RO_INVALIDATE_CACHE(ro); needfiblookup = 1; /* Redo the routing table lookup. */ } } /* See if fib was changed by packet filter. */ if (fibnum != M_GETFIB(m)) { m->m_flags |= M_SKIP_FIREWALL; fibnum = M_GETFIB(m); RO_INVALIDATE_CACHE(ro); needfiblookup = 1; } if (needfiblookup) goto again; /* See if local, if yes, send it to netisr. */ if (m->m_flags & M_FASTFWD_OURS) { if (m->m_pkthdr.rcvif == NULL) m->m_pkthdr.rcvif = V_loif; if (m->m_pkthdr.csum_flags & CSUM_DELAY_DATA_IPV6) { m->m_pkthdr.csum_flags |= CSUM_DATA_VALID_IPV6 | CSUM_PSEUDO_HDR; m->m_pkthdr.csum_data = 0xffff; } #ifdef SCTP if (m->m_pkthdr.csum_flags & CSUM_SCTP_IPV6) m->m_pkthdr.csum_flags |= CSUM_SCTP_VALID; #endif error = netisr_queue(NETISR_IPV6, m); goto done; } /* Or forward to some other address? */ if ((m->m_flags & M_IP6_NEXTHOP) && (fwd_tag = m_tag_find(m, PACKET_TAG_IPFORWARD, NULL)) != NULL) { dst = (struct sockaddr_in6 *)&ro->ro_dst; bcopy((fwd_tag+1), &dst_sa, sizeof(struct sockaddr_in6)); m->m_flags |= M_SKIP_FIREWALL; m->m_flags &= ~M_IP6_NEXTHOP; m_tag_delete(m, fwd_tag); goto again; } passout: /* * Send the packet to the outgoing interface. * If necessary, do IPv6 fragmentation before sending. * * the logic here is rather complex: * 1: normal case (dontfrag == 0, alwaysfrag == 0) * 1-a: send as is if tlen <= path mtu * 1-b: fragment if tlen > path mtu * * 2: if user asks us not to fragment (dontfrag == 1) * 2-a: send as is if tlen <= interface mtu * 2-b: error if tlen > interface mtu * * 3: if we always need to attach fragment header (alwaysfrag == 1) * always fragment * * 4: if dontfrag == 1 && alwaysfrag == 1 * error, as we cannot handle this conflicting request */ sw_csum = m->m_pkthdr.csum_flags; if (!hdrsplit) { tso = ((sw_csum & ifp->if_hwassist & CSUM_TSO) != 0) ? 1 : 0; sw_csum &= ~ifp->if_hwassist; } else tso = 0; /* * If we added extension headers, we will not do TSO and calculate the * checksums ourselves for now. * XXX-BZ Need a framework to know when the NIC can handle it, even * with ext. hdrs. */ if (sw_csum & CSUM_DELAY_DATA_IPV6) { sw_csum &= ~CSUM_DELAY_DATA_IPV6; in6_delayed_cksum(m, plen, sizeof(struct ip6_hdr)); } #ifdef SCTP if (sw_csum & CSUM_SCTP_IPV6) { sw_csum &= ~CSUM_SCTP_IPV6; sctp_delayed_cksum(m, sizeof(struct ip6_hdr)); } #endif m->m_pkthdr.csum_flags &= ifp->if_hwassist; tlen = m->m_pkthdr.len; if ((opt && (opt->ip6po_flags & IP6PO_DONTFRAG)) || tso) dontfrag = 1; else dontfrag = 0; if (dontfrag && alwaysfrag) { /* case 4 */ /* conflicting request - can't transmit */ error = EMSGSIZE; goto bad; } if (dontfrag && tlen > IN6_LINKMTU(ifp) && !tso) { /* case 2-b */ /* * Even if the DONTFRAG option is specified, we cannot send the * packet when the data length is larger than the MTU of the * outgoing interface. * Notify the error by sending IPV6_PATHMTU ancillary data if * application wanted to know the MTU value. Also return an * error code (this is not described in the API spec). */ if (inp != NULL) ip6_notify_pmtu(inp, &dst_sa, (u_int32_t)mtu); error = EMSGSIZE; goto bad; } /* * transmit packet without fragmentation */ if (dontfrag || (!alwaysfrag && tlen <= mtu)) { /* case 1-a and 2-a */ struct in6_ifaddr *ia6; ip6 = mtod(m, struct ip6_hdr *); ia6 = in6_ifawithifp(ifp, &ip6->ip6_src); if (ia6) { /* Record statistics for this interface address. */ counter_u64_add(ia6->ia_ifa.ifa_opackets, 1); counter_u64_add(ia6->ia_ifa.ifa_obytes, m->m_pkthdr.len); ifa_free(&ia6->ia_ifa); } #ifdef RATELIMIT if (inp != NULL) { if (inp->inp_flags2 & INP_RATE_LIMIT_CHANGED) in_pcboutput_txrtlmt(inp, ifp, m); /* stamp send tag on mbuf */ m->m_pkthdr.snd_tag = inp->inp_snd_tag; } else { m->m_pkthdr.snd_tag = NULL; } #endif error = nd6_output_ifp(ifp, origifp, m, dst, (struct route *)ro); #ifdef RATELIMIT /* check for route change */ if (error == EAGAIN) in_pcboutput_eagain(inp); #endif goto done; } /* * try to fragment the packet. case 1-b and 3 */ if (mtu < IPV6_MMTU) { /* path MTU cannot be less than IPV6_MMTU */ error = EMSGSIZE; in6_ifstat_inc(ifp, ifs6_out_fragfail); goto bad; } else if (ip6->ip6_plen == 0) { /* jumbo payload cannot be fragmented */ error = EMSGSIZE; in6_ifstat_inc(ifp, ifs6_out_fragfail); goto bad; } else { u_char nextproto; /* * Too large for the destination or interface; * fragment if possible. * Must be able to put at least 8 bytes per fragment. */ hlen = unfragpartlen; if (mtu > IPV6_MAXPACKET) mtu = IPV6_MAXPACKET; len = (mtu - hlen - sizeof(struct ip6_frag)) & ~7; if (len < 8) { error = EMSGSIZE; in6_ifstat_inc(ifp, ifs6_out_fragfail); goto bad; } /* * If the interface will not calculate checksums on * fragmented packets, then do it here. * XXX-BZ handle the hw offloading case. Need flags. */ if (m->m_pkthdr.csum_flags & CSUM_DELAY_DATA_IPV6) { in6_delayed_cksum(m, plen, hlen); m->m_pkthdr.csum_flags &= ~CSUM_DELAY_DATA_IPV6; } #ifdef SCTP if (m->m_pkthdr.csum_flags & CSUM_SCTP_IPV6) { sctp_delayed_cksum(m, hlen); m->m_pkthdr.csum_flags &= ~CSUM_SCTP_IPV6; } #endif /* * Change the next header field of the last header in the * unfragmentable part. */ if (exthdrs.ip6e_rthdr) { nextproto = *mtod(exthdrs.ip6e_rthdr, u_char *); *mtod(exthdrs.ip6e_rthdr, u_char *) = IPPROTO_FRAGMENT; } else if (exthdrs.ip6e_dest1) { nextproto = *mtod(exthdrs.ip6e_dest1, u_char *); *mtod(exthdrs.ip6e_dest1, u_char *) = IPPROTO_FRAGMENT; } else if (exthdrs.ip6e_hbh) { nextproto = *mtod(exthdrs.ip6e_hbh, u_char *); *mtod(exthdrs.ip6e_hbh, u_char *) = IPPROTO_FRAGMENT; } else { nextproto = ip6->ip6_nxt; ip6->ip6_nxt = IPPROTO_FRAGMENT; } /* * Loop through length of segment after first fragment, * make new header and copy data of each part and link onto * chain. */ m0 = m; id = htonl(ip6_randomid()); if ((error = ip6_fragment(ifp, m, hlen, nextproto, len, id))) goto sendorfree; in6_ifstat_inc(ifp, ifs6_out_fragok); } /* * Remove leading garbages. */ sendorfree: m = m0->m_nextpkt; m0->m_nextpkt = 0; m_freem(m0); for (; m; m = m0) { m0 = m->m_nextpkt; m->m_nextpkt = 0; if (error == 0) { /* Record statistics for this interface address. */ if (ia) { counter_u64_add(ia->ia_ifa.ifa_opackets, 1); counter_u64_add(ia->ia_ifa.ifa_obytes, m->m_pkthdr.len); } #ifdef RATELIMIT if (inp != NULL) { if (inp->inp_flags2 & INP_RATE_LIMIT_CHANGED) in_pcboutput_txrtlmt(inp, ifp, m); /* stamp send tag on mbuf */ m->m_pkthdr.snd_tag = inp->inp_snd_tag; } else { m->m_pkthdr.snd_tag = NULL; } #endif error = nd6_output_ifp(ifp, origifp, m, dst, (struct route *)ro); #ifdef RATELIMIT /* check for route change */ if (error == EAGAIN) in_pcboutput_eagain(inp); #endif } else m_freem(m); } if (error == 0) IP6STAT_INC(ip6s_fragmented); done: if (ro == &ip6route) RO_RTFREE(ro); return (error); freehdrs: m_freem(exthdrs.ip6e_hbh); /* m_freem will check if mbuf is 0 */ m_freem(exthdrs.ip6e_dest1); m_freem(exthdrs.ip6e_rthdr); m_freem(exthdrs.ip6e_dest2); /* FALLTHROUGH */ bad: if (m) m_freem(m); goto done; } static int ip6_copyexthdr(struct mbuf **mp, caddr_t hdr, int hlen) { struct mbuf *m; if (hlen > MCLBYTES) return (ENOBUFS); /* XXX */ if (hlen > MLEN) m = m_getcl(M_NOWAIT, MT_DATA, 0); else m = m_get(M_NOWAIT, MT_DATA); if (m == NULL) return (ENOBUFS); m->m_len = hlen; if (hdr) bcopy(hdr, mtod(m, caddr_t), hlen); *mp = m; return (0); } /* * Insert jumbo payload option. */ static int ip6_insert_jumboopt(struct ip6_exthdrs *exthdrs, u_int32_t plen) { struct mbuf *mopt; u_char *optbuf; u_int32_t v; #define JUMBOOPTLEN 8 /* length of jumbo payload option and padding */ /* * If there is no hop-by-hop options header, allocate new one. * If there is one but it doesn't have enough space to store the * jumbo payload option, allocate a cluster to store the whole options. * Otherwise, use it to store the options. */ if (exthdrs->ip6e_hbh == NULL) { mopt = m_get(M_NOWAIT, MT_DATA); if (mopt == NULL) return (ENOBUFS); mopt->m_len = JUMBOOPTLEN; optbuf = mtod(mopt, u_char *); optbuf[1] = 0; /* = ((JUMBOOPTLEN) >> 3) - 1 */ exthdrs->ip6e_hbh = mopt; } else { struct ip6_hbh *hbh; mopt = exthdrs->ip6e_hbh; if (M_TRAILINGSPACE(mopt) < JUMBOOPTLEN) { /* * XXX assumption: * - exthdrs->ip6e_hbh is not referenced from places * other than exthdrs. * - exthdrs->ip6e_hbh is not an mbuf chain. */ int oldoptlen = mopt->m_len; struct mbuf *n; /* * XXX: give up if the whole (new) hbh header does * not fit even in an mbuf cluster. */ if (oldoptlen + JUMBOOPTLEN > MCLBYTES) return (ENOBUFS); /* * As a consequence, we must always prepare a cluster * at this point. */ n = m_getcl(M_NOWAIT, MT_DATA, 0); if (n == NULL) return (ENOBUFS); n->m_len = oldoptlen + JUMBOOPTLEN; bcopy(mtod(mopt, caddr_t), mtod(n, caddr_t), oldoptlen); optbuf = mtod(n, caddr_t) + oldoptlen; m_freem(mopt); mopt = exthdrs->ip6e_hbh = n; } else { optbuf = mtod(mopt, u_char *) + mopt->m_len; mopt->m_len += JUMBOOPTLEN; } optbuf[0] = IP6OPT_PADN; optbuf[1] = 1; /* * Adjust the header length according to the pad and * the jumbo payload option. */ hbh = mtod(mopt, struct ip6_hbh *); hbh->ip6h_len += (JUMBOOPTLEN >> 3); } /* fill in the option. */ optbuf[2] = IP6OPT_JUMBO; optbuf[3] = 4; v = (u_int32_t)htonl(plen + JUMBOOPTLEN); bcopy(&v, &optbuf[4], sizeof(u_int32_t)); /* finally, adjust the packet header length */ exthdrs->ip6e_ip6->m_pkthdr.len += JUMBOOPTLEN; return (0); #undef JUMBOOPTLEN } /* * Insert fragment header and copy unfragmentable header portions. */ static int ip6_insertfraghdr(struct mbuf *m0, struct mbuf *m, int hlen, struct ip6_frag **frghdrp) { struct mbuf *n, *mlast; if (hlen > sizeof(struct ip6_hdr)) { n = m_copym(m0, sizeof(struct ip6_hdr), hlen - sizeof(struct ip6_hdr), M_NOWAIT); if (n == NULL) return (ENOBUFS); m->m_next = n; } else n = m; /* Search for the last mbuf of unfragmentable part. */ for (mlast = n; mlast->m_next; mlast = mlast->m_next) ; if (M_WRITABLE(mlast) && M_TRAILINGSPACE(mlast) >= sizeof(struct ip6_frag)) { /* use the trailing space of the last mbuf for the fragment hdr */ *frghdrp = (struct ip6_frag *)(mtod(mlast, caddr_t) + mlast->m_len); mlast->m_len += sizeof(struct ip6_frag); m->m_pkthdr.len += sizeof(struct ip6_frag); } else { /* allocate a new mbuf for the fragment header */ struct mbuf *mfrg; mfrg = m_get(M_NOWAIT, MT_DATA); if (mfrg == NULL) return (ENOBUFS); mfrg->m_len = sizeof(struct ip6_frag); *frghdrp = mtod(mfrg, struct ip6_frag *); mlast->m_next = mfrg; } return (0); } /* * Calculates IPv6 path mtu for destination @dst. * Resulting MTU is stored in @mtup. * * Returns 0 on success. */ static int ip6_getpmtu_ctl(u_int fibnum, const struct in6_addr *dst, u_long *mtup) { struct nhop6_extended nh6; struct in6_addr kdst; uint32_t scopeid; struct ifnet *ifp; u_long mtu; int error; in6_splitscope(dst, &kdst, &scopeid); if (fib6_lookup_nh_ext(fibnum, &kdst, scopeid, NHR_REF, 0, &nh6) != 0) return (EHOSTUNREACH); ifp = nh6.nh_ifp; mtu = nh6.nh_mtu; error = ip6_calcmtu(ifp, dst, mtu, mtup, NULL, 0); fib6_free_nh_ext(fibnum, &nh6); return (error); } /* * Calculates IPv6 path MTU for @dst based on transmit @ifp, * and cached data in @ro_pmtu. * MTU from (successful) route lookup is saved (along with dst) * inside @ro_pmtu to avoid subsequent route lookups after packet * filter processing. * * Stores mtu and always-frag value into @mtup and @alwaysfragp. * Returns 0 on success. */ static int ip6_getpmtu(struct route_in6 *ro_pmtu, int do_lookup, struct ifnet *ifp, const struct in6_addr *dst, u_long *mtup, int *alwaysfragp, u_int fibnum, u_int proto) { struct nhop6_basic nh6; struct in6_addr kdst; uint32_t scopeid; struct sockaddr_in6 *sa6_dst; u_long mtu; mtu = 0; if (do_lookup) { /* * Here ro_pmtu has final destination address, while * ro might represent immediate destination. * Use ro_pmtu destination since mtu might differ. */ sa6_dst = (struct sockaddr_in6 *)&ro_pmtu->ro_dst; if (!IN6_ARE_ADDR_EQUAL(&sa6_dst->sin6_addr, dst)) ro_pmtu->ro_mtu = 0; if (ro_pmtu->ro_mtu == 0) { bzero(sa6_dst, sizeof(*sa6_dst)); sa6_dst->sin6_family = AF_INET6; sa6_dst->sin6_len = sizeof(struct sockaddr_in6); sa6_dst->sin6_addr = *dst; in6_splitscope(dst, &kdst, &scopeid); if (fib6_lookup_nh_basic(fibnum, &kdst, scopeid, 0, 0, &nh6) == 0) ro_pmtu->ro_mtu = nh6.nh_mtu; } mtu = ro_pmtu->ro_mtu; } if (ro_pmtu->ro_rt) mtu = ro_pmtu->ro_rt->rt_mtu; return (ip6_calcmtu(ifp, dst, mtu, mtup, alwaysfragp, proto)); } /* * Calculate MTU based on transmit @ifp, route mtu @rt_mtu and * hostcache data for @dst. * Stores mtu and always-frag value into @mtup and @alwaysfragp. * * Returns 0 on success. */ static int ip6_calcmtu(struct ifnet *ifp, const struct in6_addr *dst, u_long rt_mtu, u_long *mtup, int *alwaysfragp, u_int proto) { u_long mtu = 0; int alwaysfrag = 0; int error = 0; if (rt_mtu > 0) { u_int32_t ifmtu; struct in_conninfo inc; bzero(&inc, sizeof(inc)); inc.inc_flags |= INC_ISIPV6; inc.inc6_faddr = *dst; ifmtu = IN6_LINKMTU(ifp); /* TCP is known to react to pmtu changes so skip hc */ if (proto != IPPROTO_TCP) mtu = tcp_hc_getmtu(&inc); if (mtu) mtu = min(mtu, rt_mtu); else mtu = rt_mtu; if (mtu == 0) mtu = ifmtu; else if (mtu < IPV6_MMTU) { /* * RFC2460 section 5, last paragraph: * if we record ICMPv6 too big message with * mtu < IPV6_MMTU, transmit packets sized IPV6_MMTU * or smaller, with framgent header attached. * (fragment header is needed regardless from the * packet size, for translators to identify packets) */ alwaysfrag = 1; mtu = IPV6_MMTU; } } else if (ifp) { mtu = IN6_LINKMTU(ifp); } else error = EHOSTUNREACH; /* XXX */ *mtup = mtu; if (alwaysfragp) *alwaysfragp = alwaysfrag; return (error); } /* * IP6 socket option processing. */ int ip6_ctloutput(struct socket *so, struct sockopt *sopt) { int optdatalen, uproto; void *optdata; struct inpcb *in6p = sotoinpcb(so); int error, optval; int level, op, optname; int optlen; struct thread *td; #ifdef RSS uint32_t rss_bucket; int retval; #endif /* * Don't use more than a quarter of mbuf clusters. N.B.: * nmbclusters is an int, but nmbclusters * MCLBYTES may overflow * on LP64 architectures, so cast to u_long to avoid undefined * behavior. ILP32 architectures cannot have nmbclusters * large enough to overflow for other reasons. */ #define IPV6_PKTOPTIONS_MBUF_LIMIT ((u_long)nmbclusters * MCLBYTES / 4) level = sopt->sopt_level; op = sopt->sopt_dir; optname = sopt->sopt_name; optlen = sopt->sopt_valsize; td = sopt->sopt_td; error = 0; optval = 0; uproto = (int)so->so_proto->pr_protocol; if (level != IPPROTO_IPV6) { error = EINVAL; if (sopt->sopt_level == SOL_SOCKET && sopt->sopt_dir == SOPT_SET) { switch (sopt->sopt_name) { case SO_REUSEADDR: INP_WLOCK(in6p); if ((so->so_options & SO_REUSEADDR) != 0) in6p->inp_flags2 |= INP_REUSEADDR; else in6p->inp_flags2 &= ~INP_REUSEADDR; INP_WUNLOCK(in6p); error = 0; break; case SO_REUSEPORT: INP_WLOCK(in6p); if ((so->so_options & SO_REUSEPORT) != 0) in6p->inp_flags2 |= INP_REUSEPORT; else in6p->inp_flags2 &= ~INP_REUSEPORT; INP_WUNLOCK(in6p); error = 0; break; case SO_REUSEPORT_LB: INP_WLOCK(in6p); if ((so->so_options & SO_REUSEPORT_LB) != 0) in6p->inp_flags2 |= INP_REUSEPORT_LB; else in6p->inp_flags2 &= ~INP_REUSEPORT_LB; INP_WUNLOCK(in6p); error = 0; break; case SO_SETFIB: INP_WLOCK(in6p); in6p->inp_inc.inc_fibnum = so->so_fibnum; INP_WUNLOCK(in6p); error = 0; break; case SO_MAX_PACING_RATE: #ifdef RATELIMIT INP_WLOCK(in6p); in6p->inp_flags2 |= INP_RATE_LIMIT_CHANGED; INP_WUNLOCK(in6p); error = 0; #else error = EOPNOTSUPP; #endif break; default: break; } } } else { /* level == IPPROTO_IPV6 */ switch (op) { case SOPT_SET: switch (optname) { case IPV6_2292PKTOPTIONS: #ifdef IPV6_PKTOPTIONS case IPV6_PKTOPTIONS: #endif { struct mbuf *m; if (optlen > IPV6_PKTOPTIONS_MBUF_LIMIT) { printf("ip6_ctloutput: mbuf limit hit\n"); error = ENOBUFS; break; } error = soopt_getm(sopt, &m); /* XXX */ if (error != 0) break; error = soopt_mcopyin(sopt, m); /* XXX */ if (error != 0) break; error = ip6_pcbopts(&in6p->in6p_outputopts, m, so, sopt); m_freem(m); /* XXX */ break; } /* * Use of some Hop-by-Hop options or some * Destination options, might require special * privilege. That is, normal applications * (without special privilege) might be forbidden * from setting certain options in outgoing packets, * and might never see certain options in received * packets. [RFC 2292 Section 6] * KAME specific note: * KAME prevents non-privileged users from sending or * receiving ANY hbh/dst options in order to avoid * overhead of parsing options in the kernel. */ case IPV6_RECVHOPOPTS: case IPV6_RECVDSTOPTS: case IPV6_RECVRTHDRDSTOPTS: if (td != NULL) { error = priv_check(td, PRIV_NETINET_SETHDROPTS); if (error) break; } /* FALLTHROUGH */ case IPV6_UNICAST_HOPS: case IPV6_HOPLIMIT: case IPV6_RECVPKTINFO: case IPV6_RECVHOPLIMIT: case IPV6_RECVRTHDR: case IPV6_RECVPATHMTU: case IPV6_RECVTCLASS: case IPV6_RECVFLOWID: #ifdef RSS case IPV6_RECVRSSBUCKETID: #endif case IPV6_V6ONLY: case IPV6_AUTOFLOWLABEL: case IPV6_ORIGDSTADDR: case IPV6_BINDANY: case IPV6_BINDMULTI: #ifdef RSS case IPV6_RSS_LISTEN_BUCKET: #endif if (optname == IPV6_BINDANY && td != NULL) { error = priv_check(td, PRIV_NETINET_BINDANY); if (error) break; } if (optlen != sizeof(int)) { error = EINVAL; break; } error = sooptcopyin(sopt, &optval, sizeof optval, sizeof optval); if (error) break; switch (optname) { case IPV6_UNICAST_HOPS: if (optval < -1 || optval >= 256) error = EINVAL; else { /* -1 = kernel default */ in6p->in6p_hops = optval; if ((in6p->inp_vflag & INP_IPV4) != 0) in6p->inp_ip_ttl = optval; } break; #define OPTSET(bit) \ do { \ INP_WLOCK(in6p); \ if (optval) \ in6p->inp_flags |= (bit); \ else \ in6p->inp_flags &= ~(bit); \ INP_WUNLOCK(in6p); \ } while (/*CONSTCOND*/ 0) #define OPTSET2292(bit) \ do { \ INP_WLOCK(in6p); \ in6p->inp_flags |= IN6P_RFC2292; \ if (optval) \ in6p->inp_flags |= (bit); \ else \ in6p->inp_flags &= ~(bit); \ INP_WUNLOCK(in6p); \ } while (/*CONSTCOND*/ 0) #define OPTBIT(bit) (in6p->inp_flags & (bit) ? 1 : 0) #define OPTSET2_N(bit, val) do { \ if (val) \ in6p->inp_flags2 |= bit; \ else \ in6p->inp_flags2 &= ~bit; \ } while (0) #define OPTSET2(bit, val) do { \ INP_WLOCK(in6p); \ OPTSET2_N(bit, val); \ INP_WUNLOCK(in6p); \ } while (0) #define OPTBIT2(bit) (in6p->inp_flags2 & (bit) ? 1 : 0) #define OPTSET2292_EXCLUSIVE(bit) \ do { \ INP_WLOCK(in6p); \ if (OPTBIT(IN6P_RFC2292)) { \ error = EINVAL; \ } else { \ if (optval) \ in6p->inp_flags |= (bit); \ else \ in6p->inp_flags &= ~(bit); \ } \ INP_WUNLOCK(in6p); \ } while (/*CONSTCOND*/ 0) case IPV6_RECVPKTINFO: OPTSET2292_EXCLUSIVE(IN6P_PKTINFO); break; case IPV6_HOPLIMIT: { struct ip6_pktopts **optp; /* cannot mix with RFC2292 */ if (OPTBIT(IN6P_RFC2292)) { error = EINVAL; break; } INP_WLOCK(in6p); if (in6p->inp_flags & (INP_TIMEWAIT | INP_DROPPED)) { INP_WUNLOCK(in6p); return (ECONNRESET); } optp = &in6p->in6p_outputopts; error = ip6_pcbopt(IPV6_HOPLIMIT, (u_char *)&optval, sizeof(optval), optp, (td != NULL) ? td->td_ucred : NULL, uproto); INP_WUNLOCK(in6p); break; } case IPV6_RECVHOPLIMIT: OPTSET2292_EXCLUSIVE(IN6P_HOPLIMIT); break; case IPV6_RECVHOPOPTS: OPTSET2292_EXCLUSIVE(IN6P_HOPOPTS); break; case IPV6_RECVDSTOPTS: OPTSET2292_EXCLUSIVE(IN6P_DSTOPTS); break; case IPV6_RECVRTHDRDSTOPTS: OPTSET2292_EXCLUSIVE(IN6P_RTHDRDSTOPTS); break; case IPV6_RECVRTHDR: OPTSET2292_EXCLUSIVE(IN6P_RTHDR); break; case IPV6_RECVPATHMTU: /* * We ignore this option for TCP * sockets. * (RFC3542 leaves this case * unspecified.) */ if (uproto != IPPROTO_TCP) OPTSET(IN6P_MTU); break; case IPV6_RECVFLOWID: OPTSET2(INP_RECVFLOWID, optval); break; #ifdef RSS case IPV6_RECVRSSBUCKETID: OPTSET2(INP_RECVRSSBUCKETID, optval); break; #endif case IPV6_V6ONLY: /* * make setsockopt(IPV6_V6ONLY) * available only prior to bind(2). * see ipng mailing list, Jun 22 2001. */ if (in6p->inp_lport || !IN6_IS_ADDR_UNSPECIFIED(&in6p->in6p_laddr)) { error = EINVAL; break; } OPTSET(IN6P_IPV6_V6ONLY); if (optval) in6p->inp_vflag &= ~INP_IPV4; else in6p->inp_vflag |= INP_IPV4; break; case IPV6_RECVTCLASS: /* cannot mix with RFC2292 XXX */ OPTSET2292_EXCLUSIVE(IN6P_TCLASS); break; case IPV6_AUTOFLOWLABEL: OPTSET(IN6P_AUTOFLOWLABEL); break; case IPV6_ORIGDSTADDR: OPTSET2(INP_ORIGDSTADDR, optval); break; case IPV6_BINDANY: OPTSET(INP_BINDANY); break; case IPV6_BINDMULTI: OPTSET2(INP_BINDMULTI, optval); break; #ifdef RSS case IPV6_RSS_LISTEN_BUCKET: if ((optval >= 0) && (optval < rss_getnumbuckets())) { INP_WLOCK(in6p); in6p->inp_rss_listen_bucket = optval; OPTSET2_N(INP_RSS_BUCKET_SET, 1); INP_WUNLOCK(in6p); } else { error = EINVAL; } break; #endif } break; case IPV6_TCLASS: case IPV6_DONTFRAG: case IPV6_USE_MIN_MTU: case IPV6_PREFER_TEMPADDR: if (optlen != sizeof(optval)) { error = EINVAL; break; } error = sooptcopyin(sopt, &optval, sizeof optval, sizeof optval); if (error) break; { struct ip6_pktopts **optp; INP_WLOCK(in6p); if (in6p->inp_flags & (INP_TIMEWAIT | INP_DROPPED)) { INP_WUNLOCK(in6p); return (ECONNRESET); } optp = &in6p->in6p_outputopts; error = ip6_pcbopt(optname, (u_char *)&optval, sizeof(optval), optp, (td != NULL) ? td->td_ucred : NULL, uproto); INP_WUNLOCK(in6p); break; } case IPV6_2292PKTINFO: case IPV6_2292HOPLIMIT: case IPV6_2292HOPOPTS: case IPV6_2292DSTOPTS: case IPV6_2292RTHDR: /* RFC 2292 */ if (optlen != sizeof(int)) { error = EINVAL; break; } error = sooptcopyin(sopt, &optval, sizeof optval, sizeof optval); if (error) break; switch (optname) { case IPV6_2292PKTINFO: OPTSET2292(IN6P_PKTINFO); break; case IPV6_2292HOPLIMIT: OPTSET2292(IN6P_HOPLIMIT); break; case IPV6_2292HOPOPTS: /* * Check super-user privilege. * See comments for IPV6_RECVHOPOPTS. */ if (td != NULL) { error = priv_check(td, PRIV_NETINET_SETHDROPTS); if (error) return (error); } OPTSET2292(IN6P_HOPOPTS); break; case IPV6_2292DSTOPTS: if (td != NULL) { error = priv_check(td, PRIV_NETINET_SETHDROPTS); if (error) return (error); } OPTSET2292(IN6P_DSTOPTS|IN6P_RTHDRDSTOPTS); /* XXX */ break; case IPV6_2292RTHDR: OPTSET2292(IN6P_RTHDR); break; } break; case IPV6_PKTINFO: case IPV6_HOPOPTS: case IPV6_RTHDR: case IPV6_DSTOPTS: case IPV6_RTHDRDSTOPTS: case IPV6_NEXTHOP: { /* new advanced API (RFC3542) */ u_char *optbuf; u_char optbuf_storage[MCLBYTES]; int optlen; struct ip6_pktopts **optp; /* cannot mix with RFC2292 */ if (OPTBIT(IN6P_RFC2292)) { error = EINVAL; break; } /* * We only ensure valsize is not too large * here. Further validation will be done * later. */ error = sooptcopyin(sopt, optbuf_storage, sizeof(optbuf_storage), 0); if (error) break; optlen = sopt->sopt_valsize; optbuf = optbuf_storage; INP_WLOCK(in6p); if (in6p->inp_flags & (INP_TIMEWAIT | INP_DROPPED)) { INP_WUNLOCK(in6p); return (ECONNRESET); } optp = &in6p->in6p_outputopts; error = ip6_pcbopt(optname, optbuf, optlen, optp, (td != NULL) ? td->td_ucred : NULL, uproto); INP_WUNLOCK(in6p); break; } #undef OPTSET case IPV6_MULTICAST_IF: case IPV6_MULTICAST_HOPS: case IPV6_MULTICAST_LOOP: case IPV6_JOIN_GROUP: case IPV6_LEAVE_GROUP: case IPV6_MSFILTER: case MCAST_BLOCK_SOURCE: case MCAST_UNBLOCK_SOURCE: case MCAST_JOIN_GROUP: case MCAST_LEAVE_GROUP: case MCAST_JOIN_SOURCE_GROUP: case MCAST_LEAVE_SOURCE_GROUP: error = ip6_setmoptions(in6p, sopt); break; case IPV6_PORTRANGE: error = sooptcopyin(sopt, &optval, sizeof optval, sizeof optval); if (error) break; INP_WLOCK(in6p); switch (optval) { case IPV6_PORTRANGE_DEFAULT: in6p->inp_flags &= ~(INP_LOWPORT); in6p->inp_flags &= ~(INP_HIGHPORT); break; case IPV6_PORTRANGE_HIGH: in6p->inp_flags &= ~(INP_LOWPORT); in6p->inp_flags |= INP_HIGHPORT; break; case IPV6_PORTRANGE_LOW: in6p->inp_flags &= ~(INP_HIGHPORT); in6p->inp_flags |= INP_LOWPORT; break; default: error = EINVAL; break; } INP_WUNLOCK(in6p); break; #if defined(IPSEC) || defined(IPSEC_SUPPORT) case IPV6_IPSEC_POLICY: if (IPSEC_ENABLED(ipv6)) { error = IPSEC_PCBCTL(ipv6, in6p, sopt); break; } /* FALLTHROUGH */ #endif /* IPSEC */ default: error = ENOPROTOOPT; break; } break; case SOPT_GET: switch (optname) { case IPV6_2292PKTOPTIONS: #ifdef IPV6_PKTOPTIONS case IPV6_PKTOPTIONS: #endif /* * RFC3542 (effectively) deprecated the * semantics of the 2292-style pktoptions. * Since it was not reliable in nature (i.e., * applications had to expect the lack of some * information after all), it would make sense * to simplify this part by always returning * empty data. */ sopt->sopt_valsize = 0; break; case IPV6_RECVHOPOPTS: case IPV6_RECVDSTOPTS: case IPV6_RECVRTHDRDSTOPTS: case IPV6_UNICAST_HOPS: case IPV6_RECVPKTINFO: case IPV6_RECVHOPLIMIT: case IPV6_RECVRTHDR: case IPV6_RECVPATHMTU: case IPV6_V6ONLY: case IPV6_PORTRANGE: case IPV6_RECVTCLASS: case IPV6_AUTOFLOWLABEL: case IPV6_BINDANY: case IPV6_FLOWID: case IPV6_FLOWTYPE: case IPV6_RECVFLOWID: #ifdef RSS case IPV6_RSSBUCKETID: case IPV6_RECVRSSBUCKETID: #endif case IPV6_BINDMULTI: switch (optname) { case IPV6_RECVHOPOPTS: optval = OPTBIT(IN6P_HOPOPTS); break; case IPV6_RECVDSTOPTS: optval = OPTBIT(IN6P_DSTOPTS); break; case IPV6_RECVRTHDRDSTOPTS: optval = OPTBIT(IN6P_RTHDRDSTOPTS); break; case IPV6_UNICAST_HOPS: optval = in6p->in6p_hops; break; case IPV6_RECVPKTINFO: optval = OPTBIT(IN6P_PKTINFO); break; case IPV6_RECVHOPLIMIT: optval = OPTBIT(IN6P_HOPLIMIT); break; case IPV6_RECVRTHDR: optval = OPTBIT(IN6P_RTHDR); break; case IPV6_RECVPATHMTU: optval = OPTBIT(IN6P_MTU); break; case IPV6_V6ONLY: optval = OPTBIT(IN6P_IPV6_V6ONLY); break; case IPV6_PORTRANGE: { int flags; flags = in6p->inp_flags; if (flags & INP_HIGHPORT) optval = IPV6_PORTRANGE_HIGH; else if (flags & INP_LOWPORT) optval = IPV6_PORTRANGE_LOW; else optval = 0; break; } case IPV6_RECVTCLASS: optval = OPTBIT(IN6P_TCLASS); break; case IPV6_AUTOFLOWLABEL: optval = OPTBIT(IN6P_AUTOFLOWLABEL); break; case IPV6_ORIGDSTADDR: optval = OPTBIT2(INP_ORIGDSTADDR); break; case IPV6_BINDANY: optval = OPTBIT(INP_BINDANY); break; case IPV6_FLOWID: optval = in6p->inp_flowid; break; case IPV6_FLOWTYPE: optval = in6p->inp_flowtype; break; case IPV6_RECVFLOWID: optval = OPTBIT2(INP_RECVFLOWID); break; #ifdef RSS case IPV6_RSSBUCKETID: retval = rss_hash2bucket(in6p->inp_flowid, in6p->inp_flowtype, &rss_bucket); if (retval == 0) optval = rss_bucket; else error = EINVAL; break; case IPV6_RECVRSSBUCKETID: optval = OPTBIT2(INP_RECVRSSBUCKETID); break; #endif case IPV6_BINDMULTI: optval = OPTBIT2(INP_BINDMULTI); break; } if (error) break; error = sooptcopyout(sopt, &optval, sizeof optval); break; case IPV6_PATHMTU: { u_long pmtu = 0; struct ip6_mtuinfo mtuinfo; struct in6_addr addr; if (!(so->so_state & SS_ISCONNECTED)) return (ENOTCONN); /* * XXX: we dot not consider the case of source * routing, or optional information to specify * the outgoing interface. * Copy faddr out of in6p to avoid holding lock * on inp during route lookup. */ INP_RLOCK(in6p); bcopy(&in6p->in6p_faddr, &addr, sizeof(addr)); INP_RUNLOCK(in6p); error = ip6_getpmtu_ctl(so->so_fibnum, &addr, &pmtu); if (error) break; if (pmtu > IPV6_MAXPACKET) pmtu = IPV6_MAXPACKET; bzero(&mtuinfo, sizeof(mtuinfo)); mtuinfo.ip6m_mtu = (u_int32_t)pmtu; optdata = (void *)&mtuinfo; optdatalen = sizeof(mtuinfo); error = sooptcopyout(sopt, optdata, optdatalen); break; } case IPV6_2292PKTINFO: case IPV6_2292HOPLIMIT: case IPV6_2292HOPOPTS: case IPV6_2292RTHDR: case IPV6_2292DSTOPTS: switch (optname) { case IPV6_2292PKTINFO: optval = OPTBIT(IN6P_PKTINFO); break; case IPV6_2292HOPLIMIT: optval = OPTBIT(IN6P_HOPLIMIT); break; case IPV6_2292HOPOPTS: optval = OPTBIT(IN6P_HOPOPTS); break; case IPV6_2292RTHDR: optval = OPTBIT(IN6P_RTHDR); break; case IPV6_2292DSTOPTS: optval = OPTBIT(IN6P_DSTOPTS|IN6P_RTHDRDSTOPTS); break; } error = sooptcopyout(sopt, &optval, sizeof optval); break; case IPV6_PKTINFO: case IPV6_HOPOPTS: case IPV6_RTHDR: case IPV6_DSTOPTS: case IPV6_RTHDRDSTOPTS: case IPV6_NEXTHOP: case IPV6_TCLASS: case IPV6_DONTFRAG: case IPV6_USE_MIN_MTU: case IPV6_PREFER_TEMPADDR: error = ip6_getpcbopt(in6p, optname, sopt); break; case IPV6_MULTICAST_IF: case IPV6_MULTICAST_HOPS: case IPV6_MULTICAST_LOOP: case IPV6_MSFILTER: error = ip6_getmoptions(in6p, sopt); break; #if defined(IPSEC) || defined(IPSEC_SUPPORT) case IPV6_IPSEC_POLICY: if (IPSEC_ENABLED(ipv6)) { error = IPSEC_PCBCTL(ipv6, in6p, sopt); break; } /* FALLTHROUGH */ #endif /* IPSEC */ default: error = ENOPROTOOPT; break; } break; } } return (error); } int ip6_raw_ctloutput(struct socket *so, struct sockopt *sopt) { int error = 0, optval, optlen; const int icmp6off = offsetof(struct icmp6_hdr, icmp6_cksum); struct inpcb *in6p = sotoinpcb(so); int level, op, optname; level = sopt->sopt_level; op = sopt->sopt_dir; optname = sopt->sopt_name; optlen = sopt->sopt_valsize; if (level != IPPROTO_IPV6) { return (EINVAL); } switch (optname) { case IPV6_CHECKSUM: /* * For ICMPv6 sockets, no modification allowed for checksum * offset, permit "no change" values to help existing apps. * * RFC3542 says: "An attempt to set IPV6_CHECKSUM * for an ICMPv6 socket will fail." * The current behavior does not meet RFC3542. */ switch (op) { case SOPT_SET: if (optlen != sizeof(int)) { error = EINVAL; break; } error = sooptcopyin(sopt, &optval, sizeof(optval), sizeof(optval)); if (error) break; if (optval < -1 || (optval % 2) != 0) { /* * The API assumes non-negative even offset * values or -1 as a special value. */ error = EINVAL; } else if (so->so_proto->pr_protocol == IPPROTO_ICMPV6) { if (optval != icmp6off) error = EINVAL; } else in6p->in6p_cksum = optval; break; case SOPT_GET: if (so->so_proto->pr_protocol == IPPROTO_ICMPV6) optval = icmp6off; else optval = in6p->in6p_cksum; error = sooptcopyout(sopt, &optval, sizeof(optval)); break; default: error = EINVAL; break; } break; default: error = ENOPROTOOPT; break; } return (error); } /* * Set up IP6 options in pcb for insertion in output packets or * specifying behavior of outgoing packets. */ static int ip6_pcbopts(struct ip6_pktopts **pktopt, struct mbuf *m, struct socket *so, struct sockopt *sopt) { struct ip6_pktopts *opt = *pktopt; int error = 0; struct thread *td = sopt->sopt_td; /* turn off any old options. */ if (opt) { #ifdef DIAGNOSTIC if (opt->ip6po_pktinfo || opt->ip6po_nexthop || opt->ip6po_hbh || opt->ip6po_dest1 || opt->ip6po_dest2 || opt->ip6po_rhinfo.ip6po_rhi_rthdr) printf("ip6_pcbopts: all specified options are cleared.\n"); #endif ip6_clearpktopts(opt, -1); } else opt = malloc(sizeof(*opt), M_IP6OPT, M_WAITOK); *pktopt = NULL; if (!m || m->m_len == 0) { /* * Only turning off any previous options, regardless of * whether the opt is just created or given. */ free(opt, M_IP6OPT); return (0); } /* set options specified by user. */ if ((error = ip6_setpktopts(m, opt, NULL, (td != NULL) ? td->td_ucred : NULL, so->so_proto->pr_protocol)) != 0) { ip6_clearpktopts(opt, -1); /* XXX: discard all options */ free(opt, M_IP6OPT); return (error); } *pktopt = opt; return (0); } /* * initialize ip6_pktopts. beware that there are non-zero default values in * the struct. */ void ip6_initpktopts(struct ip6_pktopts *opt) { bzero(opt, sizeof(*opt)); opt->ip6po_hlim = -1; /* -1 means default hop limit */ opt->ip6po_tclass = -1; /* -1 means default traffic class */ opt->ip6po_minmtu = IP6PO_MINMTU_MCASTONLY; opt->ip6po_prefer_tempaddr = IP6PO_TEMPADDR_SYSTEM; } static int ip6_pcbopt(int optname, u_char *buf, int len, struct ip6_pktopts **pktopt, struct ucred *cred, int uproto) { struct ip6_pktopts *opt; if (*pktopt == NULL) { *pktopt = malloc(sizeof(struct ip6_pktopts), M_IP6OPT, M_NOWAIT); if (*pktopt == NULL) return (ENOBUFS); ip6_initpktopts(*pktopt); } opt = *pktopt; return (ip6_setpktopt(optname, buf, len, opt, cred, 1, 0, uproto)); } #define GET_PKTOPT_VAR(field, lenexpr) do { \ if (pktopt && pktopt->field) { \ INP_RUNLOCK(in6p); \ optdata = malloc(sopt->sopt_valsize, M_TEMP, M_WAITOK); \ malloc_optdata = true; \ INP_RLOCK(in6p); \ if (in6p->inp_flags & (INP_TIMEWAIT | INP_DROPPED)) { \ INP_RUNLOCK(in6p); \ free(optdata, M_TEMP); \ return (ECONNRESET); \ } \ pktopt = in6p->in6p_outputopts; \ if (pktopt && pktopt->field) { \ optdatalen = min(lenexpr, sopt->sopt_valsize); \ bcopy(&pktopt->field, optdata, optdatalen); \ } else { \ free(optdata, M_TEMP); \ optdata = NULL; \ malloc_optdata = false; \ } \ } \ } while(0) #define GET_PKTOPT_EXT_HDR(field) GET_PKTOPT_VAR(field, \ (((struct ip6_ext *)pktopt->field)->ip6e_len + 1) << 3) #define GET_PKTOPT_SOCKADDR(field) GET_PKTOPT_VAR(field, \ pktopt->field->sa_len) static int ip6_getpcbopt(struct inpcb *in6p, int optname, struct sockopt *sopt) { void *optdata = NULL; bool malloc_optdata = false; int optdatalen = 0; int error = 0; struct in6_pktinfo null_pktinfo; int deftclass = 0, on; int defminmtu = IP6PO_MINMTU_MCASTONLY; int defpreftemp = IP6PO_TEMPADDR_SYSTEM; struct ip6_pktopts *pktopt; INP_RLOCK(in6p); pktopt = in6p->in6p_outputopts; switch (optname) { case IPV6_PKTINFO: optdata = (void *)&null_pktinfo; if (pktopt && pktopt->ip6po_pktinfo) { bcopy(pktopt->ip6po_pktinfo, &null_pktinfo, sizeof(null_pktinfo)); in6_clearscope(&null_pktinfo.ipi6_addr); } else { /* XXX: we don't have to do this every time... */ bzero(&null_pktinfo, sizeof(null_pktinfo)); } optdatalen = sizeof(struct in6_pktinfo); break; case IPV6_TCLASS: if (pktopt && pktopt->ip6po_tclass >= 0) deftclass = pktopt->ip6po_tclass; optdata = (void *)&deftclass; optdatalen = sizeof(int); break; case IPV6_HOPOPTS: GET_PKTOPT_EXT_HDR(ip6po_hbh); break; case IPV6_RTHDR: GET_PKTOPT_EXT_HDR(ip6po_rthdr); break; case IPV6_RTHDRDSTOPTS: GET_PKTOPT_EXT_HDR(ip6po_dest1); break; case IPV6_DSTOPTS: GET_PKTOPT_EXT_HDR(ip6po_dest2); break; case IPV6_NEXTHOP: GET_PKTOPT_SOCKADDR(ip6po_nexthop); break; case IPV6_USE_MIN_MTU: if (pktopt) defminmtu = pktopt->ip6po_minmtu; optdata = (void *)&defminmtu; optdatalen = sizeof(int); break; case IPV6_DONTFRAG: if (pktopt && ((pktopt->ip6po_flags) & IP6PO_DONTFRAG)) on = 1; else on = 0; optdata = (void *)&on; optdatalen = sizeof(on); break; case IPV6_PREFER_TEMPADDR: if (pktopt) defpreftemp = pktopt->ip6po_prefer_tempaddr; optdata = (void *)&defpreftemp; optdatalen = sizeof(int); break; default: /* should not happen */ #ifdef DIAGNOSTIC panic("ip6_getpcbopt: unexpected option\n"); #endif INP_RUNLOCK(in6p); return (ENOPROTOOPT); } INP_RUNLOCK(in6p); error = sooptcopyout(sopt, optdata, optdatalen); if (malloc_optdata) free(optdata, M_TEMP); return (error); } void ip6_clearpktopts(struct ip6_pktopts *pktopt, int optname) { if (pktopt == NULL) return; if (optname == -1 || optname == IPV6_PKTINFO) { if (pktopt->ip6po_pktinfo) free(pktopt->ip6po_pktinfo, M_IP6OPT); pktopt->ip6po_pktinfo = NULL; } if (optname == -1 || optname == IPV6_HOPLIMIT) pktopt->ip6po_hlim = -1; if (optname == -1 || optname == IPV6_TCLASS) pktopt->ip6po_tclass = -1; if (optname == -1 || optname == IPV6_NEXTHOP) { if (pktopt->ip6po_nextroute.ro_rt) { RTFREE(pktopt->ip6po_nextroute.ro_rt); pktopt->ip6po_nextroute.ro_rt = NULL; } if (pktopt->ip6po_nexthop) free(pktopt->ip6po_nexthop, M_IP6OPT); pktopt->ip6po_nexthop = NULL; } if (optname == -1 || optname == IPV6_HOPOPTS) { if (pktopt->ip6po_hbh) free(pktopt->ip6po_hbh, M_IP6OPT); pktopt->ip6po_hbh = NULL; } if (optname == -1 || optname == IPV6_RTHDRDSTOPTS) { if (pktopt->ip6po_dest1) free(pktopt->ip6po_dest1, M_IP6OPT); pktopt->ip6po_dest1 = NULL; } if (optname == -1 || optname == IPV6_RTHDR) { if (pktopt->ip6po_rhinfo.ip6po_rhi_rthdr) free(pktopt->ip6po_rhinfo.ip6po_rhi_rthdr, M_IP6OPT); pktopt->ip6po_rhinfo.ip6po_rhi_rthdr = NULL; if (pktopt->ip6po_route.ro_rt) { RTFREE(pktopt->ip6po_route.ro_rt); pktopt->ip6po_route.ro_rt = NULL; } } if (optname == -1 || optname == IPV6_DSTOPTS) { if (pktopt->ip6po_dest2) free(pktopt->ip6po_dest2, M_IP6OPT); pktopt->ip6po_dest2 = NULL; } } #define PKTOPT_EXTHDRCPY(type) \ do {\ if (src->type) {\ int hlen = (((struct ip6_ext *)src->type)->ip6e_len + 1) << 3;\ dst->type = malloc(hlen, M_IP6OPT, canwait);\ if (dst->type == NULL)\ goto bad;\ bcopy(src->type, dst->type, hlen);\ }\ } while (/*CONSTCOND*/ 0) static int copypktopts(struct ip6_pktopts *dst, struct ip6_pktopts *src, int canwait) { if (dst == NULL || src == NULL) { printf("ip6_clearpktopts: invalid argument\n"); return (EINVAL); } dst->ip6po_hlim = src->ip6po_hlim; dst->ip6po_tclass = src->ip6po_tclass; dst->ip6po_flags = src->ip6po_flags; dst->ip6po_minmtu = src->ip6po_minmtu; dst->ip6po_prefer_tempaddr = src->ip6po_prefer_tempaddr; if (src->ip6po_pktinfo) { dst->ip6po_pktinfo = malloc(sizeof(*dst->ip6po_pktinfo), M_IP6OPT, canwait); if (dst->ip6po_pktinfo == NULL) goto bad; *dst->ip6po_pktinfo = *src->ip6po_pktinfo; } if (src->ip6po_nexthop) { dst->ip6po_nexthop = malloc(src->ip6po_nexthop->sa_len, M_IP6OPT, canwait); if (dst->ip6po_nexthop == NULL) goto bad; bcopy(src->ip6po_nexthop, dst->ip6po_nexthop, src->ip6po_nexthop->sa_len); } PKTOPT_EXTHDRCPY(ip6po_hbh); PKTOPT_EXTHDRCPY(ip6po_dest1); PKTOPT_EXTHDRCPY(ip6po_dest2); PKTOPT_EXTHDRCPY(ip6po_rthdr); /* not copy the cached route */ return (0); bad: ip6_clearpktopts(dst, -1); return (ENOBUFS); } #undef PKTOPT_EXTHDRCPY struct ip6_pktopts * ip6_copypktopts(struct ip6_pktopts *src, int canwait) { int error; struct ip6_pktopts *dst; dst = malloc(sizeof(*dst), M_IP6OPT, canwait); if (dst == NULL) return (NULL); ip6_initpktopts(dst); if ((error = copypktopts(dst, src, canwait)) != 0) { free(dst, M_IP6OPT); return (NULL); } return (dst); } void ip6_freepcbopts(struct ip6_pktopts *pktopt) { if (pktopt == NULL) return; ip6_clearpktopts(pktopt, -1); free(pktopt, M_IP6OPT); } /* * Set IPv6 outgoing packet options based on advanced API. */ int ip6_setpktopts(struct mbuf *control, struct ip6_pktopts *opt, struct ip6_pktopts *stickyopt, struct ucred *cred, int uproto) { struct cmsghdr *cm = NULL; if (control == NULL || opt == NULL) return (EINVAL); ip6_initpktopts(opt); if (stickyopt) { int error; /* * If stickyopt is provided, make a local copy of the options * for this particular packet, then override them by ancillary * objects. * XXX: copypktopts() does not copy the cached route to a next * hop (if any). This is not very good in terms of efficiency, * but we can allow this since this option should be rarely * used. */ if ((error = copypktopts(opt, stickyopt, M_NOWAIT)) != 0) return (error); } /* * XXX: Currently, we assume all the optional information is stored * in a single mbuf. */ if (control->m_next) return (EINVAL); for (; control->m_len > 0; control->m_data += CMSG_ALIGN(cm->cmsg_len), control->m_len -= CMSG_ALIGN(cm->cmsg_len)) { int error; if (control->m_len < CMSG_LEN(0)) return (EINVAL); cm = mtod(control, struct cmsghdr *); if (cm->cmsg_len == 0 || cm->cmsg_len > control->m_len) return (EINVAL); if (cm->cmsg_level != IPPROTO_IPV6) continue; error = ip6_setpktopt(cm->cmsg_type, CMSG_DATA(cm), cm->cmsg_len - CMSG_LEN(0), opt, cred, 0, 1, uproto); if (error) return (error); } return (0); } /* * Set a particular packet option, as a sticky option or an ancillary data * item. "len" can be 0 only when it's a sticky option. * We have 4 cases of combination of "sticky" and "cmsg": * "sticky=0, cmsg=0": impossible * "sticky=0, cmsg=1": RFC2292 or RFC3542 ancillary data * "sticky=1, cmsg=0": RFC3542 socket option * "sticky=1, cmsg=1": RFC2292 socket option */ static int ip6_setpktopt(int optname, u_char *buf, int len, struct ip6_pktopts *opt, struct ucred *cred, int sticky, int cmsg, int uproto) { int minmtupolicy, preftemp; int error; if (!sticky && !cmsg) { #ifdef DIAGNOSTIC printf("ip6_setpktopt: impossible case\n"); #endif return (EINVAL); } /* * IPV6_2292xxx is for backward compatibility to RFC2292, and should * not be specified in the context of RFC3542. Conversely, * RFC3542 types should not be specified in the context of RFC2292. */ if (!cmsg) { switch (optname) { case IPV6_2292PKTINFO: case IPV6_2292HOPLIMIT: case IPV6_2292NEXTHOP: case IPV6_2292HOPOPTS: case IPV6_2292DSTOPTS: case IPV6_2292RTHDR: case IPV6_2292PKTOPTIONS: return (ENOPROTOOPT); } } if (sticky && cmsg) { switch (optname) { case IPV6_PKTINFO: case IPV6_HOPLIMIT: case IPV6_NEXTHOP: case IPV6_HOPOPTS: case IPV6_DSTOPTS: case IPV6_RTHDRDSTOPTS: case IPV6_RTHDR: case IPV6_USE_MIN_MTU: case IPV6_DONTFRAG: case IPV6_TCLASS: case IPV6_PREFER_TEMPADDR: /* XXX: not an RFC3542 option */ return (ENOPROTOOPT); } } switch (optname) { case IPV6_2292PKTINFO: case IPV6_PKTINFO: { struct ifnet *ifp = NULL; struct in6_pktinfo *pktinfo; if (len != sizeof(struct in6_pktinfo)) return (EINVAL); pktinfo = (struct in6_pktinfo *)buf; /* * An application can clear any sticky IPV6_PKTINFO option by * doing a "regular" setsockopt with ipi6_addr being * in6addr_any and ipi6_ifindex being zero. * [RFC 3542, Section 6] */ if (optname == IPV6_PKTINFO && opt->ip6po_pktinfo && pktinfo->ipi6_ifindex == 0 && IN6_IS_ADDR_UNSPECIFIED(&pktinfo->ipi6_addr)) { ip6_clearpktopts(opt, optname); break; } if (uproto == IPPROTO_TCP && optname == IPV6_PKTINFO && sticky && !IN6_IS_ADDR_UNSPECIFIED(&pktinfo->ipi6_addr)) { return (EINVAL); } if (IN6_IS_ADDR_MULTICAST(&pktinfo->ipi6_addr)) return (EINVAL); /* validate the interface index if specified. */ if (pktinfo->ipi6_ifindex > V_if_index) return (ENXIO); if (pktinfo->ipi6_ifindex) { ifp = ifnet_byindex(pktinfo->ipi6_ifindex); if (ifp == NULL) return (ENXIO); } if (ifp != NULL && (ifp->if_afdata[AF_INET6] == NULL || (ND_IFINFO(ifp)->flags & ND6_IFF_IFDISABLED) != 0)) return (ENETDOWN); if (ifp != NULL && !IN6_IS_ADDR_UNSPECIFIED(&pktinfo->ipi6_addr)) { struct in6_ifaddr *ia; in6_setscope(&pktinfo->ipi6_addr, ifp, NULL); ia = in6ifa_ifpwithaddr(ifp, &pktinfo->ipi6_addr); if (ia == NULL) return (EADDRNOTAVAIL); ifa_free(&ia->ia_ifa); } /* * We store the address anyway, and let in6_selectsrc() * validate the specified address. This is because ipi6_addr * may not have enough information about its scope zone, and * we may need additional information (such as outgoing * interface or the scope zone of a destination address) to * disambiguate the scope. * XXX: the delay of the validation may confuse the * application when it is used as a sticky option. */ if (opt->ip6po_pktinfo == NULL) { opt->ip6po_pktinfo = malloc(sizeof(*pktinfo), M_IP6OPT, M_NOWAIT); if (opt->ip6po_pktinfo == NULL) return (ENOBUFS); } bcopy(pktinfo, opt->ip6po_pktinfo, sizeof(*pktinfo)); break; } case IPV6_2292HOPLIMIT: case IPV6_HOPLIMIT: { int *hlimp; /* * RFC 3542 deprecated the usage of sticky IPV6_HOPLIMIT * to simplify the ordering among hoplimit options. */ if (optname == IPV6_HOPLIMIT && sticky) return (ENOPROTOOPT); if (len != sizeof(int)) return (EINVAL); hlimp = (int *)buf; if (*hlimp < -1 || *hlimp > 255) return (EINVAL); opt->ip6po_hlim = *hlimp; break; } case IPV6_TCLASS: { int tclass; if (len != sizeof(int)) return (EINVAL); tclass = *(int *)buf; if (tclass < -1 || tclass > 255) return (EINVAL); opt->ip6po_tclass = tclass; break; } case IPV6_2292NEXTHOP: case IPV6_NEXTHOP: if (cred != NULL) { error = priv_check_cred(cred, PRIV_NETINET_SETHDROPTS); if (error) return (error); } if (len == 0) { /* just remove the option */ ip6_clearpktopts(opt, IPV6_NEXTHOP); break; } /* check if cmsg_len is large enough for sa_len */ if (len < sizeof(struct sockaddr) || len < *buf) return (EINVAL); switch (((struct sockaddr *)buf)->sa_family) { case AF_INET6: { struct sockaddr_in6 *sa6 = (struct sockaddr_in6 *)buf; int error; if (sa6->sin6_len != sizeof(struct sockaddr_in6)) return (EINVAL); if (IN6_IS_ADDR_UNSPECIFIED(&sa6->sin6_addr) || IN6_IS_ADDR_MULTICAST(&sa6->sin6_addr)) { return (EINVAL); } if ((error = sa6_embedscope(sa6, V_ip6_use_defzone)) != 0) { return (error); } break; } case AF_LINK: /* should eventually be supported */ default: return (EAFNOSUPPORT); } /* turn off the previous option, then set the new option. */ ip6_clearpktopts(opt, IPV6_NEXTHOP); opt->ip6po_nexthop = malloc(*buf, M_IP6OPT, M_NOWAIT); if (opt->ip6po_nexthop == NULL) return (ENOBUFS); bcopy(buf, opt->ip6po_nexthop, *buf); break; case IPV6_2292HOPOPTS: case IPV6_HOPOPTS: { struct ip6_hbh *hbh; int hbhlen; /* * XXX: We don't allow a non-privileged user to set ANY HbH * options, since per-option restriction has too much * overhead. */ if (cred != NULL) { error = priv_check_cred(cred, PRIV_NETINET_SETHDROPTS); if (error) return (error); } if (len == 0) { ip6_clearpktopts(opt, IPV6_HOPOPTS); break; /* just remove the option */ } /* message length validation */ if (len < sizeof(struct ip6_hbh)) return (EINVAL); hbh = (struct ip6_hbh *)buf; hbhlen = (hbh->ip6h_len + 1) << 3; if (len != hbhlen) return (EINVAL); /* turn off the previous option, then set the new option. */ ip6_clearpktopts(opt, IPV6_HOPOPTS); opt->ip6po_hbh = malloc(hbhlen, M_IP6OPT, M_NOWAIT); if (opt->ip6po_hbh == NULL) return (ENOBUFS); bcopy(hbh, opt->ip6po_hbh, hbhlen); break; } case IPV6_2292DSTOPTS: case IPV6_DSTOPTS: case IPV6_RTHDRDSTOPTS: { struct ip6_dest *dest, **newdest = NULL; int destlen; if (cred != NULL) { /* XXX: see the comment for IPV6_HOPOPTS */ error = priv_check_cred(cred, PRIV_NETINET_SETHDROPTS); if (error) return (error); } if (len == 0) { ip6_clearpktopts(opt, optname); break; /* just remove the option */ } /* message length validation */ if (len < sizeof(struct ip6_dest)) return (EINVAL); dest = (struct ip6_dest *)buf; destlen = (dest->ip6d_len + 1) << 3; if (len != destlen) return (EINVAL); /* * Determine the position that the destination options header * should be inserted; before or after the routing header. */ switch (optname) { case IPV6_2292DSTOPTS: /* * The old advacned API is ambiguous on this point. * Our approach is to determine the position based * according to the existence of a routing header. * Note, however, that this depends on the order of the * extension headers in the ancillary data; the 1st * part of the destination options header must appear * before the routing header in the ancillary data, * too. * RFC3542 solved the ambiguity by introducing * separate ancillary data or option types. */ if (opt->ip6po_rthdr == NULL) newdest = &opt->ip6po_dest1; else newdest = &opt->ip6po_dest2; break; case IPV6_RTHDRDSTOPTS: newdest = &opt->ip6po_dest1; break; case IPV6_DSTOPTS: newdest = &opt->ip6po_dest2; break; } /* turn off the previous option, then set the new option. */ ip6_clearpktopts(opt, optname); *newdest = malloc(destlen, M_IP6OPT, M_NOWAIT); if (*newdest == NULL) return (ENOBUFS); bcopy(dest, *newdest, destlen); break; } case IPV6_2292RTHDR: case IPV6_RTHDR: { struct ip6_rthdr *rth; int rthlen; if (len == 0) { ip6_clearpktopts(opt, IPV6_RTHDR); break; /* just remove the option */ } /* message length validation */ if (len < sizeof(struct ip6_rthdr)) return (EINVAL); rth = (struct ip6_rthdr *)buf; rthlen = (rth->ip6r_len + 1) << 3; if (len != rthlen) return (EINVAL); switch (rth->ip6r_type) { case IPV6_RTHDR_TYPE_0: if (rth->ip6r_len == 0) /* must contain one addr */ return (EINVAL); if (rth->ip6r_len % 2) /* length must be even */ return (EINVAL); if (rth->ip6r_len / 2 != rth->ip6r_segleft) return (EINVAL); break; default: return (EINVAL); /* not supported */ } /* turn off the previous option */ ip6_clearpktopts(opt, IPV6_RTHDR); opt->ip6po_rthdr = malloc(rthlen, M_IP6OPT, M_NOWAIT); if (opt->ip6po_rthdr == NULL) return (ENOBUFS); bcopy(rth, opt->ip6po_rthdr, rthlen); break; } case IPV6_USE_MIN_MTU: if (len != sizeof(int)) return (EINVAL); minmtupolicy = *(int *)buf; if (minmtupolicy != IP6PO_MINMTU_MCASTONLY && minmtupolicy != IP6PO_MINMTU_DISABLE && minmtupolicy != IP6PO_MINMTU_ALL) { return (EINVAL); } opt->ip6po_minmtu = minmtupolicy; break; case IPV6_DONTFRAG: if (len != sizeof(int)) return (EINVAL); if (uproto == IPPROTO_TCP || *(int *)buf == 0) { /* * we ignore this option for TCP sockets. * (RFC3542 leaves this case unspecified.) */ opt->ip6po_flags &= ~IP6PO_DONTFRAG; } else opt->ip6po_flags |= IP6PO_DONTFRAG; break; case IPV6_PREFER_TEMPADDR: if (len != sizeof(int)) return (EINVAL); preftemp = *(int *)buf; if (preftemp != IP6PO_TEMPADDR_SYSTEM && preftemp != IP6PO_TEMPADDR_NOTPREFER && preftemp != IP6PO_TEMPADDR_PREFER) { return (EINVAL); } opt->ip6po_prefer_tempaddr = preftemp; break; default: return (ENOPROTOOPT); } /* end of switch */ return (0); } /* * Routine called from ip6_output() to loop back a copy of an IP6 multicast * packet to the input queue of a specified interface. Note that this * calls the output routine of the loopback "driver", but with an interface * pointer that might NOT be &loif -- easier than replicating that code here. */ void ip6_mloopback(struct ifnet *ifp, struct mbuf *m) { struct mbuf *copym; struct ip6_hdr *ip6; copym = m_copym(m, 0, M_COPYALL, M_NOWAIT); if (copym == NULL) return; /* * Make sure to deep-copy IPv6 header portion in case the data * is in an mbuf cluster, so that we can safely override the IPv6 * header portion later. */ if (!M_WRITABLE(copym) || copym->m_len < sizeof(struct ip6_hdr)) { copym = m_pullup(copym, sizeof(struct ip6_hdr)); if (copym == NULL) return; } ip6 = mtod(copym, struct ip6_hdr *); /* * clear embedded scope identifiers if necessary. * in6_clearscope will touch the addresses only when necessary. */ in6_clearscope(&ip6->ip6_src); in6_clearscope(&ip6->ip6_dst); if (copym->m_pkthdr.csum_flags & CSUM_DELAY_DATA_IPV6) { copym->m_pkthdr.csum_flags |= CSUM_DATA_VALID_IPV6 | CSUM_PSEUDO_HDR; copym->m_pkthdr.csum_data = 0xffff; } if_simloop(ifp, copym, AF_INET6, 0); } /* * Chop IPv6 header off from the payload. */ static int ip6_splithdr(struct mbuf *m, struct ip6_exthdrs *exthdrs) { struct mbuf *mh; struct ip6_hdr *ip6; ip6 = mtod(m, struct ip6_hdr *); if (m->m_len > sizeof(*ip6)) { mh = m_gethdr(M_NOWAIT, MT_DATA); if (mh == NULL) { m_freem(m); return ENOBUFS; } m_move_pkthdr(mh, m); M_ALIGN(mh, sizeof(*ip6)); m->m_len -= sizeof(*ip6); m->m_data += sizeof(*ip6); mh->m_next = m; m = mh; m->m_len = sizeof(*ip6); bcopy((caddr_t)ip6, mtod(m, caddr_t), sizeof(*ip6)); } exthdrs->ip6e_ip6 = m; return 0; } /* * Compute IPv6 extension header length. */ int ip6_optlen(struct inpcb *in6p) { int len; if (!in6p->in6p_outputopts) return 0; len = 0; #define elen(x) \ (((struct ip6_ext *)(x)) ? (((struct ip6_ext *)(x))->ip6e_len + 1) << 3 : 0) len += elen(in6p->in6p_outputopts->ip6po_hbh); if (in6p->in6p_outputopts->ip6po_rthdr) /* dest1 is valid with rthdr only */ len += elen(in6p->in6p_outputopts->ip6po_dest1); len += elen(in6p->in6p_outputopts->ip6po_rthdr); len += elen(in6p->in6p_outputopts->ip6po_dest2); return len; #undef elen } Index: user/ngie/bug-237403/sys/netpfil/ipfw/ip_fw2.c =================================================================== --- user/ngie/bug-237403/sys/netpfil/ipfw/ip_fw2.c (revision 346925) +++ user/ngie/bug-237403/sys/netpfil/ipfw/ip_fw2.c (revision 346926) @@ -1,3489 +1,3491 @@ /*- * SPDX-License-Identifier: BSD-2-Clause-FreeBSD * * Copyright (c) 2002-2009 Luigi Rizzo, Universita` di Pisa * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. */ #include __FBSDID("$FreeBSD$"); /* * The FreeBSD IP packet firewall, main file */ #include "opt_ipfw.h" #include "opt_ipdivert.h" #include "opt_inet.h" #ifndef INET #error "IPFIREWALL requires INET" #endif /* INET */ #include "opt_inet6.h" #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include /* for ETHERTYPE_IP */ #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #ifdef INET6 #include #include #include #include #endif #include /* for struct grehdr */ #include #include /* XXX for in_cksum */ #ifdef MAC #include #endif /* * static variables followed by global ones. * All ipfw global variables are here. */ VNET_DEFINE_STATIC(int, fw_deny_unknown_exthdrs); #define V_fw_deny_unknown_exthdrs VNET(fw_deny_unknown_exthdrs) VNET_DEFINE_STATIC(int, fw_permit_single_frag6) = 1; #define V_fw_permit_single_frag6 VNET(fw_permit_single_frag6) #ifdef IPFIREWALL_DEFAULT_TO_ACCEPT static int default_to_accept = 1; #else static int default_to_accept; #endif VNET_DEFINE(int, autoinc_step); VNET_DEFINE(int, fw_one_pass) = 1; VNET_DEFINE(unsigned int, fw_tables_max); VNET_DEFINE(unsigned int, fw_tables_sets) = 0; /* Don't use set-aware tables */ /* Use 128 tables by default */ static unsigned int default_fw_tables = IPFW_TABLES_DEFAULT; #ifndef LINEAR_SKIPTO static int jump_fast(struct ip_fw_chain *chain, struct ip_fw *f, int num, int tablearg, int jump_backwards); #define JUMP(ch, f, num, targ, back) jump_fast(ch, f, num, targ, back) #else static int jump_linear(struct ip_fw_chain *chain, struct ip_fw *f, int num, int tablearg, int jump_backwards); #define JUMP(ch, f, num, targ, back) jump_linear(ch, f, num, targ, back) #endif /* * Each rule belongs to one of 32 different sets (0..31). * The variable set_disable contains one bit per set. * If the bit is set, all rules in the corresponding set * are disabled. Set RESVD_SET(31) is reserved for the default rule * and rules that are not deleted by the flush command, * and CANNOT be disabled. * Rules in set RESVD_SET can only be deleted individually. */ VNET_DEFINE(u_int32_t, set_disable); #define V_set_disable VNET(set_disable) VNET_DEFINE(int, fw_verbose); /* counter for ipfw_log(NULL...) */ VNET_DEFINE(u_int64_t, norule_counter); VNET_DEFINE(int, verbose_limit); /* layer3_chain contains the list of rules for layer 3 */ VNET_DEFINE(struct ip_fw_chain, layer3_chain); /* ipfw_vnet_ready controls when we are open for business */ VNET_DEFINE(int, ipfw_vnet_ready) = 0; VNET_DEFINE(int, ipfw_nat_ready) = 0; ipfw_nat_t *ipfw_nat_ptr = NULL; struct cfg_nat *(*lookup_nat_ptr)(struct nat_list *, int); ipfw_nat_cfg_t *ipfw_nat_cfg_ptr; ipfw_nat_cfg_t *ipfw_nat_del_ptr; ipfw_nat_cfg_t *ipfw_nat_get_cfg_ptr; ipfw_nat_cfg_t *ipfw_nat_get_log_ptr; #ifdef SYSCTL_NODE uint32_t dummy_def = IPFW_DEFAULT_RULE; static int sysctl_ipfw_table_num(SYSCTL_HANDLER_ARGS); static int sysctl_ipfw_tables_sets(SYSCTL_HANDLER_ARGS); SYSBEGIN(f3) SYSCTL_NODE(_net_inet_ip, OID_AUTO, fw, CTLFLAG_RW, 0, "Firewall"); SYSCTL_INT(_net_inet_ip_fw, OID_AUTO, one_pass, CTLFLAG_VNET | CTLFLAG_RW | CTLFLAG_SECURE3, &VNET_NAME(fw_one_pass), 0, "Only do a single pass through ipfw when using dummynet(4)"); SYSCTL_INT(_net_inet_ip_fw, OID_AUTO, autoinc_step, CTLFLAG_VNET | CTLFLAG_RW, &VNET_NAME(autoinc_step), 0, "Rule number auto-increment step"); SYSCTL_INT(_net_inet_ip_fw, OID_AUTO, verbose, CTLFLAG_VNET | CTLFLAG_RW | CTLFLAG_SECURE3, &VNET_NAME(fw_verbose), 0, "Log matches to ipfw rules"); SYSCTL_INT(_net_inet_ip_fw, OID_AUTO, verbose_limit, CTLFLAG_VNET | CTLFLAG_RW, &VNET_NAME(verbose_limit), 0, "Set upper limit of matches of ipfw rules logged"); SYSCTL_UINT(_net_inet_ip_fw, OID_AUTO, default_rule, CTLFLAG_RD, &dummy_def, 0, "The default/max possible rule number."); SYSCTL_PROC(_net_inet_ip_fw, OID_AUTO, tables_max, CTLFLAG_VNET | CTLTYPE_UINT | CTLFLAG_RW, 0, 0, sysctl_ipfw_table_num, "IU", "Maximum number of concurrently used tables"); SYSCTL_PROC(_net_inet_ip_fw, OID_AUTO, tables_sets, CTLFLAG_VNET | CTLTYPE_UINT | CTLFLAG_RW, 0, 0, sysctl_ipfw_tables_sets, "IU", "Use per-set namespace for tables"); SYSCTL_INT(_net_inet_ip_fw, OID_AUTO, default_to_accept, CTLFLAG_RDTUN, &default_to_accept, 0, "Make the default rule accept all packets."); TUNABLE_INT("net.inet.ip.fw.tables_max", (int *)&default_fw_tables); SYSCTL_INT(_net_inet_ip_fw, OID_AUTO, static_count, CTLFLAG_VNET | CTLFLAG_RD, &VNET_NAME(layer3_chain.n_rules), 0, "Number of static rules"); #ifdef INET6 SYSCTL_DECL(_net_inet6_ip6); SYSCTL_NODE(_net_inet6_ip6, OID_AUTO, fw, CTLFLAG_RW, 0, "Firewall"); SYSCTL_INT(_net_inet6_ip6_fw, OID_AUTO, deny_unknown_exthdrs, CTLFLAG_VNET | CTLFLAG_RW | CTLFLAG_SECURE, &VNET_NAME(fw_deny_unknown_exthdrs), 0, "Deny packets with unknown IPv6 Extension Headers"); SYSCTL_INT(_net_inet6_ip6_fw, OID_AUTO, permit_single_frag6, CTLFLAG_VNET | CTLFLAG_RW | CTLFLAG_SECURE, &VNET_NAME(fw_permit_single_frag6), 0, "Permit single packet IPv6 fragments"); #endif /* INET6 */ SYSEND #endif /* SYSCTL_NODE */ /* * Some macros used in the various matching options. * L3HDR maps an ipv4 pointer into a layer3 header pointer of type T * Other macros just cast void * into the appropriate type */ #define L3HDR(T, ip) ((T *)((u_int32_t *)(ip) + (ip)->ip_hl)) #define TCP(p) ((struct tcphdr *)(p)) #define SCTP(p) ((struct sctphdr *)(p)) #define UDP(p) ((struct udphdr *)(p)) #define ICMP(p) ((struct icmphdr *)(p)) #define ICMP6(p) ((struct icmp6_hdr *)(p)) static __inline int icmptype_match(struct icmphdr *icmp, ipfw_insn_u32 *cmd) { int type = icmp->icmp_type; return (type <= ICMP_MAXTYPE && (cmd->d[0] & (1<icmp_type; return (type <= ICMP_MAXTYPE && (TT & (1<arg1 or cmd->d[0]. * * We scan options and store the bits we find set. We succeed if * * (want_set & ~bits) == 0 && (want_clear & ~bits) == want_clear * * The code is sometimes optimized not to store additional variables. */ static int flags_match(ipfw_insn *cmd, u_int8_t bits) { u_char want_clear; bits = ~bits; if ( ((cmd->arg1 & 0xff) & bits) != 0) return 0; /* some bits we want set were clear */ want_clear = (cmd->arg1 >> 8) & 0xff; if ( (want_clear & bits) != want_clear) return 0; /* some bits we want clear were set */ return 1; } static int ipopts_match(struct ip *ip, ipfw_insn *cmd) { int optlen, bits = 0; u_char *cp = (u_char *)(ip + 1); int x = (ip->ip_hl << 2) - sizeof (struct ip); for (; x > 0; x -= optlen, cp += optlen) { int opt = cp[IPOPT_OPTVAL]; if (opt == IPOPT_EOL) break; if (opt == IPOPT_NOP) optlen = 1; else { optlen = cp[IPOPT_OLEN]; if (optlen <= 0 || optlen > x) return 0; /* invalid or truncated */ } switch (opt) { default: break; case IPOPT_LSRR: bits |= IP_FW_IPOPT_LSRR; break; case IPOPT_SSRR: bits |= IP_FW_IPOPT_SSRR; break; case IPOPT_RR: bits |= IP_FW_IPOPT_RR; break; case IPOPT_TS: bits |= IP_FW_IPOPT_TS; break; } } return (flags_match(cmd, bits)); } static int tcpopts_match(struct tcphdr *tcp, ipfw_insn *cmd) { int optlen, bits = 0; u_char *cp = (u_char *)(tcp + 1); int x = (tcp->th_off << 2) - sizeof(struct tcphdr); for (; x > 0; x -= optlen, cp += optlen) { int opt = cp[0]; if (opt == TCPOPT_EOL) break; if (opt == TCPOPT_NOP) optlen = 1; else { optlen = cp[1]; if (optlen <= 0) break; } switch (opt) { default: break; case TCPOPT_MAXSEG: bits |= IP_FW_TCPOPT_MSS; break; case TCPOPT_WINDOW: bits |= IP_FW_TCPOPT_WINDOW; break; case TCPOPT_SACK_PERMITTED: case TCPOPT_SACK: bits |= IP_FW_TCPOPT_SACK; break; case TCPOPT_TIMESTAMP: bits |= IP_FW_TCPOPT_TS; break; } } return (flags_match(cmd, bits)); } static int iface_match(struct ifnet *ifp, ipfw_insn_if *cmd, struct ip_fw_chain *chain, uint32_t *tablearg) { if (ifp == NULL) /* no iface with this packet, match fails */ return (0); /* Check by name or by IP address */ if (cmd->name[0] != '\0') { /* match by name */ if (cmd->name[0] == '\1') /* use tablearg to match */ return ipfw_lookup_table(chain, cmd->p.kidx, 0, &ifp->if_index, tablearg); /* Check name */ if (cmd->p.glob) { if (fnmatch(cmd->name, ifp->if_xname, 0) == 0) return(1); } else { if (strncmp(ifp->if_xname, cmd->name, IFNAMSIZ) == 0) return(1); } } else { #if !defined(USERSPACE) && defined(__FreeBSD__) /* and OSX too ? */ struct ifaddr *ia; if_addr_rlock(ifp); CK_STAILQ_FOREACH(ia, &ifp->if_addrhead, ifa_link) { if (ia->ifa_addr->sa_family != AF_INET) continue; if (cmd->p.ip.s_addr == ((struct sockaddr_in *) (ia->ifa_addr))->sin_addr.s_addr) { if_addr_runlock(ifp); return(1); /* match */ } } if_addr_runlock(ifp); #endif /* __FreeBSD__ */ } return(0); /* no match, fail ... */ } /* * The verify_path function checks if a route to the src exists and * if it is reachable via ifp (when provided). * * The 'verrevpath' option checks that the interface that an IP packet * arrives on is the same interface that traffic destined for the * packet's source address would be routed out of. * The 'versrcreach' option just checks that the source address is * reachable via any route (except default) in the routing table. * These two are a measure to block forged packets. This is also * commonly known as "anti-spoofing" or Unicast Reverse Path * Forwarding (Unicast RFP) in Cisco-ese. The name of the knobs * is purposely reminiscent of the Cisco IOS command, * * ip verify unicast reverse-path * ip verify unicast source reachable-via any * * which implements the same functionality. But note that the syntax * is misleading, and the check may be performed on all IP packets * whether unicast, multicast, or broadcast. */ static int verify_path(struct in_addr src, struct ifnet *ifp, u_int fib) { #if defined(USERSPACE) || !defined(__FreeBSD__) return 0; #else struct nhop4_basic nh4; if (fib4_lookup_nh_basic(fib, src, NHR_IFAIF, 0, &nh4) != 0) return (0); /* * If ifp is provided, check for equality with rtentry. * We should use rt->rt_ifa->ifa_ifp, instead of rt->rt_ifp, * in order to pass packets injected back by if_simloop(): * routing entry (via lo0) for our own address * may exist, so we need to handle routing assymetry. */ if (ifp != NULL && ifp != nh4.nh_ifp) return (0); /* if no ifp provided, check if rtentry is not default route */ if (ifp == NULL && (nh4.nh_flags & NHF_DEFAULT) != 0) return (0); /* or if this is a blackhole/reject route */ if (ifp == NULL && (nh4.nh_flags & (NHF_REJECT|NHF_BLACKHOLE)) != 0) return (0); /* found valid route */ return 1; #endif /* __FreeBSD__ */ } /* * Generate an SCTP packet containing an ABORT chunk. The verification tag * is given by vtag. The T-bit is set in the ABORT chunk if and only if * reflected is not 0. */ static struct mbuf * ipfw_send_abort(struct mbuf *replyto, struct ipfw_flow_id *id, u_int32_t vtag, int reflected) { struct mbuf *m; struct ip *ip; #ifdef INET6 struct ip6_hdr *ip6; #endif struct sctphdr *sctp; struct sctp_chunkhdr *chunk; u_int16_t hlen, plen, tlen; MGETHDR(m, M_NOWAIT, MT_DATA); if (m == NULL) return (NULL); M_SETFIB(m, id->fib); #ifdef MAC if (replyto != NULL) mac_netinet_firewall_reply(replyto, m); else mac_netinet_firewall_send(m); #else (void)replyto; /* don't warn about unused arg */ #endif switch (id->addr_type) { case 4: hlen = sizeof(struct ip); break; #ifdef INET6 case 6: hlen = sizeof(struct ip6_hdr); break; #endif default: /* XXX: log me?!? */ FREE_PKT(m); return (NULL); } plen = sizeof(struct sctphdr) + sizeof(struct sctp_chunkhdr); tlen = hlen + plen; m->m_data += max_linkhdr; m->m_flags |= M_SKIP_FIREWALL; m->m_pkthdr.len = m->m_len = tlen; m->m_pkthdr.rcvif = NULL; bzero(m->m_data, tlen); switch (id->addr_type) { case 4: ip = mtod(m, struct ip *); ip->ip_v = 4; ip->ip_hl = sizeof(struct ip) >> 2; ip->ip_tos = IPTOS_LOWDELAY; ip->ip_len = htons(tlen); ip->ip_id = htons(0); ip->ip_off = htons(0); ip->ip_ttl = V_ip_defttl; ip->ip_p = IPPROTO_SCTP; ip->ip_sum = 0; ip->ip_src.s_addr = htonl(id->dst_ip); ip->ip_dst.s_addr = htonl(id->src_ip); sctp = (struct sctphdr *)(ip + 1); break; #ifdef INET6 case 6: ip6 = mtod(m, struct ip6_hdr *); ip6->ip6_vfc = IPV6_VERSION; ip6->ip6_plen = htons(plen); ip6->ip6_nxt = IPPROTO_SCTP; ip6->ip6_hlim = IPV6_DEFHLIM; ip6->ip6_src = id->dst_ip6; ip6->ip6_dst = id->src_ip6; sctp = (struct sctphdr *)(ip6 + 1); break; #endif } sctp->src_port = htons(id->dst_port); sctp->dest_port = htons(id->src_port); sctp->v_tag = htonl(vtag); sctp->checksum = htonl(0); chunk = (struct sctp_chunkhdr *)(sctp + 1); chunk->chunk_type = SCTP_ABORT_ASSOCIATION; chunk->chunk_flags = 0; if (reflected != 0) { chunk->chunk_flags |= SCTP_HAD_NO_TCB; } chunk->chunk_length = htons(sizeof(struct sctp_chunkhdr)); sctp->checksum = sctp_calculate_cksum(m, hlen); return (m); } /* * Generate a TCP packet, containing either a RST or a keepalive. * When flags & TH_RST, we are sending a RST packet, because of a * "reset" action matched the packet. * Otherwise we are sending a keepalive, and flags & TH_ * The 'replyto' mbuf is the mbuf being replied to, if any, and is required * so that MAC can label the reply appropriately. */ struct mbuf * ipfw_send_pkt(struct mbuf *replyto, struct ipfw_flow_id *id, u_int32_t seq, u_int32_t ack, int flags) { struct mbuf *m = NULL; /* stupid compiler */ struct ip *h = NULL; /* stupid compiler */ #ifdef INET6 struct ip6_hdr *h6 = NULL; #endif struct tcphdr *th = NULL; int len, dir; MGETHDR(m, M_NOWAIT, MT_DATA); if (m == NULL) return (NULL); M_SETFIB(m, id->fib); #ifdef MAC if (replyto != NULL) mac_netinet_firewall_reply(replyto, m); else mac_netinet_firewall_send(m); #else (void)replyto; /* don't warn about unused arg */ #endif switch (id->addr_type) { case 4: len = sizeof(struct ip) + sizeof(struct tcphdr); break; #ifdef INET6 case 6: len = sizeof(struct ip6_hdr) + sizeof(struct tcphdr); break; #endif default: /* XXX: log me?!? */ FREE_PKT(m); return (NULL); } dir = ((flags & (TH_SYN | TH_RST)) == TH_SYN); m->m_data += max_linkhdr; m->m_flags |= M_SKIP_FIREWALL; m->m_pkthdr.len = m->m_len = len; m->m_pkthdr.rcvif = NULL; bzero(m->m_data, len); switch (id->addr_type) { case 4: h = mtod(m, struct ip *); /* prepare for checksum */ h->ip_p = IPPROTO_TCP; h->ip_len = htons(sizeof(struct tcphdr)); if (dir) { h->ip_src.s_addr = htonl(id->src_ip); h->ip_dst.s_addr = htonl(id->dst_ip); } else { h->ip_src.s_addr = htonl(id->dst_ip); h->ip_dst.s_addr = htonl(id->src_ip); } th = (struct tcphdr *)(h + 1); break; #ifdef INET6 case 6: h6 = mtod(m, struct ip6_hdr *); /* prepare for checksum */ h6->ip6_nxt = IPPROTO_TCP; h6->ip6_plen = htons(sizeof(struct tcphdr)); if (dir) { h6->ip6_src = id->src_ip6; h6->ip6_dst = id->dst_ip6; } else { h6->ip6_src = id->dst_ip6; h6->ip6_dst = id->src_ip6; } th = (struct tcphdr *)(h6 + 1); break; #endif } if (dir) { th->th_sport = htons(id->src_port); th->th_dport = htons(id->dst_port); } else { th->th_sport = htons(id->dst_port); th->th_dport = htons(id->src_port); } th->th_off = sizeof(struct tcphdr) >> 2; if (flags & TH_RST) { if (flags & TH_ACK) { th->th_seq = htonl(ack); th->th_flags = TH_RST; } else { if (flags & TH_SYN) seq++; th->th_ack = htonl(seq); th->th_flags = TH_RST | TH_ACK; } } else { /* * Keepalive - use caller provided sequence numbers */ th->th_seq = htonl(seq); th->th_ack = htonl(ack); th->th_flags = TH_ACK; } switch (id->addr_type) { case 4: th->th_sum = in_cksum(m, len); /* finish the ip header */ h->ip_v = 4; h->ip_hl = sizeof(*h) >> 2; h->ip_tos = IPTOS_LOWDELAY; h->ip_off = htons(0); h->ip_len = htons(len); h->ip_ttl = V_ip_defttl; h->ip_sum = 0; break; #ifdef INET6 case 6: th->th_sum = in6_cksum(m, IPPROTO_TCP, sizeof(*h6), sizeof(struct tcphdr)); /* finish the ip6 header */ h6->ip6_vfc |= IPV6_VERSION; h6->ip6_hlim = IPV6_DEFHLIM; break; #endif } return (m); } #ifdef INET6 /* * ipv6 specific rules here... */ static __inline int icmp6type_match (int type, ipfw_insn_u32 *cmd) { return (type <= ICMP6_MAXTYPE && (cmd->d[type/32] & (1<<(type%32)) ) ); } static int flow6id_match( int curr_flow, ipfw_insn_u32 *cmd ) { int i; for (i=0; i <= cmd->o.arg1; ++i ) if (curr_flow == cmd->d[i] ) return 1; return 0; } /* support for IP6_*_ME opcodes */ static const struct in6_addr lla_mask = {{{ 0xff, 0xff, 0x00, 0x00, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff }}}; static int ipfw_localip6(struct in6_addr *in6) { struct rm_priotracker in6_ifa_tracker; struct in6_ifaddr *ia; if (IN6_IS_ADDR_MULTICAST(in6)) return (0); if (!IN6_IS_ADDR_LINKLOCAL(in6)) return (in6_localip(in6)); IN6_IFADDR_RLOCK(&in6_ifa_tracker); CK_STAILQ_FOREACH(ia, &V_in6_ifaddrhead, ia_link) { if (!IN6_IS_ADDR_LINKLOCAL(&ia->ia_addr.sin6_addr)) continue; if (IN6_ARE_MASKED_ADDR_EQUAL(&ia->ia_addr.sin6_addr, in6, &lla_mask)) { IN6_IFADDR_RUNLOCK(&in6_ifa_tracker); return (1); } } IN6_IFADDR_RUNLOCK(&in6_ifa_tracker); return (0); } static int verify_path6(struct in6_addr *src, struct ifnet *ifp, u_int fib) { struct nhop6_basic nh6; if (IN6_IS_SCOPE_LINKLOCAL(src)) return (1); if (fib6_lookup_nh_basic(fib, src, 0, NHR_IFAIF, 0, &nh6) != 0) return (0); /* If ifp is provided, check for equality with route table. */ if (ifp != NULL && ifp != nh6.nh_ifp) return (0); /* if no ifp provided, check if rtentry is not default route */ if (ifp == NULL && (nh6.nh_flags & NHF_DEFAULT) != 0) return (0); /* or if this is a blackhole/reject route */ if (ifp == NULL && (nh6.nh_flags & (NHF_REJECT|NHF_BLACKHOLE)) != 0) return (0); /* found valid route */ return 1; } static int is_icmp6_query(int icmp6_type) { if ((icmp6_type <= ICMP6_MAXTYPE) && (icmp6_type == ICMP6_ECHO_REQUEST || icmp6_type == ICMP6_MEMBERSHIP_QUERY || icmp6_type == ICMP6_WRUREQUEST || icmp6_type == ICMP6_FQDN_QUERY || icmp6_type == ICMP6_NI_QUERY)) return (1); return (0); } static int map_icmp_unreach(int code) { /* RFC 7915 p4.2 */ switch (code) { case ICMP_UNREACH_NET: case ICMP_UNREACH_HOST: case ICMP_UNREACH_SRCFAIL: case ICMP_UNREACH_NET_UNKNOWN: case ICMP_UNREACH_HOST_UNKNOWN: case ICMP_UNREACH_TOSNET: case ICMP_UNREACH_TOSHOST: return (ICMP6_DST_UNREACH_NOROUTE); case ICMP_UNREACH_PORT: return (ICMP6_DST_UNREACH_NOPORT); default: /* * Map the rest of codes into admit prohibited. * XXX: unreach proto should be mapped into ICMPv6 * parameter problem, but we use only unreach type. */ return (ICMP6_DST_UNREACH_ADMIN); } } static void send_reject6(struct ip_fw_args *args, int code, u_int hlen, struct ip6_hdr *ip6) { struct mbuf *m; m = args->m; if (code == ICMP6_UNREACH_RST && args->f_id.proto == IPPROTO_TCP) { struct tcphdr *tcp; tcp = (struct tcphdr *)((char *)ip6 + hlen); if ((tcp->th_flags & TH_RST) == 0) { struct mbuf *m0; m0 = ipfw_send_pkt(args->m, &(args->f_id), ntohl(tcp->th_seq), ntohl(tcp->th_ack), tcp->th_flags | TH_RST); if (m0 != NULL) ip6_output(m0, NULL, NULL, 0, NULL, NULL, NULL); } FREE_PKT(m); } else if (code == ICMP6_UNREACH_ABORT && args->f_id.proto == IPPROTO_SCTP) { struct mbuf *m0; struct sctphdr *sctp; u_int32_t v_tag; int reflected; sctp = (struct sctphdr *)((char *)ip6 + hlen); reflected = 1; v_tag = ntohl(sctp->v_tag); /* Investigate the first chunk header if available */ if (m->m_len >= hlen + sizeof(struct sctphdr) + sizeof(struct sctp_chunkhdr)) { struct sctp_chunkhdr *chunk; chunk = (struct sctp_chunkhdr *)(sctp + 1); switch (chunk->chunk_type) { case SCTP_INITIATION: /* * Packets containing an INIT chunk MUST have * a zero v-tag. */ if (v_tag != 0) { v_tag = 0; break; } /* INIT chunk MUST NOT be bundled */ if (m->m_pkthdr.len > hlen + sizeof(struct sctphdr) + ntohs(chunk->chunk_length) + 3) { break; } /* Use the initiate tag if available */ if ((m->m_len >= hlen + sizeof(struct sctphdr) + sizeof(struct sctp_chunkhdr) + offsetof(struct sctp_init, a_rwnd))) { struct sctp_init *init; init = (struct sctp_init *)(chunk + 1); v_tag = ntohl(init->initiate_tag); reflected = 0; } break; case SCTP_ABORT_ASSOCIATION: /* * If the packet contains an ABORT chunk, don't * reply. * XXX: We should search through all chunks, * but don't do to avoid attacks. */ v_tag = 0; break; } } if (v_tag == 0) { m0 = NULL; } else { m0 = ipfw_send_abort(args->m, &(args->f_id), v_tag, reflected); } if (m0 != NULL) ip6_output(m0, NULL, NULL, 0, NULL, NULL, NULL); FREE_PKT(m); } else if (code != ICMP6_UNREACH_RST && code != ICMP6_UNREACH_ABORT) { /* Send an ICMPv6 unreach. */ #if 0 /* * Unlike above, the mbufs need to line up with the ip6 hdr, * as the contents are read. We need to m_adj() the * needed amount. * The mbuf will however be thrown away so we can adjust it. * Remember we did an m_pullup on it already so we * can make some assumptions about contiguousness. */ if (args->L3offset) m_adj(m, args->L3offset); #endif icmp6_error(m, ICMP6_DST_UNREACH, code, 0); } else FREE_PKT(m); args->m = NULL; } #endif /* INET6 */ /* * sends a reject message, consuming the mbuf passed as an argument. */ static void send_reject(struct ip_fw_args *args, int code, int iplen, struct ip *ip) { #if 0 /* XXX When ip is not guaranteed to be at mtod() we will * need to account for this */ * The mbuf will however be thrown away so we can adjust it. * Remember we did an m_pullup on it already so we * can make some assumptions about contiguousness. */ if (args->L3offset) m_adj(m, args->L3offset); #endif if (code != ICMP_REJECT_RST && code != ICMP_REJECT_ABORT) { /* Send an ICMP unreach */ icmp_error(args->m, ICMP_UNREACH, code, 0L, 0); } else if (code == ICMP_REJECT_RST && args->f_id.proto == IPPROTO_TCP) { struct tcphdr *const tcp = L3HDR(struct tcphdr, mtod(args->m, struct ip *)); if ( (tcp->th_flags & TH_RST) == 0) { struct mbuf *m; m = ipfw_send_pkt(args->m, &(args->f_id), ntohl(tcp->th_seq), ntohl(tcp->th_ack), tcp->th_flags | TH_RST); if (m != NULL) ip_output(m, NULL, NULL, 0, NULL, NULL); } FREE_PKT(args->m); } else if (code == ICMP_REJECT_ABORT && args->f_id.proto == IPPROTO_SCTP) { struct mbuf *m; struct sctphdr *sctp; struct sctp_chunkhdr *chunk; struct sctp_init *init; u_int32_t v_tag; int reflected; sctp = L3HDR(struct sctphdr, mtod(args->m, struct ip *)); reflected = 1; v_tag = ntohl(sctp->v_tag); if (iplen >= (ip->ip_hl << 2) + sizeof(struct sctphdr) + sizeof(struct sctp_chunkhdr)) { /* Look at the first chunk header if available */ chunk = (struct sctp_chunkhdr *)(sctp + 1); switch (chunk->chunk_type) { case SCTP_INITIATION: /* * Packets containing an INIT chunk MUST have * a zero v-tag. */ if (v_tag != 0) { v_tag = 0; break; } /* INIT chunk MUST NOT be bundled */ if (iplen > (ip->ip_hl << 2) + sizeof(struct sctphdr) + ntohs(chunk->chunk_length) + 3) { break; } /* Use the initiate tag if available */ if ((iplen >= (ip->ip_hl << 2) + sizeof(struct sctphdr) + sizeof(struct sctp_chunkhdr) + offsetof(struct sctp_init, a_rwnd))) { init = (struct sctp_init *)(chunk + 1); v_tag = ntohl(init->initiate_tag); reflected = 0; } break; case SCTP_ABORT_ASSOCIATION: /* * If the packet contains an ABORT chunk, don't * reply. * XXX: We should search through all chunks, * but don't do to avoid attacks. */ v_tag = 0; break; } } if (v_tag == 0) { m = NULL; } else { m = ipfw_send_abort(args->m, &(args->f_id), v_tag, reflected); } if (m != NULL) ip_output(m, NULL, NULL, 0, NULL, NULL); FREE_PKT(args->m); } else FREE_PKT(args->m); args->m = NULL; } /* * Support for uid/gid/jail lookup. These tests are expensive * (because we may need to look into the list of active sockets) * so we cache the results. ugid_lookupp is 0 if we have not * yet done a lookup, 1 if we succeeded, and -1 if we tried * and failed. The function always returns the match value. * We could actually spare the variable and use *uc, setting * it to '(void *)check_uidgid if we have no info, NULL if * we tried and failed, or any other value if successful. */ static int check_uidgid(ipfw_insn_u32 *insn, struct ip_fw_args *args, int *ugid_lookupp, struct ucred **uc) { #if defined(USERSPACE) return 0; // not supported in userspace #else #ifndef __FreeBSD__ /* XXX */ return cred_check(insn, proto, oif, dst_ip, dst_port, src_ip, src_port, (struct bsd_ucred *)uc, ugid_lookupp, ((struct mbuf *)inp)->m_skb); #else /* FreeBSD */ struct in_addr src_ip, dst_ip; struct inpcbinfo *pi; struct ipfw_flow_id *id; struct inpcb *pcb, *inp; int lookupflags; int match; id = &args->f_id; inp = args->inp; /* * Check to see if the UDP or TCP stack supplied us with * the PCB. If so, rather then holding a lock and looking * up the PCB, we can use the one that was supplied. */ if (inp && *ugid_lookupp == 0) { INP_LOCK_ASSERT(inp); if (inp->inp_socket != NULL) { *uc = crhold(inp->inp_cred); *ugid_lookupp = 1; } else *ugid_lookupp = -1; } /* * If we have already been here and the packet has no * PCB entry associated with it, then we can safely * assume that this is a no match. */ if (*ugid_lookupp == -1) return (0); if (id->proto == IPPROTO_TCP) { lookupflags = 0; pi = &V_tcbinfo; } else if (id->proto == IPPROTO_UDP) { lookupflags = INPLOOKUP_WILDCARD; pi = &V_udbinfo; } else if (id->proto == IPPROTO_UDPLITE) { lookupflags = INPLOOKUP_WILDCARD; pi = &V_ulitecbinfo; } else return 0; lookupflags |= INPLOOKUP_RLOCKPCB; match = 0; if (*ugid_lookupp == 0) { if (id->addr_type == 6) { #ifdef INET6 if (args->flags & IPFW_ARGS_IN) pcb = in6_pcblookup_mbuf(pi, &id->src_ip6, htons(id->src_port), &id->dst_ip6, htons(id->dst_port), lookupflags, NULL, args->m); else pcb = in6_pcblookup_mbuf(pi, &id->dst_ip6, htons(id->dst_port), &id->src_ip6, htons(id->src_port), lookupflags, args->ifp, args->m); #else *ugid_lookupp = -1; return (0); #endif } else { src_ip.s_addr = htonl(id->src_ip); dst_ip.s_addr = htonl(id->dst_ip); if (args->flags & IPFW_ARGS_IN) pcb = in_pcblookup_mbuf(pi, src_ip, htons(id->src_port), dst_ip, htons(id->dst_port), lookupflags, NULL, args->m); else pcb = in_pcblookup_mbuf(pi, dst_ip, htons(id->dst_port), src_ip, htons(id->src_port), lookupflags, args->ifp, args->m); } if (pcb != NULL) { INP_RLOCK_ASSERT(pcb); *uc = crhold(pcb->inp_cred); *ugid_lookupp = 1; INP_RUNLOCK(pcb); } if (*ugid_lookupp == 0) { /* * We tried and failed, set the variable to -1 * so we will not try again on this packet. */ *ugid_lookupp = -1; return (0); } } if (insn->o.opcode == O_UID) match = ((*uc)->cr_uid == (uid_t)insn->d[0]); else if (insn->o.opcode == O_GID) match = groupmember((gid_t)insn->d[0], *uc); else if (insn->o.opcode == O_JAIL) match = ((*uc)->cr_prison->pr_id == (int)insn->d[0]); return (match); #endif /* __FreeBSD__ */ #endif /* not supported in userspace */ } /* * Helper function to set args with info on the rule after the matching * one. slot is precise, whereas we guess rule_id as they are * assigned sequentially. */ static inline void set_match(struct ip_fw_args *args, int slot, struct ip_fw_chain *chain) { args->rule.chain_id = chain->id; args->rule.slot = slot + 1; /* we use 0 as a marker */ args->rule.rule_id = 1 + chain->map[slot]->id; args->rule.rulenum = chain->map[slot]->rulenum; args->flags |= IPFW_ARGS_REF; } #ifndef LINEAR_SKIPTO /* * Helper function to enable cached rule lookups using * cached_id and cached_pos fields in ipfw rule. */ static int jump_fast(struct ip_fw_chain *chain, struct ip_fw *f, int num, int tablearg, int jump_backwards) { int f_pos; /* If possible use cached f_pos (in f->cached_pos), * whose version is written in f->cached_id * (horrible hacks to avoid changing the ABI). */ if (num != IP_FW_TARG && f->cached_id == chain->id) f_pos = f->cached_pos; else { int i = IP_FW_ARG_TABLEARG(chain, num, skipto); /* make sure we do not jump backward */ if (jump_backwards == 0 && i <= f->rulenum) i = f->rulenum + 1; if (chain->idxmap != NULL) f_pos = chain->idxmap[i]; else f_pos = ipfw_find_rule(chain, i, 0); /* update the cache */ if (num != IP_FW_TARG) { f->cached_id = chain->id; f->cached_pos = f_pos; } } return (f_pos); } #else /* * Helper function to enable real fast rule lookups. */ static int jump_linear(struct ip_fw_chain *chain, struct ip_fw *f, int num, int tablearg, int jump_backwards) { int f_pos; num = IP_FW_ARG_TABLEARG(chain, num, skipto); /* make sure we do not jump backward */ if (jump_backwards == 0 && num <= f->rulenum) num = f->rulenum + 1; f_pos = chain->idxmap[num]; return (f_pos); } #endif #define TARG(k, f) IP_FW_ARG_TABLEARG(chain, k, f) /* * The main check routine for the firewall. * * All arguments are in args so we can modify them and return them * back to the caller. * * Parameters: * * args->m (in/out) The packet; we set to NULL when/if we nuke it. * Starts with the IP header. * args->L3offset Number of bytes bypassed if we came from L2. * e.g. often sizeof(eh) ** NOTYET ** * args->ifp Incoming or outgoing interface. * args->divert_rule (in/out) * Skip up to the first rule past this rule number; * upon return, non-zero port number for divert or tee. * * args->rule Pointer to the last matching rule (in/out) * args->next_hop Socket we are forwarding to (out). * args->next_hop6 IPv6 next hop we are forwarding to (out). * args->f_id Addresses grabbed from the packet (out) * args->rule.info a cookie depending on rule action * * Return value: * * IP_FW_PASS the packet must be accepted * IP_FW_DENY the packet must be dropped * IP_FW_DIVERT divert packet, port in m_tag * IP_FW_TEE tee packet, port in m_tag * IP_FW_DUMMYNET to dummynet, pipe in args->cookie * IP_FW_NETGRAPH into netgraph, cookie args->cookie * args->rule contains the matching rule, * args->rule.info has additional information. * */ int ipfw_chk(struct ip_fw_args *args) { /* * Local variables holding state while processing a packet: * * IMPORTANT NOTE: to speed up the processing of rules, there * are some assumption on the values of the variables, which * are documented here. Should you change them, please check * the implementation of the various instructions to make sure * that they still work. * * m | args->m Pointer to the mbuf, as received from the caller. * It may change if ipfw_chk() does an m_pullup, or if it * consumes the packet because it calls send_reject(). * XXX This has to change, so that ipfw_chk() never modifies * or consumes the buffer. * OR * args->mem Pointer to contigous memory chunk. * ip Is the beginning of the ip(4 or 6) header. * eh Ethernet header in case if input is Layer2. */ struct mbuf *m; struct ip *ip; struct ether_header *eh; /* * For rules which contain uid/gid or jail constraints, cache * a copy of the users credentials after the pcb lookup has been * executed. This will speed up the processing of rules with * these types of constraints, as well as decrease contention * on pcb related locks. */ #ifndef __FreeBSD__ struct bsd_ucred ucred_cache; #else struct ucred *ucred_cache = NULL; #endif int ucred_lookup = 0; int f_pos = 0; /* index of current rule in the array */ int retval = 0; struct ifnet *oif, *iif; /* * hlen The length of the IP header. */ u_int hlen = 0; /* hlen >0 means we have an IP pkt */ /* * offset The offset of a fragment. offset != 0 means that * we have a fragment at this offset of an IPv4 packet. * offset == 0 means that (if this is an IPv4 packet) * this is the first or only fragment. * For IPv6 offset|ip6f_mf == 0 means there is no Fragment Header * or there is a single packet fragment (fragment header added * without needed). We will treat a single packet fragment as if * there was no fragment header (or log/block depending on the * V_fw_permit_single_frag6 sysctl setting). */ u_short offset = 0; u_short ip6f_mf = 0; /* * Local copies of addresses. They are only valid if we have * an IP packet. * * proto The protocol. Set to 0 for non-ip packets, * or to the protocol read from the packet otherwise. * proto != 0 means that we have an IPv4 packet. * * src_port, dst_port port numbers, in HOST format. Only * valid for TCP and UDP packets. * * src_ip, dst_ip ip addresses, in NETWORK format. * Only valid for IPv4 packets. */ uint8_t proto; uint16_t src_port, dst_port; /* NOTE: host format */ struct in_addr src_ip, dst_ip; /* NOTE: network format */ int iplen = 0; int pktlen; struct ipfw_dyn_info dyn_info; struct ip_fw *q = NULL; struct ip_fw_chain *chain = &V_layer3_chain; /* * We store in ulp a pointer to the upper layer protocol header. * In the ipv4 case this is easy to determine from the header, * but for ipv6 we might have some additional headers in the middle. * ulp is NULL if not found. */ void *ulp = NULL; /* upper layer protocol pointer. */ /* XXX ipv6 variables */ int is_ipv6 = 0; uint8_t icmp6_type = 0; uint16_t ext_hd = 0; /* bits vector for extension header filtering */ /* end of ipv6 variables */ int is_ipv4 = 0; int done = 0; /* flag to exit the outer loop */ IPFW_RLOCK_TRACKER; bool mem; if ((mem = (args->flags & IPFW_ARGS_LENMASK))) { if (args->flags & IPFW_ARGS_ETHER) { eh = (struct ether_header *)args->mem; if (eh->ether_type == htons(ETHERTYPE_VLAN)) ip = (struct ip *) ((struct ether_vlan_header *)eh + 1); else ip = (struct ip *)(eh + 1); } else { eh = NULL; ip = (struct ip *)args->mem; } pktlen = IPFW_ARGS_LENGTH(args->flags); args->f_id.fib = args->ifp->if_fib; /* best guess */ } else { m = args->m; if (m->m_flags & M_SKIP_FIREWALL || (! V_ipfw_vnet_ready)) return (IP_FW_PASS); /* accept */ if (args->flags & IPFW_ARGS_ETHER) { /* We need some amount of data to be contiguous. */ if (m->m_len < min(m->m_pkthdr.len, max_protohdr) && (args->m = m = m_pullup(m, min(m->m_pkthdr.len, max_protohdr))) == NULL) goto pullup_failed; eh = mtod(m, struct ether_header *); ip = (struct ip *)(eh + 1); } else { eh = NULL; ip = mtod(m, struct ip *); } pktlen = m->m_pkthdr.len; args->f_id.fib = M_GETFIB(m); /* mbuf not altered */ } dst_ip.s_addr = 0; /* make sure it is initialized */ src_ip.s_addr = 0; /* make sure it is initialized */ src_port = dst_port = 0; DYN_INFO_INIT(&dyn_info); /* * PULLUP_TO(len, p, T) makes sure that len + sizeof(T) is contiguous, * then it sets p to point at the offset "len" in the mbuf. WARNING: the * pointer might become stale after other pullups (but we never use it * this way). */ #define PULLUP_TO(_len, p, T) PULLUP_LEN(_len, p, sizeof(T)) #define EHLEN (eh != NULL ? ((char *)ip - (char *)eh) : 0) #define PULLUP_LEN(_len, p, T) \ do { \ int x = (_len) + T + EHLEN; \ if (mem) { \ MPASS(pktlen >= x); \ p = (char *)args->mem + (_len) + EHLEN; \ } else { \ if (__predict_false((m)->m_len < x)) { \ args->m = m = m_pullup(m, x); \ if (m == NULL) \ goto pullup_failed; \ } \ p = mtod(m, char *) + (_len) + EHLEN; \ } \ } while (0) /* * In case pointers got stale after pullups, update them. */ #define UPDATE_POINTERS() \ do { \ if (!mem) { \ if (eh != NULL) { \ eh = mtod(m, struct ether_header *); \ ip = (struct ip *)(eh + 1); \ } else \ ip = mtod(m, struct ip *); \ args->m = m; \ } \ } while (0) /* Identify IP packets and fill up variables. */ if (pktlen >= sizeof(struct ip6_hdr) && (eh == NULL || eh->ether_type == htons(ETHERTYPE_IPV6)) && ip->ip_v == 6) { struct ip6_hdr *ip6 = (struct ip6_hdr *)ip; is_ipv6 = 1; args->flags |= IPFW_ARGS_IP6; hlen = sizeof(struct ip6_hdr); proto = ip6->ip6_nxt; /* Search extension headers to find upper layer protocols */ while (ulp == NULL && offset == 0) { switch (proto) { case IPPROTO_ICMPV6: PULLUP_TO(hlen, ulp, struct icmp6_hdr); icmp6_type = ICMP6(ulp)->icmp6_type; break; case IPPROTO_TCP: PULLUP_TO(hlen, ulp, struct tcphdr); dst_port = TCP(ulp)->th_dport; src_port = TCP(ulp)->th_sport; /* save flags for dynamic rules */ args->f_id._flags = TCP(ulp)->th_flags; break; case IPPROTO_SCTP: if (pktlen >= hlen + sizeof(struct sctphdr) + sizeof(struct sctp_chunkhdr) + offsetof(struct sctp_init, a_rwnd)) PULLUP_LEN(hlen, ulp, sizeof(struct sctphdr) + sizeof(struct sctp_chunkhdr) + offsetof(struct sctp_init, a_rwnd)); else if (pktlen >= hlen + sizeof(struct sctphdr)) PULLUP_LEN(hlen, ulp, pktlen - hlen); else PULLUP_LEN(hlen, ulp, sizeof(struct sctphdr)); src_port = SCTP(ulp)->src_port; dst_port = SCTP(ulp)->dest_port; break; case IPPROTO_UDP: case IPPROTO_UDPLITE: PULLUP_TO(hlen, ulp, struct udphdr); dst_port = UDP(ulp)->uh_dport; src_port = UDP(ulp)->uh_sport; break; case IPPROTO_HOPOPTS: /* RFC 2460 */ PULLUP_TO(hlen, ulp, struct ip6_hbh); ext_hd |= EXT_HOPOPTS; hlen += (((struct ip6_hbh *)ulp)->ip6h_len + 1) << 3; proto = ((struct ip6_hbh *)ulp)->ip6h_nxt; ulp = NULL; break; case IPPROTO_ROUTING: /* RFC 2460 */ PULLUP_TO(hlen, ulp, struct ip6_rthdr); switch (((struct ip6_rthdr *)ulp)->ip6r_type) { case 0: ext_hd |= EXT_RTHDR0; break; case 2: ext_hd |= EXT_RTHDR2; break; default: if (V_fw_verbose) printf("IPFW2: IPV6 - Unknown " "Routing Header type(%d)\n", ((struct ip6_rthdr *) ulp)->ip6r_type); if (V_fw_deny_unknown_exthdrs) return (IP_FW_DENY); break; } ext_hd |= EXT_ROUTING; hlen += (((struct ip6_rthdr *)ulp)->ip6r_len + 1) << 3; proto = ((struct ip6_rthdr *)ulp)->ip6r_nxt; ulp = NULL; break; case IPPROTO_FRAGMENT: /* RFC 2460 */ PULLUP_TO(hlen, ulp, struct ip6_frag); ext_hd |= EXT_FRAGMENT; hlen += sizeof (struct ip6_frag); proto = ((struct ip6_frag *)ulp)->ip6f_nxt; offset = ((struct ip6_frag *)ulp)->ip6f_offlg & IP6F_OFF_MASK; ip6f_mf = ((struct ip6_frag *)ulp)->ip6f_offlg & IP6F_MORE_FRAG; if (V_fw_permit_single_frag6 == 0 && offset == 0 && ip6f_mf == 0) { if (V_fw_verbose) printf("IPFW2: IPV6 - Invalid " "Fragment Header\n"); if (V_fw_deny_unknown_exthdrs) return (IP_FW_DENY); break; } args->f_id.extra = ntohl(((struct ip6_frag *)ulp)->ip6f_ident); ulp = NULL; break; case IPPROTO_DSTOPTS: /* RFC 2460 */ PULLUP_TO(hlen, ulp, struct ip6_hbh); ext_hd |= EXT_DSTOPTS; hlen += (((struct ip6_hbh *)ulp)->ip6h_len + 1) << 3; proto = ((struct ip6_hbh *)ulp)->ip6h_nxt; ulp = NULL; break; case IPPROTO_AH: /* RFC 2402 */ PULLUP_TO(hlen, ulp, struct ip6_ext); ext_hd |= EXT_AH; hlen += (((struct ip6_ext *)ulp)->ip6e_len + 2) << 2; proto = ((struct ip6_ext *)ulp)->ip6e_nxt; ulp = NULL; break; case IPPROTO_ESP: /* RFC 2406 */ PULLUP_TO(hlen, ulp, uint32_t); /* SPI, Seq# */ /* Anything past Seq# is variable length and * data past this ext. header is encrypted. */ ext_hd |= EXT_ESP; break; case IPPROTO_NONE: /* RFC 2460 */ /* * Packet ends here, and IPv6 header has * already been pulled up. If ip6e_len!=0 * then octets must be ignored. */ ulp = ip; /* non-NULL to get out of loop. */ break; case IPPROTO_OSPFIGP: /* XXX OSPF header check? */ PULLUP_TO(hlen, ulp, struct ip6_ext); break; case IPPROTO_PIM: /* XXX PIM header check? */ PULLUP_TO(hlen, ulp, struct pim); break; case IPPROTO_GRE: /* RFC 1701 */ /* XXX GRE header check? */ PULLUP_TO(hlen, ulp, struct grehdr); break; case IPPROTO_CARP: PULLUP_TO(hlen, ulp, offsetof( struct carp_header, carp_counter)); if (CARP_ADVERTISEMENT != ((struct carp_header *)ulp)->carp_type) return (IP_FW_DENY); break; case IPPROTO_IPV6: /* RFC 2893 */ PULLUP_TO(hlen, ulp, struct ip6_hdr); break; case IPPROTO_IPV4: /* RFC 2893 */ PULLUP_TO(hlen, ulp, struct ip); break; default: if (V_fw_verbose) printf("IPFW2: IPV6 - Unknown " "Extension Header(%d), ext_hd=%x\n", proto, ext_hd); if (V_fw_deny_unknown_exthdrs) return (IP_FW_DENY); PULLUP_TO(hlen, ulp, struct ip6_ext); break; } /*switch */ } UPDATE_POINTERS(); ip6 = (struct ip6_hdr *)ip; args->f_id.addr_type = 6; args->f_id.src_ip6 = ip6->ip6_src; args->f_id.dst_ip6 = ip6->ip6_dst; args->f_id.flow_id6 = ntohl(ip6->ip6_flow); iplen = ntohs(ip6->ip6_plen) + sizeof(*ip6); } else if (pktlen >= sizeof(struct ip) && (eh == NULL || eh->ether_type == htons(ETHERTYPE_IP)) && ip->ip_v == 4) { is_ipv4 = 1; args->flags |= IPFW_ARGS_IP4; hlen = ip->ip_hl << 2; /* * Collect parameters into local variables for faster * matching. */ proto = ip->ip_p; src_ip = ip->ip_src; dst_ip = ip->ip_dst; offset = ntohs(ip->ip_off) & IP_OFFMASK; iplen = ntohs(ip->ip_len); if (offset == 0) { switch (proto) { case IPPROTO_TCP: PULLUP_TO(hlen, ulp, struct tcphdr); dst_port = TCP(ulp)->th_dport; src_port = TCP(ulp)->th_sport; /* save flags for dynamic rules */ args->f_id._flags = TCP(ulp)->th_flags; break; case IPPROTO_SCTP: if (pktlen >= hlen + sizeof(struct sctphdr) + sizeof(struct sctp_chunkhdr) + offsetof(struct sctp_init, a_rwnd)) PULLUP_LEN(hlen, ulp, sizeof(struct sctphdr) + sizeof(struct sctp_chunkhdr) + offsetof(struct sctp_init, a_rwnd)); else if (pktlen >= hlen + sizeof(struct sctphdr)) PULLUP_LEN(hlen, ulp, pktlen - hlen); else PULLUP_LEN(hlen, ulp, sizeof(struct sctphdr)); src_port = SCTP(ulp)->src_port; dst_port = SCTP(ulp)->dest_port; break; case IPPROTO_UDP: case IPPROTO_UDPLITE: PULLUP_TO(hlen, ulp, struct udphdr); dst_port = UDP(ulp)->uh_dport; src_port = UDP(ulp)->uh_sport; break; case IPPROTO_ICMP: PULLUP_TO(hlen, ulp, struct icmphdr); //args->f_id.flags = ICMP(ulp)->icmp_type; break; default: break; } } UPDATE_POINTERS(); args->f_id.addr_type = 4; args->f_id.src_ip = ntohl(src_ip.s_addr); args->f_id.dst_ip = ntohl(dst_ip.s_addr); } else { proto = 0; dst_ip.s_addr = src_ip.s_addr = 0; args->f_id.addr_type = 1; /* XXX */ } #undef PULLUP_TO pktlen = iplen < pktlen ? iplen: pktlen; /* Properly initialize the rest of f_id */ args->f_id.proto = proto; args->f_id.src_port = src_port = ntohs(src_port); args->f_id.dst_port = dst_port = ntohs(dst_port); IPFW_PF_RLOCK(chain); if (! V_ipfw_vnet_ready) { /* shutting down, leave NOW. */ IPFW_PF_RUNLOCK(chain); return (IP_FW_PASS); /* accept */ } if (args->flags & IPFW_ARGS_REF) { /* * Packet has already been tagged as a result of a previous * match on rule args->rule aka args->rule_id (PIPE, QUEUE, * REASS, NETGRAPH, DIVERT/TEE...) * Validate the slot and continue from the next one * if still present, otherwise do a lookup. */ f_pos = (args->rule.chain_id == chain->id) ? args->rule.slot : ipfw_find_rule(chain, args->rule.rulenum, args->rule.rule_id); } else { f_pos = 0; } if (args->flags & IPFW_ARGS_IN) { iif = args->ifp; oif = NULL; } else { MPASS(args->flags & IPFW_ARGS_OUT); iif = mem ? NULL : m->m_pkthdr.rcvif; oif = args->ifp; } /* * Now scan the rules, and parse microinstructions for each rule. * We have two nested loops and an inner switch. Sometimes we * need to break out of one or both loops, or re-enter one of * the loops with updated variables. Loop variables are: * * f_pos (outer loop) points to the current rule. * On output it points to the matching rule. * done (outer loop) is used as a flag to break the loop. * l (inner loop) residual length of current rule. * cmd points to the current microinstruction. * * We break the inner loop by setting l=0 and possibly * cmdlen=0 if we don't want to advance cmd. * We break the outer loop by setting done=1 * We can restart the inner loop by setting l>0 and f_pos, f, cmd * as needed. */ for (; f_pos < chain->n_rules; f_pos++) { ipfw_insn *cmd; uint32_t tablearg = 0; int l, cmdlen, skip_or; /* skip rest of OR block */ struct ip_fw *f; f = chain->map[f_pos]; if (V_set_disable & (1 << f->set) ) continue; skip_or = 0; for (l = f->cmd_len, cmd = f->cmd ; l > 0 ; l -= cmdlen, cmd += cmdlen) { int match; /* * check_body is a jump target used when we find a * CHECK_STATE, and need to jump to the body of * the target rule. */ /* check_body: */ cmdlen = F_LEN(cmd); /* * An OR block (insn_1 || .. || insn_n) has the * F_OR bit set in all but the last instruction. * The first match will set "skip_or", and cause * the following instructions to be skipped until * past the one with the F_OR bit clear. */ if (skip_or) { /* skip this instruction */ if ((cmd->len & F_OR) == 0) skip_or = 0; /* next one is good */ continue; } match = 0; /* set to 1 if we succeed */ switch (cmd->opcode) { /* * The first set of opcodes compares the packet's * fields with some pattern, setting 'match' if a * match is found. At the end of the loop there is * logic to deal with F_NOT and F_OR flags associated * with the opcode. */ case O_NOP: match = 1; break; case O_FORWARD_MAC: printf("ipfw: opcode %d unimplemented\n", cmd->opcode); break; case O_GID: case O_UID: case O_JAIL: /* * We only check offset == 0 && proto != 0, * as this ensures that we have a * packet with the ports info. */ if (offset != 0) break; if (proto == IPPROTO_TCP || proto == IPPROTO_UDP || proto == IPPROTO_UDPLITE) match = check_uidgid( (ipfw_insn_u32 *)cmd, args, &ucred_lookup, #ifdef __FreeBSD__ &ucred_cache); #else (void *)&ucred_cache); #endif break; case O_RECV: match = iface_match(iif, (ipfw_insn_if *)cmd, chain, &tablearg); break; case O_XMIT: match = iface_match(oif, (ipfw_insn_if *)cmd, chain, &tablearg); break; case O_VIA: match = iface_match(args->ifp, (ipfw_insn_if *)cmd, chain, &tablearg); break; case O_MACADDR2: if (args->flags & IPFW_ARGS_ETHER) { u_int32_t *want = (u_int32_t *) ((ipfw_insn_mac *)cmd)->addr; u_int32_t *mask = (u_int32_t *) ((ipfw_insn_mac *)cmd)->mask; u_int32_t *hdr = (u_int32_t *)eh; match = ( want[0] == (hdr[0] & mask[0]) && want[1] == (hdr[1] & mask[1]) && want[2] == (hdr[2] & mask[2]) ); } break; case O_MAC_TYPE: if (args->flags & IPFW_ARGS_ETHER) { u_int16_t *p = ((ipfw_insn_u16 *)cmd)->ports; int i; for (i = cmdlen - 1; !match && i>0; i--, p += 2) match = (ntohs(eh->ether_type) >= p[0] && ntohs(eh->ether_type) <= p[1]); } break; case O_FRAG: match = (offset != 0); break; case O_IN: /* "out" is "not in" */ match = (oif == NULL); break; case O_LAYER2: match = (args->flags & IPFW_ARGS_ETHER); break; case O_DIVERTED: if ((args->flags & IPFW_ARGS_REF) == 0) break; /* * For diverted packets, args->rule.info * contains the divert port (in host format) * reason and direction. */ match = ((args->rule.info & IPFW_IS_MASK) == IPFW_IS_DIVERT) && ( ((args->rule.info & IPFW_INFO_IN) ? 1: 2) & cmd->arg1); break; case O_PROTO: /* * We do not allow an arg of 0 so the * check of "proto" only suffices. */ match = (proto == cmd->arg1); break; case O_IP_SRC: match = is_ipv4 && (((ipfw_insn_ip *)cmd)->addr.s_addr == src_ip.s_addr); break; case O_IP_DST_LOOKUP: { void *pkey; uint32_t vidx, key; uint16_t keylen; if (cmdlen > F_INSN_SIZE(ipfw_insn_u32)) { /* Determine lookup key type */ vidx = ((ipfw_insn_u32 *)cmd)->d[1]; if (vidx != 4 /* uid */ && vidx != 5 /* jail */ && is_ipv6 == 0 && is_ipv4 == 0) break; /* Determine key length */ if (vidx == 0 /* dst-ip */ || vidx == 1 /* src-ip */) keylen = is_ipv6 ? sizeof(struct in6_addr): sizeof(in_addr_t); else { keylen = sizeof(key); pkey = &key; } if (vidx == 0 /* dst-ip */) pkey = is_ipv4 ? (void *)&dst_ip: (void *)&args->f_id.dst_ip6; else if (vidx == 1 /* src-ip */) pkey = is_ipv4 ? (void *)&src_ip: (void *)&args->f_id.src_ip6; else if (vidx == 6 /* dscp */) { if (is_ipv4) key = ip->ip_tos >> 2; else { key = args->f_id.flow_id6; key = (key & 0x0f) << 2 | (key & 0xf000) >> 14; } key &= 0x3f; } else if (vidx == 2 /* dst-port */ || vidx == 3 /* src-port */) { /* Skip fragments */ if (offset != 0) break; /* Skip proto without ports */ if (proto != IPPROTO_TCP && proto != IPPROTO_UDP && proto != IPPROTO_UDPLITE && proto != IPPROTO_SCTP) break; if (vidx == 2 /* dst-port */) key = dst_port; else key = src_port; } #ifndef USERSPACE else if (vidx == 4 /* uid */ || vidx == 5 /* jail */) { check_uidgid( (ipfw_insn_u32 *)cmd, args, &ucred_lookup, #ifdef __FreeBSD__ &ucred_cache); if (vidx == 4 /* uid */) key = ucred_cache->cr_uid; else if (vidx == 5 /* jail */) key = ucred_cache->cr_prison->pr_id; #else /* !__FreeBSD__ */ (void *)&ucred_cache); if (vidx == 4 /* uid */) key = ucred_cache.uid; else if (vidx == 5 /* jail */) key = ucred_cache.xid; #endif /* !__FreeBSD__ */ } #endif /* !USERSPACE */ else break; match = ipfw_lookup_table(chain, cmd->arg1, keylen, pkey, &vidx); if (!match) break; tablearg = vidx; break; } /* cmdlen =< F_INSN_SIZE(ipfw_insn_u32) */ /* FALLTHROUGH */ } case O_IP_SRC_LOOKUP: { void *pkey; uint32_t vidx; uint16_t keylen; if (is_ipv4) { keylen = sizeof(in_addr_t); if (cmd->opcode == O_IP_DST_LOOKUP) pkey = &dst_ip; else pkey = &src_ip; } else if (is_ipv6) { keylen = sizeof(struct in6_addr); if (cmd->opcode == O_IP_DST_LOOKUP) pkey = &args->f_id.dst_ip6; else pkey = &args->f_id.src_ip6; } else break; match = ipfw_lookup_table(chain, cmd->arg1, keylen, pkey, &vidx); if (!match) break; if (cmdlen == F_INSN_SIZE(ipfw_insn_u32)) { match = ((ipfw_insn_u32 *)cmd)->d[0] == TARG_VAL(chain, vidx, tag); if (!match) break; } tablearg = vidx; break; } case O_IP_FLOW_LOOKUP: { uint32_t v = 0; match = ipfw_lookup_table(chain, cmd->arg1, 0, &args->f_id, &v); if (cmdlen == F_INSN_SIZE(ipfw_insn_u32)) match = ((ipfw_insn_u32 *)cmd)->d[0] == TARG_VAL(chain, v, tag); if (match) tablearg = v; } break; case O_IP_SRC_MASK: case O_IP_DST_MASK: if (is_ipv4) { uint32_t a = (cmd->opcode == O_IP_DST_MASK) ? dst_ip.s_addr : src_ip.s_addr; uint32_t *p = ((ipfw_insn_u32 *)cmd)->d; int i = cmdlen-1; for (; !match && i>0; i-= 2, p+= 2) match = (p[0] == (a & p[1])); } break; case O_IP_SRC_ME: if (is_ipv4) { match = in_localip(src_ip); break; } #ifdef INET6 /* FALLTHROUGH */ case O_IP6_SRC_ME: match = is_ipv6 && ipfw_localip6(&args->f_id.src_ip6); #endif break; case O_IP_DST_SET: case O_IP_SRC_SET: if (is_ipv4) { u_int32_t *d = (u_int32_t *)(cmd+1); u_int32_t addr = cmd->opcode == O_IP_DST_SET ? args->f_id.dst_ip : args->f_id.src_ip; if (addr < d[0]) break; addr -= d[0]; /* subtract base */ match = (addr < cmd->arg1) && ( d[ 1 + (addr>>5)] & (1<<(addr & 0x1f)) ); } break; case O_IP_DST: match = is_ipv4 && (((ipfw_insn_ip *)cmd)->addr.s_addr == dst_ip.s_addr); break; case O_IP_DST_ME: if (is_ipv4) { match = in_localip(dst_ip); break; } #ifdef INET6 /* FALLTHROUGH */ case O_IP6_DST_ME: match = is_ipv6 && ipfw_localip6(&args->f_id.dst_ip6); #endif break; case O_IP_SRCPORT: case O_IP_DSTPORT: /* * offset == 0 && proto != 0 is enough * to guarantee that we have a * packet with port info. */ if ((proto == IPPROTO_UDP || proto == IPPROTO_UDPLITE || proto == IPPROTO_TCP || proto == IPPROTO_SCTP) && offset == 0) { u_int16_t x = (cmd->opcode == O_IP_SRCPORT) ? src_port : dst_port ; u_int16_t *p = ((ipfw_insn_u16 *)cmd)->ports; int i; for (i = cmdlen - 1; !match && i>0; i--, p += 2) match = (x>=p[0] && x<=p[1]); } break; case O_ICMPTYPE: match = (offset == 0 && proto==IPPROTO_ICMP && icmptype_match(ICMP(ulp), (ipfw_insn_u32 *)cmd) ); break; #ifdef INET6 case O_ICMP6TYPE: match = is_ipv6 && offset == 0 && proto==IPPROTO_ICMPV6 && icmp6type_match( ICMP6(ulp)->icmp6_type, (ipfw_insn_u32 *)cmd); break; #endif /* INET6 */ case O_IPOPT: match = (is_ipv4 && ipopts_match(ip, cmd) ); break; case O_IPVER: match = (is_ipv4 && cmd->arg1 == ip->ip_v); break; case O_IPID: - case O_IPLEN: case O_IPTTL: - if (is_ipv4) { /* only for IP packets */ + if (!is_ipv4) + break; + case O_IPLEN: + { /* only for IP packets */ uint16_t x; uint16_t *p; int i; if (cmd->opcode == O_IPLEN) x = iplen; else if (cmd->opcode == O_IPTTL) x = ip->ip_ttl; else /* must be IPID */ x = ntohs(ip->ip_id); if (cmdlen == 1) { match = (cmd->arg1 == x); break; } /* otherwise we have ranges */ p = ((ipfw_insn_u16 *)cmd)->ports; i = cmdlen - 1; for (; !match && i>0; i--, p += 2) match = (x >= p[0] && x <= p[1]); } break; case O_IPPRECEDENCE: match = (is_ipv4 && (cmd->arg1 == (ip->ip_tos & 0xe0)) ); break; case O_IPTOS: match = (is_ipv4 && flags_match(cmd, ip->ip_tos)); break; case O_DSCP: { uint32_t *p; uint16_t x; p = ((ipfw_insn_u32 *)cmd)->d; if (is_ipv4) x = ip->ip_tos >> 2; else if (is_ipv6) { uint8_t *v; v = &((struct ip6_hdr *)ip)->ip6_vfc; x = (*v & 0x0F) << 2; v++; x |= *v >> 6; } else break; /* DSCP bitmask is stored as low_u32 high_u32 */ if (x >= 32) match = *(p + 1) & (1 << (x - 32)); else match = *p & (1 << x); } break; case O_TCPDATALEN: if (proto == IPPROTO_TCP && offset == 0) { struct tcphdr *tcp; uint16_t x; uint16_t *p; int i; #ifdef INET6 if (is_ipv6) { struct ip6_hdr *ip6; ip6 = (struct ip6_hdr *)ip; if (ip6->ip6_plen == 0) { /* * Jumbo payload is not * supported by this * opcode. */ break; } x = iplen - hlen; } else #endif /* INET6 */ x = iplen - (ip->ip_hl << 2); tcp = TCP(ulp); x -= tcp->th_off << 2; if (cmdlen == 1) { match = (cmd->arg1 == x); break; } /* otherwise we have ranges */ p = ((ipfw_insn_u16 *)cmd)->ports; i = cmdlen - 1; for (; !match && i>0; i--, p += 2) match = (x >= p[0] && x <= p[1]); } break; case O_TCPFLAGS: match = (proto == IPPROTO_TCP && offset == 0 && flags_match(cmd, TCP(ulp)->th_flags)); break; case O_TCPOPTS: if (proto == IPPROTO_TCP && offset == 0 && ulp){ PULLUP_LEN(hlen, ulp, (TCP(ulp)->th_off << 2)); match = tcpopts_match(TCP(ulp), cmd); } break; case O_TCPSEQ: match = (proto == IPPROTO_TCP && offset == 0 && ((ipfw_insn_u32 *)cmd)->d[0] == TCP(ulp)->th_seq); break; case O_TCPACK: match = (proto == IPPROTO_TCP && offset == 0 && ((ipfw_insn_u32 *)cmd)->d[0] == TCP(ulp)->th_ack); break; case O_TCPWIN: if (proto == IPPROTO_TCP && offset == 0) { uint16_t x; uint16_t *p; int i; x = ntohs(TCP(ulp)->th_win); if (cmdlen == 1) { match = (cmd->arg1 == x); break; } /* Otherwise we have ranges. */ p = ((ipfw_insn_u16 *)cmd)->ports; i = cmdlen - 1; for (; !match && i > 0; i--, p += 2) match = (x >= p[0] && x <= p[1]); } break; case O_ESTAB: /* reject packets which have SYN only */ /* XXX should i also check for TH_ACK ? */ match = (proto == IPPROTO_TCP && offset == 0 && (TCP(ulp)->th_flags & (TH_RST | TH_ACK | TH_SYN)) != TH_SYN); break; case O_ALTQ: { struct pf_mtag *at; struct m_tag *mtag; ipfw_insn_altq *altq = (ipfw_insn_altq *)cmd; /* * ALTQ uses mbuf tags from another * packet filtering system - pf(4). * We allocate a tag in its format * and fill it in, pretending to be pf(4). */ match = 1; at = pf_find_mtag(m); if (at != NULL && at->qid != 0) break; mtag = m_tag_get(PACKET_TAG_PF, sizeof(struct pf_mtag), M_NOWAIT | M_ZERO); if (mtag == NULL) { /* * Let the packet fall back to the * default ALTQ. */ break; } m_tag_prepend(m, mtag); at = (struct pf_mtag *)(mtag + 1); at->qid = altq->qid; at->hdr = ip; break; } case O_LOG: ipfw_log(chain, f, hlen, args, offset | ip6f_mf, tablearg, ip); match = 1; break; case O_PROB: match = (random()<((ipfw_insn_u32 *)cmd)->d[0]); break; case O_VERREVPATH: /* Outgoing packets automatically pass/match */ match = (args->flags & IPFW_ARGS_OUT || ( #ifdef INET6 is_ipv6 ? verify_path6(&(args->f_id.src_ip6), iif, args->f_id.fib) : #endif verify_path(src_ip, iif, args->f_id.fib))); break; case O_VERSRCREACH: /* Outgoing packets automatically pass/match */ match = (hlen > 0 && ((oif != NULL) || ( #ifdef INET6 is_ipv6 ? verify_path6(&(args->f_id.src_ip6), NULL, args->f_id.fib) : #endif verify_path(src_ip, NULL, args->f_id.fib)))); break; case O_ANTISPOOF: /* Outgoing packets automatically pass/match */ if (oif == NULL && hlen > 0 && ( (is_ipv4 && in_localaddr(src_ip)) #ifdef INET6 || (is_ipv6 && in6_localaddr(&(args->f_id.src_ip6))) #endif )) match = #ifdef INET6 is_ipv6 ? verify_path6( &(args->f_id.src_ip6), iif, args->f_id.fib) : #endif verify_path(src_ip, iif, args->f_id.fib); else match = 1; break; case O_IPSEC: match = (m_tag_find(m, PACKET_TAG_IPSEC_IN_DONE, NULL) != NULL); /* otherwise no match */ break; #ifdef INET6 case O_IP6_SRC: match = is_ipv6 && IN6_ARE_ADDR_EQUAL(&args->f_id.src_ip6, &((ipfw_insn_ip6 *)cmd)->addr6); break; case O_IP6_DST: match = is_ipv6 && IN6_ARE_ADDR_EQUAL(&args->f_id.dst_ip6, &((ipfw_insn_ip6 *)cmd)->addr6); break; case O_IP6_SRC_MASK: case O_IP6_DST_MASK: if (is_ipv6) { int i = cmdlen - 1; struct in6_addr p; struct in6_addr *d = &((ipfw_insn_ip6 *)cmd)->addr6; for (; !match && i > 0; d += 2, i -= F_INSN_SIZE(struct in6_addr) * 2) { p = (cmd->opcode == O_IP6_SRC_MASK) ? args->f_id.src_ip6: args->f_id.dst_ip6; APPLY_MASK(&p, &d[1]); match = IN6_ARE_ADDR_EQUAL(&d[0], &p); } } break; case O_FLOW6ID: match = is_ipv6 && flow6id_match(args->f_id.flow_id6, (ipfw_insn_u32 *) cmd); break; case O_EXT_HDR: match = is_ipv6 && (ext_hd & ((ipfw_insn *) cmd)->arg1); break; case O_IP6: match = is_ipv6; break; #endif case O_IP4: match = is_ipv4; break; case O_TAG: { struct m_tag *mtag; uint32_t tag = TARG(cmd->arg1, tag); /* Packet is already tagged with this tag? */ mtag = m_tag_locate(m, MTAG_IPFW, tag, NULL); /* We have `untag' action when F_NOT flag is * present. And we must remove this mtag from * mbuf and reset `match' to zero (`match' will * be inversed later). * Otherwise we should allocate new mtag and * push it into mbuf. */ if (cmd->len & F_NOT) { /* `untag' action */ if (mtag != NULL) m_tag_delete(m, mtag); match = 0; } else { if (mtag == NULL) { mtag = m_tag_alloc( MTAG_IPFW, tag, 0, M_NOWAIT); if (mtag != NULL) m_tag_prepend(m, mtag); } match = 1; } break; } case O_FIB: /* try match the specified fib */ if (args->f_id.fib == cmd->arg1) match = 1; break; case O_SOCKARG: { #ifndef USERSPACE /* not supported in userspace */ struct inpcb *inp = args->inp; struct inpcbinfo *pi; if (is_ipv6) /* XXX can we remove this ? */ break; if (proto == IPPROTO_TCP) pi = &V_tcbinfo; else if (proto == IPPROTO_UDP) pi = &V_udbinfo; else if (proto == IPPROTO_UDPLITE) pi = &V_ulitecbinfo; else break; /* * XXXRW: so_user_cookie should almost * certainly be inp_user_cookie? */ /* For incoming packet, lookup up the inpcb using the src/dest ip/port tuple */ if (inp == NULL) { inp = in_pcblookup(pi, src_ip, htons(src_port), dst_ip, htons(dst_port), INPLOOKUP_RLOCKPCB, NULL); if (inp != NULL) { tablearg = inp->inp_socket->so_user_cookie; if (tablearg) match = 1; INP_RUNLOCK(inp); } } else { if (inp->inp_socket) { tablearg = inp->inp_socket->so_user_cookie; if (tablearg) match = 1; } } #endif /* !USERSPACE */ break; } case O_TAGGED: { struct m_tag *mtag; uint32_t tag = TARG(cmd->arg1, tag); if (cmdlen == 1) { match = m_tag_locate(m, MTAG_IPFW, tag, NULL) != NULL; break; } /* we have ranges */ for (mtag = m_tag_first(m); mtag != NULL && !match; mtag = m_tag_next(m, mtag)) { uint16_t *p; int i; if (mtag->m_tag_cookie != MTAG_IPFW) continue; p = ((ipfw_insn_u16 *)cmd)->ports; i = cmdlen - 1; for(; !match && i > 0; i--, p += 2) match = mtag->m_tag_id >= p[0] && mtag->m_tag_id <= p[1]; } break; } /* * The second set of opcodes represents 'actions', * i.e. the terminal part of a rule once the packet * matches all previous patterns. * Typically there is only one action for each rule, * and the opcode is stored at the end of the rule * (but there are exceptions -- see below). * * In general, here we set retval and terminate the * outer loop (would be a 'break 3' in some language, * but we need to set l=0, done=1) * * Exceptions: * O_COUNT and O_SKIPTO actions: * instead of terminating, we jump to the next rule * (setting l=0), or to the SKIPTO target (setting * f/f_len, cmd and l as needed), respectively. * * O_TAG, O_LOG and O_ALTQ action parameters: * perform some action and set match = 1; * * O_LIMIT and O_KEEP_STATE: these opcodes are * not real 'actions', and are stored right * before the 'action' part of the rule (one * exception is O_SKIP_ACTION which could be * between these opcodes and 'action' one). * These opcodes try to install an entry in the * state tables; if successful, we continue with * the next opcode (match=1; break;), otherwise * the packet must be dropped (set retval, * break loops with l=0, done=1) * * O_PROBE_STATE and O_CHECK_STATE: these opcodes * cause a lookup of the state table, and a jump * to the 'action' part of the parent rule * if an entry is found, or * (CHECK_STATE only) a jump to the next rule if * the entry is not found. * The result of the lookup is cached so that * further instances of these opcodes become NOPs. * The jump to the next rule is done by setting * l=0, cmdlen=0. * * O_SKIP_ACTION: this opcode is not a real 'action' * either, and is stored right before the 'action' * part of the rule, right after the O_KEEP_STATE * opcode. It causes match failure so the real * 'action' could be executed only if the rule * is checked via dynamic rule from the state * table, as in such case execution starts * from the true 'action' opcode directly. * */ case O_LIMIT: case O_KEEP_STATE: if (ipfw_dyn_install_state(chain, f, (ipfw_insn_limit *)cmd, args, ulp, pktlen, &dyn_info, tablearg)) { /* error or limit violation */ retval = IP_FW_DENY; l = 0; /* exit inner loop */ done = 1; /* exit outer loop */ } match = 1; break; case O_PROBE_STATE: case O_CHECK_STATE: /* * dynamic rules are checked at the first * keep-state or check-state occurrence, * with the result being stored in dyn_info. * The compiler introduces a PROBE_STATE * instruction for us when we have a * KEEP_STATE (because PROBE_STATE needs * to be run first). */ if (DYN_LOOKUP_NEEDED(&dyn_info, cmd) && (q = ipfw_dyn_lookup_state(args, ulp, pktlen, cmd, &dyn_info)) != NULL) { /* * Found dynamic entry, jump to the * 'action' part of the parent rule * by setting f, cmd, l and clearing * cmdlen. */ f = q; f_pos = dyn_info.f_pos; cmd = ACTION_PTR(f); l = f->cmd_len - f->act_ofs; cmdlen = 0; match = 1; break; } /* * Dynamic entry not found. If CHECK_STATE, * skip to next rule, if PROBE_STATE just * ignore and continue with next opcode. */ if (cmd->opcode == O_CHECK_STATE) l = 0; /* exit inner loop */ match = 1; break; case O_SKIP_ACTION: match = 0; /* skip to the next rule */ l = 0; /* exit inner loop */ break; case O_ACCEPT: retval = 0; /* accept */ l = 0; /* exit inner loop */ done = 1; /* exit outer loop */ break; case O_PIPE: case O_QUEUE: set_match(args, f_pos, chain); args->rule.info = TARG(cmd->arg1, pipe); if (cmd->opcode == O_PIPE) args->rule.info |= IPFW_IS_PIPE; if (V_fw_one_pass) args->rule.info |= IPFW_ONEPASS; retval = IP_FW_DUMMYNET; l = 0; /* exit inner loop */ done = 1; /* exit outer loop */ break; case O_DIVERT: case O_TEE: if (args->flags & IPFW_ARGS_ETHER) break; /* not on layer 2 */ /* otherwise this is terminal */ l = 0; /* exit inner loop */ done = 1; /* exit outer loop */ retval = (cmd->opcode == O_DIVERT) ? IP_FW_DIVERT : IP_FW_TEE; set_match(args, f_pos, chain); args->rule.info = TARG(cmd->arg1, divert); break; case O_COUNT: IPFW_INC_RULE_COUNTER(f, pktlen); l = 0; /* exit inner loop */ break; case O_SKIPTO: IPFW_INC_RULE_COUNTER(f, pktlen); f_pos = JUMP(chain, f, cmd->arg1, tablearg, 0); /* * Skip disabled rules, and re-enter * the inner loop with the correct * f_pos, f, l and cmd. * Also clear cmdlen and skip_or */ for (; f_pos < chain->n_rules - 1 && (V_set_disable & (1 << chain->map[f_pos]->set)); f_pos++) ; /* Re-enter the inner loop at the skipto rule. */ f = chain->map[f_pos]; l = f->cmd_len; cmd = f->cmd; match = 1; cmdlen = 0; skip_or = 0; continue; break; /* not reached */ case O_CALLRETURN: { /* * Implementation of `subroutine' call/return, * in the stack carried in an mbuf tag. This * is different from `skipto' in that any call * address is possible (`skipto' must prevent * backward jumps to avoid endless loops). * We have `return' action when F_NOT flag is * present. The `m_tag_id' field is used as * stack pointer. */ struct m_tag *mtag; uint16_t jmpto, *stack; #define IS_CALL ((cmd->len & F_NOT) == 0) #define IS_RETURN ((cmd->len & F_NOT) != 0) /* * Hand-rolled version of m_tag_locate() with * wildcard `type'. * If not already tagged, allocate new tag. */ mtag = m_tag_first(m); while (mtag != NULL) { if (mtag->m_tag_cookie == MTAG_IPFW_CALL) break; mtag = m_tag_next(m, mtag); } if (mtag == NULL && IS_CALL) { mtag = m_tag_alloc(MTAG_IPFW_CALL, 0, IPFW_CALLSTACK_SIZE * sizeof(uint16_t), M_NOWAIT); if (mtag != NULL) m_tag_prepend(m, mtag); } /* * On error both `call' and `return' just * continue with next rule. */ if (IS_RETURN && (mtag == NULL || mtag->m_tag_id == 0)) { l = 0; /* exit inner loop */ break; } if (IS_CALL && (mtag == NULL || mtag->m_tag_id >= IPFW_CALLSTACK_SIZE)) { printf("ipfw: call stack error, " "go to next rule\n"); l = 0; /* exit inner loop */ break; } IPFW_INC_RULE_COUNTER(f, pktlen); stack = (uint16_t *)(mtag + 1); /* * The `call' action may use cached f_pos * (in f->next_rule), whose version is written * in f->next_rule. * The `return' action, however, doesn't have * fixed jump address in cmd->arg1 and can't use * cache. */ if (IS_CALL) { stack[mtag->m_tag_id] = f->rulenum; mtag->m_tag_id++; f_pos = JUMP(chain, f, cmd->arg1, tablearg, 1); } else { /* `return' action */ mtag->m_tag_id--; jmpto = stack[mtag->m_tag_id] + 1; f_pos = ipfw_find_rule(chain, jmpto, 0); } /* * Skip disabled rules, and re-enter * the inner loop with the correct * f_pos, f, l and cmd. * Also clear cmdlen and skip_or */ for (; f_pos < chain->n_rules - 1 && (V_set_disable & (1 << chain->map[f_pos]->set)); f_pos++) ; /* Re-enter the inner loop at the dest rule. */ f = chain->map[f_pos]; l = f->cmd_len; cmd = f->cmd; cmdlen = 0; skip_or = 0; continue; break; /* NOTREACHED */ } #undef IS_CALL #undef IS_RETURN case O_REJECT: /* * Drop the packet and send a reject notice * if the packet is not ICMP (or is an ICMP * query), and it is not multicast/broadcast. */ if (hlen > 0 && is_ipv4 && offset == 0 && (proto != IPPROTO_ICMP || is_icmp_query(ICMP(ulp))) && !(m->m_flags & (M_BCAST|M_MCAST)) && !IN_MULTICAST(ntohl(dst_ip.s_addr))) { send_reject(args, cmd->arg1, iplen, ip); m = args->m; } /* FALLTHROUGH */ #ifdef INET6 case O_UNREACH6: if (hlen > 0 && is_ipv6 && ((offset & IP6F_OFF_MASK) == 0) && (proto != IPPROTO_ICMPV6 || (is_icmp6_query(icmp6_type) == 1)) && !(m->m_flags & (M_BCAST|M_MCAST)) && !IN6_IS_ADDR_MULTICAST( &args->f_id.dst_ip6)) { send_reject6(args, cmd->opcode == O_REJECT ? map_icmp_unreach(cmd->arg1): cmd->arg1, hlen, (struct ip6_hdr *)ip); m = args->m; } /* FALLTHROUGH */ #endif case O_DENY: retval = IP_FW_DENY; l = 0; /* exit inner loop */ done = 1; /* exit outer loop */ break; case O_FORWARD_IP: if (args->flags & IPFW_ARGS_ETHER) break; /* not valid on layer2 pkts */ if (q != f || dyn_info.direction == MATCH_FORWARD) { struct sockaddr_in *sa; sa = &(((ipfw_insn_sa *)cmd)->sa); if (sa->sin_addr.s_addr == INADDR_ANY) { #ifdef INET6 /* * We use O_FORWARD_IP opcode for * fwd rule with tablearg, but tables * now support IPv6 addresses. And * when we are inspecting IPv6 packet, * we can use nh6 field from * table_value as next_hop6 address. */ if (is_ipv6) { struct ip_fw_nh6 *nh6; args->flags |= IPFW_ARGS_NH6; nh6 = &args->hopstore6; nh6->sin6_addr = TARG_VAL( chain, tablearg, nh6); nh6->sin6_port = sa->sin_port; nh6->sin6_scope_id = TARG_VAL( chain, tablearg, zoneid); } else #endif { args->flags |= IPFW_ARGS_NH4; args->hopstore.sin_port = sa->sin_port; sa = &args->hopstore; sa->sin_family = AF_INET; sa->sin_len = sizeof(*sa); sa->sin_addr.s_addr = htonl( TARG_VAL(chain, tablearg, nh4)); } } else { args->flags |= IPFW_ARGS_NH4PTR; args->next_hop = sa; } } retval = IP_FW_PASS; l = 0; /* exit inner loop */ done = 1; /* exit outer loop */ break; #ifdef INET6 case O_FORWARD_IP6: if (args->flags & IPFW_ARGS_ETHER) break; /* not valid on layer2 pkts */ if (q != f || dyn_info.direction == MATCH_FORWARD) { struct sockaddr_in6 *sin6; sin6 = &(((ipfw_insn_sa6 *)cmd)->sa); args->flags |= IPFW_ARGS_NH6PTR; args->next_hop6 = sin6; } retval = IP_FW_PASS; l = 0; /* exit inner loop */ done = 1; /* exit outer loop */ break; #endif case O_NETGRAPH: case O_NGTEE: set_match(args, f_pos, chain); args->rule.info = TARG(cmd->arg1, netgraph); if (V_fw_one_pass) args->rule.info |= IPFW_ONEPASS; retval = (cmd->opcode == O_NETGRAPH) ? IP_FW_NETGRAPH : IP_FW_NGTEE; l = 0; /* exit inner loop */ done = 1; /* exit outer loop */ break; case O_SETFIB: { uint32_t fib; IPFW_INC_RULE_COUNTER(f, pktlen); fib = TARG(cmd->arg1, fib) & 0x7FFF; if (fib >= rt_numfibs) fib = 0; M_SETFIB(m, fib); args->f_id.fib = fib; /* XXX */ l = 0; /* exit inner loop */ break; } case O_SETDSCP: { uint16_t code; code = TARG(cmd->arg1, dscp) & 0x3F; l = 0; /* exit inner loop */ if (is_ipv4) { uint16_t old; old = *(uint16_t *)ip; ip->ip_tos = (code << 2) | (ip->ip_tos & 0x03); ip->ip_sum = cksum_adjust(ip->ip_sum, old, *(uint16_t *)ip); } else if (is_ipv6) { uint8_t *v; v = &((struct ip6_hdr *)ip)->ip6_vfc; *v = (*v & 0xF0) | (code >> 2); v++; *v = (*v & 0x3F) | ((code & 0x03) << 6); } else break; IPFW_INC_RULE_COUNTER(f, pktlen); break; } case O_NAT: l = 0; /* exit inner loop */ done = 1; /* exit outer loop */ /* * Ensure that we do not invoke NAT handler for * non IPv4 packets. Libalias expects only IPv4. */ if (!is_ipv4 || !IPFW_NAT_LOADED) { retval = IP_FW_DENY; break; } struct cfg_nat *t; int nat_id; args->rule.info = 0; set_match(args, f_pos, chain); /* Check if this is 'global' nat rule */ if (cmd->arg1 == IP_FW_NAT44_GLOBAL) { retval = ipfw_nat_ptr(args, NULL, m); break; } t = ((ipfw_insn_nat *)cmd)->nat; if (t == NULL) { nat_id = TARG(cmd->arg1, nat); t = (*lookup_nat_ptr)(&chain->nat, nat_id); if (t == NULL) { retval = IP_FW_DENY; break; } if (cmd->arg1 != IP_FW_TARG) ((ipfw_insn_nat *)cmd)->nat = t; } retval = ipfw_nat_ptr(args, t, m); break; case O_REASS: { int ip_off; l = 0; /* in any case exit inner loop */ if (is_ipv6) /* IPv6 is not supported yet */ break; IPFW_INC_RULE_COUNTER(f, pktlen); ip_off = ntohs(ip->ip_off); /* if not fragmented, go to next rule */ if ((ip_off & (IP_MF | IP_OFFMASK)) == 0) break; args->m = m = ip_reass(m); /* * do IP header checksum fixup. */ if (m == NULL) { /* fragment got swallowed */ retval = IP_FW_DENY; } else { /* good, packet complete */ int hlen; ip = mtod(m, struct ip *); hlen = ip->ip_hl << 2; ip->ip_sum = 0; if (hlen == sizeof(struct ip)) ip->ip_sum = in_cksum_hdr(ip); else ip->ip_sum = in_cksum(m, hlen); retval = IP_FW_REASS; args->rule.info = 0; set_match(args, f_pos, chain); } done = 1; /* exit outer loop */ break; } case O_EXTERNAL_ACTION: l = 0; /* in any case exit inner loop */ retval = ipfw_run_eaction(chain, args, cmd, &done); /* * If both @retval and @done are zero, * consider this as rule matching and * update counters. */ if (retval == 0 && done == 0) { IPFW_INC_RULE_COUNTER(f, pktlen); /* * Reset the result of the last * dynamic state lookup. * External action can change * @args content, and it may be * used for new state lookup later. */ DYN_INFO_INIT(&dyn_info); } break; default: panic("-- unknown opcode %d\n", cmd->opcode); } /* end of switch() on opcodes */ /* * if we get here with l=0, then match is irrelevant. */ if (cmd->len & F_NOT) match = !match; if (match) { if (cmd->len & F_OR) skip_or = 1; } else { if (!(cmd->len & F_OR)) /* not an OR block, */ break; /* try next rule */ } } /* end of inner loop, scan opcodes */ #undef PULLUP_LEN if (done) break; /* next_rule:; */ /* try next rule */ } /* end of outer for, scan rules */ if (done) { struct ip_fw *rule = chain->map[f_pos]; /* Update statistics */ IPFW_INC_RULE_COUNTER(rule, pktlen); } else { retval = IP_FW_DENY; printf("ipfw: ouch!, skip past end of rules, denying packet\n"); } IPFW_PF_RUNLOCK(chain); #ifdef __FreeBSD__ if (ucred_cache != NULL) crfree(ucred_cache); #endif return (retval); pullup_failed: if (V_fw_verbose) printf("ipfw: pullup failed\n"); return (IP_FW_DENY); } /* * Set maximum number of tables that can be used in given VNET ipfw instance. */ #ifdef SYSCTL_NODE static int sysctl_ipfw_table_num(SYSCTL_HANDLER_ARGS) { int error; unsigned int ntables; ntables = V_fw_tables_max; error = sysctl_handle_int(oidp, &ntables, 0, req); /* Read operation or some error */ if ((error != 0) || (req->newptr == NULL)) return (error); return (ipfw_resize_tables(&V_layer3_chain, ntables)); } /* * Switches table namespace between global and per-set. */ static int sysctl_ipfw_tables_sets(SYSCTL_HANDLER_ARGS) { int error; unsigned int sets; sets = V_fw_tables_sets; error = sysctl_handle_int(oidp, &sets, 0, req); /* Read operation or some error */ if ((error != 0) || (req->newptr == NULL)) return (error); return (ipfw_switch_tables_namespace(&V_layer3_chain, sets)); } #endif /* * Module and VNET glue */ /* * Stuff that must be initialised only on boot or module load */ static int ipfw_init(void) { int error = 0; /* * Only print out this stuff the first time around, * when called from the sysinit code. */ printf("ipfw2 " #ifdef INET6 "(+ipv6) " #endif "initialized, divert %s, nat %s, " "default to %s, logging ", #ifdef IPDIVERT "enabled", #else "loadable", #endif #ifdef IPFIREWALL_NAT "enabled", #else "loadable", #endif default_to_accept ? "accept" : "deny"); /* * Note: V_xxx variables can be accessed here but the vnet specific * initializer may not have been called yet for the VIMAGE case. * Tuneables will have been processed. We will print out values for * the default vnet. * XXX This should all be rationalized AFTER 8.0 */ if (V_fw_verbose == 0) printf("disabled\n"); else if (V_verbose_limit == 0) printf("unlimited\n"); else printf("limited to %d packets/entry by default\n", V_verbose_limit); /* Check user-supplied table count for validness */ if (default_fw_tables > IPFW_TABLES_MAX) default_fw_tables = IPFW_TABLES_MAX; ipfw_init_sopt_handler(); ipfw_init_obj_rewriter(); ipfw_iface_init(); return (error); } /* * Called for the removal of the last instance only on module unload. */ static void ipfw_destroy(void) { ipfw_iface_destroy(); ipfw_destroy_sopt_handler(); ipfw_destroy_obj_rewriter(); printf("IP firewall unloaded\n"); } /* * Stuff that must be initialized for every instance * (including the first of course). */ static int vnet_ipfw_init(const void *unused) { int error, first; struct ip_fw *rule = NULL; struct ip_fw_chain *chain; chain = &V_layer3_chain; first = IS_DEFAULT_VNET(curvnet) ? 1 : 0; /* First set up some values that are compile time options */ V_autoinc_step = 100; /* bounded to 1..1000 in add_rule() */ V_fw_deny_unknown_exthdrs = 1; #ifdef IPFIREWALL_VERBOSE V_fw_verbose = 1; #endif #ifdef IPFIREWALL_VERBOSE_LIMIT V_verbose_limit = IPFIREWALL_VERBOSE_LIMIT; #endif #ifdef IPFIREWALL_NAT LIST_INIT(&chain->nat); #endif /* Init shared services hash table */ ipfw_init_srv(chain); ipfw_init_counters(); /* Set initial number of tables */ V_fw_tables_max = default_fw_tables; error = ipfw_init_tables(chain, first); if (error) { printf("ipfw2: setting up tables failed\n"); free(chain->map, M_IPFW); free(rule, M_IPFW); return (ENOSPC); } IPFW_LOCK_INIT(chain); /* fill and insert the default rule */ rule = ipfw_alloc_rule(chain, sizeof(struct ip_fw)); rule->cmd_len = 1; rule->cmd[0].len = 1; rule->cmd[0].opcode = default_to_accept ? O_ACCEPT : O_DENY; chain->default_rule = rule; ipfw_add_protected_rule(chain, rule, 0); ipfw_dyn_init(chain); ipfw_eaction_init(chain, first); #ifdef LINEAR_SKIPTO ipfw_init_skipto_cache(chain); #endif ipfw_bpf_init(first); /* First set up some values that are compile time options */ V_ipfw_vnet_ready = 1; /* Open for business */ /* * Hook the sockopt handler and pfil hooks for ipv4 and ipv6. * Even if the latter two fail we still keep the module alive * because the sockopt and layer2 paths are still useful. * ipfw[6]_hook return 0 on success, ENOENT on failure, * so we can ignore the exact return value and just set a flag. * * Note that V_fw[6]_enable are manipulated by a SYSCTL_PROC so * changes in the underlying (per-vnet) variables trigger * immediate hook()/unhook() calls. * In layer2 we have the same behaviour, except that V_ether_ipfw * is checked on each packet because there are no pfil hooks. */ V_ip_fw_ctl_ptr = ipfw_ctl3; error = ipfw_attach_hooks(); return (error); } /* * Called for the removal of each instance. */ static int vnet_ipfw_uninit(const void *unused) { struct ip_fw *reap; struct ip_fw_chain *chain = &V_layer3_chain; int i, last; V_ipfw_vnet_ready = 0; /* tell new callers to go away */ /* * disconnect from ipv4, ipv6, layer2 and sockopt. * Then grab, release and grab again the WLOCK so we make * sure the update is propagated and nobody will be in. */ ipfw_detach_hooks(); V_ip_fw_ctl_ptr = NULL; last = IS_DEFAULT_VNET(curvnet) ? 1 : 0; IPFW_UH_WLOCK(chain); IPFW_UH_WUNLOCK(chain); ipfw_dyn_uninit(0); /* run the callout_drain */ IPFW_UH_WLOCK(chain); reap = NULL; IPFW_WLOCK(chain); for (i = 0; i < chain->n_rules; i++) ipfw_reap_add(chain, &reap, chain->map[i]); free(chain->map, M_IPFW); #ifdef LINEAR_SKIPTO ipfw_destroy_skipto_cache(chain); #endif IPFW_WUNLOCK(chain); IPFW_UH_WUNLOCK(chain); ipfw_destroy_tables(chain, last); ipfw_eaction_uninit(chain, last); if (reap != NULL) ipfw_reap_rules(reap); vnet_ipfw_iface_destroy(chain); ipfw_destroy_srv(chain); IPFW_LOCK_DESTROY(chain); ipfw_dyn_uninit(1); /* free the remaining parts */ ipfw_destroy_counters(); ipfw_bpf_uninit(last); return (0); } /* * Module event handler. * In general we have the choice of handling most of these events by the * event handler or by the (VNET_)SYS(UN)INIT handlers. I have chosen to * use the SYSINIT handlers as they are more capable of expressing the * flow of control during module and vnet operations, so this is just * a skeleton. Note there is no SYSINIT equivalent of the module * SHUTDOWN handler, but we don't have anything to do in that case anyhow. */ static int ipfw_modevent(module_t mod, int type, void *unused) { int err = 0; switch (type) { case MOD_LOAD: /* Called once at module load or * system boot if compiled in. */ break; case MOD_QUIESCE: /* Called before unload. May veto unloading. */ break; case MOD_UNLOAD: /* Called during unload. */ break; case MOD_SHUTDOWN: /* Called during system shutdown. */ break; default: err = EOPNOTSUPP; break; } return err; } static moduledata_t ipfwmod = { "ipfw", ipfw_modevent, 0 }; /* Define startup order. */ #define IPFW_SI_SUB_FIREWALL SI_SUB_PROTO_FIREWALL #define IPFW_MODEVENT_ORDER (SI_ORDER_ANY - 255) /* On boot slot in here. */ #define IPFW_MODULE_ORDER (IPFW_MODEVENT_ORDER + 1) /* A little later. */ #define IPFW_VNET_ORDER (IPFW_MODEVENT_ORDER + 2) /* Later still. */ DECLARE_MODULE(ipfw, ipfwmod, IPFW_SI_SUB_FIREWALL, IPFW_MODEVENT_ORDER); FEATURE(ipfw_ctl3, "ipfw new sockopt calls"); MODULE_VERSION(ipfw, 3); /* should declare some dependencies here */ /* * Starting up. Done in order after ipfwmod() has been called. * VNET_SYSINIT is also called for each existing vnet and each new vnet. */ SYSINIT(ipfw_init, IPFW_SI_SUB_FIREWALL, IPFW_MODULE_ORDER, ipfw_init, NULL); VNET_SYSINIT(vnet_ipfw_init, IPFW_SI_SUB_FIREWALL, IPFW_VNET_ORDER, vnet_ipfw_init, NULL); /* * Closing up shop. These are done in REVERSE ORDER, but still * after ipfwmod() has been called. Not called on reboot. * VNET_SYSUNINIT is also called for each exiting vnet as it exits. * or when the module is unloaded. */ SYSUNINIT(ipfw_destroy, IPFW_SI_SUB_FIREWALL, IPFW_MODULE_ORDER, ipfw_destroy, NULL); VNET_SYSUNINIT(vnet_ipfw_uninit, IPFW_SI_SUB_FIREWALL, IPFW_VNET_ORDER, vnet_ipfw_uninit, NULL); /* end of file */ Index: user/ngie/bug-237403/sys/opencrypto/cbc_mac.c =================================================================== --- user/ngie/bug-237403/sys/opencrypto/cbc_mac.c (revision 346925) +++ user/ngie/bug-237403/sys/opencrypto/cbc_mac.c (revision 346926) @@ -1,268 +1,265 @@ /* * Copyright (c) 2018-2019 iXsystems Inc. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ #include __FBSDID("$FreeBSD$"); #include #include #include #include #include #include /* * Given two CCM_CBC_BLOCK_LEN blocks, xor * them into dst, and then encrypt dst. */ static void xor_and_encrypt(struct aes_cbc_mac_ctx *ctx, const uint8_t *src, uint8_t *dst) { const uint64_t *b1; uint64_t *b2; uint64_t temp_block[CCM_CBC_BLOCK_LEN/sizeof(uint64_t)]; b1 = (const uint64_t*)src; b2 = (uint64_t*)dst; for (size_t count = 0; count < CCM_CBC_BLOCK_LEN/sizeof(uint64_t); count++) { temp_block[count] = b1[count] ^ b2[count]; } rijndaelEncrypt(ctx->keysched, ctx->rounds, (void*)temp_block, dst); } void AES_CBC_MAC_Init(struct aes_cbc_mac_ctx *ctx) { bzero(ctx, sizeof(*ctx)); } void AES_CBC_MAC_Setkey(struct aes_cbc_mac_ctx *ctx, const uint8_t *key, uint16_t klen) { ctx->rounds = rijndaelKeySetupEnc(ctx->keysched, key, klen * 8); } /* * This is called to set the nonce, aka IV. * Before this call, the authDataLength and cryptDataLength fields * MUST have been set. Sadly, there's no way to return an error. * * The CBC-MAC algorithm requires that the first block contain the * nonce, as well as information about the sizes and lengths involved. */ void AES_CBC_MAC_Reinit(struct aes_cbc_mac_ctx *ctx, const uint8_t *nonce, uint16_t nonceLen) { uint8_t b0[CCM_CBC_BLOCK_LEN]; uint8_t *bp = b0, flags = 0; uint8_t L = 0; uint64_t dataLength = ctx->cryptDataLength; - - KASSERT(ctx->authDataLength != 0 || ctx->cryptDataLength != 0, - ("Auth Data and Data lengths cannot both be 0")); KASSERT(nonceLen >= 7 && nonceLen <= 13, ("nonceLen must be between 7 and 13 bytes")); ctx->nonce = nonce; ctx->nonceLength = nonceLen; ctx->authDataCount = 0; ctx->blockIndex = 0; explicit_bzero(ctx->staging_block, sizeof(ctx->staging_block)); /* * Need to determine the L field value. This is the number of * bytes needed to specify the length of the message; the length * is whatever is left in the 16 bytes after specifying flags and * the nonce. */ L = 15 - nonceLen; flags = ((ctx->authDataLength > 0) << 6) + (((AES_CBC_MAC_HASH_LEN - 2) / 2) << 3) + L - 1; /* * Now we need to set up the first block, which has flags, nonce, * and the message length. */ b0[0] = flags; bcopy(nonce, b0 + 1, nonceLen); bp = b0 + 1 + nonceLen; /* Need to copy L' [aka L-1] bytes of cryptDataLength */ for (uint8_t *dst = b0 + sizeof(b0) - 1; dst >= bp; dst--) { *dst = dataLength; dataLength >>= 8; } /* Now need to encrypt b0 */ rijndaelEncrypt(ctx->keysched, ctx->rounds, b0, ctx->block); /* If there is auth data, we need to set up the staging block */ if (ctx->authDataLength) { size_t addLength; if (ctx->authDataLength < ((1<<16) - (1<<8))) { uint16_t sizeVal = htobe16(ctx->authDataLength); bcopy(&sizeVal, ctx->staging_block, sizeof(sizeVal)); addLength = sizeof(sizeVal); } else if (ctx->authDataLength < (1ULL<<32)) { uint32_t sizeVal = htobe32(ctx->authDataLength); ctx->staging_block[0] = 0xff; ctx->staging_block[1] = 0xfe; bcopy(&sizeVal, ctx->staging_block+2, sizeof(sizeVal)); addLength = 2 + sizeof(sizeVal); } else { uint64_t sizeVal = htobe64(ctx->authDataLength); ctx->staging_block[0] = 0xff; ctx->staging_block[1] = 0xff; bcopy(&sizeVal, ctx->staging_block+2, sizeof(sizeVal)); addLength = 2 + sizeof(sizeVal); } ctx->blockIndex = addLength; /* * The length descriptor goes into the AAD buffer, so we * need to account for it. */ ctx->authDataLength += addLength; ctx->authDataCount = addLength; } } int AES_CBC_MAC_Update(struct aes_cbc_mac_ctx *ctx, const uint8_t *data, uint16_t length) { size_t copy_amt; /* * This will be called in one of two phases: * (1) Applying authentication data, or * (2) Applying the payload data. * * Because CBC-MAC puts the authentication data size before the * data, subsequent calls won't be block-size-aligned. Which * complicates things a fair bit. * * The payload data doesn't have that problem. */ if (ctx->authDataCount < ctx->authDataLength) { /* * We need to process data as authentication data. * Since we may be out of sync, we may also need * to pad out the staging block. */ const uint8_t *ptr = data; while (length > 0) { copy_amt = MIN(length, sizeof(ctx->staging_block) - ctx->blockIndex); bcopy(ptr, ctx->staging_block + ctx->blockIndex, copy_amt); ptr += copy_amt; length -= copy_amt; ctx->authDataCount += copy_amt; ctx->blockIndex += copy_amt; ctx->blockIndex %= sizeof(ctx->staging_block); if (ctx->blockIndex == 0 || ctx->authDataCount == ctx->authDataLength) { /* * We're done with this block, so we * xor staging_block with block, and then * encrypt it. */ xor_and_encrypt(ctx, ctx->staging_block, ctx->block); bzero(ctx->staging_block, sizeof(ctx->staging_block)); ctx->blockIndex = 0; if (ctx->authDataCount >= ctx->authDataLength) break; } } /* * We'd like to be able to check length == 0 and return * here, but the way OCF calls us, length is always * blksize (16, in this case). So we have to count on * the fact that OCF calls us separately for the AAD and * for the real data. */ return (0); } /* * If we're here, then we're encoding payload data. * This is marginally easier, except that _Update can * be called with non-aligned update lengths. As a result, * we still need to use the staging block. */ KASSERT((length + ctx->cryptDataCount) <= ctx->cryptDataLength, ("More encryption data than allowed")); while (length) { uint8_t *ptr; copy_amt = MIN(sizeof(ctx->staging_block) - ctx->blockIndex, length); ptr = ctx->staging_block + ctx->blockIndex; bcopy(data, ptr, copy_amt); data += copy_amt; ctx->blockIndex += copy_amt; ctx->cryptDataCount += copy_amt; length -= copy_amt; if (ctx->blockIndex == sizeof(ctx->staging_block)) { /* We've got a full block */ xor_and_encrypt(ctx, ctx->staging_block, ctx->block); ctx->blockIndex = 0; bzero(ctx->staging_block, sizeof(ctx->staging_block)); } } return (0); } void AES_CBC_MAC_Final(uint8_t *buf, struct aes_cbc_mac_ctx *ctx) { uint8_t s0[CCM_CBC_BLOCK_LEN]; /* * We first need to check to see if we've got any data * left over to encrypt. */ if (ctx->blockIndex != 0) { xor_and_encrypt(ctx, ctx->staging_block, ctx->block); ctx->cryptDataCount += ctx->blockIndex; ctx->blockIndex = 0; explicit_bzero(ctx->staging_block, sizeof(ctx->staging_block)); } bzero(s0, sizeof(s0)); s0[0] = (15 - ctx->nonceLength) - 1; bcopy(ctx->nonce, s0 + 1, ctx->nonceLength); rijndaelEncrypt(ctx->keysched, ctx->rounds, s0, s0); for (size_t indx = 0; indx < AES_CBC_MAC_HASH_LEN; indx++) buf[indx] = ctx->block[indx] ^ s0[indx]; explicit_bzero(s0, sizeof(s0)); } Index: user/ngie/bug-237403/sys/powerpc/aim/aim_machdep.c =================================================================== --- user/ngie/bug-237403/sys/powerpc/aim/aim_machdep.c (revision 346925) +++ user/ngie/bug-237403/sys/powerpc/aim/aim_machdep.c (revision 346926) @@ -1,694 +1,695 @@ /*- * Copyright (C) 1995, 1996 Wolfgang Solfrank. * Copyright (C) 1995, 1996 TooLs GmbH. * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * 3. All advertising materials mentioning features or use of this software * must display the following acknowledgement: * This product includes software developed by TooLs GmbH. * 4. The name of TooLs GmbH may not be used to endorse or promote products * derived from this software without specific prior written permission. * * THIS SOFTWARE IS PROVIDED BY TOOLS GMBH ``AS IS'' AND ANY EXPRESS OR * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. * IN NO EVENT SHALL TOOLS GMBH BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; * OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, * WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR * OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF * ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ /*- * Copyright (C) 2001 Benno Rice * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY Benno Rice ``AS IS'' AND ANY EXPRESS OR * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. * IN NO EVENT SHALL TOOLS GMBH BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; * OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, * WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR * OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF * ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. * $NetBSD: machdep.c,v 1.74.2.1 2000/11/01 16:13:48 tv Exp $ */ #include __FBSDID("$FreeBSD$"); #include "opt_ddb.h" #include "opt_kstack_pages.h" #include "opt_platform.h" #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #ifndef __powerpc64__ #include #endif #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #ifdef __powerpc64__ #include "mmu_oea64.h" #endif #ifndef __powerpc64__ struct bat battable[16]; #endif #ifndef __powerpc64__ /* Bits for running on 64-bit systems in 32-bit mode. */ extern void *testppc64, *testppc64size; extern void *restorebridge, *restorebridgesize; extern void *rfid_patch, *rfi_patch1, *rfi_patch2; extern void *trapcode64; extern Elf_Addr _GLOBAL_OFFSET_TABLE_[]; #endif extern void *rstcode, *rstcodeend; extern void *trapcode, *trapcodeend; extern void *hypertrapcode, *hypertrapcodeend; extern void *generictrap, *generictrap64; extern void *alitrap, *aliend; extern void *dsitrap, *dsiend; extern void *decrint, *decrsize; extern void *extint, *extsize; extern void *dblow, *dbend; extern void *imisstrap, *imisssize; extern void *dlmisstrap, *dlmisssize; extern void *dsmisstrap, *dsmisssize; extern void *ap_pcpu; extern void __restartkernel(vm_offset_t, vm_offset_t, vm_offset_t, void *, uint32_t, register_t offset, register_t msr); void aim_early_init(vm_offset_t fdt, vm_offset_t toc, vm_offset_t ofentry, void *mdp, uint32_t mdp_cookie); void aim_cpu_init(vm_offset_t toc); void aim_early_init(vm_offset_t fdt, vm_offset_t toc, vm_offset_t ofentry, void *mdp, uint32_t mdp_cookie) { register_t scratch; /* * If running from an FDT, make sure we are in real mode to avoid * tromping on firmware page tables. Everything in the kernel assumes * 1:1 mappings out of firmware, so this won't break anything not * already broken. This doesn't work if there is live OF, since OF * may internally use non-1:1 mappings. */ if (ofentry == 0) mtmsr(mfmsr() & ~(PSL_IR | PSL_DR)); #ifdef __powerpc64__ /* * If in real mode, relocate to high memory so that the kernel * can execute from the direct map. */ if (!(mfmsr() & PSL_DR) && (vm_offset_t)&aim_early_init < DMAP_BASE_ADDRESS) __restartkernel(fdt, 0, ofentry, mdp, mdp_cookie, DMAP_BASE_ADDRESS, mfmsr()); #endif /* Various very early CPU fix ups */ switch (mfpvr() >> 16) { /* * PowerPC 970 CPUs have a misfeature requested by Apple that * makes them pretend they have a 32-byte cacheline. Turn this * off before we measure the cacheline size. */ case IBM970: case IBM970FX: case IBM970MP: case IBM970GX: scratch = mfspr(SPR_HID5); scratch &= ~HID5_970_DCBZ_SIZE_HI; mtspr(SPR_HID5, scratch); break; #ifdef __powerpc64__ case IBMPOWER7: case IBMPOWER7PLUS: case IBMPOWER8: case IBMPOWER8E: + case IBMPOWER8NVL: case IBMPOWER9: /* XXX: get from ibm,slb-size in device tree */ n_slbs = 32; break; #endif } } void aim_cpu_init(vm_offset_t toc) { size_t trap_offset, trapsize; vm_offset_t trap; register_t msr; uint8_t *cache_check; int cacheline_warn; #ifndef __powerpc64__ register_t scratch; int ppc64; #endif trap_offset = 0; cacheline_warn = 0; /* General setup for AIM CPUs */ psl_kernset = PSL_EE | PSL_ME | PSL_IR | PSL_DR | PSL_RI; #ifdef __powerpc64__ psl_kernset |= PSL_SF; if (mfmsr() & PSL_HV) psl_kernset |= PSL_HV; #endif psl_userset = psl_kernset | PSL_PR; #ifdef __powerpc64__ psl_userset32 = psl_userset & ~PSL_SF; #endif /* Bits that users aren't allowed to change */ psl_userstatic = ~(PSL_VEC | PSL_FP | PSL_FE0 | PSL_FE1); /* * Mask bits from the SRR1 that aren't really the MSR: * Bits 1-4, 10-15 (ppc32), 33-36, 42-47 (ppc64) */ psl_userstatic &= ~0x783f0000UL; /* * Initialize the interrupt tables and figure out our cache line * size and whether or not we need the 64-bit bridge code. */ /* * Disable translation in case the vector area hasn't been * mapped (G5). Note that no OFW calls can be made until * translation is re-enabled. */ msr = mfmsr(); mtmsr((msr & ~(PSL_IR | PSL_DR)) | PSL_RI); /* * Measure the cacheline size using dcbz * * Use EXC_PGM as a playground. We are about to overwrite it * anyway, we know it exists, and we know it is cache-aligned. */ cache_check = (void *)EXC_PGM; for (cacheline_size = 0; cacheline_size < 0x100; cacheline_size++) cache_check[cacheline_size] = 0xff; __asm __volatile("dcbz 0,%0":: "r" (cache_check) : "memory"); /* Find the first byte dcbz did not zero to get the cache line size */ for (cacheline_size = 0; cacheline_size < 0x100 && cache_check[cacheline_size] == 0; cacheline_size++); /* Work around psim bug */ if (cacheline_size == 0) { cacheline_warn = 1; cacheline_size = 32; } #ifndef __powerpc64__ /* * Figure out whether we need to use the 64 bit PMAP. This works by * executing an instruction that is only legal on 64-bit PPC (mtmsrd), * and setting ppc64 = 0 if that causes a trap. */ ppc64 = 1; bcopy(&testppc64, (void *)EXC_PGM, (size_t)&testppc64size); __syncicache((void *)EXC_PGM, (size_t)&testppc64size); __asm __volatile("\ mfmsr %0; \ mtsprg2 %1; \ \ mtmsrd %0; \ mfsprg2 %1;" : "=r"(scratch), "=r"(ppc64)); if (ppc64) cpu_features |= PPC_FEATURE_64; /* * Now copy restorebridge into all the handlers, if necessary, * and set up the trap tables. */ if (cpu_features & PPC_FEATURE_64) { /* Patch the two instances of rfi -> rfid */ bcopy(&rfid_patch,&rfi_patch1,4); #ifdef KDB /* rfi_patch2 is at the end of dbleave */ bcopy(&rfid_patch,&rfi_patch2,4); #endif } #else /* powerpc64 */ cpu_features |= PPC_FEATURE_64; #endif trapsize = (size_t)&trapcodeend - (size_t)&trapcode; /* * Copy generic handler into every possible trap. Special cases will get * different ones in a minute. */ for (trap = EXC_RST; trap < EXC_LAST; trap += 0x20) bcopy(&trapcode, (void *)trap, trapsize); #ifndef __powerpc64__ if (cpu_features & PPC_FEATURE_64) { /* * Copy a code snippet to restore 32-bit bridge mode * to the top of every non-generic trap handler */ trap_offset += (size_t)&restorebridgesize; bcopy(&restorebridge, (void *)EXC_RST, trap_offset); bcopy(&restorebridge, (void *)EXC_DSI, trap_offset); bcopy(&restorebridge, (void *)EXC_ALI, trap_offset); bcopy(&restorebridge, (void *)EXC_PGM, trap_offset); bcopy(&restorebridge, (void *)EXC_MCHK, trap_offset); bcopy(&restorebridge, (void *)EXC_TRC, trap_offset); bcopy(&restorebridge, (void *)EXC_BPT, trap_offset); } #else trapsize = (size_t)&hypertrapcodeend - (size_t)&hypertrapcode; bcopy(&hypertrapcode, (void *)(EXC_HEA + trap_offset), trapsize); bcopy(&hypertrapcode, (void *)(EXC_HMI + trap_offset), trapsize); bcopy(&hypertrapcode, (void *)(EXC_HVI + trap_offset), trapsize); bcopy(&hypertrapcode, (void *)(EXC_SOFT_PATCH + trap_offset), trapsize); #endif bcopy(&rstcode, (void *)(EXC_RST + trap_offset), (size_t)&rstcodeend - (size_t)&rstcode); #ifdef KDB bcopy(&dblow, (void *)(EXC_MCHK + trap_offset), (size_t)&dbend - (size_t)&dblow); bcopy(&dblow, (void *)(EXC_PGM + trap_offset), (size_t)&dbend - (size_t)&dblow); bcopy(&dblow, (void *)(EXC_TRC + trap_offset), (size_t)&dbend - (size_t)&dblow); bcopy(&dblow, (void *)(EXC_BPT + trap_offset), (size_t)&dbend - (size_t)&dblow); #endif bcopy(&alitrap, (void *)(EXC_ALI + trap_offset), (size_t)&aliend - (size_t)&alitrap); bcopy(&dsitrap, (void *)(EXC_DSI + trap_offset), (size_t)&dsiend - (size_t)&dsitrap); #ifdef __powerpc64__ /* Set TOC base so that the interrupt code can get at it */ *((void **)TRAP_GENTRAP) = &generictrap; *((register_t *)TRAP_TOCBASE) = toc; #else /* Set branch address for trap code */ if (cpu_features & PPC_FEATURE_64) *((void **)TRAP_GENTRAP) = &generictrap64; else *((void **)TRAP_GENTRAP) = &generictrap; *((void **)TRAP_TOCBASE) = _GLOBAL_OFFSET_TABLE_; /* G2-specific TLB miss helper handlers */ bcopy(&imisstrap, (void *)EXC_IMISS, (size_t)&imisssize); bcopy(&dlmisstrap, (void *)EXC_DLMISS, (size_t)&dlmisssize); bcopy(&dsmisstrap, (void *)EXC_DSMISS, (size_t)&dsmisssize); #endif __syncicache(EXC_RSVD, EXC_LAST - EXC_RSVD); /* * Restore MSR */ mtmsr(msr); /* Warn if cachline size was not determined */ if (cacheline_warn == 1) { printf("WARNING: cacheline size undetermined, setting to 32\n"); } /* * Initialise virtual memory. Use BUS_PROBE_GENERIC priority * in case the platform module had a better idea of what we * should do. */ if (cpu_features & PPC_FEATURE_64) pmap_mmu_install(MMU_TYPE_G5, BUS_PROBE_GENERIC); else pmap_mmu_install(MMU_TYPE_OEA, BUS_PROBE_GENERIC); } /* * Shutdown the CPU as much as possible. */ void cpu_halt(void) { OF_exit(); } int ptrace_single_step(struct thread *td) { struct trapframe *tf; tf = td->td_frame; tf->srr1 |= PSL_SE; return (0); } int ptrace_clear_single_step(struct thread *td) { struct trapframe *tf; tf = td->td_frame; tf->srr1 &= ~PSL_SE; return (0); } void kdb_cpu_clear_singlestep(void) { kdb_frame->srr1 &= ~PSL_SE; } void kdb_cpu_set_singlestep(void) { kdb_frame->srr1 |= PSL_SE; } /* * Initialise a struct pcpu. */ void cpu_pcpu_init(struct pcpu *pcpu, int cpuid, size_t sz) { #ifdef __powerpc64__ /* Copy the SLB contents from the current CPU */ memcpy(pcpu->pc_aim.slb, PCPU_GET(aim.slb), sizeof(pcpu->pc_aim.slb)); #endif } #ifndef __powerpc64__ uint64_t va_to_vsid(pmap_t pm, vm_offset_t va) { return ((pm->pm_sr[(uintptr_t)va >> ADDR_SR_SHFT]) & SR_VSID_MASK); } #endif /* * These functions need to provide addresses that both (a) work in real mode * (or whatever mode/circumstances the kernel is in in early boot (now)) and * (b) can still, in principle, work once the kernel is going. Because these * rely on existing mappings/real mode, unmap is a no-op. */ vm_offset_t pmap_early_io_map(vm_paddr_t pa, vm_size_t size) { KASSERT(!pmap_bootstrapped, ("Not available after PMAP started!")); /* * If we have the MMU up in early boot, assume it is 1:1. Otherwise, * try to get the address in a memory region compatible with the * direct map for efficiency later. */ if (mfmsr() & PSL_DR) return (pa); else return (DMAP_BASE_ADDRESS + pa); } void pmap_early_io_unmap(vm_offset_t va, vm_size_t size) { KASSERT(!pmap_bootstrapped, ("Not available after PMAP started!")); } /* From p3-53 of the MPC7450 RISC Microprocessor Family Reference Manual */ void flush_disable_caches(void) { register_t msr; register_t msscr0; register_t cache_reg; volatile uint32_t *memp; uint32_t temp; int i; int x; msr = mfmsr(); powerpc_sync(); mtmsr(msr & ~(PSL_EE | PSL_DR)); msscr0 = mfspr(SPR_MSSCR0); msscr0 &= ~MSSCR0_L2PFE; mtspr(SPR_MSSCR0, msscr0); powerpc_sync(); isync(); __asm__ __volatile__("dssall; sync"); powerpc_sync(); isync(); __asm__ __volatile__("dcbf 0,%0" :: "r"(0)); __asm__ __volatile__("dcbf 0,%0" :: "r"(0)); __asm__ __volatile__("dcbf 0,%0" :: "r"(0)); /* Lock the L1 Data cache. */ mtspr(SPR_LDSTCR, mfspr(SPR_LDSTCR) | 0xFF); powerpc_sync(); isync(); mtspr(SPR_LDSTCR, 0); /* * Perform this in two stages: Flush the cache starting in RAM, then do it * from ROM. */ memp = (volatile uint32_t *)0x00000000; for (i = 0; i < 128 * 1024; i++) { temp = *memp; __asm__ __volatile__("dcbf 0,%0" :: "r"(memp)); memp += 32/sizeof(*memp); } memp = (volatile uint32_t *)0xfff00000; x = 0xfe; for (; x != 0xff;) { mtspr(SPR_LDSTCR, x); for (i = 0; i < 128; i++) { temp = *memp; __asm__ __volatile__("dcbf 0,%0" :: "r"(memp)); memp += 32/sizeof(*memp); } x = ((x << 1) | 1) & 0xff; } mtspr(SPR_LDSTCR, 0); cache_reg = mfspr(SPR_L2CR); if (cache_reg & L2CR_L2E) { cache_reg &= ~(L2CR_L2IO_7450 | L2CR_L2DO_7450); mtspr(SPR_L2CR, cache_reg); powerpc_sync(); mtspr(SPR_L2CR, cache_reg | L2CR_L2HWF); while (mfspr(SPR_L2CR) & L2CR_L2HWF) ; /* Busy wait for cache to flush */ powerpc_sync(); cache_reg &= ~L2CR_L2E; mtspr(SPR_L2CR, cache_reg); powerpc_sync(); mtspr(SPR_L2CR, cache_reg | L2CR_L2I); powerpc_sync(); while (mfspr(SPR_L2CR) & L2CR_L2I) ; /* Busy wait for L2 cache invalidate */ powerpc_sync(); } cache_reg = mfspr(SPR_L3CR); if (cache_reg & L3CR_L3E) { cache_reg &= ~(L3CR_L3IO | L3CR_L3DO); mtspr(SPR_L3CR, cache_reg); powerpc_sync(); mtspr(SPR_L3CR, cache_reg | L3CR_L3HWF); while (mfspr(SPR_L3CR) & L3CR_L3HWF) ; /* Busy wait for cache to flush */ powerpc_sync(); cache_reg &= ~L3CR_L3E; mtspr(SPR_L3CR, cache_reg); powerpc_sync(); mtspr(SPR_L3CR, cache_reg | L3CR_L3I); powerpc_sync(); while (mfspr(SPR_L3CR) & L3CR_L3I) ; /* Busy wait for L3 cache invalidate */ powerpc_sync(); } mtspr(SPR_HID0, mfspr(SPR_HID0) & ~HID0_DCE); powerpc_sync(); isync(); mtmsr(msr); } void cpu_sleep() { static u_quad_t timebase = 0; static register_t sprgs[4]; static register_t srrs[2]; jmp_buf resetjb; struct thread *fputd; struct thread *vectd; register_t hid0; register_t msr; register_t saved_msr; ap_pcpu = pcpup; PCPU_SET(restore, &resetjb); saved_msr = mfmsr(); fputd = PCPU_GET(fputhread); vectd = PCPU_GET(vecthread); if (fputd != NULL) save_fpu(fputd); if (vectd != NULL) save_vec(vectd); if (setjmp(resetjb) == 0) { sprgs[0] = mfspr(SPR_SPRG0); sprgs[1] = mfspr(SPR_SPRG1); sprgs[2] = mfspr(SPR_SPRG2); sprgs[3] = mfspr(SPR_SPRG3); srrs[0] = mfspr(SPR_SRR0); srrs[1] = mfspr(SPR_SRR1); timebase = mftb(); powerpc_sync(); flush_disable_caches(); hid0 = mfspr(SPR_HID0); hid0 = (hid0 & ~(HID0_DOZE | HID0_NAP)) | HID0_SLEEP; powerpc_sync(); isync(); msr = mfmsr() | PSL_POW; mtspr(SPR_HID0, hid0); powerpc_sync(); while (1) mtmsr(msr); } platform_smp_timebase_sync(timebase, 0); PCPU_SET(curthread, curthread); PCPU_SET(curpcb, curthread->td_pcb); pmap_activate(curthread); powerpc_sync(); mtspr(SPR_SPRG0, sprgs[0]); mtspr(SPR_SPRG1, sprgs[1]); mtspr(SPR_SPRG2, sprgs[2]); mtspr(SPR_SPRG3, sprgs[3]); mtspr(SPR_SRR0, srrs[0]); mtspr(SPR_SRR1, srrs[1]); mtmsr(saved_msr); if (fputd == curthread) enable_fpu(curthread); if (vectd == curthread) enable_vec(curthread); powerpc_sync(); } Index: user/ngie/bug-237403/sys/powerpc/aim/mp_cpudep.c =================================================================== --- user/ngie/bug-237403/sys/powerpc/aim/mp_cpudep.c (revision 346925) +++ user/ngie/bug-237403/sys/powerpc/aim/mp_cpudep.c (revision 346926) @@ -1,420 +1,428 @@ /*- * SPDX-License-Identifier: BSD-2-Clause-FreeBSD * * Copyright (c) 2008 Marcel Moolenaar * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ #include __FBSDID("$FreeBSD$"); #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include void *ap_pcpu; static register_t bsp_state[8] __aligned(8); static void cpudep_save_config(void *dummy); SYSINIT(cpu_save_config, SI_SUB_CPU, SI_ORDER_ANY, cpudep_save_config, NULL); void cpudep_ap_early_bootstrap(void) { #ifndef __powerpc64__ register_t reg; #endif switch (mfpvr() >> 16) { case IBM970: case IBM970FX: case IBM970MP: /* Restore HID4 and HID5, which are necessary for the MMU */ #ifdef __powerpc64__ mtspr(SPR_HID4, bsp_state[2]); powerpc_sync(); isync(); mtspr(SPR_HID5, bsp_state[3]); powerpc_sync(); isync(); #else __asm __volatile("ld %0, 16(%2); sync; isync; \ mtspr %1, %0; sync; isync;" : "=r"(reg) : "K"(SPR_HID4), "b"(bsp_state)); __asm __volatile("ld %0, 24(%2); sync; isync; \ mtspr %1, %0; sync; isync;" : "=r"(reg) : "K"(SPR_HID5), "b"(bsp_state)); #endif powerpc_sync(); break; case IBMPOWER8: case IBMPOWER8E: + case IBMPOWER8NVL: case IBMPOWER9: #ifdef __powerpc64__ if (mfmsr() & PSL_HV) { isync(); /* * Direct interrupts to SRR instead of HSRR and * reset LPCR otherwise */ mtspr(SPR_LPID, 0); isync(); mtspr(SPR_LPCR, lpcr); isync(); + + /* + * Nuke FSCR, to be managed on a per-process basis + * later. + */ + mtspr(SPR_FSCR, 0); } #endif break; } __asm __volatile("mtsprg 0, %0" :: "r"(ap_pcpu)); powerpc_sync(); } uintptr_t cpudep_ap_bootstrap(void) { register_t msr, sp; msr = psl_kernset & ~PSL_EE; mtmsr(msr); pcpup->pc_curthread = pcpup->pc_idlethread; #ifdef __powerpc64__ __asm __volatile("mr 13,%0" :: "r"(pcpup->pc_curthread)); #else __asm __volatile("mr 2,%0" :: "r"(pcpup->pc_curthread)); #endif pcpup->pc_curpcb = pcpup->pc_curthread->td_pcb; sp = pcpup->pc_curpcb->pcb_sp; return (sp); } static register_t mpc74xx_l2_enable(register_t l2cr_config) { register_t ccr, bit; uint16_t vers; vers = mfpvr() >> 16; switch (vers) { case MPC7400: case MPC7410: bit = L2CR_L2IP; break; default: bit = L2CR_L2I; break; } ccr = mfspr(SPR_L2CR); if (ccr & L2CR_L2E) return (ccr); /* Configure L2 cache. */ ccr = l2cr_config & ~L2CR_L2E; mtspr(SPR_L2CR, ccr | L2CR_L2I); do { ccr = mfspr(SPR_L2CR); } while (ccr & bit); powerpc_sync(); mtspr(SPR_L2CR, l2cr_config); powerpc_sync(); return (l2cr_config); } static register_t mpc745x_l3_enable(register_t l3cr_config) { register_t ccr; ccr = mfspr(SPR_L3CR); if (ccr & L3CR_L3E) return (ccr); /* Configure L3 cache. */ ccr = l3cr_config & ~(L3CR_L3E | L3CR_L3I | L3CR_L3PE | L3CR_L3CLKEN); mtspr(SPR_L3CR, ccr); ccr |= 0x4000000; /* Magic, but documented. */ mtspr(SPR_L3CR, ccr); ccr |= L3CR_L3CLKEN; mtspr(SPR_L3CR, ccr); mtspr(SPR_L3CR, ccr | L3CR_L3I); while (mfspr(SPR_L3CR) & L3CR_L3I) ; mtspr(SPR_L3CR, ccr & ~L3CR_L3CLKEN); powerpc_sync(); DELAY(100); mtspr(SPR_L3CR, ccr); powerpc_sync(); DELAY(100); ccr |= L3CR_L3E; mtspr(SPR_L3CR, ccr); powerpc_sync(); return(ccr); } static register_t mpc74xx_l1d_enable(void) { register_t hid; hid = mfspr(SPR_HID0); if (hid & HID0_DCE) return (hid); /* Enable L1 D-cache */ hid |= HID0_DCE; powerpc_sync(); mtspr(SPR_HID0, hid | HID0_DCFI); powerpc_sync(); return (hid); } static register_t mpc74xx_l1i_enable(void) { register_t hid; hid = mfspr(SPR_HID0); if (hid & HID0_ICE) return (hid); /* Enable L1 I-cache */ hid |= HID0_ICE; isync(); mtspr(SPR_HID0, hid | HID0_ICFI); isync(); return (hid); } static void cpudep_save_config(void *dummy) { uint16_t vers; vers = mfpvr() >> 16; switch(vers) { case IBM970: case IBM970FX: case IBM970MP: #ifdef __powerpc64__ bsp_state[0] = mfspr(SPR_HID0); bsp_state[1] = mfspr(SPR_HID1); bsp_state[2] = mfspr(SPR_HID4); bsp_state[3] = mfspr(SPR_HID5); #else __asm __volatile ("mfspr %0,%2; mr %1,%0; srdi %0,%0,32" : "=r" (bsp_state[0]),"=r" (bsp_state[1]) : "K" (SPR_HID0)); __asm __volatile ("mfspr %0,%2; mr %1,%0; srdi %0,%0,32" : "=r" (bsp_state[2]),"=r" (bsp_state[3]) : "K" (SPR_HID1)); __asm __volatile ("mfspr %0,%2; mr %1,%0; srdi %0,%0,32" : "=r" (bsp_state[4]),"=r" (bsp_state[5]) : "K" (SPR_HID4)); __asm __volatile ("mfspr %0,%2; mr %1,%0; srdi %0,%0,32" : "=r" (bsp_state[6]),"=r" (bsp_state[7]) : "K" (SPR_HID5)); #endif powerpc_sync(); break; case IBMCELLBE: #ifdef NOTYET /* Causes problems if in instruction stream on 970 */ if (mfmsr() & PSL_HV) { bsp_state[0] = mfspr(SPR_HID0); bsp_state[1] = mfspr(SPR_HID1); bsp_state[2] = mfspr(SPR_HID4); bsp_state[3] = mfspr(SPR_HID6); bsp_state[4] = mfspr(SPR_CELL_TSCR); } #endif bsp_state[5] = mfspr(SPR_CELL_TSRL); break; case MPC7450: case MPC7455: case MPC7457: /* Only MPC745x CPUs have an L3 cache. */ bsp_state[3] = mfspr(SPR_L3CR); /* Fallthrough */ case MPC7400: case MPC7410: case MPC7447A: case MPC7448: bsp_state[2] = mfspr(SPR_L2CR); bsp_state[1] = mfspr(SPR_HID1); bsp_state[0] = mfspr(SPR_HID0); break; } } void cpudep_ap_setup() { register_t reg; uint16_t vers; vers = mfpvr() >> 16; /* The following is needed for restoring from sleep. */ platform_smp_timebase_sync(0, 1); switch(vers) { case IBM970: case IBM970FX: case IBM970MP: /* Set HIOR to 0 */ __asm __volatile("mtspr 311,%0" :: "r"(0)); powerpc_sync(); /* * The 970 has strange rules about how to update HID registers. * See Table 2-3, 970MP manual * * Note: HID4 and HID5 restored already in * cpudep_ap_early_bootstrap() */ __asm __volatile("mtasr %0; sync" :: "r"(0)); #ifdef __powerpc64__ __asm __volatile(" \ sync; isync; \ mtspr %1, %0; \ mfspr %0, %1; mfspr %0, %1; mfspr %0, %1; \ mfspr %0, %1; mfspr %0, %1; mfspr %0, %1; \ sync; isync" :: "r"(bsp_state[0]), "K"(SPR_HID0)); __asm __volatile("sync; isync; \ mtspr %1, %0; mtspr %1, %0; sync; isync" :: "r"(bsp_state[1]), "K"(SPR_HID1)); #else __asm __volatile(" \ ld %0,0(%2); \ sync; isync; \ mtspr %1, %0; \ mfspr %0, %1; mfspr %0, %1; mfspr %0, %1; \ mfspr %0, %1; mfspr %0, %1; mfspr %0, %1; \ sync; isync" : "=r"(reg) : "K"(SPR_HID0), "b"(bsp_state)); __asm __volatile("ld %0, 8(%2); sync; isync; \ mtspr %1, %0; mtspr %1, %0; sync; isync" : "=r"(reg) : "K"(SPR_HID1), "b"(bsp_state)); #endif powerpc_sync(); break; case IBMCELLBE: #ifdef NOTYET /* Causes problems if in instruction stream on 970 */ if (mfmsr() & PSL_HV) { mtspr(SPR_HID0, bsp_state[0]); mtspr(SPR_HID1, bsp_state[1]); mtspr(SPR_HID4, bsp_state[2]); mtspr(SPR_HID6, bsp_state[3]); mtspr(SPR_CELL_TSCR, bsp_state[4]); } #endif mtspr(SPR_CELL_TSRL, bsp_state[5]); break; case MPC7400: case MPC7410: case MPC7447A: case MPC7448: case MPC7450: case MPC7455: case MPC7457: /* XXX: Program the CPU ID into PIR */ __asm __volatile("mtspr 1023,%0" :: "r"(PCPU_GET(cpuid))); powerpc_sync(); isync(); mtspr(SPR_HID0, bsp_state[0]); isync(); mtspr(SPR_HID1, bsp_state[1]); isync(); /* Now enable the L3 cache. */ switch (vers) { case MPC7450: case MPC7455: case MPC7457: /* Only MPC745x CPUs have an L3 cache. */ reg = mpc745x_l3_enable(bsp_state[3]); default: break; } reg = mpc74xx_l2_enable(bsp_state[2]); reg = mpc74xx_l1d_enable(); reg = mpc74xx_l1i_enable(); break; case IBMPOWER7: case IBMPOWER7PLUS: case IBMPOWER8: case IBMPOWER8E: + case IBMPOWER8NVL: case IBMPOWER9: #ifdef __powerpc64__ if (mfmsr() & PSL_HV) { mtspr(SPR_LPCR, mfspr(SPR_LPCR) | lpcr | LPCR_PECE_WAKESET); isync(); } #endif break; default: #ifdef __powerpc64__ if (!(mfmsr() & PSL_HV)) /* Rely on HV to have set things up */ break; #endif printf("WARNING: Unknown CPU type. Cache performace may be " "suboptimal.\n"); break; } } Index: user/ngie/bug-237403/sys/powerpc/conf/GENERIC64 =================================================================== --- user/ngie/bug-237403/sys/powerpc/conf/GENERIC64 (revision 346925) +++ user/ngie/bug-237403/sys/powerpc/conf/GENERIC64 (revision 346926) @@ -1,255 +1,256 @@ # # GENERIC -- Generic kernel configuration file for FreeBSD/powerpc # # For more information on this file, please read the handbook section on # Kernel Configuration Files: # # https://www.FreeBSD.org/doc/en_US.ISO8859-1/books/handbook/kernelconfig-config.html # # The handbook is also available locally in /usr/share/doc/handbook # if you've installed the doc distribution, otherwise always see the # FreeBSD World Wide Web server (https://www.FreeBSD.org/) for the # latest information. # # An exhaustive list of options and more detailed explanations of the # device lines is also present in the ../../conf/NOTES and NOTES files. # If you are in doubt as to the purpose or necessity of a line, check first # in NOTES. # # $FreeBSD$ cpu AIM ident GENERIC machine powerpc powerpc64 makeoptions DEBUG=-g #Build kernel with gdb(1) debug symbols makeoptions WITH_CTF=1 # Platform support options POWERMAC #NewWorld Apple PowerMacs options PS3 #Sony Playstation 3 options MAMBO #IBM Mambo Full System Simulator options PSERIES #PAPR-compliant systems (e.g. IBM p) options POWERNV #Non-virtualized OpenPOWER systems options FDT #Flattened Device Tree options SCHED_ULE #ULE scheduler options NUMA #Non-Uniform Memory Architecture support options PREEMPTION #Enable kernel thread preemption options VIMAGE # Subsystem virtualization, e.g. VNET options INET #InterNETworking options INET6 #IPv6 communications protocols options IPSEC # IP (v4/v6) security options IPSEC_SUPPORT # Allow kldload of ipsec and tcpmd5 options TCP_OFFLOAD # TCP offload options TCP_BLACKBOX # Enhanced TCP event logging options TCP_HHOOK # hhook(9) framework for TCP options TCP_RFC7413 # TCP Fast Open options SCTP #Stream Control Transmission Protocol options FFS #Berkeley Fast Filesystem options SOFTUPDATES #Enable FFS soft updates support options UFS_ACL #Support for access control lists options UFS_DIRHASH #Improve performance on big directories options UFS_GJOURNAL #Enable gjournal-based UFS journaling options QUOTA #Enable disk quotas for UFS options MD_ROOT #MD is a potential root device options NFSCL #Network Filesystem Client options NFSD #Network Filesystem Server options NFSLOCKD #Network Lock Manager options NFS_ROOT #NFS usable as root device options MSDOSFS #MSDOS Filesystem options CD9660 #ISO 9660 Filesystem options PROCFS #Process filesystem (requires PSEUDOFS) options PSEUDOFS #Pseudo-filesystem framework options GEOM_PART_APM #Apple Partition Maps. options GEOM_PART_GPT #GUID Partition Tables. options GEOM_LABEL #Provides labelization options COMPAT_FREEBSD32 #Compatible with FreeBSD/powerpc binaries options COMPAT_FREEBSD5 #Compatible with FreeBSD5 options COMPAT_FREEBSD6 #Compatible with FreeBSD6 options COMPAT_FREEBSD7 #Compatible with FreeBSD7 options COMPAT_FREEBSD9 # Compatible with FreeBSD9 options COMPAT_FREEBSD10 # Compatible with FreeBSD10 options COMPAT_FREEBSD11 # Compatible with FreeBSD11 options SCSI_DELAY=5000 #Delay (in ms) before probing SCSI options KTRACE #ktrace(1) syscall trace support options STACK #stack(9) support options SYSVSHM #SYSV-style shared memory options SYSVMSG #SYSV-style message queues options SYSVSEM #SYSV-style semaphores options _KPOSIX_PRIORITY_SCHEDULING #Posix P1003_1B real-time extensions options PRINTF_BUFR_SIZE=128 # Prevent printf output being interspersed. options HWPMC_HOOKS # Necessary kernel hooks for hwpmc(4) options AUDIT # Security event auditing options CAPABILITY_MODE # Capsicum capability mode options CAPABILITIES # Capsicum capabilities options MAC # TrustedBSD MAC Framework options KDTRACE_HOOKS # Kernel DTrace hooks options DDB_CTF # Kernel ELF linker loads CTF data options INCLUDE_CONFIG_FILE # Include this file in kernel options RACCT # Resource accounting framework options RACCT_DEFAULT_TO_DISABLED # Set kern.racct.enable=0 by default options RCTL # Resource limits # Debugging support. Always need this: options KDB # Enable kernel debugger support. options KDB_TRACE # Print a stack trace for a panic. # For full debugger support use (turn off in stable branch): options DDB #Support DDB #options DEADLKRES #Enable the deadlock resolver options INVARIANTS #Enable calls of extra sanity checking options INVARIANT_SUPPORT #Extra sanity checks of internal structures, required by INVARIANTS options WITNESS #Enable checks to detect deadlocks and cycles options WITNESS_SKIPSPIN #Don't run witness on spinlocks for speed options MALLOC_DEBUG_MAXZONES=8 # Separate malloc(9) zones options VERBOSE_SYSINIT=0 # Support debug.verbose_sysinit, off by default # Kernel dump features. options EKCD # Support for encrypted kernel dumps options GZIO # gzip-compressed kernel and user dumps options ZSTDIO # zstd-compressed kernel and user dumps options NETDUMP # netdump(4) client support # Make an SMP-capable kernel by default options SMP # Symmetric MultiProcessor Kernel # CPU frequency control device cpufreq # Standard busses device pci options PCI_HP # PCI-Express native HotPlug device agp # ATA controllers device ahci # AHCI-compatible SATA controllers device ata # Legacy ATA/SATA controllers device mvs # Marvell 88SX50XX/88SX60XX/88SX70XX/SoC SATA device siis # SiliconImage SiI3124/SiI3132/SiI3531 SATA # NVM Express (NVMe) support device nvme # base NVMe driver options NVME_USE_NVD=0 # prefer the cam(4) based nda(4) driver device nvd # expose NVMe namespaces as disks, depends on nvme # SCSI Controllers device ahc # AHA2940 and onboard AIC7xxx devices options AHC_ALLOW_MEMIO # Attempt to use memory mapped I/O device isp # Qlogic family device ispfw # Firmware module for Qlogic host adapters device mpt # LSI-Logic MPT-Fusion device mps # LSI-Logic MPT-Fusion 2 device sym # NCR/Symbios/LSI Logic 53C8XX/53C1010/53C1510D # ATA/SCSI peripherals device scbus # SCSI bus (required for ATA/SCSI) device ch # SCSI media changers device da # Direct Access (disks) device sa # Sequential Access (tape etc) device cd # CD device pass # Passthrough device (direct ATA/SCSI access) device ses # Enclosure Service (SES and SAF-TE) # vt is the default console driver, resembling an SCO console device vt # Core console driver device kbdmux # Serial (COM) ports device scc device uart device uart_z8530 device iflib # Ethernet hardware device em # Intel PRO/1000 Gigabit Ethernet Family device ix # Intel PRO/10GbE PCIE PF Ethernet Family device ixv # Intel PRO/10GbE PCIE VF Ethernet Family device glc # Sony Playstation 3 Ethernet device llan # IBM pSeries Virtual Ethernet device cxgbe # Chelsio 10/25G NIC # PCI Ethernet NICs that use the common MII bus controller code. device miibus # MII bus support device bge # Broadcom BCM570xx Gigabit Ethernet device gem # Sun GEM/Sun ERI/Apple GMAC device dc # DEC/Intel 21143 and various workalikes device fxp # Intel EtherExpress PRO/100B (82557, 82558) device re # RealTek 8139C+/8169/8169S/8110S device rl # RealTek 8129/8139 # Pseudo devices. device crypto # core crypto support device loop # Network loopback device random # Entropy device device ether # Ethernet support device vlan # 802.1Q VLAN support device tun # Packet tunnel. device md # Memory "disks" device ofwd # Open Firmware disks device gif # IPv6 and IPv4 tunneling device firmware # firmware assist module # The `bpf' device enables the Berkeley Packet Filter. # Be aware of the administrative consequences of enabling this! # Note that 'bpf' is required for DHCP. device bpf #Berkeley packet filter # USB support options USB_DEBUG # enable debug msgs device uhci # UHCI PCI->USB interface device ohci # OHCI PCI->USB interface device ehci # EHCI PCI->USB interface device xhci # XHCI PCI->USB interface device usb # USB Bus (required) device uhid # "Human Interface Devices" device ukbd # Keyboard options KBD_INSTALL_CDEV # install a CDEV entry in /dev device umass # Disks/Mass storage - Requires scbus and da0 device ums # Mouse # USB Ethernet device aue # ADMtek USB Ethernet device axe # ASIX Electronics USB Ethernet device cdce # Generic USB over Ethernet device cue # CATC USB Ethernet device kue # Kawasaki LSI USB Ethernet # Wireless NIC cards options IEEE80211_SUPPORT_MESH # FireWire support device firewire # FireWire bus code device sbp # SCSI over FireWire (Requires scbus and da) device fwe # Ethernet over FireWire (non-standard!) # Misc device iicbus # I2C bus code device iic device kiic # Keywest I2C device ad7417 # PowerMac7,2 temperature sensor device ds1631 # PowerMac11,2 temperature sensor device ds1775 # PowerMac7,2 temperature sensor device fcu # Apple Fan Control Unit device max6690 # PowerMac7,2 temperature sensor device powermac_nvram # Open Firmware configuration NVRAM device smu # Apple System Management Unit device atibl # ATI-based backlight driver for PowerBooks/iBooks device nvbl # nVidia-based backlight driver for PowerBooks/iBooks +device opalflash # PowerNV embedded flash memory # ADB support device adb device pmu # Sound support device sound # Generic sound driver (required) device snd_ai2s # Apple I2S audio device snd_uaudio # USB Audio # Netmap provides direct access to TX/RX rings on supported NICs device netmap # netmap(4) support # evdev interface options EVDEV_SUPPORT # evdev support in legacy drivers device evdev # input event device support device uinput # install /dev/uinput cdev Index: user/ngie/bug-237403/sys/powerpc/include/cpu.h =================================================================== --- user/ngie/bug-237403/sys/powerpc/include/cpu.h (revision 346925) +++ user/ngie/bug-237403/sys/powerpc/include/cpu.h (revision 346926) @@ -1,149 +1,150 @@ /*- * SPDX-License-Identifier: BSD-4-Clause * * Copyright (C) 1995-1997 Wolfgang Solfrank. * Copyright (C) 1995-1997 TooLs GmbH. * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * 3. All advertising materials mentioning features or use of this software * must display the following acknowledgement: * This product includes software developed by TooLs GmbH. * 4. The name of TooLs GmbH may not be used to endorse or promote products * derived from this software without specific prior written permission. * * THIS SOFTWARE IS PROVIDED BY TOOLS GMBH ``AS IS'' AND ANY EXPRESS OR * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. * IN NO EVENT SHALL TOOLS GMBH BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; * OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, * WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR * OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF * ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. * * $NetBSD: cpu.h,v 1.11 2000/05/26 21:19:53 thorpej Exp $ * $FreeBSD$ */ #ifndef _MACHINE_CPU_H_ #define _MACHINE_CPU_H_ #include #include #include /* * CPU Feature Attributes * * These are defined in the PowerPC ELF ABI for the AT_HWCAP vector, * and are exported to userland via the machdep.cpu_features * sysctl. */ extern u_long cpu_features; extern u_long cpu_features2; #define PPC_FEATURE_32 0x80000000 /* Always true */ #define PPC_FEATURE_64 0x40000000 /* Defined on a 64-bit CPU */ #define PPC_FEATURE_601_INSTR 0x20000000 /* Defined on a 64-bit CPU */ #define PPC_FEATURE_HAS_ALTIVEC 0x10000000 #define PPC_FEATURE_HAS_FPU 0x08000000 #define PPC_FEATURE_HAS_MMU 0x04000000 #define PPC_FEATURE_UNIFIED_CACHE 0x01000000 #define PPC_FEATURE_HAS_SPE 0x00800000 #define PPC_FEATURE_HAS_EFP_SINGLE 0x00400000 #define PPC_FEATURE_HAS_EFP_DOUBLE 0x00200000 #define PPC_FEATURE_NO_TB 0x00100000 #define PPC_FEATURE_POWER4 0x00080000 #define PPC_FEATURE_POWER5 0x00040000 #define PPC_FEATURE_POWER5_PLUS 0x00020000 #define PPC_FEATURE_CELL 0x00010000 #define PPC_FEATURE_BOOKE 0x00008000 #define PPC_FEATURE_SMT 0x00004000 #define PPC_FEATURE_ICACHE_SNOOP 0x00002000 #define PPC_FEATURE_ARCH_2_05 0x00001000 #define PPC_FEATURE_HAS_DFP 0x00000400 #define PPC_FEATURE_POWER6_EXT 0x00000200 #define PPC_FEATURE_ARCH_2_06 0x00000100 #define PPC_FEATURE_HAS_VSX 0x00000080 #define PPC_FEATURE_TRUE_LE 0x00000002 #define PPC_FEATURE_PPC_LE 0x00000001 #define PPC_FEATURE2_ARCH_2_07 0x80000000 #define PPC_FEATURE2_HTM 0x40000000 #define PPC_FEATURE2_DSCR 0x20000000 +#define PPC_FEATURE2_EBB 0x10000000 #define PPC_FEATURE2_ISEL 0x08000000 #define PPC_FEATURE2_TAR 0x04000000 #define PPC_FEATURE2_HAS_VEC_CRYPTO 0x02000000 #define PPC_FEATURE2_HTM_NOSC 0x01000000 #define PPC_FEATURE2_ARCH_3_00 0x00800000 #define PPC_FEATURE2_HAS_IEEE128 0x00400000 #define PPC_FEATURE2_DARN 0x00200000 #define PPC_FEATURE2_SCV 0x00100000 -#define PPC_FEATURE2_HTM_NOSUSPEND 0x01000000 +#define PPC_FEATURE2_HTM_NOSUSPEND 0x00080000 #define PPC_FEATURE_BITMASK \ "\20" \ "\040PPC32\037PPC64\036PPC601\035ALTIVEC\034FPU\033MMU\031UNIFIEDCACHE" \ "\030SPE\027SPESFP\026DPESFP\025NOTB\024POWER4\023POWER5\022P5PLUS\021CELL"\ "\020BOOKE\017SMT\016ISNOOP\015ARCH205\013DFP\011ARCH206\010VSX"\ "\002TRUELE\001PPCLE" #define PPC_FEATURE2_BITMASK \ "\20" \ "\040ARCH207\037HTM\036DSCR\034ISEL\033TAR\032VCRYPTO\031HTMNOSC" \ "\030ARCH300\027IEEE128\026DARN\025SCV\024HTMNOSUSP" #define TRAPF_USERMODE(frame) (((frame)->srr1 & PSL_PR) != 0) #define TRAPF_PC(frame) ((frame)->srr0) /* * CTL_MACHDEP definitions. */ #define CPU_CACHELINE 1 static __inline u_int64_t get_cyclecount(void) { u_int32_t _upper, _lower; u_int64_t _time; __asm __volatile( "mftb %0\n" "mftbu %1" : "=r" (_lower), "=r" (_upper)); _time = (u_int64_t)_upper; _time = (_time << 32) + _lower; return (_time); } #define cpu_getstack(td) ((td)->td_frame->fixreg[1]) #define cpu_spinwait() __asm __volatile("or 27,27,27") /* yield */ #define cpu_lock_delay() DELAY(1) extern char btext[]; extern char etext[]; #ifdef __powerpc64__ extern void enter_idle_powerx(void); extern uint64_t can_wakeup; extern register_t lpcr; #endif void cpu_halt(void); void cpu_reset(void); void cpu_sleep(void); void flush_disable_caches(void); void fork_trampoline(void); void swi_vm(void *); #endif /* _MACHINE_CPU_H_ */ Index: user/ngie/bug-237403/sys/powerpc/include/pcb.h =================================================================== --- user/ngie/bug-237403/sys/powerpc/include/pcb.h (revision 346925) +++ user/ngie/bug-237403/sys/powerpc/include/pcb.h (revision 346926) @@ -1,111 +1,125 @@ /*- * SPDX-License-Identifier: BSD-4-Clause * * Copyright (C) 1995, 1996 Wolfgang Solfrank. * Copyright (C) 1995, 1996 TooLs GmbH. * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * 3. All advertising materials mentioning features or use of this software * must display the following acknowledgement: * This product includes software developed by TooLs GmbH. * 4. The name of TooLs GmbH may not be used to endorse or promote products * derived from this software without specific prior written permission. * * THIS SOFTWARE IS PROVIDED BY TOOLS GMBH ``AS IS'' AND ANY EXPRESS OR * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. * IN NO EVENT SHALL TOOLS GMBH BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; * OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, * WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR * OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF * ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. * * $NetBSD: pcb.h,v 1.4 2000/06/04 11:57:17 tsubai Exp $ * $FreeBSD$ */ #ifndef _MACHINE_PCB_H_ #define _MACHINE_PCB_H_ #include #ifndef _STANDALONE struct pcb { register_t pcb_context[20]; /* non-volatile r14-r31 */ register_t pcb_cr; /* Condition register */ register_t pcb_sp; /* stack pointer */ register_t pcb_toc; /* toc pointer */ register_t pcb_lr; /* link register */ register_t pcb_dscr; /* dscr value */ + register_t pcb_fscr; + register_t pcb_tar; struct pmap *pcb_pm; /* pmap of our vmspace */ jmp_buf *pcb_onfault; /* For use during copyin/copyout */ int pcb_flags; #define PCB_FPU 0x1 /* Process uses FPU */ #define PCB_FPREGS 0x2 /* Process had FPU registers initialized */ #define PCB_VEC 0x4 /* Process had Altivec initialized */ #define PCB_VSX 0x8 /* Process had VSX initialized */ #define PCB_CDSCR 0x10 /* Process had Custom DSCR initialized */ #define PCB_HTM 0x20 /* Process had HTM initialized */ +#define PCB_CFSCR 0x40 /* Process had FSCR updated */ struct fpu { union { double fpr; uint32_t vsr[4]; } fpr[32]; double fpscr; /* FPSCR stored as double for easier access */ } pcb_fpu; /* Floating point processor */ unsigned int pcb_fpcpu; /* which CPU had our FPU stuff. */ struct vec { uint32_t vr[32][4]; uint32_t spare[2]; uint32_t vrsave; uint32_t vscr; /* aligned at vector element 3 */ } pcb_vec __aligned(16); /* Vector processor */ unsigned int pcb_veccpu; /* which CPU had our vector stuff. */ struct htm { uint64_t tfhar; uint64_t texasr; uint64_t tfiar; } pcb_htm; + + struct ebb { + uint64_t ebbhr; + uint64_t ebbrr; + uint64_t bescr; + } pcb_ebb; + + struct lmon { + uint64_t lmrr; + uint64_t lmser; + } pcb_lm; union { struct { vm_offset_t usr_segm; /* Base address */ register_t usr_vsid; /* USER_SR segment */ } aim; struct { register_t dbcr0; } booke; } pcb_cpu; vm_offset_t pcb_lastill; /* Last illegal instruction */ }; #endif #ifdef _KERNEL struct trapframe; #ifndef curpcb extern struct pcb *curpcb; #endif extern struct pmap *curpm; extern struct proc *fpuproc; void makectx(struct trapframe *, struct pcb *); void savectx(struct pcb *) __returns_twice; #endif #endif /* _MACHINE_PCB_H_ */ Index: user/ngie/bug-237403/sys/powerpc/include/spr.h =================================================================== --- user/ngie/bug-237403/sys/powerpc/include/spr.h (revision 346925) +++ user/ngie/bug-237403/sys/powerpc/include/spr.h (revision 346926) @@ -1,891 +1,912 @@ /*- * SPDX-License-Identifier: BSD-2-Clause-FreeBSD * * Copyright (c) 2001 The NetBSD Foundation, Inc. * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE NETBSD FOUNDATION, INC. AND CONTRIBUTORS * ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED * TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR * PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE FOUNDATION OR CONTRIBUTORS * BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN * CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE * POSSIBILITY OF SUCH DAMAGE. * * $NetBSD: spr.h,v 1.25 2002/08/14 15:38:40 matt Exp $ * $FreeBSD$ */ #ifndef _POWERPC_SPR_H_ #define _POWERPC_SPR_H_ #ifndef _LOCORE #define mtspr(reg, val) \ __asm __volatile("mtspr %0,%1" : : "K"(reg), "r"(val)) #define mfspr(reg) \ ( { register_t val; \ __asm __volatile("mfspr %0,%1" : "=r"(val) : "K"(reg)); \ val; } ) #ifndef __powerpc64__ /* The following routines allow manipulation of the full 64-bit width * of SPRs on 64 bit CPUs in bridge mode */ #define mtspr64(reg,valhi,vallo,scratch) \ __asm __volatile(" \ mfmsr %0; \ insrdi %0,%5,1,0; \ mtmsrd %0; \ isync; \ \ sld %1,%1,%4; \ or %1,%1,%2; \ mtspr %3,%1; \ srd %1,%1,%4; \ \ clrldi %0,%0,1; \ mtmsrd %0; \ isync;" \ : "=r"(scratch), "=r"(valhi) : "r"(vallo), "K"(reg), "r"(32), "r"(1)) #define mfspr64upper(reg,scratch) \ ( { register_t val; \ __asm __volatile(" \ mfmsr %0; \ insrdi %0,%4,1,0; \ mtmsrd %0; \ isync; \ \ mfspr %1,%2; \ srd %1,%1,%3; \ \ clrldi %0,%0,1; \ mtmsrd %0; \ isync;" \ : "=r"(scratch), "=r"(val) : "K"(reg), "r"(32), "r"(1)); \ val; } ) #endif #endif /* _LOCORE */ /* * Special Purpose Register declarations. * * The first column in the comments indicates which PowerPC * architectures the SPR is valid on - 4 for 4xx series, * 6 for 6xx/7xx series and 8 for 8xx and 8xxx series. */ #define SPR_MQ 0x000 /* .6. 601 MQ register */ #define SPR_XER 0x001 /* 468 Fixed Point Exception Register */ +#define SPR_DSCR 0x003 /* .6. Data Stream Control Register (Unprivileged) */ #define SPR_RTCU_R 0x004 /* .6. 601 RTC Upper - Read */ #define SPR_RTCL_R 0x005 /* .6. 601 RTC Lower - Read */ #define SPR_LR 0x008 /* 468 Link Register */ #define SPR_CTR 0x009 /* 468 Count Register */ -#define SPR_DSCR 0x011 /* Data Stream Control Register */ +#define SPR_DSCRP 0x011 /* Data Stream Control Register (Privileged) */ #define SPR_DSISR 0x012 /* .68 DSI exception source */ #define DSISR_DIRECT 0x80000000 /* Direct-store error exception */ #define DSISR_NOTFOUND 0x40000000 /* Translation not found */ #define DSISR_PROTECT 0x08000000 /* Memory access not permitted */ #define DSISR_INVRX 0x04000000 /* Reserve-indexed insn direct-store access */ #define DSISR_STORE 0x02000000 /* Store operation */ #define DSISR_DABR 0x00400000 /* DABR match */ #define DSISR_SEGMENT 0x00200000 /* XXX; not in 6xx PEM */ #define DSISR_EAR 0x00100000 /* eciwx/ecowx && EAR[E] == 0 */ #define SPR_DAR 0x013 /* .68 Data Address Register */ #define SPR_RTCU_W 0x014 /* .6. 601 RTC Upper - Write */ #define SPR_RTCL_W 0x015 /* .6. 601 RTC Lower - Write */ #define SPR_DEC 0x016 /* .68 DECrementer register */ #define SPR_SDR1 0x019 /* .68 Page table base address register */ #define SPR_SRR0 0x01a /* 468 Save/Restore Register 0 */ #define SPR_SRR1 0x01b /* 468 Save/Restore Register 1 */ #define SRR1_ISI_PFAULT 0x40000000 /* ISI page not found */ #define SRR1_ISI_NOEXECUTE 0x10000000 /* Memory marked no-execute */ #define SRR1_ISI_PP 0x08000000 /* PP bits forbid access */ #define SPR_DECAR 0x036 /* ..8 Decrementer auto reload */ #define SPR_EIE 0x050 /* ..8 Exception Interrupt ??? */ #define SPR_EID 0x051 /* ..8 Exception Interrupt ??? */ #define SPR_NRI 0x052 /* ..8 Exception Interrupt ??? */ #define SPR_FSCR 0x099 /* Facility Status and Control Register */ -#define FSCR_IC_MASK 0xFF00000000000000ULL /* FSCR[0:7] is Interrupt Cause */ -#define FSCR_IC_FP 0x0000000000000000ULL /* FP unavailable */ -#define FSCR_IC_VSX 0x0100000000000000ULL /* VSX unavailable */ -#define FSCR_IC_DSCR 0x0200000000000000ULL /* Access to the DSCR at SPRs 3 or 17 */ -#define FSCR_IC_PM 0x0300000000000000ULL /* Read or write access of a Performance Monitor SPR in group A */ -#define FSCR_IC_BHRB 0x0400000000000000ULL /* Execution of a BHRB Instruction */ -#define FSCR_IC_HTM 0x0500000000000000ULL /* Access to a Transactional Memory */ +#define FSCR_IC_MASK 0xFF00000000000000ULL /* FSCR[0:7] is Interrupt Cause */ +#define FSCR_IC_FP 0x0000000000000000ULL /* FP unavailable */ +#define FSCR_IC_VSX 0x0100000000000000ULL /* VSX unavailable */ +#define FSCR_IC_DSCR 0x0200000000000000ULL /* Access to the DSCR at SPRs 3 or 17 */ +#define FSCR_IC_PM 0x0300000000000000ULL /* Read or write access of a Performance Monitor SPR in group A */ +#define FSCR_IC_BHRB 0x0400000000000000ULL /* Execution of a BHRB Instruction */ +#define FSCR_IC_HTM 0x0500000000000000ULL /* Access to a Transactional Memory */ /* Reserved 0x0600000000000000ULL */ -#define FSCR_IC_EBB 0x0700000000000000ULL /* Access to Event-Based Branch */ -#define FSCR_IC_TAR 0x0800000000000000ULL /* Access to Target Address Register */ -#define FSCR_IC_STOP 0x0900000000000000ULL /* Access to the 'stop' instruction in privileged non-hypervisor state */ -#define FSCR_IC_MSG 0x0A00000000000000ULL /* Access to 'msgsndp' or 'msgclrp' instructions */ -#define FSCR_IC_SCV 0x0C00000000000000ULL /* Execution of a 'scv' instruction */ +#define FSCR_IC_EBB 0x0700000000000000ULL /* Access to Event-Based Branch */ +#define FSCR_IC_TAR 0x0800000000000000ULL /* Access to Target Address Register */ +#define FSCR_IC_STOP 0x0900000000000000ULL /* Access to the 'stop' instruction in privileged non-hypervisor state */ +#define FSCR_IC_MSG 0x0A00000000000000ULL /* Access to 'msgsndp' or 'msgclrp' instructions */ +#define FSCR_IC_LM 0x0A00000000000000ULL /* Access to load monitored facility */ +#define FSCR_IC_SCV 0x0C00000000000000ULL /* Execution of a 'scv' instruction */ +#define FSCR_SCV 0x0000000000001000 /* scv instruction available */ +#define FSCR_LM 0x0000000000000800 /* Load monitored facilities available */ +#define FSCR_MSGP 0x0000000000000400 /* msgsndp and SPRs available */ +#define FSCR_TAR 0x0000000000000100 /* TAR register available */ +#define FSCR_EBB 0x0000000000000080 /* Event-based branch available */ +#define FSCR_DSCR 0x0000000000000004 /* DSCR available in PR state */ +#define SPR_DPDES 0x0b0 /* .6. Directed Privileged Doorbell Exception State Register */ #define SPR_USPRG0 0x100 /* 4.. User SPR General 0 */ #define SPR_VRSAVE 0x100 /* .6. AltiVec VRSAVE */ #define SPR_SPRG0 0x110 /* 468 SPR General 0 */ #define SPR_SPRG1 0x111 /* 468 SPR General 1 */ #define SPR_SPRG2 0x112 /* 468 SPR General 2 */ #define SPR_SPRG3 0x113 /* 468 SPR General 3 */ #define SPR_SPRG4 0x114 /* 4.. SPR General 4 */ #define SPR_SPRG5 0x115 /* 4.. SPR General 5 */ #define SPR_SPRG6 0x116 /* 4.. SPR General 6 */ #define SPR_SPRG7 0x117 /* 4.. SPR General 7 */ #define SPR_SCOMC 0x114 /* ... SCOM Address Register (970) */ #define SPR_SCOMD 0x115 /* ... SCOM Data Register (970) */ #define SPR_ASR 0x118 /* ... Address Space Register (PPC64) */ #define SPR_EAR 0x11a /* .68 External Access Register */ #define SPR_PVR 0x11f /* 468 Processor Version Register */ #define MPC601 0x0001 #define MPC603 0x0003 #define MPC604 0x0004 #define MPC602 0x0005 #define MPC603e 0x0006 #define MPC603ev 0x0007 #define MPC750 0x0008 #define MPC750CL 0x7000 /* Nintendo Wii's Broadway */ #define MPC604ev 0x0009 #define MPC7400 0x000c #define MPC620 0x0014 #define IBM403 0x0020 #define IBM401A1 0x0021 #define IBM401B2 0x0022 #define IBM401C2 0x0023 #define IBM401D2 0x0024 #define IBM401E2 0x0025 #define IBM401F2 0x0026 #define IBM401G2 0x0027 #define IBMRS64II 0x0033 #define IBMRS64III 0x0034 #define IBMPOWER4 0x0035 #define IBMRS64III_2 0x0036 #define IBMRS64IV 0x0037 #define IBMPOWER4PLUS 0x0038 #define IBM970 0x0039 #define IBMPOWER5 0x003a #define IBMPOWER5PLUS 0x003b #define IBM970FX 0x003c #define IBMPOWER6 0x003e #define IBMPOWER7 0x003f #define IBMPOWER3 0x0040 #define IBMPOWER3PLUS 0x0041 #define IBM970MP 0x0044 #define IBM970GX 0x0045 #define IBMPOWERPCA2 0x0049 #define IBMPOWER7PLUS 0x004a #define IBMPOWER8E 0x004b +#define IBMPOWER8NVL 0x004c #define IBMPOWER8 0x004d #define IBMPOWER9 0x004e #define MPC860 0x0050 #define IBMCELLBE 0x0070 #define MPC8240 0x0081 #define PA6T 0x0090 #define IBM405GP 0x4011 #define IBM405L 0x4161 #define IBM750FX 0x7000 #define MPC745X_P(v) ((v & 0xFFF8) == 0x8000) #define MPC7450 0x8000 #define MPC7455 0x8001 #define MPC7457 0x8002 #define MPC7447A 0x8003 #define MPC7448 0x8004 #define MPC7410 0x800c #define MPC8245 0x8081 #define FSL_E500v1 0x8020 #define FSL_E500v2 0x8021 #define FSL_E500mc 0x8023 #define FSL_E5500 0x8024 #define FSL_E6500 0x8040 #define FSL_E300C1 0x8083 #define FSL_E300C2 0x8084 #define FSL_E300C3 0x8085 #define FSL_E300C4 0x8086 #define LPCR_PECE_WAKESET (LPCR_PECE_EXT | LPCR_PECE_DECR | LPCR_PECE_ME) #define SPR_EPCR 0x133 #define EPCR_EXTGS 0x80000000 #define EPCR_DTLBGS 0x40000000 #define EPCR_ITLBGS 0x20000000 #define EPCR_DSIGS 0x10000000 #define EPCR_ISIGS 0x08000000 #define EPCR_DUVGS 0x04000000 #define EPCR_ICM 0x02000000 #define EPCR_GICMGS 0x01000000 #define EPCR_DGTMI 0x00800000 #define EPCR_DMIUH 0x00400000 #define EPCR_PMGS 0x00200000 #define SPR_HSRR0 0x13a #define SPR_HSRR1 0x13b #define SPR_LPCR 0x13e /* Logical Partitioning Control */ #define LPCR_LPES 0x008 /* Bit 60 */ #define LPCR_HVICE 0x002 /* Hypervisor Virtualization Interrupt (Arch 3.0) */ #define LPCR_PECE_DRBL (1ULL << 16) /* Directed Privileged Doorbell */ #define LPCR_PECE_HDRBL (1ULL << 15) /* Directed Hypervisor Doorbell */ #define LPCR_PECE_EXT (1ULL << 14) /* External exceptions */ #define LPCR_PECE_DECR (1ULL << 13) /* Decrementer exceptions */ #define LPCR_PECE_ME (1ULL << 12) /* Machine Check and Hypervisor */ /* Maintenance exceptions */ #define SPR_LPID 0x13f /* Logical Partitioning Control */ #define SPR_HMER 0x150 /* Hypervisor Maintenance Exception Register */ #define SPR_HMEER 0x151 /* Hypervisor Maintenance Exception Enable Register */ +#define SPR_TIR 0x1be /* .6. Thread Identification Register */ #define SPR_PTCR 0x1d0 /* Partition Table Control Register */ #define SPR_SPEFSCR 0x200 /* ..8 Signal Processing Engine FSCR. */ #define SPEFSCR_SOVH 0x80000000 #define SPEFSCR_OVH 0x40000000 #define SPEFSCR_FGH 0x20000000 #define SPEFSCR_FXH 0x10000000 #define SPEFSCR_FINVH 0x08000000 #define SPEFSCR_FDBZH 0x04000000 #define SPEFSCR_FUNFH 0x02000000 #define SPEFSCR_FOVFH 0x01000000 #define SPEFSCR_FINXS 0x00200000 #define SPEFSCR_FINVS 0x00100000 #define SPEFSCR_FDBZS 0x00080000 #define SPEFSCR_FUNFS 0x00040000 #define SPEFSCR_FOVFS 0x00020000 #define SPEFSCR_SOV 0x00008000 #define SPEFSCR_OV 0x00004000 #define SPEFSCR_FG 0x00002000 #define SPEFSCR_FX 0x00001000 #define SPEFSCR_FINV 0x00000800 #define SPEFSCR_FDBZ 0x00000400 #define SPEFSCR_FUNF 0x00000200 #define SPEFSCR_FOVF 0x00000100 #define SPEFSCR_FINXE 0x00000040 #define SPEFSCR_FINVE 0x00000020 #define SPEFSCR_FDBZE 0x00000010 #define SPEFSCR_FUNFE 0x00000008 #define SPEFSCR_FOVFE 0x00000004 #define SPEFSCR_FRMC_M 0x00000003 #define SPR_IBAT0U 0x210 /* .6. Instruction BAT Reg 0 Upper */ #define SPR_IBAT0L 0x211 /* .6. Instruction BAT Reg 0 Lower */ #define SPR_IBAT1U 0x212 /* .6. Instruction BAT Reg 1 Upper */ #define SPR_IBAT1L 0x213 /* .6. Instruction BAT Reg 1 Lower */ #define SPR_IBAT2U 0x214 /* .6. Instruction BAT Reg 2 Upper */ #define SPR_IBAT2L 0x215 /* .6. Instruction BAT Reg 2 Lower */ #define SPR_IBAT3U 0x216 /* .6. Instruction BAT Reg 3 Upper */ #define SPR_IBAT3L 0x217 /* .6. Instruction BAT Reg 3 Lower */ #define SPR_DBAT0U 0x218 /* .6. Data BAT Reg 0 Upper */ #define SPR_DBAT0L 0x219 /* .6. Data BAT Reg 0 Lower */ #define SPR_DBAT1U 0x21a /* .6. Data BAT Reg 1 Upper */ #define SPR_DBAT1L 0x21b /* .6. Data BAT Reg 1 Lower */ #define SPR_DBAT2U 0x21c /* .6. Data BAT Reg 2 Upper */ #define SPR_DBAT2L 0x21d /* .6. Data BAT Reg 2 Lower */ #define SPR_DBAT3U 0x21e /* .6. Data BAT Reg 3 Upper */ #define SPR_DBAT3L 0x21f /* .6. Data BAT Reg 3 Lower */ #define SPR_IC_CST 0x230 /* ..8 Instruction Cache CSR */ #define IC_CST_IEN 0x80000000 /* I cache is ENabled (RO) */ #define IC_CST_CMD_INVALL 0x0c000000 /* I cache invalidate all */ #define IC_CST_CMD_UNLOCKALL 0x0a000000 /* I cache unlock all */ #define IC_CST_CMD_UNLOCK 0x08000000 /* I cache unlock block */ #define IC_CST_CMD_LOADLOCK 0x06000000 /* I cache load & lock block */ #define IC_CST_CMD_DISABLE 0x04000000 /* I cache disable */ #define IC_CST_CMD_ENABLE 0x02000000 /* I cache enable */ #define IC_CST_CCER1 0x00200000 /* I cache error type 1 (RO) */ #define IC_CST_CCER2 0x00100000 /* I cache error type 2 (RO) */ #define IC_CST_CCER3 0x00080000 /* I cache error type 3 (RO) */ #define SPR_IBAT4U 0x230 /* .6. Instruction BAT Reg 4 Upper */ #define SPR_IC_ADR 0x231 /* ..8 Instruction Cache Address */ #define SPR_IBAT4L 0x231 /* .6. Instruction BAT Reg 4 Lower */ #define SPR_IC_DAT 0x232 /* ..8 Instruction Cache Data */ #define SPR_IBAT5U 0x232 /* .6. Instruction BAT Reg 5 Upper */ #define SPR_IBAT5L 0x233 /* .6. Instruction BAT Reg 5 Lower */ #define SPR_IBAT6U 0x234 /* .6. Instruction BAT Reg 6 Upper */ #define SPR_IBAT6L 0x235 /* .6. Instruction BAT Reg 6 Lower */ #define SPR_IBAT7U 0x236 /* .6. Instruction BAT Reg 7 Upper */ #define SPR_IBAT7L 0x237 /* .6. Instruction BAT Reg 7 Lower */ #define SPR_DC_CST 0x230 /* ..8 Data Cache CSR */ #define DC_CST_DEN 0x80000000 /* D cache ENabled (RO) */ #define DC_CST_DFWT 0x40000000 /* D cache Force Write-Thru (RO) */ #define DC_CST_LES 0x20000000 /* D cache Little Endian Swap (RO) */ #define DC_CST_CMD_FLUSH 0x0e000000 /* D cache invalidate all */ #define DC_CST_CMD_INVALL 0x0c000000 /* D cache invalidate all */ #define DC_CST_CMD_UNLOCKALL 0x0a000000 /* D cache unlock all */ #define DC_CST_CMD_UNLOCK 0x08000000 /* D cache unlock block */ #define DC_CST_CMD_CLRLESWAP 0x07000000 /* D cache clr little-endian swap */ #define DC_CST_CMD_LOADLOCK 0x06000000 /* D cache load & lock block */ #define DC_CST_CMD_SETLESWAP 0x05000000 /* D cache set little-endian swap */ #define DC_CST_CMD_DISABLE 0x04000000 /* D cache disable */ #define DC_CST_CMD_CLRFWT 0x03000000 /* D cache clear forced write-thru */ #define DC_CST_CMD_ENABLE 0x02000000 /* D cache enable */ #define DC_CST_CMD_SETFWT 0x01000000 /* D cache set forced write-thru */ #define DC_CST_CCER1 0x00200000 /* D cache error type 1 (RO) */ #define DC_CST_CCER2 0x00100000 /* D cache error type 2 (RO) */ #define DC_CST_CCER3 0x00080000 /* D cache error type 3 (RO) */ #define SPR_DBAT4U 0x238 /* .6. Data BAT Reg 4 Upper */ #define SPR_DC_ADR 0x231 /* ..8 Data Cache Address */ #define SPR_DBAT4L 0x239 /* .6. Data BAT Reg 4 Lower */ #define SPR_DC_DAT 0x232 /* ..8 Data Cache Data */ #define SPR_DBAT5U 0x23a /* .6. Data BAT Reg 5 Upper */ #define SPR_DBAT5L 0x23b /* .6. Data BAT Reg 5 Lower */ #define SPR_DBAT6U 0x23c /* .6. Data BAT Reg 6 Upper */ #define SPR_DBAT6L 0x23d /* .6. Data BAT Reg 6 Lower */ #define SPR_DBAT7U 0x23e /* .6. Data BAT Reg 7 Upper */ #define SPR_DBAT7L 0x23f /* .6. Data BAT Reg 7 Lower */ #define SPR_SPRG8 0x25c /* ..8 SPR General 8 */ #define SPR_MI_CTR 0x310 /* ..8 IMMU control */ #define Mx_CTR_GPM 0x80000000 /* Group Protection Mode */ #define Mx_CTR_PPM 0x40000000 /* Page Protection Mode */ #define Mx_CTR_CIDEF 0x20000000 /* Cache-Inhibit DEFault */ #define MD_CTR_WTDEF 0x20000000 /* Write-Through DEFault */ #define Mx_CTR_RSV4 0x08000000 /* Reserve 4 TLB entries */ #define MD_CTR_TWAM 0x04000000 /* TableWalk Assist Mode */ #define Mx_CTR_PPCS 0x02000000 /* Priv/user state compare mode */ #define Mx_CTR_TLB_INDX 0x000001f0 /* TLB index mask */ #define Mx_CTR_TLB_INDX_BITPOS 8 /* TLB index shift */ #define SPR_MI_AP 0x312 /* ..8 IMMU access protection */ #define Mx_GP_SUPER(n) (0 << (2*(15-(n)))) /* access is supervisor */ #define Mx_GP_PAGE (1 << (2*(15-(n)))) /* access is page protect */ #define Mx_GP_SWAPPED (2 << (2*(15-(n)))) /* access is swapped */ #define Mx_GP_USER (3 << (2*(15-(n)))) /* access is user */ #define SPR_MI_EPN 0x313 /* ..8 IMMU effective number */ #define Mx_EPN_EPN 0xfffff000 /* Effective Page Number mask */ #define Mx_EPN_EV 0x00000020 /* Entry Valid */ #define Mx_EPN_ASID 0x0000000f /* Address Space ID */ #define SPR_MI_TWC 0x315 /* ..8 IMMU tablewalk control */ #define MD_TWC_L2TB 0xfffff000 /* Level-2 Tablewalk Base */ #define Mx_TWC_APG 0x000001e0 /* Access Protection Group */ #define Mx_TWC_G 0x00000010 /* Guarded memory */ #define Mx_TWC_PS 0x0000000c /* Page Size (L1) */ #define MD_TWC_WT 0x00000002 /* Write-Through */ #define Mx_TWC_V 0x00000001 /* Entry Valid */ #define SPR_MI_RPN 0x316 /* ..8 IMMU real (phys) page number */ #define Mx_RPN_RPN 0xfffff000 /* Real Page Number */ #define Mx_RPN_PP 0x00000ff0 /* Page Protection */ #define Mx_RPN_SPS 0x00000008 /* Small Page Size */ #define Mx_RPN_SH 0x00000004 /* SHared page */ #define Mx_RPN_CI 0x00000002 /* Cache Inhibit */ #define Mx_RPN_V 0x00000001 /* Valid */ #define SPR_MD_CTR 0x318 /* ..8 DMMU control */ #define SPR_M_CASID 0x319 /* ..8 CASID */ #define M_CASID 0x0000000f /* Current AS Id */ #define SPR_MD_AP 0x31a /* ..8 DMMU access protection */ #define SPR_MD_EPN 0x31b /* ..8 DMMU effective number */ #define SPR_970MMCR0 0x31b /* ... Monitor Mode Control Register 0 (PPC 970) */ #define SPR_970MMCR0_PMC1SEL(x) ((x) << 8) /* PMC1 selector (970) */ #define SPR_970MMCR0_PMC2SEL(x) ((x) << 1) /* PMC2 selector (970) */ #define SPR_970MMCR1 0x31e /* ... Monitor Mode Control Register 1 (PPC 970) */ #define SPR_970MMCR1_PMC3SEL(x) (((x) & 0x1f) << 27) /* PMC 3 selector */ #define SPR_970MMCR1_PMC4SEL(x) (((x) & 0x1f) << 22) /* PMC 4 selector */ #define SPR_970MMCR1_PMC5SEL(x) (((x) & 0x1f) << 17) /* PMC 5 selector */ #define SPR_970MMCR1_PMC6SEL(x) (((x) & 0x1f) << 12) /* PMC 6 selector */ #define SPR_970MMCR1_PMC7SEL(x) (((x) & 0x1f) << 7) /* PMC 7 selector */ #define SPR_970MMCR1_PMC8SEL(x) (((x) & 0x1f) << 2) /* PMC 8 selector */ #define SPR_970MMCRA 0x312 /* ... Monitor Mode Control Register 2 (PPC 970) */ #define SPR_970PMC1 0x313 /* ... PMC 1 */ #define SPR_970PMC2 0x314 /* ... PMC 2 */ #define SPR_970PMC3 0x315 /* ... PMC 3 */ #define SPR_970PMC4 0x316 /* ... PMC 4 */ #define SPR_970PMC5 0x317 /* ... PMC 5 */ #define SPR_970PMC6 0x318 /* ... PMC 6 */ #define SPR_970PMC7 0x319 /* ... PMC 7 */ #define SPR_970PMC8 0x31a /* ... PMC 8 */ #define SPR_M_TWB 0x31c /* ..8 MMU tablewalk base */ #define M_TWB_L1TB 0xfffff000 /* level-1 translation base */ #define M_TWB_L1INDX 0x00000ffc /* level-1 index */ #define SPR_MD_TWC 0x31d /* ..8 DMMU tablewalk control */ #define SPR_MD_RPN 0x31e /* ..8 DMMU real (phys) page number */ #define SPR_MD_TW 0x31f /* ..8 MMU tablewalk scratch */ +#define SPR_BESCRS 0x320 /* .6. Branch Event Status and Control Set Register */ +#define SPR_BESCRSU 0x321 /* .6. Branch Event Status and Control Set Register (upper 32-bit) */ +#define SPR_BESCRR 0x322 /* .6. Branch Event Status and Control Reset Register */ +#define SPR_BESCRRU 0x323 /* .6. Branch Event Status and Control Register (upper 32-bit) */ +#define SPR_EBBHR 0x324 /* .6. Event-based Branch Handler Register */ +#define SPR_EBBRR 0x325 /* .6. Event-based Branch Return Register */ +#define SPR_BESCR 0x326 /* .6. Branch Event Status and Control Register */ +#define SPR_LMRR 0x32d /* .6. Load Monitored Region Register */ +#define SPR_LMSER 0x32e /* .6. Load Monitored Section Enable Register */ +#define SPR_TAR 0x32f /* .6. Branch Target Address Register */ #define SPR_MI_CAM 0x330 /* ..8 IMMU CAM entry read */ #define SPR_MI_RAM0 0x331 /* ..8 IMMU RAM entry read reg 0 */ #define SPR_MI_RAM1 0x332 /* ..8 IMMU RAM entry read reg 1 */ #define SPR_MD_CAM 0x338 /* ..8 IMMU CAM entry read */ #define SPR_MD_RAM0 0x339 /* ..8 IMMU RAM entry read reg 0 */ #define SPR_MD_RAM1 0x33a /* ..8 IMMU RAM entry read reg 1 */ #define SPR_PSSCR 0x357 /* Processor Stop Status and Control Register (ISA 3.0) */ #define PSSCR_PLS_S 60 #define PSSCR_PLS_M (0xf << PSSCR_PLS_S) #define PSSCR_SD (1 << 22) #define PSSCR_ESL (1 << 21) #define PSSCR_EC (1 << 20) #define PSSCR_PSLL_S 16 #define PSSCR_PSLL_M (0xf << PSSCR_PSLL_S) #define PSSCR_TR_S 8 #define PSSCR_TR_M (0x3 << PSSCR_TR_S) #define PSSCR_MTL_S 4 #define PSSCR_MTL_M (0xf << PSSCR_MTL_S) #define PSSCR_RL_S 0 #define PSSCR_RL_M (0xf << PSSCR_RL_S) #define SPR_PMCR 0x374 /* Processor Management Control Register */ #define SPR_UMMCR2 0x3a0 /* .6. User Monitor Mode Control Register 2 */ #define SPR_UMMCR0 0x3a8 /* .6. User Monitor Mode Control Register 0 */ #define SPR_USIA 0x3ab /* .6. User Sampled Instruction Address */ #define SPR_UMMCR1 0x3ac /* .6. User Monitor Mode Control Register 1 */ #define SPR_ZPR 0x3b0 /* 4.. Zone Protection Register */ #define SPR_MMCR2 0x3b0 /* .6. Monitor Mode Control Register 2 */ #define SPR_MMCR2_THRESHMULT_32 0x80000000 /* Multiply MMCR0 threshold by 32 */ #define SPR_MMCR2_THRESHMULT_2 0x00000000 /* Multiply MMCR0 threshold by 2 */ #define SPR_PID 0x3b1 /* 4.. Process ID */ #define SPR_PMC5 0x3b1 /* .6. Performance Counter Register 5 */ #define SPR_PMC6 0x3b2 /* .6. Performance Counter Register 6 */ #define SPR_CCR0 0x3b3 /* 4.. Core Configuration Register 0 */ #define SPR_IAC3 0x3b4 /* 4.. Instruction Address Compare 3 */ #define SPR_IAC4 0x3b5 /* 4.. Instruction Address Compare 4 */ #define SPR_DVC1 0x3b6 /* 4.. Data Value Compare 1 */ #define SPR_DVC2 0x3b7 /* 4.. Data Value Compare 2 */ #define SPR_MMCR0 0x3b8 /* .6. Monitor Mode Control Register 0 */ #define SPR_MMCR0_FC 0x80000000 /* Freeze counters */ #define SPR_MMCR0_FCS 0x40000000 /* Freeze counters in supervisor mode */ #define SPR_MMCR0_FCP 0x20000000 /* Freeze counters in user mode */ #define SPR_MMCR0_FCM1 0x10000000 /* Freeze counters when mark=1 */ #define SPR_MMCR0_FCM0 0x08000000 /* Freeze counters when mark=0 */ #define SPR_MMCR0_PMXE 0x04000000 /* Enable PM interrupt */ #define SPR_MMCR0_FCECE 0x02000000 /* Freeze counters after event */ #define SPR_MMCR0_TBSEL_15 0x01800000 /* Count bit 15 of TBL */ #define SPR_MMCR0_TBSEL_19 0x01000000 /* Count bit 19 of TBL */ #define SPR_MMCR0_TBSEL_23 0x00800000 /* Count bit 23 of TBL */ #define SPR_MMCR0_TBSEL_31 0x00000000 /* Count bit 31 of TBL */ #define SPR_MMCR0_TBEE 0x00400000 /* Time-base event enable */ #define SPR_MMCRO_THRESHOLD(x) ((x) << 16) /* Threshold value */ #define SPR_MMCR0_PMC1CE 0x00008000 /* PMC1 condition enable */ #define SPR_MMCR0_PMCNCE 0x00004000 /* PMCn condition enable */ #define SPR_MMCR0_TRIGGER 0x00002000 /* Trigger */ #define SPR_MMCR0_PMC1SEL(x) (((x) & 0x3f) << 6) /* PMC1 selector */ #define SPR_MMCR0_PMC2SEL(x) (((x) & 0x3f) << 0) /* PMC2 selector */ #define SPR_SGR 0x3b9 /* 4.. Storage Guarded Register */ #define SPR_PMC1 0x3b9 /* .6. Performance Counter Register 1 */ #define SPR_DCWR 0x3ba /* 4.. Data Cache Write-through Register */ #define SPR_PMC2 0x3ba /* .6. Performance Counter Register 2 */ #define SPR_SLER 0x3bb /* 4.. Storage Little Endian Register */ #define SPR_SIA 0x3bb /* .6. Sampled Instruction Address */ #define SPR_MMCR1 0x3bc /* .6. Monitor Mode Control Register 2 */ #define SPR_MMCR1_PMC3SEL(x) (((x) & 0x1f) << 27) /* PMC 3 selector */ #define SPR_MMCR1_PMC4SEL(x) (((x) & 0x1f) << 22) /* PMC 4 selector */ #define SPR_MMCR1_PMC5SEL(x) (((x) & 0x1f) << 17) /* PMC 5 selector */ #define SPR_MMCR1_PMC6SEL(x) (((x) & 0x3f) << 11) /* PMC 6 selector */ #define SPR_SU0R 0x3bc /* 4.. Storage User-defined 0 Register */ #define SPR_PMC3 0x3bd /* .6. Performance Counter Register 3 */ #define SPR_PMC4 0x3be /* .6. Performance Counter Register 4 */ #define SPR_DMISS 0x3d0 /* .68 Data TLB Miss Address Register */ #define SPR_DCMP 0x3d1 /* .68 Data TLB Compare Register */ #define SPR_HASH1 0x3d2 /* .68 Primary Hash Address Register */ #define SPR_ICDBDR 0x3d3 /* 4.. Instruction Cache Debug Data Register */ #define SPR_HASH2 0x3d3 /* .68 Secondary Hash Address Register */ #define SPR_IMISS 0x3d4 /* .68 Instruction TLB Miss Address Register */ #define SPR_TLBMISS 0x3d4 /* .6. TLB Miss Address Register */ #define SPR_DEAR 0x3d5 /* 4.. Data Error Address Register */ #define SPR_ICMP 0x3d5 /* .68 Instruction TLB Compare Register */ #define SPR_PTEHI 0x3d5 /* .6. Instruction TLB Compare Register */ #define SPR_EVPR 0x3d6 /* 4.. Exception Vector Prefix Register */ #define SPR_RPA 0x3d6 /* .68 Required Physical Address Register */ #define SPR_PTELO 0x3d6 /* .6. Required Physical Address Register */ #define SPR_TSR 0x150 /* ..8 Timer Status Register */ #define SPR_TCR 0x154 /* ..8 Timer Control Register */ #define TSR_ENW 0x80000000 /* Enable Next Watchdog */ #define TSR_WIS 0x40000000 /* Watchdog Interrupt Status */ #define TSR_WRS_MASK 0x30000000 /* Watchdog Reset Status */ #define TSR_WRS_NONE 0x00000000 /* No watchdog reset has occurred */ #define TSR_WRS_CORE 0x10000000 /* Core reset was forced by the watchdog */ #define TSR_WRS_CHIP 0x20000000 /* Chip reset was forced by the watchdog */ #define TSR_WRS_SYSTEM 0x30000000 /* System reset was forced by the watchdog */ #define TSR_PIS 0x08000000 /* PIT Interrupt Status */ #define TSR_DIS 0x08000000 /* Decrementer Interrupt Status */ #define TSR_FIS 0x04000000 /* FIT Interrupt Status */ #define TCR_WP_MASK 0xc0000000 /* Watchdog Period mask */ #define TCR_WP_2_17 0x00000000 /* 2**17 clocks */ #define TCR_WP_2_21 0x40000000 /* 2**21 clocks */ #define TCR_WP_2_25 0x80000000 /* 2**25 clocks */ #define TCR_WP_2_29 0xc0000000 /* 2**29 clocks */ #define TCR_WRC_MASK 0x30000000 /* Watchdog Reset Control mask */ #define TCR_WRC_NONE 0x00000000 /* No watchdog reset */ #define TCR_WRC_CORE 0x10000000 /* Core reset */ #define TCR_WRC_CHIP 0x20000000 /* Chip reset */ #define TCR_WRC_SYSTEM 0x30000000 /* System reset */ #define TCR_WIE 0x08000000 /* Watchdog Interrupt Enable */ #define TCR_PIE 0x04000000 /* PIT Interrupt Enable */ #define TCR_DIE 0x04000000 /* Pecrementer Interrupt Enable */ #define TCR_FP_MASK 0x03000000 /* FIT Period */ #define TCR_FP_2_9 0x00000000 /* 2**9 clocks */ #define TCR_FP_2_13 0x01000000 /* 2**13 clocks */ #define TCR_FP_2_17 0x02000000 /* 2**17 clocks */ #define TCR_FP_2_21 0x03000000 /* 2**21 clocks */ #define TCR_FIE 0x00800000 /* FIT Interrupt Enable */ #define TCR_ARE 0x00400000 /* Auto Reload Enable */ #define SPR_PIT 0x3db /* 4.. Programmable Interval Timer */ #define SPR_SRR2 0x3de /* 4.. Save/Restore Register 2 */ #define SPR_SRR3 0x3df /* 4.. Save/Restore Register 3 */ #define SPR_HID0 0x3f0 /* ..8 Hardware Implementation Register 0 */ #define SPR_HID1 0x3f1 /* ..8 Hardware Implementation Register 1 */ #define SPR_HID2 0x3f3 /* ..8 Hardware Implementation Register 2 */ #define SPR_HID4 0x3f4 /* ..8 Hardware Implementation Register 4 */ #define SPR_HID5 0x3f6 /* ..8 Hardware Implementation Register 5 */ #define SPR_HID6 0x3f9 /* ..8 Hardware Implementation Register 6 */ #define SPR_CELL_TSRL 0x380 /* ... Cell BE Thread Status Register */ #define SPR_CELL_TSCR 0x399 /* ... Cell BE Thread Switch Register */ #if defined(AIM) #define SPR_DBSR 0x3f0 /* 4.. Debug Status Register */ #define DBSR_IC 0x80000000 /* Instruction completion debug event */ #define DBSR_BT 0x40000000 /* Branch Taken debug event */ #define DBSR_EDE 0x20000000 /* Exception debug event */ #define DBSR_TIE 0x10000000 /* Trap Instruction debug event */ #define DBSR_UDE 0x08000000 /* Unconditional debug event */ #define DBSR_IA1 0x04000000 /* IAC1 debug event */ #define DBSR_IA2 0x02000000 /* IAC2 debug event */ #define DBSR_DR1 0x01000000 /* DAC1 Read debug event */ #define DBSR_DW1 0x00800000 /* DAC1 Write debug event */ #define DBSR_DR2 0x00400000 /* DAC2 Read debug event */ #define DBSR_DW2 0x00200000 /* DAC2 Write debug event */ #define DBSR_IDE 0x00100000 /* Imprecise debug event */ #define DBSR_IA3 0x00080000 /* IAC3 debug event */ #define DBSR_IA4 0x00040000 /* IAC4 debug event */ #define DBSR_MRR 0x00000300 /* Most recent reset */ #define SPR_DBCR0 0x3f2 /* 4.. Debug Control Register 0 */ #define SPR_DBCR1 0x3bd /* 4.. Debug Control Register 1 */ #define SPR_IAC1 0x3f4 /* 4.. Instruction Address Compare 1 */ #define SPR_IAC2 0x3f5 /* 4.. Instruction Address Compare 2 */ #define SPR_DAC1 0x3f6 /* 4.. Data Address Compare 1 */ #define SPR_DAC2 0x3f7 /* 4.. Data Address Compare 2 */ #define SPR_PIR 0x3ff /* .6. Processor Identification Register */ #elif defined(BOOKE) #define SPR_PIR 0x11e /* ..8 Processor Identification Register */ #define SPR_DBSR 0x130 /* ..8 Debug Status Register */ #define DBSR_IDE 0x80000000 /* Imprecise debug event. */ #define DBSR_UDE 0x40000000 /* Unconditional debug event. */ #define DBSR_MRR 0x30000000 /* Most recent Reset (mask). */ #define DBSR_ICMP 0x08000000 /* Instr. complete debug event. */ #define DBSR_BRT 0x04000000 /* Branch taken debug event. */ #define DBSR_IRPT 0x02000000 /* Interrupt taken debug event. */ #define DBSR_TRAP 0x01000000 /* Trap instr. debug event. */ #define DBSR_IAC1 0x00800000 /* Instr. address compare #1. */ #define DBSR_IAC2 0x00400000 /* Instr. address compare #2. */ #define DBSR_IAC3 0x00200000 /* Instr. address compare #3. */ #define DBSR_IAC4 0x00100000 /* Instr. address compare #4. */ #define DBSR_DAC1R 0x00080000 /* Data addr. read compare #1. */ #define DBSR_DAC1W 0x00040000 /* Data addr. write compare #1. */ #define DBSR_DAC2R 0x00020000 /* Data addr. read compare #2. */ #define DBSR_DAC2W 0x00010000 /* Data addr. write compare #2. */ #define DBSR_RET 0x00008000 /* Return debug event. */ #define SPR_DBCR0 0x134 /* ..8 Debug Control Register 0 */ #define SPR_DBCR1 0x135 /* ..8 Debug Control Register 1 */ #define SPR_IAC1 0x138 /* ..8 Instruction Address Compare 1 */ #define SPR_IAC2 0x139 /* ..8 Instruction Address Compare 2 */ #define SPR_DAC1 0x13c /* ..8 Data Address Compare 1 */ #define SPR_DAC2 0x13d /* ..8 Data Address Compare 2 */ #endif #define DBCR0_EDM 0x80000000 /* External Debug Mode */ #define DBCR0_IDM 0x40000000 /* Internal Debug Mode */ #define DBCR0_RST_MASK 0x30000000 /* ReSeT */ #define DBCR0_RST_NONE 0x00000000 /* No action */ #define DBCR0_RST_CORE 0x10000000 /* Core reset */ #define DBCR0_RST_CHIP 0x20000000 /* Chip reset */ #define DBCR0_RST_SYSTEM 0x30000000 /* System reset */ #define DBCR0_IC 0x08000000 /* Instruction Completion debug event */ #define DBCR0_BT 0x04000000 /* Branch Taken debug event */ #define DBCR0_EDE 0x02000000 /* Exception Debug Event */ #define DBCR0_TDE 0x01000000 /* Trap Debug Event */ #define DBCR0_IA1 0x00800000 /* IAC (Instruction Address Compare) 1 debug event */ #define DBCR0_IA2 0x00400000 /* IAC 2 debug event */ #define DBCR0_IA12 0x00200000 /* Instruction Address Range Compare 1-2 */ #define DBCR0_IA12X 0x00100000 /* IA12 eXclusive */ #define DBCR0_IA3 0x00080000 /* IAC 3 debug event */ #define DBCR0_IA4 0x00040000 /* IAC 4 debug event */ #define DBCR0_IA34 0x00020000 /* Instruction Address Range Compare 3-4 */ #define DBCR0_IA34X 0x00010000 /* IA34 eXclusive */ #define DBCR0_IA12T 0x00008000 /* Instruction Address Range Compare 1-2 range Toggle */ #define DBCR0_IA34T 0x00004000 /* Instruction Address Range Compare 3-4 range Toggle */ #define DBCR0_FT 0x00000001 /* Freeze Timers on debug event */ #define SPR_IABR 0x3f2 /* ..8 Instruction Address Breakpoint Register 0 */ #define SPR_DABR 0x3f5 /* .6. Data Address Breakpoint Register */ #define SPR_MSSCR0 0x3f6 /* .6. Memory SubSystem Control Register */ #define MSSCR0_SHDEN 0x80000000 /* 0: Shared-state enable */ #define MSSCR0_SHDPEN3 0x40000000 /* 1: ~SHD[01] signal enable in MEI mode */ #define MSSCR0_L1INTVEN 0x38000000 /* 2-4: L1 data cache ~HIT intervention enable */ #define MSSCR0_L2INTVEN 0x07000000 /* 5-7: L2 data cache ~HIT intervention enable*/ #define MSSCR0_DL1HWF 0x00800000 /* 8: L1 data cache hardware flush */ #define MSSCR0_MBO 0x00400000 /* 9: must be one */ #define MSSCR0_EMODE 0x00200000 /* 10: MPX bus mode (read-only) */ #define MSSCR0_ABD 0x00100000 /* 11: address bus driven (read-only) */ #define MSSCR0_MBZ 0x000fffff /* 12-31: must be zero */ #define MSSCR0_L2PFE 0x00000003 /* 30-31: L2 prefetch enable */ #define SPR_MSSSR0 0x3f7 /* .6. Memory Subsystem Status Register (MPC745x) */ #define MSSSR0_L2TAG 0x00040000 /* 13: L2 tag parity error */ #define MSSSR0_L2DAT 0x00020000 /* 14: L2 data parity error */ #define MSSSR0_L3TAG 0x00010000 /* 15: L3 tag parity error */ #define MSSSR0_L3DAT 0x00008000 /* 16: L3 data parity error */ #define MSSSR0_APE 0x00004000 /* 17: Address parity error */ #define MSSSR0_DPE 0x00002000 /* 18: Data parity error */ #define MSSSR0_TEA 0x00001000 /* 19: Bus transfer error acknowledge */ #define SPR_LDSTCR 0x3f8 /* .6. Load/Store Control Register */ #define SPR_L2PM 0x3f8 /* .6. L2 Private Memory Control Register */ #define SPR_L2CR 0x3f9 /* .6. L2 Control Register */ #define L2CR_L2E 0x80000000 /* 0: L2 enable */ #define L2CR_L2PE 0x40000000 /* 1: L2 data parity enable */ #define L2CR_L2SIZ 0x30000000 /* 2-3: L2 size */ #define L2SIZ_2M 0x00000000 #define L2SIZ_256K 0x10000000 #define L2SIZ_512K 0x20000000 #define L2SIZ_1M 0x30000000 #define L2CR_L2CLK 0x0e000000 /* 4-6: L2 clock ratio */ #define L2CLK_DIS 0x00000000 /* disable L2 clock */ #define L2CLK_10 0x02000000 /* core clock / 1 */ #define L2CLK_15 0x04000000 /* / 1.5 */ #define L2CLK_20 0x08000000 /* / 2 */ #define L2CLK_25 0x0a000000 /* / 2.5 */ #define L2CLK_30 0x0c000000 /* / 3 */ #define L2CR_L2RAM 0x01800000 /* 7-8: L2 RAM type */ #define L2RAM_FLOWTHRU_BURST 0x00000000 #define L2RAM_PIPELINE_BURST 0x01000000 #define L2RAM_PIPELINE_LATE 0x01800000 #define L2CR_L2DO 0x00400000 /* 9: L2 data-only. Setting this bit disables instruction caching. */ #define L2CR_L2I 0x00200000 /* 10: L2 global invalidate. */ #define L2CR_L2IO_7450 0x00010000 /* 11: L2 instruction-only (MPC745x). */ #define L2CR_L2CTL 0x00100000 /* 11: L2 RAM control (ZZ enable). Enables automatic operation of the L2ZZ (low-power mode) signal. */ #define L2CR_L2WT 0x00080000 /* 12: L2 write-through. */ #define L2CR_L2TS 0x00040000 /* 13: L2 test support. */ #define L2CR_L2OH 0x00030000 /* 14-15: L2 output hold. */ #define L2CR_L2DO_7450 0x00010000 /* 15: L2 data-only (MPC745x). */ #define L2CR_L2SL 0x00008000 /* 16: L2 DLL slow. */ #define L2CR_L2DF 0x00004000 /* 17: L2 differential clock. */ #define L2CR_L2BYP 0x00002000 /* 18: L2 DLL bypass. */ #define L2CR_L2FA 0x00001000 /* 19: L2 flush assist (for software flush). */ #define L2CR_L2HWF 0x00000800 /* 20: L2 hardware flush. */ #define L2CR_L2IO 0x00000400 /* 21: L2 instruction-only. */ #define L2CR_L2CLKSTP 0x00000200 /* 22: L2 clock stop. */ #define L2CR_L2DRO 0x00000100 /* 23: L2DLL rollover checkstop enable. */ #define L2CR_L2IP 0x00000001 /* 31: L2 global invalidate in */ /* progress (read only). */ #define SPR_L3CR 0x3fa /* .6. L3 Control Register */ #define L3CR_L3E 0x80000000 /* 0: L3 enable */ #define L3CR_L3PE 0x40000000 /* 1: L3 data parity enable */ #define L3CR_L3APE 0x20000000 #define L3CR_L3SIZ 0x10000000 /* 3: L3 size (0=1MB, 1=2MB) */ #define L3CR_L3CLKEN 0x08000000 /* 4: Enables L3_CLK[0:1] */ #define L3CR_L3CLK 0x03800000 #define L3CR_L3IO 0x00400000 #define L3CR_L3CLKEXT 0x00200000 #define L3CR_L3CKSPEXT 0x00100000 #define L3CR_L3OH1 0x00080000 #define L3CR_L3SPO 0x00040000 #define L3CR_L3CKSP 0x00030000 #define L3CR_L3PSP 0x0000e000 #define L3CR_L3REP 0x00001000 #define L3CR_L3HWF 0x00000800 #define L3CR_L3I 0x00000400 /* 21: L3 global invalidate */ #define L3CR_L3RT 0x00000300 #define L3CR_L3NIRCA 0x00000080 #define L3CR_L3DO 0x00000040 #define L3CR_PMEN 0x00000004 #define L3CR_PMSIZ 0x00000003 #define SPR_DCCR 0x3fa /* 4.. Data Cache Cachability Register */ #define SPR_ICCR 0x3fb /* 4.. Instruction Cache Cachability Register */ #define SPR_THRM1 0x3fc /* .6. Thermal Management Register */ #define SPR_THRM2 0x3fd /* .6. Thermal Management Register */ #define SPR_THRM_TIN 0x80000000 /* Thermal interrupt bit (RO) */ #define SPR_THRM_TIV 0x40000000 /* Thermal interrupt valid (RO) */ #define SPR_THRM_THRESHOLD(x) ((x) << 23) /* Thermal sensor threshold */ #define SPR_THRM_TID 0x00000004 /* Thermal interrupt direction */ #define SPR_THRM_TIE 0x00000002 /* Thermal interrupt enable */ #define SPR_THRM_VALID 0x00000001 /* Valid bit */ #define SPR_THRM3 0x3fe /* .6. Thermal Management Register */ #define SPR_THRM_TIMER(x) ((x) << 1) /* Sampling interval timer */ #define SPR_THRM_ENABLE 0x00000001 /* TAU Enable */ #define SPR_FPECR 0x3fe /* .6. Floating-Point Exception Cause Register */ /* Time Base Register declarations */ #define TBR_TBL 0x10c /* 468 Time Base Lower - read */ #define TBR_TBU 0x10d /* 468 Time Base Upper - read */ #define TBR_TBWL 0x11c /* 468 Time Base Lower - supervisor, write */ #define TBR_TBWU 0x11d /* 468 Time Base Upper - supervisor, write */ /* Performance counter declarations */ #define PMC_OVERFLOW 0x80000000 /* Counter has overflowed */ /* The first five countable [non-]events are common to many PMC's */ #define PMCN_NONE 0 /* Count nothing */ #define PMCN_CYCLES 1 /* Processor cycles */ #define PMCN_ICOMP 2 /* Instructions completed */ #define PMCN_TBLTRANS 3 /* TBL bit transitions */ #define PCMN_IDISPATCH 4 /* Instructions dispatched */ /* Similar things for the 970 PMC direct counters */ #define PMC970N_NONE 0x8 /* Count nothing */ #define PMC970N_CYCLES 0xf /* Processor cycles */ #define PMC970N_ICOMP 0x9 /* Instructions completed */ #if defined(BOOKE) #define SPR_MCARU 0x239 /* ..8 Machine Check Address register upper bits */ #define SPR_MCSR 0x23c /* ..8 Machine Check Syndrome register */ #define SPR_MCAR 0x23d /* ..8 Machine Check Address register */ #define SPR_ESR 0x003e /* ..8 Exception Syndrome Register */ #define ESR_PIL 0x08000000 /* Program interrupt - illegal */ #define ESR_PPR 0x04000000 /* Program interrupt - privileged */ #define ESR_PTR 0x02000000 /* Program interrupt - trap */ #define ESR_ST 0x00800000 /* Store operation */ #define ESR_DLK 0x00200000 /* Data storage, D cache locking */ #define ESR_ILK 0x00100000 /* Data storage, I cache locking */ #define ESR_BO 0x00020000 /* Data/instruction storage, byte ordering */ #define ESR_SPE 0x00000080 /* SPE exception bit */ #define SPR_CSRR0 0x03a /* ..8 58 Critical SRR0 */ #define SPR_CSRR1 0x03b /* ..8 59 Critical SRR1 */ #define SPR_MCSRR0 0x23a /* ..8 570 Machine check SRR0 */ #define SPR_MCSRR1 0x23b /* ..8 571 Machine check SRR1 */ #define SPR_DSRR0 0x23e /* ..8 574 Debug SRR0 */ #define SPR_DSRR1 0x23f /* ..8 575 Debug SRR1 */ #define SPR_MMUCR 0x3b2 /* 4.. MMU Control Register */ #define MMUCR_SWOA (0x80000000 >> 7) #define MMUCR_U1TE (0x80000000 >> 9) #define MMUCR_U2SWOAE (0x80000000 >> 10) #define MMUCR_DULXE (0x80000000 >> 12) #define MMUCR_IULXE (0x80000000 >> 13) #define MMUCR_STS (0x80000000 >> 15) #define MMUCR_STID_MASK (0xFF000000 >> 24) #define SPR_MMUCSR0 0x3f4 /* ..8 1012 MMU Control and Status Register 0 */ #define MMUCSR0_L2TLB0_FI 0x04 /* TLB0 flash invalidate */ #define MMUCSR0_L2TLB1_FI 0x02 /* TLB1 flash invalidate */ #define SPR_SVR 0x3ff /* ..8 1023 System Version Register */ #define SVR_MPC8533 0x8034 #define SVR_MPC8533E 0x803c #define SVR_MPC8541 0x8072 #define SVR_MPC8541E 0x807a #define SVR_MPC8548 0x8031 #define SVR_MPC8548E 0x8039 #define SVR_MPC8555 0x8071 #define SVR_MPC8555E 0x8079 #define SVR_MPC8572 0x80e0 #define SVR_MPC8572E 0x80e8 #define SVR_P1011 0x80e5 #define SVR_P1011E 0x80ed #define SVR_P1013 0x80e7 #define SVR_P1013E 0x80ef #define SVR_P1020 0x80e4 #define SVR_P1020E 0x80ec #define SVR_P1022 0x80e6 #define SVR_P1022E 0x80ee #define SVR_P2010 0x80e3 #define SVR_P2010E 0x80eb #define SVR_P2020 0x80e2 #define SVR_P2020E 0x80ea #define SVR_P2041 0x8210 #define SVR_P2041E 0x8218 #define SVR_P3041 0x8211 #define SVR_P3041E 0x8219 #define SVR_P4040 0x8200 #define SVR_P4040E 0x8208 #define SVR_P4080 0x8201 #define SVR_P4080E 0x8209 #define SVR_P5010 0x8221 #define SVR_P5010E 0x8229 #define SVR_P5020 0x8220 #define SVR_P5020E 0x8228 #define SVR_P5021 0x8205 #define SVR_P5021E 0x820d #define SVR_P5040 0x8204 #define SVR_P5040E 0x820c #define SVR_VER(svr) (((svr) >> 16) & 0xffff) #define SPR_PID0 0x030 /* ..8 Process ID Register 0 */ #define SPR_PID1 0x279 /* ..8 Process ID Register 1 */ #define SPR_PID2 0x27a /* ..8 Process ID Register 2 */ #define SPR_TLB0CFG 0x2B0 /* ..8 TLB 0 Config Register */ #define SPR_TLB1CFG 0x2B1 /* ..8 TLB 1 Config Register */ #define TLBCFG_ASSOC_MASK 0xff000000 /* Associativity of TLB */ #define TLBCFG_ASSOC_SHIFT 24 #define TLBCFG_NENTRY_MASK 0x00000fff /* Number of entries in TLB */ #define SPR_IVPR 0x03f /* ..8 Interrupt Vector Prefix Register */ #define SPR_IVOR0 0x190 /* ..8 Critical input */ #define SPR_IVOR1 0x191 /* ..8 Machine check */ #define SPR_IVOR2 0x192 #define SPR_IVOR3 0x193 #define SPR_IVOR4 0x194 #define SPR_IVOR5 0x195 #define SPR_IVOR6 0x196 #define SPR_IVOR7 0x197 #define SPR_IVOR8 0x198 #define SPR_IVOR9 0x199 #define SPR_IVOR10 0x19a #define SPR_IVOR11 0x19b #define SPR_IVOR12 0x19c #define SPR_IVOR13 0x19d #define SPR_IVOR14 0x19e #define SPR_IVOR15 0x19f #define SPR_IVOR32 0x210 #define SPR_IVOR33 0x211 #define SPR_IVOR34 0x212 #define SPR_IVOR35 0x213 #define SPR_MAS0 0x270 /* ..8 MMU Assist Register 0 Book-E/e500 */ #define SPR_MAS1 0x271 /* ..8 MMU Assist Register 1 Book-E/e500 */ #define SPR_MAS2 0x272 /* ..8 MMU Assist Register 2 Book-E/e500 */ #define SPR_MAS3 0x273 /* ..8 MMU Assist Register 3 Book-E/e500 */ #define SPR_MAS4 0x274 /* ..8 MMU Assist Register 4 Book-E/e500 */ #define SPR_MAS5 0x275 /* ..8 MMU Assist Register 5 Book-E */ #define SPR_MAS6 0x276 /* ..8 MMU Assist Register 6 Book-E/e500 */ #define SPR_MAS7 0x3B0 /* ..8 MMU Assist Register 7 Book-E/e500 */ #define SPR_MAS8 0x155 /* ..8 MMU Assist Register 8 Book-E/e500 */ #define SPR_L1CFG0 0x203 /* ..8 L1 cache configuration register 0 */ #define SPR_L1CFG1 0x204 /* ..8 L1 cache configuration register 1 */ #define SPR_CCR1 0x378 #define CCR1_L2COBE 0x00000040 #define DCR_L2DCDCRAI 0x0000 /* L2 D-Cache DCR Address Pointer */ #define DCR_L2DCDCRDI 0x0001 /* L2 D-Cache DCR Data Indirect */ #define DCR_L2CR0 0x00 /* L2 Cache Configuration Register 0 */ #define L2CR0_AS 0x30000000 #define SPR_L1CSR0 0x3F2 /* ..8 L1 Cache Control and Status Register 0 */ #define L1CSR0_DCPE 0x00010000 /* Data Cache Parity Enable */ #define L1CSR0_DCLFR 0x00000100 /* Data Cache Lock Bits Flash Reset */ #define L1CSR0_DCFI 0x00000002 /* Data Cache Flash Invalidate */ #define L1CSR0_DCE 0x00000001 /* Data Cache Enable */ #define SPR_L1CSR1 0x3F3 /* ..8 L1 Cache Control and Status Register 1 */ #define L1CSR1_ICPE 0x00010000 /* Instruction Cache Parity Enable */ #define L1CSR1_ICUL 0x00000400 /* Instr Cache Unable to Lock */ #define L1CSR1_ICLFR 0x00000100 /* Instruction Cache Lock Bits Flash Reset */ #define L1CSR1_ICFI 0x00000002 /* Instruction Cache Flash Invalidate */ #define L1CSR1_ICE 0x00000001 /* Instruction Cache Enable */ #define SPR_L2CSR0 0x3F9 /* ..8 L2 Cache Control and Status Register 0 */ #define L2CSR0_L2E 0x80000000 /* L2 Cache Enable */ #define L2CSR0_L2PE 0x40000000 /* L2 Cache Parity Enable */ #define L2CSR0_L2FI 0x00200000 /* L2 Cache Flash Invalidate */ #define L2CSR0_L2LFC 0x00000400 /* L2 Cache Lock Flags Clear */ #define SPR_BUCSR 0x3F5 /* ..8 Branch Unit Control and Status Register */ #define BUCSR_BPEN 0x00000001 /* Branch Prediction Enable */ #define BUCSR_BBFI 0x00000200 /* Branch Buffer Flash Invalidate */ #endif /* BOOKE */ #endif /* !_POWERPC_SPR_H_ */ Index: user/ngie/bug-237403/sys/powerpc/powernv/opal_dev.c =================================================================== --- user/ngie/bug-237403/sys/powerpc/powernv/opal_dev.c (revision 346925) +++ user/ngie/bug-237403/sys/powerpc/powernv/opal_dev.c (revision 346926) @@ -1,423 +1,424 @@ /*- * Copyright (c) 2015 Nathan Whitehorn * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. */ #include __FBSDID("$FreeBSD$"); #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include "clock_if.h" #include "opal.h" static int opaldev_probe(device_t); static int opaldev_attach(device_t); /* clock interface */ static int opal_gettime(device_t dev, struct timespec *ts); static int opal_settime(device_t dev, struct timespec *ts); /* ofw bus interface */ static const struct ofw_bus_devinfo *opaldev_get_devinfo(device_t dev, device_t child); static void opal_shutdown(void *arg, int howto); static void opal_handle_shutdown_message(void *unused, struct opal_msg *msg); static void opal_intr(void *); static device_method_t opaldev_methods[] = { /* Device interface */ DEVMETHOD(device_probe, opaldev_probe), DEVMETHOD(device_attach, opaldev_attach), /* clock interface */ DEVMETHOD(clock_gettime, opal_gettime), DEVMETHOD(clock_settime, opal_settime), /* Bus interface */ DEVMETHOD(bus_child_pnpinfo_str, ofw_bus_gen_child_pnpinfo_str), /* ofw_bus interface */ DEVMETHOD(ofw_bus_get_devinfo, opaldev_get_devinfo), DEVMETHOD(ofw_bus_get_compat, ofw_bus_gen_get_compat), DEVMETHOD(ofw_bus_get_model, ofw_bus_gen_get_model), DEVMETHOD(ofw_bus_get_name, ofw_bus_gen_get_name), DEVMETHOD(ofw_bus_get_node, ofw_bus_gen_get_node), DEVMETHOD(ofw_bus_get_type, ofw_bus_gen_get_type), DEVMETHOD_END }; static driver_t opaldev_driver = { "opal", opaldev_methods, 0 }; static devclass_t opaldev_devclass; -DRIVER_MODULE(opaldev, ofwbus, opaldev_driver, opaldev_devclass, 0, 0); +EARLY_DRIVER_MODULE(opaldev, ofwbus, opaldev_driver, opaldev_devclass, 0, 0, + BUS_PASS_BUS); static void opal_heartbeat(void); static void opal_handle_messages(void); static struct proc *opal_hb_proc; static struct kproc_desc opal_heartbeat_kp = { "opal_heartbeat", opal_heartbeat, &opal_hb_proc }; SYSINIT(opal_heartbeat_setup, SI_SUB_KTHREAD_IDLE, SI_ORDER_ANY, kproc_start, &opal_heartbeat_kp); static int opal_heartbeat_ms; EVENTHANDLER_LIST_DEFINE(OPAL_ASYNC_COMP); EVENTHANDLER_LIST_DEFINE(OPAL_EPOW); EVENTHANDLER_LIST_DEFINE(OPAL_SHUTDOWN); EVENTHANDLER_LIST_DEFINE(OPAL_HMI_EVT); EVENTHANDLER_LIST_DEFINE(OPAL_DPO); EVENTHANDLER_LIST_DEFINE(OPAL_OCC); #define OPAL_SOFT_OFF 0 #define OPAL_SOFT_REBOOT 1 static void opal_heartbeat(void) { uint64_t events; if (opal_heartbeat_ms == 0) kproc_exit(0); while (1) { events = 0; /* Turn the OPAL state crank */ opal_call(OPAL_POLL_EVENTS, vtophys(&events)); if (events & OPAL_EVENT_MSG_PENDING) opal_handle_messages(); tsleep(opal_hb_proc, 0, "opal", MSEC_2_TICKS(opal_heartbeat_ms)); } } static int opaldev_probe(device_t dev) { phandle_t iparent; pcell_t *irqs; int i, n_irqs; if (!ofw_bus_is_compatible(dev, "ibm,opal-v3")) return (ENXIO); if (opal_check() != 0) return (ENXIO); device_set_desc(dev, "OPAL Abstraction Firmware"); /* Manually add IRQs before attaching */ if (OF_hasprop(ofw_bus_get_node(dev), "opal-interrupts")) { iparent = OF_finddevice("/interrupt-controller@0"); iparent = OF_xref_from_node(iparent); n_irqs = OF_getproplen(ofw_bus_get_node(dev), "opal-interrupts") / sizeof(*irqs); irqs = malloc(n_irqs * sizeof(*irqs), M_DEVBUF, M_WAITOK); OF_getencprop(ofw_bus_get_node(dev), "opal-interrupts", irqs, n_irqs * sizeof(*irqs)); for (i = 0; i < n_irqs; i++) bus_set_resource(dev, SYS_RES_IRQ, i, ofw_bus_map_intr(dev, iparent, 1, &irqs[i]), 1); free(irqs, M_DEVBUF); } return (BUS_PROBE_SPECIFIC); } static int opaldev_attach(device_t dev) { phandle_t child; device_t cdev; uint64_t junk; int i, rv; uint32_t async_count; struct ofw_bus_devinfo *dinfo; struct resource *irq; /* Test for RTC support and register clock if it works */ rv = opal_call(OPAL_RTC_READ, vtophys(&junk), vtophys(&junk)); do { rv = opal_call(OPAL_RTC_READ, vtophys(&junk), vtophys(&junk)); if (rv == OPAL_BUSY_EVENT) rv = opal_call(OPAL_POLL_EVENTS, 0); } while (rv == OPAL_BUSY_EVENT); if (rv == OPAL_SUCCESS) clock_register(dev, 2000); EVENTHANDLER_REGISTER(OPAL_SHUTDOWN, opal_handle_shutdown_message, NULL, EVENTHANDLER_PRI_ANY); EVENTHANDLER_REGISTER(shutdown_final, opal_shutdown, NULL, SHUTDOWN_PRI_LAST); OF_getencprop(ofw_bus_get_node(dev), "ibm,heartbeat-ms", &opal_heartbeat_ms, sizeof(opal_heartbeat_ms)); /* Bind to interrupts */ for (i = 0; (irq = bus_alloc_resource_any(dev, SYS_RES_IRQ, &i, RF_ACTIVE)) != NULL; i++) bus_setup_intr(dev, irq, INTR_TYPE_TTY | INTR_MPSAFE | INTR_ENTROPY, NULL, opal_intr, (void *)rman_get_start(irq), NULL); OF_getencprop(ofw_bus_get_node(dev), "opal-msg-async-num", &async_count, sizeof(async_count)); opal_init_async_tokens(async_count); for (child = OF_child(ofw_bus_get_node(dev)); child != 0; child = OF_peer(child)) { dinfo = malloc(sizeof(*dinfo), M_DEVBUF, M_WAITOK | M_ZERO); if (ofw_bus_gen_setup_devinfo(dinfo, child) != 0) { free(dinfo, M_DEVBUF); continue; } cdev = device_add_child(dev, NULL, -1); if (cdev == NULL) { device_printf(dev, "<%s>: device_add_child failed\n", dinfo->obd_name); ofw_bus_gen_destroy_devinfo(dinfo); free(dinfo, M_DEVBUF); continue; } device_set_ivars(cdev, dinfo); } return (bus_generic_attach(dev)); } static int bcd2bin32(int bcd) { int out = 0; out += bcd2bin(bcd & 0xff); out += 100*bcd2bin((bcd & 0x0000ff00) >> 8); out += 10000*bcd2bin((bcd & 0x00ff0000) >> 16); out += 1000000*bcd2bin((bcd & 0xffff0000) >> 24); return (out); } static int bin2bcd32(int bin) { int out = 0; int tmp; tmp = bin % 100; out += bin2bcd(tmp) * 1; bin = bin / 100; tmp = bin % 100; out += bin2bcd(tmp) * 100; bin = bin / 100; tmp = bin % 100; out += bin2bcd(tmp) * 10000; return (out); } static int opal_gettime(device_t dev, struct timespec *ts) { int rv; struct clocktime ct; uint32_t ymd; uint64_t hmsm; rv = opal_call(OPAL_RTC_READ, vtophys(&ymd), vtophys(&hmsm)); while (rv == OPAL_BUSY_EVENT) { opal_call(OPAL_POLL_EVENTS, 0); pause("opalrtc", 1); rv = opal_call(OPAL_RTC_READ, vtophys(&ymd), vtophys(&hmsm)); } if (rv != OPAL_SUCCESS) return (ENXIO); hmsm = be64toh(hmsm); ymd = be32toh(ymd); ct.nsec = bcd2bin32((hmsm & 0x000000ffffff0000) >> 16) * 1000; ct.sec = bcd2bin((hmsm & 0x0000ff0000000000) >> 40); ct.min = bcd2bin((hmsm & 0x00ff000000000000) >> 48); ct.hour = bcd2bin((hmsm & 0xff00000000000000) >> 56); ct.day = bcd2bin((ymd & 0x000000ff) >> 0); ct.mon = bcd2bin((ymd & 0x0000ff00) >> 8); ct.year = bcd2bin32((ymd & 0xffff0000) >> 16); return (clock_ct_to_ts(&ct, ts)); } static int opal_settime(device_t dev, struct timespec *ts) { int rv; struct clocktime ct; uint32_t ymd = 0; uint64_t hmsm = 0; clock_ts_to_ct(ts, &ct); ymd |= (uint32_t)bin2bcd(ct.day); ymd |= ((uint32_t)bin2bcd(ct.mon) << 8); ymd |= ((uint32_t)bin2bcd32(ct.year) << 16); hmsm |= ((uint64_t)bin2bcd32(ct.nsec/1000) << 16); hmsm |= ((uint64_t)bin2bcd(ct.sec) << 40); hmsm |= ((uint64_t)bin2bcd(ct.min) << 48); hmsm |= ((uint64_t)bin2bcd(ct.hour) << 56); hmsm = htobe64(hmsm); ymd = htobe32(ymd); do { rv = opal_call(OPAL_RTC_WRITE, vtophys(&ymd), vtophys(&hmsm)); if (rv == OPAL_BUSY_EVENT) { rv = opal_call(OPAL_POLL_EVENTS, 0); pause("opalrtc", 1); } } while (rv == OPAL_BUSY_EVENT); if (rv != OPAL_SUCCESS) return (ENXIO); return (0); } static const struct ofw_bus_devinfo * opaldev_get_devinfo(device_t dev, device_t child) { return (device_get_ivars(child)); } static void opal_shutdown(void *arg, int howto) { if (howto & RB_HALT) opal_call(OPAL_CEC_POWER_DOWN, 0 /* Normal power off */); else opal_call(OPAL_CEC_REBOOT); opal_call(OPAL_RETURN_CPU); } static void opal_handle_shutdown_message(void *unused, struct opal_msg *msg) { int howto; switch (be64toh(msg->params[0])) { case OPAL_SOFT_OFF: howto = RB_POWEROFF; break; case OPAL_SOFT_REBOOT: howto = RB_REROOT; break; } shutdown_nice(howto); } static void opal_handle_messages(void) { static struct opal_msg msg; uint64_t rv; uint32_t type; rv = opal_call(OPAL_GET_MSG, vtophys(&msg), sizeof(msg)); if (rv != OPAL_SUCCESS) return; type = be32toh(msg.msg_type); switch (type) { case OPAL_MSG_ASYNC_COMP: EVENTHANDLER_DIRECT_INVOKE(OPAL_ASYNC_COMP, &msg); break; case OPAL_MSG_EPOW: EVENTHANDLER_DIRECT_INVOKE(OPAL_EPOW, &msg); break; case OPAL_MSG_SHUTDOWN: EVENTHANDLER_DIRECT_INVOKE(OPAL_SHUTDOWN, &msg); break; case OPAL_MSG_HMI_EVT: EVENTHANDLER_DIRECT_INVOKE(OPAL_HMI_EVT, &msg); break; case OPAL_MSG_DPO: EVENTHANDLER_DIRECT_INVOKE(OPAL_DPO, &msg); break; case OPAL_MSG_OCC: EVENTHANDLER_DIRECT_INVOKE(OPAL_OCC, &msg); break; default: printf("Unknown OPAL message type %d\n", type); } } static void opal_intr(void *xintr) { uint64_t events = 0; opal_call(OPAL_HANDLE_INTERRUPT, (uint32_t)(uint64_t)xintr, vtophys(&events)); /* Wake up the heartbeat, if it's been setup. */ if (events != 0 && opal_hb_proc != NULL) wakeup(opal_hb_proc); } Index: user/ngie/bug-237403/sys/powerpc/powerpc/cpu.c =================================================================== --- user/ngie/bug-237403/sys/powerpc/powerpc/cpu.c (revision 346925) +++ user/ngie/bug-237403/sys/powerpc/powerpc/cpu.c (revision 346926) @@ -1,836 +1,848 @@ /*- * SPDX-License-Identifier: BSD-4-Clause AND BSD-2-Clause-FreeBSD * * Copyright (c) 2001 Matt Thomas. * Copyright (c) 2001 Tsubai Masanari. * Copyright (c) 1998, 1999, 2001 Internet Research Institute, Inc. * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * 3. All advertising materials mentioning features or use of this software * must display the following acknowledgement: * This product includes software developed by * Internet Research Institute, Inc. * 4. The name of the author may not be used to endorse or promote products * derived from this software without specific prior written permission. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ /*- * Copyright (C) 2003 Benno Rice. * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY Benno Rice ``AS IS'' AND ANY EXPRESS OR * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. * IN NO EVENT SHALL TOOLS GMBH BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; * OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, * WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR * OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF * ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. * * from $NetBSD: cpu_subr.c,v 1.1 2003/02/03 17:10:09 matt Exp $ * $FreeBSD$ */ #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include static void cpu_6xx_setup(int cpuid, uint16_t vers); static void cpu_970_setup(int cpuid, uint16_t vers); static void cpu_booke_setup(int cpuid, uint16_t vers); static void cpu_powerx_setup(int cpuid, uint16_t vers); int powerpc_pow_enabled; void (*cpu_idle_hook)(sbintime_t) = NULL; static void cpu_idle_60x(sbintime_t); static void cpu_idle_booke(sbintime_t); #ifdef BOOKE_E500 static void cpu_idle_e500mc(sbintime_t sbt); #endif #if defined(__powerpc64__) && defined(AIM) static void cpu_idle_powerx(sbintime_t); static void cpu_idle_power9(sbintime_t); #endif struct cputab { const char *name; uint16_t version; uint16_t revfmt; int features; /* Do not include PPC_FEATURE_32 or * PPC_FEATURE_HAS_MMU */ int features2; void (*cpu_setup)(int cpuid, uint16_t vers); }; #define REVFMT_MAJMIN 1 /* %u.%u */ #define REVFMT_HEX 2 /* 0x%04x */ #define REVFMT_DEC 3 /* %u */ static const struct cputab models[] = { { "Motorola PowerPC 601", MPC601, REVFMT_DEC, PPC_FEATURE_HAS_FPU | PPC_FEATURE_UNIFIED_CACHE, 0, cpu_6xx_setup }, { "Motorola PowerPC 602", MPC602, REVFMT_DEC, PPC_FEATURE_HAS_FPU, 0, cpu_6xx_setup }, { "Motorola PowerPC 603", MPC603, REVFMT_MAJMIN, PPC_FEATURE_HAS_FPU, 0, cpu_6xx_setup }, { "Motorola PowerPC 603e", MPC603e, REVFMT_MAJMIN, PPC_FEATURE_HAS_FPU, 0, cpu_6xx_setup }, { "Motorola PowerPC 603ev", MPC603ev, REVFMT_MAJMIN, PPC_FEATURE_HAS_FPU, 0, cpu_6xx_setup }, { "Motorola PowerPC 604", MPC604, REVFMT_MAJMIN, PPC_FEATURE_HAS_FPU, 0, cpu_6xx_setup }, { "Motorola PowerPC 604ev", MPC604ev, REVFMT_MAJMIN, PPC_FEATURE_HAS_FPU, 0, cpu_6xx_setup }, { "Motorola PowerPC 620", MPC620, REVFMT_HEX, PPC_FEATURE_64 | PPC_FEATURE_HAS_FPU, 0, NULL }, { "Motorola PowerPC 750", MPC750, REVFMT_MAJMIN, PPC_FEATURE_HAS_FPU, 0, cpu_6xx_setup }, { "IBM PowerPC 750FX", IBM750FX, REVFMT_MAJMIN, PPC_FEATURE_HAS_FPU, 0, cpu_6xx_setup }, { "IBM PowerPC 970", IBM970, REVFMT_MAJMIN, PPC_FEATURE_64 | PPC_FEATURE_HAS_ALTIVEC | PPC_FEATURE_HAS_FPU, 0, cpu_970_setup }, { "IBM PowerPC 970FX", IBM970FX, REVFMT_MAJMIN, PPC_FEATURE_64 | PPC_FEATURE_HAS_ALTIVEC | PPC_FEATURE_HAS_FPU, 0, cpu_970_setup }, { "IBM PowerPC 970GX", IBM970GX, REVFMT_MAJMIN, PPC_FEATURE_64 | PPC_FEATURE_HAS_ALTIVEC | PPC_FEATURE_HAS_FPU, 0, cpu_970_setup }, { "IBM PowerPC 970MP", IBM970MP, REVFMT_MAJMIN, PPC_FEATURE_64 | PPC_FEATURE_HAS_ALTIVEC | PPC_FEATURE_HAS_FPU, 0, cpu_970_setup }, { "IBM POWER4", IBMPOWER4, REVFMT_MAJMIN, PPC_FEATURE_64 | PPC_FEATURE_HAS_FPU | PPC_FEATURE_POWER4, 0, NULL }, { "IBM POWER4+", IBMPOWER4PLUS, REVFMT_MAJMIN, PPC_FEATURE_64 | PPC_FEATURE_HAS_FPU | PPC_FEATURE_POWER4, 0, NULL }, { "IBM POWER5", IBMPOWER5, REVFMT_MAJMIN, PPC_FEATURE_64 | PPC_FEATURE_HAS_FPU | PPC_FEATURE_POWER4 | PPC_FEATURE_SMT | PPC_FEATURE_ICACHE_SNOOP, 0, NULL }, { "IBM POWER5+", IBMPOWER5PLUS, REVFMT_MAJMIN, PPC_FEATURE_64 | PPC_FEATURE_HAS_FPU | PPC_FEATURE_POWER5_PLUS | PPC_FEATURE_SMT | PPC_FEATURE_ICACHE_SNOOP, 0, NULL }, { "IBM POWER6", IBMPOWER6, REVFMT_MAJMIN, PPC_FEATURE_64 | PPC_FEATURE_HAS_ALTIVEC | PPC_FEATURE_HAS_FPU | PPC_FEATURE_SMT | PPC_FEATURE_ICACHE_SNOOP | PPC_FEATURE_ARCH_2_05 | PPC_FEATURE_TRUE_LE, 0, NULL }, { "IBM POWER7", IBMPOWER7, REVFMT_MAJMIN, PPC_FEATURE_64 | PPC_FEATURE_HAS_ALTIVEC | PPC_FEATURE_HAS_FPU | PPC_FEATURE_SMT | PPC_FEATURE_ARCH_2_05 | PPC_FEATURE_ARCH_2_06 | PPC_FEATURE_HAS_VSX | PPC_FEATURE_TRUE_LE, PPC_FEATURE2_DSCR, NULL }, { "IBM POWER7+", IBMPOWER7PLUS, REVFMT_MAJMIN, PPC_FEATURE_64 | PPC_FEATURE_HAS_ALTIVEC | PPC_FEATURE_HAS_FPU | PPC_FEATURE_SMT | PPC_FEATURE_ARCH_2_05 | PPC_FEATURE_ARCH_2_06 | PPC_FEATURE_HAS_VSX, PPC_FEATURE2_DSCR, NULL }, { "IBM POWER8E", IBMPOWER8E, REVFMT_MAJMIN, PPC_FEATURE_64 | PPC_FEATURE_HAS_ALTIVEC | PPC_FEATURE_HAS_FPU | PPC_FEATURE_SMT | PPC_FEATURE_ICACHE_SNOOP | PPC_FEATURE_ARCH_2_05 | PPC_FEATURE_ARCH_2_06 | PPC_FEATURE_HAS_VSX | PPC_FEATURE_TRUE_LE, PPC_FEATURE2_ARCH_2_07 | PPC_FEATURE2_HTM | PPC_FEATURE2_DSCR | PPC_FEATURE2_ISEL | PPC_FEATURE2_TAR | PPC_FEATURE2_HAS_VEC_CRYPTO | PPC_FEATURE2_HTM_NOSC, cpu_powerx_setup }, + { "IBM POWER8NVL", IBMPOWER8NVL, REVFMT_MAJMIN, + PPC_FEATURE_64 | PPC_FEATURE_HAS_ALTIVEC | PPC_FEATURE_HAS_FPU | + PPC_FEATURE_SMT | PPC_FEATURE_ICACHE_SNOOP | PPC_FEATURE_ARCH_2_05 | + PPC_FEATURE_ARCH_2_06 | PPC_FEATURE_HAS_VSX | PPC_FEATURE_TRUE_LE, + PPC_FEATURE2_ARCH_2_07 | PPC_FEATURE2_HTM | PPC_FEATURE2_DSCR | + PPC_FEATURE2_ISEL | PPC_FEATURE2_TAR | PPC_FEATURE2_HAS_VEC_CRYPTO | + PPC_FEATURE2_HTM_NOSC, cpu_powerx_setup }, { "IBM POWER8", IBMPOWER8, REVFMT_MAJMIN, PPC_FEATURE_64 | PPC_FEATURE_HAS_ALTIVEC | PPC_FEATURE_HAS_FPU | PPC_FEATURE_SMT | PPC_FEATURE_ICACHE_SNOOP | PPC_FEATURE_ARCH_2_05 | PPC_FEATURE_ARCH_2_06 | PPC_FEATURE_HAS_VSX | PPC_FEATURE_TRUE_LE, PPC_FEATURE2_ARCH_2_07 | PPC_FEATURE2_HTM | PPC_FEATURE2_DSCR | PPC_FEATURE2_ISEL | PPC_FEATURE2_TAR | PPC_FEATURE2_HAS_VEC_CRYPTO | PPC_FEATURE2_HTM_NOSC, cpu_powerx_setup }, { "IBM POWER9", IBMPOWER9, REVFMT_MAJMIN, PPC_FEATURE_64 | PPC_FEATURE_HAS_ALTIVEC | PPC_FEATURE_HAS_FPU | PPC_FEATURE_SMT | PPC_FEATURE_ICACHE_SNOOP | PPC_FEATURE_ARCH_2_05 | PPC_FEATURE_ARCH_2_06 | PPC_FEATURE_HAS_VSX | PPC_FEATURE_TRUE_LE, PPC_FEATURE2_ARCH_2_07 | PPC_FEATURE2_HTM | PPC_FEATURE2_DSCR | - PPC_FEATURE2_ISEL | PPC_FEATURE2_TAR | PPC_FEATURE2_HAS_VEC_CRYPTO | + PPC_FEATURE2_EBB | PPC_FEATURE2_ISEL | PPC_FEATURE2_TAR | + PPC_FEATURE2_HAS_VEC_CRYPTO | PPC_FEATURE2_HTM_NOSC | PPC_FEATURE2_ARCH_3_00 | PPC_FEATURE2_HAS_IEEE128 | PPC_FEATURE2_DARN, cpu_powerx_setup }, { "Motorola PowerPC 7400", MPC7400, REVFMT_MAJMIN, PPC_FEATURE_HAS_ALTIVEC | PPC_FEATURE_HAS_FPU, 0, cpu_6xx_setup }, { "Motorola PowerPC 7410", MPC7410, REVFMT_MAJMIN, PPC_FEATURE_HAS_ALTIVEC | PPC_FEATURE_HAS_FPU, 0, cpu_6xx_setup }, { "Motorola PowerPC 7450", MPC7450, REVFMT_MAJMIN, PPC_FEATURE_HAS_ALTIVEC | PPC_FEATURE_HAS_FPU, 0, cpu_6xx_setup }, { "Motorola PowerPC 7455", MPC7455, REVFMT_MAJMIN, PPC_FEATURE_HAS_ALTIVEC | PPC_FEATURE_HAS_FPU, 0, cpu_6xx_setup }, { "Motorola PowerPC 7457", MPC7457, REVFMT_MAJMIN, PPC_FEATURE_HAS_ALTIVEC | PPC_FEATURE_HAS_FPU, 0, cpu_6xx_setup }, { "Motorola PowerPC 7447A", MPC7447A, REVFMT_MAJMIN, PPC_FEATURE_HAS_ALTIVEC | PPC_FEATURE_HAS_FPU, 0, cpu_6xx_setup }, { "Motorola PowerPC 7448", MPC7448, REVFMT_MAJMIN, PPC_FEATURE_HAS_ALTIVEC | PPC_FEATURE_HAS_FPU, 0, cpu_6xx_setup }, { "Motorola PowerPC 8240", MPC8240, REVFMT_MAJMIN, PPC_FEATURE_HAS_FPU, 0, cpu_6xx_setup }, { "Motorola PowerPC 8245", MPC8245, REVFMT_MAJMIN, PPC_FEATURE_HAS_FPU, 0, cpu_6xx_setup }, { "Freescale e500v1 core", FSL_E500v1, REVFMT_MAJMIN, PPC_FEATURE_HAS_SPE | PPC_FEATURE_HAS_EFP_SINGLE | PPC_FEATURE_BOOKE, PPC_FEATURE2_ISEL, cpu_booke_setup }, { "Freescale e500v2 core", FSL_E500v2, REVFMT_MAJMIN, PPC_FEATURE_HAS_SPE | PPC_FEATURE_BOOKE | PPC_FEATURE_HAS_EFP_SINGLE | PPC_FEATURE_HAS_EFP_DOUBLE, PPC_FEATURE2_ISEL, cpu_booke_setup }, { "Freescale e500mc core", FSL_E500mc, REVFMT_MAJMIN, PPC_FEATURE_HAS_FPU | PPC_FEATURE_BOOKE | PPC_FEATURE_ARCH_2_05 | PPC_FEATURE_ARCH_2_06, PPC_FEATURE2_ISEL, cpu_booke_setup }, { "Freescale e5500 core", FSL_E5500, REVFMT_MAJMIN, PPC_FEATURE_64 | PPC_FEATURE_HAS_FPU | PPC_FEATURE_BOOKE | PPC_FEATURE_ARCH_2_05 | PPC_FEATURE_ARCH_2_06, PPC_FEATURE2_ISEL, cpu_booke_setup }, { "Freescale e6500 core", FSL_E6500, REVFMT_MAJMIN, PPC_FEATURE_64 | PPC_FEATURE_HAS_ALTIVEC | PPC_FEATURE_HAS_FPU | PPC_FEATURE_BOOKE | PPC_FEATURE_ARCH_2_05 | PPC_FEATURE_ARCH_2_06, PPC_FEATURE2_ISEL, cpu_booke_setup }, { "IBM Cell Broadband Engine", IBMCELLBE, REVFMT_MAJMIN, PPC_FEATURE_64 | PPC_FEATURE_HAS_ALTIVEC | PPC_FEATURE_HAS_FPU | PPC_FEATURE_CELL | PPC_FEATURE_SMT, 0, NULL}, { "Unknown PowerPC CPU", 0, REVFMT_HEX, 0, 0, NULL }, }; static void cpu_6xx_print_cacheinfo(u_int, uint16_t); static int cpu_feature_bit(SYSCTL_HANDLER_ARGS); static char model[64]; SYSCTL_STRING(_hw, HW_MODEL, model, CTLFLAG_RD, model, 0, ""); static const struct cputab *cput; u_long cpu_features = PPC_FEATURE_32 | PPC_FEATURE_HAS_MMU; u_long cpu_features2 = 0; SYSCTL_OPAQUE(_hw, OID_AUTO, cpu_features, CTLFLAG_RD, &cpu_features, sizeof(cpu_features), "LX", "PowerPC CPU features"); SYSCTL_OPAQUE(_hw, OID_AUTO, cpu_features2, CTLFLAG_RD, &cpu_features2, sizeof(cpu_features2), "LX", "PowerPC CPU features 2"); #ifdef __powerpc64__ register_t lpcr = LPCR_LPES; #endif /* Provide some user-friendly aliases for bits in cpu_features */ SYSCTL_PROC(_hw, OID_AUTO, floatingpoint, CTLTYPE_INT | CTLFLAG_RD, 0, PPC_FEATURE_HAS_FPU, cpu_feature_bit, "I", "Floating point instructions executed in hardware"); SYSCTL_PROC(_hw, OID_AUTO, altivec, CTLTYPE_INT | CTLFLAG_RD, 0, PPC_FEATURE_HAS_ALTIVEC, cpu_feature_bit, "I", "CPU supports Altivec"); /* * Phase 1 (early) CPU setup. Setup the cpu_features/cpu_features2 variables, * so they can be used during platform and MMU bringup. */ void cpu_feature_setup() { u_int pvr; uint16_t vers; const struct cputab *cp; pvr = mfpvr(); vers = pvr >> 16; for (cp = models; cp->version != 0; cp++) { if (cp->version == vers) break; } cput = cp; cpu_features |= cp->features; cpu_features2 |= cp->features2; } void cpu_setup(u_int cpuid) { uint64_t cps; const char *name; u_int maj, min, pvr; uint16_t rev, revfmt, vers; pvr = mfpvr(); vers = pvr >> 16; rev = pvr; switch (vers) { case MPC7410: min = (pvr >> 0) & 0xff; maj = min <= 4 ? 1 : 2; break; case FSL_E500v1: case FSL_E500v2: case FSL_E500mc: case FSL_E5500: maj = (pvr >> 4) & 0xf; min = (pvr >> 0) & 0xf; break; default: maj = (pvr >> 8) & 0xf; min = (pvr >> 0) & 0xf; } revfmt = cput->revfmt; name = cput->name; if (rev == MPC750 && pvr == 15) { name = "Motorola MPC755"; revfmt = REVFMT_HEX; } strncpy(model, name, sizeof(model) - 1); printf("cpu%d: %s revision ", cpuid, name); switch (revfmt) { case REVFMT_MAJMIN: printf("%u.%u", maj, min); break; case REVFMT_HEX: printf("0x%04x", rev); break; case REVFMT_DEC: printf("%u", rev); break; } if (cpu_est_clockrate(0, &cps) == 0) printf(", %jd.%02jd MHz", cps / 1000000, (cps / 10000) % 100); printf("\n"); printf("cpu%d: Features %b\n", cpuid, (int)cpu_features, PPC_FEATURE_BITMASK); if (cpu_features2 != 0) printf("cpu%d: Features2 %b\n", cpuid, (int)cpu_features2, PPC_FEATURE2_BITMASK); /* * Configure CPU */ if (cput->cpu_setup != NULL) cput->cpu_setup(cpuid, vers); } /* Get current clock frequency for the given cpu id. */ int cpu_est_clockrate(int cpu_id, uint64_t *cps) { uint16_t vers; register_t msr; phandle_t cpu, dev, root; int res = 0; char buf[8]; vers = mfpvr() >> 16; msr = mfmsr(); mtmsr(msr & ~PSL_EE); switch (vers) { case MPC7450: case MPC7455: case MPC7457: case MPC750: case IBM750FX: case MPC7400: case MPC7410: case MPC7447A: case MPC7448: mtspr(SPR_MMCR0, SPR_MMCR0_FC); mtspr(SPR_PMC1, 0); mtspr(SPR_MMCR0, SPR_MMCR0_PMC1SEL(PMCN_CYCLES)); DELAY(1000); *cps = (mfspr(SPR_PMC1) * 1000) + 4999; mtspr(SPR_MMCR0, SPR_MMCR0_FC); mtmsr(msr); return (0); case IBM970: case IBM970FX: case IBM970MP: isync(); mtspr(SPR_970MMCR0, SPR_MMCR0_FC); isync(); mtspr(SPR_970MMCR1, 0); mtspr(SPR_970MMCRA, 0); mtspr(SPR_970PMC1, 0); mtspr(SPR_970MMCR0, SPR_970MMCR0_PMC1SEL(PMC970N_CYCLES)); isync(); DELAY(1000); powerpc_sync(); mtspr(SPR_970MMCR0, SPR_MMCR0_FC); *cps = (mfspr(SPR_970PMC1) * 1000) + 4999; mtmsr(msr); return (0); default: root = OF_peer(0); if (root == 0) return (ENXIO); dev = OF_child(root); while (dev != 0) { res = OF_getprop(dev, "name", buf, sizeof(buf)); if (res > 0 && strcmp(buf, "cpus") == 0) break; dev = OF_peer(dev); } cpu = OF_child(dev); while (cpu != 0) { res = OF_getprop(cpu, "device_type", buf, sizeof(buf)); if (res > 0 && strcmp(buf, "cpu") == 0) break; cpu = OF_peer(cpu); } if (cpu == 0) return (ENOENT); if (OF_getprop(cpu, "ibm,extended-clock-frequency", cps, sizeof(*cps)) >= 0) { return (0); } else if (OF_getprop(cpu, "clock-frequency", cps, sizeof(cell_t)) >= 0) { *cps >>= 32; return (0); } else { return (ENOENT); } } } void cpu_6xx_setup(int cpuid, uint16_t vers) { register_t hid0, pvr; const char *bitmask; hid0 = mfspr(SPR_HID0); pvr = mfpvr(); /* * Configure power-saving mode. */ switch (vers) { case MPC603: case MPC603e: case MPC603ev: case MPC604ev: case MPC750: case IBM750FX: case MPC7400: case MPC7410: case MPC8240: case MPC8245: /* Select DOZE mode. */ hid0 &= ~(HID0_DOZE | HID0_NAP | HID0_SLEEP); hid0 |= HID0_DOZE | HID0_DPM; powerpc_pow_enabled = 1; break; case MPC7448: case MPC7447A: case MPC7457: case MPC7455: case MPC7450: /* Enable the 7450 branch caches */ hid0 |= HID0_SGE | HID0_BTIC; hid0 |= HID0_LRSTK | HID0_FOLD | HID0_BHT; /* Disable BTIC on 7450 Rev 2.0 or earlier and on 7457 */ if (((pvr >> 16) == MPC7450 && (pvr & 0xFFFF) <= 0x0200) || (pvr >> 16) == MPC7457) hid0 &= ~HID0_BTIC; /* Select NAP mode. */ hid0 &= ~(HID0_DOZE | HID0_NAP | HID0_SLEEP); hid0 |= HID0_NAP | HID0_DPM; powerpc_pow_enabled = 1; break; default: /* No power-saving mode is available. */ ; } switch (vers) { case IBM750FX: case MPC750: hid0 &= ~HID0_DBP; /* XXX correct? */ hid0 |= HID0_EMCP | HID0_BTIC | HID0_SGE | HID0_BHT; break; case MPC7400: case MPC7410: hid0 &= ~HID0_SPD; hid0 |= HID0_EMCP | HID0_BTIC | HID0_SGE | HID0_BHT; hid0 |= HID0_EIEC; break; } mtspr(SPR_HID0, hid0); if (bootverbose) cpu_6xx_print_cacheinfo(cpuid, vers); switch (vers) { case MPC7447A: case MPC7448: case MPC7450: case MPC7455: case MPC7457: bitmask = HID0_7450_BITMASK; break; default: bitmask = HID0_BITMASK; break; } printf("cpu%d: HID0 %b\n", cpuid, (int)hid0, bitmask); if (cpu_idle_hook == NULL) cpu_idle_hook = cpu_idle_60x; } static void cpu_6xx_print_cacheinfo(u_int cpuid, uint16_t vers) { register_t hid; hid = mfspr(SPR_HID0); printf("cpu%u: ", cpuid); printf("L1 I-cache %sabled, ", (hid & HID0_ICE) ? "en" : "dis"); printf("L1 D-cache %sabled\n", (hid & HID0_DCE) ? "en" : "dis"); printf("cpu%u: ", cpuid); if (mfspr(SPR_L2CR) & L2CR_L2E) { switch (vers) { case MPC7450: case MPC7455: case MPC7457: printf("256KB L2 cache, "); if (mfspr(SPR_L3CR) & L3CR_L3E) printf("%cMB L3 backside cache", mfspr(SPR_L3CR) & L3CR_L3SIZ ? '2' : '1'); else printf("L3 cache disabled"); printf("\n"); break; case IBM750FX: printf("512KB L2 cache\n"); break; default: switch (mfspr(SPR_L2CR) & L2CR_L2SIZ) { case L2SIZ_256K: printf("256KB "); break; case L2SIZ_512K: printf("512KB "); break; case L2SIZ_1M: printf("1MB "); break; } printf("write-%s", (mfspr(SPR_L2CR) & L2CR_L2WT) ? "through" : "back"); if (mfspr(SPR_L2CR) & L2CR_L2PE) printf(", with parity"); printf(" backside cache\n"); break; } } else printf("L2 cache disabled\n"); } static void cpu_booke_setup(int cpuid, uint16_t vers) { #ifdef BOOKE_E500 register_t hid0; const char *bitmask; hid0 = mfspr(SPR_HID0); switch (vers) { case FSL_E500mc: bitmask = HID0_E500MC_BITMASK; cpu_idle_hook = cpu_idle_e500mc; break; case FSL_E5500: case FSL_E6500: bitmask = HID0_E5500_BITMASK; cpu_idle_hook = cpu_idle_e500mc; break; case FSL_E500v1: case FSL_E500v2: /* Only e500v1/v2 support HID0 power management setup. */ /* Program power-management mode. */ hid0 &= ~(HID0_DOZE | HID0_NAP | HID0_SLEEP); hid0 |= HID0_DOZE; mtspr(SPR_HID0, hid0); default: bitmask = HID0_E500_BITMASK; break; } printf("cpu%d: HID0 %b\n", cpuid, (int)hid0, bitmask); #endif if (cpu_idle_hook == NULL) cpu_idle_hook = cpu_idle_booke; } static void cpu_970_setup(int cpuid, uint16_t vers) { #ifdef AIM uint32_t hid0_hi, hid0_lo; __asm __volatile ("mfspr %0,%2; clrldi %1,%0,32; srdi %0,%0,32;" : "=r" (hid0_hi), "=r" (hid0_lo) : "K" (SPR_HID0)); /* Configure power-saving mode */ switch (vers) { case IBM970MP: hid0_hi |= (HID0_DEEPNAP | HID0_NAP | HID0_DPM); hid0_hi &= ~HID0_DOZE; break; default: hid0_hi |= (HID0_NAP | HID0_DPM); hid0_hi &= ~(HID0_DOZE | HID0_DEEPNAP); break; } powerpc_pow_enabled = 1; __asm __volatile (" \ sync; isync; \ sldi %0,%0,32; or %0,%0,%1; \ mtspr %2, %0; \ mfspr %0, %2; mfspr %0, %2; mfspr %0, %2; \ mfspr %0, %2; mfspr %0, %2; mfspr %0, %2; \ sync; isync" :: "r" (hid0_hi), "r"(hid0_lo), "K" (SPR_HID0)); __asm __volatile ("mfspr %0,%1; srdi %0,%0,32;" : "=r" (hid0_hi) : "K" (SPR_HID0)); printf("cpu%d: HID0 %b\n", cpuid, (int)(hid0_hi), HID0_970_BITMASK); #endif cpu_idle_hook = cpu_idle_60x; } static void cpu_powerx_setup(int cpuid, uint16_t vers) { #if defined(__powerpc64__) && defined(AIM) if ((mfmsr() & PSL_HV) == 0) return; + /* Nuke the FSCR, to disable all facilities. */ + mtspr(SPR_FSCR, 0); + /* Configure power-saving */ switch (vers) { case IBMPOWER8: case IBMPOWER8E: + case IBMPOWER8NVL: cpu_idle_hook = cpu_idle_powerx; mtspr(SPR_LPCR, mfspr(SPR_LPCR) | LPCR_PECE_WAKESET); isync(); break; case IBMPOWER9: cpu_idle_hook = cpu_idle_power9; mtspr(SPR_LPCR, mfspr(SPR_LPCR) | LPCR_PECE_WAKESET); isync(); break; default: return; } #endif } static int cpu_feature_bit(SYSCTL_HANDLER_ARGS) { int result; result = (cpu_features & arg2) ? 1 : 0; return (sysctl_handle_int(oidp, &result, 0, req)); } void cpu_idle(int busy) { sbintime_t sbt = -1; #ifdef INVARIANTS if ((mfmsr() & PSL_EE) != PSL_EE) { struct thread *td = curthread; printf("td msr %#lx\n", (u_long)td->td_md.md_saved_msr); panic("ints disabled in idleproc!"); } #endif CTR2(KTR_SPARE2, "cpu_idle(%d) at %d", busy, curcpu); if (cpu_idle_hook != NULL) { if (!busy) { critical_enter(); sbt = cpu_idleclock(); } cpu_idle_hook(sbt); if (!busy) { cpu_activeclock(); critical_exit(); } } CTR2(KTR_SPARE2, "cpu_idle(%d) at %d done", busy, curcpu); } static void cpu_idle_60x(sbintime_t sbt) { register_t msr; uint16_t vers; if (!powerpc_pow_enabled) return; msr = mfmsr(); vers = mfpvr() >> 16; #ifdef AIM switch (vers) { case IBM970: case IBM970FX: case IBM970MP: case MPC7447A: case MPC7448: case MPC7450: case MPC7455: case MPC7457: __asm __volatile("\ dssall; sync; mtmsr %0; isync" :: "r"(msr | PSL_POW)); break; default: powerpc_sync(); mtmsr(msr | PSL_POW); break; } #endif } #ifdef BOOKE_E500 static void cpu_idle_e500mc(sbintime_t sbt) { /* * Base binutils doesn't know what the 'wait' instruction is, so * use the opcode encoding here. */ __asm __volatile(".long 0x7c00007c"); } #endif static void cpu_idle_booke(sbintime_t sbt) { register_t msr; msr = mfmsr(); #ifdef BOOKE_E500 powerpc_sync(); mtmsr(msr | PSL_WE); #endif } #if defined(__powerpc64__) && defined(AIM) static void cpu_idle_powerx(sbintime_t sbt) { /* Sleeping when running on one cpu gives no advantages - avoid it */ if (smp_started == 0) return; spinlock_enter(); if (sched_runnable()) { spinlock_exit(); return; } if (can_wakeup == 0) can_wakeup = 1; mb(); enter_idle_powerx(); spinlock_exit(); } static void cpu_idle_power9(sbintime_t sbt) { register_t msr; msr = mfmsr(); /* Suspend external interrupts until stop instruction completes. */ mtmsr(msr & ~PSL_EE); /* Set the stop state to lowest latency, wake up to next instruction */ /* Set maximum transition level to 2, for deepest lossless sleep. */ mtspr(SPR_PSSCR, (2 << PSSCR_MTL_S) | (0 << PSSCR_RL_S)); /* "stop" instruction (PowerISA 3.0) */ __asm __volatile (".long 0x4c0002e4"); /* * Re-enable external interrupts to capture the interrupt that caused * the wake up. */ mtmsr(msr); } #endif int cpu_idle_wakeup(int cpu) { return (0); } Index: user/ngie/bug-237403/sys/powerpc/powerpc/exec_machdep.c =================================================================== --- user/ngie/bug-237403/sys/powerpc/powerpc/exec_machdep.c (revision 346925) +++ user/ngie/bug-237403/sys/powerpc/powerpc/exec_machdep.c (revision 346926) @@ -1,1135 +1,1165 @@ /*- * SPDX-License-Identifier: BSD-4-Clause AND BSD-2-Clause-FreeBSD * * Copyright (C) 1995, 1996 Wolfgang Solfrank. * Copyright (C) 1995, 1996 TooLs GmbH. * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * 3. All advertising materials mentioning features or use of this software * must display the following acknowledgement: * This product includes software developed by TooLs GmbH. * 4. The name of TooLs GmbH may not be used to endorse or promote products * derived from this software without specific prior written permission. * * THIS SOFTWARE IS PROVIDED BY TOOLS GMBH ``AS IS'' AND ANY EXPRESS OR * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. * IN NO EVENT SHALL TOOLS GMBH BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; * OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, * WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR * OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF * ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ /*- * Copyright (C) 2001 Benno Rice * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY Benno Rice ``AS IS'' AND ANY EXPRESS OR * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. * IN NO EVENT SHALL TOOLS GMBH BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; * OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, * WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR * OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF * ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. * $NetBSD: machdep.c,v 1.74.2.1 2000/11/01 16:13:48 tv Exp $ */ #include __FBSDID("$FreeBSD$"); #include "opt_fpu_emu.h" #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #ifdef FPU_EMU #include #endif #ifdef COMPAT_FREEBSD32 #include #include #include typedef struct __ucontext32 { sigset_t uc_sigmask; mcontext32_t uc_mcontext; uint32_t uc_link; struct sigaltstack32 uc_stack; uint32_t uc_flags; uint32_t __spare__[4]; } ucontext32_t; struct sigframe32 { ucontext32_t sf_uc; struct siginfo32 sf_si; }; static int grab_mcontext32(struct thread *td, mcontext32_t *, int flags); #endif static int grab_mcontext(struct thread *, mcontext_t *, int); +static void cleanup_power_extras(struct thread *); + #ifdef __powerpc64__ extern struct sysentvec elf64_freebsd_sysvec_v2; #endif void sendsig(sig_t catcher, ksiginfo_t *ksi, sigset_t *mask) { struct trapframe *tf; struct sigacts *psp; struct sigframe sf; struct thread *td; struct proc *p; #ifdef COMPAT_FREEBSD32 struct siginfo32 siginfo32; struct sigframe32 sf32; #endif size_t sfpsize; caddr_t sfp, usfp; int oonstack, rndfsize; int sig; int code; td = curthread; p = td->td_proc; PROC_LOCK_ASSERT(p, MA_OWNED); psp = p->p_sigacts; mtx_assert(&psp->ps_mtx, MA_OWNED); tf = td->td_frame; oonstack = sigonstack(tf->fixreg[1]); /* * Fill siginfo structure. */ ksi->ksi_info.si_signo = ksi->ksi_signo; ksi->ksi_info.si_addr = (void *)((tf->exc == EXC_DSI || tf->exc == EXC_DSE) ? tf->dar : tf->srr0); #ifdef COMPAT_FREEBSD32 if (SV_PROC_FLAG(p, SV_ILP32)) { siginfo_to_siginfo32(&ksi->ksi_info, &siginfo32); sig = siginfo32.si_signo; code = siginfo32.si_code; sfp = (caddr_t)&sf32; sfpsize = sizeof(sf32); rndfsize = roundup(sizeof(sf32), 16); /* * Save user context */ memset(&sf32, 0, sizeof(sf32)); grab_mcontext32(td, &sf32.sf_uc.uc_mcontext, 0); sf32.sf_uc.uc_sigmask = *mask; sf32.sf_uc.uc_stack.ss_sp = (uintptr_t)td->td_sigstk.ss_sp; sf32.sf_uc.uc_stack.ss_size = (uint32_t)td->td_sigstk.ss_size; sf32.sf_uc.uc_stack.ss_flags = (td->td_pflags & TDP_ALTSTACK) ? ((oonstack) ? SS_ONSTACK : 0) : SS_DISABLE; sf32.sf_uc.uc_mcontext.mc_onstack = (oonstack) ? 1 : 0; } else { #endif sig = ksi->ksi_signo; code = ksi->ksi_code; sfp = (caddr_t)&sf; sfpsize = sizeof(sf); #ifdef __powerpc64__ /* * 64-bit PPC defines a 288 byte scratch region * below the stack. */ rndfsize = 288 + roundup(sizeof(sf), 48); #else rndfsize = roundup(sizeof(sf), 16); #endif /* * Save user context */ memset(&sf, 0, sizeof(sf)); grab_mcontext(td, &sf.sf_uc.uc_mcontext, 0); sf.sf_uc.uc_sigmask = *mask; sf.sf_uc.uc_stack = td->td_sigstk; sf.sf_uc.uc_stack.ss_flags = (td->td_pflags & TDP_ALTSTACK) ? ((oonstack) ? SS_ONSTACK : 0) : SS_DISABLE; sf.sf_uc.uc_mcontext.mc_onstack = (oonstack) ? 1 : 0; #ifdef COMPAT_FREEBSD32 } #endif CTR4(KTR_SIG, "sendsig: td=%p (%s) catcher=%p sig=%d", td, p->p_comm, catcher, sig); /* * Allocate and validate space for the signal handler context. */ if ((td->td_pflags & TDP_ALTSTACK) != 0 && !oonstack && SIGISMEMBER(psp->ps_sigonstack, sig)) { usfp = (void *)(((uintptr_t)td->td_sigstk.ss_sp + td->td_sigstk.ss_size - rndfsize) & ~0xFul); } else { usfp = (void *)((tf->fixreg[1] - rndfsize) & ~0xFul); } /* * Save the floating-point state, if necessary, then copy it. */ /* XXX */ /* * Set up the registers to return to sigcode. * * r1/sp - sigframe ptr * lr - sig function, dispatched to by blrl in trampoline * r3 - sig number * r4 - SIGINFO ? &siginfo : exception code * r5 - user context * srr0 - trampoline function addr */ tf->lr = (register_t)catcher; tf->fixreg[1] = (register_t)usfp; tf->fixreg[FIRSTARG] = sig; #ifdef COMPAT_FREEBSD32 tf->fixreg[FIRSTARG+2] = (register_t)usfp + ((SV_PROC_FLAG(p, SV_ILP32)) ? offsetof(struct sigframe32, sf_uc) : offsetof(struct sigframe, sf_uc)); #else tf->fixreg[FIRSTARG+2] = (register_t)usfp + offsetof(struct sigframe, sf_uc); #endif if (SIGISMEMBER(psp->ps_siginfo, sig)) { /* * Signal handler installed with SA_SIGINFO. */ #ifdef COMPAT_FREEBSD32 if (SV_PROC_FLAG(p, SV_ILP32)) { sf32.sf_si = siginfo32; tf->fixreg[FIRSTARG+1] = (register_t)usfp + offsetof(struct sigframe32, sf_si); sf32.sf_si = siginfo32; } else { #endif tf->fixreg[FIRSTARG+1] = (register_t)usfp + offsetof(struct sigframe, sf_si); sf.sf_si = ksi->ksi_info; #ifdef COMPAT_FREEBSD32 } #endif } else { /* Old FreeBSD-style arguments. */ tf->fixreg[FIRSTARG+1] = code; tf->fixreg[FIRSTARG+3] = (tf->exc == EXC_DSI) ? tf->dar : tf->srr0; } mtx_unlock(&psp->ps_mtx); PROC_UNLOCK(p); tf->srr0 = (register_t)p->p_sysent->sv_sigcode_base; /* * copy the frame out to userland. */ if (copyout(sfp, usfp, sfpsize) != 0) { /* * Process has trashed its stack. Kill it. */ CTR2(KTR_SIG, "sendsig: sigexit td=%p sfp=%p", td, sfp); PROC_LOCK(p); sigexit(td, SIGILL); } CTR3(KTR_SIG, "sendsig: return td=%p pc=%#x sp=%#x", td, tf->srr0, tf->fixreg[1]); PROC_LOCK(p); mtx_lock(&psp->ps_mtx); } int sys_sigreturn(struct thread *td, struct sigreturn_args *uap) { ucontext_t uc; int error; CTR2(KTR_SIG, "sigreturn: td=%p ucp=%p", td, uap->sigcntxp); if (copyin(uap->sigcntxp, &uc, sizeof(uc)) != 0) { CTR1(KTR_SIG, "sigreturn: efault td=%p", td); return (EFAULT); } error = set_mcontext(td, &uc.uc_mcontext); if (error != 0) return (error); kern_sigprocmask(td, SIG_SETMASK, &uc.uc_sigmask, NULL, 0); CTR3(KTR_SIG, "sigreturn: return td=%p pc=%#x sp=%#x", td, uc.uc_mcontext.mc_srr0, uc.uc_mcontext.mc_gpr[1]); return (EJUSTRETURN); } #ifdef COMPAT_FREEBSD4 int freebsd4_sigreturn(struct thread *td, struct freebsd4_sigreturn_args *uap) { return sys_sigreturn(td, (struct sigreturn_args *)uap); } #endif /* * Construct a PCB from a trapframe. This is called from kdb_trap() where * we want to start a backtrace from the function that caused us to enter * the debugger. We have the context in the trapframe, but base the trace * on the PCB. The PCB doesn't have to be perfect, as long as it contains * enough for a backtrace. */ void makectx(struct trapframe *tf, struct pcb *pcb) { pcb->pcb_lr = tf->srr0; pcb->pcb_sp = tf->fixreg[1]; } /* * get_mcontext/sendsig helper routine that doesn't touch the * proc lock */ static int grab_mcontext(struct thread *td, mcontext_t *mcp, int flags) { struct pcb *pcb; int i; pcb = td->td_pcb; memset(mcp, 0, sizeof(mcontext_t)); mcp->mc_vers = _MC_VERSION; mcp->mc_flags = 0; memcpy(&mcp->mc_frame, td->td_frame, sizeof(struct trapframe)); if (flags & GET_MC_CLEAR_RET) { mcp->mc_gpr[3] = 0; mcp->mc_gpr[4] = 0; } /* * This assumes that floating-point context is *not* lazy, * so if the thread has used FP there would have been a * FP-unavailable exception that would have set things up * correctly. */ if (pcb->pcb_flags & PCB_FPREGS) { if (pcb->pcb_flags & PCB_FPU) { KASSERT(td == curthread, ("get_mcontext: fp save not curthread")); critical_enter(); save_fpu(td); critical_exit(); } mcp->mc_flags |= _MC_FP_VALID; memcpy(&mcp->mc_fpscr, &pcb->pcb_fpu.fpscr, sizeof(double)); for (i = 0; i < 32; i++) memcpy(&mcp->mc_fpreg[i], &pcb->pcb_fpu.fpr[i].fpr, sizeof(double)); } if (pcb->pcb_flags & PCB_VSX) { for (i = 0; i < 32; i++) memcpy(&mcp->mc_vsxfpreg[i], &pcb->pcb_fpu.fpr[i].vsr[2], sizeof(double)); } /* * Repeat for Altivec context */ if (pcb->pcb_flags & PCB_VEC) { KASSERT(td == curthread, ("get_mcontext: fp save not curthread")); critical_enter(); save_vec(td); critical_exit(); mcp->mc_flags |= _MC_AV_VALID; mcp->mc_vscr = pcb->pcb_vec.vscr; mcp->mc_vrsave = pcb->pcb_vec.vrsave; memcpy(mcp->mc_avec, pcb->pcb_vec.vr, sizeof(mcp->mc_avec)); } mcp->mc_len = sizeof(*mcp); return (0); } int get_mcontext(struct thread *td, mcontext_t *mcp, int flags) { int error; error = grab_mcontext(td, mcp, flags); if (error == 0) { PROC_LOCK(curthread->td_proc); mcp->mc_onstack = sigonstack(td->td_frame->fixreg[1]); PROC_UNLOCK(curthread->td_proc); } return (error); } int set_mcontext(struct thread *td, mcontext_t *mcp) { struct pcb *pcb; struct trapframe *tf; register_t tls; int i; pcb = td->td_pcb; tf = td->td_frame; if (mcp->mc_vers != _MC_VERSION || mcp->mc_len != sizeof(*mcp)) return (EINVAL); /* * Don't let the user set privileged MSR bits */ if ((mcp->mc_srr1 & psl_userstatic) != (tf->srr1 & psl_userstatic)) { return (EINVAL); } /* Copy trapframe, preserving TLS pointer across context change */ if (SV_PROC_FLAG(td->td_proc, SV_LP64)) tls = tf->fixreg[13]; else tls = tf->fixreg[2]; memcpy(tf, mcp->mc_frame, sizeof(mcp->mc_frame)); if (SV_PROC_FLAG(td->td_proc, SV_LP64)) tf->fixreg[13] = tls; else tf->fixreg[2] = tls; /* Disable FPU */ tf->srr1 &= ~PSL_FP; pcb->pcb_flags &= ~PCB_FPU; if (mcp->mc_flags & _MC_FP_VALID) { /* enable_fpu() will happen lazily on a fault */ pcb->pcb_flags |= PCB_FPREGS; memcpy(&pcb->pcb_fpu.fpscr, &mcp->mc_fpscr, sizeof(double)); bzero(pcb->pcb_fpu.fpr, sizeof(pcb->pcb_fpu.fpr)); for (i = 0; i < 32; i++) { memcpy(&pcb->pcb_fpu.fpr[i].fpr, &mcp->mc_fpreg[i], sizeof(double)); memcpy(&pcb->pcb_fpu.fpr[i].vsr[2], &mcp->mc_vsxfpreg[i], sizeof(double)); } } if (mcp->mc_flags & _MC_AV_VALID) { if ((pcb->pcb_flags & PCB_VEC) != PCB_VEC) { critical_enter(); enable_vec(td); critical_exit(); } pcb->pcb_vec.vscr = mcp->mc_vscr; pcb->pcb_vec.vrsave = mcp->mc_vrsave; memcpy(pcb->pcb_vec.vr, mcp->mc_avec, sizeof(mcp->mc_avec)); } return (0); } /* + * Clean up extra POWER state. Some per-process registers and states are not + * managed by the MSR, so must be cleaned up explicitly on thread exit. + * + * Currently this includes: + * DSCR -- Data stream control register (PowerISA 2.06+) + * FSCR -- Facility Status and Control Register (PowerISA 2.07+) + */ +static void +cleanup_power_extras(struct thread *td) +{ + uint32_t pcb_flags; + + if (td != curthread) + return; + + pcb_flags = td->td_pcb->pcb_flags; + /* Clean up registers not managed by MSR. */ + if (pcb_flags & PCB_CFSCR) + mtspr(SPR_FSCR, 0); + if (pcb_flags & PCB_CDSCR) + mtspr(SPR_DSCRP, 0); +} + +/* * Set set up registers on exec. */ void exec_setregs(struct thread *td, struct image_params *imgp, u_long stack) { struct trapframe *tf; register_t argc; tf = trapframe(td); bzero(tf, sizeof *tf); #ifdef __powerpc64__ tf->fixreg[1] = -roundup(-stack + 48, 16); #else tf->fixreg[1] = -roundup(-stack + 8, 16); #endif /* * Set up arguments for _start(): * _start(argc, argv, envp, obj, cleanup, ps_strings); * * Notes: * - obj and cleanup are the auxilliary and termination * vectors. They are fixed up by ld.elf_so. * - ps_strings is a NetBSD extention, and will be * ignored by executables which are strictly * compliant with the SVR4 ABI. */ /* Collect argc from the user stack */ argc = fuword((void *)stack); tf->fixreg[3] = argc; tf->fixreg[4] = stack + sizeof(register_t); tf->fixreg[5] = stack + (2 + argc)*sizeof(register_t); tf->fixreg[6] = 0; /* auxillary vector */ tf->fixreg[7] = 0; /* termination vector */ tf->fixreg[8] = (register_t)imgp->ps_strings; /* NetBSD extension */ tf->srr0 = imgp->entry_addr; #ifdef __powerpc64__ tf->fixreg[12] = imgp->entry_addr; #endif tf->srr1 = psl_userset | PSL_FE_DFLT; + cleanup_power_extras(td); td->td_pcb->pcb_flags = 0; } #ifdef COMPAT_FREEBSD32 void ppc32_setregs(struct thread *td, struct image_params *imgp, u_long stack) { struct trapframe *tf; uint32_t argc; tf = trapframe(td); bzero(tf, sizeof *tf); tf->fixreg[1] = -roundup(-stack + 8, 16); argc = fuword32((void *)stack); tf->fixreg[3] = argc; tf->fixreg[4] = stack + sizeof(uint32_t); tf->fixreg[5] = stack + (2 + argc)*sizeof(uint32_t); tf->fixreg[6] = 0; /* auxillary vector */ tf->fixreg[7] = 0; /* termination vector */ tf->fixreg[8] = (register_t)imgp->ps_strings; /* NetBSD extension */ tf->srr0 = imgp->entry_addr; tf->srr1 = psl_userset32 | PSL_FE_DFLT; + cleanup_power_extras(td); td->td_pcb->pcb_flags = 0; } #endif int fill_regs(struct thread *td, struct reg *regs) { struct trapframe *tf; tf = td->td_frame; memcpy(regs, tf, sizeof(struct reg)); return (0); } int fill_dbregs(struct thread *td, struct dbreg *dbregs) { /* No debug registers on PowerPC */ return (ENOSYS); } int fill_fpregs(struct thread *td, struct fpreg *fpregs) { struct pcb *pcb; int i; pcb = td->td_pcb; if ((pcb->pcb_flags & PCB_FPREGS) == 0) memset(fpregs, 0, sizeof(struct fpreg)); else { memcpy(&fpregs->fpscr, &pcb->pcb_fpu.fpscr, sizeof(double)); for (i = 0; i < 32; i++) memcpy(&fpregs->fpreg[i], &pcb->pcb_fpu.fpr[i].fpr, sizeof(double)); } return (0); } int set_regs(struct thread *td, struct reg *regs) { struct trapframe *tf; tf = td->td_frame; memcpy(tf, regs, sizeof(struct reg)); return (0); } int set_dbregs(struct thread *td, struct dbreg *dbregs) { /* No debug registers on PowerPC */ return (ENOSYS); } int set_fpregs(struct thread *td, struct fpreg *fpregs) { struct pcb *pcb; int i; pcb = td->td_pcb; pcb->pcb_flags |= PCB_FPREGS; memcpy(&pcb->pcb_fpu.fpscr, &fpregs->fpscr, sizeof(double)); for (i = 0; i < 32; i++) { memcpy(&pcb->pcb_fpu.fpr[i].fpr, &fpregs->fpreg[i], sizeof(double)); } return (0); } #ifdef COMPAT_FREEBSD32 int set_regs32(struct thread *td, struct reg32 *regs) { struct trapframe *tf; int i; tf = td->td_frame; for (i = 0; i < 32; i++) tf->fixreg[i] = regs->fixreg[i]; tf->lr = regs->lr; tf->cr = regs->cr; tf->xer = regs->xer; tf->ctr = regs->ctr; tf->srr0 = regs->pc; return (0); } int fill_regs32(struct thread *td, struct reg32 *regs) { struct trapframe *tf; int i; tf = td->td_frame; for (i = 0; i < 32; i++) regs->fixreg[i] = tf->fixreg[i]; regs->lr = tf->lr; regs->cr = tf->cr; regs->xer = tf->xer; regs->ctr = tf->ctr; regs->pc = tf->srr0; return (0); } static int grab_mcontext32(struct thread *td, mcontext32_t *mcp, int flags) { mcontext_t mcp64; int i, error; error = grab_mcontext(td, &mcp64, flags); if (error != 0) return (error); mcp->mc_vers = mcp64.mc_vers; mcp->mc_flags = mcp64.mc_flags; mcp->mc_onstack = mcp64.mc_onstack; mcp->mc_len = mcp64.mc_len; memcpy(mcp->mc_avec,mcp64.mc_avec,sizeof(mcp64.mc_avec)); memcpy(mcp->mc_av,mcp64.mc_av,sizeof(mcp64.mc_av)); for (i = 0; i < 42; i++) mcp->mc_frame[i] = mcp64.mc_frame[i]; memcpy(mcp->mc_fpreg,mcp64.mc_fpreg,sizeof(mcp64.mc_fpreg)); memcpy(mcp->mc_vsxfpreg,mcp64.mc_vsxfpreg,sizeof(mcp64.mc_vsxfpreg)); return (0); } static int get_mcontext32(struct thread *td, mcontext32_t *mcp, int flags) { int error; error = grab_mcontext32(td, mcp, flags); if (error == 0) { PROC_LOCK(curthread->td_proc); mcp->mc_onstack = sigonstack(td->td_frame->fixreg[1]); PROC_UNLOCK(curthread->td_proc); } return (error); } static int set_mcontext32(struct thread *td, mcontext32_t *mcp) { mcontext_t mcp64; int i, error; mcp64.mc_vers = mcp->mc_vers; mcp64.mc_flags = mcp->mc_flags; mcp64.mc_onstack = mcp->mc_onstack; mcp64.mc_len = mcp->mc_len; memcpy(mcp64.mc_avec,mcp->mc_avec,sizeof(mcp64.mc_avec)); memcpy(mcp64.mc_av,mcp->mc_av,sizeof(mcp64.mc_av)); for (i = 0; i < 42; i++) mcp64.mc_frame[i] = mcp->mc_frame[i]; mcp64.mc_srr1 |= (td->td_frame->srr1 & 0xFFFFFFFF00000000ULL); memcpy(mcp64.mc_fpreg,mcp->mc_fpreg,sizeof(mcp64.mc_fpreg)); memcpy(mcp64.mc_vsxfpreg,mcp->mc_vsxfpreg,sizeof(mcp64.mc_vsxfpreg)); error = set_mcontext(td, &mcp64); return (error); } #endif #ifdef COMPAT_FREEBSD32 int freebsd32_sigreturn(struct thread *td, struct freebsd32_sigreturn_args *uap) { ucontext32_t uc; int error; CTR2(KTR_SIG, "sigreturn: td=%p ucp=%p", td, uap->sigcntxp); if (copyin(uap->sigcntxp, &uc, sizeof(uc)) != 0) { CTR1(KTR_SIG, "sigreturn: efault td=%p", td); return (EFAULT); } error = set_mcontext32(td, &uc.uc_mcontext); if (error != 0) return (error); kern_sigprocmask(td, SIG_SETMASK, &uc.uc_sigmask, NULL, 0); CTR3(KTR_SIG, "sigreturn: return td=%p pc=%#x sp=%#x", td, uc.uc_mcontext.mc_srr0, uc.uc_mcontext.mc_gpr[1]); return (EJUSTRETURN); } /* * The first two fields of a ucontext_t are the signal mask and the machine * context. The next field is uc_link; we want to avoid destroying the link * when copying out contexts. */ #define UC32_COPY_SIZE offsetof(ucontext32_t, uc_link) int freebsd32_getcontext(struct thread *td, struct freebsd32_getcontext_args *uap) { ucontext32_t uc; int ret; if (uap->ucp == NULL) ret = EINVAL; else { bzero(&uc, sizeof(uc)); get_mcontext32(td, &uc.uc_mcontext, GET_MC_CLEAR_RET); PROC_LOCK(td->td_proc); uc.uc_sigmask = td->td_sigmask; PROC_UNLOCK(td->td_proc); ret = copyout(&uc, uap->ucp, UC32_COPY_SIZE); } return (ret); } int freebsd32_setcontext(struct thread *td, struct freebsd32_setcontext_args *uap) { ucontext32_t uc; int ret; if (uap->ucp == NULL) ret = EINVAL; else { ret = copyin(uap->ucp, &uc, UC32_COPY_SIZE); if (ret == 0) { ret = set_mcontext32(td, &uc.uc_mcontext); if (ret == 0) { kern_sigprocmask(td, SIG_SETMASK, &uc.uc_sigmask, NULL, 0); } } } return (ret == 0 ? EJUSTRETURN : ret); } int freebsd32_swapcontext(struct thread *td, struct freebsd32_swapcontext_args *uap) { ucontext32_t uc; int ret; if (uap->oucp == NULL || uap->ucp == NULL) ret = EINVAL; else { bzero(&uc, sizeof(uc)); get_mcontext32(td, &uc.uc_mcontext, GET_MC_CLEAR_RET); PROC_LOCK(td->td_proc); uc.uc_sigmask = td->td_sigmask; PROC_UNLOCK(td->td_proc); ret = copyout(&uc, uap->oucp, UC32_COPY_SIZE); if (ret == 0) { ret = copyin(uap->ucp, &uc, UC32_COPY_SIZE); if (ret == 0) { ret = set_mcontext32(td, &uc.uc_mcontext); if (ret == 0) { kern_sigprocmask(td, SIG_SETMASK, &uc.uc_sigmask, NULL, 0); } } } } return (ret == 0 ? EJUSTRETURN : ret); } #endif void cpu_set_syscall_retval(struct thread *td, int error) { struct proc *p; struct trapframe *tf; int fixup; if (error == EJUSTRETURN) return; p = td->td_proc; tf = td->td_frame; if (tf->fixreg[0] == SYS___syscall && (SV_PROC_FLAG(p, SV_ILP32))) { int code = tf->fixreg[FIRSTARG + 1]; fixup = ( #if defined(COMPAT_FREEBSD6) && defined(SYS_freebsd6_lseek) code != SYS_freebsd6_lseek && #endif code != SYS_lseek) ? 1 : 0; } else fixup = 0; switch (error) { case 0: if (fixup) { /* * 64-bit return, 32-bit syscall. Fixup byte order */ tf->fixreg[FIRSTARG] = 0; tf->fixreg[FIRSTARG + 1] = td->td_retval[0]; } else { tf->fixreg[FIRSTARG] = td->td_retval[0]; tf->fixreg[FIRSTARG + 1] = td->td_retval[1]; } tf->cr &= ~0x10000000; /* Unset summary overflow */ break; case ERESTART: /* * Set user's pc back to redo the system call. */ tf->srr0 -= 4; break; default: tf->fixreg[FIRSTARG] = SV_ABI_ERRNO(p, error); tf->cr |= 0x10000000; /* Set summary overflow */ break; } } /* * Threading functions */ void cpu_thread_exit(struct thread *td) { + cleanup_power_extras(td); } void cpu_thread_clean(struct thread *td) { } void cpu_thread_alloc(struct thread *td) { struct pcb *pcb; pcb = (struct pcb *)((td->td_kstack + td->td_kstack_pages * PAGE_SIZE - sizeof(struct pcb)) & ~0x2fUL); td->td_pcb = pcb; td->td_frame = (struct trapframe *)pcb - 1; } void cpu_thread_free(struct thread *td) { } int cpu_set_user_tls(struct thread *td, void *tls_base) { if (SV_PROC_FLAG(td->td_proc, SV_LP64)) td->td_frame->fixreg[13] = (register_t)tls_base + 0x7010; else td->td_frame->fixreg[2] = (register_t)tls_base + 0x7008; return (0); } void cpu_copy_thread(struct thread *td, struct thread *td0) { struct pcb *pcb2; struct trapframe *tf; struct callframe *cf; pcb2 = td->td_pcb; /* Copy the upcall pcb */ bcopy(td0->td_pcb, pcb2, sizeof(*pcb2)); /* Create a stack for the new thread */ tf = td->td_frame; bcopy(td0->td_frame, tf, sizeof(struct trapframe)); tf->fixreg[FIRSTARG] = 0; tf->fixreg[FIRSTARG + 1] = 0; tf->cr &= ~0x10000000; /* Set registers for trampoline to user mode. */ cf = (struct callframe *)tf - 1; memset(cf, 0, sizeof(struct callframe)); cf->cf_func = (register_t)fork_return; cf->cf_arg0 = (register_t)td; cf->cf_arg1 = (register_t)tf; pcb2->pcb_sp = (register_t)cf; #if defined(__powerpc64__) && (!defined(_CALL_ELF) || _CALL_ELF == 1) pcb2->pcb_lr = ((register_t *)fork_trampoline)[0]; pcb2->pcb_toc = ((register_t *)fork_trampoline)[1]; #else pcb2->pcb_lr = (register_t)fork_trampoline; pcb2->pcb_context[0] = pcb2->pcb_lr; #endif pcb2->pcb_cpu.aim.usr_vsid = 0; #ifdef __SPE__ pcb2->pcb_vec.vscr = SPEFSCR_FINVE | SPEFSCR_FDBZE | SPEFSCR_FUNFE | SPEFSCR_FOVFE; #endif /* Setup to release spin count in fork_exit(). */ td->td_md.md_spinlock_count = 1; td->td_md.md_saved_msr = psl_kernset; } void cpu_set_upcall(struct thread *td, void (*entry)(void *), void *arg, stack_t *stack) { struct trapframe *tf; uintptr_t sp; tf = td->td_frame; /* align stack and alloc space for frame ptr and saved LR */ #ifdef __powerpc64__ sp = ((uintptr_t)stack->ss_sp + stack->ss_size - 48) & ~0x1f; #else sp = ((uintptr_t)stack->ss_sp + stack->ss_size - 8) & ~0x1f; #endif bzero(tf, sizeof(struct trapframe)); tf->fixreg[1] = (register_t)sp; tf->fixreg[3] = (register_t)arg; if (SV_PROC_FLAG(td->td_proc, SV_ILP32)) { tf->srr0 = (register_t)entry; #ifdef __powerpc64__ tf->srr1 = psl_userset32 | PSL_FE_DFLT; #else tf->srr1 = psl_userset | PSL_FE_DFLT; #endif } else { #ifdef __powerpc64__ if (td->td_proc->p_sysent == &elf64_freebsd_sysvec_v2) { tf->srr0 = (register_t)entry; /* ELFv2 ABI requires that the global entry point be in r12. */ tf->fixreg[12] = (register_t)entry; } else { register_t entry_desc[3]; (void)copyin((void *)entry, entry_desc, sizeof(entry_desc)); tf->srr0 = entry_desc[0]; tf->fixreg[2] = entry_desc[1]; tf->fixreg[11] = entry_desc[2]; } tf->srr1 = psl_userset | PSL_FE_DFLT; #endif } td->td_pcb->pcb_flags = 0; #ifdef __SPE__ td->td_pcb->pcb_vec.vscr = SPEFSCR_FINVE | SPEFSCR_FDBZE | SPEFSCR_FUNFE | SPEFSCR_FOVFE; #endif td->td_retval[0] = (register_t)entry; td->td_retval[1] = 0; } static int emulate_mfspr(int spr, int reg, struct trapframe *frame){ struct thread *td; td = curthread; - if (spr == SPR_DSCR) { + if (spr == SPR_DSCR || spr == SPR_DSCRP) { // If DSCR was never set, get the default DSCR if ((td->td_pcb->pcb_flags & PCB_CDSCR) == 0) - td->td_pcb->pcb_dscr = mfspr(SPR_DSCR); + td->td_pcb->pcb_dscr = mfspr(SPR_DSCRP); frame->fixreg[reg] = td->td_pcb->pcb_dscr; frame->srr0 += 4; return 0; } else return SIGILL; } static int emulate_mtspr(int spr, int reg, struct trapframe *frame){ struct thread *td; td = curthread; - if (spr == SPR_DSCR) { + if (spr == SPR_DSCR || spr == SPR_DSCRP) { td->td_pcb->pcb_flags |= PCB_CDSCR; td->td_pcb->pcb_dscr = frame->fixreg[reg]; + mtspr(SPR_DSCRP, frame->fixreg[reg]); frame->srr0 += 4; return 0; } else return SIGILL; } #define XFX 0xFC0007FF int ppc_instr_emulate(struct trapframe *frame, struct thread *td) { struct pcb *pcb; uint32_t instr; int reg, sig; int rs, spr; instr = fuword32((void *)frame->srr0); sig = SIGILL; if ((instr & 0xfc1fffff) == 0x7c1f42a6) { /* mfpvr */ reg = (instr & ~0xfc1fffff) >> 21; frame->fixreg[reg] = mfpvr(); frame->srr0 += 4; return (0); } else if ((instr & XFX) == 0x7c0002a6) { /* mfspr */ rs = (instr & 0x3e00000) >> 21; spr = (instr & 0x1ff800) >> 16; return emulate_mfspr(spr, rs, frame); } else if ((instr & XFX) == 0x7c0003a6) { /* mtspr */ rs = (instr & 0x3e00000) >> 21; spr = (instr & 0x1ff800) >> 16; return emulate_mtspr(spr, rs, frame); } else if ((instr & 0xfc000ffe) == 0x7c0004ac) { /* various sync */ powerpc_sync(); /* Do a heavy-weight sync */ frame->srr0 += 4; return (0); } pcb = td->td_pcb; #ifdef FPU_EMU if (!(pcb->pcb_flags & PCB_FPREGS)) { bzero(&pcb->pcb_fpu, sizeof(pcb->pcb_fpu)); pcb->pcb_flags |= PCB_FPREGS; } else if (pcb->pcb_flags & PCB_FPU) save_fpu(td); sig = fpu_emulate(frame, &pcb->pcb_fpu); if ((sig == 0 || sig == SIGFPE) && pcb->pcb_flags & PCB_FPU) enable_fpu(td); #endif if (sig == SIGILL) { if (pcb->pcb_lastill != frame->srr0) { /* Allow a second chance, in case of cache sync issues. */ sig = 0; pmap_sync_icache(PCPU_GET(curpmap), frame->srr0, 4); pcb->pcb_lastill = frame->srr0; } } return (sig); } Index: user/ngie/bug-237403/sys/powerpc/powerpc/genassym.c =================================================================== --- user/ngie/bug-237403/sys/powerpc/powerpc/genassym.c (revision 346925) +++ user/ngie/bug-237403/sys/powerpc/powerpc/genassym.c (revision 346926) @@ -1,271 +1,281 @@ /*- * SPDX-License-Identifier: BSD-3-Clause * * Copyright (c) 1982, 1990 The Regents of the University of California. * All rights reserved. * * This code is derived from software contributed to Berkeley by * William Jolitz. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * 3. Neither the name of the University nor the names of its contributors * may be used to endorse or promote products derived from this software * without specific prior written permission. * * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * from: @(#)genassym.c 5.11 (Berkeley) 5/10/91 * $FreeBSD$ */ #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include ASSYM(PC_CURTHREAD, offsetof(struct pcpu, pc_curthread)); ASSYM(PC_CURPCB, offsetof(struct pcpu, pc_curpcb)); ASSYM(PC_CURPMAP, offsetof(struct pcpu, pc_curpmap)); ASSYM(PC_TEMPSAVE, offsetof(struct pcpu, pc_tempsave)); ASSYM(PC_DISISAVE, offsetof(struct pcpu, pc_disisave)); ASSYM(PC_DBSAVE, offsetof(struct pcpu, pc_dbsave)); ASSYM(PC_RESTORE, offsetof(struct pcpu, pc_restore)); #if defined(BOOKE) ASSYM(PC_BOOKE_CRITSAVE, offsetof(struct pcpu, pc_booke.critsave)); ASSYM(PC_BOOKE_MCHKSAVE, offsetof(struct pcpu, pc_booke.mchksave)); ASSYM(PC_BOOKE_TLBSAVE, offsetof(struct pcpu, pc_booke.tlbsave)); ASSYM(PC_BOOKE_TLB_LEVEL, offsetof(struct pcpu, pc_booke.tlb_level)); ASSYM(PC_BOOKE_TLB_LOCK, offsetof(struct pcpu, pc_booke.tlb_lock)); #endif ASSYM(CPUSAVE_R27, CPUSAVE_R27*sizeof(register_t)); ASSYM(CPUSAVE_R28, CPUSAVE_R28*sizeof(register_t)); ASSYM(CPUSAVE_R29, CPUSAVE_R29*sizeof(register_t)); ASSYM(CPUSAVE_R30, CPUSAVE_R30*sizeof(register_t)); ASSYM(CPUSAVE_R31, CPUSAVE_R31*sizeof(register_t)); ASSYM(CPUSAVE_SRR0, CPUSAVE_SRR0*sizeof(register_t)); ASSYM(CPUSAVE_SRR1, CPUSAVE_SRR1*sizeof(register_t)); ASSYM(CPUSAVE_AIM_DAR, CPUSAVE_AIM_DAR*sizeof(register_t)); ASSYM(CPUSAVE_AIM_DSISR, CPUSAVE_AIM_DSISR*sizeof(register_t)); ASSYM(CPUSAVE_BOOKE_DEAR, CPUSAVE_BOOKE_DEAR*sizeof(register_t)); ASSYM(CPUSAVE_BOOKE_ESR, CPUSAVE_BOOKE_ESR*sizeof(register_t)); ASSYM(BOOKE_CRITSAVE_SRR0, BOOKE_CRITSAVE_SRR0*sizeof(register_t)); ASSYM(BOOKE_CRITSAVE_SRR1, BOOKE_CRITSAVE_SRR1*sizeof(register_t)); ASSYM(TLBSAVE_BOOKE_LR, TLBSAVE_BOOKE_LR*sizeof(register_t)); ASSYM(TLBSAVE_BOOKE_CR, TLBSAVE_BOOKE_CR*sizeof(register_t)); ASSYM(TLBSAVE_BOOKE_SRR0, TLBSAVE_BOOKE_SRR0*sizeof(register_t)); ASSYM(TLBSAVE_BOOKE_SRR1, TLBSAVE_BOOKE_SRR1*sizeof(register_t)); ASSYM(TLBSAVE_BOOKE_R20, TLBSAVE_BOOKE_R20*sizeof(register_t)); ASSYM(TLBSAVE_BOOKE_R21, TLBSAVE_BOOKE_R21*sizeof(register_t)); ASSYM(TLBSAVE_BOOKE_R22, TLBSAVE_BOOKE_R22*sizeof(register_t)); ASSYM(TLBSAVE_BOOKE_R23, TLBSAVE_BOOKE_R23*sizeof(register_t)); ASSYM(TLBSAVE_BOOKE_R24, TLBSAVE_BOOKE_R24*sizeof(register_t)); ASSYM(TLBSAVE_BOOKE_R25, TLBSAVE_BOOKE_R25*sizeof(register_t)); ASSYM(TLBSAVE_BOOKE_R26, TLBSAVE_BOOKE_R26*sizeof(register_t)); ASSYM(TLBSAVE_BOOKE_R27, TLBSAVE_BOOKE_R27*sizeof(register_t)); ASSYM(TLBSAVE_BOOKE_R28, TLBSAVE_BOOKE_R28*sizeof(register_t)); ASSYM(TLBSAVE_BOOKE_R29, TLBSAVE_BOOKE_R29*sizeof(register_t)); ASSYM(TLBSAVE_BOOKE_R30, TLBSAVE_BOOKE_R30*sizeof(register_t)); ASSYM(TLBSAVE_BOOKE_R31, TLBSAVE_BOOKE_R31*sizeof(register_t)); ASSYM(MTX_LOCK, offsetof(struct mtx, mtx_lock)); #if defined(AIM) ASSYM(USER_ADDR, USER_ADDR); #ifdef __powerpc64__ ASSYM(PC_KERNSLB, offsetof(struct pcpu, pc_aim.slb)); ASSYM(PC_USERSLB, offsetof(struct pcpu, pc_aim.userslb)); ASSYM(PC_SLBSAVE, offsetof(struct pcpu, pc_aim.slbsave)); ASSYM(PC_SLBSTACK, offsetof(struct pcpu, pc_aim.slbstack)); ASSYM(USER_SLB_SLOT, USER_SLB_SLOT); ASSYM(USER_SLB_SLBE, USER_SLB_SLBE); ASSYM(SEGMENT_MASK, SEGMENT_MASK); #else ASSYM(PM_SR, offsetof(struct pmap, pm_sr)); ASSYM(USER_SR, USER_SR); #endif #elif defined(BOOKE) #ifdef __powerpc64__ ASSYM(PM_PP2D, offsetof(struct pmap, pm_pp2d)); #else ASSYM(PM_PDIR, offsetof(struct pmap, pm_pdir)); #endif /* * With pte_t being a bitfield struct, these fields cannot be addressed via * offsetof(). */ ASSYM(PTE_RPN, 0); ASSYM(PTE_FLAGS, sizeof(uint32_t)); #if defined(BOOKE_E500) ASSYM(TLB_ENTRY_SIZE, sizeof(struct tlb_entry)); #endif #endif #ifdef __powerpc64__ ASSYM(FSP, 48); #else ASSYM(FSP, 8); #endif ASSYM(FRAMELEN, FRAMELEN); ASSYM(FRAME_0, offsetof(struct trapframe, fixreg[0])); ASSYM(FRAME_1, offsetof(struct trapframe, fixreg[1])); ASSYM(FRAME_2, offsetof(struct trapframe, fixreg[2])); ASSYM(FRAME_3, offsetof(struct trapframe, fixreg[3])); ASSYM(FRAME_4, offsetof(struct trapframe, fixreg[4])); ASSYM(FRAME_5, offsetof(struct trapframe, fixreg[5])); ASSYM(FRAME_6, offsetof(struct trapframe, fixreg[6])); ASSYM(FRAME_7, offsetof(struct trapframe, fixreg[7])); ASSYM(FRAME_8, offsetof(struct trapframe, fixreg[8])); ASSYM(FRAME_9, offsetof(struct trapframe, fixreg[9])); ASSYM(FRAME_10, offsetof(struct trapframe, fixreg[10])); ASSYM(FRAME_11, offsetof(struct trapframe, fixreg[11])); ASSYM(FRAME_12, offsetof(struct trapframe, fixreg[12])); ASSYM(FRAME_13, offsetof(struct trapframe, fixreg[13])); ASSYM(FRAME_14, offsetof(struct trapframe, fixreg[14])); ASSYM(FRAME_15, offsetof(struct trapframe, fixreg[15])); ASSYM(FRAME_16, offsetof(struct trapframe, fixreg[16])); ASSYM(FRAME_17, offsetof(struct trapframe, fixreg[17])); ASSYM(FRAME_18, offsetof(struct trapframe, fixreg[18])); ASSYM(FRAME_19, offsetof(struct trapframe, fixreg[19])); ASSYM(FRAME_20, offsetof(struct trapframe, fixreg[20])); ASSYM(FRAME_21, offsetof(struct trapframe, fixreg[21])); ASSYM(FRAME_22, offsetof(struct trapframe, fixreg[22])); ASSYM(FRAME_23, offsetof(struct trapframe, fixreg[23])); ASSYM(FRAME_24, offsetof(struct trapframe, fixreg[24])); ASSYM(FRAME_25, offsetof(struct trapframe, fixreg[25])); ASSYM(FRAME_26, offsetof(struct trapframe, fixreg[26])); ASSYM(FRAME_27, offsetof(struct trapframe, fixreg[27])); ASSYM(FRAME_28, offsetof(struct trapframe, fixreg[28])); ASSYM(FRAME_29, offsetof(struct trapframe, fixreg[29])); ASSYM(FRAME_30, offsetof(struct trapframe, fixreg[30])); ASSYM(FRAME_31, offsetof(struct trapframe, fixreg[31])); ASSYM(FRAME_LR, offsetof(struct trapframe, lr)); ASSYM(FRAME_CR, offsetof(struct trapframe, cr)); ASSYM(FRAME_CTR, offsetof(struct trapframe, ctr)); ASSYM(FRAME_XER, offsetof(struct trapframe, xer)); ASSYM(FRAME_SRR0, offsetof(struct trapframe, srr0)); ASSYM(FRAME_SRR1, offsetof(struct trapframe, srr1)); ASSYM(FRAME_EXC, offsetof(struct trapframe, exc)); ASSYM(FRAME_AIM_DAR, offsetof(struct trapframe, dar)); ASSYM(FRAME_AIM_DSISR, offsetof(struct trapframe, cpu.aim.dsisr)); ASSYM(FRAME_BOOKE_DEAR, offsetof(struct trapframe, dar)); ASSYM(FRAME_BOOKE_ESR, offsetof(struct trapframe, cpu.booke.esr)); ASSYM(FRAME_BOOKE_DBCR0, offsetof(struct trapframe, cpu.booke.dbcr0)); ASSYM(CF_FUNC, offsetof(struct callframe, cf_func)); ASSYM(CF_ARG0, offsetof(struct callframe, cf_arg0)); ASSYM(CF_ARG1, offsetof(struct callframe, cf_arg1)); ASSYM(CF_SIZE, sizeof(struct callframe)); ASSYM(PCB_CONTEXT, offsetof(struct pcb, pcb_context)); ASSYM(PCB_CR, offsetof(struct pcb, pcb_cr)); ASSYM(PCB_DSCR, offsetof(struct pcb, pcb_dscr)); +ASSYM(PCB_FSCR, offsetof(struct pcb, pcb_fscr)); +ASSYM(PCB_TAR, offsetof(struct pcb, pcb_tar)); ASSYM(PCB_SP, offsetof(struct pcb, pcb_sp)); ASSYM(PCB_TOC, offsetof(struct pcb, pcb_toc)); ASSYM(PCB_LR, offsetof(struct pcb, pcb_lr)); ASSYM(PCB_ONFAULT, offsetof(struct pcb, pcb_onfault)); ASSYM(PCB_FLAGS, offsetof(struct pcb, pcb_flags)); ASSYM(PCB_FPU, PCB_FPU); ASSYM(PCB_VEC, PCB_VEC); ASSYM(PCB_CDSCR, PCB_CDSCR); +ASSYM(PCB_CFSCR, PCB_CFSCR); ASSYM(PCB_AIM_USR_VSID, offsetof(struct pcb, pcb_cpu.aim.usr_vsid)); ASSYM(PCB_BOOKE_DBCR0, offsetof(struct pcb, pcb_cpu.booke.dbcr0)); ASSYM(PCB_VSCR, offsetof(struct pcb, pcb_vec.vscr)); + +ASSYM(PCB_EBB_EBBHR, offsetof(struct pcb, pcb_ebb.ebbhr)); +ASSYM(PCB_EBB_EBBRR, offsetof(struct pcb, pcb_ebb.ebbrr)); +ASSYM(PCB_EBB_BESCR, offsetof(struct pcb, pcb_ebb.bescr)); + +ASSYM(PCB_LMON_LMRR, offsetof(struct pcb, pcb_lm.lmrr)); +ASSYM(PCB_LMON_LMSER, offsetof(struct pcb, pcb_lm.lmser)); ASSYM(TD_LOCK, offsetof(struct thread, td_lock)); ASSYM(TD_PROC, offsetof(struct thread, td_proc)); ASSYM(TD_PCB, offsetof(struct thread, td_pcb)); ASSYM(P_VMSPACE, offsetof(struct proc, p_vmspace)); ASSYM(VM_PMAP, offsetof(struct vmspace, vm_pmap)); ASSYM(TD_FLAGS, offsetof(struct thread, td_flags)); ASSYM(TDF_ASTPENDING, TDF_ASTPENDING); ASSYM(TDF_NEEDRESCHED, TDF_NEEDRESCHED); ASSYM(SF_UC, offsetof(struct sigframe, sf_uc)); ASSYM(DMAP_BASE_ADDRESS, DMAP_BASE_ADDRESS); ASSYM(MAXCOMLEN, MAXCOMLEN); #ifdef __powerpc64__ ASSYM(PSL_CM, PSL_CM); #endif ASSYM(PSL_GS, PSL_GS); ASSYM(PSL_DE, PSL_DE); ASSYM(PSL_DS, PSL_DS); ASSYM(PSL_IS, PSL_IS); ASSYM(PSL_CE, PSL_CE); ASSYM(PSL_UCLE, PSL_UCLE); ASSYM(PSL_WE, PSL_WE); ASSYM(PSL_UBLE, PSL_UBLE); #if defined(AIM) && defined(__powerpc64__) ASSYM(PSL_SF, PSL_SF); ASSYM(PSL_HV, PSL_HV); #endif ASSYM(PSL_POW, PSL_POW); ASSYM(PSL_ILE, PSL_ILE); ASSYM(PSL_LE, PSL_LE); ASSYM(PSL_SE, PSL_SE); ASSYM(PSL_RI, PSL_RI); ASSYM(PSL_DR, PSL_DR); ASSYM(PSL_IP, PSL_IP); ASSYM(PSL_IR, PSL_IR); ASSYM(PSL_FE_DIS, PSL_FE_DIS); ASSYM(PSL_FE_NONREC, PSL_FE_NONREC); ASSYM(PSL_FE_PREC, PSL_FE_PREC); ASSYM(PSL_FE_REC, PSL_FE_REC); ASSYM(PSL_VEC, PSL_VEC); ASSYM(PSL_BE, PSL_BE); ASSYM(PSL_EE, PSL_EE); ASSYM(PSL_FE0, PSL_FE0); ASSYM(PSL_FE1, PSL_FE1); ASSYM(PSL_FP, PSL_FP); ASSYM(PSL_ME, PSL_ME); ASSYM(PSL_PR, PSL_PR); ASSYM(PSL_PMM, PSL_PMM); Index: user/ngie/bug-237403/sys/powerpc/powerpc/swtch32.S =================================================================== --- user/ngie/bug-237403/sys/powerpc/powerpc/swtch32.S (revision 346925) +++ user/ngie/bug-237403/sys/powerpc/powerpc/swtch32.S (revision 346926) @@ -1,220 +1,218 @@ /* $FreeBSD$ */ /* $NetBSD: locore.S,v 1.24 2000/05/31 05:09:17 thorpej Exp $ */ /*- * Copyright (C) 2001 Benno Rice * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY Benno Rice ``AS IS'' AND ANY EXPRESS OR * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. * IN NO EVENT SHALL TOOLS GMBH BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; * OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, * WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR * OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF * ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ /*- * Copyright (C) 1995, 1996 Wolfgang Solfrank. * Copyright (C) 1995, 1996 TooLs GmbH. * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * 3. All advertising materials mentioning features or use of this software * must display the following acknowledgement: * This product includes software developed by TooLs GmbH. * 4. The name of TooLs GmbH may not be used to endorse or promote products * derived from this software without specific prior written permission. * * THIS SOFTWARE IS PROVIDED BY TOOLS GMBH ``AS IS'' AND ANY EXPRESS OR * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. * IN NO EVENT SHALL TOOLS GMBH BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; * OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, * WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR * OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF * ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ #include "assym.inc" #include "opt_sched.h" #include #include #include #include #include /* * void cpu_throw(struct thread *old, struct thread *new) */ ENTRY(cpu_throw) mr %r2, %r4 li %r14,0 /* Tell cpu_switchin not to release a thread */ b cpu_switchin /* * void cpu_switch(struct thread *old, * struct thread *new, * struct mutex *mtx); * * Switch to a new thread saving the current state in the old thread. */ ENTRY(cpu_switch) lwz %r6,TD_PCB(%r3) /* Get the old thread's PCB ptr */ stmw %r12,PCB_CONTEXT(%r6) /* Save the non-volatile GP regs. These can now be used for scratch */ mfcr %r16 /* Save the condition register */ stw %r16,PCB_CR(%r6) mflr %r16 /* Save the link register */ stw %r16,PCB_LR(%r6) stw %r1,PCB_SP(%r6) /* Save the stack pointer */ mr %r14,%r3 /* Copy the old thread ptr... */ mr %r2,%r4 /* and the new thread ptr in curthread */ mr %r16,%r5 /* and the new lock */ mr %r17,%r6 /* and the PCB */ - lwz %r7,PCB_FLAGS(%r17) + lwz %r18,PCB_FLAGS(%r17) /* Save FPU context if needed */ - andi. %r7, %r7, PCB_FPU + andi. %r7, %r18, PCB_FPU beq .L1 bl save_fpu .L1: mr %r3,%r14 /* restore old thread ptr */ - lwz %r7,PCB_FLAGS(%r17) /* Save Altivec context if needed */ - andi. %r7, %r7, PCB_VEC + andi. %r7, %r18, PCB_VEC beq .L2 bl save_vec .L2: #if defined(__SPE__) mfspr %r3,SPR_SPEFSCR stw %r3,PCB_VSCR(%r17) #endif mr %r3,%r14 /* restore old thread ptr */ bl pmap_deactivate /* Deactivate the current pmap */ sync /* Make sure all of that finished */ cpu_switchin: #if defined(SMP) && defined(SCHED_ULE) /* Wait for the new thread to become unblocked */ bl _GLOBAL_OFFSET_TABLE_@local-4 mflr %r6 lwz %r6,blocked_lock@got(%r6) blocked_loop: lwz %r7,TD_LOCK(%r2) cmpw %r6,%r7 beq- blocked_loop isync #endif lwz %r17,TD_PCB(%r2) /* Get new current PCB */ lwz %r1,PCB_SP(%r17) /* Load new stack pointer */ /* Release old thread now that we have a stack pointer set up */ cmpwi %r14,0 beq- 1f stw %r16,TD_LOCK(%r14) /* ULE: update old thread's lock */ 1: mfsprg %r7,0 /* Get the pcpu pointer */ stw %r2,PC_CURTHREAD(%r7) /* Store new current thread */ lwz %r17,TD_PCB(%r2) /* Store new current PCB */ stw %r17,PC_CURPCB(%r7) mr %r3,%r2 /* Get new thread ptr */ bl pmap_activate /* Activate the new address space */ - lwz %r6, PCB_FLAGS(%r17) + lwz %r19, PCB_FLAGS(%r17) /* Restore FPU context if needed */ - andi. %r6, %r6, PCB_FPU + andi. %r6, %r19, PCB_FPU beq .L3 mr %r3,%r2 /* Pass curthread to enable_fpu */ bl enable_fpu .L3: - lwz %r6, PCB_FLAGS(%r17) /* Restore Altivec context if needed */ - andi. %r6, %r6, PCB_VEC + andi. %r6, %r19, PCB_VEC beq .L4 mr %r3,%r2 /* Pass curthread to enable_vec */ bl enable_vec .L4: #if defined(__SPE__) lwz %r3,PCB_VSCR(%r17) mtspr SPR_SPEFSCR,%r3 #endif /* thread to restore is in r3 */ mr %r3,%r17 /* Recover PCB ptr */ lmw %r12,PCB_CONTEXT(%r3) /* Load the non-volatile GP regs */ lwz %r5,PCB_CR(%r3) /* Load the condition register */ mtcr %r5 lwz %r5,PCB_LR(%r3) /* Load the link register */ mtlr %r5 lwz %r1,PCB_SP(%r3) /* Load the stack pointer */ /* * Perform a dummy stwcx. to clear any reservations we may have * inherited from the previous thread. It doesn't matter if the * stwcx succeeds or not. pcb_context[0] can be clobbered. */ stwcx. %r1, 0, %r3 blr /* * savectx(pcb) * Update pcb, saving current processor state */ ENTRY(savectx) stmw %r12,PCB_CONTEXT(%r3) /* Save the non-volatile GP regs */ mfcr %r4 /* Save the condition register */ stw %r4,PCB_CR(%r3) stw %r1,PCB_SP(%r3) /* Save the stack pointer */ mflr %r4 /* Save the link register */ stw %r4,PCB_LR(%r3) blr /* * fork_trampoline() * Set up the return from cpu_fork() */ ENTRY(fork_trampoline) lwz %r3,CF_FUNC(%r1) lwz %r4,CF_ARG0(%r1) lwz %r5,CF_ARG1(%r1) bl fork_exit addi %r1,%r1,CF_SIZE-FSP /* Allow 8 bytes in front of trapframe to simulate FRAME_SETUP does when allocating space for a frame pointer/saved LR */ #ifdef __SPE__ li %r3,SPEFSCR_FINVE|SPEFSCR_FDBZE|SPEFSCR_FUNFE|SPEFSCR_FOVFE mtspr SPR_SPEFSCR, %r3 #endif b trapexit Index: user/ngie/bug-237403/sys/powerpc/powerpc/swtch64.S =================================================================== --- user/ngie/bug-237403/sys/powerpc/powerpc/swtch64.S (revision 346925) +++ user/ngie/bug-237403/sys/powerpc/powerpc/swtch64.S (revision 346926) @@ -1,304 +1,359 @@ /* $FreeBSD$ */ /* $NetBSD: locore.S,v 1.24 2000/05/31 05:09:17 thorpej Exp $ */ /*- * Copyright (C) 2001 Benno Rice * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY Benno Rice ``AS IS'' AND ANY EXPRESS OR * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. * IN NO EVENT SHALL TOOLS GMBH BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; * OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, * WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR * OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF * ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ /*- * Copyright (C) 1995, 1996 Wolfgang Solfrank. * Copyright (C) 1995, 1996 TooLs GmbH. * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * 3. All advertising materials mentioning features or use of this software * must display the following acknowledgement: * This product includes software developed by TooLs GmbH. * 4. The name of TooLs GmbH may not be used to endorse or promote products * derived from this software without specific prior written permission. * * THIS SOFTWARE IS PROVIDED BY TOOLS GMBH ``AS IS'' AND ANY EXPRESS OR * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. * IN NO EVENT SHALL TOOLS GMBH BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; * OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, * WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR * OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF * ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ #include "assym.inc" #include "opt_sched.h" #include #include #include #include #include #ifdef _CALL_ELF .abiversion _CALL_ELF #endif TOC_ENTRY(blocked_lock) /* * void cpu_throw(struct thread *old, struct thread *new) */ ENTRY(cpu_throw) mr %r13, %r4 li %r14,0 /* Tell cpu_switchin not to release a thread */ b cpu_switchin /* * void cpu_switch(struct thread *old, * struct thread *new, * struct mutex *mtx); * * Switch to a new thread saving the current state in the old thread. + * + * Internally clobbers (not visible outside of this file): + * r18 - old thread pcb_flags + * r19 - new thread pcb_flags */ ENTRY(cpu_switch) ld %r6,TD_PCB(%r3) /* Get the old thread's PCB ptr */ std %r12,PCB_CONTEXT(%r6) /* Save the non-volatile GP regs. These can now be used for scratch */ std %r14,PCB_CONTEXT+2*8(%r6) std %r15,PCB_CONTEXT+3*8(%r6) std %r16,PCB_CONTEXT+4*8(%r6) std %r17,PCB_CONTEXT+5*8(%r6) std %r18,PCB_CONTEXT+6*8(%r6) std %r19,PCB_CONTEXT+7*8(%r6) std %r20,PCB_CONTEXT+8*8(%r6) std %r21,PCB_CONTEXT+9*8(%r6) std %r22,PCB_CONTEXT+10*8(%r6) std %r23,PCB_CONTEXT+11*8(%r6) std %r24,PCB_CONTEXT+12*8(%r6) std %r25,PCB_CONTEXT+13*8(%r6) std %r26,PCB_CONTEXT+14*8(%r6) std %r27,PCB_CONTEXT+15*8(%r6) std %r28,PCB_CONTEXT+16*8(%r6) std %r29,PCB_CONTEXT+17*8(%r6) std %r30,PCB_CONTEXT+18*8(%r6) std %r31,PCB_CONTEXT+19*8(%r6) mfcr %r16 /* Save the condition register */ std %r16,PCB_CR(%r6) mflr %r16 /* Save the link register */ std %r16,PCB_LR(%r6) std %r1,PCB_SP(%r6) /* Save the stack pointer */ std %r2,PCB_TOC(%r6) /* Save the TOC pointer */ mr %r14,%r3 /* Copy the old thread ptr... */ mr %r13,%r4 /* and the new thread ptr in curthread*/ mr %r16,%r5 /* and the new lock */ mr %r17,%r6 /* and the PCB */ stdu %r1,-48(%r1) - lwz %r7, PCB_FLAGS(%r17) - andi. %r7, %r7, PCB_CDSCR + lwz %r18, PCB_FLAGS(%r17) + andi. %r7, %r18, PCB_CFSCR + beq 1f + mfspr %r6, SPR_FSCR + std %r6, PCB_FSCR(%r17) +save_ebb: + andi. %r0, %r6, FSCR_EBB + beq save_lm + mfspr %r7, SPR_EBBHR + std %r7, PCB_EBB_EBBHR(%r17) + mfspr %r7, SPR_EBBRR + std %r7, PCB_EBB_EBBRR(%r17) + mfspr %r7, SPR_BESCR + std %r7, PCB_EBB_BESCR(%r17) +save_lm: + andi. %r0, %r6, FSCR_LM + beq save_tar + mfspr %r7, SPR_LMRR + std %r7, PCB_LMON_LMRR(%r17) + mfspr %r7, SPR_LMSER + std %r7, PCB_LMON_LMSER(%r17) +save_tar: + andi. %r0, %r6, FSCR_TAR + beq 1f + mfspr %r7, SPR_TAR + std %r7, PCB_TAR(%r17) +1: + andi. %r7, %r18, PCB_CDSCR beq .L0 - /* Custom DSCR was set. Reseting it to enter kernel */ - li %r7, 0x0 - mtspr SPR_DSCR, %r7 + mfspr %r6, SPR_DSCRP + std %r6, PCB_DSCR(%r17) .L0: - lwz %r7,PCB_FLAGS(%r17) /* Save FPU context if needed */ - andi. %r7, %r7, PCB_FPU + andi. %r7, %r18, PCB_FPU beq .L1 bl save_fpu nop .L1: mr %r3,%r14 /* restore old thread ptr */ - lwz %r7,PCB_FLAGS(%r17) /* Save Altivec context if needed */ - andi. %r7, %r7, PCB_VEC + andi. %r7, %r18, PCB_VEC beq .L2 bl save_vec nop .L2: mr %r3,%r14 /* restore old thread ptr */ bl pmap_deactivate /* Deactivate the current pmap */ nop sync /* Make sure all of that finished */ cpu_switchin: #if defined(SMP) && defined(SCHED_ULE) /* Wait for the new thread to become unblocked */ addis %r6,%r2,TOC_REF(blocked_lock)@ha ld %r6,TOC_REF(blocked_lock)@l(%r6) blocked_loop: ld %r7,TD_LOCK(%r13) cmpd %r6,%r7 beq- blocked_loop isync #endif ld %r17,TD_PCB(%r13) /* Get new PCB */ ld %r1,PCB_SP(%r17) /* Load the stack pointer */ addi %r1,%r1,-48 /* Remember about cpu_switch stack frame */ /* Release old thread now that we have a stack pointer set up */ cmpdi %r14,0 beq- 1f std %r16,TD_LOCK(%r14) /* ULE: update old thread's lock */ 1: mfsprg %r7,0 /* Get the pcpu pointer */ std %r13,PC_CURTHREAD(%r7) /* Store new current thread */ ld %r17,TD_PCB(%r13) /* Store new current PCB */ std %r17,PC_CURPCB(%r7) mr %r3,%r13 /* Get new thread ptr */ bl pmap_activate /* Activate the new address space */ nop - lwz %r6, PCB_FLAGS(%r17) + lwz %r19, PCB_FLAGS(%r17) /* Restore FPU context if needed */ - andi. %r6, %r6, PCB_FPU + andi. %r6, %r19, PCB_FPU beq .L3 mr %r3,%r13 /* Pass curthread to enable_fpu */ bl enable_fpu nop .L3: - lwz %r6, PCB_FLAGS(%r17) /* Restore Altivec context if needed */ - andi. %r6, %r6, PCB_VEC + andi. %r6, %r19, PCB_VEC beq .L31 mr %r3,%r13 /* Pass curthread to enable_vec */ bl enable_vec nop .L31: - lwz %r6, PCB_FLAGS(%r17) - /* Restore Custom DSCR if needed */ - andi. %r6, %r6, PCB_CDSCR + /* Load custom DSCR on PowerISA 2.06+ CPUs. */ + /* Load changed FSCR on PowerISA 2.07+ CPUs. */ + or %r18,%r18,%r19 + /* Restore Custom DSCR if needed (zeroes if in old but not new) */ + andi. %r6, %r18, PCB_CDSCR + beq .L32 + ld %r7, PCB_DSCR(%r17) /* Load the DSCR register*/ + mtspr SPR_DSCRP, %r7 +.L32: + /* Restore FSCR if needed (zeroes if in old but not new) */ + andi. %r6, %r18, PCB_CFSCR beq .L4 - ld %r6, PCB_DSCR(%r17) /* Load the DSCR register*/ - mtspr SPR_DSCR, %r6 + ld %r7, PCB_FSCR(%r17) /* Load the FSCR register*/ + mtspr SPR_FSCR, %r7 +restore_ebb: + andi. %r0, %r7, FSCR_EBB + beq restore_lm + ld %r6, PCB_EBB_EBBHR(%r17) + mtspr SPR_EBBHR, %r6 + ld %r6, PCB_EBB_EBBRR(%r17) + mtspr SPR_EBBRR, %r6 + ld %r6, PCB_EBB_BESCR(%r17) + mtspr SPR_BESCR, %r6 +restore_lm: + andi. %r0, %r7, FSCR_LM + beq restore_tar + ld %r6, PCB_LMON_LMRR(%r17) + mtspr SPR_LMRR, %r6 + ld %r6, PCB_LMON_LMSER(%r17) + mtspr SPR_LMSER, %r6 +restore_tar: + andi. %r0, %r7, FSCR_TAR + beq .L4 + ld %r6, PCB_TAR(%r17) + mtspr SPR_TAR, %r6 /* thread to restore is in r3 */ .L4: addi %r1,%r1,48 mr %r3,%r17 /* Recover PCB ptr */ ld %r12,PCB_CONTEXT(%r3) /* Load the non-volatile GP regs. */ ld %r14,PCB_CONTEXT+2*8(%r3) ld %r15,PCB_CONTEXT+3*8(%r3) ld %r16,PCB_CONTEXT+4*8(%r3) ld %r17,PCB_CONTEXT+5*8(%r3) ld %r18,PCB_CONTEXT+6*8(%r3) ld %r19,PCB_CONTEXT+7*8(%r3) ld %r20,PCB_CONTEXT+8*8(%r3) ld %r21,PCB_CONTEXT+9*8(%r3) ld %r22,PCB_CONTEXT+10*8(%r3) ld %r23,PCB_CONTEXT+11*8(%r3) ld %r24,PCB_CONTEXT+12*8(%r3) ld %r25,PCB_CONTEXT+13*8(%r3) ld %r26,PCB_CONTEXT+14*8(%r3) ld %r27,PCB_CONTEXT+15*8(%r3) ld %r28,PCB_CONTEXT+16*8(%r3) ld %r29,PCB_CONTEXT+17*8(%r3) ld %r30,PCB_CONTEXT+18*8(%r3) ld %r31,PCB_CONTEXT+19*8(%r3) ld %r5,PCB_CR(%r3) /* Load the condition register */ mtcr %r5 ld %r5,PCB_LR(%r3) /* Load the link register */ mtlr %r5 ld %r1,PCB_SP(%r3) /* Load the stack pointer */ ld %r2,PCB_TOC(%r3) /* Load the TOC pointer */ /* * Perform a dummy stdcx. to clear any reservations we may have * inherited from the previous thread. It doesn't matter if the * stdcx succeeds or not. pcb_context[0] can be clobbered. */ stdcx. %r1, 0, %r3 blr /* * savectx(pcb) * Update pcb, saving current processor state */ ENTRY(savectx) std %r12,PCB_CONTEXT(%r3) /* Save the non-volatile GP regs. */ std %r13,PCB_CONTEXT+1*8(%r3) std %r14,PCB_CONTEXT+2*8(%r3) std %r15,PCB_CONTEXT+3*8(%r3) std %r16,PCB_CONTEXT+4*8(%r3) std %r17,PCB_CONTEXT+5*8(%r3) std %r18,PCB_CONTEXT+6*8(%r3) std %r19,PCB_CONTEXT+7*8(%r3) std %r20,PCB_CONTEXT+8*8(%r3) std %r21,PCB_CONTEXT+9*8(%r3) std %r22,PCB_CONTEXT+10*8(%r3) std %r23,PCB_CONTEXT+11*8(%r3) std %r24,PCB_CONTEXT+12*8(%r3) std %r25,PCB_CONTEXT+13*8(%r3) std %r26,PCB_CONTEXT+14*8(%r3) std %r27,PCB_CONTEXT+15*8(%r3) std %r28,PCB_CONTEXT+16*8(%r3) std %r29,PCB_CONTEXT+17*8(%r3) std %r30,PCB_CONTEXT+18*8(%r3) std %r31,PCB_CONTEXT+19*8(%r3) mfcr %r4 /* Save the condition register */ std %r4,PCB_CR(%r3) std %r1,PCB_SP(%r3) /* Save the stack pointer */ std %r2,PCB_TOC(%r3) /* Save the TOC pointer */ mflr %r4 /* Save the link register */ std %r4,PCB_LR(%r3) blr /* * fork_trampoline() * Set up the return from cpu_fork() */ ENTRY_NOPROF(fork_trampoline) ld %r3,CF_FUNC(%r1) ld %r4,CF_ARG0(%r1) ld %r5,CF_ARG1(%r1) stdu %r1,-48(%r1) bl fork_exit nop addi %r1,%r1,48+CF_SIZE-FSP /* Allow 8 bytes in front of trapframe to simulate FRAME_SETUP does when allocating space for a frame pointer/saved LR */ bl trapexit nop Index: user/ngie/bug-237403/sys/powerpc/powerpc/trap.c =================================================================== --- user/ngie/bug-237403/sys/powerpc/powerpc/trap.c (revision 346925) +++ user/ngie/bug-237403/sys/powerpc/powerpc/trap.c (revision 346926) @@ -1,992 +1,1022 @@ /*- * Copyright (C) 1995, 1996 Wolfgang Solfrank. * Copyright (C) 1995, 1996 TooLs GmbH. * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * 3. All advertising materials mentioning features or use of this software * must display the following acknowledgement: * This product includes software developed by TooLs GmbH. * 4. The name of TooLs GmbH may not be used to endorse or promote products * derived from this software without specific prior written permission. * * THIS SOFTWARE IS PROVIDED BY TOOLS GMBH ``AS IS'' AND ANY EXPRESS OR * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. * IN NO EVENT SHALL TOOLS GMBH BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; * OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, * WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR * OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF * ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. * * $NetBSD: trap.c,v 1.58 2002/03/04 04:07:35 dbj Exp $ */ #include __FBSDID("$FreeBSD$"); #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include /* Below matches setjmp.S */ #define FAULTBUF_LR 21 #define FAULTBUF_R1 1 #define FAULTBUF_R2 2 #define FAULTBUF_CR 22 #define FAULTBUF_R14 3 #define MOREARGS(sp) ((caddr_t)((uintptr_t)(sp) + \ sizeof(struct callframe) - 3*sizeof(register_t))) /* more args go here */ static void trap_fatal(struct trapframe *frame); static void printtrap(u_int vector, struct trapframe *frame, int isfatal, int user); static int trap_pfault(struct trapframe *frame, int user); static int fix_unaligned(struct thread *td, struct trapframe *frame); static int handle_onfault(struct trapframe *frame); static void syscall(struct trapframe *frame); #if defined(__powerpc64__) && defined(AIM) void handle_kernel_slb_spill(int, register_t, register_t); static int handle_user_slb_spill(pmap_t pm, vm_offset_t addr); extern int n_slbs; static void normalize_inputs(void); #endif extern vm_offset_t __startkernel; #ifdef KDB int db_trap_glue(struct trapframe *); /* Called from trap_subr.S */ #endif struct powerpc_exception { u_int vector; char *name; }; #ifdef KDTRACE_HOOKS #include int (*dtrace_invop_jump_addr)(struct trapframe *); #endif static struct powerpc_exception powerpc_exceptions[] = { { EXC_CRIT, "critical input" }, { EXC_RST, "system reset" }, { EXC_MCHK, "machine check" }, { EXC_DSI, "data storage interrupt" }, { EXC_DSE, "data segment exception" }, { EXC_ISI, "instruction storage interrupt" }, { EXC_ISE, "instruction segment exception" }, { EXC_EXI, "external interrupt" }, { EXC_ALI, "alignment" }, { EXC_PGM, "program" }, { EXC_HEA, "hypervisor emulation assistance" }, { EXC_FPU, "floating-point unavailable" }, { EXC_APU, "auxiliary proc unavailable" }, { EXC_DECR, "decrementer" }, { EXC_FIT, "fixed-interval timer" }, { EXC_WDOG, "watchdog timer" }, { EXC_SC, "system call" }, { EXC_TRC, "trace" }, { EXC_FPA, "floating-point assist" }, { EXC_DEBUG, "debug" }, { EXC_PERF, "performance monitoring" }, { EXC_VEC, "altivec unavailable" }, { EXC_VSX, "vsx unavailable" }, { EXC_FAC, "facility unavailable" }, { EXC_ITMISS, "instruction tlb miss" }, { EXC_DLMISS, "data load tlb miss" }, { EXC_DSMISS, "data store tlb miss" }, { EXC_BPT, "instruction breakpoint" }, { EXC_SMI, "system management" }, { EXC_VECAST_G4, "altivec assist" }, { EXC_THRM, "thermal management" }, { EXC_RUNMODETRC, "run mode/trace" }, { EXC_SOFT_PATCH, "soft patch exception" }, { EXC_LAST, NULL } }; #define ESR_BITMASK \ "\20" \ "\040b0\037b1\036b2\035b3\034PIL\033PRR\032PTR\031FP" \ "\030ST\027b9\026DLK\025ILK\024b12\023b13\022BO\021PIE" \ "\020b16\017b17\016b18\015b19\014b20\013b21\012b22\011b23" \ "\010SPE\007EPID\006b26\005b27\004b28\003b29\002b30\001b31" #define MCSR_BITMASK \ "\20" \ "\040MCP\037ICERR\036DCERR\035TLBPERR\034L2MMU_MHIT\033b5\032b6\031b7" \ "\030b8\027b9\026b10\025NMI\024MAV\023MEA\022b14\021IF" \ "\020LD\017ST\016LDG\015b19\014b20\013b21\012b22\011b23" \ "\010b24\007b25\006b26\005b27\004b28\003b29\002TLBSYNC\001BSL2_ERR" #define MSSSR_BITMASK \ "\20" \ "\040b0\037b1\036b2\035b3\034b4\033b5\032b6\031b7" \ "\030b8\027b9\026b10\025b11\024b12\023L2TAG\022L2DAT\021L3TAG" \ "\020L3DAT\017APE\016DPE\015TEA\014b20\013b21\012b22\011b23" \ "\010b24\007b25\006b26\005b27\004b28\003b29\002b30\001b31" static const char * trapname(u_int vector) { struct powerpc_exception *pe; for (pe = powerpc_exceptions; pe->vector != EXC_LAST; pe++) { if (pe->vector == vector) return (pe->name); } return ("unknown"); } static inline bool frame_is_trap_inst(struct trapframe *frame) { #ifdef AIM return (frame->exc == EXC_PGM && frame->srr1 & EXC_PGM_TRAP); #else return ((frame->cpu.booke.esr & ESR_PTR) != 0); #endif } void trap(struct trapframe *frame) { struct thread *td; struct proc *p; #ifdef KDTRACE_HOOKS uint32_t inst; #endif int sig, type, user; u_int ucode; ksiginfo_t ksi; register_t fscr; VM_CNT_INC(v_trap); #ifdef KDB if (kdb_active) { kdb_reenter(); return; } #endif td = curthread; p = td->td_proc; type = ucode = frame->exc; sig = 0; user = frame->srr1 & PSL_PR; CTR3(KTR_TRAP, "trap: %s type=%s (%s)", td->td_name, trapname(type), user ? "user" : "kernel"); #ifdef KDTRACE_HOOKS /* * A trap can occur while DTrace executes a probe. Before * executing the probe, DTrace blocks re-scheduling and sets * a flag in its per-cpu flags to indicate that it doesn't * want to fault. On returning from the probe, the no-fault * flag is cleared and finally re-scheduling is enabled. * * If the DTrace kernel module has registered a trap handler, * call it and if it returns non-zero, assume that it has * handled the trap and modified the trap frame so that this * function can return normally. */ if (dtrace_trap_func != NULL && (*dtrace_trap_func)(frame, type) != 0) return; #endif if (user) { td->td_pticks = 0; td->td_frame = frame; if (td->td_cowgen != p->p_cowgen) thread_cow_update(td); /* User Mode Traps */ switch (type) { case EXC_RUNMODETRC: case EXC_TRC: frame->srr1 &= ~PSL_SE; sig = SIGTRAP; ucode = TRAP_TRACE; break; #if defined(__powerpc64__) && defined(AIM) case EXC_ISE: case EXC_DSE: if (handle_user_slb_spill(&p->p_vmspace->vm_pmap, (type == EXC_ISE) ? frame->srr0 : frame->dar) != 0){ sig = SIGSEGV; ucode = SEGV_MAPERR; } break; #endif case EXC_DSI: case EXC_ISI: sig = trap_pfault(frame, 1); if (sig == SIGSEGV) ucode = SEGV_MAPERR; break; case EXC_SC: syscall(frame); break; case EXC_FPU: KASSERT((td->td_pcb->pcb_flags & PCB_FPU) != PCB_FPU, ("FPU already enabled for thread")); enable_fpu(td); break; case EXC_VEC: KASSERT((td->td_pcb->pcb_flags & PCB_VEC) != PCB_VEC, ("Altivec already enabled for thread")); enable_vec(td); break; case EXC_VSX: KASSERT((td->td_pcb->pcb_flags & PCB_VSX) != PCB_VSX, ("VSX already enabled for thread")); if (!(td->td_pcb->pcb_flags & PCB_VEC)) enable_vec(td); if (!(td->td_pcb->pcb_flags & PCB_FPU)) save_fpu(td); td->td_pcb->pcb_flags |= PCB_VSX; enable_fpu(td); break; case EXC_FAC: fscr = mfspr(SPR_FSCR); - if ((fscr & FSCR_IC_MASK) == FSCR_IC_HTM) { - CTR0(KTR_TRAP, "Hardware Transactional Memory subsystem disabled"); + switch (fscr & FSCR_IC_MASK) { + case FSCR_IC_HTM: + CTR0(KTR_TRAP, + "Hardware Transactional Memory subsystem disabled"); + sig = SIGILL; + ucode = ILL_ILLOPC; + break; + case FSCR_IC_DSCR: + td->td_pcb->pcb_flags |= PCB_CFSCR | PCB_CDSCR; + fscr |= FSCR_DSCR; + mtspr(SPR_DSCR, 0); + break; + case FSCR_IC_EBB: + td->td_pcb->pcb_flags |= PCB_CFSCR; + fscr |= FSCR_EBB; + mtspr(SPR_EBBHR, 0); + mtspr(SPR_EBBRR, 0); + mtspr(SPR_BESCR, 0); + break; + case FSCR_IC_TAR: + td->td_pcb->pcb_flags |= PCB_CFSCR; + fscr |= FSCR_TAR; + mtspr(SPR_TAR, 0); + break; + case FSCR_IC_LM: + td->td_pcb->pcb_flags |= PCB_CFSCR; + fscr |= FSCR_LM; + mtspr(SPR_LMRR, 0); + mtspr(SPR_LMSER, 0); + break; + default: + sig = SIGILL; + ucode = ILL_ILLOPC; } - sig = SIGILL; - ucode = ILL_ILLOPC; + mtspr(SPR_FSCR, fscr & ~FSCR_IC_MASK); break; case EXC_HEA: sig = SIGILL; ucode = ILL_ILLOPC; break; case EXC_VECAST_E: case EXC_VECAST_G4: case EXC_VECAST_G5: /* * We get a VPU assist exception for IEEE mode * vector operations on denormalized floats. * Emulating this is a giant pain, so for now, * just switch off IEEE mode and treat them as * zero. */ save_vec(td); td->td_pcb->pcb_vec.vscr |= ALTIVEC_VSCR_NJ; enable_vec(td); break; case EXC_ALI: if (fix_unaligned(td, frame) != 0) { sig = SIGBUS; ucode = BUS_ADRALN; } else frame->srr0 += 4; break; case EXC_DEBUG: /* Single stepping */ mtspr(SPR_DBSR, mfspr(SPR_DBSR)); frame->srr1 &= ~PSL_DE; frame->cpu.booke.dbcr0 &= ~(DBCR0_IDM | DBCR0_IC); sig = SIGTRAP; ucode = TRAP_TRACE; break; case EXC_PGM: /* Identify the trap reason */ if (frame_is_trap_inst(frame)) { #ifdef KDTRACE_HOOKS inst = fuword32((const void *)frame->srr0); if (inst == 0x0FFFDDDD && dtrace_pid_probe_ptr != NULL) { (*dtrace_pid_probe_ptr)(frame); break; } #endif sig = SIGTRAP; ucode = TRAP_BRKPT; } else { sig = ppc_instr_emulate(frame, td); if (sig == SIGILL) { if (frame->srr1 & EXC_PGM_PRIV) ucode = ILL_PRVOPC; else if (frame->srr1 & EXC_PGM_ILLEGAL) ucode = ILL_ILLOPC; } else if (sig == SIGFPE) ucode = FPE_FLTINV; /* Punt for now, invalid operation. */ } break; case EXC_MCHK: /* * Note that this may not be recoverable for the user * process, depending on the type of machine check, * but it at least prevents the kernel from dying. */ sig = SIGBUS; ucode = BUS_OBJERR; break; #if defined(__powerpc64__) && defined(AIM) case EXC_SOFT_PATCH: /* * Point to the instruction that generated the exception to execute it again, * and normalize the register values. */ frame->srr0 -= 4; normalize_inputs(); break; #endif default: trap_fatal(frame); } } else { /* Kernel Mode Traps */ KASSERT(cold || td->td_ucred != NULL, ("kernel trap doesn't have ucred")); switch (type) { case EXC_PGM: #ifdef KDTRACE_HOOKS if (frame_is_trap_inst(frame)) { if (*(uint32_t *)frame->srr0 == EXC_DTRACE) { if (dtrace_invop_jump_addr != NULL) { dtrace_invop_jump_addr(frame); return; } } } #endif #ifdef KDB if (db_trap_glue(frame)) return; #endif break; #if defined(__powerpc64__) && defined(AIM) case EXC_DSE: if (td->td_pcb->pcb_cpu.aim.usr_vsid != 0 && (frame->dar & SEGMENT_MASK) == USER_ADDR) { __asm __volatile ("slbmte %0, %1" :: "r"(td->td_pcb->pcb_cpu.aim.usr_vsid), "r"(USER_SLB_SLBE)); return; } break; #endif case EXC_DSI: if (trap_pfault(frame, 0) == 0) return; break; case EXC_MCHK: if (handle_onfault(frame)) return; break; default: break; } trap_fatal(frame); } if (sig != 0) { if (p->p_sysent->sv_transtrap != NULL) sig = (p->p_sysent->sv_transtrap)(sig, type); ksiginfo_init_trap(&ksi); ksi.ksi_signo = sig; ksi.ksi_code = (int) ucode; /* XXX, not POSIX */ ksi.ksi_addr = (void *)frame->srr0; ksi.ksi_trapno = type; trapsignal(td, &ksi); } userret(td, frame); } static void trap_fatal(struct trapframe *frame) { #ifdef KDB bool handled; #endif printtrap(frame->exc, frame, 1, (frame->srr1 & PSL_PR)); #ifdef KDB if (debugger_on_trap) { kdb_why = KDB_WHY_TRAP; handled = kdb_trap(frame->exc, 0, frame); kdb_why = KDB_WHY_UNSET; if (handled) return; } #endif panic("%s trap", trapname(frame->exc)); } static void cpu_printtrap(u_int vector, struct trapframe *frame, int isfatal, int user) { #ifdef AIM uint16_t ver; switch (vector) { case EXC_DSE: case EXC_DSI: case EXC_DTMISS: printf(" dsisr = 0x%lx\n", (u_long)frame->cpu.aim.dsisr); break; case EXC_MCHK: ver = mfpvr() >> 16; if (MPC745X_P(ver)) printf(" msssr0 = 0x%b\n", (int)mfspr(SPR_MSSSR0), MSSSR_BITMASK); break; } #elif defined(BOOKE) vm_paddr_t pa; switch (vector) { case EXC_MCHK: pa = mfspr(SPR_MCARU); pa = (pa << 32) | (u_register_t)mfspr(SPR_MCAR); printf(" mcsr = 0x%b\n", (int)mfspr(SPR_MCSR), MCSR_BITMASK); printf(" mcar = 0x%jx\n", (uintmax_t)pa); } printf(" esr = 0x%b\n", (int)frame->cpu.booke.esr, ESR_BITMASK); #endif } static void printtrap(u_int vector, struct trapframe *frame, int isfatal, int user) { printf("\n"); printf("%s %s trap:\n", isfatal ? "fatal" : "handled", user ? "user" : "kernel"); printf("\n"); printf(" exception = 0x%x (%s)\n", vector, trapname(vector)); switch (vector) { case EXC_DSE: case EXC_DSI: case EXC_DTMISS: case EXC_ALI: printf(" virtual address = 0x%" PRIxPTR "\n", frame->dar); break; case EXC_ISE: case EXC_ISI: case EXC_ITMISS: printf(" virtual address = 0x%" PRIxPTR "\n", frame->srr0); break; case EXC_MCHK: break; } cpu_printtrap(vector, frame, isfatal, user); printf(" srr0 = 0x%" PRIxPTR " (0x%" PRIxPTR ")\n", frame->srr0, frame->srr0 - (register_t)(__startkernel - KERNBASE)); printf(" srr1 = 0x%lx\n", (u_long)frame->srr1); printf(" current msr = 0x%" PRIxPTR "\n", mfmsr()); printf(" lr = 0x%" PRIxPTR " (0x%" PRIxPTR ")\n", frame->lr, frame->lr - (register_t)(__startkernel - KERNBASE)); printf(" frame = %p\n", frame); printf(" curthread = %p\n", curthread); if (curthread != NULL) printf(" pid = %d, comm = %s\n", curthread->td_proc->p_pid, curthread->td_name); printf("\n"); } /* * Handles a fatal fault when we have onfault state to recover. Returns * non-zero if there was onfault recovery state available. */ static int handle_onfault(struct trapframe *frame) { struct thread *td; jmp_buf *fb; td = curthread; fb = td->td_pcb->pcb_onfault; if (fb != NULL) { frame->srr0 = (*fb)->_jb[FAULTBUF_LR]; frame->fixreg[1] = (*fb)->_jb[FAULTBUF_R1]; frame->fixreg[2] = (*fb)->_jb[FAULTBUF_R2]; frame->fixreg[3] = 1; frame->cr = (*fb)->_jb[FAULTBUF_CR]; bcopy(&(*fb)->_jb[FAULTBUF_R14], &frame->fixreg[14], 18 * sizeof(register_t)); td->td_pcb->pcb_onfault = NULL; /* Returns twice, not thrice */ return (1); } return (0); } int cpu_fetch_syscall_args(struct thread *td) { struct proc *p; struct trapframe *frame; struct syscall_args *sa; caddr_t params; size_t argsz; int error, n, i; p = td->td_proc; frame = td->td_frame; sa = &td->td_sa; sa->code = frame->fixreg[0]; params = (caddr_t)(frame->fixreg + FIRSTARG); n = NARGREG; if (sa->code == SYS_syscall) { /* * code is first argument, * followed by actual args. */ sa->code = *(register_t *) params; params += sizeof(register_t); n -= 1; } else if (sa->code == SYS___syscall) { /* * Like syscall, but code is a quad, * so as to maintain quad alignment * for the rest of the args. */ if (SV_PROC_FLAG(p, SV_ILP32)) { params += sizeof(register_t); sa->code = *(register_t *) params; params += sizeof(register_t); n -= 2; } else { sa->code = *(register_t *) params; params += sizeof(register_t); n -= 1; } } if (sa->code >= p->p_sysent->sv_size) sa->callp = &p->p_sysent->sv_table[0]; else sa->callp = &p->p_sysent->sv_table[sa->code]; sa->narg = sa->callp->sy_narg; if (SV_PROC_FLAG(p, SV_ILP32)) { argsz = sizeof(uint32_t); for (i = 0; i < n; i++) sa->args[i] = ((u_register_t *)(params))[i] & 0xffffffff; } else { argsz = sizeof(uint64_t); for (i = 0; i < n; i++) sa->args[i] = ((u_register_t *)(params))[i]; } if (sa->narg > n) error = copyin(MOREARGS(frame->fixreg[1]), sa->args + n, (sa->narg - n) * argsz); else error = 0; #ifdef __powerpc64__ if (SV_PROC_FLAG(p, SV_ILP32) && sa->narg > n) { /* Expand the size of arguments copied from the stack */ for (i = sa->narg; i >= n; i--) sa->args[i] = ((uint32_t *)(&sa->args[n]))[i-n]; } #endif if (error == 0) { td->td_retval[0] = 0; td->td_retval[1] = frame->fixreg[FIRSTARG + 1]; } return (error); } #include "../../kern/subr_syscall.c" void syscall(struct trapframe *frame) { struct thread *td; int error; td = curthread; td->td_frame = frame; #if defined(__powerpc64__) && defined(AIM) /* * Speculatively restore last user SLB segment, which we know is * invalid already, since we are likely to do copyin()/copyout(). */ if (td->td_pcb->pcb_cpu.aim.usr_vsid != 0) __asm __volatile ("slbmte %0, %1; isync" :: "r"(td->td_pcb->pcb_cpu.aim.usr_vsid), "r"(USER_SLB_SLBE)); #endif error = syscallenter(td); syscallret(td, error); } #if defined(__powerpc64__) && defined(AIM) /* Handle kernel SLB faults -- runs in real mode, all seat belts off */ void handle_kernel_slb_spill(int type, register_t dar, register_t srr0) { struct slb *slbcache; uint64_t slbe, slbv; uint64_t esid, addr; int i; addr = (type == EXC_ISE) ? srr0 : dar; slbcache = PCPU_GET(aim.slb); esid = (uintptr_t)addr >> ADDR_SR_SHFT; slbe = (esid << SLBE_ESID_SHIFT) | SLBE_VALID; /* See if the hardware flushed this somehow (can happen in LPARs) */ for (i = 0; i < n_slbs; i++) if (slbcache[i].slbe == (slbe | (uint64_t)i)) return; /* Not in the map, needs to actually be added */ slbv = kernel_va_to_slbv(addr); if (slbcache[USER_SLB_SLOT].slbe == 0) { for (i = 0; i < n_slbs; i++) { if (i == USER_SLB_SLOT) continue; if (!(slbcache[i].slbe & SLBE_VALID)) goto fillkernslb; } if (i == n_slbs) slbcache[USER_SLB_SLOT].slbe = 1; } /* Sacrifice a random SLB entry that is not the user entry */ i = mftb() % n_slbs; if (i == USER_SLB_SLOT) i = (i+1) % n_slbs; fillkernslb: /* Write new entry */ slbcache[i].slbv = slbv; slbcache[i].slbe = slbe | (uint64_t)i; /* Trap handler will restore from cache on exit */ } static int handle_user_slb_spill(pmap_t pm, vm_offset_t addr) { struct slb *user_entry; uint64_t esid; int i; if (pm->pm_slb == NULL) return (-1); esid = (uintptr_t)addr >> ADDR_SR_SHFT; PMAP_LOCK(pm); user_entry = user_va_to_slb_entry(pm, addr); if (user_entry == NULL) { /* allocate_vsid auto-spills it */ (void)allocate_user_vsid(pm, esid, 0); } else { /* * Check that another CPU has not already mapped this. * XXX: Per-thread SLB caches would be better. */ for (i = 0; i < pm->pm_slb_len; i++) if (pm->pm_slb[i] == user_entry) break; if (i == pm->pm_slb_len) slb_insert_user(pm, user_entry); } PMAP_UNLOCK(pm); return (0); } #endif static int trap_pfault(struct trapframe *frame, int user) { vm_offset_t eva, va; struct thread *td; struct proc *p; vm_map_t map; vm_prot_t ftype; int rv, is_user; td = curthread; p = td->td_proc; if (frame->exc == EXC_ISI) { eva = frame->srr0; ftype = VM_PROT_EXECUTE; if (frame->srr1 & SRR1_ISI_PFAULT) ftype |= VM_PROT_READ; } else { eva = frame->dar; #ifdef BOOKE if (frame->cpu.booke.esr & ESR_ST) #else if (frame->cpu.aim.dsisr & DSISR_STORE) #endif ftype = VM_PROT_WRITE; else ftype = VM_PROT_READ; } if (user) { KASSERT(p->p_vmspace != NULL, ("trap_pfault: vmspace NULL")); map = &p->p_vmspace->vm_map; } else { rv = pmap_decode_kernel_ptr(eva, &is_user, &eva); if (rv != 0) return (SIGSEGV); if (is_user) map = &p->p_vmspace->vm_map; else map = kernel_map; } va = trunc_page(eva); /* Fault in the page. */ rv = vm_fault(map, va, ftype, VM_FAULT_NORMAL); /* * XXXDTRACE: add dtrace_doubletrap_func here? */ if (rv == KERN_SUCCESS) return (0); if (!user && handle_onfault(frame)) return (0); return (SIGSEGV); } /* * For now, this only deals with the particular unaligned access case * that gcc tends to generate. Eventually it should handle all of the * possibilities that can happen on a 32-bit PowerPC in big-endian mode. */ static int fix_unaligned(struct thread *td, struct trapframe *frame) { struct thread *fputhread; #ifdef __SPE__ uint32_t inst; #endif int indicator, reg; double *fpr; #ifdef __SPE__ indicator = (frame->cpu.booke.esr & (ESR_ST|ESR_SPE)); if (indicator & ESR_SPE) { if (copyin((void *)frame->srr0, &inst, sizeof(inst)) != 0) return (-1); reg = EXC_ALI_SPE_REG(inst); fpr = (double *)td->td_pcb->pcb_vec.vr[reg]; fputhread = PCPU_GET(vecthread); /* Juggle the SPE to ensure that we've initialized * the registers, and that their current state is in * the PCB. */ if (fputhread != td) { if (fputhread) save_vec(fputhread); enable_vec(td); } save_vec(td); if (!(indicator & ESR_ST)) { if (copyin((void *)frame->dar, fpr, sizeof(double)) != 0) return (-1); frame->fixreg[reg] = td->td_pcb->pcb_vec.vr[reg][1]; enable_vec(td); } else { td->td_pcb->pcb_vec.vr[reg][1] = frame->fixreg[reg]; if (copyout(fpr, (void *)frame->dar, sizeof(double)) != 0) return (-1); } return (0); } #else indicator = EXC_ALI_OPCODE_INDICATOR(frame->cpu.aim.dsisr); switch (indicator) { case EXC_ALI_LFD: case EXC_ALI_STFD: reg = EXC_ALI_RST(frame->cpu.aim.dsisr); fpr = &td->td_pcb->pcb_fpu.fpr[reg].fpr; fputhread = PCPU_GET(fputhread); /* Juggle the FPU to ensure that we've initialized * the FPRs, and that their current state is in * the PCB. */ if (fputhread != td) { if (fputhread) save_fpu(fputhread); enable_fpu(td); } save_fpu(td); if (indicator == EXC_ALI_LFD) { if (copyin((void *)frame->dar, fpr, sizeof(double)) != 0) return (-1); enable_fpu(td); } else { if (copyout(fpr, (void *)frame->dar, sizeof(double)) != 0) return (-1); } return (0); break; } #endif return (-1); } #if defined(__powerpc64__) && defined(AIM) #define MSKNSHL(x, m, n) "(((" #x ") & " #m ") << " #n ")" #define MSKNSHR(x, m, n) "(((" #x ") & " #m ") >> " #n ")" /* xvcpsgndp instruction, built in opcode format. * This can be changed to use mnemonic after a toolchain update. */ #define XVCPSGNDP(xt, xa, xb) \ __asm __volatile(".long (" \ MSKNSHL(60, 0x3f, 26) " | " \ MSKNSHL(xt, 0x1f, 21) " | " \ MSKNSHL(xa, 0x1f, 16) " | " \ MSKNSHL(xb, 0x1f, 11) " | " \ MSKNSHL(240, 0xff, 3) " | " \ MSKNSHR(xa, 0x20, 3) " | " \ MSKNSHR(xa, 0x20, 4) " | " \ MSKNSHR(xa, 0x20, 5) ")") /* Macros to normalize 1 or 10 VSX registers */ #define NORM(x) XVCPSGNDP(x, x, x) #define NORM10(x) \ NORM(x ## 0); NORM(x ## 1); NORM(x ## 2); NORM(x ## 3); NORM(x ## 4); \ NORM(x ## 5); NORM(x ## 6); NORM(x ## 7); NORM(x ## 8); NORM(x ## 9) static void normalize_inputs(void) { unsigned long msr; /* enable VSX */ msr = mfmsr(); mtmsr(msr | PSL_VSX); NORM(0); NORM(1); NORM(2); NORM(3); NORM(4); NORM(5); NORM(6); NORM(7); NORM(8); NORM(9); NORM10(1); NORM10(2); NORM10(3); NORM10(4); NORM10(5); NORM(60); NORM(61); NORM(62); NORM(63); /* restore MSR */ mtmsr(msr); } #endif #ifdef KDB int db_trap_glue(struct trapframe *frame) { if (!(frame->srr1 & PSL_PR) && (frame->exc == EXC_TRC || frame->exc == EXC_RUNMODETRC || frame_is_trap_inst(frame) || frame->exc == EXC_BPT || frame->exc == EXC_DEBUG || frame->exc == EXC_DSI)) { int type = frame->exc; /* Ignore DTrace traps. */ if (*(uint32_t *)frame->srr0 == EXC_DTRACE) return (0); if (frame_is_trap_inst(frame)) { type = T_BREAKPOINT; } return (kdb_trap(type, 0, frame)); } return (0); } #endif Index: user/ngie/bug-237403/sys/riscv/riscv/plic.c =================================================================== --- user/ngie/bug-237403/sys/riscv/riscv/plic.c (revision 346925) +++ user/ngie/bug-237403/sys/riscv/riscv/plic.c (revision 346926) @@ -1,272 +1,298 @@ /*- * SPDX-License-Identifier: BSD-2-Clause-FreeBSD * * Copyright (c) 2018 Ruslan Bukin * All rights reserved. * * This software was developed by SRI International and the University of * Cambridge Computer Laboratory (Department of Computer Science and * Technology) under DARPA contract HR0011-18-C-0016 ("ECATS"), as part of the * DARPA SSITH research programme. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. */ #include __FBSDID("$FreeBSD$"); #include #include #include #include #include #include #include #include #include #include #include #include #include #include "pic_if.h" #define PLIC_MAX_IRQS 2048 #define PLIC_PRIORITY(n) (0x000000 + (n) * 0x4) #define PLIC_ENABLE(n, h) (0x002000 + (h) * 0x80 + 4 * ((n) / 32)) #define PLIC_THRESHOLD(h) (0x200000 + (h) * 0x1000 + 0x0) #define PLIC_CLAIM(h) (0x200000 + (h) * 0x1000 + 0x4) struct plic_irqsrc { struct intr_irqsrc isrc; u_int irq; }; struct plic_softc { device_t dev; struct resource * intc_res; struct plic_irqsrc isrcs[PLIC_MAX_IRQS]; int ndev; }; #define RD4(sc, reg) \ bus_read_4(sc->intc_res, (reg)) #define WR4(sc, reg, val) \ bus_write_4(sc->intc_res, (reg), (val)) static inline void plic_irq_dispatch(struct plic_softc *sc, u_int irq, struct trapframe *tf) { struct plic_irqsrc *src; src = &sc->isrcs[irq]; if (intr_isrc_dispatch(&src->isrc, tf) != 0) device_printf(sc->dev, "Stray irq %u detected\n", irq); } static int plic_intr(void *arg) { struct plic_softc *sc; struct trapframe *tf; uint32_t pending; uint32_t cpu; sc = arg; cpu = PCPU_GET(cpuid); pending = RD4(sc, PLIC_CLAIM(cpu)); if (pending) { tf = curthread->td_intr_frame; plic_irq_dispatch(sc, pending, tf); WR4(sc, PLIC_CLAIM(cpu), pending); } return (FILTER_HANDLED); } static void plic_disable_intr(device_t dev, struct intr_irqsrc *isrc) { struct plic_softc *sc; struct plic_irqsrc *src; uint32_t reg; uint32_t cpu; sc = device_get_softc(dev); src = (struct plic_irqsrc *)isrc; cpu = PCPU_GET(cpuid); reg = RD4(sc, PLIC_ENABLE(src->irq, cpu)); reg &= ~(1 << (src->irq % 32)); WR4(sc, PLIC_ENABLE(src->irq, cpu), reg); } static void plic_enable_intr(device_t dev, struct intr_irqsrc *isrc) { struct plic_softc *sc; struct plic_irqsrc *src; uint32_t reg; uint32_t cpu; sc = device_get_softc(dev); src = (struct plic_irqsrc *)isrc; WR4(sc, PLIC_PRIORITY(src->irq), 1); cpu = PCPU_GET(cpuid); reg = RD4(sc, PLIC_ENABLE(src->irq, cpu)); reg |= (1 << (src->irq % 32)); WR4(sc, PLIC_ENABLE(src->irq, cpu), reg); } static int plic_map_intr(device_t dev, struct intr_map_data *data, struct intr_irqsrc **isrcp) { struct intr_map_data_fdt *daf; struct plic_softc *sc; sc = device_get_softc(dev); if (data->type != INTR_MAP_DATA_FDT) return (ENOTSUP); daf = (struct intr_map_data_fdt *)data; if (daf->ncells != 1 || daf->cells[0] > sc->ndev) return (EINVAL); *isrcp = &sc->isrcs[daf->cells[0]].isrc; return (0); } static int plic_probe(device_t dev) { if (!ofw_bus_status_okay(dev)) return (ENXIO); if (!ofw_bus_is_compatible(dev, "riscv,plic0")) return (ENXIO); device_set_desc(dev, "RISC-V PLIC"); return (BUS_PROBE_DEFAULT); } static int plic_attach(device_t dev) { struct plic_irqsrc *isrcs; struct plic_softc *sc; struct intr_pic *pic; uint32_t irq; const char *name; phandle_t node; phandle_t xref; uint32_t cpu; int error; int rid; sc = device_get_softc(dev); sc->dev = dev; node = ofw_bus_get_node(dev); if ((OF_getencprop(node, "riscv,ndev", &sc->ndev, sizeof(sc->ndev))) < 0) { device_printf(dev, "Error: could not get number of devices\n"); return (ENXIO); } if (sc->ndev >= PLIC_MAX_IRQS) { device_printf(dev, "Error: invalid ndev (%d)\n", sc->ndev); return (ENXIO); } /* Request memory resources */ rid = 0; sc->intc_res = bus_alloc_resource_any(dev, SYS_RES_MEMORY, &rid, RF_ACTIVE); if (sc->intc_res == NULL) { device_printf(dev, "Error: could not allocate memory resources\n"); return (ENXIO); } isrcs = sc->isrcs; name = device_get_nameunit(sc->dev); cpu = PCPU_GET(cpuid); for (irq = 1; irq <= sc->ndev; irq++) { isrcs[irq].irq = irq; error = intr_isrc_register(&isrcs[irq].isrc, sc->dev, 0, "%s,%u", name, irq); if (error != 0) return (error); WR4(sc, PLIC_PRIORITY(irq), 0); WR4(sc, PLIC_ENABLE(irq, cpu), 0); } WR4(sc, PLIC_THRESHOLD(cpu), 0); xref = OF_xref_from_node(node); pic = intr_pic_register(sc->dev, xref); if (pic == NULL) return (ENXIO); csr_set(sie, SIE_SEIE); return (intr_pic_claim_root(sc->dev, xref, plic_intr, sc, 0)); } +static void +plic_pre_ithread(device_t dev, struct intr_irqsrc *isrc) +{ + struct plic_softc *sc; + struct plic_irqsrc *src; + + sc = device_get_softc(dev); + src = (struct plic_irqsrc *)isrc; + + WR4(sc, PLIC_PRIORITY(src->irq), 0); +} + +static void +plic_post_ithread(device_t dev, struct intr_irqsrc *isrc) +{ + struct plic_softc *sc; + struct plic_irqsrc *src; + + sc = device_get_softc(dev); + src = (struct plic_irqsrc *)isrc; + + WR4(sc, PLIC_PRIORITY(src->irq), 1); +} + static device_method_t plic_methods[] = { DEVMETHOD(device_probe, plic_probe), DEVMETHOD(device_attach, plic_attach), DEVMETHOD(pic_disable_intr, plic_disable_intr), DEVMETHOD(pic_enable_intr, plic_enable_intr), DEVMETHOD(pic_map_intr, plic_map_intr), + DEVMETHOD(pic_pre_ithread, plic_pre_ithread), + DEVMETHOD(pic_post_ithread, plic_post_ithread), DEVMETHOD_END }; static driver_t plic_driver = { "plic", plic_methods, sizeof(struct plic_softc), }; static devclass_t plic_devclass; EARLY_DRIVER_MODULE(plic, simplebus, plic_driver, plic_devclass, 0, 0, BUS_PASS_INTERRUPT + BUS_PASS_ORDER_MIDDLE); Index: user/ngie/bug-237403/sys/sys/param.h =================================================================== --- user/ngie/bug-237403/sys/sys/param.h (revision 346925) +++ user/ngie/bug-237403/sys/sys/param.h (revision 346926) @@ -1,367 +1,367 @@ /*- * SPDX-License-Identifier: BSD-3-Clause * * Copyright (c) 1982, 1986, 1989, 1993 * The Regents of the University of California. All rights reserved. * (c) UNIX System Laboratories, Inc. * All or some portions of this file are derived from material licensed * to the University of California by American Telephone and Telegraph * Co. or Unix System Laboratories, Inc. and are reproduced herein with * the permission of UNIX System Laboratories, Inc. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * 3. Neither the name of the University nor the names of its contributors * may be used to endorse or promote products derived from this software * without specific prior written permission. * * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * @(#)param.h 8.3 (Berkeley) 4/4/95 * $FreeBSD$ */ #ifndef _SYS_PARAM_H_ #define _SYS_PARAM_H_ #include #define BSD 199506 /* System version (year & month). */ #define BSD4_3 1 #define BSD4_4 1 /* * __FreeBSD_version numbers are documented in the Porter's Handbook. * If you bump the version for any reason, you should update the documentation * there. * Currently this lives here in the doc/ repository: * * head/en_US.ISO8859-1/books/porters-handbook/versions/chapter.xml * * scheme is: Rxx * 'R' is in the range 0 to 4 if this is a release branch or * X.0-CURRENT before releng/X.0 is created, otherwise 'R' is * in the range 5 to 9. */ #undef __FreeBSD_version -#define __FreeBSD_version 1300020 /* Master, propagated to newvers */ +#define __FreeBSD_version 1300021 /* Master, propagated to newvers */ /* * __FreeBSD_kernel__ indicates that this system uses the kernel of FreeBSD, * which by definition is always true on FreeBSD. This macro is also defined * on other systems that use the kernel of FreeBSD, such as GNU/kFreeBSD. * * It is tempting to use this macro in userland code when we want to enable * kernel-specific routines, and in fact it's fine to do this in code that * is part of FreeBSD itself. However, be aware that as presence of this * macro is still not widespread (e.g. older FreeBSD versions, 3rd party * compilers, etc), it is STRONGLY DISCOURAGED to check for this macro in * external applications without also checking for __FreeBSD__ as an * alternative. */ #undef __FreeBSD_kernel__ #define __FreeBSD_kernel__ #if defined(_KERNEL) || defined(IN_RTLD) #define P_OSREL_SIGWAIT 700000 #define P_OSREL_SIGSEGV 700004 #define P_OSREL_MAP_ANON 800104 #define P_OSREL_MAP_FSTRICT 1100036 #define P_OSREL_SHUTDOWN_ENOTCONN 1100077 #define P_OSREL_MAP_GUARD 1200035 #define P_OSREL_WRFSBASE 1200041 #define P_OSREL_CK_CYLGRP 1200046 #define P_OSREL_VMTOTAL64 1200054 #define P_OSREL_CK_SUPERBLOCK 1300000 #define P_OSREL_CK_INODE 1300005 #define P_OSREL_MAJOR(x) ((x) / 100000) #endif #ifndef LOCORE #include #endif /* * Machine-independent constants (some used in following include files). * Redefined constants are from POSIX 1003.1 limits file. * * MAXCOMLEN should be >= sizeof(ac_comm) (see ) */ #include #define MAXCOMLEN 19 /* max command name remembered */ #define MAXINTERP PATH_MAX /* max interpreter file name length */ #define MAXLOGNAME 33 /* max login name length (incl. NUL) */ #define MAXUPRC CHILD_MAX /* max simultaneous processes */ #define NCARGS ARG_MAX /* max bytes for an exec function */ #define NGROUPS (NGROUPS_MAX+1) /* max number groups */ #define NOFILE OPEN_MAX /* max open files per process */ #define NOGROUP 65535 /* marker for empty group set member */ #define MAXHOSTNAMELEN 256 /* max hostname size */ #define SPECNAMELEN 255 /* max length of devicename */ /* More types and definitions used throughout the kernel. */ #ifdef _KERNEL #include #include #ifndef LOCORE #include #include #endif #ifndef FALSE #define FALSE 0 #endif #ifndef TRUE #define TRUE 1 #endif #endif #ifndef _KERNEL /* Signals. */ #include #endif /* Machine type dependent parameters. */ #include #ifndef _KERNEL #include #endif #ifndef DEV_BSHIFT #define DEV_BSHIFT 9 /* log2(DEV_BSIZE) */ #endif #define DEV_BSIZE (1<>PAGE_SHIFT) #endif /* * btodb() is messy and perhaps slow because `bytes' may be an off_t. We * want to shift an unsigned type to avoid sign extension and we don't * want to widen `bytes' unnecessarily. Assume that the result fits in * a daddr_t. */ #ifndef btodb #define btodb(bytes) /* calculates (bytes / DEV_BSIZE) */ \ (sizeof (bytes) > sizeof(long) \ ? (daddr_t)((unsigned long long)(bytes) >> DEV_BSHIFT) \ : (daddr_t)((unsigned long)(bytes) >> DEV_BSHIFT)) #endif #ifndef dbtob #define dbtob(db) /* calculates (db * DEV_BSIZE) */ \ ((off_t)(db) << DEV_BSHIFT) #endif #define PRIMASK 0x0ff #define PCATCH 0x100 /* OR'd with pri for tsleep to check signals */ #define PDROP 0x200 /* OR'd with pri to stop re-entry of interlock mutex */ #define NZERO 0 /* default "nice" */ #define NBBY 8 /* number of bits in a byte */ #define NBPW sizeof(int) /* number of bytes per word (integer) */ #define CMASK 022 /* default file mask: S_IWGRP|S_IWOTH */ #define NODEV (dev_t)(-1) /* non-existent device */ /* * File system parameters and macros. * * MAXBSIZE - Filesystems are made out of blocks of at most MAXBSIZE bytes * per block. MAXBSIZE may be made larger without effecting * any existing filesystems as long as it does not exceed MAXPHYS, * and may be made smaller at the risk of not being able to use * filesystems which require a block size exceeding MAXBSIZE. * * MAXBCACHEBUF - Maximum size of a buffer in the buffer cache. This must * be >= MAXBSIZE and can be set differently for different * architectures by defining it in . * Making this larger allows NFS to do larger reads/writes. * * BKVASIZE - Nominal buffer space per buffer, in bytes. BKVASIZE is the * minimum KVM memory reservation the kernel is willing to make. * Filesystems can of course request smaller chunks. Actual * backing memory uses a chunk size of a page (PAGE_SIZE). * The default value here can be overridden on a per-architecture * basis by defining it in . * * If you make BKVASIZE too small you risk seriously fragmenting * the buffer KVM map which may slow things down a bit. If you * make it too big the kernel will not be able to optimally use * the KVM memory reserved for the buffer cache and will wind * up with too-few buffers. * * The default is 16384, roughly 2x the block size used by a * normal UFS filesystem. */ #define MAXBSIZE 65536 /* must be power of 2 */ #ifndef MAXBCACHEBUF #define MAXBCACHEBUF MAXBSIZE /* must be a power of 2 >= MAXBSIZE */ #endif #ifndef BKVASIZE #define BKVASIZE 16384 /* must be power of 2 */ #endif #define BKVAMASK (BKVASIZE-1) /* * MAXPATHLEN defines the longest permissible path length after expanding * symbolic links. It is used to allocate a temporary buffer from the buffer * pool in which to do the name expansion, hence should be a power of two, * and must be less than or equal to MAXBSIZE. MAXSYMLINKS defines the * maximum number of symbolic links that may be expanded in a path name. * It should be set high enough to allow all legitimate uses, but halt * infinite loops reasonably quickly. */ #define MAXPATHLEN PATH_MAX #define MAXSYMLINKS 32 /* Bit map related macros. */ #define setbit(a,i) (((unsigned char *)(a))[(i)/NBBY] |= 1<<((i)%NBBY)) #define clrbit(a,i) (((unsigned char *)(a))[(i)/NBBY] &= ~(1<<((i)%NBBY))) #define isset(a,i) \ (((const unsigned char *)(a))[(i)/NBBY] & (1<<((i)%NBBY))) #define isclr(a,i) \ ((((const unsigned char *)(a))[(i)/NBBY] & (1<<((i)%NBBY))) == 0) /* Macros for counting and rounding. */ #ifndef howmany #define howmany(x, y) (((x)+((y)-1))/(y)) #endif #define nitems(x) (sizeof((x)) / sizeof((x)[0])) #define rounddown(x, y) (((x)/(y))*(y)) #define rounddown2(x, y) ((x)&(~((y)-1))) /* if y is power of two */ #define roundup(x, y) ((((x)+((y)-1))/(y))*(y)) /* to any y */ #define roundup2(x, y) (((x)+((y)-1))&(~((y)-1))) /* if y is powers of two */ #define powerof2(x) ((((x)-1)&(x))==0) /* Macros for min/max. */ #define MIN(a,b) (((a)<(b))?(a):(b)) #define MAX(a,b) (((a)>(b))?(a):(b)) #ifdef _KERNEL /* * Basic byte order function prototypes for non-inline functions. */ #ifndef LOCORE #ifndef _BYTEORDER_PROTOTYPED #define _BYTEORDER_PROTOTYPED __BEGIN_DECLS __uint32_t htonl(__uint32_t); __uint16_t htons(__uint16_t); __uint32_t ntohl(__uint32_t); __uint16_t ntohs(__uint16_t); __END_DECLS #endif #endif #ifndef _BYTEORDER_FUNC_DEFINED #define _BYTEORDER_FUNC_DEFINED #define htonl(x) __htonl(x) #define htons(x) __htons(x) #define ntohl(x) __ntohl(x) #define ntohs(x) __ntohs(x) #endif /* !_BYTEORDER_FUNC_DEFINED */ #endif /* _KERNEL */ /* * Scale factor for scaled integers used to count %cpu time and load avgs. * * The number of CPU `tick's that map to a unique `%age' can be expressed * by the formula (1 / (2 ^ (FSHIFT - 11))). The maximum load average that * can be calculated (assuming 32 bits) can be closely approximated using * the formula (2 ^ (2 * (16 - FSHIFT))) for (FSHIFT < 15). * * For the scheduler to maintain a 1:1 mapping of CPU `tick' to `%age', * FSHIFT must be at least 11; this gives us a maximum load avg of ~1024. */ #define FSHIFT 11 /* bits to right of fixed binary point */ #define FSCALE (1<> (PAGE_SHIFT - DEV_BSHIFT)) #define ctodb(db) /* calculates pages to devblks */ \ ((db) << (PAGE_SHIFT - DEV_BSHIFT)) /* * Old spelling of __containerof(). */ #define member2struct(s, m, x) \ ((struct s *)(void *)((char *)(x) - offsetof(struct s, m))) /* * Access a variable length array that has been declared as a fixed * length array. */ #define __PAST_END(array, offset) (((__typeof__(*(array)) *)(array))[offset]) #endif /* _SYS_PARAM_H_ */ Index: user/ngie/bug-237403/sys/sys/proc.h =================================================================== --- user/ngie/bug-237403/sys/sys/proc.h (revision 346925) +++ user/ngie/bug-237403/sys/sys/proc.h (revision 346926) @@ -1,1192 +1,1191 @@ /*- * SPDX-License-Identifier: BSD-3-Clause * * Copyright (c) 1986, 1989, 1991, 1993 * The Regents of the University of California. All rights reserved. * (c) UNIX System Laboratories, Inc. * All or some portions of this file are derived from material licensed * to the University of California by American Telephone and Telegraph * Co. or Unix System Laboratories, Inc. and are reproduced herein with * the permission of UNIX System Laboratories, Inc. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * 3. Neither the name of the University nor the names of its contributors * may be used to endorse or promote products derived from this software * without specific prior written permission. * * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * @(#)proc.h 8.15 (Berkeley) 5/19/95 * $FreeBSD$ */ #ifndef _SYS_PROC_H_ #define _SYS_PROC_H_ #include /* For struct callout. */ #include /* For struct klist. */ #include #ifndef _KERNEL #include #endif #include #include #include #include #include #include #include /* XXX. */ #include #include #include #include #include #ifndef _KERNEL #include /* For structs itimerval, timeval. */ #else #include #include #endif #include #include #include #include #include /* Machine-dependent proc substruct. */ #ifdef _KERNEL #include #endif /* * One structure allocated per session. * * List of locks * (m) locked by s_mtx mtx * (e) locked by proctree_lock sx * (c) const until freeing */ struct session { u_int s_count; /* Ref cnt; pgrps in session - atomic. */ struct proc *s_leader; /* (m + e) Session leader. */ struct vnode *s_ttyvp; /* (m) Vnode of controlling tty. */ struct cdev_priv *s_ttydp; /* (m) Device of controlling tty. */ struct tty *s_ttyp; /* (e) Controlling tty. */ pid_t s_sid; /* (c) Session ID. */ /* (m) Setlogin() name: */ char s_login[roundup(MAXLOGNAME, sizeof(long))]; struct mtx s_mtx; /* Mutex to protect members. */ }; /* * One structure allocated per process group. * * List of locks * (m) locked by pg_mtx mtx * (e) locked by proctree_lock sx * (c) const until freeing */ struct pgrp { LIST_ENTRY(pgrp) pg_hash; /* (e) Hash chain. */ LIST_HEAD(, proc) pg_members; /* (m + e) Pointer to pgrp members. */ struct session *pg_session; /* (c) Pointer to session. */ struct sigiolst pg_sigiolst; /* (m) List of sigio sources. */ pid_t pg_id; /* (c) Process group id. */ int pg_jobc; /* (m) Job control process count. */ struct mtx pg_mtx; /* Mutex to protect members */ }; /* * pargs, used to hold a copy of the command line, if it had a sane length. */ struct pargs { u_int ar_ref; /* Reference count. */ u_int ar_length; /* Length. */ u_char ar_args[1]; /* Arguments. */ }; /*- * Description of a process. * * This structure contains the information needed to manage a thread of * control, known in UN*X as a process; it has references to substructures * containing descriptions of things that the process uses, but may share * with related processes. The process structure and the substructures * are always addressable except for those marked "(CPU)" below, * which might be addressable only on a processor on which the process * is running. * * Below is a key of locks used to protect each member of struct proc. The * lock is indicated by a reference to a specific character in parens in the * associated comment. * * - not yet protected * a - only touched by curproc or parent during fork/wait * b - created at fork, never changes * (exception aiods switch vmspaces, but they are also * marked 'P_SYSTEM' so hopefully it will be left alone) * c - locked by proc mtx * d - locked by allproc_lock lock * e - locked by proctree_lock lock * f - session mtx * g - process group mtx * h - callout_lock mtx * i - by curproc or the master session mtx * j - locked by proc slock * k - only accessed by curthread * k*- only accessed by curthread and from an interrupt * kx- only accessed by curthread and by debugger * l - the attaching proc or attaching proc parent * m - Giant * n - not locked, lazy * o - ktrace lock * q - td_contested lock * r - p_peers lock * s - see sleepq_switch(), sleeping_on_old_rtc(), and sleep(9) * t - thread lock * u - process stat lock * w - process timer lock * x - created at fork, only changes during single threading in exec * y - created at first aio, doesn't change until exit or exec at which * point we are single-threaded and only curthread changes it * z - zombie threads lock * * If the locking key specifies two identifiers (for example, p_pptr) then * either lock is sufficient for read access, but both locks must be held * for write access. */ struct cpuset; struct filecaps; struct filemon; struct kaioinfo; struct kaudit_record; struct kcov_info; struct kdtrace_proc; struct kdtrace_thread; struct mqueue_notifier; struct nlminfo; struct p_sched; struct proc; struct procdesc; struct racct; struct sbuf; struct sleepqueue; struct socket; struct syscall_args; struct td_sched; struct thread; struct trapframe; struct turnstile; struct vm_map; struct vm_map_entry; struct epoch_tracker; /* * XXX: Does this belong in resource.h or resourcevar.h instead? * Resource usage extension. The times in rusage structs in the kernel are * never up to date. The actual times are kept as runtimes and tick counts * (with control info in the "previous" times), and are converted when * userland asks for rusage info. Backwards compatibility prevents putting * this directly in the user-visible rusage struct. * * Locking for p_rux: (cu) means (u) for p_rux and (c) for p_crux. * Locking for td_rux: (t) for all fields. */ struct rusage_ext { uint64_t rux_runtime; /* (cu) Real time. */ uint64_t rux_uticks; /* (cu) Statclock hits in user mode. */ uint64_t rux_sticks; /* (cu) Statclock hits in sys mode. */ uint64_t rux_iticks; /* (cu) Statclock hits in intr mode. */ uint64_t rux_uu; /* (c) Previous user time in usec. */ uint64_t rux_su; /* (c) Previous sys time in usec. */ uint64_t rux_tu; /* (c) Previous total time in usec. */ }; /* * Kernel runnable context (thread). * This is what is put to sleep and reactivated. * Thread context. Processes may have multiple threads. */ struct thread { struct mtx *volatile td_lock; /* replaces sched lock */ struct proc *td_proc; /* (*) Associated process. */ TAILQ_ENTRY(thread) td_plist; /* (*) All threads in this proc. */ TAILQ_ENTRY(thread) td_runq; /* (t) Run queue. */ TAILQ_ENTRY(thread) td_slpq; /* (t) Sleep queue. */ TAILQ_ENTRY(thread) td_lockq; /* (t) Lock queue. */ LIST_ENTRY(thread) td_hash; /* (d) Hash chain. */ struct cpuset *td_cpuset; /* (t) CPU affinity mask. */ struct domainset_ref td_domain; /* (a) NUMA policy */ struct seltd *td_sel; /* Select queue/channel. */ struct sleepqueue *td_sleepqueue; /* (k) Associated sleep queue. */ struct turnstile *td_turnstile; /* (k) Associated turnstile. */ struct rl_q_entry *td_rlqe; /* (k) Associated range lock entry. */ struct umtx_q *td_umtxq; /* (c?) Link for when we're blocked. */ lwpid_t td_tid; /* (b) Thread ID. */ sigqueue_t td_sigqueue; /* (c) Sigs arrived, not delivered. */ #define td_siglist td_sigqueue.sq_signals u_char td_lend_user_pri; /* (t) Lend user pri. */ /* Cleared during fork1() */ #define td_startzero td_epochnest u_char td_epochnest; /* (k) Epoch nest counter. */ int td_flags; /* (t) TDF_* flags. */ int td_inhibitors; /* (t) Why can not run. */ int td_pflags; /* (k) Private thread (TDP_*) flags. */ int td_dupfd; /* (k) Ret value from fdopen. XXX */ int td_sqqueue; /* (t) Sleepqueue queue blocked on. */ void *td_wchan; /* (t) Sleep address. */ const char *td_wmesg; /* (t) Reason for sleep. */ volatile u_char td_owepreempt; /* (k*) Preempt on last critical_exit */ u_char td_tsqueue; /* (t) Turnstile queue blocked on. */ short td_locks; /* (k) Debug: count of non-spin locks */ short td_rw_rlocks; /* (k) Count of rwlock read locks. */ short td_sx_slocks; /* (k) Count of sx shared locks. */ short td_lk_slocks; /* (k) Count of lockmgr shared locks. */ short td_stopsched; /* (k) Scheduler stopped. */ struct turnstile *td_blocked; /* (t) Lock thread is blocked on. */ const char *td_lockname; /* (t) Name of lock blocked on. */ LIST_HEAD(, turnstile) td_contested; /* (q) Contested locks. */ struct lock_list_entry *td_sleeplocks; /* (k) Held sleep locks. */ int td_intr_nesting_level; /* (k) Interrupt recursion. */ int td_pinned; /* (k) Temporary cpu pin count. */ struct ucred *td_ucred; /* (k) Reference to credentials. */ struct plimit *td_limit; /* (k) Resource limits. */ int td_slptick; /* (t) Time at sleep. */ int td_blktick; /* (t) Time spent blocked. */ int td_swvoltick; /* (t) Time at last SW_VOL switch. */ int td_swinvoltick; /* (t) Time at last SW_INVOL switch. */ u_int td_cow; /* (*) Number of copy-on-write faults */ struct rusage td_ru; /* (t) rusage information. */ struct rusage_ext td_rux; /* (t) Internal rusage information. */ uint64_t td_incruntime; /* (t) Cpu ticks to transfer to proc. */ uint64_t td_runtime; /* (t) How many cpu ticks we've run. */ u_int td_pticks; /* (t) Statclock hits for profiling */ u_int td_sticks; /* (t) Statclock hits in system mode. */ u_int td_iticks; /* (t) Statclock hits in intr mode. */ u_int td_uticks; /* (t) Statclock hits in user mode. */ int td_intrval; /* (t) Return value for sleepq. */ sigset_t td_oldsigmask; /* (k) Saved mask from pre sigpause. */ volatile u_int td_generation; /* (k) For detection of preemption */ stack_t td_sigstk; /* (k) Stack ptr and on-stack flag. */ int td_xsig; /* (c) Signal for ptrace */ u_long td_profil_addr; /* (k) Temporary addr until AST. */ u_int td_profil_ticks; /* (k) Temporary ticks until AST. */ char td_name[MAXCOMLEN + 1]; /* (*) Thread name. */ struct file *td_fpop; /* (k) file referencing cdev under op */ int td_dbgflags; /* (c) Userland debugger flags */ siginfo_t td_si; /* (c) For debugger or core file */ int td_ng_outbound; /* (k) Thread entered ng from above. */ struct osd td_osd; /* (k) Object specific data. */ struct vm_map_entry *td_map_def_user; /* (k) Deferred entries. */ pid_t td_dbg_forked; /* (c) Child pid for debugger. */ u_int td_vp_reserv; /* (k) Count of reserved vnodes. */ int td_no_sleeping; /* (k) Sleeping disabled count. */ void *td_su; /* (k) FFS SU private */ sbintime_t td_sleeptimo; /* (t) Sleep timeout. */ int td_rtcgen; /* (s) rtc_generation of abs. sleep */ size_t td_vslock_sz; /* (k) amount of vslock-ed space */ struct kcov_info *td_kcov_info; /* (*) Kernel code coverage data */ #define td_endzero td_sigmask /* Copied during fork1() or create_thread(). */ #define td_startcopy td_endzero sigset_t td_sigmask; /* (c) Current signal mask. */ u_char td_rqindex; /* (t) Run queue index. */ u_char td_base_pri; /* (t) Thread base kernel priority. */ u_char td_priority; /* (t) Thread active priority. */ u_char td_pri_class; /* (t) Scheduling class. */ u_char td_user_pri; /* (t) User pri from estcpu and nice. */ u_char td_base_user_pri; /* (t) Base user pri */ u_char td_pre_epoch_prio; /* (k) User pri on entry to epoch */ uintptr_t td_rb_list; /* (k) Robust list head. */ uintptr_t td_rbp_list; /* (k) Robust priv list head. */ uintptr_t td_rb_inact; /* (k) Current in-action mutex loc. */ struct syscall_args td_sa; /* (kx) Syscall parameters. Copied on fork for child tracing. */ #define td_endcopy td_pcb /* * Fields that must be manually set in fork1() or create_thread() * or already have been set in the allocator, constructor, etc. */ struct pcb *td_pcb; /* (k) Kernel VA of pcb and kstack. */ enum td_states { TDS_INACTIVE = 0x0, TDS_INHIBITED, TDS_CAN_RUN, TDS_RUNQ, TDS_RUNNING } td_state; /* (t) thread state */ union { register_t tdu_retval[2]; off_t tdu_off; } td_uretoff; /* (k) Syscall aux returns. */ #define td_retval td_uretoff.tdu_retval u_int td_cowgen; /* (k) Generation of COW pointers. */ /* LP64 hole */ struct callout td_slpcallout; /* (h) Callout for sleep. */ struct trapframe *td_frame; /* (k) */ struct vm_object *td_kstack_obj;/* (a) Kstack object. */ vm_offset_t td_kstack; /* (a) Kernel VA of kstack. */ int td_kstack_pages; /* (a) Size of the kstack. */ volatile u_int td_critnest; /* (k*) Critical section nest level. */ struct mdthread td_md; /* (k) Any machine-dependent fields. */ struct kaudit_record *td_ar; /* (k) Active audit record, if any. */ struct lpohead td_lprof[2]; /* (a) lock profiling objects. */ struct kdtrace_thread *td_dtrace; /* (*) DTrace-specific data. */ int td_errno; /* Error returned by last syscall. */ /* LP64 hole */ struct vnet *td_vnet; /* (k) Effective vnet. */ const char *td_vnet_lpush; /* (k) Debugging vnet push / pop. */ struct trapframe *td_intr_frame;/* (k) Frame of the current irq */ struct proc *td_rfppwait_p; /* (k) The vforked child */ struct vm_page **td_ma; /* (k) uio pages held */ int td_ma_cnt; /* (k) size of *td_ma */ /* LP64 hole */ void *td_emuldata; /* Emulator state data */ int td_lastcpu; /* (t) Last cpu we were on. */ int td_oncpu; /* (t) Which cpu we are on. */ void *td_lkpi_task; /* LinuxKPI task struct pointer */ struct epoch_tracker *td_et; /* (k) compat KPI spare tracker */ int td_pmcpend; }; struct thread0_storage { struct thread t0st_thread; uint64_t t0st_sched[10]; }; struct mtx *thread_lock_block(struct thread *); void thread_lock_unblock(struct thread *, struct mtx *); void thread_lock_set(struct thread *, struct mtx *); #define THREAD_LOCK_ASSERT(td, type) \ do { \ struct mtx *__m = (td)->td_lock; \ if (__m != &blocked_lock) \ mtx_assert(__m, (type)); \ } while (0) #ifdef INVARIANTS #define THREAD_LOCKPTR_ASSERT(td, lock) \ do { \ struct mtx *__m = (td)->td_lock; \ KASSERT((__m == &blocked_lock || __m == (lock)), \ ("Thread %p lock %p does not match %p", td, __m, (lock))); \ } while (0) #define TD_LOCKS_INC(td) ((td)->td_locks++) #define TD_LOCKS_DEC(td) do { \ KASSERT(SCHEDULER_STOPPED_TD(td) || (td)->td_locks > 0, \ ("thread %p owns no locks", (td))); \ (td)->td_locks--; \ } while (0) #else #define THREAD_LOCKPTR_ASSERT(td, lock) #define TD_LOCKS_INC(td) #define TD_LOCKS_DEC(td) #endif /* * Flags kept in td_flags: * To change these you MUST have the scheduler lock. */ #define TDF_BORROWING 0x00000001 /* Thread is borrowing pri from another. */ #define TDF_INPANIC 0x00000002 /* Caused a panic, let it drive crashdump. */ #define TDF_INMEM 0x00000004 /* Thread's stack is in memory. */ #define TDF_SINTR 0x00000008 /* Sleep is interruptible. */ #define TDF_TIMEOUT 0x00000010 /* Timing out during sleep. */ #define TDF_IDLETD 0x00000020 /* This is a per-CPU idle thread. */ #define TDF_CANSWAP 0x00000040 /* Thread can be swapped. */ #define TDF_SLEEPABORT 0x00000080 /* sleepq_abort was called. */ #define TDF_KTH_SUSP 0x00000100 /* kthread is suspended */ #define TDF_ALLPROCSUSP 0x00000200 /* suspended by SINGLE_ALLPROC */ #define TDF_BOUNDARY 0x00000400 /* Thread suspended at user boundary */ #define TDF_ASTPENDING 0x00000800 /* Thread has some asynchronous events. */ #define TDF_UNUSED12 0x00001000 /* --available-- */ #define TDF_SBDRY 0x00002000 /* Stop only on usermode boundary. */ #define TDF_UPIBLOCKED 0x00004000 /* Thread blocked on user PI mutex. */ #define TDF_NEEDSUSPCHK 0x00008000 /* Thread may need to suspend. */ #define TDF_NEEDRESCHED 0x00010000 /* Thread needs to yield. */ #define TDF_NEEDSIGCHK 0x00020000 /* Thread may need signal delivery. */ #define TDF_NOLOAD 0x00040000 /* Ignore during load avg calculations. */ #define TDF_SERESTART 0x00080000 /* ERESTART on stop attempts. */ #define TDF_THRWAKEUP 0x00100000 /* Libthr thread must not suspend itself. */ #define TDF_SEINTR 0x00200000 /* EINTR on stop attempts. */ #define TDF_SWAPINREQ 0x00400000 /* Swapin request due to wakeup. */ #define TDF_UNUSED23 0x00800000 /* --available-- */ #define TDF_SCHED0 0x01000000 /* Reserved for scheduler private use */ #define TDF_SCHED1 0x02000000 /* Reserved for scheduler private use */ #define TDF_SCHED2 0x04000000 /* Reserved for scheduler private use */ #define TDF_SCHED3 0x08000000 /* Reserved for scheduler private use */ #define TDF_ALRMPEND 0x10000000 /* Pending SIGVTALRM needs to be posted. */ #define TDF_PROFPEND 0x20000000 /* Pending SIGPROF needs to be posted. */ #define TDF_MACPEND 0x40000000 /* AST-based MAC event pending. */ /* Userland debug flags */ #define TDB_SUSPEND 0x00000001 /* Thread is suspended by debugger */ #define TDB_XSIG 0x00000002 /* Thread is exchanging signal under trace */ #define TDB_USERWR 0x00000004 /* Debugger modified memory or registers */ #define TDB_SCE 0x00000008 /* Thread performs syscall enter */ #define TDB_SCX 0x00000010 /* Thread performs syscall exit */ #define TDB_EXEC 0x00000020 /* TDB_SCX from exec(2) family */ #define TDB_FORK 0x00000040 /* TDB_SCX from fork(2) that created new process */ #define TDB_STOPATFORK 0x00000080 /* Stop at the return from fork (child only) */ #define TDB_CHILD 0x00000100 /* New child indicator for ptrace() */ #define TDB_BORN 0x00000200 /* New LWP indicator for ptrace() */ #define TDB_EXIT 0x00000400 /* Exiting LWP indicator for ptrace() */ #define TDB_VFORK 0x00000800 /* vfork indicator for ptrace() */ #define TDB_FSTP 0x00001000 /* The thread is PT_ATTACH leader */ #define TDB_STEP 0x00002000 /* (x86) PSL_T set for PT_STEP */ /* * "Private" flags kept in td_pflags: * These are only written by curthread and thus need no locking. */ #define TDP_OLDMASK 0x00000001 /* Need to restore mask after suspend. */ #define TDP_INKTR 0x00000002 /* Thread is currently in KTR code. */ #define TDP_INKTRACE 0x00000004 /* Thread is currently in KTRACE code. */ #define TDP_BUFNEED 0x00000008 /* Do not recurse into the buf flush */ #define TDP_COWINPROGRESS 0x00000010 /* Snapshot copy-on-write in progress. */ #define TDP_ALTSTACK 0x00000020 /* Have alternate signal stack. */ #define TDP_DEADLKTREAT 0x00000040 /* Lock acquisition - deadlock treatment. */ #define TDP_NOFAULTING 0x00000080 /* Do not handle page faults. */ #define TDP_UNUSED9 0x00000100 /* --available-- */ #define TDP_OWEUPC 0x00000200 /* Call addupc() at next AST. */ #define TDP_ITHREAD 0x00000400 /* Thread is an interrupt thread. */ #define TDP_SYNCIO 0x00000800 /* Local override, disable async i/o. */ #define TDP_SCHED1 0x00001000 /* Reserved for scheduler private use */ #define TDP_SCHED2 0x00002000 /* Reserved for scheduler private use */ #define TDP_SCHED3 0x00004000 /* Reserved for scheduler private use */ #define TDP_SCHED4 0x00008000 /* Reserved for scheduler private use */ #define TDP_GEOM 0x00010000 /* Settle GEOM before finishing syscall */ #define TDP_SOFTDEP 0x00020000 /* Stuck processing softdep worklist */ #define TDP_NORUNNINGBUF 0x00040000 /* Ignore runningbufspace check */ #define TDP_WAKEUP 0x00080000 /* Don't sleep in umtx cond_wait */ #define TDP_INBDFLUSH 0x00100000 /* Already in BO_BDFLUSH, do not recurse */ #define TDP_KTHREAD 0x00200000 /* This is an official kernel thread */ #define TDP_CALLCHAIN 0x00400000 /* Capture thread's callchain */ #define TDP_IGNSUSP 0x00800000 /* Permission to ignore the MNTK_SUSPEND* */ #define TDP_AUDITREC 0x01000000 /* Audit record pending on thread */ #define TDP_RFPPWAIT 0x02000000 /* Handle RFPPWAIT on syscall exit */ #define TDP_RESETSPUR 0x04000000 /* Reset spurious page fault history. */ #define TDP_NERRNO 0x08000000 /* Last errno is already in td_errno */ #define TDP_UIOHELD 0x10000000 /* Current uio has pages held in td_ma */ #define TDP_FORKING 0x20000000 /* Thread is being created through fork() */ #define TDP_EXECVMSPC 0x40000000 /* Execve destroyed old vmspace */ /* * Reasons that the current thread can not be run yet. * More than one may apply. */ #define TDI_SUSPENDED 0x0001 /* On suspension queue. */ #define TDI_SLEEPING 0x0002 /* Actually asleep! (tricky). */ #define TDI_SWAPPED 0x0004 /* Stack not in mem. Bad juju if run. */ #define TDI_LOCK 0x0008 /* Stopped on a lock. */ #define TDI_IWAIT 0x0010 /* Awaiting interrupt. */ #define TD_IS_SLEEPING(td) ((td)->td_inhibitors & TDI_SLEEPING) #define TD_ON_SLEEPQ(td) ((td)->td_wchan != NULL) #define TD_IS_SUSPENDED(td) ((td)->td_inhibitors & TDI_SUSPENDED) #define TD_IS_SWAPPED(td) ((td)->td_inhibitors & TDI_SWAPPED) #define TD_ON_LOCK(td) ((td)->td_inhibitors & TDI_LOCK) #define TD_AWAITING_INTR(td) ((td)->td_inhibitors & TDI_IWAIT) #define TD_IS_RUNNING(td) ((td)->td_state == TDS_RUNNING) #define TD_ON_RUNQ(td) ((td)->td_state == TDS_RUNQ) #define TD_CAN_RUN(td) ((td)->td_state == TDS_CAN_RUN) #define TD_IS_INHIBITED(td) ((td)->td_state == TDS_INHIBITED) #define TD_ON_UPILOCK(td) ((td)->td_flags & TDF_UPIBLOCKED) #define TD_IS_IDLETHREAD(td) ((td)->td_flags & TDF_IDLETD) #define KTDSTATE(td) \ (((td)->td_inhibitors & TDI_SLEEPING) != 0 ? "sleep" : \ ((td)->td_inhibitors & TDI_SUSPENDED) != 0 ? "suspended" : \ ((td)->td_inhibitors & TDI_SWAPPED) != 0 ? "swapped" : \ ((td)->td_inhibitors & TDI_LOCK) != 0 ? "blocked" : \ ((td)->td_inhibitors & TDI_IWAIT) != 0 ? "iwait" : "yielding") #define TD_SET_INHIB(td, inhib) do { \ (td)->td_state = TDS_INHIBITED; \ (td)->td_inhibitors |= (inhib); \ } while (0) #define TD_CLR_INHIB(td, inhib) do { \ if (((td)->td_inhibitors & (inhib)) && \ (((td)->td_inhibitors &= ~(inhib)) == 0)) \ (td)->td_state = TDS_CAN_RUN; \ } while (0) #define TD_SET_SLEEPING(td) TD_SET_INHIB((td), TDI_SLEEPING) #define TD_SET_SWAPPED(td) TD_SET_INHIB((td), TDI_SWAPPED) #define TD_SET_LOCK(td) TD_SET_INHIB((td), TDI_LOCK) #define TD_SET_SUSPENDED(td) TD_SET_INHIB((td), TDI_SUSPENDED) #define TD_SET_IWAIT(td) TD_SET_INHIB((td), TDI_IWAIT) #define TD_SET_EXITING(td) TD_SET_INHIB((td), TDI_EXITING) #define TD_CLR_SLEEPING(td) TD_CLR_INHIB((td), TDI_SLEEPING) #define TD_CLR_SWAPPED(td) TD_CLR_INHIB((td), TDI_SWAPPED) #define TD_CLR_LOCK(td) TD_CLR_INHIB((td), TDI_LOCK) #define TD_CLR_SUSPENDED(td) TD_CLR_INHIB((td), TDI_SUSPENDED) #define TD_CLR_IWAIT(td) TD_CLR_INHIB((td), TDI_IWAIT) #define TD_SET_RUNNING(td) (td)->td_state = TDS_RUNNING #define TD_SET_RUNQ(td) (td)->td_state = TDS_RUNQ #define TD_SET_CAN_RUN(td) (td)->td_state = TDS_CAN_RUN #define TD_SBDRY_INTR(td) \ (((td)->td_flags & (TDF_SEINTR | TDF_SERESTART)) != 0) #define TD_SBDRY_ERRNO(td) \ (((td)->td_flags & TDF_SEINTR) != 0 ? EINTR : ERESTART) /* * Process structure. */ struct proc { LIST_ENTRY(proc) p_list; /* (d) List of all processes. */ TAILQ_HEAD(, thread) p_threads; /* (c) all threads. */ struct mtx p_slock; /* process spin lock */ struct ucred *p_ucred; /* (c) Process owner's identity. */ struct filedesc *p_fd; /* (b) Open files. */ struct filedesc_to_leader *p_fdtol; /* (b) Tracking node */ struct pstats *p_stats; /* (b) Accounting/statistics (CPU). */ struct plimit *p_limit; /* (c) Resource limits. */ struct callout p_limco; /* (c) Limit callout handle */ struct sigacts *p_sigacts; /* (x) Signal actions, state (CPU). */ int p_flag; /* (c) P_* flags. */ int p_flag2; /* (c) P2_* flags. */ enum p_states { PRS_NEW = 0, /* In creation */ PRS_NORMAL, /* threads can be run. */ PRS_ZOMBIE } p_state; /* (j/c) Process status. */ pid_t p_pid; /* (b) Process identifier. */ LIST_ENTRY(proc) p_hash; /* (d) Hash chain. */ LIST_ENTRY(proc) p_pglist; /* (g + e) List of processes in pgrp. */ struct proc *p_pptr; /* (c + e) Pointer to parent process. */ LIST_ENTRY(proc) p_sibling; /* (e) List of sibling processes. */ LIST_HEAD(, proc) p_children; /* (e) Pointer to list of children. */ struct proc *p_reaper; /* (e) My reaper. */ LIST_HEAD(, proc) p_reaplist; /* (e) List of my descendants (if I am reaper). */ LIST_ENTRY(proc) p_reapsibling; /* (e) List of siblings - descendants of the same reaper. */ struct mtx p_mtx; /* (n) Lock for this struct. */ struct mtx p_statmtx; /* Lock for the stats */ struct mtx p_itimmtx; /* Lock for the virt/prof timers */ struct mtx p_profmtx; /* Lock for the profiling */ struct ksiginfo *p_ksi; /* Locked by parent proc lock */ sigqueue_t p_sigqueue; /* (c) Sigs not delivered to a td. */ #define p_siglist p_sigqueue.sq_signals pid_t p_oppid; /* (c + e) Real parent pid. */ /* The following fields are all zeroed upon creation in fork. */ #define p_startzero p_vmspace struct vmspace *p_vmspace; /* (b) Address space. */ u_int p_swtick; /* (c) Tick when swapped in or out. */ u_int p_cowgen; /* (c) Generation of COW pointers. */ struct itimerval p_realtimer; /* (c) Alarm timer. */ struct rusage p_ru; /* (a) Exit information. */ struct rusage_ext p_rux; /* (cu) Internal resource usage. */ struct rusage_ext p_crux; /* (c) Internal child resource usage. */ int p_profthreads; /* (c) Num threads in addupc_task. */ volatile int p_exitthreads; /* (j) Number of threads exiting */ int p_traceflag; /* (o) Kernel trace points. */ struct vnode *p_tracevp; /* (c + o) Trace to vnode. */ struct ucred *p_tracecred; /* (o) Credentials to trace with. */ struct vnode *p_textvp; /* (b) Vnode of executable. */ u_int p_lock; /* (c) Proclock (prevent swap) count. */ struct sigiolst p_sigiolst; /* (c) List of sigio sources. */ int p_sigparent; /* (c) Signal to parent on exit. */ int p_sig; /* (n) For core dump/debugger XXX. */ - u_long p_code; /* (n) For core dump/debugger XXX. */ u_int p_stops; /* (c) Stop event bitmask. */ u_int p_stype; /* (c) Stop event type. */ char p_step; /* (c) Process is stopped. */ u_char p_pfsflags; /* (c) Procfs flags. */ u_int p_ptevents; /* (c + e) ptrace() event mask. */ struct nlminfo *p_nlminfo; /* (?) Only used by/for lockd. */ struct kaioinfo *p_aioinfo; /* (y) ASYNC I/O info. */ struct thread *p_singlethread;/* (c + j) If single threading this is it */ int p_suspcount; /* (j) Num threads in suspended mode. */ struct thread *p_xthread; /* (c) Trap thread */ int p_boundary_count;/* (j) Num threads at user boundary */ int p_pendingcnt; /* how many signals are pending */ struct itimers *p_itimers; /* (c) POSIX interval timers. */ struct procdesc *p_procdesc; /* (e) Process descriptor, if any. */ u_int p_treeflag; /* (e) P_TREE flags */ int p_pendingexits; /* (c) Count of pending thread exits. */ struct filemon *p_filemon; /* (c) filemon-specific data. */ int p_pdeathsig; /* (c) Signal from parent on exit. */ /* End area that is zeroed on creation. */ #define p_endzero p_magic /* The following fields are all copied upon creation in fork. */ #define p_startcopy p_endzero u_int p_magic; /* (b) Magic number. */ int p_osrel; /* (x) osreldate for the binary (from ELF note, if any) */ uint32_t p_fctl0; /* (x) ABI feature control, ELF note */ char p_comm[MAXCOMLEN + 1]; /* (x) Process name. */ struct sysentvec *p_sysent; /* (b) Syscall dispatch info. */ struct pargs *p_args; /* (c) Process arguments. */ rlim_t p_cpulimit; /* (c) Current CPU limit in seconds. */ signed char p_nice; /* (c) Process "nice" value. */ int p_fibnum; /* in this routing domain XXX MRT */ pid_t p_reapsubtree; /* (e) Pid of the direct child of the reaper which spawned our subtree. */ uint16_t p_elf_machine; /* (x) ELF machine type */ uint64_t p_elf_flags; /* (x) ELF flags */ /* End area that is copied on creation. */ #define p_endcopy p_xexit u_int p_xexit; /* (c) Exit code. */ u_int p_xsig; /* (c) Stop/kill sig. */ struct pgrp *p_pgrp; /* (c + e) Pointer to process group. */ struct knlist *p_klist; /* (c) Knotes attached to this proc. */ int p_numthreads; /* (c) Number of threads. */ struct mdproc p_md; /* Any machine-dependent fields. */ struct callout p_itcallout; /* (h + c) Interval timer callout. */ u_short p_acflag; /* (c) Accounting flags. */ struct proc *p_peers; /* (r) */ struct proc *p_leader; /* (b) */ void *p_emuldata; /* (c) Emulator state data. */ struct label *p_label; /* (*) Proc (not subject) MAC label. */ STAILQ_HEAD(, ktr_request) p_ktr; /* (o) KTR event queue. */ LIST_HEAD(, mqueue_notifier) p_mqnotifier; /* (c) mqueue notifiers.*/ struct kdtrace_proc *p_dtrace; /* (*) DTrace-specific data. */ struct cv p_pwait; /* (*) wait cv for exit/exec. */ uint64_t p_prev_runtime; /* (c) Resource usage accounting. */ struct racct *p_racct; /* (b) Resource accounting. */ int p_throttled; /* (c) Flag for racct pcpu throttling */ /* * An orphan is the child that has been re-parented to the * debugger as a result of attaching to it. Need to keep * track of them for parent to be able to collect the exit * status of what used to be children. */ LIST_ENTRY(proc) p_orphan; /* (e) List of orphan processes. */ LIST_HEAD(, proc) p_orphans; /* (e) Pointer to list of orphans. */ }; #define p_session p_pgrp->pg_session #define p_pgid p_pgrp->pg_id #define NOCPU (-1) /* For when we aren't on a CPU. */ #define NOCPU_OLD (255) #define MAXCPU_OLD (254) #define PROC_SLOCK(p) mtx_lock_spin(&(p)->p_slock) #define PROC_SUNLOCK(p) mtx_unlock_spin(&(p)->p_slock) #define PROC_SLOCK_ASSERT(p, type) mtx_assert(&(p)->p_slock, (type)) #define PROC_STATLOCK(p) mtx_lock_spin(&(p)->p_statmtx) #define PROC_STATUNLOCK(p) mtx_unlock_spin(&(p)->p_statmtx) #define PROC_STATLOCK_ASSERT(p, type) mtx_assert(&(p)->p_statmtx, (type)) #define PROC_ITIMLOCK(p) mtx_lock_spin(&(p)->p_itimmtx) #define PROC_ITIMUNLOCK(p) mtx_unlock_spin(&(p)->p_itimmtx) #define PROC_ITIMLOCK_ASSERT(p, type) mtx_assert(&(p)->p_itimmtx, (type)) #define PROC_PROFLOCK(p) mtx_lock_spin(&(p)->p_profmtx) #define PROC_PROFUNLOCK(p) mtx_unlock_spin(&(p)->p_profmtx) #define PROC_PROFLOCK_ASSERT(p, type) mtx_assert(&(p)->p_profmtx, (type)) /* These flags are kept in p_flag. */ #define P_ADVLOCK 0x00001 /* Process may hold a POSIX advisory lock. */ #define P_CONTROLT 0x00002 /* Has a controlling terminal. */ #define P_KPROC 0x00004 /* Kernel process. */ #define P_UNUSED3 0x00008 /* --available-- */ #define P_PPWAIT 0x00010 /* Parent is waiting for child to exec/exit. */ #define P_PROFIL 0x00020 /* Has started profiling. */ #define P_STOPPROF 0x00040 /* Has thread requesting to stop profiling. */ #define P_HADTHREADS 0x00080 /* Has had threads (no cleanup shortcuts) */ #define P_SUGID 0x00100 /* Had set id privileges since last exec. */ #define P_SYSTEM 0x00200 /* System proc: no sigs, stats or swapping. */ #define P_SINGLE_EXIT 0x00400 /* Threads suspending should exit, not wait. */ #define P_TRACED 0x00800 /* Debugged process being traced. */ #define P_WAITED 0x01000 /* Someone is waiting for us. */ #define P_WEXIT 0x02000 /* Working on exiting. */ #define P_EXEC 0x04000 /* Process called exec. */ #define P_WKILLED 0x08000 /* Killed, go to kernel/user boundary ASAP. */ #define P_CONTINUED 0x10000 /* Proc has continued from a stopped state. */ #define P_STOPPED_SIG 0x20000 /* Stopped due to SIGSTOP/SIGTSTP. */ #define P_STOPPED_TRACE 0x40000 /* Stopped because of tracing. */ #define P_STOPPED_SINGLE 0x80000 /* Only 1 thread can continue (not to user). */ #define P_PROTECTED 0x100000 /* Do not kill on memory overcommit. */ #define P_SIGEVENT 0x200000 /* Process pending signals changed. */ #define P_SINGLE_BOUNDARY 0x400000 /* Threads should suspend at user boundary. */ #define P_HWPMC 0x800000 /* Process is using HWPMCs */ #define P_JAILED 0x1000000 /* Process is in jail. */ #define P_TOTAL_STOP 0x2000000 /* Stopped in stop_all_proc. */ #define P_INEXEC 0x4000000 /* Process is in execve(). */ #define P_STATCHILD 0x8000000 /* Child process stopped or exited. */ #define P_INMEM 0x10000000 /* Loaded into memory. */ #define P_SWAPPINGOUT 0x20000000 /* Process is being swapped out. */ #define P_SWAPPINGIN 0x40000000 /* Process is being swapped in. */ #define P_PPTRACE 0x80000000 /* PT_TRACEME by vforked child. */ #define P_STOPPED (P_STOPPED_SIG|P_STOPPED_SINGLE|P_STOPPED_TRACE) #define P_SHOULDSTOP(p) ((p)->p_flag & P_STOPPED) #define P_KILLED(p) ((p)->p_flag & P_WKILLED) /* These flags are kept in p_flag2. */ #define P2_INHERIT_PROTECTED 0x00000001 /* New children get P_PROTECTED. */ #define P2_NOTRACE 0x00000002 /* No ptrace(2) attach or coredumps. */ #define P2_NOTRACE_EXEC 0x00000004 /* Keep P2_NOPTRACE on exec(2). */ #define P2_AST_SU 0x00000008 /* Handles SU ast for kthreads. */ #define P2_PTRACE_FSTP 0x00000010 /* SIGSTOP from PT_ATTACH not yet handled. */ #define P2_TRAPCAP 0x00000020 /* SIGTRAP on ENOTCAPABLE */ #define P2_ASLR_ENABLE 0x00000040 /* Force enable ASLR. */ #define P2_ASLR_DISABLE 0x00000080 /* Force disable ASLR. */ #define P2_ASLR_IGNSTART 0x00000100 /* Enable ASLR to consume sbrk area. */ /* Flags protected by proctree_lock, kept in p_treeflags. */ #define P_TREE_ORPHANED 0x00000001 /* Reparented, on orphan list */ #define P_TREE_FIRST_ORPHAN 0x00000002 /* First element of orphan list */ #define P_TREE_REAPER 0x00000004 /* Reaper of subtree */ /* * These were process status values (p_stat), now they are only used in * legacy conversion code. */ #define SIDL 1 /* Process being created by fork. */ #define SRUN 2 /* Currently runnable. */ #define SSLEEP 3 /* Sleeping on an address. */ #define SSTOP 4 /* Process debugging or suspension. */ #define SZOMB 5 /* Awaiting collection by parent. */ #define SWAIT 6 /* Waiting for interrupt. */ #define SLOCK 7 /* Blocked on a lock. */ #define P_MAGIC 0xbeefface #ifdef _KERNEL /* Types and flags for mi_switch(). */ #define SW_TYPE_MASK 0xff /* First 8 bits are switch type */ #define SWT_NONE 0 /* Unspecified switch. */ #define SWT_PREEMPT 1 /* Switching due to preemption. */ #define SWT_OWEPREEMPT 2 /* Switching due to owepreempt. */ #define SWT_TURNSTILE 3 /* Turnstile contention. */ #define SWT_SLEEPQ 4 /* Sleepq wait. */ #define SWT_SLEEPQTIMO 5 /* Sleepq timeout wait. */ #define SWT_RELINQUISH 6 /* yield call. */ #define SWT_NEEDRESCHED 7 /* NEEDRESCHED was set. */ #define SWT_IDLE 8 /* Switching from the idle thread. */ #define SWT_IWAIT 9 /* Waiting for interrupts. */ #define SWT_SUSPEND 10 /* Thread suspended. */ #define SWT_REMOTEPREEMPT 11 /* Remote processor preempted. */ #define SWT_REMOTEWAKEIDLE 12 /* Remote processor preempted idle. */ #define SWT_COUNT 13 /* Number of switch types. */ /* Flags */ #define SW_VOL 0x0100 /* Voluntary switch. */ #define SW_INVOL 0x0200 /* Involuntary switch. */ #define SW_PREEMPT 0x0400 /* The invol switch is a preemption */ /* How values for thread_single(). */ #define SINGLE_NO_EXIT 0 #define SINGLE_EXIT 1 #define SINGLE_BOUNDARY 2 #define SINGLE_ALLPROC 3 #ifdef MALLOC_DECLARE MALLOC_DECLARE(M_PARGS); MALLOC_DECLARE(M_PGRP); MALLOC_DECLARE(M_SESSION); MALLOC_DECLARE(M_SUBPROC); #endif #define FOREACH_PROC_IN_SYSTEM(p) \ LIST_FOREACH((p), &allproc, p_list) #define FOREACH_THREAD_IN_PROC(p, td) \ TAILQ_FOREACH((td), &(p)->p_threads, td_plist) #define FIRST_THREAD_IN_PROC(p) TAILQ_FIRST(&(p)->p_threads) /* * We use process IDs <= pid_max <= PID_MAX; PID_MAX + 1 must also fit * in a pid_t, as it is used to represent "no process group". */ #define PID_MAX 99999 #define NO_PID 100000 extern pid_t pid_max; #define SESS_LEADER(p) ((p)->p_session->s_leader == (p)) #define STOPEVENT(p, e, v) do { \ WITNESS_WARN(WARN_GIANTOK | WARN_SLEEPOK, NULL, \ "checking stopevent %d", (e)); \ if ((p)->p_stops & (e)) { \ PROC_LOCK(p); \ stopevent((p), (e), (v)); \ PROC_UNLOCK(p); \ } \ } while (0) #define _STOPEVENT(p, e, v) do { \ PROC_LOCK_ASSERT(p, MA_OWNED); \ WITNESS_WARN(WARN_GIANTOK | WARN_SLEEPOK, &p->p_mtx.lock_object, \ "checking stopevent %d", (e)); \ if ((p)->p_stops & (e)) \ stopevent((p), (e), (v)); \ } while (0) /* Lock and unlock a process. */ #define PROC_LOCK(p) mtx_lock(&(p)->p_mtx) #define PROC_TRYLOCK(p) mtx_trylock(&(p)->p_mtx) #define PROC_UNLOCK(p) mtx_unlock(&(p)->p_mtx) #define PROC_LOCKED(p) mtx_owned(&(p)->p_mtx) #define PROC_LOCK_ASSERT(p, type) mtx_assert(&(p)->p_mtx, (type)) /* Lock and unlock a process group. */ #define PGRP_LOCK(pg) mtx_lock(&(pg)->pg_mtx) #define PGRP_UNLOCK(pg) mtx_unlock(&(pg)->pg_mtx) #define PGRP_LOCKED(pg) mtx_owned(&(pg)->pg_mtx) #define PGRP_LOCK_ASSERT(pg, type) mtx_assert(&(pg)->pg_mtx, (type)) #define PGRP_LOCK_PGSIGNAL(pg) do { \ if ((pg) != NULL) \ PGRP_LOCK(pg); \ } while (0) #define PGRP_UNLOCK_PGSIGNAL(pg) do { \ if ((pg) != NULL) \ PGRP_UNLOCK(pg); \ } while (0) /* Lock and unlock a session. */ #define SESS_LOCK(s) mtx_lock(&(s)->s_mtx) #define SESS_UNLOCK(s) mtx_unlock(&(s)->s_mtx) #define SESS_LOCKED(s) mtx_owned(&(s)->s_mtx) #define SESS_LOCK_ASSERT(s, type) mtx_assert(&(s)->s_mtx, (type)) /* * Non-zero p_lock ensures that: * - exit1() is not performed until p_lock reaches zero; * - the process' threads stack are not swapped out if they are currently * not (P_INMEM). * * PHOLD() asserts that the process (except the current process) is * not exiting, increments p_lock and swaps threads stacks into memory, * if needed. * _PHOLD() is same as PHOLD(), it takes the process locked. * _PHOLD_LITE() also takes the process locked, but comparing with * _PHOLD(), it only guarantees that exit1() is not executed, * faultin() is not called. */ #define PHOLD(p) do { \ PROC_LOCK(p); \ _PHOLD(p); \ PROC_UNLOCK(p); \ } while (0) #define _PHOLD(p) do { \ PROC_LOCK_ASSERT((p), MA_OWNED); \ KASSERT(!((p)->p_flag & P_WEXIT) || (p) == curproc, \ ("PHOLD of exiting process %p", p)); \ (p)->p_lock++; \ if (((p)->p_flag & P_INMEM) == 0) \ faultin((p)); \ } while (0) #define _PHOLD_LITE(p) do { \ PROC_LOCK_ASSERT((p), MA_OWNED); \ KASSERT(!((p)->p_flag & P_WEXIT) || (p) == curproc, \ ("PHOLD of exiting process %p", p)); \ (p)->p_lock++; \ } while (0) #define PROC_ASSERT_HELD(p) do { \ KASSERT((p)->p_lock > 0, ("process %p not held", p)); \ } while (0) #define PRELE(p) do { \ PROC_LOCK((p)); \ _PRELE((p)); \ PROC_UNLOCK((p)); \ } while (0) #define _PRELE(p) do { \ PROC_LOCK_ASSERT((p), MA_OWNED); \ PROC_ASSERT_HELD(p); \ (--(p)->p_lock); \ if (((p)->p_flag & P_WEXIT) && (p)->p_lock == 0) \ wakeup(&(p)->p_lock); \ } while (0) #define PROC_ASSERT_NOT_HELD(p) do { \ KASSERT((p)->p_lock == 0, ("process %p held", p)); \ } while (0) #define PROC_UPDATE_COW(p) do { \ PROC_LOCK_ASSERT((p), MA_OWNED); \ (p)->p_cowgen++; \ } while (0) /* Check whether a thread is safe to be swapped out. */ #define thread_safetoswapout(td) ((td)->td_flags & TDF_CANSWAP) /* Control whether or not it is safe for curthread to sleep. */ #define THREAD_NO_SLEEPING() ((curthread)->td_no_sleeping++) #define THREAD_SLEEPING_OK() ((curthread)->td_no_sleeping--) #define THREAD_CAN_SLEEP() ((curthread)->td_no_sleeping == 0) #define PIDHASH(pid) (&pidhashtbl[(pid) & pidhash]) #define PIDHASHLOCK(pid) (&pidhashtbl_lock[((pid) & pidhashlock)]) extern LIST_HEAD(pidhashhead, proc) *pidhashtbl; extern struct sx *pidhashtbl_lock; extern u_long pidhash; extern u_long pidhashlock; #define TIDHASH(tid) (&tidhashtbl[(tid) & tidhash]) extern LIST_HEAD(tidhashhead, thread) *tidhashtbl; extern u_long tidhash; extern struct rwlock tidhash_lock; #define PGRPHASH(pgid) (&pgrphashtbl[(pgid) & pgrphash]) extern LIST_HEAD(pgrphashhead, pgrp) *pgrphashtbl; extern u_long pgrphash; extern struct sx allproc_lock; extern int allproc_gen; extern struct sx zombproc_lock; extern struct sx proctree_lock; extern struct mtx ppeers_lock; extern struct mtx procid_lock; extern struct proc proc0; /* Process slot for swapper. */ extern struct thread0_storage thread0_st; /* Primary thread in proc0. */ #define thread0 (thread0_st.t0st_thread) extern struct vmspace vmspace0; /* VM space for proc0. */ extern int hogticks; /* Limit on kernel cpu hogs. */ extern int lastpid; extern int nprocs, maxproc; /* Current and max number of procs. */ extern int maxprocperuid; /* Max procs per uid. */ extern u_long ps_arg_cache_limit; LIST_HEAD(proclist, proc); TAILQ_HEAD(procqueue, proc); TAILQ_HEAD(threadqueue, thread); extern struct proclist allproc; /* List of all processes. */ extern struct proclist zombproc; /* List of zombie processes. */ extern struct proc *initproc, *pageproc; /* Process slots for init, pager. */ extern struct uma_zone *proc_zone; struct proc *pfind(pid_t); /* Find process by id. */ struct proc *pfind_any(pid_t); /* Find (zombie) process by id. */ struct proc *pfind_any_locked(pid_t pid); /* Find process by id, locked. */ struct pgrp *pgfind(pid_t); /* Find process group by id. */ struct proc *zpfind(pid_t); /* Find zombie process by id. */ void pidhash_slockall(void); /* Shared lock all pid hash lists. */ void pidhash_sunlockall(void); /* Shared unlock all pid hash lists. */ struct fork_req { int fr_flags; int fr_pages; int *fr_pidp; struct proc **fr_procp; int *fr_pd_fd; int fr_pd_flags; struct filecaps *fr_pd_fcaps; }; /* * pget() flags. */ #define PGET_HOLD 0x00001 /* Hold the process. */ #define PGET_CANSEE 0x00002 /* Check against p_cansee(). */ #define PGET_CANDEBUG 0x00004 /* Check against p_candebug(). */ #define PGET_ISCURRENT 0x00008 /* Check that the found process is current. */ #define PGET_NOTWEXIT 0x00010 /* Check that the process is not in P_WEXIT. */ #define PGET_NOTINEXEC 0x00020 /* Check that the process is not in P_INEXEC. */ #define PGET_NOTID 0x00040 /* Do not assume tid if pid > PID_MAX. */ #define PGET_WANTREAD (PGET_HOLD | PGET_CANDEBUG | PGET_NOTWEXIT) int pget(pid_t pid, int flags, struct proc **pp); void ast(struct trapframe *framep); struct thread *choosethread(void); int cr_cansee(struct ucred *u1, struct ucred *u2); int cr_canseesocket(struct ucred *cred, struct socket *so); int cr_canseeothergids(struct ucred *u1, struct ucred *u2); int cr_canseeotheruids(struct ucred *u1, struct ucred *u2); int cr_canseejailproc(struct ucred *u1, struct ucred *u2); int cr_cansignal(struct ucred *cred, struct proc *proc, int signum); int enterpgrp(struct proc *p, pid_t pgid, struct pgrp *pgrp, struct session *sess); int enterthispgrp(struct proc *p, struct pgrp *pgrp); void faultin(struct proc *p); void fixjobc(struct proc *p, struct pgrp *pgrp, int entering); int fork1(struct thread *, struct fork_req *); void fork_rfppwait(struct thread *); void fork_exit(void (*)(void *, struct trapframe *), void *, struct trapframe *); void fork_return(struct thread *, struct trapframe *); int inferior(struct proc *p); void kern_proc_vmmap_resident(struct vm_map *map, struct vm_map_entry *entry, int *resident_count, bool *super); void kern_yield(int); void kick_proc0(void); void killjobc(void); int leavepgrp(struct proc *p); int maybe_preempt(struct thread *td); void maybe_yield(void); void mi_switch(int flags, struct thread *newtd); int p_candebug(struct thread *td, struct proc *p); int p_cansee(struct thread *td, struct proc *p); int p_cansched(struct thread *td, struct proc *p); int p_cansignal(struct thread *td, struct proc *p, int signum); int p_canwait(struct thread *td, struct proc *p); struct pargs *pargs_alloc(int len); void pargs_drop(struct pargs *pa); void pargs_hold(struct pargs *pa); int proc_getargv(struct thread *td, struct proc *p, struct sbuf *sb); int proc_getauxv(struct thread *td, struct proc *p, struct sbuf *sb); int proc_getenvv(struct thread *td, struct proc *p, struct sbuf *sb); void procinit(void); int proc_iterate(int (*cb)(struct proc *, void *), void *cbarg); void proc_linkup0(struct proc *p, struct thread *td); void proc_linkup(struct proc *p, struct thread *td); struct proc *proc_realparent(struct proc *child); void proc_reap(struct thread *td, struct proc *p, int *status, int options); void proc_reparent(struct proc *child, struct proc *newparent, bool set_oppid); void proc_set_traced(struct proc *p, bool stop); void proc_wkilled(struct proc *p); struct pstats *pstats_alloc(void); void pstats_fork(struct pstats *src, struct pstats *dst); void pstats_free(struct pstats *ps); void reaper_abandon_children(struct proc *p, bool exiting); int securelevel_ge(struct ucred *cr, int level); int securelevel_gt(struct ucred *cr, int level); void sess_hold(struct session *); void sess_release(struct session *); int setrunnable(struct thread *); void setsugid(struct proc *p); int should_yield(void); int sigonstack(size_t sp); void stopevent(struct proc *, u_int, u_int); struct thread *tdfind(lwpid_t, pid_t); void threadinit(void); void tidhash_add(struct thread *); void tidhash_remove(struct thread *); void cpu_idle(int); int cpu_idle_wakeup(int); extern void (*cpu_idle_hook)(sbintime_t); /* Hook to machdep CPU idler. */ void cpu_switch(struct thread *, struct thread *, struct mtx *); void cpu_throw(struct thread *, struct thread *) __dead2; void unsleep(struct thread *); void userret(struct thread *, struct trapframe *); void cpu_exit(struct thread *); void exit1(struct thread *, int, int) __dead2; void cpu_copy_thread(struct thread *td, struct thread *td0); bool cpu_exec_vmspace_reuse(struct proc *p, struct vm_map *map); int cpu_fetch_syscall_args(struct thread *td); void cpu_fork(struct thread *, struct proc *, struct thread *, int); void cpu_fork_kthread_handler(struct thread *, void (*)(void *), void *); int cpu_procctl(struct thread *td, int idtype, id_t id, int com, void *data); void cpu_set_syscall_retval(struct thread *, int); void cpu_set_upcall(struct thread *, void (*)(void *), void *, stack_t *); int cpu_set_user_tls(struct thread *, void *tls_base); void cpu_thread_alloc(struct thread *); void cpu_thread_clean(struct thread *); void cpu_thread_exit(struct thread *); void cpu_thread_free(struct thread *); void cpu_thread_swapin(struct thread *); void cpu_thread_swapout(struct thread *); struct thread *thread_alloc(int pages); int thread_alloc_stack(struct thread *, int pages); void thread_cow_get_proc(struct thread *newtd, struct proc *p); void thread_cow_get(struct thread *newtd, struct thread *td); void thread_cow_free(struct thread *td); void thread_cow_update(struct thread *td); int thread_create(struct thread *td, struct rtprio *rtp, int (*initialize_thread)(struct thread *, void *), void *thunk); void thread_exit(void) __dead2; void thread_free(struct thread *td); void thread_link(struct thread *td, struct proc *p); void thread_reap(void); int thread_single(struct proc *p, int how); void thread_single_end(struct proc *p, int how); void thread_stash(struct thread *td); void thread_stopped(struct proc *p); void childproc_stopped(struct proc *child, int reason); void childproc_continued(struct proc *child); void childproc_exited(struct proc *child); int thread_suspend_check(int how); bool thread_suspend_check_needed(void); void thread_suspend_switch(struct thread *, struct proc *p); void thread_suspend_one(struct thread *td); void thread_unlink(struct thread *td); void thread_unsuspend(struct proc *p); void thread_wait(struct proc *p); struct thread *thread_find(struct proc *p, lwpid_t tid); void stop_all_proc(void); void resume_all_proc(void); static __inline int curthread_pflags_set(int flags) { struct thread *td; int save; td = curthread; save = ~flags | (td->td_pflags & flags); td->td_pflags |= flags; return (save); } static __inline void curthread_pflags_restore(int save) { curthread->td_pflags &= save; } static __inline __pure2 struct td_sched * td_get_sched(struct thread *td) { return ((struct td_sched *)&td[1]); } extern void (*softdep_ast_cleanup)(struct thread *); static __inline void td_softdep_cleanup(struct thread *td) { if (td->td_su != NULL && softdep_ast_cleanup != NULL) softdep_ast_cleanup(td); } #define PROC_ID_PID 0 #define PROC_ID_GROUP 1 #define PROC_ID_SESSION 2 #define PROC_ID_REAP 3 void proc_id_set(int type, pid_t id); void proc_id_set_cond(int type, pid_t id); void proc_id_clear(int type, pid_t id); #endif /* _KERNEL */ #endif /* !_SYS_PROC_H_ */ Index: user/ngie/bug-237403/sys/x86/x86/busdma_bounce.c =================================================================== --- user/ngie/bug-237403/sys/x86/x86/busdma_bounce.c (revision 346925) +++ user/ngie/bug-237403/sys/x86/x86/busdma_bounce.c (revision 346926) @@ -1,1321 +1,1319 @@ /*- * SPDX-License-Identifier: BSD-2-Clause-FreeBSD * * Copyright (c) 1997, 1998 Justin T. Gibbs. * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions, and the following disclaimer, * without modification, immediately at the beginning of the file. * 2. The name of the author may not be used to endorse or promote products * derived from this software without specific prior written permission. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE FOR * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. */ #include __FBSDID("$FreeBSD$"); #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #ifdef __i386__ #define MAX_BPAGES (Maxmem > atop(0x100000000ULL) ? 8192 : 512) #else #define MAX_BPAGES 8192 #endif enum { BUS_DMA_COULD_BOUNCE = 0x01, BUS_DMA_MIN_ALLOC_COMP = 0x02, BUS_DMA_KMEM_ALLOC = 0x04, }; struct bounce_zone; struct bus_dma_tag { struct bus_dma_tag_common common; int map_count; int bounce_flags; bus_dma_segment_t *segments; struct bounce_zone *bounce_zone; }; struct bounce_page { vm_offset_t vaddr; /* kva of bounce buffer */ bus_addr_t busaddr; /* Physical address */ vm_offset_t datavaddr; /* kva of client data */ vm_offset_t dataoffs; /* page offset of client data */ vm_page_t datapage[2]; /* physical page(s) of client data */ bus_size_t datacount; /* client data count */ STAILQ_ENTRY(bounce_page) links; }; int busdma_swi_pending; struct bounce_zone { STAILQ_ENTRY(bounce_zone) links; STAILQ_HEAD(bp_list, bounce_page) bounce_page_list; int total_bpages; int free_bpages; int reserved_bpages; int active_bpages; int total_bounced; int total_deferred; int map_count; int domain; bus_size_t alignment; bus_addr_t lowaddr; char zoneid[8]; char lowaddrid[20]; struct sysctl_ctx_list sysctl_tree; struct sysctl_oid *sysctl_tree_top; }; static struct mtx bounce_lock; static int total_bpages; static int busdma_zonecount; static STAILQ_HEAD(, bounce_zone) bounce_zone_list; static SYSCTL_NODE(_hw, OID_AUTO, busdma, CTLFLAG_RD, 0, "Busdma parameters"); SYSCTL_INT(_hw_busdma, OID_AUTO, total_bpages, CTLFLAG_RD, &total_bpages, 0, "Total bounce pages"); struct bus_dmamap { struct bp_list bpages; int pagesneeded; int pagesreserved; bus_dma_tag_t dmat; struct memdesc mem; bus_dmamap_callback_t *callback; void *callback_arg; STAILQ_ENTRY(bus_dmamap) links; }; static STAILQ_HEAD(, bus_dmamap) bounce_map_waitinglist; static STAILQ_HEAD(, bus_dmamap) bounce_map_callbacklist; static struct bus_dmamap nobounce_dmamap; static void init_bounce_pages(void *dummy); static int alloc_bounce_zone(bus_dma_tag_t dmat); static int alloc_bounce_pages(bus_dma_tag_t dmat, u_int numpages); static int reserve_bounce_pages(bus_dma_tag_t dmat, bus_dmamap_t map, int commit); static bus_addr_t add_bounce_page(bus_dma_tag_t dmat, bus_dmamap_t map, vm_offset_t vaddr, vm_paddr_t addr1, vm_paddr_t addr2, bus_size_t size); static void free_bounce_page(bus_dma_tag_t dmat, struct bounce_page *bpage); static void _bus_dmamap_count_pages(bus_dma_tag_t dmat, bus_dmamap_t map, pmap_t pmap, void *buf, bus_size_t buflen, int flags); static void _bus_dmamap_count_phys(bus_dma_tag_t dmat, bus_dmamap_t map, vm_paddr_t buf, bus_size_t buflen, int flags); static int _bus_dmamap_reserve_pages(bus_dma_tag_t dmat, bus_dmamap_t map, int flags); static int bounce_bus_dma_zone_setup(bus_dma_tag_t dmat) { struct bounce_zone *bz; int error; /* Must bounce */ if ((error = alloc_bounce_zone(dmat)) != 0) return (error); bz = dmat->bounce_zone; if (ptoa(bz->total_bpages) < dmat->common.maxsize) { int pages; pages = atop(dmat->common.maxsize) - bz->total_bpages; /* Add pages to our bounce pool */ if (alloc_bounce_pages(dmat, pages) < pages) return (ENOMEM); } /* Performed initial allocation */ dmat->bounce_flags |= BUS_DMA_MIN_ALLOC_COMP; return (0); } /* * Allocate a device specific dma_tag. */ static int bounce_bus_dma_tag_create(bus_dma_tag_t parent, bus_size_t alignment, bus_addr_t boundary, bus_addr_t lowaddr, bus_addr_t highaddr, bus_dma_filter_t *filter, void *filterarg, bus_size_t maxsize, int nsegments, bus_size_t maxsegsz, int flags, bus_dma_lock_t *lockfunc, void *lockfuncarg, bus_dma_tag_t *dmat) { bus_dma_tag_t newtag; int error; *dmat = NULL; error = common_bus_dma_tag_create(parent != NULL ? &parent->common : NULL, alignment, boundary, lowaddr, highaddr, filter, filterarg, maxsize, nsegments, maxsegsz, flags, lockfunc, lockfuncarg, sizeof (struct bus_dma_tag), (void **)&newtag); if (error != 0) return (error); newtag->common.impl = &bus_dma_bounce_impl; newtag->map_count = 0; newtag->segments = NULL; if (parent != NULL && (newtag->common.filter != NULL || (parent->bounce_flags & BUS_DMA_COULD_BOUNCE) != 0)) newtag->bounce_flags |= BUS_DMA_COULD_BOUNCE; if (newtag->common.lowaddr < ptoa((vm_paddr_t)Maxmem) || newtag->common.alignment > 1) newtag->bounce_flags |= BUS_DMA_COULD_BOUNCE; if ((newtag->bounce_flags & BUS_DMA_COULD_BOUNCE) != 0 && (flags & BUS_DMA_ALLOCNOW) != 0) error = bounce_bus_dma_zone_setup(newtag); else error = 0; if (error != 0) free(newtag, M_DEVBUF); else *dmat = newtag; CTR4(KTR_BUSDMA, "%s returned tag %p tag flags 0x%x error %d", __func__, newtag, (newtag != NULL ? newtag->common.flags : 0), error); return (error); } /* * Update the domain for the tag. We may need to reallocate the zone and * bounce pages. */ static int bounce_bus_dma_tag_set_domain(bus_dma_tag_t dmat) { KASSERT(dmat->map_count == 0, ("bounce_bus_dma_tag_set_domain: Domain set after use.\n")); if ((dmat->bounce_flags & BUS_DMA_COULD_BOUNCE) == 0 || dmat->bounce_zone == NULL) return (0); dmat->bounce_flags &= ~BUS_DMA_MIN_ALLOC_COMP; return (bounce_bus_dma_zone_setup(dmat)); } static int bounce_bus_dma_tag_destroy(bus_dma_tag_t dmat) { bus_dma_tag_t dmat_copy, parent; int error; error = 0; dmat_copy = dmat; if (dmat != NULL) { if (dmat->map_count != 0) { error = EBUSY; goto out; } while (dmat != NULL) { parent = (bus_dma_tag_t)dmat->common.parent; atomic_subtract_int(&dmat->common.ref_count, 1); if (dmat->common.ref_count == 0) { if (dmat->segments != NULL) free_domain(dmat->segments, M_DEVBUF); free(dmat, M_DEVBUF); /* * Last reference count, so * release our reference * count on our parent. */ dmat = parent; } else dmat = NULL; } } out: CTR3(KTR_BUSDMA, "%s tag %p error %d", __func__, dmat_copy, error); return (error); } /* * Allocate a handle for mapping from kva/uva/physical * address space into bus device space. */ static int bounce_bus_dmamap_create(bus_dma_tag_t dmat, int flags, bus_dmamap_t *mapp) { struct bounce_zone *bz; int error, maxpages, pages; - WITNESS_WARN(WARN_GIANTOK | WARN_SLEEPOK, NULL, "%s", __func__); - error = 0; if (dmat->segments == NULL) { dmat->segments = (bus_dma_segment_t *)malloc_domainset( sizeof(bus_dma_segment_t) * dmat->common.nsegments, M_DEVBUF, DOMAINSET_PREF(dmat->common.domain), M_NOWAIT); if (dmat->segments == NULL) { CTR3(KTR_BUSDMA, "%s: tag %p error %d", __func__, dmat, ENOMEM); return (ENOMEM); } } /* * Bouncing might be required if the driver asks for an active * exclusion region, a data alignment that is stricter than 1, and/or * an active address boundary. */ if ((dmat->bounce_flags & BUS_DMA_COULD_BOUNCE) != 0) { /* Must bounce */ if (dmat->bounce_zone == NULL) { if ((error = alloc_bounce_zone(dmat)) != 0) return (error); } bz = dmat->bounce_zone; *mapp = (bus_dmamap_t)malloc_domainset(sizeof(**mapp), M_DEVBUF, DOMAINSET_PREF(dmat->common.domain), M_NOWAIT | M_ZERO); if (*mapp == NULL) { CTR3(KTR_BUSDMA, "%s: tag %p error %d", __func__, dmat, ENOMEM); return (ENOMEM); } /* Initialize the new map */ STAILQ_INIT(&((*mapp)->bpages)); /* * Attempt to add pages to our pool on a per-instance * basis up to a sane limit. */ if (dmat->common.alignment > 1) maxpages = MAX_BPAGES; else maxpages = MIN(MAX_BPAGES, Maxmem - atop(dmat->common.lowaddr)); if ((dmat->bounce_flags & BUS_DMA_MIN_ALLOC_COMP) == 0 || (bz->map_count > 0 && bz->total_bpages < maxpages)) { pages = MAX(atop(dmat->common.maxsize), 1); pages = MIN(maxpages - bz->total_bpages, pages); pages = MAX(pages, 1); if (alloc_bounce_pages(dmat, pages) < pages) error = ENOMEM; if ((dmat->bounce_flags & BUS_DMA_MIN_ALLOC_COMP) == 0) { if (error == 0) { dmat->bounce_flags |= BUS_DMA_MIN_ALLOC_COMP; } } else error = 0; } bz->map_count++; } else { *mapp = NULL; } if (error == 0) dmat->map_count++; CTR4(KTR_BUSDMA, "%s: tag %p tag flags 0x%x error %d", __func__, dmat, dmat->common.flags, error); return (error); } /* * Destroy a handle for mapping from kva/uva/physical * address space into bus device space. */ static int bounce_bus_dmamap_destroy(bus_dma_tag_t dmat, bus_dmamap_t map) { if (map != NULL && map != &nobounce_dmamap) { if (STAILQ_FIRST(&map->bpages) != NULL) { CTR3(KTR_BUSDMA, "%s: tag %p error %d", __func__, dmat, EBUSY); return (EBUSY); } if (dmat->bounce_zone) dmat->bounce_zone->map_count--; free_domain(map, M_DEVBUF); } dmat->map_count--; CTR2(KTR_BUSDMA, "%s: tag %p error 0", __func__, dmat); return (0); } /* * Allocate a piece of memory that can be efficiently mapped into * bus device space based on the constraints lited in the dma tag. * A dmamap to for use with dmamap_load is also allocated. */ static int bounce_bus_dmamem_alloc(bus_dma_tag_t dmat, void** vaddr, int flags, bus_dmamap_t *mapp) { vm_memattr_t attr; int mflags; WITNESS_WARN(WARN_GIANTOK | WARN_SLEEPOK, NULL, "%s", __func__); if (flags & BUS_DMA_NOWAIT) mflags = M_NOWAIT; else mflags = M_WAITOK; /* If we succeed, no mapping/bouncing will be required */ *mapp = NULL; if (dmat->segments == NULL) { dmat->segments = (bus_dma_segment_t *)malloc_domainset( sizeof(bus_dma_segment_t) * dmat->common.nsegments, M_DEVBUF, DOMAINSET_PREF(dmat->common.domain), mflags); if (dmat->segments == NULL) { CTR4(KTR_BUSDMA, "%s: tag %p tag flags 0x%x error %d", __func__, dmat, dmat->common.flags, ENOMEM); return (ENOMEM); } } if (flags & BUS_DMA_ZERO) mflags |= M_ZERO; if (flags & BUS_DMA_NOCACHE) attr = VM_MEMATTR_UNCACHEABLE; else attr = VM_MEMATTR_DEFAULT; /* * Allocate the buffer from the malloc(9) allocator if... * - It's small enough to fit into a single power of two sized bucket. * - The alignment is less than or equal to the maximum size * - The low address requirement is fulfilled. * else allocate non-contiguous pages if... * - The page count that could get allocated doesn't exceed * nsegments also when the maximum segment size is less * than PAGE_SIZE. * - The alignment constraint isn't larger than a page boundary. * - There are no boundary-crossing constraints. * else allocate a block of contiguous pages because one or more of the * constraints is something that only the contig allocator can fulfill. * * NOTE: The (dmat->common.alignment <= dmat->maxsize) check * below is just a quick hack. The exact alignment guarantees * of malloc(9) need to be nailed down, and the code below * should be rewritten to take that into account. * * In the meantime warn the user if malloc gets it wrong. */ if (dmat->common.maxsize <= PAGE_SIZE && dmat->common.alignment <= dmat->common.maxsize && dmat->common.lowaddr >= ptoa((vm_paddr_t)Maxmem) && attr == VM_MEMATTR_DEFAULT) { *vaddr = malloc_domainset(dmat->common.maxsize, M_DEVBUF, DOMAINSET_PREF(dmat->common.domain), mflags); } else if (dmat->common.nsegments >= howmany(dmat->common.maxsize, MIN(dmat->common.maxsegsz, PAGE_SIZE)) && dmat->common.alignment <= PAGE_SIZE && (dmat->common.boundary % PAGE_SIZE) == 0) { /* Page-based multi-segment allocations allowed */ *vaddr = (void *)kmem_alloc_attr_domainset( DOMAINSET_PREF(dmat->common.domain), dmat->common.maxsize, mflags, 0ul, dmat->common.lowaddr, attr); dmat->bounce_flags |= BUS_DMA_KMEM_ALLOC; } else { *vaddr = (void *)kmem_alloc_contig_domainset( DOMAINSET_PREF(dmat->common.domain), dmat->common.maxsize, mflags, 0ul, dmat->common.lowaddr, dmat->common.alignment != 0 ? dmat->common.alignment : 1ul, dmat->common.boundary, attr); dmat->bounce_flags |= BUS_DMA_KMEM_ALLOC; } if (*vaddr == NULL) { CTR4(KTR_BUSDMA, "%s: tag %p tag flags 0x%x error %d", __func__, dmat, dmat->common.flags, ENOMEM); return (ENOMEM); } else if (vtophys(*vaddr) & (dmat->common.alignment - 1)) { printf("bus_dmamem_alloc failed to align memory properly.\n"); } CTR4(KTR_BUSDMA, "%s: tag %p tag flags 0x%x error %d", __func__, dmat, dmat->common.flags, 0); return (0); } /* * Free a piece of memory and it's allociated dmamap, that was allocated * via bus_dmamem_alloc. Make the same choice for free/contigfree. */ static void bounce_bus_dmamem_free(bus_dma_tag_t dmat, void *vaddr, bus_dmamap_t map) { /* * dmamem does not need to be bounced, so the map should be * NULL and the BUS_DMA_KMEM_ALLOC flag cleared if malloc() * was used and set if kmem_alloc_contig() was used. */ if (map != NULL) panic("bus_dmamem_free: Invalid map freed\n"); if ((dmat->bounce_flags & BUS_DMA_KMEM_ALLOC) == 0) free_domain(vaddr, M_DEVBUF); else kmem_free((vm_offset_t)vaddr, dmat->common.maxsize); CTR3(KTR_BUSDMA, "%s: tag %p flags 0x%x", __func__, dmat, dmat->bounce_flags); } static void _bus_dmamap_count_phys(bus_dma_tag_t dmat, bus_dmamap_t map, vm_paddr_t buf, bus_size_t buflen, int flags) { vm_paddr_t curaddr; bus_size_t sgsize; if (map != &nobounce_dmamap && map->pagesneeded == 0) { /* * Count the number of bounce pages * needed in order to complete this transfer */ curaddr = buf; while (buflen != 0) { sgsize = MIN(buflen, dmat->common.maxsegsz); if (bus_dma_run_filter(&dmat->common, curaddr)) { sgsize = MIN(sgsize, PAGE_SIZE - (curaddr & PAGE_MASK)); map->pagesneeded++; } curaddr += sgsize; buflen -= sgsize; } CTR1(KTR_BUSDMA, "pagesneeded= %d\n", map->pagesneeded); } } static void _bus_dmamap_count_pages(bus_dma_tag_t dmat, bus_dmamap_t map, pmap_t pmap, void *buf, bus_size_t buflen, int flags) { vm_offset_t vaddr; vm_offset_t vendaddr; vm_paddr_t paddr; bus_size_t sg_len; if (map != &nobounce_dmamap && map->pagesneeded == 0) { CTR4(KTR_BUSDMA, "lowaddr= %d Maxmem= %d, boundary= %d, " "alignment= %d", dmat->common.lowaddr, ptoa((vm_paddr_t)Maxmem), dmat->common.boundary, dmat->common.alignment); CTR3(KTR_BUSDMA, "map= %p, nobouncemap= %p, pagesneeded= %d", map, &nobounce_dmamap, map->pagesneeded); /* * Count the number of bounce pages * needed in order to complete this transfer */ vaddr = (vm_offset_t)buf; vendaddr = (vm_offset_t)buf + buflen; while (vaddr < vendaddr) { sg_len = PAGE_SIZE - ((vm_offset_t)vaddr & PAGE_MASK); if (pmap == kernel_pmap) paddr = pmap_kextract(vaddr); else paddr = pmap_extract(pmap, vaddr); if (bus_dma_run_filter(&dmat->common, paddr) != 0) { sg_len = roundup2(sg_len, dmat->common.alignment); map->pagesneeded++; } vaddr += sg_len; } CTR1(KTR_BUSDMA, "pagesneeded= %d\n", map->pagesneeded); } } static void _bus_dmamap_count_ma(bus_dma_tag_t dmat, bus_dmamap_t map, struct vm_page **ma, int ma_offs, bus_size_t buflen, int flags) { bus_size_t sg_len, max_sgsize; int page_index; vm_paddr_t paddr; if (map != &nobounce_dmamap && map->pagesneeded == 0) { CTR4(KTR_BUSDMA, "lowaddr= %d Maxmem= %d, boundary= %d, " "alignment= %d", dmat->common.lowaddr, ptoa((vm_paddr_t)Maxmem), dmat->common.boundary, dmat->common.alignment); CTR3(KTR_BUSDMA, "map= %p, nobouncemap= %p, pagesneeded= %d", map, &nobounce_dmamap, map->pagesneeded); /* * Count the number of bounce pages * needed in order to complete this transfer */ page_index = 0; while (buflen > 0) { paddr = VM_PAGE_TO_PHYS(ma[page_index]) + ma_offs; sg_len = PAGE_SIZE - ma_offs; max_sgsize = MIN(buflen, dmat->common.maxsegsz); sg_len = MIN(sg_len, max_sgsize); if (bus_dma_run_filter(&dmat->common, paddr) != 0) { sg_len = roundup2(sg_len, dmat->common.alignment); sg_len = MIN(sg_len, max_sgsize); KASSERT((sg_len & (dmat->common.alignment - 1)) == 0, ("Segment size is not aligned")); map->pagesneeded++; } if (((ma_offs + sg_len) & ~PAGE_MASK) != 0) page_index++; ma_offs = (ma_offs + sg_len) & PAGE_MASK; KASSERT(buflen >= sg_len, ("Segment length overruns original buffer")); buflen -= sg_len; } CTR1(KTR_BUSDMA, "pagesneeded= %d\n", map->pagesneeded); } } static int _bus_dmamap_reserve_pages(bus_dma_tag_t dmat, bus_dmamap_t map, int flags) { /* Reserve Necessary Bounce Pages */ mtx_lock(&bounce_lock); if (flags & BUS_DMA_NOWAIT) { if (reserve_bounce_pages(dmat, map, 0) != 0) { mtx_unlock(&bounce_lock); return (ENOMEM); } } else { if (reserve_bounce_pages(dmat, map, 1) != 0) { /* Queue us for resources */ STAILQ_INSERT_TAIL(&bounce_map_waitinglist, map, links); mtx_unlock(&bounce_lock); return (EINPROGRESS); } } mtx_unlock(&bounce_lock); return (0); } /* * Add a single contiguous physical range to the segment list. */ static int _bus_dmamap_addseg(bus_dma_tag_t dmat, bus_dmamap_t map, vm_paddr_t curaddr, bus_size_t sgsize, bus_dma_segment_t *segs, int *segp) { bus_addr_t baddr, bmask; int seg; KASSERT(curaddr <= BUS_SPACE_MAXADDR, ("ds_addr %#jx > BUS_SPACE_MAXADDR %#jx; dmat %p fl %#x low %#jx " "hi %#jx", (uintmax_t)curaddr, (uintmax_t)BUS_SPACE_MAXADDR, dmat, dmat->bounce_flags, (uintmax_t)dmat->common.lowaddr, (uintmax_t)dmat->common.highaddr)); /* * Make sure we don't cross any boundaries. */ bmask = ~(dmat->common.boundary - 1); if (dmat->common.boundary > 0) { baddr = (curaddr + dmat->common.boundary) & bmask; if (sgsize > (baddr - curaddr)) sgsize = (baddr - curaddr); } /* * Insert chunk into a segment, coalescing with * previous segment if possible. */ seg = *segp; if (seg == -1) { seg = 0; segs[seg].ds_addr = curaddr; segs[seg].ds_len = sgsize; } else { if (curaddr == segs[seg].ds_addr + segs[seg].ds_len && (segs[seg].ds_len + sgsize) <= dmat->common.maxsegsz && (dmat->common.boundary == 0 || (segs[seg].ds_addr & bmask) == (curaddr & bmask))) segs[seg].ds_len += sgsize; else { if (++seg >= dmat->common.nsegments) return (0); segs[seg].ds_addr = curaddr; segs[seg].ds_len = sgsize; } } *segp = seg; return (sgsize); } /* * Utility function to load a physical buffer. segp contains * the starting segment on entrace, and the ending segment on exit. */ static int bounce_bus_dmamap_load_phys(bus_dma_tag_t dmat, bus_dmamap_t map, vm_paddr_t buf, bus_size_t buflen, int flags, bus_dma_segment_t *segs, int *segp) { bus_size_t sgsize; vm_paddr_t curaddr; int error; if (map == NULL) map = &nobounce_dmamap; if (segs == NULL) segs = dmat->segments; if ((dmat->bounce_flags & BUS_DMA_COULD_BOUNCE) != 0) { _bus_dmamap_count_phys(dmat, map, buf, buflen, flags); if (map->pagesneeded != 0) { error = _bus_dmamap_reserve_pages(dmat, map, flags); if (error) return (error); } } while (buflen > 0) { curaddr = buf; sgsize = MIN(buflen, dmat->common.maxsegsz); if ((dmat->bounce_flags & BUS_DMA_COULD_BOUNCE) != 0 && map->pagesneeded != 0 && bus_dma_run_filter(&dmat->common, curaddr)) { sgsize = MIN(sgsize, PAGE_SIZE - (curaddr & PAGE_MASK)); curaddr = add_bounce_page(dmat, map, 0, curaddr, 0, sgsize); } sgsize = _bus_dmamap_addseg(dmat, map, curaddr, sgsize, segs, segp); if (sgsize == 0) break; buf += sgsize; buflen -= sgsize; } /* * Did we fit? */ return (buflen != 0 ? EFBIG : 0); /* XXX better return value here? */ } /* * Utility function to load a linear buffer. segp contains * the starting segment on entrace, and the ending segment on exit. */ static int bounce_bus_dmamap_load_buffer(bus_dma_tag_t dmat, bus_dmamap_t map, void *buf, bus_size_t buflen, pmap_t pmap, int flags, bus_dma_segment_t *segs, int *segp) { bus_size_t sgsize, max_sgsize; vm_paddr_t curaddr; vm_offset_t kvaddr, vaddr; int error; if (map == NULL) map = &nobounce_dmamap; if (segs == NULL) segs = dmat->segments; if ((dmat->bounce_flags & BUS_DMA_COULD_BOUNCE) != 0) { _bus_dmamap_count_pages(dmat, map, pmap, buf, buflen, flags); if (map->pagesneeded != 0) { error = _bus_dmamap_reserve_pages(dmat, map, flags); if (error) return (error); } } vaddr = (vm_offset_t)buf; while (buflen > 0) { /* * Get the physical address for this segment. */ if (pmap == kernel_pmap) { curaddr = pmap_kextract(vaddr); kvaddr = vaddr; } else { curaddr = pmap_extract(pmap, vaddr); kvaddr = 0; } /* * Compute the segment size, and adjust counts. */ max_sgsize = MIN(buflen, dmat->common.maxsegsz); sgsize = PAGE_SIZE - (curaddr & PAGE_MASK); if ((dmat->bounce_flags & BUS_DMA_COULD_BOUNCE) != 0 && map->pagesneeded != 0 && bus_dma_run_filter(&dmat->common, curaddr)) { sgsize = roundup2(sgsize, dmat->common.alignment); sgsize = MIN(sgsize, max_sgsize); curaddr = add_bounce_page(dmat, map, kvaddr, curaddr, 0, sgsize); } else { sgsize = MIN(sgsize, max_sgsize); } sgsize = _bus_dmamap_addseg(dmat, map, curaddr, sgsize, segs, segp); if (sgsize == 0) break; vaddr += sgsize; buflen -= sgsize; } /* * Did we fit? */ return (buflen != 0 ? EFBIG : 0); /* XXX better return value here? */ } static int bounce_bus_dmamap_load_ma(bus_dma_tag_t dmat, bus_dmamap_t map, struct vm_page **ma, bus_size_t buflen, int ma_offs, int flags, bus_dma_segment_t *segs, int *segp) { vm_paddr_t paddr, next_paddr; int error, page_index; bus_size_t sgsize, max_sgsize; if (dmat->common.flags & BUS_DMA_KEEP_PG_OFFSET) { /* * If we have to keep the offset of each page this function * is not suitable, switch back to bus_dmamap_load_ma_triv * which is going to do the right thing in this case. */ error = bus_dmamap_load_ma_triv(dmat, map, ma, buflen, ma_offs, flags, segs, segp); return (error); } if (map == NULL) map = &nobounce_dmamap; if (segs == NULL) segs = dmat->segments; if ((dmat->bounce_flags & BUS_DMA_COULD_BOUNCE) != 0) { _bus_dmamap_count_ma(dmat, map, ma, ma_offs, buflen, flags); if (map->pagesneeded != 0) { error = _bus_dmamap_reserve_pages(dmat, map, flags); if (error) return (error); } } page_index = 0; while (buflen > 0) { /* * Compute the segment size, and adjust counts. */ paddr = VM_PAGE_TO_PHYS(ma[page_index]) + ma_offs; max_sgsize = MIN(buflen, dmat->common.maxsegsz); sgsize = PAGE_SIZE - ma_offs; if ((dmat->bounce_flags & BUS_DMA_COULD_BOUNCE) != 0 && map->pagesneeded != 0 && bus_dma_run_filter(&dmat->common, paddr)) { sgsize = roundup2(sgsize, dmat->common.alignment); sgsize = MIN(sgsize, max_sgsize); KASSERT((sgsize & (dmat->common.alignment - 1)) == 0, ("Segment size is not aligned")); /* * Check if two pages of the user provided buffer * are used. */ if ((ma_offs + sgsize) > PAGE_SIZE) next_paddr = VM_PAGE_TO_PHYS(ma[page_index + 1]); else next_paddr = 0; paddr = add_bounce_page(dmat, map, 0, paddr, next_paddr, sgsize); } else { sgsize = MIN(sgsize, max_sgsize); } sgsize = _bus_dmamap_addseg(dmat, map, paddr, sgsize, segs, segp); if (sgsize == 0) break; KASSERT(buflen >= sgsize, ("Segment length overruns original buffer")); buflen -= sgsize; if (((ma_offs + sgsize) & ~PAGE_MASK) != 0) page_index++; ma_offs = (ma_offs + sgsize) & PAGE_MASK; } /* * Did we fit? */ return (buflen != 0 ? EFBIG : 0); /* XXX better return value here? */ } static void bounce_bus_dmamap_waitok(bus_dma_tag_t dmat, bus_dmamap_t map, struct memdesc *mem, bus_dmamap_callback_t *callback, void *callback_arg) { if (map == NULL) return; map->mem = *mem; map->dmat = dmat; map->callback = callback; map->callback_arg = callback_arg; } static bus_dma_segment_t * bounce_bus_dmamap_complete(bus_dma_tag_t dmat, bus_dmamap_t map, bus_dma_segment_t *segs, int nsegs, int error) { if (segs == NULL) segs = dmat->segments; return (segs); } /* * Release the mapping held by map. */ static void bounce_bus_dmamap_unload(bus_dma_tag_t dmat, bus_dmamap_t map) { struct bounce_page *bpage; if (map == NULL) return; while ((bpage = STAILQ_FIRST(&map->bpages)) != NULL) { STAILQ_REMOVE_HEAD(&map->bpages, links); free_bounce_page(dmat, bpage); } } static void bounce_bus_dmamap_sync(bus_dma_tag_t dmat, bus_dmamap_t map, bus_dmasync_op_t op) { struct bounce_page *bpage; vm_offset_t datavaddr, tempvaddr; bus_size_t datacount1, datacount2; if (map == NULL || (bpage = STAILQ_FIRST(&map->bpages)) == NULL) return; /* * Handle data bouncing. We might also want to add support for * invalidating the caches on broken hardware. */ CTR4(KTR_BUSDMA, "%s: tag %p tag flags 0x%x op 0x%x " "performing bounce", __func__, dmat, dmat->common.flags, op); if ((op & BUS_DMASYNC_PREWRITE) != 0) { while (bpage != NULL) { tempvaddr = 0; datavaddr = bpage->datavaddr; datacount1 = bpage->datacount; if (datavaddr == 0) { tempvaddr = pmap_quick_enter_page(bpage->datapage[0]); datavaddr = tempvaddr | bpage->dataoffs; datacount1 = min(PAGE_SIZE - bpage->dataoffs, datacount1); } bcopy((void *)datavaddr, (void *)bpage->vaddr, datacount1); if (tempvaddr != 0) pmap_quick_remove_page(tempvaddr); if (bpage->datapage[1] == 0) { KASSERT(datacount1 == bpage->datacount, ("Mismatch between data size and provided memory space")); goto next_w; } /* * We are dealing with an unmapped buffer that expands * over two pages. */ datavaddr = pmap_quick_enter_page(bpage->datapage[1]); datacount2 = bpage->datacount - datacount1; bcopy((void *)datavaddr, (void *)(bpage->vaddr + datacount1), datacount2); pmap_quick_remove_page(datavaddr); next_w: bpage = STAILQ_NEXT(bpage, links); } dmat->bounce_zone->total_bounced++; } if ((op & BUS_DMASYNC_POSTREAD) != 0) { while (bpage != NULL) { tempvaddr = 0; datavaddr = bpage->datavaddr; datacount1 = bpage->datacount; if (datavaddr == 0) { tempvaddr = pmap_quick_enter_page(bpage->datapage[0]); datavaddr = tempvaddr | bpage->dataoffs; datacount1 = min(PAGE_SIZE - bpage->dataoffs, datacount1); } bcopy((void *)bpage->vaddr, (void *)datavaddr, datacount1); if (tempvaddr != 0) pmap_quick_remove_page(tempvaddr); if (bpage->datapage[1] == 0) { KASSERT(datacount1 == bpage->datacount, ("Mismatch between data size and provided memory space")); goto next_r; } /* * We are dealing with an unmapped buffer that expands * over two pages. */ datavaddr = pmap_quick_enter_page(bpage->datapage[1]); datacount2 = bpage->datacount - datacount1; bcopy((void *)(bpage->vaddr + datacount1), (void *)datavaddr, datacount2); pmap_quick_remove_page(datavaddr); next_r: bpage = STAILQ_NEXT(bpage, links); } dmat->bounce_zone->total_bounced++; } } static void init_bounce_pages(void *dummy __unused) { total_bpages = 0; STAILQ_INIT(&bounce_zone_list); STAILQ_INIT(&bounce_map_waitinglist); STAILQ_INIT(&bounce_map_callbacklist); mtx_init(&bounce_lock, "bounce pages lock", NULL, MTX_DEF); } SYSINIT(bpages, SI_SUB_LOCK, SI_ORDER_ANY, init_bounce_pages, NULL); static struct sysctl_ctx_list * busdma_sysctl_tree(struct bounce_zone *bz) { return (&bz->sysctl_tree); } static struct sysctl_oid * busdma_sysctl_tree_top(struct bounce_zone *bz) { return (bz->sysctl_tree_top); } static int alloc_bounce_zone(bus_dma_tag_t dmat) { struct bounce_zone *bz; /* Check to see if we already have a suitable zone */ STAILQ_FOREACH(bz, &bounce_zone_list, links) { if (dmat->common.alignment <= bz->alignment && dmat->common.lowaddr >= bz->lowaddr && dmat->common.domain == bz->domain) { dmat->bounce_zone = bz; return (0); } } if ((bz = (struct bounce_zone *)malloc(sizeof(*bz), M_DEVBUF, M_NOWAIT | M_ZERO)) == NULL) return (ENOMEM); STAILQ_INIT(&bz->bounce_page_list); bz->free_bpages = 0; bz->reserved_bpages = 0; bz->active_bpages = 0; bz->lowaddr = dmat->common.lowaddr; bz->alignment = MAX(dmat->common.alignment, PAGE_SIZE); bz->map_count = 0; bz->domain = dmat->common.domain; snprintf(bz->zoneid, 8, "zone%d", busdma_zonecount); busdma_zonecount++; snprintf(bz->lowaddrid, 18, "%#jx", (uintmax_t)bz->lowaddr); STAILQ_INSERT_TAIL(&bounce_zone_list, bz, links); dmat->bounce_zone = bz; sysctl_ctx_init(&bz->sysctl_tree); bz->sysctl_tree_top = SYSCTL_ADD_NODE(&bz->sysctl_tree, SYSCTL_STATIC_CHILDREN(_hw_busdma), OID_AUTO, bz->zoneid, CTLFLAG_RD, 0, ""); if (bz->sysctl_tree_top == NULL) { sysctl_ctx_free(&bz->sysctl_tree); return (0); /* XXX error code? */ } SYSCTL_ADD_INT(busdma_sysctl_tree(bz), SYSCTL_CHILDREN(busdma_sysctl_tree_top(bz)), OID_AUTO, "total_bpages", CTLFLAG_RD, &bz->total_bpages, 0, "Total bounce pages"); SYSCTL_ADD_INT(busdma_sysctl_tree(bz), SYSCTL_CHILDREN(busdma_sysctl_tree_top(bz)), OID_AUTO, "free_bpages", CTLFLAG_RD, &bz->free_bpages, 0, "Free bounce pages"); SYSCTL_ADD_INT(busdma_sysctl_tree(bz), SYSCTL_CHILDREN(busdma_sysctl_tree_top(bz)), OID_AUTO, "reserved_bpages", CTLFLAG_RD, &bz->reserved_bpages, 0, "Reserved bounce pages"); SYSCTL_ADD_INT(busdma_sysctl_tree(bz), SYSCTL_CHILDREN(busdma_sysctl_tree_top(bz)), OID_AUTO, "active_bpages", CTLFLAG_RD, &bz->active_bpages, 0, "Active bounce pages"); SYSCTL_ADD_INT(busdma_sysctl_tree(bz), SYSCTL_CHILDREN(busdma_sysctl_tree_top(bz)), OID_AUTO, "total_bounced", CTLFLAG_RD, &bz->total_bounced, 0, "Total bounce requests"); SYSCTL_ADD_INT(busdma_sysctl_tree(bz), SYSCTL_CHILDREN(busdma_sysctl_tree_top(bz)), OID_AUTO, "total_deferred", CTLFLAG_RD, &bz->total_deferred, 0, "Total bounce requests that were deferred"); SYSCTL_ADD_STRING(busdma_sysctl_tree(bz), SYSCTL_CHILDREN(busdma_sysctl_tree_top(bz)), OID_AUTO, "lowaddr", CTLFLAG_RD, bz->lowaddrid, 0, ""); SYSCTL_ADD_UAUTO(busdma_sysctl_tree(bz), SYSCTL_CHILDREN(busdma_sysctl_tree_top(bz)), OID_AUTO, "alignment", CTLFLAG_RD, &bz->alignment, ""); SYSCTL_ADD_INT(busdma_sysctl_tree(bz), SYSCTL_CHILDREN(busdma_sysctl_tree_top(bz)), OID_AUTO, "domain", CTLFLAG_RD, &bz->domain, 0, "memory domain"); return (0); } static int alloc_bounce_pages(bus_dma_tag_t dmat, u_int numpages) { struct bounce_zone *bz; int count; bz = dmat->bounce_zone; count = 0; while (numpages > 0) { struct bounce_page *bpage; bpage = malloc_domainset(sizeof(*bpage), M_DEVBUF, DOMAINSET_PREF(dmat->common.domain), M_NOWAIT | M_ZERO); if (bpage == NULL) break; bpage->vaddr = (vm_offset_t)contigmalloc_domainset(PAGE_SIZE, M_DEVBUF, DOMAINSET_PREF(dmat->common.domain), M_NOWAIT, 0ul, bz->lowaddr, PAGE_SIZE, 0); if (bpage->vaddr == 0) { free_domain(bpage, M_DEVBUF); break; } bpage->busaddr = pmap_kextract(bpage->vaddr); mtx_lock(&bounce_lock); STAILQ_INSERT_TAIL(&bz->bounce_page_list, bpage, links); total_bpages++; bz->total_bpages++; bz->free_bpages++; mtx_unlock(&bounce_lock); count++; numpages--; } return (count); } static int reserve_bounce_pages(bus_dma_tag_t dmat, bus_dmamap_t map, int commit) { struct bounce_zone *bz; int pages; mtx_assert(&bounce_lock, MA_OWNED); bz = dmat->bounce_zone; pages = MIN(bz->free_bpages, map->pagesneeded - map->pagesreserved); if (commit == 0 && map->pagesneeded > (map->pagesreserved + pages)) return (map->pagesneeded - (map->pagesreserved + pages)); bz->free_bpages -= pages; bz->reserved_bpages += pages; map->pagesreserved += pages; pages = map->pagesneeded - map->pagesreserved; return (pages); } static bus_addr_t add_bounce_page(bus_dma_tag_t dmat, bus_dmamap_t map, vm_offset_t vaddr, vm_paddr_t addr1, vm_paddr_t addr2, bus_size_t size) { struct bounce_zone *bz; struct bounce_page *bpage; KASSERT(dmat->bounce_zone != NULL, ("no bounce zone in dma tag")); KASSERT(map != NULL && map != &nobounce_dmamap, ("add_bounce_page: bad map %p", map)); bz = dmat->bounce_zone; if (map->pagesneeded == 0) panic("add_bounce_page: map doesn't need any pages"); map->pagesneeded--; if (map->pagesreserved == 0) panic("add_bounce_page: map doesn't need any pages"); map->pagesreserved--; mtx_lock(&bounce_lock); bpage = STAILQ_FIRST(&bz->bounce_page_list); if (bpage == NULL) panic("add_bounce_page: free page list is empty"); STAILQ_REMOVE_HEAD(&bz->bounce_page_list, links); bz->reserved_bpages--; bz->active_bpages++; mtx_unlock(&bounce_lock); if (dmat->common.flags & BUS_DMA_KEEP_PG_OFFSET) { /* Page offset needs to be preserved. */ bpage->vaddr |= addr1 & PAGE_MASK; bpage->busaddr |= addr1 & PAGE_MASK; KASSERT(addr2 == 0, ("Trying to bounce multiple pages with BUS_DMA_KEEP_PG_OFFSET")); } bpage->datavaddr = vaddr; bpage->datapage[0] = PHYS_TO_VM_PAGE(addr1); KASSERT((addr2 & PAGE_MASK) == 0, ("Second page is not aligned")); bpage->datapage[1] = PHYS_TO_VM_PAGE(addr2); bpage->dataoffs = addr1 & PAGE_MASK; bpage->datacount = size; STAILQ_INSERT_TAIL(&(map->bpages), bpage, links); return (bpage->busaddr); } static void free_bounce_page(bus_dma_tag_t dmat, struct bounce_page *bpage) { struct bus_dmamap *map; struct bounce_zone *bz; bz = dmat->bounce_zone; bpage->datavaddr = 0; bpage->datacount = 0; if (dmat->common.flags & BUS_DMA_KEEP_PG_OFFSET) { /* * Reset the bounce page to start at offset 0. Other uses * of this bounce page may need to store a full page of * data and/or assume it starts on a page boundary. */ bpage->vaddr &= ~PAGE_MASK; bpage->busaddr &= ~PAGE_MASK; } mtx_lock(&bounce_lock); STAILQ_INSERT_HEAD(&bz->bounce_page_list, bpage, links); bz->free_bpages++; bz->active_bpages--; if ((map = STAILQ_FIRST(&bounce_map_waitinglist)) != NULL) { if (reserve_bounce_pages(map->dmat, map, 1) == 0) { STAILQ_REMOVE_HEAD(&bounce_map_waitinglist, links); STAILQ_INSERT_TAIL(&bounce_map_callbacklist, map, links); busdma_swi_pending = 1; bz->total_deferred++; swi_sched(vm_ih, 0); } } mtx_unlock(&bounce_lock); } void busdma_swi(void) { bus_dma_tag_t dmat; struct bus_dmamap *map; mtx_lock(&bounce_lock); while ((map = STAILQ_FIRST(&bounce_map_callbacklist)) != NULL) { STAILQ_REMOVE_HEAD(&bounce_map_callbacklist, links); mtx_unlock(&bounce_lock); dmat = map->dmat; (dmat->common.lockfunc)(dmat->common.lockfuncarg, BUS_DMA_LOCK); bus_dmamap_load_mem(map->dmat, map, &map->mem, map->callback, map->callback_arg, BUS_DMA_WAITOK); (dmat->common.lockfunc)(dmat->common.lockfuncarg, BUS_DMA_UNLOCK); mtx_lock(&bounce_lock); } mtx_unlock(&bounce_lock); } struct bus_dma_impl bus_dma_bounce_impl = { .tag_create = bounce_bus_dma_tag_create, .tag_destroy = bounce_bus_dma_tag_destroy, .tag_set_domain = bounce_bus_dma_tag_set_domain, .map_create = bounce_bus_dmamap_create, .map_destroy = bounce_bus_dmamap_destroy, .mem_alloc = bounce_bus_dmamem_alloc, .mem_free = bounce_bus_dmamem_free, .load_phys = bounce_bus_dmamap_load_phys, .load_buffer = bounce_bus_dmamap_load_buffer, .load_ma = bounce_bus_dmamap_load_ma, .map_waitok = bounce_bus_dmamap_waitok, .map_complete = bounce_bus_dmamap_complete, .map_unload = bounce_bus_dmamap_unload, .map_sync = bounce_bus_dmamap_sync, }; Index: user/ngie/bug-237403/sys/x86/x86/mp_x86.c =================================================================== --- user/ngie/bug-237403/sys/x86/x86/mp_x86.c (revision 346925) +++ user/ngie/bug-237403/sys/x86/x86/mp_x86.c (revision 346926) @@ -1,1790 +1,1799 @@ /*- * Copyright (c) 1996, by Steve Passe * Copyright (c) 2003, by Peter Wemm * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. The name of the developer may NOT be used to endorse or promote products * derived from this software without specific prior written permission. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. */ #include __FBSDID("$FreeBSD$"); #ifdef __i386__ #include "opt_apic.h" #endif #include "opt_cpu.h" #include "opt_kstack_pages.h" #include "opt_pmap.h" #include "opt_sched.h" #include "opt_smp.h" #include #include #include #include /* cngetc() */ #include #ifdef GPROF #include #endif #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include static MALLOC_DEFINE(M_CPUS, "cpus", "CPU items"); /* lock region used by kernel profiling */ int mcount_lock; int mp_naps; /* # of Applications processors */ int boot_cpu_id = -1; /* designated BSP */ /* AP uses this during bootstrap. Do not staticize. */ char *bootSTK; int bootAP; /* Free these after use */ void *bootstacks[MAXCPU]; void *dpcpu; struct pcb stoppcbs[MAXCPU]; struct susppcb **susppcbs; #ifdef COUNT_IPIS /* Interrupt counts. */ static u_long *ipi_preempt_counts[MAXCPU]; static u_long *ipi_ast_counts[MAXCPU]; u_long *ipi_invltlb_counts[MAXCPU]; u_long *ipi_invlrng_counts[MAXCPU]; u_long *ipi_invlpg_counts[MAXCPU]; u_long *ipi_invlcache_counts[MAXCPU]; u_long *ipi_rendezvous_counts[MAXCPU]; static u_long *ipi_hardclock_counts[MAXCPU]; #endif /* Default cpu_ops implementation. */ struct cpu_ops cpu_ops; /* * Local data and functions. */ static volatile cpuset_t ipi_stop_nmi_pending; volatile cpuset_t resuming_cpus; volatile cpuset_t toresume_cpus; /* used to hold the AP's until we are ready to release them */ struct mtx ap_boot_mtx; /* Set to 1 once we're ready to let the APs out of the pen. */ volatile int aps_ready = 0; /* * Store data from cpu_add() until later in the boot when we actually setup * the APs. */ struct cpu_info *cpu_info; int *apic_cpuids; int cpu_apic_ids[MAXCPU]; _Static_assert(MAXCPU <= MAX_APIC_ID, "MAXCPU cannot be larger that MAX_APIC_ID"); _Static_assert(xAPIC_MAX_APIC_ID <= MAX_APIC_ID, "xAPIC_MAX_APIC_ID cannot be larger that MAX_APIC_ID"); /* Holds pending bitmap based IPIs per CPU */ volatile u_int cpu_ipi_pending[MAXCPU]; static void release_aps(void *dummy); static void cpustop_handler_post(u_int cpu); static int hyperthreading_allowed = 1; SYSCTL_INT(_machdep, OID_AUTO, hyperthreading_allowed, CTLFLAG_RDTUN, &hyperthreading_allowed, 0, "Use Intel HTT logical CPUs"); static struct topo_node topo_root; static int pkg_id_shift; static int node_id_shift; static int core_id_shift; static int disabled_cpus; struct cache_info { int id_shift; int present; } static caches[MAX_CACHE_LEVELS]; unsigned int boot_address; #define MiB(v) (v ## ULL << 20) void mem_range_AP_init(void) { if (mem_range_softc.mr_op && mem_range_softc.mr_op->initAP) mem_range_softc.mr_op->initAP(&mem_range_softc); } /* * Round up to the next power of two, if necessary, and then * take log2. * Returns -1 if argument is zero. */ static __inline int mask_width(u_int x) { return (fls(x << (1 - powerof2(x))) - 1); } /* * Add a cache level to the cache topology description. */ static int add_deterministic_cache(int type, int level, int share_count) { if (type == 0) return (0); if (type > 3) { printf("unexpected cache type %d\n", type); return (1); } if (type == 2) /* ignore instruction cache */ return (1); if (level == 0 || level > MAX_CACHE_LEVELS) { printf("unexpected cache level %d\n", type); return (1); } if (caches[level - 1].present) { printf("WARNING: multiple entries for L%u data cache\n", level); printf("%u => %u\n", caches[level - 1].id_shift, mask_width(share_count)); } caches[level - 1].id_shift = mask_width(share_count); caches[level - 1].present = 1; if (caches[level - 1].id_shift > pkg_id_shift) { printf("WARNING: L%u data cache covers more " "APIC IDs than a package (%u > %u)\n", level, caches[level - 1].id_shift, pkg_id_shift); caches[level - 1].id_shift = pkg_id_shift; } if (caches[level - 1].id_shift < core_id_shift) { printf("WARNING: L%u data cache covers fewer " "APIC IDs than a core (%u < %u)\n", level, caches[level - 1].id_shift, core_id_shift); caches[level - 1].id_shift = core_id_shift; } return (1); } /* * Determine topology of processing units and caches for AMD CPUs. * See: * - AMD CPUID Specification (Publication # 25481) * - BKDG for AMD NPT Family 0Fh Processors (Publication # 32559) * - BKDG For AMD Family 10h Processors (Publication # 31116) * - BKDG For AMD Family 15h Models 00h-0Fh Processors (Publication # 42301) * - BKDG For AMD Family 16h Models 00h-0Fh Processors (Publication # 48751) * - PPR For AMD Family 17h Models 00h-0Fh Processors (Publication # 54945) */ static void topo_probe_amd(void) { u_int p[4]; uint64_t v; int level; int nodes_per_socket; int share_count; int type; int i; /* No multi-core capability. */ if ((amd_feature2 & AMDID2_CMP) == 0) return; /* For families 10h and newer. */ pkg_id_shift = (cpu_procinfo2 & AMDID_COREID_SIZE) >> AMDID_COREID_SIZE_SHIFT; /* For 0Fh family. */ if (pkg_id_shift == 0) pkg_id_shift = mask_width((cpu_procinfo2 & AMDID_CMP_CORES) + 1); /* * Families prior to 16h define the following value as * cores per compute unit and we don't really care about the AMD * compute units at the moment. Perhaps we should treat them as * cores and cores within the compute units as hardware threads, * but that's up for debate. * Later families define the value as threads per compute unit, * so we are following AMD's nomenclature here. */ if ((amd_feature2 & AMDID2_TOPOLOGY) != 0 && CPUID_TO_FAMILY(cpu_id) >= 0x16) { cpuid_count(0x8000001e, 0, p); share_count = ((p[1] >> 8) & 0xff) + 1; core_id_shift = mask_width(share_count); /* * For Zen (17h), gather Nodes per Processor. Each node is a * Zeppelin die; TR and EPYC CPUs will have multiple dies per * package. Communication latency between dies is higher than * within them. */ nodes_per_socket = ((p[2] >> 8) & 0x7) + 1; node_id_shift = pkg_id_shift - mask_width(nodes_per_socket); } if ((amd_feature2 & AMDID2_TOPOLOGY) != 0) { for (i = 0; ; i++) { cpuid_count(0x8000001d, i, p); type = p[0] & 0x1f; level = (p[0] >> 5) & 0x7; share_count = 1 + ((p[0] >> 14) & 0xfff); if (!add_deterministic_cache(type, level, share_count)) break; } } else { if (cpu_exthigh >= 0x80000005) { cpuid_count(0x80000005, 0, p); if (((p[2] >> 24) & 0xff) != 0) { caches[0].id_shift = 0; caches[0].present = 1; } } if (cpu_exthigh >= 0x80000006) { cpuid_count(0x80000006, 0, p); if (((p[2] >> 16) & 0xffff) != 0) { caches[1].id_shift = 0; caches[1].present = 1; } if (((p[3] >> 18) & 0x3fff) != 0) { nodes_per_socket = 1; if ((amd_feature2 & AMDID2_NODE_ID) != 0) { /* * Handle multi-node processors that * have multiple chips, each with its * own L3 cache, on the same die. */ v = rdmsr(0xc001100c); nodes_per_socket = 1 + ((v >> 3) & 0x7); } caches[2].id_shift = pkg_id_shift - mask_width(nodes_per_socket); caches[2].present = 1; } } } } /* * Determine topology of processing units for Intel CPUs * using CPUID Leaf 1 and Leaf 4, if supported. * See: * - Intel 64 Architecture Processor Topology Enumeration * - Intel 64 and IA-32 ArchitecturesSoftware Developer’s Manual, * Volume 3A: System Programming Guide, PROGRAMMING CONSIDERATIONS * FOR HARDWARE MULTI-THREADING CAPABLE PROCESSORS */ static void topo_probe_intel_0x4(void) { u_int p[4]; int max_cores; int max_logical; /* Both zero and one here mean one logical processor per package. */ max_logical = (cpu_feature & CPUID_HTT) != 0 ? (cpu_procinfo & CPUID_HTT_CORES) >> 16 : 1; if (max_logical <= 1) return; if (cpu_high >= 0x4) { cpuid_count(0x04, 0, p); max_cores = ((p[0] >> 26) & 0x3f) + 1; } else max_cores = 1; core_id_shift = mask_width(max_logical/max_cores); KASSERT(core_id_shift >= 0, ("intel topo: max_cores > max_logical\n")); pkg_id_shift = core_id_shift + mask_width(max_cores); } /* * Determine topology of processing units for Intel CPUs * using CPUID Leaf 11, if supported. * See: * - Intel 64 Architecture Processor Topology Enumeration * - Intel 64 and IA-32 ArchitecturesSoftware Developer’s Manual, * Volume 3A: System Programming Guide, PROGRAMMING CONSIDERATIONS * FOR HARDWARE MULTI-THREADING CAPABLE PROCESSORS */ static void topo_probe_intel_0xb(void) { u_int p[4]; int bits; int type; int i; /* Fall back if CPU leaf 11 doesn't really exist. */ cpuid_count(0x0b, 0, p); if (p[1] == 0) { topo_probe_intel_0x4(); return; } /* We only support three levels for now. */ for (i = 0; ; i++) { cpuid_count(0x0b, i, p); bits = p[0] & 0x1f; type = (p[2] >> 8) & 0xff; if (type == 0) break; /* TODO: check for duplicate (re-)assignment */ if (type == CPUID_TYPE_SMT) core_id_shift = bits; else if (type == CPUID_TYPE_CORE) pkg_id_shift = bits; else printf("unknown CPU level type %d\n", type); } if (pkg_id_shift < core_id_shift) { printf("WARNING: core covers more APIC IDs than a package\n"); core_id_shift = pkg_id_shift; } } /* * Determine topology of caches for Intel CPUs. * See: * - Intel 64 Architecture Processor Topology Enumeration * - Intel 64 and IA-32 Architectures Software Developer’s Manual * Volume 2A: Instruction Set Reference, A-M, * CPUID instruction */ static void topo_probe_intel_caches(void) { u_int p[4]; int level; int share_count; int type; int i; if (cpu_high < 0x4) { /* * Available cache level and sizes can be determined * via CPUID leaf 2, but that requires a huge table of hardcoded * values, so for now just assume L1 and L2 caches potentially * shared only by HTT processing units, if HTT is present. */ caches[0].id_shift = pkg_id_shift; caches[0].present = 1; caches[1].id_shift = pkg_id_shift; caches[1].present = 1; return; } for (i = 0; ; i++) { cpuid_count(0x4, i, p); type = p[0] & 0x1f; level = (p[0] >> 5) & 0x7; share_count = 1 + ((p[0] >> 14) & 0xfff); if (!add_deterministic_cache(type, level, share_count)) break; } } /* * Determine topology of processing units and caches for Intel CPUs. * See: * - Intel 64 Architecture Processor Topology Enumeration */ static void topo_probe_intel(void) { /* * Note that 0x1 <= cpu_high < 4 case should be * compatible with topo_probe_intel_0x4() logic when * CPUID.1:EBX[23:16] > 0 (cpu_cores will be 1) * or it should trigger the fallback otherwise. */ if (cpu_high >= 0xb) topo_probe_intel_0xb(); else if (cpu_high >= 0x1) topo_probe_intel_0x4(); topo_probe_intel_caches(); } /* * Topology information is queried only on BSP, on which this * code runs and for which it can query CPUID information. * Then topology is extrapolated on all packages using an * assumption that APIC ID to hardware component ID mapping is * homogenious. * That doesn't necesserily imply that the topology is uniform. */ void topo_probe(void) { static int cpu_topo_probed = 0; struct x86_topo_layer { int type; int subtype; int id_shift; } topo_layers[MAX_CACHE_LEVELS + 4]; struct topo_node *parent; struct topo_node *node; int layer; int nlayers; int node_id; int i; if (cpu_topo_probed) return; CPU_ZERO(&logical_cpus_mask); if (mp_ncpus <= 1) ; /* nothing */ else if (cpu_vendor_id == CPU_VENDOR_AMD) topo_probe_amd(); else if (cpu_vendor_id == CPU_VENDOR_INTEL) topo_probe_intel(); KASSERT(pkg_id_shift >= core_id_shift, ("bug in APIC topology discovery")); nlayers = 0; bzero(topo_layers, sizeof(topo_layers)); topo_layers[nlayers].type = TOPO_TYPE_PKG; topo_layers[nlayers].id_shift = pkg_id_shift; if (bootverbose) printf("Package ID shift: %u\n", topo_layers[nlayers].id_shift); nlayers++; if (pkg_id_shift > node_id_shift && node_id_shift != 0) { topo_layers[nlayers].type = TOPO_TYPE_GROUP; topo_layers[nlayers].id_shift = node_id_shift; if (bootverbose) printf("Node ID shift: %u\n", topo_layers[nlayers].id_shift); nlayers++; } /* * Consider all caches to be within a package/chip * and "in front" of all sub-components like * cores and hardware threads. */ for (i = MAX_CACHE_LEVELS - 1; i >= 0; --i) { if (caches[i].present) { if (node_id_shift != 0) KASSERT(caches[i].id_shift <= node_id_shift, ("bug in APIC topology discovery")); KASSERT(caches[i].id_shift <= pkg_id_shift, ("bug in APIC topology discovery")); KASSERT(caches[i].id_shift >= core_id_shift, ("bug in APIC topology discovery")); topo_layers[nlayers].type = TOPO_TYPE_CACHE; topo_layers[nlayers].subtype = i + 1; topo_layers[nlayers].id_shift = caches[i].id_shift; if (bootverbose) printf("L%u cache ID shift: %u\n", topo_layers[nlayers].subtype, topo_layers[nlayers].id_shift); nlayers++; } } if (pkg_id_shift > core_id_shift) { topo_layers[nlayers].type = TOPO_TYPE_CORE; topo_layers[nlayers].id_shift = core_id_shift; if (bootverbose) printf("Core ID shift: %u\n", topo_layers[nlayers].id_shift); nlayers++; } topo_layers[nlayers].type = TOPO_TYPE_PU; topo_layers[nlayers].id_shift = 0; nlayers++; topo_init_root(&topo_root); for (i = 0; i <= max_apic_id; ++i) { if (!cpu_info[i].cpu_present) continue; parent = &topo_root; for (layer = 0; layer < nlayers; ++layer) { node_id = i >> topo_layers[layer].id_shift; parent = topo_add_node_by_hwid(parent, node_id, topo_layers[layer].type, topo_layers[layer].subtype); } } parent = &topo_root; for (layer = 0; layer < nlayers; ++layer) { node_id = boot_cpu_id >> topo_layers[layer].id_shift; node = topo_find_node_by_hwid(parent, node_id, topo_layers[layer].type, topo_layers[layer].subtype); topo_promote_child(node); parent = node; } cpu_topo_probed = 1; } /* * Assign logical CPU IDs to local APICs. */ void assign_cpu_ids(void) { struct topo_node *node; u_int smt_mask; int nhyper; smt_mask = (1u << core_id_shift) - 1; /* * Assign CPU IDs to local APIC IDs and disable any CPUs * beyond MAXCPU. CPU 0 is always assigned to the BSP. */ mp_ncpus = 0; nhyper = 0; TOPO_FOREACH(node, &topo_root) { if (node->type != TOPO_TYPE_PU) continue; if ((node->hwid & smt_mask) != (boot_cpu_id & smt_mask)) cpu_info[node->hwid].cpu_hyperthread = 1; if (resource_disabled("lapic", node->hwid)) { if (node->hwid != boot_cpu_id) cpu_info[node->hwid].cpu_disabled = 1; else printf("Cannot disable BSP, APIC ID = %d\n", node->hwid); } if (!hyperthreading_allowed && cpu_info[node->hwid].cpu_hyperthread) cpu_info[node->hwid].cpu_disabled = 1; if (mp_ncpus >= MAXCPU) cpu_info[node->hwid].cpu_disabled = 1; if (cpu_info[node->hwid].cpu_disabled) { disabled_cpus++; continue; } if (cpu_info[node->hwid].cpu_hyperthread) nhyper++; cpu_apic_ids[mp_ncpus] = node->hwid; apic_cpuids[node->hwid] = mp_ncpus; topo_set_pu_id(node, mp_ncpus); mp_ncpus++; } KASSERT(mp_maxid >= mp_ncpus - 1, ("%s: counters out of sync: max %d, count %d", __func__, mp_maxid, mp_ncpus)); mp_ncores = mp_ncpus - nhyper; smp_threads_per_core = mp_ncpus / mp_ncores; } /* * Print various information about the SMP system hardware and setup. */ void cpu_mp_announce(void) { struct topo_node *node; const char *hyperthread; struct topo_analysis topology; printf("FreeBSD/SMP: "); if (topo_analyze(&topo_root, 1, &topology)) { printf("%d package(s)", topology.entities[TOPO_LEVEL_PKG]); if (topology.entities[TOPO_LEVEL_GROUP] > 1) printf(" x %d groups", topology.entities[TOPO_LEVEL_GROUP]); if (topology.entities[TOPO_LEVEL_CACHEGROUP] > 1) printf(" x %d cache groups", topology.entities[TOPO_LEVEL_CACHEGROUP]); if (topology.entities[TOPO_LEVEL_CORE] > 0) printf(" x %d core(s)", topology.entities[TOPO_LEVEL_CORE]); if (topology.entities[TOPO_LEVEL_THREAD] > 1) printf(" x %d hardware threads", topology.entities[TOPO_LEVEL_THREAD]); } else { printf("Non-uniform topology"); } printf("\n"); if (disabled_cpus) { printf("FreeBSD/SMP Online: "); if (topo_analyze(&topo_root, 0, &topology)) { printf("%d package(s)", topology.entities[TOPO_LEVEL_PKG]); if (topology.entities[TOPO_LEVEL_GROUP] > 1) printf(" x %d groups", topology.entities[TOPO_LEVEL_GROUP]); if (topology.entities[TOPO_LEVEL_CACHEGROUP] > 1) printf(" x %d cache groups", topology.entities[TOPO_LEVEL_CACHEGROUP]); if (topology.entities[TOPO_LEVEL_CORE] > 0) printf(" x %d core(s)", topology.entities[TOPO_LEVEL_CORE]); if (topology.entities[TOPO_LEVEL_THREAD] > 1) printf(" x %d hardware threads", topology.entities[TOPO_LEVEL_THREAD]); } else { printf("Non-uniform topology"); } printf("\n"); } if (!bootverbose) return; TOPO_FOREACH(node, &topo_root) { switch (node->type) { case TOPO_TYPE_PKG: printf("Package HW ID = %u\n", node->hwid); break; case TOPO_TYPE_CORE: printf("\tCore HW ID = %u\n", node->hwid); break; case TOPO_TYPE_PU: if (cpu_info[node->hwid].cpu_hyperthread) hyperthread = "/HT"; else hyperthread = ""; if (node->subtype == 0) printf("\t\tCPU (AP%s): APIC ID: %u" "(disabled)\n", hyperthread, node->hwid); else if (node->id == 0) printf("\t\tCPU0 (BSP): APIC ID: %u\n", node->hwid); else printf("\t\tCPU%u (AP%s): APIC ID: %u\n", node->id, hyperthread, node->hwid); break; default: /* ignored */ break; } } } /* * Add a scheduling group, a group of logical processors sharing * a particular cache (and, thus having an affinity), to the scheduling * topology. * This function recursively works on lower level caches. */ static void x86topo_add_sched_group(struct topo_node *root, struct cpu_group *cg_root) { struct topo_node *node; int nchildren; int ncores; int i; KASSERT(root->type == TOPO_TYPE_SYSTEM || root->type == TOPO_TYPE_CACHE || root->type == TOPO_TYPE_GROUP, ("x86topo_add_sched_group: bad type: %u", root->type)); CPU_COPY(&root->cpuset, &cg_root->cg_mask); cg_root->cg_count = root->cpu_count; if (root->type == TOPO_TYPE_SYSTEM) cg_root->cg_level = CG_SHARE_NONE; else cg_root->cg_level = root->subtype; /* * Check how many core nodes we have under the given root node. * If we have multiple logical processors, but not multiple * cores, then those processors must be hardware threads. */ ncores = 0; node = root; while (node != NULL) { if (node->type != TOPO_TYPE_CORE) { node = topo_next_node(root, node); continue; } ncores++; node = topo_next_nonchild_node(root, node); } if (cg_root->cg_level != CG_SHARE_NONE && root->cpu_count > 1 && ncores < 2) cg_root->cg_flags = CG_FLAG_SMT; /* * Find out how many cache nodes we have under the given root node. * We ignore cache nodes that cover all the same processors as the * root node. Also, we do not descend below found cache nodes. * That is, we count top-level "non-redundant" caches under the root * node. */ nchildren = 0; node = root; while (node != NULL) { if ((node->type != TOPO_TYPE_GROUP && node->type != TOPO_TYPE_CACHE) || (root->type != TOPO_TYPE_SYSTEM && CPU_CMP(&node->cpuset, &root->cpuset) == 0)) { node = topo_next_node(root, node); continue; } nchildren++; node = topo_next_nonchild_node(root, node); } cg_root->cg_child = smp_topo_alloc(nchildren); cg_root->cg_children = nchildren; /* * Now find again the same cache nodes as above and recursively * build scheduling topologies for them. */ node = root; i = 0; while (node != NULL) { if ((node->type != TOPO_TYPE_GROUP && node->type != TOPO_TYPE_CACHE) || (root->type != TOPO_TYPE_SYSTEM && CPU_CMP(&node->cpuset, &root->cpuset) == 0)) { node = topo_next_node(root, node); continue; } cg_root->cg_child[i].cg_parent = cg_root; x86topo_add_sched_group(node, &cg_root->cg_child[i]); i++; node = topo_next_nonchild_node(root, node); } } /* * Build the MI scheduling topology from the discovered hardware topology. */ struct cpu_group * cpu_topo(void) { struct cpu_group *cg_root; if (mp_ncpus <= 1) return (smp_topo_none()); cg_root = smp_topo_alloc(1); x86topo_add_sched_group(&topo_root, cg_root); return (cg_root); } static void cpu_alloc(void *dummy __unused) { /* * Dynamically allocate the arrays that depend on the * maximum APIC ID. */ cpu_info = malloc(sizeof(*cpu_info) * (max_apic_id + 1), M_CPUS, M_WAITOK | M_ZERO); apic_cpuids = malloc(sizeof(*apic_cpuids) * (max_apic_id + 1), M_CPUS, M_WAITOK | M_ZERO); } SYSINIT(cpu_alloc, SI_SUB_CPU, SI_ORDER_FIRST, cpu_alloc, NULL); /* * Add a logical CPU to the topology. */ void cpu_add(u_int apic_id, char boot_cpu) { if (apic_id > max_apic_id) { panic("SMP: APIC ID %d too high", apic_id); return; } KASSERT(cpu_info[apic_id].cpu_present == 0, ("CPU %u added twice", apic_id)); cpu_info[apic_id].cpu_present = 1; if (boot_cpu) { KASSERT(boot_cpu_id == -1, ("CPU %u claims to be BSP, but CPU %u already is", apic_id, boot_cpu_id)); boot_cpu_id = apic_id; cpu_info[apic_id].cpu_bsp = 1; } if (bootverbose) printf("SMP: Added CPU %u (%s)\n", apic_id, boot_cpu ? "BSP" : "AP"); } void cpu_mp_setmaxid(void) { /* * mp_ncpus and mp_maxid should be already set by calls to cpu_add(). * If there were no calls to cpu_add() assume this is a UP system. */ if (mp_ncpus == 0) mp_ncpus = 1; } int cpu_mp_probe(void) { /* * Always record BSP in CPU map so that the mbuf init code works * correctly. */ CPU_SETOF(0, &all_cpus); return (mp_ncpus > 1); } /* Allocate memory for the AP trampoline. */ void alloc_ap_trampoline(vm_paddr_t *physmap, unsigned int *physmap_idx) { unsigned int i; bool allocated; allocated = false; for (i = *physmap_idx; i <= *physmap_idx; i -= 2) { /* * Find a memory region big enough and below the 1MB boundary * for the trampoline code. * NB: needs to be page aligned. */ if (physmap[i] >= MiB(1) || (trunc_page(physmap[i + 1]) - round_page(physmap[i])) < round_page(bootMP_size)) continue; allocated = true; /* * Try to steal from the end of the region to mimic previous * behaviour, else fallback to steal from the start. */ if (physmap[i + 1] < MiB(1)) { boot_address = trunc_page(physmap[i + 1]); if ((physmap[i + 1] - boot_address) < bootMP_size) boot_address -= round_page(bootMP_size); physmap[i + 1] = boot_address; } else { boot_address = round_page(physmap[i]); physmap[i] = boot_address + round_page(bootMP_size); } if (physmap[i] == physmap[i + 1] && *physmap_idx != 0) { memmove(&physmap[i], &physmap[i + 2], sizeof(*physmap) * (*physmap_idx - i + 2)); *physmap_idx -= 2; } break; } if (!allocated) { boot_address = basemem * 1024 - bootMP_size; if (bootverbose) printf( "Cannot find enough space for the boot trampoline, placing it at %#x", boot_address); } } /* * AP CPU's call this to initialize themselves. */ void init_secondary_tail(void) { u_int cpuid; pmap_activate_boot(vmspace_pmap(proc0.p_vmspace)); /* * On real hardware, switch to x2apic mode if possible. Do it * after aps_ready was signalled, to avoid manipulating the * mode while BSP might still want to send some IPI to us * (second startup IPI is ignored on modern hardware etc). */ lapic_xapic_mode(); /* Initialize the PAT MSR. */ pmap_init_pat(); /* set up CPU registers and state */ cpu_setregs(); /* set up SSE/NX */ initializecpu(); /* set up FPU state on the AP */ #ifdef __amd64__ fpuinit(); #else npxinit(false); #endif if (cpu_ops.cpu_init) cpu_ops.cpu_init(); /* A quick check from sanity claus */ cpuid = PCPU_GET(cpuid); if (PCPU_GET(apic_id) != lapic_id()) { printf("SMP: cpuid = %d\n", cpuid); printf("SMP: actual apic_id = %d\n", lapic_id()); printf("SMP: correct apic_id = %d\n", PCPU_GET(apic_id)); panic("cpuid mismatch! boom!!"); } /* Initialize curthread. */ KASSERT(PCPU_GET(idlethread) != NULL, ("no idle thread")); PCPU_SET(curthread, PCPU_GET(idlethread)); mtx_lock_spin(&ap_boot_mtx); mca_init(); /* Init local apic for irq's */ lapic_setup(1); /* Set memory range attributes for this CPU to match the BSP */ mem_range_AP_init(); smp_cpus++; CTR1(KTR_SMP, "SMP: AP CPU #%d Launched", cpuid); if (bootverbose) printf("SMP: AP CPU #%d Launched!\n", cpuid); else printf("%s%d%s", smp_cpus == 2 ? "Launching APs: " : "", cpuid, smp_cpus == mp_ncpus ? "\n" : " "); /* Determine if we are a logical CPU. */ if (cpu_info[PCPU_GET(apic_id)].cpu_hyperthread) CPU_SET(cpuid, &logical_cpus_mask); if (bootverbose) lapic_dump("AP"); if (smp_cpus == mp_ncpus) { /* enable IPI's, tlb shootdown, freezes etc */ atomic_store_rel_int(&smp_started, 1); } #ifdef __amd64__ /* * Enable global pages TLB extension * This also implicitly flushes the TLB */ load_cr4(rcr4() | CR4_PGE); if (pmap_pcid_enabled) load_cr4(rcr4() | CR4_PCIDE); load_ds(_udatasel); load_es(_udatasel); load_fs(_ufssel); #endif mtx_unlock_spin(&ap_boot_mtx); /* Wait until all the AP's are up. */ while (atomic_load_acq_int(&smp_started) == 0) ia32_pause(); #ifndef EARLY_AP_STARTUP /* Start per-CPU event timers. */ cpu_initclocks_ap(); #endif sched_throw(NULL); panic("scheduler returned us to %s", __func__); /* NOTREACHED */ } static void smp_after_idle_runnable(void *arg __unused) { struct thread *idle_td; int cpu; for (cpu = 1; cpu < mp_ncpus; cpu++) { idle_td = pcpu_find(cpu)->pc_idlethread; while (atomic_load_int(&idle_td->td_lastcpu) == NOCPU && atomic_load_int(&idle_td->td_oncpu) == NOCPU) cpu_spinwait(); kmem_free((vm_offset_t)bootstacks[cpu], kstack_pages * PAGE_SIZE); } } SYSINIT(smp_after_idle_runnable, SI_SUB_SMP, SI_ORDER_ANY, smp_after_idle_runnable, NULL); /* * We tell the I/O APIC code about all the CPUs we want to receive * interrupts. If we don't want certain CPUs to receive IRQs we * can simply not tell the I/O APIC code about them in this function. * We also do not tell it about the BSP since it tells itself about * the BSP internally to work with UP kernels and on UP machines. */ void set_interrupt_apic_ids(void) { u_int i, apic_id; for (i = 0; i < MAXCPU; i++) { apic_id = cpu_apic_ids[i]; if (apic_id == -1) continue; if (cpu_info[apic_id].cpu_bsp) continue; if (cpu_info[apic_id].cpu_disabled) continue; /* Don't let hyperthreads service interrupts. */ if (cpu_info[apic_id].cpu_hyperthread) continue; intr_add_cpu(i); } } #ifdef COUNT_XINVLTLB_HITS u_int xhits_gbl[MAXCPU]; u_int xhits_pg[MAXCPU]; u_int xhits_rng[MAXCPU]; static SYSCTL_NODE(_debug, OID_AUTO, xhits, CTLFLAG_RW, 0, ""); SYSCTL_OPAQUE(_debug_xhits, OID_AUTO, global, CTLFLAG_RW, &xhits_gbl, sizeof(xhits_gbl), "IU", ""); SYSCTL_OPAQUE(_debug_xhits, OID_AUTO, page, CTLFLAG_RW, &xhits_pg, sizeof(xhits_pg), "IU", ""); SYSCTL_OPAQUE(_debug_xhits, OID_AUTO, range, CTLFLAG_RW, &xhits_rng, sizeof(xhits_rng), "IU", ""); u_int ipi_global; u_int ipi_page; u_int ipi_range; u_int ipi_range_size; SYSCTL_INT(_debug_xhits, OID_AUTO, ipi_global, CTLFLAG_RW, &ipi_global, 0, ""); SYSCTL_INT(_debug_xhits, OID_AUTO, ipi_page, CTLFLAG_RW, &ipi_page, 0, ""); SYSCTL_INT(_debug_xhits, OID_AUTO, ipi_range, CTLFLAG_RW, &ipi_range, 0, ""); SYSCTL_INT(_debug_xhits, OID_AUTO, ipi_range_size, CTLFLAG_RW, &ipi_range_size, 0, ""); #endif /* COUNT_XINVLTLB_HITS */ /* * Init and startup IPI. */ void ipi_startup(int apic_id, int vector) { /* * This attempts to follow the algorithm described in the * Intel Multiprocessor Specification v1.4 in section B.4. * For each IPI, we allow the local APIC ~20us to deliver the * IPI. If that times out, we panic. */ /* * first we do an INIT IPI: this INIT IPI might be run, resetting * and running the target CPU. OR this INIT IPI might be latched (P5 * bug), CPU waiting for STARTUP IPI. OR this INIT IPI might be * ignored. */ lapic_ipi_raw(APIC_DEST_DESTFLD | APIC_TRIGMOD_LEVEL | APIC_LEVEL_ASSERT | APIC_DESTMODE_PHY | APIC_DELMODE_INIT, apic_id); lapic_ipi_wait(100); /* Explicitly deassert the INIT IPI. */ lapic_ipi_raw(APIC_DEST_DESTFLD | APIC_TRIGMOD_LEVEL | APIC_LEVEL_DEASSERT | APIC_DESTMODE_PHY | APIC_DELMODE_INIT, apic_id); DELAY(10000); /* wait ~10mS */ /* * next we do a STARTUP IPI: the previous INIT IPI might still be * latched, (P5 bug) this 1st STARTUP would then terminate * immediately, and the previously started INIT IPI would continue. OR * the previous INIT IPI has already run. and this STARTUP IPI will * run. OR the previous INIT IPI was ignored. and this STARTUP IPI * will run. */ lapic_ipi_raw(APIC_DEST_DESTFLD | APIC_TRIGMOD_EDGE | APIC_LEVEL_ASSERT | APIC_DESTMODE_PHY | APIC_DELMODE_STARTUP | vector, apic_id); if (!lapic_ipi_wait(100)) panic("Failed to deliver first STARTUP IPI to APIC %d", apic_id); DELAY(200); /* wait ~200uS */ /* * finally we do a 2nd STARTUP IPI: this 2nd STARTUP IPI should run IF * the previous STARTUP IPI was cancelled by a latched INIT IPI. OR * this STARTUP IPI will be ignored, as only ONE STARTUP IPI is * recognized after hardware RESET or INIT IPI. */ lapic_ipi_raw(APIC_DEST_DESTFLD | APIC_TRIGMOD_EDGE | APIC_LEVEL_ASSERT | APIC_DESTMODE_PHY | APIC_DELMODE_STARTUP | vector, apic_id); if (!lapic_ipi_wait(100)) panic("Failed to deliver second STARTUP IPI to APIC %d", apic_id); DELAY(200); /* wait ~200uS */ } /* * Send an IPI to specified CPU handling the bitmap logic. */ void ipi_send_cpu(int cpu, u_int ipi) { u_int bitmap, old_pending, new_pending; KASSERT(cpu_apic_ids[cpu] != -1, ("IPI to non-existent CPU %d", cpu)); if (IPI_IS_BITMAPED(ipi)) { bitmap = 1 << ipi; ipi = IPI_BITMAP_VECTOR; do { old_pending = cpu_ipi_pending[cpu]; new_pending = old_pending | bitmap; } while (!atomic_cmpset_int(&cpu_ipi_pending[cpu], old_pending, new_pending)); if (old_pending) return; } lapic_ipi_vectored(ipi, cpu_apic_ids[cpu]); } void ipi_bitmap_handler(struct trapframe frame) { struct trapframe *oldframe; struct thread *td; int cpu = PCPU_GET(cpuid); u_int ipi_bitmap; critical_enter(); td = curthread; td->td_intr_nesting_level++; oldframe = td->td_intr_frame; td->td_intr_frame = &frame; ipi_bitmap = atomic_readandclear_int(&cpu_ipi_pending[cpu]); if (ipi_bitmap & (1 << IPI_PREEMPT)) { #ifdef COUNT_IPIS (*ipi_preempt_counts[cpu])++; #endif sched_preempt(td); } if (ipi_bitmap & (1 << IPI_AST)) { #ifdef COUNT_IPIS (*ipi_ast_counts[cpu])++; #endif /* Nothing to do for AST */ } if (ipi_bitmap & (1 << IPI_HARDCLOCK)) { #ifdef COUNT_IPIS (*ipi_hardclock_counts[cpu])++; #endif hardclockintr(); } td->td_intr_frame = oldframe; td->td_intr_nesting_level--; critical_exit(); } /* * send an IPI to a set of cpus. */ void ipi_selected(cpuset_t cpus, u_int ipi) { int cpu; /* * IPI_STOP_HARD maps to a NMI and the trap handler needs a bit * of help in order to understand what is the source. * Set the mask of receiving CPUs for this purpose. */ if (ipi == IPI_STOP_HARD) CPU_OR_ATOMIC(&ipi_stop_nmi_pending, &cpus); while ((cpu = CPU_FFS(&cpus)) != 0) { cpu--; CPU_CLR(cpu, &cpus); CTR3(KTR_SMP, "%s: cpu: %d ipi: %x", __func__, cpu, ipi); ipi_send_cpu(cpu, ipi); } } /* * send an IPI to a specific CPU. */ void ipi_cpu(int cpu, u_int ipi) { /* * IPI_STOP_HARD maps to a NMI and the trap handler needs a bit * of help in order to understand what is the source. * Set the mask of receiving CPUs for this purpose. */ if (ipi == IPI_STOP_HARD) CPU_SET_ATOMIC(cpu, &ipi_stop_nmi_pending); CTR3(KTR_SMP, "%s: cpu: %d ipi: %x", __func__, cpu, ipi); ipi_send_cpu(cpu, ipi); } /* * send an IPI to all CPUs EXCEPT myself */ void ipi_all_but_self(u_int ipi) { cpuset_t other_cpus; other_cpus = all_cpus; CPU_CLR(PCPU_GET(cpuid), &other_cpus); if (IPI_IS_BITMAPED(ipi)) { ipi_selected(other_cpus, ipi); return; } /* * IPI_STOP_HARD maps to a NMI and the trap handler needs a bit * of help in order to understand what is the source. * Set the mask of receiving CPUs for this purpose. */ if (ipi == IPI_STOP_HARD) CPU_OR_ATOMIC(&ipi_stop_nmi_pending, &other_cpus); CTR2(KTR_SMP, "%s: ipi: %x", __func__, ipi); lapic_ipi_vectored(ipi, APIC_IPI_DEST_OTHERS); } int ipi_nmi_handler(void) { u_int cpuid; /* * As long as there is not a simple way to know about a NMI's * source, if the bitmask for the current CPU is present in * the global pending bitword an IPI_STOP_HARD has been issued * and should be handled. */ cpuid = PCPU_GET(cpuid); if (!CPU_ISSET(cpuid, &ipi_stop_nmi_pending)) return (1); CPU_CLR_ATOMIC(cpuid, &ipi_stop_nmi_pending); cpustop_handler(); return (0); } int nmi_kdb_lock; void nmi_call_kdb_smp(u_int type, struct trapframe *frame) { int cpu; bool call_post; cpu = PCPU_GET(cpuid); if (atomic_cmpset_acq_int(&nmi_kdb_lock, 0, 1)) { nmi_call_kdb(cpu, type, frame); call_post = false; } else { savectx(&stoppcbs[cpu]); CPU_SET_ATOMIC(cpu, &stopped_cpus); while (!atomic_cmpset_acq_int(&nmi_kdb_lock, 0, 1)) ia32_pause(); call_post = true; } atomic_store_rel_int(&nmi_kdb_lock, 0); if (call_post) cpustop_handler_post(cpu); } /* * Handle an IPI_STOP by saving our current context and spinning until we * are resumed. */ void cpustop_handler(void) { u_int cpu; cpu = PCPU_GET(cpuid); savectx(&stoppcbs[cpu]); /* Indicate that we are stopped */ CPU_SET_ATOMIC(cpu, &stopped_cpus); /* Wait for restart */ - while (!CPU_ISSET(cpu, &started_cpus)) - ia32_pause(); + while (!CPU_ISSET(cpu, &started_cpus)) { + ia32_pause(); + + /* + * Halt non-BSP CPUs on panic -- we're never going to need them + * again, and might as well save power / release resources + * (e.g., overprovisioned VM infrastructure). + */ + while (__predict_false(!IS_BSP() && panicstr != NULL)) + halt(); + } cpustop_handler_post(cpu); } static void cpustop_handler_post(u_int cpu) { CPU_CLR_ATOMIC(cpu, &started_cpus); CPU_CLR_ATOMIC(cpu, &stopped_cpus); /* * We don't broadcast TLB invalidations to other CPUs when they are * stopped. Hence, we clear the TLB before resuming. */ invltlb_glob(); #if defined(__amd64__) && defined(DDB) amd64_db_resume_dbreg(); #endif if (cpu == 0 && cpustop_restartfunc != NULL) { cpustop_restartfunc(); cpustop_restartfunc = NULL; } } /* * Handle an IPI_SUSPEND by saving our current context and spinning until we * are resumed. */ void cpususpend_handler(void) { u_int cpu; mtx_assert(&smp_ipi_mtx, MA_NOTOWNED); cpu = PCPU_GET(cpuid); if (savectx(&susppcbs[cpu]->sp_pcb)) { #ifdef __amd64__ fpususpend(susppcbs[cpu]->sp_fpususpend); #else npxsuspend(susppcbs[cpu]->sp_fpususpend); #endif /* * suspended_cpus is cleared shortly after each AP is restarted * by a Startup IPI, so that the BSP can proceed to restarting * the next AP. * * resuming_cpus gets cleared when the AP completes * initialization after having been released by the BSP. * resuming_cpus is probably not the best name for the * variable, because it is actually a set of processors that * haven't resumed yet and haven't necessarily started resuming. * * Note that suspended_cpus is meaningful only for ACPI suspend * as it's not really used for Xen suspend since the APs are * automatically restored to the running state and the correct * context. For the same reason resumectx is never called in * that case. */ CPU_SET_ATOMIC(cpu, &suspended_cpus); CPU_SET_ATOMIC(cpu, &resuming_cpus); /* * Invalidate the cache after setting the global status bits. * The last AP to set its bit may end up being an Owner of the * corresponding cache line in MOESI protocol. The AP may be * stopped before the cache line is written to the main memory. */ wbinvd(); } else { #ifdef __amd64__ fpuresume(susppcbs[cpu]->sp_fpususpend); #else npxresume(susppcbs[cpu]->sp_fpususpend); #endif pmap_init_pat(); initializecpu(); PCPU_SET(switchtime, 0); PCPU_SET(switchticks, ticks); /* Indicate that we have restarted and restored the context. */ CPU_CLR_ATOMIC(cpu, &suspended_cpus); } /* Wait for resume directive */ while (!CPU_ISSET(cpu, &toresume_cpus)) ia32_pause(); /* Re-apply microcode updates. */ ucode_reload(); #ifdef __i386__ /* Finish removing the identity mapping of low memory for this AP. */ invltlb_glob(); #endif if (cpu_ops.cpu_resume) cpu_ops.cpu_resume(); #ifdef __amd64__ if (vmm_resume_p) vmm_resume_p(); #endif /* Resume MCA and local APIC */ lapic_xapic_mode(); mca_resume(); lapic_setup(0); /* Indicate that we are resumed */ CPU_CLR_ATOMIC(cpu, &resuming_cpus); CPU_CLR_ATOMIC(cpu, &suspended_cpus); CPU_CLR_ATOMIC(cpu, &toresume_cpus); } void invlcache_handler(void) { uint32_t generation; #ifdef COUNT_IPIS (*ipi_invlcache_counts[PCPU_GET(cpuid)])++; #endif /* COUNT_IPIS */ /* * Reading the generation here allows greater parallelism * since wbinvd is a serializing instruction. Without the * temporary, we'd wait for wbinvd to complete, then the read * would execute, then the dependent write, which must then * complete before return from interrupt. */ generation = smp_tlb_generation; wbinvd(); PCPU_SET(smp_tlb_done, generation); } /* * This is called once the rest of the system is up and running and we're * ready to let the AP's out of the pen. */ static void release_aps(void *dummy __unused) { if (mp_ncpus == 1) return; atomic_store_rel_int(&aps_ready, 1); while (smp_started == 0) ia32_pause(); } SYSINIT(start_aps, SI_SUB_SMP, SI_ORDER_FIRST, release_aps, NULL); #ifdef COUNT_IPIS /* * Setup interrupt counters for IPI handlers. */ static void mp_ipi_intrcnt(void *dummy) { char buf[64]; int i; CPU_FOREACH(i) { snprintf(buf, sizeof(buf), "cpu%d:invltlb", i); intrcnt_add(buf, &ipi_invltlb_counts[i]); snprintf(buf, sizeof(buf), "cpu%d:invlrng", i); intrcnt_add(buf, &ipi_invlrng_counts[i]); snprintf(buf, sizeof(buf), "cpu%d:invlpg", i); intrcnt_add(buf, &ipi_invlpg_counts[i]); snprintf(buf, sizeof(buf), "cpu%d:invlcache", i); intrcnt_add(buf, &ipi_invlcache_counts[i]); snprintf(buf, sizeof(buf), "cpu%d:preempt", i); intrcnt_add(buf, &ipi_preempt_counts[i]); snprintf(buf, sizeof(buf), "cpu%d:ast", i); intrcnt_add(buf, &ipi_ast_counts[i]); snprintf(buf, sizeof(buf), "cpu%d:rendezvous", i); intrcnt_add(buf, &ipi_rendezvous_counts[i]); snprintf(buf, sizeof(buf), "cpu%d:hardclock", i); intrcnt_add(buf, &ipi_hardclock_counts[i]); } } SYSINIT(mp_ipi_intrcnt, SI_SUB_INTR, SI_ORDER_MIDDLE, mp_ipi_intrcnt, NULL); #endif /* * Flush the TLB on other CPU's */ /* Variables needed for SMP tlb shootdown. */ vm_offset_t smp_tlb_addr1, smp_tlb_addr2; pmap_t smp_tlb_pmap; volatile uint32_t smp_tlb_generation; #ifdef __amd64__ #define read_eflags() read_rflags() #endif static void smp_targeted_tlb_shootdown(cpuset_t mask, u_int vector, pmap_t pmap, vm_offset_t addr1, vm_offset_t addr2) { cpuset_t other_cpus; volatile uint32_t *p_cpudone; uint32_t generation; int cpu; /* It is not necessary to signal other CPUs while in the debugger. */ if (kdb_active || panicstr != NULL) return; /* * Check for other cpus. Return if none. */ if (CPU_ISFULLSET(&mask)) { if (mp_ncpus <= 1) return; } else { CPU_CLR(PCPU_GET(cpuid), &mask); if (CPU_EMPTY(&mask)) return; } if (!(read_eflags() & PSL_I)) panic("%s: interrupts disabled", __func__); mtx_lock_spin(&smp_ipi_mtx); smp_tlb_addr1 = addr1; smp_tlb_addr2 = addr2; smp_tlb_pmap = pmap; generation = ++smp_tlb_generation; if (CPU_ISFULLSET(&mask)) { ipi_all_but_self(vector); other_cpus = all_cpus; CPU_CLR(PCPU_GET(cpuid), &other_cpus); } else { other_cpus = mask; while ((cpu = CPU_FFS(&mask)) != 0) { cpu--; CPU_CLR(cpu, &mask); CTR3(KTR_SMP, "%s: cpu: %d ipi: %x", __func__, cpu, vector); ipi_send_cpu(cpu, vector); } } while ((cpu = CPU_FFS(&other_cpus)) != 0) { cpu--; CPU_CLR(cpu, &other_cpus); p_cpudone = &cpuid_to_pcpu[cpu]->pc_smp_tlb_done; while (*p_cpudone != generation) ia32_pause(); } mtx_unlock_spin(&smp_ipi_mtx); } void smp_masked_invltlb(cpuset_t mask, pmap_t pmap) { if (smp_started) { smp_targeted_tlb_shootdown(mask, IPI_INVLTLB, pmap, 0, 0); #ifdef COUNT_XINVLTLB_HITS ipi_global++; #endif } } void smp_masked_invlpg(cpuset_t mask, vm_offset_t addr, pmap_t pmap) { if (smp_started) { smp_targeted_tlb_shootdown(mask, IPI_INVLPG, pmap, addr, 0); #ifdef COUNT_XINVLTLB_HITS ipi_page++; #endif } } void smp_masked_invlpg_range(cpuset_t mask, vm_offset_t addr1, vm_offset_t addr2, pmap_t pmap) { if (smp_started) { smp_targeted_tlb_shootdown(mask, IPI_INVLRNG, pmap, addr1, addr2); #ifdef COUNT_XINVLTLB_HITS ipi_range++; ipi_range_size += (addr2 - addr1) / PAGE_SIZE; #endif } } void smp_cache_flush(void) { if (smp_started) { smp_targeted_tlb_shootdown(all_cpus, IPI_INVLCACHE, NULL, 0, 0); } } /* * Handlers for TLB related IPIs */ void invltlb_handler(void) { uint32_t generation; #ifdef COUNT_XINVLTLB_HITS xhits_gbl[PCPU_GET(cpuid)]++; #endif /* COUNT_XINVLTLB_HITS */ #ifdef COUNT_IPIS (*ipi_invltlb_counts[PCPU_GET(cpuid)])++; #endif /* COUNT_IPIS */ /* * Reading the generation here allows greater parallelism * since invalidating the TLB is a serializing operation. */ generation = smp_tlb_generation; if (smp_tlb_pmap == kernel_pmap) invltlb_glob(); #ifdef __amd64__ else invltlb(); #endif PCPU_SET(smp_tlb_done, generation); } void invlpg_handler(void) { uint32_t generation; #ifdef COUNT_XINVLTLB_HITS xhits_pg[PCPU_GET(cpuid)]++; #endif /* COUNT_XINVLTLB_HITS */ #ifdef COUNT_IPIS (*ipi_invlpg_counts[PCPU_GET(cpuid)])++; #endif /* COUNT_IPIS */ generation = smp_tlb_generation; /* Overlap with serialization */ #ifdef __i386__ if (smp_tlb_pmap == kernel_pmap) #endif invlpg(smp_tlb_addr1); PCPU_SET(smp_tlb_done, generation); } void invlrng_handler(void) { vm_offset_t addr, addr2; uint32_t generation; #ifdef COUNT_XINVLTLB_HITS xhits_rng[PCPU_GET(cpuid)]++; #endif /* COUNT_XINVLTLB_HITS */ #ifdef COUNT_IPIS (*ipi_invlrng_counts[PCPU_GET(cpuid)])++; #endif /* COUNT_IPIS */ addr = smp_tlb_addr1; addr2 = smp_tlb_addr2; generation = smp_tlb_generation; /* Overlap with serialization */ #ifdef __i386__ if (smp_tlb_pmap == kernel_pmap) #endif do { invlpg(addr); addr += PAGE_SIZE; } while (addr < addr2); PCPU_SET(smp_tlb_done, generation); } Index: user/ngie/bug-237403/tests/sys/opencrypto/cryptodev.py =================================================================== --- user/ngie/bug-237403/tests/sys/opencrypto/cryptodev.py (revision 346925) +++ user/ngie/bug-237403/tests/sys/opencrypto/cryptodev.py (revision 346926) @@ -1,695 +1,695 @@ #!/usr/bin/env python # # Copyright (c) 2014 The FreeBSD Foundation # Copyright 2014 John-Mark Gurney # All rights reserved. # Copyright 2019 Enji Cooper # # This software was developed by John-Mark Gurney under # the sponsorship from the FreeBSD Foundation. # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions # are met: # 1. Redistributions of source code must retain the above copyright # notice, this list of conditions and the following disclaimer. # 2. Redistributions in binary form must reproduce the above copyright # notice, this list of conditions and the following disclaimer in the # documentation and/or other materials provided with the distribution. # # THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND # ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE # IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE # ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE # FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL # DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS # OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) # HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT # LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY # OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF # SUCH DAMAGE. # # $FreeBSD$ # from __future__ import print_function import array from fcntl import ioctl import os import random import signal from struct import pack as _pack import sys import time import dpkt from cryptodevh import * __all__ = [ 'Crypto', 'MismatchError', ] class FindOp(dpkt.Packet): __byte_order__ = '@' __hdr__ = ( ('crid', 'i', 0), ('name', '32s', 0), ) class SessionOp(dpkt.Packet): __byte_order__ = '@' __hdr__ = ( ('cipher', 'I', 0), ('mac', 'I', 0), ('keylen', 'I', 0), ('key', 'P', 0), ('mackeylen', 'i', 0), ('mackey', 'P', 0), ('ses', 'I', 0), ) class SessionOp2(dpkt.Packet): __byte_order__ = '@' __hdr__ = ( ('cipher', 'I', 0), ('mac', 'I', 0), ('keylen', 'I', 0), ('key', 'P', 0), ('mackeylen', 'i', 0), ('mackey', 'P', 0), ('ses', 'I', 0), ('crid', 'i', 0), ('pad0', 'i', 0), ('pad1', 'i', 0), ('pad2', 'i', 0), ('pad3', 'i', 0), ) class CryptOp(dpkt.Packet): __byte_order__ = '@' __hdr__ = ( ('ses', 'I', 0), ('op', 'H', 0), ('flags', 'H', 0), ('len', 'I', 0), ('src', 'P', 0), ('dst', 'P', 0), ('mac', 'P', 0), ('iv', 'P', 0), ) class CryptAEAD(dpkt.Packet): __byte_order__ = '@' __hdr__ = ( ('ses', 'I', 0), ('op', 'H', 0), ('flags', 'H', 0), ('len', 'I', 0), ('aadlen', 'I', 0), ('ivlen', 'I', 0), ('src', 'P', 0), ('dst', 'P', 0), ('aad', 'P', 0), ('tag', 'P', 0), ('iv', 'P', 0), ) # h2py.py can't handle multiarg macros CRIOGET = 3221513060 CIOCGSESSION = 3224396645 CIOCGSESSION2 = 3225445226 CIOCFSESSION = 2147771238 CIOCCRYPT = 3224396647 CIOCKEY = 3230688104 CIOCASYMFEAT = 1074029417 CIOCKEY2 = 3230688107 CIOCFINDDEV = 3223610220 CIOCCRYPTAEAD = 3225445229 def _getdev(): buf = array.array('I', [0]) fd = os.open('/dev/crypto', os.O_RDWR) try: ioctl(fd, CRIOGET, buf, 1) finally: os.close(fd) return buf[0] _cryptodev = _getdev() def _findop(crid, name): fop = FindOp() fop.crid = crid fop.name = name.encode("ascii") s = array.array('B', fop.pack_hdr()) ioctl(_cryptodev, CIOCFINDDEV, s, 1) fop.unpack(s) try: idx = fop.name.index(b'\x00') name = fop.name[:idx] except ValueError: name = fop.name return fop.crid, name class Crypto: @staticmethod def findcrid(name): return _findop(-1, name)[0] @staticmethod def getcridname(crid): return _findop(crid, '')[1] def __init__(self, cipher=0, key=None, mac=0, mackey=None, crid=CRYPTOCAP_F_SOFTWARE | CRYPTOCAP_F_HARDWARE, maclen=None): self._ses = None self._maclen = maclen ses = SessionOp2() ses.cipher = cipher ses.mac = mac if key is not None: ses.keylen = len(key) k = array.array('B', key) ses.key = k.buffer_info()[0] else: self.key = None if mackey is not None: ses.mackeylen = len(mackey) mk = array.array('B', mackey) ses.mackey = mk.buffer_info()[0] if not cipher and not mac: raise ValueError('one of cipher or mac MUST be specified.') ses.crid = crid #print(ses) s = array.array('B', ses.pack_hdr()) #print(s) ioctl(_cryptodev, CIOCGSESSION2, s, 1) ses.unpack(s) self._ses = ses.ses def __del__(self): if self._ses is None: return try: ioctl(_cryptodev, CIOCFSESSION, _pack('I', self._ses)) except TypeError: pass self._ses = None @staticmethod def _to_bytes(val): if sys.version_info[0] >= 3: if isinstance(val, str): return val.encode("ascii") return val def _doop(self, op, src, iv): cop = CryptOp() cop.ses = self._ses cop.op = op cop.flags = 0 cop.len = len(src) s = array.array('B', src) cop.src = cop.dst = s.buffer_info()[0] if self._maclen is not None: m = array.array('B', [0] * self._maclen) cop.mac = m.buffer_info()[0] ivbuf = array.array('B', self._to_bytes(iv)) cop.iv = ivbuf.buffer_info()[0] #print('cop:', cop) ioctl(_cryptodev, CIOCCRYPT, str(cop)) s = s.tostring() if self._maclen is not None: return s, m.tostring() return s def _doaead(self, op, src, aad, iv, tag=None): caead = CryptAEAD() caead.ses = self._ses caead.op = op caead.flags = CRD_F_IV_EXPLICIT caead.flags = 0 caead.len = len(src) src = self._to_bytes(src) s = array.array("B", src) caead.src = caead.dst = s.buffer_info()[0] caead.aadlen = len(aad) aad = self._to_bytes(aad) saad = array.array('B', aad) caead.aad = saad.buffer_info()[0] if self._maclen is None: raise ValueError('must have a tag length') if tag is None: tag = array.array('B', [0] * self._maclen) else: assert len(tag) == self._maclen, \ '%d != %d' % (len(tag), self._maclen) tag = self._to_bytes(tag) tag = array.array('B', tag) caead.tag = tag.buffer_info()[0] ivbuf = array.array('B', iv) caead.ivlen = len(iv) caead.iv = ivbuf.buffer_info()[0] ioctl(_cryptodev, CIOCCRYPTAEAD, self._to_bytes(str(caead))) s = s.tostring() return s, tag.tostring() def perftest(self, op, size, timeo=3): - inp = array.array('B', (random.randint(0, 255) for x in xrange(size))) + inp = array.array('B', (random.randint(0, 255) for x in range(size))) out = array.array('B', self._to_bytes(inp)) # prep ioctl cop = CryptOp() cop.ses = self._ses cop.op = op cop.flags = 0 cop.len = len(inp) s = array.array('B', self._to_bytes(inp)) cop.src = s.buffer_info()[0] cop.dst = out.buffer_info()[0] if self._maclen is not None: m = array.array('B', [0] * self._maclen) cop.mac = m.buffer_info()[0] - ivbuf = array.array('B', (random.randint(0, 255) for x in xrange(16))) + ivbuf = array.array('B', (random.randint(0, 255) for x in range(16))) cop.iv = ivbuf.buffer_info()[0] exit = [ False ] def alarmhandle(a, b, exit=exit): exit[0] = True oldalarm = signal.signal(signal.SIGALRM, alarmhandle) signal.alarm(timeo) start = time.time() reps = 0 while not exit[0]: ioctl(_cryptodev, CIOCCRYPT, self._to_bytes(str(cop))) reps += 1 end = time.time() signal.signal(signal.SIGALRM, oldalarm) print('time:', end - start) print('perf MB/sec:', (reps * size) / (end - start) / 1024 / 1024) def encrypt(self, data, iv, aad=None): if aad is None: return self._doop(COP_ENCRYPT, data, iv) else: return self._doaead(COP_ENCRYPT, data, aad, iv) def decrypt(self, data, iv, aad=None, tag=None): if aad is None: return self._doop(COP_DECRYPT, data, iv) else: return self._doaead(COP_DECRYPT, data, aad, iv, tag=tag) class MismatchError(Exception): pass class KATParser: def __init__(self, fname, fields): self.fields = set(fields) self._pending = None self.fname = fname self.fp = None def __enter__(self): self.fp = open(self.fname) return self def __exit__(self, exc_type, exc_value, exc_tb): if self.fp is not None: self.fp.close() def __iter__(self): return self def __next__(self): while True: didread = False if self._pending is not None: i = self._pending self._pending = None else: i = self.fp.readline() didread = True if didread and not i: raise StopIteration if not i.startswith('#') and i.strip(): break if i[0] == '[': yield i[1:].split(']', 1)[0], self.fielditer() else: raise ValueError('unknown line: %r' % repr(i)) def eatblanks(self): while True: line = self.fp.readline() if line == '': break line = line.strip() if line: break return line def fielditer(self): while True: values = {} line = self.eatblanks() if not line or line[0] == '[': self._pending = line return while True: try: f, v = line.split(' =') except: if line == 'FAIL': f, v = 'FAIL', '' else: print('line:', repr(line)) raise v = v.strip() if f in values: raise ValueError('already present: %r' % repr(f)) values[f] = v line = self.fp.readline().strip() if not line: break # we should have everything remain = self.fields.copy() - set(values.keys()) # XXX - special case GCM decrypt if remain and not ('FAIL' in values and 'PT' in remain): raise ValueError('not all fields found: %r' % repr(remain)) yield values # The CCM files use a bit of a different syntax that doesn't quite fit # the generic KATParser. In particular, some keys are set globally at # the start of the file, and some are set globally at the start of a # section. class KATCCMParser: def __init__(self, fname): self.fp = open(fname) self._pending = None self.read_globals() def read_globals(self): self.global_values = {} while True: line = self.fp.readline() if not line: return if line[0] == '#' or not line.strip(): continue if line[0] == '[': self._pending = line return try: f, v = line.split(' =') except: print('line:', repr(line)) raise v = v.strip() if f in self.global_values: raise ValueError('already present: %r' % repr(f)) self.global_values[f] = v def read_section_values(self, kwpairs): self.section_values = self.global_values.copy() for pair in kwpairs.split(', '): f, v = pair.split(' = ') if f in self.section_values: raise ValueError('already present: %r' % repr(f)) self.section_values[f] = v while True: line = self.fp.readline() if not line: return if line[0] == '#' or not line.strip(): continue if line[0] == '[': self._pending = line return try: f, v = line.split(' =') except: print('line:', repr(line)) raise if f == 'Count': self._pending = line return v = v.strip() if f in self.section_values: raise ValueError('already present: %r' % repr(f)) self.section_values[f] = v def __iter__(self): while True: if self._pending: line = self._pending self._pending = None else: line = self.fp.readline() if not line: return if (line and line[0] == '#') or not line.strip(): continue if line[0] == '[': section = line[1:].split(']', 1)[0] self.read_section_values(section) continue values = self.section_values.copy() while True: try: f, v = line.split(' =') except: print('line:', repr(line)) raise v = v.strip() if f in values: raise ValueError('already present: %r' % repr(f)) values[f] = v line = self.fp.readline().strip() if not line: break yield values def _spdechex(s): return binascii.hexlify(''.join(s.split())) if __name__ == '__main__': if True: try: crid = Crypto.findcrid('aesni0') print('aesni:', crid) except IOError: print('aesni0 not found') - for i in xrange(10): + for i in range(10): try: name = Crypto.getcridname(i) print('%2d: %r' % (i, repr(name))) except IOError: pass elif False: kp = KATParser('/usr/home/jmg/aesni.testing/format tweak value input - data unit seq no/XTSGenAES128.rsp', [ 'COUNT', 'DataUnitLen', 'Key', 'DataUnitSeqNumber', 'PT', 'CT' ]) for mode, ni in kp: print(i, ni) for j in ni: print(j) elif False: key = _spdechex('c939cc13397c1d37de6ae0e1cb7c423c') iv = _spdechex('00000000000000000000000000000001') pt = _spdechex('ab3cabed693a32946055524052afe3c9cb49664f09fc8b7da824d924006b7496353b8c1657c5dec564d8f38d7432e1de35aae9d95590e66278d4acce883e51abaf94977fcd3679660109a92bf7b2973ccd547f065ec6cee4cb4a72a5e9f45e615d920d76cb34cba482467b3e21422a7242e7d931330c0fbf465c3a3a46fae943029fd899626dda542750a1eee253df323c6ef1573f1c8c156613e2ea0a6cdbf2ae9701020be2d6a83ecb7f3f9d8e') #pt = _spdechex('00000000000000000000000000000000') ct = _spdechex('f42c33853ecc5ce2949865fdb83de3bff1089e9360c94f830baebfaff72836ab5236f77212f1e7396c8c54ac73d81986375a6e9e299cfeca5ba051ed25e8d1affa5beaf6c1d2b45e90802408f2ced21663497e906de5f29341e5e52ddfea5363d628b3eb7806835e17bae051b3a6da3f8e2941fe44384eac17a9d298d2c331ca8320c775b5d53263a5e905059d891b21dede2d8110fd427c7bd5a9a274ddb47b1945ee79522203b6e297d0e399ef') c = Crypto(CRYPTO_AES_ICM, key) enc = c.encrypt(pt, iv) print('enc:', binascii.unhexlify(enc)) print(' ct:', binascii.unhexlify(ct)) assert ct == enc dec = c.decrypt(ct, iv) print('dec:', binascii.unhexlify(dec)) print(' pt:', binascii.unhexlify(pt)) assert pt == dec elif False: key = _spdechex('c939cc13397c1d37de6ae0e1cb7c423c') iv = _spdechex('00000000000000000000000000000001') pt = _spdechex('ab3cabed693a32946055524052afe3c9cb49664f09fc8b7da824d924006b7496353b8c1657c5dec564d8f38d7432e1de35aae9d95590e66278d4acce883e51abaf94977fcd3679660109a92bf7b2973ccd547f065ec6cee4cb4a72a5e9f45e615d920d76cb34cba482467b3e21422a7242e7d931330c0fbf465c3a3a46fae943029fd899626dda542750a1eee253df323c6ef1573f1c8c156613e2ea0a6cdbf2ae9701020be2d6a83ecb7f3f9d8e0a3f') #pt = _spdechex('00000000000000000000000000000000') ct = _spdechex('f42c33853ecc5ce2949865fdb83de3bff1089e9360c94f830baebfaff72836ab5236f77212f1e7396c8c54ac73d81986375a6e9e299cfeca5ba051ed25e8d1affa5beaf6c1d2b45e90802408f2ced21663497e906de5f29341e5e52ddfea5363d628b3eb7806835e17bae051b3a6da3f8e2941fe44384eac17a9d298d2c331ca8320c775b5d53263a5e905059d891b21dede2d8110fd427c7bd5a9a274ddb47b1945ee79522203b6e297d0e399ef3768') c = Crypto(CRYPTO_AES_ICM, key) enc = c.encrypt(pt, iv) print('enc:', binascii.hexlify(enc)) print(' ct:', binascii.hexlify(ct)) assert ct == enc dec = c.decrypt(ct, iv) print('dec:', binascii.hexlify(dec)) print(' pt:', binascii.hexlify(pt)) assert pt == dec elif False: key = _spdechex('c939cc13397c1d37de6ae0e1cb7c423c') iv = _spdechex('6eba2716ec0bd6fa5cdef5e6d3a795bc') pt = _spdechex('ab3cabed693a32946055524052afe3c9cb49664f09fc8b7da824d924006b7496353b8c1657c5dec564d8f38d7432e1de35aae9d95590e66278d4acce883e51abaf94977fcd3679660109a92bf7b2973ccd547f065ec6cee4cb4a72a5e9f45e615d920d76cb34cba482467b3e21422a7242e7d931330c0fbf465c3a3a46fae943029fd899626dda542750a1eee253df323c6ef1573f1c8c156613e2ea0a6cdbf2ae9701020be2d6a83ecb7f3f9d8e0a3f') ct = _spdechex('f1f81f12e72e992dbdc304032705dc75dc3e4180eff8ee4819906af6aee876d5b00b7c36d282a445ce3620327be481e8e53a8e5a8e5ca9abfeb2281be88d12ffa8f46d958d8224738c1f7eea48bda03edbf9adeb900985f4fa25648b406d13a886c25e70cfdecdde0ad0f2991420eb48a61c64fd797237cf2798c2675b9bb744360b0a3f329ac53bbceb4e3e7456e6514f1a9d2f06c236c31d0f080b79c15dce1096357416602520daa098b17d1af427') c = Crypto(CRYPTO_AES_CBC, key) enc = c.encrypt(pt, iv) print('enc:', binascii.hexlify(enc)) print(' ct:', binascii.hexlify(ct)) assert ct == enc dec = c.decrypt(ct, iv) print('dec:', binascii.hexlify(dec)) print(' pt:', binascii.hexlify(pt)) assert pt == dec elif False: key = _spdechex('c939cc13397c1d37de6ae0e1cb7c423c') iv = _spdechex('b3d8cc017cbb89b39e0f67e2') pt = _spdechex('c3b3c41f113a31b73d9a5cd4321030') aad = _spdechex('24825602bd12a984e0092d3e448eda5f') ct = _spdechex('93fe7d9e9bfd10348a5606e5cafa7354') ct = _spdechex('93fe7d9e9bfd10348a5606e5cafa73') tag = _spdechex('0032a1dc85f1c9786925a2e71d8272dd') tag = _spdechex('8d11a0929cb3fbe1fef01a4a38d5f8ea') c = Crypto(CRYPTO_AES_NIST_GCM_16, key, mac=CRYPTO_AES_128_NIST_GMAC, mackey=key) enc, enctag = c.encrypt(pt, iv, aad=aad) print('enc:', binascii.hexlify(enc)) print(' ct:', binascii.hexlify(ct)) assert enc == ct print('etg:', binascii.hexlify(enctag)) print('tag:', binascii.hexlify(tag)) assert enctag == tag # Make sure we get EBADMSG #enctag = enctag[:-1] + 'a' dec, dectag = c.decrypt(ct, iv, aad=aad, tag=enctag) print('dec:', binascii.hexlify(dec)) print(' pt:', binascii.hexlify(pt)) assert dec == pt print('dtg:', binascii.hexlify(dectag)) print('tag:', binascii.hexlify(tag)) assert dectag == tag elif False: key = _spdechex('c939cc13397c1d37de6ae0e1cb7c423c') iv = _spdechex('b3d8cc017cbb89b39e0f67e2') key = key + iv[:4] iv = iv[4:] pt = _spdechex('c3b3c41f113a31b73d9a5cd432103069') aad = _spdechex('24825602bd12a984e0092d3e448eda5f') ct = _spdechex('93fe7d9e9bfd10348a5606e5cafa7354') tag = _spdechex('0032a1dc85f1c9786925a2e71d8272dd') c = Crypto(CRYPTO_AES_GCM_16, key, mac=CRYPTO_AES_128_GMAC, mackey=key) enc, enctag = c.encrypt(pt, iv, aad=aad) print('enc:', binascii.hexlify(enc)) print(' ct:', binascii.hexlify(ct)) assert enc == ct print('etg:', binascii.hexlify(enctag)) print('tag:', binascii.hexlify(tag)) assert enctag == tag elif False: - for i in xrange(100000): + for i in range(100000): c = Crypto(CRYPTO_AES_XTS, binascii.unhexlify('1bbfeadf539daedcae33ced497343f3ca1f2474ad932b903997d44707db41382')) data = binascii.unhexlify('52a42bca4e9425a25bbc8c8bf6129dec') ct = binascii.unhexlify('517e602becd066b65fa4f4f56ddfe240') iv = _pack('QQ', 71, 0) enc = c.encrypt(data, iv) assert enc == ct elif True: c = Crypto(CRYPTO_AES_XTS, binascii.unhexlify('1bbfeadf539daedcae33ced497343f3ca1f2474ad932b903997d44707db41382')) data = binascii.unhexlify('52a42bca4e9425a25bbc8c8bf6129dec') ct = binascii.unhexlify('517e602becd066b65fa4f4f56ddfe240') iv = _pack('QQ', 71, 0) enc = c.encrypt(data, iv) assert enc == ct dec = c.decrypt(enc, iv) assert dec == data #c.perftest(COP_ENCRYPT, 192*1024, reps=30000) else: key = binascii.unhexlify('1bbfeadf539daedcae33ced497343f3ca1f2474ad932b903997d44707db41382') print('XTS %d testing:' % (len(key) * 8)) c = Crypto(CRYPTO_AES_XTS, key) for i in [ 8192, 192*1024]: print('block size: %d' % i) c.perftest(COP_ENCRYPT, i) c.perftest(COP_DECRYPT, i) Index: user/ngie/bug-237403/tests/sys/opencrypto/cryptotest.py =================================================================== --- user/ngie/bug-237403/tests/sys/opencrypto/cryptotest.py (revision 346925) +++ user/ngie/bug-237403/tests/sys/opencrypto/cryptotest.py (revision 346926) @@ -1,495 +1,495 @@ #!/usr/local/bin/python2 # # Copyright (c) 2014 The FreeBSD Foundation # All rights reserved. # # This software was developed by John-Mark Gurney under # the sponsorship from the FreeBSD Foundation. # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions # are met: # 1. Redistributions of source code must retain the above copyright # notice, this list of conditions and the following disclaimer. # 2. Redistributions in binary form must reproduce the above copyright # notice, this list of conditions and the following disclaimer in the # documentation and/or other materials provided with the distribution. # # THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND # ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE # IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE # ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE # FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL # DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS # OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) # HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT # LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY # OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF # SUCH DAMAGE. # # $FreeBSD$ # from __future__ import print_function import binascii import errno import cryptodev import itertools import os import struct import unittest from cryptodev import * from glob import iglob katdir = '/usr/local/share/nist-kat' def katg(base, glob): assert os.path.exists(katdir), "Please 'pkg install nist-kat'" if not os.path.exists(os.path.join(katdir, base)): raise unittest.SkipTest("Missing %s test vectors" % (base)) return iglob(os.path.join(katdir, base, glob)) aesmodules = [ 'cryptosoft0', 'aesni0', 'ccr0', 'ccp0' ] desmodules = [ 'cryptosoft0', ] shamodules = [ 'cryptosoft0', 'aesni0', 'ccr0', 'ccp0' ] def GenTestCase(cname): try: crid = cryptodev.Crypto.findcrid(cname) except IOError: return None class GendCryptoTestCase(unittest.TestCase): ############### ##### AES ##### ############### @unittest.skipIf(cname not in aesmodules, 'skipping AES-XTS on %s' % (cname)) def test_xts(self): for i in katg('XTSTestVectors/format tweak value input - data unit seq no', '*.rsp'): self.runXTS(i, cryptodev.CRYPTO_AES_XTS) @unittest.skipIf(cname not in aesmodules, 'skipping AES-CBC on %s' % (cname)) def test_cbc(self): for i in katg('KAT_AES', 'CBC[GKV]*.rsp'): self.runCBC(i) @unittest.skipIf(cname not in aesmodules, 'skipping AES-CCM on %s' % (cname)) def test_ccm(self): for i in katg('ccmtestvectors', 'V*.rsp'): self.runCCMEncrypt(i) for i in katg('ccmtestvectors', 'D*.rsp'): self.runCCMDecrypt(i) @unittest.skipIf(cname not in aesmodules, 'skipping AES-GCM on %s' % (cname)) def test_gcm(self): for i in katg('gcmtestvectors', 'gcmEncrypt*'): self.runGCM(i, 'ENCRYPT') for i in katg('gcmtestvectors', 'gcmDecrypt*'): self.runGCM(i, 'DECRYPT') _gmacsizes = { 32: cryptodev.CRYPTO_AES_256_NIST_GMAC, 24: cryptodev.CRYPTO_AES_192_NIST_GMAC, 16: cryptodev.CRYPTO_AES_128_NIST_GMAC, } def runGCM(self, fname, mode): curfun = None if mode == 'ENCRYPT': swapptct = False curfun = Crypto.encrypt elif mode == 'DECRYPT': swapptct = True curfun = Crypto.decrypt else: raise RuntimeError('unknown mode: %r' % repr(mode)) columns = [ 'Count', 'Key', 'IV', 'CT', 'AAD', 'Tag', 'PT', ] with cryptodev.KATParser(fname, columns) as parser: self.runGCMWithParser(parser, mode) def runGCMWithParser(self, parser, mode): for _, lines in next(parser): for data in lines: curcnt = int(data['Count']) cipherkey = binascii.unhexlify(data['Key']) iv = binascii.unhexlify(data['IV']) aad = binascii.unhexlify(data['AAD']) tag = binascii.unhexlify(data['Tag']) if 'FAIL' not in data: pt = binascii.unhexlify(data['PT']) ct = binascii.unhexlify(data['CT']) if len(iv) != 12: # XXX - isn't supported continue try: c = Crypto(cryptodev.CRYPTO_AES_NIST_GCM_16, cipherkey, mac=self._gmacsizes[len(cipherkey)], mackey=cipherkey, crid=crid, maclen=16) except EnvironmentError as e: # Can't test algorithms the driver does not support. if e.errno != errno.EOPNOTSUPP: raise continue if mode == 'ENCRYPT': try: rct, rtag = c.encrypt(pt, iv, aad) except EnvironmentError as e: # Can't test inputs the driver does not support. if e.errno != errno.EINVAL: raise continue rtag = rtag[:len(tag)] data['rct'] = binascii.hexlify(rct) data['rtag'] = binascii.hexlify(rtag) self.assertEqual(rct, ct, repr(data)) self.assertEqual(rtag, tag, repr(data)) else: if len(tag) != 16: continue args = (ct, iv, aad, tag) if 'FAIL' in data: self.assertRaises(IOError, c.decrypt, *args) else: try: rpt, rtag = c.decrypt(*args) except EnvironmentError as e: # Can't test inputs the driver does not support. if e.errno != errno.EINVAL: raise continue data['rpt'] = binascii.unhexlify(rpt) data['rtag'] = binascii.unhexlify(rtag) self.assertEqual(rpt, pt, repr(data)) def runCBC(self, fname): columns = [ 'COUNT', 'KEY', 'IV', 'PLAINTEXT', 'CIPHERTEXT', ] with cryptodev.KATParser(fname, columns) as parser: self.runCBCWithParser(parser) def runCBCWithParser(self, parser): curfun = None for mode, lines in next(parser): if mode == 'ENCRYPT': swapptct = False curfun = Crypto.encrypt elif mode == 'DECRYPT': swapptct = True curfun = Crypto.decrypt else: raise RuntimeError('unknown mode: %r' % repr(mode)) for data in lines: curcnt = int(data['COUNT']) cipherkey = binascii.unhexlify(data['KEY']) iv = binascii.unhexlify(data['IV']) pt = binascii.unhexlify(data['PLAINTEXT']) ct = binascii.unhexlify(data['CIPHERTEXT']) if swapptct: pt, ct = ct, pt # run the fun c = Crypto(cryptodev.CRYPTO_AES_CBC, cipherkey, crid=crid) r = curfun(c, pt, iv) self.assertEqual(r, ct) def runXTS(self, fname, meth): columns = [ 'COUNT', 'DataUnitLen', 'Key', 'DataUnitSeqNumber', 'PT', 'CT'] with cryptodev.KATParser(fname, columns) as parser: self.runXTSWithParser(parser, meth) def runXTSWithParser(self, parser, meth): curfun = None for mode, lines in next(parser): if mode == 'ENCRYPT': swapptct = False curfun = Crypto.encrypt elif mode == 'DECRYPT': swapptct = True curfun = Crypto.decrypt else: raise RuntimeError('unknown mode: %r' % repr(mode)) for data in lines: curcnt = int(data['COUNT']) nbits = int(data['DataUnitLen']) cipherkey = binascii.unhexlify(data['Key']) iv = struct.pack('QQ', int(data['DataUnitSeqNumber']), 0) pt = binascii.unhexlify(data['PT']) ct = binascii.unhexlify(data['CT']) if nbits % 128 != 0: # XXX - mark as skipped continue if swapptct: pt, ct = ct, pt # run the fun try: c = Crypto(meth, cipherkey, crid=crid) r = curfun(c, pt, iv) except EnvironmentError as e: # Can't test hashes the driver does not support. if e.errno != errno.EOPNOTSUPP: raise continue self.assertEqual(r, ct) def runCCMEncrypt(self, fname): for data in cryptodev.KATCCMParser(fname): Nlen = int(data['Nlen']) if Nlen != 12: # OCF only supports 12 byte IVs continue key = data['Key'].decode('hex') nonce = data['Nonce'].decode('hex') Alen = int(data['Alen']) if Alen != 0: aad = data['Adata'].decode('hex') else: aad = None payload = data['Payload'].decode('hex') ct = data['CT'].decode('hex') try: c = Crypto(crid=crid, cipher=cryptodev.CRYPTO_AES_CCM_16, key=key, mac=cryptodev.CRYPTO_AES_CCM_CBC_MAC, mackey=key, maclen=16) r, tag = Crypto.encrypt(c, payload, nonce, aad) except EnvironmentError as e: if e.errno != errno.EOPNOTSUPP: raise continue out = r + tag self.assertEqual(out, ct, "Count " + data['Count'] + " Actual: " + \ repr(out.encode("hex")) + " Expected: " + \ repr(data) + " on " + cname) def runCCMDecrypt(self, fname): # XXX: Note that all of the current CCM # decryption test vectors use IV and tag sizes # that aren't supported by OCF none of the # tests are actually ran. for data in cryptodev.KATCCMParser(fname): Nlen = int(data['Nlen']) if Nlen != 12: # OCF only supports 12 byte IVs continue Tlen = int(data['Tlen']) if Tlen != 16: # OCF only supports 16 byte tags continue key = data['Key'].decode('hex') nonce = data['Nonce'].decode('hex') Alen = int(data['Alen']) if Alen != 0: aad = data['Adata'].decode('hex') else: aad = None ct = data['CT'].decode('hex') tag = ct[-16:] ct = ct[:-16] try: c = Crypto(crid=crid, cipher=cryptodev.CRYPTO_AES_CCM_16, key=key, mac=cryptodev.CRYPTO_AES_CCM_CBC_MAC, mackey=key, maclen=16) except EnvironmentError as e: if e.errno != errno.EOPNOTSUPP: raise continue if data['Result'] == 'Fail': self.assertRaises(IOError, c.decrypt, payload, nonce, aad, tag) else: r = Crypto.decrypt(c, payload, nonce, aad, tag) payload = data['Payload'].decode('hex') - Plen = int(data('Plen')) + plen = int(data('Plen')) payload = payload[:plen] self.assertEqual(r, payload, "Count " + data['Count'] + \ " Actual: " + repr(r.encode("hex")) + \ " Expected: " + repr(data) + \ " on " + cname) ############### ##### DES ##### ############### @unittest.skipIf(cname not in desmodules, 'skipping DES on %s' % (cname)) def test_tdes(self): for i in katg('KAT_TDES', 'TCBC[a-z]*.rsp'): self.runTDES(i) def runTDES(self, fname): columns = [ 'COUNT', 'KEYs', 'IV', 'PLAINTEXT', 'CIPHERTEXT', ] with cryptodev.KATParser(fname, columns) as parser: self.runTDESWithParser(parser) def runTDESWithParser(self, parser): curfun = None for mode, lines in next(parser): if mode == 'ENCRYPT': swapptct = False curfun = Crypto.encrypt elif mode == 'DECRYPT': swapptct = True curfun = Crypto.decrypt else: raise RuntimeError('unknown mode: %r' % repr(mode)) for data in lines: curcnt = int(data['COUNT']) key = data['KEYs'] * 3 cipherkey = binascii.unhexlify(key) iv = binascii.unhexlify(data['IV']) pt = binascii.unhexlify(data['PLAINTEXT']) ct = binascii.unhexlify(data['CIPHERTEXT']) if swapptct: pt, ct = ct, pt # run the fun c = Crypto(cryptodev.CRYPTO_3DES_CBC, cipherkey, crid=crid) r = curfun(c, pt, iv) self.assertEqual(r, ct) ############### ##### SHA ##### ############### @unittest.skipIf(cname not in shamodules, 'skipping SHA on %s' % str(cname)) def test_sha(self): for i in katg('shabytetestvectors', 'SHA*Msg.rsp'): self.runSHA(i) def runSHA(self, fname): # Skip SHA512_(224|256) tests if fname.find('SHA512_') != -1: return for hashlength, lines in cryptodev.KATParser(fname, [ 'Len', 'Msg', 'MD' ]): # E.g., hashlength will be "L=20" (bytes) hashlen = int(hashlength.split("=")[1]) if hashlen == 20: alg = cryptodev.CRYPTO_SHA1 elif hashlen == 28: alg = cryptodev.CRYPTO_SHA2_224 elif hashlen == 32: alg = cryptodev.CRYPTO_SHA2_256 elif hashlen == 48: alg = cryptodev.CRYPTO_SHA2_384 elif hashlen == 64: alg = cryptodev.CRYPTO_SHA2_512 else: # Skip unsupported hashes # Slurp remaining input in section for data in lines: continue continue for data in lines: msg = data['Msg'].decode('hex') - msg = msg[:int(data['Len'])] + msg = msg[:int(data['Len'])] md = data['MD'].decode('hex') try: c = Crypto(mac=alg, crid=crid, maclen=hashlen) except EnvironmentError as e: # Can't test hashes the driver does not support. if e.errno != errno.EOPNOTSUPP: raise continue _, r = c.encrypt(msg, iv="") self.assertEqual(r, md, "Actual: " + \ repr(r.encode("hex")) + " Expected: " + repr(data) + " on " + cname) @unittest.skipIf(cname not in shamodules, 'skipping SHA-HMAC on %s' % str(cname)) def test_sha1hmac(self): for i in katg('hmactestvectors', 'HMAC.rsp'): self.runSHA1HMAC(i) def runSHA1HMAC(self, fname): columns = [ 'Count', 'Klen', 'Tlen', 'Key', 'Msg', 'Mac' ] with cryptodev.KATParser(fname, columns) as parser: self.runSHA1HMACWithParser(parser) def runSHA1HMACWithParser(self, parser): for hashlength, lines in next(parser): # E.g., hashlength will be "L=20" (bytes) hashlen = int(hashlength.split("=")[1]) blocksize = None if hashlen == 20: alg = cryptodev.CRYPTO_SHA1_HMAC blocksize = 64 elif hashlen == 28: alg = cryptodev.CRYPTO_SHA2_224_HMAC blocksize = 64 elif hashlen == 32: alg = cryptodev.CRYPTO_SHA2_256_HMAC blocksize = 64 elif hashlen == 48: alg = cryptodev.CRYPTO_SHA2_384_HMAC blocksize = 128 elif hashlen == 64: alg = cryptodev.CRYPTO_SHA2_512_HMAC blocksize = 128 else: # Skip unsupported hashes # Slurp remaining input in section for data in lines: continue continue for data in lines: key = binascii.unhexlify(data['Key']) msg = binascii.unhexlify(data['Msg']) mac = binascii.unhexlify(data['Mac']) tlen = int(data['Tlen']) if len(key) > blocksize: continue try: c = Crypto(mac=alg, mackey=key, crid=crid, maclen=hashlen) except EnvironmentError as e: # Can't test hashes the driver does not support. if e.errno != errno.EOPNOTSUPP: raise continue _, r = c.encrypt(msg, iv="") self.assertEqual(r[:tlen], mac, "Actual: " + \ repr(r.encode("hex")) + " Expected: " + repr(data)) return GendCryptoTestCase cryptosoft = GenTestCase('cryptosoft0') aesni = GenTestCase('aesni0') ccr = GenTestCase('ccr0') ccp = GenTestCase('ccp0') if __name__ == '__main__': unittest.main() Index: user/ngie/bug-237403/tools/boot/ci-qemu-test.sh =================================================================== --- user/ngie/bug-237403/tools/boot/ci-qemu-test.sh (revision 346925) +++ user/ngie/bug-237403/tools/boot/ci-qemu-test.sh (revision 346926) @@ -1,109 +1,110 @@ #!/bin/sh # Install loader, kernel, and enough of userland to boot in QEMU and echo # "Hello world." from init, as a very quick smoke test for CI. Uses QEMU's # virtual FAT filesystem to avoid the need to create a disk image. While # designed for CI automated testing, this script can also be run by hand as # a quick smoke-test. The rootgen.sh and related scripts generate much more # extensive tests for many combinations of boot env (ufs, zfs, geli, etc). # # $FreeBSD$ set -e die() { echo "$*" 1>&2 exit 1 } tempdir_cleanup() { trap - EXIT SIGINT SIGHUP SIGTERM SIGQUIT rm -rf ${ROOTDIR} } tempdir_setup() { # Create minimal directory structure and populate it. # Caller must cd ${SRCTOP} before calling this function. for dir in dev bin efi/boot etc lib libexec sbin usr/lib usr/libexec; do mkdir -p ${ROOTDIR}/${dir} done # Install kernel, loader and minimal userland. make -DNO_ROOT DESTDIR=${ROOTDIR} \ MODULES_OVERRIDE= \ WITHOUT_DEBUG_FILES=yes \ WITHOUT_KERNEL_SYMBOLS=yes \ installkernel for dir in stand \ lib/libc lib/libedit lib/ncurses \ libexec/rtld-elf \ bin/sh sbin/init sbin/shutdown; do make -DNO_ROOT DESTDIR=${ROOTDIR} INSTALL="install -U" \ WITHOUT_DEBUG_FILES= \ WITHOUT_MAN= \ WITHOUT_PROFILE= \ WITHOUT_TESTS= \ WITHOUT_TOOLCHAIN= \ -C ${dir} install done # Put loader in standard EFI location. mv ${ROOTDIR}/boot/loader.efi ${ROOTDIR}/efi/boot/BOOTx64.EFI # Configuration files. cat > ${ROOTDIR}/boot/loader.conf < ${ROOTDIR}/etc/rc <&1 | tee ${BOOTLOG} # Check whether we succesfully booted... if grep -q 'Hello world.' ${BOOTLOG}; then echo "OK" else die "Did not boot successfully, see ${BOOTLOG}" fi Index: user/ngie/bug-237403/tools/boot/install-boot.sh =================================================================== --- user/ngie/bug-237403/tools/boot/install-boot.sh (revision 346925) +++ user/ngie/bug-237403/tools/boot/install-boot.sh (revision 346926) @@ -1,429 +1,442 @@ #!/bin/sh # $FreeBSD$ # # Installs/updates the necessary boot blocks for the desired boot environment # # Lightly tested.. Intended to be installed, but until it matures, it will just # be a boot tool for regression testing. # insert code here to guess what you have -- yikes! # Minimum size of FAT filesystems, in KB. fat32min=33292 fat16min=2100 die() { echo $* exit 1 } doit() { echo $* eval $* } find-part() { dev=$1 part=$2 gpart show $dev | tail +2 | awk '$4 == "'$part'" { print $3; }' } get_uefi_bootname() { case ${TARGET:-$(uname -m)} in amd64) echo bootx64 ;; arm64) echo bootaa64 ;; i386) echo bootia32 ;; arm) echo bootarm ;; *) die "machine type $(uname -m) doesn't support UEFI" ;; esac } make_esp_file() { local file sizekb loader device mntpt fatbits efibootname file=$1 sizekb=$2 loader=$3 if [ "$sizekb" -ge "$fat32min" ]; then fatbits=32 elif [ "$sizekb" -ge "$fat16min" ]; then fatbits=16 else fatbits=12 fi dd if=/dev/zero of="${file}" bs=1k count="${sizekb}" device=$(mdconfig -a -t vnode -f "${file}") newfs_msdos -F "${fatbits}" -c 1 -L EFISYS "/dev/${device}" > /dev/null 2>&1 mntpt=$(mktemp -d /tmp/stand-test.XXXXXX) mount -t msdosfs "/dev/${device}" "${mntpt}" mkdir -p "${mntpt}/EFI/BOOT" efibootname=$(get_uefi_bootname) cp "${loader}" "${mntpt}/EFI/BOOT/${efibootname}.efi" umount "${mntpt}" rmdir "${mntpt}" mdconfig -d -u "${device}" } make_esp_device() { local dev file mntpt fstype efibootname kbfree loadersize efibootfile local isboot1 existingbootentryloaderfile bootorder bootentry # ESP device node dev=$1 file=$2 mntpt=$(mktemp -d /tmp/stand-test.XXXXXX) # See if we're using an existing (formatted) ESP fstype=$(fstyp "${dev}") if [ "${fstype}" != "msdosfs" ]; then newfs_msdos -F 32 -c 1 -L EFISYS "${dev}" > /dev/null 2>&1 fi mount -t msdosfs "${dev}" "${mntpt}" if [ $? -ne 0 ]; then die "Failed to mount ${dev} as an msdosfs filesystem" fi echo "Mounted ESP ${dev} on ${mntpt}" efibootname=$(get_uefi_bootname) kbfree=$(df -k "${mntpt}" | tail -1 | cut -w -f 4) loadersize=$(stat -f %z "${file}") loadersize=$((loadersize / 1024)) # Check if /EFI/BOOT/BOOTxx.EFI is the FreeBSD boot1.efi # If it is, remove it to avoid leaving stale files around efibootfile="${mntpt}/EFI/BOOT/${efibootname}.efi" if [ -f "${efibootfile}" ]; then isboot1=$(strings "${efibootfile}" | grep "FreeBSD EFI boot block") if [ -n "${isboot1}" ] && [ "$kbfree" -lt "${loadersize}" ]; then echo "Only ${kbfree}KB space remaining: removing old FreeBSD boot1.efi file /EFI/BOOT/${efibootname}.efi" rm "${efibootfile}" rmdir "${mntpt}/EFI/BOOT" else echo "${kbfree}KB space remaining on ESP: renaming old boot1.efi file /EFI/BOOT/${efibootname}.efi /EFI/BOOT/${efibootname}-old.efi" mv "${efibootfile}" "${mntpt}/EFI/BOOT/${efibootname}-old.efi" fi fi if [ ! -f "${mntpt}/EFI/freebsd/loader.efi" ] && [ "$kbfree" -lt "$loadersize" ]; then umount "${mntpt}" rmdir "${mntpt}" echo "Failed to update the EFI System Partition ${dev}" echo "Insufficient space remaining for ${file}" echo "Run e.g \"mount -t msdosfs ${dev} /mnt\" to inspect it for files that can be removed." die fi mkdir -p "${mntpt}/EFI/freebsd" # Keep a copy of the existing loader.efi in case there's a problem with the new one if [ -f "${mntpt}/EFI/freebsd/loader.efi" ] && [ "$kbfree" -gt "$((loadersize * 2))" ]; then cp "${mntpt}/EFI/freebsd/loader.efi" "${mntpt}/EFI/freebsd/loader-old.efi" fi echo "Copying loader to /EFI/freebsd on ESP" cp "${file}" "${mntpt}/EFI/freebsd/loader.efi" - existingbootentryloaderfile=$(efibootmgr -v | grep "${mntpt}//EFI/freebsd/loader.efi") + if [ -n "${updatesystem}" ]; then + existingbootentryloaderfile=$(efibootmgr -v | grep "${mntpt}//EFI/freebsd/loader.efi") - if [ -z "$existingbootentryloaderfile" ]; then - # Try again without the double forward-slash in the path - existingbootentryloaderfile=$(efibootmgr -v | grep "${mntpt}/EFI/freebsd/loader.efi") - fi - - if [ -z "$existingbootentryloaderfile" ]; then - echo "Creating UEFI boot entry for FreeBSD" - efibootmgr --create --label FreeBSD --loader "${mntpt}/EFI/freebsd/loader.efi" > /dev/null - if [ $? -ne 0 ]; then - die "Failed to create new boot entry" + if [ -z "$existingbootentryloaderfile" ]; then + # Try again without the double forward-slash in the path + existingbootentryloaderfile=$(efibootmgr -v | grep "${mntpt}/EFI/freebsd/loader.efi") fi - # When creating new entries, efibootmgr doesn't mark them active, so we need to - # do so. It doesn't make it easy to find which entry it just added, so rely on - # the fact that it places the new entry first in BootOrder. - bootorder=$(efivar --name 8be4df61-93ca-11d2-aa0d-00e098032b8c-BootOrder --print --no-name --hex | head -1) - bootentry=$(echo "${bootorder}" | cut -w -f 3)$(echo "${bootorder}" | cut -w -f 2) - echo "Marking UEFI boot entry ${bootentry} active" - efibootmgr --activate "${bootentry}" > /dev/null + if [ -z "$existingbootentryloaderfile" ]; then + echo "Creating UEFI boot entry for FreeBSD" + efibootmgr --create --label FreeBSD --loader "${mntpt}/EFI/freebsd/loader.efi" > /dev/null + if [ $? -ne 0 ]; then + die "Failed to create new boot entry" + fi + + # When creating new entries, efibootmgr doesn't mark them active, so we need to + # do so. It doesn't make it easy to find which entry it just added, so rely on + # the fact that it places the new entry first in BootOrder. + bootorder=$(efivar --name 8be4df61-93ca-11d2-aa0d-00e098032b8c-BootOrder --print --no-name --hex | head -1) + bootentry=$(echo "${bootorder}" | cut -w -f 3)$(echo "${bootorder}" | cut -w -f 2) + echo "Marking UEFI boot entry ${bootentry} active" + efibootmgr --activate "${bootentry}" > /dev/null + else + echo "Existing UEFI FreeBSD boot entry found: not creating a new one" + fi else - echo "Existing UEFI FreeBSD boot entry found: not creating a new one" + # Configure for booting from removable media + if [ ! -d "${mntpt}/EFI/BOOT" ]; then + mkdir -p "${mntpt}/EFI/BOOT" + fi + cp "${file}" "${mntpt}/EFI/BOOT/${efibootname}.efi" fi umount "${mntpt}" rmdir "${mntpt}" echo "Finished updating ESP" } make_esp() { local file loaderfile file=$1 loaderfile=$2 if [ -f "$file" ]; then make_esp_file ${file} ${fat32min} ${loaderfile} else make_esp_device ${file} ${loaderfile} fi } make_esp_mbr() { dev=$1 dst=$2 s=$(find-part $dev "!239") if [ -z "$s" ] ; then s=$(find-part $dev "efi") if [ -z "$s" ] ; then die "No ESP slice found" fi fi make_esp /dev/${dev}s${s} ${dst}/boot/loader.efi } make_esp_gpt() { dev=$1 dst=$2 idx=$(find-part $dev "efi") if [ -z "$idx" ] ; then die "No ESP partition found" fi make_esp /dev/${dev}p${idx} ${dst}/boot/loader.efi } boot_nogeli_gpt_ufs_legacy() { dev=$1 dst=$2 idx=$(find-part $dev "freebsd-boot") if [ -z "$idx" ] ; then die "No freebsd-boot partition found" fi doit gpart bootcode -b ${gpt0} -p ${gpt2} -i $idx $dev } boot_nogeli_gpt_ufs_uefi() { make_esp_gpt $1 $2 } boot_nogeli_gpt_ufs_both() { boot_nogeli_gpt_ufs_legacy $1 $2 $3 boot_nogeli_gpt_ufs_uefi $1 $2 $3 } boot_nogeli_gpt_zfs_legacy() { dev=$1 dst=$2 idx=$(find-part $dev "freebsd-boot") if [ -z "$idx" ] ; then die "No freebsd-boot partition found" fi doit gpart bootcode -b ${gpt0} -p ${gptzfs2} -i $idx $dev } boot_nogeli_gpt_zfs_uefi() { make_esp_gpt $1 $2 } boot_nogeli_gpt_zfs_both() { boot_nogeli_gpt_zfs_legacy $1 $2 $3 boot_nogeli_gpt_zfs_uefi $1 $2 $3 } boot_nogeli_mbr_ufs_legacy() { dev=$1 dst=$2 doit gpart bootcode -b ${mbr0} ${dev} s=$(find-part $dev "freebsd") if [ -z "$s" ] ; then die "No freebsd slice found" fi doit gpart bootcode -p ${mbr2} ${dev}s${s} } boot_nogeli_mbr_ufs_uefi() { make_esp_mbr $1 $2 } boot_nogeli_mbr_ufs_both() { boot_nogeli_mbr_ufs_legacy $1 $2 $3 boot_nogeli_mbr_ufs_uefi $1 $2 $3 } boot_nogeli_mbr_zfs_legacy() { dev=$1 dst=$2 # search to find the BSD slice s=$(find-part $dev "freebsd") if [ -z "$s" ] ; then die "No BSD slice found" fi idx=$(find-part ${dev}s${s} "freebsd-zfs") if [ -z "$idx" ] ; then die "No freebsd-zfs slice found" fi # search to find the freebsd-zfs partition within the slice # Or just assume it is 'a' because it has to be since it fails otherwise doit gpart bootcode -b ${dst}/boot/mbr ${dev} dd if=${dst}/boot/zfsboot of=/tmp/zfsboot1 count=1 doit gpart bootcode -b /tmp/zfsboot1 ${dev}s${s} # Put boot1 into the start of part sysctl kern.geom.debugflags=0x10 # Put boot2 into ZFS boot slot doit dd if=${dst}/boot/zfsboot of=/dev/${dev}s${s}a skip=1 seek=1024 sysctl kern.geom.debugflags=0x0 } boot_nogeli_mbr_zfs_uefi() { make_esp_mbr $1 $2 } boot_nogeli_mbr_zfs_both() { boot_nogeli_mbr_zfs_legacy $1 $2 $3 boot_nogeli_mbr_zfs_uefi $1 $2 $3 } boot_geli_gpt_ufs_legacy() { boot_nogeli_gpt_ufs_legacy $1 $2 $3 } boot_geli_gpt_ufs_uefi() { boot_nogeli_gpt_ufs_uefi $1 $2 $3 } boot_geli_gpt_ufs_both() { boot_nogeli_gpt_ufs_both $1 $2 $3 } boot_geli_gpt_zfs_legacy() { boot_nogeli_gpt_zfs_legacy $1 $2 $3 } boot_geli_gpt_zfs_uefi() { boot_nogeli_gpt_zfs_uefi $1 $2 $3 } boot_geli_gpt_zfs_both() { boot_nogeli_gpt_zfs_both $1 $2 $3 } # GELI+MBR is not a valid configuration boot_geli_mbr_ufs_legacy() { exit 1 } boot_geli_mbr_ufs_uefi() { exit 1 } boot_geli_mbr_ufs_both() { exit 1 } boot_geli_mbr_zfs_legacy() { exit 1 } boot_geli_mbr_zfs_uefi() { exit 1 } boot_geli_mbr_zfs_both() { exit 1 } boot_nogeli_vtoc8_ufs_ofw() { dev=$1 dst=$2 # For non-native builds, ensure that geom_part(4) supports VTOC8. kldload geom_part_vtoc8.ko doit gpart bootcode -p ${vtoc8} ${dev} } usage() { printf 'Usage: %s -b bios [-d destdir] -f fs [-g geli] [-h] [-o optargs] -s scheme \n' "$0" printf 'Options:\n' printf ' bootdev device to install the boot code on\n' printf ' -b bios bios type: legacy, uefi or both\n' printf ' -d destdir destination filesystem root\n' printf ' -f fs filesystem type: ufs or zfs\n' printf ' -g geli yes or no\n' printf ' -h this help/usage text\n' + printf ' -u Run commands such as efibootmgr to update the\n' + printf ' currently running system\n' printf ' -o optargs optional arguments\n' printf ' -s scheme mbr or gpt\n' exit 0 } srcroot=/ # Note: we really don't support geli boot in this script yet. geli=nogeli -while getopts "b:d:f:g:ho:s:" opt; do +while getopts "b:d:f:g:ho:s:u" opt; do case "$opt" in b) bios=${OPTARG} ;; d) srcroot=${OPTARG} ;; f) fs=${OPTARG} ;; g) case ${OPTARG} in [Yy][Ee][Ss]|geli) geli=geli ;; *) geli=nogeli ;; esac + ;; + u) + updatesystem=1 ;; o) opts=${OPTARG} ;; s) scheme=${OPTARG} ;; ?|h) usage ;; esac done if [ -n "${scheme}" ] && [ -n "${fs}" ] && [ -n "${bios}" ]; then shift $((OPTIND-1)) dev=$1 fi # For gpt, we need to install pmbr as the primary boot loader # it knows about gpt0=${srcroot}/boot/pmbr gpt2=${srcroot}/boot/gptboot gptzfs2=${srcroot}/boot/gptzfsboot # For MBR, we have lots of choices, but select mbr, boot0 has issues with UEFI mbr0=${srcroot}/boot/mbr mbr2=${srcroot}/boot/boot # VTOC8 vtoc8=${srcroot}/boot/boot1 # sanity check here # Check if we've been given arguments. If not, this script is probably being # sourced, so we shouldn't run anything. if [ -n "${dev}" ]; then eval boot_${geli}_${scheme}_${fs}_${bios} $dev $srcroot $opts || echo "Unsupported boot env: ${geli}-${scheme}-${fs}-${bios}" fi Index: user/ngie/bug-237403/tools/boot/rootgen.sh =================================================================== --- user/ngie/bug-237403/tools/boot/rootgen.sh (revision 346925) +++ user/ngie/bug-237403/tools/boot/rootgen.sh (revision 346926) @@ -1,867 +1,892 @@ #!/bin/sh # $FreeBSD$ passphrase=passphrase iterations=50000 # The smallest FAT32 filesystem is 33292 KB espsize=33292 # # Builds all the bat-shit crazy combinations we support booting from, # at least for amd64. It assume you have a ~sane kernel in /boot/kernel # and copies that into the ~150MB root images we create (we create the du # size of the kernel + 20MB # # Sad panda sez: this runs as root, but could be userland if someone # creates userland geli and zfs tools. # # This assumes an external program install-boot.sh which will install # the appropriate boot files in the appropriate locations. # # These images assume ada0 will be the root image. We should likely # use labels, but we don't. # # Assumes you've already rebuilt... maybe bad? Also maybe bad: the env # vars should likely be conditionally set to allow better automation. # +. $(dirname $0)/install-boot.sh + cpsys() { src=$1 dst=$2 # Copy kernel + boot loader (cd $src ; tar cf - .) | (cd $dst; tar xf -) } mk_nogeli_gpt_ufs_legacy() { src=$1 img=$2 cat > ${src}/etc/fstab < ${src}/etc/fstab < ${src}/etc/fstab <> ${mntpt}/boot/loader.conf <> ${mntpt}/boot/loader.conf <> ${mntpt}/boot/loader.conf < ${src}/etc/fstab < ${src}/etc/fstab < ${src}/etc/fstab <> ${mntpt}/boot/loader.conf <> ${mntpt}/boot/loader.conf <> ${mntpt}/boot/loader.conf < ${mntpt}/boot/loader.conf < ${mntpt}/etc/fstab < ${mntpt}/boot/loader.conf < ${mntpt}/etc/fstab < ${mntpt}/boot/loader.conf < ${mntpt}/etc/fstab <> ${mntpt}/boot/loader.conf <> ${mntpt}/boot/loader.conf < ${mntpt}/boot/loader.conf < ${src}/etc/fstab < $sh chmod 755 $sh # https://wiki.freebsd.org/arm64/QEMU also has # -device virtio-net-device,netdev=net0 # -netdev user,id=net0 } # Amd64 qemu qemu_amd64_legacy() { img=$1 sh=$2 echo "qemu-system-x86_64 -m 256m --drive file=${img},format=raw ${qser}" > $sh chmod 755 $sh } qemu_amd64_uefi() { img=$1 sh=$2 echo "qemu-system-x86_64 -m 256m -bios ~/bios/OVMF-X64.fd --drive file=${img},format=raw ${qser}" > $sh chmod 755 $sh } qemu_amd64_both() { img=$1 sh=$2 echo "qemu-system-x86_64 -m 256m --drive file=${img},format=raw ${qser}" > $sh echo "qemu-system-x86_64 -m 256m -bios ~/bios/OVMF-X64.fd --drive file=${img},format=raw ${qser}" >> $sh chmod 755 $sh } # arm # nothing listed? # i386 qemu_i386_legacy() { img=$1 sh=$2 echo "qemu-system-i386 --drive file=${img},format=raw ${qser}" > $sh chmod 755 $sh } # Not yet supported qemu_i386_uefi() { img=$1 sh=$2 echo "qemu-system-i386 -bios ~/bios/OVMF-X32.fd --drive file=${img},format=raw ${qser}" > $sh chmod 755 $sh } # Needs UEFI to be supported qemu_i386_both() { img=$1 sh=$2 echo "qemu-system-i386 --drive file=${img},format=raw ${qser}" > $sh echo "qemu-system-i386 -bios ~/bios/OVMF-X32.fd --drive file=${img},format=raw ${qser}" >> $sh chmod 755 $sh } make_one_image() { local arch=${1?} local geli=${2?} local scheme=${3?} local fs=${4?} local bios=${5?} # Create sparse file and mount newly created filesystem(s) on it img=${IMGDIR}/${arch}-${geli}-${scheme}-${fs}-${bios}.img sh=${IMGDIR}/${arch}-${geli}-${scheme}-${fs}-${bios}.sh echo "vvvvvvvvvvvvvv Creating $img vvvvvvvvvvvvvvv" rm -f ${img}* eval mk_${geli}_${scheme}_${fs}_${bios} ${DESTDIR} ${img} ${MNTPT} ${geli} ${scheme} ${fs} ${bios} eval qemu_${arch}_${bios} ${img} ${sh} [ -n "${SUDO_USER}" ] && chown ${SUDO_USER} ${img}* echo "^^^^^^^^^^^^^^ Created $img ^^^^^^^^^^^^^^^" } # mips # qemu-system-mips -kernel /path/to/rootfs/boot/kernel/kernel -nographic -hda /path/to/disk.img -m 2048 # Powerpc -- doesn't work but maybe it would enough for testing -- needs details # powerpc64 # qemu-system-ppc64 -drive file=/path/to/disk.img,format=raw # sparc64 # qemu-system-sparc64 -drive file=/path/to/disk.img,format=raw # Misc variables SRCTOP=$(make -v SRCTOP) cd ${SRCTOP}/stand OBJDIR=$(make -v .OBJDIR) IMGDIR=${OBJDIR}/boot-images mkdir -p ${IMGDIR} MNTPT=$(mktemp -d /tmp/stand-test.XXXXXX) # Setup the installed tree... DESTDIR=${OBJDIR}/boot-tree rm -rf ${DESTDIR} mkdir -p ${DESTDIR}/boot/defaults mkdir -p ${DESTDIR}/boot/kernel cp /boot/kernel/kernel ${DESTDIR}/boot/kernel echo -h -D -S115200 > ${DESTDIR}/boot.config cat > ${DESTDIR}/boot/loader.conf < ${DESTDIR}/etc/rc < #include #ifdef _UWIN # include # include # include # include #endif +#include #include #include #ifndef MAP_FILE # define MAP_FILE 0 #endif #include #include #include #include #include #include #include #include #define NUMPRINTCOLUMNS 32 /* # columns of data to print on each line */ /* * A log entry is an operation and a bunch of arguments. */ struct log_entry { int operation; int args[3]; }; #define LOGSIZE 1000 struct log_entry oplog[LOGSIZE]; /* the log */ int logptr = 0; /* current position in log */ int logcount = 0; /* total ops */ /* * Define operations */ #define OP_READ 1 #define OP_WRITE 2 #define OP_TRUNCATE 3 #define OP_CLOSEOPEN 4 #define OP_MAPREAD 5 #define OP_MAPWRITE 6 #define OP_SKIPPED 7 #define OP_INVALIDATE 8 int page_size; int page_mask; char *original_buf; /* a pointer to the original data */ char *good_buf; /* a pointer to the correct data */ char *temp_buf; /* a pointer to the current data */ char *fname; /* name of our test file */ int fd; /* fd for our test file */ off_t file_size = 0; off_t biggest = 0; char state[256]; unsigned long testcalls = 0; /* calls to function "test" */ unsigned long simulatedopcount = 0; /* -b flag */ int closeprob = 0; /* -c flag */ int invlprob = 0; /* -i flag */ int debug = 0; /* -d flag */ unsigned long debugstart = 0; /* -D flag */ unsigned long maxfilelen = 256 * 1024; /* -l flag */ int sizechecks = 1; /* -n flag disables them */ int maxoplen = 64 * 1024; /* -o flag */ int quiet = 0; /* -q flag */ unsigned long progressinterval = 0; /* -p flag */ int readbdy = 1; /* -r flag */ int style = 0; /* -s flag */ int truncbdy = 1; /* -t flag */ int writebdy = 1; /* -w flag */ long monitorstart = -1; /* -m flag */ long monitorend = -1; /* -m flag */ int lite = 0; /* -L flag */ long numops = -1; /* -N flag */ int randomoplen = 1; /* -O flag disables it */ int seed = 1; /* -S flag */ int mapped_writes = 1; /* -W flag disables */ int mapped_reads = 1; /* -R flag disables it */ int mapped_msync = 1; /* -U flag disables */ int fsxgoodfd = 0; FILE * fsxlogf = NULL; int badoff = -1; int closeopen = 0; int invl = 0; void vwarnc(code, fmt, ap) int code; const char *fmt; va_list ap; { fprintf(stderr, "fsx: "); if (fmt != NULL) { vfprintf(stderr, fmt, ap); fprintf(stderr, ": "); } fprintf(stderr, "%s\n", strerror(code)); } void warn(const char * fmt, ...) { va_list ap; va_start(ap, fmt); vwarnc(errno, fmt, ap); va_end(ap); } void prt(char *fmt, ...) { va_list args; va_start(args, fmt); vfprintf(stdout, fmt, args); va_end(args); if (fsxlogf) { va_start(args, fmt); vfprintf(fsxlogf, fmt, args); va_end(args); } } void prterr(char *prefix) { prt("%s%s%s\n", prefix, prefix ? ": " : "", strerror(errno)); } void do_log4(int operation, int arg0, int arg1, int arg2) { struct log_entry *le; le = &oplog[logptr]; le->operation = operation; le->args[0] = arg0; le->args[1] = arg1; le->args[2] = arg2; logptr++; logcount++; if (logptr >= LOGSIZE) logptr = 0; } void log4(int operation, int arg0, int arg1, int arg2) { do_log4(operation, arg0, arg1, arg2); if (closeopen) do_log4(OP_CLOSEOPEN, 0, 0, 0); if (invl) do_log4(OP_INVALIDATE, 0, 0, 0); } void logdump(void) { struct log_entry *lp; int i, count, down, opnum; prt("LOG DUMP (%d total operations):\n", logcount); if (logcount < LOGSIZE) { i = 0; count = logcount; } else { i = logptr; count = LOGSIZE; } opnum = i + 1 + (logcount/LOGSIZE)*LOGSIZE; for ( ; count > 0; count--) { lp = &oplog[i]; if (lp->operation == OP_CLOSEOPEN || lp->operation == OP_INVALIDATE) { switch (lp->operation) { case OP_CLOSEOPEN: prt("\t\tCLOSE/OPEN\n"); break; case OP_INVALIDATE: prt("\t\tMS_INVALIDATE\n"); break; } i++; if (i == LOGSIZE) i = 0; continue; } prt("%d(%d mod 256): ", opnum, opnum%256); switch (lp->operation) { case OP_MAPREAD: prt("MAPREAD\t0x%x thru 0x%x\t(0x%x bytes)", lp->args[0], lp->args[0] + lp->args[1] - 1, lp->args[1]); if (badoff >= lp->args[0] && badoff < lp->args[0] + lp->args[1]) prt("\t***RRRR***"); break; case OP_MAPWRITE: prt("MAPWRITE 0x%x thru 0x%x\t(0x%x bytes)", lp->args[0], lp->args[0] + lp->args[1] - 1, lp->args[1]); if (badoff >= lp->args[0] && badoff < lp->args[0] + lp->args[1]) prt("\t******WWWW"); break; case OP_READ: prt("READ\t0x%x thru 0x%x\t(0x%x bytes)", lp->args[0], lp->args[0] + lp->args[1] - 1, lp->args[1]); if (badoff >= lp->args[0] && badoff < lp->args[0] + lp->args[1]) prt("\t***RRRR***"); break; case OP_WRITE: - prt("WRITE\t0x%x thru 0x%x\t(0x%x bytes)", - lp->args[0], lp->args[0] + lp->args[1] - 1, - lp->args[1]); - if (lp->args[0] > lp->args[2]) - prt(" HOLE"); - else if (lp->args[0] + lp->args[1] > lp->args[2]) - prt(" EXTEND"); - if ((badoff >= lp->args[0] || badoff >=lp->args[2]) && - badoff < lp->args[0] + lp->args[1]) - prt("\t***WWWW"); + { + int offset = lp->args[0]; + int len = lp->args[1]; + int oldlen = lp->args[2]; + + prt("WRITE\t0x%x thru 0x%x\t(0x%x bytes)", + offset, offset + len - 1, + len); + if (offset > oldlen) + prt(" HOLE"); + else if (offset + len > oldlen) + prt(" EXTEND"); + if ((badoff >= offset || badoff >=oldlen) && + badoff < offset + len) + prt("\t***WWWW"); + } break; case OP_TRUNCATE: down = lp->args[0] < lp->args[1]; prt("TRUNCATE %s\tfrom 0x%x to 0x%x", down ? "DOWN" : "UP", lp->args[1], lp->args[0]); if (badoff >= lp->args[!down] && badoff < lp->args[!!down]) prt("\t******WWWW"); break; case OP_SKIPPED: prt("SKIPPED (no operation)"); break; default: prt("BOGUS LOG ENTRY (operation code = %d)!", lp->operation); } prt("\n"); opnum++; i++; if (i == LOGSIZE) i = 0; } } void save_buffer(char *buffer, off_t bufferlength, int fd) { off_t ret; ssize_t byteswritten; if (fd <= 0 || bufferlength == 0) return; if (bufferlength > SSIZE_MAX) { prt("fsx flaw: overflow in save_buffer\n"); exit(67); } if (lite) { off_t size_by_seek = lseek(fd, (off_t)0, SEEK_END); if (size_by_seek == (off_t)-1) prterr("save_buffer: lseek eof"); else if (bufferlength > size_by_seek) { warn("save_buffer: .fsxgood file too short... will save 0x%llx bytes instead of 0x%llx\n", (unsigned long long)size_by_seek, (unsigned long long)bufferlength); bufferlength = size_by_seek; } } ret = lseek(fd, (off_t)0, SEEK_SET); if (ret == (off_t)-1) prterr("save_buffer: lseek 0"); byteswritten = write(fd, buffer, (size_t)bufferlength); if (byteswritten != bufferlength) { if (byteswritten == -1) prterr("save_buffer write"); else warn("save_buffer: short write, 0x%x bytes instead of 0x%llx\n", (unsigned)byteswritten, (unsigned long long)bufferlength); } } void report_failure(int status) { logdump(); if (fsxgoodfd) { if (good_buf) { save_buffer(good_buf, file_size, fsxgoodfd); prt("Correct content saved for comparison\n"); prt("(maybe hexdump \"%s\" vs \"%s.fsxgood\")\n", fname, fname); } close(fsxgoodfd); } exit(status); } #define short_at(cp) ((unsigned short)((*((unsigned char *)(cp)) << 8) | \ *(((unsigned char *)(cp)) + 1))) void check_buffers(unsigned offset, unsigned size) { unsigned char c, t; unsigned i = 0; unsigned n = 0; unsigned op = 0; unsigned bad = 0; if (memcmp(good_buf + offset, temp_buf, size) != 0) { prt("READ BAD DATA: offset = 0x%x, size = 0x%x\n", offset, size); prt("OFFSET\tGOOD\tBAD\tRANGE\n"); while (size > 0) { c = good_buf[offset]; t = temp_buf[i]; if (c != t) { if (n == 0) { bad = short_at(&temp_buf[i]); prt("0x%5x\t0x%04x\t0x%04x", offset, short_at(&good_buf[offset]), bad); op = temp_buf[offset & 1 ? i+1 : i]; } n++; badoff = offset; } offset++; i++; size--; } if (n) { prt("\t0x%5x\n", n); if (bad) prt("operation# (mod 256) for the bad data may be %u\n", ((unsigned)op & 0xff)); else prt("operation# (mod 256) for the bad data unknown, check HOLE and EXTEND ops\n"); } else prt("????????????????\n"); report_failure(110); } } void check_size(void) { struct stat statbuf; off_t size_by_seek; if (fstat(fd, &statbuf)) { prterr("check_size: fstat"); statbuf.st_size = -1; } size_by_seek = lseek(fd, (off_t)0, SEEK_END); if (file_size != statbuf.st_size || file_size != size_by_seek) { prt("Size error: expected 0x%llx stat 0x%llx seek 0x%llx\n", (unsigned long long)file_size, (unsigned long long)statbuf.st_size, (unsigned long long)size_by_seek); report_failure(120); } } void check_trunc_hack(void) { struct stat statbuf; ftruncate(fd, (off_t)0); ftruncate(fd, (off_t)100000); fstat(fd, &statbuf); if (statbuf.st_size != (off_t)100000) { prt("no extend on truncate! not posix!\n"); exit(130); } ftruncate(fd, (off_t)0); } void doread(unsigned offset, unsigned size) { off_t ret; unsigned iret; offset -= offset % readbdy; if (size == 0) { if (!quiet && testcalls > simulatedopcount) prt("skipping zero size read\n"); log4(OP_SKIPPED, OP_READ, offset, size); return; } if (size + offset > file_size) { if (!quiet && testcalls > simulatedopcount) prt("skipping seek/read past end of file\n"); log4(OP_SKIPPED, OP_READ, offset, size); return; } log4(OP_READ, offset, size, 0); if (testcalls <= simulatedopcount) return; if (!quiet && ((progressinterval && testcalls % progressinterval == 0) || (debug && (monitorstart == -1 || (offset + size > monitorstart && (monitorend == -1 || offset <= monitorend)))))) prt("%lu read\t0x%x thru\t0x%x\t(0x%x bytes)\n", testcalls, offset, offset + size - 1, size); ret = lseek(fd, (off_t)offset, SEEK_SET); if (ret == (off_t)-1) { prterr("doread: lseek"); report_failure(140); } iret = read(fd, temp_buf, size); if (iret != size) { if (iret == -1) prterr("doread: read"); else prt("short read: 0x%x bytes instead of 0x%x\n", iret, size); report_failure(141); } check_buffers(offset, size); } void check_eofpage(char *s, unsigned offset, char *p, int size) { uintptr_t last_page, should_be_zero; if (offset + size <= (file_size & ~page_mask)) return; /* * we landed in the last page of the file * test to make sure the VM system provided 0's * beyond the true end of the file mapping * (as required by mmap def in 1996 posix 1003.1) */ last_page = ((uintptr_t)p + (offset & page_mask) + size) & ~page_mask; for (should_be_zero = last_page + (file_size & page_mask); should_be_zero < last_page + page_size; should_be_zero++) if (*(char *)should_be_zero) { prt("Mapped %s: non-zero data past EOF (0x%llx) page offset 0x%x is 0x%04x\n", s, file_size - 1, should_be_zero & page_mask, short_at(should_be_zero)); report_failure(205); } } void domapread(unsigned offset, unsigned size) { unsigned pg_offset; unsigned map_size; char *p; offset -= offset % readbdy; if (size == 0) { if (!quiet && testcalls > simulatedopcount) prt("skipping zero size read\n"); log4(OP_SKIPPED, OP_MAPREAD, offset, size); return; } if (size + offset > file_size) { if (!quiet && testcalls > simulatedopcount) prt("skipping seek/read past end of file\n"); log4(OP_SKIPPED, OP_MAPREAD, offset, size); return; } log4(OP_MAPREAD, offset, size, 0); if (testcalls <= simulatedopcount) return; if (!quiet && ((progressinterval && testcalls % progressinterval == 0) || (debug && (monitorstart == -1 || (offset + size > monitorstart && (monitorend == -1 || offset <= monitorend)))))) prt("%lu mapread\t0x%x thru\t0x%x\t(0x%x bytes)\n", testcalls, offset, offset + size - 1, size); pg_offset = offset & page_mask; map_size = pg_offset + size; if ((p = (char *)mmap(0, map_size, PROT_READ, MAP_FILE | MAP_SHARED, fd, (off_t)(offset - pg_offset))) == (char *)-1) { prterr("domapread: mmap"); report_failure(190); } memcpy(temp_buf, p + pg_offset, size); check_eofpage("Read", offset, p, size); if (munmap(p, map_size) != 0) { prterr("domapread: munmap"); report_failure(191); } check_buffers(offset, size); } void gendata(char *original_buf, char *good_buf, unsigned offset, unsigned size) { while (size--) { good_buf[offset] = testcalls % 256; if (offset % 2) good_buf[offset] += original_buf[offset]; offset++; } } void dowrite(unsigned offset, unsigned size) { off_t ret; unsigned iret; offset -= offset % writebdy; if (size == 0) { if (!quiet && testcalls > simulatedopcount) prt("skipping zero size write\n"); log4(OP_SKIPPED, OP_WRITE, offset, size); return; } log4(OP_WRITE, offset, size, file_size); gendata(original_buf, good_buf, offset, size); if (file_size < offset + size) { if (file_size < offset) memset(good_buf + file_size, '\0', offset - file_size); file_size = offset + size; if (lite) { warn("Lite file size bug in fsx!"); report_failure(149); } } if (testcalls <= simulatedopcount) return; if (!quiet && ((progressinterval && testcalls % progressinterval == 0) || (debug && (monitorstart == -1 || (offset + size > monitorstart && (monitorend == -1 || offset <= monitorend)))))) prt("%lu write\t0x%x thru\t0x%x\t(0x%x bytes)\n", testcalls, offset, offset + size - 1, size); ret = lseek(fd, (off_t)offset, SEEK_SET); if (ret == (off_t)-1) { prterr("dowrite: lseek"); report_failure(150); } iret = write(fd, good_buf + offset, size); if (iret != size) { if (iret == -1) prterr("dowrite: write"); else prt("short write: 0x%x bytes instead of 0x%x\n", iret, size); report_failure(151); } } void domapwrite(unsigned offset, unsigned size) { unsigned pg_offset; unsigned map_size; off_t cur_filesize; char *p; offset -= offset % writebdy; if (size == 0) { if (!quiet && testcalls > simulatedopcount) prt("skipping zero size write\n"); log4(OP_SKIPPED, OP_MAPWRITE, offset, size); return; } cur_filesize = file_size; log4(OP_MAPWRITE, offset, size, 0); gendata(original_buf, good_buf, offset, size); if (file_size < offset + size) { if (file_size < offset) memset(good_buf + file_size, '\0', offset - file_size); file_size = offset + size; if (lite) { warn("Lite file size bug in fsx!"); report_failure(200); } } if (testcalls <= simulatedopcount) return; if (!quiet && ((progressinterval && testcalls % progressinterval == 0) || (debug && (monitorstart == -1 || (offset + size > monitorstart && (monitorend == -1 || offset <= monitorend)))))) prt("%lu mapwrite\t0x%x thru\t0x%x\t(0x%x bytes)\n", testcalls, offset, offset + size - 1, size); if (file_size > cur_filesize) { if (ftruncate(fd, file_size) == -1) { prterr("domapwrite: ftruncate"); exit(201); } } pg_offset = offset & page_mask; map_size = pg_offset + size; if ((p = (char *)mmap(0, map_size, PROT_READ | PROT_WRITE, MAP_FILE | MAP_SHARED, fd, (off_t)(offset - pg_offset))) == MAP_FAILED) { prterr("domapwrite: mmap"); report_failure(202); } memcpy(p + pg_offset, good_buf + offset, size); if (mapped_msync && msync(p, map_size, MS_SYNC) != 0) { prterr("domapwrite: msync"); report_failure(203); } check_eofpage("Write", offset, p, size); if (munmap(p, map_size) != 0) { prterr("domapwrite: munmap"); report_failure(204); } } void dotruncate(unsigned size) { int oldsize = file_size; size -= size % truncbdy; if (size > biggest) { biggest = size; if (!quiet && testcalls > simulatedopcount) prt("truncating to largest ever: 0x%x\n", size); } log4(OP_TRUNCATE, size, (unsigned)file_size, 0); if (size > file_size) memset(good_buf + file_size, '\0', size - file_size); file_size = size; if (testcalls <= simulatedopcount) return; if ((progressinterval && testcalls % progressinterval == 0) || (debug && (monitorstart == -1 || monitorend == -1 || size <= monitorend))) prt("%lu trunc\tfrom 0x%x to 0x%x\n", testcalls, oldsize, size); if (ftruncate(fd, (off_t)size) == -1) { prt("ftruncate1: %x\n", size); prterr("dotruncate: ftruncate"); report_failure(160); } } void writefileimage() { ssize_t iret; if (lseek(fd, (off_t)0, SEEK_SET) == (off_t)-1) { prterr("writefileimage: lseek"); report_failure(171); } iret = write(fd, good_buf, file_size); if ((off_t)iret != file_size) { if (iret == -1) prterr("writefileimage: write"); else prt("short write: 0x%x bytes instead of 0x%llx\n", iret, (unsigned long long)file_size); report_failure(172); } if (lite ? 0 : ftruncate(fd, file_size) == -1) { prt("ftruncate2: %llx\n", (unsigned long long)file_size); prterr("writefileimage: ftruncate"); report_failure(173); } } void docloseopen(void) { if (testcalls <= simulatedopcount) return; if (debug) prt("%lu close/open\n", testcalls); if (close(fd)) { prterr("docloseopen: close"); report_failure(180); } fd = open(fname, O_RDWR, 0); if (fd < 0) { prterr("docloseopen: open"); report_failure(181); } } void doinvl(void) { char *p; if (file_size == 0) return; if (testcalls <= simulatedopcount) return; if (debug) prt("%lu msync(MS_INVALIDATE)\n", testcalls); if ((p = (char *)mmap(0, file_size, PROT_READ | PROT_WRITE, MAP_FILE | MAP_SHARED, fd, 0)) == MAP_FAILED) { prterr("doinvl: mmap"); report_failure(205); } if (msync(p, 0, MS_SYNC | MS_INVALIDATE) != 0) { prterr("doinvl: msync"); report_failure(206); } if (munmap(p, file_size) != 0) { prterr("doinvl: munmap"); report_failure(207); } } void test(void) { unsigned long offset; unsigned long size = maxoplen; unsigned long rv = random(); unsigned long op = rv % (3 + !lite + mapped_writes); /* turn off the map read if necessary */ if (op == 2 && !mapped_reads) op = 0; if (simulatedopcount > 0 && testcalls == simulatedopcount) writefileimage(); testcalls++; if (closeprob) closeopen = (rv >> 3) < (1 << 28) / closeprob; if (invlprob) invl = (rv >> 3) < (1 << 28) / invlprob; if (debugstart > 0 && testcalls >= debugstart) debug = 1; if (!quiet && testcalls < simulatedopcount && testcalls % 100000 == 0) prt("%lu...\n", testcalls); /* * READ: op = 0 * WRITE: op = 1 * MAPREAD: op = 2 * TRUNCATE: op = 3 * MAPWRITE: op = 3 or 4 */ if (lite ? 0 : op == 3 && style == 0) /* vanilla truncate? */ dotruncate(random() % maxfilelen); else { if (randomoplen) size = random() % (maxoplen+1); if (lite ? 0 : op == 3) dotruncate(size); else { offset = random(); if (op == 1 || op == (lite ? 3 : 4)) { offset %= maxfilelen; if (offset + size > maxfilelen) size = maxfilelen - offset; if (op != 1) domapwrite(offset, size); else dowrite(offset, size); } else { if (file_size) offset %= file_size; else offset = 0; if (offset + size > file_size) size = file_size - offset; if (op != 0) domapread(offset, size); else doread(offset, size); } } } if (sizechecks && testcalls > simulatedopcount) check_size(); if (invl) doinvl(); if (closeopen) docloseopen(); } void cleanup(sig) int sig; { if (sig) prt("signal %d\n", sig); prt("testcalls = %lu\n", testcalls); exit(sig); } void usage(void) { fprintf(stdout, "usage: %s", "fsx [-dnqLOW] [-b opnum] [-c Prob] [-l flen] [-m start:end] [-o oplen] [-p progressinterval] [-r readbdy] [-s style] [-t truncbdy] [-w writebdy] [-D startingop] [-N numops] [-P dirpath] [-S seed] fname\n\ -b opnum: beginning operation number (default 1)\n\ -c P: 1 in P chance of file close+open at each op (default infinity)\n\ -d: debug output for all operations\n\ -i P: 1 in P chance of calling msync(MS_INVALIDATE) (default infinity)\n\ -l flen: the upper bound on file size (default 262144)\n\ -m startop:endop: monitor (print debug output) specified byte range (default 0:infinity)\n\ -n: no verifications of file size\n\ -o oplen: the upper bound on operation size (default 65536)\n\ -p progressinterval: debug output at specified operation interval\n\ -q: quieter operation\n\ -r readbdy: 4096 would make reads page aligned (default 1)\n\ -s style: 1 gives smaller truncates (default 0)\n\ -t truncbdy: 4096 would make truncates page aligned (default 1)\n\ -w writebdy: 4096 would make writes page aligned (default 1)\n\ -D startingop: debug output starting at specified operation\n\ -L: fsxLite - no file creations & no file size changes\n\ -N numops: total # operations to do (default infinity)\n\ -O: use oplen (see -o flag) for every op (default random)\n\ -P dirpath: save .fsxlog and .fsxgood files in dirpath (default ./)\n\ -S seed: for random # generator (default 1) 0 gets timestamp\n\ -W: mapped write operations DISabled\n\ -R: mapped read operations DISabled)\n\ -U: msync after mapped write operations DISabled\n\ fname: this filename is REQUIRED (no default)\n"); exit(90); } int getnum(char *s, char **e) { int ret = -1; *e = (char *) 0; ret = strtol(s, e, 0); if (*e) switch (**e) { case 'b': case 'B': ret *= 512; *e = *e + 1; break; case 'k': case 'K': ret *= 1024; *e = *e + 1; break; case 'm': case 'M': ret *= 1024*1024; *e = *e + 1; break; case 'w': case 'W': ret *= 4; *e = *e + 1; break; } return (ret); } int main(int argc, char **argv) { int i, ch; char *endp; char goodfile[1024]; char logfile[1024]; + struct timespec now; goodfile[0] = 0; logfile[0] = 0; page_size = getpagesize(); page_mask = page_size - 1; setvbuf(stdout, (char *)0, _IOLBF, 0); /* line buffered stdout */ while ((ch = getopt(argc, argv, "b:c:di:l:m:no:p:qr:s:t:w:D:LN:OP:RS:UW")) != -1) switch (ch) { case 'b': simulatedopcount = getnum(optarg, &endp); if (!quiet) fprintf(stdout, "Will begin at operation %ld\n", simulatedopcount); if (simulatedopcount == 0) usage(); simulatedopcount -= 1; break; case 'c': closeprob = getnum(optarg, &endp); if (!quiet) fprintf(stdout, "Chance of close/open is 1 in %d\n", closeprob); if (closeprob <= 0) usage(); break; case 'd': debug = 1; break; case 'i': invlprob = getnum(optarg, &endp); if (!quiet) fprintf(stdout, "Chance of MS_INVALIDATE is 1 in %d\n", invlprob); if (invlprob <= 0) usage(); break; case 'l': maxfilelen = getnum(optarg, &endp); if (maxfilelen <= 0) usage(); break; case 'm': monitorstart = getnum(optarg, &endp); if (monitorstart < 0) usage(); if (!endp || *endp++ != ':') usage(); monitorend = getnum(endp, &endp); if (monitorend < 0) usage(); if (monitorend == 0) monitorend = -1; /* aka infinity */ debug = 1; case 'n': sizechecks = 0; break; case 'o': maxoplen = getnum(optarg, &endp); if (maxoplen <= 0) usage(); break; case 'p': progressinterval = getnum(optarg, &endp); if (progressinterval < 0) usage(); break; case 'q': quiet = 1; break; case 'r': readbdy = getnum(optarg, &endp); if (readbdy <= 0) usage(); break; case 's': style = getnum(optarg, &endp); if (style < 0 || style > 1) usage(); break; case 't': truncbdy = getnum(optarg, &endp); if (truncbdy <= 0) usage(); break; case 'w': writebdy = getnum(optarg, &endp); if (writebdy <= 0) usage(); break; case 'D': debugstart = getnum(optarg, &endp); if (debugstart < 1) usage(); break; case 'L': lite = 1; break; case 'N': numops = getnum(optarg, &endp); if (numops < 0) usage(); break; case 'O': randomoplen = 0; break; case 'P': strncpy(goodfile, optarg, sizeof(goodfile)); strcat(goodfile, "/"); strncpy(logfile, optarg, sizeof(logfile)); strcat(logfile, "/"); break; case 'R': mapped_reads = 0; break; case 'S': seed = getnum(optarg, &endp); - if (seed == 0) - seed = time(0) % 10000; + if (seed == 0) { + if (clock_gettime(CLOCK_REALTIME, &now) != 0) + err(1, "clock_gettime"); + seed = now.tv_nsec % 10000; + } if (!quiet) fprintf(stdout, "Seed set to %d\n", seed); if (seed < 0) usage(); break; case 'W': mapped_writes = 0; if (!quiet) fprintf(stdout, "mapped writes DISABLED\n"); break; case 'U': mapped_msync = 0; if (!quiet) fprintf(stdout, "mapped msync DISABLED\n"); break; default: usage(); /* NOTREACHED */ } argc -= optind; argv += optind; if (argc != 1) usage(); fname = argv[0]; signal(SIGHUP, cleanup); signal(SIGINT, cleanup); signal(SIGPIPE, cleanup); signal(SIGALRM, cleanup); signal(SIGTERM, cleanup); signal(SIGXCPU, cleanup); signal(SIGXFSZ, cleanup); signal(SIGVTALRM, cleanup); signal(SIGUSR1, cleanup); signal(SIGUSR2, cleanup); initstate(seed, state, 256); setstate(state); fd = open(fname, O_RDWR|(lite ? 0 : O_CREAT|O_TRUNC), 0666); if (fd < 0) { prterr(fname); exit(91); } strncat(goodfile, fname, 256); strcat (goodfile, ".fsxgood"); fsxgoodfd = open(goodfile, O_RDWR|O_CREAT|O_TRUNC, 0666); if (fsxgoodfd < 0) { prterr(goodfile); exit(92); } strncat(logfile, fname, 256); strcat (logfile, ".fsxlog"); fsxlogf = fopen(logfile, "w"); if (fsxlogf == NULL) { prterr(logfile); exit(93); } if (lite) { off_t ret; file_size = maxfilelen = lseek(fd, (off_t)0, SEEK_END); if (file_size == (off_t)-1) { prterr(fname); warn("main: lseek eof"); exit(94); } ret = lseek(fd, (off_t)0, SEEK_SET); if (ret == (off_t)-1) { prterr(fname); warn("main: lseek 0"); exit(95); } } original_buf = (char *) malloc(maxfilelen); for (i = 0; i < maxfilelen; i++) original_buf[i] = random() % 256; good_buf = (char *) malloc(maxfilelen); memset(good_buf, '\0', maxfilelen); temp_buf = (char *) malloc(maxoplen); memset(temp_buf, '\0', maxoplen); if (lite) { /* zero entire existing file */ ssize_t written; written = write(fd, good_buf, (size_t)maxfilelen); if (written != maxfilelen) { if (written == -1) { prterr(fname); warn("main: error on write"); } else - warn("main: short write, 0x%x bytes instead of 0x%x\n", + warn("main: short write, 0x%x bytes instead of 0x%lx\n", (unsigned)written, maxfilelen); exit(98); } } else check_trunc_hack(); while (numops == -1 || numops--) test(); if (close(fd)) { prterr("close"); report_failure(99); } prt("All operations completed A-OK!\n"); exit(0); return 0; } Index: user/ngie/bug-237403/usr.sbin/bhyve/acpi.c =================================================================== --- user/ngie/bug-237403/usr.sbin/bhyve/acpi.c (revision 346925) +++ user/ngie/bug-237403/usr.sbin/bhyve/acpi.c (revision 346926) @@ -1,984 +1,999 @@ /*- * SPDX-License-Identifier: BSD-2-Clause-FreeBSD * * Copyright (c) 2012 NetApp, Inc. * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY NETAPP, INC ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL NETAPP, INC OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * $FreeBSD$ */ /* * bhyve ACPI table generator. * * Create the minimal set of ACPI tables required to boot FreeBSD (and * hopefully other o/s's) by writing out ASL template files for each of * the tables and the compiling them to AML with the Intel iasl compiler. * The AML files are then read into guest memory. * * The tables are placed in the guest's ROM area just below 1MB physical, * above the MPTable. * - * Layout + * Layout (No longer correct at FADT and beyond due to properly + * calculating the size of the MADT to allow for changes to + * VM_MAXCPU above 21 which overflows this layout.) * ------ * RSDP -> 0xf2400 (36 bytes fixed) * RSDT -> 0xf2440 (36 bytes + 4*7 table addrs, 4 used) * XSDT -> 0xf2480 (36 bytes + 8*7 table addrs, 4 used) * MADT -> 0xf2500 (depends on #CPUs) * FADT -> 0xf2600 (268 bytes) * HPET -> 0xf2740 (56 bytes) * MCFG -> 0xf2780 (60 bytes) * FACS -> 0xf27C0 (64 bytes) * DSDT -> 0xf2800 (variable - can go up to 0x100000) */ #include __FBSDID("$FreeBSD$"); #include #include #include #include #include #include #include #include #include #include #include #include "bhyverun.h" #include "acpi.h" #include "pci_emul.h" /* - * Define the base address of the ACPI tables, and the offsets to - * the individual tables + * Define the base address of the ACPI tables, the sizes of some tables, + * and the offsets to the individual tables, */ #define BHYVE_ACPI_BASE 0xf2400 #define RSDT_OFFSET 0x040 #define XSDT_OFFSET 0x080 #define MADT_OFFSET 0x100 -#define FADT_OFFSET 0x200 -#define HPET_OFFSET 0x340 -#define MCFG_OFFSET 0x380 -#define FACS_OFFSET 0x3C0 -#define DSDT_OFFSET 0x400 +/* + * The MADT consists of: + * 44 Fixed Header + * 8 * maxcpu Processor Local APIC entries + * 12 I/O APIC entry + * 2 * 10 Interrupt Source Override entires + * 6 Local APIC NMI entry + */ +#define MADT_SIZE (44 + VM_MAXCPU*8 + 12 + 2*10 + 6) +#define FADT_OFFSET (MADT_OFFSET + MADT_SIZE) +#define FADT_SIZE 0x140 +#define HPET_OFFSET (FADT_OFFSET + FADT_SIZE) +#define HPET_SIZE 0x40 +#define MCFG_OFFSET (HPET_OFFSET + HPET_SIZE) +#define MCFG_SIZE 0x40 +#define FACS_OFFSET (MCFG_OFFSET + MCFG_SIZE) +#define FACS_SIZE 0x40 +#define DSDT_OFFSET (FACS_OFFSET + FACS_SIZE) #define BHYVE_ASL_TEMPLATE "bhyve.XXXXXXX" #define BHYVE_ASL_SUFFIX ".aml" #define BHYVE_ASL_COMPILER "/usr/sbin/iasl" static int basl_keep_temps; static int basl_verbose_iasl; static int basl_ncpu; static uint32_t basl_acpi_base = BHYVE_ACPI_BASE; static uint32_t hpet_capabilities; /* * Contains the full pathname of the template to be passed * to mkstemp/mktemps(3) */ static char basl_template[MAXPATHLEN]; static char basl_stemplate[MAXPATHLEN]; /* * State for dsdt_line(), dsdt_indent(), and dsdt_unindent(). */ static FILE *dsdt_fp; static int dsdt_indent_level; static int dsdt_error; struct basl_fio { int fd; FILE *fp; char f_name[MAXPATHLEN]; }; #define EFPRINTF(...) \ if (fprintf(__VA_ARGS__) < 0) goto err_exit; #define EFFLUSH(x) \ if (fflush(x) != 0) goto err_exit; static int basl_fwrite_rsdp(FILE *fp) { EFPRINTF(fp, "/*\n"); EFPRINTF(fp, " * bhyve RSDP template\n"); EFPRINTF(fp, " */\n"); EFPRINTF(fp, "[0008]\t\tSignature : \"RSD PTR \"\n"); EFPRINTF(fp, "[0001]\t\tChecksum : 43\n"); EFPRINTF(fp, "[0006]\t\tOem ID : \"BHYVE \"\n"); EFPRINTF(fp, "[0001]\t\tRevision : 02\n"); EFPRINTF(fp, "[0004]\t\tRSDT Address : %08X\n", basl_acpi_base + RSDT_OFFSET); EFPRINTF(fp, "[0004]\t\tLength : 00000024\n"); EFPRINTF(fp, "[0008]\t\tXSDT Address : 00000000%08X\n", basl_acpi_base + XSDT_OFFSET); EFPRINTF(fp, "[0001]\t\tExtended Checksum : 00\n"); EFPRINTF(fp, "[0003]\t\tReserved : 000000\n"); EFFLUSH(fp); return (0); err_exit: return (errno); } static int basl_fwrite_rsdt(FILE *fp) { EFPRINTF(fp, "/*\n"); EFPRINTF(fp, " * bhyve RSDT template\n"); EFPRINTF(fp, " */\n"); EFPRINTF(fp, "[0004]\t\tSignature : \"RSDT\"\n"); EFPRINTF(fp, "[0004]\t\tTable Length : 00000000\n"); EFPRINTF(fp, "[0001]\t\tRevision : 01\n"); EFPRINTF(fp, "[0001]\t\tChecksum : 00\n"); EFPRINTF(fp, "[0006]\t\tOem ID : \"BHYVE \"\n"); EFPRINTF(fp, "[0008]\t\tOem Table ID : \"BVRSDT \"\n"); EFPRINTF(fp, "[0004]\t\tOem Revision : 00000001\n"); /* iasl will fill in the compiler ID/revision fields */ EFPRINTF(fp, "[0004]\t\tAsl Compiler ID : \"xxxx\"\n"); EFPRINTF(fp, "[0004]\t\tAsl Compiler Revision : 00000000\n"); EFPRINTF(fp, "\n"); /* Add in pointers to the MADT, FADT and HPET */ EFPRINTF(fp, "[0004]\t\tACPI Table Address 0 : %08X\n", basl_acpi_base + MADT_OFFSET); EFPRINTF(fp, "[0004]\t\tACPI Table Address 1 : %08X\n", basl_acpi_base + FADT_OFFSET); EFPRINTF(fp, "[0004]\t\tACPI Table Address 2 : %08X\n", basl_acpi_base + HPET_OFFSET); EFPRINTF(fp, "[0004]\t\tACPI Table Address 3 : %08X\n", basl_acpi_base + MCFG_OFFSET); EFFLUSH(fp); return (0); err_exit: return (errno); } static int basl_fwrite_xsdt(FILE *fp) { EFPRINTF(fp, "/*\n"); EFPRINTF(fp, " * bhyve XSDT template\n"); EFPRINTF(fp, " */\n"); EFPRINTF(fp, "[0004]\t\tSignature : \"XSDT\"\n"); EFPRINTF(fp, "[0004]\t\tTable Length : 00000000\n"); EFPRINTF(fp, "[0001]\t\tRevision : 01\n"); EFPRINTF(fp, "[0001]\t\tChecksum : 00\n"); EFPRINTF(fp, "[0006]\t\tOem ID : \"BHYVE \"\n"); EFPRINTF(fp, "[0008]\t\tOem Table ID : \"BVXSDT \"\n"); EFPRINTF(fp, "[0004]\t\tOem Revision : 00000001\n"); /* iasl will fill in the compiler ID/revision fields */ EFPRINTF(fp, "[0004]\t\tAsl Compiler ID : \"xxxx\"\n"); EFPRINTF(fp, "[0004]\t\tAsl Compiler Revision : 00000000\n"); EFPRINTF(fp, "\n"); /* Add in pointers to the MADT, FADT and HPET */ EFPRINTF(fp, "[0004]\t\tACPI Table Address 0 : 00000000%08X\n", basl_acpi_base + MADT_OFFSET); EFPRINTF(fp, "[0004]\t\tACPI Table Address 1 : 00000000%08X\n", basl_acpi_base + FADT_OFFSET); EFPRINTF(fp, "[0004]\t\tACPI Table Address 2 : 00000000%08X\n", basl_acpi_base + HPET_OFFSET); EFPRINTF(fp, "[0004]\t\tACPI Table Address 3 : 00000000%08X\n", basl_acpi_base + MCFG_OFFSET); EFFLUSH(fp); return (0); err_exit: return (errno); } static int basl_fwrite_madt(FILE *fp) { int i; EFPRINTF(fp, "/*\n"); EFPRINTF(fp, " * bhyve MADT template\n"); EFPRINTF(fp, " */\n"); EFPRINTF(fp, "[0004]\t\tSignature : \"APIC\"\n"); EFPRINTF(fp, "[0004]\t\tTable Length : 00000000\n"); EFPRINTF(fp, "[0001]\t\tRevision : 01\n"); EFPRINTF(fp, "[0001]\t\tChecksum : 00\n"); EFPRINTF(fp, "[0006]\t\tOem ID : \"BHYVE \"\n"); EFPRINTF(fp, "[0008]\t\tOem Table ID : \"BVMADT \"\n"); EFPRINTF(fp, "[0004]\t\tOem Revision : 00000001\n"); /* iasl will fill in the compiler ID/revision fields */ EFPRINTF(fp, "[0004]\t\tAsl Compiler ID : \"xxxx\"\n"); EFPRINTF(fp, "[0004]\t\tAsl Compiler Revision : 00000000\n"); EFPRINTF(fp, "\n"); EFPRINTF(fp, "[0004]\t\tLocal Apic Address : FEE00000\n"); EFPRINTF(fp, "[0004]\t\tFlags (decoded below) : 00000001\n"); EFPRINTF(fp, "\t\t\tPC-AT Compatibility : 1\n"); EFPRINTF(fp, "\n"); /* Add a Processor Local APIC entry for each CPU */ for (i = 0; i < basl_ncpu; i++) { EFPRINTF(fp, "[0001]\t\tSubtable Type : 00\n"); EFPRINTF(fp, "[0001]\t\tLength : 08\n"); /* iasl expects hex values for the proc and apic id's */ EFPRINTF(fp, "[0001]\t\tProcessor ID : %02x\n", i); EFPRINTF(fp, "[0001]\t\tLocal Apic ID : %02x\n", i); EFPRINTF(fp, "[0004]\t\tFlags (decoded below) : 00000001\n"); EFPRINTF(fp, "\t\t\tProcessor Enabled : 1\n"); EFPRINTF(fp, "\t\t\tRuntime Online Capable : 0\n"); EFPRINTF(fp, "\n"); } /* Always a single IOAPIC entry, with ID 0 */ EFPRINTF(fp, "[0001]\t\tSubtable Type : 01\n"); EFPRINTF(fp, "[0001]\t\tLength : 0C\n"); /* iasl expects a hex value for the i/o apic id */ EFPRINTF(fp, "[0001]\t\tI/O Apic ID : %02x\n", 0); EFPRINTF(fp, "[0001]\t\tReserved : 00\n"); EFPRINTF(fp, "[0004]\t\tAddress : fec00000\n"); EFPRINTF(fp, "[0004]\t\tInterrupt : 00000000\n"); EFPRINTF(fp, "\n"); /* Legacy IRQ0 is connected to pin 2 of the IOAPIC */ EFPRINTF(fp, "[0001]\t\tSubtable Type : 02\n"); EFPRINTF(fp, "[0001]\t\tLength : 0A\n"); EFPRINTF(fp, "[0001]\t\tBus : 00\n"); EFPRINTF(fp, "[0001]\t\tSource : 00\n"); EFPRINTF(fp, "[0004]\t\tInterrupt : 00000002\n"); EFPRINTF(fp, "[0002]\t\tFlags (decoded below) : 0005\n"); EFPRINTF(fp, "\t\t\tPolarity : 1\n"); EFPRINTF(fp, "\t\t\tTrigger Mode : 1\n"); EFPRINTF(fp, "\n"); EFPRINTF(fp, "[0001]\t\tSubtable Type : 02\n"); EFPRINTF(fp, "[0001]\t\tLength : 0A\n"); EFPRINTF(fp, "[0001]\t\tBus : 00\n"); EFPRINTF(fp, "[0001]\t\tSource : %02X\n", SCI_INT); EFPRINTF(fp, "[0004]\t\tInterrupt : %08X\n", SCI_INT); EFPRINTF(fp, "[0002]\t\tFlags (decoded below) : 0000\n"); EFPRINTF(fp, "\t\t\tPolarity : 3\n"); EFPRINTF(fp, "\t\t\tTrigger Mode : 3\n"); EFPRINTF(fp, "\n"); /* Local APIC NMI is connected to LINT 1 on all CPUs */ EFPRINTF(fp, "[0001]\t\tSubtable Type : 04\n"); EFPRINTF(fp, "[0001]\t\tLength : 06\n"); EFPRINTF(fp, "[0001]\t\tProcessorId : FF\n"); EFPRINTF(fp, "[0002]\t\tFlags (decoded below) : 0005\n"); EFPRINTF(fp, "\t\t\tPolarity : 1\n"); EFPRINTF(fp, "\t\t\tTrigger Mode : 1\n"); EFPRINTF(fp, "[0001]\t\tInterrupt : 01\n"); EFPRINTF(fp, "\n"); EFFLUSH(fp); return (0); err_exit: return (errno); } static int basl_fwrite_fadt(FILE *fp) { EFPRINTF(fp, "/*\n"); EFPRINTF(fp, " * bhyve FADT template\n"); EFPRINTF(fp, " */\n"); EFPRINTF(fp, "[0004]\t\tSignature : \"FACP\"\n"); EFPRINTF(fp, "[0004]\t\tTable Length : 0000010C\n"); EFPRINTF(fp, "[0001]\t\tRevision : 05\n"); EFPRINTF(fp, "[0001]\t\tChecksum : 00\n"); EFPRINTF(fp, "[0006]\t\tOem ID : \"BHYVE \"\n"); EFPRINTF(fp, "[0008]\t\tOem Table ID : \"BVFACP \"\n"); EFPRINTF(fp, "[0004]\t\tOem Revision : 00000001\n"); /* iasl will fill in the compiler ID/revision fields */ EFPRINTF(fp, "[0004]\t\tAsl Compiler ID : \"xxxx\"\n"); EFPRINTF(fp, "[0004]\t\tAsl Compiler Revision : 00000000\n"); EFPRINTF(fp, "\n"); EFPRINTF(fp, "[0004]\t\tFACS Address : %08X\n", basl_acpi_base + FACS_OFFSET); EFPRINTF(fp, "[0004]\t\tDSDT Address : %08X\n", basl_acpi_base + DSDT_OFFSET); EFPRINTF(fp, "[0001]\t\tModel : 01\n"); EFPRINTF(fp, "[0001]\t\tPM Profile : 00 [Unspecified]\n"); EFPRINTF(fp, "[0002]\t\tSCI Interrupt : %04X\n", SCI_INT); EFPRINTF(fp, "[0004]\t\tSMI Command Port : %08X\n", SMI_CMD); EFPRINTF(fp, "[0001]\t\tACPI Enable Value : %02X\n", BHYVE_ACPI_ENABLE); EFPRINTF(fp, "[0001]\t\tACPI Disable Value : %02X\n", BHYVE_ACPI_DISABLE); EFPRINTF(fp, "[0001]\t\tS4BIOS Command : 00\n"); EFPRINTF(fp, "[0001]\t\tP-State Control : 00\n"); EFPRINTF(fp, "[0004]\t\tPM1A Event Block Address : %08X\n", PM1A_EVT_ADDR); EFPRINTF(fp, "[0004]\t\tPM1B Event Block Address : 00000000\n"); EFPRINTF(fp, "[0004]\t\tPM1A Control Block Address : %08X\n", PM1A_CNT_ADDR); EFPRINTF(fp, "[0004]\t\tPM1B Control Block Address : 00000000\n"); EFPRINTF(fp, "[0004]\t\tPM2 Control Block Address : 00000000\n"); EFPRINTF(fp, "[0004]\t\tPM Timer Block Address : %08X\n", IO_PMTMR); EFPRINTF(fp, "[0004]\t\tGPE0 Block Address : 00000000\n"); EFPRINTF(fp, "[0004]\t\tGPE1 Block Address : 00000000\n"); EFPRINTF(fp, "[0001]\t\tPM1 Event Block Length : 04\n"); EFPRINTF(fp, "[0001]\t\tPM1 Control Block Length : 02\n"); EFPRINTF(fp, "[0001]\t\tPM2 Control Block Length : 00\n"); EFPRINTF(fp, "[0001]\t\tPM Timer Block Length : 04\n"); EFPRINTF(fp, "[0001]\t\tGPE0 Block Length : 00\n"); EFPRINTF(fp, "[0001]\t\tGPE1 Block Length : 00\n"); EFPRINTF(fp, "[0001]\t\tGPE1 Base Offset : 00\n"); EFPRINTF(fp, "[0001]\t\t_CST Support : 00\n"); EFPRINTF(fp, "[0002]\t\tC2 Latency : 0000\n"); EFPRINTF(fp, "[0002]\t\tC3 Latency : 0000\n"); EFPRINTF(fp, "[0002]\t\tCPU Cache Size : 0000\n"); EFPRINTF(fp, "[0002]\t\tCache Flush Stride : 0000\n"); EFPRINTF(fp, "[0001]\t\tDuty Cycle Offset : 00\n"); EFPRINTF(fp, "[0001]\t\tDuty Cycle Width : 00\n"); EFPRINTF(fp, "[0001]\t\tRTC Day Alarm Index : 00\n"); EFPRINTF(fp, "[0001]\t\tRTC Month Alarm Index : 00\n"); EFPRINTF(fp, "[0001]\t\tRTC Century Index : 32\n"); EFPRINTF(fp, "[0002]\t\tBoot Flags (decoded below) : 0000\n"); EFPRINTF(fp, "\t\t\tLegacy Devices Supported (V2) : 0\n"); EFPRINTF(fp, "\t\t\t8042 Present on ports 60/64 (V2) : 0\n"); EFPRINTF(fp, "\t\t\tVGA Not Present (V4) : 1\n"); EFPRINTF(fp, "\t\t\tMSI Not Supported (V4) : 0\n"); EFPRINTF(fp, "\t\t\tPCIe ASPM Not Supported (V4) : 1\n"); EFPRINTF(fp, "\t\t\tCMOS RTC Not Present (V5) : 0\n"); EFPRINTF(fp, "[0001]\t\tReserved : 00\n"); EFPRINTF(fp, "[0004]\t\tFlags (decoded below) : 00000000\n"); EFPRINTF(fp, "\t\t\tWBINVD instruction is operational (V1) : 1\n"); EFPRINTF(fp, "\t\t\tWBINVD flushes all caches (V1) : 0\n"); EFPRINTF(fp, "\t\t\tAll CPUs support C1 (V1) : 1\n"); EFPRINTF(fp, "\t\t\tC2 works on MP system (V1) : 0\n"); EFPRINTF(fp, "\t\t\tControl Method Power Button (V1) : 0\n"); EFPRINTF(fp, "\t\t\tControl Method Sleep Button (V1) : 1\n"); EFPRINTF(fp, "\t\t\tRTC wake not in fixed reg space (V1) : 0\n"); EFPRINTF(fp, "\t\t\tRTC can wake system from S4 (V1) : 0\n"); EFPRINTF(fp, "\t\t\t32-bit PM Timer (V1) : 1\n"); EFPRINTF(fp, "\t\t\tDocking Supported (V1) : 0\n"); EFPRINTF(fp, "\t\t\tReset Register Supported (V2) : 1\n"); EFPRINTF(fp, "\t\t\tSealed Case (V3) : 0\n"); EFPRINTF(fp, "\t\t\tHeadless - No Video (V3) : 1\n"); EFPRINTF(fp, "\t\t\tUse native instr after SLP_TYPx (V3) : 0\n"); EFPRINTF(fp, "\t\t\tPCIEXP_WAK Bits Supported (V4) : 0\n"); EFPRINTF(fp, "\t\t\tUse Platform Timer (V4) : 0\n"); EFPRINTF(fp, "\t\t\tRTC_STS valid on S4 wake (V4) : 0\n"); EFPRINTF(fp, "\t\t\tRemote Power-on capable (V4) : 0\n"); EFPRINTF(fp, "\t\t\tUse APIC Cluster Model (V4) : 0\n"); EFPRINTF(fp, "\t\t\tUse APIC Physical Destination Mode (V4) : 1\n"); EFPRINTF(fp, "\t\t\tHardware Reduced (V5) : 0\n"); EFPRINTF(fp, "\t\t\tLow Power S0 Idle (V5) : 0\n"); EFPRINTF(fp, "\n"); EFPRINTF(fp, "[0012]\t\tReset Register : [Generic Address Structure]\n"); EFPRINTF(fp, "[0001]\t\tSpace ID : 01 [SystemIO]\n"); EFPRINTF(fp, "[0001]\t\tBit Width : 08\n"); EFPRINTF(fp, "[0001]\t\tBit Offset : 00\n"); EFPRINTF(fp, "[0001]\t\tEncoded Access Width : 01 [Byte Access:8]\n"); EFPRINTF(fp, "[0008]\t\tAddress : 0000000000000CF9\n"); EFPRINTF(fp, "\n"); EFPRINTF(fp, "[0001]\t\tValue to cause reset : 06\n"); EFPRINTF(fp, "[0002]\t\tARM Flags (decoded below): 0000\n"); EFPRINTF(fp, "\t\t\tPSCI Compliant : 0\n"); EFPRINTF(fp, "\t\t\tMust use HVC for PSCI : 0\n"); EFPRINTF(fp, "[0001]\t\tFADT Minor Revision : 01\n"); EFPRINTF(fp, "[0008]\t\tFACS Address : 00000000%08X\n", basl_acpi_base + FACS_OFFSET); EFPRINTF(fp, "[0008]\t\tDSDT Address : 00000000%08X\n", basl_acpi_base + DSDT_OFFSET); EFPRINTF(fp, "[0012]\t\tPM1A Event Block : [Generic Address Structure]\n"); EFPRINTF(fp, "[0001]\t\tSpace ID : 01 [SystemIO]\n"); EFPRINTF(fp, "[0001]\t\tBit Width : 20\n"); EFPRINTF(fp, "[0001]\t\tBit Offset : 00\n"); EFPRINTF(fp, "[0001]\t\tEncoded Access Width : 02 [Word Access:16]\n"); EFPRINTF(fp, "[0008]\t\tAddress : 00000000%08X\n", PM1A_EVT_ADDR); EFPRINTF(fp, "\n"); EFPRINTF(fp, "[0012]\t\tPM1B Event Block : [Generic Address Structure]\n"); EFPRINTF(fp, "[0001]\t\tSpace ID : 01 [SystemIO]\n"); EFPRINTF(fp, "[0001]\t\tBit Width : 00\n"); EFPRINTF(fp, "[0001]\t\tBit Offset : 00\n"); EFPRINTF(fp, "[0001]\t\tEncoded Access Width : 00 [Undefined/Legacy]\n"); EFPRINTF(fp, "[0008]\t\tAddress : 0000000000000000\n"); EFPRINTF(fp, "\n"); EFPRINTF(fp, "[0012]\t\tPM1A Control Block : [Generic Address Structure]\n"); EFPRINTF(fp, "[0001]\t\tSpace ID : 01 [SystemIO]\n"); EFPRINTF(fp, "[0001]\t\tBit Width : 10\n"); EFPRINTF(fp, "[0001]\t\tBit Offset : 00\n"); EFPRINTF(fp, "[0001]\t\tEncoded Access Width : 02 [Word Access:16]\n"); EFPRINTF(fp, "[0008]\t\tAddress : 00000000%08X\n", PM1A_CNT_ADDR); EFPRINTF(fp, "\n"); EFPRINTF(fp, "[0012]\t\tPM1B Control Block : [Generic Address Structure]\n"); EFPRINTF(fp, "[0001]\t\tSpace ID : 01 [SystemIO]\n"); EFPRINTF(fp, "[0001]\t\tBit Width : 00\n"); EFPRINTF(fp, "[0001]\t\tBit Offset : 00\n"); EFPRINTF(fp, "[0001]\t\tEncoded Access Width : 00 [Undefined/Legacy]\n"); EFPRINTF(fp, "[0008]\t\tAddress : 0000000000000000\n"); EFPRINTF(fp, "\n"); EFPRINTF(fp, "[0012]\t\tPM2 Control Block : [Generic Address Structure]\n"); EFPRINTF(fp, "[0001]\t\tSpace ID : 01 [SystemIO]\n"); EFPRINTF(fp, "[0001]\t\tBit Width : 08\n"); EFPRINTF(fp, "[0001]\t\tBit Offset : 00\n"); EFPRINTF(fp, "[0001]\t\tEncoded Access Width : 00 [Undefined/Legacy]\n"); EFPRINTF(fp, "[0008]\t\tAddress : 0000000000000000\n"); EFPRINTF(fp, "\n"); /* Valid for bhyve */ EFPRINTF(fp, "[0012]\t\tPM Timer Block : [Generic Address Structure]\n"); EFPRINTF(fp, "[0001]\t\tSpace ID : 01 [SystemIO]\n"); EFPRINTF(fp, "[0001]\t\tBit Width : 20\n"); EFPRINTF(fp, "[0001]\t\tBit Offset : 00\n"); EFPRINTF(fp, "[0001]\t\tEncoded Access Width : 03 [DWord Access:32]\n"); EFPRINTF(fp, "[0008]\t\tAddress : 00000000%08X\n", IO_PMTMR); EFPRINTF(fp, "\n"); EFPRINTF(fp, "[0012]\t\tGPE0 Block : [Generic Address Structure]\n"); EFPRINTF(fp, "[0001]\t\tSpace ID : 01 [SystemIO]\n"); EFPRINTF(fp, "[0001]\t\tBit Width : 00\n"); EFPRINTF(fp, "[0001]\t\tBit Offset : 00\n"); EFPRINTF(fp, "[0001]\t\tEncoded Access Width : 01 [Byte Access:8]\n"); EFPRINTF(fp, "[0008]\t\tAddress : 0000000000000000\n"); EFPRINTF(fp, "\n"); EFPRINTF(fp, "[0012]\t\tGPE1 Block : [Generic Address Structure]\n"); EFPRINTF(fp, "[0001]\t\tSpace ID : 01 [SystemIO]\n"); EFPRINTF(fp, "[0001]\t\tBit Width : 00\n"); EFPRINTF(fp, "[0001]\t\tBit Offset : 00\n"); EFPRINTF(fp, "[0001]\t\tEncoded Access Width : 00 [Undefined/Legacy]\n"); EFPRINTF(fp, "[0008]\t\tAddress : 0000000000000000\n"); EFPRINTF(fp, "\n"); EFPRINTF(fp, "[0012]\t\tSleep Control Register : [Generic Address Structure]\n"); EFPRINTF(fp, "[0001]\t\tSpace ID : 01 [SystemIO]\n"); EFPRINTF(fp, "[0001]\t\tBit Width : 08\n"); EFPRINTF(fp, "[0001]\t\tBit Offset : 00\n"); EFPRINTF(fp, "[0001]\t\tEncoded Access Width : 01 [Byte Access:8]\n"); EFPRINTF(fp, "[0008]\t\tAddress : 0000000000000000\n"); EFPRINTF(fp, "\n"); EFPRINTF(fp, "[0012]\t\tSleep Status Register : [Generic Address Structure]\n"); EFPRINTF(fp, "[0001]\t\tSpace ID : 01 [SystemIO]\n"); EFPRINTF(fp, "[0001]\t\tBit Width : 08\n"); EFPRINTF(fp, "[0001]\t\tBit Offset : 00\n"); EFPRINTF(fp, "[0001]\t\tEncoded Access Width : 01 [Byte Access:8]\n"); EFPRINTF(fp, "[0008]\t\tAddress : 0000000000000000\n"); EFFLUSH(fp); return (0); err_exit: return (errno); } static int basl_fwrite_hpet(FILE *fp) { EFPRINTF(fp, "/*\n"); EFPRINTF(fp, " * bhyve HPET template\n"); EFPRINTF(fp, " */\n"); EFPRINTF(fp, "[0004]\t\tSignature : \"HPET\"\n"); EFPRINTF(fp, "[0004]\t\tTable Length : 00000000\n"); EFPRINTF(fp, "[0001]\t\tRevision : 01\n"); EFPRINTF(fp, "[0001]\t\tChecksum : 00\n"); EFPRINTF(fp, "[0006]\t\tOem ID : \"BHYVE \"\n"); EFPRINTF(fp, "[0008]\t\tOem Table ID : \"BVHPET \"\n"); EFPRINTF(fp, "[0004]\t\tOem Revision : 00000001\n"); /* iasl will fill in the compiler ID/revision fields */ EFPRINTF(fp, "[0004]\t\tAsl Compiler ID : \"xxxx\"\n"); EFPRINTF(fp, "[0004]\t\tAsl Compiler Revision : 00000000\n"); EFPRINTF(fp, "\n"); EFPRINTF(fp, "[0004]\t\tTimer Block ID : %08X\n", hpet_capabilities); EFPRINTF(fp, "[0012]\t\tTimer Block Register : [Generic Address Structure]\n"); EFPRINTF(fp, "[0001]\t\tSpace ID : 00 [SystemMemory]\n"); EFPRINTF(fp, "[0001]\t\tBit Width : 00\n"); EFPRINTF(fp, "[0001]\t\tBit Offset : 00\n"); EFPRINTF(fp, "[0001]\t\tEncoded Access Width : 00 [Undefined/Legacy]\n"); EFPRINTF(fp, "[0008]\t\tAddress : 00000000FED00000\n"); EFPRINTF(fp, "\n"); EFPRINTF(fp, "[0001]\t\tHPET Number : 00\n"); EFPRINTF(fp, "[0002]\t\tMinimum Clock Ticks : 0000\n"); EFPRINTF(fp, "[0004]\t\tFlags (decoded below) : 00000001\n"); EFPRINTF(fp, "\t\t\t4K Page Protect : 1\n"); EFPRINTF(fp, "\t\t\t64K Page Protect : 0\n"); EFPRINTF(fp, "\n"); EFFLUSH(fp); return (0); err_exit: return (errno); } static int basl_fwrite_mcfg(FILE *fp) { EFPRINTF(fp, "/*\n"); EFPRINTF(fp, " * bhyve MCFG template\n"); EFPRINTF(fp, " */\n"); EFPRINTF(fp, "[0004]\t\tSignature : \"MCFG\"\n"); EFPRINTF(fp, "[0004]\t\tTable Length : 00000000\n"); EFPRINTF(fp, "[0001]\t\tRevision : 01\n"); EFPRINTF(fp, "[0001]\t\tChecksum : 00\n"); EFPRINTF(fp, "[0006]\t\tOem ID : \"BHYVE \"\n"); EFPRINTF(fp, "[0008]\t\tOem Table ID : \"BVMCFG \"\n"); EFPRINTF(fp, "[0004]\t\tOem Revision : 00000001\n"); /* iasl will fill in the compiler ID/revision fields */ EFPRINTF(fp, "[0004]\t\tAsl Compiler ID : \"xxxx\"\n"); EFPRINTF(fp, "[0004]\t\tAsl Compiler Revision : 00000000\n"); EFPRINTF(fp, "[0008]\t\tReserved : 0\n"); EFPRINTF(fp, "\n"); EFPRINTF(fp, "[0008]\t\tBase Address : %016lX\n", pci_ecfg_base()); EFPRINTF(fp, "[0002]\t\tSegment Group: 0000\n"); EFPRINTF(fp, "[0001]\t\tStart Bus: 00\n"); EFPRINTF(fp, "[0001]\t\tEnd Bus: FF\n"); EFPRINTF(fp, "[0004]\t\tReserved : 0\n"); EFFLUSH(fp); return (0); err_exit: return (errno); } static int basl_fwrite_facs(FILE *fp) { EFPRINTF(fp, "/*\n"); EFPRINTF(fp, " * bhyve FACS template\n"); EFPRINTF(fp, " */\n"); EFPRINTF(fp, "[0004]\t\tSignature : \"FACS\"\n"); EFPRINTF(fp, "[0004]\t\tLength : 00000040\n"); EFPRINTF(fp, "[0004]\t\tHardware Signature : 00000000\n"); EFPRINTF(fp, "[0004]\t\t32 Firmware Waking Vector : 00000000\n"); EFPRINTF(fp, "[0004]\t\tGlobal Lock : 00000000\n"); EFPRINTF(fp, "[0004]\t\tFlags (decoded below) : 00000000\n"); EFPRINTF(fp, "\t\t\tS4BIOS Support Present : 0\n"); EFPRINTF(fp, "\t\t\t64-bit Wake Supported (V2) : 0\n"); EFPRINTF(fp, "[0008]\t\t64 Firmware Waking Vector : 0000000000000000\n"); EFPRINTF(fp, "[0001]\t\tVersion : 02\n"); EFPRINTF(fp, "[0003]\t\tReserved : 000000\n"); EFPRINTF(fp, "[0004]\t\tOspmFlags (decoded below) : 00000000\n"); EFPRINTF(fp, "\t\t\t64-bit Wake Env Required (V2) : 0\n"); EFFLUSH(fp); return (0); err_exit: return (errno); } /* * Helper routines for writing to the DSDT from other modules. */ void dsdt_line(const char *fmt, ...) { va_list ap; if (dsdt_error != 0) return; if (strcmp(fmt, "") != 0) { if (dsdt_indent_level != 0) EFPRINTF(dsdt_fp, "%*c", dsdt_indent_level * 2, ' '); va_start(ap, fmt); if (vfprintf(dsdt_fp, fmt, ap) < 0) { va_end(ap); goto err_exit; } va_end(ap); } EFPRINTF(dsdt_fp, "\n"); return; err_exit: dsdt_error = errno; } void dsdt_indent(int levels) { dsdt_indent_level += levels; assert(dsdt_indent_level >= 0); } void dsdt_unindent(int levels) { assert(dsdt_indent_level >= levels); dsdt_indent_level -= levels; } void dsdt_fixed_ioport(uint16_t iobase, uint16_t length) { dsdt_line("IO (Decode16,"); dsdt_line(" 0x%04X, // Range Minimum", iobase); dsdt_line(" 0x%04X, // Range Maximum", iobase); dsdt_line(" 0x01, // Alignment"); dsdt_line(" 0x%02X, // Length", length); dsdt_line(" )"); } void dsdt_fixed_irq(uint8_t irq) { dsdt_line("IRQNoFlags ()"); dsdt_line(" {%d}", irq); } void dsdt_fixed_mem32(uint32_t base, uint32_t length) { dsdt_line("Memory32Fixed (ReadWrite,"); dsdt_line(" 0x%08X, // Address Base", base); dsdt_line(" 0x%08X, // Address Length", length); dsdt_line(" )"); } static int basl_fwrite_dsdt(FILE *fp) { dsdt_fp = fp; dsdt_error = 0; dsdt_indent_level = 0; dsdt_line("/*"); dsdt_line(" * bhyve DSDT template"); dsdt_line(" */"); dsdt_line("DefinitionBlock (\"bhyve_dsdt.aml\", \"DSDT\", 2," "\"BHYVE \", \"BVDSDT \", 0x00000001)"); dsdt_line("{"); dsdt_line(" Name (_S5, Package ()"); dsdt_line(" {"); dsdt_line(" 0x05,"); dsdt_line(" Zero,"); dsdt_line(" })"); pci_write_dsdt(); dsdt_line(""); dsdt_line(" Scope (_SB.PC00)"); dsdt_line(" {"); dsdt_line(" Device (HPET)"); dsdt_line(" {"); dsdt_line(" Name (_HID, EISAID(\"PNP0103\"))"); dsdt_line(" Name (_UID, 0)"); dsdt_line(" Name (_CRS, ResourceTemplate ()"); dsdt_line(" {"); dsdt_indent(4); dsdt_fixed_mem32(0xFED00000, 0x400); dsdt_unindent(4); dsdt_line(" })"); dsdt_line(" }"); dsdt_line(" }"); dsdt_line("}"); if (dsdt_error != 0) return (dsdt_error); EFFLUSH(fp); return (0); err_exit: return (errno); } static int basl_open(struct basl_fio *bf, int suffix) { int err; err = 0; if (suffix) { strlcpy(bf->f_name, basl_stemplate, MAXPATHLEN); bf->fd = mkstemps(bf->f_name, strlen(BHYVE_ASL_SUFFIX)); } else { strlcpy(bf->f_name, basl_template, MAXPATHLEN); bf->fd = mkstemp(bf->f_name); } if (bf->fd > 0) { bf->fp = fdopen(bf->fd, "w+"); if (bf->fp == NULL) { unlink(bf->f_name); close(bf->fd); } } else { err = 1; } return (err); } static void basl_close(struct basl_fio *bf) { if (!basl_keep_temps) unlink(bf->f_name); fclose(bf->fp); } static int basl_start(struct basl_fio *in, struct basl_fio *out) { int err; err = basl_open(in, 0); if (!err) { err = basl_open(out, 1); if (err) { basl_close(in); } } return (err); } static void basl_end(struct basl_fio *in, struct basl_fio *out) { basl_close(in); basl_close(out); } static int basl_load(struct vmctx *ctx, int fd, uint64_t off) { struct stat sb; void *gaddr; if (fstat(fd, &sb) < 0) return (errno); gaddr = paddr_guest2host(ctx, basl_acpi_base + off, sb.st_size); if (gaddr == NULL) return (EFAULT); if (read(fd, gaddr, sb.st_size) < 0) return (errno); return (0); } static int basl_compile(struct vmctx *ctx, int (*fwrite_section)(FILE *), uint64_t offset) { struct basl_fio io[2]; static char iaslbuf[3*MAXPATHLEN + 10]; char *fmt; int err; err = basl_start(&io[0], &io[1]); if (!err) { err = (*fwrite_section)(io[0].fp); if (!err) { /* * iasl sends the results of the compilation to * stdout. Shut this down by using the shell to * redirect stdout to /dev/null, unless the user * has requested verbose output for debugging * purposes */ fmt = basl_verbose_iasl ? "%s -p %s %s" : "/bin/sh -c \"%s -p %s %s\" 1> /dev/null"; snprintf(iaslbuf, sizeof(iaslbuf), fmt, BHYVE_ASL_COMPILER, io[1].f_name, io[0].f_name); err = system(iaslbuf); if (!err) { /* * Copy the aml output file into guest * memory at the specified location */ err = basl_load(ctx, io[1].fd, offset); } } basl_end(&io[0], &io[1]); } return (err); } static int basl_make_templates(void) { const char *tmpdir; int err; int len; err = 0; /* * */ if ((tmpdir = getenv("BHYVE_TMPDIR")) == NULL || *tmpdir == '\0' || (tmpdir = getenv("TMPDIR")) == NULL || *tmpdir == '\0') { tmpdir = _PATH_TMP; } len = strlen(tmpdir); if ((len + sizeof(BHYVE_ASL_TEMPLATE) + 1) < MAXPATHLEN) { strcpy(basl_template, tmpdir); while (len > 0 && basl_template[len - 1] == '/') len--; basl_template[len] = '/'; strcpy(&basl_template[len + 1], BHYVE_ASL_TEMPLATE); } else err = E2BIG; if (!err) { /* * len has been intialized (and maybe adjusted) above */ if ((len + sizeof(BHYVE_ASL_TEMPLATE) + 1 + sizeof(BHYVE_ASL_SUFFIX)) < MAXPATHLEN) { strcpy(basl_stemplate, tmpdir); basl_stemplate[len] = '/'; strcpy(&basl_stemplate[len + 1], BHYVE_ASL_TEMPLATE); len = strlen(basl_stemplate); strcpy(&basl_stemplate[len], BHYVE_ASL_SUFFIX); } else err = E2BIG; } return (err); } static struct { int (*wsect)(FILE *fp); uint64_t offset; } basl_ftables[] = { { basl_fwrite_rsdp, 0}, { basl_fwrite_rsdt, RSDT_OFFSET }, { basl_fwrite_xsdt, XSDT_OFFSET }, { basl_fwrite_madt, MADT_OFFSET }, { basl_fwrite_fadt, FADT_OFFSET }, { basl_fwrite_hpet, HPET_OFFSET }, { basl_fwrite_mcfg, MCFG_OFFSET }, { basl_fwrite_facs, FACS_OFFSET }, { basl_fwrite_dsdt, DSDT_OFFSET }, { NULL } }; int acpi_build(struct vmctx *ctx, int ncpu) { int err; int i; basl_ncpu = ncpu; err = vm_get_hpet_capabilities(ctx, &hpet_capabilities); if (err != 0) return (err); /* * For debug, allow the user to have iasl compiler output sent * to stdout rather than /dev/null */ if (getenv("BHYVE_ACPI_VERBOSE_IASL")) basl_verbose_iasl = 1; /* * Allow the user to keep the generated ASL files for debugging * instead of deleting them following use */ if (getenv("BHYVE_ACPI_KEEPTMPS")) basl_keep_temps = 1; i = 0; err = basl_make_templates(); /* * Run through all the ASL files, compiling them and * copying them into guest memory */ while (!err && basl_ftables[i].wsect != NULL) { err = basl_compile(ctx, basl_ftables[i].wsect, basl_ftables[i].offset); i++; } return (err); } Index: user/ngie/bug-237403/usr.sbin/bhyve/bhyverun.h =================================================================== --- user/ngie/bug-237403/usr.sbin/bhyve/bhyverun.h (revision 346925) +++ user/ngie/bug-237403/usr.sbin/bhyve/bhyverun.h (revision 346926) @@ -1,51 +1,52 @@ /*- * SPDX-License-Identifier: BSD-2-Clause-FreeBSD * * Copyright (c) 2011 NetApp, Inc. * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY NETAPP, INC ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL NETAPP, INC OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * $FreeBSD$ */ #ifndef _FBSDRUN_H_ #define _FBSDRUN_H_ #define VMEXIT_CONTINUE (0) #define VMEXIT_ABORT (-1) struct vmctx; extern int guest_ncpus; +extern uint16_t cores, sockets, threads; extern char *guest_uuid_str; extern char *vmname; void *paddr_guest2host(struct vmctx *ctx, uintptr_t addr, size_t len); void fbsdrun_set_capabilities(struct vmctx *ctx, int cpu); void fbsdrun_addcpu(struct vmctx *ctx, int fromcpu, int newcpu, uint64_t rip); int fbsdrun_muxed(void); int fbsdrun_vmexit_on_hlt(void); int fbsdrun_vmexit_on_pause(void); int fbsdrun_disable_x2apic(void); int fbsdrun_virtio_msix(void); #endif Index: user/ngie/bug-237403/usr.sbin/bhyve/smbiostbl.c =================================================================== --- user/ngie/bug-237403/usr.sbin/bhyve/smbiostbl.c (revision 346925) +++ user/ngie/bug-237403/usr.sbin/bhyve/smbiostbl.c (revision 346926) @@ -1,829 +1,839 @@ /*- * SPDX-License-Identifier: BSD-2-Clause-FreeBSD * * Copyright (c) 2014 Tycho Nightingale * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. */ #include __FBSDID("$FreeBSD$"); #include #include #include #include #include #include #include #include #include #include #include "bhyverun.h" #include "smbiostbl.h" #define MB (1024*1024) #define GB (1024ULL*1024*1024) #define SMBIOS_BASE 0xF1000 /* BHYVE_ACPI_BASE - SMBIOS_BASE) */ #define SMBIOS_MAX_LENGTH (0xF2400 - 0xF1000) #define SMBIOS_TYPE_BIOS 0 #define SMBIOS_TYPE_SYSTEM 1 #define SMBIOS_TYPE_CHASSIS 3 #define SMBIOS_TYPE_PROCESSOR 4 #define SMBIOS_TYPE_MEMARRAY 16 #define SMBIOS_TYPE_MEMDEVICE 17 #define SMBIOS_TYPE_MEMARRAYMAP 19 #define SMBIOS_TYPE_BOOT 32 #define SMBIOS_TYPE_EOT 127 struct smbios_structure { uint8_t type; uint8_t length; uint16_t handle; } __packed; typedef int (*initializer_func_t)(struct smbios_structure *template_entry, const char **template_strings, char *curaddr, char **endaddr, uint16_t *n, uint16_t *size); struct smbios_template_entry { struct smbios_structure *entry; const char **strings; initializer_func_t initializer; }; /* * SMBIOS Structure Table Entry Point */ #define SMBIOS_ENTRY_EANCHOR "_SM_" #define SMBIOS_ENTRY_EANCHORLEN 4 #define SMBIOS_ENTRY_IANCHOR "_DMI_" #define SMBIOS_ENTRY_IANCHORLEN 5 struct smbios_entry_point { char eanchor[4]; /* anchor tag */ uint8_t echecksum; /* checksum of entry point structure */ uint8_t eplen; /* length in bytes of entry point */ uint8_t major; /* major version of the SMBIOS spec */ uint8_t minor; /* minor version of the SMBIOS spec */ uint16_t maxssize; /* maximum size in bytes of a struct */ uint8_t revision; /* entry point structure revision */ uint8_t format[5]; /* entry point rev-specific data */ char ianchor[5]; /* intermediate anchor tag */ uint8_t ichecksum; /* intermediate checksum */ uint16_t stlen; /* len in bytes of structure table */ uint32_t staddr; /* physical addr of structure table */ uint16_t stnum; /* number of structure table entries */ uint8_t bcdrev; /* BCD value representing DMI ver */ } __packed; /* * BIOS Information */ #define SMBIOS_FL_ISA 0x00000010 /* ISA is supported */ #define SMBIOS_FL_PCI 0x00000080 /* PCI is supported */ #define SMBIOS_FL_SHADOW 0x00001000 /* BIOS shadowing is allowed */ #define SMBIOS_FL_CDBOOT 0x00008000 /* Boot from CD is supported */ #define SMBIOS_FL_SELBOOT 0x00010000 /* Selectable Boot supported */ #define SMBIOS_FL_EDD 0x00080000 /* EDD Spec is supported */ #define SMBIOS_XB1_FL_ACPI 0x00000001 /* ACPI is supported */ #define SMBIOS_XB2_FL_BBS 0x00000001 /* BIOS Boot Specification */ #define SMBIOS_XB2_FL_VM 0x00000010 /* Virtual Machine */ struct smbios_table_type0 { struct smbios_structure header; uint8_t vendor; /* vendor string */ uint8_t version; /* version string */ uint16_t segment; /* address segment location */ uint8_t rel_date; /* release date */ uint8_t size; /* rom size */ uint64_t cflags; /* characteristics */ uint8_t xc_bytes[2]; /* characteristics ext bytes */ uint8_t sb_major_rel; /* system bios version */ uint8_t sb_minor_rele; uint8_t ecfw_major_rel; /* embedded ctrl fw version */ uint8_t ecfw_minor_rel; } __packed; /* * System Information */ #define SMBIOS_WAKEUP_SWITCH 0x06 /* power switch */ struct smbios_table_type1 { struct smbios_structure header; uint8_t manufacturer; /* manufacturer string */ uint8_t product; /* product name string */ uint8_t version; /* version string */ uint8_t serial; /* serial number string */ uint8_t uuid[16]; /* uuid byte array */ uint8_t wakeup; /* wake-up event */ uint8_t sku; /* sku number string */ uint8_t family; /* family name string */ } __packed; /* * System Enclosure or Chassis */ #define SMBIOS_CHT_UNKNOWN 0x02 /* unknown */ #define SMBIOS_CHST_SAFE 0x03 /* safe */ #define SMBIOS_CHSC_NONE 0x03 /* none */ struct smbios_table_type3 { struct smbios_structure header; uint8_t manufacturer; /* manufacturer string */ uint8_t type; /* type */ uint8_t version; /* version string */ uint8_t serial; /* serial number string */ uint8_t asset; /* asset tag string */ uint8_t bustate; /* boot-up state */ uint8_t psstate; /* power supply state */ uint8_t tstate; /* thermal state */ uint8_t security; /* security status */ uint8_t uheight; /* height in 'u's */ uint8_t cords; /* number of power cords */ uint8_t elems; /* number of element records */ uint8_t elemlen; /* length of records */ uint8_t sku; /* sku number string */ } __packed; /* * Processor Information */ #define SMBIOS_PRT_CENTRAL 0x03 /* central processor */ #define SMBIOS_PRF_OTHER 0x01 /* other */ #define SMBIOS_PRS_PRESENT 0x40 /* socket is populated */ #define SMBIOS_PRS_ENABLED 0x1 /* enabled */ #define SMBIOS_PRU_NONE 0x06 /* none */ #define SMBIOS_PFL_64B 0x04 /* 64-bit capable */ struct smbios_table_type4 { struct smbios_structure header; uint8_t socket; /* socket designation string */ uint8_t type; /* processor type */ uint8_t family; /* processor family */ uint8_t manufacturer; /* manufacturer string */ uint64_t cpuid; /* processor cpuid */ uint8_t version; /* version string */ uint8_t voltage; /* voltage */ uint16_t clkspeed; /* ext clock speed in mhz */ uint16_t maxspeed; /* maximum speed in mhz */ uint16_t curspeed; /* current speed in mhz */ uint8_t status; /* status */ uint8_t upgrade; /* upgrade */ uint16_t l1handle; /* l1 cache handle */ uint16_t l2handle; /* l2 cache handle */ uint16_t l3handle; /* l3 cache handle */ uint8_t serial; /* serial number string */ uint8_t asset; /* asset tag string */ uint8_t part; /* part number string */ uint8_t cores; /* cores per socket */ uint8_t ecores; /* enabled cores */ uint8_t threads; /* threads per socket */ uint16_t cflags; /* processor characteristics */ uint16_t family2; /* processor family 2 */ } __packed; /* * Physical Memory Array */ #define SMBIOS_MAL_SYSMB 0x03 /* system board or motherboard */ #define SMBIOS_MAU_SYSTEM 0x03 /* system memory */ #define SMBIOS_MAE_NONE 0x03 /* none */ struct smbios_table_type16 { struct smbios_structure header; uint8_t location; /* physical device location */ uint8_t use; /* device functional purpose */ uint8_t ecc; /* err detect/correct method */ uint32_t size; /* max mem capacity in kb */ uint16_t errhand; /* handle of error (if any) */ uint16_t ndevs; /* num of slots or sockets */ uint64_t xsize; /* max mem capacity in bytes */ } __packed; /* * Memory Device */ #define SMBIOS_MDFF_UNKNOWN 0x02 /* unknown */ #define SMBIOS_MDT_UNKNOWN 0x02 /* unknown */ #define SMBIOS_MDF_UNKNOWN 0x0004 /* unknown */ struct smbios_table_type17 { struct smbios_structure header; uint16_t arrayhand; /* handle of physl mem array */ uint16_t errhand; /* handle of mem error data */ uint16_t twidth; /* total width in bits */ uint16_t dwidth; /* data width in bits */ uint16_t size; /* size in bytes */ uint8_t form; /* form factor */ uint8_t set; /* set */ uint8_t dloc; /* device locator string */ uint8_t bloc; /* phys bank locator string */ uint8_t type; /* memory type */ uint16_t flags; /* memory characteristics */ uint16_t maxspeed; /* maximum speed in mhz */ uint8_t manufacturer; /* manufacturer string */ uint8_t serial; /* serial number string */ uint8_t asset; /* asset tag string */ uint8_t part; /* part number string */ uint8_t attributes; /* attributes */ uint32_t xsize; /* extended size in mbs */ uint16_t curspeed; /* current speed in mhz */ uint16_t minvoltage; /* minimum voltage */ uint16_t maxvoltage; /* maximum voltage */ uint16_t curvoltage; /* configured voltage */ } __packed; /* * Memory Array Mapped Address */ struct smbios_table_type19 { struct smbios_structure header; uint32_t saddr; /* start phys addr in kb */ uint32_t eaddr; /* end phys addr in kb */ uint16_t arrayhand; /* physical mem array handle */ uint8_t width; /* num of dev in row */ uint64_t xsaddr; /* start phys addr in bytes */ uint64_t xeaddr; /* end phys addr in bytes */ } __packed; /* * System Boot Information */ #define SMBIOS_BOOT_NORMAL 0 /* no errors detected */ struct smbios_table_type32 { struct smbios_structure header; uint8_t reserved[6]; uint8_t status; /* boot status */ } __packed; /* * End-of-Table */ struct smbios_table_type127 { struct smbios_structure header; } __packed; struct smbios_table_type0 smbios_type0_template = { { SMBIOS_TYPE_BIOS, sizeof (struct smbios_table_type0), 0 }, 1, /* bios vendor string */ 2, /* bios version string */ 0xF000, /* bios address segment location */ 3, /* bios release date */ 0x0, /* bios size (64k * (n + 1) is the size in bytes) */ SMBIOS_FL_ISA | SMBIOS_FL_PCI | SMBIOS_FL_SHADOW | SMBIOS_FL_CDBOOT | SMBIOS_FL_EDD, { SMBIOS_XB1_FL_ACPI, SMBIOS_XB2_FL_BBS | SMBIOS_XB2_FL_VM }, 0x0, /* bios major release */ 0x0, /* bios minor release */ 0xff, /* embedded controller firmware major release */ 0xff /* embedded controller firmware minor release */ }; const char *smbios_type0_strings[] = { "BHYVE", /* vendor string */ "1.00", /* bios version string */ "03/14/2014", /* bios release date string */ NULL }; struct smbios_table_type1 smbios_type1_template = { { SMBIOS_TYPE_SYSTEM, sizeof (struct smbios_table_type1), 0 }, 1, /* manufacturer string */ 2, /* product string */ 3, /* version string */ 4, /* serial number string */ { 0 }, SMBIOS_WAKEUP_SWITCH, 5, /* sku string */ 6 /* family string */ }; static int smbios_type1_initializer(struct smbios_structure *template_entry, const char **template_strings, char *curaddr, char **endaddr, uint16_t *n, uint16_t *size); const char *smbios_type1_strings[] = { " ", /* manufacturer string */ "BHYVE", /* product name string */ "1.0", /* version string */ "None", /* serial number string */ "None", /* sku string */ " ", /* family name string */ NULL }; struct smbios_table_type3 smbios_type3_template = { { SMBIOS_TYPE_CHASSIS, sizeof (struct smbios_table_type3), 0 }, 1, /* manufacturer string */ SMBIOS_CHT_UNKNOWN, 2, /* version string */ 3, /* serial number string */ 4, /* asset tag string */ SMBIOS_CHST_SAFE, SMBIOS_CHST_SAFE, SMBIOS_CHST_SAFE, SMBIOS_CHSC_NONE, 0, /* height in 'u's (0=enclosure height unspecified) */ 0, /* number of power cords (0=number unspecified) */ 0, /* number of contained element records */ 0, /* length of records */ 5 /* sku number string */ }; const char *smbios_type3_strings[] = { " ", /* manufacturer string */ "1.0", /* version string */ "None", /* serial number string */ "None", /* asset tag string */ "None", /* sku number string */ NULL }; struct smbios_table_type4 smbios_type4_template = { { SMBIOS_TYPE_PROCESSOR, sizeof (struct smbios_table_type4), 0 }, 1, /* socket designation string */ SMBIOS_PRT_CENTRAL, SMBIOS_PRF_OTHER, 2, /* manufacturer string */ 0, /* cpuid */ 3, /* version string */ 0, /* voltage */ 0, /* external clock frequency in mhz (0=unknown) */ 0, /* maximum frequency in mhz (0=unknown) */ 0, /* current frequency in mhz (0=unknown) */ SMBIOS_PRS_PRESENT | SMBIOS_PRS_ENABLED, SMBIOS_PRU_NONE, -1, /* l1 cache handle */ -1, /* l2 cache handle */ -1, /* l3 cache handle */ 4, /* serial number string */ 5, /* asset tag string */ 6, /* part number string */ 0, /* cores per socket (0=unknown) */ 0, /* enabled cores per socket (0=unknown) */ 0, /* threads per socket (0=unknown) */ SMBIOS_PFL_64B, SMBIOS_PRF_OTHER }; const char *smbios_type4_strings[] = { " ", /* socket designation string */ " ", /* manufacturer string */ " ", /* version string */ "None", /* serial number string */ "None", /* asset tag string */ "None", /* part number string */ NULL }; static int smbios_type4_initializer(struct smbios_structure *template_entry, const char **template_strings, char *curaddr, char **endaddr, uint16_t *n, uint16_t *size); struct smbios_table_type16 smbios_type16_template = { { SMBIOS_TYPE_MEMARRAY, sizeof (struct smbios_table_type16), 0 }, SMBIOS_MAL_SYSMB, SMBIOS_MAU_SYSTEM, SMBIOS_MAE_NONE, 0x80000000, /* max mem capacity in kb (0x80000000=use extended) */ -1, /* handle of error (if any) */ 0, /* number of slots or sockets (TBD) */ 0 /* extended maximum memory capacity in bytes (TBD) */ }; static int smbios_type16_initializer(struct smbios_structure *template_entry, const char **template_strings, char *curaddr, char **endaddr, uint16_t *n, uint16_t *size); struct smbios_table_type17 smbios_type17_template = { { SMBIOS_TYPE_MEMDEVICE, sizeof (struct smbios_table_type17), 0 }, -1, /* handle of physical memory array */ -1, /* handle of memory error data */ 64, /* total width in bits including ecc */ 64, /* data width in bits */ 0x7fff, /* size in bytes (0x7fff=use extended)*/ SMBIOS_MDFF_UNKNOWN, 0, /* set (0x00=none, 0xff=unknown) */ 1, /* device locator string */ 2, /* physical bank locator string */ SMBIOS_MDT_UNKNOWN, SMBIOS_MDF_UNKNOWN, 0, /* maximum memory speed in mhz (0=unknown) */ 3, /* manufacturer string */ 4, /* serial number string */ 5, /* asset tag string */ 6, /* part number string */ 0, /* attributes (0=unknown rank information) */ 0, /* extended size in mb (TBD) */ 0, /* current speed in mhz (0=unknown) */ 0, /* minimum voltage in mv (0=unknown) */ 0, /* maximum voltage in mv (0=unknown) */ 0 /* configured voltage in mv (0=unknown) */ }; const char *smbios_type17_strings[] = { " ", /* device locator string */ " ", /* physical bank locator string */ " ", /* manufacturer string */ "None", /* serial number string */ "None", /* asset tag string */ "None", /* part number string */ NULL }; static int smbios_type17_initializer(struct smbios_structure *template_entry, const char **template_strings, char *curaddr, char **endaddr, uint16_t *n, uint16_t *size); struct smbios_table_type19 smbios_type19_template = { { SMBIOS_TYPE_MEMARRAYMAP, sizeof (struct smbios_table_type19), 0 }, 0xffffffff, /* starting phys addr in kb (0xffffffff=use ext) */ 0xffffffff, /* ending phys addr in kb (0xffffffff=use ext) */ -1, /* physical memory array handle */ 1, /* number of devices that form a row */ 0, /* extended starting phys addr in bytes (TDB) */ 0 /* extended ending phys addr in bytes (TDB) */ }; static int smbios_type19_initializer(struct smbios_structure *template_entry, const char **template_strings, char *curaddr, char **endaddr, uint16_t *n, uint16_t *size); struct smbios_table_type32 smbios_type32_template = { { SMBIOS_TYPE_BOOT, sizeof (struct smbios_table_type32), 0 }, { 0, 0, 0, 0, 0, 0 }, SMBIOS_BOOT_NORMAL }; struct smbios_table_type127 smbios_type127_template = { { SMBIOS_TYPE_EOT, sizeof (struct smbios_table_type127), 0 } }; static int smbios_generic_initializer(struct smbios_structure *template_entry, const char **template_strings, char *curaddr, char **endaddr, uint16_t *n, uint16_t *size); static struct smbios_template_entry smbios_template[] = { { (struct smbios_structure *)&smbios_type0_template, smbios_type0_strings, smbios_generic_initializer }, { (struct smbios_structure *)&smbios_type1_template, smbios_type1_strings, smbios_type1_initializer }, { (struct smbios_structure *)&smbios_type3_template, smbios_type3_strings, smbios_generic_initializer }, { (struct smbios_structure *)&smbios_type4_template, smbios_type4_strings, smbios_type4_initializer }, { (struct smbios_structure *)&smbios_type16_template, NULL, smbios_type16_initializer }, { (struct smbios_structure *)&smbios_type17_template, smbios_type17_strings, smbios_type17_initializer }, { (struct smbios_structure *)&smbios_type19_template, NULL, smbios_type19_initializer }, { (struct smbios_structure *)&smbios_type32_template, NULL, smbios_generic_initializer }, { (struct smbios_structure *)&smbios_type127_template, NULL, smbios_generic_initializer }, { NULL,NULL, NULL } }; static uint64_t guest_lomem, guest_himem; static uint16_t type16_handle; static int smbios_generic_initializer(struct smbios_structure *template_entry, const char **template_strings, char *curaddr, char **endaddr, uint16_t *n, uint16_t *size) { struct smbios_structure *entry; memcpy(curaddr, template_entry, template_entry->length); entry = (struct smbios_structure *)curaddr; entry->handle = *n + 1; curaddr += entry->length; if (template_strings != NULL) { int i; for (i = 0; template_strings[i] != NULL; i++) { const char *string; int len; string = template_strings[i]; len = strlen(string) + 1; memcpy(curaddr, string, len); curaddr += len; } *curaddr = '\0'; curaddr++; } else { /* Minimum string section is double nul */ *curaddr = '\0'; curaddr++; *curaddr = '\0'; curaddr++; } (*n)++; *endaddr = curaddr; return (0); } static int smbios_type1_initializer(struct smbios_structure *template_entry, const char **template_strings, char *curaddr, char **endaddr, uint16_t *n, uint16_t *size) { struct smbios_table_type1 *type1; smbios_generic_initializer(template_entry, template_strings, curaddr, endaddr, n, size); type1 = (struct smbios_table_type1 *)curaddr; if (guest_uuid_str != NULL) { uuid_t uuid; uint32_t status; uuid_from_string(guest_uuid_str, &uuid, &status); if (status != uuid_s_ok) return (-1); uuid_enc_le(&type1->uuid, &uuid); } else { MD5_CTX mdctx; u_char digest[16]; char hostname[MAXHOSTNAMELEN]; /* * Universally unique and yet reproducible are an * oxymoron, however reproducible is desirable in * this case. */ if (gethostname(hostname, sizeof(hostname))) return (-1); MD5Init(&mdctx); MD5Update(&mdctx, vmname, strlen(vmname)); MD5Update(&mdctx, hostname, sizeof(hostname)); MD5Final(digest, &mdctx); /* * Set the variant and version number. */ digest[6] &= 0x0F; digest[6] |= 0x30; /* version 3 */ digest[8] &= 0x3F; digest[8] |= 0x80; memcpy(&type1->uuid, digest, sizeof (digest)); } return (0); } static int smbios_type4_initializer(struct smbios_structure *template_entry, const char **template_strings, char *curaddr, char **endaddr, uint16_t *n, uint16_t *size) { int i; - for (i = 0; i < guest_ncpus; i++) { + for (i = 0; i < sockets; i++) { struct smbios_table_type4 *type4; char *p; int nstrings, len; smbios_generic_initializer(template_entry, template_strings, curaddr, endaddr, n, size); type4 = (struct smbios_table_type4 *)curaddr; p = curaddr + sizeof (struct smbios_table_type4); nstrings = 0; while (p < *endaddr - 1) { if (*p++ == '\0') nstrings++; } len = sprintf(*endaddr - 1, "CPU #%d", i) + 1; *endaddr += len - 1; *(*endaddr) = '\0'; (*endaddr)++; type4->socket = nstrings + 1; + /* Revise cores and threads after update to smbios 3.0 */ + if (cores > 254) + type4->cores = 0; + else + type4->cores = cores; + /* This threads is total threads in a socket */ + if ((cores * threads) > 254) + type4->threads = 0; + else + type4->threads = (cores * threads); curaddr = *endaddr; } return (0); } static int smbios_type16_initializer(struct smbios_structure *template_entry, const char **template_strings, char *curaddr, char **endaddr, uint16_t *n, uint16_t *size) { struct smbios_table_type16 *type16; type16_handle = *n; smbios_generic_initializer(template_entry, template_strings, curaddr, endaddr, n, size); type16 = (struct smbios_table_type16 *)curaddr; type16->xsize = guest_lomem + guest_himem; type16->ndevs = guest_himem > 0 ? 2 : 1; return (0); } static int smbios_type17_initializer(struct smbios_structure *template_entry, const char **template_strings, char *curaddr, char **endaddr, uint16_t *n, uint16_t *size) { struct smbios_table_type17 *type17; smbios_generic_initializer(template_entry, template_strings, curaddr, endaddr, n, size); type17 = (struct smbios_table_type17 *)curaddr; type17->arrayhand = type16_handle; type17->xsize = guest_lomem; if (guest_himem > 0) { curaddr = *endaddr; smbios_generic_initializer(template_entry, template_strings, curaddr, endaddr, n, size); type17 = (struct smbios_table_type17 *)curaddr; type17->arrayhand = type16_handle; type17->xsize = guest_himem; } return (0); } static int smbios_type19_initializer(struct smbios_structure *template_entry, const char **template_strings, char *curaddr, char **endaddr, uint16_t *n, uint16_t *size) { struct smbios_table_type19 *type19; smbios_generic_initializer(template_entry, template_strings, curaddr, endaddr, n, size); type19 = (struct smbios_table_type19 *)curaddr; type19->arrayhand = type16_handle; type19->xsaddr = 0; type19->xeaddr = guest_lomem; if (guest_himem > 0) { curaddr = *endaddr; smbios_generic_initializer(template_entry, template_strings, curaddr, endaddr, n, size); type19 = (struct smbios_table_type19 *)curaddr; type19->arrayhand = type16_handle; type19->xsaddr = 4*GB; type19->xeaddr = guest_himem; } return (0); } static void smbios_ep_initializer(struct smbios_entry_point *smbios_ep, uint32_t staddr) { memset(smbios_ep, 0, sizeof(*smbios_ep)); memcpy(smbios_ep->eanchor, SMBIOS_ENTRY_EANCHOR, SMBIOS_ENTRY_EANCHORLEN); smbios_ep->eplen = 0x1F; assert(sizeof (struct smbios_entry_point) == smbios_ep->eplen); smbios_ep->major = 2; smbios_ep->minor = 6; smbios_ep->revision = 0; memcpy(smbios_ep->ianchor, SMBIOS_ENTRY_IANCHOR, SMBIOS_ENTRY_IANCHORLEN); smbios_ep->staddr = staddr; smbios_ep->bcdrev = 0x24; } static void smbios_ep_finalizer(struct smbios_entry_point *smbios_ep, uint16_t len, uint16_t num, uint16_t maxssize) { uint8_t checksum; int i; smbios_ep->maxssize = maxssize; smbios_ep->stlen = len; smbios_ep->stnum = num; checksum = 0; for (i = 0x10; i < 0x1f; i++) { checksum -= ((uint8_t *)smbios_ep)[i]; } smbios_ep->ichecksum = checksum; checksum = 0; for (i = 0; i < 0x1f; i++) { checksum -= ((uint8_t *)smbios_ep)[i]; } smbios_ep->echecksum = checksum; } int smbios_build(struct vmctx *ctx) { struct smbios_entry_point *smbios_ep; uint16_t n; uint16_t maxssize; char *curaddr, *startaddr, *ststartaddr; int i; int err; guest_lomem = vm_get_lowmem_size(ctx); guest_himem = vm_get_highmem_size(ctx); startaddr = paddr_guest2host(ctx, SMBIOS_BASE, SMBIOS_MAX_LENGTH); if (startaddr == NULL) { fprintf(stderr, "smbios table requires mapped mem\n"); return (ENOMEM); } curaddr = startaddr; smbios_ep = (struct smbios_entry_point *)curaddr; smbios_ep_initializer(smbios_ep, SMBIOS_BASE + sizeof(struct smbios_entry_point)); curaddr += sizeof(struct smbios_entry_point); ststartaddr = curaddr; n = 0; maxssize = 0; for (i = 0; smbios_template[i].entry != NULL; i++) { struct smbios_structure *entry; const char **strings; initializer_func_t initializer; char *endaddr; uint16_t size; entry = smbios_template[i].entry; strings = smbios_template[i].strings; initializer = smbios_template[i].initializer; err = (*initializer)(entry, strings, curaddr, &endaddr, &n, &size); if (err != 0) return (err); if (size > maxssize) maxssize = size; curaddr = endaddr; } assert(curaddr - startaddr < SMBIOS_MAX_LENGTH); smbios_ep_finalizer(smbios_ep, curaddr - ststartaddr, n, maxssize); return (0); } Index: user/ngie/bug-237403/usr.sbin/bsdinstall/scripts/netconfig_ipv4 =================================================================== --- user/ngie/bug-237403/usr.sbin/bsdinstall/scripts/netconfig_ipv4 (revision 346925) +++ user/ngie/bug-237403/usr.sbin/bsdinstall/scripts/netconfig_ipv4 (revision 346926) @@ -1,102 +1,103 @@ #!/bin/sh #- # Copyright (c) 2011 Nathan Whitehorn # Copyright (c) 2013-2015 Devin Teske # All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions # are met: # 1. Redistributions of source code must retain the above copyright # notice, this list of conditions and the following disclaimer. # 2. Redistributions in binary form must reproduce the above copyright # notice, this list of conditions and the following disclaimer in the # documentation and/or other materials provided with the distribution. # # THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND # ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE # IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE # ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE # FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL # DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS # OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) # HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT # LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY # OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF # SUCH DAMAGE. # # $FreeBSD$ # ############################################################ INCLUDES BSDCFG_SHARE="/usr/share/bsdconfig" . $BSDCFG_SHARE/common.subr || exit 1 f_dprintf "%s: loading includes..." "$0" f_include $BSDCFG_SHARE/dialog.subr ############################################################ MAIN INTERFACE=$1 IFCONFIG_PREFIX="$2" test -z "$IFCONFIG_PREFIX" || IFCONFIG_PREFIX="$2 " case "${INTERFACE}" in "") dialog --backtitle 'FreeBSD Installer' --title 'Network Configuration' \ --msgbox 'No interface specified for IPv4 configuration.' 0 0 exit 1 ;; esac dialog --backtitle 'FreeBSD Installer' --title 'Network Configuration' --yesno 'Would you like to use DHCP to configure this interface?' 0 0 if [ $? -eq $DIALOG_OK ]; then if [ ! -z $BSDINSTALL_CONFIGCURRENT ]; then + ifconfig $INTERFACE up dialog --backtitle 'FreeBSD Installer' --infobox "Acquiring DHCP lease..." 0 0 err=$( dhclient $INTERFACE 2>&1 ) if [ $? -ne 0 ]; then f_dprintf "%s" "$err" dialog --backtitle 'FreeBSD Installer' --msgbox "DHCP lease acquisition failed." 0 0 exec $0 ${INTERFACE} "${IFCONFIG_PREFIX}" fi fi echo ifconfig_$INTERFACE=\"${IFCONFIG_PREFIX}DHCP\" >> $BSDINSTALL_TMPETC/._rc.conf.net exit 0 fi IP_ADDRESS=`ifconfig $INTERFACE inet | awk '/inet/ {printf("%s\n", $2); }'` NETMASK=`ifconfig $INTERFACE inet | awk '/inet/ {printf("%s\n", $4); }'` ROUTER=`netstat -rn -f inet | awk '/default/ {printf("%s\n", $2);}'` exec 3>&1 IF_CONFIG=$(dialog --backtitle 'FreeBSD Installer' --title 'Network Configuration' --form 'Static Network Interface Configuration' 0 0 0 \ 'IP Address' 1 0 "$IP_ADDRESS" 1 20 16 0 \ 'Subnet Mask' 2 0 "$NETMASK" 2 20 16 0 \ 'Default Router' 3 0 "$ROUTER" 3 20 16 0 \ 2>&1 1>&3) if [ $? -eq $DIALOG_CANCEL ]; then exit 1; fi exec 3>&- echo $INTERFACE $IF_CONFIG | awk -v prefix="$IFCONFIG_PREFIX" '{ printf("ifconfig_%s=\"%s\inet %s netmask %s\"\n", $1, prefix, $2, $3); printf("defaultrouter=\"%s\"\n", $4); }' >> $BSDINSTALL_TMPETC/._rc.conf.net retval=$? if [ "$BSDINSTALL_CONFIGCURRENT" ]; then . $BSDINSTALL_TMPETC/._rc.conf.net if [ -n "$2" ]; then ifconfig $INTERFACE `eval echo \\\$ifconfig_$INTERFACE | sed "s|$2||"` else ifconfig $INTERFACE `eval echo \\\$ifconfig_$INTERFACE` fi if [ "$defaultrouter" ]; then route delete -inet default route add -inet default $defaultrouter retval=$? fi fi exit $retval ################################################################################ # END ################################################################################ Index: user/ngie/bug-237403/usr.sbin/kldxref/ef_mips.c =================================================================== --- user/ngie/bug-237403/usr.sbin/kldxref/ef_mips.c (nonexistent) +++ user/ngie/bug-237403/usr.sbin/kldxref/ef_mips.c (revision 346926) @@ -0,0 +1,99 @@ +/*- + * SPDX-License-Identifier: BSD-2-Clause-FreeBSD + * + * Copyright (c) 2019 John Baldwin + * + * This software was developed by SRI International and the University of + * Cambridge Computer Laboratory (Department of Computer Science and + * Technology) under DARPA contract HR0011-18-C-0016 ("ECATS"), as part of the + * DARPA SSITH research programme. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * 1. Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF + * SUCH DAMAGE. + * + * $FreeBSD$ + */ + +#include +#include + +#include +#include + +#include "ef.h" + +/* + * Apply relocations to the values we got from the file. `relbase' is the + * target relocation address of the section, and `dataoff' is the target + * relocation address of the data in `dest'. + */ +int +ef_reloc(struct elf_file *ef, const void *reldata, int reltype, Elf_Off relbase, + Elf_Off dataoff, size_t len, void *dest) +{ + Elf_Addr *where, val; + const Elf_Rel *rel; + const Elf_Rela *rela; + Elf_Addr addend, addr; + Elf_Size rtype, symidx; + + switch (reltype) { + case EF_RELOC_REL: + rel = (const Elf_Rel *)reldata; + where = (Elf_Addr *)((char *)dest + relbase + rel->r_offset - + dataoff); + addend = 0; + rtype = ELF_R_TYPE(rel->r_info); + symidx = ELF_R_SYM(rel->r_info); + break; + case EF_RELOC_RELA: + rela = (const Elf_Rela *)reldata; + where = (Elf_Addr *)((char *)dest + relbase + rela->r_offset - + dataoff); + addend = rela->r_addend; + rtype = ELF_R_TYPE(rela->r_info); + symidx = ELF_R_SYM(rela->r_info); + break; + default: + return (EINVAL); + } + + if ((char *)where < (char *)dest || (char *)where >= (char *)dest + len) + return (0); + + if (reltype == EF_RELOC_REL) + addend = *where; + + switch (rtype) { +#ifdef __LP64__ + case R_MIPS_64: /* S + A */ +#else + case R_MIPS_32: /* S + A */ +#endif + addr = EF_SYMADDR(ef, symidx); + val = addr + addend; + *where = val; + break; + default: + warnx("unhandled relocation type %d", (int)rtype); + } + return (0); +} Property changes on: user/ngie/bug-237403/usr.sbin/kldxref/ef_mips.c ___________________________________________________________________ Added: svn:eol-style ## -0,0 +1 ## +native \ No newline at end of property Added: svn:keywords ## -0,0 +1 ## +FreeBSD=%H \ No newline at end of property Added: svn:mime-type ## -0,0 +1 ## +text/plain \ No newline at end of property Index: user/ngie/bug-237403/usr.sbin/nfsdumpstate/nfsdumpstate.c =================================================================== --- user/ngie/bug-237403/usr.sbin/nfsdumpstate/nfsdumpstate.c (revision 346925) +++ user/ngie/bug-237403/usr.sbin/nfsdumpstate/nfsdumpstate.c (revision 346926) @@ -1,295 +1,315 @@ /*- * Copyright (c) 2009 Rick Macklem, University of Guelph * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * */ #include __FBSDID("$FreeBSD$"); #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #define DUMPSIZE 10000 static void dump_lockstate(char *); static void dump_openstate(void); static void usage(void); static char *open_flags(uint32_t); static char *deleg_flags(uint32_t); static char *lock_flags(uint32_t); static char *client_flags(uint32_t); static struct nfsd_dumpclients dp[DUMPSIZE]; static struct nfsd_dumplocks lp[DUMPSIZE]; static char flag_string[20]; int main(int argc, char **argv) { int ch, openstate; char *lockfile; if (modfind("nfsd") < 0) errx(1, "nfsd not loaded - self terminating"); openstate = 0; lockfile = NULL; while ((ch = getopt(argc, argv, "ol:")) != -1) switch (ch) { case 'o': openstate = 1; break; case 'l': lockfile = optarg; break; default: usage(); } argc -= optind; argv += optind; if (openstate == 0 && lockfile == NULL) openstate = 1; else if (openstate != 0 && lockfile != NULL) errx(1, "-o and -l cannot both be specified"); /* * For -o, dump all open/lock state. * For -l, dump lock state for that file. */ if (openstate != 0) dump_openstate(); else dump_lockstate(lockfile); exit(0); } static void usage(void) { errx(1, "usage: nfsdumpstate [-o] [-l]"); } /* * Dump all open/lock state. */ static void dump_openstate(void) { struct nfsd_dumplist dumplist; int cnt, i; +#ifdef INET6 char nbuf[INET6_ADDRSTRLEN]; +#endif dumplist.ndl_size = DUMPSIZE; dumplist.ndl_list = (void *)dp; if (nfssvc(NFSSVC_DUMPCLIENTS, &dumplist) < 0) errx(1, "Can't perform dump clients syscall"); printf("%-13s %9.9s %9.9s %9.9s %9.9s %9.9s %9.9s %-45s %s\n", "Flags", "OpenOwner", "Open", "LockOwner", "Lock", "Deleg", "OldDeleg", "Clientaddr", "ClientID"); /* * Loop through results, printing them out. */ cnt = 0; while (dp[cnt].ndcl_clid.nclid_idlen > 0 && cnt < DUMPSIZE) { printf("%-13s ", client_flags(dp[cnt].ndcl_flags)); printf("%9d %9d %9d %9d %9d %9d ", dp[cnt].ndcl_nopenowners, dp[cnt].ndcl_nopens, dp[cnt].ndcl_nlockowners, dp[cnt].ndcl_nlocks, dp[cnt].ndcl_ndelegs, dp[cnt].ndcl_nolddelegs); switch (dp[cnt].ndcl_addrfam) { #ifdef INET case AF_INET: printf("%-45s ", inet_ntoa(dp[cnt].ndcl_cbaddr.sin_addr)); break; #endif #ifdef INET6 case AF_INET6: if (inet_ntop(AF_INET6, &dp[cnt].ndcl_cbaddr.sin6_addr, nbuf, sizeof(nbuf)) != NULL) printf("%-45s ", nbuf); else printf("%-45s ", " "); break; #endif } for (i = 0; i < dp[cnt].ndcl_clid.nclid_idlen; i++) printf("%02x", dp[cnt].ndcl_clid.nclid_id[i]); printf("\n"); cnt++; } } /* * Dump the lock state for a file. */ static void dump_lockstate(char *fname) { struct nfsd_dumplocklist dumplocklist; int cnt, i; +#ifdef INET6 + char nbuf[INET6_ADDRSTRLEN]; +#endif dumplocklist.ndllck_size = DUMPSIZE; dumplocklist.ndllck_list = (void *)lp; dumplocklist.ndllck_fname = fname; if (nfssvc(NFSSVC_DUMPLOCKS, &dumplocklist) < 0) errx(1, "Can't dump locks for %s\n", fname); - printf("%-11s %-36s %-15s %s\n", + printf("%-11s %-36s %-45s %s\n", "Open/Lock", " Stateid or Lock Range", "Clientaddr", "Owner and ClientID"); /* * Loop through results, printing them out. */ cnt = 0; while (lp[cnt].ndlck_clid.nclid_idlen > 0 && cnt < DUMPSIZE) { if (lp[cnt].ndlck_flags & NFSLCK_OPEN) printf("%-11s %9d %08x %08x %08x ", open_flags(lp[cnt].ndlck_flags), lp[cnt].ndlck_stateid.seqid, lp[cnt].ndlck_stateid.other[0], lp[cnt].ndlck_stateid.other[1], lp[cnt].ndlck_stateid.other[2]); else if (lp[cnt].ndlck_flags & (NFSLCK_DELEGREAD | NFSLCK_DELEGWRITE)) printf("%-11s %9d %08x %08x %08x ", deleg_flags(lp[cnt].ndlck_flags), lp[cnt].ndlck_stateid.seqid, lp[cnt].ndlck_stateid.other[0], lp[cnt].ndlck_stateid.other[1], lp[cnt].ndlck_stateid.other[2]); else printf("%-11s %17jd %17jd ", lock_flags(lp[cnt].ndlck_flags), lp[cnt].ndlck_first, lp[cnt].ndlck_end); - if (lp[cnt].ndlck_addrfam == AF_INET) - printf("%-15s ", + switch (lp[cnt].ndlck_addrfam) { +#ifdef INET + case AF_INET: + printf("%-45s ", inet_ntoa(lp[cnt].ndlck_cbaddr.sin_addr)); - else - printf("%-15s ", " "); + break; +#endif +#ifdef INET6 + case AF_INET6: + if (inet_ntop(AF_INET6, &lp[cnt].ndlck_cbaddr.sin6_addr, + nbuf, sizeof(nbuf)) != NULL) + printf("%-45s ", nbuf); + else + printf("%-45s ", " "); + break; +#endif + default: + printf("%-45s ", " "); + break; + } for (i = 0; i < lp[cnt].ndlck_owner.nclid_idlen; i++) printf("%02x", lp[cnt].ndlck_owner.nclid_id[i]); printf(" "); for (i = 0; i < lp[cnt].ndlck_clid.nclid_idlen; i++) printf("%02x", lp[cnt].ndlck_clid.nclid_id[i]); printf("\n"); cnt++; } } /* * Parse the Open/Lock flag bits and create a string to be printed. */ static char * open_flags(uint32_t flags) { int i, j; strlcpy(flag_string, "Open ", sizeof (flag_string)); i = 5; if (flags & NFSLCK_READACCESS) flag_string[i++] = 'R'; if (flags & NFSLCK_WRITEACCESS) flag_string[i++] = 'W'; flag_string[i++] = ' '; flag_string[i++] = 'D'; flag_string[i] = 'N'; j = i; if (flags & NFSLCK_READDENY) flag_string[i++] = 'R'; if (flags & NFSLCK_WRITEDENY) flag_string[i++] = 'W'; if (i == j) i++; flag_string[i] = '\0'; return (flag_string); } static char * deleg_flags(uint32_t flags) { if (flags & NFSLCK_DELEGREAD) strlcpy(flag_string, "Deleg R", sizeof (flag_string)); else strlcpy(flag_string, "Deleg W", sizeof (flag_string)); return (flag_string); } static char * lock_flags(uint32_t flags) { if (flags & NFSLCK_READ) strlcpy(flag_string, "Lock R", sizeof (flag_string)); else strlcpy(flag_string, "Lock W", sizeof (flag_string)); return (flag_string); } static char * client_flags(uint32_t flags) { flag_string[0] = '\0'; if (flags & LCL_NEEDSCONFIRM) strlcat(flag_string, "NC ", sizeof (flag_string)); if (flags & LCL_CALLBACKSON) strlcat(flag_string, "CB ", sizeof (flag_string)); if (flags & LCL_GSS) strlcat(flag_string, "GSS ", sizeof (flag_string)); if (flags & LCL_ADMINREVOKED) strlcat(flag_string, "REV", sizeof (flag_string)); return (flag_string); } Index: user/ngie/bug-237403/usr.sbin/pkg/FreeBSD.conf =================================================================== --- user/ngie/bug-237403/usr.sbin/pkg/FreeBSD.conf (revision 346925) +++ user/ngie/bug-237403/usr.sbin/pkg/FreeBSD.conf (nonexistent) @@ -1,16 +0,0 @@ -# $FreeBSD$ -# -# To disable this repository, instead of modifying or removing this file, -# create a /usr/local/etc/pkg/repos/FreeBSD.conf file: -# -# mkdir -p /usr/local/etc/pkg/repos -# echo "FreeBSD: { enabled: no }" > /usr/local/etc/pkg/repos/FreeBSD.conf -# - -FreeBSD: { - url: "pkg+http://pkg.FreeBSD.org/${ABI}/latest", - mirror_type: "srv", - signature_type: "fingerprints", - fingerprints: "/usr/share/keys/pkg", - enabled: yes -} Property changes on: user/ngie/bug-237403/usr.sbin/pkg/FreeBSD.conf ___________________________________________________________________ Deleted: svn:keywords ## -1 +0,0 ## -FreeBSD=%H \ No newline at end of property Index: user/ngie/bug-237403/usr.sbin/pkg/FreeBSD.conf.latest =================================================================== --- user/ngie/bug-237403/usr.sbin/pkg/FreeBSD.conf.latest (nonexistent) +++ user/ngie/bug-237403/usr.sbin/pkg/FreeBSD.conf.latest (revision 346926) @@ -0,0 +1,16 @@ +# $FreeBSD$ +# +# To disable this repository, instead of modifying or removing this file, +# create a /usr/local/etc/pkg/repos/FreeBSD.conf file: +# +# mkdir -p /usr/local/etc/pkg/repos +# echo "FreeBSD: { enabled: no }" > /usr/local/etc/pkg/repos/FreeBSD.conf +# + +FreeBSD: { + url: "pkg+http://pkg.FreeBSD.org/${ABI}/latest", + mirror_type: "srv", + signature_type: "fingerprints", + fingerprints: "/usr/share/keys/pkg", + enabled: yes +} Property changes on: user/ngie/bug-237403/usr.sbin/pkg/FreeBSD.conf.latest ___________________________________________________________________ Added: svn:keywords ## -0,0 +1 ## +FreeBSD=%H \ No newline at end of property Index: user/ngie/bug-237403/usr.sbin/pkg/FreeBSD.conf.quarterly =================================================================== --- user/ngie/bug-237403/usr.sbin/pkg/FreeBSD.conf.quarterly (nonexistent) +++ user/ngie/bug-237403/usr.sbin/pkg/FreeBSD.conf.quarterly (revision 346926) @@ -0,0 +1,16 @@ +# $FreeBSD$ +# +# To disable this repository, instead of modifying or removing this file, +# create a /usr/local/etc/pkg/repos/FreeBSD.conf file: +# +# mkdir -p /usr/local/etc/pkg/repos +# echo "FreeBSD: { enabled: no }" > /usr/local/etc/pkg/repos/FreeBSD.conf +# + +FreeBSD: { + url: "pkg+http://pkg.FreeBSD.org/${ABI}/quarterly", + mirror_type: "srv", + signature_type: "fingerprints", + fingerprints: "/usr/share/keys/pkg", + enabled: yes +} Property changes on: user/ngie/bug-237403/usr.sbin/pkg/FreeBSD.conf.quarterly ___________________________________________________________________ Added: svn:keywords ## -0,0 +1 ## +FreeBSD=%H \ No newline at end of property Index: user/ngie/bug-237403/usr.sbin/pkg/Makefile =================================================================== --- user/ngie/bug-237403/usr.sbin/pkg/Makefile (revision 346925) +++ user/ngie/bug-237403/usr.sbin/pkg/Makefile (revision 346926) @@ -1,14 +1,16 @@ # $FreeBSD$ -CONFS= FreeBSD.conf +PKGCONFBRANCH?= latest +CONFS= FreeBSD.conf.${PKGCONFBRANCH} +CONFSNAME= FreeBSD.conf CONFSDIR= /etc/pkg CONFSMODE= 644 PROG= pkg SRCS= pkg.c dns_utils.c config.c MAN= pkg.7 CFLAGS+=-I${SRCTOP}/contrib/libucl/include .PATH: ${SRCTOP}/contrib/libucl/include LIBADD= archive fetch ucl sbuf crypto ssl .include Index: user/ngie/bug-237403 =================================================================== --- user/ngie/bug-237403 (revision 346925) +++ user/ngie/bug-237403 (revision 346926) Property changes on: user/ngie/bug-237403 ___________________________________________________________________ Modified: svn:mergeinfo ## -0,0 +0,1 ## Merged /head:r346621-346925