diff --git a/lib/libpmc/libpmc_json.cc b/lib/libpmc/libpmc_json.cc index 2e9857ca98a8..76c5a02732ca 100644 --- a/lib/libpmc/libpmc_json.cc +++ b/lib/libpmc/libpmc_json.cc @@ -1,395 +1,398 @@ /*- * SPDX-License-Identifier: BSD-2-Clause * * Copyright (c) 2018, Matthew Macy * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * $FreeBSD$ * */ #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include using std::string; static const char *typenames[] = { "", "{\"type\": \"closelog\"}\n", "{\"type\": \"dropnotify\"}\n", "{\"type\": \"initialize\"", "", "{\"type\": \"pmcallocate\"", "{\"type\": \"pmcattach\"", "{\"type\": \"pmcdetach\"", "{\"type\": \"proccsw\"", "{\"type\": \"procexec\"", "{\"type\": \"procexit\"", "{\"type\": \"procfork\"", "{\"type\": \"sysexit\"", "{\"type\": \"userdata\"", "{\"type\": \"map_in\"", "{\"type\": \"map_out\"", "{\"type\": \"callchain\"", "{\"type\": \"pmcallocatedyn\"", "{\"type\": \"thr_create\"", "{\"type\": \"thr_exit\"", "{\"type\": \"proc_create\"", }; static string startentry(struct pmclog_ev *ev) { char eventbuf[128]; snprintf(eventbuf, sizeof(eventbuf), "%s, \"tsc\": \"%jd\"", typenames[ev->pl_type], (uintmax_t)ev->pl_ts.tv_sec); return (string(eventbuf)); } static string initialize_to_json(struct pmclog_ev *ev) { char eventbuf[256]; string startent; startent = startentry(ev); snprintf(eventbuf, sizeof(eventbuf), "%s, \"version\": \"0x%08x\", \"arch\": \"0x%08x\", \"cpuid\": \"%s\", " "\"tsc_freq\": \"%jd\", \"sec\": \"%jd\", \"nsec\": \"%jd\"}\n", startent.c_str(), ev->pl_u.pl_i.pl_version, ev->pl_u.pl_i.pl_arch, ev->pl_u.pl_i.pl_cpuid, (uintmax_t)ev->pl_u.pl_i.pl_tsc_freq, (uintmax_t)ev->pl_u.pl_i.pl_ts.tv_sec, (uintmax_t)ev->pl_u.pl_i.pl_ts.tv_nsec); return string(eventbuf); } static string pmcallocate_to_json(struct pmclog_ev *ev) { char eventbuf[256]; string startent; startent = startentry(ev); snprintf(eventbuf, sizeof(eventbuf), "%s, \"pmcid\": \"0x%08x\", \"event\": \"0x%08x\", \"flags\": \"0x%08x\", " "\"rate\": \"%jd\"}\n", startent.c_str(), ev->pl_u.pl_a.pl_pmcid, ev->pl_u.pl_a.pl_event, ev->pl_u.pl_a.pl_flags, (intmax_t)ev->pl_u.pl_a.pl_rate); return string(eventbuf); } static string pmcattach_to_json(struct pmclog_ev *ev) { char eventbuf[2048]; string startent; startent = startentry(ev); snprintf(eventbuf, sizeof(eventbuf), "%s, \"pmcid\": \"0x%08x\", \"pid\": \"%d\", \"pathname\": \"%s\"}\n", startent.c_str(), ev->pl_u.pl_t.pl_pmcid, ev->pl_u.pl_t.pl_pid, ev->pl_u.pl_t.pl_pathname); return string(eventbuf); } static string pmcdetach_to_json(struct pmclog_ev *ev) { char eventbuf[128]; string startent; startent = startentry(ev); snprintf(eventbuf, sizeof(eventbuf), "%s, \"pmcid\": \"0x%08x\", \"pid\": \"%d\"}\n", startent.c_str(), ev->pl_u.pl_d.pl_pmcid, ev->pl_u.pl_d.pl_pid); return string(eventbuf); } static string proccsw_to_json(struct pmclog_ev *ev) { char eventbuf[128]; string startent; startent = startentry(ev); snprintf(eventbuf, sizeof(eventbuf), "%s, \"pmcid\": \"0x%08x\", \"pid\": \"%d\" " "\"tid\": \"%d\", \"value\": \"0x%016jx\"}\n", startent.c_str(), ev->pl_u.pl_c.pl_pmcid, ev->pl_u.pl_c.pl_pid, ev->pl_u.pl_c.pl_tid, (uintmax_t)ev->pl_u.pl_c.pl_value); return string(eventbuf); } static string procexec_to_json(struct pmclog_ev *ev) { char eventbuf[2048]; string startent; startent = startentry(ev); snprintf(eventbuf, sizeof(eventbuf), "%s, \"pmcid\": \"0x%08x\", \"pid\": \"%d\", " - "\"start\": \"0x%016jx\", \"pathname\": \"%s\"}\n", + "\"base\": \"0x%016jx\", \"dyn\": \"0x%016jx\", " + "\"pathname\": \"%s\"}\n", startent.c_str(), ev->pl_u.pl_x.pl_pmcid, ev->pl_u.pl_x.pl_pid, - (uintmax_t)ev->pl_u.pl_x.pl_entryaddr, ev->pl_u.pl_x.pl_pathname); + (uintmax_t)ev->pl_u.pl_x.pl_baseaddr, + (uintmax_t)ev->pl_u.pl_x.pl_dynaddr, + ev->pl_u.pl_x.pl_pathname); return string(eventbuf); } static string procexit_to_json(struct pmclog_ev *ev) { char eventbuf[128]; string startent; startent = startentry(ev); snprintf(eventbuf, sizeof(eventbuf), "%s, \"pmcid\": \"0x%08x\", \"pid\": \"%d\", " "\"value\": \"0x%016jx\"}\n", startent.c_str(), ev->pl_u.pl_e.pl_pmcid, ev->pl_u.pl_e.pl_pid, (uintmax_t)ev->pl_u.pl_e.pl_value); return string(eventbuf); } static string procfork_to_json(struct pmclog_ev *ev) { char eventbuf[128]; string startent; startent = startentry(ev); snprintf(eventbuf, sizeof(eventbuf), "%s, \"oldpid\": \"%d\", \"newpid\": \"%d\"}\n", startent.c_str(), ev->pl_u.pl_f.pl_oldpid, ev->pl_u.pl_f.pl_newpid); return string(eventbuf); } static string sysexit_to_json(struct pmclog_ev *ev) { char eventbuf[128]; string startent; startent = startentry(ev); snprintf(eventbuf, sizeof(eventbuf), "%s, \"pid\": \"%d\"}\n", startent.c_str(), ev->pl_u.pl_se.pl_pid); return string(eventbuf); } static string userdata_to_json(struct pmclog_ev *ev) { char eventbuf[128]; string startent; startent = startentry(ev); snprintf(eventbuf, sizeof(eventbuf), "%s, \"userdata\": \"0x%08x\"}\n", startent.c_str(), ev->pl_u.pl_u.pl_userdata); return string(eventbuf); } static string map_in_to_json(struct pmclog_ev *ev) { char eventbuf[2048]; string startent; startent = startentry(ev); snprintf(eventbuf, sizeof(eventbuf), "%s, \"pid\": \"%d\", " "\"start\": \"0x%016jx\", \"pathname\": \"%s\"}\n", startent.c_str(), ev->pl_u.pl_mi.pl_pid, (uintmax_t)ev->pl_u.pl_mi.pl_start, ev->pl_u.pl_mi.pl_pathname); return string(eventbuf); } static string map_out_to_json(struct pmclog_ev *ev) { char eventbuf[256]; string startent; startent = startentry(ev); snprintf(eventbuf, sizeof(eventbuf), "%s, \"pid\": \"%d\", " "\"start\": \"0x%016jx\", \"end\": \"0x%016jx\"}\n", startent.c_str(), ev->pl_u.pl_mi.pl_pid, (uintmax_t)ev->pl_u.pl_mi.pl_start, (uintmax_t)ev->pl_u.pl_mo.pl_end); return string(eventbuf); } static string callchain_to_json(struct pmclog_ev *ev) { char eventbuf[1024]; string result; uint32_t i; string startent; startent = startentry(ev); snprintf(eventbuf, sizeof(eventbuf), "%s, \"pmcid\": \"0x%08x\", \"pid\": \"%d\", \"tid\": \"%d\", " "\"cpuflags\": \"0x%08x\", \"cpuflags2\": \"0x%08x\", \"pc\": [ ", startent.c_str(), ev->pl_u.pl_cc.pl_pmcid, ev->pl_u.pl_cc.pl_pid, ev->pl_u.pl_cc.pl_tid, ev->pl_u.pl_cc.pl_cpuflags, ev->pl_u.pl_cc.pl_cpuflags2); result = string(eventbuf); for (i = 0; i < ev->pl_u.pl_cc.pl_npc - 1; i++) { snprintf(eventbuf, sizeof(eventbuf), "\"0x%016jx\", ", (uintmax_t)ev->pl_u.pl_cc.pl_pc[i]); result += string(eventbuf); } snprintf(eventbuf, sizeof(eventbuf), "\"0x%016jx\"]}\n", (uintmax_t)ev->pl_u.pl_cc.pl_pc[i]); result += string(eventbuf); return (result); } static string pmcallocatedyn_to_json(struct pmclog_ev *ev) { char eventbuf[2048]; string startent; startent = startentry(ev); snprintf(eventbuf, sizeof(eventbuf), "%s, \"pmcid\": \"0x%08x\", \"event\": \"%d\", \"flags\": \"0x%08x\", \"evname\": \"%s\"}\n", startent.c_str(), ev->pl_u.pl_ad.pl_pmcid, ev->pl_u.pl_ad.pl_event, ev->pl_u.pl_ad.pl_flags, ev->pl_u.pl_ad.pl_evname); return string(eventbuf); } static string proccreate_to_json(struct pmclog_ev *ev) { char eventbuf[2048]; string startent; startent = startentry(ev); snprintf(eventbuf, sizeof(eventbuf), "%s, \"pid\": \"%d\", \"flags\": \"0x%08x\", \"pcomm\": \"%s\"}\n", startent.c_str(), ev->pl_u.pl_pc.pl_pid, ev->pl_u.pl_pc.pl_flags, ev->pl_u.pl_pc.pl_pcomm); return string(eventbuf); } static string threadcreate_to_json(struct pmclog_ev *ev) { char eventbuf[2048]; string startent; startent = startentry(ev); snprintf(eventbuf, sizeof(eventbuf), "%s, \"tid\": \"%d\", \"pid\": \"%d\", \"flags\": \"0x%08x\", \"tdname\": \"%s\"}\n", startent.c_str(), ev->pl_u.pl_tc.pl_tid, ev->pl_u.pl_tc.pl_pid, ev->pl_u.pl_tc.pl_flags, ev->pl_u.pl_tc.pl_tdname); return string(eventbuf); } static string threadexit_to_json(struct pmclog_ev *ev) { char eventbuf[256]; string startent; startent = startentry(ev); snprintf(eventbuf, sizeof(eventbuf), "%s, \"tid\": \"%d\"}\n", startent.c_str(), ev->pl_u.pl_te.pl_tid); return string(eventbuf); } static string stub_to_json(struct pmclog_ev *ev) { string startent; startent = startentry(ev); startent += string("}\n"); return startent; } typedef string (*jconv) (struct pmclog_ev*); static jconv jsonconvert[] = { NULL, stub_to_json, stub_to_json, initialize_to_json, NULL, pmcallocate_to_json, pmcattach_to_json, pmcdetach_to_json, proccsw_to_json, procexec_to_json, procexit_to_json, procfork_to_json, sysexit_to_json, userdata_to_json, map_in_to_json, map_out_to_json, callchain_to_json, pmcallocatedyn_to_json, threadcreate_to_json, threadexit_to_json, proccreate_to_json, }; string event_to_json(struct pmclog_ev *ev){ switch (ev->pl_type) { case PMCLOG_TYPE_DROPNOTIFY: case PMCLOG_TYPE_CLOSELOG: case PMCLOG_TYPE_INITIALIZE: case PMCLOG_TYPE_PMCALLOCATE: case PMCLOG_TYPE_PMCATTACH: case PMCLOG_TYPE_PMCDETACH: case PMCLOG_TYPE_PROCCSW: case PMCLOG_TYPE_PROCEXEC: case PMCLOG_TYPE_PROCEXIT: case PMCLOG_TYPE_PROCFORK: case PMCLOG_TYPE_SYSEXIT: case PMCLOG_TYPE_USERDATA: case PMCLOG_TYPE_MAP_IN: case PMCLOG_TYPE_MAP_OUT: case PMCLOG_TYPE_CALLCHAIN: case PMCLOG_TYPE_PMCALLOCATEDYN: case PMCLOG_TYPE_THR_CREATE: case PMCLOG_TYPE_THR_EXIT: case PMCLOG_TYPE_PROC_CREATE: return jsonconvert[ev->pl_type](ev); default: errx(EX_USAGE, "ERROR: unrecognized event type: %d\n", ev->pl_type); } } diff --git a/lib/libpmc/pmclog.c b/lib/libpmc/pmclog.c index babcdc3c8d0d..0db91cf51bc2 100644 --- a/lib/libpmc/pmclog.c +++ b/lib/libpmc/pmclog.c @@ -1,599 +1,600 @@ /*- * SPDX-License-Identifier: BSD-2-Clause * * Copyright (c) 2005-2007 Joseph Koshy * Copyright (c) 2007 The FreeBSD Foundation * All rights reserved. * * Portions of this software were developed by A. Joseph Koshy under * sponsorship from the FreeBSD Foundation and Google, Inc. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. */ #include __FBSDID("$FreeBSD$"); #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include "libpmcinternal.h" #define PMCLOG_BUFFER_SIZE 512*1024 /* * API NOTES * * The pmclog(3) API is oriented towards parsing an event stream in * "realtime", i.e., from an data source that may or may not preserve * record boundaries -- for example when the data source is elsewhere * on a network. The API allows data to be fed into the parser zero * or more bytes at a time. * * The state for a log file parser is maintained in a 'struct * pmclog_parse_state'. Parser invocations are done by calling * 'pmclog_read()'; this function will inform the caller when a * complete event is parsed. * * The parser first assembles a complete log file event in an internal * work area (see "ps_saved" below). Once a complete log file event * is read, the parser then parses it and converts it to an event * descriptor usable by the client. We could possibly avoid this two * step process by directly parsing the input log to set fields in the * event record. However the parser's state machine would get * insanely complicated, and this code is unlikely to be used in * performance critical paths. */ #define PMCLOG_HEADER_FROM_SAVED_STATE(PS) \ (* ((uint32_t *) &(PS)->ps_saved)) #define PMCLOG_INITIALIZE_READER(LE,A) LE = (uint32_t *) &(A) #define PMCLOG_READ32(LE,V) do { \ (V) = *(LE)++; \ } while (0) #define PMCLOG_READ64(LE,V) do { \ uint64_t _v; \ _v = (uint64_t) *(LE)++; \ _v |= ((uint64_t) *(LE)++) << 32; \ (V) = _v; \ } while (0) #define PMCLOG_READSTRING(LE,DST,LEN) strlcpy((DST), (char *) (LE), (LEN)) /* * Assemble a log record from '*len' octets starting from address '*data'. * Update 'data' and 'len' to reflect the number of bytes consumed. * * '*data' is potentially an unaligned address and '*len' octets may * not be enough to complete a event record. */ static enum pmclog_parser_state pmclog_get_record(struct pmclog_parse_state *ps, char **data, ssize_t *len) { int avail, copylen, recordsize, used; uint32_t h; const int HEADERSIZE = sizeof(uint32_t); char *src, *dst; if ((avail = *len) <= 0) return (ps->ps_state = PL_STATE_ERROR); src = *data; used = 0; if (ps->ps_state == PL_STATE_NEW_RECORD) ps->ps_svcount = 0; dst = (char *) &ps->ps_saved + ps->ps_svcount; switch (ps->ps_state) { case PL_STATE_NEW_RECORD: /* * Transitions: * * Case A: avail < headersize * -> 'expecting header' * * Case B: avail >= headersize * B.1: avail < recordsize * -> 'partial record' * B.2: avail >= recordsize * -> 'new record' */ copylen = avail < HEADERSIZE ? avail : HEADERSIZE; bcopy(src, dst, copylen); ps->ps_svcount = used = copylen; if (copylen < HEADERSIZE) { ps->ps_state = PL_STATE_EXPECTING_HEADER; goto done; } src += copylen; dst += copylen; h = PMCLOG_HEADER_FROM_SAVED_STATE(ps); recordsize = PMCLOG_HEADER_TO_LENGTH(h); if (recordsize <= 0) goto error; if (recordsize <= avail) { /* full record available */ bcopy(src, dst, recordsize - copylen); ps->ps_svcount = used = recordsize; goto done; } /* header + a partial record is available */ bcopy(src, dst, avail - copylen); ps->ps_svcount = used = avail; ps->ps_state = PL_STATE_PARTIAL_RECORD; break; case PL_STATE_EXPECTING_HEADER: /* * Transitions: * * Case C: avail+saved < headersize * -> 'expecting header' * * Case D: avail+saved >= headersize * D.1: avail+saved < recordsize * -> 'partial record' * D.2: avail+saved >= recordsize * -> 'new record' * (see PARTIAL_RECORD handling below) */ if (avail + ps->ps_svcount < HEADERSIZE) { bcopy(src, dst, avail); ps->ps_svcount += avail; used = avail; break; } used = copylen = HEADERSIZE - ps->ps_svcount; bcopy(src, dst, copylen); src += copylen; dst += copylen; avail -= copylen; ps->ps_svcount += copylen; /*FALLTHROUGH*/ case PL_STATE_PARTIAL_RECORD: /* * Transitions: * * Case E: avail+saved < recordsize * -> 'partial record' * * Case F: avail+saved >= recordsize * -> 'new record' */ h = PMCLOG_HEADER_FROM_SAVED_STATE(ps); recordsize = PMCLOG_HEADER_TO_LENGTH(h); if (recordsize <= 0) goto error; if (avail + ps->ps_svcount < recordsize) { copylen = avail; ps->ps_state = PL_STATE_PARTIAL_RECORD; } else { copylen = recordsize - ps->ps_svcount; ps->ps_state = PL_STATE_NEW_RECORD; } bcopy(src, dst, copylen); ps->ps_svcount += copylen; used += copylen; break; default: goto error; } done: *data += used; *len -= used; return ps->ps_state; error: ps->ps_state = PL_STATE_ERROR; return ps->ps_state; } /* * Get an event from the stream pointed to by '*data'. '*len' * indicates the number of bytes available to parse. Arguments * '*data' and '*len' are updated to indicate the number of bytes * consumed. */ static int pmclog_get_event(void *cookie, char **data, ssize_t *len, struct pmclog_ev *ev) { int evlen, pathlen; uint32_t h, *le, npc, noop; enum pmclog_parser_state e; struct pmclog_parse_state *ps; struct pmclog_header *ph; ps = (struct pmclog_parse_state *) cookie; assert(ps->ps_state != PL_STATE_ERROR); if ((e = pmclog_get_record(ps,data,len)) == PL_STATE_ERROR) { ev->pl_state = PMCLOG_ERROR; printf("state error\n"); return -1; } if (e != PL_STATE_NEW_RECORD) { ev->pl_state = PMCLOG_REQUIRE_DATA; return -1; } PMCLOG_INITIALIZE_READER(le, ps->ps_saved); ev->pl_data = le; ph = (struct pmclog_header *)(uintptr_t)le; h = ph->pl_header; if (!PMCLOG_HEADER_CHECK_MAGIC(h)) { printf("bad magic\n"); ps->ps_state = PL_STATE_ERROR; ev->pl_state = PMCLOG_ERROR; return -1; } /* copy out the time stamp */ ev->pl_ts.tv_sec = ph->pl_tsc; le += sizeof(*ph)/4; evlen = PMCLOG_HEADER_TO_LENGTH(h); #define PMCLOG_GET_PATHLEN(P,E,TYPE) do { \ (P) = (E) - offsetof(struct TYPE, pl_pathname); \ if ((P) > PATH_MAX || (P) < 0) \ goto error; \ } while (0) #define PMCLOG_GET_CALLCHAIN_SIZE(SZ,E) do { \ (SZ) = ((E) - offsetof(struct pmclog_callchain, pl_pc)) \ / sizeof(uintfptr_t); \ } while (0); switch (ev->pl_type = PMCLOG_HEADER_TO_TYPE(h)) { case PMCLOG_TYPE_CALLCHAIN: PMCLOG_READ32(le,ev->pl_u.pl_cc.pl_pid); PMCLOG_READ32(le,ev->pl_u.pl_cc.pl_tid); PMCLOG_READ32(le,ev->pl_u.pl_cc.pl_pmcid); PMCLOG_READ32(le,ev->pl_u.pl_cc.pl_cpuflags); PMCLOG_GET_CALLCHAIN_SIZE(ev->pl_u.pl_cc.pl_npc,evlen); for (npc = 0; npc < ev->pl_u.pl_cc.pl_npc; npc++) PMCLOG_READADDR(le,ev->pl_u.pl_cc.pl_pc[npc]); for (;npc < PMC_CALLCHAIN_DEPTH_MAX; npc++) ev->pl_u.pl_cc.pl_pc[npc] = (uintfptr_t) 0; break; case PMCLOG_TYPE_CLOSELOG: ev->pl_state = PMCLOG_EOF; return (-1); case PMCLOG_TYPE_DROPNOTIFY: /* nothing to do */ break; case PMCLOG_TYPE_INITIALIZE: PMCLOG_READ32(le,ev->pl_u.pl_i.pl_version); PMCLOG_READ32(le,ev->pl_u.pl_i.pl_arch); PMCLOG_READ64(le,ev->pl_u.pl_i.pl_tsc_freq); memcpy(&ev->pl_u.pl_i.pl_ts, le, sizeof(struct timespec)); le += sizeof(struct timespec)/4; PMCLOG_READSTRING(le, ev->pl_u.pl_i.pl_cpuid, PMC_CPUID_LEN); memcpy(ev->pl_u.pl_i.pl_cpuid, le, PMC_CPUID_LEN); ps->ps_cpuid = strdup(ev->pl_u.pl_i.pl_cpuid); ps->ps_version = ev->pl_u.pl_i.pl_version; ps->ps_arch = ev->pl_u.pl_i.pl_arch; ps->ps_initialized = 1; break; case PMCLOG_TYPE_MAP_IN: PMCLOG_GET_PATHLEN(pathlen,evlen,pmclog_map_in); PMCLOG_READ32(le,ev->pl_u.pl_mi.pl_pid); PMCLOG_READ32(le,noop); PMCLOG_READADDR(le,ev->pl_u.pl_mi.pl_start); PMCLOG_READSTRING(le, ev->pl_u.pl_mi.pl_pathname, pathlen); break; case PMCLOG_TYPE_MAP_OUT: PMCLOG_READ32(le,ev->pl_u.pl_mo.pl_pid); PMCLOG_READ32(le,noop); PMCLOG_READADDR(le,ev->pl_u.pl_mo.pl_start); PMCLOG_READADDR(le,ev->pl_u.pl_mo.pl_end); break; case PMCLOG_TYPE_PMCALLOCATE: PMCLOG_READ32(le,ev->pl_u.pl_a.pl_pmcid); PMCLOG_READ32(le,ev->pl_u.pl_a.pl_event); PMCLOG_READ32(le,ev->pl_u.pl_a.pl_flags); PMCLOG_READ32(le,noop); PMCLOG_READ64(le,ev->pl_u.pl_a.pl_rate); ev->pl_u.pl_a.pl_evname = pmc_pmu_event_get_by_idx(ps->ps_cpuid, ev->pl_u.pl_a.pl_event); if (ev->pl_u.pl_a.pl_evname != NULL) break; else if ((ev->pl_u.pl_a.pl_evname = _pmc_name_of_event(ev->pl_u.pl_a.pl_event, ps->ps_arch)) == NULL) { printf("unknown event\n"); goto error; } break; case PMCLOG_TYPE_PMCALLOCATEDYN: PMCLOG_READ32(le,ev->pl_u.pl_ad.pl_pmcid); PMCLOG_READ32(le,ev->pl_u.pl_ad.pl_event); PMCLOG_READ32(le,ev->pl_u.pl_ad.pl_flags); PMCLOG_READ32(le,noop); PMCLOG_READSTRING(le,ev->pl_u.pl_ad.pl_evname,PMC_NAME_MAX); break; case PMCLOG_TYPE_PMCATTACH: PMCLOG_GET_PATHLEN(pathlen,evlen,pmclog_pmcattach); PMCLOG_READ32(le,ev->pl_u.pl_t.pl_pmcid); PMCLOG_READ32(le,ev->pl_u.pl_t.pl_pid); PMCLOG_READSTRING(le,ev->pl_u.pl_t.pl_pathname,pathlen); break; case PMCLOG_TYPE_PMCDETACH: PMCLOG_READ32(le,ev->pl_u.pl_d.pl_pmcid); PMCLOG_READ32(le,ev->pl_u.pl_d.pl_pid); break; case PMCLOG_TYPE_PROCCSW: PMCLOG_READ64(le,ev->pl_u.pl_c.pl_value); PMCLOG_READ32(le,ev->pl_u.pl_c.pl_pmcid); PMCLOG_READ32(le,ev->pl_u.pl_c.pl_pid); PMCLOG_READ32(le,ev->pl_u.pl_c.pl_tid); break; case PMCLOG_TYPE_PROCEXEC: PMCLOG_GET_PATHLEN(pathlen,evlen,pmclog_procexec); PMCLOG_READ32(le,ev->pl_u.pl_x.pl_pid); PMCLOG_READ32(le,ev->pl_u.pl_x.pl_pmcid); - PMCLOG_READADDR(le,ev->pl_u.pl_x.pl_entryaddr); + PMCLOG_READADDR(le,ev->pl_u.pl_x.pl_baseaddr); + PMCLOG_READADDR(le,ev->pl_u.pl_x.pl_dynaddr); PMCLOG_READSTRING(le,ev->pl_u.pl_x.pl_pathname,pathlen); break; case PMCLOG_TYPE_PROCEXIT: PMCLOG_READ32(le,ev->pl_u.pl_e.pl_pmcid); PMCLOG_READ32(le,ev->pl_u.pl_e.pl_pid); PMCLOG_READ64(le,ev->pl_u.pl_e.pl_value); break; case PMCLOG_TYPE_PROCFORK: PMCLOG_READ32(le,ev->pl_u.pl_f.pl_oldpid); PMCLOG_READ32(le,ev->pl_u.pl_f.pl_newpid); break; case PMCLOG_TYPE_SYSEXIT: PMCLOG_READ32(le,ev->pl_u.pl_se.pl_pid); break; case PMCLOG_TYPE_USERDATA: PMCLOG_READ32(le,ev->pl_u.pl_u.pl_userdata); break; case PMCLOG_TYPE_THR_CREATE: PMCLOG_READ32(le,ev->pl_u.pl_tc.pl_tid); PMCLOG_READ32(le,ev->pl_u.pl_tc.pl_pid); PMCLOG_READ32(le,ev->pl_u.pl_tc.pl_flags); PMCLOG_READ32(le,noop); memcpy(ev->pl_u.pl_tc.pl_tdname, le, MAXCOMLEN+1); break; case PMCLOG_TYPE_THR_EXIT: PMCLOG_READ32(le,ev->pl_u.pl_te.pl_tid); break; case PMCLOG_TYPE_PROC_CREATE: PMCLOG_READ32(le,ev->pl_u.pl_pc.pl_pid); PMCLOG_READ32(le,ev->pl_u.pl_pc.pl_flags); memcpy(ev->pl_u.pl_pc.pl_pcomm, le, MAXCOMLEN+1); break; default: /* unknown record type */ ps->ps_state = PL_STATE_ERROR; ev->pl_state = PMCLOG_ERROR; return (-1); } ev->pl_offset = (ps->ps_offset += evlen); ev->pl_count = (ps->ps_count += 1); ev->pl_len = evlen; ev->pl_state = PMCLOG_OK; return 0; error: ev->pl_state = PMCLOG_ERROR; ps->ps_state = PL_STATE_ERROR; return -1; } /* * Extract and return the next event from the byte stream. * * Returns 0 and sets the event's state to PMCLOG_OK in case an event * was successfully parsed. Otherwise this function returns -1 and * sets the event's state to one of PMCLOG_REQUIRE_DATA (if more data * is needed) or PMCLOG_EOF (if an EOF was seen) or PMCLOG_ERROR if * a parse error was encountered. */ int pmclog_read(void *cookie, struct pmclog_ev *ev) { int retval; ssize_t nread; struct pmclog_parse_state *ps; ps = (struct pmclog_parse_state *) cookie; if (ps->ps_state == PL_STATE_ERROR) { ev->pl_state = PMCLOG_ERROR; return -1; } /* * If there isn't enough data left for a new event try and get * more data. */ if (ps->ps_len == 0) { ev->pl_state = PMCLOG_REQUIRE_DATA; /* * If we have a valid file descriptor to read from, attempt * to read from that. This read may return with an error, * (which may be EAGAIN or other recoverable error), or * can return EOF. */ if (ps->ps_fd != PMCLOG_FD_NONE) { refill: nread = read(ps->ps_fd, ps->ps_buffer, PMCLOG_BUFFER_SIZE); if (nread <= 0) { if (nread == 0) ev->pl_state = PMCLOG_EOF; else if (errno != EAGAIN) /* not restartable */ ev->pl_state = PMCLOG_ERROR; return -1; } ps->ps_len = nread; ps->ps_data = ps->ps_buffer; } else { return -1; } } assert(ps->ps_len > 0); /* Retrieve one event from the byte stream. */ retval = pmclog_get_event(ps, &ps->ps_data, &ps->ps_len, ev); /* * If we need more data and we have a configured fd, try read * from it. */ if (retval < 0 && ev->pl_state == PMCLOG_REQUIRE_DATA && ps->ps_fd != -1) { assert(ps->ps_len == 0); goto refill; } return retval; } /* * Feed data to a memory based parser. * * The memory area pointed to by 'data' needs to be valid till the * next error return from pmclog_next_event(). */ int pmclog_feed(void *cookie, char *data, int len) { struct pmclog_parse_state *ps; ps = (struct pmclog_parse_state *) cookie; if (len < 0 || /* invalid length */ ps->ps_buffer || /* called for a file parser */ ps->ps_len != 0) /* unnecessary call */ return -1; ps->ps_data = data; ps->ps_len = len; return 0; } /* * Allocate and initialize parser state. */ void * pmclog_open(int fd) { struct pmclog_parse_state *ps; if ((ps = (struct pmclog_parse_state *) malloc(sizeof(*ps))) == NULL) return NULL; ps->ps_state = PL_STATE_NEW_RECORD; ps->ps_arch = -1; ps->ps_initialized = 0; ps->ps_count = 0; ps->ps_offset = (off_t) 0; bzero(&ps->ps_saved, sizeof(ps->ps_saved)); ps->ps_cpuid = NULL; ps->ps_svcount = 0; ps->ps_fd = fd; ps->ps_data = NULL; ps->ps_buffer = NULL; ps->ps_len = 0; /* allocate space for a work area */ if (ps->ps_fd != PMCLOG_FD_NONE) { if ((ps->ps_buffer = malloc(PMCLOG_BUFFER_SIZE)) == NULL) { free(ps); return NULL; } } return ps; } /* * Free up parser state. */ void pmclog_close(void *cookie) { struct pmclog_parse_state *ps; ps = (struct pmclog_parse_state *) cookie; if (ps->ps_buffer) free(ps->ps_buffer); free(ps); } diff --git a/lib/libpmc/pmclog.h b/lib/libpmc/pmclog.h index c81246b168eb..c2973e9a365a 100644 --- a/lib/libpmc/pmclog.h +++ b/lib/libpmc/pmclog.h @@ -1,233 +1,234 @@ /*- * SPDX-License-Identifier: BSD-2-Clause * * Copyright (c) 2005-2007 Joseph Koshy * Copyright (c) 2007 The FreeBSD Foundation * All rights reserved. * * Portions of this software were developed by A. Joseph Koshy under * sponsorship from the FreeBSD Foundation and Google, Inc. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * $FreeBSD$ */ #ifndef _PMCLOG_H_ #define _PMCLOG_H_ #include #include enum pmclog_state { PMCLOG_OK, PMCLOG_EOF, PMCLOG_REQUIRE_DATA, PMCLOG_ERROR }; struct pmclog_ev_callchain { uint32_t pl_pid; uint32_t pl_tid; uint32_t pl_pmcid; uint32_t pl_cpuflags; uint32_t pl_cpuflags2; uint32_t pl_npc; uintfptr_t pl_pc[PMC_CALLCHAIN_DEPTH_MAX]; }; struct pmclog_ev_dropnotify { }; struct pmclog_ev_closelog { }; struct pmclog_ev_initialize { uint32_t pl_version; uint32_t pl_arch; uint64_t pl_tsc_freq; struct timespec pl_ts; char pl_cpuid[PATH_MAX]; }; struct pmclog_ev_map_in { pid_t pl_pid; uintfptr_t pl_start; char pl_pathname[PATH_MAX]; }; struct pmclog_ev_map_out { pid_t pl_pid; uintfptr_t pl_start; uintfptr_t pl_end; }; struct pmclog_ev_pcsample { uintfptr_t pl_pc; pid_t pl_pid; pid_t pl_tid; pmc_id_t pl_pmcid; uint32_t pl_flags; uint32_t pl_usermode; }; struct pmclog_ev_pmcallocate { const char * pl_evname; uint64_t pl_rate; uint32_t pl_event; uint32_t pl_flags; pmc_id_t pl_pmcid; }; struct pmclog_ev_pmcallocatedyn { char pl_evname[PMC_NAME_MAX]; uint32_t pl_event; uint32_t pl_flags; pmc_id_t pl_pmcid; }; struct pmclog_ev_pmcattach { pmc_id_t pl_pmcid; pid_t pl_pid; char pl_pathname[PATH_MAX]; }; struct pmclog_ev_pmcdetach { pmc_id_t pl_pmcid; pid_t pl_pid; }; struct pmclog_ev_proccsw { pid_t pl_pid; pid_t pl_tid; pmc_id_t pl_pmcid; pmc_value_t pl_value; }; struct pmclog_ev_proccreate { pid_t pl_pid; uint32_t pl_flags; char pl_pcomm[MAXCOMLEN+1]; }; struct pmclog_ev_procexec { pid_t pl_pid; pmc_id_t pl_pmcid; - uintfptr_t pl_entryaddr; + uintptr_t pl_baseaddr; + uintptr_t pl_dynaddr; char pl_pathname[PATH_MAX]; }; struct pmclog_ev_procexit { uint32_t pl_pid; pmc_id_t pl_pmcid; pmc_value_t pl_value; }; struct pmclog_ev_procfork { pid_t pl_oldpid; pid_t pl_newpid; }; struct pmclog_ev_sysexit { pid_t pl_pid; }; struct pmclog_ev_threadcreate { pid_t pl_tid; pid_t pl_pid; uint32_t pl_flags; char pl_tdname[MAXCOMLEN+1]; }; struct pmclog_ev_threadexit { pid_t pl_tid; }; struct pmclog_ev_userdata { uint32_t pl_userdata; }; struct pmclog_ev { enum pmclog_state pl_state; /* state after 'get_event()' */ off_t pl_offset; /* byte offset in stream */ size_t pl_count; /* count of records so far */ struct timespec pl_ts; /* log entry timestamp */ enum pmclog_type pl_type; /* type of log entry */ void *pl_data; int pl_len; union { /* log entry data */ struct pmclog_ev_callchain pl_cc; struct pmclog_ev_closelog pl_cl; struct pmclog_ev_dropnotify pl_dn; struct pmclog_ev_initialize pl_i; struct pmclog_ev_map_in pl_mi; struct pmclog_ev_map_out pl_mo; struct pmclog_ev_pmcallocate pl_a; struct pmclog_ev_pmcallocatedyn pl_ad; struct pmclog_ev_pmcattach pl_t; struct pmclog_ev_pmcdetach pl_d; struct pmclog_ev_proccsw pl_c; struct pmclog_ev_proccreate pl_pc; struct pmclog_ev_procexec pl_x; struct pmclog_ev_procexit pl_e; struct pmclog_ev_procfork pl_f; struct pmclog_ev_sysexit pl_se; struct pmclog_ev_threadcreate pl_tc; struct pmclog_ev_threadexit pl_te; struct pmclog_ev_userdata pl_u; } pl_u; }; enum pmclog_parser_state { PL_STATE_NEW_RECORD, /* in-between records */ PL_STATE_EXPECTING_HEADER, /* header being read */ PL_STATE_PARTIAL_RECORD, /* header present but not the record */ PL_STATE_ERROR /* parsing error encountered */ }; struct pmclog_parse_state { enum pmclog_parser_state ps_state; enum pmc_cputype ps_arch; /* log file architecture */ uint32_t ps_version; /* hwpmc version */ int ps_initialized; /* whether initialized */ int ps_count; /* count of records processed */ off_t ps_offset; /* stream byte offset */ union pmclog_entry ps_saved; /* saved partial log entry */ int ps_svcount; /* #bytes saved */ int ps_fd; /* active fd or -1 */ char *ps_buffer; /* scratch buffer if fd != -1 */ char *ps_data; /* current parse pointer */ char *ps_cpuid; /* log cpuid */ size_t ps_len; /* length of buffered data */ }; #define PMCLOG_FD_NONE (-1) __BEGIN_DECLS void *pmclog_open(int _fd); int pmclog_feed(void *_cookie, char *_data, int _len); int pmclog_read(void *_cookie, struct pmclog_ev *_ev); void pmclog_close(void *_cookie); __END_DECLS #endif diff --git a/lib/libpmcstat/libpmcstat.h b/lib/libpmcstat/libpmcstat.h index 07d82d4d0e57..87bd3a185f40 100644 --- a/lib/libpmcstat/libpmcstat.h +++ b/lib/libpmcstat/libpmcstat.h @@ -1,387 +1,387 @@ /*- * Copyright (c) 2005-2007, Joseph Koshy * Copyright (c) 2007 The FreeBSD Foundation * All rights reserved. * * Portions of this software were developed by A. Joseph Koshy under * sponsorship from the FreeBSD Foundation and Google, Inc. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * $FreeBSD$ */ #ifndef _LIBPMCSTAT_H_ #define _LIBPMCSTAT_H_ #include #include #include #include #define PMCSTAT_ALLOCATE 1 #define NSOCKPAIRFD 2 #define PARENTSOCKET 0 #define CHILDSOCKET 1 #define PMCSTAT_OPEN_FOR_READ 0 #define PMCSTAT_OPEN_FOR_WRITE 1 #define READPIPEFD 0 #define WRITEPIPEFD 1 #define NPIPEFD 2 #define PMCSTAT_NHASH 256 #define PMCSTAT_HASH_MASK 0xFF #define DEFAULT_SAMPLE_COUNT 65536 typedef const void *pmcstat_interned_string; struct pmc_plugins; enum pmcstat_state { PMCSTAT_FINISHED = 0, PMCSTAT_EXITING = 1, PMCSTAT_RUNNING = 2 }; struct pmcstat_ev { STAILQ_ENTRY(pmcstat_ev) ev_next; int ev_count; /* associated count if in sampling mode */ uint32_t ev_cpu; /* cpus for this event */ int ev_cumulative; /* show cumulative counts */ int ev_flags; /* PMC_F_* */ int ev_fieldskip; /* #leading spaces */ int ev_fieldwidth; /* print width */ enum pmc_mode ev_mode; /* desired mode */ char *ev_name; /* (derived) event name */ pmc_id_t ev_pmcid; /* allocated ID */ pmc_value_t ev_saved; /* for incremental counts */ char *ev_spec; /* event specification */ }; struct pmcstat_target { SLIST_ENTRY(pmcstat_target) pt_next; pid_t pt_pid; }; struct pmcstat_args { int pa_flags; /* argument flags */ #define FLAG_HAS_TARGET 0x00000001 /* process target */ #define FLAG_HAS_WAIT_INTERVAL 0x00000002 /* -w secs */ #define FLAG_HAS_OUTPUT_LOGFILE 0x00000004 /* -O file or pipe */ #define FLAG_HAS_COMMANDLINE 0x00000008 /* command */ #define FLAG_HAS_SAMPLING_PMCS 0x00000010 /* -S or -P */ #define FLAG_HAS_COUNTING_PMCS 0x00000020 /* -s or -p */ #define FLAG_HAS_PROCESS_PMCS 0x00000040 /* -P or -p */ #define FLAG_HAS_SYSTEM_PMCS 0x00000080 /* -S or -s */ #define FLAG_HAS_PIPE 0x00000100 /* implicit log */ #define FLAG_READ_LOGFILE 0x00000200 /* -R file */ #define FLAG_DO_GPROF 0x00000400 /* -g */ #define FLAG_HAS_SAMPLESDIR 0x00000800 /* -D dir */ /* was FLAG_HAS_KERNELPATH 0x00001000 */ #define FLAG_DO_PRINT 0x00002000 /* -o */ #define FLAG_DO_CALLGRAPHS 0x00004000 /* -G or -F */ #define FLAG_DO_ANNOTATE 0x00008000 /* -m */ #define FLAG_DO_TOP 0x00010000 /* -T */ #define FLAG_DO_ANALYSIS 0x00020000 /* -g or -G or -m or -T */ #define FLAGS_HAS_CPUMASK 0x00040000 /* -c */ #define FLAG_HAS_DURATION 0x00080000 /* -l secs */ #define FLAG_DO_WIDE_GPROF_HC 0x00100000 /* -e */ #define FLAG_SKIP_TOP_FN_RES 0x00200000 /* -A */ #define FLAG_FILTER_THREAD_ID 0x00400000 /* -L */ #define FLAG_SHOW_OFFSET 0x00800000 /* -I */ int pa_required; /* required features */ int pa_pplugin; /* pre-processing plugin */ int pa_plugin; /* analysis plugin */ int pa_verbosity; /* verbosity level */ FILE *pa_printfile; /* where to send printed output */ int pa_logfd; /* output log file */ char *pa_inputpath; /* path to input log */ char *pa_outputpath; /* path to output log */ void *pa_logparser; /* log file parser */ const char *pa_fsroot; /* FS root where executables reside */ const char *pa_samplesdir; /* directory for profile files */ const char *pa_mapfilename;/* mapfile name */ FILE *pa_graphfile; /* where to send the callgraph */ int pa_graphdepth; /* print depth for callgraphs */ double pa_interval; /* printing interval in seconds */ cpuset_t pa_cpumask; /* filter for CPUs analysed */ int pa_ctdumpinstr; /* dump instructions with calltree */ int pa_topmode; /* delta or accumulative */ int pa_toptty; /* output to tty or file */ int pa_topcolor; /* terminal support color */ int pa_mergepmc; /* merge PMC with same name */ double pa_duration; /* time duration */ uint32_t pa_tid; int pa_argc; char **pa_argv; STAILQ_HEAD(, pmcstat_ev) pa_events; SLIST_HEAD(, pmcstat_target) pa_targets; }; /* * Each function symbol tracked by pmcstat(8). */ struct pmcstat_symbol { pmcstat_interned_string ps_name; uint64_t ps_start; uint64_t ps_end; }; /* * A 'pmcstat_image' structure describes an executable program on * disk. 'pi_execpath' is a cookie representing the pathname of * the executable. 'pi_start' and 'pi_end' are the least and greatest * virtual addresses for the text segments in the executable. * 'pi_gmonlist' contains a linked list of gmon.out files associated * with this image. */ enum pmcstat_image_type { PMCSTAT_IMAGE_UNKNOWN = 0, /* never looked at the image */ PMCSTAT_IMAGE_INDETERMINABLE, /* can't tell what the image is */ PMCSTAT_IMAGE_ELF32, /* ELF 32 bit object */ PMCSTAT_IMAGE_ELF64, /* ELF 64 bit object */ PMCSTAT_IMAGE_AOUT /* AOUT object */ }; struct pmcstat_image { LIST_ENTRY(pmcstat_image) pi_next; /* hash link */ pmcstat_interned_string pi_execpath; /* cookie */ pmcstat_interned_string pi_samplename; /* sample path name */ pmcstat_interned_string pi_fullpath; /* path to FS object */ pmcstat_interned_string pi_name; /* display name */ enum pmcstat_image_type pi_type; /* executable type */ /* * Executables have pi_start and pi_end; these are zero * for shared libraries. */ uintfptr_t pi_start; /* start address (inclusive) */ uintfptr_t pi_end; /* end address (exclusive) */ uintfptr_t pi_entry; /* entry address */ uintfptr_t pi_vaddr; /* virtual address where loaded */ int pi_isdynamic; /* whether a dynamic object */ int pi_iskernelmodule; pmcstat_interned_string pi_dynlinkerpath; /* path in .interp */ /* All symbols associated with this object. */ struct pmcstat_symbol *pi_symbols; size_t pi_symcount; /* Handle to addr2line for this image. */ FILE *pi_addr2line; /* * Plugins private data */ /* gprof: * An image can be associated with one or more gmon.out files; * one per PMC. */ LIST_HEAD(,pmcstat_gmonfile) pi_gmlist; }; extern LIST_HEAD(pmcstat_image_hash_list, pmcstat_image) pmcstat_image_hash[PMCSTAT_NHASH]; /* * A simple implementation of interned strings. Each interned string * is assigned a unique address, so that subsequent string compares * can be done by a simple pointer comparison instead of using * strcmp(). This speeds up hash table lookups and saves memory if * duplicate strings are the norm. */ struct pmcstat_string { LIST_ENTRY(pmcstat_string) ps_next; /* hash link */ int ps_len; int ps_hash; char *ps_string; }; /* * A 'pmcstat_pcmap' structure maps a virtual address range to an * underlying 'pmcstat_image' descriptor. */ struct pmcstat_pcmap { TAILQ_ENTRY(pmcstat_pcmap) ppm_next; uintfptr_t ppm_lowpc; uintfptr_t ppm_highpc; struct pmcstat_image *ppm_image; }; /* * A 'pmcstat_process' structure models processes. Each process is * associated with a set of pmcstat_pcmap structures that map * addresses inside it to executable objects. This set is implemented * as a list, kept sorted in ascending order of mapped addresses. * * 'pp_pid' holds the pid of the process. When a process exits, the * 'pp_isactive' field is set to zero, but the process structure is * not immediately reclaimed because there may still be samples in the * log for this process. */ struct pmcstat_process { LIST_ENTRY(pmcstat_process) pp_next; /* hash-next */ pid_t pp_pid; /* associated pid */ int pp_isactive; /* whether active */ uintfptr_t pp_entryaddr; /* entry address */ TAILQ_HEAD(,pmcstat_pcmap) pp_map; /* address range map */ }; extern LIST_HEAD(pmcstat_process_hash_list, pmcstat_process) pmcstat_process_hash[PMCSTAT_NHASH]; /* * 'pmcstat_pmcrecord' is a mapping from PMC ids to human-readable * names. */ struct pmcstat_pmcrecord { LIST_ENTRY(pmcstat_pmcrecord) pr_next; pmc_id_t pr_pmcid; int pr_pmcin; pmcstat_interned_string pr_pmcname; int pr_samples; int pr_dubious_frames; struct pmcstat_pmcrecord *pr_merge; }; extern LIST_HEAD(pmcstat_pmcs, pmcstat_pmcrecord) pmcstat_pmcs; /* PMC list */ struct pmc_plugins { const char *pl_name; /* configure */ int (*pl_configure)(char *opt); /* init and shutdown */ int (*pl_init)(void); void (*pl_shutdown)(FILE *mf); /* sample processing */ void (*pl_process)(struct pmcstat_process *pp, struct pmcstat_pmcrecord *pmcr, uint32_t nsamples, uintfptr_t *cc, int usermode, uint32_t cpu); /* image */ void (*pl_initimage)(struct pmcstat_image *pi); void (*pl_shutdownimage)(struct pmcstat_image *pi); /* pmc */ void (*pl_newpmc)(pmcstat_interned_string ps, struct pmcstat_pmcrecord *pr); /* top display */ void (*pl_topdisplay)(void); /* top keypress */ int (*pl_topkeypress)(int c, void *w); }; /* * Misc. statistics */ struct pmcstat_stats { int ps_exec_aout; /* # a.out executables seen */ int ps_exec_elf; /* # elf executables seen */ int ps_exec_errors; /* # errors processing executables */ int ps_exec_indeterminable; /* # unknown executables seen */ int ps_samples_total; /* total number of samples processed */ int ps_samples_skipped; /* #samples filtered out for any reason */ int ps_samples_unknown_offset; /* #samples of rank 0 not in a map */ int ps_samples_indeterminable; /* #samples in indeterminable images */ int ps_samples_unknown_function;/* #samples with unknown function at offset */ int ps_callchain_dubious_frames;/* #dubious frame pointers seen */ }; __BEGIN_DECLS int pmcstat_symbol_compare(const void *a, const void *b); struct pmcstat_symbol *pmcstat_symbol_search(struct pmcstat_image *image, uintfptr_t addr); void pmcstat_image_add_symbols(struct pmcstat_image *image, Elf *e, Elf_Scn *scn, GElf_Shdr *sh); const char *pmcstat_string_unintern(pmcstat_interned_string _is); pmcstat_interned_string pmcstat_string_intern(const char *_s); int pmcstat_string_compute_hash(const char *s); pmcstat_interned_string pmcstat_string_lookup(const char *_s); void pmcstat_image_get_elf_params(struct pmcstat_image *image, struct pmcstat_args *args); struct pmcstat_image * pmcstat_image_from_path(pmcstat_interned_string internedpath, int iskernelmodule, struct pmcstat_args *args, struct pmc_plugins *plugins); int pmcstat_string_lookup_hash(pmcstat_interned_string _is); void pmcstat_process_elf_exec(struct pmcstat_process *_pp, - struct pmcstat_image *_image, uintfptr_t _entryaddr, + struct pmcstat_image *_image, uintptr_t _baseaddr, uintptr_t _dynaddr, struct pmcstat_args *args, struct pmc_plugins *plugins, struct pmcstat_stats *pmcstat_stats); void pmcstat_image_link(struct pmcstat_process *_pp, struct pmcstat_image *_i, uintfptr_t _lpc); void pmcstat_process_aout_exec(struct pmcstat_process *_pp, - struct pmcstat_image *_image, uintfptr_t _entryaddr); + struct pmcstat_image *_image, uintptr_t _baseaddr); void pmcstat_process_exec(struct pmcstat_process *_pp, - pmcstat_interned_string _path, uintfptr_t _entryaddr, + pmcstat_interned_string _path, uintptr_t _baseaddr, uintptr_t _dynaddr, struct pmcstat_args *args, struct pmc_plugins *plugins, struct pmcstat_stats *pmcstat_stats); void pmcstat_image_determine_type(struct pmcstat_image *_image, struct pmcstat_args *args); void pmcstat_image_get_aout_params(struct pmcstat_image *_image, struct pmcstat_args *args); struct pmcstat_pcmap *pmcstat_process_find_map(struct pmcstat_process *_p, uintfptr_t _pc); void pmcstat_initialize_logging(struct pmcstat_process **pmcstat_kernproc, struct pmcstat_args *args, struct pmc_plugins *plugins, int *pmcstat_npmcs, int *pmcstat_mergepmc); void pmcstat_shutdown_logging(struct pmcstat_args *args, struct pmc_plugins *plugins, struct pmcstat_stats *pmcstat_stats); struct pmcstat_process *pmcstat_process_lookup(pid_t _pid, int _allocate); void pmcstat_clone_event_descriptor(struct pmcstat_ev *ev, const cpuset_t *cpumask, struct pmcstat_args *args); void pmcstat_create_process(int *pmcstat_sockpair, struct pmcstat_args *args, int pmcstat_kq); void pmcstat_start_process(int *pmcstat_sockpair); void pmcstat_attach_pmcs(struct pmcstat_args *args); struct pmcstat_symbol *pmcstat_symbol_search_by_name(struct pmcstat_process *pp, const char *pi_name, const char *name, uintptr_t *, uintptr_t *); void pmcstat_string_initialize(void); void pmcstat_string_shutdown(void); int pmcstat_analyze_log(struct pmcstat_args *args, struct pmc_plugins *plugins, struct pmcstat_stats *pmcstat_stats, struct pmcstat_process *pmcstat_kernproc, int pmcstat_mergepmc, int *pmcstat_npmcs, int *ps_samples_period); int pmcstat_open_log(const char *_p, int _mode); int pmcstat_close_log(struct pmcstat_args *args); __END_DECLS #endif /* !_LIBPMCSTAT_H_ */ diff --git a/lib/libpmcstat/libpmcstat_logging.c b/lib/libpmcstat/libpmcstat_logging.c index 42054e636b4b..b41e93b4f729 100644 --- a/lib/libpmcstat/libpmcstat_logging.c +++ b/lib/libpmcstat/libpmcstat_logging.c @@ -1,684 +1,684 @@ /*- * Copyright (c) 2003-2008 Joseph Koshy * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. */ #include __FBSDID("$FreeBSD$"); #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include "libpmcstat.h" /* * Get PMC record by id, apply merge policy. */ static struct pmcstat_pmcrecord * pmcstat_lookup_pmcid(pmc_id_t pmcid, int pmcstat_mergepmc) { struct pmcstat_pmcrecord *pr; LIST_FOREACH(pr, &pmcstat_pmcs, pr_next) { if (pr->pr_pmcid == pmcid) { if (pmcstat_mergepmc) return pr->pr_merge; return pr; } } return NULL; } /* * Add a {pmcid,name} mapping. */ static void pmcstat_pmcid_add(pmc_id_t pmcid, pmcstat_interned_string ps, struct pmcstat_args *args, struct pmc_plugins *plugins, int *pmcstat_npmcs) { struct pmcstat_pmcrecord *pr, *prm; /* Replace an existing name for the PMC. */ prm = NULL; LIST_FOREACH(pr, &pmcstat_pmcs, pr_next) if (pr->pr_pmcid == pmcid) { pr->pr_pmcname = ps; return; } else if (pr->pr_pmcname == ps) prm = pr; /* * Otherwise, allocate a new descriptor and call the * plugins hook. */ if ((pr = malloc(sizeof(*pr))) == NULL) err(EX_OSERR, "ERROR: Cannot allocate pmc record"); pr->pr_pmcid = pmcid; pr->pr_pmcname = ps; pr->pr_pmcin = (*pmcstat_npmcs)++; pr->pr_samples = 0; pr->pr_dubious_frames = 0; pr->pr_merge = prm == NULL ? pr : prm; LIST_INSERT_HEAD(&pmcstat_pmcs, pr, pr_next); if (plugins[args->pa_pplugin].pl_newpmc != NULL) plugins[args->pa_pplugin].pl_newpmc(ps, pr); if (plugins[args->pa_plugin].pl_newpmc != NULL) plugins[args->pa_plugin].pl_newpmc(ps, pr); } /* * Unmap images in the range [start..end) associated with process * 'pp'. */ static void pmcstat_image_unmap(struct pmcstat_process *pp, uintfptr_t start, uintfptr_t end) { struct pmcstat_pcmap *pcm, *pcmtmp, *pcmnew; assert(pp != NULL); assert(start < end); /* * Cases: * - we could have the range completely in the middle of an * existing pcmap; in this case we have to split the pcmap * structure into two (i.e., generate a 'hole'). * - we could have the range covering multiple pcmaps; these * will have to be removed. * - we could have either 'start' or 'end' falling in the * middle of a pcmap; in this case shorten the entry. */ TAILQ_FOREACH_SAFE(pcm, &pp->pp_map, ppm_next, pcmtmp) { assert(pcm->ppm_lowpc < pcm->ppm_highpc); if (pcm->ppm_highpc <= start) continue; if (pcm->ppm_lowpc >= end) return; if (pcm->ppm_lowpc >= start && pcm->ppm_highpc <= end) { /* * The current pcmap is completely inside the * unmapped range: remove it entirely. */ TAILQ_REMOVE(&pp->pp_map, pcm, ppm_next); free(pcm); } else if (pcm->ppm_lowpc < start && pcm->ppm_highpc > end) { /* * Split this pcmap into two; curtail the * current map to end at [start-1], and start * the new one at [end]. */ if ((pcmnew = malloc(sizeof(*pcmnew))) == NULL) err(EX_OSERR, "ERROR: Cannot split a map entry"); pcmnew->ppm_image = pcm->ppm_image; pcmnew->ppm_lowpc = end; pcmnew->ppm_highpc = pcm->ppm_highpc; pcm->ppm_highpc = start; TAILQ_INSERT_AFTER(&pp->pp_map, pcm, pcmnew, ppm_next); return; } else if (pcm->ppm_lowpc < start && pcm->ppm_highpc <= end) pcm->ppm_highpc = start; else if (pcm->ppm_lowpc >= start && pcm->ppm_highpc > end) pcm->ppm_lowpc = end; else assert(0); } } /* * Convert a hwpmc(4) log to profile information. A system-wide * callgraph is generated if FLAG_DO_CALLGRAPHS is set. gmon.out * files usable by gprof(1) are created if FLAG_DO_GPROF is set. */ int pmcstat_analyze_log(struct pmcstat_args *args, struct pmc_plugins *plugins, struct pmcstat_stats *pmcstat_stats, struct pmcstat_process *pmcstat_kernproc, int pmcstat_mergepmc, int *pmcstat_npmcs, int *ps_samples_period) { uint32_t cpu, cpuflags; pid_t pid; struct pmcstat_image *image; struct pmcstat_process *pp, *ppnew; struct pmcstat_pcmap *ppm, *ppmtmp; struct pmclog_ev ev; struct pmcstat_pmcrecord *pmcr; pmcstat_interned_string image_path; assert(args->pa_flags & FLAG_DO_ANALYSIS); if (elf_version(EV_CURRENT) == EV_NONE) err(EX_UNAVAILABLE, "Elf library initialization failed"); while (pmclog_read(args->pa_logparser, &ev) == 0) { assert(ev.pl_state == PMCLOG_OK); switch (ev.pl_type) { case PMCLOG_TYPE_INITIALIZE: if ((ev.pl_u.pl_i.pl_version & 0xFF000000) != PMC_VERSION_MAJOR << 24 && args->pa_verbosity > 0) warnx( "WARNING: Log version 0x%x does not match compiled version 0x%x.", ev.pl_u.pl_i.pl_version, PMC_VERSION_MAJOR); break; case PMCLOG_TYPE_MAP_IN: /* * Introduce an address range mapping for a * userland process or the kernel (pid == -1). * * We always allocate a process descriptor so * that subsequent samples seen for this * address range are mapped to the current * object being mapped in. */ pid = ev.pl_u.pl_mi.pl_pid; if (pid == -1) pp = pmcstat_kernproc; else pp = pmcstat_process_lookup(pid, PMCSTAT_ALLOCATE); assert(pp != NULL); image_path = pmcstat_string_intern(ev.pl_u.pl_mi. pl_pathname); image = pmcstat_image_from_path(image_path, pid == -1, args, plugins); if (image->pi_type == PMCSTAT_IMAGE_UNKNOWN) pmcstat_image_determine_type(image, args); if (image->pi_type != PMCSTAT_IMAGE_INDETERMINABLE) pmcstat_image_link(pp, image, ev.pl_u.pl_mi.pl_start); break; case PMCLOG_TYPE_MAP_OUT: /* * Remove an address map. */ pid = ev.pl_u.pl_mo.pl_pid; if (pid == -1) pp = pmcstat_kernproc; else pp = pmcstat_process_lookup(pid, 0); if (pp == NULL) /* unknown process */ break; pmcstat_image_unmap(pp, ev.pl_u.pl_mo.pl_start, ev.pl_u.pl_mo.pl_end); break; case PMCLOG_TYPE_CALLCHAIN: pmcstat_stats->ps_samples_total++; *ps_samples_period += 1; cpuflags = ev.pl_u.pl_cc.pl_cpuflags; cpu = PMC_CALLCHAIN_CPUFLAGS_TO_CPU(cpuflags); if ((args->pa_flags & FLAG_FILTER_THREAD_ID) && args->pa_tid != ev.pl_u.pl_cc.pl_tid) { pmcstat_stats->ps_samples_skipped++; break; } /* Filter on the CPU id. */ if (!CPU_ISSET(cpu, &(args->pa_cpumask))) { pmcstat_stats->ps_samples_skipped++; break; } pp = pmcstat_process_lookup(ev.pl_u.pl_cc.pl_pid, PMCSTAT_ALLOCATE); /* Get PMC record. */ pmcr = pmcstat_lookup_pmcid(ev.pl_u.pl_cc.pl_pmcid, pmcstat_mergepmc); assert(pmcr != NULL); pmcr->pr_samples++; /* * Call the plugins processing */ if (plugins[args->pa_pplugin].pl_process != NULL) plugins[args->pa_pplugin].pl_process( pp, pmcr, ev.pl_u.pl_cc.pl_npc, ev.pl_u.pl_cc.pl_pc, PMC_CALLCHAIN_CPUFLAGS_TO_USERMODE(cpuflags), cpu); plugins[args->pa_plugin].pl_process( pp, pmcr, ev.pl_u.pl_cc.pl_npc, ev.pl_u.pl_cc.pl_pc, PMC_CALLCHAIN_CPUFLAGS_TO_USERMODE(cpuflags), cpu); break; case PMCLOG_TYPE_PMCALLOCATE: /* * Record the association pmc id between this * PMC and its name. */ pmcstat_pmcid_add(ev.pl_u.pl_a.pl_pmcid, pmcstat_string_intern(ev.pl_u.pl_a.pl_evname), args, plugins, pmcstat_npmcs); break; case PMCLOG_TYPE_PMCALLOCATEDYN: /* * Record the association pmc id between this * PMC and its name. */ pmcstat_pmcid_add(ev.pl_u.pl_ad.pl_pmcid, pmcstat_string_intern(ev.pl_u.pl_ad.pl_evname), args, plugins, pmcstat_npmcs); break; case PMCLOG_TYPE_PROCEXEC: /* * Change the executable image associated with * a process. */ pp = pmcstat_process_lookup(ev.pl_u.pl_x.pl_pid, PMCSTAT_ALLOCATE); /* delete the current process map */ TAILQ_FOREACH_SAFE(ppm, &pp->pp_map, ppm_next, ppmtmp) { TAILQ_REMOVE(&pp->pp_map, ppm, ppm_next); free(ppm); } /* * Associate this process image. */ image_path = pmcstat_string_intern( ev.pl_u.pl_x.pl_pathname); assert(image_path != NULL); pmcstat_process_exec(pp, image_path, - ev.pl_u.pl_x.pl_entryaddr, args, - plugins, pmcstat_stats); + ev.pl_u.pl_x.pl_baseaddr, ev.pl_u.pl_x.pl_dynaddr, + args, plugins, pmcstat_stats); break; case PMCLOG_TYPE_PROCEXIT: /* * Due to the way the log is generated, the * last few samples corresponding to a process * may appear in the log after the process * exit event is recorded. Thus we keep the * process' descriptor and associated data * structures around, but mark the process as * having exited. */ pp = pmcstat_process_lookup(ev.pl_u.pl_e.pl_pid, 0); if (pp == NULL) break; pp->pp_isactive = 0; /* mark as a zombie */ break; case PMCLOG_TYPE_SYSEXIT: pp = pmcstat_process_lookup(ev.pl_u.pl_se.pl_pid, 0); if (pp == NULL) break; pp->pp_isactive = 0; /* make a zombie */ break; case PMCLOG_TYPE_PROCFORK: /* * Allocate a process descriptor for the new * (child) process. */ ppnew = pmcstat_process_lookup(ev.pl_u.pl_f.pl_newpid, PMCSTAT_ALLOCATE); /* * If we had been tracking the parent, clone * its address maps. */ pp = pmcstat_process_lookup(ev.pl_u.pl_f.pl_oldpid, 0); if (pp == NULL) break; TAILQ_FOREACH(ppm, &pp->pp_map, ppm_next) pmcstat_image_link(ppnew, ppm->ppm_image, ppm->ppm_lowpc); break; default: /* other types of entries are not relevant */ break; } } if (ev.pl_state == PMCLOG_EOF) return (PMCSTAT_FINISHED); else if (ev.pl_state == PMCLOG_REQUIRE_DATA) return (PMCSTAT_RUNNING); err(EX_DATAERR, "ERROR: event parsing failed state: %d type: %d (record %jd, offset 0x%jx)", ev.pl_state, ev.pl_type, (uintmax_t) ev.pl_count + 1, ev.pl_offset); } /* * Open a log file, for reading or writing. * * The function returns the fd of a successfully opened log or -1 in * case of failure. */ int pmcstat_open_log(const char *path, int mode) { int error, fd, cfd; size_t hlen; const char *p, *errstr; struct addrinfo hints, *res, *res0; char hostname[MAXHOSTNAMELEN]; errstr = NULL; fd = -1; /* * If 'path' is "-" then open one of stdin or stdout depending * on the value of 'mode'. * * If 'path' contains a ':' and does not start with a '/' or '.', * and is being opened for writing, treat it as a "host:port" * specification and open a network socket. * * Otherwise, treat 'path' as a file name and open that. */ if (path[0] == '-' && path[1] == '\0') fd = (mode == PMCSTAT_OPEN_FOR_READ) ? 0 : 1; else if (path[0] != '/' && path[0] != '.' && strchr(path, ':') != NULL) { p = strrchr(path, ':'); hlen = p - path; if (p == path || hlen >= sizeof(hostname)) { errstr = strerror(EINVAL); goto done; } assert(hlen < sizeof(hostname)); (void) strncpy(hostname, path, hlen); hostname[hlen] = '\0'; (void) memset(&hints, 0, sizeof(hints)); hints.ai_family = AF_UNSPEC; hints.ai_socktype = SOCK_STREAM; if ((error = getaddrinfo(hostname, p+1, &hints, &res0)) != 0) { errstr = gai_strerror(error); goto done; } fd = -1; for (res = res0; res; res = res->ai_next) { if ((fd = socket(res->ai_family, res->ai_socktype, res->ai_protocol)) < 0) { errstr = strerror(errno); continue; } if (mode == PMCSTAT_OPEN_FOR_READ) { if (bind(fd, res->ai_addr, res->ai_addrlen) < 0) { errstr = strerror(errno); (void) close(fd); fd = -1; continue; } listen(fd, 1); cfd = accept(fd, NULL, NULL); (void) close(fd); if (cfd < 0) { errstr = strerror(errno); fd = -1; break; } fd = cfd; } else { if (connect(fd, res->ai_addr, res->ai_addrlen) < 0) { errstr = strerror(errno); (void) close(fd); fd = -1; continue; } } errstr = NULL; break; } freeaddrinfo(res0); } else if ((fd = open(path, mode == PMCSTAT_OPEN_FOR_READ ? O_RDONLY : (O_WRONLY|O_CREAT|O_TRUNC), S_IRUSR|S_IWUSR|S_IRGRP|S_IROTH)) < 0) errstr = strerror(errno); done: if (errstr) errx(EX_OSERR, "ERROR: Cannot open \"%s\" for %s: %s.", path, (mode == PMCSTAT_OPEN_FOR_READ ? "reading" : "writing"), errstr); return (fd); } /* * Close a logfile, after first flushing all in-module queued data. */ int pmcstat_close_log(struct pmcstat_args *args) { /* If a local logfile is configured ask the kernel to stop * and flush data. Kernel will close the file when data is flushed * so keep the status to EXITING. */ if (args->pa_logfd != -1) { if (pmc_close_logfile() < 0) err(EX_OSERR, "ERROR: logging failed"); } return (args->pa_flags & FLAG_HAS_PIPE ? PMCSTAT_EXITING : PMCSTAT_FINISHED); } /* * Initialize module. */ void pmcstat_initialize_logging(struct pmcstat_process **pmcstat_kernproc, struct pmcstat_args *args, struct pmc_plugins *plugins, int *pmcstat_npmcs, int *pmcstat_mergepmc) { struct pmcstat_process *pmcstat_kp; int i; /* use a convenient format for 'ldd' output */ if (setenv("LD_TRACE_LOADED_OBJECTS_FMT1","%o \"%p\" %x\n",1) != 0) err(EX_OSERR, "ERROR: Cannot setenv"); /* Initialize hash tables */ pmcstat_string_initialize(); for (i = 0; i < PMCSTAT_NHASH; i++) { LIST_INIT(&pmcstat_image_hash[i]); LIST_INIT(&pmcstat_process_hash[i]); } /* * Create a fake 'process' entry for the kernel with pid -1. * hwpmc(4) will subsequently inform us about where the kernel * and any loaded kernel modules are mapped. */ if ((pmcstat_kp = pmcstat_process_lookup((pid_t) -1, PMCSTAT_ALLOCATE)) == NULL) err(EX_OSERR, "ERROR: Cannot initialize logging"); *pmcstat_kernproc = pmcstat_kp; /* PMC count. */ *pmcstat_npmcs = 0; /* Merge PMC with same name. */ *pmcstat_mergepmc = args->pa_mergepmc; /* * Initialize plugins */ if (plugins[args->pa_pplugin].pl_init != NULL) plugins[args->pa_pplugin].pl_init(); if (plugins[args->pa_plugin].pl_init != NULL) plugins[args->pa_plugin].pl_init(); } /* * Shutdown module. */ void pmcstat_shutdown_logging(struct pmcstat_args *args, struct pmc_plugins *plugins, struct pmcstat_stats *pmcstat_stats) { struct pmcstat_image *pi, *pitmp; struct pmcstat_process *pp, *pptmp; struct pmcstat_pcmap *ppm, *ppmtmp; FILE *mf; int i; /* determine where to send the map file */ mf = NULL; if (args->pa_mapfilename != NULL) mf = (strcmp(args->pa_mapfilename, "-") == 0) ? args->pa_printfile : fopen(args->pa_mapfilename, "w"); if (mf == NULL && args->pa_flags & FLAG_DO_GPROF && args->pa_verbosity >= 2) mf = args->pa_printfile; if (mf) (void) fprintf(mf, "MAP:\n"); /* * Shutdown the plugins */ if (plugins[args->pa_plugin].pl_shutdown != NULL) plugins[args->pa_plugin].pl_shutdown(mf); if (plugins[args->pa_pplugin].pl_shutdown != NULL) plugins[args->pa_pplugin].pl_shutdown(mf); for (i = 0; i < PMCSTAT_NHASH; i++) { LIST_FOREACH_SAFE(pi, &pmcstat_image_hash[i], pi_next, pitmp) { if (plugins[args->pa_plugin].pl_shutdownimage != NULL) plugins[args->pa_plugin].pl_shutdownimage(pi); if (plugins[args->pa_pplugin].pl_shutdownimage != NULL) plugins[args->pa_pplugin].pl_shutdownimage(pi); free(pi->pi_symbols); if (pi->pi_addr2line != NULL) pclose(pi->pi_addr2line); LIST_REMOVE(pi, pi_next); free(pi); } LIST_FOREACH_SAFE(pp, &pmcstat_process_hash[i], pp_next, pptmp) { TAILQ_FOREACH_SAFE(ppm, &pp->pp_map, ppm_next, ppmtmp) { TAILQ_REMOVE(&pp->pp_map, ppm, ppm_next); free(ppm); } LIST_REMOVE(pp, pp_next); free(pp); } } pmcstat_string_shutdown(); /* * Print errors unless -q was specified. Print all statistics * if verbosity > 1. */ #define PRINT(N,V) do { \ if (pmcstat_stats->ps_##V || args->pa_verbosity >= 2) \ (void) fprintf(args->pa_printfile, " %-40s %d\n",\ N, pmcstat_stats->ps_##V); \ } while (0) if (args->pa_verbosity >= 1 && (args->pa_flags & FLAG_DO_ANALYSIS)) { (void) fprintf(args->pa_printfile, "CONVERSION STATISTICS:\n"); PRINT("#exec/a.out", exec_aout); PRINT("#exec/elf", exec_elf); PRINT("#exec/unknown", exec_indeterminable); PRINT("#exec handling errors", exec_errors); PRINT("#samples/total", samples_total); PRINT("#samples/unclaimed", samples_unknown_offset); PRINT("#samples/unknown-object", samples_indeterminable); PRINT("#samples/unknown-function", samples_unknown_function); PRINT("#callchain/dubious-frames", callchain_dubious_frames); } if (mf) (void) fclose(mf); } diff --git a/lib/libpmcstat/libpmcstat_process.c b/lib/libpmcstat/libpmcstat_process.c index 147eff9ab23e..4d710eac2f58 100644 --- a/lib/libpmcstat/libpmcstat_process.c +++ b/lib/libpmcstat/libpmcstat_process.c @@ -1,363 +1,363 @@ /*- * Copyright (c) 2003-2008 Joseph Koshy * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. */ #include __FBSDID("$FreeBSD$"); #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include "libpmcstat.h" /* * Associate an AOUT image with a process. */ void pmcstat_process_aout_exec(struct pmcstat_process *pp, - struct pmcstat_image *image, uintfptr_t entryaddr) + struct pmcstat_image *image, uintptr_t baseaddr) { (void) pp; (void) image; - (void) entryaddr; + (void) baseaddr; /* TODO Implement a.out handling */ } /* * Associate an ELF image with a process. */ void pmcstat_process_elf_exec(struct pmcstat_process *pp, - struct pmcstat_image *image, uintfptr_t entryaddr, + struct pmcstat_image *image, uintptr_t baseaddr, uintptr_t dynaddr, struct pmcstat_args *args, struct pmc_plugins *plugins, struct pmcstat_stats *pmcstat_stats) { - uintmax_t libstart; struct pmcstat_image *rtldimage; assert(image->pi_type == PMCSTAT_IMAGE_ELF32 || image->pi_type == PMCSTAT_IMAGE_ELF64); - /* Create a map entry for the base executable. */ - pmcstat_image_link(pp, image, image->pi_vaddr); + /* + * The exact address where the executable gets mapped in will vary for + * PIEs. The dynamic address recorded at process exec time corresponds + * to the address where the executable's file object had been mapped to. + */ + pmcstat_image_link(pp, image, image->pi_vaddr + dynaddr); /* * For dynamically linked executables we need to determine * where the dynamic linker was mapped to for this process, * Subsequent executable objects that are mapped in by the * dynamic linker will be tracked by log events of type * PMCLOG_TYPE_MAP_IN. */ if (image->pi_isdynamic) { /* * The runtime loader gets loaded just after the maximum * possible heap address. Like so: * * [ TEXT DATA BSS HEAP -->*RTLD SHLIBS <--STACK] * ^ ^ * 0 VM_MAXUSER_ADDRESS - * * The exact address where the loader gets mapped in * will vary according to the size of the executable * and the limits on the size of the process'es data - * segment at the time of exec(). The entry address + * segment at the time of exec(). The base address * recorded at process exec time corresponds to the - * 'start' address inside the dynamic linker. From - * this we can figure out the address where the - * runtime loader's file object had been mapped to. + * address where the runtime loader's file object had + * been mapped to. */ rtldimage = pmcstat_image_from_path(image->pi_dynlinkerpath, 0, args, plugins); if (rtldimage == NULL) { warnx("WARNING: Cannot find image for \"%s\".", pmcstat_string_unintern(image->pi_dynlinkerpath)); pmcstat_stats->ps_exec_errors++; return; } if (rtldimage->pi_type == PMCSTAT_IMAGE_UNKNOWN) pmcstat_image_get_elf_params(rtldimage, args); if (rtldimage->pi_type != PMCSTAT_IMAGE_ELF32 && rtldimage->pi_type != PMCSTAT_IMAGE_ELF64) { warnx("WARNING: rtld not an ELF object \"%s\".", pmcstat_string_unintern(image->pi_dynlinkerpath)); return; } - libstart = entryaddr - rtldimage->pi_entry; - pmcstat_image_link(pp, rtldimage, libstart); + pmcstat_image_link(pp, rtldimage, baseaddr); } } /* * Associate an image and a process. */ void pmcstat_process_exec(struct pmcstat_process *pp, - pmcstat_interned_string path, uintfptr_t entryaddr, + pmcstat_interned_string path, uintptr_t baseaddr, uintptr_t dynaddr, struct pmcstat_args *args, struct pmc_plugins *plugins, struct pmcstat_stats *pmcstat_stats) { struct pmcstat_image *image; if ((image = pmcstat_image_from_path(path, 0, args, plugins)) == NULL) { pmcstat_stats->ps_exec_errors++; return; } if (image->pi_type == PMCSTAT_IMAGE_UNKNOWN) pmcstat_image_determine_type(image, args); assert(image->pi_type != PMCSTAT_IMAGE_UNKNOWN); switch (image->pi_type) { case PMCSTAT_IMAGE_ELF32: case PMCSTAT_IMAGE_ELF64: pmcstat_stats->ps_exec_elf++; - pmcstat_process_elf_exec(pp, image, entryaddr, + pmcstat_process_elf_exec(pp, image, baseaddr, dynaddr, args, plugins, pmcstat_stats); break; case PMCSTAT_IMAGE_AOUT: pmcstat_stats->ps_exec_aout++; - pmcstat_process_aout_exec(pp, image, entryaddr); + pmcstat_process_aout_exec(pp, image, baseaddr); break; case PMCSTAT_IMAGE_INDETERMINABLE: pmcstat_stats->ps_exec_indeterminable++; break; default: err(EX_SOFTWARE, "ERROR: Unsupported executable type for \"%s\"", pmcstat_string_unintern(path)); } } /* * Find the map entry associated with process 'p' at PC value 'pc'. */ struct pmcstat_pcmap * pmcstat_process_find_map(struct pmcstat_process *p, uintfptr_t pc) { struct pmcstat_pcmap *ppm; TAILQ_FOREACH(ppm, &p->pp_map, ppm_next) { if (pc >= ppm->ppm_lowpc && pc < ppm->ppm_highpc) return (ppm); if (pc < ppm->ppm_lowpc) return (NULL); } return (NULL); } /* * Find the process descriptor corresponding to a PID. If 'allocate' * is zero, we return a NULL if a pid descriptor could not be found or * a process descriptor process. If 'allocate' is non-zero, then we * will attempt to allocate a fresh process descriptor. Zombie * process descriptors are only removed if a fresh allocation for the * same PID is requested. */ struct pmcstat_process * pmcstat_process_lookup(pid_t pid, int allocate) { uint32_t hash; struct pmcstat_pcmap *ppm, *ppmtmp; struct pmcstat_process *pp, *pptmp; hash = (uint32_t) pid & PMCSTAT_HASH_MASK; /* simplicity wins */ LIST_FOREACH_SAFE(pp, &pmcstat_process_hash[hash], pp_next, pptmp) if (pp->pp_pid == pid) { /* Found a descriptor, check and process zombies */ if (allocate && pp->pp_isactive == 0) { /* remove maps */ TAILQ_FOREACH_SAFE(ppm, &pp->pp_map, ppm_next, ppmtmp) { TAILQ_REMOVE(&pp->pp_map, ppm, ppm_next); free(ppm); } /* remove process entry */ LIST_REMOVE(pp, pp_next); free(pp); break; } return (pp); } if (!allocate) return (NULL); if ((pp = malloc(sizeof(*pp))) == NULL) err(EX_OSERR, "ERROR: Cannot allocate pid descriptor"); pp->pp_pid = pid; pp->pp_isactive = 1; TAILQ_INIT(&pp->pp_map); LIST_INSERT_HEAD(&pmcstat_process_hash[hash], pp, pp_next); return (pp); } void pmcstat_create_process(int *pmcstat_sockpair, struct pmcstat_args *args, int pmcstat_kq) { char token; pid_t pid; struct kevent kev; struct pmcstat_target *pt; if (socketpair(AF_UNIX, SOCK_STREAM, 0, pmcstat_sockpair) < 0) err(EX_OSERR, "ERROR: cannot create socket pair"); switch (pid = fork()) { case -1: err(EX_OSERR, "ERROR: cannot fork"); /*NOTREACHED*/ case 0: /* child */ (void) close(pmcstat_sockpair[PARENTSOCKET]); /* Write a token to tell our parent we've started executing. */ if (write(pmcstat_sockpair[CHILDSOCKET], "+", 1) != 1) err(EX_OSERR, "ERROR (child): cannot write token"); /* Wait for our parent to signal us to start. */ if (read(pmcstat_sockpair[CHILDSOCKET], &token, 1) < 0) err(EX_OSERR, "ERROR (child): cannot read token"); (void) close(pmcstat_sockpair[CHILDSOCKET]); /* exec() the program requested */ execvp(*args->pa_argv, args->pa_argv); /* and if that fails, notify the parent */ kill(getppid(), SIGCHLD); err(EX_OSERR, "ERROR: execvp \"%s\" failed", *args->pa_argv); /*NOTREACHED*/ default: /* parent */ (void) close(pmcstat_sockpair[CHILDSOCKET]); break; } /* Ask to be notified via a kevent when the target process exits. */ EV_SET(&kev, pid, EVFILT_PROC, EV_ADD | EV_ONESHOT, NOTE_EXIT, 0, NULL); if (kevent(pmcstat_kq, &kev, 1, NULL, 0, NULL) < 0) err(EX_OSERR, "ERROR: cannot monitor child process %d", pid); if ((pt = malloc(sizeof(*pt))) == NULL) errx(EX_SOFTWARE, "ERROR: Out of memory."); pt->pt_pid = pid; SLIST_INSERT_HEAD(&args->pa_targets, pt, pt_next); /* Wait for the child to signal that its ready to go. */ if (read(pmcstat_sockpair[PARENTSOCKET], &token, 1) < 0) err(EX_OSERR, "ERROR (parent): cannot read token"); return; } /* * Do process profiling * * If a pid was specified, attach each allocated PMC to the target * process. Otherwise, fork a child and attach the PMCs to the child, * and have the child exec() the target program. */ void pmcstat_start_process(int *pmcstat_sockpair) { /* Signal the child to proceed. */ if (write(pmcstat_sockpair[PARENTSOCKET], "!", 1) != 1) err(EX_OSERR, "ERROR (parent): write of token failed"); (void) close(pmcstat_sockpair[PARENTSOCKET]); } void pmcstat_attach_pmcs(struct pmcstat_args *args) { struct pmcstat_ev *ev; struct pmcstat_target *pt; int count; /* Attach all process PMCs to target processes. */ count = 0; STAILQ_FOREACH(ev, &args->pa_events, ev_next) { if (PMC_IS_SYSTEM_MODE(ev->ev_mode)) continue; SLIST_FOREACH(pt, &args->pa_targets, pt_next) { if (pmc_attach(ev->ev_pmcid, pt->pt_pid) == 0) count++; else if (errno != ESRCH) err(EX_OSERR, "ERROR: cannot attach pmc \"%s\" to process %d", ev->ev_name, (int)pt->pt_pid); } } if (count == 0) errx(EX_DATAERR, "ERROR: No processes were attached to."); } diff --git a/sys/dev/hwpmc/hwpmc_logging.c b/sys/dev/hwpmc/hwpmc_logging.c index 02f2c1c383e2..00ecd361a866 100644 --- a/sys/dev/hwpmc/hwpmc_logging.c +++ b/sys/dev/hwpmc/hwpmc_logging.c @@ -1,1294 +1,1295 @@ /*- * SPDX-License-Identifier: BSD-2-Clause * * Copyright (c) 2005-2007 Joseph Koshy * Copyright (c) 2007 The FreeBSD Foundation * Copyright (c) 2018 Matthew Macy * All rights reserved. * * Portions of this software were developed by A. Joseph Koshy under * sponsorship from the FreeBSD Foundation and Google, Inc. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * */ /* * Logging code for hwpmc(4) */ #include __FBSDID("$FreeBSD$"); #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #if defined(__i386__) || defined(__amd64__) #include #endif #define curdomain PCPU_GET(domain) /* * Sysctl tunables */ SYSCTL_DECL(_kern_hwpmc); /* * kern.hwpmc.logbuffersize -- size of the per-cpu owner buffers. */ static int pmclog_buffer_size = PMC_LOG_BUFFER_SIZE; SYSCTL_INT(_kern_hwpmc, OID_AUTO, logbuffersize, CTLFLAG_RDTUN, &pmclog_buffer_size, 0, "size of log buffers in kilobytes"); /* * kern.hwpmc.nbuffers_pcpu -- number of global log buffers */ static int pmc_nlogbuffers_pcpu = PMC_NLOGBUFFERS_PCPU; SYSCTL_INT(_kern_hwpmc, OID_AUTO, nbuffers_pcpu, CTLFLAG_RDTUN, &pmc_nlogbuffers_pcpu, 0, "number of log buffers per cpu"); /* * Global log buffer list and associated spin lock. */ static struct mtx pmc_kthread_mtx; /* sleep lock */ #define PMCLOG_INIT_BUFFER_DESCRIPTOR(D, buf, domain) do { \ (D)->plb_fence = ((char *) (buf)) + 1024*pmclog_buffer_size; \ (D)->plb_base = (D)->plb_ptr = ((char *) (buf)); \ (D)->plb_domain = domain; \ } while (0) #define PMCLOG_RESET_BUFFER_DESCRIPTOR(D) do { \ (D)->plb_ptr = (D)->plb_base; \ } while (0) /* * Log file record constructors. */ #define _PMCLOG_TO_HEADER(T,L) \ ((PMCLOG_HEADER_MAGIC << 24) | (T << 16) | ((L) & 0xFFFF)) /* reserve LEN bytes of space and initialize the entry header */ #define _PMCLOG_RESERVE_SAFE(PO,TYPE,LEN,ACTION, TSC) do { \ uint32_t *_le; \ int _len = roundup((LEN), sizeof(uint32_t)); \ struct pmclog_header *ph; \ if ((_le = pmclog_reserve((PO), _len)) == NULL) { \ ACTION; \ } \ ph = (struct pmclog_header *)_le; \ ph->pl_header =_PMCLOG_TO_HEADER(TYPE,_len); \ ph->pl_tsc = (TSC); \ _le += sizeof(*ph)/4 /* skip over timestamp */ /* reserve LEN bytes of space and initialize the entry header */ #define _PMCLOG_RESERVE(PO,TYPE,LEN,ACTION) do { \ uint32_t *_le; \ int _len = roundup((LEN), sizeof(uint32_t)); \ uint64_t tsc; \ struct pmclog_header *ph; \ tsc = pmc_rdtsc(); \ spinlock_enter(); \ if ((_le = pmclog_reserve((PO), _len)) == NULL) { \ spinlock_exit(); \ ACTION; \ } \ ph = (struct pmclog_header *)_le; \ ph->pl_header =_PMCLOG_TO_HEADER(TYPE,_len); \ ph->pl_tsc = tsc; \ _le += sizeof(*ph)/4 /* skip over timestamp */ #define PMCLOG_RESERVE_SAFE(P,T,L,TSC) _PMCLOG_RESERVE_SAFE(P,T,L,return,TSC) #define PMCLOG_RESERVE(P,T,L) _PMCLOG_RESERVE(P,T,L,return) #define PMCLOG_RESERVE_WITH_ERROR(P,T,L) _PMCLOG_RESERVE(P,T,L, \ error=ENOMEM;goto error) #define PMCLOG_EMIT32(V) do { *_le++ = (V); } while (0) #define PMCLOG_EMIT64(V) do { \ *_le++ = (uint32_t) ((V) & 0xFFFFFFFF); \ *_le++ = (uint32_t) (((V) >> 32) & 0xFFFFFFFF); \ } while (0) /* Emit a string. Caution: does NOT update _le, so needs to be last */ #define PMCLOG_EMITSTRING(S,L) do { bcopy((S), _le, (L)); } while (0) #define PMCLOG_EMITNULLSTRING(L) do { bzero(_le, (L)); } while (0) #define PMCLOG_DESPATCH_SAFE(PO) \ pmclog_release((PO)); \ } while (0) #define PMCLOG_DESPATCH_SCHED_LOCK(PO) \ pmclog_release_flags((PO), 0); \ } while (0) #define PMCLOG_DESPATCH(PO) \ pmclog_release((PO)); \ spinlock_exit(); \ } while (0) #define PMCLOG_DESPATCH_SYNC(PO) \ pmclog_schedule_io((PO), 1); \ spinlock_exit(); \ } while (0) #define TSDELTA 4 /* * Assertions about the log file format. */ CTASSERT(sizeof(struct pmclog_callchain) == 7*4 + TSDELTA + PMC_CALLCHAIN_DEPTH_MAX*sizeof(uintfptr_t)); CTASSERT(sizeof(struct pmclog_closelog) == 3*4 + TSDELTA); CTASSERT(sizeof(struct pmclog_dropnotify) == 3*4 + TSDELTA); CTASSERT(sizeof(struct pmclog_map_in) == PATH_MAX + TSDELTA + 5*4 + sizeof(uintfptr_t)); CTASSERT(offsetof(struct pmclog_map_in,pl_pathname) == 5*4 + TSDELTA + sizeof(uintfptr_t)); CTASSERT(sizeof(struct pmclog_map_out) == 5*4 + 2*sizeof(uintfptr_t) + TSDELTA); CTASSERT(sizeof(struct pmclog_pmcallocate) == 9*4 + TSDELTA); CTASSERT(sizeof(struct pmclog_pmcattach) == 5*4 + PATH_MAX + TSDELTA); CTASSERT(offsetof(struct pmclog_pmcattach,pl_pathname) == 5*4 + TSDELTA); CTASSERT(sizeof(struct pmclog_pmcdetach) == 5*4 + TSDELTA); CTASSERT(sizeof(struct pmclog_proccsw) == 7*4 + 8 + TSDELTA); CTASSERT(sizeof(struct pmclog_procexec) == 5*4 + PATH_MAX + - sizeof(uintfptr_t) + TSDELTA); + 2*sizeof(uintptr_t) + TSDELTA); CTASSERT(offsetof(struct pmclog_procexec,pl_pathname) == 5*4 + TSDELTA + - sizeof(uintfptr_t)); + 2*sizeof(uintptr_t)); CTASSERT(sizeof(struct pmclog_procexit) == 5*4 + 8 + TSDELTA); CTASSERT(sizeof(struct pmclog_procfork) == 5*4 + TSDELTA); CTASSERT(sizeof(struct pmclog_sysexit) == 6*4); CTASSERT(sizeof(struct pmclog_userdata) == 6*4); /* * Log buffer structure */ struct pmclog_buffer { TAILQ_ENTRY(pmclog_buffer) plb_next; char *plb_base; char *plb_ptr; char *plb_fence; uint16_t plb_domain; } __aligned(CACHE_LINE_SIZE); /* * Prototypes */ static int pmclog_get_buffer(struct pmc_owner *po); static void pmclog_loop(void *arg); static void pmclog_release(struct pmc_owner *po); static uint32_t *pmclog_reserve(struct pmc_owner *po, int length); static void pmclog_schedule_io(struct pmc_owner *po, int wakeup); static void pmclog_schedule_all(struct pmc_owner *po); static void pmclog_stop_kthread(struct pmc_owner *po); /* * Helper functions */ static inline void pmc_plb_rele_unlocked(struct pmclog_buffer *plb) { TAILQ_INSERT_HEAD(&pmc_dom_hdrs[plb->plb_domain]->pdbh_head, plb, plb_next); } static inline void pmc_plb_rele(struct pmclog_buffer *plb) { mtx_lock_spin(&pmc_dom_hdrs[plb->plb_domain]->pdbh_mtx); pmc_plb_rele_unlocked(plb); mtx_unlock_spin(&pmc_dom_hdrs[plb->plb_domain]->pdbh_mtx); } /* * Get a log buffer */ static int pmclog_get_buffer(struct pmc_owner *po) { struct pmclog_buffer *plb; int domain; KASSERT(po->po_curbuf[curcpu] == NULL, ("[pmclog,%d] po=%p current buffer still valid", __LINE__, po)); domain = curdomain; MPASS(pmc_dom_hdrs[domain]); mtx_lock_spin(&pmc_dom_hdrs[domain]->pdbh_mtx); if ((plb = TAILQ_FIRST(&pmc_dom_hdrs[domain]->pdbh_head)) != NULL) TAILQ_REMOVE(&pmc_dom_hdrs[domain]->pdbh_head, plb, plb_next); mtx_unlock_spin(&pmc_dom_hdrs[domain]->pdbh_mtx); PMCDBG2(LOG,GTB,1, "po=%p plb=%p", po, plb); #ifdef HWPMC_DEBUG if (plb) KASSERT(plb->plb_ptr == plb->plb_base && plb->plb_base < plb->plb_fence, ("[pmclog,%d] po=%p buffer invariants: ptr=%p " "base=%p fence=%p", __LINE__, po, plb->plb_ptr, plb->plb_base, plb->plb_fence)); #endif po->po_curbuf[curcpu] = plb; /* update stats */ counter_u64_add(pmc_stats.pm_buffer_requests, 1); if (plb == NULL) counter_u64_add(pmc_stats.pm_buffer_requests_failed, 1); return (plb ? 0 : ENOMEM); } struct pmclog_proc_init_args { struct proc *kthr; struct pmc_owner *po; bool exit; bool acted; }; int pmclog_proc_create(struct thread *td, void **handlep) { struct pmclog_proc_init_args *ia; int error; ia = malloc(sizeof(*ia), M_TEMP, M_WAITOK | M_ZERO); error = kproc_create(pmclog_loop, ia, &ia->kthr, RFHIGHPID, 0, "hwpmc: proc(%d)", td->td_proc->p_pid); if (error == 0) *handlep = ia; return (error); } void pmclog_proc_ignite(void *handle, struct pmc_owner *po) { struct pmclog_proc_init_args *ia; ia = handle; mtx_lock(&pmc_kthread_mtx); MPASS(!ia->acted); MPASS(ia->po == NULL); MPASS(!ia->exit); MPASS(ia->kthr != NULL); if (po == NULL) { ia->exit = true; } else { ia->po = po; KASSERT(po->po_kthread == NULL, ("[pmclog,%d] po=%p kthread (%p) already present", __LINE__, po, po->po_kthread)); po->po_kthread = ia->kthr; } wakeup(ia); while (!ia->acted) msleep(ia, &pmc_kthread_mtx, PWAIT, "pmclogw", 0); mtx_unlock(&pmc_kthread_mtx); free(ia, M_TEMP); } /* * Log handler loop. * * This function is executed by each pmc owner's helper thread. */ static void pmclog_loop(void *arg) { struct pmclog_proc_init_args *ia; struct pmc_owner *po; struct pmclog_buffer *lb; struct proc *p; struct ucred *ownercred; struct ucred *mycred; struct thread *td; sigset_t unb; struct uio auio; struct iovec aiov; size_t nbytes; int error; td = curthread; SIGEMPTYSET(unb); SIGADDSET(unb, SIGHUP); (void)kern_sigprocmask(td, SIG_UNBLOCK, &unb, NULL, 0); ia = arg; MPASS(ia->kthr == curproc); MPASS(!ia->acted); mtx_lock(&pmc_kthread_mtx); while (ia->po == NULL && !ia->exit) msleep(ia, &pmc_kthread_mtx, PWAIT, "pmclogi", 0); if (ia->exit) { ia->acted = true; wakeup(ia); mtx_unlock(&pmc_kthread_mtx); kproc_exit(0); } MPASS(ia->po != NULL); po = ia->po; ia->acted = true; wakeup(ia); mtx_unlock(&pmc_kthread_mtx); ia = NULL; p = po->po_owner; mycred = td->td_ucred; PROC_LOCK(p); ownercred = crhold(p->p_ucred); PROC_UNLOCK(p); PMCDBG2(LOG,INI,1, "po=%p kt=%p", po, po->po_kthread); KASSERT(po->po_kthread == curthread->td_proc, ("[pmclog,%d] proc mismatch po=%p po/kt=%p curproc=%p", __LINE__, po, po->po_kthread, curthread->td_proc)); lb = NULL; /* * Loop waiting for I/O requests to be added to the owner * struct's queue. The loop is exited when the log file * is deconfigured. */ mtx_lock(&pmc_kthread_mtx); for (;;) { /* check if we've been asked to exit */ if ((po->po_flags & PMC_PO_OWNS_LOGFILE) == 0) break; if (lb == NULL) { /* look for a fresh buffer to write */ mtx_lock_spin(&po->po_mtx); if ((lb = TAILQ_FIRST(&po->po_logbuffers)) == NULL) { mtx_unlock_spin(&po->po_mtx); /* No more buffers and shutdown required. */ if (po->po_flags & PMC_PO_SHUTDOWN) break; (void) msleep(po, &pmc_kthread_mtx, PWAIT, "pmcloop", 250); continue; } TAILQ_REMOVE(&po->po_logbuffers, lb, plb_next); mtx_unlock_spin(&po->po_mtx); } mtx_unlock(&pmc_kthread_mtx); /* process the request */ PMCDBG3(LOG,WRI,2, "po=%p base=%p ptr=%p", po, lb->plb_base, lb->plb_ptr); /* change our thread's credentials before issuing the I/O */ aiov.iov_base = lb->plb_base; aiov.iov_len = nbytes = lb->plb_ptr - lb->plb_base; auio.uio_iov = &aiov; auio.uio_iovcnt = 1; auio.uio_offset = -1; auio.uio_resid = nbytes; auio.uio_rw = UIO_WRITE; auio.uio_segflg = UIO_SYSSPACE; auio.uio_td = td; /* switch thread credentials -- see kern_ktrace.c */ td->td_ucred = ownercred; error = fo_write(po->po_file, &auio, ownercred, 0, td); td->td_ucred = mycred; if (error) { /* XXX some errors are recoverable */ /* send a SIGIO to the owner and exit */ PROC_LOCK(p); kern_psignal(p, SIGIO); PROC_UNLOCK(p); mtx_lock(&pmc_kthread_mtx); po->po_error = error; /* save for flush log */ PMCDBG2(LOG,WRI,2, "po=%p error=%d", po, error); break; } mtx_lock(&pmc_kthread_mtx); /* put the used buffer back into the global pool */ PMCLOG_RESET_BUFFER_DESCRIPTOR(lb); pmc_plb_rele(lb); lb = NULL; } wakeup_one(po->po_kthread); po->po_kthread = NULL; mtx_unlock(&pmc_kthread_mtx); /* return the current I/O buffer to the global pool */ if (lb) { PMCLOG_RESET_BUFFER_DESCRIPTOR(lb); pmc_plb_rele(lb); } /* * Exit this thread, signalling the waiter */ crfree(ownercred); kproc_exit(0); } /* * Release and log entry and schedule an I/O if needed. */ static void pmclog_release_flags(struct pmc_owner *po, int wakeup) { struct pmclog_buffer *plb; plb = po->po_curbuf[curcpu]; KASSERT(plb->plb_ptr >= plb->plb_base, ("[pmclog,%d] buffer invariants po=%p ptr=%p base=%p", __LINE__, po, plb->plb_ptr, plb->plb_base)); KASSERT(plb->plb_ptr <= plb->plb_fence, ("[pmclog,%d] buffer invariants po=%p ptr=%p fenc=%p", __LINE__, po, plb->plb_ptr, plb->plb_fence)); /* schedule an I/O if we've filled a buffer */ if (plb->plb_ptr >= plb->plb_fence) pmclog_schedule_io(po, wakeup); PMCDBG1(LOG,REL,1, "po=%p", po); } static void pmclog_release(struct pmc_owner *po) { pmclog_release_flags(po, 1); } /* * Attempt to reserve 'length' bytes of space in an owner's log * buffer. The function returns a pointer to 'length' bytes of space * if there was enough space or returns NULL if no space was * available. Non-null returns do so with the po mutex locked. The * caller must invoke pmclog_release() on the pmc owner structure * when done. */ static uint32_t * pmclog_reserve(struct pmc_owner *po, int length) { uintptr_t newptr, oldptr __diagused; struct pmclog_buffer *plb, **pplb; PMCDBG2(LOG,ALL,1, "po=%p len=%d", po, length); KASSERT(length % sizeof(uint32_t) == 0, ("[pmclog,%d] length not a multiple of word size", __LINE__)); /* No more data when shutdown in progress. */ if (po->po_flags & PMC_PO_SHUTDOWN) return (NULL); pplb = &po->po_curbuf[curcpu]; if (*pplb == NULL && pmclog_get_buffer(po) != 0) goto fail; KASSERT(*pplb != NULL, ("[pmclog,%d] po=%p no current buffer", __LINE__, po)); plb = *pplb; KASSERT(plb->plb_ptr >= plb->plb_base && plb->plb_ptr <= plb->plb_fence, ("[pmclog,%d] po=%p buffer invariants: ptr=%p base=%p fence=%p", __LINE__, po, plb->plb_ptr, plb->plb_base, plb->plb_fence)); oldptr = (uintptr_t) plb->plb_ptr; newptr = oldptr + length; KASSERT(oldptr != (uintptr_t) NULL, ("[pmclog,%d] po=%p Null log buffer pointer", __LINE__, po)); /* * If we have space in the current buffer, return a pointer to * available space with the PO structure locked. */ if (newptr <= (uintptr_t) plb->plb_fence) { plb->plb_ptr = (char *) newptr; goto done; } /* * Otherwise, schedule the current buffer for output and get a * fresh buffer. */ pmclog_schedule_io(po, 0); if (pmclog_get_buffer(po) != 0) goto fail; plb = *pplb; KASSERT(plb != NULL, ("[pmclog,%d] po=%p no current buffer", __LINE__, po)); KASSERT(plb->plb_ptr != NULL, ("[pmclog,%d] null return from pmc_get_log_buffer", __LINE__)); KASSERT(plb->plb_ptr == plb->plb_base && plb->plb_ptr <= plb->plb_fence, ("[pmclog,%d] po=%p buffer invariants: ptr=%p base=%p fence=%p", __LINE__, po, plb->plb_ptr, plb->plb_base, plb->plb_fence)); oldptr = (uintptr_t) plb->plb_ptr; done: return ((uint32_t *) oldptr); fail: return (NULL); } /* * Schedule an I/O. * * Transfer the current buffer to the helper kthread. */ static void pmclog_schedule_io(struct pmc_owner *po, int wakeup) { struct pmclog_buffer *plb; plb = po->po_curbuf[curcpu]; po->po_curbuf[curcpu] = NULL; KASSERT(plb != NULL, ("[pmclog,%d] schedule_io with null buffer po=%p", __LINE__, po)); KASSERT(plb->plb_ptr >= plb->plb_base, ("[pmclog,%d] buffer invariants po=%p ptr=%p base=%p", __LINE__, po, plb->plb_ptr, plb->plb_base)); KASSERT(plb->plb_ptr <= plb->plb_fence, ("[pmclog,%d] buffer invariants po=%p ptr=%p fenc=%p", __LINE__, po, plb->plb_ptr, plb->plb_fence)); PMCDBG1(LOG,SIO, 1, "po=%p", po); /* * Add the current buffer to the tail of the buffer list and * wakeup the helper. */ mtx_lock_spin(&po->po_mtx); TAILQ_INSERT_TAIL(&po->po_logbuffers, plb, plb_next); mtx_unlock_spin(&po->po_mtx); if (wakeup) wakeup_one(po); } /* * Stop the helper kthread. */ static void pmclog_stop_kthread(struct pmc_owner *po) { mtx_lock(&pmc_kthread_mtx); po->po_flags &= ~PMC_PO_OWNS_LOGFILE; if (po->po_kthread != NULL) { PROC_LOCK(po->po_kthread); kern_psignal(po->po_kthread, SIGHUP); PROC_UNLOCK(po->po_kthread); } wakeup_one(po); while (po->po_kthread) msleep(po->po_kthread, &pmc_kthread_mtx, PPAUSE, "pmckstp", 0); mtx_unlock(&pmc_kthread_mtx); } /* * Public functions */ /* * Configure a log file for pmc owner 'po'. * * Parameter 'logfd' is a file handle referencing an open file in the * owner process. This file needs to have been opened for writing. */ int pmclog_configure_log(struct pmc_mdep *md, struct pmc_owner *po, int logfd) { struct proc *p; struct timespec ts; int error; sx_assert(&pmc_sx, SA_XLOCKED); PMCDBG2(LOG,CFG,1, "config po=%p logfd=%d", po, logfd); p = po->po_owner; /* return EBUSY if a log file was already present */ if (po->po_flags & PMC_PO_OWNS_LOGFILE) return (EBUSY); KASSERT(po->po_file == NULL, ("[pmclog,%d] po=%p file (%p) already present", __LINE__, po, po->po_file)); /* get a reference to the file state */ error = fget_write(curthread, logfd, &cap_write_rights, &po->po_file); if (error) goto error; /* mark process as owning a log file */ po->po_flags |= PMC_PO_OWNS_LOGFILE; /* mark process as using HWPMCs */ PROC_LOCK(p); p->p_flag |= P_HWPMC; PROC_UNLOCK(p); nanotime(&ts); /* create a log initialization entry */ PMCLOG_RESERVE_WITH_ERROR(po, PMCLOG_TYPE_INITIALIZE, sizeof(struct pmclog_initialize)); PMCLOG_EMIT32(PMC_VERSION); PMCLOG_EMIT32(md->pmd_cputype); #if defined(__i386__) || defined(__amd64__) PMCLOG_EMIT64(tsc_freq); #else /* other architectures will need to fill this in */ PMCLOG_EMIT32(0); PMCLOG_EMIT32(0); #endif memcpy(_le, &ts, sizeof(ts)); _le += sizeof(ts)/4; PMCLOG_EMITSTRING(pmc_cpuid, PMC_CPUID_LEN); PMCLOG_DESPATCH_SYNC(po); return (0); error: KASSERT(po->po_kthread == NULL, ("[pmclog,%d] po=%p kthread not " "stopped", __LINE__, po)); if (po->po_file) (void) fdrop(po->po_file, curthread); po->po_file = NULL; /* clear file and error state */ po->po_error = 0; po->po_flags &= ~PMC_PO_OWNS_LOGFILE; return (error); } /* * De-configure a log file. This will throw away any buffers queued * for this owner process. */ int pmclog_deconfigure_log(struct pmc_owner *po) { int error; struct pmclog_buffer *lb; struct pmc_binding pb; PMCDBG1(LOG,CFG,1, "de-config po=%p", po); if ((po->po_flags & PMC_PO_OWNS_LOGFILE) == 0) return (EINVAL); KASSERT(po->po_sscount == 0, ("[pmclog,%d] po=%p still owning SS PMCs", __LINE__, po)); KASSERT(po->po_file != NULL, ("[pmclog,%d] po=%p no log file", __LINE__, po)); /* stop the kthread, this will reset the 'OWNS_LOGFILE' flag */ pmclog_stop_kthread(po); KASSERT(po->po_kthread == NULL, ("[pmclog,%d] po=%p kthread not stopped", __LINE__, po)); /* return all queued log buffers to the global pool */ while ((lb = TAILQ_FIRST(&po->po_logbuffers)) != NULL) { TAILQ_REMOVE(&po->po_logbuffers, lb, plb_next); PMCLOG_RESET_BUFFER_DESCRIPTOR(lb); pmc_plb_rele(lb); } pmc_save_cpu_binding(&pb); for (int i = 0; i < mp_ncpus; i++) { pmc_select_cpu(i); /* return the 'current' buffer to the global pool */ if ((lb = po->po_curbuf[curcpu]) != NULL) { PMCLOG_RESET_BUFFER_DESCRIPTOR(lb); pmc_plb_rele(lb); } } pmc_restore_cpu_binding(&pb); /* drop a reference to the fd */ if (po->po_file != NULL) { error = fdrop(po->po_file, curthread); po->po_file = NULL; } else error = 0; po->po_error = 0; return (error); } /* * Flush a process' log buffer. */ int pmclog_flush(struct pmc_owner *po, int force) { int error; PMCDBG1(LOG,FLS,1, "po=%p", po); /* * If there is a pending error recorded by the logger thread, * return that. */ if (po->po_error) return (po->po_error); error = 0; /* * Check that we do have an active log file. */ mtx_lock(&pmc_kthread_mtx); if ((po->po_flags & PMC_PO_OWNS_LOGFILE) == 0) { error = EINVAL; goto error; } pmclog_schedule_all(po); error: mtx_unlock(&pmc_kthread_mtx); return (error); } static void pmclog_schedule_one_cond(struct pmc_owner *po) { struct pmclog_buffer *plb; int cpu; spinlock_enter(); cpu = curcpu; /* tell hardclock not to run again */ if (PMC_CPU_HAS_SAMPLES(cpu)) PMC_CALL_HOOK_UNLOCKED(curthread, PMC_FN_DO_SAMPLES, NULL); plb = po->po_curbuf[cpu]; if (plb && plb->plb_ptr != plb->plb_base) pmclog_schedule_io(po, 1); spinlock_exit(); } static void pmclog_schedule_all(struct pmc_owner *po) { struct pmc_binding pb; /* * Schedule the current buffer if any and not empty. */ pmc_save_cpu_binding(&pb); for (int i = 0; i < mp_ncpus; i++) { pmc_select_cpu(i); pmclog_schedule_one_cond(po); } pmc_restore_cpu_binding(&pb); } int pmclog_close(struct pmc_owner *po) { PMCDBG1(LOG,CLO,1, "po=%p", po); pmclog_process_closelog(po); mtx_lock(&pmc_kthread_mtx); /* * Initiate shutdown: no new data queued, * thread will close file on last block. */ po->po_flags |= PMC_PO_SHUTDOWN; /* give time for all to see */ DELAY(50); /* * Schedule the current buffer. */ pmclog_schedule_all(po); wakeup_one(po); mtx_unlock(&pmc_kthread_mtx); return (0); } void pmclog_process_callchain(struct pmc *pm, struct pmc_sample *ps) { int n, recordlen; uint32_t flags; struct pmc_owner *po; PMCDBG3(LOG,SAM,1,"pm=%p pid=%d n=%d", pm, ps->ps_pid, ps->ps_nsamples); recordlen = offsetof(struct pmclog_callchain, pl_pc) + ps->ps_nsamples * sizeof(uintfptr_t); po = pm->pm_owner; flags = PMC_CALLCHAIN_TO_CPUFLAGS(ps->ps_cpu,ps->ps_flags); PMCLOG_RESERVE_SAFE(po, PMCLOG_TYPE_CALLCHAIN, recordlen, ps->ps_tsc); PMCLOG_EMIT32(ps->ps_pid); PMCLOG_EMIT32(ps->ps_tid); PMCLOG_EMIT32(pm->pm_id); PMCLOG_EMIT32(flags); for (n = 0; n < ps->ps_nsamples; n++) PMCLOG_EMITADDR(ps->ps_pc[n]); PMCLOG_DESPATCH_SAFE(po); } void pmclog_process_closelog(struct pmc_owner *po) { PMCLOG_RESERVE(po, PMCLOG_TYPE_CLOSELOG, sizeof(struct pmclog_closelog)); PMCLOG_DESPATCH_SYNC(po); } void pmclog_process_dropnotify(struct pmc_owner *po) { PMCLOG_RESERVE(po, PMCLOG_TYPE_DROPNOTIFY, sizeof(struct pmclog_dropnotify)); PMCLOG_DESPATCH(po); } void pmclog_process_map_in(struct pmc_owner *po, pid_t pid, uintfptr_t start, const char *path) { int pathlen, recordlen; KASSERT(path != NULL, ("[pmclog,%d] map-in, null path", __LINE__)); pathlen = strlen(path) + 1; /* #bytes for path name */ recordlen = offsetof(struct pmclog_map_in, pl_pathname) + pathlen; PMCLOG_RESERVE(po, PMCLOG_TYPE_MAP_IN, recordlen); PMCLOG_EMIT32(pid); PMCLOG_EMIT32(0); PMCLOG_EMITADDR(start); PMCLOG_EMITSTRING(path,pathlen); PMCLOG_DESPATCH_SYNC(po); } void pmclog_process_map_out(struct pmc_owner *po, pid_t pid, uintfptr_t start, uintfptr_t end) { KASSERT(start <= end, ("[pmclog,%d] start > end", __LINE__)); PMCLOG_RESERVE(po, PMCLOG_TYPE_MAP_OUT, sizeof(struct pmclog_map_out)); PMCLOG_EMIT32(pid); PMCLOG_EMIT32(0); PMCLOG_EMITADDR(start); PMCLOG_EMITADDR(end); PMCLOG_DESPATCH(po); } void pmclog_process_pmcallocate(struct pmc *pm) { struct pmc_owner *po; struct pmc_soft *ps; po = pm->pm_owner; PMCDBG1(LOG,ALL,1, "pm=%p", pm); if (PMC_TO_CLASS(pm) == PMC_CLASS_SOFT) { PMCLOG_RESERVE(po, PMCLOG_TYPE_PMCALLOCATEDYN, sizeof(struct pmclog_pmcallocatedyn)); PMCLOG_EMIT32(pm->pm_id); PMCLOG_EMIT32(pm->pm_event); PMCLOG_EMIT32(pm->pm_flags); PMCLOG_EMIT32(0); PMCLOG_EMIT64(pm->pm_sc.pm_reloadcount); ps = pmc_soft_ev_acquire(pm->pm_event); if (ps != NULL) PMCLOG_EMITSTRING(ps->ps_ev.pm_ev_name,PMC_NAME_MAX); else PMCLOG_EMITNULLSTRING(PMC_NAME_MAX); pmc_soft_ev_release(ps); PMCLOG_DESPATCH_SYNC(po); } else { PMCLOG_RESERVE(po, PMCLOG_TYPE_PMCALLOCATE, sizeof(struct pmclog_pmcallocate)); PMCLOG_EMIT32(pm->pm_id); PMCLOG_EMIT32(pm->pm_event); PMCLOG_EMIT32(pm->pm_flags); PMCLOG_EMIT32(0); PMCLOG_EMIT64(pm->pm_sc.pm_reloadcount); PMCLOG_DESPATCH_SYNC(po); } } void pmclog_process_pmcattach(struct pmc *pm, pid_t pid, char *path) { int pathlen, recordlen; struct pmc_owner *po; PMCDBG2(LOG,ATT,1,"pm=%p pid=%d", pm, pid); po = pm->pm_owner; pathlen = strlen(path) + 1; /* #bytes for the string */ recordlen = offsetof(struct pmclog_pmcattach, pl_pathname) + pathlen; PMCLOG_RESERVE(po, PMCLOG_TYPE_PMCATTACH, recordlen); PMCLOG_EMIT32(pm->pm_id); PMCLOG_EMIT32(pid); PMCLOG_EMITSTRING(path, pathlen); PMCLOG_DESPATCH_SYNC(po); } void pmclog_process_pmcdetach(struct pmc *pm, pid_t pid) { struct pmc_owner *po; PMCDBG2(LOG,ATT,1,"!pm=%p pid=%d", pm, pid); po = pm->pm_owner; PMCLOG_RESERVE(po, PMCLOG_TYPE_PMCDETACH, sizeof(struct pmclog_pmcdetach)); PMCLOG_EMIT32(pm->pm_id); PMCLOG_EMIT32(pid); PMCLOG_DESPATCH_SYNC(po); } void pmclog_process_proccreate(struct pmc_owner *po, struct proc *p, int sync) { if (sync) { PMCLOG_RESERVE(po, PMCLOG_TYPE_PROC_CREATE, sizeof(struct pmclog_proccreate)); PMCLOG_EMIT32(p->p_pid); PMCLOG_EMIT32(p->p_flag); PMCLOG_EMITSTRING(p->p_comm, MAXCOMLEN+1); PMCLOG_DESPATCH_SYNC(po); } else { PMCLOG_RESERVE(po, PMCLOG_TYPE_PROC_CREATE, sizeof(struct pmclog_proccreate)); PMCLOG_EMIT32(p->p_pid); PMCLOG_EMIT32(p->p_flag); PMCLOG_EMITSTRING(p->p_comm, MAXCOMLEN+1); PMCLOG_DESPATCH(po); } } /* * Log a context switch event to the log file. */ void pmclog_process_proccsw(struct pmc *pm, struct pmc_process *pp, pmc_value_t v, struct thread *td) { struct pmc_owner *po; KASSERT(pm->pm_flags & PMC_F_LOG_PROCCSW, ("[pmclog,%d] log-process-csw called gratuitously", __LINE__)); PMCDBG3(LOG,SWO,1,"pm=%p pid=%d v=%jx", pm, pp->pp_proc->p_pid, v); po = pm->pm_owner; PMCLOG_RESERVE_SAFE(po, PMCLOG_TYPE_PROCCSW, sizeof(struct pmclog_proccsw), pmc_rdtsc()); PMCLOG_EMIT64(v); PMCLOG_EMIT32(pm->pm_id); PMCLOG_EMIT32(pp->pp_proc->p_pid); PMCLOG_EMIT32(td->td_tid); PMCLOG_EMIT32(0); PMCLOG_DESPATCH_SCHED_LOCK(po); } void pmclog_process_procexec(struct pmc_owner *po, pmc_id_t pmid, pid_t pid, - uintfptr_t startaddr, char *path) + uintptr_t baseaddr, uintptr_t dynaddr, char *path) { int pathlen, recordlen; PMCDBG3(LOG,EXC,1,"po=%p pid=%d path=\"%s\"", po, pid, path); pathlen = strlen(path) + 1; /* #bytes for the path */ recordlen = offsetof(struct pmclog_procexec, pl_pathname) + pathlen; PMCLOG_RESERVE(po, PMCLOG_TYPE_PROCEXEC, recordlen); PMCLOG_EMIT32(pid); PMCLOG_EMIT32(pmid); - PMCLOG_EMITADDR(startaddr); + PMCLOG_EMITADDR(baseaddr); + PMCLOG_EMITADDR(dynaddr); PMCLOG_EMITSTRING(path,pathlen); PMCLOG_DESPATCH_SYNC(po); } /* * Log a process exit event (and accumulated pmc value) to the log file. */ void pmclog_process_procexit(struct pmc *pm, struct pmc_process *pp) { int ri; struct pmc_owner *po; ri = PMC_TO_ROWINDEX(pm); PMCDBG3(LOG,EXT,1,"pm=%p pid=%d v=%jx", pm, pp->pp_proc->p_pid, pp->pp_pmcs[ri].pp_pmcval); po = pm->pm_owner; PMCLOG_RESERVE(po, PMCLOG_TYPE_PROCEXIT, sizeof(struct pmclog_procexit)); PMCLOG_EMIT32(pm->pm_id); PMCLOG_EMIT32(pp->pp_proc->p_pid); PMCLOG_EMIT64(pp->pp_pmcs[ri].pp_pmcval); PMCLOG_DESPATCH(po); } /* * Log a fork event. */ void pmclog_process_procfork(struct pmc_owner *po, pid_t oldpid, pid_t newpid) { PMCLOG_RESERVE(po, PMCLOG_TYPE_PROCFORK, sizeof(struct pmclog_procfork)); PMCLOG_EMIT32(oldpid); PMCLOG_EMIT32(newpid); PMCLOG_DESPATCH(po); } /* * Log a process exit event of the form suitable for system-wide PMCs. */ void pmclog_process_sysexit(struct pmc_owner *po, pid_t pid) { PMCLOG_RESERVE(po, PMCLOG_TYPE_SYSEXIT, sizeof(struct pmclog_sysexit)); PMCLOG_EMIT32(pid); PMCLOG_DESPATCH(po); } void pmclog_process_threadcreate(struct pmc_owner *po, struct thread *td, int sync) { struct proc *p; p = td->td_proc; if (sync) { PMCLOG_RESERVE(po, PMCLOG_TYPE_THR_CREATE, sizeof(struct pmclog_threadcreate)); PMCLOG_EMIT32(td->td_tid); PMCLOG_EMIT32(p->p_pid); PMCLOG_EMIT32(p->p_flag); PMCLOG_EMIT32(0); PMCLOG_EMITSTRING(td->td_name, MAXCOMLEN+1); PMCLOG_DESPATCH_SYNC(po); } else { PMCLOG_RESERVE(po, PMCLOG_TYPE_THR_CREATE, sizeof(struct pmclog_threadcreate)); PMCLOG_EMIT32(td->td_tid); PMCLOG_EMIT32(p->p_pid); PMCLOG_EMIT32(p->p_flag); PMCLOG_EMIT32(0); PMCLOG_EMITSTRING(td->td_name, MAXCOMLEN+1); PMCLOG_DESPATCH(po); } } void pmclog_process_threadexit(struct pmc_owner *po, struct thread *td) { PMCLOG_RESERVE(po, PMCLOG_TYPE_THR_EXIT, sizeof(struct pmclog_threadexit)); PMCLOG_EMIT32(td->td_tid); PMCLOG_DESPATCH(po); } /* * Write a user log entry. */ int pmclog_process_userlog(struct pmc_owner *po, struct pmc_op_writelog *wl) { int error; PMCDBG2(LOG,WRI,1, "writelog po=%p ud=0x%x", po, wl->pm_userdata); error = 0; PMCLOG_RESERVE_WITH_ERROR(po, PMCLOG_TYPE_USERDATA, sizeof(struct pmclog_userdata)); PMCLOG_EMIT32(wl->pm_userdata); PMCLOG_DESPATCH(po); error: return (error); } /* * Initialization. * * Create a pool of log buffers and initialize mutexes. */ void pmclog_initialize(void) { struct pmclog_buffer *plb; int domain, ncpus, total; if (pmclog_buffer_size <= 0 || pmclog_buffer_size > 16*1024) { (void) printf("hwpmc: tunable logbuffersize=%d must be " "greater than zero and less than or equal to 16MB.\n", pmclog_buffer_size); pmclog_buffer_size = PMC_LOG_BUFFER_SIZE; } if (pmc_nlogbuffers_pcpu <= 0) { (void) printf("hwpmc: tunable nlogbuffers=%d must be greater " "than zero.\n", pmc_nlogbuffers_pcpu); pmc_nlogbuffers_pcpu = PMC_NLOGBUFFERS_PCPU; } if (pmc_nlogbuffers_pcpu*pmclog_buffer_size > 32*1024) { (void) printf("hwpmc: memory allocated pcpu must be less than 32MB (is %dK).\n", pmc_nlogbuffers_pcpu*pmclog_buffer_size); pmc_nlogbuffers_pcpu = PMC_NLOGBUFFERS_PCPU; pmclog_buffer_size = PMC_LOG_BUFFER_SIZE; } for (domain = 0; domain < vm_ndomains; domain++) { ncpus = pmc_dom_hdrs[domain]->pdbh_ncpus; total = ncpus * pmc_nlogbuffers_pcpu; plb = malloc_domainset(sizeof(struct pmclog_buffer) * total, M_PMC, DOMAINSET_PREF(domain), M_WAITOK | M_ZERO); pmc_dom_hdrs[domain]->pdbh_plbs = plb; for (; total > 0; total--, plb++) { void *buf; buf = malloc_domainset(1024 * pmclog_buffer_size, M_PMC, DOMAINSET_PREF(domain), M_WAITOK | M_ZERO); PMCLOG_INIT_BUFFER_DESCRIPTOR(plb, buf, domain); pmc_plb_rele_unlocked(plb); } } mtx_init(&pmc_kthread_mtx, "pmc-kthread", "pmc-sleep", MTX_DEF); } /* * Shutdown logging. * * Destroy mutexes and release memory back the to free pool. */ void pmclog_shutdown(void) { struct pmclog_buffer *plb; int domain; mtx_destroy(&pmc_kthread_mtx); for (domain = 0; domain < vm_ndomains; domain++) { while ((plb = TAILQ_FIRST(&pmc_dom_hdrs[domain]->pdbh_head)) != NULL) { TAILQ_REMOVE(&pmc_dom_hdrs[domain]->pdbh_head, plb, plb_next); free(plb->plb_base, M_PMC); } free(pmc_dom_hdrs[domain]->pdbh_plbs, M_PMC); } } diff --git a/sys/dev/hwpmc/hwpmc_mod.c b/sys/dev/hwpmc/hwpmc_mod.c index 830e73941fb6..b3cf309fb74e 100644 --- a/sys/dev/hwpmc/hwpmc_mod.c +++ b/sys/dev/hwpmc/hwpmc_mod.c @@ -1,5985 +1,5986 @@ /*- * SPDX-License-Identifier: BSD-2-Clause * * Copyright (c) 2003-2008 Joseph Koshy * Copyright (c) 2007 The FreeBSD Foundation * Copyright (c) 2018 Matthew Macy * All rights reserved. * * Portions of this software were developed by A. Joseph Koshy under * sponsorship from the FreeBSD Foundation and Google, Inc. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * */ #include __FBSDID("$FreeBSD$"); #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include /* needs to be after */ #include #include #include #include #include #include #include #include "hwpmc_soft.h" #define PMC_EPOCH_ENTER() struct epoch_tracker pmc_et; epoch_enter_preempt(global_epoch_preempt, &pmc_et) #define PMC_EPOCH_EXIT() epoch_exit_preempt(global_epoch_preempt, &pmc_et) /* * Types */ enum pmc_flags { PMC_FLAG_NONE = 0x00, /* do nothing */ PMC_FLAG_REMOVE = 0x01, /* atomically remove entry from hash */ PMC_FLAG_ALLOCATE = 0x02, /* add entry to hash if not found */ PMC_FLAG_NOWAIT = 0x04, /* do not wait for mallocs */ }; /* * The offset in sysent where the syscall is allocated. */ static int pmc_syscall_num = NO_SYSCALL; struct pmc_cpu **pmc_pcpu; /* per-cpu state */ pmc_value_t *pmc_pcpu_saved; /* saved PMC values: CSW handling */ #define PMC_PCPU_SAVED(C,R) pmc_pcpu_saved[(R) + md->pmd_npmc*(C)] struct mtx_pool *pmc_mtxpool; static int *pmc_pmcdisp; /* PMC row dispositions */ #define PMC_ROW_DISP_IS_FREE(R) (pmc_pmcdisp[(R)] == 0) #define PMC_ROW_DISP_IS_THREAD(R) (pmc_pmcdisp[(R)] > 0) #define PMC_ROW_DISP_IS_STANDALONE(R) (pmc_pmcdisp[(R)] < 0) #define PMC_MARK_ROW_FREE(R) do { \ pmc_pmcdisp[(R)] = 0; \ } while (0) #define PMC_MARK_ROW_STANDALONE(R) do { \ KASSERT(pmc_pmcdisp[(R)] <= 0, ("[pmc,%d] row disposition error", \ __LINE__)); \ atomic_add_int(&pmc_pmcdisp[(R)], -1); \ KASSERT(pmc_pmcdisp[(R)] >= (-pmc_cpu_max_active()), \ ("[pmc,%d] row disposition error", __LINE__)); \ } while (0) #define PMC_UNMARK_ROW_STANDALONE(R) do { \ atomic_add_int(&pmc_pmcdisp[(R)], 1); \ KASSERT(pmc_pmcdisp[(R)] <= 0, ("[pmc,%d] row disposition error", \ __LINE__)); \ } while (0) #define PMC_MARK_ROW_THREAD(R) do { \ KASSERT(pmc_pmcdisp[(R)] >= 0, ("[pmc,%d] row disposition error", \ __LINE__)); \ atomic_add_int(&pmc_pmcdisp[(R)], 1); \ } while (0) #define PMC_UNMARK_ROW_THREAD(R) do { \ atomic_add_int(&pmc_pmcdisp[(R)], -1); \ KASSERT(pmc_pmcdisp[(R)] >= 0, ("[pmc,%d] row disposition error", \ __LINE__)); \ } while (0) /* various event handlers */ static eventhandler_tag pmc_exit_tag, pmc_fork_tag, pmc_kld_load_tag, pmc_kld_unload_tag; /* Module statistics */ struct pmc_driverstats pmc_stats; /* Machine/processor dependent operations */ static struct pmc_mdep *md; /* * Hash tables mapping owner processes and target threads to PMCs. */ struct mtx pmc_processhash_mtx; /* spin mutex */ static u_long pmc_processhashmask; static LIST_HEAD(pmc_processhash, pmc_process) *pmc_processhash; /* * Hash table of PMC owner descriptors. This table is protected by * the shared PMC "sx" lock. */ static u_long pmc_ownerhashmask; static LIST_HEAD(pmc_ownerhash, pmc_owner) *pmc_ownerhash; /* * List of PMC owners with system-wide sampling PMCs. */ static CK_LIST_HEAD(, pmc_owner) pmc_ss_owners; /* * List of free thread entries. This is protected by the spin * mutex. */ static struct mtx pmc_threadfreelist_mtx; /* spin mutex */ static LIST_HEAD(, pmc_thread) pmc_threadfreelist; static int pmc_threadfreelist_entries=0; #define THREADENTRY_SIZE \ (sizeof(struct pmc_thread) + (md->pmd_npmc * sizeof(struct pmc_threadpmcstate))) /* * Task to free thread descriptors */ static struct task free_task; /* * A map of row indices to classdep structures. */ static struct pmc_classdep **pmc_rowindex_to_classdep; /* * Prototypes */ #ifdef HWPMC_DEBUG static int pmc_debugflags_sysctl_handler(SYSCTL_HANDLER_ARGS); static int pmc_debugflags_parse(char *newstr, char *fence); #endif static int load(struct module *module, int cmd, void *arg); static int pmc_add_sample(ring_type_t ring, struct pmc *pm, struct trapframe *tf); static void pmc_add_thread_descriptors_from_proc(struct proc *p, struct pmc_process *pp); static int pmc_attach_process(struct proc *p, struct pmc *pm); static struct pmc *pmc_allocate_pmc_descriptor(void); static struct pmc_owner *pmc_allocate_owner_descriptor(struct proc *p); static int pmc_attach_one_process(struct proc *p, struct pmc *pm); static int pmc_can_allocate_rowindex(struct proc *p, unsigned int ri, int cpu); static int pmc_can_attach(struct pmc *pm, struct proc *p); static void pmc_capture_user_callchain(int cpu, int soft, struct trapframe *tf); static void pmc_cleanup(void); static int pmc_detach_process(struct proc *p, struct pmc *pm); static int pmc_detach_one_process(struct proc *p, struct pmc *pm, int flags); static void pmc_destroy_owner_descriptor(struct pmc_owner *po); static void pmc_destroy_pmc_descriptor(struct pmc *pm); static void pmc_destroy_process_descriptor(struct pmc_process *pp); static struct pmc_owner *pmc_find_owner_descriptor(struct proc *p); static int pmc_find_pmc(pmc_id_t pmcid, struct pmc **pm); static struct pmc *pmc_find_pmc_descriptor_in_process(struct pmc_owner *po, pmc_id_t pmc); static struct pmc_process *pmc_find_process_descriptor(struct proc *p, uint32_t mode); static struct pmc_thread *pmc_find_thread_descriptor(struct pmc_process *pp, struct thread *td, uint32_t mode); static void pmc_force_context_switch(void); static void pmc_link_target_process(struct pmc *pm, struct pmc_process *pp); static void pmc_log_all_process_mappings(struct pmc_owner *po); static void pmc_log_kernel_mappings(struct pmc *pm); static void pmc_log_process_mappings(struct pmc_owner *po, struct proc *p); static void pmc_maybe_remove_owner(struct pmc_owner *po); static void pmc_process_csw_in(struct thread *td); static void pmc_process_csw_out(struct thread *td); static void pmc_process_exit(void *arg, struct proc *p); static void pmc_process_fork(void *arg, struct proc *p1, struct proc *p2, int n); static void pmc_process_samples(int cpu, ring_type_t soft); static void pmc_release_pmc_descriptor(struct pmc *pmc); static void pmc_process_thread_add(struct thread *td); static void pmc_process_thread_delete(struct thread *td); static void pmc_process_thread_userret(struct thread *td); static void pmc_remove_owner(struct pmc_owner *po); static void pmc_remove_process_descriptor(struct pmc_process *pp); static int pmc_start(struct pmc *pm); static int pmc_stop(struct pmc *pm); static int pmc_syscall_handler(struct thread *td, void *syscall_args); static struct pmc_thread *pmc_thread_descriptor_pool_alloc(void); static void pmc_thread_descriptor_pool_drain(void); static void pmc_thread_descriptor_pool_free(struct pmc_thread *pt); static void pmc_unlink_target_process(struct pmc *pmc, struct pmc_process *pp); static int generic_switch_in(struct pmc_cpu *pc, struct pmc_process *pp); static int generic_switch_out(struct pmc_cpu *pc, struct pmc_process *pp); static struct pmc_mdep *pmc_generic_cpu_initialize(void); static void pmc_generic_cpu_finalize(struct pmc_mdep *md); static void pmc_post_callchain_callback(void); static void pmc_process_threadcreate(struct thread *td); static void pmc_process_threadexit(struct thread *td); static void pmc_process_proccreate(struct proc *p); static void pmc_process_allproc(struct pmc *pm); /* * Kernel tunables and sysctl(8) interface. */ SYSCTL_DECL(_kern_hwpmc); SYSCTL_NODE(_kern_hwpmc, OID_AUTO, stats, CTLFLAG_RW | CTLFLAG_MPSAFE, 0, "HWPMC stats"); /* Stats. */ SYSCTL_COUNTER_U64(_kern_hwpmc_stats, OID_AUTO, intr_ignored, CTLFLAG_RW, &pmc_stats.pm_intr_ignored, "# of interrupts ignored"); SYSCTL_COUNTER_U64(_kern_hwpmc_stats, OID_AUTO, intr_processed, CTLFLAG_RW, &pmc_stats.pm_intr_processed, "# of interrupts processed"); SYSCTL_COUNTER_U64(_kern_hwpmc_stats, OID_AUTO, intr_bufferfull, CTLFLAG_RW, &pmc_stats.pm_intr_bufferfull, "# of interrupts where buffer was full"); SYSCTL_COUNTER_U64(_kern_hwpmc_stats, OID_AUTO, syscalls, CTLFLAG_RW, &pmc_stats.pm_syscalls, "# of syscalls"); SYSCTL_COUNTER_U64(_kern_hwpmc_stats, OID_AUTO, syscall_errors, CTLFLAG_RW, &pmc_stats.pm_syscall_errors, "# of syscall_errors"); SYSCTL_COUNTER_U64(_kern_hwpmc_stats, OID_AUTO, buffer_requests, CTLFLAG_RW, &pmc_stats.pm_buffer_requests, "# of buffer requests"); SYSCTL_COUNTER_U64(_kern_hwpmc_stats, OID_AUTO, buffer_requests_failed, CTLFLAG_RW, &pmc_stats.pm_buffer_requests_failed, "# of buffer requests which failed"); SYSCTL_COUNTER_U64(_kern_hwpmc_stats, OID_AUTO, log_sweeps, CTLFLAG_RW, &pmc_stats.pm_log_sweeps, "# of times samples were processed"); SYSCTL_COUNTER_U64(_kern_hwpmc_stats, OID_AUTO, merges, CTLFLAG_RW, &pmc_stats.pm_merges, "# of times kernel stack was found for user trace"); SYSCTL_COUNTER_U64(_kern_hwpmc_stats, OID_AUTO, overwrites, CTLFLAG_RW, &pmc_stats.pm_overwrites, "# of times a sample was overwritten before being logged"); static int pmc_callchaindepth = PMC_CALLCHAIN_DEPTH; SYSCTL_INT(_kern_hwpmc, OID_AUTO, callchaindepth, CTLFLAG_RDTUN, &pmc_callchaindepth, 0, "depth of call chain records"); char pmc_cpuid[PMC_CPUID_LEN]; SYSCTL_STRING(_kern_hwpmc, OID_AUTO, cpuid, CTLFLAG_RD, pmc_cpuid, 0, "cpu version string"); #ifdef HWPMC_DEBUG struct pmc_debugflags pmc_debugflags = PMC_DEBUG_DEFAULT_FLAGS; char pmc_debugstr[PMC_DEBUG_STRSIZE]; TUNABLE_STR(PMC_SYSCTL_NAME_PREFIX "debugflags", pmc_debugstr, sizeof(pmc_debugstr)); SYSCTL_PROC(_kern_hwpmc, OID_AUTO, debugflags, CTLTYPE_STRING | CTLFLAG_RWTUN | CTLFLAG_NOFETCH | CTLFLAG_MPSAFE, 0, 0, pmc_debugflags_sysctl_handler, "A", "debug flags"); #endif /* * kern.hwpmc.hashrows -- determines the number of rows in the * of the hash table used to look up threads */ static int pmc_hashsize = PMC_HASH_SIZE; SYSCTL_INT(_kern_hwpmc, OID_AUTO, hashsize, CTLFLAG_RDTUN, &pmc_hashsize, 0, "rows in hash tables"); /* * kern.hwpmc.nsamples --- number of PC samples/callchain stacks per CPU */ static int pmc_nsamples = PMC_NSAMPLES; SYSCTL_INT(_kern_hwpmc, OID_AUTO, nsamples, CTLFLAG_RDTUN, &pmc_nsamples, 0, "number of PC samples per CPU"); static uint64_t pmc_sample_mask = PMC_NSAMPLES-1; /* * kern.hwpmc.mtxpoolsize -- number of mutexes in the mutex pool. */ static int pmc_mtxpool_size = PMC_MTXPOOL_SIZE; SYSCTL_INT(_kern_hwpmc, OID_AUTO, mtxpoolsize, CTLFLAG_RDTUN, &pmc_mtxpool_size, 0, "size of spin mutex pool"); /* * kern.hwpmc.threadfreelist_entries -- number of free entries */ SYSCTL_INT(_kern_hwpmc, OID_AUTO, threadfreelist_entries, CTLFLAG_RD, &pmc_threadfreelist_entries, 0, "number of available thread entries"); /* * kern.hwpmc.threadfreelist_max -- maximum number of free entries */ static int pmc_threadfreelist_max = PMC_THREADLIST_MAX; SYSCTL_INT(_kern_hwpmc, OID_AUTO, threadfreelist_max, CTLFLAG_RW, &pmc_threadfreelist_max, 0, "maximum number of available thread entries before freeing some"); /* * kern.hwpmc.mincount -- minimum sample count */ static u_int pmc_mincount = 1000; SYSCTL_INT(_kern_hwpmc, OID_AUTO, mincount, CTLFLAG_RWTUN, &pmc_mincount, 0, "minimum count for sampling counters"); /* * security.bsd.unprivileged_syspmcs -- allow non-root processes to * allocate system-wide PMCs. * * Allowing unprivileged processes to allocate system PMCs is convenient * if system-wide measurements need to be taken concurrently with other * per-process measurements. This feature is turned off by default. */ static int pmc_unprivileged_syspmcs = 0; SYSCTL_INT(_security_bsd, OID_AUTO, unprivileged_syspmcs, CTLFLAG_RWTUN, &pmc_unprivileged_syspmcs, 0, "allow unprivileged process to allocate system PMCs"); /* * Hash function. Discard the lower 2 bits of the pointer since * these are always zero for our uses. The hash multiplier is * round((2^LONG_BIT) * ((sqrt(5)-1)/2)). */ #if LONG_BIT == 64 #define _PMC_HM 11400714819323198486u #elif LONG_BIT == 32 #define _PMC_HM 2654435769u #else #error Must know the size of 'long' to compile #endif #define PMC_HASH_PTR(P,M) ((((unsigned long) (P) >> 2) * _PMC_HM) & (M)) /* * Syscall structures */ /* The `sysent' for the new syscall */ static struct sysent pmc_sysent = { .sy_narg = 2, .sy_call = pmc_syscall_handler, }; static struct syscall_module_data pmc_syscall_mod = { .chainevh = load, .chainarg = NULL, .offset = &pmc_syscall_num, .new_sysent = &pmc_sysent, .old_sysent = { .sy_narg = 0, .sy_call = NULL }, .flags = SY_THR_STATIC_KLD, }; static moduledata_t pmc_mod = { .name = PMC_MODULE_NAME, .evhand = syscall_module_handler, .priv = &pmc_syscall_mod, }; #ifdef EARLY_AP_STARTUP DECLARE_MODULE(pmc, pmc_mod, SI_SUB_SYSCALLS, SI_ORDER_ANY); #else DECLARE_MODULE(pmc, pmc_mod, SI_SUB_SMP, SI_ORDER_ANY); #endif MODULE_VERSION(pmc, PMC_VERSION); #ifdef HWPMC_DEBUG enum pmc_dbgparse_state { PMCDS_WS, /* in whitespace */ PMCDS_MAJOR, /* seen a major keyword */ PMCDS_MINOR }; static int pmc_debugflags_parse(char *newstr, char *fence) { char c, *p, *q; struct pmc_debugflags *tmpflags; int error, found, *newbits, tmp; size_t kwlen; tmpflags = malloc(sizeof(*tmpflags), M_PMC, M_WAITOK|M_ZERO); p = newstr; error = 0; for (; p < fence && (c = *p); p++) { /* skip white space */ if (c == ' ' || c == '\t') continue; /* look for a keyword followed by "=" */ for (q = p; p < fence && (c = *p) && c != '='; p++) ; if (c != '=') { error = EINVAL; goto done; } kwlen = p - q; newbits = NULL; /* lookup flag group name */ #define DBG_SET_FLAG_MAJ(S,F) \ if (kwlen == sizeof(S)-1 && strncmp(q, S, kwlen) == 0) \ newbits = &tmpflags->pdb_ ## F; DBG_SET_FLAG_MAJ("cpu", CPU); DBG_SET_FLAG_MAJ("csw", CSW); DBG_SET_FLAG_MAJ("logging", LOG); DBG_SET_FLAG_MAJ("module", MOD); DBG_SET_FLAG_MAJ("md", MDP); DBG_SET_FLAG_MAJ("owner", OWN); DBG_SET_FLAG_MAJ("pmc", PMC); DBG_SET_FLAG_MAJ("process", PRC); DBG_SET_FLAG_MAJ("sampling", SAM); if (newbits == NULL) { error = EINVAL; goto done; } p++; /* skip the '=' */ /* Now parse the individual flags */ tmp = 0; newflag: for (q = p; p < fence && (c = *p); p++) if (c == ' ' || c == '\t' || c == ',') break; /* p == fence or c == ws or c == "," or c == 0 */ if ((kwlen = p - q) == 0) { *newbits = tmp; continue; } found = 0; #define DBG_SET_FLAG_MIN(S,F) \ if (kwlen == sizeof(S)-1 && strncmp(q, S, kwlen) == 0) \ tmp |= found = (1 << PMC_DEBUG_MIN_ ## F) /* a '*' denotes all possible flags in the group */ if (kwlen == 1 && *q == '*') tmp = found = ~0; /* look for individual flag names */ DBG_SET_FLAG_MIN("allocaterow", ALR); DBG_SET_FLAG_MIN("allocate", ALL); DBG_SET_FLAG_MIN("attach", ATT); DBG_SET_FLAG_MIN("bind", BND); DBG_SET_FLAG_MIN("config", CFG); DBG_SET_FLAG_MIN("exec", EXC); DBG_SET_FLAG_MIN("exit", EXT); DBG_SET_FLAG_MIN("find", FND); DBG_SET_FLAG_MIN("flush", FLS); DBG_SET_FLAG_MIN("fork", FRK); DBG_SET_FLAG_MIN("getbuf", GTB); DBG_SET_FLAG_MIN("hook", PMH); DBG_SET_FLAG_MIN("init", INI); DBG_SET_FLAG_MIN("intr", INT); DBG_SET_FLAG_MIN("linktarget", TLK); DBG_SET_FLAG_MIN("mayberemove", OMR); DBG_SET_FLAG_MIN("ops", OPS); DBG_SET_FLAG_MIN("read", REA); DBG_SET_FLAG_MIN("register", REG); DBG_SET_FLAG_MIN("release", REL); DBG_SET_FLAG_MIN("remove", ORM); DBG_SET_FLAG_MIN("sample", SAM); DBG_SET_FLAG_MIN("scheduleio", SIO); DBG_SET_FLAG_MIN("select", SEL); DBG_SET_FLAG_MIN("signal", SIG); DBG_SET_FLAG_MIN("swi", SWI); DBG_SET_FLAG_MIN("swo", SWO); DBG_SET_FLAG_MIN("start", STA); DBG_SET_FLAG_MIN("stop", STO); DBG_SET_FLAG_MIN("syscall", PMS); DBG_SET_FLAG_MIN("unlinktarget", TUL); DBG_SET_FLAG_MIN("write", WRI); if (found == 0) { /* unrecognized flag name */ error = EINVAL; goto done; } if (c == 0 || c == ' ' || c == '\t') { /* end of flag group */ *newbits = tmp; continue; } p++; goto newflag; } /* save the new flag set */ bcopy(tmpflags, &pmc_debugflags, sizeof(pmc_debugflags)); done: free(tmpflags, M_PMC); return error; } static int pmc_debugflags_sysctl_handler(SYSCTL_HANDLER_ARGS) { char *fence, *newstr; int error; unsigned int n; (void) arg1; (void) arg2; /* unused parameters */ n = sizeof(pmc_debugstr); newstr = malloc(n, M_PMC, M_WAITOK|M_ZERO); (void) strlcpy(newstr, pmc_debugstr, n); error = sysctl_handle_string(oidp, newstr, n, req); /* if there is a new string, parse and copy it */ if (error == 0 && req->newptr != NULL) { fence = newstr + (n < req->newlen ? n : req->newlen + 1); if ((error = pmc_debugflags_parse(newstr, fence)) == 0) (void) strlcpy(pmc_debugstr, newstr, sizeof(pmc_debugstr)); } free(newstr, M_PMC); return error; } #endif /* * Map a row index to a classdep structure and return the adjusted row * index for the PMC class index. */ static struct pmc_classdep * pmc_ri_to_classdep(struct pmc_mdep *md, int ri, int *adjri) { struct pmc_classdep *pcd; (void) md; KASSERT(ri >= 0 && ri < md->pmd_npmc, ("[pmc,%d] illegal row-index %d", __LINE__, ri)); pcd = pmc_rowindex_to_classdep[ri]; KASSERT(pcd != NULL, ("[pmc,%d] ri %d null pcd", __LINE__, ri)); *adjri = ri - pcd->pcd_ri; KASSERT(*adjri >= 0 && *adjri < pcd->pcd_num, ("[pmc,%d] adjusted row-index %d", __LINE__, *adjri)); return (pcd); } /* * Concurrency Control * * The driver manages the following data structures: * * - target process descriptors, one per target process * - owner process descriptors (and attached lists), one per owner process * - lookup hash tables for owner and target processes * - PMC descriptors (and attached lists) * - per-cpu hardware state * - the 'hook' variable through which the kernel calls into * this module * - the machine hardware state (managed by the MD layer) * * These data structures are accessed from: * * - thread context-switch code * - interrupt handlers (possibly on multiple cpus) * - kernel threads on multiple cpus running on behalf of user * processes doing system calls * - this driver's private kernel threads * * = Locks and Locking strategy = * * The driver uses four locking strategies for its operation: * * - The global SX lock "pmc_sx" is used to protect internal * data structures. * * Calls into the module by syscall() start with this lock being * held in exclusive mode. Depending on the requested operation, * the lock may be downgraded to 'shared' mode to allow more * concurrent readers into the module. Calls into the module from * other parts of the kernel acquire the lock in shared mode. * * This SX lock is held in exclusive mode for any operations that * modify the linkages between the driver's internal data structures. * * The 'pmc_hook' function pointer is also protected by this lock. * It is only examined with the sx lock held in exclusive mode. The * kernel module is allowed to be unloaded only with the sx lock held * in exclusive mode. In normal syscall handling, after acquiring the * pmc_sx lock we first check that 'pmc_hook' is non-null before * proceeding. This prevents races between the thread unloading the module * and other threads seeking to use the module. * * - Lookups of target process structures and owner process structures * cannot use the global "pmc_sx" SX lock because these lookups need * to happen during context switches and in other critical sections * where sleeping is not allowed. We protect these lookup tables * with their own private spin-mutexes, "pmc_processhash_mtx" and * "pmc_ownerhash_mtx". * * - Interrupt handlers work in a lock free manner. At interrupt * time, handlers look at the PMC pointer (phw->phw_pmc) configured * when the PMC was started. If this pointer is NULL, the interrupt * is ignored after updating driver statistics. We ensure that this * pointer is set (using an atomic operation if necessary) before the * PMC hardware is started. Conversely, this pointer is unset atomically * only after the PMC hardware is stopped. * * We ensure that everything needed for the operation of an * interrupt handler is available without it needing to acquire any * locks. We also ensure that a PMC's software state is destroyed only * after the PMC is taken off hardware (on all CPUs). * * - Context-switch handling with process-private PMCs needs more * care. * * A given process may be the target of multiple PMCs. For example, * PMCATTACH and PMCDETACH may be requested by a process on one CPU * while the target process is running on another. A PMC could also * be getting released because its owner is exiting. We tackle * these situations in the following manner: * * - each target process structure 'pmc_process' has an array * of 'struct pmc *' pointers, one for each hardware PMC. * * - At context switch IN time, each "target" PMC in RUNNING state * gets started on hardware and a pointer to each PMC is copied into * the per-cpu phw array. The 'runcount' for the PMC is * incremented. * * - At context switch OUT time, all process-virtual PMCs are stopped * on hardware. The saved value is added to the PMCs value field * only if the PMC is in a non-deleted state (the PMCs state could * have changed during the current time slice). * * Note that since in-between a switch IN on a processor and a switch * OUT, the PMC could have been released on another CPU. Therefore * context switch OUT always looks at the hardware state to turn * OFF PMCs and will update a PMC's saved value only if reachable * from the target process record. * * - OP PMCRELEASE could be called on a PMC at any time (the PMC could * be attached to many processes at the time of the call and could * be active on multiple CPUs). * * We prevent further scheduling of the PMC by marking it as in * state 'DELETED'. If the runcount of the PMC is non-zero then * this PMC is currently running on a CPU somewhere. The thread * doing the PMCRELEASE operation waits by repeatedly doing a * pause() till the runcount comes to zero. * * The contents of a PMC descriptor (struct pmc) are protected using * a spin-mutex. In order to save space, we use a mutex pool. * * In terms of lock types used by witness(4), we use: * - Type "pmc-sx", used by the global SX lock. * - Type "pmc-sleep", for sleep mutexes used by logger threads. * - Type "pmc-per-proc", for protecting PMC owner descriptors. * - Type "pmc-leaf", used for all other spin mutexes. */ /* * save the cpu binding of the current kthread */ void pmc_save_cpu_binding(struct pmc_binding *pb) { PMCDBG0(CPU,BND,2, "save-cpu"); thread_lock(curthread); pb->pb_bound = sched_is_bound(curthread); pb->pb_cpu = curthread->td_oncpu; pb->pb_priority = curthread->td_priority; thread_unlock(curthread); PMCDBG1(CPU,BND,2, "save-cpu cpu=%d", pb->pb_cpu); } /* * restore the cpu binding of the current thread */ void pmc_restore_cpu_binding(struct pmc_binding *pb) { PMCDBG2(CPU,BND,2, "restore-cpu curcpu=%d restore=%d", curthread->td_oncpu, pb->pb_cpu); thread_lock(curthread); sched_bind(curthread, pb->pb_cpu); if (!pb->pb_bound) sched_unbind(curthread); sched_prio(curthread, pb->pb_priority); thread_unlock(curthread); PMCDBG0(CPU,BND,2, "restore-cpu done"); } /* * move execution over the specified cpu and bind it there. */ void pmc_select_cpu(int cpu) { KASSERT(cpu >= 0 && cpu < pmc_cpu_max(), ("[pmc,%d] bad cpu number %d", __LINE__, cpu)); /* Never move to an inactive CPU. */ KASSERT(pmc_cpu_is_active(cpu), ("[pmc,%d] selecting inactive " "CPU %d", __LINE__, cpu)); PMCDBG1(CPU,SEL,2, "select-cpu cpu=%d", cpu); thread_lock(curthread); sched_prio(curthread, PRI_MIN); sched_bind(curthread, cpu); thread_unlock(curthread); KASSERT(curthread->td_oncpu == cpu, ("[pmc,%d] CPU not bound [cpu=%d, curr=%d]", __LINE__, cpu, curthread->td_oncpu)); PMCDBG1(CPU,SEL,2, "select-cpu cpu=%d ok", cpu); } /* * Force a context switch. * * We do this by pause'ing for 1 tick -- invoking mi_switch() is not * guaranteed to force a context switch. */ static void pmc_force_context_switch(void) { pause("pmcctx", 1); } uint64_t pmc_rdtsc(void) { #if defined(__i386__) || defined(__amd64__) if (__predict_true(amd_feature & AMDID_RDTSCP)) return rdtscp(); else return rdtsc(); #else return get_cyclecount(); #endif } /* * Get the file name for an executable. This is a simple wrapper * around vn_fullpath(9). */ static void pmc_getfilename(struct vnode *v, char **fullpath, char **freepath) { *fullpath = "unknown"; *freepath = NULL; vn_fullpath(v, fullpath, freepath); } /* * remove an process owning PMCs */ void pmc_remove_owner(struct pmc_owner *po) { struct pmc *pm, *tmp; sx_assert(&pmc_sx, SX_XLOCKED); PMCDBG1(OWN,ORM,1, "remove-owner po=%p", po); /* Remove descriptor from the owner hash table */ LIST_REMOVE(po, po_next); /* release all owned PMC descriptors */ LIST_FOREACH_SAFE(pm, &po->po_pmcs, pm_next, tmp) { PMCDBG1(OWN,ORM,2, "pmc=%p", pm); KASSERT(pm->pm_owner == po, ("[pmc,%d] owner %p != po %p", __LINE__, pm->pm_owner, po)); pmc_release_pmc_descriptor(pm); /* will unlink from the list */ pmc_destroy_pmc_descriptor(pm); } KASSERT(po->po_sscount == 0, ("[pmc,%d] SS count not zero", __LINE__)); KASSERT(LIST_EMPTY(&po->po_pmcs), ("[pmc,%d] PMC list not empty", __LINE__)); /* de-configure the log file if present */ if (po->po_flags & PMC_PO_OWNS_LOGFILE) pmclog_deconfigure_log(po); } /* * remove an owner process record if all conditions are met. */ static void pmc_maybe_remove_owner(struct pmc_owner *po) { PMCDBG1(OWN,OMR,1, "maybe-remove-owner po=%p", po); /* * Remove owner record if * - this process does not own any PMCs * - this process has not allocated a system-wide sampling buffer */ if (LIST_EMPTY(&po->po_pmcs) && ((po->po_flags & PMC_PO_OWNS_LOGFILE) == 0)) { pmc_remove_owner(po); pmc_destroy_owner_descriptor(po); } } /* * Add an association between a target process and a PMC. */ static void pmc_link_target_process(struct pmc *pm, struct pmc_process *pp) { int ri; struct pmc_target *pt; #ifdef INVARIANTS struct pmc_thread *pt_td; #endif sx_assert(&pmc_sx, SX_XLOCKED); KASSERT(pm != NULL && pp != NULL, ("[pmc,%d] Null pm %p or pp %p", __LINE__, pm, pp)); KASSERT(PMC_IS_VIRTUAL_MODE(PMC_TO_MODE(pm)), ("[pmc,%d] Attaching a non-process-virtual pmc=%p to pid=%d", __LINE__, pm, pp->pp_proc->p_pid)); KASSERT(pp->pp_refcnt >= 0 && pp->pp_refcnt <= ((int) md->pmd_npmc - 1), ("[pmc,%d] Illegal reference count %d for process record %p", __LINE__, pp->pp_refcnt, (void *) pp)); ri = PMC_TO_ROWINDEX(pm); PMCDBG3(PRC,TLK,1, "link-target pmc=%p ri=%d pmc-process=%p", pm, ri, pp); #ifdef HWPMC_DEBUG LIST_FOREACH(pt, &pm->pm_targets, pt_next) if (pt->pt_process == pp) KASSERT(0, ("[pmc,%d] pp %p already in pmc %p targets", __LINE__, pp, pm)); #endif pt = malloc(sizeof(struct pmc_target), M_PMC, M_WAITOK|M_ZERO); pt->pt_process = pp; LIST_INSERT_HEAD(&pm->pm_targets, pt, pt_next); atomic_store_rel_ptr((uintptr_t *)&pp->pp_pmcs[ri].pp_pmc, (uintptr_t)pm); if (pm->pm_owner->po_owner == pp->pp_proc) pm->pm_flags |= PMC_F_ATTACHED_TO_OWNER; /* * Initialize the per-process values at this row index. */ pp->pp_pmcs[ri].pp_pmcval = PMC_TO_MODE(pm) == PMC_MODE_TS ? pm->pm_sc.pm_reloadcount : 0; pp->pp_refcnt++; #ifdef INVARIANTS /* Confirm that the per-thread values at this row index are cleared. */ if (PMC_TO_MODE(pm) == PMC_MODE_TS) { mtx_lock_spin(pp->pp_tdslock); LIST_FOREACH(pt_td, &pp->pp_tds, pt_next) { KASSERT(pt_td->pt_pmcs[ri].pt_pmcval == (pmc_value_t) 0, ("[pmc,%d] pt_pmcval not cleared for pid=%d at " "ri=%d", __LINE__, pp->pp_proc->p_pid, ri)); } mtx_unlock_spin(pp->pp_tdslock); } #endif } /* * Removes the association between a target process and a PMC. */ static void pmc_unlink_target_process(struct pmc *pm, struct pmc_process *pp) { int ri; struct proc *p; struct pmc_target *ptgt; struct pmc_thread *pt; sx_assert(&pmc_sx, SX_XLOCKED); KASSERT(pm != NULL && pp != NULL, ("[pmc,%d] Null pm %p or pp %p", __LINE__, pm, pp)); KASSERT(pp->pp_refcnt >= 1 && pp->pp_refcnt <= (int) md->pmd_npmc, ("[pmc,%d] Illegal ref count %d on process record %p", __LINE__, pp->pp_refcnt, (void *) pp)); ri = PMC_TO_ROWINDEX(pm); PMCDBG3(PRC,TUL,1, "unlink-target pmc=%p ri=%d pmc-process=%p", pm, ri, pp); KASSERT(pp->pp_pmcs[ri].pp_pmc == pm, ("[pmc,%d] PMC ri %d mismatch pmc %p pp->[ri] %p", __LINE__, ri, pm, pp->pp_pmcs[ri].pp_pmc)); pp->pp_pmcs[ri].pp_pmc = NULL; pp->pp_pmcs[ri].pp_pmcval = (pmc_value_t) 0; /* Clear the per-thread values at this row index. */ if (PMC_TO_MODE(pm) == PMC_MODE_TS) { mtx_lock_spin(pp->pp_tdslock); LIST_FOREACH(pt, &pp->pp_tds, pt_next) pt->pt_pmcs[ri].pt_pmcval = (pmc_value_t) 0; mtx_unlock_spin(pp->pp_tdslock); } /* Remove owner-specific flags */ if (pm->pm_owner->po_owner == pp->pp_proc) { pp->pp_flags &= ~PMC_PP_ENABLE_MSR_ACCESS; pm->pm_flags &= ~PMC_F_ATTACHED_TO_OWNER; } pp->pp_refcnt--; /* Remove the target process from the PMC structure */ LIST_FOREACH(ptgt, &pm->pm_targets, pt_next) if (ptgt->pt_process == pp) break; KASSERT(ptgt != NULL, ("[pmc,%d] process %p (pp: %p) not found " "in pmc %p", __LINE__, pp->pp_proc, pp, pm)); LIST_REMOVE(ptgt, pt_next); free(ptgt, M_PMC); /* if the PMC now lacks targets, send the owner a SIGIO */ if (LIST_EMPTY(&pm->pm_targets)) { p = pm->pm_owner->po_owner; PROC_LOCK(p); kern_psignal(p, SIGIO); PROC_UNLOCK(p); PMCDBG2(PRC,SIG,2, "signalling proc=%p signal=%d", p, SIGIO); } } /* * Check if PMC 'pm' may be attached to target process 't'. */ static int pmc_can_attach(struct pmc *pm, struct proc *t) { struct proc *o; /* pmc owner */ struct ucred *oc, *tc; /* owner, target credentials */ int decline_attach, i; /* * A PMC's owner can always attach that PMC to itself. */ if ((o = pm->pm_owner->po_owner) == t) return 0; PROC_LOCK(o); oc = o->p_ucred; crhold(oc); PROC_UNLOCK(o); PROC_LOCK(t); tc = t->p_ucred; crhold(tc); PROC_UNLOCK(t); /* * The effective uid of the PMC owner should match at least one * of the {effective,real,saved} uids of the target process. */ decline_attach = oc->cr_uid != tc->cr_uid && oc->cr_uid != tc->cr_svuid && oc->cr_uid != tc->cr_ruid; /* * Every one of the target's group ids, must be in the owner's * group list. */ for (i = 0; !decline_attach && i < tc->cr_ngroups; i++) decline_attach = !groupmember(tc->cr_groups[i], oc); /* check the read and saved gids too */ if (decline_attach == 0) decline_attach = !groupmember(tc->cr_rgid, oc) || !groupmember(tc->cr_svgid, oc); crfree(tc); crfree(oc); return !decline_attach; } /* * Attach a process to a PMC. */ static int pmc_attach_one_process(struct proc *p, struct pmc *pm) { int ri, error; char *fullpath, *freepath; struct pmc_process *pp; sx_assert(&pmc_sx, SX_XLOCKED); PMCDBG5(PRC,ATT,2, "attach-one pm=%p ri=%d proc=%p (%d, %s)", pm, PMC_TO_ROWINDEX(pm), p, p->p_pid, p->p_comm); /* * Locate the process descriptor corresponding to process 'p', * allocating space as needed. * * Verify that rowindex 'pm_rowindex' is free in the process * descriptor. * * If not, allocate space for a descriptor and link the * process descriptor and PMC. */ ri = PMC_TO_ROWINDEX(pm); /* mark process as using HWPMCs */ PROC_LOCK(p); p->p_flag |= P_HWPMC; PROC_UNLOCK(p); if ((pp = pmc_find_process_descriptor(p, PMC_FLAG_ALLOCATE)) == NULL) { error = ENOMEM; goto fail; } if (pp->pp_pmcs[ri].pp_pmc == pm) {/* already present at slot [ri] */ error = EEXIST; goto fail; } if (pp->pp_pmcs[ri].pp_pmc != NULL) { error = EBUSY; goto fail; } pmc_link_target_process(pm, pp); if (PMC_IS_SAMPLING_MODE(PMC_TO_MODE(pm)) && (pm->pm_flags & PMC_F_ATTACHED_TO_OWNER) == 0) pm->pm_flags |= PMC_F_NEEDS_LOGFILE; pm->pm_flags |= PMC_F_ATTACH_DONE; /* mark as attached */ /* issue an attach event to a configured log file */ if (pm->pm_owner->po_flags & PMC_PO_OWNS_LOGFILE) { if (p->p_flag & P_KPROC) { fullpath = kernelname; freepath = NULL; } else { pmc_getfilename(p->p_textvp, &fullpath, &freepath); pmclog_process_pmcattach(pm, p->p_pid, fullpath); } free(freepath, M_TEMP); if (PMC_IS_SAMPLING_MODE(PMC_TO_MODE(pm))) pmc_log_process_mappings(pm->pm_owner, p); } return (0); fail: PROC_LOCK(p); p->p_flag &= ~P_HWPMC; PROC_UNLOCK(p); return (error); } /* * Attach a process and optionally its children */ static int pmc_attach_process(struct proc *p, struct pmc *pm) { int error; struct proc *top; sx_assert(&pmc_sx, SX_XLOCKED); PMCDBG5(PRC,ATT,1, "attach pm=%p ri=%d proc=%p (%d, %s)", pm, PMC_TO_ROWINDEX(pm), p, p->p_pid, p->p_comm); /* * If this PMC successfully allowed a GETMSR operation * in the past, disallow further ATTACHes. */ if ((pm->pm_flags & PMC_PP_ENABLE_MSR_ACCESS) != 0) return EPERM; if ((pm->pm_flags & PMC_F_DESCENDANTS) == 0) return pmc_attach_one_process(p, pm); /* * Traverse all child processes, attaching them to * this PMC. */ sx_slock(&proctree_lock); top = p; for (;;) { if ((error = pmc_attach_one_process(p, pm)) != 0) break; if (!LIST_EMPTY(&p->p_children)) p = LIST_FIRST(&p->p_children); else for (;;) { if (p == top) goto done; if (LIST_NEXT(p, p_sibling)) { p = LIST_NEXT(p, p_sibling); break; } p = p->p_pptr; } } if (error) (void) pmc_detach_process(top, pm); done: sx_sunlock(&proctree_lock); return error; } /* * Detach a process from a PMC. If there are no other PMCs tracking * this process, remove the process structure from its hash table. If * 'flags' contains PMC_FLAG_REMOVE, then free the process structure. */ static int pmc_detach_one_process(struct proc *p, struct pmc *pm, int flags) { int ri; struct pmc_process *pp; sx_assert(&pmc_sx, SX_XLOCKED); KASSERT(pm != NULL, ("[pmc,%d] null pm pointer", __LINE__)); ri = PMC_TO_ROWINDEX(pm); PMCDBG6(PRC,ATT,2, "detach-one pm=%p ri=%d proc=%p (%d, %s) flags=0x%x", pm, ri, p, p->p_pid, p->p_comm, flags); if ((pp = pmc_find_process_descriptor(p, 0)) == NULL) return ESRCH; if (pp->pp_pmcs[ri].pp_pmc != pm) return EINVAL; pmc_unlink_target_process(pm, pp); /* Issue a detach entry if a log file is configured */ if (pm->pm_owner->po_flags & PMC_PO_OWNS_LOGFILE) pmclog_process_pmcdetach(pm, p->p_pid); /* * If there are no PMCs targeting this process, we remove its * descriptor from the target hash table and unset the P_HWPMC * flag in the struct proc. */ KASSERT(pp->pp_refcnt >= 0 && pp->pp_refcnt <= (int) md->pmd_npmc, ("[pmc,%d] Illegal refcnt %d for process struct %p", __LINE__, pp->pp_refcnt, pp)); if (pp->pp_refcnt != 0) /* still a target of some PMC */ return 0; pmc_remove_process_descriptor(pp); if (flags & PMC_FLAG_REMOVE) pmc_destroy_process_descriptor(pp); PROC_LOCK(p); p->p_flag &= ~P_HWPMC; PROC_UNLOCK(p); return 0; } /* * Detach a process and optionally its descendants from a PMC. */ static int pmc_detach_process(struct proc *p, struct pmc *pm) { struct proc *top; sx_assert(&pmc_sx, SX_XLOCKED); PMCDBG5(PRC,ATT,1, "detach pm=%p ri=%d proc=%p (%d, %s)", pm, PMC_TO_ROWINDEX(pm), p, p->p_pid, p->p_comm); if ((pm->pm_flags & PMC_F_DESCENDANTS) == 0) return pmc_detach_one_process(p, pm, PMC_FLAG_REMOVE); /* * Traverse all children, detaching them from this PMC. We * ignore errors since we could be detaching a PMC from a * partially attached proc tree. */ sx_slock(&proctree_lock); top = p; for (;;) { (void) pmc_detach_one_process(p, pm, PMC_FLAG_REMOVE); if (!LIST_EMPTY(&p->p_children)) p = LIST_FIRST(&p->p_children); else for (;;) { if (p == top) goto done; if (LIST_NEXT(p, p_sibling)) { p = LIST_NEXT(p, p_sibling); break; } p = p->p_pptr; } } done: sx_sunlock(&proctree_lock); if (LIST_EMPTY(&pm->pm_targets)) pm->pm_flags &= ~PMC_F_ATTACH_DONE; return 0; } /* * Thread context switch IN */ static void pmc_process_csw_in(struct thread *td) { int cpu; unsigned int adjri, ri; struct pmc *pm; struct proc *p; struct pmc_cpu *pc; struct pmc_hw *phw __diagused; pmc_value_t newvalue; struct pmc_process *pp; struct pmc_thread *pt; struct pmc_classdep *pcd; p = td->td_proc; pt = NULL; if ((pp = pmc_find_process_descriptor(p, PMC_FLAG_NONE)) == NULL) return; KASSERT(pp->pp_proc == td->td_proc, ("[pmc,%d] not my thread state", __LINE__)); critical_enter(); /* no preemption from this point */ cpu = PCPU_GET(cpuid); /* td->td_oncpu is invalid */ PMCDBG5(CSW,SWI,1, "cpu=%d proc=%p (%d, %s) pp=%p", cpu, p, p->p_pid, p->p_comm, pp); KASSERT(cpu >= 0 && cpu < pmc_cpu_max(), ("[pmc,%d] weird CPU id %d", __LINE__, cpu)); pc = pmc_pcpu[cpu]; for (ri = 0; ri < md->pmd_npmc; ri++) { if ((pm = pp->pp_pmcs[ri].pp_pmc) == NULL) continue; KASSERT(PMC_IS_VIRTUAL_MODE(PMC_TO_MODE(pm)), ("[pmc,%d] Target PMC in non-virtual mode (%d)", __LINE__, PMC_TO_MODE(pm))); KASSERT(PMC_TO_ROWINDEX(pm) == ri, ("[pmc,%d] Row index mismatch pmc %d != ri %d", __LINE__, PMC_TO_ROWINDEX(pm), ri)); /* * Only PMCs that are marked as 'RUNNING' need * be placed on hardware. */ if (pm->pm_state != PMC_STATE_RUNNING) continue; KASSERT(counter_u64_fetch(pm->pm_runcount) >= 0, ("[pmc,%d] pm=%p runcount %ld", __LINE__, (void *) pm, (unsigned long)counter_u64_fetch(pm->pm_runcount))); /* increment PMC runcount */ counter_u64_add(pm->pm_runcount, 1); /* configure the HWPMC we are going to use. */ pcd = pmc_ri_to_classdep(md, ri, &adjri); pcd->pcd_config_pmc(cpu, adjri, pm); phw = pc->pc_hwpmcs[ri]; KASSERT(phw != NULL, ("[pmc,%d] null hw pointer", __LINE__)); KASSERT(phw->phw_pmc == pm, ("[pmc,%d] hw->pmc %p != pmc %p", __LINE__, phw->phw_pmc, pm)); /* * Write out saved value and start the PMC. * * Sampling PMCs use a per-thread value, while * counting mode PMCs use a per-pmc value that is * inherited across descendants. */ if (PMC_TO_MODE(pm) == PMC_MODE_TS) { if (pt == NULL) pt = pmc_find_thread_descriptor(pp, td, PMC_FLAG_NONE); KASSERT(pt != NULL, ("[pmc,%d] No thread found for td=%p", __LINE__, td)); mtx_pool_lock_spin(pmc_mtxpool, pm); /* * If we have a thread descriptor, use the per-thread * counter in the descriptor. If not, we will use * a per-process counter. * * TODO: Remove the per-process "safety net" once * we have thoroughly tested that we don't hit the * above assert. */ if (pt != NULL) { if (pt->pt_pmcs[ri].pt_pmcval > 0) newvalue = pt->pt_pmcs[ri].pt_pmcval; else newvalue = pm->pm_sc.pm_reloadcount; } else { /* * Use the saved value calculated after the most * recent time a thread using the shared counter * switched out. Reset the saved count in case * another thread from this process switches in * before any threads switch out. */ newvalue = pp->pp_pmcs[ri].pp_pmcval; pp->pp_pmcs[ri].pp_pmcval = pm->pm_sc.pm_reloadcount; } mtx_pool_unlock_spin(pmc_mtxpool, pm); KASSERT(newvalue > 0 && newvalue <= pm->pm_sc.pm_reloadcount, ("[pmc,%d] pmcval outside of expected range cpu=%d " "ri=%d pmcval=%jx pm_reloadcount=%jx", __LINE__, cpu, ri, newvalue, pm->pm_sc.pm_reloadcount)); } else { KASSERT(PMC_TO_MODE(pm) == PMC_MODE_TC, ("[pmc,%d] illegal mode=%d", __LINE__, PMC_TO_MODE(pm))); mtx_pool_lock_spin(pmc_mtxpool, pm); newvalue = PMC_PCPU_SAVED(cpu, ri) = pm->pm_gv.pm_savedvalue; mtx_pool_unlock_spin(pmc_mtxpool, pm); } PMCDBG3(CSW,SWI,1,"cpu=%d ri=%d new=%jd", cpu, ri, newvalue); pcd->pcd_write_pmc(cpu, adjri, pm, newvalue); /* If a sampling mode PMC, reset stalled state. */ if (PMC_TO_MODE(pm) == PMC_MODE_TS) pm->pm_pcpu_state[cpu].pps_stalled = 0; /* Indicate that we desire this to run. */ pm->pm_pcpu_state[cpu].pps_cpustate = 1; /* Start the PMC. */ pcd->pcd_start_pmc(cpu, adjri, pm); } /* * perform any other architecture/cpu dependent thread * switch-in actions. */ (void) (*md->pmd_switch_in)(pc, pp); critical_exit(); } /* * Thread context switch OUT. */ static void pmc_process_csw_out(struct thread *td) { int cpu; int64_t tmp; struct pmc *pm; struct proc *p; enum pmc_mode mode; struct pmc_cpu *pc; pmc_value_t newvalue; unsigned int adjri, ri; struct pmc_process *pp; struct pmc_thread *pt = NULL; struct pmc_classdep *pcd; /* * Locate our process descriptor; this may be NULL if * this process is exiting and we have already removed * the process from the target process table. * * Note that due to kernel preemption, multiple * context switches may happen while the process is * exiting. * * Note also that if the target process cannot be * found we still need to deconfigure any PMCs that * are currently running on hardware. */ p = td->td_proc; pp = pmc_find_process_descriptor(p, PMC_FLAG_NONE); /* * save PMCs */ critical_enter(); cpu = PCPU_GET(cpuid); /* td->td_oncpu is invalid */ PMCDBG5(CSW,SWO,1, "cpu=%d proc=%p (%d, %s) pp=%p", cpu, p, p->p_pid, p->p_comm, pp); KASSERT(cpu >= 0 && cpu < pmc_cpu_max(), ("[pmc,%d weird CPU id %d", __LINE__, cpu)); pc = pmc_pcpu[cpu]; /* * When a PMC gets unlinked from a target PMC, it will * be removed from the target's pp_pmc[] array. * * However, on a MP system, the target could have been * executing on another CPU at the time of the unlink. * So, at context switch OUT time, we need to look at * the hardware to determine if a PMC is scheduled on * it. */ for (ri = 0; ri < md->pmd_npmc; ri++) { pcd = pmc_ri_to_classdep(md, ri, &adjri); pm = NULL; (void) (*pcd->pcd_get_config)(cpu, adjri, &pm); if (pm == NULL) /* nothing at this row index */ continue; mode = PMC_TO_MODE(pm); if (!PMC_IS_VIRTUAL_MODE(mode)) continue; /* not a process virtual PMC */ KASSERT(PMC_TO_ROWINDEX(pm) == ri, ("[pmc,%d] ri mismatch pmc(%d) ri(%d)", __LINE__, PMC_TO_ROWINDEX(pm), ri)); /* * Change desired state, and then stop if not stalled. * This two-step dance should avoid race conditions where * an interrupt re-enables the PMC after this code has * already checked the pm_stalled flag. */ pm->pm_pcpu_state[cpu].pps_cpustate = 0; if (pm->pm_pcpu_state[cpu].pps_stalled == 0) pcd->pcd_stop_pmc(cpu, adjri, pm); KASSERT(counter_u64_fetch(pm->pm_runcount) > 0, ("[pmc,%d] pm=%p runcount %ld", __LINE__, (void *) pm, (unsigned long)counter_u64_fetch(pm->pm_runcount))); /* reduce this PMC's runcount */ counter_u64_add(pm->pm_runcount, -1); /* * If this PMC is associated with this process, * save the reading. */ if (pm->pm_state != PMC_STATE_DELETED && pp != NULL && pp->pp_pmcs[ri].pp_pmc != NULL) { KASSERT(pm == pp->pp_pmcs[ri].pp_pmc, ("[pmc,%d] pm %p != pp_pmcs[%d] %p", __LINE__, pm, ri, pp->pp_pmcs[ri].pp_pmc)); KASSERT(pp->pp_refcnt > 0, ("[pmc,%d] pp refcnt = %d", __LINE__, pp->pp_refcnt)); pcd->pcd_read_pmc(cpu, adjri, pm, &newvalue); if (mode == PMC_MODE_TS) { PMCDBG3(CSW,SWO,1,"cpu=%d ri=%d val=%jd (samp)", cpu, ri, newvalue); if (pt == NULL) pt = pmc_find_thread_descriptor(pp, td, PMC_FLAG_NONE); KASSERT(pt != NULL, ("[pmc,%d] No thread found for td=%p", __LINE__, td)); mtx_pool_lock_spin(pmc_mtxpool, pm); /* * If we have a thread descriptor, save the * per-thread counter in the descriptor. If not, * we will update the per-process counter. * * TODO: Remove the per-process "safety net" * once we have thoroughly tested that we * don't hit the above assert. */ if (pt != NULL) pt->pt_pmcs[ri].pt_pmcval = newvalue; else { /* * For sampling process-virtual PMCs, * newvalue is the number of events to * be seen until the next sampling * interrupt. We can just add the events * left from this invocation to the * counter, then adjust in case we * overflow our range. * * (Recall that we reload the counter * every time we use it.) */ pp->pp_pmcs[ri].pp_pmcval += newvalue; if (pp->pp_pmcs[ri].pp_pmcval > pm->pm_sc.pm_reloadcount) pp->pp_pmcs[ri].pp_pmcval -= pm->pm_sc.pm_reloadcount; } mtx_pool_unlock_spin(pmc_mtxpool, pm); } else { tmp = newvalue - PMC_PCPU_SAVED(cpu,ri); PMCDBG3(CSW,SWO,1,"cpu=%d ri=%d tmp=%jd (count)", cpu, ri, tmp); /* * For counting process-virtual PMCs, * we expect the count to be * increasing monotonically, modulo a 64 * bit wraparound. */ KASSERT(tmp >= 0, ("[pmc,%d] negative increment cpu=%d " "ri=%d newvalue=%jx saved=%jx " "incr=%jx", __LINE__, cpu, ri, newvalue, PMC_PCPU_SAVED(cpu,ri), tmp)); mtx_pool_lock_spin(pmc_mtxpool, pm); pm->pm_gv.pm_savedvalue += tmp; pp->pp_pmcs[ri].pp_pmcval += tmp; mtx_pool_unlock_spin(pmc_mtxpool, pm); if (pm->pm_flags & PMC_F_LOG_PROCCSW) pmclog_process_proccsw(pm, pp, tmp, td); } } /* mark hardware as free */ pcd->pcd_config_pmc(cpu, adjri, NULL); } /* * perform any other architecture/cpu dependent thread * switch out functions. */ (void) (*md->pmd_switch_out)(pc, pp); critical_exit(); } /* * A new thread for a process. */ static void pmc_process_thread_add(struct thread *td) { struct pmc_process *pmc; pmc = pmc_find_process_descriptor(td->td_proc, PMC_FLAG_NONE); if (pmc != NULL) pmc_find_thread_descriptor(pmc, td, PMC_FLAG_ALLOCATE); } /* * A thread delete for a process. */ static void pmc_process_thread_delete(struct thread *td) { struct pmc_process *pmc; pmc = pmc_find_process_descriptor(td->td_proc, PMC_FLAG_NONE); if (pmc != NULL) pmc_thread_descriptor_pool_free(pmc_find_thread_descriptor(pmc, td, PMC_FLAG_REMOVE)); } /* * A userret() call for a thread. */ static void pmc_process_thread_userret(struct thread *td) { sched_pin(); pmc_capture_user_callchain(curcpu, PMC_UR, td->td_frame); sched_unpin(); } /* * A mapping change for a process. */ static void pmc_process_mmap(struct thread *td, struct pmckern_map_in *pkm) { int ri; pid_t pid; char *fullpath, *freepath; const struct pmc *pm; struct pmc_owner *po; const struct pmc_process *pp; freepath = fullpath = NULL; MPASS(!in_epoch(global_epoch_preempt)); pmc_getfilename((struct vnode *) pkm->pm_file, &fullpath, &freepath); pid = td->td_proc->p_pid; PMC_EPOCH_ENTER(); /* Inform owners of all system-wide sampling PMCs. */ CK_LIST_FOREACH(po, &pmc_ss_owners, po_ssnext) if (po->po_flags & PMC_PO_OWNS_LOGFILE) pmclog_process_map_in(po, pid, pkm->pm_address, fullpath); if ((pp = pmc_find_process_descriptor(td->td_proc, 0)) == NULL) goto done; /* * Inform sampling PMC owners tracking this process. */ for (ri = 0; ri < md->pmd_npmc; ri++) if ((pm = pp->pp_pmcs[ri].pp_pmc) != NULL && PMC_IS_SAMPLING_MODE(PMC_TO_MODE(pm))) pmclog_process_map_in(pm->pm_owner, pid, pkm->pm_address, fullpath); done: if (freepath) free(freepath, M_TEMP); PMC_EPOCH_EXIT(); } /* * Log an munmap request. */ static void pmc_process_munmap(struct thread *td, struct pmckern_map_out *pkm) { int ri; pid_t pid; struct pmc_owner *po; const struct pmc *pm; const struct pmc_process *pp; pid = td->td_proc->p_pid; PMC_EPOCH_ENTER(); CK_LIST_FOREACH(po, &pmc_ss_owners, po_ssnext) if (po->po_flags & PMC_PO_OWNS_LOGFILE) pmclog_process_map_out(po, pid, pkm->pm_address, pkm->pm_address + pkm->pm_size); PMC_EPOCH_EXIT(); if ((pp = pmc_find_process_descriptor(td->td_proc, 0)) == NULL) return; for (ri = 0; ri < md->pmd_npmc; ri++) if ((pm = pp->pp_pmcs[ri].pp_pmc) != NULL && PMC_IS_SAMPLING_MODE(PMC_TO_MODE(pm))) pmclog_process_map_out(pm->pm_owner, pid, pkm->pm_address, pkm->pm_address + pkm->pm_size); } /* * Log mapping information about the kernel. */ static void pmc_log_kernel_mappings(struct pmc *pm) { struct pmc_owner *po; struct pmckern_map_in *km, *kmbase; MPASS(in_epoch(global_epoch_preempt) || sx_xlocked(&pmc_sx)); KASSERT(PMC_IS_SAMPLING_MODE(PMC_TO_MODE(pm)), ("[pmc,%d] non-sampling PMC (%p) desires mapping information", __LINE__, (void *) pm)); po = pm->pm_owner; if (po->po_flags & PMC_PO_INITIAL_MAPPINGS_DONE) return; if (PMC_TO_MODE(pm) == PMC_MODE_SS) pmc_process_allproc(pm); /* * Log the current set of kernel modules. */ kmbase = linker_hwpmc_list_objects(); for (km = kmbase; km->pm_file != NULL; km++) { PMCDBG2(LOG,REG,1,"%s %p", (char *) km->pm_file, (void *) km->pm_address); pmclog_process_map_in(po, (pid_t) -1, km->pm_address, km->pm_file); } free(kmbase, M_LINKER); po->po_flags |= PMC_PO_INITIAL_MAPPINGS_DONE; } /* * Log the mappings for a single process. */ static void pmc_log_process_mappings(struct pmc_owner *po, struct proc *p) { vm_map_t map; struct vnode *vp; struct vmspace *vm; vm_map_entry_t entry; vm_offset_t last_end; u_int last_timestamp; struct vnode *last_vp; vm_offset_t start_addr; vm_object_t obj, lobj, tobj; char *fullpath, *freepath; last_vp = NULL; last_end = (vm_offset_t) 0; fullpath = freepath = NULL; if ((vm = vmspace_acquire_ref(p)) == NULL) return; map = &vm->vm_map; vm_map_lock_read(map); VM_MAP_ENTRY_FOREACH(entry, map) { if (entry == NULL) { PMCDBG2(LOG,OPS,2, "hwpmc: vm_map entry unexpectedly " "NULL! pid=%d vm_map=%p\n", p->p_pid, map); break; } /* * We only care about executable map entries. */ if ((entry->eflags & MAP_ENTRY_IS_SUB_MAP) || !(entry->protection & VM_PROT_EXECUTE) || (entry->object.vm_object == NULL)) { continue; } obj = entry->object.vm_object; VM_OBJECT_RLOCK(obj); /* * Walk the backing_object list to find the base * (non-shadowed) vm_object. */ for (lobj = tobj = obj; tobj != NULL; tobj = tobj->backing_object) { if (tobj != obj) VM_OBJECT_RLOCK(tobj); if (lobj != obj) VM_OBJECT_RUNLOCK(lobj); lobj = tobj; } /* * At this point lobj is the base vm_object and it is locked. */ if (lobj == NULL) { PMCDBG3(LOG,OPS,2, "hwpmc: lobj unexpectedly NULL! pid=%d " "vm_map=%p vm_obj=%p\n", p->p_pid, map, obj); VM_OBJECT_RUNLOCK(obj); continue; } vp = vm_object_vnode(lobj); if (vp == NULL) { if (lobj != obj) VM_OBJECT_RUNLOCK(lobj); VM_OBJECT_RUNLOCK(obj); continue; } /* * Skip contiguous regions that point to the same * vnode, so we don't emit redundant MAP-IN * directives. */ if (entry->start == last_end && vp == last_vp) { last_end = entry->end; if (lobj != obj) VM_OBJECT_RUNLOCK(lobj); VM_OBJECT_RUNLOCK(obj); continue; } /* * We don't want to keep the proc's vm_map or this * vm_object locked while we walk the pathname, since * vn_fullpath() can sleep. However, if we drop the * lock, it's possible for concurrent activity to * modify the vm_map list. To protect against this, * we save the vm_map timestamp before we release the * lock, and check it after we reacquire the lock * below. */ start_addr = entry->start; last_end = entry->end; last_timestamp = map->timestamp; vm_map_unlock_read(map); vref(vp); if (lobj != obj) VM_OBJECT_RUNLOCK(lobj); VM_OBJECT_RUNLOCK(obj); freepath = NULL; pmc_getfilename(vp, &fullpath, &freepath); last_vp = vp; vrele(vp); vp = NULL; pmclog_process_map_in(po, p->p_pid, start_addr, fullpath); if (freepath) free(freepath, M_TEMP); vm_map_lock_read(map); /* * If our saved timestamp doesn't match, this means * that the vm_map was modified out from under us and * we can't trust our current "entry" pointer. Do a * new lookup for this entry. If there is no entry * for this address range, vm_map_lookup_entry() will * return the previous one, so we always want to go to * the next entry on the next loop iteration. * * There is an edge condition here that can occur if * there is no entry at or before this address. In * this situation, vm_map_lookup_entry returns * &map->header, which would cause our loop to abort * without processing the rest of the map. However, * in practice this will never happen for process * vm_map. This is because the executable's text * segment is the first mapping in the proc's address * space, and this mapping is never removed until the * process exits, so there will always be a non-header * entry at or before the requested address for * vm_map_lookup_entry to return. */ if (map->timestamp != last_timestamp) vm_map_lookup_entry(map, last_end - 1, &entry); } vm_map_unlock_read(map); vmspace_free(vm); return; } /* * Log mappings for all processes in the system. */ static void pmc_log_all_process_mappings(struct pmc_owner *po) { struct proc *p, *top; sx_assert(&pmc_sx, SX_XLOCKED); if ((p = pfind(1)) == NULL) panic("[pmc,%d] Cannot find init", __LINE__); PROC_UNLOCK(p); sx_slock(&proctree_lock); top = p; for (;;) { pmc_log_process_mappings(po, p); if (!LIST_EMPTY(&p->p_children)) p = LIST_FIRST(&p->p_children); else for (;;) { if (p == top) goto done; if (LIST_NEXT(p, p_sibling)) { p = LIST_NEXT(p, p_sibling); break; } p = p->p_pptr; } } done: sx_sunlock(&proctree_lock); } /* * The 'hook' invoked from the kernel proper */ #ifdef HWPMC_DEBUG const char *pmc_hooknames[] = { /* these strings correspond to PMC_FN_* in */ "", "EXEC", "CSW-IN", "CSW-OUT", "SAMPLE", "UNUSED1", "UNUSED2", "MMAP", "MUNMAP", "CALLCHAIN-NMI", "CALLCHAIN-SOFT", "SOFTSAMPLING", "THR-CREATE", "THR-EXIT", "THR-USERRET", "THR-CREATE-LOG", "THR-EXIT-LOG", "PROC-CREATE-LOG" }; #endif static int pmc_hook_handler(struct thread *td, int function, void *arg) { int cpu; PMCDBG4(MOD,PMH,1, "hook td=%p func=%d \"%s\" arg=%p", td, function, pmc_hooknames[function], arg); switch (function) { /* * Process exec() */ case PMC_FN_PROCESS_EXEC: { char *fullpath, *freepath; unsigned int ri; int is_using_hwpmcs; struct pmc *pm; struct proc *p; struct pmc_owner *po; struct pmc_process *pp; struct pmckern_procexec *pk; sx_assert(&pmc_sx, SX_XLOCKED); p = td->td_proc; pmc_getfilename(p->p_textvp, &fullpath, &freepath); pk = (struct pmckern_procexec *) arg; PMC_EPOCH_ENTER(); /* Inform owners of SS mode PMCs of the exec event. */ CK_LIST_FOREACH(po, &pmc_ss_owners, po_ssnext) if (po->po_flags & PMC_PO_OWNS_LOGFILE) pmclog_process_procexec(po, PMC_ID_INVALID, - p->p_pid, pk->pm_entryaddr, fullpath); + p->p_pid, pk->pm_baseaddr, pk->pm_dynaddr, + fullpath); PMC_EPOCH_EXIT(); PROC_LOCK(p); is_using_hwpmcs = p->p_flag & P_HWPMC; PROC_UNLOCK(p); if (!is_using_hwpmcs) { if (freepath) free(freepath, M_TEMP); break; } /* * PMCs are not inherited across an exec(): remove any * PMCs that this process is the owner of. */ if ((po = pmc_find_owner_descriptor(p)) != NULL) { pmc_remove_owner(po); pmc_destroy_owner_descriptor(po); } /* * If the process being exec'ed is not the target of any * PMC, we are done. */ if ((pp = pmc_find_process_descriptor(p, 0)) == NULL) { if (freepath) free(freepath, M_TEMP); break; } /* * Log the exec event to all monitoring owners. Skip * owners who have already received the event because * they had system sampling PMCs active. */ for (ri = 0; ri < md->pmd_npmc; ri++) if ((pm = pp->pp_pmcs[ri].pp_pmc) != NULL) { po = pm->pm_owner; if (po->po_sscount == 0 && po->po_flags & PMC_PO_OWNS_LOGFILE) pmclog_process_procexec(po, pm->pm_id, - p->p_pid, pk->pm_entryaddr, - fullpath); + p->p_pid, pk->pm_baseaddr, + pk->pm_dynaddr, fullpath); } if (freepath) free(freepath, M_TEMP); PMCDBG4(PRC,EXC,1, "exec proc=%p (%d, %s) cred-changed=%d", p, p->p_pid, p->p_comm, pk->pm_credentialschanged); if (pk->pm_credentialschanged == 0) /* no change */ break; /* * If the newly exec()'ed process has a different credential * than before, allow it to be the target of a PMC only if * the PMC's owner has sufficient privilege. */ for (ri = 0; ri < md->pmd_npmc; ri++) if ((pm = pp->pp_pmcs[ri].pp_pmc) != NULL) if (pmc_can_attach(pm, td->td_proc) != 0) pmc_detach_one_process(td->td_proc, pm, PMC_FLAG_NONE); KASSERT(pp->pp_refcnt >= 0 && pp->pp_refcnt <= (int) md->pmd_npmc, ("[pmc,%d] Illegal ref count %d on pp %p", __LINE__, pp->pp_refcnt, pp)); /* * If this process is no longer the target of any * PMCs, we can remove the process entry and free * up space. */ if (pp->pp_refcnt == 0) { pmc_remove_process_descriptor(pp); pmc_destroy_process_descriptor(pp); break; } } break; case PMC_FN_CSW_IN: pmc_process_csw_in(td); break; case PMC_FN_CSW_OUT: pmc_process_csw_out(td); break; /* * Process accumulated PC samples. * * This function is expected to be called by hardclock() for * each CPU that has accumulated PC samples. * * This function is to be executed on the CPU whose samples * are being processed. */ case PMC_FN_DO_SAMPLES: /* * Clear the cpu specific bit in the CPU mask before * do the rest of the processing. If the NMI handler * gets invoked after the "atomic_clear_int()" call * below but before "pmc_process_samples()" gets * around to processing the interrupt, then we will * come back here at the next hardclock() tick (and * may find nothing to do if "pmc_process_samples()" * had already processed the interrupt). We don't * lose the interrupt sample. */ DPCPU_SET(pmc_sampled, 0); cpu = PCPU_GET(cpuid); pmc_process_samples(cpu, PMC_HR); pmc_process_samples(cpu, PMC_SR); pmc_process_samples(cpu, PMC_UR); break; case PMC_FN_MMAP: pmc_process_mmap(td, (struct pmckern_map_in *) arg); break; case PMC_FN_MUNMAP: MPASS(in_epoch(global_epoch_preempt) || sx_xlocked(&pmc_sx)); pmc_process_munmap(td, (struct pmckern_map_out *) arg); break; case PMC_FN_PROC_CREATE_LOG: pmc_process_proccreate((struct proc *)arg); break; case PMC_FN_USER_CALLCHAIN: /* * Record a call chain. */ KASSERT(td == curthread, ("[pmc,%d] td != curthread", __LINE__)); pmc_capture_user_callchain(PCPU_GET(cpuid), PMC_HR, (struct trapframe *) arg); KASSERT(td->td_pinned == 1, ("[pmc,%d] invalid td_pinned value", __LINE__)); sched_unpin(); /* Can migrate safely now. */ td->td_pflags &= ~TDP_CALLCHAIN; break; case PMC_FN_USER_CALLCHAIN_SOFT: /* * Record a call chain. */ KASSERT(td == curthread, ("[pmc,%d] td != curthread", __LINE__)); cpu = PCPU_GET(cpuid); pmc_capture_user_callchain(cpu, PMC_SR, (struct trapframe *) arg); KASSERT(td->td_pinned == 1, ("[pmc,%d] invalid td_pinned value", __LINE__)); sched_unpin(); /* Can migrate safely now. */ td->td_pflags &= ~TDP_CALLCHAIN; break; case PMC_FN_SOFT_SAMPLING: /* * Call soft PMC sampling intr. */ pmc_soft_intr((struct pmckern_soft *) arg); break; case PMC_FN_THR_CREATE: pmc_process_thread_add(td); pmc_process_threadcreate(td); break; case PMC_FN_THR_CREATE_LOG: pmc_process_threadcreate(td); break; case PMC_FN_THR_EXIT: KASSERT(td == curthread, ("[pmc,%d] td != curthread", __LINE__)); pmc_process_thread_delete(td); pmc_process_threadexit(td); break; case PMC_FN_THR_EXIT_LOG: pmc_process_threadexit(td); break; case PMC_FN_THR_USERRET: KASSERT(td == curthread, ("[pmc,%d] td != curthread", __LINE__)); pmc_process_thread_userret(td); break; default: #ifdef HWPMC_DEBUG KASSERT(0, ("[pmc,%d] unknown hook %d\n", __LINE__, function)); #endif break; } return 0; } /* * allocate a 'struct pmc_owner' descriptor in the owner hash table. */ static struct pmc_owner * pmc_allocate_owner_descriptor(struct proc *p) { uint32_t hindex; struct pmc_owner *po; struct pmc_ownerhash *poh; hindex = PMC_HASH_PTR(p, pmc_ownerhashmask); poh = &pmc_ownerhash[hindex]; /* allocate space for N pointers and one descriptor struct */ po = malloc(sizeof(struct pmc_owner), M_PMC, M_WAITOK|M_ZERO); po->po_owner = p; LIST_INSERT_HEAD(poh, po, po_next); /* insert into hash table */ TAILQ_INIT(&po->po_logbuffers); mtx_init(&po->po_mtx, "pmc-owner-mtx", "pmc-per-proc", MTX_SPIN); PMCDBG4(OWN,ALL,1, "allocate-owner proc=%p (%d, %s) pmc-owner=%p", p, p->p_pid, p->p_comm, po); return po; } static void pmc_destroy_owner_descriptor(struct pmc_owner *po) { PMCDBG4(OWN,REL,1, "destroy-owner po=%p proc=%p (%d, %s)", po, po->po_owner, po->po_owner->p_pid, po->po_owner->p_comm); mtx_destroy(&po->po_mtx); free(po, M_PMC); } /* * Allocate a thread descriptor from the free pool. * * NOTE: This *can* return NULL. */ static struct pmc_thread * pmc_thread_descriptor_pool_alloc(void) { struct pmc_thread *pt; mtx_lock_spin(&pmc_threadfreelist_mtx); if ((pt = LIST_FIRST(&pmc_threadfreelist)) != NULL) { LIST_REMOVE(pt, pt_next); pmc_threadfreelist_entries--; } mtx_unlock_spin(&pmc_threadfreelist_mtx); return (pt); } /* * Add a thread descriptor to the free pool. We use this instead of free() * to maintain a cache of free entries. Additionally, we can safely call * this function when we cannot call free(), such as in a critical section. * */ static void pmc_thread_descriptor_pool_free(struct pmc_thread *pt) { if (pt == NULL) return; memset(pt, 0, THREADENTRY_SIZE); mtx_lock_spin(&pmc_threadfreelist_mtx); LIST_INSERT_HEAD(&pmc_threadfreelist, pt, pt_next); pmc_threadfreelist_entries++; if (pmc_threadfreelist_entries > pmc_threadfreelist_max) taskqueue_enqueue(taskqueue_fast, &free_task); mtx_unlock_spin(&pmc_threadfreelist_mtx); } /* * An asynchronous task to manage the free list. */ static void pmc_thread_descriptor_pool_free_task(void *arg __unused, int pending __unused) { struct pmc_thread *pt; LIST_HEAD(, pmc_thread) tmplist; int delta; LIST_INIT(&tmplist); /* Determine what changes, if any, we need to make. */ mtx_lock_spin(&pmc_threadfreelist_mtx); delta = pmc_threadfreelist_entries - pmc_threadfreelist_max; while (delta > 0 && (pt = LIST_FIRST(&pmc_threadfreelist)) != NULL) { delta--; pmc_threadfreelist_entries--; LIST_REMOVE(pt, pt_next); LIST_INSERT_HEAD(&tmplist, pt, pt_next); } mtx_unlock_spin(&pmc_threadfreelist_mtx); /* If there are entries to free, free them. */ while (!LIST_EMPTY(&tmplist)) { pt = LIST_FIRST(&tmplist); LIST_REMOVE(pt, pt_next); free(pt, M_PMC); } } /* * Drain the thread free pool, freeing all allocations. */ static void pmc_thread_descriptor_pool_drain(void) { struct pmc_thread *pt, *next; LIST_FOREACH_SAFE(pt, &pmc_threadfreelist, pt_next, next) { LIST_REMOVE(pt, pt_next); free(pt, M_PMC); } } /* * find the descriptor corresponding to thread 'td', adding or removing it * as specified by 'mode'. * * Note that this supports additional mode flags in addition to those * supported by pmc_find_process_descriptor(): * PMC_FLAG_NOWAIT: Causes the function to not wait for mallocs. * This makes it safe to call while holding certain other locks. */ static struct pmc_thread * pmc_find_thread_descriptor(struct pmc_process *pp, struct thread *td, uint32_t mode) { struct pmc_thread *pt = NULL, *ptnew = NULL; int wait_flag; KASSERT(td != NULL, ("[pmc,%d] called to add NULL td", __LINE__)); /* * Pre-allocate memory in the PMC_FLAG_ALLOCATE case prior to * acquiring the lock. */ if (mode & PMC_FLAG_ALLOCATE) { if ((ptnew = pmc_thread_descriptor_pool_alloc()) == NULL) { wait_flag = M_WAITOK; if ((mode & PMC_FLAG_NOWAIT) || in_epoch(global_epoch_preempt)) wait_flag = M_NOWAIT; ptnew = malloc(THREADENTRY_SIZE, M_PMC, wait_flag|M_ZERO); } } mtx_lock_spin(pp->pp_tdslock); LIST_FOREACH(pt, &pp->pp_tds, pt_next) if (pt->pt_td == td) break; if ((mode & PMC_FLAG_REMOVE) && pt != NULL) LIST_REMOVE(pt, pt_next); if ((mode & PMC_FLAG_ALLOCATE) && pt == NULL && ptnew != NULL) { pt = ptnew; ptnew = NULL; pt->pt_td = td; LIST_INSERT_HEAD(&pp->pp_tds, pt, pt_next); } mtx_unlock_spin(pp->pp_tdslock); if (ptnew != NULL) { free(ptnew, M_PMC); } return pt; } /* * Try to add thread descriptors for each thread in a process. */ static void pmc_add_thread_descriptors_from_proc(struct proc *p, struct pmc_process *pp) { struct thread *curtd; struct pmc_thread **tdlist; int i, tdcnt, tdlistsz; KASSERT(!PROC_LOCKED(p), ("[pmc,%d] proc unexpectedly locked", __LINE__)); tdcnt = 32; restart: tdlistsz = roundup2(tdcnt, 32); tdcnt = 0; tdlist = malloc(sizeof(struct pmc_thread*) * tdlistsz, M_TEMP, M_WAITOK); PROC_LOCK(p); FOREACH_THREAD_IN_PROC(p, curtd) tdcnt++; if (tdcnt >= tdlistsz) { PROC_UNLOCK(p); free(tdlist, M_TEMP); goto restart; } /* * Try to add each thread to the list without sleeping. If unable, * add to a queue to retry after dropping the process lock. */ tdcnt = 0; FOREACH_THREAD_IN_PROC(p, curtd) { tdlist[tdcnt] = pmc_find_thread_descriptor(pp, curtd, PMC_FLAG_ALLOCATE|PMC_FLAG_NOWAIT); if (tdlist[tdcnt] == NULL) { PROC_UNLOCK(p); for (i = 0; i <= tdcnt; i++) pmc_thread_descriptor_pool_free(tdlist[i]); free(tdlist, M_TEMP); goto restart; } tdcnt++; } PROC_UNLOCK(p); free(tdlist, M_TEMP); } /* * find the descriptor corresponding to process 'p', adding or removing it * as specified by 'mode'. */ static struct pmc_process * pmc_find_process_descriptor(struct proc *p, uint32_t mode) { uint32_t hindex; struct pmc_process *pp, *ppnew; struct pmc_processhash *pph; hindex = PMC_HASH_PTR(p, pmc_processhashmask); pph = &pmc_processhash[hindex]; ppnew = NULL; /* * Pre-allocate memory in the PMC_FLAG_ALLOCATE case since we * cannot call malloc(9) once we hold a spin lock. */ if (mode & PMC_FLAG_ALLOCATE) ppnew = malloc(sizeof(struct pmc_process) + md->pmd_npmc * sizeof(struct pmc_targetstate), M_PMC, M_WAITOK|M_ZERO); mtx_lock_spin(&pmc_processhash_mtx); LIST_FOREACH(pp, pph, pp_next) if (pp->pp_proc == p) break; if ((mode & PMC_FLAG_REMOVE) && pp != NULL) LIST_REMOVE(pp, pp_next); if ((mode & PMC_FLAG_ALLOCATE) && pp == NULL && ppnew != NULL) { ppnew->pp_proc = p; LIST_INIT(&ppnew->pp_tds); ppnew->pp_tdslock = mtx_pool_find(pmc_mtxpool, ppnew); LIST_INSERT_HEAD(pph, ppnew, pp_next); mtx_unlock_spin(&pmc_processhash_mtx); pp = ppnew; ppnew = NULL; /* Add thread descriptors for this process' current threads. */ pmc_add_thread_descriptors_from_proc(p, pp); } else mtx_unlock_spin(&pmc_processhash_mtx); if (ppnew != NULL) free(ppnew, M_PMC); return pp; } /* * remove a process descriptor from the process hash table. */ static void pmc_remove_process_descriptor(struct pmc_process *pp) { KASSERT(pp->pp_refcnt == 0, ("[pmc,%d] Removing process descriptor %p with count %d", __LINE__, pp, pp->pp_refcnt)); mtx_lock_spin(&pmc_processhash_mtx); LIST_REMOVE(pp, pp_next); mtx_unlock_spin(&pmc_processhash_mtx); } /* * destroy a process descriptor. */ static void pmc_destroy_process_descriptor(struct pmc_process *pp) { struct pmc_thread *pmc_td; while ((pmc_td = LIST_FIRST(&pp->pp_tds)) != NULL) { LIST_REMOVE(pmc_td, pt_next); pmc_thread_descriptor_pool_free(pmc_td); } free(pp, M_PMC); } /* * find an owner descriptor corresponding to proc 'p' */ static struct pmc_owner * pmc_find_owner_descriptor(struct proc *p) { uint32_t hindex; struct pmc_owner *po; struct pmc_ownerhash *poh; hindex = PMC_HASH_PTR(p, pmc_ownerhashmask); poh = &pmc_ownerhash[hindex]; po = NULL; LIST_FOREACH(po, poh, po_next) if (po->po_owner == p) break; PMCDBG5(OWN,FND,1, "find-owner proc=%p (%d, %s) hindex=0x%x -> " "pmc-owner=%p", p, p->p_pid, p->p_comm, hindex, po); return po; } /* * pmc_allocate_pmc_descriptor * * Allocate a pmc descriptor and initialize its * fields. */ static struct pmc * pmc_allocate_pmc_descriptor(void) { struct pmc *pmc; pmc = malloc(sizeof(struct pmc), M_PMC, M_WAITOK|M_ZERO); pmc->pm_runcount = counter_u64_alloc(M_WAITOK); pmc->pm_pcpu_state = malloc(sizeof(struct pmc_pcpu_state)*mp_ncpus, M_PMC, M_WAITOK|M_ZERO); PMCDBG1(PMC,ALL,1, "allocate-pmc -> pmc=%p", pmc); return pmc; } /* * Destroy a pmc descriptor. */ static void pmc_destroy_pmc_descriptor(struct pmc *pm) { KASSERT(pm->pm_state == PMC_STATE_DELETED || pm->pm_state == PMC_STATE_FREE, ("[pmc,%d] destroying non-deleted PMC", __LINE__)); KASSERT(LIST_EMPTY(&pm->pm_targets), ("[pmc,%d] destroying pmc with targets", __LINE__)); KASSERT(pm->pm_owner == NULL, ("[pmc,%d] destroying pmc attached to an owner", __LINE__)); KASSERT(counter_u64_fetch(pm->pm_runcount) == 0, ("[pmc,%d] pmc has non-zero run count %ld", __LINE__, (unsigned long)counter_u64_fetch(pm->pm_runcount))); counter_u64_free(pm->pm_runcount); free(pm->pm_pcpu_state, M_PMC); free(pm, M_PMC); } static void pmc_wait_for_pmc_idle(struct pmc *pm) { #ifdef INVARIANTS volatile int maxloop; maxloop = 100 * pmc_cpu_max(); #endif /* * Loop (with a forced context switch) till the PMC's runcount * comes down to zero. */ pmclog_flush(pm->pm_owner, 1); while (counter_u64_fetch(pm->pm_runcount) > 0) { pmclog_flush(pm->pm_owner, 1); #ifdef INVARIANTS maxloop--; KASSERT(maxloop > 0, ("[pmc,%d] (ri%d, rc%ld) waiting too long for " "pmc to be free", __LINE__, PMC_TO_ROWINDEX(pm), (unsigned long)counter_u64_fetch(pm->pm_runcount))); #endif pmc_force_context_switch(); } } /* * This function does the following things: * * - detaches the PMC from hardware * - unlinks all target threads that were attached to it * - removes the PMC from its owner's list * - destroys the PMC private mutex * * Once this function completes, the given pmc pointer can be freed by * calling pmc_destroy_pmc_descriptor(). */ static void pmc_release_pmc_descriptor(struct pmc *pm) { enum pmc_mode mode; struct pmc_hw *phw __diagused; u_int adjri, ri, cpu; struct pmc_owner *po; struct pmc_binding pb; struct pmc_process *pp; struct pmc_classdep *pcd; struct pmc_target *ptgt, *tmp; sx_assert(&pmc_sx, SX_XLOCKED); KASSERT(pm, ("[pmc,%d] null pmc", __LINE__)); ri = PMC_TO_ROWINDEX(pm); pcd = pmc_ri_to_classdep(md, ri, &adjri); mode = PMC_TO_MODE(pm); PMCDBG3(PMC,REL,1, "release-pmc pmc=%p ri=%d mode=%d", pm, ri, mode); /* * First, we take the PMC off hardware. */ cpu = 0; if (PMC_IS_SYSTEM_MODE(mode)) { /* * A system mode PMC runs on a specific CPU. Switch * to this CPU and turn hardware off. */ pmc_save_cpu_binding(&pb); cpu = PMC_TO_CPU(pm); pmc_select_cpu(cpu); /* switch off non-stalled CPUs */ pm->pm_pcpu_state[cpu].pps_cpustate = 0; if (pm->pm_state == PMC_STATE_RUNNING && pm->pm_pcpu_state[cpu].pps_stalled == 0) { phw = pmc_pcpu[cpu]->pc_hwpmcs[ri]; KASSERT(phw->phw_pmc == pm, ("[pmc, %d] pmc ptr ri(%d) hw(%p) pm(%p)", __LINE__, ri, phw->phw_pmc, pm)); PMCDBG2(PMC,REL,2, "stopping cpu=%d ri=%d", cpu, ri); critical_enter(); pcd->pcd_stop_pmc(cpu, adjri, pm); critical_exit(); } PMCDBG2(PMC,REL,2, "decfg cpu=%d ri=%d", cpu, ri); critical_enter(); pcd->pcd_config_pmc(cpu, adjri, NULL); critical_exit(); /* adjust the global and process count of SS mode PMCs */ if (mode == PMC_MODE_SS && pm->pm_state == PMC_STATE_RUNNING) { po = pm->pm_owner; po->po_sscount--; if (po->po_sscount == 0) { atomic_subtract_rel_int(&pmc_ss_count, 1); CK_LIST_REMOVE(po, po_ssnext); epoch_wait_preempt(global_epoch_preempt); } } pm->pm_state = PMC_STATE_DELETED; pmc_restore_cpu_binding(&pb); /* * We could have references to this PMC structure in * the per-cpu sample queues. Wait for the queue to * drain. */ pmc_wait_for_pmc_idle(pm); } else if (PMC_IS_VIRTUAL_MODE(mode)) { /* * A virtual PMC could be running on multiple CPUs at * a given instant. * * By marking its state as DELETED, we ensure that * this PMC is never further scheduled on hardware. * * Then we wait till all CPUs are done with this PMC. */ pm->pm_state = PMC_STATE_DELETED; /* Wait for the PMCs runcount to come to zero. */ pmc_wait_for_pmc_idle(pm); /* * At this point the PMC is off all CPUs and cannot be * freshly scheduled onto a CPU. It is now safe to * unlink all targets from this PMC. If a * process-record's refcount falls to zero, we remove * it from the hash table. The module-wide SX lock * protects us from races. */ LIST_FOREACH_SAFE(ptgt, &pm->pm_targets, pt_next, tmp) { pp = ptgt->pt_process; pmc_unlink_target_process(pm, pp); /* frees 'ptgt' */ PMCDBG1(PMC,REL,3, "pp->refcnt=%d", pp->pp_refcnt); /* * If the target process record shows that no * PMCs are attached to it, reclaim its space. */ if (pp->pp_refcnt == 0) { pmc_remove_process_descriptor(pp); pmc_destroy_process_descriptor(pp); } } cpu = curthread->td_oncpu; /* setup cpu for pmd_release() */ } /* * Release any MD resources */ (void) pcd->pcd_release_pmc(cpu, adjri, pm); /* * Update row disposition */ if (PMC_IS_SYSTEM_MODE(PMC_TO_MODE(pm))) PMC_UNMARK_ROW_STANDALONE(ri); else PMC_UNMARK_ROW_THREAD(ri); /* unlink from the owner's list */ if (pm->pm_owner) { LIST_REMOVE(pm, pm_next); pm->pm_owner = NULL; } } /* * Register an owner and a pmc. */ static int pmc_register_owner(struct proc *p, struct pmc *pmc) { struct pmc_owner *po; sx_assert(&pmc_sx, SX_XLOCKED); if ((po = pmc_find_owner_descriptor(p)) == NULL) if ((po = pmc_allocate_owner_descriptor(p)) == NULL) return ENOMEM; KASSERT(pmc->pm_owner == NULL, ("[pmc,%d] attempting to own an initialized PMC", __LINE__)); pmc->pm_owner = po; LIST_INSERT_HEAD(&po->po_pmcs, pmc, pm_next); PROC_LOCK(p); p->p_flag |= P_HWPMC; PROC_UNLOCK(p); if (po->po_flags & PMC_PO_OWNS_LOGFILE) pmclog_process_pmcallocate(pmc); PMCDBG2(PMC,REG,1, "register-owner pmc-owner=%p pmc=%p", po, pmc); return 0; } /* * Return the current row disposition: * == 0 => FREE * > 0 => PROCESS MODE * < 0 => SYSTEM MODE */ int pmc_getrowdisp(int ri) { return pmc_pmcdisp[ri]; } /* * Check if a PMC at row index 'ri' can be allocated to the current * process. * * Allocation can fail if: * - the current process is already being profiled by a PMC at index 'ri', * attached to it via OP_PMCATTACH. * - the current process has already allocated a PMC at index 'ri' * via OP_ALLOCATE. */ static int pmc_can_allocate_rowindex(struct proc *p, unsigned int ri, int cpu) { enum pmc_mode mode; struct pmc *pm; struct pmc_owner *po; struct pmc_process *pp; PMCDBG5(PMC,ALR,1, "can-allocate-rowindex proc=%p (%d, %s) ri=%d " "cpu=%d", p, p->p_pid, p->p_comm, ri, cpu); /* * We shouldn't have already allocated a process-mode PMC at * row index 'ri'. * * We shouldn't have allocated a system-wide PMC on the same * CPU and same RI. */ if ((po = pmc_find_owner_descriptor(p)) != NULL) LIST_FOREACH(pm, &po->po_pmcs, pm_next) { if (PMC_TO_ROWINDEX(pm) == ri) { mode = PMC_TO_MODE(pm); if (PMC_IS_VIRTUAL_MODE(mode)) return EEXIST; if (PMC_IS_SYSTEM_MODE(mode) && (int) PMC_TO_CPU(pm) == cpu) return EEXIST; } } /* * We also shouldn't be the target of any PMC at this index * since otherwise a PMC_ATTACH to ourselves will fail. */ if ((pp = pmc_find_process_descriptor(p, 0)) != NULL) if (pp->pp_pmcs[ri].pp_pmc) return EEXIST; PMCDBG4(PMC,ALR,2, "can-allocate-rowindex proc=%p (%d, %s) ri=%d ok", p, p->p_pid, p->p_comm, ri); return 0; } /* * Check if a given PMC at row index 'ri' can be currently used in * mode 'mode'. */ static int pmc_can_allocate_row(int ri, enum pmc_mode mode) { enum pmc_disp disp; sx_assert(&pmc_sx, SX_XLOCKED); PMCDBG2(PMC,ALR,1, "can-allocate-row ri=%d mode=%d", ri, mode); if (PMC_IS_SYSTEM_MODE(mode)) disp = PMC_DISP_STANDALONE; else disp = PMC_DISP_THREAD; /* * check disposition for PMC row 'ri': * * Expected disposition Row-disposition Result * * STANDALONE STANDALONE or FREE proceed * STANDALONE THREAD fail * THREAD THREAD or FREE proceed * THREAD STANDALONE fail */ if (!PMC_ROW_DISP_IS_FREE(ri) && !(disp == PMC_DISP_THREAD && PMC_ROW_DISP_IS_THREAD(ri)) && !(disp == PMC_DISP_STANDALONE && PMC_ROW_DISP_IS_STANDALONE(ri))) return EBUSY; /* * All OK */ PMCDBG2(PMC,ALR,2, "can-allocate-row ri=%d mode=%d ok", ri, mode); return 0; } /* * Find a PMC descriptor with user handle 'pmcid' for thread 'td'. */ static struct pmc * pmc_find_pmc_descriptor_in_process(struct pmc_owner *po, pmc_id_t pmcid) { struct pmc *pm; KASSERT(PMC_ID_TO_ROWINDEX(pmcid) < md->pmd_npmc, ("[pmc,%d] Illegal pmc index %d (max %d)", __LINE__, PMC_ID_TO_ROWINDEX(pmcid), md->pmd_npmc)); LIST_FOREACH(pm, &po->po_pmcs, pm_next) if (pm->pm_id == pmcid) return pm; return NULL; } static int pmc_find_pmc(pmc_id_t pmcid, struct pmc **pmc) { struct pmc *pm, *opm; struct pmc_owner *po; struct pmc_process *pp; PMCDBG1(PMC,FND,1, "find-pmc id=%d", pmcid); if (PMC_ID_TO_ROWINDEX(pmcid) >= md->pmd_npmc) return (EINVAL); if ((po = pmc_find_owner_descriptor(curthread->td_proc)) == NULL) { /* * In case of PMC_F_DESCENDANTS child processes we will not find * the current process in the owners hash list. Find the owner * process first and from there lookup the po. */ if ((pp = pmc_find_process_descriptor(curthread->td_proc, PMC_FLAG_NONE)) == NULL) { return ESRCH; } else { opm = pp->pp_pmcs[PMC_ID_TO_ROWINDEX(pmcid)].pp_pmc; if (opm == NULL) return ESRCH; if ((opm->pm_flags & (PMC_F_ATTACHED_TO_OWNER| PMC_F_DESCENDANTS)) != (PMC_F_ATTACHED_TO_OWNER| PMC_F_DESCENDANTS)) return ESRCH; po = opm->pm_owner; } } if ((pm = pmc_find_pmc_descriptor_in_process(po, pmcid)) == NULL) return EINVAL; PMCDBG2(PMC,FND,2, "find-pmc id=%d -> pmc=%p", pmcid, pm); *pmc = pm; return 0; } /* * Start a PMC. */ static int pmc_start(struct pmc *pm) { enum pmc_mode mode; struct pmc_owner *po; struct pmc_binding pb; struct pmc_classdep *pcd; int adjri, error, cpu, ri; KASSERT(pm != NULL, ("[pmc,%d] null pm", __LINE__)); mode = PMC_TO_MODE(pm); ri = PMC_TO_ROWINDEX(pm); pcd = pmc_ri_to_classdep(md, ri, &adjri); error = 0; PMCDBG3(PMC,OPS,1, "start pmc=%p mode=%d ri=%d", pm, mode, ri); po = pm->pm_owner; /* * Disallow PMCSTART if a logfile is required but has not been * configured yet. */ if ((pm->pm_flags & PMC_F_NEEDS_LOGFILE) && (po->po_flags & PMC_PO_OWNS_LOGFILE) == 0) return (EDOOFUS); /* programming error */ /* * If this is a sampling mode PMC, log mapping information for * the kernel modules that are currently loaded. */ if (PMC_IS_SAMPLING_MODE(PMC_TO_MODE(pm))) pmc_log_kernel_mappings(pm); if (PMC_IS_VIRTUAL_MODE(mode)) { /* * If a PMCATTACH has never been done on this PMC, * attach it to its owner process. */ if (LIST_EMPTY(&pm->pm_targets)) error = (pm->pm_flags & PMC_F_ATTACH_DONE) ? ESRCH : pmc_attach_process(po->po_owner, pm); /* * If the PMC is attached to its owner, then force a context * switch to ensure that the MD state gets set correctly. */ if (error == 0) { pm->pm_state = PMC_STATE_RUNNING; if (pm->pm_flags & PMC_F_ATTACHED_TO_OWNER) pmc_force_context_switch(); } return (error); } /* * A system-wide PMC. * * Add the owner to the global list if this is a system-wide * sampling PMC. */ if (mode == PMC_MODE_SS) { /* * Log mapping information for all existing processes in the * system. Subsequent mappings are logged as they happen; * see pmc_process_mmap(). */ if (po->po_logprocmaps == 0) { pmc_log_all_process_mappings(po); po->po_logprocmaps = 1; } po->po_sscount++; if (po->po_sscount == 1) { atomic_add_rel_int(&pmc_ss_count, 1); CK_LIST_INSERT_HEAD(&pmc_ss_owners, po, po_ssnext); PMCDBG1(PMC,OPS,1, "po=%p in global list", po); } } /* * Move to the CPU associated with this * PMC, and start the hardware. */ pmc_save_cpu_binding(&pb); cpu = PMC_TO_CPU(pm); if (!pmc_cpu_is_active(cpu)) return (ENXIO); pmc_select_cpu(cpu); /* * global PMCs are configured at allocation time * so write out the initial value and start the PMC. */ pm->pm_state = PMC_STATE_RUNNING; critical_enter(); if ((error = pcd->pcd_write_pmc(cpu, adjri, pm, PMC_IS_SAMPLING_MODE(mode) ? pm->pm_sc.pm_reloadcount : pm->pm_sc.pm_initial)) == 0) { /* If a sampling mode PMC, reset stalled state. */ if (PMC_IS_SAMPLING_MODE(mode)) pm->pm_pcpu_state[cpu].pps_stalled = 0; /* Indicate that we desire this to run. Start it. */ pm->pm_pcpu_state[cpu].pps_cpustate = 1; error = pcd->pcd_start_pmc(cpu, adjri, pm); } critical_exit(); pmc_restore_cpu_binding(&pb); return (error); } /* * Stop a PMC. */ static int pmc_stop(struct pmc *pm) { struct pmc_owner *po; struct pmc_binding pb; struct pmc_classdep *pcd; int adjri, cpu, error, ri; KASSERT(pm != NULL, ("[pmc,%d] null pmc", __LINE__)); PMCDBG3(PMC,OPS,1, "stop pmc=%p mode=%d ri=%d", pm, PMC_TO_MODE(pm), PMC_TO_ROWINDEX(pm)); pm->pm_state = PMC_STATE_STOPPED; /* * If the PMC is a virtual mode one, changing the state to * non-RUNNING is enough to ensure that the PMC never gets * scheduled. * * If this PMC is current running on a CPU, then it will * handled correctly at the time its target process is context * switched out. */ if (PMC_IS_VIRTUAL_MODE(PMC_TO_MODE(pm))) return 0; /* * A system-mode PMC. Move to the CPU associated with * this PMC, and stop the hardware. We update the * 'initial count' so that a subsequent PMCSTART will * resume counting from the current hardware count. */ pmc_save_cpu_binding(&pb); cpu = PMC_TO_CPU(pm); KASSERT(cpu >= 0 && cpu < pmc_cpu_max(), ("[pmc,%d] illegal cpu=%d", __LINE__, cpu)); if (!pmc_cpu_is_active(cpu)) return ENXIO; pmc_select_cpu(cpu); ri = PMC_TO_ROWINDEX(pm); pcd = pmc_ri_to_classdep(md, ri, &adjri); pm->pm_pcpu_state[cpu].pps_cpustate = 0; critical_enter(); if ((error = pcd->pcd_stop_pmc(cpu, adjri, pm)) == 0) error = pcd->pcd_read_pmc(cpu, adjri, pm, &pm->pm_sc.pm_initial); critical_exit(); pmc_restore_cpu_binding(&pb); po = pm->pm_owner; /* remove this owner from the global list of SS PMC owners */ if (PMC_TO_MODE(pm) == PMC_MODE_SS) { po->po_sscount--; if (po->po_sscount == 0) { atomic_subtract_rel_int(&pmc_ss_count, 1); CK_LIST_REMOVE(po, po_ssnext); epoch_wait_preempt(global_epoch_preempt); PMCDBG1(PMC,OPS,2,"po=%p removed from global list", po); } } return (error); } static struct pmc_classdep * pmc_class_to_classdep(enum pmc_class class) { int n; for (n = 0; n < md->pmd_nclass; n++) if (md->pmd_classdep[n].pcd_class == class) return (&md->pmd_classdep[n]); return (NULL); } #if defined(HWPMC_DEBUG) && defined(KTR) static const char *pmc_op_to_name[] = { #undef __PMC_OP #define __PMC_OP(N, D) #N , __PMC_OPS() NULL }; #endif /* * The syscall interface */ #define PMC_GET_SX_XLOCK(...) do { \ sx_xlock(&pmc_sx); \ if (pmc_hook == NULL) { \ sx_xunlock(&pmc_sx); \ return __VA_ARGS__; \ } \ } while (0) #define PMC_DOWNGRADE_SX() do { \ sx_downgrade(&pmc_sx); \ is_sx_downgraded = 1; \ } while (0) static int pmc_syscall_handler(struct thread *td, void *syscall_args) { int error, is_sx_downgraded, op; struct pmc_syscall_args *c; void *pmclog_proc_handle; void *arg; c = (struct pmc_syscall_args *)syscall_args; op = c->pmop_code; arg = c->pmop_data; /* PMC isn't set up yet */ if (pmc_hook == NULL) return (EINVAL); if (op == PMC_OP_CONFIGURELOG) { /* * We cannot create the logging process inside * pmclog_configure_log() because there is a LOR * between pmc_sx and process structure locks. * Instead, pre-create the process and ignite the loop * if everything is fine, otherwise direct the process * to exit. */ error = pmclog_proc_create(td, &pmclog_proc_handle); if (error != 0) goto done_syscall; } PMC_GET_SX_XLOCK(ENOSYS); is_sx_downgraded = 0; PMCDBG3(MOD,PMS,1, "syscall op=%d \"%s\" arg=%p", op, pmc_op_to_name[op], arg); error = 0; counter_u64_add(pmc_stats.pm_syscalls, 1); switch (op) { /* * Configure a log file. * * XXX This OP will be reworked. */ case PMC_OP_CONFIGURELOG: { struct proc *p; struct pmc *pm; struct pmc_owner *po; struct pmc_op_configurelog cl; if ((error = copyin(arg, &cl, sizeof(cl))) != 0) { pmclog_proc_ignite(pmclog_proc_handle, NULL); break; } /* No flags currently implemented */ if (cl.pm_flags != 0) { error = EINVAL; break; } /* mark this process as owning a log file */ p = td->td_proc; if ((po = pmc_find_owner_descriptor(p)) == NULL) if ((po = pmc_allocate_owner_descriptor(p)) == NULL) { pmclog_proc_ignite(pmclog_proc_handle, NULL); error = ENOMEM; break; } /* * If a valid fd was passed in, try to configure that, * otherwise if 'fd' was less than zero and there was * a log file configured, flush its buffers and * de-configure it. */ if (cl.pm_logfd >= 0) { error = pmclog_configure_log(md, po, cl.pm_logfd); pmclog_proc_ignite(pmclog_proc_handle, error == 0 ? po : NULL); } else if (po->po_flags & PMC_PO_OWNS_LOGFILE) { pmclog_proc_ignite(pmclog_proc_handle, NULL); error = pmclog_close(po); if (error == 0) { LIST_FOREACH(pm, &po->po_pmcs, pm_next) if (pm->pm_flags & PMC_F_NEEDS_LOGFILE && pm->pm_state == PMC_STATE_RUNNING) pmc_stop(pm); error = pmclog_deconfigure_log(po); } } else { pmclog_proc_ignite(pmclog_proc_handle, NULL); error = EINVAL; } } break; /* * Flush a log file. */ case PMC_OP_FLUSHLOG: { struct pmc_owner *po; sx_assert(&pmc_sx, SX_XLOCKED); if ((po = pmc_find_owner_descriptor(td->td_proc)) == NULL) { error = EINVAL; break; } error = pmclog_flush(po, 0); } break; /* * Close a log file. */ case PMC_OP_CLOSELOG: { struct pmc_owner *po; sx_assert(&pmc_sx, SX_XLOCKED); if ((po = pmc_find_owner_descriptor(td->td_proc)) == NULL) { error = EINVAL; break; } error = pmclog_close(po); } break; /* * Retrieve hardware configuration. */ case PMC_OP_GETCPUINFO: /* CPU information */ { struct pmc_op_getcpuinfo gci; struct pmc_classinfo *pci; struct pmc_classdep *pcd; int cl; memset(&gci, 0, sizeof(gci)); gci.pm_cputype = md->pmd_cputype; gci.pm_ncpu = pmc_cpu_max(); gci.pm_npmc = md->pmd_npmc; gci.pm_nclass = md->pmd_nclass; pci = gci.pm_classes; pcd = md->pmd_classdep; for (cl = 0; cl < md->pmd_nclass; cl++, pci++, pcd++) { pci->pm_caps = pcd->pcd_caps; pci->pm_class = pcd->pcd_class; pci->pm_width = pcd->pcd_width; pci->pm_num = pcd->pcd_num; } error = copyout(&gci, arg, sizeof(gci)); } break; /* * Retrieve soft events list. */ case PMC_OP_GETDYNEVENTINFO: { enum pmc_class cl; enum pmc_event ev; struct pmc_op_getdyneventinfo *gei; struct pmc_dyn_event_descr dev; struct pmc_soft *ps; uint32_t nevent; sx_assert(&pmc_sx, SX_LOCKED); gei = (struct pmc_op_getdyneventinfo *) arg; if ((error = copyin(&gei->pm_class, &cl, sizeof(cl))) != 0) break; /* Only SOFT class is dynamic. */ if (cl != PMC_CLASS_SOFT) { error = EINVAL; break; } nevent = 0; for (ev = PMC_EV_SOFT_FIRST; (int)ev <= PMC_EV_SOFT_LAST; ev++) { ps = pmc_soft_ev_acquire(ev); if (ps == NULL) continue; bcopy(&ps->ps_ev, &dev, sizeof(dev)); pmc_soft_ev_release(ps); error = copyout(&dev, &gei->pm_events[nevent], sizeof(struct pmc_dyn_event_descr)); if (error != 0) break; nevent++; } if (error != 0) break; error = copyout(&nevent, &gei->pm_nevent, sizeof(nevent)); } break; /* * Get module statistics */ case PMC_OP_GETDRIVERSTATS: { struct pmc_op_getdriverstats gms; #define CFETCH(a, b, field) a.field = counter_u64_fetch(b.field) CFETCH(gms, pmc_stats, pm_intr_ignored); CFETCH(gms, pmc_stats, pm_intr_processed); CFETCH(gms, pmc_stats, pm_intr_bufferfull); CFETCH(gms, pmc_stats, pm_syscalls); CFETCH(gms, pmc_stats, pm_syscall_errors); CFETCH(gms, pmc_stats, pm_buffer_requests); CFETCH(gms, pmc_stats, pm_buffer_requests_failed); CFETCH(gms, pmc_stats, pm_log_sweeps); #undef CFETCH error = copyout(&gms, arg, sizeof(gms)); } break; /* * Retrieve module version number */ case PMC_OP_GETMODULEVERSION: { uint32_t cv, modv; /* retrieve the client's idea of the ABI version */ if ((error = copyin(arg, &cv, sizeof(uint32_t))) != 0) break; /* don't service clients newer than our driver */ modv = PMC_VERSION; if ((cv & 0xFFFF0000) > (modv & 0xFFFF0000)) { error = EPROGMISMATCH; break; } error = copyout(&modv, arg, sizeof(int)); } break; /* * Retrieve the state of all the PMCs on a given * CPU. */ case PMC_OP_GETPMCINFO: { int ari; struct pmc *pm; size_t pmcinfo_size; uint32_t cpu, n, npmc; struct pmc_owner *po; struct pmc_binding pb; struct pmc_classdep *pcd; struct pmc_info *p, *pmcinfo; struct pmc_op_getpmcinfo *gpi; PMC_DOWNGRADE_SX(); gpi = (struct pmc_op_getpmcinfo *) arg; if ((error = copyin(&gpi->pm_cpu, &cpu, sizeof(cpu))) != 0) break; if (cpu >= pmc_cpu_max()) { error = EINVAL; break; } if (!pmc_cpu_is_active(cpu)) { error = ENXIO; break; } /* switch to CPU 'cpu' */ pmc_save_cpu_binding(&pb); pmc_select_cpu(cpu); npmc = md->pmd_npmc; pmcinfo_size = npmc * sizeof(struct pmc_info); pmcinfo = malloc(pmcinfo_size, M_PMC, M_WAITOK | M_ZERO); p = pmcinfo; for (n = 0; n < md->pmd_npmc; n++, p++) { pcd = pmc_ri_to_classdep(md, n, &ari); KASSERT(pcd != NULL, ("[pmc,%d] null pcd ri=%d", __LINE__, n)); if ((error = pcd->pcd_describe(cpu, ari, p, &pm)) != 0) break; if (PMC_ROW_DISP_IS_STANDALONE(n)) p->pm_rowdisp = PMC_DISP_STANDALONE; else if (PMC_ROW_DISP_IS_THREAD(n)) p->pm_rowdisp = PMC_DISP_THREAD; else p->pm_rowdisp = PMC_DISP_FREE; p->pm_ownerpid = -1; if (pm == NULL) /* no PMC associated */ continue; po = pm->pm_owner; KASSERT(po->po_owner != NULL, ("[pmc,%d] pmc_owner had a null proc pointer", __LINE__)); p->pm_ownerpid = po->po_owner->p_pid; p->pm_mode = PMC_TO_MODE(pm); p->pm_event = pm->pm_event; p->pm_flags = pm->pm_flags; if (PMC_IS_SAMPLING_MODE(PMC_TO_MODE(pm))) p->pm_reloadcount = pm->pm_sc.pm_reloadcount; } pmc_restore_cpu_binding(&pb); /* now copy out the PMC info collected */ if (error == 0) error = copyout(pmcinfo, &gpi->pm_pmcs, pmcinfo_size); free(pmcinfo, M_PMC); } break; /* * Set the administrative state of a PMC. I.e. whether * the PMC is to be used or not. */ case PMC_OP_PMCADMIN: { int cpu, ri; enum pmc_state request; struct pmc_cpu *pc; struct pmc_hw *phw; struct pmc_op_pmcadmin pma; struct pmc_binding pb; sx_assert(&pmc_sx, SX_XLOCKED); KASSERT(td == curthread, ("[pmc,%d] td != curthread", __LINE__)); error = priv_check(td, PRIV_PMC_MANAGE); if (error) break; if ((error = copyin(arg, &pma, sizeof(pma))) != 0) break; cpu = pma.pm_cpu; if (cpu < 0 || cpu >= (int) pmc_cpu_max()) { error = EINVAL; break; } if (!pmc_cpu_is_active(cpu)) { error = ENXIO; break; } request = pma.pm_state; if (request != PMC_STATE_DISABLED && request != PMC_STATE_FREE) { error = EINVAL; break; } ri = pma.pm_pmc; /* pmc id == row index */ if (ri < 0 || ri >= (int) md->pmd_npmc) { error = EINVAL; break; } /* * We can't disable a PMC with a row-index allocated * for process virtual PMCs. */ if (PMC_ROW_DISP_IS_THREAD(ri) && request == PMC_STATE_DISABLED) { error = EBUSY; break; } /* * otherwise, this PMC on this CPU is either free or * in system-wide mode. */ pmc_save_cpu_binding(&pb); pmc_select_cpu(cpu); pc = pmc_pcpu[cpu]; phw = pc->pc_hwpmcs[ri]; /* * XXX do we need some kind of 'forced' disable? */ if (phw->phw_pmc == NULL) { if (request == PMC_STATE_DISABLED && (phw->phw_state & PMC_PHW_FLAG_IS_ENABLED)) { phw->phw_state &= ~PMC_PHW_FLAG_IS_ENABLED; PMC_MARK_ROW_STANDALONE(ri); } else if (request == PMC_STATE_FREE && (phw->phw_state & PMC_PHW_FLAG_IS_ENABLED) == 0) { phw->phw_state |= PMC_PHW_FLAG_IS_ENABLED; PMC_UNMARK_ROW_STANDALONE(ri); } /* other cases are a no-op */ } else error = EBUSY; pmc_restore_cpu_binding(&pb); } break; /* * Allocate a PMC. */ case PMC_OP_PMCALLOCATE: { int adjri, n; u_int cpu; uint32_t caps; struct pmc *pmc; enum pmc_mode mode; struct pmc_hw *phw; struct pmc_binding pb; struct pmc_classdep *pcd; struct pmc_op_pmcallocate pa; if ((error = copyin(arg, &pa, sizeof(pa))) != 0) break; caps = pa.pm_caps; mode = pa.pm_mode; cpu = pa.pm_cpu; if ((mode != PMC_MODE_SS && mode != PMC_MODE_SC && mode != PMC_MODE_TS && mode != PMC_MODE_TC) || (cpu != (u_int) PMC_CPU_ANY && cpu >= pmc_cpu_max())) { error = EINVAL; break; } /* * Virtual PMCs should only ask for a default CPU. * System mode PMCs need to specify a non-default CPU. */ if ((PMC_IS_VIRTUAL_MODE(mode) && cpu != (u_int) PMC_CPU_ANY) || (PMC_IS_SYSTEM_MODE(mode) && cpu == (u_int) PMC_CPU_ANY)) { error = EINVAL; break; } /* * Check that an inactive CPU is not being asked for. */ if (PMC_IS_SYSTEM_MODE(mode) && !pmc_cpu_is_active(cpu)) { error = ENXIO; break; } /* * Refuse an allocation for a system-wide PMC if this * process has been jailed, or if this process lacks * super-user credentials and the sysctl tunable * 'security.bsd.unprivileged_syspmcs' is zero. */ if (PMC_IS_SYSTEM_MODE(mode)) { if (jailed(curthread->td_ucred)) { error = EPERM; break; } if (!pmc_unprivileged_syspmcs) { error = priv_check(curthread, PRIV_PMC_SYSTEM); if (error) break; } } /* * Look for valid values for 'pm_flags' */ if ((pa.pm_flags & ~(PMC_F_DESCENDANTS | PMC_F_LOG_PROCCSW | PMC_F_LOG_PROCEXIT | PMC_F_CALLCHAIN | PMC_F_USERCALLCHAIN)) != 0) { error = EINVAL; break; } /* PMC_F_USERCALLCHAIN is only valid with PMC_F_CALLCHAIN */ if ((pa.pm_flags & (PMC_F_CALLCHAIN | PMC_F_USERCALLCHAIN)) == PMC_F_USERCALLCHAIN) { error = EINVAL; break; } /* PMC_F_USERCALLCHAIN is only valid for sampling mode */ if (pa.pm_flags & PMC_F_USERCALLCHAIN && mode != PMC_MODE_TS && mode != PMC_MODE_SS) { error = EINVAL; break; } /* process logging options are not allowed for system PMCs */ if (PMC_IS_SYSTEM_MODE(mode) && (pa.pm_flags & (PMC_F_LOG_PROCCSW | PMC_F_LOG_PROCEXIT))) { error = EINVAL; break; } /* * All sampling mode PMCs need to be able to interrupt the * CPU. */ if (PMC_IS_SAMPLING_MODE(mode)) caps |= PMC_CAP_INTERRUPT; /* A valid class specifier should have been passed in. */ pcd = pmc_class_to_classdep(pa.pm_class); if (pcd == NULL) { error = EINVAL; break; } /* The requested PMC capabilities should be feasible. */ if ((pcd->pcd_caps & caps) != caps) { error = EOPNOTSUPP; break; } PMCDBG4(PMC,ALL,2, "event=%d caps=0x%x mode=%d cpu=%d", pa.pm_ev, caps, mode, cpu); pmc = pmc_allocate_pmc_descriptor(); pmc->pm_id = PMC_ID_MAKE_ID(cpu,pa.pm_mode,pa.pm_class, PMC_ID_INVALID); pmc->pm_event = pa.pm_ev; pmc->pm_state = PMC_STATE_FREE; pmc->pm_caps = caps; pmc->pm_flags = pa.pm_flags; /* XXX set lower bound on sampling for process counters */ if (PMC_IS_SAMPLING_MODE(mode)) { /* * Don't permit requested sample rate to be * less than pmc_mincount. */ if (pa.pm_count < MAX(1, pmc_mincount)) log(LOG_WARNING, "pmcallocate: passed sample " "rate %ju - setting to %u\n", (uintmax_t)pa.pm_count, MAX(1, pmc_mincount)); pmc->pm_sc.pm_reloadcount = MAX(MAX(1, pmc_mincount), pa.pm_count); } else pmc->pm_sc.pm_initial = pa.pm_count; /* switch thread to CPU 'cpu' */ pmc_save_cpu_binding(&pb); #define PMC_IS_SHAREABLE_PMC(cpu, n) \ (pmc_pcpu[(cpu)]->pc_hwpmcs[(n)]->phw_state & \ PMC_PHW_FLAG_IS_SHAREABLE) #define PMC_IS_UNALLOCATED(cpu, n) \ (pmc_pcpu[(cpu)]->pc_hwpmcs[(n)]->phw_pmc == NULL) if (PMC_IS_SYSTEM_MODE(mode)) { pmc_select_cpu(cpu); for (n = pcd->pcd_ri; n < (int) md->pmd_npmc; n++) { pcd = pmc_ri_to_classdep(md, n, &adjri); if (pmc_can_allocate_row(n, mode) == 0 && pmc_can_allocate_rowindex( curthread->td_proc, n, cpu) == 0 && (PMC_IS_UNALLOCATED(cpu, n) || PMC_IS_SHAREABLE_PMC(cpu, n)) && pcd->pcd_allocate_pmc(cpu, adjri, pmc, &pa) == 0) break; } } else { /* Process virtual mode */ for (n = pcd->pcd_ri; n < (int) md->pmd_npmc; n++) { pcd = pmc_ri_to_classdep(md, n, &adjri); if (pmc_can_allocate_row(n, mode) == 0 && pmc_can_allocate_rowindex( curthread->td_proc, n, PMC_CPU_ANY) == 0 && pcd->pcd_allocate_pmc(curthread->td_oncpu, adjri, pmc, &pa) == 0) break; } } #undef PMC_IS_UNALLOCATED #undef PMC_IS_SHAREABLE_PMC pmc_restore_cpu_binding(&pb); if (n == (int) md->pmd_npmc) { pmc_destroy_pmc_descriptor(pmc); pmc = NULL; error = EINVAL; break; } /* Fill in the correct value in the ID field */ pmc->pm_id = PMC_ID_MAKE_ID(cpu,mode,pa.pm_class,n); PMCDBG5(PMC,ALL,2, "ev=%d class=%d mode=%d n=%d -> pmcid=%x", pmc->pm_event, pa.pm_class, mode, n, pmc->pm_id); /* Process mode PMCs with logging enabled need log files */ if (pmc->pm_flags & (PMC_F_LOG_PROCEXIT | PMC_F_LOG_PROCCSW)) pmc->pm_flags |= PMC_F_NEEDS_LOGFILE; /* All system mode sampling PMCs require a log file */ if (PMC_IS_SAMPLING_MODE(mode) && PMC_IS_SYSTEM_MODE(mode)) pmc->pm_flags |= PMC_F_NEEDS_LOGFILE; /* * Configure global pmc's immediately */ if (PMC_IS_SYSTEM_MODE(PMC_TO_MODE(pmc))) { pmc_save_cpu_binding(&pb); pmc_select_cpu(cpu); phw = pmc_pcpu[cpu]->pc_hwpmcs[n]; pcd = pmc_ri_to_classdep(md, n, &adjri); if ((phw->phw_state & PMC_PHW_FLAG_IS_ENABLED) == 0 || (error = pcd->pcd_config_pmc(cpu, adjri, pmc)) != 0) { (void) pcd->pcd_release_pmc(cpu, adjri, pmc); pmc_destroy_pmc_descriptor(pmc); pmc = NULL; pmc_restore_cpu_binding(&pb); error = EPERM; break; } pmc_restore_cpu_binding(&pb); } pmc->pm_state = PMC_STATE_ALLOCATED; pmc->pm_class = pa.pm_class; /* * mark row disposition */ if (PMC_IS_SYSTEM_MODE(mode)) PMC_MARK_ROW_STANDALONE(n); else PMC_MARK_ROW_THREAD(n); /* * Register this PMC with the current thread as its owner. */ if ((error = pmc_register_owner(curthread->td_proc, pmc)) != 0) { pmc_release_pmc_descriptor(pmc); pmc_destroy_pmc_descriptor(pmc); pmc = NULL; break; } /* * Return the allocated index. */ pa.pm_pmcid = pmc->pm_id; error = copyout(&pa, arg, sizeof(pa)); } break; /* * Attach a PMC to a process. */ case PMC_OP_PMCATTACH: { struct pmc *pm; struct proc *p; struct pmc_op_pmcattach a; sx_assert(&pmc_sx, SX_XLOCKED); if ((error = copyin(arg, &a, sizeof(a))) != 0) break; if (a.pm_pid < 0) { error = EINVAL; break; } else if (a.pm_pid == 0) a.pm_pid = td->td_proc->p_pid; if ((error = pmc_find_pmc(a.pm_pmc, &pm)) != 0) break; if (PMC_IS_SYSTEM_MODE(PMC_TO_MODE(pm))) { error = EINVAL; break; } /* PMCs may be (re)attached only when allocated or stopped */ if (pm->pm_state == PMC_STATE_RUNNING) { error = EBUSY; break; } else if (pm->pm_state != PMC_STATE_ALLOCATED && pm->pm_state != PMC_STATE_STOPPED) { error = EINVAL; break; } /* lookup pid */ if ((p = pfind(a.pm_pid)) == NULL) { error = ESRCH; break; } /* * Ignore processes that are working on exiting. */ if (p->p_flag & P_WEXIT) { error = ESRCH; PROC_UNLOCK(p); /* pfind() returns a locked process */ break; } /* * we are allowed to attach a PMC to a process if * we can debug it. */ error = p_candebug(curthread, p); PROC_UNLOCK(p); if (error == 0) error = pmc_attach_process(p, pm); } break; /* * Detach an attached PMC from a process. */ case PMC_OP_PMCDETACH: { struct pmc *pm; struct proc *p; struct pmc_op_pmcattach a; if ((error = copyin(arg, &a, sizeof(a))) != 0) break; if (a.pm_pid < 0) { error = EINVAL; break; } else if (a.pm_pid == 0) a.pm_pid = td->td_proc->p_pid; if ((error = pmc_find_pmc(a.pm_pmc, &pm)) != 0) break; if ((p = pfind(a.pm_pid)) == NULL) { error = ESRCH; break; } /* * Treat processes that are in the process of exiting * as if they were not present. */ if (p->p_flag & P_WEXIT) error = ESRCH; PROC_UNLOCK(p); /* pfind() returns a locked process */ if (error == 0) error = pmc_detach_process(p, pm); } break; /* * Retrieve the MSR number associated with the counter * 'pmc_id'. This allows processes to directly use RDPMC * instructions to read their PMCs, without the overhead of a * system call. */ case PMC_OP_PMCGETMSR: { int adjri, ri; struct pmc *pm; struct pmc_target *pt; struct pmc_op_getmsr gm; struct pmc_classdep *pcd; PMC_DOWNGRADE_SX(); if ((error = copyin(arg, &gm, sizeof(gm))) != 0) break; if ((error = pmc_find_pmc(gm.pm_pmcid, &pm)) != 0) break; /* * The allocated PMC has to be a process virtual PMC, * i.e., of type MODE_T[CS]. Global PMCs can only be * read using the PMCREAD operation since they may be * allocated on a different CPU than the one we could * be running on at the time of the RDPMC instruction. * * The GETMSR operation is not allowed for PMCs that * are inherited across processes. */ if (!PMC_IS_VIRTUAL_MODE(PMC_TO_MODE(pm)) || (pm->pm_flags & PMC_F_DESCENDANTS)) { error = EINVAL; break; } /* * It only makes sense to use a RDPMC (or its * equivalent instruction on non-x86 architectures) on * a process that has allocated and attached a PMC to * itself. Conversely the PMC is only allowed to have * one process attached to it -- its owner. */ if ((pt = LIST_FIRST(&pm->pm_targets)) == NULL || LIST_NEXT(pt, pt_next) != NULL || pt->pt_process->pp_proc != pm->pm_owner->po_owner) { error = EINVAL; break; } ri = PMC_TO_ROWINDEX(pm); pcd = pmc_ri_to_classdep(md, ri, &adjri); /* PMC class has no 'GETMSR' support */ if (pcd->pcd_get_msr == NULL) { error = ENOSYS; break; } if ((error = (*pcd->pcd_get_msr)(adjri, &gm.pm_msr)) < 0) break; if ((error = copyout(&gm, arg, sizeof(gm))) < 0) break; /* * Mark our process as using MSRs. Update machine * state using a forced context switch. */ pt->pt_process->pp_flags |= PMC_PP_ENABLE_MSR_ACCESS; pmc_force_context_switch(); } break; /* * Release an allocated PMC */ case PMC_OP_PMCRELEASE: { pmc_id_t pmcid; struct pmc *pm; struct pmc_owner *po; struct pmc_op_simple sp; /* * Find PMC pointer for the named PMC. * * Use pmc_release_pmc_descriptor() to switch off the * PMC, remove all its target threads, and remove the * PMC from its owner's list. * * Remove the owner record if this is the last PMC * owned. * * Free up space. */ if ((error = copyin(arg, &sp, sizeof(sp))) != 0) break; pmcid = sp.pm_pmcid; if ((error = pmc_find_pmc(pmcid, &pm)) != 0) break; po = pm->pm_owner; pmc_release_pmc_descriptor(pm); pmc_maybe_remove_owner(po); pmc_destroy_pmc_descriptor(pm); } break; /* * Read and/or write a PMC. */ case PMC_OP_PMCRW: { int adjri; struct pmc *pm; uint32_t cpu, ri; pmc_value_t oldvalue; struct pmc_binding pb; struct pmc_op_pmcrw prw; struct pmc_classdep *pcd; struct pmc_op_pmcrw *pprw; PMC_DOWNGRADE_SX(); if ((error = copyin(arg, &prw, sizeof(prw))) != 0) break; PMCDBG2(PMC,OPS,1, "rw id=%d flags=0x%x", prw.pm_pmcid, prw.pm_flags); /* must have at least one flag set */ if ((prw.pm_flags & (PMC_F_OLDVALUE|PMC_F_NEWVALUE)) == 0) { error = EINVAL; break; } /* locate pmc descriptor */ if ((error = pmc_find_pmc(prw.pm_pmcid, &pm)) != 0) break; /* Can't read a PMC that hasn't been started. */ if (pm->pm_state != PMC_STATE_ALLOCATED && pm->pm_state != PMC_STATE_STOPPED && pm->pm_state != PMC_STATE_RUNNING) { error = EINVAL; break; } /* writing a new value is allowed only for 'STOPPED' pmcs */ if (pm->pm_state == PMC_STATE_RUNNING && (prw.pm_flags & PMC_F_NEWVALUE)) { error = EBUSY; break; } if (PMC_IS_VIRTUAL_MODE(PMC_TO_MODE(pm))) { /* * If this PMC is attached to its owner (i.e., * the process requesting this operation) and * is running, then attempt to get an * upto-date reading from hardware for a READ. * Writes are only allowed when the PMC is * stopped, so only update the saved value * field. * * If the PMC is not running, or is not * attached to its owner, read/write to the * savedvalue field. */ ri = PMC_TO_ROWINDEX(pm); pcd = pmc_ri_to_classdep(md, ri, &adjri); mtx_pool_lock_spin(pmc_mtxpool, pm); cpu = curthread->td_oncpu; if (prw.pm_flags & PMC_F_OLDVALUE) { if ((pm->pm_flags & PMC_F_ATTACHED_TO_OWNER) && (pm->pm_state == PMC_STATE_RUNNING)) error = (*pcd->pcd_read_pmc)(cpu, adjri, pm, &oldvalue); else oldvalue = pm->pm_gv.pm_savedvalue; } if (prw.pm_flags & PMC_F_NEWVALUE) pm->pm_gv.pm_savedvalue = prw.pm_value; mtx_pool_unlock_spin(pmc_mtxpool, pm); } else { /* System mode PMCs */ cpu = PMC_TO_CPU(pm); ri = PMC_TO_ROWINDEX(pm); pcd = pmc_ri_to_classdep(md, ri, &adjri); if (!pmc_cpu_is_active(cpu)) { error = ENXIO; break; } /* move this thread to CPU 'cpu' */ pmc_save_cpu_binding(&pb); pmc_select_cpu(cpu); critical_enter(); /* save old value */ if (prw.pm_flags & PMC_F_OLDVALUE) { if ((error = (*pcd->pcd_read_pmc)(cpu, adjri, pm, &oldvalue))) goto error; } /* write out new value */ if (prw.pm_flags & PMC_F_NEWVALUE) error = (*pcd->pcd_write_pmc)(cpu, adjri, pm, prw.pm_value); error: critical_exit(); pmc_restore_cpu_binding(&pb); if (error) break; } pprw = (struct pmc_op_pmcrw *) arg; #ifdef HWPMC_DEBUG if (prw.pm_flags & PMC_F_NEWVALUE) PMCDBG3(PMC,OPS,2, "rw id=%d new %jx -> old %jx", ri, prw.pm_value, oldvalue); else if (prw.pm_flags & PMC_F_OLDVALUE) PMCDBG2(PMC,OPS,2, "rw id=%d -> old %jx", ri, oldvalue); #endif /* return old value if requested */ if (prw.pm_flags & PMC_F_OLDVALUE) if ((error = copyout(&oldvalue, &pprw->pm_value, sizeof(prw.pm_value)))) break; } break; /* * Set the sampling rate for a sampling mode PMC and the * initial count for a counting mode PMC. */ case PMC_OP_PMCSETCOUNT: { struct pmc *pm; struct pmc_op_pmcsetcount sc; PMC_DOWNGRADE_SX(); if ((error = copyin(arg, &sc, sizeof(sc))) != 0) break; if ((error = pmc_find_pmc(sc.pm_pmcid, &pm)) != 0) break; if (pm->pm_state == PMC_STATE_RUNNING) { error = EBUSY; break; } if (PMC_IS_SAMPLING_MODE(PMC_TO_MODE(pm))) { /* * Don't permit requested sample rate to be * less than pmc_mincount. */ if (sc.pm_count < MAX(1, pmc_mincount)) log(LOG_WARNING, "pmcsetcount: passed sample " "rate %ju - setting to %u\n", (uintmax_t)sc.pm_count, MAX(1, pmc_mincount)); pm->pm_sc.pm_reloadcount = MAX(MAX(1, pmc_mincount), sc.pm_count); } else pm->pm_sc.pm_initial = sc.pm_count; } break; /* * Start a PMC. */ case PMC_OP_PMCSTART: { pmc_id_t pmcid; struct pmc *pm; struct pmc_op_simple sp; sx_assert(&pmc_sx, SX_XLOCKED); if ((error = copyin(arg, &sp, sizeof(sp))) != 0) break; pmcid = sp.pm_pmcid; if ((error = pmc_find_pmc(pmcid, &pm)) != 0) break; KASSERT(pmcid == pm->pm_id, ("[pmc,%d] pmcid %x != id %x", __LINE__, pm->pm_id, pmcid)); if (pm->pm_state == PMC_STATE_RUNNING) /* already running */ break; else if (pm->pm_state != PMC_STATE_STOPPED && pm->pm_state != PMC_STATE_ALLOCATED) { error = EINVAL; break; } error = pmc_start(pm); } break; /* * Stop a PMC. */ case PMC_OP_PMCSTOP: { pmc_id_t pmcid; struct pmc *pm; struct pmc_op_simple sp; PMC_DOWNGRADE_SX(); if ((error = copyin(arg, &sp, sizeof(sp))) != 0) break; pmcid = sp.pm_pmcid; /* * Mark the PMC as inactive and invoke the MD stop * routines if needed. */ if ((error = pmc_find_pmc(pmcid, &pm)) != 0) break; KASSERT(pmcid == pm->pm_id, ("[pmc,%d] pmc id %x != pmcid %x", __LINE__, pm->pm_id, pmcid)); if (pm->pm_state == PMC_STATE_STOPPED) /* already stopped */ break; else if (pm->pm_state != PMC_STATE_RUNNING) { error = EINVAL; break; } error = pmc_stop(pm); } break; /* * Write a user supplied value to the log file. */ case PMC_OP_WRITELOG: { struct pmc_op_writelog wl; struct pmc_owner *po; PMC_DOWNGRADE_SX(); if ((error = copyin(arg, &wl, sizeof(wl))) != 0) break; if ((po = pmc_find_owner_descriptor(td->td_proc)) == NULL) { error = EINVAL; break; } if ((po->po_flags & PMC_PO_OWNS_LOGFILE) == 0) { error = EINVAL; break; } error = pmclog_process_userlog(po, &wl); } break; default: error = EINVAL; break; } if (is_sx_downgraded) sx_sunlock(&pmc_sx); else sx_xunlock(&pmc_sx); done_syscall: if (error) counter_u64_add(pmc_stats.pm_syscall_errors, 1); return (error); } /* * Helper functions */ /* * Mark the thread as needing callchain capture and post an AST. The * actual callchain capture will be done in a context where it is safe * to take page faults. */ static void pmc_post_callchain_callback(void) { struct thread *td; td = curthread; /* * If there is multiple PMCs for the same interrupt ignore new post */ if (td->td_pflags & TDP_CALLCHAIN) return; /* * Mark this thread as needing callchain capture. * `td->td_pflags' will be safe to touch because this thread * was in user space when it was interrupted. */ td->td_pflags |= TDP_CALLCHAIN; /* * Don't let this thread migrate between CPUs until callchain * capture completes. */ sched_pin(); return; } /* * Find a free slot in the per-cpu array of samples and capture the * current callchain there. If a sample was successfully added, a bit * is set in mask 'pmc_cpumask' denoting that the DO_SAMPLES hook * needs to be invoked from the clock handler. * * This function is meant to be called from an NMI handler. It cannot * use any of the locking primitives supplied by the OS. */ static int pmc_add_sample(ring_type_t ring, struct pmc *pm, struct trapframe *tf) { int error, cpu, callchaindepth, inuserspace; struct thread *td; struct pmc_sample *ps; struct pmc_samplebuffer *psb; error = 0; /* * Allocate space for a sample buffer. */ cpu = curcpu; psb = pmc_pcpu[cpu]->pc_sb[ring]; inuserspace = TRAPF_USERMODE(tf); ps = PMC_PROD_SAMPLE(psb); if (psb->ps_considx != psb->ps_prodidx && ps->ps_nsamples) { /* in use, reader hasn't caught up */ pm->pm_pcpu_state[cpu].pps_stalled = 1; counter_u64_add(pmc_stats.pm_intr_bufferfull, 1); PMCDBG6(SAM,INT,1,"(spc) cpu=%d pm=%p tf=%p um=%d wr=%d rd=%d", cpu, pm, (void *) tf, inuserspace, (int) (psb->ps_prodidx & pmc_sample_mask), (int) (psb->ps_considx & pmc_sample_mask)); callchaindepth = 1; error = ENOMEM; goto done; } /* Fill in entry. */ PMCDBG6(SAM,INT,1,"cpu=%d pm=%p tf=%p um=%d wr=%d rd=%d", cpu, pm, (void *) tf, inuserspace, (int) (psb->ps_prodidx & pmc_sample_mask), (int) (psb->ps_considx & pmc_sample_mask)); td = curthread; ps->ps_pmc = pm; ps->ps_td = td; ps->ps_pid = td->td_proc->p_pid; ps->ps_tid = td->td_tid; ps->ps_tsc = pmc_rdtsc(); ps->ps_ticks = ticks; ps->ps_cpu = cpu; ps->ps_flags = inuserspace ? PMC_CC_F_USERSPACE : 0; callchaindepth = (pm->pm_flags & PMC_F_CALLCHAIN) ? pmc_callchaindepth : 1; MPASS(ps->ps_pc != NULL); if (callchaindepth == 1) ps->ps_pc[0] = PMC_TRAPFRAME_TO_PC(tf); else { /* * Kernel stack traversals can be done immediately, * while we defer to an AST for user space traversals. */ if (!inuserspace) { callchaindepth = pmc_save_kernel_callchain(ps->ps_pc, callchaindepth, tf); } else { pmc_post_callchain_callback(); callchaindepth = PMC_USER_CALLCHAIN_PENDING; } } ps->ps_nsamples = callchaindepth; /* mark entry as in use */ if (ring == PMC_UR) { ps->ps_nsamples_actual = callchaindepth; /* mark entry as in use */ ps->ps_nsamples = PMC_USER_CALLCHAIN_PENDING; } else ps->ps_nsamples = callchaindepth; /* mark entry as in use */ KASSERT(counter_u64_fetch(pm->pm_runcount) >= 0, ("[pmc,%d] pm=%p runcount %ld", __LINE__, (void *) pm, (unsigned long)counter_u64_fetch(pm->pm_runcount))); counter_u64_add(pm->pm_runcount, 1); /* hold onto PMC */ /* increment write pointer */ psb->ps_prodidx++; done: /* mark CPU as needing processing */ if (callchaindepth != PMC_USER_CALLCHAIN_PENDING) DPCPU_SET(pmc_sampled, 1); return (error); } /* * Interrupt processing. * * This function is meant to be called from an NMI handler. It cannot * use any of the locking primitives supplied by the OS. */ int pmc_process_interrupt(int ring, struct pmc *pm, struct trapframe *tf) { struct thread *td; td = curthread; if ((pm->pm_flags & PMC_F_USERCALLCHAIN) && (td->td_proc->p_flag & P_KPROC) == 0 && !TRAPF_USERMODE(tf)) { atomic_add_int(&td->td_pmcpend, 1); return (pmc_add_sample(PMC_UR, pm, tf)); } return (pmc_add_sample(ring, pm, tf)); } /* * Capture a user call chain. This function will be called from ast() * before control returns to userland and before the process gets * rescheduled. */ static void pmc_capture_user_callchain(int cpu, int ring, struct trapframe *tf) { struct pmc *pm; struct thread *td; struct pmc_sample *ps; struct pmc_samplebuffer *psb; uint64_t considx, prodidx; int nsamples, nrecords, pass, iter; #ifdef INVARIANTS int start_ticks = ticks; #endif psb = pmc_pcpu[cpu]->pc_sb[ring]; td = curthread; KASSERT(td->td_pflags & TDP_CALLCHAIN, ("[pmc,%d] Retrieving callchain for thread that doesn't want it", __LINE__)); nrecords = INT_MAX; pass = 0; restart: if (ring == PMC_UR) nrecords = atomic_readandclear_32(&td->td_pmcpend); for (iter = 0, considx = psb->ps_considx, prodidx = psb->ps_prodidx; considx < prodidx && iter < pmc_nsamples; considx++, iter++) { ps = PMC_CONS_SAMPLE_OFF(psb, considx); /* * Iterate through all deferred callchain requests. * Walk from the current read pointer to the current * write pointer. */ #ifdef INVARIANTS if (ps->ps_nsamples == PMC_SAMPLE_FREE) { continue; } #endif if (ps->ps_td != td || ps->ps_nsamples != PMC_USER_CALLCHAIN_PENDING || ps->ps_pmc->pm_state != PMC_STATE_RUNNING) continue; KASSERT(ps->ps_cpu == cpu, ("[pmc,%d] cpu mismatch ps_cpu=%d pcpu=%d", __LINE__, ps->ps_cpu, PCPU_GET(cpuid))); pm = ps->ps_pmc; KASSERT(pm->pm_flags & PMC_F_CALLCHAIN, ("[pmc,%d] Retrieving callchain for PMC that doesn't " "want it", __LINE__)); KASSERT(counter_u64_fetch(pm->pm_runcount) > 0, ("[pmc,%d] runcount %ld", __LINE__, (unsigned long)counter_u64_fetch(pm->pm_runcount))); if (ring == PMC_UR) { nsamples = ps->ps_nsamples_actual; counter_u64_add(pmc_stats.pm_merges, 1); } else nsamples = 0; /* * Retrieve the callchain and mark the sample buffer * as 'processable' by the timer tick sweep code. */ if (__predict_true(nsamples < pmc_callchaindepth - 1)) nsamples += pmc_save_user_callchain(ps->ps_pc + nsamples, pmc_callchaindepth - nsamples - 1, tf); /* * We have to prevent hardclock from potentially overwriting * this sample between when we read the value and when we set * it */ spinlock_enter(); /* * Verify that the sample hasn't been dropped in the meantime */ if (ps->ps_nsamples == PMC_USER_CALLCHAIN_PENDING) { ps->ps_nsamples = nsamples; /* * If we couldn't get a sample, simply drop the reference */ if (nsamples == 0) counter_u64_add(pm->pm_runcount, -1); } spinlock_exit(); if (nrecords-- == 1) break; } if (__predict_false(ring == PMC_UR && td->td_pmcpend)) { if (pass == 0) { pass = 1; goto restart; } /* only collect samples for this part once */ td->td_pmcpend = 0; } #ifdef INVARIANTS if ((ticks - start_ticks) > hz) log(LOG_ERR, "%s took %d ticks\n", __func__, (ticks - start_ticks)); #endif /* mark CPU as needing processing */ DPCPU_SET(pmc_sampled, 1); } /* * Process saved PC samples. */ static void pmc_process_samples(int cpu, ring_type_t ring) { struct pmc *pm; int adjri, n; struct thread *td; struct pmc_owner *po; struct pmc_sample *ps; struct pmc_classdep *pcd; struct pmc_samplebuffer *psb; uint64_t delta __diagused; KASSERT(PCPU_GET(cpuid) == cpu, ("[pmc,%d] not on the correct CPU pcpu=%d cpu=%d", __LINE__, PCPU_GET(cpuid), cpu)); psb = pmc_pcpu[cpu]->pc_sb[ring]; delta = psb->ps_prodidx - psb->ps_considx; MPASS(delta <= pmc_nsamples); MPASS(psb->ps_considx <= psb->ps_prodidx); for (n = 0; psb->ps_considx < psb->ps_prodidx; psb->ps_considx++, n++) { ps = PMC_CONS_SAMPLE(psb); if (__predict_false(ps->ps_nsamples == PMC_SAMPLE_FREE)) continue; pm = ps->ps_pmc; /* skip non-running samples */ if (pm->pm_state != PMC_STATE_RUNNING) goto entrydone; KASSERT(counter_u64_fetch(pm->pm_runcount) > 0, ("[pmc,%d] pm=%p runcount %ld", __LINE__, (void *) pm, (unsigned long)counter_u64_fetch(pm->pm_runcount))); po = pm->pm_owner; KASSERT(PMC_IS_SAMPLING_MODE(PMC_TO_MODE(pm)), ("[pmc,%d] pmc=%p non-sampling mode=%d", __LINE__, pm, PMC_TO_MODE(pm))); /* If there is a pending AST wait for completion */ if (ps->ps_nsamples == PMC_USER_CALLCHAIN_PENDING) { /* if we've been waiting more than 1 tick to * collect a callchain for this record then * drop it and move on. */ if (ticks - ps->ps_ticks > 1) { /* * track how often we hit this as it will * preferentially lose user samples * for long running system calls */ counter_u64_add(pmc_stats.pm_overwrites, 1); goto entrydone; } /* Need a rescan at a later time. */ DPCPU_SET(pmc_sampled, 1); break; } PMCDBG6(SAM,OPS,1,"cpu=%d pm=%p n=%d fl=%x wr=%d rd=%d", cpu, pm, ps->ps_nsamples, ps->ps_flags, (int) (psb->ps_prodidx & pmc_sample_mask), (int) (psb->ps_considx & pmc_sample_mask)); /* * If this is a process-mode PMC that is attached to * its owner, and if the PC is in user mode, update * profiling statistics like timer-based profiling * would have done. * * Otherwise, this is either a sampling-mode PMC that * is attached to a different process than its owner, * or a system-wide sampling PMC. Dispatch a log * entry to the PMC's owner process. */ if (pm->pm_flags & PMC_F_ATTACHED_TO_OWNER) { if (ps->ps_flags & PMC_CC_F_USERSPACE) { td = FIRST_THREAD_IN_PROC(po->po_owner); addupc_intr(td, ps->ps_pc[0], 1); } } else pmclog_process_callchain(pm, ps); entrydone: ps->ps_nsamples = 0; /* mark entry as free */ KASSERT(counter_u64_fetch(pm->pm_runcount) > 0, ("[pmc,%d] pm=%p runcount %ld", __LINE__, (void *) pm, (unsigned long)counter_u64_fetch(pm->pm_runcount))); counter_u64_add(pm->pm_runcount, -1); } counter_u64_add(pmc_stats.pm_log_sweeps, 1); /* Do not re-enable stalled PMCs if we failed to process any samples */ if (n == 0) return; /* * Restart any stalled sampling PMCs on this CPU. * * If the NMI handler sets the pm_stalled field of a PMC after * the check below, we'll end up processing the stalled PMC at * the next hardclock tick. */ for (n = 0; n < md->pmd_npmc; n++) { pcd = pmc_ri_to_classdep(md, n, &adjri); KASSERT(pcd != NULL, ("[pmc,%d] null pcd ri=%d", __LINE__, n)); (void) (*pcd->pcd_get_config)(cpu,adjri,&pm); if (pm == NULL || /* !cfg'ed */ pm->pm_state != PMC_STATE_RUNNING || /* !active */ !PMC_IS_SAMPLING_MODE(PMC_TO_MODE(pm)) || /* !sampling */ !pm->pm_pcpu_state[cpu].pps_cpustate || /* !desired */ !pm->pm_pcpu_state[cpu].pps_stalled) /* !stalled */ continue; pm->pm_pcpu_state[cpu].pps_stalled = 0; (*pcd->pcd_start_pmc)(cpu, adjri, pm); } } /* * Event handlers. */ /* * Handle a process exit. * * Remove this process from all hash tables. If this process * owned any PMCs, turn off those PMCs and deallocate them, * removing any associations with target processes. * * This function will be called by the last 'thread' of a * process. * * XXX This eventhandler gets called early in the exit process. * Consider using a 'hook' invocation from thread_exit() or equivalent * spot. Another negative is that kse_exit doesn't seem to call * exit1() [??]. * */ static void pmc_process_exit(void *arg __unused, struct proc *p) { struct pmc *pm; int adjri, cpu; unsigned int ri; int is_using_hwpmcs; struct pmc_owner *po; struct pmc_process *pp; struct pmc_classdep *pcd; pmc_value_t newvalue, tmp; PROC_LOCK(p); is_using_hwpmcs = p->p_flag & P_HWPMC; PROC_UNLOCK(p); /* * Log a sysexit event to all SS PMC owners. */ PMC_EPOCH_ENTER(); CK_LIST_FOREACH(po, &pmc_ss_owners, po_ssnext) if (po->po_flags & PMC_PO_OWNS_LOGFILE) pmclog_process_sysexit(po, p->p_pid); PMC_EPOCH_EXIT(); if (!is_using_hwpmcs) return; PMC_GET_SX_XLOCK(); PMCDBG3(PRC,EXT,1,"process-exit proc=%p (%d, %s)", p, p->p_pid, p->p_comm); /* * Since this code is invoked by the last thread in an exiting * process, we would have context switched IN at some prior * point. However, with PREEMPTION, kernel mode context * switches may happen any time, so we want to disable a * context switch OUT till we get any PMCs targeting this * process off the hardware. * * We also need to atomically remove this process' * entry from our target process hash table, using * PMC_FLAG_REMOVE. */ PMCDBG3(PRC,EXT,1, "process-exit proc=%p (%d, %s)", p, p->p_pid, p->p_comm); critical_enter(); /* no preemption */ cpu = curthread->td_oncpu; if ((pp = pmc_find_process_descriptor(p, PMC_FLAG_REMOVE)) != NULL) { PMCDBG2(PRC,EXT,2, "process-exit proc=%p pmc-process=%p", p, pp); /* * The exiting process could the target of * some PMCs which will be running on * currently executing CPU. * * We need to turn these PMCs off like we * would do at context switch OUT time. */ for (ri = 0; ri < md->pmd_npmc; ri++) { /* * Pick up the pmc pointer from hardware * state similar to the CSW_OUT code. */ pm = NULL; pcd = pmc_ri_to_classdep(md, ri, &adjri); (void) (*pcd->pcd_get_config)(cpu, adjri, &pm); PMCDBG2(PRC,EXT,2, "ri=%d pm=%p", ri, pm); if (pm == NULL || !PMC_IS_VIRTUAL_MODE(PMC_TO_MODE(pm))) continue; PMCDBG4(PRC,EXT,2, "ppmcs[%d]=%p pm=%p " "state=%d", ri, pp->pp_pmcs[ri].pp_pmc, pm, pm->pm_state); KASSERT(PMC_TO_ROWINDEX(pm) == ri, ("[pmc,%d] ri mismatch pmc(%d) ri(%d)", __LINE__, PMC_TO_ROWINDEX(pm), ri)); KASSERT(pm == pp->pp_pmcs[ri].pp_pmc, ("[pmc,%d] pm %p != pp_pmcs[%d] %p", __LINE__, pm, ri, pp->pp_pmcs[ri].pp_pmc)); KASSERT(counter_u64_fetch(pm->pm_runcount) > 0, ("[pmc,%d] bad runcount ri %d rc %ld", __LINE__, ri, (unsigned long)counter_u64_fetch(pm->pm_runcount))); /* * Change desired state, and then stop if not * stalled. This two-step dance should avoid * race conditions where an interrupt re-enables * the PMC after this code has already checked * the pm_stalled flag. */ if (pm->pm_pcpu_state[cpu].pps_cpustate) { pm->pm_pcpu_state[cpu].pps_cpustate = 0; if (!pm->pm_pcpu_state[cpu].pps_stalled) { (void) pcd->pcd_stop_pmc(cpu, adjri, pm); if (PMC_TO_MODE(pm) == PMC_MODE_TC) { pcd->pcd_read_pmc(cpu, adjri, pm, &newvalue); tmp = newvalue - PMC_PCPU_SAVED(cpu,ri); mtx_pool_lock_spin(pmc_mtxpool, pm); pm->pm_gv.pm_savedvalue += tmp; pp->pp_pmcs[ri].pp_pmcval += tmp; mtx_pool_unlock_spin( pmc_mtxpool, pm); } } } KASSERT((int64_t) counter_u64_fetch(pm->pm_runcount) > 0, ("[pmc,%d] runcount is %d", __LINE__, ri)); counter_u64_add(pm->pm_runcount, -1); (void) pcd->pcd_config_pmc(cpu, adjri, NULL); } /* * Inform the MD layer of this pseudo "context switch * out" */ (void) md->pmd_switch_out(pmc_pcpu[cpu], pp); critical_exit(); /* ok to be pre-empted now */ /* * Unlink this process from the PMCs that are * targeting it. This will send a signal to * all PMC owner's whose PMCs are orphaned. * * Log PMC value at exit time if requested. */ for (ri = 0; ri < md->pmd_npmc; ri++) if ((pm = pp->pp_pmcs[ri].pp_pmc) != NULL) { if (pm->pm_flags & PMC_F_NEEDS_LOGFILE && PMC_IS_COUNTING_MODE(PMC_TO_MODE(pm))) pmclog_process_procexit(pm, pp); pmc_unlink_target_process(pm, pp); } free(pp, M_PMC); } else critical_exit(); /* pp == NULL */ /* * If the process owned PMCs, free them up and free up * memory. */ if ((po = pmc_find_owner_descriptor(p)) != NULL) { pmc_remove_owner(po); pmc_destroy_owner_descriptor(po); } sx_xunlock(&pmc_sx); } /* * Handle a process fork. * * If the parent process 'p1' is under HWPMC monitoring, then copy * over any attached PMCs that have 'do_descendants' semantics. */ static void pmc_process_fork(void *arg __unused, struct proc *p1, struct proc *newproc, int flags) { int is_using_hwpmcs; unsigned int ri; uint32_t do_descendants; struct pmc *pm; struct pmc_owner *po; struct pmc_process *ppnew, *ppold; (void) flags; /* unused parameter */ PROC_LOCK(p1); is_using_hwpmcs = p1->p_flag & P_HWPMC; PROC_UNLOCK(p1); /* * If there are system-wide sampling PMCs active, we need to * log all fork events to their owner's logs. */ PMC_EPOCH_ENTER(); CK_LIST_FOREACH(po, &pmc_ss_owners, po_ssnext) if (po->po_flags & PMC_PO_OWNS_LOGFILE) { pmclog_process_procfork(po, p1->p_pid, newproc->p_pid); pmclog_process_proccreate(po, newproc, 1); } PMC_EPOCH_EXIT(); if (!is_using_hwpmcs) return; PMC_GET_SX_XLOCK(); PMCDBG4(PMC,FRK,1, "process-fork proc=%p (%d, %s) -> %p", p1, p1->p_pid, p1->p_comm, newproc); /* * If the parent process (curthread->td_proc) is a * target of any PMCs, look for PMCs that are to be * inherited, and link these into the new process * descriptor. */ if ((ppold = pmc_find_process_descriptor(curthread->td_proc, PMC_FLAG_NONE)) == NULL) goto done; /* nothing to do */ do_descendants = 0; for (ri = 0; ri < md->pmd_npmc; ri++) if ((pm = ppold->pp_pmcs[ri].pp_pmc) != NULL) do_descendants |= pm->pm_flags & PMC_F_DESCENDANTS; if (do_descendants == 0) /* nothing to do */ goto done; /* * Now mark the new process as being tracked by this driver. */ PROC_LOCK(newproc); newproc->p_flag |= P_HWPMC; PROC_UNLOCK(newproc); /* allocate a descriptor for the new process */ if ((ppnew = pmc_find_process_descriptor(newproc, PMC_FLAG_ALLOCATE)) == NULL) goto done; /* * Run through all PMCs that were targeting the old process * and which specified F_DESCENDANTS and attach them to the * new process. * * Log the fork event to all owners of PMCs attached to this * process, if not already logged. */ for (ri = 0; ri < md->pmd_npmc; ri++) if ((pm = ppold->pp_pmcs[ri].pp_pmc) != NULL && (pm->pm_flags & PMC_F_DESCENDANTS)) { pmc_link_target_process(pm, ppnew); po = pm->pm_owner; if (po->po_sscount == 0 && po->po_flags & PMC_PO_OWNS_LOGFILE) pmclog_process_procfork(po, p1->p_pid, newproc->p_pid); } done: sx_xunlock(&pmc_sx); } static void pmc_process_threadcreate(struct thread *td) { struct pmc_owner *po; PMC_EPOCH_ENTER(); CK_LIST_FOREACH(po, &pmc_ss_owners, po_ssnext) if (po->po_flags & PMC_PO_OWNS_LOGFILE) pmclog_process_threadcreate(po, td, 1); PMC_EPOCH_EXIT(); } static void pmc_process_threadexit(struct thread *td) { struct pmc_owner *po; PMC_EPOCH_ENTER(); CK_LIST_FOREACH(po, &pmc_ss_owners, po_ssnext) if (po->po_flags & PMC_PO_OWNS_LOGFILE) pmclog_process_threadexit(po, td); PMC_EPOCH_EXIT(); } static void pmc_process_proccreate(struct proc *p) { struct pmc_owner *po; PMC_EPOCH_ENTER(); CK_LIST_FOREACH(po, &pmc_ss_owners, po_ssnext) if (po->po_flags & PMC_PO_OWNS_LOGFILE) pmclog_process_proccreate(po, p, 1 /* sync */); PMC_EPOCH_EXIT(); } static void pmc_process_allproc(struct pmc *pm) { struct pmc_owner *po; struct thread *td; struct proc *p; po = pm->pm_owner; if ((po->po_flags & PMC_PO_OWNS_LOGFILE) == 0) return; sx_slock(&allproc_lock); FOREACH_PROC_IN_SYSTEM(p) { pmclog_process_proccreate(po, p, 0 /* sync */); PROC_LOCK(p); FOREACH_THREAD_IN_PROC(p, td) pmclog_process_threadcreate(po, td, 0 /* sync */); PROC_UNLOCK(p); } sx_sunlock(&allproc_lock); pmclog_flush(po, 0); } static void pmc_kld_load(void *arg __unused, linker_file_t lf) { struct pmc_owner *po; /* * Notify owners of system sampling PMCs about KLD operations. */ PMC_EPOCH_ENTER(); CK_LIST_FOREACH(po, &pmc_ss_owners, po_ssnext) if (po->po_flags & PMC_PO_OWNS_LOGFILE) pmclog_process_map_in(po, (pid_t) -1, (uintfptr_t) lf->address, lf->pathname); PMC_EPOCH_EXIT(); /* * TODO: Notify owners of (all) process-sampling PMCs too. */ } static void pmc_kld_unload(void *arg __unused, const char *filename __unused, caddr_t address, size_t size) { struct pmc_owner *po; PMC_EPOCH_ENTER(); CK_LIST_FOREACH(po, &pmc_ss_owners, po_ssnext) if (po->po_flags & PMC_PO_OWNS_LOGFILE) pmclog_process_map_out(po, (pid_t) -1, (uintfptr_t) address, (uintfptr_t) address + size); PMC_EPOCH_EXIT(); /* * TODO: Notify owners of process-sampling PMCs. */ } /* * initialization */ static const char * pmc_name_of_pmcclass(enum pmc_class class) { switch (class) { #undef __PMC_CLASS #define __PMC_CLASS(S,V,D) \ case PMC_CLASS_##S: \ return #S; __PMC_CLASSES(); default: return (""); } } /* * Base class initializer: allocate structure and set default classes. */ struct pmc_mdep * pmc_mdep_alloc(int nclasses) { struct pmc_mdep *md; int n; /* SOFT + md classes */ n = 1 + nclasses; md = malloc(sizeof(struct pmc_mdep) + n * sizeof(struct pmc_classdep), M_PMC, M_WAITOK|M_ZERO); md->pmd_nclass = n; /* Default methods */ md->pmd_switch_in = generic_switch_in; md->pmd_switch_out = generic_switch_out; /* Add base class. */ pmc_soft_initialize(md); return md; } void pmc_mdep_free(struct pmc_mdep *md) { pmc_soft_finalize(md); free(md, M_PMC); } static int generic_switch_in(struct pmc_cpu *pc, struct pmc_process *pp) { (void) pc; (void) pp; return (0); } static int generic_switch_out(struct pmc_cpu *pc, struct pmc_process *pp) { (void) pc; (void) pp; return (0); } static struct pmc_mdep * pmc_generic_cpu_initialize(void) { struct pmc_mdep *md; md = pmc_mdep_alloc(0); md->pmd_cputype = PMC_CPU_GENERIC; return (md); } static void pmc_generic_cpu_finalize(struct pmc_mdep *md) { (void) md; } static int pmc_initialize(void) { int c, cpu, error, n, ri; unsigned int maxcpu, domain; struct pcpu *pc; struct pmc_binding pb; struct pmc_sample *ps; struct pmc_classdep *pcd; struct pmc_samplebuffer *sb; md = NULL; error = 0; pmc_stats.pm_intr_ignored = counter_u64_alloc(M_WAITOK); pmc_stats.pm_intr_processed = counter_u64_alloc(M_WAITOK); pmc_stats.pm_intr_bufferfull = counter_u64_alloc(M_WAITOK); pmc_stats.pm_syscalls = counter_u64_alloc(M_WAITOK); pmc_stats.pm_syscall_errors = counter_u64_alloc(M_WAITOK); pmc_stats.pm_buffer_requests = counter_u64_alloc(M_WAITOK); pmc_stats.pm_buffer_requests_failed = counter_u64_alloc(M_WAITOK); pmc_stats.pm_log_sweeps = counter_u64_alloc(M_WAITOK); pmc_stats.pm_merges = counter_u64_alloc(M_WAITOK); pmc_stats.pm_overwrites = counter_u64_alloc(M_WAITOK); #ifdef HWPMC_DEBUG /* parse debug flags first */ if (TUNABLE_STR_FETCH(PMC_SYSCTL_NAME_PREFIX "debugflags", pmc_debugstr, sizeof(pmc_debugstr))) pmc_debugflags_parse(pmc_debugstr, pmc_debugstr+strlen(pmc_debugstr)); #endif PMCDBG1(MOD,INI,0, "PMC Initialize (version %x)", PMC_VERSION); /* check kernel version */ if (pmc_kernel_version != PMC_VERSION) { if (pmc_kernel_version == 0) printf("hwpmc: this kernel has not been compiled with " "'options HWPMC_HOOKS'.\n"); else printf("hwpmc: kernel version (0x%x) does not match " "module version (0x%x).\n", pmc_kernel_version, PMC_VERSION); return EPROGMISMATCH; } /* * check sysctl parameters */ if (pmc_hashsize <= 0) { (void) printf("hwpmc: tunable \"hashsize\"=%d must be " "greater than zero.\n", pmc_hashsize); pmc_hashsize = PMC_HASH_SIZE; } if (pmc_nsamples <= 0 || pmc_nsamples > 65535) { (void) printf("hwpmc: tunable \"nsamples\"=%d out of " "range.\n", pmc_nsamples); pmc_nsamples = PMC_NSAMPLES; } pmc_sample_mask = pmc_nsamples-1; if (pmc_callchaindepth <= 0 || pmc_callchaindepth > PMC_CALLCHAIN_DEPTH_MAX) { (void) printf("hwpmc: tunable \"callchaindepth\"=%d out of " "range - using %d.\n", pmc_callchaindepth, PMC_CALLCHAIN_DEPTH_MAX); pmc_callchaindepth = PMC_CALLCHAIN_DEPTH_MAX; } md = pmc_md_initialize(); if (md == NULL) { /* Default to generic CPU. */ md = pmc_generic_cpu_initialize(); if (md == NULL) return (ENOSYS); } /* * Refresh classes base ri. Optional classes may come in different * order. */ for (ri = c = 0; c < md->pmd_nclass; c++) { pcd = &md->pmd_classdep[c]; pcd->pcd_ri = ri; ri += pcd->pcd_num; } KASSERT(md->pmd_nclass >= 1 && md->pmd_npmc >= 1, ("[pmc,%d] no classes or pmcs", __LINE__)); /* Compute the map from row-indices to classdep pointers. */ pmc_rowindex_to_classdep = malloc(sizeof(struct pmc_classdep *) * md->pmd_npmc, M_PMC, M_WAITOK|M_ZERO); for (n = 0; n < md->pmd_npmc; n++) pmc_rowindex_to_classdep[n] = NULL; for (ri = c = 0; c < md->pmd_nclass; c++) { pcd = &md->pmd_classdep[c]; for (n = 0; n < pcd->pcd_num; n++, ri++) pmc_rowindex_to_classdep[ri] = pcd; } KASSERT(ri == md->pmd_npmc, ("[pmc,%d] npmc miscomputed: ri=%d, md->npmc=%d", __LINE__, ri, md->pmd_npmc)); maxcpu = pmc_cpu_max(); /* allocate space for the per-cpu array */ pmc_pcpu = malloc(maxcpu * sizeof(struct pmc_cpu *), M_PMC, M_WAITOK|M_ZERO); /* per-cpu 'saved values' for managing process-mode PMCs */ pmc_pcpu_saved = malloc(sizeof(pmc_value_t) * maxcpu * md->pmd_npmc, M_PMC, M_WAITOK); /* Perform CPU-dependent initialization. */ pmc_save_cpu_binding(&pb); error = 0; for (cpu = 0; error == 0 && cpu < maxcpu; cpu++) { if (!pmc_cpu_is_active(cpu)) continue; pmc_select_cpu(cpu); pmc_pcpu[cpu] = malloc(sizeof(struct pmc_cpu) + md->pmd_npmc * sizeof(struct pmc_hw *), M_PMC, M_WAITOK|M_ZERO); for (n = 0; error == 0 && n < md->pmd_nclass; n++) if (md->pmd_classdep[n].pcd_num > 0) error = md->pmd_classdep[n].pcd_pcpu_init(md, cpu); } pmc_restore_cpu_binding(&pb); if (error) return (error); /* allocate space for the sample array */ for (cpu = 0; cpu < maxcpu; cpu++) { if (!pmc_cpu_is_active(cpu)) continue; pc = pcpu_find(cpu); domain = pc->pc_domain; sb = malloc_domainset(sizeof(struct pmc_samplebuffer) + pmc_nsamples * sizeof(struct pmc_sample), M_PMC, DOMAINSET_PREF(domain), M_WAITOK | M_ZERO); KASSERT(pmc_pcpu[cpu] != NULL, ("[pmc,%d] cpu=%d Null per-cpu data", __LINE__, cpu)); sb->ps_callchains = malloc_domainset(pmc_callchaindepth * pmc_nsamples * sizeof(uintptr_t), M_PMC, DOMAINSET_PREF(domain), M_WAITOK | M_ZERO); for (n = 0, ps = sb->ps_samples; n < pmc_nsamples; n++, ps++) ps->ps_pc = sb->ps_callchains + (n * pmc_callchaindepth); pmc_pcpu[cpu]->pc_sb[PMC_HR] = sb; sb = malloc_domainset(sizeof(struct pmc_samplebuffer) + pmc_nsamples * sizeof(struct pmc_sample), M_PMC, DOMAINSET_PREF(domain), M_WAITOK | M_ZERO); sb->ps_callchains = malloc_domainset(pmc_callchaindepth * pmc_nsamples * sizeof(uintptr_t), M_PMC, DOMAINSET_PREF(domain), M_WAITOK | M_ZERO); for (n = 0, ps = sb->ps_samples; n < pmc_nsamples; n++, ps++) ps->ps_pc = sb->ps_callchains + (n * pmc_callchaindepth); pmc_pcpu[cpu]->pc_sb[PMC_SR] = sb; sb = malloc_domainset(sizeof(struct pmc_samplebuffer) + pmc_nsamples * sizeof(struct pmc_sample), M_PMC, DOMAINSET_PREF(domain), M_WAITOK | M_ZERO); sb->ps_callchains = malloc_domainset(pmc_callchaindepth * pmc_nsamples * sizeof(uintptr_t), M_PMC, DOMAINSET_PREF(domain), M_WAITOK | M_ZERO); for (n = 0, ps = sb->ps_samples; n < pmc_nsamples; n++, ps++) ps->ps_pc = sb->ps_callchains + n * pmc_callchaindepth; pmc_pcpu[cpu]->pc_sb[PMC_UR] = sb; } /* allocate space for the row disposition array */ pmc_pmcdisp = malloc(sizeof(enum pmc_mode) * md->pmd_npmc, M_PMC, M_WAITOK|M_ZERO); /* mark all PMCs as available */ for (n = 0; n < (int) md->pmd_npmc; n++) PMC_MARK_ROW_FREE(n); /* allocate thread hash tables */ pmc_ownerhash = hashinit(pmc_hashsize, M_PMC, &pmc_ownerhashmask); pmc_processhash = hashinit(pmc_hashsize, M_PMC, &pmc_processhashmask); mtx_init(&pmc_processhash_mtx, "pmc-process-hash", "pmc-leaf", MTX_SPIN); CK_LIST_INIT(&pmc_ss_owners); pmc_ss_count = 0; /* allocate a pool of spin mutexes */ pmc_mtxpool = mtx_pool_create("pmc-leaf", pmc_mtxpool_size, MTX_SPIN); PMCDBG4(MOD,INI,1, "pmc_ownerhash=%p, mask=0x%lx " "targethash=%p mask=0x%lx", pmc_ownerhash, pmc_ownerhashmask, pmc_processhash, pmc_processhashmask); /* Initialize a spin mutex for the thread free list. */ mtx_init(&pmc_threadfreelist_mtx, "pmc-threadfreelist", "pmc-leaf", MTX_SPIN); /* Initialize the task to prune the thread free list. */ TASK_INIT(&free_task, 0, pmc_thread_descriptor_pool_free_task, NULL); /* register process {exit,fork,exec} handlers */ pmc_exit_tag = EVENTHANDLER_REGISTER(process_exit, pmc_process_exit, NULL, EVENTHANDLER_PRI_ANY); pmc_fork_tag = EVENTHANDLER_REGISTER(process_fork, pmc_process_fork, NULL, EVENTHANDLER_PRI_ANY); /* register kld event handlers */ pmc_kld_load_tag = EVENTHANDLER_REGISTER(kld_load, pmc_kld_load, NULL, EVENTHANDLER_PRI_ANY); pmc_kld_unload_tag = EVENTHANDLER_REGISTER(kld_unload, pmc_kld_unload, NULL, EVENTHANDLER_PRI_ANY); /* initialize logging */ pmclog_initialize(); /* set hook functions */ pmc_intr = md->pmd_intr; wmb(); pmc_hook = pmc_hook_handler; if (error == 0) { printf(PMC_MODULE_NAME ":"); for (n = 0; n < (int) md->pmd_nclass; n++) { if (md->pmd_classdep[n].pcd_num == 0) continue; pcd = &md->pmd_classdep[n]; printf(" %s/%d/%d/0x%b", pmc_name_of_pmcclass(pcd->pcd_class), pcd->pcd_num, pcd->pcd_width, pcd->pcd_caps, "\20" "\1INT\2USR\3SYS\4EDG\5THR" "\6REA\7WRI\10INV\11QUA\12PRC" "\13TAG\14CSC"); } printf("\n"); } return (error); } /* prepare to be unloaded */ static void pmc_cleanup(void) { int c, cpu; unsigned int maxcpu; struct pmc_ownerhash *ph; struct pmc_owner *po, *tmp; struct pmc_binding pb; #ifdef HWPMC_DEBUG struct pmc_processhash *prh; #endif PMCDBG0(MOD,INI,0, "cleanup"); /* switch off sampling */ CPU_FOREACH(cpu) DPCPU_ID_SET(cpu, pmc_sampled, 0); pmc_intr = NULL; sx_xlock(&pmc_sx); if (pmc_hook == NULL) { /* being unloaded already */ sx_xunlock(&pmc_sx); return; } pmc_hook = NULL; /* prevent new threads from entering module */ /* deregister event handlers */ EVENTHANDLER_DEREGISTER(process_fork, pmc_fork_tag); EVENTHANDLER_DEREGISTER(process_exit, pmc_exit_tag); EVENTHANDLER_DEREGISTER(kld_load, pmc_kld_load_tag); EVENTHANDLER_DEREGISTER(kld_unload, pmc_kld_unload_tag); /* send SIGBUS to all owner threads, free up allocations */ if (pmc_ownerhash) for (ph = pmc_ownerhash; ph <= &pmc_ownerhash[pmc_ownerhashmask]; ph++) { LIST_FOREACH_SAFE(po, ph, po_next, tmp) { pmc_remove_owner(po); /* send SIGBUS to owner processes */ PMCDBG3(MOD,INI,2, "cleanup signal proc=%p " "(%d, %s)", po->po_owner, po->po_owner->p_pid, po->po_owner->p_comm); PROC_LOCK(po->po_owner); kern_psignal(po->po_owner, SIGBUS); PROC_UNLOCK(po->po_owner); pmc_destroy_owner_descriptor(po); } } /* reclaim allocated data structures */ taskqueue_drain(taskqueue_fast, &free_task); mtx_destroy(&pmc_threadfreelist_mtx); pmc_thread_descriptor_pool_drain(); if (pmc_mtxpool) mtx_pool_destroy(&pmc_mtxpool); mtx_destroy(&pmc_processhash_mtx); if (pmc_processhash) { #ifdef HWPMC_DEBUG struct pmc_process *pp; PMCDBG0(MOD,INI,3, "destroy process hash"); for (prh = pmc_processhash; prh <= &pmc_processhash[pmc_processhashmask]; prh++) LIST_FOREACH(pp, prh, pp_next) PMCDBG1(MOD,INI,3, "pid=%d", pp->pp_proc->p_pid); #endif hashdestroy(pmc_processhash, M_PMC, pmc_processhashmask); pmc_processhash = NULL; } if (pmc_ownerhash) { PMCDBG0(MOD,INI,3, "destroy owner hash"); hashdestroy(pmc_ownerhash, M_PMC, pmc_ownerhashmask); pmc_ownerhash = NULL; } KASSERT(CK_LIST_EMPTY(&pmc_ss_owners), ("[pmc,%d] Global SS owner list not empty", __LINE__)); KASSERT(pmc_ss_count == 0, ("[pmc,%d] Global SS count not empty", __LINE__)); /* do processor and pmc-class dependent cleanup */ maxcpu = pmc_cpu_max(); PMCDBG0(MOD,INI,3, "md cleanup"); if (md) { pmc_save_cpu_binding(&pb); for (cpu = 0; cpu < maxcpu; cpu++) { PMCDBG2(MOD,INI,1,"pmc-cleanup cpu=%d pcs=%p", cpu, pmc_pcpu[cpu]); if (!pmc_cpu_is_active(cpu) || pmc_pcpu[cpu] == NULL) continue; pmc_select_cpu(cpu); for (c = 0; c < md->pmd_nclass; c++) if (md->pmd_classdep[c].pcd_num > 0) md->pmd_classdep[c].pcd_pcpu_fini(md, cpu); } if (md->pmd_cputype == PMC_CPU_GENERIC) pmc_generic_cpu_finalize(md); else pmc_md_finalize(md); pmc_mdep_free(md); md = NULL; pmc_restore_cpu_binding(&pb); } /* Free per-cpu descriptors. */ for (cpu = 0; cpu < maxcpu; cpu++) { if (!pmc_cpu_is_active(cpu)) continue; KASSERT(pmc_pcpu[cpu]->pc_sb[PMC_HR] != NULL, ("[pmc,%d] Null hw cpu sample buffer cpu=%d", __LINE__, cpu)); KASSERT(pmc_pcpu[cpu]->pc_sb[PMC_SR] != NULL, ("[pmc,%d] Null sw cpu sample buffer cpu=%d", __LINE__, cpu)); KASSERT(pmc_pcpu[cpu]->pc_sb[PMC_UR] != NULL, ("[pmc,%d] Null userret cpu sample buffer cpu=%d", __LINE__, cpu)); free(pmc_pcpu[cpu]->pc_sb[PMC_HR]->ps_callchains, M_PMC); free(pmc_pcpu[cpu]->pc_sb[PMC_HR], M_PMC); free(pmc_pcpu[cpu]->pc_sb[PMC_SR]->ps_callchains, M_PMC); free(pmc_pcpu[cpu]->pc_sb[PMC_SR], M_PMC); free(pmc_pcpu[cpu]->pc_sb[PMC_UR]->ps_callchains, M_PMC); free(pmc_pcpu[cpu]->pc_sb[PMC_UR], M_PMC); free(pmc_pcpu[cpu], M_PMC); } free(pmc_pcpu, M_PMC); pmc_pcpu = NULL; free(pmc_pcpu_saved, M_PMC); pmc_pcpu_saved = NULL; if (pmc_pmcdisp) { free(pmc_pmcdisp, M_PMC); pmc_pmcdisp = NULL; } if (pmc_rowindex_to_classdep) { free(pmc_rowindex_to_classdep, M_PMC); pmc_rowindex_to_classdep = NULL; } pmclog_shutdown(); counter_u64_free(pmc_stats.pm_intr_ignored); counter_u64_free(pmc_stats.pm_intr_processed); counter_u64_free(pmc_stats.pm_intr_bufferfull); counter_u64_free(pmc_stats.pm_syscalls); counter_u64_free(pmc_stats.pm_syscall_errors); counter_u64_free(pmc_stats.pm_buffer_requests); counter_u64_free(pmc_stats.pm_buffer_requests_failed); counter_u64_free(pmc_stats.pm_log_sweeps); counter_u64_free(pmc_stats.pm_merges); counter_u64_free(pmc_stats.pm_overwrites); sx_xunlock(&pmc_sx); /* we are done */ } /* * The function called at load/unload. */ static int load (struct module *module __unused, int cmd, void *arg __unused) { int error; error = 0; switch (cmd) { case MOD_LOAD : /* initialize the subsystem */ error = pmc_initialize(); if (error != 0) break; PMCDBG2(MOD,INI,1, "syscall=%d maxcpu=%d", pmc_syscall_num, pmc_cpu_max()); break; case MOD_UNLOAD : case MOD_SHUTDOWN: pmc_cleanup(); PMCDBG0(MOD,INI,1, "unloaded"); break; default : error = EINVAL; /* XXX should panic(9) */ break; } return error; } diff --git a/sys/kern/kern_exec.c b/sys/kern/kern_exec.c index 14aac3f374d2..a779aa11b4c3 100644 --- a/sys/kern/kern_exec.c +++ b/sys/kern/kern_exec.c @@ -1,2091 +1,2092 @@ /*- * SPDX-License-Identifier: BSD-2-Clause * * Copyright (c) 1993, David Greenman * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. */ #include __FBSDID("$FreeBSD$"); #include "opt_capsicum.h" #include "opt_hwpmc_hooks.h" #include "opt_ktrace.h" #include "opt_vm.h" #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #ifdef KTRACE #include #endif #include #include #include #include #include #include #include #include #include #ifdef HWPMC_HOOKS #include #endif #include #include #ifdef KDTRACE_HOOKS #include dtrace_execexit_func_t dtrace_fasttrap_exec; #endif SDT_PROVIDER_DECLARE(proc); SDT_PROBE_DEFINE1(proc, , , exec, "char *"); SDT_PROBE_DEFINE1(proc, , , exec__failure, "int"); SDT_PROBE_DEFINE1(proc, , , exec__success, "char *"); MALLOC_DEFINE(M_PARGS, "proc-args", "Process arguments"); int coredump_pack_fileinfo = 1; SYSCTL_INT(_kern, OID_AUTO, coredump_pack_fileinfo, CTLFLAG_RWTUN, &coredump_pack_fileinfo, 0, "Enable file path packing in 'procstat -f' coredump notes"); int coredump_pack_vmmapinfo = 1; SYSCTL_INT(_kern, OID_AUTO, coredump_pack_vmmapinfo, CTLFLAG_RWTUN, &coredump_pack_vmmapinfo, 0, "Enable file path packing in 'procstat -v' coredump notes"); static int sysctl_kern_ps_strings(SYSCTL_HANDLER_ARGS); static int sysctl_kern_usrstack(SYSCTL_HANDLER_ARGS); static int sysctl_kern_stackprot(SYSCTL_HANDLER_ARGS); static int do_execve(struct thread *td, struct image_args *args, struct mac *mac_p, struct vmspace *oldvmspace); /* XXX This should be vm_size_t. */ SYSCTL_PROC(_kern, KERN_PS_STRINGS, ps_strings, CTLTYPE_ULONG|CTLFLAG_RD| CTLFLAG_CAPRD|CTLFLAG_MPSAFE, NULL, 0, sysctl_kern_ps_strings, "LU", "Location of process' ps_strings structure"); /* XXX This should be vm_size_t. */ SYSCTL_PROC(_kern, KERN_USRSTACK, usrstack, CTLTYPE_ULONG|CTLFLAG_RD| CTLFLAG_CAPRD|CTLFLAG_MPSAFE, NULL, 0, sysctl_kern_usrstack, "LU", "Top of process stack"); SYSCTL_PROC(_kern, OID_AUTO, stackprot, CTLTYPE_INT|CTLFLAG_RD|CTLFLAG_MPSAFE, NULL, 0, sysctl_kern_stackprot, "I", "Stack memory permissions"); u_long ps_arg_cache_limit = PAGE_SIZE / 16; SYSCTL_ULONG(_kern, OID_AUTO, ps_arg_cache_limit, CTLFLAG_RW, &ps_arg_cache_limit, 0, "Process' command line characters cache limit"); static int disallow_high_osrel; SYSCTL_INT(_kern, OID_AUTO, disallow_high_osrel, CTLFLAG_RW, &disallow_high_osrel, 0, "Disallow execution of binaries built for higher version of the world"); static int map_at_zero = 0; SYSCTL_INT(_security_bsd, OID_AUTO, map_at_zero, CTLFLAG_RWTUN, &map_at_zero, 0, "Permit processes to map an object at virtual address 0."); static int core_dump_can_intr = 1; SYSCTL_INT(_kern, OID_AUTO, core_dump_can_intr, CTLFLAG_RWTUN, &core_dump_can_intr, 0, "Core dumping interruptible with SIGKILL"); static int sysctl_kern_ps_strings(SYSCTL_HANDLER_ARGS) { struct proc *p; vm_offset_t ps_strings; p = curproc; #ifdef SCTL_MASK32 if (req->flags & SCTL_MASK32) { unsigned int val; val = (unsigned int)PROC_PS_STRINGS(p); return (SYSCTL_OUT(req, &val, sizeof(val))); } #endif ps_strings = PROC_PS_STRINGS(p); return (SYSCTL_OUT(req, &ps_strings, sizeof(ps_strings))); } static int sysctl_kern_usrstack(SYSCTL_HANDLER_ARGS) { struct proc *p; vm_offset_t val; p = curproc; #ifdef SCTL_MASK32 if (req->flags & SCTL_MASK32) { unsigned int val32; val32 = round_page((unsigned int)p->p_vmspace->vm_stacktop); return (SYSCTL_OUT(req, &val32, sizeof(val32))); } #endif val = round_page(p->p_vmspace->vm_stacktop); return (SYSCTL_OUT(req, &val, sizeof(val))); } static int sysctl_kern_stackprot(SYSCTL_HANDLER_ARGS) { struct proc *p; p = curproc; return (SYSCTL_OUT(req, &p->p_sysent->sv_stackprot, sizeof(p->p_sysent->sv_stackprot))); } /* * Each of the items is a pointer to a `const struct execsw', hence the * double pointer here. */ static const struct execsw **execsw; #ifndef _SYS_SYSPROTO_H_ struct execve_args { char *fname; char **argv; char **envv; }; #endif int sys_execve(struct thread *td, struct execve_args *uap) { struct image_args args; struct vmspace *oldvmspace; int error; error = pre_execve(td, &oldvmspace); if (error != 0) return (error); error = exec_copyin_args(&args, uap->fname, UIO_USERSPACE, uap->argv, uap->envv); if (error == 0) error = kern_execve(td, &args, NULL, oldvmspace); post_execve(td, error, oldvmspace); AUDIT_SYSCALL_EXIT(error == EJUSTRETURN ? 0 : error, td); return (error); } #ifndef _SYS_SYSPROTO_H_ struct fexecve_args { int fd; char **argv; char **envv; }; #endif int sys_fexecve(struct thread *td, struct fexecve_args *uap) { struct image_args args; struct vmspace *oldvmspace; int error; error = pre_execve(td, &oldvmspace); if (error != 0) return (error); error = exec_copyin_args(&args, NULL, UIO_SYSSPACE, uap->argv, uap->envv); if (error == 0) { args.fd = uap->fd; error = kern_execve(td, &args, NULL, oldvmspace); } post_execve(td, error, oldvmspace); AUDIT_SYSCALL_EXIT(error == EJUSTRETURN ? 0 : error, td); return (error); } #ifndef _SYS_SYSPROTO_H_ struct __mac_execve_args { char *fname; char **argv; char **envv; struct mac *mac_p; }; #endif int sys___mac_execve(struct thread *td, struct __mac_execve_args *uap) { #ifdef MAC struct image_args args; struct vmspace *oldvmspace; int error; error = pre_execve(td, &oldvmspace); if (error != 0) return (error); error = exec_copyin_args(&args, uap->fname, UIO_USERSPACE, uap->argv, uap->envv); if (error == 0) error = kern_execve(td, &args, uap->mac_p, oldvmspace); post_execve(td, error, oldvmspace); AUDIT_SYSCALL_EXIT(error == EJUSTRETURN ? 0 : error, td); return (error); #else return (ENOSYS); #endif } int pre_execve(struct thread *td, struct vmspace **oldvmspace) { struct proc *p; int error; KASSERT(td == curthread, ("non-current thread %p", td)); error = 0; p = td->td_proc; if ((p->p_flag & P_HADTHREADS) != 0) { PROC_LOCK(p); if (thread_single(p, SINGLE_BOUNDARY) != 0) error = ERESTART; PROC_UNLOCK(p); } KASSERT(error != 0 || (td->td_pflags & TDP_EXECVMSPC) == 0, ("nested execve")); *oldvmspace = p->p_vmspace; return (error); } void post_execve(struct thread *td, int error, struct vmspace *oldvmspace) { struct proc *p; KASSERT(td == curthread, ("non-current thread %p", td)); p = td->td_proc; if ((p->p_flag & P_HADTHREADS) != 0) { PROC_LOCK(p); /* * If success, we upgrade to SINGLE_EXIT state to * force other threads to suicide. */ if (error == EJUSTRETURN) thread_single(p, SINGLE_EXIT); else thread_single_end(p, SINGLE_BOUNDARY); PROC_UNLOCK(p); } exec_cleanup(td, oldvmspace); } /* * kern_execve() has the astonishing property of not always returning to * the caller. If sufficiently bad things happen during the call to * do_execve(), it can end up calling exit1(); as a result, callers must * avoid doing anything which they might need to undo (e.g., allocating * memory). */ int kern_execve(struct thread *td, struct image_args *args, struct mac *mac_p, struct vmspace *oldvmspace) { TSEXEC(td->td_proc->p_pid, args->begin_argv); AUDIT_ARG_ARGV(args->begin_argv, args->argc, exec_args_get_begin_envv(args) - args->begin_argv); AUDIT_ARG_ENVV(exec_args_get_begin_envv(args), args->envc, args->endp - exec_args_get_begin_envv(args)); /* Must have at least one argument. */ if (args->argc == 0) { exec_free_args(args); return (EINVAL); } return (do_execve(td, args, mac_p, oldvmspace)); } static void execve_nosetid(struct image_params *imgp) { imgp->credential_setid = false; if (imgp->newcred != NULL) { crfree(imgp->newcred); imgp->newcred = NULL; } } /* * In-kernel implementation of execve(). All arguments are assumed to be * userspace pointers from the passed thread. */ static int do_execve(struct thread *td, struct image_args *args, struct mac *mac_p, struct vmspace *oldvmspace) { struct proc *p = td->td_proc; struct nameidata nd; struct ucred *oldcred; struct uidinfo *euip = NULL; uintptr_t stack_base; struct image_params image_params, *imgp; struct vattr attr; struct pargs *oldargs = NULL, *newargs = NULL; struct sigacts *oldsigacts = NULL, *newsigacts = NULL; #ifdef KTRACE struct ktr_io_params *kiop; #endif struct vnode *oldtextvp, *newtextvp; struct vnode *oldtextdvp, *newtextdvp; char *oldbinname, *newbinname; bool credential_changing; #ifdef MAC struct label *interpvplabel = NULL; bool will_transition; #endif #ifdef HWPMC_HOOKS struct pmckern_procexec pe; #endif int error, i, orig_osrel; uint32_t orig_fctl0; Elf_Brandinfo *orig_brandinfo; size_t freepath_size; static const char fexecv_proc_title[] = "(fexecv)"; imgp = &image_params; oldtextvp = oldtextdvp = NULL; newtextvp = newtextdvp = NULL; newbinname = oldbinname = NULL; #ifdef KTRACE kiop = NULL; #endif /* * Lock the process and set the P_INEXEC flag to indicate that * it should be left alone until we're done here. This is * necessary to avoid race conditions - e.g. in ptrace() - * that might allow a local user to illicitly obtain elevated * privileges. */ PROC_LOCK(p); KASSERT((p->p_flag & P_INEXEC) == 0, ("%s(): process already has P_INEXEC flag", __func__)); p->p_flag |= P_INEXEC; PROC_UNLOCK(p); /* * Initialize part of the common data */ bzero(imgp, sizeof(*imgp)); imgp->proc = p; imgp->attr = &attr; imgp->args = args; oldcred = p->p_ucred; orig_osrel = p->p_osrel; orig_fctl0 = p->p_fctl0; orig_brandinfo = p->p_elf_brandinfo; #ifdef MAC error = mac_execve_enter(imgp, mac_p); if (error) goto exec_fail; #endif SDT_PROBE1(proc, , , exec, args->fname); interpret: if (args->fname != NULL) { #ifdef CAPABILITY_MODE /* * While capability mode can't reach this point via direct * path arguments to execve(), we also don't allow * interpreters to be used in capability mode (for now). * Catch indirect lookups and return a permissions error. */ if (IN_CAPABILITY_MODE(td)) { error = ECAPMODE; goto exec_fail; } #endif /* * Translate the file name. namei() returns a vnode * pointer in ni_vp among other things. */ NDINIT(&nd, LOOKUP, ISOPEN | LOCKLEAF | LOCKSHARED | FOLLOW | AUDITVNODE1 | WANTPARENT, UIO_SYSSPACE, args->fname); error = namei(&nd); if (error) goto exec_fail; newtextvp = nd.ni_vp; newtextdvp = nd.ni_dvp; nd.ni_dvp = NULL; newbinname = malloc(nd.ni_cnd.cn_namelen + 1, M_PARGS, M_WAITOK); memcpy(newbinname, nd.ni_cnd.cn_nameptr, nd.ni_cnd.cn_namelen); newbinname[nd.ni_cnd.cn_namelen] = '\0'; imgp->vp = newtextvp; /* * Do the best to calculate the full path to the image file. */ if (args->fname[0] == '/') { imgp->execpath = args->fname; } else { VOP_UNLOCK(imgp->vp); freepath_size = MAXPATHLEN; if (vn_fullpath_hardlink(newtextvp, newtextdvp, newbinname, nd.ni_cnd.cn_namelen, &imgp->execpath, &imgp->freepath, &freepath_size) != 0) imgp->execpath = args->fname; vn_lock(imgp->vp, LK_SHARED | LK_RETRY); } } else if (imgp->interpreter_vp) { /* * An image activator has already provided an open vnode */ newtextvp = imgp->interpreter_vp; imgp->interpreter_vp = NULL; if (vn_fullpath(newtextvp, &imgp->execpath, &imgp->freepath) != 0) imgp->execpath = args->fname; vn_lock(newtextvp, LK_SHARED | LK_RETRY); AUDIT_ARG_VNODE1(newtextvp); imgp->vp = newtextvp; } else { AUDIT_ARG_FD(args->fd); /* * If the descriptors was not opened with O_PATH, then * we require that it was opened with O_EXEC or * O_RDONLY. In either case, exec_check_permissions() * below checks _current_ file access mode regardless * of the permissions additionally checked at the * open(2). */ error = fgetvp_exec(td, args->fd, &cap_fexecve_rights, &newtextvp); if (error != 0) goto exec_fail; if (vn_fullpath(newtextvp, &imgp->execpath, &imgp->freepath) != 0) imgp->execpath = args->fname; vn_lock(newtextvp, LK_SHARED | LK_RETRY); AUDIT_ARG_VNODE1(newtextvp); imgp->vp = newtextvp; } /* * Check file permissions. Also 'opens' file and sets its vnode to * text mode. */ error = exec_check_permissions(imgp); if (error) goto exec_fail_dealloc; imgp->object = imgp->vp->v_object; if (imgp->object != NULL) vm_object_reference(imgp->object); error = exec_map_first_page(imgp); if (error) goto exec_fail_dealloc; imgp->proc->p_osrel = 0; imgp->proc->p_fctl0 = 0; imgp->proc->p_elf_brandinfo = NULL; /* * Implement image setuid/setgid. * * Determine new credentials before attempting image activators * so that it can be used by process_exec handlers to determine * credential/setid changes. * * Don't honor setuid/setgid if the filesystem prohibits it or if * the process is being traced. * * We disable setuid/setgid/etc in capability mode on the basis * that most setugid applications are not written with that * environment in mind, and will therefore almost certainly operate * incorrectly. In principle there's no reason that setugid * applications might not be useful in capability mode, so we may want * to reconsider this conservative design choice in the future. * * XXXMAC: For the time being, use NOSUID to also prohibit * transitions on the file system. */ credential_changing = false; credential_changing |= (attr.va_mode & S_ISUID) && oldcred->cr_uid != attr.va_uid; credential_changing |= (attr.va_mode & S_ISGID) && oldcred->cr_gid != attr.va_gid; #ifdef MAC will_transition = mac_vnode_execve_will_transition(oldcred, imgp->vp, interpvplabel, imgp) != 0; credential_changing |= will_transition; #endif /* Don't inherit PROC_PDEATHSIG_CTL value if setuid/setgid. */ if (credential_changing) imgp->proc->p_pdeathsig = 0; if (credential_changing && #ifdef CAPABILITY_MODE ((oldcred->cr_flags & CRED_FLAG_CAPMODE) == 0) && #endif (imgp->vp->v_mount->mnt_flag & MNT_NOSUID) == 0 && (p->p_flag & P_TRACED) == 0) { imgp->credential_setid = true; VOP_UNLOCK(imgp->vp); imgp->newcred = crdup(oldcred); if (attr.va_mode & S_ISUID) { euip = uifind(attr.va_uid); change_euid(imgp->newcred, euip); } vn_lock(imgp->vp, LK_SHARED | LK_RETRY); if (attr.va_mode & S_ISGID) change_egid(imgp->newcred, attr.va_gid); /* * Implement correct POSIX saved-id behavior. * * XXXMAC: Note that the current logic will save the * uid and gid if a MAC domain transition occurs, even * though maybe it shouldn't. */ change_svuid(imgp->newcred, imgp->newcred->cr_uid); change_svgid(imgp->newcred, imgp->newcred->cr_gid); } else { /* * Implement correct POSIX saved-id behavior. * * XXX: It's not clear that the existing behavior is * POSIX-compliant. A number of sources indicate that the * saved uid/gid should only be updated if the new ruid is * not equal to the old ruid, or the new euid is not equal * to the old euid and the new euid is not equal to the old * ruid. The FreeBSD code always updates the saved uid/gid. * Also, this code uses the new (replaced) euid and egid as * the source, which may or may not be the right ones to use. */ if (oldcred->cr_svuid != oldcred->cr_uid || oldcred->cr_svgid != oldcred->cr_gid) { VOP_UNLOCK(imgp->vp); imgp->newcred = crdup(oldcred); vn_lock(imgp->vp, LK_SHARED | LK_RETRY); change_svuid(imgp->newcred, imgp->newcred->cr_uid); change_svgid(imgp->newcred, imgp->newcred->cr_gid); } } /* The new credentials are installed into the process later. */ /* * Loop through the list of image activators, calling each one. * An activator returns -1 if there is no match, 0 on success, * and an error otherwise. */ error = -1; for (i = 0; error == -1 && execsw[i]; ++i) { if (execsw[i]->ex_imgact == NULL) continue; error = (*execsw[i]->ex_imgact)(imgp); } if (error) { if (error == -1) error = ENOEXEC; goto exec_fail_dealloc; } /* * Special interpreter operation, cleanup and loop up to try to * activate the interpreter. */ if (imgp->interpreted) { exec_unmap_first_page(imgp); /* * The text reference needs to be removed for scripts. * There is a short period before we determine that * something is a script where text reference is active. * The vnode lock is held over this entire period * so nothing should illegitimately be blocked. */ MPASS(imgp->textset); VOP_UNSET_TEXT_CHECKED(newtextvp); imgp->textset = false; /* free name buffer and old vnode */ #ifdef MAC mac_execve_interpreter_enter(newtextvp, &interpvplabel); #endif if (imgp->opened) { VOP_CLOSE(newtextvp, FREAD, td->td_ucred, td); imgp->opened = false; } vput(newtextvp); imgp->vp = newtextvp = NULL; if (args->fname != NULL) { if (newtextdvp != NULL) { vrele(newtextdvp); newtextdvp = NULL; } NDFREE_PNBUF(&nd); free(newbinname, M_PARGS); newbinname = NULL; } vm_object_deallocate(imgp->object); imgp->object = NULL; execve_nosetid(imgp); imgp->execpath = NULL; free(imgp->freepath, M_TEMP); imgp->freepath = NULL; /* set new name to that of the interpreter */ if (imgp->interpreter_vp) { args->fname = NULL; } else { args->fname = imgp->interpreter_name; } goto interpret; } /* * NB: We unlock the vnode here because it is believed that none * of the sv_copyout_strings/sv_fixup operations require the vnode. */ VOP_UNLOCK(imgp->vp); if (disallow_high_osrel && P_OSREL_MAJOR(p->p_osrel) > P_OSREL_MAJOR(__FreeBSD_version)) { error = ENOEXEC; uprintf("Osrel %d for image %s too high\n", p->p_osrel, imgp->execpath != NULL ? imgp->execpath : ""); vn_lock(imgp->vp, LK_SHARED | LK_RETRY); goto exec_fail_dealloc; } /* * Copy out strings (args and env) and initialize stack base. */ error = (*p->p_sysent->sv_copyout_strings)(imgp, &stack_base); if (error != 0) { vn_lock(imgp->vp, LK_SHARED | LK_RETRY); goto exec_fail_dealloc; } /* * Stack setup. */ error = (*p->p_sysent->sv_fixup)(&stack_base, imgp); if (error != 0) { vn_lock(imgp->vp, LK_SHARED | LK_RETRY); goto exec_fail_dealloc; } /* * For security and other reasons, the file descriptor table cannot be * shared after an exec. */ fdunshare(td); pdunshare(td); /* close files on exec */ fdcloseexec(td); /* * Malloc things before we need locks. */ i = exec_args_get_begin_envv(imgp->args) - imgp->args->begin_argv; /* Cache arguments if they fit inside our allowance */ if (ps_arg_cache_limit >= i + sizeof(struct pargs)) { newargs = pargs_alloc(i); bcopy(imgp->args->begin_argv, newargs->ar_args, i); } /* * For security and other reasons, signal handlers cannot * be shared after an exec. The new process gets a copy of the old * handlers. In execsigs(), the new process will have its signals * reset. */ if (sigacts_shared(p->p_sigacts)) { oldsigacts = p->p_sigacts; newsigacts = sigacts_alloc(); sigacts_copy(newsigacts, oldsigacts); } vn_lock(imgp->vp, LK_SHARED | LK_RETRY); PROC_LOCK(p); if (oldsigacts) p->p_sigacts = newsigacts; /* Stop profiling */ stopprofclock(p); /* reset caught signals */ execsigs(p); /* name this process - nameiexec(p, ndp) */ bzero(p->p_comm, sizeof(p->p_comm)); if (args->fname) bcopy(nd.ni_cnd.cn_nameptr, p->p_comm, min(nd.ni_cnd.cn_namelen, MAXCOMLEN)); else if (vn_commname(newtextvp, p->p_comm, sizeof(p->p_comm)) != 0) bcopy(fexecv_proc_title, p->p_comm, sizeof(fexecv_proc_title)); bcopy(p->p_comm, td->td_name, sizeof(td->td_name)); #ifdef KTR sched_clear_tdname(td); #endif /* * mark as execed, wakeup the process that vforked (if any) and tell * it that it now has its own resources back */ p->p_flag |= P_EXEC; if ((p->p_flag2 & P2_NOTRACE_EXEC) == 0) p->p_flag2 &= ~P2_NOTRACE; if ((p->p_flag2 & P2_STKGAP_DISABLE_EXEC) == 0) p->p_flag2 &= ~P2_STKGAP_DISABLE; if (p->p_flag & P_PPWAIT) { p->p_flag &= ~(P_PPWAIT | P_PPTRACE); cv_broadcast(&p->p_pwait); /* STOPs are no longer ignored, arrange for AST */ signotify(td); } if ((imgp->sysent->sv_setid_allowed != NULL && !(*imgp->sysent->sv_setid_allowed)(td, imgp)) || (p->p_flag2 & P2_NO_NEW_PRIVS) != 0) execve_nosetid(imgp); /* * Implement image setuid/setgid installation. */ if (imgp->credential_setid) { /* * Turn off syscall tracing for set-id programs, except for * root. Record any set-id flags first to make sure that * we do not regain any tracing during a possible block. */ setsugid(p); #ifdef KTRACE kiop = ktrprocexec(p); #endif /* * Close any file descriptors 0..2 that reference procfs, * then make sure file descriptors 0..2 are in use. * * Both fdsetugidsafety() and fdcheckstd() may call functions * taking sleepable locks, so temporarily drop our locks. */ PROC_UNLOCK(p); VOP_UNLOCK(imgp->vp); fdsetugidsafety(td); error = fdcheckstd(td); vn_lock(imgp->vp, LK_SHARED | LK_RETRY); if (error != 0) goto exec_fail_dealloc; PROC_LOCK(p); #ifdef MAC if (will_transition) { mac_vnode_execve_transition(oldcred, imgp->newcred, imgp->vp, interpvplabel, imgp); } #endif } else { if (oldcred->cr_uid == oldcred->cr_ruid && oldcred->cr_gid == oldcred->cr_rgid) p->p_flag &= ~P_SUGID; } /* * Set the new credentials. */ if (imgp->newcred != NULL) { proc_set_cred(p, imgp->newcred); crfree(oldcred); oldcred = NULL; } /* * Store the vp for use in kern.proc.pathname. This vnode was * referenced by namei() or by fexecve variant of fname handling. */ oldtextvp = p->p_textvp; p->p_textvp = newtextvp; oldtextdvp = p->p_textdvp; p->p_textdvp = newtextdvp; newtextdvp = NULL; oldbinname = p->p_binname; p->p_binname = newbinname; newbinname = NULL; #ifdef KDTRACE_HOOKS /* * Tell the DTrace fasttrap provider about the exec if it * has declared an interest. */ if (dtrace_fasttrap_exec) dtrace_fasttrap_exec(p); #endif /* * Notify others that we exec'd, and clear the P_INEXEC flag * as we're now a bona fide freshly-execed process. */ KNOTE_LOCKED(p->p_klist, NOTE_EXEC); p->p_flag &= ~P_INEXEC; /* clear "fork but no exec" flag, as we _are_ execing */ p->p_acflag &= ~AFORK; /* * Free any previous argument cache and replace it with * the new argument cache, if any. */ oldargs = p->p_args; p->p_args = newargs; newargs = NULL; PROC_UNLOCK(p); #ifdef HWPMC_HOOKS /* * Check if system-wide sampling is in effect or if the * current process is using PMCs. If so, do exec() time * processing. This processing needs to happen AFTER the * P_INEXEC flag is cleared. */ if (PMC_SYSTEM_SAMPLING_ACTIVE() || PMC_PROC_IS_USING_PMCS(p)) { VOP_UNLOCK(imgp->vp); pe.pm_credentialschanged = credential_changing; - pe.pm_entryaddr = imgp->entry_addr; + pe.pm_baseaddr = imgp->reloc_base; + pe.pm_dynaddr = imgp->et_dyn_addr; PMC_CALL_HOOK_X(td, PMC_FN_PROCESS_EXEC, (void *) &pe); vn_lock(imgp->vp, LK_SHARED | LK_RETRY); } #endif /* Set values passed into the program in registers. */ (*p->p_sysent->sv_setregs)(td, imgp, stack_base); VOP_MMAPPED(imgp->vp); SDT_PROBE1(proc, , , exec__success, args->fname); exec_fail_dealloc: if (error != 0) { p->p_osrel = orig_osrel; p->p_fctl0 = orig_fctl0; p->p_elf_brandinfo = orig_brandinfo; } if (imgp->firstpage != NULL) exec_unmap_first_page(imgp); if (imgp->vp != NULL) { if (imgp->opened) VOP_CLOSE(imgp->vp, FREAD, td->td_ucred, td); if (imgp->textset) VOP_UNSET_TEXT_CHECKED(imgp->vp); if (error != 0) vput(imgp->vp); else VOP_UNLOCK(imgp->vp); if (args->fname != NULL) NDFREE_PNBUF(&nd); if (newtextdvp != NULL) vrele(newtextdvp); free(newbinname, M_PARGS); } if (imgp->object != NULL) vm_object_deallocate(imgp->object); free(imgp->freepath, M_TEMP); if (error == 0) { if (p->p_ptevents & PTRACE_EXEC) { PROC_LOCK(p); if (p->p_ptevents & PTRACE_EXEC) td->td_dbgflags |= TDB_EXEC; PROC_UNLOCK(p); } } else { exec_fail: /* we're done here, clear P_INEXEC */ PROC_LOCK(p); p->p_flag &= ~P_INEXEC; PROC_UNLOCK(p); SDT_PROBE1(proc, , , exec__failure, error); } if (imgp->newcred != NULL && oldcred != NULL) crfree(imgp->newcred); #ifdef MAC mac_execve_exit(imgp); mac_execve_interpreter_exit(interpvplabel); #endif exec_free_args(args); /* * Handle deferred decrement of ref counts. */ if (oldtextvp != NULL) vrele(oldtextvp); if (oldtextdvp != NULL) vrele(oldtextdvp); free(oldbinname, M_PARGS); #ifdef KTRACE ktr_io_params_free(kiop); #endif pargs_drop(oldargs); pargs_drop(newargs); if (oldsigacts != NULL) sigacts_free(oldsigacts); if (euip != NULL) uifree(euip); if (error && imgp->vmspace_destroyed) { /* sorry, no more process anymore. exit gracefully */ exec_cleanup(td, oldvmspace); exit1(td, 0, SIGABRT); /* NOT REACHED */ } #ifdef KTRACE if (error == 0) ktrprocctor(p); #endif /* * We don't want cpu_set_syscall_retval() to overwrite any of * the register values put in place by exec_setregs(). * Implementations of cpu_set_syscall_retval() will leave * registers unmodified when returning EJUSTRETURN. */ return (error == 0 ? EJUSTRETURN : error); } void exec_cleanup(struct thread *td, struct vmspace *oldvmspace) { if ((td->td_pflags & TDP_EXECVMSPC) != 0) { KASSERT(td->td_proc->p_vmspace != oldvmspace, ("oldvmspace still used")); vmspace_free(oldvmspace); td->td_pflags &= ~TDP_EXECVMSPC; } } int exec_map_first_page(struct image_params *imgp) { vm_object_t object; vm_page_t m; int error; if (imgp->firstpage != NULL) exec_unmap_first_page(imgp); object = imgp->vp->v_object; if (object == NULL) return (EACCES); #if VM_NRESERVLEVEL > 0 if ((object->flags & OBJ_COLORED) == 0) { VM_OBJECT_WLOCK(object); vm_object_color(object, 0); VM_OBJECT_WUNLOCK(object); } #endif error = vm_page_grab_valid_unlocked(&m, object, 0, VM_ALLOC_COUNT(VM_INITIAL_PAGEIN) | VM_ALLOC_NORMAL | VM_ALLOC_NOBUSY | VM_ALLOC_WIRED); if (error != VM_PAGER_OK) return (EIO); imgp->firstpage = sf_buf_alloc(m, 0); imgp->image_header = (char *)sf_buf_kva(imgp->firstpage); return (0); } void exec_unmap_first_page(struct image_params *imgp) { vm_page_t m; if (imgp->firstpage != NULL) { m = sf_buf_page(imgp->firstpage); sf_buf_free(imgp->firstpage); imgp->firstpage = NULL; vm_page_unwire(m, PQ_ACTIVE); } } void exec_onexec_old(struct thread *td) { sigfastblock_clear(td); umtx_exec(td->td_proc); } /* * This is an optimization which removes the unmanaged shared page * mapping. In combination with pmap_remove_pages(), which cleans all * managed mappings in the process' vmspace pmap, no work will be left * for pmap_remove(min, max). */ void exec_free_abi_mappings(struct proc *p) { struct vmspace *vmspace; vmspace = p->p_vmspace; if (refcount_load(&vmspace->vm_refcnt) != 1) return; if (!PROC_HAS_SHP(p)) return; pmap_remove(vmspace_pmap(vmspace), vmspace->vm_shp_base, vmspace->vm_shp_base + p->p_sysent->sv_shared_page_len); } /* * Run down the current address space and install a new one. */ int exec_new_vmspace(struct image_params *imgp, struct sysentvec *sv) { int error; struct proc *p = imgp->proc; struct vmspace *vmspace = p->p_vmspace; struct thread *td = curthread; vm_offset_t sv_minuser; vm_map_t map; imgp->vmspace_destroyed = true; imgp->sysent = sv; if (p->p_sysent->sv_onexec_old != NULL) p->p_sysent->sv_onexec_old(td); itimers_exec(p); EVENTHANDLER_DIRECT_INVOKE(process_exec, p, imgp); /* * Blow away entire process VM, if address space not shared, * otherwise, create a new VM space so that other threads are * not disrupted */ map = &vmspace->vm_map; if (map_at_zero) sv_minuser = sv->sv_minuser; else sv_minuser = MAX(sv->sv_minuser, PAGE_SIZE); if (refcount_load(&vmspace->vm_refcnt) == 1 && vm_map_min(map) == sv_minuser && vm_map_max(map) == sv->sv_maxuser && cpu_exec_vmspace_reuse(p, map)) { exec_free_abi_mappings(p); shmexit(vmspace); pmap_remove_pages(vmspace_pmap(vmspace)); vm_map_remove(map, vm_map_min(map), vm_map_max(map)); /* * An exec terminates mlockall(MCL_FUTURE). * ASLR and W^X states must be re-evaluated. */ vm_map_lock(map); vm_map_modflags(map, 0, MAP_WIREFUTURE | MAP_ASLR | MAP_ASLR_IGNSTART | MAP_ASLR_STACK | MAP_WXORX); vm_map_unlock(map); } else { error = vmspace_exec(p, sv_minuser, sv->sv_maxuser); if (error) return (error); vmspace = p->p_vmspace; map = &vmspace->vm_map; } map->flags |= imgp->map_flags; return (sv->sv_onexec != NULL ? sv->sv_onexec(p, imgp) : 0); } /* * Compute the stack size limit and map the main process stack. * Map the shared page. */ int exec_map_stack(struct image_params *imgp) { struct rlimit rlim_stack; struct sysentvec *sv; struct proc *p; vm_map_t map; struct vmspace *vmspace; vm_offset_t stack_addr, stack_top; vm_offset_t sharedpage_addr; u_long ssiz; int error, find_space, stack_off; vm_prot_t stack_prot; vm_object_t obj; p = imgp->proc; sv = p->p_sysent; if (imgp->stack_sz != 0) { ssiz = trunc_page(imgp->stack_sz); PROC_LOCK(p); lim_rlimit_proc(p, RLIMIT_STACK, &rlim_stack); PROC_UNLOCK(p); if (ssiz > rlim_stack.rlim_max) ssiz = rlim_stack.rlim_max; if (ssiz > rlim_stack.rlim_cur) { rlim_stack.rlim_cur = ssiz; kern_setrlimit(curthread, RLIMIT_STACK, &rlim_stack); } } else if (sv->sv_maxssiz != NULL) { ssiz = *sv->sv_maxssiz; } else { ssiz = maxssiz; } vmspace = p->p_vmspace; map = &vmspace->vm_map; stack_prot = sv->sv_shared_page_obj != NULL && imgp->stack_prot != 0 ? imgp->stack_prot : sv->sv_stackprot; if ((map->flags & MAP_ASLR_STACK) != 0) { stack_addr = round_page((vm_offset_t)p->p_vmspace->vm_daddr + lim_max(curthread, RLIMIT_DATA)); find_space = VMFS_ANY_SPACE; } else { stack_addr = sv->sv_usrstack - ssiz; find_space = VMFS_NO_SPACE; } error = vm_map_find(map, NULL, 0, &stack_addr, (vm_size_t)ssiz, sv->sv_usrstack, find_space, stack_prot, VM_PROT_ALL, MAP_STACK_GROWS_DOWN); if (error != KERN_SUCCESS) { uprintf("exec_new_vmspace: mapping stack size %#jx prot %#x " "failed, mach error %d errno %d\n", (uintmax_t)ssiz, stack_prot, error, vm_mmap_to_errno(error)); return (vm_mmap_to_errno(error)); } stack_top = stack_addr + ssiz; if ((map->flags & MAP_ASLR_STACK) != 0) { /* Randomize within the first page of the stack. */ arc4rand(&stack_off, sizeof(stack_off), 0); stack_top -= rounddown2(stack_off & PAGE_MASK, sizeof(void *)); } /* Map a shared page */ obj = sv->sv_shared_page_obj; if (obj == NULL) { sharedpage_addr = 0; goto out; } /* * If randomization is disabled then the shared page will * be mapped at address specified in sysentvec. * Otherwise any address above .data section can be selected. * Same logic is used for stack address randomization. * If the address randomization is applied map a guard page * at the top of UVA. */ vm_object_reference(obj); if ((imgp->imgp_flags & IMGP_ASLR_SHARED_PAGE) != 0) { sharedpage_addr = round_page((vm_offset_t)p->p_vmspace->vm_daddr + lim_max(curthread, RLIMIT_DATA)); error = vm_map_fixed(map, NULL, 0, sv->sv_maxuser - PAGE_SIZE, PAGE_SIZE, VM_PROT_NONE, VM_PROT_NONE, MAP_CREATE_GUARD); if (error != KERN_SUCCESS) { /* * This is not fatal, so let's just print a warning * and continue. */ uprintf("%s: Mapping guard page at the top of UVA failed" " mach error %d errno %d", __func__, error, vm_mmap_to_errno(error)); } error = vm_map_find(map, obj, 0, &sharedpage_addr, sv->sv_shared_page_len, sv->sv_maxuser, VMFS_ANY_SPACE, VM_PROT_READ | VM_PROT_EXECUTE, VM_PROT_READ | VM_PROT_EXECUTE, MAP_INHERIT_SHARE | MAP_ACC_NO_CHARGE); } else { sharedpage_addr = sv->sv_shared_page_base; vm_map_fixed(map, obj, 0, sharedpage_addr, sv->sv_shared_page_len, VM_PROT_READ | VM_PROT_EXECUTE, VM_PROT_READ | VM_PROT_EXECUTE, MAP_INHERIT_SHARE | MAP_ACC_NO_CHARGE); } if (error != KERN_SUCCESS) { uprintf("%s: mapping shared page at addr: %p" "failed, mach error %d errno %d\n", __func__, (void *)sharedpage_addr, error, vm_mmap_to_errno(error)); vm_object_deallocate(obj); return (vm_mmap_to_errno(error)); } out: /* * vm_ssize and vm_maxsaddr are somewhat antiquated concepts, but they * are still used to enforce the stack rlimit on the process stack. */ vmspace->vm_maxsaddr = (char *)stack_addr; vmspace->vm_stacktop = stack_top; vmspace->vm_ssize = sgrowsiz >> PAGE_SHIFT; vmspace->vm_shp_base = sharedpage_addr; return (0); } /* * Copy out argument and environment strings from the old process address * space into the temporary string buffer. */ int exec_copyin_args(struct image_args *args, const char *fname, enum uio_seg segflg, char **argv, char **envv) { u_long arg, env; int error; bzero(args, sizeof(*args)); if (argv == NULL) return (EFAULT); /* * Allocate demand-paged memory for the file name, argument, and * environment strings. */ error = exec_alloc_args(args); if (error != 0) return (error); /* * Copy the file name. */ error = exec_args_add_fname(args, fname, segflg); if (error != 0) goto err_exit; /* * extract arguments first */ for (;;) { error = fueword(argv++, &arg); if (error == -1) { error = EFAULT; goto err_exit; } if (arg == 0) break; error = exec_args_add_arg(args, (char *)(uintptr_t)arg, UIO_USERSPACE); if (error != 0) goto err_exit; } /* * extract environment strings */ if (envv) { for (;;) { error = fueword(envv++, &env); if (error == -1) { error = EFAULT; goto err_exit; } if (env == 0) break; error = exec_args_add_env(args, (char *)(uintptr_t)env, UIO_USERSPACE); if (error != 0) goto err_exit; } } return (0); err_exit: exec_free_args(args); return (error); } struct exec_args_kva { vm_offset_t addr; u_int gen; SLIST_ENTRY(exec_args_kva) next; }; DPCPU_DEFINE_STATIC(struct exec_args_kva *, exec_args_kva); static SLIST_HEAD(, exec_args_kva) exec_args_kva_freelist; static struct mtx exec_args_kva_mtx; static u_int exec_args_gen; static void exec_prealloc_args_kva(void *arg __unused) { struct exec_args_kva *argkva; u_int i; SLIST_INIT(&exec_args_kva_freelist); mtx_init(&exec_args_kva_mtx, "exec args kva", NULL, MTX_DEF); for (i = 0; i < exec_map_entries; i++) { argkva = malloc(sizeof(*argkva), M_PARGS, M_WAITOK); argkva->addr = kmap_alloc_wait(exec_map, exec_map_entry_size); argkva->gen = exec_args_gen; SLIST_INSERT_HEAD(&exec_args_kva_freelist, argkva, next); } } SYSINIT(exec_args_kva, SI_SUB_EXEC, SI_ORDER_ANY, exec_prealloc_args_kva, NULL); static vm_offset_t exec_alloc_args_kva(void **cookie) { struct exec_args_kva *argkva; argkva = (void *)atomic_readandclear_ptr( (uintptr_t *)DPCPU_PTR(exec_args_kva)); if (argkva == NULL) { mtx_lock(&exec_args_kva_mtx); while ((argkva = SLIST_FIRST(&exec_args_kva_freelist)) == NULL) (void)mtx_sleep(&exec_args_kva_freelist, &exec_args_kva_mtx, 0, "execkva", 0); SLIST_REMOVE_HEAD(&exec_args_kva_freelist, next); mtx_unlock(&exec_args_kva_mtx); } kasan_mark((void *)argkva->addr, exec_map_entry_size, exec_map_entry_size, 0); *(struct exec_args_kva **)cookie = argkva; return (argkva->addr); } static void exec_release_args_kva(struct exec_args_kva *argkva, u_int gen) { vm_offset_t base; base = argkva->addr; kasan_mark((void *)argkva->addr, 0, exec_map_entry_size, KASAN_EXEC_ARGS_FREED); if (argkva->gen != gen) { (void)vm_map_madvise(exec_map, base, base + exec_map_entry_size, MADV_FREE); argkva->gen = gen; } if (!atomic_cmpset_ptr((uintptr_t *)DPCPU_PTR(exec_args_kva), (uintptr_t)NULL, (uintptr_t)argkva)) { mtx_lock(&exec_args_kva_mtx); SLIST_INSERT_HEAD(&exec_args_kva_freelist, argkva, next); wakeup_one(&exec_args_kva_freelist); mtx_unlock(&exec_args_kva_mtx); } } static void exec_free_args_kva(void *cookie) { exec_release_args_kva(cookie, exec_args_gen); } static void exec_args_kva_lowmem(void *arg __unused) { SLIST_HEAD(, exec_args_kva) head; struct exec_args_kva *argkva; u_int gen; int i; gen = atomic_fetchadd_int(&exec_args_gen, 1) + 1; /* * Force an madvise of each KVA range. Any currently allocated ranges * will have MADV_FREE applied once they are freed. */ SLIST_INIT(&head); mtx_lock(&exec_args_kva_mtx); SLIST_SWAP(&head, &exec_args_kva_freelist, exec_args_kva); mtx_unlock(&exec_args_kva_mtx); while ((argkva = SLIST_FIRST(&head)) != NULL) { SLIST_REMOVE_HEAD(&head, next); exec_release_args_kva(argkva, gen); } CPU_FOREACH(i) { argkva = (void *)atomic_readandclear_ptr( (uintptr_t *)DPCPU_ID_PTR(i, exec_args_kva)); if (argkva != NULL) exec_release_args_kva(argkva, gen); } } EVENTHANDLER_DEFINE(vm_lowmem, exec_args_kva_lowmem, NULL, EVENTHANDLER_PRI_ANY); /* * Allocate temporary demand-paged, zero-filled memory for the file name, * argument, and environment strings. */ int exec_alloc_args(struct image_args *args) { args->buf = (char *)exec_alloc_args_kva(&args->bufkva); return (0); } void exec_free_args(struct image_args *args) { if (args->buf != NULL) { exec_free_args_kva(args->bufkva); args->buf = NULL; } if (args->fname_buf != NULL) { free(args->fname_buf, M_TEMP); args->fname_buf = NULL; } } /* * A set to functions to fill struct image args. * * NOTE: exec_args_add_fname() must be called (possibly with a NULL * fname) before the other functions. All exec_args_add_arg() calls must * be made before any exec_args_add_env() calls. exec_args_adjust_args() * may be called any time after exec_args_add_fname(). * * exec_args_add_fname() - install path to be executed * exec_args_add_arg() - append an argument string * exec_args_add_env() - append an env string * exec_args_adjust_args() - adjust location of the argument list to * allow new arguments to be prepended */ int exec_args_add_fname(struct image_args *args, const char *fname, enum uio_seg segflg) { int error; size_t length; KASSERT(args->fname == NULL, ("fname already appended")); KASSERT(args->endp == NULL, ("already appending to args")); if (fname != NULL) { args->fname = args->buf; error = segflg == UIO_SYSSPACE ? copystr(fname, args->fname, PATH_MAX, &length) : copyinstr(fname, args->fname, PATH_MAX, &length); if (error != 0) return (error == ENAMETOOLONG ? E2BIG : error); } else length = 0; /* Set up for _arg_*()/_env_*() */ args->endp = args->buf + length; /* begin_argv must be set and kept updated */ args->begin_argv = args->endp; KASSERT(exec_map_entry_size - length >= ARG_MAX, ("too little space remaining for arguments %zu < %zu", exec_map_entry_size - length, (size_t)ARG_MAX)); args->stringspace = ARG_MAX; return (0); } static int exec_args_add_str(struct image_args *args, const char *str, enum uio_seg segflg, int *countp) { int error; size_t length; KASSERT(args->endp != NULL, ("endp not initialized")); KASSERT(args->begin_argv != NULL, ("begin_argp not initialized")); error = (segflg == UIO_SYSSPACE) ? copystr(str, args->endp, args->stringspace, &length) : copyinstr(str, args->endp, args->stringspace, &length); if (error != 0) return (error == ENAMETOOLONG ? E2BIG : error); args->stringspace -= length; args->endp += length; (*countp)++; return (0); } int exec_args_add_arg(struct image_args *args, const char *argp, enum uio_seg segflg) { KASSERT(args->envc == 0, ("appending args after env")); return (exec_args_add_str(args, argp, segflg, &args->argc)); } int exec_args_add_env(struct image_args *args, const char *envp, enum uio_seg segflg) { if (args->envc == 0) args->begin_envv = args->endp; return (exec_args_add_str(args, envp, segflg, &args->envc)); } int exec_args_adjust_args(struct image_args *args, size_t consume, ssize_t extend) { ssize_t offset; KASSERT(args->endp != NULL, ("endp not initialized")); KASSERT(args->begin_argv != NULL, ("begin_argp not initialized")); offset = extend - consume; if (args->stringspace < offset) return (E2BIG); memmove(args->begin_argv + extend, args->begin_argv + consume, args->endp - args->begin_argv + consume); if (args->envc > 0) args->begin_envv += offset; args->endp += offset; args->stringspace -= offset; return (0); } char * exec_args_get_begin_envv(struct image_args *args) { KASSERT(args->endp != NULL, ("endp not initialized")); if (args->envc > 0) return (args->begin_envv); return (args->endp); } /* * Copy strings out to the new process address space, constructing new arg * and env vector tables. Return a pointer to the base so that it can be used * as the initial stack pointer. */ int exec_copyout_strings(struct image_params *imgp, uintptr_t *stack_base) { int argc, envc; char **vectp; char *stringp; uintptr_t destp, ustringp; struct ps_strings *arginfo; struct proc *p; struct sysentvec *sysent; size_t execpath_len; int error, szsigcode; char canary[sizeof(long) * 8]; p = imgp->proc; sysent = p->p_sysent; destp = PROC_PS_STRINGS(p); arginfo = imgp->ps_strings = (void *)destp; /* * Install sigcode. */ if (sysent->sv_shared_page_base == 0 && sysent->sv_szsigcode != NULL) { szsigcode = *(sysent->sv_szsigcode); destp -= szsigcode; destp = rounddown2(destp, sizeof(void *)); error = copyout(sysent->sv_sigcode, (void *)destp, szsigcode); if (error != 0) return (error); } /* * Copy the image path for the rtld. */ if (imgp->execpath != NULL && imgp->auxargs != NULL) { execpath_len = strlen(imgp->execpath) + 1; destp -= execpath_len; destp = rounddown2(destp, sizeof(void *)); imgp->execpathp = (void *)destp; error = copyout(imgp->execpath, imgp->execpathp, execpath_len); if (error != 0) return (error); } /* * Prepare the canary for SSP. */ arc4rand(canary, sizeof(canary), 0); destp -= sizeof(canary); imgp->canary = (void *)destp; error = copyout(canary, imgp->canary, sizeof(canary)); if (error != 0) return (error); imgp->canarylen = sizeof(canary); /* * Prepare the pagesizes array. */ imgp->pagesizeslen = sizeof(pagesizes[0]) * MAXPAGESIZES; destp -= imgp->pagesizeslen; destp = rounddown2(destp, sizeof(void *)); imgp->pagesizes = (void *)destp; error = copyout(pagesizes, imgp->pagesizes, imgp->pagesizeslen); if (error != 0) return (error); /* * Allocate room for the argument and environment strings. */ destp -= ARG_MAX - imgp->args->stringspace; destp = rounddown2(destp, sizeof(void *)); ustringp = destp; if (imgp->auxargs) { /* * Allocate room on the stack for the ELF auxargs * array. It has up to AT_COUNT entries. */ destp -= AT_COUNT * sizeof(Elf_Auxinfo); destp = rounddown2(destp, sizeof(void *)); } vectp = (char **)destp; /* * Allocate room for the argv[] and env vectors including the * terminating NULL pointers. */ vectp -= imgp->args->argc + 1 + imgp->args->envc + 1; /* * vectp also becomes our initial stack base */ *stack_base = (uintptr_t)vectp; stringp = imgp->args->begin_argv; argc = imgp->args->argc; envc = imgp->args->envc; /* * Copy out strings - arguments and environment. */ error = copyout(stringp, (void *)ustringp, ARG_MAX - imgp->args->stringspace); if (error != 0) return (error); /* * Fill in "ps_strings" struct for ps, w, etc. */ imgp->argv = vectp; if (suword(&arginfo->ps_argvstr, (long)(intptr_t)vectp) != 0 || suword32(&arginfo->ps_nargvstr, argc) != 0) return (EFAULT); /* * Fill in argument portion of vector table. */ for (; argc > 0; --argc) { if (suword(vectp++, ustringp) != 0) return (EFAULT); while (*stringp++ != 0) ustringp++; ustringp++; } /* a null vector table pointer separates the argp's from the envp's */ if (suword(vectp++, 0) != 0) return (EFAULT); imgp->envv = vectp; if (suword(&arginfo->ps_envstr, (long)(intptr_t)vectp) != 0 || suword32(&arginfo->ps_nenvstr, envc) != 0) return (EFAULT); /* * Fill in environment portion of vector table. */ for (; envc > 0; --envc) { if (suword(vectp++, ustringp) != 0) return (EFAULT); while (*stringp++ != 0) ustringp++; ustringp++; } /* end of vector table is a null pointer */ if (suword(vectp, 0) != 0) return (EFAULT); if (imgp->auxargs) { vectp++; error = imgp->sysent->sv_copyout_auxargs(imgp, (uintptr_t)vectp); if (error != 0) return (error); } return (0); } /* * Check permissions of file to execute. * Called with imgp->vp locked. * Return 0 for success or error code on failure. */ int exec_check_permissions(struct image_params *imgp) { struct vnode *vp = imgp->vp; struct vattr *attr = imgp->attr; struct thread *td; int error; td = curthread; /* Get file attributes */ error = VOP_GETATTR(vp, attr, td->td_ucred); if (error) return (error); #ifdef MAC error = mac_vnode_check_exec(td->td_ucred, imgp->vp, imgp); if (error) return (error); #endif /* * 1) Check if file execution is disabled for the filesystem that * this file resides on. * 2) Ensure that at least one execute bit is on. Otherwise, a * privileged user will always succeed, and we don't want this * to happen unless the file really is executable. * 3) Ensure that the file is a regular file. */ if ((vp->v_mount->mnt_flag & MNT_NOEXEC) || (attr->va_mode & (S_IXUSR | S_IXGRP | S_IXOTH)) == 0 || (attr->va_type != VREG)) return (EACCES); /* * Zero length files can't be exec'd */ if (attr->va_size == 0) return (ENOEXEC); /* * Check for execute permission to file based on current credentials. */ error = VOP_ACCESS(vp, VEXEC, td->td_ucred, td); if (error) return (error); /* * Check number of open-for-writes on the file and deny execution * if there are any. * * Add a text reference now so no one can write to the * executable while we're activating it. * * Remember if this was set before and unset it in case this is not * actually an executable image. */ error = VOP_SET_TEXT(vp); if (error != 0) return (error); imgp->textset = true; /* * Call filesystem specific open routine (which does nothing in the * general case). */ error = VOP_OPEN(vp, FREAD, td->td_ucred, td, NULL); if (error == 0) imgp->opened = true; return (error); } /* * Exec handler registration */ int exec_register(const struct execsw *execsw_arg) { const struct execsw **es, **xs, **newexecsw; u_int count = 2; /* New slot and trailing NULL */ if (execsw) for (es = execsw; *es; es++) count++; newexecsw = malloc(count * sizeof(*es), M_TEMP, M_WAITOK); xs = newexecsw; if (execsw) for (es = execsw; *es; es++) *xs++ = *es; *xs++ = execsw_arg; *xs = NULL; if (execsw) free(execsw, M_TEMP); execsw = newexecsw; return (0); } int exec_unregister(const struct execsw *execsw_arg) { const struct execsw **es, **xs, **newexecsw; int count = 1; if (execsw == NULL) panic("unregister with no handlers left?\n"); for (es = execsw; *es; es++) { if (*es == execsw_arg) break; } if (*es == NULL) return (ENOENT); for (es = execsw; *es; es++) if (*es != execsw_arg) count++; newexecsw = malloc(count * sizeof(*es), M_TEMP, M_WAITOK); xs = newexecsw; for (es = execsw; *es; es++) if (*es != execsw_arg) *xs++ = *es; *xs = NULL; if (execsw) free(execsw, M_TEMP); execsw = newexecsw; return (0); } /* * Write out a core segment to the compression stream. */ static int compress_chunk(struct coredump_params *cp, char *base, char *buf, size_t len) { size_t chunk_len; int error; while (len > 0) { chunk_len = MIN(len, CORE_BUF_SIZE); /* * We can get EFAULT error here. * In that case zero out the current chunk of the segment. */ error = copyin(base, buf, chunk_len); if (error != 0) bzero(buf, chunk_len); error = compressor_write(cp->comp, buf, chunk_len); if (error != 0) break; base += chunk_len; len -= chunk_len; } return (error); } int core_write(struct coredump_params *cp, const void *base, size_t len, off_t offset, enum uio_seg seg, size_t *resid) { return (vn_rdwr_inchunks(UIO_WRITE, cp->vp, __DECONST(void *, base), len, offset, seg, IO_UNIT | IO_DIRECT | IO_RANGELOCKED, cp->active_cred, cp->file_cred, resid, cp->td)); } int core_output(char *base, size_t len, off_t offset, struct coredump_params *cp, void *tmpbuf) { vm_map_t map; struct mount *mp; size_t resid, runlen; int error; bool success; KASSERT((uintptr_t)base % PAGE_SIZE == 0, ("%s: user address %p is not page-aligned", __func__, base)); if (cp->comp != NULL) return (compress_chunk(cp, base, tmpbuf, len)); map = &cp->td->td_proc->p_vmspace->vm_map; for (; len > 0; base += runlen, offset += runlen, len -= runlen) { /* * Attempt to page in all virtual pages in the range. If a * virtual page is not backed by the pager, it is represented as * a hole in the file. This can occur with zero-filled * anonymous memory or truncated files, for example. */ for (runlen = 0; runlen < len; runlen += PAGE_SIZE) { if (core_dump_can_intr && curproc_sigkilled()) return (EINTR); error = vm_fault(map, (uintptr_t)base + runlen, VM_PROT_READ, VM_FAULT_NOFILL, NULL); if (runlen == 0) success = error == KERN_SUCCESS; else if ((error == KERN_SUCCESS) != success) break; } if (success) { error = core_write(cp, base, runlen, offset, UIO_USERSPACE, &resid); if (error != 0) { if (error != EFAULT) break; /* * EFAULT may be returned if the user mapping * could not be accessed, e.g., because a mapped * file has been truncated. Skip the page if no * progress was made, to protect against a * hypothetical scenario where vm_fault() was * successful but core_write() returns EFAULT * anyway. */ runlen -= resid; if (runlen == 0) { success = false; runlen = PAGE_SIZE; } } } if (!success) { error = vn_start_write(cp->vp, &mp, V_WAIT); if (error != 0) break; vn_lock(cp->vp, LK_EXCLUSIVE | LK_RETRY); error = vn_truncate_locked(cp->vp, offset + runlen, false, cp->td->td_ucred); VOP_UNLOCK(cp->vp); vn_finished_write(mp); if (error != 0) break; } } return (error); } /* * Drain into a core file. */ int sbuf_drain_core_output(void *arg, const char *data, int len) { struct coredump_params *cp; struct proc *p; int error, locked; cp = arg; p = cp->td->td_proc; /* * Some kern_proc out routines that print to this sbuf may * call us with the process lock held. Draining with the * non-sleepable lock held is unsafe. The lock is needed for * those routines when dumping a live process. In our case we * can safely release the lock before draining and acquire * again after. */ locked = PROC_LOCKED(p); if (locked) PROC_UNLOCK(p); if (cp->comp != NULL) error = compressor_write(cp->comp, __DECONST(char *, data), len); else error = core_write(cp, __DECONST(void *, data), len, cp->offset, UIO_SYSSPACE, NULL); if (locked) PROC_LOCK(p); if (error != 0) return (-error); cp->offset += len; return (len); } diff --git a/sys/sys/pmckern.h b/sys/sys/pmckern.h index 7012b0bc9de4..93e772c24563 100644 --- a/sys/sys/pmckern.h +++ b/sys/sys/pmckern.h @@ -1,272 +1,273 @@ /*- * SPDX-License-Identifier: BSD-2-Clause * * Copyright (c) 2003-2007, Joseph Koshy * Copyright (c) 2007 The FreeBSD Foundation * All rights reserved. * * Portions of this software were developed by A. Joseph Koshy under * sponsorship from the FreeBSD Foundation and Google, Inc. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * $FreeBSD$ */ /* * PMC interface used by the base kernel. */ #ifndef _SYS_PMCKERN_H_ #define _SYS_PMCKERN_H_ #include #include #include #include #include #include #include #include #define PMC_FN_PROCESS_EXEC 1 #define PMC_FN_CSW_IN 2 #define PMC_FN_CSW_OUT 3 #define PMC_FN_DO_SAMPLES 4 #define PMC_FN_UNUSED1 5 #define PMC_FN_UNUSED2 6 #define PMC_FN_MMAP 7 #define PMC_FN_MUNMAP 8 #define PMC_FN_USER_CALLCHAIN 9 #define PMC_FN_USER_CALLCHAIN_SOFT 10 #define PMC_FN_SOFT_SAMPLING 11 #define PMC_FN_THR_CREATE 12 #define PMC_FN_THR_EXIT 13 #define PMC_FN_THR_USERRET 14 #define PMC_FN_THR_CREATE_LOG 15 #define PMC_FN_THR_EXIT_LOG 16 #define PMC_FN_PROC_CREATE_LOG 17 typedef enum ring_type { PMC_HR = 0, /* Hardware ring buffer */ PMC_SR = 1, /* Software ring buffer */ PMC_UR = 2, /* userret ring buffer */ PMC_NUM_SR = PMC_UR+1 } ring_type_t; struct pmckern_procexec { int pm_credentialschanged; - uintfptr_t pm_entryaddr; + uintptr_t pm_baseaddr; + uintptr_t pm_dynaddr; }; struct pmckern_map_in { void *pm_file; /* filename or vnode pointer */ uintfptr_t pm_address; /* address object is loaded at */ }; struct pmckern_map_out { uintfptr_t pm_address; /* start address of region */ size_t pm_size; /* size of unmapped region */ }; struct pmckern_soft { enum pmc_event pm_ev; int pm_cpu; struct trapframe *pm_tf; }; /* * Soft PMC. */ #define PMC_SOFT_DEFINE_EX(prov, mod, func, name, alloc, release) \ struct pmc_soft pmc_##prov##_##mod##_##func##_##name = \ { 0, alloc, release, { #prov "_" #mod "_" #func "." #name, 0 } }; \ SYSINIT(pmc_##prov##_##mod##_##func##_##name##_init, SI_SUB_KDTRACE, \ SI_ORDER_SECOND + 1, pmc_soft_ev_register, \ &pmc_##prov##_##mod##_##func##_##name ); \ SYSUNINIT(pmc_##prov##_##mod##_##func##_##name##_uninit, \ SI_SUB_KDTRACE, SI_ORDER_SECOND + 1, pmc_soft_ev_deregister, \ &pmc_##prov##_##mod##_##func##_##name ) #define PMC_SOFT_DEFINE(prov, mod, func, name) \ PMC_SOFT_DEFINE_EX(prov, mod, func, name, NULL, NULL) #define PMC_SOFT_DECLARE(prov, mod, func, name) \ extern struct pmc_soft pmc_##prov##_##mod##_##func##_##name /* * PMC_SOFT_CALL can be used anywhere in the kernel. * Require md defined PMC_FAKE_TRAPFRAME. */ #ifdef PMC_FAKE_TRAPFRAME #define PMC_SOFT_CALL(pr, mo, fu, na) \ do { \ if (__predict_false(pmc_##pr##_##mo##_##fu##_##na.ps_running)) { \ struct pmckern_soft ks; \ register_t intr; \ intr = intr_disable(); \ PMC_FAKE_TRAPFRAME(&pmc_tf[curcpu]); \ ks.pm_ev = pmc_##pr##_##mo##_##fu##_##na.ps_ev.pm_ev_code; \ ks.pm_cpu = PCPU_GET(cpuid); \ ks.pm_tf = &pmc_tf[curcpu]; \ PMC_CALL_HOOK_UNLOCKED(curthread, \ PMC_FN_SOFT_SAMPLING, (void *) &ks); \ intr_restore(intr); \ } \ } while (0) #else #define PMC_SOFT_CALL(pr, mo, fu, na) \ do { \ } while (0) #endif /* * PMC_SOFT_CALL_TF need to be used carefully. * Userland capture will be done during AST processing. */ #define PMC_SOFT_CALL_TF(pr, mo, fu, na, tf) \ do { \ if (__predict_false(pmc_##pr##_##mo##_##fu##_##na.ps_running)) { \ struct pmckern_soft ks; \ register_t intr; \ intr = intr_disable(); \ ks.pm_ev = pmc_##pr##_##mo##_##fu##_##na.ps_ev.pm_ev_code; \ ks.pm_cpu = PCPU_GET(cpuid); \ ks.pm_tf = tf; \ PMC_CALL_HOOK_UNLOCKED(curthread, \ PMC_FN_SOFT_SAMPLING, (void *) &ks); \ intr_restore(intr); \ } \ } while (0) struct pmc_soft { int ps_running; void (*ps_alloc)(void); void (*ps_release)(void); struct pmc_dyn_event_descr ps_ev; }; struct pmclog_buffer; struct pmc_domain_buffer_header { struct mtx pdbh_mtx; TAILQ_HEAD(, pmclog_buffer) pdbh_head; struct pmclog_buffer *pdbh_plbs; int pdbh_ncpus; } __aligned(CACHE_LINE_SIZE); /* hook */ extern int (*pmc_hook)(struct thread *_td, int _function, void *_arg); extern int (*pmc_intr)(struct trapframe *_frame); /* SX lock protecting the hook */ extern struct sx pmc_sx; /* Per-cpu flags indicating availability of sampling data */ DPCPU_DECLARE(uint8_t, pmc_sampled); /* Count of system-wide sampling PMCs in existence */ extern volatile int pmc_ss_count; /* kernel version number */ extern const int pmc_kernel_version; /* PMC soft per cpu trapframe */ extern struct trapframe pmc_tf[MAXCPU]; /* per domain buffer header list */ extern struct pmc_domain_buffer_header *pmc_dom_hdrs[MAXMEMDOM]; /* Quick check if preparatory work is necessary */ #define PMC_HOOK_INSTALLED(cmd) __predict_false(pmc_hook != NULL) /* Hook invocation; for use within the kernel */ #define PMC_CALL_HOOK(t, cmd, arg) \ do { \ struct epoch_tracker et; \ epoch_enter_preempt(global_epoch_preempt, &et); \ if (pmc_hook != NULL) \ (pmc_hook)((t), (cmd), (arg)); \ epoch_exit_preempt(global_epoch_preempt, &et); \ } while (0) /* Hook invocation that needs an exclusive lock */ #define PMC_CALL_HOOK_X(t, cmd, arg) \ do { \ sx_xlock(&pmc_sx); \ if (pmc_hook != NULL) \ (pmc_hook)((t), (cmd), (arg)); \ sx_xunlock(&pmc_sx); \ } while (0) /* * Some hook invocations (e.g., from context switch and clock handling * code) need to be lock-free. */ #define PMC_CALL_HOOK_UNLOCKED(t, cmd, arg) \ do { \ if (pmc_hook != NULL) \ (pmc_hook)((t), (cmd), (arg)); \ } while (0) #define PMC_SWITCH_CONTEXT(t,cmd) PMC_CALL_HOOK_UNLOCKED(t,cmd,NULL) /* Check if a process is using HWPMCs.*/ #define PMC_PROC_IS_USING_PMCS(p) \ (__predict_false(p->p_flag & P_HWPMC)) #define PMC_THREAD_HAS_SAMPLES(td) \ (__predict_false((td)->td_pmcpend)) /* Check if a thread have pending user capture. */ #define PMC_IS_PENDING_CALLCHAIN(p) \ (__predict_false((p)->td_pflags & TDP_CALLCHAIN)) #define PMC_SYSTEM_SAMPLING_ACTIVE() (pmc_ss_count > 0) /* Check if a CPU has recorded samples. */ #define PMC_CPU_HAS_SAMPLES(C) (__predict_false(DPCPU_ID_GET((C), pmc_sampled))) /* * Helper functions. */ int pmc_cpu_is_disabled(int _cpu); /* deprecated */ int pmc_cpu_is_active(int _cpu); int pmc_cpu_is_present(int _cpu); int pmc_cpu_is_primary(int _cpu); unsigned int pmc_cpu_max(void); #ifdef INVARIANTS int pmc_cpu_max_active(void); #endif /* * Soft events functions. */ void pmc_soft_ev_register(struct pmc_soft *ps); void pmc_soft_ev_deregister(struct pmc_soft *ps); struct pmc_soft *pmc_soft_ev_acquire(enum pmc_event ev); void pmc_soft_ev_release(struct pmc_soft *ps); #endif /* _SYS_PMCKERN_H_ */ diff --git a/sys/sys/pmclog.h b/sys/sys/pmclog.h index 0ce2a29263bf..3659b2505daa 100644 --- a/sys/sys/pmclog.h +++ b/sys/sys/pmclog.h @@ -1,329 +1,332 @@ /*- * SPDX-License-Identifier: BSD-2-Clause * * Copyright (c) 2005-2007, Joseph Koshy * Copyright (c) 2007 The FreeBSD Foundation * All rights reserved. * * Portions of this software were developed by A. Joseph Koshy under * sponsorship from the FreeBSD Foundation and Google, Inc. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * $FreeBSD$ */ #ifndef _SYS_PMCLOG_H_ #define _SYS_PMCLOG_H_ #include enum pmclog_type { /* V1 ABI */ PMCLOG_TYPE_CLOSELOG = 1, PMCLOG_TYPE_DROPNOTIFY = 2, PMCLOG_TYPE_INITIALIZE = 3, PMCLOG_TYPE_PMCALLOCATE = 5, PMCLOG_TYPE_PMCATTACH = 6, PMCLOG_TYPE_PMCDETACH = 7, PMCLOG_TYPE_PROCCSW = 8, PMCLOG_TYPE_PROCEXEC = 9, PMCLOG_TYPE_PROCEXIT = 10, PMCLOG_TYPE_PROCFORK = 11, PMCLOG_TYPE_SYSEXIT = 12, PMCLOG_TYPE_USERDATA = 13, /* * V2 ABI * * The MAP_{IN,OUT} event types obsolete the MAPPING_CHANGE * event type. The CALLCHAIN event type obsoletes the * PCSAMPLE event type. */ PMCLOG_TYPE_MAP_IN = 14, PMCLOG_TYPE_MAP_OUT = 15, PMCLOG_TYPE_CALLCHAIN = 16, /* * V3 ABI * * New variant of PMCLOG_TYPE_PMCALLOCATE for dynamic event. */ PMCLOG_TYPE_PMCALLOCATEDYN = 17, /* * V6 ABI */ PMCLOG_TYPE_THR_CREATE = 18, PMCLOG_TYPE_THR_EXIT = 19, PMCLOG_TYPE_PROC_CREATE = 20 }; /* * A log entry descriptor comprises of a 32 bit header and a 64 bit * time stamp followed by as many 32 bit words are required to record * the event. * * Header field format: * * 31 24 16 0 * +------------+------------+-----------------------------------+ * | MAGIC | TYPE | LENGTH | * +------------+------------+-----------------------------------+ * * MAGIC is the constant PMCLOG_HEADER_MAGIC. * TYPE contains a value of type enum pmclog_type. * LENGTH contains the length of the event record, in bytes. */ #define PMCLOG_ENTRY_HEADER \ uint32_t pl_header; \ uint32_t pl_spare; \ uint64_t pl_tsc; \ struct pmclog_header { PMCLOG_ENTRY_HEADER; }; /* * The following structures are used to describe the size of each kind * of log entry to sizeof(). To keep the compiler from adding * padding, the fields of each structure are aligned to their natural * boundaries, and the structures are marked as 'packed'. * * The actual reading and writing of the log file is always in terms * of 4 byte quantities. */ struct pmclog_callchain { PMCLOG_ENTRY_HEADER uint32_t pl_pid; uint32_t pl_tid; uint32_t pl_pmcid; uint32_t pl_cpuflags; /* 8 byte aligned */ uintptr_t pl_pc[PMC_CALLCHAIN_DEPTH_MAX]; } __packed; #define PMC_CALLCHAIN_CPUFLAGS_TO_CPU(CF) (((CF) >> 16) & 0xFFFF) #define PMC_CALLCHAIN_CPUFLAGS_TO_USERMODE(CF) ((CF) & PMC_CC_F_USERSPACE) #define PMC_CALLCHAIN_TO_CPUFLAGS(CPU,FLAGS) \ (((CPU) << 16) | ((FLAGS) & 0xFFFF)) struct pmclog_closelog { PMCLOG_ENTRY_HEADER }; struct pmclog_dropnotify { PMCLOG_ENTRY_HEADER }; struct pmclog_initialize { PMCLOG_ENTRY_HEADER uint32_t pl_version; /* driver version */ uint32_t pl_cpu; /* enum pmc_cputype */ uint64_t pl_tsc_freq; struct timespec pl_ts; char pl_cpuid[PMC_CPUID_LEN]; } __packed; struct pmclog_map_in { PMCLOG_ENTRY_HEADER uint32_t pl_pid; uint32_t pl_pad; uintfptr_t pl_start; /* 8 byte aligned */ char pl_pathname[PATH_MAX]; } __packed; struct pmclog_map_out { PMCLOG_ENTRY_HEADER uint32_t pl_pid; uint32_t pl_pad; uintfptr_t pl_start; /* 8 byte aligned */ uintfptr_t pl_end; } __packed; struct pmclog_pmcallocate { PMCLOG_ENTRY_HEADER uint32_t pl_pmcid; uint32_t pl_event; uint32_t pl_flags; uint32_t pl_pad; uint64_t pl_rate; } __packed; struct pmclog_pmcattach { PMCLOG_ENTRY_HEADER uint32_t pl_pmcid; uint32_t pl_pid; char pl_pathname[PATH_MAX]; } __packed; struct pmclog_pmcdetach { PMCLOG_ENTRY_HEADER uint32_t pl_pmcid; uint32_t pl_pid; } __packed; struct pmclog_proccsw { PMCLOG_ENTRY_HEADER uint64_t pl_value; /* keep 8 byte aligned */ uint32_t pl_pmcid; uint32_t pl_pid; uint32_t pl_tid; uint32_t pl_pad; } __packed; struct pmclog_proccreate { PMCLOG_ENTRY_HEADER uint32_t pl_pid; uint32_t pl_flags; char pl_pcomm[MAXCOMLEN+1]; /* keep 8 byte aligned */ } __packed; struct pmclog_procexec { PMCLOG_ENTRY_HEADER uint32_t pl_pid; uint32_t pl_pmcid; - uintfptr_t pl_start; /* keep 8 byte aligned */ + /* keep 8 byte aligned */ + uintptr_t pl_base; /* AT_BASE */ + /* keep 8 byte aligned */ + uintptr_t pl_dyn; /* PIE load base */ char pl_pathname[PATH_MAX]; } __packed; struct pmclog_procexit { PMCLOG_ENTRY_HEADER uint32_t pl_pmcid; uint32_t pl_pid; uint64_t pl_value; /* keep 8 byte aligned */ } __packed; struct pmclog_procfork { PMCLOG_ENTRY_HEADER uint32_t pl_oldpid; uint32_t pl_newpid; } __packed; struct pmclog_sysexit { PMCLOG_ENTRY_HEADER uint32_t pl_pid; uint32_t pl_pad; } __packed; struct pmclog_threadcreate { PMCLOG_ENTRY_HEADER uint32_t pl_tid; uint32_t pl_pid; uint32_t pl_flags; uint32_t pl_pad; char pl_tdname[MAXCOMLEN+1]; /* keep 8 byte aligned */ } __packed; struct pmclog_threadexit { PMCLOG_ENTRY_HEADER uint32_t pl_tid; uint32_t pl_pad; } __packed; struct pmclog_userdata { PMCLOG_ENTRY_HEADER uint32_t pl_userdata; uint32_t pl_pad; } __packed; struct pmclog_pmcallocatedyn { PMCLOG_ENTRY_HEADER uint32_t pl_pmcid; uint32_t pl_event; uint32_t pl_flags; uint32_t pl_pad; char pl_evname[PMC_NAME_MAX]; } __packed; union pmclog_entry { /* only used to size scratch areas */ struct pmclog_callchain pl_cc; struct pmclog_closelog pl_cl; struct pmclog_dropnotify pl_dn; struct pmclog_initialize pl_i; struct pmclog_map_in pl_mi; struct pmclog_map_out pl_mo; struct pmclog_pmcallocate pl_a; struct pmclog_pmcallocatedyn pl_ad; struct pmclog_pmcattach pl_t; struct pmclog_pmcdetach pl_d; struct pmclog_proccsw pl_c; struct pmclog_proccreate pl_pc; struct pmclog_procexec pl_x; struct pmclog_procexit pl_e; struct pmclog_procfork pl_f; struct pmclog_sysexit pl_se; struct pmclog_threadcreate pl_tc; struct pmclog_threadexit pl_te; struct pmclog_userdata pl_u; }; #define PMCLOG_HEADER_MAGIC 0xEEU #define PMCLOG_HEADER_TO_LENGTH(H) \ ((H) & 0x0000FFFF) #define PMCLOG_HEADER_TO_TYPE(H) \ (((H) & 0x00FF0000) >> 16) #define PMCLOG_HEADER_TO_MAGIC(H) \ (((H) & 0xFF000000) >> 24) #define PMCLOG_HEADER_CHECK_MAGIC(H) \ (PMCLOG_HEADER_TO_MAGIC(H) == PMCLOG_HEADER_MAGIC) #ifdef _KERNEL /* * Prototypes */ int pmclog_configure_log(struct pmc_mdep *_md, struct pmc_owner *_po, int _logfd); int pmclog_deconfigure_log(struct pmc_owner *_po); int pmclog_flush(struct pmc_owner *_po, int force); int pmclog_close(struct pmc_owner *_po); void pmclog_initialize(void); int pmclog_proc_create(struct thread *td, void **handlep); void pmclog_proc_ignite(void *handle, struct pmc_owner *po); void pmclog_process_callchain(struct pmc *_pm, struct pmc_sample *_ps); void pmclog_process_closelog(struct pmc_owner *po); void pmclog_process_dropnotify(struct pmc_owner *po); void pmclog_process_map_in(struct pmc_owner *po, pid_t pid, uintfptr_t start, const char *path); void pmclog_process_map_out(struct pmc_owner *po, pid_t pid, uintfptr_t start, uintfptr_t end); void pmclog_process_pmcallocate(struct pmc *_pm); void pmclog_process_pmcattach(struct pmc *_pm, pid_t _pid, char *_path); void pmclog_process_pmcdetach(struct pmc *_pm, pid_t _pid); void pmclog_process_proccsw(struct pmc *_pm, struct pmc_process *_pp, pmc_value_t _v, struct thread *); void pmclog_process_procexec(struct pmc_owner *_po, pmc_id_t _pmid, pid_t _pid, - uintfptr_t _startaddr, char *_path); + uintfptr_t _baseaddr, uintptr_t _dynaddr, char *_path); void pmclog_process_procexit(struct pmc *_pm, struct pmc_process *_pp); void pmclog_process_procfork(struct pmc_owner *_po, pid_t _oldpid, pid_t _newpid); void pmclog_process_sysexit(struct pmc_owner *_po, pid_t _pid); void pmclog_process_threadcreate(struct pmc_owner *_po, struct thread *td, int sync); void pmclog_process_threadexit(struct pmc_owner *_po, struct thread *td); void pmclog_process_proccreate(struct pmc_owner *_po, struct proc *p, int sync); int pmclog_process_userlog(struct pmc_owner *_po, struct pmc_op_writelog *_wl); void pmclog_shutdown(void); #endif /* _KERNEL */ #endif /* _SYS_PMCLOG_H_ */ diff --git a/usr.sbin/pmcstat/pmcstat_log.c b/usr.sbin/pmcstat/pmcstat_log.c index 3f764da964cd..7ff5d032fc99 100644 --- a/usr.sbin/pmcstat/pmcstat_log.c +++ b/usr.sbin/pmcstat/pmcstat_log.c @@ -1,771 +1,772 @@ /*- * SPDX-License-Identifier: BSD-2-Clause * * Copyright (c) 2005-2007, Joseph Koshy * Copyright (c) 2007 The FreeBSD Foundation * All rights reserved. * * Portions of this software were developed by A. Joseph Koshy under * sponsorship from the FreeBSD Foundation and Google, Inc. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. */ /* * Transform a hwpmc(4) log into human readable form, and into * gprof(1) compatible profiles. */ #include __FBSDID("$FreeBSD$"); #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include "pmcstat.h" #include "pmcstat_log.h" #include "pmcstat_top.h" /* * PUBLIC INTERFACES * * pmcstat_initialize_logging() initialize this module, called first * pmcstat_shutdown_logging() orderly shutdown, called last * pmcstat_open_log() open an eventlog for processing * pmcstat_process_log() print/convert an event log * pmcstat_display_log() top mode display for the log * pmcstat_close_log() finish processing an event log * * IMPLEMENTATION NOTES * * We correlate each 'callchain' or 'sample' entry seen in the event * log back to an executable object in the system. Executable objects * include: * - program executables, * - shared libraries loaded by the runtime loader, * - dlopen()'ed objects loaded by the program, * - the runtime loader itself, * - the kernel and kernel modules. * * Each process that we know about is treated as a set of regions that * map to executable objects. Processes are described by * 'pmcstat_process' structures. Executable objects are tracked by * 'pmcstat_image' structures. The kernel and kernel modules are * common to all processes (they reside at the same virtual addresses * for all processes). Individual processes can have their text * segments and shared libraries loaded at process-specific locations. * * A given executable object can be in use by multiple processes * (e.g., libc.so) and loaded at a different address in each. * pmcstat_pcmap structures track per-image mappings. * * The sample log could have samples from multiple PMCs; we * generate one 'gmon.out' profile per PMC. * * IMPLEMENTATION OF GMON OUTPUT * * Each executable object gets one 'gmon.out' profile, per PMC in * use. Creation of 'gmon.out' profiles is done lazily. The * 'gmon.out' profiles generated for a given sampling PMC are * aggregates of all the samples for that particular executable * object. * * IMPLEMENTATION OF SYSTEM-WIDE CALLGRAPH OUTPUT * * Each active pmcid has its own callgraph structure, described by a * 'struct pmcstat_callgraph'. Given a process id and a list of pc * values, we map each pc value to a tuple (image, symbol), where * 'image' denotes an executable object and 'symbol' is the closest * symbol that precedes the pc value. Each pc value in the list is * also given a 'rank' that reflects its depth in the call stack. */ struct pmcstat_pmcs pmcstat_pmcs = LIST_HEAD_INITIALIZER(pmcstat_pmcs); /* * All image descriptors are kept in a hash table. */ struct pmcstat_image_hash_list pmcstat_image_hash[PMCSTAT_NHASH]; /* * All process descriptors are kept in a hash table. */ struct pmcstat_process_hash_list pmcstat_process_hash[PMCSTAT_NHASH]; struct pmcstat_stats pmcstat_stats; /* statistics */ static int ps_samples_period; /* samples count between top refresh. */ struct pmcstat_process *pmcstat_kernproc; /* kernel 'process' */ #include "pmcpl_gprof.h" #include "pmcpl_callgraph.h" #include "pmcpl_annotate.h" #include "pmcpl_annotate_cg.h" #include "pmcpl_calltree.h" static struct pmc_plugins plugins[] = { { .pl_name = "none", }, { .pl_name = "callgraph", .pl_init = pmcpl_cg_init, .pl_shutdown = pmcpl_cg_shutdown, .pl_process = pmcpl_cg_process, .pl_topkeypress = pmcpl_cg_topkeypress, .pl_topdisplay = pmcpl_cg_topdisplay }, { .pl_name = "gprof", .pl_shutdown = pmcpl_gmon_shutdown, .pl_process = pmcpl_gmon_process, .pl_initimage = pmcpl_gmon_initimage, .pl_shutdownimage = pmcpl_gmon_shutdownimage, .pl_newpmc = pmcpl_gmon_newpmc }, { .pl_name = "annotate", .pl_process = pmcpl_annotate_process }, { .pl_name = "calltree", .pl_configure = pmcpl_ct_configure, .pl_init = pmcpl_ct_init, .pl_shutdown = pmcpl_ct_shutdown, .pl_process = pmcpl_ct_process, .pl_topkeypress = pmcpl_ct_topkeypress, .pl_topdisplay = pmcpl_ct_topdisplay }, { .pl_name = "annotate_cg", .pl_process = pmcpl_annotate_cg_process }, { .pl_name = NULL } }; static int pmcstat_mergepmc; int pmcstat_pmcinfilter = 0; /* PMC filter for top mode. */ float pmcstat_threshold = 0.5; /* Cost filter for top mode. */ /* * Prototypes */ static void pmcstat_stats_reset(int _reset_global); /* * PMC count. */ int pmcstat_npmcs; /* * PMC Top mode pause state. */ static int pmcstat_pause; static void pmcstat_stats_reset(int reset_global) { struct pmcstat_pmcrecord *pr; /* Flush PMCs stats. */ LIST_FOREACH(pr, &pmcstat_pmcs, pr_next) { pr->pr_samples = 0; pr->pr_dubious_frames = 0; } ps_samples_period = 0; /* Flush global stats. */ if (reset_global) bzero(&pmcstat_stats, sizeof(struct pmcstat_stats)); } /* * Resolve file name and line number for the given address. */ int pmcstat_image_addr2line(struct pmcstat_image *image, uintfptr_t addr, char *sourcefile, size_t sourcefile_len, unsigned *sourceline, char *funcname, size_t funcname_len) { static int addr2line_warn = 0; char *sep, cmdline[PATH_MAX], imagepath[PATH_MAX]; unsigned l; int fd; if (image->pi_addr2line == NULL) { /* Try default debug file location. */ snprintf(imagepath, sizeof(imagepath), "/usr/lib/debug/%s%s.debug", args.pa_fsroot, pmcstat_string_unintern(image->pi_fullpath)); fd = open(imagepath, O_RDONLY); if (fd < 0) { /* Old kernel symbol path. */ snprintf(imagepath, sizeof(imagepath), "%s%s.symbols", args.pa_fsroot, pmcstat_string_unintern(image->pi_fullpath)); fd = open(imagepath, O_RDONLY); if (fd < 0) { snprintf(imagepath, sizeof(imagepath), "%s%s", args.pa_fsroot, pmcstat_string_unintern( image->pi_fullpath)); } } if (fd >= 0) close(fd); /* * New addr2line support recursive inline function with -i * but the format does not add a marker when no more entries * are available. */ snprintf(cmdline, sizeof(cmdline), "addr2line -Cfe \"%s\"", imagepath); image->pi_addr2line = popen(cmdline, "r+"); if (image->pi_addr2line == NULL) { if (!addr2line_warn) { addr2line_warn = 1; warnx( "WARNING: addr2line is needed for source code information." ); } return (0); } } if (feof(image->pi_addr2line) || ferror(image->pi_addr2line)) { warnx("WARNING: addr2line pipe error"); pclose(image->pi_addr2line); image->pi_addr2line = NULL; return (0); } fprintf(image->pi_addr2line, "%p\n", (void *)addr); if (fgets(funcname, funcname_len, image->pi_addr2line) == NULL) { warnx("WARNING: addr2line function name read error"); return (0); } sep = strchr(funcname, '\n'); if (sep != NULL) *sep = '\0'; if (fgets(sourcefile, sourcefile_len, image->pi_addr2line) == NULL) { warnx("WARNING: addr2line source file read error"); return (0); } sep = strchr(sourcefile, ':'); if (sep == NULL) { warnx("WARNING: addr2line source line separator missing"); return (0); } *sep = '\0'; l = atoi(sep+1); if (l == 0) return (0); *sourceline = l; return (1); } /* * Given a pmcid in use, find its human-readable name. */ const char * pmcstat_pmcid_to_name(pmc_id_t pmcid) { struct pmcstat_pmcrecord *pr; LIST_FOREACH(pr, &pmcstat_pmcs, pr_next) if (pr->pr_pmcid == pmcid) return (pmcstat_string_unintern(pr->pr_pmcname)); return NULL; } /* * Convert PMC index to name. */ const char * pmcstat_pmcindex_to_name(int pmcin) { struct pmcstat_pmcrecord *pr; LIST_FOREACH(pr, &pmcstat_pmcs, pr_next) if (pr->pr_pmcin == pmcin) return pmcstat_string_unintern(pr->pr_pmcname); return NULL; } /* * Return PMC record with given index. */ struct pmcstat_pmcrecord * pmcstat_pmcindex_to_pmcr(int pmcin) { struct pmcstat_pmcrecord *pr; LIST_FOREACH(pr, &pmcstat_pmcs, pr_next) if (pr->pr_pmcin == pmcin) return pr; return NULL; } /* * Print log entries as text. */ static int pmcstat_print_log(void) { struct pmclog_ev ev; uint32_t npc; while (pmclog_read(args.pa_logparser, &ev) == 0) { assert(ev.pl_state == PMCLOG_OK); switch (ev.pl_type) { case PMCLOG_TYPE_CALLCHAIN: PMCSTAT_PRINT_ENTRY("callchain", "%d 0x%x %d %d %c", ev.pl_u.pl_cc.pl_pid, ev.pl_u.pl_cc.pl_pmcid, PMC_CALLCHAIN_CPUFLAGS_TO_CPU(ev.pl_u.pl_cc. \ pl_cpuflags), ev.pl_u.pl_cc.pl_npc, PMC_CALLCHAIN_CPUFLAGS_TO_USERMODE(ev.pl_u.pl_cc.\ pl_cpuflags) ? 'u' : 's'); for (npc = 0; npc < ev.pl_u.pl_cc.pl_npc; npc++) PMCSTAT_PRINT_ENTRY("...", "%p", (void *) ev.pl_u.pl_cc.pl_pc[npc]); break; case PMCLOG_TYPE_CLOSELOG: PMCSTAT_PRINT_ENTRY("closelog",); break; case PMCLOG_TYPE_DROPNOTIFY: PMCSTAT_PRINT_ENTRY("drop",); break; case PMCLOG_TYPE_INITIALIZE: PMCSTAT_PRINT_ENTRY("initlog","0x%x \"%s\"", ev.pl_u.pl_i.pl_version, pmc_name_of_cputype(ev.pl_u.pl_i.pl_arch)); if ((ev.pl_u.pl_i.pl_version & 0xFF000000) != PMC_VERSION_MAJOR << 24) warnx( "WARNING: Log version 0x%x != expected version 0x%x.", ev.pl_u.pl_i.pl_version, PMC_VERSION); break; case PMCLOG_TYPE_MAP_IN: PMCSTAT_PRINT_ENTRY("map-in","%d %p \"%s\"", ev.pl_u.pl_mi.pl_pid, (void *) ev.pl_u.pl_mi.pl_start, ev.pl_u.pl_mi.pl_pathname); break; case PMCLOG_TYPE_MAP_OUT: PMCSTAT_PRINT_ENTRY("map-out","%d %p %p", ev.pl_u.pl_mo.pl_pid, (void *) ev.pl_u.pl_mo.pl_start, (void *) ev.pl_u.pl_mo.pl_end); break; case PMCLOG_TYPE_PMCALLOCATE: PMCSTAT_PRINT_ENTRY("allocate","0x%x \"%s\" 0x%x", ev.pl_u.pl_a.pl_pmcid, ev.pl_u.pl_a.pl_evname, ev.pl_u.pl_a.pl_flags); break; case PMCLOG_TYPE_PMCALLOCATEDYN: PMCSTAT_PRINT_ENTRY("allocatedyn","0x%x \"%s\" 0x%x", ev.pl_u.pl_ad.pl_pmcid, ev.pl_u.pl_ad.pl_evname, ev.pl_u.pl_ad.pl_flags); break; case PMCLOG_TYPE_PMCATTACH: PMCSTAT_PRINT_ENTRY("attach","0x%x %d \"%s\"", ev.pl_u.pl_t.pl_pmcid, ev.pl_u.pl_t.pl_pid, ev.pl_u.pl_t.pl_pathname); break; case PMCLOG_TYPE_PMCDETACH: PMCSTAT_PRINT_ENTRY("detach","0x%x %d", ev.pl_u.pl_d.pl_pmcid, ev.pl_u.pl_d.pl_pid); break; case PMCLOG_TYPE_PROCCSW: PMCSTAT_PRINT_ENTRY("cswval","0x%x %d %jd", ev.pl_u.pl_c.pl_pmcid, ev.pl_u.pl_c.pl_pid, ev.pl_u.pl_c.pl_value); break; case PMCLOG_TYPE_PROC_CREATE: PMCSTAT_PRINT_ENTRY("create","%d %x \"%s\"", ev.pl_u.pl_pc.pl_pid, ev.pl_u.pl_pc.pl_flags, ev.pl_u.pl_pc.pl_pcomm); break; case PMCLOG_TYPE_PROCEXEC: - PMCSTAT_PRINT_ENTRY("exec","0x%x %d %p \"%s\"", + PMCSTAT_PRINT_ENTRY("exec","0x%x %d %p %p \"%s\"", ev.pl_u.pl_x.pl_pmcid, ev.pl_u.pl_x.pl_pid, - (void *) ev.pl_u.pl_x.pl_entryaddr, + (void *)ev.pl_u.pl_x.pl_baseaddr, + (void *)ev.pl_u.pl_x.pl_dynaddr, ev.pl_u.pl_x.pl_pathname); break; case PMCLOG_TYPE_PROCEXIT: PMCSTAT_PRINT_ENTRY("exitval","0x%x %d %jd", ev.pl_u.pl_e.pl_pmcid, ev.pl_u.pl_e.pl_pid, ev.pl_u.pl_e.pl_value); break; case PMCLOG_TYPE_PROCFORK: PMCSTAT_PRINT_ENTRY("fork","%d %d", ev.pl_u.pl_f.pl_oldpid, ev.pl_u.pl_f.pl_newpid); break; case PMCLOG_TYPE_USERDATA: PMCSTAT_PRINT_ENTRY("userdata","0x%x", ev.pl_u.pl_u.pl_userdata); break; case PMCLOG_TYPE_SYSEXIT: PMCSTAT_PRINT_ENTRY("exit","%d", ev.pl_u.pl_se.pl_pid); break; case PMCLOG_TYPE_THR_CREATE: PMCSTAT_PRINT_ENTRY("thr-create","%d %d %x \"%s\"", ev.pl_u.pl_tc.pl_tid, ev.pl_u.pl_tc.pl_pid, ev.pl_u.pl_tc.pl_flags, ev.pl_u.pl_tc.pl_tdname); break; case PMCLOG_TYPE_THR_EXIT: PMCSTAT_PRINT_ENTRY("thr-exit","%d", ev.pl_u.pl_tc.pl_tid); break; default: fprintf(args.pa_printfile, "unknown event (type %d).\n", ev.pl_type); } } if (ev.pl_state == PMCLOG_EOF) return (PMCSTAT_FINISHED); else if (ev.pl_state == PMCLOG_REQUIRE_DATA) return (PMCSTAT_RUNNING); errx(EX_DATAERR, "ERROR: event parsing failed (record %jd, offset 0x%jx).", (uintmax_t) ev.pl_count + 1, ev.pl_offset); /*NOTREACHED*/ } /* * Public Interfaces. */ /* * Process a log file in offline analysis mode. */ int pmcstat_process_log(void) { /* * If analysis has not been asked for, just print the log to * the current output file. */ if (args.pa_flags & FLAG_DO_PRINT) return (pmcstat_print_log()); else return (pmcstat_analyze_log(&args, plugins, &pmcstat_stats, pmcstat_kernproc, pmcstat_mergepmc, &pmcstat_npmcs, &ps_samples_period)); } /* * Refresh top display. */ static void pmcstat_refresh_top(void) { int v_attrs; float v; char pmcname[40]; struct pmcstat_pmcrecord *pmcpr; /* If in pause mode do not refresh display. */ if (pmcstat_pause) return; /* Wait until PMC pop in the log. */ pmcpr = pmcstat_pmcindex_to_pmcr(pmcstat_pmcinfilter); if (pmcpr == NULL) return; /* Format PMC name. */ if (pmcstat_mergepmc) snprintf(pmcname, sizeof(pmcname), "[%s]", pmcstat_string_unintern(pmcpr->pr_pmcname)); else snprintf(pmcname, sizeof(pmcname), "%s.%d", pmcstat_string_unintern(pmcpr->pr_pmcname), pmcstat_pmcinfilter); /* Format samples count. */ if (ps_samples_period > 0) v = (pmcpr->pr_samples * 100.0) / ps_samples_period; else v = 0.; v_attrs = PMCSTAT_ATTRPERCENT(v); PMCSTAT_PRINTBEGIN(); PMCSTAT_PRINTW("PMC: %s Samples: %u ", pmcname, pmcpr->pr_samples); PMCSTAT_ATTRON(v_attrs); PMCSTAT_PRINTW("(%.1f%%) ", v); PMCSTAT_ATTROFF(v_attrs); PMCSTAT_PRINTW(", %u unresolved\n\n", pmcpr->pr_dubious_frames); if (plugins[args.pa_plugin].pl_topdisplay != NULL) plugins[args.pa_plugin].pl_topdisplay(); PMCSTAT_PRINTEND(); } /* * Find the next pmc index to display. */ static void pmcstat_changefilter(void) { int pmcin; struct pmcstat_pmcrecord *pmcr; /* * Find the next merge target. */ if (pmcstat_mergepmc) { pmcin = pmcstat_pmcinfilter; do { pmcr = pmcstat_pmcindex_to_pmcr(pmcstat_pmcinfilter); if (pmcr == NULL || pmcr == pmcr->pr_merge) break; pmcstat_pmcinfilter++; if (pmcstat_pmcinfilter >= pmcstat_npmcs) pmcstat_pmcinfilter = 0; } while (pmcstat_pmcinfilter != pmcin); } } /* * Top mode keypress. */ int pmcstat_keypress_log(void) { int c, ret = 0; WINDOW *w; w = newwin(1, 0, 1, 0); c = wgetch(w); wprintw(w, "Key: %c => ", c); switch (c) { case 'A': if (args.pa_flags & FLAG_SKIP_TOP_FN_RES) args.pa_flags &= ~FLAG_SKIP_TOP_FN_RES; else args.pa_flags |= FLAG_SKIP_TOP_FN_RES; break; case 'c': wprintw(w, "enter mode 'd' or 'a' => "); c = wgetch(w); if (c == 'd') { args.pa_topmode = PMCSTAT_TOP_DELTA; wprintw(w, "switching to delta mode"); } else { args.pa_topmode = PMCSTAT_TOP_ACCUM; wprintw(w, "switching to accumulation mode"); } break; case 'I': if (args.pa_flags & FLAG_SHOW_OFFSET) args.pa_flags &= ~FLAG_SHOW_OFFSET; else args.pa_flags |= FLAG_SHOW_OFFSET; break; case 'm': pmcstat_mergepmc = !pmcstat_mergepmc; /* * Changing merge state require data reset. */ if (plugins[args.pa_plugin].pl_shutdown != NULL) plugins[args.pa_plugin].pl_shutdown(NULL); pmcstat_stats_reset(0); if (plugins[args.pa_plugin].pl_init != NULL) plugins[args.pa_plugin].pl_init(); /* Update filter to be on a merge target. */ pmcstat_changefilter(); wprintw(w, "merge PMC %s", pmcstat_mergepmc ? "on" : "off"); break; case 'n': /* Close current plugin. */ if (plugins[args.pa_plugin].pl_shutdown != NULL) plugins[args.pa_plugin].pl_shutdown(NULL); /* Find next top display available. */ do { args.pa_plugin++; if (plugins[args.pa_plugin].pl_name == NULL) args.pa_plugin = 0; } while (plugins[args.pa_plugin].pl_topdisplay == NULL); /* Open new plugin. */ pmcstat_stats_reset(0); if (plugins[args.pa_plugin].pl_init != NULL) plugins[args.pa_plugin].pl_init(); wprintw(w, "switching to plugin %s", plugins[args.pa_plugin].pl_name); break; case 'p': pmcstat_pmcinfilter++; if (pmcstat_pmcinfilter >= pmcstat_npmcs) pmcstat_pmcinfilter = 0; pmcstat_changefilter(); wprintw(w, "switching to PMC %s.%d", pmcstat_pmcindex_to_name(pmcstat_pmcinfilter), pmcstat_pmcinfilter); break; case ' ': pmcstat_pause = !pmcstat_pause; if (pmcstat_pause) wprintw(w, "pause => press space again to continue"); break; case 'q': wprintw(w, "exiting..."); ret = 1; break; default: if (plugins[args.pa_plugin].pl_topkeypress != NULL) if (plugins[args.pa_plugin].pl_topkeypress(c, (void *)w)) ret = 1; } wrefresh(w); delwin(w); return ret; } /* * Top mode display. */ void pmcstat_display_log(void) { pmcstat_refresh_top(); /* Reset everythings if delta mode. */ if (args.pa_topmode == PMCSTAT_TOP_DELTA) { if (plugins[args.pa_plugin].pl_shutdown != NULL) plugins[args.pa_plugin].pl_shutdown(NULL); pmcstat_stats_reset(0); if (plugins[args.pa_plugin].pl_init != NULL) plugins[args.pa_plugin].pl_init(); } } /* * Configure a plugins. */ void pmcstat_pluginconfigure_log(char *opt) { if (strncmp(opt, "threshold=", 10) == 0) { pmcstat_threshold = atof(opt+10); } else { if (plugins[args.pa_plugin].pl_configure != NULL) { if (!plugins[args.pa_plugin].pl_configure(opt)) err(EX_USAGE, "ERROR: unknown option <%s>.", opt); } } } void pmcstat_log_shutdown_logging(void) { pmcstat_shutdown_logging(&args, plugins, &pmcstat_stats); } void pmcstat_log_initialize_logging(void) { pmcstat_initialize_logging(&pmcstat_kernproc, &args, plugins, &pmcstat_npmcs, &pmcstat_mergepmc); }