Changeset View
Changeset View
Standalone View
Standalone View
usr.bin/clang/llvm-mca/llvm-mca.1
.\" $FreeBSD$ | |||||
.\" Man page generated from reStructuredText. | .\" Man page generated from reStructuredText. | ||||
. | . | ||||
. | . | ||||
.nr rst2man-indent-level 0 | .nr rst2man-indent-level 0 | ||||
. | . | ||||
.de1 rstReportMargin | .de1 rstReportMargin | ||||
\\$1 \\n[an-margin] | \\$1 \\n[an-margin] | ||||
level \\n[rst2man-indent-level] | level \\n[rst2man-indent-level] | ||||
Show All 13 Lines | |||||
.de UNINDENT | .de UNINDENT | ||||
. RE | . RE | ||||
.\" indent \\n[an-margin] | .\" indent \\n[an-margin] | ||||
.\" old: \\n[rst2man-indent\\n[rst2man-indent-level]] | .\" old: \\n[rst2man-indent\\n[rst2man-indent-level]] | ||||
.nr rst2man-indent-level -1 | .nr rst2man-indent-level -1 | ||||
.\" new: \\n[rst2man-indent\\n[rst2man-indent-level]] | .\" new: \\n[rst2man-indent\\n[rst2man-indent-level]] | ||||
.in \\n[rst2man-indent\\n[rst2man-indent-level]]u | .in \\n[rst2man-indent\\n[rst2man-indent-level]]u | ||||
.. | .. | ||||
.TH "LLVM-MCA" "1" "2021-06-07" "12" "LLVM" | .TH "LLVM-MCA" "1" "2021-12-22" "13" "LLVM" | ||||
.SH NAME | .SH NAME | ||||
llvm-mca \- LLVM Machine Code Analyzer | llvm-mca \- LLVM Machine Code Analyzer | ||||
.SH SYNOPSIS | .SH SYNOPSIS | ||||
.sp | .sp | ||||
\fBllvm\-mca\fP [\fIoptions\fP] [input] | \fBllvm\-mca\fP [\fIoptions\fP] [input] | ||||
.SH DESCRIPTION | .SH DESCRIPTION | ||||
.sp | .sp | ||||
\fBllvm\-mca\fP is a performance analysis tool that uses information | \fBllvm\-mca\fP is a performance analysis tool that uses information | ||||
available in LLVM (e.g. scheduling models) to statically measure the performance | available in LLVM (e.g. scheduling models) to statically measure the performance | ||||
of machine code in a specific CPU. | of machine code in a specific CPU. | ||||
.sp | .sp | ||||
Performance is measured in terms of throughput as well as processor resource | Performance is measured in terms of throughput as well as processor resource | ||||
consumption. The tool currently works for processors with an out\-of\-order | consumption. The tool currently works for processors with a backend for which | ||||
backend, for which there is a scheduling model available in LLVM. | there is a scheduling model available in LLVM. | ||||
.sp | .sp | ||||
The main goal of this tool is not just to predict the performance of the code | The main goal of this tool is not just to predict the performance of the code | ||||
when run on the target, but also help with diagnosing potential performance | when run on the target, but also help with diagnosing potential performance | ||||
issues. | issues. | ||||
.sp | .sp | ||||
Given an assembly code sequence, \fBllvm\-mca\fP estimates the Instructions | Given an assembly code sequence, \fBllvm\-mca\fP estimates the Instructions | ||||
Per Cycle (IPC), as well as hardware resource pressure. The analysis and | Per Cycle (IPC), as well as hardware resource pressure. The analysis and | ||||
reporting style were inspired by the IACA tool from Intel. | reporting style were inspired by the IACA tool from Intel. | ||||
▲ Show 20 Lines • Show All 138 Lines • ▼ Show 20 Lines | |||||
.TP | .TP | ||||
.B \-timeline\-max\-iterations=<iterations> | .B \-timeline\-max\-iterations=<iterations> | ||||
Limit the number of iterations to print in the timeline view. By default, the | Limit the number of iterations to print in the timeline view. By default, the | ||||
timeline view prints information for up to 10 iterations. | timeline view prints information for up to 10 iterations. | ||||
.UNINDENT | .UNINDENT | ||||
.INDENT 0.0 | .INDENT 0.0 | ||||
.TP | .TP | ||||
.B \-timeline\-max\-cycles=<cycles> | .B \-timeline\-max\-cycles=<cycles> | ||||
Limit the number of cycles in the timeline view. By default, the number of | Limit the number of cycles in the timeline view, or use 0 for no limit. By | ||||
cycles is set to 80. | default, the number of cycles is set to 80. | ||||
.UNINDENT | .UNINDENT | ||||
.INDENT 0.0 | .INDENT 0.0 | ||||
.TP | .TP | ||||
.B \-resource\-pressure | .B \-resource\-pressure | ||||
Enable the resource pressure view. This is enabled by default. | Enable the resource pressure view. This is enabled by default. | ||||
.UNINDENT | .UNINDENT | ||||
.INDENT 0.0 | .INDENT 0.0 | ||||
.TP | .TP | ||||
▲ Show 20 Lines • Show All 48 Lines • ▼ Show 20 Lines | |||||
view because it doesn\(aqt require that the code is simulated. It instead prints | view because it doesn\(aqt require that the code is simulated. It instead prints | ||||
the theoretical uniform distribution of resource pressure for every | the theoretical uniform distribution of resource pressure for every | ||||
instruction in sequence. | instruction in sequence. | ||||
.UNINDENT | .UNINDENT | ||||
.INDENT 0.0 | .INDENT 0.0 | ||||
.TP | .TP | ||||
.B \-bottleneck\-analysis | .B \-bottleneck\-analysis | ||||
Print information about bottlenecks that affect the throughput. This analysis | Print information about bottlenecks that affect the throughput. This analysis | ||||
can be expensive, and it is disabled by default. Bottlenecks are highlighted | can be expensive, and it is disabled by default. Bottlenecks are highlighted | ||||
in the summary view. | in the summary view. Bottleneck analysis is currently not supported for | ||||
processors with an in\-order backend. | |||||
.UNINDENT | .UNINDENT | ||||
.INDENT 0.0 | .INDENT 0.0 | ||||
.TP | .TP | ||||
.B \-json | .B \-json | ||||
Print the requested views in JSON format. The instructions and the processor | Print the requested views in valid JSON format. The instructions and the | ||||
resources are printed as members of special top level JSON objects. The | processor resources are printed as members of special top level JSON objects. | ||||
individual views refer to them by index. | The individual views refer to them by index. However, not all views are | ||||
currently supported. For example, the report from the bottleneck analysis is | |||||
not printed out in JSON. All the default views are currently supported. | |||||
.UNINDENT | .UNINDENT | ||||
.INDENT 0.0 | |||||
.TP | |||||
.B \-disable\-cb | |||||
Force usage of the generic CustomBehaviour class rather than using the target | |||||
specific class. The generic class never detects any custom hazards. | |||||
.UNINDENT | |||||
.SH EXIT STATUS | .SH EXIT STATUS | ||||
.sp | .sp | ||||
\fBllvm\-mca\fP returns 0 on success. Otherwise, an error message is printed | \fBllvm\-mca\fP returns 0 on success. Otherwise, an error message is printed | ||||
to standard error, and the tool returns 1. | to standard error, and the tool returns 1. | ||||
.SH USING MARKERS TO ANALYZE SPECIFIC CODE BLOCKS | .SH USING MARKERS TO ANALYZE SPECIFIC CODE BLOCKS | ||||
.sp | .sp | ||||
\fBllvm\-mca\fP allows for the optional usage of special code comments to | \fBllvm\-mca\fP allows for the optional usage of special code comments to | ||||
mark regions of the assembly code to be analyzed. A comment starting with | mark regions of the assembly code to be analyzed. A comment starting with | ||||
▲ Show 20 Lines • Show All 197 Lines • ▼ Show 20 Lines | |||||
.sp | .sp | ||||
The report is structured in three main sections. The first section collects a | The report is structured in three main sections. The first section collects a | ||||
few performance numbers; the goal of this section is to give a very quick | few performance numbers; the goal of this section is to give a very quick | ||||
overview of the performance throughput. Important performance indicators are | overview of the performance throughput. Important performance indicators are | ||||
\fBIPC\fP, \fBuOps Per Cycle\fP, and \fBBlock RThroughput\fP (Block Reciprocal | \fBIPC\fP, \fBuOps Per Cycle\fP, and \fBBlock RThroughput\fP (Block Reciprocal | ||||
Throughput). | Throughput). | ||||
.sp | .sp | ||||
Field \fIDispatchWidth\fP is the maximum number of micro opcodes that are dispatched | Field \fIDispatchWidth\fP is the maximum number of micro opcodes that are dispatched | ||||
to the out\-of\-order backend every simulated cycle. | to the out\-of\-order backend every simulated cycle. For processors with an | ||||
in\-order backend, \fIDispatchWidth\fP is the maximum number of micro opcodes issued | |||||
to the backend every simulated cycle. | |||||
.sp | .sp | ||||
IPC is computed dividing the total number of simulated instructions by the total | IPC is computed dividing the total number of simulated instructions by the total | ||||
number of cycles. | number of cycles. | ||||
.sp | .sp | ||||
Field \fIBlock RThroughput\fP is the reciprocal of the block throughput. Block | Field \fIBlock RThroughput\fP is the reciprocal of the block throughput. Block | ||||
throughput is a theoretical quantity computed as the maximum number of blocks | throughput is a theoretical quantity computed as the maximum number of blocks | ||||
(i.e. iterations) that can be executed per simulated clock cycle in the absence | (i.e. iterations) that can be executed per simulated clock cycle in the absence | ||||
of loop carried dependencies. Block throughput is superiorly limited by the | of loop carried dependencies. Block throughput is superiorly limited by the | ||||
▲ Show 20 Lines • Show All 279 Lines • ▼ Show 20 Lines | |||||
The \fIcritical sequence\fP is the most expensive sequence of instructions according | The \fIcritical sequence\fP is the most expensive sequence of instructions according | ||||
to the simulation. It is annotated to provide extra information about critical | to the simulation. It is annotated to provide extra information about critical | ||||
register dependencies and resource interferences between instructions. | register dependencies and resource interferences between instructions. | ||||
.sp | .sp | ||||
Instructions from the critical sequence are expected to significantly impact | Instructions from the critical sequence are expected to significantly impact | ||||
performance. By construction, the accuracy of this analysis is strongly | performance. By construction, the accuracy of this analysis is strongly | ||||
dependent on the simulation and (as always) by the quality of the processor | dependent on the simulation and (as always) by the quality of the processor | ||||
model in llvm. | model in llvm. | ||||
.sp | |||||
Bottleneck analysis is currently not supported for processors with an in\-order | |||||
backend. | |||||
.SS Extra Statistics to Further Diagnose Performance Issues | .SS Extra Statistics to Further Diagnose Performance Issues | ||||
.sp | .sp | ||||
The \fB\-all\-stats\fP command line option enables extra statistics and performance | The \fB\-all\-stats\fP command line option enables extra statistics and performance | ||||
counters for the dispatch logic, the reorder buffer, the retire control unit, | counters for the dispatch logic, the reorder buffer, the retire control unit, | ||||
and the register file. | and the register file. | ||||
.sp | .sp | ||||
Below is an example of \fB\-all\-stats\fP output generated by \fBllvm\-mca\fP | Below is an example of \fB\-all\-stats\fP output generated by \fBllvm\-mca\fP | ||||
for 300 iterations of the dot\-product example discussed in the previous | for 300 iterations of the dot\-product example discussed in the previous | ||||
▲ Show 20 Lines • Show All 141 Lines • ▼ Show 20 Lines | |||||
.IP \(bu 2 | .IP \(bu 2 | ||||
Issue (Instruction is issued to the processor pipelines). | Issue (Instruction is issued to the processor pipelines). | ||||
.IP \(bu 2 | .IP \(bu 2 | ||||
Write Back (Instruction is executed, and results are written back). | Write Back (Instruction is executed, and results are written back). | ||||
.IP \(bu 2 | .IP \(bu 2 | ||||
Retire (Instruction is retired; writes are architecturally committed). | Retire (Instruction is retired; writes are architecturally committed). | ||||
.UNINDENT | .UNINDENT | ||||
.sp | .sp | ||||
The default pipeline only models the out\-of\-order portion of a processor. | The in\-order pipeline implements the following sequence of stages: | ||||
Therefore, the instruction fetch and decode stages are not modeled. Performance | * InOrderIssue (Instruction is issued to the processor pipelines). | ||||
bottlenecks in the frontend are not diagnosed. \fBllvm\-mca\fP assumes that | * Retire (Instruction is retired; writes are architecturally committed). | ||||
instructions have all been decoded and placed into a queue before the simulation | .sp | ||||
start. Also, \fBllvm\-mca\fP does not model branch prediction. | \fBllvm\-mca\fP assumes that instructions have all been decoded and placed | ||||
into a queue before the simulation start. Therefore, the instruction fetch and | |||||
decode stages are not modeled. Performance bottlenecks in the frontend are not | |||||
diagnosed. Also, \fBllvm\-mca\fP does not model branch prediction. | |||||
.SS Instruction Dispatch | .SS Instruction Dispatch | ||||
.sp | .sp | ||||
During the dispatch stage, instructions are picked in program order from a | During the dispatch stage, instructions are picked in program order from a | ||||
queue of already decoded instructions, and dispatched in groups to the | queue of already decoded instructions, and dispatched in groups to the | ||||
simulated hardware schedulers. | simulated hardware schedulers. | ||||
.sp | .sp | ||||
The size of a dispatch group depends on the availability of the simulated | The size of a dispatch group depends on the availability of the simulated | ||||
hardware resources. The processor dispatch width defaults to the value | hardware resources. The processor dispatch width defaults to the value | ||||
▲ Show 20 Lines • Show All 160 Lines • ▼ Show 20 Lines | |||||
A store has to wait until an older store barrier is fully executed. | A store has to wait until an older store barrier is fully executed. | ||||
.IP 4. 3 | .IP 4. 3 | ||||
A load may pass a previous load. | A load may pass a previous load. | ||||
.IP 5. 3 | .IP 5. 3 | ||||
A load may not pass a previous store unless \fB\-noalias\fP is set. | A load may not pass a previous store unless \fB\-noalias\fP is set. | ||||
.IP 6. 3 | .IP 6. 3 | ||||
A load has to wait until an older load barrier is fully executed. | A load has to wait until an older load barrier is fully executed. | ||||
.UNINDENT | .UNINDENT | ||||
.SS In\-order Issue and Execute | |||||
.sp | |||||
In\-order processors are modelled as a single \fBInOrderIssueStage\fP stage. It | |||||
bypasses Dispatch, Scheduler and Load/Store unit. Instructions are issued as | |||||
soon as their operand registers are available and resource requirements are | |||||
met. Multiple instructions can be issued in one cycle according to the value of | |||||
the \fBIssueWidth\fP parameter in LLVM\(aqs scheduling model. | |||||
.sp | |||||
Once issued, an instruction is moved to \fBIssuedInst\fP set until it is ready to | |||||
retire. \fBllvm\-mca\fP ensures that writes are committed in\-order. However, | |||||
an instruction is allowed to commit writes and retire out\-of\-order if | |||||
\fBRetireOOO\fP property is true for at least one of its writes. | |||||
.SS Custom Behaviour | |||||
.sp | |||||
Due to certain instructions not being expressed perfectly within their | |||||
scheduling model, \fBllvm\-mca\fP isn\(aqt always able to simulate them | |||||
perfectly. Modifying the scheduling model isn\(aqt always a viable | |||||
option though (maybe because the instruction is modeled incorrectly on | |||||
purpose or the instruction\(aqs behaviour is quite complex). The | |||||
CustomBehaviour class can be used in these cases to enforce proper | |||||
instruction modeling (often by customizing data dependencies and detecting | |||||
hazards that \fBllvm\-ma\fP has no way of knowing about). | |||||
.sp | |||||
\fBllvm\-mca\fP comes with one generic and multiple target specific | |||||
CustomBehaviour classes. The generic class will be used if the \fB\-disable\-cb\fP | |||||
flag is used or if a target specific CustomBehaviour class doesn\(aqt exist for | |||||
that target. (The generic class does nothing.) Currently, the CustomBehaviour | |||||
class is only a part of the in\-order pipeline, but there are plans to add it | |||||
to the out\-of\-order pipeline in the future. | |||||
.sp | |||||
CustomBehaviour\(aqs main method is \fIcheckCustomHazard()\fP which uses the | |||||
current instruction and a list of all instructions still executing within | |||||
the pipeline to determine if the current instruction should be dispatched. | |||||
As output, the method returns an integer representing the number of cycles | |||||
that the current instruction must stall for (this can be an underestimate | |||||
if you don\(aqt know the exact number and a value of 0 represents no stall). | |||||
.sp | |||||
If you\(aqd like to add a CustomBehaviour class for a target that doesn\(aqt | |||||
already have one, refer to an existing implementation to see how to set it | |||||
up. Remember to look at (and add to) \fI/llvm\-mca/lib/CMakeLists.txt\fP\&. | |||||
.SH AUTHOR | .SH AUTHOR | ||||
Maintained by the LLVM Team (https://llvm.org/). | Maintained by the LLVM Team (https://llvm.org/). | ||||
.SH COPYRIGHT | .SH COPYRIGHT | ||||
2003-2021, LLVM Project | 2003-2021, LLVM Project | ||||
.\" Generated by docutils manpage writer. | .\" Generated by docutils manpage writer. | ||||
. | . |