Differential D33620 Diff 100480 usr.bin/clang/llvm-mca/llvm-mca.1

Changeset View

Standalone View

usr.bin/clang/llvm-mca/llvm-mca.1

	.\" $FreeBSD$
	.\" Man page generated from reStructuredText.			.\" Man page generated from reStructuredText.
	.			.
	.			.
	.nr rst2man-indent-level 0			.nr rst2man-indent-level 0
	.			.
	.de1 rstReportMargin			.de1 rstReportMargin
	\\$1 \\n[an-margin]			\\$1 \\n[an-margin]
	level \\n[rst2man-indent-level]			level \\n[rst2man-indent-level]
	Show All 13 Lines
	.de UNINDENT			.de UNINDENT
	. RE			. RE
	.\" indent \\n[an-margin]			.\" indent \\n[an-margin]
	.\" old: \\n[rst2man-indent\\n[rst2man-indent-level]]			.\" old: \\n[rst2man-indent\\n[rst2man-indent-level]]
	.nr rst2man-indent-level -1			.nr rst2man-indent-level -1
	.\" new: \\n[rst2man-indent\\n[rst2man-indent-level]]			.\" new: \\n[rst2man-indent\\n[rst2man-indent-level]]
	.in \\n[rst2man-indent\\n[rst2man-indent-level]]u			.in \\n[rst2man-indent\\n[rst2man-indent-level]]u
	..			..
	.TH "LLVM-MCA" "1" "2021-06-07" "12" "LLVM"			.TH "LLVM-MCA" "1" "2021-12-22" "13" "LLVM"
	.SH NAME			.SH NAME
	llvm-mca \- LLVM Machine Code Analyzer			llvm-mca \- LLVM Machine Code Analyzer
	.SH SYNOPSIS			.SH SYNOPSIS
	.sp			.sp
	\fBllvm\-mca\fP [\fIoptions\fP] [input]			\fBllvm\-mca\fP [\fIoptions\fP] [input]
	.SH DESCRIPTION			.SH DESCRIPTION
	.sp			.sp
	\fBllvm\-mca\fP is a performance analysis tool that uses information			\fBllvm\-mca\fP is a performance analysis tool that uses information
	available in LLVM (e.g. scheduling models) to statically measure the performance			available in LLVM (e.g. scheduling models) to statically measure the performance
	of machine code in a specific CPU.			of machine code in a specific CPU.
	.sp			.sp
	Performance is measured in terms of throughput as well as processor resource			Performance is measured in terms of throughput as well as processor resource
	consumption. The tool currently works for processors with an out\-of\-order			consumption. The tool currently works for processors with a backend for which
	backend, for which there is a scheduling model available in LLVM.			there is a scheduling model available in LLVM.
	.sp			.sp
	The main goal of this tool is not just to predict the performance of the code			The main goal of this tool is not just to predict the performance of the code
	when run on the target, but also help with diagnosing potential performance			when run on the target, but also help with diagnosing potential performance
	issues.			issues.
	.sp			.sp
	Given an assembly code sequence, \fBllvm\-mca\fP estimates the Instructions			Given an assembly code sequence, \fBllvm\-mca\fP estimates the Instructions
	Per Cycle (IPC), as well as hardware resource pressure. The analysis and			Per Cycle (IPC), as well as hardware resource pressure. The analysis and
	reporting style were inspired by the IACA tool from Intel.			reporting style were inspired by the IACA tool from Intel.
	▲ Show 20 Lines • Show All 138 Lines • ▼ Show 20 Lines
	.TP			.TP
	.B \-timeline\-max\-iterations=<iterations>			.B \-timeline\-max\-iterations=<iterations>
	Limit the number of iterations to print in the timeline view. By default, the			Limit the number of iterations to print in the timeline view. By default, the
	timeline view prints information for up to 10 iterations.			timeline view prints information for up to 10 iterations.
	.UNINDENT			.UNINDENT
	.INDENT 0.0			.INDENT 0.0
	.TP			.TP
	.B \-timeline\-max\-cycles=<cycles>			.B \-timeline\-max\-cycles=<cycles>
	Limit the number of cycles in the timeline view. By default, the number of			Limit the number of cycles in the timeline view, or use 0 for no limit. By
	cycles is set to 80.			default, the number of cycles is set to 80.
	.UNINDENT			.UNINDENT
	.INDENT 0.0			.INDENT 0.0
	.TP			.TP
	.B \-resource\-pressure			.B \-resource\-pressure
	Enable the resource pressure view. This is enabled by default.			Enable the resource pressure view. This is enabled by default.
	.UNINDENT			.UNINDENT
	.INDENT 0.0			.INDENT 0.0
	.TP			.TP
	▲ Show 20 Lines • Show All 48 Lines • ▼ Show 20 Lines
	view because it doesn\(aqt require that the code is simulated. It instead prints			view because it doesn\(aqt require that the code is simulated. It instead prints
	the theoretical uniform distribution of resource pressure for every			the theoretical uniform distribution of resource pressure for every
	instruction in sequence.			instruction in sequence.
	.UNINDENT			.UNINDENT
	.INDENT 0.0			.INDENT 0.0
	.TP			.TP
	.B \-bottleneck\-analysis			.B \-bottleneck\-analysis
	Print information about bottlenecks that affect the throughput. This analysis			Print information about bottlenecks that affect the throughput. This analysis
	can be expensive, and it is disabled by default. Bottlenecks are highlighted			can be expensive, and it is disabled by default. Bottlenecks are highlighted
	in the summary view.			in the summary view. Bottleneck analysis is currently not supported for
				processors with an in\-order backend.
	.UNINDENT			.UNINDENT
	.INDENT 0.0			.INDENT 0.0
	.TP			.TP
	.B \-json			.B \-json
	Print the requested views in JSON format. The instructions and the processor			Print the requested views in valid JSON format. The instructions and the
	resources are printed as members of special top level JSON objects. The			processor resources are printed as members of special top level JSON objects.
	individual views refer to them by index.			The individual views refer to them by index. However, not all views are
				currently supported. For example, the report from the bottleneck analysis is
				not printed out in JSON. All the default views are currently supported.
	.UNINDENT			.UNINDENT
				.INDENT 0.0
				.TP
				.B \-disable\-cb
				Force usage of the generic CustomBehaviour class rather than using the target
				specific class. The generic class never detects any custom hazards.
				.UNINDENT
	.SH EXIT STATUS			.SH EXIT STATUS
	.sp			.sp
	\fBllvm\-mca\fP returns 0 on success. Otherwise, an error message is printed			\fBllvm\-mca\fP returns 0 on success. Otherwise, an error message is printed
	to standard error, and the tool returns 1.			to standard error, and the tool returns 1.
	.SH USING MARKERS TO ANALYZE SPECIFIC CODE BLOCKS			.SH USING MARKERS TO ANALYZE SPECIFIC CODE BLOCKS
	.sp			.sp
	\fBllvm\-mca\fP allows for the optional usage of special code comments to			\fBllvm\-mca\fP allows for the optional usage of special code comments to
	mark regions of the assembly code to be analyzed. A comment starting with			mark regions of the assembly code to be analyzed. A comment starting with
	▲ Show 20 Lines • Show All 197 Lines • ▼ Show 20 Lines
	.sp			.sp
	The report is structured in three main sections. The first section collects a			The report is structured in three main sections. The first section collects a
	few performance numbers; the goal of this section is to give a very quick			few performance numbers; the goal of this section is to give a very quick
	overview of the performance throughput. Important performance indicators are			overview of the performance throughput. Important performance indicators are
	\fBIPC\fP, \fBuOps Per Cycle\fP, and \fBBlock RThroughput\fP (Block Reciprocal			\fBIPC\fP, \fBuOps Per Cycle\fP, and \fBBlock RThroughput\fP (Block Reciprocal
	Throughput).			Throughput).
	.sp			.sp
	Field \fIDispatchWidth\fP is the maximum number of micro opcodes that are dispatched			Field \fIDispatchWidth\fP is the maximum number of micro opcodes that are dispatched
	to the out\-of\-order backend every simulated cycle.			to the out\-of\-order backend every simulated cycle. For processors with an
				in\-order backend, \fIDispatchWidth\fP is the maximum number of micro opcodes issued
				to the backend every simulated cycle.
	.sp			.sp
	IPC is computed dividing the total number of simulated instructions by the total			IPC is computed dividing the total number of simulated instructions by the total
	number of cycles.			number of cycles.
	.sp			.sp
	Field \fIBlock RThroughput\fP is the reciprocal of the block throughput. Block			Field \fIBlock RThroughput\fP is the reciprocal of the block throughput. Block
	throughput is a theoretical quantity computed as the maximum number of blocks			throughput is a theoretical quantity computed as the maximum number of blocks
	(i.e. iterations) that can be executed per simulated clock cycle in the absence			(i.e. iterations) that can be executed per simulated clock cycle in the absence
	of loop carried dependencies. Block throughput is superiorly limited by the			of loop carried dependencies. Block throughput is superiorly limited by the
	▲ Show 20 Lines • Show All 279 Lines • ▼ Show 20 Lines
	The \fIcritical sequence\fP is the most expensive sequence of instructions according			The \fIcritical sequence\fP is the most expensive sequence of instructions according
	to the simulation. It is annotated to provide extra information about critical			to the simulation. It is annotated to provide extra information about critical
	register dependencies and resource interferences between instructions.			register dependencies and resource interferences between instructions.
	.sp			.sp
	Instructions from the critical sequence are expected to significantly impact			Instructions from the critical sequence are expected to significantly impact
	performance. By construction, the accuracy of this analysis is strongly			performance. By construction, the accuracy of this analysis is strongly
	dependent on the simulation and (as always) by the quality of the processor			dependent on the simulation and (as always) by the quality of the processor
	model in llvm.			model in llvm.
				.sp
				Bottleneck analysis is currently not supported for processors with an in\-order
				backend.
	.SS Extra Statistics to Further Diagnose Performance Issues			.SS Extra Statistics to Further Diagnose Performance Issues
	.sp			.sp
	The \fB\-all\-stats\fP command line option enables extra statistics and performance			The \fB\-all\-stats\fP command line option enables extra statistics and performance
	counters for the dispatch logic, the reorder buffer, the retire control unit,			counters for the dispatch logic, the reorder buffer, the retire control unit,
	and the register file.			and the register file.
	.sp			.sp
	Below is an example of \fB\-all\-stats\fP output generated by \fBllvm\-mca\fP			Below is an example of \fB\-all\-stats\fP output generated by \fBllvm\-mca\fP
	for 300 iterations of the dot\-product example discussed in the previous			for 300 iterations of the dot\-product example discussed in the previous
	▲ Show 20 Lines • Show All 141 Lines • ▼ Show 20 Lines
	.IP \(bu 2			.IP \(bu 2
	Issue (Instruction is issued to the processor pipelines).			Issue (Instruction is issued to the processor pipelines).
	.IP \(bu 2			.IP \(bu 2
	Write Back (Instruction is executed, and results are written back).			Write Back (Instruction is executed, and results are written back).
	.IP \(bu 2			.IP \(bu 2
	Retire (Instruction is retired; writes are architecturally committed).			Retire (Instruction is retired; writes are architecturally committed).
	.UNINDENT			.UNINDENT
	.sp			.sp
	The default pipeline only models the out\-of\-order portion of a processor.			The in\-order pipeline implements the following sequence of stages:
	Therefore, the instruction fetch and decode stages are not modeled. Performance			* InOrderIssue (Instruction is issued to the processor pipelines).
	bottlenecks in the frontend are not diagnosed. \fBllvm\-mca\fP assumes that			* Retire (Instruction is retired; writes are architecturally committed).
	instructions have all been decoded and placed into a queue before the simulation			.sp
	start. Also, \fBllvm\-mca\fP does not model branch prediction.			\fBllvm\-mca\fP assumes that instructions have all been decoded and placed
				into a queue before the simulation start. Therefore, the instruction fetch and
				decode stages are not modeled. Performance bottlenecks in the frontend are not
				diagnosed. Also, \fBllvm\-mca\fP does not model branch prediction.
	.SS Instruction Dispatch			.SS Instruction Dispatch
	.sp			.sp
	During the dispatch stage, instructions are picked in program order from a			During the dispatch stage, instructions are picked in program order from a
	queue of already decoded instructions, and dispatched in groups to the			queue of already decoded instructions, and dispatched in groups to the
	simulated hardware schedulers.			simulated hardware schedulers.
	.sp			.sp
	The size of a dispatch group depends on the availability of the simulated			The size of a dispatch group depends on the availability of the simulated
	hardware resources. The processor dispatch width defaults to the value			hardware resources. The processor dispatch width defaults to the value
	▲ Show 20 Lines • Show All 160 Lines • ▼ Show 20 Lines
	A store has to wait until an older store barrier is fully executed.			A store has to wait until an older store barrier is fully executed.
	.IP 4. 3			.IP 4. 3
	A load may pass a previous load.			A load may pass a previous load.
	.IP 5. 3			.IP 5. 3
	A load may not pass a previous store unless \fB\-noalias\fP is set.			A load may not pass a previous store unless \fB\-noalias\fP is set.
	.IP 6. 3			.IP 6. 3
	A load has to wait until an older load barrier is fully executed.			A load has to wait until an older load barrier is fully executed.
	.UNINDENT			.UNINDENT
				.SS In\-order Issue and Execute
				.sp
				In\-order processors are modelled as a single \fBInOrderIssueStage\fP stage. It
				bypasses Dispatch, Scheduler and Load/Store unit. Instructions are issued as
				soon as their operand registers are available and resource requirements are
				met. Multiple instructions can be issued in one cycle according to the value of
				the \fBIssueWidth\fP parameter in LLVM\(aqs scheduling model.
				.sp
				Once issued, an instruction is moved to \fBIssuedInst\fP set until it is ready to
				retire. \fBllvm\-mca\fP ensures that writes are committed in\-order. However,
				an instruction is allowed to commit writes and retire out\-of\-order if
				\fBRetireOOO\fP property is true for at least one of its writes.
				.SS Custom Behaviour
				.sp
				Due to certain instructions not being expressed perfectly within their
				scheduling model, \fBllvm\-mca\fP isn\(aqt always able to simulate them
				perfectly. Modifying the scheduling model isn\(aqt always a viable
				option though (maybe because the instruction is modeled incorrectly on
				purpose or the instruction\(aqs behaviour is quite complex). The
				CustomBehaviour class can be used in these cases to enforce proper
				instruction modeling (often by customizing data dependencies and detecting
				hazards that \fBllvm\-ma\fP has no way of knowing about).
				.sp
				\fBllvm\-mca\fP comes with one generic and multiple target specific
				CustomBehaviour classes. The generic class will be used if the \fB\-disable\-cb\fP
				flag is used or if a target specific CustomBehaviour class doesn\(aqt exist for
				that target. (The generic class does nothing.) Currently, the CustomBehaviour
				class is only a part of the in\-order pipeline, but there are plans to add it
				to the out\-of\-order pipeline in the future.
				.sp
				CustomBehaviour\(aqs main method is \fIcheckCustomHazard()\fP which uses the
				current instruction and a list of all instructions still executing within
				the pipeline to determine if the current instruction should be dispatched.
				As output, the method returns an integer representing the number of cycles
				that the current instruction must stall for (this can be an underestimate
				if you don\(aqt know the exact number and a value of 0 represents no stall).
				.sp
				If you\(aqd like to add a CustomBehaviour class for a target that doesn\(aqt
				already have one, refer to an existing implementation to see how to set it
				up. Remember to look at (and add to) \fI/llvm\-mca/lib/CMakeLists.txt\fP\&.
	.SH AUTHOR			.SH AUTHOR
	Maintained by the LLVM Team (https://llvm.org/).			Maintained by the LLVM Team (https://llvm.org/).
	.SH COPYRIGHT			.SH COPYRIGHT
	2003-2021, LLVM Project			2003-2021, LLVM Project
	.\" Generated by docutils manpage writer.			.\" Generated by docutils manpage writer.
	.			.