No OneTemporary
Actions

Size

4 MB

Referenced Files

None

Subscribers

None

View Options

This file is larger than 256 KB, so syntax highlighting was skipped.

	diff --git a/docs/ReleaseNotes.rst b/docs/ReleaseNotes.rst
	index 48af491f1214..f6ef4e0a3fa2 100644
	--- a/docs/ReleaseNotes.rst
	+++ b/docs/ReleaseNotes.rst
	@@ -1,211 +1,255 @@
	========================
	LLVM 5.0.0 Release Notes
	========================

	.. contents::
	:local:

	.. warning::
	These are in-progress notes for the upcoming LLVM 5 release.
	Release notes for previous releases can be found on
	`the Download Page <http://releases.llvm.org/download.html>`_.


	Introduction
	============

	This document contains the release notes for the LLVM Compiler Infrastructure,
	release 5.0.0. Here we describe the status of LLVM, including major improvements
	from the previous release, improvements in various subprojects of LLVM, and
	some of the current users of the code. All LLVM releases may be downloaded
	from the `LLVM releases web site <http://llvm.org/releases/>`_.

	For more information about LLVM, including information about the latest
	release, please check out the `main LLVM web site <http://llvm.org/>`_. If you
	have questions or comments, the `LLVM Developer's Mailing List
	<http://lists.llvm.org/mailman/listinfo/llvm-dev>`_ is a good place to send
	them.

	Note that if you are reading this file from a Subversion checkout or the main
	LLVM web page, this document applies to the next release, not the current
	one. To see the release notes for a specific release, please see the `releases
	page <http://llvm.org/releases/>`_.

	Non-comprehensive list of changes in this release
	=================================================
	.. NOTE
	For small 1-3 sentence descriptions, just add an entry at the end of
	this list. If your description won't fit comfortably in one bullet
	point (e.g. maybe you would like to give an example of the
	functionality, or simply have a lot to talk about), see the `NOTE` below
	for adding a new subsection.

	* LLVM's ``WeakVH`` has been renamed to ``WeakTrackingVH`` and a new ``WeakVH``
	has been introduced. The new ``WeakVH`` nulls itself out on deletion, but
	does not track values across RAUW.

	* A new library named ``BinaryFormat`` has been created which holds a collection
	of code which previously lived in ``Support``. This includes the
	``file_magic`` structure and ``identify_magic`` functions, as well as all the
	structure and type definitions for DWARF, ELF, COFF, WASM, and MachO file
	formats.

	* The tool ``llvm-pdbdump`` has been renamed ``llvm-pdbutil`` to better reflect
	its nature as a general purpose PDB manipulation / diagnostics tool that does
	more than just dumping contents.

	* The ``BBVectorize`` pass has been removed. It was fully replaced and no
	longer used back in 2014 but we didn't get around to removing it. Now it is
	gone. The SLP vectorizer is the suggested non-loop vectorization pass.

	.. NOTE
	If you would like to document a larger change, then you can add a
	subsection about it right here. You can copy the following boilerplate
	and un-indent it (the indentation causes it to be inside this comment).

	Special New Feature
	-------------------

	Makes programs 10x faster by doing Special New Thing.

	Changes to the LLVM IR
	----------------------

	* The datalayout string may now indicate an address space to use for
	the pointer type of alloca rather than the default of 0.

	* Added speculatable attribute indicating a function which does has no
	side-effects which could inhibit hoisting of calls.

	-Changes to the ARM Backend
	+Changes to the Arm Targets
	--------------------------

	- During this release ...
	-
	+During this release the AArch64 target has:
	+
	+* A much improved Global ISel at O0.
	+* Support for ARMv8.1 8.2 and 8.3 instructions.
	+* New scheduler information for ThunderX2.
	+* Some SVE type changes but not much more than that.
	+* Made instruction fusion more aggressive, resulting in speedups
	+ for code making use of AArch64 AES instructions. AES fusion has been
	+ enabled for most Cortex-A cores and the AArch64MacroFusion pass was moved
	+ to the generic MacroFusion pass.
	+* Added preferred function alignments for most Cortex-A cores.
	+* OpenMP "offload-to-self" base support.
	+
	+During this release the ARM target has:
	+
	+* Improved, but still mostly broken, Global ISel.
	+* Scheduling models update, new schedule for Cortex-A57.
	+* Hardware breakpoint support in LLDB.
	+* New assembler error handling, with spelling corrections and multiple
	+ suggestions on how to fix problems.
	+* Improved mixed ARM/Thumb code generation. Some cases in which wrong
	+ relocations were emitted have been fixed.
	+* Added initial support for mixed ARM/Thumb link-time optimization, using the
	+ thumb-mode target feature.

	Changes to the MIPS Target
	--------------------------

	During this release ...


	Changes to the PowerPC Target
	-----------------------------

	- During this release ...
	+* Additional support and exploitation of POWER ISA 3.0: vabsdub, vabsduh,
	+ vabsduw, modsw, moduw, modsd, modud, lxv, stxv, vextublx, vextubrx, vextuhlx,
	+ vextuhrx, vextuwlx, vextuwrx, vextsb2w, vextsb2d, vextsh2w, vextsh2d, and
	+ vextsw2d
	+
	+* Implemented Optimal Code Sequences from The PowerPC Compiler Writer's Guide.
	+
	+* Enable -fomit-frame-pointer by default.
	+
	+* Improved handling of bit reverse intrinsic.
	+
	+* Improved handling of memcpy and memcmp functions.
	+
	+* Improved handling of branches with static branch hints.
	+
	+* Improved codegen for atomic load_acquire.
	+
	+* Improved block placement during code layout
	+
	+* Many improvements to instruction selection and code generation
	+
	+
	+

	Changes to the X86 Target
	-------------------------

	* Added initial AMD Ryzen (znver1) scheduler support.

	* Added support for Intel Goldmont CPUs.

	* Add support for avx512vpopcntdq instructions.

	* Added heuristics to convert CMOV into branches when it may be profitable.

	* More aggressive inlining of memcmp calls.

	* Improve vXi64 shuffles on 32-bit targets.

	* Improved use of PMOVMSKB for any_of/all_of comparision reductions.

	* Improved Silvermont, Sandybridge, and Jaguar (btver2) schedulers.

	* Improved support for AVX512 vector rotations.

	* Added support for AMD Lightweight Profiling (LWP) instructions.

	* Avoid using slow LEA instructions.

	* Use alternative sequences for multiply by constant.

	* Improved lowering of strided shuffles.

	* Improved the AVX512 cost model used by the vectorizer.

	* Fix scalar code performance when AVX512 is enabled by making i1's illegal.

	* Fixed many inline assembly bugs.

	Changes to the AMDGPU Target
	-----------------------------

	* Initial gfx9 support

	Changes to the AVR Target
	-----------------------------

	This release consists mainly of bugfixes and implementations of features
	required for compiling basic Rust programs.

	* Enable the branch relaxation pass so that we don't crash on large
	stack load/stores

	* Add support for lowering bit-rotations to the native `ror` and `rol`
	instructions

	* Fix bug where function pointers were treated as pointers to RAM and not
	pointers to program memory

	* Fix broken code generaton for shift-by-variable expressions

	* Support zero-sized types in argument lists; this is impossible in C,
	but possible in Rust

	Changes to the OCaml bindings
	-----------------------------

	During this release ...


	Changes to the C API
	--------------------

	* Deprecated the ``LLVMAddBBVectorizePass`` interface since the ``BBVectorize``
	pass has been removed. It is now a no-op and will be removed in the next
	release. Use ``LLVMAddSLPVectorizePass`` instead to get the supported SLP
	vectorizer.


	External Open Source Projects Using LLVM 5
	==========================================

	Zig Programming Language
	------------------------

	`Zig <http://ziglang.org>`_ is an open-source programming language designed
	for robustness, optimality, and clarity. It integrates closely with C and is
	intended to eventually take the place of C. It uses LLVM to produce highly
	optimized native code and to cross-compile for any target out of the box. Zig
	is in alpha; with a beta release expected in September.

	LDC - the LLVM-based D compiler
	-------------------------------

	`D <http://dlang.org>`_ is a language with C-like syntax and static typing. It
	pragmatically combines efficiency, control, and modeling power, with safety and
	programmer productivity. D supports powerful concepts like Compile-Time Function
	Execution (CTFE) and Template Meta-Programming, provides an innovative approach
	to concurrency and offers many classical paradigms.

	`LDC <http://wiki.dlang.org/LDC>`_ uses the frontend from the reference compiler
	combined with LLVM as backend to produce efficient native code. LDC targets
	x86/x86_64 systems like Linux, OS X, FreeBSD and Windows and also Linux on ARM
	and PowerPC (32/64 bit). Ports to other architectures like AArch64 and MIPS64
	are underway.


	Additional Information
	======================

	A wide variety of additional information is available on the `LLVM web page
	<http://llvm.org/>`_, in particular in the `documentation
	<http://llvm.org/docs/>`_ section. The web page also contains versions of the
	API documentation which is up-to-date with the Subversion version of the source
	code. You can access versions of these documents specific to this release by
	going into the ``llvm/docs/`` directory in the LLVM tree.

	If you have any questions or comments about LLVM, please feel free to contact
	us via the `mailing lists <http://llvm.org/docs/#maillist>`_.
	diff --git a/include/llvm/CodeGen/SelectionDAGNodes.h b/include/llvm/CodeGen/SelectionDAGNodes.h
	index db42fb6c170c..051c93601d3f 100644
	--- a/include/llvm/CodeGen/SelectionDAGNodes.h
	+++ b/include/llvm/CodeGen/SelectionDAGNodes.h
	@@ -1,2329 +1,2332 @@
	//===- llvm/CodeGen/SelectionDAGNodes.h - SelectionDAG Nodes ----- C++ --===//
	//
	// The LLVM Compiler Infrastructure
	//
	// This file is distributed under the University of Illinois Open Source
	// License. See LICENSE.TXT for details.
	//
	//===----------------------------------------------------------------------===//
	//
	// This file declares the SDNode class and derived classes, which are used to
	// represent the nodes and operations present in a SelectionDAG. These nodes
	// and operations are machine code level operations, with some similarities to
	// the GCC RTL representation.
	//
	// Clients should include the SelectionDAG.h file instead of this file directly.
	//
	//===----------------------------------------------------------------------===//

	#ifndef LLVM_CODEGEN_SELECTIONDAGNODES_H
	#define LLVM_CODEGEN_SELECTIONDAGNODES_H

	#include "llvm/ADT/APFloat.h"
	#include "llvm/ADT/ArrayRef.h"
	#include "llvm/ADT/BitVector.h"
	#include "llvm/ADT/FoldingSet.h"
	#include "llvm/ADT/GraphTraits.h"
	#include "llvm/ADT/SmallPtrSet.h"
	#include "llvm/ADT/SmallVector.h"
	#include "llvm/ADT/ilist_node.h"
	#include "llvm/ADT/iterator.h"
	#include "llvm/ADT/iterator_range.h"
	#include "llvm/CodeGen/ISDOpcodes.h"
	#include "llvm/CodeGen/MachineMemOperand.h"
	#include "llvm/CodeGen/MachineValueType.h"
	#include "llvm/CodeGen/ValueTypes.h"
	#include "llvm/IR/Constants.h"
	#include "llvm/IR/DebugLoc.h"
	#include "llvm/IR/Instruction.h"
	#include "llvm/IR/Instructions.h"
	#include "llvm/IR/Metadata.h"
	#include "llvm/Support/AlignOf.h"
	#include "llvm/Support/AtomicOrdering.h"
	#include "llvm/Support/Casting.h"
	#include "llvm/Support/ErrorHandling.h"
	#include <algorithm>
	#include <cassert>
	#include <climits>
	#include <cstddef>
	#include <cstdint>
	#include <cstring>
	#include <iterator>
	#include <string>
	#include <tuple>

	namespace llvm {

	class APInt;
	class Constant;
	template <typename T> struct DenseMapInfo;
	class GlobalValue;
	class MachineBasicBlock;
	class MachineConstantPoolValue;
	class MCSymbol;
	class raw_ostream;
	class SDNode;
	class SelectionDAG;
	class Type;
	class Value;

	void checkForCycles(const SDNode N, const SelectionDAG DAG = nullptr,
	bool force = false);

	/// This represents a list of ValueType's that has been intern'd by
	/// a SelectionDAG. Instances of this simple value class are returned by
	/// SelectionDAG::getVTList(...).
	///
	struct SDVTList {
	const EVT *VTs;
	unsigned int NumVTs;
	};

	namespace ISD {

	/// Node predicates

	/// If N is a BUILD_VECTOR node whose elements are all the same constant or
	/// undefined, return true and return the constant value in \p SplatValue.
	- bool isConstantSplatVector(const SDNode *N, APInt &SplatValue);
	+ /// This sets \p SplatValue to the smallest possible splat unless AllowShrink
	+ /// is set to false.
	+ bool isConstantSplatVector(const SDNode *N, APInt &SplatValue,
	+ bool AllowShrink = true);

	/// Return true if the specified node is a BUILD_VECTOR where all of the
	/// elements are ~0 or undef.
	bool isBuildVectorAllOnes(const SDNode *N);

	/// Return true if the specified node is a BUILD_VECTOR where all of the
	/// elements are 0 or undef.
	bool isBuildVectorAllZeros(const SDNode *N);

	/// Return true if the specified node is a BUILD_VECTOR node of all
	/// ConstantSDNode or undef.
	bool isBuildVectorOfConstantSDNodes(const SDNode *N);

	/// Return true if the specified node is a BUILD_VECTOR node of all
	/// ConstantFPSDNode or undef.
	bool isBuildVectorOfConstantFPSDNodes(const SDNode *N);

	/// Return true if the node has at least one operand and all operands of the
	/// specified node are ISD::UNDEF.
	bool allOperandsUndef(const SDNode *N);

	} // end namespace ISD

	//===----------------------------------------------------------------------===//
	/// Unlike LLVM values, Selection DAG nodes may return multiple
	/// values as the result of a computation. Many nodes return multiple values,
	/// from loads (which define a token and a return value) to ADDC (which returns
	/// a result and a carry value), to calls (which may return an arbitrary number
	/// of values).
	///
	/// As such, each use of a SelectionDAG computation must indicate the node that
	/// computes it as well as which return value to use from that node. This pair
	/// of information is represented with the SDValue value type.
	///
	class SDValue {
	friend struct DenseMapInfo<SDValue>;

	SDNode *Node = nullptr; // The node defining the value we are using.
	unsigned ResNo = 0; // Which return value of the node we are using.

	public:
	SDValue() = default;
	SDValue(SDNode *node, unsigned resno);

	/// get the index which selects a specific result in the SDNode
	unsigned getResNo() const { return ResNo; }

	/// get the SDNode which holds the desired result
	SDNode *getNode() const { return Node; }

	/// set the SDNode
	void setNode(SDNode *N) { Node = N; }

	inline SDNode *operator->() const { return Node; }

	bool operator==(const SDValue &O) const {
	return Node == O.Node && ResNo == O.ResNo;
	}
	bool operator!=(const SDValue &O) const {
	return !operator==(O);
	}
	bool operator<(const SDValue &O) const {
	return std::tie(Node, ResNo) < std::tie(O.Node, O.ResNo);
	}
	explicit operator bool() const {
	return Node != nullptr;
	}

	SDValue getValue(unsigned R) const {
	return SDValue(Node, R);
	}

	/// Return true if this node is an operand of N.
	bool isOperandOf(const SDNode *N) const;

	/// Return the ValueType of the referenced return value.
	inline EVT getValueType() const;

	/// Return the simple ValueType of the referenced return value.
	MVT getSimpleValueType() const {
	return getValueType().getSimpleVT();
	}

	/// Returns the size of the value in bits.
	unsigned getValueSizeInBits() const {
	return getValueType().getSizeInBits();
	}

	unsigned getScalarValueSizeInBits() const {
	return getValueType().getScalarType().getSizeInBits();
	}

	// Forwarding methods - These forward to the corresponding methods in SDNode.
	inline unsigned getOpcode() const;
	inline unsigned getNumOperands() const;
	inline const SDValue &getOperand(unsigned i) const;
	inline uint64_t getConstantOperandVal(unsigned i) const;
	inline bool isTargetMemoryOpcode() const;
	inline bool isTargetOpcode() const;
	inline bool isMachineOpcode() const;
	inline bool isUndef() const;
	inline unsigned getMachineOpcode() const;
	inline const DebugLoc &getDebugLoc() const;
	inline void dump() const;
	inline void dumpr() const;

	/// Return true if this operand (which must be a chain) reaches the
	/// specified operand without crossing any side-effecting instructions.
	/// In practice, this looks through token factors and non-volatile loads.
	/// In order to remain efficient, this only
	/// looks a couple of nodes in, it does not do an exhaustive search.
	bool reachesChainWithoutSideEffects(SDValue Dest,
	unsigned Depth = 2) const;

	/// Return true if there are no nodes using value ResNo of Node.
	inline bool use_empty() const;

	/// Return true if there is exactly one node using value ResNo of Node.
	inline bool hasOneUse() const;
	};

	template<> struct DenseMapInfo<SDValue> {
	static inline SDValue getEmptyKey() {
	SDValue V;
	V.ResNo = -1U;
	return V;
	}

	static inline SDValue getTombstoneKey() {
	SDValue V;
	V.ResNo = -2U;
	return V;
	}

	static unsigned getHashValue(const SDValue &Val) {
	return ((unsigned)((uintptr_t)Val.getNode() >> 4) ^
	(unsigned)((uintptr_t)Val.getNode() >> 9)) + Val.getResNo();
	}

	static bool isEqual(const SDValue &LHS, const SDValue &RHS) {
	return LHS == RHS;
	}
	};
	template <> struct isPodLike<SDValue> { static const bool value = true; };

	/// Allow casting operators to work directly on
	/// SDValues as if they were SDNode*'s.
	template<> struct simplify_type<SDValue> {
	using SimpleType = SDNode *;

	static SimpleType getSimplifiedValue(SDValue &Val) {
	return Val.getNode();
	}
	};
	template<> struct simplify_type<const SDValue> {
	using SimpleType = /const/ SDNode *;

	static SimpleType getSimplifiedValue(const SDValue &Val) {
	return Val.getNode();
	}
	};

	/// Represents a use of a SDNode. This class holds an SDValue,
	/// which records the SDNode being used and the result number, a
	/// pointer to the SDNode using the value, and Next and Prev pointers,
	/// which link together all the uses of an SDNode.
	///
	class SDUse {
	/// Val - The value being used.
	SDValue Val;
	/// User - The user of this value.
	SDNode *User = nullptr;
	/// Prev, Next - Pointers to the uses list of the SDNode referred by
	/// this operand.
	SDUse **Prev = nullptr;
	SDUse *Next = nullptr;

	public:
	SDUse() = default;
	SDUse(const SDUse &U) = delete;
	SDUse &operator=(const SDUse &) = delete;

	/// Normally SDUse will just implicitly convert to an SDValue that it holds.
	operator const SDValue&() const { return Val; }

	/// If implicit conversion to SDValue doesn't work, the get() method returns
	/// the SDValue.
	const SDValue &get() const { return Val; }

	/// This returns the SDNode that contains this Use.
	SDNode *getUser() { return User; }

	/// Get the next SDUse in the use list.
	SDUse *getNext() const { return Next; }

	/// Convenience function for get().getNode().
	SDNode *getNode() const { return Val.getNode(); }
	/// Convenience function for get().getResNo().
	unsigned getResNo() const { return Val.getResNo(); }
	/// Convenience function for get().getValueType().
	EVT getValueType() const { return Val.getValueType(); }

	/// Convenience function for get().operator==
	bool operator==(const SDValue &V) const {
	return Val == V;
	}

	/// Convenience function for get().operator!=
	bool operator!=(const SDValue &V) const {
	return Val != V;
	}

	/// Convenience function for get().operator<
	bool operator<(const SDValue &V) const {
	return Val < V;
	}

	private:
	friend class SelectionDAG;
	friend class SDNode;
	// TODO: unfriend HandleSDNode once we fix its operand handling.
	friend class HandleSDNode;

	void setUser(SDNode *p) { User = p; }

	/// Remove this use from its existing use list, assign it the
	/// given value, and add it to the new value's node's use list.
	inline void set(const SDValue &V);
	/// Like set, but only supports initializing a newly-allocated
	/// SDUse with a non-null value.
	inline void setInitial(const SDValue &V);
	/// Like set, but only sets the Node portion of the value,
	/// leaving the ResNo portion unmodified.
	inline void setNode(SDNode *N);

	void addToList(SDUse **List) {
	Next = *List;
	if (Next) Next->Prev = &Next;
	Prev = List;
	*List = this;
	}

	void removeFromList() {
	*Prev = Next;
	if (Next) Next->Prev = Prev;
	}
	};

	/// simplify_type specializations - Allow casting operators to work directly on
	/// SDValues as if they were SDNode*'s.
	template<> struct simplify_type<SDUse> {
	using SimpleType = SDNode *;

	static SimpleType getSimplifiedValue(SDUse &Val) {
	return Val.getNode();
	}
	};

	/// These are IR-level optimization flags that may be propagated to SDNodes.
	/// TODO: This data structure should be shared by the IR optimizer and the
	/// the backend.
	struct SDNodeFlags {
	private:
	// This bit is used to determine if the flags are in a defined state.
	// Flag bits can only be masked out during intersection if the masking flags
	// are defined.
	bool AnyDefined : 1;

	bool NoUnsignedWrap : 1;
	bool NoSignedWrap : 1;
	bool Exact : 1;
	bool UnsafeAlgebra : 1;
	bool NoNaNs : 1;
	bool NoInfs : 1;
	bool NoSignedZeros : 1;
	bool AllowReciprocal : 1;
	bool VectorReduction : 1;
	bool AllowContract : 1;

	public:
	/// Default constructor turns off all optimization flags.
	SDNodeFlags()
	: AnyDefined(false), NoUnsignedWrap(false), NoSignedWrap(false),
	Exact(false), UnsafeAlgebra(false), NoNaNs(false), NoInfs(false),
	NoSignedZeros(false), AllowReciprocal(false), VectorReduction(false),
	AllowContract(false) {}

	/// Sets the state of the flags to the defined state.
	void setDefined() { AnyDefined = true; }
	/// Returns true if the flags are in a defined state.
	bool isDefined() const { return AnyDefined; }

	// These are mutators for each flag.
	void setNoUnsignedWrap(bool b) {
	setDefined();
	NoUnsignedWrap = b;
	}
	void setNoSignedWrap(bool b) {
	setDefined();
	NoSignedWrap = b;
	}
	void setExact(bool b) {
	setDefined();
	Exact = b;
	}
	void setUnsafeAlgebra(bool b) {
	setDefined();
	UnsafeAlgebra = b;
	}
	void setNoNaNs(bool b) {
	setDefined();
	NoNaNs = b;
	}
	void setNoInfs(bool b) {
	setDefined();
	NoInfs = b;
	}
	void setNoSignedZeros(bool b) {
	setDefined();
	NoSignedZeros = b;
	}
	void setAllowReciprocal(bool b) {
	setDefined();
	AllowReciprocal = b;
	}
	void setVectorReduction(bool b) {
	setDefined();
	VectorReduction = b;
	}
	void setAllowContract(bool b) {
	setDefined();
	AllowContract = b;
	}

	// These are accessors for each flag.
	bool hasNoUnsignedWrap() const { return NoUnsignedWrap; }
	bool hasNoSignedWrap() const { return NoSignedWrap; }
	bool hasExact() const { return Exact; }
	bool hasUnsafeAlgebra() const { return UnsafeAlgebra; }
	bool hasNoNaNs() const { return NoNaNs; }
	bool hasNoInfs() const { return NoInfs; }
	bool hasNoSignedZeros() const { return NoSignedZeros; }
	bool hasAllowReciprocal() const { return AllowReciprocal; }
	bool hasVectorReduction() const { return VectorReduction; }
	bool hasAllowContract() const { return AllowContract; }

	/// Clear any flags in this flag set that aren't also set in Flags.
	/// If the given Flags are undefined then don't do anything.
	void intersectWith(const SDNodeFlags Flags) {
	if (!Flags.isDefined())
	return;
	NoUnsignedWrap &= Flags.NoUnsignedWrap;
	NoSignedWrap &= Flags.NoSignedWrap;
	Exact &= Flags.Exact;
	UnsafeAlgebra &= Flags.UnsafeAlgebra;
	NoNaNs &= Flags.NoNaNs;
	NoInfs &= Flags.NoInfs;
	NoSignedZeros &= Flags.NoSignedZeros;
	AllowReciprocal &= Flags.AllowReciprocal;
	VectorReduction &= Flags.VectorReduction;
	AllowContract &= Flags.AllowContract;
	}
	};

	/// Represents one node in the SelectionDAG.
	///
	class SDNode : public FoldingSetNode, public ilist_node<SDNode> {
	private:
	/// The operation that this node performs.
	int16_t NodeType;

	protected:
	// We define a set of mini-helper classes to help us interpret the bits in our
	// SubclassData. These are designed to fit within a uint16_t so they pack
	// with NodeType.

	class SDNodeBitfields {
	friend class SDNode;
	friend class MemIntrinsicSDNode;
	friend class MemSDNode;

	uint16_t HasDebugValue : 1;
	uint16_t IsMemIntrinsic : 1;
	};
	enum { NumSDNodeBits = 2 };

	class ConstantSDNodeBitfields {
	friend class ConstantSDNode;

	uint16_t : NumSDNodeBits;

	uint16_t IsOpaque : 1;
	};

	class MemSDNodeBitfields {
	friend class MemSDNode;
	friend class MemIntrinsicSDNode;
	friend class AtomicSDNode;

	uint16_t : NumSDNodeBits;

	uint16_t IsVolatile : 1;
	uint16_t IsNonTemporal : 1;
	uint16_t IsDereferenceable : 1;
	uint16_t IsInvariant : 1;
	};
	enum { NumMemSDNodeBits = NumSDNodeBits + 4 };

	class LSBaseSDNodeBitfields {
	friend class LSBaseSDNode;

	uint16_t : NumMemSDNodeBits;

	uint16_t AddressingMode : 3; // enum ISD::MemIndexedMode
	};
	enum { NumLSBaseSDNodeBits = NumMemSDNodeBits + 3 };

	class LoadSDNodeBitfields {
	friend class LoadSDNode;
	friend class MaskedLoadSDNode;

	uint16_t : NumLSBaseSDNodeBits;

	uint16_t ExtTy : 2; // enum ISD::LoadExtType
	uint16_t IsExpanding : 1;
	};

	class StoreSDNodeBitfields {
	friend class StoreSDNode;
	friend class MaskedStoreSDNode;

	uint16_t : NumLSBaseSDNodeBits;

	uint16_t IsTruncating : 1;
	uint16_t IsCompressing : 1;
	};

	union {
	char RawSDNodeBits[sizeof(uint16_t)];
	SDNodeBitfields SDNodeBits;
	ConstantSDNodeBitfields ConstantSDNodeBits;
	MemSDNodeBitfields MemSDNodeBits;
	LSBaseSDNodeBitfields LSBaseSDNodeBits;
	LoadSDNodeBitfields LoadSDNodeBits;
	StoreSDNodeBitfields StoreSDNodeBits;
	};

	// RawSDNodeBits must cover the entirety of the union. This means that all of
	// the union's members must have size <= RawSDNodeBits. We write the RHS as
	// "2" instead of sizeof(RawSDNodeBits) because MSVC can't handle the latter.
	static_assert(sizeof(SDNodeBitfields) <= 2, "field too wide");
	static_assert(sizeof(ConstantSDNodeBitfields) <= 2, "field too wide");
	static_assert(sizeof(MemSDNodeBitfields) <= 2, "field too wide");
	static_assert(sizeof(LSBaseSDNodeBitfields) <= 2, "field too wide");
	static_assert(sizeof(LoadSDNodeBitfields) <= 4, "field too wide");
	static_assert(sizeof(StoreSDNodeBitfields) <= 2, "field too wide");

	private:
	friend class SelectionDAG;
	// TODO: unfriend HandleSDNode once we fix its operand handling.
	friend class HandleSDNode;

	/// Unique id per SDNode in the DAG.
	int NodeId = -1;

	/// The values that are used by this operation.
	SDUse *OperandList = nullptr;

	/// The types of the values this node defines. SDNode's may
	/// define multiple values simultaneously.
	const EVT *ValueList;

	/// List of uses for this SDNode.
	SDUse *UseList = nullptr;

	/// The number of entries in the Operand/Value list.
	unsigned short NumOperands = 0;
	unsigned short NumValues;

	// The ordering of the SDNodes. It roughly corresponds to the ordering of the
	// original LLVM instructions.
	// This is used for turning off scheduling, because we'll forgo
	// the normal scheduling algorithms and output the instructions according to
	// this ordering.
	unsigned IROrder;

	/// Source line information.
	DebugLoc debugLoc;

	/// Return a pointer to the specified value type.
	static const EVT *getValueTypeList(EVT VT);

	SDNodeFlags Flags;

	public:
	/// Unique and persistent id per SDNode in the DAG.
	/// Used for debug printing.
	uint16_t PersistentId;

	//===--------------------------------------------------------------------===//
	// Accessors
	//

	/// Return the SelectionDAG opcode value for this node. For
	/// pre-isel nodes (those for which isMachineOpcode returns false), these
	/// are the opcode values in the ISD and <target>ISD namespaces. For
	/// post-isel opcodes, see getMachineOpcode.
	unsigned getOpcode() const { return (unsigned short)NodeType; }

	/// Test if this node has a target-specific opcode (in the
	/// \<target\>ISD namespace).
	bool isTargetOpcode() const { return NodeType >= ISD::BUILTIN_OP_END; }

	/// Test if this node has a target-specific
	/// memory-referencing opcode (in the \<target\>ISD namespace and
	/// greater than FIRST_TARGET_MEMORY_OPCODE).
	bool isTargetMemoryOpcode() const {
	return NodeType >= ISD::FIRST_TARGET_MEMORY_OPCODE;
	}

	/// Return true if the type of the node type undefined.
	bool isUndef() const { return NodeType == ISD::UNDEF; }

	/// Test if this node is a memory intrinsic (with valid pointer information).
	/// INTRINSIC_W_CHAIN and INTRINSIC_VOID nodes are sometimes created for
	/// non-memory intrinsics (with chains) that are not really instances of
	/// MemSDNode. For such nodes, we need some extra state to determine the
	/// proper classof relationship.
	bool isMemIntrinsic() const {
	return (NodeType == ISD::INTRINSIC_W_CHAIN \|\|
	NodeType == ISD::INTRINSIC_VOID) &&
	SDNodeBits.IsMemIntrinsic;
	}

	/// Test if this node is a strict floating point pseudo-op.
	bool isStrictFPOpcode() {
	switch (NodeType) {
	default:
	return false;
	case ISD::STRICT_FADD:
	case ISD::STRICT_FSUB:
	case ISD::STRICT_FMUL:
	case ISD::STRICT_FDIV:
	case ISD::STRICT_FREM:
	case ISD::STRICT_FSQRT:
	case ISD::STRICT_FPOW:
	case ISD::STRICT_FPOWI:
	case ISD::STRICT_FSIN:
	case ISD::STRICT_FCOS:
	case ISD::STRICT_FEXP:
	case ISD::STRICT_FEXP2:
	case ISD::STRICT_FLOG:
	case ISD::STRICT_FLOG10:
	case ISD::STRICT_FLOG2:
	case ISD::STRICT_FRINT:
	case ISD::STRICT_FNEARBYINT:
	return true;
	}
	}

	/// Test if this node has a post-isel opcode, directly
	/// corresponding to a MachineInstr opcode.
	bool isMachineOpcode() const { return NodeType < 0; }

	/// This may only be called if isMachineOpcode returns
	/// true. It returns the MachineInstr opcode value that the node's opcode
	/// corresponds to.
	unsigned getMachineOpcode() const {
	assert(isMachineOpcode() && "Not a MachineInstr opcode!");
	return ~NodeType;
	}

	bool getHasDebugValue() const { return SDNodeBits.HasDebugValue; }
	void setHasDebugValue(bool b) { SDNodeBits.HasDebugValue = b; }

	/// Return true if there are no uses of this node.
	bool use_empty() const { return UseList == nullptr; }

	/// Return true if there is exactly one use of this node.
	bool hasOneUse() const {
	return !use_empty() && std::next(use_begin()) == use_end();
	}

	/// Return the number of uses of this node. This method takes
	/// time proportional to the number of uses.
	size_t use_size() const { return std::distance(use_begin(), use_end()); }

	/// Return the unique node id.
	int getNodeId() const { return NodeId; }

	/// Set unique node id.
	void setNodeId(int Id) { NodeId = Id; }

	/// Return the node ordering.
	unsigned getIROrder() const { return IROrder; }

	/// Set the node ordering.
	void setIROrder(unsigned Order) { IROrder = Order; }

	/// Return the source location info.
	const DebugLoc &getDebugLoc() const { return debugLoc; }

	/// Set source location info. Try to avoid this, putting
	/// it in the constructor is preferable.
	void setDebugLoc(DebugLoc dl) { debugLoc = std::move(dl); }

	/// This class provides iterator support for SDUse
	/// operands that use a specific SDNode.
	class use_iterator
	: public std::iterator<std::forward_iterator_tag, SDUse, ptrdiff_t> {
	friend class SDNode;

	SDUse *Op = nullptr;

	explicit use_iterator(SDUse *op) : Op(op) {}

	public:
	using reference = std::iterator<std::forward_iterator_tag,
	SDUse, ptrdiff_t>::reference;
	using pointer = std::iterator<std::forward_iterator_tag,
	SDUse, ptrdiff_t>::pointer;

	use_iterator() = default;
	use_iterator(const use_iterator &I) : Op(I.Op) {}

	bool operator==(const use_iterator &x) const {
	return Op == x.Op;
	}
	bool operator!=(const use_iterator &x) const {
	return !operator==(x);
	}

	/// Return true if this iterator is at the end of uses list.
	bool atEnd() const { return Op == nullptr; }

	// Iterator traversal: forward iteration only.
	use_iterator &operator++() { // Preincrement
	assert(Op && "Cannot increment end iterator!");
	Op = Op->getNext();
	return *this;
	}

	use_iterator operator++(int) { // Postincrement
	use_iterator tmp = this; ++this; return tmp;
	}

	/// Retrieve a pointer to the current user node.
	SDNode operator() const {
	assert(Op && "Cannot dereference end iterator!");
	return Op->getUser();
	}

	SDNode operator->() const { return operator(); }

	SDUse &getUse() const { return *Op; }

	/// Retrieve the operand # of this use in its user.
	unsigned getOperandNo() const {
	assert(Op && "Cannot dereference end iterator!");
	return (unsigned)(Op - Op->getUser()->OperandList);
	}
	};

	/// Provide iteration support to walk over all uses of an SDNode.
	use_iterator use_begin() const {
	return use_iterator(UseList);
	}

	static use_iterator use_end() { return use_iterator(nullptr); }

	inline iterator_range<use_iterator> uses() {
	return make_range(use_begin(), use_end());
	}
	inline iterator_range<use_iterator> uses() const {
	return make_range(use_begin(), use_end());
	}

	/// Return true if there are exactly NUSES uses of the indicated value.
	/// This method ignores uses of other values defined by this operation.
	bool hasNUsesOfValue(unsigned NUses, unsigned Value) const;

	/// Return true if there are any use of the indicated value.
	/// This method ignores uses of other values defined by this operation.
	bool hasAnyUseOfValue(unsigned Value) const;

	/// Return true if this node is the only use of N.
	bool isOnlyUserOf(const SDNode *N) const;

	/// Return true if this node is an operand of N.
	bool isOperandOf(const SDNode *N) const;

	/// Return true if this node is a predecessor of N.
	/// NOTE: Implemented on top of hasPredecessor and every bit as
	/// expensive. Use carefully.
	bool isPredecessorOf(const SDNode *N) const {
	return N->hasPredecessor(this);
	}

	/// Return true if N is a predecessor of this node.
	/// N is either an operand of this node, or can be reached by recursively
	/// traversing up the operands.
	/// NOTE: This is an expensive method. Use it carefully.
	bool hasPredecessor(const SDNode *N) const;

	/// Returns true if N is a predecessor of any node in Worklist. This
	/// helper keeps Visited and Worklist sets externally to allow unions
	/// searches to be performed in parallel, caching of results across
	/// queries and incremental addition to Worklist. Stops early if N is
	/// found but will resume. Remember to clear Visited and Worklists
	/// if DAG changes.
	static bool hasPredecessorHelper(const SDNode *N,
	SmallPtrSetImpl<const SDNode *> &Visited,
	SmallVectorImpl<const SDNode *> &Worklist) {
	if (Visited.count(N))
	return true;
	while (!Worklist.empty()) {
	const SDNode *M = Worklist.pop_back_val();
	bool Found = false;
	for (const SDValue &OpV : M->op_values()) {
	SDNode *Op = OpV.getNode();
	if (Visited.insert(Op).second)
	Worklist.push_back(Op);
	if (Op == N)
	Found = true;
	}
	if (Found)
	return true;
	}
	return false;
	}

	/// Return true if all the users of N are contained in Nodes.
	/// NOTE: Requires at least one match, but doesn't require them all.
	static bool areOnlyUsersOf(ArrayRef<const SDNode > Nodes, const SDNode N);

	/// Return the number of values used by this operation.
	unsigned getNumOperands() const { return NumOperands; }

	/// Helper method returns the integer value of a ConstantSDNode operand.
	inline uint64_t getConstantOperandVal(unsigned Num) const;

	const SDValue &getOperand(unsigned Num) const {
	assert(Num < NumOperands && "Invalid child # of SDNode!");
	return OperandList[Num];
	}

	using op_iterator = SDUse *;

	op_iterator op_begin() const { return OperandList; }
	op_iterator op_end() const { return OperandList+NumOperands; }
	ArrayRef<SDUse> ops() const { return makeArrayRef(op_begin(), op_end()); }

	/// Iterator for directly iterating over the operand SDValue's.
	struct value_op_iterator
	: iterator_adaptor_base<value_op_iterator, op_iterator,
	std::random_access_iterator_tag, SDValue,
	ptrdiff_t, value_op_iterator *,
	value_op_iterator *> {
	explicit value_op_iterator(SDUse *U = nullptr)
	: iterator_adaptor_base(U) {}

	const SDValue &operator*() const { return I->get(); }
	};

	iterator_range<value_op_iterator> op_values() const {
	return make_range(value_op_iterator(op_begin()),
	value_op_iterator(op_end()));
	}

	SDVTList getVTList() const {
	SDVTList X = { ValueList, NumValues };
	return X;
	}

	/// If this node has a glue operand, return the node
	/// to which the glue operand points. Otherwise return NULL.
	SDNode *getGluedNode() const {
	if (getNumOperands() != 0 &&
	getOperand(getNumOperands()-1).getValueType() == MVT::Glue)
	return getOperand(getNumOperands()-1).getNode();
	return nullptr;
	}

	/// If this node has a glue value with a user, return
	/// the user (there is at most one). Otherwise return NULL.
	SDNode *getGluedUser() const {
	for (use_iterator UI = use_begin(), UE = use_end(); UI != UE; ++UI)
	if (UI.getUse().get().getValueType() == MVT::Glue)
	return *UI;
	return nullptr;
	}

	const SDNodeFlags getFlags() const { return Flags; }
	void setFlags(SDNodeFlags NewFlags) { Flags = NewFlags; }

	/// Clear any flags in this node that aren't also set in Flags.
	/// If Flags is not in a defined state then this has no effect.
	void intersectFlagsWith(const SDNodeFlags Flags);

	/// Return the number of values defined/returned by this operator.
	unsigned getNumValues() const { return NumValues; }

	/// Return the type of a specified result.
	EVT getValueType(unsigned ResNo) const {
	assert(ResNo < NumValues && "Illegal result number!");
	return ValueList[ResNo];
	}

	/// Return the type of a specified result as a simple type.
	MVT getSimpleValueType(unsigned ResNo) const {
	return getValueType(ResNo).getSimpleVT();
	}

	/// Returns MVT::getSizeInBits(getValueType(ResNo)).
	unsigned getValueSizeInBits(unsigned ResNo) const {
	return getValueType(ResNo).getSizeInBits();
	}

	using value_iterator = const EVT *;

	value_iterator value_begin() const { return ValueList; }
	value_iterator value_end() const { return ValueList+NumValues; }

	/// Return the opcode of this operation for printing.
	std::string getOperationName(const SelectionDAG *G = nullptr) const;
	static const char* getIndexedModeName(ISD::MemIndexedMode AM);
	void print_types(raw_ostream &OS, const SelectionDAG *G) const;
	void print_details(raw_ostream &OS, const SelectionDAG *G) const;
	void print(raw_ostream &OS, const SelectionDAG *G = nullptr) const;
	void printr(raw_ostream &OS, const SelectionDAG *G = nullptr) const;

	/// Print a SelectionDAG node and all children down to
	/// the leaves. The given SelectionDAG allows target-specific nodes
	/// to be printed in human-readable form. Unlike printr, this will
	/// print the whole DAG, including children that appear multiple
	/// times.
	///
	void printrFull(raw_ostream &O, const SelectionDAG *G = nullptr) const;

	/// Print a SelectionDAG node and children up to
	/// depth "depth." The given SelectionDAG allows target-specific
	/// nodes to be printed in human-readable form. Unlike printr, this
	/// will print children that appear multiple times wherever they are
	/// used.
	///
	void printrWithDepth(raw_ostream &O, const SelectionDAG *G = nullptr,
	unsigned depth = 100) const;

	/// Dump this node, for debugging.
	void dump() const;

	/// Dump (recursively) this node and its use-def subgraph.
	void dumpr() const;

	/// Dump this node, for debugging.
	/// The given SelectionDAG allows target-specific nodes to be printed
	/// in human-readable form.
	void dump(const SelectionDAG *G) const;

	/// Dump (recursively) this node and its use-def subgraph.
	/// The given SelectionDAG allows target-specific nodes to be printed
	/// in human-readable form.
	void dumpr(const SelectionDAG *G) const;

	/// printrFull to dbgs(). The given SelectionDAG allows
	/// target-specific nodes to be printed in human-readable form.
	/// Unlike dumpr, this will print the whole DAG, including children
	/// that appear multiple times.
	void dumprFull(const SelectionDAG *G = nullptr) const;

	/// printrWithDepth to dbgs(). The given
	/// SelectionDAG allows target-specific nodes to be printed in
	/// human-readable form. Unlike dumpr, this will print children
	/// that appear multiple times wherever they are used.
	///
	void dumprWithDepth(const SelectionDAG *G = nullptr,
	unsigned depth = 100) const;

	/// Gather unique data for the node.
	void Profile(FoldingSetNodeID &ID) const;

	/// This method should only be used by the SDUse class.
	void addUse(SDUse &U) { U.addToList(&UseList); }

	protected:
	static SDVTList getSDVTList(EVT VT) {
	SDVTList Ret = { getValueTypeList(VT), 1 };
	return Ret;
	}

	/// Create an SDNode.
	///
	/// SDNodes are created without any operands, and never own the operand
	/// storage. To add operands, see SelectionDAG::createOperands.
	SDNode(unsigned Opc, unsigned Order, DebugLoc dl, SDVTList VTs)
	: NodeType(Opc), ValueList(VTs.VTs), NumValues(VTs.NumVTs),
	IROrder(Order), debugLoc(std::move(dl)) {
	memset(&RawSDNodeBits, 0, sizeof(RawSDNodeBits));
	assert(debugLoc.hasTrivialDestructor() && "Expected trivial destructor");
	assert(NumValues == VTs.NumVTs &&
	"NumValues wasn't wide enough for its operands!");
	}

	/// Release the operands and set this node to have zero operands.
	void DropOperands();
	};

	/// Wrapper class for IR location info (IR ordering and DebugLoc) to be passed
	/// into SDNode creation functions.
	/// When an SDNode is created from the DAGBuilder, the DebugLoc is extracted
	/// from the original Instruction, and IROrder is the ordinal position of
	/// the instruction.
	/// When an SDNode is created after the DAG is being built, both DebugLoc and
	/// the IROrder are propagated from the original SDNode.
	/// So SDLoc class provides two constructors besides the default one, one to
	/// be used by the DAGBuilder, the other to be used by others.
	class SDLoc {
	private:
	DebugLoc DL;
	int IROrder = 0;

	public:
	SDLoc() = default;
	SDLoc(const SDNode *N) : DL(N->getDebugLoc()), IROrder(N->getIROrder()) {}
	SDLoc(const SDValue V) : SDLoc(V.getNode()) {}
	SDLoc(const Instruction *I, int Order) : IROrder(Order) {
	assert(Order >= 0 && "bad IROrder");
	if (I)
	DL = I->getDebugLoc();
	}

	unsigned getIROrder() const { return IROrder; }
	const DebugLoc &getDebugLoc() const { return DL; }
	};

	// Define inline functions from the SDValue class.

	inline SDValue::SDValue(SDNode *node, unsigned resno)
	: Node(node), ResNo(resno) {
	// Explicitly check for !ResNo to avoid use-after-free, because there are
	// callers that use SDValue(N, 0) with a deleted N to indicate successful
	// combines.
	assert((!Node \|\| !ResNo \|\| ResNo < Node->getNumValues()) &&
	"Invalid result number for the given node!");
	assert(ResNo < -2U && "Cannot use result numbers reserved for DenseMaps.");
	}

	inline unsigned SDValue::getOpcode() const {
	return Node->getOpcode();
	}

	inline EVT SDValue::getValueType() const {
	return Node->getValueType(ResNo);
	}

	inline unsigned SDValue::getNumOperands() const {
	return Node->getNumOperands();
	}

	inline const SDValue &SDValue::getOperand(unsigned i) const {
	return Node->getOperand(i);
	}

	inline uint64_t SDValue::getConstantOperandVal(unsigned i) const {
	return Node->getConstantOperandVal(i);
	}

	inline bool SDValue::isTargetOpcode() const {
	return Node->isTargetOpcode();
	}

	inline bool SDValue::isTargetMemoryOpcode() const {
	return Node->isTargetMemoryOpcode();
	}

	inline bool SDValue::isMachineOpcode() const {
	return Node->isMachineOpcode();
	}

	inline unsigned SDValue::getMachineOpcode() const {
	return Node->getMachineOpcode();
	}

	inline bool SDValue::isUndef() const {
	return Node->isUndef();
	}

	inline bool SDValue::use_empty() const {
	return !Node->hasAnyUseOfValue(ResNo);
	}

	inline bool SDValue::hasOneUse() const {
	return Node->hasNUsesOfValue(1, ResNo);
	}

	inline const DebugLoc &SDValue::getDebugLoc() const {
	return Node->getDebugLoc();
	}

	inline void SDValue::dump() const {
	return Node->dump();
	}

	inline void SDValue::dumpr() const {
	return Node->dumpr();
	}

	// Define inline functions from the SDUse class.

	inline void SDUse::set(const SDValue &V) {
	if (Val.getNode()) removeFromList();
	Val = V;
	if (V.getNode()) V.getNode()->addUse(*this);
	}

	inline void SDUse::setInitial(const SDValue &V) {
	Val = V;
	V.getNode()->addUse(*this);
	}

	inline void SDUse::setNode(SDNode *N) {
	if (Val.getNode()) removeFromList();
	Val.setNode(N);
	if (N) N->addUse(*this);
	}

	/// This class is used to form a handle around another node that
	/// is persistent and is updated across invocations of replaceAllUsesWith on its
	/// operand. This node should be directly created by end-users and not added to
	/// the AllNodes list.
	class HandleSDNode : public SDNode {
	SDUse Op;

	public:
	explicit HandleSDNode(SDValue X)
	: SDNode(ISD::HANDLENODE, 0, DebugLoc(), getSDVTList(MVT::Other)) {
	// HandleSDNodes are never inserted into the DAG, so they won't be
	// auto-numbered. Use ID 65535 as a sentinel.
	PersistentId = 0xffff;

	// Manually set up the operand list. This node type is special in that it's
	// always stack allocated and SelectionDAG does not manage its operands.
	// TODO: This should either (a) not be in the SDNode hierarchy, or (b) not
	// be so special.
	Op.setUser(this);
	Op.setInitial(X);
	NumOperands = 1;
	OperandList = &Op;
	}
	~HandleSDNode();

	const SDValue &getValue() const { return Op; }
	};

	class AddrSpaceCastSDNode : public SDNode {
	private:
	unsigned SrcAddrSpace;
	unsigned DestAddrSpace;

	public:
	AddrSpaceCastSDNode(unsigned Order, const DebugLoc &dl, EVT VT,
	unsigned SrcAS, unsigned DestAS);

	unsigned getSrcAddressSpace() const { return SrcAddrSpace; }
	unsigned getDestAddressSpace() const { return DestAddrSpace; }

	static bool classof(const SDNode *N) {
	return N->getOpcode() == ISD::ADDRSPACECAST;
	}
	};

	/// This is an abstract virtual class for memory operations.
	class MemSDNode : public SDNode {
	private:
	// VT of in-memory value.
	EVT MemoryVT;

	protected:
	/// Memory reference information.
	MachineMemOperand *MMO;

	public:
	MemSDNode(unsigned Opc, unsigned Order, const DebugLoc &dl, SDVTList VTs,
	EVT MemoryVT, MachineMemOperand *MMO);

	bool readMem() const { return MMO->isLoad(); }
	bool writeMem() const { return MMO->isStore(); }

	/// Returns alignment and volatility of the memory access
	unsigned getOriginalAlignment() const {
	return MMO->getBaseAlignment();
	}
	unsigned getAlignment() const {
	return MMO->getAlignment();
	}

	/// Return the SubclassData value, without HasDebugValue. This contains an
	/// encoding of the volatile flag, as well as bits used by subclasses. This
	/// function should only be used to compute a FoldingSetNodeID value.
	/// The HasDebugValue bit is masked out because CSE map needs to match
	/// nodes with debug info with nodes without debug info.
	unsigned getRawSubclassData() const {
	uint16_t Data;
	union {
	char RawSDNodeBits[sizeof(uint16_t)];
	SDNodeBitfields SDNodeBits;
	};
	memcpy(&RawSDNodeBits, &this->RawSDNodeBits, sizeof(this->RawSDNodeBits));
	SDNodeBits.HasDebugValue = 0;
	memcpy(&Data, &RawSDNodeBits, sizeof(RawSDNodeBits));
	return Data;
	}

	bool isVolatile() const { return MemSDNodeBits.IsVolatile; }
	bool isNonTemporal() const { return MemSDNodeBits.IsNonTemporal; }
	bool isDereferenceable() const { return MemSDNodeBits.IsDereferenceable; }
	bool isInvariant() const { return MemSDNodeBits.IsInvariant; }

	// Returns the offset from the location of the access.
	int64_t getSrcValueOffset() const { return MMO->getOffset(); }

	/// Returns the AA info that describes the dereference.
	AAMDNodes getAAInfo() const { return MMO->getAAInfo(); }

	/// Returns the Ranges that describes the dereference.
	const MDNode *getRanges() const { return MMO->getRanges(); }

	/// Returns the synchronization scope ID for this memory operation.
	SyncScope::ID getSyncScopeID() const { return MMO->getSyncScopeID(); }

	/// Return the atomic ordering requirements for this memory operation. For
	/// cmpxchg atomic operations, return the atomic ordering requirements when
	/// store occurs.
	AtomicOrdering getOrdering() const { return MMO->getOrdering(); }

	/// Return the type of the in-memory value.
	EVT getMemoryVT() const { return MemoryVT; }

	/// Return a MachineMemOperand object describing the memory
	/// reference performed by operation.
	MachineMemOperand *getMemOperand() const { return MMO; }

	const MachinePointerInfo &getPointerInfo() const {
	return MMO->getPointerInfo();
	}

	/// Return the address space for the associated pointer
	unsigned getAddressSpace() const {
	return getPointerInfo().getAddrSpace();
	}

	/// Update this MemSDNode's MachineMemOperand information
	/// to reflect the alignment of NewMMO, if it has a greater alignment.
	/// This must only be used when the new alignment applies to all users of
	/// this MachineMemOperand.
	void refineAlignment(const MachineMemOperand *NewMMO) {
	MMO->refineAlignment(NewMMO);
	}

	const SDValue &getChain() const { return getOperand(0); }
	const SDValue &getBasePtr() const {
	return getOperand(getOpcode() == ISD::STORE ? 2 : 1);
	}

	// Methods to support isa and dyn_cast
	static bool classof(const SDNode *N) {
	// For some targets, we lower some target intrinsics to a MemIntrinsicNode
	// with either an intrinsic or a target opcode.
	return N->getOpcode() == ISD::LOAD \|\|
	N->getOpcode() == ISD::STORE \|\|
	N->getOpcode() == ISD::PREFETCH \|\|
	N->getOpcode() == ISD::ATOMIC_CMP_SWAP \|\|
	N->getOpcode() == ISD::ATOMIC_CMP_SWAP_WITH_SUCCESS \|\|
	N->getOpcode() == ISD::ATOMIC_SWAP \|\|
	N->getOpcode() == ISD::ATOMIC_LOAD_ADD \|\|
	N->getOpcode() == ISD::ATOMIC_LOAD_SUB \|\|
	N->getOpcode() == ISD::ATOMIC_LOAD_AND \|\|
	N->getOpcode() == ISD::ATOMIC_LOAD_OR \|\|
	N->getOpcode() == ISD::ATOMIC_LOAD_XOR \|\|
	N->getOpcode() == ISD::ATOMIC_LOAD_NAND \|\|
	N->getOpcode() == ISD::ATOMIC_LOAD_MIN \|\|
	N->getOpcode() == ISD::ATOMIC_LOAD_MAX \|\|
	N->getOpcode() == ISD::ATOMIC_LOAD_UMIN \|\|
	N->getOpcode() == ISD::ATOMIC_LOAD_UMAX \|\|
	N->getOpcode() == ISD::ATOMIC_LOAD \|\|
	N->getOpcode() == ISD::ATOMIC_STORE \|\|
	N->getOpcode() == ISD::MLOAD \|\|
	N->getOpcode() == ISD::MSTORE \|\|
	N->getOpcode() == ISD::MGATHER \|\|
	N->getOpcode() == ISD::MSCATTER \|\|
	N->isMemIntrinsic() \|\|
	N->isTargetMemoryOpcode();
	}
	};

	/// This is an SDNode representing atomic operations.
	class AtomicSDNode : public MemSDNode {
	public:
	AtomicSDNode(unsigned Opc, unsigned Order, const DebugLoc &dl, SDVTList VTL,
	EVT MemVT, MachineMemOperand *MMO)
	: MemSDNode(Opc, Order, dl, VTL, MemVT, MMO) {}

	const SDValue &getBasePtr() const { return getOperand(1); }
	const SDValue &getVal() const { return getOperand(2); }

	/// Returns true if this SDNode represents cmpxchg atomic operation, false
	/// otherwise.
	bool isCompareAndSwap() const {
	unsigned Op = getOpcode();
	return Op == ISD::ATOMIC_CMP_SWAP \|\|
	Op == ISD::ATOMIC_CMP_SWAP_WITH_SUCCESS;
	}

	/// For cmpxchg atomic operations, return the atomic ordering requirements
	/// when store does not occur.
	AtomicOrdering getFailureOrdering() const {
	assert(isCompareAndSwap() && "Must be cmpxchg operation");
	return MMO->getFailureOrdering();
	}

	// Methods to support isa and dyn_cast
	static bool classof(const SDNode *N) {
	return N->getOpcode() == ISD::ATOMIC_CMP_SWAP \|\|
	N->getOpcode() == ISD::ATOMIC_CMP_SWAP_WITH_SUCCESS \|\|
	N->getOpcode() == ISD::ATOMIC_SWAP \|\|
	N->getOpcode() == ISD::ATOMIC_LOAD_ADD \|\|
	N->getOpcode() == ISD::ATOMIC_LOAD_SUB \|\|
	N->getOpcode() == ISD::ATOMIC_LOAD_AND \|\|
	N->getOpcode() == ISD::ATOMIC_LOAD_OR \|\|
	N->getOpcode() == ISD::ATOMIC_LOAD_XOR \|\|
	N->getOpcode() == ISD::ATOMIC_LOAD_NAND \|\|
	N->getOpcode() == ISD::ATOMIC_LOAD_MIN \|\|
	N->getOpcode() == ISD::ATOMIC_LOAD_MAX \|\|
	N->getOpcode() == ISD::ATOMIC_LOAD_UMIN \|\|
	N->getOpcode() == ISD::ATOMIC_LOAD_UMAX \|\|
	N->getOpcode() == ISD::ATOMIC_LOAD \|\|
	N->getOpcode() == ISD::ATOMIC_STORE;
	}
	};

	/// This SDNode is used for target intrinsics that touch
	/// memory and need an associated MachineMemOperand. Its opcode may be
	/// INTRINSIC_VOID, INTRINSIC_W_CHAIN, PREFETCH, or a target-specific opcode
	/// with a value not less than FIRST_TARGET_MEMORY_OPCODE.
	class MemIntrinsicSDNode : public MemSDNode {
	public:
	MemIntrinsicSDNode(unsigned Opc, unsigned Order, const DebugLoc &dl,
	SDVTList VTs, EVT MemoryVT, MachineMemOperand *MMO)
	: MemSDNode(Opc, Order, dl, VTs, MemoryVT, MMO) {
	SDNodeBits.IsMemIntrinsic = true;
	}

	// Methods to support isa and dyn_cast
	static bool classof(const SDNode *N) {
	// We lower some target intrinsics to their target opcode
	// early a node with a target opcode can be of this class
	return N->isMemIntrinsic() \|\|
	N->getOpcode() == ISD::PREFETCH \|\|
	N->isTargetMemoryOpcode();
	}
	};

	/// This SDNode is used to implement the code generator
	/// support for the llvm IR shufflevector instruction. It combines elements
	/// from two input vectors into a new input vector, with the selection and
	/// ordering of elements determined by an array of integers, referred to as
	/// the shuffle mask. For input vectors of width N, mask indices of 0..N-1
	/// refer to elements from the LHS input, and indices from N to 2N-1 the RHS.
	/// An index of -1 is treated as undef, such that the code generator may put
	/// any value in the corresponding element of the result.
	class ShuffleVectorSDNode : public SDNode {
	// The memory for Mask is owned by the SelectionDAG's OperandAllocator, and
	// is freed when the SelectionDAG object is destroyed.
	const int *Mask;

	protected:
	friend class SelectionDAG;

	ShuffleVectorSDNode(EVT VT, unsigned Order, const DebugLoc &dl, const int *M)
	: SDNode(ISD::VECTOR_SHUFFLE, Order, dl, getSDVTList(VT)), Mask(M) {}

	public:
	ArrayRef<int> getMask() const {
	EVT VT = getValueType(0);
	return makeArrayRef(Mask, VT.getVectorNumElements());
	}

	int getMaskElt(unsigned Idx) const {
	assert(Idx < getValueType(0).getVectorNumElements() && "Idx out of range!");
	return Mask[Idx];
	}

	bool isSplat() const { return isSplatMask(Mask, getValueType(0)); }

	int getSplatIndex() const {
	assert(isSplat() && "Cannot get splat index for non-splat!");
	EVT VT = getValueType(0);
	for (unsigned i = 0, e = VT.getVectorNumElements(); i != e; ++i) {
	if (Mask[i] >= 0)
	return Mask[i];
	}
	llvm_unreachable("Splat with all undef indices?");
	}

	static bool isSplatMask(const int *Mask, EVT VT);

	/// Change values in a shuffle permute mask assuming
	/// the two vector operands have swapped position.
	static void commuteMask(MutableArrayRef<int> Mask) {
	unsigned NumElems = Mask.size();
	for (unsigned i = 0; i != NumElems; ++i) {
	int idx = Mask[i];
	if (idx < 0)
	continue;
	else if (idx < (int)NumElems)
	Mask[i] = idx + NumElems;
	else
	Mask[i] = idx - NumElems;
	}
	}

	static bool classof(const SDNode *N) {
	return N->getOpcode() == ISD::VECTOR_SHUFFLE;
	}
	};

	class ConstantSDNode : public SDNode {
	friend class SelectionDAG;

	const ConstantInt *Value;

	ConstantSDNode(bool isTarget, bool isOpaque, const ConstantInt *val,
	const DebugLoc &DL, EVT VT)
	: SDNode(isTarget ? ISD::TargetConstant : ISD::Constant, 0, DL,
	getSDVTList(VT)),
	Value(val) {
	ConstantSDNodeBits.IsOpaque = isOpaque;
	}

	public:
	const ConstantInt *getConstantIntValue() const { return Value; }
	const APInt &getAPIntValue() const { return Value->getValue(); }
	uint64_t getZExtValue() const { return Value->getZExtValue(); }
	int64_t getSExtValue() const { return Value->getSExtValue(); }

	bool isOne() const { return Value->isOne(); }
	bool isNullValue() const { return Value->isZero(); }
	bool isAllOnesValue() const { return Value->isMinusOne(); }

	bool isOpaque() const { return ConstantSDNodeBits.IsOpaque; }

	static bool classof(const SDNode *N) {
	return N->getOpcode() == ISD::Constant \|\|
	N->getOpcode() == ISD::TargetConstant;
	}
	};

	uint64_t SDNode::getConstantOperandVal(unsigned Num) const {
	return cast<ConstantSDNode>(getOperand(Num))->getZExtValue();
	}

	class ConstantFPSDNode : public SDNode {
	friend class SelectionDAG;

	const ConstantFP *Value;

	ConstantFPSDNode(bool isTarget, const ConstantFP *val, const DebugLoc &DL,
	EVT VT)
	: SDNode(isTarget ? ISD::TargetConstantFP : ISD::ConstantFP, 0, DL,
	getSDVTList(VT)),
	Value(val) {}

	public:
	const APFloat& getValueAPF() const { return Value->getValueAPF(); }
	const ConstantFP *getConstantFPValue() const { return Value; }

	/// Return true if the value is positive or negative zero.
	bool isZero() const { return Value->isZero(); }

	/// Return true if the value is a NaN.
	bool isNaN() const { return Value->isNaN(); }

	/// Return true if the value is an infinity
	bool isInfinity() const { return Value->isInfinity(); }

	/// Return true if the value is negative.
	bool isNegative() const { return Value->isNegative(); }

	/// We don't rely on operator== working on double values, as
	/// it returns true for things that are clearly not equal, like -0.0 and 0.0.
	/// As such, this method can be used to do an exact bit-for-bit comparison of
	/// two floating point values.

	/// We leave the version with the double argument here because it's just so
	/// convenient to write "2.0" and the like. Without this function we'd
	/// have to duplicate its logic everywhere it's called.
	bool isExactlyValue(double V) const {
	bool ignored;
	APFloat Tmp(V);
	Tmp.convert(Value->getValueAPF().getSemantics(),
	APFloat::rmNearestTiesToEven, &ignored);
	return isExactlyValue(Tmp);
	}
	bool isExactlyValue(const APFloat& V) const;

	static bool isValueValidForType(EVT VT, const APFloat& Val);

	static bool classof(const SDNode *N) {
	return N->getOpcode() == ISD::ConstantFP \|\|
	N->getOpcode() == ISD::TargetConstantFP;
	}
	};

	/// Returns true if \p V is a constant integer zero.
	bool isNullConstant(SDValue V);

	/// Returns true if \p V is an FP constant with a value of positive zero.
	bool isNullFPConstant(SDValue V);

	/// Returns true if \p V is an integer constant with all bits set.
	bool isAllOnesConstant(SDValue V);

	/// Returns true if \p V is a constant integer one.
	bool isOneConstant(SDValue V);

	/// Returns true if \p V is a bitwise not operation. Assumes that an all ones
	/// constant is canonicalized to be operand 1.
	bool isBitwiseNot(SDValue V);

	/// Returns the SDNode if it is a constant splat BuildVector or constant int.
	ConstantSDNode *isConstOrConstSplat(SDValue V);

	/// Returns the SDNode if it is a constant splat BuildVector or constant float.
	ConstantFPSDNode *isConstOrConstSplatFP(SDValue V);

	class GlobalAddressSDNode : public SDNode {
	friend class SelectionDAG;

	const GlobalValue *TheGlobal;
	int64_t Offset;
	unsigned char TargetFlags;

	GlobalAddressSDNode(unsigned Opc, unsigned Order, const DebugLoc &DL,
	const GlobalValue *GA, EVT VT, int64_t o,
	unsigned char TargetFlags);

	public:
	const GlobalValue *getGlobal() const { return TheGlobal; }
	int64_t getOffset() const { return Offset; }
	unsigned char getTargetFlags() const { return TargetFlags; }
	// Return the address space this GlobalAddress belongs to.
	unsigned getAddressSpace() const;

	static bool classof(const SDNode *N) {
	return N->getOpcode() == ISD::GlobalAddress \|\|
	N->getOpcode() == ISD::TargetGlobalAddress \|\|
	N->getOpcode() == ISD::GlobalTLSAddress \|\|
	N->getOpcode() == ISD::TargetGlobalTLSAddress;
	}
	};

	class FrameIndexSDNode : public SDNode {
	friend class SelectionDAG;

	int FI;

	FrameIndexSDNode(int fi, EVT VT, bool isTarg)
	: SDNode(isTarg ? ISD::TargetFrameIndex : ISD::FrameIndex,
	0, DebugLoc(), getSDVTList(VT)), FI(fi) {
	}

	public:
	int getIndex() const { return FI; }

	static bool classof(const SDNode *N) {
	return N->getOpcode() == ISD::FrameIndex \|\|
	N->getOpcode() == ISD::TargetFrameIndex;
	}
	};

	class JumpTableSDNode : public SDNode {
	friend class SelectionDAG;

	int JTI;
	unsigned char TargetFlags;

	JumpTableSDNode(int jti, EVT VT, bool isTarg, unsigned char TF)
	: SDNode(isTarg ? ISD::TargetJumpTable : ISD::JumpTable,
	0, DebugLoc(), getSDVTList(VT)), JTI(jti), TargetFlags(TF) {
	}

	public:
	int getIndex() const { return JTI; }
	unsigned char getTargetFlags() const { return TargetFlags; }

	static bool classof(const SDNode *N) {
	return N->getOpcode() == ISD::JumpTable \|\|
	N->getOpcode() == ISD::TargetJumpTable;
	}
	};

	class ConstantPoolSDNode : public SDNode {
	friend class SelectionDAG;

	union {
	const Constant *ConstVal;
	MachineConstantPoolValue *MachineCPVal;
	} Val;
	int Offset; // It's a MachineConstantPoolValue if top bit is set.
	unsigned Alignment; // Minimum alignment requirement of CP (not log2 value).
	unsigned char TargetFlags;

	ConstantPoolSDNode(bool isTarget, const Constant *c, EVT VT, int o,
	unsigned Align, unsigned char TF)
	: SDNode(isTarget ? ISD::TargetConstantPool : ISD::ConstantPool, 0,
	DebugLoc(), getSDVTList(VT)), Offset(o), Alignment(Align),
	TargetFlags(TF) {
	assert(Offset >= 0 && "Offset is too large");
	Val.ConstVal = c;
	}

	ConstantPoolSDNode(bool isTarget, MachineConstantPoolValue *v,
	EVT VT, int o, unsigned Align, unsigned char TF)
	: SDNode(isTarget ? ISD::TargetConstantPool : ISD::ConstantPool, 0,
	DebugLoc(), getSDVTList(VT)), Offset(o), Alignment(Align),
	TargetFlags(TF) {
	assert(Offset >= 0 && "Offset is too large");
	Val.MachineCPVal = v;
	Offset \|= 1 << (sizeof(unsigned)*CHAR_BIT-1);
	}

	public:
	bool isMachineConstantPoolEntry() const {
	return Offset < 0;
	}

	const Constant *getConstVal() const {
	assert(!isMachineConstantPoolEntry() && "Wrong constantpool type");
	return Val.ConstVal;
	}

	MachineConstantPoolValue *getMachineCPVal() const {
	assert(isMachineConstantPoolEntry() && "Wrong constantpool type");
	return Val.MachineCPVal;
	}

	int getOffset() const {
	return Offset & ~(1 << (sizeof(unsigned)*CHAR_BIT-1));
	}

	// Return the alignment of this constant pool object, which is either 0 (for
	// default alignment) or the desired value.
	unsigned getAlignment() const { return Alignment; }
	unsigned char getTargetFlags() const { return TargetFlags; }

	Type *getType() const;

	static bool classof(const SDNode *N) {
	return N->getOpcode() == ISD::ConstantPool \|\|
	N->getOpcode() == ISD::TargetConstantPool;
	}
	};

	/// Completely target-dependent object reference.
	class TargetIndexSDNode : public SDNode {
	friend class SelectionDAG;

	unsigned char TargetFlags;
	int Index;
	int64_t Offset;

	public:
	TargetIndexSDNode(int Idx, EVT VT, int64_t Ofs, unsigned char TF)
	: SDNode(ISD::TargetIndex, 0, DebugLoc(), getSDVTList(VT)),
	TargetFlags(TF), Index(Idx), Offset(Ofs) {}

	unsigned char getTargetFlags() const { return TargetFlags; }
	int getIndex() const { return Index; }
	int64_t getOffset() const { return Offset; }

	static bool classof(const SDNode *N) {
	return N->getOpcode() == ISD::TargetIndex;
	}
	};

	class BasicBlockSDNode : public SDNode {
	friend class SelectionDAG;

	MachineBasicBlock *MBB;

	/// Debug info is meaningful and potentially useful here, but we create
	/// blocks out of order when they're jumped to, which makes it a bit
	/// harder. Let's see if we need it first.
	explicit BasicBlockSDNode(MachineBasicBlock *mbb)
	: SDNode(ISD::BasicBlock, 0, DebugLoc(), getSDVTList(MVT::Other)), MBB(mbb)
	{}

	public:
	MachineBasicBlock *getBasicBlock() const { return MBB; }

	static bool classof(const SDNode *N) {
	return N->getOpcode() == ISD::BasicBlock;
	}
	};

	/// A "pseudo-class" with methods for operating on BUILD_VECTORs.
	class BuildVectorSDNode : public SDNode {
	public:
	// These are constructed as SDNodes and then cast to BuildVectorSDNodes.
	explicit BuildVectorSDNode() = delete;

	/// Check if this is a constant splat, and if so, find the
	/// smallest element size that splats the vector. If MinSplatBits is
	/// nonzero, the element size must be at least that large. Note that the
	/// splat element may be the entire vector (i.e., a one element vector).
	/// Returns the splat element value in SplatValue. Any undefined bits in
	/// that value are zero, and the corresponding bits in the SplatUndef mask
	/// are set. The SplatBitSize value is set to the splat element size in
	/// bits. HasAnyUndefs is set to true if any bits in the vector are
	/// undefined. isBigEndian describes the endianness of the target.
	bool isConstantSplat(APInt &SplatValue, APInt &SplatUndef,
	unsigned &SplatBitSize, bool &HasAnyUndefs,
	unsigned MinSplatBits = 0,
	bool isBigEndian = false) const;

	/// \brief Returns the splatted value or a null value if this is not a splat.
	///
	/// If passed a non-null UndefElements bitvector, it will resize it to match
	/// the vector width and set the bits where elements are undef.
	SDValue getSplatValue(BitVector *UndefElements = nullptr) const;

	/// \brief Returns the splatted constant or null if this is not a constant
	/// splat.
	///
	/// If passed a non-null UndefElements bitvector, it will resize it to match
	/// the vector width and set the bits where elements are undef.
	ConstantSDNode *
	getConstantSplatNode(BitVector *UndefElements = nullptr) const;

	/// \brief Returns the splatted constant FP or null if this is not a constant
	/// FP splat.
	///
	/// If passed a non-null UndefElements bitvector, it will resize it to match
	/// the vector width and set the bits where elements are undef.
	ConstantFPSDNode *
	getConstantFPSplatNode(BitVector *UndefElements = nullptr) const;

	/// \brief If this is a constant FP splat and the splatted constant FP is an
	/// exact power or 2, return the log base 2 integer value. Otherwise,
	/// return -1.
	///
	/// The BitWidth specifies the necessary bit precision.
	int32_t getConstantFPSplatPow2ToLog2Int(BitVector *UndefElements,
	uint32_t BitWidth) const;

	bool isConstant() const;

	static bool classof(const SDNode *N) {
	return N->getOpcode() == ISD::BUILD_VECTOR;
	}
	};

	/// An SDNode that holds an arbitrary LLVM IR Value. This is
	/// used when the SelectionDAG needs to make a simple reference to something
	/// in the LLVM IR representation.
	///
	class SrcValueSDNode : public SDNode {
	friend class SelectionDAG;

	const Value *V;

	/// Create a SrcValue for a general value.
	explicit SrcValueSDNode(const Value *v)
	: SDNode(ISD::SRCVALUE, 0, DebugLoc(), getSDVTList(MVT::Other)), V(v) {}

	public:
	/// Return the contained Value.
	const Value *getValue() const { return V; }

	static bool classof(const SDNode *N) {
	return N->getOpcode() == ISD::SRCVALUE;
	}
	};

	class MDNodeSDNode : public SDNode {
	friend class SelectionDAG;

	const MDNode *MD;

	explicit MDNodeSDNode(const MDNode *md)
	: SDNode(ISD::MDNODE_SDNODE, 0, DebugLoc(), getSDVTList(MVT::Other)), MD(md)
	{}

	public:
	const MDNode *getMD() const { return MD; }

	static bool classof(const SDNode *N) {
	return N->getOpcode() == ISD::MDNODE_SDNODE;
	}
	};

	class RegisterSDNode : public SDNode {
	friend class SelectionDAG;

	unsigned Reg;

	RegisterSDNode(unsigned reg, EVT VT)
	: SDNode(ISD::Register, 0, DebugLoc(), getSDVTList(VT)), Reg(reg) {}

	public:
	unsigned getReg() const { return Reg; }

	static bool classof(const SDNode *N) {
	return N->getOpcode() == ISD::Register;
	}
	};

	class RegisterMaskSDNode : public SDNode {
	friend class SelectionDAG;

	// The memory for RegMask is not owned by the node.
	const uint32_t *RegMask;

	RegisterMaskSDNode(const uint32_t *mask)
	: SDNode(ISD::RegisterMask, 0, DebugLoc(), getSDVTList(MVT::Untyped)),
	RegMask(mask) {}

	public:
	const uint32_t *getRegMask() const { return RegMask; }

	static bool classof(const SDNode *N) {
	return N->getOpcode() == ISD::RegisterMask;
	}
	};

	class BlockAddressSDNode : public SDNode {
	friend class SelectionDAG;

	const BlockAddress *BA;
	int64_t Offset;
	unsigned char TargetFlags;

	BlockAddressSDNode(unsigned NodeTy, EVT VT, const BlockAddress *ba,
	int64_t o, unsigned char Flags)
	: SDNode(NodeTy, 0, DebugLoc(), getSDVTList(VT)),
	BA(ba), Offset(o), TargetFlags(Flags) {}

	public:
	const BlockAddress *getBlockAddress() const { return BA; }
	int64_t getOffset() const { return Offset; }
	unsigned char getTargetFlags() const { return TargetFlags; }

	static bool classof(const SDNode *N) {
	return N->getOpcode() == ISD::BlockAddress \|\|
	N->getOpcode() == ISD::TargetBlockAddress;
	}
	};

	class EHLabelSDNode : public SDNode {
	friend class SelectionDAG;

	MCSymbol *Label;

	EHLabelSDNode(unsigned Order, const DebugLoc &dl, MCSymbol *L)
	: SDNode(ISD::EH_LABEL, Order, dl, getSDVTList(MVT::Other)), Label(L) {}

	public:
	MCSymbol *getLabel() const { return Label; }

	static bool classof(const SDNode *N) {
	return N->getOpcode() == ISD::EH_LABEL;
	}
	};

	class ExternalSymbolSDNode : public SDNode {
	friend class SelectionDAG;

	const char *Symbol;
	unsigned char TargetFlags;

	ExternalSymbolSDNode(bool isTarget, const char *Sym, unsigned char TF, EVT VT)
	: SDNode(isTarget ? ISD::TargetExternalSymbol : ISD::ExternalSymbol,
	0, DebugLoc(), getSDVTList(VT)), Symbol(Sym), TargetFlags(TF) {}

	public:
	const char *getSymbol() const { return Symbol; }
	unsigned char getTargetFlags() const { return TargetFlags; }

	static bool classof(const SDNode *N) {
	return N->getOpcode() == ISD::ExternalSymbol \|\|
	N->getOpcode() == ISD::TargetExternalSymbol;
	}
	};

	class MCSymbolSDNode : public SDNode {
	friend class SelectionDAG;

	MCSymbol *Symbol;

	MCSymbolSDNode(MCSymbol *Symbol, EVT VT)
	: SDNode(ISD::MCSymbol, 0, DebugLoc(), getSDVTList(VT)), Symbol(Symbol) {}

	public:
	MCSymbol *getMCSymbol() const { return Symbol; }

	static bool classof(const SDNode *N) {
	return N->getOpcode() == ISD::MCSymbol;
	}
	};

	class CondCodeSDNode : public SDNode {
	friend class SelectionDAG;

	ISD::CondCode Condition;

	explicit CondCodeSDNode(ISD::CondCode Cond)
	: SDNode(ISD::CONDCODE, 0, DebugLoc(), getSDVTList(MVT::Other)),
	Condition(Cond) {}

	public:
	ISD::CondCode get() const { return Condition; }

	static bool classof(const SDNode *N) {
	return N->getOpcode() == ISD::CONDCODE;
	}
	};

	/// This class is used to represent EVT's, which are used
	/// to parameterize some operations.
	class VTSDNode : public SDNode {
	friend class SelectionDAG;

	EVT ValueType;

	explicit VTSDNode(EVT VT)
	: SDNode(ISD::VALUETYPE, 0, DebugLoc(), getSDVTList(MVT::Other)),
	ValueType(VT) {}

	public:
	EVT getVT() const { return ValueType; }

	static bool classof(const SDNode *N) {
	return N->getOpcode() == ISD::VALUETYPE;
	}
	};

	/// Base class for LoadSDNode and StoreSDNode
	class LSBaseSDNode : public MemSDNode {
	public:
	LSBaseSDNode(ISD::NodeType NodeTy, unsigned Order, const DebugLoc &dl,
	SDVTList VTs, ISD::MemIndexedMode AM, EVT MemVT,
	MachineMemOperand *MMO)
	: MemSDNode(NodeTy, Order, dl, VTs, MemVT, MMO) {
	LSBaseSDNodeBits.AddressingMode = AM;
	assert(getAddressingMode() == AM && "Value truncated");
	}

	const SDValue &getOffset() const {
	return getOperand(getOpcode() == ISD::LOAD ? 2 : 3);
	}

	/// Return the addressing mode for this load or store:
	/// unindexed, pre-inc, pre-dec, post-inc, or post-dec.
	ISD::MemIndexedMode getAddressingMode() const {
	return static_cast<ISD::MemIndexedMode>(LSBaseSDNodeBits.AddressingMode);
	}

	/// Return true if this is a pre/post inc/dec load/store.
	bool isIndexed() const { return getAddressingMode() != ISD::UNINDEXED; }

	/// Return true if this is NOT a pre/post inc/dec load/store.
	bool isUnindexed() const { return getAddressingMode() == ISD::UNINDEXED; }

	static bool classof(const SDNode *N) {
	return N->getOpcode() == ISD::LOAD \|\|
	N->getOpcode() == ISD::STORE;
	}
	};

	/// This class is used to represent ISD::LOAD nodes.
	class LoadSDNode : public LSBaseSDNode {
	friend class SelectionDAG;

	LoadSDNode(unsigned Order, const DebugLoc &dl, SDVTList VTs,
	ISD::MemIndexedMode AM, ISD::LoadExtType ETy, EVT MemVT,
	MachineMemOperand *MMO)
	: LSBaseSDNode(ISD::LOAD, Order, dl, VTs, AM, MemVT, MMO) {
	LoadSDNodeBits.ExtTy = ETy;
	assert(readMem() && "Load MachineMemOperand is not a load!");
	assert(!writeMem() && "Load MachineMemOperand is a store!");
	}

	public:
	/// Return whether this is a plain node,
	/// or one of the varieties of value-extending loads.
	ISD::LoadExtType getExtensionType() const {
	return static_cast<ISD::LoadExtType>(LoadSDNodeBits.ExtTy);
	}

	const SDValue &getBasePtr() const { return getOperand(1); }
	const SDValue &getOffset() const { return getOperand(2); }

	static bool classof(const SDNode *N) {
	return N->getOpcode() == ISD::LOAD;
	}
	};

	/// This class is used to represent ISD::STORE nodes.
	class StoreSDNode : public LSBaseSDNode {
	friend class SelectionDAG;

	StoreSDNode(unsigned Order, const DebugLoc &dl, SDVTList VTs,
	ISD::MemIndexedMode AM, bool isTrunc, EVT MemVT,
	MachineMemOperand *MMO)
	: LSBaseSDNode(ISD::STORE, Order, dl, VTs, AM, MemVT, MMO) {
	StoreSDNodeBits.IsTruncating = isTrunc;
	assert(!readMem() && "Store MachineMemOperand is a load!");
	assert(writeMem() && "Store MachineMemOperand is not a store!");
	}

	public:
	/// Return true if the op does a truncation before store.
	/// For integers this is the same as doing a TRUNCATE and storing the result.
	/// For floats, it is the same as doing an FP_ROUND and storing the result.
	bool isTruncatingStore() const { return StoreSDNodeBits.IsTruncating; }

	const SDValue &getValue() const { return getOperand(1); }
	const SDValue &getBasePtr() const { return getOperand(2); }
	const SDValue &getOffset() const { return getOperand(3); }

	static bool classof(const SDNode *N) {
	return N->getOpcode() == ISD::STORE;
	}
	};

	/// This base class is used to represent MLOAD and MSTORE nodes
	class MaskedLoadStoreSDNode : public MemSDNode {
	public:
	friend class SelectionDAG;

	MaskedLoadStoreSDNode(ISD::NodeType NodeTy, unsigned Order,
	const DebugLoc &dl, SDVTList VTs, EVT MemVT,
	MachineMemOperand *MMO)
	: MemSDNode(NodeTy, Order, dl, VTs, MemVT, MMO) {}

	// In the both nodes address is Op1, mask is Op2:
	// MaskedLoadSDNode (Chain, ptr, mask, src0), src0 is a passthru value
	// MaskedStoreSDNode (Chain, ptr, mask, data)
	// Mask is a vector of i1 elements
	const SDValue &getBasePtr() const { return getOperand(1); }
	const SDValue &getMask() const { return getOperand(2); }

	static bool classof(const SDNode *N) {
	return N->getOpcode() == ISD::MLOAD \|\|
	N->getOpcode() == ISD::MSTORE;
	}
	};

	/// This class is used to represent an MLOAD node
	class MaskedLoadSDNode : public MaskedLoadStoreSDNode {
	public:
	friend class SelectionDAG;

	MaskedLoadSDNode(unsigned Order, const DebugLoc &dl, SDVTList VTs,
	ISD::LoadExtType ETy, bool IsExpanding, EVT MemVT,
	MachineMemOperand *MMO)
	: MaskedLoadStoreSDNode(ISD::MLOAD, Order, dl, VTs, MemVT, MMO) {
	LoadSDNodeBits.ExtTy = ETy;
	LoadSDNodeBits.IsExpanding = IsExpanding;
	}

	ISD::LoadExtType getExtensionType() const {
	return static_cast<ISD::LoadExtType>(LoadSDNodeBits.ExtTy);
	}

	const SDValue &getSrc0() const { return getOperand(3); }
	static bool classof(const SDNode *N) {
	return N->getOpcode() == ISD::MLOAD;
	}

	bool isExpandingLoad() const { return LoadSDNodeBits.IsExpanding; }
	};

	/// This class is used to represent an MSTORE node
	class MaskedStoreSDNode : public MaskedLoadStoreSDNode {
	public:
	friend class SelectionDAG;

	MaskedStoreSDNode(unsigned Order, const DebugLoc &dl, SDVTList VTs,
	bool isTrunc, bool isCompressing, EVT MemVT,
	MachineMemOperand *MMO)
	: MaskedLoadStoreSDNode(ISD::MSTORE, Order, dl, VTs, MemVT, MMO) {
	StoreSDNodeBits.IsTruncating = isTrunc;
	StoreSDNodeBits.IsCompressing = isCompressing;
	}

	/// Return true if the op does a truncation before store.
	/// For integers this is the same as doing a TRUNCATE and storing the result.
	/// For floats, it is the same as doing an FP_ROUND and storing the result.
	bool isTruncatingStore() const { return StoreSDNodeBits.IsTruncating; }

	/// Returns true if the op does a compression to the vector before storing.
	/// The node contiguously stores the active elements (integers or floats)
	/// in src (those with their respective bit set in writemask k) to unaligned
	/// memory at base_addr.
	bool isCompressingStore() const { return StoreSDNodeBits.IsCompressing; }

	const SDValue &getValue() const { return getOperand(3); }

	static bool classof(const SDNode *N) {
	return N->getOpcode() == ISD::MSTORE;
	}
	};

	/// This is a base class used to represent
	/// MGATHER and MSCATTER nodes
	///
	class MaskedGatherScatterSDNode : public MemSDNode {
	public:
	friend class SelectionDAG;

	MaskedGatherScatterSDNode(unsigned NodeTy, unsigned Order,
	const DebugLoc &dl, SDVTList VTs, EVT MemVT,
	MachineMemOperand *MMO)
	: MemSDNode(NodeTy, Order, dl, VTs, MemVT, MMO) {}

	// In the both nodes address is Op1, mask is Op2:
	// MaskedGatherSDNode (Chain, src0, mask, base, index), src0 is a passthru value
	// MaskedScatterSDNode (Chain, value, mask, base, index)
	// Mask is a vector of i1 elements
	const SDValue &getBasePtr() const { return getOperand(3); }
	const SDValue &getIndex() const { return getOperand(4); }
	const SDValue &getMask() const { return getOperand(2); }
	const SDValue &getValue() const { return getOperand(1); }

	static bool classof(const SDNode *N) {
	return N->getOpcode() == ISD::MGATHER \|\|
	N->getOpcode() == ISD::MSCATTER;
	}
	};

	/// This class is used to represent an MGATHER node
	///
	class MaskedGatherSDNode : public MaskedGatherScatterSDNode {
	public:
	friend class SelectionDAG;

	MaskedGatherSDNode(unsigned Order, const DebugLoc &dl, SDVTList VTs,
	EVT MemVT, MachineMemOperand *MMO)
	: MaskedGatherScatterSDNode(ISD::MGATHER, Order, dl, VTs, MemVT, MMO) {}

	static bool classof(const SDNode *N) {
	return N->getOpcode() == ISD::MGATHER;
	}
	};

	/// This class is used to represent an MSCATTER node
	///
	class MaskedScatterSDNode : public MaskedGatherScatterSDNode {
	public:
	friend class SelectionDAG;

	MaskedScatterSDNode(unsigned Order, const DebugLoc &dl, SDVTList VTs,
	EVT MemVT, MachineMemOperand *MMO)
	: MaskedGatherScatterSDNode(ISD::MSCATTER, Order, dl, VTs, MemVT, MMO) {}

	static bool classof(const SDNode *N) {
	return N->getOpcode() == ISD::MSCATTER;
	}
	};

	/// An SDNode that represents everything that will be needed
	/// to construct a MachineInstr. These nodes are created during the
	/// instruction selection proper phase.
	class MachineSDNode : public SDNode {
	public:
	using mmo_iterator = MachineMemOperand **;

	private:
	friend class SelectionDAG;

	MachineSDNode(unsigned Opc, unsigned Order, const DebugLoc &DL, SDVTList VTs)
	: SDNode(Opc, Order, DL, VTs) {}

	/// Memory reference descriptions for this instruction.
	mmo_iterator MemRefs = nullptr;
	mmo_iterator MemRefsEnd = nullptr;

	public:
	mmo_iterator memoperands_begin() const { return MemRefs; }
	mmo_iterator memoperands_end() const { return MemRefsEnd; }
	bool memoperands_empty() const { return MemRefsEnd == MemRefs; }

	/// Assign this MachineSDNodes's memory reference descriptor
	/// list. This does not transfer ownership.
	void setMemRefs(mmo_iterator NewMemRefs, mmo_iterator NewMemRefsEnd) {
	for (mmo_iterator MMI = NewMemRefs, MME = NewMemRefsEnd; MMI != MME; ++MMI)
	assert(*MMI && "Null mem ref detected!");
	MemRefs = NewMemRefs;
	MemRefsEnd = NewMemRefsEnd;
	}

	static bool classof(const SDNode *N) {
	return N->isMachineOpcode();
	}
	};

	class SDNodeIterator : public std::iterator<std::forward_iterator_tag,
	SDNode, ptrdiff_t> {
	const SDNode *Node;
	unsigned Operand;

	SDNodeIterator(const SDNode *N, unsigned Op) : Node(N), Operand(Op) {}

	public:
	bool operator==(const SDNodeIterator& x) const {
	return Operand == x.Operand;
	}
	bool operator!=(const SDNodeIterator& x) const { return !operator==(x); }

	pointer operator*() const {
	return Node->getOperand(Operand).getNode();
	}
	pointer operator->() const { return operator*(); }

	SDNodeIterator& operator++() { // Preincrement
	++Operand;
	return *this;
	}
	SDNodeIterator operator++(int) { // Postincrement
	SDNodeIterator tmp = this; ++this; return tmp;
	}
	size_t operator-(SDNodeIterator Other) const {
	assert(Node == Other.Node &&
	"Cannot compare iterators of two different nodes!");
	return Operand - Other.Operand;
	}

	static SDNodeIterator begin(const SDNode *N) { return SDNodeIterator(N, 0); }
	static SDNodeIterator end (const SDNode *N) {
	return SDNodeIterator(N, N->getNumOperands());
	}

	unsigned getOperand() const { return Operand; }
	const SDNode *getNode() const { return Node; }
	};

	template <> struct GraphTraits<SDNode*> {
	using NodeRef = SDNode *;
	using ChildIteratorType = SDNodeIterator;

	static NodeRef getEntryNode(SDNode *N) { return N; }

	static ChildIteratorType child_begin(NodeRef N) {
	return SDNodeIterator::begin(N);
	}

	static ChildIteratorType child_end(NodeRef N) {
	return SDNodeIterator::end(N);
	}
	};

	/// A representation of the largest SDNode, for use in sizeof().
	///
	/// This needs to be a union because the largest node differs on 32 bit systems
	/// with 4 and 8 byte pointer alignment, respectively.
	using LargestSDNode = AlignedCharArrayUnion<AtomicSDNode, TargetIndexSDNode,
	BlockAddressSDNode,
	GlobalAddressSDNode>;

	/// The SDNode class with the greatest alignment requirement.
	using MostAlignedSDNode = GlobalAddressSDNode;

	namespace ISD {

	/// Returns true if the specified node is a non-extending and unindexed load.
	inline bool isNormalLoad(const SDNode *N) {
	const LoadSDNode *Ld = dyn_cast<LoadSDNode>(N);
	return Ld && Ld->getExtensionType() == ISD::NON_EXTLOAD &&
	Ld->getAddressingMode() == ISD::UNINDEXED;
	}

	/// Returns true if the specified node is a non-extending load.
	inline bool isNON_EXTLoad(const SDNode *N) {
	return isa<LoadSDNode>(N) &&
	cast<LoadSDNode>(N)->getExtensionType() == ISD::NON_EXTLOAD;
	}

	/// Returns true if the specified node is a EXTLOAD.
	inline bool isEXTLoad(const SDNode *N) {
	return isa<LoadSDNode>(N) &&
	cast<LoadSDNode>(N)->getExtensionType() == ISD::EXTLOAD;
	}

	/// Returns true if the specified node is a SEXTLOAD.
	inline bool isSEXTLoad(const SDNode *N) {
	return isa<LoadSDNode>(N) &&
	cast<LoadSDNode>(N)->getExtensionType() == ISD::SEXTLOAD;
	}

	/// Returns true if the specified node is a ZEXTLOAD.
	inline bool isZEXTLoad(const SDNode *N) {
	return isa<LoadSDNode>(N) &&
	cast<LoadSDNode>(N)->getExtensionType() == ISD::ZEXTLOAD;
	}

	/// Returns true if the specified node is an unindexed load.
	inline bool isUNINDEXEDLoad(const SDNode *N) {
	return isa<LoadSDNode>(N) &&
	cast<LoadSDNode>(N)->getAddressingMode() == ISD::UNINDEXED;
	}

	/// Returns true if the specified node is a non-truncating
	/// and unindexed store.
	inline bool isNormalStore(const SDNode *N) {
	const StoreSDNode *St = dyn_cast<StoreSDNode>(N);
	return St && !St->isTruncatingStore() &&
	St->getAddressingMode() == ISD::UNINDEXED;
	}

	/// Returns true if the specified node is a non-truncating store.
	inline bool isNON_TRUNCStore(const SDNode *N) {
	return isa<StoreSDNode>(N) && !cast<StoreSDNode>(N)->isTruncatingStore();
	}

	/// Returns true if the specified node is a truncating store.
	inline bool isTRUNCStore(const SDNode *N) {
	return isa<StoreSDNode>(N) && cast<StoreSDNode>(N)->isTruncatingStore();
	}

	/// Returns true if the specified node is an unindexed store.
	inline bool isUNINDEXEDStore(const SDNode *N) {
	return isa<StoreSDNode>(N) &&
	cast<StoreSDNode>(N)->getAddressingMode() == ISD::UNINDEXED;
	}

	} // end namespace ISD

	} // end namespace llvm

	#endif // LLVM_CODEGEN_SELECTIONDAGNODES_H
	diff --git a/lib/CodeGen/SelectionDAG/LegalizeTypes.h b/lib/CodeGen/SelectionDAG/LegalizeTypes.h
	index e102df5e913d..c46d1b04804c 100644
	--- a/lib/CodeGen/SelectionDAG/LegalizeTypes.h
	+++ b/lib/CodeGen/SelectionDAG/LegalizeTypes.h
	@@ -1,872 +1,873 @@
	//===-- LegalizeTypes.h - DAG Type Legalizer class definition ---- C++ --===//
	//
	// The LLVM Compiler Infrastructure
	//
	// This file is distributed under the University of Illinois Open Source
	// License. See LICENSE.TXT for details.
	//
	//===----------------------------------------------------------------------===//
	//
	// This file defines the DAGTypeLegalizer class. This is a private interface
	// shared between the code that implements the SelectionDAG::LegalizeTypes
	// method.
	//
	//===----------------------------------------------------------------------===//

	#ifndef LLVM_LIB_CODEGEN_SELECTIONDAG_LEGALIZETYPES_H
	#define LLVM_LIB_CODEGEN_SELECTIONDAG_LEGALIZETYPES_H

	#include "llvm/ADT/DenseMap.h"
	#include "llvm/CodeGen/SelectionDAG.h"
	#include "llvm/Support/Compiler.h"
	#include "llvm/Support/Debug.h"
	#include "llvm/Target/TargetLowering.h"

	namespace llvm {

	//===----------------------------------------------------------------------===//
	/// This takes an arbitrary SelectionDAG as input and hacks on it until only
	/// value types the target machine can handle are left. This involves promoting
	/// small sizes to large sizes or splitting up large values into small values.
	///
	class LLVM_LIBRARY_VISIBILITY DAGTypeLegalizer {
	const TargetLowering &TLI;
	SelectionDAG &DAG;
	public:
	/// This pass uses the NodeId on the SDNodes to hold information about the
	/// state of the node. The enum has all the values.
	enum NodeIdFlags {
	/// All operands have been processed, so this node is ready to be handled.
	ReadyToProcess = 0,

	/// This is a new node, not before seen, that was created in the process of
	/// legalizing some other node.
	NewNode = -1,

	/// This node's ID needs to be set to the number of its unprocessed
	/// operands.
	Unanalyzed = -2,

	/// This is a node that has already been processed.
	Processed = -3

	// 1+ - This is a node which has this many unprocessed operands.
	};
	private:

	/// This is a bitvector that contains two bits for each simple value type,
	/// where the two bits correspond to the LegalizeAction enum from
	/// TargetLowering. This can be queried with "getTypeAction(VT)".
	TargetLowering::ValueTypeActionImpl ValueTypeActions;

	/// Return how we should legalize values of this type.
	TargetLowering::LegalizeTypeAction getTypeAction(EVT VT) const {
	return TLI.getTypeAction(*DAG.getContext(), VT);
	}

	/// Return true if this type is legal on this target.
	bool isTypeLegal(EVT VT) const {
	return TLI.getTypeAction(*DAG.getContext(), VT) == TargetLowering::TypeLegal;
	}

	/// Return true if this is a simple legal type.
	bool isSimpleLegalType(EVT VT) const {
	return VT.isSimple() && TLI.isTypeLegal(VT);
	}

	/// Return true if this type can be passed in registers.
	/// For example, x86_64's f128, should to be legally in registers
	/// and only some operations converted to library calls or integer
	/// bitwise operations.
	bool isLegalInHWReg(EVT VT) const {
	EVT NVT = TLI.getTypeToTransformTo(*DAG.getContext(), VT);
	return VT == NVT && isSimpleLegalType(VT);
	}

	EVT getSetCCResultType(EVT VT) const {
	return TLI.getSetCCResultType(DAG.getDataLayout(), *DAG.getContext(), VT);
	}

	/// Pretend all of this node's results are legal.
	bool IgnoreNodeResults(SDNode *N) const {
	return N->getOpcode() == ISD::TargetConstant;
	}

	/// For integer nodes that are below legal width, this map indicates what
	/// promoted value to use.
	SmallDenseMap<SDValue, SDValue, 8> PromotedIntegers;

	/// For integer nodes that need to be expanded this map indicates which
	/// operands are the expanded version of the input.
	SmallDenseMap<SDValue, std::pair<SDValue, SDValue>, 8> ExpandedIntegers;

	/// For floating-point nodes converted to integers of the same size, this map
	/// indicates the converted value to use.
	SmallDenseMap<SDValue, SDValue, 8> SoftenedFloats;

	/// For floating-point nodes that have a smaller precision than the smallest
	/// supported precision, this map indicates what promoted value to use.
	SmallDenseMap<SDValue, SDValue, 8> PromotedFloats;

	/// For float nodes that need to be expanded this map indicates which operands
	/// are the expanded version of the input.
	SmallDenseMap<SDValue, std::pair<SDValue, SDValue>, 8> ExpandedFloats;

	/// For nodes that are <1 x ty>, this map indicates the scalar value of type
	/// 'ty' to use.
	SmallDenseMap<SDValue, SDValue, 8> ScalarizedVectors;

	/// For nodes that need to be split this map indicates which operands are the
	/// expanded version of the input.
	SmallDenseMap<SDValue, std::pair<SDValue, SDValue>, 8> SplitVectors;

	/// For vector nodes that need to be widened, indicates the widened value to
	/// use.
	SmallDenseMap<SDValue, SDValue, 8> WidenedVectors;

	/// For values that have been replaced with another, indicates the replacement
	/// value to use.
	SmallDenseMap<SDValue, SDValue, 8> ReplacedValues;

	/// This defines a worklist of nodes to process. In order to be pushed onto
	/// this worklist, all operands of a node must have already been processed.
	SmallVector<SDNode*, 128> Worklist;

	public:
	explicit DAGTypeLegalizer(SelectionDAG &dag)
	: TLI(dag.getTargetLoweringInfo()), DAG(dag),
	ValueTypeActions(TLI.getValueTypeActions()) {
	static_assert(MVT::LAST_VALUETYPE <= MVT::MAX_ALLOWED_VALUETYPE,
	"Too many value types for ValueTypeActions to hold!");
	}

	/// This is the main entry point for the type legalizer. This does a
	/// top-down traversal of the dag, legalizing types as it goes. Returns
	/// "true" if it made any changes.
	bool run();

	void NoteDeletion(SDNode Old, SDNode New) {
	ExpungeNode(Old);
	ExpungeNode(New);
	for (unsigned i = 0, e = Old->getNumValues(); i != e; ++i)
	ReplacedValues[SDValue(Old, i)] = SDValue(New, i);
	}

	SelectionDAG &getDAG() const { return DAG; }

	private:
	SDNode AnalyzeNewNode(SDNode N);
	void AnalyzeNewValue(SDValue &Val);
	void ExpungeNode(SDNode *N);
	void PerformExpensiveChecks();
	void RemapValue(SDValue &N);

	// Common routines.
	SDValue BitConvertToInteger(SDValue Op);
	SDValue BitConvertVectorToIntegerVector(SDValue Op);
	SDValue CreateStackStoreLoad(SDValue Op, EVT DestVT);
	bool CustomLowerNode(SDNode *N, EVT VT, bool LegalizeResult);
	bool CustomWidenLowerNode(SDNode *N, EVT VT);

	/// Replace each result of the given MERGE_VALUES node with the corresponding
	/// input operand, except for the result 'ResNo', for which the corresponding
	/// input operand is returned.
	SDValue DisintegrateMERGE_VALUES(SDNode *N, unsigned ResNo);

	SDValue JoinIntegers(SDValue Lo, SDValue Hi);
	SDValue LibCallify(RTLIB::Libcall LC, SDNode *N, bool isSigned);

	std::pair<SDValue, SDValue> ExpandChainLibCall(RTLIB::Libcall LC,
	SDNode *Node, bool isSigned);
	std::pair<SDValue, SDValue> ExpandAtomic(SDNode *Node);

	SDValue PromoteTargetBoolean(SDValue Bool, EVT ValVT);

	/// Modify Bit Vector to match SetCC result type of ValVT.
	/// The bit vector is widened with zeroes when WithZeroes is true.
	SDValue WidenTargetBoolean(SDValue Bool, EVT ValVT, bool WithZeroes = false);

	void ReplaceValueWith(SDValue From, SDValue To);
	void SplitInteger(SDValue Op, SDValue &Lo, SDValue &Hi);
	void SplitInteger(SDValue Op, EVT LoVT, EVT HiVT,
	SDValue &Lo, SDValue &Hi);

	void AddToWorklist(SDNode *N) {
	N->setNodeId(ReadyToProcess);
	Worklist.push_back(N);
	}

	//===--------------------------------------------------------------------===//
	// Integer Promotion Support: LegalizeIntegerTypes.cpp
	//===--------------------------------------------------------------------===//

	/// Given a processed operand Op which was promoted to a larger integer type,
	/// this returns the promoted value. The low bits of the promoted value
	/// corresponding to the original type are exactly equal to Op.
	/// The extra bits contain rubbish, so the promoted value may need to be zero-
	/// or sign-extended from the original type before it is usable (the helpers
	/// SExtPromotedInteger and ZExtPromotedInteger can do this for you).
	/// For example, if Op is an i16 and was promoted to an i32, then this method
	/// returns an i32, the lower 16 bits of which coincide with Op, and the upper
	/// 16 bits of which contain rubbish.
	SDValue GetPromotedInteger(SDValue Op) {
	SDValue &PromotedOp = PromotedIntegers[Op];
	RemapValue(PromotedOp);
	assert(PromotedOp.getNode() && "Operand wasn't promoted?");
	return PromotedOp;
	}
	void SetPromotedInteger(SDValue Op, SDValue Result);

	/// Get a promoted operand and sign extend it to the final size.
	SDValue SExtPromotedInteger(SDValue Op) {
	EVT OldVT = Op.getValueType();
	SDLoc dl(Op);
	Op = GetPromotedInteger(Op);
	return DAG.getNode(ISD::SIGN_EXTEND_INREG, dl, Op.getValueType(), Op,
	DAG.getValueType(OldVT));
	}

	/// Get a promoted operand and zero extend it to the final size.
	SDValue ZExtPromotedInteger(SDValue Op) {
	EVT OldVT = Op.getValueType();
	SDLoc dl(Op);
	Op = GetPromotedInteger(Op);
	return DAG.getZeroExtendInReg(Op, dl, OldVT.getScalarType());
	}

	// Integer Result Promotion.
	void PromoteIntegerResult(SDNode *N, unsigned ResNo);
	SDValue PromoteIntRes_MERGE_VALUES(SDNode *N, unsigned ResNo);
	SDValue PromoteIntRes_AssertSext(SDNode *N);
	SDValue PromoteIntRes_AssertZext(SDNode *N);
	SDValue PromoteIntRes_Atomic0(AtomicSDNode *N);
	SDValue PromoteIntRes_Atomic1(AtomicSDNode *N);
	SDValue PromoteIntRes_AtomicCmpSwap(AtomicSDNode *N, unsigned ResNo);
	SDValue PromoteIntRes_EXTRACT_SUBVECTOR(SDNode *N);
	SDValue PromoteIntRes_VECTOR_SHUFFLE(SDNode *N);
	SDValue PromoteIntRes_BUILD_VECTOR(SDNode *N);
	SDValue PromoteIntRes_SCALAR_TO_VECTOR(SDNode *N);
	SDValue PromoteIntRes_EXTEND_VECTOR_INREG(SDNode *N);
	SDValue PromoteIntRes_INSERT_VECTOR_ELT(SDNode *N);
	SDValue PromoteIntRes_CONCAT_VECTORS(SDNode *N);
	SDValue PromoteIntRes_BITCAST(SDNode *N);
	SDValue PromoteIntRes_BSWAP(SDNode *N);
	SDValue PromoteIntRes_BITREVERSE(SDNode *N);
	SDValue PromoteIntRes_BUILD_PAIR(SDNode *N);
	SDValue PromoteIntRes_Constant(SDNode *N);
	SDValue PromoteIntRes_CTLZ(SDNode *N);
	SDValue PromoteIntRes_CTPOP(SDNode *N);
	SDValue PromoteIntRes_CTTZ(SDNode *N);
	SDValue PromoteIntRes_EXTRACT_VECTOR_ELT(SDNode *N);
	SDValue PromoteIntRes_FP_TO_XINT(SDNode *N);
	SDValue PromoteIntRes_FP_TO_FP16(SDNode *N);
	SDValue PromoteIntRes_INT_EXTEND(SDNode *N);
	SDValue PromoteIntRes_LOAD(LoadSDNode *N);
	SDValue PromoteIntRes_MLOAD(MaskedLoadSDNode *N);
	SDValue PromoteIntRes_MGATHER(MaskedGatherSDNode *N);
	SDValue PromoteIntRes_Overflow(SDNode *N);
	SDValue PromoteIntRes_SADDSUBO(SDNode *N, unsigned ResNo);
	SDValue PromoteIntRes_SELECT(SDNode *N);
	SDValue PromoteIntRes_VSELECT(SDNode *N);
	SDValue PromoteIntRes_SELECT_CC(SDNode *N);
	SDValue PromoteIntRes_SETCC(SDNode *N);
	SDValue PromoteIntRes_SHL(SDNode *N);
	SDValue PromoteIntRes_SimpleIntBinOp(SDNode *N);
	SDValue PromoteIntRes_ZExtIntBinOp(SDNode *N);
	SDValue PromoteIntRes_SExtIntBinOp(SDNode *N);
	SDValue PromoteIntRes_SIGN_EXTEND_INREG(SDNode *N);
	SDValue PromoteIntRes_SRA(SDNode *N);
	SDValue PromoteIntRes_SRL(SDNode *N);
	SDValue PromoteIntRes_TRUNCATE(SDNode *N);
	SDValue PromoteIntRes_UADDSUBO(SDNode *N, unsigned ResNo);
	SDValue PromoteIntRes_ADDSUBCARRY(SDNode *N, unsigned ResNo);
	SDValue PromoteIntRes_UNDEF(SDNode *N);
	SDValue PromoteIntRes_VAARG(SDNode *N);
	SDValue PromoteIntRes_XMULO(SDNode *N, unsigned ResNo);

	// Integer Operand Promotion.
	bool PromoteIntegerOperand(SDNode *N, unsigned OperandNo);
	SDValue PromoteIntOp_ANY_EXTEND(SDNode *N);
	SDValue PromoteIntOp_ATOMIC_STORE(AtomicSDNode *N);
	SDValue PromoteIntOp_BITCAST(SDNode *N);
	SDValue PromoteIntOp_BUILD_PAIR(SDNode *N);
	SDValue PromoteIntOp_BR_CC(SDNode *N, unsigned OpNo);
	SDValue PromoteIntOp_BRCOND(SDNode *N, unsigned OpNo);
	SDValue PromoteIntOp_BUILD_VECTOR(SDNode *N);
	SDValue PromoteIntOp_INSERT_VECTOR_ELT(SDNode *N, unsigned OpNo);
	SDValue PromoteIntOp_EXTRACT_VECTOR_ELT(SDNode *N);
	SDValue PromoteIntOp_EXTRACT_SUBVECTOR(SDNode *N);
	SDValue PromoteIntOp_CONCAT_VECTORS(SDNode *N);
	SDValue PromoteIntOp_SCALAR_TO_VECTOR(SDNode *N);
	SDValue PromoteIntOp_SELECT(SDNode *N, unsigned OpNo);
	SDValue PromoteIntOp_SELECT_CC(SDNode *N, unsigned OpNo);
	SDValue PromoteIntOp_SETCC(SDNode *N, unsigned OpNo);
	SDValue PromoteIntOp_Shift(SDNode *N);
	SDValue PromoteIntOp_SIGN_EXTEND(SDNode *N);
	SDValue PromoteIntOp_SINT_TO_FP(SDNode *N);
	SDValue PromoteIntOp_STORE(StoreSDNode *N, unsigned OpNo);
	SDValue PromoteIntOp_TRUNCATE(SDNode *N);
	SDValue PromoteIntOp_UINT_TO_FP(SDNode *N);
	SDValue PromoteIntOp_ZERO_EXTEND(SDNode *N);
	SDValue PromoteIntOp_MSTORE(MaskedStoreSDNode *N, unsigned OpNo);
	SDValue PromoteIntOp_MLOAD(MaskedLoadSDNode *N, unsigned OpNo);
	SDValue PromoteIntOp_MSCATTER(MaskedScatterSDNode *N, unsigned OpNo);
	SDValue PromoteIntOp_MGATHER(MaskedGatherSDNode *N, unsigned OpNo);
	SDValue PromoteIntOp_ADDSUBCARRY(SDNode *N, unsigned OpNo);

	void PromoteSetCCOperands(SDValue &LHS,SDValue &RHS, ISD::CondCode Code);

	//===--------------------------------------------------------------------===//
	// Integer Expansion Support: LegalizeIntegerTypes.cpp
	//===--------------------------------------------------------------------===//

	/// Given a processed operand Op which was expanded into two integers of half
	/// the size, this returns the two halves. The low bits of Op are exactly
	/// equal to the bits of Lo; the high bits exactly equal Hi.
	/// For example, if Op is an i64 which was expanded into two i32's, then this
	/// method returns the two i32's, with Lo being equal to the lower 32 bits of
	/// Op, and Hi being equal to the upper 32 bits.
	void GetExpandedInteger(SDValue Op, SDValue &Lo, SDValue &Hi);
	void SetExpandedInteger(SDValue Op, SDValue Lo, SDValue Hi);

	// Integer Result Expansion.
	void ExpandIntegerResult(SDNode *N, unsigned ResNo);
	void ExpandIntRes_ANY_EXTEND (SDNode *N, SDValue &Lo, SDValue &Hi);
	void ExpandIntRes_AssertSext (SDNode *N, SDValue &Lo, SDValue &Hi);
	void ExpandIntRes_AssertZext (SDNode *N, SDValue &Lo, SDValue &Hi);
	void ExpandIntRes_Constant (SDNode *N, SDValue &Lo, SDValue &Hi);
	void ExpandIntRes_CTLZ (SDNode *N, SDValue &Lo, SDValue &Hi);
	void ExpandIntRes_CTPOP (SDNode *N, SDValue &Lo, SDValue &Hi);
	void ExpandIntRes_CTTZ (SDNode *N, SDValue &Lo, SDValue &Hi);
	void ExpandIntRes_LOAD (LoadSDNode *N, SDValue &Lo, SDValue &Hi);
	void ExpandIntRes_READCYCLECOUNTER (SDNode *N, SDValue &Lo, SDValue &Hi);
	void ExpandIntRes_SIGN_EXTEND (SDNode *N, SDValue &Lo, SDValue &Hi);
	void ExpandIntRes_SIGN_EXTEND_INREG (SDNode *N, SDValue &Lo, SDValue &Hi);
	void ExpandIntRes_TRUNCATE (SDNode *N, SDValue &Lo, SDValue &Hi);
	void ExpandIntRes_ZERO_EXTEND (SDNode *N, SDValue &Lo, SDValue &Hi);
	void ExpandIntRes_FLT_ROUNDS (SDNode *N, SDValue &Lo, SDValue &Hi);
	void ExpandIntRes_FP_TO_SINT (SDNode *N, SDValue &Lo, SDValue &Hi);
	void ExpandIntRes_FP_TO_UINT (SDNode *N, SDValue &Lo, SDValue &Hi);

	void ExpandIntRes_Logical (SDNode *N, SDValue &Lo, SDValue &Hi);
	void ExpandIntRes_ADDSUB (SDNode *N, SDValue &Lo, SDValue &Hi);
	void ExpandIntRes_ADDSUBC (SDNode *N, SDValue &Lo, SDValue &Hi);
	void ExpandIntRes_ADDSUBE (SDNode *N, SDValue &Lo, SDValue &Hi);
	void ExpandIntRes_ADDSUBCARRY (SDNode *N, SDValue &Lo, SDValue &Hi);
	void ExpandIntRes_BITREVERSE (SDNode *N, SDValue &Lo, SDValue &Hi);
	void ExpandIntRes_BSWAP (SDNode *N, SDValue &Lo, SDValue &Hi);
	void ExpandIntRes_MUL (SDNode *N, SDValue &Lo, SDValue &Hi);
	void ExpandIntRes_SDIV (SDNode *N, SDValue &Lo, SDValue &Hi);
	void ExpandIntRes_SREM (SDNode *N, SDValue &Lo, SDValue &Hi);
	void ExpandIntRes_UDIV (SDNode *N, SDValue &Lo, SDValue &Hi);
	void ExpandIntRes_UREM (SDNode *N, SDValue &Lo, SDValue &Hi);
	void ExpandIntRes_Shift (SDNode *N, SDValue &Lo, SDValue &Hi);

	void ExpandIntRes_MINMAX (SDNode *N, SDValue &Lo, SDValue &Hi);

	void ExpandIntRes_SADDSUBO (SDNode *N, SDValue &Lo, SDValue &Hi);
	void ExpandIntRes_UADDSUBO (SDNode *N, SDValue &Lo, SDValue &Hi);
	void ExpandIntRes_XMULO (SDNode *N, SDValue &Lo, SDValue &Hi);

	void ExpandIntRes_ATOMIC_LOAD (SDNode *N, SDValue &Lo, SDValue &Hi);

	void ExpandShiftByConstant(SDNode *N, const APInt &Amt,
	SDValue &Lo, SDValue &Hi);
	bool ExpandShiftWithKnownAmountBit(SDNode *N, SDValue &Lo, SDValue &Hi);
	bool ExpandShiftWithUnknownAmountBit(SDNode *N, SDValue &Lo, SDValue &Hi);

	// Integer Operand Expansion.
	bool ExpandIntegerOperand(SDNode *N, unsigned OperandNo);
	SDValue ExpandIntOp_BR_CC(SDNode *N);
	SDValue ExpandIntOp_SELECT_CC(SDNode *N);
	SDValue ExpandIntOp_SETCC(SDNode *N);
	SDValue ExpandIntOp_SETCCE(SDNode *N);
	SDValue ExpandIntOp_SETCCCARRY(SDNode *N);
	SDValue ExpandIntOp_Shift(SDNode *N);
	SDValue ExpandIntOp_SINT_TO_FP(SDNode *N);
	SDValue ExpandIntOp_STORE(StoreSDNode *N, unsigned OpNo);
	SDValue ExpandIntOp_TRUNCATE(SDNode *N);
	SDValue ExpandIntOp_UINT_TO_FP(SDNode *N);
	SDValue ExpandIntOp_RETURNADDR(SDNode *N);
	SDValue ExpandIntOp_ATOMIC_STORE(SDNode *N);

	void IntegerExpandSetCCOperands(SDValue &NewLHS, SDValue &NewRHS,
	ISD::CondCode &CCCode, const SDLoc &dl);

	//===--------------------------------------------------------------------===//
	// Float to Integer Conversion Support: LegalizeFloatTypes.cpp
	//===--------------------------------------------------------------------===//

	/// Given an operand Op of Float type, returns the integer if the Op is not
	/// supported in target HW and converted to the integer.
	/// The integer contains exactly the same bits as Op - only the type changed.
	/// For example, if Op is an f32 which was softened to an i32, then this method
	/// returns an i32, the bits of which coincide with those of Op.
	/// If the Op can be efficiently supported in target HW or the operand must
	/// stay in a register, the Op is not converted to an integer.
	/// In that case, the given op is returned.
	SDValue GetSoftenedFloat(SDValue Op) {
	SDValue &SoftenedOp = SoftenedFloats[Op];
	if (!SoftenedOp.getNode() &&
	isSimpleLegalType(Op.getValueType()))
	return Op;
	RemapValue(SoftenedOp);
	assert(SoftenedOp.getNode() && "Operand wasn't converted to integer?");
	return SoftenedOp;
	}
	void SetSoftenedFloat(SDValue Op, SDValue Result);

	// Convert Float Results to Integer for Non-HW-supported Operations.
	bool SoftenFloatResult(SDNode *N, unsigned ResNo);
	SDValue SoftenFloatRes_MERGE_VALUES(SDNode *N, unsigned ResNo);
	SDValue SoftenFloatRes_BITCAST(SDNode *N, unsigned ResNo);
	SDValue SoftenFloatRes_BUILD_PAIR(SDNode *N);
	SDValue SoftenFloatRes_ConstantFP(SDNode *N, unsigned ResNo);
	SDValue SoftenFloatRes_EXTRACT_VECTOR_ELT(SDNode *N, unsigned ResNo);
	SDValue SoftenFloatRes_FABS(SDNode *N, unsigned ResNo);
	SDValue SoftenFloatRes_FMINNUM(SDNode *N);
	SDValue SoftenFloatRes_FMAXNUM(SDNode *N);
	SDValue SoftenFloatRes_FADD(SDNode *N);
	SDValue SoftenFloatRes_FCEIL(SDNode *N);
	SDValue SoftenFloatRes_FCOPYSIGN(SDNode *N, unsigned ResNo);
	SDValue SoftenFloatRes_FCOS(SDNode *N);
	SDValue SoftenFloatRes_FDIV(SDNode *N);
	SDValue SoftenFloatRes_FEXP(SDNode *N);
	SDValue SoftenFloatRes_FEXP2(SDNode *N);
	SDValue SoftenFloatRes_FFLOOR(SDNode *N);
	SDValue SoftenFloatRes_FLOG(SDNode *N);
	SDValue SoftenFloatRes_FLOG2(SDNode *N);
	SDValue SoftenFloatRes_FLOG10(SDNode *N);
	SDValue SoftenFloatRes_FMA(SDNode *N);
	SDValue SoftenFloatRes_FMUL(SDNode *N);
	SDValue SoftenFloatRes_FNEARBYINT(SDNode *N);
	SDValue SoftenFloatRes_FNEG(SDNode *N, unsigned ResNo);
	SDValue SoftenFloatRes_FP_EXTEND(SDNode *N);
	SDValue SoftenFloatRes_FP16_TO_FP(SDNode *N);
	SDValue SoftenFloatRes_FP_ROUND(SDNode *N);
	SDValue SoftenFloatRes_FPOW(SDNode *N);
	SDValue SoftenFloatRes_FPOWI(SDNode *N);
	SDValue SoftenFloatRes_FREM(SDNode *N);
	SDValue SoftenFloatRes_FRINT(SDNode *N);
	SDValue SoftenFloatRes_FROUND(SDNode *N);
	SDValue SoftenFloatRes_FSIN(SDNode *N);
	SDValue SoftenFloatRes_FSQRT(SDNode *N);
	SDValue SoftenFloatRes_FSUB(SDNode *N);
	SDValue SoftenFloatRes_FTRUNC(SDNode *N);
	SDValue SoftenFloatRes_LOAD(SDNode *N, unsigned ResNo);
	SDValue SoftenFloatRes_SELECT(SDNode *N, unsigned ResNo);
	SDValue SoftenFloatRes_SELECT_CC(SDNode *N, unsigned ResNo);
	SDValue SoftenFloatRes_UNDEF(SDNode *N);
	SDValue SoftenFloatRes_VAARG(SDNode *N);
	SDValue SoftenFloatRes_XINT_TO_FP(SDNode *N);

	// Return true if we can skip softening the given operand or SDNode because
	// either it was soften before by SoftenFloatResult and references to the
	// operand were replaced by ReplaceValueWith or it's value type is legal in HW
	// registers and the operand can be left unchanged.
	bool CanSkipSoftenFloatOperand(SDNode *N, unsigned OpNo);

	// Convert Float Operand to Integer for Non-HW-supported Operations.
	bool SoftenFloatOperand(SDNode *N, unsigned OpNo);
	SDValue SoftenFloatOp_BITCAST(SDNode *N);
	SDValue SoftenFloatOp_COPY_TO_REG(SDNode *N);
	SDValue SoftenFloatOp_BR_CC(SDNode *N);
	SDValue SoftenFloatOp_FABS(SDNode *N);
	SDValue SoftenFloatOp_FCOPYSIGN(SDNode *N);
	SDValue SoftenFloatOp_FNEG(SDNode *N);
	SDValue SoftenFloatOp_FP_EXTEND(SDNode *N);
	SDValue SoftenFloatOp_FP_ROUND(SDNode *N);
	SDValue SoftenFloatOp_FP_TO_XINT(SDNode *N);
	SDValue SoftenFloatOp_SELECT(SDNode *N);
	SDValue SoftenFloatOp_SELECT_CC(SDNode *N);
	SDValue SoftenFloatOp_SETCC(SDNode *N);
	SDValue SoftenFloatOp_STORE(SDNode *N, unsigned OpNo);

	//===--------------------------------------------------------------------===//
	// Float Expansion Support: LegalizeFloatTypes.cpp
	//===--------------------------------------------------------------------===//

	/// Given a processed operand Op which was expanded into two floating-point
	/// values of half the size, this returns the two halves.
	/// The low bits of Op are exactly equal to the bits of Lo; the high bits
	/// exactly equal Hi. For example, if Op is a ppcf128 which was expanded
	/// into two f64's, then this method returns the two f64's, with Lo being
	/// equal to the lower 64 bits of Op, and Hi to the upper 64 bits.
	void GetExpandedFloat(SDValue Op, SDValue &Lo, SDValue &Hi);
	void SetExpandedFloat(SDValue Op, SDValue Lo, SDValue Hi);

	// Float Result Expansion.
	void ExpandFloatResult(SDNode *N, unsigned ResNo);
	void ExpandFloatRes_ConstantFP(SDNode *N, SDValue &Lo, SDValue &Hi);
	void ExpandFloatRes_FABS (SDNode *N, SDValue &Lo, SDValue &Hi);
	void ExpandFloatRes_FMINNUM (SDNode *N, SDValue &Lo, SDValue &Hi);
	void ExpandFloatRes_FMAXNUM (SDNode *N, SDValue &Lo, SDValue &Hi);
	void ExpandFloatRes_FADD (SDNode *N, SDValue &Lo, SDValue &Hi);
	void ExpandFloatRes_FCEIL (SDNode *N, SDValue &Lo, SDValue &Hi);
	void ExpandFloatRes_FCOPYSIGN (SDNode *N, SDValue &Lo, SDValue &Hi);
	void ExpandFloatRes_FCOS (SDNode *N, SDValue &Lo, SDValue &Hi);
	void ExpandFloatRes_FDIV (SDNode *N, SDValue &Lo, SDValue &Hi);
	void ExpandFloatRes_FEXP (SDNode *N, SDValue &Lo, SDValue &Hi);
	void ExpandFloatRes_FEXP2 (SDNode *N, SDValue &Lo, SDValue &Hi);
	void ExpandFloatRes_FFLOOR (SDNode *N, SDValue &Lo, SDValue &Hi);
	void ExpandFloatRes_FLOG (SDNode *N, SDValue &Lo, SDValue &Hi);
	void ExpandFloatRes_FLOG2 (SDNode *N, SDValue &Lo, SDValue &Hi);
	void ExpandFloatRes_FLOG10 (SDNode *N, SDValue &Lo, SDValue &Hi);
	void ExpandFloatRes_FMA (SDNode *N, SDValue &Lo, SDValue &Hi);
	void ExpandFloatRes_FMUL (SDNode *N, SDValue &Lo, SDValue &Hi);
	void ExpandFloatRes_FNEARBYINT(SDNode *N, SDValue &Lo, SDValue &Hi);
	void ExpandFloatRes_FNEG (SDNode *N, SDValue &Lo, SDValue &Hi);
	void ExpandFloatRes_FP_EXTEND (SDNode *N, SDValue &Lo, SDValue &Hi);
	void ExpandFloatRes_FPOW (SDNode *N, SDValue &Lo, SDValue &Hi);
	void ExpandFloatRes_FPOWI (SDNode *N, SDValue &Lo, SDValue &Hi);
	void ExpandFloatRes_FREM (SDNode *N, SDValue &Lo, SDValue &Hi);
	void ExpandFloatRes_FRINT (SDNode *N, SDValue &Lo, SDValue &Hi);
	void ExpandFloatRes_FROUND (SDNode *N, SDValue &Lo, SDValue &Hi);
	void ExpandFloatRes_FSIN (SDNode *N, SDValue &Lo, SDValue &Hi);
	void ExpandFloatRes_FSQRT (SDNode *N, SDValue &Lo, SDValue &Hi);
	void ExpandFloatRes_FSUB (SDNode *N, SDValue &Lo, SDValue &Hi);
	void ExpandFloatRes_FTRUNC (SDNode *N, SDValue &Lo, SDValue &Hi);
	void ExpandFloatRes_LOAD (SDNode *N, SDValue &Lo, SDValue &Hi);
	void ExpandFloatRes_XINT_TO_FP(SDNode *N, SDValue &Lo, SDValue &Hi);

	// Float Operand Expansion.
	bool ExpandFloatOperand(SDNode *N, unsigned OperandNo);
	SDValue ExpandFloatOp_BR_CC(SDNode *N);
	SDValue ExpandFloatOp_FCOPYSIGN(SDNode *N);
	SDValue ExpandFloatOp_FP_ROUND(SDNode *N);
	SDValue ExpandFloatOp_FP_TO_SINT(SDNode *N);
	SDValue ExpandFloatOp_FP_TO_UINT(SDNode *N);
	SDValue ExpandFloatOp_SELECT_CC(SDNode *N);
	SDValue ExpandFloatOp_SETCC(SDNode *N);
	SDValue ExpandFloatOp_STORE(SDNode *N, unsigned OpNo);

	void FloatExpandSetCCOperands(SDValue &NewLHS, SDValue &NewRHS,
	ISD::CondCode &CCCode, const SDLoc &dl);

	//===--------------------------------------------------------------------===//
	// Float promotion support: LegalizeFloatTypes.cpp
	//===--------------------------------------------------------------------===//

	SDValue GetPromotedFloat(SDValue Op) {
	SDValue &PromotedOp = PromotedFloats[Op];
	RemapValue(PromotedOp);
	assert(PromotedOp.getNode() && "Operand wasn't promoted?");
	return PromotedOp;
	}
	void SetPromotedFloat(SDValue Op, SDValue Result);

	void PromoteFloatResult(SDNode *N, unsigned ResNo);
	SDValue PromoteFloatRes_BITCAST(SDNode *N);
	SDValue PromoteFloatRes_BinOp(SDNode *N);
	SDValue PromoteFloatRes_ConstantFP(SDNode *N);
	SDValue PromoteFloatRes_EXTRACT_VECTOR_ELT(SDNode *N);
	SDValue PromoteFloatRes_FCOPYSIGN(SDNode *N);
	SDValue PromoteFloatRes_FMAD(SDNode *N);
	SDValue PromoteFloatRes_FPOWI(SDNode *N);
	SDValue PromoteFloatRes_FP_ROUND(SDNode *N);
	SDValue PromoteFloatRes_LOAD(SDNode *N);
	SDValue PromoteFloatRes_SELECT(SDNode *N);
	SDValue PromoteFloatRes_SELECT_CC(SDNode *N);
	SDValue PromoteFloatRes_UnaryOp(SDNode *N);
	SDValue PromoteFloatRes_UNDEF(SDNode *N);
	SDValue PromoteFloatRes_XINT_TO_FP(SDNode *N);

	bool PromoteFloatOperand(SDNode *N, unsigned ResNo);
	SDValue PromoteFloatOp_BITCAST(SDNode *N, unsigned OpNo);
	SDValue PromoteFloatOp_FCOPYSIGN(SDNode *N, unsigned OpNo);
	SDValue PromoteFloatOp_FP_EXTEND(SDNode *N, unsigned OpNo);
	SDValue PromoteFloatOp_FP_TO_XINT(SDNode *N, unsigned OpNo);
	SDValue PromoteFloatOp_STORE(SDNode *N, unsigned OpNo);
	SDValue PromoteFloatOp_SELECT_CC(SDNode *N, unsigned OpNo);
	SDValue PromoteFloatOp_SETCC(SDNode *N, unsigned OpNo);

	//===--------------------------------------------------------------------===//
	// Scalarization Support: LegalizeVectorTypes.cpp
	//===--------------------------------------------------------------------===//

	/// Given a processed one-element vector Op which was scalarized to its
	/// element type, this returns the element. For example, if Op is a v1i32,
	/// Op = < i32 val >, this method returns val, an i32.
	SDValue GetScalarizedVector(SDValue Op) {
	SDValue &ScalarizedOp = ScalarizedVectors[Op];
	RemapValue(ScalarizedOp);
	assert(ScalarizedOp.getNode() && "Operand wasn't scalarized?");
	return ScalarizedOp;
	}
	void SetScalarizedVector(SDValue Op, SDValue Result);

	// Vector Result Scalarization: <1 x ty> -> ty.
	void ScalarizeVectorResult(SDNode *N, unsigned OpNo);
	SDValue ScalarizeVecRes_MERGE_VALUES(SDNode *N, unsigned ResNo);
	SDValue ScalarizeVecRes_BinOp(SDNode *N);
	SDValue ScalarizeVecRes_TernaryOp(SDNode *N);
	SDValue ScalarizeVecRes_UnaryOp(SDNode *N);
	SDValue ScalarizeVecRes_InregOp(SDNode *N);
	SDValue ScalarizeVecRes_VecInregOp(SDNode *N);

	SDValue ScalarizeVecRes_BITCAST(SDNode *N);
	SDValue ScalarizeVecRes_BUILD_VECTOR(SDNode *N);
	SDValue ScalarizeVecRes_EXTRACT_SUBVECTOR(SDNode *N);
	SDValue ScalarizeVecRes_FP_ROUND(SDNode *N);
	SDValue ScalarizeVecRes_FPOWI(SDNode *N);
	SDValue ScalarizeVecRes_INSERT_VECTOR_ELT(SDNode *N);
	SDValue ScalarizeVecRes_LOAD(LoadSDNode *N);
	SDValue ScalarizeVecRes_SCALAR_TO_VECTOR(SDNode *N);
	SDValue ScalarizeVecRes_VSELECT(SDNode *N);
	SDValue ScalarizeVecRes_SELECT(SDNode *N);
	SDValue ScalarizeVecRes_SELECT_CC(SDNode *N);
	SDValue ScalarizeVecRes_SETCC(SDNode *N);
	SDValue ScalarizeVecRes_UNDEF(SDNode *N);
	SDValue ScalarizeVecRes_VECTOR_SHUFFLE(SDNode *N);
	SDValue ScalarizeVecRes_VSETCC(SDNode *N);

	// Vector Operand Scalarization: <1 x ty> -> ty.
	bool ScalarizeVectorOperand(SDNode *N, unsigned OpNo);
	SDValue ScalarizeVecOp_BITCAST(SDNode *N);
	SDValue ScalarizeVecOp_UnaryOp(SDNode *N);
	SDValue ScalarizeVecOp_CONCAT_VECTORS(SDNode *N);
	SDValue ScalarizeVecOp_EXTRACT_VECTOR_ELT(SDNode *N);
	SDValue ScalarizeVecOp_VSELECT(SDNode *N);
	+ SDValue ScalarizeVecOp_VSETCC(SDNode *N);
	SDValue ScalarizeVecOp_STORE(StoreSDNode *N, unsigned OpNo);
	SDValue ScalarizeVecOp_FP_ROUND(SDNode *N, unsigned OpNo);

	//===--------------------------------------------------------------------===//
	// Vector Splitting Support: LegalizeVectorTypes.cpp
	//===--------------------------------------------------------------------===//

	/// Given a processed vector Op which was split into vectors of half the size,
	/// this method returns the halves. The first elements of Op coincide with the
	/// elements of Lo; the remaining elements of Op coincide with the elements of
	/// Hi: Op is what you would get by concatenating Lo and Hi.
	/// For example, if Op is a v8i32 that was split into two v4i32's, then this
	/// method returns the two v4i32's, with Lo corresponding to the first 4
	/// elements of Op, and Hi to the last 4 elements.
	void GetSplitVector(SDValue Op, SDValue &Lo, SDValue &Hi);
	void SetSplitVector(SDValue Op, SDValue Lo, SDValue Hi);

	// Vector Result Splitting: <128 x ty> -> 2 x <64 x ty>.
	void SplitVectorResult(SDNode *N, unsigned OpNo);
	void SplitVecRes_BinOp(SDNode *N, SDValue &Lo, SDValue &Hi);
	void SplitVecRes_TernaryOp(SDNode *N, SDValue &Lo, SDValue &Hi);
	void SplitVecRes_UnaryOp(SDNode *N, SDValue &Lo, SDValue &Hi);
	void SplitVecRes_ExtendOp(SDNode *N, SDValue &Lo, SDValue &Hi);
	void SplitVecRes_InregOp(SDNode *N, SDValue &Lo, SDValue &Hi);
	void SplitVecRes_ExtVecInRegOp(SDNode *N, SDValue &Lo, SDValue &Hi);

	void SplitVecRes_BITCAST(SDNode *N, SDValue &Lo, SDValue &Hi);
	void SplitVecRes_BUILD_VECTOR(SDNode *N, SDValue &Lo, SDValue &Hi);
	void SplitVecRes_CONCAT_VECTORS(SDNode *N, SDValue &Lo, SDValue &Hi);
	void SplitVecRes_EXTRACT_SUBVECTOR(SDNode *N, SDValue &Lo, SDValue &Hi);
	void SplitVecRes_INSERT_SUBVECTOR(SDNode *N, SDValue &Lo, SDValue &Hi);
	void SplitVecRes_FPOWI(SDNode *N, SDValue &Lo, SDValue &Hi);
	void SplitVecRes_FCOPYSIGN(SDNode *N, SDValue &Lo, SDValue &Hi);
	void SplitVecRes_INSERT_VECTOR_ELT(SDNode *N, SDValue &Lo, SDValue &Hi);
	void SplitVecRes_LOAD(LoadSDNode *N, SDValue &Lo, SDValue &Hi);
	void SplitVecRes_MLOAD(MaskedLoadSDNode *N, SDValue &Lo, SDValue &Hi);
	void SplitVecRes_MGATHER(MaskedGatherSDNode *N, SDValue &Lo, SDValue &Hi);
	void SplitVecRes_SCALAR_TO_VECTOR(SDNode *N, SDValue &Lo, SDValue &Hi);
	void SplitVecRes_SETCC(SDNode *N, SDValue &Lo, SDValue &Hi);
	void SplitVecRes_VECTOR_SHUFFLE(ShuffleVectorSDNode *N, SDValue &Lo,
	SDValue &Hi);

	// Vector Operand Splitting: <128 x ty> -> 2 x <64 x ty>.
	bool SplitVectorOperand(SDNode *N, unsigned OpNo);
	SDValue SplitVecOp_VSELECT(SDNode *N, unsigned OpNo);
	SDValue SplitVecOp_VECREDUCE(SDNode *N, unsigned OpNo);
	SDValue SplitVecOp_UnaryOp(SDNode *N);
	SDValue SplitVecOp_TruncateHelper(SDNode *N);

	SDValue SplitVecOp_BITCAST(SDNode *N);
	SDValue SplitVecOp_EXTRACT_SUBVECTOR(SDNode *N);
	SDValue SplitVecOp_EXTRACT_VECTOR_ELT(SDNode *N);
	SDValue SplitVecOp_ExtVecInRegOp(SDNode *N);
	SDValue SplitVecOp_STORE(StoreSDNode *N, unsigned OpNo);
	SDValue SplitVecOp_MSTORE(MaskedStoreSDNode *N, unsigned OpNo);
	SDValue SplitVecOp_MSCATTER(MaskedScatterSDNode *N, unsigned OpNo);
	SDValue SplitVecOp_MGATHER(MaskedGatherSDNode *N, unsigned OpNo);
	SDValue SplitVecOp_CONCAT_VECTORS(SDNode *N);
	SDValue SplitVecOp_VSETCC(SDNode *N);
	SDValue SplitVecOp_FP_ROUND(SDNode *N);
	SDValue SplitVecOp_FCOPYSIGN(SDNode *N);

	//===--------------------------------------------------------------------===//
	// Vector Widening Support: LegalizeVectorTypes.cpp
	//===--------------------------------------------------------------------===//

	/// Given a processed vector Op which was widened into a larger vector, this
	/// method returns the larger vector. The elements of the returned vector
	/// consist of the elements of Op followed by elements containing rubbish.
	/// For example, if Op is a v2i32 that was widened to a v4i32, then this
	/// method returns a v4i32 for which the first two elements are the same as
	/// those of Op, while the last two elements contain rubbish.
	SDValue GetWidenedVector(SDValue Op) {
	SDValue &WidenedOp = WidenedVectors[Op];
	RemapValue(WidenedOp);
	assert(WidenedOp.getNode() && "Operand wasn't widened?");
	return WidenedOp;
	}
	void SetWidenedVector(SDValue Op, SDValue Result);

	// Widen Vector Result Promotion.
	void WidenVectorResult(SDNode *N, unsigned ResNo);
	SDValue WidenVecRes_MERGE_VALUES(SDNode* N, unsigned ResNo);
	SDValue WidenVecRes_BITCAST(SDNode* N);
	SDValue WidenVecRes_BUILD_VECTOR(SDNode* N);
	SDValue WidenVecRes_CONCAT_VECTORS(SDNode* N);
	SDValue WidenVecRes_EXTEND_VECTOR_INREG(SDNode* N);
	SDValue WidenVecRes_EXTRACT_SUBVECTOR(SDNode* N);
	SDValue WidenVecRes_INSERT_VECTOR_ELT(SDNode* N);
	SDValue WidenVecRes_LOAD(SDNode* N);
	SDValue WidenVecRes_MLOAD(MaskedLoadSDNode* N);
	SDValue WidenVecRes_MGATHER(MaskedGatherSDNode* N);
	SDValue WidenVecRes_SCALAR_TO_VECTOR(SDNode* N);
	SDValue WidenVecRes_SELECT(SDNode* N);
	SDValue WidenVSELECTAndMask(SDNode *N);
	SDValue WidenVecRes_SELECT_CC(SDNode* N);
	SDValue WidenVecRes_SETCC(SDNode* N);
	SDValue WidenVecRes_UNDEF(SDNode *N);
	SDValue WidenVecRes_VECTOR_SHUFFLE(ShuffleVectorSDNode *N);
	SDValue WidenVecRes_VSETCC(SDNode* N);

	SDValue WidenVecRes_Ternary(SDNode *N);
	SDValue WidenVecRes_Binary(SDNode *N);
	SDValue WidenVecRes_BinaryCanTrap(SDNode *N);
	SDValue WidenVecRes_Convert(SDNode *N);
	SDValue WidenVecRes_FCOPYSIGN(SDNode *N);
	SDValue WidenVecRes_POWI(SDNode *N);
	SDValue WidenVecRes_Shift(SDNode *N);
	SDValue WidenVecRes_Unary(SDNode *N);
	SDValue WidenVecRes_InregOp(SDNode *N);

	// Widen Vector Operand.
	bool WidenVectorOperand(SDNode *N, unsigned OpNo);
	SDValue WidenVecOp_BITCAST(SDNode *N);
	SDValue WidenVecOp_CONCAT_VECTORS(SDNode *N);
	SDValue WidenVecOp_EXTEND(SDNode *N);
	SDValue WidenVecOp_EXTRACT_VECTOR_ELT(SDNode *N);
	SDValue WidenVecOp_EXTRACT_SUBVECTOR(SDNode *N);
	SDValue WidenVecOp_STORE(SDNode* N);
	SDValue WidenVecOp_MSTORE(SDNode* N, unsigned OpNo);
	SDValue WidenVecOp_MSCATTER(SDNode* N, unsigned OpNo);
	SDValue WidenVecOp_SETCC(SDNode* N);

	SDValue WidenVecOp_Convert(SDNode *N);
	SDValue WidenVecOp_FCOPYSIGN(SDNode *N);

	//===--------------------------------------------------------------------===//
	// Vector Widening Utilities Support: LegalizeVectorTypes.cpp
	//===--------------------------------------------------------------------===//

	/// Helper function to generate a set of loads to load a vector with a
	/// resulting wider type. It takes:
	/// LdChain: list of chains for the load to be generated.
	/// Ld: load to widen
	SDValue GenWidenVectorLoads(SmallVectorImpl<SDValue> &LdChain,
	LoadSDNode *LD);

	/// Helper function to generate a set of extension loads to load a vector with
	/// a resulting wider type. It takes:
	/// LdChain: list of chains for the load to be generated.
	/// Ld: load to widen
	/// ExtType: extension element type
	SDValue GenWidenVectorExtLoads(SmallVectorImpl<SDValue> &LdChain,
	LoadSDNode *LD, ISD::LoadExtType ExtType);

	/// Helper function to generate a set of stores to store a widen vector into
	/// non-widen memory.
	/// StChain: list of chains for the stores we have generated
	/// ST: store of a widen value
	void GenWidenVectorStores(SmallVectorImpl<SDValue> &StChain, StoreSDNode *ST);

	/// Helper function to generate a set of stores to store a truncate widen
	/// vector into non-widen memory.
	/// StChain: list of chains for the stores we have generated
	/// ST: store of a widen value
	void GenWidenVectorTruncStores(SmallVectorImpl<SDValue> &StChain,
	StoreSDNode *ST);

	/// Modifies a vector input (widen or narrows) to a vector of NVT. The
	/// input vector must have the same element type as NVT.
	/// When FillWithZeroes is "on" the vector will be widened with zeroes.
	/// By default, the vector will be widened with undefined values.
	SDValue ModifyToType(SDValue InOp, EVT NVT, bool FillWithZeroes = false);

	/// Return a mask of vector type MaskVT to replace InMask. Also adjust
	/// MaskVT to ToMaskVT if needed with vector extension or truncation.
	SDValue convertMask(SDValue InMask, EVT MaskVT, EVT ToMaskVT);

	/// Get the target mask VT, and widen if needed.
	EVT getSETCCWidenedResultTy(SDValue SetCC);

	//===--------------------------------------------------------------------===//
	// Generic Splitting: LegalizeTypesGeneric.cpp
	//===--------------------------------------------------------------------===//

	// Legalization methods which only use that the illegal type is split into two
	// not necessarily identical types. As such they can be used for splitting
	// vectors and expanding integers and floats.

	void GetSplitOp(SDValue Op, SDValue &Lo, SDValue &Hi) {
	if (Op.getValueType().isVector())
	GetSplitVector(Op, Lo, Hi);
	else if (Op.getValueType().isInteger())
	GetExpandedInteger(Op, Lo, Hi);
	else
	GetExpandedFloat(Op, Lo, Hi);
	}

	/// Use ISD::EXTRACT_ELEMENT nodes to extract the low and high parts of the
	/// given value.
	void GetPairElements(SDValue Pair, SDValue &Lo, SDValue &Hi);

	// Generic Result Splitting.
	void SplitRes_MERGE_VALUES(SDNode *N, unsigned ResNo,
	SDValue &Lo, SDValue &Hi);
	void SplitRes_SELECT (SDNode *N, SDValue &Lo, SDValue &Hi);
	void SplitRes_SELECT_CC (SDNode *N, SDValue &Lo, SDValue &Hi);
	void SplitRes_UNDEF (SDNode *N, SDValue &Lo, SDValue &Hi);

	//===--------------------------------------------------------------------===//
	// Generic Expansion: LegalizeTypesGeneric.cpp
	//===--------------------------------------------------------------------===//

	// Legalization methods which only use that the illegal type is split into two
	// identical types of half the size, and that the Lo/Hi part is stored first
	// in memory on little/big-endian machines, followed by the Hi/Lo part. As
	// such they can be used for expanding integers and floats.

	void GetExpandedOp(SDValue Op, SDValue &Lo, SDValue &Hi) {
	if (Op.getValueType().isInteger())
	GetExpandedInteger(Op, Lo, Hi);
	else
	GetExpandedFloat(Op, Lo, Hi);
	}


	/// This function will split the integer \p Op into \p NumElements
	/// operations of type \p EltVT and store them in \p Ops.
	void IntegerToVector(SDValue Op, unsigned NumElements,
	SmallVectorImpl<SDValue> &Ops, EVT EltVT);

	// Generic Result Expansion.
	void ExpandRes_MERGE_VALUES (SDNode *N, unsigned ResNo,
	SDValue &Lo, SDValue &Hi);
	void ExpandRes_BITCAST (SDNode *N, SDValue &Lo, SDValue &Hi);
	void ExpandRes_BUILD_PAIR (SDNode *N, SDValue &Lo, SDValue &Hi);
	void ExpandRes_EXTRACT_ELEMENT (SDNode *N, SDValue &Lo, SDValue &Hi);
	void ExpandRes_EXTRACT_VECTOR_ELT(SDNode *N, SDValue &Lo, SDValue &Hi);
	void ExpandRes_NormalLoad (SDNode *N, SDValue &Lo, SDValue &Hi);
	void ExpandRes_VAARG (SDNode *N, SDValue &Lo, SDValue &Hi);

	// Generic Operand Expansion.
	SDValue ExpandOp_BITCAST (SDNode *N);
	SDValue ExpandOp_BUILD_VECTOR (SDNode *N);
	SDValue ExpandOp_EXTRACT_ELEMENT (SDNode *N);
	SDValue ExpandOp_INSERT_VECTOR_ELT(SDNode *N);
	SDValue ExpandOp_SCALAR_TO_VECTOR (SDNode *N);
	SDValue ExpandOp_NormalStore (SDNode *N, unsigned OpNo);
	};

	} // end namespace llvm.

	#endif
	diff --git a/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp b/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
	index ecb54e1e4b41..6aa3270883f0 100644
	--- a/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
	+++ b/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
	@@ -1,4086 +1,4119 @@
	//===------- LegalizeVectorTypes.cpp - Legalization of vector types -------===//
	//
	// The LLVM Compiler Infrastructure
	//
	// This file is distributed under the University of Illinois Open Source
	// License. See LICENSE.TXT for details.
	//
	//===----------------------------------------------------------------------===//
	//
	// This file performs vector type splitting and scalarization for LegalizeTypes.
	// Scalarization is the act of changing a computation in an illegal one-element
	// vector type to be a computation in its scalar element type. For example,
	// implementing <1 x f32> arithmetic in a scalar f32 register. This is needed
	// as a base case when scalarizing vector arithmetic like <4 x f32>, which
	// eventually decomposes to scalars if the target doesn't support v4f32 or v2f32
	// types.
	// Splitting is the act of changing a computation in an invalid vector type to
	// be a computation in two vectors of half the size. For example, implementing
	// <128 x f32> operations in terms of two <64 x f32> operations.
	//
	//===----------------------------------------------------------------------===//

	#include "LegalizeTypes.h"
	#include "llvm/IR/DataLayout.h"
	#include "llvm/Support/ErrorHandling.h"
	#include "llvm/Support/raw_ostream.h"
	using namespace llvm;

	#define DEBUG_TYPE "legalize-types"

	//===----------------------------------------------------------------------===//
	// Result Vector Scalarization: <1 x ty> -> ty.
	//===----------------------------------------------------------------------===//

	void DAGTypeLegalizer::ScalarizeVectorResult(SDNode *N, unsigned ResNo) {
	DEBUG(dbgs() << "Scalarize node result " << ResNo << ": ";
	N->dump(&DAG);
	dbgs() << "\n");
	SDValue R = SDValue();

	switch (N->getOpcode()) {
	default:
	#ifndef NDEBUG
	dbgs() << "ScalarizeVectorResult #" << ResNo << ": ";
	N->dump(&DAG);
	dbgs() << "\n";
	#endif
	report_fatal_error("Do not know how to scalarize the result of this "
	"operator!\n");

	case ISD::MERGE_VALUES: R = ScalarizeVecRes_MERGE_VALUES(N, ResNo);break;
	case ISD::BITCAST: R = ScalarizeVecRes_BITCAST(N); break;
	case ISD::BUILD_VECTOR: R = ScalarizeVecRes_BUILD_VECTOR(N); break;
	case ISD::EXTRACT_SUBVECTOR: R = ScalarizeVecRes_EXTRACT_SUBVECTOR(N); break;
	case ISD::FP_ROUND: R = ScalarizeVecRes_FP_ROUND(N); break;
	case ISD::FP_ROUND_INREG: R = ScalarizeVecRes_InregOp(N); break;
	case ISD::FPOWI: R = ScalarizeVecRes_FPOWI(N); break;
	case ISD::INSERT_VECTOR_ELT: R = ScalarizeVecRes_INSERT_VECTOR_ELT(N); break;
	case ISD::LOAD: R = ScalarizeVecRes_LOAD(cast<LoadSDNode>(N));break;
	case ISD::SCALAR_TO_VECTOR: R = ScalarizeVecRes_SCALAR_TO_VECTOR(N); break;
	case ISD::SIGN_EXTEND_INREG: R = ScalarizeVecRes_InregOp(N); break;
	case ISD::VSELECT: R = ScalarizeVecRes_VSELECT(N); break;
	case ISD::SELECT: R = ScalarizeVecRes_SELECT(N); break;
	case ISD::SELECT_CC: R = ScalarizeVecRes_SELECT_CC(N); break;
	case ISD::SETCC: R = ScalarizeVecRes_SETCC(N); break;
	case ISD::UNDEF: R = ScalarizeVecRes_UNDEF(N); break;
	case ISD::VECTOR_SHUFFLE: R = ScalarizeVecRes_VECTOR_SHUFFLE(N); break;
	case ISD::ANY_EXTEND_VECTOR_INREG:
	case ISD::SIGN_EXTEND_VECTOR_INREG:
	case ISD::ZERO_EXTEND_VECTOR_INREG:
	R = ScalarizeVecRes_VecInregOp(N);
	break;
	case ISD::ANY_EXTEND:
	case ISD::BITREVERSE:
	case ISD::BSWAP:
	case ISD::CTLZ:
	case ISD::CTLZ_ZERO_UNDEF:
	case ISD::CTPOP:
	case ISD::CTTZ:
	case ISD::CTTZ_ZERO_UNDEF:
	case ISD::FABS:
	case ISD::FCEIL:
	case ISD::FCOS:
	case ISD::FEXP:
	case ISD::FEXP2:
	case ISD::FFLOOR:
	case ISD::FLOG:
	case ISD::FLOG10:
	case ISD::FLOG2:
	case ISD::FNEARBYINT:
	case ISD::FNEG:
	case ISD::FP_EXTEND:
	case ISD::FP_TO_SINT:
	case ISD::FP_TO_UINT:
	case ISD::FRINT:
	case ISD::FROUND:
	case ISD::FSIN:
	case ISD::FSQRT:
	case ISD::FTRUNC:
	case ISD::SIGN_EXTEND:
	case ISD::SINT_TO_FP:
	case ISD::TRUNCATE:
	case ISD::UINT_TO_FP:
	case ISD::ZERO_EXTEND:
	case ISD::FCANONICALIZE:
	R = ScalarizeVecRes_UnaryOp(N);
	break;

	case ISD::ADD:
	case ISD::AND:
	case ISD::FADD:
	case ISD::FCOPYSIGN:
	case ISD::FDIV:
	case ISD::FMUL:
	case ISD::FMINNUM:
	case ISD::FMAXNUM:
	case ISD::FMINNAN:
	case ISD::FMAXNAN:
	case ISD::SMIN:
	case ISD::SMAX:
	case ISD::UMIN:
	case ISD::UMAX:

	case ISD::FPOW:
	case ISD::FREM:
	case ISD::FSUB:
	case ISD::MUL:
	case ISD::OR:
	case ISD::SDIV:
	case ISD::SREM:
	case ISD::SUB:
	case ISD::UDIV:
	case ISD::UREM:
	case ISD::XOR:
	case ISD::SHL:
	case ISD::SRA:
	case ISD::SRL:
	R = ScalarizeVecRes_BinOp(N);
	break;
	case ISD::FMA:
	R = ScalarizeVecRes_TernaryOp(N);
	break;
	}

	// If R is null, the sub-method took care of registering the result.
	if (R.getNode())
	SetScalarizedVector(SDValue(N, ResNo), R);
	}

	SDValue DAGTypeLegalizer::ScalarizeVecRes_BinOp(SDNode *N) {
	SDValue LHS = GetScalarizedVector(N->getOperand(0));
	SDValue RHS = GetScalarizedVector(N->getOperand(1));
	return DAG.getNode(N->getOpcode(), SDLoc(N),
	LHS.getValueType(), LHS, RHS, N->getFlags());
	}

	SDValue DAGTypeLegalizer::ScalarizeVecRes_TernaryOp(SDNode *N) {
	SDValue Op0 = GetScalarizedVector(N->getOperand(0));
	SDValue Op1 = GetScalarizedVector(N->getOperand(1));
	SDValue Op2 = GetScalarizedVector(N->getOperand(2));
	return DAG.getNode(N->getOpcode(), SDLoc(N),
	Op0.getValueType(), Op0, Op1, Op2);
	}

	SDValue DAGTypeLegalizer::ScalarizeVecRes_MERGE_VALUES(SDNode *N,
	unsigned ResNo) {
	SDValue Op = DisintegrateMERGE_VALUES(N, ResNo);
	return GetScalarizedVector(Op);
	}

	SDValue DAGTypeLegalizer::ScalarizeVecRes_BITCAST(SDNode *N) {
	EVT NewVT = N->getValueType(0).getVectorElementType();
	return DAG.getNode(ISD::BITCAST, SDLoc(N),
	NewVT, N->getOperand(0));
	}

	SDValue DAGTypeLegalizer::ScalarizeVecRes_BUILD_VECTOR(SDNode *N) {
	EVT EltVT = N->getValueType(0).getVectorElementType();
	SDValue InOp = N->getOperand(0);
	// The BUILD_VECTOR operands may be of wider element types and
	// we may need to truncate them back to the requested return type.
	if (EltVT.isInteger())
	return DAG.getNode(ISD::TRUNCATE, SDLoc(N), EltVT, InOp);
	return InOp;
	}

	SDValue DAGTypeLegalizer::ScalarizeVecRes_EXTRACT_SUBVECTOR(SDNode *N) {
	return DAG.getNode(ISD::EXTRACT_VECTOR_ELT, SDLoc(N),
	N->getValueType(0).getVectorElementType(),
	N->getOperand(0), N->getOperand(1));
	}

	SDValue DAGTypeLegalizer::ScalarizeVecRes_FP_ROUND(SDNode *N) {
	EVT NewVT = N->getValueType(0).getVectorElementType();
	SDValue Op = GetScalarizedVector(N->getOperand(0));
	return DAG.getNode(ISD::FP_ROUND, SDLoc(N),
	NewVT, Op, N->getOperand(1));
	}

	SDValue DAGTypeLegalizer::ScalarizeVecRes_FPOWI(SDNode *N) {
	SDValue Op = GetScalarizedVector(N->getOperand(0));
	return DAG.getNode(ISD::FPOWI, SDLoc(N),
	Op.getValueType(), Op, N->getOperand(1));
	}

	SDValue DAGTypeLegalizer::ScalarizeVecRes_INSERT_VECTOR_ELT(SDNode *N) {
	// The value to insert may have a wider type than the vector element type,
	// so be sure to truncate it to the element type if necessary.
	SDValue Op = N->getOperand(1);
	EVT EltVT = N->getValueType(0).getVectorElementType();
	if (Op.getValueType() != EltVT)
	// FIXME: Can this happen for floating point types?
	Op = DAG.getNode(ISD::TRUNCATE, SDLoc(N), EltVT, Op);
	return Op;
	}

	SDValue DAGTypeLegalizer::ScalarizeVecRes_LOAD(LoadSDNode *N) {
	assert(N->isUnindexed() && "Indexed vector load?");

	SDValue Result = DAG.getLoad(
	ISD::UNINDEXED, N->getExtensionType(),
	N->getValueType(0).getVectorElementType(), SDLoc(N), N->getChain(),
	N->getBasePtr(), DAG.getUNDEF(N->getBasePtr().getValueType()),
	N->getPointerInfo(), N->getMemoryVT().getVectorElementType(),
	N->getOriginalAlignment(), N->getMemOperand()->getFlags(),
	N->getAAInfo());

	// Legalize the chain result - switch anything that used the old chain to
	// use the new one.
	ReplaceValueWith(SDValue(N, 1), Result.getValue(1));
	return Result;
	}

	SDValue DAGTypeLegalizer::ScalarizeVecRes_UnaryOp(SDNode *N) {
	// Get the dest type - it doesn't always match the input type, e.g. int_to_fp.
	EVT DestVT = N->getValueType(0).getVectorElementType();
	SDValue Op = N->getOperand(0);
	EVT OpVT = Op.getValueType();
	SDLoc DL(N);
	// The result needs scalarizing, but it's not a given that the source does.
	// This is a workaround for targets where it's impossible to scalarize the
	// result of a conversion, because the source type is legal.
	// For instance, this happens on AArch64: v1i1 is illegal but v1i{8,16,32}
	// are widened to v8i8, v4i16, and v2i32, which is legal, because v1i64 is
	// legal and was not scalarized.
	// See the similar logic in ScalarizeVecRes_VSETCC
	if (getTypeAction(OpVT) == TargetLowering::TypeScalarizeVector) {
	Op = GetScalarizedVector(Op);
	} else {
	EVT VT = OpVT.getVectorElementType();
	Op = DAG.getNode(
	ISD::EXTRACT_VECTOR_ELT, DL, VT, Op,
	DAG.getConstant(0, DL, TLI.getVectorIdxTy(DAG.getDataLayout())));
	}
	return DAG.getNode(N->getOpcode(), SDLoc(N), DestVT, Op);
	}

	SDValue DAGTypeLegalizer::ScalarizeVecRes_InregOp(SDNode *N) {
	EVT EltVT = N->getValueType(0).getVectorElementType();
	EVT ExtVT = cast<VTSDNode>(N->getOperand(1))->getVT().getVectorElementType();
	SDValue LHS = GetScalarizedVector(N->getOperand(0));
	return DAG.getNode(N->getOpcode(), SDLoc(N), EltVT,
	LHS, DAG.getValueType(ExtVT));
	}

	SDValue DAGTypeLegalizer::ScalarizeVecRes_VecInregOp(SDNode *N) {
	SDLoc DL(N);
	SDValue Op = N->getOperand(0);

	EVT OpVT = Op.getValueType();
	EVT OpEltVT = OpVT.getVectorElementType();
	EVT EltVT = N->getValueType(0).getVectorElementType();

	if (getTypeAction(OpVT) == TargetLowering::TypeScalarizeVector) {
	Op = GetScalarizedVector(Op);
	} else {
	Op = DAG.getNode(
	ISD::EXTRACT_VECTOR_ELT, DL, OpEltVT, Op,
	DAG.getConstant(0, DL, TLI.getVectorIdxTy(DAG.getDataLayout())));
	}

	switch (N->getOpcode()) {
	case ISD::ANY_EXTEND_VECTOR_INREG:
	return DAG.getNode(ISD::ANY_EXTEND, DL, EltVT, Op);
	case ISD::SIGN_EXTEND_VECTOR_INREG:
	return DAG.getNode(ISD::SIGN_EXTEND, DL, EltVT, Op);
	case ISD::ZERO_EXTEND_VECTOR_INREG:
	return DAG.getNode(ISD::ZERO_EXTEND, DL, EltVT, Op);
	}

	llvm_unreachable("Illegal extend_vector_inreg opcode");
	}

	SDValue DAGTypeLegalizer::ScalarizeVecRes_SCALAR_TO_VECTOR(SDNode *N) {
	// If the operand is wider than the vector element type then it is implicitly
	// truncated. Make that explicit here.
	EVT EltVT = N->getValueType(0).getVectorElementType();
	SDValue InOp = N->getOperand(0);
	if (InOp.getValueType() != EltVT)
	return DAG.getNode(ISD::TRUNCATE, SDLoc(N), EltVT, InOp);
	return InOp;
	}

	SDValue DAGTypeLegalizer::ScalarizeVecRes_VSELECT(SDNode *N) {
	SDValue Cond = N->getOperand(0);
	EVT OpVT = Cond.getValueType();
	SDLoc DL(N);
	// The vselect result and true/value operands needs scalarizing, but it's
	// not a given that the Cond does. For instance, in AVX512 v1i1 is legal.
	// See the similar logic in ScalarizeVecRes_VSETCC
	if (getTypeAction(OpVT) == TargetLowering::TypeScalarizeVector) {
	Cond = GetScalarizedVector(Cond);
	} else {
	EVT VT = OpVT.getVectorElementType();
	Cond = DAG.getNode(
	ISD::EXTRACT_VECTOR_ELT, DL, VT, Cond,
	DAG.getConstant(0, DL, TLI.getVectorIdxTy(DAG.getDataLayout())));
	}

	SDValue LHS = GetScalarizedVector(N->getOperand(1));
	TargetLowering::BooleanContent ScalarBool =
	TLI.getBooleanContents(false, false);
	TargetLowering::BooleanContent VecBool = TLI.getBooleanContents(true, false);

	// If integer and float booleans have different contents then we can't
	// reliably optimize in all cases. There is a full explanation for this in
	// DAGCombiner::visitSELECT() where the same issue affects folding
	// (select C, 0, 1) to (xor C, 1).
	if (TLI.getBooleanContents(false, false) !=
	TLI.getBooleanContents(false, true)) {
	// At least try the common case where the boolean is generated by a
	// comparison.
	if (Cond->getOpcode() == ISD::SETCC) {
	EVT OpVT = Cond->getOperand(0)->getValueType(0);
	ScalarBool = TLI.getBooleanContents(OpVT.getScalarType());
	VecBool = TLI.getBooleanContents(OpVT);
	} else
	ScalarBool = TargetLowering::UndefinedBooleanContent;
	}

	if (ScalarBool != VecBool) {
	EVT CondVT = Cond.getValueType();
	switch (ScalarBool) {
	case TargetLowering::UndefinedBooleanContent:
	break;
	case TargetLowering::ZeroOrOneBooleanContent:
	assert(VecBool == TargetLowering::UndefinedBooleanContent \|\|
	VecBool == TargetLowering::ZeroOrNegativeOneBooleanContent);
	// Vector read from all ones, scalar expects a single 1 so mask.
	Cond = DAG.getNode(ISD::AND, SDLoc(N), CondVT,
	Cond, DAG.getConstant(1, SDLoc(N), CondVT));
	break;
	case TargetLowering::ZeroOrNegativeOneBooleanContent:
	assert(VecBool == TargetLowering::UndefinedBooleanContent \|\|
	VecBool == TargetLowering::ZeroOrOneBooleanContent);
	// Vector reads from a one, scalar from all ones so sign extend.
	Cond = DAG.getNode(ISD::SIGN_EXTEND_INREG, SDLoc(N), CondVT,
	Cond, DAG.getValueType(MVT::i1));
	break;
	}
	}

	return DAG.getSelect(SDLoc(N),
	LHS.getValueType(), Cond, LHS,
	GetScalarizedVector(N->getOperand(2)));
	}

	SDValue DAGTypeLegalizer::ScalarizeVecRes_SELECT(SDNode *N) {
	SDValue LHS = GetScalarizedVector(N->getOperand(1));
	return DAG.getSelect(SDLoc(N),
	LHS.getValueType(), N->getOperand(0), LHS,
	GetScalarizedVector(N->getOperand(2)));
	}

	SDValue DAGTypeLegalizer::ScalarizeVecRes_SELECT_CC(SDNode *N) {
	SDValue LHS = GetScalarizedVector(N->getOperand(2));
	return DAG.getNode(ISD::SELECT_CC, SDLoc(N), LHS.getValueType(),
	N->getOperand(0), N->getOperand(1),
	LHS, GetScalarizedVector(N->getOperand(3)),
	N->getOperand(4));
	}

	SDValue DAGTypeLegalizer::ScalarizeVecRes_SETCC(SDNode *N) {
	assert(N->getValueType(0).isVector() ==
	N->getOperand(0).getValueType().isVector() &&
	"Scalar/Vector type mismatch");

	if (N->getValueType(0).isVector()) return ScalarizeVecRes_VSETCC(N);

	SDValue LHS = GetScalarizedVector(N->getOperand(0));
	SDValue RHS = GetScalarizedVector(N->getOperand(1));
	SDLoc DL(N);

	// Turn it into a scalar SETCC.
	return DAG.getNode(ISD::SETCC, DL, MVT::i1, LHS, RHS, N->getOperand(2));
	}

	SDValue DAGTypeLegalizer::ScalarizeVecRes_UNDEF(SDNode *N) {
	return DAG.getUNDEF(N->getValueType(0).getVectorElementType());
	}

	SDValue DAGTypeLegalizer::ScalarizeVecRes_VECTOR_SHUFFLE(SDNode *N) {
	// Figure out if the scalar is the LHS or RHS and return it.
	SDValue Arg = N->getOperand(2).getOperand(0);
	if (Arg.isUndef())
	return DAG.getUNDEF(N->getValueType(0).getVectorElementType());
	unsigned Op = !cast<ConstantSDNode>(Arg)->isNullValue();
	return GetScalarizedVector(N->getOperand(Op));
	}

	SDValue DAGTypeLegalizer::ScalarizeVecRes_VSETCC(SDNode *N) {
	assert(N->getValueType(0).isVector() &&
	N->getOperand(0).getValueType().isVector() &&
	"Operand types must be vectors");
	SDValue LHS = N->getOperand(0);
	SDValue RHS = N->getOperand(1);
	EVT OpVT = LHS.getValueType();
	EVT NVT = N->getValueType(0).getVectorElementType();
	SDLoc DL(N);

	// The result needs scalarizing, but it's not a given that the source does.
	if (getTypeAction(OpVT) == TargetLowering::TypeScalarizeVector) {
	LHS = GetScalarizedVector(LHS);
	RHS = GetScalarizedVector(RHS);
	} else {
	EVT VT = OpVT.getVectorElementType();
	LHS = DAG.getNode(
	ISD::EXTRACT_VECTOR_ELT, DL, VT, LHS,
	DAG.getConstant(0, DL, TLI.getVectorIdxTy(DAG.getDataLayout())));
	RHS = DAG.getNode(
	ISD::EXTRACT_VECTOR_ELT, DL, VT, RHS,
	DAG.getConstant(0, DL, TLI.getVectorIdxTy(DAG.getDataLayout())));
	}

	// Turn it into a scalar SETCC.
	SDValue Res = DAG.getNode(ISD::SETCC, DL, MVT::i1, LHS, RHS,
	N->getOperand(2));
	// Vectors may have a different boolean contents to scalars. Promote the
	// value appropriately.
	ISD::NodeType ExtendCode =
	TargetLowering::getExtendForContent(TLI.getBooleanContents(OpVT));
	return DAG.getNode(ExtendCode, DL, NVT, Res);
	}


	//===----------------------------------------------------------------------===//
	// Operand Vector Scalarization <1 x ty> -> ty.
	//===----------------------------------------------------------------------===//

	bool DAGTypeLegalizer::ScalarizeVectorOperand(SDNode *N, unsigned OpNo) {
	DEBUG(dbgs() << "Scalarize node operand " << OpNo << ": ";
	N->dump(&DAG);
	dbgs() << "\n");
	SDValue Res = SDValue();

	if (!Res.getNode()) {
	switch (N->getOpcode()) {
	default:
	#ifndef NDEBUG
	dbgs() << "ScalarizeVectorOperand Op #" << OpNo << ": ";
	N->dump(&DAG);
	dbgs() << "\n";
	#endif
	llvm_unreachable("Do not know how to scalarize this operator's operand!");
	case ISD::BITCAST:
	Res = ScalarizeVecOp_BITCAST(N);
	break;
	case ISD::ANY_EXTEND:
	case ISD::ZERO_EXTEND:
	case ISD::SIGN_EXTEND:
	case ISD::TRUNCATE:
	case ISD::FP_TO_SINT:
	case ISD::FP_TO_UINT:
	case ISD::SINT_TO_FP:
	case ISD::UINT_TO_FP:
	Res = ScalarizeVecOp_UnaryOp(N);
	break;
	case ISD::CONCAT_VECTORS:
	Res = ScalarizeVecOp_CONCAT_VECTORS(N);
	break;
	case ISD::EXTRACT_VECTOR_ELT:
	Res = ScalarizeVecOp_EXTRACT_VECTOR_ELT(N);
	break;
	case ISD::VSELECT:
	Res = ScalarizeVecOp_VSELECT(N);
	break;
	+ case ISD::SETCC:
	+ Res = ScalarizeVecOp_VSETCC(N);
	+ break;
	case ISD::STORE:
	Res = ScalarizeVecOp_STORE(cast<StoreSDNode>(N), OpNo);
	break;
	case ISD::FP_ROUND:
	Res = ScalarizeVecOp_FP_ROUND(N, OpNo);
	break;
	}
	}

	// If the result is null, the sub-method took care of registering results etc.
	if (!Res.getNode()) return false;

	// If the result is N, the sub-method updated N in place. Tell the legalizer
	// core about this.
	if (Res.getNode() == N)
	return true;

	assert(Res.getValueType() == N->getValueType(0) && N->getNumValues() == 1 &&
	"Invalid operand expansion");

	ReplaceValueWith(SDValue(N, 0), Res);
	return false;
	}

	/// If the value to convert is a vector that needs to be scalarized, it must be
	/// <1 x ty>. Convert the element instead.
	SDValue DAGTypeLegalizer::ScalarizeVecOp_BITCAST(SDNode *N) {
	SDValue Elt = GetScalarizedVector(N->getOperand(0));
	return DAG.getNode(ISD::BITCAST, SDLoc(N),
	N->getValueType(0), Elt);
	}

	/// If the input is a vector that needs to be scalarized, it must be <1 x ty>.
	/// Do the operation on the element instead.
	SDValue DAGTypeLegalizer::ScalarizeVecOp_UnaryOp(SDNode *N) {
	assert(N->getValueType(0).getVectorNumElements() == 1 &&
	"Unexpected vector type!");
	SDValue Elt = GetScalarizedVector(N->getOperand(0));
	SDValue Op = DAG.getNode(N->getOpcode(), SDLoc(N),
	N->getValueType(0).getScalarType(), Elt);
	// Revectorize the result so the types line up with what the uses of this
	// expression expect.
	return DAG.getBuildVector(N->getValueType(0), SDLoc(N), Op);
	}

	/// The vectors to concatenate have length one - use a BUILD_VECTOR instead.
	SDValue DAGTypeLegalizer::ScalarizeVecOp_CONCAT_VECTORS(SDNode *N) {
	SmallVector<SDValue, 8> Ops(N->getNumOperands());
	for (unsigned i = 0, e = N->getNumOperands(); i < e; ++i)
	Ops[i] = GetScalarizedVector(N->getOperand(i));
	return DAG.getBuildVector(N->getValueType(0), SDLoc(N), Ops);
	}

	/// If the input is a vector that needs to be scalarized, it must be <1 x ty>,
	/// so just return the element, ignoring the index.
	SDValue DAGTypeLegalizer::ScalarizeVecOp_EXTRACT_VECTOR_ELT(SDNode *N) {
	EVT VT = N->getValueType(0);
	SDValue Res = GetScalarizedVector(N->getOperand(0));
	if (Res.getValueType() != VT)
	Res = VT.isFloatingPoint()
	? DAG.getNode(ISD::FP_EXTEND, SDLoc(N), VT, Res)
	: DAG.getNode(ISD::ANY_EXTEND, SDLoc(N), VT, Res);
	return Res;
	}

	/// If the input condition is a vector that needs to be scalarized, it must be
	/// <1 x i1>, so just convert to a normal ISD::SELECT
	/// (still with vector output type since that was acceptable if we got here).
	SDValue DAGTypeLegalizer::ScalarizeVecOp_VSELECT(SDNode *N) {
	SDValue ScalarCond = GetScalarizedVector(N->getOperand(0));
	EVT VT = N->getValueType(0);

	return DAG.getNode(ISD::SELECT, SDLoc(N), VT, ScalarCond, N->getOperand(1),
	N->getOperand(2));
	}

	+/// If the operand is a vector that needs to be scalarized then the
	+/// result must be v1i1, so just convert to a scalar SETCC and wrap
	+/// with a scalar_to_vector since the res type is legal if we got here
	+SDValue DAGTypeLegalizer::ScalarizeVecOp_VSETCC(SDNode *N) {
	+ assert(N->getValueType(0).isVector() &&
	+ N->getOperand(0).getValueType().isVector() &&
	+ "Operand types must be vectors");
	+ assert(N->getValueType(0) == MVT::v1i1 && "Expected v1i1 type");
	+
	+ EVT VT = N->getValueType(0);
	+ SDValue LHS = GetScalarizedVector(N->getOperand(0));
	+ SDValue RHS = GetScalarizedVector(N->getOperand(1));
	+
	+ EVT OpVT = N->getOperand(0).getValueType();
	+ EVT NVT = VT.getVectorElementType();
	+ SDLoc DL(N);
	+ // Turn it into a scalar SETCC.
	+ SDValue Res = DAG.getNode(ISD::SETCC, DL, MVT::i1, LHS, RHS,
	+ N->getOperand(2));
	+
	+ // Vectors may have a different boolean contents to scalars. Promote the
	+ // value appropriately.
	+ ISD::NodeType ExtendCode =
	+ TargetLowering::getExtendForContent(TLI.getBooleanContents(OpVT));
	+
	+ Res = DAG.getNode(ExtendCode, DL, NVT, Res);
	+
	+ return DAG.getNode(ISD::SCALAR_TO_VECTOR, DL, VT, Res);
	+}
	+
	/// If the value to store is a vector that needs to be scalarized, it must be
	/// <1 x ty>. Just store the element.
	SDValue DAGTypeLegalizer::ScalarizeVecOp_STORE(StoreSDNode *N, unsigned OpNo){
	assert(N->isUnindexed() && "Indexed store of one-element vector?");
	assert(OpNo == 1 && "Do not know how to scalarize this operand!");
	SDLoc dl(N);

	if (N->isTruncatingStore())
	return DAG.getTruncStore(
	N->getChain(), dl, GetScalarizedVector(N->getOperand(1)),
	N->getBasePtr(), N->getPointerInfo(),
	N->getMemoryVT().getVectorElementType(), N->getAlignment(),
	N->getMemOperand()->getFlags(), N->getAAInfo());

	return DAG.getStore(N->getChain(), dl, GetScalarizedVector(N->getOperand(1)),
	N->getBasePtr(), N->getPointerInfo(),
	N->getOriginalAlignment(), N->getMemOperand()->getFlags(),
	N->getAAInfo());
	}

	/// If the value to round is a vector that needs to be scalarized, it must be
	/// <1 x ty>. Convert the element instead.
	SDValue DAGTypeLegalizer::ScalarizeVecOp_FP_ROUND(SDNode *N, unsigned OpNo) {
	SDValue Elt = GetScalarizedVector(N->getOperand(0));
	SDValue Res = DAG.getNode(ISD::FP_ROUND, SDLoc(N),
	N->getValueType(0).getVectorElementType(), Elt,
	N->getOperand(1));
	return DAG.getNode(ISD::SCALAR_TO_VECTOR, SDLoc(N), N->getValueType(0), Res);
	}

	//===----------------------------------------------------------------------===//
	// Result Vector Splitting
	//===----------------------------------------------------------------------===//

	/// This method is called when the specified result of the specified node is
	/// found to need vector splitting. At this point, the node may also have
	/// invalid operands or may have other results that need legalization, we just
	/// know that (at least) one result needs vector splitting.
	void DAGTypeLegalizer::SplitVectorResult(SDNode *N, unsigned ResNo) {
	DEBUG(dbgs() << "Split node result: ";
	N->dump(&DAG);
	dbgs() << "\n");
	SDValue Lo, Hi;

	// See if the target wants to custom expand this node.
	if (CustomLowerNode(N, N->getValueType(ResNo), true))
	return;

	switch (N->getOpcode()) {
	default:
	#ifndef NDEBUG
	dbgs() << "SplitVectorResult #" << ResNo << ": ";
	N->dump(&DAG);
	dbgs() << "\n";
	#endif
	report_fatal_error("Do not know how to split the result of this "
	"operator!\n");

	case ISD::MERGE_VALUES: SplitRes_MERGE_VALUES(N, ResNo, Lo, Hi); break;
	case ISD::VSELECT:
	case ISD::SELECT: SplitRes_SELECT(N, Lo, Hi); break;
	case ISD::SELECT_CC: SplitRes_SELECT_CC(N, Lo, Hi); break;
	case ISD::UNDEF: SplitRes_UNDEF(N, Lo, Hi); break;
	case ISD::BITCAST: SplitVecRes_BITCAST(N, Lo, Hi); break;
	case ISD::BUILD_VECTOR: SplitVecRes_BUILD_VECTOR(N, Lo, Hi); break;
	case ISD::CONCAT_VECTORS: SplitVecRes_CONCAT_VECTORS(N, Lo, Hi); break;
	case ISD::EXTRACT_SUBVECTOR: SplitVecRes_EXTRACT_SUBVECTOR(N, Lo, Hi); break;
	case ISD::INSERT_SUBVECTOR: SplitVecRes_INSERT_SUBVECTOR(N, Lo, Hi); break;
	case ISD::FP_ROUND_INREG: SplitVecRes_InregOp(N, Lo, Hi); break;
	case ISD::FPOWI: SplitVecRes_FPOWI(N, Lo, Hi); break;
	case ISD::FCOPYSIGN: SplitVecRes_FCOPYSIGN(N, Lo, Hi); break;
	case ISD::INSERT_VECTOR_ELT: SplitVecRes_INSERT_VECTOR_ELT(N, Lo, Hi); break;
	case ISD::SCALAR_TO_VECTOR: SplitVecRes_SCALAR_TO_VECTOR(N, Lo, Hi); break;
	case ISD::SIGN_EXTEND_INREG: SplitVecRes_InregOp(N, Lo, Hi); break;
	case ISD::LOAD:
	SplitVecRes_LOAD(cast<LoadSDNode>(N), Lo, Hi);
	break;
	case ISD::MLOAD:
	SplitVecRes_MLOAD(cast<MaskedLoadSDNode>(N), Lo, Hi);
	break;
	case ISD::MGATHER:
	SplitVecRes_MGATHER(cast<MaskedGatherSDNode>(N), Lo, Hi);
	break;
	case ISD::SETCC:
	SplitVecRes_SETCC(N, Lo, Hi);
	break;
	case ISD::VECTOR_SHUFFLE:
	SplitVecRes_VECTOR_SHUFFLE(cast<ShuffleVectorSDNode>(N), Lo, Hi);
	break;

	case ISD::ANY_EXTEND_VECTOR_INREG:
	case ISD::SIGN_EXTEND_VECTOR_INREG:
	case ISD::ZERO_EXTEND_VECTOR_INREG:
	SplitVecRes_ExtVecInRegOp(N, Lo, Hi);
	break;

	case ISD::BITREVERSE:
	case ISD::BSWAP:
	case ISD::CTLZ:
	case ISD::CTTZ:
	case ISD::CTLZ_ZERO_UNDEF:
	case ISD::CTTZ_ZERO_UNDEF:
	case ISD::CTPOP:
	case ISD::FABS:
	case ISD::FCEIL:
	case ISD::FCOS:
	case ISD::FEXP:
	case ISD::FEXP2:
	case ISD::FFLOOR:
	case ISD::FLOG:
	case ISD::FLOG10:
	case ISD::FLOG2:
	case ISD::FNEARBYINT:
	case ISD::FNEG:
	case ISD::FP_EXTEND:
	case ISD::FP_ROUND:
	case ISD::FP_TO_SINT:
	case ISD::FP_TO_UINT:
	case ISD::FRINT:
	case ISD::FROUND:
	case ISD::FSIN:
	case ISD::FSQRT:
	case ISD::FTRUNC:
	case ISD::SINT_TO_FP:
	case ISD::TRUNCATE:
	case ISD::UINT_TO_FP:
	case ISD::FCANONICALIZE:
	SplitVecRes_UnaryOp(N, Lo, Hi);
	break;

	case ISD::ANY_EXTEND:
	case ISD::SIGN_EXTEND:
	case ISD::ZERO_EXTEND:
	SplitVecRes_ExtendOp(N, Lo, Hi);
	break;

	case ISD::ADD:
	case ISD::SUB:
	case ISD::MUL:
	case ISD::MULHS:
	case ISD::MULHU:
	case ISD::FADD:
	case ISD::FSUB:
	case ISD::FMUL:
	case ISD::FMINNUM:
	case ISD::FMAXNUM:
	case ISD::FMINNAN:
	case ISD::FMAXNAN:
	case ISD::SDIV:
	case ISD::UDIV:
	case ISD::FDIV:
	case ISD::FPOW:
	case ISD::AND:
	case ISD::OR:
	case ISD::XOR:
	case ISD::SHL:
	case ISD::SRA:
	case ISD::SRL:
	case ISD::UREM:
	case ISD::SREM:
	case ISD::FREM:
	case ISD::SMIN:
	case ISD::SMAX:
	case ISD::UMIN:
	case ISD::UMAX:
	SplitVecRes_BinOp(N, Lo, Hi);
	break;
	case ISD::FMA:
	SplitVecRes_TernaryOp(N, Lo, Hi);
	break;
	}

	// If Lo/Hi is null, the sub-method took care of registering results etc.
	if (Lo.getNode())
	SetSplitVector(SDValue(N, ResNo), Lo, Hi);
	}

	void DAGTypeLegalizer::SplitVecRes_BinOp(SDNode *N, SDValue &Lo,
	SDValue &Hi) {
	SDValue LHSLo, LHSHi;
	GetSplitVector(N->getOperand(0), LHSLo, LHSHi);
	SDValue RHSLo, RHSHi;
	GetSplitVector(N->getOperand(1), RHSLo, RHSHi);
	SDLoc dl(N);

	const SDNodeFlags Flags = N->getFlags();
	unsigned Opcode = N->getOpcode();
	Lo = DAG.getNode(Opcode, dl, LHSLo.getValueType(), LHSLo, RHSLo, Flags);
	Hi = DAG.getNode(Opcode, dl, LHSHi.getValueType(), LHSHi, RHSHi, Flags);
	}

	void DAGTypeLegalizer::SplitVecRes_TernaryOp(SDNode *N, SDValue &Lo,
	SDValue &Hi) {
	SDValue Op0Lo, Op0Hi;
	GetSplitVector(N->getOperand(0), Op0Lo, Op0Hi);
	SDValue Op1Lo, Op1Hi;
	GetSplitVector(N->getOperand(1), Op1Lo, Op1Hi);
	SDValue Op2Lo, Op2Hi;
	GetSplitVector(N->getOperand(2), Op2Lo, Op2Hi);
	SDLoc dl(N);

	Lo = DAG.getNode(N->getOpcode(), dl, Op0Lo.getValueType(),
	Op0Lo, Op1Lo, Op2Lo);
	Hi = DAG.getNode(N->getOpcode(), dl, Op0Hi.getValueType(),
	Op0Hi, Op1Hi, Op2Hi);
	}

	void DAGTypeLegalizer::SplitVecRes_BITCAST(SDNode *N, SDValue &Lo,
	SDValue &Hi) {
	// We know the result is a vector. The input may be either a vector or a
	// scalar value.
	EVT LoVT, HiVT;
	std::tie(LoVT, HiVT) = DAG.GetSplitDestVTs(N->getValueType(0));
	SDLoc dl(N);

	SDValue InOp = N->getOperand(0);
	EVT InVT = InOp.getValueType();

	// Handle some special cases efficiently.
	switch (getTypeAction(InVT)) {
	case TargetLowering::TypeLegal:
	case TargetLowering::TypePromoteInteger:
	case TargetLowering::TypePromoteFloat:
	case TargetLowering::TypeSoftenFloat:
	case TargetLowering::TypeScalarizeVector:
	case TargetLowering::TypeWidenVector:
	break;
	case TargetLowering::TypeExpandInteger:
	case TargetLowering::TypeExpandFloat:
	// A scalar to vector conversion, where the scalar needs expansion.
	// If the vector is being split in two then we can just convert the
	// expanded pieces.
	if (LoVT == HiVT) {
	GetExpandedOp(InOp, Lo, Hi);
	if (DAG.getDataLayout().isBigEndian())
	std::swap(Lo, Hi);
	Lo = DAG.getNode(ISD::BITCAST, dl, LoVT, Lo);
	Hi = DAG.getNode(ISD::BITCAST, dl, HiVT, Hi);
	return;
	}
	break;
	case TargetLowering::TypeSplitVector:
	// If the input is a vector that needs to be split, convert each split
	// piece of the input now.
	GetSplitVector(InOp, Lo, Hi);
	Lo = DAG.getNode(ISD::BITCAST, dl, LoVT, Lo);
	Hi = DAG.getNode(ISD::BITCAST, dl, HiVT, Hi);
	return;
	}

	// In the general case, convert the input to an integer and split it by hand.
	EVT LoIntVT = EVT::getIntegerVT(*DAG.getContext(), LoVT.getSizeInBits());
	EVT HiIntVT = EVT::getIntegerVT(*DAG.getContext(), HiVT.getSizeInBits());
	if (DAG.getDataLayout().isBigEndian())
	std::swap(LoIntVT, HiIntVT);

	SplitInteger(BitConvertToInteger(InOp), LoIntVT, HiIntVT, Lo, Hi);

	if (DAG.getDataLayout().isBigEndian())
	std::swap(Lo, Hi);
	Lo = DAG.getNode(ISD::BITCAST, dl, LoVT, Lo);
	Hi = DAG.getNode(ISD::BITCAST, dl, HiVT, Hi);
	}

	void DAGTypeLegalizer::SplitVecRes_BUILD_VECTOR(SDNode *N, SDValue &Lo,
	SDValue &Hi) {
	EVT LoVT, HiVT;
	SDLoc dl(N);
	std::tie(LoVT, HiVT) = DAG.GetSplitDestVTs(N->getValueType(0));
	unsigned LoNumElts = LoVT.getVectorNumElements();
	SmallVector<SDValue, 8> LoOps(N->op_begin(), N->op_begin()+LoNumElts);
	Lo = DAG.getBuildVector(LoVT, dl, LoOps);

	SmallVector<SDValue, 8> HiOps(N->op_begin()+LoNumElts, N->op_end());
	Hi = DAG.getBuildVector(HiVT, dl, HiOps);
	}

	void DAGTypeLegalizer::SplitVecRes_CONCAT_VECTORS(SDNode *N, SDValue &Lo,
	SDValue &Hi) {
	assert(!(N->getNumOperands() & 1) && "Unsupported CONCAT_VECTORS");
	SDLoc dl(N);
	unsigned NumSubvectors = N->getNumOperands() / 2;
	if (NumSubvectors == 1) {
	Lo = N->getOperand(0);
	Hi = N->getOperand(1);
	return;
	}

	EVT LoVT, HiVT;
	std::tie(LoVT, HiVT) = DAG.GetSplitDestVTs(N->getValueType(0));

	SmallVector<SDValue, 8> LoOps(N->op_begin(), N->op_begin()+NumSubvectors);
	Lo = DAG.getNode(ISD::CONCAT_VECTORS, dl, LoVT, LoOps);

	SmallVector<SDValue, 8> HiOps(N->op_begin()+NumSubvectors, N->op_end());
	Hi = DAG.getNode(ISD::CONCAT_VECTORS, dl, HiVT, HiOps);
	}

	void DAGTypeLegalizer::SplitVecRes_EXTRACT_SUBVECTOR(SDNode *N, SDValue &Lo,
	SDValue &Hi) {
	SDValue Vec = N->getOperand(0);
	SDValue Idx = N->getOperand(1);
	SDLoc dl(N);

	EVT LoVT, HiVT;
	std::tie(LoVT, HiVT) = DAG.GetSplitDestVTs(N->getValueType(0));

	Lo = DAG.getNode(ISD::EXTRACT_SUBVECTOR, dl, LoVT, Vec, Idx);
	uint64_t IdxVal = cast<ConstantSDNode>(Idx)->getZExtValue();
	Hi = DAG.getNode(ISD::EXTRACT_SUBVECTOR, dl, HiVT, Vec,
	DAG.getConstant(IdxVal + LoVT.getVectorNumElements(), dl,
	TLI.getVectorIdxTy(DAG.getDataLayout())));
	}

	void DAGTypeLegalizer::SplitVecRes_INSERT_SUBVECTOR(SDNode *N, SDValue &Lo,
	SDValue &Hi) {
	SDValue Vec = N->getOperand(0);
	SDValue SubVec = N->getOperand(1);
	SDValue Idx = N->getOperand(2);
	SDLoc dl(N);
	GetSplitVector(Vec, Lo, Hi);

	EVT VecVT = Vec.getValueType();
	unsigned VecElems = VecVT.getVectorNumElements();
	unsigned SubElems = SubVec.getValueType().getVectorNumElements();

	// If we know the index is 0, and we know the subvector doesn't cross the
	// boundary between the halves, we can avoid spilling the vector, and insert
	// into the lower half of the split vector directly.
	// TODO: The IdxVal == 0 constraint is artificial, we could do this whenever
	// the index is constant and there is no boundary crossing. But those cases
	// don't seem to get hit in practice.
	if (ConstantSDNode *ConstIdx = dyn_cast<ConstantSDNode>(Idx)) {
	unsigned IdxVal = ConstIdx->getZExtValue();
	if ((IdxVal == 0) && (IdxVal + SubElems <= VecElems / 2)) {
	EVT LoVT, HiVT;
	std::tie(LoVT, HiVT) = DAG.GetSplitDestVTs(N->getValueType(0));
	Lo = DAG.getNode(ISD::INSERT_SUBVECTOR, dl, LoVT, Lo, SubVec, Idx);
	return;
	}
	}

	// Spill the vector to the stack.
	SDValue StackPtr = DAG.CreateStackTemporary(VecVT);
	SDValue Store =
	DAG.getStore(DAG.getEntryNode(), dl, Vec, StackPtr, MachinePointerInfo());

	// Store the new subvector into the specified index.
	SDValue SubVecPtr = TLI.getVectorElementPointer(DAG, StackPtr, VecVT, Idx);
	Type VecType = VecVT.getTypeForEVT(DAG.getContext());
	unsigned Alignment = DAG.getDataLayout().getPrefTypeAlignment(VecType);
	Store = DAG.getStore(Store, dl, SubVec, SubVecPtr, MachinePointerInfo());

	// Load the Lo part from the stack slot.
	Lo =
	DAG.getLoad(Lo.getValueType(), dl, Store, StackPtr, MachinePointerInfo());

	// Increment the pointer to the other part.
	unsigned IncrementSize = Lo.getValueSizeInBits() / 8;
	StackPtr =
	DAG.getNode(ISD::ADD, dl, StackPtr.getValueType(), StackPtr,
	DAG.getConstant(IncrementSize, dl, StackPtr.getValueType()));

	// Load the Hi part from the stack slot.
	Hi = DAG.getLoad(Hi.getValueType(), dl, Store, StackPtr, MachinePointerInfo(),
	MinAlign(Alignment, IncrementSize));
	}

	void DAGTypeLegalizer::SplitVecRes_FPOWI(SDNode *N, SDValue &Lo,
	SDValue &Hi) {
	SDLoc dl(N);
	GetSplitVector(N->getOperand(0), Lo, Hi);
	Lo = DAG.getNode(ISD::FPOWI, dl, Lo.getValueType(), Lo, N->getOperand(1));
	Hi = DAG.getNode(ISD::FPOWI, dl, Hi.getValueType(), Hi, N->getOperand(1));
	}

	void DAGTypeLegalizer::SplitVecRes_FCOPYSIGN(SDNode *N, SDValue &Lo,
	SDValue &Hi) {
	SDValue LHSLo, LHSHi;
	GetSplitVector(N->getOperand(0), LHSLo, LHSHi);
	SDLoc DL(N);

	SDValue RHSLo, RHSHi;
	SDValue RHS = N->getOperand(1);
	EVT RHSVT = RHS.getValueType();
	if (getTypeAction(RHSVT) == TargetLowering::TypeSplitVector)
	GetSplitVector(RHS, RHSLo, RHSHi);
	else
	std::tie(RHSLo, RHSHi) = DAG.SplitVector(RHS, SDLoc(RHS));


	Lo = DAG.getNode(ISD::FCOPYSIGN, DL, LHSLo.getValueType(), LHSLo, RHSLo);
	Hi = DAG.getNode(ISD::FCOPYSIGN, DL, LHSHi.getValueType(), LHSHi, RHSHi);
	}

	void DAGTypeLegalizer::SplitVecRes_InregOp(SDNode *N, SDValue &Lo,
	SDValue &Hi) {
	SDValue LHSLo, LHSHi;
	GetSplitVector(N->getOperand(0), LHSLo, LHSHi);
	SDLoc dl(N);

	EVT LoVT, HiVT;
	std::tie(LoVT, HiVT) =
	DAG.GetSplitDestVTs(cast<VTSDNode>(N->getOperand(1))->getVT());

	Lo = DAG.getNode(N->getOpcode(), dl, LHSLo.getValueType(), LHSLo,
	DAG.getValueType(LoVT));
	Hi = DAG.getNode(N->getOpcode(), dl, LHSHi.getValueType(), LHSHi,
	DAG.getValueType(HiVT));
	}

	void DAGTypeLegalizer::SplitVecRes_ExtVecInRegOp(SDNode *N, SDValue &Lo,
	SDValue &Hi) {
	unsigned Opcode = N->getOpcode();
	SDValue N0 = N->getOperand(0);

	SDLoc dl(N);
	SDValue InLo, InHi;

	if (getTypeAction(N0.getValueType()) == TargetLowering::TypeSplitVector)
	GetSplitVector(N0, InLo, InHi);
	else
	std::tie(InLo, InHi) = DAG.SplitVectorOperand(N, 0);

	EVT InLoVT = InLo.getValueType();
	unsigned InNumElements = InLoVT.getVectorNumElements();

	EVT OutLoVT, OutHiVT;
	std::tie(OutLoVT, OutHiVT) = DAG.GetSplitDestVTs(N->getValueType(0));
	unsigned OutNumElements = OutLoVT.getVectorNumElements();
	assert((2 * OutNumElements) <= InNumElements &&
	"Illegal extend vector in reg split");

	// *_EXTEND_VECTOR_INREG instructions extend the lowest elements of the
	// input vector (i.e. we only use InLo):
	// OutLo will extend the first OutNumElements from InLo.
	// OutHi will extend the next OutNumElements from InLo.

	// Shuffle the elements from InLo for OutHi into the bottom elements to
	// create a 'fake' InHi.
	SmallVector<int, 8> SplitHi(InNumElements, -1);
	for (unsigned i = 0; i != OutNumElements; ++i)
	SplitHi[i] = i + OutNumElements;
	InHi = DAG.getVectorShuffle(InLoVT, dl, InLo, DAG.getUNDEF(InLoVT), SplitHi);

	Lo = DAG.getNode(Opcode, dl, OutLoVT, InLo);
	Hi = DAG.getNode(Opcode, dl, OutHiVT, InHi);
	}

	void DAGTypeLegalizer::SplitVecRes_INSERT_VECTOR_ELT(SDNode *N, SDValue &Lo,
	SDValue &Hi) {
	SDValue Vec = N->getOperand(0);
	SDValue Elt = N->getOperand(1);
	SDValue Idx = N->getOperand(2);
	SDLoc dl(N);
	GetSplitVector(Vec, Lo, Hi);

	if (ConstantSDNode *CIdx = dyn_cast<ConstantSDNode>(Idx)) {
	unsigned IdxVal = CIdx->getZExtValue();
	unsigned LoNumElts = Lo.getValueType().getVectorNumElements();
	if (IdxVal < LoNumElts)
	Lo = DAG.getNode(ISD::INSERT_VECTOR_ELT, dl,
	Lo.getValueType(), Lo, Elt, Idx);
	else
	Hi =
	DAG.getNode(ISD::INSERT_VECTOR_ELT, dl, Hi.getValueType(), Hi, Elt,
	DAG.getConstant(IdxVal - LoNumElts, dl,
	TLI.getVectorIdxTy(DAG.getDataLayout())));
	return;
	}

	// See if the target wants to custom expand this node.
	if (CustomLowerNode(N, N->getValueType(0), true))
	return;

	// Spill the vector to the stack.
	EVT VecVT = Vec.getValueType();
	EVT EltVT = VecVT.getVectorElementType();
	SDValue StackPtr = DAG.CreateStackTemporary(VecVT);
	SDValue Store =
	DAG.getStore(DAG.getEntryNode(), dl, Vec, StackPtr, MachinePointerInfo());

	// Store the new element. This may be larger than the vector element type,
	// so use a truncating store.
	SDValue EltPtr = TLI.getVectorElementPointer(DAG, StackPtr, VecVT, Idx);
	Type VecType = VecVT.getTypeForEVT(DAG.getContext());
	unsigned Alignment = DAG.getDataLayout().getPrefTypeAlignment(VecType);
	Store =
	DAG.getTruncStore(Store, dl, Elt, EltPtr, MachinePointerInfo(), EltVT);

	// Load the Lo part from the stack slot.
	Lo =
	DAG.getLoad(Lo.getValueType(), dl, Store, StackPtr, MachinePointerInfo());

	// Increment the pointer to the other part.
	unsigned IncrementSize = Lo.getValueSizeInBits() / 8;
	StackPtr = DAG.getNode(ISD::ADD, dl, StackPtr.getValueType(), StackPtr,
	DAG.getConstant(IncrementSize, dl,
	StackPtr.getValueType()));

	// Load the Hi part from the stack slot.
	Hi = DAG.getLoad(Hi.getValueType(), dl, Store, StackPtr, MachinePointerInfo(),
	MinAlign(Alignment, IncrementSize));
	}

	void DAGTypeLegalizer::SplitVecRes_SCALAR_TO_VECTOR(SDNode *N, SDValue &Lo,
	SDValue &Hi) {
	EVT LoVT, HiVT;
	SDLoc dl(N);
	std::tie(LoVT, HiVT) = DAG.GetSplitDestVTs(N->getValueType(0));
	Lo = DAG.getNode(ISD::SCALAR_TO_VECTOR, dl, LoVT, N->getOperand(0));
	Hi = DAG.getUNDEF(HiVT);
	}

	void DAGTypeLegalizer::SplitVecRes_LOAD(LoadSDNode *LD, SDValue &Lo,
	SDValue &Hi) {
	assert(ISD::isUNINDEXEDLoad(LD) && "Indexed load during type legalization!");
	EVT LoVT, HiVT;
	SDLoc dl(LD);
	std::tie(LoVT, HiVT) = DAG.GetSplitDestVTs(LD->getValueType(0));

	ISD::LoadExtType ExtType = LD->getExtensionType();
	SDValue Ch = LD->getChain();
	SDValue Ptr = LD->getBasePtr();
	SDValue Offset = DAG.getUNDEF(Ptr.getValueType());
	EVT MemoryVT = LD->getMemoryVT();
	unsigned Alignment = LD->getOriginalAlignment();
	MachineMemOperand::Flags MMOFlags = LD->getMemOperand()->getFlags();
	AAMDNodes AAInfo = LD->getAAInfo();

	EVT LoMemVT, HiMemVT;
	std::tie(LoMemVT, HiMemVT) = DAG.GetSplitDestVTs(MemoryVT);

	Lo = DAG.getLoad(ISD::UNINDEXED, ExtType, LoVT, dl, Ch, Ptr, Offset,
	LD->getPointerInfo(), LoMemVT, Alignment, MMOFlags, AAInfo);

	unsigned IncrementSize = LoMemVT.getSizeInBits()/8;
	Ptr = DAG.getNode(ISD::ADD, dl, Ptr.getValueType(), Ptr,
	DAG.getConstant(IncrementSize, dl, Ptr.getValueType()));
	Hi = DAG.getLoad(ISD::UNINDEXED, ExtType, HiVT, dl, Ch, Ptr, Offset,
	LD->getPointerInfo().getWithOffset(IncrementSize), HiMemVT,
	Alignment, MMOFlags, AAInfo);

	// Build a factor node to remember that this load is independent of the
	// other one.
	Ch = DAG.getNode(ISD::TokenFactor, dl, MVT::Other, Lo.getValue(1),
	Hi.getValue(1));

	// Legalize the chain result - switch anything that used the old chain to
	// use the new one.
	ReplaceValueWith(SDValue(LD, 1), Ch);
	}

	void DAGTypeLegalizer::SplitVecRes_MLOAD(MaskedLoadSDNode *MLD,
	SDValue &Lo, SDValue &Hi) {
	EVT LoVT, HiVT;
	SDLoc dl(MLD);
	std::tie(LoVT, HiVT) = DAG.GetSplitDestVTs(MLD->getValueType(0));

	SDValue Ch = MLD->getChain();
	SDValue Ptr = MLD->getBasePtr();
	SDValue Mask = MLD->getMask();
	SDValue Src0 = MLD->getSrc0();
	unsigned Alignment = MLD->getOriginalAlignment();
	ISD::LoadExtType ExtType = MLD->getExtensionType();

	// if Alignment is equal to the vector size,
	// take the half of it for the second part
	unsigned SecondHalfAlignment =
	(Alignment == MLD->getValueType(0).getSizeInBits()/8) ?
	Alignment/2 : Alignment;

	// Split Mask operand
	SDValue MaskLo, MaskHi;
	if (getTypeAction(Mask.getValueType()) == TargetLowering::TypeSplitVector)
	GetSplitVector(Mask, MaskLo, MaskHi);
	else
	std::tie(MaskLo, MaskHi) = DAG.SplitVector(Mask, dl);

	EVT MemoryVT = MLD->getMemoryVT();
	EVT LoMemVT, HiMemVT;
	std::tie(LoMemVT, HiMemVT) = DAG.GetSplitDestVTs(MemoryVT);

	SDValue Src0Lo, Src0Hi;
	if (getTypeAction(Src0.getValueType()) == TargetLowering::TypeSplitVector)
	GetSplitVector(Src0, Src0Lo, Src0Hi);
	else
	std::tie(Src0Lo, Src0Hi) = DAG.SplitVector(Src0, dl);

	MachineMemOperand *MMO = DAG.getMachineFunction().
	getMachineMemOperand(MLD->getPointerInfo(),
	MachineMemOperand::MOLoad, LoMemVT.getStoreSize(),
	Alignment, MLD->getAAInfo(), MLD->getRanges());

	Lo = DAG.getMaskedLoad(LoVT, dl, Ch, Ptr, MaskLo, Src0Lo, LoMemVT, MMO,
	ExtType, MLD->isExpandingLoad());

	Ptr = TLI.IncrementMemoryAddress(Ptr, MaskLo, dl, LoMemVT, DAG,
	MLD->isExpandingLoad());

	MMO = DAG.getMachineFunction().
	getMachineMemOperand(MLD->getPointerInfo(),
	MachineMemOperand::MOLoad, HiMemVT.getStoreSize(),
	SecondHalfAlignment, MLD->getAAInfo(), MLD->getRanges());

	Hi = DAG.getMaskedLoad(HiVT, dl, Ch, Ptr, MaskHi, Src0Hi, HiMemVT, MMO,
	ExtType, MLD->isExpandingLoad());


	// Build a factor node to remember that this load is independent of the
	// other one.
	Ch = DAG.getNode(ISD::TokenFactor, dl, MVT::Other, Lo.getValue(1),
	Hi.getValue(1));

	// Legalize the chain result - switch anything that used the old chain to
	// use the new one.
	ReplaceValueWith(SDValue(MLD, 1), Ch);

	}

	void DAGTypeLegalizer::SplitVecRes_MGATHER(MaskedGatherSDNode *MGT,
	SDValue &Lo, SDValue &Hi) {
	EVT LoVT, HiVT;
	SDLoc dl(MGT);
	std::tie(LoVT, HiVT) = DAG.GetSplitDestVTs(MGT->getValueType(0));

	SDValue Ch = MGT->getChain();
	SDValue Ptr = MGT->getBasePtr();
	SDValue Mask = MGT->getMask();
	SDValue Src0 = MGT->getValue();
	SDValue Index = MGT->getIndex();
	unsigned Alignment = MGT->getOriginalAlignment();

	// Split Mask operand
	SDValue MaskLo, MaskHi;
	if (getTypeAction(Mask.getValueType()) == TargetLowering::TypeSplitVector)
	GetSplitVector(Mask, MaskLo, MaskHi);
	else
	std::tie(MaskLo, MaskHi) = DAG.SplitVector(Mask, dl);

	EVT MemoryVT = MGT->getMemoryVT();
	EVT LoMemVT, HiMemVT;
	// Split MemoryVT
	std::tie(LoMemVT, HiMemVT) = DAG.GetSplitDestVTs(MemoryVT);

	SDValue Src0Lo, Src0Hi;
	if (getTypeAction(Src0.getValueType()) == TargetLowering::TypeSplitVector)
	GetSplitVector(Src0, Src0Lo, Src0Hi);
	else
	std::tie(Src0Lo, Src0Hi) = DAG.SplitVector(Src0, dl);

	SDValue IndexHi, IndexLo;
	if (getTypeAction(Index.getValueType()) == TargetLowering::TypeSplitVector)
	GetSplitVector(Index, IndexLo, IndexHi);
	else
	std::tie(IndexLo, IndexHi) = DAG.SplitVector(Index, dl);

	MachineMemOperand *MMO = DAG.getMachineFunction().
	getMachineMemOperand(MGT->getPointerInfo(),
	MachineMemOperand::MOLoad, LoMemVT.getStoreSize(),
	Alignment, MGT->getAAInfo(), MGT->getRanges());

	SDValue OpsLo[] = {Ch, Src0Lo, MaskLo, Ptr, IndexLo};
	Lo = DAG.getMaskedGather(DAG.getVTList(LoVT, MVT::Other), LoVT, dl, OpsLo,
	MMO);

	SDValue OpsHi[] = {Ch, Src0Hi, MaskHi, Ptr, IndexHi};
	Hi = DAG.getMaskedGather(DAG.getVTList(HiVT, MVT::Other), HiVT, dl, OpsHi,
	MMO);

	// Build a factor node to remember that this load is independent of the
	// other one.
	Ch = DAG.getNode(ISD::TokenFactor, dl, MVT::Other, Lo.getValue(1),
	Hi.getValue(1));

	// Legalize the chain result - switch anything that used the old chain to
	// use the new one.
	ReplaceValueWith(SDValue(MGT, 1), Ch);
	}


	void DAGTypeLegalizer::SplitVecRes_SETCC(SDNode *N, SDValue &Lo, SDValue &Hi) {
	assert(N->getValueType(0).isVector() &&
	N->getOperand(0).getValueType().isVector() &&
	"Operand types must be vectors");

	EVT LoVT, HiVT;
	SDLoc DL(N);
	std::tie(LoVT, HiVT) = DAG.GetSplitDestVTs(N->getValueType(0));

	// Split the input.
	SDValue LL, LH, RL, RH;
	std::tie(LL, LH) = DAG.SplitVectorOperand(N, 0);
	std::tie(RL, RH) = DAG.SplitVectorOperand(N, 1);

	Lo = DAG.getNode(N->getOpcode(), DL, LoVT, LL, RL, N->getOperand(2));
	Hi = DAG.getNode(N->getOpcode(), DL, HiVT, LH, RH, N->getOperand(2));
	}

	void DAGTypeLegalizer::SplitVecRes_UnaryOp(SDNode *N, SDValue &Lo,
	SDValue &Hi) {
	// Get the dest types - they may not match the input types, e.g. int_to_fp.
	EVT LoVT, HiVT;
	SDLoc dl(N);
	std::tie(LoVT, HiVT) = DAG.GetSplitDestVTs(N->getValueType(0));

	// If the input also splits, handle it directly for a compile time speedup.
	// Otherwise split it by hand.
	EVT InVT = N->getOperand(0).getValueType();
	if (getTypeAction(InVT) == TargetLowering::TypeSplitVector)
	GetSplitVector(N->getOperand(0), Lo, Hi);
	else
	std::tie(Lo, Hi) = DAG.SplitVectorOperand(N, 0);

	if (N->getOpcode() == ISD::FP_ROUND) {
	Lo = DAG.getNode(N->getOpcode(), dl, LoVT, Lo, N->getOperand(1));
	Hi = DAG.getNode(N->getOpcode(), dl, HiVT, Hi, N->getOperand(1));
	} else {
	Lo = DAG.getNode(N->getOpcode(), dl, LoVT, Lo);
	Hi = DAG.getNode(N->getOpcode(), dl, HiVT, Hi);
	}
	}

	void DAGTypeLegalizer::SplitVecRes_ExtendOp(SDNode *N, SDValue &Lo,
	SDValue &Hi) {
	SDLoc dl(N);
	EVT SrcVT = N->getOperand(0).getValueType();
	EVT DestVT = N->getValueType(0);
	EVT LoVT, HiVT;
	std::tie(LoVT, HiVT) = DAG.GetSplitDestVTs(DestVT);

	// We can do better than a generic split operation if the extend is doing
	// more than just doubling the width of the elements and the following are
	// true:
	// - The number of vector elements is even,
	// - the source type is legal,
	// - the type of a split source is illegal,
	// - the type of an extended (by doubling element size) source is legal, and
	// - the type of that extended source when split is legal.
	//
	// This won't necessarily completely legalize the operation, but it will
	// more effectively move in the right direction and prevent falling down
	// to scalarization in many cases due to the input vector being split too
	// far.
	unsigned NumElements = SrcVT.getVectorNumElements();
	if ((NumElements & 1) == 0 &&
	SrcVT.getSizeInBits() * 2 < DestVT.getSizeInBits()) {
	LLVMContext &Ctx = *DAG.getContext();
	EVT NewSrcVT = SrcVT.widenIntegerVectorElementType(Ctx);
	EVT SplitSrcVT = SrcVT.getHalfNumVectorElementsVT(Ctx);

	EVT SplitLoVT, SplitHiVT;
	std::tie(SplitLoVT, SplitHiVT) = DAG.GetSplitDestVTs(NewSrcVT);
	if (TLI.isTypeLegal(SrcVT) && !TLI.isTypeLegal(SplitSrcVT) &&
	TLI.isTypeLegal(NewSrcVT) && TLI.isTypeLegal(SplitLoVT)) {
	DEBUG(dbgs() << "Split vector extend via incremental extend:";
	N->dump(&DAG); dbgs() << "\n");
	// Extend the source vector by one step.
	SDValue NewSrc =
	DAG.getNode(N->getOpcode(), dl, NewSrcVT, N->getOperand(0));
	// Get the low and high halves of the new, extended one step, vector.
	std::tie(Lo, Hi) = DAG.SplitVector(NewSrc, dl);
	// Extend those vector halves the rest of the way.
	Lo = DAG.getNode(N->getOpcode(), dl, LoVT, Lo);
	Hi = DAG.getNode(N->getOpcode(), dl, HiVT, Hi);
	return;
	}
	}
	// Fall back to the generic unary operator splitting otherwise.
	SplitVecRes_UnaryOp(N, Lo, Hi);
	}

	void DAGTypeLegalizer::SplitVecRes_VECTOR_SHUFFLE(ShuffleVectorSDNode *N,
	SDValue &Lo, SDValue &Hi) {
	// The low and high parts of the original input give four input vectors.
	SDValue Inputs[4];
	SDLoc dl(N);
	GetSplitVector(N->getOperand(0), Inputs[0], Inputs[1]);
	GetSplitVector(N->getOperand(1), Inputs[2], Inputs[3]);
	EVT NewVT = Inputs[0].getValueType();
	unsigned NewElts = NewVT.getVectorNumElements();

	// If Lo or Hi uses elements from at most two of the four input vectors, then
	// express it as a vector shuffle of those two inputs. Otherwise extract the
	// input elements by hand and construct the Lo/Hi output using a BUILD_VECTOR.
	SmallVector<int, 16> Ops;
	for (unsigned High = 0; High < 2; ++High) {
	SDValue &Output = High ? Hi : Lo;

	// Build a shuffle mask for the output, discovering on the fly which
	// input vectors to use as shuffle operands (recorded in InputUsed).
	// If building a suitable shuffle vector proves too hard, then bail
	// out with useBuildVector set.
	unsigned InputUsed[2] = { -1U, -1U }; // Not yet discovered.
	unsigned FirstMaskIdx = High * NewElts;
	bool useBuildVector = false;
	for (unsigned MaskOffset = 0; MaskOffset < NewElts; ++MaskOffset) {
	// The mask element. This indexes into the input.
	int Idx = N->getMaskElt(FirstMaskIdx + MaskOffset);

	// The input vector this mask element indexes into.
	unsigned Input = (unsigned)Idx / NewElts;

	if (Input >= array_lengthof(Inputs)) {
	// The mask element does not index into any input vector.
	Ops.push_back(-1);
	continue;
	}

	// Turn the index into an offset from the start of the input vector.
	Idx -= Input * NewElts;

	// Find or create a shuffle vector operand to hold this input.
	unsigned OpNo;
	for (OpNo = 0; OpNo < array_lengthof(InputUsed); ++OpNo) {
	if (InputUsed[OpNo] == Input) {
	// This input vector is already an operand.
	break;
	} else if (InputUsed[OpNo] == -1U) {
	// Create a new operand for this input vector.
	InputUsed[OpNo] = Input;
	break;
	}
	}

	if (OpNo >= array_lengthof(InputUsed)) {
	// More than two input vectors used! Give up on trying to create a
	// shuffle vector. Insert all elements into a BUILD_VECTOR instead.
	useBuildVector = true;
	break;
	}

	// Add the mask index for the new shuffle vector.
	Ops.push_back(Idx + OpNo * NewElts);
	}

	if (useBuildVector) {
	EVT EltVT = NewVT.getVectorElementType();
	SmallVector<SDValue, 16> SVOps;

	// Extract the input elements by hand.
	for (unsigned MaskOffset = 0; MaskOffset < NewElts; ++MaskOffset) {
	// The mask element. This indexes into the input.
	int Idx = N->getMaskElt(FirstMaskIdx + MaskOffset);

	// The input vector this mask element indexes into.
	unsigned Input = (unsigned)Idx / NewElts;

	if (Input >= array_lengthof(Inputs)) {
	// The mask element is "undef" or indexes off the end of the input.
	SVOps.push_back(DAG.getUNDEF(EltVT));
	continue;
	}

	// Turn the index into an offset from the start of the input vector.
	Idx -= Input * NewElts;

	// Extract the vector element by hand.
	SVOps.push_back(DAG.getNode(
	ISD::EXTRACT_VECTOR_ELT, dl, EltVT, Inputs[Input],
	DAG.getConstant(Idx, dl, TLI.getVectorIdxTy(DAG.getDataLayout()))));
	}

	// Construct the Lo/Hi output using a BUILD_VECTOR.
	Output = DAG.getBuildVector(NewVT, dl, SVOps);
	} else if (InputUsed[0] == -1U) {
	// No input vectors were used! The result is undefined.
	Output = DAG.getUNDEF(NewVT);
	} else {
	SDValue Op0 = Inputs[InputUsed[0]];
	// If only one input was used, use an undefined vector for the other.
	SDValue Op1 = InputUsed[1] == -1U ?
	DAG.getUNDEF(NewVT) : Inputs[InputUsed[1]];
	// At least one input vector was used. Create a new shuffle vector.
	Output = DAG.getVectorShuffle(NewVT, dl, Op0, Op1, Ops);
	}

	Ops.clear();
	}
	}


	//===----------------------------------------------------------------------===//
	// Operand Vector Splitting
	//===----------------------------------------------------------------------===//

	/// This method is called when the specified operand of the specified node is
	/// found to need vector splitting. At this point, all of the result types of
	/// the node are known to be legal, but other operands of the node may need
	/// legalization as well as the specified one.
	bool DAGTypeLegalizer::SplitVectorOperand(SDNode *N, unsigned OpNo) {
	DEBUG(dbgs() << "Split node operand: ";
	N->dump(&DAG);
	dbgs() << "\n");
	SDValue Res = SDValue();

	// See if the target wants to custom split this node.
	if (CustomLowerNode(N, N->getOperand(OpNo).getValueType(), false))
	return false;

	if (!Res.getNode()) {
	switch (N->getOpcode()) {
	default:
	#ifndef NDEBUG
	dbgs() << "SplitVectorOperand Op #" << OpNo << ": ";
	N->dump(&DAG);
	dbgs() << "\n";
	#endif
	report_fatal_error("Do not know how to split this operator's "
	"operand!\n");

	case ISD::SETCC: Res = SplitVecOp_VSETCC(N); break;
	case ISD::BITCAST: Res = SplitVecOp_BITCAST(N); break;
	case ISD::EXTRACT_SUBVECTOR: Res = SplitVecOp_EXTRACT_SUBVECTOR(N); break;
	case ISD::EXTRACT_VECTOR_ELT:Res = SplitVecOp_EXTRACT_VECTOR_ELT(N); break;
	case ISD::CONCAT_VECTORS: Res = SplitVecOp_CONCAT_VECTORS(N); break;
	case ISD::TRUNCATE:
	Res = SplitVecOp_TruncateHelper(N);
	break;
	case ISD::FP_ROUND: Res = SplitVecOp_FP_ROUND(N); break;
	case ISD::FCOPYSIGN: Res = SplitVecOp_FCOPYSIGN(N); break;
	case ISD::STORE:
	Res = SplitVecOp_STORE(cast<StoreSDNode>(N), OpNo);
	break;
	case ISD::MSTORE:
	Res = SplitVecOp_MSTORE(cast<MaskedStoreSDNode>(N), OpNo);
	break;
	case ISD::MSCATTER:
	Res = SplitVecOp_MSCATTER(cast<MaskedScatterSDNode>(N), OpNo);
	break;
	case ISD::MGATHER:
	Res = SplitVecOp_MGATHER(cast<MaskedGatherSDNode>(N), OpNo);
	break;
	case ISD::VSELECT:
	Res = SplitVecOp_VSELECT(N, OpNo);
	break;
	case ISD::FP_TO_SINT:
	case ISD::FP_TO_UINT:
	if (N->getValueType(0).bitsLT(N->getOperand(0)->getValueType(0)))
	Res = SplitVecOp_TruncateHelper(N);
	else
	Res = SplitVecOp_UnaryOp(N);
	break;
	case ISD::SINT_TO_FP:
	case ISD::UINT_TO_FP:
	if (N->getValueType(0).bitsLT(N->getOperand(0)->getValueType(0)))
	Res = SplitVecOp_TruncateHelper(N);
	else
	Res = SplitVecOp_UnaryOp(N);
	break;
	case ISD::CTTZ:
	case ISD::CTLZ:
	case ISD::CTPOP:
	case ISD::FP_EXTEND:
	case ISD::SIGN_EXTEND:
	case ISD::ZERO_EXTEND:
	case ISD::ANY_EXTEND:
	case ISD::FTRUNC:
	case ISD::FCANONICALIZE:
	Res = SplitVecOp_UnaryOp(N);
	break;

	case ISD::ANY_EXTEND_VECTOR_INREG:
	case ISD::SIGN_EXTEND_VECTOR_INREG:
	case ISD::ZERO_EXTEND_VECTOR_INREG:
	Res = SplitVecOp_ExtVecInRegOp(N);
	break;

	case ISD::VECREDUCE_FADD:
	case ISD::VECREDUCE_FMUL:
	case ISD::VECREDUCE_ADD:
	case ISD::VECREDUCE_MUL:
	case ISD::VECREDUCE_AND:
	case ISD::VECREDUCE_OR:
	case ISD::VECREDUCE_XOR:
	case ISD::VECREDUCE_SMAX:
	case ISD::VECREDUCE_SMIN:
	case ISD::VECREDUCE_UMAX:
	case ISD::VECREDUCE_UMIN:
	case ISD::VECREDUCE_FMAX:
	case ISD::VECREDUCE_FMIN:
	Res = SplitVecOp_VECREDUCE(N, OpNo);
	break;
	}
	}

	// If the result is null, the sub-method took care of registering results etc.
	if (!Res.getNode()) return false;

	// If the result is N, the sub-method updated N in place. Tell the legalizer
	// core about this.
	if (Res.getNode() == N)
	return true;

	assert(Res.getValueType() == N->getValueType(0) && N->getNumValues() == 1 &&
	"Invalid operand expansion");

	ReplaceValueWith(SDValue(N, 0), Res);
	return false;
	}

	SDValue DAGTypeLegalizer::SplitVecOp_VSELECT(SDNode *N, unsigned OpNo) {
	// The only possibility for an illegal operand is the mask, since result type
	// legalization would have handled this node already otherwise.
	assert(OpNo == 0 && "Illegal operand must be mask");

	SDValue Mask = N->getOperand(0);
	SDValue Src0 = N->getOperand(1);
	SDValue Src1 = N->getOperand(2);
	EVT Src0VT = Src0.getValueType();
	SDLoc DL(N);
	assert(Mask.getValueType().isVector() && "VSELECT without a vector mask?");

	SDValue Lo, Hi;
	GetSplitVector(N->getOperand(0), Lo, Hi);
	assert(Lo.getValueType() == Hi.getValueType() &&
	"Lo and Hi have differing types");

	EVT LoOpVT, HiOpVT;
	std::tie(LoOpVT, HiOpVT) = DAG.GetSplitDestVTs(Src0VT);
	assert(LoOpVT == HiOpVT && "Asymmetric vector split?");

	SDValue LoOp0, HiOp0, LoOp1, HiOp1, LoMask, HiMask;
	std::tie(LoOp0, HiOp0) = DAG.SplitVector(Src0, DL);
	std::tie(LoOp1, HiOp1) = DAG.SplitVector(Src1, DL);
	std::tie(LoMask, HiMask) = DAG.SplitVector(Mask, DL);

	SDValue LoSelect =
	DAG.getNode(ISD::VSELECT, DL, LoOpVT, LoMask, LoOp0, LoOp1);
	SDValue HiSelect =
	DAG.getNode(ISD::VSELECT, DL, HiOpVT, HiMask, HiOp0, HiOp1);

	return DAG.getNode(ISD::CONCAT_VECTORS, DL, Src0VT, LoSelect, HiSelect);
	}

	SDValue DAGTypeLegalizer::SplitVecOp_VECREDUCE(SDNode *N, unsigned OpNo) {
	EVT ResVT = N->getValueType(0);
	SDValue Lo, Hi;
	SDLoc dl(N);

	SDValue VecOp = N->getOperand(OpNo);
	EVT VecVT = VecOp.getValueType();
	assert(VecVT.isVector() && "Can only split reduce vector operand");
	GetSplitVector(VecOp, Lo, Hi);
	EVT LoOpVT, HiOpVT;
	std::tie(LoOpVT, HiOpVT) = DAG.GetSplitDestVTs(VecVT);

	bool NoNaN = N->getFlags().hasNoNaNs();
	unsigned CombineOpc = 0;
	switch (N->getOpcode()) {
	case ISD::VECREDUCE_FADD: CombineOpc = ISD::FADD; break;
	case ISD::VECREDUCE_FMUL: CombineOpc = ISD::FMUL; break;
	case ISD::VECREDUCE_ADD: CombineOpc = ISD::ADD; break;
	case ISD::VECREDUCE_MUL: CombineOpc = ISD::MUL; break;
	case ISD::VECREDUCE_AND: CombineOpc = ISD::AND; break;
	case ISD::VECREDUCE_OR: CombineOpc = ISD::OR; break;
	case ISD::VECREDUCE_XOR: CombineOpc = ISD::XOR; break;
	case ISD::VECREDUCE_SMAX: CombineOpc = ISD::SMAX; break;
	case ISD::VECREDUCE_SMIN: CombineOpc = ISD::SMIN; break;
	case ISD::VECREDUCE_UMAX: CombineOpc = ISD::UMAX; break;
	case ISD::VECREDUCE_UMIN: CombineOpc = ISD::UMIN; break;
	case ISD::VECREDUCE_FMAX:
	CombineOpc = NoNaN ? ISD::FMAXNUM : ISD::FMAXNAN;
	break;
	case ISD::VECREDUCE_FMIN:
	CombineOpc = NoNaN ? ISD::FMINNUM : ISD::FMINNAN;
	break;
	default:
	llvm_unreachable("Unexpected reduce ISD node");
	}

	// Use the appropriate scalar instruction on the split subvectors before
	// reducing the now partially reduced smaller vector.
	SDValue Partial = DAG.getNode(CombineOpc, dl, LoOpVT, Lo, Hi);
	return DAG.getNode(N->getOpcode(), dl, ResVT, Partial);
	}

	SDValue DAGTypeLegalizer::SplitVecOp_UnaryOp(SDNode *N) {
	// The result has a legal vector type, but the input needs splitting.
	EVT ResVT = N->getValueType(0);
	SDValue Lo, Hi;
	SDLoc dl(N);
	GetSplitVector(N->getOperand(0), Lo, Hi);
	EVT InVT = Lo.getValueType();

	EVT OutVT = EVT::getVectorVT(*DAG.getContext(), ResVT.getVectorElementType(),
	InVT.getVectorNumElements());

	Lo = DAG.getNode(N->getOpcode(), dl, OutVT, Lo);
	Hi = DAG.getNode(N->getOpcode(), dl, OutVT, Hi);

	return DAG.getNode(ISD::CONCAT_VECTORS, dl, ResVT, Lo, Hi);
	}

	SDValue DAGTypeLegalizer::SplitVecOp_BITCAST(SDNode *N) {
	// For example, i64 = BITCAST v4i16 on alpha. Typically the vector will
	// end up being split all the way down to individual components. Convert the
	// split pieces into integers and reassemble.
	SDValue Lo, Hi;
	GetSplitVector(N->getOperand(0), Lo, Hi);
	Lo = BitConvertToInteger(Lo);
	Hi = BitConvertToInteger(Hi);

	if (DAG.getDataLayout().isBigEndian())
	std::swap(Lo, Hi);

	return DAG.getNode(ISD::BITCAST, SDLoc(N), N->getValueType(0),
	JoinIntegers(Lo, Hi));
	}

	SDValue DAGTypeLegalizer::SplitVecOp_EXTRACT_SUBVECTOR(SDNode *N) {
	// We know that the extracted result type is legal.
	EVT SubVT = N->getValueType(0);
	SDValue Idx = N->getOperand(1);
	SDLoc dl(N);
	SDValue Lo, Hi;
	GetSplitVector(N->getOperand(0), Lo, Hi);

	uint64_t LoElts = Lo.getValueType().getVectorNumElements();
	uint64_t IdxVal = cast<ConstantSDNode>(Idx)->getZExtValue();

	if (IdxVal < LoElts) {
	assert(IdxVal + SubVT.getVectorNumElements() <= LoElts &&
	"Extracted subvector crosses vector split!");
	return DAG.getNode(ISD::EXTRACT_SUBVECTOR, dl, SubVT, Lo, Idx);
	} else {
	return DAG.getNode(ISD::EXTRACT_SUBVECTOR, dl, SubVT, Hi,
	DAG.getConstant(IdxVal - LoElts, dl,
	Idx.getValueType()));
	}
	}

	SDValue DAGTypeLegalizer::SplitVecOp_EXTRACT_VECTOR_ELT(SDNode *N) {
	SDValue Vec = N->getOperand(0);
	SDValue Idx = N->getOperand(1);
	EVT VecVT = Vec.getValueType();

	if (isa<ConstantSDNode>(Idx)) {
	uint64_t IdxVal = cast<ConstantSDNode>(Idx)->getZExtValue();
	assert(IdxVal < VecVT.getVectorNumElements() && "Invalid vector index!");

	SDValue Lo, Hi;
	GetSplitVector(Vec, Lo, Hi);

	uint64_t LoElts = Lo.getValueType().getVectorNumElements();

	if (IdxVal < LoElts)
	return SDValue(DAG.UpdateNodeOperands(N, Lo, Idx), 0);
	return SDValue(DAG.UpdateNodeOperands(N, Hi,
	DAG.getConstant(IdxVal - LoElts, SDLoc(N),
	Idx.getValueType())), 0);
	}

	// See if the target wants to custom expand this node.
	if (CustomLowerNode(N, N->getValueType(0), true))
	return SDValue();

	// Make the vector elements byte-addressable if they aren't already.
	SDLoc dl(N);
	EVT EltVT = VecVT.getVectorElementType();
	if (EltVT.getSizeInBits() < 8) {
	SmallVector<SDValue, 4> ElementOps;
	for (unsigned i = 0; i < VecVT.getVectorNumElements(); ++i) {
	ElementOps.push_back(DAG.getAnyExtOrTrunc(
	DAG.getNode(ISD::EXTRACT_VECTOR_ELT, dl, EltVT, Vec,
	DAG.getConstant(i, dl, MVT::i8)),
	dl, MVT::i8));
	}

	EltVT = MVT::i8;
	VecVT = EVT::getVectorVT(*DAG.getContext(), EltVT,
	VecVT.getVectorNumElements());
	Vec = DAG.getBuildVector(VecVT, dl, ElementOps);
	}

	// Store the vector to the stack.
	SDValue StackPtr = DAG.CreateStackTemporary(VecVT);
	SDValue Store =
	DAG.getStore(DAG.getEntryNode(), dl, Vec, StackPtr, MachinePointerInfo());

	// Load back the required element.
	StackPtr = TLI.getVectorElementPointer(DAG, StackPtr, VecVT, Idx);
	return DAG.getExtLoad(ISD::EXTLOAD, dl, N->getValueType(0), Store, StackPtr,
	MachinePointerInfo(), EltVT);
	}

	SDValue DAGTypeLegalizer::SplitVecOp_ExtVecInRegOp(SDNode *N) {
	SDValue Lo, Hi;

	// *_EXTEND_VECTOR_INREG only reference the lower half of the input, so
	// splitting the result has the same effect as splitting the input operand.
	SplitVecRes_ExtVecInRegOp(N, Lo, Hi);

	return DAG.getNode(ISD::CONCAT_VECTORS, SDLoc(N), N->getValueType(0), Lo, Hi);
	}

	SDValue DAGTypeLegalizer::SplitVecOp_MGATHER(MaskedGatherSDNode *MGT,
	unsigned OpNo) {
	EVT LoVT, HiVT;
	SDLoc dl(MGT);
	std::tie(LoVT, HiVT) = DAG.GetSplitDestVTs(MGT->getValueType(0));

	SDValue Ch = MGT->getChain();
	SDValue Ptr = MGT->getBasePtr();
	SDValue Index = MGT->getIndex();
	SDValue Mask = MGT->getMask();
	SDValue Src0 = MGT->getValue();
	unsigned Alignment = MGT->getOriginalAlignment();

	SDValue MaskLo, MaskHi;
	if (getTypeAction(Mask.getValueType()) == TargetLowering::TypeSplitVector)
	// Split Mask operand
	GetSplitVector(Mask, MaskLo, MaskHi);
	else
	std::tie(MaskLo, MaskHi) = DAG.SplitVector(Mask, dl);

	EVT MemoryVT = MGT->getMemoryVT();
	EVT LoMemVT, HiMemVT;
	std::tie(LoMemVT, HiMemVT) = DAG.GetSplitDestVTs(MemoryVT);

	SDValue Src0Lo, Src0Hi;
	if (getTypeAction(Src0.getValueType()) == TargetLowering::TypeSplitVector)
	GetSplitVector(Src0, Src0Lo, Src0Hi);
	else
	std::tie(Src0Lo, Src0Hi) = DAG.SplitVector(Src0, dl);

	SDValue IndexHi, IndexLo;
	if (getTypeAction(Index.getValueType()) == TargetLowering::TypeSplitVector)
	GetSplitVector(Index, IndexLo, IndexHi);
	else
	std::tie(IndexLo, IndexHi) = DAG.SplitVector(Index, dl);

	MachineMemOperand *MMO = DAG.getMachineFunction().
	getMachineMemOperand(MGT->getPointerInfo(),
	MachineMemOperand::MOLoad, LoMemVT.getStoreSize(),
	Alignment, MGT->getAAInfo(), MGT->getRanges());

	SDValue OpsLo[] = {Ch, Src0Lo, MaskLo, Ptr, IndexLo};
	SDValue Lo = DAG.getMaskedGather(DAG.getVTList(LoVT, MVT::Other), LoVT, dl,
	OpsLo, MMO);

	MMO = DAG.getMachineFunction().
	getMachineMemOperand(MGT->getPointerInfo(),
	MachineMemOperand::MOLoad, HiMemVT.getStoreSize(),
	Alignment, MGT->getAAInfo(),
	MGT->getRanges());

	SDValue OpsHi[] = {Ch, Src0Hi, MaskHi, Ptr, IndexHi};
	SDValue Hi = DAG.getMaskedGather(DAG.getVTList(HiVT, MVT::Other), HiVT, dl,
	OpsHi, MMO);

	// Build a factor node to remember that this load is independent of the
	// other one.
	Ch = DAG.getNode(ISD::TokenFactor, dl, MVT::Other, Lo.getValue(1),
	Hi.getValue(1));

	// Legalize the chain result - switch anything that used the old chain to
	// use the new one.
	ReplaceValueWith(SDValue(MGT, 1), Ch);

	SDValue Res = DAG.getNode(ISD::CONCAT_VECTORS, dl, MGT->getValueType(0), Lo,
	Hi);
	ReplaceValueWith(SDValue(MGT, 0), Res);
	return SDValue();
	}

	SDValue DAGTypeLegalizer::SplitVecOp_MSTORE(MaskedStoreSDNode *N,
	unsigned OpNo) {
	SDValue Ch = N->getChain();
	SDValue Ptr = N->getBasePtr();
	SDValue Mask = N->getMask();
	SDValue Data = N->getValue();
	EVT MemoryVT = N->getMemoryVT();
	unsigned Alignment = N->getOriginalAlignment();
	SDLoc DL(N);

	EVT LoMemVT, HiMemVT;
	std::tie(LoMemVT, HiMemVT) = DAG.GetSplitDestVTs(MemoryVT);

	SDValue DataLo, DataHi;
	if (getTypeAction(Data.getValueType()) == TargetLowering::TypeSplitVector)
	// Split Data operand
	GetSplitVector(Data, DataLo, DataHi);
	else
	std::tie(DataLo, DataHi) = DAG.SplitVector(Data, DL);

	SDValue MaskLo, MaskHi;
	if (getTypeAction(Mask.getValueType()) == TargetLowering::TypeSplitVector)
	// Split Mask operand
	GetSplitVector(Mask, MaskLo, MaskHi);
	else
	std::tie(MaskLo, MaskHi) = DAG.SplitVector(Mask, DL);

	MaskLo = PromoteTargetBoolean(MaskLo, DataLo.getValueType());
	MaskHi = PromoteTargetBoolean(MaskHi, DataHi.getValueType());

	// if Alignment is equal to the vector size,
	// take the half of it for the second part
	unsigned SecondHalfAlignment =
	(Alignment == Data->getValueType(0).getSizeInBits()/8) ?
	Alignment/2 : Alignment;

	SDValue Lo, Hi;
	MachineMemOperand *MMO = DAG.getMachineFunction().
	getMachineMemOperand(N->getPointerInfo(),
	MachineMemOperand::MOStore, LoMemVT.getStoreSize(),
	Alignment, N->getAAInfo(), N->getRanges());

	Lo = DAG.getMaskedStore(Ch, DL, DataLo, Ptr, MaskLo, LoMemVT, MMO,
	N->isTruncatingStore(),
	N->isCompressingStore());

	Ptr = TLI.IncrementMemoryAddress(Ptr, MaskLo, DL, LoMemVT, DAG,
	N->isCompressingStore());
	MMO = DAG.getMachineFunction().
	getMachineMemOperand(N->getPointerInfo(),
	MachineMemOperand::MOStore, HiMemVT.getStoreSize(),
	SecondHalfAlignment, N->getAAInfo(), N->getRanges());

	Hi = DAG.getMaskedStore(Ch, DL, DataHi, Ptr, MaskHi, HiMemVT, MMO,
	N->isTruncatingStore(), N->isCompressingStore());

	// Build a factor node to remember that this store is independent of the
	// other one.
	return DAG.getNode(ISD::TokenFactor, DL, MVT::Other, Lo, Hi);
	}

	SDValue DAGTypeLegalizer::SplitVecOp_MSCATTER(MaskedScatterSDNode *N,
	unsigned OpNo) {
	SDValue Ch = N->getChain();
	SDValue Ptr = N->getBasePtr();
	SDValue Mask = N->getMask();
	SDValue Index = N->getIndex();
	SDValue Data = N->getValue();
	EVT MemoryVT = N->getMemoryVT();
	unsigned Alignment = N->getOriginalAlignment();
	SDLoc DL(N);

	// Split all operands
	EVT LoMemVT, HiMemVT;
	std::tie(LoMemVT, HiMemVT) = DAG.GetSplitDestVTs(MemoryVT);

	SDValue DataLo, DataHi;
	if (getTypeAction(Data.getValueType()) == TargetLowering::TypeSplitVector)
	// Split Data operand
	GetSplitVector(Data, DataLo, DataHi);
	else
	std::tie(DataLo, DataHi) = DAG.SplitVector(Data, DL);

	SDValue MaskLo, MaskHi;
	if (getTypeAction(Mask.getValueType()) == TargetLowering::TypeSplitVector)
	// Split Mask operand
	GetSplitVector(Mask, MaskLo, MaskHi);
	else
	std::tie(MaskLo, MaskHi) = DAG.SplitVector(Mask, DL);

	SDValue IndexHi, IndexLo;
	if (getTypeAction(Index.getValueType()) == TargetLowering::TypeSplitVector)
	GetSplitVector(Index, IndexLo, IndexHi);
	else
	std::tie(IndexLo, IndexHi) = DAG.SplitVector(Index, DL);

	SDValue Lo, Hi;
	MachineMemOperand *MMO = DAG.getMachineFunction().
	getMachineMemOperand(N->getPointerInfo(),
	MachineMemOperand::MOStore, LoMemVT.getStoreSize(),
	Alignment, N->getAAInfo(), N->getRanges());

	SDValue OpsLo[] = {Ch, DataLo, MaskLo, Ptr, IndexLo};
	Lo = DAG.getMaskedScatter(DAG.getVTList(MVT::Other), DataLo.getValueType(),
	DL, OpsLo, MMO);

	MMO = DAG.getMachineFunction().
	getMachineMemOperand(N->getPointerInfo(),
	MachineMemOperand::MOStore, HiMemVT.getStoreSize(),
	Alignment, N->getAAInfo(), N->getRanges());

	SDValue OpsHi[] = {Ch, DataHi, MaskHi, Ptr, IndexHi};
	Hi = DAG.getMaskedScatter(DAG.getVTList(MVT::Other), DataHi.getValueType(),
	DL, OpsHi, MMO);

	// Build a factor node to remember that this store is independent of the
	// other one.
	return DAG.getNode(ISD::TokenFactor, DL, MVT::Other, Lo, Hi);
	}

	SDValue DAGTypeLegalizer::SplitVecOp_STORE(StoreSDNode *N, unsigned OpNo) {
	assert(N->isUnindexed() && "Indexed store of vector?");
	assert(OpNo == 1 && "Can only split the stored value");
	SDLoc DL(N);

	bool isTruncating = N->isTruncatingStore();
	SDValue Ch = N->getChain();
	SDValue Ptr = N->getBasePtr();
	EVT MemoryVT = N->getMemoryVT();
	unsigned Alignment = N->getOriginalAlignment();
	MachineMemOperand::Flags MMOFlags = N->getMemOperand()->getFlags();
	AAMDNodes AAInfo = N->getAAInfo();
	SDValue Lo, Hi;
	GetSplitVector(N->getOperand(1), Lo, Hi);

	EVT LoMemVT, HiMemVT;
	std::tie(LoMemVT, HiMemVT) = DAG.GetSplitDestVTs(MemoryVT);

	unsigned IncrementSize = LoMemVT.getSizeInBits()/8;

	if (isTruncating)
	Lo = DAG.getTruncStore(Ch, DL, Lo, Ptr, N->getPointerInfo(), LoMemVT,
	Alignment, MMOFlags, AAInfo);
	else
	Lo = DAG.getStore(Ch, DL, Lo, Ptr, N->getPointerInfo(), Alignment, MMOFlags,
	AAInfo);

	// Increment the pointer to the other half.
	Ptr = DAG.getNode(ISD::ADD, DL, Ptr.getValueType(), Ptr,
	DAG.getConstant(IncrementSize, DL, Ptr.getValueType()));

	if (isTruncating)
	Hi = DAG.getTruncStore(Ch, DL, Hi, Ptr,
	N->getPointerInfo().getWithOffset(IncrementSize),
	HiMemVT, Alignment, MMOFlags, AAInfo);
	else
	Hi = DAG.getStore(Ch, DL, Hi, Ptr,
	N->getPointerInfo().getWithOffset(IncrementSize),
	Alignment, MMOFlags, AAInfo);

	return DAG.getNode(ISD::TokenFactor, DL, MVT::Other, Lo, Hi);
	}

	SDValue DAGTypeLegalizer::SplitVecOp_CONCAT_VECTORS(SDNode *N) {
	SDLoc DL(N);

	// The input operands all must have the same type, and we know the result
	// type is valid. Convert this to a buildvector which extracts all the
	// input elements.
	// TODO: If the input elements are power-two vectors, we could convert this to
	// a new CONCAT_VECTORS node with elements that are half-wide.
	SmallVector<SDValue, 32> Elts;
	EVT EltVT = N->getValueType(0).getVectorElementType();
	for (const SDValue &Op : N->op_values()) {
	for (unsigned i = 0, e = Op.getValueType().getVectorNumElements();
	i != e; ++i) {
	Elts.push_back(DAG.getNode(
	ISD::EXTRACT_VECTOR_ELT, DL, EltVT, Op,
	DAG.getConstant(i, DL, TLI.getVectorIdxTy(DAG.getDataLayout()))));
	}
	}

	return DAG.getBuildVector(N->getValueType(0), DL, Elts);
	}

	SDValue DAGTypeLegalizer::SplitVecOp_TruncateHelper(SDNode *N) {
	// The result type is legal, but the input type is illegal. If splitting
	// ends up with the result type of each half still being legal, just
	// do that. If, however, that would result in an illegal result type,
	// we can try to get more clever with power-two vectors. Specifically,
	// split the input type, but also widen the result element size, then
	// concatenate the halves and truncate again. For example, consider a target
	// where v8i8 is legal and v8i32 is not (ARM, which doesn't have 256-bit
	// vectors). To perform a "%res = v8i8 trunc v8i32 %in" we do:
	// %inlo = v4i32 extract_subvector %in, 0
	// %inhi = v4i32 extract_subvector %in, 4
	// %lo16 = v4i16 trunc v4i32 %inlo
	// %hi16 = v4i16 trunc v4i32 %inhi
	// %in16 = v8i16 concat_vectors v4i16 %lo16, v4i16 %hi16
	// %res = v8i8 trunc v8i16 %in16
	//
	// Without this transform, the original truncate would end up being
	// scalarized, which is pretty much always a last resort.
	SDValue InVec = N->getOperand(0);
	EVT InVT = InVec->getValueType(0);
	EVT OutVT = N->getValueType(0);
	unsigned NumElements = OutVT.getVectorNumElements();
	bool IsFloat = OutVT.isFloatingPoint();

	// Widening should have already made sure this is a power-two vector
	// if we're trying to split it at all. assert() that's true, just in case.
	assert(!(NumElements & 1) && "Splitting vector, but not in half!");

	unsigned InElementSize = InVT.getScalarSizeInBits();
	unsigned OutElementSize = OutVT.getScalarSizeInBits();

	// If the input elements are only 1/2 the width of the result elements,
	// just use the normal splitting. Our trick only work if there's room
	// to split more than once.
	if (InElementSize <= OutElementSize * 2)
	return SplitVecOp_UnaryOp(N);
	SDLoc DL(N);

	// Extract the halves of the input via extract_subvector.
	SDValue InLoVec, InHiVec;
	std::tie(InLoVec, InHiVec) = DAG.SplitVector(InVec, DL);
	// Truncate them to 1/2 the element size.
	EVT HalfElementVT = IsFloat ?
	EVT::getFloatingPointVT(InElementSize/2) :
	EVT::getIntegerVT(*DAG.getContext(), InElementSize/2);
	EVT HalfVT = EVT::getVectorVT(*DAG.getContext(), HalfElementVT,
	NumElements/2);
	SDValue HalfLo = DAG.getNode(N->getOpcode(), DL, HalfVT, InLoVec);
	SDValue HalfHi = DAG.getNode(N->getOpcode(), DL, HalfVT, InHiVec);
	// Concatenate them to get the full intermediate truncation result.
	EVT InterVT = EVT::getVectorVT(*DAG.getContext(), HalfElementVT, NumElements);
	SDValue InterVec = DAG.getNode(ISD::CONCAT_VECTORS, DL, InterVT, HalfLo,
	HalfHi);
	// Now finish up by truncating all the way down to the original result
	// type. This should normally be something that ends up being legal directly,
	// but in theory if a target has very wide vectors and an annoyingly
	// restricted set of legal types, this split can chain to build things up.
	return IsFloat
	? DAG.getNode(ISD::FP_ROUND, DL, OutVT, InterVec,
	DAG.getTargetConstant(
	0, DL, TLI.getPointerTy(DAG.getDataLayout())))
	: DAG.getNode(ISD::TRUNCATE, DL, OutVT, InterVec);
	}

	SDValue DAGTypeLegalizer::SplitVecOp_VSETCC(SDNode *N) {
	assert(N->getValueType(0).isVector() &&
	N->getOperand(0).getValueType().isVector() &&
	"Operand types must be vectors");
	// The result has a legal vector type, but the input needs splitting.
	SDValue Lo0, Hi0, Lo1, Hi1, LoRes, HiRes;
	SDLoc DL(N);
	GetSplitVector(N->getOperand(0), Lo0, Hi0);
	GetSplitVector(N->getOperand(1), Lo1, Hi1);
	unsigned PartElements = Lo0.getValueType().getVectorNumElements();
	EVT PartResVT = EVT::getVectorVT(*DAG.getContext(), MVT::i1, PartElements);
	EVT WideResVT = EVT::getVectorVT(DAG.getContext(), MVT::i1, 2PartElements);

	LoRes = DAG.getNode(ISD::SETCC, DL, PartResVT, Lo0, Lo1, N->getOperand(2));
	HiRes = DAG.getNode(ISD::SETCC, DL, PartResVT, Hi0, Hi1, N->getOperand(2));
	SDValue Con = DAG.getNode(ISD::CONCAT_VECTORS, DL, WideResVT, LoRes, HiRes);
	return PromoteTargetBoolean(Con, N->getValueType(0));
	}


	SDValue DAGTypeLegalizer::SplitVecOp_FP_ROUND(SDNode *N) {
	// The result has a legal vector type, but the input needs splitting.
	EVT ResVT = N->getValueType(0);
	SDValue Lo, Hi;
	SDLoc DL(N);
	GetSplitVector(N->getOperand(0), Lo, Hi);
	EVT InVT = Lo.getValueType();

	EVT OutVT = EVT::getVectorVT(*DAG.getContext(), ResVT.getVectorElementType(),
	InVT.getVectorNumElements());

	Lo = DAG.getNode(ISD::FP_ROUND, DL, OutVT, Lo, N->getOperand(1));
	Hi = DAG.getNode(ISD::FP_ROUND, DL, OutVT, Hi, N->getOperand(1));

	return DAG.getNode(ISD::CONCAT_VECTORS, DL, ResVT, Lo, Hi);
	}

	SDValue DAGTypeLegalizer::SplitVecOp_FCOPYSIGN(SDNode *N) {
	// The result (and the first input) has a legal vector type, but the second
	// input needs splitting.
	return DAG.UnrollVectorOp(N, N->getValueType(0).getVectorNumElements());
	}


	//===----------------------------------------------------------------------===//
	// Result Vector Widening
	//===----------------------------------------------------------------------===//

	void DAGTypeLegalizer::WidenVectorResult(SDNode *N, unsigned ResNo) {
	DEBUG(dbgs() << "Widen node result " << ResNo << ": ";
	N->dump(&DAG);
	dbgs() << "\n");

	// See if the target wants to custom widen this node.
	if (CustomWidenLowerNode(N, N->getValueType(ResNo)))
	return;

	SDValue Res = SDValue();
	switch (N->getOpcode()) {
	default:
	#ifndef NDEBUG
	dbgs() << "WidenVectorResult #" << ResNo << ": ";
	N->dump(&DAG);
	dbgs() << "\n";
	#endif
	llvm_unreachable("Do not know how to widen the result of this operator!");

	case ISD::MERGE_VALUES: Res = WidenVecRes_MERGE_VALUES(N, ResNo); break;
	case ISD::BITCAST: Res = WidenVecRes_BITCAST(N); break;
	case ISD::BUILD_VECTOR: Res = WidenVecRes_BUILD_VECTOR(N); break;
	case ISD::CONCAT_VECTORS: Res = WidenVecRes_CONCAT_VECTORS(N); break;
	case ISD::EXTRACT_SUBVECTOR: Res = WidenVecRes_EXTRACT_SUBVECTOR(N); break;
	case ISD::FP_ROUND_INREG: Res = WidenVecRes_InregOp(N); break;
	case ISD::INSERT_VECTOR_ELT: Res = WidenVecRes_INSERT_VECTOR_ELT(N); break;
	case ISD::LOAD: Res = WidenVecRes_LOAD(N); break;
	case ISD::SCALAR_TO_VECTOR: Res = WidenVecRes_SCALAR_TO_VECTOR(N); break;
	case ISD::SIGN_EXTEND_INREG: Res = WidenVecRes_InregOp(N); break;
	case ISD::VSELECT:
	case ISD::SELECT: Res = WidenVecRes_SELECT(N); break;
	case ISD::SELECT_CC: Res = WidenVecRes_SELECT_CC(N); break;
	case ISD::SETCC: Res = WidenVecRes_SETCC(N); break;
	case ISD::UNDEF: Res = WidenVecRes_UNDEF(N); break;
	case ISD::VECTOR_SHUFFLE:
	Res = WidenVecRes_VECTOR_SHUFFLE(cast<ShuffleVectorSDNode>(N));
	break;
	case ISD::MLOAD:
	Res = WidenVecRes_MLOAD(cast<MaskedLoadSDNode>(N));
	break;
	case ISD::MGATHER:
	Res = WidenVecRes_MGATHER(cast<MaskedGatherSDNode>(N));
	break;

	case ISD::ADD:
	case ISD::AND:
	case ISD::MUL:
	case ISD::MULHS:
	case ISD::MULHU:
	case ISD::OR:
	case ISD::SUB:
	case ISD::XOR:
	case ISD::FMINNUM:
	case ISD::FMAXNUM:
	case ISD::FMINNAN:
	case ISD::FMAXNAN:
	case ISD::SMIN:
	case ISD::SMAX:
	case ISD::UMIN:
	case ISD::UMAX:
	Res = WidenVecRes_Binary(N);
	break;

	case ISD::FADD:
	case ISD::FMUL:
	case ISD::FPOW:
	case ISD::FSUB:
	case ISD::FDIV:
	case ISD::FREM:
	case ISD::SDIV:
	case ISD::UDIV:
	case ISD::SREM:
	case ISD::UREM:
	Res = WidenVecRes_BinaryCanTrap(N);
	break;

	case ISD::FCOPYSIGN:
	Res = WidenVecRes_FCOPYSIGN(N);
	break;

	case ISD::FPOWI:
	Res = WidenVecRes_POWI(N);
	break;

	case ISD::SHL:
	case ISD::SRA:
	case ISD::SRL:
	Res = WidenVecRes_Shift(N);
	break;

	case ISD::ANY_EXTEND_VECTOR_INREG:
	case ISD::SIGN_EXTEND_VECTOR_INREG:
	case ISD::ZERO_EXTEND_VECTOR_INREG:
	Res = WidenVecRes_EXTEND_VECTOR_INREG(N);
	break;

	case ISD::ANY_EXTEND:
	case ISD::FP_EXTEND:
	case ISD::FP_ROUND:
	case ISD::FP_TO_SINT:
	case ISD::FP_TO_UINT:
	case ISD::SIGN_EXTEND:
	case ISD::SINT_TO_FP:
	case ISD::TRUNCATE:
	case ISD::UINT_TO_FP:
	case ISD::ZERO_EXTEND:
	Res = WidenVecRes_Convert(N);
	break;

	case ISD::BITREVERSE:
	case ISD::BSWAP:
	case ISD::CTLZ:
	case ISD::CTPOP:
	case ISD::CTTZ:
	case ISD::FABS:
	case ISD::FCEIL:
	case ISD::FCOS:
	case ISD::FEXP:
	case ISD::FEXP2:
	case ISD::FFLOOR:
	case ISD::FLOG:
	case ISD::FLOG10:
	case ISD::FLOG2:
	case ISD::FNEARBYINT:
	case ISD::FNEG:
	case ISD::FRINT:
	case ISD::FROUND:
	case ISD::FSIN:
	case ISD::FSQRT:
	case ISD::FTRUNC:
	Res = WidenVecRes_Unary(N);
	break;
	case ISD::FMA:
	Res = WidenVecRes_Ternary(N);
	break;
	}

	// If Res is null, the sub-method took care of registering the result.
	if (Res.getNode())
	SetWidenedVector(SDValue(N, ResNo), Res);
	}

	SDValue DAGTypeLegalizer::WidenVecRes_Ternary(SDNode *N) {
	// Ternary op widening.
	SDLoc dl(N);
	EVT WidenVT = TLI.getTypeToTransformTo(*DAG.getContext(), N->getValueType(0));
	SDValue InOp1 = GetWidenedVector(N->getOperand(0));
	SDValue InOp2 = GetWidenedVector(N->getOperand(1));
	SDValue InOp3 = GetWidenedVector(N->getOperand(2));
	return DAG.getNode(N->getOpcode(), dl, WidenVT, InOp1, InOp2, InOp3);
	}

	SDValue DAGTypeLegalizer::WidenVecRes_Binary(SDNode *N) {
	// Binary op widening.
	SDLoc dl(N);
	EVT WidenVT = TLI.getTypeToTransformTo(*DAG.getContext(), N->getValueType(0));
	SDValue InOp1 = GetWidenedVector(N->getOperand(0));
	SDValue InOp2 = GetWidenedVector(N->getOperand(1));
	return DAG.getNode(N->getOpcode(), dl, WidenVT, InOp1, InOp2, N->getFlags());
	}

	SDValue DAGTypeLegalizer::WidenVecRes_BinaryCanTrap(SDNode *N) {
	// Binary op widening for operations that can trap.
	unsigned Opcode = N->getOpcode();
	SDLoc dl(N);
	EVT WidenVT = TLI.getTypeToTransformTo(*DAG.getContext(), N->getValueType(0));
	EVT WidenEltVT = WidenVT.getVectorElementType();
	EVT VT = WidenVT;
	unsigned NumElts = VT.getVectorNumElements();
	const SDNodeFlags Flags = N->getFlags();
	while (!TLI.isTypeLegal(VT) && NumElts != 1) {
	NumElts = NumElts / 2;
	VT = EVT::getVectorVT(*DAG.getContext(), WidenEltVT, NumElts);
	}

	if (NumElts != 1 && !TLI.canOpTrap(N->getOpcode(), VT)) {
	// Operation doesn't trap so just widen as normal.
	SDValue InOp1 = GetWidenedVector(N->getOperand(0));
	SDValue InOp2 = GetWidenedVector(N->getOperand(1));
	return DAG.getNode(N->getOpcode(), dl, WidenVT, InOp1, InOp2, Flags);
	}

	// No legal vector version so unroll the vector operation and then widen.
	if (NumElts == 1)
	return DAG.UnrollVectorOp(N, WidenVT.getVectorNumElements());

	// Since the operation can trap, apply operation on the original vector.
	EVT MaxVT = VT;
	SDValue InOp1 = GetWidenedVector(N->getOperand(0));
	SDValue InOp2 = GetWidenedVector(N->getOperand(1));
	unsigned CurNumElts = N->getValueType(0).getVectorNumElements();

	SmallVector<SDValue, 16> ConcatOps(CurNumElts);
	unsigned ConcatEnd = 0; // Current ConcatOps index.
	int Idx = 0; // Current Idx into input vectors.

	// NumElts := greatest legal vector size (at most WidenVT)
	// while (orig. vector has unhandled elements) {
	// take munches of size NumElts from the beginning and add to ConcatOps
	// NumElts := next smaller supported vector size or 1
	// }
	while (CurNumElts != 0) {
	while (CurNumElts >= NumElts) {
	SDValue EOp1 = DAG.getNode(
	ISD::EXTRACT_SUBVECTOR, dl, VT, InOp1,
	DAG.getConstant(Idx, dl, TLI.getVectorIdxTy(DAG.getDataLayout())));
	SDValue EOp2 = DAG.getNode(
	ISD::EXTRACT_SUBVECTOR, dl, VT, InOp2,
	DAG.getConstant(Idx, dl, TLI.getVectorIdxTy(DAG.getDataLayout())));
	ConcatOps[ConcatEnd++] = DAG.getNode(Opcode, dl, VT, EOp1, EOp2, Flags);
	Idx += NumElts;
	CurNumElts -= NumElts;
	}
	do {
	NumElts = NumElts / 2;
	VT = EVT::getVectorVT(*DAG.getContext(), WidenEltVT, NumElts);
	} while (!TLI.isTypeLegal(VT) && NumElts != 1);

	if (NumElts == 1) {
	for (unsigned i = 0; i != CurNumElts; ++i, ++Idx) {
	SDValue EOp1 = DAG.getNode(
	ISD::EXTRACT_VECTOR_ELT, dl, WidenEltVT, InOp1,
	DAG.getConstant(Idx, dl, TLI.getVectorIdxTy(DAG.getDataLayout())));
	SDValue EOp2 = DAG.getNode(
	ISD::EXTRACT_VECTOR_ELT, dl, WidenEltVT, InOp2,
	DAG.getConstant(Idx, dl, TLI.getVectorIdxTy(DAG.getDataLayout())));
	ConcatOps[ConcatEnd++] = DAG.getNode(Opcode, dl, WidenEltVT,
	EOp1, EOp2, Flags);
	}
	CurNumElts = 0;
	}
	}

	// Check to see if we have a single operation with the widen type.
	if (ConcatEnd == 1) {
	VT = ConcatOps[0].getValueType();
	if (VT == WidenVT)
	return ConcatOps[0];
	}

	// while (Some element of ConcatOps is not of type MaxVT) {
	// From the end of ConcatOps, collect elements of the same type and put
	// them into an op of the next larger supported type
	// }
	while (ConcatOps[ConcatEnd-1].getValueType() != MaxVT) {
	Idx = ConcatEnd - 1;
	VT = ConcatOps[Idx--].getValueType();
	while (Idx >= 0 && ConcatOps[Idx].getValueType() == VT)
	Idx--;

	int NextSize = VT.isVector() ? VT.getVectorNumElements() : 1;
	EVT NextVT;
	do {
	NextSize *= 2;
	NextVT = EVT::getVectorVT(*DAG.getContext(), WidenEltVT, NextSize);
	} while (!TLI.isTypeLegal(NextVT));

	if (!VT.isVector()) {
	// Scalar type, create an INSERT_VECTOR_ELEMENT of type NextVT
	SDValue VecOp = DAG.getUNDEF(NextVT);
	unsigned NumToInsert = ConcatEnd - Idx - 1;
	for (unsigned i = 0, OpIdx = Idx+1; i < NumToInsert; i++, OpIdx++) {
	VecOp = DAG.getNode(
	ISD::INSERT_VECTOR_ELT, dl, NextVT, VecOp, ConcatOps[OpIdx],
	DAG.getConstant(i, dl, TLI.getVectorIdxTy(DAG.getDataLayout())));
	}
	ConcatOps[Idx+1] = VecOp;
	ConcatEnd = Idx + 2;
	} else {
	// Vector type, create a CONCAT_VECTORS of type NextVT
	SDValue undefVec = DAG.getUNDEF(VT);
	unsigned OpsToConcat = NextSize/VT.getVectorNumElements();
	SmallVector<SDValue, 16> SubConcatOps(OpsToConcat);
	unsigned RealVals = ConcatEnd - Idx - 1;
	unsigned SubConcatEnd = 0;
	unsigned SubConcatIdx = Idx + 1;
	while (SubConcatEnd < RealVals)
	SubConcatOps[SubConcatEnd++] = ConcatOps[++Idx];
	while (SubConcatEnd < OpsToConcat)
	SubConcatOps[SubConcatEnd++] = undefVec;
	ConcatOps[SubConcatIdx] = DAG.getNode(ISD::CONCAT_VECTORS, dl,
	NextVT, SubConcatOps);
	ConcatEnd = SubConcatIdx + 1;
	}
	}

	// Check to see if we have a single operation with the widen type.
	if (ConcatEnd == 1) {
	VT = ConcatOps[0].getValueType();
	if (VT == WidenVT)
	return ConcatOps[0];
	}

	// add undefs of size MaxVT until ConcatOps grows to length of WidenVT
	unsigned NumOps = WidenVT.getVectorNumElements()/MaxVT.getVectorNumElements();
	if (NumOps != ConcatEnd ) {
	SDValue UndefVal = DAG.getUNDEF(MaxVT);
	for (unsigned j = ConcatEnd; j < NumOps; ++j)
	ConcatOps[j] = UndefVal;
	}
	return DAG.getNode(ISD::CONCAT_VECTORS, dl, WidenVT,
	makeArrayRef(ConcatOps.data(), NumOps));
	}

	SDValue DAGTypeLegalizer::WidenVecRes_Convert(SDNode *N) {
	SDValue InOp = N->getOperand(0);
	SDLoc DL(N);

	EVT WidenVT = TLI.getTypeToTransformTo(*DAG.getContext(), N->getValueType(0));
	unsigned WidenNumElts = WidenVT.getVectorNumElements();

	EVT InVT = InOp.getValueType();
	EVT InEltVT = InVT.getVectorElementType();
	EVT InWidenVT = EVT::getVectorVT(*DAG.getContext(), InEltVT, WidenNumElts);

	unsigned Opcode = N->getOpcode();
	unsigned InVTNumElts = InVT.getVectorNumElements();
	const SDNodeFlags Flags = N->getFlags();
	if (getTypeAction(InVT) == TargetLowering::TypeWidenVector) {
	InOp = GetWidenedVector(N->getOperand(0));
	InVT = InOp.getValueType();
	InVTNumElts = InVT.getVectorNumElements();
	if (InVTNumElts == WidenNumElts) {
	if (N->getNumOperands() == 1)
	return DAG.getNode(Opcode, DL, WidenVT, InOp);
	return DAG.getNode(Opcode, DL, WidenVT, InOp, N->getOperand(1), Flags);
	}
	if (WidenVT.getSizeInBits() == InVT.getSizeInBits()) {
	// If both input and result vector types are of same width, extend
	// operations should be done with SIGN/ZERO_EXTEND_VECTOR_INREG, which
	// accepts fewer elements in the result than in the input.
	if (Opcode == ISD::SIGN_EXTEND)
	return DAG.getSignExtendVectorInReg(InOp, DL, WidenVT);
	if (Opcode == ISD::ZERO_EXTEND)
	return DAG.getZeroExtendVectorInReg(InOp, DL, WidenVT);
	}
	}

	if (TLI.isTypeLegal(InWidenVT)) {
	// Because the result and the input are different vector types, widening
	// the result could create a legal type but widening the input might make
	// it an illegal type that might lead to repeatedly splitting the input
	// and then widening it. To avoid this, we widen the input only if
	// it results in a legal type.
	if (WidenNumElts % InVTNumElts == 0) {
	// Widen the input and call convert on the widened input vector.
	unsigned NumConcat = WidenNumElts/InVTNumElts;
	SmallVector<SDValue, 16> Ops(NumConcat);
	Ops[0] = InOp;
	SDValue UndefVal = DAG.getUNDEF(InVT);
	for (unsigned i = 1; i != NumConcat; ++i)
	Ops[i] = UndefVal;
	SDValue InVec = DAG.getNode(ISD::CONCAT_VECTORS, DL, InWidenVT, Ops);
	if (N->getNumOperands() == 1)
	return DAG.getNode(Opcode, DL, WidenVT, InVec);
	return DAG.getNode(Opcode, DL, WidenVT, InVec, N->getOperand(1), Flags);
	}

	if (InVTNumElts % WidenNumElts == 0) {
	SDValue InVal = DAG.getNode(
	ISD::EXTRACT_SUBVECTOR, DL, InWidenVT, InOp,
	DAG.getConstant(0, DL, TLI.getVectorIdxTy(DAG.getDataLayout())));
	// Extract the input and convert the shorten input vector.
	if (N->getNumOperands() == 1)
	return DAG.getNode(Opcode, DL, WidenVT, InVal);
	return DAG.getNode(Opcode, DL, WidenVT, InVal, N->getOperand(1), Flags);
	}
	}

	// Otherwise unroll into some nasty scalar code and rebuild the vector.
	SmallVector<SDValue, 16> Ops(WidenNumElts);
	EVT EltVT = WidenVT.getVectorElementType();
	unsigned MinElts = std::min(InVTNumElts, WidenNumElts);
	unsigned i;
	for (i=0; i < MinElts; ++i) {
	SDValue Val = DAG.getNode(
	ISD::EXTRACT_VECTOR_ELT, DL, InEltVT, InOp,
	DAG.getConstant(i, DL, TLI.getVectorIdxTy(DAG.getDataLayout())));
	if (N->getNumOperands() == 1)
	Ops[i] = DAG.getNode(Opcode, DL, EltVT, Val);
	else
	Ops[i] = DAG.getNode(Opcode, DL, EltVT, Val, N->getOperand(1), Flags);
	}

	SDValue UndefVal = DAG.getUNDEF(EltVT);
	for (; i < WidenNumElts; ++i)
	Ops[i] = UndefVal;

	return DAG.getBuildVector(WidenVT, DL, Ops);
	}

	SDValue DAGTypeLegalizer::WidenVecRes_EXTEND_VECTOR_INREG(SDNode *N) {
	unsigned Opcode = N->getOpcode();
	SDValue InOp = N->getOperand(0);
	SDLoc DL(N);

	EVT WidenVT = TLI.getTypeToTransformTo(*DAG.getContext(), N->getValueType(0));
	EVT WidenSVT = WidenVT.getVectorElementType();
	unsigned WidenNumElts = WidenVT.getVectorNumElements();

	EVT InVT = InOp.getValueType();
	EVT InSVT = InVT.getVectorElementType();
	unsigned InVTNumElts = InVT.getVectorNumElements();

	if (getTypeAction(InVT) == TargetLowering::TypeWidenVector) {
	InOp = GetWidenedVector(InOp);
	InVT = InOp.getValueType();
	if (InVT.getSizeInBits() == WidenVT.getSizeInBits()) {
	switch (Opcode) {
	case ISD::ANY_EXTEND_VECTOR_INREG:
	return DAG.getAnyExtendVectorInReg(InOp, DL, WidenVT);
	case ISD::SIGN_EXTEND_VECTOR_INREG:
	return DAG.getSignExtendVectorInReg(InOp, DL, WidenVT);
	case ISD::ZERO_EXTEND_VECTOR_INREG:
	return DAG.getZeroExtendVectorInReg(InOp, DL, WidenVT);
	}
	}
	}

	// Unroll, extend the scalars and rebuild the vector.
	SmallVector<SDValue, 16> Ops;
	for (unsigned i = 0, e = std::min(InVTNumElts, WidenNumElts); i != e; ++i) {
	SDValue Val = DAG.getNode(ISD::EXTRACT_VECTOR_ELT, DL, InSVT, InOp,
	DAG.getConstant(i, DL, TLI.getVectorIdxTy(DAG.getDataLayout())));
	switch (Opcode) {
	case ISD::ANY_EXTEND_VECTOR_INREG:
	Val = DAG.getNode(ISD::ANY_EXTEND, DL, WidenSVT, Val);
	break;
	case ISD::SIGN_EXTEND_VECTOR_INREG:
	Val = DAG.getNode(ISD::SIGN_EXTEND, DL, WidenSVT, Val);
	break;
	case ISD::ZERO_EXTEND_VECTOR_INREG:
	Val = DAG.getNode(ISD::ZERO_EXTEND, DL, WidenSVT, Val);
	break;
	default:
	llvm_unreachable("A *_EXTEND_VECTOR_INREG node was expected");
	}
	Ops.push_back(Val);
	}

	while (Ops.size() != WidenNumElts)
	Ops.push_back(DAG.getUNDEF(WidenSVT));

	return DAG.getBuildVector(WidenVT, DL, Ops);
	}

	SDValue DAGTypeLegalizer::WidenVecRes_FCOPYSIGN(SDNode *N) {
	// If this is an FCOPYSIGN with same input types, we can treat it as a
	// normal (can trap) binary op.
	if (N->getOperand(0).getValueType() == N->getOperand(1).getValueType())
	return WidenVecRes_BinaryCanTrap(N);

	// If the types are different, fall back to unrolling.
	EVT WidenVT = TLI.getTypeToTransformTo(*DAG.getContext(), N->getValueType(0));
	return DAG.UnrollVectorOp(N, WidenVT.getVectorNumElements());
	}

	SDValue DAGTypeLegalizer::WidenVecRes_POWI(SDNode *N) {
	EVT WidenVT = TLI.getTypeToTransformTo(*DAG.getContext(), N->getValueType(0));
	SDValue InOp = GetWidenedVector(N->getOperand(0));
	SDValue ShOp = N->getOperand(1);
	return DAG.getNode(N->getOpcode(), SDLoc(N), WidenVT, InOp, ShOp);
	}

	SDValue DAGTypeLegalizer::WidenVecRes_Shift(SDNode *N) {
	EVT WidenVT = TLI.getTypeToTransformTo(*DAG.getContext(), N->getValueType(0));
	SDValue InOp = GetWidenedVector(N->getOperand(0));
	SDValue ShOp = N->getOperand(1);

	EVT ShVT = ShOp.getValueType();
	if (getTypeAction(ShVT) == TargetLowering::TypeWidenVector) {
	ShOp = GetWidenedVector(ShOp);
	ShVT = ShOp.getValueType();
	}
	EVT ShWidenVT = EVT::getVectorVT(*DAG.getContext(),
	ShVT.getVectorElementType(),
	WidenVT.getVectorNumElements());
	if (ShVT != ShWidenVT)
	ShOp = ModifyToType(ShOp, ShWidenVT);

	return DAG.getNode(N->getOpcode(), SDLoc(N), WidenVT, InOp, ShOp);
	}

	SDValue DAGTypeLegalizer::WidenVecRes_Unary(SDNode *N) {
	// Unary op widening.
	EVT WidenVT = TLI.getTypeToTransformTo(*DAG.getContext(), N->getValueType(0));
	SDValue InOp = GetWidenedVector(N->getOperand(0));
	return DAG.getNode(N->getOpcode(), SDLoc(N), WidenVT, InOp);
	}

	SDValue DAGTypeLegalizer::WidenVecRes_InregOp(SDNode *N) {
	EVT WidenVT = TLI.getTypeToTransformTo(*DAG.getContext(), N->getValueType(0));
	EVT ExtVT = EVT::getVectorVT(*DAG.getContext(),
	cast<VTSDNode>(N->getOperand(1))->getVT()
	.getVectorElementType(),
	WidenVT.getVectorNumElements());
	SDValue WidenLHS = GetWidenedVector(N->getOperand(0));
	return DAG.getNode(N->getOpcode(), SDLoc(N),
	WidenVT, WidenLHS, DAG.getValueType(ExtVT));
	}

	SDValue DAGTypeLegalizer::WidenVecRes_MERGE_VALUES(SDNode *N, unsigned ResNo) {
	SDValue WidenVec = DisintegrateMERGE_VALUES(N, ResNo);
	return GetWidenedVector(WidenVec);
	}

	SDValue DAGTypeLegalizer::WidenVecRes_BITCAST(SDNode *N) {
	SDValue InOp = N->getOperand(0);
	EVT InVT = InOp.getValueType();
	EVT VT = N->getValueType(0);
	EVT WidenVT = TLI.getTypeToTransformTo(*DAG.getContext(), VT);
	SDLoc dl(N);

	switch (getTypeAction(InVT)) {
	case TargetLowering::TypeLegal:
	break;
	case TargetLowering::TypePromoteInteger:
	// If the incoming type is a vector that is being promoted, then
	// we know that the elements are arranged differently and that we
	// must perform the conversion using a stack slot.
	if (InVT.isVector())
	break;

	// If the InOp is promoted to the same size, convert it. Otherwise,
	// fall out of the switch and widen the promoted input.
	InOp = GetPromotedInteger(InOp);
	InVT = InOp.getValueType();
	if (WidenVT.bitsEq(InVT))
	return DAG.getNode(ISD::BITCAST, dl, WidenVT, InOp);
	break;
	case TargetLowering::TypeSoftenFloat:
	case TargetLowering::TypePromoteFloat:
	case TargetLowering::TypeExpandInteger:
	case TargetLowering::TypeExpandFloat:
	case TargetLowering::TypeScalarizeVector:
	case TargetLowering::TypeSplitVector:
	break;
	case TargetLowering::TypeWidenVector:
	// If the InOp is widened to the same size, convert it. Otherwise, fall
	// out of the switch and widen the widened input.
	InOp = GetWidenedVector(InOp);
	InVT = InOp.getValueType();
	if (WidenVT.bitsEq(InVT))
	// The input widens to the same size. Convert to the widen value.
	return DAG.getNode(ISD::BITCAST, dl, WidenVT, InOp);
	break;
	}

	unsigned WidenSize = WidenVT.getSizeInBits();
	unsigned InSize = InVT.getSizeInBits();
	// x86mmx is not an acceptable vector element type, so don't try.
	if (WidenSize % InSize == 0 && InVT != MVT::x86mmx) {
	// Determine new input vector type. The new input vector type will use
	// the same element type (if its a vector) or use the input type as a
	// vector. It is the same size as the type to widen to.
	EVT NewInVT;
	unsigned NewNumElts = WidenSize / InSize;
	if (InVT.isVector()) {
	EVT InEltVT = InVT.getVectorElementType();
	NewInVT = EVT::getVectorVT(*DAG.getContext(), InEltVT,
	WidenSize / InEltVT.getSizeInBits());
	} else {
	NewInVT = EVT::getVectorVT(*DAG.getContext(), InVT, NewNumElts);
	}

	if (TLI.isTypeLegal(NewInVT)) {
	// Because the result and the input are different vector types, widening
	// the result could create a legal type but widening the input might make
	// it an illegal type that might lead to repeatedly splitting the input
	// and then widening it. To avoid this, we widen the input only if
	// it results in a legal type.
	SmallVector<SDValue, 16> Ops(NewNumElts);
	SDValue UndefVal = DAG.getUNDEF(InVT);
	Ops[0] = InOp;
	for (unsigned i = 1; i < NewNumElts; ++i)
	Ops[i] = UndefVal;

	SDValue NewVec;
	if (InVT.isVector())
	NewVec = DAG.getNode(ISD::CONCAT_VECTORS, dl, NewInVT, Ops);
	else
	NewVec = DAG.getBuildVector(NewInVT, dl, Ops);
	return DAG.getNode(ISD::BITCAST, dl, WidenVT, NewVec);
	}
	}

	return CreateStackStoreLoad(InOp, WidenVT);
	}

	SDValue DAGTypeLegalizer::WidenVecRes_BUILD_VECTOR(SDNode *N) {
	SDLoc dl(N);
	// Build a vector with undefined for the new nodes.
	EVT VT = N->getValueType(0);

	// Integer BUILD_VECTOR operands may be larger than the node's vector element
	// type. The UNDEFs need to have the same type as the existing operands.
	EVT EltVT = N->getOperand(0).getValueType();
	unsigned NumElts = VT.getVectorNumElements();

	EVT WidenVT = TLI.getTypeToTransformTo(*DAG.getContext(), VT);
	unsigned WidenNumElts = WidenVT.getVectorNumElements();

	SmallVector<SDValue, 16> NewOps(N->op_begin(), N->op_end());
	assert(WidenNumElts >= NumElts && "Shrinking vector instead of widening!");
	NewOps.append(WidenNumElts - NumElts, DAG.getUNDEF(EltVT));

	return DAG.getBuildVector(WidenVT, dl, NewOps);
	}

	SDValue DAGTypeLegalizer::WidenVecRes_CONCAT_VECTORS(SDNode *N) {
	EVT InVT = N->getOperand(0).getValueType();
	EVT WidenVT = TLI.getTypeToTransformTo(*DAG.getContext(), N->getValueType(0));
	SDLoc dl(N);
	unsigned WidenNumElts = WidenVT.getVectorNumElements();
	unsigned NumInElts = InVT.getVectorNumElements();
	unsigned NumOperands = N->getNumOperands();

	bool InputWidened = false; // Indicates we need to widen the input.
	if (getTypeAction(InVT) != TargetLowering::TypeWidenVector) {
	if (WidenVT.getVectorNumElements() % InVT.getVectorNumElements() == 0) {
	// Add undef vectors to widen to correct length.
	unsigned NumConcat = WidenVT.getVectorNumElements() /
	InVT.getVectorNumElements();
	SDValue UndefVal = DAG.getUNDEF(InVT);
	SmallVector<SDValue, 16> Ops(NumConcat);
	for (unsigned i=0; i < NumOperands; ++i)
	Ops[i] = N->getOperand(i);
	for (unsigned i = NumOperands; i != NumConcat; ++i)
	Ops[i] = UndefVal;
	return DAG.getNode(ISD::CONCAT_VECTORS, dl, WidenVT, Ops);
	}
	} else {
	InputWidened = true;
	if (WidenVT == TLI.getTypeToTransformTo(*DAG.getContext(), InVT)) {
	// The inputs and the result are widen to the same value.
	unsigned i;
	for (i=1; i < NumOperands; ++i)
	if (!N->getOperand(i).isUndef())
	break;

	if (i == NumOperands)
	// Everything but the first operand is an UNDEF so just return the
	// widened first operand.
	return GetWidenedVector(N->getOperand(0));

	if (NumOperands == 2) {
	// Replace concat of two operands with a shuffle.
	SmallVector<int, 16> MaskOps(WidenNumElts, -1);
	for (unsigned i = 0; i < NumInElts; ++i) {
	MaskOps[i] = i;
	MaskOps[i + NumInElts] = i + WidenNumElts;
	}
	return DAG.getVectorShuffle(WidenVT, dl,
	GetWidenedVector(N->getOperand(0)),
	GetWidenedVector(N->getOperand(1)),
	MaskOps);
	}
	}
	}

	// Fall back to use extracts and build vector.
	EVT EltVT = WidenVT.getVectorElementType();
	SmallVector<SDValue, 16> Ops(WidenNumElts);
	unsigned Idx = 0;
	for (unsigned i=0; i < NumOperands; ++i) {
	SDValue InOp = N->getOperand(i);
	if (InputWidened)
	InOp = GetWidenedVector(InOp);
	for (unsigned j=0; j < NumInElts; ++j)
	Ops[Idx++] = DAG.getNode(
	ISD::EXTRACT_VECTOR_ELT, dl, EltVT, InOp,
	DAG.getConstant(j, dl, TLI.getVectorIdxTy(DAG.getDataLayout())));
	}
	SDValue UndefVal = DAG.getUNDEF(EltVT);
	for (; Idx < WidenNumElts; ++Idx)
	Ops[Idx] = UndefVal;
	return DAG.getBuildVector(WidenVT, dl, Ops);
	}

	SDValue DAGTypeLegalizer::WidenVecRes_EXTRACT_SUBVECTOR(SDNode *N) {
	EVT VT = N->getValueType(0);
	EVT WidenVT = TLI.getTypeToTransformTo(*DAG.getContext(), VT);
	unsigned WidenNumElts = WidenVT.getVectorNumElements();
	SDValue InOp = N->getOperand(0);
	SDValue Idx = N->getOperand(1);
	SDLoc dl(N);

	if (getTypeAction(InOp.getValueType()) == TargetLowering::TypeWidenVector)
	InOp = GetWidenedVector(InOp);

	EVT InVT = InOp.getValueType();

	// Check if we can just return the input vector after widening.
	uint64_t IdxVal = cast<ConstantSDNode>(Idx)->getZExtValue();
	if (IdxVal == 0 && InVT == WidenVT)
	return InOp;

	// Check if we can extract from the vector.
	unsigned InNumElts = InVT.getVectorNumElements();
	if (IdxVal % WidenNumElts == 0 && IdxVal + WidenNumElts < InNumElts)
	return DAG.getNode(ISD::EXTRACT_SUBVECTOR, dl, WidenVT, InOp, Idx);

	// We could try widening the input to the right length but for now, extract
	// the original elements, fill the rest with undefs and build a vector.
	SmallVector<SDValue, 16> Ops(WidenNumElts);
	EVT EltVT = VT.getVectorElementType();
	unsigned NumElts = VT.getVectorNumElements();
	unsigned i;
	for (i=0; i < NumElts; ++i)
	Ops[i] =
	DAG.getNode(ISD::EXTRACT_VECTOR_ELT, dl, EltVT, InOp,
	DAG.getConstant(IdxVal + i, dl,
	TLI.getVectorIdxTy(DAG.getDataLayout())));

	SDValue UndefVal = DAG.getUNDEF(EltVT);
	for (; i < WidenNumElts; ++i)
	Ops[i] = UndefVal;
	return DAG.getBuildVector(WidenVT, dl, Ops);
	}

	SDValue DAGTypeLegalizer::WidenVecRes_INSERT_VECTOR_ELT(SDNode *N) {
	SDValue InOp = GetWidenedVector(N->getOperand(0));
	return DAG.getNode(ISD::INSERT_VECTOR_ELT, SDLoc(N),
	InOp.getValueType(), InOp,
	N->getOperand(1), N->getOperand(2));
	}

	SDValue DAGTypeLegalizer::WidenVecRes_LOAD(SDNode *N) {
	LoadSDNode *LD = cast<LoadSDNode>(N);
	ISD::LoadExtType ExtType = LD->getExtensionType();

	SDValue Result;
	SmallVector<SDValue, 16> LdChain; // Chain for the series of load
	if (ExtType != ISD::NON_EXTLOAD)
	Result = GenWidenVectorExtLoads(LdChain, LD, ExtType);
	else
	Result = GenWidenVectorLoads(LdChain, LD);

	// If we generate a single load, we can use that for the chain. Otherwise,
	// build a factor node to remember the multiple loads are independent and
	// chain to that.
	SDValue NewChain;
	if (LdChain.size() == 1)
	NewChain = LdChain[0];
	else
	NewChain = DAG.getNode(ISD::TokenFactor, SDLoc(LD), MVT::Other, LdChain);

	// Modified the chain - switch anything that used the old chain to use
	// the new one.
	ReplaceValueWith(SDValue(N, 1), NewChain);

	return Result;
	}

	SDValue DAGTypeLegalizer::WidenVecRes_MLOAD(MaskedLoadSDNode *N) {

	EVT WidenVT = TLI.getTypeToTransformTo(*DAG.getContext(),N->getValueType(0));
	SDValue Mask = N->getMask();
	EVT MaskVT = Mask.getValueType();
	SDValue Src0 = GetWidenedVector(N->getSrc0());
	ISD::LoadExtType ExtType = N->getExtensionType();
	SDLoc dl(N);

	if (getTypeAction(MaskVT) == TargetLowering::TypeWidenVector)
	Mask = GetWidenedVector(Mask);
	else {
	EVT BoolVT = getSetCCResultType(WidenVT);

	// We can't use ModifyToType() because we should fill the mask with
	// zeroes
	unsigned WidenNumElts = BoolVT.getVectorNumElements();
	unsigned MaskNumElts = MaskVT.getVectorNumElements();

	unsigned NumConcat = WidenNumElts / MaskNumElts;
	SmallVector<SDValue, 16> Ops(NumConcat);
	SDValue ZeroVal = DAG.getConstant(0, dl, MaskVT);
	Ops[0] = Mask;
	for (unsigned i = 1; i != NumConcat; ++i)
	Ops[i] = ZeroVal;

	Mask = DAG.getNode(ISD::CONCAT_VECTORS, dl, BoolVT, Ops);
	}

	SDValue Res = DAG.getMaskedLoad(WidenVT, dl, N->getChain(), N->getBasePtr(),
	Mask, Src0, N->getMemoryVT(),
	N->getMemOperand(), ExtType,
	N->isExpandingLoad());
	// Legalize the chain result - switch anything that used the old chain to
	// use the new one.
	ReplaceValueWith(SDValue(N, 1), Res.getValue(1));
	return Res;
	}

	SDValue DAGTypeLegalizer::WidenVecRes_MGATHER(MaskedGatherSDNode *N) {

	EVT WideVT = TLI.getTypeToTransformTo(*DAG.getContext(), N->getValueType(0));
	SDValue Mask = N->getMask();
	SDValue Src0 = GetWidenedVector(N->getValue());
	unsigned NumElts = WideVT.getVectorNumElements();
	SDLoc dl(N);

	// The mask should be widened as well
	Mask = WidenTargetBoolean(Mask, WideVT, true);

	// Widen the Index operand
	SDValue Index = N->getIndex();
	EVT WideIndexVT = EVT::getVectorVT(*DAG.getContext(),
	Index.getValueType().getScalarType(),
	NumElts);
	Index = ModifyToType(Index, WideIndexVT);
	SDValue Ops[] = { N->getChain(), Src0, Mask, N->getBasePtr(), Index };
	SDValue Res = DAG.getMaskedGather(DAG.getVTList(WideVT, MVT::Other),
	N->getMemoryVT(), dl, Ops,
	N->getMemOperand());

	// Legalize the chain result - switch anything that used the old chain to
	// use the new one.
	ReplaceValueWith(SDValue(N, 1), Res.getValue(1));
	return Res;
	}

	SDValue DAGTypeLegalizer::WidenVecRes_SCALAR_TO_VECTOR(SDNode *N) {
	EVT WidenVT = TLI.getTypeToTransformTo(*DAG.getContext(), N->getValueType(0));
	return DAG.getNode(ISD::SCALAR_TO_VECTOR, SDLoc(N),
	WidenVT, N->getOperand(0));
	}

	// Return true if this is a node that could have two SETCCs as operands.
	static inline bool isLogicalMaskOp(unsigned Opcode) {
	switch (Opcode) {
	case ISD::AND:
	case ISD::OR:
	case ISD::XOR:
	return true;
	}
	return false;
	}

	// This is used just for the assert in convertMask(). Check that this either
	// a SETCC or a previously handled SETCC by convertMask().
	#ifndef NDEBUG
	static inline bool isSETCCorConvertedSETCC(SDValue N) {
	if (N.getOpcode() == ISD::EXTRACT_SUBVECTOR)
	N = N.getOperand(0);
	else if (N.getOpcode() == ISD::CONCAT_VECTORS) {
	for (unsigned i = 1; i < N->getNumOperands(); ++i)
	if (!N->getOperand(i)->isUndef())
	return false;
	N = N.getOperand(0);
	}

	if (N.getOpcode() == ISD::TRUNCATE)
	N = N.getOperand(0);
	else if (N.getOpcode() == ISD::SIGN_EXTEND)
	N = N.getOperand(0);

	if (isLogicalMaskOp(N.getOpcode()))
	return isSETCCorConvertedSETCC(N.getOperand(0)) &&
	isSETCCorConvertedSETCC(N.getOperand(1));

	return (N.getOpcode() == ISD::SETCC \|\|
	ISD::isBuildVectorOfConstantSDNodes(N.getNode()));
	}
	#endif

	// Return a mask of vector type MaskVT to replace InMask. Also adjust MaskVT
	// to ToMaskVT if needed with vector extension or truncation.
	SDValue DAGTypeLegalizer::convertMask(SDValue InMask, EVT MaskVT,
	EVT ToMaskVT) {
	// Currently a SETCC or a AND/OR/XOR with two SETCCs are handled.
	// FIXME: This code seems to be too restrictive, we might consider
	// generalizing it or dropping it.
	assert(isSETCCorConvertedSETCC(InMask) && "Unexpected mask argument.");

	// Make a new Mask node, with a legal result VT.
	SmallVector<SDValue, 4> Ops;
	for (unsigned i = 0; i < InMask->getNumOperands(); ++i)
	Ops.push_back(InMask->getOperand(i));
	SDValue Mask = DAG.getNode(InMask->getOpcode(), SDLoc(InMask), MaskVT, Ops);

	// If MaskVT has smaller or bigger elements than ToMaskVT, a vector sign
	// extend or truncate is needed.
	LLVMContext &Ctx = *DAG.getContext();
	unsigned MaskScalarBits = MaskVT.getScalarSizeInBits();
	unsigned ToMaskScalBits = ToMaskVT.getScalarSizeInBits();
	if (MaskScalarBits < ToMaskScalBits) {
	EVT ExtVT = EVT::getVectorVT(Ctx, ToMaskVT.getVectorElementType(),
	MaskVT.getVectorNumElements());
	Mask = DAG.getNode(ISD::SIGN_EXTEND, SDLoc(Mask), ExtVT, Mask);
	} else if (MaskScalarBits > ToMaskScalBits) {
	EVT TruncVT = EVT::getVectorVT(Ctx, ToMaskVT.getVectorElementType(),
	MaskVT.getVectorNumElements());
	Mask = DAG.getNode(ISD::TRUNCATE, SDLoc(Mask), TruncVT, Mask);
	}

	assert(Mask->getValueType(0).getScalarSizeInBits() ==
	ToMaskVT.getScalarSizeInBits() &&
	"Mask should have the right element size by now.");

	// Adjust Mask to the right number of elements.
	unsigned CurrMaskNumEls = Mask->getValueType(0).getVectorNumElements();
	if (CurrMaskNumEls > ToMaskVT.getVectorNumElements()) {
	MVT IdxTy = TLI.getVectorIdxTy(DAG.getDataLayout());
	SDValue ZeroIdx = DAG.getConstant(0, SDLoc(Mask), IdxTy);
	Mask = DAG.getNode(ISD::EXTRACT_SUBVECTOR, SDLoc(Mask), ToMaskVT, Mask,
	ZeroIdx);
	} else if (CurrMaskNumEls < ToMaskVT.getVectorNumElements()) {
	unsigned NumSubVecs = (ToMaskVT.getVectorNumElements() / CurrMaskNumEls);
	EVT SubVT = Mask->getValueType(0);
	SmallVector<SDValue, 16> SubConcatOps(NumSubVecs);
	SubConcatOps[0] = Mask;
	for (unsigned i = 1; i < NumSubVecs; ++i)
	SubConcatOps[i] = DAG.getUNDEF(SubVT);
	Mask =
	DAG.getNode(ISD::CONCAT_VECTORS, SDLoc(Mask), ToMaskVT, SubConcatOps);
	}

	assert((Mask->getValueType(0) == ToMaskVT) &&
	"A mask of ToMaskVT should have been produced by now.");

	return Mask;
	}

	// Get the target mask VT, and widen if needed.
	EVT DAGTypeLegalizer::getSETCCWidenedResultTy(SDValue SetCC) {
	assert(SetCC->getOpcode() == ISD::SETCC);
	LLVMContext &Ctx = *DAG.getContext();
	EVT MaskVT = getSetCCResultType(SetCC->getOperand(0).getValueType());
	if (getTypeAction(MaskVT) == TargetLowering::TypeWidenVector)
	MaskVT = TLI.getTypeToTransformTo(Ctx, MaskVT);
	return MaskVT;
	}

	// This method tries to handle VSELECT and its mask by legalizing operands
	// (which may require widening) and if needed adjusting the mask vector type
	// to match that of the VSELECT. Without it, many cases end up with
	// scalarization of the SETCC, with many unnecessary instructions.
	SDValue DAGTypeLegalizer::WidenVSELECTAndMask(SDNode *N) {
	LLVMContext &Ctx = *DAG.getContext();
	SDValue Cond = N->getOperand(0);

	if (N->getOpcode() != ISD::VSELECT)
	return SDValue();

	if (Cond->getOpcode() != ISD::SETCC && !isLogicalMaskOp(Cond->getOpcode()))
	return SDValue();

	// If this is a splitted VSELECT that was previously already handled, do
	// nothing.
	if (Cond->getValueType(0).getScalarSizeInBits() != 1)
	return SDValue();

	EVT VSelVT = N->getValueType(0);
	// Only handle vector types which are a power of 2.
	if (!isPowerOf2_64(VSelVT.getSizeInBits()))
	return SDValue();

	// Don't touch if this will be scalarized.
	EVT FinalVT = VSelVT;
	while (getTypeAction(FinalVT) == TargetLowering::TypeSplitVector)
	FinalVT = FinalVT.getHalfNumVectorElementsVT(Ctx);

	if (FinalVT.getVectorNumElements() == 1)
	return SDValue();

	// If there is support for an i1 vector mask, don't touch.
	if (Cond.getOpcode() == ISD::SETCC) {
	EVT SetCCOpVT = Cond->getOperand(0).getValueType();
	while (TLI.getTypeAction(Ctx, SetCCOpVT) != TargetLowering::TypeLegal)
	SetCCOpVT = TLI.getTypeToTransformTo(Ctx, SetCCOpVT);
	EVT SetCCResVT = getSetCCResultType(SetCCOpVT);
	if (SetCCResVT.getScalarSizeInBits() == 1)
	return SDValue();
	}

	// Get the VT and operands for VSELECT, and widen if needed.
	SDValue VSelOp1 = N->getOperand(1);
	SDValue VSelOp2 = N->getOperand(2);
	if (getTypeAction(VSelVT) == TargetLowering::TypeWidenVector) {
	VSelVT = TLI.getTypeToTransformTo(Ctx, VSelVT);
	VSelOp1 = GetWidenedVector(VSelOp1);
	VSelOp2 = GetWidenedVector(VSelOp2);
	}

	// The mask of the VSELECT should have integer elements.
	EVT ToMaskVT = VSelVT;
	if (!ToMaskVT.getScalarType().isInteger())
	ToMaskVT = ToMaskVT.changeVectorElementTypeToInteger();

	SDValue Mask;
	if (Cond->getOpcode() == ISD::SETCC) {
	EVT MaskVT = getSETCCWidenedResultTy(Cond);
	Mask = convertMask(Cond, MaskVT, ToMaskVT);
	} else if (isLogicalMaskOp(Cond->getOpcode()) &&
	Cond->getOperand(0).getOpcode() == ISD::SETCC &&
	Cond->getOperand(1).getOpcode() == ISD::SETCC) {
	// Cond is (AND/OR/XOR (SETCC, SETCC))
	SDValue SETCC0 = Cond->getOperand(0);
	SDValue SETCC1 = Cond->getOperand(1);
	EVT VT0 = getSETCCWidenedResultTy(SETCC0);
	EVT VT1 = getSETCCWidenedResultTy(SETCC1);
	unsigned ScalarBits0 = VT0.getScalarSizeInBits();
	unsigned ScalarBits1 = VT1.getScalarSizeInBits();
	unsigned ScalarBits_ToMask = ToMaskVT.getScalarSizeInBits();
	EVT MaskVT;
	// If the two SETCCs have different VTs, either extend/truncate one of
	// them to the other "towards" ToMaskVT, or truncate one and extend the
	// other to ToMaskVT.
	if (ScalarBits0 != ScalarBits1) {
	EVT NarrowVT = ((ScalarBits0 < ScalarBits1) ? VT0 : VT1);
	EVT WideVT = ((NarrowVT == VT0) ? VT1 : VT0);
	if (ScalarBits_ToMask >= WideVT.getScalarSizeInBits())
	MaskVT = WideVT;
	else if (ScalarBits_ToMask <= NarrowVT.getScalarSizeInBits())
	MaskVT = NarrowVT;
	else
	MaskVT = ToMaskVT;
	} else
	// If the two SETCCs have the same VT, don't change it.
	MaskVT = VT0;

	// Make new SETCCs and logical nodes.
	SETCC0 = convertMask(SETCC0, VT0, MaskVT);
	SETCC1 = convertMask(SETCC1, VT1, MaskVT);
	Cond = DAG.getNode(Cond->getOpcode(), SDLoc(Cond), MaskVT, SETCC0, SETCC1);

	// Convert the logical op for VSELECT if needed.
	Mask = convertMask(Cond, MaskVT, ToMaskVT);
	} else
	return SDValue();

	return DAG.getNode(ISD::VSELECT, SDLoc(N), VSelVT, Mask, VSelOp1, VSelOp2);
	}

	SDValue DAGTypeLegalizer::WidenVecRes_SELECT(SDNode *N) {
	EVT WidenVT = TLI.getTypeToTransformTo(*DAG.getContext(), N->getValueType(0));
	unsigned WidenNumElts = WidenVT.getVectorNumElements();

	SDValue Cond1 = N->getOperand(0);
	EVT CondVT = Cond1.getValueType();
	if (CondVT.isVector()) {
	if (SDValue Res = WidenVSELECTAndMask(N))
	return Res;

	EVT CondEltVT = CondVT.getVectorElementType();
	EVT CondWidenVT = EVT::getVectorVT(*DAG.getContext(),
	CondEltVT, WidenNumElts);
	if (getTypeAction(CondVT) == TargetLowering::TypeWidenVector)
	Cond1 = GetWidenedVector(Cond1);

	// If we have to split the condition there is no point in widening the
	// select. This would result in an cycle of widening the select ->
	// widening the condition operand -> splitting the condition operand ->
	// splitting the select -> widening the select. Instead split this select
	// further and widen the resulting type.
	if (getTypeAction(CondVT) == TargetLowering::TypeSplitVector) {
	SDValue SplitSelect = SplitVecOp_VSELECT(N, 0);
	SDValue Res = ModifyToType(SplitSelect, WidenVT);
	return Res;
	}

	if (Cond1.getValueType() != CondWidenVT)
	Cond1 = ModifyToType(Cond1, CondWidenVT);
	}

	SDValue InOp1 = GetWidenedVector(N->getOperand(1));
	SDValue InOp2 = GetWidenedVector(N->getOperand(2));
	assert(InOp1.getValueType() == WidenVT && InOp2.getValueType() == WidenVT);
	return DAG.getNode(N->getOpcode(), SDLoc(N),
	WidenVT, Cond1, InOp1, InOp2);
	}

	SDValue DAGTypeLegalizer::WidenVecRes_SELECT_CC(SDNode *N) {
	SDValue InOp1 = GetWidenedVector(N->getOperand(2));
	SDValue InOp2 = GetWidenedVector(N->getOperand(3));
	return DAG.getNode(ISD::SELECT_CC, SDLoc(N),
	InOp1.getValueType(), N->getOperand(0),
	N->getOperand(1), InOp1, InOp2, N->getOperand(4));
	}

	SDValue DAGTypeLegalizer::WidenVecRes_SETCC(SDNode *N) {
	assert(N->getValueType(0).isVector() ==
	N->getOperand(0).getValueType().isVector() &&
	"Scalar/Vector type mismatch");
	if (N->getValueType(0).isVector()) return WidenVecRes_VSETCC(N);

	EVT WidenVT = TLI.getTypeToTransformTo(*DAG.getContext(), N->getValueType(0));
	SDValue InOp1 = GetWidenedVector(N->getOperand(0));
	SDValue InOp2 = GetWidenedVector(N->getOperand(1));
	return DAG.getNode(ISD::SETCC, SDLoc(N), WidenVT,
	InOp1, InOp2, N->getOperand(2));
	}

	SDValue DAGTypeLegalizer::WidenVecRes_UNDEF(SDNode *N) {
	EVT WidenVT = TLI.getTypeToTransformTo(*DAG.getContext(), N->getValueType(0));
	return DAG.getUNDEF(WidenVT);
	}

	SDValue DAGTypeLegalizer::WidenVecRes_VECTOR_SHUFFLE(ShuffleVectorSDNode *N) {
	EVT VT = N->getValueType(0);
	SDLoc dl(N);

	EVT WidenVT = TLI.getTypeToTransformTo(*DAG.getContext(), VT);
	unsigned NumElts = VT.getVectorNumElements();
	unsigned WidenNumElts = WidenVT.getVectorNumElements();

	SDValue InOp1 = GetWidenedVector(N->getOperand(0));
	SDValue InOp2 = GetWidenedVector(N->getOperand(1));

	// Adjust mask based on new input vector length.
	SmallVector<int, 16> NewMask;
	for (unsigned i = 0; i != NumElts; ++i) {
	int Idx = N->getMaskElt(i);
	if (Idx < (int)NumElts)
	NewMask.push_back(Idx);
	else
	NewMask.push_back(Idx - NumElts + WidenNumElts);
	}
	for (unsigned i = NumElts; i != WidenNumElts; ++i)
	NewMask.push_back(-1);
	return DAG.getVectorShuffle(WidenVT, dl, InOp1, InOp2, NewMask);
	}

	SDValue DAGTypeLegalizer::WidenVecRes_VSETCC(SDNode *N) {
	assert(N->getValueType(0).isVector() &&
	N->getOperand(0).getValueType().isVector() &&
	"Operands must be vectors");
	EVT WidenVT = TLI.getTypeToTransformTo(*DAG.getContext(), N->getValueType(0));
	unsigned WidenNumElts = WidenVT.getVectorNumElements();

	SDValue InOp1 = N->getOperand(0);
	EVT InVT = InOp1.getValueType();
	assert(InVT.isVector() && "can not widen non-vector type");
	EVT WidenInVT = EVT::getVectorVT(*DAG.getContext(),
	InVT.getVectorElementType(), WidenNumElts);

	// The input and output types often differ here, and it could be that while
	// we'd prefer to widen the result type, the input operands have been split.
	// In this case, we also need to split the result of this node as well.
	if (getTypeAction(InVT) == TargetLowering::TypeSplitVector) {
	SDValue SplitVSetCC = SplitVecOp_VSETCC(N);
	SDValue Res = ModifyToType(SplitVSetCC, WidenVT);
	return Res;
	}

	InOp1 = GetWidenedVector(InOp1);
	SDValue InOp2 = GetWidenedVector(N->getOperand(1));

	// Assume that the input and output will be widen appropriately. If not,
	// we will have to unroll it at some point.
	assert(InOp1.getValueType() == WidenInVT &&
	InOp2.getValueType() == WidenInVT &&
	"Input not widened to expected type!");
	(void)WidenInVT;
	return DAG.getNode(ISD::SETCC, SDLoc(N),
	WidenVT, InOp1, InOp2, N->getOperand(2));
	}


	//===----------------------------------------------------------------------===//
	// Widen Vector Operand
	//===----------------------------------------------------------------------===//
	bool DAGTypeLegalizer::WidenVectorOperand(SDNode *N, unsigned OpNo) {
	DEBUG(dbgs() << "Widen node operand " << OpNo << ": ";
	N->dump(&DAG);
	dbgs() << "\n");
	SDValue Res = SDValue();

	// See if the target wants to custom widen this node.
	if (CustomLowerNode(N, N->getOperand(OpNo).getValueType(), false))
	return false;

	switch (N->getOpcode()) {
	default:
	#ifndef NDEBUG
	dbgs() << "WidenVectorOperand op #" << OpNo << ": ";
	N->dump(&DAG);
	dbgs() << "\n";
	#endif
	llvm_unreachable("Do not know how to widen this operator's operand!");

	case ISD::BITCAST: Res = WidenVecOp_BITCAST(N); break;
	case ISD::CONCAT_VECTORS: Res = WidenVecOp_CONCAT_VECTORS(N); break;
	case ISD::EXTRACT_SUBVECTOR: Res = WidenVecOp_EXTRACT_SUBVECTOR(N); break;
	case ISD::EXTRACT_VECTOR_ELT: Res = WidenVecOp_EXTRACT_VECTOR_ELT(N); break;
	case ISD::STORE: Res = WidenVecOp_STORE(N); break;
	case ISD::MSTORE: Res = WidenVecOp_MSTORE(N, OpNo); break;
	case ISD::MSCATTER: Res = WidenVecOp_MSCATTER(N, OpNo); break;
	case ISD::SETCC: Res = WidenVecOp_SETCC(N); break;
	case ISD::FCOPYSIGN: Res = WidenVecOp_FCOPYSIGN(N); break;

	case ISD::ANY_EXTEND:
	case ISD::SIGN_EXTEND:
	case ISD::ZERO_EXTEND:
	Res = WidenVecOp_EXTEND(N);
	break;

	case ISD::FP_EXTEND:
	case ISD::FP_TO_SINT:
	case ISD::FP_TO_UINT:
	case ISD::SINT_TO_FP:
	case ISD::UINT_TO_FP:
	case ISD::TRUNCATE:
	Res = WidenVecOp_Convert(N);
	break;
	}

	// If Res is null, the sub-method took care of registering the result.
	if (!Res.getNode()) return false;

	// If the result is N, the sub-method updated N in place. Tell the legalizer
	// core about this.
	if (Res.getNode() == N)
	return true;


	assert(Res.getValueType() == N->getValueType(0) && N->getNumValues() == 1 &&
	"Invalid operand expansion");

	ReplaceValueWith(SDValue(N, 0), Res);
	return false;
	}

	SDValue DAGTypeLegalizer::WidenVecOp_EXTEND(SDNode *N) {
	SDLoc DL(N);
	EVT VT = N->getValueType(0);

	SDValue InOp = N->getOperand(0);
	// If some legalization strategy other than widening is used on the operand,
	// we can't safely assume that just extending the low lanes is the correct
	// transformation.
	if (getTypeAction(InOp.getValueType()) != TargetLowering::TypeWidenVector)
	return WidenVecOp_Convert(N);
	InOp = GetWidenedVector(InOp);
	assert(VT.getVectorNumElements() <
	InOp.getValueType().getVectorNumElements() &&
	"Input wasn't widened!");

	// We may need to further widen the operand until it has the same total
	// vector size as the result.
	EVT InVT = InOp.getValueType();
	if (InVT.getSizeInBits() != VT.getSizeInBits()) {
	EVT InEltVT = InVT.getVectorElementType();
	for (int i = MVT::FIRST_VECTOR_VALUETYPE, e = MVT::LAST_VECTOR_VALUETYPE; i < e; ++i) {
	EVT FixedVT = (MVT::SimpleValueType)i;
	EVT FixedEltVT = FixedVT.getVectorElementType();
	if (TLI.isTypeLegal(FixedVT) &&
	FixedVT.getSizeInBits() == VT.getSizeInBits() &&
	FixedEltVT == InEltVT) {
	assert(FixedVT.getVectorNumElements() >= VT.getVectorNumElements() &&
	"Not enough elements in the fixed type for the operand!");
	assert(FixedVT.getVectorNumElements() != InVT.getVectorNumElements() &&
	"We can't have the same type as we started with!");
	if (FixedVT.getVectorNumElements() > InVT.getVectorNumElements())
	InOp = DAG.getNode(
	ISD::INSERT_SUBVECTOR, DL, FixedVT, DAG.getUNDEF(FixedVT), InOp,
	DAG.getConstant(0, DL, TLI.getVectorIdxTy(DAG.getDataLayout())));
	else
	InOp = DAG.getNode(
	ISD::EXTRACT_SUBVECTOR, DL, FixedVT, InOp,
	DAG.getConstant(0, DL, TLI.getVectorIdxTy(DAG.getDataLayout())));
	break;
	}
	}
	InVT = InOp.getValueType();
	if (InVT.getSizeInBits() != VT.getSizeInBits())
	// We couldn't find a legal vector type that was a widening of the input
	// and could be extended in-register to the result type, so we have to
	// scalarize.
	return WidenVecOp_Convert(N);
	}

	// Use special DAG nodes to represent the operation of extending the
	// low lanes.
	switch (N->getOpcode()) {
	default:
	llvm_unreachable("Extend legalization on on extend operation!");
	case ISD::ANY_EXTEND:
	return DAG.getAnyExtendVectorInReg(InOp, DL, VT);
	case ISD::SIGN_EXTEND:
	return DAG.getSignExtendVectorInReg(InOp, DL, VT);
	case ISD::ZERO_EXTEND:
	return DAG.getZeroExtendVectorInReg(InOp, DL, VT);
	}
	}

	SDValue DAGTypeLegalizer::WidenVecOp_FCOPYSIGN(SDNode *N) {
	// The result (and first input) is legal, but the second input is illegal.
	// We can't do much to fix that, so just unroll and let the extracts off of
	// the second input be widened as needed later.
	return DAG.UnrollVectorOp(N);
	}

	SDValue DAGTypeLegalizer::WidenVecOp_Convert(SDNode *N) {
	// Since the result is legal and the input is illegal, it is unlikely that we
	// can fix the input to a legal type so unroll the convert into some scalar
	// code and create a nasty build vector.
	EVT VT = N->getValueType(0);
	EVT EltVT = VT.getVectorElementType();
	SDLoc dl(N);
	unsigned NumElts = VT.getVectorNumElements();
	SDValue InOp = N->getOperand(0);
	if (getTypeAction(InOp.getValueType()) == TargetLowering::TypeWidenVector)
	InOp = GetWidenedVector(InOp);
	EVT InVT = InOp.getValueType();
	EVT InEltVT = InVT.getVectorElementType();

	unsigned Opcode = N->getOpcode();
	SmallVector<SDValue, 16> Ops(NumElts);
	for (unsigned i=0; i < NumElts; ++i)
	Ops[i] = DAG.getNode(
	Opcode, dl, EltVT,
	DAG.getNode(
	ISD::EXTRACT_VECTOR_ELT, dl, InEltVT, InOp,
	DAG.getConstant(i, dl, TLI.getVectorIdxTy(DAG.getDataLayout()))));

	return DAG.getBuildVector(VT, dl, Ops);
	}

	SDValue DAGTypeLegalizer::WidenVecOp_BITCAST(SDNode *N) {
	EVT VT = N->getValueType(0);
	SDValue InOp = GetWidenedVector(N->getOperand(0));
	EVT InWidenVT = InOp.getValueType();
	SDLoc dl(N);

	// Check if we can convert between two legal vector types and extract.
	unsigned InWidenSize = InWidenVT.getSizeInBits();
	unsigned Size = VT.getSizeInBits();
	// x86mmx is not an acceptable vector element type, so don't try.
	if (InWidenSize % Size == 0 && !VT.isVector() && VT != MVT::x86mmx) {
	unsigned NewNumElts = InWidenSize / Size;
	EVT NewVT = EVT::getVectorVT(*DAG.getContext(), VT, NewNumElts);
	if (TLI.isTypeLegal(NewVT)) {
	SDValue BitOp = DAG.getNode(ISD::BITCAST, dl, NewVT, InOp);
	return DAG.getNode(
	ISD::EXTRACT_VECTOR_ELT, dl, VT, BitOp,
	DAG.getConstant(0, dl, TLI.getVectorIdxTy(DAG.getDataLayout())));
	}
	}

	return CreateStackStoreLoad(InOp, VT);
	}

	SDValue DAGTypeLegalizer::WidenVecOp_CONCAT_VECTORS(SDNode *N) {
	// If the input vector is not legal, it is likely that we will not find a
	// legal vector of the same size. Replace the concatenate vector with a
	// nasty build vector.
	EVT VT = N->getValueType(0);
	EVT EltVT = VT.getVectorElementType();
	SDLoc dl(N);
	unsigned NumElts = VT.getVectorNumElements();
	SmallVector<SDValue, 16> Ops(NumElts);

	EVT InVT = N->getOperand(0).getValueType();
	unsigned NumInElts = InVT.getVectorNumElements();

	unsigned Idx = 0;
	unsigned NumOperands = N->getNumOperands();
	for (unsigned i=0; i < NumOperands; ++i) {
	SDValue InOp = N->getOperand(i);
	if (getTypeAction(InOp.getValueType()) == TargetLowering::TypeWidenVector)
	InOp = GetWidenedVector(InOp);
	for (unsigned j=0; j < NumInElts; ++j)
	Ops[Idx++] = DAG.getNode(
	ISD::EXTRACT_VECTOR_ELT, dl, EltVT, InOp,
	DAG.getConstant(j, dl, TLI.getVectorIdxTy(DAG.getDataLayout())));
	}
	return DAG.getBuildVector(VT, dl, Ops);
	}

	SDValue DAGTypeLegalizer::WidenVecOp_EXTRACT_SUBVECTOR(SDNode *N) {
	SDValue InOp = GetWidenedVector(N->getOperand(0));
	return DAG.getNode(ISD::EXTRACT_SUBVECTOR, SDLoc(N),
	N->getValueType(0), InOp, N->getOperand(1));
	}

	SDValue DAGTypeLegalizer::WidenVecOp_EXTRACT_VECTOR_ELT(SDNode *N) {
	SDValue InOp = GetWidenedVector(N->getOperand(0));
	return DAG.getNode(ISD::EXTRACT_VECTOR_ELT, SDLoc(N),
	N->getValueType(0), InOp, N->getOperand(1));
	}

	SDValue DAGTypeLegalizer::WidenVecOp_STORE(SDNode *N) {
	// We have to widen the value, but we want only to store the original
	// vector type.
	StoreSDNode *ST = cast<StoreSDNode>(N);

	SmallVector<SDValue, 16> StChain;
	if (ST->isTruncatingStore())
	GenWidenVectorTruncStores(StChain, ST);
	else
	GenWidenVectorStores(StChain, ST);

	if (StChain.size() == 1)
	return StChain[0];
	else
	return DAG.getNode(ISD::TokenFactor, SDLoc(ST), MVT::Other, StChain);
	}

	SDValue DAGTypeLegalizer::WidenVecOp_MSTORE(SDNode *N, unsigned OpNo) {
	MaskedStoreSDNode *MST = cast<MaskedStoreSDNode>(N);
	SDValue Mask = MST->getMask();
	EVT MaskVT = Mask.getValueType();
	SDValue StVal = MST->getValue();
	// Widen the value
	SDValue WideVal = GetWidenedVector(StVal);
	SDLoc dl(N);

	if (OpNo == 2 \|\| getTypeAction(MaskVT) == TargetLowering::TypeWidenVector)
	Mask = GetWidenedVector(Mask);
	else {
	// The mask should be widened as well.
	EVT BoolVT = getSetCCResultType(WideVal.getValueType());
	// We can't use ModifyToType() because we should fill the mask with
	// zeroes.
	unsigned WidenNumElts = BoolVT.getVectorNumElements();
	unsigned MaskNumElts = MaskVT.getVectorNumElements();

	unsigned NumConcat = WidenNumElts / MaskNumElts;
	SmallVector<SDValue, 16> Ops(NumConcat);
	SDValue ZeroVal = DAG.getConstant(0, dl, MaskVT);
	Ops[0] = Mask;
	for (unsigned i = 1; i != NumConcat; ++i)
	Ops[i] = ZeroVal;

	Mask = DAG.getNode(ISD::CONCAT_VECTORS, dl, BoolVT, Ops);
	}
	assert(Mask.getValueType().getVectorNumElements() ==
	WideVal.getValueType().getVectorNumElements() &&
	"Mask and data vectors should have the same number of elements");
	return DAG.getMaskedStore(MST->getChain(), dl, WideVal, MST->getBasePtr(),
	Mask, MST->getMemoryVT(), MST->getMemOperand(),
	false, MST->isCompressingStore());
	}

	SDValue DAGTypeLegalizer::WidenVecOp_MSCATTER(SDNode *N, unsigned OpNo) {
	assert(OpNo == 1 && "Can widen only data operand of mscatter");
	MaskedScatterSDNode *MSC = cast<MaskedScatterSDNode>(N);
	SDValue DataOp = MSC->getValue();
	SDValue Mask = MSC->getMask();

	// Widen the value.
	SDValue WideVal = GetWidenedVector(DataOp);
	EVT WideVT = WideVal.getValueType();
	unsigned NumElts = WideVal.getValueType().getVectorNumElements();
	SDLoc dl(N);

	// The mask should be widened as well.
	Mask = WidenTargetBoolean(Mask, WideVT, true);

	// Widen index.
	SDValue Index = MSC->getIndex();
	EVT WideIndexVT = EVT::getVectorVT(*DAG.getContext(),
	Index.getValueType().getScalarType(),
	NumElts);
	Index = ModifyToType(Index, WideIndexVT);

	SDValue Ops[] = {MSC->getChain(), WideVal, Mask, MSC->getBasePtr(), Index};
	return DAG.getMaskedScatter(DAG.getVTList(MVT::Other),
	MSC->getMemoryVT(), dl, Ops,
	MSC->getMemOperand());
	}

	SDValue DAGTypeLegalizer::WidenVecOp_SETCC(SDNode *N) {
	SDValue InOp0 = GetWidenedVector(N->getOperand(0));
	SDValue InOp1 = GetWidenedVector(N->getOperand(1));
	SDLoc dl(N);

	// WARNING: In this code we widen the compare instruction with garbage.
	// This garbage may contain denormal floats which may be slow. Is this a real
	// concern ? Should we zero the unused lanes if this is a float compare ?

	// Get a new SETCC node to compare the newly widened operands.
	// Only some of the compared elements are legal.
	EVT SVT = TLI.getSetCCResultType(DAG.getDataLayout(), *DAG.getContext(),
	InOp0.getValueType());
	SDValue WideSETCC = DAG.getNode(ISD::SETCC, SDLoc(N),
	SVT, InOp0, InOp1, N->getOperand(2));

	// Extract the needed results from the result vector.
	EVT ResVT = EVT::getVectorVT(*DAG.getContext(),
	SVT.getVectorElementType(),
	N->getValueType(0).getVectorNumElements());
	SDValue CC = DAG.getNode(
	ISD::EXTRACT_SUBVECTOR, dl, ResVT, WideSETCC,
	DAG.getConstant(0, dl, TLI.getVectorIdxTy(DAG.getDataLayout())));

	return PromoteTargetBoolean(CC, N->getValueType(0));
	}


	//===----------------------------------------------------------------------===//
	// Vector Widening Utilities
	//===----------------------------------------------------------------------===//

	// Utility function to find the type to chop up a widen vector for load/store
	// TLI: Target lowering used to determine legal types.
	// Width: Width left need to load/store.
	// WidenVT: The widen vector type to load to/store from
	// Align: If 0, don't allow use of a wider type
	// WidenEx: If Align is not 0, the amount additional we can load/store from.

	static EVT FindMemType(SelectionDAG& DAG, const TargetLowering &TLI,
	unsigned Width, EVT WidenVT,
	unsigned Align = 0, unsigned WidenEx = 0) {
	EVT WidenEltVT = WidenVT.getVectorElementType();
	unsigned WidenWidth = WidenVT.getSizeInBits();
	unsigned WidenEltWidth = WidenEltVT.getSizeInBits();
	unsigned AlignInBits = Align*8;

	// If we have one element to load/store, return it.
	EVT RetVT = WidenEltVT;
	if (Width == WidenEltWidth)
	return RetVT;

	// See if there is larger legal integer than the element type to load/store.
	unsigned VT;
	for (VT = (unsigned)MVT::LAST_INTEGER_VALUETYPE;
	VT >= (unsigned)MVT::FIRST_INTEGER_VALUETYPE; --VT) {
	EVT MemVT((MVT::SimpleValueType) VT);
	unsigned MemVTWidth = MemVT.getSizeInBits();
	if (MemVT.getSizeInBits() <= WidenEltWidth)
	break;
	auto Action = TLI.getTypeAction(*DAG.getContext(), MemVT);
	if ((Action == TargetLowering::TypeLegal \|\|
	Action == TargetLowering::TypePromoteInteger) &&
	(WidenWidth % MemVTWidth) == 0 &&
	isPowerOf2_32(WidenWidth / MemVTWidth) &&
	(MemVTWidth <= Width \|\|
	(Align!=0 && MemVTWidth<=AlignInBits && MemVTWidth<=Width+WidenEx))) {
	RetVT = MemVT;
	break;
	}
	}

	// See if there is a larger vector type to load/store that has the same vector
	// element type and is evenly divisible with the WidenVT.
	for (VT = (unsigned)MVT::LAST_VECTOR_VALUETYPE;
	VT >= (unsigned)MVT::FIRST_VECTOR_VALUETYPE; --VT) {
	EVT MemVT = (MVT::SimpleValueType) VT;
	unsigned MemVTWidth = MemVT.getSizeInBits();
	if (TLI.isTypeLegal(MemVT) && WidenEltVT == MemVT.getVectorElementType() &&
	(WidenWidth % MemVTWidth) == 0 &&
	isPowerOf2_32(WidenWidth / MemVTWidth) &&
	(MemVTWidth <= Width \|\|
	(Align!=0 && MemVTWidth<=AlignInBits && MemVTWidth<=Width+WidenEx))) {
	if (RetVT.getSizeInBits() < MemVTWidth \|\| MemVT == WidenVT)
	return MemVT;
	}
	}

	return RetVT;
	}

	// Builds a vector type from scalar loads
	// VecTy: Resulting Vector type
	// LDOps: Load operators to build a vector type
	// [Start,End) the list of loads to use.
	static SDValue BuildVectorFromScalar(SelectionDAG& DAG, EVT VecTy,
	SmallVectorImpl<SDValue> &LdOps,
	unsigned Start, unsigned End) {
	const TargetLowering &TLI = DAG.getTargetLoweringInfo();
	SDLoc dl(LdOps[Start]);
	EVT LdTy = LdOps[Start].getValueType();
	unsigned Width = VecTy.getSizeInBits();
	unsigned NumElts = Width / LdTy.getSizeInBits();
	EVT NewVecVT = EVT::getVectorVT(*DAG.getContext(), LdTy, NumElts);

	unsigned Idx = 1;
	SDValue VecOp = DAG.getNode(ISD::SCALAR_TO_VECTOR, dl, NewVecVT,LdOps[Start]);

	for (unsigned i = Start + 1; i != End; ++i) {
	EVT NewLdTy = LdOps[i].getValueType();
	if (NewLdTy != LdTy) {
	NumElts = Width / NewLdTy.getSizeInBits();
	NewVecVT = EVT::getVectorVT(*DAG.getContext(), NewLdTy, NumElts);
	VecOp = DAG.getNode(ISD::BITCAST, dl, NewVecVT, VecOp);
	// Readjust position and vector position based on new load type.
	Idx = Idx * LdTy.getSizeInBits() / NewLdTy.getSizeInBits();
	LdTy = NewLdTy;
	}
	VecOp = DAG.getNode(
	ISD::INSERT_VECTOR_ELT, dl, NewVecVT, VecOp, LdOps[i],
	DAG.getConstant(Idx++, dl, TLI.getVectorIdxTy(DAG.getDataLayout())));
	}
	return DAG.getNode(ISD::BITCAST, dl, VecTy, VecOp);
	}

	SDValue DAGTypeLegalizer::GenWidenVectorLoads(SmallVectorImpl<SDValue> &LdChain,
	LoadSDNode *LD) {
	// The strategy assumes that we can efficiently load power-of-two widths.
	// The routine chops the vector into the largest vector loads with the same
	// element type or scalar loads and then recombines it to the widen vector
	// type.
	EVT WidenVT = TLI.getTypeToTransformTo(*DAG.getContext(),LD->getValueType(0));
	unsigned WidenWidth = WidenVT.getSizeInBits();
	EVT LdVT = LD->getMemoryVT();
	SDLoc dl(LD);
	assert(LdVT.isVector() && WidenVT.isVector());
	assert(LdVT.getVectorElementType() == WidenVT.getVectorElementType());

	// Load information
	SDValue Chain = LD->getChain();
	SDValue BasePtr = LD->getBasePtr();
	unsigned Align = LD->getAlignment();
	MachineMemOperand::Flags MMOFlags = LD->getMemOperand()->getFlags();
	AAMDNodes AAInfo = LD->getAAInfo();

	int LdWidth = LdVT.getSizeInBits();
	int WidthDiff = WidenWidth - LdWidth;
	unsigned LdAlign = LD->isVolatile() ? 0 : Align; // Allow wider loads.

	// Find the vector type that can load from.
	EVT NewVT = FindMemType(DAG, TLI, LdWidth, WidenVT, LdAlign, WidthDiff);
	int NewVTWidth = NewVT.getSizeInBits();
	SDValue LdOp = DAG.getLoad(NewVT, dl, Chain, BasePtr, LD->getPointerInfo(),
	Align, MMOFlags, AAInfo);
	LdChain.push_back(LdOp.getValue(1));

	// Check if we can load the element with one instruction.
	if (LdWidth <= NewVTWidth) {
	if (!NewVT.isVector()) {
	unsigned NumElts = WidenWidth / NewVTWidth;
	EVT NewVecVT = EVT::getVectorVT(*DAG.getContext(), NewVT, NumElts);
	SDValue VecOp = DAG.getNode(ISD::SCALAR_TO_VECTOR, dl, NewVecVT, LdOp);
	return DAG.getNode(ISD::BITCAST, dl, WidenVT, VecOp);
	}
	if (NewVT == WidenVT)
	return LdOp;

	assert(WidenWidth % NewVTWidth == 0);
	unsigned NumConcat = WidenWidth / NewVTWidth;
	SmallVector<SDValue, 16> ConcatOps(NumConcat);
	SDValue UndefVal = DAG.getUNDEF(NewVT);
	ConcatOps[0] = LdOp;
	for (unsigned i = 1; i != NumConcat; ++i)
	ConcatOps[i] = UndefVal;
	return DAG.getNode(ISD::CONCAT_VECTORS, dl, WidenVT, ConcatOps);
	}

	// Load vector by using multiple loads from largest vector to scalar.
	SmallVector<SDValue, 16> LdOps;
	LdOps.push_back(LdOp);

	LdWidth -= NewVTWidth;
	unsigned Offset = 0;

	while (LdWidth > 0) {
	unsigned Increment = NewVTWidth / 8;
	Offset += Increment;
	BasePtr = DAG.getNode(ISD::ADD, dl, BasePtr.getValueType(), BasePtr,
	DAG.getConstant(Increment, dl, BasePtr.getValueType()));

	SDValue L;
	if (LdWidth < NewVTWidth) {
	// The current type we are using is too large. Find a better size.
	NewVT = FindMemType(DAG, TLI, LdWidth, WidenVT, LdAlign, WidthDiff);
	NewVTWidth = NewVT.getSizeInBits();
	L = DAG.getLoad(NewVT, dl, Chain, BasePtr,
	LD->getPointerInfo().getWithOffset(Offset),
	MinAlign(Align, Increment), MMOFlags, AAInfo);
	LdChain.push_back(L.getValue(1));
	if (L->getValueType(0).isVector() && NewVTWidth >= LdWidth) {
	// Later code assumes the vector loads produced will be mergeable, so we
	// must pad the final entry up to the previous width. Scalars are
	// combined separately.
	SmallVector<SDValue, 16> Loads;
	Loads.push_back(L);
	unsigned size = L->getValueSizeInBits(0);
	while (size < LdOp->getValueSizeInBits(0)) {
	Loads.push_back(DAG.getUNDEF(L->getValueType(0)));
	size += L->getValueSizeInBits(0);
	}
	L = DAG.getNode(ISD::CONCAT_VECTORS, dl, LdOp->getValueType(0), Loads);
	}
	} else {
	L = DAG.getLoad(NewVT, dl, Chain, BasePtr,
	LD->getPointerInfo().getWithOffset(Offset),
	MinAlign(Align, Increment), MMOFlags, AAInfo);
	LdChain.push_back(L.getValue(1));
	}

	LdOps.push_back(L);


	LdWidth -= NewVTWidth;
	}

	// Build the vector from the load operations.
	unsigned End = LdOps.size();
	if (!LdOps[0].getValueType().isVector())
	// All the loads are scalar loads.
	return BuildVectorFromScalar(DAG, WidenVT, LdOps, 0, End);

	// If the load contains vectors, build the vector using concat vector.
	// All of the vectors used to load are power-of-2, and the scalar loads can be
	// combined to make a power-of-2 vector.
	SmallVector<SDValue, 16> ConcatOps(End);
	int i = End - 1;
	int Idx = End;
	EVT LdTy = LdOps[i].getValueType();
	// First, combine the scalar loads to a vector.
	if (!LdTy.isVector()) {
	for (--i; i >= 0; --i) {
	LdTy = LdOps[i].getValueType();
	if (LdTy.isVector())
	break;
	}
	ConcatOps[--Idx] = BuildVectorFromScalar(DAG, LdTy, LdOps, i + 1, End);
	}
	ConcatOps[--Idx] = LdOps[i];
	for (--i; i >= 0; --i) {
	EVT NewLdTy = LdOps[i].getValueType();
	if (NewLdTy != LdTy) {
	// Create a larger vector.
	ConcatOps[End-1] = DAG.getNode(ISD::CONCAT_VECTORS, dl, NewLdTy,
	makeArrayRef(&ConcatOps[Idx], End - Idx));
	Idx = End - 1;
	LdTy = NewLdTy;
	}
	ConcatOps[--Idx] = LdOps[i];
	}

	if (WidenWidth == LdTy.getSizeInBits() * (End - Idx))
	return DAG.getNode(ISD::CONCAT_VECTORS, dl, WidenVT,
	makeArrayRef(&ConcatOps[Idx], End - Idx));

	// We need to fill the rest with undefs to build the vector.
	unsigned NumOps = WidenWidth / LdTy.getSizeInBits();
	SmallVector<SDValue, 16> WidenOps(NumOps);
	SDValue UndefVal = DAG.getUNDEF(LdTy);
	{
	unsigned i = 0;
	for (; i != End-Idx; ++i)
	WidenOps[i] = ConcatOps[Idx+i];
	for (; i != NumOps; ++i)
	WidenOps[i] = UndefVal;
	}
	return DAG.getNode(ISD::CONCAT_VECTORS, dl, WidenVT, WidenOps);
	}

	SDValue
	DAGTypeLegalizer::GenWidenVectorExtLoads(SmallVectorImpl<SDValue> &LdChain,
	LoadSDNode *LD,
	ISD::LoadExtType ExtType) {
	// For extension loads, it may not be more efficient to chop up the vector
	// and then extend it. Instead, we unroll the load and build a new vector.
	EVT WidenVT = TLI.getTypeToTransformTo(*DAG.getContext(),LD->getValueType(0));
	EVT LdVT = LD->getMemoryVT();
	SDLoc dl(LD);
	assert(LdVT.isVector() && WidenVT.isVector());

	// Load information
	SDValue Chain = LD->getChain();
	SDValue BasePtr = LD->getBasePtr();
	unsigned Align = LD->getAlignment();
	MachineMemOperand::Flags MMOFlags = LD->getMemOperand()->getFlags();
	AAMDNodes AAInfo = LD->getAAInfo();

	EVT EltVT = WidenVT.getVectorElementType();
	EVT LdEltVT = LdVT.getVectorElementType();
	unsigned NumElts = LdVT.getVectorNumElements();

	// Load each element and widen.
	unsigned WidenNumElts = WidenVT.getVectorNumElements();
	SmallVector<SDValue, 16> Ops(WidenNumElts);
	unsigned Increment = LdEltVT.getSizeInBits() / 8;
	Ops[0] =
	DAG.getExtLoad(ExtType, dl, EltVT, Chain, BasePtr, LD->getPointerInfo(),
	LdEltVT, Align, MMOFlags, AAInfo);
	LdChain.push_back(Ops[0].getValue(1));
	unsigned i = 0, Offset = Increment;
	for (i=1; i < NumElts; ++i, Offset += Increment) {
	SDValue NewBasePtr = DAG.getNode(ISD::ADD, dl, BasePtr.getValueType(),
	BasePtr,
	DAG.getConstant(Offset, dl,
	BasePtr.getValueType()));
	Ops[i] = DAG.getExtLoad(ExtType, dl, EltVT, Chain, NewBasePtr,
	LD->getPointerInfo().getWithOffset(Offset), LdEltVT,
	Align, MMOFlags, AAInfo);
	LdChain.push_back(Ops[i].getValue(1));
	}

	// Fill the rest with undefs.
	SDValue UndefVal = DAG.getUNDEF(EltVT);
	for (; i != WidenNumElts; ++i)
	Ops[i] = UndefVal;

	return DAG.getBuildVector(WidenVT, dl, Ops);
	}

	void DAGTypeLegalizer::GenWidenVectorStores(SmallVectorImpl<SDValue> &StChain,
	StoreSDNode *ST) {
	// The strategy assumes that we can efficiently store power-of-two widths.
	// The routine chops the vector into the largest vector stores with the same
	// element type or scalar stores.
	SDValue Chain = ST->getChain();
	SDValue BasePtr = ST->getBasePtr();
	unsigned Align = ST->getAlignment();
	MachineMemOperand::Flags MMOFlags = ST->getMemOperand()->getFlags();
	AAMDNodes AAInfo = ST->getAAInfo();
	SDValue ValOp = GetWidenedVector(ST->getValue());
	SDLoc dl(ST);

	EVT StVT = ST->getMemoryVT();
	unsigned StWidth = StVT.getSizeInBits();
	EVT ValVT = ValOp.getValueType();
	unsigned ValWidth = ValVT.getSizeInBits();
	EVT ValEltVT = ValVT.getVectorElementType();
	unsigned ValEltWidth = ValEltVT.getSizeInBits();
	assert(StVT.getVectorElementType() == ValEltVT);

	int Idx = 0; // current index to store
	unsigned Offset = 0; // offset from base to store
	while (StWidth != 0) {
	// Find the largest vector type we can store with.
	EVT NewVT = FindMemType(DAG, TLI, StWidth, ValVT);
	unsigned NewVTWidth = NewVT.getSizeInBits();
	unsigned Increment = NewVTWidth / 8;
	if (NewVT.isVector()) {
	unsigned NumVTElts = NewVT.getVectorNumElements();
	do {
	SDValue EOp = DAG.getNode(
	ISD::EXTRACT_SUBVECTOR, dl, NewVT, ValOp,
	DAG.getConstant(Idx, dl, TLI.getVectorIdxTy(DAG.getDataLayout())));
	StChain.push_back(DAG.getStore(
	Chain, dl, EOp, BasePtr, ST->getPointerInfo().getWithOffset(Offset),
	MinAlign(Align, Offset), MMOFlags, AAInfo));
	StWidth -= NewVTWidth;
	Offset += Increment;
	Idx += NumVTElts;
	BasePtr = DAG.getNode(ISD::ADD, dl, BasePtr.getValueType(), BasePtr,
	DAG.getConstant(Increment, dl,
	BasePtr.getValueType()));
	} while (StWidth != 0 && StWidth >= NewVTWidth);
	} else {
	// Cast the vector to the scalar type we can store.
	unsigned NumElts = ValWidth / NewVTWidth;
	EVT NewVecVT = EVT::getVectorVT(*DAG.getContext(), NewVT, NumElts);
	SDValue VecOp = DAG.getNode(ISD::BITCAST, dl, NewVecVT, ValOp);
	// Readjust index position based on new vector type.
	Idx = Idx * ValEltWidth / NewVTWidth;
	do {
	SDValue EOp = DAG.getNode(
	ISD::EXTRACT_VECTOR_ELT, dl, NewVT, VecOp,
	DAG.getConstant(Idx++, dl,
	TLI.getVectorIdxTy(DAG.getDataLayout())));
	StChain.push_back(DAG.getStore(
	Chain, dl, EOp, BasePtr, ST->getPointerInfo().getWithOffset(Offset),
	MinAlign(Align, Offset), MMOFlags, AAInfo));
	StWidth -= NewVTWidth;
	Offset += Increment;
	BasePtr = DAG.getNode(ISD::ADD, dl, BasePtr.getValueType(), BasePtr,
	DAG.getConstant(Increment, dl,
	BasePtr.getValueType()));
	} while (StWidth != 0 && StWidth >= NewVTWidth);
	// Restore index back to be relative to the original widen element type.
	Idx = Idx * NewVTWidth / ValEltWidth;
	}
	}
	}

	void
	DAGTypeLegalizer::GenWidenVectorTruncStores(SmallVectorImpl<SDValue> &StChain,
	StoreSDNode *ST) {
	// For extension loads, it may not be more efficient to truncate the vector
	// and then store it. Instead, we extract each element and then store it.
	SDValue Chain = ST->getChain();
	SDValue BasePtr = ST->getBasePtr();
	unsigned Align = ST->getAlignment();
	MachineMemOperand::Flags MMOFlags = ST->getMemOperand()->getFlags();
	AAMDNodes AAInfo = ST->getAAInfo();
	SDValue ValOp = GetWidenedVector(ST->getValue());
	SDLoc dl(ST);

	EVT StVT = ST->getMemoryVT();
	EVT ValVT = ValOp.getValueType();

	// It must be true that the wide vector type is bigger than where we need to
	// store.
	assert(StVT.isVector() && ValOp.getValueType().isVector());
	assert(StVT.bitsLT(ValOp.getValueType()));

	// For truncating stores, we can not play the tricks of chopping legal vector
	// types and bitcast it to the right type. Instead, we unroll the store.
	EVT StEltVT = StVT.getVectorElementType();
	EVT ValEltVT = ValVT.getVectorElementType();
	unsigned Increment = ValEltVT.getSizeInBits() / 8;
	unsigned NumElts = StVT.getVectorNumElements();
	SDValue EOp = DAG.getNode(
	ISD::EXTRACT_VECTOR_ELT, dl, ValEltVT, ValOp,
	DAG.getConstant(0, dl, TLI.getVectorIdxTy(DAG.getDataLayout())));
	StChain.push_back(DAG.getTruncStore(Chain, dl, EOp, BasePtr,
	ST->getPointerInfo(), StEltVT, Align,
	MMOFlags, AAInfo));
	unsigned Offset = Increment;
	for (unsigned i=1; i < NumElts; ++i, Offset += Increment) {
	SDValue NewBasePtr = DAG.getNode(ISD::ADD, dl, BasePtr.getValueType(),
	BasePtr,
	DAG.getConstant(Offset, dl,
	BasePtr.getValueType()));
	SDValue EOp = DAG.getNode(
	ISD::EXTRACT_VECTOR_ELT, dl, ValEltVT, ValOp,
	DAG.getConstant(0, dl, TLI.getVectorIdxTy(DAG.getDataLayout())));
	StChain.push_back(DAG.getTruncStore(
	Chain, dl, EOp, NewBasePtr, ST->getPointerInfo().getWithOffset(Offset),
	StEltVT, MinAlign(Align, Offset), MMOFlags, AAInfo));
	}
	}

	/// Modifies a vector input (widen or narrows) to a vector of NVT. The
	/// input vector must have the same element type as NVT.
	/// FillWithZeroes specifies that the vector should be widened with zeroes.
	SDValue DAGTypeLegalizer::ModifyToType(SDValue InOp, EVT NVT,
	bool FillWithZeroes) {
	// Note that InOp might have been widened so it might already have
	// the right width or it might need be narrowed.
	EVT InVT = InOp.getValueType();
	assert(InVT.getVectorElementType() == NVT.getVectorElementType() &&
	"input and widen element type must match");
	SDLoc dl(InOp);

	// Check if InOp already has the right width.
	if (InVT == NVT)
	return InOp;

	unsigned InNumElts = InVT.getVectorNumElements();
	unsigned WidenNumElts = NVT.getVectorNumElements();
	if (WidenNumElts > InNumElts && WidenNumElts % InNumElts == 0) {
	unsigned NumConcat = WidenNumElts / InNumElts;
	SmallVector<SDValue, 16> Ops(NumConcat);
	SDValue FillVal = FillWithZeroes ? DAG.getConstant(0, dl, InVT) :
	DAG.getUNDEF(InVT);
	Ops[0] = InOp;
	for (unsigned i = 1; i != NumConcat; ++i)
	Ops[i] = FillVal;

	return DAG.getNode(ISD::CONCAT_VECTORS, dl, NVT, Ops);
	}

	if (WidenNumElts < InNumElts && InNumElts % WidenNumElts)
	return DAG.getNode(
	ISD::EXTRACT_SUBVECTOR, dl, NVT, InOp,
	DAG.getConstant(0, dl, TLI.getVectorIdxTy(DAG.getDataLayout())));

	// Fall back to extract and build.
	SmallVector<SDValue, 16> Ops(WidenNumElts);
	EVT EltVT = NVT.getVectorElementType();
	unsigned MinNumElts = std::min(WidenNumElts, InNumElts);
	unsigned Idx;
	for (Idx = 0; Idx < MinNumElts; ++Idx)
	Ops[Idx] = DAG.getNode(
	ISD::EXTRACT_VECTOR_ELT, dl, EltVT, InOp,
	DAG.getConstant(Idx, dl, TLI.getVectorIdxTy(DAG.getDataLayout())));

	SDValue FillVal = FillWithZeroes ? DAG.getConstant(0, dl, EltVT) :
	DAG.getUNDEF(EltVT);
	for ( ; Idx < WidenNumElts; ++Idx)
	Ops[Idx] = FillVal;
	return DAG.getBuildVector(NVT, dl, Ops);
	}
	diff --git a/lib/CodeGen/SelectionDAG/SelectionDAG.cpp b/lib/CodeGen/SelectionDAG/SelectionDAG.cpp
	index 0ff154784f68..16f425dc7969 100644
	--- a/lib/CodeGen/SelectionDAG/SelectionDAG.cpp
	+++ b/lib/CodeGen/SelectionDAG/SelectionDAG.cpp
	@@ -1,7958 +1,7961 @@
	//===- SelectionDAG.cpp - Implement the SelectionDAG data structures ------===//
	//
	// The LLVM Compiler Infrastructure
	//
	// This file is distributed under the University of Illinois Open Source
	// License. See LICENSE.TXT for details.
	//
	//===----------------------------------------------------------------------===//
	//
	// This implements the SelectionDAG class.
	//
	//===----------------------------------------------------------------------===//

	#include "llvm/CodeGen/SelectionDAG.h"
	#include "SDNodeDbgValue.h"
	#include "llvm/ADT/APFloat.h"
	#include "llvm/ADT/APInt.h"
	#include "llvm/ADT/APSInt.h"
	#include "llvm/ADT/ArrayRef.h"
	#include "llvm/ADT/BitVector.h"
	#include "llvm/ADT/FoldingSet.h"
	#include "llvm/ADT/None.h"
	#include "llvm/ADT/STLExtras.h"
	#include "llvm/ADT/SmallPtrSet.h"
	#include "llvm/ADT/SmallVector.h"
	#include "llvm/ADT/Triple.h"
	#include "llvm/ADT/Twine.h"
	#include "llvm/Analysis/ValueTracking.h"
	#include "llvm/CodeGen/ISDOpcodes.h"
	#include "llvm/CodeGen/MachineBasicBlock.h"
	#include "llvm/CodeGen/MachineConstantPool.h"
	#include "llvm/CodeGen/MachineFrameInfo.h"
	#include "llvm/CodeGen/MachineFunction.h"
	#include "llvm/CodeGen/MachineMemOperand.h"
	#include "llvm/CodeGen/MachineValueType.h"
	#include "llvm/CodeGen/RuntimeLibcalls.h"
	#include "llvm/CodeGen/SelectionDAGAddressAnalysis.h"
	#include "llvm/CodeGen/SelectionDAGNodes.h"
	#include "llvm/CodeGen/SelectionDAGTargetInfo.h"
	#include "llvm/CodeGen/ValueTypes.h"
	#include "llvm/IR/Constant.h"
	#include "llvm/IR/Constants.h"
	#include "llvm/IR/DataLayout.h"
	#include "llvm/IR/DebugInfoMetadata.h"
	#include "llvm/IR/DebugLoc.h"
	#include "llvm/IR/DerivedTypes.h"
	#include "llvm/IR/Function.h"
	#include "llvm/IR/GlobalValue.h"
	#include "llvm/IR/Metadata.h"
	#include "llvm/IR/Type.h"
	#include "llvm/IR/Value.h"
	#include "llvm/Support/Casting.h"
	#include "llvm/Support/CodeGen.h"
	#include "llvm/Support/Compiler.h"
	#include "llvm/Support/Debug.h"
	#include "llvm/Support/ErrorHandling.h"
	#include "llvm/Support/KnownBits.h"
	#include "llvm/Support/ManagedStatic.h"
	#include "llvm/Support/MathExtras.h"
	#include "llvm/Support/Mutex.h"
	#include "llvm/Support/raw_ostream.h"
	#include "llvm/Target/TargetLowering.h"
	#include "llvm/Target/TargetMachine.h"
	#include "llvm/Target/TargetOptions.h"
	#include "llvm/Target/TargetRegisterInfo.h"
	#include "llvm/Target/TargetSubtargetInfo.h"
	#include <algorithm>
	#include <cassert>
	#include <cstdint>
	#include <cstdlib>
	#include <limits>
	#include <set>
	#include <string>
	#include <utility>
	#include <vector>

	using namespace llvm;

	/// makeVTList - Return an instance of the SDVTList struct initialized with the
	/// specified members.
	static SDVTList makeVTList(const EVT *VTs, unsigned NumVTs) {
	SDVTList Res = {VTs, NumVTs};
	return Res;
	}

	// Default null implementations of the callbacks.
	void SelectionDAG::DAGUpdateListener::NodeDeleted(SDNode, SDNode) {}
	void SelectionDAG::DAGUpdateListener::NodeUpdated(SDNode*) {}

	//===----------------------------------------------------------------------===//
	// ConstantFPSDNode Class
	//===----------------------------------------------------------------------===//

	/// isExactlyValue - We don't rely on operator== working on double values, as
	/// it returns true for things that are clearly not equal, like -0.0 and 0.0.
	/// As such, this method can be used to do an exact bit-for-bit comparison of
	/// two floating point values.
	bool ConstantFPSDNode::isExactlyValue(const APFloat& V) const {
	return getValueAPF().bitwiseIsEqual(V);
	}

	bool ConstantFPSDNode::isValueValidForType(EVT VT,
	const APFloat& Val) {
	assert(VT.isFloatingPoint() && "Can only convert between FP types");

	// convert modifies in place, so make a copy.
	APFloat Val2 = APFloat(Val);
	bool losesInfo;
	(void) Val2.convert(SelectionDAG::EVTToAPFloatSemantics(VT),
	APFloat::rmNearestTiesToEven,
	&losesInfo);
	return !losesInfo;
	}

	//===----------------------------------------------------------------------===//
	// ISD Namespace
	//===----------------------------------------------------------------------===//

	-bool ISD::isConstantSplatVector(const SDNode *N, APInt &SplatVal) {
	+bool ISD::isConstantSplatVector(const SDNode *N, APInt &SplatVal,
	+ bool AllowShrink) {
	auto *BV = dyn_cast<BuildVectorSDNode>(N);
	if (!BV)
	return false;

	APInt SplatUndef;
	unsigned SplatBitSize;
	bool HasUndefs;
	- EVT EltVT = N->getValueType(0).getVectorElementType();
	- return BV->isConstantSplat(SplatVal, SplatUndef, SplatBitSize, HasUndefs) &&
	- EltVT.getSizeInBits() >= SplatBitSize;
	+ unsigned EltSize = N->getValueType(0).getVectorElementType().getSizeInBits();
	+ unsigned MinSplatBits = AllowShrink ? 0 : EltSize;
	+ return BV->isConstantSplat(SplatVal, SplatUndef, SplatBitSize, HasUndefs,
	+ MinSplatBits) &&
	+ EltSize >= SplatBitSize;
	}

	// FIXME: AllOnes and AllZeros duplicate a lot of code. Could these be
	// specializations of the more general isConstantSplatVector()?

	bool ISD::isBuildVectorAllOnes(const SDNode *N) {
	// Look through a bit convert.
	while (N->getOpcode() == ISD::BITCAST)
	N = N->getOperand(0).getNode();

	if (N->getOpcode() != ISD::BUILD_VECTOR) return false;

	unsigned i = 0, e = N->getNumOperands();

	// Skip over all of the undef values.
	while (i != e && N->getOperand(i).isUndef())
	++i;

	// Do not accept an all-undef vector.
	if (i == e) return false;

	// Do not accept build_vectors that aren't all constants or which have non-~0
	// elements. We have to be a bit careful here, as the type of the constant
	// may not be the same as the type of the vector elements due to type
	// legalization (the elements are promoted to a legal type for the target and
	// a vector of a type may be legal when the base element type is not).
	// We only want to check enough bits to cover the vector elements, because
	// we care if the resultant vector is all ones, not whether the individual
	// constants are.
	SDValue NotZero = N->getOperand(i);
	unsigned EltSize = N->getValueType(0).getScalarSizeInBits();
	if (ConstantSDNode *CN = dyn_cast<ConstantSDNode>(NotZero)) {
	if (CN->getAPIntValue().countTrailingOnes() < EltSize)
	return false;
	} else if (ConstantFPSDNode *CFPN = dyn_cast<ConstantFPSDNode>(NotZero)) {
	if (CFPN->getValueAPF().bitcastToAPInt().countTrailingOnes() < EltSize)
	return false;
	} else
	return false;

	// Okay, we have at least one ~0 value, check to see if the rest match or are
	// undefs. Even with the above element type twiddling, this should be OK, as
	// the same type legalization should have applied to all the elements.
	for (++i; i != e; ++i)
	if (N->getOperand(i) != NotZero && !N->getOperand(i).isUndef())
	return false;
	return true;
	}

	bool ISD::isBuildVectorAllZeros(const SDNode *N) {
	// Look through a bit convert.
	while (N->getOpcode() == ISD::BITCAST)
	N = N->getOperand(0).getNode();

	if (N->getOpcode() != ISD::BUILD_VECTOR) return false;

	bool IsAllUndef = true;
	for (const SDValue &Op : N->op_values()) {
	if (Op.isUndef())
	continue;
	IsAllUndef = false;
	// Do not accept build_vectors that aren't all constants or which have non-0
	// elements. We have to be a bit careful here, as the type of the constant
	// may not be the same as the type of the vector elements due to type
	// legalization (the elements are promoted to a legal type for the target
	// and a vector of a type may be legal when the base element type is not).
	// We only want to check enough bits to cover the vector elements, because
	// we care if the resultant vector is all zeros, not whether the individual
	// constants are.
	unsigned EltSize = N->getValueType(0).getScalarSizeInBits();
	if (ConstantSDNode *CN = dyn_cast<ConstantSDNode>(Op)) {
	if (CN->getAPIntValue().countTrailingZeros() < EltSize)
	return false;
	} else if (ConstantFPSDNode *CFPN = dyn_cast<ConstantFPSDNode>(Op)) {
	if (CFPN->getValueAPF().bitcastToAPInt().countTrailingZeros() < EltSize)
	return false;
	} else
	return false;
	}

	// Do not accept an all-undef vector.
	if (IsAllUndef)
	return false;
	return true;
	}

	bool ISD::isBuildVectorOfConstantSDNodes(const SDNode *N) {
	if (N->getOpcode() != ISD::BUILD_VECTOR)
	return false;

	for (const SDValue &Op : N->op_values()) {
	if (Op.isUndef())
	continue;
	if (!isa<ConstantSDNode>(Op))
	return false;
	}
	return true;
	}

	bool ISD::isBuildVectorOfConstantFPSDNodes(const SDNode *N) {
	if (N->getOpcode() != ISD::BUILD_VECTOR)
	return false;

	for (const SDValue &Op : N->op_values()) {
	if (Op.isUndef())
	continue;
	if (!isa<ConstantFPSDNode>(Op))
	return false;
	}
	return true;
	}

	bool ISD::allOperandsUndef(const SDNode *N) {
	// Return false if the node has no operands.
	// This is "logically inconsistent" with the definition of "all" but
	// is probably the desired behavior.
	if (N->getNumOperands() == 0)
	return false;

	for (const SDValue &Op : N->op_values())
	if (!Op.isUndef())
	return false;

	return true;
	}

	ISD::NodeType ISD::getExtForLoadExtType(bool IsFP, ISD::LoadExtType ExtType) {
	switch (ExtType) {
	case ISD::EXTLOAD:
	return IsFP ? ISD::FP_EXTEND : ISD::ANY_EXTEND;
	case ISD::SEXTLOAD:
	return ISD::SIGN_EXTEND;
	case ISD::ZEXTLOAD:
	return ISD::ZERO_EXTEND;
	default:
	break;
	}

	llvm_unreachable("Invalid LoadExtType");
	}

	ISD::CondCode ISD::getSetCCSwappedOperands(ISD::CondCode Operation) {
	// To perform this operation, we just need to swap the L and G bits of the
	// operation.
	unsigned OldL = (Operation >> 2) & 1;
	unsigned OldG = (Operation >> 1) & 1;
	return ISD::CondCode((Operation & ~6) \| // Keep the N, U, E bits
	(OldL << 1) \| // New G bit
	(OldG << 2)); // New L bit.
	}

	ISD::CondCode ISD::getSetCCInverse(ISD::CondCode Op, bool isInteger) {
	unsigned Operation = Op;
	if (isInteger)
	Operation ^= 7; // Flip L, G, E bits, but not U.
	else
	Operation ^= 15; // Flip all of the condition bits.

	if (Operation > ISD::SETTRUE2)
	Operation &= ~8; // Don't let N and U bits get set.

	return ISD::CondCode(Operation);
	}

	/// For an integer comparison, return 1 if the comparison is a signed operation
	/// and 2 if the result is an unsigned comparison. Return zero if the operation
	/// does not depend on the sign of the input (setne and seteq).
	static int isSignedOp(ISD::CondCode Opcode) {
	switch (Opcode) {
	default: llvm_unreachable("Illegal integer setcc operation!");
	case ISD::SETEQ:
	case ISD::SETNE: return 0;
	case ISD::SETLT:
	case ISD::SETLE:
	case ISD::SETGT:
	case ISD::SETGE: return 1;
	case ISD::SETULT:
	case ISD::SETULE:
	case ISD::SETUGT:
	case ISD::SETUGE: return 2;
	}
	}

	ISD::CondCode ISD::getSetCCOrOperation(ISD::CondCode Op1, ISD::CondCode Op2,
	bool IsInteger) {
	if (IsInteger && (isSignedOp(Op1) \| isSignedOp(Op2)) == 3)
	// Cannot fold a signed integer setcc with an unsigned integer setcc.
	return ISD::SETCC_INVALID;

	unsigned Op = Op1 \| Op2; // Combine all of the condition bits.

	// If the N and U bits get set, then the resultant comparison DOES suddenly
	// care about orderedness, and it is true when ordered.
	if (Op > ISD::SETTRUE2)
	Op &= ~16; // Clear the U bit if the N bit is set.

	// Canonicalize illegal integer setcc's.
	if (IsInteger && Op == ISD::SETUNE) // e.g. SETUGT \| SETULT
	Op = ISD::SETNE;

	return ISD::CondCode(Op);
	}

	ISD::CondCode ISD::getSetCCAndOperation(ISD::CondCode Op1, ISD::CondCode Op2,
	bool IsInteger) {
	if (IsInteger && (isSignedOp(Op1) \| isSignedOp(Op2)) == 3)
	// Cannot fold a signed setcc with an unsigned setcc.
	return ISD::SETCC_INVALID;

	// Combine all of the condition bits.
	ISD::CondCode Result = ISD::CondCode(Op1 & Op2);

	// Canonicalize illegal integer setcc's.
	if (IsInteger) {
	switch (Result) {
	default: break;
	case ISD::SETUO : Result = ISD::SETFALSE; break; // SETUGT & SETULT
	case ISD::SETOEQ: // SETEQ & SETU[LG]E
	case ISD::SETUEQ: Result = ISD::SETEQ ; break; // SETUGE & SETULE
	case ISD::SETOLT: Result = ISD::SETULT ; break; // SETULT & SETNE
	case ISD::SETOGT: Result = ISD::SETUGT ; break; // SETUGT & SETNE
	}
	}

	return Result;
	}

	//===----------------------------------------------------------------------===//
	// SDNode Profile Support
	//===----------------------------------------------------------------------===//

	/// AddNodeIDOpcode - Add the node opcode to the NodeID data.
	static void AddNodeIDOpcode(FoldingSetNodeID &ID, unsigned OpC) {
	ID.AddInteger(OpC);
	}

	/// AddNodeIDValueTypes - Value type lists are intern'd so we can represent them
	/// solely with their pointer.
	static void AddNodeIDValueTypes(FoldingSetNodeID &ID, SDVTList VTList) {
	ID.AddPointer(VTList.VTs);
	}

	/// AddNodeIDOperands - Various routines for adding operands to the NodeID data.
	static void AddNodeIDOperands(FoldingSetNodeID &ID,
	ArrayRef<SDValue> Ops) {
	for (auto& Op : Ops) {
	ID.AddPointer(Op.getNode());
	ID.AddInteger(Op.getResNo());
	}
	}

	/// AddNodeIDOperands - Various routines for adding operands to the NodeID data.
	static void AddNodeIDOperands(FoldingSetNodeID &ID,
	ArrayRef<SDUse> Ops) {
	for (auto& Op : Ops) {
	ID.AddPointer(Op.getNode());
	ID.AddInteger(Op.getResNo());
	}
	}

	static void AddNodeIDNode(FoldingSetNodeID &ID, unsigned short OpC,
	SDVTList VTList, ArrayRef<SDValue> OpList) {
	AddNodeIDOpcode(ID, OpC);
	AddNodeIDValueTypes(ID, VTList);
	AddNodeIDOperands(ID, OpList);
	}

	/// If this is an SDNode with special info, add this info to the NodeID data.
	static void AddNodeIDCustom(FoldingSetNodeID &ID, const SDNode *N) {
	switch (N->getOpcode()) {
	case ISD::TargetExternalSymbol:
	case ISD::ExternalSymbol:
	case ISD::MCSymbol:
	llvm_unreachable("Should only be used on nodes with operands");
	default: break; // Normal nodes don't need extra info.
	case ISD::TargetConstant:
	case ISD::Constant: {
	const ConstantSDNode *C = cast<ConstantSDNode>(N);
	ID.AddPointer(C->getConstantIntValue());
	ID.AddBoolean(C->isOpaque());
	break;
	}
	case ISD::TargetConstantFP:
	case ISD::ConstantFP:
	ID.AddPointer(cast<ConstantFPSDNode>(N)->getConstantFPValue());
	break;
	case ISD::TargetGlobalAddress:
	case ISD::GlobalAddress:
	case ISD::TargetGlobalTLSAddress:
	case ISD::GlobalTLSAddress: {
	const GlobalAddressSDNode *GA = cast<GlobalAddressSDNode>(N);
	ID.AddPointer(GA->getGlobal());
	ID.AddInteger(GA->getOffset());
	ID.AddInteger(GA->getTargetFlags());
	break;
	}
	case ISD::BasicBlock:
	ID.AddPointer(cast<BasicBlockSDNode>(N)->getBasicBlock());
	break;
	case ISD::Register:
	ID.AddInteger(cast<RegisterSDNode>(N)->getReg());
	break;
	case ISD::RegisterMask:
	ID.AddPointer(cast<RegisterMaskSDNode>(N)->getRegMask());
	break;
	case ISD::SRCVALUE:
	ID.AddPointer(cast<SrcValueSDNode>(N)->getValue());
	break;
	case ISD::FrameIndex:
	case ISD::TargetFrameIndex:
	ID.AddInteger(cast<FrameIndexSDNode>(N)->getIndex());
	break;
	case ISD::JumpTable:
	case ISD::TargetJumpTable:
	ID.AddInteger(cast<JumpTableSDNode>(N)->getIndex());
	ID.AddInteger(cast<JumpTableSDNode>(N)->getTargetFlags());
	break;
	case ISD::ConstantPool:
	case ISD::TargetConstantPool: {
	const ConstantPoolSDNode *CP = cast<ConstantPoolSDNode>(N);
	ID.AddInteger(CP->getAlignment());
	ID.AddInteger(CP->getOffset());
	if (CP->isMachineConstantPoolEntry())
	CP->getMachineCPVal()->addSelectionDAGCSEId(ID);
	else
	ID.AddPointer(CP->getConstVal());
	ID.AddInteger(CP->getTargetFlags());
	break;
	}
	case ISD::TargetIndex: {
	const TargetIndexSDNode *TI = cast<TargetIndexSDNode>(N);
	ID.AddInteger(TI->getIndex());
	ID.AddInteger(TI->getOffset());
	ID.AddInteger(TI->getTargetFlags());
	break;
	}
	case ISD::LOAD: {
	const LoadSDNode *LD = cast<LoadSDNode>(N);
	ID.AddInteger(LD->getMemoryVT().getRawBits());
	ID.AddInteger(LD->getRawSubclassData());
	ID.AddInteger(LD->getPointerInfo().getAddrSpace());
	break;
	}
	case ISD::STORE: {
	const StoreSDNode *ST = cast<StoreSDNode>(N);
	ID.AddInteger(ST->getMemoryVT().getRawBits());
	ID.AddInteger(ST->getRawSubclassData());
	ID.AddInteger(ST->getPointerInfo().getAddrSpace());
	break;
	}
	case ISD::ATOMIC_CMP_SWAP:
	case ISD::ATOMIC_CMP_SWAP_WITH_SUCCESS:
	case ISD::ATOMIC_SWAP:
	case ISD::ATOMIC_LOAD_ADD:
	case ISD::ATOMIC_LOAD_SUB:
	case ISD::ATOMIC_LOAD_AND:
	case ISD::ATOMIC_LOAD_OR:
	case ISD::ATOMIC_LOAD_XOR:
	case ISD::ATOMIC_LOAD_NAND:
	case ISD::ATOMIC_LOAD_MIN:
	case ISD::ATOMIC_LOAD_MAX:
	case ISD::ATOMIC_LOAD_UMIN:
	case ISD::ATOMIC_LOAD_UMAX:
	case ISD::ATOMIC_LOAD:
	case ISD::ATOMIC_STORE: {
	const AtomicSDNode *AT = cast<AtomicSDNode>(N);
	ID.AddInteger(AT->getMemoryVT().getRawBits());
	ID.AddInteger(AT->getRawSubclassData());
	ID.AddInteger(AT->getPointerInfo().getAddrSpace());
	break;
	}
	case ISD::PREFETCH: {
	const MemSDNode *PF = cast<MemSDNode>(N);
	ID.AddInteger(PF->getPointerInfo().getAddrSpace());
	break;
	}
	case ISD::VECTOR_SHUFFLE: {
	const ShuffleVectorSDNode *SVN = cast<ShuffleVectorSDNode>(N);
	for (unsigned i = 0, e = N->getValueType(0).getVectorNumElements();
	i != e; ++i)
	ID.AddInteger(SVN->getMaskElt(i));
	break;
	}
	case ISD::TargetBlockAddress:
	case ISD::BlockAddress: {
	const BlockAddressSDNode *BA = cast<BlockAddressSDNode>(N);
	ID.AddPointer(BA->getBlockAddress());
	ID.AddInteger(BA->getOffset());
	ID.AddInteger(BA->getTargetFlags());
	break;
	}
	} // end switch (N->getOpcode())

	// Target specific memory nodes could also have address spaces to check.
	if (N->isTargetMemoryOpcode())
	ID.AddInteger(cast<MemSDNode>(N)->getPointerInfo().getAddrSpace());
	}

	/// AddNodeIDNode - Generic routine for adding a nodes info to the NodeID
	/// data.
	static void AddNodeIDNode(FoldingSetNodeID &ID, const SDNode *N) {
	AddNodeIDOpcode(ID, N->getOpcode());
	// Add the return value info.
	AddNodeIDValueTypes(ID, N->getVTList());
	// Add the operand info.
	AddNodeIDOperands(ID, N->ops());

	// Handle SDNode leafs with special info.
	AddNodeIDCustom(ID, N);
	}

	//===----------------------------------------------------------------------===//
	// SelectionDAG Class
	//===----------------------------------------------------------------------===//

	/// doNotCSE - Return true if CSE should not be performed for this node.
	static bool doNotCSE(SDNode *N) {
	if (N->getValueType(0) == MVT::Glue)
	return true; // Never CSE anything that produces a flag.

	switch (N->getOpcode()) {
	default: break;
	case ISD::HANDLENODE:
	case ISD::EH_LABEL:
	return true; // Never CSE these nodes.
	}

	// Check that remaining values produced are not flags.
	for (unsigned i = 1, e = N->getNumValues(); i != e; ++i)
	if (N->getValueType(i) == MVT::Glue)
	return true; // Never CSE anything that produces a flag.

	return false;
	}

	/// RemoveDeadNodes - This method deletes all unreachable nodes in the
	/// SelectionDAG.
	void SelectionDAG::RemoveDeadNodes() {
	// Create a dummy node (which is not added to allnodes), that adds a reference
	// to the root node, preventing it from being deleted.
	HandleSDNode Dummy(getRoot());

	SmallVector<SDNode*, 128> DeadNodes;

	// Add all obviously-dead nodes to the DeadNodes worklist.
	for (SDNode &Node : allnodes())
	if (Node.use_empty())
	DeadNodes.push_back(&Node);

	RemoveDeadNodes(DeadNodes);

	// If the root changed (e.g. it was a dead load, update the root).
	setRoot(Dummy.getValue());
	}

	/// RemoveDeadNodes - This method deletes the unreachable nodes in the
	/// given list, and any nodes that become unreachable as a result.
	void SelectionDAG::RemoveDeadNodes(SmallVectorImpl<SDNode *> &DeadNodes) {

	// Process the worklist, deleting the nodes and adding their uses to the
	// worklist.
	while (!DeadNodes.empty()) {
	SDNode *N = DeadNodes.pop_back_val();
	// Skip to next node if we've already managed to delete the node. This could
	// happen if replacing a node causes a node previously added to the node to
	// be deleted.
	if (N->getOpcode() == ISD::DELETED_NODE)
	continue;

	for (DAGUpdateListener *DUL = UpdateListeners; DUL; DUL = DUL->Next)
	DUL->NodeDeleted(N, nullptr);

	// Take the node out of the appropriate CSE map.
	RemoveNodeFromCSEMaps(N);

	// Next, brutally remove the operand list. This is safe to do, as there are
	// no cycles in the graph.
	for (SDNode::op_iterator I = N->op_begin(), E = N->op_end(); I != E; ) {
	SDUse &Use = *I++;
	SDNode *Operand = Use.getNode();
	Use.set(SDValue());

	// Now that we removed this operand, see if there are no uses of it left.
	if (Operand->use_empty())
	DeadNodes.push_back(Operand);
	}

	DeallocateNode(N);
	}
	}

	void SelectionDAG::RemoveDeadNode(SDNode *N){
	SmallVector<SDNode*, 16> DeadNodes(1, N);

	// Create a dummy node that adds a reference to the root node, preventing
	// it from being deleted. (This matters if the root is an operand of the
	// dead node.)
	HandleSDNode Dummy(getRoot());

	RemoveDeadNodes(DeadNodes);
	}

	void SelectionDAG::DeleteNode(SDNode *N) {
	// First take this out of the appropriate CSE map.
	RemoveNodeFromCSEMaps(N);

	// Finally, remove uses due to operands of this node, remove from the
	// AllNodes list, and delete the node.
	DeleteNodeNotInCSEMaps(N);
	}

	void SelectionDAG::DeleteNodeNotInCSEMaps(SDNode *N) {
	assert(N->getIterator() != AllNodes.begin() &&
	"Cannot delete the entry node!");
	assert(N->use_empty() && "Cannot delete a node that is not dead!");

	// Drop all of the operands and decrement used node's use counts.
	N->DropOperands();

	DeallocateNode(N);
	}

	void SDDbgInfo::erase(const SDNode *Node) {
	DbgValMapType::iterator I = DbgValMap.find(Node);
	if (I == DbgValMap.end())
	return;
	for (auto &Val: I->second)
	Val->setIsInvalidated();
	DbgValMap.erase(I);
	}

	void SelectionDAG::DeallocateNode(SDNode *N) {
	// If we have operands, deallocate them.
	removeOperands(N);

	NodeAllocator.Deallocate(AllNodes.remove(N));

	// Set the opcode to DELETED_NODE to help catch bugs when node
	// memory is reallocated.
	// FIXME: There are places in SDag that have grown a dependency on the opcode
	// value in the released node.
	__asan_unpoison_memory_region(&N->NodeType, sizeof(N->NodeType));
	N->NodeType = ISD::DELETED_NODE;

	// If any of the SDDbgValue nodes refer to this SDNode, invalidate
	// them and forget about that node.
	DbgInfo->erase(N);
	}

	#ifndef NDEBUG
	/// VerifySDNode - Sanity check the given SDNode. Aborts if it is invalid.
	static void VerifySDNode(SDNode *N) {
	switch (N->getOpcode()) {
	default:
	break;
	case ISD::BUILD_PAIR: {
	EVT VT = N->getValueType(0);
	assert(N->getNumValues() == 1 && "Too many results!");
	assert(!VT.isVector() && (VT.isInteger() \|\| VT.isFloatingPoint()) &&
	"Wrong return type!");
	assert(N->getNumOperands() == 2 && "Wrong number of operands!");
	assert(N->getOperand(0).getValueType() == N->getOperand(1).getValueType() &&
	"Mismatched operand types!");
	assert(N->getOperand(0).getValueType().isInteger() == VT.isInteger() &&
	"Wrong operand type!");
	assert(VT.getSizeInBits() == 2 * N->getOperand(0).getValueSizeInBits() &&
	"Wrong return type size");
	break;
	}
	case ISD::BUILD_VECTOR: {
	assert(N->getNumValues() == 1 && "Too many results!");
	assert(N->getValueType(0).isVector() && "Wrong return type!");
	assert(N->getNumOperands() == N->getValueType(0).getVectorNumElements() &&
	"Wrong number of operands!");
	EVT EltVT = N->getValueType(0).getVectorElementType();
	for (SDNode::op_iterator I = N->op_begin(), E = N->op_end(); I != E; ++I) {
	assert((I->getValueType() == EltVT \|\|
	(EltVT.isInteger() && I->getValueType().isInteger() &&
	EltVT.bitsLE(I->getValueType()))) &&
	"Wrong operand type!");
	assert(I->getValueType() == N->getOperand(0).getValueType() &&
	"Operands must all have the same type");
	}
	break;
	}
	}
	}
	#endif // NDEBUG

	/// \brief Insert a newly allocated node into the DAG.
	///
	/// Handles insertion into the all nodes list and CSE map, as well as
	/// verification and other common operations when a new node is allocated.
	void SelectionDAG::InsertNode(SDNode *N) {
	AllNodes.push_back(N);
	#ifndef NDEBUG
	N->PersistentId = NextPersistentId++;
	VerifySDNode(N);
	#endif
	}

	/// RemoveNodeFromCSEMaps - Take the specified node out of the CSE map that
	/// correspond to it. This is useful when we're about to delete or repurpose
	/// the node. We don't want future request for structurally identical nodes
	/// to return N anymore.
	bool SelectionDAG::RemoveNodeFromCSEMaps(SDNode *N) {
	bool Erased = false;
	switch (N->getOpcode()) {
	case ISD::HANDLENODE: return false; // noop.
	case ISD::CONDCODE:
	assert(CondCodeNodes[cast<CondCodeSDNode>(N)->get()] &&
	"Cond code doesn't exist!");
	Erased = CondCodeNodes[cast<CondCodeSDNode>(N)->get()] != nullptr;
	CondCodeNodes[cast<CondCodeSDNode>(N)->get()] = nullptr;
	break;
	case ISD::ExternalSymbol:
	Erased = ExternalSymbols.erase(cast<ExternalSymbolSDNode>(N)->getSymbol());
	break;
	case ISD::TargetExternalSymbol: {
	ExternalSymbolSDNode *ESN = cast<ExternalSymbolSDNode>(N);
	Erased = TargetExternalSymbols.erase(
	std::pair<std::string,unsigned char>(ESN->getSymbol(),
	ESN->getTargetFlags()));
	break;
	}
	case ISD::MCSymbol: {
	auto *MCSN = cast<MCSymbolSDNode>(N);
	Erased = MCSymbols.erase(MCSN->getMCSymbol());
	break;
	}
	case ISD::VALUETYPE: {
	EVT VT = cast<VTSDNode>(N)->getVT();
	if (VT.isExtended()) {
	Erased = ExtendedValueTypeNodes.erase(VT);
	} else {
	Erased = ValueTypeNodes[VT.getSimpleVT().SimpleTy] != nullptr;
	ValueTypeNodes[VT.getSimpleVT().SimpleTy] = nullptr;
	}
	break;
	}
	default:
	// Remove it from the CSE Map.
	assert(N->getOpcode() != ISD::DELETED_NODE && "DELETED_NODE in CSEMap!");
	assert(N->getOpcode() != ISD::EntryToken && "EntryToken in CSEMap!");
	Erased = CSEMap.RemoveNode(N);
	break;
	}
	#ifndef NDEBUG
	// Verify that the node was actually in one of the CSE maps, unless it has a
	// flag result (which cannot be CSE'd) or is one of the special cases that are
	// not subject to CSE.
	if (!Erased && N->getValueType(N->getNumValues()-1) != MVT::Glue &&
	!N->isMachineOpcode() && !doNotCSE(N)) {
	N->dump(this);
	dbgs() << "\n";
	llvm_unreachable("Node is not in map!");
	}
	#endif
	return Erased;
	}

	/// AddModifiedNodeToCSEMaps - The specified node has been removed from the CSE
	/// maps and modified in place. Add it back to the CSE maps, unless an identical
	/// node already exists, in which case transfer all its users to the existing
	/// node. This transfer can potentially trigger recursive merging.
	void
	SelectionDAG::AddModifiedNodeToCSEMaps(SDNode *N) {
	// For node types that aren't CSE'd, just act as if no identical node
	// already exists.
	if (!doNotCSE(N)) {
	SDNode *Existing = CSEMap.GetOrInsertNode(N);
	if (Existing != N) {
	// If there was already an existing matching node, use ReplaceAllUsesWith
	// to replace the dead one with the existing one. This can cause
	// recursive merging of other unrelated nodes down the line.
	ReplaceAllUsesWith(N, Existing);

	// N is now dead. Inform the listeners and delete it.
	for (DAGUpdateListener *DUL = UpdateListeners; DUL; DUL = DUL->Next)
	DUL->NodeDeleted(N, Existing);
	DeleteNodeNotInCSEMaps(N);
	return;
	}
	}

	// If the node doesn't already exist, we updated it. Inform listeners.
	for (DAGUpdateListener *DUL = UpdateListeners; DUL; DUL = DUL->Next)
	DUL->NodeUpdated(N);
	}

	/// FindModifiedNodeSlot - Find a slot for the specified node if its operands
	/// were replaced with those specified. If this node is never memoized,
	/// return null, otherwise return a pointer to the slot it would take. If a
	/// node already exists with these operands, the slot will be non-null.
	SDNode SelectionDAG::FindModifiedNodeSlot(SDNode N, SDValue Op,
	void *&InsertPos) {
	if (doNotCSE(N))
	return nullptr;

	SDValue Ops[] = { Op };
	FoldingSetNodeID ID;
	AddNodeIDNode(ID, N->getOpcode(), N->getVTList(), Ops);
	AddNodeIDCustom(ID, N);
	SDNode *Node = FindNodeOrInsertPos(ID, SDLoc(N), InsertPos);
	if (Node)
	Node->intersectFlagsWith(N->getFlags());
	return Node;
	}

	/// FindModifiedNodeSlot - Find a slot for the specified node if its operands
	/// were replaced with those specified. If this node is never memoized,
	/// return null, otherwise return a pointer to the slot it would take. If a
	/// node already exists with these operands, the slot will be non-null.
	SDNode SelectionDAG::FindModifiedNodeSlot(SDNode N,
	SDValue Op1, SDValue Op2,
	void *&InsertPos) {
	if (doNotCSE(N))
	return nullptr;

	SDValue Ops[] = { Op1, Op2 };
	FoldingSetNodeID ID;
	AddNodeIDNode(ID, N->getOpcode(), N->getVTList(), Ops);
	AddNodeIDCustom(ID, N);
	SDNode *Node = FindNodeOrInsertPos(ID, SDLoc(N), InsertPos);
	if (Node)
	Node->intersectFlagsWith(N->getFlags());
	return Node;
	}

	/// FindModifiedNodeSlot - Find a slot for the specified node if its operands
	/// were replaced with those specified. If this node is never memoized,
	/// return null, otherwise return a pointer to the slot it would take. If a
	/// node already exists with these operands, the slot will be non-null.
	SDNode SelectionDAG::FindModifiedNodeSlot(SDNode N, ArrayRef<SDValue> Ops,
	void *&InsertPos) {
	if (doNotCSE(N))
	return nullptr;

	FoldingSetNodeID ID;
	AddNodeIDNode(ID, N->getOpcode(), N->getVTList(), Ops);
	AddNodeIDCustom(ID, N);
	SDNode *Node = FindNodeOrInsertPos(ID, SDLoc(N), InsertPos);
	if (Node)
	Node->intersectFlagsWith(N->getFlags());
	return Node;
	}

	unsigned SelectionDAG::getEVTAlignment(EVT VT) const {
	Type *Ty = VT == MVT::iPTR ?
	PointerType::get(Type::getInt8Ty(*getContext()), 0) :
	VT.getTypeForEVT(*getContext());

	return getDataLayout().getABITypeAlignment(Ty);
	}

	// EntryNode could meaningfully have debug info if we can find it...
	SelectionDAG::SelectionDAG(const TargetMachine &tm, CodeGenOpt::Level OL)
	: TM(tm), OptLevel(OL),
	EntryNode(ISD::EntryToken, 0, DebugLoc(), getVTList(MVT::Other)),
	Root(getEntryNode()) {
	InsertNode(&EntryNode);
	DbgInfo = new SDDbgInfo();
	}

	void SelectionDAG::init(MachineFunction &NewMF,
	OptimizationRemarkEmitter &NewORE) {
	MF = &NewMF;
	ORE = &NewORE;
	TLI = getSubtarget().getTargetLowering();
	TSI = getSubtarget().getSelectionDAGInfo();
	Context = &MF->getFunction()->getContext();
	}

	SelectionDAG::~SelectionDAG() {
	assert(!UpdateListeners && "Dangling registered DAGUpdateListeners");
	allnodes_clear();
	OperandRecycler.clear(OperandAllocator);
	delete DbgInfo;
	}

	void SelectionDAG::allnodes_clear() {
	assert(&*AllNodes.begin() == &EntryNode);
	AllNodes.remove(AllNodes.begin());
	while (!AllNodes.empty())
	DeallocateNode(&AllNodes.front());
	#ifndef NDEBUG
	NextPersistentId = 0;
	#endif
	}

	SDNode *SelectionDAG::FindNodeOrInsertPos(const FoldingSetNodeID &ID,
	void *&InsertPos) {
	SDNode *N = CSEMap.FindNodeOrInsertPos(ID, InsertPos);
	if (N) {
	switch (N->getOpcode()) {
	default: break;
	case ISD::Constant:
	case ISD::ConstantFP:
	llvm_unreachable("Querying for Constant and ConstantFP nodes requires "
	"debug location. Use another overload.");
	}
	}
	return N;
	}

	SDNode *SelectionDAG::FindNodeOrInsertPos(const FoldingSetNodeID &ID,
	const SDLoc &DL, void *&InsertPos) {
	SDNode *N = CSEMap.FindNodeOrInsertPos(ID, InsertPos);
	if (N) {
	switch (N->getOpcode()) {
	case ISD::Constant:
	case ISD::ConstantFP:
	// Erase debug location from the node if the node is used at several
	// different places. Do not propagate one location to all uses as it
	// will cause a worse single stepping debugging experience.
	if (N->getDebugLoc() != DL.getDebugLoc())
	N->setDebugLoc(DebugLoc());
	break;
	default:
	// When the node's point of use is located earlier in the instruction
	// sequence than its prior point of use, update its debug info to the
	// earlier location.
	if (DL.getIROrder() && DL.getIROrder() < N->getIROrder())
	N->setDebugLoc(DL.getDebugLoc());
	break;
	}
	}
	return N;
	}

	void SelectionDAG::clear() {
	allnodes_clear();
	OperandRecycler.clear(OperandAllocator);
	OperandAllocator.Reset();
	CSEMap.clear();

	ExtendedValueTypeNodes.clear();
	ExternalSymbols.clear();
	TargetExternalSymbols.clear();
	MCSymbols.clear();
	std::fill(CondCodeNodes.begin(), CondCodeNodes.end(),
	static_cast<CondCodeSDNode*>(nullptr));
	std::fill(ValueTypeNodes.begin(), ValueTypeNodes.end(),
	static_cast<SDNode*>(nullptr));

	EntryNode.UseList = nullptr;
	InsertNode(&EntryNode);
	Root = getEntryNode();
	DbgInfo->clear();
	}

	SDValue SelectionDAG::getFPExtendOrRound(SDValue Op, const SDLoc &DL, EVT VT) {
	return VT.bitsGT(Op.getValueType())
	? getNode(ISD::FP_EXTEND, DL, VT, Op)
	: getNode(ISD::FP_ROUND, DL, VT, Op, getIntPtrConstant(0, DL));
	}

	SDValue SelectionDAG::getAnyExtOrTrunc(SDValue Op, const SDLoc &DL, EVT VT) {
	return VT.bitsGT(Op.getValueType()) ?
	getNode(ISD::ANY_EXTEND, DL, VT, Op) :
	getNode(ISD::TRUNCATE, DL, VT, Op);
	}

	SDValue SelectionDAG::getSExtOrTrunc(SDValue Op, const SDLoc &DL, EVT VT) {
	return VT.bitsGT(Op.getValueType()) ?
	getNode(ISD::SIGN_EXTEND, DL, VT, Op) :
	getNode(ISD::TRUNCATE, DL, VT, Op);
	}

	SDValue SelectionDAG::getZExtOrTrunc(SDValue Op, const SDLoc &DL, EVT VT) {
	return VT.bitsGT(Op.getValueType()) ?
	getNode(ISD::ZERO_EXTEND, DL, VT, Op) :
	getNode(ISD::TRUNCATE, DL, VT, Op);
	}

	SDValue SelectionDAG::getBoolExtOrTrunc(SDValue Op, const SDLoc &SL, EVT VT,
	EVT OpVT) {
	if (VT.bitsLE(Op.getValueType()))
	return getNode(ISD::TRUNCATE, SL, VT, Op);

	TargetLowering::BooleanContent BType = TLI->getBooleanContents(OpVT);
	return getNode(TLI->getExtendForContent(BType), SL, VT, Op);
	}

	SDValue SelectionDAG::getZeroExtendInReg(SDValue Op, const SDLoc &DL, EVT VT) {
	assert(!VT.isVector() &&
	"getZeroExtendInReg should use the vector element type instead of "
	"the vector type!");
	if (Op.getValueType() == VT) return Op;
	unsigned BitWidth = Op.getScalarValueSizeInBits();
	APInt Imm = APInt::getLowBitsSet(BitWidth,
	VT.getSizeInBits());
	return getNode(ISD::AND, DL, Op.getValueType(), Op,
	getConstant(Imm, DL, Op.getValueType()));
	}

	SDValue SelectionDAG::getAnyExtendVectorInReg(SDValue Op, const SDLoc &DL,
	EVT VT) {
	assert(VT.isVector() && "This DAG node is restricted to vector types.");
	assert(VT.getSizeInBits() == Op.getValueSizeInBits() &&
	"The sizes of the input and result must match in order to perform the "
	"extend in-register.");
	assert(VT.getVectorNumElements() < Op.getValueType().getVectorNumElements() &&
	"The destination vector type must have fewer lanes than the input.");
	return getNode(ISD::ANY_EXTEND_VECTOR_INREG, DL, VT, Op);
	}

	SDValue SelectionDAG::getSignExtendVectorInReg(SDValue Op, const SDLoc &DL,
	EVT VT) {
	assert(VT.isVector() && "This DAG node is restricted to vector types.");
	assert(VT.getSizeInBits() == Op.getValueSizeInBits() &&
	"The sizes of the input and result must match in order to perform the "
	"extend in-register.");
	assert(VT.getVectorNumElements() < Op.getValueType().getVectorNumElements() &&
	"The destination vector type must have fewer lanes than the input.");
	return getNode(ISD::SIGN_EXTEND_VECTOR_INREG, DL, VT, Op);
	}

	SDValue SelectionDAG::getZeroExtendVectorInReg(SDValue Op, const SDLoc &DL,
	EVT VT) {
	assert(VT.isVector() && "This DAG node is restricted to vector types.");
	assert(VT.getSizeInBits() == Op.getValueSizeInBits() &&
	"The sizes of the input and result must match in order to perform the "
	"extend in-register.");
	assert(VT.getVectorNumElements() < Op.getValueType().getVectorNumElements() &&
	"The destination vector type must have fewer lanes than the input.");
	return getNode(ISD::ZERO_EXTEND_VECTOR_INREG, DL, VT, Op);
	}

	/// getNOT - Create a bitwise NOT operation as (XOR Val, -1).
	SDValue SelectionDAG::getNOT(const SDLoc &DL, SDValue Val, EVT VT) {
	EVT EltVT = VT.getScalarType();
	SDValue NegOne =
	getConstant(APInt::getAllOnesValue(EltVT.getSizeInBits()), DL, VT);
	return getNode(ISD::XOR, DL, VT, Val, NegOne);
	}

	SDValue SelectionDAG::getLogicalNOT(const SDLoc &DL, SDValue Val, EVT VT) {
	EVT EltVT = VT.getScalarType();
	SDValue TrueValue;
	switch (TLI->getBooleanContents(VT)) {
	case TargetLowering::ZeroOrOneBooleanContent:
	case TargetLowering::UndefinedBooleanContent:
	TrueValue = getConstant(1, DL, VT);
	break;
	case TargetLowering::ZeroOrNegativeOneBooleanContent:
	TrueValue = getConstant(APInt::getAllOnesValue(EltVT.getSizeInBits()), DL,
	VT);
	break;
	}
	return getNode(ISD::XOR, DL, VT, Val, TrueValue);
	}

	SDValue SelectionDAG::getConstant(uint64_t Val, const SDLoc &DL, EVT VT,
	bool isT, bool isO) {
	EVT EltVT = VT.getScalarType();
	assert((EltVT.getSizeInBits() >= 64 \|\|
	(uint64_t)((int64_t)Val >> EltVT.getSizeInBits()) + 1 < 2) &&
	"getConstant with a uint64_t value that doesn't fit in the type!");
	return getConstant(APInt(EltVT.getSizeInBits(), Val), DL, VT, isT, isO);
	}

	SDValue SelectionDAG::getConstant(const APInt &Val, const SDLoc &DL, EVT VT,
	bool isT, bool isO) {
	return getConstant(ConstantInt::get(Context, Val), DL, VT, isT, isO);
	}

	SDValue SelectionDAG::getConstant(const ConstantInt &Val, const SDLoc &DL,
	EVT VT, bool isT, bool isO) {
	assert(VT.isInteger() && "Cannot create FP integer constant!");

	EVT EltVT = VT.getScalarType();
	const ConstantInt *Elt = &Val;

	// In some cases the vector type is legal but the element type is illegal and
	// needs to be promoted, for example v8i8 on ARM. In this case, promote the
	// inserted value (the type does not need to match the vector element type).
	// Any extra bits introduced will be truncated away.
	if (VT.isVector() && TLI->getTypeAction(*getContext(), EltVT) ==
	TargetLowering::TypePromoteInteger) {
	EltVT = TLI->getTypeToTransformTo(*getContext(), EltVT);
	APInt NewVal = Elt->getValue().zextOrTrunc(EltVT.getSizeInBits());
	Elt = ConstantInt::get(*getContext(), NewVal);
	}
	// In other cases the element type is illegal and needs to be expanded, for
	// example v2i64 on MIPS32. In this case, find the nearest legal type, split
	// the value into n parts and use a vector type with n-times the elements.
	// Then bitcast to the type requested.
	// Legalizing constants too early makes the DAGCombiner's job harder so we
	// only legalize if the DAG tells us we must produce legal types.
	else if (NewNodesMustHaveLegalTypes && VT.isVector() &&
	TLI->getTypeAction(*getContext(), EltVT) ==
	TargetLowering::TypeExpandInteger) {
	const APInt &NewVal = Elt->getValue();
	EVT ViaEltVT = TLI->getTypeToTransformTo(*getContext(), EltVT);
	unsigned ViaEltSizeInBits = ViaEltVT.getSizeInBits();
	unsigned ViaVecNumElts = VT.getSizeInBits() / ViaEltSizeInBits;
	EVT ViaVecVT = EVT::getVectorVT(*getContext(), ViaEltVT, ViaVecNumElts);

	// Check the temporary vector is the correct size. If this fails then
	// getTypeToTransformTo() probably returned a type whose size (in bits)
	// isn't a power-of-2 factor of the requested type size.
	assert(ViaVecVT.getSizeInBits() == VT.getSizeInBits());

	SmallVector<SDValue, 2> EltParts;
	for (unsigned i = 0; i < ViaVecNumElts / VT.getVectorNumElements(); ++i) {
	EltParts.push_back(getConstant(NewVal.lshr(i * ViaEltSizeInBits)
	.zextOrTrunc(ViaEltSizeInBits), DL,
	ViaEltVT, isT, isO));
	}

	// EltParts is currently in little endian order. If we actually want
	// big-endian order then reverse it now.
	if (getDataLayout().isBigEndian())
	std::reverse(EltParts.begin(), EltParts.end());

	// The elements must be reversed when the element order is different
	// to the endianness of the elements (because the BITCAST is itself a
	// vector shuffle in this situation). However, we do not need any code to
	// perform this reversal because getConstant() is producing a vector
	// splat.
	// This situation occurs in MIPS MSA.

	SmallVector<SDValue, 8> Ops;
	for (unsigned i = 0, e = VT.getVectorNumElements(); i != e; ++i)
	Ops.insert(Ops.end(), EltParts.begin(), EltParts.end());
	return getNode(ISD::BITCAST, DL, VT, getBuildVector(ViaVecVT, DL, Ops));
	}

	assert(Elt->getBitWidth() == EltVT.getSizeInBits() &&
	"APInt size does not match type size!");
	unsigned Opc = isT ? ISD::TargetConstant : ISD::Constant;
	FoldingSetNodeID ID;
	AddNodeIDNode(ID, Opc, getVTList(EltVT), None);
	ID.AddPointer(Elt);
	ID.AddBoolean(isO);
	void *IP = nullptr;
	SDNode *N = nullptr;
	if ((N = FindNodeOrInsertPos(ID, DL, IP)))
	if (!VT.isVector())
	return SDValue(N, 0);

	if (!N) {
	N = newSDNode<ConstantSDNode>(isT, isO, Elt, DL.getDebugLoc(), EltVT);
	CSEMap.InsertNode(N, IP);
	InsertNode(N);
	}

	SDValue Result(N, 0);
	if (VT.isVector())
	Result = getSplatBuildVector(VT, DL, Result);
	return Result;
	}

	SDValue SelectionDAG::getIntPtrConstant(uint64_t Val, const SDLoc &DL,
	bool isTarget) {
	return getConstant(Val, DL, TLI->getPointerTy(getDataLayout()), isTarget);
	}

	SDValue SelectionDAG::getConstantFP(const APFloat &V, const SDLoc &DL, EVT VT,
	bool isTarget) {
	return getConstantFP(ConstantFP::get(getContext(), V), DL, VT, isTarget);
	}

	SDValue SelectionDAG::getConstantFP(const ConstantFP &V, const SDLoc &DL,
	EVT VT, bool isTarget) {
	assert(VT.isFloatingPoint() && "Cannot create integer FP constant!");

	EVT EltVT = VT.getScalarType();

	// Do the map lookup using the actual bit pattern for the floating point
	// value, so that we don't have problems with 0.0 comparing equal to -0.0, and
	// we don't have issues with SNANs.
	unsigned Opc = isTarget ? ISD::TargetConstantFP : ISD::ConstantFP;
	FoldingSetNodeID ID;
	AddNodeIDNode(ID, Opc, getVTList(EltVT), None);
	ID.AddPointer(&V);
	void *IP = nullptr;
	SDNode *N = nullptr;
	if ((N = FindNodeOrInsertPos(ID, DL, IP)))
	if (!VT.isVector())
	return SDValue(N, 0);

	if (!N) {
	N = newSDNode<ConstantFPSDNode>(isTarget, &V, DL.getDebugLoc(), EltVT);
	CSEMap.InsertNode(N, IP);
	InsertNode(N);
	}

	SDValue Result(N, 0);
	if (VT.isVector())
	Result = getSplatBuildVector(VT, DL, Result);
	return Result;
	}

	SDValue SelectionDAG::getConstantFP(double Val, const SDLoc &DL, EVT VT,
	bool isTarget) {
	EVT EltVT = VT.getScalarType();
	if (EltVT == MVT::f32)
	return getConstantFP(APFloat((float)Val), DL, VT, isTarget);
	else if (EltVT == MVT::f64)
	return getConstantFP(APFloat(Val), DL, VT, isTarget);
	else if (EltVT == MVT::f80 \|\| EltVT == MVT::f128 \|\| EltVT == MVT::ppcf128 \|\|
	EltVT == MVT::f16) {
	bool Ignored;
	APFloat APF = APFloat(Val);
	APF.convert(EVTToAPFloatSemantics(EltVT), APFloat::rmNearestTiesToEven,
	&Ignored);
	return getConstantFP(APF, DL, VT, isTarget);
	} else
	llvm_unreachable("Unsupported type in getConstantFP");
	}

	SDValue SelectionDAG::getGlobalAddress(const GlobalValue *GV, const SDLoc &DL,
	EVT VT, int64_t Offset, bool isTargetGA,
	unsigned char TargetFlags) {
	assert((TargetFlags == 0 \|\| isTargetGA) &&
	"Cannot set target flags on target-independent globals");

	// Truncate (with sign-extension) the offset value to the pointer size.
	unsigned BitWidth = getDataLayout().getPointerTypeSizeInBits(GV->getType());
	if (BitWidth < 64)
	Offset = SignExtend64(Offset, BitWidth);

	unsigned Opc;
	if (GV->isThreadLocal())
	Opc = isTargetGA ? ISD::TargetGlobalTLSAddress : ISD::GlobalTLSAddress;
	else
	Opc = isTargetGA ? ISD::TargetGlobalAddress : ISD::GlobalAddress;

	FoldingSetNodeID ID;
	AddNodeIDNode(ID, Opc, getVTList(VT), None);
	ID.AddPointer(GV);
	ID.AddInteger(Offset);
	ID.AddInteger(TargetFlags);
	void *IP = nullptr;
	if (SDNode *E = FindNodeOrInsertPos(ID, DL, IP))
	return SDValue(E, 0);

	auto *N = newSDNode<GlobalAddressSDNode>(
	Opc, DL.getIROrder(), DL.getDebugLoc(), GV, VT, Offset, TargetFlags);
	CSEMap.InsertNode(N, IP);
	InsertNode(N);
	return SDValue(N, 0);
	}

	SDValue SelectionDAG::getFrameIndex(int FI, EVT VT, bool isTarget) {
	unsigned Opc = isTarget ? ISD::TargetFrameIndex : ISD::FrameIndex;
	FoldingSetNodeID ID;
	AddNodeIDNode(ID, Opc, getVTList(VT), None);
	ID.AddInteger(FI);
	void *IP = nullptr;
	if (SDNode *E = FindNodeOrInsertPos(ID, IP))
	return SDValue(E, 0);

	auto *N = newSDNode<FrameIndexSDNode>(FI, VT, isTarget);
	CSEMap.InsertNode(N, IP);
	InsertNode(N);
	return SDValue(N, 0);
	}

	SDValue SelectionDAG::getJumpTable(int JTI, EVT VT, bool isTarget,
	unsigned char TargetFlags) {
	assert((TargetFlags == 0 \|\| isTarget) &&
	"Cannot set target flags on target-independent jump tables");
	unsigned Opc = isTarget ? ISD::TargetJumpTable : ISD::JumpTable;
	FoldingSetNodeID ID;
	AddNodeIDNode(ID, Opc, getVTList(VT), None);
	ID.AddInteger(JTI);
	ID.AddInteger(TargetFlags);
	void *IP = nullptr;
	if (SDNode *E = FindNodeOrInsertPos(ID, IP))
	return SDValue(E, 0);

	auto *N = newSDNode<JumpTableSDNode>(JTI, VT, isTarget, TargetFlags);
	CSEMap.InsertNode(N, IP);
	InsertNode(N);
	return SDValue(N, 0);
	}

	SDValue SelectionDAG::getConstantPool(const Constant *C, EVT VT,
	unsigned Alignment, int Offset,
	bool isTarget,
	unsigned char TargetFlags) {
	assert((TargetFlags == 0 \|\| isTarget) &&
	"Cannot set target flags on target-independent globals");
	if (Alignment == 0)
	Alignment = MF->getFunction()->optForSize()
	? getDataLayout().getABITypeAlignment(C->getType())
	: getDataLayout().getPrefTypeAlignment(C->getType());
	unsigned Opc = isTarget ? ISD::TargetConstantPool : ISD::ConstantPool;
	FoldingSetNodeID ID;
	AddNodeIDNode(ID, Opc, getVTList(VT), None);
	ID.AddInteger(Alignment);
	ID.AddInteger(Offset);
	ID.AddPointer(C);
	ID.AddInteger(TargetFlags);
	void *IP = nullptr;
	if (SDNode *E = FindNodeOrInsertPos(ID, IP))
	return SDValue(E, 0);

	auto *N = newSDNode<ConstantPoolSDNode>(isTarget, C, VT, Offset, Alignment,
	TargetFlags);
	CSEMap.InsertNode(N, IP);
	InsertNode(N);
	return SDValue(N, 0);
	}

	SDValue SelectionDAG::getConstantPool(MachineConstantPoolValue *C, EVT VT,
	unsigned Alignment, int Offset,
	bool isTarget,
	unsigned char TargetFlags) {
	assert((TargetFlags == 0 \|\| isTarget) &&
	"Cannot set target flags on target-independent globals");
	if (Alignment == 0)
	Alignment = getDataLayout().getPrefTypeAlignment(C->getType());
	unsigned Opc = isTarget ? ISD::TargetConstantPool : ISD::ConstantPool;
	FoldingSetNodeID ID;
	AddNodeIDNode(ID, Opc, getVTList(VT), None);
	ID.AddInteger(Alignment);
	ID.AddInteger(Offset);
	C->addSelectionDAGCSEId(ID);
	ID.AddInteger(TargetFlags);
	void *IP = nullptr;
	if (SDNode *E = FindNodeOrInsertPos(ID, IP))
	return SDValue(E, 0);

	auto *N = newSDNode<ConstantPoolSDNode>(isTarget, C, VT, Offset, Alignment,
	TargetFlags);
	CSEMap.InsertNode(N, IP);
	InsertNode(N);
	return SDValue(N, 0);
	}

	SDValue SelectionDAG::getTargetIndex(int Index, EVT VT, int64_t Offset,
	unsigned char TargetFlags) {
	FoldingSetNodeID ID;
	AddNodeIDNode(ID, ISD::TargetIndex, getVTList(VT), None);
	ID.AddInteger(Index);
	ID.AddInteger(Offset);
	ID.AddInteger(TargetFlags);
	void *IP = nullptr;
	if (SDNode *E = FindNodeOrInsertPos(ID, IP))
	return SDValue(E, 0);

	auto *N = newSDNode<TargetIndexSDNode>(Index, VT, Offset, TargetFlags);
	CSEMap.InsertNode(N, IP);
	InsertNode(N);
	return SDValue(N, 0);
	}

	SDValue SelectionDAG::getBasicBlock(MachineBasicBlock *MBB) {
	FoldingSetNodeID ID;
	AddNodeIDNode(ID, ISD::BasicBlock, getVTList(MVT::Other), None);
	ID.AddPointer(MBB);
	void *IP = nullptr;
	if (SDNode *E = FindNodeOrInsertPos(ID, IP))
	return SDValue(E, 0);

	auto *N = newSDNode<BasicBlockSDNode>(MBB);
	CSEMap.InsertNode(N, IP);
	InsertNode(N);
	return SDValue(N, 0);
	}

	SDValue SelectionDAG::getValueType(EVT VT) {
	if (VT.isSimple() && (unsigned)VT.getSimpleVT().SimpleTy >=
	ValueTypeNodes.size())
	ValueTypeNodes.resize(VT.getSimpleVT().SimpleTy+1);

	SDNode *&N = VT.isExtended() ?
	ExtendedValueTypeNodes[VT] : ValueTypeNodes[VT.getSimpleVT().SimpleTy];

	if (N) return SDValue(N, 0);
	N = newSDNode<VTSDNode>(VT);
	InsertNode(N);
	return SDValue(N, 0);
	}

	SDValue SelectionDAG::getExternalSymbol(const char *Sym, EVT VT) {
	SDNode *&N = ExternalSymbols[Sym];
	if (N) return SDValue(N, 0);
	N = newSDNode<ExternalSymbolSDNode>(false, Sym, 0, VT);
	InsertNode(N);
	return SDValue(N, 0);
	}

	SDValue SelectionDAG::getMCSymbol(MCSymbol *Sym, EVT VT) {
	SDNode *&N = MCSymbols[Sym];
	if (N)
	return SDValue(N, 0);
	N = newSDNode<MCSymbolSDNode>(Sym, VT);
	InsertNode(N);
	return SDValue(N, 0);
	}

	SDValue SelectionDAG::getTargetExternalSymbol(const char *Sym, EVT VT,
	unsigned char TargetFlags) {
	SDNode *&N =
	TargetExternalSymbols[std::pair<std::string,unsigned char>(Sym,
	TargetFlags)];
	if (N) return SDValue(N, 0);
	N = newSDNode<ExternalSymbolSDNode>(true, Sym, TargetFlags, VT);
	InsertNode(N);
	return SDValue(N, 0);
	}

	SDValue SelectionDAG::getCondCode(ISD::CondCode Cond) {
	if ((unsigned)Cond >= CondCodeNodes.size())
	CondCodeNodes.resize(Cond+1);

	if (!CondCodeNodes[Cond]) {
	auto *N = newSDNode<CondCodeSDNode>(Cond);
	CondCodeNodes[Cond] = N;
	InsertNode(N);
	}

	return SDValue(CondCodeNodes[Cond], 0);
	}

	/// Swaps the values of N1 and N2. Swaps all indices in the shuffle mask M that
	/// point at N1 to point at N2 and indices that point at N2 to point at N1.
	static void commuteShuffle(SDValue &N1, SDValue &N2, MutableArrayRef<int> M) {
	std::swap(N1, N2);
	ShuffleVectorSDNode::commuteMask(M);
	}

	SDValue SelectionDAG::getVectorShuffle(EVT VT, const SDLoc &dl, SDValue N1,
	SDValue N2, ArrayRef<int> Mask) {
	assert(VT.getVectorNumElements() == Mask.size() &&
	"Must have the same number of vector elements as mask elements!");
	assert(VT == N1.getValueType() && VT == N2.getValueType() &&
	"Invalid VECTOR_SHUFFLE");

	// Canonicalize shuffle undef, undef -> undef
	if (N1.isUndef() && N2.isUndef())
	return getUNDEF(VT);

	// Validate that all indices in Mask are within the range of the elements
	// input to the shuffle.
	int NElts = Mask.size();
	assert(llvm::all_of(Mask, [&](int M) { return M < (NElts * 2); }) &&
	"Index out of range");

	// Copy the mask so we can do any needed cleanup.
	SmallVector<int, 8> MaskVec(Mask.begin(), Mask.end());

	// Canonicalize shuffle v, v -> v, undef
	if (N1 == N2) {
	N2 = getUNDEF(VT);
	for (int i = 0; i != NElts; ++i)
	if (MaskVec[i] >= NElts) MaskVec[i] -= NElts;
	}

	// Canonicalize shuffle undef, v -> v, undef. Commute the shuffle mask.
	if (N1.isUndef())
	commuteShuffle(N1, N2, MaskVec);

	// If shuffling a splat, try to blend the splat instead. We do this here so
	// that even when this arises during lowering we don't have to re-handle it.
	auto BlendSplat = [&](BuildVectorSDNode *BV, int Offset) {
	BitVector UndefElements;
	SDValue Splat = BV->getSplatValue(&UndefElements);
	if (!Splat)
	return;

	for (int i = 0; i < NElts; ++i) {
	if (MaskVec[i] < Offset \|\| MaskVec[i] >= (Offset + NElts))
	continue;

	// If this input comes from undef, mark it as such.
	if (UndefElements[MaskVec[i] - Offset]) {
	MaskVec[i] = -1;
	continue;
	}

	// If we can blend a non-undef lane, use that instead.
	if (!UndefElements[i])
	MaskVec[i] = i + Offset;
	}
	};
	if (auto *N1BV = dyn_cast<BuildVectorSDNode>(N1))
	BlendSplat(N1BV, 0);
	if (auto *N2BV = dyn_cast<BuildVectorSDNode>(N2))
	BlendSplat(N2BV, NElts);

	// Canonicalize all index into lhs, -> shuffle lhs, undef
	// Canonicalize all index into rhs, -> shuffle rhs, undef
	bool AllLHS = true, AllRHS = true;
	bool N2Undef = N2.isUndef();
	for (int i = 0; i != NElts; ++i) {
	if (MaskVec[i] >= NElts) {
	if (N2Undef)
	MaskVec[i] = -1;
	else
	AllLHS = false;
	} else if (MaskVec[i] >= 0) {
	AllRHS = false;
	}
	}
	if (AllLHS && AllRHS)
	return getUNDEF(VT);
	if (AllLHS && !N2Undef)
	N2 = getUNDEF(VT);
	if (AllRHS) {
	N1 = getUNDEF(VT);
	commuteShuffle(N1, N2, MaskVec);
	}
	// Reset our undef status after accounting for the mask.
	N2Undef = N2.isUndef();
	// Re-check whether both sides ended up undef.
	if (N1.isUndef() && N2Undef)
	return getUNDEF(VT);

	// If Identity shuffle return that node.
	bool Identity = true, AllSame = true;
	for (int i = 0; i != NElts; ++i) {
	if (MaskVec[i] >= 0 && MaskVec[i] != i) Identity = false;
	if (MaskVec[i] != MaskVec[0]) AllSame = false;
	}
	if (Identity && NElts)
	return N1;

	// Shuffling a constant splat doesn't change the result.
	if (N2Undef) {
	SDValue V = N1;

	// Look through any bitcasts. We check that these don't change the number
	// (and size) of elements and just changes their types.
	while (V.getOpcode() == ISD::BITCAST)
	V = V->getOperand(0);

	// A splat should always show up as a build vector node.
	if (auto *BV = dyn_cast<BuildVectorSDNode>(V)) {
	BitVector UndefElements;
	SDValue Splat = BV->getSplatValue(&UndefElements);
	// If this is a splat of an undef, shuffling it is also undef.
	if (Splat && Splat.isUndef())
	return getUNDEF(VT);

	bool SameNumElts =
	V.getValueType().getVectorNumElements() == VT.getVectorNumElements();

	// We only have a splat which can skip shuffles if there is a splatted
	// value and no undef lanes rearranged by the shuffle.
	if (Splat && UndefElements.none()) {
	// Splat of <x, x, ..., x>, return <x, x, ..., x>, provided that the
	// number of elements match or the value splatted is a zero constant.
	if (SameNumElts)
	return N1;
	if (auto *C = dyn_cast<ConstantSDNode>(Splat))
	if (C->isNullValue())
	return N1;
	}

	// If the shuffle itself creates a splat, build the vector directly.
	if (AllSame && SameNumElts) {
	EVT BuildVT = BV->getValueType(0);
	const SDValue &Splatted = BV->getOperand(MaskVec[0]);
	SDValue NewBV = getSplatBuildVector(BuildVT, dl, Splatted);

	// We may have jumped through bitcasts, so the type of the
	// BUILD_VECTOR may not match the type of the shuffle.
	if (BuildVT != VT)
	NewBV = getNode(ISD::BITCAST, dl, VT, NewBV);
	return NewBV;
	}
	}
	}

	FoldingSetNodeID ID;
	SDValue Ops[2] = { N1, N2 };
	AddNodeIDNode(ID, ISD::VECTOR_SHUFFLE, getVTList(VT), Ops);
	for (int i = 0; i != NElts; ++i)
	ID.AddInteger(MaskVec[i]);

	void* IP = nullptr;
	if (SDNode *E = FindNodeOrInsertPos(ID, dl, IP))
	return SDValue(E, 0);

	// Allocate the mask array for the node out of the BumpPtrAllocator, since
	// SDNode doesn't have access to it. This memory will be "leaked" when
	// the node is deallocated, but recovered when the NodeAllocator is released.
	int *MaskAlloc = OperandAllocator.Allocate<int>(NElts);
	std::copy(MaskVec.begin(), MaskVec.end(), MaskAlloc);

	auto *N = newSDNode<ShuffleVectorSDNode>(VT, dl.getIROrder(),
	dl.getDebugLoc(), MaskAlloc);
	createOperands(N, Ops);

	CSEMap.InsertNode(N, IP);
	InsertNode(N);
	return SDValue(N, 0);
	}

	SDValue SelectionDAG::getCommutedVectorShuffle(const ShuffleVectorSDNode &SV) {
	MVT VT = SV.getSimpleValueType(0);
	SmallVector<int, 8> MaskVec(SV.getMask().begin(), SV.getMask().end());
	ShuffleVectorSDNode::commuteMask(MaskVec);

	SDValue Op0 = SV.getOperand(0);
	SDValue Op1 = SV.getOperand(1);
	return getVectorShuffle(VT, SDLoc(&SV), Op1, Op0, MaskVec);
	}

	SDValue SelectionDAG::getRegister(unsigned RegNo, EVT VT) {
	FoldingSetNodeID ID;
	AddNodeIDNode(ID, ISD::Register, getVTList(VT), None);
	ID.AddInteger(RegNo);
	void *IP = nullptr;
	if (SDNode *E = FindNodeOrInsertPos(ID, IP))
	return SDValue(E, 0);

	auto *N = newSDNode<RegisterSDNode>(RegNo, VT);
	CSEMap.InsertNode(N, IP);
	InsertNode(N);
	return SDValue(N, 0);
	}

	SDValue SelectionDAG::getRegisterMask(const uint32_t *RegMask) {
	FoldingSetNodeID ID;
	AddNodeIDNode(ID, ISD::RegisterMask, getVTList(MVT::Untyped), None);
	ID.AddPointer(RegMask);
	void *IP = nullptr;
	if (SDNode *E = FindNodeOrInsertPos(ID, IP))
	return SDValue(E, 0);

	auto *N = newSDNode<RegisterMaskSDNode>(RegMask);
	CSEMap.InsertNode(N, IP);
	InsertNode(N);
	return SDValue(N, 0);
	}

	SDValue SelectionDAG::getEHLabel(const SDLoc &dl, SDValue Root,
	MCSymbol *Label) {
	FoldingSetNodeID ID;
	SDValue Ops[] = { Root };
	AddNodeIDNode(ID, ISD::EH_LABEL, getVTList(MVT::Other), Ops);
	ID.AddPointer(Label);
	void *IP = nullptr;
	if (SDNode *E = FindNodeOrInsertPos(ID, IP))
	return SDValue(E, 0);

	auto *N = newSDNode<EHLabelSDNode>(dl.getIROrder(), dl.getDebugLoc(), Label);
	createOperands(N, Ops);

	CSEMap.InsertNode(N, IP);
	InsertNode(N);
	return SDValue(N, 0);
	}

	SDValue SelectionDAG::getBlockAddress(const BlockAddress *BA, EVT VT,
	int64_t Offset,
	bool isTarget,
	unsigned char TargetFlags) {
	unsigned Opc = isTarget ? ISD::TargetBlockAddress : ISD::BlockAddress;

	FoldingSetNodeID ID;
	AddNodeIDNode(ID, Opc, getVTList(VT), None);
	ID.AddPointer(BA);
	ID.AddInteger(Offset);
	ID.AddInteger(TargetFlags);
	void *IP = nullptr;
	if (SDNode *E = FindNodeOrInsertPos(ID, IP))
	return SDValue(E, 0);

	auto *N = newSDNode<BlockAddressSDNode>(Opc, VT, BA, Offset, TargetFlags);
	CSEMap.InsertNode(N, IP);
	InsertNode(N);
	return SDValue(N, 0);
	}

	SDValue SelectionDAG::getSrcValue(const Value *V) {
	assert((!V \|\| V->getType()->isPointerTy()) &&
	"SrcValue is not a pointer?");

	FoldingSetNodeID ID;
	AddNodeIDNode(ID, ISD::SRCVALUE, getVTList(MVT::Other), None);
	ID.AddPointer(V);

	void *IP = nullptr;
	if (SDNode *E = FindNodeOrInsertPos(ID, IP))
	return SDValue(E, 0);

	auto *N = newSDNode<SrcValueSDNode>(V);
	CSEMap.InsertNode(N, IP);
	InsertNode(N);
	return SDValue(N, 0);
	}

	SDValue SelectionDAG::getMDNode(const MDNode *MD) {
	FoldingSetNodeID ID;
	AddNodeIDNode(ID, ISD::MDNODE_SDNODE, getVTList(MVT::Other), None);
	ID.AddPointer(MD);

	void *IP = nullptr;
	if (SDNode *E = FindNodeOrInsertPos(ID, IP))
	return SDValue(E, 0);

	auto *N = newSDNode<MDNodeSDNode>(MD);
	CSEMap.InsertNode(N, IP);
	InsertNode(N);
	return SDValue(N, 0);
	}

	SDValue SelectionDAG::getBitcast(EVT VT, SDValue V) {
	if (VT == V.getValueType())
	return V;

	return getNode(ISD::BITCAST, SDLoc(V), VT, V);
	}

	SDValue SelectionDAG::getAddrSpaceCast(const SDLoc &dl, EVT VT, SDValue Ptr,
	unsigned SrcAS, unsigned DestAS) {
	SDValue Ops[] = {Ptr};
	FoldingSetNodeID ID;
	AddNodeIDNode(ID, ISD::ADDRSPACECAST, getVTList(VT), Ops);
	ID.AddInteger(SrcAS);
	ID.AddInteger(DestAS);

	void *IP = nullptr;
	if (SDNode *E = FindNodeOrInsertPos(ID, dl, IP))
	return SDValue(E, 0);

	auto *N = newSDNode<AddrSpaceCastSDNode>(dl.getIROrder(), dl.getDebugLoc(),
	VT, SrcAS, DestAS);
	createOperands(N, Ops);

	CSEMap.InsertNode(N, IP);
	InsertNode(N);
	return SDValue(N, 0);
	}

	/// getShiftAmountOperand - Return the specified value casted to
	/// the target's desired shift amount type.
	SDValue SelectionDAG::getShiftAmountOperand(EVT LHSTy, SDValue Op) {
	EVT OpTy = Op.getValueType();
	EVT ShTy = TLI->getShiftAmountTy(LHSTy, getDataLayout());
	if (OpTy == ShTy \|\| OpTy.isVector()) return Op;

	return getZExtOrTrunc(Op, SDLoc(Op), ShTy);
	}

	SDValue SelectionDAG::expandVAArg(SDNode *Node) {
	SDLoc dl(Node);
	const TargetLowering &TLI = getTargetLoweringInfo();
	const Value *V = cast<SrcValueSDNode>(Node->getOperand(2))->getValue();
	EVT VT = Node->getValueType(0);
	SDValue Tmp1 = Node->getOperand(0);
	SDValue Tmp2 = Node->getOperand(1);
	unsigned Align = Node->getConstantOperandVal(3);

	SDValue VAListLoad = getLoad(TLI.getPointerTy(getDataLayout()), dl, Tmp1,
	Tmp2, MachinePointerInfo(V));
	SDValue VAList = VAListLoad;

	if (Align > TLI.getMinStackArgumentAlignment()) {
	assert(((Align & (Align-1)) == 0) && "Expected Align to be a power of 2");

	VAList = getNode(ISD::ADD, dl, VAList.getValueType(), VAList,
	getConstant(Align - 1, dl, VAList.getValueType()));

	VAList = getNode(ISD::AND, dl, VAList.getValueType(), VAList,
	getConstant(-(int64_t)Align, dl, VAList.getValueType()));
	}

	// Increment the pointer, VAList, to the next vaarg
	Tmp1 = getNode(ISD::ADD, dl, VAList.getValueType(), VAList,
	getConstant(getDataLayout().getTypeAllocSize(
	VT.getTypeForEVT(*getContext())),
	dl, VAList.getValueType()));
	// Store the incremented VAList to the legalized pointer
	Tmp1 =
	getStore(VAListLoad.getValue(1), dl, Tmp1, Tmp2, MachinePointerInfo(V));
	// Load the actual argument out of the pointer VAList
	return getLoad(VT, dl, Tmp1, VAList, MachinePointerInfo());
	}

	SDValue SelectionDAG::expandVACopy(SDNode *Node) {
	SDLoc dl(Node);
	const TargetLowering &TLI = getTargetLoweringInfo();
	// This defaults to loading a pointer from the input and storing it to the
	// output, returning the chain.
	const Value *VD = cast<SrcValueSDNode>(Node->getOperand(3))->getValue();
	const Value *VS = cast<SrcValueSDNode>(Node->getOperand(4))->getValue();
	SDValue Tmp1 =
	getLoad(TLI.getPointerTy(getDataLayout()), dl, Node->getOperand(0),
	Node->getOperand(2), MachinePointerInfo(VS));
	return getStore(Tmp1.getValue(1), dl, Tmp1, Node->getOperand(1),
	MachinePointerInfo(VD));
	}

	SDValue SelectionDAG::CreateStackTemporary(EVT VT, unsigned minAlign) {
	MachineFrameInfo &MFI = getMachineFunction().getFrameInfo();
	unsigned ByteSize = VT.getStoreSize();
	Type Ty = VT.getTypeForEVT(getContext());
	unsigned StackAlign =
	std::max((unsigned)getDataLayout().getPrefTypeAlignment(Ty), minAlign);

	int FrameIdx = MFI.CreateStackObject(ByteSize, StackAlign, false);
	return getFrameIndex(FrameIdx, TLI->getFrameIndexTy(getDataLayout()));
	}

	SDValue SelectionDAG::CreateStackTemporary(EVT VT1, EVT VT2) {
	unsigned Bytes = std::max(VT1.getStoreSize(), VT2.getStoreSize());
	Type Ty1 = VT1.getTypeForEVT(getContext());
	Type Ty2 = VT2.getTypeForEVT(getContext());
	const DataLayout &DL = getDataLayout();
	unsigned Align =
	std::max(DL.getPrefTypeAlignment(Ty1), DL.getPrefTypeAlignment(Ty2));

	MachineFrameInfo &MFI = getMachineFunction().getFrameInfo();
	int FrameIdx = MFI.CreateStackObject(Bytes, Align, false);
	return getFrameIndex(FrameIdx, TLI->getFrameIndexTy(getDataLayout()));
	}

	SDValue SelectionDAG::FoldSetCC(EVT VT, SDValue N1, SDValue N2,
	ISD::CondCode Cond, const SDLoc &dl) {
	// These setcc operations always fold.
	switch (Cond) {
	default: break;
	case ISD::SETFALSE:
	case ISD::SETFALSE2: return getConstant(0, dl, VT);
	case ISD::SETTRUE:
	case ISD::SETTRUE2: {
	TargetLowering::BooleanContent Cnt =
	TLI->getBooleanContents(N1->getValueType(0));
	return getConstant(
	Cnt == TargetLowering::ZeroOrNegativeOneBooleanContent ? -1ULL : 1, dl,
	VT);
	}

	case ISD::SETOEQ:
	case ISD::SETOGT:
	case ISD::SETOGE:
	case ISD::SETOLT:
	case ISD::SETOLE:
	case ISD::SETONE:
	case ISD::SETO:
	case ISD::SETUO:
	case ISD::SETUEQ:
	case ISD::SETUNE:
	assert(!N1.getValueType().isInteger() && "Illegal setcc for integer!");
	break;
	}

	if (ConstantSDNode *N2C = dyn_cast<ConstantSDNode>(N2)) {
	const APInt &C2 = N2C->getAPIntValue();
	if (ConstantSDNode *N1C = dyn_cast<ConstantSDNode>(N1)) {
	const APInt &C1 = N1C->getAPIntValue();

	switch (Cond) {
	default: llvm_unreachable("Unknown integer setcc!");
	case ISD::SETEQ: return getConstant(C1 == C2, dl, VT);
	case ISD::SETNE: return getConstant(C1 != C2, dl, VT);
	case ISD::SETULT: return getConstant(C1.ult(C2), dl, VT);
	case ISD::SETUGT: return getConstant(C1.ugt(C2), dl, VT);
	case ISD::SETULE: return getConstant(C1.ule(C2), dl, VT);
	case ISD::SETUGE: return getConstant(C1.uge(C2), dl, VT);
	case ISD::SETLT: return getConstant(C1.slt(C2), dl, VT);
	case ISD::SETGT: return getConstant(C1.sgt(C2), dl, VT);
	case ISD::SETLE: return getConstant(C1.sle(C2), dl, VT);
	case ISD::SETGE: return getConstant(C1.sge(C2), dl, VT);
	}
	}
	}
	if (ConstantFPSDNode *N1C = dyn_cast<ConstantFPSDNode>(N1)) {
	if (ConstantFPSDNode *N2C = dyn_cast<ConstantFPSDNode>(N2)) {
	APFloat::cmpResult R = N1C->getValueAPF().compare(N2C->getValueAPF());
	switch (Cond) {
	default: break;
	case ISD::SETEQ: if (R==APFloat::cmpUnordered)
	return getUNDEF(VT);
	LLVM_FALLTHROUGH;
	case ISD::SETOEQ: return getConstant(R==APFloat::cmpEqual, dl, VT);
	case ISD::SETNE: if (R==APFloat::cmpUnordered)
	return getUNDEF(VT);
	LLVM_FALLTHROUGH;
	case ISD::SETONE: return getConstant(R==APFloat::cmpGreaterThan \|\|
	R==APFloat::cmpLessThan, dl, VT);
	case ISD::SETLT: if (R==APFloat::cmpUnordered)
	return getUNDEF(VT);
	LLVM_FALLTHROUGH;
	case ISD::SETOLT: return getConstant(R==APFloat::cmpLessThan, dl, VT);
	case ISD::SETGT: if (R==APFloat::cmpUnordered)
	return getUNDEF(VT);
	LLVM_FALLTHROUGH;
	case ISD::SETOGT: return getConstant(R==APFloat::cmpGreaterThan, dl, VT);
	case ISD::SETLE: if (R==APFloat::cmpUnordered)
	return getUNDEF(VT);
	LLVM_FALLTHROUGH;
	case ISD::SETOLE: return getConstant(R==APFloat::cmpLessThan \|\|
	R==APFloat::cmpEqual, dl, VT);
	case ISD::SETGE: if (R==APFloat::cmpUnordered)
	return getUNDEF(VT);
	LLVM_FALLTHROUGH;
	case ISD::SETOGE: return getConstant(R==APFloat::cmpGreaterThan \|\|
	R==APFloat::cmpEqual, dl, VT);
	case ISD::SETO: return getConstant(R!=APFloat::cmpUnordered, dl, VT);
	case ISD::SETUO: return getConstant(R==APFloat::cmpUnordered, dl, VT);
	case ISD::SETUEQ: return getConstant(R==APFloat::cmpUnordered \|\|
	R==APFloat::cmpEqual, dl, VT);
	case ISD::SETUNE: return getConstant(R!=APFloat::cmpEqual, dl, VT);
	case ISD::SETULT: return getConstant(R==APFloat::cmpUnordered \|\|
	R==APFloat::cmpLessThan, dl, VT);
	case ISD::SETUGT: return getConstant(R==APFloat::cmpGreaterThan \|\|
	R==APFloat::cmpUnordered, dl, VT);
	case ISD::SETULE: return getConstant(R!=APFloat::cmpGreaterThan, dl, VT);
	case ISD::SETUGE: return getConstant(R!=APFloat::cmpLessThan, dl, VT);
	}
	} else {
	// Ensure that the constant occurs on the RHS.
	ISD::CondCode SwappedCond = ISD::getSetCCSwappedOperands(Cond);
	MVT CompVT = N1.getValueType().getSimpleVT();
	if (!TLI->isCondCodeLegal(SwappedCond, CompVT))
	return SDValue();

	return getSetCC(dl, VT, N2, N1, SwappedCond);
	}
	}

	// Could not fold it.
	return SDValue();
	}

	/// SignBitIsZero - Return true if the sign bit of Op is known to be zero. We
	/// use this predicate to simplify operations downstream.
	bool SelectionDAG::SignBitIsZero(SDValue Op, unsigned Depth) const {
	unsigned BitWidth = Op.getScalarValueSizeInBits();
	return MaskedValueIsZero(Op, APInt::getSignMask(BitWidth), Depth);
	}

	/// MaskedValueIsZero - Return true if 'V & Mask' is known to be zero. We use
	/// this predicate to simplify operations downstream. Mask is known to be zero
	/// for bits that V cannot have.
	bool SelectionDAG::MaskedValueIsZero(SDValue Op, const APInt &Mask,
	unsigned Depth) const {
	KnownBits Known;
	computeKnownBits(Op, Known, Depth);
	return Mask.isSubsetOf(Known.Zero);
	}

	/// If a SHL/SRA/SRL node has a constant or splat constant shift amount that
	/// is less than the element bit-width of the shift node, return it.
	static const APInt *getValidShiftAmountConstant(SDValue V) {
	if (ConstantSDNode *SA = isConstOrConstSplat(V.getOperand(1))) {
	// Shifting more than the bitwidth is not valid.
	const APInt &ShAmt = SA->getAPIntValue();
	if (ShAmt.ult(V.getScalarValueSizeInBits()))
	return &ShAmt;
	}
	return nullptr;
	}

	/// Determine which bits of Op are known to be either zero or one and return
	/// them in Known. For vectors, the known bits are those that are shared by
	/// every vector element.
	void SelectionDAG::computeKnownBits(SDValue Op, KnownBits &Known,
	unsigned Depth) const {
	EVT VT = Op.getValueType();
	APInt DemandedElts = VT.isVector()
	? APInt::getAllOnesValue(VT.getVectorNumElements())
	: APInt(1, 1);
	computeKnownBits(Op, Known, DemandedElts, Depth);
	}

	/// Determine which bits of Op are known to be either zero or one and return
	/// them in Known. The DemandedElts argument allows us to only collect the known
	/// bits that are shared by the requested vector elements.
	void SelectionDAG::computeKnownBits(SDValue Op, KnownBits &Known,
	const APInt &DemandedElts,
	unsigned Depth) const {
	unsigned BitWidth = Op.getScalarValueSizeInBits();

	Known = KnownBits(BitWidth); // Don't know anything.
	if (Depth == 6)
	return; // Limit search depth.

	KnownBits Known2;
	unsigned NumElts = DemandedElts.getBitWidth();

	if (!DemandedElts)
	return; // No demanded elts, better to assume we don't know anything.

	unsigned Opcode = Op.getOpcode();
	switch (Opcode) {
	case ISD::Constant:
	// We know all of the bits for a constant!
	Known.One = cast<ConstantSDNode>(Op)->getAPIntValue();
	Known.Zero = ~Known.One;
	break;
	case ISD::BUILD_VECTOR:
	// Collect the known bits that are shared by every demanded vector element.
	assert(NumElts == Op.getValueType().getVectorNumElements() &&
	"Unexpected vector size");
	Known.Zero.setAllBits(); Known.One.setAllBits();
	for (unsigned i = 0, e = Op.getNumOperands(); i != e; ++i) {
	if (!DemandedElts[i])
	continue;

	SDValue SrcOp = Op.getOperand(i);
	computeKnownBits(SrcOp, Known2, Depth + 1);

	// BUILD_VECTOR can implicitly truncate sources, we must handle this.
	if (SrcOp.getValueSizeInBits() != BitWidth) {
	assert(SrcOp.getValueSizeInBits() > BitWidth &&
	"Expected BUILD_VECTOR implicit truncation");
	Known2 = Known2.trunc(BitWidth);
	}

	// Known bits are the values that are shared by every demanded element.
	Known.One &= Known2.One;
	Known.Zero &= Known2.Zero;

	// If we don't know any bits, early out.
	if (!Known.One && !Known.Zero)
	break;
	}
	break;
	case ISD::VECTOR_SHUFFLE: {
	// Collect the known bits that are shared by every vector element referenced
	// by the shuffle.
	APInt DemandedLHS(NumElts, 0), DemandedRHS(NumElts, 0);
	Known.Zero.setAllBits(); Known.One.setAllBits();
	const ShuffleVectorSDNode *SVN = cast<ShuffleVectorSDNode>(Op);
	assert(NumElts == SVN->getMask().size() && "Unexpected vector size");
	for (unsigned i = 0; i != NumElts; ++i) {
	if (!DemandedElts[i])
	continue;

	int M = SVN->getMaskElt(i);
	if (M < 0) {
	// For UNDEF elements, we don't know anything about the common state of
	// the shuffle result.
	Known.resetAll();
	DemandedLHS.clearAllBits();
	DemandedRHS.clearAllBits();
	break;
	}

	if ((unsigned)M < NumElts)
	DemandedLHS.setBit((unsigned)M % NumElts);
	else
	DemandedRHS.setBit((unsigned)M % NumElts);
	}
	// Known bits are the values that are shared by every demanded element.
	if (!!DemandedLHS) {
	SDValue LHS = Op.getOperand(0);
	computeKnownBits(LHS, Known2, DemandedLHS, Depth + 1);
	Known.One &= Known2.One;
	Known.Zero &= Known2.Zero;
	}
	// If we don't know any bits, early out.
	if (!Known.One && !Known.Zero)
	break;
	if (!!DemandedRHS) {
	SDValue RHS = Op.getOperand(1);
	computeKnownBits(RHS, Known2, DemandedRHS, Depth + 1);
	Known.One &= Known2.One;
	Known.Zero &= Known2.Zero;
	}
	break;
	}
	case ISD::CONCAT_VECTORS: {
	// Split DemandedElts and test each of the demanded subvectors.
	Known.Zero.setAllBits(); Known.One.setAllBits();
	EVT SubVectorVT = Op.getOperand(0).getValueType();
	unsigned NumSubVectorElts = SubVectorVT.getVectorNumElements();
	unsigned NumSubVectors = Op.getNumOperands();
	for (unsigned i = 0; i != NumSubVectors; ++i) {
	APInt DemandedSub = DemandedElts.lshr(i * NumSubVectorElts);
	DemandedSub = DemandedSub.trunc(NumSubVectorElts);
	if (!!DemandedSub) {
	SDValue Sub = Op.getOperand(i);
	computeKnownBits(Sub, Known2, DemandedSub, Depth + 1);
	Known.One &= Known2.One;
	Known.Zero &= Known2.Zero;
	}
	// If we don't know any bits, early out.
	if (!Known.One && !Known.Zero)
	break;
	}
	break;
	}
	case ISD::EXTRACT_SUBVECTOR: {
	// If we know the element index, just demand that subvector elements,
	// otherwise demand them all.
	SDValue Src = Op.getOperand(0);
	ConstantSDNode *SubIdx = dyn_cast<ConstantSDNode>(Op.getOperand(1));
	unsigned NumSrcElts = Src.getValueType().getVectorNumElements();
	if (SubIdx && SubIdx->getAPIntValue().ule(NumSrcElts - NumElts)) {
	// Offset the demanded elts by the subvector index.
	uint64_t Idx = SubIdx->getZExtValue();
	APInt DemandedSrc = DemandedElts.zext(NumSrcElts).shl(Idx);
	computeKnownBits(Src, Known, DemandedSrc, Depth + 1);
	} else {
	computeKnownBits(Src, Known, Depth + 1);
	}
	break;
	}
	case ISD::BITCAST: {
	SDValue N0 = Op.getOperand(0);
	unsigned SubBitWidth = N0.getScalarValueSizeInBits();

	// Ignore bitcasts from floating point.
	if (!N0.getValueType().isInteger())
	break;

	// Fast handling of 'identity' bitcasts.
	if (BitWidth == SubBitWidth) {
	computeKnownBits(N0, Known, DemandedElts, Depth + 1);
	break;
	}

	// Support big-endian targets when it becomes useful.
	bool IsLE = getDataLayout().isLittleEndian();
	if (!IsLE)
	break;

	// Bitcast 'small element' vector to 'large element' scalar/vector.
	if ((BitWidth % SubBitWidth) == 0) {
	assert(N0.getValueType().isVector() && "Expected bitcast from vector");

	// Collect known bits for the (larger) output by collecting the known
	// bits from each set of sub elements and shift these into place.
	// We need to separately call computeKnownBits for each set of
	// sub elements as the knownbits for each is likely to be different.
	unsigned SubScale = BitWidth / SubBitWidth;
	APInt SubDemandedElts(NumElts * SubScale, 0);
	for (unsigned i = 0; i != NumElts; ++i)
	if (DemandedElts[i])
	SubDemandedElts.setBit(i * SubScale);

	for (unsigned i = 0; i != SubScale; ++i) {
	computeKnownBits(N0, Known2, SubDemandedElts.shl(i),
	Depth + 1);
	Known.One \|= Known2.One.zext(BitWidth).shl(SubBitWidth * i);
	Known.Zero \|= Known2.Zero.zext(BitWidth).shl(SubBitWidth * i);
	}
	}

	// Bitcast 'large element' scalar/vector to 'small element' vector.
	if ((SubBitWidth % BitWidth) == 0) {
	assert(Op.getValueType().isVector() && "Expected bitcast to vector");

	// Collect known bits for the (smaller) output by collecting the known
	// bits from the overlapping larger input elements and extracting the
	// sub sections we actually care about.
	unsigned SubScale = SubBitWidth / BitWidth;
	APInt SubDemandedElts(NumElts / SubScale, 0);
	for (unsigned i = 0; i != NumElts; ++i)
	if (DemandedElts[i])
	SubDemandedElts.setBit(i / SubScale);

	computeKnownBits(N0, Known2, SubDemandedElts, Depth + 1);

	Known.Zero.setAllBits(); Known.One.setAllBits();
	for (unsigned i = 0; i != NumElts; ++i)
	if (DemandedElts[i]) {
	unsigned Offset = (i % SubScale) * BitWidth;
	Known.One &= Known2.One.lshr(Offset).trunc(BitWidth);
	Known.Zero &= Known2.Zero.lshr(Offset).trunc(BitWidth);
	// If we don't know any bits, early out.
	if (!Known.One && !Known.Zero)
	break;
	}
	}
	break;
	}
	case ISD::AND:
	// If either the LHS or the RHS are Zero, the result is zero.
	computeKnownBits(Op.getOperand(1), Known, DemandedElts, Depth + 1);
	computeKnownBits(Op.getOperand(0), Known2, DemandedElts, Depth + 1);

	// Output known-1 bits are only known if set in both the LHS & RHS.
	Known.One &= Known2.One;
	// Output known-0 are known to be clear if zero in either the LHS \| RHS.
	Known.Zero \|= Known2.Zero;
	break;
	case ISD::OR:
	computeKnownBits(Op.getOperand(1), Known, DemandedElts, Depth + 1);
	computeKnownBits(Op.getOperand(0), Known2, DemandedElts, Depth + 1);

	// Output known-0 bits are only known if clear in both the LHS & RHS.
	Known.Zero &= Known2.Zero;
	// Output known-1 are known to be set if set in either the LHS \| RHS.
	Known.One \|= Known2.One;
	break;
	case ISD::XOR: {
	computeKnownBits(Op.getOperand(1), Known, DemandedElts, Depth + 1);
	computeKnownBits(Op.getOperand(0), Known2, DemandedElts, Depth + 1);

	// Output known-0 bits are known if clear or set in both the LHS & RHS.
	APInt KnownZeroOut = (Known.Zero & Known2.Zero) \| (Known.One & Known2.One);
	// Output known-1 are known to be set if set in only one of the LHS, RHS.
	Known.One = (Known.Zero & Known2.One) \| (Known.One & Known2.Zero);
	Known.Zero = KnownZeroOut;
	break;
	}
	case ISD::MUL: {
	computeKnownBits(Op.getOperand(1), Known, DemandedElts, Depth + 1);
	computeKnownBits(Op.getOperand(0), Known2, DemandedElts, Depth + 1);

	// If low bits are zero in either operand, output low known-0 bits.
	// Also compute a conservative estimate for high known-0 bits.
	// More trickiness is possible, but this is sufficient for the
	// interesting case of alignment computation.
	unsigned TrailZ = Known.countMinTrailingZeros() +
	Known2.countMinTrailingZeros();
	unsigned LeadZ = std::max(Known.countMinLeadingZeros() +
	Known2.countMinLeadingZeros(),
	BitWidth) - BitWidth;

	Known.resetAll();
	Known.Zero.setLowBits(std::min(TrailZ, BitWidth));
	Known.Zero.setHighBits(std::min(LeadZ, BitWidth));
	break;
	}
	case ISD::UDIV: {
	// For the purposes of computing leading zeros we can conservatively
	// treat a udiv as a logical right shift by the power of 2 known to
	// be less than the denominator.
	computeKnownBits(Op.getOperand(0), Known2, DemandedElts, Depth + 1);
	unsigned LeadZ = Known2.countMinLeadingZeros();

	computeKnownBits(Op.getOperand(1), Known2, DemandedElts, Depth + 1);
	unsigned RHSMaxLeadingZeros = Known2.countMaxLeadingZeros();
	if (RHSMaxLeadingZeros != BitWidth)
	LeadZ = std::min(BitWidth, LeadZ + BitWidth - RHSMaxLeadingZeros - 1);

	Known.Zero.setHighBits(LeadZ);
	break;
	}
	case ISD::SELECT:
	computeKnownBits(Op.getOperand(2), Known, Depth+1);
	// If we don't know any bits, early out.
	if (!Known.One && !Known.Zero)
	break;
	computeKnownBits(Op.getOperand(1), Known2, Depth+1);

	// Only known if known in both the LHS and RHS.
	Known.One &= Known2.One;
	Known.Zero &= Known2.Zero;
	break;
	case ISD::SELECT_CC:
	computeKnownBits(Op.getOperand(3), Known, Depth+1);
	// If we don't know any bits, early out.
	if (!Known.One && !Known.Zero)
	break;
	computeKnownBits(Op.getOperand(2), Known2, Depth+1);

	// Only known if known in both the LHS and RHS.
	Known.One &= Known2.One;
	Known.Zero &= Known2.Zero;
	break;
	case ISD::SMULO:
	case ISD::UMULO:
	if (Op.getResNo() != 1)
	break;
	// The boolean result conforms to getBooleanContents.
	// If we know the result of a setcc has the top bits zero, use this info.
	// We know that we have an integer-based boolean since these operations
	// are only available for integer.
	if (TLI->getBooleanContents(Op.getValueType().isVector(), false) ==
	TargetLowering::ZeroOrOneBooleanContent &&
	BitWidth > 1)
	Known.Zero.setBitsFrom(1);
	break;
	case ISD::SETCC:
	// If we know the result of a setcc has the top bits zero, use this info.
	if (TLI->getBooleanContents(Op.getOperand(0).getValueType()) ==
	TargetLowering::ZeroOrOneBooleanContent &&
	BitWidth > 1)
	Known.Zero.setBitsFrom(1);
	break;
	case ISD::SHL:
	if (const APInt *ShAmt = getValidShiftAmountConstant(Op)) {
	computeKnownBits(Op.getOperand(0), Known, DemandedElts, Depth + 1);
	Known.Zero <<= *ShAmt;
	Known.One <<= *ShAmt;
	// Low bits are known zero.
	Known.Zero.setLowBits(ShAmt->getZExtValue());
	}
	break;
	case ISD::SRL:
	if (const APInt *ShAmt = getValidShiftAmountConstant(Op)) {
	computeKnownBits(Op.getOperand(0), Known, DemandedElts, Depth + 1);
	Known.Zero.lshrInPlace(*ShAmt);
	Known.One.lshrInPlace(*ShAmt);
	// High bits are known zero.
	Known.Zero.setHighBits(ShAmt->getZExtValue());
	}
	break;
	case ISD::SRA:
	if (const APInt *ShAmt = getValidShiftAmountConstant(Op)) {
	computeKnownBits(Op.getOperand(0), Known, DemandedElts, Depth + 1);
	Known.Zero.lshrInPlace(*ShAmt);
	Known.One.lshrInPlace(*ShAmt);
	// If we know the value of the sign bit, then we know it is copied across
	// the high bits by the shift amount.
	APInt SignMask = APInt::getSignMask(BitWidth);
	SignMask.lshrInPlace(*ShAmt); // Adjust to where it is now in the mask.
	if (Known.Zero.intersects(SignMask)) {
	Known.Zero.setHighBits(ShAmt->getZExtValue());// New bits are known zero.
	} else if (Known.One.intersects(SignMask)) {
	Known.One.setHighBits(ShAmt->getZExtValue()); // New bits are known one.
	}
	}
	break;
	case ISD::SIGN_EXTEND_INREG: {
	EVT EVT = cast<VTSDNode>(Op.getOperand(1))->getVT();
	unsigned EBits = EVT.getScalarSizeInBits();

	// Sign extension. Compute the demanded bits in the result that are not
	// present in the input.
	APInt NewBits = APInt::getHighBitsSet(BitWidth, BitWidth - EBits);

	APInt InSignMask = APInt::getSignMask(EBits);
	APInt InputDemandedBits = APInt::getLowBitsSet(BitWidth, EBits);

	// If the sign extended bits are demanded, we know that the sign
	// bit is demanded.
	InSignMask = InSignMask.zext(BitWidth);
	if (NewBits.getBoolValue())
	InputDemandedBits \|= InSignMask;

	computeKnownBits(Op.getOperand(0), Known, DemandedElts, Depth + 1);
	Known.One &= InputDemandedBits;
	Known.Zero &= InputDemandedBits;

	// If the sign bit of the input is known set or clear, then we know the
	// top bits of the result.
	if (Known.Zero.intersects(InSignMask)) { // Input sign bit known clear
	Known.Zero \|= NewBits;
	Known.One &= ~NewBits;
	} else if (Known.One.intersects(InSignMask)) { // Input sign bit known set
	Known.One \|= NewBits;
	Known.Zero &= ~NewBits;
	} else { // Input sign bit unknown
	Known.Zero &= ~NewBits;
	Known.One &= ~NewBits;
	}
	break;
	}
	case ISD::CTTZ:
	case ISD::CTTZ_ZERO_UNDEF: {
	computeKnownBits(Op.getOperand(0), Known2, DemandedElts, Depth + 1);
	// If we have a known 1, its position is our upper bound.
	unsigned PossibleTZ = Known2.countMaxTrailingZeros();
	unsigned LowBits = Log2_32(PossibleTZ) + 1;
	Known.Zero.setBitsFrom(LowBits);
	break;
	}
	case ISD::CTLZ:
	case ISD::CTLZ_ZERO_UNDEF: {
	computeKnownBits(Op.getOperand(0), Known2, DemandedElts, Depth + 1);
	// If we have a known 1, its position is our upper bound.
	unsigned PossibleLZ = Known2.countMaxLeadingZeros();
	unsigned LowBits = Log2_32(PossibleLZ) + 1;
	Known.Zero.setBitsFrom(LowBits);
	break;
	}
	case ISD::CTPOP: {
	computeKnownBits(Op.getOperand(0), Known2, DemandedElts, Depth + 1);
	// If we know some of the bits are zero, they can't be one.
	unsigned PossibleOnes = Known2.countMaxPopulation();
	Known.Zero.setBitsFrom(Log2_32(PossibleOnes) + 1);
	break;
	}
	case ISD::LOAD: {
	LoadSDNode *LD = cast<LoadSDNode>(Op);
	// If this is a ZEXTLoad and we are looking at the loaded value.
	if (ISD::isZEXTLoad(Op.getNode()) && Op.getResNo() == 0) {
	EVT VT = LD->getMemoryVT();
	unsigned MemBits = VT.getScalarSizeInBits();
	Known.Zero.setBitsFrom(MemBits);
	} else if (const MDNode *Ranges = LD->getRanges()) {
	if (LD->getExtensionType() == ISD::NON_EXTLOAD)
	computeKnownBitsFromRangeMetadata(*Ranges, Known);
	}
	break;
	}
	case ISD::ZERO_EXTEND_VECTOR_INREG: {
	EVT InVT = Op.getOperand(0).getValueType();
	unsigned InBits = InVT.getScalarSizeInBits();
	Known = Known.trunc(InBits);
	computeKnownBits(Op.getOperand(0), Known,
	DemandedElts.zext(InVT.getVectorNumElements()),
	Depth + 1);
	Known = Known.zext(BitWidth);
	Known.Zero.setBitsFrom(InBits);
	break;
	}
	case ISD::ZERO_EXTEND: {
	EVT InVT = Op.getOperand(0).getValueType();
	unsigned InBits = InVT.getScalarSizeInBits();
	Known = Known.trunc(InBits);
	computeKnownBits(Op.getOperand(0), Known, DemandedElts, Depth + 1);
	Known = Known.zext(BitWidth);
	Known.Zero.setBitsFrom(InBits);
	break;
	}
	// TODO ISD::SIGN_EXTEND_VECTOR_INREG
	case ISD::SIGN_EXTEND: {
	EVT InVT = Op.getOperand(0).getValueType();
	unsigned InBits = InVT.getScalarSizeInBits();

	Known = Known.trunc(InBits);
	computeKnownBits(Op.getOperand(0), Known, DemandedElts, Depth + 1);

	// If the sign bit is known to be zero or one, then sext will extend
	// it to the top bits, else it will just zext.
	Known = Known.sext(BitWidth);
	break;
	}
	case ISD::ANY_EXTEND: {
	EVT InVT = Op.getOperand(0).getValueType();
	unsigned InBits = InVT.getScalarSizeInBits();
	Known = Known.trunc(InBits);
	computeKnownBits(Op.getOperand(0), Known, Depth+1);
	Known = Known.zext(BitWidth);
	break;
	}
	case ISD::TRUNCATE: {
	EVT InVT = Op.getOperand(0).getValueType();
	unsigned InBits = InVT.getScalarSizeInBits();
	Known = Known.zext(InBits);
	computeKnownBits(Op.getOperand(0), Known, DemandedElts, Depth + 1);
	Known = Known.trunc(BitWidth);
	break;
	}
	case ISD::AssertZext: {
	EVT VT = cast<VTSDNode>(Op.getOperand(1))->getVT();
	APInt InMask = APInt::getLowBitsSet(BitWidth, VT.getSizeInBits());
	computeKnownBits(Op.getOperand(0), Known, Depth+1);
	Known.Zero \|= (~InMask);
	Known.One &= (~Known.Zero);
	break;
	}
	case ISD::FGETSIGN:
	// All bits are zero except the low bit.
	Known.Zero.setBitsFrom(1);
	break;
	case ISD::USUBO:
	case ISD::SSUBO:
	if (Op.getResNo() == 1) {
	// If we know the result of a setcc has the top bits zero, use this info.
	if (TLI->getBooleanContents(Op.getOperand(0).getValueType()) ==
	TargetLowering::ZeroOrOneBooleanContent &&
	BitWidth > 1)
	Known.Zero.setBitsFrom(1);
	break;
	}
	LLVM_FALLTHROUGH;
	case ISD::SUB:
	case ISD::SUBC: {
	if (ConstantSDNode *CLHS = isConstOrConstSplat(Op.getOperand(0))) {
	// We know that the top bits of C-X are clear if X contains less bits
	// than C (i.e. no wrap-around can happen). For example, 20-X is
	// positive if we can prove that X is >= 0 and < 16.
	if (CLHS->getAPIntValue().isNonNegative()) {
	unsigned NLZ = (CLHS->getAPIntValue()+1).countLeadingZeros();
	// NLZ can't be BitWidth with no sign bit
	APInt MaskV = APInt::getHighBitsSet(BitWidth, NLZ+1);
	computeKnownBits(Op.getOperand(1), Known2, DemandedElts,
	Depth + 1);

	// If all of the MaskV bits are known to be zero, then we know the
	// output top bits are zero, because we now know that the output is
	// from [0-C].
	if ((Known2.Zero & MaskV) == MaskV) {
	unsigned NLZ2 = CLHS->getAPIntValue().countLeadingZeros();
	// Top bits known zero.
	Known.Zero.setHighBits(NLZ2);
	}
	}
	}

	// If low bits are know to be zero in both operands, then we know they are
	// going to be 0 in the result. Both addition and complement operations
	// preserve the low zero bits.
	computeKnownBits(Op.getOperand(0), Known2, DemandedElts, Depth + 1);
	unsigned KnownZeroLow = Known2.countMinTrailingZeros();
	if (KnownZeroLow == 0)
	break;

	computeKnownBits(Op.getOperand(1), Known2, DemandedElts, Depth + 1);
	KnownZeroLow = std::min(KnownZeroLow, Known2.countMinTrailingZeros());
	Known.Zero.setLowBits(KnownZeroLow);
	break;
	}
	case ISD::UADDO:
	case ISD::SADDO:
	case ISD::ADDCARRY:
	if (Op.getResNo() == 1) {
	// If we know the result of a setcc has the top bits zero, use this info.
	if (TLI->getBooleanContents(Op.getOperand(0).getValueType()) ==
	TargetLowering::ZeroOrOneBooleanContent &&
	BitWidth > 1)
	Known.Zero.setBitsFrom(1);
	break;
	}
	LLVM_FALLTHROUGH;
	case ISD::ADD:
	case ISD::ADDC:
	case ISD::ADDE: {
	// Output known-0 bits are known if clear or set in both the low clear bits
	// common to both LHS & RHS. For example, 8+(X<<3) is known to have the
	// low 3 bits clear.
	// Output known-0 bits are also known if the top bits of each input are
	// known to be clear. For example, if one input has the top 10 bits clear
	// and the other has the top 8 bits clear, we know the top 7 bits of the
	// output must be clear.
	computeKnownBits(Op.getOperand(0), Known2, DemandedElts, Depth + 1);
	unsigned KnownZeroHigh = Known2.countMinLeadingZeros();
	unsigned KnownZeroLow = Known2.countMinTrailingZeros();

	computeKnownBits(Op.getOperand(1), Known2, DemandedElts,
	Depth + 1);
	KnownZeroHigh = std::min(KnownZeroHigh, Known2.countMinLeadingZeros());
	KnownZeroLow = std::min(KnownZeroLow, Known2.countMinTrailingZeros());

	if (Opcode == ISD::ADDE \|\| Opcode == ISD::ADDCARRY) {
	// With ADDE and ADDCARRY, a carry bit may be added in, so we can only
	// use this information if we know (at least) that the low two bits are
	// clear. We then return to the caller that the low bit is unknown but
	// that other bits are known zero.
	if (KnownZeroLow >= 2)
	Known.Zero.setBits(1, KnownZeroLow);
	break;
	}

	Known.Zero.setLowBits(KnownZeroLow);
	if (KnownZeroHigh > 1)
	Known.Zero.setHighBits(KnownZeroHigh - 1);
	break;
	}
	case ISD::SREM:
	if (ConstantSDNode *Rem = isConstOrConstSplat(Op.getOperand(1))) {
	const APInt &RA = Rem->getAPIntValue().abs();
	if (RA.isPowerOf2()) {
	APInt LowBits = RA - 1;
	computeKnownBits(Op.getOperand(0), Known2, DemandedElts, Depth + 1);

	// The low bits of the first operand are unchanged by the srem.
	Known.Zero = Known2.Zero & LowBits;
	Known.One = Known2.One & LowBits;

	// If the first operand is non-negative or has all low bits zero, then
	// the upper bits are all zero.
	if (Known2.Zero[BitWidth-1] \|\| ((Known2.Zero & LowBits) == LowBits))
	Known.Zero \|= ~LowBits;

	// If the first operand is negative and not all low bits are zero, then
	// the upper bits are all one.
	if (Known2.One[BitWidth-1] && ((Known2.One & LowBits) != 0))
	Known.One \|= ~LowBits;
	assert((Known.Zero & Known.One) == 0&&"Bits known to be one AND zero?");
	}
	}
	break;
	case ISD::UREM: {
	if (ConstantSDNode *Rem = isConstOrConstSplat(Op.getOperand(1))) {
	const APInt &RA = Rem->getAPIntValue();
	if (RA.isPowerOf2()) {
	APInt LowBits = (RA - 1);
	computeKnownBits(Op.getOperand(0), Known2, DemandedElts, Depth + 1);

	// The upper bits are all zero, the lower ones are unchanged.
	Known.Zero = Known2.Zero \| ~LowBits;
	Known.One = Known2.One & LowBits;
	break;
	}
	}

	// Since the result is less than or equal to either operand, any leading
	// zero bits in either operand must also exist in the result.
	computeKnownBits(Op.getOperand(0), Known, DemandedElts, Depth + 1);
	computeKnownBits(Op.getOperand(1), Known2, DemandedElts, Depth + 1);

	uint32_t Leaders =
	std::max(Known.countMinLeadingZeros(), Known2.countMinLeadingZeros());
	Known.resetAll();
	Known.Zero.setHighBits(Leaders);
	break;
	}
	case ISD::EXTRACT_ELEMENT: {
	computeKnownBits(Op.getOperand(0), Known, Depth+1);
	const unsigned Index = Op.getConstantOperandVal(1);
	const unsigned BitWidth = Op.getValueSizeInBits();

	// Remove low part of known bits mask
	Known.Zero = Known.Zero.getHiBits(Known.Zero.getBitWidth() - Index * BitWidth);
	Known.One = Known.One.getHiBits(Known.One.getBitWidth() - Index * BitWidth);

	// Remove high part of known bit mask
	Known = Known.trunc(BitWidth);
	break;
	}
	case ISD::EXTRACT_VECTOR_ELT: {
	SDValue InVec = Op.getOperand(0);
	SDValue EltNo = Op.getOperand(1);
	EVT VecVT = InVec.getValueType();
	const unsigned BitWidth = Op.getValueSizeInBits();
	const unsigned EltBitWidth = VecVT.getScalarSizeInBits();
	const unsigned NumSrcElts = VecVT.getVectorNumElements();
	// If BitWidth > EltBitWidth the value is anyext:ed. So we do not know
	// anything about the extended bits.
	if (BitWidth > EltBitWidth)
	Known = Known.trunc(EltBitWidth);
	ConstantSDNode *ConstEltNo = dyn_cast<ConstantSDNode>(EltNo);
	if (ConstEltNo && ConstEltNo->getAPIntValue().ult(NumSrcElts)) {
	// If we know the element index, just demand that vector element.
	unsigned Idx = ConstEltNo->getZExtValue();
	APInt DemandedElt = APInt::getOneBitSet(NumSrcElts, Idx);
	computeKnownBits(InVec, Known, DemandedElt, Depth + 1);
	} else {
	// Unknown element index, so ignore DemandedElts and demand them all.
	computeKnownBits(InVec, Known, Depth + 1);
	}
	if (BitWidth > EltBitWidth)
	Known = Known.zext(BitWidth);
	break;
	}
	case ISD::INSERT_VECTOR_ELT: {
	SDValue InVec = Op.getOperand(0);
	SDValue InVal = Op.getOperand(1);
	SDValue EltNo = Op.getOperand(2);

	ConstantSDNode *CEltNo = dyn_cast<ConstantSDNode>(EltNo);
	if (CEltNo && CEltNo->getAPIntValue().ult(NumElts)) {
	// If we know the element index, split the demand between the
	// source vector and the inserted element.
	Known.Zero = Known.One = APInt::getAllOnesValue(BitWidth);
	unsigned EltIdx = CEltNo->getZExtValue();

	// If we demand the inserted element then add its common known bits.
	if (DemandedElts[EltIdx]) {
	computeKnownBits(InVal, Known2, Depth + 1);
	Known.One &= Known2.One.zextOrTrunc(Known.One.getBitWidth());
	Known.Zero &= Known2.Zero.zextOrTrunc(Known.Zero.getBitWidth());
	}

	// If we demand the source vector then add its common known bits, ensuring
	// that we don't demand the inserted element.
	APInt VectorElts = DemandedElts & ~(APInt::getOneBitSet(NumElts, EltIdx));
	if (!!VectorElts) {
	computeKnownBits(InVec, Known2, VectorElts, Depth + 1);
	Known.One &= Known2.One;
	Known.Zero &= Known2.Zero;
	}
	} else {
	// Unknown element index, so ignore DemandedElts and demand them all.
	computeKnownBits(InVec, Known, Depth + 1);
	computeKnownBits(InVal, Known2, Depth + 1);
	Known.One &= Known2.One.zextOrTrunc(Known.One.getBitWidth());
	Known.Zero &= Known2.Zero.zextOrTrunc(Known.Zero.getBitWidth());
	}
	break;
	}
	case ISD::BITREVERSE: {
	computeKnownBits(Op.getOperand(0), Known2, DemandedElts, Depth + 1);
	Known.Zero = Known2.Zero.reverseBits();
	Known.One = Known2.One.reverseBits();
	break;
	}
	case ISD::BSWAP: {
	computeKnownBits(Op.getOperand(0), Known2, DemandedElts, Depth + 1);
	Known.Zero = Known2.Zero.byteSwap();
	Known.One = Known2.One.byteSwap();
	break;
	}
	case ISD::ABS: {
	computeKnownBits(Op.getOperand(0), Known2, DemandedElts, Depth + 1);

	// If the source's MSB is zero then we know the rest of the bits already.
	if (Known2.isNonNegative()) {
	Known.Zero = Known2.Zero;
	Known.One = Known2.One;
	break;
	}

	// We only know that the absolute values's MSB will be zero iff there is
	// a set bit that isn't the sign bit (otherwise it could be INT_MIN).
	Known2.One.clearSignBit();
	if (Known2.One.getBoolValue()) {
	Known.Zero = APInt::getSignMask(BitWidth);
	break;
	}
	break;
	}
	case ISD::UMIN: {
	computeKnownBits(Op.getOperand(0), Known, DemandedElts, Depth + 1);
	computeKnownBits(Op.getOperand(1), Known2, DemandedElts, Depth + 1);

	// UMIN - we know that the result will have the maximum of the
	// known zero leading bits of the inputs.
	unsigned LeadZero = Known.countMinLeadingZeros();
	LeadZero = std::max(LeadZero, Known2.countMinLeadingZeros());

	Known.Zero &= Known2.Zero;
	Known.One &= Known2.One;
	Known.Zero.setHighBits(LeadZero);
	break;
	}
	case ISD::UMAX: {
	computeKnownBits(Op.getOperand(0), Known, DemandedElts,
	Depth + 1);
	computeKnownBits(Op.getOperand(1), Known2, DemandedElts, Depth + 1);

	// UMAX - we know that the result will have the maximum of the
	// known one leading bits of the inputs.
	unsigned LeadOne = Known.countMinLeadingOnes();
	LeadOne = std::max(LeadOne, Known2.countMinLeadingOnes());

	Known.Zero &= Known2.Zero;
	Known.One &= Known2.One;
	Known.One.setHighBits(LeadOne);
	break;
	}
	case ISD::SMIN:
	case ISD::SMAX: {
	computeKnownBits(Op.getOperand(0), Known, DemandedElts,
	Depth + 1);
	// If we don't know any bits, early out.
	if (!Known.One && !Known.Zero)
	break;
	computeKnownBits(Op.getOperand(1), Known2, DemandedElts, Depth + 1);
	Known.Zero &= Known2.Zero;
	Known.One &= Known2.One;
	break;
	}
	case ISD::FrameIndex:
	case ISD::TargetFrameIndex:
	if (unsigned Align = InferPtrAlignment(Op)) {
	// The low bits are known zero if the pointer is aligned.
	Known.Zero.setLowBits(Log2_32(Align));
	break;
	}
	break;

	default:
	if (Opcode < ISD::BUILTIN_OP_END)
	break;
	LLVM_FALLTHROUGH;
	case ISD::INTRINSIC_WO_CHAIN:
	case ISD::INTRINSIC_W_CHAIN:
	case ISD::INTRINSIC_VOID:
	// Allow the target to implement this method for its nodes.
	TLI->computeKnownBitsForTargetNode(Op, Known, DemandedElts, *this, Depth);
	break;
	}

	assert((Known.Zero & Known.One) == 0 && "Bits known to be one AND zero?");
	}

	SelectionDAG::OverflowKind SelectionDAG::computeOverflowKind(SDValue N0,
	SDValue N1) const {
	// X + 0 never overflow
	if (isNullConstant(N1))
	return OFK_Never;

	KnownBits N1Known;
	computeKnownBits(N1, N1Known);
	if (N1Known.Zero.getBoolValue()) {
	KnownBits N0Known;
	computeKnownBits(N0, N0Known);

	bool overflow;
	(void)(~N0Known.Zero).uadd_ov(~N1Known.Zero, overflow);
	if (!overflow)
	return OFK_Never;
	}

	// mulhi + 1 never overflow
	if (N0.getOpcode() == ISD::UMUL_LOHI && N0.getResNo() == 1 &&
	(~N1Known.Zero & 0x01) == ~N1Known.Zero)
	return OFK_Never;

	if (N1.getOpcode() == ISD::UMUL_LOHI && N1.getResNo() == 1) {
	KnownBits N0Known;
	computeKnownBits(N0, N0Known);

	if ((~N0Known.Zero & 0x01) == ~N0Known.Zero)
	return OFK_Never;
	}

	return OFK_Sometime;
	}

	bool SelectionDAG::isKnownToBeAPowerOfTwo(SDValue Val) const {
	EVT OpVT = Val.getValueType();
	unsigned BitWidth = OpVT.getScalarSizeInBits();

	// Is the constant a known power of 2?
	if (ConstantSDNode *Const = dyn_cast<ConstantSDNode>(Val))
	return Const->getAPIntValue().zextOrTrunc(BitWidth).isPowerOf2();

	// A left-shift of a constant one will have exactly one bit set because
	// shifting the bit off the end is undefined.
	if (Val.getOpcode() == ISD::SHL) {
	auto *C = isConstOrConstSplat(Val.getOperand(0));
	if (C && C->getAPIntValue() == 1)
	return true;
	}

	// Similarly, a logical right-shift of a constant sign-bit will have exactly
	// one bit set.
	if (Val.getOpcode() == ISD::SRL) {
	auto *C = isConstOrConstSplat(Val.getOperand(0));
	if (C && C->getAPIntValue().isSignMask())
	return true;
	}

	// Are all operands of a build vector constant powers of two?
	if (Val.getOpcode() == ISD::BUILD_VECTOR)
	if (llvm::all_of(Val->ops(), [BitWidth](SDValue E) {
	if (ConstantSDNode *C = dyn_cast<ConstantSDNode>(E))
	return C->getAPIntValue().zextOrTrunc(BitWidth).isPowerOf2();
	return false;
	}))
	return true;

	// More could be done here, though the above checks are enough
	// to handle some common cases.

	// Fall back to computeKnownBits to catch other known cases.
	KnownBits Known;
	computeKnownBits(Val, Known);
	return (Known.countMaxPopulation() == 1) && (Known.countMinPopulation() == 1);
	}

	unsigned SelectionDAG::ComputeNumSignBits(SDValue Op, unsigned Depth) const {
	EVT VT = Op.getValueType();
	APInt DemandedElts = VT.isVector()
	? APInt::getAllOnesValue(VT.getVectorNumElements())
	: APInt(1, 1);
	return ComputeNumSignBits(Op, DemandedElts, Depth);
	}

	unsigned SelectionDAG::ComputeNumSignBits(SDValue Op, const APInt &DemandedElts,
	unsigned Depth) const {
	EVT VT = Op.getValueType();
	assert(VT.isInteger() && "Invalid VT!");
	unsigned VTBits = VT.getScalarSizeInBits();
	unsigned NumElts = DemandedElts.getBitWidth();
	unsigned Tmp, Tmp2;
	unsigned FirstAnswer = 1;

	if (Depth == 6)
	return 1; // Limit search depth.

	if (!DemandedElts)
	return 1; // No demanded elts, better to assume we don't know anything.

	switch (Op.getOpcode()) {
	default: break;
	case ISD::AssertSext:
	Tmp = cast<VTSDNode>(Op.getOperand(1))->getVT().getSizeInBits();
	return VTBits-Tmp+1;
	case ISD::AssertZext:
	Tmp = cast<VTSDNode>(Op.getOperand(1))->getVT().getSizeInBits();
	return VTBits-Tmp;

	case ISD::Constant: {
	const APInt &Val = cast<ConstantSDNode>(Op)->getAPIntValue();
	return Val.getNumSignBits();
	}

	case ISD::BUILD_VECTOR:
	Tmp = VTBits;
	for (unsigned i = 0, e = Op.getNumOperands(); (i < e) && (Tmp > 1); ++i) {
	if (!DemandedElts[i])
	continue;

	SDValue SrcOp = Op.getOperand(i);
	Tmp2 = ComputeNumSignBits(Op.getOperand(i), Depth + 1);

	// BUILD_VECTOR can implicitly truncate sources, we must handle this.
	if (SrcOp.getValueSizeInBits() != VTBits) {
	assert(SrcOp.getValueSizeInBits() > VTBits &&
	"Expected BUILD_VECTOR implicit truncation");
	unsigned ExtraBits = SrcOp.getValueSizeInBits() - VTBits;
	Tmp2 = (Tmp2 > ExtraBits ? Tmp2 - ExtraBits : 1);
	}
	Tmp = std::min(Tmp, Tmp2);
	}
	return Tmp;

	case ISD::VECTOR_SHUFFLE: {
	// Collect the minimum number of sign bits that are shared by every vector
	// element referenced by the shuffle.
	APInt DemandedLHS(NumElts, 0), DemandedRHS(NumElts, 0);
	const ShuffleVectorSDNode *SVN = cast<ShuffleVectorSDNode>(Op);
	assert(NumElts == SVN->getMask().size() && "Unexpected vector size");
	for (unsigned i = 0; i != NumElts; ++i) {
	int M = SVN->getMaskElt(i);
	if (!DemandedElts[i])
	continue;
	// For UNDEF elements, we don't know anything about the common state of
	// the shuffle result.
	if (M < 0)
	return 1;
	if ((unsigned)M < NumElts)
	DemandedLHS.setBit((unsigned)M % NumElts);
	else
	DemandedRHS.setBit((unsigned)M % NumElts);
	}
	Tmp = std::numeric_limits<unsigned>::max();
	if (!!DemandedLHS)
	Tmp = ComputeNumSignBits(Op.getOperand(0), DemandedLHS, Depth + 1);
	if (!!DemandedRHS) {
	Tmp2 = ComputeNumSignBits(Op.getOperand(1), DemandedRHS, Depth + 1);
	Tmp = std::min(Tmp, Tmp2);
	}
	// If we don't know anything, early out and try computeKnownBits fall-back.
	if (Tmp == 1)
	break;
	assert(Tmp <= VTBits && "Failed to determine minimum sign bits");
	return Tmp;
	}

	case ISD::SIGN_EXTEND:
	case ISD::SIGN_EXTEND_VECTOR_INREG:
	Tmp = VTBits - Op.getOperand(0).getScalarValueSizeInBits();
	return ComputeNumSignBits(Op.getOperand(0), Depth+1) + Tmp;

	case ISD::SIGN_EXTEND_INREG:
	// Max of the input and what this extends.
	Tmp = cast<VTSDNode>(Op.getOperand(1))->getVT().getScalarSizeInBits();
	Tmp = VTBits-Tmp+1;

	Tmp2 = ComputeNumSignBits(Op.getOperand(0), Depth+1);
	return std::max(Tmp, Tmp2);

	case ISD::SRA:
	Tmp = ComputeNumSignBits(Op.getOperand(0), DemandedElts, Depth+1);
	// SRA X, C -> adds C sign bits.
	if (ConstantSDNode *C = isConstOrConstSplat(Op.getOperand(1))) {
	APInt ShiftVal = C->getAPIntValue();
	ShiftVal += Tmp;
	Tmp = ShiftVal.uge(VTBits) ? VTBits : ShiftVal.getZExtValue();
	}
	return Tmp;
	case ISD::SHL:
	if (ConstantSDNode *C = isConstOrConstSplat(Op.getOperand(1))) {
	// shl destroys sign bits.
	Tmp = ComputeNumSignBits(Op.getOperand(0), Depth+1);
	if (C->getAPIntValue().uge(VTBits) \|\| // Bad shift.
	C->getAPIntValue().uge(Tmp)) break; // Shifted all sign bits out.
	return Tmp - C->getZExtValue();
	}
	break;
	case ISD::AND:
	case ISD::OR:
	case ISD::XOR: // NOT is handled here.
	// Logical binary ops preserve the number of sign bits at the worst.
	Tmp = ComputeNumSignBits(Op.getOperand(0), Depth+1);
	if (Tmp != 1) {
	Tmp2 = ComputeNumSignBits(Op.getOperand(1), Depth+1);
	FirstAnswer = std::min(Tmp, Tmp2);
	// We computed what we know about the sign bits as our first
	// answer. Now proceed to the generic code that uses
	// computeKnownBits, and pick whichever answer is better.
	}
	break;

	case ISD::SELECT:
	Tmp = ComputeNumSignBits(Op.getOperand(1), Depth+1);
	if (Tmp == 1) return 1; // Early out.
	Tmp2 = ComputeNumSignBits(Op.getOperand(2), Depth+1);
	return std::min(Tmp, Tmp2);
	case ISD::SELECT_CC:
	Tmp = ComputeNumSignBits(Op.getOperand(2), Depth+1);
	if (Tmp == 1) return 1; // Early out.
	Tmp2 = ComputeNumSignBits(Op.getOperand(3), Depth+1);
	return std::min(Tmp, Tmp2);
	case ISD::SMIN:
	case ISD::SMAX:
	case ISD::UMIN:
	case ISD::UMAX:
	Tmp = ComputeNumSignBits(Op.getOperand(0), Depth + 1);
	if (Tmp == 1)
	return 1; // Early out.
	Tmp2 = ComputeNumSignBits(Op.getOperand(1), Depth + 1);
	return std::min(Tmp, Tmp2);
	case ISD::SADDO:
	case ISD::UADDO:
	case ISD::SSUBO:
	case ISD::USUBO:
	case ISD::SMULO:
	case ISD::UMULO:
	if (Op.getResNo() != 1)
	break;
	// The boolean result conforms to getBooleanContents. Fall through.
	// If setcc returns 0/-1, all bits are sign bits.
	// We know that we have an integer-based boolean since these operations
	// are only available for integer.
	if (TLI->getBooleanContents(Op.getValueType().isVector(), false) ==
	TargetLowering::ZeroOrNegativeOneBooleanContent)
	return VTBits;
	break;
	case ISD::SETCC:
	// If setcc returns 0/-1, all bits are sign bits.
	if (TLI->getBooleanContents(Op.getOperand(0).getValueType()) ==
	TargetLowering::ZeroOrNegativeOneBooleanContent)
	return VTBits;
	break;
	case ISD::ROTL:
	case ISD::ROTR:
	if (ConstantSDNode *C = dyn_cast<ConstantSDNode>(Op.getOperand(1))) {
	unsigned RotAmt = C->getZExtValue() & (VTBits-1);

	// Handle rotate right by N like a rotate left by 32-N.
	if (Op.getOpcode() == ISD::ROTR)
	RotAmt = (VTBits-RotAmt) & (VTBits-1);

	// If we aren't rotating out all of the known-in sign bits, return the
	// number that are left. This handles rotl(sext(x), 1) for example.
	Tmp = ComputeNumSignBits(Op.getOperand(0), Depth+1);
	if (Tmp > RotAmt+1) return Tmp-RotAmt;
	}
	break;
	case ISD::ADD:
	case ISD::ADDC:
	// Add can have at most one carry bit. Thus we know that the output
	// is, at worst, one more bit than the inputs.
	Tmp = ComputeNumSignBits(Op.getOperand(0), Depth+1);
	if (Tmp == 1) return 1; // Early out.

	// Special case decrementing a value (ADD X, -1):
	if (ConstantSDNode *CRHS = dyn_cast<ConstantSDNode>(Op.getOperand(1)))
	if (CRHS->isAllOnesValue()) {
	KnownBits Known;
	computeKnownBits(Op.getOperand(0), Known, Depth+1);

	// If the input is known to be 0 or 1, the output is 0/-1, which is all
	// sign bits set.
	if ((Known.Zero \| 1).isAllOnesValue())
	return VTBits;

	// If we are subtracting one from a positive number, there is no carry
	// out of the result.
	if (Known.isNonNegative())
	return Tmp;
	}

	Tmp2 = ComputeNumSignBits(Op.getOperand(1), Depth+1);
	if (Tmp2 == 1) return 1;
	return std::min(Tmp, Tmp2)-1;

	case ISD::SUB:
	Tmp2 = ComputeNumSignBits(Op.getOperand(1), Depth+1);
	if (Tmp2 == 1) return 1;

	// Handle NEG.
	if (ConstantSDNode *CLHS = isConstOrConstSplat(Op.getOperand(0)))
	if (CLHS->isNullValue()) {
	KnownBits Known;
	computeKnownBits(Op.getOperand(1), Known, Depth+1);
	// If the input is known to be 0 or 1, the output is 0/-1, which is all
	// sign bits set.
	if ((Known.Zero \| 1).isAllOnesValue())
	return VTBits;

	// If the input is known to be positive (the sign bit is known clear),
	// the output of the NEG has the same number of sign bits as the input.
	if (Known.isNonNegative())
	return Tmp2;

	// Otherwise, we treat this like a SUB.
	}

	// Sub can have at most one carry bit. Thus we know that the output
	// is, at worst, one more bit than the inputs.
	Tmp = ComputeNumSignBits(Op.getOperand(0), Depth+1);
	if (Tmp == 1) return 1; // Early out.
	return std::min(Tmp, Tmp2)-1;
	case ISD::TRUNCATE: {
	// Check if the sign bits of source go down as far as the truncated value.
	unsigned NumSrcBits = Op.getOperand(0).getScalarValueSizeInBits();
	unsigned NumSrcSignBits = ComputeNumSignBits(Op.getOperand(0), Depth + 1);
	if (NumSrcSignBits > (NumSrcBits - VTBits))
	return NumSrcSignBits - (NumSrcBits - VTBits);
	break;
	}
	case ISD::EXTRACT_ELEMENT: {
	const int KnownSign = ComputeNumSignBits(Op.getOperand(0), Depth+1);
	const int BitWidth = Op.getValueSizeInBits();
	const int Items = Op.getOperand(0).getValueSizeInBits() / BitWidth;

	// Get reverse index (starting from 1), Op1 value indexes elements from
	// little end. Sign starts at big end.
	const int rIndex = Items - 1 - Op.getConstantOperandVal(1);

	// If the sign portion ends in our element the subtraction gives correct
	// result. Otherwise it gives either negative or > bitwidth result
	return std::max(std::min(KnownSign - rIndex * BitWidth, BitWidth), 0);
	}
	case ISD::INSERT_VECTOR_ELT: {
	SDValue InVec = Op.getOperand(0);
	SDValue InVal = Op.getOperand(1);
	SDValue EltNo = Op.getOperand(2);
	unsigned NumElts = InVec.getValueType().getVectorNumElements();

	ConstantSDNode *CEltNo = dyn_cast<ConstantSDNode>(EltNo);
	if (CEltNo && CEltNo->getAPIntValue().ult(NumElts)) {
	// If we know the element index, split the demand between the
	// source vector and the inserted element.
	unsigned EltIdx = CEltNo->getZExtValue();

	// If we demand the inserted element then get its sign bits.
	Tmp = std::numeric_limits<unsigned>::max();
	if (DemandedElts[EltIdx]) {
	// TODO - handle implicit truncation of inserted elements.
	if (InVal.getScalarValueSizeInBits() != VTBits)
	break;
	Tmp = ComputeNumSignBits(InVal, Depth + 1);
	}

	// If we demand the source vector then get its sign bits, and determine
	// the minimum.
	APInt VectorElts = DemandedElts;
	VectorElts.clearBit(EltIdx);
	if (!!VectorElts) {
	Tmp2 = ComputeNumSignBits(InVec, VectorElts, Depth + 1);
	Tmp = std::min(Tmp, Tmp2);
	}
	} else {
	// Unknown element index, so ignore DemandedElts and demand them all.
	Tmp = ComputeNumSignBits(InVec, Depth + 1);
	Tmp2 = ComputeNumSignBits(InVal, Depth + 1);
	Tmp = std::min(Tmp, Tmp2);
	}
	assert(Tmp <= VTBits && "Failed to determine minimum sign bits");
	return Tmp;
	}
	case ISD::EXTRACT_VECTOR_ELT: {
	SDValue InVec = Op.getOperand(0);
	SDValue EltNo = Op.getOperand(1);
	EVT VecVT = InVec.getValueType();
	const unsigned BitWidth = Op.getValueSizeInBits();
	const unsigned EltBitWidth = Op.getOperand(0).getScalarValueSizeInBits();
	const unsigned NumSrcElts = VecVT.getVectorNumElements();

	// If BitWidth > EltBitWidth the value is anyext:ed, and we do not know
	// anything about sign bits. But if the sizes match we can derive knowledge
	// about sign bits from the vector operand.
	if (BitWidth != EltBitWidth)
	break;

	// If we know the element index, just demand that vector element, else for
	// an unknown element index, ignore DemandedElts and demand them all.
	APInt DemandedSrcElts = APInt::getAllOnesValue(NumSrcElts);
	ConstantSDNode *ConstEltNo = dyn_cast<ConstantSDNode>(EltNo);
	if (ConstEltNo && ConstEltNo->getAPIntValue().ult(NumSrcElts))
	DemandedSrcElts =
	APInt::getOneBitSet(NumSrcElts, ConstEltNo->getZExtValue());

	return ComputeNumSignBits(InVec, DemandedSrcElts, Depth + 1);
	}
	case ISD::EXTRACT_SUBVECTOR: {
	// If we know the element index, just demand that subvector elements,
	// otherwise demand them all.
	SDValue Src = Op.getOperand(0);
	ConstantSDNode *SubIdx = dyn_cast<ConstantSDNode>(Op.getOperand(1));
	unsigned NumSrcElts = Src.getValueType().getVectorNumElements();
	if (SubIdx && SubIdx->getAPIntValue().ule(NumSrcElts - NumElts)) {
	// Offset the demanded elts by the subvector index.
	uint64_t Idx = SubIdx->getZExtValue();
	APInt DemandedSrc = DemandedElts.zext(NumSrcElts).shl(Idx);
	return ComputeNumSignBits(Src, DemandedSrc, Depth + 1);
	}
	return ComputeNumSignBits(Src, Depth + 1);
	}
	case ISD::CONCAT_VECTORS:
	// Determine the minimum number of sign bits across all demanded
	// elts of the input vectors. Early out if the result is already 1.
	Tmp = std::numeric_limits<unsigned>::max();
	EVT SubVectorVT = Op.getOperand(0).getValueType();
	unsigned NumSubVectorElts = SubVectorVT.getVectorNumElements();
	unsigned NumSubVectors = Op.getNumOperands();
	for (unsigned i = 0; (i < NumSubVectors) && (Tmp > 1); ++i) {
	APInt DemandedSub = DemandedElts.lshr(i * NumSubVectorElts);
	DemandedSub = DemandedSub.trunc(NumSubVectorElts);
	if (!DemandedSub)
	continue;
	Tmp2 = ComputeNumSignBits(Op.getOperand(i), DemandedSub, Depth + 1);
	Tmp = std::min(Tmp, Tmp2);
	}
	assert(Tmp <= VTBits && "Failed to determine minimum sign bits");
	return Tmp;
	}

	// If we are looking at the loaded value of the SDNode.
	if (Op.getResNo() == 0) {
	// Handle LOADX separately here. EXTLOAD case will fallthrough.
	if (LoadSDNode *LD = dyn_cast<LoadSDNode>(Op)) {
	unsigned ExtType = LD->getExtensionType();
	switch (ExtType) {
	default: break;
	case ISD::SEXTLOAD: // '17' bits known
	Tmp = LD->getMemoryVT().getScalarSizeInBits();
	return VTBits-Tmp+1;
	case ISD::ZEXTLOAD: // '16' bits known
	Tmp = LD->getMemoryVT().getScalarSizeInBits();
	return VTBits-Tmp;
	}
	}
	}

	// Allow the target to implement this method for its nodes.
	if (Op.getOpcode() >= ISD::BUILTIN_OP_END \|\|
	Op.getOpcode() == ISD::INTRINSIC_WO_CHAIN \|\|
	Op.getOpcode() == ISD::INTRINSIC_W_CHAIN \|\|
	Op.getOpcode() == ISD::INTRINSIC_VOID) {
	unsigned NumBits =
	TLI->ComputeNumSignBitsForTargetNode(Op, DemandedElts, *this, Depth);
	if (NumBits > 1)
	FirstAnswer = std::max(FirstAnswer, NumBits);
	}

	// Finally, if we can prove that the top bits of the result are 0's or 1's,
	// use this information.
	KnownBits Known;
	computeKnownBits(Op, Known, DemandedElts, Depth);

	APInt Mask;
	if (Known.isNonNegative()) { // sign bit is 0
	Mask = Known.Zero;
	} else if (Known.isNegative()) { // sign bit is 1;
	Mask = Known.One;
	} else {
	// Nothing known.
	return FirstAnswer;
	}

	// Okay, we know that the sign bit in Mask is set. Use CLZ to determine
	// the number of identical bits in the top of the input value.
	Mask = ~Mask;
	Mask <<= Mask.getBitWidth()-VTBits;
	// Return # leading zeros. We use 'min' here in case Val was zero before
	// shifting. We don't want to return '64' as for an i32 "0".
	return std::max(FirstAnswer, std::min(VTBits, Mask.countLeadingZeros()));
	}

	bool SelectionDAG::isBaseWithConstantOffset(SDValue Op) const {
	if ((Op.getOpcode() != ISD::ADD && Op.getOpcode() != ISD::OR) \|\|
	!isa<ConstantSDNode>(Op.getOperand(1)))
	return false;

	if (Op.getOpcode() == ISD::OR &&
	!MaskedValueIsZero(Op.getOperand(0),
	cast<ConstantSDNode>(Op.getOperand(1))->getAPIntValue()))
	return false;

	return true;
	}

	bool SelectionDAG::isKnownNeverNaN(SDValue Op) const {
	// If we're told that NaNs won't happen, assume they won't.
	if (getTarget().Options.NoNaNsFPMath)
	return true;

	if (Op->getFlags().hasNoNaNs())
	return true;

	// If the value is a constant, we can obviously see if it is a NaN or not.
	if (const ConstantFPSDNode *C = dyn_cast<ConstantFPSDNode>(Op))
	return !C->getValueAPF().isNaN();

	// TODO: Recognize more cases here.

	return false;
	}

	bool SelectionDAG::isKnownNeverZero(SDValue Op) const {
	// If the value is a constant, we can obviously see if it is a zero or not.
	if (const ConstantFPSDNode *C = dyn_cast<ConstantFPSDNode>(Op))
	return !C->isZero();

	// TODO: Recognize more cases here.
	switch (Op.getOpcode()) {
	default: break;
	case ISD::OR:
	if (const ConstantSDNode *C = dyn_cast<ConstantSDNode>(Op.getOperand(1)))
	return !C->isNullValue();
	break;
	}

	return false;
	}

	bool SelectionDAG::isEqualTo(SDValue A, SDValue B) const {
	// Check the obvious case.
	if (A == B) return true;

	// For for negative and positive zero.
	if (const ConstantFPSDNode *CA = dyn_cast<ConstantFPSDNode>(A))
	if (const ConstantFPSDNode *CB = dyn_cast<ConstantFPSDNode>(B))
	if (CA->isZero() && CB->isZero()) return true;

	// Otherwise they may not be equal.
	return false;
	}

	bool SelectionDAG::haveNoCommonBitsSet(SDValue A, SDValue B) const {
	assert(A.getValueType() == B.getValueType() &&
	"Values must have the same type");
	KnownBits AKnown, BKnown;
	computeKnownBits(A, AKnown);
	computeKnownBits(B, BKnown);
	return (AKnown.Zero \| BKnown.Zero).isAllOnesValue();
	}

	static SDValue FoldCONCAT_VECTORS(const SDLoc &DL, EVT VT,
	ArrayRef<SDValue> Ops,
	SelectionDAG &DAG) {
	assert(!Ops.empty() && "Can't concatenate an empty list of vectors!");
	assert(llvm::all_of(Ops,
	[Ops](SDValue Op) {
	return Ops[0].getValueType() == Op.getValueType();
	}) &&
	"Concatenation of vectors with inconsistent value types!");
	assert((Ops.size() * Ops[0].getValueType().getVectorNumElements()) ==
	VT.getVectorNumElements() &&
	"Incorrect element count in vector concatenation!");

	if (Ops.size() == 1)
	return Ops[0];

	// Concat of UNDEFs is UNDEF.
	if (llvm::all_of(Ops, [](SDValue Op) { return Op.isUndef(); }))
	return DAG.getUNDEF(VT);

	// A CONCAT_VECTOR with all UNDEF/BUILD_VECTOR operands can be
	// simplified to one big BUILD_VECTOR.
	// FIXME: Add support for SCALAR_TO_VECTOR as well.
	EVT SVT = VT.getScalarType();
	SmallVector<SDValue, 16> Elts;
	for (SDValue Op : Ops) {
	EVT OpVT = Op.getValueType();
	if (Op.isUndef())
	Elts.append(OpVT.getVectorNumElements(), DAG.getUNDEF(SVT));
	else if (Op.getOpcode() == ISD::BUILD_VECTOR)
	Elts.append(Op->op_begin(), Op->op_end());
	else
	return SDValue();
	}

	// BUILD_VECTOR requires all inputs to be of the same type, find the
	// maximum type and extend them all.
	for (SDValue Op : Elts)
	SVT = (SVT.bitsLT(Op.getValueType()) ? Op.getValueType() : SVT);

	if (SVT.bitsGT(VT.getScalarType()))
	for (SDValue &Op : Elts)
	Op = DAG.getTargetLoweringInfo().isZExtFree(Op.getValueType(), SVT)
	? DAG.getZExtOrTrunc(Op, DL, SVT)
	: DAG.getSExtOrTrunc(Op, DL, SVT);

	return DAG.getBuildVector(VT, DL, Elts);
	}

	/// Gets or creates the specified node.
	SDValue SelectionDAG::getNode(unsigned Opcode, const SDLoc &DL, EVT VT) {
	FoldingSetNodeID ID;
	AddNodeIDNode(ID, Opcode, getVTList(VT), None);
	void *IP = nullptr;
	if (SDNode *E = FindNodeOrInsertPos(ID, DL, IP))
	return SDValue(E, 0);

	auto *N = newSDNode<SDNode>(Opcode, DL.getIROrder(), DL.getDebugLoc(),
	getVTList(VT));
	CSEMap.InsertNode(N, IP);

	InsertNode(N);
	return SDValue(N, 0);
	}

	SDValue SelectionDAG::getNode(unsigned Opcode, const SDLoc &DL, EVT VT,
	SDValue Operand, const SDNodeFlags Flags) {
	// Constant fold unary operations with an integer constant operand. Even
	// opaque constant will be folded, because the folding of unary operations
	// doesn't create new constants with different values. Nevertheless, the
	// opaque flag is preserved during folding to prevent future folding with
	// other constants.
	if (ConstantSDNode *C = dyn_cast<ConstantSDNode>(Operand)) {
	const APInt &Val = C->getAPIntValue();
	switch (Opcode) {
	default: break;
	case ISD::SIGN_EXTEND:
	return getConstant(Val.sextOrTrunc(VT.getSizeInBits()), DL, VT,
	C->isTargetOpcode(), C->isOpaque());
	case ISD::ANY_EXTEND:
	case ISD::ZERO_EXTEND:
	case ISD::TRUNCATE:
	return getConstant(Val.zextOrTrunc(VT.getSizeInBits()), DL, VT,
	C->isTargetOpcode(), C->isOpaque());
	case ISD::UINT_TO_FP:
	case ISD::SINT_TO_FP: {
	APFloat apf(EVTToAPFloatSemantics(VT),
	APInt::getNullValue(VT.getSizeInBits()));
	(void)apf.convertFromAPInt(Val,
	Opcode==ISD::SINT_TO_FP,
	APFloat::rmNearestTiesToEven);
	return getConstantFP(apf, DL, VT);
	}
	case ISD::BITCAST:
	if (VT == MVT::f16 && C->getValueType(0) == MVT::i16)
	return getConstantFP(APFloat(APFloat::IEEEhalf(), Val), DL, VT);
	if (VT == MVT::f32 && C->getValueType(0) == MVT::i32)
	return getConstantFP(APFloat(APFloat::IEEEsingle(), Val), DL, VT);
	if (VT == MVT::f64 && C->getValueType(0) == MVT::i64)
	return getConstantFP(APFloat(APFloat::IEEEdouble(), Val), DL, VT);
	if (VT == MVT::f128 && C->getValueType(0) == MVT::i128)
	return getConstantFP(APFloat(APFloat::IEEEquad(), Val), DL, VT);
	break;
	case ISD::ABS:
	return getConstant(Val.abs(), DL, VT, C->isTargetOpcode(),
	C->isOpaque());
	case ISD::BITREVERSE:
	return getConstant(Val.reverseBits(), DL, VT, C->isTargetOpcode(),
	C->isOpaque());
	case ISD::BSWAP:
	return getConstant(Val.byteSwap(), DL, VT, C->isTargetOpcode(),
	C->isOpaque());
	case ISD::CTPOP:
	return getConstant(Val.countPopulation(), DL, VT, C->isTargetOpcode(),
	C->isOpaque());
	case ISD::CTLZ:
	case ISD::CTLZ_ZERO_UNDEF:
	return getConstant(Val.countLeadingZeros(), DL, VT, C->isTargetOpcode(),
	C->isOpaque());
	case ISD::CTTZ:
	case ISD::CTTZ_ZERO_UNDEF:
	return getConstant(Val.countTrailingZeros(), DL, VT, C->isTargetOpcode(),
	C->isOpaque());
	case ISD::FP16_TO_FP: {
	bool Ignored;
	APFloat FPV(APFloat::IEEEhalf(),
	(Val.getBitWidth() == 16) ? Val : Val.trunc(16));

	// This can return overflow, underflow, or inexact; we don't care.
	// FIXME need to be more flexible about rounding mode.
	(void)FPV.convert(EVTToAPFloatSemantics(VT),
	APFloat::rmNearestTiesToEven, &Ignored);
	return getConstantFP(FPV, DL, VT);
	}
	}
	}

	// Constant fold unary operations with a floating point constant operand.
	if (ConstantFPSDNode *C = dyn_cast<ConstantFPSDNode>(Operand)) {
	APFloat V = C->getValueAPF(); // make copy
	switch (Opcode) {
	case ISD::FNEG:
	V.changeSign();
	return getConstantFP(V, DL, VT);
	case ISD::FABS:
	V.clearSign();
	return getConstantFP(V, DL, VT);
	case ISD::FCEIL: {
	APFloat::opStatus fs = V.roundToIntegral(APFloat::rmTowardPositive);
	if (fs == APFloat::opOK \|\| fs == APFloat::opInexact)
	return getConstantFP(V, DL, VT);
	break;
	}
	case ISD::FTRUNC: {
	APFloat::opStatus fs = V.roundToIntegral(APFloat::rmTowardZero);
	if (fs == APFloat::opOK \|\| fs == APFloat::opInexact)
	return getConstantFP(V, DL, VT);
	break;
	}
	case ISD::FFLOOR: {
	APFloat::opStatus fs = V.roundToIntegral(APFloat::rmTowardNegative);
	if (fs == APFloat::opOK \|\| fs == APFloat::opInexact)
	return getConstantFP(V, DL, VT);
	break;
	}
	case ISD::FP_EXTEND: {
	bool ignored;
	// This can return overflow, underflow, or inexact; we don't care.
	// FIXME need to be more flexible about rounding mode.
	(void)V.convert(EVTToAPFloatSemantics(VT),
	APFloat::rmNearestTiesToEven, &ignored);
	return getConstantFP(V, DL, VT);
	}
	case ISD::FP_TO_SINT:
	case ISD::FP_TO_UINT: {
	bool ignored;
	APSInt IntVal(VT.getSizeInBits(), Opcode == ISD::FP_TO_UINT);
	// FIXME need to be more flexible about rounding mode.
	APFloat::opStatus s =
	V.convertToInteger(IntVal, APFloat::rmTowardZero, &ignored);
	if (s == APFloat::opInvalidOp) // inexact is OK, in fact usual
	break;
	return getConstant(IntVal, DL, VT);
	}
	case ISD::BITCAST:
	if (VT == MVT::i16 && C->getValueType(0) == MVT::f16)
	return getConstant((uint16_t)V.bitcastToAPInt().getZExtValue(), DL, VT);
	else if (VT == MVT::i32 && C->getValueType(0) == MVT::f32)
	return getConstant((uint32_t)V.bitcastToAPInt().getZExtValue(), DL, VT);
	else if (VT == MVT::i64 && C->getValueType(0) == MVT::f64)
	return getConstant(V.bitcastToAPInt().getZExtValue(), DL, VT);
	break;
	case ISD::FP_TO_FP16: {
	bool Ignored;
	// This can return overflow, underflow, or inexact; we don't care.
	// FIXME need to be more flexible about rounding mode.
	(void)V.convert(APFloat::IEEEhalf(),
	APFloat::rmNearestTiesToEven, &Ignored);
	return getConstant(V.bitcastToAPInt(), DL, VT);
	}
	}
	}

	// Constant fold unary operations with a vector integer or float operand.
	if (BuildVectorSDNode *BV = dyn_cast<BuildVectorSDNode>(Operand)) {
	if (BV->isConstant()) {
	switch (Opcode) {
	default:
	// FIXME: Entirely reasonable to perform folding of other unary
	// operations here as the need arises.
	break;
	case ISD::FNEG:
	case ISD::FABS:
	case ISD::FCEIL:
	case ISD::FTRUNC:
	case ISD::FFLOOR:
	case ISD::FP_EXTEND:
	case ISD::FP_TO_SINT:
	case ISD::FP_TO_UINT:
	case ISD::TRUNCATE:
	case ISD::UINT_TO_FP:
	case ISD::SINT_TO_FP:
	case ISD::ABS:
	case ISD::BITREVERSE:
	case ISD::BSWAP:
	case ISD::CTLZ:
	case ISD::CTLZ_ZERO_UNDEF:
	case ISD::CTTZ:
	case ISD::CTTZ_ZERO_UNDEF:
	case ISD::CTPOP: {
	SDValue Ops = { Operand };
	if (SDValue Fold = FoldConstantVectorArithmetic(Opcode, DL, VT, Ops))
	return Fold;
	}
	}
	}
	}

	unsigned OpOpcode = Operand.getNode()->getOpcode();
	switch (Opcode) {
	case ISD::TokenFactor:
	case ISD::MERGE_VALUES:
	case ISD::CONCAT_VECTORS:
	return Operand; // Factor, merge or concat of one node? No need.
	case ISD::FP_ROUND: llvm_unreachable("Invalid method to make FP_ROUND node");
	case ISD::FP_EXTEND:
	assert(VT.isFloatingPoint() &&
	Operand.getValueType().isFloatingPoint() && "Invalid FP cast!");
	if (Operand.getValueType() == VT) return Operand; // noop conversion.
	assert((!VT.isVector() \|\|
	VT.getVectorNumElements() ==
	Operand.getValueType().getVectorNumElements()) &&
	"Vector element count mismatch!");
	assert(Operand.getValueType().bitsLT(VT) &&
	"Invalid fpext node, dst < src!");
	if (Operand.isUndef())
	return getUNDEF(VT);
	break;
	case ISD::SIGN_EXTEND:
	assert(VT.isInteger() && Operand.getValueType().isInteger() &&
	"Invalid SIGN_EXTEND!");
	if (Operand.getValueType() == VT) return Operand; // noop extension
	assert((!VT.isVector() \|\|
	VT.getVectorNumElements() ==
	Operand.getValueType().getVectorNumElements()) &&
	"Vector element count mismatch!");
	assert(Operand.getValueType().bitsLT(VT) &&
	"Invalid sext node, dst < src!");
	if (OpOpcode == ISD::SIGN_EXTEND \|\| OpOpcode == ISD::ZERO_EXTEND)
	return getNode(OpOpcode, DL, VT, Operand.getOperand(0));
	else if (OpOpcode == ISD::UNDEF)
	// sext(undef) = 0, because the top bits will all be the same.
	return getConstant(0, DL, VT);
	break;
	case ISD::ZERO_EXTEND:
	assert(VT.isInteger() && Operand.getValueType().isInteger() &&
	"Invalid ZERO_EXTEND!");
	if (Operand.getValueType() == VT) return Operand; // noop extension
	assert((!VT.isVector() \|\|
	VT.getVectorNumElements() ==
	Operand.getValueType().getVectorNumElements()) &&
	"Vector element count mismatch!");
	assert(Operand.getValueType().bitsLT(VT) &&
	"Invalid zext node, dst < src!");
	if (OpOpcode == ISD::ZERO_EXTEND) // (zext (zext x)) -> (zext x)
	return getNode(ISD::ZERO_EXTEND, DL, VT, Operand.getOperand(0));
	else if (OpOpcode == ISD::UNDEF)
	// zext(undef) = 0, because the top bits will be zero.
	return getConstant(0, DL, VT);
	break;
	case ISD::ANY_EXTEND:
	assert(VT.isInteger() && Operand.getValueType().isInteger() &&
	"Invalid ANY_EXTEND!");
	if (Operand.getValueType() == VT) return Operand; // noop extension
	assert((!VT.isVector() \|\|
	VT.getVectorNumElements() ==
	Operand.getValueType().getVectorNumElements()) &&
	"Vector element count mismatch!");
	assert(Operand.getValueType().bitsLT(VT) &&
	"Invalid anyext node, dst < src!");

	if (OpOpcode == ISD::ZERO_EXTEND \|\| OpOpcode == ISD::SIGN_EXTEND \|\|
	OpOpcode == ISD::ANY_EXTEND)
	// (ext (zext x)) -> (zext x) and (ext (sext x)) -> (sext x)
	return getNode(OpOpcode, DL, VT, Operand.getOperand(0));
	else if (OpOpcode == ISD::UNDEF)
	return getUNDEF(VT);

	// (ext (trunx x)) -> x
	if (OpOpcode == ISD::TRUNCATE) {
	SDValue OpOp = Operand.getOperand(0);
	if (OpOp.getValueType() == VT)
	return OpOp;
	}
	break;
	case ISD::TRUNCATE:
	assert(VT.isInteger() && Operand.getValueType().isInteger() &&
	"Invalid TRUNCATE!");
	if (Operand.getValueType() == VT) return Operand; // noop truncate
	assert((!VT.isVector() \|\|
	VT.getVectorNumElements() ==
	Operand.getValueType().getVectorNumElements()) &&
	"Vector element count mismatch!");
	assert(Operand.getValueType().bitsGT(VT) &&
	"Invalid truncate node, src < dst!");
	if (OpOpcode == ISD::TRUNCATE)
	return getNode(ISD::TRUNCATE, DL, VT, Operand.getOperand(0));
	if (OpOpcode == ISD::ZERO_EXTEND \|\| OpOpcode == ISD::SIGN_EXTEND \|\|
	OpOpcode == ISD::ANY_EXTEND) {
	// If the source is smaller than the dest, we still need an extend.
	if (Operand.getOperand(0).getValueType().getScalarType()
	.bitsLT(VT.getScalarType()))
	return getNode(OpOpcode, DL, VT, Operand.getOperand(0));
	if (Operand.getOperand(0).getValueType().bitsGT(VT))
	return getNode(ISD::TRUNCATE, DL, VT, Operand.getOperand(0));
	return Operand.getOperand(0);
	}
	if (OpOpcode == ISD::UNDEF)
	return getUNDEF(VT);
	break;
	case ISD::ABS:
	assert(VT.isInteger() && VT == Operand.getValueType() &&
	"Invalid ABS!");
	if (OpOpcode == ISD::UNDEF)
	return getUNDEF(VT);
	break;
	case ISD::BSWAP:
	assert(VT.isInteger() && VT == Operand.getValueType() &&
	"Invalid BSWAP!");
	assert((VT.getScalarSizeInBits() % 16 == 0) &&
	"BSWAP types must be a multiple of 16 bits!");
	if (OpOpcode == ISD::UNDEF)
	return getUNDEF(VT);
	break;
	case ISD::BITREVERSE:
	assert(VT.isInteger() && VT == Operand.getValueType() &&
	"Invalid BITREVERSE!");
	if (OpOpcode == ISD::UNDEF)
	return getUNDEF(VT);
	break;
	case ISD::BITCAST:
	// Basic sanity checking.
	assert(VT.getSizeInBits() == Operand.getValueSizeInBits() &&
	"Cannot BITCAST between types of different sizes!");
	if (VT == Operand.getValueType()) return Operand; // noop conversion.
	if (OpOpcode == ISD::BITCAST) // bitconv(bitconv(x)) -> bitconv(x)
	return getNode(ISD::BITCAST, DL, VT, Operand.getOperand(0));
	if (OpOpcode == ISD::UNDEF)
	return getUNDEF(VT);
	break;
	case ISD::SCALAR_TO_VECTOR:
	assert(VT.isVector() && !Operand.getValueType().isVector() &&
	(VT.getVectorElementType() == Operand.getValueType() \|\|
	(VT.getVectorElementType().isInteger() &&
	Operand.getValueType().isInteger() &&
	VT.getVectorElementType().bitsLE(Operand.getValueType()))) &&
	"Illegal SCALAR_TO_VECTOR node!");
	if (OpOpcode == ISD::UNDEF)
	return getUNDEF(VT);
	// scalar_to_vector(extract_vector_elt V, 0) -> V, top bits are undefined.
	if (OpOpcode == ISD::EXTRACT_VECTOR_ELT &&
	isa<ConstantSDNode>(Operand.getOperand(1)) &&
	Operand.getConstantOperandVal(1) == 0 &&
	Operand.getOperand(0).getValueType() == VT)
	return Operand.getOperand(0);
	break;
	case ISD::FNEG:
	// -(X-Y) -> (Y-X) is unsafe because when X==Y, -0.0 != +0.0
	if (getTarget().Options.UnsafeFPMath && OpOpcode == ISD::FSUB)
	// FIXME: FNEG has no fast-math-flags to propagate; use the FSUB's flags?
	return getNode(ISD::FSUB, DL, VT, Operand.getOperand(1),
	Operand.getOperand(0), Operand.getNode()->getFlags());
	if (OpOpcode == ISD::FNEG) // --X -> X
	return Operand.getOperand(0);
	break;
	case ISD::FABS:
	if (OpOpcode == ISD::FNEG) // abs(-X) -> abs(X)
	return getNode(ISD::FABS, DL, VT, Operand.getOperand(0));
	break;
	}

	SDNode *N;
	SDVTList VTs = getVTList(VT);
	SDValue Ops[] = {Operand};
	if (VT != MVT::Glue) { // Don't CSE flag producing nodes
	FoldingSetNodeID ID;
	AddNodeIDNode(ID, Opcode, VTs, Ops);
	void *IP = nullptr;
	if (SDNode *E = FindNodeOrInsertPos(ID, DL, IP)) {
	E->intersectFlagsWith(Flags);
	return SDValue(E, 0);
	}

	N = newSDNode<SDNode>(Opcode, DL.getIROrder(), DL.getDebugLoc(), VTs);
	N->setFlags(Flags);
	createOperands(N, Ops);
	CSEMap.InsertNode(N, IP);
	} else {
	N = newSDNode<SDNode>(Opcode, DL.getIROrder(), DL.getDebugLoc(), VTs);
	createOperands(N, Ops);
	}

	InsertNode(N);
	return SDValue(N, 0);
	}

	static std::pair<APInt, bool> FoldValue(unsigned Opcode, const APInt &C1,
	const APInt &C2) {
	switch (Opcode) {
	case ISD::ADD: return std::make_pair(C1 + C2, true);
	case ISD::SUB: return std::make_pair(C1 - C2, true);
	case ISD::MUL: return std::make_pair(C1 * C2, true);
	case ISD::AND: return std::make_pair(C1 & C2, true);
	case ISD::OR: return std::make_pair(C1 \| C2, true);
	case ISD::XOR: return std::make_pair(C1 ^ C2, true);
	case ISD::SHL: return std::make_pair(C1 << C2, true);
	case ISD::SRL: return std::make_pair(C1.lshr(C2), true);
	case ISD::SRA: return std::make_pair(C1.ashr(C2), true);
	case ISD::ROTL: return std::make_pair(C1.rotl(C2), true);
	case ISD::ROTR: return std::make_pair(C1.rotr(C2), true);
	case ISD::SMIN: return std::make_pair(C1.sle(C2) ? C1 : C2, true);
	case ISD::SMAX: return std::make_pair(C1.sge(C2) ? C1 : C2, true);
	case ISD::UMIN: return std::make_pair(C1.ule(C2) ? C1 : C2, true);
	case ISD::UMAX: return std::make_pair(C1.uge(C2) ? C1 : C2, true);
	case ISD::UDIV:
	if (!C2.getBoolValue())
	break;
	return std::make_pair(C1.udiv(C2), true);
	case ISD::UREM:
	if (!C2.getBoolValue())
	break;
	return std::make_pair(C1.urem(C2), true);
	case ISD::SDIV:
	if (!C2.getBoolValue())
	break;
	return std::make_pair(C1.sdiv(C2), true);
	case ISD::SREM:
	if (!C2.getBoolValue())
	break;
	return std::make_pair(C1.srem(C2), true);
	}
	return std::make_pair(APInt(1, 0), false);
	}

	SDValue SelectionDAG::FoldConstantArithmetic(unsigned Opcode, const SDLoc &DL,
	EVT VT, const ConstantSDNode *Cst1,
	const ConstantSDNode *Cst2) {
	if (Cst1->isOpaque() \|\| Cst2->isOpaque())
	return SDValue();

	std::pair<APInt, bool> Folded = FoldValue(Opcode, Cst1->getAPIntValue(),
	Cst2->getAPIntValue());
	if (!Folded.second)
	return SDValue();
	return getConstant(Folded.first, DL, VT);
	}

	SDValue SelectionDAG::FoldSymbolOffset(unsigned Opcode, EVT VT,
	const GlobalAddressSDNode *GA,
	const SDNode *N2) {
	if (GA->getOpcode() != ISD::GlobalAddress)
	return SDValue();
	if (!TLI->isOffsetFoldingLegal(GA))
	return SDValue();
	const ConstantSDNode *Cst2 = dyn_cast<ConstantSDNode>(N2);
	if (!Cst2)
	return SDValue();
	int64_t Offset = Cst2->getSExtValue();
	switch (Opcode) {
	case ISD::ADD: break;
	case ISD::SUB: Offset = -uint64_t(Offset); break;
	default: return SDValue();
	}
	return getGlobalAddress(GA->getGlobal(), SDLoc(Cst2), VT,
	GA->getOffset() + uint64_t(Offset));
	}

	bool SelectionDAG::isUndef(unsigned Opcode, ArrayRef<SDValue> Ops) {
	switch (Opcode) {
	case ISD::SDIV:
	case ISD::UDIV:
	case ISD::SREM:
	case ISD::UREM: {
	// If a divisor is zero/undef or any element of a divisor vector is
	// zero/undef, the whole op is undef.
	assert(Ops.size() == 2 && "Div/rem should have 2 operands");
	SDValue Divisor = Ops[1];
	if (Divisor.isUndef() \|\| isNullConstant(Divisor))
	return true;

	return ISD::isBuildVectorOfConstantSDNodes(Divisor.getNode()) &&
	llvm::any_of(Divisor->op_values(),
	[](SDValue V) { return V.isUndef() \|\|
	isNullConstant(V); });
	// TODO: Handle signed overflow.
	}
	// TODO: Handle oversized shifts.
	default:
	return false;
	}
	}

	SDValue SelectionDAG::FoldConstantArithmetic(unsigned Opcode, const SDLoc &DL,
	EVT VT, SDNode *Cst1,
	SDNode *Cst2) {
	// If the opcode is a target-specific ISD node, there's nothing we can
	// do here and the operand rules may not line up with the below, so
	// bail early.
	if (Opcode >= ISD::BUILTIN_OP_END)
	return SDValue();

	if (isUndef(Opcode, {SDValue(Cst1, 0), SDValue(Cst2, 0)}))
	return getUNDEF(VT);

	// Handle the case of two scalars.
	if (const ConstantSDNode *Scalar1 = dyn_cast<ConstantSDNode>(Cst1)) {
	if (const ConstantSDNode *Scalar2 = dyn_cast<ConstantSDNode>(Cst2)) {
	SDValue Folded = FoldConstantArithmetic(Opcode, DL, VT, Scalar1, Scalar2);
	assert((!Folded \|\| !VT.isVector()) &&
	"Can't fold vectors ops with scalar operands");
	return Folded;
	}
	}

	// fold (add Sym, c) -> Sym+c
	if (GlobalAddressSDNode *GA = dyn_cast<GlobalAddressSDNode>(Cst1))
	return FoldSymbolOffset(Opcode, VT, GA, Cst2);
	if (TLI->isCommutativeBinOp(Opcode))
	if (GlobalAddressSDNode *GA = dyn_cast<GlobalAddressSDNode>(Cst2))
	return FoldSymbolOffset(Opcode, VT, GA, Cst1);

	// For vectors extract each constant element into Inputs so we can constant
	// fold them individually.
	BuildVectorSDNode *BV1 = dyn_cast<BuildVectorSDNode>(Cst1);
	BuildVectorSDNode *BV2 = dyn_cast<BuildVectorSDNode>(Cst2);
	if (!BV1 \|\| !BV2)
	return SDValue();

	assert(BV1->getNumOperands() == BV2->getNumOperands() && "Out of sync!");

	EVT SVT = VT.getScalarType();
	SmallVector<SDValue, 4> Outputs;
	for (unsigned I = 0, E = BV1->getNumOperands(); I != E; ++I) {
	SDValue V1 = BV1->getOperand(I);
	SDValue V2 = BV2->getOperand(I);

	// Avoid BUILD_VECTOR nodes that perform implicit truncation.
	// FIXME: This is valid and could be handled by truncation.
	if (V1->getValueType(0) != SVT \|\| V2->getValueType(0) != SVT)
	return SDValue();

	// Fold one vector element.
	SDValue ScalarResult = getNode(Opcode, DL, SVT, V1, V2);

	// Scalar folding only succeeded if the result is a constant or UNDEF.
	if (!ScalarResult.isUndef() && ScalarResult.getOpcode() != ISD::Constant &&
	ScalarResult.getOpcode() != ISD::ConstantFP)
	return SDValue();
	Outputs.push_back(ScalarResult);
	}

	assert(VT.getVectorNumElements() == Outputs.size() &&
	"Vector size mismatch!");

	// We may have a vector type but a scalar result. Create a splat.
	Outputs.resize(VT.getVectorNumElements(), Outputs.back());

	// Build a big vector out of the scalar elements we generated.
	return getBuildVector(VT, SDLoc(), Outputs);
	}

	SDValue SelectionDAG::FoldConstantVectorArithmetic(unsigned Opcode,
	const SDLoc &DL, EVT VT,
	ArrayRef<SDValue> Ops,
	const SDNodeFlags Flags) {
	// If the opcode is a target-specific ISD node, there's nothing we can
	// do here and the operand rules may not line up with the below, so
	// bail early.
	if (Opcode >= ISD::BUILTIN_OP_END)
	return SDValue();

	if (isUndef(Opcode, Ops))
	return getUNDEF(VT);

	// We can only fold vectors - maybe merge with FoldConstantArithmetic someday?
	if (!VT.isVector())
	return SDValue();

	unsigned NumElts = VT.getVectorNumElements();

	auto IsScalarOrSameVectorSize = [&](const SDValue &Op) {
	return !Op.getValueType().isVector() \|\|
	Op.getValueType().getVectorNumElements() == NumElts;
	};

	auto IsConstantBuildVectorOrUndef = [&](const SDValue &Op) {
	BuildVectorSDNode *BV = dyn_cast<BuildVectorSDNode>(Op);
	return (Op.isUndef()) \|\| (Op.getOpcode() == ISD::CONDCODE) \|\|
	(BV && BV->isConstant());
	};

	// All operands must be vector types with the same number of elements as
	// the result type and must be either UNDEF or a build vector of constant
	// or UNDEF scalars.
	if (!llvm::all_of(Ops, IsConstantBuildVectorOrUndef) \|\|
	!llvm::all_of(Ops, IsScalarOrSameVectorSize))
	return SDValue();

	// If we are comparing vectors, then the result needs to be a i1 boolean
	// that is then sign-extended back to the legal result type.
	EVT SVT = (Opcode == ISD::SETCC ? MVT::i1 : VT.getScalarType());

	// Find legal integer scalar type for constant promotion and
	// ensure that its scalar size is at least as large as source.
	EVT LegalSVT = VT.getScalarType();
	if (NewNodesMustHaveLegalTypes && LegalSVT.isInteger()) {
	LegalSVT = TLI->getTypeToTransformTo(*getContext(), LegalSVT);
	if (LegalSVT.bitsLT(VT.getScalarType()))
	return SDValue();
	}

	// Constant fold each scalar lane separately.
	SmallVector<SDValue, 4> ScalarResults;
	for (unsigned i = 0; i != NumElts; i++) {
	SmallVector<SDValue, 4> ScalarOps;
	for (SDValue Op : Ops) {
	EVT InSVT = Op.getValueType().getScalarType();
	BuildVectorSDNode *InBV = dyn_cast<BuildVectorSDNode>(Op);
	if (!InBV) {
	// We've checked that this is UNDEF or a constant of some kind.
	if (Op.isUndef())
	ScalarOps.push_back(getUNDEF(InSVT));
	else
	ScalarOps.push_back(Op);
	continue;
	}

	SDValue ScalarOp = InBV->getOperand(i);
	EVT ScalarVT = ScalarOp.getValueType();

	// Build vector (integer) scalar operands may need implicit
	// truncation - do this before constant folding.
	if (ScalarVT.isInteger() && ScalarVT.bitsGT(InSVT))
	ScalarOp = getNode(ISD::TRUNCATE, DL, InSVT, ScalarOp);

	ScalarOps.push_back(ScalarOp);
	}

	// Constant fold the scalar operands.
	SDValue ScalarResult = getNode(Opcode, DL, SVT, ScalarOps, Flags);

	// Legalize the (integer) scalar constant if necessary.
	if (LegalSVT != SVT)
	ScalarResult = getNode(ISD::SIGN_EXTEND, DL, LegalSVT, ScalarResult);

	// Scalar folding only succeeded if the result is a constant or UNDEF.
	if (!ScalarResult.isUndef() && ScalarResult.getOpcode() != ISD::Constant &&
	ScalarResult.getOpcode() != ISD::ConstantFP)
	return SDValue();
	ScalarResults.push_back(ScalarResult);
	}

	return getBuildVector(VT, DL, ScalarResults);
	}

	SDValue SelectionDAG::getNode(unsigned Opcode, const SDLoc &DL, EVT VT,
	SDValue N1, SDValue N2, const SDNodeFlags Flags) {
	ConstantSDNode *N1C = dyn_cast<ConstantSDNode>(N1);
	ConstantSDNode *N2C = dyn_cast<ConstantSDNode>(N2);
	ConstantFPSDNode *N1CFP = dyn_cast<ConstantFPSDNode>(N1);
	ConstantFPSDNode *N2CFP = dyn_cast<ConstantFPSDNode>(N2);

	// Canonicalize constant to RHS if commutative.
	if (TLI->isCommutativeBinOp(Opcode)) {
	if (N1C && !N2C) {
	std::swap(N1C, N2C);
	std::swap(N1, N2);
	} else if (N1CFP && !N2CFP) {
	std::swap(N1CFP, N2CFP);
	std::swap(N1, N2);
	}
	}

	switch (Opcode) {
	default: break;
	case ISD::TokenFactor:
	assert(VT == MVT::Other && N1.getValueType() == MVT::Other &&
	N2.getValueType() == MVT::Other && "Invalid token factor!");
	// Fold trivial token factors.
	if (N1.getOpcode() == ISD::EntryToken) return N2;
	if (N2.getOpcode() == ISD::EntryToken) return N1;
	if (N1 == N2) return N1;
	break;
	case ISD::CONCAT_VECTORS: {
	// Attempt to fold CONCAT_VECTORS into BUILD_VECTOR or UNDEF.
	SDValue Ops[] = {N1, N2};
	if (SDValue V = FoldCONCAT_VECTORS(DL, VT, Ops, *this))
	return V;
	break;
	}
	case ISD::AND:
	assert(VT.isInteger() && "This operator does not apply to FP types!");
	assert(N1.getValueType() == N2.getValueType() &&
	N1.getValueType() == VT && "Binary operator types must match!");
	// (X & 0) -> 0. This commonly occurs when legalizing i64 values, so it's
	// worth handling here.
	if (N2C && N2C->isNullValue())
	return N2;
	if (N2C && N2C->isAllOnesValue()) // X & -1 -> X
	return N1;
	break;
	case ISD::OR:
	case ISD::XOR:
	case ISD::ADD:
	case ISD::SUB:
	assert(VT.isInteger() && "This operator does not apply to FP types!");
	assert(N1.getValueType() == N2.getValueType() &&
	N1.getValueType() == VT && "Binary operator types must match!");
	// (X ^\|+- 0) -> X. This commonly occurs when legalizing i64 values, so
	// it's worth handling here.
	if (N2C && N2C->isNullValue())
	return N1;
	break;
	case ISD::UDIV:
	case ISD::UREM:
	case ISD::MULHU:
	case ISD::MULHS:
	case ISD::MUL:
	case ISD::SDIV:
	case ISD::SREM:
	case ISD::SMIN:
	case ISD::SMAX:
	case ISD::UMIN:
	case ISD::UMAX:
	assert(VT.isInteger() && "This operator does not apply to FP types!");
	assert(N1.getValueType() == N2.getValueType() &&
	N1.getValueType() == VT && "Binary operator types must match!");
	break;
	case ISD::FADD:
	case ISD::FSUB:
	case ISD::FMUL:
	case ISD::FDIV:
	case ISD::FREM:
	if (getTarget().Options.UnsafeFPMath) {
	if (Opcode == ISD::FADD) {
	// x+0 --> x
	if (N2CFP && N2CFP->getValueAPF().isZero())
	return N1;
	} else if (Opcode == ISD::FSUB) {
	// x-0 --> x
	if (N2CFP && N2CFP->getValueAPF().isZero())
	return N1;
	} else if (Opcode == ISD::FMUL) {
	// x*0 --> 0
	if (N2CFP && N2CFP->isZero())
	return N2;
	// x*1 --> x
	if (N2CFP && N2CFP->isExactlyValue(1.0))
	return N1;
	}
	}
	assert(VT.isFloatingPoint() && "This operator only applies to FP types!");
	assert(N1.getValueType() == N2.getValueType() &&
	N1.getValueType() == VT && "Binary operator types must match!");
	break;
	case ISD::FCOPYSIGN: // N1 and result must match. N1/N2 need not match.
	assert(N1.getValueType() == VT &&
	N1.getValueType().isFloatingPoint() &&
	N2.getValueType().isFloatingPoint() &&
	"Invalid FCOPYSIGN!");
	break;
	case ISD::SHL:
	case ISD::SRA:
	case ISD::SRL:
	case ISD::ROTL:
	case ISD::ROTR:
	assert(VT == N1.getValueType() &&
	"Shift operators return type must be the same as their first arg");
	assert(VT.isInteger() && N2.getValueType().isInteger() &&
	"Shifts only work on integers");
	assert((!VT.isVector() \|\| VT == N2.getValueType()) &&
	"Vector shift amounts must be in the same as their first arg");
	// Verify that the shift amount VT is bit enough to hold valid shift
	// amounts. This catches things like trying to shift an i1024 value by an
	// i8, which is easy to fall into in generic code that uses
	// TLI.getShiftAmount().
	assert(N2.getValueSizeInBits() >= Log2_32_Ceil(N1.getValueSizeInBits()) &&
	"Invalid use of small shift amount with oversized value!");

	// Always fold shifts of i1 values so the code generator doesn't need to
	// handle them. Since we know the size of the shift has to be less than the
	// size of the value, the shift/rotate count is guaranteed to be zero.
	if (VT == MVT::i1)
	return N1;
	if (N2C && N2C->isNullValue())
	return N1;
	break;
	case ISD::FP_ROUND_INREG: {
	EVT EVT = cast<VTSDNode>(N2)->getVT();
	assert(VT == N1.getValueType() && "Not an inreg round!");
	assert(VT.isFloatingPoint() && EVT.isFloatingPoint() &&
	"Cannot FP_ROUND_INREG integer types");
	assert(EVT.isVector() == VT.isVector() &&
	"FP_ROUND_INREG type should be vector iff the operand "
	"type is vector!");
	assert((!EVT.isVector() \|\|
	EVT.getVectorNumElements() == VT.getVectorNumElements()) &&
	"Vector element counts must match in FP_ROUND_INREG");
	assert(EVT.bitsLE(VT) && "Not rounding down!");
	(void)EVT;
	if (cast<VTSDNode>(N2)->getVT() == VT) return N1; // Not actually rounding.
	break;
	}
	case ISD::FP_ROUND:
	assert(VT.isFloatingPoint() &&
	N1.getValueType().isFloatingPoint() &&
	VT.bitsLE(N1.getValueType()) &&
	N2C && (N2C->getZExtValue() == 0 \|\| N2C->getZExtValue() == 1) &&
	"Invalid FP_ROUND!");
	if (N1.getValueType() == VT) return N1; // noop conversion.
	break;
	case ISD::AssertSext:
	case ISD::AssertZext: {
	EVT EVT = cast<VTSDNode>(N2)->getVT();
	assert(VT == N1.getValueType() && "Not an inreg extend!");
	assert(VT.isInteger() && EVT.isInteger() &&
	"Cannot *_EXTEND_INREG FP types");
	assert(!EVT.isVector() &&
	"AssertSExt/AssertZExt type should be the vector element type "
	"rather than the vector type!");
	assert(EVT.bitsLE(VT) && "Not extending!");
	if (VT == EVT) return N1; // noop assertion.
	break;
	}
	case ISD::SIGN_EXTEND_INREG: {
	EVT EVT = cast<VTSDNode>(N2)->getVT();
	assert(VT == N1.getValueType() && "Not an inreg extend!");
	assert(VT.isInteger() && EVT.isInteger() &&
	"Cannot *_EXTEND_INREG FP types");
	assert(EVT.isVector() == VT.isVector() &&
	"SIGN_EXTEND_INREG type should be vector iff the operand "
	"type is vector!");
	assert((!EVT.isVector() \|\|
	EVT.getVectorNumElements() == VT.getVectorNumElements()) &&
	"Vector element counts must match in SIGN_EXTEND_INREG");
	assert(EVT.bitsLE(VT) && "Not extending!");
	if (EVT == VT) return N1; // Not actually extending

	auto SignExtendInReg = [&](APInt Val, llvm::EVT ConstantVT) {
	unsigned FromBits = EVT.getScalarSizeInBits();
	Val <<= Val.getBitWidth() - FromBits;
	Val.ashrInPlace(Val.getBitWidth() - FromBits);
	return getConstant(Val, DL, ConstantVT);
	};

	if (N1C) {
	const APInt &Val = N1C->getAPIntValue();
	return SignExtendInReg(Val, VT);
	}
	if (ISD::isBuildVectorOfConstantSDNodes(N1.getNode())) {
	SmallVector<SDValue, 8> Ops;
	llvm::EVT OpVT = N1.getOperand(0).getValueType();
	for (int i = 0, e = VT.getVectorNumElements(); i != e; ++i) {
	SDValue Op = N1.getOperand(i);
	if (Op.isUndef()) {
	Ops.push_back(getUNDEF(OpVT));
	continue;
	}
	ConstantSDNode *C = cast<ConstantSDNode>(Op);
	APInt Val = C->getAPIntValue();
	Ops.push_back(SignExtendInReg(Val, OpVT));
	}
	return getBuildVector(VT, DL, Ops);
	}
	break;
	}
	case ISD::EXTRACT_VECTOR_ELT:
	// EXTRACT_VECTOR_ELT of an UNDEF is an UNDEF.
	if (N1.isUndef())
	return getUNDEF(VT);

	// EXTRACT_VECTOR_ELT of out-of-bounds element is an UNDEF
	if (N2C && N2C->getZExtValue() >= N1.getValueType().getVectorNumElements())
	return getUNDEF(VT);

	// EXTRACT_VECTOR_ELT of CONCAT_VECTORS is often formed while lowering is
	// expanding copies of large vectors from registers.
	if (N2C &&
	N1.getOpcode() == ISD::CONCAT_VECTORS &&
	N1.getNumOperands() > 0) {
	unsigned Factor =
	N1.getOperand(0).getValueType().getVectorNumElements();
	return getNode(ISD::EXTRACT_VECTOR_ELT, DL, VT,
	N1.getOperand(N2C->getZExtValue() / Factor),
	getConstant(N2C->getZExtValue() % Factor, DL,
	N2.getValueType()));
	}

	// EXTRACT_VECTOR_ELT of BUILD_VECTOR is often formed while lowering is
	// expanding large vector constants.
	if (N2C && N1.getOpcode() == ISD::BUILD_VECTOR) {
	SDValue Elt = N1.getOperand(N2C->getZExtValue());

	if (VT != Elt.getValueType())
	// If the vector element type is not legal, the BUILD_VECTOR operands
	// are promoted and implicitly truncated, and the result implicitly
	// extended. Make that explicit here.
	Elt = getAnyExtOrTrunc(Elt, DL, VT);

	return Elt;
	}

	// EXTRACT_VECTOR_ELT of INSERT_VECTOR_ELT is often formed when vector
	// operations are lowered to scalars.
	if (N1.getOpcode() == ISD::INSERT_VECTOR_ELT) {
	// If the indices are the same, return the inserted element else
	// if the indices are known different, extract the element from
	// the original vector.
	SDValue N1Op2 = N1.getOperand(2);
	ConstantSDNode *N1Op2C = dyn_cast<ConstantSDNode>(N1Op2);

	if (N1Op2C && N2C) {
	if (N1Op2C->getZExtValue() == N2C->getZExtValue()) {
	if (VT == N1.getOperand(1).getValueType())
	return N1.getOperand(1);
	else
	return getSExtOrTrunc(N1.getOperand(1), DL, VT);
	}

	return getNode(ISD::EXTRACT_VECTOR_ELT, DL, VT, N1.getOperand(0), N2);
	}
	}
	break;
	case ISD::EXTRACT_ELEMENT:
	assert(N2C && (unsigned)N2C->getZExtValue() < 2 && "Bad EXTRACT_ELEMENT!");
	assert(!N1.getValueType().isVector() && !VT.isVector() &&
	(N1.getValueType().isInteger() == VT.isInteger()) &&
	N1.getValueType() != VT &&
	"Wrong types for EXTRACT_ELEMENT!");

	// EXTRACT_ELEMENT of BUILD_PAIR is often formed while legalize is expanding
	// 64-bit integers into 32-bit parts. Instead of building the extract of
	// the BUILD_PAIR, only to have legalize rip it apart, just do it now.
	if (N1.getOpcode() == ISD::BUILD_PAIR)
	return N1.getOperand(N2C->getZExtValue());

	// EXTRACT_ELEMENT of a constant int is also very common.
	if (N1C) {
	unsigned ElementSize = VT.getSizeInBits();
	unsigned Shift = ElementSize * N2C->getZExtValue();
	APInt ShiftedVal = N1C->getAPIntValue().lshr(Shift);
	return getConstant(ShiftedVal.trunc(ElementSize), DL, VT);
	}
	break;
	case ISD::EXTRACT_SUBVECTOR:
	if (VT.isSimple() && N1.getValueType().isSimple()) {
	assert(VT.isVector() && N1.getValueType().isVector() &&
	"Extract subvector VTs must be a vectors!");
	assert(VT.getVectorElementType() ==
	N1.getValueType().getVectorElementType() &&
	"Extract subvector VTs must have the same element type!");
	assert(VT.getSimpleVT() <= N1.getSimpleValueType() &&
	"Extract subvector must be from larger vector to smaller vector!");

	if (N2C) {
	assert((VT.getVectorNumElements() + N2C->getZExtValue()
	<= N1.getValueType().getVectorNumElements())
	&& "Extract subvector overflow!");
	}

	// Trivial extraction.
	if (VT.getSimpleVT() == N1.getSimpleValueType())
	return N1;

	// EXTRACT_SUBVECTOR of an UNDEF is an UNDEF.
	if (N1.isUndef())
	return getUNDEF(VT);

	// EXTRACT_SUBVECTOR of CONCAT_VECTOR can be simplified if the pieces of
	// the concat have the same type as the extract.
	if (N2C && N1.getOpcode() == ISD::CONCAT_VECTORS &&
	N1.getNumOperands() > 0 &&
	VT == N1.getOperand(0).getValueType()) {
	unsigned Factor = VT.getVectorNumElements();
	return N1.getOperand(N2C->getZExtValue() / Factor);
	}

	// EXTRACT_SUBVECTOR of INSERT_SUBVECTOR is often created
	// during shuffle legalization.
	if (N1.getOpcode() == ISD::INSERT_SUBVECTOR && N2 == N1.getOperand(2) &&
	VT == N1.getOperand(1).getValueType())
	return N1.getOperand(1);
	}
	break;
	}

	// Perform trivial constant folding.
	if (SDValue SV =
	FoldConstantArithmetic(Opcode, DL, VT, N1.getNode(), N2.getNode()))
	return SV;

	// Constant fold FP operations.
	bool HasFPExceptions = TLI->hasFloatingPointExceptions();
	if (N1CFP) {
	if (N2CFP) {
	APFloat V1 = N1CFP->getValueAPF(), V2 = N2CFP->getValueAPF();
	APFloat::opStatus s;
	switch (Opcode) {
	case ISD::FADD:
	s = V1.add(V2, APFloat::rmNearestTiesToEven);
	if (!HasFPExceptions \|\| s != APFloat::opInvalidOp)
	return getConstantFP(V1, DL, VT);
	break;
	case ISD::FSUB:
	s = V1.subtract(V2, APFloat::rmNearestTiesToEven);
	if (!HasFPExceptions \|\| s!=APFloat::opInvalidOp)
	return getConstantFP(V1, DL, VT);
	break;
	case ISD::FMUL:
	s = V1.multiply(V2, APFloat::rmNearestTiesToEven);
	if (!HasFPExceptions \|\| s!=APFloat::opInvalidOp)
	return getConstantFP(V1, DL, VT);
	break;
	case ISD::FDIV:
	s = V1.divide(V2, APFloat::rmNearestTiesToEven);
	if (!HasFPExceptions \|\| (s!=APFloat::opInvalidOp &&
	s!=APFloat::opDivByZero)) {
	return getConstantFP(V1, DL, VT);
	}
	break;
	case ISD::FREM :
	s = V1.mod(V2);
	if (!HasFPExceptions \|\| (s!=APFloat::opInvalidOp &&
	s!=APFloat::opDivByZero)) {
	return getConstantFP(V1, DL, VT);
	}
	break;
	case ISD::FCOPYSIGN:
	V1.copySign(V2);
	return getConstantFP(V1, DL, VT);
	default: break;
	}
	}

	if (Opcode == ISD::FP_ROUND) {
	APFloat V = N1CFP->getValueAPF(); // make copy
	bool ignored;
	// This can return overflow, underflow, or inexact; we don't care.
	// FIXME need to be more flexible about rounding mode.
	(void)V.convert(EVTToAPFloatSemantics(VT),
	APFloat::rmNearestTiesToEven, &ignored);
	return getConstantFP(V, DL, VT);
	}
	}

	// Canonicalize an UNDEF to the RHS, even over a constant.
	if (N1.isUndef()) {
	if (TLI->isCommutativeBinOp(Opcode)) {
	std::swap(N1, N2);
	} else {
	switch (Opcode) {
	case ISD::FP_ROUND_INREG:
	case ISD::SIGN_EXTEND_INREG:
	case ISD::SUB:
	case ISD::FSUB:
	case ISD::FDIV:
	case ISD::FREM:
	case ISD::SRA:
	return N1; // fold op(undef, arg2) -> undef
	case ISD::UDIV:
	case ISD::SDIV:
	case ISD::UREM:
	case ISD::SREM:
	case ISD::SRL:
	case ISD::SHL:
	if (!VT.isVector())
	return getConstant(0, DL, VT); // fold op(undef, arg2) -> 0
	// For vectors, we can't easily build an all zero vector, just return
	// the LHS.
	return N2;
	}
	}
	}

	// Fold a bunch of operators when the RHS is undef.
	if (N2.isUndef()) {
	switch (Opcode) {
	case ISD::XOR:
	if (N1.isUndef())
	// Handle undef ^ undef -> 0 special case. This is a common
	// idiom (misuse).
	return getConstant(0, DL, VT);
	LLVM_FALLTHROUGH;
	case ISD::ADD:
	case ISD::ADDC:
	case ISD::ADDE:
	case ISD::SUB:
	case ISD::UDIV:
	case ISD::SDIV:
	case ISD::UREM:
	case ISD::SREM:
	return N2; // fold op(arg1, undef) -> undef
	case ISD::FADD:
	case ISD::FSUB:
	case ISD::FMUL:
	case ISD::FDIV:
	case ISD::FREM:
	if (getTarget().Options.UnsafeFPMath)
	return N2;
	break;
	case ISD::MUL:
	case ISD::AND:
	case ISD::SRL:
	case ISD::SHL:
	if (!VT.isVector())
	return getConstant(0, DL, VT); // fold op(arg1, undef) -> 0
	// For vectors, we can't easily build an all zero vector, just return
	// the LHS.
	return N1;
	case ISD::OR:
	if (!VT.isVector())
	return getConstant(APInt::getAllOnesValue(VT.getSizeInBits()), DL, VT);
	// For vectors, we can't easily build an all one vector, just return
	// the LHS.
	return N1;
	case ISD::SRA:
	return N1;
	}
	}

	// Memoize this node if possible.
	SDNode *N;
	SDVTList VTs = getVTList(VT);
	SDValue Ops[] = {N1, N2};
	if (VT != MVT::Glue) {
	FoldingSetNodeID ID;
	AddNodeIDNode(ID, Opcode, VTs, Ops);
	void *IP = nullptr;
	if (SDNode *E = FindNodeOrInsertPos(ID, DL, IP)) {
	E->intersectFlagsWith(Flags);
	return SDValue(E, 0);
	}

	N = newSDNode<SDNode>(Opcode, DL.getIROrder(), DL.getDebugLoc(), VTs);
	N->setFlags(Flags);
	createOperands(N, Ops);
	CSEMap.InsertNode(N, IP);
	} else {
	N = newSDNode<SDNode>(Opcode, DL.getIROrder(), DL.getDebugLoc(), VTs);
	createOperands(N, Ops);
	}

	InsertNode(N);
	return SDValue(N, 0);
	}

	SDValue SelectionDAG::getNode(unsigned Opcode, const SDLoc &DL, EVT VT,
	SDValue N1, SDValue N2, SDValue N3) {
	// Perform various simplifications.
	switch (Opcode) {
	case ISD::FMA: {
	ConstantFPSDNode *N1CFP = dyn_cast<ConstantFPSDNode>(N1);
	ConstantFPSDNode *N2CFP = dyn_cast<ConstantFPSDNode>(N2);
	ConstantFPSDNode *N3CFP = dyn_cast<ConstantFPSDNode>(N3);
	if (N1CFP && N2CFP && N3CFP) {
	APFloat V1 = N1CFP->getValueAPF();
	const APFloat &V2 = N2CFP->getValueAPF();
	const APFloat &V3 = N3CFP->getValueAPF();
	APFloat::opStatus s =
	V1.fusedMultiplyAdd(V2, V3, APFloat::rmNearestTiesToEven);
	if (!TLI->hasFloatingPointExceptions() \|\| s != APFloat::opInvalidOp)
	return getConstantFP(V1, DL, VT);
	}
	break;
	}
	case ISD::CONCAT_VECTORS: {
	// Attempt to fold CONCAT_VECTORS into BUILD_VECTOR or UNDEF.
	SDValue Ops[] = {N1, N2, N3};
	if (SDValue V = FoldCONCAT_VECTORS(DL, VT, Ops, *this))
	return V;
	break;
	}
	case ISD::SETCC: {
	// Use FoldSetCC to simplify SETCC's.
	if (SDValue V = FoldSetCC(VT, N1, N2, cast<CondCodeSDNode>(N3)->get(), DL))
	return V;
	// Vector constant folding.
	SDValue Ops[] = {N1, N2, N3};
	if (SDValue V = FoldConstantVectorArithmetic(Opcode, DL, VT, Ops))
	return V;
	break;
	}
	case ISD::SELECT:
	if (ConstantSDNode *N1C = dyn_cast<ConstantSDNode>(N1)) {
	if (N1C->getZExtValue())
	return N2; // select true, X, Y -> X
	return N3; // select false, X, Y -> Y
	}

	if (N2 == N3) return N2; // select C, X, X -> X
	break;
	case ISD::VECTOR_SHUFFLE:
	llvm_unreachable("should use getVectorShuffle constructor!");
	case ISD::INSERT_VECTOR_ELT: {
	ConstantSDNode *N3C = dyn_cast<ConstantSDNode>(N3);
	// INSERT_VECTOR_ELT into out-of-bounds element is an UNDEF
	if (N3C && N3C->getZExtValue() >= N1.getValueType().getVectorNumElements())
	return getUNDEF(VT);
	break;
	}
	case ISD::INSERT_SUBVECTOR: {
	SDValue Index = N3;
	if (VT.isSimple() && N1.getValueType().isSimple()
	&& N2.getValueType().isSimple()) {
	assert(VT.isVector() && N1.getValueType().isVector() &&
	N2.getValueType().isVector() &&
	"Insert subvector VTs must be a vectors");
	assert(VT == N1.getValueType() &&
	"Dest and insert subvector source types must match!");
	assert(N2.getSimpleValueType() <= N1.getSimpleValueType() &&
	"Insert subvector must be from smaller vector to larger vector!");
	if (isa<ConstantSDNode>(Index)) {
	assert((N2.getValueType().getVectorNumElements() +
	cast<ConstantSDNode>(Index)->getZExtValue()
	<= VT.getVectorNumElements())
	&& "Insert subvector overflow!");
	}

	// Trivial insertion.
	if (VT.getSimpleVT() == N2.getSimpleValueType())
	return N2;
	}
	break;
	}
	case ISD::BITCAST:
	// Fold bit_convert nodes from a type to themselves.
	if (N1.getValueType() == VT)
	return N1;
	break;
	}

	// Memoize node if it doesn't produce a flag.
	SDNode *N;
	SDVTList VTs = getVTList(VT);
	SDValue Ops[] = {N1, N2, N3};
	if (VT != MVT::Glue) {
	FoldingSetNodeID ID;
	AddNodeIDNode(ID, Opcode, VTs, Ops);
	void *IP = nullptr;
	if (SDNode *E = FindNodeOrInsertPos(ID, DL, IP))
	return SDValue(E, 0);

	N = newSDNode<SDNode>(Opcode, DL.getIROrder(), DL.getDebugLoc(), VTs);
	createOperands(N, Ops);
	CSEMap.InsertNode(N, IP);
	} else {
	N = newSDNode<SDNode>(Opcode, DL.getIROrder(), DL.getDebugLoc(), VTs);
	createOperands(N, Ops);
	}

	InsertNode(N);
	return SDValue(N, 0);
	}

	SDValue SelectionDAG::getNode(unsigned Opcode, const SDLoc &DL, EVT VT,
	SDValue N1, SDValue N2, SDValue N3, SDValue N4) {
	SDValue Ops[] = { N1, N2, N3, N4 };
	return getNode(Opcode, DL, VT, Ops);
	}

	SDValue SelectionDAG::getNode(unsigned Opcode, const SDLoc &DL, EVT VT,
	SDValue N1, SDValue N2, SDValue N3, SDValue N4,
	SDValue N5) {
	SDValue Ops[] = { N1, N2, N3, N4, N5 };
	return getNode(Opcode, DL, VT, Ops);
	}

	/// getStackArgumentTokenFactor - Compute a TokenFactor to force all
	/// the incoming stack arguments to be loaded from the stack.
	SDValue SelectionDAG::getStackArgumentTokenFactor(SDValue Chain) {
	SmallVector<SDValue, 8> ArgChains;

	// Include the original chain at the beginning of the list. When this is
	// used by target LowerCall hooks, this helps legalize find the
	// CALLSEQ_BEGIN node.
	ArgChains.push_back(Chain);

	// Add a chain value for each stack argument.
	for (SDNode::use_iterator U = getEntryNode().getNode()->use_begin(),
	UE = getEntryNode().getNode()->use_end(); U != UE; ++U)
	if (LoadSDNode L = dyn_cast<LoadSDNode>(U))
	if (FrameIndexSDNode *FI = dyn_cast<FrameIndexSDNode>(L->getBasePtr()))
	if (FI->getIndex() < 0)
	ArgChains.push_back(SDValue(L, 1));

	// Build a tokenfactor for all the chains.
	return getNode(ISD::TokenFactor, SDLoc(Chain), MVT::Other, ArgChains);
	}

	/// getMemsetValue - Vectorized representation of the memset value
	/// operand.
	static SDValue getMemsetValue(SDValue Value, EVT VT, SelectionDAG &DAG,
	const SDLoc &dl) {
	assert(!Value.isUndef());

	unsigned NumBits = VT.getScalarSizeInBits();
	if (ConstantSDNode *C = dyn_cast<ConstantSDNode>(Value)) {
	assert(C->getAPIntValue().getBitWidth() == 8);
	APInt Val = APInt::getSplat(NumBits, C->getAPIntValue());
	if (VT.isInteger())
	return DAG.getConstant(Val, dl, VT);
	return DAG.getConstantFP(APFloat(DAG.EVTToAPFloatSemantics(VT), Val), dl,
	VT);
	}

	assert(Value.getValueType() == MVT::i8 && "memset with non-byte fill value?");
	EVT IntVT = VT.getScalarType();
	if (!IntVT.isInteger())
	IntVT = EVT::getIntegerVT(*DAG.getContext(), IntVT.getSizeInBits());

	Value = DAG.getNode(ISD::ZERO_EXTEND, dl, IntVT, Value);
	if (NumBits > 8) {
	// Use a multiplication with 0x010101... to extend the input to the
	// required length.
	APInt Magic = APInt::getSplat(NumBits, APInt(8, 0x01));
	Value = DAG.getNode(ISD::MUL, dl, IntVT, Value,
	DAG.getConstant(Magic, dl, IntVT));
	}

	if (VT != Value.getValueType() && !VT.isInteger())
	Value = DAG.getBitcast(VT.getScalarType(), Value);
	if (VT != Value.getValueType())
	Value = DAG.getSplatBuildVector(VT, dl, Value);

	return Value;
	}

	/// getMemsetStringVal - Similar to getMemsetValue. Except this is only
	/// used when a memcpy is turned into a memset when the source is a constant
	/// string ptr.
	static SDValue getMemsetStringVal(EVT VT, const SDLoc &dl, SelectionDAG &DAG,
	const TargetLowering &TLI,
	const ConstantDataArraySlice &Slice) {
	// Handle vector with all elements zero.
	if (Slice.Array == nullptr) {
	if (VT.isInteger())
	return DAG.getConstant(0, dl, VT);
	else if (VT == MVT::f32 \|\| VT == MVT::f64 \|\| VT == MVT::f128)
	return DAG.getConstantFP(0.0, dl, VT);
	else if (VT.isVector()) {
	unsigned NumElts = VT.getVectorNumElements();
	MVT EltVT = (VT.getVectorElementType() == MVT::f32) ? MVT::i32 : MVT::i64;
	return DAG.getNode(ISD::BITCAST, dl, VT,
	DAG.getConstant(0, dl,
	EVT::getVectorVT(*DAG.getContext(),
	EltVT, NumElts)));
	} else
	llvm_unreachable("Expected type!");
	}

	assert(!VT.isVector() && "Can't handle vector type here!");
	unsigned NumVTBits = VT.getSizeInBits();
	unsigned NumVTBytes = NumVTBits / 8;
	unsigned NumBytes = std::min(NumVTBytes, unsigned(Slice.Length));

	APInt Val(NumVTBits, 0);
	if (DAG.getDataLayout().isLittleEndian()) {
	for (unsigned i = 0; i != NumBytes; ++i)
	Val \|= (uint64_t)(unsigned char)Slice[i] << i*8;
	} else {
	for (unsigned i = 0; i != NumBytes; ++i)
	Val \|= (uint64_t)(unsigned char)Slice[i] << (NumVTBytes-i-1)*8;
	}

	// If the "cost" of materializing the integer immediate is less than the cost
	// of a load, then it is cost effective to turn the load into the immediate.
	Type Ty = VT.getTypeForEVT(DAG.getContext());
	if (TLI.shouldConvertConstantLoadToIntImm(Val, Ty))
	return DAG.getConstant(Val, dl, VT);
	return SDValue(nullptr, 0);
	}

	SDValue SelectionDAG::getMemBasePlusOffset(SDValue Base, unsigned Offset,
	const SDLoc &DL) {
	EVT VT = Base.getValueType();
	return getNode(ISD::ADD, DL, VT, Base, getConstant(Offset, DL, VT));
	}

	/// Returns true if memcpy source is constant data.
	static bool isMemSrcFromConstant(SDValue Src, ConstantDataArraySlice &Slice) {
	uint64_t SrcDelta = 0;
	GlobalAddressSDNode *G = nullptr;
	if (Src.getOpcode() == ISD::GlobalAddress)
	G = cast<GlobalAddressSDNode>(Src);
	else if (Src.getOpcode() == ISD::ADD &&
	Src.getOperand(0).getOpcode() == ISD::GlobalAddress &&
	Src.getOperand(1).getOpcode() == ISD::Constant) {
	G = cast<GlobalAddressSDNode>(Src.getOperand(0));
	SrcDelta = cast<ConstantSDNode>(Src.getOperand(1))->getZExtValue();
	}
	if (!G)
	return false;

	return getConstantDataArrayInfo(G->getGlobal(), Slice, 8,
	SrcDelta + G->getOffset());
	}

	/// Determines the optimal series of memory ops to replace the memset / memcpy.
	/// Return true if the number of memory ops is below the threshold (Limit).
	/// It returns the types of the sequence of memory ops to perform
	/// memset / memcpy by reference.
	static bool FindOptimalMemOpLowering(std::vector<EVT> &MemOps,
	unsigned Limit, uint64_t Size,
	unsigned DstAlign, unsigned SrcAlign,
	bool IsMemset,
	bool ZeroMemset,
	bool MemcpyStrSrc,
	bool AllowOverlap,
	unsigned DstAS, unsigned SrcAS,
	SelectionDAG &DAG,
	const TargetLowering &TLI) {
	assert((SrcAlign == 0 \|\| SrcAlign >= DstAlign) &&
	"Expecting memcpy / memset source to meet alignment requirement!");
	// If 'SrcAlign' is zero, that means the memory operation does not need to
	// load the value, i.e. memset or memcpy from constant string. Otherwise,
	// it's the inferred alignment of the source. 'DstAlign', on the other hand,
	// is the specified alignment of the memory operation. If it is zero, that
	// means it's possible to change the alignment of the destination.
	// 'MemcpyStrSrc' indicates whether the memcpy source is constant so it does
	// not need to be loaded.
	EVT VT = TLI.getOptimalMemOpType(Size, DstAlign, SrcAlign,
	IsMemset, ZeroMemset, MemcpyStrSrc,
	DAG.getMachineFunction());

	if (VT == MVT::Other) {
	// Use the largest integer type whose alignment constraints are satisfied.
	// We only need to check DstAlign here as SrcAlign is always greater or
	// equal to DstAlign (or zero).
	VT = MVT::i64;
	while (DstAlign && DstAlign < VT.getSizeInBits() / 8 &&
	!TLI.allowsMisalignedMemoryAccesses(VT, DstAS, DstAlign))
	VT = (MVT::SimpleValueType)(VT.getSimpleVT().SimpleTy - 1);
	assert(VT.isInteger());

	// Find the largest legal integer type.
	MVT LVT = MVT::i64;
	while (!TLI.isTypeLegal(LVT))
	LVT = (MVT::SimpleValueType)(LVT.SimpleTy - 1);
	assert(LVT.isInteger());

	// If the type we've chosen is larger than the largest legal integer type
	// then use that instead.
	if (VT.bitsGT(LVT))
	VT = LVT;
	}

	unsigned NumMemOps = 0;
	while (Size != 0) {
	unsigned VTSize = VT.getSizeInBits() / 8;
	while (VTSize > Size) {
	// For now, only use non-vector load / store's for the left-over pieces.
	EVT NewVT = VT;
	unsigned NewVTSize;

	bool Found = false;
	if (VT.isVector() \|\| VT.isFloatingPoint()) {
	NewVT = (VT.getSizeInBits() > 64) ? MVT::i64 : MVT::i32;
	if (TLI.isOperationLegalOrCustom(ISD::STORE, NewVT) &&
	TLI.isSafeMemOpType(NewVT.getSimpleVT()))
	Found = true;
	else if (NewVT == MVT::i64 &&
	TLI.isOperationLegalOrCustom(ISD::STORE, MVT::f64) &&
	TLI.isSafeMemOpType(MVT::f64)) {
	// i64 is usually not legal on 32-bit targets, but f64 may be.
	NewVT = MVT::f64;
	Found = true;
	}
	}

	if (!Found) {
	do {
	NewVT = (MVT::SimpleValueType)(NewVT.getSimpleVT().SimpleTy - 1);
	if (NewVT == MVT::i8)
	break;
	} while (!TLI.isSafeMemOpType(NewVT.getSimpleVT()));
	}
	NewVTSize = NewVT.getSizeInBits() / 8;

	// If the new VT cannot cover all of the remaining bits, then consider
	// issuing a (or a pair of) unaligned and overlapping load / store.
	// FIXME: Only does this for 64-bit or more since we don't have proper
	// cost model for unaligned load / store.
	bool Fast;
	if (NumMemOps && AllowOverlap &&
	VTSize >= 8 && NewVTSize < Size &&
	TLI.allowsMisalignedMemoryAccesses(VT, DstAS, DstAlign, &Fast) && Fast)
	VTSize = Size;
	else {
	VT = NewVT;
	VTSize = NewVTSize;
	}
	}

	if (++NumMemOps > Limit)
	return false;

	MemOps.push_back(VT);
	Size -= VTSize;
	}

	return true;
	}

	static bool shouldLowerMemFuncForSize(const MachineFunction &MF) {
	// On Darwin, -Os means optimize for size without hurting performance, so
	// only really optimize for size when -Oz (MinSize) is used.
	if (MF.getTarget().getTargetTriple().isOSDarwin())
	return MF.getFunction()->optForMinSize();
	return MF.getFunction()->optForSize();
	}

	static SDValue getMemcpyLoadsAndStores(SelectionDAG &DAG, const SDLoc &dl,
	SDValue Chain, SDValue Dst, SDValue Src,
	uint64_t Size, unsigned Align,
	bool isVol, bool AlwaysInline,
	MachinePointerInfo DstPtrInfo,
	MachinePointerInfo SrcPtrInfo) {
	// Turn a memcpy of undef to nop.
	if (Src.isUndef())
	return Chain;

	// Expand memcpy to a series of load and store ops if the size operand falls
	// below a certain threshold.
	// TODO: In the AlwaysInline case, if the size is big then generate a loop
	// rather than maybe a humongous number of loads and stores.
	const TargetLowering &TLI = DAG.getTargetLoweringInfo();
	const DataLayout &DL = DAG.getDataLayout();
	LLVMContext &C = *DAG.getContext();
	std::vector<EVT> MemOps;
	bool DstAlignCanChange = false;
	MachineFunction &MF = DAG.getMachineFunction();
	MachineFrameInfo &MFI = MF.getFrameInfo();
	bool OptSize = shouldLowerMemFuncForSize(MF);
	FrameIndexSDNode *FI = dyn_cast<FrameIndexSDNode>(Dst);
	if (FI && !MFI.isFixedObjectIndex(FI->getIndex()))
	DstAlignCanChange = true;
	unsigned SrcAlign = DAG.InferPtrAlignment(Src);
	if (Align > SrcAlign)
	SrcAlign = Align;
	ConstantDataArraySlice Slice;
	bool CopyFromConstant = isMemSrcFromConstant(Src, Slice);
	bool isZeroConstant = CopyFromConstant && Slice.Array == nullptr;
	unsigned Limit = AlwaysInline ? ~0U : TLI.getMaxStoresPerMemcpy(OptSize);

	if (!FindOptimalMemOpLowering(MemOps, Limit, Size,
	(DstAlignCanChange ? 0 : Align),
	(isZeroConstant ? 0 : SrcAlign),
	false, false, CopyFromConstant, true,
	DstPtrInfo.getAddrSpace(),
	SrcPtrInfo.getAddrSpace(),
	DAG, TLI))
	return SDValue();

	if (DstAlignCanChange) {
	Type *Ty = MemOps[0].getTypeForEVT(C);
	unsigned NewAlign = (unsigned)DL.getABITypeAlignment(Ty);

	// Don't promote to an alignment that would require dynamic stack
	// realignment.
	const TargetRegisterInfo *TRI = MF.getSubtarget().getRegisterInfo();
	if (!TRI->needsStackRealignment(MF))
	while (NewAlign > Align &&
	DL.exceedsNaturalStackAlignment(NewAlign))
	NewAlign /= 2;

	if (NewAlign > Align) {
	// Give the stack frame object a larger alignment if needed.
	if (MFI.getObjectAlignment(FI->getIndex()) < NewAlign)
	MFI.setObjectAlignment(FI->getIndex(), NewAlign);
	Align = NewAlign;
	}
	}

	MachineMemOperand::Flags MMOFlags =
	isVol ? MachineMemOperand::MOVolatile : MachineMemOperand::MONone;
	SmallVector<SDValue, 8> OutChains;
	unsigned NumMemOps = MemOps.size();
	uint64_t SrcOff = 0, DstOff = 0;
	for (unsigned i = 0; i != NumMemOps; ++i) {
	EVT VT = MemOps[i];
	unsigned VTSize = VT.getSizeInBits() / 8;
	SDValue Value, Store;

	if (VTSize > Size) {
	// Issuing an unaligned load / store pair that overlaps with the previous
	// pair. Adjust the offset accordingly.
	assert(i == NumMemOps-1 && i != 0);
	SrcOff -= VTSize - Size;
	DstOff -= VTSize - Size;
	}

	if (CopyFromConstant &&
	(isZeroConstant \|\| (VT.isInteger() && !VT.isVector()))) {
	// It's unlikely a store of a vector immediate can be done in a single
	// instruction. It would require a load from a constantpool first.
	// We only handle zero vectors here.
	// FIXME: Handle other cases where store of vector immediate is done in
	// a single instruction.
	ConstantDataArraySlice SubSlice;
	if (SrcOff < Slice.Length) {
	SubSlice = Slice;
	SubSlice.move(SrcOff);
	} else {
	// This is an out-of-bounds access and hence UB. Pretend we read zero.
	SubSlice.Array = nullptr;
	SubSlice.Offset = 0;
	SubSlice.Length = VTSize;
	}
	Value = getMemsetStringVal(VT, dl, DAG, TLI, SubSlice);
	if (Value.getNode())
	Store = DAG.getStore(Chain, dl, Value,
	DAG.getMemBasePlusOffset(Dst, DstOff, dl),
	DstPtrInfo.getWithOffset(DstOff), Align,
	MMOFlags);
	}

	if (!Store.getNode()) {
	// The type might not be legal for the target. This should only happen
	// if the type is smaller than a legal type, as on PPC, so the right
	// thing to do is generate a LoadExt/StoreTrunc pair. These simplify
	// to Load/Store if NVT==VT.
	// FIXME does the case above also need this?
	EVT NVT = TLI.getTypeToTransformTo(C, VT);
	assert(NVT.bitsGE(VT));

	bool isDereferenceable =
	SrcPtrInfo.getWithOffset(SrcOff).isDereferenceable(VTSize, C, DL);
	MachineMemOperand::Flags SrcMMOFlags = MMOFlags;
	if (isDereferenceable)
	SrcMMOFlags \|= MachineMemOperand::MODereferenceable;

	Value = DAG.getExtLoad(ISD::EXTLOAD, dl, NVT, Chain,
	DAG.getMemBasePlusOffset(Src, SrcOff, dl),
	SrcPtrInfo.getWithOffset(SrcOff), VT,
	MinAlign(SrcAlign, SrcOff), SrcMMOFlags);
	OutChains.push_back(Value.getValue(1));
	Store = DAG.getTruncStore(
	Chain, dl, Value, DAG.getMemBasePlusOffset(Dst, DstOff, dl),
	DstPtrInfo.getWithOffset(DstOff), VT, Align, MMOFlags);
	}
	OutChains.push_back(Store);
	SrcOff += VTSize;
	DstOff += VTSize;
	Size -= VTSize;
	}

	return DAG.getNode(ISD::TokenFactor, dl, MVT::Other, OutChains);
	}

	static SDValue getMemmoveLoadsAndStores(SelectionDAG &DAG, const SDLoc &dl,
	SDValue Chain, SDValue Dst, SDValue Src,
	uint64_t Size, unsigned Align,
	bool isVol, bool AlwaysInline,
	MachinePointerInfo DstPtrInfo,
	MachinePointerInfo SrcPtrInfo) {
	// Turn a memmove of undef to nop.
	if (Src.isUndef())
	return Chain;

	// Expand memmove to a series of load and store ops if the size operand falls
	// below a certain threshold.
	const TargetLowering &TLI = DAG.getTargetLoweringInfo();
	const DataLayout &DL = DAG.getDataLayout();
	LLVMContext &C = *DAG.getContext();
	std::vector<EVT> MemOps;
	bool DstAlignCanChange = false;
	MachineFunction &MF = DAG.getMachineFunction();
	MachineFrameInfo &MFI = MF.getFrameInfo();
	bool OptSize = shouldLowerMemFuncForSize(MF);
	FrameIndexSDNode *FI = dyn_cast<FrameIndexSDNode>(Dst);
	if (FI && !MFI.isFixedObjectIndex(FI->getIndex()))
	DstAlignCanChange = true;
	unsigned SrcAlign = DAG.InferPtrAlignment(Src);
	if (Align > SrcAlign)
	SrcAlign = Align;
	unsigned Limit = AlwaysInline ? ~0U : TLI.getMaxStoresPerMemmove(OptSize);

	if (!FindOptimalMemOpLowering(MemOps, Limit, Size,
	(DstAlignCanChange ? 0 : Align), SrcAlign,
	false, false, false, false,
	DstPtrInfo.getAddrSpace(),
	SrcPtrInfo.getAddrSpace(),
	DAG, TLI))
	return SDValue();

	if (DstAlignCanChange) {
	Type *Ty = MemOps[0].getTypeForEVT(C);
	unsigned NewAlign = (unsigned)DL.getABITypeAlignment(Ty);
	if (NewAlign > Align) {
	// Give the stack frame object a larger alignment if needed.
	if (MFI.getObjectAlignment(FI->getIndex()) < NewAlign)
	MFI.setObjectAlignment(FI->getIndex(), NewAlign);
	Align = NewAlign;
	}
	}

	MachineMemOperand::Flags MMOFlags =
	isVol ? MachineMemOperand::MOVolatile : MachineMemOperand::MONone;
	uint64_t SrcOff = 0, DstOff = 0;
	SmallVector<SDValue, 8> LoadValues;
	SmallVector<SDValue, 8> LoadChains;
	SmallVector<SDValue, 8> OutChains;
	unsigned NumMemOps = MemOps.size();
	for (unsigned i = 0; i < NumMemOps; i++) {
	EVT VT = MemOps[i];
	unsigned VTSize = VT.getSizeInBits() / 8;
	SDValue Value;

	bool isDereferenceable =
	SrcPtrInfo.getWithOffset(SrcOff).isDereferenceable(VTSize, C, DL);
	MachineMemOperand::Flags SrcMMOFlags = MMOFlags;
	if (isDereferenceable)
	SrcMMOFlags \|= MachineMemOperand::MODereferenceable;

	Value =
	DAG.getLoad(VT, dl, Chain, DAG.getMemBasePlusOffset(Src, SrcOff, dl),
	SrcPtrInfo.getWithOffset(SrcOff), SrcAlign, SrcMMOFlags);
	LoadValues.push_back(Value);
	LoadChains.push_back(Value.getValue(1));
	SrcOff += VTSize;
	}
	Chain = DAG.getNode(ISD::TokenFactor, dl, MVT::Other, LoadChains);
	OutChains.clear();
	for (unsigned i = 0; i < NumMemOps; i++) {
	EVT VT = MemOps[i];
	unsigned VTSize = VT.getSizeInBits() / 8;
	SDValue Store;

	Store = DAG.getStore(Chain, dl, LoadValues[i],
	DAG.getMemBasePlusOffset(Dst, DstOff, dl),
	DstPtrInfo.getWithOffset(DstOff), Align, MMOFlags);
	OutChains.push_back(Store);
	DstOff += VTSize;
	}

	return DAG.getNode(ISD::TokenFactor, dl, MVT::Other, OutChains);
	}

	/// \brief Lower the call to 'memset' intrinsic function into a series of store
	/// operations.
	///
	/// \param DAG Selection DAG where lowered code is placed.
	/// \param dl Link to corresponding IR location.
	/// \param Chain Control flow dependency.
	/// \param Dst Pointer to destination memory location.
	/// \param Src Value of byte to write into the memory.
	/// \param Size Number of bytes to write.
	/// \param Align Alignment of the destination in bytes.
	/// \param isVol True if destination is volatile.
	/// \param DstPtrInfo IR information on the memory pointer.
	/// \returns New head in the control flow, if lowering was successful, empty
	/// SDValue otherwise.
	///
	/// The function tries to replace 'llvm.memset' intrinsic with several store
	/// operations and value calculation code. This is usually profitable for small
	/// memory size.
	static SDValue getMemsetStores(SelectionDAG &DAG, const SDLoc &dl,
	SDValue Chain, SDValue Dst, SDValue Src,
	uint64_t Size, unsigned Align, bool isVol,
	MachinePointerInfo DstPtrInfo) {
	// Turn a memset of undef to nop.
	if (Src.isUndef())
	return Chain;

	// Expand memset to a series of load/store ops if the size operand
	// falls below a certain threshold.
	const TargetLowering &TLI = DAG.getTargetLoweringInfo();
	std::vector<EVT> MemOps;
	bool DstAlignCanChange = false;
	MachineFunction &MF = DAG.getMachineFunction();
	MachineFrameInfo &MFI = MF.getFrameInfo();
	bool OptSize = shouldLowerMemFuncForSize(MF);
	FrameIndexSDNode *FI = dyn_cast<FrameIndexSDNode>(Dst);
	if (FI && !MFI.isFixedObjectIndex(FI->getIndex()))
	DstAlignCanChange = true;
	bool IsZeroVal =
	isa<ConstantSDNode>(Src) && cast<ConstantSDNode>(Src)->isNullValue();
	if (!FindOptimalMemOpLowering(MemOps, TLI.getMaxStoresPerMemset(OptSize),
	Size, (DstAlignCanChange ? 0 : Align), 0,
	true, IsZeroVal, false, true,
	DstPtrInfo.getAddrSpace(), ~0u,
	DAG, TLI))
	return SDValue();

	if (DstAlignCanChange) {
	Type Ty = MemOps[0].getTypeForEVT(DAG.getContext());
	unsigned NewAlign = (unsigned)DAG.getDataLayout().getABITypeAlignment(Ty);
	if (NewAlign > Align) {
	// Give the stack frame object a larger alignment if needed.
	if (MFI.getObjectAlignment(FI->getIndex()) < NewAlign)
	MFI.setObjectAlignment(FI->getIndex(), NewAlign);
	Align = NewAlign;
	}
	}

	SmallVector<SDValue, 8> OutChains;
	uint64_t DstOff = 0;
	unsigned NumMemOps = MemOps.size();

	// Find the largest store and generate the bit pattern for it.
	EVT LargestVT = MemOps[0];
	for (unsigned i = 1; i < NumMemOps; i++)
	if (MemOps[i].bitsGT(LargestVT))
	LargestVT = MemOps[i];
	SDValue MemSetValue = getMemsetValue(Src, LargestVT, DAG, dl);

	for (unsigned i = 0; i < NumMemOps; i++) {
	EVT VT = MemOps[i];
	unsigned VTSize = VT.getSizeInBits() / 8;
	if (VTSize > Size) {
	// Issuing an unaligned load / store pair that overlaps with the previous
	// pair. Adjust the offset accordingly.
	assert(i == NumMemOps-1 && i != 0);
	DstOff -= VTSize - Size;
	}

	// If this store is smaller than the largest store see whether we can get
	// the smaller value for free with a truncate.
	SDValue Value = MemSetValue;
	if (VT.bitsLT(LargestVT)) {
	if (!LargestVT.isVector() && !VT.isVector() &&
	TLI.isTruncateFree(LargestVT, VT))
	Value = DAG.getNode(ISD::TRUNCATE, dl, VT, MemSetValue);
	else
	Value = getMemsetValue(Src, VT, DAG, dl);
	}
	assert(Value.getValueType() == VT && "Value with wrong type.");
	SDValue Store = DAG.getStore(
	Chain, dl, Value, DAG.getMemBasePlusOffset(Dst, DstOff, dl),
	DstPtrInfo.getWithOffset(DstOff), Align,
	isVol ? MachineMemOperand::MOVolatile : MachineMemOperand::MONone);
	OutChains.push_back(Store);
	DstOff += VT.getSizeInBits() / 8;
	Size -= VTSize;
	}

	return DAG.getNode(ISD::TokenFactor, dl, MVT::Other, OutChains);
	}

	static void checkAddrSpaceIsValidForLibcall(const TargetLowering *TLI,
	unsigned AS) {
	// Lowering memcpy / memset / memmove intrinsics to calls is only valid if all
	// pointer operands can be losslessly bitcasted to pointers of address space 0
	if (AS != 0 && !TLI->isNoopAddrSpaceCast(AS, 0)) {
	report_fatal_error("cannot lower memory intrinsic in address space " +
	Twine(AS));
	}
	}

	SDValue SelectionDAG::getMemcpy(SDValue Chain, const SDLoc &dl, SDValue Dst,
	SDValue Src, SDValue Size, unsigned Align,
	bool isVol, bool AlwaysInline, bool isTailCall,
	MachinePointerInfo DstPtrInfo,
	MachinePointerInfo SrcPtrInfo) {
	assert(Align && "The SDAG layer expects explicit alignment and reserves 0");

	// Check to see if we should lower the memcpy to loads and stores first.
	// For cases within the target-specified limits, this is the best choice.
	ConstantSDNode *ConstantSize = dyn_cast<ConstantSDNode>(Size);
	if (ConstantSize) {
	// Memcpy with size zero? Just return the original chain.
	if (ConstantSize->isNullValue())
	return Chain;

	SDValue Result = getMemcpyLoadsAndStores(*this, dl, Chain, Dst, Src,
	ConstantSize->getZExtValue(),Align,
	isVol, false, DstPtrInfo, SrcPtrInfo);
	if (Result.getNode())
	return Result;
	}

	// Then check to see if we should lower the memcpy with target-specific
	// code. If the target chooses to do this, this is the next best.
	if (TSI) {
	SDValue Result = TSI->EmitTargetCodeForMemcpy(
	*this, dl, Chain, Dst, Src, Size, Align, isVol, AlwaysInline,
	DstPtrInfo, SrcPtrInfo);
	if (Result.getNode())
	return Result;
	}

	// If we really need inline code and the target declined to provide it,
	// use a (potentially long) sequence of loads and stores.
	if (AlwaysInline) {
	assert(ConstantSize && "AlwaysInline requires a constant size!");
	return getMemcpyLoadsAndStores(*this, dl, Chain, Dst, Src,
	ConstantSize->getZExtValue(), Align, isVol,
	true, DstPtrInfo, SrcPtrInfo);
	}

	checkAddrSpaceIsValidForLibcall(TLI, DstPtrInfo.getAddrSpace());
	checkAddrSpaceIsValidForLibcall(TLI, SrcPtrInfo.getAddrSpace());

	// FIXME: If the memcpy is volatile (isVol), lowering it to a plain libc
	// memcpy is not guaranteed to be safe. libc memcpys aren't required to
	// respect volatile, so they may do things like read or write memory
	// beyond the given memory regions. But fixing this isn't easy, and most
	// people don't care.

	// Emit a library call.
	TargetLowering::ArgListTy Args;
	TargetLowering::ArgListEntry Entry;
	Entry.Ty = getDataLayout().getIntPtrType(*getContext());
	Entry.Node = Dst; Args.push_back(Entry);
	Entry.Node = Src; Args.push_back(Entry);
	Entry.Node = Size; Args.push_back(Entry);
	// FIXME: pass in SDLoc
	TargetLowering::CallLoweringInfo CLI(*this);
	CLI.setDebugLoc(dl)
	.setChain(Chain)
	.setLibCallee(TLI->getLibcallCallingConv(RTLIB::MEMCPY),
	Dst.getValueType().getTypeForEVT(*getContext()),
	getExternalSymbol(TLI->getLibcallName(RTLIB::MEMCPY),
	TLI->getPointerTy(getDataLayout())),
	std::move(Args))
	.setDiscardResult()
	.setTailCall(isTailCall);

	std::pair<SDValue,SDValue> CallResult = TLI->LowerCallTo(CLI);
	return CallResult.second;
	}

	SDValue SelectionDAG::getMemmove(SDValue Chain, const SDLoc &dl, SDValue Dst,
	SDValue Src, SDValue Size, unsigned Align,
	bool isVol, bool isTailCall,
	MachinePointerInfo DstPtrInfo,
	MachinePointerInfo SrcPtrInfo) {
	assert(Align && "The SDAG layer expects explicit alignment and reserves 0");

	// Check to see if we should lower the memmove to loads and stores first.
	// For cases within the target-specified limits, this is the best choice.
	ConstantSDNode *ConstantSize = dyn_cast<ConstantSDNode>(Size);
	if (ConstantSize) {
	// Memmove with size zero? Just return the original chain.
	if (ConstantSize->isNullValue())
	return Chain;

	SDValue Result =
	getMemmoveLoadsAndStores(*this, dl, Chain, Dst, Src,
	ConstantSize->getZExtValue(), Align, isVol,
	false, DstPtrInfo, SrcPtrInfo);
	if (Result.getNode())
	return Result;
	}

	// Then check to see if we should lower the memmove with target-specific
	// code. If the target chooses to do this, this is the next best.
	if (TSI) {
	SDValue Result = TSI->EmitTargetCodeForMemmove(
	*this, dl, Chain, Dst, Src, Size, Align, isVol, DstPtrInfo, SrcPtrInfo);
	if (Result.getNode())
	return Result;
	}

	checkAddrSpaceIsValidForLibcall(TLI, DstPtrInfo.getAddrSpace());
	checkAddrSpaceIsValidForLibcall(TLI, SrcPtrInfo.getAddrSpace());

	// FIXME: If the memmove is volatile, lowering it to plain libc memmove may
	// not be safe. See memcpy above for more details.

	// Emit a library call.
	TargetLowering::ArgListTy Args;
	TargetLowering::ArgListEntry Entry;
	Entry.Ty = getDataLayout().getIntPtrType(*getContext());
	Entry.Node = Dst; Args.push_back(Entry);
	Entry.Node = Src; Args.push_back(Entry);
	Entry.Node = Size; Args.push_back(Entry);
	// FIXME: pass in SDLoc
	TargetLowering::CallLoweringInfo CLI(*this);
	CLI.setDebugLoc(dl)
	.setChain(Chain)
	.setLibCallee(TLI->getLibcallCallingConv(RTLIB::MEMMOVE),
	Dst.getValueType().getTypeForEVT(*getContext()),
	getExternalSymbol(TLI->getLibcallName(RTLIB::MEMMOVE),
	TLI->getPointerTy(getDataLayout())),
	std::move(Args))
	.setDiscardResult()
	.setTailCall(isTailCall);

	std::pair<SDValue,SDValue> CallResult = TLI->LowerCallTo(CLI);
	return CallResult.second;
	}

	SDValue SelectionDAG::getMemset(SDValue Chain, const SDLoc &dl, SDValue Dst,
	SDValue Src, SDValue Size, unsigned Align,
	bool isVol, bool isTailCall,
	MachinePointerInfo DstPtrInfo) {
	assert(Align && "The SDAG layer expects explicit alignment and reserves 0");

	// Check to see if we should lower the memset to stores first.
	// For cases within the target-specified limits, this is the best choice.
	ConstantSDNode *ConstantSize = dyn_cast<ConstantSDNode>(Size);
	if (ConstantSize) {
	// Memset with size zero? Just return the original chain.
	if (ConstantSize->isNullValue())
	return Chain;

	SDValue Result =
	getMemsetStores(*this, dl, Chain, Dst, Src, ConstantSize->getZExtValue(),
	Align, isVol, DstPtrInfo);

	if (Result.getNode())
	return Result;
	}

	// Then check to see if we should lower the memset with target-specific
	// code. If the target chooses to do this, this is the next best.
	if (TSI) {
	SDValue Result = TSI->EmitTargetCodeForMemset(
	*this, dl, Chain, Dst, Src, Size, Align, isVol, DstPtrInfo);
	if (Result.getNode())
	return Result;
	}

	checkAddrSpaceIsValidForLibcall(TLI, DstPtrInfo.getAddrSpace());

	// Emit a library call.
	Type IntPtrTy = getDataLayout().getIntPtrType(getContext());
	TargetLowering::ArgListTy Args;
	TargetLowering::ArgListEntry Entry;
	Entry.Node = Dst; Entry.Ty = IntPtrTy;
	Args.push_back(Entry);
	Entry.Node = Src;
	Entry.Ty = Src.getValueType().getTypeForEVT(*getContext());
	Args.push_back(Entry);
	Entry.Node = Size;
	Entry.Ty = IntPtrTy;
	Args.push_back(Entry);

	// FIXME: pass in SDLoc
	TargetLowering::CallLoweringInfo CLI(*this);
	CLI.setDebugLoc(dl)
	.setChain(Chain)
	.setLibCallee(TLI->getLibcallCallingConv(RTLIB::MEMSET),
	Dst.getValueType().getTypeForEVT(*getContext()),
	getExternalSymbol(TLI->getLibcallName(RTLIB::MEMSET),
	TLI->getPointerTy(getDataLayout())),
	std::move(Args))
	.setDiscardResult()
	.setTailCall(isTailCall);

	std::pair<SDValue,SDValue> CallResult = TLI->LowerCallTo(CLI);
	return CallResult.second;
	}

	SDValue SelectionDAG::getAtomic(unsigned Opcode, const SDLoc &dl, EVT MemVT,
	SDVTList VTList, ArrayRef<SDValue> Ops,
	MachineMemOperand *MMO) {
	FoldingSetNodeID ID;
	ID.AddInteger(MemVT.getRawBits());
	AddNodeIDNode(ID, Opcode, VTList, Ops);
	ID.AddInteger(MMO->getPointerInfo().getAddrSpace());
	void* IP = nullptr;
	if (SDNode *E = FindNodeOrInsertPos(ID, dl, IP)) {
	cast<AtomicSDNode>(E)->refineAlignment(MMO);
	return SDValue(E, 0);
	}

	auto *N = newSDNode<AtomicSDNode>(Opcode, dl.getIROrder(), dl.getDebugLoc(),
	VTList, MemVT, MMO);
	createOperands(N, Ops);

	CSEMap.InsertNode(N, IP);
	InsertNode(N);
	return SDValue(N, 0);
	}

	SDValue SelectionDAG::getAtomicCmpSwap(
	unsigned Opcode, const SDLoc &dl, EVT MemVT, SDVTList VTs, SDValue Chain,
	SDValue Ptr, SDValue Cmp, SDValue Swp, MachinePointerInfo PtrInfo,
	unsigned Alignment, AtomicOrdering SuccessOrdering,
	AtomicOrdering FailureOrdering, SyncScope::ID SSID) {
	assert(Opcode == ISD::ATOMIC_CMP_SWAP \|\|
	Opcode == ISD::ATOMIC_CMP_SWAP_WITH_SUCCESS);
	assert(Cmp.getValueType() == Swp.getValueType() && "Invalid Atomic Op Types");

	if (Alignment == 0) // Ensure that codegen never sees alignment 0
	Alignment = getEVTAlignment(MemVT);

	MachineFunction &MF = getMachineFunction();

	// FIXME: Volatile isn't really correct; we should keep track of atomic
	// orderings in the memoperand.
	auto Flags = MachineMemOperand::MOVolatile \| MachineMemOperand::MOLoad \|
	MachineMemOperand::MOStore;
	MachineMemOperand *MMO =
	MF.getMachineMemOperand(PtrInfo, Flags, MemVT.getStoreSize(), Alignment,
	AAMDNodes(), nullptr, SSID, SuccessOrdering,
	FailureOrdering);

	return getAtomicCmpSwap(Opcode, dl, MemVT, VTs, Chain, Ptr, Cmp, Swp, MMO);
	}

	SDValue SelectionDAG::getAtomicCmpSwap(unsigned Opcode, const SDLoc &dl,
	EVT MemVT, SDVTList VTs, SDValue Chain,
	SDValue Ptr, SDValue Cmp, SDValue Swp,
	MachineMemOperand *MMO) {
	assert(Opcode == ISD::ATOMIC_CMP_SWAP \|\|
	Opcode == ISD::ATOMIC_CMP_SWAP_WITH_SUCCESS);
	assert(Cmp.getValueType() == Swp.getValueType() && "Invalid Atomic Op Types");

	SDValue Ops[] = {Chain, Ptr, Cmp, Swp};
	return getAtomic(Opcode, dl, MemVT, VTs, Ops, MMO);
	}

	SDValue SelectionDAG::getAtomic(unsigned Opcode, const SDLoc &dl, EVT MemVT,
	SDValue Chain, SDValue Ptr, SDValue Val,
	const Value *PtrVal, unsigned Alignment,
	AtomicOrdering Ordering,
	SyncScope::ID SSID) {
	if (Alignment == 0) // Ensure that codegen never sees alignment 0
	Alignment = getEVTAlignment(MemVT);

	MachineFunction &MF = getMachineFunction();
	// An atomic store does not load. An atomic load does not store.
	// (An atomicrmw obviously both loads and stores.)
	// For now, atomics are considered to be volatile always, and they are
	// chained as such.
	// FIXME: Volatile isn't really correct; we should keep track of atomic
	// orderings in the memoperand.
	auto Flags = MachineMemOperand::MOVolatile;
	if (Opcode != ISD::ATOMIC_STORE)
	Flags \|= MachineMemOperand::MOLoad;
	if (Opcode != ISD::ATOMIC_LOAD)
	Flags \|= MachineMemOperand::MOStore;

	MachineMemOperand *MMO =
	MF.getMachineMemOperand(MachinePointerInfo(PtrVal), Flags,
	MemVT.getStoreSize(), Alignment, AAMDNodes(),
	nullptr, SSID, Ordering);

	return getAtomic(Opcode, dl, MemVT, Chain, Ptr, Val, MMO);
	}

	SDValue SelectionDAG::getAtomic(unsigned Opcode, const SDLoc &dl, EVT MemVT,
	SDValue Chain, SDValue Ptr, SDValue Val,
	MachineMemOperand *MMO) {
	assert((Opcode == ISD::ATOMIC_LOAD_ADD \|\|
	Opcode == ISD::ATOMIC_LOAD_SUB \|\|
	Opcode == ISD::ATOMIC_LOAD_AND \|\|
	Opcode == ISD::ATOMIC_LOAD_OR \|\|
	Opcode == ISD::ATOMIC_LOAD_XOR \|\|
	Opcode == ISD::ATOMIC_LOAD_NAND \|\|
	Opcode == ISD::ATOMIC_LOAD_MIN \|\|
	Opcode == ISD::ATOMIC_LOAD_MAX \|\|
	Opcode == ISD::ATOMIC_LOAD_UMIN \|\|
	Opcode == ISD::ATOMIC_LOAD_UMAX \|\|
	Opcode == ISD::ATOMIC_SWAP \|\|
	Opcode == ISD::ATOMIC_STORE) &&
	"Invalid Atomic Op");

	EVT VT = Val.getValueType();

	SDVTList VTs = Opcode == ISD::ATOMIC_STORE ? getVTList(MVT::Other) :
	getVTList(VT, MVT::Other);
	SDValue Ops[] = {Chain, Ptr, Val};
	return getAtomic(Opcode, dl, MemVT, VTs, Ops, MMO);
	}

	SDValue SelectionDAG::getAtomic(unsigned Opcode, const SDLoc &dl, EVT MemVT,
	EVT VT, SDValue Chain, SDValue Ptr,
	MachineMemOperand *MMO) {
	assert(Opcode == ISD::ATOMIC_LOAD && "Invalid Atomic Op");

	SDVTList VTs = getVTList(VT, MVT::Other);
	SDValue Ops[] = {Chain, Ptr};
	return getAtomic(Opcode, dl, MemVT, VTs, Ops, MMO);
	}

	/// getMergeValues - Create a MERGE_VALUES node from the given operands.
	SDValue SelectionDAG::getMergeValues(ArrayRef<SDValue> Ops, const SDLoc &dl) {
	if (Ops.size() == 1)
	return Ops[0];

	SmallVector<EVT, 4> VTs;
	VTs.reserve(Ops.size());
	for (unsigned i = 0; i < Ops.size(); ++i)
	VTs.push_back(Ops[i].getValueType());
	return getNode(ISD::MERGE_VALUES, dl, getVTList(VTs), Ops);
	}

	SDValue SelectionDAG::getMemIntrinsicNode(
	unsigned Opcode, const SDLoc &dl, SDVTList VTList, ArrayRef<SDValue> Ops,
	EVT MemVT, MachinePointerInfo PtrInfo, unsigned Align, bool Vol,
	bool ReadMem, bool WriteMem, unsigned Size) {
	if (Align == 0) // Ensure that codegen never sees alignment 0
	Align = getEVTAlignment(MemVT);

	MachineFunction &MF = getMachineFunction();
	auto Flags = MachineMemOperand::MONone;
	if (WriteMem)
	Flags \|= MachineMemOperand::MOStore;
	if (ReadMem)
	Flags \|= MachineMemOperand::MOLoad;
	if (Vol)
	Flags \|= MachineMemOperand::MOVolatile;
	if (!Size)
	Size = MemVT.getStoreSize();
	MachineMemOperand *MMO =
	MF.getMachineMemOperand(PtrInfo, Flags, Size, Align);

	return getMemIntrinsicNode(Opcode, dl, VTList, Ops, MemVT, MMO);
	}

	SDValue SelectionDAG::getMemIntrinsicNode(unsigned Opcode, const SDLoc &dl,
	SDVTList VTList,
	ArrayRef<SDValue> Ops, EVT MemVT,
	MachineMemOperand *MMO) {
	assert((Opcode == ISD::INTRINSIC_VOID \|\|
	Opcode == ISD::INTRINSIC_W_CHAIN \|\|
	Opcode == ISD::PREFETCH \|\|
	Opcode == ISD::LIFETIME_START \|\|
	Opcode == ISD::LIFETIME_END \|\|
	((int)Opcode <= std::numeric_limits<int>::max() &&
	(int)Opcode >= ISD::FIRST_TARGET_MEMORY_OPCODE)) &&
	"Opcode is not a memory-accessing opcode!");

	// Memoize the node unless it returns a flag.
	MemIntrinsicSDNode *N;
	if (VTList.VTs[VTList.NumVTs-1] != MVT::Glue) {
	FoldingSetNodeID ID;
	AddNodeIDNode(ID, Opcode, VTList, Ops);
	ID.AddInteger(MMO->getPointerInfo().getAddrSpace());
	void *IP = nullptr;
	if (SDNode *E = FindNodeOrInsertPos(ID, dl, IP)) {
	cast<MemIntrinsicSDNode>(E)->refineAlignment(MMO);
	return SDValue(E, 0);
	}

	N = newSDNode<MemIntrinsicSDNode>(Opcode, dl.getIROrder(), dl.getDebugLoc(),
	VTList, MemVT, MMO);
	createOperands(N, Ops);

	CSEMap.InsertNode(N, IP);
	} else {
	N = newSDNode<MemIntrinsicSDNode>(Opcode, dl.getIROrder(), dl.getDebugLoc(),
	VTList, MemVT, MMO);
	createOperands(N, Ops);
	}
	InsertNode(N);
	return SDValue(N, 0);
	}

	/// InferPointerInfo - If the specified ptr/offset is a frame index, infer a
	/// MachinePointerInfo record from it. This is particularly useful because the
	/// code generator has many cases where it doesn't bother passing in a
	/// MachinePointerInfo to getLoad or getStore when it has "FI+Cst".
	static MachinePointerInfo InferPointerInfo(SelectionDAG &DAG, SDValue Ptr,
	int64_t Offset = 0) {
	// If this is FI+Offset, we can model it.
	if (const FrameIndexSDNode *FI = dyn_cast<FrameIndexSDNode>(Ptr))
	return MachinePointerInfo::getFixedStack(DAG.getMachineFunction(),
	FI->getIndex(), Offset);

	// If this is (FI+Offset1)+Offset2, we can model it.
	if (Ptr.getOpcode() != ISD::ADD \|\|
	!isa<ConstantSDNode>(Ptr.getOperand(1)) \|\|
	!isa<FrameIndexSDNode>(Ptr.getOperand(0)))
	return MachinePointerInfo();

	int FI = cast<FrameIndexSDNode>(Ptr.getOperand(0))->getIndex();
	return MachinePointerInfo::getFixedStack(
	DAG.getMachineFunction(), FI,
	Offset + cast<ConstantSDNode>(Ptr.getOperand(1))->getSExtValue());
	}

	/// InferPointerInfo - If the specified ptr/offset is a frame index, infer a
	/// MachinePointerInfo record from it. This is particularly useful because the
	/// code generator has many cases where it doesn't bother passing in a
	/// MachinePointerInfo to getLoad or getStore when it has "FI+Cst".
	static MachinePointerInfo InferPointerInfo(SelectionDAG &DAG, SDValue Ptr,
	SDValue OffsetOp) {
	// If the 'Offset' value isn't a constant, we can't handle this.
	if (ConstantSDNode *OffsetNode = dyn_cast<ConstantSDNode>(OffsetOp))
	return InferPointerInfo(DAG, Ptr, OffsetNode->getSExtValue());
	if (OffsetOp.isUndef())
	return InferPointerInfo(DAG, Ptr);
	return MachinePointerInfo();
	}

	SDValue SelectionDAG::getLoad(ISD::MemIndexedMode AM, ISD::LoadExtType ExtType,
	EVT VT, const SDLoc &dl, SDValue Chain,
	SDValue Ptr, SDValue Offset,
	MachinePointerInfo PtrInfo, EVT MemVT,
	unsigned Alignment,
	MachineMemOperand::Flags MMOFlags,
	const AAMDNodes &AAInfo, const MDNode *Ranges) {
	assert(Chain.getValueType() == MVT::Other &&
	"Invalid chain type");
	if (Alignment == 0) // Ensure that codegen never sees alignment 0
	Alignment = getEVTAlignment(MemVT);

	MMOFlags \|= MachineMemOperand::MOLoad;
	assert((MMOFlags & MachineMemOperand::MOStore) == 0);
	// If we don't have a PtrInfo, infer the trivial frame index case to simplify
	// clients.
	if (PtrInfo.V.isNull())
	PtrInfo = InferPointerInfo(*this, Ptr, Offset);

	MachineFunction &MF = getMachineFunction();
	MachineMemOperand *MMO = MF.getMachineMemOperand(
	PtrInfo, MMOFlags, MemVT.getStoreSize(), Alignment, AAInfo, Ranges);
	return getLoad(AM, ExtType, VT, dl, Chain, Ptr, Offset, MemVT, MMO);
	}

	SDValue SelectionDAG::getLoad(ISD::MemIndexedMode AM, ISD::LoadExtType ExtType,
	EVT VT, const SDLoc &dl, SDValue Chain,
	SDValue Ptr, SDValue Offset, EVT MemVT,
	MachineMemOperand *MMO) {
	if (VT == MemVT) {
	ExtType = ISD::NON_EXTLOAD;
	} else if (ExtType == ISD::NON_EXTLOAD) {
	assert(VT == MemVT && "Non-extending load from different memory type!");
	} else {
	// Extending load.
	assert(MemVT.getScalarType().bitsLT(VT.getScalarType()) &&
	"Should only be an extending load, not truncating!");
	assert(VT.isInteger() == MemVT.isInteger() &&
	"Cannot convert from FP to Int or Int -> FP!");
	assert(VT.isVector() == MemVT.isVector() &&
	"Cannot use an ext load to convert to or from a vector!");
	assert((!VT.isVector() \|\|
	VT.getVectorNumElements() == MemVT.getVectorNumElements()) &&
	"Cannot use an ext load to change the number of vector elements!");
	}

	bool Indexed = AM != ISD::UNINDEXED;
	assert((Indexed \|\| Offset.isUndef()) && "Unindexed load with an offset!");

	SDVTList VTs = Indexed ?
	getVTList(VT, Ptr.getValueType(), MVT::Other) : getVTList(VT, MVT::Other);
	SDValue Ops[] = { Chain, Ptr, Offset };
	FoldingSetNodeID ID;
	AddNodeIDNode(ID, ISD::LOAD, VTs, Ops);
	ID.AddInteger(MemVT.getRawBits());
	ID.AddInteger(getSyntheticNodeSubclassData<LoadSDNode>(
	dl.getIROrder(), VTs, AM, ExtType, MemVT, MMO));
	ID.AddInteger(MMO->getPointerInfo().getAddrSpace());
	void *IP = nullptr;
	if (SDNode *E = FindNodeOrInsertPos(ID, dl, IP)) {
	cast<LoadSDNode>(E)->refineAlignment(MMO);
	return SDValue(E, 0);
	}
	auto *N = newSDNode<LoadSDNode>(dl.getIROrder(), dl.getDebugLoc(), VTs, AM,
	ExtType, MemVT, MMO);
	createOperands(N, Ops);

	CSEMap.InsertNode(N, IP);
	InsertNode(N);
	return SDValue(N, 0);
	}

	SDValue SelectionDAG::getLoad(EVT VT, const SDLoc &dl, SDValue Chain,
	SDValue Ptr, MachinePointerInfo PtrInfo,
	unsigned Alignment,
	MachineMemOperand::Flags MMOFlags,
	const AAMDNodes &AAInfo, const MDNode *Ranges) {
	SDValue Undef = getUNDEF(Ptr.getValueType());
	return getLoad(ISD::UNINDEXED, ISD::NON_EXTLOAD, VT, dl, Chain, Ptr, Undef,
	PtrInfo, VT, Alignment, MMOFlags, AAInfo, Ranges);
	}

	SDValue SelectionDAG::getLoad(EVT VT, const SDLoc &dl, SDValue Chain,
	SDValue Ptr, MachineMemOperand *MMO) {
	SDValue Undef = getUNDEF(Ptr.getValueType());
	return getLoad(ISD::UNINDEXED, ISD::NON_EXTLOAD, VT, dl, Chain, Ptr, Undef,
	VT, MMO);
	}

	SDValue SelectionDAG::getExtLoad(ISD::LoadExtType ExtType, const SDLoc &dl,
	EVT VT, SDValue Chain, SDValue Ptr,
	MachinePointerInfo PtrInfo, EVT MemVT,
	unsigned Alignment,
	MachineMemOperand::Flags MMOFlags,
	const AAMDNodes &AAInfo) {
	SDValue Undef = getUNDEF(Ptr.getValueType());
	return getLoad(ISD::UNINDEXED, ExtType, VT, dl, Chain, Ptr, Undef, PtrInfo,
	MemVT, Alignment, MMOFlags, AAInfo);
	}

	SDValue SelectionDAG::getExtLoad(ISD::LoadExtType ExtType, const SDLoc &dl,
	EVT VT, SDValue Chain, SDValue Ptr, EVT MemVT,
	MachineMemOperand *MMO) {
	SDValue Undef = getUNDEF(Ptr.getValueType());
	return getLoad(ISD::UNINDEXED, ExtType, VT, dl, Chain, Ptr, Undef,
	MemVT, MMO);
	}

	SDValue SelectionDAG::getIndexedLoad(SDValue OrigLoad, const SDLoc &dl,
	SDValue Base, SDValue Offset,
	ISD::MemIndexedMode AM) {
	LoadSDNode *LD = cast<LoadSDNode>(OrigLoad);
	assert(LD->getOffset().isUndef() && "Load is already a indexed load!");
	// Don't propagate the invariant or dereferenceable flags.
	auto MMOFlags =
	LD->getMemOperand()->getFlags() &
	~(MachineMemOperand::MOInvariant \| MachineMemOperand::MODereferenceable);
	return getLoad(AM, LD->getExtensionType(), OrigLoad.getValueType(), dl,
	LD->getChain(), Base, Offset, LD->getPointerInfo(),
	LD->getMemoryVT(), LD->getAlignment(), MMOFlags,
	LD->getAAInfo());
	}

	SDValue SelectionDAG::getStore(SDValue Chain, const SDLoc &dl, SDValue Val,
	SDValue Ptr, MachinePointerInfo PtrInfo,
	unsigned Alignment,
	MachineMemOperand::Flags MMOFlags,
	const AAMDNodes &AAInfo) {
	assert(Chain.getValueType() == MVT::Other && "Invalid chain type");
	if (Alignment == 0) // Ensure that codegen never sees alignment 0
	Alignment = getEVTAlignment(Val.getValueType());

	MMOFlags \|= MachineMemOperand::MOStore;
	assert((MMOFlags & MachineMemOperand::MOLoad) == 0);

	if (PtrInfo.V.isNull())
	PtrInfo = InferPointerInfo(*this, Ptr);

	MachineFunction &MF = getMachineFunction();
	MachineMemOperand *MMO = MF.getMachineMemOperand(
	PtrInfo, MMOFlags, Val.getValueType().getStoreSize(), Alignment, AAInfo);
	return getStore(Chain, dl, Val, Ptr, MMO);
	}

	SDValue SelectionDAG::getStore(SDValue Chain, const SDLoc &dl, SDValue Val,
	SDValue Ptr, MachineMemOperand *MMO) {
	assert(Chain.getValueType() == MVT::Other &&
	"Invalid chain type");
	EVT VT = Val.getValueType();
	SDVTList VTs = getVTList(MVT::Other);
	SDValue Undef = getUNDEF(Ptr.getValueType());
	SDValue Ops[] = { Chain, Val, Ptr, Undef };
	FoldingSetNodeID ID;
	AddNodeIDNode(ID, ISD::STORE, VTs, Ops);
	ID.AddInteger(VT.getRawBits());
	ID.AddInteger(getSyntheticNodeSubclassData<StoreSDNode>(
	dl.getIROrder(), VTs, ISD::UNINDEXED, false, VT, MMO));
	ID.AddInteger(MMO->getPointerInfo().getAddrSpace());
	void *IP = nullptr;
	if (SDNode *E = FindNodeOrInsertPos(ID, dl, IP)) {
	cast<StoreSDNode>(E)->refineAlignment(MMO);
	return SDValue(E, 0);
	}
	auto *N = newSDNode<StoreSDNode>(dl.getIROrder(), dl.getDebugLoc(), VTs,
	ISD::UNINDEXED, false, VT, MMO);
	createOperands(N, Ops);

	CSEMap.InsertNode(N, IP);
	InsertNode(N);
	return SDValue(N, 0);
	}

	SDValue SelectionDAG::getTruncStore(SDValue Chain, const SDLoc &dl, SDValue Val,
	SDValue Ptr, MachinePointerInfo PtrInfo,
	EVT SVT, unsigned Alignment,
	MachineMemOperand::Flags MMOFlags,
	const AAMDNodes &AAInfo) {
	assert(Chain.getValueType() == MVT::Other &&
	"Invalid chain type");
	if (Alignment == 0) // Ensure that codegen never sees alignment 0
	Alignment = getEVTAlignment(SVT);

	MMOFlags \|= MachineMemOperand::MOStore;
	assert((MMOFlags & MachineMemOperand::MOLoad) == 0);

	if (PtrInfo.V.isNull())
	PtrInfo = InferPointerInfo(*this, Ptr);

	MachineFunction &MF = getMachineFunction();
	MachineMemOperand *MMO = MF.getMachineMemOperand(
	PtrInfo, MMOFlags, SVT.getStoreSize(), Alignment, AAInfo);
	return getTruncStore(Chain, dl, Val, Ptr, SVT, MMO);
	}

	SDValue SelectionDAG::getTruncStore(SDValue Chain, const SDLoc &dl, SDValue Val,
	SDValue Ptr, EVT SVT,
	MachineMemOperand *MMO) {
	EVT VT = Val.getValueType();

	assert(Chain.getValueType() == MVT::Other &&
	"Invalid chain type");
	if (VT == SVT)
	return getStore(Chain, dl, Val, Ptr, MMO);

	assert(SVT.getScalarType().bitsLT(VT.getScalarType()) &&
	"Should only be a truncating store, not extending!");
	assert(VT.isInteger() == SVT.isInteger() &&
	"Can't do FP-INT conversion!");
	assert(VT.isVector() == SVT.isVector() &&
	"Cannot use trunc store to convert to or from a vector!");
	assert((!VT.isVector() \|\|
	VT.getVectorNumElements() == SVT.getVectorNumElements()) &&
	"Cannot use trunc store to change the number of vector elements!");

	SDVTList VTs = getVTList(MVT::Other);
	SDValue Undef = getUNDEF(Ptr.getValueType());
	SDValue Ops[] = { Chain, Val, Ptr, Undef };
	FoldingSetNodeID ID;
	AddNodeIDNode(ID, ISD::STORE, VTs, Ops);
	ID.AddInteger(SVT.getRawBits());
	ID.AddInteger(getSyntheticNodeSubclassData<StoreSDNode>(
	dl.getIROrder(), VTs, ISD::UNINDEXED, true, SVT, MMO));
	ID.AddInteger(MMO->getPointerInfo().getAddrSpace());
	void *IP = nullptr;
	if (SDNode *E = FindNodeOrInsertPos(ID, dl, IP)) {
	cast<StoreSDNode>(E)->refineAlignment(MMO);
	return SDValue(E, 0);
	}
	auto *N = newSDNode<StoreSDNode>(dl.getIROrder(), dl.getDebugLoc(), VTs,
	ISD::UNINDEXED, true, SVT, MMO);
	createOperands(N, Ops);

	CSEMap.InsertNode(N, IP);
	InsertNode(N);
	return SDValue(N, 0);
	}

	SDValue SelectionDAG::getIndexedStore(SDValue OrigStore, const SDLoc &dl,
	SDValue Base, SDValue Offset,
	ISD::MemIndexedMode AM) {
	StoreSDNode *ST = cast<StoreSDNode>(OrigStore);
	assert(ST->getOffset().isUndef() && "Store is already a indexed store!");
	SDVTList VTs = getVTList(Base.getValueType(), MVT::Other);
	SDValue Ops[] = { ST->getChain(), ST->getValue(), Base, Offset };
	FoldingSetNodeID ID;
	AddNodeIDNode(ID, ISD::STORE, VTs, Ops);
	ID.AddInteger(ST->getMemoryVT().getRawBits());
	ID.AddInteger(ST->getRawSubclassData());
	ID.AddInteger(ST->getPointerInfo().getAddrSpace());
	void *IP = nullptr;
	if (SDNode *E = FindNodeOrInsertPos(ID, dl, IP))
	return SDValue(E, 0);

	auto *N = newSDNode<StoreSDNode>(dl.getIROrder(), dl.getDebugLoc(), VTs, AM,
	ST->isTruncatingStore(), ST->getMemoryVT(),
	ST->getMemOperand());
	createOperands(N, Ops);

	CSEMap.InsertNode(N, IP);
	InsertNode(N);
	return SDValue(N, 0);
	}

	SDValue SelectionDAG::getMaskedLoad(EVT VT, const SDLoc &dl, SDValue Chain,
	SDValue Ptr, SDValue Mask, SDValue Src0,
	EVT MemVT, MachineMemOperand *MMO,
	ISD::LoadExtType ExtTy, bool isExpanding) {
	SDVTList VTs = getVTList(VT, MVT::Other);
	SDValue Ops[] = { Chain, Ptr, Mask, Src0 };
	FoldingSetNodeID ID;
	AddNodeIDNode(ID, ISD::MLOAD, VTs, Ops);
	ID.AddInteger(VT.getRawBits());
	ID.AddInteger(getSyntheticNodeSubclassData<MaskedLoadSDNode>(
	dl.getIROrder(), VTs, ExtTy, isExpanding, MemVT, MMO));
	ID.AddInteger(MMO->getPointerInfo().getAddrSpace());
	void *IP = nullptr;
	if (SDNode *E = FindNodeOrInsertPos(ID, dl, IP)) {
	cast<MaskedLoadSDNode>(E)->refineAlignment(MMO);
	return SDValue(E, 0);
	}
	auto *N = newSDNode<MaskedLoadSDNode>(dl.getIROrder(), dl.getDebugLoc(), VTs,
	ExtTy, isExpanding, MemVT, MMO);
	createOperands(N, Ops);

	CSEMap.InsertNode(N, IP);
	InsertNode(N);
	return SDValue(N, 0);
	}

	SDValue SelectionDAG::getMaskedStore(SDValue Chain, const SDLoc &dl,
	SDValue Val, SDValue Ptr, SDValue Mask,
	EVT MemVT, MachineMemOperand *MMO,
	bool IsTruncating, bool IsCompressing) {
	assert(Chain.getValueType() == MVT::Other &&
	"Invalid chain type");
	EVT VT = Val.getValueType();
	SDVTList VTs = getVTList(MVT::Other);
	SDValue Ops[] = { Chain, Ptr, Mask, Val };
	FoldingSetNodeID ID;
	AddNodeIDNode(ID, ISD::MSTORE, VTs, Ops);
	ID.AddInteger(VT.getRawBits());
	ID.AddInteger(getSyntheticNodeSubclassData<MaskedStoreSDNode>(
	dl.getIROrder(), VTs, IsTruncating, IsCompressing, MemVT, MMO));
	ID.AddInteger(MMO->getPointerInfo().getAddrSpace());
	void *IP = nullptr;
	if (SDNode *E = FindNodeOrInsertPos(ID, dl, IP)) {
	cast<MaskedStoreSDNode>(E)->refineAlignment(MMO);
	return SDValue(E, 0);
	}
	auto *N = newSDNode<MaskedStoreSDNode>(dl.getIROrder(), dl.getDebugLoc(), VTs,
	IsTruncating, IsCompressing, MemVT, MMO);
	createOperands(N, Ops);

	CSEMap.InsertNode(N, IP);
	InsertNode(N);
	return SDValue(N, 0);
	}

	SDValue SelectionDAG::getMaskedGather(SDVTList VTs, EVT VT, const SDLoc &dl,
	ArrayRef<SDValue> Ops,
	MachineMemOperand *MMO) {
	assert(Ops.size() == 5 && "Incompatible number of operands");

	FoldingSetNodeID ID;
	AddNodeIDNode(ID, ISD::MGATHER, VTs, Ops);
	ID.AddInteger(VT.getRawBits());
	ID.AddInteger(getSyntheticNodeSubclassData<MaskedGatherSDNode>(
	dl.getIROrder(), VTs, VT, MMO));
	ID.AddInteger(MMO->getPointerInfo().getAddrSpace());
	void *IP = nullptr;
	if (SDNode *E = FindNodeOrInsertPos(ID, dl, IP)) {
	cast<MaskedGatherSDNode>(E)->refineAlignment(MMO);
	return SDValue(E, 0);
	}

	auto *N = newSDNode<MaskedGatherSDNode>(dl.getIROrder(), dl.getDebugLoc(),
	VTs, VT, MMO);
	createOperands(N, Ops);

	assert(N->getValue().getValueType() == N->getValueType(0) &&
	"Incompatible type of the PassThru value in MaskedGatherSDNode");
	assert(N->getMask().getValueType().getVectorNumElements() ==
	N->getValueType(0).getVectorNumElements() &&
	"Vector width mismatch between mask and data");
	assert(N->getIndex().getValueType().getVectorNumElements() ==
	N->getValueType(0).getVectorNumElements() &&
	"Vector width mismatch between index and data");

	CSEMap.InsertNode(N, IP);
	InsertNode(N);
	return SDValue(N, 0);
	}

	SDValue SelectionDAG::getMaskedScatter(SDVTList VTs, EVT VT, const SDLoc &dl,
	ArrayRef<SDValue> Ops,
	MachineMemOperand *MMO) {
	assert(Ops.size() == 5 && "Incompatible number of operands");

	FoldingSetNodeID ID;
	AddNodeIDNode(ID, ISD::MSCATTER, VTs, Ops);
	ID.AddInteger(VT.getRawBits());
	ID.AddInteger(getSyntheticNodeSubclassData<MaskedScatterSDNode>(
	dl.getIROrder(), VTs, VT, MMO));
	ID.AddInteger(MMO->getPointerInfo().getAddrSpace());
	void *IP = nullptr;
	if (SDNode *E = FindNodeOrInsertPos(ID, dl, IP)) {
	cast<MaskedScatterSDNode>(E)->refineAlignment(MMO);
	return SDValue(E, 0);
	}
	auto *N = newSDNode<MaskedScatterSDNode>(dl.getIROrder(), dl.getDebugLoc(),
	VTs, VT, MMO);
	createOperands(N, Ops);

	assert(N->getMask().getValueType().getVectorNumElements() ==
	N->getValue().getValueType().getVectorNumElements() &&
	"Vector width mismatch between mask and data");
	assert(N->getIndex().getValueType().getVectorNumElements() ==
	N->getValue().getValueType().getVectorNumElements() &&
	"Vector width mismatch between index and data");

	CSEMap.InsertNode(N, IP);
	InsertNode(N);
	return SDValue(N, 0);
	}

	SDValue SelectionDAG::getVAArg(EVT VT, const SDLoc &dl, SDValue Chain,
	SDValue Ptr, SDValue SV, unsigned Align) {
	SDValue Ops[] = { Chain, Ptr, SV, getTargetConstant(Align, dl, MVT::i32) };
	return getNode(ISD::VAARG, dl, getVTList(VT, MVT::Other), Ops);
	}

	SDValue SelectionDAG::getNode(unsigned Opcode, const SDLoc &DL, EVT VT,
	ArrayRef<SDUse> Ops) {
	switch (Ops.size()) {
	case 0: return getNode(Opcode, DL, VT);
	case 1: return getNode(Opcode, DL, VT, static_cast<const SDValue>(Ops[0]));
	case 2: return getNode(Opcode, DL, VT, Ops[0], Ops[1]);
	case 3: return getNode(Opcode, DL, VT, Ops[0], Ops[1], Ops[2]);
	default: break;
	}

	// Copy from an SDUse array into an SDValue array for use with
	// the regular getNode logic.
	SmallVector<SDValue, 8> NewOps(Ops.begin(), Ops.end());
	return getNode(Opcode, DL, VT, NewOps);
	}

	SDValue SelectionDAG::getNode(unsigned Opcode, const SDLoc &DL, EVT VT,
	ArrayRef<SDValue> Ops, const SDNodeFlags Flags) {
	unsigned NumOps = Ops.size();
	switch (NumOps) {
	case 0: return getNode(Opcode, DL, VT);
	case 1: return getNode(Opcode, DL, VT, Ops[0], Flags);
	case 2: return getNode(Opcode, DL, VT, Ops[0], Ops[1], Flags);
	case 3: return getNode(Opcode, DL, VT, Ops[0], Ops[1], Ops[2]);
	default: break;
	}

	switch (Opcode) {
	default: break;
	case ISD::CONCAT_VECTORS:
	// Attempt to fold CONCAT_VECTORS into BUILD_VECTOR or UNDEF.
	if (SDValue V = FoldCONCAT_VECTORS(DL, VT, Ops, *this))
	return V;
	break;
	case ISD::SELECT_CC:
	assert(NumOps == 5 && "SELECT_CC takes 5 operands!");
	assert(Ops[0].getValueType() == Ops[1].getValueType() &&
	"LHS and RHS of condition must have same type!");
	assert(Ops[2].getValueType() == Ops[3].getValueType() &&
	"True and False arms of SelectCC must have same type!");
	assert(Ops[2].getValueType() == VT &&
	"select_cc node must be of same type as true and false value!");
	break;
	case ISD::BR_CC:
	assert(NumOps == 5 && "BR_CC takes 5 operands!");
	assert(Ops[2].getValueType() == Ops[3].getValueType() &&
	"LHS/RHS of comparison should match types!");
	break;
	}

	// Memoize nodes.
	SDNode *N;
	SDVTList VTs = getVTList(VT);

	if (VT != MVT::Glue) {
	FoldingSetNodeID ID;
	AddNodeIDNode(ID, Opcode, VTs, Ops);
	void *IP = nullptr;

	if (SDNode *E = FindNodeOrInsertPos(ID, DL, IP))
	return SDValue(E, 0);

	N = newSDNode<SDNode>(Opcode, DL.getIROrder(), DL.getDebugLoc(), VTs);
	createOperands(N, Ops);

	CSEMap.InsertNode(N, IP);
	} else {
	N = newSDNode<SDNode>(Opcode, DL.getIROrder(), DL.getDebugLoc(), VTs);
	createOperands(N, Ops);
	}

	InsertNode(N);
	return SDValue(N, 0);
	}

	SDValue SelectionDAG::getNode(unsigned Opcode, const SDLoc &DL,
	ArrayRef<EVT> ResultTys, ArrayRef<SDValue> Ops) {
	return getNode(Opcode, DL, getVTList(ResultTys), Ops);
	}

	SDValue SelectionDAG::getNode(unsigned Opcode, const SDLoc &DL, SDVTList VTList,
	ArrayRef<SDValue> Ops) {
	if (VTList.NumVTs == 1)
	return getNode(Opcode, DL, VTList.VTs[0], Ops);

	#if 0
	switch (Opcode) {
	// FIXME: figure out how to safely handle things like
	// int foo(int x) { return 1 << (x & 255); }
	// int bar() { return foo(256); }
	case ISD::SRA_PARTS:
	case ISD::SRL_PARTS:
	case ISD::SHL_PARTS:
	if (N3.getOpcode() == ISD::SIGN_EXTEND_INREG &&
	cast<VTSDNode>(N3.getOperand(1))->getVT() != MVT::i1)
	return getNode(Opcode, DL, VT, N1, N2, N3.getOperand(0));
	else if (N3.getOpcode() == ISD::AND)
	if (ConstantSDNode *AndRHS = dyn_cast<ConstantSDNode>(N3.getOperand(1))) {
	// If the and is only masking out bits that cannot effect the shift,
	// eliminate the and.
	unsigned NumBits = VT.getScalarSizeInBits()*2;
	if ((AndRHS->getValue() & (NumBits-1)) == NumBits-1)
	return getNode(Opcode, DL, VT, N1, N2, N3.getOperand(0));
	}
	break;
	}
	#endif

	// Memoize the node unless it returns a flag.
	SDNode *N;
	if (VTList.VTs[VTList.NumVTs-1] != MVT::Glue) {
	FoldingSetNodeID ID;
	AddNodeIDNode(ID, Opcode, VTList, Ops);
	void *IP = nullptr;
	if (SDNode *E = FindNodeOrInsertPos(ID, DL, IP))
	return SDValue(E, 0);

	N = newSDNode<SDNode>(Opcode, DL.getIROrder(), DL.getDebugLoc(), VTList);
	createOperands(N, Ops);
	CSEMap.InsertNode(N, IP);
	} else {
	N = newSDNode<SDNode>(Opcode, DL.getIROrder(), DL.getDebugLoc(), VTList);
	createOperands(N, Ops);
	}
	InsertNode(N);
	return SDValue(N, 0);
	}

	SDValue SelectionDAG::getNode(unsigned Opcode, const SDLoc &DL,
	SDVTList VTList) {
	return getNode(Opcode, DL, VTList, None);
	}

	SDValue SelectionDAG::getNode(unsigned Opcode, const SDLoc &DL, SDVTList VTList,
	SDValue N1) {
	SDValue Ops[] = { N1 };
	return getNode(Opcode, DL, VTList, Ops);
	}

	SDValue SelectionDAG::getNode(unsigned Opcode, const SDLoc &DL, SDVTList VTList,
	SDValue N1, SDValue N2) {
	SDValue Ops[] = { N1, N2 };
	return getNode(Opcode, DL, VTList, Ops);
	}

	SDValue SelectionDAG::getNode(unsigned Opcode, const SDLoc &DL, SDVTList VTList,
	SDValue N1, SDValue N2, SDValue N3) {
	SDValue Ops[] = { N1, N2, N3 };
	return getNode(Opcode, DL, VTList, Ops);
	}

	SDValue SelectionDAG::getNode(unsigned Opcode, const SDLoc &DL, SDVTList VTList,
	SDValue N1, SDValue N2, SDValue N3, SDValue N4) {
	SDValue Ops[] = { N1, N2, N3, N4 };
	return getNode(Opcode, DL, VTList, Ops);
	}

	SDValue SelectionDAG::getNode(unsigned Opcode, const SDLoc &DL, SDVTList VTList,
	SDValue N1, SDValue N2, SDValue N3, SDValue N4,
	SDValue N5) {
	SDValue Ops[] = { N1, N2, N3, N4, N5 };
	return getNode(Opcode, DL, VTList, Ops);
	}

	SDVTList SelectionDAG::getVTList(EVT VT) {
	return makeVTList(SDNode::getValueTypeList(VT), 1);
	}

	SDVTList SelectionDAG::getVTList(EVT VT1, EVT VT2) {
	FoldingSetNodeID ID;
	ID.AddInteger(2U);
	ID.AddInteger(VT1.getRawBits());
	ID.AddInteger(VT2.getRawBits());

	void *IP = nullptr;
	SDVTListNode *Result = VTListMap.FindNodeOrInsertPos(ID, IP);
	if (!Result) {
	EVT *Array = Allocator.Allocate<EVT>(2);
	Array[0] = VT1;
	Array[1] = VT2;
	Result = new (Allocator) SDVTListNode(ID.Intern(Allocator), Array, 2);
	VTListMap.InsertNode(Result, IP);
	}
	return Result->getSDVTList();
	}

	SDVTList SelectionDAG::getVTList(EVT VT1, EVT VT2, EVT VT3) {
	FoldingSetNodeID ID;
	ID.AddInteger(3U);
	ID.AddInteger(VT1.getRawBits());
	ID.AddInteger(VT2.getRawBits());
	ID.AddInteger(VT3.getRawBits());

	void *IP = nullptr;
	SDVTListNode *Result = VTListMap.FindNodeOrInsertPos(ID, IP);
	if (!Result) {
	EVT *Array = Allocator.Allocate<EVT>(3);
	Array[0] = VT1;
	Array[1] = VT2;
	Array[2] = VT3;
	Result = new (Allocator) SDVTListNode(ID.Intern(Allocator), Array, 3);
	VTListMap.InsertNode(Result, IP);
	}
	return Result->getSDVTList();
	}

	SDVTList SelectionDAG::getVTList(EVT VT1, EVT VT2, EVT VT3, EVT VT4) {
	FoldingSetNodeID ID;
	ID.AddInteger(4U);
	ID.AddInteger(VT1.getRawBits());
	ID.AddInteger(VT2.getRawBits());
	ID.AddInteger(VT3.getRawBits());
	ID.AddInteger(VT4.getRawBits());

	void *IP = nullptr;
	SDVTListNode *Result = VTListMap.FindNodeOrInsertPos(ID, IP);
	if (!Result) {
	EVT *Array = Allocator.Allocate<EVT>(4);
	Array[0] = VT1;
	Array[1] = VT2;
	Array[2] = VT3;
	Array[3] = VT4;
	Result = new (Allocator) SDVTListNode(ID.Intern(Allocator), Array, 4);
	VTListMap.InsertNode(Result, IP);
	}
	return Result->getSDVTList();
	}

	SDVTList SelectionDAG::getVTList(ArrayRef<EVT> VTs) {
	unsigned NumVTs = VTs.size();
	FoldingSetNodeID ID;
	ID.AddInteger(NumVTs);
	for (unsigned index = 0; index < NumVTs; index++) {
	ID.AddInteger(VTs[index].getRawBits());
	}

	void *IP = nullptr;
	SDVTListNode *Result = VTListMap.FindNodeOrInsertPos(ID, IP);
	if (!Result) {
	EVT *Array = Allocator.Allocate<EVT>(NumVTs);
	std::copy(VTs.begin(), VTs.end(), Array);
	Result = new (Allocator) SDVTListNode(ID.Intern(Allocator), Array, NumVTs);
	VTListMap.InsertNode(Result, IP);
	}
	return Result->getSDVTList();
	}


	/// UpdateNodeOperands - Mutate the specified node in-place to have the
	/// specified operands. If the resultant node already exists in the DAG,
	/// this does not modify the specified node, instead it returns the node that
	/// already exists. If the resultant node does not exist in the DAG, the
	/// input node is returned. As a degenerate case, if you specify the same
	/// input operands as the node already has, the input node is returned.
	SDNode SelectionDAG::UpdateNodeOperands(SDNode N, SDValue Op) {
	assert(N->getNumOperands() == 1 && "Update with wrong number of operands");

	// Check to see if there is no change.
	if (Op == N->getOperand(0)) return N;

	// See if the modified node already exists.
	void *InsertPos = nullptr;
	if (SDNode *Existing = FindModifiedNodeSlot(N, Op, InsertPos))
	return Existing;

	// Nope it doesn't. Remove the node from its current place in the maps.
	if (InsertPos)
	if (!RemoveNodeFromCSEMaps(N))
	InsertPos = nullptr;

	// Now we update the operands.
	N->OperandList[0].set(Op);

	// If this gets put into a CSE map, add it.
	if (InsertPos) CSEMap.InsertNode(N, InsertPos);
	return N;
	}

	SDNode SelectionDAG::UpdateNodeOperands(SDNode N, SDValue Op1, SDValue Op2) {
	assert(N->getNumOperands() == 2 && "Update with wrong number of operands");

	// Check to see if there is no change.
	if (Op1 == N->getOperand(0) && Op2 == N->getOperand(1))
	return N; // No operands changed, just return the input node.

	// See if the modified node already exists.
	void *InsertPos = nullptr;
	if (SDNode *Existing = FindModifiedNodeSlot(N, Op1, Op2, InsertPos))
	return Existing;

	// Nope it doesn't. Remove the node from its current place in the maps.
	if (InsertPos)
	if (!RemoveNodeFromCSEMaps(N))
	InsertPos = nullptr;

	// Now we update the operands.
	if (N->OperandList[0] != Op1)
	N->OperandList[0].set(Op1);
	if (N->OperandList[1] != Op2)
	N->OperandList[1].set(Op2);

	// If this gets put into a CSE map, add it.
	if (InsertPos) CSEMap.InsertNode(N, InsertPos);
	return N;
	}

	SDNode *SelectionDAG::
	UpdateNodeOperands(SDNode *N, SDValue Op1, SDValue Op2, SDValue Op3) {
	SDValue Ops[] = { Op1, Op2, Op3 };
	return UpdateNodeOperands(N, Ops);
	}

	SDNode *SelectionDAG::
	UpdateNodeOperands(SDNode *N, SDValue Op1, SDValue Op2,
	SDValue Op3, SDValue Op4) {
	SDValue Ops[] = { Op1, Op2, Op3, Op4 };
	return UpdateNodeOperands(N, Ops);
	}

	SDNode *SelectionDAG::
	UpdateNodeOperands(SDNode *N, SDValue Op1, SDValue Op2,
	SDValue Op3, SDValue Op4, SDValue Op5) {
	SDValue Ops[] = { Op1, Op2, Op3, Op4, Op5 };
	return UpdateNodeOperands(N, Ops);
	}

	SDNode *SelectionDAG::
	UpdateNodeOperands(SDNode *N, ArrayRef<SDValue> Ops) {
	unsigned NumOps = Ops.size();
	assert(N->getNumOperands() == NumOps &&
	"Update with wrong number of operands");

	// If no operands changed just return the input node.
	if (std::equal(Ops.begin(), Ops.end(), N->op_begin()))
	return N;

	// See if the modified node already exists.
	void *InsertPos = nullptr;
	if (SDNode *Existing = FindModifiedNodeSlot(N, Ops, InsertPos))
	return Existing;

	// Nope it doesn't. Remove the node from its current place in the maps.
	if (InsertPos)
	if (!RemoveNodeFromCSEMaps(N))
	InsertPos = nullptr;

	// Now we update the operands.
	for (unsigned i = 0; i != NumOps; ++i)
	if (N->OperandList[i] != Ops[i])
	N->OperandList[i].set(Ops[i]);

	// If this gets put into a CSE map, add it.
	if (InsertPos) CSEMap.InsertNode(N, InsertPos);
	return N;
	}

	/// DropOperands - Release the operands and set this node to have
	/// zero operands.
	void SDNode::DropOperands() {
	// Unlike the code in MorphNodeTo that does this, we don't need to
	// watch for dead nodes here.
	for (op_iterator I = op_begin(), E = op_end(); I != E; ) {
	SDUse &Use = *I++;
	Use.set(SDValue());
	}
	}

	/// SelectNodeTo - These are wrappers around MorphNodeTo that accept a
	/// machine opcode.
	///
	SDNode SelectionDAG::SelectNodeTo(SDNode N, unsigned MachineOpc,
	EVT VT) {
	SDVTList VTs = getVTList(VT);
	return SelectNodeTo(N, MachineOpc, VTs, None);
	}

	SDNode SelectionDAG::SelectNodeTo(SDNode N, unsigned MachineOpc,
	EVT VT, SDValue Op1) {
	SDVTList VTs = getVTList(VT);
	SDValue Ops[] = { Op1 };
	return SelectNodeTo(N, MachineOpc, VTs, Ops);
	}

	SDNode SelectionDAG::SelectNodeTo(SDNode N, unsigned MachineOpc,
	EVT VT, SDValue Op1,
	SDValue Op2) {
	SDVTList VTs = getVTList(VT);
	SDValue Ops[] = { Op1, Op2 };
	return SelectNodeTo(N, MachineOpc, VTs, Ops);
	}

	SDNode SelectionDAG::SelectNodeTo(SDNode N, unsigned MachineOpc,
	EVT VT, SDValue Op1,
	SDValue Op2, SDValue Op3) {
	SDVTList VTs = getVTList(VT);
	SDValue Ops[] = { Op1, Op2, Op3 };
	return SelectNodeTo(N, MachineOpc, VTs, Ops);
	}

	SDNode SelectionDAG::SelectNodeTo(SDNode N, unsigned MachineOpc,
	EVT VT, ArrayRef<SDValue> Ops) {
	SDVTList VTs = getVTList(VT);
	return SelectNodeTo(N, MachineOpc, VTs, Ops);
	}

	SDNode SelectionDAG::SelectNodeTo(SDNode N, unsigned MachineOpc,
	EVT VT1, EVT VT2, ArrayRef<SDValue> Ops) {
	SDVTList VTs = getVTList(VT1, VT2);
	return SelectNodeTo(N, MachineOpc, VTs, Ops);
	}

	SDNode SelectionDAG::SelectNodeTo(SDNode N, unsigned MachineOpc,
	EVT VT1, EVT VT2) {
	SDVTList VTs = getVTList(VT1, VT2);
	return SelectNodeTo(N, MachineOpc, VTs, None);
	}

	SDNode SelectionDAG::SelectNodeTo(SDNode N, unsigned MachineOpc,
	EVT VT1, EVT VT2, EVT VT3,
	ArrayRef<SDValue> Ops) {
	SDVTList VTs = getVTList(VT1, VT2, VT3);
	return SelectNodeTo(N, MachineOpc, VTs, Ops);
	}

	SDNode SelectionDAG::SelectNodeTo(SDNode N, unsigned MachineOpc,
	EVT VT1, EVT VT2,
	SDValue Op1, SDValue Op2) {
	SDVTList VTs = getVTList(VT1, VT2);
	SDValue Ops[] = { Op1, Op2 };
	return SelectNodeTo(N, MachineOpc, VTs, Ops);
	}

	SDNode SelectionDAG::SelectNodeTo(SDNode N, unsigned MachineOpc,
	SDVTList VTs,ArrayRef<SDValue> Ops) {
	SDNode *New = MorphNodeTo(N, ~MachineOpc, VTs, Ops);
	// Reset the NodeID to -1.
	New->setNodeId(-1);
	if (New != N) {
	ReplaceAllUsesWith(N, New);
	RemoveDeadNode(N);
	}
	return New;
	}

	/// UpdateSDLocOnMergeSDNode - If the opt level is -O0 then it throws away
	/// the line number information on the merged node since it is not possible to
	/// preserve the information that operation is associated with multiple lines.
	/// This will make the debugger working better at -O0, were there is a higher
	/// probability having other instructions associated with that line.
	///
	/// For IROrder, we keep the smaller of the two
	SDNode SelectionDAG::UpdateSDLocOnMergeSDNode(SDNode N, const SDLoc &OLoc) {
	DebugLoc NLoc = N->getDebugLoc();
	if (NLoc && OptLevel == CodeGenOpt::None && OLoc.getDebugLoc() != NLoc) {
	N->setDebugLoc(DebugLoc());
	}
	unsigned Order = std::min(N->getIROrder(), OLoc.getIROrder());
	N->setIROrder(Order);
	return N;
	}

	/// MorphNodeTo - This mutates the specified node to have the specified
	/// return type, opcode, and operands.
	///
	/// Note that MorphNodeTo returns the resultant node. If there is already a
	/// node of the specified opcode and operands, it returns that node instead of
	/// the current one. Note that the SDLoc need not be the same.
	///
	/// Using MorphNodeTo is faster than creating a new node and swapping it in
	/// with ReplaceAllUsesWith both because it often avoids allocating a new
	/// node, and because it doesn't require CSE recalculation for any of
	/// the node's users.
	///
	/// However, note that MorphNodeTo recursively deletes dead nodes from the DAG.
	/// As a consequence it isn't appropriate to use from within the DAG combiner or
	/// the legalizer which maintain worklists that would need to be updated when
	/// deleting things.
	SDNode SelectionDAG::MorphNodeTo(SDNode N, unsigned Opc,
	SDVTList VTs, ArrayRef<SDValue> Ops) {
	// If an identical node already exists, use it.
	void *IP = nullptr;
	if (VTs.VTs[VTs.NumVTs-1] != MVT::Glue) {
	FoldingSetNodeID ID;
	AddNodeIDNode(ID, Opc, VTs, Ops);
	if (SDNode *ON = FindNodeOrInsertPos(ID, SDLoc(N), IP))
	return UpdateSDLocOnMergeSDNode(ON, SDLoc(N));
	}

	if (!RemoveNodeFromCSEMaps(N))
	IP = nullptr;

	// Start the morphing.
	N->NodeType = Opc;
	N->ValueList = VTs.VTs;
	N->NumValues = VTs.NumVTs;

	// Clear the operands list, updating used nodes to remove this from their
	// use list. Keep track of any operands that become dead as a result.
	SmallPtrSet<SDNode*, 16> DeadNodeSet;
	for (SDNode::op_iterator I = N->op_begin(), E = N->op_end(); I != E; ) {
	SDUse &Use = *I++;
	SDNode *Used = Use.getNode();
	Use.set(SDValue());
	if (Used->use_empty())
	DeadNodeSet.insert(Used);
	}

	// For MachineNode, initialize the memory references information.
	if (MachineSDNode *MN = dyn_cast<MachineSDNode>(N))
	MN->setMemRefs(nullptr, nullptr);

	// Swap for an appropriately sized array from the recycler.
	removeOperands(N);
	createOperands(N, Ops);

	// Delete any nodes that are still dead after adding the uses for the
	// new operands.
	if (!DeadNodeSet.empty()) {
	SmallVector<SDNode *, 16> DeadNodes;
	for (SDNode *N : DeadNodeSet)
	if (N->use_empty())
	DeadNodes.push_back(N);
	RemoveDeadNodes(DeadNodes);
	}

	if (IP)
	CSEMap.InsertNode(N, IP); // Memoize the new node.
	return N;
	}

	SDNode* SelectionDAG::mutateStrictFPToFP(SDNode *Node) {
	unsigned OrigOpc = Node->getOpcode();
	unsigned NewOpc;
	bool IsUnary = false;
	switch (OrigOpc) {
	default:
	llvm_unreachable("mutateStrictFPToFP called with unexpected opcode!");
	case ISD::STRICT_FADD: NewOpc = ISD::FADD; break;
	case ISD::STRICT_FSUB: NewOpc = ISD::FSUB; break;
	case ISD::STRICT_FMUL: NewOpc = ISD::FMUL; break;
	case ISD::STRICT_FDIV: NewOpc = ISD::FDIV; break;
	case ISD::STRICT_FREM: NewOpc = ISD::FREM; break;
	case ISD::STRICT_FSQRT: NewOpc = ISD::FSQRT; IsUnary = true; break;
	case ISD::STRICT_FPOW: NewOpc = ISD::FPOW; break;
	case ISD::STRICT_FPOWI: NewOpc = ISD::FPOWI; break;
	case ISD::STRICT_FSIN: NewOpc = ISD::FSIN; IsUnary = true; break;
	case ISD::STRICT_FCOS: NewOpc = ISD::FCOS; IsUnary = true; break;
	case ISD::STRICT_FEXP: NewOpc = ISD::FEXP; IsUnary = true; break;
	case ISD::STRICT_FEXP2: NewOpc = ISD::FEXP2; IsUnary = true; break;
	case ISD::STRICT_FLOG: NewOpc = ISD::FLOG; IsUnary = true; break;
	case ISD::STRICT_FLOG10: NewOpc = ISD::FLOG10; IsUnary = true; break;
	case ISD::STRICT_FLOG2: NewOpc = ISD::FLOG2; IsUnary = true; break;
	case ISD::STRICT_FRINT: NewOpc = ISD::FRINT; IsUnary = true; break;
	case ISD::STRICT_FNEARBYINT:
	NewOpc = ISD::FNEARBYINT;
	IsUnary = true;
	break;
	}

	// We're taking this node out of the chain, so we need to re-link things.
	SDValue InputChain = Node->getOperand(0);
	SDValue OutputChain = SDValue(Node, 1);
	ReplaceAllUsesOfValueWith(OutputChain, InputChain);

	SDVTList VTs = getVTList(Node->getOperand(1).getValueType());
	SDNode *Res = nullptr;
	if (IsUnary)
	Res = MorphNodeTo(Node, NewOpc, VTs, { Node->getOperand(1) });
	else
	Res = MorphNodeTo(Node, NewOpc, VTs, { Node->getOperand(1),
	Node->getOperand(2) });

	// MorphNodeTo can operate in two ways: if an existing node with the
	// specified operands exists, it can just return it. Otherwise, it
	// updates the node in place to have the requested operands.
	if (Res == Node) {
	// If we updated the node in place, reset the node ID. To the isel,
	// this should be just like a newly allocated machine node.
	Res->setNodeId(-1);
	} else {
	ReplaceAllUsesWith(Node, Res);
	RemoveDeadNode(Node);
	}

	return Res;
	}

	/// getMachineNode - These are used for target selectors to create a new node
	/// with specified return type(s), MachineInstr opcode, and operands.
	///
	/// Note that getMachineNode returns the resultant node. If there is already a
	/// node of the specified opcode and operands, it returns that node instead of
	/// the current one.
	MachineSDNode *SelectionDAG::getMachineNode(unsigned Opcode, const SDLoc &dl,
	EVT VT) {
	SDVTList VTs = getVTList(VT);
	return getMachineNode(Opcode, dl, VTs, None);
	}

	MachineSDNode *SelectionDAG::getMachineNode(unsigned Opcode, const SDLoc &dl,
	EVT VT, SDValue Op1) {
	SDVTList VTs = getVTList(VT);
	SDValue Ops[] = { Op1 };
	return getMachineNode(Opcode, dl, VTs, Ops);
	}

	MachineSDNode *SelectionDAG::getMachineNode(unsigned Opcode, const SDLoc &dl,
	EVT VT, SDValue Op1, SDValue Op2) {
	SDVTList VTs = getVTList(VT);
	SDValue Ops[] = { Op1, Op2 };
	return getMachineNode(Opcode, dl, VTs, Ops);
	}

	MachineSDNode *SelectionDAG::getMachineNode(unsigned Opcode, const SDLoc &dl,
	EVT VT, SDValue Op1, SDValue Op2,
	SDValue Op3) {
	SDVTList VTs = getVTList(VT);
	SDValue Ops[] = { Op1, Op2, Op3 };
	return getMachineNode(Opcode, dl, VTs, Ops);
	}

	MachineSDNode *SelectionDAG::getMachineNode(unsigned Opcode, const SDLoc &dl,
	EVT VT, ArrayRef<SDValue> Ops) {
	SDVTList VTs = getVTList(VT);
	return getMachineNode(Opcode, dl, VTs, Ops);
	}

	MachineSDNode *SelectionDAG::getMachineNode(unsigned Opcode, const SDLoc &dl,
	EVT VT1, EVT VT2, SDValue Op1,
	SDValue Op2) {
	SDVTList VTs = getVTList(VT1, VT2);
	SDValue Ops[] = { Op1, Op2 };
	return getMachineNode(Opcode, dl, VTs, Ops);
	}

	MachineSDNode *SelectionDAG::getMachineNode(unsigned Opcode, const SDLoc &dl,
	EVT VT1, EVT VT2, SDValue Op1,
	SDValue Op2, SDValue Op3) {
	SDVTList VTs = getVTList(VT1, VT2);
	SDValue Ops[] = { Op1, Op2, Op3 };
	return getMachineNode(Opcode, dl, VTs, Ops);
	}

	MachineSDNode *SelectionDAG::getMachineNode(unsigned Opcode, const SDLoc &dl,
	EVT VT1, EVT VT2,
	ArrayRef<SDValue> Ops) {
	SDVTList VTs = getVTList(VT1, VT2);
	return getMachineNode(Opcode, dl, VTs, Ops);
	}

	MachineSDNode *SelectionDAG::getMachineNode(unsigned Opcode, const SDLoc &dl,
	EVT VT1, EVT VT2, EVT VT3,
	SDValue Op1, SDValue Op2) {
	SDVTList VTs = getVTList(VT1, VT2, VT3);
	SDValue Ops[] = { Op1, Op2 };
	return getMachineNode(Opcode, dl, VTs, Ops);
	}

	MachineSDNode *SelectionDAG::getMachineNode(unsigned Opcode, const SDLoc &dl,
	EVT VT1, EVT VT2, EVT VT3,
	SDValue Op1, SDValue Op2,
	SDValue Op3) {
	SDVTList VTs = getVTList(VT1, VT2, VT3);
	SDValue Ops[] = { Op1, Op2, Op3 };
	return getMachineNode(Opcode, dl, VTs, Ops);
	}

	MachineSDNode *SelectionDAG::getMachineNode(unsigned Opcode, const SDLoc &dl,
	EVT VT1, EVT VT2, EVT VT3,
	ArrayRef<SDValue> Ops) {
	SDVTList VTs = getVTList(VT1, VT2, VT3);
	return getMachineNode(Opcode, dl, VTs, Ops);
	}

	MachineSDNode *SelectionDAG::getMachineNode(unsigned Opcode, const SDLoc &dl,
	ArrayRef<EVT> ResultTys,
	ArrayRef<SDValue> Ops) {
	SDVTList VTs = getVTList(ResultTys);
	return getMachineNode(Opcode, dl, VTs, Ops);
	}

	MachineSDNode *SelectionDAG::getMachineNode(unsigned Opcode, const SDLoc &DL,
	SDVTList VTs,
	ArrayRef<SDValue> Ops) {
	bool DoCSE = VTs.VTs[VTs.NumVTs-1] != MVT::Glue;
	MachineSDNode *N;
	void *IP = nullptr;

	if (DoCSE) {
	FoldingSetNodeID ID;
	AddNodeIDNode(ID, ~Opcode, VTs, Ops);
	IP = nullptr;
	if (SDNode *E = FindNodeOrInsertPos(ID, DL, IP)) {
	return cast<MachineSDNode>(UpdateSDLocOnMergeSDNode(E, DL));
	}
	}

	// Allocate a new MachineSDNode.
	N = newSDNode<MachineSDNode>(~Opcode, DL.getIROrder(), DL.getDebugLoc(), VTs);
	createOperands(N, Ops);

	if (DoCSE)
	CSEMap.InsertNode(N, IP);

	InsertNode(N);
	return N;
	}

	/// getTargetExtractSubreg - A convenience function for creating
	/// TargetOpcode::EXTRACT_SUBREG nodes.
	SDValue SelectionDAG::getTargetExtractSubreg(int SRIdx, const SDLoc &DL, EVT VT,
	SDValue Operand) {
	SDValue SRIdxVal = getTargetConstant(SRIdx, DL, MVT::i32);
	SDNode *Subreg = getMachineNode(TargetOpcode::EXTRACT_SUBREG, DL,
	VT, Operand, SRIdxVal);
	return SDValue(Subreg, 0);
	}

	/// getTargetInsertSubreg - A convenience function for creating
	/// TargetOpcode::INSERT_SUBREG nodes.
	SDValue SelectionDAG::getTargetInsertSubreg(int SRIdx, const SDLoc &DL, EVT VT,
	SDValue Operand, SDValue Subreg) {
	SDValue SRIdxVal = getTargetConstant(SRIdx, DL, MVT::i32);
	SDNode *Result = getMachineNode(TargetOpcode::INSERT_SUBREG, DL,
	VT, Operand, Subreg, SRIdxVal);
	return SDValue(Result, 0);
	}

	/// getNodeIfExists - Get the specified node if it's already available, or
	/// else return NULL.
	SDNode *SelectionDAG::getNodeIfExists(unsigned Opcode, SDVTList VTList,
	ArrayRef<SDValue> Ops,
	const SDNodeFlags Flags) {
	if (VTList.VTs[VTList.NumVTs - 1] != MVT::Glue) {
	FoldingSetNodeID ID;
	AddNodeIDNode(ID, Opcode, VTList, Ops);
	void *IP = nullptr;
	if (SDNode *E = FindNodeOrInsertPos(ID, SDLoc(), IP)) {
	E->intersectFlagsWith(Flags);
	return E;
	}
	}
	return nullptr;
	}

	/// getDbgValue - Creates a SDDbgValue node.
	///
	/// SDNode
	SDDbgValue SelectionDAG::getDbgValue(MDNode Var, MDNode Expr, SDNode N,
	unsigned R, bool IsIndirect, uint64_t Off,
	const DebugLoc &DL, unsigned O) {
	assert(cast<DILocalVariable>(Var)->isValidLocationForIntrinsic(DL) &&
	"Expected inlined-at fields to agree");
	return new (DbgInfo->getAlloc())
	SDDbgValue(Var, Expr, N, R, IsIndirect, Off, DL, O);
	}

	/// Constant
	SDDbgValue SelectionDAG::getConstantDbgValue(MDNode Var, MDNode *Expr,
	const Value *C, uint64_t Off,
	const DebugLoc &DL, unsigned O) {
	assert(cast<DILocalVariable>(Var)->isValidLocationForIntrinsic(DL) &&
	"Expected inlined-at fields to agree");
	return new (DbgInfo->getAlloc()) SDDbgValue(Var, Expr, C, Off, DL, O);
	}

	/// FrameIndex
	SDDbgValue SelectionDAG::getFrameIndexDbgValue(MDNode Var, MDNode *Expr,
	unsigned FI, uint64_t Off,
	const DebugLoc &DL,
	unsigned O) {
	assert(cast<DILocalVariable>(Var)->isValidLocationForIntrinsic(DL) &&
	"Expected inlined-at fields to agree");
	return new (DbgInfo->getAlloc()) SDDbgValue(Var, Expr, FI, Off, DL, O);
	}

	namespace {

	/// RAUWUpdateListener - Helper for ReplaceAllUsesWith - When the node
	/// pointed to by a use iterator is deleted, increment the use iterator
	/// so that it doesn't dangle.
	///
	class RAUWUpdateListener : public SelectionDAG::DAGUpdateListener {
	SDNode::use_iterator &UI;
	SDNode::use_iterator &UE;

	void NodeDeleted(SDNode N, SDNode E) override {
	// Increment the iterator as needed.
	while (UI != UE && N == *UI)
	++UI;
	}

	public:
	RAUWUpdateListener(SelectionDAG &d,
	SDNode::use_iterator &ui,
	SDNode::use_iterator &ue)
	: SelectionDAG::DAGUpdateListener(d), UI(ui), UE(ue) {}
	};

	} // end anonymous namespace

	/// ReplaceAllUsesWith - Modify anything using 'From' to use 'To' instead.
	/// This can cause recursive merging of nodes in the DAG.
	///
	/// This version assumes From has a single result value.
	///
	void SelectionDAG::ReplaceAllUsesWith(SDValue FromN, SDValue To) {
	SDNode *From = FromN.getNode();
	assert(From->getNumValues() == 1 && FromN.getResNo() == 0 &&
	"Cannot replace with this method!");
	assert(From != To.getNode() && "Cannot replace uses of with self");

	// Preserve Debug Values
	TransferDbgValues(FromN, To);

	// Iterate over all the existing uses of From. New uses will be added
	// to the beginning of the use list, which we avoid visiting.
	// This specifically avoids visiting uses of From that arise while the
	// replacement is happening, because any such uses would be the result
	// of CSE: If an existing node looks like From after one of its operands
	// is replaced by To, we don't want to replace of all its users with To
	// too. See PR3018 for more info.
	SDNode::use_iterator UI = From->use_begin(), UE = From->use_end();
	RAUWUpdateListener Listener(*this, UI, UE);
	while (UI != UE) {
	SDNode User = UI;

	// This node is about to morph, remove its old self from the CSE maps.
	RemoveNodeFromCSEMaps(User);

	// A user can appear in a use list multiple times, and when this
	// happens the uses are usually next to each other in the list.
	// To help reduce the number of CSE recomputations, process all
	// the uses of this user that we can find this way.
	do {
	SDUse &Use = UI.getUse();
	++UI;
	Use.set(To);
	} while (UI != UE && *UI == User);

	// Now that we have modified User, add it back to the CSE maps. If it
	// already exists there, recursively merge the results together.
	AddModifiedNodeToCSEMaps(User);
	}

	// If we just RAUW'd the root, take note.
	if (FromN == getRoot())
	setRoot(To);
	}

	/// ReplaceAllUsesWith - Modify anything using 'From' to use 'To' instead.
	/// This can cause recursive merging of nodes in the DAG.
	///
	/// This version assumes that for each value of From, there is a
	/// corresponding value in To in the same position with the same type.
	///
	void SelectionDAG::ReplaceAllUsesWith(SDNode From, SDNode To) {
	#ifndef NDEBUG
	for (unsigned i = 0, e = From->getNumValues(); i != e; ++i)
	assert((!From->hasAnyUseOfValue(i) \|\|
	From->getValueType(i) == To->getValueType(i)) &&
	"Cannot use this version of ReplaceAllUsesWith!");
	#endif

	// Handle the trivial case.
	if (From == To)
	return;

	// Preserve Debug Info. Only do this if there's a use.
	for (unsigned i = 0, e = From->getNumValues(); i != e; ++i)
	if (From->hasAnyUseOfValue(i)) {
	assert((i < To->getNumValues()) && "Invalid To location");
	TransferDbgValues(SDValue(From, i), SDValue(To, i));
	}

	// Iterate over just the existing users of From. See the comments in
	// the ReplaceAllUsesWith above.
	SDNode::use_iterator UI = From->use_begin(), UE = From->use_end();
	RAUWUpdateListener Listener(*this, UI, UE);
	while (UI != UE) {
	SDNode User = UI;

	// This node is about to morph, remove its old self from the CSE maps.
	RemoveNodeFromCSEMaps(User);

	// A user can appear in a use list multiple times, and when this
	// happens the uses are usually next to each other in the list.
	// To help reduce the number of CSE recomputations, process all
	// the uses of this user that we can find this way.
	do {
	SDUse &Use = UI.getUse();
	++UI;
	Use.setNode(To);
	} while (UI != UE && *UI == User);

	// Now that we have modified User, add it back to the CSE maps. If it
	// already exists there, recursively merge the results together.
	AddModifiedNodeToCSEMaps(User);
	}

	// If we just RAUW'd the root, take note.
	if (From == getRoot().getNode())
	setRoot(SDValue(To, getRoot().getResNo()));
	}

	/// ReplaceAllUsesWith - Modify anything using 'From' to use 'To' instead.
	/// This can cause recursive merging of nodes in the DAG.
	///
	/// This version can replace From with any result values. To must match the
	/// number and types of values returned by From.
	void SelectionDAG::ReplaceAllUsesWith(SDNode From, const SDValue To) {
	if (From->getNumValues() == 1) // Handle the simple case efficiently.
	return ReplaceAllUsesWith(SDValue(From, 0), To[0]);

	// Preserve Debug Info.
	for (unsigned i = 0, e = From->getNumValues(); i != e; ++i)
	TransferDbgValues(SDValue(From, i), *To);

	// Iterate over just the existing users of From. See the comments in
	// the ReplaceAllUsesWith above.
	SDNode::use_iterator UI = From->use_begin(), UE = From->use_end();
	RAUWUpdateListener Listener(*this, UI, UE);
	while (UI != UE) {
	SDNode User = UI;

	// This node is about to morph, remove its old self from the CSE maps.
	RemoveNodeFromCSEMaps(User);

	// A user can appear in a use list multiple times, and when this
	// happens the uses are usually next to each other in the list.
	// To help reduce the number of CSE recomputations, process all
	// the uses of this user that we can find this way.
	do {
	SDUse &Use = UI.getUse();
	const SDValue &ToOp = To[Use.getResNo()];
	++UI;
	Use.set(ToOp);
	} while (UI != UE && *UI == User);

	// Now that we have modified User, add it back to the CSE maps. If it
	// already exists there, recursively merge the results together.
	AddModifiedNodeToCSEMaps(User);
	}

	// If we just RAUW'd the root, take note.
	if (From == getRoot().getNode())
	setRoot(SDValue(To[getRoot().getResNo()]));
	}

	/// ReplaceAllUsesOfValueWith - Replace any uses of From with To, leaving
	/// uses of other values produced by From.getNode() alone. The Deleted
	/// vector is handled the same way as for ReplaceAllUsesWith.
	void SelectionDAG::ReplaceAllUsesOfValueWith(SDValue From, SDValue To){
	// Handle the really simple, really trivial case efficiently.
	if (From == To) return;

	// Handle the simple, trivial, case efficiently.
	if (From.getNode()->getNumValues() == 1) {
	ReplaceAllUsesWith(From, To);
	return;
	}

	// Preserve Debug Info.
	TransferDbgValues(From, To);

	// Iterate over just the existing users of From. See the comments in
	// the ReplaceAllUsesWith above.
	SDNode::use_iterator UI = From.getNode()->use_begin(),
	UE = From.getNode()->use_end();
	RAUWUpdateListener Listener(*this, UI, UE);
	while (UI != UE) {
	SDNode User = UI;
	bool UserRemovedFromCSEMaps = false;

	// A user can appear in a use list multiple times, and when this
	// happens the uses are usually next to each other in the list.
	// To help reduce the number of CSE recomputations, process all
	// the uses of this user that we can find this way.
	do {
	SDUse &Use = UI.getUse();

	// Skip uses of different values from the same node.
	if (Use.getResNo() != From.getResNo()) {
	++UI;
	continue;
	}

	// If this node hasn't been modified yet, it's still in the CSE maps,
	// so remove its old self from the CSE maps.
	if (!UserRemovedFromCSEMaps) {
	RemoveNodeFromCSEMaps(User);
	UserRemovedFromCSEMaps = true;
	}

	++UI;
	Use.set(To);
	} while (UI != UE && *UI == User);

	// We are iterating over all uses of the From node, so if a use
	// doesn't use the specific value, no changes are made.
	if (!UserRemovedFromCSEMaps)
	continue;

	// Now that we have modified User, add it back to the CSE maps. If it
	// already exists there, recursively merge the results together.
	AddModifiedNodeToCSEMaps(User);
	}

	// If we just RAUW'd the root, take note.
	if (From == getRoot())
	setRoot(To);
	}

	namespace {

	/// UseMemo - This class is used by SelectionDAG::ReplaceAllUsesOfValuesWith
	/// to record information about a use.
	struct UseMemo {
	SDNode *User;
	unsigned Index;
	SDUse *Use;
	};

	/// operator< - Sort Memos by User.
	bool operator<(const UseMemo &L, const UseMemo &R) {
	return (intptr_t)L.User < (intptr_t)R.User;
	}

	} // end anonymous namespace

	/// ReplaceAllUsesOfValuesWith - Replace any uses of From with To, leaving
	/// uses of other values produced by From.getNode() alone. The same value
	/// may appear in both the From and To list. The Deleted vector is
	/// handled the same way as for ReplaceAllUsesWith.
	void SelectionDAG::ReplaceAllUsesOfValuesWith(const SDValue *From,
	const SDValue *To,
	unsigned Num){
	// Handle the simple, trivial case efficiently.
	if (Num == 1)
	return ReplaceAllUsesOfValueWith(From, To);

	TransferDbgValues(From, To);

	// Read up all the uses and make records of them. This helps
	// processing new uses that are introduced during the
	// replacement process.
	SmallVector<UseMemo, 4> Uses;
	for (unsigned i = 0; i != Num; ++i) {
	unsigned FromResNo = From[i].getResNo();
	SDNode *FromNode = From[i].getNode();
	for (SDNode::use_iterator UI = FromNode->use_begin(),
	E = FromNode->use_end(); UI != E; ++UI) {
	SDUse &Use = UI.getUse();
	if (Use.getResNo() == FromResNo) {
	UseMemo Memo = { *UI, i, &Use };
	Uses.push_back(Memo);
	}
	}
	}

	// Sort the uses, so that all the uses from a given User are together.
	std::sort(Uses.begin(), Uses.end());

	for (unsigned UseIndex = 0, UseIndexEnd = Uses.size();
	UseIndex != UseIndexEnd; ) {
	// We know that this user uses some value of From. If it is the right
	// value, update it.
	SDNode *User = Uses[UseIndex].User;

	// This node is about to morph, remove its old self from the CSE maps.
	RemoveNodeFromCSEMaps(User);

	// The Uses array is sorted, so all the uses for a given User
	// are next to each other in the list.
	// To help reduce the number of CSE recomputations, process all
	// the uses of this user that we can find this way.
	do {
	unsigned i = Uses[UseIndex].Index;
	SDUse &Use = *Uses[UseIndex].Use;
	++UseIndex;

	Use.set(To[i]);
	} while (UseIndex != UseIndexEnd && Uses[UseIndex].User == User);

	// Now that we have modified User, add it back to the CSE maps. If it
	// already exists there, recursively merge the results together.
	AddModifiedNodeToCSEMaps(User);
	}
	}

	/// AssignTopologicalOrder - Assign a unique node id for each node in the DAG
	/// based on their topological order. It returns the maximum id and a vector
	/// of the SDNodes* in assigned order by reference.
	unsigned SelectionDAG::AssignTopologicalOrder() {
	unsigned DAGSize = 0;

	// SortedPos tracks the progress of the algorithm. Nodes before it are
	// sorted, nodes after it are unsorted. When the algorithm completes
	// it is at the end of the list.
	allnodes_iterator SortedPos = allnodes_begin();

	// Visit all the nodes. Move nodes with no operands to the front of
	// the list immediately. Annotate nodes that do have operands with their
	// operand count. Before we do this, the Node Id fields of the nodes
	// may contain arbitrary values. After, the Node Id fields for nodes
	// before SortedPos will contain the topological sort index, and the
	// Node Id fields for nodes At SortedPos and after will contain the
	// count of outstanding operands.
	for (allnodes_iterator I = allnodes_begin(),E = allnodes_end(); I != E; ) {
	SDNode N = &I++;
	checkForCycles(N, this);
	unsigned Degree = N->getNumOperands();
	if (Degree == 0) {
	// A node with no uses, add it to the result array immediately.
	N->setNodeId(DAGSize++);
	allnodes_iterator Q(N);
	if (Q != SortedPos)
	SortedPos = AllNodes.insert(SortedPos, AllNodes.remove(Q));
	assert(SortedPos != AllNodes.end() && "Overran node list");
	++SortedPos;
	} else {
	// Temporarily use the Node Id as scratch space for the degree count.
	N->setNodeId(Degree);
	}
	}

	// Visit all the nodes. As we iterate, move nodes into sorted order,
	// such that by the time the end is reached all nodes will be sorted.
	for (SDNode &Node : allnodes()) {
	SDNode *N = &Node;
	checkForCycles(N, this);
	// N is in sorted position, so all its uses have one less operand
	// that needs to be sorted.
	for (SDNode::use_iterator UI = N->use_begin(), UE = N->use_end();
	UI != UE; ++UI) {
	SDNode P = UI;
	unsigned Degree = P->getNodeId();
	assert(Degree != 0 && "Invalid node degree");
	--Degree;
	if (Degree == 0) {
	// All of P's operands are sorted, so P may sorted now.
	P->setNodeId(DAGSize++);
	if (P->getIterator() != SortedPos)
	SortedPos = AllNodes.insert(SortedPos, AllNodes.remove(P));
	assert(SortedPos != AllNodes.end() && "Overran node list");
	++SortedPos;
	} else {
	// Update P's outstanding operand count.
	P->setNodeId(Degree);
	}
	}
	if (Node.getIterator() == SortedPos) {
	#ifndef NDEBUG
	allnodes_iterator I(N);
	SDNode S = &++I;
	dbgs() << "Overran sorted position:\n";
	S->dumprFull(this); dbgs() << "\n";
	dbgs() << "Checking if this is due to cycles\n";
	checkForCycles(this, true);
	#endif
	llvm_unreachable(nullptr);
	}
	}

	assert(SortedPos == AllNodes.end() &&
	"Topological sort incomplete!");
	assert(AllNodes.front().getOpcode() == ISD::EntryToken &&
	"First node in topological sort is not the entry token!");
	assert(AllNodes.front().getNodeId() == 0 &&
	"First node in topological sort has non-zero id!");
	assert(AllNodes.front().getNumOperands() == 0 &&
	"First node in topological sort has operands!");
	assert(AllNodes.back().getNodeId() == (int)DAGSize-1 &&
	"Last node in topologic sort has unexpected id!");
	assert(AllNodes.back().use_empty() &&
	"Last node in topologic sort has users!");
	assert(DAGSize == allnodes_size() && "Node count mismatch!");
	return DAGSize;
	}

	/// AddDbgValue - Add a dbg_value SDNode. If SD is non-null that means the
	/// value is produced by SD.
	void SelectionDAG::AddDbgValue(SDDbgValue DB, SDNode SD, bool isParameter) {
	if (SD) {
	assert(DbgInfo->getSDDbgValues(SD).empty() \|\| SD->getHasDebugValue());
	SD->setHasDebugValue(true);
	}
	DbgInfo->add(DB, SD, isParameter);
	}

	/// TransferDbgValues - Transfer SDDbgValues. Called in replace nodes.
	void SelectionDAG::TransferDbgValues(SDValue From, SDValue To) {
	if (From == To \|\| !From.getNode()->getHasDebugValue())
	return;
	SDNode *FromNode = From.getNode();
	SDNode *ToNode = To.getNode();
	ArrayRef<SDDbgValue *> DVs = GetDbgValues(FromNode);
	SmallVector<SDDbgValue *, 2> ClonedDVs;
	for (ArrayRef<SDDbgValue *>::iterator I = DVs.begin(), E = DVs.end();
	I != E; ++I) {
	SDDbgValue Dbg = I;
	// Only add Dbgvalues attached to same ResNo.
	if (Dbg->getKind() == SDDbgValue::SDNODE &&
	Dbg->getSDNode() == From.getNode() &&
	Dbg->getResNo() == From.getResNo() && !Dbg->isInvalidated()) {
	assert(FromNode != ToNode &&
	"Should not transfer Debug Values intranode");
	SDDbgValue *Clone =
	getDbgValue(Dbg->getVariable(), Dbg->getExpression(), ToNode,
	To.getResNo(), Dbg->isIndirect(), Dbg->getOffset(),
	Dbg->getDebugLoc(), Dbg->getOrder());
	ClonedDVs.push_back(Clone);
	Dbg->setIsInvalidated();
	}
	}
	for (SDDbgValue *I : ClonedDVs)
	AddDbgValue(I, ToNode, false);
	}

	SDValue SelectionDAG::makeEquivalentMemoryOrdering(LoadSDNode *OldLoad,
	SDValue NewMemOp) {
	assert(isa<MemSDNode>(NewMemOp.getNode()) && "Expected a memop node");
	// The new memory operation must have the same position as the old load in
	// terms of memory dependency. Create a TokenFactor for the old load and new
	// memory operation and update uses of the old load's output chain to use that
	// TokenFactor.
	SDValue OldChain = SDValue(OldLoad, 1);
	SDValue NewChain = SDValue(NewMemOp.getNode(), 1);
	if (!OldLoad->hasAnyUseOfValue(1))
	return NewChain;

	SDValue TokenFactor =
	getNode(ISD::TokenFactor, SDLoc(OldLoad), MVT::Other, OldChain, NewChain);
	ReplaceAllUsesOfValueWith(OldChain, TokenFactor);
	UpdateNodeOperands(TokenFactor.getNode(), OldChain, NewChain);
	return TokenFactor;
	}

	//===----------------------------------------------------------------------===//
	// SDNode Class
	//===----------------------------------------------------------------------===//

	bool llvm::isNullConstant(SDValue V) {
	ConstantSDNode *Const = dyn_cast<ConstantSDNode>(V);
	return Const != nullptr && Const->isNullValue();
	}

	bool llvm::isNullFPConstant(SDValue V) {
	ConstantFPSDNode *Const = dyn_cast<ConstantFPSDNode>(V);
	return Const != nullptr && Const->isZero() && !Const->isNegative();
	}

	bool llvm::isAllOnesConstant(SDValue V) {
	ConstantSDNode *Const = dyn_cast<ConstantSDNode>(V);
	return Const != nullptr && Const->isAllOnesValue();
	}

	bool llvm::isOneConstant(SDValue V) {
	ConstantSDNode *Const = dyn_cast<ConstantSDNode>(V);
	return Const != nullptr && Const->isOne();
	}

	bool llvm::isBitwiseNot(SDValue V) {
	return V.getOpcode() == ISD::XOR && isAllOnesConstant(V.getOperand(1));
	}

	ConstantSDNode *llvm::isConstOrConstSplat(SDValue N) {
	if (ConstantSDNode *CN = dyn_cast<ConstantSDNode>(N))
	return CN;

	if (BuildVectorSDNode *BV = dyn_cast<BuildVectorSDNode>(N)) {
	BitVector UndefElements;
	ConstantSDNode *CN = BV->getConstantSplatNode(&UndefElements);

	// BuildVectors can truncate their operands. Ignore that case here.
	// FIXME: We blindly ignore splats which include undef which is overly
	// pessimistic.
	if (CN && UndefElements.none() &&
	CN->getValueType(0) == N.getValueType().getScalarType())
	return CN;
	}

	return nullptr;
	}

	ConstantFPSDNode *llvm::isConstOrConstSplatFP(SDValue N) {
	if (ConstantFPSDNode *CN = dyn_cast<ConstantFPSDNode>(N))
	return CN;

	if (BuildVectorSDNode *BV = dyn_cast<BuildVectorSDNode>(N)) {
	BitVector UndefElements;
	ConstantFPSDNode *CN = BV->getConstantFPSplatNode(&UndefElements);

	if (CN && UndefElements.none())
	return CN;
	}

	return nullptr;
	}

	HandleSDNode::~HandleSDNode() {
	DropOperands();
	}

	GlobalAddressSDNode::GlobalAddressSDNode(unsigned Opc, unsigned Order,
	const DebugLoc &DL,
	const GlobalValue *GA, EVT VT,
	int64_t o, unsigned char TF)
	: SDNode(Opc, Order, DL, getSDVTList(VT)), Offset(o), TargetFlags(TF) {
	TheGlobal = GA;
	}

	AddrSpaceCastSDNode::AddrSpaceCastSDNode(unsigned Order, const DebugLoc &dl,
	EVT VT, unsigned SrcAS,
	unsigned DestAS)
	: SDNode(ISD::ADDRSPACECAST, Order, dl, getSDVTList(VT)),
	SrcAddrSpace(SrcAS), DestAddrSpace(DestAS) {}

	MemSDNode::MemSDNode(unsigned Opc, unsigned Order, const DebugLoc &dl,
	SDVTList VTs, EVT memvt, MachineMemOperand *mmo)
	: SDNode(Opc, Order, dl, VTs), MemoryVT(memvt), MMO(mmo) {
	MemSDNodeBits.IsVolatile = MMO->isVolatile();
	MemSDNodeBits.IsNonTemporal = MMO->isNonTemporal();
	MemSDNodeBits.IsDereferenceable = MMO->isDereferenceable();
	MemSDNodeBits.IsInvariant = MMO->isInvariant();

	// We check here that the size of the memory operand fits within the size of
	// the MMO. This is because the MMO might indicate only a possible address
	// range instead of specifying the affected memory addresses precisely.
	assert(memvt.getStoreSize() <= MMO->getSize() && "Size mismatch!");
	}

	/// Profile - Gather unique data for the node.
	///
	void SDNode::Profile(FoldingSetNodeID &ID) const {
	AddNodeIDNode(ID, this);
	}

	namespace {

	struct EVTArray {
	std::vector<EVT> VTs;

	EVTArray() {
	VTs.reserve(MVT::LAST_VALUETYPE);
	for (unsigned i = 0; i < MVT::LAST_VALUETYPE; ++i)
	VTs.push_back(MVT((MVT::SimpleValueType)i));
	}
	};

	} // end anonymous namespace

	static ManagedStatic<std::set<EVT, EVT::compareRawBits>> EVTs;
	static ManagedStatic<EVTArray> SimpleVTArray;
	static ManagedStatic<sys::SmartMutex<true>> VTMutex;

	/// getValueTypeList - Return a pointer to the specified value type.
	///
	const EVT *SDNode::getValueTypeList(EVT VT) {
	if (VT.isExtended()) {
	sys::SmartScopedLock<true> Lock(*VTMutex);
	return &(*EVTs->insert(VT).first);
	} else {
	assert(VT.getSimpleVT() < MVT::LAST_VALUETYPE &&
	"Value type out of range!");
	return &SimpleVTArray->VTs[VT.getSimpleVT().SimpleTy];
	}
	}

	/// hasNUsesOfValue - Return true if there are exactly NUSES uses of the
	/// indicated value. This method ignores uses of other values defined by this
	/// operation.
	bool SDNode::hasNUsesOfValue(unsigned NUses, unsigned Value) const {
	assert(Value < getNumValues() && "Bad value!");

	// TODO: Only iterate over uses of a given value of the node
	for (SDNode::use_iterator UI = use_begin(), E = use_end(); UI != E; ++UI) {
	if (UI.getUse().getResNo() == Value) {
	if (NUses == 0)
	return false;
	--NUses;
	}
	}

	// Found exactly the right number of uses?
	return NUses == 0;
	}

	/// hasAnyUseOfValue - Return true if there are any use of the indicated
	/// value. This method ignores uses of other values defined by this operation.
	bool SDNode::hasAnyUseOfValue(unsigned Value) const {
	assert(Value < getNumValues() && "Bad value!");

	for (SDNode::use_iterator UI = use_begin(), E = use_end(); UI != E; ++UI)
	if (UI.getUse().getResNo() == Value)
	return true;

	return false;
	}

	/// isOnlyUserOf - Return true if this node is the only use of N.
	bool SDNode::isOnlyUserOf(const SDNode *N) const {
	bool Seen = false;
	for (SDNode::use_iterator I = N->use_begin(), E = N->use_end(); I != E; ++I) {
	SDNode User = I;
	if (User == this)
	Seen = true;
	else
	return false;
	}

	return Seen;
	}

	/// Return true if the only users of N are contained in Nodes.
	bool SDNode::areOnlyUsersOf(ArrayRef<const SDNode > Nodes, const SDNode N) {
	bool Seen = false;
	for (SDNode::use_iterator I = N->use_begin(), E = N->use_end(); I != E; ++I) {
	SDNode User = I;
	if (llvm::any_of(Nodes,
	[&User](const SDNode *Node) { return User == Node; }))
	Seen = true;
	else
	return false;
	}

	return Seen;
	}

	/// isOperand - Return true if this node is an operand of N.
	bool SDValue::isOperandOf(const SDNode *N) const {
	for (const SDValue &Op : N->op_values())
	if (*this == Op)
	return true;
	return false;
	}

	bool SDNode::isOperandOf(const SDNode *N) const {
	for (const SDValue &Op : N->op_values())
	if (this == Op.getNode())
	return true;
	return false;
	}

	/// reachesChainWithoutSideEffects - Return true if this operand (which must
	/// be a chain) reaches the specified operand without crossing any
	/// side-effecting instructions on any chain path. In practice, this looks
	/// through token factors and non-volatile loads. In order to remain efficient,
	/// this only looks a couple of nodes in, it does not do an exhaustive search.
	///
	/// Note that we only need to examine chains when we're searching for
	/// side-effects; SelectionDAG requires that all side-effects are represented
	/// by chains, even if another operand would force a specific ordering. This
	/// constraint is necessary to allow transformations like splitting loads.
	bool SDValue::reachesChainWithoutSideEffects(SDValue Dest,
	unsigned Depth) const {
	if (*this == Dest) return true;

	// Don't search too deeply, we just want to be able to see through
	// TokenFactor's etc.
	if (Depth == 0) return false;

	// If this is a token factor, all inputs to the TF happen in parallel.
	if (getOpcode() == ISD::TokenFactor) {
	// First, try a shallow search.
	if (is_contained((*this)->ops(), Dest)) {
	// We found the chain we want as an operand of this TokenFactor.
	// Essentially, we reach the chain without side-effects if we could
	// serialize the TokenFactor into a simple chain of operations with
	// Dest as the last operation. This is automatically true if the
	// chain has one use: there are no other ordering constraints.
	// If the chain has more than one use, we give up: some other
	// use of Dest might force a side-effect between Dest and the current
	// node.
	if (Dest.hasOneUse())
	return true;
	}
	// Next, try a deep search: check whether every operand of the TokenFactor
	// reaches Dest.
	return llvm::all_of((*this)->ops(), [=](SDValue Op) {
	return Op.reachesChainWithoutSideEffects(Dest, Depth - 1);
	});
	}

	// Loads don't have side effects, look through them.
	if (LoadSDNode Ld = dyn_cast<LoadSDNode>(this)) {
	if (!Ld->isVolatile())
	return Ld->getChain().reachesChainWithoutSideEffects(Dest, Depth-1);
	}
	return false;
	}

	bool SDNode::hasPredecessor(const SDNode *N) const {
	SmallPtrSet<const SDNode *, 32> Visited;
	SmallVector<const SDNode *, 16> Worklist;
	Worklist.push_back(this);
	return hasPredecessorHelper(N, Visited, Worklist);
	}

	void SDNode::intersectFlagsWith(const SDNodeFlags Flags) {
	this->Flags.intersectWith(Flags);
	}

	SDValue SelectionDAG::UnrollVectorOp(SDNode *N, unsigned ResNE) {
	assert(N->getNumValues() == 1 &&
	"Can't unroll a vector with multiple results!");

	EVT VT = N->getValueType(0);
	unsigned NE = VT.getVectorNumElements();
	EVT EltVT = VT.getVectorElementType();
	SDLoc dl(N);

	SmallVector<SDValue, 8> Scalars;
	SmallVector<SDValue, 4> Operands(N->getNumOperands());

	// If ResNE is 0, fully unroll the vector op.
	if (ResNE == 0)
	ResNE = NE;
	else if (NE > ResNE)
	NE = ResNE;

	unsigned i;
	for (i= 0; i != NE; ++i) {
	for (unsigned j = 0, e = N->getNumOperands(); j != e; ++j) {
	SDValue Operand = N->getOperand(j);
	EVT OperandVT = Operand.getValueType();
	if (OperandVT.isVector()) {
	// A vector operand; extract a single element.
	EVT OperandEltVT = OperandVT.getVectorElementType();
	Operands[j] =
	getNode(ISD::EXTRACT_VECTOR_ELT, dl, OperandEltVT, Operand,
	getConstant(i, dl, TLI->getVectorIdxTy(getDataLayout())));
	} else {
	// A scalar operand; just use it as is.
	Operands[j] = Operand;
	}
	}

	switch (N->getOpcode()) {
	default: {
	Scalars.push_back(getNode(N->getOpcode(), dl, EltVT, Operands,
	N->getFlags()));
	break;
	}
	case ISD::VSELECT:
	Scalars.push_back(getNode(ISD::SELECT, dl, EltVT, Operands));
	break;
	case ISD::SHL:
	case ISD::SRA:
	case ISD::SRL:
	case ISD::ROTL:
	case ISD::ROTR:
	Scalars.push_back(getNode(N->getOpcode(), dl, EltVT, Operands[0],
	getShiftAmountOperand(Operands[0].getValueType(),
	Operands[1])));
	break;
	case ISD::SIGN_EXTEND_INREG:
	case ISD::FP_ROUND_INREG: {
	EVT ExtVT = cast<VTSDNode>(Operands[1])->getVT().getVectorElementType();
	Scalars.push_back(getNode(N->getOpcode(), dl, EltVT,
	Operands[0],
	getValueType(ExtVT)));
	}
	}
	}

	for (; i < ResNE; ++i)
	Scalars.push_back(getUNDEF(EltVT));

	EVT VecVT = EVT::getVectorVT(*getContext(), EltVT, ResNE);
	return getBuildVector(VecVT, dl, Scalars);
	}

	bool SelectionDAG::areNonVolatileConsecutiveLoads(LoadSDNode *LD,
	LoadSDNode *Base,
	unsigned Bytes,
	int Dist) const {
	if (LD->isVolatile() \|\| Base->isVolatile())
	return false;
	if (LD->isIndexed() \|\| Base->isIndexed())
	return false;
	if (LD->getChain() != Base->getChain())
	return false;
	EVT VT = LD->getValueType(0);
	if (VT.getSizeInBits() / 8 != Bytes)
	return false;

	SDValue Loc = LD->getOperand(1);
	SDValue BaseLoc = Base->getOperand(1);

	auto BaseLocDecomp = BaseIndexOffset::match(BaseLoc, *this);
	auto LocDecomp = BaseIndexOffset::match(Loc, *this);

	int64_t Offset = 0;
	if (BaseLocDecomp.equalBaseIndex(LocDecomp, *this, Offset))
	return (Dist * Bytes == Offset);
	return false;
	}

	/// InferPtrAlignment - Infer alignment of a load / store address. Return 0 if
	/// it cannot be inferred.
	unsigned SelectionDAG::InferPtrAlignment(SDValue Ptr) const {
	// If this is a GlobalAddress + cst, return the alignment.
	const GlobalValue *GV;
	int64_t GVOffset = 0;
	if (TLI->isGAPlusOffset(Ptr.getNode(), GV, GVOffset)) {
	unsigned PtrWidth = getDataLayout().getPointerTypeSizeInBits(GV->getType());
	KnownBits Known(PtrWidth);
	llvm::computeKnownBits(GV, Known, getDataLayout());
	unsigned AlignBits = Known.countMinTrailingZeros();
	unsigned Align = AlignBits ? 1 << std::min(31U, AlignBits) : 0;
	if (Align)
	return MinAlign(Align, GVOffset);
	}

	// If this is a direct reference to a stack slot, use information about the
	// stack slot's alignment.
	int FrameIdx = 1 << 31;
	int64_t FrameOffset = 0;
	if (FrameIndexSDNode *FI = dyn_cast<FrameIndexSDNode>(Ptr)) {
	FrameIdx = FI->getIndex();
	} else if (isBaseWithConstantOffset(Ptr) &&
	isa<FrameIndexSDNode>(Ptr.getOperand(0))) {
	// Handle FI+Cst
	FrameIdx = cast<FrameIndexSDNode>(Ptr.getOperand(0))->getIndex();
	FrameOffset = Ptr.getConstantOperandVal(1);
	}

	if (FrameIdx != (1 << 31)) {
	const MachineFrameInfo &MFI = getMachineFunction().getFrameInfo();
	unsigned FIInfoAlign = MinAlign(MFI.getObjectAlignment(FrameIdx),
	FrameOffset);
	return FIInfoAlign;
	}

	return 0;
	}

	/// GetSplitDestVTs - Compute the VTs needed for the low/hi parts of a type
	/// which is split (or expanded) into two not necessarily identical pieces.
	std::pair<EVT, EVT> SelectionDAG::GetSplitDestVTs(const EVT &VT) const {
	// Currently all types are split in half.
	EVT LoVT, HiVT;
	if (!VT.isVector())
	LoVT = HiVT = TLI->getTypeToTransformTo(*getContext(), VT);
	else
	LoVT = HiVT = VT.getHalfNumVectorElementsVT(*getContext());

	return std::make_pair(LoVT, HiVT);
	}

	/// SplitVector - Split the vector with EXTRACT_SUBVECTOR and return the
	/// low/high part.
	std::pair<SDValue, SDValue>
	SelectionDAG::SplitVector(const SDValue &N, const SDLoc &DL, const EVT &LoVT,
	const EVT &HiVT) {
	assert(LoVT.getVectorNumElements() + HiVT.getVectorNumElements() <=
	N.getValueType().getVectorNumElements() &&
	"More vector elements requested than available!");
	SDValue Lo, Hi;
	Lo = getNode(ISD::EXTRACT_SUBVECTOR, DL, LoVT, N,
	getConstant(0, DL, TLI->getVectorIdxTy(getDataLayout())));
	Hi = getNode(ISD::EXTRACT_SUBVECTOR, DL, HiVT, N,
	getConstant(LoVT.getVectorNumElements(), DL,
	TLI->getVectorIdxTy(getDataLayout())));
	return std::make_pair(Lo, Hi);
	}

	void SelectionDAG::ExtractVectorElements(SDValue Op,
	SmallVectorImpl<SDValue> &Args,
	unsigned Start, unsigned Count) {
	EVT VT = Op.getValueType();
	if (Count == 0)
	Count = VT.getVectorNumElements();

	EVT EltVT = VT.getVectorElementType();
	EVT IdxTy = TLI->getVectorIdxTy(getDataLayout());
	SDLoc SL(Op);
	for (unsigned i = Start, e = Start + Count; i != e; ++i) {
	Args.push_back(getNode(ISD::EXTRACT_VECTOR_ELT, SL, EltVT,
	Op, getConstant(i, SL, IdxTy)));
	}
	}

	// getAddressSpace - Return the address space this GlobalAddress belongs to.
	unsigned GlobalAddressSDNode::getAddressSpace() const {
	return getGlobal()->getType()->getAddressSpace();
	}

	Type *ConstantPoolSDNode::getType() const {
	if (isMachineConstantPoolEntry())
	return Val.MachineCPVal->getType();
	return Val.ConstVal->getType();
	}

	bool BuildVectorSDNode::isConstantSplat(APInt &SplatValue, APInt &SplatUndef,
	unsigned &SplatBitSize,
	bool &HasAnyUndefs,
	unsigned MinSplatBits,
	bool IsBigEndian) const {
	EVT VT = getValueType(0);
	assert(VT.isVector() && "Expected a vector type");
	unsigned VecWidth = VT.getSizeInBits();
	if (MinSplatBits > VecWidth)
	return false;

	// FIXME: The widths are based on this node's type, but build vectors can
	// truncate their operands.
	SplatValue = APInt(VecWidth, 0);
	SplatUndef = APInt(VecWidth, 0);

	// Get the bits. Bits with undefined values (when the corresponding element
	// of the vector is an ISD::UNDEF value) are set in SplatUndef and cleared
	// in SplatValue. If any of the values are not constant, give up and return
	// false.
	unsigned int NumOps = getNumOperands();
	assert(NumOps > 0 && "isConstantSplat has 0-size build vector");
	unsigned EltWidth = VT.getScalarSizeInBits();

	for (unsigned j = 0; j < NumOps; ++j) {
	unsigned i = IsBigEndian ? NumOps - 1 - j : j;
	SDValue OpVal = getOperand(i);
	unsigned BitPos = j * EltWidth;

	if (OpVal.isUndef())
	SplatUndef.setBits(BitPos, BitPos + EltWidth);
	else if (auto *CN = dyn_cast<ConstantSDNode>(OpVal))
	SplatValue.insertBits(CN->getAPIntValue().zextOrTrunc(EltWidth), BitPos);
	else if (auto *CN = dyn_cast<ConstantFPSDNode>(OpVal))
	SplatValue.insertBits(CN->getValueAPF().bitcastToAPInt(), BitPos);
	else
	return false;
	}

	// The build_vector is all constants or undefs. Find the smallest element
	// size that splats the vector.
	HasAnyUndefs = (SplatUndef != 0);

	// FIXME: This does not work for vectors with elements less than 8 bits.
	while (VecWidth > 8) {
	unsigned HalfSize = VecWidth / 2;
	APInt HighValue = SplatValue.lshr(HalfSize).trunc(HalfSize);
	APInt LowValue = SplatValue.trunc(HalfSize);
	APInt HighUndef = SplatUndef.lshr(HalfSize).trunc(HalfSize);
	APInt LowUndef = SplatUndef.trunc(HalfSize);

	// If the two halves do not match (ignoring undef bits), stop here.
	if ((HighValue & ~LowUndef) != (LowValue & ~HighUndef) \|\|
	MinSplatBits > HalfSize)
	break;

	SplatValue = HighValue \| LowValue;
	SplatUndef = HighUndef & LowUndef;

	VecWidth = HalfSize;
	}

	SplatBitSize = VecWidth;
	return true;
	}

	SDValue BuildVectorSDNode::getSplatValue(BitVector *UndefElements) const {
	if (UndefElements) {
	UndefElements->clear();
	UndefElements->resize(getNumOperands());
	}
	SDValue Splatted;
	for (unsigned i = 0, e = getNumOperands(); i != e; ++i) {
	SDValue Op = getOperand(i);
	if (Op.isUndef()) {
	if (UndefElements)
	(*UndefElements)[i] = true;
	} else if (!Splatted) {
	Splatted = Op;
	} else if (Splatted != Op) {
	return SDValue();
	}
	}

	if (!Splatted) {
	assert(getOperand(0).isUndef() &&
	"Can only have a splat without a constant for all undefs.");
	return getOperand(0);
	}

	return Splatted;
	}

	ConstantSDNode *
	BuildVectorSDNode::getConstantSplatNode(BitVector *UndefElements) const {
	return dyn_cast_or_null<ConstantSDNode>(getSplatValue(UndefElements));
	}

	ConstantFPSDNode *
	BuildVectorSDNode::getConstantFPSplatNode(BitVector *UndefElements) const {
	return dyn_cast_or_null<ConstantFPSDNode>(getSplatValue(UndefElements));
	}

	int32_t
	BuildVectorSDNode::getConstantFPSplatPow2ToLog2Int(BitVector *UndefElements,
	uint32_t BitWidth) const {
	if (ConstantFPSDNode *CN =
	dyn_cast_or_null<ConstantFPSDNode>(getSplatValue(UndefElements))) {
	bool IsExact;
	APSInt IntVal(BitWidth);
	const APFloat &APF = CN->getValueAPF();
	if (APF.convertToInteger(IntVal, APFloat::rmTowardZero, &IsExact) !=
	APFloat::opOK \|\|
	!IsExact)
	return -1;

	return IntVal.exactLogBase2();
	}
	return -1;
	}

	bool BuildVectorSDNode::isConstant() const {
	for (const SDValue &Op : op_values()) {
	unsigned Opc = Op.getOpcode();
	if (Opc != ISD::UNDEF && Opc != ISD::Constant && Opc != ISD::ConstantFP)
	return false;
	}
	return true;
	}

	bool ShuffleVectorSDNode::isSplatMask(const int *Mask, EVT VT) {
	// Find the first non-undef value in the shuffle mask.
	unsigned i, e;
	for (i = 0, e = VT.getVectorNumElements(); i != e && Mask[i] < 0; ++i)
	/* search */;

	assert(i != e && "VECTOR_SHUFFLE node with all undef indices!");

	// Make sure all remaining elements are either undef or the same as the first
	// non-undef value.
	for (int Idx = Mask[i]; i != e; ++i)
	if (Mask[i] >= 0 && Mask[i] != Idx)
	return false;
	return true;
	}

	// \brief Returns the SDNode if it is a constant integer BuildVector
	// or constant integer.
	SDNode *SelectionDAG::isConstantIntBuildVectorOrConstantInt(SDValue N) {
	if (isa<ConstantSDNode>(N))
	return N.getNode();
	if (ISD::isBuildVectorOfConstantSDNodes(N.getNode()))
	return N.getNode();
	// Treat a GlobalAddress supporting constant offset folding as a
	// constant integer.
	if (GlobalAddressSDNode *GA = dyn_cast<GlobalAddressSDNode>(N))
	if (GA->getOpcode() == ISD::GlobalAddress &&
	TLI->isOffsetFoldingLegal(GA))
	return GA;
	return nullptr;
	}

	SDNode *SelectionDAG::isConstantFPBuildVectorOrConstantFP(SDValue N) {
	if (isa<ConstantFPSDNode>(N))
	return N.getNode();

	if (ISD::isBuildVectorOfConstantFPSDNodes(N.getNode()))
	return N.getNode();

	return nullptr;
	}

	#ifndef NDEBUG
	static void checkForCyclesHelper(const SDNode *N,
	SmallPtrSetImpl<const SDNode*> &Visited,
	SmallPtrSetImpl<const SDNode*> &Checked,
	const llvm::SelectionDAG *DAG) {
	// If this node has already been checked, don't check it again.
	if (Checked.count(N))
	return;

	// If a node has already been visited on this depth-first walk, reject it as
	// a cycle.
	if (!Visited.insert(N).second) {
	errs() << "Detected cycle in SelectionDAG\n";
	dbgs() << "Offending node:\n";
	N->dumprFull(DAG); dbgs() << "\n";
	abort();
	}

	for (const SDValue &Op : N->op_values())
	checkForCyclesHelper(Op.getNode(), Visited, Checked, DAG);

	Checked.insert(N);
	Visited.erase(N);
	}
	#endif

	void llvm::checkForCycles(const llvm::SDNode *N,
	const llvm::SelectionDAG *DAG,
	bool force) {
	#ifndef NDEBUG
	bool check = force;
	#ifdef EXPENSIVE_CHECKS
	check = true;
	#endif // EXPENSIVE_CHECKS
	if (check) {
	assert(N && "Checking nonexistent SDNode");
	SmallPtrSet<const SDNode*, 32> visited;
	SmallPtrSet<const SDNode*, 32> checked;
	checkForCyclesHelper(N, visited, checked, DAG);
	}
	#endif // !NDEBUG
	}

	void llvm::checkForCycles(const llvm::SelectionDAG *DAG, bool force) {
	checkForCycles(DAG->getRoot().getNode(), DAG, force);
	}
	diff --git a/lib/ExecutionEngine/CMakeLists.txt b/lib/ExecutionEngine/CMakeLists.txt
	index 2d9337bbefd2..84b34919e442 100644
	--- a/lib/ExecutionEngine/CMakeLists.txt
	+++ b/lib/ExecutionEngine/CMakeLists.txt
	@@ -1,28 +1,32 @@


	add_llvm_library(LLVMExecutionEngine
	ExecutionEngine.cpp
	ExecutionEngineBindings.cpp
	GDBRegistrationListener.cpp
	SectionMemoryManager.cpp
	TargetSelect.cpp

	ADDITIONAL_HEADER_DIRS
	${LLVM_MAIN_INCLUDE_DIR}/llvm/ExecutionEngine

	DEPENDS
	intrinsics_gen
	)

	+if(BUILD_SHARED_LIBS)
	+ target_link_libraries(LLVMExecutionEngine PUBLIC LLVMRuntimeDyld)
	+endif()
	+
	add_subdirectory(Interpreter)
	add_subdirectory(MCJIT)
	add_subdirectory(Orc)
	add_subdirectory(RuntimeDyld)

	if( LLVM_USE_OPROFILE )
	add_subdirectory(OProfileJIT)
	endif( LLVM_USE_OPROFILE )

	if( LLVM_USE_INTEL_JITEVENTS )
	add_subdirectory(IntelJITEvents)
	endif( LLVM_USE_INTEL_JITEVENTS )
	diff --git a/lib/IR/AutoUpgrade.cpp b/lib/IR/AutoUpgrade.cpp
	index 6a4b8032ffd5..a501799b4799 100644
	--- a/lib/IR/AutoUpgrade.cpp
	+++ b/lib/IR/AutoUpgrade.cpp
	@@ -1,2332 +1,2350 @@
	//===-- AutoUpgrade.cpp - Implement auto-upgrade helper functions ---------===//
	//
	// The LLVM Compiler Infrastructure
	//
	// This file is distributed under the University of Illinois Open Source
	// License. See LICENSE.TXT for details.
	//
	//===----------------------------------------------------------------------===//
	//
	// This file implements the auto-upgrade helper functions.
	// This is where deprecated IR intrinsics and other IR features are updated to
	// current specifications.
	//
	//===----------------------------------------------------------------------===//

	#include "llvm/IR/AutoUpgrade.h"
	#include "llvm/ADT/StringSwitch.h"
	#include "llvm/IR/CFG.h"
	#include "llvm/IR/CallSite.h"
	#include "llvm/IR/Constants.h"
	#include "llvm/IR/DIBuilder.h"
	#include "llvm/IR/DebugInfo.h"
	#include "llvm/IR/DiagnosticInfo.h"
	#include "llvm/IR/Function.h"
	#include "llvm/IR/IRBuilder.h"
	#include "llvm/IR/Instruction.h"
	#include "llvm/IR/IntrinsicInst.h"
	#include "llvm/IR/LLVMContext.h"
	#include "llvm/IR/Module.h"
	#include "llvm/Support/ErrorHandling.h"
	#include "llvm/Support/Regex.h"
	#include <cstring>
	using namespace llvm;

	static void rename(GlobalValue *GV) { GV->setName(GV->getName() + ".old"); }

	// Upgrade the declarations of the SSE4.1 ptest intrinsics whose arguments have
	// changed their type from v4f32 to v2i64.
	static bool UpgradePTESTIntrinsic(Function* F, Intrinsic::ID IID,
	Function *&NewFn) {
	// Check whether this is an old version of the function, which received
	// v4f32 arguments.
	Type *Arg0Type = F->getFunctionType()->getParamType(0);
	if (Arg0Type != VectorType::get(Type::getFloatTy(F->getContext()), 4))
	return false;

	// Yes, it's old, replace it with new version.
	rename(F);
	NewFn = Intrinsic::getDeclaration(F->getParent(), IID);
	return true;
	}

	// Upgrade the declarations of intrinsic functions whose 8-bit immediate mask
	// arguments have changed their type from i32 to i8.
	static bool UpgradeX86IntrinsicsWith8BitMask(Function *F, Intrinsic::ID IID,
	Function *&NewFn) {
	// Check that the last argument is an i32.
	Type *LastArgType = F->getFunctionType()->getParamType(
	F->getFunctionType()->getNumParams() - 1);
	if (!LastArgType->isIntegerTy(32))
	return false;

	// Move this function aside and map down.
	rename(F);
	NewFn = Intrinsic::getDeclaration(F->getParent(), IID);
	return true;
	}

	static bool ShouldUpgradeX86Intrinsic(Function *F, StringRef Name) {
	// All of the intrinsics matches below should be marked with which llvm
	// version started autoupgrading them. At some point in the future we would
	// like to use this information to remove upgrade code for some older
	// intrinsics. It is currently undecided how we will determine that future
	// point.
	if (Name.startswith("sse2.pcmpeq.") \|\| // Added in 3.1
	Name.startswith("sse2.pcmpgt.") \|\| // Added in 3.1
	Name.startswith("avx2.pcmpeq.") \|\| // Added in 3.1
	Name.startswith("avx2.pcmpgt.") \|\| // Added in 3.1
	Name.startswith("avx512.mask.pcmpeq.") \|\| // Added in 3.9
	Name.startswith("avx512.mask.pcmpgt.") \|\| // Added in 3.9
	Name == "sse.add.ss" \|\| // Added in 4.0
	Name == "sse2.add.sd" \|\| // Added in 4.0
	Name == "sse.sub.ss" \|\| // Added in 4.0
	Name == "sse2.sub.sd" \|\| // Added in 4.0
	Name == "sse.mul.ss" \|\| // Added in 4.0
	Name == "sse2.mul.sd" \|\| // Added in 4.0
	Name == "sse.div.ss" \|\| // Added in 4.0
	Name == "sse2.div.sd" \|\| // Added in 4.0
	Name == "sse41.pmaxsb" \|\| // Added in 3.9
	Name == "sse2.pmaxs.w" \|\| // Added in 3.9
	Name == "sse41.pmaxsd" \|\| // Added in 3.9
	Name == "sse2.pmaxu.b" \|\| // Added in 3.9
	Name == "sse41.pmaxuw" \|\| // Added in 3.9
	Name == "sse41.pmaxud" \|\| // Added in 3.9
	Name == "sse41.pminsb" \|\| // Added in 3.9
	Name == "sse2.pmins.w" \|\| // Added in 3.9
	Name == "sse41.pminsd" \|\| // Added in 3.9
	Name == "sse2.pminu.b" \|\| // Added in 3.9
	Name == "sse41.pminuw" \|\| // Added in 3.9
	Name == "sse41.pminud" \|\| // Added in 3.9
	Name.startswith("avx512.mask.pshuf.b.") \|\| // Added in 4.0
	Name.startswith("avx2.pmax") \|\| // Added in 3.9
	Name.startswith("avx2.pmin") \|\| // Added in 3.9
	Name.startswith("avx512.mask.pmax") \|\| // Added in 4.0
	Name.startswith("avx512.mask.pmin") \|\| // Added in 4.0
	Name.startswith("avx2.vbroadcast") \|\| // Added in 3.8
	Name.startswith("avx2.pbroadcast") \|\| // Added in 3.8
	Name.startswith("avx.vpermil.") \|\| // Added in 3.1
	Name.startswith("sse2.pshuf") \|\| // Added in 3.9
	Name.startswith("avx512.pbroadcast") \|\| // Added in 3.9
	Name.startswith("avx512.mask.broadcast.s") \|\| // Added in 3.9
	Name.startswith("avx512.mask.movddup") \|\| // Added in 3.9
	Name.startswith("avx512.mask.movshdup") \|\| // Added in 3.9
	Name.startswith("avx512.mask.movsldup") \|\| // Added in 3.9
	Name.startswith("avx512.mask.pshuf.d.") \|\| // Added in 3.9
	Name.startswith("avx512.mask.pshufl.w.") \|\| // Added in 3.9
	Name.startswith("avx512.mask.pshufh.w.") \|\| // Added in 3.9
	Name.startswith("avx512.mask.shuf.p") \|\| // Added in 4.0
	Name.startswith("avx512.mask.vpermil.p") \|\| // Added in 3.9
	Name.startswith("avx512.mask.perm.df.") \|\| // Added in 3.9
	Name.startswith("avx512.mask.perm.di.") \|\| // Added in 3.9
	Name.startswith("avx512.mask.punpckl") \|\| // Added in 3.9
	Name.startswith("avx512.mask.punpckh") \|\| // Added in 3.9
	Name.startswith("avx512.mask.unpckl.") \|\| // Added in 3.9
	Name.startswith("avx512.mask.unpckh.") \|\| // Added in 3.9
	Name.startswith("avx512.mask.pand.") \|\| // Added in 3.9
	Name.startswith("avx512.mask.pandn.") \|\| // Added in 3.9
	Name.startswith("avx512.mask.por.") \|\| // Added in 3.9
	Name.startswith("avx512.mask.pxor.") \|\| // Added in 3.9
	Name.startswith("avx512.mask.and.") \|\| // Added in 3.9
	Name.startswith("avx512.mask.andn.") \|\| // Added in 3.9
	Name.startswith("avx512.mask.or.") \|\| // Added in 3.9
	Name.startswith("avx512.mask.xor.") \|\| // Added in 3.9
	Name.startswith("avx512.mask.padd.") \|\| // Added in 4.0
	Name.startswith("avx512.mask.psub.") \|\| // Added in 4.0
	Name.startswith("avx512.mask.pmull.") \|\| // Added in 4.0
	Name.startswith("avx512.mask.cvtdq2pd.") \|\| // Added in 4.0
	Name.startswith("avx512.mask.cvtudq2pd.") \|\| // Added in 4.0
	Name.startswith("avx512.mask.pmul.dq.") \|\| // Added in 4.0
	Name.startswith("avx512.mask.pmulu.dq.") \|\| // Added in 4.0
	Name.startswith("avx512.mask.packsswb.") \|\| // Added in 5.0
	Name.startswith("avx512.mask.packssdw.") \|\| // Added in 5.0
	Name.startswith("avx512.mask.packuswb.") \|\| // Added in 5.0
	Name.startswith("avx512.mask.packusdw.") \|\| // Added in 5.0
	Name.startswith("avx512.mask.cmp.b") \|\| // Added in 5.0
	Name.startswith("avx512.mask.cmp.d") \|\| // Added in 5.0
	Name.startswith("avx512.mask.cmp.q") \|\| // Added in 5.0
	Name.startswith("avx512.mask.cmp.w") \|\| // Added in 5.0
	Name.startswith("avx512.mask.ucmp.") \|\| // Added in 5.0
	Name == "avx512.mask.add.pd.128" \|\| // Added in 4.0
	Name == "avx512.mask.add.pd.256" \|\| // Added in 4.0
	Name == "avx512.mask.add.ps.128" \|\| // Added in 4.0
	Name == "avx512.mask.add.ps.256" \|\| // Added in 4.0
	Name == "avx512.mask.div.pd.128" \|\| // Added in 4.0
	Name == "avx512.mask.div.pd.256" \|\| // Added in 4.0
	Name == "avx512.mask.div.ps.128" \|\| // Added in 4.0
	Name == "avx512.mask.div.ps.256" \|\| // Added in 4.0
	Name == "avx512.mask.mul.pd.128" \|\| // Added in 4.0
	Name == "avx512.mask.mul.pd.256" \|\| // Added in 4.0
	Name == "avx512.mask.mul.ps.128" \|\| // Added in 4.0
	Name == "avx512.mask.mul.ps.256" \|\| // Added in 4.0
	Name == "avx512.mask.sub.pd.128" \|\| // Added in 4.0
	Name == "avx512.mask.sub.pd.256" \|\| // Added in 4.0
	Name == "avx512.mask.sub.ps.128" \|\| // Added in 4.0
	Name == "avx512.mask.sub.ps.256" \|\| // Added in 4.0
	Name == "avx512.mask.max.pd.128" \|\| // Added in 5.0
	Name == "avx512.mask.max.pd.256" \|\| // Added in 5.0
	Name == "avx512.mask.max.ps.128" \|\| // Added in 5.0
	Name == "avx512.mask.max.ps.256" \|\| // Added in 5.0
	Name == "avx512.mask.min.pd.128" \|\| // Added in 5.0
	Name == "avx512.mask.min.pd.256" \|\| // Added in 5.0
	Name == "avx512.mask.min.ps.128" \|\| // Added in 5.0
	Name == "avx512.mask.min.ps.256" \|\| // Added in 5.0
	Name.startswith("avx512.mask.vpermilvar.") \|\| // Added in 4.0
	Name.startswith("avx512.mask.psll.d") \|\| // Added in 4.0
	Name.startswith("avx512.mask.psll.q") \|\| // Added in 4.0
	Name.startswith("avx512.mask.psll.w") \|\| // Added in 4.0
	Name.startswith("avx512.mask.psra.d") \|\| // Added in 4.0
	Name.startswith("avx512.mask.psra.q") \|\| // Added in 4.0
	Name.startswith("avx512.mask.psra.w") \|\| // Added in 4.0
	Name.startswith("avx512.mask.psrl.d") \|\| // Added in 4.0
	Name.startswith("avx512.mask.psrl.q") \|\| // Added in 4.0
	Name.startswith("avx512.mask.psrl.w") \|\| // Added in 4.0
	Name.startswith("avx512.mask.pslli") \|\| // Added in 4.0
	Name.startswith("avx512.mask.psrai") \|\| // Added in 4.0
	Name.startswith("avx512.mask.psrli") \|\| // Added in 4.0
	Name.startswith("avx512.mask.psllv") \|\| // Added in 4.0
	Name.startswith("avx512.mask.psrav") \|\| // Added in 4.0
	Name.startswith("avx512.mask.psrlv") \|\| // Added in 4.0
	Name.startswith("sse41.pmovsx") \|\| // Added in 3.8
	Name.startswith("sse41.pmovzx") \|\| // Added in 3.9
	Name.startswith("avx2.pmovsx") \|\| // Added in 3.9
	Name.startswith("avx2.pmovzx") \|\| // Added in 3.9
	Name.startswith("avx512.mask.pmovsx") \|\| // Added in 4.0
	Name.startswith("avx512.mask.pmovzx") \|\| // Added in 4.0
	Name.startswith("avx512.mask.lzcnt.") \|\| // Added in 5.0
	Name == "sse2.cvtdq2pd" \|\| // Added in 3.9
	Name == "sse2.cvtps2pd" \|\| // Added in 3.9
	Name == "avx.cvtdq2.pd.256" \|\| // Added in 3.9
	Name == "avx.cvt.ps2.pd.256" \|\| // Added in 3.9
	Name.startswith("avx.vinsertf128.") \|\| // Added in 3.7
	Name == "avx2.vinserti128" \|\| // Added in 3.7
	Name.startswith("avx512.mask.insert") \|\| // Added in 4.0
	Name.startswith("avx.vextractf128.") \|\| // Added in 3.7
	Name == "avx2.vextracti128" \|\| // Added in 3.7
	Name.startswith("avx512.mask.vextract") \|\| // Added in 4.0
	Name.startswith("sse4a.movnt.") \|\| // Added in 3.9
	Name.startswith("avx.movnt.") \|\| // Added in 3.2
	Name.startswith("avx512.storent.") \|\| // Added in 3.9
	Name == "sse41.movntdqa" \|\| // Added in 5.0
	Name == "avx2.movntdqa" \|\| // Added in 5.0
	Name == "avx512.movntdqa" \|\| // Added in 5.0
	Name == "sse2.storel.dq" \|\| // Added in 3.9
	Name.startswith("sse.storeu.") \|\| // Added in 3.9
	Name.startswith("sse2.storeu.") \|\| // Added in 3.9
	Name.startswith("avx.storeu.") \|\| // Added in 3.9
	Name.startswith("avx512.mask.storeu.") \|\| // Added in 3.9
	Name.startswith("avx512.mask.store.p") \|\| // Added in 3.9
	Name.startswith("avx512.mask.store.b.") \|\| // Added in 3.9
	Name.startswith("avx512.mask.store.w.") \|\| // Added in 3.9
	Name.startswith("avx512.mask.store.d.") \|\| // Added in 3.9
	Name.startswith("avx512.mask.store.q.") \|\| // Added in 3.9
	Name.startswith("avx512.mask.loadu.") \|\| // Added in 3.9
	Name.startswith("avx512.mask.load.") \|\| // Added in 3.9
	Name == "sse42.crc32.64.8" \|\| // Added in 3.4
	Name.startswith("avx.vbroadcast.s") \|\| // Added in 3.5
	Name.startswith("avx512.mask.palignr.") \|\| // Added in 3.9
	Name.startswith("avx512.mask.valign.") \|\| // Added in 4.0
	Name.startswith("sse2.psll.dq") \|\| // Added in 3.7
	Name.startswith("sse2.psrl.dq") \|\| // Added in 3.7
	Name.startswith("avx2.psll.dq") \|\| // Added in 3.7
	Name.startswith("avx2.psrl.dq") \|\| // Added in 3.7
	Name.startswith("avx512.psll.dq") \|\| // Added in 3.9
	Name.startswith("avx512.psrl.dq") \|\| // Added in 3.9
	Name == "sse41.pblendw" \|\| // Added in 3.7
	Name.startswith("sse41.blendp") \|\| // Added in 3.7
	Name.startswith("avx.blend.p") \|\| // Added in 3.7
	Name == "avx2.pblendw" \|\| // Added in 3.7
	Name.startswith("avx2.pblendd.") \|\| // Added in 3.7
	Name.startswith("avx.vbroadcastf128") \|\| // Added in 4.0
	Name == "avx2.vbroadcasti128" \|\| // Added in 3.7
	Name == "xop.vpcmov" \|\| // Added in 3.8
	Name == "xop.vpcmov.256" \|\| // Added in 5.0
	Name.startswith("avx512.mask.move.s") \|\| // Added in 4.0
	Name.startswith("avx512.cvtmask2") \|\| // Added in 5.0
	(Name.startswith("xop.vpcom") && // Added in 3.2
	F->arg_size() == 2))
	return true;

	return false;
	}

	static bool UpgradeX86IntrinsicFunction(Function *F, StringRef Name,
	Function *&NewFn) {
	// Only handle intrinsics that start with "x86.".
	if (!Name.startswith("x86."))
	return false;
	// Remove "x86." prefix.
	Name = Name.substr(4);

	if (ShouldUpgradeX86Intrinsic(F, Name)) {
	NewFn = nullptr;
	return true;
	}

	// SSE4.1 ptest functions may have an old signature.
	if (Name.startswith("sse41.ptest")) { // Added in 3.2
	if (Name.substr(11) == "c")
	return UpgradePTESTIntrinsic(F, Intrinsic::x86_sse41_ptestc, NewFn);
	if (Name.substr(11) == "z")
	return UpgradePTESTIntrinsic(F, Intrinsic::x86_sse41_ptestz, NewFn);
	if (Name.substr(11) == "nzc")
	return UpgradePTESTIntrinsic(F, Intrinsic::x86_sse41_ptestnzc, NewFn);
	}
	// Several blend and other instructions with masks used the wrong number of
	// bits.
	if (Name == "sse41.insertps") // Added in 3.6
	return UpgradeX86IntrinsicsWith8BitMask(F, Intrinsic::x86_sse41_insertps,
	NewFn);
	if (Name == "sse41.dppd") // Added in 3.6
	return UpgradeX86IntrinsicsWith8BitMask(F, Intrinsic::x86_sse41_dppd,
	NewFn);
	if (Name == "sse41.dpps") // Added in 3.6
	return UpgradeX86IntrinsicsWith8BitMask(F, Intrinsic::x86_sse41_dpps,
	NewFn);
	if (Name == "sse41.mpsadbw") // Added in 3.6
	return UpgradeX86IntrinsicsWith8BitMask(F, Intrinsic::x86_sse41_mpsadbw,
	NewFn);
	if (Name == "avx.dp.ps.256") // Added in 3.6
	return UpgradeX86IntrinsicsWith8BitMask(F, Intrinsic::x86_avx_dp_ps_256,
	NewFn);
	if (Name == "avx2.mpsadbw") // Added in 3.6
	return UpgradeX86IntrinsicsWith8BitMask(F, Intrinsic::x86_avx2_mpsadbw,
	NewFn);

	// frcz.ss/sd may need to have an argument dropped. Added in 3.2
	if (Name.startswith("xop.vfrcz.ss") && F->arg_size() == 2) {
	rename(F);
	NewFn = Intrinsic::getDeclaration(F->getParent(),
	Intrinsic::x86_xop_vfrcz_ss);
	return true;
	}
	if (Name.startswith("xop.vfrcz.sd") && F->arg_size() == 2) {
	rename(F);
	NewFn = Intrinsic::getDeclaration(F->getParent(),
	Intrinsic::x86_xop_vfrcz_sd);
	return true;
	}
	// Upgrade any XOP PERMIL2 index operand still using a float/double vector.
	if (Name.startswith("xop.vpermil2")) { // Added in 3.9
	auto Idx = F->getFunctionType()->getParamType(2);
	if (Idx->isFPOrFPVectorTy()) {
	rename(F);
	unsigned IdxSize = Idx->getPrimitiveSizeInBits();
	unsigned EltSize = Idx->getScalarSizeInBits();
	Intrinsic::ID Permil2ID;
	if (EltSize == 64 && IdxSize == 128)
	Permil2ID = Intrinsic::x86_xop_vpermil2pd;
	else if (EltSize == 32 && IdxSize == 128)
	Permil2ID = Intrinsic::x86_xop_vpermil2ps;
	else if (EltSize == 64 && IdxSize == 256)
	Permil2ID = Intrinsic::x86_xop_vpermil2pd_256;
	else
	Permil2ID = Intrinsic::x86_xop_vpermil2ps_256;
	NewFn = Intrinsic::getDeclaration(F->getParent(), Permil2ID);
	return true;
	}
	}

	return false;
	}

	static bool UpgradeIntrinsicFunction1(Function F, Function &NewFn) {
	assert(F && "Illegal to upgrade a non-existent Function.");

	// Quickly eliminate it, if it's not a candidate.
	StringRef Name = F->getName();
	if (Name.size() <= 8 \|\| !Name.startswith("llvm."))
	return false;
	Name = Name.substr(5); // Strip off "llvm."

	switch (Name[0]) {
	default: break;
	case 'a': {
	if (Name.startswith("arm.rbit") \|\| Name.startswith("aarch64.rbit")) {
	NewFn = Intrinsic::getDeclaration(F->getParent(), Intrinsic::bitreverse,
	F->arg_begin()->getType());
	return true;
	}
	if (Name.startswith("arm.neon.vclz")) {
	Type* args[2] = {
	F->arg_begin()->getType(),
	Type::getInt1Ty(F->getContext())
	};
	// Can't use Intrinsic::getDeclaration here as it adds a ".i1" to
	// the end of the name. Change name from llvm.arm.neon.vclz.* to
	// llvm.ctlz.*
	FunctionType* fType = FunctionType::get(F->getReturnType(), args, false);
	NewFn = Function::Create(fType, F->getLinkage(),
	"llvm.ctlz." + Name.substr(14), F->getParent());
	return true;
	}
	if (Name.startswith("arm.neon.vcnt")) {
	NewFn = Intrinsic::getDeclaration(F->getParent(), Intrinsic::ctpop,
	F->arg_begin()->getType());
	return true;
	}
	Regex vldRegex("^arm\\.neon\\.vld([1234]\|[234]lane)\\.v[a-z0-9]*$");
	if (vldRegex.match(Name)) {
	auto fArgs = F->getFunctionType()->params();
	SmallVector<Type *, 4> Tys(fArgs.begin(), fArgs.end());
	// Can't use Intrinsic::getDeclaration here as the return types might
	// then only be structurally equal.
	FunctionType* fType = FunctionType::get(F->getReturnType(), Tys, false);
	NewFn = Function::Create(fType, F->getLinkage(),
	"llvm." + Name + ".p0i8", F->getParent());
	return true;
	}
	Regex vstRegex("^arm\\.neon\\.vst([1234]\|[234]lane)\\.v[a-z0-9]*$");
	if (vstRegex.match(Name)) {
	static const Intrinsic::ID StoreInts[] = {Intrinsic::arm_neon_vst1,
	Intrinsic::arm_neon_vst2,
	Intrinsic::arm_neon_vst3,
	Intrinsic::arm_neon_vst4};

	static const Intrinsic::ID StoreLaneInts[] = {
	Intrinsic::arm_neon_vst2lane, Intrinsic::arm_neon_vst3lane,
	Intrinsic::arm_neon_vst4lane
	};

	auto fArgs = F->getFunctionType()->params();
	Type *Tys[] = {fArgs[0], fArgs[1]};
	if (Name.find("lane") == StringRef::npos)
	NewFn = Intrinsic::getDeclaration(F->getParent(),
	StoreInts[fArgs.size() - 3], Tys);
	else
	NewFn = Intrinsic::getDeclaration(F->getParent(),
	StoreLaneInts[fArgs.size() - 5], Tys);
	return true;
	}
	if (Name == "aarch64.thread.pointer" \|\| Name == "arm.thread.pointer") {
	NewFn = Intrinsic::getDeclaration(F->getParent(), Intrinsic::thread_pointer);
	return true;
	}
	break;
	}

	case 'c': {
	if (Name.startswith("ctlz.") && F->arg_size() == 1) {
	rename(F);
	NewFn = Intrinsic::getDeclaration(F->getParent(), Intrinsic::ctlz,
	F->arg_begin()->getType());
	return true;
	}
	if (Name.startswith("cttz.") && F->arg_size() == 1) {
	rename(F);
	NewFn = Intrinsic::getDeclaration(F->getParent(), Intrinsic::cttz,
	F->arg_begin()->getType());
	return true;
	}
	break;
	}
	case 'i':
	case 'l': {
	bool IsLifetimeStart = Name.startswith("lifetime.start");
	if (IsLifetimeStart \|\| Name.startswith("invariant.start")) {
	Intrinsic::ID ID = IsLifetimeStart ?
	Intrinsic::lifetime_start : Intrinsic::invariant_start;
	auto Args = F->getFunctionType()->params();
	Type* ObjectPtr[1] = {Args[1]};
	if (F->getName() != Intrinsic::getName(ID, ObjectPtr)) {
	rename(F);
	NewFn = Intrinsic::getDeclaration(F->getParent(), ID, ObjectPtr);
	return true;
	}
	}

	bool IsLifetimeEnd = Name.startswith("lifetime.end");
	if (IsLifetimeEnd \|\| Name.startswith("invariant.end")) {
	Intrinsic::ID ID = IsLifetimeEnd ?
	Intrinsic::lifetime_end : Intrinsic::invariant_end;

	auto Args = F->getFunctionType()->params();
	Type* ObjectPtr[1] = {Args[IsLifetimeEnd ? 1 : 2]};
	if (F->getName() != Intrinsic::getName(ID, ObjectPtr)) {
	rename(F);
	NewFn = Intrinsic::getDeclaration(F->getParent(), ID, ObjectPtr);
	return true;
	}
	}
	break;
	}
	case 'm': {
	if (Name.startswith("masked.load.")) {
	Type *Tys[] = { F->getReturnType(), F->arg_begin()->getType() };
	if (F->getName() != Intrinsic::getName(Intrinsic::masked_load, Tys)) {
	rename(F);
	NewFn = Intrinsic::getDeclaration(F->getParent(),
	Intrinsic::masked_load,
	Tys);
	return true;
	}
	}
	if (Name.startswith("masked.store.")) {
	auto Args = F->getFunctionType()->params();
	Type *Tys[] = { Args[0], Args[1] };
	if (F->getName() != Intrinsic::getName(Intrinsic::masked_store, Tys)) {
	rename(F);
	NewFn = Intrinsic::getDeclaration(F->getParent(),
	Intrinsic::masked_store,
	Tys);
	return true;
	}
	}
	// Renaming gather/scatter intrinsics with no address space overloading
	// to the new overload which includes an address space
	if (Name.startswith("masked.gather.")) {
	Type *Tys[] = {F->getReturnType(), F->arg_begin()->getType()};
	if (F->getName() != Intrinsic::getName(Intrinsic::masked_gather, Tys)) {
	rename(F);
	NewFn = Intrinsic::getDeclaration(F->getParent(),
	Intrinsic::masked_gather, Tys);
	return true;
	}
	}
	if (Name.startswith("masked.scatter.")) {
	auto Args = F->getFunctionType()->params();
	Type *Tys[] = {Args[0], Args[1]};
	if (F->getName() != Intrinsic::getName(Intrinsic::masked_scatter, Tys)) {
	rename(F);
	NewFn = Intrinsic::getDeclaration(F->getParent(),
	Intrinsic::masked_scatter, Tys);
	return true;
	}
	}
	break;
	}
	case 'n': {
	if (Name.startswith("nvvm.")) {
	Name = Name.substr(5);

	// The following nvvm intrinsics correspond exactly to an LLVM intrinsic.
	Intrinsic::ID IID = StringSwitch<Intrinsic::ID>(Name)
	.Cases("brev32", "brev64", Intrinsic::bitreverse)
	.Case("clz.i", Intrinsic::ctlz)
	.Case("popc.i", Intrinsic::ctpop)
	.Default(Intrinsic::not_intrinsic);
	if (IID != Intrinsic::not_intrinsic && F->arg_size() == 1) {
	NewFn = Intrinsic::getDeclaration(F->getParent(), IID,
	{F->getReturnType()});
	return true;
	}

	// The following nvvm intrinsics correspond exactly to an LLVM idiom, but
	// not to an intrinsic alone. We expand them in UpgradeIntrinsicCall.
	//
	// TODO: We could add lohi.i2d.
	bool Expand = StringSwitch<bool>(Name)
	.Cases("abs.i", "abs.ll", true)
	.Cases("clz.ll", "popc.ll", "h2f", true)
	.Cases("max.i", "max.ll", "max.ui", "max.ull", true)
	.Cases("min.i", "min.ll", "min.ui", "min.ull", true)
	.Default(false);
	if (Expand) {
	NewFn = nullptr;
	return true;
	}
	}
	break;
	}
	case 'o':
	// We only need to change the name to match the mangling including the
	// address space.
	if (Name.startswith("objectsize.")) {
	Type *Tys[2] = { F->getReturnType(), F->arg_begin()->getType() };
	if (F->arg_size() == 2 \|\|
	F->getName() != Intrinsic::getName(Intrinsic::objectsize, Tys)) {
	rename(F);
	NewFn = Intrinsic::getDeclaration(F->getParent(), Intrinsic::objectsize,
	Tys);
	return true;
	}
	}
	break;

	case 's':
	if (Name == "stackprotectorcheck") {
	NewFn = nullptr;
	return true;
	}
	break;

	case 'x':
	if (UpgradeX86IntrinsicFunction(F, Name, NewFn))
	return true;
	}
	// Remangle our intrinsic since we upgrade the mangling
	auto Result = llvm::Intrinsic::remangleIntrinsicFunction(F);
	if (Result != None) {
	NewFn = Result.getValue();
	return true;
	}

	// This may not belong here. This function is effectively being overloaded
	// to both detect an intrinsic which needs upgrading, and to provide the
	// upgraded form of the intrinsic. We should perhaps have two separate
	// functions for this.
	return false;
	}

	bool llvm::UpgradeIntrinsicFunction(Function F, Function &NewFn) {
	NewFn = nullptr;
	bool Upgraded = UpgradeIntrinsicFunction1(F, NewFn);
	assert(F != NewFn && "Intrinsic function upgraded to the same function");

	// Upgrade intrinsic attributes. This does not change the function.
	if (NewFn)
	F = NewFn;
	if (Intrinsic::ID id = F->getIntrinsicID())
	F->setAttributes(Intrinsic::getAttributes(F->getContext(), id));
	return Upgraded;
	}

	bool llvm::UpgradeGlobalVariable(GlobalVariable *GV) {
	// Nothing to do yet.
	return false;
	}

	// Handles upgrading SSE2/AVX2/AVX512BW PSLLDQ intrinsics by converting them
	// to byte shuffles.
	static Value *UpgradeX86PSLLDQIntrinsics(IRBuilder<> &Builder,
	Value *Op, unsigned Shift) {
	Type *ResultTy = Op->getType();
	unsigned NumElts = ResultTy->getVectorNumElements() * 8;

	// Bitcast from a 64-bit element type to a byte element type.
	Type *VecTy = VectorType::get(Builder.getInt8Ty(), NumElts);
	Op = Builder.CreateBitCast(Op, VecTy, "cast");

	// We'll be shuffling in zeroes.
	Value *Res = Constant::getNullValue(VecTy);

	// If shift is less than 16, emit a shuffle to move the bytes. Otherwise,
	// we'll just return the zero vector.
	if (Shift < 16) {
	uint32_t Idxs[64];
	// 256/512-bit version is split into 2/4 16-byte lanes.
	for (unsigned l = 0; l != NumElts; l += 16)
	for (unsigned i = 0; i != 16; ++i) {
	unsigned Idx = NumElts + i - Shift;
	if (Idx < NumElts)
	Idx -= NumElts - 16; // end of lane, switch operand.
	Idxs[l + i] = Idx + l;
	}

	Res = Builder.CreateShuffleVector(Res, Op, makeArrayRef(Idxs, NumElts));
	}

	// Bitcast back to a 64-bit element type.
	return Builder.CreateBitCast(Res, ResultTy, "cast");
	}

	// Handles upgrading SSE2/AVX2/AVX512BW PSRLDQ intrinsics by converting them
	// to byte shuffles.
	static Value UpgradeX86PSRLDQIntrinsics(IRBuilder<> &Builder, Value Op,
	unsigned Shift) {
	Type *ResultTy = Op->getType();
	unsigned NumElts = ResultTy->getVectorNumElements() * 8;

	// Bitcast from a 64-bit element type to a byte element type.
	Type *VecTy = VectorType::get(Builder.getInt8Ty(), NumElts);
	Op = Builder.CreateBitCast(Op, VecTy, "cast");

	// We'll be shuffling in zeroes.
	Value *Res = Constant::getNullValue(VecTy);

	// If shift is less than 16, emit a shuffle to move the bytes. Otherwise,
	// we'll just return the zero vector.
	if (Shift < 16) {
	uint32_t Idxs[64];
	// 256/512-bit version is split into 2/4 16-byte lanes.
	for (unsigned l = 0; l != NumElts; l += 16)
	for (unsigned i = 0; i != 16; ++i) {
	unsigned Idx = i + Shift;
	if (Idx >= 16)
	Idx += NumElts - 16; // end of lane, switch operand.
	Idxs[l + i] = Idx + l;
	}

	Res = Builder.CreateShuffleVector(Op, Res, makeArrayRef(Idxs, NumElts));
	}

	// Bitcast back to a 64-bit element type.
	return Builder.CreateBitCast(Res, ResultTy, "cast");
	}

	static Value getX86MaskVec(IRBuilder<> &Builder, Value Mask,
	unsigned NumElts) {
	llvm::VectorType *MaskTy = llvm::VectorType::get(Builder.getInt1Ty(),
	cast<IntegerType>(Mask->getType())->getBitWidth());
	Mask = Builder.CreateBitCast(Mask, MaskTy);

	// If we have less than 8 elements, then the starting mask was an i8 and
	// we need to extract down to the right number of elements.
	if (NumElts < 8) {
	uint32_t Indices[4];
	for (unsigned i = 0; i != NumElts; ++i)
	Indices[i] = i;
	Mask = Builder.CreateShuffleVector(Mask, Mask,
	makeArrayRef(Indices, NumElts),
	"extract");
	}

	return Mask;
	}

	static Value EmitX86Select(IRBuilder<> &Builder, Value Mask,
	Value Op0, Value Op1) {
	// If the mask is all ones just emit the align operation.
	if (const auto *C = dyn_cast<Constant>(Mask))
	if (C->isAllOnesValue())
	return Op0;

	Mask = getX86MaskVec(Builder, Mask, Op0->getType()->getVectorNumElements());
	return Builder.CreateSelect(Mask, Op0, Op1);
	}

	// Handle autoupgrade for masked PALIGNR and VALIGND/Q intrinsics.
	// PALIGNR handles large immediates by shifting while VALIGN masks the immediate
	// so we need to handle both cases. VALIGN also doesn't have 128-bit lanes.
	static Value UpgradeX86ALIGNIntrinsics(IRBuilder<> &Builder, Value Op0,
	Value Op1, Value Shift,
	Value Passthru, Value Mask,
	bool IsVALIGN) {
	unsigned ShiftVal = cast<llvm::ConstantInt>(Shift)->getZExtValue();

	unsigned NumElts = Op0->getType()->getVectorNumElements();
	assert((IsVALIGN \|\| NumElts % 16 == 0) && "Illegal NumElts for PALIGNR!");
	assert((!IsVALIGN \|\| NumElts <= 16) && "NumElts too large for VALIGN!");
	assert(isPowerOf2_32(NumElts) && "NumElts not a power of 2!");

	// Mask the immediate for VALIGN.
	if (IsVALIGN)
	ShiftVal &= (NumElts - 1);

	// If palignr is shifting the pair of vectors more than the size of two
	// lanes, emit zero.
	if (ShiftVal >= 32)
	return llvm::Constant::getNullValue(Op0->getType());

	// If palignr is shifting the pair of input vectors more than one lane,
	// but less than two lanes, convert to shifting in zeroes.
	if (ShiftVal > 16) {
	ShiftVal -= 16;
	Op1 = Op0;
	Op0 = llvm::Constant::getNullValue(Op0->getType());
	}

	uint32_t Indices[64];
	// 256-bit palignr operates on 128-bit lanes so we need to handle that
	for (unsigned l = 0; l < NumElts; l += 16) {
	for (unsigned i = 0; i != 16; ++i) {
	unsigned Idx = ShiftVal + i;
	if (!IsVALIGN && Idx >= 16) // Disable wrap for VALIGN.
	Idx += NumElts - 16; // End of lane, switch operand.
	Indices[l + i] = Idx + l;
	}
	}

	Value *Align = Builder.CreateShuffleVector(Op1, Op0,
	makeArrayRef(Indices, NumElts),
	"palignr");

	return EmitX86Select(Builder, Mask, Align, Passthru);
	}

	static Value *UpgradeMaskedStore(IRBuilder<> &Builder,
	Value Ptr, Value Data, Value *Mask,
	bool Aligned) {
	// Cast the pointer to the right type.
	Ptr = Builder.CreateBitCast(Ptr,
	llvm::PointerType::getUnqual(Data->getType()));
	unsigned Align =
	Aligned ? cast<VectorType>(Data->getType())->getBitWidth() / 8 : 1;

	// If the mask is all ones just emit a regular store.
	if (const auto *C = dyn_cast<Constant>(Mask))
	if (C->isAllOnesValue())
	return Builder.CreateAlignedStore(Data, Ptr, Align);

	// Convert the mask from an integer type to a vector of i1.
	unsigned NumElts = Data->getType()->getVectorNumElements();
	Mask = getX86MaskVec(Builder, Mask, NumElts);
	return Builder.CreateMaskedStore(Data, Ptr, Align, Mask);
	}

	static Value *UpgradeMaskedLoad(IRBuilder<> &Builder,
	Value Ptr, Value Passthru, Value *Mask,
	bool Aligned) {
	// Cast the pointer to the right type.
	Ptr = Builder.CreateBitCast(Ptr,
	llvm::PointerType::getUnqual(Passthru->getType()));
	unsigned Align =
	Aligned ? cast<VectorType>(Passthru->getType())->getBitWidth() / 8 : 1;

	// If the mask is all ones just emit a regular store.
	if (const auto *C = dyn_cast<Constant>(Mask))
	if (C->isAllOnesValue())
	return Builder.CreateAlignedLoad(Ptr, Align);

	// Convert the mask from an integer type to a vector of i1.
	unsigned NumElts = Passthru->getType()->getVectorNumElements();
	Mask = getX86MaskVec(Builder, Mask, NumElts);
	return Builder.CreateMaskedLoad(Ptr, Align, Mask, Passthru);
	}

	static Value *upgradeIntMinMax(IRBuilder<> &Builder, CallInst &CI,
	ICmpInst::Predicate Pred) {
	Value *Op0 = CI.getArgOperand(0);
	Value *Op1 = CI.getArgOperand(1);
	Value *Cmp = Builder.CreateICmp(Pred, Op0, Op1);
	Value *Res = Builder.CreateSelect(Cmp, Op0, Op1);

	if (CI.getNumArgOperands() == 4)
	Res = EmitX86Select(Builder, CI.getArgOperand(3), Res, CI.getArgOperand(2));

	return Res;
	}

	static Value *upgradeMaskedCompare(IRBuilder<> &Builder, CallInst &CI,
	unsigned CC, bool Signed) {
	Value *Op0 = CI.getArgOperand(0);
	unsigned NumElts = Op0->getType()->getVectorNumElements();

	Value *Cmp;
	if (CC == 3) {
	Cmp = Constant::getNullValue(llvm::VectorType::get(Builder.getInt1Ty(), NumElts));
	} else if (CC == 7) {
	Cmp = Constant::getAllOnesValue(llvm::VectorType::get(Builder.getInt1Ty(), NumElts));
	} else {
	ICmpInst::Predicate Pred;
	switch (CC) {
	default: llvm_unreachable("Unknown condition code");
	case 0: Pred = ICmpInst::ICMP_EQ; break;
	case 1: Pred = Signed ? ICmpInst::ICMP_SLT : ICmpInst::ICMP_ULT; break;
	case 2: Pred = Signed ? ICmpInst::ICMP_SLE : ICmpInst::ICMP_ULE; break;
	case 4: Pred = ICmpInst::ICMP_NE; break;
	case 5: Pred = Signed ? ICmpInst::ICMP_SGE : ICmpInst::ICMP_UGE; break;
	case 6: Pred = Signed ? ICmpInst::ICMP_SGT : ICmpInst::ICMP_UGT; break;
	}
	Cmp = Builder.CreateICmp(Pred, Op0, CI.getArgOperand(1));
	}

	Value *Mask = CI.getArgOperand(CI.getNumArgOperands() - 1);
	const auto *C = dyn_cast<Constant>(Mask);
	if (!C \|\| !C->isAllOnesValue())
	Cmp = Builder.CreateAnd(Cmp, getX86MaskVec(Builder, Mask, NumElts));

	if (NumElts < 8) {
	uint32_t Indices[8];
	for (unsigned i = 0; i != NumElts; ++i)
	Indices[i] = i;
	for (unsigned i = NumElts; i != 8; ++i)
	Indices[i] = NumElts + i % NumElts;
	Cmp = Builder.CreateShuffleVector(Cmp,
	Constant::getNullValue(Cmp->getType()),
	Indices);
	}
	return Builder.CreateBitCast(Cmp, IntegerType::get(CI.getContext(),
	std::max(NumElts, 8U)));
	}

	// Replace a masked intrinsic with an older unmasked intrinsic.
	static Value *UpgradeX86MaskedShift(IRBuilder<> &Builder, CallInst &CI,
	Intrinsic::ID IID) {
	Function *F = CI.getCalledFunction();
	Function *Intrin = Intrinsic::getDeclaration(F->getParent(), IID);
	Value *Rep = Builder.CreateCall(Intrin,
	{ CI.getArgOperand(0), CI.getArgOperand(1) });
	return EmitX86Select(Builder, CI.getArgOperand(3), Rep, CI.getArgOperand(2));
	}

	static Value* upgradeMaskedMove(IRBuilder<> &Builder, CallInst &CI) {
	Value* A = CI.getArgOperand(0);
	Value* B = CI.getArgOperand(1);
	Value* Src = CI.getArgOperand(2);
	Value* Mask = CI.getArgOperand(3);

	Value* AndNode = Builder.CreateAnd(Mask, APInt(8, 1));
	Value* Cmp = Builder.CreateIsNotNull(AndNode);
	Value* Extract1 = Builder.CreateExtractElement(B, (uint64_t)0);
	Value* Extract2 = Builder.CreateExtractElement(Src, (uint64_t)0);
	Value* Select = Builder.CreateSelect(Cmp, Extract1, Extract2);
	return Builder.CreateInsertElement(A, Select, (uint64_t)0);
	}


	static Value* UpgradeMaskToInt(IRBuilder<> &Builder, CallInst &CI) {
	Value* Op = CI.getArgOperand(0);
	Type* ReturnOp = CI.getType();
	unsigned NumElts = CI.getType()->getVectorNumElements();
	Value *Mask = getX86MaskVec(Builder, Op, NumElts);
	return Builder.CreateSExt(Mask, ReturnOp, "vpmovm2");
	}

	/// Upgrade a call to an old intrinsic. All argument and return casting must be
	/// provided to seamlessly integrate with existing context.
	void llvm::UpgradeIntrinsicCall(CallInst CI, Function NewFn) {
	Function *F = CI->getCalledFunction();
	LLVMContext &C = CI->getContext();
	IRBuilder<> Builder(C);
	Builder.SetInsertPoint(CI->getParent(), CI->getIterator());

	assert(F && "Intrinsic call is not direct?");

	if (!NewFn) {
	// Get the Function's name.
	StringRef Name = F->getName();

	assert(Name.startswith("llvm.") && "Intrinsic doesn't start with 'llvm.'");
	Name = Name.substr(5);

	bool IsX86 = Name.startswith("x86.");
	if (IsX86)
	Name = Name.substr(4);
	bool IsNVVM = Name.startswith("nvvm.");
	if (IsNVVM)
	Name = Name.substr(5);

	if (IsX86 && Name.startswith("sse4a.movnt.")) {
	Module *M = F->getParent();
	SmallVector<Metadata *, 1> Elts;
	Elts.push_back(
	ConstantAsMetadata::get(ConstantInt::get(Type::getInt32Ty(C), 1)));
	MDNode *Node = MDNode::get(C, Elts);

	Value *Arg0 = CI->getArgOperand(0);
	Value *Arg1 = CI->getArgOperand(1);

	// Nontemporal (unaligned) store of the 0'th element of the float/double
	// vector.
	Type *SrcEltTy = cast<VectorType>(Arg1->getType())->getElementType();
	PointerType *EltPtrTy = PointerType::getUnqual(SrcEltTy);
	Value *Addr = Builder.CreateBitCast(Arg0, EltPtrTy, "cast");
	Value *Extract =
	Builder.CreateExtractElement(Arg1, (uint64_t)0, "extractelement");

	StoreInst *SI = Builder.CreateAlignedStore(Extract, Addr, 1);
	SI->setMetadata(M->getMDKindID("nontemporal"), Node);

	// Remove intrinsic.
	CI->eraseFromParent();
	return;
	}

	if (IsX86 && (Name.startswith("avx.movnt.") \|\|
	Name.startswith("avx512.storent."))) {
	Module *M = F->getParent();
	SmallVector<Metadata *, 1> Elts;
	Elts.push_back(
	ConstantAsMetadata::get(ConstantInt::get(Type::getInt32Ty(C), 1)));
	MDNode *Node = MDNode::get(C, Elts);

	Value *Arg0 = CI->getArgOperand(0);
	Value *Arg1 = CI->getArgOperand(1);

	// Convert the type of the pointer to a pointer to the stored type.
	Value *BC = Builder.CreateBitCast(Arg0,
	PointerType::getUnqual(Arg1->getType()),
	"cast");
	VectorType *VTy = cast<VectorType>(Arg1->getType());
	StoreInst *SI = Builder.CreateAlignedStore(Arg1, BC,
	VTy->getBitWidth() / 8);
	SI->setMetadata(M->getMDKindID("nontemporal"), Node);

	// Remove intrinsic.
	CI->eraseFromParent();
	return;
	}

	if (IsX86 && Name == "sse2.storel.dq") {
	Value *Arg0 = CI->getArgOperand(0);
	Value *Arg1 = CI->getArgOperand(1);

	Type *NewVecTy = VectorType::get(Type::getInt64Ty(C), 2);
	Value *BC0 = Builder.CreateBitCast(Arg1, NewVecTy, "cast");
	Value *Elt = Builder.CreateExtractElement(BC0, (uint64_t)0);
	Value *BC = Builder.CreateBitCast(Arg0,
	PointerType::getUnqual(Elt->getType()),
	"cast");
	Builder.CreateAlignedStore(Elt, BC, 1);

	// Remove intrinsic.
	CI->eraseFromParent();
	return;
	}

	if (IsX86 && (Name.startswith("sse.storeu.") \|\|
	Name.startswith("sse2.storeu.") \|\|
	Name.startswith("avx.storeu."))) {
	Value *Arg0 = CI->getArgOperand(0);
	Value *Arg1 = CI->getArgOperand(1);

	Arg0 = Builder.CreateBitCast(Arg0,
	PointerType::getUnqual(Arg1->getType()),
	"cast");
	Builder.CreateAlignedStore(Arg1, Arg0, 1);

	// Remove intrinsic.
	CI->eraseFromParent();
	return;
	}

	if (IsX86 && (Name.startswith("avx512.mask.store"))) {
	// "avx512.mask.storeu." or "avx512.mask.store."
	bool Aligned = Name[17] != 'u'; // "avx512.mask.storeu".
	UpgradeMaskedStore(Builder, CI->getArgOperand(0), CI->getArgOperand(1),
	CI->getArgOperand(2), Aligned);

	// Remove intrinsic.
	CI->eraseFromParent();
	return;
	}

	Value *Rep;
	// Upgrade packed integer vector compare intrinsics to compare instructions.
	if (IsX86 && (Name.startswith("sse2.pcmp") \|\|
	Name.startswith("avx2.pcmp"))) {
	// "sse2.pcpmpeq." "sse2.pcmpgt." "avx2.pcmpeq." or "avx2.pcmpgt."
	bool CmpEq = Name[9] == 'e';
	Rep = Builder.CreateICmp(CmpEq ? ICmpInst::ICMP_EQ : ICmpInst::ICMP_SGT,
	CI->getArgOperand(0), CI->getArgOperand(1));
	Rep = Builder.CreateSExt(Rep, CI->getType(), "");
	} else if (IsX86 && (Name == "sse.add.ss" \|\| Name == "sse2.add.sd")) {
	Type *I32Ty = Type::getInt32Ty(C);
	Value *Elt0 = Builder.CreateExtractElement(CI->getArgOperand(0),
	ConstantInt::get(I32Ty, 0));
	Value *Elt1 = Builder.CreateExtractElement(CI->getArgOperand(1),
	ConstantInt::get(I32Ty, 0));
	Rep = Builder.CreateInsertElement(CI->getArgOperand(0),
	Builder.CreateFAdd(Elt0, Elt1),
	ConstantInt::get(I32Ty, 0));
	} else if (IsX86 && (Name == "sse.sub.ss" \|\| Name == "sse2.sub.sd")) {
	Type *I32Ty = Type::getInt32Ty(C);
	Value *Elt0 = Builder.CreateExtractElement(CI->getArgOperand(0),
	ConstantInt::get(I32Ty, 0));
	Value *Elt1 = Builder.CreateExtractElement(CI->getArgOperand(1),
	ConstantInt::get(I32Ty, 0));
	Rep = Builder.CreateInsertElement(CI->getArgOperand(0),
	Builder.CreateFSub(Elt0, Elt1),
	ConstantInt::get(I32Ty, 0));
	} else if (IsX86 && (Name == "sse.mul.ss" \|\| Name == "sse2.mul.sd")) {
	Type *I32Ty = Type::getInt32Ty(C);
	Value *Elt0 = Builder.CreateExtractElement(CI->getArgOperand(0),
	ConstantInt::get(I32Ty, 0));
	Value *Elt1 = Builder.CreateExtractElement(CI->getArgOperand(1),
	ConstantInt::get(I32Ty, 0));
	Rep = Builder.CreateInsertElement(CI->getArgOperand(0),
	Builder.CreateFMul(Elt0, Elt1),
	ConstantInt::get(I32Ty, 0));
	} else if (IsX86 && (Name == "sse.div.ss" \|\| Name == "sse2.div.sd")) {
	Type *I32Ty = Type::getInt32Ty(C);
	Value *Elt0 = Builder.CreateExtractElement(CI->getArgOperand(0),
	ConstantInt::get(I32Ty, 0));
	Value *Elt1 = Builder.CreateExtractElement(CI->getArgOperand(1),
	ConstantInt::get(I32Ty, 0));
	Rep = Builder.CreateInsertElement(CI->getArgOperand(0),
	Builder.CreateFDiv(Elt0, Elt1),
	ConstantInt::get(I32Ty, 0));
	} else if (IsX86 && Name.startswith("avx512.mask.pcmp")) {
	// "avx512.mask.pcmpeq." or "avx512.mask.pcmpgt."
	bool CmpEq = Name[16] == 'e';
	Rep = upgradeMaskedCompare(Builder, *CI, CmpEq ? 0 : 6, true);
	} else if (IsX86 && Name.startswith("avx512.mask.cmp")) {
	unsigned Imm = cast<ConstantInt>(CI->getArgOperand(2))->getZExtValue();
	Rep = upgradeMaskedCompare(Builder, *CI, Imm, true);
	} else if (IsX86 && Name.startswith("avx512.mask.ucmp")) {
	unsigned Imm = cast<ConstantInt>(CI->getArgOperand(2))->getZExtValue();
	Rep = upgradeMaskedCompare(Builder, *CI, Imm, false);
	} else if (IsX86 && (Name == "sse41.pmaxsb" \|\|
	Name == "sse2.pmaxs.w" \|\|
	Name == "sse41.pmaxsd" \|\|
	Name.startswith("avx2.pmaxs") \|\|
	Name.startswith("avx512.mask.pmaxs"))) {
	Rep = upgradeIntMinMax(Builder, *CI, ICmpInst::ICMP_SGT);
	} else if (IsX86 && (Name == "sse2.pmaxu.b" \|\|
	Name == "sse41.pmaxuw" \|\|
	Name == "sse41.pmaxud" \|\|
	Name.startswith("avx2.pmaxu") \|\|
	Name.startswith("avx512.mask.pmaxu"))) {
	Rep = upgradeIntMinMax(Builder, *CI, ICmpInst::ICMP_UGT);
	} else if (IsX86 && (Name == "sse41.pminsb" \|\|
	Name == "sse2.pmins.w" \|\|
	Name == "sse41.pminsd" \|\|
	Name.startswith("avx2.pmins") \|\|
	Name.startswith("avx512.mask.pmins"))) {
	Rep = upgradeIntMinMax(Builder, *CI, ICmpInst::ICMP_SLT);
	} else if (IsX86 && (Name == "sse2.pminu.b" \|\|
	Name == "sse41.pminuw" \|\|
	Name == "sse41.pminud" \|\|
	Name.startswith("avx2.pminu") \|\|
	Name.startswith("avx512.mask.pminu"))) {
	Rep = upgradeIntMinMax(Builder, *CI, ICmpInst::ICMP_ULT);
	} else if (IsX86 && (Name == "sse2.cvtdq2pd" \|\|
	Name == "sse2.cvtps2pd" \|\|
	Name == "avx.cvtdq2.pd.256" \|\|
	Name == "avx.cvt.ps2.pd.256" \|\|
	Name.startswith("avx512.mask.cvtdq2pd.") \|\|
	Name.startswith("avx512.mask.cvtudq2pd."))) {
	// Lossless i32/float to double conversion.
	// Extract the bottom elements if necessary and convert to double vector.
	Value *Src = CI->getArgOperand(0);
	VectorType *SrcTy = cast<VectorType>(Src->getType());
	VectorType *DstTy = cast<VectorType>(CI->getType());
	Rep = CI->getArgOperand(0);

	unsigned NumDstElts = DstTy->getNumElements();
	if (NumDstElts < SrcTy->getNumElements()) {
	assert(NumDstElts == 2 && "Unexpected vector size");
	uint32_t ShuffleMask[2] = { 0, 1 };
	Rep = Builder.CreateShuffleVector(Rep, UndefValue::get(SrcTy),
	ShuffleMask);
	}

	bool SInt2Double = (StringRef::npos != Name.find("cvtdq2"));
	bool UInt2Double = (StringRef::npos != Name.find("cvtudq2"));
	if (SInt2Double)
	Rep = Builder.CreateSIToFP(Rep, DstTy, "cvtdq2pd");
	else if (UInt2Double)
	Rep = Builder.CreateUIToFP(Rep, DstTy, "cvtudq2pd");
	else
	Rep = Builder.CreateFPExt(Rep, DstTy, "cvtps2pd");

	if (CI->getNumArgOperands() == 3)
	Rep = EmitX86Select(Builder, CI->getArgOperand(2), Rep,
	CI->getArgOperand(1));
	} else if (IsX86 && (Name.startswith("avx512.mask.loadu."))) {
	Rep = UpgradeMaskedLoad(Builder, CI->getArgOperand(0),
	CI->getArgOperand(1), CI->getArgOperand(2),
	/Aligned/false);
	} else if (IsX86 && (Name.startswith("avx512.mask.load."))) {
	Rep = UpgradeMaskedLoad(Builder, CI->getArgOperand(0),
	CI->getArgOperand(1),CI->getArgOperand(2),
	/Aligned/true);
	} else if (IsX86 && Name.startswith("xop.vpcom")) {
	Intrinsic::ID intID;
	if (Name.endswith("ub"))
	intID = Intrinsic::x86_xop_vpcomub;
	else if (Name.endswith("uw"))
	intID = Intrinsic::x86_xop_vpcomuw;
	else if (Name.endswith("ud"))
	intID = Intrinsic::x86_xop_vpcomud;
	else if (Name.endswith("uq"))
	intID = Intrinsic::x86_xop_vpcomuq;
	else if (Name.endswith("b"))
	intID = Intrinsic::x86_xop_vpcomb;
	else if (Name.endswith("w"))
	intID = Intrinsic::x86_xop_vpcomw;
	else if (Name.endswith("d"))
	intID = Intrinsic::x86_xop_vpcomd;
	else if (Name.endswith("q"))
	intID = Intrinsic::x86_xop_vpcomq;
	else
	llvm_unreachable("Unknown suffix");

	Name = Name.substr(9); // strip off "xop.vpcom"
	unsigned Imm;
	if (Name.startswith("lt"))
	Imm = 0;
	else if (Name.startswith("le"))
	Imm = 1;
	else if (Name.startswith("gt"))
	Imm = 2;
	else if (Name.startswith("ge"))
	Imm = 3;
	else if (Name.startswith("eq"))
	Imm = 4;
	else if (Name.startswith("ne"))
	Imm = 5;
	else if (Name.startswith("false"))
	Imm = 6;
	else if (Name.startswith("true"))
	Imm = 7;
	else
	llvm_unreachable("Unknown condition");

	Function *VPCOM = Intrinsic::getDeclaration(F->getParent(), intID);
	Rep =
	Builder.CreateCall(VPCOM, {CI->getArgOperand(0), CI->getArgOperand(1),
	Builder.getInt8(Imm)});
	} else if (IsX86 && Name.startswith("xop.vpcmov")) {
	Value *Sel = CI->getArgOperand(2);
	Value *NotSel = Builder.CreateNot(Sel);
	Value *Sel0 = Builder.CreateAnd(CI->getArgOperand(0), Sel);
	Value *Sel1 = Builder.CreateAnd(CI->getArgOperand(1), NotSel);
	Rep = Builder.CreateOr(Sel0, Sel1);
	} else if (IsX86 && Name == "sse42.crc32.64.8") {
	Function *CRC32 = Intrinsic::getDeclaration(F->getParent(),
	Intrinsic::x86_sse42_crc32_32_8);
	Value *Trunc0 = Builder.CreateTrunc(CI->getArgOperand(0), Type::getInt32Ty(C));
	Rep = Builder.CreateCall(CRC32, {Trunc0, CI->getArgOperand(1)});
	Rep = Builder.CreateZExt(Rep, CI->getType(), "");
	} else if (IsX86 && Name.startswith("avx.vbroadcast.s")) {
	// Replace broadcasts with a series of insertelements.
	Type *VecTy = CI->getType();
	Type *EltTy = VecTy->getVectorElementType();
	unsigned EltNum = VecTy->getVectorNumElements();
	Value *Cast = Builder.CreateBitCast(CI->getArgOperand(0),
	EltTy->getPointerTo());
	Value *Load = Builder.CreateLoad(EltTy, Cast);
	Type *I32Ty = Type::getInt32Ty(C);
	Rep = UndefValue::get(VecTy);
	for (unsigned I = 0; I < EltNum; ++I)
	Rep = Builder.CreateInsertElement(Rep, Load,
	ConstantInt::get(I32Ty, I));
	} else if (IsX86 && (Name.startswith("sse41.pmovsx") \|\|
	Name.startswith("sse41.pmovzx") \|\|
	Name.startswith("avx2.pmovsx") \|\|
	Name.startswith("avx2.pmovzx") \|\|
	Name.startswith("avx512.mask.pmovsx") \|\|
	Name.startswith("avx512.mask.pmovzx"))) {
	VectorType *SrcTy = cast<VectorType>(CI->getArgOperand(0)->getType());
	VectorType *DstTy = cast<VectorType>(CI->getType());
	unsigned NumDstElts = DstTy->getNumElements();

	// Extract a subvector of the first NumDstElts lanes and sign/zero extend.
	SmallVector<uint32_t, 8> ShuffleMask(NumDstElts);
	for (unsigned i = 0; i != NumDstElts; ++i)
	ShuffleMask[i] = i;

	Value *SV = Builder.CreateShuffleVector(
	CI->getArgOperand(0), UndefValue::get(SrcTy), ShuffleMask);

	bool DoSext = (StringRef::npos != Name.find("pmovsx"));
	Rep = DoSext ? Builder.CreateSExt(SV, DstTy)
	: Builder.CreateZExt(SV, DstTy);
	// If there are 3 arguments, it's a masked intrinsic so we need a select.
	if (CI->getNumArgOperands() == 3)
	Rep = EmitX86Select(Builder, CI->getArgOperand(2), Rep,
	CI->getArgOperand(1));
	} else if (IsX86 && (Name.startswith("avx.vbroadcastf128") \|\|
	Name == "avx2.vbroadcasti128")) {
	// Replace vbroadcastf128/vbroadcasti128 with a vector load+shuffle.
	Type *EltTy = CI->getType()->getVectorElementType();
	unsigned NumSrcElts = 128 / EltTy->getPrimitiveSizeInBits();
	Type *VT = VectorType::get(EltTy, NumSrcElts);
	Value *Op = Builder.CreatePointerCast(CI->getArgOperand(0),
	PointerType::getUnqual(VT));
	Value *Load = Builder.CreateAlignedLoad(Op, 1);
	if (NumSrcElts == 2)
	Rep = Builder.CreateShuffleVector(Load, UndefValue::get(Load->getType()),
	{ 0, 1, 0, 1 });
	else
	Rep = Builder.CreateShuffleVector(Load, UndefValue::get(Load->getType()),
	{ 0, 1, 2, 3, 0, 1, 2, 3 });
	} else if (IsX86 && (Name.startswith("avx2.pbroadcast") \|\|
	Name.startswith("avx2.vbroadcast") \|\|
	Name.startswith("avx512.pbroadcast") \|\|
	Name.startswith("avx512.mask.broadcast.s"))) {
	// Replace vp?broadcasts with a vector shuffle.
	Value *Op = CI->getArgOperand(0);
	unsigned NumElts = CI->getType()->getVectorNumElements();
	Type *MaskTy = VectorType::get(Type::getInt32Ty(C), NumElts);
	Rep = Builder.CreateShuffleVector(Op, UndefValue::get(Op->getType()),
	Constant::getNullValue(MaskTy));

	if (CI->getNumArgOperands() == 3)
	Rep = EmitX86Select(Builder, CI->getArgOperand(2), Rep,
	CI->getArgOperand(1));
	} else if (IsX86 && Name.startswith("avx512.mask.palignr.")) {
	Rep = UpgradeX86ALIGNIntrinsics(Builder, CI->getArgOperand(0),
	CI->getArgOperand(1),
	CI->getArgOperand(2),
	CI->getArgOperand(3),
	CI->getArgOperand(4),
	false);
	} else if (IsX86 && Name.startswith("avx512.mask.valign.")) {
	Rep = UpgradeX86ALIGNIntrinsics(Builder, CI->getArgOperand(0),
	CI->getArgOperand(1),
	CI->getArgOperand(2),
	CI->getArgOperand(3),
	CI->getArgOperand(4),
	true);
	} else if (IsX86 && (Name == "sse2.psll.dq" \|\|
	Name == "avx2.psll.dq")) {
	// 128/256-bit shift left specified in bits.
	unsigned Shift = cast<ConstantInt>(CI->getArgOperand(1))->getZExtValue();
	Rep = UpgradeX86PSLLDQIntrinsics(Builder, CI->getArgOperand(0),
	Shift / 8); // Shift is in bits.
	} else if (IsX86 && (Name == "sse2.psrl.dq" \|\|
	Name == "avx2.psrl.dq")) {
	// 128/256-bit shift right specified in bits.
	unsigned Shift = cast<ConstantInt>(CI->getArgOperand(1))->getZExtValue();
	Rep = UpgradeX86PSRLDQIntrinsics(Builder, CI->getArgOperand(0),
	Shift / 8); // Shift is in bits.
	} else if (IsX86 && (Name == "sse2.psll.dq.bs" \|\|
	Name == "avx2.psll.dq.bs" \|\|
	Name == "avx512.psll.dq.512")) {
	// 128/256/512-bit shift left specified in bytes.
	unsigned Shift = cast<ConstantInt>(CI->getArgOperand(1))->getZExtValue();
	Rep = UpgradeX86PSLLDQIntrinsics(Builder, CI->getArgOperand(0), Shift);
	} else if (IsX86 && (Name == "sse2.psrl.dq.bs" \|\|
	Name == "avx2.psrl.dq.bs" \|\|
	Name == "avx512.psrl.dq.512")) {
	// 128/256/512-bit shift right specified in bytes.
	unsigned Shift = cast<ConstantInt>(CI->getArgOperand(1))->getZExtValue();
	Rep = UpgradeX86PSRLDQIntrinsics(Builder, CI->getArgOperand(0), Shift);
	} else if (IsX86 && (Name == "sse41.pblendw" \|\|
	Name.startswith("sse41.blendp") \|\|
	Name.startswith("avx.blend.p") \|\|
	Name == "avx2.pblendw" \|\|
	Name.startswith("avx2.pblendd."))) {
	Value *Op0 = CI->getArgOperand(0);
	Value *Op1 = CI->getArgOperand(1);
	unsigned Imm = cast <ConstantInt>(CI->getArgOperand(2))->getZExtValue();
	VectorType *VecTy = cast<VectorType>(CI->getType());
	unsigned NumElts = VecTy->getNumElements();

	SmallVector<uint32_t, 16> Idxs(NumElts);
	for (unsigned i = 0; i != NumElts; ++i)
	Idxs[i] = ((Imm >> (i%8)) & 1) ? i + NumElts : i;

	Rep = Builder.CreateShuffleVector(Op0, Op1, Idxs);
	} else if (IsX86 && (Name.startswith("avx.vinsertf128.") \|\|
	Name == "avx2.vinserti128" \|\|
	Name.startswith("avx512.mask.insert"))) {
	Value *Op0 = CI->getArgOperand(0);
	Value *Op1 = CI->getArgOperand(1);
	unsigned Imm = cast<ConstantInt>(CI->getArgOperand(2))->getZExtValue();
	unsigned DstNumElts = CI->getType()->getVectorNumElements();
	unsigned SrcNumElts = Op1->getType()->getVectorNumElements();
	unsigned Scale = DstNumElts / SrcNumElts;

	// Mask off the high bits of the immediate value; hardware ignores those.
	Imm = Imm % Scale;

	// Extend the second operand into a vector the size of the destination.
	Value *UndefV = UndefValue::get(Op1->getType());
	SmallVector<uint32_t, 8> Idxs(DstNumElts);
	for (unsigned i = 0; i != SrcNumElts; ++i)
	Idxs[i] = i;
	for (unsigned i = SrcNumElts; i != DstNumElts; ++i)
	Idxs[i] = SrcNumElts;
	Rep = Builder.CreateShuffleVector(Op1, UndefV, Idxs);

	// Insert the second operand into the first operand.

	// Note that there is no guarantee that instruction lowering will actually
	// produce a vinsertf128 instruction for the created shuffles. In
	// particular, the 0 immediate case involves no lane changes, so it can
	// be handled as a blend.

	// Example of shuffle mask for 32-bit elements:
	// Imm = 1 <i32 0, i32 1, i32 2, i32 3, i32 8, i32 9, i32 10, i32 11>
	// Imm = 0 <i32 8, i32 9, i32 10, i32 11, i32 4, i32 5, i32 6, i32 7 >

	// First fill with identify mask.
	for (unsigned i = 0; i != DstNumElts; ++i)
	Idxs[i] = i;
	// Then replace the elements where we need to insert.
	for (unsigned i = 0; i != SrcNumElts; ++i)
	Idxs[i + Imm * SrcNumElts] = i + DstNumElts;
	Rep = Builder.CreateShuffleVector(Op0, Rep, Idxs);

	// If the intrinsic has a mask operand, handle that.
	if (CI->getNumArgOperands() == 5)
	Rep = EmitX86Select(Builder, CI->getArgOperand(4), Rep,
	CI->getArgOperand(3));
	} else if (IsX86 && (Name.startswith("avx.vextractf128.") \|\|
	Name == "avx2.vextracti128" \|\|
	Name.startswith("avx512.mask.vextract"))) {
	Value *Op0 = CI->getArgOperand(0);
	unsigned Imm = cast<ConstantInt>(CI->getArgOperand(1))->getZExtValue();
	unsigned DstNumElts = CI->getType()->getVectorNumElements();
	unsigned SrcNumElts = Op0->getType()->getVectorNumElements();
	unsigned Scale = SrcNumElts / DstNumElts;

	// Mask off the high bits of the immediate value; hardware ignores those.
	Imm = Imm % Scale;

	// Get indexes for the subvector of the input vector.
	SmallVector<uint32_t, 8> Idxs(DstNumElts);
	for (unsigned i = 0; i != DstNumElts; ++i) {
	Idxs[i] = i + (Imm * DstNumElts);
	}
	Rep = Builder.CreateShuffleVector(Op0, Op0, Idxs);

	// If the intrinsic has a mask operand, handle that.
	if (CI->getNumArgOperands() == 4)
	Rep = EmitX86Select(Builder, CI->getArgOperand(3), Rep,
	CI->getArgOperand(2));
	} else if (!IsX86 && Name == "stackprotectorcheck") {
	Rep = nullptr;
	} else if (IsX86 && (Name.startswith("avx512.mask.perm.df.") \|\|
	Name.startswith("avx512.mask.perm.di."))) {
	Value *Op0 = CI->getArgOperand(0);
	unsigned Imm = cast<ConstantInt>(CI->getArgOperand(1))->getZExtValue();
	VectorType *VecTy = cast<VectorType>(CI->getType());
	unsigned NumElts = VecTy->getNumElements();

	SmallVector<uint32_t, 8> Idxs(NumElts);
	for (unsigned i = 0; i != NumElts; ++i)
	Idxs[i] = (i & ~0x3) + ((Imm >> (2 * (i & 0x3))) & 3);

	Rep = Builder.CreateShuffleVector(Op0, Op0, Idxs);

	if (CI->getNumArgOperands() == 4)
	Rep = EmitX86Select(Builder, CI->getArgOperand(3), Rep,
	CI->getArgOperand(2));
	} else if (IsX86 && (Name.startswith("avx.vpermil.") \|\|
	Name == "sse2.pshuf.d" \|\|
	Name.startswith("avx512.mask.vpermil.p") \|\|
	Name.startswith("avx512.mask.pshuf.d."))) {
	Value *Op0 = CI->getArgOperand(0);
	unsigned Imm = cast<ConstantInt>(CI->getArgOperand(1))->getZExtValue();
	VectorType *VecTy = cast<VectorType>(CI->getType());
	unsigned NumElts = VecTy->getNumElements();
	// Calculate the size of each index in the immediate.
	unsigned IdxSize = 64 / VecTy->getScalarSizeInBits();
	unsigned IdxMask = ((1 << IdxSize) - 1);

	SmallVector<uint32_t, 8> Idxs(NumElts);
	// Lookup the bits for this element, wrapping around the immediate every
	// 8-bits. Elements are grouped into sets of 2 or 4 elements so we need
	// to offset by the first index of each group.
	for (unsigned i = 0; i != NumElts; ++i)
	Idxs[i] = ((Imm >> ((i * IdxSize) % 8)) & IdxMask) \| (i & ~IdxMask);

	Rep = Builder.CreateShuffleVector(Op0, Op0, Idxs);

	if (CI->getNumArgOperands() == 4)
	Rep = EmitX86Select(Builder, CI->getArgOperand(3), Rep,
	CI->getArgOperand(2));
	} else if (IsX86 && (Name == "sse2.pshufl.w" \|\|
	Name.startswith("avx512.mask.pshufl.w."))) {
	Value *Op0 = CI->getArgOperand(0);
	unsigned Imm = cast<ConstantInt>(CI->getArgOperand(1))->getZExtValue();
	unsigned NumElts = CI->getType()->getVectorNumElements();

	SmallVector<uint32_t, 16> Idxs(NumElts);
	for (unsigned l = 0; l != NumElts; l += 8) {
	for (unsigned i = 0; i != 4; ++i)
	Idxs[i + l] = ((Imm >> (2 * i)) & 0x3) + l;
	for (unsigned i = 4; i != 8; ++i)
	Idxs[i + l] = i + l;
	}

	Rep = Builder.CreateShuffleVector(Op0, Op0, Idxs);

	if (CI->getNumArgOperands() == 4)
	Rep = EmitX86Select(Builder, CI->getArgOperand(3), Rep,
	CI->getArgOperand(2));
	} else if (IsX86 && (Name == "sse2.pshufh.w" \|\|
	Name.startswith("avx512.mask.pshufh.w."))) {
	Value *Op0 = CI->getArgOperand(0);
	unsigned Imm = cast<ConstantInt>(CI->getArgOperand(1))->getZExtValue();
	unsigned NumElts = CI->getType()->getVectorNumElements();

	SmallVector<uint32_t, 16> Idxs(NumElts);
	for (unsigned l = 0; l != NumElts; l += 8) {
	for (unsigned i = 0; i != 4; ++i)
	Idxs[i + l] = i + l;
	for (unsigned i = 0; i != 4; ++i)
	Idxs[i + l + 4] = ((Imm >> (2 * i)) & 0x3) + 4 + l;
	}

	Rep = Builder.CreateShuffleVector(Op0, Op0, Idxs);

	if (CI->getNumArgOperands() == 4)
	Rep = EmitX86Select(Builder, CI->getArgOperand(3), Rep,
	CI->getArgOperand(2));
	} else if (IsX86 && Name.startswith("avx512.mask.shuf.p")) {
	Value *Op0 = CI->getArgOperand(0);
	Value *Op1 = CI->getArgOperand(1);
	unsigned Imm = cast<ConstantInt>(CI->getArgOperand(2))->getZExtValue();
	unsigned NumElts = CI->getType()->getVectorNumElements();

	unsigned NumLaneElts = 128/CI->getType()->getScalarSizeInBits();
	unsigned HalfLaneElts = NumLaneElts / 2;

	SmallVector<uint32_t, 16> Idxs(NumElts);
	for (unsigned i = 0; i != NumElts; ++i) {
	// Base index is the starting element of the lane.
	Idxs[i] = i - (i % NumLaneElts);
	// If we are half way through the lane switch to the other source.
	if ((i % NumLaneElts) >= HalfLaneElts)
	Idxs[i] += NumElts;
	// Now select the specific element. By adding HalfLaneElts bits from
	// the immediate. Wrapping around the immediate every 8-bits.
	Idxs[i] += (Imm >> ((i * HalfLaneElts) % 8)) & ((1 << HalfLaneElts) - 1);
	}

	Rep = Builder.CreateShuffleVector(Op0, Op1, Idxs);

	Rep = EmitX86Select(Builder, CI->getArgOperand(4), Rep,
	CI->getArgOperand(3));
	} else if (IsX86 && (Name.startswith("avx512.mask.movddup") \|\|
	Name.startswith("avx512.mask.movshdup") \|\|
	Name.startswith("avx512.mask.movsldup"))) {
	Value *Op0 = CI->getArgOperand(0);
	unsigned NumElts = CI->getType()->getVectorNumElements();
	unsigned NumLaneElts = 128/CI->getType()->getScalarSizeInBits();

	unsigned Offset = 0;
	if (Name.startswith("avx512.mask.movshdup."))
	Offset = 1;

	SmallVector<uint32_t, 16> Idxs(NumElts);
	for (unsigned l = 0; l != NumElts; l += NumLaneElts)
	for (unsigned i = 0; i != NumLaneElts; i += 2) {
	Idxs[i + l + 0] = i + l + Offset;
	Idxs[i + l + 1] = i + l + Offset;
	}

	Rep = Builder.CreateShuffleVector(Op0, Op0, Idxs);

	Rep = EmitX86Select(Builder, CI->getArgOperand(2), Rep,
	CI->getArgOperand(1));
	} else if (IsX86 && (Name.startswith("avx512.mask.punpckl") \|\|
	Name.startswith("avx512.mask.unpckl."))) {
	Value *Op0 = CI->getArgOperand(0);
	Value *Op1 = CI->getArgOperand(1);
	int NumElts = CI->getType()->getVectorNumElements();
	int NumLaneElts = 128/CI->getType()->getScalarSizeInBits();

	SmallVector<uint32_t, 64> Idxs(NumElts);
	for (int l = 0; l != NumElts; l += NumLaneElts)
	for (int i = 0; i != NumLaneElts; ++i)
	Idxs[i + l] = l + (i / 2) + NumElts * (i % 2);

	Rep = Builder.CreateShuffleVector(Op0, Op1, Idxs);

	Rep = EmitX86Select(Builder, CI->getArgOperand(3), Rep,
	CI->getArgOperand(2));
	} else if (IsX86 && (Name.startswith("avx512.mask.punpckh") \|\|
	Name.startswith("avx512.mask.unpckh."))) {
	Value *Op0 = CI->getArgOperand(0);
	Value *Op1 = CI->getArgOperand(1);
	int NumElts = CI->getType()->getVectorNumElements();
	int NumLaneElts = 128/CI->getType()->getScalarSizeInBits();

	SmallVector<uint32_t, 64> Idxs(NumElts);
	for (int l = 0; l != NumElts; l += NumLaneElts)
	for (int i = 0; i != NumLaneElts; ++i)
	Idxs[i + l] = (NumLaneElts / 2) + l + (i / 2) + NumElts * (i % 2);

	Rep = Builder.CreateShuffleVector(Op0, Op1, Idxs);

	Rep = EmitX86Select(Builder, CI->getArgOperand(3), Rep,
	CI->getArgOperand(2));
	} else if (IsX86 && Name.startswith("avx512.mask.pand.")) {
	Rep = Builder.CreateAnd(CI->getArgOperand(0), CI->getArgOperand(1));
	Rep = EmitX86Select(Builder, CI->getArgOperand(3), Rep,
	CI->getArgOperand(2));
	} else if (IsX86 && Name.startswith("avx512.mask.pandn.")) {
	Rep = Builder.CreateAnd(Builder.CreateNot(CI->getArgOperand(0)),
	CI->getArgOperand(1));
	Rep = EmitX86Select(Builder, CI->getArgOperand(3), Rep,
	CI->getArgOperand(2));
	} else if (IsX86 && Name.startswith("avx512.mask.por.")) {
	Rep = Builder.CreateOr(CI->getArgOperand(0), CI->getArgOperand(1));
	Rep = EmitX86Select(Builder, CI->getArgOperand(3), Rep,
	CI->getArgOperand(2));
	} else if (IsX86 && Name.startswith("avx512.mask.pxor.")) {
	Rep = Builder.CreateXor(CI->getArgOperand(0), CI->getArgOperand(1));
	Rep = EmitX86Select(Builder, CI->getArgOperand(3), Rep,
	CI->getArgOperand(2));
	} else if (IsX86 && Name.startswith("avx512.mask.and.")) {
	VectorType *FTy = cast<VectorType>(CI->getType());
	VectorType *ITy = VectorType::getInteger(FTy);
	Rep = Builder.CreateAnd(Builder.CreateBitCast(CI->getArgOperand(0), ITy),
	Builder.CreateBitCast(CI->getArgOperand(1), ITy));
	Rep = Builder.CreateBitCast(Rep, FTy);
	Rep = EmitX86Select(Builder, CI->getArgOperand(3), Rep,
	CI->getArgOperand(2));
	} else if (IsX86 && Name.startswith("avx512.mask.andn.")) {
	VectorType *FTy = cast<VectorType>(CI->getType());
	VectorType *ITy = VectorType::getInteger(FTy);
	Rep = Builder.CreateNot(Builder.CreateBitCast(CI->getArgOperand(0), ITy));
	Rep = Builder.CreateAnd(Rep,
	Builder.CreateBitCast(CI->getArgOperand(1), ITy));
	Rep = Builder.CreateBitCast(Rep, FTy);
	Rep = EmitX86Select(Builder, CI->getArgOperand(3), Rep,
	CI->getArgOperand(2));
	} else if (IsX86 && Name.startswith("avx512.mask.or.")) {
	VectorType *FTy = cast<VectorType>(CI->getType());
	VectorType *ITy = VectorType::getInteger(FTy);
	Rep = Builder.CreateOr(Builder.CreateBitCast(CI->getArgOperand(0), ITy),
	Builder.CreateBitCast(CI->getArgOperand(1), ITy));
	Rep = Builder.CreateBitCast(Rep, FTy);
	Rep = EmitX86Select(Builder, CI->getArgOperand(3), Rep,
	CI->getArgOperand(2));
	} else if (IsX86 && Name.startswith("avx512.mask.xor.")) {
	VectorType *FTy = cast<VectorType>(CI->getType());
	VectorType *ITy = VectorType::getInteger(FTy);
	Rep = Builder.CreateXor(Builder.CreateBitCast(CI->getArgOperand(0), ITy),
	Builder.CreateBitCast(CI->getArgOperand(1), ITy));
	Rep = Builder.CreateBitCast(Rep, FTy);
	Rep = EmitX86Select(Builder, CI->getArgOperand(3), Rep,
	CI->getArgOperand(2));
	} else if (IsX86 && Name.startswith("avx512.mask.padd.")) {
	Rep = Builder.CreateAdd(CI->getArgOperand(0), CI->getArgOperand(1));
	Rep = EmitX86Select(Builder, CI->getArgOperand(3), Rep,
	CI->getArgOperand(2));
	} else if (IsX86 && Name.startswith("avx512.mask.psub.")) {
	Rep = Builder.CreateSub(CI->getArgOperand(0), CI->getArgOperand(1));
	Rep = EmitX86Select(Builder, CI->getArgOperand(3), Rep,
	CI->getArgOperand(2));
	} else if (IsX86 && Name.startswith("avx512.mask.pmull.")) {
	Rep = Builder.CreateMul(CI->getArgOperand(0), CI->getArgOperand(1));
	Rep = EmitX86Select(Builder, CI->getArgOperand(3), Rep,
	CI->getArgOperand(2));
	} else if (IsX86 && (Name.startswith("avx512.mask.add.p"))) {
	Rep = Builder.CreateFAdd(CI->getArgOperand(0), CI->getArgOperand(1));
	Rep = EmitX86Select(Builder, CI->getArgOperand(3), Rep,
	CI->getArgOperand(2));
	} else if (IsX86 && Name.startswith("avx512.mask.div.p")) {
	Rep = Builder.CreateFDiv(CI->getArgOperand(0), CI->getArgOperand(1));
	Rep = EmitX86Select(Builder, CI->getArgOperand(3), Rep,
	CI->getArgOperand(2));
	} else if (IsX86 && Name.startswith("avx512.mask.mul.p")) {
	Rep = Builder.CreateFMul(CI->getArgOperand(0), CI->getArgOperand(1));
	Rep = EmitX86Select(Builder, CI->getArgOperand(3), Rep,
	CI->getArgOperand(2));
	} else if (IsX86 && Name.startswith("avx512.mask.sub.p")) {
	Rep = Builder.CreateFSub(CI->getArgOperand(0), CI->getArgOperand(1));
	Rep = EmitX86Select(Builder, CI->getArgOperand(3), Rep,
	CI->getArgOperand(2));
	} else if (IsX86 && Name.startswith("avx512.mask.lzcnt.")) {
	Rep = Builder.CreateCall(Intrinsic::getDeclaration(F->getParent(),
	Intrinsic::ctlz,
	CI->getType()),
	{ CI->getArgOperand(0), Builder.getInt1(false) });
	Rep = EmitX86Select(Builder, CI->getArgOperand(2), Rep,
	CI->getArgOperand(1));
	} else if (IsX86 && (Name.startswith("avx512.mask.max.p") \|\|
	Name.startswith("avx512.mask.min.p"))) {
	bool IsMin = Name[13] == 'i';
	VectorType *VecTy = cast<VectorType>(CI->getType());
	unsigned VecWidth = VecTy->getPrimitiveSizeInBits();
	unsigned EltWidth = VecTy->getScalarSizeInBits();
	Intrinsic::ID IID;
	if (!IsMin && VecWidth == 128 && EltWidth == 32)
	IID = Intrinsic::x86_sse_max_ps;
	else if (!IsMin && VecWidth == 128 && EltWidth == 64)
	IID = Intrinsic::x86_sse2_max_pd;
	else if (!IsMin && VecWidth == 256 && EltWidth == 32)
	IID = Intrinsic::x86_avx_max_ps_256;
	else if (!IsMin && VecWidth == 256 && EltWidth == 64)
	IID = Intrinsic::x86_avx_max_pd_256;
	else if (IsMin && VecWidth == 128 && EltWidth == 32)
	IID = Intrinsic::x86_sse_min_ps;
	else if (IsMin && VecWidth == 128 && EltWidth == 64)
	IID = Intrinsic::x86_sse2_min_pd;
	else if (IsMin && VecWidth == 256 && EltWidth == 32)
	IID = Intrinsic::x86_avx_min_ps_256;
	else if (IsMin && VecWidth == 256 && EltWidth == 64)
	IID = Intrinsic::x86_avx_min_pd_256;
	else
	llvm_unreachable("Unexpected intrinsic");

	Rep = Builder.CreateCall(Intrinsic::getDeclaration(F->getParent(), IID),
	{ CI->getArgOperand(0), CI->getArgOperand(1) });
	Rep = EmitX86Select(Builder, CI->getArgOperand(3), Rep,
	CI->getArgOperand(2));
	} else if (IsX86 && Name.startswith("avx512.mask.pshuf.b.")) {
	VectorType *VecTy = cast<VectorType>(CI->getType());
	Intrinsic::ID IID;
	if (VecTy->getPrimitiveSizeInBits() == 128)
	IID = Intrinsic::x86_ssse3_pshuf_b_128;
	else if (VecTy->getPrimitiveSizeInBits() == 256)
	IID = Intrinsic::x86_avx2_pshuf_b;
	else if (VecTy->getPrimitiveSizeInBits() == 512)
	IID = Intrinsic::x86_avx512_pshuf_b_512;
	else
	llvm_unreachable("Unexpected intrinsic");

	Rep = Builder.CreateCall(Intrinsic::getDeclaration(F->getParent(), IID),
	{ CI->getArgOperand(0), CI->getArgOperand(1) });
	Rep = EmitX86Select(Builder, CI->getArgOperand(3), Rep,
	CI->getArgOperand(2));
	} else if (IsX86 && (Name.startswith("avx512.mask.pmul.dq.") \|\|
	Name.startswith("avx512.mask.pmulu.dq."))) {
	bool IsUnsigned = Name[16] == 'u';
	VectorType *VecTy = cast<VectorType>(CI->getType());
	Intrinsic::ID IID;
	if (!IsUnsigned && VecTy->getPrimitiveSizeInBits() == 128)
	IID = Intrinsic::x86_sse41_pmuldq;
	else if (!IsUnsigned && VecTy->getPrimitiveSizeInBits() == 256)
	IID = Intrinsic::x86_avx2_pmul_dq;
	else if (!IsUnsigned && VecTy->getPrimitiveSizeInBits() == 512)
	IID = Intrinsic::x86_avx512_pmul_dq_512;
	else if (IsUnsigned && VecTy->getPrimitiveSizeInBits() == 128)
	IID = Intrinsic::x86_sse2_pmulu_dq;
	else if (IsUnsigned && VecTy->getPrimitiveSizeInBits() == 256)
	IID = Intrinsic::x86_avx2_pmulu_dq;
	else if (IsUnsigned && VecTy->getPrimitiveSizeInBits() == 512)
	IID = Intrinsic::x86_avx512_pmulu_dq_512;
	else
	llvm_unreachable("Unexpected intrinsic");

	Rep = Builder.CreateCall(Intrinsic::getDeclaration(F->getParent(), IID),
	{ CI->getArgOperand(0), CI->getArgOperand(1) });
	Rep = EmitX86Select(Builder, CI->getArgOperand(3), Rep,
	CI->getArgOperand(2));
	} else if (IsX86 && Name.startswith("avx512.mask.pack")) {
	bool IsUnsigned = Name[16] == 'u';
	bool IsDW = Name[18] == 'd';
	VectorType *VecTy = cast<VectorType>(CI->getType());
	Intrinsic::ID IID;
	if (!IsUnsigned && !IsDW && VecTy->getPrimitiveSizeInBits() == 128)
	IID = Intrinsic::x86_sse2_packsswb_128;
	else if (!IsUnsigned && !IsDW && VecTy->getPrimitiveSizeInBits() == 256)
	IID = Intrinsic::x86_avx2_packsswb;
	else if (!IsUnsigned && !IsDW && VecTy->getPrimitiveSizeInBits() == 512)
	IID = Intrinsic::x86_avx512_packsswb_512;
	else if (!IsUnsigned && IsDW && VecTy->getPrimitiveSizeInBits() == 128)
	IID = Intrinsic::x86_sse2_packssdw_128;
	else if (!IsUnsigned && IsDW && VecTy->getPrimitiveSizeInBits() == 256)
	IID = Intrinsic::x86_avx2_packssdw;
	else if (!IsUnsigned && IsDW && VecTy->getPrimitiveSizeInBits() == 512)
	IID = Intrinsic::x86_avx512_packssdw_512;
	else if (IsUnsigned && !IsDW && VecTy->getPrimitiveSizeInBits() == 128)
	IID = Intrinsic::x86_sse2_packuswb_128;
	else if (IsUnsigned && !IsDW && VecTy->getPrimitiveSizeInBits() == 256)
	IID = Intrinsic::x86_avx2_packuswb;
	else if (IsUnsigned && !IsDW && VecTy->getPrimitiveSizeInBits() == 512)
	IID = Intrinsic::x86_avx512_packuswb_512;
	else if (IsUnsigned && IsDW && VecTy->getPrimitiveSizeInBits() == 128)
	IID = Intrinsic::x86_sse41_packusdw;
	else if (IsUnsigned && IsDW && VecTy->getPrimitiveSizeInBits() == 256)
	IID = Intrinsic::x86_avx2_packusdw;
	else if (IsUnsigned && IsDW && VecTy->getPrimitiveSizeInBits() == 512)
	IID = Intrinsic::x86_avx512_packusdw_512;
	else
	llvm_unreachable("Unexpected intrinsic");

	Rep = Builder.CreateCall(Intrinsic::getDeclaration(F->getParent(), IID),
	{ CI->getArgOperand(0), CI->getArgOperand(1) });
	Rep = EmitX86Select(Builder, CI->getArgOperand(3), Rep,
	CI->getArgOperand(2));
	} else if (IsX86 && Name.startswith("avx512.mask.psll")) {
	bool IsImmediate = Name[16] == 'i' \|\|
	(Name.size() > 18 && Name[18] == 'i');
	bool IsVariable = Name[16] == 'v';
	char Size = Name[16] == '.' ? Name[17] :
	Name[17] == '.' ? Name[18] :
	Name[18] == '.' ? Name[19] :
	Name[20];

	Intrinsic::ID IID;
	if (IsVariable && Name[17] != '.') {
	if (Size == 'd' && Name[17] == '2') // avx512.mask.psllv2.di
	IID = Intrinsic::x86_avx2_psllv_q;
	else if (Size == 'd' && Name[17] == '4') // avx512.mask.psllv4.di
	IID = Intrinsic::x86_avx2_psllv_q_256;
	else if (Size == 's' && Name[17] == '4') // avx512.mask.psllv4.si
	IID = Intrinsic::x86_avx2_psllv_d;
	else if (Size == 's' && Name[17] == '8') // avx512.mask.psllv8.si
	IID = Intrinsic::x86_avx2_psllv_d_256;
	else if (Size == 'h' && Name[17] == '8') // avx512.mask.psllv8.hi
	IID = Intrinsic::x86_avx512_psllv_w_128;
	else if (Size == 'h' && Name[17] == '1') // avx512.mask.psllv16.hi
	IID = Intrinsic::x86_avx512_psllv_w_256;
	else if (Name[17] == '3' && Name[18] == '2') // avx512.mask.psllv32hi
	IID = Intrinsic::x86_avx512_psllv_w_512;
	else
	llvm_unreachable("Unexpected size");
	} else if (Name.endswith(".128")) {
	if (Size == 'd') // avx512.mask.psll.d.128, avx512.mask.psll.di.128
	IID = IsImmediate ? Intrinsic::x86_sse2_pslli_d
	: Intrinsic::x86_sse2_psll_d;
	else if (Size == 'q') // avx512.mask.psll.q.128, avx512.mask.psll.qi.128
	IID = IsImmediate ? Intrinsic::x86_sse2_pslli_q
	: Intrinsic::x86_sse2_psll_q;
	else if (Size == 'w') // avx512.mask.psll.w.128, avx512.mask.psll.wi.128
	IID = IsImmediate ? Intrinsic::x86_sse2_pslli_w
	: Intrinsic::x86_sse2_psll_w;
	else
	llvm_unreachable("Unexpected size");
	} else if (Name.endswith(".256")) {
	if (Size == 'd') // avx512.mask.psll.d.256, avx512.mask.psll.di.256
	IID = IsImmediate ? Intrinsic::x86_avx2_pslli_d
	: Intrinsic::x86_avx2_psll_d;
	else if (Size == 'q') // avx512.mask.psll.q.256, avx512.mask.psll.qi.256
	IID = IsImmediate ? Intrinsic::x86_avx2_pslli_q
	: Intrinsic::x86_avx2_psll_q;
	else if (Size == 'w') // avx512.mask.psll.w.256, avx512.mask.psll.wi.256
	IID = IsImmediate ? Intrinsic::x86_avx2_pslli_w
	: Intrinsic::x86_avx2_psll_w;
	else
	llvm_unreachable("Unexpected size");
	} else {
	if (Size == 'd') // psll.di.512, pslli.d, psll.d, psllv.d.512
	IID = IsImmediate ? Intrinsic::x86_avx512_pslli_d_512 :
	IsVariable ? Intrinsic::x86_avx512_psllv_d_512 :
	Intrinsic::x86_avx512_psll_d_512;
	else if (Size == 'q') // psll.qi.512, pslli.q, psll.q, psllv.q.512
	IID = IsImmediate ? Intrinsic::x86_avx512_pslli_q_512 :
	IsVariable ? Intrinsic::x86_avx512_psllv_q_512 :
	Intrinsic::x86_avx512_psll_q_512;
	else if (Size == 'w') // psll.wi.512, pslli.w, psll.w
	IID = IsImmediate ? Intrinsic::x86_avx512_pslli_w_512
	: Intrinsic::x86_avx512_psll_w_512;
	else
	llvm_unreachable("Unexpected size");
	}

	Rep = UpgradeX86MaskedShift(Builder, *CI, IID);
	} else if (IsX86 && Name.startswith("avx512.mask.psrl")) {
	bool IsImmediate = Name[16] == 'i' \|\|
	(Name.size() > 18 && Name[18] == 'i');
	bool IsVariable = Name[16] == 'v';
	char Size = Name[16] == '.' ? Name[17] :
	Name[17] == '.' ? Name[18] :
	Name[18] == '.' ? Name[19] :
	Name[20];

	Intrinsic::ID IID;
	if (IsVariable && Name[17] != '.') {
	if (Size == 'd' && Name[17] == '2') // avx512.mask.psrlv2.di
	IID = Intrinsic::x86_avx2_psrlv_q;
	else if (Size == 'd' && Name[17] == '4') // avx512.mask.psrlv4.di
	IID = Intrinsic::x86_avx2_psrlv_q_256;
	else if (Size == 's' && Name[17] == '4') // avx512.mask.psrlv4.si
	IID = Intrinsic::x86_avx2_psrlv_d;
	else if (Size == 's' && Name[17] == '8') // avx512.mask.psrlv8.si
	IID = Intrinsic::x86_avx2_psrlv_d_256;
	else if (Size == 'h' && Name[17] == '8') // avx512.mask.psrlv8.hi
	IID = Intrinsic::x86_avx512_psrlv_w_128;
	else if (Size == 'h' && Name[17] == '1') // avx512.mask.psrlv16.hi
	IID = Intrinsic::x86_avx512_psrlv_w_256;
	else if (Name[17] == '3' && Name[18] == '2') // avx512.mask.psrlv32hi
	IID = Intrinsic::x86_avx512_psrlv_w_512;
	else
	llvm_unreachable("Unexpected size");
	} else if (Name.endswith(".128")) {
	if (Size == 'd') // avx512.mask.psrl.d.128, avx512.mask.psrl.di.128
	IID = IsImmediate ? Intrinsic::x86_sse2_psrli_d
	: Intrinsic::x86_sse2_psrl_d;
	else if (Size == 'q') // avx512.mask.psrl.q.128, avx512.mask.psrl.qi.128
	IID = IsImmediate ? Intrinsic::x86_sse2_psrli_q
	: Intrinsic::x86_sse2_psrl_q;
	else if (Size == 'w') // avx512.mask.psrl.w.128, avx512.mask.psrl.wi.128
	IID = IsImmediate ? Intrinsic::x86_sse2_psrli_w
	: Intrinsic::x86_sse2_psrl_w;
	else
	llvm_unreachable("Unexpected size");
	} else if (Name.endswith(".256")) {
	if (Size == 'd') // avx512.mask.psrl.d.256, avx512.mask.psrl.di.256
	IID = IsImmediate ? Intrinsic::x86_avx2_psrli_d
	: Intrinsic::x86_avx2_psrl_d;
	else if (Size == 'q') // avx512.mask.psrl.q.256, avx512.mask.psrl.qi.256
	IID = IsImmediate ? Intrinsic::x86_avx2_psrli_q
	: Intrinsic::x86_avx2_psrl_q;
	else if (Size == 'w') // avx512.mask.psrl.w.256, avx512.mask.psrl.wi.256
	IID = IsImmediate ? Intrinsic::x86_avx2_psrli_w
	: Intrinsic::x86_avx2_psrl_w;
	else
	llvm_unreachable("Unexpected size");
	} else {
	if (Size == 'd') // psrl.di.512, psrli.d, psrl.d, psrl.d.512
	IID = IsImmediate ? Intrinsic::x86_avx512_psrli_d_512 :
	IsVariable ? Intrinsic::x86_avx512_psrlv_d_512 :
	Intrinsic::x86_avx512_psrl_d_512;
	else if (Size == 'q') // psrl.qi.512, psrli.q, psrl.q, psrl.q.512
	IID = IsImmediate ? Intrinsic::x86_avx512_psrli_q_512 :
	IsVariable ? Intrinsic::x86_avx512_psrlv_q_512 :
	Intrinsic::x86_avx512_psrl_q_512;
	else if (Size == 'w') // psrl.wi.512, psrli.w, psrl.w)
	IID = IsImmediate ? Intrinsic::x86_avx512_psrli_w_512
	: Intrinsic::x86_avx512_psrl_w_512;
	else
	llvm_unreachable("Unexpected size");
	}

	Rep = UpgradeX86MaskedShift(Builder, *CI, IID);
	} else if (IsX86 && Name.startswith("avx512.mask.psra")) {
	bool IsImmediate = Name[16] == 'i' \|\|
	(Name.size() > 18 && Name[18] == 'i');
	bool IsVariable = Name[16] == 'v';
	char Size = Name[16] == '.' ? Name[17] :
	Name[17] == '.' ? Name[18] :
	Name[18] == '.' ? Name[19] :
	Name[20];

	Intrinsic::ID IID;
	if (IsVariable && Name[17] != '.') {
	if (Size == 's' && Name[17] == '4') // avx512.mask.psrav4.si
	IID = Intrinsic::x86_avx2_psrav_d;
	else if (Size == 's' && Name[17] == '8') // avx512.mask.psrav8.si
	IID = Intrinsic::x86_avx2_psrav_d_256;
	else if (Size == 'h' && Name[17] == '8') // avx512.mask.psrav8.hi
	IID = Intrinsic::x86_avx512_psrav_w_128;
	else if (Size == 'h' && Name[17] == '1') // avx512.mask.psrav16.hi
	IID = Intrinsic::x86_avx512_psrav_w_256;
	else if (Name[17] == '3' && Name[18] == '2') // avx512.mask.psrav32hi
	IID = Intrinsic::x86_avx512_psrav_w_512;
	else
	llvm_unreachable("Unexpected size");
	} else if (Name.endswith(".128")) {
	if (Size == 'd') // avx512.mask.psra.d.128, avx512.mask.psra.di.128
	IID = IsImmediate ? Intrinsic::x86_sse2_psrai_d
	: Intrinsic::x86_sse2_psra_d;
	else if (Size == 'q') // avx512.mask.psra.q.128, avx512.mask.psra.qi.128
	IID = IsImmediate ? Intrinsic::x86_avx512_psrai_q_128 :
	IsVariable ? Intrinsic::x86_avx512_psrav_q_128 :
	Intrinsic::x86_avx512_psra_q_128;
	else if (Size == 'w') // avx512.mask.psra.w.128, avx512.mask.psra.wi.128
	IID = IsImmediate ? Intrinsic::x86_sse2_psrai_w
	: Intrinsic::x86_sse2_psra_w;
	else
	llvm_unreachable("Unexpected size");
	} else if (Name.endswith(".256")) {
	if (Size == 'd') // avx512.mask.psra.d.256, avx512.mask.psra.di.256
	IID = IsImmediate ? Intrinsic::x86_avx2_psrai_d
	: Intrinsic::x86_avx2_psra_d;
	else if (Size == 'q') // avx512.mask.psra.q.256, avx512.mask.psra.qi.256
	IID = IsImmediate ? Intrinsic::x86_avx512_psrai_q_256 :
	IsVariable ? Intrinsic::x86_avx512_psrav_q_256 :
	Intrinsic::x86_avx512_psra_q_256;
	else if (Size == 'w') // avx512.mask.psra.w.256, avx512.mask.psra.wi.256
	IID = IsImmediate ? Intrinsic::x86_avx2_psrai_w
	: Intrinsic::x86_avx2_psra_w;
	else
	llvm_unreachable("Unexpected size");
	} else {
	if (Size == 'd') // psra.di.512, psrai.d, psra.d, psrav.d.512
	IID = IsImmediate ? Intrinsic::x86_avx512_psrai_d_512 :
	IsVariable ? Intrinsic::x86_avx512_psrav_d_512 :
	Intrinsic::x86_avx512_psra_d_512;
	else if (Size == 'q') // psra.qi.512, psrai.q, psra.q
	IID = IsImmediate ? Intrinsic::x86_avx512_psrai_q_512 :
	IsVariable ? Intrinsic::x86_avx512_psrav_q_512 :
	Intrinsic::x86_avx512_psra_q_512;
	else if (Size == 'w') // psra.wi.512, psrai.w, psra.w
	IID = IsImmediate ? Intrinsic::x86_avx512_psrai_w_512
	: Intrinsic::x86_avx512_psra_w_512;
	else
	llvm_unreachable("Unexpected size");
	}

	Rep = UpgradeX86MaskedShift(Builder, *CI, IID);
	} else if (IsX86 && Name.startswith("avx512.mask.move.s")) {
	Rep = upgradeMaskedMove(Builder, *CI);
	} else if (IsX86 && Name.startswith("avx512.cvtmask2")) {
	Rep = UpgradeMaskToInt(Builder, *CI);
	} else if (IsX86 && Name.startswith("avx512.mask.vpermilvar.")) {
	Intrinsic::ID IID;
	if (Name.endswith("ps.128"))
	IID = Intrinsic::x86_avx_vpermilvar_ps;
	else if (Name.endswith("pd.128"))
	IID = Intrinsic::x86_avx_vpermilvar_pd;
	else if (Name.endswith("ps.256"))
	IID = Intrinsic::x86_avx_vpermilvar_ps_256;
	else if (Name.endswith("pd.256"))
	IID = Intrinsic::x86_avx_vpermilvar_pd_256;
	else if (Name.endswith("ps.512"))
	IID = Intrinsic::x86_avx512_vpermilvar_ps_512;
	else if (Name.endswith("pd.512"))
	IID = Intrinsic::x86_avx512_vpermilvar_pd_512;
	else
	llvm_unreachable("Unexpected vpermilvar intrinsic");

	Function *Intrin = Intrinsic::getDeclaration(F->getParent(), IID);
	Rep = Builder.CreateCall(Intrin,
	{ CI->getArgOperand(0), CI->getArgOperand(1) });
	Rep = EmitX86Select(Builder, CI->getArgOperand(3), Rep,
	CI->getArgOperand(2));
	} else if (IsX86 && Name.endswith(".movntdqa")) {
	Module *M = F->getParent();
	MDNode *Node = MDNode::get(
	C, ConstantAsMetadata::get(ConstantInt::get(Type::getInt32Ty(C), 1)));

	Value *Ptr = CI->getArgOperand(0);
	VectorType *VTy = cast<VectorType>(CI->getType());

	// Convert the type of the pointer to a pointer to the stored type.
	Value *BC =
	Builder.CreateBitCast(Ptr, PointerType::getUnqual(VTy), "cast");
	LoadInst *LI = Builder.CreateAlignedLoad(BC, VTy->getBitWidth() / 8);
	LI->setMetadata(M->getMDKindID("nontemporal"), Node);
	Rep = LI;
	} else if (IsNVVM && (Name == "abs.i" \|\| Name == "abs.ll")) {
	Value *Arg = CI->getArgOperand(0);
	Value *Neg = Builder.CreateNeg(Arg, "neg");
	Value *Cmp = Builder.CreateICmpSGE(
	Arg, llvm::Constant::getNullValue(Arg->getType()), "abs.cond");
	Rep = Builder.CreateSelect(Cmp, Arg, Neg, "abs");
	} else if (IsNVVM && (Name == "max.i" \|\| Name == "max.ll" \|\|
	Name == "max.ui" \|\| Name == "max.ull")) {
	Value *Arg0 = CI->getArgOperand(0);
	Value *Arg1 = CI->getArgOperand(1);
	Value *Cmp = Name.endswith(".ui") \|\| Name.endswith(".ull")
	? Builder.CreateICmpUGE(Arg0, Arg1, "max.cond")
	: Builder.CreateICmpSGE(Arg0, Arg1, "max.cond");
	Rep = Builder.CreateSelect(Cmp, Arg0, Arg1, "max");
	} else if (IsNVVM && (Name == "min.i" \|\| Name == "min.ll" \|\|
	Name == "min.ui" \|\| Name == "min.ull")) {
	Value *Arg0 = CI->getArgOperand(0);
	Value *Arg1 = CI->getArgOperand(1);
	Value *Cmp = Name.endswith(".ui") \|\| Name.endswith(".ull")
	? Builder.CreateICmpULE(Arg0, Arg1, "min.cond")
	: Builder.CreateICmpSLE(Arg0, Arg1, "min.cond");
	Rep = Builder.CreateSelect(Cmp, Arg0, Arg1, "min");
	} else if (IsNVVM && Name == "clz.ll") {
	// llvm.nvvm.clz.ll returns an i32, but llvm.ctlz.i64 and returns an i64.
	Value *Arg = CI->getArgOperand(0);
	Value *Ctlz = Builder.CreateCall(
	Intrinsic::getDeclaration(F->getParent(), Intrinsic::ctlz,
	{Arg->getType()}),
	{Arg, Builder.getFalse()}, "ctlz");
	Rep = Builder.CreateTrunc(Ctlz, Builder.getInt32Ty(), "ctlz.trunc");
	} else if (IsNVVM && Name == "popc.ll") {
	// llvm.nvvm.popc.ll returns an i32, but llvm.ctpop.i64 and returns an
	// i64.
	Value *Arg = CI->getArgOperand(0);
	Value *Popc = Builder.CreateCall(
	Intrinsic::getDeclaration(F->getParent(), Intrinsic::ctpop,
	{Arg->getType()}),
	Arg, "ctpop");
	Rep = Builder.CreateTrunc(Popc, Builder.getInt32Ty(), "ctpop.trunc");
	} else if (IsNVVM && Name == "h2f") {
	Rep = Builder.CreateCall(Intrinsic::getDeclaration(
	F->getParent(), Intrinsic::convert_from_fp16,
	{Builder.getFloatTy()}),
	CI->getArgOperand(0), "h2f");
	} else {
	llvm_unreachable("Unknown function for CallInst upgrade.");
	}

	if (Rep)
	CI->replaceAllUsesWith(Rep);
	CI->eraseFromParent();
	return;
	}

	CallInst *NewCall = nullptr;
	switch (NewFn->getIntrinsicID()) {
	default: {
	// Handle generic mangling change, but nothing else
	assert(
	(CI->getCalledFunction()->getName() != NewFn->getName()) &&
	"Unknown function for CallInst upgrade and isn't just a name change");
	CI->setCalledFunction(NewFn);
	return;
	}

	case Intrinsic::arm_neon_vld1:
	case Intrinsic::arm_neon_vld2:
	case Intrinsic::arm_neon_vld3:
	case Intrinsic::arm_neon_vld4:
	case Intrinsic::arm_neon_vld2lane:
	case Intrinsic::arm_neon_vld3lane:
	case Intrinsic::arm_neon_vld4lane:
	case Intrinsic::arm_neon_vst1:
	case Intrinsic::arm_neon_vst2:
	case Intrinsic::arm_neon_vst3:
	case Intrinsic::arm_neon_vst4:
	case Intrinsic::arm_neon_vst2lane:
	case Intrinsic::arm_neon_vst3lane:
	case Intrinsic::arm_neon_vst4lane: {
	SmallVector<Value *, 4> Args(CI->arg_operands().begin(),
	CI->arg_operands().end());
	NewCall = Builder.CreateCall(NewFn, Args);
	break;
	}

	case Intrinsic::bitreverse:
	NewCall = Builder.CreateCall(NewFn, {CI->getArgOperand(0)});
	break;

	case Intrinsic::ctlz:
	case Intrinsic::cttz:
	assert(CI->getNumArgOperands() == 1 &&
	"Mismatch between function args and call args");
	NewCall =
	Builder.CreateCall(NewFn, {CI->getArgOperand(0), Builder.getFalse()});
	break;

	case Intrinsic::objectsize: {
	Value *NullIsUnknownSize = CI->getNumArgOperands() == 2
	? Builder.getFalse()
	: CI->getArgOperand(2);
	NewCall = Builder.CreateCall(
	NewFn, {CI->getArgOperand(0), CI->getArgOperand(1), NullIsUnknownSize});
	break;
	}

	case Intrinsic::ctpop:
	NewCall = Builder.CreateCall(NewFn, {CI->getArgOperand(0)});
	break;

	case Intrinsic::convert_from_fp16:
	NewCall = Builder.CreateCall(NewFn, {CI->getArgOperand(0)});
	break;

	case Intrinsic::x86_xop_vfrcz_ss:
	case Intrinsic::x86_xop_vfrcz_sd:
	NewCall = Builder.CreateCall(NewFn, {CI->getArgOperand(1)});
	break;

	case Intrinsic::x86_xop_vpermil2pd:
	case Intrinsic::x86_xop_vpermil2ps:
	case Intrinsic::x86_xop_vpermil2pd_256:
	case Intrinsic::x86_xop_vpermil2ps_256: {
	SmallVector<Value *, 4> Args(CI->arg_operands().begin(),
	CI->arg_operands().end());
	VectorType *FltIdxTy = cast<VectorType>(Args[2]->getType());
	VectorType *IntIdxTy = VectorType::getInteger(FltIdxTy);
	Args[2] = Builder.CreateBitCast(Args[2], IntIdxTy);
	NewCall = Builder.CreateCall(NewFn, Args);
	break;
	}

	case Intrinsic::x86_sse41_ptestc:
	case Intrinsic::x86_sse41_ptestz:
	case Intrinsic::x86_sse41_ptestnzc: {
	// The arguments for these intrinsics used to be v4f32, and changed
	// to v2i64. This is purely a nop, since those are bitwise intrinsics.
	// So, the only thing required is a bitcast for both arguments.
	// First, check the arguments have the old type.
	Value *Arg0 = CI->getArgOperand(0);
	if (Arg0->getType() != VectorType::get(Type::getFloatTy(C), 4))
	return;

	// Old intrinsic, add bitcasts
	Value *Arg1 = CI->getArgOperand(1);

	Type *NewVecTy = VectorType::get(Type::getInt64Ty(C), 2);

	Value *BC0 = Builder.CreateBitCast(Arg0, NewVecTy, "cast");
	Value *BC1 = Builder.CreateBitCast(Arg1, NewVecTy, "cast");

	NewCall = Builder.CreateCall(NewFn, {BC0, BC1});
	break;
	}

	case Intrinsic::x86_sse41_insertps:
	case Intrinsic::x86_sse41_dppd:
	case Intrinsic::x86_sse41_dpps:
	case Intrinsic::x86_sse41_mpsadbw:
	case Intrinsic::x86_avx_dp_ps_256:
	case Intrinsic::x86_avx2_mpsadbw: {
	// Need to truncate the last argument from i32 to i8 -- this argument models
	// an inherently 8-bit immediate operand to these x86 instructions.
	SmallVector<Value *, 4> Args(CI->arg_operands().begin(),
	CI->arg_operands().end());

	// Replace the last argument with a trunc.
	Args.back() = Builder.CreateTrunc(Args.back(), Type::getInt8Ty(C), "trunc");
	NewCall = Builder.CreateCall(NewFn, Args);
	break;
	}

	case Intrinsic::thread_pointer: {
	NewCall = Builder.CreateCall(NewFn, {});
	break;
	}

	case Intrinsic::invariant_start:
	case Intrinsic::invariant_end:
	case Intrinsic::masked_load:
	case Intrinsic::masked_store:
	case Intrinsic::masked_gather:
	case Intrinsic::masked_scatter: {
	SmallVector<Value *, 4> Args(CI->arg_operands().begin(),
	CI->arg_operands().end());
	NewCall = Builder.CreateCall(NewFn, Args);
	break;
	}
	}
	assert(NewCall && "Should have either set this variable or returned through "
	"the default case");
	std::string Name = CI->getName();
	if (!Name.empty()) {
	CI->setName(Name + ".old");
	NewCall->setName(Name);
	}
	CI->replaceAllUsesWith(NewCall);
	CI->eraseFromParent();
	}

	void llvm::UpgradeCallsToIntrinsic(Function *F) {
	assert(F && "Illegal attempt to upgrade a non-existent intrinsic.");

	// Check if this function should be upgraded and get the replacement function
	// if there is one.
	Function *NewFn;
	if (UpgradeIntrinsicFunction(F, NewFn)) {
	// Replace all users of the old function with the new function or new
	// instructions. This is not a range loop because the call is deleted.
	for (auto UI = F->user_begin(), UE = F->user_end(); UI != UE; )
	if (CallInst CI = dyn_cast<CallInst>(UI++))
	UpgradeIntrinsicCall(CI, NewFn);

	// Remove old function, no longer used, from the module.
	F->eraseFromParent();
	}
	}

	MDNode *llvm::UpgradeTBAANode(MDNode &MD) {
	// Check if the tag uses struct-path aware TBAA format.
	if (isa<MDNode>(MD.getOperand(0)) && MD.getNumOperands() >= 3)
	return &MD;

	auto &Context = MD.getContext();
	if (MD.getNumOperands() == 3) {
	Metadata *Elts[] = {MD.getOperand(0), MD.getOperand(1)};
	MDNode *ScalarType = MDNode::get(Context, Elts);
	// Create a MDNode <ScalarType, ScalarType, offset 0, const>
	Metadata *Elts2[] = {ScalarType, ScalarType,
	ConstantAsMetadata::get(
	Constant::getNullValue(Type::getInt64Ty(Context))),
	MD.getOperand(2)};
	return MDNode::get(Context, Elts2);
	}
	// Create a MDNode <MD, MD, offset 0>
	Metadata *Elts[] = {&MD, &MD, ConstantAsMetadata::get(Constant::getNullValue(
	Type::getInt64Ty(Context)))};
	return MDNode::get(Context, Elts);
	}

	Instruction llvm::UpgradeBitCastInst(unsigned Opc, Value V, Type *DestTy,
	Instruction *&Temp) {
	if (Opc != Instruction::BitCast)
	return nullptr;

	Temp = nullptr;
	Type *SrcTy = V->getType();
	if (SrcTy->isPtrOrPtrVectorTy() && DestTy->isPtrOrPtrVectorTy() &&
	SrcTy->getPointerAddressSpace() != DestTy->getPointerAddressSpace()) {
	LLVMContext &Context = V->getContext();

	// We have no information about target data layout, so we assume that
	// the maximum pointer size is 64bit.
	Type *MidTy = Type::getInt64Ty(Context);
	Temp = CastInst::Create(Instruction::PtrToInt, V, MidTy);

	return CastInst::Create(Instruction::IntToPtr, Temp, DestTy);
	}

	return nullptr;
	}

	Value llvm::UpgradeBitCastExpr(unsigned Opc, Constant C, Type *DestTy) {
	if (Opc != Instruction::BitCast)
	return nullptr;

	Type *SrcTy = C->getType();
	if (SrcTy->isPtrOrPtrVectorTy() && DestTy->isPtrOrPtrVectorTy() &&
	SrcTy->getPointerAddressSpace() != DestTy->getPointerAddressSpace()) {
	LLVMContext &Context = C->getContext();

	// We have no information about target data layout, so we assume that
	// the maximum pointer size is 64bit.
	Type *MidTy = Type::getInt64Ty(Context);

	return ConstantExpr::getIntToPtr(ConstantExpr::getPtrToInt(C, MidTy),
	DestTy);
	}

	return nullptr;
	}

	/// Check the debug info version number, if it is out-dated, drop the debug
	/// info. Return true if module is modified.
	bool llvm::UpgradeDebugInfo(Module &M) {
	unsigned Version = getDebugMetadataVersionFromModule(M);
	if (Version == DEBUG_METADATA_VERSION)
	return false;

	bool RetCode = StripDebugInfo(M);
	if (RetCode) {
	DiagnosticInfoDebugMetadataVersion DiagVersion(M, Version);
	M.getContext().diagnose(DiagVersion);
	}
	return RetCode;
	}

	bool llvm::UpgradeModuleFlags(Module &M) {
	- const NamedMDNode *ModFlags = M.getModuleFlagsMetadata();
	+ NamedMDNode *ModFlags = M.getModuleFlagsMetadata();
	if (!ModFlags)
	return false;

	- bool HasObjCFlag = false, HasClassProperties = false;
	+ bool HasObjCFlag = false, HasClassProperties = false, Changed = false;
	for (unsigned I = 0, E = ModFlags->getNumOperands(); I != E; ++I) {
	MDNode *Op = ModFlags->getOperand(I);
	- if (Op->getNumOperands() < 2)
	+ if (Op->getNumOperands() != 3)
	continue;
	MDString *ID = dyn_cast_or_null<MDString>(Op->getOperand(1));
	if (!ID)
	continue;
	if (ID->getString() == "Objective-C Image Info Version")
	HasObjCFlag = true;
	if (ID->getString() == "Objective-C Class Properties")
	HasClassProperties = true;
	+ // Upgrade PIC/PIE Module Flags. The module flag behavior for these two
	+ // field was Error and now they are Max.
	+ if (ID->getString() == "PIC Level" \|\| ID->getString() == "PIE Level") {
	+ if (auto *Behavior =
	+ mdconst::dyn_extract_or_null<ConstantInt>(Op->getOperand(0))) {
	+ if (Behavior->getLimitedValue() == Module::Error) {
	+ Type *Int32Ty = Type::getInt32Ty(M.getContext());
	+ Metadata *Ops[3] = {
	+ ConstantAsMetadata::get(ConstantInt::get(Int32Ty, Module::Max)),
	+ MDString::get(M.getContext(), ID->getString()),
	+ Op->getOperand(2)};
	+ ModFlags->setOperand(I, MDNode::get(M.getContext(), Ops));
	+ Changed = true;
	+ }
	+ }
	+ }
	}
	+
	// "Objective-C Class Properties" is recently added for Objective-C. We
	// upgrade ObjC bitcodes to contain a "Objective-C Class Properties" module
	// flag of value 0, so we can correclty downgrade this flag when trying to
	// link an ObjC bitcode without this module flag with an ObjC bitcode with
	// this module flag.
	if (HasObjCFlag && !HasClassProperties) {
	M.addModuleFlag(llvm::Module::Override, "Objective-C Class Properties",
	(uint32_t)0);
	- return true;
	+ Changed = true;
	}
	- return false;
	+
	+ return Changed;
	}

	static bool isOldLoopArgument(Metadata *MD) {
	auto *T = dyn_cast_or_null<MDTuple>(MD);
	if (!T)
	return false;
	if (T->getNumOperands() < 1)
	return false;
	auto *S = dyn_cast_or_null<MDString>(T->getOperand(0));
	if (!S)
	return false;
	return S->getString().startswith("llvm.vectorizer.");
	}

	static MDString *upgradeLoopTag(LLVMContext &C, StringRef OldTag) {
	StringRef OldPrefix = "llvm.vectorizer.";
	assert(OldTag.startswith(OldPrefix) && "Expected old prefix");

	if (OldTag == "llvm.vectorizer.unroll")
	return MDString::get(C, "llvm.loop.interleave.count");

	return MDString::get(
	C, (Twine("llvm.loop.vectorize.") + OldTag.drop_front(OldPrefix.size()))
	.str());
	}

	static Metadata upgradeLoopArgument(Metadata MD) {
	auto *T = dyn_cast_or_null<MDTuple>(MD);
	if (!T)
	return MD;
	if (T->getNumOperands() < 1)
	return MD;
	auto *OldTag = dyn_cast_or_null<MDString>(T->getOperand(0));
	if (!OldTag)
	return MD;
	if (!OldTag->getString().startswith("llvm.vectorizer."))
	return MD;

	// This has an old tag. Upgrade it.
	SmallVector<Metadata *, 8> Ops;
	Ops.reserve(T->getNumOperands());
	Ops.push_back(upgradeLoopTag(T->getContext(), OldTag->getString()));
	for (unsigned I = 1, E = T->getNumOperands(); I != E; ++I)
	Ops.push_back(T->getOperand(I));

	return MDTuple::get(T->getContext(), Ops);
	}

	MDNode *llvm::upgradeInstructionLoopAttachment(MDNode &N) {
	auto *T = dyn_cast<MDTuple>(&N);
	if (!T)
	return &N;

	if (none_of(T->operands(), isOldLoopArgument))
	return &N;

	SmallVector<Metadata *, 8> Ops;
	Ops.reserve(T->getNumOperands());
	for (Metadata *MD : T->operands())
	Ops.push_back(upgradeLoopArgument(MD));

	return MDTuple::get(T->getContext(), Ops);
	}
	diff --git a/lib/Object/COFFModuleDefinition.cpp b/lib/Object/COFFModuleDefinition.cpp
	index ed9140d1fe08..510eac8b239b 100644
	--- a/lib/Object/COFFModuleDefinition.cpp
	+++ b/lib/Object/COFFModuleDefinition.cpp
	@@ -1,331 +1,337 @@
	//===--- COFFModuleDefinition.cpp - Simple DEF parser ---------------------===//
	//
	// The LLVM Compiler Infrastructure
	//
	// This file is distributed under the University of Illinois Open Source
	// License. See LICENSE.TXT for details.
	//
	//===----------------------------------------------------------------------===//
	//
	// Windows-specific.
	// A parser for the module-definition file (.def file).
	//
	// The format of module-definition files are described in this document:
	// https://msdn.microsoft.com/en-us/library/28d6s79h.aspx
	//
	//===----------------------------------------------------------------------===//

	#include "llvm/Object/COFFModuleDefinition.h"
	#include "llvm/ADT/StringRef.h"
	#include "llvm/ADT/StringSwitch.h"
	#include "llvm/Object/COFF.h"
	#include "llvm/Object/COFFImportFile.h"
	#include "llvm/Object/Error.h"
	#include "llvm/Support/Error.h"
	#include "llvm/Support/Path.h"
	#include "llvm/Support/raw_ostream.h"

	using namespace llvm::COFF;
	using namespace llvm;

	namespace llvm {
	namespace object {

	enum Kind {
	Unknown,
	Eof,
	Identifier,
	Comma,
	Equal,
	KwBase,
	KwConstant,
	KwData,
	KwExports,
	KwHeapsize,
	KwLibrary,
	KwName,
	KwNoname,
	KwPrivate,
	KwStacksize,
	KwVersion,
	};

	struct Token {
	explicit Token(Kind T = Unknown, StringRef S = "") : K(T), Value(S) {}
	Kind K;
	StringRef Value;
	};

	static bool isDecorated(StringRef Sym, bool MingwDef) {
	// mingw does not prepend "_".
	return (!MingwDef && Sym.startswith("_")) \|\| Sym.startswith("@") \|\|
	Sym.startswith("?");
	}

	static Error createError(const Twine &Err) {
	return make_error<StringError>(StringRef(Err.str()),
	object_error::parse_failed);
	}

	class Lexer {
	public:
	Lexer(StringRef S) : Buf(S) {}

	Token lex() {
	Buf = Buf.trim();
	if (Buf.empty())
	return Token(Eof);

	switch (Buf[0]) {
	case '\0':
	return Token(Eof);
	case ';': {
	size_t End = Buf.find('\n');
	Buf = (End == Buf.npos) ? "" : Buf.drop_front(End);
	return lex();
	}
	case '=':
	Buf = Buf.drop_front();
	// GNU dlltool accepts both = and ==.
	if (Buf.startswith("="))
	Buf = Buf.drop_front();
	return Token(Equal, "=");
	case ',':
	Buf = Buf.drop_front();
	return Token(Comma, ",");
	case '"': {
	StringRef S;
	std::tie(S, Buf) = Buf.substr(1).split('"');
	return Token(Identifier, S);
	}
	default: {
	size_t End = Buf.find_first_of("=,\r\n \t\v");
	StringRef Word = Buf.substr(0, End);
	Kind K = llvm::StringSwitch<Kind>(Word)
	.Case("BASE", KwBase)
	.Case("CONSTANT", KwConstant)
	.Case("DATA", KwData)
	.Case("EXPORTS", KwExports)
	.Case("HEAPSIZE", KwHeapsize)
	.Case("LIBRARY", KwLibrary)
	.Case("NAME", KwName)
	.Case("NONAME", KwNoname)
	.Case("PRIVATE", KwPrivate)
	.Case("STACKSIZE", KwStacksize)
	.Case("VERSION", KwVersion)
	.Default(Identifier);
	Buf = (End == Buf.npos) ? "" : Buf.drop_front(End);
	return Token(K, Word);
	}
	}
	}

	private:
	StringRef Buf;
	};

	class Parser {
	public:
	explicit Parser(StringRef S, MachineTypes M, bool B)
	: Lex(S), Machine(M), MingwDef(B) {}

	Expected<COFFModuleDefinition> parse() {
	do {
	if (Error Err = parseOne())
	return std::move(Err);
	} while (Tok.K != Eof);
	return Info;
	}

	private:
	void read() {
	if (Stack.empty()) {
	Tok = Lex.lex();
	return;
	}
	Tok = Stack.back();
	Stack.pop_back();
	}

	Error readAsInt(uint64_t *I) {
	read();
	if (Tok.K != Identifier \|\| Tok.Value.getAsInteger(10, *I))
	return createError("integer expected");
	return Error::success();
	}

	Error expect(Kind Expected, StringRef Msg) {
	read();
	if (Tok.K != Expected)
	return createError(Msg);
	return Error::success();
	}

	void unget() { Stack.push_back(Tok); }

	Error parseOne() {
	read();
	switch (Tok.K) {
	case Eof:
	return Error::success();
	case KwExports:
	for (;;) {
	read();
	if (Tok.K != Identifier) {
	unget();
	return Error::success();
	}
	if (Error Err = parseExport())
	return Err;
	}
	case KwHeapsize:
	return parseNumbers(&Info.HeapReserve, &Info.HeapCommit);
	case KwStacksize:
	return parseNumbers(&Info.StackReserve, &Info.StackCommit);
	case KwLibrary:
	case KwName: {
	bool IsDll = Tok.K == KwLibrary; // Check before parseName.
	std::string Name;
	if (Error Err = parseName(&Name, &Info.ImageBase))
	return Err;

	Info.ImportName = Name;

	// Set the output file, but don't override /out if it was already passed.
	if (Info.OutputFile.empty()) {
	Info.OutputFile = Name;
	// Append the appropriate file extension if not already present.
	if (!sys::path::has_extension(Name))
	Info.OutputFile += IsDll ? ".dll" : ".exe";
	}

	return Error::success();
	}
	case KwVersion:
	return parseVersion(&Info.MajorImageVersion, &Info.MinorImageVersion);
	default:
	return createError("unknown directive: " + Tok.Value);
	}
	}

	Error parseExport() {
	COFFShortExport E;
	E.Name = Tok.Value;
	read();
	if (Tok.K == Equal) {
	read();
	if (Tok.K != Identifier)
	return createError("identifier expected, but got " + Tok.Value);
	E.ExtName = E.Name;
	E.Name = Tok.Value;
	} else {
	unget();
	}

	if (Machine == IMAGE_FILE_MACHINE_I386) {
	if (!isDecorated(E.Name, MingwDef))
	E.Name = (std::string("_").append(E.Name));
	if (!E.ExtName.empty() && !isDecorated(E.ExtName, MingwDef))
	E.ExtName = (std::string("_").append(E.ExtName));
	}

	for (;;) {
	read();
	if (Tok.K == Identifier && Tok.Value[0] == '@') {
	- Tok.Value.drop_front().getAsInteger(10, E.Ordinal);
	+ if (Tok.Value.drop_front().getAsInteger(10, E.Ordinal)) {
	+ // Not an ordinal modifier at all, but the next export (fastcall
	+ // decorated) - complete the current one.
	+ unget();
	+ Info.Exports.push_back(E);
	+ return Error::success();
	+ }
	read();
	if (Tok.K == KwNoname) {
	E.Noname = true;
	} else {
	unget();
	}
	continue;
	}
	if (Tok.K == KwData) {
	E.Data = true;
	continue;
	}
	if (Tok.K == KwConstant) {
	E.Constant = true;
	continue;
	}
	if (Tok.K == KwPrivate) {
	E.Private = true;
	continue;
	}
	unget();
	Info.Exports.push_back(E);
	return Error::success();
	}
	}

	// HEAPSIZE/STACKSIZE reserve[,commit]
	Error parseNumbers(uint64_t Reserve, uint64_t Commit) {
	if (Error Err = readAsInt(Reserve))
	return Err;
	read();
	if (Tok.K != Comma) {
	unget();
	Commit = nullptr;
	return Error::success();
	}
	if (Error Err = readAsInt(Commit))
	return Err;
	return Error::success();
	}

	// NAME outputPath [BASE=address]
	Error parseName(std::string Out, uint64_t Baseaddr) {
	read();
	if (Tok.K == Identifier) {
	*Out = Tok.Value;
	} else {
	*Out = "";
	unget();
	return Error::success();
	}
	read();
	if (Tok.K == KwBase) {
	if (Error Err = expect(Equal, "'=' expected"))
	return Err;
	if (Error Err = readAsInt(Baseaddr))
	return Err;
	} else {
	unget();
	*Baseaddr = 0;
	}
	return Error::success();
	}

	// VERSION major[.minor]
	Error parseVersion(uint32_t Major, uint32_t Minor) {
	read();
	if (Tok.K != Identifier)
	return createError("identifier expected, but got " + Tok.Value);
	StringRef V1, V2;
	std::tie(V1, V2) = Tok.Value.split('.');
	if (V1.getAsInteger(10, *Major))
	return createError("integer expected, but got " + Tok.Value);
	if (V2.empty())
	*Minor = 0;
	else if (V2.getAsInteger(10, *Minor))
	return createError("integer expected, but got " + Tok.Value);
	return Error::success();
	}

	Lexer Lex;
	Token Tok;
	std::vector<Token> Stack;
	MachineTypes Machine;
	COFFModuleDefinition Info;
	bool MingwDef;
	};

	Expected<COFFModuleDefinition> parseCOFFModuleDefinition(MemoryBufferRef MB,
	MachineTypes Machine,
	bool MingwDef) {
	return Parser(MB.getBuffer(), Machine, MingwDef).parse();
	}

	} // namespace object
	} // namespace llvm
	diff --git a/lib/Target/ARM/ARMISelLowering.cpp b/lib/Target/ARM/ARMISelLowering.cpp
	index 6ba7593543a9..27dda93387b6 100644
	--- a/lib/Target/ARM/ARMISelLowering.cpp
	+++ b/lib/Target/ARM/ARMISelLowering.cpp
	@@ -1,14085 +1,14101 @@
	//===-- ARMISelLowering.cpp - ARM DAG Lowering Implementation -------------===//
	//
	// The LLVM Compiler Infrastructure
	//
	// This file is distributed under the University of Illinois Open Source
	// License. See LICENSE.TXT for details.
	//
	//===----------------------------------------------------------------------===//
	//
	// This file defines the interfaces that ARM uses to lower LLVM code into a
	// selection DAG.
	//
	//===----------------------------------------------------------------------===//

	#include "ARMISelLowering.h"
	#include "ARMBaseInstrInfo.h"
	#include "ARMBaseRegisterInfo.h"
	#include "ARMCallingConv.h"
	#include "ARMConstantPoolValue.h"
	#include "ARMMachineFunctionInfo.h"
	#include "ARMPerfectShuffle.h"
	#include "ARMRegisterInfo.h"
	#include "ARMSelectionDAGInfo.h"
	#include "ARMSubtarget.h"
	#include "MCTargetDesc/ARMAddressingModes.h"
	#include "MCTargetDesc/ARMBaseInfo.h"
	#include "llvm/ADT/APFloat.h"
	#include "llvm/ADT/APInt.h"
	#include "llvm/ADT/ArrayRef.h"
	#include "llvm/ADT/BitVector.h"
	#include "llvm/ADT/DenseMap.h"
	#include "llvm/ADT/STLExtras.h"
	#include "llvm/ADT/SmallPtrSet.h"
	#include "llvm/ADT/SmallVector.h"
	#include "llvm/ADT/Statistic.h"
	#include "llvm/ADT/StringExtras.h"
	#include "llvm/ADT/StringRef.h"
	#include "llvm/ADT/StringSwitch.h"
	#include "llvm/ADT/Triple.h"
	#include "llvm/ADT/Twine.h"
	#include "llvm/Analysis/VectorUtils.h"
	#include "llvm/CodeGen/CallingConvLower.h"
	#include "llvm/CodeGen/ISDOpcodes.h"
	#include "llvm/CodeGen/IntrinsicLowering.h"
	#include "llvm/CodeGen/MachineBasicBlock.h"
	#include "llvm/CodeGen/MachineConstantPool.h"
	#include "llvm/CodeGen/MachineFrameInfo.h"
	#include "llvm/CodeGen/MachineFunction.h"
	#include "llvm/CodeGen/MachineInstr.h"
	#include "llvm/CodeGen/MachineInstrBuilder.h"
	#include "llvm/CodeGen/MachineJumpTableInfo.h"
	#include "llvm/CodeGen/MachineMemOperand.h"
	#include "llvm/CodeGen/MachineOperand.h"
	#include "llvm/CodeGen/MachineRegisterInfo.h"
	#include "llvm/CodeGen/MachineValueType.h"
	#include "llvm/CodeGen/RuntimeLibcalls.h"
	#include "llvm/CodeGen/SelectionDAG.h"
	#include "llvm/CodeGen/SelectionDAGNodes.h"
	#include "llvm/CodeGen/ValueTypes.h"
	#include "llvm/IR/Attributes.h"
	#include "llvm/IR/CallingConv.h"
	#include "llvm/IR/Constant.h"
	#include "llvm/IR/Constants.h"
	#include "llvm/IR/DataLayout.h"
	#include "llvm/IR/DebugLoc.h"
	#include "llvm/IR/DerivedTypes.h"
	#include "llvm/IR/Function.h"
	#include "llvm/IR/GlobalAlias.h"
	#include "llvm/IR/GlobalValue.h"
	#include "llvm/IR/GlobalVariable.h"
	#include "llvm/IR/IRBuilder.h"
	#include "llvm/IR/InlineAsm.h"
	#include "llvm/IR/Instruction.h"
	#include "llvm/IR/Instructions.h"
	#include "llvm/IR/IntrinsicInst.h"
	#include "llvm/IR/Intrinsics.h"
	#include "llvm/IR/Module.h"
	#include "llvm/IR/Type.h"
	#include "llvm/IR/User.h"
	#include "llvm/IR/Value.h"
	#include "llvm/MC/MCInstrDesc.h"
	#include "llvm/MC/MCInstrItineraries.h"
	#include "llvm/MC/MCRegisterInfo.h"
	#include "llvm/MC/MCSchedule.h"
	#include "llvm/Support/AtomicOrdering.h"
	#include "llvm/Support/BranchProbability.h"
	#include "llvm/Support/Casting.h"
	#include "llvm/Support/CodeGen.h"
	#include "llvm/Support/CommandLine.h"
	#include "llvm/Support/Compiler.h"
	#include "llvm/Support/Debug.h"
	#include "llvm/Support/ErrorHandling.h"
	#include "llvm/Support/KnownBits.h"
	#include "llvm/Support/MathExtras.h"
	#include "llvm/Support/raw_ostream.h"
	#include "llvm/Target/TargetInstrInfo.h"
	#include "llvm/Target/TargetMachine.h"
	#include "llvm/Target/TargetOptions.h"
	#include <algorithm>
	#include <cassert>
	#include <cstdint>
	#include <cstdlib>
	#include <iterator>
	#include <limits>
	#include <string>
	#include <tuple>
	#include <utility>
	#include <vector>

	using namespace llvm;

	#define DEBUG_TYPE "arm-isel"

	STATISTIC(NumTailCalls, "Number of tail calls");
	STATISTIC(NumMovwMovt, "Number of GAs materialized with movw + movt");
	STATISTIC(NumLoopByVals, "Number of loops generated for byval arguments");
	STATISTIC(NumConstpoolPromoted,
	"Number of constants with their storage promoted into constant pools");

	static cl::opt<bool>
	ARMInterworking("arm-interworking", cl::Hidden,
	cl::desc("Enable / disable ARM interworking (for debugging only)"),
	cl::init(true));

	static cl::opt<bool> EnableConstpoolPromotion(
	"arm-promote-constant", cl::Hidden,
	cl::desc("Enable / disable promotion of unnamed_addr constants into "
	"constant pools"),
	cl::init(false)); // FIXME: set to true by default once PR32780 is fixed
	static cl::opt<unsigned> ConstpoolPromotionMaxSize(
	"arm-promote-constant-max-size", cl::Hidden,
	cl::desc("Maximum size of constant to promote into a constant pool"),
	cl::init(64));
	static cl::opt<unsigned> ConstpoolPromotionMaxTotal(
	"arm-promote-constant-max-total", cl::Hidden,
	cl::desc("Maximum size of ALL constants to promote into a constant pool"),
	cl::init(128));

	// The APCS parameter registers.
	static const MCPhysReg GPRArgRegs[] = {
	ARM::R0, ARM::R1, ARM::R2, ARM::R3
	};

	void ARMTargetLowering::addTypeForNEON(MVT VT, MVT PromotedLdStVT,
	MVT PromotedBitwiseVT) {
	if (VT != PromotedLdStVT) {
	setOperationAction(ISD::LOAD, VT, Promote);
	AddPromotedToType (ISD::LOAD, VT, PromotedLdStVT);

	setOperationAction(ISD::STORE, VT, Promote);
	AddPromotedToType (ISD::STORE, VT, PromotedLdStVT);
	}

	MVT ElemTy = VT.getVectorElementType();
	if (ElemTy != MVT::f64)
	setOperationAction(ISD::SETCC, VT, Custom);
	setOperationAction(ISD::INSERT_VECTOR_ELT, VT, Custom);
	setOperationAction(ISD::EXTRACT_VECTOR_ELT, VT, Custom);
	if (ElemTy == MVT::i32) {
	setOperationAction(ISD::SINT_TO_FP, VT, Custom);
	setOperationAction(ISD::UINT_TO_FP, VT, Custom);
	setOperationAction(ISD::FP_TO_SINT, VT, Custom);
	setOperationAction(ISD::FP_TO_UINT, VT, Custom);
	} else {
	setOperationAction(ISD::SINT_TO_FP, VT, Expand);
	setOperationAction(ISD::UINT_TO_FP, VT, Expand);
	setOperationAction(ISD::FP_TO_SINT, VT, Expand);
	setOperationAction(ISD::FP_TO_UINT, VT, Expand);
	}
	setOperationAction(ISD::BUILD_VECTOR, VT, Custom);
	setOperationAction(ISD::VECTOR_SHUFFLE, VT, Custom);
	setOperationAction(ISD::CONCAT_VECTORS, VT, Legal);
	setOperationAction(ISD::EXTRACT_SUBVECTOR, VT, Legal);
	setOperationAction(ISD::SELECT, VT, Expand);
	setOperationAction(ISD::SELECT_CC, VT, Expand);
	setOperationAction(ISD::VSELECT, VT, Expand);
	setOperationAction(ISD::SIGN_EXTEND_INREG, VT, Expand);
	if (VT.isInteger()) {
	setOperationAction(ISD::SHL, VT, Custom);
	setOperationAction(ISD::SRA, VT, Custom);
	setOperationAction(ISD::SRL, VT, Custom);
	}

	// Promote all bit-wise operations.
	if (VT.isInteger() && VT != PromotedBitwiseVT) {
	setOperationAction(ISD::AND, VT, Promote);
	AddPromotedToType (ISD::AND, VT, PromotedBitwiseVT);
	setOperationAction(ISD::OR, VT, Promote);
	AddPromotedToType (ISD::OR, VT, PromotedBitwiseVT);
	setOperationAction(ISD::XOR, VT, Promote);
	AddPromotedToType (ISD::XOR, VT, PromotedBitwiseVT);
	}

	// Neon does not support vector divide/remainder operations.
	setOperationAction(ISD::SDIV, VT, Expand);
	setOperationAction(ISD::UDIV, VT, Expand);
	setOperationAction(ISD::FDIV, VT, Expand);
	setOperationAction(ISD::SREM, VT, Expand);
	setOperationAction(ISD::UREM, VT, Expand);
	setOperationAction(ISD::FREM, VT, Expand);

	if (!VT.isFloatingPoint() &&
	VT != MVT::v2i64 && VT != MVT::v1i64)
	for (auto Opcode : {ISD::ABS, ISD::SMIN, ISD::SMAX, ISD::UMIN, ISD::UMAX})
	setOperationAction(Opcode, VT, Legal);
	}

	void ARMTargetLowering::addDRTypeForNEON(MVT VT) {
	addRegisterClass(VT, &ARM::DPRRegClass);
	addTypeForNEON(VT, MVT::f64, MVT::v2i32);
	}

	void ARMTargetLowering::addQRTypeForNEON(MVT VT) {
	addRegisterClass(VT, &ARM::DPairRegClass);
	addTypeForNEON(VT, MVT::v2f64, MVT::v4i32);
	}

	ARMTargetLowering::ARMTargetLowering(const TargetMachine &TM,
	const ARMSubtarget &STI)
	: TargetLowering(TM), Subtarget(&STI) {
	RegInfo = Subtarget->getRegisterInfo();
	Itins = Subtarget->getInstrItineraryData();

	setBooleanVectorContents(ZeroOrNegativeOneBooleanContent);

	if (!Subtarget->isTargetDarwin() && !Subtarget->isTargetIOS() &&
	!Subtarget->isTargetWatchOS()) {
	const auto &E = Subtarget->getTargetTriple().getEnvironment();

	bool IsHFTarget = E == Triple::EABIHF \|\| E == Triple::GNUEABIHF \|\|
	E == Triple::MuslEABIHF;
	// Windows is a special case. Technically, we will replace all of the "GNU"
	// calls with calls to MSVCRT if appropriate and adjust the calling
	// convention then.
	IsHFTarget = IsHFTarget \|\| Subtarget->isTargetWindows();

	for (int LCID = 0; LCID < RTLIB::UNKNOWN_LIBCALL; ++LCID)
	setLibcallCallingConv(static_cast<RTLIB::Libcall>(LCID),
	IsHFTarget ? CallingConv::ARM_AAPCS_VFP
	: CallingConv::ARM_AAPCS);
	}

	if (Subtarget->isTargetMachO()) {
	// Uses VFP for Thumb libfuncs if available.
	if (Subtarget->isThumb() && Subtarget->hasVFP2() &&
	Subtarget->hasARMOps() && !Subtarget->useSoftFloat()) {
	static const struct {
	const RTLIB::Libcall Op;
	const char * const Name;
	const ISD::CondCode Cond;
	} LibraryCalls[] = {
	// Single-precision floating-point arithmetic.
	{ RTLIB::ADD_F32, "__addsf3vfp", ISD::SETCC_INVALID },
	{ RTLIB::SUB_F32, "__subsf3vfp", ISD::SETCC_INVALID },
	{ RTLIB::MUL_F32, "__mulsf3vfp", ISD::SETCC_INVALID },
	{ RTLIB::DIV_F32, "__divsf3vfp", ISD::SETCC_INVALID },

	// Double-precision floating-point arithmetic.
	{ RTLIB::ADD_F64, "__adddf3vfp", ISD::SETCC_INVALID },
	{ RTLIB::SUB_F64, "__subdf3vfp", ISD::SETCC_INVALID },
	{ RTLIB::MUL_F64, "__muldf3vfp", ISD::SETCC_INVALID },
	{ RTLIB::DIV_F64, "__divdf3vfp", ISD::SETCC_INVALID },

	// Single-precision comparisons.
	{ RTLIB::OEQ_F32, "__eqsf2vfp", ISD::SETNE },
	{ RTLIB::UNE_F32, "__nesf2vfp", ISD::SETNE },
	{ RTLIB::OLT_F32, "__ltsf2vfp", ISD::SETNE },
	{ RTLIB::OLE_F32, "__lesf2vfp", ISD::SETNE },
	{ RTLIB::OGE_F32, "__gesf2vfp", ISD::SETNE },
	{ RTLIB::OGT_F32, "__gtsf2vfp", ISD::SETNE },
	{ RTLIB::UO_F32, "__unordsf2vfp", ISD::SETNE },
	{ RTLIB::O_F32, "__unordsf2vfp", ISD::SETEQ },

	// Double-precision comparisons.
	{ RTLIB::OEQ_F64, "__eqdf2vfp", ISD::SETNE },
	{ RTLIB::UNE_F64, "__nedf2vfp", ISD::SETNE },
	{ RTLIB::OLT_F64, "__ltdf2vfp", ISD::SETNE },
	{ RTLIB::OLE_F64, "__ledf2vfp", ISD::SETNE },
	{ RTLIB::OGE_F64, "__gedf2vfp", ISD::SETNE },
	{ RTLIB::OGT_F64, "__gtdf2vfp", ISD::SETNE },
	{ RTLIB::UO_F64, "__unorddf2vfp", ISD::SETNE },
	{ RTLIB::O_F64, "__unorddf2vfp", ISD::SETEQ },

	// Floating-point to integer conversions.
	// i64 conversions are done via library routines even when generating VFP
	// instructions, so use the same ones.
	{ RTLIB::FPTOSINT_F64_I32, "__fixdfsivfp", ISD::SETCC_INVALID },
	{ RTLIB::FPTOUINT_F64_I32, "__fixunsdfsivfp", ISD::SETCC_INVALID },
	{ RTLIB::FPTOSINT_F32_I32, "__fixsfsivfp", ISD::SETCC_INVALID },
	{ RTLIB::FPTOUINT_F32_I32, "__fixunssfsivfp", ISD::SETCC_INVALID },

	// Conversions between floating types.
	{ RTLIB::FPROUND_F64_F32, "__truncdfsf2vfp", ISD::SETCC_INVALID },
	{ RTLIB::FPEXT_F32_F64, "__extendsfdf2vfp", ISD::SETCC_INVALID },

	// Integer to floating-point conversions.
	// i64 conversions are done via library routines even when generating VFP
	// instructions, so use the same ones.
	// FIXME: There appears to be some naming inconsistency in ARM libgcc:
	// e.g., __floatunsidf vs. __floatunssidfvfp.
	{ RTLIB::SINTTOFP_I32_F64, "__floatsidfvfp", ISD::SETCC_INVALID },
	{ RTLIB::UINTTOFP_I32_F64, "__floatunssidfvfp", ISD::SETCC_INVALID },
	{ RTLIB::SINTTOFP_I32_F32, "__floatsisfvfp", ISD::SETCC_INVALID },
	{ RTLIB::UINTTOFP_I32_F32, "__floatunssisfvfp", ISD::SETCC_INVALID },
	};

	for (const auto &LC : LibraryCalls) {
	setLibcallName(LC.Op, LC.Name);
	if (LC.Cond != ISD::SETCC_INVALID)
	setCmpLibcallCC(LC.Op, LC.Cond);
	}
	}

	// Set the correct calling convention for ARMv7k WatchOS. It's just
	// AAPCS_VFP for functions as simple as libcalls.
	if (Subtarget->isTargetWatchABI()) {
	for (int i = 0; i < RTLIB::UNKNOWN_LIBCALL; ++i)
	setLibcallCallingConv((RTLIB::Libcall)i, CallingConv::ARM_AAPCS_VFP);
	}
	}

	// These libcalls are not available in 32-bit.
	setLibcallName(RTLIB::SHL_I128, nullptr);
	setLibcallName(RTLIB::SRL_I128, nullptr);
	setLibcallName(RTLIB::SRA_I128, nullptr);

	// RTLIB
	if (Subtarget->isAAPCS_ABI() &&
	(Subtarget->isTargetAEABI() \|\| Subtarget->isTargetGNUAEABI() \|\|
	Subtarget->isTargetMuslAEABI() \|\| Subtarget->isTargetAndroid())) {
	static const struct {
	const RTLIB::Libcall Op;
	const char * const Name;
	const CallingConv::ID CC;
	const ISD::CondCode Cond;
	} LibraryCalls[] = {
	// Double-precision floating-point arithmetic helper functions
	// RTABI chapter 4.1.2, Table 2
	{ RTLIB::ADD_F64, "__aeabi_dadd", CallingConv::ARM_AAPCS, ISD::SETCC_INVALID },
	{ RTLIB::DIV_F64, "__aeabi_ddiv", CallingConv::ARM_AAPCS, ISD::SETCC_INVALID },
	{ RTLIB::MUL_F64, "__aeabi_dmul", CallingConv::ARM_AAPCS, ISD::SETCC_INVALID },
	{ RTLIB::SUB_F64, "__aeabi_dsub", CallingConv::ARM_AAPCS, ISD::SETCC_INVALID },

	// Double-precision floating-point comparison helper functions
	// RTABI chapter 4.1.2, Table 3
	{ RTLIB::OEQ_F64, "__aeabi_dcmpeq", CallingConv::ARM_AAPCS, ISD::SETNE },
	{ RTLIB::UNE_F64, "__aeabi_dcmpeq", CallingConv::ARM_AAPCS, ISD::SETEQ },
	{ RTLIB::OLT_F64, "__aeabi_dcmplt", CallingConv::ARM_AAPCS, ISD::SETNE },
	{ RTLIB::OLE_F64, "__aeabi_dcmple", CallingConv::ARM_AAPCS, ISD::SETNE },
	{ RTLIB::OGE_F64, "__aeabi_dcmpge", CallingConv::ARM_AAPCS, ISD::SETNE },
	{ RTLIB::OGT_F64, "__aeabi_dcmpgt", CallingConv::ARM_AAPCS, ISD::SETNE },
	{ RTLIB::UO_F64, "__aeabi_dcmpun", CallingConv::ARM_AAPCS, ISD::SETNE },
	{ RTLIB::O_F64, "__aeabi_dcmpun", CallingConv::ARM_AAPCS, ISD::SETEQ },

	// Single-precision floating-point arithmetic helper functions
	// RTABI chapter 4.1.2, Table 4
	{ RTLIB::ADD_F32, "__aeabi_fadd", CallingConv::ARM_AAPCS, ISD::SETCC_INVALID },
	{ RTLIB::DIV_F32, "__aeabi_fdiv", CallingConv::ARM_AAPCS, ISD::SETCC_INVALID },
	{ RTLIB::MUL_F32, "__aeabi_fmul", CallingConv::ARM_AAPCS, ISD::SETCC_INVALID },
	{ RTLIB::SUB_F32, "__aeabi_fsub", CallingConv::ARM_AAPCS, ISD::SETCC_INVALID },

	// Single-precision floating-point comparison helper functions
	// RTABI chapter 4.1.2, Table 5
	{ RTLIB::OEQ_F32, "__aeabi_fcmpeq", CallingConv::ARM_AAPCS, ISD::SETNE },
	{ RTLIB::UNE_F32, "__aeabi_fcmpeq", CallingConv::ARM_AAPCS, ISD::SETEQ },
	{ RTLIB::OLT_F32, "__aeabi_fcmplt", CallingConv::ARM_AAPCS, ISD::SETNE },
	{ RTLIB::OLE_F32, "__aeabi_fcmple", CallingConv::ARM_AAPCS, ISD::SETNE },
	{ RTLIB::OGE_F32, "__aeabi_fcmpge", CallingConv::ARM_AAPCS, ISD::SETNE },
	{ RTLIB::OGT_F32, "__aeabi_fcmpgt", CallingConv::ARM_AAPCS, ISD::SETNE },
	{ RTLIB::UO_F32, "__aeabi_fcmpun", CallingConv::ARM_AAPCS, ISD::SETNE },
	{ RTLIB::O_F32, "__aeabi_fcmpun", CallingConv::ARM_AAPCS, ISD::SETEQ },

	// Floating-point to integer conversions.
	// RTABI chapter 4.1.2, Table 6
	{ RTLIB::FPTOSINT_F64_I32, "__aeabi_d2iz", CallingConv::ARM_AAPCS, ISD::SETCC_INVALID },
	{ RTLIB::FPTOUINT_F64_I32, "__aeabi_d2uiz", CallingConv::ARM_AAPCS, ISD::SETCC_INVALID },
	{ RTLIB::FPTOSINT_F64_I64, "__aeabi_d2lz", CallingConv::ARM_AAPCS, ISD::SETCC_INVALID },
	{ RTLIB::FPTOUINT_F64_I64, "__aeabi_d2ulz", CallingConv::ARM_AAPCS, ISD::SETCC_INVALID },
	{ RTLIB::FPTOSINT_F32_I32, "__aeabi_f2iz", CallingConv::ARM_AAPCS, ISD::SETCC_INVALID },
	{ RTLIB::FPTOUINT_F32_I32, "__aeabi_f2uiz", CallingConv::ARM_AAPCS, ISD::SETCC_INVALID },
	{ RTLIB::FPTOSINT_F32_I64, "__aeabi_f2lz", CallingConv::ARM_AAPCS, ISD::SETCC_INVALID },
	{ RTLIB::FPTOUINT_F32_I64, "__aeabi_f2ulz", CallingConv::ARM_AAPCS, ISD::SETCC_INVALID },

	// Conversions between floating types.
	// RTABI chapter 4.1.2, Table 7
	{ RTLIB::FPROUND_F64_F32, "__aeabi_d2f", CallingConv::ARM_AAPCS, ISD::SETCC_INVALID },
	{ RTLIB::FPROUND_F64_F16, "__aeabi_d2h", CallingConv::ARM_AAPCS, ISD::SETCC_INVALID },
	{ RTLIB::FPEXT_F32_F64, "__aeabi_f2d", CallingConv::ARM_AAPCS, ISD::SETCC_INVALID },

	// Integer to floating-point conversions.
	// RTABI chapter 4.1.2, Table 8
	{ RTLIB::SINTTOFP_I32_F64, "__aeabi_i2d", CallingConv::ARM_AAPCS, ISD::SETCC_INVALID },
	{ RTLIB::UINTTOFP_I32_F64, "__aeabi_ui2d", CallingConv::ARM_AAPCS, ISD::SETCC_INVALID },
	{ RTLIB::SINTTOFP_I64_F64, "__aeabi_l2d", CallingConv::ARM_AAPCS, ISD::SETCC_INVALID },
	{ RTLIB::UINTTOFP_I64_F64, "__aeabi_ul2d", CallingConv::ARM_AAPCS, ISD::SETCC_INVALID },
	{ RTLIB::SINTTOFP_I32_F32, "__aeabi_i2f", CallingConv::ARM_AAPCS, ISD::SETCC_INVALID },
	{ RTLIB::UINTTOFP_I32_F32, "__aeabi_ui2f", CallingConv::ARM_AAPCS, ISD::SETCC_INVALID },
	{ RTLIB::SINTTOFP_I64_F32, "__aeabi_l2f", CallingConv::ARM_AAPCS, ISD::SETCC_INVALID },
	{ RTLIB::UINTTOFP_I64_F32, "__aeabi_ul2f", CallingConv::ARM_AAPCS, ISD::SETCC_INVALID },

	// Long long helper functions
	// RTABI chapter 4.2, Table 9
	{ RTLIB::MUL_I64, "__aeabi_lmul", CallingConv::ARM_AAPCS, ISD::SETCC_INVALID },
	{ RTLIB::SHL_I64, "__aeabi_llsl", CallingConv::ARM_AAPCS, ISD::SETCC_INVALID },
	{ RTLIB::SRL_I64, "__aeabi_llsr", CallingConv::ARM_AAPCS, ISD::SETCC_INVALID },
	{ RTLIB::SRA_I64, "__aeabi_lasr", CallingConv::ARM_AAPCS, ISD::SETCC_INVALID },

	// Integer division functions
	// RTABI chapter 4.3.1
	{ RTLIB::SDIV_I8, "__aeabi_idiv", CallingConv::ARM_AAPCS, ISD::SETCC_INVALID },
	{ RTLIB::SDIV_I16, "__aeabi_idiv", CallingConv::ARM_AAPCS, ISD::SETCC_INVALID },
	{ RTLIB::SDIV_I32, "__aeabi_idiv", CallingConv::ARM_AAPCS, ISD::SETCC_INVALID },
	{ RTLIB::SDIV_I64, "__aeabi_ldivmod", CallingConv::ARM_AAPCS, ISD::SETCC_INVALID },
	{ RTLIB::UDIV_I8, "__aeabi_uidiv", CallingConv::ARM_AAPCS, ISD::SETCC_INVALID },
	{ RTLIB::UDIV_I16, "__aeabi_uidiv", CallingConv::ARM_AAPCS, ISD::SETCC_INVALID },
	{ RTLIB::UDIV_I32, "__aeabi_uidiv", CallingConv::ARM_AAPCS, ISD::SETCC_INVALID },
	{ RTLIB::UDIV_I64, "__aeabi_uldivmod", CallingConv::ARM_AAPCS, ISD::SETCC_INVALID },
	};

	for (const auto &LC : LibraryCalls) {
	setLibcallName(LC.Op, LC.Name);
	setLibcallCallingConv(LC.Op, LC.CC);
	if (LC.Cond != ISD::SETCC_INVALID)
	setCmpLibcallCC(LC.Op, LC.Cond);
	}

	// EABI dependent RTLIB
	if (TM.Options.EABIVersion == EABI::EABI4 \|\|
	TM.Options.EABIVersion == EABI::EABI5) {
	static const struct {
	const RTLIB::Libcall Op;
	const char *const Name;
	const CallingConv::ID CC;
	const ISD::CondCode Cond;
	} MemOpsLibraryCalls[] = {
	// Memory operations
	// RTABI chapter 4.3.4
	{ RTLIB::MEMCPY, "__aeabi_memcpy", CallingConv::ARM_AAPCS, ISD::SETCC_INVALID },
	{ RTLIB::MEMMOVE, "__aeabi_memmove", CallingConv::ARM_AAPCS, ISD::SETCC_INVALID },
	{ RTLIB::MEMSET, "__aeabi_memset", CallingConv::ARM_AAPCS, ISD::SETCC_INVALID },
	};

	for (const auto &LC : MemOpsLibraryCalls) {
	setLibcallName(LC.Op, LC.Name);
	setLibcallCallingConv(LC.Op, LC.CC);
	if (LC.Cond != ISD::SETCC_INVALID)
	setCmpLibcallCC(LC.Op, LC.Cond);
	}
	}
	}

	if (Subtarget->isTargetWindows()) {
	static const struct {
	const RTLIB::Libcall Op;
	const char * const Name;
	const CallingConv::ID CC;
	} LibraryCalls[] = {
	{ RTLIB::FPTOSINT_F32_I64, "__stoi64", CallingConv::ARM_AAPCS_VFP },
	{ RTLIB::FPTOSINT_F64_I64, "__dtoi64", CallingConv::ARM_AAPCS_VFP },
	{ RTLIB::FPTOUINT_F32_I64, "__stou64", CallingConv::ARM_AAPCS_VFP },
	{ RTLIB::FPTOUINT_F64_I64, "__dtou64", CallingConv::ARM_AAPCS_VFP },
	{ RTLIB::SINTTOFP_I64_F32, "__i64tos", CallingConv::ARM_AAPCS_VFP },
	{ RTLIB::SINTTOFP_I64_F64, "__i64tod", CallingConv::ARM_AAPCS_VFP },
	{ RTLIB::UINTTOFP_I64_F32, "__u64tos", CallingConv::ARM_AAPCS_VFP },
	{ RTLIB::UINTTOFP_I64_F64, "__u64tod", CallingConv::ARM_AAPCS_VFP },
	};

	for (const auto &LC : LibraryCalls) {
	setLibcallName(LC.Op, LC.Name);
	setLibcallCallingConv(LC.Op, LC.CC);
	}
	}

	// Use divmod compiler-rt calls for iOS 5.0 and later.
	if (Subtarget->isTargetMachO() &&
	!(Subtarget->isTargetIOS() &&
	Subtarget->getTargetTriple().isOSVersionLT(5, 0))) {
	setLibcallName(RTLIB::SDIVREM_I32, "__divmodsi4");
	setLibcallName(RTLIB::UDIVREM_I32, "__udivmodsi4");
	}

	// The half <-> float conversion functions are always soft-float on
	// non-watchos platforms, but are needed for some targets which use a
	// hard-float calling convention by default.
	if (!Subtarget->isTargetWatchABI()) {
	if (Subtarget->isAAPCS_ABI()) {
	setLibcallCallingConv(RTLIB::FPROUND_F32_F16, CallingConv::ARM_AAPCS);
	setLibcallCallingConv(RTLIB::FPROUND_F64_F16, CallingConv::ARM_AAPCS);
	setLibcallCallingConv(RTLIB::FPEXT_F16_F32, CallingConv::ARM_AAPCS);
	} else {
	setLibcallCallingConv(RTLIB::FPROUND_F32_F16, CallingConv::ARM_APCS);
	setLibcallCallingConv(RTLIB::FPROUND_F64_F16, CallingConv::ARM_APCS);
	setLibcallCallingConv(RTLIB::FPEXT_F16_F32, CallingConv::ARM_APCS);
	}
	}

	// In EABI, these functions have an __aeabi_ prefix, but in GNUEABI they have
	// a __gnu_ prefix (which is the default).
	if (Subtarget->isTargetAEABI()) {
	static const struct {
	const RTLIB::Libcall Op;
	const char * const Name;
	const CallingConv::ID CC;
	} LibraryCalls[] = {
	{ RTLIB::FPROUND_F32_F16, "__aeabi_f2h", CallingConv::ARM_AAPCS },
	{ RTLIB::FPROUND_F64_F16, "__aeabi_d2h", CallingConv::ARM_AAPCS },
	{ RTLIB::FPEXT_F16_F32, "__aeabi_h2f", CallingConv::ARM_AAPCS },
	};

	for (const auto &LC : LibraryCalls) {
	setLibcallName(LC.Op, LC.Name);
	setLibcallCallingConv(LC.Op, LC.CC);
	}
	}

	if (Subtarget->isThumb1Only())
	addRegisterClass(MVT::i32, &ARM::tGPRRegClass);
	else
	addRegisterClass(MVT::i32, &ARM::GPRRegClass);

	if (!Subtarget->useSoftFloat() && Subtarget->hasVFP2() &&
	!Subtarget->isThumb1Only()) {
	addRegisterClass(MVT::f32, &ARM::SPRRegClass);
	addRegisterClass(MVT::f64, &ARM::DPRRegClass);
	}

	for (MVT VT : MVT::vector_valuetypes()) {
	for (MVT InnerVT : MVT::vector_valuetypes()) {
	setTruncStoreAction(VT, InnerVT, Expand);
	setLoadExtAction(ISD::SEXTLOAD, VT, InnerVT, Expand);
	setLoadExtAction(ISD::ZEXTLOAD, VT, InnerVT, Expand);
	setLoadExtAction(ISD::EXTLOAD, VT, InnerVT, Expand);
	}

	setOperationAction(ISD::MULHS, VT, Expand);
	setOperationAction(ISD::SMUL_LOHI, VT, Expand);
	setOperationAction(ISD::MULHU, VT, Expand);
	setOperationAction(ISD::UMUL_LOHI, VT, Expand);

	setOperationAction(ISD::BSWAP, VT, Expand);
	}

	setOperationAction(ISD::ConstantFP, MVT::f32, Custom);
	setOperationAction(ISD::ConstantFP, MVT::f64, Custom);

	setOperationAction(ISD::READ_REGISTER, MVT::i64, Custom);
	setOperationAction(ISD::WRITE_REGISTER, MVT::i64, Custom);

	if (Subtarget->hasNEON()) {
	addDRTypeForNEON(MVT::v2f32);
	addDRTypeForNEON(MVT::v8i8);
	addDRTypeForNEON(MVT::v4i16);
	addDRTypeForNEON(MVT::v2i32);
	addDRTypeForNEON(MVT::v1i64);

	addQRTypeForNEON(MVT::v4f32);
	addQRTypeForNEON(MVT::v2f64);
	addQRTypeForNEON(MVT::v16i8);
	addQRTypeForNEON(MVT::v8i16);
	addQRTypeForNEON(MVT::v4i32);
	addQRTypeForNEON(MVT::v2i64);

	// v2f64 is legal so that QR subregs can be extracted as f64 elements, but
	// neither Neon nor VFP support any arithmetic operations on it.
	// The same with v4f32. But keep in mind that vadd, vsub, vmul are natively
	// supported for v4f32.
	setOperationAction(ISD::FADD, MVT::v2f64, Expand);
	setOperationAction(ISD::FSUB, MVT::v2f64, Expand);
	setOperationAction(ISD::FMUL, MVT::v2f64, Expand);
	// FIXME: Code duplication: FDIV and FREM are expanded always, see
	// ARMTargetLowering::addTypeForNEON method for details.
	setOperationAction(ISD::FDIV, MVT::v2f64, Expand);
	setOperationAction(ISD::FREM, MVT::v2f64, Expand);
	// FIXME: Create unittest.
	// In another words, find a way when "copysign" appears in DAG with vector
	// operands.
	setOperationAction(ISD::FCOPYSIGN, MVT::v2f64, Expand);
	// FIXME: Code duplication: SETCC has custom operation action, see
	// ARMTargetLowering::addTypeForNEON method for details.
	setOperationAction(ISD::SETCC, MVT::v2f64, Expand);
	// FIXME: Create unittest for FNEG and for FABS.
	setOperationAction(ISD::FNEG, MVT::v2f64, Expand);
	setOperationAction(ISD::FABS, MVT::v2f64, Expand);
	setOperationAction(ISD::FSQRT, MVT::v2f64, Expand);
	setOperationAction(ISD::FSIN, MVT::v2f64, Expand);
	setOperationAction(ISD::FCOS, MVT::v2f64, Expand);
	setOperationAction(ISD::FPOW, MVT::v2f64, Expand);
	setOperationAction(ISD::FLOG, MVT::v2f64, Expand);
	setOperationAction(ISD::FLOG2, MVT::v2f64, Expand);
	setOperationAction(ISD::FLOG10, MVT::v2f64, Expand);
	setOperationAction(ISD::FEXP, MVT::v2f64, Expand);
	setOperationAction(ISD::FEXP2, MVT::v2f64, Expand);
	// FIXME: Create unittest for FCEIL, FTRUNC, FRINT, FNEARBYINT, FFLOOR.
	setOperationAction(ISD::FCEIL, MVT::v2f64, Expand);
	setOperationAction(ISD::FTRUNC, MVT::v2f64, Expand);
	setOperationAction(ISD::FRINT, MVT::v2f64, Expand);
	setOperationAction(ISD::FNEARBYINT, MVT::v2f64, Expand);
	setOperationAction(ISD::FFLOOR, MVT::v2f64, Expand);
	setOperationAction(ISD::FMA, MVT::v2f64, Expand);

	setOperationAction(ISD::FSQRT, MVT::v4f32, Expand);
	setOperationAction(ISD::FSIN, MVT::v4f32, Expand);
	setOperationAction(ISD::FCOS, MVT::v4f32, Expand);
	setOperationAction(ISD::FPOW, MVT::v4f32, Expand);
	setOperationAction(ISD::FLOG, MVT::v4f32, Expand);
	setOperationAction(ISD::FLOG2, MVT::v4f32, Expand);
	setOperationAction(ISD::FLOG10, MVT::v4f32, Expand);
	setOperationAction(ISD::FEXP, MVT::v4f32, Expand);
	setOperationAction(ISD::FEXP2, MVT::v4f32, Expand);
	setOperationAction(ISD::FCEIL, MVT::v4f32, Expand);
	setOperationAction(ISD::FTRUNC, MVT::v4f32, Expand);
	setOperationAction(ISD::FRINT, MVT::v4f32, Expand);
	setOperationAction(ISD::FNEARBYINT, MVT::v4f32, Expand);
	setOperationAction(ISD::FFLOOR, MVT::v4f32, Expand);

	// Mark v2f32 intrinsics.
	setOperationAction(ISD::FSQRT, MVT::v2f32, Expand);
	setOperationAction(ISD::FSIN, MVT::v2f32, Expand);
	setOperationAction(ISD::FCOS, MVT::v2f32, Expand);
	setOperationAction(ISD::FPOW, MVT::v2f32, Expand);
	setOperationAction(ISD::FLOG, MVT::v2f32, Expand);
	setOperationAction(ISD::FLOG2, MVT::v2f32, Expand);
	setOperationAction(ISD::FLOG10, MVT::v2f32, Expand);
	setOperationAction(ISD::FEXP, MVT::v2f32, Expand);
	setOperationAction(ISD::FEXP2, MVT::v2f32, Expand);
	setOperationAction(ISD::FCEIL, MVT::v2f32, Expand);
	setOperationAction(ISD::FTRUNC, MVT::v2f32, Expand);
	setOperationAction(ISD::FRINT, MVT::v2f32, Expand);
	setOperationAction(ISD::FNEARBYINT, MVT::v2f32, Expand);
	setOperationAction(ISD::FFLOOR, MVT::v2f32, Expand);

	// Neon does not support some operations on v1i64 and v2i64 types.
	setOperationAction(ISD::MUL, MVT::v1i64, Expand);
	// Custom handling for some quad-vector types to detect VMULL.
	setOperationAction(ISD::MUL, MVT::v8i16, Custom);
	setOperationAction(ISD::MUL, MVT::v4i32, Custom);
	setOperationAction(ISD::MUL, MVT::v2i64, Custom);
	// Custom handling for some vector types to avoid expensive expansions
	setOperationAction(ISD::SDIV, MVT::v4i16, Custom);
	setOperationAction(ISD::SDIV, MVT::v8i8, Custom);
	setOperationAction(ISD::UDIV, MVT::v4i16, Custom);
	setOperationAction(ISD::UDIV, MVT::v8i8, Custom);
	// Neon does not have single instruction SINT_TO_FP and UINT_TO_FP with
	// a destination type that is wider than the source, and nor does
	// it have a FP_TO_[SU]INT instruction with a narrower destination than
	// source.
	setOperationAction(ISD::SINT_TO_FP, MVT::v4i16, Custom);
	setOperationAction(ISD::UINT_TO_FP, MVT::v4i16, Custom);
	setOperationAction(ISD::FP_TO_UINT, MVT::v4i16, Custom);
	setOperationAction(ISD::FP_TO_SINT, MVT::v4i16, Custom);

	setOperationAction(ISD::FP_ROUND, MVT::v2f32, Expand);
	setOperationAction(ISD::FP_EXTEND, MVT::v2f64, Expand);

	// NEON does not have single instruction CTPOP for vectors with element
	// types wider than 8-bits. However, custom lowering can leverage the
	// v8i8/v16i8 vcnt instruction.
	setOperationAction(ISD::CTPOP, MVT::v2i32, Custom);
	setOperationAction(ISD::CTPOP, MVT::v4i32, Custom);
	setOperationAction(ISD::CTPOP, MVT::v4i16, Custom);
	setOperationAction(ISD::CTPOP, MVT::v8i16, Custom);
	setOperationAction(ISD::CTPOP, MVT::v1i64, Expand);
	setOperationAction(ISD::CTPOP, MVT::v2i64, Expand);

	setOperationAction(ISD::CTLZ, MVT::v1i64, Expand);
	setOperationAction(ISD::CTLZ, MVT::v2i64, Expand);

	// NEON does not have single instruction CTTZ for vectors.
	setOperationAction(ISD::CTTZ, MVT::v8i8, Custom);
	setOperationAction(ISD::CTTZ, MVT::v4i16, Custom);
	setOperationAction(ISD::CTTZ, MVT::v2i32, Custom);
	setOperationAction(ISD::CTTZ, MVT::v1i64, Custom);

	setOperationAction(ISD::CTTZ, MVT::v16i8, Custom);
	setOperationAction(ISD::CTTZ, MVT::v8i16, Custom);
	setOperationAction(ISD::CTTZ, MVT::v4i32, Custom);
	setOperationAction(ISD::CTTZ, MVT::v2i64, Custom);

	setOperationAction(ISD::CTTZ_ZERO_UNDEF, MVT::v8i8, Custom);
	setOperationAction(ISD::CTTZ_ZERO_UNDEF, MVT::v4i16, Custom);
	setOperationAction(ISD::CTTZ_ZERO_UNDEF, MVT::v2i32, Custom);
	setOperationAction(ISD::CTTZ_ZERO_UNDEF, MVT::v1i64, Custom);

	setOperationAction(ISD::CTTZ_ZERO_UNDEF, MVT::v16i8, Custom);
	setOperationAction(ISD::CTTZ_ZERO_UNDEF, MVT::v8i16, Custom);
	setOperationAction(ISD::CTTZ_ZERO_UNDEF, MVT::v4i32, Custom);
	setOperationAction(ISD::CTTZ_ZERO_UNDEF, MVT::v2i64, Custom);

	// NEON only has FMA instructions as of VFP4.
	if (!Subtarget->hasVFP4()) {
	setOperationAction(ISD::FMA, MVT::v2f32, Expand);
	setOperationAction(ISD::FMA, MVT::v4f32, Expand);
	}

	setTargetDAGCombine(ISD::INTRINSIC_VOID);
	setTargetDAGCombine(ISD::INTRINSIC_W_CHAIN);
	setTargetDAGCombine(ISD::INTRINSIC_WO_CHAIN);
	setTargetDAGCombine(ISD::SHL);
	setTargetDAGCombine(ISD::SRL);
	setTargetDAGCombine(ISD::SRA);
	setTargetDAGCombine(ISD::SIGN_EXTEND);
	setTargetDAGCombine(ISD::ZERO_EXTEND);
	setTargetDAGCombine(ISD::ANY_EXTEND);
	setTargetDAGCombine(ISD::BUILD_VECTOR);
	setTargetDAGCombine(ISD::VECTOR_SHUFFLE);
	setTargetDAGCombine(ISD::INSERT_VECTOR_ELT);
	setTargetDAGCombine(ISD::STORE);
	setTargetDAGCombine(ISD::FP_TO_SINT);
	setTargetDAGCombine(ISD::FP_TO_UINT);
	setTargetDAGCombine(ISD::FDIV);
	setTargetDAGCombine(ISD::LOAD);

	// It is legal to extload from v4i8 to v4i16 or v4i32.
	for (MVT Ty : {MVT::v8i8, MVT::v4i8, MVT::v2i8, MVT::v4i16, MVT::v2i16,
	MVT::v2i32}) {
	for (MVT VT : MVT::integer_vector_valuetypes()) {
	setLoadExtAction(ISD::EXTLOAD, VT, Ty, Legal);
	setLoadExtAction(ISD::ZEXTLOAD, VT, Ty, Legal);
	setLoadExtAction(ISD::SEXTLOAD, VT, Ty, Legal);
	}
	}
	}

	if (Subtarget->isFPOnlySP()) {
	// When targeting a floating-point unit with only single-precision
	// operations, f64 is legal for the few double-precision instructions which
	// are present However, no double-precision operations other than moves,
	// loads and stores are provided by the hardware.
	setOperationAction(ISD::FADD, MVT::f64, Expand);
	setOperationAction(ISD::FSUB, MVT::f64, Expand);
	setOperationAction(ISD::FMUL, MVT::f64, Expand);
	setOperationAction(ISD::FMA, MVT::f64, Expand);
	setOperationAction(ISD::FDIV, MVT::f64, Expand);
	setOperationAction(ISD::FREM, MVT::f64, Expand);
	setOperationAction(ISD::FCOPYSIGN, MVT::f64, Expand);
	setOperationAction(ISD::FGETSIGN, MVT::f64, Expand);
	setOperationAction(ISD::FNEG, MVT::f64, Expand);
	setOperationAction(ISD::FABS, MVT::f64, Expand);
	setOperationAction(ISD::FSQRT, MVT::f64, Expand);
	setOperationAction(ISD::FSIN, MVT::f64, Expand);
	setOperationAction(ISD::FCOS, MVT::f64, Expand);
	setOperationAction(ISD::FPOW, MVT::f64, Expand);
	setOperationAction(ISD::FLOG, MVT::f64, Expand);
	setOperationAction(ISD::FLOG2, MVT::f64, Expand);
	setOperationAction(ISD::FLOG10, MVT::f64, Expand);
	setOperationAction(ISD::FEXP, MVT::f64, Expand);
	setOperationAction(ISD::FEXP2, MVT::f64, Expand);
	setOperationAction(ISD::FCEIL, MVT::f64, Expand);
	setOperationAction(ISD::FTRUNC, MVT::f64, Expand);
	setOperationAction(ISD::FRINT, MVT::f64, Expand);
	setOperationAction(ISD::FNEARBYINT, MVT::f64, Expand);
	setOperationAction(ISD::FFLOOR, MVT::f64, Expand);
	setOperationAction(ISD::SINT_TO_FP, MVT::i32, Custom);
	setOperationAction(ISD::UINT_TO_FP, MVT::i32, Custom);
	setOperationAction(ISD::FP_TO_SINT, MVT::i32, Custom);
	setOperationAction(ISD::FP_TO_UINT, MVT::i32, Custom);
	setOperationAction(ISD::FP_TO_SINT, MVT::f64, Custom);
	setOperationAction(ISD::FP_TO_UINT, MVT::f64, Custom);
	setOperationAction(ISD::FP_ROUND, MVT::f32, Custom);
	setOperationAction(ISD::FP_EXTEND, MVT::f64, Custom);
	}

	computeRegisterProperties(Subtarget->getRegisterInfo());

	// ARM does not have floating-point extending loads.
	for (MVT VT : MVT::fp_valuetypes()) {
	setLoadExtAction(ISD::EXTLOAD, VT, MVT::f32, Expand);
	setLoadExtAction(ISD::EXTLOAD, VT, MVT::f16, Expand);
	}

	// ... or truncating stores
	setTruncStoreAction(MVT::f64, MVT::f32, Expand);
	setTruncStoreAction(MVT::f32, MVT::f16, Expand);
	setTruncStoreAction(MVT::f64, MVT::f16, Expand);

	// ARM does not have i1 sign extending load.
	for (MVT VT : MVT::integer_valuetypes())
	setLoadExtAction(ISD::SEXTLOAD, VT, MVT::i1, Promote);

	// ARM supports all 4 flavors of integer indexed load / store.
	if (!Subtarget->isThumb1Only()) {
	for (unsigned im = (unsigned)ISD::PRE_INC;
	im != (unsigned)ISD::LAST_INDEXED_MODE; ++im) {
	setIndexedLoadAction(im, MVT::i1, Legal);
	setIndexedLoadAction(im, MVT::i8, Legal);
	setIndexedLoadAction(im, MVT::i16, Legal);
	setIndexedLoadAction(im, MVT::i32, Legal);
	setIndexedStoreAction(im, MVT::i1, Legal);
	setIndexedStoreAction(im, MVT::i8, Legal);
	setIndexedStoreAction(im, MVT::i16, Legal);
	setIndexedStoreAction(im, MVT::i32, Legal);
	}
	} else {
	// Thumb-1 has limited post-inc load/store support - LDM r0!, {r1}.
	setIndexedLoadAction(ISD::POST_INC, MVT::i32, Legal);
	setIndexedStoreAction(ISD::POST_INC, MVT::i32, Legal);
	}

	setOperationAction(ISD::SADDO, MVT::i32, Custom);
	setOperationAction(ISD::UADDO, MVT::i32, Custom);
	setOperationAction(ISD::SSUBO, MVT::i32, Custom);
	setOperationAction(ISD::USUBO, MVT::i32, Custom);

	// i64 operation support.
	setOperationAction(ISD::MUL, MVT::i64, Expand);
	setOperationAction(ISD::MULHU, MVT::i32, Expand);
	if (Subtarget->isThumb1Only()) {
	setOperationAction(ISD::UMUL_LOHI, MVT::i32, Expand);
	setOperationAction(ISD::SMUL_LOHI, MVT::i32, Expand);
	}
	if (Subtarget->isThumb1Only() \|\| !Subtarget->hasV6Ops()
	\|\| (Subtarget->isThumb2() && !Subtarget->hasDSP()))
	setOperationAction(ISD::MULHS, MVT::i32, Expand);

	setOperationAction(ISD::SHL_PARTS, MVT::i32, Custom);
	setOperationAction(ISD::SRA_PARTS, MVT::i32, Custom);
	setOperationAction(ISD::SRL_PARTS, MVT::i32, Custom);
	setOperationAction(ISD::SRL, MVT::i64, Custom);
	setOperationAction(ISD::SRA, MVT::i64, Custom);
	setOperationAction(ISD::INTRINSIC_WO_CHAIN, MVT::i64, Custom);

	setOperationAction(ISD::ADDC, MVT::i32, Custom);
	setOperationAction(ISD::ADDE, MVT::i32, Custom);
	setOperationAction(ISD::SUBC, MVT::i32, Custom);
	setOperationAction(ISD::SUBE, MVT::i32, Custom);

	if (!Subtarget->isThumb1Only() && Subtarget->hasV6T2Ops())
	setOperationAction(ISD::BITREVERSE, MVT::i32, Legal);

	// ARM does not have ROTL.
	setOperationAction(ISD::ROTL, MVT::i32, Expand);
	for (MVT VT : MVT::vector_valuetypes()) {
	setOperationAction(ISD::ROTL, VT, Expand);
	setOperationAction(ISD::ROTR, VT, Expand);
	}
	setOperationAction(ISD::CTTZ, MVT::i32, Custom);
	setOperationAction(ISD::CTPOP, MVT::i32, Expand);
	if (!Subtarget->hasV5TOps() \|\| Subtarget->isThumb1Only())
	setOperationAction(ISD::CTLZ, MVT::i32, Expand);

	// @llvm.readcyclecounter requires the Performance Monitors extension.
	// Default to the 0 expansion on unsupported platforms.
	// FIXME: Technically there are older ARM CPUs that have
	// implementation-specific ways of obtaining this information.
	if (Subtarget->hasPerfMon())
	setOperationAction(ISD::READCYCLECOUNTER, MVT::i64, Custom);

	// Only ARMv6 has BSWAP.
	if (!Subtarget->hasV6Ops())
	setOperationAction(ISD::BSWAP, MVT::i32, Expand);

	bool hasDivide = Subtarget->isThumb() ? Subtarget->hasDivideInThumbMode()
	: Subtarget->hasDivideInARMMode();
	if (!hasDivide) {
	// These are expanded into libcalls if the cpu doesn't have HW divider.
	setOperationAction(ISD::SDIV, MVT::i32, LibCall);
	setOperationAction(ISD::UDIV, MVT::i32, LibCall);
	}

	if (Subtarget->isTargetWindows() && !Subtarget->hasDivideInThumbMode()) {
	setOperationAction(ISD::SDIV, MVT::i32, Custom);
	setOperationAction(ISD::UDIV, MVT::i32, Custom);

	setOperationAction(ISD::SDIV, MVT::i64, Custom);
	setOperationAction(ISD::UDIV, MVT::i64, Custom);
	}

	setOperationAction(ISD::SREM, MVT::i32, Expand);
	setOperationAction(ISD::UREM, MVT::i32, Expand);

	// Register based DivRem for AEABI (RTABI 4.2)
	if (Subtarget->isTargetAEABI() \|\| Subtarget->isTargetAndroid() \|\|
	Subtarget->isTargetGNUAEABI() \|\| Subtarget->isTargetMuslAEABI() \|\|
	Subtarget->isTargetWindows()) {
	setOperationAction(ISD::SREM, MVT::i64, Custom);
	setOperationAction(ISD::UREM, MVT::i64, Custom);
	HasStandaloneRem = false;

	if (Subtarget->isTargetWindows()) {
	const struct {
	const RTLIB::Libcall Op;
	const char * const Name;
	const CallingConv::ID CC;
	} LibraryCalls[] = {
	{ RTLIB::SDIVREM_I8, "__rt_sdiv", CallingConv::ARM_AAPCS },
	{ RTLIB::SDIVREM_I16, "__rt_sdiv", CallingConv::ARM_AAPCS },
	{ RTLIB::SDIVREM_I32, "__rt_sdiv", CallingConv::ARM_AAPCS },
	{ RTLIB::SDIVREM_I64, "__rt_sdiv64", CallingConv::ARM_AAPCS },

	{ RTLIB::UDIVREM_I8, "__rt_udiv", CallingConv::ARM_AAPCS },
	{ RTLIB::UDIVREM_I16, "__rt_udiv", CallingConv::ARM_AAPCS },
	{ RTLIB::UDIVREM_I32, "__rt_udiv", CallingConv::ARM_AAPCS },
	{ RTLIB::UDIVREM_I64, "__rt_udiv64", CallingConv::ARM_AAPCS },
	};

	for (const auto &LC : LibraryCalls) {
	setLibcallName(LC.Op, LC.Name);
	setLibcallCallingConv(LC.Op, LC.CC);
	}
	} else {
	const struct {
	const RTLIB::Libcall Op;
	const char * const Name;
	const CallingConv::ID CC;
	} LibraryCalls[] = {
	{ RTLIB::SDIVREM_I8, "__aeabi_idivmod", CallingConv::ARM_AAPCS },
	{ RTLIB::SDIVREM_I16, "__aeabi_idivmod", CallingConv::ARM_AAPCS },
	{ RTLIB::SDIVREM_I32, "__aeabi_idivmod", CallingConv::ARM_AAPCS },
	{ RTLIB::SDIVREM_I64, "__aeabi_ldivmod", CallingConv::ARM_AAPCS },

	{ RTLIB::UDIVREM_I8, "__aeabi_uidivmod", CallingConv::ARM_AAPCS },
	{ RTLIB::UDIVREM_I16, "__aeabi_uidivmod", CallingConv::ARM_AAPCS },
	{ RTLIB::UDIVREM_I32, "__aeabi_uidivmod", CallingConv::ARM_AAPCS },
	{ RTLIB::UDIVREM_I64, "__aeabi_uldivmod", CallingConv::ARM_AAPCS },
	};

	for (const auto &LC : LibraryCalls) {
	setLibcallName(LC.Op, LC.Name);
	setLibcallCallingConv(LC.Op, LC.CC);
	}
	}

	setOperationAction(ISD::SDIVREM, MVT::i32, Custom);
	setOperationAction(ISD::UDIVREM, MVT::i32, Custom);
	setOperationAction(ISD::SDIVREM, MVT::i64, Custom);
	setOperationAction(ISD::UDIVREM, MVT::i64, Custom);
	} else {
	setOperationAction(ISD::SDIVREM, MVT::i32, Expand);
	setOperationAction(ISD::UDIVREM, MVT::i32, Expand);
	}

	if (Subtarget->isTargetWindows() && Subtarget->getTargetTriple().isOSMSVCRT())
	for (auto &VT : {MVT::f32, MVT::f64})
	setOperationAction(ISD::FPOWI, VT, Custom);

	setOperationAction(ISD::GlobalAddress, MVT::i32, Custom);
	setOperationAction(ISD::ConstantPool, MVT::i32, Custom);
	setOperationAction(ISD::GlobalTLSAddress, MVT::i32, Custom);
	setOperationAction(ISD::BlockAddress, MVT::i32, Custom);

	setOperationAction(ISD::TRAP, MVT::Other, Legal);

	// Use the default implementation.
	setOperationAction(ISD::VASTART, MVT::Other, Custom);
	setOperationAction(ISD::VAARG, MVT::Other, Expand);
	setOperationAction(ISD::VACOPY, MVT::Other, Expand);
	setOperationAction(ISD::VAEND, MVT::Other, Expand);
	setOperationAction(ISD::STACKSAVE, MVT::Other, Expand);
	setOperationAction(ISD::STACKRESTORE, MVT::Other, Expand);

	if (Subtarget->getTargetTriple().isWindowsItaniumEnvironment())
	setOperationAction(ISD::DYNAMIC_STACKALLOC, MVT::i32, Custom);
	else
	setOperationAction(ISD::DYNAMIC_STACKALLOC, MVT::i32, Expand);

	// ARMv6 Thumb1 (except for CPUs that support dmb / dsb) and earlier use
	// the default expansion.
	InsertFencesForAtomic = false;
	if (Subtarget->hasAnyDataBarrier() &&
	(!Subtarget->isThumb() \|\| Subtarget->hasV8MBaselineOps())) {
	// ATOMIC_FENCE needs custom lowering; the others should have been expanded
	// to ldrex/strex loops already.
	setOperationAction(ISD::ATOMIC_FENCE, MVT::Other, Custom);
	if (!Subtarget->isThumb() \|\| !Subtarget->isMClass())
	setOperationAction(ISD::ATOMIC_CMP_SWAP, MVT::i64, Custom);

	// On v8, we have particularly efficient implementations of atomic fences
	// if they can be combined with nearby atomic loads and stores.
	if (!Subtarget->hasV8Ops() \|\| getTargetMachine().getOptLevel() == 0) {
	// Automatically insert fences (dmb ish) around ATOMIC_SWAP etc.
	InsertFencesForAtomic = true;
	}
	} else {
	// If there's anything we can use as a barrier, go through custom lowering
	// for ATOMIC_FENCE.
	// If target has DMB in thumb, Fences can be inserted.
	if (Subtarget->hasDataBarrier())
	InsertFencesForAtomic = true;

	setOperationAction(ISD::ATOMIC_FENCE, MVT::Other,
	Subtarget->hasAnyDataBarrier() ? Custom : Expand);

	// Set them all for expansion, which will force libcalls.
	setOperationAction(ISD::ATOMIC_CMP_SWAP, MVT::i32, Expand);
	setOperationAction(ISD::ATOMIC_SWAP, MVT::i32, Expand);
	setOperationAction(ISD::ATOMIC_LOAD_ADD, MVT::i32, Expand);
	setOperationAction(ISD::ATOMIC_LOAD_SUB, MVT::i32, Expand);
	setOperationAction(ISD::ATOMIC_LOAD_AND, MVT::i32, Expand);
	setOperationAction(ISD::ATOMIC_LOAD_OR, MVT::i32, Expand);
	setOperationAction(ISD::ATOMIC_LOAD_XOR, MVT::i32, Expand);
	setOperationAction(ISD::ATOMIC_LOAD_NAND, MVT::i32, Expand);
	setOperationAction(ISD::ATOMIC_LOAD_MIN, MVT::i32, Expand);
	setOperationAction(ISD::ATOMIC_LOAD_MAX, MVT::i32, Expand);
	setOperationAction(ISD::ATOMIC_LOAD_UMIN, MVT::i32, Expand);
	setOperationAction(ISD::ATOMIC_LOAD_UMAX, MVT::i32, Expand);
	// Mark ATOMIC_LOAD and ATOMIC_STORE custom so we can handle the
	// Unordered/Monotonic case.
	if (!InsertFencesForAtomic) {
	setOperationAction(ISD::ATOMIC_LOAD, MVT::i32, Custom);
	setOperationAction(ISD::ATOMIC_STORE, MVT::i32, Custom);
	}
	}

	setOperationAction(ISD::PREFETCH, MVT::Other, Custom);

	// Requires SXTB/SXTH, available on v6 and up in both ARM and Thumb modes.
	if (!Subtarget->hasV6Ops()) {
	setOperationAction(ISD::SIGN_EXTEND_INREG, MVT::i16, Expand);
	setOperationAction(ISD::SIGN_EXTEND_INREG, MVT::i8, Expand);
	}
	setOperationAction(ISD::SIGN_EXTEND_INREG, MVT::i1, Expand);

	if (!Subtarget->useSoftFloat() && Subtarget->hasVFP2() &&
	!Subtarget->isThumb1Only()) {
	// Turn f64->i64 into VMOVRRD, i64 -> f64 to VMOVDRR
	// iff target supports vfp2.
	setOperationAction(ISD::BITCAST, MVT::i64, Custom);
	setOperationAction(ISD::FLT_ROUNDS_, MVT::i32, Custom);
	}

	// We want to custom lower some of our intrinsics.
	setOperationAction(ISD::INTRINSIC_WO_CHAIN, MVT::Other, Custom);
	setOperationAction(ISD::EH_SJLJ_SETJMP, MVT::i32, Custom);
	setOperationAction(ISD::EH_SJLJ_LONGJMP, MVT::Other, Custom);
	setOperationAction(ISD::EH_SJLJ_SETUP_DISPATCH, MVT::Other, Custom);
	if (Subtarget->useSjLjEH())
	setLibcallName(RTLIB::UNWIND_RESUME, "_Unwind_SjLj_Resume");

	setOperationAction(ISD::SETCC, MVT::i32, Expand);
	setOperationAction(ISD::SETCC, MVT::f32, Expand);
	setOperationAction(ISD::SETCC, MVT::f64, Expand);
	setOperationAction(ISD::SELECT, MVT::i32, Custom);
	setOperationAction(ISD::SELECT, MVT::f32, Custom);
	setOperationAction(ISD::SELECT, MVT::f64, Custom);
	setOperationAction(ISD::SELECT_CC, MVT::i32, Custom);
	setOperationAction(ISD::SELECT_CC, MVT::f32, Custom);
	setOperationAction(ISD::SELECT_CC, MVT::f64, Custom);

	// Thumb-1 cannot currently select ARMISD::SUBE.
	if (!Subtarget->isThumb1Only())
	setOperationAction(ISD::SETCCE, MVT::i32, Custom);

	setOperationAction(ISD::BRCOND, MVT::Other, Expand);
	setOperationAction(ISD::BR_CC, MVT::i32, Custom);
	setOperationAction(ISD::BR_CC, MVT::f32, Custom);
	setOperationAction(ISD::BR_CC, MVT::f64, Custom);
	setOperationAction(ISD::BR_JT, MVT::Other, Custom);

	// We don't support sin/cos/fmod/copysign/pow
	setOperationAction(ISD::FSIN, MVT::f64, Expand);
	setOperationAction(ISD::FSIN, MVT::f32, Expand);
	setOperationAction(ISD::FCOS, MVT::f32, Expand);
	setOperationAction(ISD::FCOS, MVT::f64, Expand);
	setOperationAction(ISD::FSINCOS, MVT::f64, Expand);
	setOperationAction(ISD::FSINCOS, MVT::f32, Expand);
	setOperationAction(ISD::FREM, MVT::f64, Expand);
	setOperationAction(ISD::FREM, MVT::f32, Expand);
	if (!Subtarget->useSoftFloat() && Subtarget->hasVFP2() &&
	!Subtarget->isThumb1Only()) {
	setOperationAction(ISD::FCOPYSIGN, MVT::f64, Custom);
	setOperationAction(ISD::FCOPYSIGN, MVT::f32, Custom);
	}
	setOperationAction(ISD::FPOW, MVT::f64, Expand);
	setOperationAction(ISD::FPOW, MVT::f32, Expand);

	if (!Subtarget->hasVFP4()) {
	setOperationAction(ISD::FMA, MVT::f64, Expand);
	setOperationAction(ISD::FMA, MVT::f32, Expand);
	}

	// Various VFP goodness
	if (!Subtarget->useSoftFloat() && !Subtarget->isThumb1Only()) {
	// FP-ARMv8 adds f64 <-> f16 conversion. Before that it should be expanded.
	if (!Subtarget->hasFPARMv8() \|\| Subtarget->isFPOnlySP()) {
	setOperationAction(ISD::FP16_TO_FP, MVT::f64, Expand);
	setOperationAction(ISD::FP_TO_FP16, MVT::f64, Expand);
	}

	// fp16 is a special v7 extension that adds f16 <-> f32 conversions.
	if (!Subtarget->hasFP16()) {
	setOperationAction(ISD::FP16_TO_FP, MVT::f32, Expand);
	setOperationAction(ISD::FP_TO_FP16, MVT::f32, Expand);
	}
	}

	// Combine sin / cos into one node or libcall if possible.
	if (Subtarget->hasSinCos()) {
	setLibcallName(RTLIB::SINCOS_F32, "sincosf");
	setLibcallName(RTLIB::SINCOS_F64, "sincos");
	if (Subtarget->isTargetWatchABI()) {
	setLibcallCallingConv(RTLIB::SINCOS_F32, CallingConv::ARM_AAPCS_VFP);
	setLibcallCallingConv(RTLIB::SINCOS_F64, CallingConv::ARM_AAPCS_VFP);
	}
	if (Subtarget->isTargetIOS() \|\| Subtarget->isTargetWatchOS()) {
	// For iOS, we don't want to the normal expansion of a libcall to
	// sincos. We want to issue a libcall to __sincos_stret.
	setOperationAction(ISD::FSINCOS, MVT::f64, Custom);
	setOperationAction(ISD::FSINCOS, MVT::f32, Custom);
	}
	}

	// FP-ARMv8 implements a lot of rounding-like FP operations.
	if (Subtarget->hasFPARMv8()) {
	setOperationAction(ISD::FFLOOR, MVT::f32, Legal);
	setOperationAction(ISD::FCEIL, MVT::f32, Legal);
	setOperationAction(ISD::FROUND, MVT::f32, Legal);
	setOperationAction(ISD::FTRUNC, MVT::f32, Legal);
	setOperationAction(ISD::FNEARBYINT, MVT::f32, Legal);
	setOperationAction(ISD::FRINT, MVT::f32, Legal);
	setOperationAction(ISD::FMINNUM, MVT::f32, Legal);
	setOperationAction(ISD::FMAXNUM, MVT::f32, Legal);
	setOperationAction(ISD::FMINNUM, MVT::v2f32, Legal);
	setOperationAction(ISD::FMAXNUM, MVT::v2f32, Legal);
	setOperationAction(ISD::FMINNUM, MVT::v4f32, Legal);
	setOperationAction(ISD::FMAXNUM, MVT::v4f32, Legal);

	if (!Subtarget->isFPOnlySP()) {
	setOperationAction(ISD::FFLOOR, MVT::f64, Legal);
	setOperationAction(ISD::FCEIL, MVT::f64, Legal);
	setOperationAction(ISD::FROUND, MVT::f64, Legal);
	setOperationAction(ISD::FTRUNC, MVT::f64, Legal);
	setOperationAction(ISD::FNEARBYINT, MVT::f64, Legal);
	setOperationAction(ISD::FRINT, MVT::f64, Legal);
	setOperationAction(ISD::FMINNUM, MVT::f64, Legal);
	setOperationAction(ISD::FMAXNUM, MVT::f64, Legal);
	}
	}

	if (Subtarget->hasNEON()) {
	// vmin and vmax aren't available in a scalar form, so we use
	// a NEON instruction with an undef lane instead.
	setOperationAction(ISD::FMINNAN, MVT::f32, Legal);
	setOperationAction(ISD::FMAXNAN, MVT::f32, Legal);
	setOperationAction(ISD::FMINNAN, MVT::v2f32, Legal);
	setOperationAction(ISD::FMAXNAN, MVT::v2f32, Legal);
	setOperationAction(ISD::FMINNAN, MVT::v4f32, Legal);
	setOperationAction(ISD::FMAXNAN, MVT::v4f32, Legal);
	}

	// We have target-specific dag combine patterns for the following nodes:
	// ARMISD::VMOVRRD - No need to call setTargetDAGCombine
	setTargetDAGCombine(ISD::ADD);
	setTargetDAGCombine(ISD::SUB);
	setTargetDAGCombine(ISD::MUL);
	setTargetDAGCombine(ISD::AND);
	setTargetDAGCombine(ISD::OR);
	setTargetDAGCombine(ISD::XOR);

	if (Subtarget->hasV6Ops())
	setTargetDAGCombine(ISD::SRL);

	setStackPointerRegisterToSaveRestore(ARM::SP);

	if (Subtarget->useSoftFloat() \|\| Subtarget->isThumb1Only() \|\|
	!Subtarget->hasVFP2())
	setSchedulingPreference(Sched::RegPressure);
	else
	setSchedulingPreference(Sched::Hybrid);

	//// temporary - rewrite interface to use type
	MaxStoresPerMemset = 8;
	MaxStoresPerMemsetOptSize = 4;
	MaxStoresPerMemcpy = 4; // For @llvm.memcpy -> sequence of stores
	MaxStoresPerMemcpyOptSize = 2;
	MaxStoresPerMemmove = 4; // For @llvm.memmove -> sequence of stores
	MaxStoresPerMemmoveOptSize = 2;

	// On ARM arguments smaller than 4 bytes are extended, so all arguments
	// are at least 4 bytes aligned.
	setMinStackArgumentAlignment(4);

	// Prefer likely predicted branches to selects on out-of-order cores.
	PredictableSelectIsExpensive = Subtarget->getSchedModel().isOutOfOrder();

	setMinFunctionAlignment(Subtarget->isThumb() ? 1 : 2);
	}

	bool ARMTargetLowering::useSoftFloat() const {
	return Subtarget->useSoftFloat();
	}

	// FIXME: It might make sense to define the representative register class as the
	// nearest super-register that has a non-null superset. For example, DPR_VFP2 is
	// a super-register of SPR, and DPR is a superset if DPR_VFP2. Consequently,
	// SPR's representative would be DPR_VFP2. This should work well if register
	// pressure tracking were modified such that a register use would increment the
	// pressure of the register class's representative and all of it's super
	// classes' representatives transitively. We have not implemented this because
	// of the difficulty prior to coalescing of modeling operand register classes
	// due to the common occurrence of cross class copies and subregister insertions
	// and extractions.
	std::pair<const TargetRegisterClass *, uint8_t>
	ARMTargetLowering::findRepresentativeClass(const TargetRegisterInfo *TRI,
	MVT VT) const {
	const TargetRegisterClass *RRC = nullptr;
	uint8_t Cost = 1;
	switch (VT.SimpleTy) {
	default:
	return TargetLowering::findRepresentativeClass(TRI, VT);
	// Use DPR as representative register class for all floating point
	// and vector types. Since there are 32 SPR registers and 32 DPR registers so
	// the cost is 1 for both f32 and f64.
	case MVT::f32: case MVT::f64: case MVT::v8i8: case MVT::v4i16:
	case MVT::v2i32: case MVT::v1i64: case MVT::v2f32:
	RRC = &ARM::DPRRegClass;
	// When NEON is used for SP, only half of the register file is available
	// because operations that define both SP and DP results will be constrained
	// to the VFP2 class (D0-D15). We currently model this constraint prior to
	// coalescing by double-counting the SP regs. See the FIXME above.
	if (Subtarget->useNEONForSinglePrecisionFP())
	Cost = 2;
	break;
	case MVT::v16i8: case MVT::v8i16: case MVT::v4i32: case MVT::v2i64:
	case MVT::v4f32: case MVT::v2f64:
	RRC = &ARM::DPRRegClass;
	Cost = 2;
	break;
	case MVT::v4i64:
	RRC = &ARM::DPRRegClass;
	Cost = 4;
	break;
	case MVT::v8i64:
	RRC = &ARM::DPRRegClass;
	Cost = 8;
	break;
	}
	return std::make_pair(RRC, Cost);
	}

	const char *ARMTargetLowering::getTargetNodeName(unsigned Opcode) const {
	switch ((ARMISD::NodeType)Opcode) {
	case ARMISD::FIRST_NUMBER: break;
	case ARMISD::Wrapper: return "ARMISD::Wrapper";
	case ARMISD::WrapperPIC: return "ARMISD::WrapperPIC";
	case ARMISD::WrapperJT: return "ARMISD::WrapperJT";
	case ARMISD::COPY_STRUCT_BYVAL: return "ARMISD::COPY_STRUCT_BYVAL";
	case ARMISD::CALL: return "ARMISD::CALL";
	case ARMISD::CALL_PRED: return "ARMISD::CALL_PRED";
	case ARMISD::CALL_NOLINK: return "ARMISD::CALL_NOLINK";
	case ARMISD::BRCOND: return "ARMISD::BRCOND";
	case ARMISD::BR_JT: return "ARMISD::BR_JT";
	case ARMISD::BR2_JT: return "ARMISD::BR2_JT";
	case ARMISD::RET_FLAG: return "ARMISD::RET_FLAG";
	case ARMISD::INTRET_FLAG: return "ARMISD::INTRET_FLAG";
	case ARMISD::PIC_ADD: return "ARMISD::PIC_ADD";
	case ARMISD::CMP: return "ARMISD::CMP";
	case ARMISD::CMN: return "ARMISD::CMN";
	case ARMISD::CMPZ: return "ARMISD::CMPZ";
	case ARMISD::CMPFP: return "ARMISD::CMPFP";
	case ARMISD::CMPFPw0: return "ARMISD::CMPFPw0";
	case ARMISD::BCC_i64: return "ARMISD::BCC_i64";
	case ARMISD::FMSTAT: return "ARMISD::FMSTAT";

	case ARMISD::CMOV: return "ARMISD::CMOV";

	case ARMISD::SSAT: return "ARMISD::SSAT";

	case ARMISD::SRL_FLAG: return "ARMISD::SRL_FLAG";
	case ARMISD::SRA_FLAG: return "ARMISD::SRA_FLAG";
	case ARMISD::RRX: return "ARMISD::RRX";

	case ARMISD::ADDC: return "ARMISD::ADDC";
	case ARMISD::ADDE: return "ARMISD::ADDE";
	case ARMISD::SUBC: return "ARMISD::SUBC";
	case ARMISD::SUBE: return "ARMISD::SUBE";

	case ARMISD::VMOVRRD: return "ARMISD::VMOVRRD";
	case ARMISD::VMOVDRR: return "ARMISD::VMOVDRR";

	case ARMISD::EH_SJLJ_SETJMP: return "ARMISD::EH_SJLJ_SETJMP";
	case ARMISD::EH_SJLJ_LONGJMP: return "ARMISD::EH_SJLJ_LONGJMP";
	case ARMISD::EH_SJLJ_SETUP_DISPATCH: return "ARMISD::EH_SJLJ_SETUP_DISPATCH";

	case ARMISD::TC_RETURN: return "ARMISD::TC_RETURN";

	case ARMISD::THREAD_POINTER:return "ARMISD::THREAD_POINTER";

	case ARMISD::DYN_ALLOC: return "ARMISD::DYN_ALLOC";

	case ARMISD::MEMBARRIER_MCR: return "ARMISD::MEMBARRIER_MCR";

	case ARMISD::PRELOAD: return "ARMISD::PRELOAD";

	case ARMISD::WIN__CHKSTK: return "ARMISD::WIN__CHKSTK";
	case ARMISD::WIN__DBZCHK: return "ARMISD::WIN__DBZCHK";

	case ARMISD::VCEQ: return "ARMISD::VCEQ";
	case ARMISD::VCEQZ: return "ARMISD::VCEQZ";
	case ARMISD::VCGE: return "ARMISD::VCGE";
	case ARMISD::VCGEZ: return "ARMISD::VCGEZ";
	case ARMISD::VCLEZ: return "ARMISD::VCLEZ";
	case ARMISD::VCGEU: return "ARMISD::VCGEU";
	case ARMISD::VCGT: return "ARMISD::VCGT";
	case ARMISD::VCGTZ: return "ARMISD::VCGTZ";
	case ARMISD::VCLTZ: return "ARMISD::VCLTZ";
	case ARMISD::VCGTU: return "ARMISD::VCGTU";
	case ARMISD::VTST: return "ARMISD::VTST";

	case ARMISD::VSHL: return "ARMISD::VSHL";
	case ARMISD::VSHRs: return "ARMISD::VSHRs";
	case ARMISD::VSHRu: return "ARMISD::VSHRu";
	case ARMISD::VRSHRs: return "ARMISD::VRSHRs";
	case ARMISD::VRSHRu: return "ARMISD::VRSHRu";
	case ARMISD::VRSHRN: return "ARMISD::VRSHRN";
	case ARMISD::VQSHLs: return "ARMISD::VQSHLs";
	case ARMISD::VQSHLu: return "ARMISD::VQSHLu";
	case ARMISD::VQSHLsu: return "ARMISD::VQSHLsu";
	case ARMISD::VQSHRNs: return "ARMISD::VQSHRNs";
	case ARMISD::VQSHRNu: return "ARMISD::VQSHRNu";
	case ARMISD::VQSHRNsu: return "ARMISD::VQSHRNsu";
	case ARMISD::VQRSHRNs: return "ARMISD::VQRSHRNs";
	case ARMISD::VQRSHRNu: return "ARMISD::VQRSHRNu";
	case ARMISD::VQRSHRNsu: return "ARMISD::VQRSHRNsu";
	case ARMISD::VSLI: return "ARMISD::VSLI";
	case ARMISD::VSRI: return "ARMISD::VSRI";
	case ARMISD::VGETLANEu: return "ARMISD::VGETLANEu";
	case ARMISD::VGETLANEs: return "ARMISD::VGETLANEs";
	case ARMISD::VMOVIMM: return "ARMISD::VMOVIMM";
	case ARMISD::VMVNIMM: return "ARMISD::VMVNIMM";
	case ARMISD::VMOVFPIMM: return "ARMISD::VMOVFPIMM";
	case ARMISD::VDUP: return "ARMISD::VDUP";
	case ARMISD::VDUPLANE: return "ARMISD::VDUPLANE";
	case ARMISD::VEXT: return "ARMISD::VEXT";
	case ARMISD::VREV64: return "ARMISD::VREV64";
	case ARMISD::VREV32: return "ARMISD::VREV32";
	case ARMISD::VREV16: return "ARMISD::VREV16";
	case ARMISD::VZIP: return "ARMISD::VZIP";
	case ARMISD::VUZP: return "ARMISD::VUZP";
	case ARMISD::VTRN: return "ARMISD::VTRN";
	case ARMISD::VTBL1: return "ARMISD::VTBL1";
	case ARMISD::VTBL2: return "ARMISD::VTBL2";
	case ARMISD::VMULLs: return "ARMISD::VMULLs";
	case ARMISD::VMULLu: return "ARMISD::VMULLu";
	case ARMISD::UMAAL: return "ARMISD::UMAAL";
	case ARMISD::UMLAL: return "ARMISD::UMLAL";
	case ARMISD::SMLAL: return "ARMISD::SMLAL";
	case ARMISD::SMLALBB: return "ARMISD::SMLALBB";
	case ARMISD::SMLALBT: return "ARMISD::SMLALBT";
	case ARMISD::SMLALTB: return "ARMISD::SMLALTB";
	case ARMISD::SMLALTT: return "ARMISD::SMLALTT";
	case ARMISD::SMULWB: return "ARMISD::SMULWB";
	case ARMISD::SMULWT: return "ARMISD::SMULWT";
	case ARMISD::SMLALD: return "ARMISD::SMLALD";
	case ARMISD::SMLALDX: return "ARMISD::SMLALDX";
	case ARMISD::SMLSLD: return "ARMISD::SMLSLD";
	case ARMISD::SMLSLDX: return "ARMISD::SMLSLDX";
	case ARMISD::BUILD_VECTOR: return "ARMISD::BUILD_VECTOR";
	case ARMISD::BFI: return "ARMISD::BFI";
	case ARMISD::VORRIMM: return "ARMISD::VORRIMM";
	case ARMISD::VBICIMM: return "ARMISD::VBICIMM";
	case ARMISD::VBSL: return "ARMISD::VBSL";
	case ARMISD::MEMCPY: return "ARMISD::MEMCPY";
	case ARMISD::VLD1DUP: return "ARMISD::VLD1DUP";
	case ARMISD::VLD2DUP: return "ARMISD::VLD2DUP";
	case ARMISD::VLD3DUP: return "ARMISD::VLD3DUP";
	case ARMISD::VLD4DUP: return "ARMISD::VLD4DUP";
	case ARMISD::VLD1_UPD: return "ARMISD::VLD1_UPD";
	case ARMISD::VLD2_UPD: return "ARMISD::VLD2_UPD";
	case ARMISD::VLD3_UPD: return "ARMISD::VLD3_UPD";
	case ARMISD::VLD4_UPD: return "ARMISD::VLD4_UPD";
	case ARMISD::VLD2LN_UPD: return "ARMISD::VLD2LN_UPD";
	case ARMISD::VLD3LN_UPD: return "ARMISD::VLD3LN_UPD";
	case ARMISD::VLD4LN_UPD: return "ARMISD::VLD4LN_UPD";
	case ARMISD::VLD1DUP_UPD: return "ARMISD::VLD1DUP_UPD";
	case ARMISD::VLD2DUP_UPD: return "ARMISD::VLD2DUP_UPD";
	case ARMISD::VLD3DUP_UPD: return "ARMISD::VLD3DUP_UPD";
	case ARMISD::VLD4DUP_UPD: return "ARMISD::VLD4DUP_UPD";
	case ARMISD::VST1_UPD: return "ARMISD::VST1_UPD";
	case ARMISD::VST2_UPD: return "ARMISD::VST2_UPD";
	case ARMISD::VST3_UPD: return "ARMISD::VST3_UPD";
	case ARMISD::VST4_UPD: return "ARMISD::VST4_UPD";
	case ARMISD::VST2LN_UPD: return "ARMISD::VST2LN_UPD";
	case ARMISD::VST3LN_UPD: return "ARMISD::VST3LN_UPD";
	case ARMISD::VST4LN_UPD: return "ARMISD::VST4LN_UPD";
	}
	return nullptr;
	}

	EVT ARMTargetLowering::getSetCCResultType(const DataLayout &DL, LLVMContext &,
	EVT VT) const {
	if (!VT.isVector())
	return getPointerTy(DL);
	return VT.changeVectorElementTypeToInteger();
	}

	/// getRegClassFor - Return the register class that should be used for the
	/// specified value type.
	const TargetRegisterClass *ARMTargetLowering::getRegClassFor(MVT VT) const {
	// Map v4i64 to QQ registers but do not make the type legal. Similarly map
	// v8i64 to QQQQ registers. v4i64 and v8i64 are only used for REG_SEQUENCE to
	// load / store 4 to 8 consecutive D registers.
	if (Subtarget->hasNEON()) {
	if (VT == MVT::v4i64)
	return &ARM::QQPRRegClass;
	if (VT == MVT::v8i64)
	return &ARM::QQQQPRRegClass;
	}
	return TargetLowering::getRegClassFor(VT);
	}

	// memcpy, and other memory intrinsics, typically tries to use LDM/STM if the
	// source/dest is aligned and the copy size is large enough. We therefore want
	// to align such objects passed to memory intrinsics.
	bool ARMTargetLowering::shouldAlignPointerArgs(CallInst *CI, unsigned &MinSize,
	unsigned &PrefAlign) const {
	if (!isa<MemIntrinsic>(CI))
	return false;
	MinSize = 8;
	// On ARM11 onwards (excluding M class) 8-byte aligned LDM is typically 1
	// cycle faster than 4-byte aligned LDM.
	PrefAlign = (Subtarget->hasV6Ops() && !Subtarget->isMClass() ? 8 : 4);
	return true;
	}

	// Create a fast isel object.
	FastISel *
	ARMTargetLowering::createFastISel(FunctionLoweringInfo &funcInfo,
	const TargetLibraryInfo *libInfo) const {
	return ARM::createFastISel(funcInfo, libInfo);
	}

	Sched::Preference ARMTargetLowering::getSchedulingPreference(SDNode *N) const {
	unsigned NumVals = N->getNumValues();
	if (!NumVals)
	return Sched::RegPressure;

	for (unsigned i = 0; i != NumVals; ++i) {
	EVT VT = N->getValueType(i);
	if (VT == MVT::Glue \|\| VT == MVT::Other)
	continue;
	if (VT.isFloatingPoint() \|\| VT.isVector())
	return Sched::ILP;
	}

	if (!N->isMachineOpcode())
	return Sched::RegPressure;

	// Load are scheduled for latency even if there instruction itinerary
	// is not available.
	const TargetInstrInfo *TII = Subtarget->getInstrInfo();
	const MCInstrDesc &MCID = TII->get(N->getMachineOpcode());

	if (MCID.getNumDefs() == 0)
	return Sched::RegPressure;
	if (!Itins->isEmpty() &&
	Itins->getOperandCycle(MCID.getSchedClass(), 0) > 2)
	return Sched::ILP;

	return Sched::RegPressure;
	}

	//===----------------------------------------------------------------------===//
	// Lowering Code
	//===----------------------------------------------------------------------===//

	static bool isSRL16(const SDValue &Op) {
	if (Op.getOpcode() != ISD::SRL)
	return false;
	if (auto Const = dyn_cast<ConstantSDNode>(Op.getOperand(1)))
	return Const->getZExtValue() == 16;
	return false;
	}

	static bool isSRA16(const SDValue &Op) {
	if (Op.getOpcode() != ISD::SRA)
	return false;
	if (auto Const = dyn_cast<ConstantSDNode>(Op.getOperand(1)))
	return Const->getZExtValue() == 16;
	return false;
	}

	static bool isSHL16(const SDValue &Op) {
	if (Op.getOpcode() != ISD::SHL)
	return false;
	if (auto Const = dyn_cast<ConstantSDNode>(Op.getOperand(1)))
	return Const->getZExtValue() == 16;
	return false;
	}

	// Check for a signed 16-bit value. We special case SRA because it makes it
	// more simple when also looking for SRAs that aren't sign extending a
	// smaller value. Without the check, we'd need to take extra care with
	// checking order for some operations.
	static bool isS16(const SDValue &Op, SelectionDAG &DAG) {
	if (isSRA16(Op))
	return isSHL16(Op.getOperand(0));
	return DAG.ComputeNumSignBits(Op) == 17;
	}

	/// IntCCToARMCC - Convert a DAG integer condition code to an ARM CC
	static ARMCC::CondCodes IntCCToARMCC(ISD::CondCode CC) {
	switch (CC) {
	default: llvm_unreachable("Unknown condition code!");
	case ISD::SETNE: return ARMCC::NE;
	case ISD::SETEQ: return ARMCC::EQ;
	case ISD::SETGT: return ARMCC::GT;
	case ISD::SETGE: return ARMCC::GE;
	case ISD::SETLT: return ARMCC::LT;
	case ISD::SETLE: return ARMCC::LE;
	case ISD::SETUGT: return ARMCC::HI;
	case ISD::SETUGE: return ARMCC::HS;
	case ISD::SETULT: return ARMCC::LO;
	case ISD::SETULE: return ARMCC::LS;
	}
	}

	/// FPCCToARMCC - Convert a DAG fp condition code to an ARM CC.
	static void FPCCToARMCC(ISD::CondCode CC, ARMCC::CondCodes &CondCode,
	ARMCC::CondCodes &CondCode2, bool &InvalidOnQNaN) {
	CondCode2 = ARMCC::AL;
	InvalidOnQNaN = true;
	switch (CC) {
	default: llvm_unreachable("Unknown FP condition!");
	case ISD::SETEQ:
	case ISD::SETOEQ:
	CondCode = ARMCC::EQ;
	InvalidOnQNaN = false;
	break;
	case ISD::SETGT:
	case ISD::SETOGT: CondCode = ARMCC::GT; break;
	case ISD::SETGE:
	case ISD::SETOGE: CondCode = ARMCC::GE; break;
	case ISD::SETOLT: CondCode = ARMCC::MI; break;
	case ISD::SETOLE: CondCode = ARMCC::LS; break;
	case ISD::SETONE:
	CondCode = ARMCC::MI;
	CondCode2 = ARMCC::GT;
	InvalidOnQNaN = false;
	break;
	case ISD::SETO: CondCode = ARMCC::VC; break;
	case ISD::SETUO: CondCode = ARMCC::VS; break;
	case ISD::SETUEQ:
	CondCode = ARMCC::EQ;
	CondCode2 = ARMCC::VS;
	InvalidOnQNaN = false;
	break;
	case ISD::SETUGT: CondCode = ARMCC::HI; break;
	case ISD::SETUGE: CondCode = ARMCC::PL; break;
	case ISD::SETLT:
	case ISD::SETULT: CondCode = ARMCC::LT; break;
	case ISD::SETLE:
	case ISD::SETULE: CondCode = ARMCC::LE; break;
	case ISD::SETNE:
	case ISD::SETUNE:
	CondCode = ARMCC::NE;
	InvalidOnQNaN = false;
	break;
	}
	}

	//===----------------------------------------------------------------------===//
	// Calling Convention Implementation
	//===----------------------------------------------------------------------===//

	#include "ARMGenCallingConv.inc"

	/// getEffectiveCallingConv - Get the effective calling convention, taking into
	/// account presence of floating point hardware and calling convention
	/// limitations, such as support for variadic functions.
	CallingConv::ID
	ARMTargetLowering::getEffectiveCallingConv(CallingConv::ID CC,
	bool isVarArg) const {
	switch (CC) {
	default:
	llvm_unreachable("Unsupported calling convention");
	case CallingConv::ARM_AAPCS:
	case CallingConv::ARM_APCS:
	case CallingConv::GHC:
	return CC;
	case CallingConv::PreserveMost:
	return CallingConv::PreserveMost;
	case CallingConv::ARM_AAPCS_VFP:
	case CallingConv::Swift:
	return isVarArg ? CallingConv::ARM_AAPCS : CallingConv::ARM_AAPCS_VFP;
	case CallingConv::C:
	if (!Subtarget->isAAPCS_ABI())
	return CallingConv::ARM_APCS;
	else if (Subtarget->hasVFP2() && !Subtarget->isThumb1Only() &&
	getTargetMachine().Options.FloatABIType == FloatABI::Hard &&
	!isVarArg)
	return CallingConv::ARM_AAPCS_VFP;
	else
	return CallingConv::ARM_AAPCS;
	case CallingConv::Fast:
	case CallingConv::CXX_FAST_TLS:
	if (!Subtarget->isAAPCS_ABI()) {
	if (Subtarget->hasVFP2() && !Subtarget->isThumb1Only() && !isVarArg)
	return CallingConv::Fast;
	return CallingConv::ARM_APCS;
	} else if (Subtarget->hasVFP2() && !Subtarget->isThumb1Only() && !isVarArg)
	return CallingConv::ARM_AAPCS_VFP;
	else
	return CallingConv::ARM_AAPCS;
	}
	}

	CCAssignFn *ARMTargetLowering::CCAssignFnForCall(CallingConv::ID CC,
	bool isVarArg) const {
	return CCAssignFnForNode(CC, false, isVarArg);
	}

	CCAssignFn *ARMTargetLowering::CCAssignFnForReturn(CallingConv::ID CC,
	bool isVarArg) const {
	return CCAssignFnForNode(CC, true, isVarArg);
	}

	/// CCAssignFnForNode - Selects the correct CCAssignFn for the given
	/// CallingConvention.
	CCAssignFn *ARMTargetLowering::CCAssignFnForNode(CallingConv::ID CC,
	bool Return,
	bool isVarArg) const {
	switch (getEffectiveCallingConv(CC, isVarArg)) {
	default:
	llvm_unreachable("Unsupported calling convention");
	case CallingConv::ARM_APCS:
	return (Return ? RetCC_ARM_APCS : CC_ARM_APCS);
	case CallingConv::ARM_AAPCS:
	return (Return ? RetCC_ARM_AAPCS : CC_ARM_AAPCS);
	case CallingConv::ARM_AAPCS_VFP:
	return (Return ? RetCC_ARM_AAPCS_VFP : CC_ARM_AAPCS_VFP);
	case CallingConv::Fast:
	return (Return ? RetFastCC_ARM_APCS : FastCC_ARM_APCS);
	case CallingConv::GHC:
	return (Return ? RetCC_ARM_APCS : CC_ARM_APCS_GHC);
	case CallingConv::PreserveMost:
	return (Return ? RetCC_ARM_AAPCS : CC_ARM_AAPCS);
	}
	}

	/// LowerCallResult - Lower the result values of a call into the
	/// appropriate copies out of appropriate physical registers.
	SDValue ARMTargetLowering::LowerCallResult(
	SDValue Chain, SDValue InFlag, CallingConv::ID CallConv, bool isVarArg,
	const SmallVectorImpl<ISD::InputArg> &Ins, const SDLoc &dl,
	SelectionDAG &DAG, SmallVectorImpl<SDValue> &InVals, bool isThisReturn,
	SDValue ThisVal) const {

	// Assign locations to each value returned by this call.
	SmallVector<CCValAssign, 16> RVLocs;
	CCState CCInfo(CallConv, isVarArg, DAG.getMachineFunction(), RVLocs,
	*DAG.getContext());
	CCInfo.AnalyzeCallResult(Ins, CCAssignFnForReturn(CallConv, isVarArg));

	// Copy all of the result registers out of their specified physreg.
	for (unsigned i = 0; i != RVLocs.size(); ++i) {
	CCValAssign VA = RVLocs[i];

	// Pass 'this' value directly from the argument to return value, to avoid
	// reg unit interference
	if (i == 0 && isThisReturn) {
	assert(!VA.needsCustom() && VA.getLocVT() == MVT::i32 &&
	"unexpected return calling convention register assignment");
	InVals.push_back(ThisVal);
	continue;
	}

	SDValue Val;
	if (VA.needsCustom()) {
	// Handle f64 or half of a v2f64.
	SDValue Lo = DAG.getCopyFromReg(Chain, dl, VA.getLocReg(), MVT::i32,
	InFlag);
	Chain = Lo.getValue(1);
	InFlag = Lo.getValue(2);
	VA = RVLocs[++i]; // skip ahead to next loc
	SDValue Hi = DAG.getCopyFromReg(Chain, dl, VA.getLocReg(), MVT::i32,
	InFlag);
	Chain = Hi.getValue(1);
	InFlag = Hi.getValue(2);
	if (!Subtarget->isLittle())
	std::swap (Lo, Hi);
	Val = DAG.getNode(ARMISD::VMOVDRR, dl, MVT::f64, Lo, Hi);

	if (VA.getLocVT() == MVT::v2f64) {
	SDValue Vec = DAG.getNode(ISD::UNDEF, dl, MVT::v2f64);
	Vec = DAG.getNode(ISD::INSERT_VECTOR_ELT, dl, MVT::v2f64, Vec, Val,
	DAG.getConstant(0, dl, MVT::i32));

	VA = RVLocs[++i]; // skip ahead to next loc
	Lo = DAG.getCopyFromReg(Chain, dl, VA.getLocReg(), MVT::i32, InFlag);
	Chain = Lo.getValue(1);
	InFlag = Lo.getValue(2);
	VA = RVLocs[++i]; // skip ahead to next loc
	Hi = DAG.getCopyFromReg(Chain, dl, VA.getLocReg(), MVT::i32, InFlag);
	Chain = Hi.getValue(1);
	InFlag = Hi.getValue(2);
	if (!Subtarget->isLittle())
	std::swap (Lo, Hi);
	Val = DAG.getNode(ARMISD::VMOVDRR, dl, MVT::f64, Lo, Hi);
	Val = DAG.getNode(ISD::INSERT_VECTOR_ELT, dl, MVT::v2f64, Vec, Val,
	DAG.getConstant(1, dl, MVT::i32));
	}
	} else {
	Val = DAG.getCopyFromReg(Chain, dl, VA.getLocReg(), VA.getLocVT(),
	InFlag);
	Chain = Val.getValue(1);
	InFlag = Val.getValue(2);
	}

	switch (VA.getLocInfo()) {
	default: llvm_unreachable("Unknown loc info!");
	case CCValAssign::Full: break;
	case CCValAssign::BCvt:
	Val = DAG.getNode(ISD::BITCAST, dl, VA.getValVT(), Val);
	break;
	}

	InVals.push_back(Val);
	}

	return Chain;
	}

	/// LowerMemOpCallTo - Store the argument to the stack.
	SDValue ARMTargetLowering::LowerMemOpCallTo(SDValue Chain, SDValue StackPtr,
	SDValue Arg, const SDLoc &dl,
	SelectionDAG &DAG,
	const CCValAssign &VA,
	ISD::ArgFlagsTy Flags) const {
	unsigned LocMemOffset = VA.getLocMemOffset();
	SDValue PtrOff = DAG.getIntPtrConstant(LocMemOffset, dl);
	PtrOff = DAG.getNode(ISD::ADD, dl, getPointerTy(DAG.getDataLayout()),
	StackPtr, PtrOff);
	return DAG.getStore(
	Chain, dl, Arg, PtrOff,
	MachinePointerInfo::getStack(DAG.getMachineFunction(), LocMemOffset));
	}

	void ARMTargetLowering::PassF64ArgInRegs(const SDLoc &dl, SelectionDAG &DAG,
	SDValue Chain, SDValue &Arg,
	RegsToPassVector &RegsToPass,
	CCValAssign &VA, CCValAssign &NextVA,
	SDValue &StackPtr,
	SmallVectorImpl<SDValue> &MemOpChains,
	ISD::ArgFlagsTy Flags) const {

	SDValue fmrrd = DAG.getNode(ARMISD::VMOVRRD, dl,
	DAG.getVTList(MVT::i32, MVT::i32), Arg);
	unsigned id = Subtarget->isLittle() ? 0 : 1;
	RegsToPass.push_back(std::make_pair(VA.getLocReg(), fmrrd.getValue(id)));

	if (NextVA.isRegLoc())
	RegsToPass.push_back(std::make_pair(NextVA.getLocReg(), fmrrd.getValue(1-id)));
	else {
	assert(NextVA.isMemLoc());
	if (!StackPtr.getNode())
	StackPtr = DAG.getCopyFromReg(Chain, dl, ARM::SP,
	getPointerTy(DAG.getDataLayout()));

	MemOpChains.push_back(LowerMemOpCallTo(Chain, StackPtr, fmrrd.getValue(1-id),
	dl, DAG, NextVA,
	Flags));
	}
	}

	/// LowerCall - Lowering a call into a callseq_start <-
	/// ARMISD:CALL <- callseq_end chain. Also add input and output parameter
	/// nodes.
	SDValue
	ARMTargetLowering::LowerCall(TargetLowering::CallLoweringInfo &CLI,
	SmallVectorImpl<SDValue> &InVals) const {
	SelectionDAG &DAG = CLI.DAG;
	SDLoc &dl = CLI.DL;
	SmallVectorImpl<ISD::OutputArg> &Outs = CLI.Outs;
	SmallVectorImpl<SDValue> &OutVals = CLI.OutVals;
	SmallVectorImpl<ISD::InputArg> &Ins = CLI.Ins;
	SDValue Chain = CLI.Chain;
	SDValue Callee = CLI.Callee;
	bool &isTailCall = CLI.IsTailCall;
	CallingConv::ID CallConv = CLI.CallConv;
	bool doesNotRet = CLI.DoesNotReturn;
	bool isVarArg = CLI.IsVarArg;

	MachineFunction &MF = DAG.getMachineFunction();
	bool isStructRet = (Outs.empty()) ? false : Outs[0].Flags.isSRet();
	bool isThisReturn = false;
	bool isSibCall = false;
	auto Attr = MF.getFunction()->getFnAttribute("disable-tail-calls");

	// Disable tail calls if they're not supported.
	if (!Subtarget->supportsTailCall() \|\| Attr.getValueAsString() == "true")
	isTailCall = false;

	if (isTailCall) {
	// Check if it's really possible to do a tail call.
	isTailCall = IsEligibleForTailCallOptimization(Callee, CallConv,
	isVarArg, isStructRet, MF.getFunction()->hasStructRetAttr(),
	Outs, OutVals, Ins, DAG);
	if (!isTailCall && CLI.CS && CLI.CS->isMustTailCall())
	report_fatal_error("failed to perform tail call elimination on a call "
	"site marked musttail");
	// We don't support GuaranteedTailCallOpt for ARM, only automatically
	// detected sibcalls.
	if (isTailCall) {
	++NumTailCalls;
	isSibCall = true;
	}
	}

	// Analyze operands of the call, assigning locations to each operand.
	SmallVector<CCValAssign, 16> ArgLocs;
	CCState CCInfo(CallConv, isVarArg, DAG.getMachineFunction(), ArgLocs,
	*DAG.getContext());
	CCInfo.AnalyzeCallOperands(Outs, CCAssignFnForCall(CallConv, isVarArg));

	// Get a count of how many bytes are to be pushed on the stack.
	unsigned NumBytes = CCInfo.getNextStackOffset();

	// For tail calls, memory operands are available in our caller's stack.
	if (isSibCall)
	NumBytes = 0;

	// Adjust the stack pointer for the new arguments...
	// These operations are automatically eliminated by the prolog/epilog pass
	if (!isSibCall)
	Chain = DAG.getCALLSEQ_START(Chain, NumBytes, 0, dl);

	SDValue StackPtr =
	DAG.getCopyFromReg(Chain, dl, ARM::SP, getPointerTy(DAG.getDataLayout()));

	RegsToPassVector RegsToPass;
	SmallVector<SDValue, 8> MemOpChains;

	// Walk the register/memloc assignments, inserting copies/loads. In the case
	// of tail call optimization, arguments are handled later.
	for (unsigned i = 0, realArgIdx = 0, e = ArgLocs.size();
	i != e;
	++i, ++realArgIdx) {
	CCValAssign &VA = ArgLocs[i];
	SDValue Arg = OutVals[realArgIdx];
	ISD::ArgFlagsTy Flags = Outs[realArgIdx].Flags;
	bool isByVal = Flags.isByVal();

	// Promote the value if needed.
	switch (VA.getLocInfo()) {
	default: llvm_unreachable("Unknown loc info!");
	case CCValAssign::Full: break;
	case CCValAssign::SExt:
	Arg = DAG.getNode(ISD::SIGN_EXTEND, dl, VA.getLocVT(), Arg);
	break;
	case CCValAssign::ZExt:
	Arg = DAG.getNode(ISD::ZERO_EXTEND, dl, VA.getLocVT(), Arg);
	break;
	case CCValAssign::AExt:
	Arg = DAG.getNode(ISD::ANY_EXTEND, dl, VA.getLocVT(), Arg);
	break;
	case CCValAssign::BCvt:
	Arg = DAG.getNode(ISD::BITCAST, dl, VA.getLocVT(), Arg);
	break;
	}

	// f64 and v2f64 might be passed in i32 pairs and must be split into pieces
	if (VA.needsCustom()) {
	if (VA.getLocVT() == MVT::v2f64) {
	SDValue Op0 = DAG.getNode(ISD::EXTRACT_VECTOR_ELT, dl, MVT::f64, Arg,
	DAG.getConstant(0, dl, MVT::i32));
	SDValue Op1 = DAG.getNode(ISD::EXTRACT_VECTOR_ELT, dl, MVT::f64, Arg,
	DAG.getConstant(1, dl, MVT::i32));

	PassF64ArgInRegs(dl, DAG, Chain, Op0, RegsToPass,
	VA, ArgLocs[++i], StackPtr, MemOpChains, Flags);

	VA = ArgLocs[++i]; // skip ahead to next loc
	if (VA.isRegLoc()) {
	PassF64ArgInRegs(dl, DAG, Chain, Op1, RegsToPass,
	VA, ArgLocs[++i], StackPtr, MemOpChains, Flags);
	} else {
	assert(VA.isMemLoc());

	MemOpChains.push_back(LowerMemOpCallTo(Chain, StackPtr, Op1,
	dl, DAG, VA, Flags));
	}
	} else {
	PassF64ArgInRegs(dl, DAG, Chain, Arg, RegsToPass, VA, ArgLocs[++i],
	StackPtr, MemOpChains, Flags);
	}
	} else if (VA.isRegLoc()) {
	if (realArgIdx == 0 && Flags.isReturned() && !Flags.isSwiftSelf() &&
	Outs[0].VT == MVT::i32) {
	assert(VA.getLocVT() == MVT::i32 &&
	"unexpected calling convention register assignment");
	assert(!Ins.empty() && Ins[0].VT == MVT::i32 &&
	"unexpected use of 'returned'");
	isThisReturn = true;
	}
	RegsToPass.push_back(std::make_pair(VA.getLocReg(), Arg));
	} else if (isByVal) {
	assert(VA.isMemLoc());
	unsigned offset = 0;

	// True if this byval aggregate will be split between registers
	// and memory.
	unsigned ByValArgsCount = CCInfo.getInRegsParamsCount();
	unsigned CurByValIdx = CCInfo.getInRegsParamsProcessed();

	if (CurByValIdx < ByValArgsCount) {

	unsigned RegBegin, RegEnd;
	CCInfo.getInRegsParamInfo(CurByValIdx, RegBegin, RegEnd);

	EVT PtrVT =
	DAG.getTargetLoweringInfo().getPointerTy(DAG.getDataLayout());
	unsigned int i, j;
	for (i = 0, j = RegBegin; j < RegEnd; i++, j++) {
	SDValue Const = DAG.getConstant(4*i, dl, MVT::i32);
	SDValue AddArg = DAG.getNode(ISD::ADD, dl, PtrVT, Arg, Const);
	SDValue Load = DAG.getLoad(PtrVT, dl, Chain, AddArg,
	MachinePointerInfo(),
	DAG.InferPtrAlignment(AddArg));
	MemOpChains.push_back(Load.getValue(1));
	RegsToPass.push_back(std::make_pair(j, Load));
	}

	// If parameter size outsides register area, "offset" value
	// helps us to calculate stack slot for remained part properly.
	offset = RegEnd - RegBegin;

	CCInfo.nextInRegsParam();
	}

	if (Flags.getByValSize() > 4*offset) {
	auto PtrVT = getPointerTy(DAG.getDataLayout());
	unsigned LocMemOffset = VA.getLocMemOffset();
	SDValue StkPtrOff = DAG.getIntPtrConstant(LocMemOffset, dl);
	SDValue Dst = DAG.getNode(ISD::ADD, dl, PtrVT, StackPtr, StkPtrOff);
	SDValue SrcOffset = DAG.getIntPtrConstant(4*offset, dl);
	SDValue Src = DAG.getNode(ISD::ADD, dl, PtrVT, Arg, SrcOffset);
	SDValue SizeNode = DAG.getConstant(Flags.getByValSize() - 4*offset, dl,
	MVT::i32);
	SDValue AlignNode = DAG.getConstant(Flags.getByValAlign(), dl,
	MVT::i32);

	SDVTList VTs = DAG.getVTList(MVT::Other, MVT::Glue);
	SDValue Ops[] = { Chain, Dst, Src, SizeNode, AlignNode};
	MemOpChains.push_back(DAG.getNode(ARMISD::COPY_STRUCT_BYVAL, dl, VTs,
	Ops));
	}
	} else if (!isSibCall) {
	assert(VA.isMemLoc());

	MemOpChains.push_back(LowerMemOpCallTo(Chain, StackPtr, Arg,
	dl, DAG, VA, Flags));
	}
	}

	if (!MemOpChains.empty())
	Chain = DAG.getNode(ISD::TokenFactor, dl, MVT::Other, MemOpChains);

	// Build a sequence of copy-to-reg nodes chained together with token chain
	// and flag operands which copy the outgoing args into the appropriate regs.
	SDValue InFlag;
	// Tail call byval lowering might overwrite argument registers so in case of
	// tail call optimization the copies to registers are lowered later.
	if (!isTailCall)
	for (unsigned i = 0, e = RegsToPass.size(); i != e; ++i) {
	Chain = DAG.getCopyToReg(Chain, dl, RegsToPass[i].first,
	RegsToPass[i].second, InFlag);
	InFlag = Chain.getValue(1);
	}

	// For tail calls lower the arguments to the 'real' stack slot.
	if (isTailCall) {
	// Force all the incoming stack arguments to be loaded from the stack
	// before any new outgoing arguments are stored to the stack, because the
	// outgoing stack slots may alias the incoming argument stack slots, and
	// the alias isn't otherwise explicit. This is slightly more conservative
	// than necessary, because it means that each store effectively depends
	// on every argument instead of just those arguments it would clobber.

	// Do not flag preceding copytoreg stuff together with the following stuff.
	InFlag = SDValue();
	for (unsigned i = 0, e = RegsToPass.size(); i != e; ++i) {
	Chain = DAG.getCopyToReg(Chain, dl, RegsToPass[i].first,
	RegsToPass[i].second, InFlag);
	InFlag = Chain.getValue(1);
	}
	InFlag = SDValue();
	}

	// If the callee is a GlobalAddress/ExternalSymbol node (quite common, every
	// direct call is) turn it into a TargetGlobalAddress/TargetExternalSymbol
	// node so that legalize doesn't hack it.
	bool isDirect = false;

	const TargetMachine &TM = getTargetMachine();
	const Module *Mod = MF.getFunction()->getParent();
	const GlobalValue *GV = nullptr;
	if (GlobalAddressSDNode *G = dyn_cast<GlobalAddressSDNode>(Callee))
	GV = G->getGlobal();
	bool isStub =
	!TM.shouldAssumeDSOLocal(*Mod, GV) && Subtarget->isTargetMachO();

	bool isARMFunc = !Subtarget->isThumb() \|\| (isStub && !Subtarget->isMClass());
	bool isLocalARMFunc = false;
	ARMFunctionInfo *AFI = MF.getInfo<ARMFunctionInfo>();
	auto PtrVt = getPointerTy(DAG.getDataLayout());

	if (Subtarget->genLongCalls()) {
	assert((!isPositionIndependent() \|\| Subtarget->isTargetWindows()) &&
	"long-calls codegen is not position independent!");
	// Handle a global address or an external symbol. If it's not one of
	// those, the target's already in a register, so we don't need to do
	// anything extra.
	if (isa<GlobalAddressSDNode>(Callee)) {
	// Create a constant pool entry for the callee address
	unsigned ARMPCLabelIndex = AFI->createPICLabelUId();
	ARMConstantPoolValue *CPV =
	ARMConstantPoolConstant::Create(GV, ARMPCLabelIndex, ARMCP::CPValue, 0);

	// Get the address of the callee into a register
	SDValue CPAddr = DAG.getTargetConstantPool(CPV, PtrVt, 4);
	CPAddr = DAG.getNode(ARMISD::Wrapper, dl, MVT::i32, CPAddr);
	Callee = DAG.getLoad(
	PtrVt, dl, DAG.getEntryNode(), CPAddr,
	MachinePointerInfo::getConstantPool(DAG.getMachineFunction()));
	} else if (ExternalSymbolSDNode *S=dyn_cast<ExternalSymbolSDNode>(Callee)) {
	const char *Sym = S->getSymbol();

	// Create a constant pool entry for the callee address
	unsigned ARMPCLabelIndex = AFI->createPICLabelUId();
	ARMConstantPoolValue *CPV =
	ARMConstantPoolSymbol::Create(*DAG.getContext(), Sym,
	ARMPCLabelIndex, 0);
	// Get the address of the callee into a register
	SDValue CPAddr = DAG.getTargetConstantPool(CPV, PtrVt, 4);
	CPAddr = DAG.getNode(ARMISD::Wrapper, dl, MVT::i32, CPAddr);
	Callee = DAG.getLoad(
	PtrVt, dl, DAG.getEntryNode(), CPAddr,
	MachinePointerInfo::getConstantPool(DAG.getMachineFunction()));
	}
	} else if (isa<GlobalAddressSDNode>(Callee)) {
	// If we're optimizing for minimum size and the function is called three or
	// more times in this block, we can improve codesize by calling indirectly
	// as BLXr has a 16-bit encoding.
	auto *GV = cast<GlobalAddressSDNode>(Callee)->getGlobal();
	auto *BB = CLI.CS->getParent();
	bool PreferIndirect =
	Subtarget->isThumb() && MF.getFunction()->optForMinSize() &&
	count_if(GV->users(), [&BB](const User *U) {
	return isa<Instruction>(U) && cast<Instruction>(U)->getParent() == BB;
	}) > 2;

	if (!PreferIndirect) {
	isDirect = true;
	bool isDef = GV->isStrongDefinitionForLinker();

	// ARM call to a local ARM function is predicable.
	isLocalARMFunc = !Subtarget->isThumb() && (isDef \|\| !ARMInterworking);
	// tBX takes a register source operand.
	if (isStub && Subtarget->isThumb1Only() && !Subtarget->hasV5TOps()) {
	assert(Subtarget->isTargetMachO() && "WrapperPIC use on non-MachO?");
	Callee = DAG.getNode(
	ARMISD::WrapperPIC, dl, PtrVt,
	DAG.getTargetGlobalAddress(GV, dl, PtrVt, 0, ARMII::MO_NONLAZY));
	Callee = DAG.getLoad(
	PtrVt, dl, DAG.getEntryNode(), Callee,
	MachinePointerInfo::getGOT(DAG.getMachineFunction()),
	/* Alignment = */ 0, MachineMemOperand::MODereferenceable \|
	MachineMemOperand::MOInvariant);
	} else if (Subtarget->isTargetCOFF()) {
	assert(Subtarget->isTargetWindows() &&
	"Windows is the only supported COFF target");
	unsigned TargetFlags = GV->hasDLLImportStorageClass()
	? ARMII::MO_DLLIMPORT
	: ARMII::MO_NO_FLAG;
	Callee = DAG.getTargetGlobalAddress(GV, dl, PtrVt, /Offset=/0,
	TargetFlags);
	if (GV->hasDLLImportStorageClass())
	Callee =
	DAG.getLoad(PtrVt, dl, DAG.getEntryNode(),
	DAG.getNode(ARMISD::Wrapper, dl, PtrVt, Callee),
	MachinePointerInfo::getGOT(DAG.getMachineFunction()));
	} else {
	Callee = DAG.getTargetGlobalAddress(GV, dl, PtrVt, 0, 0);
	}
	}
	} else if (ExternalSymbolSDNode *S = dyn_cast<ExternalSymbolSDNode>(Callee)) {
	isDirect = true;
	// tBX takes a register source operand.
	const char *Sym = S->getSymbol();
	if (isARMFunc && Subtarget->isThumb1Only() && !Subtarget->hasV5TOps()) {
	unsigned ARMPCLabelIndex = AFI->createPICLabelUId();
	ARMConstantPoolValue *CPV =
	ARMConstantPoolSymbol::Create(*DAG.getContext(), Sym,
	ARMPCLabelIndex, 4);
	SDValue CPAddr = DAG.getTargetConstantPool(CPV, PtrVt, 4);
	CPAddr = DAG.getNode(ARMISD::Wrapper, dl, MVT::i32, CPAddr);
	Callee = DAG.getLoad(
	PtrVt, dl, DAG.getEntryNode(), CPAddr,
	MachinePointerInfo::getConstantPool(DAG.getMachineFunction()));
	SDValue PICLabel = DAG.getConstant(ARMPCLabelIndex, dl, MVT::i32);
	Callee = DAG.getNode(ARMISD::PIC_ADD, dl, PtrVt, Callee, PICLabel);
	} else {
	Callee = DAG.getTargetExternalSymbol(Sym, PtrVt, 0);
	}
	}

	// FIXME: handle tail calls differently.
	unsigned CallOpc;
	if (Subtarget->isThumb()) {
	if ((!isDirect \|\| isARMFunc) && !Subtarget->hasV5TOps())
	CallOpc = ARMISD::CALL_NOLINK;
	else
	CallOpc = ARMISD::CALL;
	} else {
	if (!isDirect && !Subtarget->hasV5TOps())
	CallOpc = ARMISD::CALL_NOLINK;
	else if (doesNotRet && isDirect && Subtarget->hasRetAddrStack() &&
	// Emit regular call when code size is the priority
	!MF.getFunction()->optForMinSize())
	// "mov lr, pc; b _foo" to avoid confusing the RSP
	CallOpc = ARMISD::CALL_NOLINK;
	else
	CallOpc = isLocalARMFunc ? ARMISD::CALL_PRED : ARMISD::CALL;
	}

	std::vector<SDValue> Ops;
	Ops.push_back(Chain);
	Ops.push_back(Callee);

	// Add argument registers to the end of the list so that they are known live
	// into the call.
	for (unsigned i = 0, e = RegsToPass.size(); i != e; ++i)
	Ops.push_back(DAG.getRegister(RegsToPass[i].first,
	RegsToPass[i].second.getValueType()));

	// Add a register mask operand representing the call-preserved registers.
	if (!isTailCall) {
	const uint32_t *Mask;
	const ARMBaseRegisterInfo *ARI = Subtarget->getRegisterInfo();
	if (isThisReturn) {
	// For 'this' returns, use the R0-preserving mask if applicable
	Mask = ARI->getThisReturnPreservedMask(MF, CallConv);
	if (!Mask) {
	// Set isThisReturn to false if the calling convention is not one that
	// allows 'returned' to be modeled in this way, so LowerCallResult does
	// not try to pass 'this' straight through
	isThisReturn = false;
	Mask = ARI->getCallPreservedMask(MF, CallConv);
	}
	} else
	Mask = ARI->getCallPreservedMask(MF, CallConv);

	assert(Mask && "Missing call preserved mask for calling convention");
	Ops.push_back(DAG.getRegisterMask(Mask));
	}

	if (InFlag.getNode())
	Ops.push_back(InFlag);

	SDVTList NodeTys = DAG.getVTList(MVT::Other, MVT::Glue);
	if (isTailCall) {
	MF.getFrameInfo().setHasTailCall();
	return DAG.getNode(ARMISD::TC_RETURN, dl, NodeTys, Ops);
	}

	// Returns a chain and a flag for retval copy to use.
	Chain = DAG.getNode(CallOpc, dl, NodeTys, Ops);
	InFlag = Chain.getValue(1);

	Chain = DAG.getCALLSEQ_END(Chain, DAG.getIntPtrConstant(NumBytes, dl, true),
	DAG.getIntPtrConstant(0, dl, true), InFlag, dl);
	if (!Ins.empty())
	InFlag = Chain.getValue(1);

	// Handle result values, copying them out of physregs into vregs that we
	// return.
	return LowerCallResult(Chain, InFlag, CallConv, isVarArg, Ins, dl, DAG,
	InVals, isThisReturn,
	isThisReturn ? OutVals[0] : SDValue());
	}

	/// HandleByVal - Every parameter after a byval parameter is passed
	/// on the stack. Remember the next parameter register to allocate,
	/// and then confiscate the rest of the parameter registers to insure
	/// this.
	void ARMTargetLowering::HandleByVal(CCState *State, unsigned &Size,
	unsigned Align) const {
	// Byval (as with any stack) slots are always at least 4 byte aligned.
	Align = std::max(Align, 4U);

	unsigned Reg = State->AllocateReg(GPRArgRegs);
	if (!Reg)
	return;

	unsigned AlignInRegs = Align / 4;
	unsigned Waste = (ARM::R4 - Reg) % AlignInRegs;
	for (unsigned i = 0; i < Waste; ++i)
	Reg = State->AllocateReg(GPRArgRegs);

	if (!Reg)
	return;

	unsigned Excess = 4 * (ARM::R4 - Reg);

	// Special case when NSAA != SP and parameter size greater than size of
	// all remained GPR regs. In that case we can't split parameter, we must
	// send it to stack. We also must set NCRN to R4, so waste all
	// remained registers.
	const unsigned NSAAOffset = State->getNextStackOffset();
	if (NSAAOffset != 0 && Size > Excess) {
	while (State->AllocateReg(GPRArgRegs))
	;
	return;
	}

	// First register for byval parameter is the first register that wasn't
	// allocated before this method call, so it would be "reg".
	// If parameter is small enough to be saved in range [reg, r4), then
	// the end (first after last) register would be reg + param-size-in-regs,
	// else parameter would be splitted between registers and stack,
	// end register would be r4 in this case.
	unsigned ByValRegBegin = Reg;
	unsigned ByValRegEnd = std::min<unsigned>(Reg + Size / 4, ARM::R4);
	State->addInRegsParamInfo(ByValRegBegin, ByValRegEnd);
	// Note, first register is allocated in the beginning of function already,
	// allocate remained amount of registers we need.
	for (unsigned i = Reg + 1; i != ByValRegEnd; ++i)
	State->AllocateReg(GPRArgRegs);
	// A byval parameter that is split between registers and memory needs its
	// size truncated here.
	// In the case where the entire structure fits in registers, we set the
	// size in memory to zero.
	Size = std::max<int>(Size - Excess, 0);
	}

	/// MatchingStackOffset - Return true if the given stack call argument is
	/// already available in the same position (relatively) of the caller's
	/// incoming argument stack.
	static
	bool MatchingStackOffset(SDValue Arg, unsigned Offset, ISD::ArgFlagsTy Flags,
	MachineFrameInfo &MFI, const MachineRegisterInfo *MRI,
	const TargetInstrInfo *TII) {
	unsigned Bytes = Arg.getValueSizeInBits() / 8;
	int FI = std::numeric_limits<int>::max();
	if (Arg.getOpcode() == ISD::CopyFromReg) {
	unsigned VR = cast<RegisterSDNode>(Arg.getOperand(1))->getReg();
	if (!TargetRegisterInfo::isVirtualRegister(VR))
	return false;
	MachineInstr *Def = MRI->getVRegDef(VR);
	if (!Def)
	return false;
	if (!Flags.isByVal()) {
	if (!TII->isLoadFromStackSlot(*Def, FI))
	return false;
	} else {
	return false;
	}
	} else if (LoadSDNode *Ld = dyn_cast<LoadSDNode>(Arg)) {
	if (Flags.isByVal())
	// ByVal argument is passed in as a pointer but it's now being
	// dereferenced. e.g.
	// define @foo(%struct.X* %A) {
	// tail call @bar(%struct.X* byval %A)
	// }
	return false;
	SDValue Ptr = Ld->getBasePtr();
	FrameIndexSDNode *FINode = dyn_cast<FrameIndexSDNode>(Ptr);
	if (!FINode)
	return false;
	FI = FINode->getIndex();
	} else
	return false;

	assert(FI != std::numeric_limits<int>::max());
	if (!MFI.isFixedObjectIndex(FI))
	return false;
	return Offset == MFI.getObjectOffset(FI) && Bytes == MFI.getObjectSize(FI);
	}

	/// IsEligibleForTailCallOptimization - Check whether the call is eligible
	/// for tail call optimization. Targets which want to do tail call
	/// optimization should implement this function.
	bool
	ARMTargetLowering::IsEligibleForTailCallOptimization(SDValue Callee,
	CallingConv::ID CalleeCC,
	bool isVarArg,
	bool isCalleeStructRet,
	bool isCallerStructRet,
	const SmallVectorImpl<ISD::OutputArg> &Outs,
	const SmallVectorImpl<SDValue> &OutVals,
	const SmallVectorImpl<ISD::InputArg> &Ins,
	SelectionDAG& DAG) const {
	MachineFunction &MF = DAG.getMachineFunction();
	const Function *CallerF = MF.getFunction();
	CallingConv::ID CallerCC = CallerF->getCallingConv();

	assert(Subtarget->supportsTailCall());

	// Look for obvious safe cases to perform tail call optimization that do not
	// require ABI changes. This is what gcc calls sibcall.

	// Exception-handling functions need a special set of instructions to indicate
	// a return to the hardware. Tail-calling another function would probably
	// break this.
	if (CallerF->hasFnAttribute("interrupt"))
	return false;

	// Also avoid sibcall optimization if either caller or callee uses struct
	// return semantics.
	if (isCalleeStructRet \|\| isCallerStructRet)
	return false;

	// Externally-defined functions with weak linkage should not be
	// tail-called on ARM when the OS does not support dynamic
	// pre-emption of symbols, as the AAELF spec requires normal calls
	// to undefined weak functions to be replaced with a NOP or jump to the
	// next instruction. The behaviour of branch instructions in this
	// situation (as used for tail calls) is implementation-defined, so we
	// cannot rely on the linker replacing the tail call with a return.
	if (GlobalAddressSDNode *G = dyn_cast<GlobalAddressSDNode>(Callee)) {
	const GlobalValue *GV = G->getGlobal();
	const Triple &TT = getTargetMachine().getTargetTriple();
	if (GV->hasExternalWeakLinkage() &&
	(!TT.isOSWindows() \|\| TT.isOSBinFormatELF() \|\| TT.isOSBinFormatMachO()))
	return false;
	}

	// Check that the call results are passed in the same way.
	LLVMContext &C = *DAG.getContext();
	if (!CCState::resultsCompatible(CalleeCC, CallerCC, MF, C, Ins,
	CCAssignFnForReturn(CalleeCC, isVarArg),
	CCAssignFnForReturn(CallerCC, isVarArg)))
	return false;
	// The callee has to preserve all registers the caller needs to preserve.
	const ARMBaseRegisterInfo *TRI = Subtarget->getRegisterInfo();
	const uint32_t *CallerPreserved = TRI->getCallPreservedMask(MF, CallerCC);
	if (CalleeCC != CallerCC) {
	const uint32_t *CalleePreserved = TRI->getCallPreservedMask(MF, CalleeCC);
	if (!TRI->regmaskSubsetEqual(CallerPreserved, CalleePreserved))
	return false;
	}

	// If Caller's vararg or byval argument has been split between registers and
	// stack, do not perform tail call, since part of the argument is in caller's
	// local frame.
	const ARMFunctionInfo *AFI_Caller = MF.getInfo<ARMFunctionInfo>();
	if (AFI_Caller->getArgRegsSaveSize())
	return false;

	// If the callee takes no arguments then go on to check the results of the
	// call.
	if (!Outs.empty()) {
	// Check if stack adjustment is needed. For now, do not do this if any
	// argument is passed on the stack.
	SmallVector<CCValAssign, 16> ArgLocs;
	CCState CCInfo(CalleeCC, isVarArg, MF, ArgLocs, C);
	CCInfo.AnalyzeCallOperands(Outs, CCAssignFnForCall(CalleeCC, isVarArg));
	if (CCInfo.getNextStackOffset()) {
	// Check if the arguments are already laid out in the right way as
	// the caller's fixed stack objects.
	MachineFrameInfo &MFI = MF.getFrameInfo();
	const MachineRegisterInfo *MRI = &MF.getRegInfo();
	const TargetInstrInfo *TII = Subtarget->getInstrInfo();
	for (unsigned i = 0, realArgIdx = 0, e = ArgLocs.size();
	i != e;
	++i, ++realArgIdx) {
	CCValAssign &VA = ArgLocs[i];
	EVT RegVT = VA.getLocVT();
	SDValue Arg = OutVals[realArgIdx];
	ISD::ArgFlagsTy Flags = Outs[realArgIdx].Flags;
	if (VA.getLocInfo() == CCValAssign::Indirect)
	return false;
	if (VA.needsCustom()) {
	// f64 and vector types are split into multiple registers or
	// register/stack-slot combinations. The types will not match
	// the registers; give up on memory f64 refs until we figure
	// out what to do about this.
	if (!VA.isRegLoc())
	return false;
	if (!ArgLocs[++i].isRegLoc())
	return false;
	if (RegVT == MVT::v2f64) {
	if (!ArgLocs[++i].isRegLoc())
	return false;
	if (!ArgLocs[++i].isRegLoc())
	return false;
	}
	} else if (!VA.isRegLoc()) {
	if (!MatchingStackOffset(Arg, VA.getLocMemOffset(), Flags,
	MFI, MRI, TII))
	return false;
	}
	}
	}

	const MachineRegisterInfo &MRI = MF.getRegInfo();
	if (!parametersInCSRMatch(MRI, CallerPreserved, ArgLocs, OutVals))
	return false;
	}

	return true;
	}

	bool
	ARMTargetLowering::CanLowerReturn(CallingConv::ID CallConv,
	MachineFunction &MF, bool isVarArg,
	const SmallVectorImpl<ISD::OutputArg> &Outs,
	LLVMContext &Context) const {
	SmallVector<CCValAssign, 16> RVLocs;
	CCState CCInfo(CallConv, isVarArg, MF, RVLocs, Context);
	return CCInfo.CheckReturn(Outs, CCAssignFnForReturn(CallConv, isVarArg));
	}

	static SDValue LowerInterruptReturn(SmallVectorImpl<SDValue> &RetOps,
	const SDLoc &DL, SelectionDAG &DAG) {
	const MachineFunction &MF = DAG.getMachineFunction();
	const Function *F = MF.getFunction();

	StringRef IntKind = F->getFnAttribute("interrupt").getValueAsString();

	// See ARM ARM v7 B1.8.3. On exception entry LR is set to a possibly offset
	// version of the "preferred return address". These offsets affect the return
	// instruction if this is a return from PL1 without hypervisor extensions.
	// IRQ/FIQ: +4 "subs pc, lr, #4"
	// SWI: 0 "subs pc, lr, #0"
	// ABORT: +4 "subs pc, lr, #4"
	// UNDEF: +4/+2 "subs pc, lr, #0"
	// UNDEF varies depending on where the exception came from ARM or Thumb
	// mode. Alongside GCC, we throw our hands up in disgust and pretend it's 0.

	int64_t LROffset;
	if (IntKind == "" \|\| IntKind == "IRQ" \|\| IntKind == "FIQ" \|\|
	IntKind == "ABORT")
	LROffset = 4;
	else if (IntKind == "SWI" \|\| IntKind == "UNDEF")
	LROffset = 0;
	else
	report_fatal_error("Unsupported interrupt attribute. If present, value "
	"must be one of: IRQ, FIQ, SWI, ABORT or UNDEF");

	RetOps.insert(RetOps.begin() + 1,
	DAG.getConstant(LROffset, DL, MVT::i32, false));

	return DAG.getNode(ARMISD::INTRET_FLAG, DL, MVT::Other, RetOps);
	}

	SDValue
	ARMTargetLowering::LowerReturn(SDValue Chain, CallingConv::ID CallConv,
	bool isVarArg,
	const SmallVectorImpl<ISD::OutputArg> &Outs,
	const SmallVectorImpl<SDValue> &OutVals,
	const SDLoc &dl, SelectionDAG &DAG) const {

	// CCValAssign - represent the assignment of the return value to a location.
	SmallVector<CCValAssign, 16> RVLocs;

	// CCState - Info about the registers and stack slots.
	CCState CCInfo(CallConv, isVarArg, DAG.getMachineFunction(), RVLocs,
	*DAG.getContext());

	// Analyze outgoing return values.
	CCInfo.AnalyzeReturn(Outs, CCAssignFnForReturn(CallConv, isVarArg));

	SDValue Flag;
	SmallVector<SDValue, 4> RetOps;
	RetOps.push_back(Chain); // Operand #0 = Chain (updated below)
	bool isLittleEndian = Subtarget->isLittle();

	MachineFunction &MF = DAG.getMachineFunction();
	ARMFunctionInfo *AFI = MF.getInfo<ARMFunctionInfo>();
	AFI->setReturnRegsCount(RVLocs.size());

	// Copy the result values into the output registers.
	for (unsigned i = 0, realRVLocIdx = 0;
	i != RVLocs.size();
	++i, ++realRVLocIdx) {
	CCValAssign &VA = RVLocs[i];
	assert(VA.isRegLoc() && "Can only return in registers!");

	SDValue Arg = OutVals[realRVLocIdx];

	switch (VA.getLocInfo()) {
	default: llvm_unreachable("Unknown loc info!");
	case CCValAssign::Full: break;
	case CCValAssign::BCvt:
	Arg = DAG.getNode(ISD::BITCAST, dl, VA.getLocVT(), Arg);
	break;
	}

	if (VA.needsCustom()) {
	if (VA.getLocVT() == MVT::v2f64) {
	// Extract the first half and return it in two registers.
	SDValue Half = DAG.getNode(ISD::EXTRACT_VECTOR_ELT, dl, MVT::f64, Arg,
	DAG.getConstant(0, dl, MVT::i32));
	SDValue HalfGPRs = DAG.getNode(ARMISD::VMOVRRD, dl,
	DAG.getVTList(MVT::i32, MVT::i32), Half);

	Chain = DAG.getCopyToReg(Chain, dl, VA.getLocReg(),
	HalfGPRs.getValue(isLittleEndian ? 0 : 1),
	Flag);
	Flag = Chain.getValue(1);
	RetOps.push_back(DAG.getRegister(VA.getLocReg(), VA.getLocVT()));
	VA = RVLocs[++i]; // skip ahead to next loc
	Chain = DAG.getCopyToReg(Chain, dl, VA.getLocReg(),
	HalfGPRs.getValue(isLittleEndian ? 1 : 0),
	Flag);
	Flag = Chain.getValue(1);
	RetOps.push_back(DAG.getRegister(VA.getLocReg(), VA.getLocVT()));
	VA = RVLocs[++i]; // skip ahead to next loc

	// Extract the 2nd half and fall through to handle it as an f64 value.
	Arg = DAG.getNode(ISD::EXTRACT_VECTOR_ELT, dl, MVT::f64, Arg,
	DAG.getConstant(1, dl, MVT::i32));
	}
	// Legalize ret f64 -> ret 2 x i32. We always have fmrrd if f64 is
	// available.
	SDValue fmrrd = DAG.getNode(ARMISD::VMOVRRD, dl,
	DAG.getVTList(MVT::i32, MVT::i32), Arg);
	Chain = DAG.getCopyToReg(Chain, dl, VA.getLocReg(),
	fmrrd.getValue(isLittleEndian ? 0 : 1),
	Flag);
	Flag = Chain.getValue(1);
	RetOps.push_back(DAG.getRegister(VA.getLocReg(), VA.getLocVT()));
	VA = RVLocs[++i]; // skip ahead to next loc
	Chain = DAG.getCopyToReg(Chain, dl, VA.getLocReg(),
	fmrrd.getValue(isLittleEndian ? 1 : 0),
	Flag);
	} else
	Chain = DAG.getCopyToReg(Chain, dl, VA.getLocReg(), Arg, Flag);

	// Guarantee that all emitted copies are
	// stuck together, avoiding something bad.
	Flag = Chain.getValue(1);
	RetOps.push_back(DAG.getRegister(VA.getLocReg(), VA.getLocVT()));
	}
	const ARMBaseRegisterInfo *TRI = Subtarget->getRegisterInfo();
	const MCPhysReg *I =
	TRI->getCalleeSavedRegsViaCopy(&DAG.getMachineFunction());
	if (I) {
	for (; *I; ++I) {
	if (ARM::GPRRegClass.contains(*I))
	RetOps.push_back(DAG.getRegister(*I, MVT::i32));
	else if (ARM::DPRRegClass.contains(*I))
	RetOps.push_back(DAG.getRegister(*I, MVT::getFloatingPointVT(64)));
	else
	llvm_unreachable("Unexpected register class in CSRsViaCopy!");
	}
	}

	// Update chain and glue.
	RetOps[0] = Chain;
	if (Flag.getNode())
	RetOps.push_back(Flag);

	// CPUs which aren't M-class use a special sequence to return from
	// exceptions (roughly, any instruction setting pc and cpsr simultaneously,
	// though we use "subs pc, lr, #N").
	//
	// M-class CPUs actually use a normal return sequence with a special
	// (hardware-provided) value in LR, so the normal code path works.
	if (DAG.getMachineFunction().getFunction()->hasFnAttribute("interrupt") &&
	!Subtarget->isMClass()) {
	if (Subtarget->isThumb1Only())
	report_fatal_error("interrupt attribute is not supported in Thumb1");
	return LowerInterruptReturn(RetOps, dl, DAG);
	}

	return DAG.getNode(ARMISD::RET_FLAG, dl, MVT::Other, RetOps);
	}

	bool ARMTargetLowering::isUsedByReturnOnly(SDNode *N, SDValue &Chain) const {
	if (N->getNumValues() != 1)
	return false;
	if (!N->hasNUsesOfValue(1, 0))
	return false;

	SDValue TCChain = Chain;
	SDNode Copy = N->use_begin();
	if (Copy->getOpcode() == ISD::CopyToReg) {
	// If the copy has a glue operand, we conservatively assume it isn't safe to
	// perform a tail call.
	if (Copy->getOperand(Copy->getNumOperands()-1).getValueType() == MVT::Glue)
	return false;
	TCChain = Copy->getOperand(0);
	} else if (Copy->getOpcode() == ARMISD::VMOVRRD) {
	SDNode *VMov = Copy;
	// f64 returned in a pair of GPRs.
	SmallPtrSet<SDNode*, 2> Copies;
	for (SDNode::use_iterator UI = VMov->use_begin(), UE = VMov->use_end();
	UI != UE; ++UI) {
	if (UI->getOpcode() != ISD::CopyToReg)
	return false;
	Copies.insert(*UI);
	}
	if (Copies.size() > 2)
	return false;

	for (SDNode::use_iterator UI = VMov->use_begin(), UE = VMov->use_end();
	UI != UE; ++UI) {
	SDValue UseChain = UI->getOperand(0);
	if (Copies.count(UseChain.getNode()))
	// Second CopyToReg
	Copy = *UI;
	else {
	// We are at the top of this chain.
	// If the copy has a glue operand, we conservatively assume it
	// isn't safe to perform a tail call.
	if (UI->getOperand(UI->getNumOperands()-1).getValueType() == MVT::Glue)
	return false;
	// First CopyToReg
	TCChain = UseChain;
	}
	}
	} else if (Copy->getOpcode() == ISD::BITCAST) {
	// f32 returned in a single GPR.
	if (!Copy->hasOneUse())
	return false;
	Copy = *Copy->use_begin();
	if (Copy->getOpcode() != ISD::CopyToReg \|\| !Copy->hasNUsesOfValue(1, 0))
	return false;
	// If the copy has a glue operand, we conservatively assume it isn't safe to
	// perform a tail call.
	if (Copy->getOperand(Copy->getNumOperands()-1).getValueType() == MVT::Glue)
	return false;
	TCChain = Copy->getOperand(0);
	} else {
	return false;
	}

	bool HasRet = false;
	for (SDNode::use_iterator UI = Copy->use_begin(), UE = Copy->use_end();
	UI != UE; ++UI) {
	if (UI->getOpcode() != ARMISD::RET_FLAG &&
	UI->getOpcode() != ARMISD::INTRET_FLAG)
	return false;
	HasRet = true;
	}

	if (!HasRet)
	return false;

	Chain = TCChain;
	return true;
	}

	bool ARMTargetLowering::mayBeEmittedAsTailCall(const CallInst *CI) const {
	if (!Subtarget->supportsTailCall())
	return false;

	auto Attr =
	CI->getParent()->getParent()->getFnAttribute("disable-tail-calls");
	if (!CI->isTailCall() \|\| Attr.getValueAsString() == "true")
	return false;

	return true;
	}

	// Trying to write a 64 bit value so need to split into two 32 bit values first,
	// and pass the lower and high parts through.
	static SDValue LowerWRITE_REGISTER(SDValue Op, SelectionDAG &DAG) {
	SDLoc DL(Op);
	SDValue WriteValue = Op->getOperand(2);

	// This function is only supposed to be called for i64 type argument.
	assert(WriteValue.getValueType() == MVT::i64
	&& "LowerWRITE_REGISTER called for non-i64 type argument.");

	SDValue Lo = DAG.getNode(ISD::EXTRACT_ELEMENT, DL, MVT::i32, WriteValue,
	DAG.getConstant(0, DL, MVT::i32));
	SDValue Hi = DAG.getNode(ISD::EXTRACT_ELEMENT, DL, MVT::i32, WriteValue,
	DAG.getConstant(1, DL, MVT::i32));
	SDValue Ops[] = { Op->getOperand(0), Op->getOperand(1), Lo, Hi };
	return DAG.getNode(ISD::WRITE_REGISTER, DL, MVT::Other, Ops);
	}

	// ConstantPool, JumpTable, GlobalAddress, and ExternalSymbol are lowered as
	// their target counterpart wrapped in the ARMISD::Wrapper node. Suppose N is
	// one of the above mentioned nodes. It has to be wrapped because otherwise
	// Select(N) returns N. So the raw TargetGlobalAddress nodes, etc. can only
	// be used to form addressing mode. These wrapped nodes will be selected
	// into MOVi.
	SDValue ARMTargetLowering::LowerConstantPool(SDValue Op,
	SelectionDAG &DAG) const {
	EVT PtrVT = Op.getValueType();
	// FIXME there is no actual debug info here
	SDLoc dl(Op);
	ConstantPoolSDNode *CP = cast<ConstantPoolSDNode>(Op);
	SDValue Res;

	// When generating execute-only code Constant Pools must be promoted to the
	// global data section. It's a bit ugly that we can't share them across basic
	// blocks, but this way we guarantee that execute-only behaves correct with
	// position-independent addressing modes.
	if (Subtarget->genExecuteOnly()) {
	auto AFI = DAG.getMachineFunction().getInfo<ARMFunctionInfo>();
	auto T = const_cast<Type*>(CP->getType());
	auto C = const_cast<Constant*>(CP->getConstVal());
	auto M = const_cast<Module*>(DAG.getMachineFunction().
	getFunction()->getParent());
	auto GV = new GlobalVariable(
	M, T, /isConst=*/true, GlobalVariable::InternalLinkage, C,
	Twine(DAG.getDataLayout().getPrivateGlobalPrefix()) + "CP" +
	Twine(DAG.getMachineFunction().getFunctionNumber()) + "_" +
	Twine(AFI->createPICLabelUId())
	);
	SDValue GA = DAG.getTargetGlobalAddress(dyn_cast<GlobalValue>(GV),
	dl, PtrVT);
	return LowerGlobalAddress(GA, DAG);
	}

	if (CP->isMachineConstantPoolEntry())
	Res = DAG.getTargetConstantPool(CP->getMachineCPVal(), PtrVT,
	CP->getAlignment());
	else
	Res = DAG.getTargetConstantPool(CP->getConstVal(), PtrVT,
	CP->getAlignment());
	return DAG.getNode(ARMISD::Wrapper, dl, MVT::i32, Res);
	}

	unsigned ARMTargetLowering::getJumpTableEncoding() const {
	return MachineJumpTableInfo::EK_Inline;
	}

	SDValue ARMTargetLowering::LowerBlockAddress(SDValue Op,
	SelectionDAG &DAG) const {
	MachineFunction &MF = DAG.getMachineFunction();
	ARMFunctionInfo *AFI = MF.getInfo<ARMFunctionInfo>();
	unsigned ARMPCLabelIndex = 0;
	SDLoc DL(Op);
	EVT PtrVT = getPointerTy(DAG.getDataLayout());
	const BlockAddress *BA = cast<BlockAddressSDNode>(Op)->getBlockAddress();
	SDValue CPAddr;
	bool IsPositionIndependent = isPositionIndependent() \|\| Subtarget->isROPI();
	if (!IsPositionIndependent) {
	CPAddr = DAG.getTargetConstantPool(BA, PtrVT, 4);
	} else {
	unsigned PCAdj = Subtarget->isThumb() ? 4 : 8;
	ARMPCLabelIndex = AFI->createPICLabelUId();
	ARMConstantPoolValue *CPV =
	ARMConstantPoolConstant::Create(BA, ARMPCLabelIndex,
	ARMCP::CPBlockAddress, PCAdj);
	CPAddr = DAG.getTargetConstantPool(CPV, PtrVT, 4);
	}
	CPAddr = DAG.getNode(ARMISD::Wrapper, DL, PtrVT, CPAddr);
	SDValue Result = DAG.getLoad(
	PtrVT, DL, DAG.getEntryNode(), CPAddr,
	MachinePointerInfo::getConstantPool(DAG.getMachineFunction()));
	if (!IsPositionIndependent)
	return Result;
	SDValue PICLabel = DAG.getConstant(ARMPCLabelIndex, DL, MVT::i32);
	return DAG.getNode(ARMISD::PIC_ADD, DL, PtrVT, Result, PICLabel);
	}

	/// \brief Convert a TLS address reference into the correct sequence of loads
	/// and calls to compute the variable's address for Darwin, and return an
	/// SDValue containing the final node.

	/// Darwin only has one TLS scheme which must be capable of dealing with the
	/// fully general situation, in the worst case. This means:
	/// + "extern __thread" declaration.
	/// + Defined in a possibly unknown dynamic library.
	///
	/// The general system is that each __thread variable has a [3 x i32] descriptor
	/// which contains information used by the runtime to calculate the address. The
	/// only part of this the compiler needs to know about is the first word, which
	/// contains a function pointer that must be called with the address of the
	/// entire descriptor in "r0".
	///
	/// Since this descriptor may be in a different unit, in general access must
	/// proceed along the usual ARM rules. A common sequence to produce is:
	///
	/// movw rT1, :lower16:_var$non_lazy_ptr
	/// movt rT1, :upper16:_var$non_lazy_ptr
	/// ldr r0, [rT1]
	/// ldr rT2, [r0]
	/// blx rT2
	/// [...address now in r0...]
	SDValue
	ARMTargetLowering::LowerGlobalTLSAddressDarwin(SDValue Op,
	SelectionDAG &DAG) const {
	assert(Subtarget->isTargetDarwin() && "TLS only supported on Darwin");
	SDLoc DL(Op);

	// First step is to get the address of the actua global symbol. This is where
	// the TLS descriptor lives.
	SDValue DescAddr = LowerGlobalAddressDarwin(Op, DAG);

	// The first entry in the descriptor is a function pointer that we must call
	// to obtain the address of the variable.
	SDValue Chain = DAG.getEntryNode();
	SDValue FuncTLVGet = DAG.getLoad(
	MVT::i32, DL, Chain, DescAddr,
	MachinePointerInfo::getGOT(DAG.getMachineFunction()),
	/* Alignment = */ 4,
	MachineMemOperand::MONonTemporal \| MachineMemOperand::MODereferenceable \|
	MachineMemOperand::MOInvariant);
	Chain = FuncTLVGet.getValue(1);

	MachineFunction &F = DAG.getMachineFunction();
	MachineFrameInfo &MFI = F.getFrameInfo();
	MFI.setAdjustsStack(true);

	// TLS calls preserve all registers except those that absolutely must be
	// trashed: R0 (it takes an argument), LR (it's a call) and CPSR (let's not be
	// silly).
	auto TRI =
	getTargetMachine().getSubtargetImpl(*F.getFunction())->getRegisterInfo();
	auto ARI = static_cast<const ARMRegisterInfo *>(TRI);
	const uint32_t *Mask = ARI->getTLSCallPreservedMask(DAG.getMachineFunction());

	// Finally, we can make the call. This is just a degenerate version of a
	// normal AArch64 call node: r0 takes the address of the descriptor, and
	// returns the address of the variable in this thread.
	Chain = DAG.getCopyToReg(Chain, DL, ARM::R0, DescAddr, SDValue());
	Chain =
	DAG.getNode(ARMISD::CALL, DL, DAG.getVTList(MVT::Other, MVT::Glue),
	Chain, FuncTLVGet, DAG.getRegister(ARM::R0, MVT::i32),
	DAG.getRegisterMask(Mask), Chain.getValue(1));
	return DAG.getCopyFromReg(Chain, DL, ARM::R0, MVT::i32, Chain.getValue(1));
	}

	SDValue
	ARMTargetLowering::LowerGlobalTLSAddressWindows(SDValue Op,
	SelectionDAG &DAG) const {
	assert(Subtarget->isTargetWindows() && "Windows specific TLS lowering");

	SDValue Chain = DAG.getEntryNode();
	EVT PtrVT = getPointerTy(DAG.getDataLayout());
	SDLoc DL(Op);

	// Load the current TEB (thread environment block)
	SDValue Ops[] = {Chain,
	DAG.getConstant(Intrinsic::arm_mrc, DL, MVT::i32),
	DAG.getConstant(15, DL, MVT::i32),
	DAG.getConstant(0, DL, MVT::i32),
	DAG.getConstant(13, DL, MVT::i32),
	DAG.getConstant(0, DL, MVT::i32),
	DAG.getConstant(2, DL, MVT::i32)};
	SDValue CurrentTEB = DAG.getNode(ISD::INTRINSIC_W_CHAIN, DL,
	DAG.getVTList(MVT::i32, MVT::Other), Ops);

	SDValue TEB = CurrentTEB.getValue(0);
	Chain = CurrentTEB.getValue(1);

	// Load the ThreadLocalStoragePointer from the TEB
	// A pointer to the TLS array is located at offset 0x2c from the TEB.
	SDValue TLSArray =
	DAG.getNode(ISD::ADD, DL, PtrVT, TEB, DAG.getIntPtrConstant(0x2c, DL));
	TLSArray = DAG.getLoad(PtrVT, DL, Chain, TLSArray, MachinePointerInfo());

	// The pointer to the thread's TLS data area is at the TLS Index scaled by 4
	// offset into the TLSArray.

	// Load the TLS index from the C runtime
	SDValue TLSIndex =
	DAG.getTargetExternalSymbol("_tls_index", PtrVT, ARMII::MO_NO_FLAG);
	TLSIndex = DAG.getNode(ARMISD::Wrapper, DL, PtrVT, TLSIndex);
	TLSIndex = DAG.getLoad(PtrVT, DL, Chain, TLSIndex, MachinePointerInfo());

	SDValue Slot = DAG.getNode(ISD::SHL, DL, PtrVT, TLSIndex,
	DAG.getConstant(2, DL, MVT::i32));
	SDValue TLS = DAG.getLoad(PtrVT, DL, Chain,
	DAG.getNode(ISD::ADD, DL, PtrVT, TLSArray, Slot),
	MachinePointerInfo());

	// Get the offset of the start of the .tls section (section base)
	const auto *GA = cast<GlobalAddressSDNode>(Op);
	auto *CPV = ARMConstantPoolConstant::Create(GA->getGlobal(), ARMCP::SECREL);
	SDValue Offset = DAG.getLoad(
	PtrVT, DL, Chain, DAG.getNode(ARMISD::Wrapper, DL, MVT::i32,
	DAG.getTargetConstantPool(CPV, PtrVT, 4)),
	MachinePointerInfo::getConstantPool(DAG.getMachineFunction()));

	return DAG.getNode(ISD::ADD, DL, PtrVT, TLS, Offset);
	}

	// Lower ISD::GlobalTLSAddress using the "general dynamic" model
	SDValue
	ARMTargetLowering::LowerToTLSGeneralDynamicModel(GlobalAddressSDNode *GA,
	SelectionDAG &DAG) const {
	SDLoc dl(GA);
	EVT PtrVT = getPointerTy(DAG.getDataLayout());
	unsigned char PCAdj = Subtarget->isThumb() ? 4 : 8;
	MachineFunction &MF = DAG.getMachineFunction();
	ARMFunctionInfo *AFI = MF.getInfo<ARMFunctionInfo>();
	unsigned ARMPCLabelIndex = AFI->createPICLabelUId();
	ARMConstantPoolValue *CPV =
	ARMConstantPoolConstant::Create(GA->getGlobal(), ARMPCLabelIndex,
	ARMCP::CPValue, PCAdj, ARMCP::TLSGD, true);
	SDValue Argument = DAG.getTargetConstantPool(CPV, PtrVT, 4);
	Argument = DAG.getNode(ARMISD::Wrapper, dl, MVT::i32, Argument);
	Argument = DAG.getLoad(
	PtrVT, dl, DAG.getEntryNode(), Argument,
	MachinePointerInfo::getConstantPool(DAG.getMachineFunction()));
	SDValue Chain = Argument.getValue(1);

	SDValue PICLabel = DAG.getConstant(ARMPCLabelIndex, dl, MVT::i32);
	Argument = DAG.getNode(ARMISD::PIC_ADD, dl, PtrVT, Argument, PICLabel);

	// call __tls_get_addr.
	ArgListTy Args;
	ArgListEntry Entry;
	Entry.Node = Argument;
	Entry.Ty = (Type ) Type::getInt32Ty(DAG.getContext());
	Args.push_back(Entry);

	// FIXME: is there useful debug info available here?
	TargetLowering::CallLoweringInfo CLI(DAG);
	CLI.setDebugLoc(dl).setChain(Chain).setLibCallee(
	CallingConv::C, Type::getInt32Ty(*DAG.getContext()),
	DAG.getExternalSymbol("__tls_get_addr", PtrVT), std::move(Args));

	std::pair<SDValue, SDValue> CallResult = LowerCallTo(CLI);
	return CallResult.first;
	}

	// Lower ISD::GlobalTLSAddress using the "initial exec" or
	// "local exec" model.
	SDValue
	ARMTargetLowering::LowerToTLSExecModels(GlobalAddressSDNode *GA,
	SelectionDAG &DAG,
	TLSModel::Model model) const {
	const GlobalValue *GV = GA->getGlobal();
	SDLoc dl(GA);
	SDValue Offset;
	SDValue Chain = DAG.getEntryNode();
	EVT PtrVT = getPointerTy(DAG.getDataLayout());
	// Get the Thread Pointer
	SDValue ThreadPointer = DAG.getNode(ARMISD::THREAD_POINTER, dl, PtrVT);

	if (model == TLSModel::InitialExec) {
	MachineFunction &MF = DAG.getMachineFunction();
	ARMFunctionInfo *AFI = MF.getInfo<ARMFunctionInfo>();
	unsigned ARMPCLabelIndex = AFI->createPICLabelUId();
	// Initial exec model.
	unsigned char PCAdj = Subtarget->isThumb() ? 4 : 8;
	ARMConstantPoolValue *CPV =
	ARMConstantPoolConstant::Create(GA->getGlobal(), ARMPCLabelIndex,
	ARMCP::CPValue, PCAdj, ARMCP::GOTTPOFF,
	true);
	Offset = DAG.getTargetConstantPool(CPV, PtrVT, 4);
	Offset = DAG.getNode(ARMISD::Wrapper, dl, MVT::i32, Offset);
	Offset = DAG.getLoad(
	PtrVT, dl, Chain, Offset,
	MachinePointerInfo::getConstantPool(DAG.getMachineFunction()));
	Chain = Offset.getValue(1);

	SDValue PICLabel = DAG.getConstant(ARMPCLabelIndex, dl, MVT::i32);
	Offset = DAG.getNode(ARMISD::PIC_ADD, dl, PtrVT, Offset, PICLabel);

	Offset = DAG.getLoad(
	PtrVT, dl, Chain, Offset,
	MachinePointerInfo::getConstantPool(DAG.getMachineFunction()));
	} else {
	// local exec model
	assert(model == TLSModel::LocalExec);
	ARMConstantPoolValue *CPV =
	ARMConstantPoolConstant::Create(GV, ARMCP::TPOFF);
	Offset = DAG.getTargetConstantPool(CPV, PtrVT, 4);
	Offset = DAG.getNode(ARMISD::Wrapper, dl, MVT::i32, Offset);
	Offset = DAG.getLoad(
	PtrVT, dl, Chain, Offset,
	MachinePointerInfo::getConstantPool(DAG.getMachineFunction()));
	}

	// The address of the thread local variable is the add of the thread
	// pointer with the offset of the variable.
	return DAG.getNode(ISD::ADD, dl, PtrVT, ThreadPointer, Offset);
	}

	SDValue
	ARMTargetLowering::LowerGlobalTLSAddress(SDValue Op, SelectionDAG &DAG) const {
	if (Subtarget->isTargetDarwin())
	return LowerGlobalTLSAddressDarwin(Op, DAG);

	if (Subtarget->isTargetWindows())
	return LowerGlobalTLSAddressWindows(Op, DAG);

	// TODO: implement the "local dynamic" model
	assert(Subtarget->isTargetELF() && "Only ELF implemented here");
	GlobalAddressSDNode *GA = cast<GlobalAddressSDNode>(Op);
	if (DAG.getTarget().Options.EmulatedTLS)
	return LowerToTLSEmulatedModel(GA, DAG);

	TLSModel::Model model = getTargetMachine().getTLSModel(GA->getGlobal());

	switch (model) {
	case TLSModel::GeneralDynamic:
	case TLSModel::LocalDynamic:
	return LowerToTLSGeneralDynamicModel(GA, DAG);
	case TLSModel::InitialExec:
	case TLSModel::LocalExec:
	return LowerToTLSExecModels(GA, DAG, model);
	}
	llvm_unreachable("bogus TLS model");
	}

	/// Return true if all users of V are within function F, looking through
	/// ConstantExprs.
	static bool allUsersAreInFunction(const Value V, const Function F) {
	SmallVector<const User*,4> Worklist;
	for (auto *U : V->users())
	Worklist.push_back(U);
	while (!Worklist.empty()) {
	auto *U = Worklist.pop_back_val();
	if (isa<ConstantExpr>(U)) {
	for (auto *UU : U->users())
	Worklist.push_back(UU);
	continue;
	}

	auto *I = dyn_cast<Instruction>(U);
	if (!I \|\| I->getParent()->getParent() != F)
	return false;
	}
	return true;
	}

	/// Return true if all users of V are within some (any) function, looking through
	/// ConstantExprs. In other words, are there any global constant users?
	static bool allUsersAreInFunctions(const Value *V) {
	SmallVector<const User*,4> Worklist;
	for (auto *U : V->users())
	Worklist.push_back(U);
	while (!Worklist.empty()) {
	auto *U = Worklist.pop_back_val();
	if (isa<ConstantExpr>(U)) {
	for (auto *UU : U->users())
	Worklist.push_back(UU);
	continue;
	}

	if (!isa<Instruction>(U))
	return false;
	}
	return true;
	}

	// Return true if T is an integer, float or an array/vector of either.
	static bool isSimpleType(Type *T) {
	if (T->isIntegerTy() \|\| T->isFloatingPointTy())
	return true;
	Type *SubT = nullptr;
	if (T->isArrayTy())
	SubT = T->getArrayElementType();
	else if (T->isVectorTy())
	SubT = T->getVectorElementType();
	else
	return false;
	return SubT->isIntegerTy() \|\| SubT->isFloatingPointTy();
	}

	static SDValue promoteToConstantPool(const GlobalValue *GV, SelectionDAG &DAG,
	EVT PtrVT, const SDLoc &dl) {
	// If we're creating a pool entry for a constant global with unnamed address,
	// and the global is small enough, we can emit it inline into the constant pool
	// to save ourselves an indirection.
	//
	// This is a win if the constant is only used in one function (so it doesn't
	// need to be duplicated) or duplicating the constant wouldn't increase code
	// size (implying the constant is no larger than 4 bytes).
	const Function *F = DAG.getMachineFunction().getFunction();

	// We rely on this decision to inline being idemopotent and unrelated to the
	// use-site. We know that if we inline a variable at one use site, we'll
	// inline it elsewhere too (and reuse the constant pool entry). Fast-isel
	// doesn't know about this optimization, so bail out if it's enabled else
	// we could decide to inline here (and thus never emit the GV) but require
	// the GV from fast-isel generated code.
	if (!EnableConstpoolPromotion \|\|
	DAG.getMachineFunction().getTarget().Options.EnableFastISel)
	return SDValue();

	auto *GVar = dyn_cast<GlobalVariable>(GV);
	if (!GVar \|\| !GVar->hasInitializer() \|\|
	!GVar->isConstant() \|\| !GVar->hasGlobalUnnamedAddr() \|\|
	!GVar->hasLocalLinkage())
	return SDValue();

	// Ensure that we don't try and inline any type that contains pointers. If
	// we inline a value that contains relocations, we move the relocations from
	// .data to .text which is not ideal.
	auto *Init = GVar->getInitializer();
	if (!isSimpleType(Init->getType()))
	return SDValue();

	// The constant islands pass can only really deal with alignment requests
	// <= 4 bytes and cannot pad constants itself. Therefore we cannot promote
	// any type wanting greater alignment requirements than 4 bytes. We also
	// can only promote constants that are multiples of 4 bytes in size or
	// are paddable to a multiple of 4. Currently we only try and pad constants
	// that are strings for simplicity.
	auto *CDAInit = dyn_cast<ConstantDataArray>(Init);
	unsigned Size = DAG.getDataLayout().getTypeAllocSize(Init->getType());
	unsigned Align = GVar->getAlignment();
	unsigned RequiredPadding = 4 - (Size % 4);
	bool PaddingPossible =
	RequiredPadding == 4 \|\| (CDAInit && CDAInit->isString());
	if (!PaddingPossible \|\| Align > 4 \|\| Size > ConstpoolPromotionMaxSize \|\|
	Size == 0)
	return SDValue();

	unsigned PaddedSize = Size + ((RequiredPadding == 4) ? 0 : RequiredPadding);
	MachineFunction &MF = DAG.getMachineFunction();
	ARMFunctionInfo *AFI = MF.getInfo<ARMFunctionInfo>();

	// We can't bloat the constant pool too much, else the ConstantIslands pass
	// may fail to converge. If we haven't promoted this global yet (it may have
	// multiple uses), and promoting it would increase the constant pool size (Sz
	// > 4), ensure we have space to do so up to MaxTotal.
	if (!AFI->getGlobalsPromotedToConstantPool().count(GVar) && Size > 4)
	if (AFI->getPromotedConstpoolIncrease() + PaddedSize - 4 >=
	ConstpoolPromotionMaxTotal)
	return SDValue();

	// This is only valid if all users are in a single function OR it has users
	// in multiple functions but it no larger than a pointer. We also check if
	// GVar has constant (non-ConstantExpr) users. If so, it essentially has its
	// address taken.
	if (!allUsersAreInFunction(GVar, F) &&
	!(Size <= 4 && allUsersAreInFunctions(GVar)))
	return SDValue();

	// We're going to inline this global. Pad it out if needed.
	if (RequiredPadding != 4) {
	StringRef S = CDAInit->getAsString();

	SmallVector<uint8_t,16> V(S.size());
	std::copy(S.bytes_begin(), S.bytes_end(), V.begin());
	while (RequiredPadding--)
	V.push_back(0);
	Init = ConstantDataArray::get(*DAG.getContext(), V);
	}

	auto CPVal = ARMConstantPoolConstant::Create(GVar, Init);
	SDValue CPAddr =
	DAG.getTargetConstantPool(CPVal, PtrVT, /Align=/4);
	if (!AFI->getGlobalsPromotedToConstantPool().count(GVar)) {
	AFI->markGlobalAsPromotedToConstantPool(GVar);
	AFI->setPromotedConstpoolIncrease(AFI->getPromotedConstpoolIncrease() +
	PaddedSize - 4);
	}
	++NumConstpoolPromoted;
	return DAG.getNode(ARMISD::Wrapper, dl, MVT::i32, CPAddr);
	}

	static bool isReadOnly(const GlobalValue *GV) {
	if (const GlobalAlias *GA = dyn_cast<GlobalAlias>(GV))
	GV = GA->getBaseObject();
	return (isa<GlobalVariable>(GV) && cast<GlobalVariable>(GV)->isConstant()) \|\|
	isa<Function>(GV);
	}

	SDValue ARMTargetLowering::LowerGlobalAddress(SDValue Op,
	SelectionDAG &DAG) const {
	switch (Subtarget->getTargetTriple().getObjectFormat()) {
	default: llvm_unreachable("unknown object format");
	case Triple::COFF:
	return LowerGlobalAddressWindows(Op, DAG);
	case Triple::ELF:
	return LowerGlobalAddressELF(Op, DAG);
	case Triple::MachO:
	return LowerGlobalAddressDarwin(Op, DAG);
	}
	}

	SDValue ARMTargetLowering::LowerGlobalAddressELF(SDValue Op,
	SelectionDAG &DAG) const {
	EVT PtrVT = getPointerTy(DAG.getDataLayout());
	SDLoc dl(Op);
	const GlobalValue *GV = cast<GlobalAddressSDNode>(Op)->getGlobal();
	const TargetMachine &TM = getTargetMachine();
	bool IsRO = isReadOnly(GV);

	// promoteToConstantPool only if not generating XO text section
	if (TM.shouldAssumeDSOLocal(*GV->getParent(), GV) && !Subtarget->genExecuteOnly())
	if (SDValue V = promoteToConstantPool(GV, DAG, PtrVT, dl))
	return V;

	if (isPositionIndependent()) {
	bool UseGOT_PREL = !TM.shouldAssumeDSOLocal(*GV->getParent(), GV);

	MachineFunction &MF = DAG.getMachineFunction();
	ARMFunctionInfo *AFI = MF.getInfo<ARMFunctionInfo>();
	unsigned ARMPCLabelIndex = AFI->createPICLabelUId();
	EVT PtrVT = getPointerTy(DAG.getDataLayout());
	SDLoc dl(Op);
	unsigned PCAdj = Subtarget->isThumb() ? 4 : 8;
	ARMConstantPoolValue *CPV = ARMConstantPoolConstant::Create(
	GV, ARMPCLabelIndex, ARMCP::CPValue, PCAdj,
	UseGOT_PREL ? ARMCP::GOT_PREL : ARMCP::no_modifier,
	/AddCurrentAddress=/UseGOT_PREL);
	SDValue CPAddr = DAG.getTargetConstantPool(CPV, PtrVT, 4);
	CPAddr = DAG.getNode(ARMISD::Wrapper, dl, MVT::i32, CPAddr);
	SDValue Result = DAG.getLoad(
	PtrVT, dl, DAG.getEntryNode(), CPAddr,
	MachinePointerInfo::getConstantPool(DAG.getMachineFunction()));
	SDValue Chain = Result.getValue(1);
	SDValue PICLabel = DAG.getConstant(ARMPCLabelIndex, dl, MVT::i32);
	Result = DAG.getNode(ARMISD::PIC_ADD, dl, PtrVT, Result, PICLabel);
	if (UseGOT_PREL)
	Result =
	DAG.getLoad(PtrVT, dl, Chain, Result,
	MachinePointerInfo::getGOT(DAG.getMachineFunction()));
	return Result;
	} else if (Subtarget->isROPI() && IsRO) {
	// PC-relative.
	SDValue G = DAG.getTargetGlobalAddress(GV, dl, PtrVT);
	SDValue Result = DAG.getNode(ARMISD::WrapperPIC, dl, PtrVT, G);
	return Result;
	} else if (Subtarget->isRWPI() && !IsRO) {
	// SB-relative.
	SDValue RelAddr;
	if (Subtarget->useMovt(DAG.getMachineFunction())) {
	++NumMovwMovt;
	SDValue G = DAG.getTargetGlobalAddress(GV, dl, PtrVT, 0, ARMII::MO_SBREL);
	RelAddr = DAG.getNode(ARMISD::Wrapper, dl, PtrVT, G);
	} else { // use literal pool for address constant
	ARMConstantPoolValue *CPV =
	ARMConstantPoolConstant::Create(GV, ARMCP::SBREL);
	SDValue CPAddr = DAG.getTargetConstantPool(CPV, PtrVT, 4);
	CPAddr = DAG.getNode(ARMISD::Wrapper, dl, MVT::i32, CPAddr);
	RelAddr = DAG.getLoad(
	PtrVT, dl, DAG.getEntryNode(), CPAddr,
	MachinePointerInfo::getConstantPool(DAG.getMachineFunction()));
	}
	SDValue SB = DAG.getCopyFromReg(DAG.getEntryNode(), dl, ARM::R9, PtrVT);
	SDValue Result = DAG.getNode(ISD::ADD, dl, PtrVT, SB, RelAddr);
	return Result;
	}

	// If we have T2 ops, we can materialize the address directly via movt/movw
	// pair. This is always cheaper.
	if (Subtarget->useMovt(DAG.getMachineFunction())) {
	++NumMovwMovt;
	// FIXME: Once remat is capable of dealing with instructions with register
	// operands, expand this into two nodes.
	return DAG.getNode(ARMISD::Wrapper, dl, PtrVT,
	DAG.getTargetGlobalAddress(GV, dl, PtrVT));
	} else {
	SDValue CPAddr = DAG.getTargetConstantPool(GV, PtrVT, 4);
	CPAddr = DAG.getNode(ARMISD::Wrapper, dl, MVT::i32, CPAddr);
	return DAG.getLoad(
	PtrVT, dl, DAG.getEntryNode(), CPAddr,
	MachinePointerInfo::getConstantPool(DAG.getMachineFunction()));
	}
	}

	SDValue ARMTargetLowering::LowerGlobalAddressDarwin(SDValue Op,
	SelectionDAG &DAG) const {
	assert(!Subtarget->isROPI() && !Subtarget->isRWPI() &&
	"ROPI/RWPI not currently supported for Darwin");
	EVT PtrVT = getPointerTy(DAG.getDataLayout());
	SDLoc dl(Op);
	const GlobalValue *GV = cast<GlobalAddressSDNode>(Op)->getGlobal();

	if (Subtarget->useMovt(DAG.getMachineFunction()))
	++NumMovwMovt;

	// FIXME: Once remat is capable of dealing with instructions with register
	// operands, expand this into multiple nodes
	unsigned Wrapper =
	isPositionIndependent() ? ARMISD::WrapperPIC : ARMISD::Wrapper;

	SDValue G = DAG.getTargetGlobalAddress(GV, dl, PtrVT, 0, ARMII::MO_NONLAZY);
	SDValue Result = DAG.getNode(Wrapper, dl, PtrVT, G);

	if (Subtarget->isGVIndirectSymbol(GV))
	Result = DAG.getLoad(PtrVT, dl, DAG.getEntryNode(), Result,
	MachinePointerInfo::getGOT(DAG.getMachineFunction()));
	return Result;
	}

	SDValue ARMTargetLowering::LowerGlobalAddressWindows(SDValue Op,
	SelectionDAG &DAG) const {
	assert(Subtarget->isTargetWindows() && "non-Windows COFF is not supported");
	assert(Subtarget->useMovt(DAG.getMachineFunction()) &&
	"Windows on ARM expects to use movw/movt");
	assert(!Subtarget->isROPI() && !Subtarget->isRWPI() &&
	"ROPI/RWPI not currently supported for Windows");

	const GlobalValue *GV = cast<GlobalAddressSDNode>(Op)->getGlobal();
	const ARMII::TOF TargetFlags =
	(GV->hasDLLImportStorageClass() ? ARMII::MO_DLLIMPORT : ARMII::MO_NO_FLAG);
	EVT PtrVT = getPointerTy(DAG.getDataLayout());
	SDValue Result;
	SDLoc DL(Op);

	++NumMovwMovt;

	// FIXME: Once remat is capable of dealing with instructions with register
	// operands, expand this into two nodes.
	Result = DAG.getNode(ARMISD::Wrapper, DL, PtrVT,
	DAG.getTargetGlobalAddress(GV, DL, PtrVT, /Offset=/0,
	TargetFlags));
	if (GV->hasDLLImportStorageClass())
	Result = DAG.getLoad(PtrVT, DL, DAG.getEntryNode(), Result,
	MachinePointerInfo::getGOT(DAG.getMachineFunction()));
	return Result;
	}

	SDValue
	ARMTargetLowering::LowerEH_SJLJ_SETJMP(SDValue Op, SelectionDAG &DAG) const {
	SDLoc dl(Op);
	SDValue Val = DAG.getConstant(0, dl, MVT::i32);
	return DAG.getNode(ARMISD::EH_SJLJ_SETJMP, dl,
	DAG.getVTList(MVT::i32, MVT::Other), Op.getOperand(0),
	Op.getOperand(1), Val);
	}

	SDValue
	ARMTargetLowering::LowerEH_SJLJ_LONGJMP(SDValue Op, SelectionDAG &DAG) const {
	SDLoc dl(Op);
	return DAG.getNode(ARMISD::EH_SJLJ_LONGJMP, dl, MVT::Other, Op.getOperand(0),
	Op.getOperand(1), DAG.getConstant(0, dl, MVT::i32));
	}

	SDValue ARMTargetLowering::LowerEH_SJLJ_SETUP_DISPATCH(SDValue Op,
	SelectionDAG &DAG) const {
	SDLoc dl(Op);
	return DAG.getNode(ARMISD::EH_SJLJ_SETUP_DISPATCH, dl, MVT::Other,
	Op.getOperand(0));
	}

	SDValue
	ARMTargetLowering::LowerINTRINSIC_WO_CHAIN(SDValue Op, SelectionDAG &DAG,
	const ARMSubtarget *Subtarget) const {
	unsigned IntNo = cast<ConstantSDNode>(Op.getOperand(0))->getZExtValue();
	SDLoc dl(Op);
	switch (IntNo) {
	default: return SDValue(); // Don't custom lower most intrinsics.
	case Intrinsic::thread_pointer: {
	EVT PtrVT = getPointerTy(DAG.getDataLayout());
	return DAG.getNode(ARMISD::THREAD_POINTER, dl, PtrVT);
	}
	case Intrinsic::eh_sjlj_lsda: {
	MachineFunction &MF = DAG.getMachineFunction();
	ARMFunctionInfo *AFI = MF.getInfo<ARMFunctionInfo>();
	unsigned ARMPCLabelIndex = AFI->createPICLabelUId();
	EVT PtrVT = getPointerTy(DAG.getDataLayout());
	SDValue CPAddr;
	bool IsPositionIndependent = isPositionIndependent();
	unsigned PCAdj = IsPositionIndependent ? (Subtarget->isThumb() ? 4 : 8) : 0;
	ARMConstantPoolValue *CPV =
	ARMConstantPoolConstant::Create(MF.getFunction(), ARMPCLabelIndex,
	ARMCP::CPLSDA, PCAdj);
	CPAddr = DAG.getTargetConstantPool(CPV, PtrVT, 4);
	CPAddr = DAG.getNode(ARMISD::Wrapper, dl, MVT::i32, CPAddr);
	SDValue Result = DAG.getLoad(
	PtrVT, dl, DAG.getEntryNode(), CPAddr,
	MachinePointerInfo::getConstantPool(DAG.getMachineFunction()));

	if (IsPositionIndependent) {
	SDValue PICLabel = DAG.getConstant(ARMPCLabelIndex, dl, MVT::i32);
	Result = DAG.getNode(ARMISD::PIC_ADD, dl, PtrVT, Result, PICLabel);
	}
	return Result;
	}
	case Intrinsic::arm_neon_vabs:
	return DAG.getNode(ISD::ABS, SDLoc(Op), Op.getValueType(),
	Op.getOperand(1));
	case Intrinsic::arm_neon_vmulls:
	case Intrinsic::arm_neon_vmullu: {
	unsigned NewOpc = (IntNo == Intrinsic::arm_neon_vmulls)
	? ARMISD::VMULLs : ARMISD::VMULLu;
	return DAG.getNode(NewOpc, SDLoc(Op), Op.getValueType(),
	Op.getOperand(1), Op.getOperand(2));
	}
	case Intrinsic::arm_neon_vminnm:
	case Intrinsic::arm_neon_vmaxnm: {
	unsigned NewOpc = (IntNo == Intrinsic::arm_neon_vminnm)
	? ISD::FMINNUM : ISD::FMAXNUM;
	return DAG.getNode(NewOpc, SDLoc(Op), Op.getValueType(),
	Op.getOperand(1), Op.getOperand(2));
	}
	case Intrinsic::arm_neon_vminu:
	case Intrinsic::arm_neon_vmaxu: {
	if (Op.getValueType().isFloatingPoint())
	return SDValue();
	unsigned NewOpc = (IntNo == Intrinsic::arm_neon_vminu)
	? ISD::UMIN : ISD::UMAX;
	return DAG.getNode(NewOpc, SDLoc(Op), Op.getValueType(),
	Op.getOperand(1), Op.getOperand(2));
	}
	case Intrinsic::arm_neon_vmins:
	case Intrinsic::arm_neon_vmaxs: {
	// v{min,max}s is overloaded between signed integers and floats.
	if (!Op.getValueType().isFloatingPoint()) {
	unsigned NewOpc = (IntNo == Intrinsic::arm_neon_vmins)
	? ISD::SMIN : ISD::SMAX;
	return DAG.getNode(NewOpc, SDLoc(Op), Op.getValueType(),
	Op.getOperand(1), Op.getOperand(2));
	}
	unsigned NewOpc = (IntNo == Intrinsic::arm_neon_vmins)
	? ISD::FMINNAN : ISD::FMAXNAN;
	return DAG.getNode(NewOpc, SDLoc(Op), Op.getValueType(),
	Op.getOperand(1), Op.getOperand(2));
	}
	case Intrinsic::arm_neon_vtbl1:
	return DAG.getNode(ARMISD::VTBL1, SDLoc(Op), Op.getValueType(),
	Op.getOperand(1), Op.getOperand(2));
	case Intrinsic::arm_neon_vtbl2:
	return DAG.getNode(ARMISD::VTBL2, SDLoc(Op), Op.getValueType(),
	Op.getOperand(1), Op.getOperand(2), Op.getOperand(3));
	}
	}

	static SDValue LowerATOMIC_FENCE(SDValue Op, SelectionDAG &DAG,
	const ARMSubtarget *Subtarget) {
	SDLoc dl(Op);
	ConstantSDNode *SSIDNode = cast<ConstantSDNode>(Op.getOperand(2));
	auto SSID = static_cast<SyncScope::ID>(SSIDNode->getZExtValue());
	if (SSID == SyncScope::SingleThread)
	return Op;

	if (!Subtarget->hasDataBarrier()) {
	// Some ARMv6 cpus can support data barriers with an mcr instruction.
	// Thumb1 and pre-v6 ARM mode use a libcall instead and should never get
	// here.
	assert(Subtarget->hasV6Ops() && !Subtarget->isThumb() &&
	"Unexpected ISD::ATOMIC_FENCE encountered. Should be libcall!");
	return DAG.getNode(ARMISD::MEMBARRIER_MCR, dl, MVT::Other, Op.getOperand(0),
	DAG.getConstant(0, dl, MVT::i32));
	}

	ConstantSDNode *OrdN = cast<ConstantSDNode>(Op.getOperand(1));
	AtomicOrdering Ord = static_cast<AtomicOrdering>(OrdN->getZExtValue());
	ARM_MB::MemBOpt Domain = ARM_MB::ISH;
	if (Subtarget->isMClass()) {
	// Only a full system barrier exists in the M-class architectures.
	Domain = ARM_MB::SY;
	} else if (Subtarget->preferISHSTBarriers() &&
	Ord == AtomicOrdering::Release) {
	// Swift happens to implement ISHST barriers in a way that's compatible with
	// Release semantics but weaker than ISH so we'd be fools not to use
	// it. Beware: other processors probably don't!
	Domain = ARM_MB::ISHST;
	}

	return DAG.getNode(ISD::INTRINSIC_VOID, dl, MVT::Other, Op.getOperand(0),
	DAG.getConstant(Intrinsic::arm_dmb, dl, MVT::i32),
	DAG.getConstant(Domain, dl, MVT::i32));
	}

	static SDValue LowerPREFETCH(SDValue Op, SelectionDAG &DAG,
	const ARMSubtarget *Subtarget) {
	// ARM pre v5TE and Thumb1 does not have preload instructions.
	if (!(Subtarget->isThumb2() \|\|
	(!Subtarget->isThumb1Only() && Subtarget->hasV5TEOps())))
	// Just preserve the chain.
	return Op.getOperand(0);

	SDLoc dl(Op);
	unsigned isRead = ~cast<ConstantSDNode>(Op.getOperand(2))->getZExtValue() & 1;
	if (!isRead &&
	(!Subtarget->hasV7Ops() \|\| !Subtarget->hasMPExtension()))
	// ARMv7 with MP extension has PLDW.
	return Op.getOperand(0);

	unsigned isData = cast<ConstantSDNode>(Op.getOperand(4))->getZExtValue();
	if (Subtarget->isThumb()) {
	// Invert the bits.
	isRead = ~isRead & 1;
	isData = ~isData & 1;
	}

	return DAG.getNode(ARMISD::PRELOAD, dl, MVT::Other, Op.getOperand(0),
	Op.getOperand(1), DAG.getConstant(isRead, dl, MVT::i32),
	DAG.getConstant(isData, dl, MVT::i32));
	}

	static SDValue LowerVASTART(SDValue Op, SelectionDAG &DAG) {
	MachineFunction &MF = DAG.getMachineFunction();
	ARMFunctionInfo *FuncInfo = MF.getInfo<ARMFunctionInfo>();

	// vastart just stores the address of the VarArgsFrameIndex slot into the
	// memory location argument.
	SDLoc dl(Op);
	EVT PtrVT = DAG.getTargetLoweringInfo().getPointerTy(DAG.getDataLayout());
	SDValue FR = DAG.getFrameIndex(FuncInfo->getVarArgsFrameIndex(), PtrVT);
	const Value *SV = cast<SrcValueSDNode>(Op.getOperand(2))->getValue();
	return DAG.getStore(Op.getOperand(0), dl, FR, Op.getOperand(1),
	MachinePointerInfo(SV));
	}

	SDValue ARMTargetLowering::GetF64FormalArgument(CCValAssign &VA,
	CCValAssign &NextVA,
	SDValue &Root,
	SelectionDAG &DAG,
	const SDLoc &dl) const {
	MachineFunction &MF = DAG.getMachineFunction();
	ARMFunctionInfo *AFI = MF.getInfo<ARMFunctionInfo>();

	const TargetRegisterClass *RC;
	if (AFI->isThumb1OnlyFunction())
	RC = &ARM::tGPRRegClass;
	else
	RC = &ARM::GPRRegClass;

	// Transform the arguments stored in physical registers into virtual ones.
	unsigned Reg = MF.addLiveIn(VA.getLocReg(), RC);
	SDValue ArgValue = DAG.getCopyFromReg(Root, dl, Reg, MVT::i32);

	SDValue ArgValue2;
	if (NextVA.isMemLoc()) {
	MachineFrameInfo &MFI = MF.getFrameInfo();
	int FI = MFI.CreateFixedObject(4, NextVA.getLocMemOffset(), true);

	// Create load node to retrieve arguments from the stack.
	SDValue FIN = DAG.getFrameIndex(FI, getPointerTy(DAG.getDataLayout()));
	ArgValue2 = DAG.getLoad(
	MVT::i32, dl, Root, FIN,
	MachinePointerInfo::getFixedStack(DAG.getMachineFunction(), FI));
	} else {
	Reg = MF.addLiveIn(NextVA.getLocReg(), RC);
	ArgValue2 = DAG.getCopyFromReg(Root, dl, Reg, MVT::i32);
	}
	if (!Subtarget->isLittle())
	std::swap (ArgValue, ArgValue2);
	return DAG.getNode(ARMISD::VMOVDRR, dl, MVT::f64, ArgValue, ArgValue2);
	}

	// The remaining GPRs hold either the beginning of variable-argument
	// data, or the beginning of an aggregate passed by value (usually
	// byval). Either way, we allocate stack slots adjacent to the data
	// provided by our caller, and store the unallocated registers there.
	// If this is a variadic function, the va_list pointer will begin with
	// these values; otherwise, this reassembles a (byval) structure that
	// was split between registers and memory.
	// Return: The frame index registers were stored into.
	int ARMTargetLowering::StoreByValRegs(CCState &CCInfo, SelectionDAG &DAG,
	const SDLoc &dl, SDValue &Chain,
	const Value *OrigArg,
	unsigned InRegsParamRecordIdx,
	int ArgOffset, unsigned ArgSize) const {
	// Currently, two use-cases possible:
	// Case #1. Non-var-args function, and we meet first byval parameter.
	// Setup first unallocated register as first byval register;
	// eat all remained registers
	// (these two actions are performed by HandleByVal method).
	// Then, here, we initialize stack frame with
	// "store-reg" instructions.
	// Case #2. Var-args function, that doesn't contain byval parameters.
	// The same: eat all remained unallocated registers,
	// initialize stack frame.

	MachineFunction &MF = DAG.getMachineFunction();
	MachineFrameInfo &MFI = MF.getFrameInfo();
	ARMFunctionInfo *AFI = MF.getInfo<ARMFunctionInfo>();
	unsigned RBegin, REnd;
	if (InRegsParamRecordIdx < CCInfo.getInRegsParamsCount()) {
	CCInfo.getInRegsParamInfo(InRegsParamRecordIdx, RBegin, REnd);
	} else {
	unsigned RBeginIdx = CCInfo.getFirstUnallocated(GPRArgRegs);
	RBegin = RBeginIdx == 4 ? (unsigned)ARM::R4 : GPRArgRegs[RBeginIdx];
	REnd = ARM::R4;
	}

	if (REnd != RBegin)
	ArgOffset = -4 * (ARM::R4 - RBegin);

	auto PtrVT = getPointerTy(DAG.getDataLayout());
	int FrameIndex = MFI.CreateFixedObject(ArgSize, ArgOffset, false);
	SDValue FIN = DAG.getFrameIndex(FrameIndex, PtrVT);

	SmallVector<SDValue, 4> MemOps;
	const TargetRegisterClass *RC =
	AFI->isThumb1OnlyFunction() ? &ARM::tGPRRegClass : &ARM::GPRRegClass;

	for (unsigned Reg = RBegin, i = 0; Reg < REnd; ++Reg, ++i) {
	unsigned VReg = MF.addLiveIn(Reg, RC);
	SDValue Val = DAG.getCopyFromReg(Chain, dl, VReg, MVT::i32);
	SDValue Store = DAG.getStore(Val.getValue(1), dl, Val, FIN,
	MachinePointerInfo(OrigArg, 4 * i));
	MemOps.push_back(Store);
	FIN = DAG.getNode(ISD::ADD, dl, PtrVT, FIN, DAG.getConstant(4, dl, PtrVT));
	}

	if (!MemOps.empty())
	Chain = DAG.getNode(ISD::TokenFactor, dl, MVT::Other, MemOps);
	return FrameIndex;
	}

	// Setup stack frame, the va_list pointer will start from.
	void ARMTargetLowering::VarArgStyleRegisters(CCState &CCInfo, SelectionDAG &DAG,
	const SDLoc &dl, SDValue &Chain,
	unsigned ArgOffset,
	unsigned TotalArgRegsSaveSize,
	bool ForceMutable) const {
	MachineFunction &MF = DAG.getMachineFunction();
	ARMFunctionInfo *AFI = MF.getInfo<ARMFunctionInfo>();

	// Try to store any remaining integer argument regs
	// to their spots on the stack so that they may be loaded by dereferencing
	// the result of va_next.
	// If there is no regs to be stored, just point address after last
	// argument passed via stack.
	int FrameIndex = StoreByValRegs(CCInfo, DAG, dl, Chain, nullptr,
	CCInfo.getInRegsParamsCount(),
	CCInfo.getNextStackOffset(), 4);
	AFI->setVarArgsFrameIndex(FrameIndex);
	}

	SDValue ARMTargetLowering::LowerFormalArguments(
	SDValue Chain, CallingConv::ID CallConv, bool isVarArg,
	const SmallVectorImpl<ISD::InputArg> &Ins, const SDLoc &dl,
	SelectionDAG &DAG, SmallVectorImpl<SDValue> &InVals) const {
	MachineFunction &MF = DAG.getMachineFunction();
	MachineFrameInfo &MFI = MF.getFrameInfo();

	ARMFunctionInfo *AFI = MF.getInfo<ARMFunctionInfo>();

	// Assign locations to all of the incoming arguments.
	SmallVector<CCValAssign, 16> ArgLocs;
	CCState CCInfo(CallConv, isVarArg, DAG.getMachineFunction(), ArgLocs,
	*DAG.getContext());
	CCInfo.AnalyzeFormalArguments(Ins, CCAssignFnForCall(CallConv, isVarArg));

	SmallVector<SDValue, 16> ArgValues;
	SDValue ArgValue;
	Function::const_arg_iterator CurOrigArg = MF.getFunction()->arg_begin();
	unsigned CurArgIdx = 0;

	// Initially ArgRegsSaveSize is zero.
	// Then we increase this value each time we meet byval parameter.
	// We also increase this value in case of varargs function.
	AFI->setArgRegsSaveSize(0);

	// Calculate the amount of stack space that we need to allocate to store
	// byval and variadic arguments that are passed in registers.
	// We need to know this before we allocate the first byval or variadic
	// argument, as they will be allocated a stack slot below the CFA (Canonical
	// Frame Address, the stack pointer at entry to the function).
	unsigned ArgRegBegin = ARM::R4;
	for (unsigned i = 0, e = ArgLocs.size(); i != e; ++i) {
	if (CCInfo.getInRegsParamsProcessed() >= CCInfo.getInRegsParamsCount())
	break;

	CCValAssign &VA = ArgLocs[i];
	unsigned Index = VA.getValNo();
	ISD::ArgFlagsTy Flags = Ins[Index].Flags;
	if (!Flags.isByVal())
	continue;

	assert(VA.isMemLoc() && "unexpected byval pointer in reg");
	unsigned RBegin, REnd;
	CCInfo.getInRegsParamInfo(CCInfo.getInRegsParamsProcessed(), RBegin, REnd);
	ArgRegBegin = std::min(ArgRegBegin, RBegin);

	CCInfo.nextInRegsParam();
	}
	CCInfo.rewindByValRegsInfo();

	int lastInsIndex = -1;
	if (isVarArg && MFI.hasVAStart()) {
	unsigned RegIdx = CCInfo.getFirstUnallocated(GPRArgRegs);
	if (RegIdx != array_lengthof(GPRArgRegs))
	ArgRegBegin = std::min(ArgRegBegin, (unsigned)GPRArgRegs[RegIdx]);
	}

	unsigned TotalArgRegsSaveSize = 4 * (ARM::R4 - ArgRegBegin);
	AFI->setArgRegsSaveSize(TotalArgRegsSaveSize);
	auto PtrVT = getPointerTy(DAG.getDataLayout());

	for (unsigned i = 0, e = ArgLocs.size(); i != e; ++i) {
	CCValAssign &VA = ArgLocs[i];
	if (Ins[VA.getValNo()].isOrigArg()) {
	std::advance(CurOrigArg,
	Ins[VA.getValNo()].getOrigArgIndex() - CurArgIdx);
	CurArgIdx = Ins[VA.getValNo()].getOrigArgIndex();
	}
	// Arguments stored in registers.
	if (VA.isRegLoc()) {
	EVT RegVT = VA.getLocVT();

	if (VA.needsCustom()) {
	// f64 and vector types are split up into multiple registers or
	// combinations of registers and stack slots.
	if (VA.getLocVT() == MVT::v2f64) {
	SDValue ArgValue1 = GetF64FormalArgument(VA, ArgLocs[++i],
	Chain, DAG, dl);
	VA = ArgLocs[++i]; // skip ahead to next loc
	SDValue ArgValue2;
	if (VA.isMemLoc()) {
	int FI = MFI.CreateFixedObject(8, VA.getLocMemOffset(), true);
	SDValue FIN = DAG.getFrameIndex(FI, PtrVT);
	ArgValue2 = DAG.getLoad(MVT::f64, dl, Chain, FIN,
	MachinePointerInfo::getFixedStack(
	DAG.getMachineFunction(), FI));
	} else {
	ArgValue2 = GetF64FormalArgument(VA, ArgLocs[++i],
	Chain, DAG, dl);
	}
	ArgValue = DAG.getNode(ISD::UNDEF, dl, MVT::v2f64);
	ArgValue = DAG.getNode(ISD::INSERT_VECTOR_ELT, dl, MVT::v2f64,
	ArgValue, ArgValue1,
	DAG.getIntPtrConstant(0, dl));
	ArgValue = DAG.getNode(ISD::INSERT_VECTOR_ELT, dl, MVT::v2f64,
	ArgValue, ArgValue2,
	DAG.getIntPtrConstant(1, dl));
	} else
	ArgValue = GetF64FormalArgument(VA, ArgLocs[++i], Chain, DAG, dl);

	} else {
	const TargetRegisterClass *RC;

	if (RegVT == MVT::f32)
	RC = &ARM::SPRRegClass;
	else if (RegVT == MVT::f64)
	RC = &ARM::DPRRegClass;
	else if (RegVT == MVT::v2f64)
	RC = &ARM::QPRRegClass;
	else if (RegVT == MVT::i32)
	RC = AFI->isThumb1OnlyFunction() ? &ARM::tGPRRegClass
	: &ARM::GPRRegClass;
	else
	llvm_unreachable("RegVT not supported by FORMAL_ARGUMENTS Lowering");

	// Transform the arguments in physical registers into virtual ones.
	unsigned Reg = MF.addLiveIn(VA.getLocReg(), RC);
	ArgValue = DAG.getCopyFromReg(Chain, dl, Reg, RegVT);
	}

	// If this is an 8 or 16-bit value, it is really passed promoted
	// to 32 bits. Insert an assert[sz]ext to capture this, then
	// truncate to the right size.
	switch (VA.getLocInfo()) {
	default: llvm_unreachable("Unknown loc info!");
	case CCValAssign::Full: break;
	case CCValAssign::BCvt:
	ArgValue = DAG.getNode(ISD::BITCAST, dl, VA.getValVT(), ArgValue);
	break;
	case CCValAssign::SExt:
	ArgValue = DAG.getNode(ISD::AssertSext, dl, RegVT, ArgValue,
	DAG.getValueType(VA.getValVT()));
	ArgValue = DAG.getNode(ISD::TRUNCATE, dl, VA.getValVT(), ArgValue);
	break;
	case CCValAssign::ZExt:
	ArgValue = DAG.getNode(ISD::AssertZext, dl, RegVT, ArgValue,
	DAG.getValueType(VA.getValVT()));
	ArgValue = DAG.getNode(ISD::TRUNCATE, dl, VA.getValVT(), ArgValue);
	break;
	}

	InVals.push_back(ArgValue);

	} else { // VA.isRegLoc()
	// sanity check
	assert(VA.isMemLoc());
	assert(VA.getValVT() != MVT::i64 && "i64 should already be lowered");

	int index = VA.getValNo();

	// Some Ins[] entries become multiple ArgLoc[] entries.
	// Process them only once.
	if (index != lastInsIndex)
	{
	ISD::ArgFlagsTy Flags = Ins[index].Flags;
	// FIXME: For now, all byval parameter objects are marked mutable.
	// This can be changed with more analysis.
	// In case of tail call optimization mark all arguments mutable.
	// Since they could be overwritten by lowering of arguments in case of
	// a tail call.
	if (Flags.isByVal()) {
	assert(Ins[index].isOrigArg() &&
	"Byval arguments cannot be implicit");
	unsigned CurByValIndex = CCInfo.getInRegsParamsProcessed();

	int FrameIndex = StoreByValRegs(
	CCInfo, DAG, dl, Chain, &*CurOrigArg, CurByValIndex,
	VA.getLocMemOffset(), Flags.getByValSize());
	InVals.push_back(DAG.getFrameIndex(FrameIndex, PtrVT));
	CCInfo.nextInRegsParam();
	} else {
	unsigned FIOffset = VA.getLocMemOffset();
	int FI = MFI.CreateFixedObject(VA.getLocVT().getSizeInBits()/8,
	FIOffset, true);

	// Create load nodes to retrieve arguments from the stack.
	SDValue FIN = DAG.getFrameIndex(FI, PtrVT);
	InVals.push_back(DAG.getLoad(VA.getValVT(), dl, Chain, FIN,
	MachinePointerInfo::getFixedStack(
	DAG.getMachineFunction(), FI)));
	}
	lastInsIndex = index;
	}
	}
	}

	// varargs
	if (isVarArg && MFI.hasVAStart())
	VarArgStyleRegisters(CCInfo, DAG, dl, Chain,
	CCInfo.getNextStackOffset(),
	TotalArgRegsSaveSize);

	AFI->setArgumentStackSize(CCInfo.getNextStackOffset());

	return Chain;
	}

	/// isFloatingPointZero - Return true if this is +0.0.
	static bool isFloatingPointZero(SDValue Op) {
	if (ConstantFPSDNode *CFP = dyn_cast<ConstantFPSDNode>(Op))
	return CFP->getValueAPF().isPosZero();
	else if (ISD::isEXTLoad(Op.getNode()) \|\| ISD::isNON_EXTLoad(Op.getNode())) {
	// Maybe this has already been legalized into the constant pool?
	if (Op.getOperand(1).getOpcode() == ARMISD::Wrapper) {
	SDValue WrapperOp = Op.getOperand(1).getOperand(0);
	if (ConstantPoolSDNode *CP = dyn_cast<ConstantPoolSDNode>(WrapperOp))
	if (const ConstantFP *CFP = dyn_cast<ConstantFP>(CP->getConstVal()))
	return CFP->getValueAPF().isPosZero();
	}
	} else if (Op->getOpcode() == ISD::BITCAST &&
	Op->getValueType(0) == MVT::f64) {
	// Handle (ISD::BITCAST (ARMISD::VMOVIMM (ISD::TargetConstant 0)) MVT::f64)
	// created by LowerConstantFP().
	SDValue BitcastOp = Op->getOperand(0);
	if (BitcastOp->getOpcode() == ARMISD::VMOVIMM &&
	isNullConstant(BitcastOp->getOperand(0)))
	return true;
	}
	return false;
	}

	/// Returns appropriate ARM CMP (cmp) and corresponding condition code for
	/// the given operands.
	SDValue ARMTargetLowering::getARMCmp(SDValue LHS, SDValue RHS, ISD::CondCode CC,
	SDValue &ARMcc, SelectionDAG &DAG,
	const SDLoc &dl) const {
	if (ConstantSDNode *RHSC = dyn_cast<ConstantSDNode>(RHS.getNode())) {
	unsigned C = RHSC->getZExtValue();
	if (!isLegalICmpImmediate(C)) {
	// Constant does not fit, try adjusting it by one?
	switch (CC) {
	default: break;
	case ISD::SETLT:
	case ISD::SETGE:
	if (C != 0x80000000 && isLegalICmpImmediate(C-1)) {
	CC = (CC == ISD::SETLT) ? ISD::SETLE : ISD::SETGT;
	RHS = DAG.getConstant(C - 1, dl, MVT::i32);
	}
	break;
	case ISD::SETULT:
	case ISD::SETUGE:
	if (C != 0 && isLegalICmpImmediate(C-1)) {
	CC = (CC == ISD::SETULT) ? ISD::SETULE : ISD::SETUGT;
	RHS = DAG.getConstant(C - 1, dl, MVT::i32);
	}
	break;
	case ISD::SETLE:
	case ISD::SETGT:
	if (C != 0x7fffffff && isLegalICmpImmediate(C+1)) {
	CC = (CC == ISD::SETLE) ? ISD::SETLT : ISD::SETGE;
	RHS = DAG.getConstant(C + 1, dl, MVT::i32);
	}
	break;
	case ISD::SETULE:
	case ISD::SETUGT:
	if (C != 0xffffffff && isLegalICmpImmediate(C+1)) {
	CC = (CC == ISD::SETULE) ? ISD::SETULT : ISD::SETUGE;
	RHS = DAG.getConstant(C + 1, dl, MVT::i32);
	}
	break;
	}
	}
	}

	ARMCC::CondCodes CondCode = IntCCToARMCC(CC);
	ARMISD::NodeType CompareType;
	switch (CondCode) {
	default:
	CompareType = ARMISD::CMP;
	break;
	case ARMCC::EQ:
	case ARMCC::NE:
	// Uses only Z Flag
	CompareType = ARMISD::CMPZ;
	break;
	}
	ARMcc = DAG.getConstant(CondCode, dl, MVT::i32);
	return DAG.getNode(CompareType, dl, MVT::Glue, LHS, RHS);
	}

	/// Returns a appropriate VFP CMP (fcmp{s\|d}+fmstat) for the given operands.
	SDValue ARMTargetLowering::getVFPCmp(SDValue LHS, SDValue RHS,
	SelectionDAG &DAG, const SDLoc &dl,
	bool InvalidOnQNaN) const {
	assert(!Subtarget->isFPOnlySP() \|\| RHS.getValueType() != MVT::f64);
	SDValue Cmp;
	SDValue C = DAG.getConstant(InvalidOnQNaN, dl, MVT::i32);
	if (!isFloatingPointZero(RHS))
	Cmp = DAG.getNode(ARMISD::CMPFP, dl, MVT::Glue, LHS, RHS, C);
	else
	Cmp = DAG.getNode(ARMISD::CMPFPw0, dl, MVT::Glue, LHS, C);
	return DAG.getNode(ARMISD::FMSTAT, dl, MVT::Glue, Cmp);
	}

	/// duplicateCmp - Glue values can have only one use, so this function
	/// duplicates a comparison node.
	SDValue
	ARMTargetLowering::duplicateCmp(SDValue Cmp, SelectionDAG &DAG) const {
	unsigned Opc = Cmp.getOpcode();
	SDLoc DL(Cmp);
	if (Opc == ARMISD::CMP \|\| Opc == ARMISD::CMPZ)
	return DAG.getNode(Opc, DL, MVT::Glue, Cmp.getOperand(0),Cmp.getOperand(1));

	assert(Opc == ARMISD::FMSTAT && "unexpected comparison operation");
	Cmp = Cmp.getOperand(0);
	Opc = Cmp.getOpcode();
	if (Opc == ARMISD::CMPFP)
	Cmp = DAG.getNode(Opc, DL, MVT::Glue, Cmp.getOperand(0),
	Cmp.getOperand(1), Cmp.getOperand(2));
	else {
	assert(Opc == ARMISD::CMPFPw0 && "unexpected operand of FMSTAT");
	Cmp = DAG.getNode(Opc, DL, MVT::Glue, Cmp.getOperand(0),
	Cmp.getOperand(1));
	}
	return DAG.getNode(ARMISD::FMSTAT, DL, MVT::Glue, Cmp);
	}

	std::pair<SDValue, SDValue>
	ARMTargetLowering::getARMXALUOOp(SDValue Op, SelectionDAG &DAG,
	SDValue &ARMcc) const {
	assert(Op.getValueType() == MVT::i32 && "Unsupported value type");

	SDValue Value, OverflowCmp;
	SDValue LHS = Op.getOperand(0);
	SDValue RHS = Op.getOperand(1);
	SDLoc dl(Op);

	// FIXME: We are currently always generating CMPs because we don't support
	// generating CMN through the backend. This is not as good as the natural
	// CMP case because it causes a register dependency and cannot be folded
	// later.

	switch (Op.getOpcode()) {
	default:
	llvm_unreachable("Unknown overflow instruction!");
	case ISD::SADDO:
	ARMcc = DAG.getConstant(ARMCC::VC, dl, MVT::i32);
	Value = DAG.getNode(ISD::ADD, dl, Op.getValueType(), LHS, RHS);
	OverflowCmp = DAG.getNode(ARMISD::CMP, dl, MVT::Glue, Value, LHS);
	break;
	case ISD::UADDO:
	ARMcc = DAG.getConstant(ARMCC::HS, dl, MVT::i32);
	Value = DAG.getNode(ISD::ADD, dl, Op.getValueType(), LHS, RHS);
	OverflowCmp = DAG.getNode(ARMISD::CMP, dl, MVT::Glue, Value, LHS);
	break;
	case ISD::SSUBO:
	ARMcc = DAG.getConstant(ARMCC::VC, dl, MVT::i32);
	Value = DAG.getNode(ISD::SUB, dl, Op.getValueType(), LHS, RHS);
	OverflowCmp = DAG.getNode(ARMISD::CMP, dl, MVT::Glue, LHS, RHS);
	break;
	case ISD::USUBO:
	ARMcc = DAG.getConstant(ARMCC::HS, dl, MVT::i32);
	Value = DAG.getNode(ISD::SUB, dl, Op.getValueType(), LHS, RHS);
	OverflowCmp = DAG.getNode(ARMISD::CMP, dl, MVT::Glue, LHS, RHS);
	break;
	} // switch (...)

	return std::make_pair(Value, OverflowCmp);
	}

	SDValue
	ARMTargetLowering::LowerXALUO(SDValue Op, SelectionDAG &DAG) const {
	// Let legalize expand this if it isn't a legal type yet.
	if (!DAG.getTargetLoweringInfo().isTypeLegal(Op.getValueType()))
	return SDValue();

	SDValue Value, OverflowCmp;
	SDValue ARMcc;
	std::tie(Value, OverflowCmp) = getARMXALUOOp(Op, DAG, ARMcc);
	SDValue CCR = DAG.getRegister(ARM::CPSR, MVT::i32);
	SDLoc dl(Op);
	// We use 0 and 1 as false and true values.
	SDValue TVal = DAG.getConstant(1, dl, MVT::i32);
	SDValue FVal = DAG.getConstant(0, dl, MVT::i32);
	EVT VT = Op.getValueType();

	SDValue Overflow = DAG.getNode(ARMISD::CMOV, dl, VT, TVal, FVal,
	ARMcc, CCR, OverflowCmp);

	SDVTList VTs = DAG.getVTList(Op.getValueType(), MVT::i32);
	return DAG.getNode(ISD::MERGE_VALUES, dl, VTs, Value, Overflow);
	}

	SDValue ARMTargetLowering::LowerSELECT(SDValue Op, SelectionDAG &DAG) const {
	SDValue Cond = Op.getOperand(0);
	SDValue SelectTrue = Op.getOperand(1);
	SDValue SelectFalse = Op.getOperand(2);
	SDLoc dl(Op);
	unsigned Opc = Cond.getOpcode();

	if (Cond.getResNo() == 1 &&
	(Opc == ISD::SADDO \|\| Opc == ISD::UADDO \|\| Opc == ISD::SSUBO \|\|
	Opc == ISD::USUBO)) {
	if (!DAG.getTargetLoweringInfo().isTypeLegal(Cond->getValueType(0)))
	return SDValue();

	SDValue Value, OverflowCmp;
	SDValue ARMcc;
	std::tie(Value, OverflowCmp) = getARMXALUOOp(Cond, DAG, ARMcc);
	SDValue CCR = DAG.getRegister(ARM::CPSR, MVT::i32);
	EVT VT = Op.getValueType();

	return getCMOV(dl, VT, SelectTrue, SelectFalse, ARMcc, CCR,
	OverflowCmp, DAG);
	}

	// Convert:
	//
	// (select (cmov 1, 0, cond), t, f) -> (cmov t, f, cond)
	// (select (cmov 0, 1, cond), t, f) -> (cmov f, t, cond)
	//
	if (Cond.getOpcode() == ARMISD::CMOV && Cond.hasOneUse()) {
	const ConstantSDNode *CMOVTrue =
	dyn_cast<ConstantSDNode>(Cond.getOperand(0));
	const ConstantSDNode *CMOVFalse =
	dyn_cast<ConstantSDNode>(Cond.getOperand(1));

	if (CMOVTrue && CMOVFalse) {
	unsigned CMOVTrueVal = CMOVTrue->getZExtValue();
	unsigned CMOVFalseVal = CMOVFalse->getZExtValue();

	SDValue True;
	SDValue False;
	if (CMOVTrueVal == 1 && CMOVFalseVal == 0) {
	True = SelectTrue;
	False = SelectFalse;
	} else if (CMOVTrueVal == 0 && CMOVFalseVal == 1) {
	True = SelectFalse;
	False = SelectTrue;
	}

	if (True.getNode() && False.getNode()) {
	EVT VT = Op.getValueType();
	SDValue ARMcc = Cond.getOperand(2);
	SDValue CCR = Cond.getOperand(3);
	SDValue Cmp = duplicateCmp(Cond.getOperand(4), DAG);
	assert(True.getValueType() == VT);
	return getCMOV(dl, VT, True, False, ARMcc, CCR, Cmp, DAG);
	}
	}
	}

	// ARM's BooleanContents value is UndefinedBooleanContent. Mask out the
	// undefined bits before doing a full-word comparison with zero.
	Cond = DAG.getNode(ISD::AND, dl, Cond.getValueType(), Cond,
	DAG.getConstant(1, dl, Cond.getValueType()));

	return DAG.getSelectCC(dl, Cond,
	DAG.getConstant(0, dl, Cond.getValueType()),
	SelectTrue, SelectFalse, ISD::SETNE);
	}

	static void checkVSELConstraints(ISD::CondCode CC, ARMCC::CondCodes &CondCode,
	bool &swpCmpOps, bool &swpVselOps) {
	// Start by selecting the GE condition code for opcodes that return true for
	// 'equality'
	if (CC == ISD::SETUGE \|\| CC == ISD::SETOGE \|\| CC == ISD::SETOLE \|\|
	CC == ISD::SETULE)
	CondCode = ARMCC::GE;

	// and GT for opcodes that return false for 'equality'.
	else if (CC == ISD::SETUGT \|\| CC == ISD::SETOGT \|\| CC == ISD::SETOLT \|\|
	CC == ISD::SETULT)
	CondCode = ARMCC::GT;

	// Since we are constrained to GE/GT, if the opcode contains 'less', we need
	// to swap the compare operands.
	if (CC == ISD::SETOLE \|\| CC == ISD::SETULE \|\| CC == ISD::SETOLT \|\|
	CC == ISD::SETULT)
	swpCmpOps = true;

	// Both GT and GE are ordered comparisons, and return false for 'unordered'.
	// If we have an unordered opcode, we need to swap the operands to the VSEL
	// instruction (effectively negating the condition).
	//
	// This also has the effect of swapping which one of 'less' or 'greater'
	// returns true, so we also swap the compare operands. It also switches
	// whether we return true for 'equality', so we compensate by picking the
	// opposite condition code to our original choice.
	if (CC == ISD::SETULE \|\| CC == ISD::SETULT \|\| CC == ISD::SETUGE \|\|
	CC == ISD::SETUGT) {
	swpCmpOps = !swpCmpOps;
	swpVselOps = !swpVselOps;
	CondCode = CondCode == ARMCC::GT ? ARMCC::GE : ARMCC::GT;
	}

	// 'ordered' is 'anything but unordered', so use the VS condition code and
	// swap the VSEL operands.
	if (CC == ISD::SETO) {
	CondCode = ARMCC::VS;
	swpVselOps = true;
	}

	// 'unordered or not equal' is 'anything but equal', so use the EQ condition
	// code and swap the VSEL operands.
	if (CC == ISD::SETUNE) {
	CondCode = ARMCC::EQ;
	swpVselOps = true;
	}
	}

	SDValue ARMTargetLowering::getCMOV(const SDLoc &dl, EVT VT, SDValue FalseVal,
	SDValue TrueVal, SDValue ARMcc, SDValue CCR,
	SDValue Cmp, SelectionDAG &DAG) const {
	if (Subtarget->isFPOnlySP() && VT == MVT::f64) {
	FalseVal = DAG.getNode(ARMISD::VMOVRRD, dl,
	DAG.getVTList(MVT::i32, MVT::i32), FalseVal);
	TrueVal = DAG.getNode(ARMISD::VMOVRRD, dl,
	DAG.getVTList(MVT::i32, MVT::i32), TrueVal);

	SDValue TrueLow = TrueVal.getValue(0);
	SDValue TrueHigh = TrueVal.getValue(1);
	SDValue FalseLow = FalseVal.getValue(0);
	SDValue FalseHigh = FalseVal.getValue(1);

	SDValue Low = DAG.getNode(ARMISD::CMOV, dl, MVT::i32, FalseLow, TrueLow,
	ARMcc, CCR, Cmp);
	SDValue High = DAG.getNode(ARMISD::CMOV, dl, MVT::i32, FalseHigh, TrueHigh,
	ARMcc, CCR, duplicateCmp(Cmp, DAG));

	return DAG.getNode(ARMISD::VMOVDRR, dl, MVT::f64, Low, High);
	} else {
	return DAG.getNode(ARMISD::CMOV, dl, VT, FalseVal, TrueVal, ARMcc, CCR,
	Cmp);
	}
	}

	static bool isGTorGE(ISD::CondCode CC) {
	return CC == ISD::SETGT \|\| CC == ISD::SETGE;
	}

	static bool isLTorLE(ISD::CondCode CC) {
	return CC == ISD::SETLT \|\| CC == ISD::SETLE;
	}

	// See if a conditional (LHS CC RHS ? TrueVal : FalseVal) is lower-saturating.
	// All of these conditions (and their <= and >= counterparts) will do:
	// x < k ? k : x
	// x > k ? x : k
	// k < x ? x : k
	// k > x ? k : x
	static bool isLowerSaturate(const SDValue LHS, const SDValue RHS,
	const SDValue TrueVal, const SDValue FalseVal,
	const ISD::CondCode CC, const SDValue K) {
	return (isGTorGE(CC) &&
	((K == LHS && K == TrueVal) \|\| (K == RHS && K == FalseVal))) \|\|
	(isLTorLE(CC) &&
	((K == RHS && K == TrueVal) \|\| (K == LHS && K == FalseVal)));
	}

	// Similar to isLowerSaturate(), but checks for upper-saturating conditions.
	static bool isUpperSaturate(const SDValue LHS, const SDValue RHS,
	const SDValue TrueVal, const SDValue FalseVal,
	const ISD::CondCode CC, const SDValue K) {
	return (isGTorGE(CC) &&
	((K == RHS && K == TrueVal) \|\| (K == LHS && K == FalseVal))) \|\|
	(isLTorLE(CC) &&
	((K == LHS && K == TrueVal) \|\| (K == RHS && K == FalseVal)));
	}

	// Check if two chained conditionals could be converted into SSAT.
	//
	// SSAT can replace a set of two conditional selectors that bound a number to an
	// interval of type [k, ~k] when k + 1 is a power of 2. Here are some examples:
	//
	// x < -k ? -k : (x > k ? k : x)
	// x < -k ? -k : (x < k ? x : k)
	// x > -k ? (x > k ? k : x) : -k
	// x < k ? (x < -k ? -k : x) : k
	// etc.
	//
	// It returns true if the conversion can be done, false otherwise.
	// Additionally, the variable is returned in parameter V and the constant in K.
	static bool isSaturatingConditional(const SDValue &Op, SDValue &V,
	uint64_t &K) {
	SDValue LHS1 = Op.getOperand(0);
	SDValue RHS1 = Op.getOperand(1);
	SDValue TrueVal1 = Op.getOperand(2);
	SDValue FalseVal1 = Op.getOperand(3);
	ISD::CondCode CC1 = cast<CondCodeSDNode>(Op.getOperand(4))->get();

	const SDValue Op2 = isa<ConstantSDNode>(TrueVal1) ? FalseVal1 : TrueVal1;
	if (Op2.getOpcode() != ISD::SELECT_CC)
	return false;

	SDValue LHS2 = Op2.getOperand(0);
	SDValue RHS2 = Op2.getOperand(1);
	SDValue TrueVal2 = Op2.getOperand(2);
	SDValue FalseVal2 = Op2.getOperand(3);
	ISD::CondCode CC2 = cast<CondCodeSDNode>(Op2.getOperand(4))->get();

	// Find out which are the constants and which are the variables
	// in each conditional
	SDValue *K1 = isa<ConstantSDNode>(LHS1) ? &LHS1 : isa<ConstantSDNode>(RHS1)
	? &RHS1
	: nullptr;
	SDValue *K2 = isa<ConstantSDNode>(LHS2) ? &LHS2 : isa<ConstantSDNode>(RHS2)
	? &RHS2
	: nullptr;
	SDValue K2Tmp = isa<ConstantSDNode>(TrueVal2) ? TrueVal2 : FalseVal2;
	SDValue V1Tmp = (K1 && *K1 == LHS1) ? RHS1 : LHS1;
	SDValue V2Tmp = (K2 && *K2 == LHS2) ? RHS2 : LHS2;
	SDValue V2 = (K2Tmp == TrueVal2) ? FalseVal2 : TrueVal2;

	// We must detect cases where the original operations worked with 16- or
	// 8-bit values. In such case, V2Tmp != V2 because the comparison operations
	// must work with sign-extended values but the select operations return
	// the original non-extended value.
	SDValue V2TmpReg = V2Tmp;
	if (V2Tmp->getOpcode() == ISD::SIGN_EXTEND_INREG)
	V2TmpReg = V2Tmp->getOperand(0);

	// Check that the registers and the constants have the correct values
	// in both conditionals
	if (!K1 \|\| !K2 \|\| K1 == Op2 \|\| K2 != K2Tmp \|\| V1Tmp != V2Tmp \|\|
	V2TmpReg != V2)
	return false;

	// Figure out which conditional is saturating the lower/upper bound.
	const SDValue *LowerCheckOp =
	isLowerSaturate(LHS1, RHS1, TrueVal1, FalseVal1, CC1, *K1)
	? &Op
	: isLowerSaturate(LHS2, RHS2, TrueVal2, FalseVal2, CC2, *K2)
	? &Op2
	: nullptr;
	const SDValue *UpperCheckOp =
	isUpperSaturate(LHS1, RHS1, TrueVal1, FalseVal1, CC1, *K1)
	? &Op
	: isUpperSaturate(LHS2, RHS2, TrueVal2, FalseVal2, CC2, *K2)
	? &Op2
	: nullptr;

	if (!UpperCheckOp \|\| !LowerCheckOp \|\| LowerCheckOp == UpperCheckOp)
	return false;

	// Check that the constant in the lower-bound check is
	// the opposite of the constant in the upper-bound check
	// in 1's complement.
	int64_t Val1 = cast<ConstantSDNode>(*K1)->getSExtValue();
	int64_t Val2 = cast<ConstantSDNode>(*K2)->getSExtValue();
	int64_t PosVal = std::max(Val1, Val2);

	if (((Val1 > Val2 && UpperCheckOp == &Op) \|\|
	(Val1 < Val2 && UpperCheckOp == &Op2)) &&
	Val1 == ~Val2 && isPowerOf2_64(PosVal + 1)) {

	V = V2;
	K = (uint64_t)PosVal; // At this point, PosVal is guaranteed to be positive
	return true;
	}

	return false;
	}

	SDValue ARMTargetLowering::LowerSELECT_CC(SDValue Op, SelectionDAG &DAG) const {
	EVT VT = Op.getValueType();
	SDLoc dl(Op);

	// Try to convert two saturating conditional selects into a single SSAT
	SDValue SatValue;
	uint64_t SatConstant;
	if (((!Subtarget->isThumb() && Subtarget->hasV6Ops()) \|\| Subtarget->isThumb2()) &&
	isSaturatingConditional(Op, SatValue, SatConstant))
	return DAG.getNode(ARMISD::SSAT, dl, VT, SatValue,
	DAG.getConstant(countTrailingOnes(SatConstant), dl, VT));

	SDValue LHS = Op.getOperand(0);
	SDValue RHS = Op.getOperand(1);
	ISD::CondCode CC = cast<CondCodeSDNode>(Op.getOperand(4))->get();
	SDValue TrueVal = Op.getOperand(2);
	SDValue FalseVal = Op.getOperand(3);

	if (Subtarget->isFPOnlySP() && LHS.getValueType() == MVT::f64) {
	DAG.getTargetLoweringInfo().softenSetCCOperands(DAG, MVT::f64, LHS, RHS, CC,
	dl);

	// If softenSetCCOperands only returned one value, we should compare it to
	// zero.
	if (!RHS.getNode()) {
	RHS = DAG.getConstant(0, dl, LHS.getValueType());
	CC = ISD::SETNE;
	}
	}

	if (LHS.getValueType() == MVT::i32) {
	// Try to generate VSEL on ARMv8.
	// The VSEL instruction can't use all the usual ARM condition
	// codes: it only has two bits to select the condition code, so it's
	// constrained to use only GE, GT, VS and EQ.
	//
	// To implement all the various ISD::SETXXX opcodes, we sometimes need to
	// swap the operands of the previous compare instruction (effectively
	// inverting the compare condition, swapping 'less' and 'greater') and
	// sometimes need to swap the operands to the VSEL (which inverts the
	// condition in the sense of firing whenever the previous condition didn't)
	if (Subtarget->hasFPARMv8() && (TrueVal.getValueType() == MVT::f32 \|\|
	TrueVal.getValueType() == MVT::f64)) {
	ARMCC::CondCodes CondCode = IntCCToARMCC(CC);
	if (CondCode == ARMCC::LT \|\| CondCode == ARMCC::LE \|\|
	CondCode == ARMCC::VC \|\| CondCode == ARMCC::NE) {
	CC = ISD::getSetCCInverse(CC, true);
	std::swap(TrueVal, FalseVal);
	}
	}

	SDValue ARMcc;
	SDValue CCR = DAG.getRegister(ARM::CPSR, MVT::i32);
	SDValue Cmp = getARMCmp(LHS, RHS, CC, ARMcc, DAG, dl);
	return getCMOV(dl, VT, FalseVal, TrueVal, ARMcc, CCR, Cmp, DAG);
	}

	ARMCC::CondCodes CondCode, CondCode2;
	bool InvalidOnQNaN;
	FPCCToARMCC(CC, CondCode, CondCode2, InvalidOnQNaN);

	// Try to generate VMAXNM/VMINNM on ARMv8.
	if (Subtarget->hasFPARMv8() && (TrueVal.getValueType() == MVT::f32 \|\|
	TrueVal.getValueType() == MVT::f64)) {
	bool swpCmpOps = false;
	bool swpVselOps = false;
	checkVSELConstraints(CC, CondCode, swpCmpOps, swpVselOps);

	if (CondCode == ARMCC::GT \|\| CondCode == ARMCC::GE \|\|
	CondCode == ARMCC::VS \|\| CondCode == ARMCC::EQ) {
	if (swpCmpOps)
	std::swap(LHS, RHS);
	if (swpVselOps)
	std::swap(TrueVal, FalseVal);
	}
	}

	SDValue ARMcc = DAG.getConstant(CondCode, dl, MVT::i32);
	SDValue Cmp = getVFPCmp(LHS, RHS, DAG, dl, InvalidOnQNaN);
	SDValue CCR = DAG.getRegister(ARM::CPSR, MVT::i32);
	SDValue Result = getCMOV(dl, VT, FalseVal, TrueVal, ARMcc, CCR, Cmp, DAG);
	if (CondCode2 != ARMCC::AL) {
	SDValue ARMcc2 = DAG.getConstant(CondCode2, dl, MVT::i32);
	// FIXME: Needs another CMP because flag can have but one use.
	SDValue Cmp2 = getVFPCmp(LHS, RHS, DAG, dl, InvalidOnQNaN);
	Result = getCMOV(dl, VT, Result, TrueVal, ARMcc2, CCR, Cmp2, DAG);
	}
	return Result;
	}

	/// canChangeToInt - Given the fp compare operand, return true if it is suitable
	/// to morph to an integer compare sequence.
	static bool canChangeToInt(SDValue Op, bool &SeenZero,
	const ARMSubtarget *Subtarget) {
	SDNode *N = Op.getNode();
	if (!N->hasOneUse())
	// Otherwise it requires moving the value from fp to integer registers.
	return false;
	if (!N->getNumValues())
	return false;
	EVT VT = Op.getValueType();
	if (VT != MVT::f32 && !Subtarget->isFPBrccSlow())
	// f32 case is generally profitable. f64 case only makes sense when vcmpe +
	// vmrs are very slow, e.g. cortex-a8.
	return false;

	if (isFloatingPointZero(Op)) {
	SeenZero = true;
	return true;
	}
	return ISD::isNormalLoad(N);
	}

	static SDValue bitcastf32Toi32(SDValue Op, SelectionDAG &DAG) {
	if (isFloatingPointZero(Op))
	return DAG.getConstant(0, SDLoc(Op), MVT::i32);

	if (LoadSDNode *Ld = dyn_cast<LoadSDNode>(Op))
	return DAG.getLoad(MVT::i32, SDLoc(Op), Ld->getChain(), Ld->getBasePtr(),
	Ld->getPointerInfo(), Ld->getAlignment(),
	Ld->getMemOperand()->getFlags());

	llvm_unreachable("Unknown VFP cmp argument!");
	}

	static void expandf64Toi32(SDValue Op, SelectionDAG &DAG,
	SDValue &RetVal1, SDValue &RetVal2) {
	SDLoc dl(Op);

	if (isFloatingPointZero(Op)) {
	RetVal1 = DAG.getConstant(0, dl, MVT::i32);
	RetVal2 = DAG.getConstant(0, dl, MVT::i32);
	return;
	}

	if (LoadSDNode *Ld = dyn_cast<LoadSDNode>(Op)) {
	SDValue Ptr = Ld->getBasePtr();
	RetVal1 =
	DAG.getLoad(MVT::i32, dl, Ld->getChain(), Ptr, Ld->getPointerInfo(),
	Ld->getAlignment(), Ld->getMemOperand()->getFlags());

	EVT PtrType = Ptr.getValueType();
	unsigned NewAlign = MinAlign(Ld->getAlignment(), 4);
	SDValue NewPtr = DAG.getNode(ISD::ADD, dl,
	PtrType, Ptr, DAG.getConstant(4, dl, PtrType));
	RetVal2 = DAG.getLoad(MVT::i32, dl, Ld->getChain(), NewPtr,
	Ld->getPointerInfo().getWithOffset(4), NewAlign,
	Ld->getMemOperand()->getFlags());
	return;
	}

	llvm_unreachable("Unknown VFP cmp argument!");
	}

	/// OptimizeVFPBrcond - With -enable-unsafe-fp-math, it's legal to optimize some
	/// f32 and even f64 comparisons to integer ones.
	SDValue
	ARMTargetLowering::OptimizeVFPBrcond(SDValue Op, SelectionDAG &DAG) const {
	SDValue Chain = Op.getOperand(0);
	ISD::CondCode CC = cast<CondCodeSDNode>(Op.getOperand(1))->get();
	SDValue LHS = Op.getOperand(2);
	SDValue RHS = Op.getOperand(3);
	SDValue Dest = Op.getOperand(4);
	SDLoc dl(Op);

	bool LHSSeenZero = false;
	bool LHSOk = canChangeToInt(LHS, LHSSeenZero, Subtarget);
	bool RHSSeenZero = false;
	bool RHSOk = canChangeToInt(RHS, RHSSeenZero, Subtarget);
	if (LHSOk && RHSOk && (LHSSeenZero \|\| RHSSeenZero)) {
	// If unsafe fp math optimization is enabled and there are no other uses of
	// the CMP operands, and the condition code is EQ or NE, we can optimize it
	// to an integer comparison.
	if (CC == ISD::SETOEQ)
	CC = ISD::SETEQ;
	else if (CC == ISD::SETUNE)
	CC = ISD::SETNE;

	SDValue Mask = DAG.getConstant(0x7fffffff, dl, MVT::i32);
	SDValue ARMcc;
	if (LHS.getValueType() == MVT::f32) {
	LHS = DAG.getNode(ISD::AND, dl, MVT::i32,
	bitcastf32Toi32(LHS, DAG), Mask);
	RHS = DAG.getNode(ISD::AND, dl, MVT::i32,
	bitcastf32Toi32(RHS, DAG), Mask);
	SDValue Cmp = getARMCmp(LHS, RHS, CC, ARMcc, DAG, dl);
	SDValue CCR = DAG.getRegister(ARM::CPSR, MVT::i32);
	return DAG.getNode(ARMISD::BRCOND, dl, MVT::Other,
	Chain, Dest, ARMcc, CCR, Cmp);
	}

	SDValue LHS1, LHS2;
	SDValue RHS1, RHS2;
	expandf64Toi32(LHS, DAG, LHS1, LHS2);
	expandf64Toi32(RHS, DAG, RHS1, RHS2);
	LHS2 = DAG.getNode(ISD::AND, dl, MVT::i32, LHS2, Mask);
	RHS2 = DAG.getNode(ISD::AND, dl, MVT::i32, RHS2, Mask);
	ARMCC::CondCodes CondCode = IntCCToARMCC(CC);
	ARMcc = DAG.getConstant(CondCode, dl, MVT::i32);
	SDVTList VTList = DAG.getVTList(MVT::Other, MVT::Glue);
	SDValue Ops[] = { Chain, ARMcc, LHS1, LHS2, RHS1, RHS2, Dest };
	return DAG.getNode(ARMISD::BCC_i64, dl, VTList, Ops);
	}

	return SDValue();
	}

	SDValue ARMTargetLowering::LowerBR_CC(SDValue Op, SelectionDAG &DAG) const {
	SDValue Chain = Op.getOperand(0);
	ISD::CondCode CC = cast<CondCodeSDNode>(Op.getOperand(1))->get();
	SDValue LHS = Op.getOperand(2);
	SDValue RHS = Op.getOperand(3);
	SDValue Dest = Op.getOperand(4);
	SDLoc dl(Op);

	if (Subtarget->isFPOnlySP() && LHS.getValueType() == MVT::f64) {
	DAG.getTargetLoweringInfo().softenSetCCOperands(DAG, MVT::f64, LHS, RHS, CC,
	dl);

	// If softenSetCCOperands only returned one value, we should compare it to
	// zero.
	if (!RHS.getNode()) {
	RHS = DAG.getConstant(0, dl, LHS.getValueType());
	CC = ISD::SETNE;
	}
	}

	if (LHS.getValueType() == MVT::i32) {
	SDValue ARMcc;
	SDValue Cmp = getARMCmp(LHS, RHS, CC, ARMcc, DAG, dl);
	SDValue CCR = DAG.getRegister(ARM::CPSR, MVT::i32);
	return DAG.getNode(ARMISD::BRCOND, dl, MVT::Other,
	Chain, Dest, ARMcc, CCR, Cmp);
	}

	assert(LHS.getValueType() == MVT::f32 \|\| LHS.getValueType() == MVT::f64);

	if (getTargetMachine().Options.UnsafeFPMath &&
	(CC == ISD::SETEQ \|\| CC == ISD::SETOEQ \|\|
	CC == ISD::SETNE \|\| CC == ISD::SETUNE)) {
	if (SDValue Result = OptimizeVFPBrcond(Op, DAG))
	return Result;
	}

	ARMCC::CondCodes CondCode, CondCode2;
	bool InvalidOnQNaN;
	FPCCToARMCC(CC, CondCode, CondCode2, InvalidOnQNaN);

	SDValue ARMcc = DAG.getConstant(CondCode, dl, MVT::i32);
	SDValue Cmp = getVFPCmp(LHS, RHS, DAG, dl, InvalidOnQNaN);
	SDValue CCR = DAG.getRegister(ARM::CPSR, MVT::i32);
	SDVTList VTList = DAG.getVTList(MVT::Other, MVT::Glue);
	SDValue Ops[] = { Chain, Dest, ARMcc, CCR, Cmp };
	SDValue Res = DAG.getNode(ARMISD::BRCOND, dl, VTList, Ops);
	if (CondCode2 != ARMCC::AL) {
	ARMcc = DAG.getConstant(CondCode2, dl, MVT::i32);
	SDValue Ops[] = { Res, Dest, ARMcc, CCR, Res.getValue(1) };
	Res = DAG.getNode(ARMISD::BRCOND, dl, VTList, Ops);
	}
	return Res;
	}

	SDValue ARMTargetLowering::LowerBR_JT(SDValue Op, SelectionDAG &DAG) const {
	SDValue Chain = Op.getOperand(0);
	SDValue Table = Op.getOperand(1);
	SDValue Index = Op.getOperand(2);
	SDLoc dl(Op);

	EVT PTy = getPointerTy(DAG.getDataLayout());
	JumpTableSDNode *JT = cast<JumpTableSDNode>(Table);
	SDValue JTI = DAG.getTargetJumpTable(JT->getIndex(), PTy);
	Table = DAG.getNode(ARMISD::WrapperJT, dl, MVT::i32, JTI);
	Index = DAG.getNode(ISD::MUL, dl, PTy, Index, DAG.getConstant(4, dl, PTy));
	SDValue Addr = DAG.getNode(ISD::ADD, dl, PTy, Index, Table);
	if (Subtarget->isThumb2() \|\| (Subtarget->hasV8MBaselineOps() && Subtarget->isThumb())) {
	// Thumb2 and ARMv8-M use a two-level jump. That is, it jumps into the jump table
	// which does another jump to the destination. This also makes it easier
	// to translate it to TBB / TBH later (Thumb2 only).
	// FIXME: This might not work if the function is extremely large.
	return DAG.getNode(ARMISD::BR2_JT, dl, MVT::Other, Chain,
	Addr, Op.getOperand(2), JTI);
	}
	if (isPositionIndependent() \|\| Subtarget->isROPI()) {
	Addr =
	DAG.getLoad((EVT)MVT::i32, dl, Chain, Addr,
	MachinePointerInfo::getJumpTable(DAG.getMachineFunction()));
	Chain = Addr.getValue(1);
	Addr = DAG.getNode(ISD::ADD, dl, PTy, Addr, Table);
	return DAG.getNode(ARMISD::BR_JT, dl, MVT::Other, Chain, Addr, JTI);
	} else {
	Addr =
	DAG.getLoad(PTy, dl, Chain, Addr,
	MachinePointerInfo::getJumpTable(DAG.getMachineFunction()));
	Chain = Addr.getValue(1);
	return DAG.getNode(ARMISD::BR_JT, dl, MVT::Other, Chain, Addr, JTI);
	}
	}

	static SDValue LowerVectorFP_TO_INT(SDValue Op, SelectionDAG &DAG) {
	EVT VT = Op.getValueType();
	SDLoc dl(Op);

	if (Op.getValueType().getVectorElementType() == MVT::i32) {
	if (Op.getOperand(0).getValueType().getVectorElementType() == MVT::f32)
	return Op;
	return DAG.UnrollVectorOp(Op.getNode());
	}

	assert(Op.getOperand(0).getValueType() == MVT::v4f32 &&
	"Invalid type for custom lowering!");
	if (VT != MVT::v4i16)
	return DAG.UnrollVectorOp(Op.getNode());

	Op = DAG.getNode(Op.getOpcode(), dl, MVT::v4i32, Op.getOperand(0));
	return DAG.getNode(ISD::TRUNCATE, dl, VT, Op);
	}

	SDValue ARMTargetLowering::LowerFP_TO_INT(SDValue Op, SelectionDAG &DAG) const {
	EVT VT = Op.getValueType();
	if (VT.isVector())
	return LowerVectorFP_TO_INT(Op, DAG);
	if (Subtarget->isFPOnlySP() && Op.getOperand(0).getValueType() == MVT::f64) {
	RTLIB::Libcall LC;
	if (Op.getOpcode() == ISD::FP_TO_SINT)
	LC = RTLIB::getFPTOSINT(Op.getOperand(0).getValueType(),
	Op.getValueType());
	else
	LC = RTLIB::getFPTOUINT(Op.getOperand(0).getValueType(),
	Op.getValueType());
	return makeLibCall(DAG, LC, Op.getValueType(), Op.getOperand(0),
	/isSigned/ false, SDLoc(Op)).first;
	}

	return Op;
	}

	static SDValue LowerVectorINT_TO_FP(SDValue Op, SelectionDAG &DAG) {
	EVT VT = Op.getValueType();
	SDLoc dl(Op);

	if (Op.getOperand(0).getValueType().getVectorElementType() == MVT::i32) {
	if (VT.getVectorElementType() == MVT::f32)
	return Op;
	return DAG.UnrollVectorOp(Op.getNode());
	}

	assert(Op.getOperand(0).getValueType() == MVT::v4i16 &&
	"Invalid type for custom lowering!");
	if (VT != MVT::v4f32)
	return DAG.UnrollVectorOp(Op.getNode());

	unsigned CastOpc;
	unsigned Opc;
	switch (Op.getOpcode()) {
	default: llvm_unreachable("Invalid opcode!");
	case ISD::SINT_TO_FP:
	CastOpc = ISD::SIGN_EXTEND;
	Opc = ISD::SINT_TO_FP;
	break;
	case ISD::UINT_TO_FP:
	CastOpc = ISD::ZERO_EXTEND;
	Opc = ISD::UINT_TO_FP;
	break;
	}

	Op = DAG.getNode(CastOpc, dl, MVT::v4i32, Op.getOperand(0));
	return DAG.getNode(Opc, dl, VT, Op);
	}

	SDValue ARMTargetLowering::LowerINT_TO_FP(SDValue Op, SelectionDAG &DAG) const {
	EVT VT = Op.getValueType();
	if (VT.isVector())
	return LowerVectorINT_TO_FP(Op, DAG);
	if (Subtarget->isFPOnlySP() && Op.getValueType() == MVT::f64) {
	RTLIB::Libcall LC;
	if (Op.getOpcode() == ISD::SINT_TO_FP)
	LC = RTLIB::getSINTTOFP(Op.getOperand(0).getValueType(),
	Op.getValueType());
	else
	LC = RTLIB::getUINTTOFP(Op.getOperand(0).getValueType(),
	Op.getValueType());
	return makeLibCall(DAG, LC, Op.getValueType(), Op.getOperand(0),
	/isSigned/ false, SDLoc(Op)).first;
	}

	return Op;
	}

	SDValue ARMTargetLowering::LowerFCOPYSIGN(SDValue Op, SelectionDAG &DAG) const {
	// Implement fcopysign with a fabs and a conditional fneg.
	SDValue Tmp0 = Op.getOperand(0);
	SDValue Tmp1 = Op.getOperand(1);
	SDLoc dl(Op);
	EVT VT = Op.getValueType();
	EVT SrcVT = Tmp1.getValueType();
	bool InGPR = Tmp0.getOpcode() == ISD::BITCAST \|\|
	Tmp0.getOpcode() == ARMISD::VMOVDRR;
	bool UseNEON = !InGPR && Subtarget->hasNEON();

	if (UseNEON) {
	// Use VBSL to copy the sign bit.
	unsigned EncodedVal = ARM_AM::createNEONModImm(0x6, 0x80);
	SDValue Mask = DAG.getNode(ARMISD::VMOVIMM, dl, MVT::v2i32,
	DAG.getTargetConstant(EncodedVal, dl, MVT::i32));
	EVT OpVT = (VT == MVT::f32) ? MVT::v2i32 : MVT::v1i64;
	if (VT == MVT::f64)
	Mask = DAG.getNode(ARMISD::VSHL, dl, OpVT,
	DAG.getNode(ISD::BITCAST, dl, OpVT, Mask),
	DAG.getConstant(32, dl, MVT::i32));
	else /if (VT == MVT::f32)/
	Tmp0 = DAG.getNode(ISD::SCALAR_TO_VECTOR, dl, MVT::v2f32, Tmp0);
	if (SrcVT == MVT::f32) {
	Tmp1 = DAG.getNode(ISD::SCALAR_TO_VECTOR, dl, MVT::v2f32, Tmp1);
	if (VT == MVT::f64)
	Tmp1 = DAG.getNode(ARMISD::VSHL, dl, OpVT,
	DAG.getNode(ISD::BITCAST, dl, OpVT, Tmp1),
	DAG.getConstant(32, dl, MVT::i32));
	} else if (VT == MVT::f32)
	Tmp1 = DAG.getNode(ARMISD::VSHRu, dl, MVT::v1i64,
	DAG.getNode(ISD::BITCAST, dl, MVT::v1i64, Tmp1),
	DAG.getConstant(32, dl, MVT::i32));
	Tmp0 = DAG.getNode(ISD::BITCAST, dl, OpVT, Tmp0);
	Tmp1 = DAG.getNode(ISD::BITCAST, dl, OpVT, Tmp1);

	SDValue AllOnes = DAG.getTargetConstant(ARM_AM::createNEONModImm(0xe, 0xff),
	dl, MVT::i32);
	AllOnes = DAG.getNode(ARMISD::VMOVIMM, dl, MVT::v8i8, AllOnes);
	SDValue MaskNot = DAG.getNode(ISD::XOR, dl, OpVT, Mask,
	DAG.getNode(ISD::BITCAST, dl, OpVT, AllOnes));

	SDValue Res = DAG.getNode(ISD::OR, dl, OpVT,
	DAG.getNode(ISD::AND, dl, OpVT, Tmp1, Mask),
	DAG.getNode(ISD::AND, dl, OpVT, Tmp0, MaskNot));
	if (VT == MVT::f32) {
	Res = DAG.getNode(ISD::BITCAST, dl, MVT::v2f32, Res);
	Res = DAG.getNode(ISD::EXTRACT_VECTOR_ELT, dl, MVT::f32, Res,
	DAG.getConstant(0, dl, MVT::i32));
	} else {
	Res = DAG.getNode(ISD::BITCAST, dl, MVT::f64, Res);
	}

	return Res;
	}

	// Bitcast operand 1 to i32.
	if (SrcVT == MVT::f64)
	Tmp1 = DAG.getNode(ARMISD::VMOVRRD, dl, DAG.getVTList(MVT::i32, MVT::i32),
	Tmp1).getValue(1);
	Tmp1 = DAG.getNode(ISD::BITCAST, dl, MVT::i32, Tmp1);

	// Or in the signbit with integer operations.
	SDValue Mask1 = DAG.getConstant(0x80000000, dl, MVT::i32);
	SDValue Mask2 = DAG.getConstant(0x7fffffff, dl, MVT::i32);
	Tmp1 = DAG.getNode(ISD::AND, dl, MVT::i32, Tmp1, Mask1);
	if (VT == MVT::f32) {
	Tmp0 = DAG.getNode(ISD::AND, dl, MVT::i32,
	DAG.getNode(ISD::BITCAST, dl, MVT::i32, Tmp0), Mask2);
	return DAG.getNode(ISD::BITCAST, dl, MVT::f32,
	DAG.getNode(ISD::OR, dl, MVT::i32, Tmp0, Tmp1));
	}

	// f64: Or the high part with signbit and then combine two parts.
	Tmp0 = DAG.getNode(ARMISD::VMOVRRD, dl, DAG.getVTList(MVT::i32, MVT::i32),
	Tmp0);
	SDValue Lo = Tmp0.getValue(0);
	SDValue Hi = DAG.getNode(ISD::AND, dl, MVT::i32, Tmp0.getValue(1), Mask2);
	Hi = DAG.getNode(ISD::OR, dl, MVT::i32, Hi, Tmp1);
	return DAG.getNode(ARMISD::VMOVDRR, dl, MVT::f64, Lo, Hi);
	}

	SDValue ARMTargetLowering::LowerRETURNADDR(SDValue Op, SelectionDAG &DAG) const{
	MachineFunction &MF = DAG.getMachineFunction();
	MachineFrameInfo &MFI = MF.getFrameInfo();
	MFI.setReturnAddressIsTaken(true);

	if (verifyReturnAddressArgumentIsConstant(Op, DAG))
	return SDValue();

	EVT VT = Op.getValueType();
	SDLoc dl(Op);
	unsigned Depth = cast<ConstantSDNode>(Op.getOperand(0))->getZExtValue();
	if (Depth) {
	SDValue FrameAddr = LowerFRAMEADDR(Op, DAG);
	SDValue Offset = DAG.getConstant(4, dl, MVT::i32);
	return DAG.getLoad(VT, dl, DAG.getEntryNode(),
	DAG.getNode(ISD::ADD, dl, VT, FrameAddr, Offset),
	MachinePointerInfo());
	}

	// Return LR, which contains the return address. Mark it an implicit live-in.
	unsigned Reg = MF.addLiveIn(ARM::LR, getRegClassFor(MVT::i32));
	return DAG.getCopyFromReg(DAG.getEntryNode(), dl, Reg, VT);
	}

	SDValue ARMTargetLowering::LowerFRAMEADDR(SDValue Op, SelectionDAG &DAG) const {
	const ARMBaseRegisterInfo &ARI =
	static_cast<const ARMBaseRegisterInfo>(RegInfo);
	MachineFunction &MF = DAG.getMachineFunction();
	MachineFrameInfo &MFI = MF.getFrameInfo();
	MFI.setFrameAddressIsTaken(true);

	EVT VT = Op.getValueType();
	SDLoc dl(Op); // FIXME probably not meaningful
	unsigned Depth = cast<ConstantSDNode>(Op.getOperand(0))->getZExtValue();
	unsigned FrameReg = ARI.getFrameRegister(MF);
	SDValue FrameAddr = DAG.getCopyFromReg(DAG.getEntryNode(), dl, FrameReg, VT);
	while (Depth--)
	FrameAddr = DAG.getLoad(VT, dl, DAG.getEntryNode(), FrameAddr,
	MachinePointerInfo());
	return FrameAddr;
	}

	// FIXME? Maybe this could be a TableGen attribute on some registers and
	// this table could be generated automatically from RegInfo.
	unsigned ARMTargetLowering::getRegisterByName(const char* RegName, EVT VT,
	SelectionDAG &DAG) const {
	unsigned Reg = StringSwitch<unsigned>(RegName)
	.Case("sp", ARM::SP)
	.Default(0);
	if (Reg)
	return Reg;
	report_fatal_error(Twine("Invalid register name \""
	+ StringRef(RegName) + "\"."));
	}

	// Result is 64 bit value so split into two 32 bit values and return as a
	// pair of values.
	static void ExpandREAD_REGISTER(SDNode *N, SmallVectorImpl<SDValue> &Results,
	SelectionDAG &DAG) {
	SDLoc DL(N);

	// This function is only supposed to be called for i64 type destination.
	assert(N->getValueType(0) == MVT::i64
	&& "ExpandREAD_REGISTER called for non-i64 type result.");

	SDValue Read = DAG.getNode(ISD::READ_REGISTER, DL,
	DAG.getVTList(MVT::i32, MVT::i32, MVT::Other),
	N->getOperand(0),
	N->getOperand(1));

	Results.push_back(DAG.getNode(ISD::BUILD_PAIR, DL, MVT::i64, Read.getValue(0),
	Read.getValue(1)));
	Results.push_back(Read.getOperand(0));
	}

	/// \p BC is a bitcast that is about to be turned into a VMOVDRR.
	/// When \p DstVT, the destination type of \p BC, is on the vector
	/// register bank and the source of bitcast, \p Op, operates on the same bank,
	/// it might be possible to combine them, such that everything stays on the
	/// vector register bank.
	/// \p return The node that would replace \p BT, if the combine
	/// is possible.
	static SDValue CombineVMOVDRRCandidateWithVecOp(const SDNode *BC,
	SelectionDAG &DAG) {
	SDValue Op = BC->getOperand(0);
	EVT DstVT = BC->getValueType(0);

	// The only vector instruction that can produce a scalar (remember,
	// since the bitcast was about to be turned into VMOVDRR, the source
	// type is i64) from a vector is EXTRACT_VECTOR_ELT.
	// Moreover, we can do this combine only if there is one use.
	// Finally, if the destination type is not a vector, there is not
	// much point on forcing everything on the vector bank.
	if (!DstVT.isVector() \|\| Op.getOpcode() != ISD::EXTRACT_VECTOR_ELT \|\|
	!Op.hasOneUse())
	return SDValue();

	// If the index is not constant, we will introduce an additional
	// multiply that will stick.
	// Give up in that case.
	ConstantSDNode *Index = dyn_cast<ConstantSDNode>(Op.getOperand(1));
	if (!Index)
	return SDValue();
	unsigned DstNumElt = DstVT.getVectorNumElements();

	// Compute the new index.
	const APInt &APIntIndex = Index->getAPIntValue();
	APInt NewIndex(APIntIndex.getBitWidth(), DstNumElt);
	NewIndex *= APIntIndex;
	// Check if the new constant index fits into i32.
	if (NewIndex.getBitWidth() > 32)
	return SDValue();

	// vMTy bitcast(i64 extractelt vNi64 src, i32 index) ->
	// vMTy extractsubvector vNxMTy (bitcast vNi64 src), i32 index*M)
	SDLoc dl(Op);
	SDValue ExtractSrc = Op.getOperand(0);
	EVT VecVT = EVT::getVectorVT(
	*DAG.getContext(), DstVT.getScalarType(),
	ExtractSrc.getValueType().getVectorNumElements() * DstNumElt);
	SDValue BitCast = DAG.getNode(ISD::BITCAST, dl, VecVT, ExtractSrc);
	return DAG.getNode(ISD::EXTRACT_SUBVECTOR, dl, DstVT, BitCast,
	DAG.getConstant(NewIndex.getZExtValue(), dl, MVT::i32));
	}

	/// ExpandBITCAST - If the target supports VFP, this function is called to
	/// expand a bit convert where either the source or destination type is i64 to
	/// use a VMOVDRR or VMOVRRD node. This should not be done when the non-i64
	/// operand type is illegal (e.g., v2f32 for a target that doesn't support
	/// vectors), since the legalizer won't know what to do with that.
	static SDValue ExpandBITCAST(SDNode *N, SelectionDAG &DAG) {
	const TargetLowering &TLI = DAG.getTargetLoweringInfo();
	SDLoc dl(N);
	SDValue Op = N->getOperand(0);

	// This function is only supposed to be called for i64 types, either as the
	// source or destination of the bit convert.
	EVT SrcVT = Op.getValueType();
	EVT DstVT = N->getValueType(0);
	assert((SrcVT == MVT::i64 \|\| DstVT == MVT::i64) &&
	"ExpandBITCAST called for non-i64 type");

	// Turn i64->f64 into VMOVDRR.
	if (SrcVT == MVT::i64 && TLI.isTypeLegal(DstVT)) {
	// Do not force values to GPRs (this is what VMOVDRR does for the inputs)
	// if we can combine the bitcast with its source.
	if (SDValue Val = CombineVMOVDRRCandidateWithVecOp(N, DAG))
	return Val;

	SDValue Lo = DAG.getNode(ISD::EXTRACT_ELEMENT, dl, MVT::i32, Op,
	DAG.getConstant(0, dl, MVT::i32));
	SDValue Hi = DAG.getNode(ISD::EXTRACT_ELEMENT, dl, MVT::i32, Op,
	DAG.getConstant(1, dl, MVT::i32));
	return DAG.getNode(ISD::BITCAST, dl, DstVT,
	DAG.getNode(ARMISD::VMOVDRR, dl, MVT::f64, Lo, Hi));
	}

	// Turn f64->i64 into VMOVRRD.
	if (DstVT == MVT::i64 && TLI.isTypeLegal(SrcVT)) {
	SDValue Cvt;
	if (DAG.getDataLayout().isBigEndian() && SrcVT.isVector() &&
	SrcVT.getVectorNumElements() > 1)
	Cvt = DAG.getNode(ARMISD::VMOVRRD, dl,
	DAG.getVTList(MVT::i32, MVT::i32),
	DAG.getNode(ARMISD::VREV64, dl, SrcVT, Op));
	else
	Cvt = DAG.getNode(ARMISD::VMOVRRD, dl,
	DAG.getVTList(MVT::i32, MVT::i32), Op);
	// Merge the pieces into a single i64 value.
	return DAG.getNode(ISD::BUILD_PAIR, dl, MVT::i64, Cvt, Cvt.getValue(1));
	}

	return SDValue();
	}

	/// getZeroVector - Returns a vector of specified type with all zero elements.
	/// Zero vectors are used to represent vector negation and in those cases
	/// will be implemented with the NEON VNEG instruction. However, VNEG does
	/// not support i64 elements, so sometimes the zero vectors will need to be
	/// explicitly constructed. Regardless, use a canonical VMOV to create the
	/// zero vector.
	static SDValue getZeroVector(EVT VT, SelectionDAG &DAG, const SDLoc &dl) {
	assert(VT.isVector() && "Expected a vector type");
	// The canonical modified immediate encoding of a zero vector is....0!
	SDValue EncodedVal = DAG.getTargetConstant(0, dl, MVT::i32);
	EVT VmovVT = VT.is128BitVector() ? MVT::v4i32 : MVT::v2i32;
	SDValue Vmov = DAG.getNode(ARMISD::VMOVIMM, dl, VmovVT, EncodedVal);
	return DAG.getNode(ISD::BITCAST, dl, VT, Vmov);
	}

	/// LowerShiftRightParts - Lower SRA_PARTS, which returns two
	/// i32 values and take a 2 x i32 value to shift plus a shift amount.
	SDValue ARMTargetLowering::LowerShiftRightParts(SDValue Op,
	SelectionDAG &DAG) const {
	assert(Op.getNumOperands() == 3 && "Not a double-shift!");
	EVT VT = Op.getValueType();
	unsigned VTBits = VT.getSizeInBits();
	SDLoc dl(Op);
	SDValue ShOpLo = Op.getOperand(0);
	SDValue ShOpHi = Op.getOperand(1);
	SDValue ShAmt = Op.getOperand(2);
	SDValue ARMcc;
	SDValue CCR = DAG.getRegister(ARM::CPSR, MVT::i32);
	unsigned Opc = (Op.getOpcode() == ISD::SRA_PARTS) ? ISD::SRA : ISD::SRL;

	assert(Op.getOpcode() == ISD::SRA_PARTS \|\| Op.getOpcode() == ISD::SRL_PARTS);

	SDValue RevShAmt = DAG.getNode(ISD::SUB, dl, MVT::i32,
	DAG.getConstant(VTBits, dl, MVT::i32), ShAmt);
	SDValue Tmp1 = DAG.getNode(ISD::SRL, dl, VT, ShOpLo, ShAmt);
	SDValue ExtraShAmt = DAG.getNode(ISD::SUB, dl, MVT::i32, ShAmt,
	DAG.getConstant(VTBits, dl, MVT::i32));
	SDValue Tmp2 = DAG.getNode(ISD::SHL, dl, VT, ShOpHi, RevShAmt);
	SDValue LoSmallShift = DAG.getNode(ISD::OR, dl, VT, Tmp1, Tmp2);
	SDValue LoBigShift = DAG.getNode(Opc, dl, VT, ShOpHi, ExtraShAmt);
	SDValue CmpLo = getARMCmp(ExtraShAmt, DAG.getConstant(0, dl, MVT::i32),
	ISD::SETGE, ARMcc, DAG, dl);
	SDValue Lo = DAG.getNode(ARMISD::CMOV, dl, VT, LoSmallShift, LoBigShift,
	ARMcc, CCR, CmpLo);


	SDValue HiSmallShift = DAG.getNode(Opc, dl, VT, ShOpHi, ShAmt);
	SDValue HiBigShift = Opc == ISD::SRA
	? DAG.getNode(Opc, dl, VT, ShOpHi,
	DAG.getConstant(VTBits - 1, dl, VT))
	: DAG.getConstant(0, dl, VT);
	SDValue CmpHi = getARMCmp(ExtraShAmt, DAG.getConstant(0, dl, MVT::i32),
	ISD::SETGE, ARMcc, DAG, dl);
	SDValue Hi = DAG.getNode(ARMISD::CMOV, dl, VT, HiSmallShift, HiBigShift,
	ARMcc, CCR, CmpHi);

	SDValue Ops[2] = { Lo, Hi };
	return DAG.getMergeValues(Ops, dl);
	}

	/// LowerShiftLeftParts - Lower SHL_PARTS, which returns two
	/// i32 values and take a 2 x i32 value to shift plus a shift amount.
	SDValue ARMTargetLowering::LowerShiftLeftParts(SDValue Op,
	SelectionDAG &DAG) const {
	assert(Op.getNumOperands() == 3 && "Not a double-shift!");
	EVT VT = Op.getValueType();
	unsigned VTBits = VT.getSizeInBits();
	SDLoc dl(Op);
	SDValue ShOpLo = Op.getOperand(0);
	SDValue ShOpHi = Op.getOperand(1);
	SDValue ShAmt = Op.getOperand(2);
	SDValue ARMcc;
	SDValue CCR = DAG.getRegister(ARM::CPSR, MVT::i32);

	assert(Op.getOpcode() == ISD::SHL_PARTS);
	SDValue RevShAmt = DAG.getNode(ISD::SUB, dl, MVT::i32,
	DAG.getConstant(VTBits, dl, MVT::i32), ShAmt);
	SDValue Tmp1 = DAG.getNode(ISD::SRL, dl, VT, ShOpLo, RevShAmt);
	SDValue Tmp2 = DAG.getNode(ISD::SHL, dl, VT, ShOpHi, ShAmt);
	SDValue HiSmallShift = DAG.getNode(ISD::OR, dl, VT, Tmp1, Tmp2);

	SDValue ExtraShAmt = DAG.getNode(ISD::SUB, dl, MVT::i32, ShAmt,
	DAG.getConstant(VTBits, dl, MVT::i32));
	SDValue HiBigShift = DAG.getNode(ISD::SHL, dl, VT, ShOpLo, ExtraShAmt);
	SDValue CmpHi = getARMCmp(ExtraShAmt, DAG.getConstant(0, dl, MVT::i32),
	ISD::SETGE, ARMcc, DAG, dl);
	SDValue Hi = DAG.getNode(ARMISD::CMOV, dl, VT, HiSmallShift, HiBigShift,
	ARMcc, CCR, CmpHi);

	SDValue CmpLo = getARMCmp(ExtraShAmt, DAG.getConstant(0, dl, MVT::i32),
	ISD::SETGE, ARMcc, DAG, dl);
	SDValue LoSmallShift = DAG.getNode(ISD::SHL, dl, VT, ShOpLo, ShAmt);
	SDValue Lo = DAG.getNode(ARMISD::CMOV, dl, VT, LoSmallShift,
	DAG.getConstant(0, dl, VT), ARMcc, CCR, CmpLo);

	SDValue Ops[2] = { Lo, Hi };
	return DAG.getMergeValues(Ops, dl);
	}

	SDValue ARMTargetLowering::LowerFLT_ROUNDS_(SDValue Op,
	SelectionDAG &DAG) const {
	// The rounding mode is in bits 23:22 of the FPSCR.
	// The ARM rounding mode value to FLT_ROUNDS mapping is 0->1, 1->2, 2->3, 3->0
	// The formula we use to implement this is (((FPSCR + 1 << 22) >> 22) & 3)
	// so that the shift + and get folded into a bitfield extract.
	SDLoc dl(Op);
	SDValue Ops[] = { DAG.getEntryNode(),
	DAG.getConstant(Intrinsic::arm_get_fpscr, dl, MVT::i32) };

	SDValue FPSCR = DAG.getNode(ISD::INTRINSIC_W_CHAIN, dl, MVT::i32, Ops);
	SDValue FltRounds = DAG.getNode(ISD::ADD, dl, MVT::i32, FPSCR,
	DAG.getConstant(1U << 22, dl, MVT::i32));
	SDValue RMODE = DAG.getNode(ISD::SRL, dl, MVT::i32, FltRounds,
	DAG.getConstant(22, dl, MVT::i32));
	return DAG.getNode(ISD::AND, dl, MVT::i32, RMODE,
	DAG.getConstant(3, dl, MVT::i32));
	}

	static SDValue LowerCTTZ(SDNode *N, SelectionDAG &DAG,
	const ARMSubtarget *ST) {
	SDLoc dl(N);
	EVT VT = N->getValueType(0);
	if (VT.isVector()) {
	assert(ST->hasNEON());

	// Compute the least significant set bit: LSB = X & -X
	SDValue X = N->getOperand(0);
	SDValue NX = DAG.getNode(ISD::SUB, dl, VT, getZeroVector(VT, DAG, dl), X);
	SDValue LSB = DAG.getNode(ISD::AND, dl, VT, X, NX);

	EVT ElemTy = VT.getVectorElementType();

	if (ElemTy == MVT::i8) {
	// Compute with: cttz(x) = ctpop(lsb - 1)
	SDValue One = DAG.getNode(ARMISD::VMOVIMM, dl, VT,
	DAG.getTargetConstant(1, dl, ElemTy));
	SDValue Bits = DAG.getNode(ISD::SUB, dl, VT, LSB, One);
	return DAG.getNode(ISD::CTPOP, dl, VT, Bits);
	}

	if ((ElemTy == MVT::i16 \|\| ElemTy == MVT::i32) &&
	(N->getOpcode() == ISD::CTTZ_ZERO_UNDEF)) {
	// Compute with: cttz(x) = (width - 1) - ctlz(lsb), if x != 0
	unsigned NumBits = ElemTy.getSizeInBits();
	SDValue WidthMinus1 =
	DAG.getNode(ARMISD::VMOVIMM, dl, VT,
	DAG.getTargetConstant(NumBits - 1, dl, ElemTy));
	SDValue CTLZ = DAG.getNode(ISD::CTLZ, dl, VT, LSB);
	return DAG.getNode(ISD::SUB, dl, VT, WidthMinus1, CTLZ);
	}

	// Compute with: cttz(x) = ctpop(lsb - 1)

	// Since we can only compute the number of bits in a byte with vcnt.8, we
	// have to gather the result with pairwise addition (vpaddl) for i16, i32,
	// and i64.

	// Compute LSB - 1.
	SDValue Bits;
	if (ElemTy == MVT::i64) {
	// Load constant 0xffff'ffff'ffff'ffff to register.
	SDValue FF = DAG.getNode(ARMISD::VMOVIMM, dl, VT,
	DAG.getTargetConstant(0x1eff, dl, MVT::i32));
	Bits = DAG.getNode(ISD::ADD, dl, VT, LSB, FF);
	} else {
	SDValue One = DAG.getNode(ARMISD::VMOVIMM, dl, VT,
	DAG.getTargetConstant(1, dl, ElemTy));
	Bits = DAG.getNode(ISD::SUB, dl, VT, LSB, One);
	}

	// Count #bits with vcnt.8.
	EVT VT8Bit = VT.is64BitVector() ? MVT::v8i8 : MVT::v16i8;
	SDValue BitsVT8 = DAG.getNode(ISD::BITCAST, dl, VT8Bit, Bits);
	SDValue Cnt8 = DAG.getNode(ISD::CTPOP, dl, VT8Bit, BitsVT8);

	// Gather the #bits with vpaddl (pairwise add.)
	EVT VT16Bit = VT.is64BitVector() ? MVT::v4i16 : MVT::v8i16;
	SDValue Cnt16 = DAG.getNode(ISD::INTRINSIC_WO_CHAIN, dl, VT16Bit,
	DAG.getTargetConstant(Intrinsic::arm_neon_vpaddlu, dl, MVT::i32),
	Cnt8);
	if (ElemTy == MVT::i16)
	return Cnt16;

	EVT VT32Bit = VT.is64BitVector() ? MVT::v2i32 : MVT::v4i32;
	SDValue Cnt32 = DAG.getNode(ISD::INTRINSIC_WO_CHAIN, dl, VT32Bit,
	DAG.getTargetConstant(Intrinsic::arm_neon_vpaddlu, dl, MVT::i32),
	Cnt16);
	if (ElemTy == MVT::i32)
	return Cnt32;

	assert(ElemTy == MVT::i64);
	SDValue Cnt64 = DAG.getNode(ISD::INTRINSIC_WO_CHAIN, dl, VT,
	DAG.getTargetConstant(Intrinsic::arm_neon_vpaddlu, dl, MVT::i32),
	Cnt32);
	return Cnt64;
	}

	if (!ST->hasV6T2Ops())
	return SDValue();

	SDValue rbit = DAG.getNode(ISD::BITREVERSE, dl, VT, N->getOperand(0));
	return DAG.getNode(ISD::CTLZ, dl, VT, rbit);
	}

	/// getCTPOP16BitCounts - Returns a v8i8/v16i8 vector containing the bit-count
	/// for each 16-bit element from operand, repeated. The basic idea is to
	/// leverage vcnt to get the 8-bit counts, gather and add the results.
	///
	/// Trace for v4i16:
	/// input = [v0 v1 v2 v3 ] (vi 16-bit element)
	/// cast: N0 = [w0 w1 w2 w3 w4 w5 w6 w7] (v0 = [w0 w1], wi 8-bit element)
	/// vcnt: N1 = [b0 b1 b2 b3 b4 b5 b6 b7] (bi = bit-count of 8-bit element wi)
	/// vrev: N2 = [b1 b0 b3 b2 b5 b4 b7 b6]
	/// [b0 b1 b2 b3 b4 b5 b6 b7]
	/// +[b1 b0 b3 b2 b5 b4 b7 b6]
	/// N3=N1+N2 = [k0 k0 k1 k1 k2 k2 k3 k3] (k0 = b0+b1 = bit-count of 16-bit v0,
	/// vuzp: = [k0 k1 k2 k3 k0 k1 k2 k3] each ki is 8-bits)
	static SDValue getCTPOP16BitCounts(SDNode *N, SelectionDAG &DAG) {
	EVT VT = N->getValueType(0);
	SDLoc DL(N);

	EVT VT8Bit = VT.is64BitVector() ? MVT::v8i8 : MVT::v16i8;
	SDValue N0 = DAG.getNode(ISD::BITCAST, DL, VT8Bit, N->getOperand(0));
	SDValue N1 = DAG.getNode(ISD::CTPOP, DL, VT8Bit, N0);
	SDValue N2 = DAG.getNode(ARMISD::VREV16, DL, VT8Bit, N1);
	SDValue N3 = DAG.getNode(ISD::ADD, DL, VT8Bit, N1, N2);
	return DAG.getNode(ARMISD::VUZP, DL, VT8Bit, N3, N3);
	}

	/// lowerCTPOP16BitElements - Returns a v4i16/v8i16 vector containing the
	/// bit-count for each 16-bit element from the operand. We need slightly
	/// different sequencing for v4i16 and v8i16 to stay within NEON's available
	/// 64/128-bit registers.
	///
	/// Trace for v4i16:
	/// input = [v0 v1 v2 v3 ] (vi 16-bit element)
	/// v8i8: BitCounts = [k0 k1 k2 k3 k0 k1 k2 k3 ] (ki is the bit-count of vi)
	/// v8i16:Extended = [k0 k1 k2 k3 k0 k1 k2 k3 ]
	/// v4i16:Extracted = [k0 k1 k2 k3 ]
	static SDValue lowerCTPOP16BitElements(SDNode *N, SelectionDAG &DAG) {
	EVT VT = N->getValueType(0);
	SDLoc DL(N);

	SDValue BitCounts = getCTPOP16BitCounts(N, DAG);
	if (VT.is64BitVector()) {
	SDValue Extended = DAG.getNode(ISD::ZERO_EXTEND, DL, MVT::v8i16, BitCounts);
	return DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, MVT::v4i16, Extended,
	DAG.getIntPtrConstant(0, DL));
	} else {
	SDValue Extracted = DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, MVT::v8i8,
	BitCounts, DAG.getIntPtrConstant(0, DL));
	return DAG.getNode(ISD::ZERO_EXTEND, DL, MVT::v8i16, Extracted);
	}
	}

	/// lowerCTPOP32BitElements - Returns a v2i32/v4i32 vector containing the
	/// bit-count for each 32-bit element from the operand. The idea here is
	/// to split the vector into 16-bit elements, leverage the 16-bit count
	/// routine, and then combine the results.
	///
	/// Trace for v2i32 (v4i32 similar with Extracted/Extended exchanged):
	/// input = [v0 v1 ] (vi: 32-bit elements)
	/// Bitcast = [w0 w1 w2 w3 ] (wi: 16-bit elements, v0 = [w0 w1])
	/// Counts16 = [k0 k1 k2 k3 ] (ki: 16-bit elements, bit-count of wi)
	/// vrev: N0 = [k1 k0 k3 k2 ]
	/// [k0 k1 k2 k3 ]
	/// N1 =+[k1 k0 k3 k2 ]
	/// [k0 k2 k1 k3 ]
	/// N2 =+[k1 k3 k0 k2 ]
	/// [k0 k2 k1 k3 ]
	/// Extended =+[k1 k3 k0 k2 ]
	/// [k0 k2 ]
	/// Extracted=+[k1 k3 ]
	///
	static SDValue lowerCTPOP32BitElements(SDNode *N, SelectionDAG &DAG) {
	EVT VT = N->getValueType(0);
	SDLoc DL(N);

	EVT VT16Bit = VT.is64BitVector() ? MVT::v4i16 : MVT::v8i16;

	SDValue Bitcast = DAG.getNode(ISD::BITCAST, DL, VT16Bit, N->getOperand(0));
	SDValue Counts16 = lowerCTPOP16BitElements(Bitcast.getNode(), DAG);
	SDValue N0 = DAG.getNode(ARMISD::VREV32, DL, VT16Bit, Counts16);
	SDValue N1 = DAG.getNode(ISD::ADD, DL, VT16Bit, Counts16, N0);
	SDValue N2 = DAG.getNode(ARMISD::VUZP, DL, VT16Bit, N1, N1);

	if (VT.is64BitVector()) {
	SDValue Extended = DAG.getNode(ISD::ZERO_EXTEND, DL, MVT::v4i32, N2);
	return DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, MVT::v2i32, Extended,
	DAG.getIntPtrConstant(0, DL));
	} else {
	SDValue Extracted = DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, MVT::v4i16, N2,
	DAG.getIntPtrConstant(0, DL));
	return DAG.getNode(ISD::ZERO_EXTEND, DL, MVT::v4i32, Extracted);
	}
	}

	static SDValue LowerCTPOP(SDNode *N, SelectionDAG &DAG,
	const ARMSubtarget *ST) {
	EVT VT = N->getValueType(0);

	assert(ST->hasNEON() && "Custom ctpop lowering requires NEON.");
	assert((VT == MVT::v2i32 \|\| VT == MVT::v4i32 \|\|
	VT == MVT::v4i16 \|\| VT == MVT::v8i16) &&
	"Unexpected type for custom ctpop lowering");

	if (VT.getVectorElementType() == MVT::i32)
	return lowerCTPOP32BitElements(N, DAG);
	else
	return lowerCTPOP16BitElements(N, DAG);
	}

	static SDValue LowerShift(SDNode *N, SelectionDAG &DAG,
	const ARMSubtarget *ST) {
	EVT VT = N->getValueType(0);
	SDLoc dl(N);

	if (!VT.isVector())
	return SDValue();

	// Lower vector shifts on NEON to use VSHL.
	assert(ST->hasNEON() && "unexpected vector shift");

	// Left shifts translate directly to the vshiftu intrinsic.
	if (N->getOpcode() == ISD::SHL)
	return DAG.getNode(ISD::INTRINSIC_WO_CHAIN, dl, VT,
	DAG.getConstant(Intrinsic::arm_neon_vshiftu, dl,
	MVT::i32),
	N->getOperand(0), N->getOperand(1));

	assert((N->getOpcode() == ISD::SRA \|\|
	N->getOpcode() == ISD::SRL) && "unexpected vector shift opcode");

	// NEON uses the same intrinsics for both left and right shifts. For
	// right shifts, the shift amounts are negative, so negate the vector of
	// shift amounts.
	EVT ShiftVT = N->getOperand(1).getValueType();
	SDValue NegatedCount = DAG.getNode(ISD::SUB, dl, ShiftVT,
	getZeroVector(ShiftVT, DAG, dl),
	N->getOperand(1));
	Intrinsic::ID vshiftInt = (N->getOpcode() == ISD::SRA ?
	Intrinsic::arm_neon_vshifts :
	Intrinsic::arm_neon_vshiftu);
	return DAG.getNode(ISD::INTRINSIC_WO_CHAIN, dl, VT,
	DAG.getConstant(vshiftInt, dl, MVT::i32),
	N->getOperand(0), NegatedCount);
	}

	static SDValue Expand64BitShift(SDNode *N, SelectionDAG &DAG,
	const ARMSubtarget *ST) {
	EVT VT = N->getValueType(0);
	SDLoc dl(N);

	// We can get here for a node like i32 = ISD::SHL i32, i64
	if (VT != MVT::i64)
	return SDValue();

	assert((N->getOpcode() == ISD::SRL \|\| N->getOpcode() == ISD::SRA) &&
	"Unknown shift to lower!");

	// We only lower SRA, SRL of 1 here, all others use generic lowering.
	if (!isOneConstant(N->getOperand(1)))
	return SDValue();

	// If we are in thumb mode, we don't have RRX.
	if (ST->isThumb1Only()) return SDValue();

	// Okay, we have a 64-bit SRA or SRL of 1. Lower this to an RRX expr.
	SDValue Lo = DAG.getNode(ISD::EXTRACT_ELEMENT, dl, MVT::i32, N->getOperand(0),
	DAG.getConstant(0, dl, MVT::i32));
	SDValue Hi = DAG.getNode(ISD::EXTRACT_ELEMENT, dl, MVT::i32, N->getOperand(0),
	DAG.getConstant(1, dl, MVT::i32));

	// First, build a SRA_FLAG/SRL_FLAG op, which shifts the top part by one and
	// captures the result into a carry flag.
	unsigned Opc = N->getOpcode() == ISD::SRL ? ARMISD::SRL_FLAG:ARMISD::SRA_FLAG;
	Hi = DAG.getNode(Opc, dl, DAG.getVTList(MVT::i32, MVT::Glue), Hi);

	// The low part is an ARMISD::RRX operand, which shifts the carry in.
	Lo = DAG.getNode(ARMISD::RRX, dl, MVT::i32, Lo, Hi.getValue(1));

	// Merge the pieces into a single i64 value.
	return DAG.getNode(ISD::BUILD_PAIR, dl, MVT::i64, Lo, Hi);
	}

	static SDValue LowerVSETCC(SDValue Op, SelectionDAG &DAG) {
	SDValue TmpOp0, TmpOp1;
	bool Invert = false;
	bool Swap = false;
	unsigned Opc = 0;

	SDValue Op0 = Op.getOperand(0);
	SDValue Op1 = Op.getOperand(1);
	SDValue CC = Op.getOperand(2);
	EVT CmpVT = Op0.getValueType().changeVectorElementTypeToInteger();
	EVT VT = Op.getValueType();
	ISD::CondCode SetCCOpcode = cast<CondCodeSDNode>(CC)->get();
	SDLoc dl(Op);

	if (Op0.getValueType().getVectorElementType() == MVT::i64 &&
	(SetCCOpcode == ISD::SETEQ \|\| SetCCOpcode == ISD::SETNE)) {
	// Special-case integer 64-bit equality comparisons. They aren't legal,
	// but they can be lowered with a few vector instructions.
	unsigned CmpElements = CmpVT.getVectorNumElements() * 2;
	EVT SplitVT = EVT::getVectorVT(*DAG.getContext(), MVT::i32, CmpElements);
	SDValue CastOp0 = DAG.getNode(ISD::BITCAST, dl, SplitVT, Op0);
	SDValue CastOp1 = DAG.getNode(ISD::BITCAST, dl, SplitVT, Op1);
	SDValue Cmp = DAG.getNode(ISD::SETCC, dl, SplitVT, CastOp0, CastOp1,
	DAG.getCondCode(ISD::SETEQ));
	SDValue Reversed = DAG.getNode(ARMISD::VREV64, dl, SplitVT, Cmp);
	SDValue Merged = DAG.getNode(ISD::AND, dl, SplitVT, Cmp, Reversed);
	Merged = DAG.getNode(ISD::BITCAST, dl, CmpVT, Merged);
	if (SetCCOpcode == ISD::SETNE)
	Merged = DAG.getNOT(dl, Merged, CmpVT);
	Merged = DAG.getSExtOrTrunc(Merged, dl, VT);
	return Merged;
	}

	if (CmpVT.getVectorElementType() == MVT::i64)
	// 64-bit comparisons are not legal in general.
	return SDValue();

	if (Op1.getValueType().isFloatingPoint()) {
	switch (SetCCOpcode) {
	default: llvm_unreachable("Illegal FP comparison");
	case ISD::SETUNE:
	case ISD::SETNE: Invert = true; LLVM_FALLTHROUGH;
	case ISD::SETOEQ:
	case ISD::SETEQ: Opc = ARMISD::VCEQ; break;
	case ISD::SETOLT:
	case ISD::SETLT: Swap = true; LLVM_FALLTHROUGH;
	case ISD::SETOGT:
	case ISD::SETGT: Opc = ARMISD::VCGT; break;
	case ISD::SETOLE:
	case ISD::SETLE: Swap = true; LLVM_FALLTHROUGH;
	case ISD::SETOGE:
	case ISD::SETGE: Opc = ARMISD::VCGE; break;
	case ISD::SETUGE: Swap = true; LLVM_FALLTHROUGH;
	case ISD::SETULE: Invert = true; Opc = ARMISD::VCGT; break;
	case ISD::SETUGT: Swap = true; LLVM_FALLTHROUGH;
	case ISD::SETULT: Invert = true; Opc = ARMISD::VCGE; break;
	case ISD::SETUEQ: Invert = true; LLVM_FALLTHROUGH;
	case ISD::SETONE:
	// Expand this to (OLT \| OGT).
	TmpOp0 = Op0;
	TmpOp1 = Op1;
	Opc = ISD::OR;
	Op0 = DAG.getNode(ARMISD::VCGT, dl, CmpVT, TmpOp1, TmpOp0);
	Op1 = DAG.getNode(ARMISD::VCGT, dl, CmpVT, TmpOp0, TmpOp1);
	break;
	case ISD::SETUO:
	Invert = true;
	LLVM_FALLTHROUGH;
	case ISD::SETO:
	// Expand this to (OLT \| OGE).
	TmpOp0 = Op0;
	TmpOp1 = Op1;
	Opc = ISD::OR;
	Op0 = DAG.getNode(ARMISD::VCGT, dl, CmpVT, TmpOp1, TmpOp0);
	Op1 = DAG.getNode(ARMISD::VCGE, dl, CmpVT, TmpOp0, TmpOp1);
	break;
	}
	} else {
	// Integer comparisons.
	switch (SetCCOpcode) {
	default: llvm_unreachable("Illegal integer comparison");
	case ISD::SETNE: Invert = true; LLVM_FALLTHROUGH;
	case ISD::SETEQ: Opc = ARMISD::VCEQ; break;
	case ISD::SETLT: Swap = true; LLVM_FALLTHROUGH;
	case ISD::SETGT: Opc = ARMISD::VCGT; break;
	case ISD::SETLE: Swap = true; LLVM_FALLTHROUGH;
	case ISD::SETGE: Opc = ARMISD::VCGE; break;
	case ISD::SETULT: Swap = true; LLVM_FALLTHROUGH;
	case ISD::SETUGT: Opc = ARMISD::VCGTU; break;
	case ISD::SETULE: Swap = true; LLVM_FALLTHROUGH;
	case ISD::SETUGE: Opc = ARMISD::VCGEU; break;
	}

	// Detect VTST (Vector Test Bits) = icmp ne (and (op0, op1), zero).
	if (Opc == ARMISD::VCEQ) {

	SDValue AndOp;
	if (ISD::isBuildVectorAllZeros(Op1.getNode()))
	AndOp = Op0;
	else if (ISD::isBuildVectorAllZeros(Op0.getNode()))
	AndOp = Op1;

	// Ignore bitconvert.
	if (AndOp.getNode() && AndOp.getOpcode() == ISD::BITCAST)
	AndOp = AndOp.getOperand(0);

	if (AndOp.getNode() && AndOp.getOpcode() == ISD::AND) {
	Opc = ARMISD::VTST;
	Op0 = DAG.getNode(ISD::BITCAST, dl, CmpVT, AndOp.getOperand(0));
	Op1 = DAG.getNode(ISD::BITCAST, dl, CmpVT, AndOp.getOperand(1));
	Invert = !Invert;
	}
	}
	}

	if (Swap)
	std::swap(Op0, Op1);

	// If one of the operands is a constant vector zero, attempt to fold the
	// comparison to a specialized compare-against-zero form.
	SDValue SingleOp;
	if (ISD::isBuildVectorAllZeros(Op1.getNode()))
	SingleOp = Op0;
	else if (ISD::isBuildVectorAllZeros(Op0.getNode())) {
	if (Opc == ARMISD::VCGE)
	Opc = ARMISD::VCLEZ;
	else if (Opc == ARMISD::VCGT)
	Opc = ARMISD::VCLTZ;
	SingleOp = Op1;
	}

	SDValue Result;
	if (SingleOp.getNode()) {
	switch (Opc) {
	case ARMISD::VCEQ:
	Result = DAG.getNode(ARMISD::VCEQZ, dl, CmpVT, SingleOp); break;
	case ARMISD::VCGE:
	Result = DAG.getNode(ARMISD::VCGEZ, dl, CmpVT, SingleOp); break;
	case ARMISD::VCLEZ:
	Result = DAG.getNode(ARMISD::VCLEZ, dl, CmpVT, SingleOp); break;
	case ARMISD::VCGT:
	Result = DAG.getNode(ARMISD::VCGTZ, dl, CmpVT, SingleOp); break;
	case ARMISD::VCLTZ:
	Result = DAG.getNode(ARMISD::VCLTZ, dl, CmpVT, SingleOp); break;
	default:
	Result = DAG.getNode(Opc, dl, CmpVT, Op0, Op1);
	}
	} else {
	Result = DAG.getNode(Opc, dl, CmpVT, Op0, Op1);
	}

	Result = DAG.getSExtOrTrunc(Result, dl, VT);

	if (Invert)
	Result = DAG.getNOT(dl, Result, VT);

	return Result;
	}

	static SDValue LowerSETCCE(SDValue Op, SelectionDAG &DAG) {
	SDValue LHS = Op.getOperand(0);
	SDValue RHS = Op.getOperand(1);
	SDValue Carry = Op.getOperand(2);
	SDValue Cond = Op.getOperand(3);
	SDLoc DL(Op);

	assert(LHS.getSimpleValueType().isInteger() && "SETCCE is integer only.");

	assert(Carry.getOpcode() != ISD::CARRY_FALSE);
	SDVTList VTs = DAG.getVTList(LHS.getValueType(), MVT::i32);
	SDValue Cmp = DAG.getNode(ARMISD::SUBE, DL, VTs, LHS, RHS, Carry);

	SDValue FVal = DAG.getConstant(0, DL, MVT::i32);
	SDValue TVal = DAG.getConstant(1, DL, MVT::i32);
	SDValue ARMcc = DAG.getConstant(
	IntCCToARMCC(cast<CondCodeSDNode>(Cond)->get()), DL, MVT::i32);
	SDValue CCR = DAG.getRegister(ARM::CPSR, MVT::i32);
	SDValue Chain = DAG.getCopyToReg(DAG.getEntryNode(), DL, ARM::CPSR,
	Cmp.getValue(1), SDValue());
	return DAG.getNode(ARMISD::CMOV, DL, Op.getValueType(), FVal, TVal, ARMcc,
	CCR, Chain.getValue(1));
	}

	/// isNEONModifiedImm - Check if the specified splat value corresponds to a
	/// valid vector constant for a NEON instruction with a "modified immediate"
	/// operand (e.g., VMOV). If so, return the encoded value.
	static SDValue isNEONModifiedImm(uint64_t SplatBits, uint64_t SplatUndef,
	unsigned SplatBitSize, SelectionDAG &DAG,
	const SDLoc &dl, EVT &VT, bool is128Bits,
	NEONModImmType type) {
	unsigned OpCmode, Imm;

	// SplatBitSize is set to the smallest size that splats the vector, so a
	// zero vector will always have SplatBitSize == 8. However, NEON modified
	// immediate instructions others than VMOV do not support the 8-bit encoding
	// of a zero vector, and the default encoding of zero is supposed to be the
	// 32-bit version.
	if (SplatBits == 0)
	SplatBitSize = 32;

	switch (SplatBitSize) {
	case 8:
	if (type != VMOVModImm)
	return SDValue();
	// Any 1-byte value is OK. Op=0, Cmode=1110.
	assert((SplatBits & ~0xff) == 0 && "one byte splat value is too big");
	OpCmode = 0xe;
	Imm = SplatBits;
	VT = is128Bits ? MVT::v16i8 : MVT::v8i8;
	break;

	case 16:
	// NEON's 16-bit VMOV supports splat values where only one byte is nonzero.
	VT = is128Bits ? MVT::v8i16 : MVT::v4i16;
	if ((SplatBits & ~0xff) == 0) {
	// Value = 0x00nn: Op=x, Cmode=100x.
	OpCmode = 0x8;
	Imm = SplatBits;
	break;
	}
	if ((SplatBits & ~0xff00) == 0) {
	// Value = 0xnn00: Op=x, Cmode=101x.
	OpCmode = 0xa;
	Imm = SplatBits >> 8;
	break;
	}
	return SDValue();

	case 32:
	// NEON's 32-bit VMOV supports splat values where:
	// * only one byte is nonzero, or
	// * the least significant byte is 0xff and the second byte is nonzero, or
	// * the least significant 2 bytes are 0xff and the third is nonzero.
	VT = is128Bits ? MVT::v4i32 : MVT::v2i32;
	if ((SplatBits & ~0xff) == 0) {
	// Value = 0x000000nn: Op=x, Cmode=000x.
	OpCmode = 0;
	Imm = SplatBits;
	break;
	}
	if ((SplatBits & ~0xff00) == 0) {
	// Value = 0x0000nn00: Op=x, Cmode=001x.
	OpCmode = 0x2;
	Imm = SplatBits >> 8;
	break;
	}
	if ((SplatBits & ~0xff0000) == 0) {
	// Value = 0x00nn0000: Op=x, Cmode=010x.
	OpCmode = 0x4;
	Imm = SplatBits >> 16;
	break;
	}
	if ((SplatBits & ~0xff000000) == 0) {
	// Value = 0xnn000000: Op=x, Cmode=011x.
	OpCmode = 0x6;
	Imm = SplatBits >> 24;
	break;
	}

	// cmode == 0b1100 and cmode == 0b1101 are not supported for VORR or VBIC
	if (type == OtherModImm) return SDValue();

	if ((SplatBits & ~0xffff) == 0 &&
	((SplatBits \| SplatUndef) & 0xff) == 0xff) {
	// Value = 0x0000nnff: Op=x, Cmode=1100.
	OpCmode = 0xc;
	Imm = SplatBits >> 8;
	break;
	}

	if ((SplatBits & ~0xffffff) == 0 &&
	((SplatBits \| SplatUndef) & 0xffff) == 0xffff) {
	// Value = 0x00nnffff: Op=x, Cmode=1101.
	OpCmode = 0xd;
	Imm = SplatBits >> 16;
	break;
	}

	// Note: there are a few 32-bit splat values (specifically: 00ffff00,
	// ff000000, ff0000ff, and ffff00ff) that are valid for VMOV.I64 but not
	// VMOV.I32. A (very) minor optimization would be to replicate the value
	// and fall through here to test for a valid 64-bit splat. But, then the
	// caller would also need to check and handle the change in size.
	return SDValue();

	case 64: {
	if (type != VMOVModImm)
	return SDValue();
	// NEON has a 64-bit VMOV splat where each byte is either 0 or 0xff.
	uint64_t BitMask = 0xff;
	uint64_t Val = 0;
	unsigned ImmMask = 1;
	Imm = 0;
	for (int ByteNum = 0; ByteNum < 8; ++ByteNum) {
	if (((SplatBits \| SplatUndef) & BitMask) == BitMask) {
	Val \|= BitMask;
	Imm \|= ImmMask;
	} else if ((SplatBits & BitMask) != 0) {
	return SDValue();
	}
	BitMask <<= 8;
	ImmMask <<= 1;
	}

	if (DAG.getDataLayout().isBigEndian())
	// swap higher and lower 32 bit word
	Imm = ((Imm & 0xf) << 4) \| ((Imm & 0xf0) >> 4);

	// Op=1, Cmode=1110.
	OpCmode = 0x1e;
	VT = is128Bits ? MVT::v2i64 : MVT::v1i64;
	break;
	}

	default:
	llvm_unreachable("unexpected size for isNEONModifiedImm");
	}

	unsigned EncodedVal = ARM_AM::createNEONModImm(OpCmode, Imm);
	return DAG.getTargetConstant(EncodedVal, dl, MVT::i32);
	}

	SDValue ARMTargetLowering::LowerConstantFP(SDValue Op, SelectionDAG &DAG,
	const ARMSubtarget *ST) const {
	bool IsDouble = Op.getValueType() == MVT::f64;
	ConstantFPSDNode *CFP = cast<ConstantFPSDNode>(Op);
	const APFloat &FPVal = CFP->getValueAPF();

	// Prevent floating-point constants from using literal loads
	// when execute-only is enabled.
	if (ST->genExecuteOnly()) {
	APInt INTVal = FPVal.bitcastToAPInt();
	SDLoc DL(CFP);
	if (IsDouble) {
	SDValue Lo = DAG.getConstant(INTVal.trunc(32), DL, MVT::i32);
	SDValue Hi = DAG.getConstant(INTVal.lshr(32).trunc(32), DL, MVT::i32);
	if (!ST->isLittle())
	std::swap(Lo, Hi);
	return DAG.getNode(ARMISD::VMOVDRR, DL, MVT::f64, Lo, Hi);
	} else {
	return DAG.getConstant(INTVal, DL, MVT::i32);
	}
	}

	if (!ST->hasVFP3())
	return SDValue();

	// Use the default (constant pool) lowering for double constants when we have
	// an SP-only FPU
	if (IsDouble && Subtarget->isFPOnlySP())
	return SDValue();

	// Try splatting with a VMOV.f32...
	int ImmVal = IsDouble ? ARM_AM::getFP64Imm(FPVal) : ARM_AM::getFP32Imm(FPVal);

	if (ImmVal != -1) {
	if (IsDouble \|\| !ST->useNEONForSinglePrecisionFP()) {
	// We have code in place to select a valid ConstantFP already, no need to
	// do any mangling.
	return Op;
	}

	// It's a float and we are trying to use NEON operations where
	// possible. Lower it to a splat followed by an extract.
	SDLoc DL(Op);
	SDValue NewVal = DAG.getTargetConstant(ImmVal, DL, MVT::i32);
	SDValue VecConstant = DAG.getNode(ARMISD::VMOVFPIMM, DL, MVT::v2f32,
	NewVal);
	return DAG.getNode(ISD::EXTRACT_VECTOR_ELT, DL, MVT::f32, VecConstant,
	DAG.getConstant(0, DL, MVT::i32));
	}

	// The rest of our options are NEON only, make sure that's allowed before
	// proceeding..
	if (!ST->hasNEON() \|\| (!IsDouble && !ST->useNEONForSinglePrecisionFP()))
	return SDValue();

	EVT VMovVT;
	uint64_t iVal = FPVal.bitcastToAPInt().getZExtValue();

	// It wouldn't really be worth bothering for doubles except for one very
	// important value, which does happen to match: 0.0. So make sure we don't do
	// anything stupid.
	if (IsDouble && (iVal & 0xffffffff) != (iVal >> 32))
	return SDValue();

	// Try a VMOV.i32 (FIXME: i8, i16, or i64 could work too).
	SDValue NewVal = isNEONModifiedImm(iVal & 0xffffffffU, 0, 32, DAG, SDLoc(Op),
	VMovVT, false, VMOVModImm);
	if (NewVal != SDValue()) {
	SDLoc DL(Op);
	SDValue VecConstant = DAG.getNode(ARMISD::VMOVIMM, DL, VMovVT,
	NewVal);
	if (IsDouble)
	return DAG.getNode(ISD::BITCAST, DL, MVT::f64, VecConstant);

	// It's a float: cast and extract a vector element.
	SDValue VecFConstant = DAG.getNode(ISD::BITCAST, DL, MVT::v2f32,
	VecConstant);
	return DAG.getNode(ISD::EXTRACT_VECTOR_ELT, DL, MVT::f32, VecFConstant,
	DAG.getConstant(0, DL, MVT::i32));
	}

	// Finally, try a VMVN.i32
	NewVal = isNEONModifiedImm(~iVal & 0xffffffffU, 0, 32, DAG, SDLoc(Op), VMovVT,
	false, VMVNModImm);
	if (NewVal != SDValue()) {
	SDLoc DL(Op);
	SDValue VecConstant = DAG.getNode(ARMISD::VMVNIMM, DL, VMovVT, NewVal);

	if (IsDouble)
	return DAG.getNode(ISD::BITCAST, DL, MVT::f64, VecConstant);

	// It's a float: cast and extract a vector element.
	SDValue VecFConstant = DAG.getNode(ISD::BITCAST, DL, MVT::v2f32,
	VecConstant);
	return DAG.getNode(ISD::EXTRACT_VECTOR_ELT, DL, MVT::f32, VecFConstant,
	DAG.getConstant(0, DL, MVT::i32));
	}

	return SDValue();
	}

	// check if an VEXT instruction can handle the shuffle mask when the
	// vector sources of the shuffle are the same.
	static bool isSingletonVEXTMask(ArrayRef<int> M, EVT VT, unsigned &Imm) {
	unsigned NumElts = VT.getVectorNumElements();

	// Assume that the first shuffle index is not UNDEF. Fail if it is.
	if (M[0] < 0)
	return false;

	Imm = M[0];

	// If this is a VEXT shuffle, the immediate value is the index of the first
	// element. The other shuffle indices must be the successive elements after
	// the first one.
	unsigned ExpectedElt = Imm;
	for (unsigned i = 1; i < NumElts; ++i) {
	// Increment the expected index. If it wraps around, just follow it
	// back to index zero and keep going.
	++ExpectedElt;
	if (ExpectedElt == NumElts)
	ExpectedElt = 0;

	if (M[i] < 0) continue; // ignore UNDEF indices
	if (ExpectedElt != static_cast<unsigned>(M[i]))
	return false;
	}

	return true;
	}

	static bool isVEXTMask(ArrayRef<int> M, EVT VT,
	bool &ReverseVEXT, unsigned &Imm) {
	unsigned NumElts = VT.getVectorNumElements();
	ReverseVEXT = false;

	// Assume that the first shuffle index is not UNDEF. Fail if it is.
	if (M[0] < 0)
	return false;

	Imm = M[0];

	// If this is a VEXT shuffle, the immediate value is the index of the first
	// element. The other shuffle indices must be the successive elements after
	// the first one.
	unsigned ExpectedElt = Imm;
	for (unsigned i = 1; i < NumElts; ++i) {
	// Increment the expected index. If it wraps around, it may still be
	// a VEXT but the source vectors must be swapped.
	ExpectedElt += 1;
	if (ExpectedElt == NumElts * 2) {
	ExpectedElt = 0;
	ReverseVEXT = true;
	}

	if (M[i] < 0) continue; // ignore UNDEF indices
	if (ExpectedElt != static_cast<unsigned>(M[i]))
	return false;
	}

	// Adjust the index value if the source operands will be swapped.
	if (ReverseVEXT)
	Imm -= NumElts;

	return true;
	}

	/// isVREVMask - Check if a vector shuffle corresponds to a VREV
	/// instruction with the specified blocksize. (The order of the elements
	/// within each block of the vector is reversed.)
	static bool isVREVMask(ArrayRef<int> M, EVT VT, unsigned BlockSize) {
	assert((BlockSize==16 \|\| BlockSize==32 \|\| BlockSize==64) &&
	"Only possible block sizes for VREV are: 16, 32, 64");

	unsigned EltSz = VT.getScalarSizeInBits();
	if (EltSz == 64)
	return false;

	unsigned NumElts = VT.getVectorNumElements();
	unsigned BlockElts = M[0] + 1;
	// If the first shuffle index is UNDEF, be optimistic.
	if (M[0] < 0)
	BlockElts = BlockSize / EltSz;

	if (BlockSize <= EltSz \|\| BlockSize != BlockElts * EltSz)
	return false;

	for (unsigned i = 0; i < NumElts; ++i) {
	if (M[i] < 0) continue; // ignore UNDEF indices
	if ((unsigned) M[i] != (i - i%BlockElts) + (BlockElts - 1 - i%BlockElts))
	return false;
	}

	return true;
	}

	static bool isVTBLMask(ArrayRef<int> M, EVT VT) {
	// We can handle <8 x i8> vector shuffles. If the index in the mask is out of
	// range, then 0 is placed into the resulting vector. So pretty much any mask
	// of 8 elements can work here.
	return VT == MVT::v8i8 && M.size() == 8;
	}

	// Checks whether the shuffle mask represents a vector transpose (VTRN) by
	// checking that pairs of elements in the shuffle mask represent the same index
	// in each vector, incrementing the expected index by 2 at each step.
	// e.g. For v1,v2 of type v4i32 a valid shuffle mask is: [0, 4, 2, 6]
	// v1={a,b,c,d} => x=shufflevector v1, v2 shufflemask => x={a,e,c,g}
	// v2={e,f,g,h}
	// WhichResult gives the offset for each element in the mask based on which
	// of the two results it belongs to.
	//
	// The transpose can be represented either as:
	// result1 = shufflevector v1, v2, result1_shuffle_mask
	// result2 = shufflevector v1, v2, result2_shuffle_mask
	// where v1/v2 and the shuffle masks have the same number of elements
	// (here WhichResult (see below) indicates which result is being checked)
	//
	// or as:
	// results = shufflevector v1, v2, shuffle_mask
	// where both results are returned in one vector and the shuffle mask has twice
	// as many elements as v1/v2 (here WhichResult will always be 0 if true) here we
	// want to check the low half and high half of the shuffle mask as if it were
	// the other case
	static bool isVTRNMask(ArrayRef<int> M, EVT VT, unsigned &WhichResult) {
	unsigned EltSz = VT.getScalarSizeInBits();
	if (EltSz == 64)
	return false;

	unsigned NumElts = VT.getVectorNumElements();
	if (M.size() != NumElts && M.size() != NumElts*2)
	return false;

	// If the mask is twice as long as the input vector then we need to check the
	// upper and lower parts of the mask with a matching value for WhichResult
	// FIXME: A mask with only even values will be rejected in case the first
	// element is undefined, e.g. [-1, 4, 2, 6] will be rejected, because only
	// M[0] is used to determine WhichResult
	for (unsigned i = 0; i < M.size(); i += NumElts) {
	if (M.size() == NumElts * 2)
	WhichResult = i / NumElts;
	else
	WhichResult = M[i] == 0 ? 0 : 1;
	for (unsigned j = 0; j < NumElts; j += 2) {
	if ((M[i+j] >= 0 && (unsigned) M[i+j] != j + WhichResult) \|\|
	(M[i+j+1] >= 0 && (unsigned) M[i+j+1] != j + NumElts + WhichResult))
	return false;
	}
	}

	if (M.size() == NumElts*2)
	WhichResult = 0;

	return true;
	}

	/// isVTRN_v_undef_Mask - Special case of isVTRNMask for canonical form of
	/// "vector_shuffle v, v", i.e., "vector_shuffle v, undef".
	/// Mask is e.g., <0, 0, 2, 2> instead of <0, 4, 2, 6>.
	static bool isVTRN_v_undef_Mask(ArrayRef<int> M, EVT VT, unsigned &WhichResult){
	unsigned EltSz = VT.getScalarSizeInBits();
	if (EltSz == 64)
	return false;

	unsigned NumElts = VT.getVectorNumElements();
	if (M.size() != NumElts && M.size() != NumElts*2)
	return false;

	for (unsigned i = 0; i < M.size(); i += NumElts) {
	if (M.size() == NumElts * 2)
	WhichResult = i / NumElts;
	else
	WhichResult = M[i] == 0 ? 0 : 1;
	for (unsigned j = 0; j < NumElts; j += 2) {
	if ((M[i+j] >= 0 && (unsigned) M[i+j] != j + WhichResult) \|\|
	(M[i+j+1] >= 0 && (unsigned) M[i+j+1] != j + WhichResult))
	return false;
	}
	}

	if (M.size() == NumElts*2)
	WhichResult = 0;

	return true;
	}

	// Checks whether the shuffle mask represents a vector unzip (VUZP) by checking
	// that the mask elements are either all even and in steps of size 2 or all odd
	// and in steps of size 2.
	// e.g. For v1,v2 of type v4i32 a valid shuffle mask is: [0, 2, 4, 6]
	// v1={a,b,c,d} => x=shufflevector v1, v2 shufflemask => x={a,c,e,g}
	// v2={e,f,g,h}
	// Requires similar checks to that of isVTRNMask with
	// respect the how results are returned.
	static bool isVUZPMask(ArrayRef<int> M, EVT VT, unsigned &WhichResult) {
	unsigned EltSz = VT.getScalarSizeInBits();
	if (EltSz == 64)
	return false;

	unsigned NumElts = VT.getVectorNumElements();
	if (M.size() != NumElts && M.size() != NumElts*2)
	return false;

	for (unsigned i = 0; i < M.size(); i += NumElts) {
	- WhichResult = M[i] == 0 ? 0 : 1;
	+ if (M.size() == NumElts * 2)
	+ WhichResult = i / NumElts;
	+ else
	+ WhichResult = M[i] == 0 ? 0 : 1;
	for (unsigned j = 0; j < NumElts; ++j) {
	if (M[i+j] >= 0 && (unsigned) M[i+j] != 2 * j + WhichResult)
	return false;
	}
	}

	if (M.size() == NumElts*2)
	WhichResult = 0;

	// VUZP.32 for 64-bit vectors is a pseudo-instruction alias for VTRN.32.
	if (VT.is64BitVector() && EltSz == 32)
	return false;

	return true;
	}

	/// isVUZP_v_undef_Mask - Special case of isVUZPMask for canonical form of
	/// "vector_shuffle v, v", i.e., "vector_shuffle v, undef".
	/// Mask is e.g., <0, 2, 0, 2> instead of <0, 2, 4, 6>,
	static bool isVUZP_v_undef_Mask(ArrayRef<int> M, EVT VT, unsigned &WhichResult){
	unsigned EltSz = VT.getScalarSizeInBits();
	if (EltSz == 64)
	return false;

	unsigned NumElts = VT.getVectorNumElements();
	if (M.size() != NumElts && M.size() != NumElts*2)
	return false;

	unsigned Half = NumElts / 2;
	for (unsigned i = 0; i < M.size(); i += NumElts) {
	- WhichResult = M[i] == 0 ? 0 : 1;
	+ if (M.size() == NumElts * 2)
	+ WhichResult = i / NumElts;
	+ else
	+ WhichResult = M[i] == 0 ? 0 : 1;
	for (unsigned j = 0; j < NumElts; j += Half) {
	unsigned Idx = WhichResult;
	for (unsigned k = 0; k < Half; ++k) {
	int MIdx = M[i + j + k];
	if (MIdx >= 0 && (unsigned) MIdx != Idx)
	return false;
	Idx += 2;
	}
	}
	}

	if (M.size() == NumElts*2)
	WhichResult = 0;

	// VUZP.32 for 64-bit vectors is a pseudo-instruction alias for VTRN.32.
	if (VT.is64BitVector() && EltSz == 32)
	return false;

	return true;
	}

	// Checks whether the shuffle mask represents a vector zip (VZIP) by checking
	// that pairs of elements of the shufflemask represent the same index in each
	// vector incrementing sequentially through the vectors.
	// e.g. For v1,v2 of type v4i32 a valid shuffle mask is: [0, 4, 1, 5]
	// v1={a,b,c,d} => x=shufflevector v1, v2 shufflemask => x={a,e,b,f}
	// v2={e,f,g,h}
	// Requires similar checks to that of isVTRNMask with respect the how results
	// are returned.
	static bool isVZIPMask(ArrayRef<int> M, EVT VT, unsigned &WhichResult) {
	unsigned EltSz = VT.getScalarSizeInBits();
	if (EltSz == 64)
	return false;

	unsigned NumElts = VT.getVectorNumElements();
	if (M.size() != NumElts && M.size() != NumElts*2)
	return false;

	for (unsigned i = 0; i < M.size(); i += NumElts) {
	- WhichResult = M[i] == 0 ? 0 : 1;
	+ if (M.size() == NumElts * 2)
	+ WhichResult = i / NumElts;
	+ else
	+ WhichResult = M[i] == 0 ? 0 : 1;
	unsigned Idx = WhichResult * NumElts / 2;
	for (unsigned j = 0; j < NumElts; j += 2) {
	if ((M[i+j] >= 0 && (unsigned) M[i+j] != Idx) \|\|
	(M[i+j+1] >= 0 && (unsigned) M[i+j+1] != Idx + NumElts))
	return false;
	Idx += 1;
	}
	}

	if (M.size() == NumElts*2)
	WhichResult = 0;

	// VZIP.32 for 64-bit vectors is a pseudo-instruction alias for VTRN.32.
	if (VT.is64BitVector() && EltSz == 32)
	return false;

	return true;
	}

	/// isVZIP_v_undef_Mask - Special case of isVZIPMask for canonical form of
	/// "vector_shuffle v, v", i.e., "vector_shuffle v, undef".
	/// Mask is e.g., <0, 0, 1, 1> instead of <0, 4, 1, 5>.
	static bool isVZIP_v_undef_Mask(ArrayRef<int> M, EVT VT, unsigned &WhichResult){
	unsigned EltSz = VT.getScalarSizeInBits();
	if (EltSz == 64)
	return false;

	unsigned NumElts = VT.getVectorNumElements();
	if (M.size() != NumElts && M.size() != NumElts*2)
	return false;

	for (unsigned i = 0; i < M.size(); i += NumElts) {
	- WhichResult = M[i] == 0 ? 0 : 1;
	+ if (M.size() == NumElts * 2)
	+ WhichResult = i / NumElts;
	+ else
	+ WhichResult = M[i] == 0 ? 0 : 1;
	unsigned Idx = WhichResult * NumElts / 2;
	for (unsigned j = 0; j < NumElts; j += 2) {
	if ((M[i+j] >= 0 && (unsigned) M[i+j] != Idx) \|\|
	(M[i+j+1] >= 0 && (unsigned) M[i+j+1] != Idx))
	return false;
	Idx += 1;
	}
	}

	if (M.size() == NumElts*2)
	WhichResult = 0;

	// VZIP.32 for 64-bit vectors is a pseudo-instruction alias for VTRN.32.
	if (VT.is64BitVector() && EltSz == 32)
	return false;

	return true;
	}

	/// Check if \p ShuffleMask is a NEON two-result shuffle (VZIP, VUZP, VTRN),
	/// and return the corresponding ARMISD opcode if it is, or 0 if it isn't.
	static unsigned isNEONTwoResultShuffleMask(ArrayRef<int> ShuffleMask, EVT VT,
	unsigned &WhichResult,
	bool &isV_UNDEF) {
	isV_UNDEF = false;
	if (isVTRNMask(ShuffleMask, VT, WhichResult))
	return ARMISD::VTRN;
	if (isVUZPMask(ShuffleMask, VT, WhichResult))
	return ARMISD::VUZP;
	if (isVZIPMask(ShuffleMask, VT, WhichResult))
	return ARMISD::VZIP;

	isV_UNDEF = true;
	if (isVTRN_v_undef_Mask(ShuffleMask, VT, WhichResult))
	return ARMISD::VTRN;
	if (isVUZP_v_undef_Mask(ShuffleMask, VT, WhichResult))
	return ARMISD::VUZP;
	if (isVZIP_v_undef_Mask(ShuffleMask, VT, WhichResult))
	return ARMISD::VZIP;

	return 0;
	}

	/// \return true if this is a reverse operation on an vector.
	static bool isReverseMask(ArrayRef<int> M, EVT VT) {
	unsigned NumElts = VT.getVectorNumElements();
	// Make sure the mask has the right size.
	if (NumElts != M.size())
	return false;

	// Look for <15, ..., 3, -1, 1, 0>.
	for (unsigned i = 0; i != NumElts; ++i)
	if (M[i] >= 0 && M[i] != (int) (NumElts - 1 - i))
	return false;

	return true;
	}

	// If N is an integer constant that can be moved into a register in one
	// instruction, return an SDValue of such a constant (will become a MOV
	// instruction). Otherwise return null.
	static SDValue IsSingleInstrConstant(SDValue N, SelectionDAG &DAG,
	const ARMSubtarget *ST, const SDLoc &dl) {
	uint64_t Val;
	if (!isa<ConstantSDNode>(N))
	return SDValue();
	Val = cast<ConstantSDNode>(N)->getZExtValue();

	if (ST->isThumb1Only()) {
	if (Val <= 255 \|\| ~Val <= 255)
	return DAG.getConstant(Val, dl, MVT::i32);
	} else {
	if (ARM_AM::getSOImmVal(Val) != -1 \|\| ARM_AM::getSOImmVal(~Val) != -1)
	return DAG.getConstant(Val, dl, MVT::i32);
	}
	return SDValue();
	}

	// If this is a case we can't handle, return null and let the default
	// expansion code take care of it.
	SDValue ARMTargetLowering::LowerBUILD_VECTOR(SDValue Op, SelectionDAG &DAG,
	const ARMSubtarget *ST) const {
	BuildVectorSDNode *BVN = cast<BuildVectorSDNode>(Op.getNode());
	SDLoc dl(Op);
	EVT VT = Op.getValueType();

	APInt SplatBits, SplatUndef;
	unsigned SplatBitSize;
	bool HasAnyUndefs;
	if (BVN->isConstantSplat(SplatBits, SplatUndef, SplatBitSize, HasAnyUndefs)) {
	if (SplatUndef.isAllOnesValue())
	return DAG.getUNDEF(VT);

	if (SplatBitSize <= 64) {
	// Check if an immediate VMOV works.
	EVT VmovVT;
	SDValue Val = isNEONModifiedImm(SplatBits.getZExtValue(),
	SplatUndef.getZExtValue(), SplatBitSize,
	DAG, dl, VmovVT, VT.is128BitVector(),
	VMOVModImm);
	if (Val.getNode()) {
	SDValue Vmov = DAG.getNode(ARMISD::VMOVIMM, dl, VmovVT, Val);
	return DAG.getNode(ISD::BITCAST, dl, VT, Vmov);
	}

	// Try an immediate VMVN.
	uint64_t NegatedImm = (~SplatBits).getZExtValue();
	Val = isNEONModifiedImm(NegatedImm,
	SplatUndef.getZExtValue(), SplatBitSize,
	DAG, dl, VmovVT, VT.is128BitVector(),
	VMVNModImm);
	if (Val.getNode()) {
	SDValue Vmov = DAG.getNode(ARMISD::VMVNIMM, dl, VmovVT, Val);
	return DAG.getNode(ISD::BITCAST, dl, VT, Vmov);
	}

	// Use vmov.f32 to materialize other v2f32 and v4f32 splats.
	if ((VT == MVT::v2f32 \|\| VT == MVT::v4f32) && SplatBitSize == 32) {
	int ImmVal = ARM_AM::getFP32Imm(SplatBits);
	if (ImmVal != -1) {
	SDValue Val = DAG.getTargetConstant(ImmVal, dl, MVT::i32);
	return DAG.getNode(ARMISD::VMOVFPIMM, dl, VT, Val);
	}
	}
	}
	}

	// Scan through the operands to see if only one value is used.
	//
	// As an optimisation, even if more than one value is used it may be more
	// profitable to splat with one value then change some lanes.
	//
	// Heuristically we decide to do this if the vector has a "dominant" value,
	// defined as splatted to more than half of the lanes.
	unsigned NumElts = VT.getVectorNumElements();
	bool isOnlyLowElement = true;
	bool usesOnlyOneValue = true;
	bool hasDominantValue = false;
	bool isConstant = true;

	// Map of the number of times a particular SDValue appears in the
	// element list.
	DenseMap<SDValue, unsigned> ValueCounts;
	SDValue Value;
	for (unsigned i = 0; i < NumElts; ++i) {
	SDValue V = Op.getOperand(i);
	if (V.isUndef())
	continue;
	if (i > 0)
	isOnlyLowElement = false;
	if (!isa<ConstantFPSDNode>(V) && !isa<ConstantSDNode>(V))
	isConstant = false;

	ValueCounts.insert(std::make_pair(V, 0));
	unsigned &Count = ValueCounts[V];

	// Is this value dominant? (takes up more than half of the lanes)
	if (++Count > (NumElts / 2)) {
	hasDominantValue = true;
	Value = V;
	}
	}
	if (ValueCounts.size() != 1)
	usesOnlyOneValue = false;
	if (!Value.getNode() && !ValueCounts.empty())
	Value = ValueCounts.begin()->first;

	if (ValueCounts.empty())
	return DAG.getUNDEF(VT);

	// Loads are better lowered with insert_vector_elt/ARMISD::BUILD_VECTOR.
	// Keep going if we are hitting this case.
	if (isOnlyLowElement && !ISD::isNormalLoad(Value.getNode()))
	return DAG.getNode(ISD::SCALAR_TO_VECTOR, dl, VT, Value);

	unsigned EltSize = VT.getScalarSizeInBits();

	// Use VDUP for non-constant splats. For f32 constant splats, reduce to
	// i32 and try again.
	if (hasDominantValue && EltSize <= 32) {
	if (!isConstant) {
	SDValue N;

	// If we are VDUPing a value that comes directly from a vector, that will
	// cause an unnecessary move to and from a GPR, where instead we could
	// just use VDUPLANE. We can only do this if the lane being extracted
	// is at a constant index, as the VDUP from lane instructions only have
	// constant-index forms.
	ConstantSDNode *constIndex;
	if (Value->getOpcode() == ISD::EXTRACT_VECTOR_ELT &&
	(constIndex = dyn_cast<ConstantSDNode>(Value->getOperand(1)))) {
	// We need to create a new undef vector to use for the VDUPLANE if the
	// size of the vector from which we get the value is different than the
	// size of the vector that we need to create. We will insert the element
	// such that the register coalescer will remove unnecessary copies.
	if (VT != Value->getOperand(0).getValueType()) {
	unsigned index = constIndex->getAPIntValue().getLimitedValue() %
	VT.getVectorNumElements();
	N = DAG.getNode(ARMISD::VDUPLANE, dl, VT,
	DAG.getNode(ISD::INSERT_VECTOR_ELT, dl, VT, DAG.getUNDEF(VT),
	Value, DAG.getConstant(index, dl, MVT::i32)),
	DAG.getConstant(index, dl, MVT::i32));
	} else
	N = DAG.getNode(ARMISD::VDUPLANE, dl, VT,
	Value->getOperand(0), Value->getOperand(1));
	} else
	N = DAG.getNode(ARMISD::VDUP, dl, VT, Value);

	if (!usesOnlyOneValue) {
	// The dominant value was splatted as 'N', but we now have to insert
	// all differing elements.
	for (unsigned I = 0; I < NumElts; ++I) {
	if (Op.getOperand(I) == Value)
	continue;
	SmallVector<SDValue, 3> Ops;
	Ops.push_back(N);
	Ops.push_back(Op.getOperand(I));
	Ops.push_back(DAG.getConstant(I, dl, MVT::i32));
	N = DAG.getNode(ISD::INSERT_VECTOR_ELT, dl, VT, Ops);
	}
	}
	return N;
	}
	if (VT.getVectorElementType().isFloatingPoint()) {
	SmallVector<SDValue, 8> Ops;
	for (unsigned i = 0; i < NumElts; ++i)
	Ops.push_back(DAG.getNode(ISD::BITCAST, dl, MVT::i32,
	Op.getOperand(i)));
	EVT VecVT = EVT::getVectorVT(*DAG.getContext(), MVT::i32, NumElts);
	SDValue Val = DAG.getBuildVector(VecVT, dl, Ops);
	Val = LowerBUILD_VECTOR(Val, DAG, ST);
	if (Val.getNode())
	return DAG.getNode(ISD::BITCAST, dl, VT, Val);
	}
	if (usesOnlyOneValue) {
	SDValue Val = IsSingleInstrConstant(Value, DAG, ST, dl);
	if (isConstant && Val.getNode())
	return DAG.getNode(ARMISD::VDUP, dl, VT, Val);
	}
	}

	// If all elements are constants and the case above didn't get hit, fall back
	// to the default expansion, which will generate a load from the constant
	// pool.
	if (isConstant)
	return SDValue();

	// Empirical tests suggest this is rarely worth it for vectors of length <= 2.
	if (NumElts >= 4) {
	SDValue shuffle = ReconstructShuffle(Op, DAG);
	if (shuffle != SDValue())
	return shuffle;
	}

	if (VT.is128BitVector() && VT != MVT::v2f64 && VT != MVT::v4f32) {
	// If we haven't found an efficient lowering, try splitting a 128-bit vector
	// into two 64-bit vectors; we might discover a better way to lower it.
	SmallVector<SDValue, 64> Ops(Op->op_begin(), Op->op_begin() + NumElts);
	EVT ExtVT = VT.getVectorElementType();
	EVT HVT = EVT::getVectorVT(*DAG.getContext(), ExtVT, NumElts / 2);
	SDValue Lower =
	DAG.getBuildVector(HVT, dl, makeArrayRef(&Ops[0], NumElts / 2));
	if (Lower.getOpcode() == ISD::BUILD_VECTOR)
	Lower = LowerBUILD_VECTOR(Lower, DAG, ST);
	SDValue Upper = DAG.getBuildVector(
	HVT, dl, makeArrayRef(&Ops[NumElts / 2], NumElts / 2));
	if (Upper.getOpcode() == ISD::BUILD_VECTOR)
	Upper = LowerBUILD_VECTOR(Upper, DAG, ST);
	if (Lower && Upper)
	return DAG.getNode(ISD::CONCAT_VECTORS, dl, VT, Lower, Upper);
	}

	// Vectors with 32- or 64-bit elements can be built by directly assigning
	// the subregisters. Lower it to an ARMISD::BUILD_VECTOR so the operands
	// will be legalized.
	if (EltSize >= 32) {
	// Do the expansion with floating-point types, since that is what the VFP
	// registers are defined to use, and since i64 is not legal.
	EVT EltVT = EVT::getFloatingPointVT(EltSize);
	EVT VecVT = EVT::getVectorVT(*DAG.getContext(), EltVT, NumElts);
	SmallVector<SDValue, 8> Ops;
	for (unsigned i = 0; i < NumElts; ++i)
	Ops.push_back(DAG.getNode(ISD::BITCAST, dl, EltVT, Op.getOperand(i)));
	SDValue Val = DAG.getNode(ARMISD::BUILD_VECTOR, dl, VecVT, Ops);
	return DAG.getNode(ISD::BITCAST, dl, VT, Val);
	}

	// If all else fails, just use a sequence of INSERT_VECTOR_ELT when we
	// know the default expansion would otherwise fall back on something even
	// worse. For a vector with one or two non-undef values, that's
	// scalar_to_vector for the elements followed by a shuffle (provided the
	// shuffle is valid for the target) and materialization element by element
	// on the stack followed by a load for everything else.
	if (!isConstant && !usesOnlyOneValue) {
	SDValue Vec = DAG.getUNDEF(VT);
	for (unsigned i = 0 ; i < NumElts; ++i) {
	SDValue V = Op.getOperand(i);
	if (V.isUndef())
	continue;
	SDValue LaneIdx = DAG.getConstant(i, dl, MVT::i32);
	Vec = DAG.getNode(ISD::INSERT_VECTOR_ELT, dl, VT, Vec, V, LaneIdx);
	}
	return Vec;
	}

	return SDValue();
	}

	// Gather data to see if the operation can be modelled as a
	// shuffle in combination with VEXTs.
	SDValue ARMTargetLowering::ReconstructShuffle(SDValue Op,
	SelectionDAG &DAG) const {
	assert(Op.getOpcode() == ISD::BUILD_VECTOR && "Unknown opcode!");
	SDLoc dl(Op);
	EVT VT = Op.getValueType();
	unsigned NumElts = VT.getVectorNumElements();

	struct ShuffleSourceInfo {
	SDValue Vec;
	unsigned MinElt = std::numeric_limits<unsigned>::max();
	unsigned MaxElt = 0;

	// We may insert some combination of BITCASTs and VEXT nodes to force Vec to
	// be compatible with the shuffle we intend to construct. As a result
	// ShuffleVec will be some sliding window into the original Vec.
	SDValue ShuffleVec;

	// Code should guarantee that element i in Vec starts at element "WindowBase
	// + i * WindowScale in ShuffleVec".
	int WindowBase = 0;
	int WindowScale = 1;

	ShuffleSourceInfo(SDValue Vec) : Vec(Vec), ShuffleVec(Vec) {}

	bool operator ==(SDValue OtherVec) { return Vec == OtherVec; }
	};

	// First gather all vectors used as an immediate source for this BUILD_VECTOR
	// node.
	SmallVector<ShuffleSourceInfo, 2> Sources;
	for (unsigned i = 0; i < NumElts; ++i) {
	SDValue V = Op.getOperand(i);
	if (V.isUndef())
	continue;
	else if (V.getOpcode() != ISD::EXTRACT_VECTOR_ELT) {
	// A shuffle can only come from building a vector from various
	// elements of other vectors.
	return SDValue();
	} else if (!isa<ConstantSDNode>(V.getOperand(1))) {
	// Furthermore, shuffles require a constant mask, whereas extractelts
	// accept variable indices.
	return SDValue();
	}

	// Add this element source to the list if it's not already there.
	SDValue SourceVec = V.getOperand(0);
	auto Source = llvm::find(Sources, SourceVec);
	if (Source == Sources.end())
	Source = Sources.insert(Sources.end(), ShuffleSourceInfo(SourceVec));

	// Update the minimum and maximum lane number seen.
	unsigned EltNo = cast<ConstantSDNode>(V.getOperand(1))->getZExtValue();
	Source->MinElt = std::min(Source->MinElt, EltNo);
	Source->MaxElt = std::max(Source->MaxElt, EltNo);
	}

	// Currently only do something sane when at most two source vectors
	// are involved.
	if (Sources.size() > 2)
	return SDValue();

	// Find out the smallest element size among result and two sources, and use
	// it as element size to build the shuffle_vector.
	EVT SmallestEltTy = VT.getVectorElementType();
	for (auto &Source : Sources) {
	EVT SrcEltTy = Source.Vec.getValueType().getVectorElementType();
	if (SrcEltTy.bitsLT(SmallestEltTy))
	SmallestEltTy = SrcEltTy;
	}
	unsigned ResMultiplier =
	VT.getScalarSizeInBits() / SmallestEltTy.getSizeInBits();
	NumElts = VT.getSizeInBits() / SmallestEltTy.getSizeInBits();
	EVT ShuffleVT = EVT::getVectorVT(*DAG.getContext(), SmallestEltTy, NumElts);

	// If the source vector is too wide or too narrow, we may nevertheless be able
	// to construct a compatible shuffle either by concatenating it with UNDEF or
	// extracting a suitable range of elements.
	for (auto &Src : Sources) {
	EVT SrcVT = Src.ShuffleVec.getValueType();

	if (SrcVT.getSizeInBits() == VT.getSizeInBits())
	continue;

	// This stage of the search produces a source with the same element type as
	// the original, but with a total width matching the BUILD_VECTOR output.
	EVT EltVT = SrcVT.getVectorElementType();
	unsigned NumSrcElts = VT.getSizeInBits() / EltVT.getSizeInBits();
	EVT DestVT = EVT::getVectorVT(*DAG.getContext(), EltVT, NumSrcElts);

	if (SrcVT.getSizeInBits() < VT.getSizeInBits()) {
	if (2 * SrcVT.getSizeInBits() != VT.getSizeInBits())
	return SDValue();
	// We can pad out the smaller vector for free, so if it's part of a
	// shuffle...
	Src.ShuffleVec =
	DAG.getNode(ISD::CONCAT_VECTORS, dl, DestVT, Src.ShuffleVec,
	DAG.getUNDEF(Src.ShuffleVec.getValueType()));
	continue;
	}

	if (SrcVT.getSizeInBits() != 2 * VT.getSizeInBits())
	return SDValue();

	if (Src.MaxElt - Src.MinElt >= NumSrcElts) {
	// Span too large for a VEXT to cope
	return SDValue();
	}

	if (Src.MinElt >= NumSrcElts) {
	// The extraction can just take the second half
	Src.ShuffleVec =
	DAG.getNode(ISD::EXTRACT_SUBVECTOR, dl, DestVT, Src.ShuffleVec,
	DAG.getConstant(NumSrcElts, dl, MVT::i32));
	Src.WindowBase = -NumSrcElts;
	} else if (Src.MaxElt < NumSrcElts) {
	// The extraction can just take the first half
	Src.ShuffleVec =
	DAG.getNode(ISD::EXTRACT_SUBVECTOR, dl, DestVT, Src.ShuffleVec,
	DAG.getConstant(0, dl, MVT::i32));
	} else {
	// An actual VEXT is needed
	SDValue VEXTSrc1 =
	DAG.getNode(ISD::EXTRACT_SUBVECTOR, dl, DestVT, Src.ShuffleVec,
	DAG.getConstant(0, dl, MVT::i32));
	SDValue VEXTSrc2 =
	DAG.getNode(ISD::EXTRACT_SUBVECTOR, dl, DestVT, Src.ShuffleVec,
	DAG.getConstant(NumSrcElts, dl, MVT::i32));

	Src.ShuffleVec = DAG.getNode(ARMISD::VEXT, dl, DestVT, VEXTSrc1,
	VEXTSrc2,
	DAG.getConstant(Src.MinElt, dl, MVT::i32));
	Src.WindowBase = -Src.MinElt;
	}
	}

	// Another possible incompatibility occurs from the vector element types. We
	// can fix this by bitcasting the source vectors to the same type we intend
	// for the shuffle.
	for (auto &Src : Sources) {
	EVT SrcEltTy = Src.ShuffleVec.getValueType().getVectorElementType();
	if (SrcEltTy == SmallestEltTy)
	continue;
	assert(ShuffleVT.getVectorElementType() == SmallestEltTy);
	Src.ShuffleVec = DAG.getNode(ISD::BITCAST, dl, ShuffleVT, Src.ShuffleVec);
	Src.WindowScale = SrcEltTy.getSizeInBits() / SmallestEltTy.getSizeInBits();
	Src.WindowBase *= Src.WindowScale;
	}

	// Final sanity check before we try to actually produce a shuffle.
	DEBUG(
	for (auto Src : Sources)
	assert(Src.ShuffleVec.getValueType() == ShuffleVT);
	);

	// The stars all align, our next step is to produce the mask for the shuffle.
	SmallVector<int, 8> Mask(ShuffleVT.getVectorNumElements(), -1);
	int BitsPerShuffleLane = ShuffleVT.getScalarSizeInBits();
	for (unsigned i = 0; i < VT.getVectorNumElements(); ++i) {
	SDValue Entry = Op.getOperand(i);
	if (Entry.isUndef())
	continue;

	auto Src = llvm::find(Sources, Entry.getOperand(0));
	int EltNo = cast<ConstantSDNode>(Entry.getOperand(1))->getSExtValue();

	// EXTRACT_VECTOR_ELT performs an implicit any_ext; BUILD_VECTOR an implicit
	// trunc. So only std::min(SrcBits, DestBits) actually get defined in this
	// segment.
	EVT OrigEltTy = Entry.getOperand(0).getValueType().getVectorElementType();
	int BitsDefined = std::min(OrigEltTy.getSizeInBits(),
	VT.getScalarSizeInBits());
	int LanesDefined = BitsDefined / BitsPerShuffleLane;

	// This source is expected to fill ResMultiplier lanes of the final shuffle,
	// starting at the appropriate offset.
	int LaneMask = &Mask[i ResMultiplier];

	int ExtractBase = EltNo * Src->WindowScale + Src->WindowBase;
	ExtractBase += NumElts * (Src - Sources.begin());
	for (int j = 0; j < LanesDefined; ++j)
	LaneMask[j] = ExtractBase + j;
	}

	// Final check before we try to produce nonsense...
	if (!isShuffleMaskLegal(Mask, ShuffleVT))
	return SDValue();

	// We can't handle more than two sources. This should have already
	// been checked before this point.
	assert(Sources.size() <= 2 && "Too many sources!");

	SDValue ShuffleOps[] = { DAG.getUNDEF(ShuffleVT), DAG.getUNDEF(ShuffleVT) };
	for (unsigned i = 0; i < Sources.size(); ++i)
	ShuffleOps[i] = Sources[i].ShuffleVec;

	SDValue Shuffle = DAG.getVectorShuffle(ShuffleVT, dl, ShuffleOps[0],
	ShuffleOps[1], Mask);
	return DAG.getNode(ISD::BITCAST, dl, VT, Shuffle);
	}

	/// isShuffleMaskLegal - Targets can use this to indicate that they only
	/// support some VECTOR_SHUFFLE operations, those with specific masks.
	/// By default, if a target supports the VECTOR_SHUFFLE node, all mask values
	/// are assumed to be legal.
	bool
	ARMTargetLowering::isShuffleMaskLegal(const SmallVectorImpl<int> &M,
	EVT VT) const {
	if (VT.getVectorNumElements() == 4 &&
	(VT.is128BitVector() \|\| VT.is64BitVector())) {
	unsigned PFIndexes[4];
	for (unsigned i = 0; i != 4; ++i) {
	if (M[i] < 0)
	PFIndexes[i] = 8;
	else
	PFIndexes[i] = M[i];
	}

	// Compute the index in the perfect shuffle table.
	unsigned PFTableIndex =
	PFIndexes[0]999+PFIndexes[1]99+PFIndexes[2]9+PFIndexes[3];
	unsigned PFEntry = PerfectShuffleTable[PFTableIndex];
	unsigned Cost = (PFEntry >> 30);

	if (Cost <= 4)
	return true;
	}

	bool ReverseVEXT, isV_UNDEF;
	unsigned Imm, WhichResult;

	unsigned EltSize = VT.getScalarSizeInBits();
	return (EltSize >= 32 \|\|
	ShuffleVectorSDNode::isSplatMask(&M[0], VT) \|\|
	isVREVMask(M, VT, 64) \|\|
	isVREVMask(M, VT, 32) \|\|
	isVREVMask(M, VT, 16) \|\|
	isVEXTMask(M, VT, ReverseVEXT, Imm) \|\|
	isVTBLMask(M, VT) \|\|
	isNEONTwoResultShuffleMask(M, VT, WhichResult, isV_UNDEF) \|\|
	((VT == MVT::v8i16 \|\| VT == MVT::v16i8) && isReverseMask(M, VT)));
	}

	/// GeneratePerfectShuffle - Given an entry in the perfect-shuffle table, emit
	/// the specified operations to build the shuffle.
	static SDValue GeneratePerfectShuffle(unsigned PFEntry, SDValue LHS,
	SDValue RHS, SelectionDAG &DAG,
	const SDLoc &dl) {
	unsigned OpNum = (PFEntry >> 26) & 0x0F;
	unsigned LHSID = (PFEntry >> 13) & ((1 << 13)-1);
	unsigned RHSID = (PFEntry >> 0) & ((1 << 13)-1);

	enum {
	OP_COPY = 0, // Copy, used for things like <u,u,u,3> to say it is <0,1,2,3>
	OP_VREV,
	OP_VDUP0,
	OP_VDUP1,
	OP_VDUP2,
	OP_VDUP3,
	OP_VEXT1,
	OP_VEXT2,
	OP_VEXT3,
	OP_VUZPL, // VUZP, left result
	OP_VUZPR, // VUZP, right result
	OP_VZIPL, // VZIP, left result
	OP_VZIPR, // VZIP, right result
	OP_VTRNL, // VTRN, left result
	OP_VTRNR // VTRN, right result
	};

	if (OpNum == OP_COPY) {
	if (LHSID == (19+2)9+3) return LHS;
	assert(LHSID == ((49+5)9+6)*9+7 && "Illegal OP_COPY!");
	return RHS;
	}

	SDValue OpLHS, OpRHS;
	OpLHS = GeneratePerfectShuffle(PerfectShuffleTable[LHSID], LHS, RHS, DAG, dl);
	OpRHS = GeneratePerfectShuffle(PerfectShuffleTable[RHSID], LHS, RHS, DAG, dl);
	EVT VT = OpLHS.getValueType();

	switch (OpNum) {
	default: llvm_unreachable("Unknown shuffle opcode!");
	case OP_VREV:
	// VREV divides the vector in half and swaps within the half.
	if (VT.getVectorElementType() == MVT::i32 \|\|
	VT.getVectorElementType() == MVT::f32)
	return DAG.getNode(ARMISD::VREV64, dl, VT, OpLHS);
	// vrev <4 x i16> -> VREV32
	if (VT.getVectorElementType() == MVT::i16)
	return DAG.getNode(ARMISD::VREV32, dl, VT, OpLHS);
	// vrev <4 x i8> -> VREV16
	assert(VT.getVectorElementType() == MVT::i8);
	return DAG.getNode(ARMISD::VREV16, dl, VT, OpLHS);
	case OP_VDUP0:
	case OP_VDUP1:
	case OP_VDUP2:
	case OP_VDUP3:
	return DAG.getNode(ARMISD::VDUPLANE, dl, VT,
	OpLHS, DAG.getConstant(OpNum-OP_VDUP0, dl, MVT::i32));
	case OP_VEXT1:
	case OP_VEXT2:
	case OP_VEXT3:
	return DAG.getNode(ARMISD::VEXT, dl, VT,
	OpLHS, OpRHS,
	DAG.getConstant(OpNum - OP_VEXT1 + 1, dl, MVT::i32));
	case OP_VUZPL:
	case OP_VUZPR:
	return DAG.getNode(ARMISD::VUZP, dl, DAG.getVTList(VT, VT),
	OpLHS, OpRHS).getValue(OpNum-OP_VUZPL);
	case OP_VZIPL:
	case OP_VZIPR:
	return DAG.getNode(ARMISD::VZIP, dl, DAG.getVTList(VT, VT),
	OpLHS, OpRHS).getValue(OpNum-OP_VZIPL);
	case OP_VTRNL:
	case OP_VTRNR:
	return DAG.getNode(ARMISD::VTRN, dl, DAG.getVTList(VT, VT),
	OpLHS, OpRHS).getValue(OpNum-OP_VTRNL);
	}
	}

	static SDValue LowerVECTOR_SHUFFLEv8i8(SDValue Op,
	ArrayRef<int> ShuffleMask,
	SelectionDAG &DAG) {
	// Check to see if we can use the VTBL instruction.
	SDValue V1 = Op.getOperand(0);
	SDValue V2 = Op.getOperand(1);
	SDLoc DL(Op);

	SmallVector<SDValue, 8> VTBLMask;
	for (ArrayRef<int>::iterator
	I = ShuffleMask.begin(), E = ShuffleMask.end(); I != E; ++I)
	VTBLMask.push_back(DAG.getConstant(*I, DL, MVT::i32));

	if (V2.getNode()->isUndef())
	return DAG.getNode(ARMISD::VTBL1, DL, MVT::v8i8, V1,
	DAG.getBuildVector(MVT::v8i8, DL, VTBLMask));

	return DAG.getNode(ARMISD::VTBL2, DL, MVT::v8i8, V1, V2,
	DAG.getBuildVector(MVT::v8i8, DL, VTBLMask));
	}

	static SDValue LowerReverse_VECTOR_SHUFFLEv16i8_v8i16(SDValue Op,
	SelectionDAG &DAG) {
	SDLoc DL(Op);
	SDValue OpLHS = Op.getOperand(0);
	EVT VT = OpLHS.getValueType();

	assert((VT == MVT::v8i16 \|\| VT == MVT::v16i8) &&
	"Expect an v8i16/v16i8 type");
	OpLHS = DAG.getNode(ARMISD::VREV64, DL, VT, OpLHS);
	// For a v16i8 type: After the VREV, we have got <8, ...15, 8, ..., 0>. Now,
	// extract the first 8 bytes into the top double word and the last 8 bytes
	// into the bottom double word. The v8i16 case is similar.
	unsigned ExtractNum = (VT == MVT::v16i8) ? 8 : 4;
	return DAG.getNode(ARMISD::VEXT, DL, VT, OpLHS, OpLHS,
	DAG.getConstant(ExtractNum, DL, MVT::i32));
	}

	static SDValue LowerVECTOR_SHUFFLE(SDValue Op, SelectionDAG &DAG) {
	SDValue V1 = Op.getOperand(0);
	SDValue V2 = Op.getOperand(1);
	SDLoc dl(Op);
	EVT VT = Op.getValueType();
	ShuffleVectorSDNode *SVN = cast<ShuffleVectorSDNode>(Op.getNode());

	// Convert shuffles that are directly supported on NEON to target-specific
	// DAG nodes, instead of keeping them as shuffles and matching them again
	// during code selection. This is more efficient and avoids the possibility
	// of inconsistencies between legalization and selection.
	// FIXME: floating-point vectors should be canonicalized to integer vectors
	// of the same time so that they get CSEd properly.
	ArrayRef<int> ShuffleMask = SVN->getMask();

	unsigned EltSize = VT.getScalarSizeInBits();
	if (EltSize <= 32) {
	if (SVN->isSplat()) {
	int Lane = SVN->getSplatIndex();
	// If this is undef splat, generate it via "just" vdup, if possible.
	if (Lane == -1) Lane = 0;

	// Test if V1 is a SCALAR_TO_VECTOR.
	if (Lane == 0 && V1.getOpcode() == ISD::SCALAR_TO_VECTOR) {
	return DAG.getNode(ARMISD::VDUP, dl, VT, V1.getOperand(0));
	}
	// Test if V1 is a BUILD_VECTOR which is equivalent to a SCALAR_TO_VECTOR
	// (and probably will turn into a SCALAR_TO_VECTOR once legalization
	// reaches it).
	if (Lane == 0 && V1.getOpcode() == ISD::BUILD_VECTOR &&
	!isa<ConstantSDNode>(V1.getOperand(0))) {
	bool IsScalarToVector = true;
	for (unsigned i = 1, e = V1.getNumOperands(); i != e; ++i)
	if (!V1.getOperand(i).isUndef()) {
	IsScalarToVector = false;
	break;
	}
	if (IsScalarToVector)
	return DAG.getNode(ARMISD::VDUP, dl, VT, V1.getOperand(0));
	}
	return DAG.getNode(ARMISD::VDUPLANE, dl, VT, V1,
	DAG.getConstant(Lane, dl, MVT::i32));
	}

	bool ReverseVEXT;
	unsigned Imm;
	if (isVEXTMask(ShuffleMask, VT, ReverseVEXT, Imm)) {
	if (ReverseVEXT)
	std::swap(V1, V2);
	return DAG.getNode(ARMISD::VEXT, dl, VT, V1, V2,
	DAG.getConstant(Imm, dl, MVT::i32));
	}

	if (isVREVMask(ShuffleMask, VT, 64))
	return DAG.getNode(ARMISD::VREV64, dl, VT, V1);
	if (isVREVMask(ShuffleMask, VT, 32))
	return DAG.getNode(ARMISD::VREV32, dl, VT, V1);
	if (isVREVMask(ShuffleMask, VT, 16))
	return DAG.getNode(ARMISD::VREV16, dl, VT, V1);

	if (V2->isUndef() && isSingletonVEXTMask(ShuffleMask, VT, Imm)) {
	return DAG.getNode(ARMISD::VEXT, dl, VT, V1, V1,
	DAG.getConstant(Imm, dl, MVT::i32));
	}

	// Check for Neon shuffles that modify both input vectors in place.
	// If both results are used, i.e., if there are two shuffles with the same
	// source operands and with masks corresponding to both results of one of
	// these operations, DAG memoization will ensure that a single node is
	// used for both shuffles.
	unsigned WhichResult;
	bool isV_UNDEF;
	if (unsigned ShuffleOpc = isNEONTwoResultShuffleMask(
	ShuffleMask, VT, WhichResult, isV_UNDEF)) {
	if (isV_UNDEF)
	V2 = V1;
	return DAG.getNode(ShuffleOpc, dl, DAG.getVTList(VT, VT), V1, V2)
	.getValue(WhichResult);
	}

	// Also check for these shuffles through CONCAT_VECTORS: we canonicalize
	// shuffles that produce a result larger than their operands with:
	// shuffle(concat(v1, undef), concat(v2, undef))
	// ->
	// shuffle(concat(v1, v2), undef)
	// because we can access quad vectors (see PerformVECTOR_SHUFFLECombine).
	//
	// This is useful in the general case, but there are special cases where
	// native shuffles produce larger results: the two-result ops.
	//
	// Look through the concat when lowering them:
	// shuffle(concat(v1, v2), undef)
	// ->
	// concat(VZIP(v1, v2):0, :1)
	//
	if (V1->getOpcode() == ISD::CONCAT_VECTORS && V2->isUndef()) {
	SDValue SubV1 = V1->getOperand(0);
	SDValue SubV2 = V1->getOperand(1);
	EVT SubVT = SubV1.getValueType();

	// We expect these to have been canonicalized to -1.
	assert(llvm::all_of(ShuffleMask, [&](int i) {
	return i < (int)VT.getVectorNumElements();
	}) && "Unexpected shuffle index into UNDEF operand!");

	if (unsigned ShuffleOpc = isNEONTwoResultShuffleMask(
	ShuffleMask, SubVT, WhichResult, isV_UNDEF)) {
	if (isV_UNDEF)
	SubV2 = SubV1;
	assert((WhichResult == 0) &&
	"In-place shuffle of concat can only have one result!");
	SDValue Res = DAG.getNode(ShuffleOpc, dl, DAG.getVTList(SubVT, SubVT),
	SubV1, SubV2);
	return DAG.getNode(ISD::CONCAT_VECTORS, dl, VT, Res.getValue(0),
	Res.getValue(1));
	}
	}
	}

	// If the shuffle is not directly supported and it has 4 elements, use
	// the PerfectShuffle-generated table to synthesize it from other shuffles.
	unsigned NumElts = VT.getVectorNumElements();
	if (NumElts == 4) {
	unsigned PFIndexes[4];
	for (unsigned i = 0; i != 4; ++i) {
	if (ShuffleMask[i] < 0)
	PFIndexes[i] = 8;
	else
	PFIndexes[i] = ShuffleMask[i];
	}

	// Compute the index in the perfect shuffle table.
	unsigned PFTableIndex =
	PFIndexes[0]999+PFIndexes[1]99+PFIndexes[2]9+PFIndexes[3];
	unsigned PFEntry = PerfectShuffleTable[PFTableIndex];
	unsigned Cost = (PFEntry >> 30);

	if (Cost <= 4)
	return GeneratePerfectShuffle(PFEntry, V1, V2, DAG, dl);
	}

	// Implement shuffles with 32- or 64-bit elements as ARMISD::BUILD_VECTORs.
	if (EltSize >= 32) {
	// Do the expansion with floating-point types, since that is what the VFP
	// registers are defined to use, and since i64 is not legal.
	EVT EltVT = EVT::getFloatingPointVT(EltSize);
	EVT VecVT = EVT::getVectorVT(*DAG.getContext(), EltVT, NumElts);
	V1 = DAG.getNode(ISD::BITCAST, dl, VecVT, V1);
	V2 = DAG.getNode(ISD::BITCAST, dl, VecVT, V2);
	SmallVector<SDValue, 8> Ops;
	for (unsigned i = 0; i < NumElts; ++i) {
	if (ShuffleMask[i] < 0)
	Ops.push_back(DAG.getUNDEF(EltVT));
	else
	Ops.push_back(DAG.getNode(ISD::EXTRACT_VECTOR_ELT, dl, EltVT,
	ShuffleMask[i] < (int)NumElts ? V1 : V2,
	DAG.getConstant(ShuffleMask[i] & (NumElts-1),
	dl, MVT::i32)));
	}
	SDValue Val = DAG.getNode(ARMISD::BUILD_VECTOR, dl, VecVT, Ops);
	return DAG.getNode(ISD::BITCAST, dl, VT, Val);
	}

	if ((VT == MVT::v8i16 \|\| VT == MVT::v16i8) && isReverseMask(ShuffleMask, VT))
	return LowerReverse_VECTOR_SHUFFLEv16i8_v8i16(Op, DAG);

	if (VT == MVT::v8i8)
	if (SDValue NewOp = LowerVECTOR_SHUFFLEv8i8(Op, ShuffleMask, DAG))
	return NewOp;

	return SDValue();
	}

	static SDValue LowerINSERT_VECTOR_ELT(SDValue Op, SelectionDAG &DAG) {
	// INSERT_VECTOR_ELT is legal only for immediate indexes.
	SDValue Lane = Op.getOperand(2);
	if (!isa<ConstantSDNode>(Lane))
	return SDValue();

	return Op;
	}

	static SDValue LowerEXTRACT_VECTOR_ELT(SDValue Op, SelectionDAG &DAG) {
	// EXTRACT_VECTOR_ELT is legal only for immediate indexes.
	SDValue Lane = Op.getOperand(1);
	if (!isa<ConstantSDNode>(Lane))
	return SDValue();

	SDValue Vec = Op.getOperand(0);
	if (Op.getValueType() == MVT::i32 && Vec.getScalarValueSizeInBits() < 32) {
	SDLoc dl(Op);
	return DAG.getNode(ARMISD::VGETLANEu, dl, MVT::i32, Vec, Lane);
	}

	return Op;
	}

	static SDValue LowerCONCAT_VECTORS(SDValue Op, SelectionDAG &DAG) {
	// The only time a CONCAT_VECTORS operation can have legal types is when
	// two 64-bit vectors are concatenated to a 128-bit vector.
	assert(Op.getValueType().is128BitVector() && Op.getNumOperands() == 2 &&
	"unexpected CONCAT_VECTORS");
	SDLoc dl(Op);
	SDValue Val = DAG.getUNDEF(MVT::v2f64);
	SDValue Op0 = Op.getOperand(0);
	SDValue Op1 = Op.getOperand(1);
	if (!Op0.isUndef())
	Val = DAG.getNode(ISD::INSERT_VECTOR_ELT, dl, MVT::v2f64, Val,
	DAG.getNode(ISD::BITCAST, dl, MVT::f64, Op0),
	DAG.getIntPtrConstant(0, dl));
	if (!Op1.isUndef())
	Val = DAG.getNode(ISD::INSERT_VECTOR_ELT, dl, MVT::v2f64, Val,
	DAG.getNode(ISD::BITCAST, dl, MVT::f64, Op1),
	DAG.getIntPtrConstant(1, dl));
	return DAG.getNode(ISD::BITCAST, dl, Op.getValueType(), Val);
	}

	/// isExtendedBUILD_VECTOR - Check if N is a constant BUILD_VECTOR where each
	/// element has been zero/sign-extended, depending on the isSigned parameter,
	/// from an integer type half its size.
	static bool isExtendedBUILD_VECTOR(SDNode *N, SelectionDAG &DAG,
	bool isSigned) {
	// A v2i64 BUILD_VECTOR will have been legalized to a BITCAST from v4i32.
	EVT VT = N->getValueType(0);
	if (VT == MVT::v2i64 && N->getOpcode() == ISD::BITCAST) {
	SDNode *BVN = N->getOperand(0).getNode();
	if (BVN->getValueType(0) != MVT::v4i32 \|\|
	BVN->getOpcode() != ISD::BUILD_VECTOR)
	return false;
	unsigned LoElt = DAG.getDataLayout().isBigEndian() ? 1 : 0;
	unsigned HiElt = 1 - LoElt;
	ConstantSDNode *Lo0 = dyn_cast<ConstantSDNode>(BVN->getOperand(LoElt));
	ConstantSDNode *Hi0 = dyn_cast<ConstantSDNode>(BVN->getOperand(HiElt));
	ConstantSDNode *Lo1 = dyn_cast<ConstantSDNode>(BVN->getOperand(LoElt+2));
	ConstantSDNode *Hi1 = dyn_cast<ConstantSDNode>(BVN->getOperand(HiElt+2));
	if (!Lo0 \|\| !Hi0 \|\| !Lo1 \|\| !Hi1)
	return false;
	if (isSigned) {
	if (Hi0->getSExtValue() == Lo0->getSExtValue() >> 32 &&
	Hi1->getSExtValue() == Lo1->getSExtValue() >> 32)
	return true;
	} else {
	if (Hi0->isNullValue() && Hi1->isNullValue())
	return true;
	}
	return false;
	}

	if (N->getOpcode() != ISD::BUILD_VECTOR)
	return false;

	for (unsigned i = 0, e = N->getNumOperands(); i != e; ++i) {
	SDNode *Elt = N->getOperand(i).getNode();
	if (ConstantSDNode *C = dyn_cast<ConstantSDNode>(Elt)) {
	unsigned EltSize = VT.getScalarSizeInBits();
	unsigned HalfSize = EltSize / 2;
	if (isSigned) {
	if (!isIntN(HalfSize, C->getSExtValue()))
	return false;
	} else {
	if (!isUIntN(HalfSize, C->getZExtValue()))
	return false;
	}
	continue;
	}
	return false;
	}

	return true;
	}

	/// isSignExtended - Check if a node is a vector value that is sign-extended
	/// or a constant BUILD_VECTOR with sign-extended elements.
	static bool isSignExtended(SDNode *N, SelectionDAG &DAG) {
	if (N->getOpcode() == ISD::SIGN_EXTEND \|\| ISD::isSEXTLoad(N))
	return true;
	if (isExtendedBUILD_VECTOR(N, DAG, true))
	return true;
	return false;
	}

	/// isZeroExtended - Check if a node is a vector value that is zero-extended
	/// or a constant BUILD_VECTOR with zero-extended elements.
	static bool isZeroExtended(SDNode *N, SelectionDAG &DAG) {
	if (N->getOpcode() == ISD::ZERO_EXTEND \|\| ISD::isZEXTLoad(N))
	return true;
	if (isExtendedBUILD_VECTOR(N, DAG, false))
	return true;
	return false;
	}

	static EVT getExtensionTo64Bits(const EVT &OrigVT) {
	if (OrigVT.getSizeInBits() >= 64)
	return OrigVT;

	assert(OrigVT.isSimple() && "Expecting a simple value type");

	MVT::SimpleValueType OrigSimpleTy = OrigVT.getSimpleVT().SimpleTy;
	switch (OrigSimpleTy) {
	default: llvm_unreachable("Unexpected Vector Type");
	case MVT::v2i8:
	case MVT::v2i16:
	return MVT::v2i32;
	case MVT::v4i8:
	return MVT::v4i16;
	}
	}

	/// AddRequiredExtensionForVMULL - Add a sign/zero extension to extend the total
	/// value size to 64 bits. We need a 64-bit D register as an operand to VMULL.
	/// We insert the required extension here to get the vector to fill a D register.
	static SDValue AddRequiredExtensionForVMULL(SDValue N, SelectionDAG &DAG,
	const EVT &OrigTy,
	const EVT &ExtTy,
	unsigned ExtOpcode) {
	// The vector originally had a size of OrigTy. It was then extended to ExtTy.
	// We expect the ExtTy to be 128-bits total. If the OrigTy is less than
	// 64-bits we need to insert a new extension so that it will be 64-bits.
	assert(ExtTy.is128BitVector() && "Unexpected extension size");
	if (OrigTy.getSizeInBits() >= 64)
	return N;

	// Must extend size to at least 64 bits to be used as an operand for VMULL.
	EVT NewVT = getExtensionTo64Bits(OrigTy);

	return DAG.getNode(ExtOpcode, SDLoc(N), NewVT, N);
	}

	/// SkipLoadExtensionForVMULL - return a load of the original vector size that
	/// does not do any sign/zero extension. If the original vector is less
	/// than 64 bits, an appropriate extension will be added after the load to
	/// reach a total size of 64 bits. We have to add the extension separately
	/// because ARM does not have a sign/zero extending load for vectors.
	static SDValue SkipLoadExtensionForVMULL(LoadSDNode *LD, SelectionDAG& DAG) {
	EVT ExtendedTy = getExtensionTo64Bits(LD->getMemoryVT());

	// The load already has the right type.
	if (ExtendedTy == LD->getMemoryVT())
	return DAG.getLoad(LD->getMemoryVT(), SDLoc(LD), LD->getChain(),
	LD->getBasePtr(), LD->getPointerInfo(),
	LD->getAlignment(), LD->getMemOperand()->getFlags());

	// We need to create a zextload/sextload. We cannot just create a load
	// followed by a zext/zext node because LowerMUL is also run during normal
	// operation legalization where we can't create illegal types.
	return DAG.getExtLoad(LD->getExtensionType(), SDLoc(LD), ExtendedTy,
	LD->getChain(), LD->getBasePtr(), LD->getPointerInfo(),
	LD->getMemoryVT(), LD->getAlignment(),
	LD->getMemOperand()->getFlags());
	}

	/// SkipExtensionForVMULL - For a node that is a SIGN_EXTEND, ZERO_EXTEND,
	/// extending load, or BUILD_VECTOR with extended elements, return the
	/// unextended value. The unextended vector should be 64 bits so that it can
	/// be used as an operand to a VMULL instruction. If the original vector size
	/// before extension is less than 64 bits we add a an extension to resize
	/// the vector to 64 bits.
	static SDValue SkipExtensionForVMULL(SDNode *N, SelectionDAG &DAG) {
	if (N->getOpcode() == ISD::SIGN_EXTEND \|\| N->getOpcode() == ISD::ZERO_EXTEND)
	return AddRequiredExtensionForVMULL(N->getOperand(0), DAG,
	N->getOperand(0)->getValueType(0),
	N->getValueType(0),
	N->getOpcode());

	if (LoadSDNode *LD = dyn_cast<LoadSDNode>(N)) {
	assert((ISD::isSEXTLoad(LD) \|\| ISD::isZEXTLoad(LD)) &&
	"Expected extending load");

	SDValue newLoad = SkipLoadExtensionForVMULL(LD, DAG);
	DAG.ReplaceAllUsesOfValueWith(SDValue(LD, 1), newLoad.getValue(1));
	unsigned Opcode = ISD::isSEXTLoad(LD) ? ISD::SIGN_EXTEND : ISD::ZERO_EXTEND;
	SDValue extLoad =
	DAG.getNode(Opcode, SDLoc(newLoad), LD->getValueType(0), newLoad);
	DAG.ReplaceAllUsesOfValueWith(SDValue(LD, 0), extLoad);

	return newLoad;
	}

	// Otherwise, the value must be a BUILD_VECTOR. For v2i64, it will
	// have been legalized as a BITCAST from v4i32.
	if (N->getOpcode() == ISD::BITCAST) {
	SDNode *BVN = N->getOperand(0).getNode();
	assert(BVN->getOpcode() == ISD::BUILD_VECTOR &&
	BVN->getValueType(0) == MVT::v4i32 && "expected v4i32 BUILD_VECTOR");
	unsigned LowElt = DAG.getDataLayout().isBigEndian() ? 1 : 0;
	return DAG.getBuildVector(
	MVT::v2i32, SDLoc(N),
	{BVN->getOperand(LowElt), BVN->getOperand(LowElt + 2)});
	}
	// Construct a new BUILD_VECTOR with elements truncated to half the size.
	assert(N->getOpcode() == ISD::BUILD_VECTOR && "expected BUILD_VECTOR");
	EVT VT = N->getValueType(0);
	unsigned EltSize = VT.getScalarSizeInBits() / 2;
	unsigned NumElts = VT.getVectorNumElements();
	MVT TruncVT = MVT::getIntegerVT(EltSize);
	SmallVector<SDValue, 8> Ops;
	SDLoc dl(N);
	for (unsigned i = 0; i != NumElts; ++i) {
	ConstantSDNode *C = cast<ConstantSDNode>(N->getOperand(i));
	const APInt &CInt = C->getAPIntValue();
	// Element types smaller than 32 bits are not legal, so use i32 elements.
	// The values are implicitly truncated so sext vs. zext doesn't matter.
	Ops.push_back(DAG.getConstant(CInt.zextOrTrunc(32), dl, MVT::i32));
	}
	return DAG.getBuildVector(MVT::getVectorVT(TruncVT, NumElts), dl, Ops);
	}

	static bool isAddSubSExt(SDNode *N, SelectionDAG &DAG) {
	unsigned Opcode = N->getOpcode();
	if (Opcode == ISD::ADD \|\| Opcode == ISD::SUB) {
	SDNode *N0 = N->getOperand(0).getNode();
	SDNode *N1 = N->getOperand(1).getNode();
	return N0->hasOneUse() && N1->hasOneUse() &&
	isSignExtended(N0, DAG) && isSignExtended(N1, DAG);
	}
	return false;
	}

	static bool isAddSubZExt(SDNode *N, SelectionDAG &DAG) {
	unsigned Opcode = N->getOpcode();
	if (Opcode == ISD::ADD \|\| Opcode == ISD::SUB) {
	SDNode *N0 = N->getOperand(0).getNode();
	SDNode *N1 = N->getOperand(1).getNode();
	return N0->hasOneUse() && N1->hasOneUse() &&
	isZeroExtended(N0, DAG) && isZeroExtended(N1, DAG);
	}
	return false;
	}

	static SDValue LowerMUL(SDValue Op, SelectionDAG &DAG) {
	// Multiplications are only custom-lowered for 128-bit vectors so that
	// VMULL can be detected. Otherwise v2i64 multiplications are not legal.
	EVT VT = Op.getValueType();
	assert(VT.is128BitVector() && VT.isInteger() &&
	"unexpected type for custom-lowering ISD::MUL");
	SDNode *N0 = Op.getOperand(0).getNode();
	SDNode *N1 = Op.getOperand(1).getNode();
	unsigned NewOpc = 0;
	bool isMLA = false;
	bool isN0SExt = isSignExtended(N0, DAG);
	bool isN1SExt = isSignExtended(N1, DAG);
	if (isN0SExt && isN1SExt)
	NewOpc = ARMISD::VMULLs;
	else {
	bool isN0ZExt = isZeroExtended(N0, DAG);
	bool isN1ZExt = isZeroExtended(N1, DAG);
	if (isN0ZExt && isN1ZExt)
	NewOpc = ARMISD::VMULLu;
	else if (isN1SExt \|\| isN1ZExt) {
	// Look for (s/zext A + s/zext B) * (s/zext C). We want to turn these
	// into (s/zext A * s/zext C) + (s/zext B * s/zext C)
	if (isN1SExt && isAddSubSExt(N0, DAG)) {
	NewOpc = ARMISD::VMULLs;
	isMLA = true;
	} else if (isN1ZExt && isAddSubZExt(N0, DAG)) {
	NewOpc = ARMISD::VMULLu;
	isMLA = true;
	} else if (isN0ZExt && isAddSubZExt(N1, DAG)) {
	std::swap(N0, N1);
	NewOpc = ARMISD::VMULLu;
	isMLA = true;
	}
	}

	if (!NewOpc) {
	if (VT == MVT::v2i64)
	// Fall through to expand this. It is not legal.
	return SDValue();
	else
	// Other vector multiplications are legal.
	return Op;
	}
	}

	// Legalize to a VMULL instruction.
	SDLoc DL(Op);
	SDValue Op0;
	SDValue Op1 = SkipExtensionForVMULL(N1, DAG);
	if (!isMLA) {
	Op0 = SkipExtensionForVMULL(N0, DAG);
	assert(Op0.getValueType().is64BitVector() &&
	Op1.getValueType().is64BitVector() &&
	"unexpected types for extended operands to VMULL");
	return DAG.getNode(NewOpc, DL, VT, Op0, Op1);
	}

	// Optimizing (zext A + zext B) * C, to (VMULL A, C) + (VMULL B, C) during
	// isel lowering to take advantage of no-stall back to back vmul + vmla.
	// vmull q0, d4, d6
	// vmlal q0, d5, d6
	// is faster than
	// vaddl q0, d4, d5
	// vmovl q1, d6
	// vmul q0, q0, q1
	SDValue N00 = SkipExtensionForVMULL(N0->getOperand(0).getNode(), DAG);
	SDValue N01 = SkipExtensionForVMULL(N0->getOperand(1).getNode(), DAG);
	EVT Op1VT = Op1.getValueType();
	return DAG.getNode(N0->getOpcode(), DL, VT,
	DAG.getNode(NewOpc, DL, VT,
	DAG.getNode(ISD::BITCAST, DL, Op1VT, N00), Op1),
	DAG.getNode(NewOpc, DL, VT,
	DAG.getNode(ISD::BITCAST, DL, Op1VT, N01), Op1));
	}

	static SDValue LowerSDIV_v4i8(SDValue X, SDValue Y, const SDLoc &dl,
	SelectionDAG &DAG) {
	// TODO: Should this propagate fast-math-flags?

	// Convert to float
	// float4 xf = vcvt_f32_s32(vmovl_s16(a.lo));
	// float4 yf = vcvt_f32_s32(vmovl_s16(b.lo));
	X = DAG.getNode(ISD::SIGN_EXTEND, dl, MVT::v4i32, X);
	Y = DAG.getNode(ISD::SIGN_EXTEND, dl, MVT::v4i32, Y);
	X = DAG.getNode(ISD::SINT_TO_FP, dl, MVT::v4f32, X);
	Y = DAG.getNode(ISD::SINT_TO_FP, dl, MVT::v4f32, Y);
	// Get reciprocal estimate.
	// float4 recip = vrecpeq_f32(yf);
	Y = DAG.getNode(ISD::INTRINSIC_WO_CHAIN, dl, MVT::v4f32,
	DAG.getConstant(Intrinsic::arm_neon_vrecpe, dl, MVT::i32),
	Y);
	// Because char has a smaller range than uchar, we can actually get away
	// without any newton steps. This requires that we use a weird bias
	// of 0xb000, however (again, this has been exhaustively tested).
	// float4 result = as_float4(as_int4(xf*recip) + 0xb000);
	X = DAG.getNode(ISD::FMUL, dl, MVT::v4f32, X, Y);
	X = DAG.getNode(ISD::BITCAST, dl, MVT::v4i32, X);
	Y = DAG.getConstant(0xb000, dl, MVT::v4i32);
	X = DAG.getNode(ISD::ADD, dl, MVT::v4i32, X, Y);
	X = DAG.getNode(ISD::BITCAST, dl, MVT::v4f32, X);
	// Convert back to short.
	X = DAG.getNode(ISD::FP_TO_SINT, dl, MVT::v4i32, X);
	X = DAG.getNode(ISD::TRUNCATE, dl, MVT::v4i16, X);
	return X;
	}

	static SDValue LowerSDIV_v4i16(SDValue N0, SDValue N1, const SDLoc &dl,
	SelectionDAG &DAG) {
	// TODO: Should this propagate fast-math-flags?

	SDValue N2;
	// Convert to float.
	// float4 yf = vcvt_f32_s32(vmovl_s16(y));
	// float4 xf = vcvt_f32_s32(vmovl_s16(x));
	N0 = DAG.getNode(ISD::SIGN_EXTEND, dl, MVT::v4i32, N0);
	N1 = DAG.getNode(ISD::SIGN_EXTEND, dl, MVT::v4i32, N1);
	N0 = DAG.getNode(ISD::SINT_TO_FP, dl, MVT::v4f32, N0);
	N1 = DAG.getNode(ISD::SINT_TO_FP, dl, MVT::v4f32, N1);

	// Use reciprocal estimate and one refinement step.
	// float4 recip = vrecpeq_f32(yf);
	// recip *= vrecpsq_f32(yf, recip);
	N2 = DAG.getNode(ISD::INTRINSIC_WO_CHAIN, dl, MVT::v4f32,
	DAG.getConstant(Intrinsic::arm_neon_vrecpe, dl, MVT::i32),
	N1);
	N1 = DAG.getNode(ISD::INTRINSIC_WO_CHAIN, dl, MVT::v4f32,
	DAG.getConstant(Intrinsic::arm_neon_vrecps, dl, MVT::i32),
	N1, N2);
	N2 = DAG.getNode(ISD::FMUL, dl, MVT::v4f32, N1, N2);
	// Because short has a smaller range than ushort, we can actually get away
	// with only a single newton step. This requires that we use a weird bias
	// of 89, however (again, this has been exhaustively tested).
	// float4 result = as_float4(as_int4(xf*recip) + 0x89);
	N0 = DAG.getNode(ISD::FMUL, dl, MVT::v4f32, N0, N2);
	N0 = DAG.getNode(ISD::BITCAST, dl, MVT::v4i32, N0);
	N1 = DAG.getConstant(0x89, dl, MVT::v4i32);
	N0 = DAG.getNode(ISD::ADD, dl, MVT::v4i32, N0, N1);
	N0 = DAG.getNode(ISD::BITCAST, dl, MVT::v4f32, N0);
	// Convert back to integer and return.
	// return vmovn_s32(vcvt_s32_f32(result));
	N0 = DAG.getNode(ISD::FP_TO_SINT, dl, MVT::v4i32, N0);
	N0 = DAG.getNode(ISD::TRUNCATE, dl, MVT::v4i16, N0);
	return N0;
	}

	static SDValue LowerSDIV(SDValue Op, SelectionDAG &DAG) {
	EVT VT = Op.getValueType();
	assert((VT == MVT::v4i16 \|\| VT == MVT::v8i8) &&
	"unexpected type for custom-lowering ISD::SDIV");

	SDLoc dl(Op);
	SDValue N0 = Op.getOperand(0);
	SDValue N1 = Op.getOperand(1);
	SDValue N2, N3;

	if (VT == MVT::v8i8) {
	N0 = DAG.getNode(ISD::SIGN_EXTEND, dl, MVT::v8i16, N0);
	N1 = DAG.getNode(ISD::SIGN_EXTEND, dl, MVT::v8i16, N1);

	N2 = DAG.getNode(ISD::EXTRACT_SUBVECTOR, dl, MVT::v4i16, N0,
	DAG.getIntPtrConstant(4, dl));
	N3 = DAG.getNode(ISD::EXTRACT_SUBVECTOR, dl, MVT::v4i16, N1,
	DAG.getIntPtrConstant(4, dl));
	N0 = DAG.getNode(ISD::EXTRACT_SUBVECTOR, dl, MVT::v4i16, N0,
	DAG.getIntPtrConstant(0, dl));
	N1 = DAG.getNode(ISD::EXTRACT_SUBVECTOR, dl, MVT::v4i16, N1,
	DAG.getIntPtrConstant(0, dl));

	N0 = LowerSDIV_v4i8(N0, N1, dl, DAG); // v4i16
	N2 = LowerSDIV_v4i8(N2, N3, dl, DAG); // v4i16

	N0 = DAG.getNode(ISD::CONCAT_VECTORS, dl, MVT::v8i16, N0, N2);
	N0 = LowerCONCAT_VECTORS(N0, DAG);

	N0 = DAG.getNode(ISD::TRUNCATE, dl, MVT::v8i8, N0);
	return N0;
	}
	return LowerSDIV_v4i16(N0, N1, dl, DAG);
	}

	static SDValue LowerUDIV(SDValue Op, SelectionDAG &DAG) {
	// TODO: Should this propagate fast-math-flags?
	EVT VT = Op.getValueType();
	assert((VT == MVT::v4i16 \|\| VT == MVT::v8i8) &&
	"unexpected type for custom-lowering ISD::UDIV");

	SDLoc dl(Op);
	SDValue N0 = Op.getOperand(0);
	SDValue N1 = Op.getOperand(1);
	SDValue N2, N3;

	if (VT == MVT::v8i8) {
	N0 = DAG.getNode(ISD::ZERO_EXTEND, dl, MVT::v8i16, N0);
	N1 = DAG.getNode(ISD::ZERO_EXTEND, dl, MVT::v8i16, N1);

	N2 = DAG.getNode(ISD::EXTRACT_SUBVECTOR, dl, MVT::v4i16, N0,
	DAG.getIntPtrConstant(4, dl));
	N3 = DAG.getNode(ISD::EXTRACT_SUBVECTOR, dl, MVT::v4i16, N1,
	DAG.getIntPtrConstant(4, dl));
	N0 = DAG.getNode(ISD::EXTRACT_SUBVECTOR, dl, MVT::v4i16, N0,
	DAG.getIntPtrConstant(0, dl));
	N1 = DAG.getNode(ISD::EXTRACT_SUBVECTOR, dl, MVT::v4i16, N1,
	DAG.getIntPtrConstant(0, dl));

	N0 = LowerSDIV_v4i16(N0, N1, dl, DAG); // v4i16
	N2 = LowerSDIV_v4i16(N2, N3, dl, DAG); // v4i16

	N0 = DAG.getNode(ISD::CONCAT_VECTORS, dl, MVT::v8i16, N0, N2);
	N0 = LowerCONCAT_VECTORS(N0, DAG);

	N0 = DAG.getNode(ISD::INTRINSIC_WO_CHAIN, dl, MVT::v8i8,
	DAG.getConstant(Intrinsic::arm_neon_vqmovnsu, dl,
	MVT::i32),
	N0);
	return N0;
	}

	// v4i16 sdiv ... Convert to float.
	// float4 yf = vcvt_f32_s32(vmovl_u16(y));
	// float4 xf = vcvt_f32_s32(vmovl_u16(x));
	N0 = DAG.getNode(ISD::ZERO_EXTEND, dl, MVT::v4i32, N0);
	N1 = DAG.getNode(ISD::ZERO_EXTEND, dl, MVT::v4i32, N1);
	N0 = DAG.getNode(ISD::SINT_TO_FP, dl, MVT::v4f32, N0);
	SDValue BN1 = DAG.getNode(ISD::SINT_TO_FP, dl, MVT::v4f32, N1);

	// Use reciprocal estimate and two refinement steps.
	// float4 recip = vrecpeq_f32(yf);
	// recip *= vrecpsq_f32(yf, recip);
	// recip *= vrecpsq_f32(yf, recip);
	N2 = DAG.getNode(ISD::INTRINSIC_WO_CHAIN, dl, MVT::v4f32,
	DAG.getConstant(Intrinsic::arm_neon_vrecpe, dl, MVT::i32),
	BN1);
	N1 = DAG.getNode(ISD::INTRINSIC_WO_CHAIN, dl, MVT::v4f32,
	DAG.getConstant(Intrinsic::arm_neon_vrecps, dl, MVT::i32),
	BN1, N2);
	N2 = DAG.getNode(ISD::FMUL, dl, MVT::v4f32, N1, N2);
	N1 = DAG.getNode(ISD::INTRINSIC_WO_CHAIN, dl, MVT::v4f32,
	DAG.getConstant(Intrinsic::arm_neon_vrecps, dl, MVT::i32),
	BN1, N2);
	N2 = DAG.getNode(ISD::FMUL, dl, MVT::v4f32, N1, N2);
	// Simply multiplying by the reciprocal estimate can leave us a few ulps
	// too low, so we add 2 ulps (exhaustive testing shows that this is enough,
	// and that it will never cause us to return an answer too large).
	// float4 result = as_float4(as_int4(xf*recip) + 2);
	N0 = DAG.getNode(ISD::FMUL, dl, MVT::v4f32, N0, N2);
	N0 = DAG.getNode(ISD::BITCAST, dl, MVT::v4i32, N0);
	N1 = DAG.getConstant(2, dl, MVT::v4i32);
	N0 = DAG.getNode(ISD::ADD, dl, MVT::v4i32, N0, N1);
	N0 = DAG.getNode(ISD::BITCAST, dl, MVT::v4f32, N0);
	// Convert back to integer and return.
	// return vmovn_u32(vcvt_s32_f32(result));
	N0 = DAG.getNode(ISD::FP_TO_SINT, dl, MVT::v4i32, N0);
	N0 = DAG.getNode(ISD::TRUNCATE, dl, MVT::v4i16, N0);
	return N0;
	}

	static SDValue LowerADDC_ADDE_SUBC_SUBE(SDValue Op, SelectionDAG &DAG) {
	EVT VT = Op.getNode()->getValueType(0);
	SDVTList VTs = DAG.getVTList(VT, MVT::i32);

	unsigned Opc;
	bool ExtraOp = false;
	switch (Op.getOpcode()) {
	default: llvm_unreachable("Invalid code");
	case ISD::ADDC: Opc = ARMISD::ADDC; break;
	case ISD::ADDE: Opc = ARMISD::ADDE; ExtraOp = true; break;
	case ISD::SUBC: Opc = ARMISD::SUBC; break;
	case ISD::SUBE: Opc = ARMISD::SUBE; ExtraOp = true; break;
	}

	if (!ExtraOp)
	return DAG.getNode(Opc, SDLoc(Op), VTs, Op.getOperand(0),
	Op.getOperand(1));
	return DAG.getNode(Opc, SDLoc(Op), VTs, Op.getOperand(0),
	Op.getOperand(1), Op.getOperand(2));
	}

	SDValue ARMTargetLowering::LowerFSINCOS(SDValue Op, SelectionDAG &DAG) const {
	assert(Subtarget->isTargetDarwin());

	// For iOS, we want to call an alternative entry point: __sincos_stret,
	// return values are passed via sret.
	SDLoc dl(Op);
	SDValue Arg = Op.getOperand(0);
	EVT ArgVT = Arg.getValueType();
	Type ArgTy = ArgVT.getTypeForEVT(DAG.getContext());
	auto PtrVT = getPointerTy(DAG.getDataLayout());

	MachineFrameInfo &MFI = DAG.getMachineFunction().getFrameInfo();
	const TargetLowering &TLI = DAG.getTargetLoweringInfo();

	// Pair of floats / doubles used to pass the result.
	Type *RetTy = StructType::get(ArgTy, ArgTy);
	auto &DL = DAG.getDataLayout();

	ArgListTy Args;
	bool ShouldUseSRet = Subtarget->isAPCS_ABI();
	SDValue SRet;
	if (ShouldUseSRet) {
	// Create stack object for sret.
	const uint64_t ByteSize = DL.getTypeAllocSize(RetTy);
	const unsigned StackAlign = DL.getPrefTypeAlignment(RetTy);
	int FrameIdx = MFI.CreateStackObject(ByteSize, StackAlign, false);
	SRet = DAG.getFrameIndex(FrameIdx, TLI.getPointerTy(DL));

	ArgListEntry Entry;
	Entry.Node = SRet;
	Entry.Ty = RetTy->getPointerTo();
	Entry.IsSExt = false;
	Entry.IsZExt = false;
	Entry.IsSRet = true;
	Args.push_back(Entry);
	RetTy = Type::getVoidTy(*DAG.getContext());
	}

	ArgListEntry Entry;
	Entry.Node = Arg;
	Entry.Ty = ArgTy;
	Entry.IsSExt = false;
	Entry.IsZExt = false;
	Args.push_back(Entry);

	const char *LibcallName =
	(ArgVT == MVT::f64) ? "__sincos_stret" : "__sincosf_stret";
	RTLIB::Libcall LC =
	(ArgVT == MVT::f64) ? RTLIB::SINCOS_F64 : RTLIB::SINCOS_F32;
	CallingConv::ID CC = getLibcallCallingConv(LC);
	SDValue Callee = DAG.getExternalSymbol(LibcallName, getPointerTy(DL));

	TargetLowering::CallLoweringInfo CLI(DAG);
	CLI.setDebugLoc(dl)
	.setChain(DAG.getEntryNode())
	.setCallee(CC, RetTy, Callee, std::move(Args))
	.setDiscardResult(ShouldUseSRet);
	std::pair<SDValue, SDValue> CallResult = LowerCallTo(CLI);

	if (!ShouldUseSRet)
	return CallResult.first;

	SDValue LoadSin =
	DAG.getLoad(ArgVT, dl, CallResult.second, SRet, MachinePointerInfo());

	// Address of cos field.
	SDValue Add = DAG.getNode(ISD::ADD, dl, PtrVT, SRet,
	DAG.getIntPtrConstant(ArgVT.getStoreSize(), dl));
	SDValue LoadCos =
	DAG.getLoad(ArgVT, dl, LoadSin.getValue(1), Add, MachinePointerInfo());

	SDVTList Tys = DAG.getVTList(ArgVT, ArgVT);
	return DAG.getNode(ISD::MERGE_VALUES, dl, Tys,
	LoadSin.getValue(0), LoadCos.getValue(0));
	}

	SDValue ARMTargetLowering::LowerWindowsDIVLibCall(SDValue Op, SelectionDAG &DAG,
	bool Signed,
	SDValue &Chain) const {
	EVT VT = Op.getValueType();
	assert((VT == MVT::i32 \|\| VT == MVT::i64) &&
	"unexpected type for custom lowering DIV");
	SDLoc dl(Op);

	const auto &DL = DAG.getDataLayout();
	const auto &TLI = DAG.getTargetLoweringInfo();

	const char *Name = nullptr;
	if (Signed)
	Name = (VT == MVT::i32) ? "__rt_sdiv" : "__rt_sdiv64";
	else
	Name = (VT == MVT::i32) ? "__rt_udiv" : "__rt_udiv64";

	SDValue ES = DAG.getExternalSymbol(Name, TLI.getPointerTy(DL));

	ARMTargetLowering::ArgListTy Args;

	for (auto AI : {1, 0}) {
	ArgListEntry Arg;
	Arg.Node = Op.getOperand(AI);
	Arg.Ty = Arg.Node.getValueType().getTypeForEVT(*DAG.getContext());
	Args.push_back(Arg);
	}

	CallLoweringInfo CLI(DAG);
	CLI.setDebugLoc(dl)
	.setChain(Chain)
	.setCallee(CallingConv::ARM_AAPCS_VFP, VT.getTypeForEVT(*DAG.getContext()),
	ES, std::move(Args));

	return LowerCallTo(CLI).first;
	}

	SDValue ARMTargetLowering::LowerDIV_Windows(SDValue Op, SelectionDAG &DAG,
	bool Signed) const {
	assert(Op.getValueType() == MVT::i32 &&
	"unexpected type for custom lowering DIV");
	SDLoc dl(Op);

	SDValue DBZCHK = DAG.getNode(ARMISD::WIN__DBZCHK, dl, MVT::Other,
	DAG.getEntryNode(), Op.getOperand(1));

	return LowerWindowsDIVLibCall(Op, DAG, Signed, DBZCHK);
	}

	static SDValue WinDBZCheckDenominator(SelectionDAG &DAG, SDNode *N, SDValue InChain) {
	SDLoc DL(N);
	SDValue Op = N->getOperand(1);
	if (N->getValueType(0) == MVT::i32)
	return DAG.getNode(ARMISD::WIN__DBZCHK, DL, MVT::Other, InChain, Op);
	SDValue Lo = DAG.getNode(ISD::EXTRACT_ELEMENT, DL, MVT::i32, Op,
	DAG.getConstant(0, DL, MVT::i32));
	SDValue Hi = DAG.getNode(ISD::EXTRACT_ELEMENT, DL, MVT::i32, Op,
	DAG.getConstant(1, DL, MVT::i32));
	return DAG.getNode(ARMISD::WIN__DBZCHK, DL, MVT::Other, InChain,
	DAG.getNode(ISD::OR, DL, MVT::i32, Lo, Hi));
	}

	void ARMTargetLowering::ExpandDIV_Windows(
	SDValue Op, SelectionDAG &DAG, bool Signed,
	SmallVectorImpl<SDValue> &Results) const {
	const auto &DL = DAG.getDataLayout();
	const auto &TLI = DAG.getTargetLoweringInfo();

	assert(Op.getValueType() == MVT::i64 &&
	"unexpected type for custom lowering DIV");
	SDLoc dl(Op);

	SDValue DBZCHK = WinDBZCheckDenominator(DAG, Op.getNode(), DAG.getEntryNode());

	SDValue Result = LowerWindowsDIVLibCall(Op, DAG, Signed, DBZCHK);

	SDValue Lower = DAG.getNode(ISD::TRUNCATE, dl, MVT::i32, Result);
	SDValue Upper = DAG.getNode(ISD::SRL, dl, MVT::i64, Result,
	DAG.getConstant(32, dl, TLI.getPointerTy(DL)));
	Upper = DAG.getNode(ISD::TRUNCATE, dl, MVT::i32, Upper);

	Results.push_back(Lower);
	Results.push_back(Upper);
	}

	static SDValue LowerAtomicLoadStore(SDValue Op, SelectionDAG &DAG) {
	if (isStrongerThanMonotonic(cast<AtomicSDNode>(Op)->getOrdering()))
	// Acquire/Release load/store is not legal for targets without a dmb or
	// equivalent available.
	return SDValue();

	// Monotonic load/store is legal for all targets.
	return Op;
	}

	static void ReplaceREADCYCLECOUNTER(SDNode *N,
	SmallVectorImpl<SDValue> &Results,
	SelectionDAG &DAG,
	const ARMSubtarget *Subtarget) {
	SDLoc DL(N);
	// Under Power Management extensions, the cycle-count is:
	// mrc p15, #0, <Rt>, c9, c13, #0
	SDValue Ops[] = { N->getOperand(0), // Chain
	DAG.getConstant(Intrinsic::arm_mrc, DL, MVT::i32),
	DAG.getConstant(15, DL, MVT::i32),
	DAG.getConstant(0, DL, MVT::i32),
	DAG.getConstant(9, DL, MVT::i32),
	DAG.getConstant(13, DL, MVT::i32),
	DAG.getConstant(0, DL, MVT::i32)
	};

	SDValue Cycles32 = DAG.getNode(ISD::INTRINSIC_W_CHAIN, DL,
	DAG.getVTList(MVT::i32, MVT::Other), Ops);
	Results.push_back(DAG.getNode(ISD::BUILD_PAIR, DL, MVT::i64, Cycles32,
	DAG.getConstant(0, DL, MVT::i32)));
	Results.push_back(Cycles32.getValue(1));
	}

	static SDValue createGPRPairNode(SelectionDAG &DAG, SDValue V) {
	SDLoc dl(V.getNode());
	SDValue VLo = DAG.getAnyExtOrTrunc(V, dl, MVT::i32);
	SDValue VHi = DAG.getAnyExtOrTrunc(
	DAG.getNode(ISD::SRL, dl, MVT::i64, V, DAG.getConstant(32, dl, MVT::i32)),
	dl, MVT::i32);
	bool isBigEndian = DAG.getDataLayout().isBigEndian();
	if (isBigEndian)
	std::swap (VLo, VHi);
	SDValue RegClass =
	DAG.getTargetConstant(ARM::GPRPairRegClassID, dl, MVT::i32);
	SDValue SubReg0 = DAG.getTargetConstant(ARM::gsub_0, dl, MVT::i32);
	SDValue SubReg1 = DAG.getTargetConstant(ARM::gsub_1, dl, MVT::i32);
	const SDValue Ops[] = { RegClass, VLo, SubReg0, VHi, SubReg1 };
	return SDValue(
	DAG.getMachineNode(TargetOpcode::REG_SEQUENCE, dl, MVT::Untyped, Ops), 0);
	}

	static void ReplaceCMP_SWAP_64Results(SDNode *N,
	SmallVectorImpl<SDValue> & Results,
	SelectionDAG &DAG) {
	assert(N->getValueType(0) == MVT::i64 &&
	"AtomicCmpSwap on types less than 64 should be legal");
	SDValue Ops[] = {N->getOperand(1),
	createGPRPairNode(DAG, N->getOperand(2)),
	createGPRPairNode(DAG, N->getOperand(3)),
	N->getOperand(0)};
	SDNode *CmpSwap = DAG.getMachineNode(
	ARM::CMP_SWAP_64, SDLoc(N),
	DAG.getVTList(MVT::Untyped, MVT::i32, MVT::Other), Ops);

	MachineFunction &MF = DAG.getMachineFunction();
	MachineSDNode::mmo_iterator MemOp = MF.allocateMemRefsArray(1);
	MemOp[0] = cast<MemSDNode>(N)->getMemOperand();
	cast<MachineSDNode>(CmpSwap)->setMemRefs(MemOp, MemOp + 1);

	bool isBigEndian = DAG.getDataLayout().isBigEndian();

	Results.push_back(
	DAG.getTargetExtractSubreg(isBigEndian ? ARM::gsub_1 : ARM::gsub_0,
	SDLoc(N), MVT::i32, SDValue(CmpSwap, 0)));
	Results.push_back(
	DAG.getTargetExtractSubreg(isBigEndian ? ARM::gsub_0 : ARM::gsub_1,
	SDLoc(N), MVT::i32, SDValue(CmpSwap, 0)));
	Results.push_back(SDValue(CmpSwap, 2));
	}

	static SDValue LowerFPOWI(SDValue Op, const ARMSubtarget &Subtarget,
	SelectionDAG &DAG) {
	const auto &TLI = DAG.getTargetLoweringInfo();

	assert(Subtarget.getTargetTriple().isOSMSVCRT() &&
	"Custom lowering is MSVCRT specific!");

	SDLoc dl(Op);
	SDValue Val = Op.getOperand(0);
	MVT Ty = Val->getSimpleValueType(0);
	SDValue Exponent = DAG.getNode(ISD::SINT_TO_FP, dl, Ty, Op.getOperand(1));
	SDValue Callee = DAG.getExternalSymbol(Ty == MVT::f32 ? "powf" : "pow",
	TLI.getPointerTy(DAG.getDataLayout()));

	TargetLowering::ArgListTy Args;
	TargetLowering::ArgListEntry Entry;

	Entry.Node = Val;
	Entry.Ty = Val.getValueType().getTypeForEVT(*DAG.getContext());
	Entry.IsZExt = true;
	Args.push_back(Entry);

	Entry.Node = Exponent;
	Entry.Ty = Exponent.getValueType().getTypeForEVT(*DAG.getContext());
	Entry.IsZExt = true;
	Args.push_back(Entry);

	Type LCRTy = Val.getValueType().getTypeForEVT(DAG.getContext());

	// In the in-chain to the call is the entry node If we are emitting a
	// tailcall, the chain will be mutated if the node has a non-entry input
	// chain.
	SDValue InChain = DAG.getEntryNode();
	SDValue TCChain = InChain;

	const auto *F = DAG.getMachineFunction().getFunction();
	bool IsTC = TLI.isInTailCallPosition(DAG, Op.getNode(), TCChain) &&
	F->getReturnType() == LCRTy;
	if (IsTC)
	InChain = TCChain;

	TargetLowering::CallLoweringInfo CLI(DAG);
	CLI.setDebugLoc(dl)
	.setChain(InChain)
	.setCallee(CallingConv::ARM_AAPCS_VFP, LCRTy, Callee, std::move(Args))
	.setTailCall(IsTC);
	std::pair<SDValue, SDValue> CI = TLI.LowerCallTo(CLI);

	// Return the chain (the DAG root) if it is a tail call
	return !CI.second.getNode() ? DAG.getRoot() : CI.first;
	}

	SDValue ARMTargetLowering::LowerOperation(SDValue Op, SelectionDAG &DAG) const {
	switch (Op.getOpcode()) {
	default: llvm_unreachable("Don't know how to custom lower this!");
	case ISD::WRITE_REGISTER: return LowerWRITE_REGISTER(Op, DAG);
	case ISD::ConstantPool: return LowerConstantPool(Op, DAG);
	case ISD::BlockAddress: return LowerBlockAddress(Op, DAG);
	case ISD::GlobalAddress: return LowerGlobalAddress(Op, DAG);
	case ISD::GlobalTLSAddress: return LowerGlobalTLSAddress(Op, DAG);
	case ISD::SELECT: return LowerSELECT(Op, DAG);
	case ISD::SELECT_CC: return LowerSELECT_CC(Op, DAG);
	case ISD::BR_CC: return LowerBR_CC(Op, DAG);
	case ISD::BR_JT: return LowerBR_JT(Op, DAG);
	case ISD::VASTART: return LowerVASTART(Op, DAG);
	case ISD::ATOMIC_FENCE: return LowerATOMIC_FENCE(Op, DAG, Subtarget);
	case ISD::PREFETCH: return LowerPREFETCH(Op, DAG, Subtarget);
	case ISD::SINT_TO_FP:
	case ISD::UINT_TO_FP: return LowerINT_TO_FP(Op, DAG);
	case ISD::FP_TO_SINT:
	case ISD::FP_TO_UINT: return LowerFP_TO_INT(Op, DAG);
	case ISD::FCOPYSIGN: return LowerFCOPYSIGN(Op, DAG);
	case ISD::RETURNADDR: return LowerRETURNADDR(Op, DAG);
	case ISD::FRAMEADDR: return LowerFRAMEADDR(Op, DAG);
	case ISD::EH_SJLJ_SETJMP: return LowerEH_SJLJ_SETJMP(Op, DAG);
	case ISD::EH_SJLJ_LONGJMP: return LowerEH_SJLJ_LONGJMP(Op, DAG);
	case ISD::EH_SJLJ_SETUP_DISPATCH: return LowerEH_SJLJ_SETUP_DISPATCH(Op, DAG);
	case ISD::INTRINSIC_WO_CHAIN: return LowerINTRINSIC_WO_CHAIN(Op, DAG,
	Subtarget);
	case ISD::BITCAST: return ExpandBITCAST(Op.getNode(), DAG);
	case ISD::SHL:
	case ISD::SRL:
	case ISD::SRA: return LowerShift(Op.getNode(), DAG, Subtarget);
	case ISD::SREM: return LowerREM(Op.getNode(), DAG);
	case ISD::UREM: return LowerREM(Op.getNode(), DAG);
	case ISD::SHL_PARTS: return LowerShiftLeftParts(Op, DAG);
	case ISD::SRL_PARTS:
	case ISD::SRA_PARTS: return LowerShiftRightParts(Op, DAG);
	case ISD::CTTZ:
	case ISD::CTTZ_ZERO_UNDEF: return LowerCTTZ(Op.getNode(), DAG, Subtarget);
	case ISD::CTPOP: return LowerCTPOP(Op.getNode(), DAG, Subtarget);
	case ISD::SETCC: return LowerVSETCC(Op, DAG);
	case ISD::SETCCE: return LowerSETCCE(Op, DAG);
	case ISD::ConstantFP: return LowerConstantFP(Op, DAG, Subtarget);
	case ISD::BUILD_VECTOR: return LowerBUILD_VECTOR(Op, DAG, Subtarget);
	case ISD::VECTOR_SHUFFLE: return LowerVECTOR_SHUFFLE(Op, DAG);
	case ISD::INSERT_VECTOR_ELT: return LowerINSERT_VECTOR_ELT(Op, DAG);
	case ISD::EXTRACT_VECTOR_ELT: return LowerEXTRACT_VECTOR_ELT(Op, DAG);
	case ISD::CONCAT_VECTORS: return LowerCONCAT_VECTORS(Op, DAG);
	case ISD::FLT_ROUNDS_: return LowerFLT_ROUNDS_(Op, DAG);
	case ISD::MUL: return LowerMUL(Op, DAG);
	case ISD::SDIV:
	if (Subtarget->isTargetWindows() && !Op.getValueType().isVector())
	return LowerDIV_Windows(Op, DAG, /* Signed */ true);
	return LowerSDIV(Op, DAG);
	case ISD::UDIV:
	if (Subtarget->isTargetWindows() && !Op.getValueType().isVector())
	return LowerDIV_Windows(Op, DAG, /* Signed */ false);
	return LowerUDIV(Op, DAG);
	case ISD::ADDC:
	case ISD::ADDE:
	case ISD::SUBC:
	case ISD::SUBE: return LowerADDC_ADDE_SUBC_SUBE(Op, DAG);
	case ISD::SADDO:
	case ISD::UADDO:
	case ISD::SSUBO:
	case ISD::USUBO:
	return LowerXALUO(Op, DAG);
	case ISD::ATOMIC_LOAD:
	case ISD::ATOMIC_STORE: return LowerAtomicLoadStore(Op, DAG);
	case ISD::FSINCOS: return LowerFSINCOS(Op, DAG);
	case ISD::SDIVREM:
	case ISD::UDIVREM: return LowerDivRem(Op, DAG);
	case ISD::DYNAMIC_STACKALLOC:
	if (Subtarget->getTargetTriple().isWindowsItaniumEnvironment())
	return LowerDYNAMIC_STACKALLOC(Op, DAG);
	llvm_unreachable("Don't know how to custom lower this!");
	case ISD::FP_ROUND: return LowerFP_ROUND(Op, DAG);
	case ISD::FP_EXTEND: return LowerFP_EXTEND(Op, DAG);
	case ISD::FPOWI: return LowerFPOWI(Op, *Subtarget, DAG);
	case ARMISD::WIN__DBZCHK: return SDValue();
	}
	}

	static void ReplaceLongIntrinsic(SDNode *N, SmallVectorImpl<SDValue> &Results,
	SelectionDAG &DAG) {
	unsigned IntNo = cast<ConstantSDNode>(N->getOperand(0))->getZExtValue();
	unsigned Opc = 0;
	if (IntNo == Intrinsic::arm_smlald)
	Opc = ARMISD::SMLALD;
	else if (IntNo == Intrinsic::arm_smlaldx)
	Opc = ARMISD::SMLALDX;
	else if (IntNo == Intrinsic::arm_smlsld)
	Opc = ARMISD::SMLSLD;
	else if (IntNo == Intrinsic::arm_smlsldx)
	Opc = ARMISD::SMLSLDX;
	else
	return;

	SDLoc dl(N);
	SDValue Lo = DAG.getNode(ISD::EXTRACT_ELEMENT, dl, MVT::i32,
	N->getOperand(3),
	DAG.getConstant(0, dl, MVT::i32));
	SDValue Hi = DAG.getNode(ISD::EXTRACT_ELEMENT, dl, MVT::i32,
	N->getOperand(3),
	DAG.getConstant(1, dl, MVT::i32));

	SDValue LongMul = DAG.getNode(Opc, dl,
	DAG.getVTList(MVT::i32, MVT::i32),
	N->getOperand(1), N->getOperand(2),
	Lo, Hi);
	Results.push_back(LongMul.getValue(0));
	Results.push_back(LongMul.getValue(1));
	}

	/// ReplaceNodeResults - Replace the results of node with an illegal result
	/// type with new values built out of custom code.
	void ARMTargetLowering::ReplaceNodeResults(SDNode *N,
	SmallVectorImpl<SDValue> &Results,
	SelectionDAG &DAG) const {
	SDValue Res;
	switch (N->getOpcode()) {
	default:
	llvm_unreachable("Don't know how to custom expand this!");
	case ISD::READ_REGISTER:
	ExpandREAD_REGISTER(N, Results, DAG);
	break;
	case ISD::BITCAST:
	Res = ExpandBITCAST(N, DAG);
	break;
	case ISD::SRL:
	case ISD::SRA:
	Res = Expand64BitShift(N, DAG, Subtarget);
	break;
	case ISD::SREM:
	case ISD::UREM:
	Res = LowerREM(N, DAG);
	break;
	case ISD::SDIVREM:
	case ISD::UDIVREM:
	Res = LowerDivRem(SDValue(N, 0), DAG);
	assert(Res.getNumOperands() == 2 && "DivRem needs two values");
	Results.push_back(Res.getValue(0));
	Results.push_back(Res.getValue(1));
	return;
	case ISD::READCYCLECOUNTER:
	ReplaceREADCYCLECOUNTER(N, Results, DAG, Subtarget);
	return;
	case ISD::UDIV:
	case ISD::SDIV:
	assert(Subtarget->isTargetWindows() && "can only expand DIV on Windows");
	return ExpandDIV_Windows(SDValue(N, 0), DAG, N->getOpcode() == ISD::SDIV,
	Results);
	case ISD::ATOMIC_CMP_SWAP:
	ReplaceCMP_SWAP_64Results(N, Results, DAG);
	return;
	case ISD::INTRINSIC_WO_CHAIN:
	return ReplaceLongIntrinsic(N, Results, DAG);
	}
	if (Res.getNode())
	Results.push_back(Res);
	}

	//===----------------------------------------------------------------------===//
	// ARM Scheduler Hooks
	//===----------------------------------------------------------------------===//

	/// SetupEntryBlockForSjLj - Insert code into the entry block that creates and
	/// registers the function context.
	void ARMTargetLowering::SetupEntryBlockForSjLj(MachineInstr &MI,
	MachineBasicBlock *MBB,
	MachineBasicBlock *DispatchBB,
	int FI) const {
	assert(!Subtarget->isROPI() && !Subtarget->isRWPI() &&
	"ROPI/RWPI not currently supported with SjLj");
	const TargetInstrInfo *TII = Subtarget->getInstrInfo();
	DebugLoc dl = MI.getDebugLoc();
	MachineFunction *MF = MBB->getParent();
	MachineRegisterInfo *MRI = &MF->getRegInfo();
	MachineConstantPool *MCP = MF->getConstantPool();
	ARMFunctionInfo *AFI = MF->getInfo<ARMFunctionInfo>();
	const Function *F = MF->getFunction();

	bool isThumb = Subtarget->isThumb();
	bool isThumb2 = Subtarget->isThumb2();

	unsigned PCLabelId = AFI->createPICLabelUId();
	unsigned PCAdj = (isThumb \|\| isThumb2) ? 4 : 8;
	ARMConstantPoolValue *CPV =
	ARMConstantPoolMBB::Create(F->getContext(), DispatchBB, PCLabelId, PCAdj);
	unsigned CPI = MCP->getConstantPoolIndex(CPV, 4);

	const TargetRegisterClass *TRC = isThumb ? &ARM::tGPRRegClass
	: &ARM::GPRRegClass;

	// Grab constant pool and fixed stack memory operands.
	MachineMemOperand *CPMMO =
	MF->getMachineMemOperand(MachinePointerInfo::getConstantPool(*MF),
	MachineMemOperand::MOLoad, 4, 4);

	MachineMemOperand *FIMMOSt =
	MF->getMachineMemOperand(MachinePointerInfo::getFixedStack(*MF, FI),
	MachineMemOperand::MOStore, 4, 4);

	// Load the address of the dispatch MBB into the jump buffer.
	if (isThumb2) {
	// Incoming value: jbuf
	// ldr.n r5, LCPI1_1
	// orr r5, r5, #1
	// add r5, pc
	// str r5, [$jbuf, #+4] ; &jbuf[1]
	unsigned NewVReg1 = MRI->createVirtualRegister(TRC);
	BuildMI(*MBB, MI, dl, TII->get(ARM::t2LDRpci), NewVReg1)
	.addConstantPoolIndex(CPI)
	.addMemOperand(CPMMO)
	.add(predOps(ARMCC::AL));
	// Set the low bit because of thumb mode.
	unsigned NewVReg2 = MRI->createVirtualRegister(TRC);
	BuildMI(*MBB, MI, dl, TII->get(ARM::t2ORRri), NewVReg2)
	.addReg(NewVReg1, RegState::Kill)
	.addImm(0x01)
	.add(predOps(ARMCC::AL))
	.add(condCodeOp());
	unsigned NewVReg3 = MRI->createVirtualRegister(TRC);
	BuildMI(*MBB, MI, dl, TII->get(ARM::tPICADD), NewVReg3)
	.addReg(NewVReg2, RegState::Kill)
	.addImm(PCLabelId);
	BuildMI(*MBB, MI, dl, TII->get(ARM::t2STRi12))
	.addReg(NewVReg3, RegState::Kill)
	.addFrameIndex(FI)
	.addImm(36) // &jbuf[1] :: pc
	.addMemOperand(FIMMOSt)
	.add(predOps(ARMCC::AL));
	} else if (isThumb) {
	// Incoming value: jbuf
	// ldr.n r1, LCPI1_4
	// add r1, pc
	// mov r2, #1
	// orrs r1, r2
	// add r2, $jbuf, #+4 ; &jbuf[1]
	// str r1, [r2]
	unsigned NewVReg1 = MRI->createVirtualRegister(TRC);
	BuildMI(*MBB, MI, dl, TII->get(ARM::tLDRpci), NewVReg1)
	.addConstantPoolIndex(CPI)
	.addMemOperand(CPMMO)
	.add(predOps(ARMCC::AL));
	unsigned NewVReg2 = MRI->createVirtualRegister(TRC);
	BuildMI(*MBB, MI, dl, TII->get(ARM::tPICADD), NewVReg2)
	.addReg(NewVReg1, RegState::Kill)
	.addImm(PCLabelId);
	// Set the low bit because of thumb mode.
	unsigned NewVReg3 = MRI->createVirtualRegister(TRC);
	BuildMI(*MBB, MI, dl, TII->get(ARM::tMOVi8), NewVReg3)
	.addReg(ARM::CPSR, RegState::Define)
	.addImm(1)
	.add(predOps(ARMCC::AL));
	unsigned NewVReg4 = MRI->createVirtualRegister(TRC);
	BuildMI(*MBB, MI, dl, TII->get(ARM::tORR), NewVReg4)
	.addReg(ARM::CPSR, RegState::Define)
	.addReg(NewVReg2, RegState::Kill)
	.addReg(NewVReg3, RegState::Kill)
	.add(predOps(ARMCC::AL));
	unsigned NewVReg5 = MRI->createVirtualRegister(TRC);
	BuildMI(*MBB, MI, dl, TII->get(ARM::tADDframe), NewVReg5)
	.addFrameIndex(FI)
	.addImm(36); // &jbuf[1] :: pc
	BuildMI(*MBB, MI, dl, TII->get(ARM::tSTRi))
	.addReg(NewVReg4, RegState::Kill)
	.addReg(NewVReg5, RegState::Kill)
	.addImm(0)
	.addMemOperand(FIMMOSt)
	.add(predOps(ARMCC::AL));
	} else {
	// Incoming value: jbuf
	// ldr r1, LCPI1_1
	// add r1, pc, r1
	// str r1, [$jbuf, #+4] ; &jbuf[1]
	unsigned NewVReg1 = MRI->createVirtualRegister(TRC);
	BuildMI(*MBB, MI, dl, TII->get(ARM::LDRi12), NewVReg1)
	.addConstantPoolIndex(CPI)
	.addImm(0)
	.addMemOperand(CPMMO)
	.add(predOps(ARMCC::AL));
	unsigned NewVReg2 = MRI->createVirtualRegister(TRC);
	BuildMI(*MBB, MI, dl, TII->get(ARM::PICADD), NewVReg2)
	.addReg(NewVReg1, RegState::Kill)
	.addImm(PCLabelId)
	.add(predOps(ARMCC::AL));
	BuildMI(*MBB, MI, dl, TII->get(ARM::STRi12))
	.addReg(NewVReg2, RegState::Kill)
	.addFrameIndex(FI)
	.addImm(36) // &jbuf[1] :: pc
	.addMemOperand(FIMMOSt)
	.add(predOps(ARMCC::AL));
	}
	}

	void ARMTargetLowering::EmitSjLjDispatchBlock(MachineInstr &MI,
	MachineBasicBlock *MBB) const {
	const TargetInstrInfo *TII = Subtarget->getInstrInfo();
	DebugLoc dl = MI.getDebugLoc();
	MachineFunction *MF = MBB->getParent();
	MachineRegisterInfo *MRI = &MF->getRegInfo();
	MachineFrameInfo &MFI = MF->getFrameInfo();
	int FI = MFI.getFunctionContextIndex();

	const TargetRegisterClass *TRC = Subtarget->isThumb() ? &ARM::tGPRRegClass
	: &ARM::GPRnopcRegClass;

	// Get a mapping of the call site numbers to all of the landing pads they're
	// associated with.
	DenseMap<unsigned, SmallVector<MachineBasicBlock*, 2>> CallSiteNumToLPad;
	unsigned MaxCSNum = 0;
	for (MachineFunction::iterator BB = MF->begin(), E = MF->end(); BB != E;
	++BB) {
	if (!BB->isEHPad()) continue;

	// FIXME: We should assert that the EH_LABEL is the first MI in the landing
	// pad.
	for (MachineBasicBlock::iterator
	II = BB->begin(), IE = BB->end(); II != IE; ++II) {
	if (!II->isEHLabel()) continue;

	MCSymbol *Sym = II->getOperand(0).getMCSymbol();
	if (!MF->hasCallSiteLandingPad(Sym)) continue;

	SmallVectorImpl<unsigned> &CallSiteIdxs = MF->getCallSiteLandingPad(Sym);
	for (SmallVectorImpl<unsigned>::iterator
	CSI = CallSiteIdxs.begin(), CSE = CallSiteIdxs.end();
	CSI != CSE; ++CSI) {
	CallSiteNumToLPad[CSI].push_back(&BB);
	MaxCSNum = std::max(MaxCSNum, *CSI);
	}
	break;
	}
	}

	// Get an ordered list of the machine basic blocks for the jump table.
	std::vector<MachineBasicBlock*> LPadList;
	SmallPtrSet<MachineBasicBlock*, 32> InvokeBBs;
	LPadList.reserve(CallSiteNumToLPad.size());
	for (unsigned I = 1; I <= MaxCSNum; ++I) {
	SmallVectorImpl<MachineBasicBlock*> &MBBList = CallSiteNumToLPad[I];
	for (SmallVectorImpl<MachineBasicBlock*>::iterator
	II = MBBList.begin(), IE = MBBList.end(); II != IE; ++II) {
	LPadList.push_back(*II);
	InvokeBBs.insert((II)->pred_begin(), (II)->pred_end());
	}
	}

	assert(!LPadList.empty() &&
	"No landing pad destinations for the dispatch jump table!");

	// Create the jump table and associated information.
	MachineJumpTableInfo *JTI =
	MF->getOrCreateJumpTableInfo(MachineJumpTableInfo::EK_Inline);
	unsigned MJTI = JTI->createJumpTableIndex(LPadList);

	// Create the MBBs for the dispatch code.

	// Shove the dispatch's address into the return slot in the function context.
	MachineBasicBlock *DispatchBB = MF->CreateMachineBasicBlock();
	DispatchBB->setIsEHPad();

	MachineBasicBlock *TrapBB = MF->CreateMachineBasicBlock();
	unsigned trap_opcode;
	if (Subtarget->isThumb())
	trap_opcode = ARM::tTRAP;
	else
	trap_opcode = Subtarget->useNaClTrap() ? ARM::TRAPNaCl : ARM::TRAP;

	BuildMI(TrapBB, dl, TII->get(trap_opcode));
	DispatchBB->addSuccessor(TrapBB);

	MachineBasicBlock *DispContBB = MF->CreateMachineBasicBlock();
	DispatchBB->addSuccessor(DispContBB);

	// Insert and MBBs.
	MF->insert(MF->end(), DispatchBB);
	MF->insert(MF->end(), DispContBB);
	MF->insert(MF->end(), TrapBB);

	// Insert code into the entry block that creates and registers the function
	// context.
	SetupEntryBlockForSjLj(MI, MBB, DispatchBB, FI);

	MachineMemOperand *FIMMOLd = MF->getMachineMemOperand(
	MachinePointerInfo::getFixedStack(*MF, FI),
	MachineMemOperand::MOLoad \| MachineMemOperand::MOVolatile, 4, 4);

	MachineInstrBuilder MIB;
	MIB = BuildMI(DispatchBB, dl, TII->get(ARM::Int_eh_sjlj_dispatchsetup));

	const ARMBaseInstrInfo AII = static_cast<const ARMBaseInstrInfo>(TII);
	const ARMBaseRegisterInfo &RI = AII->getRegisterInfo();

	// Add a register mask with no preserved registers. This results in all
	// registers being marked as clobbered. This can't work if the dispatch block
	// is in a Thumb1 function and is linked with ARM code which uses the FP
	// registers, as there is no way to preserve the FP registers in Thumb1 mode.
	MIB.addRegMask(RI.getSjLjDispatchPreservedMask(*MF));

	bool IsPositionIndependent = isPositionIndependent();
	unsigned NumLPads = LPadList.size();
	if (Subtarget->isThumb2()) {
	unsigned NewVReg1 = MRI->createVirtualRegister(TRC);
	BuildMI(DispatchBB, dl, TII->get(ARM::t2LDRi12), NewVReg1)
	.addFrameIndex(FI)
	.addImm(4)
	.addMemOperand(FIMMOLd)
	.add(predOps(ARMCC::AL));

	if (NumLPads < 256) {
	BuildMI(DispatchBB, dl, TII->get(ARM::t2CMPri))
	.addReg(NewVReg1)
	.addImm(LPadList.size())
	.add(predOps(ARMCC::AL));
	} else {
	unsigned VReg1 = MRI->createVirtualRegister(TRC);
	BuildMI(DispatchBB, dl, TII->get(ARM::t2MOVi16), VReg1)
	.addImm(NumLPads & 0xFFFF)
	.add(predOps(ARMCC::AL));

	unsigned VReg2 = VReg1;
	if ((NumLPads & 0xFFFF0000) != 0) {
	VReg2 = MRI->createVirtualRegister(TRC);
	BuildMI(DispatchBB, dl, TII->get(ARM::t2MOVTi16), VReg2)
	.addReg(VReg1)
	.addImm(NumLPads >> 16)
	.add(predOps(ARMCC::AL));
	}

	BuildMI(DispatchBB, dl, TII->get(ARM::t2CMPrr))
	.addReg(NewVReg1)
	.addReg(VReg2)
	.add(predOps(ARMCC::AL));
	}

	BuildMI(DispatchBB, dl, TII->get(ARM::t2Bcc))
	.addMBB(TrapBB)
	.addImm(ARMCC::HI)
	.addReg(ARM::CPSR);

	unsigned NewVReg3 = MRI->createVirtualRegister(TRC);
	BuildMI(DispContBB, dl, TII->get(ARM::t2LEApcrelJT), NewVReg3)
	.addJumpTableIndex(MJTI)
	.add(predOps(ARMCC::AL));

	unsigned NewVReg4 = MRI->createVirtualRegister(TRC);
	BuildMI(DispContBB, dl, TII->get(ARM::t2ADDrs), NewVReg4)
	.addReg(NewVReg3, RegState::Kill)
	.addReg(NewVReg1)
	.addImm(ARM_AM::getSORegOpc(ARM_AM::lsl, 2))
	.add(predOps(ARMCC::AL))
	.add(condCodeOp());

	BuildMI(DispContBB, dl, TII->get(ARM::t2BR_JT))
	.addReg(NewVReg4, RegState::Kill)
	.addReg(NewVReg1)
	.addJumpTableIndex(MJTI);
	} else if (Subtarget->isThumb()) {
	unsigned NewVReg1 = MRI->createVirtualRegister(TRC);
	BuildMI(DispatchBB, dl, TII->get(ARM::tLDRspi), NewVReg1)
	.addFrameIndex(FI)
	.addImm(1)
	.addMemOperand(FIMMOLd)
	.add(predOps(ARMCC::AL));

	if (NumLPads < 256) {
	BuildMI(DispatchBB, dl, TII->get(ARM::tCMPi8))
	.addReg(NewVReg1)
	.addImm(NumLPads)
	.add(predOps(ARMCC::AL));
	} else {
	MachineConstantPool *ConstantPool = MF->getConstantPool();
	Type *Int32Ty = Type::getInt32Ty(MF->getFunction()->getContext());
	const Constant *C = ConstantInt::get(Int32Ty, NumLPads);

	// MachineConstantPool wants an explicit alignment.
	unsigned Align = MF->getDataLayout().getPrefTypeAlignment(Int32Ty);
	if (Align == 0)
	Align = MF->getDataLayout().getTypeAllocSize(C->getType());
	unsigned Idx = ConstantPool->getConstantPoolIndex(C, Align);

	unsigned VReg1 = MRI->createVirtualRegister(TRC);
	BuildMI(DispatchBB, dl, TII->get(ARM::tLDRpci))
	.addReg(VReg1, RegState::Define)
	.addConstantPoolIndex(Idx)
	.add(predOps(ARMCC::AL));
	BuildMI(DispatchBB, dl, TII->get(ARM::tCMPr))
	.addReg(NewVReg1)
	.addReg(VReg1)
	.add(predOps(ARMCC::AL));
	}

	BuildMI(DispatchBB, dl, TII->get(ARM::tBcc))
	.addMBB(TrapBB)
	.addImm(ARMCC::HI)
	.addReg(ARM::CPSR);

	unsigned NewVReg2 = MRI->createVirtualRegister(TRC);
	BuildMI(DispContBB, dl, TII->get(ARM::tLSLri), NewVReg2)
	.addReg(ARM::CPSR, RegState::Define)
	.addReg(NewVReg1)
	.addImm(2)
	.add(predOps(ARMCC::AL));

	unsigned NewVReg3 = MRI->createVirtualRegister(TRC);
	BuildMI(DispContBB, dl, TII->get(ARM::tLEApcrelJT), NewVReg3)
	.addJumpTableIndex(MJTI)
	.add(predOps(ARMCC::AL));

	unsigned NewVReg4 = MRI->createVirtualRegister(TRC);
	BuildMI(DispContBB, dl, TII->get(ARM::tADDrr), NewVReg4)
	.addReg(ARM::CPSR, RegState::Define)
	.addReg(NewVReg2, RegState::Kill)
	.addReg(NewVReg3)
	.add(predOps(ARMCC::AL));

	MachineMemOperand *JTMMOLd = MF->getMachineMemOperand(
	MachinePointerInfo::getJumpTable(*MF), MachineMemOperand::MOLoad, 4, 4);

	unsigned NewVReg5 = MRI->createVirtualRegister(TRC);
	BuildMI(DispContBB, dl, TII->get(ARM::tLDRi), NewVReg5)
	.addReg(NewVReg4, RegState::Kill)
	.addImm(0)
	.addMemOperand(JTMMOLd)
	.add(predOps(ARMCC::AL));

	unsigned NewVReg6 = NewVReg5;
	if (IsPositionIndependent) {
	NewVReg6 = MRI->createVirtualRegister(TRC);
	BuildMI(DispContBB, dl, TII->get(ARM::tADDrr), NewVReg6)
	.addReg(ARM::CPSR, RegState::Define)
	.addReg(NewVReg5, RegState::Kill)
	.addReg(NewVReg3)
	.add(predOps(ARMCC::AL));
	}

	BuildMI(DispContBB, dl, TII->get(ARM::tBR_JTr))
	.addReg(NewVReg6, RegState::Kill)
	.addJumpTableIndex(MJTI);
	} else {
	unsigned NewVReg1 = MRI->createVirtualRegister(TRC);
	BuildMI(DispatchBB, dl, TII->get(ARM::LDRi12), NewVReg1)
	.addFrameIndex(FI)
	.addImm(4)
	.addMemOperand(FIMMOLd)
	.add(predOps(ARMCC::AL));

	if (NumLPads < 256) {
	BuildMI(DispatchBB, dl, TII->get(ARM::CMPri))
	.addReg(NewVReg1)
	.addImm(NumLPads)
	.add(predOps(ARMCC::AL));
	} else if (Subtarget->hasV6T2Ops() && isUInt<16>(NumLPads)) {
	unsigned VReg1 = MRI->createVirtualRegister(TRC);
	BuildMI(DispatchBB, dl, TII->get(ARM::MOVi16), VReg1)
	.addImm(NumLPads & 0xFFFF)
	.add(predOps(ARMCC::AL));

	unsigned VReg2 = VReg1;
	if ((NumLPads & 0xFFFF0000) != 0) {
	VReg2 = MRI->createVirtualRegister(TRC);
	BuildMI(DispatchBB, dl, TII->get(ARM::MOVTi16), VReg2)
	.addReg(VReg1)
	.addImm(NumLPads >> 16)
	.add(predOps(ARMCC::AL));
	}

	BuildMI(DispatchBB, dl, TII->get(ARM::CMPrr))
	.addReg(NewVReg1)
	.addReg(VReg2)
	.add(predOps(ARMCC::AL));
	} else {
	MachineConstantPool *ConstantPool = MF->getConstantPool();
	Type *Int32Ty = Type::getInt32Ty(MF->getFunction()->getContext());
	const Constant *C = ConstantInt::get(Int32Ty, NumLPads);

	// MachineConstantPool wants an explicit alignment.
	unsigned Align = MF->getDataLayout().getPrefTypeAlignment(Int32Ty);
	if (Align == 0)
	Align = MF->getDataLayout().getTypeAllocSize(C->getType());
	unsigned Idx = ConstantPool->getConstantPoolIndex(C, Align);

	unsigned VReg1 = MRI->createVirtualRegister(TRC);
	BuildMI(DispatchBB, dl, TII->get(ARM::LDRcp))
	.addReg(VReg1, RegState::Define)
	.addConstantPoolIndex(Idx)
	.addImm(0)
	.add(predOps(ARMCC::AL));
	BuildMI(DispatchBB, dl, TII->get(ARM::CMPrr))
	.addReg(NewVReg1)
	.addReg(VReg1, RegState::Kill)
	.add(predOps(ARMCC::AL));
	}

	BuildMI(DispatchBB, dl, TII->get(ARM::Bcc))
	.addMBB(TrapBB)
	.addImm(ARMCC::HI)
	.addReg(ARM::CPSR);

	unsigned NewVReg3 = MRI->createVirtualRegister(TRC);
	BuildMI(DispContBB, dl, TII->get(ARM::MOVsi), NewVReg3)
	.addReg(NewVReg1)
	.addImm(ARM_AM::getSORegOpc(ARM_AM::lsl, 2))
	.add(predOps(ARMCC::AL))
	.add(condCodeOp());
	unsigned NewVReg4 = MRI->createVirtualRegister(TRC);
	BuildMI(DispContBB, dl, TII->get(ARM::LEApcrelJT), NewVReg4)
	.addJumpTableIndex(MJTI)
	.add(predOps(ARMCC::AL));

	MachineMemOperand *JTMMOLd = MF->getMachineMemOperand(
	MachinePointerInfo::getJumpTable(*MF), MachineMemOperand::MOLoad, 4, 4);
	unsigned NewVReg5 = MRI->createVirtualRegister(TRC);
	BuildMI(DispContBB, dl, TII->get(ARM::LDRrs), NewVReg5)
	.addReg(NewVReg3, RegState::Kill)
	.addReg(NewVReg4)
	.addImm(0)
	.addMemOperand(JTMMOLd)
	.add(predOps(ARMCC::AL));

	if (IsPositionIndependent) {
	BuildMI(DispContBB, dl, TII->get(ARM::BR_JTadd))
	.addReg(NewVReg5, RegState::Kill)
	.addReg(NewVReg4)
	.addJumpTableIndex(MJTI);
	} else {
	BuildMI(DispContBB, dl, TII->get(ARM::BR_JTr))
	.addReg(NewVReg5, RegState::Kill)
	.addJumpTableIndex(MJTI);
	}
	}

	// Add the jump table entries as successors to the MBB.
	SmallPtrSet<MachineBasicBlock*, 8> SeenMBBs;
	for (std::vector<MachineBasicBlock*>::iterator
	I = LPadList.begin(), E = LPadList.end(); I != E; ++I) {
	MachineBasicBlock CurMBB = I;
	if (SeenMBBs.insert(CurMBB).second)
	DispContBB->addSuccessor(CurMBB);
	}

	// N.B. the order the invoke BBs are processed in doesn't matter here.
	const MCPhysReg *SavedRegs = RI.getCalleeSavedRegs(MF);
	SmallVector<MachineBasicBlock*, 64> MBBLPads;
	for (MachineBasicBlock *BB : InvokeBBs) {

	// Remove the landing pad successor from the invoke block and replace it
	// with the new dispatch block.
	SmallVector<MachineBasicBlock*, 4> Successors(BB->succ_begin(),
	BB->succ_end());
	while (!Successors.empty()) {
	MachineBasicBlock *SMBB = Successors.pop_back_val();
	if (SMBB->isEHPad()) {
	BB->removeSuccessor(SMBB);
	MBBLPads.push_back(SMBB);
	}
	}

	BB->addSuccessor(DispatchBB, BranchProbability::getZero());
	BB->normalizeSuccProbs();

	// Find the invoke call and mark all of the callee-saved registers as
	// 'implicit defined' so that they're spilled. This prevents code from
	// moving instructions to before the EH block, where they will never be
	// executed.
	for (MachineBasicBlock::reverse_iterator
	II = BB->rbegin(), IE = BB->rend(); II != IE; ++II) {
	if (!II->isCall()) continue;

	DenseMap<unsigned, bool> DefRegs;
	for (MachineInstr::mop_iterator
	OI = II->operands_begin(), OE = II->operands_end();
	OI != OE; ++OI) {
	if (!OI->isReg()) continue;
	DefRegs[OI->getReg()] = true;
	}

	MachineInstrBuilder MIB(MF, &II);

	for (unsigned i = 0; SavedRegs[i] != 0; ++i) {
	unsigned Reg = SavedRegs[i];
	if (Subtarget->isThumb2() &&
	!ARM::tGPRRegClass.contains(Reg) &&
	!ARM::hGPRRegClass.contains(Reg))
	continue;
	if (Subtarget->isThumb1Only() && !ARM::tGPRRegClass.contains(Reg))
	continue;
	if (!Subtarget->isThumb() && !ARM::GPRRegClass.contains(Reg))
	continue;
	if (!DefRegs[Reg])
	MIB.addReg(Reg, RegState::ImplicitDefine \| RegState::Dead);
	}

	break;
	}
	}

	// Mark all former landing pads as non-landing pads. The dispatch is the only
	// landing pad now.
	for (SmallVectorImpl<MachineBasicBlock*>::iterator
	I = MBBLPads.begin(), E = MBBLPads.end(); I != E; ++I)
	(*I)->setIsEHPad(false);

	// The instruction is gone now.
	MI.eraseFromParent();
	}

	static
	MachineBasicBlock OtherSucc(MachineBasicBlock MBB, MachineBasicBlock *Succ) {
	for (MachineBasicBlock::succ_iterator I = MBB->succ_begin(),
	E = MBB->succ_end(); I != E; ++I)
	if (*I != Succ)
	return *I;
	llvm_unreachable("Expecting a BB with two successors!");
	}

	/// Return the load opcode for a given load size. If load size >= 8,
	/// neon opcode will be returned.
	static unsigned getLdOpcode(unsigned LdSize, bool IsThumb1, bool IsThumb2) {
	if (LdSize >= 8)
	return LdSize == 16 ? ARM::VLD1q32wb_fixed
	: LdSize == 8 ? ARM::VLD1d32wb_fixed : 0;
	if (IsThumb1)
	return LdSize == 4 ? ARM::tLDRi
	: LdSize == 2 ? ARM::tLDRHi
	: LdSize == 1 ? ARM::tLDRBi : 0;
	if (IsThumb2)
	return LdSize == 4 ? ARM::t2LDR_POST
	: LdSize == 2 ? ARM::t2LDRH_POST
	: LdSize == 1 ? ARM::t2LDRB_POST : 0;
	return LdSize == 4 ? ARM::LDR_POST_IMM
	: LdSize == 2 ? ARM::LDRH_POST
	: LdSize == 1 ? ARM::LDRB_POST_IMM : 0;
	}

	/// Return the store opcode for a given store size. If store size >= 8,
	/// neon opcode will be returned.
	static unsigned getStOpcode(unsigned StSize, bool IsThumb1, bool IsThumb2) {
	if (StSize >= 8)
	return StSize == 16 ? ARM::VST1q32wb_fixed
	: StSize == 8 ? ARM::VST1d32wb_fixed : 0;
	if (IsThumb1)
	return StSize == 4 ? ARM::tSTRi
	: StSize == 2 ? ARM::tSTRHi
	: StSize == 1 ? ARM::tSTRBi : 0;
	if (IsThumb2)
	return StSize == 4 ? ARM::t2STR_POST
	: StSize == 2 ? ARM::t2STRH_POST
	: StSize == 1 ? ARM::t2STRB_POST : 0;
	return StSize == 4 ? ARM::STR_POST_IMM
	: StSize == 2 ? ARM::STRH_POST
	: StSize == 1 ? ARM::STRB_POST_IMM : 0;
	}

	/// Emit a post-increment load operation with given size. The instructions
	/// will be added to BB at Pos.
	static void emitPostLd(MachineBasicBlock *BB, MachineBasicBlock::iterator Pos,
	const TargetInstrInfo *TII, const DebugLoc &dl,
	unsigned LdSize, unsigned Data, unsigned AddrIn,
	unsigned AddrOut, bool IsThumb1, bool IsThumb2) {
	unsigned LdOpc = getLdOpcode(LdSize, IsThumb1, IsThumb2);
	assert(LdOpc != 0 && "Should have a load opcode");
	if (LdSize >= 8) {
	BuildMI(*BB, Pos, dl, TII->get(LdOpc), Data)
	.addReg(AddrOut, RegState::Define)
	.addReg(AddrIn)
	.addImm(0)
	.add(predOps(ARMCC::AL));
	} else if (IsThumb1) {
	// load + update AddrIn
	BuildMI(*BB, Pos, dl, TII->get(LdOpc), Data)
	.addReg(AddrIn)
	.addImm(0)
	.add(predOps(ARMCC::AL));
	BuildMI(*BB, Pos, dl, TII->get(ARM::tADDi8), AddrOut)
	.add(t1CondCodeOp())
	.addReg(AddrIn)
	.addImm(LdSize)
	.add(predOps(ARMCC::AL));
	} else if (IsThumb2) {
	BuildMI(*BB, Pos, dl, TII->get(LdOpc), Data)
	.addReg(AddrOut, RegState::Define)
	.addReg(AddrIn)
	.addImm(LdSize)
	.add(predOps(ARMCC::AL));
	} else { // arm
	BuildMI(*BB, Pos, dl, TII->get(LdOpc), Data)
	.addReg(AddrOut, RegState::Define)
	.addReg(AddrIn)
	.addReg(0)
	.addImm(LdSize)
	.add(predOps(ARMCC::AL));
	}
	}

	/// Emit a post-increment store operation with given size. The instructions
	/// will be added to BB at Pos.
	static void emitPostSt(MachineBasicBlock *BB, MachineBasicBlock::iterator Pos,
	const TargetInstrInfo *TII, const DebugLoc &dl,
	unsigned StSize, unsigned Data, unsigned AddrIn,
	unsigned AddrOut, bool IsThumb1, bool IsThumb2) {
	unsigned StOpc = getStOpcode(StSize, IsThumb1, IsThumb2);
	assert(StOpc != 0 && "Should have a store opcode");
	if (StSize >= 8) {
	BuildMI(*BB, Pos, dl, TII->get(StOpc), AddrOut)
	.addReg(AddrIn)
	.addImm(0)
	.addReg(Data)
	.add(predOps(ARMCC::AL));
	} else if (IsThumb1) {
	// store + update AddrIn
	BuildMI(*BB, Pos, dl, TII->get(StOpc))
	.addReg(Data)
	.addReg(AddrIn)
	.addImm(0)
	.add(predOps(ARMCC::AL));
	BuildMI(*BB, Pos, dl, TII->get(ARM::tADDi8), AddrOut)
	.add(t1CondCodeOp())
	.addReg(AddrIn)
	.addImm(StSize)
	.add(predOps(ARMCC::AL));
	} else if (IsThumb2) {
	BuildMI(*BB, Pos, dl, TII->get(StOpc), AddrOut)
	.addReg(Data)
	.addReg(AddrIn)
	.addImm(StSize)
	.add(predOps(ARMCC::AL));
	} else { // arm
	BuildMI(*BB, Pos, dl, TII->get(StOpc), AddrOut)
	.addReg(Data)
	.addReg(AddrIn)
	.addReg(0)
	.addImm(StSize)
	.add(predOps(ARMCC::AL));
	}
	}

	MachineBasicBlock *
	ARMTargetLowering::EmitStructByval(MachineInstr &MI,
	MachineBasicBlock *BB) const {
	// This pseudo instruction has 3 operands: dst, src, size
	// We expand it to a loop if size > Subtarget->getMaxInlineSizeThreshold().
	// Otherwise, we will generate unrolled scalar copies.
	const TargetInstrInfo *TII = Subtarget->getInstrInfo();
	const BasicBlock *LLVM_BB = BB->getBasicBlock();
	MachineFunction::iterator It = ++BB->getIterator();

	unsigned dest = MI.getOperand(0).getReg();
	unsigned src = MI.getOperand(1).getReg();
	unsigned SizeVal = MI.getOperand(2).getImm();
	unsigned Align = MI.getOperand(3).getImm();
	DebugLoc dl = MI.getDebugLoc();

	MachineFunction *MF = BB->getParent();
	MachineRegisterInfo &MRI = MF->getRegInfo();
	unsigned UnitSize = 0;
	const TargetRegisterClass *TRC = nullptr;
	const TargetRegisterClass *VecTRC = nullptr;

	bool IsThumb1 = Subtarget->isThumb1Only();
	bool IsThumb2 = Subtarget->isThumb2();
	bool IsThumb = Subtarget->isThumb();

	if (Align & 1) {
	UnitSize = 1;
	} else if (Align & 2) {
	UnitSize = 2;
	} else {
	// Check whether we can use NEON instructions.
	if (!MF->getFunction()->hasFnAttribute(Attribute::NoImplicitFloat) &&
	Subtarget->hasNEON()) {
	if ((Align % 16 == 0) && SizeVal >= 16)
	UnitSize = 16;
	else if ((Align % 8 == 0) && SizeVal >= 8)
	UnitSize = 8;
	}
	// Can't use NEON instructions.
	if (UnitSize == 0)
	UnitSize = 4;
	}

	// Select the correct opcode and register class for unit size load/store
	bool IsNeon = UnitSize >= 8;
	TRC = IsThumb ? &ARM::tGPRRegClass : &ARM::GPRRegClass;
	if (IsNeon)
	VecTRC = UnitSize == 16 ? &ARM::DPairRegClass
	: UnitSize == 8 ? &ARM::DPRRegClass
	: nullptr;

	unsigned BytesLeft = SizeVal % UnitSize;
	unsigned LoopSize = SizeVal - BytesLeft;

	if (SizeVal <= Subtarget->getMaxInlineSizeThreshold()) {
	// Use LDR and STR to copy.
	// [scratch, srcOut] = LDR_POST(srcIn, UnitSize)
	// [destOut] = STR_POST(scratch, destIn, UnitSize)
	unsigned srcIn = src;
	unsigned destIn = dest;
	for (unsigned i = 0; i < LoopSize; i+=UnitSize) {
	unsigned srcOut = MRI.createVirtualRegister(TRC);
	unsigned destOut = MRI.createVirtualRegister(TRC);
	unsigned scratch = MRI.createVirtualRegister(IsNeon ? VecTRC : TRC);
	emitPostLd(BB, MI, TII, dl, UnitSize, scratch, srcIn, srcOut,
	IsThumb1, IsThumb2);
	emitPostSt(BB, MI, TII, dl, UnitSize, scratch, destIn, destOut,
	IsThumb1, IsThumb2);
	srcIn = srcOut;
	destIn = destOut;
	}

	// Handle the leftover bytes with LDRB and STRB.
	// [scratch, srcOut] = LDRB_POST(srcIn, 1)
	// [destOut] = STRB_POST(scratch, destIn, 1)
	for (unsigned i = 0; i < BytesLeft; i++) {
	unsigned srcOut = MRI.createVirtualRegister(TRC);
	unsigned destOut = MRI.createVirtualRegister(TRC);
	unsigned scratch = MRI.createVirtualRegister(TRC);
	emitPostLd(BB, MI, TII, dl, 1, scratch, srcIn, srcOut,
	IsThumb1, IsThumb2);
	emitPostSt(BB, MI, TII, dl, 1, scratch, destIn, destOut,
	IsThumb1, IsThumb2);
	srcIn = srcOut;
	destIn = destOut;
	}
	MI.eraseFromParent(); // The instruction is gone now.
	return BB;
	}

	// Expand the pseudo op to a loop.
	// thisMBB:
	// ...
	// movw varEnd, # --> with thumb2
	// movt varEnd, #
	// ldrcp varEnd, idx --> without thumb2
	// fallthrough --> loopMBB
	// loopMBB:
	// PHI varPhi, varEnd, varLoop
	// PHI srcPhi, src, srcLoop
	// PHI destPhi, dst, destLoop
	// [scratch, srcLoop] = LDR_POST(srcPhi, UnitSize)
	// [destLoop] = STR_POST(scratch, destPhi, UnitSize)
	// subs varLoop, varPhi, #UnitSize
	// bne loopMBB
	// fallthrough --> exitMBB
	// exitMBB:
	// epilogue to handle left-over bytes
	// [scratch, srcOut] = LDRB_POST(srcLoop, 1)
	// [destOut] = STRB_POST(scratch, destLoop, 1)
	MachineBasicBlock *loopMBB = MF->CreateMachineBasicBlock(LLVM_BB);
	MachineBasicBlock *exitMBB = MF->CreateMachineBasicBlock(LLVM_BB);
	MF->insert(It, loopMBB);
	MF->insert(It, exitMBB);

	// Transfer the remainder of BB and its successor edges to exitMBB.
	exitMBB->splice(exitMBB->begin(), BB,
	std::next(MachineBasicBlock::iterator(MI)), BB->end());
	exitMBB->transferSuccessorsAndUpdatePHIs(BB);

	// Load an immediate to varEnd.
	unsigned varEnd = MRI.createVirtualRegister(TRC);
	if (Subtarget->useMovt(*MF)) {
	unsigned Vtmp = varEnd;
	if ((LoopSize & 0xFFFF0000) != 0)
	Vtmp = MRI.createVirtualRegister(TRC);
	BuildMI(BB, dl, TII->get(IsThumb ? ARM::t2MOVi16 : ARM::MOVi16), Vtmp)
	.addImm(LoopSize & 0xFFFF)
	.add(predOps(ARMCC::AL));

	if ((LoopSize & 0xFFFF0000) != 0)
	BuildMI(BB, dl, TII->get(IsThumb ? ARM::t2MOVTi16 : ARM::MOVTi16), varEnd)
	.addReg(Vtmp)
	.addImm(LoopSize >> 16)
	.add(predOps(ARMCC::AL));
	} else {
	MachineConstantPool *ConstantPool = MF->getConstantPool();
	Type *Int32Ty = Type::getInt32Ty(MF->getFunction()->getContext());
	const Constant *C = ConstantInt::get(Int32Ty, LoopSize);

	// MachineConstantPool wants an explicit alignment.
	unsigned Align = MF->getDataLayout().getPrefTypeAlignment(Int32Ty);
	if (Align == 0)
	Align = MF->getDataLayout().getTypeAllocSize(C->getType());
	unsigned Idx = ConstantPool->getConstantPoolIndex(C, Align);

	if (IsThumb)
	BuildMI(*BB, MI, dl, TII->get(ARM::tLDRpci))
	.addReg(varEnd, RegState::Define)
	.addConstantPoolIndex(Idx)
	.add(predOps(ARMCC::AL));
	else
	BuildMI(*BB, MI, dl, TII->get(ARM::LDRcp))
	.addReg(varEnd, RegState::Define)
	.addConstantPoolIndex(Idx)
	.addImm(0)
	.add(predOps(ARMCC::AL));
	}
	BB->addSuccessor(loopMBB);

	// Generate the loop body:
	// varPhi = PHI(varLoop, varEnd)
	// srcPhi = PHI(srcLoop, src)
	// destPhi = PHI(destLoop, dst)
	MachineBasicBlock *entryBB = BB;
	BB = loopMBB;
	unsigned varLoop = MRI.createVirtualRegister(TRC);
	unsigned varPhi = MRI.createVirtualRegister(TRC);
	unsigned srcLoop = MRI.createVirtualRegister(TRC);
	unsigned srcPhi = MRI.createVirtualRegister(TRC);
	unsigned destLoop = MRI.createVirtualRegister(TRC);
	unsigned destPhi = MRI.createVirtualRegister(TRC);

	BuildMI(*BB, BB->begin(), dl, TII->get(ARM::PHI), varPhi)
	.addReg(varLoop).addMBB(loopMBB)
	.addReg(varEnd).addMBB(entryBB);
	BuildMI(BB, dl, TII->get(ARM::PHI), srcPhi)
	.addReg(srcLoop).addMBB(loopMBB)
	.addReg(src).addMBB(entryBB);
	BuildMI(BB, dl, TII->get(ARM::PHI), destPhi)
	.addReg(destLoop).addMBB(loopMBB)
	.addReg(dest).addMBB(entryBB);

	// [scratch, srcLoop] = LDR_POST(srcPhi, UnitSize)
	// [destLoop] = STR_POST(scratch, destPhi, UnitSiz)
	unsigned scratch = MRI.createVirtualRegister(IsNeon ? VecTRC : TRC);
	emitPostLd(BB, BB->end(), TII, dl, UnitSize, scratch, srcPhi, srcLoop,
	IsThumb1, IsThumb2);
	emitPostSt(BB, BB->end(), TII, dl, UnitSize, scratch, destPhi, destLoop,
	IsThumb1, IsThumb2);

	// Decrement loop variable by UnitSize.
	if (IsThumb1) {
	BuildMI(*BB, BB->end(), dl, TII->get(ARM::tSUBi8), varLoop)
	.add(t1CondCodeOp())
	.addReg(varPhi)
	.addImm(UnitSize)
	.add(predOps(ARMCC::AL));
	} else {
	MachineInstrBuilder MIB =
	BuildMI(*BB, BB->end(), dl,
	TII->get(IsThumb2 ? ARM::t2SUBri : ARM::SUBri), varLoop);
	MIB.addReg(varPhi)
	.addImm(UnitSize)
	.add(predOps(ARMCC::AL))
	.add(condCodeOp());
	MIB->getOperand(5).setReg(ARM::CPSR);
	MIB->getOperand(5).setIsDef(true);
	}
	BuildMI(*BB, BB->end(), dl,
	TII->get(IsThumb1 ? ARM::tBcc : IsThumb2 ? ARM::t2Bcc : ARM::Bcc))
	.addMBB(loopMBB).addImm(ARMCC::NE).addReg(ARM::CPSR);

	// loopMBB can loop back to loopMBB or fall through to exitMBB.
	BB->addSuccessor(loopMBB);
	BB->addSuccessor(exitMBB);

	// Add epilogue to handle BytesLeft.
	BB = exitMBB;
	auto StartOfExit = exitMBB->begin();

	// [scratch, srcOut] = LDRB_POST(srcLoop, 1)
	// [destOut] = STRB_POST(scratch, destLoop, 1)
	unsigned srcIn = srcLoop;
	unsigned destIn = destLoop;
	for (unsigned i = 0; i < BytesLeft; i++) {
	unsigned srcOut = MRI.createVirtualRegister(TRC);
	unsigned destOut = MRI.createVirtualRegister(TRC);
	unsigned scratch = MRI.createVirtualRegister(TRC);
	emitPostLd(BB, StartOfExit, TII, dl, 1, scratch, srcIn, srcOut,
	IsThumb1, IsThumb2);
	emitPostSt(BB, StartOfExit, TII, dl, 1, scratch, destIn, destOut,
	IsThumb1, IsThumb2);
	srcIn = srcOut;
	destIn = destOut;
	}

	MI.eraseFromParent(); // The instruction is gone now.
	return BB;
	}

	MachineBasicBlock *
	ARMTargetLowering::EmitLowered__chkstk(MachineInstr &MI,
	MachineBasicBlock *MBB) const {
	const TargetMachine &TM = getTargetMachine();
	const TargetInstrInfo &TII = *Subtarget->getInstrInfo();
	DebugLoc DL = MI.getDebugLoc();

	assert(Subtarget->isTargetWindows() &&
	"__chkstk is only supported on Windows");
	assert(Subtarget->isThumb2() && "Windows on ARM requires Thumb-2 mode");

	// __chkstk takes the number of words to allocate on the stack in R4, and
	// returns the stack adjustment in number of bytes in R4. This will not
	// clober any other registers (other than the obvious lr).
	//
	// Although, technically, IP should be considered a register which may be
	// clobbered, the call itself will not touch it. Windows on ARM is a pure
	// thumb-2 environment, so there is no interworking required. As a result, we
	// do not expect a veneer to be emitted by the linker, clobbering IP.
	//
	// Each module receives its own copy of __chkstk, so no import thunk is
	// required, again, ensuring that IP is not clobbered.
	//
	// Finally, although some linkers may theoretically provide a trampoline for
	// out of range calls (which is quite common due to a 32M range limitation of
	// branches for Thumb), we can generate the long-call version via
	// -mcmodel=large, alleviating the need for the trampoline which may clobber
	// IP.

	switch (TM.getCodeModel()) {
	case CodeModel::Small:
	case CodeModel::Medium:
	case CodeModel::Default:
	case CodeModel::Kernel:
	BuildMI(*MBB, MI, DL, TII.get(ARM::tBL))
	.add(predOps(ARMCC::AL))
	.addExternalSymbol("__chkstk")
	.addReg(ARM::R4, RegState::Implicit \| RegState::Kill)
	.addReg(ARM::R4, RegState::Implicit \| RegState::Define)
	.addReg(ARM::R12,
	+ RegState::Implicit \| RegState::Define \| RegState::Dead)
	+ .addReg(ARM::CPSR,
	RegState::Implicit \| RegState::Define \| RegState::Dead);
	break;
	case CodeModel::Large:
	case CodeModel::JITDefault: {
	MachineRegisterInfo &MRI = MBB->getParent()->getRegInfo();
	unsigned Reg = MRI.createVirtualRegister(&ARM::rGPRRegClass);

	BuildMI(*MBB, MI, DL, TII.get(ARM::t2MOVi32imm), Reg)
	.addExternalSymbol("__chkstk");
	BuildMI(*MBB, MI, DL, TII.get(ARM::tBLXr))
	.add(predOps(ARMCC::AL))
	.addReg(Reg, RegState::Kill)
	.addReg(ARM::R4, RegState::Implicit \| RegState::Kill)
	.addReg(ARM::R4, RegState::Implicit \| RegState::Define)
	.addReg(ARM::R12,
	+ RegState::Implicit \| RegState::Define \| RegState::Dead)
	+ .addReg(ARM::CPSR,
	RegState::Implicit \| RegState::Define \| RegState::Dead);
	break;
	}
	}

	BuildMI(*MBB, MI, DL, TII.get(ARM::t2SUBrr), ARM::SP)
	.addReg(ARM::SP, RegState::Kill)
	.addReg(ARM::R4, RegState::Kill)
	.setMIFlags(MachineInstr::FrameSetup)
	.add(predOps(ARMCC::AL))
	.add(condCodeOp());

	MI.eraseFromParent();
	return MBB;
	}

	MachineBasicBlock *
	ARMTargetLowering::EmitLowered__dbzchk(MachineInstr &MI,
	MachineBasicBlock *MBB) const {
	DebugLoc DL = MI.getDebugLoc();
	MachineFunction *MF = MBB->getParent();
	const TargetInstrInfo *TII = Subtarget->getInstrInfo();

	MachineBasicBlock *ContBB = MF->CreateMachineBasicBlock();
	MF->insert(++MBB->getIterator(), ContBB);
	ContBB->splice(ContBB->begin(), MBB,
	std::next(MachineBasicBlock::iterator(MI)), MBB->end());
	ContBB->transferSuccessorsAndUpdatePHIs(MBB);
	MBB->addSuccessor(ContBB);

	MachineBasicBlock *TrapBB = MF->CreateMachineBasicBlock();
	BuildMI(TrapBB, DL, TII->get(ARM::t__brkdiv0));
	MF->push_back(TrapBB);
	MBB->addSuccessor(TrapBB);

	BuildMI(*MBB, MI, DL, TII->get(ARM::tCMPi8))
	.addReg(MI.getOperand(0).getReg())
	.addImm(0)
	.add(predOps(ARMCC::AL));
	BuildMI(*MBB, MI, DL, TII->get(ARM::t2Bcc))
	.addMBB(TrapBB)
	.addImm(ARMCC::EQ)
	.addReg(ARM::CPSR);

	MI.eraseFromParent();
	return ContBB;
	}

	MachineBasicBlock *
	ARMTargetLowering::EmitInstrWithCustomInserter(MachineInstr &MI,
	MachineBasicBlock *BB) const {
	const TargetInstrInfo *TII = Subtarget->getInstrInfo();
	DebugLoc dl = MI.getDebugLoc();
	bool isThumb2 = Subtarget->isThumb2();
	switch (MI.getOpcode()) {
	default: {
	MI.print(errs());
	llvm_unreachable("Unexpected instr type to insert");
	}

	// Thumb1 post-indexed loads are really just single-register LDMs.
	case ARM::tLDR_postidx: {
	BuildMI(*BB, MI, dl, TII->get(ARM::tLDMIA_UPD))
	.add(MI.getOperand(1)) // Rn_wb
	.add(MI.getOperand(2)) // Rn
	.add(MI.getOperand(3)) // PredImm
	.add(MI.getOperand(4)) // PredReg
	.add(MI.getOperand(0)); // Rt
	MI.eraseFromParent();
	return BB;
	}

	// The Thumb2 pre-indexed stores have the same MI operands, they just
	// define them differently in the .td files from the isel patterns, so
	// they need pseudos.
	case ARM::t2STR_preidx:
	MI.setDesc(TII->get(ARM::t2STR_PRE));
	return BB;
	case ARM::t2STRB_preidx:
	MI.setDesc(TII->get(ARM::t2STRB_PRE));
	return BB;
	case ARM::t2STRH_preidx:
	MI.setDesc(TII->get(ARM::t2STRH_PRE));
	return BB;

	case ARM::STRi_preidx:
	case ARM::STRBi_preidx: {
	unsigned NewOpc = MI.getOpcode() == ARM::STRi_preidx ? ARM::STR_PRE_IMM
	: ARM::STRB_PRE_IMM;
	// Decode the offset.
	unsigned Offset = MI.getOperand(4).getImm();
	bool isSub = ARM_AM::getAM2Op(Offset) == ARM_AM::sub;
	Offset = ARM_AM::getAM2Offset(Offset);
	if (isSub)
	Offset = -Offset;

	MachineMemOperand MMO = MI.memoperands_begin();
	BuildMI(*BB, MI, dl, TII->get(NewOpc))
	.add(MI.getOperand(0)) // Rn_wb
	.add(MI.getOperand(1)) // Rt
	.add(MI.getOperand(2)) // Rn
	.addImm(Offset) // offset (skip GPR==zero_reg)
	.add(MI.getOperand(5)) // pred
	.add(MI.getOperand(6))
	.addMemOperand(MMO);
	MI.eraseFromParent();
	return BB;
	}
	case ARM::STRr_preidx:
	case ARM::STRBr_preidx:
	case ARM::STRH_preidx: {
	unsigned NewOpc;
	switch (MI.getOpcode()) {
	default: llvm_unreachable("unexpected opcode!");
	case ARM::STRr_preidx: NewOpc = ARM::STR_PRE_REG; break;
	case ARM::STRBr_preidx: NewOpc = ARM::STRB_PRE_REG; break;
	case ARM::STRH_preidx: NewOpc = ARM::STRH_PRE; break;
	}
	MachineInstrBuilder MIB = BuildMI(*BB, MI, dl, TII->get(NewOpc));
	for (unsigned i = 0; i < MI.getNumOperands(); ++i)
	MIB.add(MI.getOperand(i));
	MI.eraseFromParent();
	return BB;
	}

	case ARM::tMOVCCr_pseudo: {
	// To "insert" a SELECT_CC instruction, we actually have to insert the
	// diamond control-flow pattern. The incoming instruction knows the
	// destination vreg to set, the condition code register to branch on, the
	// true/false values to select between, and a branch opcode to use.
	const BasicBlock *LLVM_BB = BB->getBasicBlock();
	MachineFunction::iterator It = ++BB->getIterator();

	// thisMBB:
	// ...
	// TrueVal = ...
	// cmpTY ccX, r1, r2
	// bCC copy1MBB
	// fallthrough --> copy0MBB
	MachineBasicBlock *thisMBB = BB;
	MachineFunction *F = BB->getParent();
	MachineBasicBlock *copy0MBB = F->CreateMachineBasicBlock(LLVM_BB);
	MachineBasicBlock *sinkMBB = F->CreateMachineBasicBlock(LLVM_BB);
	F->insert(It, copy0MBB);
	F->insert(It, sinkMBB);

	// Transfer the remainder of BB and its successor edges to sinkMBB.
	sinkMBB->splice(sinkMBB->begin(), BB,
	std::next(MachineBasicBlock::iterator(MI)), BB->end());
	sinkMBB->transferSuccessorsAndUpdatePHIs(BB);

	BB->addSuccessor(copy0MBB);
	BB->addSuccessor(sinkMBB);

	BuildMI(BB, dl, TII->get(ARM::tBcc))
	.addMBB(sinkMBB)
	.addImm(MI.getOperand(3).getImm())
	.addReg(MI.getOperand(4).getReg());

	// copy0MBB:
	// %FalseValue = ...
	// # fallthrough to sinkMBB
	BB = copy0MBB;

	// Update machine-CFG edges
	BB->addSuccessor(sinkMBB);

	// sinkMBB:
	// %Result = phi [ %FalseValue, copy0MBB ], [ %TrueValue, thisMBB ]
	// ...
	BB = sinkMBB;
	BuildMI(*BB, BB->begin(), dl, TII->get(ARM::PHI), MI.getOperand(0).getReg())
	.addReg(MI.getOperand(1).getReg())
	.addMBB(copy0MBB)
	.addReg(MI.getOperand(2).getReg())
	.addMBB(thisMBB);

	MI.eraseFromParent(); // The pseudo instruction is gone now.
	return BB;
	}

	case ARM::BCCi64:
	case ARM::BCCZi64: {
	// If there is an unconditional branch to the other successor, remove it.
	BB->erase(std::next(MachineBasicBlock::iterator(MI)), BB->end());

	// Compare both parts that make up the double comparison separately for
	// equality.
	bool RHSisZero = MI.getOpcode() == ARM::BCCZi64;

	unsigned LHS1 = MI.getOperand(1).getReg();
	unsigned LHS2 = MI.getOperand(2).getReg();
	if (RHSisZero) {
	BuildMI(BB, dl, TII->get(isThumb2 ? ARM::t2CMPri : ARM::CMPri))
	.addReg(LHS1)
	.addImm(0)
	.add(predOps(ARMCC::AL));
	BuildMI(BB, dl, TII->get(isThumb2 ? ARM::t2CMPri : ARM::CMPri))
	.addReg(LHS2).addImm(0)
	.addImm(ARMCC::EQ).addReg(ARM::CPSR);
	} else {
	unsigned RHS1 = MI.getOperand(3).getReg();
	unsigned RHS2 = MI.getOperand(4).getReg();
	BuildMI(BB, dl, TII->get(isThumb2 ? ARM::t2CMPrr : ARM::CMPrr))
	.addReg(LHS1)
	.addReg(RHS1)
	.add(predOps(ARMCC::AL));
	BuildMI(BB, dl, TII->get(isThumb2 ? ARM::t2CMPrr : ARM::CMPrr))
	.addReg(LHS2).addReg(RHS2)
	.addImm(ARMCC::EQ).addReg(ARM::CPSR);
	}

	MachineBasicBlock *destMBB = MI.getOperand(RHSisZero ? 3 : 5).getMBB();
	MachineBasicBlock *exitMBB = OtherSucc(BB, destMBB);
	if (MI.getOperand(0).getImm() == ARMCC::NE)
	std::swap(destMBB, exitMBB);

	BuildMI(BB, dl, TII->get(isThumb2 ? ARM::t2Bcc : ARM::Bcc))
	.addMBB(destMBB).addImm(ARMCC::EQ).addReg(ARM::CPSR);
	if (isThumb2)
	BuildMI(BB, dl, TII->get(ARM::t2B))
	.addMBB(exitMBB)
	.add(predOps(ARMCC::AL));
	else
	BuildMI(BB, dl, TII->get(ARM::B)) .addMBB(exitMBB);

	MI.eraseFromParent(); // The pseudo instruction is gone now.
	return BB;
	}

	case ARM::Int_eh_sjlj_setjmp:
	case ARM::Int_eh_sjlj_setjmp_nofp:
	case ARM::tInt_eh_sjlj_setjmp:
	case ARM::t2Int_eh_sjlj_setjmp:
	case ARM::t2Int_eh_sjlj_setjmp_nofp:
	return BB;

	case ARM::Int_eh_sjlj_setup_dispatch:
	EmitSjLjDispatchBlock(MI, BB);
	return BB;

	case ARM::ABS:
	case ARM::t2ABS: {
	// To insert an ABS instruction, we have to insert the
	// diamond control-flow pattern. The incoming instruction knows the
	// source vreg to test against 0, the destination vreg to set,
	// the condition code register to branch on, the
	// true/false values to select between, and a branch opcode to use.
	// It transforms
	// V1 = ABS V0
	// into
	// V2 = MOVS V0
	// BCC (branch to SinkBB if V0 >= 0)
	// RSBBB: V3 = RSBri V2, 0 (compute ABS if V2 < 0)
	// SinkBB: V1 = PHI(V2, V3)
	const BasicBlock *LLVM_BB = BB->getBasicBlock();
	MachineFunction::iterator BBI = ++BB->getIterator();
	MachineFunction *Fn = BB->getParent();
	MachineBasicBlock *RSBBB = Fn->CreateMachineBasicBlock(LLVM_BB);
	MachineBasicBlock *SinkBB = Fn->CreateMachineBasicBlock(LLVM_BB);
	Fn->insert(BBI, RSBBB);
	Fn->insert(BBI, SinkBB);

	unsigned int ABSSrcReg = MI.getOperand(1).getReg();
	unsigned int ABSDstReg = MI.getOperand(0).getReg();
	bool ABSSrcKIll = MI.getOperand(1).isKill();
	bool isThumb2 = Subtarget->isThumb2();
	MachineRegisterInfo &MRI = Fn->getRegInfo();
	// In Thumb mode S must not be specified if source register is the SP or
	// PC and if destination register is the SP, so restrict register class
	unsigned NewRsbDstReg =
	MRI.createVirtualRegister(isThumb2 ? &ARM::rGPRRegClass : &ARM::GPRRegClass);

	// Transfer the remainder of BB and its successor edges to sinkMBB.
	SinkBB->splice(SinkBB->begin(), BB,
	std::next(MachineBasicBlock::iterator(MI)), BB->end());
	SinkBB->transferSuccessorsAndUpdatePHIs(BB);

	BB->addSuccessor(RSBBB);
	BB->addSuccessor(SinkBB);

	// fall through to SinkMBB
	RSBBB->addSuccessor(SinkBB);

	// insert a cmp at the end of BB
	BuildMI(BB, dl, TII->get(isThumb2 ? ARM::t2CMPri : ARM::CMPri))
	.addReg(ABSSrcReg)
	.addImm(0)
	.add(predOps(ARMCC::AL));

	// insert a bcc with opposite CC to ARMCC::MI at the end of BB
	BuildMI(BB, dl,
	TII->get(isThumb2 ? ARM::t2Bcc : ARM::Bcc)).addMBB(SinkBB)
	.addImm(ARMCC::getOppositeCondition(ARMCC::MI)).addReg(ARM::CPSR);

	// insert rsbri in RSBBB
	// Note: BCC and rsbri will be converted into predicated rsbmi
	// by if-conversion pass
	BuildMI(*RSBBB, RSBBB->begin(), dl,
	TII->get(isThumb2 ? ARM::t2RSBri : ARM::RSBri), NewRsbDstReg)
	.addReg(ABSSrcReg, ABSSrcKIll ? RegState::Kill : 0)
	.addImm(0)
	.add(predOps(ARMCC::AL))
	.add(condCodeOp());

	// insert PHI in SinkBB,
	// reuse ABSDstReg to not change uses of ABS instruction
	BuildMI(*SinkBB, SinkBB->begin(), dl,
	TII->get(ARM::PHI), ABSDstReg)
	.addReg(NewRsbDstReg).addMBB(RSBBB)
	.addReg(ABSSrcReg).addMBB(BB);

	// remove ABS instruction
	MI.eraseFromParent();

	// return last added BB
	return SinkBB;
	}
	case ARM::COPY_STRUCT_BYVAL_I32:
	++NumLoopByVals;
	return EmitStructByval(MI, BB);
	case ARM::WIN__CHKSTK:
	return EmitLowered__chkstk(MI, BB);
	case ARM::WIN__DBZCHK:
	return EmitLowered__dbzchk(MI, BB);
	}
	}

	/// \brief Attaches vregs to MEMCPY that it will use as scratch registers
	/// when it is expanded into LDM/STM. This is done as a post-isel lowering
	/// instead of as a custom inserter because we need the use list from the SDNode.
	static void attachMEMCPYScratchRegs(const ARMSubtarget *Subtarget,
	MachineInstr &MI, const SDNode *Node) {
	bool isThumb1 = Subtarget->isThumb1Only();

	DebugLoc DL = MI.getDebugLoc();
	MachineFunction *MF = MI.getParent()->getParent();
	MachineRegisterInfo &MRI = MF->getRegInfo();
	MachineInstrBuilder MIB(*MF, MI);

	// If the new dst/src is unused mark it as dead.
	if (!Node->hasAnyUseOfValue(0)) {
	MI.getOperand(0).setIsDead(true);
	}
	if (!Node->hasAnyUseOfValue(1)) {
	MI.getOperand(1).setIsDead(true);
	}

	// The MEMCPY both defines and kills the scratch registers.
	for (unsigned I = 0; I != MI.getOperand(4).getImm(); ++I) {
	unsigned TmpReg = MRI.createVirtualRegister(isThumb1 ? &ARM::tGPRRegClass
	: &ARM::GPRRegClass);
	MIB.addReg(TmpReg, RegState::Define\|RegState::Dead);
	}
	}

	void ARMTargetLowering::AdjustInstrPostInstrSelection(MachineInstr &MI,
	SDNode *Node) const {
	if (MI.getOpcode() == ARM::MEMCPY) {
	attachMEMCPYScratchRegs(Subtarget, MI, Node);
	return;
	}

	const MCInstrDesc *MCID = &MI.getDesc();
	// Adjust potentially 's' setting instructions after isel, i.e. ADC, SBC, RSB,
	// RSC. Coming out of isel, they have an implicit CPSR def, but the optional
	// operand is still set to noreg. If needed, set the optional operand's
	// register to CPSR, and remove the redundant implicit def.
	//
	// e.g. ADCS (..., CPSR<imp-def>) -> ADC (... opt:CPSR<def>).

	// Rename pseudo opcodes.
	unsigned NewOpc = convertAddSubFlagsOpcode(MI.getOpcode());
	unsigned ccOutIdx;
	if (NewOpc) {
	const ARMBaseInstrInfo *TII = Subtarget->getInstrInfo();
	MCID = &TII->get(NewOpc);

	assert(MCID->getNumOperands() ==
	MI.getDesc().getNumOperands() + 5 - MI.getDesc().getSize()
	&& "converted opcode should be the same except for cc_out"
	" (and, on Thumb1, pred)");

	MI.setDesc(*MCID);

	// Add the optional cc_out operand
	MI.addOperand(MachineOperand::CreateReg(0, /isDef=/true));

	// On Thumb1, move all input operands to the end, then add the predicate
	if (Subtarget->isThumb1Only()) {
	for (unsigned c = MCID->getNumOperands() - 4; c--;) {
	MI.addOperand(MI.getOperand(1));
	MI.RemoveOperand(1);
	}

	// Restore the ties
	for (unsigned i = MI.getNumOperands(); i--;) {
	const MachineOperand& op = MI.getOperand(i);
	if (op.isReg() && op.isUse()) {
	int DefIdx = MCID->getOperandConstraint(i, MCOI::TIED_TO);
	if (DefIdx != -1)
	MI.tieOperands(DefIdx, i);
	}
	}

	MI.addOperand(MachineOperand::CreateImm(ARMCC::AL));
	MI.addOperand(MachineOperand::CreateReg(0, /isDef=/false));
	ccOutIdx = 1;
	} else
	ccOutIdx = MCID->getNumOperands() - 1;
	} else
	ccOutIdx = MCID->getNumOperands() - 1;

	// Any ARM instruction that sets the 's' bit should specify an optional
	// "cc_out" operand in the last operand position.
	if (!MI.hasOptionalDef() \|\| !MCID->OpInfo[ccOutIdx].isOptionalDef()) {
	assert(!NewOpc && "Optional cc_out operand required");
	return;
	}
	// Look for an implicit def of CPSR added by MachineInstr ctor. Remove it
	// since we already have an optional CPSR def.
	bool definesCPSR = false;
	bool deadCPSR = false;
	for (unsigned i = MCID->getNumOperands(), e = MI.getNumOperands(); i != e;
	++i) {
	const MachineOperand &MO = MI.getOperand(i);
	if (MO.isReg() && MO.isDef() && MO.getReg() == ARM::CPSR) {
	definesCPSR = true;
	if (MO.isDead())
	deadCPSR = true;
	MI.RemoveOperand(i);
	break;
	}
	}
	if (!definesCPSR) {
	assert(!NewOpc && "Optional cc_out operand required");
	return;
	}
	assert(deadCPSR == !Node->hasAnyUseOfValue(1) && "inconsistent dead flag");
	if (deadCPSR) {
	assert(!MI.getOperand(ccOutIdx).getReg() &&
	"expect uninitialized optional cc_out operand");
	// Thumb1 instructions must have the S bit even if the CPSR is dead.
	if (!Subtarget->isThumb1Only())
	return;
	}

	// If this instruction was defined with an optional CPSR def and its dag node
	// had a live implicit CPSR def, then activate the optional CPSR def.
	MachineOperand &MO = MI.getOperand(ccOutIdx);
	MO.setReg(ARM::CPSR);
	MO.setIsDef(true);
	}

	//===----------------------------------------------------------------------===//
	// ARM Optimization Hooks
	//===----------------------------------------------------------------------===//

	// Helper function that checks if N is a null or all ones constant.
	static inline bool isZeroOrAllOnes(SDValue N, bool AllOnes) {
	return AllOnes ? isAllOnesConstant(N) : isNullConstant(N);
	}

	// Return true if N is conditionally 0 or all ones.
	// Detects these expressions where cc is an i1 value:
	//
	// (select cc 0, y) [AllOnes=0]
	// (select cc y, 0) [AllOnes=0]
	// (zext cc) [AllOnes=0]
	// (sext cc) [AllOnes=0/1]
	// (select cc -1, y) [AllOnes=1]
	// (select cc y, -1) [AllOnes=1]
	//
	// Invert is set when N is the null/all ones constant when CC is false.
	// OtherOp is set to the alternative value of N.
	static bool isConditionalZeroOrAllOnes(SDNode *N, bool AllOnes,
	SDValue &CC, bool &Invert,
	SDValue &OtherOp,
	SelectionDAG &DAG) {
	switch (N->getOpcode()) {
	default: return false;
	case ISD::SELECT: {
	CC = N->getOperand(0);
	SDValue N1 = N->getOperand(1);
	SDValue N2 = N->getOperand(2);
	if (isZeroOrAllOnes(N1, AllOnes)) {
	Invert = false;
	OtherOp = N2;
	return true;
	}
	if (isZeroOrAllOnes(N2, AllOnes)) {
	Invert = true;
	OtherOp = N1;
	return true;
	}
	return false;
	}
	case ISD::ZERO_EXTEND:
	// (zext cc) can never be the all ones value.
	if (AllOnes)
	return false;
	LLVM_FALLTHROUGH;
	case ISD::SIGN_EXTEND: {
	SDLoc dl(N);
	EVT VT = N->getValueType(0);
	CC = N->getOperand(0);
	if (CC.getValueType() != MVT::i1 \|\| CC.getOpcode() != ISD::SETCC)
	return false;
	Invert = !AllOnes;
	if (AllOnes)
	// When looking for an AllOnes constant, N is an sext, and the 'other'
	// value is 0.
	OtherOp = DAG.getConstant(0, dl, VT);
	else if (N->getOpcode() == ISD::ZERO_EXTEND)
	// When looking for a 0 constant, N can be zext or sext.
	OtherOp = DAG.getConstant(1, dl, VT);
	else
	OtherOp = DAG.getConstant(APInt::getAllOnesValue(VT.getSizeInBits()), dl,
	VT);
	return true;
	}
	}
	}

	// Combine a constant select operand into its use:
	//
	// (add (select cc, 0, c), x) -> (select cc, x, (add, x, c))
	// (sub x, (select cc, 0, c)) -> (select cc, x, (sub, x, c))
	// (and (select cc, -1, c), x) -> (select cc, x, (and, x, c)) [AllOnes=1]
	// (or (select cc, 0, c), x) -> (select cc, x, (or, x, c))
	// (xor (select cc, 0, c), x) -> (select cc, x, (xor, x, c))
	//
	// The transform is rejected if the select doesn't have a constant operand that
	// is null, or all ones when AllOnes is set.
	//
	// Also recognize sext/zext from i1:
	//
	// (add (zext cc), x) -> (select cc (add x, 1), x)
	// (add (sext cc), x) -> (select cc (add x, -1), x)
	//
	// These transformations eventually create predicated instructions.
	//
	// @param N The node to transform.
	// @param Slct The N operand that is a select.
	// @param OtherOp The other N operand (x above).
	// @param DCI Context.
	// @param AllOnes Require the select constant to be all ones instead of null.
	// @returns The new node, or SDValue() on failure.
	static
	SDValue combineSelectAndUse(SDNode *N, SDValue Slct, SDValue OtherOp,
	TargetLowering::DAGCombinerInfo &DCI,
	bool AllOnes = false) {
	SelectionDAG &DAG = DCI.DAG;
	EVT VT = N->getValueType(0);
	SDValue NonConstantVal;
	SDValue CCOp;
	bool SwapSelectOps;
	if (!isConditionalZeroOrAllOnes(Slct.getNode(), AllOnes, CCOp, SwapSelectOps,
	NonConstantVal, DAG))
	return SDValue();

	// Slct is now know to be the desired identity constant when CC is true.
	SDValue TrueVal = OtherOp;
	SDValue FalseVal = DAG.getNode(N->getOpcode(), SDLoc(N), VT,
	OtherOp, NonConstantVal);
	// Unless SwapSelectOps says CC should be false.
	if (SwapSelectOps)
	std::swap(TrueVal, FalseVal);

	return DAG.getNode(ISD::SELECT, SDLoc(N), VT,
	CCOp, TrueVal, FalseVal);
	}

	// Attempt combineSelectAndUse on each operand of a commutative operator N.
	static
	SDValue combineSelectAndUseCommutative(SDNode *N, bool AllOnes,
	TargetLowering::DAGCombinerInfo &DCI) {
	SDValue N0 = N->getOperand(0);
	SDValue N1 = N->getOperand(1);
	if (N0.getNode()->hasOneUse())
	if (SDValue Result = combineSelectAndUse(N, N0, N1, DCI, AllOnes))
	return Result;
	if (N1.getNode()->hasOneUse())
	if (SDValue Result = combineSelectAndUse(N, N1, N0, DCI, AllOnes))
	return Result;
	return SDValue();
	}

	static bool IsVUZPShuffleNode(SDNode *N) {
	// VUZP shuffle node.
	if (N->getOpcode() == ARMISD::VUZP)
	return true;

	// "VUZP" on i32 is an alias for VTRN.
	if (N->getOpcode() == ARMISD::VTRN && N->getValueType(0) == MVT::v2i32)
	return true;

	return false;
	}

	static SDValue AddCombineToVPADD(SDNode *N, SDValue N0, SDValue N1,
	TargetLowering::DAGCombinerInfo &DCI,
	const ARMSubtarget *Subtarget) {
	// Look for ADD(VUZP.0, VUZP.1).
	if (!IsVUZPShuffleNode(N0.getNode()) \|\| N0.getNode() != N1.getNode() \|\|
	N0 == N1)
	return SDValue();

	// Make sure the ADD is a 64-bit add; there is no 128-bit VPADD.
	if (!N->getValueType(0).is64BitVector())
	return SDValue();

	// Generate vpadd.
	SelectionDAG &DAG = DCI.DAG;
	const TargetLowering &TLI = DAG.getTargetLoweringInfo();
	SDLoc dl(N);
	SDNode *Unzip = N0.getNode();
	EVT VT = N->getValueType(0);

	SmallVector<SDValue, 8> Ops;
	Ops.push_back(DAG.getConstant(Intrinsic::arm_neon_vpadd, dl,
	TLI.getPointerTy(DAG.getDataLayout())));
	Ops.push_back(Unzip->getOperand(0));
	Ops.push_back(Unzip->getOperand(1));

	return DAG.getNode(ISD::INTRINSIC_WO_CHAIN, dl, VT, Ops);
	}

	static SDValue AddCombineVUZPToVPADDL(SDNode *N, SDValue N0, SDValue N1,
	TargetLowering::DAGCombinerInfo &DCI,
	const ARMSubtarget *Subtarget) {
	// Check for two extended operands.
	if (!(N0.getOpcode() == ISD::SIGN_EXTEND &&
	N1.getOpcode() == ISD::SIGN_EXTEND) &&
	!(N0.getOpcode() == ISD::ZERO_EXTEND &&
	N1.getOpcode() == ISD::ZERO_EXTEND))
	return SDValue();

	SDValue N00 = N0.getOperand(0);
	SDValue N10 = N1.getOperand(0);

	// Look for ADD(SEXT(VUZP.0), SEXT(VUZP.1))
	if (!IsVUZPShuffleNode(N00.getNode()) \|\| N00.getNode() != N10.getNode() \|\|
	N00 == N10)
	return SDValue();

	// We only recognize Q register paddl here; this can't be reached until
	// after type legalization.
	if (!N00.getValueType().is64BitVector() \|\|
	!N0.getValueType().is128BitVector())
	return SDValue();

	// Generate vpaddl.
	SelectionDAG &DAG = DCI.DAG;
	const TargetLowering &TLI = DAG.getTargetLoweringInfo();
	SDLoc dl(N);
	EVT VT = N->getValueType(0);

	SmallVector<SDValue, 8> Ops;
	// Form vpaddl.sN or vpaddl.uN depending on the kind of extension.
	unsigned Opcode;
	if (N0.getOpcode() == ISD::SIGN_EXTEND)
	Opcode = Intrinsic::arm_neon_vpaddls;
	else
	Opcode = Intrinsic::arm_neon_vpaddlu;
	Ops.push_back(DAG.getConstant(Opcode, dl,
	TLI.getPointerTy(DAG.getDataLayout())));
	EVT ElemTy = N00.getValueType().getVectorElementType();
	unsigned NumElts = VT.getVectorNumElements();
	EVT ConcatVT = EVT::getVectorVT(DAG.getContext(), ElemTy, NumElts 2);
	SDValue Concat = DAG.getNode(ISD::CONCAT_VECTORS, SDLoc(N), ConcatVT,
	N00.getOperand(0), N00.getOperand(1));
	Ops.push_back(Concat);

	return DAG.getNode(ISD::INTRINSIC_WO_CHAIN, dl, VT, Ops);
	}

	// FIXME: This function shouldn't be necessary; if we lower BUILD_VECTOR in
	// an appropriate manner, we end up with ADD(VUZP(ZEXT(N))), which is
	// much easier to match.
	static SDValue
	AddCombineBUILD_VECTORToVPADDL(SDNode *N, SDValue N0, SDValue N1,
	TargetLowering::DAGCombinerInfo &DCI,
	const ARMSubtarget *Subtarget) {
	// Only perform optimization if after legalize, and if NEON is available. We
	// also expected both operands to be BUILD_VECTORs.
	if (DCI.isBeforeLegalize() \|\| !Subtarget->hasNEON()
	\|\| N0.getOpcode() != ISD::BUILD_VECTOR
	\|\| N1.getOpcode() != ISD::BUILD_VECTOR)
	return SDValue();

	// Check output type since VPADDL operand elements can only be 8, 16, or 32.
	EVT VT = N->getValueType(0);
	if (!VT.isInteger() \|\| VT.getVectorElementType() == MVT::i64)
	return SDValue();

	// Check that the vector operands are of the right form.
	// N0 and N1 are BUILD_VECTOR nodes with N number of EXTRACT_VECTOR
	// operands, where N is the size of the formed vector.
	// Each EXTRACT_VECTOR should have the same input vector and odd or even
	// index such that we have a pair wise add pattern.

	// Grab the vector that all EXTRACT_VECTOR nodes should be referencing.
	if (N0->getOperand(0)->getOpcode() != ISD::EXTRACT_VECTOR_ELT)
	return SDValue();
	SDValue Vec = N0->getOperand(0)->getOperand(0);
	SDNode *V = Vec.getNode();
	unsigned nextIndex = 0;

	// For each operands to the ADD which are BUILD_VECTORs,
	// check to see if each of their operands are an EXTRACT_VECTOR with
	// the same vector and appropriate index.
	for (unsigned i = 0, e = N0->getNumOperands(); i != e; ++i) {
	if (N0->getOperand(i)->getOpcode() == ISD::EXTRACT_VECTOR_ELT
	&& N1->getOperand(i)->getOpcode() == ISD::EXTRACT_VECTOR_ELT) {

	SDValue ExtVec0 = N0->getOperand(i);
	SDValue ExtVec1 = N1->getOperand(i);

	// First operand is the vector, verify its the same.
	if (V != ExtVec0->getOperand(0).getNode() \|\|
	V != ExtVec1->getOperand(0).getNode())
	return SDValue();

	// Second is the constant, verify its correct.
	ConstantSDNode *C0 = dyn_cast<ConstantSDNode>(ExtVec0->getOperand(1));
	ConstantSDNode *C1 = dyn_cast<ConstantSDNode>(ExtVec1->getOperand(1));

	// For the constant, we want to see all the even or all the odd.
	if (!C0 \|\| !C1 \|\| C0->getZExtValue() != nextIndex
	\|\| C1->getZExtValue() != nextIndex+1)
	return SDValue();

	// Increment index.
	nextIndex+=2;
	} else
	return SDValue();
	}

	// Don't generate vpaddl+vmovn; we'll match it to vpadd later. Also make sure
	// we're using the entire input vector, otherwise there's a size/legality
	// mismatch somewhere.
	if (nextIndex != Vec.getValueType().getVectorNumElements() \|\|
	Vec.getValueType().getVectorElementType() == VT.getVectorElementType())
	return SDValue();

	// Create VPADDL node.
	SelectionDAG &DAG = DCI.DAG;
	const TargetLowering &TLI = DAG.getTargetLoweringInfo();

	SDLoc dl(N);

	// Build operand list.
	SmallVector<SDValue, 8> Ops;
	Ops.push_back(DAG.getConstant(Intrinsic::arm_neon_vpaddls, dl,
	TLI.getPointerTy(DAG.getDataLayout())));

	// Input is the vector.
	Ops.push_back(Vec);

	// Get widened type and narrowed type.
	MVT widenType;
	unsigned numElem = VT.getVectorNumElements();

	EVT inputLaneType = Vec.getValueType().getVectorElementType();
	switch (inputLaneType.getSimpleVT().SimpleTy) {
	case MVT::i8: widenType = MVT::getVectorVT(MVT::i16, numElem); break;
	case MVT::i16: widenType = MVT::getVectorVT(MVT::i32, numElem); break;
	case MVT::i32: widenType = MVT::getVectorVT(MVT::i64, numElem); break;
	default:
	llvm_unreachable("Invalid vector element type for padd optimization.");
	}

	SDValue tmp = DAG.getNode(ISD::INTRINSIC_WO_CHAIN, dl, widenType, Ops);
	unsigned ExtOp = VT.bitsGT(tmp.getValueType()) ? ISD::ANY_EXTEND : ISD::TRUNCATE;
	return DAG.getNode(ExtOp, dl, VT, tmp);
	}

	static SDValue findMUL_LOHI(SDValue V) {
	if (V->getOpcode() == ISD::UMUL_LOHI \|\|
	V->getOpcode() == ISD::SMUL_LOHI)
	return V;
	return SDValue();
	}

	static SDValue AddCombineTo64BitSMLAL16(SDNode AddcNode, SDNode AddeNode,
	TargetLowering::DAGCombinerInfo &DCI,
	const ARMSubtarget *Subtarget) {

	if (Subtarget->isThumb()) {
	if (!Subtarget->hasDSP())
	return SDValue();
	} else if (!Subtarget->hasV5TEOps())
	return SDValue();

	// SMLALBB, SMLALBT, SMLALTB, SMLALTT multiply two 16-bit values and
	// accumulates the product into a 64-bit value. The 16-bit values will
	// be sign extended somehow or SRA'd into 32-bit values
	// (addc (adde (mul 16bit, 16bit), lo), hi)
	SDValue Mul = AddcNode->getOperand(0);
	SDValue Lo = AddcNode->getOperand(1);
	if (Mul.getOpcode() != ISD::MUL) {
	Lo = AddcNode->getOperand(0);
	Mul = AddcNode->getOperand(1);
	if (Mul.getOpcode() != ISD::MUL)
	return SDValue();
	}

	SDValue SRA = AddeNode->getOperand(0);
	SDValue Hi = AddeNode->getOperand(1);
	if (SRA.getOpcode() != ISD::SRA) {
	SRA = AddeNode->getOperand(1);
	Hi = AddeNode->getOperand(0);
	if (SRA.getOpcode() != ISD::SRA)
	return SDValue();
	}
	if (auto Const = dyn_cast<ConstantSDNode>(SRA.getOperand(1))) {
	if (Const->getZExtValue() != 31)
	return SDValue();
	} else
	return SDValue();

	if (SRA.getOperand(0) != Mul)
	return SDValue();

	SelectionDAG &DAG = DCI.DAG;
	SDLoc dl(AddcNode);
	unsigned Opcode = 0;
	SDValue Op0;
	SDValue Op1;

	if (isS16(Mul.getOperand(0), DAG) && isS16(Mul.getOperand(1), DAG)) {
	Opcode = ARMISD::SMLALBB;
	Op0 = Mul.getOperand(0);
	Op1 = Mul.getOperand(1);
	} else if (isS16(Mul.getOperand(0), DAG) && isSRA16(Mul.getOperand(1))) {
	Opcode = ARMISD::SMLALBT;
	Op0 = Mul.getOperand(0);
	Op1 = Mul.getOperand(1).getOperand(0);
	} else if (isSRA16(Mul.getOperand(0)) && isS16(Mul.getOperand(1), DAG)) {
	Opcode = ARMISD::SMLALTB;
	Op0 = Mul.getOperand(0).getOperand(0);
	Op1 = Mul.getOperand(1);
	} else if (isSRA16(Mul.getOperand(0)) && isSRA16(Mul.getOperand(1))) {
	Opcode = ARMISD::SMLALTT;
	Op0 = Mul->getOperand(0).getOperand(0);
	Op1 = Mul->getOperand(1).getOperand(0);
	}

	if (!Op0 \|\| !Op1)
	return SDValue();

	SDValue SMLAL = DAG.getNode(Opcode, dl, DAG.getVTList(MVT::i32, MVT::i32),
	Op0, Op1, Lo, Hi);
	// Replace the ADDs' nodes uses by the MLA node's values.
	SDValue HiMLALResult(SMLAL.getNode(), 1);
	SDValue LoMLALResult(SMLAL.getNode(), 0);

	DAG.ReplaceAllUsesOfValueWith(SDValue(AddcNode, 0), LoMLALResult);
	DAG.ReplaceAllUsesOfValueWith(SDValue(AddeNode, 0), HiMLALResult);

	// Return original node to notify the driver to stop replacing.
	SDValue resNode(AddcNode, 0);
	return resNode;
	}

	static SDValue AddCombineTo64bitMLAL(SDNode *AddeNode,
	TargetLowering::DAGCombinerInfo &DCI,
	const ARMSubtarget *Subtarget) {
	// Look for multiply add opportunities.
	// The pattern is a ISD::UMUL_LOHI followed by two add nodes, where
	// each add nodes consumes a value from ISD::UMUL_LOHI and there is
	// a glue link from the first add to the second add.
	// If we find this pattern, we can replace the U/SMUL_LOHI, ADDC, and ADDE by
	// a S/UMLAL instruction.
	// UMUL_LOHI
	// / :lo \ :hi
	// / \ [no multiline comment]
	// loAdd -> ADDE \|
	// \ :glue /
	// \ /
	// ADDC <- hiAdd
	//
	assert(AddeNode->getOpcode() == ARMISD::ADDE && "Expect an ADDE");

	assert(AddeNode->getNumOperands() == 3 &&
	AddeNode->getOperand(2).getValueType() == MVT::i32 &&
	"ADDE node has the wrong inputs");

	// Check that we have a glued ADDC node.
	SDNode* AddcNode = AddeNode->getOperand(2).getNode();
	if (AddcNode->getOpcode() != ARMISD::ADDC)
	return SDValue();

	SDValue AddcOp0 = AddcNode->getOperand(0);
	SDValue AddcOp1 = AddcNode->getOperand(1);

	// Check if the two operands are from the same mul_lohi node.
	if (AddcOp0.getNode() == AddcOp1.getNode())
	return SDValue();

	assert(AddcNode->getNumValues() == 2 &&
	AddcNode->getValueType(0) == MVT::i32 &&
	"Expect ADDC with two result values. First: i32");

	// Check that the ADDC adds the low result of the S/UMUL_LOHI. If not, it
	// maybe a SMLAL which multiplies two 16-bit values.
	if (AddcOp0->getOpcode() != ISD::UMUL_LOHI &&
	AddcOp0->getOpcode() != ISD::SMUL_LOHI &&
	AddcOp1->getOpcode() != ISD::UMUL_LOHI &&
	AddcOp1->getOpcode() != ISD::SMUL_LOHI)
	return AddCombineTo64BitSMLAL16(AddcNode, AddeNode, DCI, Subtarget);

	// Check for the triangle shape.
	SDValue AddeOp0 = AddeNode->getOperand(0);
	SDValue AddeOp1 = AddeNode->getOperand(1);

	// Make sure that the ADDE operands are not coming from the same node.
	if (AddeOp0.getNode() == AddeOp1.getNode())
	return SDValue();

	// Find the MUL_LOHI node walking up ADDE's operands.
	bool IsLeftOperandMUL = false;
	SDValue MULOp = findMUL_LOHI(AddeOp0);
	if (MULOp == SDValue())
	MULOp = findMUL_LOHI(AddeOp1);
	else
	IsLeftOperandMUL = true;
	if (MULOp == SDValue())
	return SDValue();

	// Figure out the right opcode.
	unsigned Opc = MULOp->getOpcode();
	unsigned FinalOpc = (Opc == ISD::SMUL_LOHI) ? ARMISD::SMLAL : ARMISD::UMLAL;

	// Figure out the high and low input values to the MLAL node.
	SDValue* HiAdd = nullptr;
	SDValue* LoMul = nullptr;
	SDValue* LowAdd = nullptr;

	// Ensure that ADDE is from high result of ISD::SMUL_LOHI.
	if ((AddeOp0 != MULOp.getValue(1)) && (AddeOp1 != MULOp.getValue(1)))
	return SDValue();

	if (IsLeftOperandMUL)
	HiAdd = &AddeOp1;
	else
	HiAdd = &AddeOp0;


	// Ensure that LoMul and LowAdd are taken from correct ISD::SMUL_LOHI node
	// whose low result is fed to the ADDC we are checking.

	if (AddcOp0 == MULOp.getValue(0)) {
	LoMul = &AddcOp0;
	LowAdd = &AddcOp1;
	}
	if (AddcOp1 == MULOp.getValue(0)) {
	LoMul = &AddcOp1;
	LowAdd = &AddcOp0;
	}

	if (!LoMul)
	return SDValue();

	// Create the merged node.
	SelectionDAG &DAG = DCI.DAG;

	// Build operand list.
	SmallVector<SDValue, 8> Ops;
	Ops.push_back(LoMul->getOperand(0));
	Ops.push_back(LoMul->getOperand(1));
	Ops.push_back(*LowAdd);
	Ops.push_back(*HiAdd);

	SDValue MLALNode = DAG.getNode(FinalOpc, SDLoc(AddcNode),
	DAG.getVTList(MVT::i32, MVT::i32), Ops);

	// Replace the ADDs' nodes uses by the MLA node's values.
	SDValue HiMLALResult(MLALNode.getNode(), 1);
	DAG.ReplaceAllUsesOfValueWith(SDValue(AddeNode, 0), HiMLALResult);

	SDValue LoMLALResult(MLALNode.getNode(), 0);
	DAG.ReplaceAllUsesOfValueWith(SDValue(AddcNode, 0), LoMLALResult);

	// Return original node to notify the driver to stop replacing.
	return SDValue(AddeNode, 0);
	}

	static SDValue AddCombineTo64bitUMAAL(SDNode *AddeNode,
	TargetLowering::DAGCombinerInfo &DCI,
	const ARMSubtarget *Subtarget) {
	// UMAAL is similar to UMLAL except that it adds two unsigned values.
	// While trying to combine for the other MLAL nodes, first search for the
	// chance to use UMAAL. Check if Addc uses a node which has already
	// been combined into a UMLAL. The other pattern is UMLAL using Addc/Adde
	// as the addend, and it's handled in PerformUMLALCombine.

	if (!Subtarget->hasV6Ops() \|\| !Subtarget->hasDSP())
	return AddCombineTo64bitMLAL(AddeNode, DCI, Subtarget);

	// Check that we have a glued ADDC node.
	SDNode* AddcNode = AddeNode->getOperand(2).getNode();
	if (AddcNode->getOpcode() != ARMISD::ADDC)
	return SDValue();

	// Find the converted UMAAL or quit if it doesn't exist.
	SDNode *UmlalNode = nullptr;
	SDValue AddHi;
	if (AddcNode->getOperand(0).getOpcode() == ARMISD::UMLAL) {
	UmlalNode = AddcNode->getOperand(0).getNode();
	AddHi = AddcNode->getOperand(1);
	} else if (AddcNode->getOperand(1).getOpcode() == ARMISD::UMLAL) {
	UmlalNode = AddcNode->getOperand(1).getNode();
	AddHi = AddcNode->getOperand(0);
	} else {
	return AddCombineTo64bitMLAL(AddeNode, DCI, Subtarget);
	}

	// The ADDC should be glued to an ADDE node, which uses the same UMLAL as
	// the ADDC as well as Zero.
	if (!isNullConstant(UmlalNode->getOperand(3)))
	return SDValue();

	if ((isNullConstant(AddeNode->getOperand(0)) &&
	AddeNode->getOperand(1).getNode() == UmlalNode) \|\|
	(AddeNode->getOperand(0).getNode() == UmlalNode &&
	isNullConstant(AddeNode->getOperand(1)))) {

	SelectionDAG &DAG = DCI.DAG;
	SDValue Ops[] = { UmlalNode->getOperand(0), UmlalNode->getOperand(1),
	UmlalNode->getOperand(2), AddHi };
	SDValue UMAAL = DAG.getNode(ARMISD::UMAAL, SDLoc(AddcNode),
	DAG.getVTList(MVT::i32, MVT::i32), Ops);

	// Replace the ADDs' nodes uses by the UMAAL node's values.
	DAG.ReplaceAllUsesOfValueWith(SDValue(AddeNode, 0), SDValue(UMAAL.getNode(), 1));
	DAG.ReplaceAllUsesOfValueWith(SDValue(AddcNode, 0), SDValue(UMAAL.getNode(), 0));

	// Return original node to notify the driver to stop replacing.
	return SDValue(AddeNode, 0);
	}
	return SDValue();
	}

	static SDValue PerformUMLALCombine(SDNode *N, SelectionDAG &DAG,
	const ARMSubtarget *Subtarget) {
	if (!Subtarget->hasV6Ops() \|\| !Subtarget->hasDSP())
	return SDValue();

	// Check that we have a pair of ADDC and ADDE as operands.
	// Both addends of the ADDE must be zero.
	SDNode* AddcNode = N->getOperand(2).getNode();
	SDNode* AddeNode = N->getOperand(3).getNode();
	if ((AddcNode->getOpcode() == ARMISD::ADDC) &&
	(AddeNode->getOpcode() == ARMISD::ADDE) &&
	isNullConstant(AddeNode->getOperand(0)) &&
	isNullConstant(AddeNode->getOperand(1)) &&
	(AddeNode->getOperand(2).getNode() == AddcNode))
	return DAG.getNode(ARMISD::UMAAL, SDLoc(N),
	DAG.getVTList(MVT::i32, MVT::i32),
	{N->getOperand(0), N->getOperand(1),
	AddcNode->getOperand(0), AddcNode->getOperand(1)});
	else
	return SDValue();
	}

	static SDValue PerformAddcSubcCombine(SDNode *N, SelectionDAG &DAG,
	const ARMSubtarget *Subtarget) {
	if (Subtarget->isThumb1Only()) {
	SDValue RHS = N->getOperand(1);
	if (ConstantSDNode *C = dyn_cast<ConstantSDNode>(RHS)) {
	int32_t imm = C->getSExtValue();
	if (imm < 0 && imm > INT_MIN) {
	SDLoc DL(N);
	RHS = DAG.getConstant(-imm, DL, MVT::i32);
	unsigned Opcode = (N->getOpcode() == ARMISD::ADDC) ? ARMISD::SUBC
	: ARMISD::ADDC;
	return DAG.getNode(Opcode, DL, N->getVTList(), N->getOperand(0), RHS);
	}
	}
	}
	return SDValue();
	}

	static SDValue PerformAddeSubeCombine(SDNode *N, SelectionDAG &DAG,
	const ARMSubtarget *Subtarget) {
	if (Subtarget->isThumb1Only()) {
	SDValue RHS = N->getOperand(1);
	if (ConstantSDNode *C = dyn_cast<ConstantSDNode>(RHS)) {
	int64_t imm = C->getSExtValue();
	if (imm < 0) {
	SDLoc DL(N);

	// The with-carry-in form matches bitwise not instead of the negation.
	// Effectively, the inverse interpretation of the carry flag already
	// accounts for part of the negation.
	RHS = DAG.getConstant(~imm, DL, MVT::i32);

	unsigned Opcode = (N->getOpcode() == ARMISD::ADDE) ? ARMISD::SUBE
	: ARMISD::ADDE;
	return DAG.getNode(Opcode, DL, N->getVTList(),
	N->getOperand(0), RHS, N->getOperand(2));
	}
	}
	}
	return SDValue();
	}

	/// PerformADDECombine - Target-specific dag combine transform from
	/// ARMISD::ADDC, ARMISD::ADDE, and ISD::MUL_LOHI to MLAL or
	/// ARMISD::ADDC, ARMISD::ADDE and ARMISD::UMLAL to ARMISD::UMAAL
	static SDValue PerformADDECombine(SDNode *N,
	TargetLowering::DAGCombinerInfo &DCI,
	const ARMSubtarget *Subtarget) {
	// Only ARM and Thumb2 support UMLAL/SMLAL.
	if (Subtarget->isThumb1Only())
	return PerformAddeSubeCombine(N, DCI.DAG, Subtarget);

	// Only perform the checks after legalize when the pattern is available.
	if (DCI.isBeforeLegalize()) return SDValue();

	return AddCombineTo64bitUMAAL(N, DCI, Subtarget);
	}

	/// PerformADDCombineWithOperands - Try DAG combinations for an ADD with
	/// operands N0 and N1. This is a helper for PerformADDCombine that is
	/// called with the default operands, and if that fails, with commuted
	/// operands.
	static SDValue PerformADDCombineWithOperands(SDNode *N, SDValue N0, SDValue N1,
	TargetLowering::DAGCombinerInfo &DCI,
	const ARMSubtarget *Subtarget){
	// Attempt to create vpadd for this add.
	if (SDValue Result = AddCombineToVPADD(N, N0, N1, DCI, Subtarget))
	return Result;

	// Attempt to create vpaddl for this add.
	if (SDValue Result = AddCombineVUZPToVPADDL(N, N0, N1, DCI, Subtarget))
	return Result;
	if (SDValue Result = AddCombineBUILD_VECTORToVPADDL(N, N0, N1, DCI,
	Subtarget))
	return Result;

	// fold (add (select cc, 0, c), x) -> (select cc, x, (add, x, c))
	if (N0.getNode()->hasOneUse())
	if (SDValue Result = combineSelectAndUse(N, N0, N1, DCI))
	return Result;
	return SDValue();
	}

	/// PerformADDCombine - Target-specific dag combine xforms for ISD::ADD.
	///
	static SDValue PerformADDCombine(SDNode *N,
	TargetLowering::DAGCombinerInfo &DCI,
	const ARMSubtarget *Subtarget) {
	SDValue N0 = N->getOperand(0);
	SDValue N1 = N->getOperand(1);

	// First try with the default operand order.
	if (SDValue Result = PerformADDCombineWithOperands(N, N0, N1, DCI, Subtarget))
	return Result;

	// If that didn't work, try again with the operands commuted.
	return PerformADDCombineWithOperands(N, N1, N0, DCI, Subtarget);
	}

	/// PerformSUBCombine - Target-specific dag combine xforms for ISD::SUB.
	///
	static SDValue PerformSUBCombine(SDNode *N,
	TargetLowering::DAGCombinerInfo &DCI) {
	SDValue N0 = N->getOperand(0);
	SDValue N1 = N->getOperand(1);

	// fold (sub x, (select cc, 0, c)) -> (select cc, x, (sub, x, c))
	if (N1.getNode()->hasOneUse())
	if (SDValue Result = combineSelectAndUse(N, N1, N0, DCI))
	return Result;

	return SDValue();
	}

	/// PerformVMULCombine
	/// Distribute (A + B) * C to (A * C) + (B * C) to take advantage of the
	/// special multiplier accumulator forwarding.
	/// vmul d3, d0, d2
	/// vmla d3, d1, d2
	/// is faster than
	/// vadd d3, d0, d1
	/// vmul d3, d3, d2
	// However, for (A + B) * (A + B),
	// vadd d2, d0, d1
	// vmul d3, d0, d2
	// vmla d3, d1, d2
	// is slower than
	// vadd d2, d0, d1
	// vmul d3, d2, d2
	static SDValue PerformVMULCombine(SDNode *N,
	TargetLowering::DAGCombinerInfo &DCI,
	const ARMSubtarget *Subtarget) {
	if (!Subtarget->hasVMLxForwarding())
	return SDValue();

	SelectionDAG &DAG = DCI.DAG;
	SDValue N0 = N->getOperand(0);
	SDValue N1 = N->getOperand(1);
	unsigned Opcode = N0.getOpcode();
	if (Opcode != ISD::ADD && Opcode != ISD::SUB &&
	Opcode != ISD::FADD && Opcode != ISD::FSUB) {
	Opcode = N1.getOpcode();
	if (Opcode != ISD::ADD && Opcode != ISD::SUB &&
	Opcode != ISD::FADD && Opcode != ISD::FSUB)
	return SDValue();
	std::swap(N0, N1);
	}

	if (N0 == N1)
	return SDValue();

	EVT VT = N->getValueType(0);
	SDLoc DL(N);
	SDValue N00 = N0->getOperand(0);
	SDValue N01 = N0->getOperand(1);
	return DAG.getNode(Opcode, DL, VT,
	DAG.getNode(ISD::MUL, DL, VT, N00, N1),
	DAG.getNode(ISD::MUL, DL, VT, N01, N1));
	}

	static SDValue PerformMULCombine(SDNode *N,
	TargetLowering::DAGCombinerInfo &DCI,
	const ARMSubtarget *Subtarget) {
	SelectionDAG &DAG = DCI.DAG;

	if (Subtarget->isThumb1Only())
	return SDValue();

	if (DCI.isBeforeLegalize() \|\| DCI.isCalledByLegalizer())
	return SDValue();

	EVT VT = N->getValueType(0);
	if (VT.is64BitVector() \|\| VT.is128BitVector())
	return PerformVMULCombine(N, DCI, Subtarget);
	if (VT != MVT::i32)
	return SDValue();

	ConstantSDNode *C = dyn_cast<ConstantSDNode>(N->getOperand(1));
	if (!C)
	return SDValue();

	int64_t MulAmt = C->getSExtValue();
	unsigned ShiftAmt = countTrailingZeros<uint64_t>(MulAmt);

	ShiftAmt = ShiftAmt & (32 - 1);
	SDValue V = N->getOperand(0);
	SDLoc DL(N);

	SDValue Res;
	MulAmt >>= ShiftAmt;

	if (MulAmt >= 0) {
	if (isPowerOf2_32(MulAmt - 1)) {
	// (mul x, 2^N + 1) => (add (shl x, N), x)
	Res = DAG.getNode(ISD::ADD, DL, VT,
	V,
	DAG.getNode(ISD::SHL, DL, VT,
	V,
	DAG.getConstant(Log2_32(MulAmt - 1), DL,
	MVT::i32)));
	} else if (isPowerOf2_32(MulAmt + 1)) {
	// (mul x, 2^N - 1) => (sub (shl x, N), x)
	Res = DAG.getNode(ISD::SUB, DL, VT,
	DAG.getNode(ISD::SHL, DL, VT,
	V,
	DAG.getConstant(Log2_32(MulAmt + 1), DL,
	MVT::i32)),
	V);
	} else
	return SDValue();
	} else {
	uint64_t MulAmtAbs = -MulAmt;
	if (isPowerOf2_32(MulAmtAbs + 1)) {
	// (mul x, -(2^N - 1)) => (sub x, (shl x, N))
	Res = DAG.getNode(ISD::SUB, DL, VT,
	V,
	DAG.getNode(ISD::SHL, DL, VT,
	V,
	DAG.getConstant(Log2_32(MulAmtAbs + 1), DL,
	MVT::i32)));
	} else if (isPowerOf2_32(MulAmtAbs - 1)) {
	// (mul x, -(2^N + 1)) => - (add (shl x, N), x)
	Res = DAG.getNode(ISD::ADD, DL, VT,
	V,
	DAG.getNode(ISD::SHL, DL, VT,
	V,
	DAG.getConstant(Log2_32(MulAmtAbs - 1), DL,
	MVT::i32)));
	Res = DAG.getNode(ISD::SUB, DL, VT,
	DAG.getConstant(0, DL, MVT::i32), Res);

	} else
	return SDValue();
	}

	if (ShiftAmt != 0)
	Res = DAG.getNode(ISD::SHL, DL, VT,
	Res, DAG.getConstant(ShiftAmt, DL, MVT::i32));

	// Do not add new nodes to DAG combiner worklist.
	DCI.CombineTo(N, Res, false);
	return SDValue();
	}

	static SDValue PerformANDCombine(SDNode *N,
	TargetLowering::DAGCombinerInfo &DCI,
	const ARMSubtarget *Subtarget) {
	// Attempt to use immediate-form VBIC
	BuildVectorSDNode *BVN = dyn_cast<BuildVectorSDNode>(N->getOperand(1));
	SDLoc dl(N);
	EVT VT = N->getValueType(0);
	SelectionDAG &DAG = DCI.DAG;

	if(!DAG.getTargetLoweringInfo().isTypeLegal(VT))
	return SDValue();

	APInt SplatBits, SplatUndef;
	unsigned SplatBitSize;
	bool HasAnyUndefs;
	if (BVN &&
	BVN->isConstantSplat(SplatBits, SplatUndef, SplatBitSize, HasAnyUndefs)) {
	if (SplatBitSize <= 64) {
	EVT VbicVT;
	SDValue Val = isNEONModifiedImm((~SplatBits).getZExtValue(),
	SplatUndef.getZExtValue(), SplatBitSize,
	DAG, dl, VbicVT, VT.is128BitVector(),
	OtherModImm);
	if (Val.getNode()) {
	SDValue Input =
	DAG.getNode(ISD::BITCAST, dl, VbicVT, N->getOperand(0));
	SDValue Vbic = DAG.getNode(ARMISD::VBICIMM, dl, VbicVT, Input, Val);
	return DAG.getNode(ISD::BITCAST, dl, VT, Vbic);
	}
	}
	}

	if (!Subtarget->isThumb1Only()) {
	// fold (and (select cc, -1, c), x) -> (select cc, x, (and, x, c))
	if (SDValue Result = combineSelectAndUseCommutative(N, true, DCI))
	return Result;
	}

	return SDValue();
	}

	// Try combining OR nodes to SMULWB, SMULWT.
	static SDValue PerformORCombineToSMULWBT(SDNode *OR,
	TargetLowering::DAGCombinerInfo &DCI,
	const ARMSubtarget *Subtarget) {
	if (!Subtarget->hasV6Ops() \|\|
	(Subtarget->isThumb() &&
	(!Subtarget->hasThumb2() \|\| !Subtarget->hasDSP())))
	return SDValue();

	SDValue SRL = OR->getOperand(0);
	SDValue SHL = OR->getOperand(1);

	if (SRL.getOpcode() != ISD::SRL \|\| SHL.getOpcode() != ISD::SHL) {
	SRL = OR->getOperand(1);
	SHL = OR->getOperand(0);
	}
	if (!isSRL16(SRL) \|\| !isSHL16(SHL))
	return SDValue();

	// The first operands to the shifts need to be the two results from the
	// same smul_lohi node.
	if ((SRL.getOperand(0).getNode() != SHL.getOperand(0).getNode()) \|\|
	SRL.getOperand(0).getOpcode() != ISD::SMUL_LOHI)
	return SDValue();

	SDNode *SMULLOHI = SRL.getOperand(0).getNode();
	if (SRL.getOperand(0) != SDValue(SMULLOHI, 0) \|\|
	SHL.getOperand(0) != SDValue(SMULLOHI, 1))
	return SDValue();

	// Now we have:
	// (or (srl (smul_lohi ?, ?), 16), (shl (smul_lohi ?, ?), 16)))
	// For SMUL[B\|T] smul_lohi will take a 32-bit and a 16-bit arguments.
	// For SMUWB the 16-bit value will signed extended somehow.
	// For SMULWT only the SRA is required.
	// Check both sides of SMUL_LOHI
	SDValue OpS16 = SMULLOHI->getOperand(0);
	SDValue OpS32 = SMULLOHI->getOperand(1);

	SelectionDAG &DAG = DCI.DAG;
	if (!isS16(OpS16, DAG) && !isSRA16(OpS16)) {
	OpS16 = OpS32;
	OpS32 = SMULLOHI->getOperand(0);
	}

	SDLoc dl(OR);
	unsigned Opcode = 0;
	if (isS16(OpS16, DAG))
	Opcode = ARMISD::SMULWB;
	else if (isSRA16(OpS16)) {
	Opcode = ARMISD::SMULWT;
	OpS16 = OpS16->getOperand(0);
	}
	else
	return SDValue();

	SDValue Res = DAG.getNode(Opcode, dl, MVT::i32, OpS32, OpS16);
	DAG.ReplaceAllUsesOfValueWith(SDValue(OR, 0), Res);
	return SDValue(OR, 0);
	}

	/// PerformORCombine - Target-specific dag combine xforms for ISD::OR
	static SDValue PerformORCombine(SDNode *N,
	TargetLowering::DAGCombinerInfo &DCI,
	const ARMSubtarget *Subtarget) {
	// Attempt to use immediate-form VORR
	BuildVectorSDNode *BVN = dyn_cast<BuildVectorSDNode>(N->getOperand(1));
	SDLoc dl(N);
	EVT VT = N->getValueType(0);
	SelectionDAG &DAG = DCI.DAG;

	if(!DAG.getTargetLoweringInfo().isTypeLegal(VT))
	return SDValue();

	APInt SplatBits, SplatUndef;
	unsigned SplatBitSize;
	bool HasAnyUndefs;
	if (BVN && Subtarget->hasNEON() &&
	BVN->isConstantSplat(SplatBits, SplatUndef, SplatBitSize, HasAnyUndefs)) {
	if (SplatBitSize <= 64) {
	EVT VorrVT;
	SDValue Val = isNEONModifiedImm(SplatBits.getZExtValue(),
	SplatUndef.getZExtValue(), SplatBitSize,
	DAG, dl, VorrVT, VT.is128BitVector(),
	OtherModImm);
	if (Val.getNode()) {
	SDValue Input =
	DAG.getNode(ISD::BITCAST, dl, VorrVT, N->getOperand(0));
	SDValue Vorr = DAG.getNode(ARMISD::VORRIMM, dl, VorrVT, Input, Val);
	return DAG.getNode(ISD::BITCAST, dl, VT, Vorr);
	}
	}
	}

	if (!Subtarget->isThumb1Only()) {
	// fold (or (select cc, 0, c), x) -> (select cc, x, (or, x, c))
	if (SDValue Result = combineSelectAndUseCommutative(N, false, DCI))
	return Result;
	if (SDValue Result = PerformORCombineToSMULWBT(N, DCI, Subtarget))
	return Result;
	}

	// The code below optimizes (or (and X, Y), Z).
	// The AND operand needs to have a single user to make these optimizations
	// profitable.
	SDValue N0 = N->getOperand(0);
	if (N0.getOpcode() != ISD::AND \|\| !N0.hasOneUse())
	return SDValue();
	SDValue N1 = N->getOperand(1);

	// (or (and B, A), (and C, ~A)) => (VBSL A, B, C) when A is a constant.
	if (Subtarget->hasNEON() && N1.getOpcode() == ISD::AND && VT.isVector() &&
	DAG.getTargetLoweringInfo().isTypeLegal(VT)) {
	APInt SplatUndef;
	unsigned SplatBitSize;
	bool HasAnyUndefs;

	APInt SplatBits0, SplatBits1;
	BuildVectorSDNode *BVN0 = dyn_cast<BuildVectorSDNode>(N0->getOperand(1));
	BuildVectorSDNode *BVN1 = dyn_cast<BuildVectorSDNode>(N1->getOperand(1));
	// Ensure that the second operand of both ands are constants
	if (BVN0 && BVN0->isConstantSplat(SplatBits0, SplatUndef, SplatBitSize,
	HasAnyUndefs) && !HasAnyUndefs) {
	if (BVN1 && BVN1->isConstantSplat(SplatBits1, SplatUndef, SplatBitSize,
	HasAnyUndefs) && !HasAnyUndefs) {
	// Ensure that the bit width of the constants are the same and that
	// the splat arguments are logical inverses as per the pattern we
	// are trying to simplify.
	if (SplatBits0.getBitWidth() == SplatBits1.getBitWidth() &&
	SplatBits0 == ~SplatBits1) {
	// Canonicalize the vector type to make instruction selection
	// simpler.
	EVT CanonicalVT = VT.is128BitVector() ? MVT::v4i32 : MVT::v2i32;
	SDValue Result = DAG.getNode(ARMISD::VBSL, dl, CanonicalVT,
	N0->getOperand(1),
	N0->getOperand(0),
	N1->getOperand(0));
	return DAG.getNode(ISD::BITCAST, dl, VT, Result);
	}
	}
	}
	}

	// Try to use the ARM/Thumb2 BFI (bitfield insert) instruction when
	// reasonable.

	// BFI is only available on V6T2+
	if (Subtarget->isThumb1Only() \|\| !Subtarget->hasV6T2Ops())
	return SDValue();

	SDLoc DL(N);
	// 1) or (and A, mask), val => ARMbfi A, val, mask
	// iff (val & mask) == val
	//
	// 2) or (and A, mask), (and B, mask2) => ARMbfi A, (lsr B, amt), mask
	// 2a) iff isBitFieldInvertedMask(mask) && isBitFieldInvertedMask(~mask2)
	// && mask == ~mask2
	// 2b) iff isBitFieldInvertedMask(~mask) && isBitFieldInvertedMask(mask2)
	// && ~mask == mask2
	// (i.e., copy a bitfield value into another bitfield of the same width)

	if (VT != MVT::i32)
	return SDValue();

	SDValue N00 = N0.getOperand(0);

	// The value and the mask need to be constants so we can verify this is
	// actually a bitfield set. If the mask is 0xffff, we can do better
	// via a movt instruction, so don't use BFI in that case.
	SDValue MaskOp = N0.getOperand(1);
	ConstantSDNode *MaskC = dyn_cast<ConstantSDNode>(MaskOp);
	if (!MaskC)
	return SDValue();
	unsigned Mask = MaskC->getZExtValue();
	if (Mask == 0xffff)
	return SDValue();
	SDValue Res;
	// Case (1): or (and A, mask), val => ARMbfi A, val, mask
	ConstantSDNode *N1C = dyn_cast<ConstantSDNode>(N1);
	if (N1C) {
	unsigned Val = N1C->getZExtValue();
	if ((Val & ~Mask) != Val)
	return SDValue();

	if (ARM::isBitFieldInvertedMask(Mask)) {
	Val >>= countTrailingZeros(~Mask);

	Res = DAG.getNode(ARMISD::BFI, DL, VT, N00,
	DAG.getConstant(Val, DL, MVT::i32),
	DAG.getConstant(Mask, DL, MVT::i32));

	// Do not add new nodes to DAG combiner worklist.
	DCI.CombineTo(N, Res, false);
	return SDValue();
	}
	} else if (N1.getOpcode() == ISD::AND) {
	// case (2) or (and A, mask), (and B, mask2) => ARMbfi A, (lsr B, amt), mask
	ConstantSDNode *N11C = dyn_cast<ConstantSDNode>(N1.getOperand(1));
	if (!N11C)
	return SDValue();
	unsigned Mask2 = N11C->getZExtValue();

	// Mask and ~Mask2 (or reverse) must be equivalent for the BFI pattern
	// as is to match.
	if (ARM::isBitFieldInvertedMask(Mask) &&
	(Mask == ~Mask2)) {
	// The pack halfword instruction works better for masks that fit it,
	// so use that when it's available.
	if (Subtarget->hasDSP() &&
	(Mask == 0xffff \|\| Mask == 0xffff0000))
	return SDValue();
	// 2a
	unsigned amt = countTrailingZeros(Mask2);
	Res = DAG.getNode(ISD::SRL, DL, VT, N1.getOperand(0),
	DAG.getConstant(amt, DL, MVT::i32));
	Res = DAG.getNode(ARMISD::BFI, DL, VT, N00, Res,
	DAG.getConstant(Mask, DL, MVT::i32));
	// Do not add new nodes to DAG combiner worklist.
	DCI.CombineTo(N, Res, false);
	return SDValue();
	} else if (ARM::isBitFieldInvertedMask(~Mask) &&
	(~Mask == Mask2)) {
	// The pack halfword instruction works better for masks that fit it,
	// so use that when it's available.
	if (Subtarget->hasDSP() &&
	(Mask2 == 0xffff \|\| Mask2 == 0xffff0000))
	return SDValue();
	// 2b
	unsigned lsb = countTrailingZeros(Mask);
	Res = DAG.getNode(ISD::SRL, DL, VT, N00,
	DAG.getConstant(lsb, DL, MVT::i32));
	Res = DAG.getNode(ARMISD::BFI, DL, VT, N1.getOperand(0), Res,
	DAG.getConstant(Mask2, DL, MVT::i32));
	// Do not add new nodes to DAG combiner worklist.
	DCI.CombineTo(N, Res, false);
	return SDValue();
	}
	}

	if (DAG.MaskedValueIsZero(N1, MaskC->getAPIntValue()) &&
	N00.getOpcode() == ISD::SHL && isa<ConstantSDNode>(N00.getOperand(1)) &&
	ARM::isBitFieldInvertedMask(~Mask)) {
	// Case (3): or (and (shl A, #shamt), mask), B => ARMbfi B, A, ~mask
	// where lsb(mask) == #shamt and masked bits of B are known zero.
	SDValue ShAmt = N00.getOperand(1);
	unsigned ShAmtC = cast<ConstantSDNode>(ShAmt)->getZExtValue();
	unsigned LSB = countTrailingZeros(Mask);
	if (ShAmtC != LSB)
	return SDValue();

	Res = DAG.getNode(ARMISD::BFI, DL, VT, N1, N00.getOperand(0),
	DAG.getConstant(~Mask, DL, MVT::i32));

	// Do not add new nodes to DAG combiner worklist.
	DCI.CombineTo(N, Res, false);
	}

	return SDValue();
	}

	static SDValue PerformXORCombine(SDNode *N,
	TargetLowering::DAGCombinerInfo &DCI,
	const ARMSubtarget *Subtarget) {
	EVT VT = N->getValueType(0);
	SelectionDAG &DAG = DCI.DAG;

	if(!DAG.getTargetLoweringInfo().isTypeLegal(VT))
	return SDValue();

	if (!Subtarget->isThumb1Only()) {
	// fold (xor (select cc, 0, c), x) -> (select cc, x, (xor, x, c))
	if (SDValue Result = combineSelectAndUseCommutative(N, false, DCI))
	return Result;
	}

	return SDValue();
	}

	// ParseBFI - given a BFI instruction in N, extract the "from" value (Rn) and return it,
	// and fill in FromMask and ToMask with (consecutive) bits in "from" to be extracted and
	// their position in "to" (Rd).
	static SDValue ParseBFI(SDNode *N, APInt &ToMask, APInt &FromMask) {
	assert(N->getOpcode() == ARMISD::BFI);

	SDValue From = N->getOperand(1);
	ToMask = ~cast<ConstantSDNode>(N->getOperand(2))->getAPIntValue();
	FromMask = APInt::getLowBitsSet(ToMask.getBitWidth(), ToMask.countPopulation());

	// If the Base came from a SHR #C, we can deduce that it is really testing bit
	// #C in the base of the SHR.
	if (From->getOpcode() == ISD::SRL &&
	isa<ConstantSDNode>(From->getOperand(1))) {
	APInt Shift = cast<ConstantSDNode>(From->getOperand(1))->getAPIntValue();
	assert(Shift.getLimitedValue() < 32 && "Shift too large!");
	FromMask <<= Shift.getLimitedValue(31);
	From = From->getOperand(0);
	}

	return From;
	}

	// If A and B contain one contiguous set of bits, does A \| B == A . B?
	//
	// Neither A nor B must be zero.
	static bool BitsProperlyConcatenate(const APInt &A, const APInt &B) {
	unsigned LastActiveBitInA = A.countTrailingZeros();
	unsigned FirstActiveBitInB = B.getBitWidth() - B.countLeadingZeros() - 1;
	return LastActiveBitInA - 1 == FirstActiveBitInB;
	}

	static SDValue FindBFIToCombineWith(SDNode *N) {
	// We have a BFI in N. Follow a possible chain of BFIs and find a BFI it can combine with,
	// if one exists.
	APInt ToMask, FromMask;
	SDValue From = ParseBFI(N, ToMask, FromMask);
	SDValue To = N->getOperand(0);

	// Now check for a compatible BFI to merge with. We can pass through BFIs that
	// aren't compatible, but not if they set the same bit in their destination as
	// we do (or that of any BFI we're going to combine with).
	SDValue V = To;
	APInt CombinedToMask = ToMask;
	while (V.getOpcode() == ARMISD::BFI) {
	APInt NewToMask, NewFromMask;
	SDValue NewFrom = ParseBFI(V.getNode(), NewToMask, NewFromMask);
	if (NewFrom != From) {
	// This BFI has a different base. Keep going.
	CombinedToMask \|= NewToMask;
	V = V.getOperand(0);
	continue;
	}

	// Do the written bits conflict with any we've seen so far?
	if ((NewToMask & CombinedToMask).getBoolValue())
	// Conflicting bits - bail out because going further is unsafe.
	return SDValue();

	// Are the new bits contiguous when combined with the old bits?
	if (BitsProperlyConcatenate(ToMask, NewToMask) &&
	BitsProperlyConcatenate(FromMask, NewFromMask))
	return V;
	if (BitsProperlyConcatenate(NewToMask, ToMask) &&
	BitsProperlyConcatenate(NewFromMask, FromMask))
	return V;

	// We've seen a write to some bits, so track it.
	CombinedToMask \|= NewToMask;
	// Keep going...
	V = V.getOperand(0);
	}

	return SDValue();
	}

	static SDValue PerformBFICombine(SDNode *N,
	TargetLowering::DAGCombinerInfo &DCI) {
	SDValue N1 = N->getOperand(1);
	if (N1.getOpcode() == ISD::AND) {
	// (bfi A, (and B, Mask1), Mask2) -> (bfi A, B, Mask2) iff
	// the bits being cleared by the AND are not demanded by the BFI.
	ConstantSDNode *N11C = dyn_cast<ConstantSDNode>(N1.getOperand(1));
	if (!N11C)
	return SDValue();
	unsigned InvMask = cast<ConstantSDNode>(N->getOperand(2))->getZExtValue();
	unsigned LSB = countTrailingZeros(~InvMask);
	unsigned Width = (32 - countLeadingZeros(~InvMask)) - LSB;
	assert(Width <
	static_cast<unsigned>(std::numeric_limits<unsigned>::digits) &&
	"undefined behavior");
	unsigned Mask = (1u << Width) - 1;
	unsigned Mask2 = N11C->getZExtValue();
	if ((Mask & (~Mask2)) == 0)
	return DCI.DAG.getNode(ARMISD::BFI, SDLoc(N), N->getValueType(0),
	N->getOperand(0), N1.getOperand(0),
	N->getOperand(2));
	} else if (N->getOperand(0).getOpcode() == ARMISD::BFI) {
	// We have a BFI of a BFI. Walk up the BFI chain to see how long it goes.
	// Keep track of any consecutive bits set that all come from the same base
	// value. We can combine these together into a single BFI.
	SDValue CombineBFI = FindBFIToCombineWith(N);
	if (CombineBFI == SDValue())
	return SDValue();

	// We've found a BFI.
	APInt ToMask1, FromMask1;
	SDValue From1 = ParseBFI(N, ToMask1, FromMask1);

	APInt ToMask2, FromMask2;
	SDValue From2 = ParseBFI(CombineBFI.getNode(), ToMask2, FromMask2);
	assert(From1 == From2);
	(void)From2;

	// First, unlink CombineBFI.
	DCI.DAG.ReplaceAllUsesWith(CombineBFI, CombineBFI.getOperand(0));
	// Then create a new BFI, combining the two together.
	APInt NewFromMask = FromMask1 \| FromMask2;
	APInt NewToMask = ToMask1 \| ToMask2;

	EVT VT = N->getValueType(0);
	SDLoc dl(N);

	if (NewFromMask[0] == 0)
	From1 = DCI.DAG.getNode(
	ISD::SRL, dl, VT, From1,
	DCI.DAG.getConstant(NewFromMask.countTrailingZeros(), dl, VT));
	return DCI.DAG.getNode(ARMISD::BFI, dl, VT, N->getOperand(0), From1,
	DCI.DAG.getConstant(~NewToMask, dl, VT));
	}
	return SDValue();
	}

	/// PerformVMOVRRDCombine - Target-specific dag combine xforms for
	/// ARMISD::VMOVRRD.
	static SDValue PerformVMOVRRDCombine(SDNode *N,
	TargetLowering::DAGCombinerInfo &DCI,
	const ARMSubtarget *Subtarget) {
	// vmovrrd(vmovdrr x, y) -> x,y
	SDValue InDouble = N->getOperand(0);
	if (InDouble.getOpcode() == ARMISD::VMOVDRR && !Subtarget->isFPOnlySP())
	return DCI.CombineTo(N, InDouble.getOperand(0), InDouble.getOperand(1));

	// vmovrrd(load f64) -> (load i32), (load i32)
	SDNode *InNode = InDouble.getNode();
	if (ISD::isNormalLoad(InNode) && InNode->hasOneUse() &&
	InNode->getValueType(0) == MVT::f64 &&
	InNode->getOperand(1).getOpcode() == ISD::FrameIndex &&
	!cast<LoadSDNode>(InNode)->isVolatile()) {
	// TODO: Should this be done for non-FrameIndex operands?
	LoadSDNode *LD = cast<LoadSDNode>(InNode);

	SelectionDAG &DAG = DCI.DAG;
	SDLoc DL(LD);
	SDValue BasePtr = LD->getBasePtr();
	SDValue NewLD1 =
	DAG.getLoad(MVT::i32, DL, LD->getChain(), BasePtr, LD->getPointerInfo(),
	LD->getAlignment(), LD->getMemOperand()->getFlags());

	SDValue OffsetPtr = DAG.getNode(ISD::ADD, DL, MVT::i32, BasePtr,
	DAG.getConstant(4, DL, MVT::i32));
	SDValue NewLD2 = DAG.getLoad(
	MVT::i32, DL, NewLD1.getValue(1), OffsetPtr, LD->getPointerInfo(),
	std::min(4U, LD->getAlignment() / 2), LD->getMemOperand()->getFlags());

	DAG.ReplaceAllUsesOfValueWith(SDValue(LD, 1), NewLD2.getValue(1));
	if (DCI.DAG.getDataLayout().isBigEndian())
	std::swap (NewLD1, NewLD2);
	SDValue Result = DCI.CombineTo(N, NewLD1, NewLD2);
	return Result;
	}

	return SDValue();
	}

	/// PerformVMOVDRRCombine - Target-specific dag combine xforms for
	/// ARMISD::VMOVDRR. This is also used for BUILD_VECTORs with 2 operands.
	static SDValue PerformVMOVDRRCombine(SDNode *N, SelectionDAG &DAG) {
	// N=vmovrrd(X); vmovdrr(N:0, N:1) -> bit_convert(X)
	SDValue Op0 = N->getOperand(0);
	SDValue Op1 = N->getOperand(1);
	if (Op0.getOpcode() == ISD::BITCAST)
	Op0 = Op0.getOperand(0);
	if (Op1.getOpcode() == ISD::BITCAST)
	Op1 = Op1.getOperand(0);
	if (Op0.getOpcode() == ARMISD::VMOVRRD &&
	Op0.getNode() == Op1.getNode() &&
	Op0.getResNo() == 0 && Op1.getResNo() == 1)
	return DAG.getNode(ISD::BITCAST, SDLoc(N),
	N->getValueType(0), Op0.getOperand(0));
	return SDValue();
	}

	/// hasNormalLoadOperand - Check if any of the operands of a BUILD_VECTOR node
	/// are normal, non-volatile loads. If so, it is profitable to bitcast an
	/// i64 vector to have f64 elements, since the value can then be loaded
	/// directly into a VFP register.
	static bool hasNormalLoadOperand(SDNode *N) {
	unsigned NumElts = N->getValueType(0).getVectorNumElements();
	for (unsigned i = 0; i < NumElts; ++i) {
	SDNode *Elt = N->getOperand(i).getNode();
	if (ISD::isNormalLoad(Elt) && !cast<LoadSDNode>(Elt)->isVolatile())
	return true;
	}
	return false;
	}

	/// PerformBUILD_VECTORCombine - Target-specific dag combine xforms for
	/// ISD::BUILD_VECTOR.
	static SDValue PerformBUILD_VECTORCombine(SDNode *N,
	TargetLowering::DAGCombinerInfo &DCI,
	const ARMSubtarget *Subtarget) {
	// build_vector(N=ARMISD::VMOVRRD(X), N:1) -> bit_convert(X):
	// VMOVRRD is introduced when legalizing i64 types. It forces the i64 value
	// into a pair of GPRs, which is fine when the value is used as a scalar,
	// but if the i64 value is converted to a vector, we need to undo the VMOVRRD.
	SelectionDAG &DAG = DCI.DAG;
	if (N->getNumOperands() == 2)
	if (SDValue RV = PerformVMOVDRRCombine(N, DAG))
	return RV;

	// Load i64 elements as f64 values so that type legalization does not split
	// them up into i32 values.
	EVT VT = N->getValueType(0);
	if (VT.getVectorElementType() != MVT::i64 \|\| !hasNormalLoadOperand(N))
	return SDValue();
	SDLoc dl(N);
	SmallVector<SDValue, 8> Ops;
	unsigned NumElts = VT.getVectorNumElements();
	for (unsigned i = 0; i < NumElts; ++i) {
	SDValue V = DAG.getNode(ISD::BITCAST, dl, MVT::f64, N->getOperand(i));
	Ops.push_back(V);
	// Make the DAGCombiner fold the bitcast.
	DCI.AddToWorklist(V.getNode());
	}
	EVT FloatVT = EVT::getVectorVT(*DAG.getContext(), MVT::f64, NumElts);
	SDValue BV = DAG.getBuildVector(FloatVT, dl, Ops);
	return DAG.getNode(ISD::BITCAST, dl, VT, BV);
	}

	/// \brief Target-specific dag combine xforms for ARMISD::BUILD_VECTOR.
	static SDValue
	PerformARMBUILD_VECTORCombine(SDNode *N, TargetLowering::DAGCombinerInfo &DCI) {
	// ARMISD::BUILD_VECTOR is introduced when legalizing ISD::BUILD_VECTOR.
	// At that time, we may have inserted bitcasts from integer to float.
	// If these bitcasts have survived DAGCombine, change the lowering of this
	// BUILD_VECTOR in something more vector friendly, i.e., that does not
	// force to use floating point types.

	// Make sure we can change the type of the vector.
	// This is possible iff:
	// 1. The vector is only used in a bitcast to a integer type. I.e.,
	// 1.1. Vector is used only once.
	// 1.2. Use is a bit convert to an integer type.
	// 2. The size of its operands are 32-bits (64-bits are not legal).
	EVT VT = N->getValueType(0);
	EVT EltVT = VT.getVectorElementType();

	// Check 1.1. and 2.
	if (EltVT.getSizeInBits() != 32 \|\| !N->hasOneUse())
	return SDValue();

	// By construction, the input type must be float.
	assert(EltVT == MVT::f32 && "Unexpected type!");

	// Check 1.2.
	SDNode Use = N->use_begin();
	if (Use->getOpcode() != ISD::BITCAST \|\|
	Use->getValueType(0).isFloatingPoint())
	return SDValue();

	// Check profitability.
	// Model is, if more than half of the relevant operands are bitcast from
	// i32, turn the build_vector into a sequence of insert_vector_elt.
	// Relevant operands are everything that is not statically
	// (i.e., at compile time) bitcasted.
	unsigned NumOfBitCastedElts = 0;
	unsigned NumElts = VT.getVectorNumElements();
	unsigned NumOfRelevantElts = NumElts;
	for (unsigned Idx = 0; Idx < NumElts; ++Idx) {
	SDValue Elt = N->getOperand(Idx);
	if (Elt->getOpcode() == ISD::BITCAST) {
	// Assume only bit cast to i32 will go away.
	if (Elt->getOperand(0).getValueType() == MVT::i32)
	++NumOfBitCastedElts;
	} else if (Elt.isUndef() \|\| isa<ConstantSDNode>(Elt))
	// Constants are statically casted, thus do not count them as
	// relevant operands.
	--NumOfRelevantElts;
	}

	// Check if more than half of the elements require a non-free bitcast.
	if (NumOfBitCastedElts <= NumOfRelevantElts / 2)
	return SDValue();

	SelectionDAG &DAG = DCI.DAG;
	// Create the new vector type.
	EVT VecVT = EVT::getVectorVT(*DAG.getContext(), MVT::i32, NumElts);
	// Check if the type is legal.
	const TargetLowering &TLI = DAG.getTargetLoweringInfo();
	if (!TLI.isTypeLegal(VecVT))
	return SDValue();

	// Combine:
	// ARMISD::BUILD_VECTOR E1, E2, ..., EN.
	// => BITCAST INSERT_VECTOR_ELT
	// (INSERT_VECTOR_ELT (...), (BITCAST EN-1), N-1),
	// (BITCAST EN), N.
	SDValue Vec = DAG.getUNDEF(VecVT);
	SDLoc dl(N);
	for (unsigned Idx = 0 ; Idx < NumElts; ++Idx) {
	SDValue V = N->getOperand(Idx);
	if (V.isUndef())
	continue;
	if (V.getOpcode() == ISD::BITCAST &&
	V->getOperand(0).getValueType() == MVT::i32)
	// Fold obvious case.
	V = V.getOperand(0);
	else {
	V = DAG.getNode(ISD::BITCAST, SDLoc(V), MVT::i32, V);
	// Make the DAGCombiner fold the bitcasts.
	DCI.AddToWorklist(V.getNode());
	}
	SDValue LaneIdx = DAG.getConstant(Idx, dl, MVT::i32);
	Vec = DAG.getNode(ISD::INSERT_VECTOR_ELT, dl, VecVT, Vec, V, LaneIdx);
	}
	Vec = DAG.getNode(ISD::BITCAST, dl, VT, Vec);
	// Make the DAGCombiner fold the bitcasts.
	DCI.AddToWorklist(Vec.getNode());
	return Vec;
	}

	/// PerformInsertEltCombine - Target-specific dag combine xforms for
	/// ISD::INSERT_VECTOR_ELT.
	static SDValue PerformInsertEltCombine(SDNode *N,
	TargetLowering::DAGCombinerInfo &DCI) {
	// Bitcast an i64 load inserted into a vector to f64.
	// Otherwise, the i64 value will be legalized to a pair of i32 values.
	EVT VT = N->getValueType(0);
	SDNode *Elt = N->getOperand(1).getNode();
	if (VT.getVectorElementType() != MVT::i64 \|\|
	!ISD::isNormalLoad(Elt) \|\| cast<LoadSDNode>(Elt)->isVolatile())
	return SDValue();

	SelectionDAG &DAG = DCI.DAG;
	SDLoc dl(N);
	EVT FloatVT = EVT::getVectorVT(*DAG.getContext(), MVT::f64,
	VT.getVectorNumElements());
	SDValue Vec = DAG.getNode(ISD::BITCAST, dl, FloatVT, N->getOperand(0));
	SDValue V = DAG.getNode(ISD::BITCAST, dl, MVT::f64, N->getOperand(1));
	// Make the DAGCombiner fold the bitcasts.
	DCI.AddToWorklist(Vec.getNode());
	DCI.AddToWorklist(V.getNode());
	SDValue InsElt = DAG.getNode(ISD::INSERT_VECTOR_ELT, dl, FloatVT,
	Vec, V, N->getOperand(2));
	return DAG.getNode(ISD::BITCAST, dl, VT, InsElt);
	}

	/// PerformVECTOR_SHUFFLECombine - Target-specific dag combine xforms for
	/// ISD::VECTOR_SHUFFLE.
	static SDValue PerformVECTOR_SHUFFLECombine(SDNode *N, SelectionDAG &DAG) {
	// The LLVM shufflevector instruction does not require the shuffle mask
	// length to match the operand vector length, but ISD::VECTOR_SHUFFLE does
	// have that requirement. When translating to ISD::VECTOR_SHUFFLE, if the
	// operands do not match the mask length, they are extended by concatenating
	// them with undef vectors. That is probably the right thing for other
	// targets, but for NEON it is better to concatenate two double-register
	// size vector operands into a single quad-register size vector. Do that
	// transformation here:
	// shuffle(concat(v1, undef), concat(v2, undef)) ->
	// shuffle(concat(v1, v2), undef)
	SDValue Op0 = N->getOperand(0);
	SDValue Op1 = N->getOperand(1);
	if (Op0.getOpcode() != ISD::CONCAT_VECTORS \|\|
	Op1.getOpcode() != ISD::CONCAT_VECTORS \|\|
	Op0.getNumOperands() != 2 \|\|
	Op1.getNumOperands() != 2)
	return SDValue();
	SDValue Concat0Op1 = Op0.getOperand(1);
	SDValue Concat1Op1 = Op1.getOperand(1);
	if (!Concat0Op1.isUndef() \|\| !Concat1Op1.isUndef())
	return SDValue();
	// Skip the transformation if any of the types are illegal.
	const TargetLowering &TLI = DAG.getTargetLoweringInfo();
	EVT VT = N->getValueType(0);
	if (!TLI.isTypeLegal(VT) \|\|
	!TLI.isTypeLegal(Concat0Op1.getValueType()) \|\|
	!TLI.isTypeLegal(Concat1Op1.getValueType()))
	return SDValue();

	SDValue NewConcat = DAG.getNode(ISD::CONCAT_VECTORS, SDLoc(N), VT,
	Op0.getOperand(0), Op1.getOperand(0));
	// Translate the shuffle mask.
	SmallVector<int, 16> NewMask;
	unsigned NumElts = VT.getVectorNumElements();
	unsigned HalfElts = NumElts/2;
	ShuffleVectorSDNode *SVN = cast<ShuffleVectorSDNode>(N);
	for (unsigned n = 0; n < NumElts; ++n) {
	int MaskElt = SVN->getMaskElt(n);
	int NewElt = -1;
	if (MaskElt < (int)HalfElts)
	NewElt = MaskElt;
	else if (MaskElt >= (int)NumElts && MaskElt < (int)(NumElts + HalfElts))
	NewElt = HalfElts + MaskElt - NumElts;
	NewMask.push_back(NewElt);
	}
	return DAG.getVectorShuffle(VT, SDLoc(N), NewConcat,
	DAG.getUNDEF(VT), NewMask);
	}

	/// CombineBaseUpdate - Target-specific DAG combine function for VLDDUP,
	/// NEON load/store intrinsics, and generic vector load/stores, to merge
	/// base address updates.
	/// For generic load/stores, the memory type is assumed to be a vector.
	/// The caller is assumed to have checked legality.
	static SDValue CombineBaseUpdate(SDNode *N,
	TargetLowering::DAGCombinerInfo &DCI) {
	SelectionDAG &DAG = DCI.DAG;
	const bool isIntrinsic = (N->getOpcode() == ISD::INTRINSIC_VOID \|\|
	N->getOpcode() == ISD::INTRINSIC_W_CHAIN);
	const bool isStore = N->getOpcode() == ISD::STORE;
	const unsigned AddrOpIdx = ((isIntrinsic \|\| isStore) ? 2 : 1);
	SDValue Addr = N->getOperand(AddrOpIdx);
	MemSDNode *MemN = cast<MemSDNode>(N);
	SDLoc dl(N);

	// Search for a use of the address operand that is an increment.
	for (SDNode::use_iterator UI = Addr.getNode()->use_begin(),
	UE = Addr.getNode()->use_end(); UI != UE; ++UI) {
	SDNode User = UI;
	if (User->getOpcode() != ISD::ADD \|\|
	UI.getUse().getResNo() != Addr.getResNo())
	continue;

	// Check that the add is independent of the load/store. Otherwise, folding
	// it would create a cycle.
	if (User->isPredecessorOf(N) \|\| N->isPredecessorOf(User))
	continue;

	// Find the new opcode for the updating load/store.
	bool isLoadOp = true;
	bool isLaneOp = false;
	unsigned NewOpc = 0;
	unsigned NumVecs = 0;
	if (isIntrinsic) {
	unsigned IntNo = cast<ConstantSDNode>(N->getOperand(1))->getZExtValue();
	switch (IntNo) {
	default: llvm_unreachable("unexpected intrinsic for Neon base update");
	case Intrinsic::arm_neon_vld1: NewOpc = ARMISD::VLD1_UPD;
	NumVecs = 1; break;
	case Intrinsic::arm_neon_vld2: NewOpc = ARMISD::VLD2_UPD;
	NumVecs = 2; break;
	case Intrinsic::arm_neon_vld3: NewOpc = ARMISD::VLD3_UPD;
	NumVecs = 3; break;
	case Intrinsic::arm_neon_vld4: NewOpc = ARMISD::VLD4_UPD;
	NumVecs = 4; break;
	case Intrinsic::arm_neon_vld2lane: NewOpc = ARMISD::VLD2LN_UPD;
	NumVecs = 2; isLaneOp = true; break;
	case Intrinsic::arm_neon_vld3lane: NewOpc = ARMISD::VLD3LN_UPD;
	NumVecs = 3; isLaneOp = true; break;
	case Intrinsic::arm_neon_vld4lane: NewOpc = ARMISD::VLD4LN_UPD;
	NumVecs = 4; isLaneOp = true; break;
	case Intrinsic::arm_neon_vst1: NewOpc = ARMISD::VST1_UPD;
	NumVecs = 1; isLoadOp = false; break;
	case Intrinsic::arm_neon_vst2: NewOpc = ARMISD::VST2_UPD;
	NumVecs = 2; isLoadOp = false; break;
	case Intrinsic::arm_neon_vst3: NewOpc = ARMISD::VST3_UPD;
	NumVecs = 3; isLoadOp = false; break;
	case Intrinsic::arm_neon_vst4: NewOpc = ARMISD::VST4_UPD;
	NumVecs = 4; isLoadOp = false; break;
	case Intrinsic::arm_neon_vst2lane: NewOpc = ARMISD::VST2LN_UPD;
	NumVecs = 2; isLoadOp = false; isLaneOp = true; break;
	case Intrinsic::arm_neon_vst3lane: NewOpc = ARMISD::VST3LN_UPD;
	NumVecs = 3; isLoadOp = false; isLaneOp = true; break;
	case Intrinsic::arm_neon_vst4lane: NewOpc = ARMISD::VST4LN_UPD;
	NumVecs = 4; isLoadOp = false; isLaneOp = true; break;
	}
	} else {
	isLaneOp = true;
	switch (N->getOpcode()) {
	default: llvm_unreachable("unexpected opcode for Neon base update");
	case ARMISD::VLD1DUP: NewOpc = ARMISD::VLD1DUP_UPD; NumVecs = 1; break;
	case ARMISD::VLD2DUP: NewOpc = ARMISD::VLD2DUP_UPD; NumVecs = 2; break;
	case ARMISD::VLD3DUP: NewOpc = ARMISD::VLD3DUP_UPD; NumVecs = 3; break;
	case ARMISD::VLD4DUP: NewOpc = ARMISD::VLD4DUP_UPD; NumVecs = 4; break;
	case ISD::LOAD: NewOpc = ARMISD::VLD1_UPD;
	NumVecs = 1; isLaneOp = false; break;
	case ISD::STORE: NewOpc = ARMISD::VST1_UPD;
	NumVecs = 1; isLaneOp = false; isLoadOp = false; break;
	}
	}

	// Find the size of memory referenced by the load/store.
	EVT VecTy;
	if (isLoadOp) {
	VecTy = N->getValueType(0);
	} else if (isIntrinsic) {
	VecTy = N->getOperand(AddrOpIdx+1).getValueType();
	} else {
	assert(isStore && "Node has to be a load, a store, or an intrinsic!");
	VecTy = N->getOperand(1).getValueType();
	}

	unsigned NumBytes = NumVecs * VecTy.getSizeInBits() / 8;
	if (isLaneOp)
	NumBytes /= VecTy.getVectorNumElements();

	// If the increment is a constant, it must match the memory ref size.
	SDValue Inc = User->getOperand(User->getOperand(0) == Addr ? 1 : 0);
	ConstantSDNode *CInc = dyn_cast<ConstantSDNode>(Inc.getNode());
	if (NumBytes >= 3 * 16 && (!CInc \|\| CInc->getZExtValue() != NumBytes)) {
	// VLD3/4 and VST3/4 for 128-bit vectors are implemented with two
	// separate instructions that make it harder to use a non-constant update.
	continue;
	}

	// OK, we found an ADD we can fold into the base update.
	// Now, create a _UPD node, taking care of not breaking alignment.

	EVT AlignedVecTy = VecTy;
	unsigned Alignment = MemN->getAlignment();

	// If this is a less-than-standard-aligned load/store, change the type to
	// match the standard alignment.
	// The alignment is overlooked when selecting _UPD variants; and it's
	// easier to introduce bitcasts here than fix that.
	// There are 3 ways to get to this base-update combine:
	// - intrinsics: they are assumed to be properly aligned (to the standard
	// alignment of the memory type), so we don't need to do anything.
	// - ARMISD::VLDx nodes: they are only generated from the aforementioned
	// intrinsics, so, likewise, there's nothing to do.
	// - generic load/store instructions: the alignment is specified as an
	// explicit operand, rather than implicitly as the standard alignment
	// of the memory type (like the intrisics). We need to change the
	// memory type to match the explicit alignment. That way, we don't
	// generate non-standard-aligned ARMISD::VLDx nodes.
	if (isa<LSBaseSDNode>(N)) {
	if (Alignment == 0)
	Alignment = 1;
	if (Alignment < VecTy.getScalarSizeInBits() / 8) {
	MVT EltTy = MVT::getIntegerVT(Alignment * 8);
	assert(NumVecs == 1 && "Unexpected multi-element generic load/store.");
	assert(!isLaneOp && "Unexpected generic load/store lane.");
	unsigned NumElts = NumBytes / (EltTy.getSizeInBits() / 8);
	AlignedVecTy = MVT::getVectorVT(EltTy, NumElts);
	}
	// Don't set an explicit alignment on regular load/stores that we want
	// to transform to VLD/VST 1_UPD nodes.
	// This matches the behavior of regular load/stores, which only get an
	// explicit alignment if the MMO alignment is larger than the standard
	// alignment of the memory type.
	// Intrinsics, however, always get an explicit alignment, set to the
	// alignment of the MMO.
	Alignment = 1;
	}

	// Create the new updating load/store node.
	// First, create an SDVTList for the new updating node's results.
	EVT Tys[6];
	unsigned NumResultVecs = (isLoadOp ? NumVecs : 0);
	unsigned n;
	for (n = 0; n < NumResultVecs; ++n)
	Tys[n] = AlignedVecTy;
	Tys[n++] = MVT::i32;
	Tys[n] = MVT::Other;
	SDVTList SDTys = DAG.getVTList(makeArrayRef(Tys, NumResultVecs+2));

	// Then, gather the new node's operands.
	SmallVector<SDValue, 8> Ops;
	Ops.push_back(N->getOperand(0)); // incoming chain
	Ops.push_back(N->getOperand(AddrOpIdx));
	Ops.push_back(Inc);

	if (StoreSDNode *StN = dyn_cast<StoreSDNode>(N)) {
	// Try to match the intrinsic's signature
	Ops.push_back(StN->getValue());
	} else {
	// Loads (and of course intrinsics) match the intrinsics' signature,
	// so just add all but the alignment operand.
	for (unsigned i = AddrOpIdx + 1; i < N->getNumOperands() - 1; ++i)
	Ops.push_back(N->getOperand(i));
	}

	// For all node types, the alignment operand is always the last one.
	Ops.push_back(DAG.getConstant(Alignment, dl, MVT::i32));

	// If this is a non-standard-aligned STORE, the penultimate operand is the
	// stored value. Bitcast it to the aligned type.
	if (AlignedVecTy != VecTy && N->getOpcode() == ISD::STORE) {
	SDValue &StVal = Ops[Ops.size()-2];
	StVal = DAG.getNode(ISD::BITCAST, dl, AlignedVecTy, StVal);
	}

	EVT LoadVT = isLaneOp ? VecTy.getVectorElementType() : AlignedVecTy;
	SDValue UpdN = DAG.getMemIntrinsicNode(NewOpc, dl, SDTys, Ops, LoadVT,
	MemN->getMemOperand());

	// Update the uses.
	SmallVector<SDValue, 5> NewResults;
	for (unsigned i = 0; i < NumResultVecs; ++i)
	NewResults.push_back(SDValue(UpdN.getNode(), i));

	// If this is an non-standard-aligned LOAD, the first result is the loaded
	// value. Bitcast it to the expected result type.
	if (AlignedVecTy != VecTy && N->getOpcode() == ISD::LOAD) {
	SDValue &LdVal = NewResults[0];
	LdVal = DAG.getNode(ISD::BITCAST, dl, VecTy, LdVal);
	}

	NewResults.push_back(SDValue(UpdN.getNode(), NumResultVecs+1)); // chain
	DCI.CombineTo(N, NewResults);
	DCI.CombineTo(User, SDValue(UpdN.getNode(), NumResultVecs));

	break;
	}
	return SDValue();
	}

	static SDValue PerformVLDCombine(SDNode *N,
	TargetLowering::DAGCombinerInfo &DCI) {
	if (DCI.isBeforeLegalize() \|\| DCI.isCalledByLegalizer())
	return SDValue();

	return CombineBaseUpdate(N, DCI);
	}

	/// CombineVLDDUP - For a VDUPLANE node N, check if its source operand is a
	/// vldN-lane (N > 1) intrinsic, and if all the other uses of that intrinsic
	/// are also VDUPLANEs. If so, combine them to a vldN-dup operation and
	/// return true.
	static bool CombineVLDDUP(SDNode *N, TargetLowering::DAGCombinerInfo &DCI) {
	SelectionDAG &DAG = DCI.DAG;
	EVT VT = N->getValueType(0);
	// vldN-dup instructions only support 64-bit vectors for N > 1.
	if (!VT.is64BitVector())
	return false;

	// Check if the VDUPLANE operand is a vldN-dup intrinsic.
	SDNode *VLD = N->getOperand(0).getNode();
	if (VLD->getOpcode() != ISD::INTRINSIC_W_CHAIN)
	return false;
	unsigned NumVecs = 0;
	unsigned NewOpc = 0;
	unsigned IntNo = cast<ConstantSDNode>(VLD->getOperand(1))->getZExtValue();
	if (IntNo == Intrinsic::arm_neon_vld2lane) {
	NumVecs = 2;
	NewOpc = ARMISD::VLD2DUP;
	} else if (IntNo == Intrinsic::arm_neon_vld3lane) {
	NumVecs = 3;
	NewOpc = ARMISD::VLD3DUP;
	} else if (IntNo == Intrinsic::arm_neon_vld4lane) {
	NumVecs = 4;
	NewOpc = ARMISD::VLD4DUP;
	} else {
	return false;
	}

	// First check that all the vldN-lane uses are VDUPLANEs and that the lane
	// numbers match the load.
	unsigned VLDLaneNo =
	cast<ConstantSDNode>(VLD->getOperand(NumVecs+3))->getZExtValue();
	for (SDNode::use_iterator UI = VLD->use_begin(), UE = VLD->use_end();
	UI != UE; ++UI) {
	// Ignore uses of the chain result.
	if (UI.getUse().getResNo() == NumVecs)
	continue;
	SDNode User = UI;
	if (User->getOpcode() != ARMISD::VDUPLANE \|\|
	VLDLaneNo != cast<ConstantSDNode>(User->getOperand(1))->getZExtValue())
	return false;
	}

	// Create the vldN-dup node.
	EVT Tys[5];
	unsigned n;
	for (n = 0; n < NumVecs; ++n)
	Tys[n] = VT;
	Tys[n] = MVT::Other;
	SDVTList SDTys = DAG.getVTList(makeArrayRef(Tys, NumVecs+1));
	SDValue Ops[] = { VLD->getOperand(0), VLD->getOperand(2) };
	MemIntrinsicSDNode *VLDMemInt = cast<MemIntrinsicSDNode>(VLD);
	SDValue VLDDup = DAG.getMemIntrinsicNode(NewOpc, SDLoc(VLD), SDTys,
	Ops, VLDMemInt->getMemoryVT(),
	VLDMemInt->getMemOperand());

	// Update the uses.
	for (SDNode::use_iterator UI = VLD->use_begin(), UE = VLD->use_end();
	UI != UE; ++UI) {
	unsigned ResNo = UI.getUse().getResNo();
	// Ignore uses of the chain result.
	if (ResNo == NumVecs)
	continue;
	SDNode User = UI;
	DCI.CombineTo(User, SDValue(VLDDup.getNode(), ResNo));
	}

	// Now the vldN-lane intrinsic is dead except for its chain result.
	// Update uses of the chain.
	std::vector<SDValue> VLDDupResults;
	for (unsigned n = 0; n < NumVecs; ++n)
	VLDDupResults.push_back(SDValue(VLDDup.getNode(), n));
	VLDDupResults.push_back(SDValue(VLDDup.getNode(), NumVecs));
	DCI.CombineTo(VLD, VLDDupResults);

	return true;
	}

	/// PerformVDUPLANECombine - Target-specific dag combine xforms for
	/// ARMISD::VDUPLANE.
	static SDValue PerformVDUPLANECombine(SDNode *N,
	TargetLowering::DAGCombinerInfo &DCI) {
	SDValue Op = N->getOperand(0);

	// If the source is a vldN-lane (N > 1) intrinsic, and all the other uses
	// of that intrinsic are also VDUPLANEs, combine them to a vldN-dup operation.
	if (CombineVLDDUP(N, DCI))
	return SDValue(N, 0);

	// If the source is already a VMOVIMM or VMVNIMM splat, the VDUPLANE is
	// redundant. Ignore bit_converts for now; element sizes are checked below.
	while (Op.getOpcode() == ISD::BITCAST)
	Op = Op.getOperand(0);
	if (Op.getOpcode() != ARMISD::VMOVIMM && Op.getOpcode() != ARMISD::VMVNIMM)
	return SDValue();

	// Make sure the VMOV element size is not bigger than the VDUPLANE elements.
	unsigned EltSize = Op.getScalarValueSizeInBits();
	// The canonical VMOV for a zero vector uses a 32-bit element size.
	unsigned Imm = cast<ConstantSDNode>(Op.getOperand(0))->getZExtValue();
	unsigned EltBits;
	if (ARM_AM::decodeNEONModImm(Imm, EltBits) == 0)
	EltSize = 8;
	EVT VT = N->getValueType(0);
	if (EltSize > VT.getScalarSizeInBits())
	return SDValue();

	return DCI.DAG.getNode(ISD::BITCAST, SDLoc(N), VT, Op);
	}

	/// PerformVDUPCombine - Target-specific dag combine xforms for ARMISD::VDUP.
	static SDValue PerformVDUPCombine(SDNode *N,
	TargetLowering::DAGCombinerInfo &DCI) {
	SelectionDAG &DAG = DCI.DAG;
	SDValue Op = N->getOperand(0);

	// Match VDUP(LOAD) -> VLD1DUP.
	// We match this pattern here rather than waiting for isel because the
	// transform is only legal for unindexed loads.
	LoadSDNode *LD = dyn_cast<LoadSDNode>(Op.getNode());
	if (LD && Op.hasOneUse() && LD->isUnindexed() &&
	LD->getMemoryVT() == N->getValueType(0).getVectorElementType()) {
	SDValue Ops[] = { LD->getOperand(0), LD->getOperand(1),
	DAG.getConstant(LD->getAlignment(), SDLoc(N), MVT::i32) };
	SDVTList SDTys = DAG.getVTList(N->getValueType(0), MVT::Other);
	SDValue VLDDup = DAG.getMemIntrinsicNode(ARMISD::VLD1DUP, SDLoc(N), SDTys,
	Ops, LD->getMemoryVT(),
	LD->getMemOperand());
	DAG.ReplaceAllUsesOfValueWith(SDValue(LD, 1), VLDDup.getValue(1));
	return VLDDup;
	}

	return SDValue();
	}

	static SDValue PerformLOADCombine(SDNode *N,
	TargetLowering::DAGCombinerInfo &DCI) {
	EVT VT = N->getValueType(0);

	// If this is a legal vector load, try to combine it into a VLD1_UPD.
	if (ISD::isNormalLoad(N) && VT.isVector() &&
	DCI.DAG.getTargetLoweringInfo().isTypeLegal(VT))
	return CombineBaseUpdate(N, DCI);

	return SDValue();
	}

	/// PerformSTORECombine - Target-specific dag combine xforms for
	/// ISD::STORE.
	static SDValue PerformSTORECombine(SDNode *N,
	TargetLowering::DAGCombinerInfo &DCI) {
	StoreSDNode *St = cast<StoreSDNode>(N);
	if (St->isVolatile())
	return SDValue();

	// Optimize trunc store (of multiple scalars) to shuffle and store. First,
	// pack all of the elements in one place. Next, store to memory in fewer
	// chunks.
	SDValue StVal = St->getValue();
	EVT VT = StVal.getValueType();
	if (St->isTruncatingStore() && VT.isVector()) {
	SelectionDAG &DAG = DCI.DAG;
	const TargetLowering &TLI = DAG.getTargetLoweringInfo();
	EVT StVT = St->getMemoryVT();
	unsigned NumElems = VT.getVectorNumElements();
	assert(StVT != VT && "Cannot truncate to the same type");
	unsigned FromEltSz = VT.getScalarSizeInBits();
	unsigned ToEltSz = StVT.getScalarSizeInBits();

	// From, To sizes and ElemCount must be pow of two
	if (!isPowerOf2_32(NumElems * FromEltSz * ToEltSz)) return SDValue();

	// We are going to use the original vector elt for storing.
	// Accumulated smaller vector elements must be a multiple of the store size.
	if (0 != (NumElems * FromEltSz) % ToEltSz) return SDValue();

	unsigned SizeRatio = FromEltSz / ToEltSz;
	assert(SizeRatio * NumElems * ToEltSz == VT.getSizeInBits());

	// Create a type on which we perform the shuffle.
	EVT WideVecVT = EVT::getVectorVT(*DAG.getContext(), StVT.getScalarType(),
	NumElems*SizeRatio);
	assert(WideVecVT.getSizeInBits() == VT.getSizeInBits());

	SDLoc DL(St);
	SDValue WideVec = DAG.getNode(ISD::BITCAST, DL, WideVecVT, StVal);
	SmallVector<int, 8> ShuffleVec(NumElems * SizeRatio, -1);
	for (unsigned i = 0; i < NumElems; ++i)
	ShuffleVec[i] = DAG.getDataLayout().isBigEndian()
	? (i + 1) * SizeRatio - 1
	: i * SizeRatio;

	// Can't shuffle using an illegal type.
	if (!TLI.isTypeLegal(WideVecVT)) return SDValue();

	SDValue Shuff = DAG.getVectorShuffle(WideVecVT, DL, WideVec,
	DAG.getUNDEF(WideVec.getValueType()),
	ShuffleVec);
	// At this point all of the data is stored at the bottom of the
	// register. We now need to save it to mem.

	// Find the largest store unit
	MVT StoreType = MVT::i8;
	for (MVT Tp : MVT::integer_valuetypes()) {
	if (TLI.isTypeLegal(Tp) && Tp.getSizeInBits() <= NumElems * ToEltSz)
	StoreType = Tp;
	}
	// Didn't find a legal store type.
	if (!TLI.isTypeLegal(StoreType))
	return SDValue();

	// Bitcast the original vector into a vector of store-size units
	EVT StoreVecVT = EVT::getVectorVT(*DAG.getContext(),
	StoreType, VT.getSizeInBits()/EVT(StoreType).getSizeInBits());
	assert(StoreVecVT.getSizeInBits() == VT.getSizeInBits());
	SDValue ShuffWide = DAG.getNode(ISD::BITCAST, DL, StoreVecVT, Shuff);
	SmallVector<SDValue, 8> Chains;
	SDValue Increment = DAG.getConstant(StoreType.getSizeInBits() / 8, DL,
	TLI.getPointerTy(DAG.getDataLayout()));
	SDValue BasePtr = St->getBasePtr();

	// Perform one or more big stores into memory.
	unsigned E = (ToEltSz*NumElems)/StoreType.getSizeInBits();
	for (unsigned I = 0; I < E; I++) {
	SDValue SubVec = DAG.getNode(ISD::EXTRACT_VECTOR_ELT, DL,
	StoreType, ShuffWide,
	DAG.getIntPtrConstant(I, DL));
	SDValue Ch = DAG.getStore(St->getChain(), DL, SubVec, BasePtr,
	St->getPointerInfo(), St->getAlignment(),
	St->getMemOperand()->getFlags());
	BasePtr = DAG.getNode(ISD::ADD, DL, BasePtr.getValueType(), BasePtr,
	Increment);
	Chains.push_back(Ch);
	}
	return DAG.getNode(ISD::TokenFactor, DL, MVT::Other, Chains);
	}

	if (!ISD::isNormalStore(St))
	return SDValue();

	// Split a store of a VMOVDRR into two integer stores to avoid mixing NEON and
	// ARM stores of arguments in the same cache line.
	if (StVal.getNode()->getOpcode() == ARMISD::VMOVDRR &&
	StVal.getNode()->hasOneUse()) {
	SelectionDAG &DAG = DCI.DAG;
	bool isBigEndian = DAG.getDataLayout().isBigEndian();
	SDLoc DL(St);
	SDValue BasePtr = St->getBasePtr();
	SDValue NewST1 = DAG.getStore(
	St->getChain(), DL, StVal.getNode()->getOperand(isBigEndian ? 1 : 0),
	BasePtr, St->getPointerInfo(), St->getAlignment(),
	St->getMemOperand()->getFlags());

	SDValue OffsetPtr = DAG.getNode(ISD::ADD, DL, MVT::i32, BasePtr,
	DAG.getConstant(4, DL, MVT::i32));
	return DAG.getStore(NewST1.getValue(0), DL,
	StVal.getNode()->getOperand(isBigEndian ? 0 : 1),
	OffsetPtr, St->getPointerInfo(),
	std::min(4U, St->getAlignment() / 2),
	St->getMemOperand()->getFlags());
	}

	if (StVal.getValueType() == MVT::i64 &&
	StVal.getNode()->getOpcode() == ISD::EXTRACT_VECTOR_ELT) {

	// Bitcast an i64 store extracted from a vector to f64.
	// Otherwise, the i64 value will be legalized to a pair of i32 values.
	SelectionDAG &DAG = DCI.DAG;
	SDLoc dl(StVal);
	SDValue IntVec = StVal.getOperand(0);
	EVT FloatVT = EVT::getVectorVT(*DAG.getContext(), MVT::f64,
	IntVec.getValueType().getVectorNumElements());
	SDValue Vec = DAG.getNode(ISD::BITCAST, dl, FloatVT, IntVec);
	SDValue ExtElt = DAG.getNode(ISD::EXTRACT_VECTOR_ELT, dl, MVT::f64,
	Vec, StVal.getOperand(1));
	dl = SDLoc(N);
	SDValue V = DAG.getNode(ISD::BITCAST, dl, MVT::i64, ExtElt);
	// Make the DAGCombiner fold the bitcasts.
	DCI.AddToWorklist(Vec.getNode());
	DCI.AddToWorklist(ExtElt.getNode());
	DCI.AddToWorklist(V.getNode());
	return DAG.getStore(St->getChain(), dl, V, St->getBasePtr(),
	St->getPointerInfo(), St->getAlignment(),
	St->getMemOperand()->getFlags(), St->getAAInfo());
	}

	// If this is a legal vector store, try to combine it into a VST1_UPD.
	if (ISD::isNormalStore(N) && VT.isVector() &&
	DCI.DAG.getTargetLoweringInfo().isTypeLegal(VT))
	return CombineBaseUpdate(N, DCI);

	return SDValue();
	}

	/// PerformVCVTCombine - VCVT (floating-point to fixed-point, Advanced SIMD)
	/// can replace combinations of VMUL and VCVT (floating-point to integer)
	/// when the VMUL has a constant operand that is a power of 2.
	///
	/// Example (assume d17 = <float 8.000000e+00, float 8.000000e+00>):
	/// vmul.f32 d16, d17, d16
	/// vcvt.s32.f32 d16, d16
	/// becomes:
	/// vcvt.s32.f32 d16, d16, #3
	static SDValue PerformVCVTCombine(SDNode *N, SelectionDAG &DAG,
	const ARMSubtarget *Subtarget) {
	if (!Subtarget->hasNEON())
	return SDValue();

	SDValue Op = N->getOperand(0);
	if (!Op.getValueType().isVector() \|\| !Op.getValueType().isSimple() \|\|
	Op.getOpcode() != ISD::FMUL)
	return SDValue();

	SDValue ConstVec = Op->getOperand(1);
	if (!isa<BuildVectorSDNode>(ConstVec))
	return SDValue();

	MVT FloatTy = Op.getSimpleValueType().getVectorElementType();
	uint32_t FloatBits = FloatTy.getSizeInBits();
	MVT IntTy = N->getSimpleValueType(0).getVectorElementType();
	uint32_t IntBits = IntTy.getSizeInBits();
	unsigned NumLanes = Op.getValueType().getVectorNumElements();
	if (FloatBits != 32 \|\| IntBits > 32 \|\| NumLanes > 4) {
	// These instructions only exist converting from f32 to i32. We can handle
	// smaller integers by generating an extra truncate, but larger ones would
	// be lossy. We also can't handle more then 4 lanes, since these intructions
	// only support v2i32/v4i32 types.
	return SDValue();
	}

	BitVector UndefElements;
	BuildVectorSDNode *BV = cast<BuildVectorSDNode>(ConstVec);
	int32_t C = BV->getConstantFPSplatPow2ToLog2Int(&UndefElements, 33);
	if (C == -1 \|\| C == 0 \|\| C > 32)
	return SDValue();

	SDLoc dl(N);
	bool isSigned = N->getOpcode() == ISD::FP_TO_SINT;
	unsigned IntrinsicOpcode = isSigned ? Intrinsic::arm_neon_vcvtfp2fxs :
	Intrinsic::arm_neon_vcvtfp2fxu;
	SDValue FixConv = DAG.getNode(
	ISD::INTRINSIC_WO_CHAIN, dl, NumLanes == 2 ? MVT::v2i32 : MVT::v4i32,
	DAG.getConstant(IntrinsicOpcode, dl, MVT::i32), Op->getOperand(0),
	DAG.getConstant(C, dl, MVT::i32));

	if (IntBits < FloatBits)
	FixConv = DAG.getNode(ISD::TRUNCATE, dl, N->getValueType(0), FixConv);

	return FixConv;
	}

	/// PerformVDIVCombine - VCVT (fixed-point to floating-point, Advanced SIMD)
	/// can replace combinations of VCVT (integer to floating-point) and VDIV
	/// when the VDIV has a constant operand that is a power of 2.
	///
	/// Example (assume d17 = <float 8.000000e+00, float 8.000000e+00>):
	/// vcvt.f32.s32 d16, d16
	/// vdiv.f32 d16, d17, d16
	/// becomes:
	/// vcvt.f32.s32 d16, d16, #3
	static SDValue PerformVDIVCombine(SDNode *N, SelectionDAG &DAG,
	const ARMSubtarget *Subtarget) {
	if (!Subtarget->hasNEON())
	return SDValue();

	SDValue Op = N->getOperand(0);
	unsigned OpOpcode = Op.getNode()->getOpcode();
	if (!N->getValueType(0).isVector() \|\| !N->getValueType(0).isSimple() \|\|
	(OpOpcode != ISD::SINT_TO_FP && OpOpcode != ISD::UINT_TO_FP))
	return SDValue();

	SDValue ConstVec = N->getOperand(1);
	if (!isa<BuildVectorSDNode>(ConstVec))
	return SDValue();

	MVT FloatTy = N->getSimpleValueType(0).getVectorElementType();
	uint32_t FloatBits = FloatTy.getSizeInBits();
	MVT IntTy = Op.getOperand(0).getSimpleValueType().getVectorElementType();
	uint32_t IntBits = IntTy.getSizeInBits();
	unsigned NumLanes = Op.getValueType().getVectorNumElements();
	if (FloatBits != 32 \|\| IntBits > 32 \|\| NumLanes > 4) {
	// These instructions only exist converting from i32 to f32. We can handle
	// smaller integers by generating an extra extend, but larger ones would
	// be lossy. We also can't handle more then 4 lanes, since these intructions
	// only support v2i32/v4i32 types.
	return SDValue();
	}

	BitVector UndefElements;
	BuildVectorSDNode *BV = cast<BuildVectorSDNode>(ConstVec);
	int32_t C = BV->getConstantFPSplatPow2ToLog2Int(&UndefElements, 33);
	if (C == -1 \|\| C == 0 \|\| C > 32)
	return SDValue();

	SDLoc dl(N);
	bool isSigned = OpOpcode == ISD::SINT_TO_FP;
	SDValue ConvInput = Op.getOperand(0);
	if (IntBits < FloatBits)
	ConvInput = DAG.getNode(isSigned ? ISD::SIGN_EXTEND : ISD::ZERO_EXTEND,
	dl, NumLanes == 2 ? MVT::v2i32 : MVT::v4i32,
	ConvInput);

	unsigned IntrinsicOpcode = isSigned ? Intrinsic::arm_neon_vcvtfxs2fp :
	Intrinsic::arm_neon_vcvtfxu2fp;
	return DAG.getNode(ISD::INTRINSIC_WO_CHAIN, dl,
	Op.getValueType(),
	DAG.getConstant(IntrinsicOpcode, dl, MVT::i32),
	ConvInput, DAG.getConstant(C, dl, MVT::i32));
	}

	/// Getvshiftimm - Check if this is a valid build_vector for the immediate
	/// operand of a vector shift operation, where all the elements of the
	/// build_vector must have the same constant integer value.
	static bool getVShiftImm(SDValue Op, unsigned ElementBits, int64_t &Cnt) {
	// Ignore bit_converts.
	while (Op.getOpcode() == ISD::BITCAST)
	Op = Op.getOperand(0);
	BuildVectorSDNode *BVN = dyn_cast<BuildVectorSDNode>(Op.getNode());
	APInt SplatBits, SplatUndef;
	unsigned SplatBitSize;
	bool HasAnyUndefs;
	if (! BVN \|\| ! BVN->isConstantSplat(SplatBits, SplatUndef, SplatBitSize,
	HasAnyUndefs, ElementBits) \|\|
	SplatBitSize > ElementBits)
	return false;
	Cnt = SplatBits.getSExtValue();
	return true;
	}

	/// isVShiftLImm - Check if this is a valid build_vector for the immediate
	/// operand of a vector shift left operation. That value must be in the range:
	/// 0 <= Value < ElementBits for a left shift; or
	/// 0 <= Value <= ElementBits for a long left shift.
	static bool isVShiftLImm(SDValue Op, EVT VT, bool isLong, int64_t &Cnt) {
	assert(VT.isVector() && "vector shift count is not a vector type");
	int64_t ElementBits = VT.getScalarSizeInBits();
	if (! getVShiftImm(Op, ElementBits, Cnt))
	return false;
	return (Cnt >= 0 && (isLong ? Cnt-1 : Cnt) < ElementBits);
	}

	/// isVShiftRImm - Check if this is a valid build_vector for the immediate
	/// operand of a vector shift right operation. For a shift opcode, the value
	/// is positive, but for an intrinsic the value count must be negative. The
	/// absolute value must be in the range:
	/// 1 <= \|Value\| <= ElementBits for a right shift; or
	/// 1 <= \|Value\| <= ElementBits/2 for a narrow right shift.
	static bool isVShiftRImm(SDValue Op, EVT VT, bool isNarrow, bool isIntrinsic,
	int64_t &Cnt) {
	assert(VT.isVector() && "vector shift count is not a vector type");
	int64_t ElementBits = VT.getScalarSizeInBits();
	if (! getVShiftImm(Op, ElementBits, Cnt))
	return false;
	if (!isIntrinsic)
	return (Cnt >= 1 && Cnt <= (isNarrow ? ElementBits/2 : ElementBits));
	if (Cnt >= -(isNarrow ? ElementBits/2 : ElementBits) && Cnt <= -1) {
	Cnt = -Cnt;
	return true;
	}
	return false;
	}

	/// PerformIntrinsicCombine - ARM-specific DAG combining for intrinsics.
	static SDValue PerformIntrinsicCombine(SDNode *N, SelectionDAG &DAG) {
	unsigned IntNo = cast<ConstantSDNode>(N->getOperand(0))->getZExtValue();
	switch (IntNo) {
	default:
	// Don't do anything for most intrinsics.
	break;

	// Vector shifts: check for immediate versions and lower them.
	// Note: This is done during DAG combining instead of DAG legalizing because
	// the build_vectors for 64-bit vector element shift counts are generally
	// not legal, and it is hard to see their values after they get legalized to
	// loads from a constant pool.
	case Intrinsic::arm_neon_vshifts:
	case Intrinsic::arm_neon_vshiftu:
	case Intrinsic::arm_neon_vrshifts:
	case Intrinsic::arm_neon_vrshiftu:
	case Intrinsic::arm_neon_vrshiftn:
	case Intrinsic::arm_neon_vqshifts:
	case Intrinsic::arm_neon_vqshiftu:
	case Intrinsic::arm_neon_vqshiftsu:
	case Intrinsic::arm_neon_vqshiftns:
	case Intrinsic::arm_neon_vqshiftnu:
	case Intrinsic::arm_neon_vqshiftnsu:
	case Intrinsic::arm_neon_vqrshiftns:
	case Intrinsic::arm_neon_vqrshiftnu:
	case Intrinsic::arm_neon_vqrshiftnsu: {
	EVT VT = N->getOperand(1).getValueType();
	int64_t Cnt;
	unsigned VShiftOpc = 0;

	switch (IntNo) {
	case Intrinsic::arm_neon_vshifts:
	case Intrinsic::arm_neon_vshiftu:
	if (isVShiftLImm(N->getOperand(2), VT, false, Cnt)) {
	VShiftOpc = ARMISD::VSHL;
	break;
	}
	if (isVShiftRImm(N->getOperand(2), VT, false, true, Cnt)) {
	VShiftOpc = (IntNo == Intrinsic::arm_neon_vshifts ?
	ARMISD::VSHRs : ARMISD::VSHRu);
	break;
	}
	return SDValue();

	case Intrinsic::arm_neon_vrshifts:
	case Intrinsic::arm_neon_vrshiftu:
	if (isVShiftRImm(N->getOperand(2), VT, false, true, Cnt))
	break;
	return SDValue();

	case Intrinsic::arm_neon_vqshifts:
	case Intrinsic::arm_neon_vqshiftu:
	if (isVShiftLImm(N->getOperand(2), VT, false, Cnt))
	break;
	return SDValue();

	case Intrinsic::arm_neon_vqshiftsu:
	if (isVShiftLImm(N->getOperand(2), VT, false, Cnt))
	break;
	llvm_unreachable("invalid shift count for vqshlu intrinsic");

	case Intrinsic::arm_neon_vrshiftn:
	case Intrinsic::arm_neon_vqshiftns:
	case Intrinsic::arm_neon_vqshiftnu:
	case Intrinsic::arm_neon_vqshiftnsu:
	case Intrinsic::arm_neon_vqrshiftns:
	case Intrinsic::arm_neon_vqrshiftnu:
	case Intrinsic::arm_neon_vqrshiftnsu:
	// Narrowing shifts require an immediate right shift.
	if (isVShiftRImm(N->getOperand(2), VT, true, true, Cnt))
	break;
	llvm_unreachable("invalid shift count for narrowing vector shift "
	"intrinsic");

	default:
	llvm_unreachable("unhandled vector shift");
	}

	switch (IntNo) {
	case Intrinsic::arm_neon_vshifts:
	case Intrinsic::arm_neon_vshiftu:
	// Opcode already set above.
	break;
	case Intrinsic::arm_neon_vrshifts:
	VShiftOpc = ARMISD::VRSHRs; break;
	case Intrinsic::arm_neon_vrshiftu:
	VShiftOpc = ARMISD::VRSHRu; break;
	case Intrinsic::arm_neon_vrshiftn:
	VShiftOpc = ARMISD::VRSHRN; break;
	case Intrinsic::arm_neon_vqshifts:
	VShiftOpc = ARMISD::VQSHLs; break;
	case Intrinsic::arm_neon_vqshiftu:
	VShiftOpc = ARMISD::VQSHLu; break;
	case Intrinsic::arm_neon_vqshiftsu:
	VShiftOpc = ARMISD::VQSHLsu; break;
	case Intrinsic::arm_neon_vqshiftns:
	VShiftOpc = ARMISD::VQSHRNs; break;
	case Intrinsic::arm_neon_vqshiftnu:
	VShiftOpc = ARMISD::VQSHRNu; break;
	case Intrinsic::arm_neon_vqshiftnsu:
	VShiftOpc = ARMISD::VQSHRNsu; break;
	case Intrinsic::arm_neon_vqrshiftns:
	VShiftOpc = ARMISD::VQRSHRNs; break;
	case Intrinsic::arm_neon_vqrshiftnu:
	VShiftOpc = ARMISD::VQRSHRNu; break;
	case Intrinsic::arm_neon_vqrshiftnsu:
	VShiftOpc = ARMISD::VQRSHRNsu; break;
	}

	SDLoc dl(N);
	return DAG.getNode(VShiftOpc, dl, N->getValueType(0),
	N->getOperand(1), DAG.getConstant(Cnt, dl, MVT::i32));
	}

	case Intrinsic::arm_neon_vshiftins: {
	EVT VT = N->getOperand(1).getValueType();
	int64_t Cnt;
	unsigned VShiftOpc = 0;

	if (isVShiftLImm(N->getOperand(3), VT, false, Cnt))
	VShiftOpc = ARMISD::VSLI;
	else if (isVShiftRImm(N->getOperand(3), VT, false, true, Cnt))
	VShiftOpc = ARMISD::VSRI;
	else {
	llvm_unreachable("invalid shift count for vsli/vsri intrinsic");
	}

	SDLoc dl(N);
	return DAG.getNode(VShiftOpc, dl, N->getValueType(0),
	N->getOperand(1), N->getOperand(2),
	DAG.getConstant(Cnt, dl, MVT::i32));
	}

	case Intrinsic::arm_neon_vqrshifts:
	case Intrinsic::arm_neon_vqrshiftu:
	// No immediate versions of these to check for.
	break;
	}

	return SDValue();
	}

	/// PerformShiftCombine - Checks for immediate versions of vector shifts and
	/// lowers them. As with the vector shift intrinsics, this is done during DAG
	/// combining instead of DAG legalizing because the build_vectors for 64-bit
	/// vector element shift counts are generally not legal, and it is hard to see
	/// their values after they get legalized to loads from a constant pool.
	static SDValue PerformShiftCombine(SDNode *N, SelectionDAG &DAG,
	const ARMSubtarget *ST) {
	EVT VT = N->getValueType(0);
	if (N->getOpcode() == ISD::SRL && VT == MVT::i32 && ST->hasV6Ops()) {
	// Canonicalize (srl (bswap x), 16) to (rotr (bswap x), 16) if the high
	// 16-bits of x is zero. This optimizes rev + lsr 16 to rev16.
	SDValue N1 = N->getOperand(1);
	if (ConstantSDNode *C = dyn_cast<ConstantSDNode>(N1)) {
	SDValue N0 = N->getOperand(0);
	if (C->getZExtValue() == 16 && N0.getOpcode() == ISD::BSWAP &&
	DAG.MaskedValueIsZero(N0.getOperand(0),
	APInt::getHighBitsSet(32, 16)))
	return DAG.getNode(ISD::ROTR, SDLoc(N), VT, N0, N1);
	}
	}

	// Nothing to be done for scalar shifts.
	const TargetLowering &TLI = DAG.getTargetLoweringInfo();
	if (!VT.isVector() \|\| !TLI.isTypeLegal(VT))
	return SDValue();

	assert(ST->hasNEON() && "unexpected vector shift");
	int64_t Cnt;

	switch (N->getOpcode()) {
	default: llvm_unreachable("unexpected shift opcode");

	case ISD::SHL:
	if (isVShiftLImm(N->getOperand(1), VT, false, Cnt)) {
	SDLoc dl(N);
	return DAG.getNode(ARMISD::VSHL, dl, VT, N->getOperand(0),
	DAG.getConstant(Cnt, dl, MVT::i32));
	}
	break;

	case ISD::SRA:
	case ISD::SRL:
	if (isVShiftRImm(N->getOperand(1), VT, false, false, Cnt)) {
	unsigned VShiftOpc = (N->getOpcode() == ISD::SRA ?
	ARMISD::VSHRs : ARMISD::VSHRu);
	SDLoc dl(N);
	return DAG.getNode(VShiftOpc, dl, VT, N->getOperand(0),
	DAG.getConstant(Cnt, dl, MVT::i32));
	}
	}
	return SDValue();
	}

	/// PerformExtendCombine - Target-specific DAG combining for ISD::SIGN_EXTEND,
	/// ISD::ZERO_EXTEND, and ISD::ANY_EXTEND.
	static SDValue PerformExtendCombine(SDNode *N, SelectionDAG &DAG,
	const ARMSubtarget *ST) {
	SDValue N0 = N->getOperand(0);

	// Check for sign- and zero-extensions of vector extract operations of 8-
	// and 16-bit vector elements. NEON supports these directly. They are
	// handled during DAG combining because type legalization will promote them
	// to 32-bit types and it is messy to recognize the operations after that.
	if (ST->hasNEON() && N0.getOpcode() == ISD::EXTRACT_VECTOR_ELT) {
	SDValue Vec = N0.getOperand(0);
	SDValue Lane = N0.getOperand(1);
	EVT VT = N->getValueType(0);
	EVT EltVT = N0.getValueType();
	const TargetLowering &TLI = DAG.getTargetLoweringInfo();

	if (VT == MVT::i32 &&
	(EltVT == MVT::i8 \|\| EltVT == MVT::i16) &&
	TLI.isTypeLegal(Vec.getValueType()) &&
	isa<ConstantSDNode>(Lane)) {

	unsigned Opc = 0;
	switch (N->getOpcode()) {
	default: llvm_unreachable("unexpected opcode");
	case ISD::SIGN_EXTEND:
	Opc = ARMISD::VGETLANEs;
	break;
	case ISD::ZERO_EXTEND:
	case ISD::ANY_EXTEND:
	Opc = ARMISD::VGETLANEu;
	break;
	}
	return DAG.getNode(Opc, SDLoc(N), VT, Vec, Lane);
	}
	}

	return SDValue();
	}

	SDValue ARMTargetLowering::PerformCMOVToBFICombine(SDNode *CMOV, SelectionDAG &DAG) const {
	// If we have a CMOV, OR and AND combination such as:
	// if (x & CN)
	// y \|= CM;
	//
	// And:
	// * CN is a single bit;
	// * All bits covered by CM are known zero in y
	//
	// Then we can convert this into a sequence of BFI instructions. This will
	// always be a win if CM is a single bit, will always be no worse than the
	// TST&OR sequence if CM is two bits, and for thumb will be no worse if CM is
	// three bits (due to the extra IT instruction).

	SDValue Op0 = CMOV->getOperand(0);
	SDValue Op1 = CMOV->getOperand(1);
	auto CCNode = cast<ConstantSDNode>(CMOV->getOperand(2));
	auto CC = CCNode->getAPIntValue().getLimitedValue();
	SDValue CmpZ = CMOV->getOperand(4);

	// The compare must be against zero.
	if (!isNullConstant(CmpZ->getOperand(1)))
	return SDValue();

	assert(CmpZ->getOpcode() == ARMISD::CMPZ);
	SDValue And = CmpZ->getOperand(0);
	if (And->getOpcode() != ISD::AND)
	return SDValue();
	ConstantSDNode *AndC = dyn_cast<ConstantSDNode>(And->getOperand(1));
	if (!AndC \|\| !AndC->getAPIntValue().isPowerOf2())
	return SDValue();
	SDValue X = And->getOperand(0);

	if (CC == ARMCC::EQ) {
	// We're performing an "equal to zero" compare. Swap the operands so we
	// canonicalize on a "not equal to zero" compare.
	std::swap(Op0, Op1);
	} else {
	assert(CC == ARMCC::NE && "How can a CMPZ node not be EQ or NE?");
	}

	if (Op1->getOpcode() != ISD::OR)
	return SDValue();

	ConstantSDNode *OrC = dyn_cast<ConstantSDNode>(Op1->getOperand(1));
	if (!OrC)
	return SDValue();
	SDValue Y = Op1->getOperand(0);

	if (Op0 != Y)
	return SDValue();

	// Now, is it profitable to continue?
	APInt OrCI = OrC->getAPIntValue();
	unsigned Heuristic = Subtarget->isThumb() ? 3 : 2;
	if (OrCI.countPopulation() > Heuristic)
	return SDValue();

	// Lastly, can we determine that the bits defined by OrCI
	// are zero in Y?
	KnownBits Known;
	DAG.computeKnownBits(Y, Known);
	if ((OrCI & Known.Zero) != OrCI)
	return SDValue();

	// OK, we can do the combine.
	SDValue V = Y;
	SDLoc dl(X);
	EVT VT = X.getValueType();
	unsigned BitInX = AndC->getAPIntValue().logBase2();

	if (BitInX != 0) {
	// We must shift X first.
	X = DAG.getNode(ISD::SRL, dl, VT, X,
	DAG.getConstant(BitInX, dl, VT));
	}

	for (unsigned BitInY = 0, NumActiveBits = OrCI.getActiveBits();
	BitInY < NumActiveBits; ++BitInY) {
	if (OrCI[BitInY] == 0)
	continue;
	APInt Mask(VT.getSizeInBits(), 0);
	Mask.setBit(BitInY);
	V = DAG.getNode(ARMISD::BFI, dl, VT, V, X,
	// Confusingly, the operand is an inverted mask.
	DAG.getConstant(~Mask, dl, VT));
	}

	return V;
	}

	/// PerformBRCONDCombine - Target-specific DAG combining for ARMISD::BRCOND.
	SDValue
	ARMTargetLowering::PerformBRCONDCombine(SDNode *N, SelectionDAG &DAG) const {
	SDValue Cmp = N->getOperand(4);
	if (Cmp.getOpcode() != ARMISD::CMPZ)
	// Only looking at NE cases.
	return SDValue();

	EVT VT = N->getValueType(0);
	SDLoc dl(N);
	SDValue LHS = Cmp.getOperand(0);
	SDValue RHS = Cmp.getOperand(1);
	SDValue Chain = N->getOperand(0);
	SDValue BB = N->getOperand(1);
	SDValue ARMcc = N->getOperand(2);
	ARMCC::CondCodes CC =
	(ARMCC::CondCodes)cast<ConstantSDNode>(ARMcc)->getZExtValue();

	// (brcond Chain BB ne CPSR (cmpz (and (cmov 0 1 CC CPSR Cmp) 1) 0))
	// -> (brcond Chain BB CC CPSR Cmp)
	if (CC == ARMCC::NE && LHS.getOpcode() == ISD::AND && LHS->hasOneUse() &&
	LHS->getOperand(0)->getOpcode() == ARMISD::CMOV &&
	LHS->getOperand(0)->hasOneUse()) {
	auto *LHS00C = dyn_cast<ConstantSDNode>(LHS->getOperand(0)->getOperand(0));
	auto *LHS01C = dyn_cast<ConstantSDNode>(LHS->getOperand(0)->getOperand(1));
	auto *LHS1C = dyn_cast<ConstantSDNode>(LHS->getOperand(1));
	auto *RHSC = dyn_cast<ConstantSDNode>(RHS);
	if ((LHS00C && LHS00C->getZExtValue() == 0) &&
	(LHS01C && LHS01C->getZExtValue() == 1) &&
	(LHS1C && LHS1C->getZExtValue() == 1) &&
	(RHSC && RHSC->getZExtValue() == 0)) {
	return DAG.getNode(
	ARMISD::BRCOND, dl, VT, Chain, BB, LHS->getOperand(0)->getOperand(2),
	LHS->getOperand(0)->getOperand(3), LHS->getOperand(0)->getOperand(4));
	}
	}

	return SDValue();
	}

	/// PerformCMOVCombine - Target-specific DAG combining for ARMISD::CMOV.
	SDValue
	ARMTargetLowering::PerformCMOVCombine(SDNode *N, SelectionDAG &DAG) const {
	SDValue Cmp = N->getOperand(4);
	if (Cmp.getOpcode() != ARMISD::CMPZ)
	// Only looking at EQ and NE cases.
	return SDValue();

	EVT VT = N->getValueType(0);
	SDLoc dl(N);
	SDValue LHS = Cmp.getOperand(0);
	SDValue RHS = Cmp.getOperand(1);
	SDValue FalseVal = N->getOperand(0);
	SDValue TrueVal = N->getOperand(1);
	SDValue ARMcc = N->getOperand(2);
	ARMCC::CondCodes CC =
	(ARMCC::CondCodes)cast<ConstantSDNode>(ARMcc)->getZExtValue();

	// BFI is only available on V6T2+.
	if (!Subtarget->isThumb1Only() && Subtarget->hasV6T2Ops()) {
	SDValue R = PerformCMOVToBFICombine(N, DAG);
	if (R)
	return R;
	}

	// Simplify
	// mov r1, r0
	// cmp r1, x
	// mov r0, y
	// moveq r0, x
	// to
	// cmp r0, x
	// movne r0, y
	//
	// mov r1, r0
	// cmp r1, x
	// mov r0, x
	// movne r0, y
	// to
	// cmp r0, x
	// movne r0, y
	/// FIXME: Turn this into a target neutral optimization?
	SDValue Res;
	if (CC == ARMCC::NE && FalseVal == RHS && FalseVal != LHS) {
	Res = DAG.getNode(ARMISD::CMOV, dl, VT, LHS, TrueVal, ARMcc,
	N->getOperand(3), Cmp);
	} else if (CC == ARMCC::EQ && TrueVal == RHS) {
	SDValue ARMcc;
	SDValue NewCmp = getARMCmp(LHS, RHS, ISD::SETNE, ARMcc, DAG, dl);
	Res = DAG.getNode(ARMISD::CMOV, dl, VT, LHS, FalseVal, ARMcc,
	N->getOperand(3), NewCmp);
	}

	// (cmov F T ne CPSR (cmpz (cmov 0 1 CC CPSR Cmp) 0))
	// -> (cmov F T CC CPSR Cmp)
	if (CC == ARMCC::NE && LHS.getOpcode() == ARMISD::CMOV && LHS->hasOneUse()) {
	auto *LHS0C = dyn_cast<ConstantSDNode>(LHS->getOperand(0));
	auto *LHS1C = dyn_cast<ConstantSDNode>(LHS->getOperand(1));
	auto *RHSC = dyn_cast<ConstantSDNode>(RHS);
	if ((LHS0C && LHS0C->getZExtValue() == 0) &&
	(LHS1C && LHS1C->getZExtValue() == 1) &&
	(RHSC && RHSC->getZExtValue() == 0)) {
	return DAG.getNode(ARMISD::CMOV, dl, VT, FalseVal, TrueVal,
	LHS->getOperand(2), LHS->getOperand(3),
	LHS->getOperand(4));
	}
	}

	if (Res.getNode()) {
	KnownBits Known;
	DAG.computeKnownBits(SDValue(N,0), Known);
	// Capture demanded bits information that would be otherwise lost.
	if (Known.Zero == 0xfffffffe)
	Res = DAG.getNode(ISD::AssertZext, dl, MVT::i32, Res,
	DAG.getValueType(MVT::i1));
	else if (Known.Zero == 0xffffff00)
	Res = DAG.getNode(ISD::AssertZext, dl, MVT::i32, Res,
	DAG.getValueType(MVT::i8));
	else if (Known.Zero == 0xffff0000)
	Res = DAG.getNode(ISD::AssertZext, dl, MVT::i32, Res,
	DAG.getValueType(MVT::i16));
	}

	return Res;
	}

	SDValue ARMTargetLowering::PerformDAGCombine(SDNode *N,
	DAGCombinerInfo &DCI) const {
	switch (N->getOpcode()) {
	default: break;
	case ARMISD::ADDE: return PerformADDECombine(N, DCI, Subtarget);
	case ARMISD::UMLAL: return PerformUMLALCombine(N, DCI.DAG, Subtarget);
	case ISD::ADD: return PerformADDCombine(N, DCI, Subtarget);
	case ISD::SUB: return PerformSUBCombine(N, DCI);
	case ISD::MUL: return PerformMULCombine(N, DCI, Subtarget);
	case ISD::OR: return PerformORCombine(N, DCI, Subtarget);
	case ISD::XOR: return PerformXORCombine(N, DCI, Subtarget);
	case ISD::AND: return PerformANDCombine(N, DCI, Subtarget);
	case ARMISD::ADDC:
	case ARMISD::SUBC: return PerformAddcSubcCombine(N, DCI.DAG, Subtarget);
	case ARMISD::SUBE: return PerformAddeSubeCombine(N, DCI.DAG, Subtarget);
	case ARMISD::BFI: return PerformBFICombine(N, DCI);
	case ARMISD::VMOVRRD: return PerformVMOVRRDCombine(N, DCI, Subtarget);
	case ARMISD::VMOVDRR: return PerformVMOVDRRCombine(N, DCI.DAG);
	case ISD::STORE: return PerformSTORECombine(N, DCI);
	case ISD::BUILD_VECTOR: return PerformBUILD_VECTORCombine(N, DCI, Subtarget);
	case ISD::INSERT_VECTOR_ELT: return PerformInsertEltCombine(N, DCI);
	case ISD::VECTOR_SHUFFLE: return PerformVECTOR_SHUFFLECombine(N, DCI.DAG);
	case ARMISD::VDUPLANE: return PerformVDUPLANECombine(N, DCI);
	case ARMISD::VDUP: return PerformVDUPCombine(N, DCI);
	case ISD::FP_TO_SINT:
	case ISD::FP_TO_UINT:
	return PerformVCVTCombine(N, DCI.DAG, Subtarget);
	case ISD::FDIV:
	return PerformVDIVCombine(N, DCI.DAG, Subtarget);
	case ISD::INTRINSIC_WO_CHAIN: return PerformIntrinsicCombine(N, DCI.DAG);
	case ISD::SHL:
	case ISD::SRA:
	case ISD::SRL: return PerformShiftCombine(N, DCI.DAG, Subtarget);
	case ISD::SIGN_EXTEND:
	case ISD::ZERO_EXTEND:
	case ISD::ANY_EXTEND: return PerformExtendCombine(N, DCI.DAG, Subtarget);
	case ARMISD::CMOV: return PerformCMOVCombine(N, DCI.DAG);
	case ARMISD::BRCOND: return PerformBRCONDCombine(N, DCI.DAG);
	case ISD::LOAD: return PerformLOADCombine(N, DCI);
	case ARMISD::VLD1DUP:
	case ARMISD::VLD2DUP:
	case ARMISD::VLD3DUP:
	case ARMISD::VLD4DUP:
	return PerformVLDCombine(N, DCI);
	case ARMISD::BUILD_VECTOR:
	return PerformARMBUILD_VECTORCombine(N, DCI);
	case ARMISD::SMULWB: {
	unsigned BitWidth = N->getValueType(0).getSizeInBits();
	APInt DemandedMask = APInt::getLowBitsSet(BitWidth, 16);
	if (SimplifyDemandedBits(N->getOperand(1), DemandedMask, DCI))
	return SDValue();
	break;
	}
	case ARMISD::SMULWT: {
	unsigned BitWidth = N->getValueType(0).getSizeInBits();
	APInt DemandedMask = APInt::getHighBitsSet(BitWidth, 16);
	if (SimplifyDemandedBits(N->getOperand(1), DemandedMask, DCI))
	return SDValue();
	break;
	}
	case ARMISD::SMLALBB: {
	unsigned BitWidth = N->getValueType(0).getSizeInBits();
	APInt DemandedMask = APInt::getLowBitsSet(BitWidth, 16);
	if ((SimplifyDemandedBits(N->getOperand(0), DemandedMask, DCI)) \|\|
	(SimplifyDemandedBits(N->getOperand(1), DemandedMask, DCI)))
	return SDValue();
	break;
	}
	case ARMISD::SMLALBT: {
	unsigned LowWidth = N->getOperand(0).getValueType().getSizeInBits();
	APInt LowMask = APInt::getLowBitsSet(LowWidth, 16);
	unsigned HighWidth = N->getOperand(1).getValueType().getSizeInBits();
	APInt HighMask = APInt::getHighBitsSet(HighWidth, 16);
	if ((SimplifyDemandedBits(N->getOperand(0), LowMask, DCI)) \|\|
	(SimplifyDemandedBits(N->getOperand(1), HighMask, DCI)))
	return SDValue();
	break;
	}
	case ARMISD::SMLALTB: {
	unsigned HighWidth = N->getOperand(0).getValueType().getSizeInBits();
	APInt HighMask = APInt::getHighBitsSet(HighWidth, 16);
	unsigned LowWidth = N->getOperand(1).getValueType().getSizeInBits();
	APInt LowMask = APInt::getLowBitsSet(LowWidth, 16);
	if ((SimplifyDemandedBits(N->getOperand(0), HighMask, DCI)) \|\|
	(SimplifyDemandedBits(N->getOperand(1), LowMask, DCI)))
	return SDValue();
	break;
	}
	case ARMISD::SMLALTT: {
	unsigned BitWidth = N->getValueType(0).getSizeInBits();
	APInt DemandedMask = APInt::getHighBitsSet(BitWidth, 16);
	if ((SimplifyDemandedBits(N->getOperand(0), DemandedMask, DCI)) \|\|
	(SimplifyDemandedBits(N->getOperand(1), DemandedMask, DCI)))
	return SDValue();
	break;
	}
	case ISD::INTRINSIC_VOID:
	case ISD::INTRINSIC_W_CHAIN:
	switch (cast<ConstantSDNode>(N->getOperand(1))->getZExtValue()) {
	case Intrinsic::arm_neon_vld1:
	case Intrinsic::arm_neon_vld2:
	case Intrinsic::arm_neon_vld3:
	case Intrinsic::arm_neon_vld4:
	case Intrinsic::arm_neon_vld2lane:
	case Intrinsic::arm_neon_vld3lane:
	case Intrinsic::arm_neon_vld4lane:
	case Intrinsic::arm_neon_vst1:
	case Intrinsic::arm_neon_vst2:
	case Intrinsic::arm_neon_vst3:
	case Intrinsic::arm_neon_vst4:
	case Intrinsic::arm_neon_vst2lane:
	case Intrinsic::arm_neon_vst3lane:
	case Intrinsic::arm_neon_vst4lane:
	return PerformVLDCombine(N, DCI);
	default: break;
	}
	break;
	}
	return SDValue();
	}

	bool ARMTargetLowering::isDesirableToTransformToIntegerOp(unsigned Opc,
	EVT VT) const {
	return (VT == MVT::f32) && (Opc == ISD::LOAD \|\| Opc == ISD::STORE);
	}

	bool ARMTargetLowering::allowsMisalignedMemoryAccesses(EVT VT,
	unsigned,
	unsigned,
	bool *Fast) const {
	// The AllowsUnaliged flag models the SCTLR.A setting in ARM cpus
	bool AllowsUnaligned = Subtarget->allowsUnalignedMem();

	switch (VT.getSimpleVT().SimpleTy) {
	default:
	return false;
	case MVT::i8:
	case MVT::i16:
	case MVT::i32: {
	// Unaligned access can use (for example) LRDB, LRDH, LDR
	if (AllowsUnaligned) {
	if (Fast)
	*Fast = Subtarget->hasV7Ops();
	return true;
	}
	return false;
	}
	case MVT::f64:
	case MVT::v2f64: {
	// For any little-endian targets with neon, we can support unaligned ld/st
	// of D and Q (e.g. {D0,D1}) registers by using vld1.i8/vst1.i8.
	// A big-endian target may also explicitly support unaligned accesses
	if (Subtarget->hasNEON() && (AllowsUnaligned \|\| Subtarget->isLittle())) {
	if (Fast)
	*Fast = true;
	return true;
	}
	return false;
	}
	}
	}

	static bool memOpAlign(unsigned DstAlign, unsigned SrcAlign,
	unsigned AlignCheck) {
	return ((SrcAlign == 0 \|\| SrcAlign % AlignCheck == 0) &&
	(DstAlign == 0 \|\| DstAlign % AlignCheck == 0));
	}

	EVT ARMTargetLowering::getOptimalMemOpType(uint64_t Size,
	unsigned DstAlign, unsigned SrcAlign,
	bool IsMemset, bool ZeroMemset,
	bool MemcpyStrSrc,
	MachineFunction &MF) const {
	const Function *F = MF.getFunction();

	// See if we can use NEON instructions for this...
	if ((!IsMemset \|\| ZeroMemset) && Subtarget->hasNEON() &&
	!F->hasFnAttribute(Attribute::NoImplicitFloat)) {
	bool Fast;
	if (Size >= 16 &&
	(memOpAlign(SrcAlign, DstAlign, 16) \|\|
	(allowsMisalignedMemoryAccesses(MVT::v2f64, 0, 1, &Fast) && Fast))) {
	return MVT::v2f64;
	} else if (Size >= 8 &&
	(memOpAlign(SrcAlign, DstAlign, 8) \|\|
	(allowsMisalignedMemoryAccesses(MVT::f64, 0, 1, &Fast) &&
	Fast))) {
	return MVT::f64;
	}
	}

	// Let the target-independent logic figure it out.
	return MVT::Other;
	}

	bool ARMTargetLowering::isZExtFree(SDValue Val, EVT VT2) const {
	if (Val.getOpcode() != ISD::LOAD)
	return false;

	EVT VT1 = Val.getValueType();
	if (!VT1.isSimple() \|\| !VT1.isInteger() \|\|
	!VT2.isSimple() \|\| !VT2.isInteger())
	return false;

	switch (VT1.getSimpleVT().SimpleTy) {
	default: break;
	case MVT::i1:
	case MVT::i8:
	case MVT::i16:
	// 8-bit and 16-bit loads implicitly zero-extend to 32-bits.
	return true;
	}

	return false;
	}

	bool ARMTargetLowering::isVectorLoadExtDesirable(SDValue ExtVal) const {
	EVT VT = ExtVal.getValueType();

	if (!isTypeLegal(VT))
	return false;

	// Don't create a loadext if we can fold the extension into a wide/long
	// instruction.
	// If there's more than one user instruction, the loadext is desirable no
	// matter what. There can be two uses by the same instruction.
	if (ExtVal->use_empty() \|\|
	!ExtVal->use_begin()->isOnlyUserOf(ExtVal.getNode()))
	return true;

	SDNode U = ExtVal->use_begin();
	if ((U->getOpcode() == ISD::ADD \|\| U->getOpcode() == ISD::SUB \|\|
	U->getOpcode() == ISD::SHL \|\| U->getOpcode() == ARMISD::VSHL))
	return false;

	return true;
	}

	bool ARMTargetLowering::allowTruncateForTailCall(Type Ty1, Type Ty2) const {
	if (!Ty1->isIntegerTy() \|\| !Ty2->isIntegerTy())
	return false;

	if (!isTypeLegal(EVT::getEVT(Ty1)))
	return false;

	assert(Ty1->getPrimitiveSizeInBits() <= 64 && "i128 is probably not a noop");

	// Assuming the caller doesn't have a zeroext or signext return parameter,
	// truncation all the way down to i1 is valid.
	return true;
	}

	int ARMTargetLowering::getScalingFactorCost(const DataLayout &DL,
	const AddrMode &AM, Type *Ty,
	unsigned AS) const {
	if (isLegalAddressingMode(DL, AM, Ty, AS)) {
	if (Subtarget->hasFPAO())
	return AM.Scale < 0 ? 1 : 0; // positive offsets execute faster
	return 0;
	}
	return -1;
	}


	static bool isLegalT1AddressImmediate(int64_t V, EVT VT) {
	if (V < 0)
	return false;

	unsigned Scale = 1;
	switch (VT.getSimpleVT().SimpleTy) {
	default: return false;
	case MVT::i1:
	case MVT::i8:
	// Scale == 1;
	break;
	case MVT::i16:
	// Scale == 2;
	Scale = 2;
	break;
	case MVT::i32:
	// Scale == 4;
	Scale = 4;
	break;
	}

	if ((V & (Scale - 1)) != 0)
	return false;
	V /= Scale;
	return V == (V & ((1LL << 5) - 1));
	}

	static bool isLegalT2AddressImmediate(int64_t V, EVT VT,
	const ARMSubtarget *Subtarget) {
	bool isNeg = false;
	if (V < 0) {
	isNeg = true;
	V = - V;
	}

	switch (VT.getSimpleVT().SimpleTy) {
	default: return false;
	case MVT::i1:
	case MVT::i8:
	case MVT::i16:
	case MVT::i32:
	// + imm12 or - imm8
	if (isNeg)
	return V == (V & ((1LL << 8) - 1));
	return V == (V & ((1LL << 12) - 1));
	case MVT::f32:
	case MVT::f64:
	// Same as ARM mode. FIXME: NEON?
	if (!Subtarget->hasVFP2())
	return false;
	if ((V & 3) != 0)
	return false;
	V >>= 2;
	return V == (V & ((1LL << 8) - 1));
	}
	}

	/// isLegalAddressImmediate - Return true if the integer value can be used
	/// as the offset of the target addressing mode for load / store of the
	/// given type.
	static bool isLegalAddressImmediate(int64_t V, EVT VT,
	const ARMSubtarget *Subtarget) {
	if (V == 0)
	return true;

	if (!VT.isSimple())
	return false;

	if (Subtarget->isThumb1Only())
	return isLegalT1AddressImmediate(V, VT);
	else if (Subtarget->isThumb2())
	return isLegalT2AddressImmediate(V, VT, Subtarget);

	// ARM mode.
	if (V < 0)
	V = - V;
	switch (VT.getSimpleVT().SimpleTy) {
	default: return false;
	case MVT::i1:
	case MVT::i8:
	case MVT::i32:
	// +- imm12
	return V == (V & ((1LL << 12) - 1));
	case MVT::i16:
	// +- imm8
	return V == (V & ((1LL << 8) - 1));
	case MVT::f32:
	case MVT::f64:
	if (!Subtarget->hasVFP2()) // FIXME: NEON?
	return false;
	if ((V & 3) != 0)
	return false;
	V >>= 2;
	return V == (V & ((1LL << 8) - 1));
	}
	}

	bool ARMTargetLowering::isLegalT2ScaledAddressingMode(const AddrMode &AM,
	EVT VT) const {
	int Scale = AM.Scale;
	if (Scale < 0)
	return false;

	switch (VT.getSimpleVT().SimpleTy) {
	default: return false;
	case MVT::i1:
	case MVT::i8:
	case MVT::i16:
	case MVT::i32:
	if (Scale == 1)
	return true;
	// r + r << imm
	Scale = Scale & ~1;
	return Scale == 2 \|\| Scale == 4 \|\| Scale == 8;
	case MVT::i64:
	// r + r
	if (((unsigned)AM.HasBaseReg + Scale) <= 2)
	return true;
	return false;
	case MVT::isVoid:
	// Note, we allow "void" uses (basically, uses that aren't loads or
	// stores), because arm allows folding a scale into many arithmetic
	// operations. This should be made more precise and revisited later.

	// Allow r << imm, but the imm has to be a multiple of two.
	if (Scale & 1) return false;
	return isPowerOf2_32(Scale);
	}
	}

	/// isLegalAddressingMode - Return true if the addressing mode represented
	/// by AM is legal for this target, for a load/store of the specified type.
	bool ARMTargetLowering::isLegalAddressingMode(const DataLayout &DL,
	const AddrMode &AM, Type *Ty,
	unsigned AS) const {
	EVT VT = getValueType(DL, Ty, true);
	if (!isLegalAddressImmediate(AM.BaseOffs, VT, Subtarget))
	return false;

	// Can never fold addr of global into load/store.
	if (AM.BaseGV)
	return false;

	switch (AM.Scale) {
	case 0: // no scale reg, must be "r+i" or "r", or "i".
	break;
	case 1:
	if (Subtarget->isThumb1Only())
	return false;
	LLVM_FALLTHROUGH;
	default:
	// ARM doesn't support any R+R*scale+imm addr modes.
	if (AM.BaseOffs)
	return false;

	if (!VT.isSimple())
	return false;

	if (Subtarget->isThumb2())
	return isLegalT2ScaledAddressingMode(AM, VT);

	int Scale = AM.Scale;
	switch (VT.getSimpleVT().SimpleTy) {
	default: return false;
	case MVT::i1:
	case MVT::i8:
	case MVT::i32:
	if (Scale < 0) Scale = -Scale;
	if (Scale == 1)
	return true;
	// r + r << imm
	return isPowerOf2_32(Scale & ~1);
	case MVT::i16:
	case MVT::i64:
	// r + r
	if (((unsigned)AM.HasBaseReg + Scale) <= 2)
	return true;
	return false;

	case MVT::isVoid:
	// Note, we allow "void" uses (basically, uses that aren't loads or
	// stores), because arm allows folding a scale into many arithmetic
	// operations. This should be made more precise and revisited later.

	// Allow r << imm, but the imm has to be a multiple of two.
	if (Scale & 1) return false;
	return isPowerOf2_32(Scale);
	}
	}
	return true;
	}

	/// isLegalICmpImmediate - Return true if the specified immediate is legal
	/// icmp immediate, that is the target has icmp instructions which can compare
	/// a register against the immediate without having to materialize the
	/// immediate into a register.
	bool ARMTargetLowering::isLegalICmpImmediate(int64_t Imm) const {
	// Thumb2 and ARM modes can use cmn for negative immediates.
	if (!Subtarget->isThumb())
	return ARM_AM::getSOImmVal(std::abs(Imm)) != -1;
	if (Subtarget->isThumb2())
	return ARM_AM::getT2SOImmVal(std::abs(Imm)) != -1;
	// Thumb1 doesn't have cmn, and only 8-bit immediates.
	return Imm >= 0 && Imm <= 255;
	}

	/// isLegalAddImmediate - Return true if the specified immediate is a legal add
	/// or sub immediate, that is the target has add or sub instructions which can
	/// add a register with the immediate without having to materialize the
	/// immediate into a register.
	bool ARMTargetLowering::isLegalAddImmediate(int64_t Imm) const {
	// Same encoding for add/sub, just flip the sign.
	int64_t AbsImm = std::abs(Imm);
	if (!Subtarget->isThumb())
	return ARM_AM::getSOImmVal(AbsImm) != -1;
	if (Subtarget->isThumb2())
	return ARM_AM::getT2SOImmVal(AbsImm) != -1;
	// Thumb1 only has 8-bit unsigned immediate.
	return AbsImm >= 0 && AbsImm <= 255;
	}

	static bool getARMIndexedAddressParts(SDNode *Ptr, EVT VT,
	bool isSEXTLoad, SDValue &Base,
	SDValue &Offset, bool &isInc,
	SelectionDAG &DAG) {
	if (Ptr->getOpcode() != ISD::ADD && Ptr->getOpcode() != ISD::SUB)
	return false;

	if (VT == MVT::i16 \|\| ((VT == MVT::i8 \|\| VT == MVT::i1) && isSEXTLoad)) {
	// AddressingMode 3
	Base = Ptr->getOperand(0);
	if (ConstantSDNode *RHS = dyn_cast<ConstantSDNode>(Ptr->getOperand(1))) {
	int RHSC = (int)RHS->getZExtValue();
	if (RHSC < 0 && RHSC > -256) {
	assert(Ptr->getOpcode() == ISD::ADD);
	isInc = false;
	Offset = DAG.getConstant(-RHSC, SDLoc(Ptr), RHS->getValueType(0));
	return true;
	}
	}
	isInc = (Ptr->getOpcode() == ISD::ADD);
	Offset = Ptr->getOperand(1);
	return true;
	} else if (VT == MVT::i32 \|\| VT == MVT::i8 \|\| VT == MVT::i1) {
	// AddressingMode 2
	if (ConstantSDNode *RHS = dyn_cast<ConstantSDNode>(Ptr->getOperand(1))) {
	int RHSC = (int)RHS->getZExtValue();
	if (RHSC < 0 && RHSC > -0x1000) {
	assert(Ptr->getOpcode() == ISD::ADD);
	isInc = false;
	Offset = DAG.getConstant(-RHSC, SDLoc(Ptr), RHS->getValueType(0));
	Base = Ptr->getOperand(0);
	return true;
	}
	}

	if (Ptr->getOpcode() == ISD::ADD) {
	isInc = true;
	ARM_AM::ShiftOpc ShOpcVal=
	ARM_AM::getShiftOpcForNode(Ptr->getOperand(0).getOpcode());
	if (ShOpcVal != ARM_AM::no_shift) {
	Base = Ptr->getOperand(1);
	Offset = Ptr->getOperand(0);
	} else {
	Base = Ptr->getOperand(0);
	Offset = Ptr->getOperand(1);
	}
	return true;
	}

	isInc = (Ptr->getOpcode() == ISD::ADD);
	Base = Ptr->getOperand(0);
	Offset = Ptr->getOperand(1);
	return true;
	}

	// FIXME: Use VLDM / VSTM to emulate indexed FP load / store.
	return false;
	}

	static bool getT2IndexedAddressParts(SDNode *Ptr, EVT VT,
	bool isSEXTLoad, SDValue &Base,
	SDValue &Offset, bool &isInc,
	SelectionDAG &DAG) {
	if (Ptr->getOpcode() != ISD::ADD && Ptr->getOpcode() != ISD::SUB)
	return false;

	Base = Ptr->getOperand(0);
	if (ConstantSDNode *RHS = dyn_cast<ConstantSDNode>(Ptr->getOperand(1))) {
	int RHSC = (int)RHS->getZExtValue();
	if (RHSC < 0 && RHSC > -0x100) { // 8 bits.
	assert(Ptr->getOpcode() == ISD::ADD);
	isInc = false;
	Offset = DAG.getConstant(-RHSC, SDLoc(Ptr), RHS->getValueType(0));
	return true;
	} else if (RHSC > 0 && RHSC < 0x100) { // 8 bit, no zero.
	isInc = Ptr->getOpcode() == ISD::ADD;
	Offset = DAG.getConstant(RHSC, SDLoc(Ptr), RHS->getValueType(0));
	return true;
	}
	}

	return false;
	}

	/// getPreIndexedAddressParts - returns true by value, base pointer and
	/// offset pointer and addressing mode by reference if the node's address
	/// can be legally represented as pre-indexed load / store address.
	bool
	ARMTargetLowering::getPreIndexedAddressParts(SDNode *N, SDValue &Base,
	SDValue &Offset,
	ISD::MemIndexedMode &AM,
	SelectionDAG &DAG) const {
	if (Subtarget->isThumb1Only())
	return false;

	EVT VT;
	SDValue Ptr;
	bool isSEXTLoad = false;
	if (LoadSDNode *LD = dyn_cast<LoadSDNode>(N)) {
	Ptr = LD->getBasePtr();
	VT = LD->getMemoryVT();
	isSEXTLoad = LD->getExtensionType() == ISD::SEXTLOAD;
	} else if (StoreSDNode *ST = dyn_cast<StoreSDNode>(N)) {
	Ptr = ST->getBasePtr();
	VT = ST->getMemoryVT();
	} else
	return false;

	bool isInc;
	bool isLegal = false;
	if (Subtarget->isThumb2())
	isLegal = getT2IndexedAddressParts(Ptr.getNode(), VT, isSEXTLoad, Base,
	Offset, isInc, DAG);
	else
	isLegal = getARMIndexedAddressParts(Ptr.getNode(), VT, isSEXTLoad, Base,
	Offset, isInc, DAG);
	if (!isLegal)
	return false;

	AM = isInc ? ISD::PRE_INC : ISD::PRE_DEC;
	return true;
	}

	/// getPostIndexedAddressParts - returns true by value, base pointer and
	/// offset pointer and addressing mode by reference if this node can be
	/// combined with a load / store to form a post-indexed load / store.
	bool ARMTargetLowering::getPostIndexedAddressParts(SDNode N, SDNode Op,
	SDValue &Base,
	SDValue &Offset,
	ISD::MemIndexedMode &AM,
	SelectionDAG &DAG) const {
	EVT VT;
	SDValue Ptr;
	bool isSEXTLoad = false, isNonExt;
	if (LoadSDNode *LD = dyn_cast<LoadSDNode>(N)) {
	VT = LD->getMemoryVT();
	Ptr = LD->getBasePtr();
	isSEXTLoad = LD->getExtensionType() == ISD::SEXTLOAD;
	isNonExt = LD->getExtensionType() == ISD::NON_EXTLOAD;
	} else if (StoreSDNode *ST = dyn_cast<StoreSDNode>(N)) {
	VT = ST->getMemoryVT();
	Ptr = ST->getBasePtr();
	isNonExt = !ST->isTruncatingStore();
	} else
	return false;

	if (Subtarget->isThumb1Only()) {
	// Thumb-1 can do a limited post-inc load or store as an updating LDM. It
	// must be non-extending/truncating, i32, with an offset of 4.
	assert(Op->getValueType(0) == MVT::i32 && "Non-i32 post-inc op?!");
	if (Op->getOpcode() != ISD::ADD \|\| !isNonExt)
	return false;
	auto *RHS = dyn_cast<ConstantSDNode>(Op->getOperand(1));
	if (!RHS \|\| RHS->getZExtValue() != 4)
	return false;

	Offset = Op->getOperand(1);
	Base = Op->getOperand(0);
	AM = ISD::POST_INC;
	return true;
	}

	bool isInc;
	bool isLegal = false;
	if (Subtarget->isThumb2())
	isLegal = getT2IndexedAddressParts(Op, VT, isSEXTLoad, Base, Offset,
	isInc, DAG);
	else
	isLegal = getARMIndexedAddressParts(Op, VT, isSEXTLoad, Base, Offset,
	isInc, DAG);
	if (!isLegal)
	return false;

	if (Ptr != Base) {
	// Swap base ptr and offset to catch more post-index load / store when
	// it's legal. In Thumb2 mode, offset must be an immediate.
	if (Ptr == Offset && Op->getOpcode() == ISD::ADD &&
	!Subtarget->isThumb2())
	std::swap(Base, Offset);

	// Post-indexed load / store update the base pointer.
	if (Ptr != Base)
	return false;
	}

	AM = isInc ? ISD::POST_INC : ISD::POST_DEC;
	return true;
	}

	void ARMTargetLowering::computeKnownBitsForTargetNode(const SDValue Op,
	KnownBits &Known,
	const APInt &DemandedElts,
	const SelectionDAG &DAG,
	unsigned Depth) const {
	unsigned BitWidth = Known.getBitWidth();
	Known.resetAll();
	switch (Op.getOpcode()) {
	default: break;
	case ARMISD::ADDC:
	case ARMISD::ADDE:
	case ARMISD::SUBC:
	case ARMISD::SUBE:
	// These nodes' second result is a boolean
	if (Op.getResNo() == 0)
	break;
	Known.Zero \|= APInt::getHighBitsSet(BitWidth, BitWidth - 1);
	break;
	case ARMISD::CMOV: {
	// Bits are known zero/one if known on the LHS and RHS.
	DAG.computeKnownBits(Op.getOperand(0), Known, Depth+1);
	if (Known.isUnknown())
	return;

	KnownBits KnownRHS;
	DAG.computeKnownBits(Op.getOperand(1), KnownRHS, Depth+1);
	Known.Zero &= KnownRHS.Zero;
	Known.One &= KnownRHS.One;
	return;
	}
	case ISD::INTRINSIC_W_CHAIN: {
	ConstantSDNode *CN = cast<ConstantSDNode>(Op->getOperand(1));
	Intrinsic::ID IntID = static_cast<Intrinsic::ID>(CN->getZExtValue());
	switch (IntID) {
	default: return;
	case Intrinsic::arm_ldaex:
	case Intrinsic::arm_ldrex: {
	EVT VT = cast<MemIntrinsicSDNode>(Op)->getMemoryVT();
	unsigned MemBits = VT.getScalarSizeInBits();
	Known.Zero \|= APInt::getHighBitsSet(BitWidth, BitWidth - MemBits);
	return;
	}
	}
	}
	case ARMISD::BFI: {
	// Conservatively, we can recurse down the first operand
	// and just mask out all affected bits.
	DAG.computeKnownBits(Op.getOperand(0), Known, Depth + 1);

	// The operand to BFI is already a mask suitable for removing the bits it
	// sets.
	ConstantSDNode *CI = cast<ConstantSDNode>(Op.getOperand(2));
	const APInt &Mask = CI->getAPIntValue();
	Known.Zero &= Mask;
	Known.One &= Mask;
	return;
	}
	}
	}

	//===----------------------------------------------------------------------===//
	// ARM Inline Assembly Support
	//===----------------------------------------------------------------------===//

	bool ARMTargetLowering::ExpandInlineAsm(CallInst *CI) const {
	// Looking for "rev" which is V6+.
	if (!Subtarget->hasV6Ops())
	return false;

	InlineAsm *IA = cast<InlineAsm>(CI->getCalledValue());
	std::string AsmStr = IA->getAsmString();
	SmallVector<StringRef, 4> AsmPieces;
	SplitString(AsmStr, AsmPieces, ";\n");

	switch (AsmPieces.size()) {
	default: return false;
	case 1:
	AsmStr = AsmPieces[0];
	AsmPieces.clear();
	SplitString(AsmStr, AsmPieces, " \t,");

	// rev $0, $1
	if (AsmPieces.size() == 3 &&
	AsmPieces[0] == "rev" && AsmPieces[1] == "$0" && AsmPieces[2] == "$1" &&
	IA->getConstraintString().compare(0, 4, "=l,l") == 0) {
	IntegerType *Ty = dyn_cast<IntegerType>(CI->getType());
	if (Ty && Ty->getBitWidth() == 32)
	return IntrinsicLowering::LowerToByteSwap(CI);
	}
	break;
	}

	return false;
	}

	const char *ARMTargetLowering::LowerXConstraint(EVT ConstraintVT) const {
	// At this point, we have to lower this constraint to something else, so we
	// lower it to an "r" or "w". However, by doing this we will force the result
	// to be in register, while the X constraint is much more permissive.
	//
	// Although we are correct (we are free to emit anything, without
	// constraints), we might break use cases that would expect us to be more
	// efficient and emit something else.
	if (!Subtarget->hasVFP2())
	return "r";
	if (ConstraintVT.isFloatingPoint())
	return "w";
	if (ConstraintVT.isVector() && Subtarget->hasNEON() &&
	(ConstraintVT.getSizeInBits() == 64 \|\|
	ConstraintVT.getSizeInBits() == 128))
	return "w";

	return "r";
	}

	/// getConstraintType - Given a constraint letter, return the type of
	/// constraint it is for this target.
	ARMTargetLowering::ConstraintType
	ARMTargetLowering::getConstraintType(StringRef Constraint) const {
	if (Constraint.size() == 1) {
	switch (Constraint[0]) {
	default: break;
	case 'l': return C_RegisterClass;
	case 'w': return C_RegisterClass;
	case 'h': return C_RegisterClass;
	case 'x': return C_RegisterClass;
	case 't': return C_RegisterClass;
	case 'j': return C_Other; // Constant for movw.
	// An address with a single base register. Due to the way we
	// currently handle addresses it is the same as an 'r' memory constraint.
	case 'Q': return C_Memory;
	}
	} else if (Constraint.size() == 2) {
	switch (Constraint[0]) {
	default: break;
	// All 'U+' constraints are addresses.
	case 'U': return C_Memory;
	}
	}
	return TargetLowering::getConstraintType(Constraint);
	}

	/// Examine constraint type and operand type and determine a weight value.
	/// This object must already have been set up with the operand type
	/// and the current alternative constraint selected.
	TargetLowering::ConstraintWeight
	ARMTargetLowering::getSingleConstraintMatchWeight(
	AsmOperandInfo &info, const char *constraint) const {
	ConstraintWeight weight = CW_Invalid;
	Value *CallOperandVal = info.CallOperandVal;
	// If we don't have a value, we can't do a match,
	// but allow it at the lowest weight.
	if (!CallOperandVal)
	return CW_Default;
	Type *type = CallOperandVal->getType();
	// Look at the constraint type.
	switch (*constraint) {
	default:
	weight = TargetLowering::getSingleConstraintMatchWeight(info, constraint);
	break;
	case 'l':
	if (type->isIntegerTy()) {
	if (Subtarget->isThumb())
	weight = CW_SpecificReg;
	else
	weight = CW_Register;
	}
	break;
	case 'w':
	if (type->isFloatingPointTy())
	weight = CW_Register;
	break;
	}
	return weight;
	}

	typedef std::pair<unsigned, const TargetRegisterClass*> RCPair;
	RCPair ARMTargetLowering::getRegForInlineAsmConstraint(
	const TargetRegisterInfo *TRI, StringRef Constraint, MVT VT) const {
	if (Constraint.size() == 1) {
	// GCC ARM Constraint Letters
	switch (Constraint[0]) {
	case 'l': // Low regs or general regs.
	if (Subtarget->isThumb())
	return RCPair(0U, &ARM::tGPRRegClass);
	return RCPair(0U, &ARM::GPRRegClass);
	case 'h': // High regs or no regs.
	if (Subtarget->isThumb())
	return RCPair(0U, &ARM::hGPRRegClass);
	break;
	case 'r':
	if (Subtarget->isThumb1Only())
	return RCPair(0U, &ARM::tGPRRegClass);
	return RCPair(0U, &ARM::GPRRegClass);
	case 'w':
	if (VT == MVT::Other)
	break;
	if (VT == MVT::f32)
	return RCPair(0U, &ARM::SPRRegClass);
	if (VT.getSizeInBits() == 64)
	return RCPair(0U, &ARM::DPRRegClass);
	if (VT.getSizeInBits() == 128)
	return RCPair(0U, &ARM::QPRRegClass);
	break;
	case 'x':
	if (VT == MVT::Other)
	break;
	if (VT == MVT::f32)
	return RCPair(0U, &ARM::SPR_8RegClass);
	if (VT.getSizeInBits() == 64)
	return RCPair(0U, &ARM::DPR_8RegClass);
	if (VT.getSizeInBits() == 128)
	return RCPair(0U, &ARM::QPR_8RegClass);
	break;
	case 't':
	if (VT == MVT::f32)
	return RCPair(0U, &ARM::SPRRegClass);
	break;
	}
	}
	if (StringRef("{cc}").equals_lower(Constraint))
	return std::make_pair(unsigned(ARM::CPSR), &ARM::CCRRegClass);

	return TargetLowering::getRegForInlineAsmConstraint(TRI, Constraint, VT);
	}

	/// LowerAsmOperandForConstraint - Lower the specified operand into the Ops
	/// vector. If it is invalid, don't add anything to Ops.
	void ARMTargetLowering::LowerAsmOperandForConstraint(SDValue Op,
	std::string &Constraint,
	std::vector<SDValue>&Ops,
	SelectionDAG &DAG) const {
	SDValue Result;

	// Currently only support length 1 constraints.
	if (Constraint.length() != 1) return;

	char ConstraintLetter = Constraint[0];
	switch (ConstraintLetter) {
	default: break;
	case 'j':
	case 'I': case 'J': case 'K': case 'L':
	case 'M': case 'N': case 'O':
	ConstantSDNode *C = dyn_cast<ConstantSDNode>(Op);
	if (!C)
	return;

	int64_t CVal64 = C->getSExtValue();
	int CVal = (int) CVal64;
	// None of these constraints allow values larger than 32 bits. Check
	// that the value fits in an int.
	if (CVal != CVal64)
	return;

	switch (ConstraintLetter) {
	case 'j':
	// Constant suitable for movw, must be between 0 and
	// 65535.
	if (Subtarget->hasV6T2Ops())
	if (CVal >= 0 && CVal <= 65535)
	break;
	return;
	case 'I':
	if (Subtarget->isThumb1Only()) {
	// This must be a constant between 0 and 255, for ADD
	// immediates.
	if (CVal >= 0 && CVal <= 255)
	break;
	} else if (Subtarget->isThumb2()) {
	// A constant that can be used as an immediate value in a
	// data-processing instruction.
	if (ARM_AM::getT2SOImmVal(CVal) != -1)
	break;
	} else {
	// A constant that can be used as an immediate value in a
	// data-processing instruction.
	if (ARM_AM::getSOImmVal(CVal) != -1)
	break;
	}
	return;

	case 'J':
	if (Subtarget->isThumb1Only()) {
	// This must be a constant between -255 and -1, for negated ADD
	// immediates. This can be used in GCC with an "n" modifier that
	// prints the negated value, for use with SUB instructions. It is
	// not useful otherwise but is implemented for compatibility.
	if (CVal >= -255 && CVal <= -1)
	break;
	} else {
	// This must be a constant between -4095 and 4095. It is not clear
	// what this constraint is intended for. Implemented for
	// compatibility with GCC.
	if (CVal >= -4095 && CVal <= 4095)
	break;
	}
	return;

	case 'K':
	if (Subtarget->isThumb1Only()) {
	// A 32-bit value where only one byte has a nonzero value. Exclude
	// zero to match GCC. This constraint is used by GCC internally for
	// constants that can be loaded with a move/shift combination.
	// It is not useful otherwise but is implemented for compatibility.
	if (CVal != 0 && ARM_AM::isThumbImmShiftedVal(CVal))
	break;
	} else if (Subtarget->isThumb2()) {
	// A constant whose bitwise inverse can be used as an immediate
	// value in a data-processing instruction. This can be used in GCC
	// with a "B" modifier that prints the inverted value, for use with
	// BIC and MVN instructions. It is not useful otherwise but is
	// implemented for compatibility.
	if (ARM_AM::getT2SOImmVal(~CVal) != -1)
	break;
	} else {
	// A constant whose bitwise inverse can be used as an immediate
	// value in a data-processing instruction. This can be used in GCC
	// with a "B" modifier that prints the inverted value, for use with
	// BIC and MVN instructions. It is not useful otherwise but is
	// implemented for compatibility.
	if (ARM_AM::getSOImmVal(~CVal) != -1)
	break;
	}
	return;

	case 'L':
	if (Subtarget->isThumb1Only()) {
	// This must be a constant between -7 and 7,
	// for 3-operand ADD/SUB immediate instructions.
	if (CVal >= -7 && CVal < 7)
	break;
	} else if (Subtarget->isThumb2()) {
	// A constant whose negation can be used as an immediate value in a
	// data-processing instruction. This can be used in GCC with an "n"
	// modifier that prints the negated value, for use with SUB
	// instructions. It is not useful otherwise but is implemented for
	// compatibility.
	if (ARM_AM::getT2SOImmVal(-CVal) != -1)
	break;
	} else {
	// A constant whose negation can be used as an immediate value in a
	// data-processing instruction. This can be used in GCC with an "n"
	// modifier that prints the negated value, for use with SUB
	// instructions. It is not useful otherwise but is implemented for
	// compatibility.
	if (ARM_AM::getSOImmVal(-CVal) != -1)
	break;
	}
	return;

	case 'M':
	if (Subtarget->isThumb1Only()) {
	// This must be a multiple of 4 between 0 and 1020, for
	// ADD sp + immediate.
	if ((CVal >= 0 && CVal <= 1020) && ((CVal & 3) == 0))
	break;
	} else {
	// A power of two or a constant between 0 and 32. This is used in
	// GCC for the shift amount on shifted register operands, but it is
	// useful in general for any shift amounts.
	if ((CVal >= 0 && CVal <= 32) \|\| ((CVal & (CVal - 1)) == 0))
	break;
	}
	return;

	case 'N':
	if (Subtarget->isThumb()) { // FIXME thumb2
	// This must be a constant between 0 and 31, for shift amounts.
	if (CVal >= 0 && CVal <= 31)
	break;
	}
	return;

	case 'O':
	if (Subtarget->isThumb()) { // FIXME thumb2
	// This must be a multiple of 4 between -508 and 508, for
	// ADD/SUB sp = sp + immediate.
	if ((CVal >= -508 && CVal <= 508) && ((CVal & 3) == 0))
	break;
	}
	return;
	}
	Result = DAG.getTargetConstant(CVal, SDLoc(Op), Op.getValueType());
	break;
	}

	if (Result.getNode()) {
	Ops.push_back(Result);
	return;
	}
	return TargetLowering::LowerAsmOperandForConstraint(Op, Constraint, Ops, DAG);
	}

	static RTLIB::Libcall getDivRemLibcall(
	const SDNode *N, MVT::SimpleValueType SVT) {
	assert((N->getOpcode() == ISD::SDIVREM \|\| N->getOpcode() == ISD::UDIVREM \|\|
	N->getOpcode() == ISD::SREM \|\| N->getOpcode() == ISD::UREM) &&
	"Unhandled Opcode in getDivRemLibcall");
	bool isSigned = N->getOpcode() == ISD::SDIVREM \|\|
	N->getOpcode() == ISD::SREM;
	RTLIB::Libcall LC;
	switch (SVT) {
	default: llvm_unreachable("Unexpected request for libcall!");
	case MVT::i8: LC = isSigned ? RTLIB::SDIVREM_I8 : RTLIB::UDIVREM_I8; break;
	case MVT::i16: LC = isSigned ? RTLIB::SDIVREM_I16 : RTLIB::UDIVREM_I16; break;
	case MVT::i32: LC = isSigned ? RTLIB::SDIVREM_I32 : RTLIB::UDIVREM_I32; break;
	case MVT::i64: LC = isSigned ? RTLIB::SDIVREM_I64 : RTLIB::UDIVREM_I64; break;
	}
	return LC;
	}

	static TargetLowering::ArgListTy getDivRemArgList(
	const SDNode N, LLVMContext Context, const ARMSubtarget *Subtarget) {
	assert((N->getOpcode() == ISD::SDIVREM \|\| N->getOpcode() == ISD::UDIVREM \|\|
	N->getOpcode() == ISD::SREM \|\| N->getOpcode() == ISD::UREM) &&
	"Unhandled Opcode in getDivRemArgList");
	bool isSigned = N->getOpcode() == ISD::SDIVREM \|\|
	N->getOpcode() == ISD::SREM;
	TargetLowering::ArgListTy Args;
	TargetLowering::ArgListEntry Entry;
	for (unsigned i = 0, e = N->getNumOperands(); i != e; ++i) {
	EVT ArgVT = N->getOperand(i).getValueType();
	Type ArgTy = ArgVT.getTypeForEVT(Context);
	Entry.Node = N->getOperand(i);
	Entry.Ty = ArgTy;
	Entry.IsSExt = isSigned;
	Entry.IsZExt = !isSigned;
	Args.push_back(Entry);
	}
	if (Subtarget->isTargetWindows() && Args.size() >= 2)
	std::swap(Args[0], Args[1]);
	return Args;
	}

	SDValue ARMTargetLowering::LowerDivRem(SDValue Op, SelectionDAG &DAG) const {
	assert((Subtarget->isTargetAEABI() \|\| Subtarget->isTargetAndroid() \|\|
	Subtarget->isTargetGNUAEABI() \|\| Subtarget->isTargetMuslAEABI() \|\|
	Subtarget->isTargetWindows()) &&
	"Register-based DivRem lowering only");
	unsigned Opcode = Op->getOpcode();
	assert((Opcode == ISD::SDIVREM \|\| Opcode == ISD::UDIVREM) &&
	"Invalid opcode for Div/Rem lowering");
	bool isSigned = (Opcode == ISD::SDIVREM);
	EVT VT = Op->getValueType(0);
	Type Ty = VT.getTypeForEVT(DAG.getContext());
	SDLoc dl(Op);

	// If the target has hardware divide, use divide + multiply + subtract:
	// div = a / b
	// rem = a - b * div
	// return {div, rem}
	// This should be lowered into UDIV/SDIV + MLS later on.
	bool hasDivide = Subtarget->isThumb() ? Subtarget->hasDivideInThumbMode()
	: Subtarget->hasDivideInARMMode();
	if (hasDivide && Op->getValueType(0).isSimple() &&
	Op->getSimpleValueType(0) == MVT::i32) {
	unsigned DivOpcode = isSigned ? ISD::SDIV : ISD::UDIV;
	const SDValue Dividend = Op->getOperand(0);
	const SDValue Divisor = Op->getOperand(1);
	SDValue Div = DAG.getNode(DivOpcode, dl, VT, Dividend, Divisor);
	SDValue Mul = DAG.getNode(ISD::MUL, dl, VT, Div, Divisor);
	SDValue Rem = DAG.getNode(ISD::SUB, dl, VT, Dividend, Mul);

	SDValue Values[2] = {Div, Rem};
	return DAG.getNode(ISD::MERGE_VALUES, dl, DAG.getVTList(VT, VT), Values);
	}

	RTLIB::Libcall LC = getDivRemLibcall(Op.getNode(),
	VT.getSimpleVT().SimpleTy);
	SDValue InChain = DAG.getEntryNode();

	TargetLowering::ArgListTy Args = getDivRemArgList(Op.getNode(),
	DAG.getContext(),
	Subtarget);

	SDValue Callee = DAG.getExternalSymbol(getLibcallName(LC),
	getPointerTy(DAG.getDataLayout()));

	Type *RetTy = StructType::get(Ty, Ty);

	if (Subtarget->isTargetWindows())
	InChain = WinDBZCheckDenominator(DAG, Op.getNode(), InChain);

	TargetLowering::CallLoweringInfo CLI(DAG);
	CLI.setDebugLoc(dl).setChain(InChain)
	.setCallee(getLibcallCallingConv(LC), RetTy, Callee, std::move(Args))
	.setInRegister().setSExtResult(isSigned).setZExtResult(!isSigned);

	std::pair<SDValue, SDValue> CallInfo = LowerCallTo(CLI);
	return CallInfo.first;
	}

	// Lowers REM using divmod helpers
	// see RTABI section 4.2/4.3
	SDValue ARMTargetLowering::LowerREM(SDNode *N, SelectionDAG &DAG) const {
	// Build return types (div and rem)
	std::vector<Type*> RetTyParams;
	Type *RetTyElement;

	switch (N->getValueType(0).getSimpleVT().SimpleTy) {
	default: llvm_unreachable("Unexpected request for libcall!");
	case MVT::i8: RetTyElement = Type::getInt8Ty(*DAG.getContext()); break;
	case MVT::i16: RetTyElement = Type::getInt16Ty(*DAG.getContext()); break;
	case MVT::i32: RetTyElement = Type::getInt32Ty(*DAG.getContext()); break;
	case MVT::i64: RetTyElement = Type::getInt64Ty(*DAG.getContext()); break;
	}

	RetTyParams.push_back(RetTyElement);
	RetTyParams.push_back(RetTyElement);
	ArrayRef<Type> ret = ArrayRef<Type>(RetTyParams);
	Type RetTy = StructType::get(DAG.getContext(), ret);

	RTLIB::Libcall LC = getDivRemLibcall(N, N->getValueType(0).getSimpleVT().
	SimpleTy);
	SDValue InChain = DAG.getEntryNode();
	TargetLowering::ArgListTy Args = getDivRemArgList(N, DAG.getContext(),
	Subtarget);
	bool isSigned = N->getOpcode() == ISD::SREM;
	SDValue Callee = DAG.getExternalSymbol(getLibcallName(LC),
	getPointerTy(DAG.getDataLayout()));

	if (Subtarget->isTargetWindows())
	InChain = WinDBZCheckDenominator(DAG, N, InChain);

	// Lower call
	CallLoweringInfo CLI(DAG);
	CLI.setChain(InChain)
	.setCallee(CallingConv::ARM_AAPCS, RetTy, Callee, std::move(Args))
	.setSExtResult(isSigned).setZExtResult(!isSigned).setDebugLoc(SDLoc(N));
	std::pair<SDValue, SDValue> CallResult = LowerCallTo(CLI);

	// Return second (rem) result operand (first contains div)
	SDNode *ResNode = CallResult.first.getNode();
	assert(ResNode->getNumOperands() == 2 && "divmod should return two operands");
	return ResNode->getOperand(1);
	}

	SDValue
	ARMTargetLowering::LowerDYNAMIC_STACKALLOC(SDValue Op, SelectionDAG &DAG) const {
	assert(Subtarget->isTargetWindows() && "unsupported target platform");
	SDLoc DL(Op);

	// Get the inputs.
	SDValue Chain = Op.getOperand(0);
	SDValue Size = Op.getOperand(1);

	SDValue Words = DAG.getNode(ISD::SRL, DL, MVT::i32, Size,
	DAG.getConstant(2, DL, MVT::i32));

	SDValue Flag;
	Chain = DAG.getCopyToReg(Chain, DL, ARM::R4, Words, Flag);
	Flag = Chain.getValue(1);

	SDVTList NodeTys = DAG.getVTList(MVT::Other, MVT::Glue);
	Chain = DAG.getNode(ARMISD::WIN__CHKSTK, DL, NodeTys, Chain, Flag);

	SDValue NewSP = DAG.getCopyFromReg(Chain, DL, ARM::SP, MVT::i32);
	Chain = NewSP.getValue(1);

	SDValue Ops[2] = { NewSP, Chain };
	return DAG.getMergeValues(Ops, DL);
	}

	SDValue ARMTargetLowering::LowerFP_EXTEND(SDValue Op, SelectionDAG &DAG) const {
	assert(Op.getValueType() == MVT::f64 && Subtarget->isFPOnlySP() &&
	"Unexpected type for custom-lowering FP_EXTEND");

	RTLIB::Libcall LC;
	LC = RTLIB::getFPEXT(Op.getOperand(0).getValueType(), Op.getValueType());

	SDValue SrcVal = Op.getOperand(0);
	return makeLibCall(DAG, LC, Op.getValueType(), SrcVal, /isSigned/ false,
	SDLoc(Op)).first;
	}

	SDValue ARMTargetLowering::LowerFP_ROUND(SDValue Op, SelectionDAG &DAG) const {
	assert(Op.getOperand(0).getValueType() == MVT::f64 &&
	Subtarget->isFPOnlySP() &&
	"Unexpected type for custom-lowering FP_ROUND");

	RTLIB::Libcall LC;
	LC = RTLIB::getFPROUND(Op.getOperand(0).getValueType(), Op.getValueType());

	SDValue SrcVal = Op.getOperand(0);
	return makeLibCall(DAG, LC, Op.getValueType(), SrcVal, /isSigned/ false,
	SDLoc(Op)).first;
	}

	bool
	ARMTargetLowering::isOffsetFoldingLegal(const GlobalAddressSDNode *GA) const {
	// The ARM target isn't yet aware of offsets.
	return false;
	}

	bool ARM::isBitFieldInvertedMask(unsigned v) {
	if (v == 0xffffffff)
	return false;

	// there can be 1's on either or both "outsides", all the "inside"
	// bits must be 0's
	return isShiftedMask_32(~v);
	}

	/// isFPImmLegal - Returns true if the target can instruction select the
	/// specified FP immediate natively. If false, the legalizer will
	/// materialize the FP immediate as a load from a constant pool.
	bool ARMTargetLowering::isFPImmLegal(const APFloat &Imm, EVT VT) const {
	if (!Subtarget->hasVFP3())
	return false;
	if (VT == MVT::f32)
	return ARM_AM::getFP32Imm(Imm) != -1;
	if (VT == MVT::f64 && !Subtarget->isFPOnlySP())
	return ARM_AM::getFP64Imm(Imm) != -1;
	return false;
	}

	/// getTgtMemIntrinsic - Represent NEON load and store intrinsics as
	/// MemIntrinsicNodes. The associated MachineMemOperands record the alignment
	/// specified in the intrinsic calls.
	bool ARMTargetLowering::getTgtMemIntrinsic(IntrinsicInfo &Info,
	const CallInst &I,
	unsigned Intrinsic) const {
	switch (Intrinsic) {
	case Intrinsic::arm_neon_vld1:
	case Intrinsic::arm_neon_vld2:
	case Intrinsic::arm_neon_vld3:
	case Intrinsic::arm_neon_vld4:
	case Intrinsic::arm_neon_vld2lane:
	case Intrinsic::arm_neon_vld3lane:
	case Intrinsic::arm_neon_vld4lane: {
	Info.opc = ISD::INTRINSIC_W_CHAIN;
	// Conservatively set memVT to the entire set of vectors loaded.
	auto &DL = I.getCalledFunction()->getParent()->getDataLayout();
	uint64_t NumElts = DL.getTypeSizeInBits(I.getType()) / 64;
	Info.memVT = EVT::getVectorVT(I.getType()->getContext(), MVT::i64, NumElts);
	Info.ptrVal = I.getArgOperand(0);
	Info.offset = 0;
	Value *AlignArg = I.getArgOperand(I.getNumArgOperands() - 1);
	Info.align = cast<ConstantInt>(AlignArg)->getZExtValue();
	Info.vol = false; // volatile loads with NEON intrinsics not supported
	Info.readMem = true;
	Info.writeMem = false;
	return true;
	}
	case Intrinsic::arm_neon_vst1:
	case Intrinsic::arm_neon_vst2:
	case Intrinsic::arm_neon_vst3:
	case Intrinsic::arm_neon_vst4:
	case Intrinsic::arm_neon_vst2lane:
	case Intrinsic::arm_neon_vst3lane:
	case Intrinsic::arm_neon_vst4lane: {
	Info.opc = ISD::INTRINSIC_VOID;
	// Conservatively set memVT to the entire set of vectors stored.
	auto &DL = I.getCalledFunction()->getParent()->getDataLayout();
	unsigned NumElts = 0;
	for (unsigned ArgI = 1, ArgE = I.getNumArgOperands(); ArgI < ArgE; ++ArgI) {
	Type *ArgTy = I.getArgOperand(ArgI)->getType();
	if (!ArgTy->isVectorTy())
	break;
	NumElts += DL.getTypeSizeInBits(ArgTy) / 64;
	}
	Info.memVT = EVT::getVectorVT(I.getType()->getContext(), MVT::i64, NumElts);
	Info.ptrVal = I.getArgOperand(0);
	Info.offset = 0;
	Value *AlignArg = I.getArgOperand(I.getNumArgOperands() - 1);
	Info.align = cast<ConstantInt>(AlignArg)->getZExtValue();
	Info.vol = false; // volatile stores with NEON intrinsics not supported
	Info.readMem = false;
	Info.writeMem = true;
	return true;
	}
	case Intrinsic::arm_ldaex:
	case Intrinsic::arm_ldrex: {
	auto &DL = I.getCalledFunction()->getParent()->getDataLayout();
	PointerType *PtrTy = cast<PointerType>(I.getArgOperand(0)->getType());
	Info.opc = ISD::INTRINSIC_W_CHAIN;
	Info.memVT = MVT::getVT(PtrTy->getElementType());
	Info.ptrVal = I.getArgOperand(0);
	Info.offset = 0;
	Info.align = DL.getABITypeAlignment(PtrTy->getElementType());
	Info.vol = true;
	Info.readMem = true;
	Info.writeMem = false;
	return true;
	}
	case Intrinsic::arm_stlex:
	case Intrinsic::arm_strex: {
	auto &DL = I.getCalledFunction()->getParent()->getDataLayout();
	PointerType *PtrTy = cast<PointerType>(I.getArgOperand(1)->getType());
	Info.opc = ISD::INTRINSIC_W_CHAIN;
	Info.memVT = MVT::getVT(PtrTy->getElementType());
	Info.ptrVal = I.getArgOperand(1);
	Info.offset = 0;
	Info.align = DL.getABITypeAlignment(PtrTy->getElementType());
	Info.vol = true;
	Info.readMem = false;
	Info.writeMem = true;
	return true;
	}
	case Intrinsic::arm_stlexd:
	case Intrinsic::arm_strexd:
	Info.opc = ISD::INTRINSIC_W_CHAIN;
	Info.memVT = MVT::i64;
	Info.ptrVal = I.getArgOperand(2);
	Info.offset = 0;
	Info.align = 8;
	Info.vol = true;
	Info.readMem = false;
	Info.writeMem = true;
	return true;

	case Intrinsic::arm_ldaexd:
	case Intrinsic::arm_ldrexd:
	Info.opc = ISD::INTRINSIC_W_CHAIN;
	Info.memVT = MVT::i64;
	Info.ptrVal = I.getArgOperand(0);
	Info.offset = 0;
	Info.align = 8;
	Info.vol = true;
	Info.readMem = true;
	Info.writeMem = false;
	return true;

	default:
	break;
	}

	return false;
	}

	/// \brief Returns true if it is beneficial to convert a load of a constant
	/// to just the constant itself.
	bool ARMTargetLowering::shouldConvertConstantLoadToIntImm(const APInt &Imm,
	Type *Ty) const {
	assert(Ty->isIntegerTy());

	unsigned Bits = Ty->getPrimitiveSizeInBits();
	if (Bits == 0 \|\| Bits > 32)
	return false;
	return true;
	}

	bool ARMTargetLowering::isExtractSubvectorCheap(EVT ResVT,
	unsigned Index) const {
	if (!isOperationLegalOrCustom(ISD::EXTRACT_SUBVECTOR, ResVT))
	return false;

	return (Index == 0 \|\| Index == ResVT.getVectorNumElements());
	}

	Instruction* ARMTargetLowering::makeDMB(IRBuilder<> &Builder,
	ARM_MB::MemBOpt Domain) const {
	Module *M = Builder.GetInsertBlock()->getParent()->getParent();

	// First, if the target has no DMB, see what fallback we can use.
	if (!Subtarget->hasDataBarrier()) {
	// Some ARMv6 cpus can support data barriers with an mcr instruction.
	// Thumb1 and pre-v6 ARM mode use a libcall instead and should never get
	// here.
	if (Subtarget->hasV6Ops() && !Subtarget->isThumb()) {
	Function *MCR = Intrinsic::getDeclaration(M, Intrinsic::arm_mcr);
	Value* args[6] = {Builder.getInt32(15), Builder.getInt32(0),
	Builder.getInt32(0), Builder.getInt32(7),
	Builder.getInt32(10), Builder.getInt32(5)};
	return Builder.CreateCall(MCR, args);
	} else {
	// Instead of using barriers, atomic accesses on these subtargets use
	// libcalls.
	llvm_unreachable("makeDMB on a target so old that it has no barriers");
	}
	} else {
	Function *DMB = Intrinsic::getDeclaration(M, Intrinsic::arm_dmb);
	// Only a full system barrier exists in the M-class architectures.
	Domain = Subtarget->isMClass() ? ARM_MB::SY : Domain;
	Constant *CDomain = Builder.getInt32(Domain);
	return Builder.CreateCall(DMB, CDomain);
	}
	}

	// Based on http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html
	Instruction *ARMTargetLowering::emitLeadingFence(IRBuilder<> &Builder,
	Instruction *Inst,
	AtomicOrdering Ord) const {
	switch (Ord) {
	case AtomicOrdering::NotAtomic:
	case AtomicOrdering::Unordered:
	llvm_unreachable("Invalid fence: unordered/non-atomic");
	case AtomicOrdering::Monotonic:
	case AtomicOrdering::Acquire:
	return nullptr; // Nothing to do
	case AtomicOrdering::SequentiallyConsistent:
	if (!Inst->hasAtomicStore())
	return nullptr; // Nothing to do
	/FALLTHROUGH/
	case AtomicOrdering::Release:
	case AtomicOrdering::AcquireRelease:
	if (Subtarget->preferISHSTBarriers())
	return makeDMB(Builder, ARM_MB::ISHST);
	// FIXME: add a comment with a link to documentation justifying this.
	else
	return makeDMB(Builder, ARM_MB::ISH);
	}
	llvm_unreachable("Unknown fence ordering in emitLeadingFence");
	}

	Instruction *ARMTargetLowering::emitTrailingFence(IRBuilder<> &Builder,
	Instruction *Inst,
	AtomicOrdering Ord) const {
	switch (Ord) {
	case AtomicOrdering::NotAtomic:
	case AtomicOrdering::Unordered:
	llvm_unreachable("Invalid fence: unordered/not-atomic");
	case AtomicOrdering::Monotonic:
	case AtomicOrdering::Release:
	return nullptr; // Nothing to do
	case AtomicOrdering::Acquire:
	case AtomicOrdering::AcquireRelease:
	case AtomicOrdering::SequentiallyConsistent:
	return makeDMB(Builder, ARM_MB::ISH);
	}
	llvm_unreachable("Unknown fence ordering in emitTrailingFence");
	}

	// Loads and stores less than 64-bits are already atomic; ones above that
	// are doomed anyway, so defer to the default libcall and blame the OS when
	// things go wrong. Cortex M doesn't have ldrexd/strexd though, so don't emit
	// anything for those.
	bool ARMTargetLowering::shouldExpandAtomicStoreInIR(StoreInst *SI) const {
	unsigned Size = SI->getValueOperand()->getType()->getPrimitiveSizeInBits();
	return (Size == 64) && !Subtarget->isMClass();
	}

	// Loads and stores less than 64-bits are already atomic; ones above that
	// are doomed anyway, so defer to the default libcall and blame the OS when
	// things go wrong. Cortex M doesn't have ldrexd/strexd though, so don't emit
	// anything for those.
	// FIXME: ldrd and strd are atomic if the CPU has LPAE (e.g. A15 has that
	// guarantee, see DDI0406C ARM architecture reference manual,
	// sections A8.8.72-74 LDRD)
	TargetLowering::AtomicExpansionKind
	ARMTargetLowering::shouldExpandAtomicLoadInIR(LoadInst *LI) const {
	unsigned Size = LI->getType()->getPrimitiveSizeInBits();
	return ((Size == 64) && !Subtarget->isMClass()) ? AtomicExpansionKind::LLOnly
	: AtomicExpansionKind::None;
	}

	// For the real atomic operations, we have ldrex/strex up to 32 bits,
	// and up to 64 bits on the non-M profiles
	TargetLowering::AtomicExpansionKind
	ARMTargetLowering::shouldExpandAtomicRMWInIR(AtomicRMWInst *AI) const {
	unsigned Size = AI->getType()->getPrimitiveSizeInBits();
	bool hasAtomicRMW = !Subtarget->isThumb() \|\| Subtarget->hasV8MBaselineOps();
	return (Size <= (Subtarget->isMClass() ? 32U : 64U) && hasAtomicRMW)
	? AtomicExpansionKind::LLSC
	: AtomicExpansionKind::None;
	}

	bool ARMTargetLowering::shouldExpandAtomicCmpXchgInIR(
	AtomicCmpXchgInst *AI) const {
	// At -O0, fast-regalloc cannot cope with the live vregs necessary to
	// implement cmpxchg without spilling. If the address being exchanged is also
	// on the stack and close enough to the spill slot, this can lead to a
	// situation where the monitor always gets cleared and the atomic operation
	// can never succeed. So at -O0 we need a late-expanded pseudo-inst instead.
	bool hasAtomicCmpXchg =
	!Subtarget->isThumb() \|\| Subtarget->hasV8MBaselineOps();
	return getTargetMachine().getOptLevel() != 0 && hasAtomicCmpXchg;
	}

	bool ARMTargetLowering::shouldInsertFencesForAtomic(
	const Instruction *I) const {
	return InsertFencesForAtomic;
	}

	// This has so far only been implemented for MachO.
	bool ARMTargetLowering::useLoadStackGuardNode() const {
	return Subtarget->isTargetMachO();
	}

	bool ARMTargetLowering::canCombineStoreAndExtract(Type VectorTy, Value Idx,
	unsigned &Cost) const {
	// If we do not have NEON, vector types are not natively supported.
	if (!Subtarget->hasNEON())
	return false;

	// Floating point values and vector values map to the same register file.
	// Therefore, although we could do a store extract of a vector type, this is
	// better to leave at float as we have more freedom in the addressing mode for
	// those.
	if (VectorTy->isFPOrFPVectorTy())
	return false;

	// If the index is unknown at compile time, this is very expensive to lower
	// and it is not possible to combine the store with the extract.
	if (!isa<ConstantInt>(Idx))
	return false;

	assert(VectorTy->isVectorTy() && "VectorTy is not a vector type");
	unsigned BitWidth = cast<VectorType>(VectorTy)->getBitWidth();
	// We can do a store + vector extract on any vector that fits perfectly in a D
	// or Q register.
	if (BitWidth == 64 \|\| BitWidth == 128) {
	Cost = 0;
	return true;
	}
	return false;
	}

	bool ARMTargetLowering::isCheapToSpeculateCttz() const {
	return Subtarget->hasV6T2Ops();
	}

	bool ARMTargetLowering::isCheapToSpeculateCtlz() const {
	return Subtarget->hasV6T2Ops();
	}

	Value ARMTargetLowering::emitLoadLinked(IRBuilder<> &Builder, Value Addr,
	AtomicOrdering Ord) const {
	Module *M = Builder.GetInsertBlock()->getParent()->getParent();
	Type *ValTy = cast<PointerType>(Addr->getType())->getElementType();
	bool IsAcquire = isAcquireOrStronger(Ord);

	// Since i64 isn't legal and intrinsics don't get type-lowered, the ldrexd
	// intrinsic must return {i32, i32} and we have to recombine them into a
	// single i64 here.
	if (ValTy->getPrimitiveSizeInBits() == 64) {
	Intrinsic::ID Int =
	IsAcquire ? Intrinsic::arm_ldaexd : Intrinsic::arm_ldrexd;
	Function *Ldrex = Intrinsic::getDeclaration(M, Int);

	Addr = Builder.CreateBitCast(Addr, Type::getInt8PtrTy(M->getContext()));
	Value *LoHi = Builder.CreateCall(Ldrex, Addr, "lohi");

	Value *Lo = Builder.CreateExtractValue(LoHi, 0, "lo");
	Value *Hi = Builder.CreateExtractValue(LoHi, 1, "hi");
	if (!Subtarget->isLittle())
	std::swap (Lo, Hi);
	Lo = Builder.CreateZExt(Lo, ValTy, "lo64");
	Hi = Builder.CreateZExt(Hi, ValTy, "hi64");
	return Builder.CreateOr(
	Lo, Builder.CreateShl(Hi, ConstantInt::get(ValTy, 32)), "val64");
	}

	Type *Tys[] = { Addr->getType() };
	Intrinsic::ID Int = IsAcquire ? Intrinsic::arm_ldaex : Intrinsic::arm_ldrex;
	Function *Ldrex = Intrinsic::getDeclaration(M, Int, Tys);

	return Builder.CreateTruncOrBitCast(
	Builder.CreateCall(Ldrex, Addr),
	cast<PointerType>(Addr->getType())->getElementType());
	}

	void ARMTargetLowering::emitAtomicCmpXchgNoStoreLLBalance(
	IRBuilder<> &Builder) const {
	if (!Subtarget->hasV7Ops())
	return;
	Module *M = Builder.GetInsertBlock()->getParent()->getParent();
	Builder.CreateCall(Intrinsic::getDeclaration(M, Intrinsic::arm_clrex));
	}

	Value ARMTargetLowering::emitStoreConditional(IRBuilder<> &Builder, Value Val,
	Value *Addr,
	AtomicOrdering Ord) const {
	Module *M = Builder.GetInsertBlock()->getParent()->getParent();
	bool IsRelease = isReleaseOrStronger(Ord);

	// Since the intrinsics must have legal type, the i64 intrinsics take two
	// parameters: "i32, i32". We must marshal Val into the appropriate form
	// before the call.
	if (Val->getType()->getPrimitiveSizeInBits() == 64) {
	Intrinsic::ID Int =
	IsRelease ? Intrinsic::arm_stlexd : Intrinsic::arm_strexd;
	Function *Strex = Intrinsic::getDeclaration(M, Int);
	Type *Int32Ty = Type::getInt32Ty(M->getContext());

	Value *Lo = Builder.CreateTrunc(Val, Int32Ty, "lo");
	Value *Hi = Builder.CreateTrunc(Builder.CreateLShr(Val, 32), Int32Ty, "hi");
	if (!Subtarget->isLittle())
	std::swap (Lo, Hi);
	Addr = Builder.CreateBitCast(Addr, Type::getInt8PtrTy(M->getContext()));
	return Builder.CreateCall(Strex, {Lo, Hi, Addr});
	}

	Intrinsic::ID Int = IsRelease ? Intrinsic::arm_stlex : Intrinsic::arm_strex;
	Type *Tys[] = { Addr->getType() };
	Function *Strex = Intrinsic::getDeclaration(M, Int, Tys);

	return Builder.CreateCall(
	Strex, {Builder.CreateZExtOrBitCast(
	Val, Strex->getFunctionType()->getParamType(0)),
	Addr});
	}

	/// A helper function for determining the number of interleaved accesses we
	/// will generate when lowering accesses of the given type.
	unsigned
	ARMTargetLowering::getNumInterleavedAccesses(VectorType *VecTy,
	const DataLayout &DL) const {
	return (DL.getTypeSizeInBits(VecTy) + 127) / 128;
	}

	bool ARMTargetLowering::isLegalInterleavedAccessType(
	VectorType *VecTy, const DataLayout &DL) const {

	unsigned VecSize = DL.getTypeSizeInBits(VecTy);
	unsigned ElSize = DL.getTypeSizeInBits(VecTy->getElementType());

	// Ensure the vector doesn't have f16 elements. Even though we could do an
	// i16 vldN, we can't hold the f16 vectors and will end up converting via
	// f32.
	if (VecTy->getElementType()->isHalfTy())
	return false;

	// Ensure the number of vector elements is greater than 1.
	if (VecTy->getNumElements() < 2)
	return false;

	// Ensure the element type is legal.
	if (ElSize != 8 && ElSize != 16 && ElSize != 32)
	return false;

	// Ensure the total vector size is 64 or a multiple of 128. Types larger than
	// 128 will be split into multiple interleaved accesses.
	return VecSize == 64 \|\| VecSize % 128 == 0;
	}

	/// \brief Lower an interleaved load into a vldN intrinsic.
	///
	/// E.g. Lower an interleaved load (Factor = 2):
	/// %wide.vec = load <8 x i32>, <8 x i32>* %ptr, align 4
	/// %v0 = shuffle %wide.vec, undef, <0, 2, 4, 6> ; Extract even elements
	/// %v1 = shuffle %wide.vec, undef, <1, 3, 5, 7> ; Extract odd elements
	///
	/// Into:
	/// %vld2 = { <4 x i32>, <4 x i32> } call llvm.arm.neon.vld2(%ptr, 4)
	/// %vec0 = extractelement { <4 x i32>, <4 x i32> } %vld2, i32 0
	/// %vec1 = extractelement { <4 x i32>, <4 x i32> } %vld2, i32 1
	bool ARMTargetLowering::lowerInterleavedLoad(
	LoadInst LI, ArrayRef<ShuffleVectorInst > Shuffles,
	ArrayRef<unsigned> Indices, unsigned Factor) const {
	assert(Factor >= 2 && Factor <= getMaxSupportedInterleaveFactor() &&
	"Invalid interleave factor");
	assert(!Shuffles.empty() && "Empty shufflevector input");
	assert(Shuffles.size() == Indices.size() &&
	"Unmatched number of shufflevectors and indices");

	VectorType *VecTy = Shuffles[0]->getType();
	Type *EltTy = VecTy->getVectorElementType();

	const DataLayout &DL = LI->getModule()->getDataLayout();

	// Skip if we do not have NEON and skip illegal vector types. We can
	// "legalize" wide vector types into multiple interleaved accesses as long as
	// the vector types are divisible by 128.
	if (!Subtarget->hasNEON() \|\| !isLegalInterleavedAccessType(VecTy, DL))
	return false;

	unsigned NumLoads = getNumInterleavedAccesses(VecTy, DL);

	// A pointer vector can not be the return type of the ldN intrinsics. Need to
	// load integer vectors first and then convert to pointer vectors.
	if (EltTy->isPointerTy())
	VecTy =
	VectorType::get(DL.getIntPtrType(EltTy), VecTy->getVectorNumElements());

	IRBuilder<> Builder(LI);

	// The base address of the load.
	Value *BaseAddr = LI->getPointerOperand();

	if (NumLoads > 1) {
	// If we're going to generate more than one load, reset the sub-vector type
	// to something legal.
	VecTy = VectorType::get(VecTy->getVectorElementType(),
	VecTy->getVectorNumElements() / NumLoads);

	// We will compute the pointer operand of each load from the original base
	// address using GEPs. Cast the base address to a pointer to the scalar
	// element type.
	BaseAddr = Builder.CreateBitCast(
	BaseAddr, VecTy->getVectorElementType()->getPointerTo(
	LI->getPointerAddressSpace()));
	}

	assert(isTypeLegal(EVT::getEVT(VecTy)) && "Illegal vldN vector type!");

	Type *Int8Ptr = Builder.getInt8PtrTy(LI->getPointerAddressSpace());
	Type *Tys[] = {VecTy, Int8Ptr};
	static const Intrinsic::ID LoadInts[3] = {Intrinsic::arm_neon_vld2,
	Intrinsic::arm_neon_vld3,
	Intrinsic::arm_neon_vld4};
	Function *VldnFunc =
	Intrinsic::getDeclaration(LI->getModule(), LoadInts[Factor - 2], Tys);

	// Holds sub-vectors extracted from the load intrinsic return values. The
	// sub-vectors are associated with the shufflevector instructions they will
	// replace.
	DenseMap<ShuffleVectorInst , SmallVector<Value , 4>> SubVecs;

	for (unsigned LoadCount = 0; LoadCount < NumLoads; ++LoadCount) {

	// If we're generating more than one load, compute the base address of
	// subsequent loads as an offset from the previous.
	if (LoadCount > 0)
	BaseAddr = Builder.CreateConstGEP1_32(
	BaseAddr, VecTy->getVectorNumElements() * Factor);

	SmallVector<Value *, 2> Ops;
	Ops.push_back(Builder.CreateBitCast(BaseAddr, Int8Ptr));
	Ops.push_back(Builder.getInt32(LI->getAlignment()));

	CallInst *VldN = Builder.CreateCall(VldnFunc, Ops, "vldN");

	// Replace uses of each shufflevector with the corresponding vector loaded
	// by ldN.
	for (unsigned i = 0; i < Shuffles.size(); i++) {
	ShuffleVectorInst *SV = Shuffles[i];
	unsigned Index = Indices[i];

	Value *SubVec = Builder.CreateExtractValue(VldN, Index);

	// Convert the integer vector to pointer vector if the element is pointer.
	if (EltTy->isPointerTy())
	SubVec = Builder.CreateIntToPtr(
	SubVec, VectorType::get(SV->getType()->getVectorElementType(),
	VecTy->getVectorNumElements()));

	SubVecs[SV].push_back(SubVec);
	}
	}

	// Replace uses of the shufflevector instructions with the sub-vectors
	// returned by the load intrinsic. If a shufflevector instruction is
	// associated with more than one sub-vector, those sub-vectors will be
	// concatenated into a single wide vector.
	for (ShuffleVectorInst *SVI : Shuffles) {
	auto &SubVec = SubVecs[SVI];
	auto *WideVec =
	SubVec.size() > 1 ? concatenateVectors(Builder, SubVec) : SubVec[0];
	SVI->replaceAllUsesWith(WideVec);
	}

	return true;
	}

	/// \brief Lower an interleaved store into a vstN intrinsic.
	///
	/// E.g. Lower an interleaved store (Factor = 3):
	/// %i.vec = shuffle <8 x i32> %v0, <8 x i32> %v1,
	/// <0, 4, 8, 1, 5, 9, 2, 6, 10, 3, 7, 11>
	/// store <12 x i32> %i.vec, <12 x i32>* %ptr, align 4
	///
	/// Into:
	/// %sub.v0 = shuffle <8 x i32> %v0, <8 x i32> v1, <0, 1, 2, 3>
	/// %sub.v1 = shuffle <8 x i32> %v0, <8 x i32> v1, <4, 5, 6, 7>
	/// %sub.v2 = shuffle <8 x i32> %v0, <8 x i32> v1, <8, 9, 10, 11>
	/// call void llvm.arm.neon.vst3(%ptr, %sub.v0, %sub.v1, %sub.v2, 4)
	///
	/// Note that the new shufflevectors will be removed and we'll only generate one
	/// vst3 instruction in CodeGen.
	///
	/// Example for a more general valid mask (Factor 3). Lower:
	/// %i.vec = shuffle <32 x i32> %v0, <32 x i32> %v1,
	/// <4, 32, 16, 5, 33, 17, 6, 34, 18, 7, 35, 19>
	/// store <12 x i32> %i.vec, <12 x i32>* %ptr
	///
	/// Into:
	/// %sub.v0 = shuffle <32 x i32> %v0, <32 x i32> v1, <4, 5, 6, 7>
	/// %sub.v1 = shuffle <32 x i32> %v0, <32 x i32> v1, <32, 33, 34, 35>
	/// %sub.v2 = shuffle <32 x i32> %v0, <32 x i32> v1, <16, 17, 18, 19>
	/// call void llvm.arm.neon.vst3(%ptr, %sub.v0, %sub.v1, %sub.v2, 4)
	bool ARMTargetLowering::lowerInterleavedStore(StoreInst *SI,
	ShuffleVectorInst *SVI,
	unsigned Factor) const {
	assert(Factor >= 2 && Factor <= getMaxSupportedInterleaveFactor() &&
	"Invalid interleave factor");

	VectorType *VecTy = SVI->getType();
	assert(VecTy->getVectorNumElements() % Factor == 0 &&
	"Invalid interleaved store");

	unsigned LaneLen = VecTy->getVectorNumElements() / Factor;
	Type *EltTy = VecTy->getVectorElementType();
	VectorType *SubVecTy = VectorType::get(EltTy, LaneLen);

	const DataLayout &DL = SI->getModule()->getDataLayout();

	// Skip if we do not have NEON and skip illegal vector types. We can
	// "legalize" wide vector types into multiple interleaved accesses as long as
	// the vector types are divisible by 128.
	if (!Subtarget->hasNEON() \|\| !isLegalInterleavedAccessType(SubVecTy, DL))
	return false;

	unsigned NumStores = getNumInterleavedAccesses(SubVecTy, DL);

	Value *Op0 = SVI->getOperand(0);
	Value *Op1 = SVI->getOperand(1);
	IRBuilder<> Builder(SI);

	// StN intrinsics don't support pointer vectors as arguments. Convert pointer
	// vectors to integer vectors.
	if (EltTy->isPointerTy()) {
	Type *IntTy = DL.getIntPtrType(EltTy);

	// Convert to the corresponding integer vector.
	Type *IntVecTy =
	VectorType::get(IntTy, Op0->getType()->getVectorNumElements());
	Op0 = Builder.CreatePtrToInt(Op0, IntVecTy);
	Op1 = Builder.CreatePtrToInt(Op1, IntVecTy);

	SubVecTy = VectorType::get(IntTy, LaneLen);
	}

	// The base address of the store.
	Value *BaseAddr = SI->getPointerOperand();

	if (NumStores > 1) {
	// If we're going to generate more than one store, reset the lane length
	// and sub-vector type to something legal.
	LaneLen /= NumStores;
	SubVecTy = VectorType::get(SubVecTy->getVectorElementType(), LaneLen);

	// We will compute the pointer operand of each store from the original base
	// address using GEPs. Cast the base address to a pointer to the scalar
	// element type.
	BaseAddr = Builder.CreateBitCast(
	BaseAddr, SubVecTy->getVectorElementType()->getPointerTo(
	SI->getPointerAddressSpace()));
	}

	assert(isTypeLegal(EVT::getEVT(SubVecTy)) && "Illegal vstN vector type!");

	auto Mask = SVI->getShuffleMask();

	Type *Int8Ptr = Builder.getInt8PtrTy(SI->getPointerAddressSpace());
	Type *Tys[] = {Int8Ptr, SubVecTy};
	static const Intrinsic::ID StoreInts[3] = {Intrinsic::arm_neon_vst2,
	Intrinsic::arm_neon_vst3,
	Intrinsic::arm_neon_vst4};

	for (unsigned StoreCount = 0; StoreCount < NumStores; ++StoreCount) {

	// If we generating more than one store, we compute the base address of
	// subsequent stores as an offset from the previous.
	if (StoreCount > 0)
	BaseAddr = Builder.CreateConstGEP1_32(BaseAddr, LaneLen * Factor);

	SmallVector<Value *, 6> Ops;
	Ops.push_back(Builder.CreateBitCast(BaseAddr, Int8Ptr));

	Function *VstNFunc =
	Intrinsic::getDeclaration(SI->getModule(), StoreInts[Factor - 2], Tys);

	// Split the shufflevector operands into sub vectors for the new vstN call.
	for (unsigned i = 0; i < Factor; i++) {
	unsigned IdxI = StoreCount * LaneLen * Factor + i;
	if (Mask[IdxI] >= 0) {
	Ops.push_back(Builder.CreateShuffleVector(
	Op0, Op1, createSequentialMask(Builder, Mask[IdxI], LaneLen, 0)));
	} else {
	unsigned StartMask = 0;
	for (unsigned j = 1; j < LaneLen; j++) {
	unsigned IdxJ = StoreCount * LaneLen * Factor + j;
	if (Mask[IdxJ * Factor + IdxI] >= 0) {
	StartMask = Mask[IdxJ * Factor + IdxI] - IdxJ;
	break;
	}
	}
	// Note: If all elements in a chunk are undefs, StartMask=0!
	// Note: Filling undef gaps with random elements is ok, since
	// those elements were being written anyway (with undefs).
	// In the case of all undefs we're defaulting to using elems from 0
	// Note: StartMask cannot be negative, it's checked in
	// isReInterleaveMask
	Ops.push_back(Builder.CreateShuffleVector(
	Op0, Op1, createSequentialMask(Builder, StartMask, LaneLen, 0)));
	}
	}

	Ops.push_back(Builder.getInt32(SI->getAlignment()));
	Builder.CreateCall(VstNFunc, Ops);
	}
	return true;
	}

	enum HABaseType {
	HA_UNKNOWN = 0,
	HA_FLOAT,
	HA_DOUBLE,
	HA_VECT64,
	HA_VECT128
	};

	static bool isHomogeneousAggregate(Type *Ty, HABaseType &Base,
	uint64_t &Members) {
	if (auto *ST = dyn_cast<StructType>(Ty)) {
	for (unsigned i = 0; i < ST->getNumElements(); ++i) {
	uint64_t SubMembers = 0;
	if (!isHomogeneousAggregate(ST->getElementType(i), Base, SubMembers))
	return false;
	Members += SubMembers;
	}
	} else if (auto *AT = dyn_cast<ArrayType>(Ty)) {
	uint64_t SubMembers = 0;
	if (!isHomogeneousAggregate(AT->getElementType(), Base, SubMembers))
	return false;
	Members += SubMembers * AT->getNumElements();
	} else if (Ty->isFloatTy()) {
	if (Base != HA_UNKNOWN && Base != HA_FLOAT)
	return false;
	Members = 1;
	Base = HA_FLOAT;
	} else if (Ty->isDoubleTy()) {
	if (Base != HA_UNKNOWN && Base != HA_DOUBLE)
	return false;
	Members = 1;
	Base = HA_DOUBLE;
	} else if (auto *VT = dyn_cast<VectorType>(Ty)) {
	Members = 1;
	switch (Base) {
	case HA_FLOAT:
	case HA_DOUBLE:
	return false;
	case HA_VECT64:
	return VT->getBitWidth() == 64;
	case HA_VECT128:
	return VT->getBitWidth() == 128;
	case HA_UNKNOWN:
	switch (VT->getBitWidth()) {
	case 64:
	Base = HA_VECT64;
	return true;
	case 128:
	Base = HA_VECT128;
	return true;
	default:
	return false;
	}
	}
	}

	return (Members > 0 && Members <= 4);
	}

	/// \brief Return true if a type is an AAPCS-VFP homogeneous aggregate or one of
	/// [N x i32] or [N x i64]. This allows front-ends to skip emitting padding when
	/// passing according to AAPCS rules.
	bool ARMTargetLowering::functionArgumentNeedsConsecutiveRegisters(
	Type *Ty, CallingConv::ID CallConv, bool isVarArg) const {
	if (getEffectiveCallingConv(CallConv, isVarArg) !=
	CallingConv::ARM_AAPCS_VFP)
	return false;

	HABaseType Base = HA_UNKNOWN;
	uint64_t Members = 0;
	bool IsHA = isHomogeneousAggregate(Ty, Base, Members);
	DEBUG(dbgs() << "isHA: " << IsHA << " "; Ty->dump());

	bool IsIntArray = Ty->isArrayTy() && Ty->getArrayElementType()->isIntegerTy();
	return IsHA \|\| IsIntArray;
	}

	unsigned ARMTargetLowering::getExceptionPointerRegister(
	const Constant *PersonalityFn) const {
	// Platforms which do not use SjLj EH may return values in these registers
	// via the personality function.
	return Subtarget->useSjLjEH() ? ARM::NoRegister : ARM::R0;
	}

	unsigned ARMTargetLowering::getExceptionSelectorRegister(
	const Constant *PersonalityFn) const {
	// Platforms which do not use SjLj EH may return values in these registers
	// via the personality function.
	return Subtarget->useSjLjEH() ? ARM::NoRegister : ARM::R1;
	}

	void ARMTargetLowering::initializeSplitCSR(MachineBasicBlock *Entry) const {
	// Update IsSplitCSR in ARMFunctionInfo.
	ARMFunctionInfo *AFI = Entry->getParent()->getInfo<ARMFunctionInfo>();
	AFI->setIsSplitCSR(true);
	}

	void ARMTargetLowering::insertCopiesSplitCSR(
	MachineBasicBlock *Entry,
	const SmallVectorImpl<MachineBasicBlock *> &Exits) const {
	const ARMBaseRegisterInfo *TRI = Subtarget->getRegisterInfo();
	const MCPhysReg *IStart = TRI->getCalleeSavedRegsViaCopy(Entry->getParent());
	if (!IStart)
	return;

	const TargetInstrInfo *TII = Subtarget->getInstrInfo();
	MachineRegisterInfo *MRI = &Entry->getParent()->getRegInfo();
	MachineBasicBlock::iterator MBBI = Entry->begin();
	for (const MCPhysReg I = IStart; I; ++I) {
	const TargetRegisterClass *RC = nullptr;
	if (ARM::GPRRegClass.contains(*I))
	RC = &ARM::GPRRegClass;
	else if (ARM::DPRRegClass.contains(*I))
	RC = &ARM::DPRRegClass;
	else
	llvm_unreachable("Unexpected register class in CSRsViaCopy!");

	unsigned NewVR = MRI->createVirtualRegister(RC);
	// Create copy from CSR to a virtual register.
	// FIXME: this currently does not emit CFI pseudo-instructions, it works
	// fine for CXX_FAST_TLS since the C++-style TLS access functions should be
	// nounwind. If we want to generalize this later, we may need to emit
	// CFI pseudo-instructions.
	assert(Entry->getParent()->getFunction()->hasFnAttribute(
	Attribute::NoUnwind) &&
	"Function should be nounwind in insertCopiesSplitCSR!");
	Entry->addLiveIn(*I);
	BuildMI(*Entry, MBBI, DebugLoc(), TII->get(TargetOpcode::COPY), NewVR)
	.addReg(*I);

	// Insert the copy-back instructions right before the terminator.
	for (auto *Exit : Exits)
	BuildMI(*Exit, Exit->getFirstTerminator(), DebugLoc(),
	TII->get(TargetOpcode::COPY), *I)
	.addReg(NewVR);
	}
	}

	void ARMTargetLowering::finalizeLowering(MachineFunction &MF) const {
	MF.getFrameInfo().computeMaxCallFrameSize(MF);
	TargetLoweringBase::finalizeLowering(MF);
	}
	diff --git a/lib/Target/X86/X86ISelLowering.cpp b/lib/Target/X86/X86ISelLowering.cpp
	index 1e73122cdc38..193ee8de6192 100644
	--- a/lib/Target/X86/X86ISelLowering.cpp
	+++ b/lib/Target/X86/X86ISelLowering.cpp
	@@ -1,36742 +1,36749 @@
	//===-- X86ISelLowering.cpp - X86 DAG Lowering Implementation -------------===//
	//
	// The LLVM Compiler Infrastructure
	//
	// This file is distributed under the University of Illinois Open Source
	// License. See LICENSE.TXT for details.
	//
	//===----------------------------------------------------------------------===//
	//
	// This file defines the interfaces that X86 uses to lower LLVM code into a
	// selection DAG.
	//
	//===----------------------------------------------------------------------===//

	#include "X86ISelLowering.h"
	#include "Utils/X86ShuffleDecode.h"
	#include "X86CallingConv.h"
	#include "X86FrameLowering.h"
	#include "X86InstrBuilder.h"
	#include "X86IntrinsicsInfo.h"
	#include "X86MachineFunctionInfo.h"
	#include "X86ShuffleDecodeConstantPool.h"
	#include "X86TargetMachine.h"
	#include "X86TargetObjectFile.h"
	#include "llvm/ADT/SmallBitVector.h"
	#include "llvm/ADT/SmallSet.h"
	#include "llvm/ADT/Statistic.h"
	#include "llvm/ADT/StringExtras.h"
	#include "llvm/ADT/StringSwitch.h"
	#include "llvm/Analysis/EHPersonalities.h"
	#include "llvm/CodeGen/IntrinsicLowering.h"
	#include "llvm/CodeGen/MachineFrameInfo.h"
	#include "llvm/CodeGen/MachineFunction.h"
	#include "llvm/CodeGen/MachineInstrBuilder.h"
	#include "llvm/CodeGen/MachineJumpTableInfo.h"
	#include "llvm/CodeGen/MachineModuleInfo.h"
	#include "llvm/CodeGen/MachineRegisterInfo.h"
	#include "llvm/CodeGen/WinEHFuncInfo.h"
	#include "llvm/IR/CallSite.h"
	#include "llvm/IR/CallingConv.h"
	#include "llvm/IR/Constants.h"
	#include "llvm/IR/DerivedTypes.h"
	#include "llvm/IR/DiagnosticInfo.h"
	#include "llvm/IR/Function.h"
	#include "llvm/IR/GlobalAlias.h"
	#include "llvm/IR/GlobalVariable.h"
	#include "llvm/IR/Instructions.h"
	#include "llvm/IR/Intrinsics.h"
	#include "llvm/MC/MCAsmInfo.h"
	#include "llvm/MC/MCContext.h"
	#include "llvm/MC/MCExpr.h"
	#include "llvm/MC/MCSymbol.h"
	#include "llvm/Support/CommandLine.h"
	#include "llvm/Support/Debug.h"
	#include "llvm/Support/ErrorHandling.h"
	#include "llvm/Support/KnownBits.h"
	#include "llvm/Support/MathExtras.h"
	#include "llvm/Target/TargetLowering.h"
	#include "llvm/Target/TargetOptions.h"
	#include <algorithm>
	#include <bitset>
	#include <cctype>
	#include <numeric>
	using namespace llvm;

	#define DEBUG_TYPE "x86-isel"

	STATISTIC(NumTailCalls, "Number of tail calls");

	static cl::opt<bool> ExperimentalVectorWideningLegalization(
	"x86-experimental-vector-widening-legalization", cl::init(false),
	cl::desc("Enable an experimental vector type legalization through widening "
	"rather than promotion."),
	cl::Hidden);

	static cl::opt<int> ExperimentalPrefLoopAlignment(
	"x86-experimental-pref-loop-alignment", cl::init(4),
	cl::desc("Sets the preferable loop alignment for experiments "
	"(the last x86-experimental-pref-loop-alignment bits"
	" of the loop header PC will be 0)."),
	cl::Hidden);

	static cl::opt<bool> MulConstantOptimization(
	"mul-constant-optimization", cl::init(true),
	cl::desc("Replace 'mul x, Const' with more effective instructions like "
	"SHIFT, LEA, etc."),
	cl::Hidden);

	/// Call this when the user attempts to do something unsupported, like
	/// returning a double without SSE2 enabled on x86_64. This is not fatal, unlike
	/// report_fatal_error, so calling code should attempt to recover without
	/// crashing.
	static void errorUnsupported(SelectionDAG &DAG, const SDLoc &dl,
	const char *Msg) {
	MachineFunction &MF = DAG.getMachineFunction();
	DAG.getContext()->diagnose(
	DiagnosticInfoUnsupported(*MF.getFunction(), Msg, dl.getDebugLoc()));
	}

	X86TargetLowering::X86TargetLowering(const X86TargetMachine &TM,
	const X86Subtarget &STI)
	: TargetLowering(TM), Subtarget(STI) {
	bool UseX87 = !Subtarget.useSoftFloat() && Subtarget.hasX87();
	X86ScalarSSEf64 = Subtarget.hasSSE2();
	X86ScalarSSEf32 = Subtarget.hasSSE1();
	MVT PtrVT = MVT::getIntegerVT(8 * TM.getPointerSize());

	// Set up the TargetLowering object.

	// X86 is weird. It always uses i8 for shift amounts and setcc results.
	setBooleanContents(ZeroOrOneBooleanContent);
	// X86-SSE is even stranger. It uses -1 or 0 for vector masks.
	setBooleanVectorContents(ZeroOrNegativeOneBooleanContent);

	// For 64-bit, since we have so many registers, use the ILP scheduler.
	// For 32-bit, use the register pressure specific scheduling.
	// For Atom, always use ILP scheduling.
	if (Subtarget.isAtom())
	setSchedulingPreference(Sched::ILP);
	else if (Subtarget.is64Bit())
	setSchedulingPreference(Sched::ILP);
	else
	setSchedulingPreference(Sched::RegPressure);
	const X86RegisterInfo *RegInfo = Subtarget.getRegisterInfo();
	setStackPointerRegisterToSaveRestore(RegInfo->getStackRegister());

	// Bypass expensive divides and use cheaper ones.
	if (TM.getOptLevel() >= CodeGenOpt::Default) {
	if (Subtarget.hasSlowDivide32())
	addBypassSlowDiv(32, 8);
	if (Subtarget.hasSlowDivide64() && Subtarget.is64Bit())
	addBypassSlowDiv(64, 32);
	}

	if (Subtarget.isTargetKnownWindowsMSVC() \|\|
	Subtarget.isTargetWindowsItanium()) {
	// Setup Windows compiler runtime calls.
	setLibcallName(RTLIB::SDIV_I64, "_alldiv");
	setLibcallName(RTLIB::UDIV_I64, "_aulldiv");
	setLibcallName(RTLIB::SREM_I64, "_allrem");
	setLibcallName(RTLIB::UREM_I64, "_aullrem");
	setLibcallName(RTLIB::MUL_I64, "_allmul");
	setLibcallCallingConv(RTLIB::SDIV_I64, CallingConv::X86_StdCall);
	setLibcallCallingConv(RTLIB::UDIV_I64, CallingConv::X86_StdCall);
	setLibcallCallingConv(RTLIB::SREM_I64, CallingConv::X86_StdCall);
	setLibcallCallingConv(RTLIB::UREM_I64, CallingConv::X86_StdCall);
	setLibcallCallingConv(RTLIB::MUL_I64, CallingConv::X86_StdCall);
	}

	if (Subtarget.isTargetDarwin()) {
	// Darwin should use _setjmp/_longjmp instead of setjmp/longjmp.
	setUseUnderscoreSetJmp(false);
	setUseUnderscoreLongJmp(false);
	} else if (Subtarget.isTargetWindowsGNU()) {
	// MS runtime is weird: it exports _setjmp, but longjmp!
	setUseUnderscoreSetJmp(true);
	setUseUnderscoreLongJmp(false);
	} else {
	setUseUnderscoreSetJmp(true);
	setUseUnderscoreLongJmp(true);
	}

	// Set up the register classes.
	addRegisterClass(MVT::i8, &X86::GR8RegClass);
	addRegisterClass(MVT::i16, &X86::GR16RegClass);
	addRegisterClass(MVT::i32, &X86::GR32RegClass);
	if (Subtarget.is64Bit())
	addRegisterClass(MVT::i64, &X86::GR64RegClass);

	for (MVT VT : MVT::integer_valuetypes())
	setLoadExtAction(ISD::SEXTLOAD, VT, MVT::i1, Promote);

	// We don't accept any truncstore of integer registers.
	setTruncStoreAction(MVT::i64, MVT::i32, Expand);
	setTruncStoreAction(MVT::i64, MVT::i16, Expand);
	setTruncStoreAction(MVT::i64, MVT::i8 , Expand);
	setTruncStoreAction(MVT::i32, MVT::i16, Expand);
	setTruncStoreAction(MVT::i32, MVT::i8 , Expand);
	setTruncStoreAction(MVT::i16, MVT::i8, Expand);

	setTruncStoreAction(MVT::f64, MVT::f32, Expand);

	// SETOEQ and SETUNE require checking two conditions.
	setCondCodeAction(ISD::SETOEQ, MVT::f32, Expand);
	setCondCodeAction(ISD::SETOEQ, MVT::f64, Expand);
	setCondCodeAction(ISD::SETOEQ, MVT::f80, Expand);
	setCondCodeAction(ISD::SETUNE, MVT::f32, Expand);
	setCondCodeAction(ISD::SETUNE, MVT::f64, Expand);
	setCondCodeAction(ISD::SETUNE, MVT::f80, Expand);

	// Promote all UINT_TO_FP to larger SINT_TO_FP's, as X86 doesn't have this
	// operation.
	setOperationAction(ISD::UINT_TO_FP , MVT::i1 , Promote);
	setOperationAction(ISD::UINT_TO_FP , MVT::i8 , Promote);
	setOperationAction(ISD::UINT_TO_FP , MVT::i16 , Promote);

	if (Subtarget.is64Bit()) {
	if (!Subtarget.useSoftFloat() && Subtarget.hasAVX512())
	// f32/f64 are legal, f80 is custom.
	setOperationAction(ISD::UINT_TO_FP , MVT::i32 , Custom);
	else
	setOperationAction(ISD::UINT_TO_FP , MVT::i32 , Promote);
	setOperationAction(ISD::UINT_TO_FP , MVT::i64 , Custom);
	} else if (!Subtarget.useSoftFloat()) {
	// We have an algorithm for SSE2->double, and we turn this into a
	// 64-bit FILD followed by conditional FADD for other targets.
	setOperationAction(ISD::UINT_TO_FP , MVT::i64 , Custom);
	// We have an algorithm for SSE2, and we turn this into a 64-bit
	// FILD or VCVTUSI2SS/SD for other targets.
	setOperationAction(ISD::UINT_TO_FP , MVT::i32 , Custom);
	}

	// Promote i1/i8 SINT_TO_FP to larger SINT_TO_FP's, as X86 doesn't have
	// this operation.
	setOperationAction(ISD::SINT_TO_FP , MVT::i1 , Promote);
	setOperationAction(ISD::SINT_TO_FP , MVT::i8 , Promote);

	if (!Subtarget.useSoftFloat()) {
	// SSE has no i16 to fp conversion, only i32.
	if (X86ScalarSSEf32) {
	setOperationAction(ISD::SINT_TO_FP , MVT::i16 , Promote);
	// f32 and f64 cases are Legal, f80 case is not
	setOperationAction(ISD::SINT_TO_FP , MVT::i32 , Custom);
	} else {
	setOperationAction(ISD::SINT_TO_FP , MVT::i16 , Custom);
	setOperationAction(ISD::SINT_TO_FP , MVT::i32 , Custom);
	}
	} else {
	setOperationAction(ISD::SINT_TO_FP , MVT::i16 , Promote);
	setOperationAction(ISD::SINT_TO_FP , MVT::i32 , Promote);
	}

	// Promote i1/i8 FP_TO_SINT to larger FP_TO_SINTS's, as X86 doesn't have
	// this operation.
	setOperationAction(ISD::FP_TO_SINT , MVT::i1 , Promote);
	setOperationAction(ISD::FP_TO_SINT , MVT::i8 , Promote);

	if (!Subtarget.useSoftFloat()) {
	// In 32-bit mode these are custom lowered. In 64-bit mode F32 and F64
	// are Legal, f80 is custom lowered.
	setOperationAction(ISD::FP_TO_SINT , MVT::i64 , Custom);
	setOperationAction(ISD::SINT_TO_FP , MVT::i64 , Custom);

	if (X86ScalarSSEf32) {
	setOperationAction(ISD::FP_TO_SINT , MVT::i16 , Promote);
	// f32 and f64 cases are Legal, f80 case is not
	setOperationAction(ISD::FP_TO_SINT , MVT::i32 , Custom);
	} else {
	setOperationAction(ISD::FP_TO_SINT , MVT::i16 , Custom);
	setOperationAction(ISD::FP_TO_SINT , MVT::i32 , Custom);
	}
	} else {
	setOperationAction(ISD::FP_TO_SINT , MVT::i16 , Promote);
	setOperationAction(ISD::FP_TO_SINT , MVT::i32 , Expand);
	setOperationAction(ISD::FP_TO_SINT , MVT::i64 , Expand);
	}

	// Handle FP_TO_UINT by promoting the destination to a larger signed
	// conversion.
	setOperationAction(ISD::FP_TO_UINT , MVT::i1 , Promote);
	setOperationAction(ISD::FP_TO_UINT , MVT::i8 , Promote);
	setOperationAction(ISD::FP_TO_UINT , MVT::i16 , Promote);

	if (Subtarget.is64Bit()) {
	if (!Subtarget.useSoftFloat() && Subtarget.hasAVX512()) {
	// FP_TO_UINT-i32/i64 is legal for f32/f64, but custom for f80.
	setOperationAction(ISD::FP_TO_UINT , MVT::i32 , Custom);
	setOperationAction(ISD::FP_TO_UINT , MVT::i64 , Custom);
	} else {
	setOperationAction(ISD::FP_TO_UINT , MVT::i32 , Promote);
	setOperationAction(ISD::FP_TO_UINT , MVT::i64 , Expand);
	}
	} else if (!Subtarget.useSoftFloat()) {
	// Since AVX is a superset of SSE3, only check for SSE here.
	if (Subtarget.hasSSE1() && !Subtarget.hasSSE3())
	// Expand FP_TO_UINT into a select.
	// FIXME: We would like to use a Custom expander here eventually to do
	// the optimal thing for SSE vs. the default expansion in the legalizer.
	setOperationAction(ISD::FP_TO_UINT , MVT::i32 , Expand);
	else
	// With AVX512 we can use vcvts[ds]2usi for f32/f64->i32, f80 is custom.
	// With SSE3 we can use fisttpll to convert to a signed i64; without
	// SSE, we're stuck with a fistpll.
	setOperationAction(ISD::FP_TO_UINT , MVT::i32 , Custom);

	setOperationAction(ISD::FP_TO_UINT , MVT::i64 , Custom);
	}

	// TODO: when we have SSE, these could be more efficient, by using movd/movq.
	if (!X86ScalarSSEf64) {
	setOperationAction(ISD::BITCAST , MVT::f32 , Expand);
	setOperationAction(ISD::BITCAST , MVT::i32 , Expand);
	if (Subtarget.is64Bit()) {
	setOperationAction(ISD::BITCAST , MVT::f64 , Expand);
	// Without SSE, i64->f64 goes through memory.
	setOperationAction(ISD::BITCAST , MVT::i64 , Expand);
	}
	} else if (!Subtarget.is64Bit())
	setOperationAction(ISD::BITCAST , MVT::i64 , Custom);

	// Scalar integer divide and remainder are lowered to use operations that
	// produce two results, to match the available instructions. This exposes
	// the two-result form to trivial CSE, which is able to combine x/y and x%y
	// into a single instruction.
	//
	// Scalar integer multiply-high is also lowered to use two-result
	// operations, to match the available instructions. However, plain multiply
	// (low) operations are left as Legal, as there are single-result
	// instructions for this in x86. Using the two-result multiply instructions
	// when both high and low results are needed must be arranged by dagcombine.
	for (auto VT : { MVT::i8, MVT::i16, MVT::i32, MVT::i64 }) {
	setOperationAction(ISD::MULHS, VT, Expand);
	setOperationAction(ISD::MULHU, VT, Expand);
	setOperationAction(ISD::SDIV, VT, Expand);
	setOperationAction(ISD::UDIV, VT, Expand);
	setOperationAction(ISD::SREM, VT, Expand);
	setOperationAction(ISD::UREM, VT, Expand);
	}

	setOperationAction(ISD::BR_JT , MVT::Other, Expand);
	setOperationAction(ISD::BRCOND , MVT::Other, Custom);
	for (auto VT : { MVT::f32, MVT::f64, MVT::f80, MVT::f128,
	MVT::i8, MVT::i16, MVT::i32, MVT::i64 }) {
	setOperationAction(ISD::BR_CC, VT, Expand);
	setOperationAction(ISD::SELECT_CC, VT, Expand);
	}
	if (Subtarget.is64Bit())
	setOperationAction(ISD::SIGN_EXTEND_INREG, MVT::i32, Legal);
	setOperationAction(ISD::SIGN_EXTEND_INREG, MVT::i16 , Legal);
	setOperationAction(ISD::SIGN_EXTEND_INREG, MVT::i8 , Legal);
	setOperationAction(ISD::SIGN_EXTEND_INREG, MVT::i1 , Expand);
	setOperationAction(ISD::FP_ROUND_INREG , MVT::f32 , Expand);

	setOperationAction(ISD::FREM , MVT::f32 , Expand);
	setOperationAction(ISD::FREM , MVT::f64 , Expand);
	setOperationAction(ISD::FREM , MVT::f80 , Expand);
	setOperationAction(ISD::FLT_ROUNDS_ , MVT::i32 , Custom);

	// Promote the i8 variants and force them on up to i32 which has a shorter
	// encoding.
	setOperationPromotedToType(ISD::CTTZ , MVT::i8 , MVT::i32);
	setOperationPromotedToType(ISD::CTTZ_ZERO_UNDEF, MVT::i8 , MVT::i32);
	if (!Subtarget.hasBMI()) {
	setOperationAction(ISD::CTTZ , MVT::i16 , Custom);
	setOperationAction(ISD::CTTZ , MVT::i32 , Custom);
	setOperationAction(ISD::CTTZ_ZERO_UNDEF, MVT::i16 , Legal);
	setOperationAction(ISD::CTTZ_ZERO_UNDEF, MVT::i32 , Legal);
	if (Subtarget.is64Bit()) {
	setOperationAction(ISD::CTTZ , MVT::i64 , Custom);
	setOperationAction(ISD::CTTZ_ZERO_UNDEF, MVT::i64, Legal);
	}
	}

	if (Subtarget.hasLZCNT()) {
	// When promoting the i8 variants, force them to i32 for a shorter
	// encoding.
	setOperationPromotedToType(ISD::CTLZ , MVT::i8 , MVT::i32);
	setOperationPromotedToType(ISD::CTLZ_ZERO_UNDEF, MVT::i8 , MVT::i32);
	} else {
	setOperationAction(ISD::CTLZ , MVT::i8 , Custom);
	setOperationAction(ISD::CTLZ , MVT::i16 , Custom);
	setOperationAction(ISD::CTLZ , MVT::i32 , Custom);
	setOperationAction(ISD::CTLZ_ZERO_UNDEF, MVT::i8 , Custom);
	setOperationAction(ISD::CTLZ_ZERO_UNDEF, MVT::i16 , Custom);
	setOperationAction(ISD::CTLZ_ZERO_UNDEF, MVT::i32 , Custom);
	if (Subtarget.is64Bit()) {
	setOperationAction(ISD::CTLZ , MVT::i64 , Custom);
	setOperationAction(ISD::CTLZ_ZERO_UNDEF, MVT::i64, Custom);
	}
	}

	// Special handling for half-precision floating point conversions.
	// If we don't have F16C support, then lower half float conversions
	// into library calls.
	if (Subtarget.useSoftFloat() \|\|
	(!Subtarget.hasF16C() && !Subtarget.hasAVX512())) {
	setOperationAction(ISD::FP16_TO_FP, MVT::f32, Expand);
	setOperationAction(ISD::FP_TO_FP16, MVT::f32, Expand);
	}

	// There's never any support for operations beyond MVT::f32.
	setOperationAction(ISD::FP16_TO_FP, MVT::f64, Expand);
	setOperationAction(ISD::FP16_TO_FP, MVT::f80, Expand);
	setOperationAction(ISD::FP_TO_FP16, MVT::f64, Expand);
	setOperationAction(ISD::FP_TO_FP16, MVT::f80, Expand);

	setLoadExtAction(ISD::EXTLOAD, MVT::f32, MVT::f16, Expand);
	setLoadExtAction(ISD::EXTLOAD, MVT::f64, MVT::f16, Expand);
	setLoadExtAction(ISD::EXTLOAD, MVT::f80, MVT::f16, Expand);
	setTruncStoreAction(MVT::f32, MVT::f16, Expand);
	setTruncStoreAction(MVT::f64, MVT::f16, Expand);
	setTruncStoreAction(MVT::f80, MVT::f16, Expand);

	if (Subtarget.hasPOPCNT()) {
	setOperationAction(ISD::CTPOP , MVT::i8 , Promote);
	} else {
	setOperationAction(ISD::CTPOP , MVT::i8 , Expand);
	setOperationAction(ISD::CTPOP , MVT::i16 , Expand);
	setOperationAction(ISD::CTPOP , MVT::i32 , Expand);
	if (Subtarget.is64Bit())
	setOperationAction(ISD::CTPOP , MVT::i64 , Expand);
	}

	setOperationAction(ISD::READCYCLECOUNTER , MVT::i64 , Custom);

	if (!Subtarget.hasMOVBE())
	setOperationAction(ISD::BSWAP , MVT::i16 , Expand);

	// These should be promoted to a larger select which is supported.
	setOperationAction(ISD::SELECT , MVT::i1 , Promote);
	// X86 wants to expand cmov itself.
	for (auto VT : { MVT::f32, MVT::f64, MVT::f80, MVT::f128 }) {
	setOperationAction(ISD::SELECT, VT, Custom);
	setOperationAction(ISD::SETCC, VT, Custom);
	}
	for (auto VT : { MVT::i8, MVT::i16, MVT::i32, MVT::i64 }) {
	if (VT == MVT::i64 && !Subtarget.is64Bit())
	continue;
	setOperationAction(ISD::SELECT, VT, Custom);
	setOperationAction(ISD::SETCC, VT, Custom);
	}

	// Custom action for SELECT MMX and expand action for SELECT_CC MMX
	setOperationAction(ISD::SELECT, MVT::x86mmx, Custom);
	setOperationAction(ISD::SELECT_CC, MVT::x86mmx, Expand);

	setOperationAction(ISD::EH_RETURN , MVT::Other, Custom);
	// NOTE: EH_SJLJ_SETJMP/_LONGJMP supported here is NOT intended to support
	// SjLj exception handling but a light-weight setjmp/longjmp replacement to
	// support continuation, user-level threading, and etc.. As a result, no
	// other SjLj exception interfaces are implemented and please don't build
	// your own exception handling based on them.
	// LLVM/Clang supports zero-cost DWARF exception handling.
	setOperationAction(ISD::EH_SJLJ_SETJMP, MVT::i32, Custom);
	setOperationAction(ISD::EH_SJLJ_LONGJMP, MVT::Other, Custom);
	setOperationAction(ISD::EH_SJLJ_SETUP_DISPATCH, MVT::Other, Custom);
	if (TM.Options.ExceptionModel == ExceptionHandling::SjLj)
	setLibcallName(RTLIB::UNWIND_RESUME, "_Unwind_SjLj_Resume");

	// Darwin ABI issue.
	for (auto VT : { MVT::i32, MVT::i64 }) {
	if (VT == MVT::i64 && !Subtarget.is64Bit())
	continue;
	setOperationAction(ISD::ConstantPool , VT, Custom);
	setOperationAction(ISD::JumpTable , VT, Custom);
	setOperationAction(ISD::GlobalAddress , VT, Custom);
	setOperationAction(ISD::GlobalTLSAddress, VT, Custom);
	setOperationAction(ISD::ExternalSymbol , VT, Custom);
	setOperationAction(ISD::BlockAddress , VT, Custom);
	}

	// 64-bit shl, sra, srl (iff 32-bit x86)
	for (auto VT : { MVT::i32, MVT::i64 }) {
	if (VT == MVT::i64 && !Subtarget.is64Bit())
	continue;
	setOperationAction(ISD::SHL_PARTS, VT, Custom);
	setOperationAction(ISD::SRA_PARTS, VT, Custom);
	setOperationAction(ISD::SRL_PARTS, VT, Custom);
	}

	if (Subtarget.hasSSE1())
	setOperationAction(ISD::PREFETCH , MVT::Other, Legal);

	setOperationAction(ISD::ATOMIC_FENCE , MVT::Other, Custom);

	// Expand certain atomics
	for (auto VT : { MVT::i8, MVT::i16, MVT::i32, MVT::i64 }) {
	setOperationAction(ISD::ATOMIC_CMP_SWAP_WITH_SUCCESS, VT, Custom);
	setOperationAction(ISD::ATOMIC_LOAD_SUB, VT, Custom);
	setOperationAction(ISD::ATOMIC_LOAD_ADD, VT, Custom);
	setOperationAction(ISD::ATOMIC_LOAD_OR, VT, Custom);
	setOperationAction(ISD::ATOMIC_LOAD_XOR, VT, Custom);
	setOperationAction(ISD::ATOMIC_LOAD_AND, VT, Custom);
	setOperationAction(ISD::ATOMIC_STORE, VT, Custom);
	}

	if (Subtarget.hasCmpxchg16b()) {
	setOperationAction(ISD::ATOMIC_CMP_SWAP_WITH_SUCCESS, MVT::i128, Custom);
	}

	// FIXME - use subtarget debug flags
	if (!Subtarget.isTargetDarwin() && !Subtarget.isTargetELF() &&
	!Subtarget.isTargetCygMing() && !Subtarget.isTargetWin64() &&
	TM.Options.ExceptionModel != ExceptionHandling::SjLj) {
	setOperationAction(ISD::EH_LABEL, MVT::Other, Expand);
	}

	setOperationAction(ISD::FRAME_TO_ARGS_OFFSET, MVT::i32, Custom);
	setOperationAction(ISD::FRAME_TO_ARGS_OFFSET, MVT::i64, Custom);

	setOperationAction(ISD::INIT_TRAMPOLINE, MVT::Other, Custom);
	setOperationAction(ISD::ADJUST_TRAMPOLINE, MVT::Other, Custom);

	setOperationAction(ISD::TRAP, MVT::Other, Legal);
	setOperationAction(ISD::DEBUGTRAP, MVT::Other, Legal);

	// VASTART needs to be custom lowered to use the VarArgsFrameIndex
	setOperationAction(ISD::VASTART , MVT::Other, Custom);
	setOperationAction(ISD::VAEND , MVT::Other, Expand);
	bool Is64Bit = Subtarget.is64Bit();
	setOperationAction(ISD::VAARG, MVT::Other, Is64Bit ? Custom : Expand);
	setOperationAction(ISD::VACOPY, MVT::Other, Is64Bit ? Custom : Expand);

	setOperationAction(ISD::STACKSAVE, MVT::Other, Expand);
	setOperationAction(ISD::STACKRESTORE, MVT::Other, Expand);

	setOperationAction(ISD::DYNAMIC_STACKALLOC, PtrVT, Custom);

	// GC_TRANSITION_START and GC_TRANSITION_END need custom lowering.
	setOperationAction(ISD::GC_TRANSITION_START, MVT::Other, Custom);
	setOperationAction(ISD::GC_TRANSITION_END, MVT::Other, Custom);

	if (!Subtarget.useSoftFloat() && X86ScalarSSEf64) {
	// f32 and f64 use SSE.
	// Set up the FP register classes.
	addRegisterClass(MVT::f32, Subtarget.hasAVX512() ? &X86::FR32XRegClass
	: &X86::FR32RegClass);
	addRegisterClass(MVT::f64, Subtarget.hasAVX512() ? &X86::FR64XRegClass
	: &X86::FR64RegClass);

	for (auto VT : { MVT::f32, MVT::f64 }) {
	// Use ANDPD to simulate FABS.
	setOperationAction(ISD::FABS, VT, Custom);

	// Use XORP to simulate FNEG.
	setOperationAction(ISD::FNEG, VT, Custom);

	// Use ANDPD and ORPD to simulate FCOPYSIGN.
	setOperationAction(ISD::FCOPYSIGN, VT, Custom);

	// We don't support sin/cos/fmod
	setOperationAction(ISD::FSIN , VT, Expand);
	setOperationAction(ISD::FCOS , VT, Expand);
	setOperationAction(ISD::FSINCOS, VT, Expand);
	}

	// Lower this to MOVMSK plus an AND.
	setOperationAction(ISD::FGETSIGN, MVT::i64, Custom);
	setOperationAction(ISD::FGETSIGN, MVT::i32, Custom);

	// Expand FP immediates into loads from the stack, except for the special
	// cases we handle.
	addLegalFPImmediate(APFloat(+0.0)); // xorpd
	addLegalFPImmediate(APFloat(+0.0f)); // xorps
	} else if (UseX87 && X86ScalarSSEf32) {
	// Use SSE for f32, x87 for f64.
	// Set up the FP register classes.
	addRegisterClass(MVT::f32, Subtarget.hasAVX512() ? &X86::FR32XRegClass
	: &X86::FR32RegClass);
	addRegisterClass(MVT::f64, &X86::RFP64RegClass);

	// Use ANDPS to simulate FABS.
	setOperationAction(ISD::FABS , MVT::f32, Custom);

	// Use XORP to simulate FNEG.
	setOperationAction(ISD::FNEG , MVT::f32, Custom);

	setOperationAction(ISD::UNDEF, MVT::f64, Expand);

	// Use ANDPS and ORPS to simulate FCOPYSIGN.
	setOperationAction(ISD::FCOPYSIGN, MVT::f64, Expand);
	setOperationAction(ISD::FCOPYSIGN, MVT::f32, Custom);

	// We don't support sin/cos/fmod
	setOperationAction(ISD::FSIN , MVT::f32, Expand);
	setOperationAction(ISD::FCOS , MVT::f32, Expand);
	setOperationAction(ISD::FSINCOS, MVT::f32, Expand);

	// Special cases we handle for FP constants.
	addLegalFPImmediate(APFloat(+0.0f)); // xorps
	addLegalFPImmediate(APFloat(+0.0)); // FLD0
	addLegalFPImmediate(APFloat(+1.0)); // FLD1
	addLegalFPImmediate(APFloat(-0.0)); // FLD0/FCHS
	addLegalFPImmediate(APFloat(-1.0)); // FLD1/FCHS

	if (!TM.Options.UnsafeFPMath) {
	setOperationAction(ISD::FSIN , MVT::f64, Expand);
	setOperationAction(ISD::FCOS , MVT::f64, Expand);
	setOperationAction(ISD::FSINCOS, MVT::f64, Expand);
	}
	} else if (UseX87) {
	// f32 and f64 in x87.
	// Set up the FP register classes.
	addRegisterClass(MVT::f64, &X86::RFP64RegClass);
	addRegisterClass(MVT::f32, &X86::RFP32RegClass);

	for (auto VT : { MVT::f32, MVT::f64 }) {
	setOperationAction(ISD::UNDEF, VT, Expand);
	setOperationAction(ISD::FCOPYSIGN, VT, Expand);

	if (!TM.Options.UnsafeFPMath) {
	setOperationAction(ISD::FSIN , VT, Expand);
	setOperationAction(ISD::FCOS , VT, Expand);
	setOperationAction(ISD::FSINCOS, VT, Expand);
	}
	}
	addLegalFPImmediate(APFloat(+0.0)); // FLD0
	addLegalFPImmediate(APFloat(+1.0)); // FLD1
	addLegalFPImmediate(APFloat(-0.0)); // FLD0/FCHS
	addLegalFPImmediate(APFloat(-1.0)); // FLD1/FCHS
	addLegalFPImmediate(APFloat(+0.0f)); // FLD0
	addLegalFPImmediate(APFloat(+1.0f)); // FLD1
	addLegalFPImmediate(APFloat(-0.0f)); // FLD0/FCHS
	addLegalFPImmediate(APFloat(-1.0f)); // FLD1/FCHS
	}

	// We don't support FMA.
	setOperationAction(ISD::FMA, MVT::f64, Expand);
	setOperationAction(ISD::FMA, MVT::f32, Expand);

	// Long double always uses X87, except f128 in MMX.
	if (UseX87) {
	if (Subtarget.is64Bit() && Subtarget.hasMMX()) {
	addRegisterClass(MVT::f128, &X86::FR128RegClass);
	ValueTypeActions.setTypeAction(MVT::f128, TypeSoftenFloat);
	setOperationAction(ISD::FABS , MVT::f128, Custom);
	setOperationAction(ISD::FNEG , MVT::f128, Custom);
	setOperationAction(ISD::FCOPYSIGN, MVT::f128, Custom);
	}

	addRegisterClass(MVT::f80, &X86::RFP80RegClass);
	setOperationAction(ISD::UNDEF, MVT::f80, Expand);
	setOperationAction(ISD::FCOPYSIGN, MVT::f80, Expand);
	{
	APFloat TmpFlt = APFloat::getZero(APFloat::x87DoubleExtended());
	addLegalFPImmediate(TmpFlt); // FLD0
	TmpFlt.changeSign();
	addLegalFPImmediate(TmpFlt); // FLD0/FCHS

	bool ignored;
	APFloat TmpFlt2(+1.0);
	TmpFlt2.convert(APFloat::x87DoubleExtended(), APFloat::rmNearestTiesToEven,
	&ignored);
	addLegalFPImmediate(TmpFlt2); // FLD1
	TmpFlt2.changeSign();
	addLegalFPImmediate(TmpFlt2); // FLD1/FCHS
	}

	if (!TM.Options.UnsafeFPMath) {
	setOperationAction(ISD::FSIN , MVT::f80, Expand);
	setOperationAction(ISD::FCOS , MVT::f80, Expand);
	setOperationAction(ISD::FSINCOS, MVT::f80, Expand);
	}

	setOperationAction(ISD::FFLOOR, MVT::f80, Expand);
	setOperationAction(ISD::FCEIL, MVT::f80, Expand);
	setOperationAction(ISD::FTRUNC, MVT::f80, Expand);
	setOperationAction(ISD::FRINT, MVT::f80, Expand);
	setOperationAction(ISD::FNEARBYINT, MVT::f80, Expand);
	setOperationAction(ISD::FMA, MVT::f80, Expand);
	}

	// Always use a library call for pow.
	setOperationAction(ISD::FPOW , MVT::f32 , Expand);
	setOperationAction(ISD::FPOW , MVT::f64 , Expand);
	setOperationAction(ISD::FPOW , MVT::f80 , Expand);

	setOperationAction(ISD::FLOG, MVT::f80, Expand);
	setOperationAction(ISD::FLOG2, MVT::f80, Expand);
	setOperationAction(ISD::FLOG10, MVT::f80, Expand);
	setOperationAction(ISD::FEXP, MVT::f80, Expand);
	setOperationAction(ISD::FEXP2, MVT::f80, Expand);
	setOperationAction(ISD::FMINNUM, MVT::f80, Expand);
	setOperationAction(ISD::FMAXNUM, MVT::f80, Expand);

	// Some FP actions are always expanded for vector types.
	for (auto VT : { MVT::v4f32, MVT::v8f32, MVT::v16f32,
	MVT::v2f64, MVT::v4f64, MVT::v8f64 }) {
	setOperationAction(ISD::FSIN, VT, Expand);
	setOperationAction(ISD::FSINCOS, VT, Expand);
	setOperationAction(ISD::FCOS, VT, Expand);
	setOperationAction(ISD::FREM, VT, Expand);
	setOperationAction(ISD::FCOPYSIGN, VT, Expand);
	setOperationAction(ISD::FPOW, VT, Expand);
	setOperationAction(ISD::FLOG, VT, Expand);
	setOperationAction(ISD::FLOG2, VT, Expand);
	setOperationAction(ISD::FLOG10, VT, Expand);
	setOperationAction(ISD::FEXP, VT, Expand);
	setOperationAction(ISD::FEXP2, VT, Expand);
	}

	// First set operation action for all vector types to either promote
	// (for widening) or expand (for scalarization). Then we will selectively
	// turn on ones that can be effectively codegen'd.
	for (MVT VT : MVT::vector_valuetypes()) {
	setOperationAction(ISD::SDIV, VT, Expand);
	setOperationAction(ISD::UDIV, VT, Expand);
	setOperationAction(ISD::SREM, VT, Expand);
	setOperationAction(ISD::UREM, VT, Expand);
	setOperationAction(ISD::EXTRACT_VECTOR_ELT, VT,Expand);
	setOperationAction(ISD::INSERT_VECTOR_ELT, VT, Expand);
	setOperationAction(ISD::EXTRACT_SUBVECTOR, VT,Expand);
	setOperationAction(ISD::INSERT_SUBVECTOR, VT,Expand);
	setOperationAction(ISD::FMA, VT, Expand);
	setOperationAction(ISD::FFLOOR, VT, Expand);
	setOperationAction(ISD::FCEIL, VT, Expand);
	setOperationAction(ISD::FTRUNC, VT, Expand);
	setOperationAction(ISD::FRINT, VT, Expand);
	setOperationAction(ISD::FNEARBYINT, VT, Expand);
	setOperationAction(ISD::SMUL_LOHI, VT, Expand);
	setOperationAction(ISD::MULHS, VT, Expand);
	setOperationAction(ISD::UMUL_LOHI, VT, Expand);
	setOperationAction(ISD::MULHU, VT, Expand);
	setOperationAction(ISD::SDIVREM, VT, Expand);
	setOperationAction(ISD::UDIVREM, VT, Expand);
	setOperationAction(ISD::CTPOP, VT, Expand);
	setOperationAction(ISD::CTTZ, VT, Expand);
	setOperationAction(ISD::CTLZ, VT, Expand);
	setOperationAction(ISD::ROTL, VT, Expand);
	setOperationAction(ISD::ROTR, VT, Expand);
	setOperationAction(ISD::BSWAP, VT, Expand);
	setOperationAction(ISD::SETCC, VT, Expand);
	setOperationAction(ISD::FP_TO_UINT, VT, Expand);
	setOperationAction(ISD::FP_TO_SINT, VT, Expand);
	setOperationAction(ISD::UINT_TO_FP, VT, Expand);
	setOperationAction(ISD::SINT_TO_FP, VT, Expand);
	setOperationAction(ISD::SIGN_EXTEND_INREG, VT,Expand);
	setOperationAction(ISD::TRUNCATE, VT, Expand);
	setOperationAction(ISD::SIGN_EXTEND, VT, Expand);
	setOperationAction(ISD::ZERO_EXTEND, VT, Expand);
	setOperationAction(ISD::ANY_EXTEND, VT, Expand);
	setOperationAction(ISD::SELECT_CC, VT, Expand);
	for (MVT InnerVT : MVT::vector_valuetypes()) {
	setTruncStoreAction(InnerVT, VT, Expand);

	setLoadExtAction(ISD::SEXTLOAD, InnerVT, VT, Expand);
	setLoadExtAction(ISD::ZEXTLOAD, InnerVT, VT, Expand);

	// N.b. ISD::EXTLOAD legality is basically ignored except for i1-like
	// types, we have to deal with them whether we ask for Expansion or not.
	// Setting Expand causes its own optimisation problems though, so leave
	// them legal.
	if (VT.getVectorElementType() == MVT::i1)
	setLoadExtAction(ISD::EXTLOAD, InnerVT, VT, Expand);

	// EXTLOAD for MVT::f16 vectors is not legal because f16 vectors are
	// split/scalarized right now.
	if (VT.getVectorElementType() == MVT::f16)
	setLoadExtAction(ISD::EXTLOAD, InnerVT, VT, Expand);
	}
	}

	// FIXME: In order to prevent SSE instructions being expanded to MMX ones
	// with -msoft-float, disable use of MMX as well.
	if (!Subtarget.useSoftFloat() && Subtarget.hasMMX()) {
	addRegisterClass(MVT::x86mmx, &X86::VR64RegClass);
	// No operations on x86mmx supported, everything uses intrinsics.
	}

	if (!Subtarget.useSoftFloat() && Subtarget.hasSSE1()) {
	addRegisterClass(MVT::v4f32, Subtarget.hasVLX() ? &X86::VR128XRegClass
	: &X86::VR128RegClass);

	setOperationAction(ISD::FNEG, MVT::v4f32, Custom);
	setOperationAction(ISD::FABS, MVT::v4f32, Custom);
	setOperationAction(ISD::FCOPYSIGN, MVT::v4f32, Custom);
	setOperationAction(ISD::BUILD_VECTOR, MVT::v4f32, Custom);
	setOperationAction(ISD::VECTOR_SHUFFLE, MVT::v4f32, Custom);
	setOperationAction(ISD::VSELECT, MVT::v4f32, Custom);
	setOperationAction(ISD::EXTRACT_VECTOR_ELT, MVT::v4f32, Custom);
	setOperationAction(ISD::SELECT, MVT::v4f32, Custom);
	setOperationAction(ISD::UINT_TO_FP, MVT::v4i32, Custom);
	}

	if (!Subtarget.useSoftFloat() && Subtarget.hasSSE2()) {
	addRegisterClass(MVT::v2f64, Subtarget.hasVLX() ? &X86::VR128XRegClass
	: &X86::VR128RegClass);

	// FIXME: Unfortunately, -soft-float and -no-implicit-float mean XMM
	// registers cannot be used even for integer operations.
	addRegisterClass(MVT::v16i8, Subtarget.hasVLX() ? &X86::VR128XRegClass
	: &X86::VR128RegClass);
	addRegisterClass(MVT::v8i16, Subtarget.hasVLX() ? &X86::VR128XRegClass
	: &X86::VR128RegClass);
	addRegisterClass(MVT::v4i32, Subtarget.hasVLX() ? &X86::VR128XRegClass
	: &X86::VR128RegClass);
	addRegisterClass(MVT::v2i64, Subtarget.hasVLX() ? &X86::VR128XRegClass
	: &X86::VR128RegClass);

	setOperationAction(ISD::MUL, MVT::v16i8, Custom);
	setOperationAction(ISD::MUL, MVT::v4i32, Custom);
	setOperationAction(ISD::MUL, MVT::v2i64, Custom);
	setOperationAction(ISD::UMUL_LOHI, MVT::v4i32, Custom);
	setOperationAction(ISD::SMUL_LOHI, MVT::v4i32, Custom);
	setOperationAction(ISD::MULHU, MVT::v16i8, Custom);
	setOperationAction(ISD::MULHS, MVT::v16i8, Custom);
	setOperationAction(ISD::MULHU, MVT::v8i16, Legal);
	setOperationAction(ISD::MULHS, MVT::v8i16, Legal);
	setOperationAction(ISD::MUL, MVT::v8i16, Legal);
	setOperationAction(ISD::FNEG, MVT::v2f64, Custom);
	setOperationAction(ISD::FABS, MVT::v2f64, Custom);
	setOperationAction(ISD::FCOPYSIGN, MVT::v2f64, Custom);

	setOperationAction(ISD::SMAX, MVT::v8i16, Legal);
	setOperationAction(ISD::UMAX, MVT::v16i8, Legal);
	setOperationAction(ISD::SMIN, MVT::v8i16, Legal);
	setOperationAction(ISD::UMIN, MVT::v16i8, Legal);

	setOperationAction(ISD::INSERT_VECTOR_ELT, MVT::v8i16, Custom);
	setOperationAction(ISD::INSERT_VECTOR_ELT, MVT::v4i32, Custom);
	setOperationAction(ISD::INSERT_VECTOR_ELT, MVT::v4f32, Custom);

	for (auto VT : { MVT::v16i8, MVT::v8i16, MVT::v4i32, MVT::v2i64 }) {
	setOperationAction(ISD::SETCC, VT, Custom);
	setOperationAction(ISD::CTPOP, VT, Custom);
	setOperationAction(ISD::CTTZ, VT, Custom);
	}

	for (auto VT : { MVT::v16i8, MVT::v8i16, MVT::v4i32 }) {
	setOperationAction(ISD::SCALAR_TO_VECTOR, VT, Custom);
	setOperationAction(ISD::BUILD_VECTOR, VT, Custom);
	setOperationAction(ISD::VECTOR_SHUFFLE, VT, Custom);
	setOperationAction(ISD::VSELECT, VT, Custom);
	setOperationAction(ISD::EXTRACT_VECTOR_ELT, VT, Custom);
	}

	// We support custom legalizing of sext and anyext loads for specific
	// memory vector types which we can load as a scalar (or sequence of
	// scalars) and extend in-register to a legal 128-bit vector type. For sext
	// loads these must work with a single scalar load.
	for (MVT VT : MVT::integer_vector_valuetypes()) {
	setLoadExtAction(ISD::SEXTLOAD, VT, MVT::v4i8, Custom);
	setLoadExtAction(ISD::SEXTLOAD, VT, MVT::v4i16, Custom);
	setLoadExtAction(ISD::SEXTLOAD, VT, MVT::v8i8, Custom);
	setLoadExtAction(ISD::EXTLOAD, VT, MVT::v2i8, Custom);
	setLoadExtAction(ISD::EXTLOAD, VT, MVT::v2i16, Custom);
	setLoadExtAction(ISD::EXTLOAD, VT, MVT::v2i32, Custom);
	setLoadExtAction(ISD::EXTLOAD, VT, MVT::v4i8, Custom);
	setLoadExtAction(ISD::EXTLOAD, VT, MVT::v4i16, Custom);
	setLoadExtAction(ISD::EXTLOAD, VT, MVT::v8i8, Custom);
	}

	for (auto VT : { MVT::v2f64, MVT::v2i64 }) {
	setOperationAction(ISD::BUILD_VECTOR, VT, Custom);
	setOperationAction(ISD::VECTOR_SHUFFLE, VT, Custom);
	setOperationAction(ISD::VSELECT, VT, Custom);

	if (VT == MVT::v2i64 && !Subtarget.is64Bit())
	continue;

	setOperationAction(ISD::INSERT_VECTOR_ELT, VT, Custom);
	setOperationAction(ISD::EXTRACT_VECTOR_ELT, VT, Custom);
	}

	// Promote v16i8, v8i16, v4i32 load, select, and, or, xor to v2i64.
	for (auto VT : { MVT::v16i8, MVT::v8i16, MVT::v4i32 }) {
	setOperationPromotedToType(ISD::AND, VT, MVT::v2i64);
	setOperationPromotedToType(ISD::OR, VT, MVT::v2i64);
	setOperationPromotedToType(ISD::XOR, VT, MVT::v2i64);
	setOperationPromotedToType(ISD::LOAD, VT, MVT::v2i64);
	setOperationPromotedToType(ISD::SELECT, VT, MVT::v2i64);
	}

	// Custom lower v2i64 and v2f64 selects.
	setOperationAction(ISD::SELECT, MVT::v2f64, Custom);
	setOperationAction(ISD::SELECT, MVT::v2i64, Custom);

	setOperationAction(ISD::FP_TO_SINT, MVT::v4i32, Legal);
	setOperationAction(ISD::FP_TO_SINT, MVT::v2i32, Custom);

	setOperationAction(ISD::SINT_TO_FP, MVT::v4i32, Legal);
	setOperationAction(ISD::SINT_TO_FP, MVT::v2i32, Custom);

	setOperationAction(ISD::UINT_TO_FP, MVT::v4i8, Custom);
	setOperationAction(ISD::UINT_TO_FP, MVT::v4i16, Custom);
	setOperationAction(ISD::UINT_TO_FP, MVT::v2i32, Custom);

	// Fast v2f32 UINT_TO_FP( v2i32 ) custom conversion.
	setOperationAction(ISD::UINT_TO_FP, MVT::v2f32, Custom);

	setOperationAction(ISD::FP_EXTEND, MVT::v2f32, Custom);
	setOperationAction(ISD::FP_ROUND, MVT::v2f32, Custom);

	for (MVT VT : MVT::fp_vector_valuetypes())
	setLoadExtAction(ISD::EXTLOAD, VT, MVT::v2f32, Legal);

	setOperationAction(ISD::BITCAST, MVT::v2i32, Custom);
	setOperationAction(ISD::BITCAST, MVT::v4i16, Custom);
	setOperationAction(ISD::BITCAST, MVT::v8i8, Custom);

	setOperationAction(ISD::SIGN_EXTEND_VECTOR_INREG, MVT::v2i64, Custom);
	setOperationAction(ISD::SIGN_EXTEND_VECTOR_INREG, MVT::v4i32, Custom);
	setOperationAction(ISD::SIGN_EXTEND_VECTOR_INREG, MVT::v8i16, Custom);

	// In the customized shift lowering, the legal v4i32/v2i64 cases
	// in AVX2 will be recognized.
	for (auto VT : { MVT::v16i8, MVT::v8i16, MVT::v4i32, MVT::v2i64 }) {
	setOperationAction(ISD::SRL, VT, Custom);
	setOperationAction(ISD::SHL, VT, Custom);
	setOperationAction(ISD::SRA, VT, Custom);
	}
	}

	if (!Subtarget.useSoftFloat() && Subtarget.hasSSSE3()) {
	setOperationAction(ISD::ABS, MVT::v16i8, Legal);
	setOperationAction(ISD::ABS, MVT::v8i16, Legal);
	setOperationAction(ISD::ABS, MVT::v4i32, Legal);
	setOperationAction(ISD::BITREVERSE, MVT::v16i8, Custom);
	setOperationAction(ISD::CTLZ, MVT::v16i8, Custom);
	setOperationAction(ISD::CTLZ, MVT::v8i16, Custom);
	setOperationAction(ISD::CTLZ, MVT::v4i32, Custom);
	setOperationAction(ISD::CTLZ, MVT::v2i64, Custom);
	}

	if (!Subtarget.useSoftFloat() && Subtarget.hasSSE41()) {
	for (MVT RoundedTy : {MVT::f32, MVT::f64, MVT::v4f32, MVT::v2f64}) {
	setOperationAction(ISD::FFLOOR, RoundedTy, Legal);
	setOperationAction(ISD::FCEIL, RoundedTy, Legal);
	setOperationAction(ISD::FTRUNC, RoundedTy, Legal);
	setOperationAction(ISD::FRINT, RoundedTy, Legal);
	setOperationAction(ISD::FNEARBYINT, RoundedTy, Legal);
	}

	setOperationAction(ISD::SMAX, MVT::v16i8, Legal);
	setOperationAction(ISD::SMAX, MVT::v4i32, Legal);
	setOperationAction(ISD::UMAX, MVT::v8i16, Legal);
	setOperationAction(ISD::UMAX, MVT::v4i32, Legal);
	setOperationAction(ISD::SMIN, MVT::v16i8, Legal);
	setOperationAction(ISD::SMIN, MVT::v4i32, Legal);
	setOperationAction(ISD::UMIN, MVT::v8i16, Legal);
	setOperationAction(ISD::UMIN, MVT::v4i32, Legal);

	// FIXME: Do we need to handle scalar-to-vector here?
	setOperationAction(ISD::MUL, MVT::v4i32, Legal);

	// We directly match byte blends in the backend as they match the VSELECT
	// condition form.
	setOperationAction(ISD::VSELECT, MVT::v16i8, Legal);

	// SSE41 brings specific instructions for doing vector sign extend even in
	// cases where we don't have SRA.
	for (auto VT : { MVT::v8i16, MVT::v4i32, MVT::v2i64 }) {
	setOperationAction(ISD::SIGN_EXTEND_VECTOR_INREG, VT, Legal);
	setOperationAction(ISD::ZERO_EXTEND_VECTOR_INREG, VT, Legal);
	}

	for (MVT VT : MVT::integer_vector_valuetypes()) {
	setLoadExtAction(ISD::SEXTLOAD, VT, MVT::v2i8, Custom);
	setLoadExtAction(ISD::SEXTLOAD, VT, MVT::v2i16, Custom);
	setLoadExtAction(ISD::SEXTLOAD, VT, MVT::v2i32, Custom);
	}

	// SSE41 also has vector sign/zero extending loads, PMOV[SZ]X
	for (auto LoadExtOp : { ISD::SEXTLOAD, ISD::ZEXTLOAD }) {
	setLoadExtAction(LoadExtOp, MVT::v8i16, MVT::v8i8, Legal);
	setLoadExtAction(LoadExtOp, MVT::v4i32, MVT::v4i8, Legal);
	setLoadExtAction(LoadExtOp, MVT::v2i64, MVT::v2i8, Legal);
	setLoadExtAction(LoadExtOp, MVT::v4i32, MVT::v4i16, Legal);
	setLoadExtAction(LoadExtOp, MVT::v2i64, MVT::v2i16, Legal);
	setLoadExtAction(LoadExtOp, MVT::v2i64, MVT::v2i32, Legal);
	}

	// i8 vectors are custom because the source register and source
	// source memory operand types are not the same width.
	setOperationAction(ISD::INSERT_VECTOR_ELT, MVT::v16i8, Custom);
	}

	if (!Subtarget.useSoftFloat() && Subtarget.hasXOP()) {
	for (auto VT : { MVT::v16i8, MVT::v8i16, MVT::v4i32, MVT::v2i64,
	MVT::v32i8, MVT::v16i16, MVT::v8i32, MVT::v4i64 })
	setOperationAction(ISD::ROTL, VT, Custom);

	// XOP can efficiently perform BITREVERSE with VPPERM.
	for (auto VT : { MVT::i8, MVT::i16, MVT::i32, MVT::i64 })
	setOperationAction(ISD::BITREVERSE, VT, Custom);

	for (auto VT : { MVT::v16i8, MVT::v8i16, MVT::v4i32, MVT::v2i64,
	MVT::v32i8, MVT::v16i16, MVT::v8i32, MVT::v4i64 })
	setOperationAction(ISD::BITREVERSE, VT, Custom);
	}

	if (!Subtarget.useSoftFloat() && Subtarget.hasFp256()) {
	bool HasInt256 = Subtarget.hasInt256();

	addRegisterClass(MVT::v32i8, Subtarget.hasVLX() ? &X86::VR256XRegClass
	: &X86::VR256RegClass);
	addRegisterClass(MVT::v16i16, Subtarget.hasVLX() ? &X86::VR256XRegClass
	: &X86::VR256RegClass);
	addRegisterClass(MVT::v8i32, Subtarget.hasVLX() ? &X86::VR256XRegClass
	: &X86::VR256RegClass);
	addRegisterClass(MVT::v8f32, Subtarget.hasVLX() ? &X86::VR256XRegClass
	: &X86::VR256RegClass);
	addRegisterClass(MVT::v4i64, Subtarget.hasVLX() ? &X86::VR256XRegClass
	: &X86::VR256RegClass);
	addRegisterClass(MVT::v4f64, Subtarget.hasVLX() ? &X86::VR256XRegClass
	: &X86::VR256RegClass);

	for (auto VT : { MVT::v8f32, MVT::v4f64 }) {
	setOperationAction(ISD::FFLOOR, VT, Legal);
	setOperationAction(ISD::FCEIL, VT, Legal);
	setOperationAction(ISD::FTRUNC, VT, Legal);
	setOperationAction(ISD::FRINT, VT, Legal);
	setOperationAction(ISD::FNEARBYINT, VT, Legal);
	setOperationAction(ISD::FNEG, VT, Custom);
	setOperationAction(ISD::FABS, VT, Custom);
	setOperationAction(ISD::FCOPYSIGN, VT, Custom);
	}

	// (fp_to_int:v8i16 (v8f32 ..)) requires the result type to be promoted
	// even though v8i16 is a legal type.
	setOperationAction(ISD::FP_TO_SINT, MVT::v8i16, Promote);
	setOperationAction(ISD::FP_TO_UINT, MVT::v8i16, Promote);
	setOperationAction(ISD::FP_TO_SINT, MVT::v8i32, Legal);

	setOperationAction(ISD::SINT_TO_FP, MVT::v8i16, Promote);
	setOperationAction(ISD::SINT_TO_FP, MVT::v8i32, Legal);
	setOperationAction(ISD::FP_ROUND, MVT::v4f32, Legal);

	setOperationAction(ISD::UINT_TO_FP, MVT::v8i8, Custom);
	setOperationAction(ISD::UINT_TO_FP, MVT::v8i16, Custom);

	for (MVT VT : MVT::fp_vector_valuetypes())
	setLoadExtAction(ISD::EXTLOAD, VT, MVT::v4f32, Legal);

	// In the customized shift lowering, the legal v8i32/v4i64 cases
	// in AVX2 will be recognized.
	for (auto VT : { MVT::v32i8, MVT::v16i16, MVT::v8i32, MVT::v4i64 }) {
	setOperationAction(ISD::SRL, VT, Custom);
	setOperationAction(ISD::SHL, VT, Custom);
	setOperationAction(ISD::SRA, VT, Custom);
	}

	setOperationAction(ISD::SELECT, MVT::v4f64, Custom);
	setOperationAction(ISD::SELECT, MVT::v4i64, Custom);
	setOperationAction(ISD::SELECT, MVT::v8f32, Custom);

	for (auto VT : { MVT::v16i16, MVT::v8i32, MVT::v4i64 }) {
	setOperationAction(ISD::SIGN_EXTEND, VT, Custom);
	setOperationAction(ISD::ZERO_EXTEND, VT, Custom);
	setOperationAction(ISD::ANY_EXTEND, VT, Custom);
	}

	setOperationAction(ISD::TRUNCATE, MVT::v16i8, Custom);
	setOperationAction(ISD::TRUNCATE, MVT::v8i16, Custom);
	setOperationAction(ISD::TRUNCATE, MVT::v4i32, Custom);
	setOperationAction(ISD::BITREVERSE, MVT::v32i8, Custom);

	for (auto VT : { MVT::v32i8, MVT::v16i16, MVT::v8i32, MVT::v4i64 }) {
	setOperationAction(ISD::SETCC, VT, Custom);
	setOperationAction(ISD::CTPOP, VT, Custom);
	setOperationAction(ISD::CTTZ, VT, Custom);
	setOperationAction(ISD::CTLZ, VT, Custom);
	}

	if (Subtarget.hasAnyFMA()) {
	for (auto VT : { MVT::f32, MVT::f64, MVT::v4f32, MVT::v8f32,
	MVT::v2f64, MVT::v4f64 })
	setOperationAction(ISD::FMA, VT, Legal);
	}

	for (auto VT : { MVT::v32i8, MVT::v16i16, MVT::v8i32, MVT::v4i64 }) {
	setOperationAction(ISD::ADD, VT, HasInt256 ? Legal : Custom);
	setOperationAction(ISD::SUB, VT, HasInt256 ? Legal : Custom);
	}

	setOperationAction(ISD::MUL, MVT::v4i64, Custom);
	setOperationAction(ISD::MUL, MVT::v8i32, HasInt256 ? Legal : Custom);
	setOperationAction(ISD::MUL, MVT::v16i16, HasInt256 ? Legal : Custom);
	setOperationAction(ISD::MUL, MVT::v32i8, Custom);

	setOperationAction(ISD::UMUL_LOHI, MVT::v8i32, Custom);
	setOperationAction(ISD::SMUL_LOHI, MVT::v8i32, Custom);

	setOperationAction(ISD::MULHU, MVT::v16i16, HasInt256 ? Legal : Custom);
	setOperationAction(ISD::MULHS, MVT::v16i16, HasInt256 ? Legal : Custom);
	setOperationAction(ISD::MULHU, MVT::v32i8, Custom);
	setOperationAction(ISD::MULHS, MVT::v32i8, Custom);

	for (auto VT : { MVT::v32i8, MVT::v16i16, MVT::v8i32 }) {
	setOperationAction(ISD::ABS, VT, HasInt256 ? Legal : Custom);
	setOperationAction(ISD::SMAX, VT, HasInt256 ? Legal : Custom);
	setOperationAction(ISD::UMAX, VT, HasInt256 ? Legal : Custom);
	setOperationAction(ISD::SMIN, VT, HasInt256 ? Legal : Custom);
	setOperationAction(ISD::UMIN, VT, HasInt256 ? Legal : Custom);
	}

	if (HasInt256) {
	setOperationAction(ISD::SIGN_EXTEND_VECTOR_INREG, MVT::v4i64, Custom);
	setOperationAction(ISD::SIGN_EXTEND_VECTOR_INREG, MVT::v8i32, Custom);
	setOperationAction(ISD::SIGN_EXTEND_VECTOR_INREG, MVT::v16i16, Custom);

	// The custom lowering for UINT_TO_FP for v8i32 becomes interesting
	// when we have a 256bit-wide blend with immediate.
	setOperationAction(ISD::UINT_TO_FP, MVT::v8i32, Custom);

	// AVX2 also has wider vector sign/zero extending loads, VPMOV[SZ]X
	for (auto LoadExtOp : { ISD::SEXTLOAD, ISD::ZEXTLOAD }) {
	setLoadExtAction(LoadExtOp, MVT::v16i16, MVT::v16i8, Legal);
	setLoadExtAction(LoadExtOp, MVT::v8i32, MVT::v8i8, Legal);
	setLoadExtAction(LoadExtOp, MVT::v4i64, MVT::v4i8, Legal);
	setLoadExtAction(LoadExtOp, MVT::v8i32, MVT::v8i16, Legal);
	setLoadExtAction(LoadExtOp, MVT::v4i64, MVT::v4i16, Legal);
	setLoadExtAction(LoadExtOp, MVT::v4i64, MVT::v4i32, Legal);
	}
	}

	for (auto VT : { MVT::v4i32, MVT::v8i32, MVT::v2i64, MVT::v4i64,
	MVT::v4f32, MVT::v8f32, MVT::v2f64, MVT::v4f64 }) {
	setOperationAction(ISD::MLOAD, VT, Legal);
	setOperationAction(ISD::MSTORE, VT, Legal);
	}

	// Extract subvector is special because the value type
	// (result) is 128-bit but the source is 256-bit wide.
	for (auto VT : { MVT::v16i8, MVT::v8i16, MVT::v4i32, MVT::v2i64,
	MVT::v4f32, MVT::v2f64 }) {
	setOperationAction(ISD::EXTRACT_SUBVECTOR, VT, Custom);
	}

	// Custom lower several nodes for 256-bit types.
	for (MVT VT : { MVT::v32i8, MVT::v16i16, MVT::v8i32, MVT::v4i64,
	MVT::v8f32, MVT::v4f64 }) {
	setOperationAction(ISD::BUILD_VECTOR, VT, Custom);
	setOperationAction(ISD::VECTOR_SHUFFLE, VT, Custom);
	setOperationAction(ISD::VSELECT, VT, Custom);
	setOperationAction(ISD::INSERT_VECTOR_ELT, VT, Custom);
	setOperationAction(ISD::EXTRACT_VECTOR_ELT, VT, Custom);
	setOperationAction(ISD::SCALAR_TO_VECTOR, VT, Custom);
	setOperationAction(ISD::INSERT_SUBVECTOR, VT, Legal);
	setOperationAction(ISD::CONCAT_VECTORS, VT, Custom);
	}

	if (HasInt256)
	setOperationAction(ISD::VSELECT, MVT::v32i8, Legal);

	// Promote v32i8, v16i16, v8i32 select, and, or, xor to v4i64.
	for (auto VT : { MVT::v32i8, MVT::v16i16, MVT::v8i32 }) {
	setOperationPromotedToType(ISD::AND, VT, MVT::v4i64);
	setOperationPromotedToType(ISD::OR, VT, MVT::v4i64);
	setOperationPromotedToType(ISD::XOR, VT, MVT::v4i64);
	setOperationPromotedToType(ISD::LOAD, VT, MVT::v4i64);
	setOperationPromotedToType(ISD::SELECT, VT, MVT::v4i64);
	}
	}

	if (!Subtarget.useSoftFloat() && Subtarget.hasAVX512()) {
	addRegisterClass(MVT::v16i32, &X86::VR512RegClass);
	addRegisterClass(MVT::v16f32, &X86::VR512RegClass);
	addRegisterClass(MVT::v8i64, &X86::VR512RegClass);
	addRegisterClass(MVT::v8f64, &X86::VR512RegClass);

	addRegisterClass(MVT::v1i1, &X86::VK1RegClass);
	addRegisterClass(MVT::v8i1, &X86::VK8RegClass);
	addRegisterClass(MVT::v16i1, &X86::VK16RegClass);

	for (MVT VT : MVT::fp_vector_valuetypes())
	setLoadExtAction(ISD::EXTLOAD, VT, MVT::v8f32, Legal);

	for (auto ExtType : {ISD::ZEXTLOAD, ISD::SEXTLOAD, ISD::EXTLOAD}) {
	setLoadExtAction(ExtType, MVT::v16i32, MVT::v16i8, Legal);
	setLoadExtAction(ExtType, MVT::v16i32, MVT::v16i16, Legal);
	setLoadExtAction(ExtType, MVT::v32i16, MVT::v32i8, Legal);
	setLoadExtAction(ExtType, MVT::v8i64, MVT::v8i8, Legal);
	setLoadExtAction(ExtType, MVT::v8i64, MVT::v8i16, Legal);
	setLoadExtAction(ExtType, MVT::v8i64, MVT::v8i32, Legal);
	}

	for (MVT VT : {MVT::v2i64, MVT::v4i32, MVT::v8i32, MVT::v4i64, MVT::v8i16,
	MVT::v16i8, MVT::v16i16, MVT::v32i8, MVT::v16i32,
	MVT::v8i64, MVT::v32i16, MVT::v64i8}) {
	MVT MaskVT = MVT::getVectorVT(MVT::i1, VT.getVectorNumElements());
	setLoadExtAction(ISD::SEXTLOAD, VT, MaskVT, Custom);
	setLoadExtAction(ISD::ZEXTLOAD, VT, MaskVT, Custom);
	setLoadExtAction(ISD::EXTLOAD, VT, MaskVT, Custom);
	setTruncStoreAction(VT, MaskVT, Custom);
	}

	for (MVT VT : { MVT::v16f32, MVT::v8f64 }) {
	setOperationAction(ISD::FNEG, VT, Custom);
	setOperationAction(ISD::FABS, VT, Custom);
	setOperationAction(ISD::FMA, VT, Legal);
	setOperationAction(ISD::FCOPYSIGN, VT, Custom);
	}

	setOperationAction(ISD::FP_TO_SINT, MVT::v16i32, Legal);
	setOperationAction(ISD::FP_TO_UINT, MVT::v16i32, Legal);
	setOperationAction(ISD::FP_TO_UINT, MVT::v8i32, Legal);
	setOperationAction(ISD::FP_TO_UINT, MVT::v4i32, Legal);
	setOperationAction(ISD::FP_TO_UINT, MVT::v2i32, Custom);
	setOperationAction(ISD::SINT_TO_FP, MVT::v16i32, Legal);
	setOperationAction(ISD::SINT_TO_FP, MVT::v8i1, Custom);
	setOperationAction(ISD::SINT_TO_FP, MVT::v16i1, Custom);
	setOperationAction(ISD::SINT_TO_FP, MVT::v16i8, Promote);
	setOperationAction(ISD::SINT_TO_FP, MVT::v16i16, Promote);
	setOperationAction(ISD::UINT_TO_FP, MVT::v16i32, Legal);
	setOperationAction(ISD::UINT_TO_FP, MVT::v8i32, Legal);
	setOperationAction(ISD::UINT_TO_FP, MVT::v4i32, Legal);
	setOperationAction(ISD::UINT_TO_FP, MVT::v16i8, Custom);
	setOperationAction(ISD::UINT_TO_FP, MVT::v16i16, Custom);
	setOperationAction(ISD::SINT_TO_FP, MVT::v16i1, Custom);
	setOperationAction(ISD::UINT_TO_FP, MVT::v16i1, Custom);
	setOperationAction(ISD::SINT_TO_FP, MVT::v8i1, Custom);
	setOperationAction(ISD::UINT_TO_FP, MVT::v8i1, Custom);
	setOperationAction(ISD::SINT_TO_FP, MVT::v4i1, Custom);
	setOperationAction(ISD::UINT_TO_FP, MVT::v4i1, Custom);
	setOperationAction(ISD::SINT_TO_FP, MVT::v2i1, Custom);
	setOperationAction(ISD::UINT_TO_FP, MVT::v2i1, Custom);
	setOperationAction(ISD::FP_ROUND, MVT::v8f32, Legal);
	setOperationAction(ISD::FP_EXTEND, MVT::v8f32, Legal);

	setTruncStoreAction(MVT::v8i64, MVT::v8i8, Legal);
	setTruncStoreAction(MVT::v8i64, MVT::v8i16, Legal);
	setTruncStoreAction(MVT::v8i64, MVT::v8i32, Legal);
	setTruncStoreAction(MVT::v16i32, MVT::v16i8, Legal);
	setTruncStoreAction(MVT::v16i32, MVT::v16i16, Legal);
	if (Subtarget.hasVLX()){
	setTruncStoreAction(MVT::v4i64, MVT::v4i8, Legal);
	setTruncStoreAction(MVT::v4i64, MVT::v4i16, Legal);
	setTruncStoreAction(MVT::v4i64, MVT::v4i32, Legal);
	setTruncStoreAction(MVT::v8i32, MVT::v8i8, Legal);
	setTruncStoreAction(MVT::v8i32, MVT::v8i16, Legal);

	setTruncStoreAction(MVT::v2i64, MVT::v2i8, Legal);
	setTruncStoreAction(MVT::v2i64, MVT::v2i16, Legal);
	setTruncStoreAction(MVT::v2i64, MVT::v2i32, Legal);
	setTruncStoreAction(MVT::v4i32, MVT::v4i8, Legal);
	setTruncStoreAction(MVT::v4i32, MVT::v4i16, Legal);
	} else {
	for (auto VT : {MVT::v4i32, MVT::v8i32, MVT::v2i64, MVT::v4i64,
	MVT::v4f32, MVT::v8f32, MVT::v2f64, MVT::v4f64}) {
	setOperationAction(ISD::MLOAD, VT, Custom);
	setOperationAction(ISD::MSTORE, VT, Custom);
	}
	}
	setOperationAction(ISD::TRUNCATE, MVT::v16i8, Custom);
	setOperationAction(ISD::TRUNCATE, MVT::v8i32, Custom);

	if (Subtarget.hasDQI()) {
	for (auto VT : { MVT::v2i64, MVT::v4i64, MVT::v8i64 }) {
	setOperationAction(ISD::SINT_TO_FP, VT, Legal);
	setOperationAction(ISD::UINT_TO_FP, VT, Legal);
	setOperationAction(ISD::FP_TO_SINT, VT, Legal);
	setOperationAction(ISD::FP_TO_UINT, VT, Legal);
	}
	if (Subtarget.hasVLX()) {
	// Fast v2f32 SINT_TO_FP( v2i32 ) custom conversion.
	setOperationAction(ISD::SINT_TO_FP, MVT::v2f32, Custom);
	setOperationAction(ISD::FP_TO_SINT, MVT::v2f32, Custom);
	setOperationAction(ISD::FP_TO_UINT, MVT::v2f32, Custom);
	}
	}
	if (Subtarget.hasVLX()) {
	setOperationAction(ISD::SINT_TO_FP, MVT::v8i32, Legal);
	setOperationAction(ISD::UINT_TO_FP, MVT::v8i32, Legal);
	setOperationAction(ISD::FP_TO_SINT, MVT::v8i32, Legal);
	setOperationAction(ISD::FP_TO_UINT, MVT::v8i32, Legal);
	setOperationAction(ISD::SINT_TO_FP, MVT::v4i32, Legal);
	setOperationAction(ISD::FP_TO_SINT, MVT::v4i32, Legal);
	setOperationAction(ISD::FP_TO_UINT, MVT::v4i32, Legal);
	setOperationAction(ISD::ZERO_EXTEND, MVT::v4i32, Custom);
	setOperationAction(ISD::ZERO_EXTEND, MVT::v2i64, Custom);
	setOperationAction(ISD::SIGN_EXTEND, MVT::v4i32, Custom);
	setOperationAction(ISD::SIGN_EXTEND, MVT::v2i64, Custom);

	// FIXME. This commands are available on SSE/AVX2, add relevant patterns.
	setLoadExtAction(ISD::EXTLOAD, MVT::v8i32, MVT::v8i8, Legal);
	setLoadExtAction(ISD::EXTLOAD, MVT::v8i32, MVT::v8i16, Legal);
	setLoadExtAction(ISD::EXTLOAD, MVT::v4i32, MVT::v4i8, Legal);
	setLoadExtAction(ISD::EXTLOAD, MVT::v4i32, MVT::v4i16, Legal);
	setLoadExtAction(ISD::EXTLOAD, MVT::v4i64, MVT::v4i8, Legal);
	setLoadExtAction(ISD::EXTLOAD, MVT::v4i64, MVT::v4i16, Legal);
	setLoadExtAction(ISD::EXTLOAD, MVT::v4i64, MVT::v4i32, Legal);
	setLoadExtAction(ISD::EXTLOAD, MVT::v2i64, MVT::v2i8, Legal);
	setLoadExtAction(ISD::EXTLOAD, MVT::v2i64, MVT::v2i16, Legal);
	setLoadExtAction(ISD::EXTLOAD, MVT::v2i64, MVT::v2i32, Legal);
	}

	setOperationAction(ISD::TRUNCATE, MVT::v16i16, Custom);
	setOperationAction(ISD::ZERO_EXTEND, MVT::v16i32, Custom);
	setOperationAction(ISD::ZERO_EXTEND, MVT::v8i64, Custom);
	setOperationAction(ISD::ANY_EXTEND, MVT::v16i32, Custom);
	setOperationAction(ISD::ANY_EXTEND, MVT::v8i64, Custom);
	setOperationAction(ISD::SIGN_EXTEND, MVT::v16i32, Custom);
	setOperationAction(ISD::SIGN_EXTEND, MVT::v8i64, Custom);
	setOperationAction(ISD::SIGN_EXTEND, MVT::v16i8, Custom);
	setOperationAction(ISD::SIGN_EXTEND, MVT::v8i16, Custom);
	setOperationAction(ISD::SIGN_EXTEND, MVT::v16i16, Custom);

	for (auto VT : { MVT::v16f32, MVT::v8f64 }) {
	setOperationAction(ISD::FFLOOR, VT, Legal);
	setOperationAction(ISD::FCEIL, VT, Legal);
	setOperationAction(ISD::FTRUNC, VT, Legal);
	setOperationAction(ISD::FRINT, VT, Legal);
	setOperationAction(ISD::FNEARBYINT, VT, Legal);
	}

	setOperationAction(ISD::SIGN_EXTEND_VECTOR_INREG, MVT::v8i64, Custom);
	setOperationAction(ISD::SIGN_EXTEND_VECTOR_INREG, MVT::v16i32, Custom);

	// Without BWI we need to use custom lowering to handle MVT::v64i8 input.
	setOperationAction(ISD::SIGN_EXTEND_VECTOR_INREG, MVT::v64i8, Custom);
	setOperationAction(ISD::ZERO_EXTEND_VECTOR_INREG, MVT::v64i8, Custom);

	setOperationAction(ISD::CONCAT_VECTORS, MVT::v8f64, Custom);
	setOperationAction(ISD::CONCAT_VECTORS, MVT::v8i64, Custom);
	setOperationAction(ISD::CONCAT_VECTORS, MVT::v16f32, Custom);
	setOperationAction(ISD::CONCAT_VECTORS, MVT::v16i32, Custom);
	setOperationAction(ISD::CONCAT_VECTORS, MVT::v16i1, Custom);

	setOperationAction(ISD::MUL, MVT::v8i64, Custom);

	setOperationAction(ISD::EXTRACT_VECTOR_ELT, MVT::v1i1, Custom);
	setOperationAction(ISD::INSERT_SUBVECTOR, MVT::v16i1, Custom);
	setOperationAction(ISD::BUILD_VECTOR, MVT::v1i1, Custom);
	setOperationAction(ISD::SELECT, MVT::v8f64, Custom);
	setOperationAction(ISD::SELECT, MVT::v8i64, Custom);
	setOperationAction(ISD::SELECT, MVT::v16f32, Custom);

	setOperationAction(ISD::MUL, MVT::v16i32, Legal);

	// NonVLX sub-targets extend 128/256 vectors to use the 512 version.
	setOperationAction(ISD::ABS, MVT::v4i64, Legal);
	setOperationAction(ISD::ABS, MVT::v2i64, Legal);

	for (auto VT : { MVT::v8i1, MVT::v16i1 }) {
	setOperationAction(ISD::ADD, VT, Custom);
	setOperationAction(ISD::SUB, VT, Custom);
	setOperationAction(ISD::MUL, VT, Custom);
	setOperationAction(ISD::SETCC, VT, Custom);
	setOperationAction(ISD::SELECT, VT, Custom);
	setOperationAction(ISD::TRUNCATE, VT, Custom);

	setOperationAction(ISD::BUILD_VECTOR, VT, Custom);
	setOperationAction(ISD::EXTRACT_VECTOR_ELT, VT, Custom);
	setOperationAction(ISD::INSERT_VECTOR_ELT, VT, Custom);
	setOperationAction(ISD::VECTOR_SHUFFLE, VT, Custom);
	setOperationAction(ISD::VSELECT, VT, Expand);
	}

	for (auto VT : { MVT::v16i32, MVT::v8i64 }) {
	setOperationAction(ISD::SMAX, VT, Legal);
	setOperationAction(ISD::UMAX, VT, Legal);
	setOperationAction(ISD::SMIN, VT, Legal);
	setOperationAction(ISD::UMIN, VT, Legal);
	setOperationAction(ISD::ABS, VT, Legal);
	setOperationAction(ISD::SRL, VT, Custom);
	setOperationAction(ISD::SHL, VT, Custom);
	setOperationAction(ISD::SRA, VT, Custom);
	setOperationAction(ISD::CTPOP, VT, Custom);
	setOperationAction(ISD::CTTZ, VT, Custom);
	}

	// NonVLX sub-targets extend 128/256 vectors to use the 512 version.
	for (auto VT : {MVT::v4i32, MVT::v8i32, MVT::v16i32, MVT::v2i64, MVT::v4i64,
	MVT::v8i64}) {
	setOperationAction(ISD::ROTL, VT, Custom);
	setOperationAction(ISD::ROTR, VT, Custom);
	}

	// Need to promote to 64-bit even though we have 32-bit masked instructions
	// because the IR optimizers rearrange bitcasts around logic ops leaving
	// too many variations to handle if we don't promote them.
	setOperationPromotedToType(ISD::AND, MVT::v16i32, MVT::v8i64);
	setOperationPromotedToType(ISD::OR, MVT::v16i32, MVT::v8i64);
	setOperationPromotedToType(ISD::XOR, MVT::v16i32, MVT::v8i64);

	if (Subtarget.hasCDI()) {
	// NonVLX sub-targets extend 128/256 vectors to use the 512 version.
	for (auto VT : {MVT::v4i32, MVT::v8i32, MVT::v16i32, MVT::v2i64,
	MVT::v4i64, MVT::v8i64}) {
	setOperationAction(ISD::CTLZ, VT, Legal);
	setOperationAction(ISD::CTTZ_ZERO_UNDEF, VT, Custom);
	}
	} // Subtarget.hasCDI()

	if (Subtarget.hasDQI()) {
	// NonVLX sub-targets extend 128/256 vectors to use the 512 version.
	setOperationAction(ISD::MUL, MVT::v2i64, Legal);
	setOperationAction(ISD::MUL, MVT::v4i64, Legal);
	setOperationAction(ISD::MUL, MVT::v8i64, Legal);
	}

	if (Subtarget.hasVPOPCNTDQ()) {
	// VPOPCNTDQ sub-targets extend 128/256 vectors to use the avx512
	// version of popcntd/q.
	for (auto VT : {MVT::v16i32, MVT::v8i64, MVT::v8i32, MVT::v4i64,
	MVT::v4i32, MVT::v2i64})
	setOperationAction(ISD::CTPOP, VT, Legal);
	}

	// Custom lower several nodes.
	for (auto VT : { MVT::v4i32, MVT::v8i32, MVT::v2i64, MVT::v4i64,
	MVT::v4f32, MVT::v8f32, MVT::v2f64, MVT::v4f64 }) {
	setOperationAction(ISD::MGATHER, VT, Custom);
	setOperationAction(ISD::MSCATTER, VT, Custom);
	}
	// Extract subvector is special because the value type
	// (result) is 256-bit but the source is 512-bit wide.
	// 128-bit was made Custom under AVX1.
	for (auto VT : { MVT::v32i8, MVT::v16i16, MVT::v8i32, MVT::v4i64,
	MVT::v8f32, MVT::v4f64, MVT::v1i1 })
	setOperationAction(ISD::EXTRACT_SUBVECTOR, VT, Custom);
	for (auto VT : { MVT::v2i1, MVT::v4i1, MVT::v8i1,
	MVT::v16i1, MVT::v32i1, MVT::v64i1 })
	setOperationAction(ISD::EXTRACT_SUBVECTOR, VT, Legal);

	for (auto VT : { MVT::v16i32, MVT::v8i64, MVT::v16f32, MVT::v8f64 }) {
	setOperationAction(ISD::VECTOR_SHUFFLE, VT, Custom);
	setOperationAction(ISD::INSERT_VECTOR_ELT, VT, Custom);
	setOperationAction(ISD::BUILD_VECTOR, VT, Custom);
	setOperationAction(ISD::VSELECT, VT, Custom);
	setOperationAction(ISD::EXTRACT_VECTOR_ELT, VT, Custom);
	setOperationAction(ISD::SCALAR_TO_VECTOR, VT, Custom);
	setOperationAction(ISD::INSERT_SUBVECTOR, VT, Legal);
	setOperationAction(ISD::MLOAD, VT, Legal);
	setOperationAction(ISD::MSTORE, VT, Legal);
	setOperationAction(ISD::MGATHER, VT, Legal);
	setOperationAction(ISD::MSCATTER, VT, Custom);
	}
	for (auto VT : { MVT::v64i8, MVT::v32i16, MVT::v16i32 }) {
	setOperationPromotedToType(ISD::LOAD, VT, MVT::v8i64);
	setOperationPromotedToType(ISD::SELECT, VT, MVT::v8i64);
	}
	}// has AVX-512

	if (!Subtarget.useSoftFloat() && Subtarget.hasBWI()) {
	addRegisterClass(MVT::v32i16, &X86::VR512RegClass);
	addRegisterClass(MVT::v64i8, &X86::VR512RegClass);

	addRegisterClass(MVT::v32i1, &X86::VK32RegClass);
	addRegisterClass(MVT::v64i1, &X86::VK64RegClass);

	setOperationAction(ISD::ADD, MVT::v32i1, Custom);
	setOperationAction(ISD::ADD, MVT::v64i1, Custom);
	setOperationAction(ISD::SUB, MVT::v32i1, Custom);
	setOperationAction(ISD::SUB, MVT::v64i1, Custom);
	setOperationAction(ISD::MUL, MVT::v32i1, Custom);
	setOperationAction(ISD::MUL, MVT::v64i1, Custom);

	setOperationAction(ISD::SETCC, MVT::v32i1, Custom);
	setOperationAction(ISD::SETCC, MVT::v64i1, Custom);
	setOperationAction(ISD::MUL, MVT::v32i16, Legal);
	setOperationAction(ISD::MUL, MVT::v64i8, Custom);
	setOperationAction(ISD::MULHS, MVT::v32i16, Legal);
	setOperationAction(ISD::MULHU, MVT::v32i16, Legal);
	setOperationAction(ISD::CONCAT_VECTORS, MVT::v32i1, Custom);
	setOperationAction(ISD::CONCAT_VECTORS, MVT::v64i1, Custom);
	setOperationAction(ISD::CONCAT_VECTORS, MVT::v32i16, Custom);
	setOperationAction(ISD::CONCAT_VECTORS, MVT::v64i8, Custom);
	setOperationAction(ISD::INSERT_SUBVECTOR, MVT::v32i1, Custom);
	setOperationAction(ISD::INSERT_SUBVECTOR, MVT::v64i1, Custom);
	setOperationAction(ISD::INSERT_SUBVECTOR, MVT::v32i16, Legal);
	setOperationAction(ISD::INSERT_SUBVECTOR, MVT::v64i8, Legal);
	setOperationAction(ISD::EXTRACT_VECTOR_ELT, MVT::v32i16, Custom);
	setOperationAction(ISD::EXTRACT_VECTOR_ELT, MVT::v64i8, Custom);
	setOperationAction(ISD::EXTRACT_VECTOR_ELT, MVT::v32i1, Custom);
	setOperationAction(ISD::EXTRACT_VECTOR_ELT, MVT::v64i1, Custom);
	setOperationAction(ISD::SCALAR_TO_VECTOR, MVT::v32i16, Custom);
	setOperationAction(ISD::SCALAR_TO_VECTOR, MVT::v64i8, Custom);
	setOperationAction(ISD::SELECT, MVT::v32i1, Custom);
	setOperationAction(ISD::SELECT, MVT::v64i1, Custom);
	setOperationAction(ISD::SIGN_EXTEND, MVT::v32i8, Custom);
	setOperationAction(ISD::ZERO_EXTEND, MVT::v32i8, Custom);
	setOperationAction(ISD::SIGN_EXTEND, MVT::v32i16, Custom);
	setOperationAction(ISD::ZERO_EXTEND, MVT::v32i16, Custom);
	setOperationAction(ISD::ANY_EXTEND, MVT::v32i16, Custom);
	setOperationAction(ISD::VECTOR_SHUFFLE, MVT::v32i16, Custom);
	setOperationAction(ISD::VECTOR_SHUFFLE, MVT::v64i8, Custom);
	setOperationAction(ISD::SIGN_EXTEND, MVT::v64i8, Custom);
	setOperationAction(ISD::ZERO_EXTEND, MVT::v64i8, Custom);
	setOperationAction(ISD::INSERT_VECTOR_ELT, MVT::v32i1, Custom);
	setOperationAction(ISD::INSERT_VECTOR_ELT, MVT::v64i1, Custom);
	setOperationAction(ISD::INSERT_VECTOR_ELT, MVT::v32i16, Custom);
	setOperationAction(ISD::INSERT_VECTOR_ELT, MVT::v64i8, Custom);
	setOperationAction(ISD::TRUNCATE, MVT::v32i1, Custom);
	setOperationAction(ISD::TRUNCATE, MVT::v64i1, Custom);
	setOperationAction(ISD::TRUNCATE, MVT::v32i8, Custom);
	setOperationAction(ISD::VECTOR_SHUFFLE, MVT::v32i1, Custom);
	setOperationAction(ISD::VECTOR_SHUFFLE, MVT::v64i1, Custom);
	setOperationAction(ISD::BUILD_VECTOR, MVT::v32i1, Custom);
	setOperationAction(ISD::BUILD_VECTOR, MVT::v64i1, Custom);
	setOperationAction(ISD::VSELECT, MVT::v32i1, Expand);
	setOperationAction(ISD::VSELECT, MVT::v64i1, Expand);
	setOperationAction(ISD::BITREVERSE, MVT::v64i8, Custom);

	setOperationAction(ISD::SIGN_EXTEND_VECTOR_INREG, MVT::v32i16, Custom);

	setTruncStoreAction(MVT::v32i16, MVT::v32i8, Legal);
	if (Subtarget.hasVLX()) {
	setTruncStoreAction(MVT::v16i16, MVT::v16i8, Legal);
	setTruncStoreAction(MVT::v8i16, MVT::v8i8, Legal);
	}

	LegalizeAction Action = Subtarget.hasVLX() ? Legal : Custom;
	for (auto VT : { MVT::v32i8, MVT::v16i8, MVT::v16i16, MVT::v8i16 }) {
	setOperationAction(ISD::MLOAD, VT, Action);
	setOperationAction(ISD::MSTORE, VT, Action);
	}

	if (Subtarget.hasCDI()) {
	setOperationAction(ISD::CTLZ, MVT::v32i16, Custom);
	setOperationAction(ISD::CTLZ, MVT::v64i8, Custom);
	}

	for (auto VT : { MVT::v64i8, MVT::v32i16 }) {
	setOperationAction(ISD::BUILD_VECTOR, VT, Custom);
	setOperationAction(ISD::VSELECT, VT, Custom);
	setOperationAction(ISD::ABS, VT, Legal);
	setOperationAction(ISD::SRL, VT, Custom);
	setOperationAction(ISD::SHL, VT, Custom);
	setOperationAction(ISD::SRA, VT, Custom);
	setOperationAction(ISD::MLOAD, VT, Legal);
	setOperationAction(ISD::MSTORE, VT, Legal);
	setOperationAction(ISD::CTPOP, VT, Custom);
	setOperationAction(ISD::CTTZ, VT, Custom);
	setOperationAction(ISD::SMAX, VT, Legal);
	setOperationAction(ISD::UMAX, VT, Legal);
	setOperationAction(ISD::SMIN, VT, Legal);
	setOperationAction(ISD::UMIN, VT, Legal);

	setOperationPromotedToType(ISD::AND, VT, MVT::v8i64);
	setOperationPromotedToType(ISD::OR, VT, MVT::v8i64);
	setOperationPromotedToType(ISD::XOR, VT, MVT::v8i64);
	}

	for (auto ExtType : {ISD::ZEXTLOAD, ISD::SEXTLOAD, ISD::EXTLOAD}) {
	setLoadExtAction(ExtType, MVT::v32i16, MVT::v32i8, Legal);
	if (Subtarget.hasVLX()) {
	// FIXME. This commands are available on SSE/AVX2, add relevant patterns.
	setLoadExtAction(ExtType, MVT::v16i16, MVT::v16i8, Legal);
	setLoadExtAction(ExtType, MVT::v8i16, MVT::v8i8, Legal);
	}
	}
	}

	if (!Subtarget.useSoftFloat() && Subtarget.hasVLX()) {
	addRegisterClass(MVT::v4i1, &X86::VK4RegClass);
	addRegisterClass(MVT::v2i1, &X86::VK2RegClass);

	for (auto VT : { MVT::v2i1, MVT::v4i1 }) {
	setOperationAction(ISD::ADD, VT, Custom);
	setOperationAction(ISD::SUB, VT, Custom);
	setOperationAction(ISD::MUL, VT, Custom);
	setOperationAction(ISD::VSELECT, VT, Expand);

	setOperationAction(ISD::TRUNCATE, VT, Custom);
	setOperationAction(ISD::SETCC, VT, Custom);
	setOperationAction(ISD::EXTRACT_VECTOR_ELT, VT, Custom);
	setOperationAction(ISD::INSERT_VECTOR_ELT, VT, Custom);
	setOperationAction(ISD::SELECT, VT, Custom);
	setOperationAction(ISD::BUILD_VECTOR, VT, Custom);
	setOperationAction(ISD::VECTOR_SHUFFLE, VT, Custom);
	}

	setOperationAction(ISD::CONCAT_VECTORS, MVT::v8i1, Custom);
	setOperationAction(ISD::CONCAT_VECTORS, MVT::v4i1, Custom);
	setOperationAction(ISD::INSERT_SUBVECTOR, MVT::v8i1, Custom);
	setOperationAction(ISD::INSERT_SUBVECTOR, MVT::v4i1, Custom);

	for (auto VT : { MVT::v2i64, MVT::v4i64 }) {
	setOperationAction(ISD::SMAX, VT, Legal);
	setOperationAction(ISD::UMAX, VT, Legal);
	setOperationAction(ISD::SMIN, VT, Legal);
	setOperationAction(ISD::UMIN, VT, Legal);
	}
	}

	// We want to custom lower some of our intrinsics.
	setOperationAction(ISD::INTRINSIC_WO_CHAIN, MVT::Other, Custom);
	setOperationAction(ISD::INTRINSIC_W_CHAIN, MVT::Other, Custom);
	setOperationAction(ISD::INTRINSIC_VOID, MVT::Other, Custom);
	if (!Subtarget.is64Bit()) {
	setOperationAction(ISD::INTRINSIC_W_CHAIN, MVT::i64, Custom);
	setOperationAction(ISD::INTRINSIC_WO_CHAIN, MVT::i64, Custom);
	}

	// Only custom-lower 64-bit SADDO and friends on 64-bit because we don't
	// handle type legalization for these operations here.
	//
	// FIXME: We really should do custom legalization for addition and
	// subtraction on x86-32 once PR3203 is fixed. We really can't do much better
	// than generic legalization for 64-bit multiplication-with-overflow, though.
	for (auto VT : { MVT::i8, MVT::i16, MVT::i32, MVT::i64 }) {
	if (VT == MVT::i64 && !Subtarget.is64Bit())
	continue;
	// Add/Sub/Mul with overflow operations are custom lowered.
	setOperationAction(ISD::SADDO, VT, Custom);
	setOperationAction(ISD::UADDO, VT, Custom);
	setOperationAction(ISD::SSUBO, VT, Custom);
	setOperationAction(ISD::USUBO, VT, Custom);
	setOperationAction(ISD::SMULO, VT, Custom);
	setOperationAction(ISD::UMULO, VT, Custom);

	// Support carry in as value rather than glue.
	setOperationAction(ISD::ADDCARRY, VT, Custom);
	setOperationAction(ISD::SUBCARRY, VT, Custom);
	setOperationAction(ISD::SETCCCARRY, VT, Custom);
	}

	if (!Subtarget.is64Bit()) {
	// These libcalls are not available in 32-bit.
	setLibcallName(RTLIB::SHL_I128, nullptr);
	setLibcallName(RTLIB::SRL_I128, nullptr);
	setLibcallName(RTLIB::SRA_I128, nullptr);
	}

	// Combine sin / cos into one node or libcall if possible.
	if (Subtarget.hasSinCos()) {
	setLibcallName(RTLIB::SINCOS_F32, "sincosf");
	setLibcallName(RTLIB::SINCOS_F64, "sincos");
	if (Subtarget.isTargetDarwin()) {
	// For MacOSX, we don't want the normal expansion of a libcall to sincos.
	// We want to issue a libcall to __sincos_stret to avoid memory traffic.
	setOperationAction(ISD::FSINCOS, MVT::f64, Custom);
	setOperationAction(ISD::FSINCOS, MVT::f32, Custom);
	}
	}

	if (Subtarget.isTargetWin64()) {
	setOperationAction(ISD::SDIV, MVT::i128, Custom);
	setOperationAction(ISD::UDIV, MVT::i128, Custom);
	setOperationAction(ISD::SREM, MVT::i128, Custom);
	setOperationAction(ISD::UREM, MVT::i128, Custom);
	setOperationAction(ISD::SDIVREM, MVT::i128, Custom);
	setOperationAction(ISD::UDIVREM, MVT::i128, Custom);
	}

	// On 32 bit MSVC, `fmodf(f32)` is not defined - only `fmod(f64)`
	// is. We should promote the value to 64-bits to solve this.
	// This is what the CRT headers do - `fmodf` is an inline header
	// function casting to f64 and calling `fmod`.
	if (Subtarget.is32Bit() && (Subtarget.isTargetKnownWindowsMSVC() \|\|
	Subtarget.isTargetWindowsItanium()))
	for (ISD::NodeType Op :
	{ISD::FCEIL, ISD::FCOS, ISD::FEXP, ISD::FFLOOR, ISD::FREM, ISD::FLOG,
	ISD::FLOG10, ISD::FPOW, ISD::FSIN})
	if (isOperationExpand(Op, MVT::f32))
	setOperationAction(Op, MVT::f32, Promote);

	// We have target-specific dag combine patterns for the following nodes:
	setTargetDAGCombine(ISD::VECTOR_SHUFFLE);
	setTargetDAGCombine(ISD::EXTRACT_VECTOR_ELT);
	setTargetDAGCombine(ISD::INSERT_SUBVECTOR);
	setTargetDAGCombine(ISD::BITCAST);
	setTargetDAGCombine(ISD::VSELECT);
	setTargetDAGCombine(ISD::SELECT);
	setTargetDAGCombine(ISD::SHL);
	setTargetDAGCombine(ISD::SRA);
	setTargetDAGCombine(ISD::SRL);
	setTargetDAGCombine(ISD::OR);
	setTargetDAGCombine(ISD::AND);
	setTargetDAGCombine(ISD::ADD);
	setTargetDAGCombine(ISD::FADD);
	setTargetDAGCombine(ISD::FSUB);
	setTargetDAGCombine(ISD::FNEG);
	setTargetDAGCombine(ISD::FMA);
	setTargetDAGCombine(ISD::FMINNUM);
	setTargetDAGCombine(ISD::FMAXNUM);
	setTargetDAGCombine(ISD::SUB);
	setTargetDAGCombine(ISD::LOAD);
	setTargetDAGCombine(ISD::MLOAD);
	setTargetDAGCombine(ISD::STORE);
	setTargetDAGCombine(ISD::MSTORE);
	setTargetDAGCombine(ISD::TRUNCATE);
	setTargetDAGCombine(ISD::ZERO_EXTEND);
	setTargetDAGCombine(ISD::ANY_EXTEND);
	setTargetDAGCombine(ISD::SIGN_EXTEND);
	setTargetDAGCombine(ISD::SIGN_EXTEND_INREG);
	setTargetDAGCombine(ISD::SIGN_EXTEND_VECTOR_INREG);
	setTargetDAGCombine(ISD::ZERO_EXTEND_VECTOR_INREG);
	setTargetDAGCombine(ISD::SINT_TO_FP);
	setTargetDAGCombine(ISD::UINT_TO_FP);
	setTargetDAGCombine(ISD::SETCC);
	setTargetDAGCombine(ISD::MUL);
	setTargetDAGCombine(ISD::XOR);
	setTargetDAGCombine(ISD::MSCATTER);
	setTargetDAGCombine(ISD::MGATHER);

	computeRegisterProperties(Subtarget.getRegisterInfo());

	MaxStoresPerMemset = 16; // For @llvm.memset -> sequence of stores
	MaxStoresPerMemsetOptSize = 8;
	MaxStoresPerMemcpy = 8; // For @llvm.memcpy -> sequence of stores
	MaxStoresPerMemcpyOptSize = 4;
	MaxStoresPerMemmove = 8; // For @llvm.memmove -> sequence of stores
	MaxStoresPerMemmoveOptSize = 4;

	// TODO: These control memcmp expansion in CGP and could be raised higher, but
	// that needs to benchmarked and balanced with the potential use of vector
	// load/store types (PR33329, PR33914).
	MaxLoadsPerMemcmp = 2;
	MaxLoadsPerMemcmpOptSize = 2;

	// Set loop alignment to 2^ExperimentalPrefLoopAlignment bytes (default: 2^4).
	setPrefLoopAlignment(ExperimentalPrefLoopAlignment);

	// An out-of-order CPU can speculatively execute past a predictable branch,
	// but a conditional move could be stalled by an expensive earlier operation.
	PredictableSelectIsExpensive = Subtarget.getSchedModel().isOutOfOrder();
	EnableExtLdPromotion = true;
	setPrefFunctionAlignment(4); // 2^4 bytes.

	verifyIntrinsicTables();
	}

	// This has so far only been implemented for 64-bit MachO.
	bool X86TargetLowering::useLoadStackGuardNode() const {
	return Subtarget.isTargetMachO() && Subtarget.is64Bit();
	}

	TargetLoweringBase::LegalizeTypeAction
	X86TargetLowering::getPreferredVectorAction(EVT VT) const {
	if (ExperimentalVectorWideningLegalization &&
	VT.getVectorNumElements() != 1 &&
	VT.getVectorElementType().getSimpleVT() != MVT::i1)
	return TypeWidenVector;

	return TargetLoweringBase::getPreferredVectorAction(VT);
	}

	EVT X86TargetLowering::getSetCCResultType(const DataLayout &DL,
	LLVMContext& Context,
	EVT VT) const {
	if (!VT.isVector())
	return MVT::i8;

	if (VT.isSimple()) {
	MVT VVT = VT.getSimpleVT();
	const unsigned NumElts = VVT.getVectorNumElements();
	MVT EltVT = VVT.getVectorElementType();
	if (VVT.is512BitVector()) {
	if (Subtarget.hasAVX512())
	if (EltVT == MVT::i32 \|\| EltVT == MVT::i64 \|\|
	EltVT == MVT::f32 \|\| EltVT == MVT::f64)
	switch(NumElts) {
	case 8: return MVT::v8i1;
	case 16: return MVT::v16i1;
	}
	if (Subtarget.hasBWI())
	if (EltVT == MVT::i8 \|\| EltVT == MVT::i16)
	switch(NumElts) {
	case 32: return MVT::v32i1;
	case 64: return MVT::v64i1;
	}
	}

	if (Subtarget.hasBWI() && Subtarget.hasVLX())
	return MVT::getVectorVT(MVT::i1, NumElts);

	if (!isTypeLegal(VT) && getTypeAction(Context, VT) == TypePromoteInteger) {
	EVT LegalVT = getTypeToTransformTo(Context, VT);
	EltVT = LegalVT.getVectorElementType().getSimpleVT();
	}

	if (Subtarget.hasVLX() && EltVT.getSizeInBits() >= 32)
	switch(NumElts) {
	case 2: return MVT::v2i1;
	case 4: return MVT::v4i1;
	case 8: return MVT::v8i1;
	}
	}

	return VT.changeVectorElementTypeToInteger();
	}

	/// Helper for getByValTypeAlignment to determine
	/// the desired ByVal argument alignment.
	static void getMaxByValAlign(Type *Ty, unsigned &MaxAlign) {
	if (MaxAlign == 16)
	return;
	if (VectorType *VTy = dyn_cast<VectorType>(Ty)) {
	if (VTy->getBitWidth() == 128)
	MaxAlign = 16;
	} else if (ArrayType *ATy = dyn_cast<ArrayType>(Ty)) {
	unsigned EltAlign = 0;
	getMaxByValAlign(ATy->getElementType(), EltAlign);
	if (EltAlign > MaxAlign)
	MaxAlign = EltAlign;
	} else if (StructType *STy = dyn_cast<StructType>(Ty)) {
	for (auto *EltTy : STy->elements()) {
	unsigned EltAlign = 0;
	getMaxByValAlign(EltTy, EltAlign);
	if (EltAlign > MaxAlign)
	MaxAlign = EltAlign;
	if (MaxAlign == 16)
	break;
	}
	}
	}

	/// Return the desired alignment for ByVal aggregate
	/// function arguments in the caller parameter area. For X86, aggregates
	/// that contain SSE vectors are placed at 16-byte boundaries while the rest
	/// are at 4-byte boundaries.
	unsigned X86TargetLowering::getByValTypeAlignment(Type *Ty,
	const DataLayout &DL) const {
	if (Subtarget.is64Bit()) {
	// Max of 8 and alignment of type.
	unsigned TyAlign = DL.getABITypeAlignment(Ty);
	if (TyAlign > 8)
	return TyAlign;
	return 8;
	}

	unsigned Align = 4;
	if (Subtarget.hasSSE1())
	getMaxByValAlign(Ty, Align);
	return Align;
	}

	/// Returns the target specific optimal type for load
	/// and store operations as a result of memset, memcpy, and memmove
	/// lowering. If DstAlign is zero that means it's safe to destination
	/// alignment can satisfy any constraint. Similarly if SrcAlign is zero it
	/// means there isn't a need to check it against alignment requirement,
	/// probably because the source does not need to be loaded. If 'IsMemset' is
	/// true, that means it's expanding a memset. If 'ZeroMemset' is true, that
	/// means it's a memset of zero. 'MemcpyStrSrc' indicates whether the memcpy
	/// source is constant so it does not need to be loaded.
	/// It returns EVT::Other if the type should be determined using generic
	/// target-independent logic.
	EVT
	X86TargetLowering::getOptimalMemOpType(uint64_t Size,
	unsigned DstAlign, unsigned SrcAlign,
	bool IsMemset, bool ZeroMemset,
	bool MemcpyStrSrc,
	MachineFunction &MF) const {
	const Function *F = MF.getFunction();
	if (!F->hasFnAttribute(Attribute::NoImplicitFloat)) {
	if (Size >= 16 &&
	(!Subtarget.isUnalignedMem16Slow() \|\|
	((DstAlign == 0 \|\| DstAlign >= 16) &&
	(SrcAlign == 0 \|\| SrcAlign >= 16)))) {
	// FIXME: Check if unaligned 32-byte accesses are slow.
	if (Size >= 32 && Subtarget.hasAVX()) {
	// Although this isn't a well-supported type for AVX1, we'll let
	// legalization and shuffle lowering produce the optimal codegen. If we
	// choose an optimal type with a vector element larger than a byte,
	// getMemsetStores() may create an intermediate splat (using an integer
	// multiply) before we splat as a vector.
	return MVT::v32i8;
	}
	if (Subtarget.hasSSE2())
	return MVT::v16i8;
	// TODO: Can SSE1 handle a byte vector?
	if (Subtarget.hasSSE1())
	return MVT::v4f32;
	} else if ((!IsMemset \|\| ZeroMemset) && !MemcpyStrSrc && Size >= 8 &&
	!Subtarget.is64Bit() && Subtarget.hasSSE2()) {
	// Do not use f64 to lower memcpy if source is string constant. It's
	// better to use i32 to avoid the loads.
	// Also, do not use f64 to lower memset unless this is a memset of zeros.
	// The gymnastics of splatting a byte value into an XMM register and then
	// only using 8-byte stores (because this is a CPU with slow unaligned
	// 16-byte accesses) makes that a loser.
	return MVT::f64;
	}
	}
	// This is a compromise. If we reach here, unaligned accesses may be slow on
	// this target. However, creating smaller, aligned accesses could be even
	// slower and would certainly be a lot more code.
	if (Subtarget.is64Bit() && Size >= 8)
	return MVT::i64;
	return MVT::i32;
	}

	bool X86TargetLowering::isSafeMemOpType(MVT VT) const {
	if (VT == MVT::f32)
	return X86ScalarSSEf32;
	else if (VT == MVT::f64)
	return X86ScalarSSEf64;
	return true;
	}

	bool
	X86TargetLowering::allowsMisalignedMemoryAccesses(EVT VT,
	unsigned,
	unsigned,
	bool *Fast) const {
	if (Fast) {
	switch (VT.getSizeInBits()) {
	default:
	// 8-byte and under are always assumed to be fast.
	*Fast = true;
	break;
	case 128:
	*Fast = !Subtarget.isUnalignedMem16Slow();
	break;
	case 256:
	*Fast = !Subtarget.isUnalignedMem32Slow();
	break;
	// TODO: What about AVX-512 (512-bit) accesses?
	}
	}
	// Misaligned accesses of any size are always allowed.
	return true;
	}

	/// Return the entry encoding for a jump table in the
	/// current function. The returned value is a member of the
	/// MachineJumpTableInfo::JTEntryKind enum.
	unsigned X86TargetLowering::getJumpTableEncoding() const {
	// In GOT pic mode, each entry in the jump table is emitted as a @GOTOFF
	// symbol.
	if (isPositionIndependent() && Subtarget.isPICStyleGOT())
	return MachineJumpTableInfo::EK_Custom32;

	// Otherwise, use the normal jump table encoding heuristics.
	return TargetLowering::getJumpTableEncoding();
	}

	bool X86TargetLowering::useSoftFloat() const {
	return Subtarget.useSoftFloat();
	}

	void X86TargetLowering::markLibCallAttributes(MachineFunction *MF, unsigned CC,
	ArgListTy &Args) const {

	// Only relabel X86-32 for C / Stdcall CCs.
	if (Subtarget.is64Bit())
	return;
	if (CC != CallingConv::C && CC != CallingConv::X86_StdCall)
	return;
	unsigned ParamRegs = 0;
	if (auto *M = MF->getFunction()->getParent())
	ParamRegs = M->getNumberRegisterParameters();

	// Mark the first N int arguments as having reg
	for (unsigned Idx = 0; Idx < Args.size(); Idx++) {
	Type *T = Args[Idx].Ty;
	if (T->isPointerTy() \|\| T->isIntegerTy())
	if (MF->getDataLayout().getTypeAllocSize(T) <= 8) {
	unsigned numRegs = 1;
	if (MF->getDataLayout().getTypeAllocSize(T) > 4)
	numRegs = 2;
	if (ParamRegs < numRegs)
	return;
	ParamRegs -= numRegs;
	Args[Idx].IsInReg = true;
	}
	}
	}

	const MCExpr *
	X86TargetLowering::LowerCustomJumpTableEntry(const MachineJumpTableInfo *MJTI,
	const MachineBasicBlock *MBB,
	unsigned uid,MCContext &Ctx) const{
	assert(isPositionIndependent() && Subtarget.isPICStyleGOT());
	// In 32-bit ELF systems, our jump table entries are formed with @GOTOFF
	// entries.
	return MCSymbolRefExpr::create(MBB->getSymbol(),
	MCSymbolRefExpr::VK_GOTOFF, Ctx);
	}

	/// Returns relocation base for the given PIC jumptable.
	SDValue X86TargetLowering::getPICJumpTableRelocBase(SDValue Table,
	SelectionDAG &DAG) const {
	if (!Subtarget.is64Bit())
	// This doesn't have SDLoc associated with it, but is not really the
	// same as a Register.
	return DAG.getNode(X86ISD::GlobalBaseReg, SDLoc(),
	getPointerTy(DAG.getDataLayout()));
	return Table;
	}

	/// This returns the relocation base for the given PIC jumptable,
	/// the same as getPICJumpTableRelocBase, but as an MCExpr.
	const MCExpr *X86TargetLowering::
	getPICJumpTableRelocBaseExpr(const MachineFunction *MF, unsigned JTI,
	MCContext &Ctx) const {
	// X86-64 uses RIP relative addressing based on the jump table label.
	if (Subtarget.isPICStyleRIPRel())
	return TargetLowering::getPICJumpTableRelocBaseExpr(MF, JTI, Ctx);

	// Otherwise, the reference is relative to the PIC base.
	return MCSymbolRefExpr::create(MF->getPICBaseSymbol(), Ctx);
	}

	std::pair<const TargetRegisterClass *, uint8_t>
	X86TargetLowering::findRepresentativeClass(const TargetRegisterInfo *TRI,
	MVT VT) const {
	const TargetRegisterClass *RRC = nullptr;
	uint8_t Cost = 1;
	switch (VT.SimpleTy) {
	default:
	return TargetLowering::findRepresentativeClass(TRI, VT);
	case MVT::i8: case MVT::i16: case MVT::i32: case MVT::i64:
	RRC = Subtarget.is64Bit() ? &X86::GR64RegClass : &X86::GR32RegClass;
	break;
	case MVT::x86mmx:
	RRC = &X86::VR64RegClass;
	break;
	case MVT::f32: case MVT::f64:
	case MVT::v16i8: case MVT::v8i16: case MVT::v4i32: case MVT::v2i64:
	case MVT::v4f32: case MVT::v2f64:
	case MVT::v32i8: case MVT::v16i16: case MVT::v8i32: case MVT::v4i64:
	case MVT::v8f32: case MVT::v4f64:
	case MVT::v64i8: case MVT::v32i16: case MVT::v16i32: case MVT::v8i64:
	case MVT::v16f32: case MVT::v8f64:
	RRC = &X86::VR128XRegClass;
	break;
	}
	return std::make_pair(RRC, Cost);
	}

	unsigned X86TargetLowering::getAddressSpace() const {
	if (Subtarget.is64Bit())
	return (getTargetMachine().getCodeModel() == CodeModel::Kernel) ? 256 : 257;
	return 256;
	}

	static bool hasStackGuardSlotTLS(const Triple &TargetTriple) {
	return TargetTriple.isOSGlibc() \|\| TargetTriple.isOSFuchsia() \|\|
	(TargetTriple.isAndroid() && !TargetTriple.isAndroidVersionLT(17));
	}

	static Constant* SegmentOffset(IRBuilder<> &IRB,
	unsigned Offset, unsigned AddressSpace) {
	return ConstantExpr::getIntToPtr(
	ConstantInt::get(Type::getInt32Ty(IRB.getContext()), Offset),
	Type::getInt8PtrTy(IRB.getContext())->getPointerTo(AddressSpace));
	}

	Value *X86TargetLowering::getIRStackGuard(IRBuilder<> &IRB) const {
	// glibc, bionic, and Fuchsia have a special slot for the stack guard in
	// tcbhead_t; use it instead of the usual global variable (see
	// sysdeps/{i386,x86_64}/nptl/tls.h)
	if (hasStackGuardSlotTLS(Subtarget.getTargetTriple())) {
	if (Subtarget.isTargetFuchsia()) {
	// <magenta/tls.h> defines MX_TLS_STACK_GUARD_OFFSET with this value.
	return SegmentOffset(IRB, 0x10, getAddressSpace());
	} else {
	// %fs:0x28, unless we're using a Kernel code model, in which case
	// it's %gs:0x28. gs:0x14 on i386.
	unsigned Offset = (Subtarget.is64Bit()) ? 0x28 : 0x14;
	return SegmentOffset(IRB, Offset, getAddressSpace());
	}
	}

	return TargetLowering::getIRStackGuard(IRB);
	}

	void X86TargetLowering::insertSSPDeclarations(Module &M) const {
	// MSVC CRT provides functionalities for stack protection.
	if (Subtarget.getTargetTriple().isOSMSVCRT()) {
	// MSVC CRT has a global variable holding security cookie.
	M.getOrInsertGlobal("__security_cookie",
	Type::getInt8PtrTy(M.getContext()));

	// MSVC CRT has a function to validate security cookie.
	auto *SecurityCheckCookie = cast<Function>(
	M.getOrInsertFunction("__security_check_cookie",
	Type::getVoidTy(M.getContext()),
	Type::getInt8PtrTy(M.getContext())));
	SecurityCheckCookie->setCallingConv(CallingConv::X86_FastCall);
	SecurityCheckCookie->addAttribute(1, Attribute::AttrKind::InReg);
	return;
	}
	// glibc, bionic, and Fuchsia have a special slot for the stack guard.
	if (hasStackGuardSlotTLS(Subtarget.getTargetTriple()))
	return;
	TargetLowering::insertSSPDeclarations(M);
	}

	Value *X86TargetLowering::getSDagStackGuard(const Module &M) const {
	// MSVC CRT has a global variable holding security cookie.
	if (Subtarget.getTargetTriple().isOSMSVCRT())
	return M.getGlobalVariable("__security_cookie");
	return TargetLowering::getSDagStackGuard(M);
	}

	Value *X86TargetLowering::getSSPStackGuardCheck(const Module &M) const {
	// MSVC CRT has a function to validate security cookie.
	if (Subtarget.getTargetTriple().isOSMSVCRT())
	return M.getFunction("__security_check_cookie");
	return TargetLowering::getSSPStackGuardCheck(M);
	}

	Value *X86TargetLowering::getSafeStackPointerLocation(IRBuilder<> &IRB) const {
	if (Subtarget.getTargetTriple().isOSContiki())
	return getDefaultSafeStackPointerLocation(IRB, false);

	// Android provides a fixed TLS slot for the SafeStack pointer. See the
	// definition of TLS_SLOT_SAFESTACK in
	// https://android.googlesource.com/platform/bionic/+/master/libc/private/bionic_tls.h
	if (Subtarget.isTargetAndroid()) {
	// %fs:0x48, unless we're using a Kernel code model, in which case it's %gs:
	// %gs:0x24 on i386
	unsigned Offset = (Subtarget.is64Bit()) ? 0x48 : 0x24;
	return SegmentOffset(IRB, Offset, getAddressSpace());
	}

	// Fuchsia is similar.
	if (Subtarget.isTargetFuchsia()) {
	// <magenta/tls.h> defines MX_TLS_UNSAFE_SP_OFFSET with this value.
	return SegmentOffset(IRB, 0x18, getAddressSpace());
	}

	return TargetLowering::getSafeStackPointerLocation(IRB);
	}

	bool X86TargetLowering::isNoopAddrSpaceCast(unsigned SrcAS,
	unsigned DestAS) const {
	assert(SrcAS != DestAS && "Expected different address spaces!");

	return SrcAS < 256 && DestAS < 256;
	}

	//===----------------------------------------------------------------------===//
	// Return Value Calling Convention Implementation
	//===----------------------------------------------------------------------===//

	#include "X86GenCallingConv.inc"

	bool X86TargetLowering::CanLowerReturn(
	CallingConv::ID CallConv, MachineFunction &MF, bool isVarArg,
	const SmallVectorImpl<ISD::OutputArg> &Outs, LLVMContext &Context) const {
	SmallVector<CCValAssign, 16> RVLocs;
	CCState CCInfo(CallConv, isVarArg, MF, RVLocs, Context);
	return CCInfo.CheckReturn(Outs, RetCC_X86);
	}

	const MCPhysReg *X86TargetLowering::getScratchRegisters(CallingConv::ID) const {
	static const MCPhysReg ScratchRegs[] = { X86::R11, 0 };
	return ScratchRegs;
	}

	/// Lowers masks values (v*i1) to the local register values
	/// \returns DAG node after lowering to register type
	static SDValue lowerMasksToReg(const SDValue &ValArg, const EVT &ValLoc,
	const SDLoc &Dl, SelectionDAG &DAG) {
	EVT ValVT = ValArg.getValueType();

	if ((ValVT == MVT::v8i1 && (ValLoc == MVT::i8 \|\| ValLoc == MVT::i32)) \|\|
	(ValVT == MVT::v16i1 && (ValLoc == MVT::i16 \|\| ValLoc == MVT::i32))) {
	// Two stage lowering might be required
	// bitcast: v8i1 -> i8 / v16i1 -> i16
	// anyextend: i8 -> i32 / i16 -> i32
	EVT TempValLoc = ValVT == MVT::v8i1 ? MVT::i8 : MVT::i16;
	SDValue ValToCopy = DAG.getBitcast(TempValLoc, ValArg);
	if (ValLoc == MVT::i32)
	ValToCopy = DAG.getNode(ISD::ANY_EXTEND, Dl, ValLoc, ValToCopy);
	return ValToCopy;
	} else if ((ValVT == MVT::v32i1 && ValLoc == MVT::i32) \|\|
	(ValVT == MVT::v64i1 && ValLoc == MVT::i64)) {
	// One stage lowering is required
	// bitcast: v32i1 -> i32 / v64i1 -> i64
	return DAG.getBitcast(ValLoc, ValArg);
	} else
	return DAG.getNode(ISD::SIGN_EXTEND, Dl, ValLoc, ValArg);
	}

	/// Breaks v64i1 value into two registers and adds the new node to the DAG
	static void Passv64i1ArgInRegs(
	const SDLoc &Dl, SelectionDAG &DAG, SDValue Chain, SDValue &Arg,
	SmallVector<std::pair<unsigned, SDValue>, 8> &RegsToPass, CCValAssign &VA,
	CCValAssign &NextVA, const X86Subtarget &Subtarget) {
	assert((Subtarget.hasBWI() \|\| Subtarget.hasBMI()) &&
	"Expected AVX512BW or AVX512BMI target!");
	assert(Subtarget.is32Bit() && "Expecting 32 bit target");
	assert(Arg.getValueType() == MVT::i64 && "Expecting 64 bit value");
	assert(VA.isRegLoc() && NextVA.isRegLoc() &&
	"The value should reside in two registers");

	// Before splitting the value we cast it to i64
	Arg = DAG.getBitcast(MVT::i64, Arg);

	// Splitting the value into two i32 types
	SDValue Lo, Hi;
	Lo = DAG.getNode(ISD::EXTRACT_ELEMENT, Dl, MVT::i32, Arg,
	DAG.getConstant(0, Dl, MVT::i32));
	Hi = DAG.getNode(ISD::EXTRACT_ELEMENT, Dl, MVT::i32, Arg,
	DAG.getConstant(1, Dl, MVT::i32));

	// Attach the two i32 types into corresponding registers
	RegsToPass.push_back(std::make_pair(VA.getLocReg(), Lo));
	RegsToPass.push_back(std::make_pair(NextVA.getLocReg(), Hi));
	}

	SDValue
	X86TargetLowering::LowerReturn(SDValue Chain, CallingConv::ID CallConv,
	bool isVarArg,
	const SmallVectorImpl<ISD::OutputArg> &Outs,
	const SmallVectorImpl<SDValue> &OutVals,
	const SDLoc &dl, SelectionDAG &DAG) const {
	MachineFunction &MF = DAG.getMachineFunction();
	X86MachineFunctionInfo *FuncInfo = MF.getInfo<X86MachineFunctionInfo>();

	// In some cases we need to disable registers from the default CSR list.
	// For example, when they are used for argument passing.
	bool ShouldDisableCalleeSavedRegister =
	CallConv == CallingConv::X86_RegCall \|\|
	MF.getFunction()->hasFnAttribute("no_caller_saved_registers");

	if (CallConv == CallingConv::X86_INTR && !Outs.empty())
	report_fatal_error("X86 interrupts may not return any value");

	SmallVector<CCValAssign, 16> RVLocs;
	CCState CCInfo(CallConv, isVarArg, MF, RVLocs, *DAG.getContext());
	CCInfo.AnalyzeReturn(Outs, RetCC_X86);

	SDValue Flag;
	SmallVector<SDValue, 6> RetOps;
	RetOps.push_back(Chain); // Operand #0 = Chain (updated below)
	// Operand #1 = Bytes To Pop
	RetOps.push_back(DAG.getTargetConstant(FuncInfo->getBytesToPopOnReturn(), dl,
	MVT::i32));

	// Copy the result values into the output registers.
	for (unsigned I = 0, OutsIndex = 0, E = RVLocs.size(); I != E;
	++I, ++OutsIndex) {
	CCValAssign &VA = RVLocs[I];
	assert(VA.isRegLoc() && "Can only return in registers!");

	// Add the register to the CalleeSaveDisableRegs list.
	if (ShouldDisableCalleeSavedRegister)
	MF.getRegInfo().disableCalleeSavedRegister(VA.getLocReg());

	SDValue ValToCopy = OutVals[OutsIndex];
	EVT ValVT = ValToCopy.getValueType();

	// Promote values to the appropriate types.
	if (VA.getLocInfo() == CCValAssign::SExt)
	ValToCopy = DAG.getNode(ISD::SIGN_EXTEND, dl, VA.getLocVT(), ValToCopy);
	else if (VA.getLocInfo() == CCValAssign::ZExt)
	ValToCopy = DAG.getNode(ISD::ZERO_EXTEND, dl, VA.getLocVT(), ValToCopy);
	else if (VA.getLocInfo() == CCValAssign::AExt) {
	if (ValVT.isVector() && ValVT.getVectorElementType() == MVT::i1)
	ValToCopy = lowerMasksToReg(ValToCopy, VA.getLocVT(), dl, DAG);
	else
	ValToCopy = DAG.getNode(ISD::ANY_EXTEND, dl, VA.getLocVT(), ValToCopy);
	}
	else if (VA.getLocInfo() == CCValAssign::BCvt)
	ValToCopy = DAG.getBitcast(VA.getLocVT(), ValToCopy);

	assert(VA.getLocInfo() != CCValAssign::FPExt &&
	"Unexpected FP-extend for return value.");

	// If this is x86-64, and we disabled SSE, we can't return FP values,
	// or SSE or MMX vectors.
	if ((ValVT == MVT::f32 \|\| ValVT == MVT::f64 \|\|
	VA.getLocReg() == X86::XMM0 \|\| VA.getLocReg() == X86::XMM1) &&
	(Subtarget.is64Bit() && !Subtarget.hasSSE1())) {
	errorUnsupported(DAG, dl, "SSE register return with SSE disabled");
	VA.convertToReg(X86::FP0); // Set reg to FP0, avoid hitting asserts.
	} else if (ValVT == MVT::f64 &&
	(Subtarget.is64Bit() && !Subtarget.hasSSE2())) {
	// Likewise we can't return F64 values with SSE1 only. gcc does so, but
	// llvm-gcc has never done it right and no one has noticed, so this
	// should be OK for now.
	errorUnsupported(DAG, dl, "SSE2 register return with SSE2 disabled");
	VA.convertToReg(X86::FP0); // Set reg to FP0, avoid hitting asserts.
	}

	// Returns in ST0/ST1 are handled specially: these are pushed as operands to
	// the RET instruction and handled by the FP Stackifier.
	if (VA.getLocReg() == X86::FP0 \|\|
	VA.getLocReg() == X86::FP1) {
	// If this is a copy from an xmm register to ST(0), use an FPExtend to
	// change the value to the FP stack register class.
	if (isScalarFPTypeInSSEReg(VA.getValVT()))
	ValToCopy = DAG.getNode(ISD::FP_EXTEND, dl, MVT::f80, ValToCopy);
	RetOps.push_back(ValToCopy);
	// Don't emit a copytoreg.
	continue;
	}

	// 64-bit vector (MMX) values are returned in XMM0 / XMM1 except for v1i64
	// which is returned in RAX / RDX.
	if (Subtarget.is64Bit()) {
	if (ValVT == MVT::x86mmx) {
	if (VA.getLocReg() == X86::XMM0 \|\| VA.getLocReg() == X86::XMM1) {
	ValToCopy = DAG.getBitcast(MVT::i64, ValToCopy);
	ValToCopy = DAG.getNode(ISD::SCALAR_TO_VECTOR, dl, MVT::v2i64,
	ValToCopy);
	// If we don't have SSE2 available, convert to v4f32 so the generated
	// register is legal.
	if (!Subtarget.hasSSE2())
	ValToCopy = DAG.getBitcast(MVT::v4f32, ValToCopy);
	}
	}
	}

	SmallVector<std::pair<unsigned, SDValue>, 8> RegsToPass;

	if (VA.needsCustom()) {
	assert(VA.getValVT() == MVT::v64i1 &&
	"Currently the only custom case is when we split v64i1 to 2 regs");

	Passv64i1ArgInRegs(dl, DAG, Chain, ValToCopy, RegsToPass, VA, RVLocs[++I],
	Subtarget);

	assert(2 == RegsToPass.size() &&
	"Expecting two registers after Pass64BitArgInRegs");

	// Add the second register to the CalleeSaveDisableRegs list.
	if (ShouldDisableCalleeSavedRegister)
	MF.getRegInfo().disableCalleeSavedRegister(RVLocs[I].getLocReg());
	} else {
	RegsToPass.push_back(std::make_pair(VA.getLocReg(), ValToCopy));
	}

	// Add nodes to the DAG and add the values into the RetOps list
	for (auto &Reg : RegsToPass) {
	Chain = DAG.getCopyToReg(Chain, dl, Reg.first, Reg.second, Flag);
	Flag = Chain.getValue(1);
	RetOps.push_back(DAG.getRegister(Reg.first, Reg.second.getValueType()));
	}
	}

	// Swift calling convention does not require we copy the sret argument
	// into %rax/%eax for the return, and SRetReturnReg is not set for Swift.

	// All x86 ABIs require that for returning structs by value we copy
	// the sret argument into %rax/%eax (depending on ABI) for the return.
	// We saved the argument into a virtual register in the entry block,
	// so now we copy the value out and into %rax/%eax.
	//
	// Checking Function.hasStructRetAttr() here is insufficient because the IR
	// may not have an explicit sret argument. If FuncInfo.CanLowerReturn is
	// false, then an sret argument may be implicitly inserted in the SelDAG. In
	// either case FuncInfo->setSRetReturnReg() will have been called.
	if (unsigned SRetReg = FuncInfo->getSRetReturnReg()) {
	// When we have both sret and another return value, we should use the
	// original Chain stored in RetOps[0], instead of the current Chain updated
	// in the above loop. If we only have sret, RetOps[0] equals to Chain.

	// For the case of sret and another return value, we have
	// Chain_0 at the function entry
	// Chain_1 = getCopyToReg(Chain_0) in the above loop
	// If we use Chain_1 in getCopyFromReg, we will have
	// Val = getCopyFromReg(Chain_1)
	// Chain_2 = getCopyToReg(Chain_1, Val) from below

	// getCopyToReg(Chain_0) will be glued together with
	// getCopyToReg(Chain_1, Val) into Unit A, getCopyFromReg(Chain_1) will be
	// in Unit B, and we will have cyclic dependency between Unit A and Unit B:
	// Data dependency from Unit B to Unit A due to usage of Val in
	// getCopyToReg(Chain_1, Val)
	// Chain dependency from Unit A to Unit B

	// So here, we use RetOps[0] (i.e Chain_0) for getCopyFromReg.
	SDValue Val = DAG.getCopyFromReg(RetOps[0], dl, SRetReg,
	getPointerTy(MF.getDataLayout()));

	unsigned RetValReg
	= (Subtarget.is64Bit() && !Subtarget.isTarget64BitILP32()) ?
	X86::RAX : X86::EAX;
	Chain = DAG.getCopyToReg(Chain, dl, RetValReg, Val, Flag);
	Flag = Chain.getValue(1);

	// RAX/EAX now acts like a return value.
	RetOps.push_back(
	DAG.getRegister(RetValReg, getPointerTy(DAG.getDataLayout())));

	// Add the returned register to the CalleeSaveDisableRegs list.
	if (ShouldDisableCalleeSavedRegister)
	MF.getRegInfo().disableCalleeSavedRegister(RetValReg);
	}

	const X86RegisterInfo *TRI = Subtarget.getRegisterInfo();
	const MCPhysReg *I =
	TRI->getCalleeSavedRegsViaCopy(&DAG.getMachineFunction());
	if (I) {
	for (; *I; ++I) {
	if (X86::GR64RegClass.contains(*I))
	RetOps.push_back(DAG.getRegister(*I, MVT::i64));
	else
	llvm_unreachable("Unexpected register class in CSRsViaCopy!");
	}
	}

	RetOps[0] = Chain; // Update chain.

	// Add the flag if we have it.
	if (Flag.getNode())
	RetOps.push_back(Flag);

	X86ISD::NodeType opcode = X86ISD::RET_FLAG;
	if (CallConv == CallingConv::X86_INTR)
	opcode = X86ISD::IRET;
	return DAG.getNode(opcode, dl, MVT::Other, RetOps);
	}

	bool X86TargetLowering::isUsedByReturnOnly(SDNode *N, SDValue &Chain) const {
	if (N->getNumValues() != 1 \|\| !N->hasNUsesOfValue(1, 0))
	return false;

	SDValue TCChain = Chain;
	SDNode Copy = N->use_begin();
	if (Copy->getOpcode() == ISD::CopyToReg) {
	// If the copy has a glue operand, we conservatively assume it isn't safe to
	// perform a tail call.
	if (Copy->getOperand(Copy->getNumOperands()-1).getValueType() == MVT::Glue)
	return false;
	TCChain = Copy->getOperand(0);
	} else if (Copy->getOpcode() != ISD::FP_EXTEND)
	return false;

	bool HasRet = false;
	for (SDNode::use_iterator UI = Copy->use_begin(), UE = Copy->use_end();
	UI != UE; ++UI) {
	if (UI->getOpcode() != X86ISD::RET_FLAG)
	return false;
	// If we are returning more than one value, we can definitely
	// not make a tail call see PR19530
	if (UI->getNumOperands() > 4)
	return false;
	if (UI->getNumOperands() == 4 &&
	UI->getOperand(UI->getNumOperands()-1).getValueType() != MVT::Glue)
	return false;
	HasRet = true;
	}

	if (!HasRet)
	return false;

	Chain = TCChain;
	return true;
	}

	EVT X86TargetLowering::getTypeForExtReturn(LLVMContext &Context, EVT VT,
	ISD::NodeType ExtendKind) const {
	MVT ReturnMVT = MVT::i32;

	bool Darwin = Subtarget.getTargetTriple().isOSDarwin();
	if (VT == MVT::i1 \|\| (!Darwin && (VT == MVT::i8 \|\| VT == MVT::i16))) {
	// The ABI does not require i1, i8 or i16 to be extended.
	//
	// On Darwin, there is code in the wild relying on Clang's old behaviour of
	// always extending i8/i16 return values, so keep doing that for now.
	// (PR26665).
	ReturnMVT = MVT::i8;
	}

	EVT MinVT = getRegisterType(Context, ReturnMVT);
	return VT.bitsLT(MinVT) ? MinVT : VT;
	}

	/// Reads two 32 bit registers and creates a 64 bit mask value.
	/// \param VA The current 32 bit value that need to be assigned.
	/// \param NextVA The next 32 bit value that need to be assigned.
	/// \param Root The parent DAG node.
	/// \param [in,out] InFlag Represents SDvalue in the parent DAG node for
	/// glue purposes. In the case the DAG is already using
	/// physical register instead of virtual, we should glue
	/// our new SDValue to InFlag SDvalue.
	/// \return a new SDvalue of size 64bit.
	static SDValue getv64i1Argument(CCValAssign &VA, CCValAssign &NextVA,
	SDValue &Root, SelectionDAG &DAG,
	const SDLoc &Dl, const X86Subtarget &Subtarget,
	SDValue *InFlag = nullptr) {
	assert((Subtarget.hasBWI()) && "Expected AVX512BW target!");
	assert(Subtarget.is32Bit() && "Expecting 32 bit target");
	assert(VA.getValVT() == MVT::v64i1 &&
	"Expecting first location of 64 bit width type");
	assert(NextVA.getValVT() == VA.getValVT() &&
	"The locations should have the same type");
	assert(VA.isRegLoc() && NextVA.isRegLoc() &&
	"The values should reside in two registers");

	SDValue Lo, Hi;
	unsigned Reg;
	SDValue ArgValueLo, ArgValueHi;

	MachineFunction &MF = DAG.getMachineFunction();
	const TargetRegisterClass *RC = &X86::GR32RegClass;

	// Read a 32 bit value from the registers
	if (nullptr == InFlag) {
	// When no physical register is present,
	// create an intermediate virtual register
	Reg = MF.addLiveIn(VA.getLocReg(), RC);
	ArgValueLo = DAG.getCopyFromReg(Root, Dl, Reg, MVT::i32);
	Reg = MF.addLiveIn(NextVA.getLocReg(), RC);
	ArgValueHi = DAG.getCopyFromReg(Root, Dl, Reg, MVT::i32);
	} else {
	// When a physical register is available read the value from it and glue
	// the reads together.
	ArgValueLo =
	DAG.getCopyFromReg(Root, Dl, VA.getLocReg(), MVT::i32, *InFlag);
	*InFlag = ArgValueLo.getValue(2);
	ArgValueHi =
	DAG.getCopyFromReg(Root, Dl, NextVA.getLocReg(), MVT::i32, *InFlag);
	*InFlag = ArgValueHi.getValue(2);
	}

	// Convert the i32 type into v32i1 type
	Lo = DAG.getBitcast(MVT::v32i1, ArgValueLo);

	// Convert the i32 type into v32i1 type
	Hi = DAG.getBitcast(MVT::v32i1, ArgValueHi);

	// Concatenate the two values together
	return DAG.getNode(ISD::CONCAT_VECTORS, Dl, MVT::v64i1, Lo, Hi);
	}

	/// The function will lower a register of various sizes (8/16/32/64)
	/// to a mask value of the expected size (v8i1/v16i1/v32i1/v64i1)
	/// \returns a DAG node contains the operand after lowering to mask type.
	static SDValue lowerRegToMasks(const SDValue &ValArg, const EVT &ValVT,
	const EVT &ValLoc, const SDLoc &Dl,
	SelectionDAG &DAG) {
	SDValue ValReturned = ValArg;

	if (ValVT == MVT::v1i1)
	return DAG.getNode(ISD::SCALAR_TO_VECTOR, Dl, MVT::v1i1, ValReturned);

	if (ValVT == MVT::v64i1) {
	// In 32 bit machine, this case is handled by getv64i1Argument
	assert(ValLoc == MVT::i64 && "Expecting only i64 locations");
	// In 64 bit machine, There is no need to truncate the value only bitcast
	} else {
	MVT maskLen;
	switch (ValVT.getSimpleVT().SimpleTy) {
	case MVT::v8i1:
	maskLen = MVT::i8;
	break;
	case MVT::v16i1:
	maskLen = MVT::i16;
	break;
	case MVT::v32i1:
	maskLen = MVT::i32;
	break;
	default:
	llvm_unreachable("Expecting a vector of i1 types");
	}

	ValReturned = DAG.getNode(ISD::TRUNCATE, Dl, maskLen, ValReturned);
	}
	return DAG.getBitcast(ValVT, ValReturned);
	}

	/// Lower the result values of a call into the
	/// appropriate copies out of appropriate physical registers.
	///
	SDValue X86TargetLowering::LowerCallResult(
	SDValue Chain, SDValue InFlag, CallingConv::ID CallConv, bool isVarArg,
	const SmallVectorImpl<ISD::InputArg> &Ins, const SDLoc &dl,
	SelectionDAG &DAG, SmallVectorImpl<SDValue> &InVals,
	uint32_t *RegMask) const {

	const TargetRegisterInfo *TRI = Subtarget.getRegisterInfo();
	// Assign locations to each value returned by this call.
	SmallVector<CCValAssign, 16> RVLocs;
	bool Is64Bit = Subtarget.is64Bit();
	CCState CCInfo(CallConv, isVarArg, DAG.getMachineFunction(), RVLocs,
	*DAG.getContext());
	CCInfo.AnalyzeCallResult(Ins, RetCC_X86);

	// Copy all of the result registers out of their specified physreg.
	for (unsigned I = 0, InsIndex = 0, E = RVLocs.size(); I != E;
	++I, ++InsIndex) {
	CCValAssign &VA = RVLocs[I];
	EVT CopyVT = VA.getLocVT();

	// In some calling conventions we need to remove the used registers
	// from the register mask.
	if (RegMask) {
	for (MCSubRegIterator SubRegs(VA.getLocReg(), TRI, /IncludeSelf=/true);
	SubRegs.isValid(); ++SubRegs)
	RegMask[SubRegs / 32] &= ~(1u << (SubRegs % 32));
	}

	// If this is x86-64, and we disabled SSE, we can't return FP values
	if ((CopyVT == MVT::f32 \|\| CopyVT == MVT::f64 \|\| CopyVT == MVT::f128) &&
	((Is64Bit \|\| Ins[InsIndex].Flags.isInReg()) && !Subtarget.hasSSE1())) {
	errorUnsupported(DAG, dl, "SSE register return with SSE disabled");
	VA.convertToReg(X86::FP0); // Set reg to FP0, avoid hitting asserts.
	}

	// If we prefer to use the value in xmm registers, copy it out as f80 and
	// use a truncate to move it from fp stack reg to xmm reg.
	bool RoundAfterCopy = false;
	if ((VA.getLocReg() == X86::FP0 \|\| VA.getLocReg() == X86::FP1) &&
	isScalarFPTypeInSSEReg(VA.getValVT())) {
	if (!Subtarget.hasX87())
	report_fatal_error("X87 register return with X87 disabled");
	CopyVT = MVT::f80;
	RoundAfterCopy = (CopyVT != VA.getLocVT());
	}

	SDValue Val;
	if (VA.needsCustom()) {
	assert(VA.getValVT() == MVT::v64i1 &&
	"Currently the only custom case is when we split v64i1 to 2 regs");
	Val =
	getv64i1Argument(VA, RVLocs[++I], Chain, DAG, dl, Subtarget, &InFlag);
	} else {
	Chain = DAG.getCopyFromReg(Chain, dl, VA.getLocReg(), CopyVT, InFlag)
	.getValue(1);
	Val = Chain.getValue(0);
	InFlag = Chain.getValue(2);
	}

	if (RoundAfterCopy)
	Val = DAG.getNode(ISD::FP_ROUND, dl, VA.getValVT(), Val,
	// This truncation won't change the value.
	DAG.getIntPtrConstant(1, dl));

	if (VA.isExtInLoc() && (VA.getValVT().getScalarType() == MVT::i1)) {
	if (VA.getValVT().isVector() &&
	((VA.getLocVT() == MVT::i64) \|\| (VA.getLocVT() == MVT::i32) \|\|
	(VA.getLocVT() == MVT::i16) \|\| (VA.getLocVT() == MVT::i8))) {
	// promoting a mask type (v*i1) into a register of type i64/i32/i16/i8
	Val = lowerRegToMasks(Val, VA.getValVT(), VA.getLocVT(), dl, DAG);
	} else
	Val = DAG.getNode(ISD::TRUNCATE, dl, VA.getValVT(), Val);
	}

	InVals.push_back(Val);
	}

	return Chain;
	}

	//===----------------------------------------------------------------------===//
	// C & StdCall & Fast Calling Convention implementation
	//===----------------------------------------------------------------------===//
	// StdCall calling convention seems to be standard for many Windows' API
	// routines and around. It differs from C calling convention just a little:
	// callee should clean up the stack, not caller. Symbols should be also
	// decorated in some fancy way :) It doesn't support any vector arguments.
	// For info on fast calling convention see Fast Calling Convention (tail call)
	// implementation LowerX86_32FastCCCallTo.

	/// CallIsStructReturn - Determines whether a call uses struct return
	/// semantics.
	enum StructReturnType {
	NotStructReturn,
	RegStructReturn,
	StackStructReturn
	};
	static StructReturnType
	callIsStructReturn(const SmallVectorImpl<ISD::OutputArg> &Outs, bool IsMCU) {
	if (Outs.empty())
	return NotStructReturn;

	const ISD::ArgFlagsTy &Flags = Outs[0].Flags;
	if (!Flags.isSRet())
	return NotStructReturn;
	if (Flags.isInReg() \|\| IsMCU)
	return RegStructReturn;
	return StackStructReturn;
	}

	/// Determines whether a function uses struct return semantics.
	static StructReturnType
	argsAreStructReturn(const SmallVectorImpl<ISD::InputArg> &Ins, bool IsMCU) {
	if (Ins.empty())
	return NotStructReturn;

	const ISD::ArgFlagsTy &Flags = Ins[0].Flags;
	if (!Flags.isSRet())
	return NotStructReturn;
	if (Flags.isInReg() \|\| IsMCU)
	return RegStructReturn;
	return StackStructReturn;
	}

	/// Make a copy of an aggregate at address specified by "Src" to address
	/// "Dst" with size and alignment information specified by the specific
	/// parameter attribute. The copy will be passed as a byval function parameter.
	static SDValue CreateCopyOfByValArgument(SDValue Src, SDValue Dst,
	SDValue Chain, ISD::ArgFlagsTy Flags,
	SelectionDAG &DAG, const SDLoc &dl) {
	SDValue SizeNode = DAG.getConstant(Flags.getByValSize(), dl, MVT::i32);

	return DAG.getMemcpy(Chain, dl, Dst, Src, SizeNode, Flags.getByValAlign(),
	/isVolatile/false, /AlwaysInline=/true,
	/isTailCall/false,
	MachinePointerInfo(), MachinePointerInfo());
	}

	/// Return true if the calling convention is one that we can guarantee TCO for.
	static bool canGuaranteeTCO(CallingConv::ID CC) {
	return (CC == CallingConv::Fast \|\| CC == CallingConv::GHC \|\|
	CC == CallingConv::X86_RegCall \|\| CC == CallingConv::HiPE \|\|
	CC == CallingConv::HHVM);
	}

	/// Return true if we might ever do TCO for calls with this calling convention.
	static bool mayTailCallThisCC(CallingConv::ID CC) {
	switch (CC) {
	// C calling conventions:
	case CallingConv::C:
	case CallingConv::Win64:
	case CallingConv::X86_64_SysV:
	// Callee pop conventions:
	case CallingConv::X86_ThisCall:
	case CallingConv::X86_StdCall:
	case CallingConv::X86_VectorCall:
	case CallingConv::X86_FastCall:
	return true;
	default:
	return canGuaranteeTCO(CC);
	}
	}

	/// Return true if the function is being made into a tailcall target by
	/// changing its ABI.
	static bool shouldGuaranteeTCO(CallingConv::ID CC, bool GuaranteedTailCallOpt) {
	return GuaranteedTailCallOpt && canGuaranteeTCO(CC);
	}

	bool X86TargetLowering::mayBeEmittedAsTailCall(const CallInst *CI) const {
	auto Attr =
	CI->getParent()->getParent()->getFnAttribute("disable-tail-calls");
	if (!CI->isTailCall() \|\| Attr.getValueAsString() == "true")
	return false;

	ImmutableCallSite CS(CI);
	CallingConv::ID CalleeCC = CS.getCallingConv();
	if (!mayTailCallThisCC(CalleeCC))
	return false;

	return true;
	}

	SDValue
	X86TargetLowering::LowerMemArgument(SDValue Chain, CallingConv::ID CallConv,
	const SmallVectorImpl<ISD::InputArg> &Ins,
	const SDLoc &dl, SelectionDAG &DAG,
	const CCValAssign &VA,
	MachineFrameInfo &MFI, unsigned i) const {
	// Create the nodes corresponding to a load from this parameter slot.
	ISD::ArgFlagsTy Flags = Ins[i].Flags;
	bool AlwaysUseMutable = shouldGuaranteeTCO(
	CallConv, DAG.getTarget().Options.GuaranteedTailCallOpt);
	bool isImmutable = !AlwaysUseMutable && !Flags.isByVal();
	EVT ValVT;
	MVT PtrVT = getPointerTy(DAG.getDataLayout());

	// If value is passed by pointer we have address passed instead of the value
	// itself. No need to extend if the mask value and location share the same
	// absolute size.
	bool ExtendedInMem =
	VA.isExtInLoc() && VA.getValVT().getScalarType() == MVT::i1 &&
	VA.getValVT().getSizeInBits() != VA.getLocVT().getSizeInBits();

	if (VA.getLocInfo() == CCValAssign::Indirect \|\| ExtendedInMem)
	ValVT = VA.getLocVT();
	else
	ValVT = VA.getValVT();

	// Calculate SP offset of interrupt parameter, re-arrange the slot normally
	// taken by a return address.
	int Offset = 0;
	if (CallConv == CallingConv::X86_INTR) {
	// X86 interrupts may take one or two arguments.
	// On the stack there will be no return address as in regular call.
	// Offset of last argument need to be set to -4/-8 bytes.
	// Where offset of the first argument out of two, should be set to 0 bytes.
	Offset = (Subtarget.is64Bit() ? 8 : 4) * ((i + 1) % Ins.size() - 1);
	if (Subtarget.is64Bit() && Ins.size() == 2) {
	// The stack pointer needs to be realigned for 64 bit handlers with error
	// code, so the argument offset changes by 8 bytes.
	Offset += 8;
	}
	}

	// FIXME: For now, all byval parameter objects are marked mutable. This can be
	// changed with more analysis.
	// In case of tail call optimization mark all arguments mutable. Since they
	// could be overwritten by lowering of arguments in case of a tail call.
	if (Flags.isByVal()) {
	unsigned Bytes = Flags.getByValSize();
	if (Bytes == 0) Bytes = 1; // Don't create zero-sized stack objects.
	int FI = MFI.CreateFixedObject(Bytes, VA.getLocMemOffset(), isImmutable);
	// Adjust SP offset of interrupt parameter.
	if (CallConv == CallingConv::X86_INTR) {
	MFI.setObjectOffset(FI, Offset);
	}
	return DAG.getFrameIndex(FI, PtrVT);
	}

	// This is an argument in memory. We might be able to perform copy elision.
	if (Flags.isCopyElisionCandidate()) {
	EVT ArgVT = Ins[i].ArgVT;
	SDValue PartAddr;
	if (Ins[i].PartOffset == 0) {
	// If this is a one-part value or the first part of a multi-part value,
	// create a stack object for the entire argument value type and return a
	// load from our portion of it. This assumes that if the first part of an
	// argument is in memory, the rest will also be in memory.
	int FI = MFI.CreateFixedObject(ArgVT.getStoreSize(), VA.getLocMemOffset(),
	/Immutable=/false);
	PartAddr = DAG.getFrameIndex(FI, PtrVT);
	return DAG.getLoad(
	ValVT, dl, Chain, PartAddr,
	MachinePointerInfo::getFixedStack(DAG.getMachineFunction(), FI));
	} else {
	// This is not the first piece of an argument in memory. See if there is
	// already a fixed stack object including this offset. If so, assume it
	// was created by the PartOffset == 0 branch above and create a load from
	// the appropriate offset into it.
	int64_t PartBegin = VA.getLocMemOffset();
	int64_t PartEnd = PartBegin + ValVT.getSizeInBits() / 8;
	int FI = MFI.getObjectIndexBegin();
	for (; MFI.isFixedObjectIndex(FI); ++FI) {
	int64_t ObjBegin = MFI.getObjectOffset(FI);
	int64_t ObjEnd = ObjBegin + MFI.getObjectSize(FI);
	if (ObjBegin <= PartBegin && PartEnd <= ObjEnd)
	break;
	}
	if (MFI.isFixedObjectIndex(FI)) {
	SDValue Addr =
	DAG.getNode(ISD::ADD, dl, PtrVT, DAG.getFrameIndex(FI, PtrVT),
	DAG.getIntPtrConstant(Ins[i].PartOffset, dl));
	return DAG.getLoad(
	ValVT, dl, Chain, Addr,
	MachinePointerInfo::getFixedStack(DAG.getMachineFunction(), FI,
	Ins[i].PartOffset));
	}
	}
	}

	int FI = MFI.CreateFixedObject(ValVT.getSizeInBits() / 8,
	VA.getLocMemOffset(), isImmutable);

	// Set SExt or ZExt flag.
	if (VA.getLocInfo() == CCValAssign::ZExt) {
	MFI.setObjectZExt(FI, true);
	} else if (VA.getLocInfo() == CCValAssign::SExt) {
	MFI.setObjectSExt(FI, true);
	}

	// Adjust SP offset of interrupt parameter.
	if (CallConv == CallingConv::X86_INTR) {
	MFI.setObjectOffset(FI, Offset);
	}

	SDValue FIN = DAG.getFrameIndex(FI, PtrVT);
	SDValue Val = DAG.getLoad(
	ValVT, dl, Chain, FIN,
	MachinePointerInfo::getFixedStack(DAG.getMachineFunction(), FI));
	return ExtendedInMem
	? (VA.getValVT().isVector()
	? DAG.getNode(ISD::SCALAR_TO_VECTOR, dl, VA.getValVT(), Val)
	: DAG.getNode(ISD::TRUNCATE, dl, VA.getValVT(), Val))
	: Val;
	}

	// FIXME: Get this from tablegen.
	static ArrayRef<MCPhysReg> get64BitArgumentGPRs(CallingConv::ID CallConv,
	const X86Subtarget &Subtarget) {
	assert(Subtarget.is64Bit());

	if (Subtarget.isCallingConvWin64(CallConv)) {
	static const MCPhysReg GPR64ArgRegsWin64[] = {
	X86::RCX, X86::RDX, X86::R8, X86::R9
	};
	return makeArrayRef(std::begin(GPR64ArgRegsWin64), std::end(GPR64ArgRegsWin64));
	}

	static const MCPhysReg GPR64ArgRegs64Bit[] = {
	X86::RDI, X86::RSI, X86::RDX, X86::RCX, X86::R8, X86::R9
	};
	return makeArrayRef(std::begin(GPR64ArgRegs64Bit), std::end(GPR64ArgRegs64Bit));
	}

	// FIXME: Get this from tablegen.
	static ArrayRef<MCPhysReg> get64BitArgumentXMMs(MachineFunction &MF,
	CallingConv::ID CallConv,
	const X86Subtarget &Subtarget) {
	assert(Subtarget.is64Bit());
	if (Subtarget.isCallingConvWin64(CallConv)) {
	// The XMM registers which might contain var arg parameters are shadowed
	// in their paired GPR. So we only need to save the GPR to their home
	// slots.
	// TODO: __vectorcall will change this.
	return None;
	}

	const Function *Fn = MF.getFunction();
	bool NoImplicitFloatOps = Fn->hasFnAttribute(Attribute::NoImplicitFloat);
	bool isSoftFloat = Subtarget.useSoftFloat();
	assert(!(isSoftFloat && NoImplicitFloatOps) &&
	"SSE register cannot be used when SSE is disabled!");
	if (isSoftFloat \|\| NoImplicitFloatOps \|\| !Subtarget.hasSSE1())
	// Kernel mode asks for SSE to be disabled, so there are no XMM argument
	// registers.
	return None;

	static const MCPhysReg XMMArgRegs64Bit[] = {
	X86::XMM0, X86::XMM1, X86::XMM2, X86::XMM3,
	X86::XMM4, X86::XMM5, X86::XMM6, X86::XMM7
	};
	return makeArrayRef(std::begin(XMMArgRegs64Bit), std::end(XMMArgRegs64Bit));
	}

	#ifndef NDEBUG
	static bool isSortedByValueNo(const SmallVectorImpl<CCValAssign> &ArgLocs) {
	return std::is_sorted(ArgLocs.begin(), ArgLocs.end(),
	[](const CCValAssign &A, const CCValAssign &B) -> bool {
	return A.getValNo() < B.getValNo();
	});
	}
	#endif

	SDValue X86TargetLowering::LowerFormalArguments(
	SDValue Chain, CallingConv::ID CallConv, bool isVarArg,
	const SmallVectorImpl<ISD::InputArg> &Ins, const SDLoc &dl,
	SelectionDAG &DAG, SmallVectorImpl<SDValue> &InVals) const {
	MachineFunction &MF = DAG.getMachineFunction();
	X86MachineFunctionInfo *FuncInfo = MF.getInfo<X86MachineFunctionInfo>();
	const TargetFrameLowering &TFI = *Subtarget.getFrameLowering();

	const Function *Fn = MF.getFunction();
	if (Fn->hasExternalLinkage() &&
	Subtarget.isTargetCygMing() &&
	Fn->getName() == "main")
	FuncInfo->setForceFramePointer(true);

	MachineFrameInfo &MFI = MF.getFrameInfo();
	bool Is64Bit = Subtarget.is64Bit();
	bool IsWin64 = Subtarget.isCallingConvWin64(CallConv);

	assert(
	!(isVarArg && canGuaranteeTCO(CallConv)) &&
	"Var args not supported with calling conv' regcall, fastcc, ghc or hipe");

	if (CallConv == CallingConv::X86_INTR) {
	bool isLegal = Ins.size() == 1 \|\|
	(Ins.size() == 2 && ((Is64Bit && Ins[1].VT == MVT::i64) \|\|
	(!Is64Bit && Ins[1].VT == MVT::i32)));
	if (!isLegal)
	report_fatal_error("X86 interrupts may take one or two arguments");
	}

	// Assign locations to all of the incoming arguments.
	SmallVector<CCValAssign, 16> ArgLocs;
	CCState CCInfo(CallConv, isVarArg, MF, ArgLocs, *DAG.getContext());

	// Allocate shadow area for Win64.
	if (IsWin64)
	CCInfo.AllocateStack(32, 8);

	CCInfo.AnalyzeArguments(Ins, CC_X86);

	// In vectorcall calling convention a second pass is required for the HVA
	// types.
	if (CallingConv::X86_VectorCall == CallConv) {
	CCInfo.AnalyzeArgumentsSecondPass(Ins, CC_X86);
	}

	// The next loop assumes that the locations are in the same order of the
	// input arguments.
	assert(isSortedByValueNo(ArgLocs) &&
	"Argument Location list must be sorted before lowering");

	SDValue ArgValue;
	for (unsigned I = 0, InsIndex = 0, E = ArgLocs.size(); I != E;
	++I, ++InsIndex) {
	assert(InsIndex < Ins.size() && "Invalid Ins index");
	CCValAssign &VA = ArgLocs[I];

	if (VA.isRegLoc()) {
	EVT RegVT = VA.getLocVT();
	if (VA.needsCustom()) {
	assert(
	VA.getValVT() == MVT::v64i1 &&
	"Currently the only custom case is when we split v64i1 to 2 regs");

	// v64i1 values, in regcall calling convention, that are
	// compiled to 32 bit arch, are split up into two registers.
	ArgValue =
	getv64i1Argument(VA, ArgLocs[++I], Chain, DAG, dl, Subtarget);
	} else {
	const TargetRegisterClass *RC;
	if (RegVT == MVT::i32)
	RC = &X86::GR32RegClass;
	else if (Is64Bit && RegVT == MVT::i64)
	RC = &X86::GR64RegClass;
	else if (RegVT == MVT::f32)
	RC = Subtarget.hasAVX512() ? &X86::FR32XRegClass : &X86::FR32RegClass;
	else if (RegVT == MVT::f64)
	RC = Subtarget.hasAVX512() ? &X86::FR64XRegClass : &X86::FR64RegClass;
	else if (RegVT == MVT::f80)
	RC = &X86::RFP80RegClass;
	else if (RegVT == MVT::f128)
	RC = &X86::FR128RegClass;
	else if (RegVT.is512BitVector())
	RC = &X86::VR512RegClass;
	else if (RegVT.is256BitVector())
	RC = Subtarget.hasVLX() ? &X86::VR256XRegClass : &X86::VR256RegClass;
	else if (RegVT.is128BitVector())
	RC = Subtarget.hasVLX() ? &X86::VR128XRegClass : &X86::VR128RegClass;
	else if (RegVT == MVT::x86mmx)
	RC = &X86::VR64RegClass;
	else if (RegVT == MVT::v1i1)
	RC = &X86::VK1RegClass;
	else if (RegVT == MVT::v8i1)
	RC = &X86::VK8RegClass;
	else if (RegVT == MVT::v16i1)
	RC = &X86::VK16RegClass;
	else if (RegVT == MVT::v32i1)
	RC = &X86::VK32RegClass;
	else if (RegVT == MVT::v64i1)
	RC = &X86::VK64RegClass;
	else
	llvm_unreachable("Unknown argument type!");

	unsigned Reg = MF.addLiveIn(VA.getLocReg(), RC);
	ArgValue = DAG.getCopyFromReg(Chain, dl, Reg, RegVT);
	}

	// If this is an 8 or 16-bit value, it is really passed promoted to 32
	// bits. Insert an assert[sz]ext to capture this, then truncate to the
	// right size.
	if (VA.getLocInfo() == CCValAssign::SExt)
	ArgValue = DAG.getNode(ISD::AssertSext, dl, RegVT, ArgValue,
	DAG.getValueType(VA.getValVT()));
	else if (VA.getLocInfo() == CCValAssign::ZExt)
	ArgValue = DAG.getNode(ISD::AssertZext, dl, RegVT, ArgValue,
	DAG.getValueType(VA.getValVT()));
	else if (VA.getLocInfo() == CCValAssign::BCvt)
	ArgValue = DAG.getBitcast(VA.getValVT(), ArgValue);

	if (VA.isExtInLoc()) {
	// Handle MMX values passed in XMM regs.
	if (RegVT.isVector() && VA.getValVT().getScalarType() != MVT::i1)
	ArgValue = DAG.getNode(X86ISD::MOVDQ2Q, dl, VA.getValVT(), ArgValue);
	else if (VA.getValVT().isVector() &&
	VA.getValVT().getScalarType() == MVT::i1 &&
	((VA.getLocVT() == MVT::i64) \|\| (VA.getLocVT() == MVT::i32) \|\|
	(VA.getLocVT() == MVT::i16) \|\| (VA.getLocVT() == MVT::i8))) {
	// Promoting a mask type (v*i1) into a register of type i64/i32/i16/i8
	ArgValue = lowerRegToMasks(ArgValue, VA.getValVT(), RegVT, dl, DAG);
	} else
	ArgValue = DAG.getNode(ISD::TRUNCATE, dl, VA.getValVT(), ArgValue);
	}
	} else {
	assert(VA.isMemLoc());
	ArgValue =
	LowerMemArgument(Chain, CallConv, Ins, dl, DAG, VA, MFI, InsIndex);
	}

	// If value is passed via pointer - do a load.
	if (VA.getLocInfo() == CCValAssign::Indirect)
	ArgValue =
	DAG.getLoad(VA.getValVT(), dl, Chain, ArgValue, MachinePointerInfo());

	InVals.push_back(ArgValue);
	}

	for (unsigned I = 0, E = Ins.size(); I != E; ++I) {
	// Swift calling convention does not require we copy the sret argument
	// into %rax/%eax for the return. We don't set SRetReturnReg for Swift.
	if (CallConv == CallingConv::Swift)
	continue;

	// All x86 ABIs require that for returning structs by value we copy the
	// sret argument into %rax/%eax (depending on ABI) for the return. Save
	// the argument into a virtual register so that we can access it from the
	// return points.
	if (Ins[I].Flags.isSRet()) {
	unsigned Reg = FuncInfo->getSRetReturnReg();
	if (!Reg) {
	MVT PtrTy = getPointerTy(DAG.getDataLayout());
	Reg = MF.getRegInfo().createVirtualRegister(getRegClassFor(PtrTy));
	FuncInfo->setSRetReturnReg(Reg);
	}
	SDValue Copy = DAG.getCopyToReg(DAG.getEntryNode(), dl, Reg, InVals[I]);
	Chain = DAG.getNode(ISD::TokenFactor, dl, MVT::Other, Copy, Chain);
	break;
	}
	}

	unsigned StackSize = CCInfo.getNextStackOffset();
	// Align stack specially for tail calls.
	if (shouldGuaranteeTCO(CallConv,
	MF.getTarget().Options.GuaranteedTailCallOpt))
	StackSize = GetAlignedArgumentStackSize(StackSize, DAG);

	// If the function takes variable number of arguments, make a frame index for
	// the start of the first vararg value... for expansion of llvm.va_start. We
	// can skip this if there are no va_start calls.
	if (MFI.hasVAStart() &&
	(Is64Bit \|\| (CallConv != CallingConv::X86_FastCall &&
	CallConv != CallingConv::X86_ThisCall))) {
	FuncInfo->setVarArgsFrameIndex(MFI.CreateFixedObject(1, StackSize, true));
	}

	// Figure out if XMM registers are in use.
	assert(!(Subtarget.useSoftFloat() &&
	Fn->hasFnAttribute(Attribute::NoImplicitFloat)) &&
	"SSE register cannot be used when SSE is disabled!");

	// 64-bit calling conventions support varargs and register parameters, so we
	// have to do extra work to spill them in the prologue.
	if (Is64Bit && isVarArg && MFI.hasVAStart()) {
	// Find the first unallocated argument registers.
	ArrayRef<MCPhysReg> ArgGPRs = get64BitArgumentGPRs(CallConv, Subtarget);
	ArrayRef<MCPhysReg> ArgXMMs = get64BitArgumentXMMs(MF, CallConv, Subtarget);
	unsigned NumIntRegs = CCInfo.getFirstUnallocated(ArgGPRs);
	unsigned NumXMMRegs = CCInfo.getFirstUnallocated(ArgXMMs);
	assert(!(NumXMMRegs && !Subtarget.hasSSE1()) &&
	"SSE register cannot be used when SSE is disabled!");

	// Gather all the live in physical registers.
	SmallVector<SDValue, 6> LiveGPRs;
	SmallVector<SDValue, 8> LiveXMMRegs;
	SDValue ALVal;
	for (MCPhysReg Reg : ArgGPRs.slice(NumIntRegs)) {
	unsigned GPR = MF.addLiveIn(Reg, &X86::GR64RegClass);
	LiveGPRs.push_back(
	DAG.getCopyFromReg(Chain, dl, GPR, MVT::i64));
	}
	if (!ArgXMMs.empty()) {
	unsigned AL = MF.addLiveIn(X86::AL, &X86::GR8RegClass);
	ALVal = DAG.getCopyFromReg(Chain, dl, AL, MVT::i8);
	for (MCPhysReg Reg : ArgXMMs.slice(NumXMMRegs)) {
	unsigned XMMReg = MF.addLiveIn(Reg, &X86::VR128RegClass);
	LiveXMMRegs.push_back(
	DAG.getCopyFromReg(Chain, dl, XMMReg, MVT::v4f32));
	}
	}

	if (IsWin64) {
	// Get to the caller-allocated home save location. Add 8 to account
	// for the return address.
	int HomeOffset = TFI.getOffsetOfLocalArea() + 8;
	FuncInfo->setRegSaveFrameIndex(
	MFI.CreateFixedObject(1, NumIntRegs * 8 + HomeOffset, false));
	// Fixup to set vararg frame on shadow area (4 x i64).
	if (NumIntRegs < 4)
	FuncInfo->setVarArgsFrameIndex(FuncInfo->getRegSaveFrameIndex());
	} else {
	// For X86-64, if there are vararg parameters that are passed via
	// registers, then we must store them to their spots on the stack so
	// they may be loaded by dereferencing the result of va_next.
	FuncInfo->setVarArgsGPOffset(NumIntRegs * 8);
	FuncInfo->setVarArgsFPOffset(ArgGPRs.size() * 8 + NumXMMRegs * 16);
	FuncInfo->setRegSaveFrameIndex(MFI.CreateStackObject(
	ArgGPRs.size() * 8 + ArgXMMs.size() * 16, 16, false));
	}

	// Store the integer parameter registers.
	SmallVector<SDValue, 8> MemOps;
	SDValue RSFIN = DAG.getFrameIndex(FuncInfo->getRegSaveFrameIndex(),
	getPointerTy(DAG.getDataLayout()));
	unsigned Offset = FuncInfo->getVarArgsGPOffset();
	for (SDValue Val : LiveGPRs) {
	SDValue FIN = DAG.getNode(ISD::ADD, dl, getPointerTy(DAG.getDataLayout()),
	RSFIN, DAG.getIntPtrConstant(Offset, dl));
	SDValue Store =
	DAG.getStore(Val.getValue(1), dl, Val, FIN,
	MachinePointerInfo::getFixedStack(
	DAG.getMachineFunction(),
	FuncInfo->getRegSaveFrameIndex(), Offset));
	MemOps.push_back(Store);
	Offset += 8;
	}

	if (!ArgXMMs.empty() && NumXMMRegs != ArgXMMs.size()) {
	// Now store the XMM (fp + vector) parameter registers.
	SmallVector<SDValue, 12> SaveXMMOps;
	SaveXMMOps.push_back(Chain);
	SaveXMMOps.push_back(ALVal);
	SaveXMMOps.push_back(DAG.getIntPtrConstant(
	FuncInfo->getRegSaveFrameIndex(), dl));
	SaveXMMOps.push_back(DAG.getIntPtrConstant(
	FuncInfo->getVarArgsFPOffset(), dl));
	SaveXMMOps.insert(SaveXMMOps.end(), LiveXMMRegs.begin(),
	LiveXMMRegs.end());
	MemOps.push_back(DAG.getNode(X86ISD::VASTART_SAVE_XMM_REGS, dl,
	MVT::Other, SaveXMMOps));
	}

	if (!MemOps.empty())
	Chain = DAG.getNode(ISD::TokenFactor, dl, MVT::Other, MemOps);
	}

	if (isVarArg && MFI.hasMustTailInVarArgFunc()) {
	// Find the largest legal vector type.
	MVT VecVT = MVT::Other;
	// FIXME: Only some x86_32 calling conventions support AVX512.
	if (Subtarget.hasAVX512() &&
	(Is64Bit \|\| (CallConv == CallingConv::X86_VectorCall \|\|
	CallConv == CallingConv::Intel_OCL_BI)))
	VecVT = MVT::v16f32;
	else if (Subtarget.hasAVX())
	VecVT = MVT::v8f32;
	else if (Subtarget.hasSSE2())
	VecVT = MVT::v4f32;

	// We forward some GPRs and some vector types.
	SmallVector<MVT, 2> RegParmTypes;
	MVT IntVT = Is64Bit ? MVT::i64 : MVT::i32;
	RegParmTypes.push_back(IntVT);
	if (VecVT != MVT::Other)
	RegParmTypes.push_back(VecVT);

	// Compute the set of forwarded registers. The rest are scratch.
	SmallVectorImpl<ForwardedRegister> &Forwards =
	FuncInfo->getForwardedMustTailRegParms();
	CCInfo.analyzeMustTailForwardedRegisters(Forwards, RegParmTypes, CC_X86);

	// Conservatively forward AL on x86_64, since it might be used for varargs.
	if (Is64Bit && !CCInfo.isAllocated(X86::AL)) {
	unsigned ALVReg = MF.addLiveIn(X86::AL, &X86::GR8RegClass);
	Forwards.push_back(ForwardedRegister(ALVReg, X86::AL, MVT::i8));
	}

	// Copy all forwards from physical to virtual registers.
	for (ForwardedRegister &F : Forwards) {
	// FIXME: Can we use a less constrained schedule?
	SDValue RegVal = DAG.getCopyFromReg(Chain, dl, F.VReg, F.VT);
	F.VReg = MF.getRegInfo().createVirtualRegister(getRegClassFor(F.VT));
	Chain = DAG.getCopyToReg(Chain, dl, F.VReg, RegVal);
	}
	}

	// Some CCs need callee pop.
	if (X86::isCalleePop(CallConv, Is64Bit, isVarArg,
	MF.getTarget().Options.GuaranteedTailCallOpt)) {
	FuncInfo->setBytesToPopOnReturn(StackSize); // Callee pops everything.
	} else if (CallConv == CallingConv::X86_INTR && Ins.size() == 2) {
	// X86 interrupts must pop the error code (and the alignment padding) if
	// present.
	FuncInfo->setBytesToPopOnReturn(Is64Bit ? 16 : 4);
	} else {
	FuncInfo->setBytesToPopOnReturn(0); // Callee pops nothing.
	// If this is an sret function, the return should pop the hidden pointer.
	if (!Is64Bit && !canGuaranteeTCO(CallConv) &&
	!Subtarget.getTargetTriple().isOSMSVCRT() &&
	argsAreStructReturn(Ins, Subtarget.isTargetMCU()) == StackStructReturn)
	FuncInfo->setBytesToPopOnReturn(4);
	}

	if (!Is64Bit) {
	// RegSaveFrameIndex is X86-64 only.
	FuncInfo->setRegSaveFrameIndex(0xAAAAAAA);
	if (CallConv == CallingConv::X86_FastCall \|\|
	CallConv == CallingConv::X86_ThisCall)
	// fastcc functions can't have varargs.
	FuncInfo->setVarArgsFrameIndex(0xAAAAAAA);
	}

	FuncInfo->setArgumentStackSize(StackSize);

	if (WinEHFuncInfo *EHInfo = MF.getWinEHFuncInfo()) {
	EHPersonality Personality = classifyEHPersonality(Fn->getPersonalityFn());
	if (Personality == EHPersonality::CoreCLR) {
	assert(Is64Bit);
	// TODO: Add a mechanism to frame lowering that will allow us to indicate
	// that we'd prefer this slot be allocated towards the bottom of the frame
	// (i.e. near the stack pointer after allocating the frame). Every
	// funclet needs a copy of this slot in its (mostly empty) frame, and the
	// offset from the bottom of this and each funclet's frame must be the
	// same, so the size of funclets' (mostly empty) frames is dictated by
	// how far this slot is from the bottom (since they allocate just enough
	// space to accommodate holding this slot at the correct offset).
	int PSPSymFI = MFI.CreateStackObject(8, 8, /isSS=/false);
	EHInfo->PSPSymFrameIdx = PSPSymFI;
	}
	}

	if (CallConv == CallingConv::X86_RegCall \|\|
	Fn->hasFnAttribute("no_caller_saved_registers")) {
	const MachineRegisterInfo &MRI = MF.getRegInfo();
	for (const auto &Pair : make_range(MRI.livein_begin(), MRI.livein_end()))
	MF.getRegInfo().disableCalleeSavedRegister(Pair.first);
	}

	return Chain;
	}

	SDValue X86TargetLowering::LowerMemOpCallTo(SDValue Chain, SDValue StackPtr,
	SDValue Arg, const SDLoc &dl,
	SelectionDAG &DAG,
	const CCValAssign &VA,
	ISD::ArgFlagsTy Flags) const {
	unsigned LocMemOffset = VA.getLocMemOffset();
	SDValue PtrOff = DAG.getIntPtrConstant(LocMemOffset, dl);
	PtrOff = DAG.getNode(ISD::ADD, dl, getPointerTy(DAG.getDataLayout()),
	StackPtr, PtrOff);
	if (Flags.isByVal())
	return CreateCopyOfByValArgument(Arg, PtrOff, Chain, Flags, DAG, dl);

	return DAG.getStore(
	Chain, dl, Arg, PtrOff,
	MachinePointerInfo::getStack(DAG.getMachineFunction(), LocMemOffset));
	}

	/// Emit a load of return address if tail call
	/// optimization is performed and it is required.
	SDValue X86TargetLowering::EmitTailCallLoadRetAddr(
	SelectionDAG &DAG, SDValue &OutRetAddr, SDValue Chain, bool IsTailCall,
	bool Is64Bit, int FPDiff, const SDLoc &dl) const {
	// Adjust the Return address stack slot.
	EVT VT = getPointerTy(DAG.getDataLayout());
	OutRetAddr = getReturnAddressFrameIndex(DAG);

	// Load the "old" Return address.
	OutRetAddr = DAG.getLoad(VT, dl, Chain, OutRetAddr, MachinePointerInfo());
	return SDValue(OutRetAddr.getNode(), 1);
	}

	/// Emit a store of the return address if tail call
	/// optimization is performed and it is required (FPDiff!=0).
	static SDValue EmitTailCallStoreRetAddr(SelectionDAG &DAG, MachineFunction &MF,
	SDValue Chain, SDValue RetAddrFrIdx,
	EVT PtrVT, unsigned SlotSize,
	int FPDiff, const SDLoc &dl) {
	// Store the return address to the appropriate stack slot.
	if (!FPDiff) return Chain;
	// Calculate the new stack slot for the return address.
	int NewReturnAddrFI =
	MF.getFrameInfo().CreateFixedObject(SlotSize, (int64_t)FPDiff - SlotSize,
	false);
	SDValue NewRetAddrFrIdx = DAG.getFrameIndex(NewReturnAddrFI, PtrVT);
	Chain = DAG.getStore(Chain, dl, RetAddrFrIdx, NewRetAddrFrIdx,
	MachinePointerInfo::getFixedStack(
	DAG.getMachineFunction(), NewReturnAddrFI));
	return Chain;
	}

	/// Returns a vector_shuffle mask for an movs{s\|d}, movd
	/// operation of specified width.
	static SDValue getMOVL(SelectionDAG &DAG, const SDLoc &dl, MVT VT, SDValue V1,
	SDValue V2) {
	unsigned NumElems = VT.getVectorNumElements();
	SmallVector<int, 8> Mask;
	Mask.push_back(NumElems);
	for (unsigned i = 1; i != NumElems; ++i)
	Mask.push_back(i);
	return DAG.getVectorShuffle(VT, dl, V1, V2, Mask);
	}

	SDValue
	X86TargetLowering::LowerCall(TargetLowering::CallLoweringInfo &CLI,
	SmallVectorImpl<SDValue> &InVals) const {
	SelectionDAG &DAG = CLI.DAG;
	SDLoc &dl = CLI.DL;
	SmallVectorImpl<ISD::OutputArg> &Outs = CLI.Outs;
	SmallVectorImpl<SDValue> &OutVals = CLI.OutVals;
	SmallVectorImpl<ISD::InputArg> &Ins = CLI.Ins;
	SDValue Chain = CLI.Chain;
	SDValue Callee = CLI.Callee;
	CallingConv::ID CallConv = CLI.CallConv;
	bool &isTailCall = CLI.IsTailCall;
	bool isVarArg = CLI.IsVarArg;

	MachineFunction &MF = DAG.getMachineFunction();
	bool Is64Bit = Subtarget.is64Bit();
	bool IsWin64 = Subtarget.isCallingConvWin64(CallConv);
	StructReturnType SR = callIsStructReturn(Outs, Subtarget.isTargetMCU());
	bool IsSibcall = false;
	X86MachineFunctionInfo *X86Info = MF.getInfo<X86MachineFunctionInfo>();
	auto Attr = MF.getFunction()->getFnAttribute("disable-tail-calls");
	const CallInst *CI =
	CLI.CS ? dyn_cast<CallInst>(CLI.CS->getInstruction()) : nullptr;
	const Function *Fn = CI ? CI->getCalledFunction() : nullptr;
	bool HasNCSR = (CI && CI->hasFnAttr("no_caller_saved_registers")) \|\|
	(Fn && Fn->hasFnAttribute("no_caller_saved_registers"));

	if (CallConv == CallingConv::X86_INTR)
	report_fatal_error("X86 interrupts may not be called directly");

	if (Attr.getValueAsString() == "true")
	isTailCall = false;

	if (Subtarget.isPICStyleGOT() &&
	!MF.getTarget().Options.GuaranteedTailCallOpt) {
	// If we are using a GOT, disable tail calls to external symbols with
	// default visibility. Tail calling such a symbol requires using a GOT
	// relocation, which forces early binding of the symbol. This breaks code
	// that require lazy function symbol resolution. Using musttail or
	// GuaranteedTailCallOpt will override this.
	GlobalAddressSDNode *G = dyn_cast<GlobalAddressSDNode>(Callee);
	if (!G \|\| (!G->getGlobal()->hasLocalLinkage() &&
	G->getGlobal()->hasDefaultVisibility()))
	isTailCall = false;
	}

	bool IsMustTail = CLI.CS && CLI.CS->isMustTailCall();
	if (IsMustTail) {
	// Force this to be a tail call. The verifier rules are enough to ensure
	// that we can lower this successfully without moving the return address
	// around.
	isTailCall = true;
	} else if (isTailCall) {
	// Check if it's really possible to do a tail call.
	isTailCall = IsEligibleForTailCallOptimization(Callee, CallConv,
	isVarArg, SR != NotStructReturn,
	MF.getFunction()->hasStructRetAttr(), CLI.RetTy,
	Outs, OutVals, Ins, DAG);

	// Sibcalls are automatically detected tailcalls which do not require
	// ABI changes.
	if (!MF.getTarget().Options.GuaranteedTailCallOpt && isTailCall)
	IsSibcall = true;

	if (isTailCall)
	++NumTailCalls;
	}

	assert(!(isVarArg && canGuaranteeTCO(CallConv)) &&
	"Var args not supported with calling convention fastcc, ghc or hipe");

	// Analyze operands of the call, assigning locations to each operand.
	SmallVector<CCValAssign, 16> ArgLocs;
	CCState CCInfo(CallConv, isVarArg, MF, ArgLocs, *DAG.getContext());

	// Allocate shadow area for Win64.
	if (IsWin64)
	CCInfo.AllocateStack(32, 8);

	CCInfo.AnalyzeArguments(Outs, CC_X86);

	// In vectorcall calling convention a second pass is required for the HVA
	// types.
	if (CallingConv::X86_VectorCall == CallConv) {
	CCInfo.AnalyzeArgumentsSecondPass(Outs, CC_X86);
	}

	// Get a count of how many bytes are to be pushed on the stack.
	unsigned NumBytes = CCInfo.getAlignedCallFrameSize();
	if (IsSibcall)
	// This is a sibcall. The memory operands are available in caller's
	// own caller's stack.
	NumBytes = 0;
	else if (MF.getTarget().Options.GuaranteedTailCallOpt &&
	canGuaranteeTCO(CallConv))
	NumBytes = GetAlignedArgumentStackSize(NumBytes, DAG);

	int FPDiff = 0;
	if (isTailCall && !IsSibcall && !IsMustTail) {
	// Lower arguments at fp - stackoffset + fpdiff.
	unsigned NumBytesCallerPushed = X86Info->getBytesToPopOnReturn();

	FPDiff = NumBytesCallerPushed - NumBytes;

	// Set the delta of movement of the returnaddr stackslot.
	// But only set if delta is greater than previous delta.
	if (FPDiff < X86Info->getTCReturnAddrDelta())
	X86Info->setTCReturnAddrDelta(FPDiff);
	}

	unsigned NumBytesToPush = NumBytes;
	unsigned NumBytesToPop = NumBytes;

	// If we have an inalloca argument, all stack space has already been allocated
	// for us and be right at the top of the stack. We don't support multiple
	// arguments passed in memory when using inalloca.
	if (!Outs.empty() && Outs.back().Flags.isInAlloca()) {
	NumBytesToPush = 0;
	if (!ArgLocs.back().isMemLoc())
	report_fatal_error("cannot use inalloca attribute on a register "
	"parameter");
	if (ArgLocs.back().getLocMemOffset() != 0)
	report_fatal_error("any parameter with the inalloca attribute must be "
	"the only memory argument");
	}

	if (!IsSibcall)
	Chain = DAG.getCALLSEQ_START(Chain, NumBytesToPush,
	NumBytes - NumBytesToPush, dl);

	SDValue RetAddrFrIdx;
	// Load return address for tail calls.
	if (isTailCall && FPDiff)
	Chain = EmitTailCallLoadRetAddr(DAG, RetAddrFrIdx, Chain, isTailCall,
	Is64Bit, FPDiff, dl);

	SmallVector<std::pair<unsigned, SDValue>, 8> RegsToPass;
	SmallVector<SDValue, 8> MemOpChains;
	SDValue StackPtr;

	// The next loop assumes that the locations are in the same order of the
	// input arguments.
	assert(isSortedByValueNo(ArgLocs) &&
	"Argument Location list must be sorted before lowering");

	// Walk the register/memloc assignments, inserting copies/loads. In the case
	// of tail call optimization arguments are handle later.
	const X86RegisterInfo *RegInfo = Subtarget.getRegisterInfo();
	for (unsigned I = 0, OutIndex = 0, E = ArgLocs.size(); I != E;
	++I, ++OutIndex) {
	assert(OutIndex < Outs.size() && "Invalid Out index");
	// Skip inalloca arguments, they have already been written.
	ISD::ArgFlagsTy Flags = Outs[OutIndex].Flags;
	if (Flags.isInAlloca())
	continue;

	CCValAssign &VA = ArgLocs[I];
	EVT RegVT = VA.getLocVT();
	SDValue Arg = OutVals[OutIndex];
	bool isByVal = Flags.isByVal();

	// Promote the value if needed.
	switch (VA.getLocInfo()) {
	default: llvm_unreachable("Unknown loc info!");
	case CCValAssign::Full: break;
	case CCValAssign::SExt:
	Arg = DAG.getNode(ISD::SIGN_EXTEND, dl, RegVT, Arg);
	break;
	case CCValAssign::ZExt:
	Arg = DAG.getNode(ISD::ZERO_EXTEND, dl, RegVT, Arg);
	break;
	case CCValAssign::AExt:
	if (Arg.getValueType().isVector() &&
	Arg.getValueType().getVectorElementType() == MVT::i1)
	Arg = lowerMasksToReg(Arg, RegVT, dl, DAG);
	else if (RegVT.is128BitVector()) {
	// Special case: passing MMX values in XMM registers.
	Arg = DAG.getBitcast(MVT::i64, Arg);
	Arg = DAG.getNode(ISD::SCALAR_TO_VECTOR, dl, MVT::v2i64, Arg);
	Arg = getMOVL(DAG, dl, MVT::v2i64, DAG.getUNDEF(MVT::v2i64), Arg);
	} else
	Arg = DAG.getNode(ISD::ANY_EXTEND, dl, RegVT, Arg);
	break;
	case CCValAssign::BCvt:
	Arg = DAG.getBitcast(RegVT, Arg);
	break;
	case CCValAssign::Indirect: {
	// Store the argument.
	SDValue SpillSlot = DAG.CreateStackTemporary(VA.getValVT());
	int FI = cast<FrameIndexSDNode>(SpillSlot)->getIndex();
	Chain = DAG.getStore(
	Chain, dl, Arg, SpillSlot,
	MachinePointerInfo::getFixedStack(DAG.getMachineFunction(), FI));
	Arg = SpillSlot;
	break;
	}
	}

	if (VA.needsCustom()) {
	assert(VA.getValVT() == MVT::v64i1 &&
	"Currently the only custom case is when we split v64i1 to 2 regs");
	// Split v64i1 value into two registers
	Passv64i1ArgInRegs(dl, DAG, Chain, Arg, RegsToPass, VA, ArgLocs[++I],
	Subtarget);
	} else if (VA.isRegLoc()) {
	RegsToPass.push_back(std::make_pair(VA.getLocReg(), Arg));
	if (isVarArg && IsWin64) {
	// Win64 ABI requires argument XMM reg to be copied to the corresponding
	// shadow reg if callee is a varargs function.
	unsigned ShadowReg = 0;
	switch (VA.getLocReg()) {
	case X86::XMM0: ShadowReg = X86::RCX; break;
	case X86::XMM1: ShadowReg = X86::RDX; break;
	case X86::XMM2: ShadowReg = X86::R8; break;
	case X86::XMM3: ShadowReg = X86::R9; break;
	}
	if (ShadowReg)
	RegsToPass.push_back(std::make_pair(ShadowReg, Arg));
	}
	} else if (!IsSibcall && (!isTailCall \|\| isByVal)) {
	assert(VA.isMemLoc());
	if (!StackPtr.getNode())
	StackPtr = DAG.getCopyFromReg(Chain, dl, RegInfo->getStackRegister(),
	getPointerTy(DAG.getDataLayout()));
	MemOpChains.push_back(LowerMemOpCallTo(Chain, StackPtr, Arg,
	dl, DAG, VA, Flags));
	}
	}

	if (!MemOpChains.empty())
	Chain = DAG.getNode(ISD::TokenFactor, dl, MVT::Other, MemOpChains);

	if (Subtarget.isPICStyleGOT()) {
	// ELF / PIC requires GOT in the EBX register before function calls via PLT
	// GOT pointer.
	if (!isTailCall) {
	RegsToPass.push_back(std::make_pair(
	unsigned(X86::EBX), DAG.getNode(X86ISD::GlobalBaseReg, SDLoc(),
	getPointerTy(DAG.getDataLayout()))));
	} else {
	// If we are tail calling and generating PIC/GOT style code load the
	// address of the callee into ECX. The value in ecx is used as target of
	// the tail jump. This is done to circumvent the ebx/callee-saved problem
	// for tail calls on PIC/GOT architectures. Normally we would just put the
	// address of GOT into ebx and then call target@PLT. But for tail calls
	// ebx would be restored (since ebx is callee saved) before jumping to the
	// target@PLT.

	// Note: The actual moving to ECX is done further down.
	GlobalAddressSDNode *G = dyn_cast<GlobalAddressSDNode>(Callee);
	if (G && !G->getGlobal()->hasLocalLinkage() &&
	G->getGlobal()->hasDefaultVisibility())
	Callee = LowerGlobalAddress(Callee, DAG);
	else if (isa<ExternalSymbolSDNode>(Callee))
	Callee = LowerExternalSymbol(Callee, DAG);
	}
	}

	if (Is64Bit && isVarArg && !IsWin64 && !IsMustTail) {
	// From AMD64 ABI document:
	// For calls that may call functions that use varargs or stdargs
	// (prototype-less calls or calls to functions containing ellipsis (...) in
	// the declaration) %al is used as hidden argument to specify the number
	// of SSE registers used. The contents of %al do not need to match exactly
	// the number of registers, but must be an ubound on the number of SSE
	// registers used and is in the range 0 - 8 inclusive.

	// Count the number of XMM registers allocated.
	static const MCPhysReg XMMArgRegs[] = {
	X86::XMM0, X86::XMM1, X86::XMM2, X86::XMM3,
	X86::XMM4, X86::XMM5, X86::XMM6, X86::XMM7
	};
	unsigned NumXMMRegs = CCInfo.getFirstUnallocated(XMMArgRegs);
	assert((Subtarget.hasSSE1() \|\| !NumXMMRegs)
	&& "SSE registers cannot be used when SSE is disabled");

	RegsToPass.push_back(std::make_pair(unsigned(X86::AL),
	DAG.getConstant(NumXMMRegs, dl,
	MVT::i8)));
	}

	if (isVarArg && IsMustTail) {
	const auto &Forwards = X86Info->getForwardedMustTailRegParms();
	for (const auto &F : Forwards) {
	SDValue Val = DAG.getCopyFromReg(Chain, dl, F.VReg, F.VT);
	RegsToPass.push_back(std::make_pair(unsigned(F.PReg), Val));
	}
	}

	// For tail calls lower the arguments to the 'real' stack slots. Sibcalls
	// don't need this because the eligibility check rejects calls that require
	// shuffling arguments passed in memory.
	if (!IsSibcall && isTailCall) {
	// Force all the incoming stack arguments to be loaded from the stack
	// before any new outgoing arguments are stored to the stack, because the
	// outgoing stack slots may alias the incoming argument stack slots, and
	// the alias isn't otherwise explicit. This is slightly more conservative
	// than necessary, because it means that each store effectively depends
	// on every argument instead of just those arguments it would clobber.
	SDValue ArgChain = DAG.getStackArgumentTokenFactor(Chain);

	SmallVector<SDValue, 8> MemOpChains2;
	SDValue FIN;
	int FI = 0;
	for (unsigned I = 0, OutsIndex = 0, E = ArgLocs.size(); I != E;
	++I, ++OutsIndex) {
	CCValAssign &VA = ArgLocs[I];

	if (VA.isRegLoc()) {
	if (VA.needsCustom()) {
	assert((CallConv == CallingConv::X86_RegCall) &&
	"Expecting custom case only in regcall calling convention");
	// This means that we are in special case where one argument was
	// passed through two register locations - Skip the next location
	++I;
	}

	continue;
	}

	assert(VA.isMemLoc());
	SDValue Arg = OutVals[OutsIndex];
	ISD::ArgFlagsTy Flags = Outs[OutsIndex].Flags;
	// Skip inalloca arguments. They don't require any work.
	if (Flags.isInAlloca())
	continue;
	// Create frame index.
	int32_t Offset = VA.getLocMemOffset()+FPDiff;
	uint32_t OpSize = (VA.getLocVT().getSizeInBits()+7)/8;
	FI = MF.getFrameInfo().CreateFixedObject(OpSize, Offset, true);
	FIN = DAG.getFrameIndex(FI, getPointerTy(DAG.getDataLayout()));

	if (Flags.isByVal()) {
	// Copy relative to framepointer.
	SDValue Source = DAG.getIntPtrConstant(VA.getLocMemOffset(), dl);
	if (!StackPtr.getNode())
	StackPtr = DAG.getCopyFromReg(Chain, dl, RegInfo->getStackRegister(),
	getPointerTy(DAG.getDataLayout()));
	Source = DAG.getNode(ISD::ADD, dl, getPointerTy(DAG.getDataLayout()),
	StackPtr, Source);

	MemOpChains2.push_back(CreateCopyOfByValArgument(Source, FIN,
	ArgChain,
	Flags, DAG, dl));
	} else {
	// Store relative to framepointer.
	MemOpChains2.push_back(DAG.getStore(
	ArgChain, dl, Arg, FIN,
	MachinePointerInfo::getFixedStack(DAG.getMachineFunction(), FI)));
	}
	}

	if (!MemOpChains2.empty())
	Chain = DAG.getNode(ISD::TokenFactor, dl, MVT::Other, MemOpChains2);

	// Store the return address to the appropriate stack slot.
	Chain = EmitTailCallStoreRetAddr(DAG, MF, Chain, RetAddrFrIdx,
	getPointerTy(DAG.getDataLayout()),
	RegInfo->getSlotSize(), FPDiff, dl);
	}

	// Build a sequence of copy-to-reg nodes chained together with token chain
	// and flag operands which copy the outgoing args into registers.
	SDValue InFlag;
	for (unsigned i = 0, e = RegsToPass.size(); i != e; ++i) {
	Chain = DAG.getCopyToReg(Chain, dl, RegsToPass[i].first,
	RegsToPass[i].second, InFlag);
	InFlag = Chain.getValue(1);
	}

	if (DAG.getTarget().getCodeModel() == CodeModel::Large) {
	assert(Is64Bit && "Large code model is only legal in 64-bit mode.");
	// In the 64-bit large code model, we have to make all calls
	// through a register, since the call instruction's 32-bit
	// pc-relative offset may not be large enough to hold the whole
	// address.
	} else if (Callee->getOpcode() == ISD::GlobalAddress) {
	// If the callee is a GlobalAddress node (quite common, every direct call
	// is) turn it into a TargetGlobalAddress node so that legalize doesn't hack
	// it.
	GlobalAddressSDNode* G = cast<GlobalAddressSDNode>(Callee);

	// We should use extra load for direct calls to dllimported functions in
	// non-JIT mode.
	const GlobalValue *GV = G->getGlobal();
	if (!GV->hasDLLImportStorageClass()) {
	unsigned char OpFlags = Subtarget.classifyGlobalFunctionReference(GV);

	Callee = DAG.getTargetGlobalAddress(
	GV, dl, getPointerTy(DAG.getDataLayout()), G->getOffset(), OpFlags);

	if (OpFlags == X86II::MO_GOTPCREL) {
	// Add a wrapper.
	Callee = DAG.getNode(X86ISD::WrapperRIP, dl,
	getPointerTy(DAG.getDataLayout()), Callee);
	// Add extra indirection
	Callee = DAG.getLoad(
	getPointerTy(DAG.getDataLayout()), dl, DAG.getEntryNode(), Callee,
	MachinePointerInfo::getGOT(DAG.getMachineFunction()));
	}
	}
	} else if (ExternalSymbolSDNode *S = dyn_cast<ExternalSymbolSDNode>(Callee)) {
	const Module *Mod = DAG.getMachineFunction().getFunction()->getParent();
	unsigned char OpFlags =
	Subtarget.classifyGlobalFunctionReference(nullptr, *Mod);

	Callee = DAG.getTargetExternalSymbol(
	S->getSymbol(), getPointerTy(DAG.getDataLayout()), OpFlags);
	} else if (Subtarget.isTarget64BitILP32() &&
	Callee->getValueType(0) == MVT::i32) {
	// Zero-extend the 32-bit Callee address into a 64-bit according to x32 ABI
	Callee = DAG.getNode(ISD::ZERO_EXTEND, dl, MVT::i64, Callee);
	}

	// Returns a chain & a flag for retval copy to use.
	SDVTList NodeTys = DAG.getVTList(MVT::Other, MVT::Glue);
	SmallVector<SDValue, 8> Ops;

	if (!IsSibcall && isTailCall) {
	Chain = DAG.getCALLSEQ_END(Chain,
	DAG.getIntPtrConstant(NumBytesToPop, dl, true),
	DAG.getIntPtrConstant(0, dl, true), InFlag, dl);
	InFlag = Chain.getValue(1);
	}

	Ops.push_back(Chain);
	Ops.push_back(Callee);

	if (isTailCall)
	Ops.push_back(DAG.getConstant(FPDiff, dl, MVT::i32));

	// Add argument registers to the end of the list so that they are known live
	// into the call.
	for (unsigned i = 0, e = RegsToPass.size(); i != e; ++i)
	Ops.push_back(DAG.getRegister(RegsToPass[i].first,
	RegsToPass[i].second.getValueType()));

	// Add a register mask operand representing the call-preserved registers.
	// If HasNCSR is asserted (attribute NoCallerSavedRegisters exists) then we
	// set X86_INTR calling convention because it has the same CSR mask
	// (same preserved registers).
	const uint32_t *Mask = RegInfo->getCallPreservedMask(
	MF, HasNCSR ? (CallingConv::ID)CallingConv::X86_INTR : CallConv);
	assert(Mask && "Missing call preserved mask for calling convention");

	// If this is an invoke in a 32-bit function using a funclet-based
	// personality, assume the function clobbers all registers. If an exception
	// is thrown, the runtime will not restore CSRs.
	// FIXME: Model this more precisely so that we can register allocate across
	// the normal edge and spill and fill across the exceptional edge.
	if (!Is64Bit && CLI.CS && CLI.CS->isInvoke()) {
	const Function *CallerFn = MF.getFunction();
	EHPersonality Pers =
	CallerFn->hasPersonalityFn()
	? classifyEHPersonality(CallerFn->getPersonalityFn())
	: EHPersonality::Unknown;
	if (isFuncletEHPersonality(Pers))
	Mask = RegInfo->getNoPreservedMask();
	}

	// Define a new register mask from the existing mask.
	uint32_t *RegMask = nullptr;

	// In some calling conventions we need to remove the used physical registers
	// from the reg mask.
	if (CallConv == CallingConv::X86_RegCall \|\| HasNCSR) {
	const TargetRegisterInfo *TRI = Subtarget.getRegisterInfo();

	// Allocate a new Reg Mask and copy Mask.
	RegMask = MF.allocateRegisterMask(TRI->getNumRegs());
	unsigned RegMaskSize = (TRI->getNumRegs() + 31) / 32;
	memcpy(RegMask, Mask, sizeof(uint32_t) * RegMaskSize);

	// Make sure all sub registers of the argument registers are reset
	// in the RegMask.
	for (auto const &RegPair : RegsToPass)
	for (MCSubRegIterator SubRegs(RegPair.first, TRI, /IncludeSelf=/true);
	SubRegs.isValid(); ++SubRegs)
	RegMask[SubRegs / 32] &= ~(1u << (SubRegs % 32));

	// Create the RegMask Operand according to our updated mask.
	Ops.push_back(DAG.getRegisterMask(RegMask));
	} else {
	// Create the RegMask Operand according to the static mask.
	Ops.push_back(DAG.getRegisterMask(Mask));
	}

	if (InFlag.getNode())
	Ops.push_back(InFlag);

	if (isTailCall) {
	// We used to do:
	//// If this is the first return lowered for this function, add the regs
	//// to the liveout set for the function.
	// This isn't right, although it's probably harmless on x86; liveouts
	// should be computed from returns not tail calls. Consider a void
	// function making a tail call to a function returning int.
	MF.getFrameInfo().setHasTailCall();
	return DAG.getNode(X86ISD::TC_RETURN, dl, NodeTys, Ops);
	}

	Chain = DAG.getNode(X86ISD::CALL, dl, NodeTys, Ops);
	InFlag = Chain.getValue(1);

	// Create the CALLSEQ_END node.
	unsigned NumBytesForCalleeToPop;
	if (X86::isCalleePop(CallConv, Is64Bit, isVarArg,
	DAG.getTarget().Options.GuaranteedTailCallOpt))
	NumBytesForCalleeToPop = NumBytes; // Callee pops everything
	else if (!Is64Bit && !canGuaranteeTCO(CallConv) &&
	!Subtarget.getTargetTriple().isOSMSVCRT() &&
	SR == StackStructReturn)
	// If this is a call to a struct-return function, the callee
	// pops the hidden struct pointer, so we have to push it back.
	// This is common for Darwin/X86, Linux & Mingw32 targets.
	// For MSVC Win32 targets, the caller pops the hidden struct pointer.
	NumBytesForCalleeToPop = 4;
	else
	NumBytesForCalleeToPop = 0; // Callee pops nothing.

	if (CLI.DoesNotReturn && !getTargetMachine().Options.TrapUnreachable) {
	// No need to reset the stack after the call if the call doesn't return. To
	// make the MI verify, we'll pretend the callee does it for us.
	NumBytesForCalleeToPop = NumBytes;
	}

	// Returns a flag for retval copy to use.
	if (!IsSibcall) {
	Chain = DAG.getCALLSEQ_END(Chain,
	DAG.getIntPtrConstant(NumBytesToPop, dl, true),
	DAG.getIntPtrConstant(NumBytesForCalleeToPop, dl,
	true),
	InFlag, dl);
	InFlag = Chain.getValue(1);
	}

	// Handle result values, copying them out of physregs into vregs that we
	// return.
	return LowerCallResult(Chain, InFlag, CallConv, isVarArg, Ins, dl, DAG,
	InVals, RegMask);
	}

	//===----------------------------------------------------------------------===//
	// Fast Calling Convention (tail call) implementation
	//===----------------------------------------------------------------------===//

	// Like std call, callee cleans arguments, convention except that ECX is
	// reserved for storing the tail called function address. Only 2 registers are
	// free for argument passing (inreg). Tail call optimization is performed
	// provided:
	// * tailcallopt is enabled
	// * caller/callee are fastcc
	// On X86_64 architecture with GOT-style position independent code only local
	// (within module) calls are supported at the moment.
	// To keep the stack aligned according to platform abi the function
	// GetAlignedArgumentStackSize ensures that argument delta is always multiples
	// of stack alignment. (Dynamic linkers need this - darwin's dyld for example)
	// If a tail called function callee has more arguments than the caller the
	// caller needs to make sure that there is room to move the RETADDR to. This is
	// achieved by reserving an area the size of the argument delta right after the
	// original RETADDR, but before the saved framepointer or the spilled registers
	// e.g. caller(arg1, arg2) calls callee(arg1, arg2,arg3,arg4)
	// stack layout:
	// arg1
	// arg2
	// RETADDR
	// [ new RETADDR
	// move area ]
	// (possible EBP)
	// ESI
	// EDI
	// local1 ..

	/// Make the stack size align e.g 16n + 12 aligned for a 16-byte align
	/// requirement.
	unsigned
	X86TargetLowering::GetAlignedArgumentStackSize(unsigned StackSize,
	SelectionDAG& DAG) const {
	const X86RegisterInfo *RegInfo = Subtarget.getRegisterInfo();
	const TargetFrameLowering &TFI = *Subtarget.getFrameLowering();
	unsigned StackAlignment = TFI.getStackAlignment();
	uint64_t AlignMask = StackAlignment - 1;
	int64_t Offset = StackSize;
	unsigned SlotSize = RegInfo->getSlotSize();
	if ( (Offset & AlignMask) <= (StackAlignment - SlotSize) ) {
	// Number smaller than 12 so just add the difference.
	Offset += ((StackAlignment - SlotSize) - (Offset & AlignMask));
	} else {
	// Mask out lower bits, add stackalignment once plus the 12 bytes.
	Offset = ((~AlignMask) & Offset) + StackAlignment +
	(StackAlignment-SlotSize);
	}
	return Offset;
	}

	/// Return true if the given stack call argument is already available in the
	/// same position (relatively) of the caller's incoming argument stack.
	static
	bool MatchingStackOffset(SDValue Arg, unsigned Offset, ISD::ArgFlagsTy Flags,
	MachineFrameInfo &MFI, const MachineRegisterInfo *MRI,
	const X86InstrInfo *TII, const CCValAssign &VA) {
	unsigned Bytes = Arg.getValueSizeInBits() / 8;

	for (;;) {
	// Look through nodes that don't alter the bits of the incoming value.
	unsigned Op = Arg.getOpcode();
	if (Op == ISD::ZERO_EXTEND \|\| Op == ISD::ANY_EXTEND \|\| Op == ISD::BITCAST) {
	Arg = Arg.getOperand(0);
	continue;
	}
	if (Op == ISD::TRUNCATE) {
	const SDValue &TruncInput = Arg.getOperand(0);
	if (TruncInput.getOpcode() == ISD::AssertZext &&
	cast<VTSDNode>(TruncInput.getOperand(1))->getVT() ==
	Arg.getValueType()) {
	Arg = TruncInput.getOperand(0);
	continue;
	}
	}
	break;
	}

	int FI = INT_MAX;
	if (Arg.getOpcode() == ISD::CopyFromReg) {
	unsigned VR = cast<RegisterSDNode>(Arg.getOperand(1))->getReg();
	if (!TargetRegisterInfo::isVirtualRegister(VR))
	return false;
	MachineInstr *Def = MRI->getVRegDef(VR);
	if (!Def)
	return false;
	if (!Flags.isByVal()) {
	if (!TII->isLoadFromStackSlot(*Def, FI))
	return false;
	} else {
	unsigned Opcode = Def->getOpcode();
	if ((Opcode == X86::LEA32r \|\| Opcode == X86::LEA64r \|\|
	Opcode == X86::LEA64_32r) &&
	Def->getOperand(1).isFI()) {
	FI = Def->getOperand(1).getIndex();
	Bytes = Flags.getByValSize();
	} else
	return false;
	}
	} else if (LoadSDNode *Ld = dyn_cast<LoadSDNode>(Arg)) {
	if (Flags.isByVal())
	// ByVal argument is passed in as a pointer but it's now being
	// dereferenced. e.g.
	// define @foo(%struct.X* %A) {
	// tail call @bar(%struct.X* byval %A)
	// }
	return false;
	SDValue Ptr = Ld->getBasePtr();
	FrameIndexSDNode *FINode = dyn_cast<FrameIndexSDNode>(Ptr);
	if (!FINode)
	return false;
	FI = FINode->getIndex();
	} else if (Arg.getOpcode() == ISD::FrameIndex && Flags.isByVal()) {
	FrameIndexSDNode *FINode = cast<FrameIndexSDNode>(Arg);
	FI = FINode->getIndex();
	Bytes = Flags.getByValSize();
	} else
	return false;

	assert(FI != INT_MAX);
	if (!MFI.isFixedObjectIndex(FI))
	return false;

	if (Offset != MFI.getObjectOffset(FI))
	return false;

	// If this is not byval, check that the argument stack object is immutable.
	// inalloca and argument copy elision can create mutable argument stack
	// objects. Byval objects can be mutated, but a byval call intends to pass the
	// mutated memory.
	if (!Flags.isByVal() && !MFI.isImmutableObjectIndex(FI))
	return false;

	if (VA.getLocVT().getSizeInBits() > Arg.getValueSizeInBits()) {
	// If the argument location is wider than the argument type, check that any
	// extension flags match.
	if (Flags.isZExt() != MFI.isObjectZExt(FI) \|\|
	Flags.isSExt() != MFI.isObjectSExt(FI)) {
	return false;
	}
	}

	return Bytes == MFI.getObjectSize(FI);
	}

	/// Check whether the call is eligible for tail call optimization. Targets
	/// that want to do tail call optimization should implement this function.
	bool X86TargetLowering::IsEligibleForTailCallOptimization(
	SDValue Callee, CallingConv::ID CalleeCC, bool isVarArg,
	bool isCalleeStructRet, bool isCallerStructRet, Type *RetTy,
	const SmallVectorImpl<ISD::OutputArg> &Outs,
	const SmallVectorImpl<SDValue> &OutVals,
	const SmallVectorImpl<ISD::InputArg> &Ins, SelectionDAG &DAG) const {
	if (!mayTailCallThisCC(CalleeCC))
	return false;

	// If -tailcallopt is specified, make fastcc functions tail-callable.
	MachineFunction &MF = DAG.getMachineFunction();
	const Function *CallerF = MF.getFunction();

	// If the function return type is x86_fp80 and the callee return type is not,
	// then the FP_EXTEND of the call result is not a nop. It's not safe to
	// perform a tailcall optimization here.
	if (CallerF->getReturnType()->isX86_FP80Ty() && !RetTy->isX86_FP80Ty())
	return false;

	CallingConv::ID CallerCC = CallerF->getCallingConv();
	bool CCMatch = CallerCC == CalleeCC;
	bool IsCalleeWin64 = Subtarget.isCallingConvWin64(CalleeCC);
	bool IsCallerWin64 = Subtarget.isCallingConvWin64(CallerCC);

	// Win64 functions have extra shadow space for argument homing. Don't do the
	// sibcall if the caller and callee have mismatched expectations for this
	// space.
	if (IsCalleeWin64 != IsCallerWin64)
	return false;

	if (DAG.getTarget().Options.GuaranteedTailCallOpt) {
	if (canGuaranteeTCO(CalleeCC) && CCMatch)
	return true;
	return false;
	}

	// Look for obvious safe cases to perform tail call optimization that do not
	// require ABI changes. This is what gcc calls sibcall.

	// Can't do sibcall if stack needs to be dynamically re-aligned. PEI needs to
	// emit a special epilogue.
	const X86RegisterInfo *RegInfo = Subtarget.getRegisterInfo();
	if (RegInfo->needsStackRealignment(MF))
	return false;

	// Also avoid sibcall optimization if either caller or callee uses struct
	// return semantics.
	if (isCalleeStructRet \|\| isCallerStructRet)
	return false;

	// Do not sibcall optimize vararg calls unless all arguments are passed via
	// registers.
	LLVMContext &C = *DAG.getContext();
	if (isVarArg && !Outs.empty()) {
	// Optimizing for varargs on Win64 is unlikely to be safe without
	// additional testing.
	if (IsCalleeWin64 \|\| IsCallerWin64)
	return false;

	SmallVector<CCValAssign, 16> ArgLocs;
	CCState CCInfo(CalleeCC, isVarArg, MF, ArgLocs, C);

	CCInfo.AnalyzeCallOperands(Outs, CC_X86);
	for (unsigned i = 0, e = ArgLocs.size(); i != e; ++i)
	if (!ArgLocs[i].isRegLoc())
	return false;
	}

	// If the call result is in ST0 / ST1, it needs to be popped off the x87
	// stack. Therefore, if it's not used by the call it is not safe to optimize
	// this into a sibcall.
	bool Unused = false;
	for (unsigned i = 0, e = Ins.size(); i != e; ++i) {
	if (!Ins[i].Used) {
	Unused = true;
	break;
	}
	}
	if (Unused) {
	SmallVector<CCValAssign, 16> RVLocs;
	CCState CCInfo(CalleeCC, false, MF, RVLocs, C);
	CCInfo.AnalyzeCallResult(Ins, RetCC_X86);
	for (unsigned i = 0, e = RVLocs.size(); i != e; ++i) {
	CCValAssign &VA = RVLocs[i];
	if (VA.getLocReg() == X86::FP0 \|\| VA.getLocReg() == X86::FP1)
	return false;
	}
	}

	// Check that the call results are passed in the same way.
	if (!CCState::resultsCompatible(CalleeCC, CallerCC, MF, C, Ins,
	RetCC_X86, RetCC_X86))
	return false;
	// The callee has to preserve all registers the caller needs to preserve.
	const X86RegisterInfo *TRI = Subtarget.getRegisterInfo();
	const uint32_t *CallerPreserved = TRI->getCallPreservedMask(MF, CallerCC);
	if (!CCMatch) {
	const uint32_t *CalleePreserved = TRI->getCallPreservedMask(MF, CalleeCC);
	if (!TRI->regmaskSubsetEqual(CallerPreserved, CalleePreserved))
	return false;
	}

	unsigned StackArgsSize = 0;

	// If the callee takes no arguments then go on to check the results of the
	// call.
	if (!Outs.empty()) {
	// Check if stack adjustment is needed. For now, do not do this if any
	// argument is passed on the stack.
	SmallVector<CCValAssign, 16> ArgLocs;
	CCState CCInfo(CalleeCC, isVarArg, MF, ArgLocs, C);

	// Allocate shadow area for Win64
	if (IsCalleeWin64)
	CCInfo.AllocateStack(32, 8);

	CCInfo.AnalyzeCallOperands(Outs, CC_X86);
	StackArgsSize = CCInfo.getNextStackOffset();

	if (CCInfo.getNextStackOffset()) {
	// Check if the arguments are already laid out in the right way as
	// the caller's fixed stack objects.
	MachineFrameInfo &MFI = MF.getFrameInfo();
	const MachineRegisterInfo *MRI = &MF.getRegInfo();
	const X86InstrInfo *TII = Subtarget.getInstrInfo();
	for (unsigned i = 0, e = ArgLocs.size(); i != e; ++i) {
	CCValAssign &VA = ArgLocs[i];
	SDValue Arg = OutVals[i];
	ISD::ArgFlagsTy Flags = Outs[i].Flags;
	if (VA.getLocInfo() == CCValAssign::Indirect)
	return false;
	if (!VA.isRegLoc()) {
	if (!MatchingStackOffset(Arg, VA.getLocMemOffset(), Flags,
	MFI, MRI, TII, VA))
	return false;
	}
	}
	}

	bool PositionIndependent = isPositionIndependent();
	// If the tailcall address may be in a register, then make sure it's
	// possible to register allocate for it. In 32-bit, the call address can
	// only target EAX, EDX, or ECX since the tail call must be scheduled after
	// callee-saved registers are restored. These happen to be the same
	// registers used to pass 'inreg' arguments so watch out for those.
	if (!Subtarget.is64Bit() && ((!isa<GlobalAddressSDNode>(Callee) &&
	!isa<ExternalSymbolSDNode>(Callee)) \|\|
	PositionIndependent)) {
	unsigned NumInRegs = 0;
	// In PIC we need an extra register to formulate the address computation
	// for the callee.
	unsigned MaxInRegs = PositionIndependent ? 2 : 3;

	for (unsigned i = 0, e = ArgLocs.size(); i != e; ++i) {
	CCValAssign &VA = ArgLocs[i];
	if (!VA.isRegLoc())
	continue;
	unsigned Reg = VA.getLocReg();
	switch (Reg) {
	default: break;
	case X86::EAX: case X86::EDX: case X86::ECX:
	if (++NumInRegs == MaxInRegs)
	return false;
	break;
	}
	}
	}

	const MachineRegisterInfo &MRI = MF.getRegInfo();
	if (!parametersInCSRMatch(MRI, CallerPreserved, ArgLocs, OutVals))
	return false;
	}

	bool CalleeWillPop =
	X86::isCalleePop(CalleeCC, Subtarget.is64Bit(), isVarArg,
	MF.getTarget().Options.GuaranteedTailCallOpt);

	if (unsigned BytesToPop =
	MF.getInfo<X86MachineFunctionInfo>()->getBytesToPopOnReturn()) {
	// If we have bytes to pop, the callee must pop them.
	bool CalleePopMatches = CalleeWillPop && BytesToPop == StackArgsSize;
	if (!CalleePopMatches)
	return false;
	} else if (CalleeWillPop && StackArgsSize > 0) {
	// If we don't have bytes to pop, make sure the callee doesn't pop any.
	return false;
	}

	return true;
	}

	FastISel *
	X86TargetLowering::createFastISel(FunctionLoweringInfo &funcInfo,
	const TargetLibraryInfo *libInfo) const {
	return X86::createFastISel(funcInfo, libInfo);
	}

	//===----------------------------------------------------------------------===//
	// Other Lowering Hooks
	//===----------------------------------------------------------------------===//

	static bool MayFoldLoad(SDValue Op) {
	return Op.hasOneUse() && ISD::isNormalLoad(Op.getNode());
	}

	static bool MayFoldIntoStore(SDValue Op) {
	return Op.hasOneUse() && ISD::isNormalStore(*Op.getNode()->use_begin());
	}

	static bool MayFoldIntoZeroExtend(SDValue Op) {
	if (Op.hasOneUse()) {
	unsigned Opcode = Op.getNode()->use_begin()->getOpcode();
	return (ISD::ZERO_EXTEND == Opcode);
	}
	return false;
	}

	static bool isTargetShuffle(unsigned Opcode) {
	switch(Opcode) {
	default: return false;
	case X86ISD::BLENDI:
	case X86ISD::PSHUFB:
	case X86ISD::PSHUFD:
	case X86ISD::PSHUFHW:
	case X86ISD::PSHUFLW:
	case X86ISD::SHUFP:
	case X86ISD::INSERTPS:
	case X86ISD::EXTRQI:
	case X86ISD::INSERTQI:
	case X86ISD::PALIGNR:
	case X86ISD::VSHLDQ:
	case X86ISD::VSRLDQ:
	case X86ISD::MOVLHPS:
	case X86ISD::MOVLHPD:
	case X86ISD::MOVHLPS:
	case X86ISD::MOVLPS:
	case X86ISD::MOVLPD:
	case X86ISD::MOVSHDUP:
	case X86ISD::MOVSLDUP:
	case X86ISD::MOVDDUP:
	case X86ISD::MOVSS:
	case X86ISD::MOVSD:
	case X86ISD::UNPCKL:
	case X86ISD::UNPCKH:
	case X86ISD::VBROADCAST:
	case X86ISD::VPERMILPI:
	case X86ISD::VPERMILPV:
	case X86ISD::VPERM2X128:
	case X86ISD::VPERMIL2:
	case X86ISD::VPERMI:
	case X86ISD::VPPERM:
	case X86ISD::VPERMV:
	case X86ISD::VPERMV3:
	case X86ISD::VPERMIV3:
	case X86ISD::VZEXT_MOVL:
	return true;
	}
	}

	static bool isTargetShuffleVariableMask(unsigned Opcode) {
	switch (Opcode) {
	default: return false;
	// Target Shuffles.
	case X86ISD::PSHUFB:
	case X86ISD::VPERMILPV:
	case X86ISD::VPERMIL2:
	case X86ISD::VPPERM:
	case X86ISD::VPERMV:
	case X86ISD::VPERMV3:
	case X86ISD::VPERMIV3:
	return true;
	// 'Faux' Target Shuffles.
	case ISD::AND:
	case X86ISD::ANDNP:
	return true;
	}
	}

	SDValue X86TargetLowering::getReturnAddressFrameIndex(SelectionDAG &DAG) const {
	MachineFunction &MF = DAG.getMachineFunction();
	const X86RegisterInfo *RegInfo = Subtarget.getRegisterInfo();
	X86MachineFunctionInfo *FuncInfo = MF.getInfo<X86MachineFunctionInfo>();
	int ReturnAddrIndex = FuncInfo->getRAIndex();

	if (ReturnAddrIndex == 0) {
	// Set up a frame object for the return address.
	unsigned SlotSize = RegInfo->getSlotSize();
	ReturnAddrIndex = MF.getFrameInfo().CreateFixedObject(SlotSize,
	-(int64_t)SlotSize,
	false);
	FuncInfo->setRAIndex(ReturnAddrIndex);
	}

	return DAG.getFrameIndex(ReturnAddrIndex, getPointerTy(DAG.getDataLayout()));
	}

	bool X86::isOffsetSuitableForCodeModel(int64_t Offset, CodeModel::Model M,
	bool hasSymbolicDisplacement) {
	// Offset should fit into 32 bit immediate field.
	if (!isInt<32>(Offset))
	return false;

	// If we don't have a symbolic displacement - we don't have any extra
	// restrictions.
	if (!hasSymbolicDisplacement)
	return true;

	// FIXME: Some tweaks might be needed for medium code model.
	if (M != CodeModel::Small && M != CodeModel::Kernel)
	return false;

	// For small code model we assume that latest object is 16MB before end of 31
	// bits boundary. We may also accept pretty large negative constants knowing
	// that all objects are in the positive half of address space.
	if (M == CodeModel::Small && Offset < 1610241024)
	return true;

	// For kernel code model we know that all object resist in the negative half
	// of 32bits address space. We may not accept negative offsets, since they may
	// be just off and we may accept pretty large positive ones.
	if (M == CodeModel::Kernel && Offset >= 0)
	return true;

	return false;
	}

	/// Determines whether the callee is required to pop its own arguments.
	/// Callee pop is necessary to support tail calls.
	bool X86::isCalleePop(CallingConv::ID CallingConv,
	bool is64Bit, bool IsVarArg, bool GuaranteeTCO) {
	// If GuaranteeTCO is true, we force some calls to be callee pop so that we
	// can guarantee TCO.
	if (!IsVarArg && shouldGuaranteeTCO(CallingConv, GuaranteeTCO))
	return true;

	switch (CallingConv) {
	default:
	return false;
	case CallingConv::X86_StdCall:
	case CallingConv::X86_FastCall:
	case CallingConv::X86_ThisCall:
	case CallingConv::X86_VectorCall:
	return !is64Bit;
	}
	}

	/// \brief Return true if the condition is an unsigned comparison operation.
	static bool isX86CCUnsigned(unsigned X86CC) {
	switch (X86CC) {
	default:
	llvm_unreachable("Invalid integer condition!");
	case X86::COND_E:
	case X86::COND_NE:
	case X86::COND_B:
	case X86::COND_A:
	case X86::COND_BE:
	case X86::COND_AE:
	return true;
	case X86::COND_G:
	case X86::COND_GE:
	case X86::COND_L:
	case X86::COND_LE:
	return false;
	}
	}

	static X86::CondCode TranslateIntegerX86CC(ISD::CondCode SetCCOpcode) {
	switch (SetCCOpcode) {
	default: llvm_unreachable("Invalid integer condition!");
	case ISD::SETEQ: return X86::COND_E;
	case ISD::SETGT: return X86::COND_G;
	case ISD::SETGE: return X86::COND_GE;
	case ISD::SETLT: return X86::COND_L;
	case ISD::SETLE: return X86::COND_LE;
	case ISD::SETNE: return X86::COND_NE;
	case ISD::SETULT: return X86::COND_B;
	case ISD::SETUGT: return X86::COND_A;
	case ISD::SETULE: return X86::COND_BE;
	case ISD::SETUGE: return X86::COND_AE;
	}
	}

	/// Do a one-to-one translation of a ISD::CondCode to the X86-specific
	/// condition code, returning the condition code and the LHS/RHS of the
	/// comparison to make.
	static X86::CondCode TranslateX86CC(ISD::CondCode SetCCOpcode, const SDLoc &DL,
	bool isFP, SDValue &LHS, SDValue &RHS,
	SelectionDAG &DAG) {
	if (!isFP) {
	if (ConstantSDNode *RHSC = dyn_cast<ConstantSDNode>(RHS)) {
	if (SetCCOpcode == ISD::SETGT && RHSC->isAllOnesValue()) {
	// X > -1 -> X == 0, jump !sign.
	RHS = DAG.getConstant(0, DL, RHS.getValueType());
	return X86::COND_NS;
	}
	if (SetCCOpcode == ISD::SETLT && RHSC->isNullValue()) {
	// X < 0 -> X == 0, jump on sign.
	return X86::COND_S;
	}
	if (SetCCOpcode == ISD::SETLT && RHSC->getZExtValue() == 1) {
	// X < 1 -> X <= 0
	RHS = DAG.getConstant(0, DL, RHS.getValueType());
	return X86::COND_LE;
	}
	}

	return TranslateIntegerX86CC(SetCCOpcode);
	}

	// First determine if it is required or is profitable to flip the operands.

	// If LHS is a foldable load, but RHS is not, flip the condition.
	if (ISD::isNON_EXTLoad(LHS.getNode()) &&
	!ISD::isNON_EXTLoad(RHS.getNode())) {
	SetCCOpcode = getSetCCSwappedOperands(SetCCOpcode);
	std::swap(LHS, RHS);
	}

	switch (SetCCOpcode) {
	default: break;
	case ISD::SETOLT:
	case ISD::SETOLE:
	case ISD::SETUGT:
	case ISD::SETUGE:
	std::swap(LHS, RHS);
	break;
	}

	// On a floating point condition, the flags are set as follows:
	// ZF PF CF op
	// 0 \| 0 \| 0 \| X > Y
	// 0 \| 0 \| 1 \| X < Y
	// 1 \| 0 \| 0 \| X == Y
	// 1 \| 1 \| 1 \| unordered
	switch (SetCCOpcode) {
	default: llvm_unreachable("Condcode should be pre-legalized away");
	case ISD::SETUEQ:
	case ISD::SETEQ: return X86::COND_E;
	case ISD::SETOLT: // flipped
	case ISD::SETOGT:
	case ISD::SETGT: return X86::COND_A;
	case ISD::SETOLE: // flipped
	case ISD::SETOGE:
	case ISD::SETGE: return X86::COND_AE;
	case ISD::SETUGT: // flipped
	case ISD::SETULT:
	case ISD::SETLT: return X86::COND_B;
	case ISD::SETUGE: // flipped
	case ISD::SETULE:
	case ISD::SETLE: return X86::COND_BE;
	case ISD::SETONE:
	case ISD::SETNE: return X86::COND_NE;
	case ISD::SETUO: return X86::COND_P;
	case ISD::SETO: return X86::COND_NP;
	case ISD::SETOEQ:
	case ISD::SETUNE: return X86::COND_INVALID;
	}
	}

	/// Is there a floating point cmov for the specific X86 condition code?
	/// Current x86 isa includes the following FP cmov instructions:
	/// fcmovb, fcomvbe, fcomve, fcmovu, fcmovae, fcmova, fcmovne, fcmovnu.
	static bool hasFPCMov(unsigned X86CC) {
	switch (X86CC) {
	default:
	return false;
	case X86::COND_B:
	case X86::COND_BE:
	case X86::COND_E:
	case X86::COND_P:
	case X86::COND_A:
	case X86::COND_AE:
	case X86::COND_NE:
	case X86::COND_NP:
	return true;
	}
	}


	bool X86TargetLowering::getTgtMemIntrinsic(IntrinsicInfo &Info,
	const CallInst &I,
	unsigned Intrinsic) const {

	const IntrinsicData* IntrData = getIntrinsicWithChain(Intrinsic);
	if (!IntrData)
	return false;

	Info.opc = ISD::INTRINSIC_W_CHAIN;
	Info.readMem = false;
	Info.writeMem = false;
	Info.vol = false;
	Info.offset = 0;

	switch (IntrData->Type) {
	case EXPAND_FROM_MEM: {
	Info.ptrVal = I.getArgOperand(0);
	Info.memVT = MVT::getVT(I.getType());
	Info.align = 1;
	Info.readMem = true;
	break;
	}
	case COMPRESS_TO_MEM: {
	Info.ptrVal = I.getArgOperand(0);
	Info.memVT = MVT::getVT(I.getArgOperand(1)->getType());
	Info.align = 1;
	Info.writeMem = true;
	break;
	}
	case TRUNCATE_TO_MEM_VI8:
	case TRUNCATE_TO_MEM_VI16:
	case TRUNCATE_TO_MEM_VI32: {
	Info.ptrVal = I.getArgOperand(0);
	MVT VT = MVT::getVT(I.getArgOperand(1)->getType());
	MVT ScalarVT = MVT::INVALID_SIMPLE_VALUE_TYPE;
	if (IntrData->Type == TRUNCATE_TO_MEM_VI8)
	ScalarVT = MVT::i8;
	else if (IntrData->Type == TRUNCATE_TO_MEM_VI16)
	ScalarVT = MVT::i16;
	else if (IntrData->Type == TRUNCATE_TO_MEM_VI32)
	ScalarVT = MVT::i32;

	Info.memVT = MVT::getVectorVT(ScalarVT, VT.getVectorNumElements());
	Info.align = 1;
	Info.writeMem = true;
	break;
	}
	default:
	return false;
	}

	return true;
	}

	/// Returns true if the target can instruction select the
	/// specified FP immediate natively. If false, the legalizer will
	/// materialize the FP immediate as a load from a constant pool.
	bool X86TargetLowering::isFPImmLegal(const APFloat &Imm, EVT VT) const {
	for (unsigned i = 0, e = LegalFPImmediates.size(); i != e; ++i) {
	if (Imm.bitwiseIsEqual(LegalFPImmediates[i]))
	return true;
	}
	return false;
	}

	bool X86TargetLowering::shouldReduceLoadWidth(SDNode *Load,
	ISD::LoadExtType ExtTy,
	EVT NewVT) const {
	// "ELF Handling for Thread-Local Storage" specifies that R_X86_64_GOTTPOFF
	// relocation target a movq or addq instruction: don't let the load shrink.
	SDValue BasePtr = cast<LoadSDNode>(Load)->getBasePtr();
	if (BasePtr.getOpcode() == X86ISD::WrapperRIP)
	if (const auto *GA = dyn_cast<GlobalAddressSDNode>(BasePtr.getOperand(0)))
	return GA->getTargetFlags() != X86II::MO_GOTTPOFF;
	return true;
	}

	/// \brief Returns true if it is beneficial to convert a load of a constant
	/// to just the constant itself.
	bool X86TargetLowering::shouldConvertConstantLoadToIntImm(const APInt &Imm,
	Type *Ty) const {
	assert(Ty->isIntegerTy());

	unsigned BitSize = Ty->getPrimitiveSizeInBits();
	if (BitSize == 0 \|\| BitSize > 64)
	return false;
	return true;
	}

	bool X86TargetLowering::isExtractSubvectorCheap(EVT ResVT,
	unsigned Index) const {
	if (!isOperationLegalOrCustom(ISD::EXTRACT_SUBVECTOR, ResVT))
	return false;

	return (Index == 0 \|\| Index == ResVT.getVectorNumElements());
	}

	bool X86TargetLowering::isCheapToSpeculateCttz() const {
	// Speculate cttz only if we can directly use TZCNT.
	return Subtarget.hasBMI();
	}

	bool X86TargetLowering::isCheapToSpeculateCtlz() const {
	// Speculate ctlz only if we can directly use LZCNT.
	return Subtarget.hasLZCNT();
	}

	bool X86TargetLowering::isCtlzFast() const {
	return Subtarget.hasFastLZCNT();
	}

	bool X86TargetLowering::isMaskAndCmp0FoldingBeneficial(
	const Instruction &AndI) const {
	return true;
	}

	bool X86TargetLowering::hasAndNotCompare(SDValue Y) const {
	if (!Subtarget.hasBMI())
	return false;

	// There are only 32-bit and 64-bit forms for 'andn'.
	EVT VT = Y.getValueType();
	if (VT != MVT::i32 && VT != MVT::i64)
	return false;

	return true;
	}

	MVT X86TargetLowering::hasFastEqualityCompare(unsigned NumBits) const {
	MVT VT = MVT::getIntegerVT(NumBits);
	if (isTypeLegal(VT))
	return VT;

	// PMOVMSKB can handle this.
	if (NumBits == 128 && isTypeLegal(MVT::v16i8))
	return MVT::v16i8;

	// VPMOVMSKB can handle this.
	if (NumBits == 256 && isTypeLegal(MVT::v32i8))
	return MVT::v32i8;

	// TODO: Allow 64-bit type for 32-bit target.
	// TODO: 512-bit types should be allowed, but make sure that those
	// cases are handled in combineVectorSizedSetCCEquality().

	return MVT::INVALID_SIMPLE_VALUE_TYPE;
	}

	/// Val is the undef sentinel value or equal to the specified value.
	static bool isUndefOrEqual(int Val, int CmpVal) {
	return ((Val == SM_SentinelUndef) \|\| (Val == CmpVal));
	}

	/// Val is either the undef or zero sentinel value.
	static bool isUndefOrZero(int Val) {
	return ((Val == SM_SentinelUndef) \|\| (Val == SM_SentinelZero));
	}

	/// Return true if every element in Mask, beginning
	/// from position Pos and ending in Pos+Size is the undef sentinel value.
	static bool isUndefInRange(ArrayRef<int> Mask, unsigned Pos, unsigned Size) {
	for (unsigned i = Pos, e = Pos + Size; i != e; ++i)
	if (Mask[i] != SM_SentinelUndef)
	return false;
	return true;
	}

	/// Return true if Val is undef or if its value falls within the
	/// specified range (L, H].
	static bool isUndefOrInRange(int Val, int Low, int Hi) {
	return (Val == SM_SentinelUndef) \|\| (Val >= Low && Val < Hi);
	}

	/// Return true if every element in Mask is undef or if its value
	/// falls within the specified range (L, H].
	static bool isUndefOrInRange(ArrayRef<int> Mask,
	int Low, int Hi) {
	for (int M : Mask)
	if (!isUndefOrInRange(M, Low, Hi))
	return false;
	return true;
	}

	/// Return true if Val is undef, zero or if its value falls within the
	/// specified range (L, H].
	static bool isUndefOrZeroOrInRange(int Val, int Low, int Hi) {
	return isUndefOrZero(Val) \|\| (Val >= Low && Val < Hi);
	}

	/// Return true if every element in Mask is undef, zero or if its value
	/// falls within the specified range (L, H].
	static bool isUndefOrZeroOrInRange(ArrayRef<int> Mask, int Low, int Hi) {
	for (int M : Mask)
	if (!isUndefOrZeroOrInRange(M, Low, Hi))
	return false;
	return true;
	}

	/// Return true if every element in Mask, beginning
	/// from position Pos and ending in Pos+Size, falls within the specified
	/// sequential range (Low, Low+Size]. or is undef.
	static bool isSequentialOrUndefInRange(ArrayRef<int> Mask,
	unsigned Pos, unsigned Size, int Low) {
	for (unsigned i = Pos, e = Pos+Size; i != e; ++i, ++Low)
	if (!isUndefOrEqual(Mask[i], Low))
	return false;
	return true;
	}

	/// Return true if every element in Mask, beginning
	/// from position Pos and ending in Pos+Size, falls within the specified
	/// sequential range (Low, Low+Size], or is undef or is zero.
	static bool isSequentialOrUndefOrZeroInRange(ArrayRef<int> Mask, unsigned Pos,
	unsigned Size, int Low) {
	for (unsigned i = Pos, e = Pos + Size; i != e; ++i, ++Low)
	if (!isUndefOrZero(Mask[i]) && Mask[i] != Low)
	return false;
	return true;
	}

	/// Return true if every element in Mask, beginning
	/// from position Pos and ending in Pos+Size is undef or is zero.
	static bool isUndefOrZeroInRange(ArrayRef<int> Mask, unsigned Pos,
	unsigned Size) {
	for (unsigned i = Pos, e = Pos + Size; i != e; ++i)
	if (!isUndefOrZero(Mask[i]))
	return false;
	return true;
	}

	/// \brief Helper function to test whether a shuffle mask could be
	/// simplified by widening the elements being shuffled.
	///
	/// Appends the mask for wider elements in WidenedMask if valid. Otherwise
	/// leaves it in an unspecified state.
	///
	/// NOTE: This must handle normal vector shuffle masks and target vector
	/// shuffle masks. The latter have the special property of a '-2' representing
	/// a zero-ed lane of a vector.
	static bool canWidenShuffleElements(ArrayRef<int> Mask,
	SmallVectorImpl<int> &WidenedMask) {
	WidenedMask.assign(Mask.size() / 2, 0);
	for (int i = 0, Size = Mask.size(); i < Size; i += 2) {
	int M0 = Mask[i];
	int M1 = Mask[i + 1];

	// If both elements are undef, its trivial.
	if (M0 == SM_SentinelUndef && M1 == SM_SentinelUndef) {
	WidenedMask[i / 2] = SM_SentinelUndef;
	continue;
	}

	// Check for an undef mask and a mask value properly aligned to fit with
	// a pair of values. If we find such a case, use the non-undef mask's value.
	if (M0 == SM_SentinelUndef && M1 >= 0 && (M1 % 2) == 1) {
	WidenedMask[i / 2] = M1 / 2;
	continue;
	}
	if (M1 == SM_SentinelUndef && M0 >= 0 && (M0 % 2) == 0) {
	WidenedMask[i / 2] = M0 / 2;
	continue;
	}

	// When zeroing, we need to spread the zeroing across both lanes to widen.
	if (M0 == SM_SentinelZero \|\| M1 == SM_SentinelZero) {
	if ((M0 == SM_SentinelZero \|\| M0 == SM_SentinelUndef) &&
	(M1 == SM_SentinelZero \|\| M1 == SM_SentinelUndef)) {
	WidenedMask[i / 2] = SM_SentinelZero;
	continue;
	}
	return false;
	}

	// Finally check if the two mask values are adjacent and aligned with
	// a pair.
	if (M0 != SM_SentinelUndef && (M0 % 2) == 0 && (M0 + 1) == M1) {
	WidenedMask[i / 2] = M0 / 2;
	continue;
	}

	// Otherwise we can't safely widen the elements used in this shuffle.
	return false;
	}
	assert(WidenedMask.size() == Mask.size() / 2 &&
	"Incorrect size of mask after widening the elements!");

	return true;
	}

	/// Helper function to scale a shuffle or target shuffle mask, replacing each
	/// mask index with the scaled sequential indices for an equivalent narrowed
	/// mask. This is the reverse process to canWidenShuffleElements, but can always
	/// succeed.
	static void scaleShuffleMask(int Scale, ArrayRef<int> Mask,
	SmallVectorImpl<int> &ScaledMask) {
	assert(0 < Scale && "Unexpected scaling factor");
	int NumElts = Mask.size();
	ScaledMask.assign(static_cast<size_t>(NumElts * Scale), -1);

	for (int i = 0; i != NumElts; ++i) {
	int M = Mask[i];

	// Repeat sentinel values in every mask element.
	if (M < 0) {
	for (int s = 0; s != Scale; ++s)
	ScaledMask[(Scale * i) + s] = M;
	continue;
	}

	// Scale mask element and increment across each mask element.
	for (int s = 0; s != Scale; ++s)
	ScaledMask[(Scale * i) + s] = (Scale * M) + s;
	}
	}

	/// Return true if the specified EXTRACT_SUBVECTOR operand specifies a vector
	/// extract that is suitable for instruction that extract 128 or 256 bit vectors
	static bool isVEXTRACTIndex(SDNode *N, unsigned vecWidth) {
	assert((vecWidth == 128 \|\| vecWidth == 256) && "Unexpected vector width");
	if (!isa<ConstantSDNode>(N->getOperand(1).getNode()))
	return false;

	// The index should be aligned on a vecWidth-bit boundary.
	uint64_t Index = N->getConstantOperandVal(1);
	MVT VT = N->getSimpleValueType(0);
	unsigned ElSize = VT.getScalarSizeInBits();
	return (Index * ElSize) % vecWidth == 0;
	}

	/// Return true if the specified INSERT_SUBVECTOR
	/// operand specifies a subvector insert that is suitable for input to
	/// insertion of 128 or 256-bit subvectors
	static bool isVINSERTIndex(SDNode *N, unsigned vecWidth) {
	assert((vecWidth == 128 \|\| vecWidth == 256) && "Unexpected vector width");
	if (!isa<ConstantSDNode>(N->getOperand(2).getNode()))
	return false;

	// The index should be aligned on a vecWidth-bit boundary.
	uint64_t Index = N->getConstantOperandVal(2);
	MVT VT = N->getSimpleValueType(0);
	unsigned ElSize = VT.getScalarSizeInBits();
	return (Index * ElSize) % vecWidth == 0;
	}

	bool X86::isVINSERT128Index(SDNode *N) {
	return isVINSERTIndex(N, 128);
	}

	bool X86::isVINSERT256Index(SDNode *N) {
	return isVINSERTIndex(N, 256);
	}

	bool X86::isVEXTRACT128Index(SDNode *N) {
	return isVEXTRACTIndex(N, 128);
	}

	bool X86::isVEXTRACT256Index(SDNode *N) {
	return isVEXTRACTIndex(N, 256);
	}

	static unsigned getExtractVEXTRACTImmediate(SDNode *N, unsigned vecWidth) {
	assert((vecWidth == 128 \|\| vecWidth == 256) && "Unsupported vector width");
	assert(isa<ConstantSDNode>(N->getOperand(1).getNode()) &&
	"Illegal extract subvector for VEXTRACT");

	uint64_t Index = N->getConstantOperandVal(1);
	MVT VecVT = N->getOperand(0).getSimpleValueType();
	unsigned NumElemsPerChunk = vecWidth / VecVT.getScalarSizeInBits();
	return Index / NumElemsPerChunk;
	}

	static unsigned getInsertVINSERTImmediate(SDNode *N, unsigned vecWidth) {
	assert((vecWidth == 128 \|\| vecWidth == 256) && "Unsupported vector width");
	assert(isa<ConstantSDNode>(N->getOperand(2).getNode()) &&
	"Illegal insert subvector for VINSERT");

	uint64_t Index = N->getConstantOperandVal(2);
	MVT VecVT = N->getSimpleValueType(0);
	unsigned NumElemsPerChunk = vecWidth / VecVT.getScalarSizeInBits();
	return Index / NumElemsPerChunk;
	}

	/// Return the appropriate immediate to extract the specified
	/// EXTRACT_SUBVECTOR index with VEXTRACTF128 and VINSERTI128 instructions.
	unsigned X86::getExtractVEXTRACT128Immediate(SDNode *N) {
	return getExtractVEXTRACTImmediate(N, 128);
	}

	/// Return the appropriate immediate to extract the specified
	/// EXTRACT_SUBVECTOR index with VEXTRACTF64x4 and VINSERTI64x4 instructions.
	unsigned X86::getExtractVEXTRACT256Immediate(SDNode *N) {
	return getExtractVEXTRACTImmediate(N, 256);
	}

	/// Return the appropriate immediate to insert at the specified
	/// INSERT_SUBVECTOR index with VINSERTF128 and VINSERTI128 instructions.
	unsigned X86::getInsertVINSERT128Immediate(SDNode *N) {
	return getInsertVINSERTImmediate(N, 128);
	}

	/// Return the appropriate immediate to insert at the specified
	/// INSERT_SUBVECTOR index with VINSERTF46x4 and VINSERTI64x4 instructions.
	unsigned X86::getInsertVINSERT256Immediate(SDNode *N) {
	return getInsertVINSERTImmediate(N, 256);
	}

	/// Returns true if Elt is a constant zero or a floating point constant +0.0.
	bool X86::isZeroNode(SDValue Elt) {
	return isNullConstant(Elt) \|\| isNullFPConstant(Elt);
	}

	// Build a vector of constants.
	// Use an UNDEF node if MaskElt == -1.
	// Split 64-bit constants in the 32-bit mode.
	static SDValue getConstVector(ArrayRef<int> Values, MVT VT, SelectionDAG &DAG,
	const SDLoc &dl, bool IsMask = false) {

	SmallVector<SDValue, 32> Ops;
	bool Split = false;

	MVT ConstVecVT = VT;
	unsigned NumElts = VT.getVectorNumElements();
	bool In64BitMode = DAG.getTargetLoweringInfo().isTypeLegal(MVT::i64);
	if (!In64BitMode && VT.getVectorElementType() == MVT::i64) {
	ConstVecVT = MVT::getVectorVT(MVT::i32, NumElts * 2);
	Split = true;
	}

	MVT EltVT = ConstVecVT.getVectorElementType();
	for (unsigned i = 0; i < NumElts; ++i) {
	bool IsUndef = Values[i] < 0 && IsMask;
	SDValue OpNode = IsUndef ? DAG.getUNDEF(EltVT) :
	DAG.getConstant(Values[i], dl, EltVT);
	Ops.push_back(OpNode);
	if (Split)
	Ops.push_back(IsUndef ? DAG.getUNDEF(EltVT) :
	DAG.getConstant(0, dl, EltVT));
	}
	SDValue ConstsNode = DAG.getBuildVector(ConstVecVT, dl, Ops);
	if (Split)
	ConstsNode = DAG.getBitcast(VT, ConstsNode);
	return ConstsNode;
	}

	static SDValue getConstVector(ArrayRef<APInt> Bits, APInt &Undefs,
	MVT VT, SelectionDAG &DAG, const SDLoc &dl) {
	assert(Bits.size() == Undefs.getBitWidth() &&
	"Unequal constant and undef arrays");
	SmallVector<SDValue, 32> Ops;
	bool Split = false;

	MVT ConstVecVT = VT;
	unsigned NumElts = VT.getVectorNumElements();
	bool In64BitMode = DAG.getTargetLoweringInfo().isTypeLegal(MVT::i64);
	if (!In64BitMode && VT.getVectorElementType() == MVT::i64) {
	ConstVecVT = MVT::getVectorVT(MVT::i32, NumElts * 2);
	Split = true;
	}

	MVT EltVT = ConstVecVT.getVectorElementType();
	for (unsigned i = 0, e = Bits.size(); i != e; ++i) {
	if (Undefs[i]) {
	Ops.append(Split ? 2 : 1, DAG.getUNDEF(EltVT));
	continue;
	}
	const APInt &V = Bits[i];
	assert(V.getBitWidth() == VT.getScalarSizeInBits() && "Unexpected sizes");
	if (Split) {
	Ops.push_back(DAG.getConstant(V.trunc(32), dl, EltVT));
	Ops.push_back(DAG.getConstant(V.lshr(32).trunc(32), dl, EltVT));
	} else if (EltVT == MVT::f32) {
	APFloat FV(APFloat::IEEEsingle(), V);
	Ops.push_back(DAG.getConstantFP(FV, dl, EltVT));
	} else if (EltVT == MVT::f64) {
	APFloat FV(APFloat::IEEEdouble(), V);
	Ops.push_back(DAG.getConstantFP(FV, dl, EltVT));
	} else {
	Ops.push_back(DAG.getConstant(V, dl, EltVT));
	}
	}

	SDValue ConstsNode = DAG.getBuildVector(ConstVecVT, dl, Ops);
	return DAG.getBitcast(VT, ConstsNode);
	}

	/// Returns a vector of specified type with all zero elements.
	static SDValue getZeroVector(MVT VT, const X86Subtarget &Subtarget,
	SelectionDAG &DAG, const SDLoc &dl) {
	assert((VT.is128BitVector() \|\| VT.is256BitVector() \|\| VT.is512BitVector() \|\|
	VT.getVectorElementType() == MVT::i1) &&
	"Unexpected vector type");

	// Try to build SSE/AVX zero vectors as <N x i32> bitcasted to their dest
	// type. This ensures they get CSE'd. But if the integer type is not
	// available, use a floating-point +0.0 instead.
	SDValue Vec;
	if (!Subtarget.hasSSE2() && VT.is128BitVector()) {
	Vec = DAG.getConstantFP(+0.0, dl, MVT::v4f32);
	} else if (VT.getVectorElementType() == MVT::i1) {
	assert((Subtarget.hasBWI() \|\| VT.getVectorNumElements() <= 16) &&
	"Unexpected vector type");
	assert((Subtarget.hasVLX() \|\| VT.getVectorNumElements() >= 8) &&
	"Unexpected vector type");
	Vec = DAG.getConstant(0, dl, VT);
	} else {
	unsigned Num32BitElts = VT.getSizeInBits() / 32;
	Vec = DAG.getConstant(0, dl, MVT::getVectorVT(MVT::i32, Num32BitElts));
	}
	return DAG.getBitcast(VT, Vec);
	}

	static SDValue extractSubVector(SDValue Vec, unsigned IdxVal, SelectionDAG &DAG,
	const SDLoc &dl, unsigned vectorWidth) {
	EVT VT = Vec.getValueType();
	EVT ElVT = VT.getVectorElementType();
	unsigned Factor = VT.getSizeInBits()/vectorWidth;
	EVT ResultVT = EVT::getVectorVT(*DAG.getContext(), ElVT,
	VT.getVectorNumElements()/Factor);

	// Extract the relevant vectorWidth bits. Generate an EXTRACT_SUBVECTOR
	unsigned ElemsPerChunk = vectorWidth / ElVT.getSizeInBits();
	assert(isPowerOf2_32(ElemsPerChunk) && "Elements per chunk not power of 2");

	// This is the index of the first element of the vectorWidth-bit chunk
	// we want. Since ElemsPerChunk is a power of 2 just need to clear bits.
	IdxVal &= ~(ElemsPerChunk - 1);

	// If the input is a buildvector just emit a smaller one.
	if (Vec.getOpcode() == ISD::BUILD_VECTOR)
	return DAG.getBuildVector(
	ResultVT, dl, makeArrayRef(Vec->op_begin() + IdxVal, ElemsPerChunk));

	SDValue VecIdx = DAG.getIntPtrConstant(IdxVal, dl);
	return DAG.getNode(ISD::EXTRACT_SUBVECTOR, dl, ResultVT, Vec, VecIdx);
	}

	/// Generate a DAG to grab 128-bits from a vector > 128 bits. This
	/// sets things up to match to an AVX VEXTRACTF128 / VEXTRACTI128
	/// or AVX-512 VEXTRACTF32x4 / VEXTRACTI32x4
	/// instructions or a simple subregister reference. Idx is an index in the
	/// 128 bits we want. It need not be aligned to a 128-bit boundary. That makes
	/// lowering EXTRACT_VECTOR_ELT operations easier.
	static SDValue extract128BitVector(SDValue Vec, unsigned IdxVal,
	SelectionDAG &DAG, const SDLoc &dl) {
	assert((Vec.getValueType().is256BitVector() \|\|
	Vec.getValueType().is512BitVector()) && "Unexpected vector size!");
	return extractSubVector(Vec, IdxVal, DAG, dl, 128);
	}

	/// Generate a DAG to grab 256-bits from a 512-bit vector.
	static SDValue extract256BitVector(SDValue Vec, unsigned IdxVal,
	SelectionDAG &DAG, const SDLoc &dl) {
	assert(Vec.getValueType().is512BitVector() && "Unexpected vector size!");
	return extractSubVector(Vec, IdxVal, DAG, dl, 256);
	}

	static SDValue insertSubVector(SDValue Result, SDValue Vec, unsigned IdxVal,
	SelectionDAG &DAG, const SDLoc &dl,
	unsigned vectorWidth) {
	assert((vectorWidth == 128 \|\| vectorWidth == 256) &&
	"Unsupported vector width");
	// Inserting UNDEF is Result
	if (Vec.isUndef())
	return Result;
	EVT VT = Vec.getValueType();
	EVT ElVT = VT.getVectorElementType();
	EVT ResultVT = Result.getValueType();

	// Insert the relevant vectorWidth bits.
	unsigned ElemsPerChunk = vectorWidth/ElVT.getSizeInBits();
	assert(isPowerOf2_32(ElemsPerChunk) && "Elements per chunk not power of 2");

	// This is the index of the first element of the vectorWidth-bit chunk
	// we want. Since ElemsPerChunk is a power of 2 just need to clear bits.
	IdxVal &= ~(ElemsPerChunk - 1);

	SDValue VecIdx = DAG.getIntPtrConstant(IdxVal, dl);
	return DAG.getNode(ISD::INSERT_SUBVECTOR, dl, ResultVT, Result, Vec, VecIdx);
	}

	/// Generate a DAG to put 128-bits into a vector > 128 bits. This
	/// sets things up to match to an AVX VINSERTF128/VINSERTI128 or
	/// AVX-512 VINSERTF32x4/VINSERTI32x4 instructions or a
	/// simple superregister reference. Idx is an index in the 128 bits
	/// we want. It need not be aligned to a 128-bit boundary. That makes
	/// lowering INSERT_VECTOR_ELT operations easier.
	static SDValue insert128BitVector(SDValue Result, SDValue Vec, unsigned IdxVal,
	SelectionDAG &DAG, const SDLoc &dl) {
	assert(Vec.getValueType().is128BitVector() && "Unexpected vector size!");
	return insertSubVector(Result, Vec, IdxVal, DAG, dl, 128);
	}

	static SDValue insert256BitVector(SDValue Result, SDValue Vec, unsigned IdxVal,
	SelectionDAG &DAG, const SDLoc &dl) {
	assert(Vec.getValueType().is256BitVector() && "Unexpected vector size!");
	return insertSubVector(Result, Vec, IdxVal, DAG, dl, 256);
	}

	// Return true if the instruction zeroes the unused upper part of the
	// destination and accepts mask.
	static bool isMaskedZeroUpperBitsvXi1(unsigned int Opcode) {
	switch (Opcode) {
	default:
	return false;
	case X86ISD::PCMPEQM:
	case X86ISD::PCMPGTM:
	case X86ISD::CMPM:
	case X86ISD::CMPMU:
	return true;
	}
	}

	/// Insert i1-subvector to i1-vector.
	static SDValue insert1BitVector(SDValue Op, SelectionDAG &DAG,
	const X86Subtarget &Subtarget) {

	SDLoc dl(Op);
	SDValue Vec = Op.getOperand(0);
	SDValue SubVec = Op.getOperand(1);
	SDValue Idx = Op.getOperand(2);

	if (!isa<ConstantSDNode>(Idx))
	return SDValue();

	unsigned IdxVal = cast<ConstantSDNode>(Idx)->getZExtValue();
	if (IdxVal == 0 && Vec.isUndef()) // the operation is legal
	return Op;

	MVT OpVT = Op.getSimpleValueType();
	MVT SubVecVT = SubVec.getSimpleValueType();
	unsigned NumElems = OpVT.getVectorNumElements();
	unsigned SubVecNumElems = SubVecVT.getVectorNumElements();

	assert(IdxVal + SubVecNumElems <= NumElems &&
	IdxVal % SubVecVT.getSizeInBits() == 0 &&
	"Unexpected index value in INSERT_SUBVECTOR");

	// There are 3 possible cases:
	// 1. Subvector should be inserted in the lower part (IdxVal == 0)
	// 2. Subvector should be inserted in the upper part
	// (IdxVal + SubVecNumElems == NumElems)
	// 3. Subvector should be inserted in the middle (for example v2i1
	// to v16i1, index 2)

	// If this node widens - by concatenating zeroes - the type of the result
	// of a node with instruction that zeroes all upper (irrelevant) bits of the
	// output register, mark this node as legal to enable replacing them with
	// the v8i1 version of the previous instruction during instruction selection.
	// For example, VPCMPEQDZ128rr instruction stores its v4i1 result in a k-reg,
	// while zeroing all the upper remaining 60 bits of the register. if the
	// result of such instruction is inserted into an allZeroVector, then we can
	// safely remove insert_vector (in instruction selection) as the cmp instr
	// already zeroed the rest of the register.
	if (ISD::isBuildVectorAllZeros(Vec.getNode()) && IdxVal == 0 &&
	(isMaskedZeroUpperBitsvXi1(SubVec.getOpcode()) \|\|
	(SubVec.getOpcode() == ISD::AND &&
	(isMaskedZeroUpperBitsvXi1(SubVec.getOperand(0).getOpcode()) \|\|
	isMaskedZeroUpperBitsvXi1(SubVec.getOperand(1).getOpcode())))))
	return Op;

	// extend to natively supported kshift
	MVT MinVT = Subtarget.hasDQI() ? MVT::v8i1 : MVT::v16i1;
	MVT WideOpVT = OpVT;
	if (OpVT.getSizeInBits() < MinVT.getStoreSizeInBits())
	WideOpVT = MinVT;

	SDValue ZeroIdx = DAG.getIntPtrConstant(0, dl);
	SDValue Undef = DAG.getUNDEF(WideOpVT);
	SDValue WideSubVec = DAG.getNode(ISD::INSERT_SUBVECTOR, dl, WideOpVT,
	Undef, SubVec, ZeroIdx);

	// Extract sub-vector if require.
	auto ExtractSubVec = [&](SDValue V) {
	return (WideOpVT == OpVT) ? V : DAG.getNode(ISD::EXTRACT_SUBVECTOR, dl,
	OpVT, V, ZeroIdx);
	};

	if (Vec.isUndef()) {
	if (IdxVal != 0) {
	SDValue ShiftBits = DAG.getConstant(IdxVal, dl, MVT::i8);
	WideSubVec = DAG.getNode(X86ISD::KSHIFTL, dl, WideOpVT, WideSubVec,
	ShiftBits);
	}
	return ExtractSubVec(WideSubVec);
	}

	if (ISD::isBuildVectorAllZeros(Vec.getNode())) {
	NumElems = WideOpVT.getVectorNumElements();
	unsigned ShiftLeft = NumElems - SubVecNumElems;
	unsigned ShiftRight = NumElems - SubVecNumElems - IdxVal;
	Vec = DAG.getNode(X86ISD::KSHIFTL, dl, WideOpVT, WideSubVec,
	DAG.getConstant(ShiftLeft, dl, MVT::i8));
	Vec = ShiftRight ? DAG.getNode(X86ISD::KSHIFTR, dl, WideOpVT, Vec,
	DAG.getConstant(ShiftRight, dl, MVT::i8)) : Vec;
	return ExtractSubVec(Vec);
	}

	if (IdxVal == 0) {
	// Zero lower bits of the Vec
	SDValue ShiftBits = DAG.getConstant(SubVecNumElems, dl, MVT::i8);
	Vec = DAG.getNode(ISD::INSERT_SUBVECTOR, dl, WideOpVT, Undef, Vec, ZeroIdx);
	Vec = DAG.getNode(X86ISD::KSHIFTR, dl, WideOpVT, Vec, ShiftBits);
	Vec = DAG.getNode(X86ISD::KSHIFTL, dl, WideOpVT, Vec, ShiftBits);
	// Merge them together, SubVec should be zero extended.
	WideSubVec = DAG.getNode(ISD::INSERT_SUBVECTOR, dl, WideOpVT,
	getZeroVector(WideOpVT, Subtarget, DAG, dl),
	SubVec, ZeroIdx);
	Vec = DAG.getNode(ISD::OR, dl, WideOpVT, Vec, WideSubVec);
	return ExtractSubVec(Vec);
	}

	// Simple case when we put subvector in the upper part
	if (IdxVal + SubVecNumElems == NumElems) {
	// Zero upper bits of the Vec
	WideSubVec = DAG.getNode(X86ISD::KSHIFTL, dl, WideOpVT, WideSubVec,
	DAG.getConstant(IdxVal, dl, MVT::i8));
	SDValue ShiftBits = DAG.getConstant(SubVecNumElems, dl, MVT::i8);
	Vec = DAG.getNode(ISD::INSERT_SUBVECTOR, dl, WideOpVT, Undef, Vec, ZeroIdx);
	Vec = DAG.getNode(X86ISD::KSHIFTL, dl, WideOpVT, Vec, ShiftBits);
	Vec = DAG.getNode(X86ISD::KSHIFTR, dl, WideOpVT, Vec, ShiftBits);
	Vec = DAG.getNode(ISD::OR, dl, WideOpVT, Vec, WideSubVec);
	return ExtractSubVec(Vec);
	}
	// Subvector should be inserted in the middle - use shuffle
	WideSubVec = DAG.getNode(ISD::INSERT_SUBVECTOR, dl, OpVT, Undef,
	SubVec, ZeroIdx);
	SmallVector<int, 64> Mask;
	for (unsigned i = 0; i < NumElems; ++i)
	Mask.push_back(i >= IdxVal && i < IdxVal + SubVecNumElems ?
	i : i + NumElems);
	return DAG.getVectorShuffle(OpVT, dl, WideSubVec, Vec, Mask);
	}

	/// Concat two 128-bit vectors into a 256 bit vector using VINSERTF128
	/// instructions. This is used because creating CONCAT_VECTOR nodes of
	/// BUILD_VECTORS returns a larger BUILD_VECTOR while we're trying to lower
	/// large BUILD_VECTORS.
	static SDValue concat128BitVectors(SDValue V1, SDValue V2, EVT VT,
	unsigned NumElems, SelectionDAG &DAG,
	const SDLoc &dl) {
	SDValue V = insert128BitVector(DAG.getUNDEF(VT), V1, 0, DAG, dl);
	return insert128BitVector(V, V2, NumElems / 2, DAG, dl);
	}

	static SDValue concat256BitVectors(SDValue V1, SDValue V2, EVT VT,
	unsigned NumElems, SelectionDAG &DAG,
	const SDLoc &dl) {
	SDValue V = insert256BitVector(DAG.getUNDEF(VT), V1, 0, DAG, dl);
	return insert256BitVector(V, V2, NumElems / 2, DAG, dl);
	}

	/// Returns a vector of specified type with all bits set.
	/// Always build ones vectors as <4 x i32>, <8 x i32> or <16 x i32>.
	/// Then bitcast to their original type, ensuring they get CSE'd.
	static SDValue getOnesVector(EVT VT, SelectionDAG &DAG, const SDLoc &dl) {
	assert((VT.is128BitVector() \|\| VT.is256BitVector() \|\| VT.is512BitVector()) &&
	"Expected a 128/256/512-bit vector type");

	APInt Ones = APInt::getAllOnesValue(32);
	unsigned NumElts = VT.getSizeInBits() / 32;
	SDValue Vec = DAG.getConstant(Ones, dl, MVT::getVectorVT(MVT::i32, NumElts));
	return DAG.getBitcast(VT, Vec);
	}

	static SDValue getExtendInVec(unsigned Opc, const SDLoc &DL, EVT VT, SDValue In,
	SelectionDAG &DAG) {
	EVT InVT = In.getValueType();
	assert((X86ISD::VSEXT == Opc \|\| X86ISD::VZEXT == Opc) && "Unexpected opcode");

	if (VT.is128BitVector() && InVT.is128BitVector())
	return X86ISD::VSEXT == Opc ? DAG.getSignExtendVectorInReg(In, DL, VT)
	: DAG.getZeroExtendVectorInReg(In, DL, VT);

	// For 256-bit vectors, we only need the lower (128-bit) input half.
	// For 512-bit vectors, we only need the lower input half or quarter.
	if (VT.getSizeInBits() > 128 && InVT.getSizeInBits() > 128) {
	int Scale = VT.getScalarSizeInBits() / InVT.getScalarSizeInBits();
	In = extractSubVector(In, 0, DAG, DL,
	std::max(128, (int)VT.getSizeInBits() / Scale));
	}

	return DAG.getNode(Opc, DL, VT, In);
	}

	/// Generate unpacklo/unpackhi shuffle mask.
	static void createUnpackShuffleMask(MVT VT, SmallVectorImpl<int> &Mask, bool Lo,
	bool Unary) {
	assert(Mask.empty() && "Expected an empty shuffle mask vector");
	int NumElts = VT.getVectorNumElements();
	int NumEltsInLane = 128 / VT.getScalarSizeInBits();

	for (int i = 0; i < NumElts; ++i) {
	unsigned LaneStart = (i / NumEltsInLane) * NumEltsInLane;
	int Pos = (i % NumEltsInLane) / 2 + LaneStart;
	Pos += (Unary ? 0 : NumElts * (i % 2));
	Pos += (Lo ? 0 : NumEltsInLane / 2);
	Mask.push_back(Pos);
	}
	}

	/// Returns a vector_shuffle node for an unpackl operation.
	static SDValue getUnpackl(SelectionDAG &DAG, const SDLoc &dl, MVT VT,
	SDValue V1, SDValue V2) {
	SmallVector<int, 8> Mask;
	createUnpackShuffleMask(VT, Mask, /* Lo = / true, / Unary = */ false);
	return DAG.getVectorShuffle(VT, dl, V1, V2, Mask);
	}

	/// Returns a vector_shuffle node for an unpackh operation.
	static SDValue getUnpackh(SelectionDAG &DAG, const SDLoc &dl, MVT VT,
	SDValue V1, SDValue V2) {
	SmallVector<int, 8> Mask;
	createUnpackShuffleMask(VT, Mask, /* Lo = / false, / Unary = */ false);
	return DAG.getVectorShuffle(VT, dl, V1, V2, Mask);
	}

	/// Return a vector_shuffle of the specified vector of zero or undef vector.
	/// This produces a shuffle where the low element of V2 is swizzled into the
	/// zero/undef vector, landing at element Idx.
	/// This produces a shuffle mask like 4,1,2,3 (idx=0) or 0,1,2,4 (idx=3).
	static SDValue getShuffleVectorZeroOrUndef(SDValue V2, int Idx,
	bool IsZero,
	const X86Subtarget &Subtarget,
	SelectionDAG &DAG) {
	MVT VT = V2.getSimpleValueType();
	SDValue V1 = IsZero
	? getZeroVector(VT, Subtarget, DAG, SDLoc(V2)) : DAG.getUNDEF(VT);
	int NumElems = VT.getVectorNumElements();
	SmallVector<int, 16> MaskVec(NumElems);
	for (int i = 0; i != NumElems; ++i)
	// If this is the insertion idx, put the low elt of V2 here.
	MaskVec[i] = (i == Idx) ? NumElems : i;
	return DAG.getVectorShuffle(VT, SDLoc(V2), V1, V2, MaskVec);
	}

	static SDValue peekThroughBitcasts(SDValue V) {
	while (V.getNode() && V.getOpcode() == ISD::BITCAST)
	V = V.getOperand(0);
	return V;
	}

	static SDValue peekThroughOneUseBitcasts(SDValue V) {
	while (V.getNode() && V.getOpcode() == ISD::BITCAST &&
	V.getOperand(0).hasOneUse())
	V = V.getOperand(0);
	return V;
	}

	static const Constant *getTargetConstantFromNode(SDValue Op) {
	Op = peekThroughBitcasts(Op);

	auto *Load = dyn_cast<LoadSDNode>(Op);
	if (!Load)
	return nullptr;

	SDValue Ptr = Load->getBasePtr();
	if (Ptr->getOpcode() == X86ISD::Wrapper \|\|
	Ptr->getOpcode() == X86ISD::WrapperRIP)
	Ptr = Ptr->getOperand(0);

	auto *CNode = dyn_cast<ConstantPoolSDNode>(Ptr);
	if (!CNode \|\| CNode->isMachineConstantPoolEntry())
	return nullptr;

	return dyn_cast<Constant>(CNode->getConstVal());
	}

	// Extract raw constant bits from constant pools.
	static bool getTargetConstantBitsFromNode(SDValue Op, unsigned EltSizeInBits,
	APInt &UndefElts,
	SmallVectorImpl<APInt> &EltBits,
	bool AllowWholeUndefs = true,
	bool AllowPartialUndefs = true) {
	assert(EltBits.empty() && "Expected an empty EltBits vector");

	Op = peekThroughBitcasts(Op);

	EVT VT = Op.getValueType();
	unsigned SizeInBits = VT.getSizeInBits();
	assert((SizeInBits % EltSizeInBits) == 0 && "Can't split constant!");
	unsigned NumElts = SizeInBits / EltSizeInBits;

	// Bitcast a source array of element bits to the target size.
	auto CastBitData = [&](APInt &UndefSrcElts, ArrayRef<APInt> SrcEltBits) {
	unsigned NumSrcElts = UndefSrcElts.getBitWidth();
	unsigned SrcEltSizeInBits = SrcEltBits[0].getBitWidth();
	assert((NumSrcElts * SrcEltSizeInBits) == SizeInBits &&
	"Constant bit sizes don't match");

	// Don't split if we don't allow undef bits.
	bool AllowUndefs = AllowWholeUndefs \|\| AllowPartialUndefs;
	if (UndefSrcElts.getBoolValue() && !AllowUndefs)
	return false;

	// If we're already the right size, don't bother bitcasting.
	if (NumSrcElts == NumElts) {
	UndefElts = UndefSrcElts;
	EltBits.assign(SrcEltBits.begin(), SrcEltBits.end());
	return true;
	}

	// Extract all the undef/constant element data and pack into single bitsets.
	APInt UndefBits(SizeInBits, 0);
	APInt MaskBits(SizeInBits, 0);

	for (unsigned i = 0; i != NumSrcElts; ++i) {
	unsigned BitOffset = i * SrcEltSizeInBits;
	if (UndefSrcElts[i])
	UndefBits.setBits(BitOffset, BitOffset + SrcEltSizeInBits);
	MaskBits.insertBits(SrcEltBits[i], BitOffset);
	}

	// Split the undef/constant single bitset data into the target elements.
	UndefElts = APInt(NumElts, 0);
	EltBits.resize(NumElts, APInt(EltSizeInBits, 0));

	for (unsigned i = 0; i != NumElts; ++i) {
	unsigned BitOffset = i * EltSizeInBits;
	APInt UndefEltBits = UndefBits.extractBits(EltSizeInBits, BitOffset);

	// Only treat an element as UNDEF if all bits are UNDEF.
	if (UndefEltBits.isAllOnesValue()) {
	if (!AllowWholeUndefs)
	return false;
	UndefElts.setBit(i);
	continue;
	}

	// If only some bits are UNDEF then treat them as zero (or bail if not
	// supported).
	if (UndefEltBits.getBoolValue() && !AllowPartialUndefs)
	return false;

	APInt Bits = MaskBits.extractBits(EltSizeInBits, BitOffset);
	EltBits[i] = Bits.getZExtValue();
	}
	return true;
	};

	// Collect constant bits and insert into mask/undef bit masks.
	auto CollectConstantBits = [](const Constant *Cst, APInt &Mask, APInt &Undefs,
	unsigned UndefBitIndex) {
	if (!Cst)
	return false;
	if (isa<UndefValue>(Cst)) {
	Undefs.setBit(UndefBitIndex);
	return true;
	}
	if (auto *CInt = dyn_cast<ConstantInt>(Cst)) {
	Mask = CInt->getValue();
	return true;
	}
	if (auto *CFP = dyn_cast<ConstantFP>(Cst)) {
	Mask = CFP->getValueAPF().bitcastToAPInt();
	return true;
	}
	return false;
	};

	// Extract constant bits from build vector.
	if (ISD::isBuildVectorOfConstantSDNodes(Op.getNode())) {
	unsigned SrcEltSizeInBits = VT.getScalarSizeInBits();
	unsigned NumSrcElts = SizeInBits / SrcEltSizeInBits;

	APInt UndefSrcElts(NumSrcElts, 0);
	SmallVector<APInt, 64> SrcEltBits(NumSrcElts, APInt(SrcEltSizeInBits, 0));
	for (unsigned i = 0, e = Op.getNumOperands(); i != e; ++i) {
	const SDValue &Src = Op.getOperand(i);
	if (Src.isUndef()) {
	UndefSrcElts.setBit(i);
	continue;
	}
	auto *Cst = cast<ConstantSDNode>(Src);
	SrcEltBits[i] = Cst->getAPIntValue().zextOrTrunc(SrcEltSizeInBits);
	}
	return CastBitData(UndefSrcElts, SrcEltBits);
	}

	// Extract constant bits from constant pool vector.
	if (auto *Cst = getTargetConstantFromNode(Op)) {
	Type *CstTy = Cst->getType();
	if (!CstTy->isVectorTy() \|\| (SizeInBits != CstTy->getPrimitiveSizeInBits()))
	return false;

	unsigned SrcEltSizeInBits = CstTy->getScalarSizeInBits();
	unsigned NumSrcElts = CstTy->getVectorNumElements();

	APInt UndefSrcElts(NumSrcElts, 0);
	SmallVector<APInt, 64> SrcEltBits(NumSrcElts, APInt(SrcEltSizeInBits, 0));
	for (unsigned i = 0; i != NumSrcElts; ++i)
	if (!CollectConstantBits(Cst->getAggregateElement(i), SrcEltBits[i],
	UndefSrcElts, i))
	return false;

	return CastBitData(UndefSrcElts, SrcEltBits);
	}

	// Extract constant bits from a broadcasted constant pool scalar.
	if (Op.getOpcode() == X86ISD::VBROADCAST &&
	EltSizeInBits <= VT.getScalarSizeInBits()) {
	if (auto *Broadcast = getTargetConstantFromNode(Op.getOperand(0))) {
	unsigned SrcEltSizeInBits = Broadcast->getType()->getScalarSizeInBits();
	unsigned NumSrcElts = SizeInBits / SrcEltSizeInBits;

	APInt UndefSrcElts(NumSrcElts, 0);
	SmallVector<APInt, 64> SrcEltBits(1, APInt(SrcEltSizeInBits, 0));
	if (CollectConstantBits(Broadcast, SrcEltBits[0], UndefSrcElts, 0)) {
	if (UndefSrcElts[0])
	UndefSrcElts.setBits(0, NumSrcElts);
	SrcEltBits.append(NumSrcElts - 1, SrcEltBits[0]);
	return CastBitData(UndefSrcElts, SrcEltBits);
	}
	}
	}

	// Extract a rematerialized scalar constant insertion.
	if (Op.getOpcode() == X86ISD::VZEXT_MOVL &&
	Op.getOperand(0).getOpcode() == ISD::SCALAR_TO_VECTOR &&
	isa<ConstantSDNode>(Op.getOperand(0).getOperand(0))) {
	unsigned SrcEltSizeInBits = VT.getScalarSizeInBits();
	unsigned NumSrcElts = SizeInBits / SrcEltSizeInBits;

	APInt UndefSrcElts(NumSrcElts, 0);
	SmallVector<APInt, 64> SrcEltBits;
	auto *CN = cast<ConstantSDNode>(Op.getOperand(0).getOperand(0));
	SrcEltBits.push_back(CN->getAPIntValue().zextOrTrunc(SrcEltSizeInBits));
	SrcEltBits.append(NumSrcElts - 1, APInt(SrcEltSizeInBits, 0));
	return CastBitData(UndefSrcElts, SrcEltBits);
	}

	return false;
	}

	static bool getTargetShuffleMaskIndices(SDValue MaskNode,
	unsigned MaskEltSizeInBits,
	SmallVectorImpl<uint64_t> &RawMask) {
	APInt UndefElts;
	SmallVector<APInt, 64> EltBits;

	// Extract the raw target constant bits.
	// FIXME: We currently don't support UNDEF bits or mask entries.
	if (!getTargetConstantBitsFromNode(MaskNode, MaskEltSizeInBits, UndefElts,
	EltBits, /* AllowWholeUndefs */ false,
	/* AllowPartialUndefs */ false))
	return false;

	// Insert the extracted elements into the mask.
	for (APInt Elt : EltBits)
	RawMask.push_back(Elt.getZExtValue());

	return true;
	}

	/// Calculates the shuffle mask corresponding to the target-specific opcode.
	/// If the mask could be calculated, returns it in \p Mask, returns the shuffle
	/// operands in \p Ops, and returns true.
	/// Sets \p IsUnary to true if only one source is used. Note that this will set
	/// IsUnary for shuffles which use a single input multiple times, and in those
	/// cases it will adjust the mask to only have indices within that single input.
	/// It is an error to call this with non-empty Mask/Ops vectors.
	static bool getTargetShuffleMask(SDNode *N, MVT VT, bool AllowSentinelZero,
	SmallVectorImpl<SDValue> &Ops,
	SmallVectorImpl<int> &Mask, bool &IsUnary) {
	unsigned NumElems = VT.getVectorNumElements();
	SDValue ImmN;

	assert(Mask.empty() && "getTargetShuffleMask expects an empty Mask vector");
	assert(Ops.empty() && "getTargetShuffleMask expects an empty Ops vector");

	IsUnary = false;
	bool IsFakeUnary = false;
	switch(N->getOpcode()) {
	case X86ISD::BLENDI:
	ImmN = N->getOperand(N->getNumOperands()-1);
	DecodeBLENDMask(VT, cast<ConstantSDNode>(ImmN)->getZExtValue(), Mask);
	IsUnary = IsFakeUnary = N->getOperand(0) == N->getOperand(1);
	break;
	case X86ISD::SHUFP:
	ImmN = N->getOperand(N->getNumOperands()-1);
	DecodeSHUFPMask(VT, cast<ConstantSDNode>(ImmN)->getZExtValue(), Mask);
	IsUnary = IsFakeUnary = N->getOperand(0) == N->getOperand(1);
	break;
	case X86ISD::INSERTPS:
	ImmN = N->getOperand(N->getNumOperands()-1);
	DecodeINSERTPSMask(cast<ConstantSDNode>(ImmN)->getZExtValue(), Mask);
	IsUnary = IsFakeUnary = N->getOperand(0) == N->getOperand(1);
	break;
	case X86ISD::EXTRQI:
	if (isa<ConstantSDNode>(N->getOperand(1)) &&
	isa<ConstantSDNode>(N->getOperand(2))) {
	int BitLen = N->getConstantOperandVal(1);
	int BitIdx = N->getConstantOperandVal(2);
	DecodeEXTRQIMask(VT, BitLen, BitIdx, Mask);
	IsUnary = true;
	}
	break;
	case X86ISD::INSERTQI:
	if (isa<ConstantSDNode>(N->getOperand(2)) &&
	isa<ConstantSDNode>(N->getOperand(3))) {
	int BitLen = N->getConstantOperandVal(2);
	int BitIdx = N->getConstantOperandVal(3);
	DecodeINSERTQIMask(VT, BitLen, BitIdx, Mask);
	IsUnary = IsFakeUnary = N->getOperand(0) == N->getOperand(1);
	}
	break;
	case X86ISD::UNPCKH:
	DecodeUNPCKHMask(VT, Mask);
	IsUnary = IsFakeUnary = N->getOperand(0) == N->getOperand(1);
	break;
	case X86ISD::UNPCKL:
	DecodeUNPCKLMask(VT, Mask);
	IsUnary = IsFakeUnary = N->getOperand(0) == N->getOperand(1);
	break;
	case X86ISD::MOVHLPS:
	DecodeMOVHLPSMask(NumElems, Mask);
	IsUnary = IsFakeUnary = N->getOperand(0) == N->getOperand(1);
	break;
	case X86ISD::MOVLHPS:
	DecodeMOVLHPSMask(NumElems, Mask);
	IsUnary = IsFakeUnary = N->getOperand(0) == N->getOperand(1);
	break;
	case X86ISD::PALIGNR:
	assert(VT.getScalarType() == MVT::i8 && "Byte vector expected");
	ImmN = N->getOperand(N->getNumOperands()-1);
	DecodePALIGNRMask(VT, cast<ConstantSDNode>(ImmN)->getZExtValue(), Mask);
	IsUnary = IsFakeUnary = N->getOperand(0) == N->getOperand(1);
	Ops.push_back(N->getOperand(1));
	Ops.push_back(N->getOperand(0));
	break;
	case X86ISD::VSHLDQ:
	assert(VT.getScalarType() == MVT::i8 && "Byte vector expected");
	ImmN = N->getOperand(N->getNumOperands() - 1);
	DecodePSLLDQMask(VT, cast<ConstantSDNode>(ImmN)->getZExtValue(), Mask);
	IsUnary = true;
	break;
	case X86ISD::VSRLDQ:
	assert(VT.getScalarType() == MVT::i8 && "Byte vector expected");
	ImmN = N->getOperand(N->getNumOperands() - 1);
	DecodePSRLDQMask(VT, cast<ConstantSDNode>(ImmN)->getZExtValue(), Mask);
	IsUnary = true;
	break;
	case X86ISD::PSHUFD:
	case X86ISD::VPERMILPI:
	ImmN = N->getOperand(N->getNumOperands()-1);
	DecodePSHUFMask(VT, cast<ConstantSDNode>(ImmN)->getZExtValue(), Mask);
	IsUnary = true;
	break;
	case X86ISD::PSHUFHW:
	ImmN = N->getOperand(N->getNumOperands()-1);
	DecodePSHUFHWMask(VT, cast<ConstantSDNode>(ImmN)->getZExtValue(), Mask);
	IsUnary = true;
	break;
	case X86ISD::PSHUFLW:
	ImmN = N->getOperand(N->getNumOperands()-1);
	DecodePSHUFLWMask(VT, cast<ConstantSDNode>(ImmN)->getZExtValue(), Mask);
	IsUnary = true;
	break;
	case X86ISD::VZEXT_MOVL:
	DecodeZeroMoveLowMask(VT, Mask);
	IsUnary = true;
	break;
	case X86ISD::VBROADCAST: {
	SDValue N0 = N->getOperand(0);
	// See if we're broadcasting from index 0 of an EXTRACT_SUBVECTOR. If so,
	// add the pre-extracted value to the Ops vector.
	if (N0.getOpcode() == ISD::EXTRACT_SUBVECTOR &&
	N0.getOperand(0).getValueType() == VT &&
	N0.getConstantOperandVal(1) == 0)
	Ops.push_back(N0.getOperand(0));

	// We only decode broadcasts of same-sized vectors, unless the broadcast
	// came from an extract from the original width. If we found one, we
	// pushed it the Ops vector above.
	if (N0.getValueType() == VT \|\| !Ops.empty()) {
	DecodeVectorBroadcast(VT, Mask);
	IsUnary = true;
	break;
	}
	return false;
	}
	case X86ISD::VPERMILPV: {
	IsUnary = true;
	SDValue MaskNode = N->getOperand(1);
	unsigned MaskEltSize = VT.getScalarSizeInBits();
	SmallVector<uint64_t, 32> RawMask;
	if (getTargetShuffleMaskIndices(MaskNode, MaskEltSize, RawMask)) {
	DecodeVPERMILPMask(VT, RawMask, Mask);
	break;
	}
	if (auto *C = getTargetConstantFromNode(MaskNode)) {
	DecodeVPERMILPMask(C, MaskEltSize, Mask);
	break;
	}
	return false;
	}
	case X86ISD::PSHUFB: {
	IsUnary = true;
	SDValue MaskNode = N->getOperand(1);
	SmallVector<uint64_t, 32> RawMask;
	if (getTargetShuffleMaskIndices(MaskNode, 8, RawMask)) {
	DecodePSHUFBMask(RawMask, Mask);
	break;
	}
	if (auto *C = getTargetConstantFromNode(MaskNode)) {
	DecodePSHUFBMask(C, Mask);
	break;
	}
	return false;
	}
	case X86ISD::VPERMI:
	ImmN = N->getOperand(N->getNumOperands()-1);
	DecodeVPERMMask(VT, cast<ConstantSDNode>(ImmN)->getZExtValue(), Mask);
	IsUnary = true;
	break;
	case X86ISD::MOVSS:
	case X86ISD::MOVSD:
	DecodeScalarMoveMask(VT, /* IsLoad */ false, Mask);
	break;
	case X86ISD::VPERM2X128:
	ImmN = N->getOperand(N->getNumOperands()-1);
	DecodeVPERM2X128Mask(VT, cast<ConstantSDNode>(ImmN)->getZExtValue(), Mask);
	IsUnary = IsFakeUnary = N->getOperand(0) == N->getOperand(1);
	break;
	case X86ISD::MOVSLDUP:
	DecodeMOVSLDUPMask(VT, Mask);
	IsUnary = true;
	break;
	case X86ISD::MOVSHDUP:
	DecodeMOVSHDUPMask(VT, Mask);
	IsUnary = true;
	break;
	case X86ISD::MOVDDUP:
	DecodeMOVDDUPMask(VT, Mask);
	IsUnary = true;
	break;
	case X86ISD::MOVLHPD:
	case X86ISD::MOVLPD:
	case X86ISD::MOVLPS:
	// Not yet implemented
	return false;
	case X86ISD::VPERMIL2: {
	IsUnary = IsFakeUnary = N->getOperand(0) == N->getOperand(1);
	unsigned MaskEltSize = VT.getScalarSizeInBits();
	SDValue MaskNode = N->getOperand(2);
	SDValue CtrlNode = N->getOperand(3);
	if (ConstantSDNode *CtrlOp = dyn_cast<ConstantSDNode>(CtrlNode)) {
	unsigned CtrlImm = CtrlOp->getZExtValue();
	SmallVector<uint64_t, 32> RawMask;
	if (getTargetShuffleMaskIndices(MaskNode, MaskEltSize, RawMask)) {
	DecodeVPERMIL2PMask(VT, CtrlImm, RawMask, Mask);
	break;
	}
	if (auto *C = getTargetConstantFromNode(MaskNode)) {
	DecodeVPERMIL2PMask(C, CtrlImm, MaskEltSize, Mask);
	break;
	}
	}
	return false;
	}
	case X86ISD::VPPERM: {
	IsUnary = IsFakeUnary = N->getOperand(0) == N->getOperand(1);
	SDValue MaskNode = N->getOperand(2);
	SmallVector<uint64_t, 32> RawMask;
	if (getTargetShuffleMaskIndices(MaskNode, 8, RawMask)) {
	DecodeVPPERMMask(RawMask, Mask);
	break;
	}
	if (auto *C = getTargetConstantFromNode(MaskNode)) {
	DecodeVPPERMMask(C, Mask);
	break;
	}
	return false;
	}
	case X86ISD::VPERMV: {
	IsUnary = true;
	// Unlike most shuffle nodes, VPERMV's mask operand is operand 0.
	Ops.push_back(N->getOperand(1));
	SDValue MaskNode = N->getOperand(0);
	SmallVector<uint64_t, 32> RawMask;
	unsigned MaskEltSize = VT.getScalarSizeInBits();
	if (getTargetShuffleMaskIndices(MaskNode, MaskEltSize, RawMask)) {
	DecodeVPERMVMask(RawMask, Mask);
	break;
	}
	if (auto *C = getTargetConstantFromNode(MaskNode)) {
	DecodeVPERMVMask(C, MaskEltSize, Mask);
	break;
	}
	return false;
	}
	case X86ISD::VPERMV3: {
	IsUnary = IsFakeUnary = N->getOperand(0) == N->getOperand(2);
	// Unlike most shuffle nodes, VPERMV3's mask operand is the middle one.
	Ops.push_back(N->getOperand(0));
	Ops.push_back(N->getOperand(2));
	SDValue MaskNode = N->getOperand(1);
	unsigned MaskEltSize = VT.getScalarSizeInBits();
	if (auto *C = getTargetConstantFromNode(MaskNode)) {
	DecodeVPERMV3Mask(C, MaskEltSize, Mask);
	break;
	}
	return false;
	}
	case X86ISD::VPERMIV3: {
	IsUnary = IsFakeUnary = N->getOperand(1) == N->getOperand(2);
	// Unlike most shuffle nodes, VPERMIV3's mask operand is the first one.
	Ops.push_back(N->getOperand(1));
	Ops.push_back(N->getOperand(2));
	SDValue MaskNode = N->getOperand(0);
	unsigned MaskEltSize = VT.getScalarSizeInBits();
	if (auto *C = getTargetConstantFromNode(MaskNode)) {
	DecodeVPERMV3Mask(C, MaskEltSize, Mask);
	break;
	}
	return false;
	}
	default: llvm_unreachable("unknown target shuffle node");
	}

	// Empty mask indicates the decode failed.
	if (Mask.empty())
	return false;

	// Check if we're getting a shuffle mask with zero'd elements.
	if (!AllowSentinelZero)
	if (any_of(Mask, [](int M) { return M == SM_SentinelZero; }))
	return false;

	// If we have a fake unary shuffle, the shuffle mask is spread across two
	// inputs that are actually the same node. Re-map the mask to always point
	// into the first input.
	if (IsFakeUnary)
	for (int &M : Mask)
	if (M >= (int)Mask.size())
	M -= Mask.size();

	// If we didn't already add operands in the opcode-specific code, default to
	// adding 1 or 2 operands starting at 0.
	if (Ops.empty()) {
	Ops.push_back(N->getOperand(0));
	if (!IsUnary \|\| IsFakeUnary)
	Ops.push_back(N->getOperand(1));
	}

	return true;
	}

	/// Check a target shuffle mask's inputs to see if we can set any values to
	/// SM_SentinelZero - this is for elements that are known to be zero
	/// (not just zeroable) from their inputs.
	/// Returns true if the target shuffle mask was decoded.
	static bool setTargetShuffleZeroElements(SDValue N,
	SmallVectorImpl<int> &Mask,
	SmallVectorImpl<SDValue> &Ops) {
	bool IsUnary;
	if (!isTargetShuffle(N.getOpcode()))
	return false;

	MVT VT = N.getSimpleValueType();
	if (!getTargetShuffleMask(N.getNode(), VT, true, Ops, Mask, IsUnary))
	return false;

	SDValue V1 = Ops[0];
	SDValue V2 = IsUnary ? V1 : Ops[1];

	V1 = peekThroughBitcasts(V1);
	V2 = peekThroughBitcasts(V2);

	assert((VT.getSizeInBits() % Mask.size()) == 0 &&
	"Illegal split of shuffle value type");
	unsigned EltSizeInBits = VT.getSizeInBits() / Mask.size();

	// Extract known constant input data.
	APInt UndefSrcElts[2];
	SmallVector<APInt, 32> SrcEltBits[2];
	bool IsSrcConstant[2] = {
	getTargetConstantBitsFromNode(V1, EltSizeInBits, UndefSrcElts[0],
	SrcEltBits[0], true, false),
	getTargetConstantBitsFromNode(V2, EltSizeInBits, UndefSrcElts[1],
	SrcEltBits[1], true, false)};

	for (int i = 0, Size = Mask.size(); i < Size; ++i) {
	int M = Mask[i];

	// Already decoded as SM_SentinelZero / SM_SentinelUndef.
	if (M < 0)
	continue;

	// Determine shuffle input and normalize the mask.
	unsigned SrcIdx = M / Size;
	SDValue V = M < Size ? V1 : V2;
	M %= Size;

	// We are referencing an UNDEF input.
	if (V.isUndef()) {
	Mask[i] = SM_SentinelUndef;
	continue;
	}

	// SCALAR_TO_VECTOR - only the first element is defined, and the rest UNDEF.
	// TODO: We currently only set UNDEF for integer types - floats use the same
	// registers as vectors and many of the scalar folded loads rely on the
	// SCALAR_TO_VECTOR pattern.
	if (V.getOpcode() == ISD::SCALAR_TO_VECTOR &&
	(Size % V.getValueType().getVectorNumElements()) == 0) {
	int Scale = Size / V.getValueType().getVectorNumElements();
	int Idx = M / Scale;
	if (Idx != 0 && !VT.isFloatingPoint())
	Mask[i] = SM_SentinelUndef;
	else if (Idx == 0 && X86::isZeroNode(V.getOperand(0)))
	Mask[i] = SM_SentinelZero;
	continue;
	}

	// Attempt to extract from the source's constant bits.
	if (IsSrcConstant[SrcIdx]) {
	if (UndefSrcElts[SrcIdx][M])
	Mask[i] = SM_SentinelUndef;
	else if (SrcEltBits[SrcIdx][M] == 0)
	Mask[i] = SM_SentinelZero;
	}
	}

	assert(VT.getVectorNumElements() == Mask.size() &&
	"Different mask size from vector size!");
	return true;
	}

	// Attempt to decode ops that could be represented as a shuffle mask.
	// The decoded shuffle mask may contain a different number of elements to the
	// destination value type.
	static bool getFauxShuffleMask(SDValue N, SmallVectorImpl<int> &Mask,
	SmallVectorImpl<SDValue> &Ops,
	SelectionDAG &DAG) {
	Mask.clear();
	Ops.clear();

	MVT VT = N.getSimpleValueType();
	unsigned NumElts = VT.getVectorNumElements();
	unsigned NumSizeInBits = VT.getSizeInBits();
	unsigned NumBitsPerElt = VT.getScalarSizeInBits();
	assert((NumBitsPerElt % 8) == 0 && (NumSizeInBits % 8) == 0 &&
	"Expected byte aligned value types");

	unsigned Opcode = N.getOpcode();
	switch (Opcode) {
	case ISD::AND:
	case X86ISD::ANDNP: {
	// Attempt to decode as a per-byte mask.
	APInt UndefElts;
	SmallVector<APInt, 32> EltBits;
	SDValue N0 = N.getOperand(0);
	SDValue N1 = N.getOperand(1);
	bool IsAndN = (X86ISD::ANDNP == Opcode);
	uint64_t ZeroMask = IsAndN ? 255 : 0;
	if (!getTargetConstantBitsFromNode(IsAndN ? N0 : N1, 8, UndefElts, EltBits))
	return false;
	for (int i = 0, e = (int)EltBits.size(); i != e; ++i) {
	if (UndefElts[i]) {
	Mask.push_back(SM_SentinelUndef);
	continue;
	}
	uint64_t ByteBits = EltBits[i].getZExtValue();
	if (ByteBits != 0 && ByteBits != 255)
	return false;
	Mask.push_back(ByteBits == ZeroMask ? SM_SentinelZero : i);
	}
	Ops.push_back(IsAndN ? N1 : N0);
	return true;
	}
	case ISD::SCALAR_TO_VECTOR: {
	// Match against a scalar_to_vector of an extract from a vector,
	// for PEXTRW/PEXTRB we must handle the implicit zext of the scalar.
	SDValue N0 = N.getOperand(0);
	SDValue SrcExtract;

	if (N0.getOpcode() == ISD::EXTRACT_VECTOR_ELT &&
	N0.getOperand(0).getValueType() == VT) {
	SrcExtract = N0;
	} else if (N0.getOpcode() == ISD::AssertZext &&
	N0.getOperand(0).getOpcode() == X86ISD::PEXTRW &&
	cast<VTSDNode>(N0.getOperand(1))->getVT() == MVT::i16) {
	SrcExtract = N0.getOperand(0);
	assert(SrcExtract.getOperand(0).getValueType() == MVT::v8i16);
	} else if (N0.getOpcode() == ISD::AssertZext &&
	N0.getOperand(0).getOpcode() == X86ISD::PEXTRB &&
	cast<VTSDNode>(N0.getOperand(1))->getVT() == MVT::i8) {
	SrcExtract = N0.getOperand(0);
	assert(SrcExtract.getOperand(0).getValueType() == MVT::v16i8);
	}

	if (!SrcExtract \|\| !isa<ConstantSDNode>(SrcExtract.getOperand(1)))
	return false;

	SDValue SrcVec = SrcExtract.getOperand(0);
	EVT SrcVT = SrcVec.getValueType();
	unsigned NumSrcElts = SrcVT.getVectorNumElements();
	unsigned NumZeros = (NumBitsPerElt / SrcVT.getScalarSizeInBits()) - 1;

	unsigned SrcIdx = SrcExtract.getConstantOperandVal(1);
	if (NumSrcElts <= SrcIdx)
	return false;

	Ops.push_back(SrcVec);
	Mask.push_back(SrcIdx);
	Mask.append(NumZeros, SM_SentinelZero);
	Mask.append(NumSrcElts - Mask.size(), SM_SentinelUndef);
	return true;
	}
	case X86ISD::PINSRB:
	case X86ISD::PINSRW: {
	SDValue InVec = N.getOperand(0);
	SDValue InScl = N.getOperand(1);
	uint64_t InIdx = N.getConstantOperandVal(2);
	assert(InIdx < NumElts && "Illegal insertion index");

	// Attempt to recognise a PINSR*(VEC, 0, Idx) shuffle pattern.
	if (X86::isZeroNode(InScl)) {
	Ops.push_back(InVec);
	for (unsigned i = 0; i != NumElts; ++i)
	Mask.push_back(i == InIdx ? SM_SentinelZero : (int)i);
	return true;
	}

	// Attempt to recognise a PINSR(ASSERTZEXT(PEXTR)) shuffle pattern.
	// TODO: Expand this to support INSERT_VECTOR_ELT/etc.
	unsigned ExOp =
	(X86ISD::PINSRB == Opcode ? X86ISD::PEXTRB : X86ISD::PEXTRW);
	if (InScl.getOpcode() != ISD::AssertZext \|\|
	InScl.getOperand(0).getOpcode() != ExOp)
	return false;

	SDValue ExVec = InScl.getOperand(0).getOperand(0);
	uint64_t ExIdx = InScl.getOperand(0).getConstantOperandVal(1);
	assert(ExIdx < NumElts && "Illegal extraction index");
	Ops.push_back(InVec);
	Ops.push_back(ExVec);
	for (unsigned i = 0; i != NumElts; ++i)
	Mask.push_back(i == InIdx ? NumElts + ExIdx : i);
	return true;
	}
	case X86ISD::PACKSS: {
	// If we know input saturation won't happen we can treat this
	// as a truncation shuffle.
	if (DAG.ComputeNumSignBits(N.getOperand(0)) <= NumBitsPerElt \|\|
	DAG.ComputeNumSignBits(N.getOperand(1)) <= NumBitsPerElt)
	return false;

	Ops.push_back(N.getOperand(0));
	Ops.push_back(N.getOperand(1));
	for (unsigned i = 0; i != NumElts; ++i)
	Mask.push_back(i * 2);
	return true;
	}
	case X86ISD::VSHLI:
	case X86ISD::VSRLI: {
	uint64_t ShiftVal = N.getConstantOperandVal(1);
	// Out of range bit shifts are guaranteed to be zero.
	if (NumBitsPerElt <= ShiftVal) {
	Mask.append(NumElts, SM_SentinelZero);
	return true;
	}

	// We can only decode 'whole byte' bit shifts as shuffles.
	if ((ShiftVal % 8) != 0)
	break;

	uint64_t ByteShift = ShiftVal / 8;
	unsigned NumBytes = NumSizeInBits / 8;
	unsigned NumBytesPerElt = NumBitsPerElt / 8;
	Ops.push_back(N.getOperand(0));

	// Clear mask to all zeros and insert the shifted byte indices.
	Mask.append(NumBytes, SM_SentinelZero);

	if (X86ISD::VSHLI == Opcode) {
	for (unsigned i = 0; i != NumBytes; i += NumBytesPerElt)
	for (unsigned j = ByteShift; j != NumBytesPerElt; ++j)
	Mask[i + j] = i + j - ByteShift;
	} else {
	for (unsigned i = 0; i != NumBytes; i += NumBytesPerElt)
	for (unsigned j = ByteShift; j != NumBytesPerElt; ++j)
	Mask[i + j - ByteShift] = i + j;
	}
	return true;
	}
	case ISD::ZERO_EXTEND_VECTOR_INREG:
	case X86ISD::VZEXT: {
	// TODO - add support for VPMOVZX with smaller input vector types.
	SDValue Src = N.getOperand(0);
	MVT SrcVT = Src.getSimpleValueType();
	if (NumSizeInBits != SrcVT.getSizeInBits())
	break;
	DecodeZeroExtendMask(SrcVT.getScalarType(), VT, Mask);
	Ops.push_back(Src);
	return true;
	}
	}

	return false;
	}

	/// Removes unused shuffle source inputs and adjusts the shuffle mask accordingly.
	static void resolveTargetShuffleInputsAndMask(SmallVectorImpl<SDValue> &Inputs,
	SmallVectorImpl<int> &Mask) {
	int MaskWidth = Mask.size();
	SmallVector<SDValue, 16> UsedInputs;
	for (int i = 0, e = Inputs.size(); i < e; ++i) {
	int lo = UsedInputs.size() * MaskWidth;
	int hi = lo + MaskWidth;
	if (any_of(Mask, [lo, hi](int i) { return (lo <= i) && (i < hi); })) {
	UsedInputs.push_back(Inputs[i]);
	continue;
	}
	for (int &M : Mask)
	if (lo <= M)
	M -= MaskWidth;
	}
	Inputs = UsedInputs;
	}

	/// Calls setTargetShuffleZeroElements to resolve a target shuffle mask's inputs
	/// and set the SM_SentinelUndef and SM_SentinelZero values. Then check the
	/// remaining input indices in case we now have a unary shuffle and adjust the
	/// inputs accordingly.
	/// Returns true if the target shuffle mask was decoded.
	static bool resolveTargetShuffleInputs(SDValue Op,
	SmallVectorImpl<SDValue> &Inputs,
	SmallVectorImpl<int> &Mask,
	SelectionDAG &DAG) {
	if (!setTargetShuffleZeroElements(Op, Mask, Inputs))
	if (!getFauxShuffleMask(Op, Mask, Inputs, DAG))
	return false;

	resolveTargetShuffleInputsAndMask(Inputs, Mask);
	return true;
	}

	/// Returns the scalar element that will make up the ith
	/// element of the result of the vector shuffle.
	static SDValue getShuffleScalarElt(SDNode *N, unsigned Index, SelectionDAG &DAG,
	unsigned Depth) {
	if (Depth == 6)
	return SDValue(); // Limit search depth.

	SDValue V = SDValue(N, 0);
	EVT VT = V.getValueType();
	unsigned Opcode = V.getOpcode();

	// Recurse into ISD::VECTOR_SHUFFLE node to find scalars.
	if (const ShuffleVectorSDNode *SV = dyn_cast<ShuffleVectorSDNode>(N)) {
	int Elt = SV->getMaskElt(Index);

	if (Elt < 0)
	return DAG.getUNDEF(VT.getVectorElementType());

	unsigned NumElems = VT.getVectorNumElements();
	SDValue NewV = (Elt < (int)NumElems) ? SV->getOperand(0)
	: SV->getOperand(1);
	return getShuffleScalarElt(NewV.getNode(), Elt % NumElems, DAG, Depth+1);
	}

	// Recurse into target specific vector shuffles to find scalars.
	if (isTargetShuffle(Opcode)) {
	MVT ShufVT = V.getSimpleValueType();
	MVT ShufSVT = ShufVT.getVectorElementType();
	int NumElems = (int)ShufVT.getVectorNumElements();
	SmallVector<int, 16> ShuffleMask;
	SmallVector<SDValue, 16> ShuffleOps;
	bool IsUnary;

	if (!getTargetShuffleMask(N, ShufVT, true, ShuffleOps, ShuffleMask, IsUnary))
	return SDValue();

	int Elt = ShuffleMask[Index];
	if (Elt == SM_SentinelZero)
	return ShufSVT.isInteger() ? DAG.getConstant(0, SDLoc(N), ShufSVT)
	: DAG.getConstantFP(+0.0, SDLoc(N), ShufSVT);
	if (Elt == SM_SentinelUndef)
	return DAG.getUNDEF(ShufSVT);

	assert(0 <= Elt && Elt < (2*NumElems) && "Shuffle index out of range");
	SDValue NewV = (Elt < NumElems) ? ShuffleOps[0] : ShuffleOps[1];
	return getShuffleScalarElt(NewV.getNode(), Elt % NumElems, DAG,
	Depth+1);
	}

	// Actual nodes that may contain scalar elements
	if (Opcode == ISD::BITCAST) {
	V = V.getOperand(0);
	EVT SrcVT = V.getValueType();
	unsigned NumElems = VT.getVectorNumElements();

	if (!SrcVT.isVector() \|\| SrcVT.getVectorNumElements() != NumElems)
	return SDValue();
	}

	if (V.getOpcode() == ISD::SCALAR_TO_VECTOR)
	return (Index == 0) ? V.getOperand(0)
	: DAG.getUNDEF(VT.getVectorElementType());

	if (V.getOpcode() == ISD::BUILD_VECTOR)
	return V.getOperand(Index);

	return SDValue();
	}

	/// Custom lower build_vector of v16i8.
	static SDValue LowerBuildVectorv16i8(SDValue Op, unsigned NonZeros,
	unsigned NumNonZero, unsigned NumZero,
	SelectionDAG &DAG,
	const X86Subtarget &Subtarget) {
	if (NumNonZero > 8 && !Subtarget.hasSSE41())
	return SDValue();

	SDLoc dl(Op);
	SDValue V;
	bool First = true;

	// SSE4.1 - use PINSRB to insert each byte directly.
	if (Subtarget.hasSSE41()) {
	for (unsigned i = 0; i < 16; ++i) {
	bool IsNonZero = (NonZeros & (1 << i)) != 0;
	if (IsNonZero) {
	// If the build vector contains zeros or our first insertion is not the
	// first index then insert into zero vector to break any register
	// dependency else use SCALAR_TO_VECTOR/VZEXT_MOVL.
	if (First) {
	First = false;
	if (NumZero \|\| 0 != i)
	V = getZeroVector(MVT::v16i8, Subtarget, DAG, dl);
	else {
	assert(0 == i && "Expected insertion into zero-index");
	V = DAG.getAnyExtOrTrunc(Op.getOperand(i), dl, MVT::i32);
	V = DAG.getNode(ISD::SCALAR_TO_VECTOR, dl, MVT::v4i32, V);
	V = DAG.getNode(X86ISD::VZEXT_MOVL, dl, MVT::v4i32, V);
	V = DAG.getBitcast(MVT::v16i8, V);
	continue;
	}
	}
	V = DAG.getNode(ISD::INSERT_VECTOR_ELT, dl, MVT::v16i8, V,
	Op.getOperand(i), DAG.getIntPtrConstant(i, dl));
	}
	}

	return V;
	}

	// Pre-SSE4.1 - merge byte pairs and insert with PINSRW.
	for (unsigned i = 0; i < 16; ++i) {
	bool ThisIsNonZero = (NonZeros & (1 << i)) != 0;
	if (ThisIsNonZero && First) {
	if (NumZero)
	V = getZeroVector(MVT::v8i16, Subtarget, DAG, dl);
	else
	V = DAG.getUNDEF(MVT::v8i16);
	First = false;
	}

	if ((i & 1) != 0) {
	// FIXME: Investigate extending to i32 instead of just i16.
	// FIXME: Investigate combining the first 4 bytes as a i32 instead.
	SDValue ThisElt, LastElt;
	bool LastIsNonZero = (NonZeros & (1 << (i - 1))) != 0;
	if (LastIsNonZero) {
	LastElt =
	DAG.getNode(ISD::ZERO_EXTEND, dl, MVT::i16, Op.getOperand(i - 1));
	}
	if (ThisIsNonZero) {
	ThisElt = DAG.getNode(ISD::ZERO_EXTEND, dl, MVT::i16, Op.getOperand(i));
	ThisElt = DAG.getNode(ISD::SHL, dl, MVT::i16, ThisElt,
	DAG.getConstant(8, dl, MVT::i8));
	if (LastIsNonZero)
	ThisElt = DAG.getNode(ISD::OR, dl, MVT::i16, ThisElt, LastElt);
	} else
	ThisElt = LastElt;

	if (ThisElt) {
	if (1 == i) {
	V = NumZero ? DAG.getZExtOrTrunc(ThisElt, dl, MVT::i32)
	: DAG.getAnyExtOrTrunc(ThisElt, dl, MVT::i32);
	V = DAG.getNode(ISD::SCALAR_TO_VECTOR, dl, MVT::v4i32, V);
	V = DAG.getNode(X86ISD::VZEXT_MOVL, dl, MVT::v4i32, V);
	V = DAG.getBitcast(MVT::v8i16, V);
	} else {
	V = DAG.getNode(ISD::INSERT_VECTOR_ELT, dl, MVT::v8i16, V, ThisElt,
	DAG.getIntPtrConstant(i / 2, dl));
	}
	}
	}
	}

	return DAG.getBitcast(MVT::v16i8, V);
	}

	/// Custom lower build_vector of v8i16.
	static SDValue LowerBuildVectorv8i16(SDValue Op, unsigned NonZeros,
	unsigned NumNonZero, unsigned NumZero,
	SelectionDAG &DAG,
	const X86Subtarget &Subtarget) {
	if (NumNonZero > 4 && !Subtarget.hasSSE41())
	return SDValue();

	SDLoc dl(Op);
	SDValue V;
	bool First = true;
	for (unsigned i = 0; i < 8; ++i) {
	bool IsNonZero = (NonZeros & (1 << i)) != 0;
	if (IsNonZero) {
	// If the build vector contains zeros or our first insertion is not the
	// first index then insert into zero vector to break any register
	// dependency else use SCALAR_TO_VECTOR/VZEXT_MOVL.
	if (First) {
	First = false;
	if (NumZero \|\| 0 != i)
	V = getZeroVector(MVT::v8i16, Subtarget, DAG, dl);
	else {
	assert(0 == i && "Expected insertion into zero-index");
	V = DAG.getAnyExtOrTrunc(Op.getOperand(i), dl, MVT::i32);
	V = DAG.getNode(ISD::SCALAR_TO_VECTOR, dl, MVT::v4i32, V);
	V = DAG.getNode(X86ISD::VZEXT_MOVL, dl, MVT::v4i32, V);
	V = DAG.getBitcast(MVT::v8i16, V);
	continue;
	}
	}
	V = DAG.getNode(ISD::INSERT_VECTOR_ELT, dl, MVT::v8i16, V,
	Op.getOperand(i), DAG.getIntPtrConstant(i, dl));
	}
	}

	return V;
	}

	/// Custom lower build_vector of v4i32 or v4f32.
	static SDValue LowerBuildVectorv4x32(SDValue Op, SelectionDAG &DAG,
	const X86Subtarget &Subtarget) {
	// Find all zeroable elements.
	std::bitset<4> Zeroable;
	for (int i=0; i < 4; ++i) {
	SDValue Elt = Op->getOperand(i);
	Zeroable[i] = (Elt.isUndef() \|\| X86::isZeroNode(Elt));
	}
	assert(Zeroable.size() - Zeroable.count() > 1 &&
	"We expect at least two non-zero elements!");

	// We only know how to deal with build_vector nodes where elements are either
	// zeroable or extract_vector_elt with constant index.
	SDValue FirstNonZero;
	unsigned FirstNonZeroIdx;
	for (unsigned i=0; i < 4; ++i) {
	if (Zeroable[i])
	continue;
	SDValue Elt = Op->getOperand(i);
	if (Elt.getOpcode() != ISD::EXTRACT_VECTOR_ELT \|\|
	!isa<ConstantSDNode>(Elt.getOperand(1)))
	return SDValue();
	// Make sure that this node is extracting from a 128-bit vector.
	MVT VT = Elt.getOperand(0).getSimpleValueType();
	if (!VT.is128BitVector())
	return SDValue();
	if (!FirstNonZero.getNode()) {
	FirstNonZero = Elt;
	FirstNonZeroIdx = i;
	}
	}

	assert(FirstNonZero.getNode() && "Unexpected build vector of all zeros!");
	SDValue V1 = FirstNonZero.getOperand(0);
	MVT VT = V1.getSimpleValueType();

	// See if this build_vector can be lowered as a blend with zero.
	SDValue Elt;
	unsigned EltMaskIdx, EltIdx;
	int Mask[4];
	for (EltIdx = 0; EltIdx < 4; ++EltIdx) {
	if (Zeroable[EltIdx]) {
	// The zero vector will be on the right hand side.
	Mask[EltIdx] = EltIdx+4;
	continue;
	}

	Elt = Op->getOperand(EltIdx);
	// By construction, Elt is a EXTRACT_VECTOR_ELT with constant index.
	EltMaskIdx = Elt.getConstantOperandVal(1);
	if (Elt.getOperand(0) != V1 \|\| EltMaskIdx != EltIdx)
	break;
	Mask[EltIdx] = EltIdx;
	}

	if (EltIdx == 4) {
	// Let the shuffle legalizer deal with blend operations.
	SDValue VZero = getZeroVector(VT, Subtarget, DAG, SDLoc(Op));
	if (V1.getSimpleValueType() != VT)
	V1 = DAG.getBitcast(VT, V1);
	return DAG.getVectorShuffle(VT, SDLoc(V1), V1, VZero, Mask);
	}

	// See if we can lower this build_vector to a INSERTPS.
	if (!Subtarget.hasSSE41())
	return SDValue();

	SDValue V2 = Elt.getOperand(0);
	if (Elt == FirstNonZero && EltIdx == FirstNonZeroIdx)
	V1 = SDValue();

	bool CanFold = true;
	for (unsigned i = EltIdx + 1; i < 4 && CanFold; ++i) {
	if (Zeroable[i])
	continue;

	SDValue Current = Op->getOperand(i);
	SDValue SrcVector = Current->getOperand(0);
	if (!V1.getNode())
	V1 = SrcVector;
	CanFold = (SrcVector == V1) && (Current.getConstantOperandVal(1) == i);
	}

	if (!CanFold)
	return SDValue();

	assert(V1.getNode() && "Expected at least two non-zero elements!");
	if (V1.getSimpleValueType() != MVT::v4f32)
	V1 = DAG.getBitcast(MVT::v4f32, V1);
	if (V2.getSimpleValueType() != MVT::v4f32)
	V2 = DAG.getBitcast(MVT::v4f32, V2);

	// Ok, we can emit an INSERTPS instruction.
	unsigned ZMask = Zeroable.to_ulong();

	unsigned InsertPSMask = EltMaskIdx << 6 \| EltIdx << 4 \| ZMask;
	assert((InsertPSMask & ~0xFFu) == 0 && "Invalid mask!");
	SDLoc DL(Op);
	SDValue Result = DAG.getNode(X86ISD::INSERTPS, DL, MVT::v4f32, V1, V2,
	DAG.getIntPtrConstant(InsertPSMask, DL));
	return DAG.getBitcast(VT, Result);
	}

	/// Return a vector logical shift node.
	static SDValue getVShift(bool isLeft, EVT VT, SDValue SrcOp, unsigned NumBits,
	SelectionDAG &DAG, const TargetLowering &TLI,
	const SDLoc &dl) {
	assert(VT.is128BitVector() && "Unknown type for VShift");
	MVT ShVT = MVT::v16i8;
	unsigned Opc = isLeft ? X86ISD::VSHLDQ : X86ISD::VSRLDQ;
	SrcOp = DAG.getBitcast(ShVT, SrcOp);
	MVT ScalarShiftTy = TLI.getScalarShiftAmountTy(DAG.getDataLayout(), VT);
	assert(NumBits % 8 == 0 && "Only support byte sized shifts");
	SDValue ShiftVal = DAG.getConstant(NumBits/8, dl, ScalarShiftTy);
	return DAG.getBitcast(VT, DAG.getNode(Opc, dl, ShVT, SrcOp, ShiftVal));
	}

	static SDValue LowerAsSplatVectorLoad(SDValue SrcOp, MVT VT, const SDLoc &dl,
	SelectionDAG &DAG) {

	// Check if the scalar load can be widened into a vector load. And if
	// the address is "base + cst" see if the cst can be "absorbed" into
	// the shuffle mask.
	if (LoadSDNode *LD = dyn_cast<LoadSDNode>(SrcOp)) {
	SDValue Ptr = LD->getBasePtr();
	if (!ISD::isNormalLoad(LD) \|\| LD->isVolatile())
	return SDValue();
	EVT PVT = LD->getValueType(0);
	if (PVT != MVT::i32 && PVT != MVT::f32)
	return SDValue();

	int FI = -1;
	int64_t Offset = 0;
	if (FrameIndexSDNode *FINode = dyn_cast<FrameIndexSDNode>(Ptr)) {
	FI = FINode->getIndex();
	Offset = 0;
	} else if (DAG.isBaseWithConstantOffset(Ptr) &&
	isa<FrameIndexSDNode>(Ptr.getOperand(0))) {
	FI = cast<FrameIndexSDNode>(Ptr.getOperand(0))->getIndex();
	Offset = Ptr.getConstantOperandVal(1);
	Ptr = Ptr.getOperand(0);
	} else {
	return SDValue();
	}

	// FIXME: 256-bit vector instructions don't require a strict alignment,
	// improve this code to support it better.
	unsigned RequiredAlign = VT.getSizeInBits()/8;
	SDValue Chain = LD->getChain();
	// Make sure the stack object alignment is at least 16 or 32.
	MachineFrameInfo &MFI = DAG.getMachineFunction().getFrameInfo();
	if (DAG.InferPtrAlignment(Ptr) < RequiredAlign) {
	if (MFI.isFixedObjectIndex(FI)) {
	// Can't change the alignment. FIXME: It's possible to compute
	// the exact stack offset and reference FI + adjust offset instead.
	// If someone really cares about this. That's the way to implement it.
	return SDValue();
	} else {
	MFI.setObjectAlignment(FI, RequiredAlign);
	}
	}

	// (Offset % 16 or 32) must be multiple of 4. Then address is then
	// Ptr + (Offset & ~15).
	if (Offset < 0)
	return SDValue();
	if ((Offset % RequiredAlign) & 3)
	return SDValue();
	int64_t StartOffset = Offset & ~int64_t(RequiredAlign - 1);
	if (StartOffset) {
	SDLoc DL(Ptr);
	Ptr = DAG.getNode(ISD::ADD, DL, Ptr.getValueType(), Ptr,
	DAG.getConstant(StartOffset, DL, Ptr.getValueType()));
	}

	int EltNo = (Offset - StartOffset) >> 2;
	unsigned NumElems = VT.getVectorNumElements();

	EVT NVT = EVT::getVectorVT(*DAG.getContext(), PVT, NumElems);
	SDValue V1 = DAG.getLoad(NVT, dl, Chain, Ptr,
	LD->getPointerInfo().getWithOffset(StartOffset));

	SmallVector<int, 8> Mask(NumElems, EltNo);

	return DAG.getVectorShuffle(NVT, dl, V1, DAG.getUNDEF(NVT), Mask);
	}

	return SDValue();
	}

	/// Given the initializing elements 'Elts' of a vector of type 'VT', see if the
	/// elements can be replaced by a single large load which has the same value as
	/// a build_vector or insert_subvector whose loaded operands are 'Elts'.
	///
	/// Example: <load i32 a, load i32 a+4, zero, undef> -> zextload a
	static SDValue EltsFromConsecutiveLoads(EVT VT, ArrayRef<SDValue> Elts,
	const SDLoc &DL, SelectionDAG &DAG,
	const X86Subtarget &Subtarget,
	bool isAfterLegalize) {
	unsigned NumElems = Elts.size();

	int LastLoadedElt = -1;
	SmallBitVector LoadMask(NumElems, false);
	SmallBitVector ZeroMask(NumElems, false);
	SmallBitVector UndefMask(NumElems, false);

	// For each element in the initializer, see if we've found a load, zero or an
	// undef.
	for (unsigned i = 0; i < NumElems; ++i) {
	SDValue Elt = peekThroughBitcasts(Elts[i]);
	if (!Elt.getNode())
	return SDValue();

	if (Elt.isUndef())
	UndefMask[i] = true;
	else if (X86::isZeroNode(Elt) \|\| ISD::isBuildVectorAllZeros(Elt.getNode()))
	ZeroMask[i] = true;
	else if (ISD::isNON_EXTLoad(Elt.getNode())) {
	LoadMask[i] = true;
	LastLoadedElt = i;
	// Each loaded element must be the correct fractional portion of the
	// requested vector load.
	if ((NumElems * Elt.getValueSizeInBits()) != VT.getSizeInBits())
	return SDValue();
	} else
	return SDValue();
	}
	assert((ZeroMask \| UndefMask \| LoadMask).count() == NumElems &&
	"Incomplete element masks");

	// Handle Special Cases - all undef or undef/zero.
	if (UndefMask.count() == NumElems)
	return DAG.getUNDEF(VT);

	// FIXME: Should we return this as a BUILD_VECTOR instead?
	if ((ZeroMask \| UndefMask).count() == NumElems)
	return VT.isInteger() ? DAG.getConstant(0, DL, VT)
	: DAG.getConstantFP(0.0, DL, VT);

	const TargetLowering &TLI = DAG.getTargetLoweringInfo();
	int FirstLoadedElt = LoadMask.find_first();
	SDValue EltBase = peekThroughBitcasts(Elts[FirstLoadedElt]);
	LoadSDNode *LDBase = cast<LoadSDNode>(EltBase);
	EVT LDBaseVT = EltBase.getValueType();

	// Consecutive loads can contain UNDEFS but not ZERO elements.
	// Consecutive loads with UNDEFs and ZEROs elements require a
	// an additional shuffle stage to clear the ZERO elements.
	bool IsConsecutiveLoad = true;
	bool IsConsecutiveLoadWithZeros = true;
	for (int i = FirstLoadedElt + 1; i <= LastLoadedElt; ++i) {
	if (LoadMask[i]) {
	SDValue Elt = peekThroughBitcasts(Elts[i]);
	LoadSDNode *LD = cast<LoadSDNode>(Elt);
	if (!DAG.areNonVolatileConsecutiveLoads(
	LD, LDBase, Elt.getValueType().getStoreSizeInBits() / 8,
	i - FirstLoadedElt)) {
	IsConsecutiveLoad = false;
	IsConsecutiveLoadWithZeros = false;
	break;
	}
	} else if (ZeroMask[i]) {
	IsConsecutiveLoad = false;
	}
	}

	auto CreateLoad = [&DAG, &DL](EVT VT, LoadSDNode *LDBase) {
	auto MMOFlags = LDBase->getMemOperand()->getFlags();
	assert(!(MMOFlags & MachineMemOperand::MOVolatile) &&
	"Cannot merge volatile loads.");
	SDValue NewLd =
	DAG.getLoad(VT, DL, LDBase->getChain(), LDBase->getBasePtr(),
	LDBase->getPointerInfo(), LDBase->getAlignment(), MMOFlags);
	DAG.makeEquivalentMemoryOrdering(LDBase, NewLd);
	return NewLd;
	};

	// LOAD - all consecutive load/undefs (must start/end with a load).
	// If we have found an entire vector of loads and undefs, then return a large
	// load of the entire vector width starting at the base pointer.
	// If the vector contains zeros, then attempt to shuffle those elements.
	if (FirstLoadedElt == 0 && LastLoadedElt == (int)(NumElems - 1) &&
	(IsConsecutiveLoad \|\| IsConsecutiveLoadWithZeros)) {
	assert(LDBase && "Did not find base load for merging consecutive loads");
	EVT EltVT = LDBase->getValueType(0);
	// Ensure that the input vector size for the merged loads matches the
	// cumulative size of the input elements.
	if (VT.getSizeInBits() != EltVT.getSizeInBits() * NumElems)
	return SDValue();

	if (isAfterLegalize && !TLI.isOperationLegal(ISD::LOAD, VT))
	return SDValue();

	// Don't create 256-bit non-temporal aligned loads without AVX2 as these
	// will lower to regular temporal loads and use the cache.
	if (LDBase->isNonTemporal() && LDBase->getAlignment() >= 32 &&
	VT.is256BitVector() && !Subtarget.hasInt256())
	return SDValue();

	if (IsConsecutiveLoad)
	return CreateLoad(VT, LDBase);

	// IsConsecutiveLoadWithZeros - we need to create a shuffle of the loaded
	// vector and a zero vector to clear out the zero elements.
	if (!isAfterLegalize && NumElems == VT.getVectorNumElements()) {
	SmallVector<int, 4> ClearMask(NumElems, -1);
	for (unsigned i = 0; i < NumElems; ++i) {
	if (ZeroMask[i])
	ClearMask[i] = i + NumElems;
	else if (LoadMask[i])
	ClearMask[i] = i;
	}
	SDValue V = CreateLoad(VT, LDBase);
	SDValue Z = VT.isInteger() ? DAG.getConstant(0, DL, VT)
	: DAG.getConstantFP(0.0, DL, VT);
	return DAG.getVectorShuffle(VT, DL, V, Z, ClearMask);
	}
	}

	int LoadSize =
	(1 + LastLoadedElt - FirstLoadedElt) * LDBaseVT.getStoreSizeInBits();

	// VZEXT_LOAD - consecutive 32/64-bit load/undefs followed by zeros/undefs.
	if (IsConsecutiveLoad && FirstLoadedElt == 0 &&
	(LoadSize == 32 \|\| LoadSize == 64) &&
	((VT.is128BitVector() \|\| VT.is256BitVector() \|\| VT.is512BitVector()))) {
	MVT VecSVT = VT.isFloatingPoint() ? MVT::getFloatingPointVT(LoadSize)
	: MVT::getIntegerVT(LoadSize);
	MVT VecVT = MVT::getVectorVT(VecSVT, VT.getSizeInBits() / LoadSize);
	if (TLI.isTypeLegal(VecVT)) {
	SDVTList Tys = DAG.getVTList(VecVT, MVT::Other);
	SDValue Ops[] = { LDBase->getChain(), LDBase->getBasePtr() };
	SDValue ResNode =
	DAG.getMemIntrinsicNode(X86ISD::VZEXT_LOAD, DL, Tys, Ops, VecSVT,
	LDBase->getPointerInfo(),
	LDBase->getAlignment(),
	false/isVolatile/, true/ReadMem/,
	false/WriteMem/);
	DAG.makeEquivalentMemoryOrdering(LDBase, ResNode);
	return DAG.getBitcast(VT, ResNode);
	}
	}

	return SDValue();
	}

	static Constant *getConstantVector(MVT VT, const APInt &SplatValue,
	unsigned SplatBitSize, LLVMContext &C) {
	unsigned ScalarSize = VT.getScalarSizeInBits();
	unsigned NumElm = SplatBitSize / ScalarSize;

	SmallVector<Constant *, 32> ConstantVec;
	for (unsigned i = 0; i < NumElm; i++) {
	APInt Val = SplatValue.extractBits(ScalarSize, ScalarSize * i);
	Constant *Const;
	if (VT.isFloatingPoint()) {
	if (ScalarSize == 32) {
	Const = ConstantFP::get(C, APFloat(APFloat::IEEEsingle(), Val));
	} else {
	assert(ScalarSize == 64 && "Unsupported floating point scalar size");
	Const = ConstantFP::get(C, APFloat(APFloat::IEEEdouble(), Val));
	}
	} else
	Const = Constant::getIntegerValue(Type::getIntNTy(C, ScalarSize), Val);
	ConstantVec.push_back(Const);
	}
	return ConstantVector::get(ArrayRef<Constant *>(ConstantVec));
	}

	static bool isUseOfShuffle(SDNode *N) {
	for (auto *U : N->uses()) {
	if (isTargetShuffle(U->getOpcode()))
	return true;
	if (U->getOpcode() == ISD::BITCAST) // Ignore bitcasts
	return isUseOfShuffle(U);
	}
	return false;
	}

	/// Attempt to use the vbroadcast instruction to generate a splat value
	/// from a splat BUILD_VECTOR which uses:
	/// a. A single scalar load, or a constant.
	/// b. Repeated pattern of constants (e.g. <0,1,0,1> or <0,1,2,3,0,1,2,3>).
	///
	/// The VBROADCAST node is returned when a pattern is found,
	/// or SDValue() otherwise.
	static SDValue lowerBuildVectorAsBroadcast(BuildVectorSDNode *BVOp,
	const X86Subtarget &Subtarget,
	SelectionDAG &DAG) {
	// VBROADCAST requires AVX.
	// TODO: Splats could be generated for non-AVX CPUs using SSE
	// instructions, but there's less potential gain for only 128-bit vectors.
	if (!Subtarget.hasAVX())
	return SDValue();

	MVT VT = BVOp->getSimpleValueType(0);
	SDLoc dl(BVOp);

	assert((VT.is128BitVector() \|\| VT.is256BitVector() \|\| VT.is512BitVector()) &&
	"Unsupported vector type for broadcast.");

	BitVector UndefElements;
	SDValue Ld = BVOp->getSplatValue(&UndefElements);

	// We need a splat of a single value to use broadcast, and it doesn't
	// make any sense if the value is only in one element of the vector.
	if (!Ld \|\| (VT.getVectorNumElements() - UndefElements.count()) <= 1) {
	APInt SplatValue, Undef;
	unsigned SplatBitSize;
	bool HasUndef;
	// Check if this is a repeated constant pattern suitable for broadcasting.
	if (BVOp->isConstantSplat(SplatValue, Undef, SplatBitSize, HasUndef) &&
	SplatBitSize > VT.getScalarSizeInBits() &&
	SplatBitSize < VT.getSizeInBits()) {
	// Avoid replacing with broadcast when it's a use of a shuffle
	// instruction to preserve the present custom lowering of shuffles.
	if (isUseOfShuffle(BVOp) \|\| BVOp->hasOneUse())
	return SDValue();
	// replace BUILD_VECTOR with broadcast of the repeated constants.
	const TargetLowering &TLI = DAG.getTargetLoweringInfo();
	LLVMContext *Ctx = DAG.getContext();
	MVT PVT = TLI.getPointerTy(DAG.getDataLayout());
	if (Subtarget.hasAVX()) {
	if (SplatBitSize <= 64 && Subtarget.hasAVX2() &&
	!(SplatBitSize == 64 && Subtarget.is32Bit())) {
	// Splatted value can fit in one INTEGER constant in constant pool.
	// Load the constant and broadcast it.
	MVT CVT = MVT::getIntegerVT(SplatBitSize);
	Type ScalarTy = Type::getIntNTy(Ctx, SplatBitSize);
	Constant *C = Constant::getIntegerValue(ScalarTy, SplatValue);
	SDValue CP = DAG.getConstantPool(C, PVT);
	unsigned Repeat = VT.getSizeInBits() / SplatBitSize;

	unsigned Alignment = cast<ConstantPoolSDNode>(CP)->getAlignment();
	Ld = DAG.getLoad(
	CVT, dl, DAG.getEntryNode(), CP,
	MachinePointerInfo::getConstantPool(DAG.getMachineFunction()),
	Alignment);
	SDValue Brdcst = DAG.getNode(X86ISD::VBROADCAST, dl,
	MVT::getVectorVT(CVT, Repeat), Ld);
	return DAG.getBitcast(VT, Brdcst);
	} else if (SplatBitSize == 32 \|\| SplatBitSize == 64) {
	// Splatted value can fit in one FLOAT constant in constant pool.
	// Load the constant and broadcast it.
	// AVX have support for 32 and 64 bit broadcast for floats only.
	// No 64bit integer in 32bit subtarget.
	MVT CVT = MVT::getFloatingPointVT(SplatBitSize);
	// Lower the splat via APFloat directly, to avoid any conversion.
	Constant *C =
	SplatBitSize == 32
	? ConstantFP::get(*Ctx,
	APFloat(APFloat::IEEEsingle(), SplatValue))
	: ConstantFP::get(*Ctx,
	APFloat(APFloat::IEEEdouble(), SplatValue));
	SDValue CP = DAG.getConstantPool(C, PVT);
	unsigned Repeat = VT.getSizeInBits() / SplatBitSize;

	unsigned Alignment = cast<ConstantPoolSDNode>(CP)->getAlignment();
	Ld = DAG.getLoad(
	CVT, dl, DAG.getEntryNode(), CP,
	MachinePointerInfo::getConstantPool(DAG.getMachineFunction()),
	Alignment);
	SDValue Brdcst = DAG.getNode(X86ISD::VBROADCAST, dl,
	MVT::getVectorVT(CVT, Repeat), Ld);
	return DAG.getBitcast(VT, Brdcst);
	} else if (SplatBitSize > 64) {
	// Load the vector of constants and broadcast it.
	MVT CVT = VT.getScalarType();
	Constant *VecC = getConstantVector(VT, SplatValue, SplatBitSize,
	*Ctx);
	SDValue VCP = DAG.getConstantPool(VecC, PVT);
	unsigned NumElm = SplatBitSize / VT.getScalarSizeInBits();
	unsigned Alignment = cast<ConstantPoolSDNode>(VCP)->getAlignment();
	Ld = DAG.getLoad(
	MVT::getVectorVT(CVT, NumElm), dl, DAG.getEntryNode(), VCP,
	MachinePointerInfo::getConstantPool(DAG.getMachineFunction()),
	Alignment);
	SDValue Brdcst = DAG.getNode(X86ISD::SUBV_BROADCAST, dl, VT, Ld);
	return DAG.getBitcast(VT, Brdcst);
	}
	}
	}
	return SDValue();
	}

	bool ConstSplatVal =
	(Ld.getOpcode() == ISD::Constant \|\| Ld.getOpcode() == ISD::ConstantFP);

	// Make sure that all of the users of a non-constant load are from the
	// BUILD_VECTOR node.
	if (!ConstSplatVal && !BVOp->isOnlyUserOf(Ld.getNode()))
	return SDValue();

	unsigned ScalarSize = Ld.getValueSizeInBits();
	bool IsGE256 = (VT.getSizeInBits() >= 256);

	// When optimizing for size, generate up to 5 extra bytes for a broadcast
	// instruction to save 8 or more bytes of constant pool data.
	// TODO: If multiple splats are generated to load the same constant,
	// it may be detrimental to overall size. There needs to be a way to detect
	// that condition to know if this is truly a size win.
	bool OptForSize = DAG.getMachineFunction().getFunction()->optForSize();

	// Handle broadcasting a single constant scalar from the constant pool
	// into a vector.
	// On Sandybridge (no AVX2), it is still better to load a constant vector
	// from the constant pool and not to broadcast it from a scalar.
	// But override that restriction when optimizing for size.
	// TODO: Check if splatting is recommended for other AVX-capable CPUs.
	if (ConstSplatVal && (Subtarget.hasAVX2() \|\| OptForSize)) {
	EVT CVT = Ld.getValueType();
	assert(!CVT.isVector() && "Must not broadcast a vector type");

	// Splat f32, i32, v4f64, v4i64 in all cases with AVX2.
	// For size optimization, also splat v2f64 and v2i64, and for size opt
	// with AVX2, also splat i8 and i16.
	// With pattern matching, the VBROADCAST node may become a VMOVDDUP.
	if (ScalarSize == 32 \|\| (IsGE256 && ScalarSize == 64) \|\|
	(OptForSize && (ScalarSize == 64 \|\| Subtarget.hasAVX2()))) {
	const Constant *C = nullptr;
	if (ConstantSDNode *CI = dyn_cast<ConstantSDNode>(Ld))
	C = CI->getConstantIntValue();
	else if (ConstantFPSDNode *CF = dyn_cast<ConstantFPSDNode>(Ld))
	C = CF->getConstantFPValue();

	assert(C && "Invalid constant type");

	const TargetLowering &TLI = DAG.getTargetLoweringInfo();
	SDValue CP =
	DAG.getConstantPool(C, TLI.getPointerTy(DAG.getDataLayout()));
	unsigned Alignment = cast<ConstantPoolSDNode>(CP)->getAlignment();
	Ld = DAG.getLoad(
	CVT, dl, DAG.getEntryNode(), CP,
	MachinePointerInfo::getConstantPool(DAG.getMachineFunction()),
	Alignment);

	return DAG.getNode(X86ISD::VBROADCAST, dl, VT, Ld);
	}
	}

	bool IsLoad = ISD::isNormalLoad(Ld.getNode());

	// Handle AVX2 in-register broadcasts.
	if (!IsLoad && Subtarget.hasInt256() &&
	(ScalarSize == 32 \|\| (IsGE256 && ScalarSize == 64)))
	return DAG.getNode(X86ISD::VBROADCAST, dl, VT, Ld);

	// The scalar source must be a normal load.
	if (!IsLoad)
	return SDValue();

	if (ScalarSize == 32 \|\| (IsGE256 && ScalarSize == 64) \|\|
	(Subtarget.hasVLX() && ScalarSize == 64))
	return DAG.getNode(X86ISD::VBROADCAST, dl, VT, Ld);

	// The integer check is needed for the 64-bit into 128-bit so it doesn't match
	// double since there is no vbroadcastsd xmm
	if (Subtarget.hasInt256() && Ld.getValueType().isInteger()) {
	if (ScalarSize == 8 \|\| ScalarSize == 16 \|\| ScalarSize == 64)
	return DAG.getNode(X86ISD::VBROADCAST, dl, VT, Ld);
	}

	// Unsupported broadcast.
	return SDValue();
	}

	/// \brief For an EXTRACT_VECTOR_ELT with a constant index return the real
	/// underlying vector and index.
	///
	/// Modifies \p ExtractedFromVec to the real vector and returns the real
	/// index.
	static int getUnderlyingExtractedFromVec(SDValue &ExtractedFromVec,
	SDValue ExtIdx) {
	int Idx = cast<ConstantSDNode>(ExtIdx)->getZExtValue();
	if (!isa<ShuffleVectorSDNode>(ExtractedFromVec))
	return Idx;

	// For 256-bit vectors, LowerEXTRACT_VECTOR_ELT_SSE4 may have already
	// lowered this:
	// (extract_vector_elt (v8f32 %vreg1), Constant<6>)
	// to:
	// (extract_vector_elt (vector_shuffle<2,u,u,u>
	// (extract_subvector (v8f32 %vreg0), Constant<4>),
	// undef)
	// Constant<0>)
	// In this case the vector is the extract_subvector expression and the index
	// is 2, as specified by the shuffle.
	ShuffleVectorSDNode *SVOp = cast<ShuffleVectorSDNode>(ExtractedFromVec);
	SDValue ShuffleVec = SVOp->getOperand(0);
	MVT ShuffleVecVT = ShuffleVec.getSimpleValueType();
	assert(ShuffleVecVT.getVectorElementType() ==
	ExtractedFromVec.getSimpleValueType().getVectorElementType());

	int ShuffleIdx = SVOp->getMaskElt(Idx);
	if (isUndefOrInRange(ShuffleIdx, 0, ShuffleVecVT.getVectorNumElements())) {
	ExtractedFromVec = ShuffleVec;
	return ShuffleIdx;
	}
	return Idx;
	}

	static SDValue buildFromShuffleMostly(SDValue Op, SelectionDAG &DAG) {
	MVT VT = Op.getSimpleValueType();

	// Skip if insert_vec_elt is not supported.
	const TargetLowering &TLI = DAG.getTargetLoweringInfo();
	if (!TLI.isOperationLegalOrCustom(ISD::INSERT_VECTOR_ELT, VT))
	return SDValue();

	SDLoc DL(Op);
	unsigned NumElems = Op.getNumOperands();

	SDValue VecIn1;
	SDValue VecIn2;
	SmallVector<unsigned, 4> InsertIndices;
	SmallVector<int, 8> Mask(NumElems, -1);

	for (unsigned i = 0; i != NumElems; ++i) {
	unsigned Opc = Op.getOperand(i).getOpcode();

	if (Opc == ISD::UNDEF)
	continue;

	if (Opc != ISD::EXTRACT_VECTOR_ELT) {
	// Quit if more than 1 elements need inserting.
	if (InsertIndices.size() > 1)
	return SDValue();

	InsertIndices.push_back(i);
	continue;
	}

	SDValue ExtractedFromVec = Op.getOperand(i).getOperand(0);
	SDValue ExtIdx = Op.getOperand(i).getOperand(1);

	// Quit if non-constant index.
	if (!isa<ConstantSDNode>(ExtIdx))
	return SDValue();
	int Idx = getUnderlyingExtractedFromVec(ExtractedFromVec, ExtIdx);

	// Quit if extracted from vector of different type.
	if (ExtractedFromVec.getValueType() != VT)
	return SDValue();

	if (!VecIn1.getNode())
	VecIn1 = ExtractedFromVec;
	else if (VecIn1 != ExtractedFromVec) {
	if (!VecIn2.getNode())
	VecIn2 = ExtractedFromVec;
	else if (VecIn2 != ExtractedFromVec)
	// Quit if more than 2 vectors to shuffle
	return SDValue();
	}

	if (ExtractedFromVec == VecIn1)
	Mask[i] = Idx;
	else if (ExtractedFromVec == VecIn2)
	Mask[i] = Idx + NumElems;
	}

	if (!VecIn1.getNode())
	return SDValue();

	VecIn2 = VecIn2.getNode() ? VecIn2 : DAG.getUNDEF(VT);
	SDValue NV = DAG.getVectorShuffle(VT, DL, VecIn1, VecIn2, Mask);

	for (unsigned Idx : InsertIndices)
	NV = DAG.getNode(ISD::INSERT_VECTOR_ELT, DL, VT, NV, Op.getOperand(Idx),
	DAG.getIntPtrConstant(Idx, DL));

	return NV;
	}

	static SDValue ConvertI1VectorToInteger(SDValue Op, SelectionDAG &DAG) {
	assert(ISD::isBuildVectorOfConstantSDNodes(Op.getNode()) &&
	Op.getScalarValueSizeInBits() == 1 &&
	"Can not convert non-constant vector");
	uint64_t Immediate = 0;
	for (unsigned idx = 0, e = Op.getNumOperands(); idx < e; ++idx) {
	SDValue In = Op.getOperand(idx);
	if (!In.isUndef())
	Immediate \|= (cast<ConstantSDNode>(In)->getZExtValue() & 0x1) << idx;
	}
	SDLoc dl(Op);
	MVT VT = MVT::getIntegerVT(std::max((int)Op.getValueSizeInBits(), 8));
	return DAG.getConstant(Immediate, dl, VT);
	}
	// Lower BUILD_VECTOR operation for v8i1 and v16i1 types.
	SDValue
	X86TargetLowering::LowerBUILD_VECTORvXi1(SDValue Op, SelectionDAG &DAG) const {

	MVT VT = Op.getSimpleValueType();
	assert((VT.getVectorElementType() == MVT::i1) &&
	"Unexpected type in LowerBUILD_VECTORvXi1!");

	SDLoc dl(Op);
	if (ISD::isBuildVectorAllZeros(Op.getNode()))
	return DAG.getTargetConstant(0, dl, VT);

	if (ISD::isBuildVectorAllOnes(Op.getNode()))
	return DAG.getTargetConstant(1, dl, VT);

	if (ISD::isBuildVectorOfConstantSDNodes(Op.getNode())) {
	SDValue Imm = ConvertI1VectorToInteger(Op, DAG);
	if (Imm.getValueSizeInBits() == VT.getSizeInBits())
	return DAG.getBitcast(VT, Imm);
	SDValue ExtVec = DAG.getBitcast(MVT::v8i1, Imm);
	return DAG.getNode(ISD::EXTRACT_SUBVECTOR, dl, VT, ExtVec,
	DAG.getIntPtrConstant(0, dl));
	}

	// Vector has one or more non-const elements
	uint64_t Immediate = 0;
	SmallVector<unsigned, 16> NonConstIdx;
	bool IsSplat = true;
	bool HasConstElts = false;
	int SplatIdx = -1;
	for (unsigned idx = 0, e = Op.getNumOperands(); idx < e; ++idx) {
	SDValue In = Op.getOperand(idx);
	if (In.isUndef())
	continue;
	if (!isa<ConstantSDNode>(In))
	NonConstIdx.push_back(idx);
	else {
	Immediate \|= (cast<ConstantSDNode>(In)->getZExtValue() & 0x1) << idx;
	HasConstElts = true;
	}
	if (SplatIdx < 0)
	SplatIdx = idx;
	else if (In != Op.getOperand(SplatIdx))
	IsSplat = false;
	}

	// for splat use " (select i1 splat_elt, all-ones, all-zeroes)"
	if (IsSplat)
	return DAG.getSelect(dl, VT, Op.getOperand(SplatIdx),
	DAG.getConstant(1, dl, VT),
	DAG.getConstant(0, dl, VT));

	// insert elements one by one
	SDValue DstVec;
	SDValue Imm;
	if (Immediate) {
	MVT ImmVT = MVT::getIntegerVT(std::max((int)VT.getSizeInBits(), 8));
	Imm = DAG.getConstant(Immediate, dl, ImmVT);
	}
	else if (HasConstElts)
	Imm = DAG.getConstant(0, dl, VT);
	else
	Imm = DAG.getUNDEF(VT);
	if (Imm.getValueSizeInBits() == VT.getSizeInBits())
	DstVec = DAG.getBitcast(VT, Imm);
	else {
	SDValue ExtVec = DAG.getBitcast(MVT::v8i1, Imm);
	DstVec = DAG.getNode(ISD::EXTRACT_SUBVECTOR, dl, VT, ExtVec,
	DAG.getIntPtrConstant(0, dl));
	}

	for (unsigned i = 0, e = NonConstIdx.size(); i != e; ++i) {
	unsigned InsertIdx = NonConstIdx[i];
	DstVec = DAG.getNode(ISD::INSERT_VECTOR_ELT, dl, VT, DstVec,
	Op.getOperand(InsertIdx),
	DAG.getIntPtrConstant(InsertIdx, dl));
	}
	return DstVec;
	}

	/// \brief Return true if \p N implements a horizontal binop and return the
	/// operands for the horizontal binop into V0 and V1.
	///
	/// This is a helper function of LowerToHorizontalOp().
	/// This function checks that the build_vector \p N in input implements a
	/// horizontal operation. Parameter \p Opcode defines the kind of horizontal
	/// operation to match.
	/// For example, if \p Opcode is equal to ISD::ADD, then this function
	/// checks if \p N implements a horizontal arithmetic add; if instead \p Opcode
	/// is equal to ISD::SUB, then this function checks if this is a horizontal
	/// arithmetic sub.
	///
	/// This function only analyzes elements of \p N whose indices are
	/// in range [BaseIdx, LastIdx).
	static bool isHorizontalBinOp(const BuildVectorSDNode *N, unsigned Opcode,
	SelectionDAG &DAG,
	unsigned BaseIdx, unsigned LastIdx,
	SDValue &V0, SDValue &V1) {
	EVT VT = N->getValueType(0);

	assert(BaseIdx * 2 <= LastIdx && "Invalid Indices in input!");
	assert(VT.isVector() && VT.getVectorNumElements() >= LastIdx &&
	"Invalid Vector in input!");

	bool IsCommutable = (Opcode == ISD::ADD \|\| Opcode == ISD::FADD);
	bool CanFold = true;
	unsigned ExpectedVExtractIdx = BaseIdx;
	unsigned NumElts = LastIdx - BaseIdx;
	V0 = DAG.getUNDEF(VT);
	V1 = DAG.getUNDEF(VT);

	// Check if N implements a horizontal binop.
	for (unsigned i = 0, e = NumElts; i != e && CanFold; ++i) {
	SDValue Op = N->getOperand(i + BaseIdx);

	// Skip UNDEFs.
	if (Op->isUndef()) {
	// Update the expected vector extract index.
	if (i * 2 == NumElts)
	ExpectedVExtractIdx = BaseIdx;
	ExpectedVExtractIdx += 2;
	continue;
	}

	CanFold = Op->getOpcode() == Opcode && Op->hasOneUse();

	if (!CanFold)
	break;

	SDValue Op0 = Op.getOperand(0);
	SDValue Op1 = Op.getOperand(1);

	// Try to match the following pattern:
	// (BINOP (extract_vector_elt A, I), (extract_vector_elt A, I+1))
	CanFold = (Op0.getOpcode() == ISD::EXTRACT_VECTOR_ELT &&
	Op1.getOpcode() == ISD::EXTRACT_VECTOR_ELT &&
	Op0.getOperand(0) == Op1.getOperand(0) &&
	isa<ConstantSDNode>(Op0.getOperand(1)) &&
	isa<ConstantSDNode>(Op1.getOperand(1)));
	if (!CanFold)
	break;

	unsigned I0 = cast<ConstantSDNode>(Op0.getOperand(1))->getZExtValue();
	unsigned I1 = cast<ConstantSDNode>(Op1.getOperand(1))->getZExtValue();

	if (i * 2 < NumElts) {
	if (V0.isUndef()) {
	V0 = Op0.getOperand(0);
	if (V0.getValueType() != VT)
	return false;
	}
	} else {
	if (V1.isUndef()) {
	V1 = Op0.getOperand(0);
	if (V1.getValueType() != VT)
	return false;
	}
	if (i * 2 == NumElts)
	ExpectedVExtractIdx = BaseIdx;
	}

	SDValue Expected = (i * 2 < NumElts) ? V0 : V1;
	if (I0 == ExpectedVExtractIdx)
	CanFold = I1 == I0 + 1 && Op0.getOperand(0) == Expected;
	else if (IsCommutable && I1 == ExpectedVExtractIdx) {
	// Try to match the following dag sequence:
	// (BINOP (extract_vector_elt A, I+1), (extract_vector_elt A, I))
	CanFold = I0 == I1 + 1 && Op1.getOperand(0) == Expected;
	} else
	CanFold = false;

	ExpectedVExtractIdx += 2;
	}

	return CanFold;
	}

	/// \brief Emit a sequence of two 128-bit horizontal add/sub followed by
	/// a concat_vector.
	///
	/// This is a helper function of LowerToHorizontalOp().
	/// This function expects two 256-bit vectors called V0 and V1.
	/// At first, each vector is split into two separate 128-bit vectors.
	/// Then, the resulting 128-bit vectors are used to implement two
	/// horizontal binary operations.
	///
	/// The kind of horizontal binary operation is defined by \p X86Opcode.
	///
	/// \p Mode specifies how the 128-bit parts of V0 and V1 are passed in input to
	/// the two new horizontal binop.
	/// When Mode is set, the first horizontal binop dag node would take as input
	/// the lower 128-bit of V0 and the upper 128-bit of V0. The second
	/// horizontal binop dag node would take as input the lower 128-bit of V1
	/// and the upper 128-bit of V1.
	/// Example:
	/// HADD V0_LO, V0_HI
	/// HADD V1_LO, V1_HI
	///
	/// Otherwise, the first horizontal binop dag node takes as input the lower
	/// 128-bit of V0 and the lower 128-bit of V1, and the second horizontal binop
	/// dag node takes the upper 128-bit of V0 and the upper 128-bit of V1.
	/// Example:
	/// HADD V0_LO, V1_LO
	/// HADD V0_HI, V1_HI
	///
	/// If \p isUndefLO is set, then the algorithm propagates UNDEF to the lower
	/// 128-bits of the result. If \p isUndefHI is set, then UNDEF is propagated to
	/// the upper 128-bits of the result.
	static SDValue ExpandHorizontalBinOp(const SDValue &V0, const SDValue &V1,
	const SDLoc &DL, SelectionDAG &DAG,
	unsigned X86Opcode, bool Mode,
	bool isUndefLO, bool isUndefHI) {
	MVT VT = V0.getSimpleValueType();
	assert(VT.is256BitVector() && VT == V1.getSimpleValueType() &&
	"Invalid nodes in input!");

	unsigned NumElts = VT.getVectorNumElements();
	SDValue V0_LO = extract128BitVector(V0, 0, DAG, DL);
	SDValue V0_HI = extract128BitVector(V0, NumElts/2, DAG, DL);
	SDValue V1_LO = extract128BitVector(V1, 0, DAG, DL);
	SDValue V1_HI = extract128BitVector(V1, NumElts/2, DAG, DL);
	MVT NewVT = V0_LO.getSimpleValueType();

	SDValue LO = DAG.getUNDEF(NewVT);
	SDValue HI = DAG.getUNDEF(NewVT);

	if (Mode) {
	// Don't emit a horizontal binop if the result is expected to be UNDEF.
	if (!isUndefLO && !V0->isUndef())
	LO = DAG.getNode(X86Opcode, DL, NewVT, V0_LO, V0_HI);
	if (!isUndefHI && !V1->isUndef())
	HI = DAG.getNode(X86Opcode, DL, NewVT, V1_LO, V1_HI);
	} else {
	// Don't emit a horizontal binop if the result is expected to be UNDEF.
	if (!isUndefLO && (!V0_LO->isUndef() \|\| !V1_LO->isUndef()))
	LO = DAG.getNode(X86Opcode, DL, NewVT, V0_LO, V1_LO);

	if (!isUndefHI && (!V0_HI->isUndef() \|\| !V1_HI->isUndef()))
	HI = DAG.getNode(X86Opcode, DL, NewVT, V0_HI, V1_HI);
	}

	return DAG.getNode(ISD::CONCAT_VECTORS, DL, VT, LO, HI);
	}

	/// Returns true iff \p BV builds a vector with the result equivalent to
	/// the result of ADDSUB operation.
	/// If true is returned then the operands of ADDSUB = Opnd0 +- Opnd1 operation
	/// are written to the parameters \p Opnd0 and \p Opnd1.
	static bool isAddSub(const BuildVectorSDNode *BV,
	const X86Subtarget &Subtarget, SelectionDAG &DAG,
	SDValue &Opnd0, SDValue &Opnd1) {

	MVT VT = BV->getSimpleValueType(0);
	if ((!Subtarget.hasSSE3() \|\| (VT != MVT::v4f32 && VT != MVT::v2f64)) &&
	(!Subtarget.hasAVX() \|\| (VT != MVT::v8f32 && VT != MVT::v4f64)) &&
	(!Subtarget.hasAVX512() \|\| (VT != MVT::v16f32 && VT != MVT::v8f64)))
	return false;

	unsigned NumElts = VT.getVectorNumElements();
	SDValue InVec0 = DAG.getUNDEF(VT);
	SDValue InVec1 = DAG.getUNDEF(VT);

	// Odd-numbered elements in the input build vector are obtained from
	// adding two integer/float elements.
	// Even-numbered elements in the input build vector are obtained from
	// subtracting two integer/float elements.
	unsigned ExpectedOpcode = ISD::FSUB;
	unsigned NextExpectedOpcode = ISD::FADD;
	bool AddFound = false;
	bool SubFound = false;

	for (unsigned i = 0, e = NumElts; i != e; ++i) {
	SDValue Op = BV->getOperand(i);

	// Skip 'undef' values.
	unsigned Opcode = Op.getOpcode();
	if (Opcode == ISD::UNDEF) {
	std::swap(ExpectedOpcode, NextExpectedOpcode);
	continue;
	}

	// Early exit if we found an unexpected opcode.
	if (Opcode != ExpectedOpcode)
	return false;

	SDValue Op0 = Op.getOperand(0);
	SDValue Op1 = Op.getOperand(1);

	// Try to match the following pattern:
	// (BINOP (extract_vector_elt A, i), (extract_vector_elt B, i))
	// Early exit if we cannot match that sequence.
	if (Op0.getOpcode() != ISD::EXTRACT_VECTOR_ELT \|\|
	Op1.getOpcode() != ISD::EXTRACT_VECTOR_ELT \|\|
	!isa<ConstantSDNode>(Op0.getOperand(1)) \|\|
	!isa<ConstantSDNode>(Op1.getOperand(1)) \|\|
	Op0.getOperand(1) != Op1.getOperand(1))
	return false;

	unsigned I0 = cast<ConstantSDNode>(Op0.getOperand(1))->getZExtValue();
	if (I0 != i)
	return false;

	// We found a valid add/sub node. Update the information accordingly.
	if (i & 1)
	AddFound = true;
	else
	SubFound = true;

	// Update InVec0 and InVec1.
	if (InVec0.isUndef()) {
	InVec0 = Op0.getOperand(0);
	if (InVec0.getSimpleValueType() != VT)
	return false;
	}
	if (InVec1.isUndef()) {
	InVec1 = Op1.getOperand(0);
	if (InVec1.getSimpleValueType() != VT)
	return false;
	}

	// Make sure that operands in input to each add/sub node always
	// come from a same pair of vectors.
	if (InVec0 != Op0.getOperand(0)) {
	if (ExpectedOpcode == ISD::FSUB)
	return false;

	// FADD is commutable. Try to commute the operands
	// and then test again.
	std::swap(Op0, Op1);
	if (InVec0 != Op0.getOperand(0))
	return false;
	}

	if (InVec1 != Op1.getOperand(0))
	return false;

	// Update the pair of expected opcodes.
	std::swap(ExpectedOpcode, NextExpectedOpcode);
	}

	// Don't try to fold this build_vector into an ADDSUB if the inputs are undef.
	if (!AddFound \|\| !SubFound \|\| InVec0.isUndef() \|\| InVec1.isUndef())
	return false;

	Opnd0 = InVec0;
	Opnd1 = InVec1;
	return true;
	}

	/// Returns true if is possible to fold MUL and an idiom that has already been
	/// recognized as ADDSUB(\p Opnd0, \p Opnd1) into FMADDSUB(x, y, \p Opnd1).
	/// If (and only if) true is returned, the operands of FMADDSUB are written to
	/// parameters \p Opnd0, \p Opnd1, \p Opnd2.
	///
	/// Prior to calling this function it should be known that there is some
	/// SDNode that potentially can be replaced with an X86ISD::ADDSUB operation
	/// using \p Opnd0 and \p Opnd1 as operands. Also, this method is called
	/// before replacement of such SDNode with ADDSUB operation. Thus the number
	/// of \p Opnd0 uses is expected to be equal to 2.
	/// For example, this function may be called for the following IR:
	/// %AB = fmul fast <2 x double> %A, %B
	/// %Sub = fsub fast <2 x double> %AB, %C
	/// %Add = fadd fast <2 x double> %AB, %C
	/// %Addsub = shufflevector <2 x double> %Sub, <2 x double> %Add,
	/// <2 x i32> <i32 0, i32 3>
	/// There is a def for %Addsub here, which potentially can be replaced by
	/// X86ISD::ADDSUB operation:
	/// %Addsub = X86ISD::ADDSUB %AB, %C
	/// and such ADDSUB can further be replaced with FMADDSUB:
	/// %Addsub = FMADDSUB %A, %B, %C.
	///
	/// The main reason why this method is called before the replacement of the
	/// recognized ADDSUB idiom with ADDSUB operation is that such replacement
	/// is illegal sometimes. E.g. 512-bit ADDSUB is not available, while 512-bit
	/// FMADDSUB is.
	static bool isFMAddSub(const X86Subtarget &Subtarget, SelectionDAG &DAG,
	SDValue &Opnd0, SDValue &Opnd1, SDValue &Opnd2) {
	if (Opnd0.getOpcode() != ISD::FMUL \|\| Opnd0->use_size() != 2 \|\|
	!Subtarget.hasAnyFMA())
	return false;

	// FIXME: These checks must match the similar ones in
	// DAGCombiner::visitFADDForFMACombine. It would be good to have one
	// function that would answer if it is Ok to fuse MUL + ADD to FMADD
	// or MUL + ADDSUB to FMADDSUB.
	const TargetOptions &Options = DAG.getTarget().Options;
	bool AllowFusion =
	(Options.AllowFPOpFusion == FPOpFusion::Fast \|\| Options.UnsafeFPMath);
	if (!AllowFusion)
	return false;

	Opnd2 = Opnd1;
	Opnd1 = Opnd0.getOperand(1);
	Opnd0 = Opnd0.getOperand(0);

	return true;
	}

	/// Try to fold a build_vector that performs an 'addsub' or 'fmaddsub' operation
	/// accordingly to X86ISD::ADDSUB or X86ISD::FMADDSUB node.
	static SDValue lowerToAddSubOrFMAddSub(const BuildVectorSDNode *BV,
	const X86Subtarget &Subtarget,
	SelectionDAG &DAG) {
	SDValue Opnd0, Opnd1;
	if (!isAddSub(BV, Subtarget, DAG, Opnd0, Opnd1))
	return SDValue();

	MVT VT = BV->getSimpleValueType(0);
	SDLoc DL(BV);

	// Try to generate X86ISD::FMADDSUB node here.
	SDValue Opnd2;
	if (isFMAddSub(Subtarget, DAG, Opnd0, Opnd1, Opnd2))
	return DAG.getNode(X86ISD::FMADDSUB, DL, VT, Opnd0, Opnd1, Opnd2);

	// Do not generate X86ISD::ADDSUB node for 512-bit types even though
	// the ADDSUB idiom has been successfully recognized. There are no known
	// X86 targets with 512-bit ADDSUB instructions!
	// 512-bit ADDSUB idiom recognition was needed only as part of FMADDSUB idiom
	// recognition.
	if (VT.is512BitVector())
	return SDValue();

	return DAG.getNode(X86ISD::ADDSUB, DL, VT, Opnd0, Opnd1);
	}

	/// Lower BUILD_VECTOR to a horizontal add/sub operation if possible.
	static SDValue LowerToHorizontalOp(const BuildVectorSDNode *BV,
	const X86Subtarget &Subtarget,
	SelectionDAG &DAG) {
	MVT VT = BV->getSimpleValueType(0);
	unsigned NumElts = VT.getVectorNumElements();
	unsigned NumUndefsLO = 0;
	unsigned NumUndefsHI = 0;
	unsigned Half = NumElts/2;

	// Count the number of UNDEF operands in the build_vector in input.
	for (unsigned i = 0, e = Half; i != e; ++i)
	if (BV->getOperand(i)->isUndef())
	NumUndefsLO++;

	for (unsigned i = Half, e = NumElts; i != e; ++i)
	if (BV->getOperand(i)->isUndef())
	NumUndefsHI++;

	// Early exit if this is either a build_vector of all UNDEFs or all the
	// operands but one are UNDEF.
	if (NumUndefsLO + NumUndefsHI + 1 >= NumElts)
	return SDValue();

	SDLoc DL(BV);
	SDValue InVec0, InVec1;
	if ((VT == MVT::v4f32 \|\| VT == MVT::v2f64) && Subtarget.hasSSE3()) {
	// Try to match an SSE3 float HADD/HSUB.
	if (isHorizontalBinOp(BV, ISD::FADD, DAG, 0, NumElts, InVec0, InVec1))
	return DAG.getNode(X86ISD::FHADD, DL, VT, InVec0, InVec1);

	if (isHorizontalBinOp(BV, ISD::FSUB, DAG, 0, NumElts, InVec0, InVec1))
	return DAG.getNode(X86ISD::FHSUB, DL, VT, InVec0, InVec1);
	} else if ((VT == MVT::v4i32 \|\| VT == MVT::v8i16) && Subtarget.hasSSSE3()) {
	// Try to match an SSSE3 integer HADD/HSUB.
	if (isHorizontalBinOp(BV, ISD::ADD, DAG, 0, NumElts, InVec0, InVec1))
	return DAG.getNode(X86ISD::HADD, DL, VT, InVec0, InVec1);

	if (isHorizontalBinOp(BV, ISD::SUB, DAG, 0, NumElts, InVec0, InVec1))
	return DAG.getNode(X86ISD::HSUB, DL, VT, InVec0, InVec1);
	}

	if (!Subtarget.hasAVX())
	return SDValue();

	if ((VT == MVT::v8f32 \|\| VT == MVT::v4f64)) {
	// Try to match an AVX horizontal add/sub of packed single/double
	// precision floating point values from 256-bit vectors.
	SDValue InVec2, InVec3;
	if (isHorizontalBinOp(BV, ISD::FADD, DAG, 0, Half, InVec0, InVec1) &&
	isHorizontalBinOp(BV, ISD::FADD, DAG, Half, NumElts, InVec2, InVec3) &&
	((InVec0.isUndef() \|\| InVec2.isUndef()) \|\| InVec0 == InVec2) &&
	((InVec1.isUndef() \|\| InVec3.isUndef()) \|\| InVec1 == InVec3))
	return DAG.getNode(X86ISD::FHADD, DL, VT, InVec0, InVec1);

	if (isHorizontalBinOp(BV, ISD::FSUB, DAG, 0, Half, InVec0, InVec1) &&
	isHorizontalBinOp(BV, ISD::FSUB, DAG, Half, NumElts, InVec2, InVec3) &&
	((InVec0.isUndef() \|\| InVec2.isUndef()) \|\| InVec0 == InVec2) &&
	((InVec1.isUndef() \|\| InVec3.isUndef()) \|\| InVec1 == InVec3))
	return DAG.getNode(X86ISD::FHSUB, DL, VT, InVec0, InVec1);
	} else if (VT == MVT::v8i32 \|\| VT == MVT::v16i16) {
	// Try to match an AVX2 horizontal add/sub of signed integers.
	SDValue InVec2, InVec3;
	unsigned X86Opcode;
	bool CanFold = true;

	if (isHorizontalBinOp(BV, ISD::ADD, DAG, 0, Half, InVec0, InVec1) &&
	isHorizontalBinOp(BV, ISD::ADD, DAG, Half, NumElts, InVec2, InVec3) &&
	((InVec0.isUndef() \|\| InVec2.isUndef()) \|\| InVec0 == InVec2) &&
	((InVec1.isUndef() \|\| InVec3.isUndef()) \|\| InVec1 == InVec3))
	X86Opcode = X86ISD::HADD;
	else if (isHorizontalBinOp(BV, ISD::SUB, DAG, 0, Half, InVec0, InVec1) &&
	isHorizontalBinOp(BV, ISD::SUB, DAG, Half, NumElts, InVec2, InVec3) &&
	((InVec0.isUndef() \|\| InVec2.isUndef()) \|\| InVec0 == InVec2) &&
	((InVec1.isUndef() \|\| InVec3.isUndef()) \|\| InVec1 == InVec3))
	X86Opcode = X86ISD::HSUB;
	else
	CanFold = false;

	if (CanFold) {
	// Fold this build_vector into a single horizontal add/sub.
	// Do this only if the target has AVX2.
	if (Subtarget.hasAVX2())
	return DAG.getNode(X86Opcode, DL, VT, InVec0, InVec1);

	// Do not try to expand this build_vector into a pair of horizontal
	// add/sub if we can emit a pair of scalar add/sub.
	if (NumUndefsLO + 1 == Half \|\| NumUndefsHI + 1 == Half)
	return SDValue();

	// Convert this build_vector into a pair of horizontal binop followed by
	// a concat vector.
	bool isUndefLO = NumUndefsLO == Half;
	bool isUndefHI = NumUndefsHI == Half;
	return ExpandHorizontalBinOp(InVec0, InVec1, DL, DAG, X86Opcode, false,
	isUndefLO, isUndefHI);
	}
	}

	if ((VT == MVT::v8f32 \|\| VT == MVT::v4f64 \|\| VT == MVT::v8i32 \|\|
	VT == MVT::v16i16) && Subtarget.hasAVX()) {
	unsigned X86Opcode;
	if (isHorizontalBinOp(BV, ISD::ADD, DAG, 0, NumElts, InVec0, InVec1))
	X86Opcode = X86ISD::HADD;
	else if (isHorizontalBinOp(BV, ISD::SUB, DAG, 0, NumElts, InVec0, InVec1))
	X86Opcode = X86ISD::HSUB;
	else if (isHorizontalBinOp(BV, ISD::FADD, DAG, 0, NumElts, InVec0, InVec1))
	X86Opcode = X86ISD::FHADD;
	else if (isHorizontalBinOp(BV, ISD::FSUB, DAG, 0, NumElts, InVec0, InVec1))
	X86Opcode = X86ISD::FHSUB;
	else
	return SDValue();

	// Don't try to expand this build_vector into a pair of horizontal add/sub
	// if we can simply emit a pair of scalar add/sub.
	if (NumUndefsLO + 1 == Half \|\| NumUndefsHI + 1 == Half)
	return SDValue();

	// Convert this build_vector into two horizontal add/sub followed by
	// a concat vector.
	bool isUndefLO = NumUndefsLO == Half;
	bool isUndefHI = NumUndefsHI == Half;
	return ExpandHorizontalBinOp(InVec0, InVec1, DL, DAG, X86Opcode, true,
	isUndefLO, isUndefHI);
	}

	return SDValue();
	}

	/// If a BUILD_VECTOR's source elements all apply the same bit operation and
	/// one of their operands is constant, lower to a pair of BUILD_VECTOR and
	/// just apply the bit to the vectors.
	/// NOTE: Its not in our interest to start make a general purpose vectorizer
	/// from this, but enough scalar bit operations are created from the later
	/// legalization + scalarization stages to need basic support.
	static SDValue lowerBuildVectorToBitOp(BuildVectorSDNode *Op,
	SelectionDAG &DAG) {
	SDLoc DL(Op);
	MVT VT = Op->getSimpleValueType(0);
	unsigned NumElems = VT.getVectorNumElements();
	const TargetLowering &TLI = DAG.getTargetLoweringInfo();

	// Check that all elements have the same opcode.
	// TODO: Should we allow UNDEFS and if so how many?
	unsigned Opcode = Op->getOperand(0).getOpcode();
	for (unsigned i = 1; i < NumElems; ++i)
	if (Opcode != Op->getOperand(i).getOpcode())
	return SDValue();

	// TODO: We may be able to add support for other Ops (ADD/SUB + shifts).
	switch (Opcode) {
	default:
	return SDValue();
	case ISD::AND:
	case ISD::XOR:
	case ISD::OR:
	if (!TLI.isOperationLegalOrPromote(Opcode, VT))
	return SDValue();
	break;
	}

	SmallVector<SDValue, 4> LHSElts, RHSElts;
	for (SDValue Elt : Op->ops()) {
	SDValue LHS = Elt.getOperand(0);
	SDValue RHS = Elt.getOperand(1);

	// We expect the canonicalized RHS operand to be the constant.
	if (!isa<ConstantSDNode>(RHS))
	return SDValue();
	LHSElts.push_back(LHS);
	RHSElts.push_back(RHS);
	}

	SDValue LHS = DAG.getBuildVector(VT, DL, LHSElts);
	SDValue RHS = DAG.getBuildVector(VT, DL, RHSElts);
	return DAG.getNode(Opcode, DL, VT, LHS, RHS);
	}

	/// Create a vector constant without a load. SSE/AVX provide the bare minimum
	/// functionality to do this, so it's all zeros, all ones, or some derivation
	/// that is cheap to calculate.
	static SDValue materializeVectorConstant(SDValue Op, SelectionDAG &DAG,
	const X86Subtarget &Subtarget) {
	SDLoc DL(Op);
	MVT VT = Op.getSimpleValueType();

	// Vectors containing all zeros can be matched by pxor and xorps.
	if (ISD::isBuildVectorAllZeros(Op.getNode())) {
	// Canonicalize this to <4 x i32> to 1) ensure the zero vectors are CSE'd
	// and 2) ensure that i64 scalars are eliminated on x86-32 hosts.
	if (VT == MVT::v4i32 \|\| VT == MVT::v8i32 \|\| VT == MVT::v16i32)
	return Op;

	return getZeroVector(VT, Subtarget, DAG, DL);
	}

	// Vectors containing all ones can be matched by pcmpeqd on 128-bit width
	// vectors or broken into v4i32 operations on 256-bit vectors. AVX2 can use
	// vpcmpeqd on 256-bit vectors.
	if (Subtarget.hasSSE2() && ISD::isBuildVectorAllOnes(Op.getNode())) {
	if (VT == MVT::v4i32 \|\| VT == MVT::v16i32 \|\|
	(VT == MVT::v8i32 && Subtarget.hasInt256()))
	return Op;

	return getOnesVector(VT, DAG, DL);
	}

	return SDValue();
	}

	SDValue
	X86TargetLowering::LowerBUILD_VECTOR(SDValue Op, SelectionDAG &DAG) const {
	SDLoc dl(Op);

	MVT VT = Op.getSimpleValueType();
	MVT ExtVT = VT.getVectorElementType();
	unsigned NumElems = Op.getNumOperands();

	// Generate vectors for predicate vectors.
	if (VT.getVectorElementType() == MVT::i1 && Subtarget.hasAVX512())
	return LowerBUILD_VECTORvXi1(Op, DAG);

	if (SDValue VectorConstant = materializeVectorConstant(Op, DAG, Subtarget))
	return VectorConstant;

	BuildVectorSDNode *BV = cast<BuildVectorSDNode>(Op.getNode());
	if (SDValue AddSub = lowerToAddSubOrFMAddSub(BV, Subtarget, DAG))
	return AddSub;
	if (SDValue HorizontalOp = LowerToHorizontalOp(BV, Subtarget, DAG))
	return HorizontalOp;
	if (SDValue Broadcast = lowerBuildVectorAsBroadcast(BV, Subtarget, DAG))
	return Broadcast;
	if (SDValue BitOp = lowerBuildVectorToBitOp(BV, DAG))
	return BitOp;

	unsigned EVTBits = ExtVT.getSizeInBits();

	unsigned NumZero = 0;
	unsigned NumNonZero = 0;
	uint64_t NonZeros = 0;
	bool IsAllConstants = true;
	SmallSet<SDValue, 8> Values;
	for (unsigned i = 0; i < NumElems; ++i) {
	SDValue Elt = Op.getOperand(i);
	if (Elt.isUndef())
	continue;
	Values.insert(Elt);
	if (Elt.getOpcode() != ISD::Constant &&
	Elt.getOpcode() != ISD::ConstantFP)
	IsAllConstants = false;
	if (X86::isZeroNode(Elt))
	NumZero++;
	else {
	assert(i < sizeof(NonZeros) * 8); // Make sure the shift is within range.
	NonZeros \|= ((uint64_t)1 << i);
	NumNonZero++;
	}
	}

	// All undef vector. Return an UNDEF. All zero vectors were handled above.
	if (NumNonZero == 0)
	return DAG.getUNDEF(VT);

	// Special case for single non-zero, non-undef, element.
	if (NumNonZero == 1) {
	unsigned Idx = countTrailingZeros(NonZeros);
	SDValue Item = Op.getOperand(Idx);

	// If this is an insertion of an i64 value on x86-32, and if the top bits of
	// the value are obviously zero, truncate the value to i32 and do the
	// insertion that way. Only do this if the value is non-constant or if the
	// value is a constant being inserted into element 0. It is cheaper to do
	// a constant pool load than it is to do a movd + shuffle.
	if (ExtVT == MVT::i64 && !Subtarget.is64Bit() &&
	(!IsAllConstants \|\| Idx == 0)) {
	if (DAG.MaskedValueIsZero(Item, APInt::getHighBitsSet(64, 32))) {
	// Handle SSE only.
	assert(VT == MVT::v2i64 && "Expected an SSE value type!");
	MVT VecVT = MVT::v4i32;

	// Truncate the value (which may itself be a constant) to i32, and
	// convert it to a vector with movd (S2V+shuffle to zero extend).
	Item = DAG.getNode(ISD::TRUNCATE, dl, MVT::i32, Item);
	Item = DAG.getNode(ISD::SCALAR_TO_VECTOR, dl, VecVT, Item);
	return DAG.getBitcast(VT, getShuffleVectorZeroOrUndef(
	Item, Idx * 2, true, Subtarget, DAG));
	}
	}

	// If we have a constant or non-constant insertion into the low element of
	// a vector, we can do this with SCALAR_TO_VECTOR + shuffle of zero into
	// the rest of the elements. This will be matched as movd/movq/movss/movsd
	// depending on what the source datatype is.
	if (Idx == 0) {
	if (NumZero == 0)
	return DAG.getNode(ISD::SCALAR_TO_VECTOR, dl, VT, Item);

	if (ExtVT == MVT::i32 \|\| ExtVT == MVT::f32 \|\| ExtVT == MVT::f64 \|\|
	(ExtVT == MVT::i64 && Subtarget.is64Bit())) {
	assert((VT.is128BitVector() \|\| VT.is256BitVector() \|\|
	VT.is512BitVector()) &&
	"Expected an SSE value type!");
	Item = DAG.getNode(ISD::SCALAR_TO_VECTOR, dl, VT, Item);
	// Turn it into a MOVL (i.e. movss, movsd, or movd) to a zero vector.
	return getShuffleVectorZeroOrUndef(Item, 0, true, Subtarget, DAG);
	}

	// We can't directly insert an i8 or i16 into a vector, so zero extend
	// it to i32 first.
	if (ExtVT == MVT::i16 \|\| ExtVT == MVT::i8) {
	Item = DAG.getNode(ISD::ZERO_EXTEND, dl, MVT::i32, Item);
	if (VT.getSizeInBits() >= 256) {
	MVT ShufVT = MVT::getVectorVT(MVT::i32, VT.getSizeInBits()/32);
	if (Subtarget.hasAVX()) {
	Item = DAG.getNode(ISD::SCALAR_TO_VECTOR, dl, ShufVT, Item);
	Item = getShuffleVectorZeroOrUndef(Item, 0, true, Subtarget, DAG);
	} else {
	// Without AVX, we need to extend to a 128-bit vector and then
	// insert into the 256-bit vector.
	Item = DAG.getNode(ISD::SCALAR_TO_VECTOR, dl, MVT::v4i32, Item);
	SDValue ZeroVec = getZeroVector(ShufVT, Subtarget, DAG, dl);
	Item = insert128BitVector(ZeroVec, Item, 0, DAG, dl);
	}
	} else {
	assert(VT.is128BitVector() && "Expected an SSE value type!");
	Item = DAG.getNode(ISD::SCALAR_TO_VECTOR, dl, MVT::v4i32, Item);
	Item = getShuffleVectorZeroOrUndef(Item, 0, true, Subtarget, DAG);
	}
	return DAG.getBitcast(VT, Item);
	}
	}

	// Is it a vector logical left shift?
	if (NumElems == 2 && Idx == 1 &&
	X86::isZeroNode(Op.getOperand(0)) &&
	!X86::isZeroNode(Op.getOperand(1))) {
	unsigned NumBits = VT.getSizeInBits();
	return getVShift(true, VT,
	DAG.getNode(ISD::SCALAR_TO_VECTOR, dl,
	VT, Op.getOperand(1)),
	NumBits/2, DAG, *this, dl);
	}

	if (IsAllConstants) // Otherwise, it's better to do a constpool load.
	return SDValue();

	// Otherwise, if this is a vector with i32 or f32 elements, and the element
	// is a non-constant being inserted into an element other than the low one,
	// we can't use a constant pool load. Instead, use SCALAR_TO_VECTOR (aka
	// movd/movss) to move this into the low element, then shuffle it into
	// place.
	if (EVTBits == 32) {
	Item = DAG.getNode(ISD::SCALAR_TO_VECTOR, dl, VT, Item);
	return getShuffleVectorZeroOrUndef(Item, Idx, NumZero > 0, Subtarget, DAG);
	}
	}

	// Splat is obviously ok. Let legalizer expand it to a shuffle.
	if (Values.size() == 1) {
	if (EVTBits == 32) {
	// Instead of a shuffle like this:
	// shuffle (scalar_to_vector (load (ptr + 4))), undef, <0, 0, 0, 0>
	// Check if it's possible to issue this instead.
	// shuffle (vload ptr)), undef, <1, 1, 1, 1>
	unsigned Idx = countTrailingZeros(NonZeros);
	SDValue Item = Op.getOperand(Idx);
	if (Op.getNode()->isOnlyUserOf(Item.getNode()))
	return LowerAsSplatVectorLoad(Item, VT, dl, DAG);
	}
	return SDValue();
	}

	// A vector full of immediates; various special cases are already
	// handled, so this is best done with a single constant-pool load.
	if (IsAllConstants)
	return SDValue();

	// See if we can use a vector load to get all of the elements.
	if (VT.is128BitVector() \|\| VT.is256BitVector() \|\| VT.is512BitVector()) {
	SmallVector<SDValue, 64> Ops(Op->op_begin(), Op->op_begin() + NumElems);
	if (SDValue LD =
	EltsFromConsecutiveLoads(VT, Ops, dl, DAG, Subtarget, false))
	return LD;
	}

	// For AVX-length vectors, build the individual 128-bit pieces and use
	// shuffles to put them in place.
	if (VT.is256BitVector() \|\| VT.is512BitVector()) {
	SmallVector<SDValue, 64> Ops(Op->op_begin(), Op->op_begin() + NumElems);

	EVT HVT = EVT::getVectorVT(*DAG.getContext(), ExtVT, NumElems/2);

	// Build both the lower and upper subvector.
	SDValue Lower =
	DAG.getBuildVector(HVT, dl, makeArrayRef(&Ops[0], NumElems / 2));
	SDValue Upper = DAG.getBuildVector(
	HVT, dl, makeArrayRef(&Ops[NumElems / 2], NumElems / 2));

	// Recreate the wider vector with the lower and upper part.
	if (VT.is256BitVector())
	return concat128BitVectors(Lower, Upper, VT, NumElems, DAG, dl);
	return concat256BitVectors(Lower, Upper, VT, NumElems, DAG, dl);
	}

	// Let legalizer expand 2-wide build_vectors.
	if (EVTBits == 64) {
	if (NumNonZero == 1) {
	// One half is zero or undef.
	unsigned Idx = countTrailingZeros(NonZeros);
	SDValue V2 = DAG.getNode(ISD::SCALAR_TO_VECTOR, dl, VT,
	Op.getOperand(Idx));
	return getShuffleVectorZeroOrUndef(V2, Idx, true, Subtarget, DAG);
	}
	return SDValue();
	}

	// If element VT is < 32 bits, convert it to inserts into a zero vector.
	if (EVTBits == 8 && NumElems == 16)
	if (SDValue V = LowerBuildVectorv16i8(Op, NonZeros, NumNonZero, NumZero,
	DAG, Subtarget))
	return V;

	if (EVTBits == 16 && NumElems == 8)
	if (SDValue V = LowerBuildVectorv8i16(Op, NonZeros, NumNonZero, NumZero,
	DAG, Subtarget))
	return V;

	// If element VT is == 32 bits and has 4 elems, try to generate an INSERTPS
	if (EVTBits == 32 && NumElems == 4)
	if (SDValue V = LowerBuildVectorv4x32(Op, DAG, Subtarget))
	return V;

	// If element VT is == 32 bits, turn it into a number of shuffles.
	if (NumElems == 4 && NumZero > 0) {
	SmallVector<SDValue, 8> Ops(NumElems);
	for (unsigned i = 0; i < 4; ++i) {
	bool isZero = !(NonZeros & (1ULL << i));
	if (isZero)
	Ops[i] = getZeroVector(VT, Subtarget, DAG, dl);
	else
	Ops[i] = DAG.getNode(ISD::SCALAR_TO_VECTOR, dl, VT, Op.getOperand(i));
	}

	for (unsigned i = 0; i < 2; ++i) {
	switch ((NonZeros & (0x3 << i2)) >> (i2)) {
	default: break;
	case 0:
	Ops[i] = Ops[i*2]; // Must be a zero vector.
	break;
	case 1:
	Ops[i] = getMOVL(DAG, dl, VT, Ops[i2+1], Ops[i2]);
	break;
	case 2:
	Ops[i] = getMOVL(DAG, dl, VT, Ops[i2], Ops[i2+1]);
	break;
	case 3:
	Ops[i] = getUnpackl(DAG, dl, VT, Ops[i2], Ops[i2+1]);
	break;
	}
	}

	bool Reverse1 = (NonZeros & 0x3) == 2;
	bool Reverse2 = ((NonZeros & (0x3 << 2)) >> 2) == 2;
	int MaskVec[] = {
	Reverse1 ? 1 : 0,
	Reverse1 ? 0 : 1,
	static_cast<int>(Reverse2 ? NumElems+1 : NumElems),
	static_cast<int>(Reverse2 ? NumElems : NumElems+1)
	};
	return DAG.getVectorShuffle(VT, dl, Ops[0], Ops[1], MaskVec);
	}

	if (Values.size() > 1 && VT.is128BitVector()) {
	// Check for a build vector from mostly shuffle plus few inserting.
	if (SDValue Sh = buildFromShuffleMostly(Op, DAG))
	return Sh;

	// For SSE 4.1, use insertps to put the high elements into the low element.
	if (Subtarget.hasSSE41()) {
	SDValue Result;
	if (!Op.getOperand(0).isUndef())
	Result = DAG.getNode(ISD::SCALAR_TO_VECTOR, dl, VT, Op.getOperand(0));
	else
	Result = DAG.getUNDEF(VT);

	for (unsigned i = 1; i < NumElems; ++i) {
	if (Op.getOperand(i).isUndef()) continue;
	Result = DAG.getNode(ISD::INSERT_VECTOR_ELT, dl, VT, Result,
	Op.getOperand(i), DAG.getIntPtrConstant(i, dl));
	}
	return Result;
	}

	// Otherwise, expand into a number of unpckl*, start by extending each of
	// our (non-undef) elements to the full vector width with the element in the
	// bottom slot of the vector (which generates no code for SSE).
	SmallVector<SDValue, 8> Ops(NumElems);
	for (unsigned i = 0; i < NumElems; ++i) {
	if (!Op.getOperand(i).isUndef())
	Ops[i] = DAG.getNode(ISD::SCALAR_TO_VECTOR, dl, VT, Op.getOperand(i));
	else
	Ops[i] = DAG.getUNDEF(VT);
	}

	// Next, we iteratively mix elements, e.g. for v4f32:
	// Step 1: unpcklps 0, 1 ==> X: <?, ?, 1, 0>
	// : unpcklps 2, 3 ==> Y: <?, ?, 3, 2>
	// Step 2: unpcklpd X, Y ==> <3, 2, 1, 0>
	for (unsigned Scale = 1; Scale < NumElems; Scale *= 2) {
	// Generate scaled UNPCKL shuffle mask.
	SmallVector<int, 16> Mask;
	for(unsigned i = 0; i != Scale; ++i)
	Mask.push_back(i);
	for (unsigned i = 0; i != Scale; ++i)
	Mask.push_back(NumElems+i);
	Mask.append(NumElems - Mask.size(), SM_SentinelUndef);

	for (unsigned i = 0, e = NumElems / (2 * Scale); i != e; ++i)
	Ops[i] = DAG.getVectorShuffle(VT, dl, Ops[2i], Ops[(2i)+1], Mask);
	}
	return Ops[0];
	}
	return SDValue();
	}

	// 256-bit AVX can use the vinsertf128 instruction
	// to create 256-bit vectors from two other 128-bit ones.
	static SDValue LowerAVXCONCAT_VECTORS(SDValue Op, SelectionDAG &DAG) {
	SDLoc dl(Op);
	MVT ResVT = Op.getSimpleValueType();

	assert((ResVT.is256BitVector() \|\|
	ResVT.is512BitVector()) && "Value type must be 256-/512-bit wide");

	SDValue V1 = Op.getOperand(0);
	SDValue V2 = Op.getOperand(1);
	unsigned NumElems = ResVT.getVectorNumElements();
	if (ResVT.is256BitVector())
	return concat128BitVectors(V1, V2, ResVT, NumElems, DAG, dl);

	if (Op.getNumOperands() == 4) {
	MVT HalfVT = MVT::getVectorVT(ResVT.getVectorElementType(),
	ResVT.getVectorNumElements()/2);
	SDValue V3 = Op.getOperand(2);
	SDValue V4 = Op.getOperand(3);
	return concat256BitVectors(
	concat128BitVectors(V1, V2, HalfVT, NumElems / 2, DAG, dl),
	concat128BitVectors(V3, V4, HalfVT, NumElems / 2, DAG, dl), ResVT,
	NumElems, DAG, dl);
	}
	return concat256BitVectors(V1, V2, ResVT, NumElems, DAG, dl);
	}

	// Return true if all the operands of the given CONCAT_VECTORS node are zeros
	// except for the first one. (CONCAT_VECTORS Op, 0, 0,...,0)
	static bool isExpandWithZeros(const SDValue &Op) {
	assert(Op.getOpcode() == ISD::CONCAT_VECTORS &&
	"Expand with zeros only possible in CONCAT_VECTORS nodes!");

	for (unsigned i = 1; i < Op.getNumOperands(); i++)
	if (!ISD::isBuildVectorAllZeros(Op.getOperand(i).getNode()))
	return false;

	return true;
	}

	// Returns true if the given node is a type promotion (by concatenating i1
	// zeros) of the result of a node that already zeros all upper bits of
	// k-register.
	static SDValue isTypePromotionOfi1ZeroUpBits(SDValue Op) {
	unsigned Opc = Op.getOpcode();

	assert(Opc == ISD::CONCAT_VECTORS &&
	Op.getSimpleValueType().getVectorElementType() == MVT::i1 &&
	"Unexpected node to check for type promotion!");

	// As long as we are concatenating zeros to the upper part of a previous node
	// result, climb up the tree until a node with different opcode is
	// encountered
	while (Opc == ISD::INSERT_SUBVECTOR \|\| Opc == ISD::CONCAT_VECTORS) {
	if (Opc == ISD::INSERT_SUBVECTOR) {
	if (ISD::isBuildVectorAllZeros(Op.getOperand(0).getNode()) &&
	Op.getConstantOperandVal(2) == 0)
	Op = Op.getOperand(1);
	else
	return SDValue();
	} else { // Opc == ISD::CONCAT_VECTORS
	if (isExpandWithZeros(Op))
	Op = Op.getOperand(0);
	else
	return SDValue();
	}
	Opc = Op.getOpcode();
	}

	// Check if the first inserted node zeroes the upper bits, or an 'and' result
	// of a node that zeros the upper bits (its masked version).
	if (isMaskedZeroUpperBitsvXi1(Op.getOpcode()) \|\|
	(Op.getOpcode() == ISD::AND &&
	(isMaskedZeroUpperBitsvXi1(Op.getOperand(0).getOpcode()) \|\|
	isMaskedZeroUpperBitsvXi1(Op.getOperand(1).getOpcode())))) {
	return Op;
	}

	return SDValue();
	}

	static SDValue LowerCONCAT_VECTORSvXi1(SDValue Op,
	const X86Subtarget &Subtarget,
	SelectionDAG & DAG) {
	SDLoc dl(Op);
	MVT ResVT = Op.getSimpleValueType();
	unsigned NumOfOperands = Op.getNumOperands();

	assert(isPowerOf2_32(NumOfOperands) &&
	"Unexpected number of operands in CONCAT_VECTORS");

	// If this node promotes - by concatenating zeroes - the type of the result
	// of a node with instruction that zeroes all upper (irrelevant) bits of the
	// output register, mark it as legal and catch the pattern in instruction
	// selection to avoid emitting extra insturctions (for zeroing upper bits).
	if (SDValue Promoted = isTypePromotionOfi1ZeroUpBits(Op)) {
	SDValue ZeroC = DAG.getConstant(0, dl, MVT::i64);
	SDValue AllZeros = DAG.getSplatBuildVector(ResVT, dl, ZeroC);
	return DAG.getNode(ISD::INSERT_SUBVECTOR, dl, ResVT, AllZeros, Promoted,
	ZeroC);
	}

	SDValue Undef = DAG.getUNDEF(ResVT);
	if (NumOfOperands > 2) {
	// Specialize the cases when all, or all but one, of the operands are undef.
	unsigned NumOfDefinedOps = 0;
	unsigned OpIdx = 0;
	for (unsigned i = 0; i < NumOfOperands; i++)
	if (!Op.getOperand(i).isUndef()) {
	NumOfDefinedOps++;
	OpIdx = i;
	}
	if (NumOfDefinedOps == 0)
	return Undef;
	if (NumOfDefinedOps == 1) {
	unsigned SubVecNumElts =
	Op.getOperand(OpIdx).getValueType().getVectorNumElements();
	SDValue IdxVal = DAG.getIntPtrConstant(SubVecNumElts * OpIdx, dl);
	return DAG.getNode(ISD::INSERT_SUBVECTOR, dl, ResVT, Undef,
	Op.getOperand(OpIdx), IdxVal);
	}

	MVT HalfVT = MVT::getVectorVT(ResVT.getVectorElementType(),
	ResVT.getVectorNumElements()/2);
	SmallVector<SDValue, 2> Ops;
	for (unsigned i = 0; i < NumOfOperands/2; i++)
	Ops.push_back(Op.getOperand(i));
	SDValue Lo = DAG.getNode(ISD::CONCAT_VECTORS, dl, HalfVT, Ops);
	Ops.clear();
	for (unsigned i = NumOfOperands/2; i < NumOfOperands; i++)
	Ops.push_back(Op.getOperand(i));
	SDValue Hi = DAG.getNode(ISD::CONCAT_VECTORS, dl, HalfVT, Ops);
	return DAG.getNode(ISD::CONCAT_VECTORS, dl, ResVT, Lo, Hi);
	}

	// 2 operands
	SDValue V1 = Op.getOperand(0);
	SDValue V2 = Op.getOperand(1);
	unsigned NumElems = ResVT.getVectorNumElements();
	assert(V1.getValueType() == V2.getValueType() &&
	V1.getValueType().getVectorNumElements() == NumElems/2 &&
	"Unexpected operands in CONCAT_VECTORS");

	if (ResVT.getSizeInBits() >= 16)
	return Op; // The operation is legal with KUNPCK

	bool IsZeroV1 = ISD::isBuildVectorAllZeros(V1.getNode());
	bool IsZeroV2 = ISD::isBuildVectorAllZeros(V2.getNode());
	SDValue ZeroVec = getZeroVector(ResVT, Subtarget, DAG, dl);
	if (IsZeroV1 && IsZeroV2)
	return ZeroVec;

	SDValue ZeroIdx = DAG.getIntPtrConstant(0, dl);
	if (V2.isUndef())
	return DAG.getNode(ISD::INSERT_SUBVECTOR, dl, ResVT, Undef, V1, ZeroIdx);
	if (IsZeroV2)
	return DAG.getNode(ISD::INSERT_SUBVECTOR, dl, ResVT, ZeroVec, V1, ZeroIdx);

	SDValue IdxVal = DAG.getIntPtrConstant(NumElems/2, dl);
	if (V1.isUndef())
	return DAG.getNode(ISD::INSERT_SUBVECTOR, dl, ResVT, Undef, V2, IdxVal);

	if (IsZeroV1)
	return DAG.getNode(ISD::INSERT_SUBVECTOR, dl, ResVT, ZeroVec, V2, IdxVal);

	V1 = DAG.getNode(ISD::INSERT_SUBVECTOR, dl, ResVT, Undef, V1, ZeroIdx);
	return DAG.getNode(ISD::INSERT_SUBVECTOR, dl, ResVT, V1, V2, IdxVal);
	}

	static SDValue LowerCONCAT_VECTORS(SDValue Op,
	const X86Subtarget &Subtarget,
	SelectionDAG &DAG) {
	MVT VT = Op.getSimpleValueType();
	if (VT.getVectorElementType() == MVT::i1)
	return LowerCONCAT_VECTORSvXi1(Op, Subtarget, DAG);

	assert((VT.is256BitVector() && Op.getNumOperands() == 2) \|\|
	(VT.is512BitVector() && (Op.getNumOperands() == 2 \|\|
	Op.getNumOperands() == 4)));

	// AVX can use the vinsertf128 instruction to create 256-bit vectors
	// from two other 128-bit ones.

	// 512-bit vector may contain 2 256-bit vectors or 4 128-bit vectors
	return LowerAVXCONCAT_VECTORS(Op, DAG);
	}

	//===----------------------------------------------------------------------===//
	// Vector shuffle lowering
	//
	// This is an experimental code path for lowering vector shuffles on x86. It is
	// designed to handle arbitrary vector shuffles and blends, gracefully
	// degrading performance as necessary. It works hard to recognize idiomatic
	// shuffles and lower them to optimal instruction patterns without leaving
	// a framework that allows reasonably efficient handling of all vector shuffle
	// patterns.
	//===----------------------------------------------------------------------===//

	/// \brief Tiny helper function to identify a no-op mask.
	///
	/// This is a somewhat boring predicate function. It checks whether the mask
	/// array input, which is assumed to be a single-input shuffle mask of the kind
	/// used by the X86 shuffle instructions (not a fully general
	/// ShuffleVectorSDNode mask) requires any shuffles to occur. Both undef and an
	/// in-place shuffle are 'no-op's.
	static bool isNoopShuffleMask(ArrayRef<int> Mask) {
	for (int i = 0, Size = Mask.size(); i < Size; ++i) {
	assert(Mask[i] >= -1 && "Out of bound mask element!");
	if (Mask[i] >= 0 && Mask[i] != i)
	return false;
	}
	return true;
	}

	/// \brief Test whether there are elements crossing 128-bit lanes in this
	/// shuffle mask.
	///
	/// X86 divides up its shuffles into in-lane and cross-lane shuffle operations
	/// and we routinely test for these.
	static bool is128BitLaneCrossingShuffleMask(MVT VT, ArrayRef<int> Mask) {
	int LaneSize = 128 / VT.getScalarSizeInBits();
	int Size = Mask.size();
	for (int i = 0; i < Size; ++i)
	if (Mask[i] >= 0 && (Mask[i] % Size) / LaneSize != i / LaneSize)
	return true;
	return false;
	}

	/// \brief Test whether a shuffle mask is equivalent within each sub-lane.
	///
	/// This checks a shuffle mask to see if it is performing the same
	/// lane-relative shuffle in each sub-lane. This trivially implies
	/// that it is also not lane-crossing. It may however involve a blend from the
	/// same lane of a second vector.
	///
	/// The specific repeated shuffle mask is populated in \p RepeatedMask, as it is
	/// non-trivial to compute in the face of undef lanes. The representation is
	/// suitable for use with existing 128-bit shuffles as entries from the second
	/// vector have been remapped to [LaneSize, 2*LaneSize).
	static bool isRepeatedShuffleMask(unsigned LaneSizeInBits, MVT VT,
	ArrayRef<int> Mask,
	SmallVectorImpl<int> &RepeatedMask) {
	auto LaneSize = LaneSizeInBits / VT.getScalarSizeInBits();
	RepeatedMask.assign(LaneSize, -1);
	int Size = Mask.size();
	for (int i = 0; i < Size; ++i) {
	assert(Mask[i] == SM_SentinelUndef \|\| Mask[i] >= 0);
	if (Mask[i] < 0)
	continue;
	if ((Mask[i] % Size) / LaneSize != i / LaneSize)
	// This entry crosses lanes, so there is no way to model this shuffle.
	return false;

	// Ok, handle the in-lane shuffles by detecting if and when they repeat.
	// Adjust second vector indices to start at LaneSize instead of Size.
	int LocalM = Mask[i] < Size ? Mask[i] % LaneSize
	: Mask[i] % LaneSize + LaneSize;
	if (RepeatedMask[i % LaneSize] < 0)
	// This is the first non-undef entry in this slot of a 128-bit lane.
	RepeatedMask[i % LaneSize] = LocalM;
	else if (RepeatedMask[i % LaneSize] != LocalM)
	// Found a mismatch with the repeated mask.
	return false;
	}
	return true;
	}

	/// Test whether a shuffle mask is equivalent within each 128-bit lane.
	static bool
	is128BitLaneRepeatedShuffleMask(MVT VT, ArrayRef<int> Mask,
	SmallVectorImpl<int> &RepeatedMask) {
	return isRepeatedShuffleMask(128, VT, Mask, RepeatedMask);
	}

	/// Test whether a shuffle mask is equivalent within each 256-bit lane.
	static bool
	is256BitLaneRepeatedShuffleMask(MVT VT, ArrayRef<int> Mask,
	SmallVectorImpl<int> &RepeatedMask) {
	return isRepeatedShuffleMask(256, VT, Mask, RepeatedMask);
	}

	/// Test whether a target shuffle mask is equivalent within each sub-lane.
	/// Unlike isRepeatedShuffleMask we must respect SM_SentinelZero.
	static bool isRepeatedTargetShuffleMask(unsigned LaneSizeInBits, MVT VT,
	ArrayRef<int> Mask,
	SmallVectorImpl<int> &RepeatedMask) {
	int LaneSize = LaneSizeInBits / VT.getScalarSizeInBits();
	RepeatedMask.assign(LaneSize, SM_SentinelUndef);
	int Size = Mask.size();
	for (int i = 0; i < Size; ++i) {
	assert(isUndefOrZero(Mask[i]) \|\| (Mask[i] >= 0));
	if (Mask[i] == SM_SentinelUndef)
	continue;
	if (Mask[i] == SM_SentinelZero) {
	if (!isUndefOrZero(RepeatedMask[i % LaneSize]))
	return false;
	RepeatedMask[i % LaneSize] = SM_SentinelZero;
	continue;
	}
	if ((Mask[i] % Size) / LaneSize != i / LaneSize)
	// This entry crosses lanes, so there is no way to model this shuffle.
	return false;

	// Ok, handle the in-lane shuffles by detecting if and when they repeat.
	// Adjust second vector indices to start at LaneSize instead of Size.
	int LocalM =
	Mask[i] < Size ? Mask[i] % LaneSize : Mask[i] % LaneSize + LaneSize;
	if (RepeatedMask[i % LaneSize] == SM_SentinelUndef)
	// This is the first non-undef entry in this slot of a 128-bit lane.
	RepeatedMask[i % LaneSize] = LocalM;
	else if (RepeatedMask[i % LaneSize] != LocalM)
	// Found a mismatch with the repeated mask.
	return false;
	}
	return true;
	}

	/// \brief Checks whether a shuffle mask is equivalent to an explicit list of
	/// arguments.
	///
	/// This is a fast way to test a shuffle mask against a fixed pattern:
	///
	/// if (isShuffleEquivalent(Mask, 3, 2, {1, 0})) { ... }
	///
	/// It returns true if the mask is exactly as wide as the argument list, and
	/// each element of the mask is either -1 (signifying undef) or the value given
	/// in the argument.
	static bool isShuffleEquivalent(SDValue V1, SDValue V2, ArrayRef<int> Mask,
	ArrayRef<int> ExpectedMask) {
	if (Mask.size() != ExpectedMask.size())
	return false;

	int Size = Mask.size();

	// If the values are build vectors, we can look through them to find
	// equivalent inputs that make the shuffles equivalent.
	auto *BV1 = dyn_cast<BuildVectorSDNode>(V1);
	auto *BV2 = dyn_cast<BuildVectorSDNode>(V2);

	for (int i = 0; i < Size; ++i) {
	assert(Mask[i] >= -1 && "Out of bound mask element!");
	if (Mask[i] >= 0 && Mask[i] != ExpectedMask[i]) {
	auto *MaskBV = Mask[i] < Size ? BV1 : BV2;
	auto *ExpectedBV = ExpectedMask[i] < Size ? BV1 : BV2;
	if (!MaskBV \|\| !ExpectedBV \|\|
	MaskBV->getOperand(Mask[i] % Size) !=
	ExpectedBV->getOperand(ExpectedMask[i] % Size))
	return false;
	}
	}

	return true;
	}

	/// Checks whether a target shuffle mask is equivalent to an explicit pattern.
	///
	/// The masks must be exactly the same width.
	///
	/// If an element in Mask matches SM_SentinelUndef (-1) then the corresponding
	/// value in ExpectedMask is always accepted. Otherwise the indices must match.
	///
	/// SM_SentinelZero is accepted as a valid negative index but must match in both.
	static bool isTargetShuffleEquivalent(ArrayRef<int> Mask,
	ArrayRef<int> ExpectedMask) {
	int Size = Mask.size();
	if (Size != (int)ExpectedMask.size())
	return false;

	for (int i = 0; i < Size; ++i)
	if (Mask[i] == SM_SentinelUndef)
	continue;
	else if (Mask[i] < 0 && Mask[i] != SM_SentinelZero)
	return false;
	else if (Mask[i] != ExpectedMask[i])
	return false;

	return true;
	}

	// Merges a general DAG shuffle mask and zeroable bit mask into a target shuffle
	// mask.
	static SmallVector<int, 64> createTargetShuffleMask(ArrayRef<int> Mask,
	const APInt &Zeroable) {
	int NumElts = Mask.size();
	assert(NumElts == (int)Zeroable.getBitWidth() && "Mismatch mask sizes");

	SmallVector<int, 64> TargetMask(NumElts, SM_SentinelUndef);
	for (int i = 0; i != NumElts; ++i) {
	int M = Mask[i];
	if (M == SM_SentinelUndef)
	continue;
	assert(0 <= M && M < (2 * NumElts) && "Out of range shuffle index");
	TargetMask[i] = (Zeroable[i] ? SM_SentinelZero : M);
	}
	return TargetMask;
	}

	// Check if the shuffle mask is suitable for the AVX vpunpcklwd or vpunpckhwd
	// instructions.
	static bool isUnpackWdShuffleMask(ArrayRef<int> Mask, MVT VT) {
	if (VT != MVT::v8i32 && VT != MVT::v8f32)
	return false;

	SmallVector<int, 8> Unpcklwd;
	createUnpackShuffleMask(MVT::v8i16, Unpcklwd, /* Lo = */ true,
	/* Unary = */ false);
	SmallVector<int, 8> Unpckhwd;
	createUnpackShuffleMask(MVT::v8i16, Unpckhwd, /* Lo = */ false,
	/* Unary = */ false);
	bool IsUnpackwdMask = (isTargetShuffleEquivalent(Mask, Unpcklwd) \|\|
	isTargetShuffleEquivalent(Mask, Unpckhwd));
	return IsUnpackwdMask;
	}

	/// \brief Get a 4-lane 8-bit shuffle immediate for a mask.
	///
	/// This helper function produces an 8-bit shuffle immediate corresponding to
	/// the ubiquitous shuffle encoding scheme used in x86 instructions for
	/// shuffling 4 lanes. It can be used with most of the PSHUF instructions for
	/// example.
	///
	/// NB: We rely heavily on "undef" masks preserving the input lane.
	static unsigned getV4X86ShuffleImm(ArrayRef<int> Mask) {
	assert(Mask.size() == 4 && "Only 4-lane shuffle masks");
	assert(Mask[0] >= -1 && Mask[0] < 4 && "Out of bound mask element!");
	assert(Mask[1] >= -1 && Mask[1] < 4 && "Out of bound mask element!");
	assert(Mask[2] >= -1 && Mask[2] < 4 && "Out of bound mask element!");
	assert(Mask[3] >= -1 && Mask[3] < 4 && "Out of bound mask element!");

	unsigned Imm = 0;
	Imm \|= (Mask[0] < 0 ? 0 : Mask[0]) << 0;
	Imm \|= (Mask[1] < 0 ? 1 : Mask[1]) << 2;
	Imm \|= (Mask[2] < 0 ? 2 : Mask[2]) << 4;
	Imm \|= (Mask[3] < 0 ? 3 : Mask[3]) << 6;
	return Imm;
	}

	static SDValue getV4X86ShuffleImm8ForMask(ArrayRef<int> Mask, const SDLoc &DL,
	SelectionDAG &DAG) {
	return DAG.getConstant(getV4X86ShuffleImm(Mask), DL, MVT::i8);
	}

	/// \brief Compute whether each element of a shuffle is zeroable.
	///
	/// A "zeroable" vector shuffle element is one which can be lowered to zero.
	/// Either it is an undef element in the shuffle mask, the element of the input
	/// referenced is undef, or the element of the input referenced is known to be
	/// zero. Many x86 shuffles can zero lanes cheaply and we often want to handle
	/// as many lanes with this technique as possible to simplify the remaining
	/// shuffle.
	static APInt computeZeroableShuffleElements(ArrayRef<int> Mask,
	SDValue V1, SDValue V2) {
	APInt Zeroable(Mask.size(), 0);
	V1 = peekThroughBitcasts(V1);
	V2 = peekThroughBitcasts(V2);

	bool V1IsZero = ISD::isBuildVectorAllZeros(V1.getNode());
	bool V2IsZero = ISD::isBuildVectorAllZeros(V2.getNode());

	int VectorSizeInBits = V1.getValueSizeInBits();
	int ScalarSizeInBits = VectorSizeInBits / Mask.size();
	assert(!(VectorSizeInBits % ScalarSizeInBits) && "Illegal shuffle mask size");

	for (int i = 0, Size = Mask.size(); i < Size; ++i) {
	int M = Mask[i];
	// Handle the easy cases.
	if (M < 0 \|\| (M >= 0 && M < Size && V1IsZero) \|\| (M >= Size && V2IsZero)) {
	Zeroable.setBit(i);
	continue;
	}

	// Determine shuffle input and normalize the mask.
	SDValue V = M < Size ? V1 : V2;
	M %= Size;

	// Currently we can only search BUILD_VECTOR for UNDEF/ZERO elements.
	if (V.getOpcode() != ISD::BUILD_VECTOR)
	continue;

	// If the BUILD_VECTOR has fewer elements then the bitcasted portion of
	// the (larger) source element must be UNDEF/ZERO.
	if ((Size % V.getNumOperands()) == 0) {
	int Scale = Size / V->getNumOperands();
	SDValue Op = V.getOperand(M / Scale);
	if (Op.isUndef() \|\| X86::isZeroNode(Op))
	Zeroable.setBit(i);
	else if (ConstantSDNode *Cst = dyn_cast<ConstantSDNode>(Op)) {
	APInt Val = Cst->getAPIntValue();
	Val.lshrInPlace((M % Scale) * ScalarSizeInBits);
	Val = Val.getLoBits(ScalarSizeInBits);
	if (Val == 0)
	Zeroable.setBit(i);
	} else if (ConstantFPSDNode *Cst = dyn_cast<ConstantFPSDNode>(Op)) {
	APInt Val = Cst->getValueAPF().bitcastToAPInt();
	Val.lshrInPlace((M % Scale) * ScalarSizeInBits);
	Val = Val.getLoBits(ScalarSizeInBits);
	if (Val == 0)
	Zeroable.setBit(i);
	}
	continue;
	}

	// If the BUILD_VECTOR has more elements then all the (smaller) source
	// elements must be UNDEF or ZERO.
	if ((V.getNumOperands() % Size) == 0) {
	int Scale = V->getNumOperands() / Size;
	bool AllZeroable = true;
	for (int j = 0; j < Scale; ++j) {
	SDValue Op = V.getOperand((M * Scale) + j);
	AllZeroable &= (Op.isUndef() \|\| X86::isZeroNode(Op));
	}
	if (AllZeroable)
	Zeroable.setBit(i);
	continue;
	}
	}

	return Zeroable;
	}

	// The Shuffle result is as follow:
	// 0a[0]0a[1]...0*a[n] , n >=0 where a[] elements in a ascending order.
	// Each Zeroable's element correspond to a particular Mask's element.
	// As described in computeZeroableShuffleElements function.
	//
	// The function looks for a sub-mask that the nonzero elements are in
	// increasing order. If such sub-mask exist. The function returns true.
	static bool isNonZeroElementsInOrder(const APInt &Zeroable,
	ArrayRef<int> Mask, const EVT &VectorType,
	bool &IsZeroSideLeft) {
	int NextElement = -1;
	// Check if the Mask's nonzero elements are in increasing order.
	for (int i = 0, e = Mask.size(); i < e; i++) {
	// Checks if the mask's zeros elements are built from only zeros.
	assert(Mask[i] >= -1 && "Out of bound mask element!");
	if (Mask[i] < 0)
	return false;
	if (Zeroable[i])
	continue;
	// Find the lowest non zero element
	if (NextElement < 0) {
	NextElement = Mask[i] != 0 ? VectorType.getVectorNumElements() : 0;
	IsZeroSideLeft = NextElement != 0;
	}
	// Exit if the mask's non zero elements are not in increasing order.
	if (NextElement != Mask[i])
	return false;
	NextElement++;
	}
	return true;
	}

	/// Try to lower a shuffle with a single PSHUFB of V1 or V2.
	static SDValue lowerVectorShuffleWithPSHUFB(const SDLoc &DL, MVT VT,
	ArrayRef<int> Mask, SDValue V1,
	SDValue V2,
	const APInt &Zeroable,
	const X86Subtarget &Subtarget,
	SelectionDAG &DAG) {
	int Size = Mask.size();
	int LaneSize = 128 / VT.getScalarSizeInBits();
	const int NumBytes = VT.getSizeInBits() / 8;
	const int NumEltBytes = VT.getScalarSizeInBits() / 8;

	assert((Subtarget.hasSSSE3() && VT.is128BitVector()) \|\|
	(Subtarget.hasAVX2() && VT.is256BitVector()) \|\|
	(Subtarget.hasBWI() && VT.is512BitVector()));

	SmallVector<SDValue, 64> PSHUFBMask(NumBytes);
	// Sign bit set in i8 mask means zero element.
	SDValue ZeroMask = DAG.getConstant(0x80, DL, MVT::i8);

	SDValue V;
	for (int i = 0; i < NumBytes; ++i) {
	int M = Mask[i / NumEltBytes];
	if (M < 0) {
	PSHUFBMask[i] = DAG.getUNDEF(MVT::i8);
	continue;
	}
	if (Zeroable[i / NumEltBytes]) {
	PSHUFBMask[i] = ZeroMask;
	continue;
	}

	// We can only use a single input of V1 or V2.
	SDValue SrcV = (M >= Size ? V2 : V1);
	if (V && V != SrcV)
	return SDValue();
	V = SrcV;
	M %= Size;

	// PSHUFB can't cross lanes, ensure this doesn't happen.
	if ((M / LaneSize) != ((i / NumEltBytes) / LaneSize))
	return SDValue();

	M = M % LaneSize;
	M = M * NumEltBytes + (i % NumEltBytes);
	PSHUFBMask[i] = DAG.getConstant(M, DL, MVT::i8);
	}
	assert(V && "Failed to find a source input");

	MVT I8VT = MVT::getVectorVT(MVT::i8, NumBytes);
	return DAG.getBitcast(
	VT, DAG.getNode(X86ISD::PSHUFB, DL, I8VT, DAG.getBitcast(I8VT, V),
	DAG.getBuildVector(I8VT, DL, PSHUFBMask)));
	}

	static SDValue getMaskNode(SDValue Mask, MVT MaskVT,
	const X86Subtarget &Subtarget, SelectionDAG &DAG,
	const SDLoc &dl);

	// X86 has dedicated shuffle that can be lowered to VEXPAND
	static SDValue lowerVectorShuffleToEXPAND(const SDLoc &DL, MVT VT,
	const APInt &Zeroable,
	ArrayRef<int> Mask, SDValue &V1,
	SDValue &V2, SelectionDAG &DAG,
	const X86Subtarget &Subtarget) {
	bool IsLeftZeroSide = true;
	if (!isNonZeroElementsInOrder(Zeroable, Mask, V1.getValueType(),
	IsLeftZeroSide))
	return SDValue();
	unsigned VEXPANDMask = (~Zeroable).getZExtValue();
	MVT IntegerType =
	MVT::getIntegerVT(std::max((int)VT.getVectorNumElements(), 8));
	SDValue MaskNode = DAG.getConstant(VEXPANDMask, DL, IntegerType);
	unsigned NumElts = VT.getVectorNumElements();
	assert((NumElts == 4 \|\| NumElts == 8 \|\| NumElts == 16) &&
	"Unexpected number of vector elements");
	SDValue VMask = getMaskNode(MaskNode, MVT::getVectorVT(MVT::i1, NumElts),
	Subtarget, DAG, DL);
	SDValue ZeroVector = getZeroVector(VT, Subtarget, DAG, DL);
	SDValue ExpandedVector = IsLeftZeroSide ? V2 : V1;
	return DAG.getSelect(DL, VT, VMask,
	DAG.getNode(X86ISD::EXPAND, DL, VT, ExpandedVector),
	ZeroVector);
	}

	static bool matchVectorShuffleWithUNPCK(MVT VT, SDValue &V1, SDValue &V2,
	unsigned &UnpackOpcode, bool IsUnary,
	ArrayRef<int> TargetMask, SDLoc &DL,
	SelectionDAG &DAG,
	const X86Subtarget &Subtarget) {
	int NumElts = VT.getVectorNumElements();

	bool Undef1 = true, Undef2 = true, Zero1 = true, Zero2 = true;
	for (int i = 0; i != NumElts; i += 2) {
	int M1 = TargetMask[i + 0];
	int M2 = TargetMask[i + 1];
	Undef1 &= (SM_SentinelUndef == M1);
	Undef2 &= (SM_SentinelUndef == M2);
	Zero1 &= isUndefOrZero(M1);
	Zero2 &= isUndefOrZero(M2);
	}
	assert(!((Undef1 \|\| Zero1) && (Undef2 \|\| Zero2)) &&
	"Zeroable shuffle detected");

	// Attempt to match the target mask against the unpack lo/hi mask patterns.
	SmallVector<int, 64> Unpckl, Unpckh;
	createUnpackShuffleMask(VT, Unpckl, /* Lo = */ true, IsUnary);
	if (isTargetShuffleEquivalent(TargetMask, Unpckl)) {
	UnpackOpcode = X86ISD::UNPCKL;
	V2 = (Undef2 ? DAG.getUNDEF(VT) : (IsUnary ? V1 : V2));
	V1 = (Undef1 ? DAG.getUNDEF(VT) : V1);
	return true;
	}

	createUnpackShuffleMask(VT, Unpckh, /* Lo = */ false, IsUnary);
	if (isTargetShuffleEquivalent(TargetMask, Unpckh)) {
	UnpackOpcode = X86ISD::UNPCKH;
	V2 = (Undef2 ? DAG.getUNDEF(VT) : (IsUnary ? V1 : V2));
	V1 = (Undef1 ? DAG.getUNDEF(VT) : V1);
	return true;
	}

	// If an unary shuffle, attempt to match as an unpack lo/hi with zero.
	if (IsUnary && (Zero1 \|\| Zero2)) {
	// Don't bother if we can blend instead.
	if ((Subtarget.hasSSE41() \|\| VT == MVT::v2i64 \|\| VT == MVT::v2f64) &&
	isSequentialOrUndefOrZeroInRange(TargetMask, 0, NumElts, 0))
	return false;

	bool MatchLo = true, MatchHi = true;
	for (int i = 0; (i != NumElts) && (MatchLo \|\| MatchHi); ++i) {
	int M = TargetMask[i];

	// Ignore if the input is known to be zero or the index is undef.
	if ((((i & 1) == 0) && Zero1) \|\| (((i & 1) == 1) && Zero2) \|\|
	(M == SM_SentinelUndef))
	continue;

	MatchLo &= (M == Unpckl[i]);
	MatchHi &= (M == Unpckh[i]);
	}

	if (MatchLo \|\| MatchHi) {
	UnpackOpcode = MatchLo ? X86ISD::UNPCKL : X86ISD::UNPCKH;
	V2 = Zero2 ? getZeroVector(VT, Subtarget, DAG, DL) : V1;
	V1 = Zero1 ? getZeroVector(VT, Subtarget, DAG, DL) : V1;
	return true;
	}
	}

	// If a binary shuffle, commute and try again.
	if (!IsUnary) {
	ShuffleVectorSDNode::commuteMask(Unpckl);
	if (isTargetShuffleEquivalent(TargetMask, Unpckl)) {
	UnpackOpcode = X86ISD::UNPCKL;
	std::swap(V1, V2);
	return true;
	}

	ShuffleVectorSDNode::commuteMask(Unpckh);
	if (isTargetShuffleEquivalent(TargetMask, Unpckh)) {
	UnpackOpcode = X86ISD::UNPCKH;
	std::swap(V1, V2);
	return true;
	}
	}

	return false;
	}

	// X86 has dedicated unpack instructions that can handle specific blend
	// operations: UNPCKH and UNPCKL.
	static SDValue lowerVectorShuffleWithUNPCK(const SDLoc &DL, MVT VT,
	ArrayRef<int> Mask, SDValue V1,
	SDValue V2, SelectionDAG &DAG) {
	SmallVector<int, 8> Unpckl;
	createUnpackShuffleMask(VT, Unpckl, /* Lo = / true, / Unary = */ false);
	if (isShuffleEquivalent(V1, V2, Mask, Unpckl))
	return DAG.getNode(X86ISD::UNPCKL, DL, VT, V1, V2);

	SmallVector<int, 8> Unpckh;
	createUnpackShuffleMask(VT, Unpckh, /* Lo = / false, / Unary = */ false);
	if (isShuffleEquivalent(V1, V2, Mask, Unpckh))
	return DAG.getNode(X86ISD::UNPCKH, DL, VT, V1, V2);

	// Commute and try again.
	ShuffleVectorSDNode::commuteMask(Unpckl);
	if (isShuffleEquivalent(V1, V2, Mask, Unpckl))
	return DAG.getNode(X86ISD::UNPCKL, DL, VT, V2, V1);

	ShuffleVectorSDNode::commuteMask(Unpckh);
	if (isShuffleEquivalent(V1, V2, Mask, Unpckh))
	return DAG.getNode(X86ISD::UNPCKH, DL, VT, V2, V1);

	return SDValue();
	}

	/// \brief Try to emit a bitmask instruction for a shuffle.
	///
	/// This handles cases where we can model a blend exactly as a bitmask due to
	/// one of the inputs being zeroable.
	static SDValue lowerVectorShuffleAsBitMask(const SDLoc &DL, MVT VT, SDValue V1,
	SDValue V2, ArrayRef<int> Mask,
	const APInt &Zeroable,
	SelectionDAG &DAG) {
	assert(!VT.isFloatingPoint() && "Floating point types are not supported");
	MVT EltVT = VT.getVectorElementType();
	SDValue Zero = DAG.getConstant(0, DL, EltVT);
	SDValue AllOnes = DAG.getAllOnesConstant(DL, EltVT);
	SmallVector<SDValue, 16> VMaskOps(Mask.size(), Zero);
	SDValue V;
	for (int i = 0, Size = Mask.size(); i < Size; ++i) {
	if (Zeroable[i])
	continue;
	if (Mask[i] % Size != i)
	return SDValue(); // Not a blend.
	if (!V)
	V = Mask[i] < Size ? V1 : V2;
	else if (V != (Mask[i] < Size ? V1 : V2))
	return SDValue(); // Can only let one input through the mask.

	VMaskOps[i] = AllOnes;
	}
	if (!V)
	return SDValue(); // No non-zeroable elements!

	SDValue VMask = DAG.getBuildVector(VT, DL, VMaskOps);
	return DAG.getNode(ISD::AND, DL, VT, V, VMask);
	}

	/// \brief Try to emit a blend instruction for a shuffle using bit math.
	///
	/// This is used as a fallback approach when first class blend instructions are
	/// unavailable. Currently it is only suitable for integer vectors, but could
	/// be generalized for floating point vectors if desirable.
	static SDValue lowerVectorShuffleAsBitBlend(const SDLoc &DL, MVT VT, SDValue V1,
	SDValue V2, ArrayRef<int> Mask,
	SelectionDAG &DAG) {
	assert(VT.isInteger() && "Only supports integer vector types!");
	MVT EltVT = VT.getVectorElementType();
	SDValue Zero = DAG.getConstant(0, DL, EltVT);
	SDValue AllOnes = DAG.getAllOnesConstant(DL, EltVT);
	SmallVector<SDValue, 16> MaskOps;
	for (int i = 0, Size = Mask.size(); i < Size; ++i) {
	if (Mask[i] >= 0 && Mask[i] != i && Mask[i] != i + Size)
	return SDValue(); // Shuffled input!
	MaskOps.push_back(Mask[i] < Size ? AllOnes : Zero);
	}

	SDValue V1Mask = DAG.getBuildVector(VT, DL, MaskOps);
	V1 = DAG.getNode(ISD::AND, DL, VT, V1, V1Mask);
	// We have to cast V2 around.
	MVT MaskVT = MVT::getVectorVT(MVT::i64, VT.getSizeInBits() / 64);
	V2 = DAG.getBitcast(VT, DAG.getNode(X86ISD::ANDNP, DL, MaskVT,
	DAG.getBitcast(MaskVT, V1Mask),
	DAG.getBitcast(MaskVT, V2)));
	return DAG.getNode(ISD::OR, DL, VT, V1, V2);
	}

	static SDValue getVectorMaskingNode(SDValue Op, SDValue Mask,
	SDValue PreservedSrc,
	const X86Subtarget &Subtarget,
	SelectionDAG &DAG);

	static bool matchVectorShuffleAsBlend(SDValue V1, SDValue V2,
	MutableArrayRef<int> TargetMask,
	bool &ForceV1Zero, bool &ForceV2Zero,
	uint64_t &BlendMask) {
	bool V1IsZeroOrUndef =
	V1.isUndef() \|\| ISD::isBuildVectorAllZeros(V1.getNode());
	bool V2IsZeroOrUndef =
	V2.isUndef() \|\| ISD::isBuildVectorAllZeros(V2.getNode());

	BlendMask = 0;
	ForceV1Zero = false, ForceV2Zero = false;
	assert(TargetMask.size() <= 64 && "Shuffle mask too big for blend mask");

	// Attempt to generate the binary blend mask. If an input is zero then
	// we can use any lane.
	// TODO: generalize the zero matching to any scalar like isShuffleEquivalent.
	for (int i = 0, Size = TargetMask.size(); i < Size; ++i) {
	int M = TargetMask[i];
	if (M == SM_SentinelUndef)
	continue;
	if (M == i)
	continue;
	if (M == i + Size) {
	BlendMask \|= 1ull << i;
	continue;
	}
	if (M == SM_SentinelZero) {
	if (V1IsZeroOrUndef) {
	ForceV1Zero = true;
	TargetMask[i] = i;
	continue;
	}
	if (V2IsZeroOrUndef) {
	ForceV2Zero = true;
	BlendMask \|= 1ull << i;
	TargetMask[i] = i + Size;
	continue;
	}
	}
	return false;
	}
	return true;
	}

	uint64_t scaleVectorShuffleBlendMask(uint64_t BlendMask, int Size, int Scale) {
	uint64_t ScaledMask = 0;
	for (int i = 0; i != Size; ++i)
	if (BlendMask & (1ull << i))
	ScaledMask \|= ((1ull << Scale) - 1) << (i * Scale);
	return ScaledMask;
	}

	/// \brief Try to emit a blend instruction for a shuffle.
	///
	/// This doesn't do any checks for the availability of instructions for blending
	/// these values. It relies on the availability of the X86ISD::BLENDI pattern to
	/// be matched in the backend with the type given. What it does check for is
	/// that the shuffle mask is a blend, or convertible into a blend with zero.
	static SDValue lowerVectorShuffleAsBlend(const SDLoc &DL, MVT VT, SDValue V1,
	SDValue V2, ArrayRef<int> Original,
	const APInt &Zeroable,
	const X86Subtarget &Subtarget,
	SelectionDAG &DAG) {
	SmallVector<int, 64> Mask = createTargetShuffleMask(Original, Zeroable);

	uint64_t BlendMask = 0;
	bool ForceV1Zero = false, ForceV2Zero = false;
	if (!matchVectorShuffleAsBlend(V1, V2, Mask, ForceV1Zero, ForceV2Zero,
	BlendMask))
	return SDValue();

	// Create a REAL zero vector - ISD::isBuildVectorAllZeros allows UNDEFs.
	if (ForceV1Zero)
	V1 = getZeroVector(VT, Subtarget, DAG, DL);
	if (ForceV2Zero)
	V2 = getZeroVector(VT, Subtarget, DAG, DL);

	switch (VT.SimpleTy) {
	case MVT::v2f64:
	case MVT::v4f32:
	case MVT::v4f64:
	case MVT::v8f32:
	return DAG.getNode(X86ISD::BLENDI, DL, VT, V1, V2,
	DAG.getConstant(BlendMask, DL, MVT::i8));

	case MVT::v4i64:
	case MVT::v8i32:
	assert(Subtarget.hasAVX2() && "256-bit integer blends require AVX2!");
	LLVM_FALLTHROUGH;
	case MVT::v2i64:
	case MVT::v4i32:
	// If we have AVX2 it is faster to use VPBLENDD when the shuffle fits into
	// that instruction.
	if (Subtarget.hasAVX2()) {
	// Scale the blend by the number of 32-bit dwords per element.
	int Scale = VT.getScalarSizeInBits() / 32;
	BlendMask = scaleVectorShuffleBlendMask(BlendMask, Mask.size(), Scale);
	MVT BlendVT = VT.getSizeInBits() > 128 ? MVT::v8i32 : MVT::v4i32;
	V1 = DAG.getBitcast(BlendVT, V1);
	V2 = DAG.getBitcast(BlendVT, V2);
	return DAG.getBitcast(
	VT, DAG.getNode(X86ISD::BLENDI, DL, BlendVT, V1, V2,
	DAG.getConstant(BlendMask, DL, MVT::i8)));
	}
	LLVM_FALLTHROUGH;
	case MVT::v8i16: {
	// For integer shuffles we need to expand the mask and cast the inputs to
	// v8i16s prior to blending.
	int Scale = 8 / VT.getVectorNumElements();
	BlendMask = scaleVectorShuffleBlendMask(BlendMask, Mask.size(), Scale);
	V1 = DAG.getBitcast(MVT::v8i16, V1);
	V2 = DAG.getBitcast(MVT::v8i16, V2);
	return DAG.getBitcast(VT,
	DAG.getNode(X86ISD::BLENDI, DL, MVT::v8i16, V1, V2,
	DAG.getConstant(BlendMask, DL, MVT::i8)));
	}

	case MVT::v16i16: {
	assert(Subtarget.hasAVX2() && "256-bit integer blends require AVX2!");
	SmallVector<int, 8> RepeatedMask;
	if (is128BitLaneRepeatedShuffleMask(MVT::v16i16, Mask, RepeatedMask)) {
	// We can lower these with PBLENDW which is mirrored across 128-bit lanes.
	assert(RepeatedMask.size() == 8 && "Repeated mask size doesn't match!");
	BlendMask = 0;
	for (int i = 0; i < 8; ++i)
	if (RepeatedMask[i] >= 8)
	BlendMask \|= 1ull << i;
	return DAG.getNode(X86ISD::BLENDI, DL, MVT::v16i16, V1, V2,
	DAG.getConstant(BlendMask, DL, MVT::i8));
	}
	LLVM_FALLTHROUGH;
	}
	case MVT::v16i8:
	case MVT::v32i8: {
	assert((VT.is128BitVector() \|\| Subtarget.hasAVX2()) &&
	"256-bit byte-blends require AVX2 support!");

	if (Subtarget.hasBWI() && Subtarget.hasVLX()) {
	MVT IntegerType =
	MVT::getIntegerVT(std::max((int)VT.getVectorNumElements(), 8));
	SDValue MaskNode = DAG.getConstant(BlendMask, DL, IntegerType);
	return getVectorMaskingNode(V2, MaskNode, V1, Subtarget, DAG);
	}

	// Attempt to lower to a bitmask if we can. VPAND is faster than VPBLENDVB.
	if (SDValue Masked =
	lowerVectorShuffleAsBitMask(DL, VT, V1, V2, Mask, Zeroable, DAG))
	return Masked;

	// Scale the blend by the number of bytes per element.
	int Scale = VT.getScalarSizeInBits() / 8;

	// This form of blend is always done on bytes. Compute the byte vector
	// type.
	MVT BlendVT = MVT::getVectorVT(MVT::i8, VT.getSizeInBits() / 8);

	// Compute the VSELECT mask. Note that VSELECT is really confusing in the
	// mix of LLVM's code generator and the x86 backend. We tell the code
	// generator that boolean values in the elements of an x86 vector register
	// are -1 for true and 0 for false. We then use the LLVM semantics of 'true'
	// mapping a select to operand #1, and 'false' mapping to operand #2. The
	// reality in x86 is that vector masks (pre-AVX-512) use only the high bit
	// of the element (the remaining are ignored) and 0 in that high bit would
	// mean operand #1 while 1 in the high bit would mean operand #2. So while
	// the LLVM model for boolean values in vector elements gets the relevant
	// bit set, it is set backwards and over constrained relative to x86's
	// actual model.
	SmallVector<SDValue, 32> VSELECTMask;
	for (int i = 0, Size = Mask.size(); i < Size; ++i)
	for (int j = 0; j < Scale; ++j)
	VSELECTMask.push_back(
	Mask[i] < 0 ? DAG.getUNDEF(MVT::i8)
	: DAG.getConstant(Mask[i] < Size ? -1 : 0, DL,
	MVT::i8));

	V1 = DAG.getBitcast(BlendVT, V1);
	V2 = DAG.getBitcast(BlendVT, V2);
	return DAG.getBitcast(
	VT,
	DAG.getSelect(DL, BlendVT, DAG.getBuildVector(BlendVT, DL, VSELECTMask),
	V1, V2));
	}
	case MVT::v16f32:
	case MVT::v8f64:
	case MVT::v8i64:
	case MVT::v16i32:
	case MVT::v32i16:
	case MVT::v64i8: {
	MVT IntegerType =
	MVT::getIntegerVT(std::max((int)VT.getVectorNumElements(), 8));
	SDValue MaskNode = DAG.getConstant(BlendMask, DL, IntegerType);
	return getVectorMaskingNode(V2, MaskNode, V1, Subtarget, DAG);
	}
	default:
	llvm_unreachable("Not a supported integer vector type!");
	}
	}

	/// \brief Try to lower as a blend of elements from two inputs followed by
	/// a single-input permutation.
	///
	/// This matches the pattern where we can blend elements from two inputs and
	/// then reduce the shuffle to a single-input permutation.
	static SDValue lowerVectorShuffleAsBlendAndPermute(const SDLoc &DL, MVT VT,
	SDValue V1, SDValue V2,
	ArrayRef<int> Mask,
	SelectionDAG &DAG) {
	// We build up the blend mask while checking whether a blend is a viable way
	// to reduce the shuffle.
	SmallVector<int, 32> BlendMask(Mask.size(), -1);
	SmallVector<int, 32> PermuteMask(Mask.size(), -1);

	for (int i = 0, Size = Mask.size(); i < Size; ++i) {
	if (Mask[i] < 0)
	continue;

	assert(Mask[i] < Size * 2 && "Shuffle input is out of bounds.");

	if (BlendMask[Mask[i] % Size] < 0)
	BlendMask[Mask[i] % Size] = Mask[i];
	else if (BlendMask[Mask[i] % Size] != Mask[i])
	return SDValue(); // Can't blend in the needed input!

	PermuteMask[i] = Mask[i] % Size;
	}

	SDValue V = DAG.getVectorShuffle(VT, DL, V1, V2, BlendMask);
	return DAG.getVectorShuffle(VT, DL, V, DAG.getUNDEF(VT), PermuteMask);
	}

	/// \brief Generic routine to decompose a shuffle and blend into independent
	/// blends and permutes.
	///
	/// This matches the extremely common pattern for handling combined
	/// shuffle+blend operations on newer X86 ISAs where we have very fast blend
	/// operations. It will try to pick the best arrangement of shuffles and
	/// blends.
	static SDValue lowerVectorShuffleAsDecomposedShuffleBlend(const SDLoc &DL,
	MVT VT, SDValue V1,
	SDValue V2,
	ArrayRef<int> Mask,
	SelectionDAG &DAG) {
	// Shuffle the input elements into the desired positions in V1 and V2 and
	// blend them together.
	SmallVector<int, 32> V1Mask(Mask.size(), -1);
	SmallVector<int, 32> V2Mask(Mask.size(), -1);
	SmallVector<int, 32> BlendMask(Mask.size(), -1);
	for (int i = 0, Size = Mask.size(); i < Size; ++i)
	if (Mask[i] >= 0 && Mask[i] < Size) {
	V1Mask[i] = Mask[i];
	BlendMask[i] = i;
	} else if (Mask[i] >= Size) {
	V2Mask[i] = Mask[i] - Size;
	BlendMask[i] = i + Size;
	}

	// Try to lower with the simpler initial blend strategy unless one of the
	// input shuffles would be a no-op. We prefer to shuffle inputs as the
	// shuffle may be able to fold with a load or other benefit. However, when
	// we'll have to do 2x as many shuffles in order to achieve this, blending
	// first is a better strategy.
	if (!isNoopShuffleMask(V1Mask) && !isNoopShuffleMask(V2Mask))
	if (SDValue BlendPerm =
	lowerVectorShuffleAsBlendAndPermute(DL, VT, V1, V2, Mask, DAG))
	return BlendPerm;

	V1 = DAG.getVectorShuffle(VT, DL, V1, DAG.getUNDEF(VT), V1Mask);
	V2 = DAG.getVectorShuffle(VT, DL, V2, DAG.getUNDEF(VT), V2Mask);
	return DAG.getVectorShuffle(VT, DL, V1, V2, BlendMask);
	}

	/// \brief Try to lower a vector shuffle as a rotation.
	///
	/// This is used for support PALIGNR for SSSE3 or VALIGND/Q for AVX512.
	static int matchVectorShuffleAsRotate(SDValue &V1, SDValue &V2,
	ArrayRef<int> Mask) {
	int NumElts = Mask.size();

	// We need to detect various ways of spelling a rotation:
	// [11, 12, 13, 14, 15, 0, 1, 2]
	// [-1, 12, 13, 14, -1, -1, 1, -1]
	// [-1, -1, -1, -1, -1, -1, 1, 2]
	// [ 3, 4, 5, 6, 7, 8, 9, 10]
	// [-1, 4, 5, 6, -1, -1, 9, -1]
	// [-1, 4, 5, 6, -1, -1, -1, -1]
	int Rotation = 0;
	SDValue Lo, Hi;
	for (int i = 0; i < NumElts; ++i) {
	int M = Mask[i];
	assert((M == SM_SentinelUndef \|\| (0 <= M && M < (2*NumElts))) &&
	"Unexpected mask index.");
	if (M < 0)
	continue;

	// Determine where a rotated vector would have started.
	int StartIdx = i - (M % NumElts);
	if (StartIdx == 0)
	// The identity rotation isn't interesting, stop.
	return -1;

	// If we found the tail of a vector the rotation must be the missing
	// front. If we found the head of a vector, it must be how much of the
	// head.
	int CandidateRotation = StartIdx < 0 ? -StartIdx : NumElts - StartIdx;

	if (Rotation == 0)
	Rotation = CandidateRotation;
	else if (Rotation != CandidateRotation)
	// The rotations don't match, so we can't match this mask.
	return -1;

	// Compute which value this mask is pointing at.
	SDValue MaskV = M < NumElts ? V1 : V2;

	// Compute which of the two target values this index should be assigned
	// to. This reflects whether the high elements are remaining or the low
	// elements are remaining.
	SDValue &TargetV = StartIdx < 0 ? Hi : Lo;

	// Either set up this value if we've not encountered it before, or check
	// that it remains consistent.
	if (!TargetV)
	TargetV = MaskV;
	else if (TargetV != MaskV)
	// This may be a rotation, but it pulls from the inputs in some
	// unsupported interleaving.
	return -1;
	}

	// Check that we successfully analyzed the mask, and normalize the results.
	assert(Rotation != 0 && "Failed to locate a viable rotation!");
	assert((Lo \|\| Hi) && "Failed to find a rotated input vector!");
	if (!Lo)
	Lo = Hi;
	else if (!Hi)
	Hi = Lo;

	V1 = Lo;
	V2 = Hi;

	return Rotation;
	}

	/// \brief Try to lower a vector shuffle as a byte rotation.
	///
	/// SSSE3 has a generic PALIGNR instruction in x86 that will do an arbitrary
	/// byte-rotation of the concatenation of two vectors; pre-SSSE3 can use
	/// a PSRLDQ/PSLLDQ/POR pattern to get a similar effect. This routine will
	/// try to generically lower a vector shuffle through such an pattern. It
	/// does not check for the profitability of lowering either as PALIGNR or
	/// PSRLDQ/PSLLDQ/POR, only whether the mask is valid to lower in that form.
	/// This matches shuffle vectors that look like:
	///
	/// v8i16 [11, 12, 13, 14, 15, 0, 1, 2]
	///
	/// Essentially it concatenates V1 and V2, shifts right by some number of
	/// elements, and takes the low elements as the result. Note that while this is
	/// specified as a right shift because x86 is little-endian, it is a *left
	/// rotate* of the vector lanes.
	static int matchVectorShuffleAsByteRotate(MVT VT, SDValue &V1, SDValue &V2,
	ArrayRef<int> Mask) {
	// Don't accept any shuffles with zero elements.
	if (any_of(Mask, [](int M) { return M == SM_SentinelZero; }))
	return -1;

	// PALIGNR works on 128-bit lanes.
	SmallVector<int, 16> RepeatedMask;
	if (!is128BitLaneRepeatedShuffleMask(VT, Mask, RepeatedMask))
	return -1;

	int Rotation = matchVectorShuffleAsRotate(V1, V2, RepeatedMask);
	if (Rotation <= 0)
	return -1;

	// PALIGNR rotates bytes, so we need to scale the
	// rotation based on how many bytes are in the vector lane.
	int NumElts = RepeatedMask.size();
	int Scale = 16 / NumElts;
	return Rotation * Scale;
	}

	static SDValue lowerVectorShuffleAsByteRotate(const SDLoc &DL, MVT VT,
	SDValue V1, SDValue V2,
	ArrayRef<int> Mask,
	const X86Subtarget &Subtarget,
	SelectionDAG &DAG) {
	assert(!isNoopShuffleMask(Mask) && "We shouldn't lower no-op shuffles!");

	SDValue Lo = V1, Hi = V2;
	int ByteRotation = matchVectorShuffleAsByteRotate(VT, Lo, Hi, Mask);
	if (ByteRotation <= 0)
	return SDValue();

	// Cast the inputs to i8 vector of correct length to match PALIGNR or
	// PSLLDQ/PSRLDQ.
	MVT ByteVT = MVT::getVectorVT(MVT::i8, VT.getSizeInBits() / 8);
	Lo = DAG.getBitcast(ByteVT, Lo);
	Hi = DAG.getBitcast(ByteVT, Hi);

	// SSSE3 targets can use the palignr instruction.
	if (Subtarget.hasSSSE3()) {
	assert((!VT.is512BitVector() \|\| Subtarget.hasBWI()) &&
	"512-bit PALIGNR requires BWI instructions");
	return DAG.getBitcast(
	VT, DAG.getNode(X86ISD::PALIGNR, DL, ByteVT, Lo, Hi,
	DAG.getConstant(ByteRotation, DL, MVT::i8)));
	}

	assert(VT.is128BitVector() &&
	"Rotate-based lowering only supports 128-bit lowering!");
	assert(Mask.size() <= 16 &&
	"Can shuffle at most 16 bytes in a 128-bit vector!");
	assert(ByteVT == MVT::v16i8 &&
	"SSE2 rotate lowering only needed for v16i8!");

	// Default SSE2 implementation
	int LoByteShift = 16 - ByteRotation;
	int HiByteShift = ByteRotation;

	SDValue LoShift = DAG.getNode(X86ISD::VSHLDQ, DL, MVT::v16i8, Lo,
	DAG.getConstant(LoByteShift, DL, MVT::i8));
	SDValue HiShift = DAG.getNode(X86ISD::VSRLDQ, DL, MVT::v16i8, Hi,
	DAG.getConstant(HiByteShift, DL, MVT::i8));
	return DAG.getBitcast(VT,
	DAG.getNode(ISD::OR, DL, MVT::v16i8, LoShift, HiShift));
	}

	/// \brief Try to lower a vector shuffle as a dword/qword rotation.
	///
	/// AVX512 has a VALIGND/VALIGNQ instructions that will do an arbitrary
	/// rotation of the concatenation of two vectors; This routine will
	/// try to generically lower a vector shuffle through such an pattern.
	///
	/// Essentially it concatenates V1 and V2, shifts right by some number of
	/// elements, and takes the low elements as the result. Note that while this is
	/// specified as a right shift because x86 is little-endian, it is a *left
	/// rotate* of the vector lanes.
	static SDValue lowerVectorShuffleAsRotate(const SDLoc &DL, MVT VT,
	SDValue V1, SDValue V2,
	ArrayRef<int> Mask,
	const X86Subtarget &Subtarget,
	SelectionDAG &DAG) {
	assert((VT.getScalarType() == MVT::i32 \|\| VT.getScalarType() == MVT::i64) &&
	"Only 32-bit and 64-bit elements are supported!");

	// 128/256-bit vectors are only supported with VLX.
	assert((Subtarget.hasVLX() \|\| (!VT.is128BitVector() && !VT.is256BitVector()))
	&& "VLX required for 128/256-bit vectors");

	SDValue Lo = V1, Hi = V2;
	int Rotation = matchVectorShuffleAsRotate(Lo, Hi, Mask);
	if (Rotation <= 0)
	return SDValue();

	return DAG.getNode(X86ISD::VALIGN, DL, VT, Lo, Hi,
	DAG.getConstant(Rotation, DL, MVT::i8));
	}

	/// \brief Try to lower a vector shuffle as a bit shift (shifts in zeros).
	///
	/// Attempts to match a shuffle mask against the PSLL(W/D/Q/DQ) and
	/// PSRL(W/D/Q/DQ) SSE2 and AVX2 logical bit-shift instructions. The function
	/// matches elements from one of the input vectors shuffled to the left or
	/// right with zeroable elements 'shifted in'. It handles both the strictly
	/// bit-wise element shifts and the byte shift across an entire 128-bit double
	/// quad word lane.
	///
	/// PSHL : (little-endian) left bit shift.
	/// [ zz, 0, zz, 2 ]
	/// [ -1, 4, zz, -1 ]
	/// PSRL : (little-endian) right bit shift.
	/// [ 1, zz, 3, zz]
	/// [ -1, -1, 7, zz]
	/// PSLLDQ : (little-endian) left byte shift
	/// [ zz, 0, 1, 2, 3, 4, 5, 6]
	/// [ zz, zz, -1, -1, 2, 3, 4, -1]
	/// [ zz, zz, zz, zz, zz, zz, -1, 1]
	/// PSRLDQ : (little-endian) right byte shift
	/// [ 5, 6, 7, zz, zz, zz, zz, zz]
	/// [ -1, 5, 6, 7, zz, zz, zz, zz]
	/// [ 1, 2, -1, -1, -1, -1, zz, zz]
	static int matchVectorShuffleAsShift(MVT &ShiftVT, unsigned &Opcode,
	unsigned ScalarSizeInBits,
	ArrayRef<int> Mask, int MaskOffset,
	const APInt &Zeroable,
	const X86Subtarget &Subtarget) {
	int Size = Mask.size();
	unsigned SizeInBits = Size * ScalarSizeInBits;

	auto CheckZeros = [&](int Shift, int Scale, bool Left) {
	for (int i = 0; i < Size; i += Scale)
	for (int j = 0; j < Shift; ++j)
	if (!Zeroable[i + j + (Left ? 0 : (Scale - Shift))])
	return false;

	return true;
	};

	auto MatchShift = [&](int Shift, int Scale, bool Left) {
	for (int i = 0; i != Size; i += Scale) {
	unsigned Pos = Left ? i + Shift : i;
	unsigned Low = Left ? i : i + Shift;
	unsigned Len = Scale - Shift;
	if (!isSequentialOrUndefInRange(Mask, Pos, Len, Low + MaskOffset))
	return -1;
	}

	int ShiftEltBits = ScalarSizeInBits * Scale;
	bool ByteShift = ShiftEltBits > 64;
	Opcode = Left ? (ByteShift ? X86ISD::VSHLDQ : X86ISD::VSHLI)
	: (ByteShift ? X86ISD::VSRLDQ : X86ISD::VSRLI);
	int ShiftAmt = Shift * ScalarSizeInBits / (ByteShift ? 8 : 1);

	// Normalize the scale for byte shifts to still produce an i64 element
	// type.
	Scale = ByteShift ? Scale / 2 : Scale;

	// We need to round trip through the appropriate type for the shift.
	MVT ShiftSVT = MVT::getIntegerVT(ScalarSizeInBits * Scale);
	ShiftVT = ByteShift ? MVT::getVectorVT(MVT::i8, SizeInBits / 8)
	: MVT::getVectorVT(ShiftSVT, Size / Scale);
	return (int)ShiftAmt;
	};

	// SSE/AVX supports logical shifts up to 64-bit integers - so we can just
	// keep doubling the size of the integer elements up to that. We can
	// then shift the elements of the integer vector by whole multiples of
	// their width within the elements of the larger integer vector. Test each
	// multiple to see if we can find a match with the moved element indices
	// and that the shifted in elements are all zeroable.
	unsigned MaxWidth = ((SizeInBits == 512) && !Subtarget.hasBWI() ? 64 : 128);
	for (int Scale = 2; Scale * ScalarSizeInBits <= MaxWidth; Scale *= 2)
	for (int Shift = 1; Shift != Scale; ++Shift)
	for (bool Left : {true, false})
	if (CheckZeros(Shift, Scale, Left)) {
	int ShiftAmt = MatchShift(Shift, Scale, Left);
	if (0 < ShiftAmt)
	return ShiftAmt;
	}

	// no match
	return -1;
	}

	static SDValue lowerVectorShuffleAsShift(const SDLoc &DL, MVT VT, SDValue V1,
	SDValue V2, ArrayRef<int> Mask,
	const APInt &Zeroable,
	const X86Subtarget &Subtarget,
	SelectionDAG &DAG) {
	int Size = Mask.size();
	assert(Size == (int)VT.getVectorNumElements() && "Unexpected mask size");

	MVT ShiftVT;
	SDValue V = V1;
	unsigned Opcode;

	// Try to match shuffle against V1 shift.
	int ShiftAmt = matchVectorShuffleAsShift(
	ShiftVT, Opcode, VT.getScalarSizeInBits(), Mask, 0, Zeroable, Subtarget);

	// If V1 failed, try to match shuffle against V2 shift.
	if (ShiftAmt < 0) {
	ShiftAmt =
	matchVectorShuffleAsShift(ShiftVT, Opcode, VT.getScalarSizeInBits(),
	Mask, Size, Zeroable, Subtarget);
	V = V2;
	}

	if (ShiftAmt < 0)
	return SDValue();

	assert(DAG.getTargetLoweringInfo().isTypeLegal(ShiftVT) &&
	"Illegal integer vector type");
	V = DAG.getBitcast(ShiftVT, V);
	V = DAG.getNode(Opcode, DL, ShiftVT, V,
	DAG.getConstant(ShiftAmt, DL, MVT::i8));
	return DAG.getBitcast(VT, V);
	}

	// EXTRQ: Extract Len elements from lower half of source, starting at Idx.
	// Remainder of lower half result is zero and upper half is all undef.
	static bool matchVectorShuffleAsEXTRQ(MVT VT, SDValue &V1, SDValue &V2,
	ArrayRef<int> Mask, uint64_t &BitLen,
	uint64_t &BitIdx, const APInt &Zeroable) {
	int Size = Mask.size();
	int HalfSize = Size / 2;
	assert(Size == (int)VT.getVectorNumElements() && "Unexpected mask size");
	assert(!Zeroable.isAllOnesValue() && "Fully zeroable shuffle mask");

	// Upper half must be undefined.
	if (!isUndefInRange(Mask, HalfSize, HalfSize))
	return false;

	// Determine the extraction length from the part of the
	// lower half that isn't zeroable.
	int Len = HalfSize;
	for (; Len > 0; --Len)
	if (!Zeroable[Len - 1])
	break;
	assert(Len > 0 && "Zeroable shuffle mask");

	// Attempt to match first Len sequential elements from the lower half.
	SDValue Src;
	int Idx = -1;
	for (int i = 0; i != Len; ++i) {
	int M = Mask[i];
	if (M == SM_SentinelUndef)
	continue;
	SDValue &V = (M < Size ? V1 : V2);
	M = M % Size;

	// The extracted elements must start at a valid index and all mask
	// elements must be in the lower half.
	if (i > M \|\| M >= HalfSize)
	return false;

	if (Idx < 0 \|\| (Src == V && Idx == (M - i))) {
	Src = V;
	Idx = M - i;
	continue;
	}
	return false;
	}

	if (!Src \|\| Idx < 0)
	return false;

	assert((Idx + Len) <= HalfSize && "Illegal extraction mask");
	BitLen = (Len * VT.getScalarSizeInBits()) & 0x3f;
	BitIdx = (Idx * VT.getScalarSizeInBits()) & 0x3f;
	V1 = Src;
	return true;
	}

	// INSERTQ: Extract lowest Len elements from lower half of second source and
	// insert over first source, starting at Idx.
	// { A[0], .., A[Idx-1], B[0], .., B[Len-1], A[Idx+Len], .., UNDEF, ... }
	static bool matchVectorShuffleAsINSERTQ(MVT VT, SDValue &V1, SDValue &V2,
	ArrayRef<int> Mask, uint64_t &BitLen,
	uint64_t &BitIdx) {
	int Size = Mask.size();
	int HalfSize = Size / 2;
	assert(Size == (int)VT.getVectorNumElements() && "Unexpected mask size");

	// Upper half must be undefined.
	if (!isUndefInRange(Mask, HalfSize, HalfSize))
	return false;

	for (int Idx = 0; Idx != HalfSize; ++Idx) {
	SDValue Base;

	// Attempt to match first source from mask before insertion point.
	if (isUndefInRange(Mask, 0, Idx)) {
	/* EMPTY */
	} else if (isSequentialOrUndefInRange(Mask, 0, Idx, 0)) {
	Base = V1;
	} else if (isSequentialOrUndefInRange(Mask, 0, Idx, Size)) {
	Base = V2;
	} else {
	continue;
	}

	// Extend the extraction length looking to match both the insertion of
	// the second source and the remaining elements of the first.
	for (int Hi = Idx + 1; Hi <= HalfSize; ++Hi) {
	SDValue Insert;
	int Len = Hi - Idx;

	// Match insertion.
	if (isSequentialOrUndefInRange(Mask, Idx, Len, 0)) {
	Insert = V1;
	} else if (isSequentialOrUndefInRange(Mask, Idx, Len, Size)) {
	Insert = V2;
	} else {
	continue;
	}

	// Match the remaining elements of the lower half.
	if (isUndefInRange(Mask, Hi, HalfSize - Hi)) {
	/* EMPTY */
	} else if ((!Base \|\| (Base == V1)) &&
	isSequentialOrUndefInRange(Mask, Hi, HalfSize - Hi, Hi)) {
	Base = V1;
	} else if ((!Base \|\| (Base == V2)) &&
	isSequentialOrUndefInRange(Mask, Hi, HalfSize - Hi,
	Size + Hi)) {
	Base = V2;
	} else {
	continue;
	}

	BitLen = (Len * VT.getScalarSizeInBits()) & 0x3f;
	BitIdx = (Idx * VT.getScalarSizeInBits()) & 0x3f;
	V1 = Base;
	V2 = Insert;
	return true;
	}
	}

	return false;
	}

	/// \brief Try to lower a vector shuffle using SSE4a EXTRQ/INSERTQ.
	static SDValue lowerVectorShuffleWithSSE4A(const SDLoc &DL, MVT VT, SDValue V1,
	SDValue V2, ArrayRef<int> Mask,
	const APInt &Zeroable,
	SelectionDAG &DAG) {
	uint64_t BitLen, BitIdx;
	if (matchVectorShuffleAsEXTRQ(VT, V1, V2, Mask, BitLen, BitIdx, Zeroable))
	return DAG.getNode(X86ISD::EXTRQI, DL, VT, V1,
	DAG.getConstant(BitLen, DL, MVT::i8),
	DAG.getConstant(BitIdx, DL, MVT::i8));

	if (matchVectorShuffleAsINSERTQ(VT, V1, V2, Mask, BitLen, BitIdx))
	return DAG.getNode(X86ISD::INSERTQI, DL, VT, V1 ? V1 : DAG.getUNDEF(VT),
	V2 ? V2 : DAG.getUNDEF(VT),
	DAG.getConstant(BitLen, DL, MVT::i8),
	DAG.getConstant(BitIdx, DL, MVT::i8));

	return SDValue();
	}

	/// \brief Lower a vector shuffle as a zero or any extension.
	///
	/// Given a specific number of elements, element bit width, and extension
	/// stride, produce either a zero or any extension based on the available
	/// features of the subtarget. The extended elements are consecutive and
	/// begin and can start from an offsetted element index in the input; to
	/// avoid excess shuffling the offset must either being in the bottom lane
	/// or at the start of a higher lane. All extended elements must be from
	/// the same lane.
	static SDValue lowerVectorShuffleAsSpecificZeroOrAnyExtend(
	const SDLoc &DL, MVT VT, int Scale, int Offset, bool AnyExt, SDValue InputV,
	ArrayRef<int> Mask, const X86Subtarget &Subtarget, SelectionDAG &DAG) {
	assert(Scale > 1 && "Need a scale to extend.");
	int EltBits = VT.getScalarSizeInBits();
	int NumElements = VT.getVectorNumElements();
	int NumEltsPerLane = 128 / EltBits;
	int OffsetLane = Offset / NumEltsPerLane;
	assert((EltBits == 8 \|\| EltBits == 16 \|\| EltBits == 32) &&
	"Only 8, 16, and 32 bit elements can be extended.");
	assert(Scale * EltBits <= 64 && "Cannot zero extend past 64 bits.");
	assert(0 <= Offset && "Extension offset must be positive.");
	assert((Offset < NumEltsPerLane \|\| Offset % NumEltsPerLane == 0) &&
	"Extension offset must be in the first lane or start an upper lane.");

	// Check that an index is in same lane as the base offset.
	auto SafeOffset = [&](int Idx) {
	return OffsetLane == (Idx / NumEltsPerLane);
	};

	// Shift along an input so that the offset base moves to the first element.
	auto ShuffleOffset = [&](SDValue V) {
	if (!Offset)
	return V;

	SmallVector<int, 8> ShMask((unsigned)NumElements, -1);
	for (int i = 0; i * Scale < NumElements; ++i) {
	int SrcIdx = i + Offset;
	ShMask[i] = SafeOffset(SrcIdx) ? SrcIdx : -1;
	}
	return DAG.getVectorShuffle(VT, DL, V, DAG.getUNDEF(VT), ShMask);
	};

	// Found a valid zext mask! Try various lowering strategies based on the
	// input type and available ISA extensions.
	if (Subtarget.hasSSE41()) {
	// Not worth offsetting 128-bit vectors if scale == 2, a pattern using
	// PUNPCK will catch this in a later shuffle match.
	if (Offset && Scale == 2 && VT.is128BitVector())
	return SDValue();
	MVT ExtVT = MVT::getVectorVT(MVT::getIntegerVT(EltBits * Scale),
	NumElements / Scale);
	InputV = ShuffleOffset(InputV);
	InputV = getExtendInVec(X86ISD::VZEXT, DL, ExtVT, InputV, DAG);
	return DAG.getBitcast(VT, InputV);
	}

	assert(VT.is128BitVector() && "Only 128-bit vectors can be extended.");

	// For any extends we can cheat for larger element sizes and use shuffle
	// instructions that can fold with a load and/or copy.
	if (AnyExt && EltBits == 32) {
	int PSHUFDMask[4] = {Offset, -1, SafeOffset(Offset + 1) ? Offset + 1 : -1,
	-1};
	return DAG.getBitcast(
	VT, DAG.getNode(X86ISD::PSHUFD, DL, MVT::v4i32,
	DAG.getBitcast(MVT::v4i32, InputV),
	getV4X86ShuffleImm8ForMask(PSHUFDMask, DL, DAG)));
	}
	if (AnyExt && EltBits == 16 && Scale > 2) {
	int PSHUFDMask[4] = {Offset / 2, -1,
	SafeOffset(Offset + 1) ? (Offset + 1) / 2 : -1, -1};
	InputV = DAG.getNode(X86ISD::PSHUFD, DL, MVT::v4i32,
	DAG.getBitcast(MVT::v4i32, InputV),
	getV4X86ShuffleImm8ForMask(PSHUFDMask, DL, DAG));
	int PSHUFWMask[4] = {1, -1, -1, -1};
	unsigned OddEvenOp = (Offset & 1 ? X86ISD::PSHUFLW : X86ISD::PSHUFHW);
	return DAG.getBitcast(
	VT, DAG.getNode(OddEvenOp, DL, MVT::v8i16,
	DAG.getBitcast(MVT::v8i16, InputV),
	getV4X86ShuffleImm8ForMask(PSHUFWMask, DL, DAG)));
	}

	// The SSE4A EXTRQ instruction can efficiently extend the first 2 lanes
	// to 64-bits.
	if ((Scale * EltBits) == 64 && EltBits < 32 && Subtarget.hasSSE4A()) {
	assert(NumElements == (int)Mask.size() && "Unexpected shuffle mask size!");
	assert(VT.is128BitVector() && "Unexpected vector width!");

	int LoIdx = Offset * EltBits;
	SDValue Lo = DAG.getBitcast(
	MVT::v2i64, DAG.getNode(X86ISD::EXTRQI, DL, VT, InputV,
	DAG.getConstant(EltBits, DL, MVT::i8),
	DAG.getConstant(LoIdx, DL, MVT::i8)));

	if (isUndefInRange(Mask, NumElements / 2, NumElements / 2) \|\|
	!SafeOffset(Offset + 1))
	return DAG.getBitcast(VT, Lo);

	int HiIdx = (Offset + 1) * EltBits;
	SDValue Hi = DAG.getBitcast(
	MVT::v2i64, DAG.getNode(X86ISD::EXTRQI, DL, VT, InputV,
	DAG.getConstant(EltBits, DL, MVT::i8),
	DAG.getConstant(HiIdx, DL, MVT::i8)));
	return DAG.getBitcast(VT,
	DAG.getNode(X86ISD::UNPCKL, DL, MVT::v2i64, Lo, Hi));
	}

	// If this would require more than 2 unpack instructions to expand, use
	// pshufb when available. We can only use more than 2 unpack instructions
	// when zero extending i8 elements which also makes it easier to use pshufb.
	if (Scale > 4 && EltBits == 8 && Subtarget.hasSSSE3()) {
	assert(NumElements == 16 && "Unexpected byte vector width!");
	SDValue PSHUFBMask[16];
	for (int i = 0; i < 16; ++i) {
	int Idx = Offset + (i / Scale);
	PSHUFBMask[i] = DAG.getConstant(
	(i % Scale == 0 && SafeOffset(Idx)) ? Idx : 0x80, DL, MVT::i8);
	}
	InputV = DAG.getBitcast(MVT::v16i8, InputV);
	return DAG.getBitcast(
	VT, DAG.getNode(X86ISD::PSHUFB, DL, MVT::v16i8, InputV,
	DAG.getBuildVector(MVT::v16i8, DL, PSHUFBMask)));
	}

	// If we are extending from an offset, ensure we start on a boundary that
	// we can unpack from.
	int AlignToUnpack = Offset % (NumElements / Scale);
	if (AlignToUnpack) {
	SmallVector<int, 8> ShMask((unsigned)NumElements, -1);
	for (int i = AlignToUnpack; i < NumElements; ++i)
	ShMask[i - AlignToUnpack] = i;
	InputV = DAG.getVectorShuffle(VT, DL, InputV, DAG.getUNDEF(VT), ShMask);
	Offset -= AlignToUnpack;
	}

	// Otherwise emit a sequence of unpacks.
	do {
	unsigned UnpackLoHi = X86ISD::UNPCKL;
	if (Offset >= (NumElements / 2)) {
	UnpackLoHi = X86ISD::UNPCKH;
	Offset -= (NumElements / 2);
	}

	MVT InputVT = MVT::getVectorVT(MVT::getIntegerVT(EltBits), NumElements);
	SDValue Ext = AnyExt ? DAG.getUNDEF(InputVT)
	: getZeroVector(InputVT, Subtarget, DAG, DL);
	InputV = DAG.getBitcast(InputVT, InputV);
	InputV = DAG.getNode(UnpackLoHi, DL, InputVT, InputV, Ext);
	Scale /= 2;
	EltBits *= 2;
	NumElements /= 2;
	} while (Scale > 1);
	return DAG.getBitcast(VT, InputV);
	}

	/// \brief Try to lower a vector shuffle as a zero extension on any microarch.
	///
	/// This routine will try to do everything in its power to cleverly lower
	/// a shuffle which happens to match the pattern of a zero extend. It doesn't
	/// check for the profitability of this lowering, it tries to aggressively
	/// match this pattern. It will use all of the micro-architectural details it
	/// can to emit an efficient lowering. It handles both blends with all-zero
	/// inputs to explicitly zero-extend and undef-lanes (sometimes undef due to
	/// masking out later).
	///
	/// The reason we have dedicated lowering for zext-style shuffles is that they
	/// are both incredibly common and often quite performance sensitive.
	static SDValue lowerVectorShuffleAsZeroOrAnyExtend(
	const SDLoc &DL, MVT VT, SDValue V1, SDValue V2, ArrayRef<int> Mask,
	const APInt &Zeroable, const X86Subtarget &Subtarget,
	SelectionDAG &DAG) {
	int Bits = VT.getSizeInBits();
	int NumLanes = Bits / 128;
	int NumElements = VT.getVectorNumElements();
	int NumEltsPerLane = NumElements / NumLanes;
	assert(VT.getScalarSizeInBits() <= 32 &&
	"Exceeds 32-bit integer zero extension limit");
	assert((int)Mask.size() == NumElements && "Unexpected shuffle mask size");

	// Define a helper function to check a particular ext-scale and lower to it if
	// valid.
	auto Lower = [&](int Scale) -> SDValue {
	SDValue InputV;
	bool AnyExt = true;
	int Offset = 0;
	int Matches = 0;
	for (int i = 0; i < NumElements; ++i) {
	int M = Mask[i];
	if (M < 0)
	continue; // Valid anywhere but doesn't tell us anything.
	if (i % Scale != 0) {
	// Each of the extended elements need to be zeroable.
	if (!Zeroable[i])
	return SDValue();

	// We no longer are in the anyext case.
	AnyExt = false;
	continue;
	}

	// Each of the base elements needs to be consecutive indices into the
	// same input vector.
	SDValue V = M < NumElements ? V1 : V2;
	M = M % NumElements;
	if (!InputV) {
	InputV = V;
	Offset = M - (i / Scale);
	} else if (InputV != V)
	return SDValue(); // Flip-flopping inputs.

	// Offset must start in the lowest 128-bit lane or at the start of an
	// upper lane.
	// FIXME: Is it ever worth allowing a negative base offset?
	if (!((0 <= Offset && Offset < NumEltsPerLane) \|\|
	(Offset % NumEltsPerLane) == 0))
	return SDValue();

	// If we are offsetting, all referenced entries must come from the same
	// lane.
	if (Offset && (Offset / NumEltsPerLane) != (M / NumEltsPerLane))
	return SDValue();

	if ((M % NumElements) != (Offset + (i / Scale)))
	return SDValue(); // Non-consecutive strided elements.
	Matches++;
	}

	// If we fail to find an input, we have a zero-shuffle which should always
	// have already been handled.
	// FIXME: Maybe handle this here in case during blending we end up with one?
	if (!InputV)
	return SDValue();

	// If we are offsetting, don't extend if we only match a single input, we
	// can always do better by using a basic PSHUF or PUNPCK.
	if (Offset != 0 && Matches < 2)
	return SDValue();

	return lowerVectorShuffleAsSpecificZeroOrAnyExtend(
	DL, VT, Scale, Offset, AnyExt, InputV, Mask, Subtarget, DAG);
	};

	// The widest scale possible for extending is to a 64-bit integer.
	assert(Bits % 64 == 0 &&
	"The number of bits in a vector must be divisible by 64 on x86!");
	int NumExtElements = Bits / 64;

	// Each iteration, try extending the elements half as much, but into twice as
	// many elements.
	for (; NumExtElements < NumElements; NumExtElements *= 2) {
	assert(NumElements % NumExtElements == 0 &&
	"The input vector size must be divisible by the extended size.");
	if (SDValue V = Lower(NumElements / NumExtElements))
	return V;
	}

	// General extends failed, but 128-bit vectors may be able to use MOVQ.
	if (Bits != 128)
	return SDValue();

	// Returns one of the source operands if the shuffle can be reduced to a
	// MOVQ, copying the lower 64-bits and zero-extending to the upper 64-bits.
	auto CanZExtLowHalf = [&]() {
	for (int i = NumElements / 2; i != NumElements; ++i)
	if (!Zeroable[i])
	return SDValue();
	if (isSequentialOrUndefInRange(Mask, 0, NumElements / 2, 0))
	return V1;
	if (isSequentialOrUndefInRange(Mask, 0, NumElements / 2, NumElements))
	return V2;
	return SDValue();
	};

	if (SDValue V = CanZExtLowHalf()) {
	V = DAG.getBitcast(MVT::v2i64, V);
	V = DAG.getNode(X86ISD::VZEXT_MOVL, DL, MVT::v2i64, V);
	return DAG.getBitcast(VT, V);
	}

	// No viable ext lowering found.
	return SDValue();
	}

	/// \brief Try to get a scalar value for a specific element of a vector.
	///
	/// Looks through BUILD_VECTOR and SCALAR_TO_VECTOR nodes to find a scalar.
	static SDValue getScalarValueForVectorElement(SDValue V, int Idx,
	SelectionDAG &DAG) {
	MVT VT = V.getSimpleValueType();
	MVT EltVT = VT.getVectorElementType();
	V = peekThroughBitcasts(V);

	// If the bitcasts shift the element size, we can't extract an equivalent
	// element from it.
	MVT NewVT = V.getSimpleValueType();
	if (!NewVT.isVector() \|\| NewVT.getScalarSizeInBits() != VT.getScalarSizeInBits())
	return SDValue();

	if (V.getOpcode() == ISD::BUILD_VECTOR \|\|
	(Idx == 0 && V.getOpcode() == ISD::SCALAR_TO_VECTOR)) {
	// Ensure the scalar operand is the same size as the destination.
	// FIXME: Add support for scalar truncation where possible.
	SDValue S = V.getOperand(Idx);
	if (EltVT.getSizeInBits() == S.getSimpleValueType().getSizeInBits())
	return DAG.getBitcast(EltVT, S);
	}

	return SDValue();
	}

	/// \brief Helper to test for a load that can be folded with x86 shuffles.
	///
	/// This is particularly important because the set of instructions varies
	/// significantly based on whether the operand is a load or not.
	static bool isShuffleFoldableLoad(SDValue V) {
	V = peekThroughBitcasts(V);
	return ISD::isNON_EXTLoad(V.getNode());
	}

	/// \brief Try to lower insertion of a single element into a zero vector.
	///
	/// This is a common pattern that we have especially efficient patterns to lower
	/// across all subtarget feature sets.
	static SDValue lowerVectorShuffleAsElementInsertion(
	const SDLoc &DL, MVT VT, SDValue V1, SDValue V2, ArrayRef<int> Mask,
	const APInt &Zeroable, const X86Subtarget &Subtarget,
	SelectionDAG &DAG) {
	MVT ExtVT = VT;
	MVT EltVT = VT.getVectorElementType();

	int V2Index =
	find_if(Mask, [&Mask](int M) { return M >= (int)Mask.size(); }) -
	Mask.begin();
	bool IsV1Zeroable = true;
	for (int i = 0, Size = Mask.size(); i < Size; ++i)
	if (i != V2Index && !Zeroable[i]) {
	IsV1Zeroable = false;
	break;
	}

	// Check for a single input from a SCALAR_TO_VECTOR node.
	// FIXME: All of this should be canonicalized into INSERT_VECTOR_ELT and
	// all the smarts here sunk into that routine. However, the current
	// lowering of BUILD_VECTOR makes that nearly impossible until the old
	// vector shuffle lowering is dead.
	SDValue V2S = getScalarValueForVectorElement(V2, Mask[V2Index] - Mask.size(),
	DAG);
	if (V2S && DAG.getTargetLoweringInfo().isTypeLegal(V2S.getValueType())) {
	// We need to zext the scalar if it is smaller than an i32.
	V2S = DAG.getBitcast(EltVT, V2S);
	if (EltVT == MVT::i8 \|\| EltVT == MVT::i16) {
	// Using zext to expand a narrow element won't work for non-zero
	// insertions.
	if (!IsV1Zeroable)
	return SDValue();

	// Zero-extend directly to i32.
	ExtVT = MVT::v4i32;
	V2S = DAG.getNode(ISD::ZERO_EXTEND, DL, MVT::i32, V2S);
	}
	V2 = DAG.getNode(ISD::SCALAR_TO_VECTOR, DL, ExtVT, V2S);
	} else if (Mask[V2Index] != (int)Mask.size() \|\| EltVT == MVT::i8 \|\|
	EltVT == MVT::i16) {
	// Either not inserting from the low element of the input or the input
	// element size is too small to use VZEXT_MOVL to clear the high bits.
	return SDValue();
	}

	if (!IsV1Zeroable) {
	// If V1 can't be treated as a zero vector we have fewer options to lower
	// this. We can't support integer vectors or non-zero targets cheaply, and
	// the V1 elements can't be permuted in any way.
	assert(VT == ExtVT && "Cannot change extended type when non-zeroable!");
	if (!VT.isFloatingPoint() \|\| V2Index != 0)
	return SDValue();
	SmallVector<int, 8> V1Mask(Mask.begin(), Mask.end());
	V1Mask[V2Index] = -1;
	if (!isNoopShuffleMask(V1Mask))
	return SDValue();
	// This is essentially a special case blend operation, but if we have
	// general purpose blend operations, they are always faster. Bail and let
	// the rest of the lowering handle these as blends.
	if (Subtarget.hasSSE41())
	return SDValue();

	// Otherwise, use MOVSD or MOVSS.
	assert((EltVT == MVT::f32 \|\| EltVT == MVT::f64) &&
	"Only two types of floating point element types to handle!");
	return DAG.getNode(EltVT == MVT::f32 ? X86ISD::MOVSS : X86ISD::MOVSD, DL,
	ExtVT, V1, V2);
	}

	// This lowering only works for the low element with floating point vectors.
	if (VT.isFloatingPoint() && V2Index != 0)
	return SDValue();

	V2 = DAG.getNode(X86ISD::VZEXT_MOVL, DL, ExtVT, V2);
	if (ExtVT != VT)
	V2 = DAG.getBitcast(VT, V2);

	if (V2Index != 0) {
	// If we have 4 or fewer lanes we can cheaply shuffle the element into
	// the desired position. Otherwise it is more efficient to do a vector
	// shift left. We know that we can do a vector shift left because all
	// the inputs are zero.
	if (VT.isFloatingPoint() \|\| VT.getVectorNumElements() <= 4) {
	SmallVector<int, 4> V2Shuffle(Mask.size(), 1);
	V2Shuffle[V2Index] = 0;
	V2 = DAG.getVectorShuffle(VT, DL, V2, DAG.getUNDEF(VT), V2Shuffle);
	} else {
	V2 = DAG.getBitcast(MVT::v16i8, V2);
	V2 = DAG.getNode(
	X86ISD::VSHLDQ, DL, MVT::v16i8, V2,
	DAG.getConstant(V2Index * EltVT.getSizeInBits() / 8, DL,
	DAG.getTargetLoweringInfo().getScalarShiftAmountTy(
	DAG.getDataLayout(), VT)));
	V2 = DAG.getBitcast(VT, V2);
	}
	}
	return V2;
	}

	/// Try to lower broadcast of a single - truncated - integer element,
	/// coming from a scalar_to_vector/build_vector node \p V0 with larger elements.
	///
	/// This assumes we have AVX2.
	static SDValue lowerVectorShuffleAsTruncBroadcast(const SDLoc &DL, MVT VT,
	SDValue V0, int BroadcastIdx,
	const X86Subtarget &Subtarget,
	SelectionDAG &DAG) {
	assert(Subtarget.hasAVX2() &&
	"We can only lower integer broadcasts with AVX2!");

	EVT EltVT = VT.getVectorElementType();
	EVT V0VT = V0.getValueType();

	assert(VT.isInteger() && "Unexpected non-integer trunc broadcast!");
	assert(V0VT.isVector() && "Unexpected non-vector vector-sized value!");

	EVT V0EltVT = V0VT.getVectorElementType();
	if (!V0EltVT.isInteger())
	return SDValue();

	const unsigned EltSize = EltVT.getSizeInBits();
	const unsigned V0EltSize = V0EltVT.getSizeInBits();

	// This is only a truncation if the original element type is larger.
	if (V0EltSize <= EltSize)
	return SDValue();

	assert(((V0EltSize % EltSize) == 0) &&
	"Scalar type sizes must all be powers of 2 on x86!");

	const unsigned V0Opc = V0.getOpcode();
	const unsigned Scale = V0EltSize / EltSize;
	const unsigned V0BroadcastIdx = BroadcastIdx / Scale;

	if ((V0Opc != ISD::SCALAR_TO_VECTOR \|\| V0BroadcastIdx != 0) &&
	V0Opc != ISD::BUILD_VECTOR)
	return SDValue();

	SDValue Scalar = V0.getOperand(V0BroadcastIdx);

	// If we're extracting non-least-significant bits, shift so we can truncate.
	// Hopefully, we can fold away the trunc/srl/load into the broadcast.
	// Even if we can't (and !isShuffleFoldableLoad(Scalar)), prefer
	// vpbroadcast+vmovd+shr to vpshufb(m)+vmovd.
	if (const int OffsetIdx = BroadcastIdx % Scale)
	Scalar = DAG.getNode(ISD::SRL, DL, Scalar.getValueType(), Scalar,
	DAG.getConstant(OffsetIdx * EltSize, DL, Scalar.getValueType()));

	return DAG.getNode(X86ISD::VBROADCAST, DL, VT,
	DAG.getNode(ISD::TRUNCATE, DL, EltVT, Scalar));
	}

	/// \brief Try to lower broadcast of a single element.
	///
	/// For convenience, this code also bundles all of the subtarget feature set
	/// filtering. While a little annoying to re-dispatch on type here, there isn't
	/// a convenient way to factor it out.
	static SDValue lowerVectorShuffleAsBroadcast(const SDLoc &DL, MVT VT,
	SDValue V1, SDValue V2,
	ArrayRef<int> Mask,
	const X86Subtarget &Subtarget,
	SelectionDAG &DAG) {
	if (!((Subtarget.hasSSE3() && VT == MVT::v2f64) \|\|
	(Subtarget.hasAVX() && VT.isFloatingPoint()) \|\|
	(Subtarget.hasAVX2() && VT.isInteger())))
	return SDValue();

	// With MOVDDUP (v2f64) we can broadcast from a register or a load, otherwise
	// we can only broadcast from a register with AVX2.
	unsigned NumElts = Mask.size();
	unsigned Opcode = VT == MVT::v2f64 ? X86ISD::MOVDDUP : X86ISD::VBROADCAST;
	bool BroadcastFromReg = (Opcode == X86ISD::MOVDDUP) \|\| Subtarget.hasAVX2();

	// Check that the mask is a broadcast.
	int BroadcastIdx = -1;
	for (int i = 0; i != (int)NumElts; ++i) {
	SmallVector<int, 8> BroadcastMask(NumElts, i);
	if (isShuffleEquivalent(V1, V2, Mask, BroadcastMask)) {
	BroadcastIdx = i;
	break;
	}
	}

	if (BroadcastIdx < 0)
	return SDValue();
	assert(BroadcastIdx < (int)Mask.size() && "We only expect to be called with "
	"a sorted mask where the broadcast "
	"comes from V1.");

	// Go up the chain of (vector) values to find a scalar load that we can
	// combine with the broadcast.
	SDValue V = V1;
	for (;;) {
	switch (V.getOpcode()) {
	case ISD::BITCAST: {
	SDValue VSrc = V.getOperand(0);
	MVT SrcVT = VSrc.getSimpleValueType();
	if (VT.getScalarSizeInBits() != SrcVT.getScalarSizeInBits())
	break;
	V = VSrc;
	continue;
	}
	case ISD::CONCAT_VECTORS: {
	int OperandSize = Mask.size() / V.getNumOperands();
	V = V.getOperand(BroadcastIdx / OperandSize);
	BroadcastIdx %= OperandSize;
	continue;
	}
	case ISD::INSERT_SUBVECTOR: {
	SDValue VOuter = V.getOperand(0), VInner = V.getOperand(1);
	auto ConstantIdx = dyn_cast<ConstantSDNode>(V.getOperand(2));
	if (!ConstantIdx)
	break;

	int BeginIdx = (int)ConstantIdx->getZExtValue();
	int EndIdx =
	BeginIdx + (int)VInner.getSimpleValueType().getVectorNumElements();
	if (BroadcastIdx >= BeginIdx && BroadcastIdx < EndIdx) {
	BroadcastIdx -= BeginIdx;
	V = VInner;
	} else {
	V = VOuter;
	}
	continue;
	}
	}
	break;
	}

	// Check if this is a broadcast of a scalar. We special case lowering
	// for scalars so that we can more effectively fold with loads.
	// First, look through bitcast: if the original value has a larger element
	// type than the shuffle, the broadcast element is in essence truncated.
	// Make that explicit to ease folding.
	if (V.getOpcode() == ISD::BITCAST && VT.isInteger())
	if (SDValue TruncBroadcast = lowerVectorShuffleAsTruncBroadcast(
	DL, VT, V.getOperand(0), BroadcastIdx, Subtarget, DAG))
	return TruncBroadcast;

	MVT BroadcastVT = VT;

	// Peek through any bitcast (only useful for loads).
	SDValue BC = peekThroughBitcasts(V);

	// Also check the simpler case, where we can directly reuse the scalar.
	if (V.getOpcode() == ISD::BUILD_VECTOR \|\|
	(V.getOpcode() == ISD::SCALAR_TO_VECTOR && BroadcastIdx == 0)) {
	V = V.getOperand(BroadcastIdx);

	// If we can't broadcast from a register, check that the input is a load.
	if (!BroadcastFromReg && !isShuffleFoldableLoad(V))
	return SDValue();
	} else if (MayFoldLoad(BC) && !cast<LoadSDNode>(BC)->isVolatile()) {
	// 32-bit targets need to load i64 as a f64 and then bitcast the result.
	if (!Subtarget.is64Bit() && VT.getScalarType() == MVT::i64) {
	BroadcastVT = MVT::getVectorVT(MVT::f64, VT.getVectorNumElements());
	Opcode = (BroadcastVT.is128BitVector() ? X86ISD::MOVDDUP : Opcode);
	}

	// If we are broadcasting a load that is only used by the shuffle
	// then we can reduce the vector load to the broadcasted scalar load.
	LoadSDNode *Ld = cast<LoadSDNode>(BC);
	SDValue BaseAddr = Ld->getOperand(1);
	EVT SVT = BroadcastVT.getScalarType();
	unsigned Offset = BroadcastIdx * SVT.getStoreSize();
	SDValue NewAddr = DAG.getMemBasePlusOffset(BaseAddr, Offset, DL);
	V = DAG.getLoad(SVT, DL, Ld->getChain(), NewAddr,
	DAG.getMachineFunction().getMachineMemOperand(
	Ld->getMemOperand(), Offset, SVT.getStoreSize()));
	DAG.makeEquivalentMemoryOrdering(Ld, V);
	} else if (!BroadcastFromReg) {
	// We can't broadcast from a vector register.
	return SDValue();
	} else if (BroadcastIdx != 0) {
	// We can only broadcast from the zero-element of a vector register,
	// but it can be advantageous to broadcast from the zero-element of a
	// subvector.
	if (!VT.is256BitVector() && !VT.is512BitVector())
	return SDValue();

	// VPERMQ/VPERMPD can perform the cross-lane shuffle directly.
	if (VT == MVT::v4f64 \|\| VT == MVT::v4i64)
	return SDValue();

	// Only broadcast the zero-element of a 128-bit subvector.
	unsigned EltSize = VT.getScalarSizeInBits();
	if (((BroadcastIdx * EltSize) % 128) != 0)
	return SDValue();

	// The shuffle input might have been a bitcast we looked through; look at
	// the original input vector. Emit an EXTRACT_SUBVECTOR of that type; we'll
	// later bitcast it to BroadcastVT.
	MVT SrcVT = V.getSimpleValueType();
	assert(SrcVT.getScalarSizeInBits() == BroadcastVT.getScalarSizeInBits() &&
	"Unexpected vector element size");
	assert((SrcVT.is256BitVector() \|\| SrcVT.is512BitVector()) &&
	"Unexpected vector size");

	MVT ExtVT = MVT::getVectorVT(SrcVT.getScalarType(), 128 / EltSize);
	V = DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, ExtVT, V,
	DAG.getIntPtrConstant(BroadcastIdx, DL));
	}

	if (Opcode == X86ISD::MOVDDUP && !V.getValueType().isVector())
	V = DAG.getNode(ISD::SCALAR_TO_VECTOR, DL, MVT::v2f64,
	DAG.getBitcast(MVT::f64, V));

	// Bitcast back to the same scalar type as BroadcastVT.
	MVT SrcVT = V.getSimpleValueType();
	if (SrcVT.getScalarType() != BroadcastVT.getScalarType()) {
	assert(SrcVT.getScalarSizeInBits() == BroadcastVT.getScalarSizeInBits() &&
	"Unexpected vector element size");
	if (SrcVT.isVector()) {
	unsigned NumSrcElts = SrcVT.getVectorNumElements();
	SrcVT = MVT::getVectorVT(BroadcastVT.getScalarType(), NumSrcElts);
	} else {
	SrcVT = BroadcastVT.getScalarType();
	}
	V = DAG.getBitcast(SrcVT, V);
	}

	// 32-bit targets need to load i64 as a f64 and then bitcast the result.
	if (!Subtarget.is64Bit() && SrcVT == MVT::i64) {
	V = DAG.getBitcast(MVT::f64, V);
	unsigned NumBroadcastElts = BroadcastVT.getVectorNumElements();
	BroadcastVT = MVT::getVectorVT(MVT::f64, NumBroadcastElts);
	}

	// We only support broadcasting from 128-bit vectors to minimize the
	// number of patterns we need to deal with in isel. So extract down to
	// 128-bits.
	if (SrcVT.getSizeInBits() > 128)
	V = extract128BitVector(V, 0, DAG, DL);

	return DAG.getBitcast(VT, DAG.getNode(Opcode, DL, BroadcastVT, V));
	}

	// Check for whether we can use INSERTPS to perform the shuffle. We only use
	// INSERTPS when the V1 elements are already in the correct locations
	// because otherwise we can just always use two SHUFPS instructions which
	// are much smaller to encode than a SHUFPS and an INSERTPS. We can also
	// perform INSERTPS if a single V1 element is out of place and all V2
	// elements are zeroable.
	static bool matchVectorShuffleAsInsertPS(SDValue &V1, SDValue &V2,
	unsigned &InsertPSMask,
	const APInt &Zeroable,
	ArrayRef<int> Mask,
	SelectionDAG &DAG) {
	assert(V1.getSimpleValueType().is128BitVector() && "Bad operand type!");
	assert(V2.getSimpleValueType().is128BitVector() && "Bad operand type!");
	assert(Mask.size() == 4 && "Unexpected mask size for v4 shuffle!");

	// Attempt to match INSERTPS with one element from VA or VB being
	// inserted into VA (or undef). If successful, V1, V2 and InsertPSMask
	// are updated.
	auto matchAsInsertPS = [&](SDValue VA, SDValue VB,
	ArrayRef<int> CandidateMask) {
	unsigned ZMask = 0;
	int VADstIndex = -1;
	int VBDstIndex = -1;
	bool VAUsedInPlace = false;

	for (int i = 0; i < 4; ++i) {
	// Synthesize a zero mask from the zeroable elements (includes undefs).
	if (Zeroable[i]) {
	ZMask \|= 1 << i;
	continue;
	}

	// Flag if we use any VA inputs in place.
	if (i == CandidateMask[i]) {
	VAUsedInPlace = true;
	continue;
	}

	// We can only insert a single non-zeroable element.
	if (VADstIndex >= 0 \|\| VBDstIndex >= 0)
	return false;

	if (CandidateMask[i] < 4) {
	// VA input out of place for insertion.
	VADstIndex = i;
	} else {
	// VB input for insertion.
	VBDstIndex = i;
	}
	}

	// Don't bother if we have no (non-zeroable) element for insertion.
	if (VADstIndex < 0 && VBDstIndex < 0)
	return false;

	// Determine element insertion src/dst indices. The src index is from the
	// start of the inserted vector, not the start of the concatenated vector.
	unsigned VBSrcIndex = 0;
	if (VADstIndex >= 0) {
	// If we have a VA input out of place, we use VA as the V2 element
	// insertion and don't use the original V2 at all.
	VBSrcIndex = CandidateMask[VADstIndex];
	VBDstIndex = VADstIndex;
	VB = VA;
	} else {
	VBSrcIndex = CandidateMask[VBDstIndex] - 4;
	}

	// If no V1 inputs are used in place, then the result is created only from
	// the zero mask and the V2 insertion - so remove V1 dependency.
	if (!VAUsedInPlace)
	VA = DAG.getUNDEF(MVT::v4f32);

	// Update V1, V2 and InsertPSMask accordingly.
	V1 = VA;
	V2 = VB;

	// Insert the V2 element into the desired position.
	InsertPSMask = VBSrcIndex << 6 \| VBDstIndex << 4 \| ZMask;
	assert((InsertPSMask & ~0xFFu) == 0 && "Invalid mask!");
	return true;
	};

	if (matchAsInsertPS(V1, V2, Mask))
	return true;

	// Commute and try again.
	SmallVector<int, 4> CommutedMask(Mask.begin(), Mask.end());
	ShuffleVectorSDNode::commuteMask(CommutedMask);
	if (matchAsInsertPS(V2, V1, CommutedMask))
	return true;

	return false;
	}

	static SDValue lowerVectorShuffleAsInsertPS(const SDLoc &DL, SDValue V1,
	SDValue V2, ArrayRef<int> Mask,
	const APInt &Zeroable,
	SelectionDAG &DAG) {
	assert(V1.getSimpleValueType() == MVT::v4f32 && "Bad operand type!");
	assert(V2.getSimpleValueType() == MVT::v4f32 && "Bad operand type!");

	// Attempt to match the insertps pattern.
	unsigned InsertPSMask;
	if (!matchVectorShuffleAsInsertPS(V1, V2, InsertPSMask, Zeroable, Mask, DAG))
	return SDValue();

	// Insert the V2 element into the desired position.
	return DAG.getNode(X86ISD::INSERTPS, DL, MVT::v4f32, V1, V2,
	DAG.getConstant(InsertPSMask, DL, MVT::i8));
	}

	/// \brief Try to lower a shuffle as a permute of the inputs followed by an
	/// UNPCK instruction.
	///
	/// This specifically targets cases where we end up with alternating between
	/// the two inputs, and so can permute them into something that feeds a single
	/// UNPCK instruction. Note that this routine only targets integer vectors
	/// because for floating point vectors we have a generalized SHUFPS lowering
	/// strategy that handles everything that doesn't exactly match an unpack,
	/// making this clever lowering unnecessary.
	static SDValue lowerVectorShuffleAsPermuteAndUnpack(const SDLoc &DL, MVT VT,
	SDValue V1, SDValue V2,
	ArrayRef<int> Mask,
	SelectionDAG &DAG) {
	assert(!VT.isFloatingPoint() &&
	"This routine only supports integer vectors.");
	assert(VT.is128BitVector() &&
	"This routine only works on 128-bit vectors.");
	assert(!V2.isUndef() &&
	"This routine should only be used when blending two inputs.");
	assert(Mask.size() >= 2 && "Single element masks are invalid.");

	int Size = Mask.size();

	int NumLoInputs =
	count_if(Mask, [Size](int M) { return M >= 0 && M % Size < Size / 2; });
	int NumHiInputs =
	count_if(Mask, [Size](int M) { return M % Size >= Size / 2; });

	bool UnpackLo = NumLoInputs >= NumHiInputs;

	auto TryUnpack = [&](int ScalarSize, int Scale) {
	SmallVector<int, 16> V1Mask((unsigned)Size, -1);
	SmallVector<int, 16> V2Mask((unsigned)Size, -1);

	for (int i = 0; i < Size; ++i) {
	if (Mask[i] < 0)
	continue;

	// Each element of the unpack contains Scale elements from this mask.
	int UnpackIdx = i / Scale;

	// We only handle the case where V1 feeds the first slots of the unpack.
	// We rely on canonicalization to ensure this is the case.
	if ((UnpackIdx % 2 == 0) != (Mask[i] < Size))
	return SDValue();

	// Setup the mask for this input. The indexing is tricky as we have to
	// handle the unpack stride.
	SmallVectorImpl<int> &VMask = (UnpackIdx % 2 == 0) ? V1Mask : V2Mask;
	VMask[(UnpackIdx / 2) * Scale + i % Scale + (UnpackLo ? 0 : Size / 2)] =
	Mask[i] % Size;
	}

	// If we will have to shuffle both inputs to use the unpack, check whether
	// we can just unpack first and shuffle the result. If so, skip this unpack.
	if ((NumLoInputs == 0 \|\| NumHiInputs == 0) && !isNoopShuffleMask(V1Mask) &&
	!isNoopShuffleMask(V2Mask))
	return SDValue();

	// Shuffle the inputs into place.
	V1 = DAG.getVectorShuffle(VT, DL, V1, DAG.getUNDEF(VT), V1Mask);
	V2 = DAG.getVectorShuffle(VT, DL, V2, DAG.getUNDEF(VT), V2Mask);

	// Cast the inputs to the type we will use to unpack them.
	MVT UnpackVT = MVT::getVectorVT(MVT::getIntegerVT(ScalarSize), Size / Scale);
	V1 = DAG.getBitcast(UnpackVT, V1);
	V2 = DAG.getBitcast(UnpackVT, V2);

	// Unpack the inputs and cast the result back to the desired type.
	return DAG.getBitcast(
	VT, DAG.getNode(UnpackLo ? X86ISD::UNPCKL : X86ISD::UNPCKH, DL,
	UnpackVT, V1, V2));
	};

	// We try each unpack from the largest to the smallest to try and find one
	// that fits this mask.
	int OrigScalarSize = VT.getScalarSizeInBits();
	for (int ScalarSize = 64; ScalarSize >= OrigScalarSize; ScalarSize /= 2)
	if (SDValue Unpack = TryUnpack(ScalarSize, ScalarSize / OrigScalarSize))
	return Unpack;

	// If none of the unpack-rooted lowerings worked (or were profitable) try an
	// initial unpack.
	if (NumLoInputs == 0 \|\| NumHiInputs == 0) {
	assert((NumLoInputs > 0 \|\| NumHiInputs > 0) &&
	"We have to have some inputs!");
	int HalfOffset = NumLoInputs == 0 ? Size / 2 : 0;

	// FIXME: We could consider the total complexity of the permute of each
	// possible unpacking. Or at the least we should consider how many
	// half-crossings are created.
	// FIXME: We could consider commuting the unpacks.

	SmallVector<int, 32> PermMask((unsigned)Size, -1);
	for (int i = 0; i < Size; ++i) {
	if (Mask[i] < 0)
	continue;

	assert(Mask[i] % Size >= HalfOffset && "Found input from wrong half!");

	PermMask[i] =
	2 * ((Mask[i] % Size) - HalfOffset) + (Mask[i] < Size ? 0 : 1);
	}
	return DAG.getVectorShuffle(
	VT, DL, DAG.getNode(NumLoInputs == 0 ? X86ISD::UNPCKH : X86ISD::UNPCKL,
	DL, VT, V1, V2),
	DAG.getUNDEF(VT), PermMask);
	}

	return SDValue();
	}

	/// \brief Handle lowering of 2-lane 64-bit floating point shuffles.
	///
	/// This is the basis function for the 2-lane 64-bit shuffles as we have full
	/// support for floating point shuffles but not integer shuffles. These
	/// instructions will incur a domain crossing penalty on some chips though so
	/// it is better to avoid lowering through this for integer vectors where
	/// possible.
	static SDValue lowerV2F64VectorShuffle(const SDLoc &DL, ArrayRef<int> Mask,
	const APInt &Zeroable,
	SDValue V1, SDValue V2,
	const X86Subtarget &Subtarget,
	SelectionDAG &DAG) {
	assert(V1.getSimpleValueType() == MVT::v2f64 && "Bad operand type!");
	assert(V2.getSimpleValueType() == MVT::v2f64 && "Bad operand type!");
	assert(Mask.size() == 2 && "Unexpected mask size for v2 shuffle!");

	if (V2.isUndef()) {
	// Check for being able to broadcast a single element.
	if (SDValue Broadcast = lowerVectorShuffleAsBroadcast(
	DL, MVT::v2f64, V1, V2, Mask, Subtarget, DAG))
	return Broadcast;

	// Straight shuffle of a single input vector. Simulate this by using the
	// single input as both of the "inputs" to this instruction..
	unsigned SHUFPDMask = (Mask[0] == 1) \| ((Mask[1] == 1) << 1);

	if (Subtarget.hasAVX()) {
	// If we have AVX, we can use VPERMILPS which will allow folding a load
	// into the shuffle.
	return DAG.getNode(X86ISD::VPERMILPI, DL, MVT::v2f64, V1,
	DAG.getConstant(SHUFPDMask, DL, MVT::i8));
	}

	return DAG.getNode(
	X86ISD::SHUFP, DL, MVT::v2f64,
	Mask[0] == SM_SentinelUndef ? DAG.getUNDEF(MVT::v2f64) : V1,
	Mask[1] == SM_SentinelUndef ? DAG.getUNDEF(MVT::v2f64) : V1,
	DAG.getConstant(SHUFPDMask, DL, MVT::i8));
	}
	assert(Mask[0] >= 0 && Mask[0] < 2 && "Non-canonicalized blend!");
	assert(Mask[1] >= 2 && "Non-canonicalized blend!");

	// If we have a single input, insert that into V1 if we can do so cheaply.
	if ((Mask[0] >= 2) + (Mask[1] >= 2) == 1) {
	if (SDValue Insertion = lowerVectorShuffleAsElementInsertion(
	DL, MVT::v2f64, V1, V2, Mask, Zeroable, Subtarget, DAG))
	return Insertion;
	// Try inverting the insertion since for v2 masks it is easy to do and we
	// can't reliably sort the mask one way or the other.
	int InverseMask[2] = {Mask[0] < 0 ? -1 : (Mask[0] ^ 2),
	Mask[1] < 0 ? -1 : (Mask[1] ^ 2)};
	if (SDValue Insertion = lowerVectorShuffleAsElementInsertion(
	DL, MVT::v2f64, V2, V1, InverseMask, Zeroable, Subtarget, DAG))
	return Insertion;
	}

	// Try to use one of the special instruction patterns to handle two common
	// blend patterns if a zero-blend above didn't work.
	if (isShuffleEquivalent(V1, V2, Mask, {0, 3}) \|\|
	isShuffleEquivalent(V1, V2, Mask, {1, 3}))
	if (SDValue V1S = getScalarValueForVectorElement(V1, Mask[0], DAG))
	// We can either use a special instruction to load over the low double or
	// to move just the low double.
	return DAG.getNode(
	isShuffleFoldableLoad(V1S) ? X86ISD::MOVLPD : X86ISD::MOVSD,
	DL, MVT::v2f64, V2,
	DAG.getNode(ISD::SCALAR_TO_VECTOR, DL, MVT::v2f64, V1S));

	if (Subtarget.hasSSE41())
	if (SDValue Blend = lowerVectorShuffleAsBlend(DL, MVT::v2f64, V1, V2, Mask,
	Zeroable, Subtarget, DAG))
	return Blend;

	// Use dedicated unpack instructions for masks that match their pattern.
	if (SDValue V =
	lowerVectorShuffleWithUNPCK(DL, MVT::v2f64, Mask, V1, V2, DAG))
	return V;

	unsigned SHUFPDMask = (Mask[0] == 1) \| (((Mask[1] - 2) == 1) << 1);
	return DAG.getNode(X86ISD::SHUFP, DL, MVT::v2f64, V1, V2,
	DAG.getConstant(SHUFPDMask, DL, MVT::i8));
	}

	/// \brief Handle lowering of 2-lane 64-bit integer shuffles.
	///
	/// Tries to lower a 2-lane 64-bit shuffle using shuffle operations provided by
	/// the integer unit to minimize domain crossing penalties. However, for blends
	/// it falls back to the floating point shuffle operation with appropriate bit
	/// casting.
	static SDValue lowerV2I64VectorShuffle(const SDLoc &DL, ArrayRef<int> Mask,
	const APInt &Zeroable,
	SDValue V1, SDValue V2,
	const X86Subtarget &Subtarget,
	SelectionDAG &DAG) {
	assert(V1.getSimpleValueType() == MVT::v2i64 && "Bad operand type!");
	assert(V2.getSimpleValueType() == MVT::v2i64 && "Bad operand type!");
	assert(Mask.size() == 2 && "Unexpected mask size for v2 shuffle!");

	if (V2.isUndef()) {
	// Check for being able to broadcast a single element.
	if (SDValue Broadcast = lowerVectorShuffleAsBroadcast(
	DL, MVT::v2i64, V1, V2, Mask, Subtarget, DAG))
	return Broadcast;

	// Straight shuffle of a single input vector. For everything from SSE2
	// onward this has a single fast instruction with no scary immediates.
	// We have to map the mask as it is actually a v4i32 shuffle instruction.
	V1 = DAG.getBitcast(MVT::v4i32, V1);
	int WidenedMask[4] = {
	std::max(Mask[0], 0) * 2, std::max(Mask[0], 0) * 2 + 1,
	std::max(Mask[1], 0) * 2, std::max(Mask[1], 0) * 2 + 1};
	return DAG.getBitcast(
	MVT::v2i64,
	DAG.getNode(X86ISD::PSHUFD, DL, MVT::v4i32, V1,
	getV4X86ShuffleImm8ForMask(WidenedMask, DL, DAG)));
	}
	assert(Mask[0] != -1 && "No undef lanes in multi-input v2 shuffles!");
	assert(Mask[1] != -1 && "No undef lanes in multi-input v2 shuffles!");
	assert(Mask[0] < 2 && "We sort V1 to be the first input.");
	assert(Mask[1] >= 2 && "We sort V2 to be the second input.");

	// If we have a blend of two same-type PACKUS operations and the blend aligns
	// with the low and high halves, we can just merge the PACKUS operations.
	// This is particularly important as it lets us merge shuffles that this
	// routine itself creates.
	auto GetPackNode = [](SDValue V) {
	V = peekThroughBitcasts(V);
	return V.getOpcode() == X86ISD::PACKUS ? V : SDValue();
	};
	if (SDValue V1Pack = GetPackNode(V1))
	if (SDValue V2Pack = GetPackNode(V2)) {
	EVT PackVT = V1Pack.getValueType();
	if (PackVT == V2Pack.getValueType())
	return DAG.getBitcast(MVT::v2i64,
	DAG.getNode(X86ISD::PACKUS, DL, PackVT,
	Mask[0] == 0 ? V1Pack.getOperand(0)
	: V1Pack.getOperand(1),
	Mask[1] == 2 ? V2Pack.getOperand(0)
	: V2Pack.getOperand(1)));
	}

	// Try to use shift instructions.
	if (SDValue Shift = lowerVectorShuffleAsShift(DL, MVT::v2i64, V1, V2, Mask,
	Zeroable, Subtarget, DAG))
	return Shift;

	// When loading a scalar and then shuffling it into a vector we can often do
	// the insertion cheaply.
	if (SDValue Insertion = lowerVectorShuffleAsElementInsertion(
	DL, MVT::v2i64, V1, V2, Mask, Zeroable, Subtarget, DAG))
	return Insertion;
	// Try inverting the insertion since for v2 masks it is easy to do and we
	// can't reliably sort the mask one way or the other.
	int InverseMask[2] = {Mask[0] ^ 2, Mask[1] ^ 2};
	if (SDValue Insertion = lowerVectorShuffleAsElementInsertion(
	DL, MVT::v2i64, V2, V1, InverseMask, Zeroable, Subtarget, DAG))
	return Insertion;

	// We have different paths for blend lowering, but they all must use the
	// exact same predicate.
	bool IsBlendSupported = Subtarget.hasSSE41();
	if (IsBlendSupported)
	if (SDValue Blend = lowerVectorShuffleAsBlend(DL, MVT::v2i64, V1, V2, Mask,
	Zeroable, Subtarget, DAG))
	return Blend;

	// Use dedicated unpack instructions for masks that match their pattern.
	if (SDValue V =
	lowerVectorShuffleWithUNPCK(DL, MVT::v2i64, Mask, V1, V2, DAG))
	return V;

	// Try to use byte rotation instructions.
	// Its more profitable for pre-SSSE3 to use shuffles/unpacks.
	if (Subtarget.hasSSSE3())
	if (SDValue Rotate = lowerVectorShuffleAsByteRotate(
	DL, MVT::v2i64, V1, V2, Mask, Subtarget, DAG))
	return Rotate;

	// If we have direct support for blends, we should lower by decomposing into
	// a permute. That will be faster than the domain cross.
	if (IsBlendSupported)
	return lowerVectorShuffleAsDecomposedShuffleBlend(DL, MVT::v2i64, V1, V2,
	Mask, DAG);

	// We implement this with SHUFPD which is pretty lame because it will likely
	// incur 2 cycles of stall for integer vectors on Nehalem and older chips.
	// However, all the alternatives are still more cycles and newer chips don't
	// have this problem. It would be really nice if x86 had better shuffles here.
	V1 = DAG.getBitcast(MVT::v2f64, V1);
	V2 = DAG.getBitcast(MVT::v2f64, V2);
	return DAG.getBitcast(MVT::v2i64,
	DAG.getVectorShuffle(MVT::v2f64, DL, V1, V2, Mask));
	}

	/// \brief Test whether this can be lowered with a single SHUFPS instruction.
	///
	/// This is used to disable more specialized lowerings when the shufps lowering
	/// will happen to be efficient.
	static bool isSingleSHUFPSMask(ArrayRef<int> Mask) {
	// This routine only handles 128-bit shufps.
	assert(Mask.size() == 4 && "Unsupported mask size!");
	assert(Mask[0] >= -1 && Mask[0] < 8 && "Out of bound mask element!");
	assert(Mask[1] >= -1 && Mask[1] < 8 && "Out of bound mask element!");
	assert(Mask[2] >= -1 && Mask[2] < 8 && "Out of bound mask element!");
	assert(Mask[3] >= -1 && Mask[3] < 8 && "Out of bound mask element!");

	// To lower with a single SHUFPS we need to have the low half and high half
	// each requiring a single input.
	if (Mask[0] >= 0 && Mask[1] >= 0 && (Mask[0] < 4) != (Mask[1] < 4))
	return false;
	if (Mask[2] >= 0 && Mask[3] >= 0 && (Mask[2] < 4) != (Mask[3] < 4))
	return false;

	return true;
	}

	/// \brief Lower a vector shuffle using the SHUFPS instruction.
	///
	/// This is a helper routine dedicated to lowering vector shuffles using SHUFPS.
	/// It makes no assumptions about whether this is the best lowering, it simply
	/// uses it.
	static SDValue lowerVectorShuffleWithSHUFPS(const SDLoc &DL, MVT VT,
	ArrayRef<int> Mask, SDValue V1,
	SDValue V2, SelectionDAG &DAG) {
	SDValue LowV = V1, HighV = V2;
	int NewMask[4] = {Mask[0], Mask[1], Mask[2], Mask[3]};

	int NumV2Elements = count_if(Mask, [](int M) { return M >= 4; });

	if (NumV2Elements == 1) {
	int V2Index = find_if(Mask, [](int M) { return M >= 4; }) - Mask.begin();

	// Compute the index adjacent to V2Index and in the same half by toggling
	// the low bit.
	int V2AdjIndex = V2Index ^ 1;

	if (Mask[V2AdjIndex] < 0) {
	// Handles all the cases where we have a single V2 element and an undef.
	// This will only ever happen in the high lanes because we commute the
	// vector otherwise.
	if (V2Index < 2)
	std::swap(LowV, HighV);
	NewMask[V2Index] -= 4;
	} else {
	// Handle the case where the V2 element ends up adjacent to a V1 element.
	// To make this work, blend them together as the first step.
	int V1Index = V2AdjIndex;
	int BlendMask[4] = {Mask[V2Index] - 4, 0, Mask[V1Index], 0};
	V2 = DAG.getNode(X86ISD::SHUFP, DL, VT, V2, V1,
	getV4X86ShuffleImm8ForMask(BlendMask, DL, DAG));

	// Now proceed to reconstruct the final blend as we have the necessary
	// high or low half formed.
	if (V2Index < 2) {
	LowV = V2;
	HighV = V1;
	} else {
	HighV = V2;
	}
	NewMask[V1Index] = 2; // We put the V1 element in V2[2].
	NewMask[V2Index] = 0; // We shifted the V2 element into V2[0].
	}
	} else if (NumV2Elements == 2) {
	if (Mask[0] < 4 && Mask[1] < 4) {
	// Handle the easy case where we have V1 in the low lanes and V2 in the
	// high lanes.
	NewMask[2] -= 4;
	NewMask[3] -= 4;
	} else if (Mask[2] < 4 && Mask[3] < 4) {
	// We also handle the reversed case because this utility may get called
	// when we detect a SHUFPS pattern but can't easily commute the shuffle to
	// arrange things in the right direction.
	NewMask[0] -= 4;
	NewMask[1] -= 4;
	HighV = V1;
	LowV = V2;
	} else {
	// We have a mixture of V1 and V2 in both low and high lanes. Rather than
	// trying to place elements directly, just blend them and set up the final
	// shuffle to place them.

	// The first two blend mask elements are for V1, the second two are for
	// V2.
	int BlendMask[4] = {Mask[0] < 4 ? Mask[0] : Mask[1],
	Mask[2] < 4 ? Mask[2] : Mask[3],
	(Mask[0] >= 4 ? Mask[0] : Mask[1]) - 4,
	(Mask[2] >= 4 ? Mask[2] : Mask[3]) - 4};
	V1 = DAG.getNode(X86ISD::SHUFP, DL, VT, V1, V2,
	getV4X86ShuffleImm8ForMask(BlendMask, DL, DAG));

	// Now we do a normal shuffle of V1 by giving V1 as both operands to
	// a blend.
	LowV = HighV = V1;
	NewMask[0] = Mask[0] < 4 ? 0 : 2;
	NewMask[1] = Mask[0] < 4 ? 2 : 0;
	NewMask[2] = Mask[2] < 4 ? 1 : 3;
	NewMask[3] = Mask[2] < 4 ? 3 : 1;
	}
	}
	return DAG.getNode(X86ISD::SHUFP, DL, VT, LowV, HighV,
	getV4X86ShuffleImm8ForMask(NewMask, DL, DAG));
	}

	/// \brief Lower 4-lane 32-bit floating point shuffles.
	///
	/// Uses instructions exclusively from the floating point unit to minimize
	/// domain crossing penalties, as these are sufficient to implement all v4f32
	/// shuffles.
	static SDValue lowerV4F32VectorShuffle(const SDLoc &DL, ArrayRef<int> Mask,
	const APInt &Zeroable,
	SDValue V1, SDValue V2,
	const X86Subtarget &Subtarget,
	SelectionDAG &DAG) {
	assert(V1.getSimpleValueType() == MVT::v4f32 && "Bad operand type!");
	assert(V2.getSimpleValueType() == MVT::v4f32 && "Bad operand type!");
	assert(Mask.size() == 4 && "Unexpected mask size for v4 shuffle!");

	int NumV2Elements = count_if(Mask, [](int M) { return M >= 4; });

	if (NumV2Elements == 0) {
	// Check for being able to broadcast a single element.
	if (SDValue Broadcast = lowerVectorShuffleAsBroadcast(
	DL, MVT::v4f32, V1, V2, Mask, Subtarget, DAG))
	return Broadcast;

	// Use even/odd duplicate instructions for masks that match their pattern.
	if (Subtarget.hasSSE3()) {
	if (isShuffleEquivalent(V1, V2, Mask, {0, 0, 2, 2}))
	return DAG.getNode(X86ISD::MOVSLDUP, DL, MVT::v4f32, V1);
	if (isShuffleEquivalent(V1, V2, Mask, {1, 1, 3, 3}))
	return DAG.getNode(X86ISD::MOVSHDUP, DL, MVT::v4f32, V1);
	}

	if (Subtarget.hasAVX()) {
	// If we have AVX, we can use VPERMILPS which will allow folding a load
	// into the shuffle.
	return DAG.getNode(X86ISD::VPERMILPI, DL, MVT::v4f32, V1,
	getV4X86ShuffleImm8ForMask(Mask, DL, DAG));
	}

	// Otherwise, use a straight shuffle of a single input vector. We pass the
	// input vector to both operands to simulate this with a SHUFPS.
	return DAG.getNode(X86ISD::SHUFP, DL, MVT::v4f32, V1, V1,
	getV4X86ShuffleImm8ForMask(Mask, DL, DAG));
	}

	// There are special ways we can lower some single-element blends. However, we
	// have custom ways we can lower more complex single-element blends below that
	// we defer to if both this and BLENDPS fail to match, so restrict this to
	// when the V2 input is targeting element 0 of the mask -- that is the fast
	// case here.
	if (NumV2Elements == 1 && Mask[0] >= 4)
	if (SDValue V = lowerVectorShuffleAsElementInsertion(
	DL, MVT::v4f32, V1, V2, Mask, Zeroable, Subtarget, DAG))
	return V;

	if (Subtarget.hasSSE41()) {
	if (SDValue Blend = lowerVectorShuffleAsBlend(DL, MVT::v4f32, V1, V2, Mask,
	Zeroable, Subtarget, DAG))
	return Blend;

	// Use INSERTPS if we can complete the shuffle efficiently.
	if (SDValue V =
	lowerVectorShuffleAsInsertPS(DL, V1, V2, Mask, Zeroable, DAG))
	return V;

	if (!isSingleSHUFPSMask(Mask))
	if (SDValue BlendPerm = lowerVectorShuffleAsBlendAndPermute(
	DL, MVT::v4f32, V1, V2, Mask, DAG))
	return BlendPerm;
	}

	// Use low/high mov instructions.
	if (isShuffleEquivalent(V1, V2, Mask, {0, 1, 4, 5}))
	return DAG.getNode(X86ISD::MOVLHPS, DL, MVT::v4f32, V1, V2);
	if (isShuffleEquivalent(V1, V2, Mask, {2, 3, 6, 7}))
	return DAG.getNode(X86ISD::MOVHLPS, DL, MVT::v4f32, V2, V1);

	// Use dedicated unpack instructions for masks that match their pattern.
	if (SDValue V =
	lowerVectorShuffleWithUNPCK(DL, MVT::v4f32, Mask, V1, V2, DAG))
	return V;

	// Otherwise fall back to a SHUFPS lowering strategy.
	return lowerVectorShuffleWithSHUFPS(DL, MVT::v4f32, Mask, V1, V2, DAG);
	}

	/// \brief Lower 4-lane i32 vector shuffles.
	///
	/// We try to handle these with integer-domain shuffles where we can, but for
	/// blends we use the floating point domain blend instructions.
	static SDValue lowerV4I32VectorShuffle(const SDLoc &DL, ArrayRef<int> Mask,
	const APInt &Zeroable,
	SDValue V1, SDValue V2,
	const X86Subtarget &Subtarget,
	SelectionDAG &DAG) {
	assert(V1.getSimpleValueType() == MVT::v4i32 && "Bad operand type!");
	assert(V2.getSimpleValueType() == MVT::v4i32 && "Bad operand type!");
	assert(Mask.size() == 4 && "Unexpected mask size for v4 shuffle!");

	// Whenever we can lower this as a zext, that instruction is strictly faster
	// than any alternative. It also allows us to fold memory operands into the
	// shuffle in many cases.
	if (SDValue ZExt = lowerVectorShuffleAsZeroOrAnyExtend(
	DL, MVT::v4i32, V1, V2, Mask, Zeroable, Subtarget, DAG))
	return ZExt;

	int NumV2Elements = count_if(Mask, [](int M) { return M >= 4; });

	if (NumV2Elements == 0) {
	// Check for being able to broadcast a single element.
	if (SDValue Broadcast = lowerVectorShuffleAsBroadcast(
	DL, MVT::v4i32, V1, V2, Mask, Subtarget, DAG))
	return Broadcast;

	// Straight shuffle of a single input vector. For everything from SSE2
	// onward this has a single fast instruction with no scary immediates.
	// We coerce the shuffle pattern to be compatible with UNPCK instructions
	// but we aren't actually going to use the UNPCK instruction because doing
	// so prevents folding a load into this instruction or making a copy.
	const int UnpackLoMask[] = {0, 0, 1, 1};
	const int UnpackHiMask[] = {2, 2, 3, 3};
	if (isShuffleEquivalent(V1, V2, Mask, {0, 0, 1, 1}))
	Mask = UnpackLoMask;
	else if (isShuffleEquivalent(V1, V2, Mask, {2, 2, 3, 3}))
	Mask = UnpackHiMask;

	return DAG.getNode(X86ISD::PSHUFD, DL, MVT::v4i32, V1,
	getV4X86ShuffleImm8ForMask(Mask, DL, DAG));
	}

	// Try to use shift instructions.
	if (SDValue Shift = lowerVectorShuffleAsShift(DL, MVT::v4i32, V1, V2, Mask,
	Zeroable, Subtarget, DAG))
	return Shift;

	// There are special ways we can lower some single-element blends.
	if (NumV2Elements == 1)
	if (SDValue V = lowerVectorShuffleAsElementInsertion(
	DL, MVT::v4i32, V1, V2, Mask, Zeroable, Subtarget, DAG))
	return V;

	// We have different paths for blend lowering, but they all must use the
	// exact same predicate.
	bool IsBlendSupported = Subtarget.hasSSE41();
	if (IsBlendSupported)
	if (SDValue Blend = lowerVectorShuffleAsBlend(DL, MVT::v4i32, V1, V2, Mask,
	Zeroable, Subtarget, DAG))
	return Blend;

	if (SDValue Masked = lowerVectorShuffleAsBitMask(DL, MVT::v4i32, V1, V2, Mask,
	Zeroable, DAG))
	return Masked;

	// Use dedicated unpack instructions for masks that match their pattern.
	if (SDValue V =
	lowerVectorShuffleWithUNPCK(DL, MVT::v4i32, Mask, V1, V2, DAG))
	return V;

	// Try to use byte rotation instructions.
	// Its more profitable for pre-SSSE3 to use shuffles/unpacks.
	if (Subtarget.hasSSSE3())
	if (SDValue Rotate = lowerVectorShuffleAsByteRotate(
	DL, MVT::v4i32, V1, V2, Mask, Subtarget, DAG))
	return Rotate;

	// Assume that a single SHUFPS is faster than an alternative sequence of
	// multiple instructions (even if the CPU has a domain penalty).
	// If some CPU is harmed by the domain switch, we can fix it in a later pass.
	if (!isSingleSHUFPSMask(Mask)) {
	// If we have direct support for blends, we should lower by decomposing into
	// a permute. That will be faster than the domain cross.
	if (IsBlendSupported)
	return lowerVectorShuffleAsDecomposedShuffleBlend(DL, MVT::v4i32, V1, V2,
	Mask, DAG);

	// Try to lower by permuting the inputs into an unpack instruction.
	if (SDValue Unpack = lowerVectorShuffleAsPermuteAndUnpack(
	DL, MVT::v4i32, V1, V2, Mask, DAG))
	return Unpack;
	}

	// We implement this with SHUFPS because it can blend from two vectors.
	// Because we're going to eventually use SHUFPS, we use SHUFPS even to build
	// up the inputs, bypassing domain shift penalties that we would incur if we
	// directly used PSHUFD on Nehalem and older. For newer chips, this isn't
	// relevant.
	SDValue CastV1 = DAG.getBitcast(MVT::v4f32, V1);
	SDValue CastV2 = DAG.getBitcast(MVT::v4f32, V2);
	SDValue ShufPS = DAG.getVectorShuffle(MVT::v4f32, DL, CastV1, CastV2, Mask);
	return DAG.getBitcast(MVT::v4i32, ShufPS);
	}

	/// \brief Lowering of single-input v8i16 shuffles is the cornerstone of SSE2
	/// shuffle lowering, and the most complex part.
	///
	/// The lowering strategy is to try to form pairs of input lanes which are
	/// targeted at the same half of the final vector, and then use a dword shuffle
	/// to place them onto the right half, and finally unpack the paired lanes into
	/// their final position.
	///
	/// The exact breakdown of how to form these dword pairs and align them on the
	/// correct sides is really tricky. See the comments within the function for
	/// more of the details.
	///
	/// This code also handles repeated 128-bit lanes of v8i16 shuffles, but each
	/// lane must shuffle the exact same way. In fact, you must pass a v8 Mask to
	/// this routine for it to work correctly. To shuffle a 256-bit or 512-bit i16
	/// vector, form the analogous 128-bit 8-element Mask.
	static SDValue lowerV8I16GeneralSingleInputVectorShuffle(
	const SDLoc &DL, MVT VT, SDValue V, MutableArrayRef<int> Mask,
	const X86Subtarget &Subtarget, SelectionDAG &DAG) {
	assert(VT.getVectorElementType() == MVT::i16 && "Bad input type!");
	MVT PSHUFDVT = MVT::getVectorVT(MVT::i32, VT.getVectorNumElements() / 2);

	assert(Mask.size() == 8 && "Shuffle mask length doesn't match!");
	MutableArrayRef<int> LoMask = Mask.slice(0, 4);
	MutableArrayRef<int> HiMask = Mask.slice(4, 4);

	SmallVector<int, 4> LoInputs;
	copy_if(LoMask, std::back_inserter(LoInputs), [](int M) { return M >= 0; });
	std::sort(LoInputs.begin(), LoInputs.end());
	LoInputs.erase(std::unique(LoInputs.begin(), LoInputs.end()), LoInputs.end());
	SmallVector<int, 4> HiInputs;
	copy_if(HiMask, std::back_inserter(HiInputs), [](int M) { return M >= 0; });
	std::sort(HiInputs.begin(), HiInputs.end());
	HiInputs.erase(std::unique(HiInputs.begin(), HiInputs.end()), HiInputs.end());
	int NumLToL =
	std::lower_bound(LoInputs.begin(), LoInputs.end(), 4) - LoInputs.begin();
	int NumHToL = LoInputs.size() - NumLToL;
	int NumLToH =
	std::lower_bound(HiInputs.begin(), HiInputs.end(), 4) - HiInputs.begin();
	int NumHToH = HiInputs.size() - NumLToH;
	MutableArrayRef<int> LToLInputs(LoInputs.data(), NumLToL);
	MutableArrayRef<int> LToHInputs(HiInputs.data(), NumLToH);
	MutableArrayRef<int> HToLInputs(LoInputs.data() + NumLToL, NumHToL);
	MutableArrayRef<int> HToHInputs(HiInputs.data() + NumLToH, NumHToH);

	// If we are splatting two values from one half - one to each half, then
	// we can shuffle that half so each is splatted to a dword, then splat those
	// to their respective halves.
	auto SplatHalfs = [&](int LoInput, int HiInput, unsigned ShufWOp,
	int DOffset) {
	int PSHUFHalfMask[] = {LoInput % 4, LoInput % 4, HiInput % 4, HiInput % 4};
	int PSHUFDMask[] = {DOffset + 0, DOffset + 0, DOffset + 1, DOffset + 1};
	V = DAG.getNode(ShufWOp, DL, VT, V,
	getV4X86ShuffleImm8ForMask(PSHUFHalfMask, DL, DAG));
	V = DAG.getBitcast(PSHUFDVT, V);
	V = DAG.getNode(X86ISD::PSHUFD, DL, PSHUFDVT, V,
	getV4X86ShuffleImm8ForMask(PSHUFDMask, DL, DAG));
	return DAG.getBitcast(VT, V);
	};

	if (NumLToL == 1 && NumLToH == 1 && (NumHToL + NumHToH) == 0)
	return SplatHalfs(LToLInputs[0], LToHInputs[0], X86ISD::PSHUFLW, 0);
	if (NumHToL == 1 && NumHToH == 1 && (NumLToL + NumLToH) == 0)
	return SplatHalfs(HToLInputs[0], HToHInputs[0], X86ISD::PSHUFHW, 2);

	// Simplify the 1-into-3 and 3-into-1 cases with a single pshufd. For all
	// such inputs we can swap two of the dwords across the half mark and end up
	// with <=2 inputs to each half in each half. Once there, we can fall through
	// to the generic code below. For example:
	//
	// Input: [a, b, c, d, e, f, g, h] -PSHUFD[0,2,1,3]-> [a, b, e, f, c, d, g, h]
	// Mask: [0, 1, 2, 7, 4, 5, 6, 3] -----------------> [0, 1, 4, 7, 2, 3, 6, 5]
	//
	// However in some very rare cases we have a 1-into-3 or 3-into-1 on one half
	// and an existing 2-into-2 on the other half. In this case we may have to
	// pre-shuffle the 2-into-2 half to avoid turning it into a 3-into-1 or
	// 1-into-3 which could cause us to cycle endlessly fixing each side in turn.
	// Fortunately, we don't have to handle anything but a 2-into-2 pattern
	// because any other situation (including a 3-into-1 or 1-into-3 in the other
	// half than the one we target for fixing) will be fixed when we re-enter this
	// path. We will also combine away any sequence of PSHUFD instructions that
	// result into a single instruction. Here is an example of the tricky case:
	//
	// Input: [a, b, c, d, e, f, g, h] -PSHUFD[0,2,1,3]-> [a, b, e, f, c, d, g, h]
	// Mask: [3, 7, 1, 0, 2, 7, 3, 5] -THIS-IS-BAD!!!!-> [5, 7, 1, 0, 4, 7, 5, 3]
	//
	// This now has a 1-into-3 in the high half! Instead, we do two shuffles:
	//
	// Input: [a, b, c, d, e, f, g, h] PSHUFHW[0,2,1,3]-> [a, b, c, d, e, g, f, h]
	// Mask: [3, 7, 1, 0, 2, 7, 3, 5] -----------------> [3, 7, 1, 0, 2, 7, 3, 6]
	//
	// Input: [a, b, c, d, e, g, f, h] -PSHUFD[0,2,1,3]-> [a, b, e, g, c, d, f, h]
	// Mask: [3, 7, 1, 0, 2, 7, 3, 6] -----------------> [5, 7, 1, 0, 4, 7, 5, 6]
	//
	// The result is fine to be handled by the generic logic.
	auto balanceSides = [&](ArrayRef<int> AToAInputs, ArrayRef<int> BToAInputs,
	ArrayRef<int> BToBInputs, ArrayRef<int> AToBInputs,
	int AOffset, int BOffset) {
	assert((AToAInputs.size() == 3 \|\| AToAInputs.size() == 1) &&
	"Must call this with A having 3 or 1 inputs from the A half.");
	assert((BToAInputs.size() == 1 \|\| BToAInputs.size() == 3) &&
	"Must call this with B having 1 or 3 inputs from the B half.");
	assert(AToAInputs.size() + BToAInputs.size() == 4 &&
	"Must call this with either 3:1 or 1:3 inputs (summing to 4).");

	bool ThreeAInputs = AToAInputs.size() == 3;

	// Compute the index of dword with only one word among the three inputs in
	// a half by taking the sum of the half with three inputs and subtracting
	// the sum of the actual three inputs. The difference is the remaining
	// slot.
	int ADWord, BDWord;
	int &TripleDWord = ThreeAInputs ? ADWord : BDWord;
	int &OneInputDWord = ThreeAInputs ? BDWord : ADWord;
	int TripleInputOffset = ThreeAInputs ? AOffset : BOffset;
	ArrayRef<int> TripleInputs = ThreeAInputs ? AToAInputs : BToAInputs;
	int OneInput = ThreeAInputs ? BToAInputs[0] : AToAInputs[0];
	int TripleInputSum = 0 + 1 + 2 + 3 + (4 * TripleInputOffset);
	int TripleNonInputIdx =
	TripleInputSum - std::accumulate(TripleInputs.begin(), TripleInputs.end(), 0);
	TripleDWord = TripleNonInputIdx / 2;

	// We use xor with one to compute the adjacent DWord to whichever one the
	// OneInput is in.
	OneInputDWord = (OneInput / 2) ^ 1;

	// Check for one tricky case: We're fixing a 3<-1 or a 1<-3 shuffle for AToA
	// and BToA inputs. If there is also such a problem with the BToB and AToB
	// inputs, we don't try to fix it necessarily -- we'll recurse and see it in
	// the next pass. However, if we have a 2<-2 in the BToB and AToB inputs, it
	// is essential that we don't create a 3<-1 as then we might oscillate.
	if (BToBInputs.size() == 2 && AToBInputs.size() == 2) {
	// Compute how many inputs will be flipped by swapping these DWords. We
	// need
	// to balance this to ensure we don't form a 3-1 shuffle in the other
	// half.
	int NumFlippedAToBInputs =
	std::count(AToBInputs.begin(), AToBInputs.end(), 2 * ADWord) +
	std::count(AToBInputs.begin(), AToBInputs.end(), 2 * ADWord + 1);
	int NumFlippedBToBInputs =
	std::count(BToBInputs.begin(), BToBInputs.end(), 2 * BDWord) +
	std::count(BToBInputs.begin(), BToBInputs.end(), 2 * BDWord + 1);
	if ((NumFlippedAToBInputs == 1 &&
	(NumFlippedBToBInputs == 0 \|\| NumFlippedBToBInputs == 2)) \|\|
	(NumFlippedBToBInputs == 1 &&
	(NumFlippedAToBInputs == 0 \|\| NumFlippedAToBInputs == 2))) {
	// We choose whether to fix the A half or B half based on whether that
	// half has zero flipped inputs. At zero, we may not be able to fix it
	// with that half. We also bias towards fixing the B half because that
	// will more commonly be the high half, and we have to bias one way.
	auto FixFlippedInputs = [&V, &DL, &Mask, &DAG](int PinnedIdx, int DWord,
	ArrayRef<int> Inputs) {
	int FixIdx = PinnedIdx ^ 1; // The adjacent slot to the pinned slot.
	bool IsFixIdxInput = is_contained(Inputs, PinnedIdx ^ 1);
	// Determine whether the free index is in the flipped dword or the
	// unflipped dword based on where the pinned index is. We use this bit
	// in an xor to conditionally select the adjacent dword.
	int FixFreeIdx = 2 * (DWord ^ (PinnedIdx / 2 == DWord));
	bool IsFixFreeIdxInput = is_contained(Inputs, FixFreeIdx);
	if (IsFixIdxInput == IsFixFreeIdxInput)
	FixFreeIdx += 1;
	IsFixFreeIdxInput = is_contained(Inputs, FixFreeIdx);
	assert(IsFixIdxInput != IsFixFreeIdxInput &&
	"We need to be changing the number of flipped inputs!");
	int PSHUFHalfMask[] = {0, 1, 2, 3};
	std::swap(PSHUFHalfMask[FixFreeIdx % 4], PSHUFHalfMask[FixIdx % 4]);
	V = DAG.getNode(
	FixIdx < 4 ? X86ISD::PSHUFLW : X86ISD::PSHUFHW, DL,
	MVT::getVectorVT(MVT::i16, V.getValueSizeInBits() / 16), V,
	getV4X86ShuffleImm8ForMask(PSHUFHalfMask, DL, DAG));

	for (int &M : Mask)
	if (M >= 0 && M == FixIdx)
	M = FixFreeIdx;
	else if (M >= 0 && M == FixFreeIdx)
	M = FixIdx;
	};
	if (NumFlippedBToBInputs != 0) {
	int BPinnedIdx =
	BToAInputs.size() == 3 ? TripleNonInputIdx : OneInput;
	FixFlippedInputs(BPinnedIdx, BDWord, BToBInputs);
	} else {
	assert(NumFlippedAToBInputs != 0 && "Impossible given predicates!");
	int APinnedIdx = ThreeAInputs ? TripleNonInputIdx : OneInput;
	FixFlippedInputs(APinnedIdx, ADWord, AToBInputs);
	}
	}
	}

	int PSHUFDMask[] = {0, 1, 2, 3};
	PSHUFDMask[ADWord] = BDWord;
	PSHUFDMask[BDWord] = ADWord;
	V = DAG.getBitcast(
	VT,
	DAG.getNode(X86ISD::PSHUFD, DL, PSHUFDVT, DAG.getBitcast(PSHUFDVT, V),
	getV4X86ShuffleImm8ForMask(PSHUFDMask, DL, DAG)));

	// Adjust the mask to match the new locations of A and B.
	for (int &M : Mask)
	if (M >= 0 && M/2 == ADWord)
	M = 2 * BDWord + M % 2;
	else if (M >= 0 && M/2 == BDWord)
	M = 2 * ADWord + M % 2;

	// Recurse back into this routine to re-compute state now that this isn't
	// a 3 and 1 problem.
	return lowerV8I16GeneralSingleInputVectorShuffle(DL, VT, V, Mask, Subtarget,
	DAG);
	};
	if ((NumLToL == 3 && NumHToL == 1) \|\| (NumLToL == 1 && NumHToL == 3))
	return balanceSides(LToLInputs, HToLInputs, HToHInputs, LToHInputs, 0, 4);
	if ((NumHToH == 3 && NumLToH == 1) \|\| (NumHToH == 1 && NumLToH == 3))
	return balanceSides(HToHInputs, LToHInputs, LToLInputs, HToLInputs, 4, 0);

	// At this point there are at most two inputs to the low and high halves from
	// each half. That means the inputs can always be grouped into dwords and
	// those dwords can then be moved to the correct half with a dword shuffle.
	// We use at most one low and one high word shuffle to collect these paired
	// inputs into dwords, and finally a dword shuffle to place them.
	int PSHUFLMask[4] = {-1, -1, -1, -1};
	int PSHUFHMask[4] = {-1, -1, -1, -1};
	int PSHUFDMask[4] = {-1, -1, -1, -1};

	// First fix the masks for all the inputs that are staying in their
	// original halves. This will then dictate the targets of the cross-half
	// shuffles.
	auto fixInPlaceInputs =
	[&PSHUFDMask](ArrayRef<int> InPlaceInputs, ArrayRef<int> IncomingInputs,
	MutableArrayRef<int> SourceHalfMask,
	MutableArrayRef<int> HalfMask, int HalfOffset) {
	if (InPlaceInputs.empty())
	return;
	if (InPlaceInputs.size() == 1) {
	SourceHalfMask[InPlaceInputs[0] - HalfOffset] =
	InPlaceInputs[0] - HalfOffset;
	PSHUFDMask[InPlaceInputs[0] / 2] = InPlaceInputs[0] / 2;
	return;
	}
	if (IncomingInputs.empty()) {
	// Just fix all of the in place inputs.
	for (int Input : InPlaceInputs) {
	SourceHalfMask[Input - HalfOffset] = Input - HalfOffset;
	PSHUFDMask[Input / 2] = Input / 2;
	}
	return;
	}

	assert(InPlaceInputs.size() == 2 && "Cannot handle 3 or 4 inputs!");
	SourceHalfMask[InPlaceInputs[0] - HalfOffset] =
	InPlaceInputs[0] - HalfOffset;
	// Put the second input next to the first so that they are packed into
	// a dword. We find the adjacent index by toggling the low bit.
	int AdjIndex = InPlaceInputs[0] ^ 1;
	SourceHalfMask[AdjIndex - HalfOffset] = InPlaceInputs[1] - HalfOffset;
	std::replace(HalfMask.begin(), HalfMask.end(), InPlaceInputs[1], AdjIndex);
	PSHUFDMask[AdjIndex / 2] = AdjIndex / 2;
	};
	fixInPlaceInputs(LToLInputs, HToLInputs, PSHUFLMask, LoMask, 0);
	fixInPlaceInputs(HToHInputs, LToHInputs, PSHUFHMask, HiMask, 4);

	// Now gather the cross-half inputs and place them into a free dword of
	// their target half.
	// FIXME: This operation could almost certainly be simplified dramatically to
	// look more like the 3-1 fixing operation.
	auto moveInputsToRightHalf = [&PSHUFDMask](
	MutableArrayRef<int> IncomingInputs, ArrayRef<int> ExistingInputs,
	MutableArrayRef<int> SourceHalfMask, MutableArrayRef<int> HalfMask,
	MutableArrayRef<int> FinalSourceHalfMask, int SourceOffset,
	int DestOffset) {
	auto isWordClobbered = [](ArrayRef<int> SourceHalfMask, int Word) {
	return SourceHalfMask[Word] >= 0 && SourceHalfMask[Word] != Word;
	};
	auto isDWordClobbered = [&isWordClobbered](ArrayRef<int> SourceHalfMask,
	int Word) {
	int LowWord = Word & ~1;
	int HighWord = Word \| 1;
	return isWordClobbered(SourceHalfMask, LowWord) \|\|
	isWordClobbered(SourceHalfMask, HighWord);
	};

	if (IncomingInputs.empty())
	return;

	if (ExistingInputs.empty()) {
	// Map any dwords with inputs from them into the right half.
	for (int Input : IncomingInputs) {
	// If the source half mask maps over the inputs, turn those into
	// swaps and use the swapped lane.
	if (isWordClobbered(SourceHalfMask, Input - SourceOffset)) {
	if (SourceHalfMask[SourceHalfMask[Input - SourceOffset]] < 0) {
	SourceHalfMask[SourceHalfMask[Input - SourceOffset]] =
	Input - SourceOffset;
	// We have to swap the uses in our half mask in one sweep.
	for (int &M : HalfMask)
	if (M == SourceHalfMask[Input - SourceOffset] + SourceOffset)
	M = Input;
	else if (M == Input)
	M = SourceHalfMask[Input - SourceOffset] + SourceOffset;
	} else {
	assert(SourceHalfMask[SourceHalfMask[Input - SourceOffset]] ==
	Input - SourceOffset &&
	"Previous placement doesn't match!");
	}
	// Note that this correctly re-maps both when we do a swap and when
	// we observe the other side of the swap above. We rely on that to
	// avoid swapping the members of the input list directly.
	Input = SourceHalfMask[Input - SourceOffset] + SourceOffset;
	}

	// Map the input's dword into the correct half.
	if (PSHUFDMask[(Input - SourceOffset + DestOffset) / 2] < 0)
	PSHUFDMask[(Input - SourceOffset + DestOffset) / 2] = Input / 2;
	else
	assert(PSHUFDMask[(Input - SourceOffset + DestOffset) / 2] ==
	Input / 2 &&
	"Previous placement doesn't match!");
	}

	// And just directly shift any other-half mask elements to be same-half
	// as we will have mirrored the dword containing the element into the
	// same position within that half.
	for (int &M : HalfMask)
	if (M >= SourceOffset && M < SourceOffset + 4) {
	M = M - SourceOffset + DestOffset;
	assert(M >= 0 && "This should never wrap below zero!");
	}
	return;
	}

	// Ensure we have the input in a viable dword of its current half. This
	// is particularly tricky because the original position may be clobbered
	// by inputs being moved and staying in that half.
	if (IncomingInputs.size() == 1) {
	if (isWordClobbered(SourceHalfMask, IncomingInputs[0] - SourceOffset)) {
	int InputFixed = find(SourceHalfMask, -1) - std::begin(SourceHalfMask) +
	SourceOffset;
	SourceHalfMask[InputFixed - SourceOffset] =
	IncomingInputs[0] - SourceOffset;
	std::replace(HalfMask.begin(), HalfMask.end(), IncomingInputs[0],
	InputFixed);
	IncomingInputs[0] = InputFixed;
	}
	} else if (IncomingInputs.size() == 2) {
	if (IncomingInputs[0] / 2 != IncomingInputs[1] / 2 \|\|
	isDWordClobbered(SourceHalfMask, IncomingInputs[0] - SourceOffset)) {
	// We have two non-adjacent or clobbered inputs we need to extract from
	// the source half. To do this, we need to map them into some adjacent
	// dword slot in the source mask.
	int InputsFixed[2] = {IncomingInputs[0] - SourceOffset,
	IncomingInputs[1] - SourceOffset};

	// If there is a free slot in the source half mask adjacent to one of
	// the inputs, place the other input in it. We use (Index XOR 1) to
	// compute an adjacent index.
	if (!isWordClobbered(SourceHalfMask, InputsFixed[0]) &&
	SourceHalfMask[InputsFixed[0] ^ 1] < 0) {
	SourceHalfMask[InputsFixed[0]] = InputsFixed[0];
	SourceHalfMask[InputsFixed[0] ^ 1] = InputsFixed[1];
	InputsFixed[1] = InputsFixed[0] ^ 1;
	} else if (!isWordClobbered(SourceHalfMask, InputsFixed[1]) &&
	SourceHalfMask[InputsFixed[1] ^ 1] < 0) {
	SourceHalfMask[InputsFixed[1]] = InputsFixed[1];
	SourceHalfMask[InputsFixed[1] ^ 1] = InputsFixed[0];
	InputsFixed[0] = InputsFixed[1] ^ 1;
	} else if (SourceHalfMask[2 * ((InputsFixed[0] / 2) ^ 1)] < 0 &&
	SourceHalfMask[2 * ((InputsFixed[0] / 2) ^ 1) + 1] < 0) {
	// The two inputs are in the same DWord but it is clobbered and the
	// adjacent DWord isn't used at all. Move both inputs to the free
	// slot.
	SourceHalfMask[2 * ((InputsFixed[0] / 2) ^ 1)] = InputsFixed[0];
	SourceHalfMask[2 * ((InputsFixed[0] / 2) ^ 1) + 1] = InputsFixed[1];
	InputsFixed[0] = 2 * ((InputsFixed[0] / 2) ^ 1);
	InputsFixed[1] = 2 * ((InputsFixed[0] / 2) ^ 1) + 1;
	} else {
	// The only way we hit this point is if there is no clobbering
	// (because there are no off-half inputs to this half) and there is no
	// free slot adjacent to one of the inputs. In this case, we have to
	// swap an input with a non-input.
	for (int i = 0; i < 4; ++i)
	assert((SourceHalfMask[i] < 0 \|\| SourceHalfMask[i] == i) &&
	"We can't handle any clobbers here!");
	assert(InputsFixed[1] != (InputsFixed[0] ^ 1) &&
	"Cannot have adjacent inputs here!");

	SourceHalfMask[InputsFixed[0] ^ 1] = InputsFixed[1];
	SourceHalfMask[InputsFixed[1]] = InputsFixed[0] ^ 1;

	// We also have to update the final source mask in this case because
	// it may need to undo the above swap.
	for (int &M : FinalSourceHalfMask)
	if (M == (InputsFixed[0] ^ 1) + SourceOffset)
	M = InputsFixed[1] + SourceOffset;
	else if (M == InputsFixed[1] + SourceOffset)
	M = (InputsFixed[0] ^ 1) + SourceOffset;

	InputsFixed[1] = InputsFixed[0] ^ 1;
	}

	// Point everything at the fixed inputs.
	for (int &M : HalfMask)
	if (M == IncomingInputs[0])
	M = InputsFixed[0] + SourceOffset;
	else if (M == IncomingInputs[1])
	M = InputsFixed[1] + SourceOffset;

	IncomingInputs[0] = InputsFixed[0] + SourceOffset;
	IncomingInputs[1] = InputsFixed[1] + SourceOffset;
	}
	} else {
	llvm_unreachable("Unhandled input size!");
	}

	// Now hoist the DWord down to the right half.
	int FreeDWord = (PSHUFDMask[DestOffset / 2] < 0 ? 0 : 1) + DestOffset / 2;
	assert(PSHUFDMask[FreeDWord] < 0 && "DWord not free");
	PSHUFDMask[FreeDWord] = IncomingInputs[0] / 2;
	for (int &M : HalfMask)
	for (int Input : IncomingInputs)
	if (M == Input)
	M = FreeDWord * 2 + Input % 2;
	};
	moveInputsToRightHalf(HToLInputs, LToLInputs, PSHUFHMask, LoMask, HiMask,
	/SourceOffset/ 4, /DestOffset/ 0);
	moveInputsToRightHalf(LToHInputs, HToHInputs, PSHUFLMask, HiMask, LoMask,
	/SourceOffset/ 0, /DestOffset/ 4);

	// Now enact all the shuffles we've computed to move the inputs into their
	// target half.
	if (!isNoopShuffleMask(PSHUFLMask))
	V = DAG.getNode(X86ISD::PSHUFLW, DL, VT, V,
	getV4X86ShuffleImm8ForMask(PSHUFLMask, DL, DAG));
	if (!isNoopShuffleMask(PSHUFHMask))
	V = DAG.getNode(X86ISD::PSHUFHW, DL, VT, V,
	getV4X86ShuffleImm8ForMask(PSHUFHMask, DL, DAG));
	if (!isNoopShuffleMask(PSHUFDMask))
	V = DAG.getBitcast(
	VT,
	DAG.getNode(X86ISD::PSHUFD, DL, PSHUFDVT, DAG.getBitcast(PSHUFDVT, V),
	getV4X86ShuffleImm8ForMask(PSHUFDMask, DL, DAG)));

	// At this point, each half should contain all its inputs, and we can then
	// just shuffle them into their final position.
	assert(count_if(LoMask, [](int M) { return M >= 4; }) == 0 &&
	"Failed to lift all the high half inputs to the low mask!");
	assert(count_if(HiMask, [](int M) { return M >= 0 && M < 4; }) == 0 &&
	"Failed to lift all the low half inputs to the high mask!");

	// Do a half shuffle for the low mask.
	if (!isNoopShuffleMask(LoMask))
	V = DAG.getNode(X86ISD::PSHUFLW, DL, VT, V,
	getV4X86ShuffleImm8ForMask(LoMask, DL, DAG));

	// Do a half shuffle with the high mask after shifting its values down.
	for (int &M : HiMask)
	if (M >= 0)
	M -= 4;
	if (!isNoopShuffleMask(HiMask))
	V = DAG.getNode(X86ISD::PSHUFHW, DL, VT, V,
	getV4X86ShuffleImm8ForMask(HiMask, DL, DAG));

	return V;
	}

	/// Helper to form a PSHUFB-based shuffle+blend, opportunistically avoiding the
	/// blend if only one input is used.
	static SDValue lowerVectorShuffleAsBlendOfPSHUFBs(
	const SDLoc &DL, MVT VT, SDValue V1, SDValue V2, ArrayRef<int> Mask,
	const APInt &Zeroable, SelectionDAG &DAG, bool &V1InUse,
	bool &V2InUse) {
	SDValue V1Mask[16];
	SDValue V2Mask[16];
	V1InUse = false;
	V2InUse = false;

	int Size = Mask.size();
	int Scale = 16 / Size;
	for (int i = 0; i < 16; ++i) {
	if (Mask[i / Scale] < 0) {
	V1Mask[i] = V2Mask[i] = DAG.getUNDEF(MVT::i8);
	} else {
	const int ZeroMask = 0x80;
	int V1Idx = Mask[i / Scale] < Size ? Mask[i / Scale] * Scale + i % Scale
	: ZeroMask;
	int V2Idx = Mask[i / Scale] < Size
	? ZeroMask
	: (Mask[i / Scale] - Size) * Scale + i % Scale;
	if (Zeroable[i / Scale])
	V1Idx = V2Idx = ZeroMask;
	V1Mask[i] = DAG.getConstant(V1Idx, DL, MVT::i8);
	V2Mask[i] = DAG.getConstant(V2Idx, DL, MVT::i8);
	V1InUse \|= (ZeroMask != V1Idx);
	V2InUse \|= (ZeroMask != V2Idx);
	}
	}

	if (V1InUse)
	V1 = DAG.getNode(X86ISD::PSHUFB, DL, MVT::v16i8,
	DAG.getBitcast(MVT::v16i8, V1),
	DAG.getBuildVector(MVT::v16i8, DL, V1Mask));
	if (V2InUse)
	V2 = DAG.getNode(X86ISD::PSHUFB, DL, MVT::v16i8,
	DAG.getBitcast(MVT::v16i8, V2),
	DAG.getBuildVector(MVT::v16i8, DL, V2Mask));

	// If we need shuffled inputs from both, blend the two.
	SDValue V;
	if (V1InUse && V2InUse)
	V = DAG.getNode(ISD::OR, DL, MVT::v16i8, V1, V2);
	else
	V = V1InUse ? V1 : V2;

	// Cast the result back to the correct type.
	return DAG.getBitcast(VT, V);
	}

	/// \brief Generic lowering of 8-lane i16 shuffles.
	///
	/// This handles both single-input shuffles and combined shuffle/blends with
	/// two inputs. The single input shuffles are immediately delegated to
	/// a dedicated lowering routine.
	///
	/// The blends are lowered in one of three fundamental ways. If there are few
	/// enough inputs, it delegates to a basic UNPCK-based strategy. If the shuffle
	/// of the input is significantly cheaper when lowered as an interleaving of
	/// the two inputs, try to interleave them. Otherwise, blend the low and high
	/// halves of the inputs separately (making them have relatively few inputs)
	/// and then concatenate them.
	static SDValue lowerV8I16VectorShuffle(const SDLoc &DL, ArrayRef<int> Mask,
	const APInt &Zeroable,
	SDValue V1, SDValue V2,
	const X86Subtarget &Subtarget,
	SelectionDAG &DAG) {
	assert(V1.getSimpleValueType() == MVT::v8i16 && "Bad operand type!");
	assert(V2.getSimpleValueType() == MVT::v8i16 && "Bad operand type!");
	assert(Mask.size() == 8 && "Unexpected mask size for v8 shuffle!");

	// Whenever we can lower this as a zext, that instruction is strictly faster
	// than any alternative.
	if (SDValue ZExt = lowerVectorShuffleAsZeroOrAnyExtend(
	DL, MVT::v8i16, V1, V2, Mask, Zeroable, Subtarget, DAG))
	return ZExt;

	int NumV2Inputs = count_if(Mask, [](int M) { return M >= 8; });

	if (NumV2Inputs == 0) {
	// Check for being able to broadcast a single element.
	if (SDValue Broadcast = lowerVectorShuffleAsBroadcast(
	DL, MVT::v8i16, V1, V2, Mask, Subtarget, DAG))
	return Broadcast;

	// Try to use shift instructions.
	if (SDValue Shift = lowerVectorShuffleAsShift(DL, MVT::v8i16, V1, V1, Mask,
	Zeroable, Subtarget, DAG))
	return Shift;

	// Use dedicated unpack instructions for masks that match their pattern.
	if (SDValue V =
	lowerVectorShuffleWithUNPCK(DL, MVT::v8i16, Mask, V1, V2, DAG))
	return V;

	// Try to use byte rotation instructions.
	if (SDValue Rotate = lowerVectorShuffleAsByteRotate(DL, MVT::v8i16, V1, V1,
	Mask, Subtarget, DAG))
	return Rotate;

	// Make a copy of the mask so it can be modified.
	SmallVector<int, 8> MutableMask(Mask.begin(), Mask.end());
	return lowerV8I16GeneralSingleInputVectorShuffle(DL, MVT::v8i16, V1,
	MutableMask, Subtarget,
	DAG);
	}

	assert(llvm::any_of(Mask, [](int M) { return M >= 0 && M < 8; }) &&
	"All single-input shuffles should be canonicalized to be V1-input "
	"shuffles.");

	// Try to use shift instructions.
	if (SDValue Shift = lowerVectorShuffleAsShift(DL, MVT::v8i16, V1, V2, Mask,
	Zeroable, Subtarget, DAG))
	return Shift;

	// See if we can use SSE4A Extraction / Insertion.
	if (Subtarget.hasSSE4A())
	if (SDValue V = lowerVectorShuffleWithSSE4A(DL, MVT::v8i16, V1, V2, Mask,
	Zeroable, DAG))
	return V;

	// There are special ways we can lower some single-element blends.
	if (NumV2Inputs == 1)
	if (SDValue V = lowerVectorShuffleAsElementInsertion(
	DL, MVT::v8i16, V1, V2, Mask, Zeroable, Subtarget, DAG))
	return V;

	// We have different paths for blend lowering, but they all must use the
	// exact same predicate.
	bool IsBlendSupported = Subtarget.hasSSE41();
	if (IsBlendSupported)
	if (SDValue Blend = lowerVectorShuffleAsBlend(DL, MVT::v8i16, V1, V2, Mask,
	Zeroable, Subtarget, DAG))
	return Blend;

	if (SDValue Masked = lowerVectorShuffleAsBitMask(DL, MVT::v8i16, V1, V2, Mask,
	Zeroable, DAG))
	return Masked;

	// Use dedicated unpack instructions for masks that match their pattern.
	if (SDValue V =
	lowerVectorShuffleWithUNPCK(DL, MVT::v8i16, Mask, V1, V2, DAG))
	return V;

	// Try to use byte rotation instructions.
	if (SDValue Rotate = lowerVectorShuffleAsByteRotate(
	DL, MVT::v8i16, V1, V2, Mask, Subtarget, DAG))
	return Rotate;

	if (SDValue BitBlend =
	lowerVectorShuffleAsBitBlend(DL, MVT::v8i16, V1, V2, Mask, DAG))
	return BitBlend;

	// Try to lower by permuting the inputs into an unpack instruction.
	if (SDValue Unpack = lowerVectorShuffleAsPermuteAndUnpack(DL, MVT::v8i16, V1,
	V2, Mask, DAG))
	return Unpack;

	// If we can't directly blend but can use PSHUFB, that will be better as it
	// can both shuffle and set up the inefficient blend.
	if (!IsBlendSupported && Subtarget.hasSSSE3()) {
	bool V1InUse, V2InUse;
	return lowerVectorShuffleAsBlendOfPSHUFBs(DL, MVT::v8i16, V1, V2, Mask,
	Zeroable, DAG, V1InUse, V2InUse);
	}

	// We can always bit-blend if we have to so the fallback strategy is to
	// decompose into single-input permutes and blends.
	return lowerVectorShuffleAsDecomposedShuffleBlend(DL, MVT::v8i16, V1, V2,
	Mask, DAG);
	}

	/// \brief Check whether a compaction lowering can be done by dropping even
	/// elements and compute how many times even elements must be dropped.
	///
	/// This handles shuffles which take every Nth element where N is a power of
	/// two. Example shuffle masks:
	///
	/// N = 1: 0, 2, 4, 6, 8, 10, 12, 14, 0, 2, 4, 6, 8, 10, 12, 14
	/// N = 1: 0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30
	/// N = 2: 0, 4, 8, 12, 0, 4, 8, 12, 0, 4, 8, 12, 0, 4, 8, 12
	/// N = 2: 0, 4, 8, 12, 16, 20, 24, 28, 0, 4, 8, 12, 16, 20, 24, 28
	/// N = 3: 0, 8, 0, 8, 0, 8, 0, 8, 0, 8, 0, 8, 0, 8, 0, 8
	/// N = 3: 0, 8, 16, 24, 0, 8, 16, 24, 0, 8, 16, 24, 0, 8, 16, 24
	///
	/// Any of these lanes can of course be undef.
	///
	/// This routine only supports N <= 3.
	/// FIXME: Evaluate whether either AVX or AVX-512 have any opportunities here
	/// for larger N.
	///
	/// \returns N above, or the number of times even elements must be dropped if
	/// there is such a number. Otherwise returns zero.
	static int canLowerByDroppingEvenElements(ArrayRef<int> Mask,
	bool IsSingleInput) {
	// The modulus for the shuffle vector entries is based on whether this is
	// a single input or not.
	int ShuffleModulus = Mask.size() * (IsSingleInput ? 1 : 2);
	assert(isPowerOf2_32((uint32_t)ShuffleModulus) &&
	"We should only be called with masks with a power-of-2 size!");

	uint64_t ModMask = (uint64_t)ShuffleModulus - 1;

	// We track whether the input is viable for all power-of-2 strides 2^1, 2^2,
	// and 2^3 simultaneously. This is because we may have ambiguity with
	// partially undef inputs.
	bool ViableForN[3] = {true, true, true};

	for (int i = 0, e = Mask.size(); i < e; ++i) {
	// Ignore undef lanes, we'll optimistically collapse them to the pattern we
	// want.
	if (Mask[i] < 0)
	continue;

	bool IsAnyViable = false;
	for (unsigned j = 0; j != array_lengthof(ViableForN); ++j)
	if (ViableForN[j]) {
	uint64_t N = j + 1;

	// The shuffle mask must be equal to (i * 2^N) % M.
	if ((uint64_t)Mask[i] == (((uint64_t)i << N) & ModMask))
	IsAnyViable = true;
	else
	ViableForN[j] = false;
	}
	// Early exit if we exhaust the possible powers of two.
	if (!IsAnyViable)
	break;
	}

	for (unsigned j = 0; j != array_lengthof(ViableForN); ++j)
	if (ViableForN[j])
	return j + 1;

	// Return 0 as there is no viable power of two.
	return 0;
	}

	/// \brief Generic lowering of v16i8 shuffles.
	///
	/// This is a hybrid strategy to lower v16i8 vectors. It first attempts to
	/// detect any complexity reducing interleaving. If that doesn't help, it uses
	/// UNPCK to spread the i8 elements across two i16-element vectors, and uses
	/// the existing lowering for v8i16 blends on each half, finally PACK-ing them
	/// back together.
	static SDValue lowerV16I8VectorShuffle(const SDLoc &DL, ArrayRef<int> Mask,
	const APInt &Zeroable,
	SDValue V1, SDValue V2,
	const X86Subtarget &Subtarget,
	SelectionDAG &DAG) {
	assert(V1.getSimpleValueType() == MVT::v16i8 && "Bad operand type!");
	assert(V2.getSimpleValueType() == MVT::v16i8 && "Bad operand type!");
	assert(Mask.size() == 16 && "Unexpected mask size for v16 shuffle!");

	// Try to use shift instructions.
	if (SDValue Shift = lowerVectorShuffleAsShift(DL, MVT::v16i8, V1, V2, Mask,
	Zeroable, Subtarget, DAG))
	return Shift;

	// Try to use byte rotation instructions.
	if (SDValue Rotate = lowerVectorShuffleAsByteRotate(
	DL, MVT::v16i8, V1, V2, Mask, Subtarget, DAG))
	return Rotate;

	// Try to use a zext lowering.
	if (SDValue ZExt = lowerVectorShuffleAsZeroOrAnyExtend(
	DL, MVT::v16i8, V1, V2, Mask, Zeroable, Subtarget, DAG))
	return ZExt;

	// See if we can use SSE4A Extraction / Insertion.
	if (Subtarget.hasSSE4A())
	if (SDValue V = lowerVectorShuffleWithSSE4A(DL, MVT::v16i8, V1, V2, Mask,
	Zeroable, DAG))
	return V;

	int NumV2Elements = count_if(Mask, [](int M) { return M >= 16; });

	// For single-input shuffles, there are some nicer lowering tricks we can use.
	if (NumV2Elements == 0) {
	// Check for being able to broadcast a single element.
	if (SDValue Broadcast = lowerVectorShuffleAsBroadcast(
	DL, MVT::v16i8, V1, V2, Mask, Subtarget, DAG))
	return Broadcast;

	// Check whether we can widen this to an i16 shuffle by duplicating bytes.
	// Notably, this handles splat and partial-splat shuffles more efficiently.
	// However, it only makes sense if the pre-duplication shuffle simplifies
	// things significantly. Currently, this means we need to be able to
	// express the pre-duplication shuffle as an i16 shuffle.
	//
	// FIXME: We should check for other patterns which can be widened into an
	// i16 shuffle as well.
	auto canWidenViaDuplication = [](ArrayRef<int> Mask) {
	for (int i = 0; i < 16; i += 2)
	if (Mask[i] >= 0 && Mask[i + 1] >= 0 && Mask[i] != Mask[i + 1])
	return false;

	return true;
	};
	auto tryToWidenViaDuplication = [&]() -> SDValue {
	if (!canWidenViaDuplication(Mask))
	return SDValue();
	SmallVector<int, 4> LoInputs;
	copy_if(Mask, std::back_inserter(LoInputs),
	[](int M) { return M >= 0 && M < 8; });
	std::sort(LoInputs.begin(), LoInputs.end());
	LoInputs.erase(std::unique(LoInputs.begin(), LoInputs.end()),
	LoInputs.end());
	SmallVector<int, 4> HiInputs;
	copy_if(Mask, std::back_inserter(HiInputs), [](int M) { return M >= 8; });
	std::sort(HiInputs.begin(), HiInputs.end());
	HiInputs.erase(std::unique(HiInputs.begin(), HiInputs.end()),
	HiInputs.end());

	bool TargetLo = LoInputs.size() >= HiInputs.size();
	ArrayRef<int> InPlaceInputs = TargetLo ? LoInputs : HiInputs;
	ArrayRef<int> MovingInputs = TargetLo ? HiInputs : LoInputs;

	int PreDupI16Shuffle[] = {-1, -1, -1, -1, -1, -1, -1, -1};
	SmallDenseMap<int, int, 8> LaneMap;
	for (int I : InPlaceInputs) {
	PreDupI16Shuffle[I/2] = I/2;
	LaneMap[I] = I;
	}
	int j = TargetLo ? 0 : 4, je = j + 4;
	for (int i = 0, ie = MovingInputs.size(); i < ie; ++i) {
	// Check if j is already a shuffle of this input. This happens when
	// there are two adjacent bytes after we move the low one.
	if (PreDupI16Shuffle[j] != MovingInputs[i] / 2) {
	// If we haven't yet mapped the input, search for a slot into which
	// we can map it.
	while (j < je && PreDupI16Shuffle[j] >= 0)
	++j;

	if (j == je)
	// We can't place the inputs into a single half with a simple i16 shuffle, so bail.
	return SDValue();

	// Map this input with the i16 shuffle.
	PreDupI16Shuffle[j] = MovingInputs[i] / 2;
	}

	// Update the lane map based on the mapping we ended up with.
	LaneMap[MovingInputs[i]] = 2 * j + MovingInputs[i] % 2;
	}
	V1 = DAG.getBitcast(
	MVT::v16i8,
	DAG.getVectorShuffle(MVT::v8i16, DL, DAG.getBitcast(MVT::v8i16, V1),
	DAG.getUNDEF(MVT::v8i16), PreDupI16Shuffle));

	// Unpack the bytes to form the i16s that will be shuffled into place.
	V1 = DAG.getNode(TargetLo ? X86ISD::UNPCKL : X86ISD::UNPCKH, DL,
	MVT::v16i8, V1, V1);

	int PostDupI16Shuffle[8] = {-1, -1, -1, -1, -1, -1, -1, -1};
	for (int i = 0; i < 16; ++i)
	if (Mask[i] >= 0) {
	int MappedMask = LaneMap[Mask[i]] - (TargetLo ? 0 : 8);
	assert(MappedMask < 8 && "Invalid v8 shuffle mask!");
	if (PostDupI16Shuffle[i / 2] < 0)
	PostDupI16Shuffle[i / 2] = MappedMask;
	else
	assert(PostDupI16Shuffle[i / 2] == MappedMask &&
	"Conflicting entries in the original shuffle!");
	}
	return DAG.getBitcast(
	MVT::v16i8,
	DAG.getVectorShuffle(MVT::v8i16, DL, DAG.getBitcast(MVT::v8i16, V1),
	DAG.getUNDEF(MVT::v8i16), PostDupI16Shuffle));
	};
	if (SDValue V = tryToWidenViaDuplication())
	return V;
	}

	if (SDValue Masked = lowerVectorShuffleAsBitMask(DL, MVT::v16i8, V1, V2, Mask,
	Zeroable, DAG))
	return Masked;

	// Use dedicated unpack instructions for masks that match their pattern.
	if (SDValue V =
	lowerVectorShuffleWithUNPCK(DL, MVT::v16i8, Mask, V1, V2, DAG))
	return V;

	// Check for SSSE3 which lets us lower all v16i8 shuffles much more directly
	// with PSHUFB. It is important to do this before we attempt to generate any
	// blends but after all of the single-input lowerings. If the single input
	// lowerings can find an instruction sequence that is faster than a PSHUFB, we
	// want to preserve that and we can DAG combine any longer sequences into
	// a PSHUFB in the end. But once we start blending from multiple inputs,
	// the complexity of DAG combining bad patterns back into PSHUFB is too high,
	// and there are very few patterns that would actually be faster than the
	// PSHUFB approach because of its ability to zero lanes.
	//
	// FIXME: The only exceptions to the above are blends which are exact
	// interleavings with direct instructions supporting them. We currently don't
	// handle those well here.
	if (Subtarget.hasSSSE3()) {
	bool V1InUse = false;
	bool V2InUse = false;

	SDValue PSHUFB = lowerVectorShuffleAsBlendOfPSHUFBs(
	DL, MVT::v16i8, V1, V2, Mask, Zeroable, DAG, V1InUse, V2InUse);

	// If both V1 and V2 are in use and we can use a direct blend or an unpack,
	// do so. This avoids using them to handle blends-with-zero which is
	// important as a single pshufb is significantly faster for that.
	if (V1InUse && V2InUse) {
	if (Subtarget.hasSSE41())
	if (SDValue Blend = lowerVectorShuffleAsBlend(
	DL, MVT::v16i8, V1, V2, Mask, Zeroable, Subtarget, DAG))
	return Blend;

	// We can use an unpack to do the blending rather than an or in some
	// cases. Even though the or may be (very minorly) more efficient, we
	// preference this lowering because there are common cases where part of
	// the complexity of the shuffles goes away when we do the final blend as
	// an unpack.
	// FIXME: It might be worth trying to detect if the unpack-feeding
	// shuffles will both be pshufb, in which case we shouldn't bother with
	// this.
	if (SDValue Unpack = lowerVectorShuffleAsPermuteAndUnpack(
	DL, MVT::v16i8, V1, V2, Mask, DAG))
	return Unpack;
	}

	return PSHUFB;
	}

	// There are special ways we can lower some single-element blends.
	if (NumV2Elements == 1)
	if (SDValue V = lowerVectorShuffleAsElementInsertion(
	DL, MVT::v16i8, V1, V2, Mask, Zeroable, Subtarget, DAG))
	return V;

	if (SDValue BitBlend =
	lowerVectorShuffleAsBitBlend(DL, MVT::v16i8, V1, V2, Mask, DAG))
	return BitBlend;

	// Check whether a compaction lowering can be done. This handles shuffles
	// which take every Nth element for some even N. See the helper function for
	// details.
	//
	// We special case these as they can be particularly efficiently handled with
	// the PACKUSB instruction on x86 and they show up in common patterns of
	// rearranging bytes to truncate wide elements.
	bool IsSingleInput = V2.isUndef();
	if (int NumEvenDrops = canLowerByDroppingEvenElements(Mask, IsSingleInput)) {
	// NumEvenDrops is the power of two stride of the elements. Another way of
	// thinking about it is that we need to drop the even elements this many
	// times to get the original input.

	// First we need to zero all the dropped bytes.
	assert(NumEvenDrops <= 3 &&
	"No support for dropping even elements more than 3 times.");
	// We use the mask type to pick which bytes are preserved based on how many
	// elements are dropped.
	MVT MaskVTs[] = { MVT::v8i16, MVT::v4i32, MVT::v2i64 };
	SDValue ByteClearMask = DAG.getBitcast(
	MVT::v16i8, DAG.getConstant(0xFF, DL, MaskVTs[NumEvenDrops - 1]));
	V1 = DAG.getNode(ISD::AND, DL, MVT::v16i8, V1, ByteClearMask);
	if (!IsSingleInput)
	V2 = DAG.getNode(ISD::AND, DL, MVT::v16i8, V2, ByteClearMask);

	// Now pack things back together.
	V1 = DAG.getBitcast(MVT::v8i16, V1);
	V2 = IsSingleInput ? V1 : DAG.getBitcast(MVT::v8i16, V2);
	SDValue Result = DAG.getNode(X86ISD::PACKUS, DL, MVT::v16i8, V1, V2);
	for (int i = 1; i < NumEvenDrops; ++i) {
	Result = DAG.getBitcast(MVT::v8i16, Result);
	Result = DAG.getNode(X86ISD::PACKUS, DL, MVT::v16i8, Result, Result);
	}

	return Result;
	}

	// Handle multi-input cases by blending single-input shuffles.
	if (NumV2Elements > 0)
	return lowerVectorShuffleAsDecomposedShuffleBlend(DL, MVT::v16i8, V1, V2,
	Mask, DAG);

	// The fallback path for single-input shuffles widens this into two v8i16
	// vectors with unpacks, shuffles those, and then pulls them back together
	// with a pack.
	SDValue V = V1;

	std::array<int, 8> LoBlendMask = {{-1, -1, -1, -1, -1, -1, -1, -1}};
	std::array<int, 8> HiBlendMask = {{-1, -1, -1, -1, -1, -1, -1, -1}};
	for (int i = 0; i < 16; ++i)
	if (Mask[i] >= 0)
	(i < 8 ? LoBlendMask[i] : HiBlendMask[i % 8]) = Mask[i];

	SDValue VLoHalf, VHiHalf;
	// Check if any of the odd lanes in the v16i8 are used. If not, we can mask
	// them out and avoid using UNPCK{L,H} to extract the elements of V as
	// i16s.
	if (none_of(LoBlendMask, [](int M) { return M >= 0 && M % 2 == 1; }) &&
	none_of(HiBlendMask, [](int M) { return M >= 0 && M % 2 == 1; })) {
	// Use a mask to drop the high bytes.
	VLoHalf = DAG.getBitcast(MVT::v8i16, V);
	VLoHalf = DAG.getNode(ISD::AND, DL, MVT::v8i16, VLoHalf,
	DAG.getConstant(0x00FF, DL, MVT::v8i16));

	// This will be a single vector shuffle instead of a blend so nuke VHiHalf.
	VHiHalf = DAG.getUNDEF(MVT::v8i16);

	// Squash the masks to point directly into VLoHalf.
	for (int &M : LoBlendMask)
	if (M >= 0)
	M /= 2;
	for (int &M : HiBlendMask)
	if (M >= 0)
	M /= 2;
	} else {
	// Otherwise just unpack the low half of V into VLoHalf and the high half into
	// VHiHalf so that we can blend them as i16s.
	SDValue Zero = getZeroVector(MVT::v16i8, Subtarget, DAG, DL);

	VLoHalf = DAG.getBitcast(
	MVT::v8i16, DAG.getNode(X86ISD::UNPCKL, DL, MVT::v16i8, V, Zero));
	VHiHalf = DAG.getBitcast(
	MVT::v8i16, DAG.getNode(X86ISD::UNPCKH, DL, MVT::v16i8, V, Zero));
	}

	SDValue LoV = DAG.getVectorShuffle(MVT::v8i16, DL, VLoHalf, VHiHalf, LoBlendMask);
	SDValue HiV = DAG.getVectorShuffle(MVT::v8i16, DL, VLoHalf, VHiHalf, HiBlendMask);

	return DAG.getNode(X86ISD::PACKUS, DL, MVT::v16i8, LoV, HiV);
	}

	/// \brief Dispatching routine to lower various 128-bit x86 vector shuffles.
	///
	/// This routine breaks down the specific type of 128-bit shuffle and
	/// dispatches to the lowering routines accordingly.
	static SDValue lower128BitVectorShuffle(const SDLoc &DL, ArrayRef<int> Mask,
	MVT VT, SDValue V1, SDValue V2,
	const APInt &Zeroable,
	const X86Subtarget &Subtarget,
	SelectionDAG &DAG) {
	switch (VT.SimpleTy) {
	case MVT::v2i64:
	return lowerV2I64VectorShuffle(DL, Mask, Zeroable, V1, V2, Subtarget, DAG);
	case MVT::v2f64:
	return lowerV2F64VectorShuffle(DL, Mask, Zeroable, V1, V2, Subtarget, DAG);
	case MVT::v4i32:
	return lowerV4I32VectorShuffle(DL, Mask, Zeroable, V1, V2, Subtarget, DAG);
	case MVT::v4f32:
	return lowerV4F32VectorShuffle(DL, Mask, Zeroable, V1, V2, Subtarget, DAG);
	case MVT::v8i16:
	return lowerV8I16VectorShuffle(DL, Mask, Zeroable, V1, V2, Subtarget, DAG);
	case MVT::v16i8:
	return lowerV16I8VectorShuffle(DL, Mask, Zeroable, V1, V2, Subtarget, DAG);

	default:
	llvm_unreachable("Unimplemented!");
	}
	}

	/// \brief Generic routine to split vector shuffle into half-sized shuffles.
	///
	/// This routine just extracts two subvectors, shuffles them independently, and
	/// then concatenates them back together. This should work effectively with all
	/// AVX vector shuffle types.
	static SDValue splitAndLowerVectorShuffle(const SDLoc &DL, MVT VT, SDValue V1,
	SDValue V2, ArrayRef<int> Mask,
	SelectionDAG &DAG) {
	assert(VT.getSizeInBits() >= 256 &&
	"Only for 256-bit or wider vector shuffles!");
	assert(V1.getSimpleValueType() == VT && "Bad operand type!");
	assert(V2.getSimpleValueType() == VT && "Bad operand type!");

	ArrayRef<int> LoMask = Mask.slice(0, Mask.size() / 2);
	ArrayRef<int> HiMask = Mask.slice(Mask.size() / 2);

	int NumElements = VT.getVectorNumElements();
	int SplitNumElements = NumElements / 2;
	MVT ScalarVT = VT.getVectorElementType();
	MVT SplitVT = MVT::getVectorVT(ScalarVT, NumElements / 2);

	// Rather than splitting build-vectors, just build two narrower build
	// vectors. This helps shuffling with splats and zeros.
	auto SplitVector = [&](SDValue V) {
	V = peekThroughBitcasts(V);

	MVT OrigVT = V.getSimpleValueType();
	int OrigNumElements = OrigVT.getVectorNumElements();
	int OrigSplitNumElements = OrigNumElements / 2;
	MVT OrigScalarVT = OrigVT.getVectorElementType();
	MVT OrigSplitVT = MVT::getVectorVT(OrigScalarVT, OrigNumElements / 2);

	SDValue LoV, HiV;

	auto *BV = dyn_cast<BuildVectorSDNode>(V);
	if (!BV) {
	LoV = DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, OrigSplitVT, V,
	DAG.getIntPtrConstant(0, DL));
	HiV = DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, OrigSplitVT, V,
	DAG.getIntPtrConstant(OrigSplitNumElements, DL));
	} else {

	SmallVector<SDValue, 16> LoOps, HiOps;
	for (int i = 0; i < OrigSplitNumElements; ++i) {
	LoOps.push_back(BV->getOperand(i));
	HiOps.push_back(BV->getOperand(i + OrigSplitNumElements));
	}
	LoV = DAG.getBuildVector(OrigSplitVT, DL, LoOps);
	HiV = DAG.getBuildVector(OrigSplitVT, DL, HiOps);
	}
	return std::make_pair(DAG.getBitcast(SplitVT, LoV),
	DAG.getBitcast(SplitVT, HiV));
	};

	SDValue LoV1, HiV1, LoV2, HiV2;
	std::tie(LoV1, HiV1) = SplitVector(V1);
	std::tie(LoV2, HiV2) = SplitVector(V2);

	// Now create two 4-way blends of these half-width vectors.
	auto HalfBlend = [&](ArrayRef<int> HalfMask) {
	bool UseLoV1 = false, UseHiV1 = false, UseLoV2 = false, UseHiV2 = false;
	SmallVector<int, 32> V1BlendMask((unsigned)SplitNumElements, -1);
	SmallVector<int, 32> V2BlendMask((unsigned)SplitNumElements, -1);
	SmallVector<int, 32> BlendMask((unsigned)SplitNumElements, -1);
	for (int i = 0; i < SplitNumElements; ++i) {
	int M = HalfMask[i];
	if (M >= NumElements) {
	if (M >= NumElements + SplitNumElements)
	UseHiV2 = true;
	else
	UseLoV2 = true;
	V2BlendMask[i] = M - NumElements;
	BlendMask[i] = SplitNumElements + i;
	} else if (M >= 0) {
	if (M >= SplitNumElements)
	UseHiV1 = true;
	else
	UseLoV1 = true;
	V1BlendMask[i] = M;
	BlendMask[i] = i;
	}
	}

	// Because the lowering happens after all combining takes place, we need to
	// manually combine these blend masks as much as possible so that we create
	// a minimal number of high-level vector shuffle nodes.

	// First try just blending the halves of V1 or V2.
	if (!UseLoV1 && !UseHiV1 && !UseLoV2 && !UseHiV2)
	return DAG.getUNDEF(SplitVT);
	if (!UseLoV2 && !UseHiV2)
	return DAG.getVectorShuffle(SplitVT, DL, LoV1, HiV1, V1BlendMask);
	if (!UseLoV1 && !UseHiV1)
	return DAG.getVectorShuffle(SplitVT, DL, LoV2, HiV2, V2BlendMask);

	SDValue V1Blend, V2Blend;
	if (UseLoV1 && UseHiV1) {
	V1Blend =
	DAG.getVectorShuffle(SplitVT, DL, LoV1, HiV1, V1BlendMask);
	} else {
	// We only use half of V1 so map the usage down into the final blend mask.
	V1Blend = UseLoV1 ? LoV1 : HiV1;
	for (int i = 0; i < SplitNumElements; ++i)
	if (BlendMask[i] >= 0 && BlendMask[i] < SplitNumElements)
	BlendMask[i] = V1BlendMask[i] - (UseLoV1 ? 0 : SplitNumElements);
	}
	if (UseLoV2 && UseHiV2) {
	V2Blend =
	DAG.getVectorShuffle(SplitVT, DL, LoV2, HiV2, V2BlendMask);
	} else {
	// We only use half of V2 so map the usage down into the final blend mask.
	V2Blend = UseLoV2 ? LoV2 : HiV2;
	for (int i = 0; i < SplitNumElements; ++i)
	if (BlendMask[i] >= SplitNumElements)
	BlendMask[i] = V2BlendMask[i] + (UseLoV2 ? SplitNumElements : 0);
	}
	return DAG.getVectorShuffle(SplitVT, DL, V1Blend, V2Blend, BlendMask);
	};
	SDValue Lo = HalfBlend(LoMask);
	SDValue Hi = HalfBlend(HiMask);
	return DAG.getNode(ISD::CONCAT_VECTORS, DL, VT, Lo, Hi);
	}

	/// \brief Either split a vector in halves or decompose the shuffles and the
	/// blend.
	///
	/// This is provided as a good fallback for many lowerings of non-single-input
	/// shuffles with more than one 128-bit lane. In those cases, we want to select
	/// between splitting the shuffle into 128-bit components and stitching those
	/// back together vs. extracting the single-input shuffles and blending those
	/// results.
	static SDValue lowerVectorShuffleAsSplitOrBlend(const SDLoc &DL, MVT VT,
	SDValue V1, SDValue V2,
	ArrayRef<int> Mask,
	SelectionDAG &DAG) {
	assert(!V2.isUndef() && "This routine must not be used to lower single-input "
	"shuffles as it could then recurse on itself.");
	int Size = Mask.size();

	// If this can be modeled as a broadcast of two elements followed by a blend,
	// prefer that lowering. This is especially important because broadcasts can
	// often fold with memory operands.
	auto DoBothBroadcast = [&] {
	int V1BroadcastIdx = -1, V2BroadcastIdx = -1;
	for (int M : Mask)
	if (M >= Size) {
	if (V2BroadcastIdx < 0)
	V2BroadcastIdx = M - Size;
	else if (M - Size != V2BroadcastIdx)
	return false;
	} else if (M >= 0) {
	if (V1BroadcastIdx < 0)
	V1BroadcastIdx = M;
	else if (M != V1BroadcastIdx)
	return false;
	}
	return true;
	};
	if (DoBothBroadcast())
	return lowerVectorShuffleAsDecomposedShuffleBlend(DL, VT, V1, V2, Mask,
	DAG);

	// If the inputs all stem from a single 128-bit lane of each input, then we
	// split them rather than blending because the split will decompose to
	// unusually few instructions.
	int LaneCount = VT.getSizeInBits() / 128;
	int LaneSize = Size / LaneCount;
	SmallBitVector LaneInputs[2];
	LaneInputs[0].resize(LaneCount, false);
	LaneInputs[1].resize(LaneCount, false);
	for (int i = 0; i < Size; ++i)
	if (Mask[i] >= 0)
	LaneInputs[Mask[i] / Size][(Mask[i] % Size) / LaneSize] = true;
	if (LaneInputs[0].count() <= 1 && LaneInputs[1].count() <= 1)
	return splitAndLowerVectorShuffle(DL, VT, V1, V2, Mask, DAG);

	// Otherwise, just fall back to decomposed shuffles and a blend. This requires
	// that the decomposed single-input shuffles don't end up here.
	return lowerVectorShuffleAsDecomposedShuffleBlend(DL, VT, V1, V2, Mask, DAG);
	}

	/// \brief Lower a vector shuffle crossing multiple 128-bit lanes as
	/// a permutation and blend of those lanes.
	///
	/// This essentially blends the out-of-lane inputs to each lane into the lane
	/// from a permuted copy of the vector. This lowering strategy results in four
	/// instructions in the worst case for a single-input cross lane shuffle which
	/// is lower than any other fully general cross-lane shuffle strategy I'm aware
	/// of. Special cases for each particular shuffle pattern should be handled
	/// prior to trying this lowering.
	static SDValue lowerVectorShuffleAsLanePermuteAndBlend(const SDLoc &DL, MVT VT,
	SDValue V1, SDValue V2,
	ArrayRef<int> Mask,
	SelectionDAG &DAG) {
	// FIXME: This should probably be generalized for 512-bit vectors as well.
	assert(VT.is256BitVector() && "Only for 256-bit vector shuffles!");
	int Size = Mask.size();
	int LaneSize = Size / 2;

	// If there are only inputs from one 128-bit lane, splitting will in fact be
	// less expensive. The flags track whether the given lane contains an element
	// that crosses to another lane.
	bool LaneCrossing[2] = {false, false};
	for (int i = 0; i < Size; ++i)
	if (Mask[i] >= 0 && (Mask[i] % Size) / LaneSize != i / LaneSize)
	LaneCrossing[(Mask[i] % Size) / LaneSize] = true;
	if (!LaneCrossing[0] \|\| !LaneCrossing[1])
	return splitAndLowerVectorShuffle(DL, VT, V1, V2, Mask, DAG);

	assert(V2.isUndef() &&
	"This last part of this routine only works on single input shuffles");

	SmallVector<int, 32> FlippedBlendMask(Size);
	for (int i = 0; i < Size; ++i)
	FlippedBlendMask[i] =
	Mask[i] < 0 ? -1 : (((Mask[i] % Size) / LaneSize == i / LaneSize)
	? Mask[i]
	: Mask[i] % LaneSize +
	(i / LaneSize) * LaneSize + Size);

	// Flip the vector, and blend the results which should now be in-lane. The
	// VPERM2X128 mask uses the low 2 bits for the low source and bits 4 and
	// 5 for the high source. The value 3 selects the high half of source 2 and
	// the value 2 selects the low half of source 2. We only use source 2 to
	// allow folding it into a memory operand.
	unsigned PERMMask = 3 \| 2 << 4;
	SDValue Flipped = DAG.getNode(X86ISD::VPERM2X128, DL, VT, DAG.getUNDEF(VT),
	V1, DAG.getConstant(PERMMask, DL, MVT::i8));
	return DAG.getVectorShuffle(VT, DL, V1, Flipped, FlippedBlendMask);
	}

	/// \brief Handle lowering 2-lane 128-bit shuffles.
	static SDValue lowerV2X128VectorShuffle(const SDLoc &DL, MVT VT, SDValue V1,
	SDValue V2, ArrayRef<int> Mask,
	const APInt &Zeroable,
	const X86Subtarget &Subtarget,
	SelectionDAG &DAG) {
	SmallVector<int, 4> WidenedMask;
	if (!canWidenShuffleElements(Mask, WidenedMask))
	return SDValue();

	// TODO: If minimizing size and one of the inputs is a zero vector and the
	// the zero vector has only one use, we could use a VPERM2X128 to save the
	// instruction bytes needed to explicitly generate the zero vector.

	// Blends are faster and handle all the non-lane-crossing cases.
	if (SDValue Blend = lowerVectorShuffleAsBlend(DL, VT, V1, V2, Mask,
	Zeroable, Subtarget, DAG))
	return Blend;

	bool IsV1Zero = ISD::isBuildVectorAllZeros(V1.getNode());
	bool IsV2Zero = ISD::isBuildVectorAllZeros(V2.getNode());

	// If either input operand is a zero vector, use VPERM2X128 because its mask
	// allows us to replace the zero input with an implicit zero.
	if (!IsV1Zero && !IsV2Zero) {
	// Check for patterns which can be matched with a single insert of a 128-bit
	// subvector.
	bool OnlyUsesV1 = isShuffleEquivalent(V1, V2, Mask, {0, 1, 0, 1});
	if (OnlyUsesV1 \|\| isShuffleEquivalent(V1, V2, Mask, {0, 1, 4, 5})) {
	// With AVX2, use VPERMQ/VPERMPD to allow memory folding.
	if (Subtarget.hasAVX2() && V2.isUndef())
	return SDValue();

	// With AVX1, use vperm2f128 (below) to allow load folding. Otherwise,
	// this will likely become vinsertf128 which can't fold a 256-bit memop.
	if (!isa<LoadSDNode>(peekThroughBitcasts(V1))) {
	MVT SubVT = MVT::getVectorVT(VT.getVectorElementType(),
	VT.getVectorNumElements() / 2);
	SDValue LoV = DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, SubVT, V1,
	DAG.getIntPtrConstant(0, DL));
	SDValue HiV = DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, SubVT,
	OnlyUsesV1 ? V1 : V2,
	DAG.getIntPtrConstant(0, DL));
	return DAG.getNode(ISD::CONCAT_VECTORS, DL, VT, LoV, HiV);
	}
	}
	}

	// Otherwise form a 128-bit permutation. After accounting for undefs,
	// convert the 64-bit shuffle mask selection values into 128-bit
	// selection bits by dividing the indexes by 2 and shifting into positions
	// defined by a vperm2*128 instruction's immediate control byte.

	// The immediate permute control byte looks like this:
	// [1:0] - select 128 bits from sources for low half of destination
	// [2] - ignore
	// [3] - zero low half of destination
	// [5:4] - select 128 bits from sources for high half of destination
	// [6] - ignore
	// [7] - zero high half of destination

	int MaskLO = WidenedMask[0] < 0 ? 0 : WidenedMask[0];
	int MaskHI = WidenedMask[1] < 0 ? 0 : WidenedMask[1];

	unsigned PermMask = MaskLO \| (MaskHI << 4);

	// If either input is a zero vector, replace it with an undef input.
	// Shuffle mask values < 4 are selecting elements of V1.
	// Shuffle mask values >= 4 are selecting elements of V2.
	// Adjust each half of the permute mask by clearing the half that was
	// selecting the zero vector and setting the zero mask bit.
	if (IsV1Zero) {
	V1 = DAG.getUNDEF(VT);
	if (MaskLO < 2)
	PermMask = (PermMask & 0xf0) \| 0x08;
	if (MaskHI < 2)
	PermMask = (PermMask & 0x0f) \| 0x80;
	}
	if (IsV2Zero) {
	V2 = DAG.getUNDEF(VT);
	if (MaskLO >= 2)
	PermMask = (PermMask & 0xf0) \| 0x08;
	if (MaskHI >= 2)
	PermMask = (PermMask & 0x0f) \| 0x80;
	}

	return DAG.getNode(X86ISD::VPERM2X128, DL, VT, V1, V2,
	DAG.getConstant(PermMask, DL, MVT::i8));
	}

	/// \brief Lower a vector shuffle by first fixing the 128-bit lanes and then
	/// shuffling each lane.
	///
	/// This will only succeed when the result of fixing the 128-bit lanes results
	/// in a single-input non-lane-crossing shuffle with a repeating shuffle mask in
	/// each 128-bit lanes. This handles many cases where we can quickly blend away
	/// the lane crosses early and then use simpler shuffles within each lane.
	///
	/// FIXME: It might be worthwhile at some point to support this without
	/// requiring the 128-bit lane-relative shuffles to be repeating, but currently
	/// in x86 only floating point has interesting non-repeating shuffles, and even
	/// those are still marginally more expensive.
	static SDValue lowerVectorShuffleByMerging128BitLanes(
	const SDLoc &DL, MVT VT, SDValue V1, SDValue V2, ArrayRef<int> Mask,
	const X86Subtarget &Subtarget, SelectionDAG &DAG) {
	assert(!V2.isUndef() && "This is only useful with multiple inputs.");

	int Size = Mask.size();
	int LaneSize = 128 / VT.getScalarSizeInBits();
	int NumLanes = Size / LaneSize;
	assert(NumLanes > 1 && "Only handles 256-bit and wider shuffles.");

	// See if we can build a hypothetical 128-bit lane-fixing shuffle mask. Also
	// check whether the in-128-bit lane shuffles share a repeating pattern.
	SmallVector<int, 4> Lanes((unsigned)NumLanes, -1);
	SmallVector<int, 4> InLaneMask((unsigned)LaneSize, -1);
	for (int i = 0; i < Size; ++i) {
	if (Mask[i] < 0)
	continue;

	int j = i / LaneSize;

	if (Lanes[j] < 0) {
	// First entry we've seen for this lane.
	Lanes[j] = Mask[i] / LaneSize;
	} else if (Lanes[j] != Mask[i] / LaneSize) {
	// This doesn't match the lane selected previously!
	return SDValue();
	}

	// Check that within each lane we have a consistent shuffle mask.
	int k = i % LaneSize;
	if (InLaneMask[k] < 0) {
	InLaneMask[k] = Mask[i] % LaneSize;
	} else if (InLaneMask[k] != Mask[i] % LaneSize) {
	// This doesn't fit a repeating in-lane mask.
	return SDValue();
	}
	}

	// First shuffle the lanes into place.
	MVT LaneVT = MVT::getVectorVT(VT.isFloatingPoint() ? MVT::f64 : MVT::i64,
	VT.getSizeInBits() / 64);
	SmallVector<int, 8> LaneMask((unsigned)NumLanes * 2, -1);
	for (int i = 0; i < NumLanes; ++i)
	if (Lanes[i] >= 0) {
	LaneMask[2 * i + 0] = 2*Lanes[i] + 0;
	LaneMask[2 * i + 1] = 2*Lanes[i] + 1;
	}

	V1 = DAG.getBitcast(LaneVT, V1);
	V2 = DAG.getBitcast(LaneVT, V2);
	SDValue LaneShuffle = DAG.getVectorShuffle(LaneVT, DL, V1, V2, LaneMask);

	// Cast it back to the type we actually want.
	LaneShuffle = DAG.getBitcast(VT, LaneShuffle);

	// Now do a simple shuffle that isn't lane crossing.
	SmallVector<int, 8> NewMask((unsigned)Size, -1);
	for (int i = 0; i < Size; ++i)
	if (Mask[i] >= 0)
	NewMask[i] = (i / LaneSize) * LaneSize + Mask[i] % LaneSize;
	assert(!is128BitLaneCrossingShuffleMask(VT, NewMask) &&
	"Must not introduce lane crosses at this point!");

	return DAG.getVectorShuffle(VT, DL, LaneShuffle, DAG.getUNDEF(VT), NewMask);
	}

	/// Lower shuffles where an entire half of a 256-bit vector is UNDEF.
	/// This allows for fast cases such as subvector extraction/insertion
	/// or shuffling smaller vector types which can lower more efficiently.
	static SDValue lowerVectorShuffleWithUndefHalf(const SDLoc &DL, MVT VT,
	SDValue V1, SDValue V2,
	ArrayRef<int> Mask,
	const X86Subtarget &Subtarget,
	SelectionDAG &DAG) {
	assert(VT.is256BitVector() && "Expected 256-bit vector");

	unsigned NumElts = VT.getVectorNumElements();
	unsigned HalfNumElts = NumElts / 2;
	MVT HalfVT = MVT::getVectorVT(VT.getVectorElementType(), HalfNumElts);

	bool UndefLower = isUndefInRange(Mask, 0, HalfNumElts);
	bool UndefUpper = isUndefInRange(Mask, HalfNumElts, HalfNumElts);
	if (!UndefLower && !UndefUpper)
	return SDValue();

	// Upper half is undef and lower half is whole upper subvector.
	// e.g. vector_shuffle <4, 5, 6, 7, u, u, u, u> or <2, 3, u, u>
	if (UndefUpper &&
	isSequentialOrUndefInRange(Mask, 0, HalfNumElts, HalfNumElts)) {
	SDValue Hi = DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, HalfVT, V1,
	DAG.getIntPtrConstant(HalfNumElts, DL));
	return DAG.getNode(ISD::INSERT_SUBVECTOR, DL, VT, DAG.getUNDEF(VT), Hi,
	DAG.getIntPtrConstant(0, DL));
	}

	// Lower half is undef and upper half is whole lower subvector.
	// e.g. vector_shuffle <u, u, u, u, 0, 1, 2, 3> or <u, u, 0, 1>
	if (UndefLower &&
	isSequentialOrUndefInRange(Mask, HalfNumElts, HalfNumElts, 0)) {
	SDValue Hi = DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, HalfVT, V1,
	DAG.getIntPtrConstant(0, DL));
	return DAG.getNode(ISD::INSERT_SUBVECTOR, DL, VT, DAG.getUNDEF(VT), Hi,
	DAG.getIntPtrConstant(HalfNumElts, DL));
	}

	// If the shuffle only uses two of the four halves of the input operands,
	// then extract them and perform the 'half' shuffle at half width.
	// e.g. vector_shuffle <X, X, X, X, u, u, u, u> or <X, X, u, u>
	int HalfIdx1 = -1, HalfIdx2 = -1;
	SmallVector<int, 8> HalfMask(HalfNumElts);
	unsigned Offset = UndefLower ? HalfNumElts : 0;
	for (unsigned i = 0; i != HalfNumElts; ++i) {
	int M = Mask[i + Offset];
	if (M < 0) {
	HalfMask[i] = M;
	continue;
	}

	// Determine which of the 4 half vectors this element is from.
	// i.e. 0 = Lower V1, 1 = Upper V1, 2 = Lower V2, 3 = Upper V2.
	int HalfIdx = M / HalfNumElts;

	// Determine the element index into its half vector source.
	int HalfElt = M % HalfNumElts;

	// We can shuffle with up to 2 half vectors, set the new 'half'
	// shuffle mask accordingly.
	if (HalfIdx1 < 0 \|\| HalfIdx1 == HalfIdx) {
	HalfMask[i] = HalfElt;
	HalfIdx1 = HalfIdx;
	continue;
	}
	if (HalfIdx2 < 0 \|\| HalfIdx2 == HalfIdx) {
	HalfMask[i] = HalfElt + HalfNumElts;
	HalfIdx2 = HalfIdx;
	continue;
	}

	// Too many half vectors referenced.
	return SDValue();
	}
	assert(HalfMask.size() == HalfNumElts && "Unexpected shuffle mask length");

	// Only shuffle the halves of the inputs when useful.
	int NumLowerHalves =
	(HalfIdx1 == 0 \|\| HalfIdx1 == 2) + (HalfIdx2 == 0 \|\| HalfIdx2 == 2);
	int NumUpperHalves =
	(HalfIdx1 == 1 \|\| HalfIdx1 == 3) + (HalfIdx2 == 1 \|\| HalfIdx2 == 3);

	// uuuuXXXX - don't extract uppers just to insert again.
	if (UndefLower && NumUpperHalves != 0)
	return SDValue();

	// XXXXuuuu - don't extract both uppers, instead shuffle and then extract.
	if (UndefUpper && NumUpperHalves == 2)
	return SDValue();

	// AVX2 - XXXXuuuu - always extract lowers.
	if (Subtarget.hasAVX2() && !(UndefUpper && NumUpperHalves == 0)) {
	// AVX2 supports efficient immediate 64-bit element cross-lane shuffles.
	if (VT == MVT::v4f64 \|\| VT == MVT::v4i64)
	return SDValue();
	// AVX2 supports variable 32-bit element cross-lane shuffles.
	if (VT == MVT::v8f32 \|\| VT == MVT::v8i32) {
	// XXXXuuuu - don't extract lowers and uppers.
	if (UndefUpper && NumLowerHalves != 0 && NumUpperHalves != 0)
	return SDValue();
	}
	}

	auto GetHalfVector = [&](int HalfIdx) {
	if (HalfIdx < 0)
	return DAG.getUNDEF(HalfVT);
	SDValue V = (HalfIdx < 2 ? V1 : V2);
	HalfIdx = (HalfIdx % 2) * HalfNumElts;
	return DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, HalfVT, V,
	DAG.getIntPtrConstant(HalfIdx, DL));
	};

	SDValue Half1 = GetHalfVector(HalfIdx1);
	SDValue Half2 = GetHalfVector(HalfIdx2);
	SDValue V = DAG.getVectorShuffle(HalfVT, DL, Half1, Half2, HalfMask);
	return DAG.getNode(ISD::INSERT_SUBVECTOR, DL, VT, DAG.getUNDEF(VT), V,
	DAG.getIntPtrConstant(Offset, DL));
	}

	/// \brief Test whether the specified input (0 or 1) is in-place blended by the
	/// given mask.
	///
	/// This returns true if the elements from a particular input are already in the
	/// slot required by the given mask and require no permutation.
	static bool isShuffleMaskInputInPlace(int Input, ArrayRef<int> Mask) {
	assert((Input == 0 \|\| Input == 1) && "Only two inputs to shuffles.");
	int Size = Mask.size();
	for (int i = 0; i < Size; ++i)
	if (Mask[i] >= 0 && Mask[i] / Size == Input && Mask[i] % Size != i)
	return false;

	return true;
	}

	/// Handle case where shuffle sources are coming from the same 128-bit lane and
	/// every lane can be represented as the same repeating mask - allowing us to
	/// shuffle the sources with the repeating shuffle and then permute the result
	/// to the destination lanes.
	static SDValue lowerShuffleAsRepeatedMaskAndLanePermute(
	const SDLoc &DL, MVT VT, SDValue V1, SDValue V2, ArrayRef<int> Mask,
	const X86Subtarget &Subtarget, SelectionDAG &DAG) {
	int NumElts = VT.getVectorNumElements();
	int NumLanes = VT.getSizeInBits() / 128;
	int NumLaneElts = NumElts / NumLanes;

	// On AVX2 we may be able to just shuffle the lowest elements and then
	// broadcast the result.
	if (Subtarget.hasAVX2()) {
	for (unsigned BroadcastSize : {16, 32, 64}) {
	if (BroadcastSize <= VT.getScalarSizeInBits())
	continue;
	int NumBroadcastElts = BroadcastSize / VT.getScalarSizeInBits();

	// Attempt to match a repeating pattern every NumBroadcastElts,
	// accounting for UNDEFs but only references the lowest 128-bit
	// lane of the inputs.
	auto FindRepeatingBroadcastMask = [&](SmallVectorImpl<int> &RepeatMask) {
	for (int i = 0; i != NumElts; i += NumBroadcastElts)
	for (int j = 0; j != NumBroadcastElts; ++j) {
	int M = Mask[i + j];
	if (M < 0)
	continue;
	int &R = RepeatMask[j];
	if (0 != ((M % NumElts) / NumLaneElts))
	return false;
	if (0 <= R && R != M)
	return false;
	R = M;
	}
	return true;
	};

	SmallVector<int, 8> RepeatMask((unsigned)NumElts, -1);
	if (!FindRepeatingBroadcastMask(RepeatMask))
	continue;

	// Shuffle the (lowest) repeated elements in place for broadcast.
	SDValue RepeatShuf = DAG.getVectorShuffle(VT, DL, V1, V2, RepeatMask);

	// Shuffle the actual broadcast.
	SmallVector<int, 8> BroadcastMask((unsigned)NumElts, -1);
	for (int i = 0; i != NumElts; i += NumBroadcastElts)
	for (int j = 0; j != NumBroadcastElts; ++j)
	BroadcastMask[i + j] = j;
	return DAG.getVectorShuffle(VT, DL, RepeatShuf, DAG.getUNDEF(VT),
	BroadcastMask);
	}
	}

	// Bail if the shuffle mask doesn't cross 128-bit lanes.
	if (!is128BitLaneCrossingShuffleMask(VT, Mask))
	return SDValue();

	// Bail if we already have a repeated lane shuffle mask.
	SmallVector<int, 8> RepeatedShuffleMask;
	if (is128BitLaneRepeatedShuffleMask(VT, Mask, RepeatedShuffleMask))
	return SDValue();

	// On AVX2 targets we can permute 256-bit vectors as 64-bit sub-lanes
	// (with PERMQ/PERMPD), otherwise we can only permute whole 128-bit lanes.
	int SubLaneScale = Subtarget.hasAVX2() && VT.is256BitVector() ? 2 : 1;
	int NumSubLanes = NumLanes * SubLaneScale;
	int NumSubLaneElts = NumLaneElts / SubLaneScale;

	// Check that all the sources are coming from the same lane and see if we can
	// form a repeating shuffle mask (local to each sub-lane). At the same time,
	// determine the source sub-lane for each destination sub-lane.
	int TopSrcSubLane = -1;
	SmallVector<int, 8> Dst2SrcSubLanes((unsigned)NumSubLanes, -1);
	SmallVector<int, 8> RepeatedSubLaneMasks[2] = {
	SmallVector<int, 8>((unsigned)NumSubLaneElts, SM_SentinelUndef),
	SmallVector<int, 8>((unsigned)NumSubLaneElts, SM_SentinelUndef)};

	for (int DstSubLane = 0; DstSubLane != NumSubLanes; ++DstSubLane) {
	// Extract the sub-lane mask, check that it all comes from the same lane
	// and normalize the mask entries to come from the first lane.
	int SrcLane = -1;
	SmallVector<int, 8> SubLaneMask((unsigned)NumSubLaneElts, -1);
	for (int Elt = 0; Elt != NumSubLaneElts; ++Elt) {
	int M = Mask[(DstSubLane * NumSubLaneElts) + Elt];
	if (M < 0)
	continue;
	int Lane = (M % NumElts) / NumLaneElts;
	if ((0 <= SrcLane) && (SrcLane != Lane))
	return SDValue();
	SrcLane = Lane;
	int LocalM = (M % NumLaneElts) + (M < NumElts ? 0 : NumElts);
	SubLaneMask[Elt] = LocalM;
	}

	// Whole sub-lane is UNDEF.
	if (SrcLane < 0)
	continue;

	// Attempt to match against the candidate repeated sub-lane masks.
	for (int SubLane = 0; SubLane != SubLaneScale; ++SubLane) {
	auto MatchMasks = [NumSubLaneElts](ArrayRef<int> M1, ArrayRef<int> M2) {
	for (int i = 0; i != NumSubLaneElts; ++i) {
	if (M1[i] < 0 \|\| M2[i] < 0)
	continue;
	if (M1[i] != M2[i])
	return false;
	}
	return true;
	};

	auto &RepeatedSubLaneMask = RepeatedSubLaneMasks[SubLane];
	if (!MatchMasks(SubLaneMask, RepeatedSubLaneMask))
	continue;

	// Merge the sub-lane mask into the matching repeated sub-lane mask.
	for (int i = 0; i != NumSubLaneElts; ++i) {
	int M = SubLaneMask[i];
	if (M < 0)
	continue;
	assert((RepeatedSubLaneMask[i] < 0 \|\| RepeatedSubLaneMask[i] == M) &&
	"Unexpected mask element");
	RepeatedSubLaneMask[i] = M;
	}

	// Track the top most source sub-lane - by setting the remaining to UNDEF
	// we can greatly simplify shuffle matching.
	int SrcSubLane = (SrcLane * SubLaneScale) + SubLane;
	TopSrcSubLane = std::max(TopSrcSubLane, SrcSubLane);
	Dst2SrcSubLanes[DstSubLane] = SrcSubLane;
	break;
	}

	// Bail if we failed to find a matching repeated sub-lane mask.
	if (Dst2SrcSubLanes[DstSubLane] < 0)
	return SDValue();
	}
	assert(0 <= TopSrcSubLane && TopSrcSubLane < NumSubLanes &&
	"Unexpected source lane");

	// Create a repeating shuffle mask for the entire vector.
	SmallVector<int, 8> RepeatedMask((unsigned)NumElts, -1);
	for (int SubLane = 0; SubLane <= TopSrcSubLane; ++SubLane) {
	int Lane = SubLane / SubLaneScale;
	auto &RepeatedSubLaneMask = RepeatedSubLaneMasks[SubLane % SubLaneScale];
	for (int Elt = 0; Elt != NumSubLaneElts; ++Elt) {
	int M = RepeatedSubLaneMask[Elt];
	if (M < 0)
	continue;
	int Idx = (SubLane * NumSubLaneElts) + Elt;
	RepeatedMask[Idx] = M + (Lane * NumLaneElts);
	}
	}
	SDValue RepeatedShuffle = DAG.getVectorShuffle(VT, DL, V1, V2, RepeatedMask);

	// Shuffle each source sub-lane to its destination.
	SmallVector<int, 8> SubLaneMask((unsigned)NumElts, -1);
	for (int i = 0; i != NumElts; i += NumSubLaneElts) {
	int SrcSubLane = Dst2SrcSubLanes[i / NumSubLaneElts];
	if (SrcSubLane < 0)
	continue;
	for (int j = 0; j != NumSubLaneElts; ++j)
	SubLaneMask[i + j] = j + (SrcSubLane * NumSubLaneElts);
	}

	return DAG.getVectorShuffle(VT, DL, RepeatedShuffle, DAG.getUNDEF(VT),
	SubLaneMask);
	}

	static bool matchVectorShuffleWithSHUFPD(MVT VT, SDValue &V1, SDValue &V2,
	unsigned &ShuffleImm,
	ArrayRef<int> Mask) {
	int NumElts = VT.getVectorNumElements();
	assert(VT.getScalarSizeInBits() == 64 &&
	(NumElts == 2 \|\| NumElts == 4 \|\| NumElts == 8) &&
	"Unexpected data type for VSHUFPD");

	// Mask for V8F64: 0/1, 8/9, 2/3, 10/11, 4/5, ..
	// Mask for V4F64; 0/1, 4/5, 2/3, 6/7..
	ShuffleImm = 0;
	bool ShufpdMask = true;
	bool CommutableMask = true;
	for (int i = 0; i < NumElts; ++i) {
	if (Mask[i] == SM_SentinelUndef)
	continue;
	if (Mask[i] < 0)
	return false;
	int Val = (i & 6) + NumElts * (i & 1);
	int CommutVal = (i & 0xe) + NumElts * ((i & 1) ^ 1);
	if (Mask[i] < Val \|\| Mask[i] > Val + 1)
	ShufpdMask = false;
	if (Mask[i] < CommutVal \|\| Mask[i] > CommutVal + 1)
	CommutableMask = false;
	ShuffleImm \|= (Mask[i] % 2) << i;
	}

	if (ShufpdMask)
	return true;
	if (CommutableMask) {
	std::swap(V1, V2);
	return true;
	}

	return false;
	}

	static SDValue lowerVectorShuffleWithSHUFPD(const SDLoc &DL, MVT VT,
	ArrayRef<int> Mask, SDValue V1,
	SDValue V2, SelectionDAG &DAG) {
	assert((VT == MVT::v2f64 \|\| VT == MVT::v4f64 \|\| VT == MVT::v8f64)&&
	"Unexpected data type for VSHUFPD");

	unsigned Immediate = 0;
	if (!matchVectorShuffleWithSHUFPD(VT, V1, V2, Immediate, Mask))
	return SDValue();

	return DAG.getNode(X86ISD::SHUFP, DL, VT, V1, V2,
	DAG.getConstant(Immediate, DL, MVT::i8));
	}

	static SDValue lowerVectorShuffleWithPERMV(const SDLoc &DL, MVT VT,
	ArrayRef<int> Mask, SDValue V1,
	SDValue V2, SelectionDAG &DAG) {
	MVT MaskEltVT = MVT::getIntegerVT(VT.getScalarSizeInBits());
	MVT MaskVecVT = MVT::getVectorVT(MaskEltVT, VT.getVectorNumElements());

	SDValue MaskNode = getConstVector(Mask, MaskVecVT, DAG, DL, true);
	if (V2.isUndef())
	return DAG.getNode(X86ISD::VPERMV, DL, VT, MaskNode, V1);

	return DAG.getNode(X86ISD::VPERMV3, DL, VT, V1, MaskNode, V2);
	}

	/// \brief Handle lowering of 4-lane 64-bit floating point shuffles.
	///
	/// Also ends up handling lowering of 4-lane 64-bit integer shuffles when AVX2
	/// isn't available.
	static SDValue lowerV4F64VectorShuffle(const SDLoc &DL, ArrayRef<int> Mask,
	const APInt &Zeroable,
	SDValue V1, SDValue V2,
	const X86Subtarget &Subtarget,
	SelectionDAG &DAG) {
	assert(V1.getSimpleValueType() == MVT::v4f64 && "Bad operand type!");
	assert(V2.getSimpleValueType() == MVT::v4f64 && "Bad operand type!");
	assert(Mask.size() == 4 && "Unexpected mask size for v4 shuffle!");

	if (SDValue V = lowerV2X128VectorShuffle(DL, MVT::v4f64, V1, V2, Mask,
	Zeroable, Subtarget, DAG))
	return V;

	if (V2.isUndef()) {
	// Check for being able to broadcast a single element.
	if (SDValue Broadcast = lowerVectorShuffleAsBroadcast(
	DL, MVT::v4f64, V1, V2, Mask, Subtarget, DAG))
	return Broadcast;

	// Use low duplicate instructions for masks that match their pattern.
	if (isShuffleEquivalent(V1, V2, Mask, {0, 0, 2, 2}))
	return DAG.getNode(X86ISD::MOVDDUP, DL, MVT::v4f64, V1);

	if (!is128BitLaneCrossingShuffleMask(MVT::v4f64, Mask)) {
	// Non-half-crossing single input shuffles can be lowered with an
	// interleaved permutation.
	unsigned VPERMILPMask = (Mask[0] == 1) \| ((Mask[1] == 1) << 1) \|
	((Mask[2] == 3) << 2) \| ((Mask[3] == 3) << 3);
	return DAG.getNode(X86ISD::VPERMILPI, DL, MVT::v4f64, V1,
	DAG.getConstant(VPERMILPMask, DL, MVT::i8));
	}

	// With AVX2 we have direct support for this permutation.
	if (Subtarget.hasAVX2())
	return DAG.getNode(X86ISD::VPERMI, DL, MVT::v4f64, V1,
	getV4X86ShuffleImm8ForMask(Mask, DL, DAG));

	// Try to create an in-lane repeating shuffle mask and then shuffle the
	// the results into the target lanes.
	if (SDValue V = lowerShuffleAsRepeatedMaskAndLanePermute(
	DL, MVT::v4f64, V1, V2, Mask, Subtarget, DAG))
	return V;

	// Otherwise, fall back.
	return lowerVectorShuffleAsLanePermuteAndBlend(DL, MVT::v4f64, V1, V2, Mask,
	DAG);
	}

	// Use dedicated unpack instructions for masks that match their pattern.
	if (SDValue V =
	lowerVectorShuffleWithUNPCK(DL, MVT::v4f64, Mask, V1, V2, DAG))
	return V;

	if (SDValue Blend = lowerVectorShuffleAsBlend(DL, MVT::v4f64, V1, V2, Mask,
	Zeroable, Subtarget, DAG))
	return Blend;

	// Check if the blend happens to exactly fit that of SHUFPD.
	if (SDValue Op =
	lowerVectorShuffleWithSHUFPD(DL, MVT::v4f64, Mask, V1, V2, DAG))
	return Op;

	// Try to create an in-lane repeating shuffle mask and then shuffle the
	// the results into the target lanes.
	if (SDValue V = lowerShuffleAsRepeatedMaskAndLanePermute(
	DL, MVT::v4f64, V1, V2, Mask, Subtarget, DAG))
	return V;

	// Try to simplify this by merging 128-bit lanes to enable a lane-based
	// shuffle. However, if we have AVX2 and either inputs are already in place,
	// we will be able to shuffle even across lanes the other input in a single
	// instruction so skip this pattern.
	if (!(Subtarget.hasAVX2() && (isShuffleMaskInputInPlace(0, Mask) \|\|
	isShuffleMaskInputInPlace(1, Mask))))
	if (SDValue Result = lowerVectorShuffleByMerging128BitLanes(
	DL, MVT::v4f64, V1, V2, Mask, Subtarget, DAG))
	return Result;
	// If we have VLX support, we can use VEXPAND.
	if (Subtarget.hasVLX())
	if (SDValue V = lowerVectorShuffleToEXPAND(DL, MVT::v4f64, Zeroable, Mask,
	V1, V2, DAG, Subtarget))
	return V;

	// If we have AVX2 then we always want to lower with a blend because an v4 we
	// can fully permute the elements.
	if (Subtarget.hasAVX2())
	return lowerVectorShuffleAsDecomposedShuffleBlend(DL, MVT::v4f64, V1, V2,
	Mask, DAG);

	// Otherwise fall back on generic lowering.
	return lowerVectorShuffleAsSplitOrBlend(DL, MVT::v4f64, V1, V2, Mask, DAG);
	}

	/// \brief Handle lowering of 4-lane 64-bit integer shuffles.
	///
	/// This routine is only called when we have AVX2 and thus a reasonable
	/// instruction set for v4i64 shuffling..
	static SDValue lowerV4I64VectorShuffle(const SDLoc &DL, ArrayRef<int> Mask,
	const APInt &Zeroable,
	SDValue V1, SDValue V2,
	const X86Subtarget &Subtarget,
	SelectionDAG &DAG) {
	assert(V1.getSimpleValueType() == MVT::v4i64 && "Bad operand type!");
	assert(V2.getSimpleValueType() == MVT::v4i64 && "Bad operand type!");
	assert(Mask.size() == 4 && "Unexpected mask size for v4 shuffle!");
	assert(Subtarget.hasAVX2() && "We can only lower v4i64 with AVX2!");

	if (SDValue V = lowerV2X128VectorShuffle(DL, MVT::v4i64, V1, V2, Mask,
	Zeroable, Subtarget, DAG))
	return V;

	if (SDValue Blend = lowerVectorShuffleAsBlend(DL, MVT::v4i64, V1, V2, Mask,
	Zeroable, Subtarget, DAG))
	return Blend;

	// Check for being able to broadcast a single element.
	if (SDValue Broadcast = lowerVectorShuffleAsBroadcast(DL, MVT::v4i64, V1, V2,
	Mask, Subtarget, DAG))
	return Broadcast;

	if (V2.isUndef()) {
	// When the shuffle is mirrored between the 128-bit lanes of the unit, we
	// can use lower latency instructions that will operate on both lanes.
	SmallVector<int, 2> RepeatedMask;
	if (is128BitLaneRepeatedShuffleMask(MVT::v4i64, Mask, RepeatedMask)) {
	SmallVector<int, 4> PSHUFDMask;
	scaleShuffleMask(2, RepeatedMask, PSHUFDMask);
	return DAG.getBitcast(
	MVT::v4i64,
	DAG.getNode(X86ISD::PSHUFD, DL, MVT::v8i32,
	DAG.getBitcast(MVT::v8i32, V1),
	getV4X86ShuffleImm8ForMask(PSHUFDMask, DL, DAG)));
	}

	// AVX2 provides a direct instruction for permuting a single input across
	// lanes.
	return DAG.getNode(X86ISD::VPERMI, DL, MVT::v4i64, V1,
	getV4X86ShuffleImm8ForMask(Mask, DL, DAG));
	}

	// Try to use shift instructions.
	if (SDValue Shift = lowerVectorShuffleAsShift(DL, MVT::v4i64, V1, V2, Mask,
	Zeroable, Subtarget, DAG))
	return Shift;

	// If we have VLX support, we can use VALIGN or VEXPAND.
	if (Subtarget.hasVLX()) {
	if (SDValue Rotate = lowerVectorShuffleAsRotate(DL, MVT::v4i64, V1, V2,
	Mask, Subtarget, DAG))
	return Rotate;

	if (SDValue V = lowerVectorShuffleToEXPAND(DL, MVT::v4i64, Zeroable, Mask,
	V1, V2, DAG, Subtarget))
	return V;
	}

	// Try to use PALIGNR.
	if (SDValue Rotate = lowerVectorShuffleAsByteRotate(DL, MVT::v4i64, V1, V2,
	Mask, Subtarget, DAG))
	return Rotate;

	// Use dedicated unpack instructions for masks that match their pattern.
	if (SDValue V =
	lowerVectorShuffleWithUNPCK(DL, MVT::v4i64, Mask, V1, V2, DAG))
	return V;

	// Try to simplify this by merging 128-bit lanes to enable a lane-based
	// shuffle. However, if we have AVX2 and either inputs are already in place,
	// we will be able to shuffle even across lanes the other input in a single
	// instruction so skip this pattern.
	if (!isShuffleMaskInputInPlace(0, Mask) &&
	!isShuffleMaskInputInPlace(1, Mask))
	if (SDValue Result = lowerVectorShuffleByMerging128BitLanes(
	DL, MVT::v4i64, V1, V2, Mask, Subtarget, DAG))
	return Result;

	// Otherwise fall back on generic blend lowering.
	return lowerVectorShuffleAsDecomposedShuffleBlend(DL, MVT::v4i64, V1, V2,
	Mask, DAG);
	}

	/// \brief Handle lowering of 8-lane 32-bit floating point shuffles.
	///
	/// Also ends up handling lowering of 8-lane 32-bit integer shuffles when AVX2
	/// isn't available.
	static SDValue lowerV8F32VectorShuffle(const SDLoc &DL, ArrayRef<int> Mask,
	const APInt &Zeroable,
	SDValue V1, SDValue V2,
	const X86Subtarget &Subtarget,
	SelectionDAG &DAG) {
	assert(V1.getSimpleValueType() == MVT::v8f32 && "Bad operand type!");
	assert(V2.getSimpleValueType() == MVT::v8f32 && "Bad operand type!");
	assert(Mask.size() == 8 && "Unexpected mask size for v8 shuffle!");

	if (SDValue Blend = lowerVectorShuffleAsBlend(DL, MVT::v8f32, V1, V2, Mask,
	Zeroable, Subtarget, DAG))
	return Blend;

	// Check for being able to broadcast a single element.
	if (SDValue Broadcast = lowerVectorShuffleAsBroadcast(DL, MVT::v8f32, V1, V2,
	Mask, Subtarget, DAG))
	return Broadcast;

	// If the shuffle mask is repeated in each 128-bit lane, we have many more
	// options to efficiently lower the shuffle.
	SmallVector<int, 4> RepeatedMask;
	if (is128BitLaneRepeatedShuffleMask(MVT::v8f32, Mask, RepeatedMask)) {
	assert(RepeatedMask.size() == 4 &&
	"Repeated masks must be half the mask width!");

	// Use even/odd duplicate instructions for masks that match their pattern.
	if (isShuffleEquivalent(V1, V2, RepeatedMask, {0, 0, 2, 2}))
	return DAG.getNode(X86ISD::MOVSLDUP, DL, MVT::v8f32, V1);
	if (isShuffleEquivalent(V1, V2, RepeatedMask, {1, 1, 3, 3}))
	return DAG.getNode(X86ISD::MOVSHDUP, DL, MVT::v8f32, V1);

	if (V2.isUndef())
	return DAG.getNode(X86ISD::VPERMILPI, DL, MVT::v8f32, V1,
	getV4X86ShuffleImm8ForMask(RepeatedMask, DL, DAG));

	// Use dedicated unpack instructions for masks that match their pattern.
	if (SDValue V =
	lowerVectorShuffleWithUNPCK(DL, MVT::v8f32, Mask, V1, V2, DAG))
	return V;

	// Otherwise, fall back to a SHUFPS sequence. Here it is important that we
	// have already handled any direct blends.
	return lowerVectorShuffleWithSHUFPS(DL, MVT::v8f32, RepeatedMask, V1, V2, DAG);
	}

	// Try to create an in-lane repeating shuffle mask and then shuffle the
	// the results into the target lanes.
	if (SDValue V = lowerShuffleAsRepeatedMaskAndLanePermute(
	DL, MVT::v8f32, V1, V2, Mask, Subtarget, DAG))
	return V;

	// If we have a single input shuffle with different shuffle patterns in the
	// two 128-bit lanes use the variable mask to VPERMILPS.
	if (V2.isUndef()) {
	SDValue VPermMask = getConstVector(Mask, MVT::v8i32, DAG, DL, true);
	if (!is128BitLaneCrossingShuffleMask(MVT::v8f32, Mask))
	return DAG.getNode(X86ISD::VPERMILPV, DL, MVT::v8f32, V1, VPermMask);

	if (Subtarget.hasAVX2())
	return DAG.getNode(X86ISD::VPERMV, DL, MVT::v8f32, VPermMask, V1);

	// Otherwise, fall back.
	return lowerVectorShuffleAsLanePermuteAndBlend(DL, MVT::v8f32, V1, V2, Mask,
	DAG);
	}

	// Try to simplify this by merging 128-bit lanes to enable a lane-based
	// shuffle.
	if (SDValue Result = lowerVectorShuffleByMerging128BitLanes(
	DL, MVT::v8f32, V1, V2, Mask, Subtarget, DAG))
	return Result;
	// If we have VLX support, we can use VEXPAND.
	if (Subtarget.hasVLX())
	if (SDValue V = lowerVectorShuffleToEXPAND(DL, MVT::v8f32, Zeroable, Mask,
	V1, V2, DAG, Subtarget))
	return V;

	// For non-AVX512 if the Mask is of 16bit elements in lane then try to split
	// since after split we get a more efficient code using vpunpcklwd and
	// vpunpckhwd instrs than vblend.
	if (!Subtarget.hasAVX512() && isUnpackWdShuffleMask(Mask, MVT::v8f32))
	if (SDValue V = lowerVectorShuffleAsSplitOrBlend(DL, MVT::v8f32, V1, V2,
	Mask, DAG))
	return V;

	// If we have AVX2 then we always want to lower with a blend because at v8 we
	// can fully permute the elements.
	if (Subtarget.hasAVX2())
	return lowerVectorShuffleAsDecomposedShuffleBlend(DL, MVT::v8f32, V1, V2,
	Mask, DAG);

	// Otherwise fall back on generic lowering.
	return lowerVectorShuffleAsSplitOrBlend(DL, MVT::v8f32, V1, V2, Mask, DAG);
	}

	/// \brief Handle lowering of 8-lane 32-bit integer shuffles.
	///
	/// This routine is only called when we have AVX2 and thus a reasonable
	/// instruction set for v8i32 shuffling..
	static SDValue lowerV8I32VectorShuffle(const SDLoc &DL, ArrayRef<int> Mask,
	const APInt &Zeroable,
	SDValue V1, SDValue V2,
	const X86Subtarget &Subtarget,
	SelectionDAG &DAG) {
	assert(V1.getSimpleValueType() == MVT::v8i32 && "Bad operand type!");
	assert(V2.getSimpleValueType() == MVT::v8i32 && "Bad operand type!");
	assert(Mask.size() == 8 && "Unexpected mask size for v8 shuffle!");
	assert(Subtarget.hasAVX2() && "We can only lower v8i32 with AVX2!");

	// Whenever we can lower this as a zext, that instruction is strictly faster
	// than any alternative. It also allows us to fold memory operands into the
	// shuffle in many cases.
	if (SDValue ZExt = lowerVectorShuffleAsZeroOrAnyExtend(
	DL, MVT::v8i32, V1, V2, Mask, Zeroable, Subtarget, DAG))
	return ZExt;

	// For non-AVX512 if the Mask is of 16bit elements in lane then try to split
	// since after split we get a more efficient code than vblend by using
	// vpunpcklwd and vpunpckhwd instrs.
	if (isUnpackWdShuffleMask(Mask, MVT::v8i32) && !V2.isUndef() &&
	!Subtarget.hasAVX512())
	if (SDValue V =
	lowerVectorShuffleAsSplitOrBlend(DL, MVT::v8i32, V1, V2, Mask, DAG))
	return V;

	if (SDValue Blend = lowerVectorShuffleAsBlend(DL, MVT::v8i32, V1, V2, Mask,
	Zeroable, Subtarget, DAG))
	return Blend;

	// Check for being able to broadcast a single element.
	if (SDValue Broadcast = lowerVectorShuffleAsBroadcast(DL, MVT::v8i32, V1, V2,
	Mask, Subtarget, DAG))
	return Broadcast;

	// If the shuffle mask is repeated in each 128-bit lane we can use more
	// efficient instructions that mirror the shuffles across the two 128-bit
	// lanes.
	SmallVector<int, 4> RepeatedMask;
	bool Is128BitLaneRepeatedShuffle =
	is128BitLaneRepeatedShuffleMask(MVT::v8i32, Mask, RepeatedMask);
	if (Is128BitLaneRepeatedShuffle) {
	assert(RepeatedMask.size() == 4 && "Unexpected repeated mask size!");
	if (V2.isUndef())
	return DAG.getNode(X86ISD::PSHUFD, DL, MVT::v8i32, V1,
	getV4X86ShuffleImm8ForMask(RepeatedMask, DL, DAG));

	// Use dedicated unpack instructions for masks that match their pattern.
	if (SDValue V =
	lowerVectorShuffleWithUNPCK(DL, MVT::v8i32, Mask, V1, V2, DAG))
	return V;
	}

	// Try to use shift instructions.
	if (SDValue Shift = lowerVectorShuffleAsShift(DL, MVT::v8i32, V1, V2, Mask,
	Zeroable, Subtarget, DAG))
	return Shift;

	// If we have VLX support, we can use VALIGN or EXPAND.
	if (Subtarget.hasVLX()) {
	if (SDValue Rotate = lowerVectorShuffleAsRotate(DL, MVT::v8i32, V1, V2,
	Mask, Subtarget, DAG))
	return Rotate;

	if (SDValue V = lowerVectorShuffleToEXPAND(DL, MVT::v8i32, Zeroable, Mask,
	V1, V2, DAG, Subtarget))
	return V;
	}

	// Try to use byte rotation instructions.
	if (SDValue Rotate = lowerVectorShuffleAsByteRotate(
	DL, MVT::v8i32, V1, V2, Mask, Subtarget, DAG))
	return Rotate;

	// Try to create an in-lane repeating shuffle mask and then shuffle the
	// results into the target lanes.
	if (SDValue V = lowerShuffleAsRepeatedMaskAndLanePermute(
	DL, MVT::v8i32, V1, V2, Mask, Subtarget, DAG))
	return V;

	// If the shuffle patterns aren't repeated but it is a single input, directly
	// generate a cross-lane VPERMD instruction.
	if (V2.isUndef()) {
	SDValue VPermMask = getConstVector(Mask, MVT::v8i32, DAG, DL, true);
	return DAG.getNode(X86ISD::VPERMV, DL, MVT::v8i32, VPermMask, V1);
	}

	// Assume that a single SHUFPS is faster than an alternative sequence of
	// multiple instructions (even if the CPU has a domain penalty).
	// If some CPU is harmed by the domain switch, we can fix it in a later pass.
	if (Is128BitLaneRepeatedShuffle && isSingleSHUFPSMask(RepeatedMask)) {
	SDValue CastV1 = DAG.getBitcast(MVT::v8f32, V1);
	SDValue CastV2 = DAG.getBitcast(MVT::v8f32, V2);
	SDValue ShufPS = lowerVectorShuffleWithSHUFPS(DL, MVT::v8f32, RepeatedMask,
	CastV1, CastV2, DAG);
	return DAG.getBitcast(MVT::v8i32, ShufPS);
	}

	// Try to simplify this by merging 128-bit lanes to enable a lane-based
	// shuffle.
	if (SDValue Result = lowerVectorShuffleByMerging128BitLanes(
	DL, MVT::v8i32, V1, V2, Mask, Subtarget, DAG))
	return Result;

	// Otherwise fall back on generic blend lowering.
	return lowerVectorShuffleAsDecomposedShuffleBlend(DL, MVT::v8i32, V1, V2,
	Mask, DAG);
	}

	/// \brief Handle lowering of 16-lane 16-bit integer shuffles.
	///
	/// This routine is only called when we have AVX2 and thus a reasonable
	/// instruction set for v16i16 shuffling..
	static SDValue lowerV16I16VectorShuffle(const SDLoc &DL, ArrayRef<int> Mask,
	const APInt &Zeroable,
	SDValue V1, SDValue V2,
	const X86Subtarget &Subtarget,
	SelectionDAG &DAG) {
	assert(V1.getSimpleValueType() == MVT::v16i16 && "Bad operand type!");
	assert(V2.getSimpleValueType() == MVT::v16i16 && "Bad operand type!");
	assert(Mask.size() == 16 && "Unexpected mask size for v16 shuffle!");
	assert(Subtarget.hasAVX2() && "We can only lower v16i16 with AVX2!");

	// Whenever we can lower this as a zext, that instruction is strictly faster
	// than any alternative. It also allows us to fold memory operands into the
	// shuffle in many cases.
	if (SDValue ZExt = lowerVectorShuffleAsZeroOrAnyExtend(
	DL, MVT::v16i16, V1, V2, Mask, Zeroable, Subtarget, DAG))
	return ZExt;

	// Check for being able to broadcast a single element.
	if (SDValue Broadcast = lowerVectorShuffleAsBroadcast(DL, MVT::v16i16, V1, V2,
	Mask, Subtarget, DAG))
	return Broadcast;

	if (SDValue Blend = lowerVectorShuffleAsBlend(DL, MVT::v16i16, V1, V2, Mask,
	Zeroable, Subtarget, DAG))
	return Blend;

	// Use dedicated unpack instructions for masks that match their pattern.
	if (SDValue V =
	lowerVectorShuffleWithUNPCK(DL, MVT::v16i16, Mask, V1, V2, DAG))
	return V;

	// Try to use shift instructions.
	if (SDValue Shift = lowerVectorShuffleAsShift(DL, MVT::v16i16, V1, V2, Mask,
	Zeroable, Subtarget, DAG))
	return Shift;

	// Try to use byte rotation instructions.
	if (SDValue Rotate = lowerVectorShuffleAsByteRotate(
	DL, MVT::v16i16, V1, V2, Mask, Subtarget, DAG))
	return Rotate;

	// Try to create an in-lane repeating shuffle mask and then shuffle the
	// the results into the target lanes.
	if (SDValue V = lowerShuffleAsRepeatedMaskAndLanePermute(
	DL, MVT::v16i16, V1, V2, Mask, Subtarget, DAG))
	return V;

	if (V2.isUndef()) {
	// There are no generalized cross-lane shuffle operations available on i16
	// element types.
	if (is128BitLaneCrossingShuffleMask(MVT::v16i16, Mask))
	return lowerVectorShuffleAsLanePermuteAndBlend(DL, MVT::v16i16, V1, V2,
	Mask, DAG);

	SmallVector<int, 8> RepeatedMask;
	if (is128BitLaneRepeatedShuffleMask(MVT::v16i16, Mask, RepeatedMask)) {
	// As this is a single-input shuffle, the repeated mask should be
	// a strictly valid v8i16 mask that we can pass through to the v8i16
	// lowering to handle even the v16 case.
	return lowerV8I16GeneralSingleInputVectorShuffle(
	DL, MVT::v16i16, V1, RepeatedMask, Subtarget, DAG);
	}
	}

	if (SDValue PSHUFB = lowerVectorShuffleWithPSHUFB(
	DL, MVT::v16i16, Mask, V1, V2, Zeroable, Subtarget, DAG))
	return PSHUFB;

	// AVX512BWVL can lower to VPERMW.
	if (Subtarget.hasBWI() && Subtarget.hasVLX())
	return lowerVectorShuffleWithPERMV(DL, MVT::v16i16, Mask, V1, V2, DAG);

	// Try to simplify this by merging 128-bit lanes to enable a lane-based
	// shuffle.
	if (SDValue Result = lowerVectorShuffleByMerging128BitLanes(
	DL, MVT::v16i16, V1, V2, Mask, Subtarget, DAG))
	return Result;

	// Otherwise fall back on generic lowering.
	return lowerVectorShuffleAsSplitOrBlend(DL, MVT::v16i16, V1, V2, Mask, DAG);
	}

	/// \brief Handle lowering of 32-lane 8-bit integer shuffles.
	///
	/// This routine is only called when we have AVX2 and thus a reasonable
	/// instruction set for v32i8 shuffling..
	static SDValue lowerV32I8VectorShuffle(const SDLoc &DL, ArrayRef<int> Mask,
	const APInt &Zeroable,
	SDValue V1, SDValue V2,
	const X86Subtarget &Subtarget,
	SelectionDAG &DAG) {
	assert(V1.getSimpleValueType() == MVT::v32i8 && "Bad operand type!");
	assert(V2.getSimpleValueType() == MVT::v32i8 && "Bad operand type!");
	assert(Mask.size() == 32 && "Unexpected mask size for v32 shuffle!");
	assert(Subtarget.hasAVX2() && "We can only lower v32i8 with AVX2!");

	// Whenever we can lower this as a zext, that instruction is strictly faster
	// than any alternative. It also allows us to fold memory operands into the
	// shuffle in many cases.
	if (SDValue ZExt = lowerVectorShuffleAsZeroOrAnyExtend(
	DL, MVT::v32i8, V1, V2, Mask, Zeroable, Subtarget, DAG))
	return ZExt;

	// Check for being able to broadcast a single element.
	if (SDValue Broadcast = lowerVectorShuffleAsBroadcast(DL, MVT::v32i8, V1, V2,
	Mask, Subtarget, DAG))
	return Broadcast;

	if (SDValue Blend = lowerVectorShuffleAsBlend(DL, MVT::v32i8, V1, V2, Mask,
	Zeroable, Subtarget, DAG))
	return Blend;

	// Use dedicated unpack instructions for masks that match their pattern.
	if (SDValue V =
	lowerVectorShuffleWithUNPCK(DL, MVT::v32i8, Mask, V1, V2, DAG))
	return V;

	// Try to use shift instructions.
	if (SDValue Shift = lowerVectorShuffleAsShift(DL, MVT::v32i8, V1, V2, Mask,
	Zeroable, Subtarget, DAG))
	return Shift;

	// Try to use byte rotation instructions.
	if (SDValue Rotate = lowerVectorShuffleAsByteRotate(
	DL, MVT::v32i8, V1, V2, Mask, Subtarget, DAG))
	return Rotate;

	// Try to create an in-lane repeating shuffle mask and then shuffle the
	// the results into the target lanes.
	if (SDValue V = lowerShuffleAsRepeatedMaskAndLanePermute(
	DL, MVT::v32i8, V1, V2, Mask, Subtarget, DAG))
	return V;

	// There are no generalized cross-lane shuffle operations available on i8
	// element types.
	if (V2.isUndef() && is128BitLaneCrossingShuffleMask(MVT::v32i8, Mask))
	return lowerVectorShuffleAsLanePermuteAndBlend(DL, MVT::v32i8, V1, V2, Mask,
	DAG);

	if (SDValue PSHUFB = lowerVectorShuffleWithPSHUFB(
	DL, MVT::v32i8, Mask, V1, V2, Zeroable, Subtarget, DAG))
	return PSHUFB;

	// Try to simplify this by merging 128-bit lanes to enable a lane-based
	// shuffle.
	if (SDValue Result = lowerVectorShuffleByMerging128BitLanes(
	DL, MVT::v32i8, V1, V2, Mask, Subtarget, DAG))
	return Result;

	// Otherwise fall back on generic lowering.
	return lowerVectorShuffleAsSplitOrBlend(DL, MVT::v32i8, V1, V2, Mask, DAG);
	}

	/// \brief High-level routine to lower various 256-bit x86 vector shuffles.
	///
	/// This routine either breaks down the specific type of a 256-bit x86 vector
	/// shuffle or splits it into two 128-bit shuffles and fuses the results back
	/// together based on the available instructions.
	static SDValue lower256BitVectorShuffle(const SDLoc &DL, ArrayRef<int> Mask,
	MVT VT, SDValue V1, SDValue V2,
	const APInt &Zeroable,
	const X86Subtarget &Subtarget,
	SelectionDAG &DAG) {
	// If we have a single input to the zero element, insert that into V1 if we
	// can do so cheaply.
	int NumElts = VT.getVectorNumElements();
	int NumV2Elements = count_if(Mask, [NumElts](int M) { return M >= NumElts; });

	if (NumV2Elements == 1 && Mask[0] >= NumElts)
	if (SDValue Insertion = lowerVectorShuffleAsElementInsertion(
	DL, VT, V1, V2, Mask, Zeroable, Subtarget, DAG))
	return Insertion;

	// Handle special cases where the lower or upper half is UNDEF.
	if (SDValue V =
	lowerVectorShuffleWithUndefHalf(DL, VT, V1, V2, Mask, Subtarget, DAG))
	return V;

	// There is a really nice hard cut-over between AVX1 and AVX2 that means we
	// can check for those subtargets here and avoid much of the subtarget
	// querying in the per-vector-type lowering routines. With AVX1 we have
	// essentially zero ability to manipulate a 256-bit vector with integer
	// types. Since we'll use floating point types there eventually, just
	// immediately cast everything to a float and operate entirely in that domain.
	if (VT.isInteger() && !Subtarget.hasAVX2()) {
	int ElementBits = VT.getScalarSizeInBits();
	if (ElementBits < 32) {
	// No floating point type available, if we can't use the bit operations
	// for masking/blending then decompose into 128-bit vectors.
	if (SDValue V =
	lowerVectorShuffleAsBitMask(DL, VT, V1, V2, Mask, Zeroable, DAG))
	return V;
	if (SDValue V = lowerVectorShuffleAsBitBlend(DL, VT, V1, V2, Mask, DAG))
	return V;
	return splitAndLowerVectorShuffle(DL, VT, V1, V2, Mask, DAG);
	}

	MVT FpVT = MVT::getVectorVT(MVT::getFloatingPointVT(ElementBits),
	VT.getVectorNumElements());
	V1 = DAG.getBitcast(FpVT, V1);
	V2 = DAG.getBitcast(FpVT, V2);
	return DAG.getBitcast(VT, DAG.getVectorShuffle(FpVT, DL, V1, V2, Mask));
	}

	switch (VT.SimpleTy) {
	case MVT::v4f64:
	return lowerV4F64VectorShuffle(DL, Mask, Zeroable, V1, V2, Subtarget, DAG);
	case MVT::v4i64:
	return lowerV4I64VectorShuffle(DL, Mask, Zeroable, V1, V2, Subtarget, DAG);
	case MVT::v8f32:
	return lowerV8F32VectorShuffle(DL, Mask, Zeroable, V1, V2, Subtarget, DAG);
	case MVT::v8i32:
	return lowerV8I32VectorShuffle(DL, Mask, Zeroable, V1, V2, Subtarget, DAG);
	case MVT::v16i16:
	return lowerV16I16VectorShuffle(DL, Mask, Zeroable, V1, V2, Subtarget, DAG);
	case MVT::v32i8:
	return lowerV32I8VectorShuffle(DL, Mask, Zeroable, V1, V2, Subtarget, DAG);

	default:
	llvm_unreachable("Not a valid 256-bit x86 vector type!");
	}
	}

	/// \brief Try to lower a vector shuffle as a 128-bit shuffles.
	static SDValue lowerV4X128VectorShuffle(const SDLoc &DL, MVT VT,
	ArrayRef<int> Mask, SDValue V1,
	SDValue V2, SelectionDAG &DAG) {
	assert(VT.getScalarSizeInBits() == 64 &&
	"Unexpected element type size for 128bit shuffle.");

	// To handle 256 bit vector requires VLX and most probably
	// function lowerV2X128VectorShuffle() is better solution.
	assert(VT.is512BitVector() && "Unexpected vector size for 512bit shuffle.");

	SmallVector<int, 4> WidenedMask;
	if (!canWidenShuffleElements(Mask, WidenedMask))
	return SDValue();

	// Check for patterns which can be matched with a single insert of a 256-bit
	// subvector.
	bool OnlyUsesV1 = isShuffleEquivalent(V1, V2, Mask,
	{0, 1, 2, 3, 0, 1, 2, 3});
	if (OnlyUsesV1 \|\| isShuffleEquivalent(V1, V2, Mask,
	{0, 1, 2, 3, 8, 9, 10, 11})) {
	MVT SubVT = MVT::getVectorVT(VT.getVectorElementType(), 4);
	SDValue LoV = DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, SubVT, V1,
	DAG.getIntPtrConstant(0, DL));
	SDValue HiV = DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, SubVT,
	OnlyUsesV1 ? V1 : V2,
	DAG.getIntPtrConstant(0, DL));
	return DAG.getNode(ISD::CONCAT_VECTORS, DL, VT, LoV, HiV);
	}

	assert(WidenedMask.size() == 4);

	// See if this is an insertion of the lower 128-bits of V2 into V1.
	bool IsInsert = true;
	int V2Index = -1;
	for (int i = 0; i < 4; ++i) {
	assert(WidenedMask[i] >= -1);
	if (WidenedMask[i] < 0)
	continue;

	// Make sure all V1 subvectors are in place.
	if (WidenedMask[i] < 4) {
	if (WidenedMask[i] != i) {
	IsInsert = false;
	break;
	}
	} else {
	// Make sure we only have a single V2 index and its the lowest 128-bits.
	if (V2Index >= 0 \|\| WidenedMask[i] != 4) {
	IsInsert = false;
	break;
	}
	V2Index = i;
	}
	}
	if (IsInsert && V2Index >= 0) {
	MVT SubVT = MVT::getVectorVT(VT.getVectorElementType(), 2);
	SDValue Subvec = DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, SubVT, V2,
	DAG.getIntPtrConstant(0, DL));
	return insert128BitVector(V1, Subvec, V2Index * 2, DAG, DL);
	}

	// Try to lower to to vshuf64x2/vshuf32x4.
	SDValue Ops[2] = {DAG.getUNDEF(VT), DAG.getUNDEF(VT)};
	unsigned PermMask = 0;
	// Insure elements came from the same Op.
	for (int i = 0; i < 4; ++i) {
	assert(WidenedMask[i] >= -1);
	if (WidenedMask[i] < 0)
	continue;

	SDValue Op = WidenedMask[i] >= 4 ? V2 : V1;
	unsigned OpIndex = i / 2;
	if (Ops[OpIndex].isUndef())
	Ops[OpIndex] = Op;
	else if (Ops[OpIndex] != Op)
	return SDValue();

	// Convert the 128-bit shuffle mask selection values into 128-bit selection
	// bits defined by a vshuf64x2 instruction's immediate control byte.
	PermMask \|= (WidenedMask[i] % 4) << (i * 2);
	}

	return DAG.getNode(X86ISD::SHUF128, DL, VT, Ops[0], Ops[1],
	DAG.getConstant(PermMask, DL, MVT::i8));
	}

	/// \brief Handle lowering of 8-lane 64-bit floating point shuffles.
	static SDValue lowerV8F64VectorShuffle(const SDLoc &DL, ArrayRef<int> Mask,
	const APInt &Zeroable,
	SDValue V1, SDValue V2,
	const X86Subtarget &Subtarget,
	SelectionDAG &DAG) {
	assert(V1.getSimpleValueType() == MVT::v8f64 && "Bad operand type!");
	assert(V2.getSimpleValueType() == MVT::v8f64 && "Bad operand type!");
	assert(Mask.size() == 8 && "Unexpected mask size for v8 shuffle!");

	if (V2.isUndef()) {
	// Use low duplicate instructions for masks that match their pattern.
	if (isShuffleEquivalent(V1, V2, Mask, {0, 0, 2, 2, 4, 4, 6, 6}))
	return DAG.getNode(X86ISD::MOVDDUP, DL, MVT::v8f64, V1);

	if (!is128BitLaneCrossingShuffleMask(MVT::v8f64, Mask)) {
	// Non-half-crossing single input shuffles can be lowered with an
	// interleaved permutation.
	unsigned VPERMILPMask = (Mask[0] == 1) \| ((Mask[1] == 1) << 1) \|
	((Mask[2] == 3) << 2) \| ((Mask[3] == 3) << 3) \|
	((Mask[4] == 5) << 4) \| ((Mask[5] == 5) << 5) \|
	((Mask[6] == 7) << 6) \| ((Mask[7] == 7) << 7);
	return DAG.getNode(X86ISD::VPERMILPI, DL, MVT::v8f64, V1,
	DAG.getConstant(VPERMILPMask, DL, MVT::i8));
	}

	SmallVector<int, 4> RepeatedMask;
	if (is256BitLaneRepeatedShuffleMask(MVT::v8f64, Mask, RepeatedMask))
	return DAG.getNode(X86ISD::VPERMI, DL, MVT::v8f64, V1,
	getV4X86ShuffleImm8ForMask(RepeatedMask, DL, DAG));
	}

	if (SDValue Shuf128 =
	lowerV4X128VectorShuffle(DL, MVT::v8f64, Mask, V1, V2, DAG))
	return Shuf128;

	if (SDValue Unpck =
	lowerVectorShuffleWithUNPCK(DL, MVT::v8f64, Mask, V1, V2, DAG))
	return Unpck;

	// Check if the blend happens to exactly fit that of SHUFPD.
	if (SDValue Op =
	lowerVectorShuffleWithSHUFPD(DL, MVT::v8f64, Mask, V1, V2, DAG))
	return Op;

	if (SDValue V = lowerVectorShuffleToEXPAND(DL, MVT::v8f64, Zeroable, Mask, V1,
	V2, DAG, Subtarget))
	return V;

	if (SDValue Blend = lowerVectorShuffleAsBlend(DL, MVT::v8f64, V1, V2, Mask,
	Zeroable, Subtarget, DAG))
	return Blend;

	return lowerVectorShuffleWithPERMV(DL, MVT::v8f64, Mask, V1, V2, DAG);
	}

	/// \brief Handle lowering of 16-lane 32-bit floating point shuffles.
	static SDValue lowerV16F32VectorShuffle(const SDLoc &DL, ArrayRef<int> Mask,
	const APInt &Zeroable,
	SDValue V1, SDValue V2,
	const X86Subtarget &Subtarget,
	SelectionDAG &DAG) {
	assert(V1.getSimpleValueType() == MVT::v16f32 && "Bad operand type!");
	assert(V2.getSimpleValueType() == MVT::v16f32 && "Bad operand type!");
	assert(Mask.size() == 16 && "Unexpected mask size for v16 shuffle!");

	// If the shuffle mask is repeated in each 128-bit lane, we have many more
	// options to efficiently lower the shuffle.
	SmallVector<int, 4> RepeatedMask;
	if (is128BitLaneRepeatedShuffleMask(MVT::v16f32, Mask, RepeatedMask)) {
	assert(RepeatedMask.size() == 4 && "Unexpected repeated mask size!");

	// Use even/odd duplicate instructions for masks that match their pattern.
	if (isShuffleEquivalent(V1, V2, RepeatedMask, {0, 0, 2, 2}))
	return DAG.getNode(X86ISD::MOVSLDUP, DL, MVT::v16f32, V1);
	if (isShuffleEquivalent(V1, V2, RepeatedMask, {1, 1, 3, 3}))
	return DAG.getNode(X86ISD::MOVSHDUP, DL, MVT::v16f32, V1);

	if (V2.isUndef())
	return DAG.getNode(X86ISD::VPERMILPI, DL, MVT::v16f32, V1,
	getV4X86ShuffleImm8ForMask(RepeatedMask, DL, DAG));

	// Use dedicated unpack instructions for masks that match their pattern.
	if (SDValue Unpck =
	lowerVectorShuffleWithUNPCK(DL, MVT::v16f32, Mask, V1, V2, DAG))
	return Unpck;

	if (SDValue Blend = lowerVectorShuffleAsBlend(DL, MVT::v16f32, V1, V2, Mask,
	Zeroable, Subtarget, DAG))
	return Blend;

	// Otherwise, fall back to a SHUFPS sequence.
	return lowerVectorShuffleWithSHUFPS(DL, MVT::v16f32, RepeatedMask, V1, V2, DAG);
	}
	// If we have AVX512F support, we can use VEXPAND.
	if (SDValue V = lowerVectorShuffleToEXPAND(DL, MVT::v16f32, Zeroable, Mask,
	V1, V2, DAG, Subtarget))
	return V;

	return lowerVectorShuffleWithPERMV(DL, MVT::v16f32, Mask, V1, V2, DAG);
	}

	/// \brief Handle lowering of 8-lane 64-bit integer shuffles.
	static SDValue lowerV8I64VectorShuffle(const SDLoc &DL, ArrayRef<int> Mask,
	const APInt &Zeroable,
	SDValue V1, SDValue V2,
	const X86Subtarget &Subtarget,
	SelectionDAG &DAG) {
	assert(V1.getSimpleValueType() == MVT::v8i64 && "Bad operand type!");
	assert(V2.getSimpleValueType() == MVT::v8i64 && "Bad operand type!");
	assert(Mask.size() == 8 && "Unexpected mask size for v8 shuffle!");

	if (SDValue Shuf128 =
	lowerV4X128VectorShuffle(DL, MVT::v8i64, Mask, V1, V2, DAG))
	return Shuf128;

	if (V2.isUndef()) {
	// When the shuffle is mirrored between the 128-bit lanes of the unit, we
	// can use lower latency instructions that will operate on all four
	// 128-bit lanes.
	SmallVector<int, 2> Repeated128Mask;
	if (is128BitLaneRepeatedShuffleMask(MVT::v8i64, Mask, Repeated128Mask)) {
	SmallVector<int, 4> PSHUFDMask;
	scaleShuffleMask(2, Repeated128Mask, PSHUFDMask);
	return DAG.getBitcast(
	MVT::v8i64,
	DAG.getNode(X86ISD::PSHUFD, DL, MVT::v16i32,
	DAG.getBitcast(MVT::v16i32, V1),
	getV4X86ShuffleImm8ForMask(PSHUFDMask, DL, DAG)));
	}

	SmallVector<int, 4> Repeated256Mask;
	if (is256BitLaneRepeatedShuffleMask(MVT::v8i64, Mask, Repeated256Mask))
	return DAG.getNode(X86ISD::VPERMI, DL, MVT::v8i64, V1,
	getV4X86ShuffleImm8ForMask(Repeated256Mask, DL, DAG));
	}

	// Try to use shift instructions.
	if (SDValue Shift = lowerVectorShuffleAsShift(DL, MVT::v8i64, V1, V2, Mask,
	Zeroable, Subtarget, DAG))
	return Shift;

	// Try to use VALIGN.
	if (SDValue Rotate = lowerVectorShuffleAsRotate(DL, MVT::v8i64, V1, V2,
	Mask, Subtarget, DAG))
	return Rotate;

	// Try to use PALIGNR.
	if (SDValue Rotate = lowerVectorShuffleAsByteRotate(DL, MVT::v8i64, V1, V2,
	Mask, Subtarget, DAG))
	return Rotate;

	if (SDValue Unpck =
	lowerVectorShuffleWithUNPCK(DL, MVT::v8i64, Mask, V1, V2, DAG))
	return Unpck;
	// If we have AVX512F support, we can use VEXPAND.
	if (SDValue V = lowerVectorShuffleToEXPAND(DL, MVT::v8i64, Zeroable, Mask, V1,
	V2, DAG, Subtarget))
	return V;

	if (SDValue Blend = lowerVectorShuffleAsBlend(DL, MVT::v8i64, V1, V2, Mask,
	Zeroable, Subtarget, DAG))
	return Blend;

	return lowerVectorShuffleWithPERMV(DL, MVT::v8i64, Mask, V1, V2, DAG);
	}

	/// \brief Handle lowering of 16-lane 32-bit integer shuffles.
	static SDValue lowerV16I32VectorShuffle(const SDLoc &DL, ArrayRef<int> Mask,
	const APInt &Zeroable,
	SDValue V1, SDValue V2,
	const X86Subtarget &Subtarget,
	SelectionDAG &DAG) {
	assert(V1.getSimpleValueType() == MVT::v16i32 && "Bad operand type!");
	assert(V2.getSimpleValueType() == MVT::v16i32 && "Bad operand type!");
	assert(Mask.size() == 16 && "Unexpected mask size for v16 shuffle!");

	// Whenever we can lower this as a zext, that instruction is strictly faster
	// than any alternative. It also allows us to fold memory operands into the
	// shuffle in many cases.
	if (SDValue ZExt = lowerVectorShuffleAsZeroOrAnyExtend(
	DL, MVT::v16i32, V1, V2, Mask, Zeroable, Subtarget, DAG))
	return ZExt;

	// If the shuffle mask is repeated in each 128-bit lane we can use more
	// efficient instructions that mirror the shuffles across the four 128-bit
	// lanes.
	SmallVector<int, 4> RepeatedMask;
	bool Is128BitLaneRepeatedShuffle =
	is128BitLaneRepeatedShuffleMask(MVT::v16i32, Mask, RepeatedMask);
	if (Is128BitLaneRepeatedShuffle) {
	assert(RepeatedMask.size() == 4 && "Unexpected repeated mask size!");
	if (V2.isUndef())
	return DAG.getNode(X86ISD::PSHUFD, DL, MVT::v16i32, V1,
	getV4X86ShuffleImm8ForMask(RepeatedMask, DL, DAG));

	// Use dedicated unpack instructions for masks that match their pattern.
	if (SDValue V =
	lowerVectorShuffleWithUNPCK(DL, MVT::v16i32, Mask, V1, V2, DAG))
	return V;
	}

	// Try to use shift instructions.
	if (SDValue Shift = lowerVectorShuffleAsShift(DL, MVT::v16i32, V1, V2, Mask,
	Zeroable, Subtarget, DAG))
	return Shift;

	// Try to use VALIGN.
	if (SDValue Rotate = lowerVectorShuffleAsRotate(DL, MVT::v16i32, V1, V2,
	Mask, Subtarget, DAG))
	return Rotate;

	// Try to use byte rotation instructions.
	if (Subtarget.hasBWI())
	if (SDValue Rotate = lowerVectorShuffleAsByteRotate(
	DL, MVT::v16i32, V1, V2, Mask, Subtarget, DAG))
	return Rotate;

	// Assume that a single SHUFPS is faster than using a permv shuffle.
	// If some CPU is harmed by the domain switch, we can fix it in a later pass.
	if (Is128BitLaneRepeatedShuffle && isSingleSHUFPSMask(RepeatedMask)) {
	SDValue CastV1 = DAG.getBitcast(MVT::v16f32, V1);
	SDValue CastV2 = DAG.getBitcast(MVT::v16f32, V2);
	SDValue ShufPS = lowerVectorShuffleWithSHUFPS(DL, MVT::v16f32, RepeatedMask,
	CastV1, CastV2, DAG);
	return DAG.getBitcast(MVT::v16i32, ShufPS);
	}
	// If we have AVX512F support, we can use VEXPAND.
	if (SDValue V = lowerVectorShuffleToEXPAND(DL, MVT::v16i32, Zeroable, Mask,
	V1, V2, DAG, Subtarget))
	return V;

	if (SDValue Blend = lowerVectorShuffleAsBlend(DL, MVT::v16i32, V1, V2, Mask,
	Zeroable, Subtarget, DAG))
	return Blend;
	return lowerVectorShuffleWithPERMV(DL, MVT::v16i32, Mask, V1, V2, DAG);
	}

	/// \brief Handle lowering of 32-lane 16-bit integer shuffles.
	static SDValue lowerV32I16VectorShuffle(const SDLoc &DL, ArrayRef<int> Mask,
	const APInt &Zeroable,
	SDValue V1, SDValue V2,
	const X86Subtarget &Subtarget,
	SelectionDAG &DAG) {
	assert(V1.getSimpleValueType() == MVT::v32i16 && "Bad operand type!");
	assert(V2.getSimpleValueType() == MVT::v32i16 && "Bad operand type!");
	assert(Mask.size() == 32 && "Unexpected mask size for v32 shuffle!");
	assert(Subtarget.hasBWI() && "We can only lower v32i16 with AVX-512-BWI!");

	// Whenever we can lower this as a zext, that instruction is strictly faster
	// than any alternative. It also allows us to fold memory operands into the
	// shuffle in many cases.
	if (SDValue ZExt = lowerVectorShuffleAsZeroOrAnyExtend(
	DL, MVT::v32i16, V1, V2, Mask, Zeroable, Subtarget, DAG))
	return ZExt;

	// Use dedicated unpack instructions for masks that match their pattern.
	if (SDValue V =
	lowerVectorShuffleWithUNPCK(DL, MVT::v32i16, Mask, V1, V2, DAG))
	return V;

	// Try to use shift instructions.
	if (SDValue Shift = lowerVectorShuffleAsShift(DL, MVT::v32i16, V1, V2, Mask,
	Zeroable, Subtarget, DAG))
	return Shift;

	// Try to use byte rotation instructions.
	if (SDValue Rotate = lowerVectorShuffleAsByteRotate(
	DL, MVT::v32i16, V1, V2, Mask, Subtarget, DAG))
	return Rotate;

	if (V2.isUndef()) {
	SmallVector<int, 8> RepeatedMask;
	if (is128BitLaneRepeatedShuffleMask(MVT::v32i16, Mask, RepeatedMask)) {
	// As this is a single-input shuffle, the repeated mask should be
	// a strictly valid v8i16 mask that we can pass through to the v8i16
	// lowering to handle even the v32 case.
	return lowerV8I16GeneralSingleInputVectorShuffle(
	DL, MVT::v32i16, V1, RepeatedMask, Subtarget, DAG);
	}
	}

	if (SDValue Blend = lowerVectorShuffleAsBlend(DL, MVT::v32i16, V1, V2, Mask,
	Zeroable, Subtarget, DAG))
	return Blend;

	return lowerVectorShuffleWithPERMV(DL, MVT::v32i16, Mask, V1, V2, DAG);
	}

	/// \brief Handle lowering of 64-lane 8-bit integer shuffles.
	static SDValue lowerV64I8VectorShuffle(const SDLoc &DL, ArrayRef<int> Mask,
	const APInt &Zeroable,
	SDValue V1, SDValue V2,
	const X86Subtarget &Subtarget,
	SelectionDAG &DAG) {
	assert(V1.getSimpleValueType() == MVT::v64i8 && "Bad operand type!");
	assert(V2.getSimpleValueType() == MVT::v64i8 && "Bad operand type!");
	assert(Mask.size() == 64 && "Unexpected mask size for v64 shuffle!");
	assert(Subtarget.hasBWI() && "We can only lower v64i8 with AVX-512-BWI!");

	// Whenever we can lower this as a zext, that instruction is strictly faster
	// than any alternative. It also allows us to fold memory operands into the
	// shuffle in many cases.
	if (SDValue ZExt = lowerVectorShuffleAsZeroOrAnyExtend(
	DL, MVT::v64i8, V1, V2, Mask, Zeroable, Subtarget, DAG))
	return ZExt;

	// Use dedicated unpack instructions for masks that match their pattern.
	if (SDValue V =
	lowerVectorShuffleWithUNPCK(DL, MVT::v64i8, Mask, V1, V2, DAG))
	return V;

	// Try to use shift instructions.
	if (SDValue Shift = lowerVectorShuffleAsShift(DL, MVT::v64i8, V1, V2, Mask,
	Zeroable, Subtarget, DAG))
	return Shift;

	// Try to use byte rotation instructions.
	if (SDValue Rotate = lowerVectorShuffleAsByteRotate(
	DL, MVT::v64i8, V1, V2, Mask, Subtarget, DAG))
	return Rotate;

	if (SDValue PSHUFB = lowerVectorShuffleWithPSHUFB(
	DL, MVT::v64i8, Mask, V1, V2, Zeroable, Subtarget, DAG))
	return PSHUFB;

	// VBMI can use VPERMV/VPERMV3 byte shuffles.
	if (Subtarget.hasVBMI())
	return lowerVectorShuffleWithPERMV(DL, MVT::v64i8, Mask, V1, V2, DAG);

	// Try to create an in-lane repeating shuffle mask and then shuffle the
	// the results into the target lanes.
	if (SDValue V = lowerShuffleAsRepeatedMaskAndLanePermute(
	DL, MVT::v64i8, V1, V2, Mask, Subtarget, DAG))
	return V;

	if (SDValue Blend = lowerVectorShuffleAsBlend(DL, MVT::v64i8, V1, V2, Mask,
	Zeroable, Subtarget, DAG))
	return Blend;

	// FIXME: Implement direct support for this type!
	return splitAndLowerVectorShuffle(DL, MVT::v64i8, V1, V2, Mask, DAG);
	}

	/// \brief High-level routine to lower various 512-bit x86 vector shuffles.
	///
	/// This routine either breaks down the specific type of a 512-bit x86 vector
	/// shuffle or splits it into two 256-bit shuffles and fuses the results back
	/// together based on the available instructions.
	static SDValue lower512BitVectorShuffle(const SDLoc &DL, ArrayRef<int> Mask,
	MVT VT, SDValue V1, SDValue V2,
	const APInt &Zeroable,
	const X86Subtarget &Subtarget,
	SelectionDAG &DAG) {
	assert(Subtarget.hasAVX512() &&
	"Cannot lower 512-bit vectors w/ basic ISA!");

	// If we have a single input to the zero element, insert that into V1 if we
	// can do so cheaply.
	int NumElts = Mask.size();
	int NumV2Elements = count_if(Mask, [NumElts](int M) { return M >= NumElts; });

	if (NumV2Elements == 1 && Mask[0] >= NumElts)
	if (SDValue Insertion = lowerVectorShuffleAsElementInsertion(
	DL, VT, V1, V2, Mask, Zeroable, Subtarget, DAG))
	return Insertion;

	// Check for being able to broadcast a single element.
	if (SDValue Broadcast =
	lowerVectorShuffleAsBroadcast(DL, VT, V1, V2, Mask, Subtarget, DAG))
	return Broadcast;

	// Dispatch to each element type for lowering. If we don't have support for
	// specific element type shuffles at 512 bits, immediately split them and
	// lower them. Each lowering routine of a given type is allowed to assume that
	// the requisite ISA extensions for that element type are available.
	switch (VT.SimpleTy) {
	case MVT::v8f64:
	return lowerV8F64VectorShuffle(DL, Mask, Zeroable, V1, V2, Subtarget, DAG);
	case MVT::v16f32:
	return lowerV16F32VectorShuffle(DL, Mask, Zeroable, V1, V2, Subtarget, DAG);
	case MVT::v8i64:
	return lowerV8I64VectorShuffle(DL, Mask, Zeroable, V1, V2, Subtarget, DAG);
	case MVT::v16i32:
	return lowerV16I32VectorShuffle(DL, Mask, Zeroable, V1, V2, Subtarget, DAG);
	case MVT::v32i16:
	return lowerV32I16VectorShuffle(DL, Mask, Zeroable, V1, V2, Subtarget, DAG);
	case MVT::v64i8:
	return lowerV64I8VectorShuffle(DL, Mask, Zeroable, V1, V2, Subtarget, DAG);

	default:
	llvm_unreachable("Not a valid 512-bit x86 vector type!");
	}
	}

	// Lower vXi1 vector shuffles.
	// There is no a dedicated instruction on AVX-512 that shuffles the masks.
	// The only way to shuffle bits is to sign-extend the mask vector to SIMD
	// vector, shuffle and then truncate it back.
	static SDValue lower1BitVectorShuffle(const SDLoc &DL, ArrayRef<int> Mask,
	MVT VT, SDValue V1, SDValue V2,
	const X86Subtarget &Subtarget,
	SelectionDAG &DAG) {
	assert(Subtarget.hasAVX512() &&
	"Cannot lower 512-bit vectors w/o basic ISA!");
	MVT ExtVT;
	switch (VT.SimpleTy) {
	default:
	llvm_unreachable("Expected a vector of i1 elements");
	case MVT::v2i1:
	ExtVT = MVT::v2i64;
	break;
	case MVT::v4i1:
	ExtVT = MVT::v4i32;
	break;
	case MVT::v8i1:
	ExtVT = MVT::v8i64; // Take 512-bit type, more shuffles on KNL
	break;
	case MVT::v16i1:
	ExtVT = MVT::v16i32;
	break;
	case MVT::v32i1:
	ExtVT = MVT::v32i16;
	break;
	case MVT::v64i1:
	ExtVT = MVT::v64i8;
	break;
	}

	if (ISD::isBuildVectorAllZeros(V1.getNode()))
	V1 = getZeroVector(ExtVT, Subtarget, DAG, DL);
	else if (ISD::isBuildVectorAllOnes(V1.getNode()))
	V1 = getOnesVector(ExtVT, DAG, DL);
	else
	V1 = DAG.getNode(ISD::SIGN_EXTEND, DL, ExtVT, V1);

	if (V2.isUndef())
	V2 = DAG.getUNDEF(ExtVT);
	else if (ISD::isBuildVectorAllZeros(V2.getNode()))
	V2 = getZeroVector(ExtVT, Subtarget, DAG, DL);
	else if (ISD::isBuildVectorAllOnes(V2.getNode()))
	V2 = getOnesVector(ExtVT, DAG, DL);
	else
	V2 = DAG.getNode(ISD::SIGN_EXTEND, DL, ExtVT, V2);

	SDValue Shuffle = DAG.getVectorShuffle(ExtVT, DL, V1, V2, Mask);
	// i1 was sign extended we can use X86ISD::CVT2MASK.
	int NumElems = VT.getVectorNumElements();
	if ((Subtarget.hasBWI() && (NumElems >= 32)) \|\|
	(Subtarget.hasDQI() && (NumElems < 32)))
	return DAG.getNode(X86ISD::CVT2MASK, DL, VT, Shuffle);

	return DAG.getNode(ISD::TRUNCATE, DL, VT, Shuffle);
	}

	/// Helper function that returns true if the shuffle mask should be
	/// commuted to improve canonicalization.
	static bool canonicalizeShuffleMaskWithCommute(ArrayRef<int> Mask) {
	int NumElements = Mask.size();

	int NumV1Elements = 0, NumV2Elements = 0;
	for (int M : Mask)
	if (M < 0)
	continue;
	else if (M < NumElements)
	++NumV1Elements;
	else
	++NumV2Elements;

	// Commute the shuffle as needed such that more elements come from V1 than
	// V2. This allows us to match the shuffle pattern strictly on how many
	// elements come from V1 without handling the symmetric cases.
	if (NumV2Elements > NumV1Elements)
	return true;

	assert(NumV1Elements > 0 && "No V1 indices");

	if (NumV2Elements == 0)
	return false;

	// When the number of V1 and V2 elements are the same, try to minimize the
	// number of uses of V2 in the low half of the vector. When that is tied,
	// ensure that the sum of indices for V1 is equal to or lower than the sum
	// indices for V2. When those are equal, try to ensure that the number of odd
	// indices for V1 is lower than the number of odd indices for V2.
	if (NumV1Elements == NumV2Elements) {
	int LowV1Elements = 0, LowV2Elements = 0;
	for (int M : Mask.slice(0, NumElements / 2))
	if (M >= NumElements)
	++LowV2Elements;
	else if (M >= 0)
	++LowV1Elements;
	if (LowV2Elements > LowV1Elements)
	return true;
	if (LowV2Elements == LowV1Elements) {
	int SumV1Indices = 0, SumV2Indices = 0;
	for (int i = 0, Size = Mask.size(); i < Size; ++i)
	if (Mask[i] >= NumElements)
	SumV2Indices += i;
	else if (Mask[i] >= 0)
	SumV1Indices += i;
	if (SumV2Indices < SumV1Indices)
	return true;
	if (SumV2Indices == SumV1Indices) {
	int NumV1OddIndices = 0, NumV2OddIndices = 0;
	for (int i = 0, Size = Mask.size(); i < Size; ++i)
	if (Mask[i] >= NumElements)
	NumV2OddIndices += i % 2;
	else if (Mask[i] >= 0)
	NumV1OddIndices += i % 2;
	if (NumV2OddIndices < NumV1OddIndices)
	return true;
	}
	}
	}

	return false;
	}

	/// \brief Top-level lowering for x86 vector shuffles.
	///
	/// This handles decomposition, canonicalization, and lowering of all x86
	/// vector shuffles. Most of the specific lowering strategies are encapsulated
	/// above in helper routines. The canonicalization attempts to widen shuffles
	/// to involve fewer lanes of wider elements, consolidate symmetric patterns
	/// s.t. only one of the two inputs needs to be tested, etc.
	static SDValue lowerVectorShuffle(SDValue Op, const X86Subtarget &Subtarget,
	SelectionDAG &DAG) {
	ShuffleVectorSDNode *SVOp = cast<ShuffleVectorSDNode>(Op);
	ArrayRef<int> Mask = SVOp->getMask();
	SDValue V1 = Op.getOperand(0);
	SDValue V2 = Op.getOperand(1);
	MVT VT = Op.getSimpleValueType();
	int NumElements = VT.getVectorNumElements();
	SDLoc DL(Op);
	bool Is1BitVector = (VT.getVectorElementType() == MVT::i1);

	assert((VT.getSizeInBits() != 64 \|\| Is1BitVector) &&
	"Can't lower MMX shuffles");

	bool V1IsUndef = V1.isUndef();
	bool V2IsUndef = V2.isUndef();
	if (V1IsUndef && V2IsUndef)
	return DAG.getUNDEF(VT);

	// When we create a shuffle node we put the UNDEF node to second operand,
	// but in some cases the first operand may be transformed to UNDEF.
	// In this case we should just commute the node.
	if (V1IsUndef)
	return DAG.getCommutedVectorShuffle(*SVOp);

	// Check for non-undef masks pointing at an undef vector and make the masks
	// undef as well. This makes it easier to match the shuffle based solely on
	// the mask.
	if (V2IsUndef)
	for (int M : Mask)
	if (M >= NumElements) {
	SmallVector<int, 8> NewMask(Mask.begin(), Mask.end());
	for (int &M : NewMask)
	if (M >= NumElements)
	M = -1;
	return DAG.getVectorShuffle(VT, DL, V1, V2, NewMask);
	}

	// Check for illegal shuffle mask element index values.
	int MaskUpperLimit = Mask.size() * (V2IsUndef ? 1 : 2); (void)MaskUpperLimit;
	assert(llvm::all_of(Mask,
	[&](int M) { return -1 <= M && M < MaskUpperLimit; }) &&
	"Out of bounds shuffle index");

	// We actually see shuffles that are entirely re-arrangements of a set of
	// zero inputs. This mostly happens while decomposing complex shuffles into
	// simple ones. Directly lower these as a buildvector of zeros.
	APInt Zeroable = computeZeroableShuffleElements(Mask, V1, V2);
	if (Zeroable.isAllOnesValue())
	return getZeroVector(VT, Subtarget, DAG, DL);

	// Try to collapse shuffles into using a vector type with fewer elements but
	// wider element types. We cap this to not form integers or floating point
	// elements wider than 64 bits, but it might be interesting to form i128
	// integers to handle flipping the low and high halves of AVX 256-bit vectors.
	SmallVector<int, 16> WidenedMask;
	if (VT.getScalarSizeInBits() < 64 && !Is1BitVector &&
	canWidenShuffleElements(Mask, WidenedMask)) {
	MVT NewEltVT = VT.isFloatingPoint()
	? MVT::getFloatingPointVT(VT.getScalarSizeInBits() * 2)
	: MVT::getIntegerVT(VT.getScalarSizeInBits() * 2);
	MVT NewVT = MVT::getVectorVT(NewEltVT, VT.getVectorNumElements() / 2);
	// Make sure that the new vector type is legal. For example, v2f64 isn't
	// legal on SSE1.
	if (DAG.getTargetLoweringInfo().isTypeLegal(NewVT)) {
	V1 = DAG.getBitcast(NewVT, V1);
	V2 = DAG.getBitcast(NewVT, V2);
	return DAG.getBitcast(
	VT, DAG.getVectorShuffle(NewVT, DL, V1, V2, WidenedMask));
	}
	}

	// Commute the shuffle if it will improve canonicalization.
	if (canonicalizeShuffleMaskWithCommute(Mask))
	return DAG.getCommutedVectorShuffle(*SVOp);

	// For each vector width, delegate to a specialized lowering routine.
	if (VT.is128BitVector())
	return lower128BitVectorShuffle(DL, Mask, VT, V1, V2, Zeroable, Subtarget,
	DAG);

	if (VT.is256BitVector())
	return lower256BitVectorShuffle(DL, Mask, VT, V1, V2, Zeroable, Subtarget,
	DAG);

	if (VT.is512BitVector())
	return lower512BitVectorShuffle(DL, Mask, VT, V1, V2, Zeroable, Subtarget,
	DAG);

	if (Is1BitVector)
	return lower1BitVectorShuffle(DL, Mask, VT, V1, V2, Subtarget, DAG);

	llvm_unreachable("Unimplemented!");
	}

	/// \brief Try to lower a VSELECT instruction to a vector shuffle.
	static SDValue lowerVSELECTtoVectorShuffle(SDValue Op,
	const X86Subtarget &Subtarget,
	SelectionDAG &DAG) {
	SDValue Cond = Op.getOperand(0);
	SDValue LHS = Op.getOperand(1);
	SDValue RHS = Op.getOperand(2);
	SDLoc dl(Op);
	MVT VT = Op.getSimpleValueType();

	if (!ISD::isBuildVectorOfConstantSDNodes(Cond.getNode()))
	return SDValue();
	auto *CondBV = cast<BuildVectorSDNode>(Cond);

	// Only non-legal VSELECTs reach this lowering, convert those into generic
	// shuffles and re-use the shuffle lowering path for blends.
	SmallVector<int, 32> Mask;
	for (int i = 0, Size = VT.getVectorNumElements(); i < Size; ++i) {
	SDValue CondElt = CondBV->getOperand(i);
	Mask.push_back(
	isa<ConstantSDNode>(CondElt) ? i + (isNullConstant(CondElt) ? Size : 0)
	: -1);
	}
	return DAG.getVectorShuffle(VT, dl, LHS, RHS, Mask);
	}

	SDValue X86TargetLowering::LowerVSELECT(SDValue Op, SelectionDAG &DAG) const {
	// A vselect where all conditions and data are constants can be optimized into
	// a single vector load by SelectionDAGLegalize::ExpandBUILD_VECTOR().
	if (ISD::isBuildVectorOfConstantSDNodes(Op.getOperand(0).getNode()) &&
	ISD::isBuildVectorOfConstantSDNodes(Op.getOperand(1).getNode()) &&
	ISD::isBuildVectorOfConstantSDNodes(Op.getOperand(2).getNode()))
	return SDValue();

	// If this VSELECT has a vector if i1 as a mask, it will be directly matched
	// with patterns on the mask registers on AVX-512.
	if (Op->getOperand(0).getValueType().getScalarSizeInBits() == 1)
	return Op;

	// Try to lower this to a blend-style vector shuffle. This can handle all
	// constant condition cases.
	if (SDValue BlendOp = lowerVSELECTtoVectorShuffle(Op, Subtarget, DAG))
	return BlendOp;

	// Variable blends are only legal from SSE4.1 onward.
	if (!Subtarget.hasSSE41())
	return SDValue();

	SDLoc dl(Op);
	MVT VT = Op.getSimpleValueType();

	// If the VSELECT is on a 512-bit type, we have to convert a non-i1 condition
	// into an i1 condition so that we can use the mask-based 512-bit blend
	// instructions.
	if (VT.getSizeInBits() == 512) {
	SDValue Cond = Op.getOperand(0);
	// The vNi1 condition case should be handled above as it can be trivially
	// lowered.
	assert(Cond.getValueType().getScalarSizeInBits() ==
	VT.getScalarSizeInBits() &&
	"Should have a size-matched integer condition!");
	// Build a mask by testing the condition against itself (tests for zero).
	MVT MaskVT = MVT::getVectorVT(MVT::i1, VT.getVectorNumElements());
	SDValue Mask = DAG.getNode(X86ISD::TESTM, dl, MaskVT, Cond, Cond);
	// Now return a new VSELECT using the mask.
	return DAG.getSelect(dl, VT, Mask, Op.getOperand(1), Op.getOperand(2));
	}

	// Only some types will be legal on some subtargets. If we can emit a legal
	// VSELECT-matching blend, return Op, and but if we need to expand, return
	// a null value.
	switch (VT.SimpleTy) {
	default:
	// Most of the vector types have blends past SSE4.1.
	return Op;

	case MVT::v32i8:
	// The byte blends for AVX vectors were introduced only in AVX2.
	if (Subtarget.hasAVX2())
	return Op;

	return SDValue();

	case MVT::v8i16:
	case MVT::v16i16:
	// AVX-512 BWI and VLX features support VSELECT with i16 elements.
	if (Subtarget.hasBWI() && Subtarget.hasVLX())
	return Op;

	// FIXME: We should custom lower this by fixing the condition and using i8
	// blends.
	return SDValue();
	}
	}

	static SDValue LowerEXTRACT_VECTOR_ELT_SSE4(SDValue Op, SelectionDAG &DAG) {
	MVT VT = Op.getSimpleValueType();
	SDLoc dl(Op);

	if (!Op.getOperand(0).getSimpleValueType().is128BitVector())
	return SDValue();

	if (VT.getSizeInBits() == 8) {
	SDValue Extract = DAG.getNode(X86ISD::PEXTRB, dl, MVT::i32,
	Op.getOperand(0), Op.getOperand(1));
	SDValue Assert = DAG.getNode(ISD::AssertZext, dl, MVT::i32, Extract,
	DAG.getValueType(VT));
	return DAG.getNode(ISD::TRUNCATE, dl, VT, Assert);
	}

	if (VT == MVT::f32) {
	// EXTRACTPS outputs to a GPR32 register which will require a movd to copy
	// the result back to FR32 register. It's only worth matching if the
	// result has a single use which is a store or a bitcast to i32. And in
	// the case of a store, it's not worth it if the index is a constant 0,
	// because a MOVSSmr can be used instead, which is smaller and faster.
	if (!Op.hasOneUse())
	return SDValue();
	SDNode User = Op.getNode()->use_begin();
	if ((User->getOpcode() != ISD::STORE \|\|
	isNullConstant(Op.getOperand(1))) &&
	(User->getOpcode() != ISD::BITCAST \|\|
	User->getValueType(0) != MVT::i32))
	return SDValue();
	SDValue Extract = DAG.getNode(ISD::EXTRACT_VECTOR_ELT, dl, MVT::i32,
	DAG.getBitcast(MVT::v4i32, Op.getOperand(0)),
	Op.getOperand(1));
	return DAG.getBitcast(MVT::f32, Extract);
	}

	if (VT == MVT::i32 \|\| VT == MVT::i64) {
	// ExtractPS/pextrq works with constant index.
	if (isa<ConstantSDNode>(Op.getOperand(1)))
	return Op;
	}

	return SDValue();
	}

	/// Extract one bit from mask vector, like v16i1 or v8i1.
	/// AVX-512 feature.
	SDValue
	X86TargetLowering::ExtractBitFromMaskVector(SDValue Op, SelectionDAG &DAG) const {
	SDValue Vec = Op.getOperand(0);
	SDLoc dl(Vec);
	MVT VecVT = Vec.getSimpleValueType();
	SDValue Idx = Op.getOperand(1);
	MVT EltVT = Op.getSimpleValueType();

	assert((VecVT.getVectorNumElements() <= 16 \|\| Subtarget.hasBWI()) &&
	"Unexpected vector type in ExtractBitFromMaskVector");

	// variable index can't be handled in mask registers,
	// extend vector to VR512/128
	if (!isa<ConstantSDNode>(Idx)) {
	unsigned NumElts = VecVT.getVectorNumElements();
	// Extending v8i1/v16i1 to 512-bit get better performance on KNL
	// than extending to 128/256bit.
	unsigned VecSize = (NumElts <= 4 ? 128 : 512);
	MVT ExtVT = MVT::getVectorVT(MVT::getIntegerVT(VecSize/NumElts), NumElts);
	SDValue Ext = DAG.getNode(ISD::SIGN_EXTEND, dl, ExtVT, Vec);
	SDValue Elt = DAG.getNode(ISD::EXTRACT_VECTOR_ELT, dl,
	ExtVT.getVectorElementType(), Ext, Idx);
	return DAG.getNode(ISD::TRUNCATE, dl, EltVT, Elt);
	}

	unsigned IdxVal = cast<ConstantSDNode>(Idx)->getZExtValue();
	if ((!Subtarget.hasDQI() && (VecVT.getVectorNumElements() == 8)) \|\|
	(VecVT.getVectorNumElements() < 8)) {
	// Use kshiftlw/rw instruction.
	VecVT = MVT::v16i1;
	Vec = DAG.getNode(ISD::INSERT_SUBVECTOR, dl, VecVT,
	DAG.getUNDEF(VecVT),
	Vec,
	DAG.getIntPtrConstant(0, dl));
	}
	unsigned MaxSift = VecVT.getVectorNumElements() - 1;
	if (MaxSift - IdxVal)
	Vec = DAG.getNode(X86ISD::KSHIFTL, dl, VecVT, Vec,
	DAG.getConstant(MaxSift - IdxVal, dl, MVT::i8));
	Vec = DAG.getNode(X86ISD::KSHIFTR, dl, VecVT, Vec,
	DAG.getConstant(MaxSift, dl, MVT::i8));
	return DAG.getNode(X86ISD::VEXTRACT, dl, Op.getSimpleValueType(), Vec,
	DAG.getIntPtrConstant(0, dl));
	}

	SDValue
	X86TargetLowering::LowerEXTRACT_VECTOR_ELT(SDValue Op,
	SelectionDAG &DAG) const {
	SDLoc dl(Op);
	SDValue Vec = Op.getOperand(0);
	MVT VecVT = Vec.getSimpleValueType();
	SDValue Idx = Op.getOperand(1);

	if (VecVT.getVectorElementType() == MVT::i1)
	return ExtractBitFromMaskVector(Op, DAG);

	if (!isa<ConstantSDNode>(Idx)) {
	// Its more profitable to go through memory (1 cycles throughput)
	// than using VMOVD + VPERMV/PSHUFB sequence ( 2/3 cycles throughput)
	// IACA tool was used to get performance estimation
	// (https://software.intel.com/en-us/articles/intel-architecture-code-analyzer)
	//
	// example : extractelement <16 x i8> %a, i32 %i
	//
	// Block Throughput: 3.00 Cycles
	// Throughput Bottleneck: Port5
	//
	// \| Num Of \| Ports pressure in cycles \| \|
	// \| Uops \| 0 - DV \| 5 \| 6 \| 7 \| \|
	// ---------------------------------------------
	// \| 1 \| \| 1.0 \| \| \| CP \| vmovd xmm1, edi
	// \| 1 \| \| 1.0 \| \| \| CP \| vpshufb xmm0, xmm0, xmm1
	// \| 2 \| 1.0 \| 1.0 \| \| \| CP \| vpextrb eax, xmm0, 0x0
	// Total Num Of Uops: 4
	//
	//
	// Block Throughput: 1.00 Cycles
	// Throughput Bottleneck: PORT2_AGU, PORT3_AGU, Port4
	//
	// \| \| Ports pressure in cycles \| \|
	// \|Uops\| 1 \| 2 - D \|3 - D \| 4 \| 5 \| \|
	// ---------------------------------------------------------
	// \|2^ \| \| 0.5 \| 0.5 \|1.0\| \|CP\| vmovaps xmmword ptr [rsp-0x18], xmm0
	// \|1 \|0.5\| \| \| \|0.5\| \| lea rax, ptr [rsp-0x18]
	// \|1 \| \|0.5, 0.5\|0.5, 0.5\| \| \|CP\| mov al, byte ptr [rdi+rax*1]
	// Total Num Of Uops: 4

	return SDValue();
	}

	unsigned IdxVal = cast<ConstantSDNode>(Idx)->getZExtValue();

	// If this is a 256-bit vector result, first extract the 128-bit vector and
	// then extract the element from the 128-bit vector.
	if (VecVT.is256BitVector() \|\| VecVT.is512BitVector()) {
	// Get the 128-bit vector.
	Vec = extract128BitVector(Vec, IdxVal, DAG, dl);
	MVT EltVT = VecVT.getVectorElementType();

	unsigned ElemsPerChunk = 128 / EltVT.getSizeInBits();
	assert(isPowerOf2_32(ElemsPerChunk) && "Elements per chunk not power of 2");

	// Find IdxVal modulo ElemsPerChunk. Since ElemsPerChunk is a power of 2
	// this can be done with a mask.
	IdxVal &= ElemsPerChunk - 1;
	return DAG.getNode(ISD::EXTRACT_VECTOR_ELT, dl, Op.getValueType(), Vec,
	DAG.getConstant(IdxVal, dl, MVT::i32));
	}

	assert(VecVT.is128BitVector() && "Unexpected vector length");

	MVT VT = Op.getSimpleValueType();

	if (VT.getSizeInBits() == 16) {
	// If IdxVal is 0, it's cheaper to do a move instead of a pextrw, unless
	// we're going to zero extend the register or fold the store (SSE41 only).
	if (IdxVal == 0 && !MayFoldIntoZeroExtend(Op) &&
	!(Subtarget.hasSSE41() && MayFoldIntoStore(Op)))
	return DAG.getNode(ISD::TRUNCATE, dl, MVT::i16,
	DAG.getNode(ISD::EXTRACT_VECTOR_ELT, dl, MVT::i32,
	DAG.getBitcast(MVT::v4i32, Vec), Idx));

	// Transform it so it match pextrw which produces a 32-bit result.
	SDValue Extract = DAG.getNode(X86ISD::PEXTRW, dl, MVT::i32,
	Op.getOperand(0), Op.getOperand(1));
	SDValue Assert = DAG.getNode(ISD::AssertZext, dl, MVT::i32, Extract,
	DAG.getValueType(VT));
	return DAG.getNode(ISD::TRUNCATE, dl, VT, Assert);
	}

	if (Subtarget.hasSSE41())
	if (SDValue Res = LowerEXTRACT_VECTOR_ELT_SSE4(Op, DAG))
	return Res;

	// TODO: We only extract a single element from v16i8, we can probably afford
	// to be more aggressive here before using the default approach of spilling to
	// stack.
	if (VT.getSizeInBits() == 8 && Op->isOnlyUserOf(Vec.getNode())) {
	// Extract either the lowest i32 or any i16, and extract the sub-byte.
	int DWordIdx = IdxVal / 4;
	if (DWordIdx == 0) {
	SDValue Res = DAG.getNode(ISD::EXTRACT_VECTOR_ELT, dl, MVT::i32,
	DAG.getBitcast(MVT::v4i32, Vec),
	DAG.getIntPtrConstant(DWordIdx, dl));
	int ShiftVal = (IdxVal % 4) * 8;
	if (ShiftVal != 0)
	Res = DAG.getNode(ISD::SRL, dl, MVT::i32, Res,
	DAG.getConstant(ShiftVal, dl, MVT::i32));
	return DAG.getNode(ISD::TRUNCATE, dl, VT, Res);
	}

	int WordIdx = IdxVal / 2;
	SDValue Res = DAG.getNode(ISD::EXTRACT_VECTOR_ELT, dl, MVT::i16,
	DAG.getBitcast(MVT::v8i16, Vec),
	DAG.getIntPtrConstant(WordIdx, dl));
	int ShiftVal = (IdxVal % 2) * 8;
	if (ShiftVal != 0)
	Res = DAG.getNode(ISD::SRL, dl, MVT::i16, Res,
	DAG.getConstant(ShiftVal, dl, MVT::i16));
	return DAG.getNode(ISD::TRUNCATE, dl, VT, Res);
	}

	if (VT.getSizeInBits() == 32) {
	if (IdxVal == 0)
	return Op;

	// SHUFPS the element to the lowest double word, then movss.
	int Mask[4] = { static_cast<int>(IdxVal), -1, -1, -1 };
	Vec = DAG.getVectorShuffle(VecVT, dl, Vec, DAG.getUNDEF(VecVT), Mask);
	return DAG.getNode(ISD::EXTRACT_VECTOR_ELT, dl, VT, Vec,
	DAG.getIntPtrConstant(0, dl));
	}

	if (VT.getSizeInBits() == 64) {
	// FIXME: .td only matches this for <2 x f64>, not <2 x i64> on 32b
	// FIXME: seems like this should be unnecessary if mov{h,l}pd were taught
	// to match extract_elt for f64.
	if (IdxVal == 0)
	return Op;

	// UNPCKHPD the element to the lowest double word, then movsd.
	// Note if the lower 64 bits of the result of the UNPCKHPD is then stored
	// to a f64mem, the whole operation is folded into a single MOVHPDmr.
	int Mask[2] = { 1, -1 };
	Vec = DAG.getVectorShuffle(VecVT, dl, Vec, DAG.getUNDEF(VecVT), Mask);
	return DAG.getNode(ISD::EXTRACT_VECTOR_ELT, dl, VT, Vec,
	DAG.getIntPtrConstant(0, dl));
	}

	return SDValue();
	}

	/// Insert one bit to mask vector, like v16i1 or v8i1.
	/// AVX-512 feature.
	SDValue
	X86TargetLowering::InsertBitToMaskVector(SDValue Op, SelectionDAG &DAG) const {
	SDLoc dl(Op);
	SDValue Vec = Op.getOperand(0);
	SDValue Elt = Op.getOperand(1);
	SDValue Idx = Op.getOperand(2);
	MVT VecVT = Vec.getSimpleValueType();

	if (!isa<ConstantSDNode>(Idx)) {
	// Non constant index. Extend source and destination,
	// insert element and then truncate the result.
	MVT ExtVecVT = (VecVT == MVT::v8i1 ? MVT::v8i64 : MVT::v16i32);
	MVT ExtEltVT = (VecVT == MVT::v8i1 ? MVT::i64 : MVT::i32);
	SDValue ExtOp = DAG.getNode(ISD::INSERT_VECTOR_ELT, dl, ExtVecVT,
	DAG.getNode(ISD::ZERO_EXTEND, dl, ExtVecVT, Vec),
	DAG.getNode(ISD::ZERO_EXTEND, dl, ExtEltVT, Elt), Idx);
	return DAG.getNode(ISD::TRUNCATE, dl, VecVT, ExtOp);
	}

	unsigned IdxVal = cast<ConstantSDNode>(Idx)->getZExtValue();
	SDValue EltInVec = DAG.getNode(ISD::SCALAR_TO_VECTOR, dl, VecVT, Elt);
	unsigned NumElems = VecVT.getVectorNumElements();

	if(Vec.isUndef()) {
	if (IdxVal)
	EltInVec = DAG.getNode(X86ISD::KSHIFTL, dl, VecVT, EltInVec,
	DAG.getConstant(IdxVal, dl, MVT::i8));
	return EltInVec;
	}

	// Insertion of one bit into first position
	if (IdxVal == 0 ) {
	// Clean top bits of vector.
	EltInVec = DAG.getNode(X86ISD::KSHIFTL, dl, VecVT, EltInVec,
	DAG.getConstant(NumElems - 1, dl, MVT::i8));
	EltInVec = DAG.getNode(X86ISD::KSHIFTR, dl, VecVT, EltInVec,
	DAG.getConstant(NumElems - 1, dl, MVT::i8));
	// Clean the first bit in source vector.
	Vec = DAG.getNode(X86ISD::KSHIFTR, dl, VecVT, Vec,
	DAG.getConstant(1 , dl, MVT::i8));
	Vec = DAG.getNode(X86ISD::KSHIFTL, dl, VecVT, Vec,
	DAG.getConstant(1, dl, MVT::i8));

	return DAG.getNode(ISD::OR, dl, VecVT, Vec, EltInVec);
	}
	// Insertion of one bit into last position
	if (IdxVal == NumElems -1) {
	// Move the bit to the last position inside the vector.
	EltInVec = DAG.getNode(X86ISD::KSHIFTL, dl, VecVT, EltInVec,
	DAG.getConstant(IdxVal, dl, MVT::i8));
	// Clean the last bit in the source vector.
	Vec = DAG.getNode(X86ISD::KSHIFTL, dl, VecVT, Vec,
	DAG.getConstant(1, dl, MVT::i8));
	Vec = DAG.getNode(X86ISD::KSHIFTR, dl, VecVT, Vec,
	DAG.getConstant(1 , dl, MVT::i8));

	return DAG.getNode(ISD::OR, dl, VecVT, Vec, EltInVec);
	}

	// Use shuffle to insert element.
	SmallVector<int, 64> MaskVec(NumElems);
	for (unsigned i = 0; i != NumElems; ++i)
	MaskVec[i] = (i == IdxVal) ? NumElems : i;

	return DAG.getVectorShuffle(VecVT, dl, Vec, EltInVec, MaskVec);
	}

	SDValue X86TargetLowering::LowerINSERT_VECTOR_ELT(SDValue Op,
	SelectionDAG &DAG) const {
	MVT VT = Op.getSimpleValueType();
	MVT EltVT = VT.getVectorElementType();
	unsigned NumElts = VT.getVectorNumElements();

	if (EltVT == MVT::i1)
	return InsertBitToMaskVector(Op, DAG);

	SDLoc dl(Op);
	SDValue N0 = Op.getOperand(0);
	SDValue N1 = Op.getOperand(1);
	SDValue N2 = Op.getOperand(2);
	if (!isa<ConstantSDNode>(N2))
	return SDValue();
	auto *N2C = cast<ConstantSDNode>(N2);
	unsigned IdxVal = N2C->getZExtValue();

	bool IsZeroElt = X86::isZeroNode(N1);
	bool IsAllOnesElt = VT.isInteger() && llvm::isAllOnesConstant(N1);

	// If we are inserting a element, see if we can do this more efficiently with
	// a blend shuffle with a rematerializable vector than a costly integer
	// insertion.
	if ((IsZeroElt \|\| IsAllOnesElt) && Subtarget.hasSSE41() &&
	16 <= EltVT.getSizeInBits()) {
	SmallVector<int, 8> BlendMask;
	for (unsigned i = 0; i != NumElts; ++i)
	BlendMask.push_back(i == IdxVal ? i + NumElts : i);
	SDValue CstVector = IsZeroElt ? getZeroVector(VT, Subtarget, DAG, dl)
	: DAG.getConstant(-1, dl, VT);
	return DAG.getVectorShuffle(VT, dl, N0, CstVector, BlendMask);
	}

	// If the vector is wider than 128 bits, extract the 128-bit subvector, insert
	// into that, and then insert the subvector back into the result.
	if (VT.is256BitVector() \|\| VT.is512BitVector()) {
	// With a 256-bit vector, we can insert into the zero element efficiently
	// using a blend if we have AVX or AVX2 and the right data type.
	if (VT.is256BitVector() && IdxVal == 0) {
	// TODO: It is worthwhile to cast integer to floating point and back
	// and incur a domain crossing penalty if that's what we'll end up
	// doing anyway after extracting to a 128-bit vector.
	if ((Subtarget.hasAVX() && (EltVT == MVT::f64 \|\| EltVT == MVT::f32)) \|\|
	(Subtarget.hasAVX2() && EltVT == MVT::i32)) {
	SDValue N1Vec = DAG.getNode(ISD::SCALAR_TO_VECTOR, dl, VT, N1);
	N2 = DAG.getIntPtrConstant(1, dl);
	return DAG.getNode(X86ISD::BLENDI, dl, VT, N0, N1Vec, N2);
	}
	}

	// Get the desired 128-bit vector chunk.
	SDValue V = extract128BitVector(N0, IdxVal, DAG, dl);

	// Insert the element into the desired chunk.
	unsigned NumEltsIn128 = 128 / EltVT.getSizeInBits();
	assert(isPowerOf2_32(NumEltsIn128));
	// Since NumEltsIn128 is a power of 2 we can use mask instead of modulo.
	unsigned IdxIn128 = IdxVal & (NumEltsIn128 - 1);

	V = DAG.getNode(ISD::INSERT_VECTOR_ELT, dl, V.getValueType(), V, N1,
	DAG.getConstant(IdxIn128, dl, MVT::i32));

	// Insert the changed part back into the bigger vector
	return insert128BitVector(N0, V, IdxVal, DAG, dl);
	}
	assert(VT.is128BitVector() && "Only 128-bit vector types should be left!");

	// Transform it so it match pinsr{b,w} which expects a GR32 as its second
	// argument. SSE41 required for pinsrb.
	if (VT == MVT::v8i16 \|\| (VT == MVT::v16i8 && Subtarget.hasSSE41())) {
	unsigned Opc;
	if (VT == MVT::v8i16) {
	assert(Subtarget.hasSSE2() && "SSE2 required for PINSRW");
	Opc = X86ISD::PINSRW;
	} else {
	assert(VT == MVT::v16i8 && "PINSRB requires v16i8 vector");
	assert(Subtarget.hasSSE41() && "SSE41 required for PINSRB");
	Opc = X86ISD::PINSRB;
	}

	if (N1.getValueType() != MVT::i32)
	N1 = DAG.getNode(ISD::ANY_EXTEND, dl, MVT::i32, N1);
	if (N2.getValueType() != MVT::i32)
	N2 = DAG.getIntPtrConstant(IdxVal, dl);
	return DAG.getNode(Opc, dl, VT, N0, N1, N2);
	}

	if (Subtarget.hasSSE41()) {
	if (EltVT == MVT::f32) {
	// Bits [7:6] of the constant are the source select. This will always be
	// zero here. The DAG Combiner may combine an extract_elt index into
	// these bits. For example (insert (extract, 3), 2) could be matched by
	// putting the '3' into bits [7:6] of X86ISD::INSERTPS.
	// Bits [5:4] of the constant are the destination select. This is the
	// value of the incoming immediate.
	// Bits [3:0] of the constant are the zero mask. The DAG Combiner may
	// combine either bitwise AND or insert of float 0.0 to set these bits.

	bool MinSize = DAG.getMachineFunction().getFunction()->optForMinSize();
	if (IdxVal == 0 && (!MinSize \|\| !MayFoldLoad(N1))) {
	// If this is an insertion of 32-bits into the low 32-bits of
	// a vector, we prefer to generate a blend with immediate rather
	// than an insertps. Blends are simpler operations in hardware and so
	// will always have equal or better performance than insertps.
	// But if optimizing for size and there's a load folding opportunity,
	// generate insertps because blendps does not have a 32-bit memory
	// operand form.
	N2 = DAG.getIntPtrConstant(1, dl);
	N1 = DAG.getNode(ISD::SCALAR_TO_VECTOR, dl, MVT::v4f32, N1);
	return DAG.getNode(X86ISD::BLENDI, dl, VT, N0, N1, N2);
	}
	N2 = DAG.getIntPtrConstant(IdxVal << 4, dl);
	// Create this as a scalar to vector..
	N1 = DAG.getNode(ISD::SCALAR_TO_VECTOR, dl, MVT::v4f32, N1);
	return DAG.getNode(X86ISD::INSERTPS, dl, VT, N0, N1, N2);
	}

	// PINSR* works with constant index.
	if (EltVT == MVT::i32 \|\| EltVT == MVT::i64)
	return Op;
	}

	return SDValue();
	}

	static SDValue LowerSCALAR_TO_VECTOR(SDValue Op, const X86Subtarget &Subtarget,
	SelectionDAG &DAG) {
	SDLoc dl(Op);
	MVT OpVT = Op.getSimpleValueType();

	// It's always cheaper to replace a xor+movd with xorps and simplifies further
	// combines.
	if (X86::isZeroNode(Op.getOperand(0)))
	return getZeroVector(OpVT, Subtarget, DAG, dl);

	// If this is a 256-bit vector result, first insert into a 128-bit
	// vector and then insert into the 256-bit vector.
	if (!OpVT.is128BitVector()) {
	// Insert into a 128-bit vector.
	unsigned SizeFactor = OpVT.getSizeInBits() / 128;
	MVT VT128 = MVT::getVectorVT(OpVT.getVectorElementType(),
	OpVT.getVectorNumElements() / SizeFactor);

	Op = DAG.getNode(ISD::SCALAR_TO_VECTOR, dl, VT128, Op.getOperand(0));

	// Insert the 128-bit vector.
	return insert128BitVector(DAG.getUNDEF(OpVT), Op, 0, DAG, dl);
	}
	assert(OpVT.is128BitVector() && "Expected an SSE type!");

	// Pass through a v4i32 SCALAR_TO_VECTOR as that's what we use in tblgen.
	if (OpVT == MVT::v4i32)
	return Op;

	SDValue AnyExt = DAG.getNode(ISD::ANY_EXTEND, dl, MVT::i32, Op.getOperand(0));
	return DAG.getBitcast(
	OpVT, DAG.getNode(ISD::SCALAR_TO_VECTOR, dl, MVT::v4i32, AnyExt));
	}

	// Lower a node with an EXTRACT_SUBVECTOR opcode. This may result in
	// a simple subregister reference or explicit instructions to grab
	// upper bits of a vector.
	static SDValue LowerEXTRACT_SUBVECTOR(SDValue Op, const X86Subtarget &Subtarget,
	SelectionDAG &DAG) {
	assert(Subtarget.hasAVX() && "EXTRACT_SUBVECTOR requires AVX");

	SDLoc dl(Op);
	SDValue In = Op.getOperand(0);
	SDValue Idx = Op.getOperand(1);
	unsigned IdxVal = cast<ConstantSDNode>(Idx)->getZExtValue();
	MVT ResVT = Op.getSimpleValueType();

	// When v1i1 is legal a scalarization of a vselect with a vXi1 Cond
	// would result with: v1i1 = extract_subvector(vXi1, idx).
	// Lower these into extract_vector_elt which is already selectable.
	if (ResVT == MVT::v1i1) {
	assert(Subtarget.hasAVX512() &&
	"Boolean EXTRACT_SUBVECTOR requires AVX512");

	MVT EltVT = ResVT.getVectorElementType();
	const TargetLowering &TLI = DAG.getTargetLoweringInfo();
	MVT LegalVT =
	(TLI.getTypeToTransformTo(*DAG.getContext(), EltVT)).getSimpleVT();
	SDValue Res = DAG.getNode(ISD::EXTRACT_VECTOR_ELT, dl, LegalVT, In, Idx);
	return DAG.getNode(ISD::SCALAR_TO_VECTOR, dl, ResVT, Res);
	}

	assert((In.getSimpleValueType().is256BitVector() \|\|
	In.getSimpleValueType().is512BitVector()) &&
	"Can only extract from 256-bit or 512-bit vectors");

	// If the input is a buildvector just emit a smaller one.
	unsigned ElemsPerChunk = ResVT.getVectorNumElements();
	if (In.getOpcode() == ISD::BUILD_VECTOR)
	return DAG.getBuildVector(
	ResVT, dl, makeArrayRef(In->op_begin() + IdxVal, ElemsPerChunk));

	// Everything else is legal.
	return Op;
	}

	// Lower a node with an INSERT_SUBVECTOR opcode. This may result in a
	// simple superregister reference or explicit instructions to insert
	// the upper bits of a vector.
	static SDValue LowerINSERT_SUBVECTOR(SDValue Op, const X86Subtarget &Subtarget,
	SelectionDAG &DAG) {
	assert(Op.getSimpleValueType().getVectorElementType() == MVT::i1);

	return insert1BitVector(Op, DAG, Subtarget);
	}

	// Returns the appropriate wrapper opcode for a global reference.
	unsigned X86TargetLowering::getGlobalWrapperKind(const GlobalValue *GV) const {
	// References to absolute symbols are never PC-relative.
	if (GV && GV->isAbsoluteSymbolRef())
	return X86ISD::Wrapper;

	CodeModel::Model M = getTargetMachine().getCodeModel();
	if (Subtarget.isPICStyleRIPRel() &&
	(M == CodeModel::Small \|\| M == CodeModel::Kernel))
	return X86ISD::WrapperRIP;

	return X86ISD::Wrapper;
	}

	// ConstantPool, JumpTable, GlobalAddress, and ExternalSymbol are lowered as
	// their target counterpart wrapped in the X86ISD::Wrapper node. Suppose N is
	// one of the above mentioned nodes. It has to be wrapped because otherwise
	// Select(N) returns N. So the raw TargetGlobalAddress nodes, etc. can only
	// be used to form addressing mode. These wrapped nodes will be selected
	// into MOV32ri.
	SDValue
	X86TargetLowering::LowerConstantPool(SDValue Op, SelectionDAG &DAG) const {
	ConstantPoolSDNode *CP = cast<ConstantPoolSDNode>(Op);

	// In PIC mode (unless we're in RIPRel PIC mode) we add an offset to the
	// global base reg.
	unsigned char OpFlag = Subtarget.classifyLocalReference(nullptr);

	auto PtrVT = getPointerTy(DAG.getDataLayout());
	SDValue Result = DAG.getTargetConstantPool(
	CP->getConstVal(), PtrVT, CP->getAlignment(), CP->getOffset(), OpFlag);
	SDLoc DL(CP);
	Result = DAG.getNode(getGlobalWrapperKind(), DL, PtrVT, Result);
	// With PIC, the address is actually $g + Offset.
	if (OpFlag) {
	Result =
	DAG.getNode(ISD::ADD, DL, PtrVT,
	DAG.getNode(X86ISD::GlobalBaseReg, SDLoc(), PtrVT), Result);
	}

	return Result;
	}

	SDValue X86TargetLowering::LowerJumpTable(SDValue Op, SelectionDAG &DAG) const {
	JumpTableSDNode *JT = cast<JumpTableSDNode>(Op);

	// In PIC mode (unless we're in RIPRel PIC mode) we add an offset to the
	// global base reg.
	unsigned char OpFlag = Subtarget.classifyLocalReference(nullptr);

	auto PtrVT = getPointerTy(DAG.getDataLayout());
	SDValue Result = DAG.getTargetJumpTable(JT->getIndex(), PtrVT, OpFlag);
	SDLoc DL(JT);
	Result = DAG.getNode(getGlobalWrapperKind(), DL, PtrVT, Result);

	// With PIC, the address is actually $g + Offset.
	if (OpFlag)
	Result =
	DAG.getNode(ISD::ADD, DL, PtrVT,
	DAG.getNode(X86ISD::GlobalBaseReg, SDLoc(), PtrVT), Result);

	return Result;
	}

	SDValue
	X86TargetLowering::LowerExternalSymbol(SDValue Op, SelectionDAG &DAG) const {
	const char *Sym = cast<ExternalSymbolSDNode>(Op)->getSymbol();

	// In PIC mode (unless we're in RIPRel PIC mode) we add an offset to the
	// global base reg.
	const Module *Mod = DAG.getMachineFunction().getFunction()->getParent();
	unsigned char OpFlag = Subtarget.classifyGlobalReference(nullptr, *Mod);

	auto PtrVT = getPointerTy(DAG.getDataLayout());
	SDValue Result = DAG.getTargetExternalSymbol(Sym, PtrVT, OpFlag);

	SDLoc DL(Op);
	Result = DAG.getNode(getGlobalWrapperKind(), DL, PtrVT, Result);

	// With PIC, the address is actually $g + Offset.
	if (isPositionIndependent() && !Subtarget.is64Bit()) {
	Result =
	DAG.getNode(ISD::ADD, DL, PtrVT,
	DAG.getNode(X86ISD::GlobalBaseReg, SDLoc(), PtrVT), Result);
	}

	// For symbols that require a load from a stub to get the address, emit the
	// load.
	if (isGlobalStubReference(OpFlag))
	Result = DAG.getLoad(PtrVT, DL, DAG.getEntryNode(), Result,
	MachinePointerInfo::getGOT(DAG.getMachineFunction()));

	return Result;
	}

	SDValue
	X86TargetLowering::LowerBlockAddress(SDValue Op, SelectionDAG &DAG) const {
	// Create the TargetBlockAddressAddress node.
	unsigned char OpFlags =
	Subtarget.classifyBlockAddressReference();
	const BlockAddress *BA = cast<BlockAddressSDNode>(Op)->getBlockAddress();
	int64_t Offset = cast<BlockAddressSDNode>(Op)->getOffset();
	SDLoc dl(Op);
	auto PtrVT = getPointerTy(DAG.getDataLayout());
	SDValue Result = DAG.getTargetBlockAddress(BA, PtrVT, Offset, OpFlags);
	Result = DAG.getNode(getGlobalWrapperKind(), dl, PtrVT, Result);

	// With PIC, the address is actually $g + Offset.
	if (isGlobalRelativeToPICBase(OpFlags)) {
	Result = DAG.getNode(ISD::ADD, dl, PtrVT,
	DAG.getNode(X86ISD::GlobalBaseReg, dl, PtrVT), Result);
	}

	return Result;
	}

	SDValue X86TargetLowering::LowerGlobalAddress(const GlobalValue *GV,
	const SDLoc &dl, int64_t Offset,
	SelectionDAG &DAG) const {
	// Create the TargetGlobalAddress node, folding in the constant
	// offset if it is legal.
	unsigned char OpFlags = Subtarget.classifyGlobalReference(GV);
	CodeModel::Model M = DAG.getTarget().getCodeModel();
	auto PtrVT = getPointerTy(DAG.getDataLayout());
	SDValue Result;
	if (OpFlags == X86II::MO_NO_FLAG &&
	X86::isOffsetSuitableForCodeModel(Offset, M)) {
	// A direct static reference to a global.
	Result = DAG.getTargetGlobalAddress(GV, dl, PtrVT, Offset);
	Offset = 0;
	} else {
	Result = DAG.getTargetGlobalAddress(GV, dl, PtrVT, 0, OpFlags);
	}

	Result = DAG.getNode(getGlobalWrapperKind(GV), dl, PtrVT, Result);

	// With PIC, the address is actually $g + Offset.
	if (isGlobalRelativeToPICBase(OpFlags)) {
	Result = DAG.getNode(ISD::ADD, dl, PtrVT,
	DAG.getNode(X86ISD::GlobalBaseReg, dl, PtrVT), Result);
	}

	// For globals that require a load from a stub to get the address, emit the
	// load.
	if (isGlobalStubReference(OpFlags))
	Result = DAG.getLoad(PtrVT, dl, DAG.getEntryNode(), Result,
	MachinePointerInfo::getGOT(DAG.getMachineFunction()));

	// If there was a non-zero offset that we didn't fold, create an explicit
	// addition for it.
	if (Offset != 0)
	Result = DAG.getNode(ISD::ADD, dl, PtrVT, Result,
	DAG.getConstant(Offset, dl, PtrVT));

	return Result;
	}

	SDValue
	X86TargetLowering::LowerGlobalAddress(SDValue Op, SelectionDAG &DAG) const {
	const GlobalValue *GV = cast<GlobalAddressSDNode>(Op)->getGlobal();
	int64_t Offset = cast<GlobalAddressSDNode>(Op)->getOffset();
	return LowerGlobalAddress(GV, SDLoc(Op), Offset, DAG);
	}

	static SDValue
	GetTLSADDR(SelectionDAG &DAG, SDValue Chain, GlobalAddressSDNode *GA,
	SDValue *InFlag, const EVT PtrVT, unsigned ReturnReg,
	unsigned char OperandFlags, bool LocalDynamic = false) {
	MachineFrameInfo &MFI = DAG.getMachineFunction().getFrameInfo();
	SDVTList NodeTys = DAG.getVTList(MVT::Other, MVT::Glue);
	SDLoc dl(GA);
	SDValue TGA = DAG.getTargetGlobalAddress(GA->getGlobal(), dl,
	GA->getValueType(0),
	GA->getOffset(),
	OperandFlags);

	X86ISD::NodeType CallType = LocalDynamic ? X86ISD::TLSBASEADDR
	: X86ISD::TLSADDR;

	if (InFlag) {
	SDValue Ops[] = { Chain, TGA, *InFlag };
	Chain = DAG.getNode(CallType, dl, NodeTys, Ops);
	} else {
	SDValue Ops[] = { Chain, TGA };
	Chain = DAG.getNode(CallType, dl, NodeTys, Ops);
	}

	// TLSADDR will be codegen'ed as call. Inform MFI that function has calls.
	MFI.setAdjustsStack(true);
	MFI.setHasCalls(true);

	SDValue Flag = Chain.getValue(1);
	return DAG.getCopyFromReg(Chain, dl, ReturnReg, PtrVT, Flag);
	}

	// Lower ISD::GlobalTLSAddress using the "general dynamic" model, 32 bit
	static SDValue
	LowerToTLSGeneralDynamicModel32(GlobalAddressSDNode *GA, SelectionDAG &DAG,
	const EVT PtrVT) {
	SDValue InFlag;
	SDLoc dl(GA); // ? function entry point might be better
	SDValue Chain = DAG.getCopyToReg(DAG.getEntryNode(), dl, X86::EBX,
	DAG.getNode(X86ISD::GlobalBaseReg,
	SDLoc(), PtrVT), InFlag);
	InFlag = Chain.getValue(1);

	return GetTLSADDR(DAG, Chain, GA, &InFlag, PtrVT, X86::EAX, X86II::MO_TLSGD);
	}

	// Lower ISD::GlobalTLSAddress using the "general dynamic" model, 64 bit
	static SDValue
	LowerToTLSGeneralDynamicModel64(GlobalAddressSDNode *GA, SelectionDAG &DAG,
	const EVT PtrVT) {
	return GetTLSADDR(DAG, DAG.getEntryNode(), GA, nullptr, PtrVT,
	X86::RAX, X86II::MO_TLSGD);
	}

	static SDValue LowerToTLSLocalDynamicModel(GlobalAddressSDNode *GA,
	SelectionDAG &DAG,
	const EVT PtrVT,
	bool is64Bit) {
	SDLoc dl(GA);

	// Get the start address of the TLS block for this module.
	X86MachineFunctionInfo *MFI = DAG.getMachineFunction()
	.getInfo<X86MachineFunctionInfo>();
	MFI->incNumLocalDynamicTLSAccesses();

	SDValue Base;
	if (is64Bit) {
	Base = GetTLSADDR(DAG, DAG.getEntryNode(), GA, nullptr, PtrVT, X86::RAX,
	X86II::MO_TLSLD, /LocalDynamic=/true);
	} else {
	SDValue InFlag;
	SDValue Chain = DAG.getCopyToReg(DAG.getEntryNode(), dl, X86::EBX,
	DAG.getNode(X86ISD::GlobalBaseReg, SDLoc(), PtrVT), InFlag);
	InFlag = Chain.getValue(1);
	Base = GetTLSADDR(DAG, Chain, GA, &InFlag, PtrVT, X86::EAX,
	X86II::MO_TLSLDM, /LocalDynamic=/true);
	}

	// Note: the CleanupLocalDynamicTLSPass will remove redundant computations
	// of Base.

	// Build x@dtpoff.
	unsigned char OperandFlags = X86II::MO_DTPOFF;
	unsigned WrapperKind = X86ISD::Wrapper;
	SDValue TGA = DAG.getTargetGlobalAddress(GA->getGlobal(), dl,
	GA->getValueType(0),
	GA->getOffset(), OperandFlags);
	SDValue Offset = DAG.getNode(WrapperKind, dl, PtrVT, TGA);

	// Add x@dtpoff with the base.
	return DAG.getNode(ISD::ADD, dl, PtrVT, Offset, Base);
	}

	// Lower ISD::GlobalTLSAddress using the "initial exec" or "local exec" model.
	static SDValue LowerToTLSExecModel(GlobalAddressSDNode *GA, SelectionDAG &DAG,
	const EVT PtrVT, TLSModel::Model model,
	bool is64Bit, bool isPIC) {
	SDLoc dl(GA);

	// Get the Thread Pointer, which is %gs:0 (32-bit) or %fs:0 (64-bit).
	Value Ptr = Constant::getNullValue(Type::getInt8PtrTy(DAG.getContext(),
	is64Bit ? 257 : 256));

	SDValue ThreadPointer =
	DAG.getLoad(PtrVT, dl, DAG.getEntryNode(), DAG.getIntPtrConstant(0, dl),
	MachinePointerInfo(Ptr));

	unsigned char OperandFlags = 0;
	// Most TLS accesses are not RIP relative, even on x86-64. One exception is
	// initialexec.
	unsigned WrapperKind = X86ISD::Wrapper;
	if (model == TLSModel::LocalExec) {
	OperandFlags = is64Bit ? X86II::MO_TPOFF : X86II::MO_NTPOFF;
	} else if (model == TLSModel::InitialExec) {
	if (is64Bit) {
	OperandFlags = X86II::MO_GOTTPOFF;
	WrapperKind = X86ISD::WrapperRIP;
	} else {
	OperandFlags = isPIC ? X86II::MO_GOTNTPOFF : X86II::MO_INDNTPOFF;
	}
	} else {
	llvm_unreachable("Unexpected model");
	}

	// emit "addl x@ntpoff,%eax" (local exec)
	// or "addl x@indntpoff,%eax" (initial exec)
	// or "addl x@gotntpoff(%ebx) ,%eax" (initial exec, 32-bit pic)
	SDValue TGA =
	DAG.getTargetGlobalAddress(GA->getGlobal(), dl, GA->getValueType(0),
	GA->getOffset(), OperandFlags);
	SDValue Offset = DAG.getNode(WrapperKind, dl, PtrVT, TGA);

	if (model == TLSModel::InitialExec) {
	if (isPIC && !is64Bit) {
	Offset = DAG.getNode(ISD::ADD, dl, PtrVT,
	DAG.getNode(X86ISD::GlobalBaseReg, SDLoc(), PtrVT),
	Offset);
	}

	Offset = DAG.getLoad(PtrVT, dl, DAG.getEntryNode(), Offset,
	MachinePointerInfo::getGOT(DAG.getMachineFunction()));
	}

	// The address of the thread local variable is the add of the thread
	// pointer with the offset of the variable.
	return DAG.getNode(ISD::ADD, dl, PtrVT, ThreadPointer, Offset);
	}

	SDValue
	X86TargetLowering::LowerGlobalTLSAddress(SDValue Op, SelectionDAG &DAG) const {

	GlobalAddressSDNode *GA = cast<GlobalAddressSDNode>(Op);

	if (DAG.getTarget().Options.EmulatedTLS)
	return LowerToTLSEmulatedModel(GA, DAG);

	const GlobalValue *GV = GA->getGlobal();
	auto PtrVT = getPointerTy(DAG.getDataLayout());
	bool PositionIndependent = isPositionIndependent();

	if (Subtarget.isTargetELF()) {
	TLSModel::Model model = DAG.getTarget().getTLSModel(GV);
	switch (model) {
	case TLSModel::GeneralDynamic:
	if (Subtarget.is64Bit())
	return LowerToTLSGeneralDynamicModel64(GA, DAG, PtrVT);
	return LowerToTLSGeneralDynamicModel32(GA, DAG, PtrVT);
	case TLSModel::LocalDynamic:
	return LowerToTLSLocalDynamicModel(GA, DAG, PtrVT,
	Subtarget.is64Bit());
	case TLSModel::InitialExec:
	case TLSModel::LocalExec:
	return LowerToTLSExecModel(GA, DAG, PtrVT, model, Subtarget.is64Bit(),
	PositionIndependent);
	}
	llvm_unreachable("Unknown TLS model.");
	}

	if (Subtarget.isTargetDarwin()) {
	// Darwin only has one model of TLS. Lower to that.
	unsigned char OpFlag = 0;
	unsigned WrapperKind = Subtarget.isPICStyleRIPRel() ?
	X86ISD::WrapperRIP : X86ISD::Wrapper;

	// In PIC mode (unless we're in RIPRel PIC mode) we add an offset to the
	// global base reg.
	bool PIC32 = PositionIndependent && !Subtarget.is64Bit();
	if (PIC32)
	OpFlag = X86II::MO_TLVP_PIC_BASE;
	else
	OpFlag = X86II::MO_TLVP;
	SDLoc DL(Op);
	SDValue Result = DAG.getTargetGlobalAddress(GA->getGlobal(), DL,
	GA->getValueType(0),
	GA->getOffset(), OpFlag);
	SDValue Offset = DAG.getNode(WrapperKind, DL, PtrVT, Result);

	// With PIC32, the address is actually $g + Offset.
	if (PIC32)
	Offset = DAG.getNode(ISD::ADD, DL, PtrVT,
	DAG.getNode(X86ISD::GlobalBaseReg, SDLoc(), PtrVT),
	Offset);

	// Lowering the machine isd will make sure everything is in the right
	// location.
	SDValue Chain = DAG.getEntryNode();
	SDVTList NodeTys = DAG.getVTList(MVT::Other, MVT::Glue);
	Chain = DAG.getCALLSEQ_START(Chain, 0, 0, DL);
	SDValue Args[] = { Chain, Offset };
	Chain = DAG.getNode(X86ISD::TLSCALL, DL, NodeTys, Args);
	Chain = DAG.getCALLSEQ_END(Chain, DAG.getIntPtrConstant(0, DL, true),
	DAG.getIntPtrConstant(0, DL, true),
	Chain.getValue(1), DL);

	// TLSCALL will be codegen'ed as call. Inform MFI that function has calls.
	MachineFrameInfo &MFI = DAG.getMachineFunction().getFrameInfo();
	MFI.setAdjustsStack(true);

	// And our return value (tls address) is in the standard call return value
	// location.
	unsigned Reg = Subtarget.is64Bit() ? X86::RAX : X86::EAX;
	return DAG.getCopyFromReg(Chain, DL, Reg, PtrVT, Chain.getValue(1));
	}

	if (Subtarget.isTargetKnownWindowsMSVC() \|\|
	Subtarget.isTargetWindowsItanium() \|\|
	Subtarget.isTargetWindowsGNU()) {
	// Just use the implicit TLS architecture
	// Need to generate something similar to:
	// mov rdx, qword [gs:abs 58H]; Load pointer to ThreadLocalStorage
	// ; from TEB
	// mov ecx, dword [rel _tls_index]: Load index (from C runtime)
	// mov rcx, qword [rdx+rcx*8]
	// mov eax, .tls$:tlsvar
	// [rax+rcx] contains the address
	// Windows 64bit: gs:0x58
	// Windows 32bit: fs:__tls_array

	SDLoc dl(GA);
	SDValue Chain = DAG.getEntryNode();

	// Get the Thread Pointer, which is %fs:__tls_array (32-bit) or
	// %gs:0x58 (64-bit). On MinGW, __tls_array is not available, so directly
	// use its literal value of 0x2C.
	Value *Ptr = Constant::getNullValue(Subtarget.is64Bit()
	? Type::getInt8PtrTy(*DAG.getContext(),
	256)
	: Type::getInt32PtrTy(*DAG.getContext(),
	257));

	SDValue TlsArray = Subtarget.is64Bit()
	? DAG.getIntPtrConstant(0x58, dl)
	: (Subtarget.isTargetWindowsGNU()
	? DAG.getIntPtrConstant(0x2C, dl)
	: DAG.getExternalSymbol("_tls_array", PtrVT));

	SDValue ThreadPointer =
	DAG.getLoad(PtrVT, dl, Chain, TlsArray, MachinePointerInfo(Ptr));

	SDValue res;
	if (GV->getThreadLocalMode() == GlobalVariable::LocalExecTLSModel) {
	res = ThreadPointer;
	} else {
	// Load the _tls_index variable
	SDValue IDX = DAG.getExternalSymbol("_tls_index", PtrVT);
	if (Subtarget.is64Bit())
	IDX = DAG.getExtLoad(ISD::ZEXTLOAD, dl, PtrVT, Chain, IDX,
	MachinePointerInfo(), MVT::i32);
	else
	IDX = DAG.getLoad(PtrVT, dl, Chain, IDX, MachinePointerInfo());

	auto &DL = DAG.getDataLayout();
	SDValue Scale =
	DAG.getConstant(Log2_64_Ceil(DL.getPointerSize()), dl, PtrVT);
	IDX = DAG.getNode(ISD::SHL, dl, PtrVT, IDX, Scale);

	res = DAG.getNode(ISD::ADD, dl, PtrVT, ThreadPointer, IDX);
	}

	res = DAG.getLoad(PtrVT, dl, Chain, res, MachinePointerInfo());

	// Get the offset of start of .tls section
	SDValue TGA = DAG.getTargetGlobalAddress(GA->getGlobal(), dl,
	GA->getValueType(0),
	GA->getOffset(), X86II::MO_SECREL);
	SDValue Offset = DAG.getNode(X86ISD::Wrapper, dl, PtrVT, TGA);

	// The address of the thread local variable is the add of the thread
	// pointer with the offset of the variable.
	return DAG.getNode(ISD::ADD, dl, PtrVT, res, Offset);
	}

	llvm_unreachable("TLS not implemented for this target.");
	}

	/// Lower SRA_PARTS and friends, which return two i32 values
	/// and take a 2 x i32 value to shift plus a shift amount.
	static SDValue LowerShiftParts(SDValue Op, SelectionDAG &DAG) {
	assert(Op.getNumOperands() == 3 && "Not a double-shift!");
	MVT VT = Op.getSimpleValueType();
	unsigned VTBits = VT.getSizeInBits();
	SDLoc dl(Op);
	bool isSRA = Op.getOpcode() == ISD::SRA_PARTS;
	SDValue ShOpLo = Op.getOperand(0);
	SDValue ShOpHi = Op.getOperand(1);
	SDValue ShAmt = Op.getOperand(2);
	// X86ISD::SHLD and X86ISD::SHRD have defined overflow behavior but the
	// generic ISD nodes haven't. Insert an AND to be safe, it's optimized away
	// during isel.
	SDValue SafeShAmt = DAG.getNode(ISD::AND, dl, MVT::i8, ShAmt,
	DAG.getConstant(VTBits - 1, dl, MVT::i8));
	SDValue Tmp1 = isSRA ? DAG.getNode(ISD::SRA, dl, VT, ShOpHi,
	DAG.getConstant(VTBits - 1, dl, MVT::i8))
	: DAG.getConstant(0, dl, VT);

	SDValue Tmp2, Tmp3;
	if (Op.getOpcode() == ISD::SHL_PARTS) {
	Tmp2 = DAG.getNode(X86ISD::SHLD, dl, VT, ShOpHi, ShOpLo, ShAmt);
	Tmp3 = DAG.getNode(ISD::SHL, dl, VT, ShOpLo, SafeShAmt);
	} else {
	Tmp2 = DAG.getNode(X86ISD::SHRD, dl, VT, ShOpLo, ShOpHi, ShAmt);
	Tmp3 = DAG.getNode(isSRA ? ISD::SRA : ISD::SRL, dl, VT, ShOpHi, SafeShAmt);
	}

	// If the shift amount is larger or equal than the width of a part we can't
	// rely on the results of shld/shrd. Insert a test and select the appropriate
	// values for large shift amounts.
	SDValue AndNode = DAG.getNode(ISD::AND, dl, MVT::i8, ShAmt,
	DAG.getConstant(VTBits, dl, MVT::i8));
	SDValue Cond = DAG.getNode(X86ISD::CMP, dl, MVT::i32,
	AndNode, DAG.getConstant(0, dl, MVT::i8));

	SDValue Hi, Lo;
	SDValue CC = DAG.getConstant(X86::COND_NE, dl, MVT::i8);
	SDValue Ops0[4] = { Tmp2, Tmp3, CC, Cond };
	SDValue Ops1[4] = { Tmp3, Tmp1, CC, Cond };

	if (Op.getOpcode() == ISD::SHL_PARTS) {
	Hi = DAG.getNode(X86ISD::CMOV, dl, VT, Ops0);
	Lo = DAG.getNode(X86ISD::CMOV, dl, VT, Ops1);
	} else {
	Lo = DAG.getNode(X86ISD::CMOV, dl, VT, Ops0);
	Hi = DAG.getNode(X86ISD::CMOV, dl, VT, Ops1);
	}

	SDValue Ops[2] = { Lo, Hi };
	return DAG.getMergeValues(Ops, dl);
	}

	SDValue X86TargetLowering::LowerSINT_TO_FP(SDValue Op,
	SelectionDAG &DAG) const {
	SDValue Src = Op.getOperand(0);
	MVT SrcVT = Src.getSimpleValueType();
	MVT VT = Op.getSimpleValueType();
	SDLoc dl(Op);

	const TargetLowering &TLI = DAG.getTargetLoweringInfo();
	if (SrcVT.isVector()) {
	if (SrcVT == MVT::v2i32 && VT == MVT::v2f64) {
	return DAG.getNode(X86ISD::CVTSI2P, dl, VT,
	DAG.getNode(ISD::CONCAT_VECTORS, dl, MVT::v4i32, Src,
	DAG.getUNDEF(SrcVT)));
	}
	if (SrcVT.getVectorElementType() == MVT::i1) {
	if (SrcVT == MVT::v2i1 && TLI.isTypeLegal(SrcVT))
	return DAG.getNode(ISD::SINT_TO_FP, dl, Op.getValueType(),
	DAG.getNode(ISD::SIGN_EXTEND, dl, MVT::v2i64, Src));
	MVT IntegerVT = MVT::getVectorVT(MVT::i32, SrcVT.getVectorNumElements());
	return DAG.getNode(ISD::SINT_TO_FP, dl, Op.getValueType(),
	DAG.getNode(ISD::SIGN_EXTEND, dl, IntegerVT, Src));
	}
	return SDValue();
	}

	assert(SrcVT <= MVT::i64 && SrcVT >= MVT::i16 &&
	"Unknown SINT_TO_FP to lower!");

	// These are really Legal; return the operand so the caller accepts it as
	// Legal.
	if (SrcVT == MVT::i32 && isScalarFPTypeInSSEReg(Op.getValueType()))
	return Op;
	if (SrcVT == MVT::i64 && isScalarFPTypeInSSEReg(Op.getValueType()) &&
	Subtarget.is64Bit()) {
	return Op;
	}

	SDValue ValueToStore = Op.getOperand(0);
	if (SrcVT == MVT::i64 && isScalarFPTypeInSSEReg(Op.getValueType()) &&
	!Subtarget.is64Bit())
	// Bitcasting to f64 here allows us to do a single 64-bit store from
	// an SSE register, avoiding the store forwarding penalty that would come
	// with two 32-bit stores.
	ValueToStore = DAG.getBitcast(MVT::f64, ValueToStore);

	unsigned Size = SrcVT.getSizeInBits()/8;
	MachineFunction &MF = DAG.getMachineFunction();
	auto PtrVT = getPointerTy(MF.getDataLayout());
	int SSFI = MF.getFrameInfo().CreateStackObject(Size, Size, false);
	SDValue StackSlot = DAG.getFrameIndex(SSFI, PtrVT);
	SDValue Chain = DAG.getStore(
	DAG.getEntryNode(), dl, ValueToStore, StackSlot,
	MachinePointerInfo::getFixedStack(DAG.getMachineFunction(), SSFI));
	return BuildFILD(Op, SrcVT, Chain, StackSlot, DAG);
	}

	SDValue X86TargetLowering::BuildFILD(SDValue Op, EVT SrcVT, SDValue Chain,
	SDValue StackSlot,
	SelectionDAG &DAG) const {
	// Build the FILD
	SDLoc DL(Op);
	SDVTList Tys;
	bool useSSE = isScalarFPTypeInSSEReg(Op.getValueType());
	if (useSSE)
	Tys = DAG.getVTList(MVT::f64, MVT::Other, MVT::Glue);
	else
	Tys = DAG.getVTList(Op.getValueType(), MVT::Other);

	unsigned ByteSize = SrcVT.getSizeInBits()/8;

	FrameIndexSDNode *FI = dyn_cast<FrameIndexSDNode>(StackSlot);
	MachineMemOperand *MMO;
	if (FI) {
	int SSFI = FI->getIndex();
	MMO = DAG.getMachineFunction().getMachineMemOperand(
	MachinePointerInfo::getFixedStack(DAG.getMachineFunction(), SSFI),
	MachineMemOperand::MOLoad, ByteSize, ByteSize);
	} else {
	MMO = cast<LoadSDNode>(StackSlot)->getMemOperand();
	StackSlot = StackSlot.getOperand(1);
	}
	SDValue Ops[] = { Chain, StackSlot, DAG.getValueType(SrcVT) };
	SDValue Result = DAG.getMemIntrinsicNode(useSSE ? X86ISD::FILD_FLAG :
	X86ISD::FILD, DL,
	Tys, Ops, SrcVT, MMO);

	if (useSSE) {
	Chain = Result.getValue(1);
	SDValue InFlag = Result.getValue(2);

	// FIXME: Currently the FST is flagged to the FILD_FLAG. This
	// shouldn't be necessary except that RFP cannot be live across
	// multiple blocks. When stackifier is fixed, they can be uncoupled.
	MachineFunction &MF = DAG.getMachineFunction();
	unsigned SSFISize = Op.getValueSizeInBits()/8;
	int SSFI = MF.getFrameInfo().CreateStackObject(SSFISize, SSFISize, false);
	auto PtrVT = getPointerTy(MF.getDataLayout());
	SDValue StackSlot = DAG.getFrameIndex(SSFI, PtrVT);
	Tys = DAG.getVTList(MVT::Other);
	SDValue Ops[] = {
	Chain, Result, StackSlot, DAG.getValueType(Op.getValueType()), InFlag
	};
	MachineMemOperand *MMO = DAG.getMachineFunction().getMachineMemOperand(
	MachinePointerInfo::getFixedStack(DAG.getMachineFunction(), SSFI),
	MachineMemOperand::MOStore, SSFISize, SSFISize);

	Chain = DAG.getMemIntrinsicNode(X86ISD::FST, DL, Tys,
	Ops, Op.getValueType(), MMO);
	Result = DAG.getLoad(
	Op.getValueType(), DL, Chain, StackSlot,
	MachinePointerInfo::getFixedStack(DAG.getMachineFunction(), SSFI));
	}

	return Result;
	}

	/// 64-bit unsigned integer to double expansion.
	SDValue X86TargetLowering::LowerUINT_TO_FP_i64(SDValue Op,
	SelectionDAG &DAG) const {
	// This algorithm is not obvious. Here it is what we're trying to output:
	/*
	movq %rax, %xmm0
	punpckldq (c0), %xmm0 // c0: (uint4){ 0x43300000U, 0x45300000U, 0U, 0U }
	subpd (c1), %xmm0 // c1: (double2){ 0x1.0p52, 0x1.0p52 * 0x1.0p32 }
	#ifdef __SSE3__
	haddpd %xmm0, %xmm0
	#else
	pshufd $0x4e, %xmm0, %xmm1
	addpd %xmm1, %xmm0
	#endif
	*/

	SDLoc dl(Op);
	LLVMContext *Context = DAG.getContext();

	// Build some magic constants.
	static const uint32_t CV0[] = { 0x43300000, 0x45300000, 0, 0 };
	Constant C0 = ConstantDataVector::get(Context, CV0);
	auto PtrVT = getPointerTy(DAG.getDataLayout());
	SDValue CPIdx0 = DAG.getConstantPool(C0, PtrVT, 16);

	SmallVector<Constant*,2> CV1;
	CV1.push_back(
	ConstantFP::get(*Context, APFloat(APFloat::IEEEdouble(),
	APInt(64, 0x4330000000000000ULL))));
	CV1.push_back(
	ConstantFP::get(*Context, APFloat(APFloat::IEEEdouble(),
	APInt(64, 0x4530000000000000ULL))));
	Constant *C1 = ConstantVector::get(CV1);
	SDValue CPIdx1 = DAG.getConstantPool(C1, PtrVT, 16);

	// Load the 64-bit value into an XMM register.
	SDValue XR1 = DAG.getNode(ISD::SCALAR_TO_VECTOR, dl, MVT::v2i64,
	Op.getOperand(0));
	SDValue CLod0 =
	DAG.getLoad(MVT::v4i32, dl, DAG.getEntryNode(), CPIdx0,
	MachinePointerInfo::getConstantPool(DAG.getMachineFunction()),
	/* Alignment = */ 16);
	SDValue Unpck1 =
	getUnpackl(DAG, dl, MVT::v4i32, DAG.getBitcast(MVT::v4i32, XR1), CLod0);

	SDValue CLod1 =
	DAG.getLoad(MVT::v2f64, dl, CLod0.getValue(1), CPIdx1,
	MachinePointerInfo::getConstantPool(DAG.getMachineFunction()),
	/* Alignment = */ 16);
	SDValue XR2F = DAG.getBitcast(MVT::v2f64, Unpck1);
	// TODO: Are there any fast-math-flags to propagate here?
	SDValue Sub = DAG.getNode(ISD::FSUB, dl, MVT::v2f64, XR2F, CLod1);
	SDValue Result;

	if (Subtarget.hasSSE3()) {
	// FIXME: The 'haddpd' instruction may be slower than 'movhlps + addsd'.
	Result = DAG.getNode(X86ISD::FHADD, dl, MVT::v2f64, Sub, Sub);
	} else {
	SDValue S2F = DAG.getBitcast(MVT::v4i32, Sub);
	SDValue Shuffle = DAG.getVectorShuffle(MVT::v4i32, dl, S2F, S2F, {2,3,0,1});
	Result = DAG.getNode(ISD::FADD, dl, MVT::v2f64,
	DAG.getBitcast(MVT::v2f64, Shuffle), Sub);
	}

	return DAG.getNode(ISD::EXTRACT_VECTOR_ELT, dl, MVT::f64, Result,
	DAG.getIntPtrConstant(0, dl));
	}

	/// 32-bit unsigned integer to float expansion.
	SDValue X86TargetLowering::LowerUINT_TO_FP_i32(SDValue Op,
	SelectionDAG &DAG) const {
	SDLoc dl(Op);
	// FP constant to bias correct the final result.
	SDValue Bias = DAG.getConstantFP(BitsToDouble(0x4330000000000000ULL), dl,
	MVT::f64);

	// Load the 32-bit value into an XMM register.
	SDValue Load = DAG.getNode(ISD::SCALAR_TO_VECTOR, dl, MVT::v4i32,
	Op.getOperand(0));

	// Zero out the upper parts of the register.
	Load = getShuffleVectorZeroOrUndef(Load, 0, true, Subtarget, DAG);

	Load = DAG.getNode(ISD::EXTRACT_VECTOR_ELT, dl, MVT::f64,
	DAG.getBitcast(MVT::v2f64, Load),
	DAG.getIntPtrConstant(0, dl));

	// Or the load with the bias.
	SDValue Or = DAG.getNode(
	ISD::OR, dl, MVT::v2i64,
	DAG.getBitcast(MVT::v2i64,
	DAG.getNode(ISD::SCALAR_TO_VECTOR, dl, MVT::v2f64, Load)),
	DAG.getBitcast(MVT::v2i64,
	DAG.getNode(ISD::SCALAR_TO_VECTOR, dl, MVT::v2f64, Bias)));
	Or =
	DAG.getNode(ISD::EXTRACT_VECTOR_ELT, dl, MVT::f64,
	DAG.getBitcast(MVT::v2f64, Or), DAG.getIntPtrConstant(0, dl));

	// Subtract the bias.
	// TODO: Are there any fast-math-flags to propagate here?
	SDValue Sub = DAG.getNode(ISD::FSUB, dl, MVT::f64, Or, Bias);

	// Handle final rounding.
	MVT DestVT = Op.getSimpleValueType();

	if (DestVT.bitsLT(MVT::f64))
	return DAG.getNode(ISD::FP_ROUND, dl, DestVT, Sub,
	DAG.getIntPtrConstant(0, dl));
	if (DestVT.bitsGT(MVT::f64))
	return DAG.getNode(ISD::FP_EXTEND, dl, DestVT, Sub);

	// Handle final rounding.
	return Sub;
	}

	static SDValue lowerUINT_TO_FP_v2i32(SDValue Op, SelectionDAG &DAG,
	const X86Subtarget &Subtarget, SDLoc &DL) {
	if (Op.getSimpleValueType() != MVT::v2f64)
	return SDValue();

	SDValue N0 = Op.getOperand(0);
	assert(N0.getSimpleValueType() == MVT::v2i32 && "Unexpected input type");

	// Legalize to v4i32 type.
	N0 = DAG.getNode(ISD::CONCAT_VECTORS, DL, MVT::v4i32, N0,
	DAG.getUNDEF(MVT::v2i32));

	if (Subtarget.hasAVX512())
	return DAG.getNode(X86ISD::CVTUI2P, DL, MVT::v2f64, N0);

	// Same implementation as VectorLegalizer::ExpandUINT_TO_FLOAT,
	// but using v2i32 to v2f64 with X86ISD::CVTSI2P.
	SDValue HalfWord = DAG.getConstant(16, DL, MVT::v4i32);
	SDValue HalfWordMask = DAG.getConstant(0x0000FFFF, DL, MVT::v4i32);

	// Two to the power of half-word-size.
	SDValue TWOHW = DAG.getConstantFP(1 << 16, DL, MVT::v2f64);

	// Clear upper part of LO, lower HI.
	SDValue HI = DAG.getNode(ISD::SRL, DL, MVT::v4i32, N0, HalfWord);
	SDValue LO = DAG.getNode(ISD::AND, DL, MVT::v4i32, N0, HalfWordMask);

	SDValue fHI = DAG.getNode(X86ISD::CVTSI2P, DL, MVT::v2f64, HI);
	fHI = DAG.getNode(ISD::FMUL, DL, MVT::v2f64, fHI, TWOHW);
	SDValue fLO = DAG.getNode(X86ISD::CVTSI2P, DL, MVT::v2f64, LO);

	// Add the two halves.
	return DAG.getNode(ISD::FADD, DL, MVT::v2f64, fHI, fLO);
	}

	static SDValue lowerUINT_TO_FP_vXi32(SDValue Op, SelectionDAG &DAG,
	const X86Subtarget &Subtarget) {
	// The algorithm is the following:
	// #ifdef __SSE4_1__
	// uint4 lo = _mm_blend_epi16( v, (uint4) 0x4b000000, 0xaa);
	// uint4 hi = _mm_blend_epi16( _mm_srli_epi32(v,16),
	// (uint4) 0x53000000, 0xaa);
	// #else
	// uint4 lo = (v & (uint4) 0xffff) \| (uint4) 0x4b000000;
	// uint4 hi = (v >> 16) \| (uint4) 0x53000000;
	// #endif
	// float4 fhi = (float4) hi - (0x1.0p39f + 0x1.0p23f);
	// return (float4) lo + fhi;

	// We shouldn't use it when unsafe-fp-math is enabled though: we might later
	// reassociate the two FADDs, and if we do that, the algorithm fails
	// spectacularly (PR24512).
	// FIXME: If we ever have some kind of Machine FMF, this should be marked
	// as non-fast and always be enabled. Why isn't SDAG FMF enough? Because
	// there's also the MachineCombiner reassociations happening on Machine IR.
	if (DAG.getTarget().Options.UnsafeFPMath)
	return SDValue();

	SDLoc DL(Op);
	SDValue V = Op->getOperand(0);
	MVT VecIntVT = V.getSimpleValueType();
	bool Is128 = VecIntVT == MVT::v4i32;
	MVT VecFloatVT = Is128 ? MVT::v4f32 : MVT::v8f32;
	// If we convert to something else than the supported type, e.g., to v4f64,
	// abort early.
	if (VecFloatVT != Op->getSimpleValueType(0))
	return SDValue();

	assert((VecIntVT == MVT::v4i32 \|\| VecIntVT == MVT::v8i32) &&
	"Unsupported custom type");

	// In the #idef/#else code, we have in common:
	// - The vector of constants:
	// -- 0x4b000000
	// -- 0x53000000
	// - A shift:
	// -- v >> 16

	// Create the splat vector for 0x4b000000.
	SDValue VecCstLow = DAG.getConstant(0x4b000000, DL, VecIntVT);
	// Create the splat vector for 0x53000000.
	SDValue VecCstHigh = DAG.getConstant(0x53000000, DL, VecIntVT);

	// Create the right shift.
	SDValue VecCstShift = DAG.getConstant(16, DL, VecIntVT);
	SDValue HighShift = DAG.getNode(ISD::SRL, DL, VecIntVT, V, VecCstShift);

	SDValue Low, High;
	if (Subtarget.hasSSE41()) {
	MVT VecI16VT = Is128 ? MVT::v8i16 : MVT::v16i16;
	// uint4 lo = _mm_blend_epi16( v, (uint4) 0x4b000000, 0xaa);
	SDValue VecCstLowBitcast = DAG.getBitcast(VecI16VT, VecCstLow);
	SDValue VecBitcast = DAG.getBitcast(VecI16VT, V);
	// Low will be bitcasted right away, so do not bother bitcasting back to its
	// original type.
	Low = DAG.getNode(X86ISD::BLENDI, DL, VecI16VT, VecBitcast,
	VecCstLowBitcast, DAG.getConstant(0xaa, DL, MVT::i32));
	// uint4 hi = _mm_blend_epi16( _mm_srli_epi32(v,16),
	// (uint4) 0x53000000, 0xaa);
	SDValue VecCstHighBitcast = DAG.getBitcast(VecI16VT, VecCstHigh);
	SDValue VecShiftBitcast = DAG.getBitcast(VecI16VT, HighShift);
	// High will be bitcasted right away, so do not bother bitcasting back to
	// its original type.
	High = DAG.getNode(X86ISD::BLENDI, DL, VecI16VT, VecShiftBitcast,
	VecCstHighBitcast, DAG.getConstant(0xaa, DL, MVT::i32));
	} else {
	SDValue VecCstMask = DAG.getConstant(0xffff, DL, VecIntVT);
	// uint4 lo = (v & (uint4) 0xffff) \| (uint4) 0x4b000000;
	SDValue LowAnd = DAG.getNode(ISD::AND, DL, VecIntVT, V, VecCstMask);
	Low = DAG.getNode(ISD::OR, DL, VecIntVT, LowAnd, VecCstLow);

	// uint4 hi = (v >> 16) \| (uint4) 0x53000000;
	High = DAG.getNode(ISD::OR, DL, VecIntVT, HighShift, VecCstHigh);
	}

	// Create the vector constant for -(0x1.0p39f + 0x1.0p23f).
	SDValue VecCstFAdd = DAG.getConstantFP(
	APFloat(APFloat::IEEEsingle(), APInt(32, 0xD3000080)), DL, VecFloatVT);

	// float4 fhi = (float4) hi - (0x1.0p39f + 0x1.0p23f);
	SDValue HighBitcast = DAG.getBitcast(VecFloatVT, High);
	// TODO: Are there any fast-math-flags to propagate here?
	SDValue FHigh =
	DAG.getNode(ISD::FADD, DL, VecFloatVT, HighBitcast, VecCstFAdd);
	// return (float4) lo + fhi;
	SDValue LowBitcast = DAG.getBitcast(VecFloatVT, Low);
	return DAG.getNode(ISD::FADD, DL, VecFloatVT, LowBitcast, FHigh);
	}

	SDValue X86TargetLowering::lowerUINT_TO_FP_vec(SDValue Op,
	SelectionDAG &DAG) const {
	SDValue N0 = Op.getOperand(0);
	MVT SrcVT = N0.getSimpleValueType();
	SDLoc dl(Op);

	if (SrcVT.getVectorElementType() == MVT::i1) {
	if (SrcVT == MVT::v2i1)
	return DAG.getNode(ISD::UINT_TO_FP, dl, Op.getValueType(),
	DAG.getNode(ISD::ZERO_EXTEND, dl, MVT::v2i64, N0));
	MVT IntegerVT = MVT::getVectorVT(MVT::i32, SrcVT.getVectorNumElements());
	return DAG.getNode(ISD::UINT_TO_FP, dl, Op.getValueType(),
	DAG.getNode(ISD::ZERO_EXTEND, dl, IntegerVT, N0));
	}

	switch (SrcVT.SimpleTy) {
	default:
	llvm_unreachable("Custom UINT_TO_FP is not supported!");
	case MVT::v4i8:
	case MVT::v4i16:
	case MVT::v8i8:
	case MVT::v8i16: {
	MVT NVT = MVT::getVectorVT(MVT::i32, SrcVT.getVectorNumElements());
	return DAG.getNode(ISD::SINT_TO_FP, dl, Op.getValueType(),
	DAG.getNode(ISD::ZERO_EXTEND, dl, NVT, N0));
	}
	case MVT::v2i32:
	return lowerUINT_TO_FP_v2i32(Op, DAG, Subtarget, dl);
	case MVT::v4i32:
	case MVT::v8i32:
	return lowerUINT_TO_FP_vXi32(Op, DAG, Subtarget);
	case MVT::v16i8:
	case MVT::v16i16:
	assert(Subtarget.hasAVX512());
	return DAG.getNode(ISD::UINT_TO_FP, dl, Op.getValueType(),
	DAG.getNode(ISD::ZERO_EXTEND, dl, MVT::v16i32, N0));
	}
	}

	SDValue X86TargetLowering::LowerUINT_TO_FP(SDValue Op,
	SelectionDAG &DAG) const {
	SDValue N0 = Op.getOperand(0);
	SDLoc dl(Op);
	auto PtrVT = getPointerTy(DAG.getDataLayout());

	// Since UINT_TO_FP is legal (it's marked custom), dag combiner won't
	// optimize it to a SINT_TO_FP when the sign bit is known zero. Perform
	// the optimization here.
	if (DAG.SignBitIsZero(N0))
	return DAG.getNode(ISD::SINT_TO_FP, dl, Op.getValueType(), N0);

	if (Op.getSimpleValueType().isVector())
	return lowerUINT_TO_FP_vec(Op, DAG);

	MVT SrcVT = N0.getSimpleValueType();
	MVT DstVT = Op.getSimpleValueType();

	if (Subtarget.hasAVX512() && isScalarFPTypeInSSEReg(DstVT) &&
	(SrcVT == MVT::i32 \|\| (SrcVT == MVT::i64 && Subtarget.is64Bit()))) {
	// Conversions from unsigned i32 to f32/f64 are legal,
	// using VCVTUSI2SS/SD. Same for i64 in 64-bit mode.
	return Op;
	}

	if (SrcVT == MVT::i64 && DstVT == MVT::f64 && X86ScalarSSEf64)
	return LowerUINT_TO_FP_i64(Op, DAG);
	if (SrcVT == MVT::i32 && X86ScalarSSEf64)
	return LowerUINT_TO_FP_i32(Op, DAG);
	if (Subtarget.is64Bit() && SrcVT == MVT::i64 && DstVT == MVT::f32)
	return SDValue();

	// Make a 64-bit buffer, and use it to build an FILD.
	SDValue StackSlot = DAG.CreateStackTemporary(MVT::i64);
	if (SrcVT == MVT::i32) {
	SDValue OffsetSlot = DAG.getMemBasePlusOffset(StackSlot, 4, dl);
	SDValue Store1 = DAG.getStore(DAG.getEntryNode(), dl, Op.getOperand(0),
	StackSlot, MachinePointerInfo());
	SDValue Store2 = DAG.getStore(Store1, dl, DAG.getConstant(0, dl, MVT::i32),
	OffsetSlot, MachinePointerInfo());
	SDValue Fild = BuildFILD(Op, MVT::i64, Store2, StackSlot, DAG);
	return Fild;
	}

	assert(SrcVT == MVT::i64 && "Unexpected type in UINT_TO_FP");
	SDValue ValueToStore = Op.getOperand(0);
	if (isScalarFPTypeInSSEReg(Op.getValueType()) && !Subtarget.is64Bit())
	// Bitcasting to f64 here allows us to do a single 64-bit store from
	// an SSE register, avoiding the store forwarding penalty that would come
	// with two 32-bit stores.
	ValueToStore = DAG.getBitcast(MVT::f64, ValueToStore);
	SDValue Store = DAG.getStore(DAG.getEntryNode(), dl, ValueToStore, StackSlot,
	MachinePointerInfo());
	// For i64 source, we need to add the appropriate power of 2 if the input
	// was negative. This is the same as the optimization in
	// DAGTypeLegalizer::ExpandIntOp_UNIT_TO_FP, and for it to be safe here,
	// we must be careful to do the computation in x87 extended precision, not
	// in SSE. (The generic code can't know it's OK to do this, or how to.)
	int SSFI = cast<FrameIndexSDNode>(StackSlot)->getIndex();
	MachineMemOperand *MMO = DAG.getMachineFunction().getMachineMemOperand(
	MachinePointerInfo::getFixedStack(DAG.getMachineFunction(), SSFI),
	MachineMemOperand::MOLoad, 8, 8);

	SDVTList Tys = DAG.getVTList(MVT::f80, MVT::Other);
	SDValue Ops[] = { Store, StackSlot, DAG.getValueType(MVT::i64) };
	SDValue Fild = DAG.getMemIntrinsicNode(X86ISD::FILD, dl, Tys, Ops,
	MVT::i64, MMO);

	APInt FF(32, 0x5F800000ULL);

	// Check whether the sign bit is set.
	SDValue SignSet = DAG.getSetCC(
	dl, getSetCCResultType(DAG.getDataLayout(), *DAG.getContext(), MVT::i64),
	Op.getOperand(0), DAG.getConstant(0, dl, MVT::i64), ISD::SETLT);

	// Build a 64 bit pair (0, FF) in the constant pool, with FF in the lo bits.
	SDValue FudgePtr = DAG.getConstantPool(
	ConstantInt::get(*DAG.getContext(), FF.zext(64)), PtrVT);

	// Get a pointer to FF if the sign bit was set, or to 0 otherwise.
	SDValue Zero = DAG.getIntPtrConstant(0, dl);
	SDValue Four = DAG.getIntPtrConstant(4, dl);
	SDValue Offset = DAG.getSelect(dl, Zero.getValueType(), SignSet, Zero, Four);
	FudgePtr = DAG.getNode(ISD::ADD, dl, PtrVT, FudgePtr, Offset);

	// Load the value out, extending it from f32 to f80.
	// FIXME: Avoid the extend by constructing the right constant pool?
	SDValue Fudge = DAG.getExtLoad(
	ISD::EXTLOAD, dl, MVT::f80, DAG.getEntryNode(), FudgePtr,
	MachinePointerInfo::getConstantPool(DAG.getMachineFunction()), MVT::f32,
	/* Alignment = */ 4);
	// Extend everything to 80 bits to force it to be done on x87.
	// TODO: Are there any fast-math-flags to propagate here?
	SDValue Add = DAG.getNode(ISD::FADD, dl, MVT::f80, Fild, Fudge);
	return DAG.getNode(ISD::FP_ROUND, dl, DstVT, Add,
	DAG.getIntPtrConstant(0, dl));
	}

	// If the given FP_TO_SINT (IsSigned) or FP_TO_UINT (!IsSigned) operation
	// is legal, or has an fp128 or f16 source (which needs to be promoted to f32),
	// just return an <SDValue(), SDValue()> pair.
	// Otherwise it is assumed to be a conversion from one of f32, f64 or f80
	// to i16, i32 or i64, and we lower it to a legal sequence.
	// If lowered to the final integer result we return a <result, SDValue()> pair.
	// Otherwise we lower it to a sequence ending with a FIST, return a
	// <FIST, StackSlot> pair, and the caller is responsible for loading
	// the final integer result from StackSlot.
	std::pair<SDValue,SDValue>
	X86TargetLowering::FP_TO_INTHelper(SDValue Op, SelectionDAG &DAG,
	bool IsSigned, bool IsReplace) const {
	SDLoc DL(Op);

	EVT DstTy = Op.getValueType();
	EVT TheVT = Op.getOperand(0).getValueType();
	auto PtrVT = getPointerTy(DAG.getDataLayout());

	if (TheVT != MVT::f32 && TheVT != MVT::f64 && TheVT != MVT::f80) {
	// f16 must be promoted before using the lowering in this routine.
	// fp128 does not use this lowering.
	return std::make_pair(SDValue(), SDValue());
	}

	// If using FIST to compute an unsigned i64, we'll need some fixup
	// to handle values above the maximum signed i64. A FIST is always
	// used for the 32-bit subtarget, but also for f80 on a 64-bit target.
	bool UnsignedFixup = !IsSigned &&
	DstTy == MVT::i64 &&
	(!Subtarget.is64Bit() \|\|
	!isScalarFPTypeInSSEReg(TheVT));

	if (!IsSigned && DstTy != MVT::i64 && !Subtarget.hasAVX512()) {
	// Replace the fp-to-uint32 operation with an fp-to-sint64 FIST.
	// The low 32 bits of the fist result will have the correct uint32 result.
	assert(DstTy == MVT::i32 && "Unexpected FP_TO_UINT");
	DstTy = MVT::i64;
	}

	assert(DstTy.getSimpleVT() <= MVT::i64 &&
	DstTy.getSimpleVT() >= MVT::i16 &&
	"Unknown FP_TO_INT to lower!");

	// These are really Legal.
	if (DstTy == MVT::i32 &&
	isScalarFPTypeInSSEReg(Op.getOperand(0).getValueType()))
	return std::make_pair(SDValue(), SDValue());
	if (Subtarget.is64Bit() &&
	DstTy == MVT::i64 &&
	isScalarFPTypeInSSEReg(Op.getOperand(0).getValueType()))
	return std::make_pair(SDValue(), SDValue());

	// We lower FP->int64 into FISTP64 followed by a load from a temporary
	// stack slot.
	MachineFunction &MF = DAG.getMachineFunction();
	unsigned MemSize = DstTy.getSizeInBits()/8;
	int SSFI = MF.getFrameInfo().CreateStackObject(MemSize, MemSize, false);
	SDValue StackSlot = DAG.getFrameIndex(SSFI, PtrVT);

	unsigned Opc;
	switch (DstTy.getSimpleVT().SimpleTy) {
	default: llvm_unreachable("Invalid FP_TO_SINT to lower!");
	case MVT::i16: Opc = X86ISD::FP_TO_INT16_IN_MEM; break;
	case MVT::i32: Opc = X86ISD::FP_TO_INT32_IN_MEM; break;
	case MVT::i64: Opc = X86ISD::FP_TO_INT64_IN_MEM; break;
	}

	SDValue Chain = DAG.getEntryNode();
	SDValue Value = Op.getOperand(0);
	SDValue Adjust; // 0x0 or 0x80000000, for result sign bit adjustment.

	if (UnsignedFixup) {
	//
	// Conversion to unsigned i64 is implemented with a select,
	// depending on whether the source value fits in the range
	// of a signed i64. Let Thresh be the FP equivalent of
	// 0x8000000000000000ULL.
	//
	// Adjust i32 = (Value < Thresh) ? 0 : 0x80000000;
	// FistSrc = (Value < Thresh) ? Value : (Value - Thresh);
	// Fist-to-mem64 FistSrc
	// Add 0 or 0x800...0ULL to the 64-bit result, which is equivalent
	// to XOR'ing the high 32 bits with Adjust.
	//
	// Being a power of 2, Thresh is exactly representable in all FP formats.
	// For X87 we'd like to use the smallest FP type for this constant, but
	// for DAG type consistency we have to match the FP operand type.

	APFloat Thresh(APFloat::IEEEsingle(), APInt(32, 0x5f000000));
	LLVM_ATTRIBUTE_UNUSED APFloat::opStatus Status = APFloat::opOK;
	bool LosesInfo = false;
	if (TheVT == MVT::f64)
	// The rounding mode is irrelevant as the conversion should be exact.
	Status = Thresh.convert(APFloat::IEEEdouble(), APFloat::rmNearestTiesToEven,
	&LosesInfo);
	else if (TheVT == MVT::f80)
	Status = Thresh.convert(APFloat::x87DoubleExtended(),
	APFloat::rmNearestTiesToEven, &LosesInfo);

	assert(Status == APFloat::opOK && !LosesInfo &&
	"FP conversion should have been exact");

	SDValue ThreshVal = DAG.getConstantFP(Thresh, DL, TheVT);

	SDValue Cmp = DAG.getSetCC(DL,
	getSetCCResultType(DAG.getDataLayout(),
	*DAG.getContext(), TheVT),
	Value, ThreshVal, ISD::SETLT);
	Adjust = DAG.getSelect(DL, MVT::i32, Cmp,
	DAG.getConstant(0, DL, MVT::i32),
	DAG.getConstant(0x80000000, DL, MVT::i32));
	SDValue Sub = DAG.getNode(ISD::FSUB, DL, TheVT, Value, ThreshVal);
	Cmp = DAG.getSetCC(DL, getSetCCResultType(DAG.getDataLayout(),
	*DAG.getContext(), TheVT),
	Value, ThreshVal, ISD::SETLT);
	Value = DAG.getSelect(DL, TheVT, Cmp, Value, Sub);
	}

	// FIXME This causes a redundant load/store if the SSE-class value is already
	// in memory, such as if it is on the callstack.
	if (isScalarFPTypeInSSEReg(TheVT)) {
	assert(DstTy == MVT::i64 && "Invalid FP_TO_SINT to lower!");
	Chain = DAG.getStore(Chain, DL, Value, StackSlot,
	MachinePointerInfo::getFixedStack(MF, SSFI));
	SDVTList Tys = DAG.getVTList(Op.getOperand(0).getValueType(), MVT::Other);
	SDValue Ops[] = {
	Chain, StackSlot, DAG.getValueType(TheVT)
	};

	MachineMemOperand *MMO =
	MF.getMachineMemOperand(MachinePointerInfo::getFixedStack(MF, SSFI),
	MachineMemOperand::MOLoad, MemSize, MemSize);
	Value = DAG.getMemIntrinsicNode(X86ISD::FLD, DL, Tys, Ops, DstTy, MMO);
	Chain = Value.getValue(1);
	SSFI = MF.getFrameInfo().CreateStackObject(MemSize, MemSize, false);
	StackSlot = DAG.getFrameIndex(SSFI, PtrVT);
	}

	MachineMemOperand *MMO =
	MF.getMachineMemOperand(MachinePointerInfo::getFixedStack(MF, SSFI),
	MachineMemOperand::MOStore, MemSize, MemSize);

	if (UnsignedFixup) {

	// Insert the FIST, load its result as two i32's,
	// and XOR the high i32 with Adjust.

	SDValue FistOps[] = { Chain, Value, StackSlot };
	SDValue FIST = DAG.getMemIntrinsicNode(Opc, DL, DAG.getVTList(MVT::Other),
	FistOps, DstTy, MMO);

	SDValue Low32 =
	DAG.getLoad(MVT::i32, DL, FIST, StackSlot, MachinePointerInfo());
	SDValue HighAddr = DAG.getMemBasePlusOffset(StackSlot, 4, DL);

	SDValue High32 =
	DAG.getLoad(MVT::i32, DL, FIST, HighAddr, MachinePointerInfo());
	High32 = DAG.getNode(ISD::XOR, DL, MVT::i32, High32, Adjust);

	if (Subtarget.is64Bit()) {
	// Join High32 and Low32 into a 64-bit result.
	// (High32 << 32) \| Low32
	Low32 = DAG.getNode(ISD::ZERO_EXTEND, DL, MVT::i64, Low32);
	High32 = DAG.getNode(ISD::ANY_EXTEND, DL, MVT::i64, High32);
	High32 = DAG.getNode(ISD::SHL, DL, MVT::i64, High32,
	DAG.getConstant(32, DL, MVT::i8));
	SDValue Result = DAG.getNode(ISD::OR, DL, MVT::i64, High32, Low32);
	return std::make_pair(Result, SDValue());
	}

	SDValue ResultOps[] = { Low32, High32 };

	SDValue pair = IsReplace
	? DAG.getNode(ISD::BUILD_PAIR, DL, MVT::i64, ResultOps)
	: DAG.getMergeValues(ResultOps, DL);
	return std::make_pair(pair, SDValue());
	} else {
	// Build the FP_TO_INT*_IN_MEM
	SDValue Ops[] = { Chain, Value, StackSlot };
	SDValue FIST = DAG.getMemIntrinsicNode(Opc, DL, DAG.getVTList(MVT::Other),
	Ops, DstTy, MMO);
	return std::make_pair(FIST, StackSlot);
	}
	}

	static SDValue LowerAVXExtend(SDValue Op, SelectionDAG &DAG,
	const X86Subtarget &Subtarget) {
	MVT VT = Op->getSimpleValueType(0);
	SDValue In = Op->getOperand(0);
	MVT InVT = In.getSimpleValueType();
	SDLoc dl(Op);

	if (VT.is512BitVector() \|\| InVT.getVectorElementType() == MVT::i1)
	return DAG.getNode(ISD::ZERO_EXTEND, dl, VT, In);

	// Optimize vectors in AVX mode:
	//
	// v8i16 -> v8i32
	// Use vpunpcklwd for 4 lower elements v8i16 -> v4i32.
	// Use vpunpckhwd for 4 upper elements v8i16 -> v4i32.
	// Concat upper and lower parts.
	//
	// v4i32 -> v4i64
	// Use vpunpckldq for 4 lower elements v4i32 -> v2i64.
	// Use vpunpckhdq for 4 upper elements v4i32 -> v2i64.
	// Concat upper and lower parts.
	//

	if (((VT != MVT::v16i16) \|\| (InVT != MVT::v16i8)) &&
	((VT != MVT::v8i32) \|\| (InVT != MVT::v8i16)) &&
	((VT != MVT::v4i64) \|\| (InVT != MVT::v4i32)))
	return SDValue();

	if (Subtarget.hasInt256())
	return DAG.getNode(X86ISD::VZEXT, dl, VT, In);

	SDValue ZeroVec = getZeroVector(InVT, Subtarget, DAG, dl);
	SDValue Undef = DAG.getUNDEF(InVT);
	bool NeedZero = Op.getOpcode() == ISD::ZERO_EXTEND;
	SDValue OpLo = getUnpackl(DAG, dl, InVT, In, NeedZero ? ZeroVec : Undef);
	SDValue OpHi = getUnpackh(DAG, dl, InVT, In, NeedZero ? ZeroVec : Undef);

	MVT HVT = MVT::getVectorVT(VT.getVectorElementType(),
	VT.getVectorNumElements()/2);

	OpLo = DAG.getBitcast(HVT, OpLo);
	OpHi = DAG.getBitcast(HVT, OpHi);

	return DAG.getNode(ISD::CONCAT_VECTORS, dl, VT, OpLo, OpHi);
	}

	static SDValue LowerZERO_EXTEND_AVX512(SDValue Op,
	const X86Subtarget &Subtarget, SelectionDAG &DAG) {
	MVT VT = Op->getSimpleValueType(0);
	SDValue In = Op->getOperand(0);
	MVT InVT = In.getSimpleValueType();
	SDLoc DL(Op);
	unsigned NumElts = VT.getVectorNumElements();

	if (VT.is512BitVector() && InVT.getVectorElementType() != MVT::i1 &&
	(NumElts == 8 \|\| NumElts == 16 \|\| Subtarget.hasBWI()))
	return DAG.getNode(X86ISD::VZEXT, DL, VT, In);

	if (InVT.getVectorElementType() != MVT::i1)
	return SDValue();

	// Extend VT if the target is 256 or 128bit vector and VLX is not supported.
	MVT ExtVT = VT;
	if (!VT.is512BitVector() && !Subtarget.hasVLX())
	ExtVT = MVT::getVectorVT(MVT::getIntegerVT(512/NumElts), NumElts);

	SDValue One =
	DAG.getConstant(APInt(ExtVT.getScalarSizeInBits(), 1), DL, ExtVT);
	SDValue Zero =
	DAG.getConstant(APInt::getNullValue(ExtVT.getScalarSizeInBits()), DL, ExtVT);

	SDValue SelectedVal = DAG.getSelect(DL, ExtVT, In, One, Zero);
	if (VT == ExtVT)
	return SelectedVal;
	return DAG.getNode(X86ISD::VTRUNC, DL, VT, SelectedVal);
	}

	static SDValue LowerANY_EXTEND(SDValue Op, const X86Subtarget &Subtarget,
	SelectionDAG &DAG) {
	if (Subtarget.hasFp256())
	if (SDValue Res = LowerAVXExtend(Op, DAG, Subtarget))
	return Res;

	return SDValue();
	}

	static SDValue LowerZERO_EXTEND(SDValue Op, const X86Subtarget &Subtarget,
	SelectionDAG &DAG) {
	SDLoc DL(Op);
	MVT VT = Op.getSimpleValueType();
	SDValue In = Op.getOperand(0);
	MVT SVT = In.getSimpleValueType();

	if (VT.is512BitVector() \|\| SVT.getVectorElementType() == MVT::i1)
	return LowerZERO_EXTEND_AVX512(Op, Subtarget, DAG);

	if (Subtarget.hasFp256())
	if (SDValue Res = LowerAVXExtend(Op, DAG, Subtarget))
	return Res;

	assert(!VT.is256BitVector() \|\| !SVT.is128BitVector() \|\|
	VT.getVectorNumElements() != SVT.getVectorNumElements());
	return SDValue();
	}

	/// Helper to recursively truncate vector elements in half with PACKSS.
	/// It makes use of the fact that vector comparison results will be all-zeros
	/// or all-ones to use (vXi8 PACKSS(vYi16, vYi16)) instead of matching types.
	/// AVX2 (Int256) sub-targets require extra shuffling as the PACKSS operates
	/// within each 128-bit lane.
	static SDValue truncateVectorCompareWithPACKSS(EVT DstVT, SDValue In,
	const SDLoc &DL,
	SelectionDAG &DAG,
	const X86Subtarget &Subtarget) {
	// Requires SSE2 but AVX512 has fast truncate.
	if (!Subtarget.hasSSE2() \|\| Subtarget.hasAVX512())
	return SDValue();

	EVT SrcVT = In.getValueType();

	// No truncation required, we might get here due to recursive calls.
	if (SrcVT == DstVT)
	return In;

	// We only support vector truncation to 128bits or greater from a
	// 256bits or greater source.
	if ((DstVT.getSizeInBits() % 128) != 0)
	return SDValue();
	if ((SrcVT.getSizeInBits() % 256) != 0)
	return SDValue();

	unsigned NumElems = SrcVT.getVectorNumElements();
	assert(DstVT.getVectorNumElements() == NumElems && "Illegal truncation");
	assert(SrcVT.getSizeInBits() > DstVT.getSizeInBits() && "Illegal truncation");

	EVT PackedSVT =
	EVT::getIntegerVT(*DAG.getContext(), SrcVT.getScalarSizeInBits() / 2);

	// Extract lower/upper subvectors.
	unsigned NumSubElts = NumElems / 2;
	unsigned SrcSizeInBits = SrcVT.getSizeInBits();
	SDValue Lo = extractSubVector(In, 0 * NumSubElts, DAG, DL, SrcSizeInBits / 2);
	SDValue Hi = extractSubVector(In, 1 * NumSubElts, DAG, DL, SrcSizeInBits / 2);

	// 256bit -> 128bit truncate - PACKSS lower/upper 128-bit subvectors.
	if (SrcVT.is256BitVector()) {
	Lo = DAG.getBitcast(MVT::v8i16, Lo);
	Hi = DAG.getBitcast(MVT::v8i16, Hi);
	SDValue Res = DAG.getNode(X86ISD::PACKSS, DL, MVT::v16i8, Lo, Hi);
	return DAG.getBitcast(DstVT, Res);
	}

	// AVX2: 512bit -> 256bit truncate - PACKSS lower/upper 256-bit subvectors.
	// AVX2: 512bit -> 128bit truncate - PACKSS(PACKSS, PACKSS).
	if (SrcVT.is512BitVector() && Subtarget.hasInt256()) {
	Lo = DAG.getBitcast(MVT::v16i16, Lo);
	Hi = DAG.getBitcast(MVT::v16i16, Hi);
	SDValue Res = DAG.getNode(X86ISD::PACKSS, DL, MVT::v32i8, Lo, Hi);

	// 256-bit PACKSS(ARG0, ARG1) leaves us with ((LO0,LO1),(HI0,HI1)),
	// so we need to shuffle to get ((LO0,HI0),(LO1,HI1)).
	Res = DAG.getBitcast(MVT::v4i64, Res);
	Res = DAG.getVectorShuffle(MVT::v4i64, DL, Res, Res, {0, 2, 1, 3});

	if (DstVT.is256BitVector())
	return DAG.getBitcast(DstVT, Res);

	// If 512bit -> 128bit truncate another stage.
	EVT PackedVT = EVT::getVectorVT(*DAG.getContext(), PackedSVT, NumElems);
	Res = DAG.getBitcast(PackedVT, Res);
	return truncateVectorCompareWithPACKSS(DstVT, Res, DL, DAG, Subtarget);
	}

	// Recursively pack lower/upper subvectors, concat result and pack again.
	assert(SrcVT.getSizeInBits() >= 512 && "Expected 512-bit vector or greater");
	EVT PackedVT = EVT::getVectorVT(*DAG.getContext(), PackedSVT, NumElems / 2);
	Lo = truncateVectorCompareWithPACKSS(PackedVT, Lo, DL, DAG, Subtarget);
	Hi = truncateVectorCompareWithPACKSS(PackedVT, Hi, DL, DAG, Subtarget);

	PackedVT = EVT::getVectorVT(*DAG.getContext(), PackedSVT, NumElems);
	SDValue Res = DAG.getNode(ISD::CONCAT_VECTORS, DL, PackedVT, Lo, Hi);
	return truncateVectorCompareWithPACKSS(DstVT, Res, DL, DAG, Subtarget);
	}

	static SDValue LowerTruncateVecI1(SDValue Op, SelectionDAG &DAG,
	const X86Subtarget &Subtarget) {

	SDLoc DL(Op);
	MVT VT = Op.getSimpleValueType();
	SDValue In = Op.getOperand(0);
	MVT InVT = In.getSimpleValueType();

	assert(VT.getVectorElementType() == MVT::i1 && "Unexpected vector type.");

	// Shift LSB to MSB and use VPMOVB/W2M or TESTD/Q.
	unsigned ShiftInx = InVT.getScalarSizeInBits() - 1;
	if (InVT.getScalarSizeInBits() <= 16) {
	if (Subtarget.hasBWI()) {
	// legal, will go to VPMOVB2M, VPMOVW2M
	// Shift packed bytes not supported natively, bitcast to word
	MVT ExtVT = MVT::getVectorVT(MVT::i16, InVT.getSizeInBits()/16);
	SDValue ShiftNode = DAG.getNode(ISD::SHL, DL, ExtVT,
	DAG.getBitcast(ExtVT, In),
	DAG.getConstant(ShiftInx, DL, ExtVT));
	ShiftNode = DAG.getBitcast(InVT, ShiftNode);
	return DAG.getNode(X86ISD::CVT2MASK, DL, VT, ShiftNode);
	}
	// Use TESTD/Q, extended vector to packed dword/qword.
	assert((InVT.is256BitVector() \|\| InVT.is128BitVector()) &&
	"Unexpected vector type.");
	unsigned NumElts = InVT.getVectorNumElements();
	MVT ExtVT = MVT::getVectorVT(MVT::getIntegerVT(512/NumElts), NumElts);
	In = DAG.getNode(ISD::SIGN_EXTEND, DL, ExtVT, In);
	InVT = ExtVT;
	ShiftInx = InVT.getScalarSizeInBits() - 1;
	}

	SDValue ShiftNode = DAG.getNode(ISD::SHL, DL, InVT, In,
	DAG.getConstant(ShiftInx, DL, InVT));
	return DAG.getNode(X86ISD::TESTM, DL, VT, ShiftNode, ShiftNode);
	}

	SDValue X86TargetLowering::LowerTRUNCATE(SDValue Op, SelectionDAG &DAG) const {
	SDLoc DL(Op);
	MVT VT = Op.getSimpleValueType();
	SDValue In = Op.getOperand(0);
	MVT InVT = In.getSimpleValueType();

	if (VT == MVT::i1) {
	assert((InVT.isInteger() && (InVT.getSizeInBits() <= 64)) &&
	"Invalid scalar TRUNCATE operation");
	if (InVT.getSizeInBits() >= 32)
	return SDValue();
	In = DAG.getNode(ISD::ANY_EXTEND, DL, MVT::i32, In);
	return DAG.getNode(ISD::TRUNCATE, DL, VT, In);
	}
	assert(VT.getVectorNumElements() == InVT.getVectorNumElements() &&
	"Invalid TRUNCATE operation");

	if (VT.getVectorElementType() == MVT::i1)
	return LowerTruncateVecI1(Op, DAG, Subtarget);

	// vpmovqb/w/d, vpmovdb/w, vpmovwb
	if (Subtarget.hasAVX512()) {
	// word to byte only under BWI
	if (InVT == MVT::v16i16 && !Subtarget.hasBWI()) // v16i16 -> v16i8
	return DAG.getNode(X86ISD::VTRUNC, DL, VT,
	getExtendInVec(X86ISD::VSEXT, DL, MVT::v16i32, In, DAG));
	return DAG.getNode(X86ISD::VTRUNC, DL, VT, In);
	}

	// Truncate with PACKSS if we are truncating a vector zero/all-bits result.
	if (InVT.getScalarSizeInBits() == DAG.ComputeNumSignBits(In))
	if (SDValue V = truncateVectorCompareWithPACKSS(VT, In, DL, DAG, Subtarget))
	return V;

	if ((VT == MVT::v4i32) && (InVT == MVT::v4i64)) {
	// On AVX2, v4i64 -> v4i32 becomes VPERMD.
	if (Subtarget.hasInt256()) {
	static const int ShufMask[] = {0, 2, 4, 6, -1, -1, -1, -1};
	In = DAG.getBitcast(MVT::v8i32, In);
	In = DAG.getVectorShuffle(MVT::v8i32, DL, In, In, ShufMask);
	return DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, VT, In,
	DAG.getIntPtrConstant(0, DL));
	}

	SDValue OpLo = DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, MVT::v2i64, In,
	DAG.getIntPtrConstant(0, DL));
	SDValue OpHi = DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, MVT::v2i64, In,
	DAG.getIntPtrConstant(2, DL));
	OpLo = DAG.getBitcast(MVT::v4i32, OpLo);
	OpHi = DAG.getBitcast(MVT::v4i32, OpHi);
	static const int ShufMask[] = {0, 2, 4, 6};
	return DAG.getVectorShuffle(VT, DL, OpLo, OpHi, ShufMask);
	}

	if ((VT == MVT::v8i16) && (InVT == MVT::v8i32)) {
	// On AVX2, v8i32 -> v8i16 becomes PSHUFB.
	if (Subtarget.hasInt256()) {
	In = DAG.getBitcast(MVT::v32i8, In);

	// The PSHUFB mask:
	static const int ShufMask1[] = { 0, 1, 4, 5, 8, 9, 12, 13,
	-1, -1, -1, -1, -1, -1, -1, -1,
	16, 17, 20, 21, 24, 25, 28, 29,
	-1, -1, -1, -1, -1, -1, -1, -1 };
	In = DAG.getVectorShuffle(MVT::v32i8, DL, In, In, ShufMask1);
	In = DAG.getBitcast(MVT::v4i64, In);

	static const int ShufMask2[] = {0, 2, -1, -1};
	In = DAG.getVectorShuffle(MVT::v4i64, DL, In, In, ShufMask2);
	In = DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, MVT::v2i64, In,
	DAG.getIntPtrConstant(0, DL));
	return DAG.getBitcast(VT, In);
	}

	SDValue OpLo = DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, MVT::v4i32, In,
	DAG.getIntPtrConstant(0, DL));

	SDValue OpHi = DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, MVT::v4i32, In,
	DAG.getIntPtrConstant(4, DL));

	OpLo = DAG.getBitcast(MVT::v16i8, OpLo);
	OpHi = DAG.getBitcast(MVT::v16i8, OpHi);

	// The PSHUFB mask:
	static const int ShufMask1[] = {0, 1, 4, 5, 8, 9, 12, 13,
	-1, -1, -1, -1, -1, -1, -1, -1};

	OpLo = DAG.getVectorShuffle(MVT::v16i8, DL, OpLo, OpLo, ShufMask1);
	OpHi = DAG.getVectorShuffle(MVT::v16i8, DL, OpHi, OpHi, ShufMask1);

	OpLo = DAG.getBitcast(MVT::v4i32, OpLo);
	OpHi = DAG.getBitcast(MVT::v4i32, OpHi);

	// The MOVLHPS Mask:
	static const int ShufMask2[] = {0, 1, 4, 5};
	SDValue res = DAG.getVectorShuffle(MVT::v4i32, DL, OpLo, OpHi, ShufMask2);
	return DAG.getBitcast(MVT::v8i16, res);
	}

	// Handle truncation of V256 to V128 using shuffles.
	if (!VT.is128BitVector() \|\| !InVT.is256BitVector())
	return SDValue();

	assert(Subtarget.hasFp256() && "256-bit vector without AVX!");

	unsigned NumElems = VT.getVectorNumElements();
	MVT NVT = MVT::getVectorVT(VT.getVectorElementType(), NumElems * 2);

	SmallVector<int, 16> MaskVec(NumElems * 2, -1);
	// Prepare truncation shuffle mask
	for (unsigned i = 0; i != NumElems; ++i)
	MaskVec[i] = i * 2;
	In = DAG.getBitcast(NVT, In);
	SDValue V = DAG.getVectorShuffle(NVT, DL, In, In, MaskVec);
	return DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, VT, V,
	DAG.getIntPtrConstant(0, DL));
	}

	SDValue X86TargetLowering::LowerFP_TO_INT(SDValue Op, SelectionDAG &DAG) const {
	bool IsSigned = Op.getOpcode() == ISD::FP_TO_SINT;
	MVT VT = Op.getSimpleValueType();

	if (VT.isVector()) {
	assert(Subtarget.hasDQI() && Subtarget.hasVLX() && "Requires AVX512DQVL!");
	SDValue Src = Op.getOperand(0);
	SDLoc dl(Op);
	if (VT == MVT::v2i64 && Src.getSimpleValueType() == MVT::v2f32) {
	return DAG.getNode(IsSigned ? X86ISD::CVTTP2SI : X86ISD::CVTTP2UI, dl, VT,
	DAG.getNode(ISD::CONCAT_VECTORS, dl, MVT::v4f32, Src,
	DAG.getUNDEF(MVT::v2f32)));
	}

	return SDValue();
	}

	assert(!VT.isVector());

	std::pair<SDValue,SDValue> Vals = FP_TO_INTHelper(Op, DAG,
	IsSigned, /IsReplace=/ false);
	SDValue FIST = Vals.first, StackSlot = Vals.second;
	// If FP_TO_INTHelper failed, the node is actually supposed to be Legal.
	if (!FIST.getNode())
	return Op;

	if (StackSlot.getNode())
	// Load the result.
	return DAG.getLoad(VT, SDLoc(Op), FIST, StackSlot, MachinePointerInfo());

	// The node is the result.
	return FIST;
	}

	static SDValue LowerFP_EXTEND(SDValue Op, SelectionDAG &DAG) {
	SDLoc DL(Op);
	MVT VT = Op.getSimpleValueType();
	SDValue In = Op.getOperand(0);
	MVT SVT = In.getSimpleValueType();

	assert(SVT == MVT::v2f32 && "Only customize MVT::v2f32 type legalization!");

	return DAG.getNode(X86ISD::VFPEXT, DL, VT,
	DAG.getNode(ISD::CONCAT_VECTORS, DL, MVT::v4f32,
	In, DAG.getUNDEF(SVT)));
	}

	/// The only differences between FABS and FNEG are the mask and the logic op.
	/// FNEG also has a folding opportunity for FNEG(FABS(x)).
	static SDValue LowerFABSorFNEG(SDValue Op, SelectionDAG &DAG) {
	assert((Op.getOpcode() == ISD::FABS \|\| Op.getOpcode() == ISD::FNEG) &&
	"Wrong opcode for lowering FABS or FNEG.");

	bool IsFABS = (Op.getOpcode() == ISD::FABS);

	// If this is a FABS and it has an FNEG user, bail out to fold the combination
	// into an FNABS. We'll lower the FABS after that if it is still in use.
	if (IsFABS)
	for (SDNode *User : Op->uses())
	if (User->getOpcode() == ISD::FNEG)
	return Op;

	SDLoc dl(Op);
	MVT VT = Op.getSimpleValueType();

	bool IsF128 = (VT == MVT::f128);

	// FIXME: Use function attribute "OptimizeForSize" and/or CodeGenOpt::Level to
	// decide if we should generate a 16-byte constant mask when we only need 4 or
	// 8 bytes for the scalar case.

	MVT LogicVT;
	MVT EltVT;

	if (VT.isVector()) {
	LogicVT = VT;
	EltVT = VT.getVectorElementType();
	} else if (IsF128) {
	// SSE instructions are used for optimized f128 logical operations.
	LogicVT = MVT::f128;
	EltVT = VT;
	} else {
	// There are no scalar bitwise logical SSE/AVX instructions, so we
	// generate a 16-byte vector constant and logic op even for the scalar case.
	// Using a 16-byte mask allows folding the load of the mask with
	// the logic op, so it can save (~4 bytes) on code size.
	LogicVT = (VT == MVT::f64) ? MVT::v2f64 : MVT::v4f32;
	EltVT = VT;
	}

	unsigned EltBits = EltVT.getSizeInBits();
	// For FABS, mask is 0x7f...; for FNEG, mask is 0x80...
	APInt MaskElt =
	IsFABS ? APInt::getSignedMaxValue(EltBits) : APInt::getSignMask(EltBits);
	const fltSemantics &Sem =
	EltVT == MVT::f64 ? APFloat::IEEEdouble() :
	(IsF128 ? APFloat::IEEEquad() : APFloat::IEEEsingle());
	SDValue Mask = DAG.getConstantFP(APFloat(Sem, MaskElt), dl, LogicVT);

	SDValue Op0 = Op.getOperand(0);
	bool IsFNABS = !IsFABS && (Op0.getOpcode() == ISD::FABS);
	unsigned LogicOp =
	IsFABS ? X86ISD::FAND : IsFNABS ? X86ISD::FOR : X86ISD::FXOR;
	SDValue Operand = IsFNABS ? Op0.getOperand(0) : Op0;

	if (VT.isVector() \|\| IsF128)
	return DAG.getNode(LogicOp, dl, LogicVT, Operand, Mask);

	// For the scalar case extend to a 128-bit vector, perform the logic op,
	// and extract the scalar result back out.
	Operand = DAG.getNode(ISD::SCALAR_TO_VECTOR, dl, LogicVT, Operand);
	SDValue LogicNode = DAG.getNode(LogicOp, dl, LogicVT, Operand, Mask);
	return DAG.getNode(ISD::EXTRACT_VECTOR_ELT, dl, VT, LogicNode,
	DAG.getIntPtrConstant(0, dl));
	}

	static SDValue LowerFCOPYSIGN(SDValue Op, SelectionDAG &DAG) {
	SDValue Mag = Op.getOperand(0);
	SDValue Sign = Op.getOperand(1);
	SDLoc dl(Op);

	// If the sign operand is smaller, extend it first.
	MVT VT = Op.getSimpleValueType();
	if (Sign.getSimpleValueType().bitsLT(VT))
	Sign = DAG.getNode(ISD::FP_EXTEND, dl, VT, Sign);

	// And if it is bigger, shrink it first.
	if (Sign.getSimpleValueType().bitsGT(VT))
	Sign = DAG.getNode(ISD::FP_ROUND, dl, VT, Sign, DAG.getIntPtrConstant(1, dl));

	// At this point the operands and the result should have the same
	// type, and that won't be f80 since that is not custom lowered.
	bool IsF128 = (VT == MVT::f128);
	assert((VT == MVT::f64 \|\| VT == MVT::f32 \|\| VT == MVT::f128 \|\|
	VT == MVT::v2f64 \|\| VT == MVT::v4f64 \|\| VT == MVT::v4f32 \|\|
	VT == MVT::v8f32 \|\| VT == MVT::v8f64 \|\| VT == MVT::v16f32) &&
	"Unexpected type in LowerFCOPYSIGN");

	MVT EltVT = VT.getScalarType();
	const fltSemantics &Sem =
	EltVT == MVT::f64 ? APFloat::IEEEdouble()
	: (IsF128 ? APFloat::IEEEquad() : APFloat::IEEEsingle());

	// Perform all scalar logic operations as 16-byte vectors because there are no
	// scalar FP logic instructions in SSE.
	// TODO: This isn't necessary. If we used scalar types, we might avoid some
	// unnecessary splats, but we might miss load folding opportunities. Should
	// this decision be based on OptimizeForSize?
	bool IsFakeVector = !VT.isVector() && !IsF128;
	MVT LogicVT = VT;
	if (IsFakeVector)
	LogicVT = (VT == MVT::f64) ? MVT::v2f64 : MVT::v4f32;

	// The mask constants are automatically splatted for vector types.
	unsigned EltSizeInBits = VT.getScalarSizeInBits();
	SDValue SignMask = DAG.getConstantFP(
	APFloat(Sem, APInt::getSignMask(EltSizeInBits)), dl, LogicVT);
	SDValue MagMask = DAG.getConstantFP(
	APFloat(Sem, ~APInt::getSignMask(EltSizeInBits)), dl, LogicVT);

	// First, clear all bits but the sign bit from the second operand (sign).
	if (IsFakeVector)
	Sign = DAG.getNode(ISD::SCALAR_TO_VECTOR, dl, LogicVT, Sign);
	SDValue SignBit = DAG.getNode(X86ISD::FAND, dl, LogicVT, Sign, SignMask);

	// Next, clear the sign bit from the first operand (magnitude).
	// TODO: If we had general constant folding for FP logic ops, this check
	// wouldn't be necessary.
	SDValue MagBits;
	if (ConstantFPSDNode *Op0CN = dyn_cast<ConstantFPSDNode>(Mag)) {
	APFloat APF = Op0CN->getValueAPF();
	APF.clearSign();
	MagBits = DAG.getConstantFP(APF, dl, LogicVT);
	} else {
	// If the magnitude operand wasn't a constant, we need to AND out the sign.
	if (IsFakeVector)
	Mag = DAG.getNode(ISD::SCALAR_TO_VECTOR, dl, LogicVT, Mag);
	MagBits = DAG.getNode(X86ISD::FAND, dl, LogicVT, Mag, MagMask);
	}

	// OR the magnitude value with the sign bit.
	SDValue Or = DAG.getNode(X86ISD::FOR, dl, LogicVT, MagBits, SignBit);
	return !IsFakeVector ? Or : DAG.getNode(ISD::EXTRACT_VECTOR_ELT, dl, VT, Or,
	DAG.getIntPtrConstant(0, dl));
	}

	static SDValue LowerFGETSIGN(SDValue Op, SelectionDAG &DAG) {
	SDValue N0 = Op.getOperand(0);
	SDLoc dl(Op);
	MVT VT = Op.getSimpleValueType();

	MVT OpVT = N0.getSimpleValueType();
	assert((OpVT == MVT::f32 \|\| OpVT == MVT::f64) &&
	"Unexpected type for FGETSIGN");

	// Lower ISD::FGETSIGN to (AND (X86ISD::MOVMSK ...) 1).
	MVT VecVT = (OpVT == MVT::f32 ? MVT::v4f32 : MVT::v2f64);
	SDValue Res = DAG.getNode(ISD::SCALAR_TO_VECTOR, dl, VecVT, N0);
	Res = DAG.getNode(X86ISD::MOVMSK, dl, MVT::i32, Res);
	Res = DAG.getZExtOrTrunc(Res, dl, VT);
	Res = DAG.getNode(ISD::AND, dl, VT, Res, DAG.getConstant(1, dl, VT));
	return Res;
	}

	// Check whether an OR'd tree is PTEST-able.
	static SDValue LowerVectorAllZeroTest(SDValue Op, const X86Subtarget &Subtarget,
	SelectionDAG &DAG) {
	assert(Op.getOpcode() == ISD::OR && "Only check OR'd tree.");

	if (!Subtarget.hasSSE41())
	return SDValue();

	if (!Op->hasOneUse())
	return SDValue();

	SDNode *N = Op.getNode();
	SDLoc DL(N);

	SmallVector<SDValue, 8> Opnds;
	DenseMap<SDValue, unsigned> VecInMap;
	SmallVector<SDValue, 8> VecIns;
	EVT VT = MVT::Other;

	// Recognize a special case where a vector is casted into wide integer to
	// test all 0s.
	Opnds.push_back(N->getOperand(0));
	Opnds.push_back(N->getOperand(1));

	for (unsigned Slot = 0, e = Opnds.size(); Slot < e; ++Slot) {
	SmallVectorImpl<SDValue>::const_iterator I = Opnds.begin() + Slot;
	// BFS traverse all OR'd operands.
	if (I->getOpcode() == ISD::OR) {
	Opnds.push_back(I->getOperand(0));
	Opnds.push_back(I->getOperand(1));
	// Re-evaluate the number of nodes to be traversed.
	e += 2; // 2 more nodes (LHS and RHS) are pushed.
	continue;
	}

	// Quit if a non-EXTRACT_VECTOR_ELT
	if (I->getOpcode() != ISD::EXTRACT_VECTOR_ELT)
	return SDValue();

	// Quit if without a constant index.
	SDValue Idx = I->getOperand(1);
	if (!isa<ConstantSDNode>(Idx))
	return SDValue();

	SDValue ExtractedFromVec = I->getOperand(0);
	DenseMap<SDValue, unsigned>::iterator M = VecInMap.find(ExtractedFromVec);
	if (M == VecInMap.end()) {
	VT = ExtractedFromVec.getValueType();
	// Quit if not 128/256-bit vector.
	if (!VT.is128BitVector() && !VT.is256BitVector())
	return SDValue();
	// Quit if not the same type.
	if (VecInMap.begin() != VecInMap.end() &&
	VT != VecInMap.begin()->first.getValueType())
	return SDValue();
	M = VecInMap.insert(std::make_pair(ExtractedFromVec, 0)).first;
	VecIns.push_back(ExtractedFromVec);
	}
	M->second \|= 1U << cast<ConstantSDNode>(Idx)->getZExtValue();
	}

	assert((VT.is128BitVector() \|\| VT.is256BitVector()) &&
	"Not extracted from 128-/256-bit vector.");

	unsigned FullMask = (1U << VT.getVectorNumElements()) - 1U;

	for (DenseMap<SDValue, unsigned>::const_iterator
	I = VecInMap.begin(), E = VecInMap.end(); I != E; ++I) {
	// Quit if not all elements are used.
	if (I->second != FullMask)
	return SDValue();
	}

	MVT TestVT = VT.is128BitVector() ? MVT::v2i64 : MVT::v4i64;

	// Cast all vectors into TestVT for PTEST.
	for (unsigned i = 0, e = VecIns.size(); i < e; ++i)
	VecIns[i] = DAG.getBitcast(TestVT, VecIns[i]);

	// If more than one full vector is evaluated, OR them first before PTEST.
	for (unsigned Slot = 0, e = VecIns.size(); e - Slot > 1; Slot += 2, e += 1) {
	// Each iteration will OR 2 nodes and append the result until there is only
	// 1 node left, i.e. the final OR'd value of all vectors.
	SDValue LHS = VecIns[Slot];
	SDValue RHS = VecIns[Slot + 1];
	VecIns.push_back(DAG.getNode(ISD::OR, DL, TestVT, LHS, RHS));
	}

	return DAG.getNode(X86ISD::PTEST, DL, MVT::i32, VecIns.back(), VecIns.back());
	}

	/// \brief return true if \c Op has a use that doesn't just read flags.
	static bool hasNonFlagsUse(SDValue Op) {
	for (SDNode::use_iterator UI = Op->use_begin(), UE = Op->use_end(); UI != UE;
	++UI) {
	SDNode User = UI;
	unsigned UOpNo = UI.getOperandNo();
	if (User->getOpcode() == ISD::TRUNCATE && User->hasOneUse()) {
	// Look pass truncate.
	UOpNo = User->use_begin().getOperandNo();
	User = *User->use_begin();
	}

	if (User->getOpcode() != ISD::BRCOND && User->getOpcode() != ISD::SETCC &&
	!(User->getOpcode() == ISD::SELECT && UOpNo == 0))
	return true;
	}
	return false;
	}

	// Emit KTEST instruction for bit vectors on AVX-512
	static SDValue EmitKTEST(SDValue Op, SelectionDAG &DAG,
	const X86Subtarget &Subtarget) {
	if (Op.getOpcode() == ISD::BITCAST) {
	auto hasKTEST = [&](MVT VT) {
	unsigned SizeInBits = VT.getSizeInBits();
	return (Subtarget.hasDQI() && (SizeInBits == 8 \|\| SizeInBits == 16)) \|\|
	(Subtarget.hasBWI() && (SizeInBits == 32 \|\| SizeInBits == 64));
	};
	SDValue Op0 = Op.getOperand(0);
	MVT Op0VT = Op0.getValueType().getSimpleVT();
	if (Op0VT.isVector() && Op0VT.getVectorElementType() == MVT::i1 &&
	hasKTEST(Op0VT))
	return DAG.getNode(X86ISD::KTEST, SDLoc(Op), Op0VT, Op0, Op0);
	}
	return SDValue();
	}

	/// Emit nodes that will be selected as "test Op0,Op0", or something
	/// equivalent.
	SDValue X86TargetLowering::EmitTest(SDValue Op, unsigned X86CC, const SDLoc &dl,
	SelectionDAG &DAG) const {
	if (Op.getValueType() == MVT::i1) {
	SDValue ExtOp = DAG.getNode(ISD::ZERO_EXTEND, dl, MVT::i8, Op);
	return DAG.getNode(X86ISD::CMP, dl, MVT::i32, ExtOp,
	DAG.getConstant(0, dl, MVT::i8));
	}
	// CF and OF aren't always set the way we want. Determine which
	// of these we need.
	bool NeedCF = false;
	bool NeedOF = false;
	switch (X86CC) {
	default: break;
	case X86::COND_A: case X86::COND_AE:
	case X86::COND_B: case X86::COND_BE:
	NeedCF = true;
	break;
	case X86::COND_G: case X86::COND_GE:
	case X86::COND_L: case X86::COND_LE:
	case X86::COND_O: case X86::COND_NO: {
	// Check if we really need to set the
	// Overflow flag. If NoSignedWrap is present
	// that is not actually needed.
	switch (Op->getOpcode()) {
	case ISD::ADD:
	case ISD::SUB:
	case ISD::MUL:
	case ISD::SHL:
	if (Op.getNode()->getFlags().hasNoSignedWrap())
	break;
	LLVM_FALLTHROUGH;
	default:
	NeedOF = true;
	break;
	}
	break;
	}
	}
	// See if we can use the EFLAGS value from the operand instead of
	// doing a separate TEST. TEST always sets OF and CF to 0, so unless
	// we prove that the arithmetic won't overflow, we can't use OF or CF.
	if (Op.getResNo() != 0 \|\| NeedOF \|\| NeedCF) {
	// Emit KTEST for bit vectors
	if (auto Node = EmitKTEST(Op, DAG, Subtarget))
	return Node;
	// Emit a CMP with 0, which is the TEST pattern.
	return DAG.getNode(X86ISD::CMP, dl, MVT::i32, Op,
	DAG.getConstant(0, dl, Op.getValueType()));
	}
	unsigned Opcode = 0;
	unsigned NumOperands = 0;

	// Truncate operations may prevent the merge of the SETCC instruction
	// and the arithmetic instruction before it. Attempt to truncate the operands
	// of the arithmetic instruction and use a reduced bit-width instruction.
	bool NeedTruncation = false;
	SDValue ArithOp = Op;
	if (Op->getOpcode() == ISD::TRUNCATE && Op->hasOneUse()) {
	SDValue Arith = Op->getOperand(0);
	// Both the trunc and the arithmetic op need to have one user each.
	if (Arith->hasOneUse())
	switch (Arith.getOpcode()) {
	default: break;
	case ISD::ADD:
	case ISD::SUB:
	case ISD::AND:
	case ISD::OR:
	case ISD::XOR: {
	NeedTruncation = true;
	ArithOp = Arith;
	}
	}
	}

	// Sometimes flags can be set either with an AND or with an SRL/SHL
	// instruction. SRL/SHL variant should be preferred for masks longer than this
	// number of bits.
	const int ShiftToAndMaxMaskWidth = 32;
	const bool ZeroCheck = (X86CC == X86::COND_E \|\| X86CC == X86::COND_NE);

	// NOTICE: In the code below we use ArithOp to hold the arithmetic operation
	// which may be the result of a CAST. We use the variable 'Op', which is the
	// non-casted variable when we check for possible users.
	switch (ArithOp.getOpcode()) {
	case ISD::ADD:
	// Due to an isel shortcoming, be conservative if this add is likely to be
	// selected as part of a load-modify-store instruction. When the root node
	// in a match is a store, isel doesn't know how to remap non-chain non-flag
	// uses of other nodes in the match, such as the ADD in this case. This
	// leads to the ADD being left around and reselected, with the result being
	// two adds in the output. Alas, even if none our users are stores, that
	// doesn't prove we're O.K. Ergo, if we have any parents that aren't
	// CopyToReg or SETCC, eschew INC/DEC. A better fix seems to require
	// climbing the DAG back to the root, and it doesn't seem to be worth the
	// effort.
	for (SDNode::use_iterator UI = Op.getNode()->use_begin(),
	UE = Op.getNode()->use_end(); UI != UE; ++UI)
	if (UI->getOpcode() != ISD::CopyToReg &&
	UI->getOpcode() != ISD::SETCC &&
	UI->getOpcode() != ISD::STORE)
	goto default_case;

	if (ConstantSDNode *C =
	dyn_cast<ConstantSDNode>(ArithOp.getOperand(1))) {
	// An add of one will be selected as an INC.
	if (C->isOne() && !Subtarget.slowIncDec()) {
	Opcode = X86ISD::INC;
	NumOperands = 1;
	break;
	}

	// An add of negative one (subtract of one) will be selected as a DEC.
	if (C->isAllOnesValue() && !Subtarget.slowIncDec()) {
	Opcode = X86ISD::DEC;
	NumOperands = 1;
	break;
	}
	}

	// Otherwise use a regular EFLAGS-setting add.
	Opcode = X86ISD::ADD;
	NumOperands = 2;
	break;
	case ISD::SHL:
	case ISD::SRL:
	// If we have a constant logical shift that's only used in a comparison
	// against zero turn it into an equivalent AND. This allows turning it into
	// a TEST instruction later.
	if (ZeroCheck && Op->hasOneUse() &&
	isa<ConstantSDNode>(Op->getOperand(1)) && !hasNonFlagsUse(Op)) {
	EVT VT = Op.getValueType();
	unsigned BitWidth = VT.getSizeInBits();
	unsigned ShAmt = Op->getConstantOperandVal(1);
	if (ShAmt >= BitWidth) // Avoid undefined shifts.
	break;
	APInt Mask = ArithOp.getOpcode() == ISD::SRL
	? APInt::getHighBitsSet(BitWidth, BitWidth - ShAmt)
	: APInt::getLowBitsSet(BitWidth, BitWidth - ShAmt);
	if (!Mask.isSignedIntN(ShiftToAndMaxMaskWidth))
	break;
	Op = DAG.getNode(ISD::AND, dl, VT, Op->getOperand(0),
	DAG.getConstant(Mask, dl, VT));
	}
	break;

	case ISD::AND:
	// If the primary 'and' result isn't used, don't bother using X86ISD::AND,
	// because a TEST instruction will be better. However, AND should be
	// preferred if the instruction can be combined into ANDN.
	if (!hasNonFlagsUse(Op)) {
	SDValue Op0 = ArithOp->getOperand(0);
	SDValue Op1 = ArithOp->getOperand(1);
	EVT VT = ArithOp.getValueType();
	bool isAndn = isBitwiseNot(Op0) \|\| isBitwiseNot(Op1);
	bool isLegalAndnType = VT == MVT::i32 \|\| VT == MVT::i64;
	bool isProperAndn = isAndn && isLegalAndnType && Subtarget.hasBMI();

	// If we cannot select an ANDN instruction, check if we can replace
	// AND+IMM64 with a shift before giving up. This is possible for masks
	// like 0xFF000000 or 0x00FFFFFF and if we care only about the zero flag.
	if (!isProperAndn) {
	if (!ZeroCheck)
	break;

	assert(!isa<ConstantSDNode>(Op0) && "AND node isn't canonicalized");
	auto *CN = dyn_cast<ConstantSDNode>(Op1);
	if (!CN)
	break;

	const APInt &Mask = CN->getAPIntValue();
	if (Mask.isSignedIntN(ShiftToAndMaxMaskWidth))
	break; // Prefer TEST instruction.

	unsigned BitWidth = Mask.getBitWidth();
	unsigned LeadingOnes = Mask.countLeadingOnes();
	unsigned TrailingZeros = Mask.countTrailingZeros();

	if (LeadingOnes + TrailingZeros == BitWidth) {
	assert(TrailingZeros < VT.getSizeInBits() &&
	"Shift amount should be less than the type width");
	MVT ShTy = getScalarShiftAmountTy(DAG.getDataLayout(), VT);
	SDValue ShAmt = DAG.getConstant(TrailingZeros, dl, ShTy);
	Op = DAG.getNode(ISD::SRL, dl, VT, Op0, ShAmt);
	break;
	}

	unsigned LeadingZeros = Mask.countLeadingZeros();
	unsigned TrailingOnes = Mask.countTrailingOnes();

	if (LeadingZeros + TrailingOnes == BitWidth) {
	assert(LeadingZeros < VT.getSizeInBits() &&
	"Shift amount should be less than the type width");
	MVT ShTy = getScalarShiftAmountTy(DAG.getDataLayout(), VT);
	SDValue ShAmt = DAG.getConstant(LeadingZeros, dl, ShTy);
	Op = DAG.getNode(ISD::SHL, dl, VT, Op0, ShAmt);
	break;
	}

	break;
	}
	}
	LLVM_FALLTHROUGH;
	case ISD::SUB:
	case ISD::OR:
	case ISD::XOR:
	// Due to the ISEL shortcoming noted above, be conservative if this op is
	// likely to be selected as part of a load-modify-store instruction.
	for (SDNode::use_iterator UI = Op.getNode()->use_begin(),
	UE = Op.getNode()->use_end(); UI != UE; ++UI)
	if (UI->getOpcode() == ISD::STORE)
	goto default_case;

	// Otherwise use a regular EFLAGS-setting instruction.
	switch (ArithOp.getOpcode()) {
	default: llvm_unreachable("unexpected operator!");
	case ISD::SUB: Opcode = X86ISD::SUB; break;
	case ISD::XOR: Opcode = X86ISD::XOR; break;
	case ISD::AND: Opcode = X86ISD::AND; break;
	case ISD::OR: {
	if (!NeedTruncation && ZeroCheck) {
	if (SDValue EFLAGS = LowerVectorAllZeroTest(Op, Subtarget, DAG))
	return EFLAGS;
	}
	Opcode = X86ISD::OR;
	break;
	}
	}

	NumOperands = 2;
	break;
	case X86ISD::ADD:
	case X86ISD::SUB:
	case X86ISD::INC:
	case X86ISD::DEC:
	case X86ISD::OR:
	case X86ISD::XOR:
	case X86ISD::AND:
	return SDValue(Op.getNode(), 1);
	default:
	default_case:
	break;
	}

	// If we found that truncation is beneficial, perform the truncation and
	// update 'Op'.
	if (NeedTruncation) {
	EVT VT = Op.getValueType();
	SDValue WideVal = Op->getOperand(0);
	EVT WideVT = WideVal.getValueType();
	unsigned ConvertedOp = 0;
	// Use a target machine opcode to prevent further DAGCombine
	// optimizations that may separate the arithmetic operations
	// from the setcc node.
	switch (WideVal.getOpcode()) {
	default: break;
	case ISD::ADD: ConvertedOp = X86ISD::ADD; break;
	case ISD::SUB: ConvertedOp = X86ISD::SUB; break;
	case ISD::AND: ConvertedOp = X86ISD::AND; break;
	case ISD::OR: ConvertedOp = X86ISD::OR; break;
	case ISD::XOR: ConvertedOp = X86ISD::XOR; break;
	}

	if (ConvertedOp) {
	const TargetLowering &TLI = DAG.getTargetLoweringInfo();
	if (TLI.isOperationLegal(WideVal.getOpcode(), WideVT)) {
	SDValue V0 = DAG.getNode(ISD::TRUNCATE, dl, VT, WideVal.getOperand(0));
	SDValue V1 = DAG.getNode(ISD::TRUNCATE, dl, VT, WideVal.getOperand(1));
	Op = DAG.getNode(ConvertedOp, dl, VT, V0, V1);
	}
	}
	}

	if (Opcode == 0) {
	// Emit KTEST for bit vectors
	if (auto Node = EmitKTEST(Op, DAG, Subtarget))
	return Node;

	// Emit a CMP with 0, which is the TEST pattern.
	return DAG.getNode(X86ISD::CMP, dl, MVT::i32, Op,
	DAG.getConstant(0, dl, Op.getValueType()));
	}
	SDVTList VTs = DAG.getVTList(Op.getValueType(), MVT::i32);
	SmallVector<SDValue, 4> Ops(Op->op_begin(), Op->op_begin() + NumOperands);

	SDValue New = DAG.getNode(Opcode, dl, VTs, Ops);
	DAG.ReplaceAllUsesWith(Op, New);
	return SDValue(New.getNode(), 1);
	}

	/// Emit nodes that will be selected as "cmp Op0,Op1", or something
	/// equivalent.
	SDValue X86TargetLowering::EmitCmp(SDValue Op0, SDValue Op1, unsigned X86CC,
	const SDLoc &dl, SelectionDAG &DAG) const {
	if (isNullConstant(Op1))
	return EmitTest(Op0, X86CC, dl, DAG);

	assert(!(isa<ConstantSDNode>(Op1) && Op0.getValueType() == MVT::i1) &&
	"Unexpected comparison operation for MVT::i1 operands");

	if ((Op0.getValueType() == MVT::i8 \|\| Op0.getValueType() == MVT::i16 \|\|
	Op0.getValueType() == MVT::i32 \|\| Op0.getValueType() == MVT::i64)) {
	// Only promote the compare up to I32 if it is a 16 bit operation
	// with an immediate. 16 bit immediates are to be avoided.
	if ((Op0.getValueType() == MVT::i16 &&
	(isa<ConstantSDNode>(Op0) \|\| isa<ConstantSDNode>(Op1))) &&
	!DAG.getMachineFunction().getFunction()->optForMinSize() &&
	!Subtarget.isAtom()) {
	unsigned ExtendOp =
	isX86CCUnsigned(X86CC) ? ISD::ZERO_EXTEND : ISD::SIGN_EXTEND;
	Op0 = DAG.getNode(ExtendOp, dl, MVT::i32, Op0);
	Op1 = DAG.getNode(ExtendOp, dl, MVT::i32, Op1);
	}
	// Use SUB instead of CMP to enable CSE between SUB and CMP.
	SDVTList VTs = DAG.getVTList(Op0.getValueType(), MVT::i32);
	SDValue Sub = DAG.getNode(X86ISD::SUB, dl, VTs,
	Op0, Op1);
	return SDValue(Sub.getNode(), 1);
	}
	return DAG.getNode(X86ISD::CMP, dl, MVT::i32, Op0, Op1);
	}

	/// Convert a comparison if required by the subtarget.
	SDValue X86TargetLowering::ConvertCmpIfNecessary(SDValue Cmp,
	SelectionDAG &DAG) const {
	// If the subtarget does not support the FUCOMI instruction, floating-point
	// comparisons have to be converted.
	if (Subtarget.hasCMov() \|\|
	Cmp.getOpcode() != X86ISD::CMP \|\|
	!Cmp.getOperand(0).getValueType().isFloatingPoint() \|\|
	!Cmp.getOperand(1).getValueType().isFloatingPoint())
	return Cmp;

	// The instruction selector will select an FUCOM instruction instead of
	// FUCOMI, which writes the comparison result to FPSW instead of EFLAGS. Hence
	// build an SDNode sequence that transfers the result from FPSW into EFLAGS:
	// (X86sahf (trunc (srl (X86fp_stsw (trunc (X86cmp ...)), 8))))
	SDLoc dl(Cmp);
	SDValue TruncFPSW = DAG.getNode(ISD::TRUNCATE, dl, MVT::i16, Cmp);
	SDValue FNStSW = DAG.getNode(X86ISD::FNSTSW16r, dl, MVT::i16, TruncFPSW);
	SDValue Srl = DAG.getNode(ISD::SRL, dl, MVT::i16, FNStSW,
	DAG.getConstant(8, dl, MVT::i8));
	SDValue TruncSrl = DAG.getNode(ISD::TRUNCATE, dl, MVT::i8, Srl);

	// Some 64-bit targets lack SAHF support, but they do support FCOMI.
	assert(Subtarget.hasLAHFSAHF() && "Target doesn't support SAHF or FCOMI?");
	return DAG.getNode(X86ISD::SAHF, dl, MVT::i32, TruncSrl);
	}

	/// Check if replacement of SQRT with RSQRT should be disabled.
	bool X86TargetLowering::isFsqrtCheap(SDValue Op, SelectionDAG &DAG) const {
	EVT VT = Op.getValueType();

	// We never want to use both SQRT and RSQRT instructions for the same input.
	if (DAG.getNodeIfExists(X86ISD::FRSQRT, DAG.getVTList(VT), Op))
	return false;

	if (VT.isVector())
	return Subtarget.hasFastVectorFSQRT();
	return Subtarget.hasFastScalarFSQRT();
	}

	/// The minimum architected relative accuracy is 2^-12. We need one
	/// Newton-Raphson step to have a good float result (24 bits of precision).
	SDValue X86TargetLowering::getSqrtEstimate(SDValue Op,
	SelectionDAG &DAG, int Enabled,
	int &RefinementSteps,
	bool &UseOneConstNR,
	bool Reciprocal) const {
	EVT VT = Op.getValueType();

	// SSE1 has rsqrtss and rsqrtps. AVX adds a 256-bit variant for rsqrtps.
	// TODO: Add support for AVX512 (v16f32).
	// It is likely not profitable to do this for f64 because a double-precision
	// rsqrt estimate with refinement on x86 prior to FMA requires at least 16
	// instructions: convert to single, rsqrtss, convert back to double, refine
	// (3 steps = at least 13 insts). If an 'rsqrtsd' variant was added to the ISA
	// along with FMA, this could be a throughput win.
	if ((VT == MVT::f32 && Subtarget.hasSSE1()) \|\|
	(VT == MVT::v4f32 && Subtarget.hasSSE1()) \|\|
	(VT == MVT::v8f32 && Subtarget.hasAVX())) {
	if (RefinementSteps == ReciprocalEstimate::Unspecified)
	RefinementSteps = 1;

	UseOneConstNR = false;
	return DAG.getNode(X86ISD::FRSQRT, SDLoc(Op), VT, Op);
	}
	return SDValue();
	}

	/// The minimum architected relative accuracy is 2^-12. We need one
	/// Newton-Raphson step to have a good float result (24 bits of precision).
	SDValue X86TargetLowering::getRecipEstimate(SDValue Op, SelectionDAG &DAG,
	int Enabled,
	int &RefinementSteps) const {
	EVT VT = Op.getValueType();

	// SSE1 has rcpss and rcpps. AVX adds a 256-bit variant for rcpps.
	// TODO: Add support for AVX512 (v16f32).
	// It is likely not profitable to do this for f64 because a double-precision
	// reciprocal estimate with refinement on x86 prior to FMA requires
	// 15 instructions: convert to single, rcpss, convert back to double, refine
	// (3 steps = 12 insts). If an 'rcpsd' variant was added to the ISA
	// along with FMA, this could be a throughput win.

	if ((VT == MVT::f32 && Subtarget.hasSSE1()) \|\|
	(VT == MVT::v4f32 && Subtarget.hasSSE1()) \|\|
	(VT == MVT::v8f32 && Subtarget.hasAVX())) {
	// Enable estimate codegen with 1 refinement step for vector division.
	// Scalar division estimates are disabled because they break too much
	// real-world code. These defaults are intended to match GCC behavior.
	if (VT == MVT::f32 && Enabled == ReciprocalEstimate::Unspecified)
	return SDValue();

	if (RefinementSteps == ReciprocalEstimate::Unspecified)
	RefinementSteps = 1;

	return DAG.getNode(X86ISD::FRCP, SDLoc(Op), VT, Op);
	}
	return SDValue();
	}

	/// If we have at least two divisions that use the same divisor, convert to
	/// multiplication by a reciprocal. This may need to be adjusted for a given
	/// CPU if a division's cost is not at least twice the cost of a multiplication.
	/// This is because we still need one division to calculate the reciprocal and
	/// then we need two multiplies by that reciprocal as replacements for the
	/// original divisions.
	unsigned X86TargetLowering::combineRepeatedFPDivisors() const {
	return 2;
	}

	/// Helper for creating a X86ISD::SETCC node.
	static SDValue getSETCC(X86::CondCode Cond, SDValue EFLAGS, const SDLoc &dl,
	SelectionDAG &DAG) {
	return DAG.getNode(X86ISD::SETCC, dl, MVT::i8,
	DAG.getConstant(Cond, dl, MVT::i8), EFLAGS);
	}

	/// Create a BT (Bit Test) node - Test bit \p BitNo in \p Src and set condition
	/// according to equal/not-equal condition code \p CC.
	static SDValue getBitTestCondition(SDValue Src, SDValue BitNo, ISD::CondCode CC,
	const SDLoc &dl, SelectionDAG &DAG) {
	// If Src is i8, promote it to i32 with any_extend. There is no i8 BT
	// instruction. Since the shift amount is in-range-or-undefined, we know
	// that doing a bittest on the i32 value is ok. We extend to i32 because
	// the encoding for the i16 version is larger than the i32 version.
	// Also promote i16 to i32 for performance / code size reason.
	if (Src.getValueType() == MVT::i8 \|\| Src.getValueType() == MVT::i16)
	Src = DAG.getNode(ISD::ANY_EXTEND, dl, MVT::i32, Src);

	// See if we can use the 32-bit instruction instead of the 64-bit one for a
	// shorter encoding. Since the former takes the modulo 32 of BitNo and the
	// latter takes the modulo 64, this is only valid if the 5th bit of BitNo is
	// known to be zero.
	if (Src.getValueType() == MVT::i64 &&
	DAG.MaskedValueIsZero(BitNo, APInt(BitNo.getValueSizeInBits(), 32)))
	Src = DAG.getNode(ISD::TRUNCATE, dl, MVT::i32, Src);

	// If the operand types disagree, extend the shift amount to match. Since
	// BT ignores high bits (like shifts) we can use anyextend.
	if (Src.getValueType() != BitNo.getValueType())
	BitNo = DAG.getNode(ISD::ANY_EXTEND, dl, Src.getValueType(), BitNo);

	SDValue BT = DAG.getNode(X86ISD::BT, dl, MVT::i32, Src, BitNo);
	X86::CondCode Cond = CC == ISD::SETEQ ? X86::COND_AE : X86::COND_B;
	return getSETCC(Cond, BT, dl , DAG);
	}

	/// Result of 'and' is compared against zero. Change to a BT node if possible.
	static SDValue LowerAndToBT(SDValue And, ISD::CondCode CC,
	const SDLoc &dl, SelectionDAG &DAG) {
	SDValue Op0 = And.getOperand(0);
	SDValue Op1 = And.getOperand(1);
	if (Op0.getOpcode() == ISD::TRUNCATE)
	Op0 = Op0.getOperand(0);
	if (Op1.getOpcode() == ISD::TRUNCATE)
	Op1 = Op1.getOperand(0);

	SDValue LHS, RHS;
	if (Op1.getOpcode() == ISD::SHL)
	std::swap(Op0, Op1);
	if (Op0.getOpcode() == ISD::SHL) {
	if (isOneConstant(Op0.getOperand(0))) {
	// If we looked past a truncate, check that it's only truncating away
	// known zeros.
	unsigned BitWidth = Op0.getValueSizeInBits();
	unsigned AndBitWidth = And.getValueSizeInBits();
	if (BitWidth > AndBitWidth) {
	KnownBits Known;
	DAG.computeKnownBits(Op0, Known);
	if (Known.countMinLeadingZeros() < BitWidth - AndBitWidth)
	return SDValue();
	}
	LHS = Op1;
	RHS = Op0.getOperand(1);
	}
	} else if (Op1.getOpcode() == ISD::Constant) {
	ConstantSDNode *AndRHS = cast<ConstantSDNode>(Op1);
	uint64_t AndRHSVal = AndRHS->getZExtValue();
	SDValue AndLHS = Op0;

	if (AndRHSVal == 1 && AndLHS.getOpcode() == ISD::SRL) {
	LHS = AndLHS.getOperand(0);
	RHS = AndLHS.getOperand(1);
	}

	// Use BT if the immediate can't be encoded in a TEST instruction.
	if (!isUInt<32>(AndRHSVal) && isPowerOf2_64(AndRHSVal)) {
	LHS = AndLHS;
	RHS = DAG.getConstant(Log2_64_Ceil(AndRHSVal), dl, LHS.getValueType());
	}
	}

	if (LHS.getNode())
	return getBitTestCondition(LHS, RHS, CC, dl, DAG);

	return SDValue();
	}

	// Convert (truncate (srl X, N) to i1) to (bt X, N)
	static SDValue LowerTruncateToBT(SDValue Op, ISD::CondCode CC,
	const SDLoc &dl, SelectionDAG &DAG) {

	assert(Op.getOpcode() == ISD::TRUNCATE && Op.getValueType() == MVT::i1 &&
	"Expected TRUNCATE to i1 node");

	if (Op.getOperand(0).getOpcode() != ISD::SRL)
	return SDValue();

	SDValue ShiftRight = Op.getOperand(0);
	return getBitTestCondition(ShiftRight.getOperand(0), ShiftRight.getOperand(1),
	CC, dl, DAG);
	}

	/// Result of 'and' or 'trunc to i1' is compared against zero.
	/// Change to a BT node if possible.
	SDValue X86TargetLowering::LowerToBT(SDValue Op, ISD::CondCode CC,
	const SDLoc &dl, SelectionDAG &DAG) const {
	if (Op.getOpcode() == ISD::AND)
	return LowerAndToBT(Op, CC, dl, DAG);
	if (Op.getOpcode() == ISD::TRUNCATE && Op.getValueType() == MVT::i1)
	return LowerTruncateToBT(Op, CC, dl, DAG);
	return SDValue();
	}

	/// Turns an ISD::CondCode into a value suitable for SSE floating-point mask
	/// CMPs.
	static int translateX86FSETCC(ISD::CondCode SetCCOpcode, SDValue &Op0,
	SDValue &Op1) {
	unsigned SSECC;
	bool Swap = false;

	// SSE Condition code mapping:
	// 0 - EQ
	// 1 - LT
	// 2 - LE
	// 3 - UNORD
	// 4 - NEQ
	// 5 - NLT
	// 6 - NLE
	// 7 - ORD
	switch (SetCCOpcode) {
	default: llvm_unreachable("Unexpected SETCC condition");
	case ISD::SETOEQ:
	case ISD::SETEQ: SSECC = 0; break;
	case ISD::SETOGT:
	case ISD::SETGT: Swap = true; LLVM_FALLTHROUGH;
	case ISD::SETLT:
	case ISD::SETOLT: SSECC = 1; break;
	case ISD::SETOGE:
	case ISD::SETGE: Swap = true; LLVM_FALLTHROUGH;
	case ISD::SETLE:
	case ISD::SETOLE: SSECC = 2; break;
	case ISD::SETUO: SSECC = 3; break;
	case ISD::SETUNE:
	case ISD::SETNE: SSECC = 4; break;
	case ISD::SETULE: Swap = true; LLVM_FALLTHROUGH;
	case ISD::SETUGE: SSECC = 5; break;
	case ISD::SETULT: Swap = true; LLVM_FALLTHROUGH;
	case ISD::SETUGT: SSECC = 6; break;
	case ISD::SETO: SSECC = 7; break;
	case ISD::SETUEQ:
	case ISD::SETONE: SSECC = 8; break;
	}
	if (Swap)
	std::swap(Op0, Op1);

	return SSECC;
	}

	/// Break a VSETCC 256-bit integer VSETCC into two new 128 ones and then
	/// concatenate the result back.
	static SDValue Lower256IntVSETCC(SDValue Op, SelectionDAG &DAG) {
	MVT VT = Op.getSimpleValueType();

	assert(VT.is256BitVector() && Op.getOpcode() == ISD::SETCC &&
	"Unsupported value type for operation");

	unsigned NumElems = VT.getVectorNumElements();
	SDLoc dl(Op);
	SDValue CC = Op.getOperand(2);

	// Extract the LHS vectors
	SDValue LHS = Op.getOperand(0);
	SDValue LHS1 = extract128BitVector(LHS, 0, DAG, dl);
	SDValue LHS2 = extract128BitVector(LHS, NumElems / 2, DAG, dl);

	// Extract the RHS vectors
	SDValue RHS = Op.getOperand(1);
	SDValue RHS1 = extract128BitVector(RHS, 0, DAG, dl);
	SDValue RHS2 = extract128BitVector(RHS, NumElems / 2, DAG, dl);

	// Issue the operation on the smaller types and concatenate the result back
	MVT EltVT = VT.getVectorElementType();
	MVT NewVT = MVT::getVectorVT(EltVT, NumElems/2);
	return DAG.getNode(ISD::CONCAT_VECTORS, dl, VT,
	DAG.getNode(Op.getOpcode(), dl, NewVT, LHS1, RHS1, CC),
	DAG.getNode(Op.getOpcode(), dl, NewVT, LHS2, RHS2, CC));
	}

	static SDValue LowerBoolVSETCC_AVX512(SDValue Op, SelectionDAG &DAG) {
	SDValue Op0 = Op.getOperand(0);
	SDValue Op1 = Op.getOperand(1);
	SDValue CC = Op.getOperand(2);
	MVT VT = Op.getSimpleValueType();
	SDLoc dl(Op);

	assert(Op0.getSimpleValueType().getVectorElementType() == MVT::i1 &&
	"Unexpected type for boolean compare operation");
	ISD::CondCode SetCCOpcode = cast<CondCodeSDNode>(CC)->get();
	SDValue NotOp0 = DAG.getNode(ISD::XOR, dl, VT, Op0,
	DAG.getConstant(-1, dl, VT));
	SDValue NotOp1 = DAG.getNode(ISD::XOR, dl, VT, Op1,
	DAG.getConstant(-1, dl, VT));
	switch (SetCCOpcode) {
	default: llvm_unreachable("Unexpected SETCC condition");
	case ISD::SETEQ:
	// (x == y) -> ~(x ^ y)
	return DAG.getNode(ISD::XOR, dl, VT,
	DAG.getNode(ISD::XOR, dl, VT, Op0, Op1),
	DAG.getConstant(-1, dl, VT));
	case ISD::SETNE:
	// (x != y) -> (x ^ y)
	return DAG.getNode(ISD::XOR, dl, VT, Op0, Op1);
	case ISD::SETUGT:
	case ISD::SETGT:
	// (x > y) -> (x & ~y)
	return DAG.getNode(ISD::AND, dl, VT, Op0, NotOp1);
	case ISD::SETULT:
	case ISD::SETLT:
	// (x < y) -> (~x & y)
	return DAG.getNode(ISD::AND, dl, VT, NotOp0, Op1);
	case ISD::SETULE:
	case ISD::SETLE:
	// (x <= y) -> (~x \| y)
	return DAG.getNode(ISD::OR, dl, VT, NotOp0, Op1);
	case ISD::SETUGE:
	case ISD::SETGE:
	// (x >=y) -> (x \| ~y)
	return DAG.getNode(ISD::OR, dl, VT, Op0, NotOp1);
	}
	}

	static SDValue LowerIntVSETCC_AVX512(SDValue Op, SelectionDAG &DAG) {

	SDValue Op0 = Op.getOperand(0);
	SDValue Op1 = Op.getOperand(1);
	SDValue CC = Op.getOperand(2);
	MVT VT = Op.getSimpleValueType();
	SDLoc dl(Op);

	assert(VT.getVectorElementType() == MVT::i1 &&
	"Cannot set masked compare for this operation");

	ISD::CondCode SetCCOpcode = cast<CondCodeSDNode>(CC)->get();
	unsigned Opc = 0;
	bool Unsigned = false;
	bool Swap = false;
	unsigned SSECC;
	switch (SetCCOpcode) {
	default: llvm_unreachable("Unexpected SETCC condition");
	case ISD::SETNE: SSECC = 4; break;
	case ISD::SETEQ: Opc = X86ISD::PCMPEQM; break;
	case ISD::SETUGT: SSECC = 6; Unsigned = true; break;
	case ISD::SETLT: Swap = true; LLVM_FALLTHROUGH;
	case ISD::SETGT: Opc = X86ISD::PCMPGTM; break;
	case ISD::SETULT: SSECC = 1; Unsigned = true; break;
	case ISD::SETUGE: SSECC = 5; Unsigned = true; break; //NLT
	case ISD::SETGE: Swap = true; SSECC = 2; break; // LE + swap
	case ISD::SETULE: Unsigned = true; LLVM_FALLTHROUGH;
	case ISD::SETLE: SSECC = 2; break;
	}

	if (Swap)
	std::swap(Op0, Op1);
	if (Opc)
	return DAG.getNode(Opc, dl, VT, Op0, Op1);
	Opc = Unsigned ? X86ISD::CMPMU: X86ISD::CMPM;
	return DAG.getNode(Opc, dl, VT, Op0, Op1,
	DAG.getConstant(SSECC, dl, MVT::i8));
	}

	/// \brief Try to turn a VSETULT into a VSETULE by modifying its second
	/// operand \p Op1. If non-trivial (for example because it's not constant)
	/// return an empty value.
	static SDValue ChangeVSETULTtoVSETULE(const SDLoc &dl, SDValue Op1,
	SelectionDAG &DAG) {
	BuildVectorSDNode *BV = dyn_cast<BuildVectorSDNode>(Op1.getNode());
	if (!BV)
	return SDValue();

	MVT VT = Op1.getSimpleValueType();
	MVT EVT = VT.getVectorElementType();
	unsigned n = VT.getVectorNumElements();
	SmallVector<SDValue, 8> ULTOp1;

	for (unsigned i = 0; i < n; ++i) {
	ConstantSDNode *Elt = dyn_cast<ConstantSDNode>(BV->getOperand(i));
	if (!Elt \|\| Elt->isOpaque() \|\| Elt->getSimpleValueType(0) != EVT)
	return SDValue();

	// Avoid underflow.
	APInt Val = Elt->getAPIntValue();
	if (Val == 0)
	return SDValue();

	ULTOp1.push_back(DAG.getConstant(Val - 1, dl, EVT));
	}

	return DAG.getBuildVector(VT, dl, ULTOp1);
	}

	static SDValue LowerVSETCC(SDValue Op, const X86Subtarget &Subtarget,
	SelectionDAG &DAG) {
	SDValue Op0 = Op.getOperand(0);
	SDValue Op1 = Op.getOperand(1);
	SDValue CC = Op.getOperand(2);
	MVT VT = Op.getSimpleValueType();
	ISD::CondCode Cond = cast<CondCodeSDNode>(CC)->get();
	bool isFP = Op.getOperand(1).getSimpleValueType().isFloatingPoint();
	SDLoc dl(Op);

	if (isFP) {
	#ifndef NDEBUG
	MVT EltVT = Op0.getSimpleValueType().getVectorElementType();
	assert(EltVT == MVT::f32 \|\| EltVT == MVT::f64);
	#endif

	unsigned Opc;
	if (Subtarget.hasAVX512() && VT.getVectorElementType() == MVT::i1) {
	assert(VT.getVectorNumElements() <= 16);
	Opc = X86ISD::CMPM;
	} else {
	Opc = X86ISD::CMPP;
	// The SSE/AVX packed FP comparison nodes are defined with a
	// floating-point vector result that matches the operand type. This allows
	// them to work with an SSE1 target (integer vector types are not legal).
	VT = Op0.getSimpleValueType();
	}

	// In the two cases not handled by SSE compare predicates (SETUEQ/SETONE),
	// emit two comparisons and a logic op to tie them together.
	// TODO: This can be avoided if Intel (and only Intel as of 2016) AVX is
	// available.
	SDValue Cmp;
	unsigned SSECC = translateX86FSETCC(Cond, Op0, Op1);
	if (SSECC == 8) {
	// LLVM predicate is SETUEQ or SETONE.
	unsigned CC0, CC1;
	unsigned CombineOpc;
	if (Cond == ISD::SETUEQ) {
	CC0 = 3; // UNORD
	CC1 = 0; // EQ
	CombineOpc = Opc == X86ISD::CMPP ? static_cast<unsigned>(X86ISD::FOR) :
	static_cast<unsigned>(ISD::OR);
	} else {
	assert(Cond == ISD::SETONE);
	CC0 = 7; // ORD
	CC1 = 4; // NEQ
	CombineOpc = Opc == X86ISD::CMPP ? static_cast<unsigned>(X86ISD::FAND) :
	static_cast<unsigned>(ISD::AND);
	}

	SDValue Cmp0 = DAG.getNode(Opc, dl, VT, Op0, Op1,
	DAG.getConstant(CC0, dl, MVT::i8));
	SDValue Cmp1 = DAG.getNode(Opc, dl, VT, Op0, Op1,
	DAG.getConstant(CC1, dl, MVT::i8));
	Cmp = DAG.getNode(CombineOpc, dl, VT, Cmp0, Cmp1);
	} else {
	// Handle all other FP comparisons here.
	Cmp = DAG.getNode(Opc, dl, VT, Op0, Op1,
	DAG.getConstant(SSECC, dl, MVT::i8));
	}

	// If this is SSE/AVX CMPP, bitcast the result back to integer to match the
	// result type of SETCC. The bitcast is expected to be optimized away
	// during combining/isel.
	if (Opc == X86ISD::CMPP)
	Cmp = DAG.getBitcast(Op.getSimpleValueType(), Cmp);

	return Cmp;
	}

	MVT VTOp0 = Op0.getSimpleValueType();
	assert(VTOp0 == Op1.getSimpleValueType() &&
	"Expected operands with same type!");
	assert(VT.getVectorNumElements() == VTOp0.getVectorNumElements() &&
	"Invalid number of packed elements for source and destination!");

	if (VT.is128BitVector() && VTOp0.is256BitVector()) {
	// On non-AVX512 targets, a vector of MVT::i1 is promoted by the type
	// legalizer to a wider vector type. In the case of 'vsetcc' nodes, the
	// legalizer firstly checks if the first operand in input to the setcc has
	// a legal type. If so, then it promotes the return type to that same type.
	// Otherwise, the return type is promoted to the 'next legal type' which,
	// for a vector of MVT::i1 is always a 128-bit integer vector type.
	//
	// We reach this code only if the following two conditions are met:
	// 1. Both return type and operand type have been promoted to wider types
	// by the type legalizer.
	// 2. The original operand type has been promoted to a 256-bit vector.
	//
	// Note that condition 2. only applies for AVX targets.
	SDValue NewOp = DAG.getSetCC(dl, VTOp0, Op0, Op1, Cond);
	return DAG.getZExtOrTrunc(NewOp, dl, VT);
	}

	// The non-AVX512 code below works under the assumption that source and
	// destination types are the same.
	assert((Subtarget.hasAVX512() \|\| (VT == VTOp0)) &&
	"Value types for source and destination must be the same!");

	// Break 256-bit integer vector compare into smaller ones.
	if (VT.is256BitVector() && !Subtarget.hasInt256())
	return Lower256IntVSETCC(Op, DAG);

	// Operands are boolean (vectors of i1)
	MVT OpVT = Op1.getSimpleValueType();
	if (OpVT.getVectorElementType() == MVT::i1)
	return LowerBoolVSETCC_AVX512(Op, DAG);

	// The result is boolean, but operands are int/float
	if (VT.getVectorElementType() == MVT::i1) {
	// In AVX-512 architecture setcc returns mask with i1 elements,
	// But there is no compare instruction for i8 and i16 elements in KNL.
	// In this case use SSE compare
	bool UseAVX512Inst =
	(OpVT.is512BitVector() \|\|
	OpVT.getScalarSizeInBits() >= 32 \|\|
	(Subtarget.hasBWI() && Subtarget.hasVLX()));

	if (UseAVX512Inst)
	return LowerIntVSETCC_AVX512(Op, DAG);

	return DAG.getNode(ISD::TRUNCATE, dl, VT,
	DAG.getNode(ISD::SETCC, dl, OpVT, Op0, Op1, CC));
	}

	// Lower using XOP integer comparisons.
	if ((VT == MVT::v16i8 \|\| VT == MVT::v8i16 \|\|
	VT == MVT::v4i32 \|\| VT == MVT::v2i64) && Subtarget.hasXOP()) {
	// Translate compare code to XOP PCOM compare mode.
	unsigned CmpMode = 0;
	switch (Cond) {
	default: llvm_unreachable("Unexpected SETCC condition");
	case ISD::SETULT:
	case ISD::SETLT: CmpMode = 0x00; break;
	case ISD::SETULE:
	case ISD::SETLE: CmpMode = 0x01; break;
	case ISD::SETUGT:
	case ISD::SETGT: CmpMode = 0x02; break;
	case ISD::SETUGE:
	case ISD::SETGE: CmpMode = 0x03; break;
	case ISD::SETEQ: CmpMode = 0x04; break;
	case ISD::SETNE: CmpMode = 0x05; break;
	}

	// Are we comparing unsigned or signed integers?
	unsigned Opc =
	ISD::isUnsignedIntSetCC(Cond) ? X86ISD::VPCOMU : X86ISD::VPCOM;

	return DAG.getNode(Opc, dl, VT, Op0, Op1,
	DAG.getConstant(CmpMode, dl, MVT::i8));
	}

	// We are handling one of the integer comparisons here. Since SSE only has
	// GT and EQ comparisons for integer, swapping operands and multiple
	// operations may be required for some comparisons.
	unsigned Opc = (Cond == ISD::SETEQ \|\| Cond == ISD::SETNE) ? X86ISD::PCMPEQ
	: X86ISD::PCMPGT;
	bool Swap = Cond == ISD::SETLT \|\| Cond == ISD::SETULT \|\|
	Cond == ISD::SETGE \|\| Cond == ISD::SETUGE;
	bool Invert = Cond == ISD::SETNE \|\|
	(Cond != ISD::SETEQ && ISD::isTrueWhenEqual(Cond));

	// If both operands are known non-negative, then an unsigned compare is the
	// same as a signed compare and there's no need to flip signbits.
	// TODO: We could check for more general simplifications here since we're
	// computing known bits.
	bool FlipSigns = ISD::isUnsignedIntSetCC(Cond) &&
	!(DAG.SignBitIsZero(Op0) && DAG.SignBitIsZero(Op1));

	// Special case: Use min/max operations for SETULE/SETUGE
	MVT VET = VT.getVectorElementType();
	bool HasMinMax =
	(Subtarget.hasSSE41() && (VET >= MVT::i8 && VET <= MVT::i32)) \|\|
	(Subtarget.hasSSE2() && (VET == MVT::i8));
	bool MinMax = false;
	if (HasMinMax) {
	switch (Cond) {
	default: break;
	case ISD::SETULE: Opc = ISD::UMIN; MinMax = true; break;
	case ISD::SETUGE: Opc = ISD::UMAX; MinMax = true; break;
	}

	if (MinMax)
	Swap = Invert = FlipSigns = false;
	}

	bool HasSubus = Subtarget.hasSSE2() && (VET == MVT::i8 \|\| VET == MVT::i16);
	bool Subus = false;
	if (!MinMax && HasSubus) {
	// As another special case, use PSUBUS[BW] when it's profitable. E.g. for
	// Op0 u<= Op1:
	// t = psubus Op0, Op1
	// pcmpeq t, <0..0>
	switch (Cond) {
	default: break;
	case ISD::SETULT: {
	// If the comparison is against a constant we can turn this into a
	// setule. With psubus, setule does not require a swap. This is
	// beneficial because the constant in the register is no longer
	// destructed as the destination so it can be hoisted out of a loop.
	// Only do this pre-AVX since vpcmp* is no longer destructive.
	if (Subtarget.hasAVX())
	break;
	if (SDValue ULEOp1 = ChangeVSETULTtoVSETULE(dl, Op1, DAG)) {
	Op1 = ULEOp1;
	Subus = true; Invert = false; Swap = false;
	}
	break;
	}
	// Psubus is better than flip-sign because it requires no inversion.
	case ISD::SETUGE: Subus = true; Invert = false; Swap = true; break;
	case ISD::SETULE: Subus = true; Invert = false; Swap = false; break;
	}

	if (Subus) {
	Opc = X86ISD::SUBUS;
	FlipSigns = false;
	}
	}

	if (Swap)
	std::swap(Op0, Op1);

	// Check that the operation in question is available (most are plain SSE2,
	// but PCMPGTQ and PCMPEQQ have different requirements).
	if (VT == MVT::v2i64) {
	if (Opc == X86ISD::PCMPGT && !Subtarget.hasSSE42()) {
	assert(Subtarget.hasSSE2() && "Don't know how to lower!");

	// First cast everything to the right type.
	Op0 = DAG.getBitcast(MVT::v4i32, Op0);
	Op1 = DAG.getBitcast(MVT::v4i32, Op1);

	// Since SSE has no unsigned integer comparisons, we need to flip the sign
	// bits of the inputs before performing those operations. The lower
	// compare is always unsigned.
	SDValue SB;
	if (FlipSigns) {
	SB = DAG.getConstant(0x80000000U, dl, MVT::v4i32);
	} else {
	SDValue Sign = DAG.getConstant(0x80000000U, dl, MVT::i32);
	SDValue Zero = DAG.getConstant(0x00000000U, dl, MVT::i32);
	SB = DAG.getBuildVector(MVT::v4i32, dl, {Sign, Zero, Sign, Zero});
	}
	Op0 = DAG.getNode(ISD::XOR, dl, MVT::v4i32, Op0, SB);
	Op1 = DAG.getNode(ISD::XOR, dl, MVT::v4i32, Op1, SB);

	// Emulate PCMPGTQ with (hi1 > hi2) \| ((hi1 == hi2) & (lo1 > lo2))
	SDValue GT = DAG.getNode(X86ISD::PCMPGT, dl, MVT::v4i32, Op0, Op1);
	SDValue EQ = DAG.getNode(X86ISD::PCMPEQ, dl, MVT::v4i32, Op0, Op1);

	// Create masks for only the low parts/high parts of the 64 bit integers.
	static const int MaskHi[] = { 1, 1, 3, 3 };
	static const int MaskLo[] = { 0, 0, 2, 2 };
	SDValue EQHi = DAG.getVectorShuffle(MVT::v4i32, dl, EQ, EQ, MaskHi);
	SDValue GTLo = DAG.getVectorShuffle(MVT::v4i32, dl, GT, GT, MaskLo);
	SDValue GTHi = DAG.getVectorShuffle(MVT::v4i32, dl, GT, GT, MaskHi);

	SDValue Result = DAG.getNode(ISD::AND, dl, MVT::v4i32, EQHi, GTLo);
	Result = DAG.getNode(ISD::OR, dl, MVT::v4i32, Result, GTHi);

	if (Invert)
	Result = DAG.getNOT(dl, Result, MVT::v4i32);

	return DAG.getBitcast(VT, Result);
	}

	if (Opc == X86ISD::PCMPEQ && !Subtarget.hasSSE41()) {
	// If pcmpeqq is missing but pcmpeqd is available synthesize pcmpeqq with
	// pcmpeqd + pshufd + pand.
	assert(Subtarget.hasSSE2() && !FlipSigns && "Don't know how to lower!");

	// First cast everything to the right type.
	Op0 = DAG.getBitcast(MVT::v4i32, Op0);
	Op1 = DAG.getBitcast(MVT::v4i32, Op1);

	// Do the compare.
	SDValue Result = DAG.getNode(Opc, dl, MVT::v4i32, Op0, Op1);

	// Make sure the lower and upper halves are both all-ones.
	static const int Mask[] = { 1, 0, 3, 2 };
	SDValue Shuf = DAG.getVectorShuffle(MVT::v4i32, dl, Result, Result, Mask);
	Result = DAG.getNode(ISD::AND, dl, MVT::v4i32, Result, Shuf);

	if (Invert)
	Result = DAG.getNOT(dl, Result, MVT::v4i32);

	return DAG.getBitcast(VT, Result);
	}
	}

	// Since SSE has no unsigned integer comparisons, we need to flip the sign
	// bits of the inputs before performing those operations.
	if (FlipSigns) {
	MVT EltVT = VT.getVectorElementType();
	SDValue SM = DAG.getConstant(APInt::getSignMask(EltVT.getSizeInBits()), dl,
	VT);
	Op0 = DAG.getNode(ISD::XOR, dl, VT, Op0, SM);
	Op1 = DAG.getNode(ISD::XOR, dl, VT, Op1, SM);
	}

	SDValue Result = DAG.getNode(Opc, dl, VT, Op0, Op1);

	// If the logical-not of the result is required, perform that now.
	if (Invert)
	Result = DAG.getNOT(dl, Result, VT);

	if (MinMax)
	Result = DAG.getNode(X86ISD::PCMPEQ, dl, VT, Op0, Result);

	if (Subus)
	Result = DAG.getNode(X86ISD::PCMPEQ, dl, VT, Result,
	getZeroVector(VT, Subtarget, DAG, dl));

	return Result;
	}

	SDValue X86TargetLowering::LowerSETCC(SDValue Op, SelectionDAG &DAG) const {

	MVT VT = Op.getSimpleValueType();

	if (VT.isVector()) return LowerVSETCC(Op, Subtarget, DAG);

	assert(VT == MVT::i8 && "SetCC type must be 8-bit integer");
	SDValue Op0 = Op.getOperand(0);
	SDValue Op1 = Op.getOperand(1);
	SDLoc dl(Op);
	ISD::CondCode CC = cast<CondCodeSDNode>(Op.getOperand(2))->get();

	// Optimize to BT if possible.
	// Lower (X & (1 << N)) == 0 to BT(X, N).
	// Lower ((X >>u N) & 1) != 0 to BT(X, N).
	// Lower ((X >>s N) & 1) != 0 to BT(X, N).
	// Lower (trunc (X >> N) to i1) to BT(X, N).
	if (Op0.hasOneUse() && isNullConstant(Op1) &&
	(CC == ISD::SETEQ \|\| CC == ISD::SETNE)) {
	if (SDValue NewSetCC = LowerToBT(Op0, CC, dl, DAG)) {
	if (VT == MVT::i1)
	return DAG.getNode(ISD::TRUNCATE, dl, MVT::i1, NewSetCC);
	return NewSetCC;
	}
	}

	// Look for X == 0, X == 1, X != 0, or X != 1. We can simplify some forms of
	// these.
	if ((isOneConstant(Op1) \|\| isNullConstant(Op1)) &&
	(CC == ISD::SETEQ \|\| CC == ISD::SETNE)) {

	// If the input is a setcc, then reuse the input setcc or use a new one with
	// the inverted condition.
	if (Op0.getOpcode() == X86ISD::SETCC) {
	X86::CondCode CCode = (X86::CondCode)Op0.getConstantOperandVal(0);
	bool Invert = (CC == ISD::SETNE) ^ isNullConstant(Op1);
	if (!Invert)
	return Op0;

	CCode = X86::GetOppositeBranchCondition(CCode);
	SDValue SetCC = getSETCC(CCode, Op0.getOperand(1), dl, DAG);
	if (VT == MVT::i1)
	return DAG.getNode(ISD::TRUNCATE, dl, MVT::i1, SetCC);
	return SetCC;
	}
	}
	if (Op0.getValueType() == MVT::i1 && (CC == ISD::SETEQ \|\| CC == ISD::SETNE)) {
	if (isOneConstant(Op1)) {
	ISD::CondCode NewCC = ISD::getSetCCInverse(CC, true);
	return DAG.getSetCC(dl, VT, Op0, DAG.getConstant(0, dl, MVT::i1), NewCC);
	}
	if (!isNullConstant(Op1)) {
	SDValue Xor = DAG.getNode(ISD::XOR, dl, MVT::i1, Op0, Op1);
	return DAG.getSetCC(dl, VT, Xor, DAG.getConstant(0, dl, MVT::i1), CC);
	}
	}

	bool IsFP = Op1.getSimpleValueType().isFloatingPoint();
	X86::CondCode X86CC = TranslateX86CC(CC, dl, IsFP, Op0, Op1, DAG);
	if (X86CC == X86::COND_INVALID)
	return SDValue();

	SDValue EFLAGS = EmitCmp(Op0, Op1, X86CC, dl, DAG);
	EFLAGS = ConvertCmpIfNecessary(EFLAGS, DAG);
	SDValue SetCC = getSETCC(X86CC, EFLAGS, dl, DAG);
	if (VT == MVT::i1)
	return DAG.getNode(ISD::TRUNCATE, dl, MVT::i1, SetCC);
	return SetCC;
	}

	SDValue X86TargetLowering::LowerSETCCCARRY(SDValue Op, SelectionDAG &DAG) const {
	SDValue LHS = Op.getOperand(0);
	SDValue RHS = Op.getOperand(1);
	SDValue Carry = Op.getOperand(2);
	SDValue Cond = Op.getOperand(3);
	SDLoc DL(Op);

	assert(LHS.getSimpleValueType().isInteger() && "SETCCCARRY is integer only.");
	X86::CondCode CC = TranslateIntegerX86CC(cast<CondCodeSDNode>(Cond)->get());

	// Recreate the carry if needed.
	EVT CarryVT = Carry.getValueType();
	APInt NegOne = APInt::getAllOnesValue(CarryVT.getScalarSizeInBits());
	Carry = DAG.getNode(X86ISD::ADD, DL, DAG.getVTList(CarryVT, MVT::i32),
	Carry, DAG.getConstant(NegOne, DL, CarryVT));

	SDVTList VTs = DAG.getVTList(LHS.getValueType(), MVT::i32);
	SDValue Cmp = DAG.getNode(X86ISD::SBB, DL, VTs, LHS, RHS, Carry.getValue(1));
	SDValue SetCC = getSETCC(CC, Cmp.getValue(1), DL, DAG);
	if (Op.getSimpleValueType() == MVT::i1)
	return DAG.getNode(ISD::TRUNCATE, DL, MVT::i1, SetCC);
	return SetCC;
	}

	/// Return true if opcode is a X86 logical comparison.
	static bool isX86LogicalCmp(SDValue Op) {
	unsigned Opc = Op.getOpcode();
	if (Opc == X86ISD::CMP \|\| Opc == X86ISD::COMI \|\| Opc == X86ISD::UCOMI \|\|
	Opc == X86ISD::SAHF)
	return true;
	if (Op.getResNo() == 1 &&
	(Opc == X86ISD::ADD \|\| Opc == X86ISD::SUB \|\| Opc == X86ISD::ADC \|\|
	Opc == X86ISD::SBB \|\| Opc == X86ISD::SMUL \|\| Opc == X86ISD::UMUL \|\|
	Opc == X86ISD::INC \|\| Opc == X86ISD::DEC \|\| Opc == X86ISD::OR \|\|
	Opc == X86ISD::XOR \|\| Opc == X86ISD::AND))
	return true;

	if (Op.getResNo() == 2 && Opc == X86ISD::UMUL)
	return true;

	return false;
	}

	static bool isTruncWithZeroHighBitsInput(SDValue V, SelectionDAG &DAG) {
	if (V.getOpcode() != ISD::TRUNCATE)
	return false;

	SDValue VOp0 = V.getOperand(0);
	unsigned InBits = VOp0.getValueSizeInBits();
	unsigned Bits = V.getValueSizeInBits();
	return DAG.MaskedValueIsZero(VOp0, APInt::getHighBitsSet(InBits,InBits-Bits));
	}

	SDValue X86TargetLowering::LowerSELECT(SDValue Op, SelectionDAG &DAG) const {
	bool AddTest = true;
	SDValue Cond = Op.getOperand(0);
	SDValue Op1 = Op.getOperand(1);
	SDValue Op2 = Op.getOperand(2);
	SDLoc DL(Op);
	MVT VT = Op1.getSimpleValueType();
	SDValue CC;

	// Lower FP selects into a CMP/AND/ANDN/OR sequence when the necessary SSE ops
	// are available or VBLENDV if AVX is available.
	// Otherwise FP cmovs get lowered into a less efficient branch sequence later.
	if (Cond.getOpcode() == ISD::SETCC &&
	((Subtarget.hasSSE2() && (VT == MVT::f32 \|\| VT == MVT::f64)) \|\|
	(Subtarget.hasSSE1() && VT == MVT::f32)) &&
	VT == Cond.getOperand(0).getSimpleValueType() && Cond->hasOneUse()) {
	SDValue CondOp0 = Cond.getOperand(0), CondOp1 = Cond.getOperand(1);
	int SSECC = translateX86FSETCC(
	cast<CondCodeSDNode>(Cond.getOperand(2))->get(), CondOp0, CondOp1);

	if (SSECC != 8) {
	if (Subtarget.hasAVX512()) {
	SDValue Cmp = DAG.getNode(X86ISD::FSETCCM, DL, MVT::v1i1, CondOp0,
	CondOp1, DAG.getConstant(SSECC, DL, MVT::i8));
	return DAG.getNode(VT.isVector() ? X86ISD::SELECT : X86ISD::SELECTS,
	DL, VT, Cmp, Op1, Op2);
	}

	SDValue Cmp = DAG.getNode(X86ISD::FSETCC, DL, VT, CondOp0, CondOp1,
	DAG.getConstant(SSECC, DL, MVT::i8));

	// If we have AVX, we can use a variable vector select (VBLENDV) instead
	// of 3 logic instructions for size savings and potentially speed.
	// Unfortunately, there is no scalar form of VBLENDV.

	// If either operand is a constant, don't try this. We can expect to
	// optimize away at least one of the logic instructions later in that
	// case, so that sequence would be faster than a variable blend.

	// BLENDV was introduced with SSE 4.1, but the 2 register form implicitly
	// uses XMM0 as the selection register. That may need just as many
	// instructions as the AND/ANDN/OR sequence due to register moves, so
	// don't bother.

	if (Subtarget.hasAVX() &&
	!isa<ConstantFPSDNode>(Op1) && !isa<ConstantFPSDNode>(Op2)) {

	// Convert to vectors, do a VSELECT, and convert back to scalar.
	// All of the conversions should be optimized away.

	MVT VecVT = VT == MVT::f32 ? MVT::v4f32 : MVT::v2f64;
	SDValue VOp1 = DAG.getNode(ISD::SCALAR_TO_VECTOR, DL, VecVT, Op1);
	SDValue VOp2 = DAG.getNode(ISD::SCALAR_TO_VECTOR, DL, VecVT, Op2);
	SDValue VCmp = DAG.getNode(ISD::SCALAR_TO_VECTOR, DL, VecVT, Cmp);

	MVT VCmpVT = VT == MVT::f32 ? MVT::v4i32 : MVT::v2i64;
	VCmp = DAG.getBitcast(VCmpVT, VCmp);

	SDValue VSel = DAG.getSelect(DL, VecVT, VCmp, VOp1, VOp2);

	return DAG.getNode(ISD::EXTRACT_VECTOR_ELT, DL, VT,
	VSel, DAG.getIntPtrConstant(0, DL));
	}
	SDValue AndN = DAG.getNode(X86ISD::FANDN, DL, VT, Cmp, Op2);
	SDValue And = DAG.getNode(X86ISD::FAND, DL, VT, Cmp, Op1);
	return DAG.getNode(X86ISD::FOR, DL, VT, AndN, And);
	}
	}

	// AVX512 fallback is to lower selects of scalar floats to masked moves.
	if ((VT == MVT::f64 \|\| VT == MVT::f32) && Subtarget.hasAVX512()) {
	SDValue Cmp = DAG.getNode(ISD::SCALAR_TO_VECTOR, DL, MVT::v1i1, Cond);
	return DAG.getNode(X86ISD::SELECTS, DL, VT, Cmp, Op1, Op2);
	}

	if (VT.isVector() && VT.getVectorElementType() == MVT::i1) {
	SDValue Op1Scalar;
	if (ISD::isBuildVectorOfConstantSDNodes(Op1.getNode()))
	Op1Scalar = ConvertI1VectorToInteger(Op1, DAG);
	else if (Op1.getOpcode() == ISD::BITCAST && Op1.getOperand(0))
	Op1Scalar = Op1.getOperand(0);
	SDValue Op2Scalar;
	if (ISD::isBuildVectorOfConstantSDNodes(Op2.getNode()))
	Op2Scalar = ConvertI1VectorToInteger(Op2, DAG);
	else if (Op2.getOpcode() == ISD::BITCAST && Op2.getOperand(0))
	Op2Scalar = Op2.getOperand(0);
	if (Op1Scalar.getNode() && Op2Scalar.getNode()) {
	SDValue newSelect = DAG.getSelect(DL, Op1Scalar.getValueType(), Cond,
	Op1Scalar, Op2Scalar);
	if (newSelect.getValueSizeInBits() == VT.getSizeInBits())
	return DAG.getBitcast(VT, newSelect);
	SDValue ExtVec = DAG.getBitcast(MVT::v8i1, newSelect);
	return DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, VT, ExtVec,
	DAG.getIntPtrConstant(0, DL));
	}
	}

	if (VT == MVT::v4i1 \|\| VT == MVT::v2i1) {
	SDValue zeroConst = DAG.getIntPtrConstant(0, DL);
	Op1 = DAG.getNode(ISD::INSERT_SUBVECTOR, DL, MVT::v8i1,
	DAG.getUNDEF(MVT::v8i1), Op1, zeroConst);
	Op2 = DAG.getNode(ISD::INSERT_SUBVECTOR, DL, MVT::v8i1,
	DAG.getUNDEF(MVT::v8i1), Op2, zeroConst);
	SDValue newSelect = DAG.getSelect(DL, MVT::v8i1, Cond, Op1, Op2);
	return DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, VT, newSelect, zeroConst);
	}

	if (Cond.getOpcode() == ISD::SETCC) {
	if (SDValue NewCond = LowerSETCC(Cond, DAG)) {
	Cond = NewCond;
	// If the condition was updated, it's possible that the operands of the
	// select were also updated (for example, EmitTest has a RAUW). Refresh
	// the local references to the select operands in case they got stale.
	Op1 = Op.getOperand(1);
	Op2 = Op.getOperand(2);
	}
	}

	// (select (x == 0), -1, y) -> (sign_bit (x - 1)) \| y
	// (select (x == 0), y, -1) -> ~(sign_bit (x - 1)) \| y
	// (select (x != 0), y, -1) -> (sign_bit (x - 1)) \| y
	// (select (x != 0), -1, y) -> ~(sign_bit (x - 1)) \| y
	// (select (and (x , 0x1) == 0), y, (z ^ y) ) -> (-(and (x , 0x1)) & z ) ^ y
	// (select (and (x , 0x1) == 0), y, (z \| y) ) -> (-(and (x , 0x1)) & z ) \| y
	if (Cond.getOpcode() == X86ISD::SETCC &&
	Cond.getOperand(1).getOpcode() == X86ISD::CMP &&
	isNullConstant(Cond.getOperand(1).getOperand(1))) {
	SDValue Cmp = Cond.getOperand(1);
	unsigned CondCode =
	cast<ConstantSDNode>(Cond.getOperand(0))->getZExtValue();

	if ((isAllOnesConstant(Op1) \|\| isAllOnesConstant(Op2)) &&
	(CondCode == X86::COND_E \|\| CondCode == X86::COND_NE)) {
	SDValue Y = isAllOnesConstant(Op2) ? Op1 : Op2;
	SDValue CmpOp0 = Cmp.getOperand(0);

	// Apply further optimizations for special cases
	// (select (x != 0), -1, 0) -> neg & sbb
	// (select (x == 0), 0, -1) -> neg & sbb
	if (isNullConstant(Y) &&
	(isAllOnesConstant(Op1) == (CondCode == X86::COND_NE))) {
	SDVTList VTs = DAG.getVTList(CmpOp0.getValueType(), MVT::i32);
	SDValue Zero = DAG.getConstant(0, DL, CmpOp0.getValueType());
	SDValue Neg = DAG.getNode(X86ISD::SUB, DL, VTs, Zero, CmpOp0);
	SDValue Res = DAG.getNode(X86ISD::SETCC_CARRY, DL, Op.getValueType(),
	DAG.getConstant(X86::COND_B, DL, MVT::i8),
	SDValue(Neg.getNode(), 1));
	return Res;
	}

	Cmp = DAG.getNode(X86ISD::CMP, DL, MVT::i32,
	CmpOp0, DAG.getConstant(1, DL, CmpOp0.getValueType()));
	Cmp = ConvertCmpIfNecessary(Cmp, DAG);

	SDValue Res = // Res = 0 or -1.
	DAG.getNode(X86ISD::SETCC_CARRY, DL, Op.getValueType(),
	DAG.getConstant(X86::COND_B, DL, MVT::i8), Cmp);

	if (isAllOnesConstant(Op1) != (CondCode == X86::COND_E))
	Res = DAG.getNOT(DL, Res, Res.getValueType());

	if (!isNullConstant(Op2))
	Res = DAG.getNode(ISD::OR, DL, Res.getValueType(), Res, Y);
	return Res;
	} else if (!Subtarget.hasCMov() && CondCode == X86::COND_E &&
	Cmp.getOperand(0).getOpcode() == ISD::AND &&
	isOneConstant(Cmp.getOperand(0).getOperand(1))) {
	SDValue CmpOp0 = Cmp.getOperand(0);
	SDValue Src1, Src2;
	// true if Op2 is XOR or OR operator and one of its operands
	// is equal to Op1
	// ( a , a op b) \|\| ( b , a op b)
	auto isOrXorPattern = [&]() {
	if ((Op2.getOpcode() == ISD::XOR \|\| Op2.getOpcode() == ISD::OR) &&
	(Op2.getOperand(0) == Op1 \|\| Op2.getOperand(1) == Op1)) {
	Src1 =
	Op2.getOperand(0) == Op1 ? Op2.getOperand(1) : Op2.getOperand(0);
	Src2 = Op1;
	return true;
	}
	return false;
	};

	if (isOrXorPattern()) {
	SDValue Neg;
	unsigned int CmpSz = CmpOp0.getSimpleValueType().getSizeInBits();
	// we need mask of all zeros or ones with same size of the other
	// operands.
	if (CmpSz > VT.getSizeInBits())
	Neg = DAG.getNode(ISD::TRUNCATE, DL, VT, CmpOp0);
	else if (CmpSz < VT.getSizeInBits())
	Neg = DAG.getNode(ISD::AND, DL, VT,
	DAG.getNode(ISD::ANY_EXTEND, DL, VT, CmpOp0.getOperand(0)),
	DAG.getConstant(1, DL, VT));
	else
	Neg = CmpOp0;
	SDValue Mask = DAG.getNode(ISD::SUB, DL, VT, DAG.getConstant(0, DL, VT),
	Neg); // -(and (x, 0x1))
	SDValue And = DAG.getNode(ISD::AND, DL, VT, Mask, Src1); // Mask & z
	return DAG.getNode(Op2.getOpcode(), DL, VT, And, Src2); // And Op y
	}
	}
	}

	// Look past (and (setcc_carry (cmp ...)), 1).
	if (Cond.getOpcode() == ISD::AND &&
	Cond.getOperand(0).getOpcode() == X86ISD::SETCC_CARRY &&
	isOneConstant(Cond.getOperand(1)))
	Cond = Cond.getOperand(0);

	// If condition flag is set by a X86ISD::CMP, then use it as the condition
	// setting operand in place of the X86ISD::SETCC.
	unsigned CondOpcode = Cond.getOpcode();
	if (CondOpcode == X86ISD::SETCC \|\|
	CondOpcode == X86ISD::SETCC_CARRY) {
	CC = Cond.getOperand(0);

	SDValue Cmp = Cond.getOperand(1);
	unsigned Opc = Cmp.getOpcode();
	MVT VT = Op.getSimpleValueType();

	bool IllegalFPCMov = false;
	if (VT.isFloatingPoint() && !VT.isVector() &&
	!isScalarFPTypeInSSEReg(VT)) // FPStack?
	IllegalFPCMov = !hasFPCMov(cast<ConstantSDNode>(CC)->getSExtValue());

	if ((isX86LogicalCmp(Cmp) && !IllegalFPCMov) \|\|
	Opc == X86ISD::BT) { // FIXME
	Cond = Cmp;
	AddTest = false;
	}
	} else if (CondOpcode == ISD::USUBO \|\| CondOpcode == ISD::SSUBO \|\|
	CondOpcode == ISD::UADDO \|\| CondOpcode == ISD::SADDO \|\|
	((CondOpcode == ISD::UMULO \|\| CondOpcode == ISD::SMULO) &&
	Cond.getOperand(0).getValueType() != MVT::i8)) {
	SDValue LHS = Cond.getOperand(0);
	SDValue RHS = Cond.getOperand(1);
	unsigned X86Opcode;
	unsigned X86Cond;
	SDVTList VTs;
	switch (CondOpcode) {
	case ISD::UADDO: X86Opcode = X86ISD::ADD; X86Cond = X86::COND_B; break;
	case ISD::SADDO: X86Opcode = X86ISD::ADD; X86Cond = X86::COND_O; break;
	case ISD::USUBO: X86Opcode = X86ISD::SUB; X86Cond = X86::COND_B; break;
	case ISD::SSUBO: X86Opcode = X86ISD::SUB; X86Cond = X86::COND_O; break;
	case ISD::UMULO: X86Opcode = X86ISD::UMUL; X86Cond = X86::COND_O; break;
	case ISD::SMULO: X86Opcode = X86ISD::SMUL; X86Cond = X86::COND_O; break;
	default: llvm_unreachable("unexpected overflowing operator");
	}
	if (CondOpcode == ISD::UMULO)
	VTs = DAG.getVTList(LHS.getValueType(), LHS.getValueType(),
	MVT::i32);
	else
	VTs = DAG.getVTList(LHS.getValueType(), MVT::i32);

	SDValue X86Op = DAG.getNode(X86Opcode, DL, VTs, LHS, RHS);

	if (CondOpcode == ISD::UMULO)
	Cond = X86Op.getValue(2);
	else
	Cond = X86Op.getValue(1);

	CC = DAG.getConstant(X86Cond, DL, MVT::i8);
	AddTest = false;
	}

	if (AddTest) {
	// Look past the truncate if the high bits are known zero.
	if (isTruncWithZeroHighBitsInput(Cond, DAG))
	Cond = Cond.getOperand(0);

	// We know the result of AND is compared against zero. Try to match
	// it to BT.
	if (Cond.getOpcode() == ISD::AND && Cond.hasOneUse()) {
	if (SDValue NewSetCC = LowerToBT(Cond, ISD::SETNE, DL, DAG)) {
	CC = NewSetCC.getOperand(0);
	Cond = NewSetCC.getOperand(1);
	AddTest = false;
	}
	}
	}

	if (AddTest) {
	CC = DAG.getConstant(X86::COND_NE, DL, MVT::i8);
	Cond = EmitTest(Cond, X86::COND_NE, DL, DAG);
	}

	// a < b ? -1 : 0 -> RES = ~setcc_carry
	// a < b ? 0 : -1 -> RES = setcc_carry
	// a >= b ? -1 : 0 -> RES = setcc_carry
	// a >= b ? 0 : -1 -> RES = ~setcc_carry
	if (Cond.getOpcode() == X86ISD::SUB) {
	Cond = ConvertCmpIfNecessary(Cond, DAG);
	unsigned CondCode = cast<ConstantSDNode>(CC)->getZExtValue();

	if ((CondCode == X86::COND_AE \|\| CondCode == X86::COND_B) &&
	(isAllOnesConstant(Op1) \|\| isAllOnesConstant(Op2)) &&
	(isNullConstant(Op1) \|\| isNullConstant(Op2))) {
	SDValue Res = DAG.getNode(X86ISD::SETCC_CARRY, DL, Op.getValueType(),
	DAG.getConstant(X86::COND_B, DL, MVT::i8),
	Cond);
	if (isAllOnesConstant(Op1) != (CondCode == X86::COND_B))
	return DAG.getNOT(DL, Res, Res.getValueType());
	return Res;
	}
	}

	// X86 doesn't have an i8 cmov. If both operands are the result of a truncate
	// widen the cmov and push the truncate through. This avoids introducing a new
	// branch during isel and doesn't add any extensions.
	if (Op.getValueType() == MVT::i8 &&
	Op1.getOpcode() == ISD::TRUNCATE && Op2.getOpcode() == ISD::TRUNCATE) {
	SDValue T1 = Op1.getOperand(0), T2 = Op2.getOperand(0);
	if (T1.getValueType() == T2.getValueType() &&
	// Blacklist CopyFromReg to avoid partial register stalls.
	T1.getOpcode() != ISD::CopyFromReg && T2.getOpcode()!=ISD::CopyFromReg){
	SDVTList VTs = DAG.getVTList(T1.getValueType(), MVT::Glue);
	SDValue Cmov = DAG.getNode(X86ISD::CMOV, DL, VTs, T2, T1, CC, Cond);
	return DAG.getNode(ISD::TRUNCATE, DL, Op.getValueType(), Cmov);
	}
	}

	// X86ISD::CMOV means set the result (which is operand 1) to the RHS if
	// condition is true.
	SDVTList VTs = DAG.getVTList(Op.getValueType(), MVT::Glue);
	SDValue Ops[] = { Op2, Op1, CC, Cond };
	return DAG.getNode(X86ISD::CMOV, DL, VTs, Ops);
	}

	static SDValue LowerSIGN_EXTEND_AVX512(SDValue Op,
	const X86Subtarget &Subtarget,
	SelectionDAG &DAG) {
	MVT VT = Op->getSimpleValueType(0);
	SDValue In = Op->getOperand(0);
	MVT InVT = In.getSimpleValueType();
	MVT VTElt = VT.getVectorElementType();
	MVT InVTElt = InVT.getVectorElementType();
	SDLoc dl(Op);

	// SKX processor
	if ((InVTElt == MVT::i1) &&
	(((Subtarget.hasBWI() && VTElt.getSizeInBits() <= 16)) \|\|

	((Subtarget.hasDQI() && VTElt.getSizeInBits() >= 32))))

	return DAG.getNode(X86ISD::VSEXT, dl, VT, In);

	unsigned NumElts = VT.getVectorNumElements();

	if (VT.is512BitVector() && InVTElt != MVT::i1 &&
	(NumElts == 8 \|\| NumElts == 16 \|\| Subtarget.hasBWI())) {
	if (In.getOpcode() == X86ISD::VSEXT \|\| In.getOpcode() == X86ISD::VZEXT)
	return getExtendInVec(In.getOpcode(), dl, VT, In.getOperand(0), DAG);
	return getExtendInVec(X86ISD::VSEXT, dl, VT, In, DAG);
	}

	if (InVTElt != MVT::i1)
	return SDValue();

	MVT ExtVT = VT;
	if (!VT.is512BitVector() && !Subtarget.hasVLX())
	ExtVT = MVT::getVectorVT(MVT::getIntegerVT(512/NumElts), NumElts);

	SDValue V;
	if (Subtarget.hasDQI()) {
	V = getExtendInVec(X86ISD::VSEXT, dl, ExtVT, In, DAG);
	assert(!VT.is512BitVector() && "Unexpected vector type");
	} else {
	SDValue NegOne = getOnesVector(ExtVT, DAG, dl);
	SDValue Zero = getZeroVector(ExtVT, Subtarget, DAG, dl);
	V = DAG.getSelect(dl, ExtVT, In, NegOne, Zero);
	if (ExtVT == VT)
	return V;
	}

	return DAG.getNode(X86ISD::VTRUNC, dl, VT, V);
	}

	// Lowering for SIGN_EXTEND_VECTOR_INREG and ZERO_EXTEND_VECTOR_INREG.
	// For sign extend this needs to handle all vector sizes and SSE4.1 and
	// non-SSE4.1 targets. For zero extend this should only handle inputs of
	// MVT::v64i8 when BWI is not supported, but AVX512 is.
	static SDValue LowerEXTEND_VECTOR_INREG(SDValue Op,
	const X86Subtarget &Subtarget,
	SelectionDAG &DAG) {
	SDValue In = Op->getOperand(0);
	MVT VT = Op->getSimpleValueType(0);
	MVT InVT = In.getSimpleValueType();
	assert(VT.getSizeInBits() == InVT.getSizeInBits());

	MVT SVT = VT.getVectorElementType();
	MVT InSVT = InVT.getVectorElementType();
	assert(SVT.getSizeInBits() > InSVT.getSizeInBits());

	if (SVT != MVT::i64 && SVT != MVT::i32 && SVT != MVT::i16)
	return SDValue();
	if (InSVT != MVT::i32 && InSVT != MVT::i16 && InSVT != MVT::i8)
	return SDValue();
	if (!(VT.is128BitVector() && Subtarget.hasSSE2()) &&
	!(VT.is256BitVector() && Subtarget.hasInt256()) &&
	!(VT.is512BitVector() && Subtarget.hasAVX512()))
	return SDValue();

	SDLoc dl(Op);

	// For 256-bit vectors, we only need the lower (128-bit) half of the input.
	// For 512-bit vectors, we need 128-bits or 256-bits.
	if (VT.getSizeInBits() > 128) {
	// Input needs to be at least the same number of elements as output, and
	// at least 128-bits.
	int InSize = InSVT.getSizeInBits() * VT.getVectorNumElements();
	In = extractSubVector(In, 0, DAG, dl, std::max(InSize, 128));
	}

	assert((Op.getOpcode() != ISD::ZERO_EXTEND_VECTOR_INREG \|\|
	InVT == MVT::v64i8) && "Zero extend only for v64i8 input!");

	// SSE41 targets can use the pmovsx* instructions directly for 128-bit results,
	// so are legal and shouldn't occur here. AVX2/AVX512 pmovsx* instructions still
	// need to be handled here for 256/512-bit results.
	if (Subtarget.hasInt256()) {
	assert(VT.getSizeInBits() > 128 && "Unexpected 128-bit vector extension");
	unsigned ExtOpc = Op.getOpcode() == ISD::SIGN_EXTEND_VECTOR_INREG ?
	X86ISD::VSEXT : X86ISD::VZEXT;
	return DAG.getNode(ExtOpc, dl, VT, In);
	}

	// We should only get here for sign extend.
	assert(Op.getOpcode() == ISD::SIGN_EXTEND_VECTOR_INREG &&
	"Unexpected opcode!");

	// pre-SSE41 targets unpack lower lanes and then sign-extend using SRAI.
	SDValue Curr = In;
	MVT CurrVT = InVT;

	// As SRAI is only available on i16/i32 types, we expand only up to i32
	// and handle i64 separately.
	while (CurrVT != VT && CurrVT.getVectorElementType() != MVT::i32) {
	Curr = DAG.getNode(X86ISD::UNPCKL, dl, CurrVT, DAG.getUNDEF(CurrVT), Curr);
	MVT CurrSVT = MVT::getIntegerVT(CurrVT.getScalarSizeInBits() * 2);
	CurrVT = MVT::getVectorVT(CurrSVT, CurrVT.getVectorNumElements() / 2);
	Curr = DAG.getBitcast(CurrVT, Curr);
	}

	SDValue SignExt = Curr;
	if (CurrVT != InVT) {
	unsigned SignExtShift =
	CurrVT.getScalarSizeInBits() - InSVT.getSizeInBits();
	SignExt = DAG.getNode(X86ISD::VSRAI, dl, CurrVT, Curr,
	DAG.getConstant(SignExtShift, dl, MVT::i8));
	}

	if (CurrVT == VT)
	return SignExt;

	if (VT == MVT::v2i64 && CurrVT == MVT::v4i32) {
	SDValue Sign = DAG.getNode(X86ISD::VSRAI, dl, CurrVT, Curr,
	DAG.getConstant(31, dl, MVT::i8));
	SDValue Ext = DAG.getVectorShuffle(CurrVT, dl, SignExt, Sign, {0, 4, 1, 5});
	return DAG.getBitcast(VT, Ext);
	}

	return SDValue();
	}

	static SDValue LowerSIGN_EXTEND(SDValue Op, const X86Subtarget &Subtarget,
	SelectionDAG &DAG) {
	MVT VT = Op->getSimpleValueType(0);
	SDValue In = Op->getOperand(0);
	MVT InVT = In.getSimpleValueType();
	SDLoc dl(Op);

	if (VT.is512BitVector() \|\| InVT.getVectorElementType() == MVT::i1)
	return LowerSIGN_EXTEND_AVX512(Op, Subtarget, DAG);

	if ((VT != MVT::v4i64 \|\| InVT != MVT::v4i32) &&
	(VT != MVT::v8i32 \|\| InVT != MVT::v8i16) &&
	(VT != MVT::v16i16 \|\| InVT != MVT::v16i8))
	return SDValue();

	if (Subtarget.hasInt256())
	return DAG.getNode(X86ISD::VSEXT, dl, VT, In);

	// Optimize vectors in AVX mode
	// Sign extend v8i16 to v8i32 and
	// v4i32 to v4i64
	//
	// Divide input vector into two parts
	// for v4i32 the shuffle mask will be { 0, 1, -1, -1} {2, 3, -1, -1}
	// use vpmovsx instruction to extend v4i32 -> v2i64; v8i16 -> v4i32
	// concat the vectors to original VT

	unsigned NumElems = InVT.getVectorNumElements();
	SDValue Undef = DAG.getUNDEF(InVT);

	SmallVector<int,8> ShufMask1(NumElems, -1);
	for (unsigned i = 0; i != NumElems/2; ++i)
	ShufMask1[i] = i;

	SDValue OpLo = DAG.getVectorShuffle(InVT, dl, In, Undef, ShufMask1);

	SmallVector<int,8> ShufMask2(NumElems, -1);
	for (unsigned i = 0; i != NumElems/2; ++i)
	ShufMask2[i] = i + NumElems/2;

	SDValue OpHi = DAG.getVectorShuffle(InVT, dl, In, Undef, ShufMask2);

	MVT HalfVT = MVT::getVectorVT(VT.getVectorElementType(),
	VT.getVectorNumElements() / 2);

	OpLo = DAG.getSignExtendVectorInReg(OpLo, dl, HalfVT);
	OpHi = DAG.getSignExtendVectorInReg(OpHi, dl, HalfVT);

	return DAG.getNode(ISD::CONCAT_VECTORS, dl, VT, OpLo, OpHi);
	}

	// Lower truncating store. We need a special lowering to vXi1 vectors
	static SDValue LowerTruncatingStore(SDValue StOp, const X86Subtarget &Subtarget,
	SelectionDAG &DAG) {
	StoreSDNode *St = cast<StoreSDNode>(StOp.getNode());
	SDLoc dl(St);
	EVT MemVT = St->getMemoryVT();
	assert(St->isTruncatingStore() && "We only custom truncating store.");
	assert(MemVT.isVector() && MemVT.getVectorElementType() == MVT::i1 &&
	"Expected truncstore of i1 vector");

	SDValue Op = St->getValue();
	MVT OpVT = Op.getValueType().getSimpleVT();
	unsigned NumElts = OpVT.getVectorNumElements();
	if ((Subtarget.hasVLX() && Subtarget.hasBWI() && Subtarget.hasDQI()) \|\|
	NumElts == 16) {
	// Truncate and store - everything is legal
	Op = DAG.getNode(ISD::TRUNCATE, dl, MemVT, Op);
	if (MemVT.getSizeInBits() < 8)
	Op = DAG.getNode(ISD::INSERT_SUBVECTOR, dl, MVT::v8i1,
	DAG.getUNDEF(MVT::v8i1), Op,
	DAG.getIntPtrConstant(0, dl));
	return DAG.getStore(St->getChain(), dl, Op, St->getBasePtr(),
	St->getMemOperand());
	}

	// A subset, assume that we have only AVX-512F
	if (NumElts <= 8) {
	if (NumElts < 8) {
	// Extend to 8-elts vector
	MVT ExtVT = MVT::getVectorVT(OpVT.getScalarType(), 8);
	Op = DAG.getNode(ISD::INSERT_SUBVECTOR, dl, ExtVT,
	DAG.getUNDEF(ExtVT), Op, DAG.getIntPtrConstant(0, dl));
	}
	Op = DAG.getNode(ISD::TRUNCATE, dl, MVT::v8i1, Op);
	return DAG.getStore(St->getChain(), dl, Op, St->getBasePtr(),
	St->getMemOperand());
	}
	// v32i8
	assert(OpVT == MVT::v32i8 && "Unexpected operand type");
	// Divide the vector into 2 parts and store each part separately
	SDValue Lo = DAG.getNode(ISD::EXTRACT_SUBVECTOR, dl, MVT::v16i8, Op,
	DAG.getIntPtrConstant(0, dl));
	Lo = DAG.getNode(ISD::TRUNCATE, dl, MVT::v16i1, Lo);
	SDValue BasePtr = St->getBasePtr();
	SDValue StLo = DAG.getStore(St->getChain(), dl, Lo, BasePtr,
	St->getMemOperand());
	SDValue Hi = DAG.getNode(ISD::EXTRACT_SUBVECTOR, dl, MVT::v16i8, Op,
	DAG.getIntPtrConstant(16, dl));
	Hi = DAG.getNode(ISD::TRUNCATE, dl, MVT::v16i1, Hi);

	SDValue BasePtrHi =
	DAG.getNode(ISD::ADD, dl, BasePtr.getValueType(), BasePtr,
	DAG.getConstant(2, dl, BasePtr.getValueType()));

	SDValue StHi = DAG.getStore(St->getChain(), dl, Hi,
	BasePtrHi, St->getMemOperand());
	return DAG.getNode(ISD::TokenFactor, dl, MVT::Other, StLo, StHi);
	}

	static SDValue LowerExtended1BitVectorLoad(SDValue Op,
	const X86Subtarget &Subtarget,
	SelectionDAG &DAG) {

	LoadSDNode *Ld = cast<LoadSDNode>(Op.getNode());
	SDLoc dl(Ld);
	EVT MemVT = Ld->getMemoryVT();
	assert(MemVT.isVector() && MemVT.getScalarType() == MVT::i1 &&
	"Expected i1 vector load");
	unsigned ExtOpcode = Ld->getExtensionType() == ISD::ZEXTLOAD ?
	ISD::ZERO_EXTEND : ISD::SIGN_EXTEND;
	MVT VT = Op.getValueType().getSimpleVT();
	unsigned NumElts = VT.getVectorNumElements();

	if ((Subtarget.hasBWI() && NumElts >= 32) \|\|
	(Subtarget.hasDQI() && NumElts < 16) \|\|
	NumElts == 16) {
	// Load and extend - everything is legal
	if (NumElts < 8) {
	SDValue Load = DAG.getLoad(MVT::v8i1, dl, Ld->getChain(),
	Ld->getBasePtr(),
	Ld->getMemOperand());
	// Replace chain users with the new chain.
	assert(Load->getNumValues() == 2 && "Loads must carry a chain!");
	DAG.ReplaceAllUsesOfValueWith(SDValue(Ld, 1), Load.getValue(1));
	MVT ExtVT = MVT::getVectorVT(VT.getScalarType(), 8);
	SDValue ExtVec = DAG.getNode(ExtOpcode, dl, ExtVT, Load);

	return DAG.getNode(ISD::EXTRACT_SUBVECTOR, dl, VT, ExtVec,
	DAG.getIntPtrConstant(0, dl));
	}
	SDValue Load = DAG.getLoad(MemVT, dl, Ld->getChain(),
	Ld->getBasePtr(),
	Ld->getMemOperand());
	// Replace chain users with the new chain.
	assert(Load->getNumValues() == 2 && "Loads must carry a chain!");
	DAG.ReplaceAllUsesOfValueWith(SDValue(Ld, 1), Load.getValue(1));

	// Finally, do a normal sign-extend to the desired register.
	return DAG.getNode(ExtOpcode, dl, Op.getValueType(), Load);
	}

	if (NumElts <= 8) {
	// A subset, assume that we have only AVX-512F
	unsigned NumBitsToLoad = 8;
	MVT TypeToLoad = MVT::getIntegerVT(NumBitsToLoad);
	SDValue Load = DAG.getLoad(TypeToLoad, dl, Ld->getChain(),
	Ld->getBasePtr(),
	Ld->getMemOperand());
	// Replace chain users with the new chain.
	assert(Load->getNumValues() == 2 && "Loads must carry a chain!");
	DAG.ReplaceAllUsesOfValueWith(SDValue(Ld, 1), Load.getValue(1));

	MVT MaskVT = MVT::getVectorVT(MVT::i1, NumBitsToLoad);
	SDValue BitVec = DAG.getBitcast(MaskVT, Load);

	if (NumElts == 8)
	return DAG.getNode(ExtOpcode, dl, VT, BitVec);

	// we should take care to v4i1 and v2i1

	MVT ExtVT = MVT::getVectorVT(VT.getScalarType(), 8);
	SDValue ExtVec = DAG.getNode(ExtOpcode, dl, ExtVT, BitVec);
	return DAG.getNode(ISD::EXTRACT_SUBVECTOR, dl, VT, ExtVec,
	DAG.getIntPtrConstant(0, dl));
	}

	assert(VT == MVT::v32i8 && "Unexpected extload type");

	SmallVector<SDValue, 2> Chains;

	SDValue BasePtr = Ld->getBasePtr();
	SDValue LoadLo = DAG.getLoad(MVT::v16i1, dl, Ld->getChain(),
	Ld->getBasePtr(),
	Ld->getMemOperand());
	Chains.push_back(LoadLo.getValue(1));

	SDValue BasePtrHi =
	DAG.getNode(ISD::ADD, dl, BasePtr.getValueType(), BasePtr,
	DAG.getConstant(2, dl, BasePtr.getValueType()));

	SDValue LoadHi = DAG.getLoad(MVT::v16i1, dl, Ld->getChain(),
	BasePtrHi,
	Ld->getMemOperand());
	Chains.push_back(LoadHi.getValue(1));
	SDValue NewChain = DAG.getNode(ISD::TokenFactor, dl, MVT::Other, Chains);
	DAG.ReplaceAllUsesOfValueWith(SDValue(Ld, 1), NewChain);

	SDValue Lo = DAG.getNode(ExtOpcode, dl, MVT::v16i8, LoadLo);
	SDValue Hi = DAG.getNode(ExtOpcode, dl, MVT::v16i8, LoadHi);
	return DAG.getNode(ISD::CONCAT_VECTORS, dl, MVT::v32i8, Lo, Hi);
	}

	// Lower vector extended loads using a shuffle. If SSSE3 is not available we
	// may emit an illegal shuffle but the expansion is still better than scalar
	// code. We generate X86ISD::VSEXT for SEXTLOADs if it's available, otherwise
	// we'll emit a shuffle and a arithmetic shift.
	// FIXME: Is the expansion actually better than scalar code? It doesn't seem so.
	// TODO: It is possible to support ZExt by zeroing the undef values during
	// the shuffle phase or after the shuffle.
	static SDValue LowerExtendedLoad(SDValue Op, const X86Subtarget &Subtarget,
	SelectionDAG &DAG) {
	MVT RegVT = Op.getSimpleValueType();
	assert(RegVT.isVector() && "We only custom lower vector sext loads.");
	assert(RegVT.isInteger() &&
	"We only custom lower integer vector sext loads.");

	// Nothing useful we can do without SSE2 shuffles.
	assert(Subtarget.hasSSE2() && "We only custom lower sext loads with SSE2.");

	LoadSDNode *Ld = cast<LoadSDNode>(Op.getNode());
	SDLoc dl(Ld);
	EVT MemVT = Ld->getMemoryVT();
	if (MemVT.getScalarType() == MVT::i1)
	return LowerExtended1BitVectorLoad(Op, Subtarget, DAG);

	const TargetLowering &TLI = DAG.getTargetLoweringInfo();
	unsigned RegSz = RegVT.getSizeInBits();

	ISD::LoadExtType Ext = Ld->getExtensionType();

	assert((Ext == ISD::EXTLOAD \|\| Ext == ISD::SEXTLOAD)
	&& "Only anyext and sext are currently implemented.");
	assert(MemVT != RegVT && "Cannot extend to the same type");
	assert(MemVT.isVector() && "Must load a vector from memory");

	unsigned NumElems = RegVT.getVectorNumElements();
	unsigned MemSz = MemVT.getSizeInBits();
	assert(RegSz > MemSz && "Register size must be greater than the mem size");

	if (Ext == ISD::SEXTLOAD && RegSz == 256 && !Subtarget.hasInt256()) {
	// The only way in which we have a legal 256-bit vector result but not the
	// integer 256-bit operations needed to directly lower a sextload is if we
	// have AVX1 but not AVX2. In that case, we can always emit a sextload to
	// a 128-bit vector and a normal sign_extend to 256-bits that should get
	// correctly legalized. We do this late to allow the canonical form of
	// sextload to persist throughout the rest of the DAG combiner -- it wants
	// to fold together any extensions it can, and so will fuse a sign_extend
	// of an sextload into a sextload targeting a wider value.
	SDValue Load;
	if (MemSz == 128) {
	// Just switch this to a normal load.
	assert(TLI.isTypeLegal(MemVT) && "If the memory type is a 128-bit type, "
	"it must be a legal 128-bit vector "
	"type!");
	Load = DAG.getLoad(MemVT, dl, Ld->getChain(), Ld->getBasePtr(),
	Ld->getPointerInfo(), Ld->getAlignment(),
	Ld->getMemOperand()->getFlags());
	} else {
	assert(MemSz < 128 &&
	"Can't extend a type wider than 128 bits to a 256 bit vector!");
	// Do an sext load to a 128-bit vector type. We want to use the same
	// number of elements, but elements half as wide. This will end up being
	// recursively lowered by this routine, but will succeed as we definitely
	// have all the necessary features if we're using AVX1.
	EVT HalfEltVT =
	EVT::getIntegerVT(*DAG.getContext(), RegVT.getScalarSizeInBits() / 2);
	EVT HalfVecVT = EVT::getVectorVT(*DAG.getContext(), HalfEltVT, NumElems);
	Load =
	DAG.getExtLoad(Ext, dl, HalfVecVT, Ld->getChain(), Ld->getBasePtr(),
	Ld->getPointerInfo(), MemVT, Ld->getAlignment(),
	Ld->getMemOperand()->getFlags());
	}

	// Replace chain users with the new chain.
	assert(Load->getNumValues() == 2 && "Loads must carry a chain!");
	DAG.ReplaceAllUsesOfValueWith(SDValue(Ld, 1), Load.getValue(1));

	// Finally, do a normal sign-extend to the desired register.
	return DAG.getSExtOrTrunc(Load, dl, RegVT);
	}

	// All sizes must be a power of two.
	assert(isPowerOf2_32(RegSz * MemSz * NumElems) &&
	"Non-power-of-two elements are not custom lowered!");

	// Attempt to load the original value using scalar loads.
	// Find the largest scalar type that divides the total loaded size.
	MVT SclrLoadTy = MVT::i8;
	for (MVT Tp : MVT::integer_valuetypes()) {
	if (TLI.isTypeLegal(Tp) && ((MemSz % Tp.getSizeInBits()) == 0)) {
	SclrLoadTy = Tp;
	}
	}

	// On 32bit systems, we can't save 64bit integers. Try bitcasting to F64.
	if (TLI.isTypeLegal(MVT::f64) && SclrLoadTy.getSizeInBits() < 64 &&
	(64 <= MemSz))
	SclrLoadTy = MVT::f64;

	// Calculate the number of scalar loads that we need to perform
	// in order to load our vector from memory.
	unsigned NumLoads = MemSz / SclrLoadTy.getSizeInBits();

	assert((Ext != ISD::SEXTLOAD \|\| NumLoads == 1) &&
	"Can only lower sext loads with a single scalar load!");

	unsigned loadRegZize = RegSz;
	if (Ext == ISD::SEXTLOAD && RegSz >= 256)
	loadRegZize = 128;

	// Represent our vector as a sequence of elements which are the
	// largest scalar that we can load.
	EVT LoadUnitVecVT = EVT::getVectorVT(
	*DAG.getContext(), SclrLoadTy, loadRegZize / SclrLoadTy.getSizeInBits());

	// Represent the data using the same element type that is stored in
	// memory. In practice, we ''widen'' MemVT.
	EVT WideVecVT =
	EVT::getVectorVT(*DAG.getContext(), MemVT.getScalarType(),
	loadRegZize / MemVT.getScalarSizeInBits());

	assert(WideVecVT.getSizeInBits() == LoadUnitVecVT.getSizeInBits() &&
	"Invalid vector type");

	// We can't shuffle using an illegal type.
	assert(TLI.isTypeLegal(WideVecVT) &&
	"We only lower types that form legal widened vector types");

	SmallVector<SDValue, 8> Chains;
	SDValue Ptr = Ld->getBasePtr();
	SDValue Increment = DAG.getConstant(SclrLoadTy.getSizeInBits() / 8, dl,
	TLI.getPointerTy(DAG.getDataLayout()));
	SDValue Res = DAG.getUNDEF(LoadUnitVecVT);

	for (unsigned i = 0; i < NumLoads; ++i) {
	// Perform a single load.
	SDValue ScalarLoad =
	DAG.getLoad(SclrLoadTy, dl, Ld->getChain(), Ptr, Ld->getPointerInfo(),
	Ld->getAlignment(), Ld->getMemOperand()->getFlags());
	Chains.push_back(ScalarLoad.getValue(1));
	// Create the first element type using SCALAR_TO_VECTOR in order to avoid
	// another round of DAGCombining.
	if (i == 0)
	Res = DAG.getNode(ISD::SCALAR_TO_VECTOR, dl, LoadUnitVecVT, ScalarLoad);
	else
	Res = DAG.getNode(ISD::INSERT_VECTOR_ELT, dl, LoadUnitVecVT, Res,
	ScalarLoad, DAG.getIntPtrConstant(i, dl));

	Ptr = DAG.getNode(ISD::ADD, dl, Ptr.getValueType(), Ptr, Increment);
	}

	SDValue TF = DAG.getNode(ISD::TokenFactor, dl, MVT::Other, Chains);

	// Bitcast the loaded value to a vector of the original element type, in
	// the size of the target vector type.
	SDValue SlicedVec = DAG.getBitcast(WideVecVT, Res);
	unsigned SizeRatio = RegSz / MemSz;

	if (Ext == ISD::SEXTLOAD) {
	// If we have SSE4.1, we can directly emit a VSEXT node.
	if (Subtarget.hasSSE41()) {
	SDValue Sext = getExtendInVec(X86ISD::VSEXT, dl, RegVT, SlicedVec, DAG);
	DAG.ReplaceAllUsesOfValueWith(SDValue(Ld, 1), TF);
	return Sext;
	}

	// Otherwise we'll use SIGN_EXTEND_VECTOR_INREG to sign extend the lowest
	// lanes.
	assert(TLI.isOperationLegalOrCustom(ISD::SIGN_EXTEND_VECTOR_INREG, RegVT) &&
	"We can't implement a sext load without SIGN_EXTEND_VECTOR_INREG!");

	SDValue Shuff = DAG.getSignExtendVectorInReg(SlicedVec, dl, RegVT);
	DAG.ReplaceAllUsesOfValueWith(SDValue(Ld, 1), TF);
	return Shuff;
	}

	// Redistribute the loaded elements into the different locations.
	SmallVector<int, 16> ShuffleVec(NumElems * SizeRatio, -1);
	for (unsigned i = 0; i != NumElems; ++i)
	ShuffleVec[i * SizeRatio] = i;

	SDValue Shuff = DAG.getVectorShuffle(WideVecVT, dl, SlicedVec,
	DAG.getUNDEF(WideVecVT), ShuffleVec);

	// Bitcast to the requested type.
	Shuff = DAG.getBitcast(RegVT, Shuff);
	DAG.ReplaceAllUsesOfValueWith(SDValue(Ld, 1), TF);
	return Shuff;
	}

	/// Return true if node is an ISD::AND or ISD::OR of two X86ISD::SETCC nodes
	/// each of which has no other use apart from the AND / OR.
	static bool isAndOrOfSetCCs(SDValue Op, unsigned &Opc) {
	Opc = Op.getOpcode();
	if (Opc != ISD::OR && Opc != ISD::AND)
	return false;
	return (Op.getOperand(0).getOpcode() == X86ISD::SETCC &&
	Op.getOperand(0).hasOneUse() &&
	Op.getOperand(1).getOpcode() == X86ISD::SETCC &&
	Op.getOperand(1).hasOneUse());
	}

	/// Return true if node is an ISD::XOR of a X86ISD::SETCC and 1 and that the
	/// SETCC node has a single use.
	static bool isXor1OfSetCC(SDValue Op) {
	if (Op.getOpcode() != ISD::XOR)
	return false;
	if (isOneConstant(Op.getOperand(1)))
	return Op.getOperand(0).getOpcode() == X86ISD::SETCC &&
	Op.getOperand(0).hasOneUse();
	return false;
	}

	SDValue X86TargetLowering::LowerBRCOND(SDValue Op, SelectionDAG &DAG) const {
	bool addTest = true;
	SDValue Chain = Op.getOperand(0);
	SDValue Cond = Op.getOperand(1);
	SDValue Dest = Op.getOperand(2);
	SDLoc dl(Op);
	SDValue CC;
	bool Inverted = false;

	if (Cond.getOpcode() == ISD::SETCC) {
	// Check for setcc([su]{add,sub,mul}o == 0).
	if (cast<CondCodeSDNode>(Cond.getOperand(2))->get() == ISD::SETEQ &&
	isNullConstant(Cond.getOperand(1)) &&
	Cond.getOperand(0).getResNo() == 1 &&
	(Cond.getOperand(0).getOpcode() == ISD::SADDO \|\|
	Cond.getOperand(0).getOpcode() == ISD::UADDO \|\|
	Cond.getOperand(0).getOpcode() == ISD::SSUBO \|\|
	Cond.getOperand(0).getOpcode() == ISD::USUBO \|\|
	Cond.getOperand(0).getOpcode() == ISD::SMULO \|\|
	Cond.getOperand(0).getOpcode() == ISD::UMULO)) {
	Inverted = true;
	Cond = Cond.getOperand(0);
	} else {
	if (SDValue NewCond = LowerSETCC(Cond, DAG))
	Cond = NewCond;
	}
	}
	#if 0
	// FIXME: LowerXALUO doesn't handle these!!
	else if (Cond.getOpcode() == X86ISD::ADD \|\|
	Cond.getOpcode() == X86ISD::SUB \|\|
	Cond.getOpcode() == X86ISD::SMUL \|\|
	Cond.getOpcode() == X86ISD::UMUL)
	Cond = LowerXALUO(Cond, DAG);
	#endif

	// Look pass (and (setcc_carry (cmp ...)), 1).
	if (Cond.getOpcode() == ISD::AND &&
	Cond.getOperand(0).getOpcode() == X86ISD::SETCC_CARRY &&
	isOneConstant(Cond.getOperand(1)))
	Cond = Cond.getOperand(0);

	// If condition flag is set by a X86ISD::CMP, then use it as the condition
	// setting operand in place of the X86ISD::SETCC.
	unsigned CondOpcode = Cond.getOpcode();
	if (CondOpcode == X86ISD::SETCC \|\|
	CondOpcode == X86ISD::SETCC_CARRY) {
	CC = Cond.getOperand(0);

	SDValue Cmp = Cond.getOperand(1);
	unsigned Opc = Cmp.getOpcode();
	// FIXME: WHY THE SPECIAL CASING OF LogicalCmp??
	if (isX86LogicalCmp(Cmp) \|\| Opc == X86ISD::BT) {
	Cond = Cmp;
	addTest = false;
	} else {
	switch (cast<ConstantSDNode>(CC)->getZExtValue()) {
	default: break;
	case X86::COND_O:
	case X86::COND_B:
	// These can only come from an arithmetic instruction with overflow,
	// e.g. SADDO, UADDO.
	Cond = Cond.getOperand(1);
	addTest = false;
	break;
	}
	}
	}
	CondOpcode = Cond.getOpcode();
	if (CondOpcode == ISD::UADDO \|\| CondOpcode == ISD::SADDO \|\|
	CondOpcode == ISD::USUBO \|\| CondOpcode == ISD::SSUBO \|\|
	((CondOpcode == ISD::UMULO \|\| CondOpcode == ISD::SMULO) &&
	Cond.getOperand(0).getValueType() != MVT::i8)) {
	SDValue LHS = Cond.getOperand(0);
	SDValue RHS = Cond.getOperand(1);
	unsigned X86Opcode;
	unsigned X86Cond;
	SDVTList VTs;
	// Keep this in sync with LowerXALUO, otherwise we might create redundant
	// instructions that can't be removed afterwards (i.e. X86ISD::ADD and
	// X86ISD::INC).
	switch (CondOpcode) {
	case ISD::UADDO: X86Opcode = X86ISD::ADD; X86Cond = X86::COND_B; break;
	case ISD::SADDO:
	if (isOneConstant(RHS)) {
	X86Opcode = X86ISD::INC; X86Cond = X86::COND_O;
	break;
	}
	X86Opcode = X86ISD::ADD; X86Cond = X86::COND_O; break;
	case ISD::USUBO: X86Opcode = X86ISD::SUB; X86Cond = X86::COND_B; break;
	case ISD::SSUBO:
	if (isOneConstant(RHS)) {
	X86Opcode = X86ISD::DEC; X86Cond = X86::COND_O;
	break;
	}
	X86Opcode = X86ISD::SUB; X86Cond = X86::COND_O; break;
	case ISD::UMULO: X86Opcode = X86ISD::UMUL; X86Cond = X86::COND_O; break;
	case ISD::SMULO: X86Opcode = X86ISD::SMUL; X86Cond = X86::COND_O; break;
	default: llvm_unreachable("unexpected overflowing operator");
	}
	if (Inverted)
	X86Cond = X86::GetOppositeBranchCondition((X86::CondCode)X86Cond);
	if (CondOpcode == ISD::UMULO)
	VTs = DAG.getVTList(LHS.getValueType(), LHS.getValueType(),
	MVT::i32);
	else
	VTs = DAG.getVTList(LHS.getValueType(), MVT::i32);

	SDValue X86Op = DAG.getNode(X86Opcode, dl, VTs, LHS, RHS);

	if (CondOpcode == ISD::UMULO)
	Cond = X86Op.getValue(2);
	else
	Cond = X86Op.getValue(1);

	CC = DAG.getConstant(X86Cond, dl, MVT::i8);
	addTest = false;
	} else {
	unsigned CondOpc;
	if (Cond.hasOneUse() && isAndOrOfSetCCs(Cond, CondOpc)) {
	SDValue Cmp = Cond.getOperand(0).getOperand(1);
	if (CondOpc == ISD::OR) {
	// Also, recognize the pattern generated by an FCMP_UNE. We can emit
	// two branches instead of an explicit OR instruction with a
	// separate test.
	if (Cmp == Cond.getOperand(1).getOperand(1) &&
	isX86LogicalCmp(Cmp)) {
	CC = Cond.getOperand(0).getOperand(0);
	Chain = DAG.getNode(X86ISD::BRCOND, dl, Op.getValueType(),
	Chain, Dest, CC, Cmp);
	CC = Cond.getOperand(1).getOperand(0);
	Cond = Cmp;
	addTest = false;
	}
	} else { // ISD::AND
	// Also, recognize the pattern generated by an FCMP_OEQ. We can emit
	// two branches instead of an explicit AND instruction with a
	// separate test. However, we only do this if this block doesn't
	// have a fall-through edge, because this requires an explicit
	// jmp when the condition is false.
	if (Cmp == Cond.getOperand(1).getOperand(1) &&
	isX86LogicalCmp(Cmp) &&
	Op.getNode()->hasOneUse()) {
	X86::CondCode CCode =
	(X86::CondCode)Cond.getOperand(0).getConstantOperandVal(0);
	CCode = X86::GetOppositeBranchCondition(CCode);
	CC = DAG.getConstant(CCode, dl, MVT::i8);
	SDNode User = Op.getNode()->use_begin();
	// Look for an unconditional branch following this conditional branch.
	// We need this because we need to reverse the successors in order
	// to implement FCMP_OEQ.
	if (User->getOpcode() == ISD::BR) {
	SDValue FalseBB = User->getOperand(1);
	SDNode *NewBR =
	DAG.UpdateNodeOperands(User, User->getOperand(0), Dest);
	assert(NewBR == User);
	(void)NewBR;
	Dest = FalseBB;

	Chain = DAG.getNode(X86ISD::BRCOND, dl, Op.getValueType(),
	Chain, Dest, CC, Cmp);
	X86::CondCode CCode =
	(X86::CondCode)Cond.getOperand(1).getConstantOperandVal(0);
	CCode = X86::GetOppositeBranchCondition(CCode);
	CC = DAG.getConstant(CCode, dl, MVT::i8);
	Cond = Cmp;
	addTest = false;
	}
	}
	}
	} else if (Cond.hasOneUse() && isXor1OfSetCC(Cond)) {
	// Recognize for xorb (setcc), 1 patterns. The xor inverts the condition.
	// It should be transformed during dag combiner except when the condition
	// is set by a arithmetics with overflow node.
	X86::CondCode CCode =
	(X86::CondCode)Cond.getOperand(0).getConstantOperandVal(0);
	CCode = X86::GetOppositeBranchCondition(CCode);
	CC = DAG.getConstant(CCode, dl, MVT::i8);
	Cond = Cond.getOperand(0).getOperand(1);
	addTest = false;
	} else if (Cond.getOpcode() == ISD::SETCC &&
	cast<CondCodeSDNode>(Cond.getOperand(2))->get() == ISD::SETOEQ) {
	// For FCMP_OEQ, we can emit
	// two branches instead of an explicit AND instruction with a
	// separate test. However, we only do this if this block doesn't
	// have a fall-through edge, because this requires an explicit
	// jmp when the condition is false.
	if (Op.getNode()->hasOneUse()) {
	SDNode User = Op.getNode()->use_begin();
	// Look for an unconditional branch following this conditional branch.
	// We need this because we need to reverse the successors in order
	// to implement FCMP_OEQ.
	if (User->getOpcode() == ISD::BR) {
	SDValue FalseBB = User->getOperand(1);
	SDNode *NewBR =
	DAG.UpdateNodeOperands(User, User->getOperand(0), Dest);
	assert(NewBR == User);
	(void)NewBR;
	Dest = FalseBB;

	SDValue Cmp = DAG.getNode(X86ISD::CMP, dl, MVT::i32,
	Cond.getOperand(0), Cond.getOperand(1));
	Cmp = ConvertCmpIfNecessary(Cmp, DAG);
	CC = DAG.getConstant(X86::COND_NE, dl, MVT::i8);
	Chain = DAG.getNode(X86ISD::BRCOND, dl, Op.getValueType(),
	Chain, Dest, CC, Cmp);
	CC = DAG.getConstant(X86::COND_P, dl, MVT::i8);
	Cond = Cmp;
	addTest = false;
	}
	}
	} else if (Cond.getOpcode() == ISD::SETCC &&
	cast<CondCodeSDNode>(Cond.getOperand(2))->get() == ISD::SETUNE) {
	// For FCMP_UNE, we can emit
	// two branches instead of an explicit AND instruction with a
	// separate test. However, we only do this if this block doesn't
	// have a fall-through edge, because this requires an explicit
	// jmp when the condition is false.
	if (Op.getNode()->hasOneUse()) {
	SDNode User = Op.getNode()->use_begin();
	// Look for an unconditional branch following this conditional branch.
	// We need this because we need to reverse the successors in order
	// to implement FCMP_UNE.
	if (User->getOpcode() == ISD::BR) {
	SDValue FalseBB = User->getOperand(1);
	SDNode *NewBR =
	DAG.UpdateNodeOperands(User, User->getOperand(0), Dest);
	assert(NewBR == User);
	(void)NewBR;

	SDValue Cmp = DAG.getNode(X86ISD::CMP, dl, MVT::i32,
	Cond.getOperand(0), Cond.getOperand(1));
	Cmp = ConvertCmpIfNecessary(Cmp, DAG);
	CC = DAG.getConstant(X86::COND_NE, dl, MVT::i8);
	Chain = DAG.getNode(X86ISD::BRCOND, dl, Op.getValueType(),
	Chain, Dest, CC, Cmp);
	CC = DAG.getConstant(X86::COND_NP, dl, MVT::i8);
	Cond = Cmp;
	addTest = false;
	Dest = FalseBB;
	}
	}
	}
	}

	if (addTest) {
	// Look pass the truncate if the high bits are known zero.
	if (isTruncWithZeroHighBitsInput(Cond, DAG))
	Cond = Cond.getOperand(0);

	// We know the result is compared against zero. Try to match it to BT.
	if (Cond.hasOneUse()) {
	if (SDValue NewSetCC = LowerToBT(Cond, ISD::SETNE, dl, DAG)) {
	CC = NewSetCC.getOperand(0);
	Cond = NewSetCC.getOperand(1);
	addTest = false;
	}
	}
	}

	if (addTest) {
	X86::CondCode X86Cond = Inverted ? X86::COND_E : X86::COND_NE;
	CC = DAG.getConstant(X86Cond, dl, MVT::i8);
	Cond = EmitTest(Cond, X86Cond, dl, DAG);
	}
	Cond = ConvertCmpIfNecessary(Cond, DAG);
	return DAG.getNode(X86ISD::BRCOND, dl, Op.getValueType(),
	Chain, Dest, CC, Cond);
	}

	// Lower dynamic stack allocation to _alloca call for Cygwin/Mingw targets.
	// Calls to _alloca are needed to probe the stack when allocating more than 4k
	// bytes in one go. Touching the stack at 4K increments is necessary to ensure
	// that the guard pages used by the OS virtual memory manager are allocated in
	// correct sequence.
	SDValue
	X86TargetLowering::LowerDYNAMIC_STACKALLOC(SDValue Op,
	SelectionDAG &DAG) const {
	MachineFunction &MF = DAG.getMachineFunction();
	bool SplitStack = MF.shouldSplitStack();
	bool EmitStackProbe = !getStackProbeSymbolName(MF).empty();
	bool Lower = (Subtarget.isOSWindows() && !Subtarget.isTargetMachO()) \|\|
	SplitStack \|\| EmitStackProbe;
	SDLoc dl(Op);

	// Get the inputs.
	SDNode *Node = Op.getNode();
	SDValue Chain = Op.getOperand(0);
	SDValue Size = Op.getOperand(1);
	unsigned Align = cast<ConstantSDNode>(Op.getOperand(2))->getZExtValue();
	EVT VT = Node->getValueType(0);

	// Chain the dynamic stack allocation so that it doesn't modify the stack
	// pointer when other instructions are using the stack.
	Chain = DAG.getCALLSEQ_START(Chain, 0, 0, dl);

	bool Is64Bit = Subtarget.is64Bit();
	MVT SPTy = getPointerTy(DAG.getDataLayout());

	SDValue Result;
	if (!Lower) {
	const TargetLowering &TLI = DAG.getTargetLoweringInfo();
	unsigned SPReg = TLI.getStackPointerRegisterToSaveRestore();
	assert(SPReg && "Target cannot require DYNAMIC_STACKALLOC expansion and"
	" not tell us which reg is the stack pointer!");

	SDValue SP = DAG.getCopyFromReg(Chain, dl, SPReg, VT);
	Chain = SP.getValue(1);
	const TargetFrameLowering &TFI = *Subtarget.getFrameLowering();
	unsigned StackAlign = TFI.getStackAlignment();
	Result = DAG.getNode(ISD::SUB, dl, VT, SP, Size); // Value
	if (Align > StackAlign)
	Result = DAG.getNode(ISD::AND, dl, VT, Result,
	DAG.getConstant(-(uint64_t)Align, dl, VT));
	Chain = DAG.getCopyToReg(Chain, dl, SPReg, Result); // Output chain
	} else if (SplitStack) {
	MachineRegisterInfo &MRI = MF.getRegInfo();

	if (Is64Bit) {
	// The 64 bit implementation of segmented stacks needs to clobber both r10
	// r11. This makes it impossible to use it along with nested parameters.
	const Function *F = MF.getFunction();
	for (const auto &A : F->args()) {
	if (A.hasNestAttr())
	report_fatal_error("Cannot use segmented stacks with functions that "
	"have nested arguments.");
	}
	}

	const TargetRegisterClass *AddrRegClass = getRegClassFor(SPTy);
	unsigned Vreg = MRI.createVirtualRegister(AddrRegClass);
	Chain = DAG.getCopyToReg(Chain, dl, Vreg, Size);
	Result = DAG.getNode(X86ISD::SEG_ALLOCA, dl, SPTy, Chain,
	DAG.getRegister(Vreg, SPTy));
	} else {
	SDVTList NodeTys = DAG.getVTList(MVT::Other, MVT::Glue);
	Chain = DAG.getNode(X86ISD::WIN_ALLOCA, dl, NodeTys, Chain, Size);
	MF.getInfo<X86MachineFunctionInfo>()->setHasWinAlloca(true);

	const X86RegisterInfo *RegInfo = Subtarget.getRegisterInfo();
	unsigned SPReg = RegInfo->getStackRegister();
	SDValue SP = DAG.getCopyFromReg(Chain, dl, SPReg, SPTy);
	Chain = SP.getValue(1);

	if (Align) {
	SP = DAG.getNode(ISD::AND, dl, VT, SP.getValue(0),
	DAG.getConstant(-(uint64_t)Align, dl, VT));
	Chain = DAG.getCopyToReg(Chain, dl, SPReg, SP);
	}

	Result = SP;
	}

	Chain = DAG.getCALLSEQ_END(Chain, DAG.getIntPtrConstant(0, dl, true),
	DAG.getIntPtrConstant(0, dl, true), SDValue(), dl);

	SDValue Ops[2] = {Result, Chain};
	return DAG.getMergeValues(Ops, dl);
	}

	SDValue X86TargetLowering::LowerVASTART(SDValue Op, SelectionDAG &DAG) const {
	MachineFunction &MF = DAG.getMachineFunction();
	auto PtrVT = getPointerTy(MF.getDataLayout());
	X86MachineFunctionInfo *FuncInfo = MF.getInfo<X86MachineFunctionInfo>();

	const Value *SV = cast<SrcValueSDNode>(Op.getOperand(2))->getValue();
	SDLoc DL(Op);

	if (!Subtarget.is64Bit() \|\|
	Subtarget.isCallingConvWin64(MF.getFunction()->getCallingConv())) {
	// vastart just stores the address of the VarArgsFrameIndex slot into the
	// memory location argument.
	SDValue FR = DAG.getFrameIndex(FuncInfo->getVarArgsFrameIndex(), PtrVT);
	return DAG.getStore(Op.getOperand(0), DL, FR, Op.getOperand(1),
	MachinePointerInfo(SV));
	}

	// __va_list_tag:
	// gp_offset (0 - 6 * 8)
	// fp_offset (48 - 48 + 8 * 16)
	// overflow_arg_area (point to parameters coming in memory).
	// reg_save_area
	SmallVector<SDValue, 8> MemOps;
	SDValue FIN = Op.getOperand(1);
	// Store gp_offset
	SDValue Store = DAG.getStore(
	Op.getOperand(0), DL,
	DAG.getConstant(FuncInfo->getVarArgsGPOffset(), DL, MVT::i32), FIN,
	MachinePointerInfo(SV));
	MemOps.push_back(Store);

	// Store fp_offset
	FIN = DAG.getMemBasePlusOffset(FIN, 4, DL);
	Store = DAG.getStore(
	Op.getOperand(0), DL,
	DAG.getConstant(FuncInfo->getVarArgsFPOffset(), DL, MVT::i32), FIN,
	MachinePointerInfo(SV, 4));
	MemOps.push_back(Store);

	// Store ptr to overflow_arg_area
	FIN = DAG.getNode(ISD::ADD, DL, PtrVT, FIN, DAG.getIntPtrConstant(4, DL));
	SDValue OVFIN = DAG.getFrameIndex(FuncInfo->getVarArgsFrameIndex(), PtrVT);
	Store =
	DAG.getStore(Op.getOperand(0), DL, OVFIN, FIN, MachinePointerInfo(SV, 8));
	MemOps.push_back(Store);

	// Store ptr to reg_save_area.
	FIN = DAG.getNode(ISD::ADD, DL, PtrVT, FIN, DAG.getIntPtrConstant(
	Subtarget.isTarget64BitLP64() ? 8 : 4, DL));
	SDValue RSFIN = DAG.getFrameIndex(FuncInfo->getRegSaveFrameIndex(), PtrVT);
	Store = DAG.getStore(
	Op.getOperand(0), DL, RSFIN, FIN,
	MachinePointerInfo(SV, Subtarget.isTarget64BitLP64() ? 16 : 12));
	MemOps.push_back(Store);
	return DAG.getNode(ISD::TokenFactor, DL, MVT::Other, MemOps);
	}

	SDValue X86TargetLowering::LowerVAARG(SDValue Op, SelectionDAG &DAG) const {
	assert(Subtarget.is64Bit() &&
	"LowerVAARG only handles 64-bit va_arg!");
	assert(Op.getNumOperands() == 4);

	MachineFunction &MF = DAG.getMachineFunction();
	if (Subtarget.isCallingConvWin64(MF.getFunction()->getCallingConv()))
	// The Win64 ABI uses char* instead of a structure.
	return DAG.expandVAArg(Op.getNode());

	SDValue Chain = Op.getOperand(0);
	SDValue SrcPtr = Op.getOperand(1);
	const Value *SV = cast<SrcValueSDNode>(Op.getOperand(2))->getValue();
	unsigned Align = Op.getConstantOperandVal(3);
	SDLoc dl(Op);

	EVT ArgVT = Op.getNode()->getValueType(0);
	Type ArgTy = ArgVT.getTypeForEVT(DAG.getContext());
	uint32_t ArgSize = DAG.getDataLayout().getTypeAllocSize(ArgTy);
	uint8_t ArgMode;

	// Decide which area this value should be read from.
	// TODO: Implement the AMD64 ABI in its entirety. This simple
	// selection mechanism works only for the basic types.
	if (ArgVT == MVT::f80) {
	llvm_unreachable("va_arg for f80 not yet implemented");
	} else if (ArgVT.isFloatingPoint() && ArgSize <= 16 /bytes/) {
	ArgMode = 2; // Argument passed in XMM register. Use fp_offset.
	} else if (ArgVT.isInteger() && ArgSize <= 32 /bytes/) {
	ArgMode = 1; // Argument passed in GPR64 register(s). Use gp_offset.
	} else {
	llvm_unreachable("Unhandled argument type in LowerVAARG");
	}

	if (ArgMode == 2) {
	// Sanity Check: Make sure using fp_offset makes sense.
	assert(!Subtarget.useSoftFloat() &&
	!(MF.getFunction()->hasFnAttribute(Attribute::NoImplicitFloat)) &&
	Subtarget.hasSSE1());
	}

	// Insert VAARG_64 node into the DAG
	// VAARG_64 returns two values: Variable Argument Address, Chain
	SDValue InstOps[] = {Chain, SrcPtr, DAG.getConstant(ArgSize, dl, MVT::i32),
	DAG.getConstant(ArgMode, dl, MVT::i8),
	DAG.getConstant(Align, dl, MVT::i32)};
	SDVTList VTs = DAG.getVTList(getPointerTy(DAG.getDataLayout()), MVT::Other);
	SDValue VAARG = DAG.getMemIntrinsicNode(X86ISD::VAARG_64, dl,
	VTs, InstOps, MVT::i64,
	MachinePointerInfo(SV),
	/Align=/0,
	/Volatile=/false,
	/ReadMem=/true,
	/WriteMem=/true);
	Chain = VAARG.getValue(1);

	// Load the next argument and return it
	return DAG.getLoad(ArgVT, dl, Chain, VAARG, MachinePointerInfo());
	}

	static SDValue LowerVACOPY(SDValue Op, const X86Subtarget &Subtarget,
	SelectionDAG &DAG) {
	// X86-64 va_list is a struct { i32, i32, i8, i8 }, except on Windows,
	// where a va_list is still an i8*.
	assert(Subtarget.is64Bit() && "This code only handles 64-bit va_copy!");
	if (Subtarget.isCallingConvWin64(
	DAG.getMachineFunction().getFunction()->getCallingConv()))
	// Probably a Win64 va_copy.
	return DAG.expandVACopy(Op.getNode());

	SDValue Chain = Op.getOperand(0);
	SDValue DstPtr = Op.getOperand(1);
	SDValue SrcPtr = Op.getOperand(2);
	const Value *DstSV = cast<SrcValueSDNode>(Op.getOperand(3))->getValue();
	const Value *SrcSV = cast<SrcValueSDNode>(Op.getOperand(4))->getValue();
	SDLoc DL(Op);

	return DAG.getMemcpy(Chain, DL, DstPtr, SrcPtr,
	DAG.getIntPtrConstant(24, DL), 8, /isVolatile/false,
	false, false,
	MachinePointerInfo(DstSV), MachinePointerInfo(SrcSV));
	}

	/// Handle vector element shifts where the shift amount is a constant.
	/// Takes immediate version of shift as input.
	static SDValue getTargetVShiftByConstNode(unsigned Opc, const SDLoc &dl, MVT VT,
	SDValue SrcOp, uint64_t ShiftAmt,
	SelectionDAG &DAG) {
	MVT ElementType = VT.getVectorElementType();

	// Bitcast the source vector to the output type, this is mainly necessary for
	// vXi8/vXi64 shifts.
	if (VT != SrcOp.getSimpleValueType())
	SrcOp = DAG.getBitcast(VT, SrcOp);

	// Fold this packed shift into its first operand if ShiftAmt is 0.
	if (ShiftAmt == 0)
	return SrcOp;

	// Check for ShiftAmt >= element width
	if (ShiftAmt >= ElementType.getSizeInBits()) {
	if (Opc == X86ISD::VSRAI)
	ShiftAmt = ElementType.getSizeInBits() - 1;
	else
	return DAG.getConstant(0, dl, VT);
	}

	assert((Opc == X86ISD::VSHLI \|\| Opc == X86ISD::VSRLI \|\| Opc == X86ISD::VSRAI)
	&& "Unknown target vector shift-by-constant node");

	// Fold this packed vector shift into a build vector if SrcOp is a
	// vector of Constants or UNDEFs.
	if (ISD::isBuildVectorOfConstantSDNodes(SrcOp.getNode())) {
	SmallVector<SDValue, 8> Elts;
	unsigned NumElts = SrcOp->getNumOperands();
	ConstantSDNode *ND;

	switch(Opc) {
	default: llvm_unreachable("Unknown opcode!");
	case X86ISD::VSHLI:
	for (unsigned i=0; i!=NumElts; ++i) {
	SDValue CurrentOp = SrcOp->getOperand(i);
	if (CurrentOp->isUndef()) {
	Elts.push_back(CurrentOp);
	continue;
	}
	ND = cast<ConstantSDNode>(CurrentOp);
	const APInt &C = ND->getAPIntValue();
	Elts.push_back(DAG.getConstant(C.shl(ShiftAmt), dl, ElementType));
	}
	break;
	case X86ISD::VSRLI:
	for (unsigned i=0; i!=NumElts; ++i) {
	SDValue CurrentOp = SrcOp->getOperand(i);
	if (CurrentOp->isUndef()) {
	Elts.push_back(CurrentOp);
	continue;
	}
	ND = cast<ConstantSDNode>(CurrentOp);
	const APInt &C = ND->getAPIntValue();
	Elts.push_back(DAG.getConstant(C.lshr(ShiftAmt), dl, ElementType));
	}
	break;
	case X86ISD::VSRAI:
	for (unsigned i=0; i!=NumElts; ++i) {
	SDValue CurrentOp = SrcOp->getOperand(i);
	if (CurrentOp->isUndef()) {
	Elts.push_back(CurrentOp);
	continue;
	}
	ND = cast<ConstantSDNode>(CurrentOp);
	const APInt &C = ND->getAPIntValue();
	Elts.push_back(DAG.getConstant(C.ashr(ShiftAmt), dl, ElementType));
	}
	break;
	}

	return DAG.getBuildVector(VT, dl, Elts);
	}

	return DAG.getNode(Opc, dl, VT, SrcOp,
	DAG.getConstant(ShiftAmt, dl, MVT::i8));
	}

	/// Handle vector element shifts where the shift amount may or may not be a
	/// constant. Takes immediate version of shift as input.
	static SDValue getTargetVShiftNode(unsigned Opc, const SDLoc &dl, MVT VT,
	SDValue SrcOp, SDValue ShAmt,
	const X86Subtarget &Subtarget,
	SelectionDAG &DAG) {
	MVT SVT = ShAmt.getSimpleValueType();
	assert((SVT == MVT::i32 \|\| SVT == MVT::i64) && "Unexpected value type!");

	// Catch shift-by-constant.
	if (ConstantSDNode *CShAmt = dyn_cast<ConstantSDNode>(ShAmt))
	return getTargetVShiftByConstNode(Opc, dl, VT, SrcOp,
	CShAmt->getZExtValue(), DAG);

	// Change opcode to non-immediate version
	switch (Opc) {
	default: llvm_unreachable("Unknown target vector shift node");
	case X86ISD::VSHLI: Opc = X86ISD::VSHL; break;
	case X86ISD::VSRLI: Opc = X86ISD::VSRL; break;
	case X86ISD::VSRAI: Opc = X86ISD::VSRA; break;
	}

	// Need to build a vector containing shift amount.
	// SSE/AVX packed shifts only use the lower 64-bit of the shift count.
	// +=================+============+=======================================+
	// \| ShAmt is \| HasSSE4.1? \| Construct ShAmt vector as \|
	// +=================+============+=======================================+
	// \| i64 \| Yes, No \| Use ShAmt as lowest elt \|
	// \| i32 \| Yes \| zero-extend in-reg \|
	// \| (i32 zext(i16)) \| Yes \| zero-extend in-reg \|
	// \| i16/i32 \| No \| v4i32 build_vector(ShAmt, 0, ud, ud)) \|
	// +=================+============+=======================================+

	if (SVT == MVT::i64)
	ShAmt = DAG.getNode(ISD::SCALAR_TO_VECTOR, SDLoc(ShAmt), MVT::v2i64, ShAmt);
	else if (Subtarget.hasSSE41() && ShAmt.getOpcode() == ISD::ZERO_EXTEND &&
	ShAmt.getOperand(0).getSimpleValueType() == MVT::i16) {
	ShAmt = ShAmt.getOperand(0);
	ShAmt = DAG.getNode(ISD::SCALAR_TO_VECTOR, SDLoc(ShAmt), MVT::v8i16, ShAmt);
	ShAmt = DAG.getZeroExtendVectorInReg(ShAmt, SDLoc(ShAmt), MVT::v2i64);
	} else if (Subtarget.hasSSE41() &&
	ShAmt.getOpcode() == ISD::EXTRACT_VECTOR_ELT) {
	ShAmt = DAG.getNode(ISD::SCALAR_TO_VECTOR, SDLoc(ShAmt), MVT::v4i32, ShAmt);
	ShAmt = DAG.getZeroExtendVectorInReg(ShAmt, SDLoc(ShAmt), MVT::v2i64);
	} else {
	SmallVector<SDValue, 4> ShOps = {ShAmt, DAG.getConstant(0, dl, SVT),
	DAG.getUNDEF(SVT), DAG.getUNDEF(SVT)};
	ShAmt = DAG.getBuildVector(MVT::v4i32, dl, ShOps);
	}

	// The return type has to be a 128-bit type with the same element
	// type as the input type.
	MVT EltVT = VT.getVectorElementType();
	MVT ShVT = MVT::getVectorVT(EltVT, 128/EltVT.getSizeInBits());

	ShAmt = DAG.getBitcast(ShVT, ShAmt);
	return DAG.getNode(Opc, dl, VT, SrcOp, ShAmt);
	}

	/// \brief Return Mask with the necessary casting or extending
	/// for \p Mask according to \p MaskVT when lowering masking intrinsics
	static SDValue getMaskNode(SDValue Mask, MVT MaskVT,
	const X86Subtarget &Subtarget, SelectionDAG &DAG,
	const SDLoc &dl) {

	if (isAllOnesConstant(Mask))
	return DAG.getTargetConstant(1, dl, MaskVT);
	if (X86::isZeroNode(Mask))
	return DAG.getTargetConstant(0, dl, MaskVT);

	if (MaskVT.bitsGT(Mask.getSimpleValueType())) {
	// Mask should be extended
	Mask = DAG.getNode(ISD::ANY_EXTEND, dl,
	MVT::getIntegerVT(MaskVT.getSizeInBits()), Mask);
	}

	if (Mask.getSimpleValueType() == MVT::i64 && Subtarget.is32Bit()) {
	if (MaskVT == MVT::v64i1) {
	assert(Subtarget.hasBWI() && "Expected AVX512BW target!");
	// In case 32bit mode, bitcast i64 is illegal, extend/split it.
	SDValue Lo, Hi;
	Lo = DAG.getNode(ISD::EXTRACT_ELEMENT, dl, MVT::i32, Mask,
	DAG.getConstant(0, dl, MVT::i32));
	Hi = DAG.getNode(ISD::EXTRACT_ELEMENT, dl, MVT::i32, Mask,
	DAG.getConstant(1, dl, MVT::i32));

	Lo = DAG.getBitcast(MVT::v32i1, Lo);
	Hi = DAG.getBitcast(MVT::v32i1, Hi);

	return DAG.getNode(ISD::CONCAT_VECTORS, dl, MVT::v64i1, Lo, Hi);
	} else {
	// MaskVT require < 64bit. Truncate mask (should succeed in any case),
	// and bitcast.
	MVT TruncVT = MVT::getIntegerVT(MaskVT.getSizeInBits());
	return DAG.getBitcast(MaskVT,
	DAG.getNode(ISD::TRUNCATE, dl, TruncVT, Mask));
	}

	} else {
	MVT BitcastVT = MVT::getVectorVT(MVT::i1,
	Mask.getSimpleValueType().getSizeInBits());
	// In case when MaskVT equals v2i1 or v4i1, low 2 or 4 elements
	// are extracted by EXTRACT_SUBVECTOR.
	return DAG.getNode(ISD::EXTRACT_SUBVECTOR, dl, MaskVT,
	DAG.getBitcast(BitcastVT, Mask),
	DAG.getIntPtrConstant(0, dl));
	}
	}

	/// \brief Return (and \p Op, \p Mask) for compare instructions or
	/// (vselect \p Mask, \p Op, \p PreservedSrc) for others along with the
	/// necessary casting or extending for \p Mask when lowering masking intrinsics
	static SDValue getVectorMaskingNode(SDValue Op, SDValue Mask,
	SDValue PreservedSrc,
	const X86Subtarget &Subtarget,
	SelectionDAG &DAG) {
	MVT VT = Op.getSimpleValueType();
	MVT MaskVT = MVT::getVectorVT(MVT::i1, VT.getVectorNumElements());
	unsigned OpcodeSelect = ISD::VSELECT;
	SDLoc dl(Op);

	if (isAllOnesConstant(Mask))
	return Op;

	SDValue VMask = getMaskNode(Mask, MaskVT, Subtarget, DAG, dl);

	switch (Op.getOpcode()) {
	default: break;
	case X86ISD::PCMPEQM:
	case X86ISD::PCMPGTM:
	case X86ISD::CMPM:
	case X86ISD::CMPMU:
	return DAG.getNode(ISD::AND, dl, VT, Op, VMask);
	case X86ISD::VFPCLASS:
	case X86ISD::VFPCLASSS:
	return DAG.getNode(ISD::OR, dl, VT, Op, VMask);
	case X86ISD::VTRUNC:
	case X86ISD::VTRUNCS:
	case X86ISD::VTRUNCUS:
	case X86ISD::CVTPS2PH:
	// We can't use ISD::VSELECT here because it is not always "Legal"
	// for the destination type. For example vpmovqb require only AVX512
	// and vselect that can operate on byte element type require BWI
	OpcodeSelect = X86ISD::SELECT;
	break;
	}
	if (PreservedSrc.isUndef())
	PreservedSrc = getZeroVector(VT, Subtarget, DAG, dl);
	return DAG.getNode(OpcodeSelect, dl, VT, VMask, Op, PreservedSrc);
	}

	/// \brief Creates an SDNode for a predicated scalar operation.
	/// \returns (X86vselect \p Mask, \p Op, \p PreservedSrc).
	/// The mask is coming as MVT::i8 and it should be transformed
	/// to MVT::v1i1 while lowering masking intrinsics.
	/// The main difference between ScalarMaskingNode and VectorMaskingNode is using
	/// "X86select" instead of "vselect". We just can't create the "vselect" node
	/// for a scalar instruction.
	static SDValue getScalarMaskingNode(SDValue Op, SDValue Mask,
	SDValue PreservedSrc,
	const X86Subtarget &Subtarget,
	SelectionDAG &DAG) {

	if (auto *MaskConst = dyn_cast<ConstantSDNode>(Mask))
	if (MaskConst->getZExtValue() & 0x1)
	return Op;

	MVT VT = Op.getSimpleValueType();
	SDLoc dl(Op);

	SDValue IMask = DAG.getNode(ISD::SCALAR_TO_VECTOR, dl, MVT::v1i1, Mask);
	if (Op.getOpcode() == X86ISD::FSETCCM \|\|
	Op.getOpcode() == X86ISD::FSETCCM_RND)
	return DAG.getNode(ISD::AND, dl, VT, Op, IMask);
	if (Op.getOpcode() == X86ISD::VFPCLASSS)
	return DAG.getNode(ISD::OR, dl, VT, Op, IMask);

	if (PreservedSrc.isUndef())
	PreservedSrc = getZeroVector(VT, Subtarget, DAG, dl);
	return DAG.getNode(X86ISD::SELECTS, dl, VT, IMask, Op, PreservedSrc);
	}

	static int getSEHRegistrationNodeSize(const Function *Fn) {
	if (!Fn->hasPersonalityFn())
	report_fatal_error(
	"querying registration node size for function without personality");
	// The RegNodeSize is 6 32-bit words for SEH and 4 for C++ EH. See
	// WinEHStatePass for the full struct definition.
	switch (classifyEHPersonality(Fn->getPersonalityFn())) {
	case EHPersonality::MSVC_X86SEH: return 24;
	case EHPersonality::MSVC_CXX: return 16;
	default: break;
	}
	report_fatal_error(
	"can only recover FP for 32-bit MSVC EH personality functions");
	}

	/// When the MSVC runtime transfers control to us, either to an outlined
	/// function or when returning to a parent frame after catching an exception, we
	/// recover the parent frame pointer by doing arithmetic on the incoming EBP.
	/// Here's the math:
	/// RegNodeBase = EntryEBP - RegNodeSize
	/// ParentFP = RegNodeBase - ParentFrameOffset
	/// Subtracting RegNodeSize takes us to the offset of the registration node, and
	/// subtracting the offset (negative on x86) takes us back to the parent FP.
	static SDValue recoverFramePointer(SelectionDAG &DAG, const Function *Fn,
	SDValue EntryEBP) {
	MachineFunction &MF = DAG.getMachineFunction();
	SDLoc dl;

	const TargetLowering &TLI = DAG.getTargetLoweringInfo();
	MVT PtrVT = TLI.getPointerTy(DAG.getDataLayout());

	// It's possible that the parent function no longer has a personality function
	// if the exceptional code was optimized away, in which case we just return
	// the incoming EBP.
	if (!Fn->hasPersonalityFn())
	return EntryEBP;

	// Get an MCSymbol that will ultimately resolve to the frame offset of the EH
	// registration, or the .set_setframe offset.
	MCSymbol *OffsetSym =
	MF.getMMI().getContext().getOrCreateParentFrameOffsetSymbol(
	GlobalValue::dropLLVMManglingEscape(Fn->getName()));
	SDValue OffsetSymVal = DAG.getMCSymbol(OffsetSym, PtrVT);
	SDValue ParentFrameOffset =
	DAG.getNode(ISD::LOCAL_RECOVER, dl, PtrVT, OffsetSymVal);

	// Return EntryEBP + ParentFrameOffset for x64. This adjusts from RSP after
	// prologue to RBP in the parent function.
	const X86Subtarget &Subtarget =
	static_cast<const X86Subtarget &>(DAG.getSubtarget());
	if (Subtarget.is64Bit())
	return DAG.getNode(ISD::ADD, dl, PtrVT, EntryEBP, ParentFrameOffset);

	int RegNodeSize = getSEHRegistrationNodeSize(Fn);
	// RegNodeBase = EntryEBP - RegNodeSize
	// ParentFP = RegNodeBase - ParentFrameOffset
	SDValue RegNodeBase = DAG.getNode(ISD::SUB, dl, PtrVT, EntryEBP,
	DAG.getConstant(RegNodeSize, dl, PtrVT));
	return DAG.getNode(ISD::SUB, dl, PtrVT, RegNodeBase, ParentFrameOffset);
	}

	static SDValue LowerINTRINSIC_WO_CHAIN(SDValue Op, const X86Subtarget &Subtarget,
	SelectionDAG &DAG) {
	// Helper to detect if the operand is CUR_DIRECTION rounding mode.
	auto isRoundModeCurDirection = [](SDValue Rnd) {
	if (!isa<ConstantSDNode>(Rnd))
	return false;

	unsigned Round = cast<ConstantSDNode>(Rnd)->getZExtValue();
	return Round == X86::STATIC_ROUNDING::CUR_DIRECTION;
	};

	SDLoc dl(Op);
	unsigned IntNo = cast<ConstantSDNode>(Op.getOperand(0))->getZExtValue();
	MVT VT = Op.getSimpleValueType();
	const IntrinsicData* IntrData = getIntrinsicWithoutChain(IntNo);
	if (IntrData) {
	switch(IntrData->Type) {
	case INTR_TYPE_1OP:
	return DAG.getNode(IntrData->Opc0, dl, Op.getValueType(), Op.getOperand(1));
	case INTR_TYPE_2OP:
	return DAG.getNode(IntrData->Opc0, dl, Op.getValueType(), Op.getOperand(1),
	Op.getOperand(2));
	case INTR_TYPE_3OP:
	return DAG.getNode(IntrData->Opc0, dl, Op.getValueType(), Op.getOperand(1),
	Op.getOperand(2), Op.getOperand(3));
	case INTR_TYPE_4OP:
	return DAG.getNode(IntrData->Opc0, dl, Op.getValueType(), Op.getOperand(1),
	Op.getOperand(2), Op.getOperand(3), Op.getOperand(4));
	case INTR_TYPE_1OP_MASK_RM: {
	SDValue Src = Op.getOperand(1);
	SDValue PassThru = Op.getOperand(2);
	SDValue Mask = Op.getOperand(3);
	SDValue RoundingMode;
	// We always add rounding mode to the Node.
	// If the rounding mode is not specified, we add the
	// "current direction" mode.
	if (Op.getNumOperands() == 4)
	RoundingMode =
	DAG.getConstant(X86::STATIC_ROUNDING::CUR_DIRECTION, dl, MVT::i32);
	else
	RoundingMode = Op.getOperand(4);
	assert(IntrData->Opc1 == 0 && "Unexpected second opcode!");
	return getVectorMaskingNode(DAG.getNode(IntrData->Opc0, dl, VT, Src,
	RoundingMode),
	Mask, PassThru, Subtarget, DAG);
	}
	case INTR_TYPE_1OP_MASK: {
	SDValue Src = Op.getOperand(1);
	SDValue PassThru = Op.getOperand(2);
	SDValue Mask = Op.getOperand(3);
	// We add rounding mode to the Node when
	// - RM Opcode is specified and
	// - RM is not "current direction".
	unsigned IntrWithRoundingModeOpcode = IntrData->Opc1;
	if (IntrWithRoundingModeOpcode != 0) {
	SDValue Rnd = Op.getOperand(4);
	if (!isRoundModeCurDirection(Rnd)) {
	return getVectorMaskingNode(DAG.getNode(IntrWithRoundingModeOpcode,
	dl, Op.getValueType(),
	Src, Rnd),
	Mask, PassThru, Subtarget, DAG);
	}
	}
	return getVectorMaskingNode(DAG.getNode(IntrData->Opc0, dl, VT, Src),
	Mask, PassThru, Subtarget, DAG);
	}
	case INTR_TYPE_SCALAR_MASK: {
	SDValue Src1 = Op.getOperand(1);
	SDValue Src2 = Op.getOperand(2);
	SDValue passThru = Op.getOperand(3);
	SDValue Mask = Op.getOperand(4);
	unsigned IntrWithRoundingModeOpcode = IntrData->Opc1;
	if (IntrWithRoundingModeOpcode != 0) {
	SDValue Rnd = Op.getOperand(5);
	if (!isRoundModeCurDirection(Rnd))
	return getScalarMaskingNode(DAG.getNode(IntrWithRoundingModeOpcode,
	dl, VT, Src1, Src2, Rnd),
	Mask, passThru, Subtarget, DAG);
	}
	return getScalarMaskingNode(DAG.getNode(IntrData->Opc0, dl, VT, Src1, Src2),
	Mask, passThru, Subtarget, DAG);
	}
	case INTR_TYPE_SCALAR_MASK_RM: {
	SDValue Src1 = Op.getOperand(1);
	SDValue Src2 = Op.getOperand(2);
	SDValue Src0 = Op.getOperand(3);
	SDValue Mask = Op.getOperand(4);
	// There are 2 kinds of intrinsics in this group:
	// (1) With suppress-all-exceptions (sae) or rounding mode- 6 operands
	// (2) With rounding mode and sae - 7 operands.
	if (Op.getNumOperands() == 6) {
	SDValue Sae = Op.getOperand(5);
	return getScalarMaskingNode(DAG.getNode(IntrData->Opc0, dl, VT, Src1, Src2,
	Sae),
	Mask, Src0, Subtarget, DAG);
	}
	assert(Op.getNumOperands() == 7 && "Unexpected intrinsic form");
	SDValue RoundingMode = Op.getOperand(5);
	SDValue Sae = Op.getOperand(6);
	return getScalarMaskingNode(DAG.getNode(IntrData->Opc0, dl, VT, Src1, Src2,
	RoundingMode, Sae),
	Mask, Src0, Subtarget, DAG);
	}
	case INTR_TYPE_2OP_MASK:
	case INTR_TYPE_2OP_IMM8_MASK: {
	SDValue Src1 = Op.getOperand(1);
	SDValue Src2 = Op.getOperand(2);
	SDValue PassThru = Op.getOperand(3);
	SDValue Mask = Op.getOperand(4);

	if (IntrData->Type == INTR_TYPE_2OP_IMM8_MASK)
	Src2 = DAG.getNode(ISD::TRUNCATE, dl, MVT::i8, Src2);

	// We specify 2 possible opcodes for intrinsics with rounding modes.
	// First, we check if the intrinsic may have non-default rounding mode,
	// (IntrData->Opc1 != 0), then we check the rounding mode operand.
	unsigned IntrWithRoundingModeOpcode = IntrData->Opc1;
	if (IntrWithRoundingModeOpcode != 0) {
	SDValue Rnd = Op.getOperand(5);
	if (!isRoundModeCurDirection(Rnd)) {
	return getVectorMaskingNode(DAG.getNode(IntrWithRoundingModeOpcode,
	dl, Op.getValueType(),
	Src1, Src2, Rnd),
	Mask, PassThru, Subtarget, DAG);
	}
	}
	// TODO: Intrinsics should have fast-math-flags to propagate.
	return getVectorMaskingNode(DAG.getNode(IntrData->Opc0, dl, VT,Src1,Src2),
	Mask, PassThru, Subtarget, DAG);
	}
	case INTR_TYPE_2OP_MASK_RM: {
	SDValue Src1 = Op.getOperand(1);
	SDValue Src2 = Op.getOperand(2);
	SDValue PassThru = Op.getOperand(3);
	SDValue Mask = Op.getOperand(4);
	// We specify 2 possible modes for intrinsics, with/without rounding
	// modes.
	// First, we check if the intrinsic have rounding mode (6 operands),
	// if not, we set rounding mode to "current".
	SDValue Rnd;
	if (Op.getNumOperands() == 6)
	Rnd = Op.getOperand(5);
	else
	Rnd = DAG.getConstant(X86::STATIC_ROUNDING::CUR_DIRECTION, dl, MVT::i32);
	return getVectorMaskingNode(DAG.getNode(IntrData->Opc0, dl, VT,
	Src1, Src2, Rnd),
	Mask, PassThru, Subtarget, DAG);
	}
	case INTR_TYPE_3OP_SCALAR_MASK_RM: {
	SDValue Src1 = Op.getOperand(1);
	SDValue Src2 = Op.getOperand(2);
	SDValue Src3 = Op.getOperand(3);
	SDValue PassThru = Op.getOperand(4);
	SDValue Mask = Op.getOperand(5);
	SDValue Sae = Op.getOperand(6);

	return getScalarMaskingNode(DAG.getNode(IntrData->Opc0, dl, VT, Src1,
	Src2, Src3, Sae),
	Mask, PassThru, Subtarget, DAG);
	}
	case INTR_TYPE_3OP_MASK_RM: {
	SDValue Src1 = Op.getOperand(1);
	SDValue Src2 = Op.getOperand(2);
	SDValue Imm = Op.getOperand(3);
	SDValue PassThru = Op.getOperand(4);
	SDValue Mask = Op.getOperand(5);
	// We specify 2 possible modes for intrinsics, with/without rounding
	// modes.
	// First, we check if the intrinsic have rounding mode (7 operands),
	// if not, we set rounding mode to "current".
	SDValue Rnd;
	if (Op.getNumOperands() == 7)
	Rnd = Op.getOperand(6);
	else
	Rnd = DAG.getConstant(X86::STATIC_ROUNDING::CUR_DIRECTION, dl, MVT::i32);
	return getVectorMaskingNode(DAG.getNode(IntrData->Opc0, dl, VT,
	Src1, Src2, Imm, Rnd),
	Mask, PassThru, Subtarget, DAG);
	}
	case INTR_TYPE_3OP_IMM8_MASK:
	case INTR_TYPE_3OP_MASK: {
	SDValue Src1 = Op.getOperand(1);
	SDValue Src2 = Op.getOperand(2);
	SDValue Src3 = Op.getOperand(3);
	SDValue PassThru = Op.getOperand(4);
	SDValue Mask = Op.getOperand(5);

	if (IntrData->Type == INTR_TYPE_3OP_IMM8_MASK)
	Src3 = DAG.getNode(ISD::TRUNCATE, dl, MVT::i8, Src3);

	// We specify 2 possible opcodes for intrinsics with rounding modes.
	// First, we check if the intrinsic may have non-default rounding mode,
	// (IntrData->Opc1 != 0), then we check the rounding mode operand.
	unsigned IntrWithRoundingModeOpcode = IntrData->Opc1;
	if (IntrWithRoundingModeOpcode != 0) {
	SDValue Rnd = Op.getOperand(6);
	if (!isRoundModeCurDirection(Rnd)) {
	return getVectorMaskingNode(DAG.getNode(IntrWithRoundingModeOpcode,
	dl, Op.getValueType(),
	Src1, Src2, Src3, Rnd),
	Mask, PassThru, Subtarget, DAG);
	}
	}
	return getVectorMaskingNode(DAG.getNode(IntrData->Opc0, dl, VT,
	Src1, Src2, Src3),
	Mask, PassThru, Subtarget, DAG);
	}
	case VPERM_2OP_MASK : {
	SDValue Src1 = Op.getOperand(1);
	SDValue Src2 = Op.getOperand(2);
	SDValue PassThru = Op.getOperand(3);
	SDValue Mask = Op.getOperand(4);

	// Swap Src1 and Src2 in the node creation
	return getVectorMaskingNode(DAG.getNode(IntrData->Opc0, dl, VT,Src2, Src1),
	Mask, PassThru, Subtarget, DAG);
	}
	case VPERM_3OP_MASKZ:
	case VPERM_3OP_MASK:{
	MVT VT = Op.getSimpleValueType();
	// Src2 is the PassThru
	SDValue Src1 = Op.getOperand(1);
	// PassThru needs to be the same type as the destination in order
	// to pattern match correctly.
	SDValue Src2 = DAG.getBitcast(VT, Op.getOperand(2));
	SDValue Src3 = Op.getOperand(3);
	SDValue Mask = Op.getOperand(4);
	SDValue PassThru = SDValue();

	// set PassThru element
	if (IntrData->Type == VPERM_3OP_MASKZ)
	PassThru = getZeroVector(VT, Subtarget, DAG, dl);
	else
	PassThru = Src2;

	// Swap Src1 and Src2 in the node creation
	return getVectorMaskingNode(DAG.getNode(IntrData->Opc0,
	dl, Op.getValueType(),
	Src2, Src1, Src3),
	Mask, PassThru, Subtarget, DAG);
	}
	case FMA_OP_MASK3:
	case FMA_OP_MASKZ:
	case FMA_OP_MASK: {
	SDValue Src1 = Op.getOperand(1);
	SDValue Src2 = Op.getOperand(2);
	SDValue Src3 = Op.getOperand(3);
	SDValue Mask = Op.getOperand(4);
	MVT VT = Op.getSimpleValueType();
	SDValue PassThru = SDValue();

	// set PassThru element
	if (IntrData->Type == FMA_OP_MASKZ)
	PassThru = getZeroVector(VT, Subtarget, DAG, dl);
	else if (IntrData->Type == FMA_OP_MASK3)
	PassThru = Src3;
	else
	PassThru = Src1;

	// We specify 2 possible opcodes for intrinsics with rounding modes.
	// First, we check if the intrinsic may have non-default rounding mode,
	// (IntrData->Opc1 != 0), then we check the rounding mode operand.
	unsigned IntrWithRoundingModeOpcode = IntrData->Opc1;
	if (IntrWithRoundingModeOpcode != 0) {
	SDValue Rnd = Op.getOperand(5);
	if (!isRoundModeCurDirection(Rnd))
	return getVectorMaskingNode(DAG.getNode(IntrWithRoundingModeOpcode,
	dl, Op.getValueType(),
	Src1, Src2, Src3, Rnd),
	Mask, PassThru, Subtarget, DAG);
	}
	return getVectorMaskingNode(DAG.getNode(IntrData->Opc0,
	dl, Op.getValueType(),
	Src1, Src2, Src3),
	Mask, PassThru, Subtarget, DAG);
	}
	case FMA_OP_SCALAR_MASK:
	case FMA_OP_SCALAR_MASK3:
	case FMA_OP_SCALAR_MASKZ: {
	SDValue Src1 = Op.getOperand(1);
	SDValue Src2 = Op.getOperand(2);
	SDValue Src3 = Op.getOperand(3);
	SDValue Mask = Op.getOperand(4);
	MVT VT = Op.getSimpleValueType();
	SDValue PassThru = SDValue();

	// set PassThru element
	if (IntrData->Type == FMA_OP_SCALAR_MASKZ)
	PassThru = getZeroVector(VT, Subtarget, DAG, dl);
	else if (IntrData->Type == FMA_OP_SCALAR_MASK3)
	PassThru = Src3;
	else
	PassThru = Src1;

	SDValue Rnd = Op.getOperand(5);
	return getScalarMaskingNode(DAG.getNode(IntrData->Opc0, dl,
	Op.getValueType(), Src1, Src2,
	Src3, Rnd),
	Mask, PassThru, Subtarget, DAG);
	}
	case TERLOG_OP_MASK:
	case TERLOG_OP_MASKZ: {
	SDValue Src1 = Op.getOperand(1);
	SDValue Src2 = Op.getOperand(2);
	SDValue Src3 = Op.getOperand(3);
	SDValue Src4 = DAG.getNode(ISD::TRUNCATE, dl, MVT::i8, Op.getOperand(4));
	SDValue Mask = Op.getOperand(5);
	MVT VT = Op.getSimpleValueType();
	SDValue PassThru = Src1;
	// Set PassThru element.
	if (IntrData->Type == TERLOG_OP_MASKZ)
	PassThru = getZeroVector(VT, Subtarget, DAG, dl);

	return getVectorMaskingNode(DAG.getNode(IntrData->Opc0, dl, VT,
	Src1, Src2, Src3, Src4),
	Mask, PassThru, Subtarget, DAG);
	}
	case CVTPD2PS:
	// ISD::FP_ROUND has a second argument that indicates if the truncation
	// does not change the value. Set it to 0 since it can change.
	return DAG.getNode(IntrData->Opc0, dl, VT, Op.getOperand(1),
	DAG.getIntPtrConstant(0, dl));
	case CVTPD2PS_MASK: {
	SDValue Src = Op.getOperand(1);
	SDValue PassThru = Op.getOperand(2);
	SDValue Mask = Op.getOperand(3);
	// We add rounding mode to the Node when
	// - RM Opcode is specified and
	// - RM is not "current direction".
	unsigned IntrWithRoundingModeOpcode = IntrData->Opc1;
	if (IntrWithRoundingModeOpcode != 0) {
	SDValue Rnd = Op.getOperand(4);
	if (!isRoundModeCurDirection(Rnd)) {
	return getVectorMaskingNode(DAG.getNode(IntrWithRoundingModeOpcode,
	dl, Op.getValueType(),
	Src, Rnd),
	Mask, PassThru, Subtarget, DAG);
	}
	}
	assert(IntrData->Opc0 == ISD::FP_ROUND && "Unexpected opcode!");
	// ISD::FP_ROUND has a second argument that indicates if the truncation
	// does not change the value. Set it to 0 since it can change.
	return getVectorMaskingNode(DAG.getNode(IntrData->Opc0, dl, VT, Src,
	DAG.getIntPtrConstant(0, dl)),
	Mask, PassThru, Subtarget, DAG);
	}
	case FPCLASS: {
	// FPclass intrinsics with mask
	SDValue Src1 = Op.getOperand(1);
	MVT VT = Src1.getSimpleValueType();
	MVT MaskVT = MVT::getVectorVT(MVT::i1, VT.getVectorNumElements());
	SDValue Imm = Op.getOperand(2);
	SDValue Mask = Op.getOperand(3);
	MVT BitcastVT = MVT::getVectorVT(MVT::i1,
	Mask.getSimpleValueType().getSizeInBits());
	SDValue FPclass = DAG.getNode(IntrData->Opc0, dl, MaskVT, Src1, Imm);
	SDValue FPclassMask = getVectorMaskingNode(FPclass, Mask,
	DAG.getTargetConstant(0, dl, MaskVT),
	Subtarget, DAG);
	SDValue Res = DAG.getNode(ISD::INSERT_SUBVECTOR, dl, BitcastVT,
	DAG.getUNDEF(BitcastVT), FPclassMask,
	DAG.getIntPtrConstant(0, dl));
	return DAG.getBitcast(Op.getValueType(), Res);
	}
	case FPCLASSS: {
	SDValue Src1 = Op.getOperand(1);
	SDValue Imm = Op.getOperand(2);
	SDValue Mask = Op.getOperand(3);
	SDValue FPclass = DAG.getNode(IntrData->Opc0, dl, MVT::v1i1, Src1, Imm);
	SDValue FPclassMask = getScalarMaskingNode(FPclass, Mask,
	DAG.getTargetConstant(0, dl, MVT::i1), Subtarget, DAG);
	return DAG.getNode(X86ISD::VEXTRACT, dl, MVT::i8, FPclassMask,
	DAG.getIntPtrConstant(0, dl));
	}
	case CMP_MASK:
	case CMP_MASK_CC: {
	// Comparison intrinsics with masks.
	// Example of transformation:
	// (i8 (int_x86_avx512_mask_pcmpeq_q_128
	// (v2i64 %a), (v2i64 %b), (i8 %mask))) ->
	// (i8 (bitcast
	// (v8i1 (insert_subvector undef,
	// (v2i1 (and (PCMPEQM %a, %b),
	// (extract_subvector
	// (v8i1 (bitcast %mask)), 0))), 0))))
	MVT VT = Op.getOperand(1).getSimpleValueType();
	MVT MaskVT = MVT::getVectorVT(MVT::i1, VT.getVectorNumElements());
	SDValue Mask = Op.getOperand((IntrData->Type == CMP_MASK_CC) ? 4 : 3);
	MVT BitcastVT = MVT::getVectorVT(MVT::i1,
	Mask.getSimpleValueType().getSizeInBits());
	SDValue Cmp;
	if (IntrData->Type == CMP_MASK_CC) {
	SDValue CC = Op.getOperand(3);
	CC = DAG.getNode(ISD::TRUNCATE, dl, MVT::i8, CC);
	// We specify 2 possible opcodes for intrinsics with rounding modes.
	// First, we check if the intrinsic may have non-default rounding mode,
	// (IntrData->Opc1 != 0), then we check the rounding mode operand.
	if (IntrData->Opc1 != 0) {
	SDValue Rnd = Op.getOperand(5);
	if (!isRoundModeCurDirection(Rnd))
	Cmp = DAG.getNode(IntrData->Opc1, dl, MaskVT, Op.getOperand(1),
	Op.getOperand(2), CC, Rnd);
	}
	//default rounding mode
	if(!Cmp.getNode())
	Cmp = DAG.getNode(IntrData->Opc0, dl, MaskVT, Op.getOperand(1),
	Op.getOperand(2), CC);

	} else {
	assert(IntrData->Type == CMP_MASK && "Unexpected intrinsic type!");
	Cmp = DAG.getNode(IntrData->Opc0, dl, MaskVT, Op.getOperand(1),
	Op.getOperand(2));
	}
	SDValue CmpMask = getVectorMaskingNode(Cmp, Mask,
	DAG.getTargetConstant(0, dl,
	MaskVT),
	Subtarget, DAG);
	SDValue Res = DAG.getNode(ISD::INSERT_SUBVECTOR, dl, BitcastVT,
	DAG.getUNDEF(BitcastVT), CmpMask,
	DAG.getIntPtrConstant(0, dl));
	return DAG.getBitcast(Op.getValueType(), Res);
	}
	case CMP_MASK_SCALAR_CC: {
	SDValue Src1 = Op.getOperand(1);
	SDValue Src2 = Op.getOperand(2);
	SDValue CC = DAG.getNode(ISD::TRUNCATE, dl, MVT::i8, Op.getOperand(3));
	SDValue Mask = Op.getOperand(4);

	SDValue Cmp;
	if (IntrData->Opc1 != 0) {
	SDValue Rnd = Op.getOperand(5);
	if (!isRoundModeCurDirection(Rnd))
	Cmp = DAG.getNode(IntrData->Opc1, dl, MVT::v1i1, Src1, Src2, CC, Rnd);
	}
	//default rounding mode
	if(!Cmp.getNode())
	Cmp = DAG.getNode(IntrData->Opc0, dl, MVT::v1i1, Src1, Src2, CC);

	SDValue CmpMask = getScalarMaskingNode(Cmp, Mask,
	DAG.getTargetConstant(0, dl,
	MVT::i1),
	Subtarget, DAG);
	return DAG.getNode(X86ISD::VEXTRACT, dl, MVT::i8, CmpMask,
	DAG.getIntPtrConstant(0, dl));
	}
	case COMI: { // Comparison intrinsics
	ISD::CondCode CC = (ISD::CondCode)IntrData->Opc1;
	SDValue LHS = Op.getOperand(1);
	SDValue RHS = Op.getOperand(2);
	SDValue Comi = DAG.getNode(IntrData->Opc0, dl, MVT::i32, LHS, RHS);
	SDValue InvComi = DAG.getNode(IntrData->Opc0, dl, MVT::i32, RHS, LHS);
	SDValue SetCC;
	switch (CC) {
	case ISD::SETEQ: { // (ZF = 0 and PF = 0)
	SetCC = getSETCC(X86::COND_E, Comi, dl, DAG);
	SDValue SetNP = getSETCC(X86::COND_NP, Comi, dl, DAG);
	SetCC = DAG.getNode(ISD::AND, dl, MVT::i8, SetCC, SetNP);
	break;
	}
	case ISD::SETNE: { // (ZF = 1 or PF = 1)
	SetCC = getSETCC(X86::COND_NE, Comi, dl, DAG);
	SDValue SetP = getSETCC(X86::COND_P, Comi, dl, DAG);
	SetCC = DAG.getNode(ISD::OR, dl, MVT::i8, SetCC, SetP);
	break;
	}
	case ISD::SETGT: // (CF = 0 and ZF = 0)
	SetCC = getSETCC(X86::COND_A, Comi, dl, DAG);
	break;
	case ISD::SETLT: { // The condition is opposite to GT. Swap the operands.
	SetCC = getSETCC(X86::COND_A, InvComi, dl, DAG);
	break;
	}
	case ISD::SETGE: // CF = 0
	SetCC = getSETCC(X86::COND_AE, Comi, dl, DAG);
	break;
	case ISD::SETLE: // The condition is opposite to GE. Swap the operands.
	SetCC = getSETCC(X86::COND_AE, InvComi, dl, DAG);
	break;
	default:
	llvm_unreachable("Unexpected illegal condition!");
	}
	return DAG.getNode(ISD::ZERO_EXTEND, dl, MVT::i32, SetCC);
	}
	case COMI_RM: { // Comparison intrinsics with Sae
	SDValue LHS = Op.getOperand(1);
	SDValue RHS = Op.getOperand(2);
	unsigned CondVal = cast<ConstantSDNode>(Op.getOperand(3))->getZExtValue();
	SDValue Sae = Op.getOperand(4);

	SDValue FCmp;
	if (isRoundModeCurDirection(Sae))
	FCmp = DAG.getNode(X86ISD::FSETCCM, dl, MVT::v1i1, LHS, RHS,
	DAG.getConstant(CondVal, dl, MVT::i8));
	else
	FCmp = DAG.getNode(X86ISD::FSETCCM_RND, dl, MVT::v1i1, LHS, RHS,
	DAG.getConstant(CondVal, dl, MVT::i8), Sae);
	return DAG.getNode(X86ISD::VEXTRACT, dl, MVT::i32, FCmp,
	DAG.getIntPtrConstant(0, dl));
	}
	case VSHIFT:
	return getTargetVShiftNode(IntrData->Opc0, dl, Op.getSimpleValueType(),
	Op.getOperand(1), Op.getOperand(2), Subtarget,
	DAG);
	case COMPRESS_EXPAND_IN_REG: {
	SDValue Mask = Op.getOperand(3);
	SDValue DataToCompress = Op.getOperand(1);
	SDValue PassThru = Op.getOperand(2);
	if (isAllOnesConstant(Mask)) // return data as is
	return Op.getOperand(1);

	return getVectorMaskingNode(DAG.getNode(IntrData->Opc0, dl, VT,
	DataToCompress),
	Mask, PassThru, Subtarget, DAG);
	}
	case BROADCASTM: {
	SDValue Mask = Op.getOperand(1);
	MVT MaskVT = MVT::getVectorVT(MVT::i1,
	Mask.getSimpleValueType().getSizeInBits());
	Mask = DAG.getBitcast(MaskVT, Mask);
	return DAG.getNode(IntrData->Opc0, dl, Op.getValueType(), Mask);
	}
	case KUNPCK: {
	MVT VT = Op.getSimpleValueType();
	MVT MaskVT = MVT::getVectorVT(MVT::i1, VT.getSizeInBits()/2);

	SDValue Src1 = getMaskNode(Op.getOperand(1), MaskVT, Subtarget, DAG, dl);
	SDValue Src2 = getMaskNode(Op.getOperand(2), MaskVT, Subtarget, DAG, dl);
	// Arguments should be swapped.
	SDValue Res = DAG.getNode(IntrData->Opc0, dl,
	MVT::getVectorVT(MVT::i1, VT.getSizeInBits()),
	Src2, Src1);
	return DAG.getBitcast(VT, Res);
	}
	case MASK_BINOP: {
	MVT VT = Op.getSimpleValueType();
	MVT MaskVT = MVT::getVectorVT(MVT::i1, VT.getSizeInBits());

	SDValue Src1 = getMaskNode(Op.getOperand(1), MaskVT, Subtarget, DAG, dl);
	SDValue Src2 = getMaskNode(Op.getOperand(2), MaskVT, Subtarget, DAG, dl);
	SDValue Res = DAG.getNode(IntrData->Opc0, dl, MaskVT, Src1, Src2);
	return DAG.getBitcast(VT, Res);
	}
	case FIXUPIMMS:
	case FIXUPIMMS_MASKZ:
	case FIXUPIMM:
	case FIXUPIMM_MASKZ:{
	SDValue Src1 = Op.getOperand(1);
	SDValue Src2 = Op.getOperand(2);
	SDValue Src3 = Op.getOperand(3);
	SDValue Imm = Op.getOperand(4);
	SDValue Mask = Op.getOperand(5);
	SDValue Passthru = (IntrData->Type == FIXUPIMM \|\| IntrData->Type == FIXUPIMMS ) ?
	Src1 : getZeroVector(VT, Subtarget, DAG, dl);
	// We specify 2 possible modes for intrinsics, with/without rounding
	// modes.
	// First, we check if the intrinsic have rounding mode (7 operands),
	// if not, we set rounding mode to "current".
	SDValue Rnd;
	if (Op.getNumOperands() == 7)
	Rnd = Op.getOperand(6);
	else
	Rnd = DAG.getConstant(X86::STATIC_ROUNDING::CUR_DIRECTION, dl, MVT::i32);
	if (IntrData->Type == FIXUPIMM \|\| IntrData->Type == FIXUPIMM_MASKZ)
	return getVectorMaskingNode(DAG.getNode(IntrData->Opc0, dl, VT,
	Src1, Src2, Src3, Imm, Rnd),
	Mask, Passthru, Subtarget, DAG);
	else // Scalar - FIXUPIMMS, FIXUPIMMS_MASKZ
	return getScalarMaskingNode(DAG.getNode(IntrData->Opc0, dl, VT,
	Src1, Src2, Src3, Imm, Rnd),
	Mask, Passthru, Subtarget, DAG);
	}
	case CONVERT_TO_MASK: {
	MVT SrcVT = Op.getOperand(1).getSimpleValueType();
	MVT MaskVT = MVT::getVectorVT(MVT::i1, SrcVT.getVectorNumElements());
	MVT BitcastVT = MVT::getVectorVT(MVT::i1, VT.getSizeInBits());

	SDValue CvtMask = DAG.getNode(IntrData->Opc0, dl, MaskVT,
	Op.getOperand(1));
	SDValue Res = DAG.getNode(ISD::INSERT_SUBVECTOR, dl, BitcastVT,
	DAG.getUNDEF(BitcastVT), CvtMask,
	DAG.getIntPtrConstant(0, dl));
	return DAG.getBitcast(Op.getValueType(), Res);
	}
	case BRCST_SUBVEC_TO_VEC: {
	SDValue Src = Op.getOperand(1);
	SDValue Passthru = Op.getOperand(2);
	SDValue Mask = Op.getOperand(3);
	EVT resVT = Passthru.getValueType();
	SDValue subVec = DAG.getNode(ISD::INSERT_SUBVECTOR, dl, resVT,
	DAG.getUNDEF(resVT), Src,
	DAG.getIntPtrConstant(0, dl));
	SDValue immVal;
	if (Src.getSimpleValueType().is256BitVector() && resVT.is512BitVector())
	immVal = DAG.getConstant(0x44, dl, MVT::i8);
	else
	immVal = DAG.getConstant(0, dl, MVT::i8);
	return getVectorMaskingNode(DAG.getNode(IntrData->Opc0, dl, VT,
	subVec, subVec, immVal),
	Mask, Passthru, Subtarget, DAG);
	}
	case BRCST32x2_TO_VEC: {
	SDValue Src = Op.getOperand(1);
	SDValue PassThru = Op.getOperand(2);
	SDValue Mask = Op.getOperand(3);

	assert((VT.getScalarType() == MVT::i32 \|\|
	VT.getScalarType() == MVT::f32) && "Unexpected type!");
	//bitcast Src to packed 64
	MVT ScalarVT = VT.getScalarType() == MVT::i32 ? MVT::i64 : MVT::f64;
	MVT BitcastVT = MVT::getVectorVT(ScalarVT, Src.getValueSizeInBits()/64);
	Src = DAG.getBitcast(BitcastVT, Src);

	return getVectorMaskingNode(DAG.getNode(IntrData->Opc0, dl, VT, Src),
	Mask, PassThru, Subtarget, DAG);
	}
	default:
	break;
	}
	}

	switch (IntNo) {
	default: return SDValue(); // Don't custom lower most intrinsics.

	case Intrinsic::x86_avx2_permd:
	case Intrinsic::x86_avx2_permps:
	// Operands intentionally swapped. Mask is last operand to intrinsic,
	// but second operand for node/instruction.
	return DAG.getNode(X86ISD::VPERMV, dl, Op.getValueType(),
	Op.getOperand(2), Op.getOperand(1));

	// ptest and testp intrinsics. The intrinsic these come from are designed to
	// return an integer value, not just an instruction so lower it to the ptest
	// or testp pattern and a setcc for the result.
	case Intrinsic::x86_sse41_ptestz:
	case Intrinsic::x86_sse41_ptestc:
	case Intrinsic::x86_sse41_ptestnzc:
	case Intrinsic::x86_avx_ptestz_256:
	case Intrinsic::x86_avx_ptestc_256:
	case Intrinsic::x86_avx_ptestnzc_256:
	case Intrinsic::x86_avx_vtestz_ps:
	case Intrinsic::x86_avx_vtestc_ps:
	case Intrinsic::x86_avx_vtestnzc_ps:
	case Intrinsic::x86_avx_vtestz_pd:
	case Intrinsic::x86_avx_vtestc_pd:
	case Intrinsic::x86_avx_vtestnzc_pd:
	case Intrinsic::x86_avx_vtestz_ps_256:
	case Intrinsic::x86_avx_vtestc_ps_256:
	case Intrinsic::x86_avx_vtestnzc_ps_256:
	case Intrinsic::x86_avx_vtestz_pd_256:
	case Intrinsic::x86_avx_vtestc_pd_256:
	case Intrinsic::x86_avx_vtestnzc_pd_256: {
	bool IsTestPacked = false;
	X86::CondCode X86CC;
	switch (IntNo) {
	default: llvm_unreachable("Bad fallthrough in Intrinsic lowering.");
	case Intrinsic::x86_avx_vtestz_ps:
	case Intrinsic::x86_avx_vtestz_pd:
	case Intrinsic::x86_avx_vtestz_ps_256:
	case Intrinsic::x86_avx_vtestz_pd_256:
	IsTestPacked = true;
	LLVM_FALLTHROUGH;
	case Intrinsic::x86_sse41_ptestz:
	case Intrinsic::x86_avx_ptestz_256:
	// ZF = 1
	X86CC = X86::COND_E;
	break;
	case Intrinsic::x86_avx_vtestc_ps:
	case Intrinsic::x86_avx_vtestc_pd:
	case Intrinsic::x86_avx_vtestc_ps_256:
	case Intrinsic::x86_avx_vtestc_pd_256:
	IsTestPacked = true;
	LLVM_FALLTHROUGH;
	case Intrinsic::x86_sse41_ptestc:
	case Intrinsic::x86_avx_ptestc_256:
	// CF = 1
	X86CC = X86::COND_B;
	break;
	case Intrinsic::x86_avx_vtestnzc_ps:
	case Intrinsic::x86_avx_vtestnzc_pd:
	case Intrinsic::x86_avx_vtestnzc_ps_256:
	case Intrinsic::x86_avx_vtestnzc_pd_256:
	IsTestPacked = true;
	LLVM_FALLTHROUGH;
	case Intrinsic::x86_sse41_ptestnzc:
	case Intrinsic::x86_avx_ptestnzc_256:
	// ZF and CF = 0
	X86CC = X86::COND_A;
	break;
	}

	SDValue LHS = Op.getOperand(1);
	SDValue RHS = Op.getOperand(2);
	unsigned TestOpc = IsTestPacked ? X86ISD::TESTP : X86ISD::PTEST;
	SDValue Test = DAG.getNode(TestOpc, dl, MVT::i32, LHS, RHS);
	SDValue SetCC = getSETCC(X86CC, Test, dl, DAG);
	return DAG.getNode(ISD::ZERO_EXTEND, dl, MVT::i32, SetCC);
	}
	case Intrinsic::x86_avx512_kortestz_w:
	case Intrinsic::x86_avx512_kortestc_w: {
	X86::CondCode X86CC =
	(IntNo == Intrinsic::x86_avx512_kortestz_w) ? X86::COND_E : X86::COND_B;
	SDValue LHS = DAG.getBitcast(MVT::v16i1, Op.getOperand(1));
	SDValue RHS = DAG.getBitcast(MVT::v16i1, Op.getOperand(2));
	SDValue Test = DAG.getNode(X86ISD::KORTEST, dl, MVT::i32, LHS, RHS);
	SDValue SetCC = getSETCC(X86CC, Test, dl, DAG);
	return DAG.getNode(ISD::ZERO_EXTEND, dl, MVT::i32, SetCC);
	}

	case Intrinsic::x86_avx512_knot_w: {
	SDValue LHS = DAG.getBitcast(MVT::v16i1, Op.getOperand(1));
	SDValue RHS = DAG.getConstant(1, dl, MVT::v16i1);
	SDValue Res = DAG.getNode(ISD::XOR, dl, MVT::v16i1, LHS, RHS);
	return DAG.getBitcast(MVT::i16, Res);
	}

	case Intrinsic::x86_avx512_kandn_w: {
	SDValue LHS = DAG.getBitcast(MVT::v16i1, Op.getOperand(1));
	// Invert LHS for the not.
	LHS = DAG.getNode(ISD::XOR, dl, MVT::v16i1, LHS,
	DAG.getConstant(1, dl, MVT::v16i1));
	SDValue RHS = DAG.getBitcast(MVT::v16i1, Op.getOperand(2));
	SDValue Res = DAG.getNode(ISD::AND, dl, MVT::v16i1, LHS, RHS);
	return DAG.getBitcast(MVT::i16, Res);
	}

	case Intrinsic::x86_avx512_kxnor_w: {
	SDValue LHS = DAG.getBitcast(MVT::v16i1, Op.getOperand(1));
	SDValue RHS = DAG.getBitcast(MVT::v16i1, Op.getOperand(2));
	SDValue Res = DAG.getNode(ISD::XOR, dl, MVT::v16i1, LHS, RHS);
	// Invert result for the not.
	Res = DAG.getNode(ISD::XOR, dl, MVT::v16i1, Res,
	DAG.getConstant(1, dl, MVT::v16i1));
	return DAG.getBitcast(MVT::i16, Res);
	}

	case Intrinsic::x86_sse42_pcmpistria128:
	case Intrinsic::x86_sse42_pcmpestria128:
	case Intrinsic::x86_sse42_pcmpistric128:
	case Intrinsic::x86_sse42_pcmpestric128:
	case Intrinsic::x86_sse42_pcmpistrio128:
	case Intrinsic::x86_sse42_pcmpestrio128:
	case Intrinsic::x86_sse42_pcmpistris128:
	case Intrinsic::x86_sse42_pcmpestris128:
	case Intrinsic::x86_sse42_pcmpistriz128:
	case Intrinsic::x86_sse42_pcmpestriz128: {
	unsigned Opcode;
	X86::CondCode X86CC;
	switch (IntNo) {
	default: llvm_unreachable("Impossible intrinsic"); // Can't reach here.
	case Intrinsic::x86_sse42_pcmpistria128:
	Opcode = X86ISD::PCMPISTRI;
	X86CC = X86::COND_A;
	break;
	case Intrinsic::x86_sse42_pcmpestria128:
	Opcode = X86ISD::PCMPESTRI;
	X86CC = X86::COND_A;
	break;
	case Intrinsic::x86_sse42_pcmpistric128:
	Opcode = X86ISD::PCMPISTRI;
	X86CC = X86::COND_B;
	break;
	case Intrinsic::x86_sse42_pcmpestric128:
	Opcode = X86ISD::PCMPESTRI;
	X86CC = X86::COND_B;
	break;
	case Intrinsic::x86_sse42_pcmpistrio128:
	Opcode = X86ISD::PCMPISTRI;
	X86CC = X86::COND_O;
	break;
	case Intrinsic::x86_sse42_pcmpestrio128:
	Opcode = X86ISD::PCMPESTRI;
	X86CC = X86::COND_O;
	break;
	case Intrinsic::x86_sse42_pcmpistris128:
	Opcode = X86ISD::PCMPISTRI;
	X86CC = X86::COND_S;
	break;
	case Intrinsic::x86_sse42_pcmpestris128:
	Opcode = X86ISD::PCMPESTRI;
	X86CC = X86::COND_S;
	break;
	case Intrinsic::x86_sse42_pcmpistriz128:
	Opcode = X86ISD::PCMPISTRI;
	X86CC = X86::COND_E;
	break;
	case Intrinsic::x86_sse42_pcmpestriz128:
	Opcode = X86ISD::PCMPESTRI;
	X86CC = X86::COND_E;
	break;
	}
	SmallVector<SDValue, 5> NewOps(Op->op_begin()+1, Op->op_end());
	SDVTList VTs = DAG.getVTList(Op.getValueType(), MVT::i32);
	SDValue PCMP = DAG.getNode(Opcode, dl, VTs, NewOps);
	SDValue SetCC = getSETCC(X86CC, SDValue(PCMP.getNode(), 1), dl, DAG);
	return DAG.getNode(ISD::ZERO_EXTEND, dl, MVT::i32, SetCC);
	}

	case Intrinsic::x86_sse42_pcmpistri128:
	case Intrinsic::x86_sse42_pcmpestri128: {
	unsigned Opcode;
	if (IntNo == Intrinsic::x86_sse42_pcmpistri128)
	Opcode = X86ISD::PCMPISTRI;
	else
	Opcode = X86ISD::PCMPESTRI;

	SmallVector<SDValue, 5> NewOps(Op->op_begin()+1, Op->op_end());
	SDVTList VTs = DAG.getVTList(Op.getValueType(), MVT::i32);
	return DAG.getNode(Opcode, dl, VTs, NewOps);
	}

	case Intrinsic::eh_sjlj_lsda: {
	MachineFunction &MF = DAG.getMachineFunction();
	const TargetLowering &TLI = DAG.getTargetLoweringInfo();
	MVT PtrVT = TLI.getPointerTy(DAG.getDataLayout());
	auto &Context = MF.getMMI().getContext();
	MCSymbol *S = Context.getOrCreateSymbol(Twine("GCC_except_table") +
	Twine(MF.getFunctionNumber()));
	return DAG.getNode(X86ISD::Wrapper, dl, VT, DAG.getMCSymbol(S, PtrVT));
	}

	case Intrinsic::x86_seh_lsda: {
	// Compute the symbol for the LSDA. We know it'll get emitted later.
	MachineFunction &MF = DAG.getMachineFunction();
	SDValue Op1 = Op.getOperand(1);
	auto *Fn = cast<Function>(cast<GlobalAddressSDNode>(Op1)->getGlobal());
	MCSymbol *LSDASym = MF.getMMI().getContext().getOrCreateLSDASymbol(
	GlobalValue::dropLLVMManglingEscape(Fn->getName()));

	// Generate a simple absolute symbol reference. This intrinsic is only
	// supported on 32-bit Windows, which isn't PIC.
	SDValue Result = DAG.getMCSymbol(LSDASym, VT);
	return DAG.getNode(X86ISD::Wrapper, dl, VT, Result);
	}

	case Intrinsic::x86_seh_recoverfp: {
	SDValue FnOp = Op.getOperand(1);
	SDValue IncomingFPOp = Op.getOperand(2);
	GlobalAddressSDNode *GSD = dyn_cast<GlobalAddressSDNode>(FnOp);
	auto *Fn = dyn_cast_or_null<Function>(GSD ? GSD->getGlobal() : nullptr);
	if (!Fn)
	report_fatal_error(
	"llvm.x86.seh.recoverfp must take a function as the first argument");
	return recoverFramePointer(DAG, Fn, IncomingFPOp);
	}

	case Intrinsic::localaddress: {
	// Returns one of the stack, base, or frame pointer registers, depending on
	// which is used to reference local variables.
	MachineFunction &MF = DAG.getMachineFunction();
	const X86RegisterInfo *RegInfo = Subtarget.getRegisterInfo();
	unsigned Reg;
	if (RegInfo->hasBasePointer(MF))
	Reg = RegInfo->getBaseRegister();
	else // This function handles the SP or FP case.
	Reg = RegInfo->getPtrSizedFrameRegister(MF);
	return DAG.getCopyFromReg(DAG.getEntryNode(), dl, Reg, VT);
	}
	}
	}

	static SDValue getAVX2GatherNode(unsigned Opc, SDValue Op, SelectionDAG &DAG,
	SDValue Src, SDValue Mask, SDValue Base,
	SDValue Index, SDValue ScaleOp, SDValue Chain,
	const X86Subtarget &Subtarget) {
	SDLoc dl(Op);
	auto *C = dyn_cast<ConstantSDNode>(ScaleOp);
	// Scale must be constant.
	if (!C)
	return SDValue();
	SDValue Scale = DAG.getTargetConstant(C->getZExtValue(), dl, MVT::i8);
	EVT MaskVT = Mask.getValueType();
	SDVTList VTs = DAG.getVTList(Op.getValueType(), MaskVT, MVT::Other);
	SDValue Disp = DAG.getTargetConstant(0, dl, MVT::i32);
	SDValue Segment = DAG.getRegister(0, MVT::i32);
	// If source is undef or we know it won't be used, use a zero vector
	// to break register dependency.
	// TODO: use undef instead and let ExecutionDepsFix deal with it?
	if (Src.isUndef() \|\| ISD::isBuildVectorAllOnes(Mask.getNode()))
	Src = getZeroVector(Op.getSimpleValueType(), Subtarget, DAG, dl);
	SDValue Ops[] = {Src, Base, Scale, Index, Disp, Segment, Mask, Chain};
	SDNode *Res = DAG.getMachineNode(Opc, dl, VTs, Ops);
	SDValue RetOps[] = { SDValue(Res, 0), SDValue(Res, 2) };
	return DAG.getMergeValues(RetOps, dl);
	}

	static SDValue getGatherNode(unsigned Opc, SDValue Op, SelectionDAG &DAG,
	SDValue Src, SDValue Mask, SDValue Base,
	SDValue Index, SDValue ScaleOp, SDValue Chain,
	const X86Subtarget &Subtarget) {
	SDLoc dl(Op);
	auto *C = dyn_cast<ConstantSDNode>(ScaleOp);
	// Scale must be constant.
	if (!C)
	return SDValue();
	SDValue Scale = DAG.getTargetConstant(C->getZExtValue(), dl, MVT::i8);
	MVT MaskVT = MVT::getVectorVT(MVT::i1,
	Index.getSimpleValueType().getVectorNumElements());

	SDValue VMask = getMaskNode(Mask, MaskVT, Subtarget, DAG, dl);
	SDVTList VTs = DAG.getVTList(Op.getValueType(), MaskVT, MVT::Other);
	SDValue Disp = DAG.getTargetConstant(0, dl, MVT::i32);
	SDValue Segment = DAG.getRegister(0, MVT::i32);
	// If source is undef or we know it won't be used, use a zero vector
	// to break register dependency.
	// TODO: use undef instead and let ExecutionDepsFix deal with it?
	if (Src.isUndef() \|\| ISD::isBuildVectorAllOnes(VMask.getNode()))
	Src = getZeroVector(Op.getSimpleValueType(), Subtarget, DAG, dl);
	SDValue Ops[] = {Src, VMask, Base, Scale, Index, Disp, Segment, Chain};
	SDNode *Res = DAG.getMachineNode(Opc, dl, VTs, Ops);
	SDValue RetOps[] = { SDValue(Res, 0), SDValue(Res, 2) };
	return DAG.getMergeValues(RetOps, dl);
	}

	static SDValue getScatterNode(unsigned Opc, SDValue Op, SelectionDAG &DAG,
	SDValue Src, SDValue Mask, SDValue Base,
	SDValue Index, SDValue ScaleOp, SDValue Chain,
	const X86Subtarget &Subtarget) {
	SDLoc dl(Op);
	auto *C = dyn_cast<ConstantSDNode>(ScaleOp);
	// Scale must be constant.
	if (!C)
	return SDValue();
	SDValue Scale = DAG.getTargetConstant(C->getZExtValue(), dl, MVT::i8);
	SDValue Disp = DAG.getTargetConstant(0, dl, MVT::i32);
	SDValue Segment = DAG.getRegister(0, MVT::i32);
	MVT MaskVT = MVT::getVectorVT(MVT::i1,
	Index.getSimpleValueType().getVectorNumElements());

	SDValue VMask = getMaskNode(Mask, MaskVT, Subtarget, DAG, dl);
	SDVTList VTs = DAG.getVTList(MaskVT, MVT::Other);
	SDValue Ops[] = {Base, Scale, Index, Disp, Segment, VMask, Src, Chain};
	SDNode *Res = DAG.getMachineNode(Opc, dl, VTs, Ops);
	return SDValue(Res, 1);
	}

	static SDValue getPrefetchNode(unsigned Opc, SDValue Op, SelectionDAG &DAG,
	SDValue Mask, SDValue Base, SDValue Index,
	SDValue ScaleOp, SDValue Chain,
	const X86Subtarget &Subtarget) {
	SDLoc dl(Op);
	auto *C = dyn_cast<ConstantSDNode>(ScaleOp);
	// Scale must be constant.
	if (!C)
	return SDValue();
	SDValue Scale = DAG.getTargetConstant(C->getZExtValue(), dl, MVT::i8);
	SDValue Disp = DAG.getTargetConstant(0, dl, MVT::i32);
	SDValue Segment = DAG.getRegister(0, MVT::i32);
	MVT MaskVT =
	MVT::getVectorVT(MVT::i1, Index.getSimpleValueType().getVectorNumElements());
	SDValue VMask = getMaskNode(Mask, MaskVT, Subtarget, DAG, dl);
	SDValue Ops[] = {VMask, Base, Scale, Index, Disp, Segment, Chain};
	SDNode *Res = DAG.getMachineNode(Opc, dl, MVT::Other, Ops);
	return SDValue(Res, 0);
	}

	/// Handles the lowering of builtin intrinsic that return the value
	/// of the extended control register.
	static void getExtendedControlRegister(SDNode *N, const SDLoc &DL,
	SelectionDAG &DAG,
	const X86Subtarget &Subtarget,
	SmallVectorImpl<SDValue> &Results) {
	assert(N->getNumOperands() == 3 && "Unexpected number of operands!");
	SDVTList Tys = DAG.getVTList(MVT::Other, MVT::Glue);
	SDValue LO, HI;

	// The ECX register is used to select the index of the XCR register to
	// return.
	SDValue Chain =
	DAG.getCopyToReg(N->getOperand(0), DL, X86::ECX, N->getOperand(2));
	SDNode *N1 = DAG.getMachineNode(X86::XGETBV, DL, Tys, Chain);
	Chain = SDValue(N1, 0);

	// Reads the content of XCR and returns it in registers EDX:EAX.
	if (Subtarget.is64Bit()) {
	LO = DAG.getCopyFromReg(Chain, DL, X86::RAX, MVT::i64, SDValue(N1, 1));
	HI = DAG.getCopyFromReg(LO.getValue(1), DL, X86::RDX, MVT::i64,
	LO.getValue(2));
	} else {
	LO = DAG.getCopyFromReg(Chain, DL, X86::EAX, MVT::i32, SDValue(N1, 1));
	HI = DAG.getCopyFromReg(LO.getValue(1), DL, X86::EDX, MVT::i32,
	LO.getValue(2));
	}
	Chain = HI.getValue(1);

	if (Subtarget.is64Bit()) {
	// Merge the two 32-bit values into a 64-bit one..
	SDValue Tmp = DAG.getNode(ISD::SHL, DL, MVT::i64, HI,
	DAG.getConstant(32, DL, MVT::i8));
	Results.push_back(DAG.getNode(ISD::OR, DL, MVT::i64, LO, Tmp));
	Results.push_back(Chain);
	return;
	}

	// Use a buildpair to merge the two 32-bit values into a 64-bit one.
	SDValue Ops[] = { LO, HI };
	SDValue Pair = DAG.getNode(ISD::BUILD_PAIR, DL, MVT::i64, Ops);
	Results.push_back(Pair);
	Results.push_back(Chain);
	}

	/// Handles the lowering of builtin intrinsics that read performance monitor
	/// counters (x86_rdpmc).
	static void getReadPerformanceCounter(SDNode *N, const SDLoc &DL,
	SelectionDAG &DAG,
	const X86Subtarget &Subtarget,
	SmallVectorImpl<SDValue> &Results) {
	assert(N->getNumOperands() == 3 && "Unexpected number of operands!");
	SDVTList Tys = DAG.getVTList(MVT::Other, MVT::Glue);
	SDValue LO, HI;

	// The ECX register is used to select the index of the performance counter
	// to read.
	SDValue Chain = DAG.getCopyToReg(N->getOperand(0), DL, X86::ECX,
	N->getOperand(2));
	SDValue rd = DAG.getNode(X86ISD::RDPMC_DAG, DL, Tys, Chain);

	// Reads the content of a 64-bit performance counter and returns it in the
	// registers EDX:EAX.
	if (Subtarget.is64Bit()) {
	LO = DAG.getCopyFromReg(rd, DL, X86::RAX, MVT::i64, rd.getValue(1));
	HI = DAG.getCopyFromReg(LO.getValue(1), DL, X86::RDX, MVT::i64,
	LO.getValue(2));
	} else {
	LO = DAG.getCopyFromReg(rd, DL, X86::EAX, MVT::i32, rd.getValue(1));
	HI = DAG.getCopyFromReg(LO.getValue(1), DL, X86::EDX, MVT::i32,
	LO.getValue(2));
	}
	Chain = HI.getValue(1);

	if (Subtarget.is64Bit()) {
	// The EAX register is loaded with the low-order 32 bits. The EDX register
	// is loaded with the supported high-order bits of the counter.
	SDValue Tmp = DAG.getNode(ISD::SHL, DL, MVT::i64, HI,
	DAG.getConstant(32, DL, MVT::i8));
	Results.push_back(DAG.getNode(ISD::OR, DL, MVT::i64, LO, Tmp));
	Results.push_back(Chain);
	return;
	}

	// Use a buildpair to merge the two 32-bit values into a 64-bit one.
	SDValue Ops[] = { LO, HI };
	SDValue Pair = DAG.getNode(ISD::BUILD_PAIR, DL, MVT::i64, Ops);
	Results.push_back(Pair);
	Results.push_back(Chain);
	}

	/// Handles the lowering of builtin intrinsics that read the time stamp counter
	/// (x86_rdtsc and x86_rdtscp). This function is also used to custom lower
	/// READCYCLECOUNTER nodes.
	static void getReadTimeStampCounter(SDNode *N, const SDLoc &DL, unsigned Opcode,
	SelectionDAG &DAG,
	const X86Subtarget &Subtarget,
	SmallVectorImpl<SDValue> &Results) {
	SDVTList Tys = DAG.getVTList(MVT::Other, MVT::Glue);
	SDValue rd = DAG.getNode(Opcode, DL, Tys, N->getOperand(0));
	SDValue LO, HI;

	// The processor's time-stamp counter (a 64-bit MSR) is stored into the
	// EDX:EAX registers. EDX is loaded with the high-order 32 bits of the MSR
	// and the EAX register is loaded with the low-order 32 bits.
	if (Subtarget.is64Bit()) {
	LO = DAG.getCopyFromReg(rd, DL, X86::RAX, MVT::i64, rd.getValue(1));
	HI = DAG.getCopyFromReg(LO.getValue(1), DL, X86::RDX, MVT::i64,
	LO.getValue(2));
	} else {
	LO = DAG.getCopyFromReg(rd, DL, X86::EAX, MVT::i32, rd.getValue(1));
	HI = DAG.getCopyFromReg(LO.getValue(1), DL, X86::EDX, MVT::i32,
	LO.getValue(2));
	}
	SDValue Chain = HI.getValue(1);

	if (Opcode == X86ISD::RDTSCP_DAG) {
	assert(N->getNumOperands() == 3 && "Unexpected number of operands!");

	// Instruction RDTSCP loads the IA32:TSC_AUX_MSR (address C000_0103H) into
	// the ECX register. Add 'ecx' explicitly to the chain.
	SDValue ecx = DAG.getCopyFromReg(Chain, DL, X86::ECX, MVT::i32,
	HI.getValue(2));
	// Explicitly store the content of ECX at the location passed in input
	// to the 'rdtscp' intrinsic.
	Chain = DAG.getStore(ecx.getValue(1), DL, ecx, N->getOperand(2),
	MachinePointerInfo());
	}

	if (Subtarget.is64Bit()) {
	// The EDX register is loaded with the high-order 32 bits of the MSR, and
	// the EAX register is loaded with the low-order 32 bits.
	SDValue Tmp = DAG.getNode(ISD::SHL, DL, MVT::i64, HI,
	DAG.getConstant(32, DL, MVT::i8));
	Results.push_back(DAG.getNode(ISD::OR, DL, MVT::i64, LO, Tmp));
	Results.push_back(Chain);
	return;
	}

	// Use a buildpair to merge the two 32-bit values into a 64-bit one.
	SDValue Ops[] = { LO, HI };
	SDValue Pair = DAG.getNode(ISD::BUILD_PAIR, DL, MVT::i64, Ops);
	Results.push_back(Pair);
	Results.push_back(Chain);
	}

	static SDValue LowerREADCYCLECOUNTER(SDValue Op, const X86Subtarget &Subtarget,
	SelectionDAG &DAG) {
	SmallVector<SDValue, 2> Results;
	SDLoc DL(Op);
	getReadTimeStampCounter(Op.getNode(), DL, X86ISD::RDTSC_DAG, DAG, Subtarget,
	Results);
	return DAG.getMergeValues(Results, DL);
	}

	static SDValue MarkEHRegistrationNode(SDValue Op, SelectionDAG &DAG) {
	MachineFunction &MF = DAG.getMachineFunction();
	SDValue Chain = Op.getOperand(0);
	SDValue RegNode = Op.getOperand(2);
	WinEHFuncInfo *EHInfo = MF.getWinEHFuncInfo();
	if (!EHInfo)
	report_fatal_error("EH registrations only live in functions using WinEH");

	// Cast the operand to an alloca, and remember the frame index.
	auto *FINode = dyn_cast<FrameIndexSDNode>(RegNode);
	if (!FINode)
	report_fatal_error("llvm.x86.seh.ehregnode expects a static alloca");
	EHInfo->EHRegNodeFrameIndex = FINode->getIndex();

	// Return the chain operand without making any DAG nodes.
	return Chain;
	}

	static SDValue MarkEHGuard(SDValue Op, SelectionDAG &DAG) {
	MachineFunction &MF = DAG.getMachineFunction();
	SDValue Chain = Op.getOperand(0);
	SDValue EHGuard = Op.getOperand(2);
	WinEHFuncInfo *EHInfo = MF.getWinEHFuncInfo();
	if (!EHInfo)
	report_fatal_error("EHGuard only live in functions using WinEH");

	// Cast the operand to an alloca, and remember the frame index.
	auto *FINode = dyn_cast<FrameIndexSDNode>(EHGuard);
	if (!FINode)
	report_fatal_error("llvm.x86.seh.ehguard expects a static alloca");
	EHInfo->EHGuardFrameIndex = FINode->getIndex();

	// Return the chain operand without making any DAG nodes.
	return Chain;
	}

	/// Emit Truncating Store with signed or unsigned saturation.
	static SDValue
	EmitTruncSStore(bool SignedSat, SDValue Chain, const SDLoc &Dl, SDValue Val,
	SDValue Ptr, EVT MemVT, MachineMemOperand *MMO,
	SelectionDAG &DAG) {

	SDVTList VTs = DAG.getVTList(MVT::Other);
	SDValue Undef = DAG.getUNDEF(Ptr.getValueType());
	SDValue Ops[] = { Chain, Val, Ptr, Undef };
	return SignedSat ?
	DAG.getTargetMemSDNode<TruncSStoreSDNode>(VTs, Ops, Dl, MemVT, MMO) :
	DAG.getTargetMemSDNode<TruncUSStoreSDNode>(VTs, Ops, Dl, MemVT, MMO);
	}

	/// Emit Masked Truncating Store with signed or unsigned saturation.
	static SDValue
	EmitMaskedTruncSStore(bool SignedSat, SDValue Chain, const SDLoc &Dl,
	SDValue Val, SDValue Ptr, SDValue Mask, EVT MemVT,
	MachineMemOperand *MMO, SelectionDAG &DAG) {

	SDVTList VTs = DAG.getVTList(MVT::Other);
	SDValue Ops[] = { Chain, Ptr, Mask, Val };
	return SignedSat ?
	DAG.getTargetMemSDNode<MaskedTruncSStoreSDNode>(VTs, Ops, Dl, MemVT, MMO) :
	DAG.getTargetMemSDNode<MaskedTruncUSStoreSDNode>(VTs, Ops, Dl, MemVT, MMO);
	}

	static SDValue LowerINTRINSIC_W_CHAIN(SDValue Op, const X86Subtarget &Subtarget,
	SelectionDAG &DAG) {
	unsigned IntNo = cast<ConstantSDNode>(Op.getOperand(1))->getZExtValue();

	const IntrinsicData *IntrData = getIntrinsicWithChain(IntNo);
	if (!IntrData) {
	switch (IntNo) {
	case llvm::Intrinsic::x86_seh_ehregnode:
	return MarkEHRegistrationNode(Op, DAG);
	case llvm::Intrinsic::x86_seh_ehguard:
	return MarkEHGuard(Op, DAG);
	case llvm::Intrinsic::x86_flags_read_u32:
	case llvm::Intrinsic::x86_flags_read_u64:
	case llvm::Intrinsic::x86_flags_write_u32:
	case llvm::Intrinsic::x86_flags_write_u64: {
	// We need a frame pointer because this will get lowered to a PUSH/POP
	// sequence.
	MachineFrameInfo &MFI = DAG.getMachineFunction().getFrameInfo();
	MFI.setHasCopyImplyingStackAdjustment(true);
	// Don't do anything here, we will expand these intrinsics out later
	// during ExpandISelPseudos in EmitInstrWithCustomInserter.
	return SDValue();
	}
	case Intrinsic::x86_lwpins32:
	case Intrinsic::x86_lwpins64: {
	SDLoc dl(Op);
	SDValue Chain = Op->getOperand(0);
	SDVTList VTs = DAG.getVTList(MVT::i32, MVT::Other);
	SDValue LwpIns =
	DAG.getNode(X86ISD::LWPINS, dl, VTs, Chain, Op->getOperand(2),
	Op->getOperand(3), Op->getOperand(4));
	SDValue SetCC = getSETCC(X86::COND_B, LwpIns.getValue(0), dl, DAG);
	SDValue Result = DAG.getNode(ISD::ZERO_EXTEND, dl, MVT::i8, SetCC);
	return DAG.getNode(ISD::MERGE_VALUES, dl, Op->getVTList(), Result,
	LwpIns.getValue(1));
	}
	}
	return SDValue();
	}

	SDLoc dl(Op);
	switch(IntrData->Type) {
	default: llvm_unreachable("Unknown Intrinsic Type");
	case RDSEED:
	case RDRAND: {
	// Emit the node with the right value type.
	SDVTList VTs = DAG.getVTList(Op->getValueType(0), MVT::Glue, MVT::Other);
	SDValue Result = DAG.getNode(IntrData->Opc0, dl, VTs, Op.getOperand(0));

	// If the value returned by RDRAND/RDSEED was valid (CF=1), return 1.
	// Otherwise return the value from Rand, which is always 0, casted to i32.
	SDValue Ops[] = { DAG.getZExtOrTrunc(Result, dl, Op->getValueType(1)),
	DAG.getConstant(1, dl, Op->getValueType(1)),
	DAG.getConstant(X86::COND_B, dl, MVT::i32),
	SDValue(Result.getNode(), 1) };
	SDValue isValid = DAG.getNode(X86ISD::CMOV, dl,
	DAG.getVTList(Op->getValueType(1), MVT::Glue),
	Ops);

	// Return { result, isValid, chain }.
	return DAG.getNode(ISD::MERGE_VALUES, dl, Op->getVTList(), Result, isValid,
	SDValue(Result.getNode(), 2));
	}
	case GATHER_AVX2: {
	SDValue Chain = Op.getOperand(0);
	SDValue Src = Op.getOperand(2);
	SDValue Base = Op.getOperand(3);
	SDValue Index = Op.getOperand(4);
	SDValue Mask = Op.getOperand(5);
	SDValue Scale = Op.getOperand(6);
	return getAVX2GatherNode(IntrData->Opc0, Op, DAG, Src, Mask, Base, Index,
	Scale, Chain, Subtarget);
	}
	case GATHER: {
	//gather(v1, mask, index, base, scale);
	SDValue Chain = Op.getOperand(0);
	SDValue Src = Op.getOperand(2);
	SDValue Base = Op.getOperand(3);
	SDValue Index = Op.getOperand(4);
	SDValue Mask = Op.getOperand(5);
	SDValue Scale = Op.getOperand(6);
	return getGatherNode(IntrData->Opc0, Op, DAG, Src, Mask, Base, Index, Scale,
	Chain, Subtarget);
	}
	case SCATTER: {
	//scatter(base, mask, index, v1, scale);
	SDValue Chain = Op.getOperand(0);
	SDValue Base = Op.getOperand(2);
	SDValue Mask = Op.getOperand(3);
	SDValue Index = Op.getOperand(4);
	SDValue Src = Op.getOperand(5);
	SDValue Scale = Op.getOperand(6);
	return getScatterNode(IntrData->Opc0, Op, DAG, Src, Mask, Base, Index,
	Scale, Chain, Subtarget);
	}
	case PREFETCH: {
	SDValue Hint = Op.getOperand(6);
	unsigned HintVal = cast<ConstantSDNode>(Hint)->getZExtValue();
	assert((HintVal == 2 \|\| HintVal == 3) &&
	"Wrong prefetch hint in intrinsic: should be 2 or 3");
	unsigned Opcode = (HintVal == 2 ? IntrData->Opc1 : IntrData->Opc0);
	SDValue Chain = Op.getOperand(0);
	SDValue Mask = Op.getOperand(2);
	SDValue Index = Op.getOperand(3);
	SDValue Base = Op.getOperand(4);
	SDValue Scale = Op.getOperand(5);
	return getPrefetchNode(Opcode, Op, DAG, Mask, Base, Index, Scale, Chain,
	Subtarget);
	}
	// Read Time Stamp Counter (RDTSC) and Processor ID (RDTSCP).
	case RDTSC: {
	SmallVector<SDValue, 2> Results;
	getReadTimeStampCounter(Op.getNode(), dl, IntrData->Opc0, DAG, Subtarget,
	Results);
	return DAG.getMergeValues(Results, dl);
	}
	// Read Performance Monitoring Counters.
	case RDPMC: {
	SmallVector<SDValue, 2> Results;
	getReadPerformanceCounter(Op.getNode(), dl, DAG, Subtarget, Results);
	return DAG.getMergeValues(Results, dl);
	}
	// Get Extended Control Register.
	case XGETBV: {
	SmallVector<SDValue, 2> Results;
	getExtendedControlRegister(Op.getNode(), dl, DAG, Subtarget, Results);
	return DAG.getMergeValues(Results, dl);
	}
	// XTEST intrinsics.
	case XTEST: {
	SDVTList VTs = DAG.getVTList(Op->getValueType(0), MVT::Other);
	SDValue InTrans = DAG.getNode(IntrData->Opc0, dl, VTs, Op.getOperand(0));

	SDValue SetCC = getSETCC(X86::COND_NE, InTrans, dl, DAG);
	SDValue Ret = DAG.getNode(ISD::ZERO_EXTEND, dl, Op->getValueType(0), SetCC);
	return DAG.getNode(ISD::MERGE_VALUES, dl, Op->getVTList(),
	Ret, SDValue(InTrans.getNode(), 1));
	}
	// ADC/ADCX/SBB
	case ADX: {
	SDVTList CFVTs = DAG.getVTList(Op->getValueType(0), MVT::i32);
	SDVTList VTs = DAG.getVTList(Op.getOperand(3)->getValueType(0), MVT::i32);
	SDValue GenCF = DAG.getNode(X86ISD::ADD, dl, CFVTs, Op.getOperand(2),
	DAG.getConstant(-1, dl, MVT::i8));
	SDValue Res = DAG.getNode(IntrData->Opc0, dl, VTs, Op.getOperand(3),
	Op.getOperand(4), GenCF.getValue(1));
	SDValue Store = DAG.getStore(Op.getOperand(0), dl, Res.getValue(0),
	Op.getOperand(5), MachinePointerInfo());
	SDValue SetCC = getSETCC(X86::COND_B, Res.getValue(1), dl, DAG);
	SDValue Results[] = { SetCC, Store };
	return DAG.getMergeValues(Results, dl);
	}
	case COMPRESS_TO_MEM: {
	SDValue Mask = Op.getOperand(4);
	SDValue DataToCompress = Op.getOperand(3);
	SDValue Addr = Op.getOperand(2);
	SDValue Chain = Op.getOperand(0);
	MVT VT = DataToCompress.getSimpleValueType();

	MemIntrinsicSDNode *MemIntr = dyn_cast<MemIntrinsicSDNode>(Op);
	assert(MemIntr && "Expected MemIntrinsicSDNode!");

	if (isAllOnesConstant(Mask)) // return just a store
	return DAG.getStore(Chain, dl, DataToCompress, Addr,
	MemIntr->getMemOperand());

	MVT MaskVT = MVT::getVectorVT(MVT::i1, VT.getVectorNumElements());
	SDValue VMask = getMaskNode(Mask, MaskVT, Subtarget, DAG, dl);

	return DAG.getMaskedStore(Chain, dl, DataToCompress, Addr, VMask, VT,
	MemIntr->getMemOperand(),
	false /* truncating /, true / compressing */);
	}
	case TRUNCATE_TO_MEM_VI8:
	case TRUNCATE_TO_MEM_VI16:
	case TRUNCATE_TO_MEM_VI32: {
	SDValue Mask = Op.getOperand(4);
	SDValue DataToTruncate = Op.getOperand(3);
	SDValue Addr = Op.getOperand(2);
	SDValue Chain = Op.getOperand(0);

	MemIntrinsicSDNode *MemIntr = dyn_cast<MemIntrinsicSDNode>(Op);
	assert(MemIntr && "Expected MemIntrinsicSDNode!");

	EVT MemVT = MemIntr->getMemoryVT();

	uint16_t TruncationOp = IntrData->Opc0;
	switch (TruncationOp) {
	case X86ISD::VTRUNC: {
	if (isAllOnesConstant(Mask)) // return just a truncate store
	return DAG.getTruncStore(Chain, dl, DataToTruncate, Addr, MemVT,
	MemIntr->getMemOperand());

	MVT MaskVT = MVT::getVectorVT(MVT::i1, MemVT.getVectorNumElements());
	SDValue VMask = getMaskNode(Mask, MaskVT, Subtarget, DAG, dl);

	return DAG.getMaskedStore(Chain, dl, DataToTruncate, Addr, VMask, MemVT,
	MemIntr->getMemOperand(), true /* truncating */);
	}
	case X86ISD::VTRUNCUS:
	case X86ISD::VTRUNCS: {
	bool IsSigned = (TruncationOp == X86ISD::VTRUNCS);
	if (isAllOnesConstant(Mask))
	return EmitTruncSStore(IsSigned, Chain, dl, DataToTruncate, Addr, MemVT,
	MemIntr->getMemOperand(), DAG);

	MVT MaskVT = MVT::getVectorVT(MVT::i1, MemVT.getVectorNumElements());
	SDValue VMask = getMaskNode(Mask, MaskVT, Subtarget, DAG, dl);

	return EmitMaskedTruncSStore(IsSigned, Chain, dl, DataToTruncate, Addr,
	VMask, MemVT, MemIntr->getMemOperand(), DAG);
	}
	default:
	llvm_unreachable("Unsupported truncstore intrinsic");
	}
	}

	case EXPAND_FROM_MEM: {
	SDValue Mask = Op.getOperand(4);
	SDValue PassThru = Op.getOperand(3);
	SDValue Addr = Op.getOperand(2);
	SDValue Chain = Op.getOperand(0);
	MVT VT = Op.getSimpleValueType();

	MemIntrinsicSDNode *MemIntr = dyn_cast<MemIntrinsicSDNode>(Op);
	assert(MemIntr && "Expected MemIntrinsicSDNode!");

	if (isAllOnesConstant(Mask)) // Return a regular (unmasked) vector load.
	return DAG.getLoad(VT, dl, Chain, Addr, MemIntr->getMemOperand());
	if (X86::isZeroNode(Mask))
	return DAG.getUNDEF(VT);

	MVT MaskVT = MVT::getVectorVT(MVT::i1, VT.getVectorNumElements());
	SDValue VMask = getMaskNode(Mask, MaskVT, Subtarget, DAG, dl);
	return DAG.getMaskedLoad(VT, dl, Chain, Addr, VMask, PassThru, VT,
	MemIntr->getMemOperand(), ISD::NON_EXTLOAD,
	true /* expanding */);
	}
	}
	}

	SDValue X86TargetLowering::LowerRETURNADDR(SDValue Op,
	SelectionDAG &DAG) const {
	MachineFrameInfo &MFI = DAG.getMachineFunction().getFrameInfo();
	MFI.setReturnAddressIsTaken(true);

	if (verifyReturnAddressArgumentIsConstant(Op, DAG))
	return SDValue();

	unsigned Depth = cast<ConstantSDNode>(Op.getOperand(0))->getZExtValue();
	SDLoc dl(Op);
	EVT PtrVT = getPointerTy(DAG.getDataLayout());

	if (Depth > 0) {
	SDValue FrameAddr = LowerFRAMEADDR(Op, DAG);
	const X86RegisterInfo *RegInfo = Subtarget.getRegisterInfo();
	SDValue Offset = DAG.getConstant(RegInfo->getSlotSize(), dl, PtrVT);
	return DAG.getLoad(PtrVT, dl, DAG.getEntryNode(),
	DAG.getNode(ISD::ADD, dl, PtrVT, FrameAddr, Offset),
	MachinePointerInfo());
	}

	// Just load the return address.
	SDValue RetAddrFI = getReturnAddressFrameIndex(DAG);
	return DAG.getLoad(PtrVT, dl, DAG.getEntryNode(), RetAddrFI,
	MachinePointerInfo());
	}

	SDValue X86TargetLowering::LowerADDROFRETURNADDR(SDValue Op,
	SelectionDAG &DAG) const {
	DAG.getMachineFunction().getFrameInfo().setReturnAddressIsTaken(true);
	return getReturnAddressFrameIndex(DAG);
	}

	SDValue X86TargetLowering::LowerFRAMEADDR(SDValue Op, SelectionDAG &DAG) const {
	MachineFunction &MF = DAG.getMachineFunction();
	MachineFrameInfo &MFI = MF.getFrameInfo();
	X86MachineFunctionInfo *FuncInfo = MF.getInfo<X86MachineFunctionInfo>();
	const X86RegisterInfo *RegInfo = Subtarget.getRegisterInfo();
	EVT VT = Op.getValueType();

	MFI.setFrameAddressIsTaken(true);

	if (MF.getTarget().getMCAsmInfo()->usesWindowsCFI()) {
	// Depth > 0 makes no sense on targets which use Windows unwind codes. It
	// is not possible to crawl up the stack without looking at the unwind codes
	// simultaneously.
	int FrameAddrIndex = FuncInfo->getFAIndex();
	if (!FrameAddrIndex) {
	// Set up a frame object for the return address.
	unsigned SlotSize = RegInfo->getSlotSize();
	FrameAddrIndex = MF.getFrameInfo().CreateFixedObject(
	SlotSize, /Offset=/0, /IsImmutable=/false);
	FuncInfo->setFAIndex(FrameAddrIndex);
	}
	return DAG.getFrameIndex(FrameAddrIndex, VT);
	}

	unsigned FrameReg =
	RegInfo->getPtrSizedFrameRegister(DAG.getMachineFunction());
	SDLoc dl(Op); // FIXME probably not meaningful
	unsigned Depth = cast<ConstantSDNode>(Op.getOperand(0))->getZExtValue();
	assert(((FrameReg == X86::RBP && VT == MVT::i64) \|\|
	(FrameReg == X86::EBP && VT == MVT::i32)) &&
	"Invalid Frame Register!");
	SDValue FrameAddr = DAG.getCopyFromReg(DAG.getEntryNode(), dl, FrameReg, VT);
	while (Depth--)
	FrameAddr = DAG.getLoad(VT, dl, DAG.getEntryNode(), FrameAddr,
	MachinePointerInfo());
	return FrameAddr;
	}

	// FIXME? Maybe this could be a TableGen attribute on some registers and
	// this table could be generated automatically from RegInfo.
	unsigned X86TargetLowering::getRegisterByName(const char* RegName, EVT VT,
	SelectionDAG &DAG) const {
	const TargetFrameLowering &TFI = *Subtarget.getFrameLowering();
	const MachineFunction &MF = DAG.getMachineFunction();

	unsigned Reg = StringSwitch<unsigned>(RegName)
	.Case("esp", X86::ESP)
	.Case("rsp", X86::RSP)
	.Case("ebp", X86::EBP)
	.Case("rbp", X86::RBP)
	.Default(0);

	if (Reg == X86::EBP \|\| Reg == X86::RBP) {
	if (!TFI.hasFP(MF))
	report_fatal_error("register " + StringRef(RegName) +
	" is allocatable: function has no frame pointer");
	#ifndef NDEBUG
	else {
	const X86RegisterInfo *RegInfo = Subtarget.getRegisterInfo();
	unsigned FrameReg =
	RegInfo->getPtrSizedFrameRegister(DAG.getMachineFunction());
	assert((FrameReg == X86::EBP \|\| FrameReg == X86::RBP) &&
	"Invalid Frame Register!");
	}
	#endif
	}

	if (Reg)
	return Reg;

	report_fatal_error("Invalid register name global variable");
	}

	SDValue X86TargetLowering::LowerFRAME_TO_ARGS_OFFSET(SDValue Op,
	SelectionDAG &DAG) const {
	const X86RegisterInfo *RegInfo = Subtarget.getRegisterInfo();
	return DAG.getIntPtrConstant(2 * RegInfo->getSlotSize(), SDLoc(Op));
	}

	unsigned X86TargetLowering::getExceptionPointerRegister(
	const Constant *PersonalityFn) const {
	if (classifyEHPersonality(PersonalityFn) == EHPersonality::CoreCLR)
	return Subtarget.isTarget64BitLP64() ? X86::RDX : X86::EDX;

	return Subtarget.isTarget64BitLP64() ? X86::RAX : X86::EAX;
	}

	unsigned X86TargetLowering::getExceptionSelectorRegister(
	const Constant *PersonalityFn) const {
	// Funclet personalities don't use selectors (the runtime does the selection).
	assert(!isFuncletEHPersonality(classifyEHPersonality(PersonalityFn)));
	return Subtarget.isTarget64BitLP64() ? X86::RDX : X86::EDX;
	}

	bool X86TargetLowering::needsFixedCatchObjects() const {
	return Subtarget.isTargetWin64();
	}

	SDValue X86TargetLowering::LowerEH_RETURN(SDValue Op, SelectionDAG &DAG) const {
	SDValue Chain = Op.getOperand(0);
	SDValue Offset = Op.getOperand(1);
	SDValue Handler = Op.getOperand(2);
	SDLoc dl (Op);

	EVT PtrVT = getPointerTy(DAG.getDataLayout());
	const X86RegisterInfo *RegInfo = Subtarget.getRegisterInfo();
	unsigned FrameReg = RegInfo->getFrameRegister(DAG.getMachineFunction());
	assert(((FrameReg == X86::RBP && PtrVT == MVT::i64) \|\|
	(FrameReg == X86::EBP && PtrVT == MVT::i32)) &&
	"Invalid Frame Register!");
	SDValue Frame = DAG.getCopyFromReg(DAG.getEntryNode(), dl, FrameReg, PtrVT);
	unsigned StoreAddrReg = (PtrVT == MVT::i64) ? X86::RCX : X86::ECX;

	SDValue StoreAddr = DAG.getNode(ISD::ADD, dl, PtrVT, Frame,
	DAG.getIntPtrConstant(RegInfo->getSlotSize(),
	dl));
	StoreAddr = DAG.getNode(ISD::ADD, dl, PtrVT, StoreAddr, Offset);
	Chain = DAG.getStore(Chain, dl, Handler, StoreAddr, MachinePointerInfo());
	Chain = DAG.getCopyToReg(Chain, dl, StoreAddrReg, StoreAddr);

	return DAG.getNode(X86ISD::EH_RETURN, dl, MVT::Other, Chain,
	DAG.getRegister(StoreAddrReg, PtrVT));
	}

	SDValue X86TargetLowering::lowerEH_SJLJ_SETJMP(SDValue Op,
	SelectionDAG &DAG) const {
	SDLoc DL(Op);
	// If the subtarget is not 64bit, we may need the global base reg
	// after isel expand pseudo, i.e., after CGBR pass ran.
	// Therefore, ask for the GlobalBaseReg now, so that the pass
	// inserts the code for us in case we need it.
	// Otherwise, we will end up in a situation where we will
	// reference a virtual register that is not defined!
	if (!Subtarget.is64Bit()) {
	const X86InstrInfo *TII = Subtarget.getInstrInfo();
	(void)TII->getGlobalBaseReg(&DAG.getMachineFunction());
	}
	return DAG.getNode(X86ISD::EH_SJLJ_SETJMP, DL,
	DAG.getVTList(MVT::i32, MVT::Other),
	Op.getOperand(0), Op.getOperand(1));
	}

	SDValue X86TargetLowering::lowerEH_SJLJ_LONGJMP(SDValue Op,
	SelectionDAG &DAG) const {
	SDLoc DL(Op);
	return DAG.getNode(X86ISD::EH_SJLJ_LONGJMP, DL, MVT::Other,
	Op.getOperand(0), Op.getOperand(1));
	}

	SDValue X86TargetLowering::lowerEH_SJLJ_SETUP_DISPATCH(SDValue Op,
	SelectionDAG &DAG) const {
	SDLoc DL(Op);
	return DAG.getNode(X86ISD::EH_SJLJ_SETUP_DISPATCH, DL, MVT::Other,
	Op.getOperand(0));
	}

	static SDValue LowerADJUST_TRAMPOLINE(SDValue Op, SelectionDAG &DAG) {
	return Op.getOperand(0);
	}

	SDValue X86TargetLowering::LowerINIT_TRAMPOLINE(SDValue Op,
	SelectionDAG &DAG) const {
	SDValue Root = Op.getOperand(0);
	SDValue Trmp = Op.getOperand(1); // trampoline
	SDValue FPtr = Op.getOperand(2); // nested function
	SDValue Nest = Op.getOperand(3); // 'nest' parameter value
	SDLoc dl (Op);

	const Value *TrmpAddr = cast<SrcValueSDNode>(Op.getOperand(4))->getValue();
	const TargetRegisterInfo *TRI = Subtarget.getRegisterInfo();

	if (Subtarget.is64Bit()) {
	SDValue OutChains[6];

	// Large code-model.
	const unsigned char JMP64r = 0xFF; // 64-bit jmp through register opcode.
	const unsigned char MOV64ri = 0xB8; // X86::MOV64ri opcode.

	const unsigned char N86R10 = TRI->getEncodingValue(X86::R10) & 0x7;
	const unsigned char N86R11 = TRI->getEncodingValue(X86::R11) & 0x7;

	const unsigned char REX_WB = 0x40 \| 0x08 \| 0x01; // REX prefix

	// Load the pointer to the nested function into R11.
	unsigned OpCode = ((MOV64ri \| N86R11) << 8) \| REX_WB; // movabsq r11
	SDValue Addr = Trmp;
	OutChains[0] = DAG.getStore(Root, dl, DAG.getConstant(OpCode, dl, MVT::i16),
	Addr, MachinePointerInfo(TrmpAddr));

	Addr = DAG.getNode(ISD::ADD, dl, MVT::i64, Trmp,
	DAG.getConstant(2, dl, MVT::i64));
	OutChains[1] =
	DAG.getStore(Root, dl, FPtr, Addr, MachinePointerInfo(TrmpAddr, 2),
	/* Alignment = */ 2);

	// Load the 'nest' parameter value into R10.
	// R10 is specified in X86CallingConv.td
	OpCode = ((MOV64ri \| N86R10) << 8) \| REX_WB; // movabsq r10
	Addr = DAG.getNode(ISD::ADD, dl, MVT::i64, Trmp,
	DAG.getConstant(10, dl, MVT::i64));
	OutChains[2] = DAG.getStore(Root, dl, DAG.getConstant(OpCode, dl, MVT::i16),
	Addr, MachinePointerInfo(TrmpAddr, 10));

	Addr = DAG.getNode(ISD::ADD, dl, MVT::i64, Trmp,
	DAG.getConstant(12, dl, MVT::i64));
	OutChains[3] =
	DAG.getStore(Root, dl, Nest, Addr, MachinePointerInfo(TrmpAddr, 12),
	/* Alignment = */ 2);

	// Jump to the nested function.
	OpCode = (JMP64r << 8) \| REX_WB; // jmpq *...
	Addr = DAG.getNode(ISD::ADD, dl, MVT::i64, Trmp,
	DAG.getConstant(20, dl, MVT::i64));
	OutChains[4] = DAG.getStore(Root, dl, DAG.getConstant(OpCode, dl, MVT::i16),
	Addr, MachinePointerInfo(TrmpAddr, 20));

	unsigned char ModRM = N86R11 \| (4 << 3) \| (3 << 6); // ...r11
	Addr = DAG.getNode(ISD::ADD, dl, MVT::i64, Trmp,
	DAG.getConstant(22, dl, MVT::i64));
	OutChains[5] = DAG.getStore(Root, dl, DAG.getConstant(ModRM, dl, MVT::i8),
	Addr, MachinePointerInfo(TrmpAddr, 22));

	return DAG.getNode(ISD::TokenFactor, dl, MVT::Other, OutChains);
	} else {
	const Function *Func =
	cast<Function>(cast<SrcValueSDNode>(Op.getOperand(5))->getValue());
	CallingConv::ID CC = Func->getCallingConv();
	unsigned NestReg;

	switch (CC) {
	default:
	llvm_unreachable("Unsupported calling convention");
	case CallingConv::C:
	case CallingConv::X86_StdCall: {
	// Pass 'nest' parameter in ECX.
	// Must be kept in sync with X86CallingConv.td
	NestReg = X86::ECX;

	// Check that ECX wasn't needed by an 'inreg' parameter.
	FunctionType *FTy = Func->getFunctionType();
	const AttributeList &Attrs = Func->getAttributes();

	if (!Attrs.isEmpty() && !Func->isVarArg()) {
	unsigned InRegCount = 0;
	unsigned Idx = 1;

	for (FunctionType::param_iterator I = FTy->param_begin(),
	E = FTy->param_end(); I != E; ++I, ++Idx)
	if (Attrs.hasAttribute(Idx, Attribute::InReg)) {
	auto &DL = DAG.getDataLayout();
	// FIXME: should only count parameters that are lowered to integers.
	InRegCount += (DL.getTypeSizeInBits(*I) + 31) / 32;
	}

	if (InRegCount > 2) {
	report_fatal_error("Nest register in use - reduce number of inreg"
	" parameters!");
	}
	}
	break;
	}
	case CallingConv::X86_FastCall:
	case CallingConv::X86_ThisCall:
	case CallingConv::Fast:
	// Pass 'nest' parameter in EAX.
	// Must be kept in sync with X86CallingConv.td
	NestReg = X86::EAX;
	break;
	}

	SDValue OutChains[4];
	SDValue Addr, Disp;

	Addr = DAG.getNode(ISD::ADD, dl, MVT::i32, Trmp,
	DAG.getConstant(10, dl, MVT::i32));
	Disp = DAG.getNode(ISD::SUB, dl, MVT::i32, FPtr, Addr);

	// This is storing the opcode for MOV32ri.
	const unsigned char MOV32ri = 0xB8; // X86::MOV32ri's opcode byte.
	const unsigned char N86Reg = TRI->getEncodingValue(NestReg) & 0x7;
	OutChains[0] =
	DAG.getStore(Root, dl, DAG.getConstant(MOV32ri \| N86Reg, dl, MVT::i8),
	Trmp, MachinePointerInfo(TrmpAddr));

	Addr = DAG.getNode(ISD::ADD, dl, MVT::i32, Trmp,
	DAG.getConstant(1, dl, MVT::i32));
	OutChains[1] =
	DAG.getStore(Root, dl, Nest, Addr, MachinePointerInfo(TrmpAddr, 1),
	/* Alignment = */ 1);

	const unsigned char JMP = 0xE9; // jmp <32bit dst> opcode.
	Addr = DAG.getNode(ISD::ADD, dl, MVT::i32, Trmp,
	DAG.getConstant(5, dl, MVT::i32));
	OutChains[2] = DAG.getStore(Root, dl, DAG.getConstant(JMP, dl, MVT::i8),
	Addr, MachinePointerInfo(TrmpAddr, 5),
	/* Alignment = */ 1);

	Addr = DAG.getNode(ISD::ADD, dl, MVT::i32, Trmp,
	DAG.getConstant(6, dl, MVT::i32));
	OutChains[3] =
	DAG.getStore(Root, dl, Disp, Addr, MachinePointerInfo(TrmpAddr, 6),
	/* Alignment = */ 1);

	return DAG.getNode(ISD::TokenFactor, dl, MVT::Other, OutChains);
	}
	}

	SDValue X86TargetLowering::LowerFLT_ROUNDS_(SDValue Op,
	SelectionDAG &DAG) const {
	/*
	The rounding mode is in bits 11:10 of FPSR, and has the following
	settings:
	00 Round to nearest
	01 Round to -inf
	10 Round to +inf
	11 Round to 0

	FLT_ROUNDS, on the other hand, expects the following:
	-1 Undefined
	0 Round to 0
	1 Round to nearest
	2 Round to +inf
	3 Round to -inf

	To perform the conversion, we do:
	(((((FPSR & 0x800) >> 11) \| ((FPSR & 0x400) >> 9)) + 1) & 3)
	*/

	MachineFunction &MF = DAG.getMachineFunction();
	const TargetFrameLowering &TFI = *Subtarget.getFrameLowering();
	unsigned StackAlignment = TFI.getStackAlignment();
	MVT VT = Op.getSimpleValueType();
	SDLoc DL(Op);

	// Save FP Control Word to stack slot
	int SSFI = MF.getFrameInfo().CreateStackObject(2, StackAlignment, false);
	SDValue StackSlot =
	DAG.getFrameIndex(SSFI, getPointerTy(DAG.getDataLayout()));

	MachineMemOperand *MMO =
	MF.getMachineMemOperand(MachinePointerInfo::getFixedStack(MF, SSFI),
	MachineMemOperand::MOStore, 2, 2);

	SDValue Ops[] = { DAG.getEntryNode(), StackSlot };
	SDValue Chain = DAG.getMemIntrinsicNode(X86ISD::FNSTCW16m, DL,
	DAG.getVTList(MVT::Other),
	Ops, MVT::i16, MMO);

	// Load FP Control Word from stack slot
	SDValue CWD =
	DAG.getLoad(MVT::i16, DL, Chain, StackSlot, MachinePointerInfo());

	// Transform as necessary
	SDValue CWD1 =
	DAG.getNode(ISD::SRL, DL, MVT::i16,
	DAG.getNode(ISD::AND, DL, MVT::i16,
	CWD, DAG.getConstant(0x800, DL, MVT::i16)),
	DAG.getConstant(11, DL, MVT::i8));
	SDValue CWD2 =
	DAG.getNode(ISD::SRL, DL, MVT::i16,
	DAG.getNode(ISD::AND, DL, MVT::i16,
	CWD, DAG.getConstant(0x400, DL, MVT::i16)),
	DAG.getConstant(9, DL, MVT::i8));

	SDValue RetVal =
	DAG.getNode(ISD::AND, DL, MVT::i16,
	DAG.getNode(ISD::ADD, DL, MVT::i16,
	DAG.getNode(ISD::OR, DL, MVT::i16, CWD1, CWD2),
	DAG.getConstant(1, DL, MVT::i16)),
	DAG.getConstant(3, DL, MVT::i16));

	return DAG.getNode((VT.getSizeInBits() < 16 ?
	ISD::TRUNCATE : ISD::ZERO_EXTEND), DL, VT, RetVal);
	}

	// Split an unary integer op into 2 half sized ops.
	static SDValue LowerVectorIntUnary(SDValue Op, SelectionDAG &DAG) {
	MVT VT = Op.getSimpleValueType();
	unsigned NumElems = VT.getVectorNumElements();
	unsigned SizeInBits = VT.getSizeInBits();

	// Extract the Lo/Hi vectors
	SDLoc dl(Op);
	SDValue Src = Op.getOperand(0);
	SDValue Lo = extractSubVector(Src, 0, DAG, dl, SizeInBits / 2);
	SDValue Hi = extractSubVector(Src, NumElems / 2, DAG, dl, SizeInBits / 2);

	MVT EltVT = VT.getVectorElementType();
	MVT NewVT = MVT::getVectorVT(EltVT, NumElems / 2);
	return DAG.getNode(ISD::CONCAT_VECTORS, dl, VT,
	DAG.getNode(Op.getOpcode(), dl, NewVT, Lo),
	DAG.getNode(Op.getOpcode(), dl, NewVT, Hi));
	}

	// Decompose 256-bit ops into smaller 128-bit ops.
	static SDValue Lower256IntUnary(SDValue Op, SelectionDAG &DAG) {
	assert(Op.getSimpleValueType().is256BitVector() &&
	Op.getSimpleValueType().isInteger() &&
	"Only handle AVX 256-bit vector integer operation");
	return LowerVectorIntUnary(Op, DAG);
	}

	// Decompose 512-bit ops into smaller 256-bit ops.
	static SDValue Lower512IntUnary(SDValue Op, SelectionDAG &DAG) {
	assert(Op.getSimpleValueType().is512BitVector() &&
	Op.getSimpleValueType().isInteger() &&
	"Only handle AVX 512-bit vector integer operation");
	return LowerVectorIntUnary(Op, DAG);
	}

	/// \brief Lower a vector CTLZ using native supported vector CTLZ instruction.
	//
	// i8/i16 vector implemented using dword LZCNT vector instruction
	// ( sub(trunc(lzcnt(zext32(x)))) ). In case zext32(x) is illegal,
	// split the vector, perform operation on it's Lo a Hi part and
	// concatenate the results.
	static SDValue LowerVectorCTLZ_AVX512CDI(SDValue Op, SelectionDAG &DAG) {
	assert(Op.getOpcode() == ISD::CTLZ);
	SDLoc dl(Op);
	MVT VT = Op.getSimpleValueType();
	MVT EltVT = VT.getVectorElementType();
	unsigned NumElems = VT.getVectorNumElements();

	assert((EltVT == MVT::i8 \|\| EltVT == MVT::i16) &&
	"Unsupported element type");

	// Split vector, it's Lo and Hi parts will be handled in next iteration.
	if (16 < NumElems)
	return LowerVectorIntUnary(Op, DAG);

	MVT NewVT = MVT::getVectorVT(MVT::i32, NumElems);
	assert((NewVT.is256BitVector() \|\| NewVT.is512BitVector()) &&
	"Unsupported value type for operation");

	// Use native supported vector instruction vplzcntd.
	Op = DAG.getNode(ISD::ZERO_EXTEND, dl, NewVT, Op.getOperand(0));
	SDValue CtlzNode = DAG.getNode(ISD::CTLZ, dl, NewVT, Op);
	SDValue TruncNode = DAG.getNode(ISD::TRUNCATE, dl, VT, CtlzNode);
	SDValue Delta = DAG.getConstant(32 - EltVT.getSizeInBits(), dl, VT);

	return DAG.getNode(ISD::SUB, dl, VT, TruncNode, Delta);
	}

	// Lower CTLZ using a PSHUFB lookup table implementation.
	static SDValue LowerVectorCTLZInRegLUT(SDValue Op, const SDLoc &DL,
	const X86Subtarget &Subtarget,
	SelectionDAG &DAG) {
	MVT VT = Op.getSimpleValueType();
	int NumElts = VT.getVectorNumElements();
	int NumBytes = NumElts * (VT.getScalarSizeInBits() / 8);
	MVT CurrVT = MVT::getVectorVT(MVT::i8, NumBytes);

	// Per-nibble leading zero PSHUFB lookup table.
	const int LUT[16] = {/* 0 / 4, / 1 / 3, / 2 / 2, / 3 */ 2,
	/* 4 / 1, / 5 / 1, / 6 / 1, / 7 */ 1,
	/* 8 / 0, / 9 / 0, / a / 0, / b */ 0,
	/* c / 0, / d / 0, / e / 0, / f */ 0};

	SmallVector<SDValue, 64> LUTVec;
	for (int i = 0; i < NumBytes; ++i)
	LUTVec.push_back(DAG.getConstant(LUT[i % 16], DL, MVT::i8));
	SDValue InRegLUT = DAG.getBuildVector(CurrVT, DL, LUTVec);

	// Begin by bitcasting the input to byte vector, then split those bytes
	// into lo/hi nibbles and use the PSHUFB LUT to perform CLTZ on each of them.
	// If the hi input nibble is zero then we add both results together, otherwise
	// we just take the hi result (by masking the lo result to zero before the
	// add).
	SDValue Op0 = DAG.getBitcast(CurrVT, Op.getOperand(0));
	SDValue Zero = getZeroVector(CurrVT, Subtarget, DAG, DL);

	SDValue NibbleMask = DAG.getConstant(0xF, DL, CurrVT);
	SDValue NibbleShift = DAG.getConstant(0x4, DL, CurrVT);
	SDValue Lo = DAG.getNode(ISD::AND, DL, CurrVT, Op0, NibbleMask);
	SDValue Hi = DAG.getNode(ISD::SRL, DL, CurrVT, Op0, NibbleShift);
	SDValue HiZ = DAG.getSetCC(DL, CurrVT, Hi, Zero, ISD::SETEQ);

	Lo = DAG.getNode(X86ISD::PSHUFB, DL, CurrVT, InRegLUT, Lo);
	Hi = DAG.getNode(X86ISD::PSHUFB, DL, CurrVT, InRegLUT, Hi);
	Lo = DAG.getNode(ISD::AND, DL, CurrVT, Lo, HiZ);
	SDValue Res = DAG.getNode(ISD::ADD, DL, CurrVT, Lo, Hi);

	// Merge result back from vXi8 back to VT, working on the lo/hi halves
	// of the current vector width in the same way we did for the nibbles.
	// If the upper half of the input element is zero then add the halves'
	// leading zero counts together, otherwise just use the upper half's.
	// Double the width of the result until we are at target width.
	while (CurrVT != VT) {
	int CurrScalarSizeInBits = CurrVT.getScalarSizeInBits();
	int CurrNumElts = CurrVT.getVectorNumElements();
	MVT NextSVT = MVT::getIntegerVT(CurrScalarSizeInBits * 2);
	MVT NextVT = MVT::getVectorVT(NextSVT, CurrNumElts / 2);
	SDValue Shift = DAG.getConstant(CurrScalarSizeInBits, DL, NextVT);

	// Check if the upper half of the input element is zero.
	SDValue HiZ = DAG.getSetCC(DL, CurrVT, DAG.getBitcast(CurrVT, Op0),
	DAG.getBitcast(CurrVT, Zero), ISD::SETEQ);
	HiZ = DAG.getBitcast(NextVT, HiZ);

	// Move the upper/lower halves to the lower bits as we'll be extending to
	// NextVT. Mask the lower result to zero if HiZ is true and add the results
	// together.
	SDValue ResNext = Res = DAG.getBitcast(NextVT, Res);
	SDValue R0 = DAG.getNode(ISD::SRL, DL, NextVT, ResNext, Shift);
	SDValue R1 = DAG.getNode(ISD::SRL, DL, NextVT, HiZ, Shift);
	R1 = DAG.getNode(ISD::AND, DL, NextVT, ResNext, R1);
	Res = DAG.getNode(ISD::ADD, DL, NextVT, R0, R1);
	CurrVT = NextVT;
	}

	return Res;
	}

	static SDValue LowerVectorCTLZ(SDValue Op, const SDLoc &DL,
	const X86Subtarget &Subtarget,
	SelectionDAG &DAG) {
	MVT VT = Op.getSimpleValueType();

	if (Subtarget.hasCDI())
	return LowerVectorCTLZ_AVX512CDI(Op, DAG);

	// Decompose 256-bit ops into smaller 128-bit ops.
	if (VT.is256BitVector() && !Subtarget.hasInt256())
	return Lower256IntUnary(Op, DAG);

	// Decompose 512-bit ops into smaller 256-bit ops.
	if (VT.is512BitVector() && !Subtarget.hasBWI())
	return Lower512IntUnary(Op, DAG);

	assert(Subtarget.hasSSSE3() && "Expected SSSE3 support for PSHUFB");
	return LowerVectorCTLZInRegLUT(Op, DL, Subtarget, DAG);
	}

	static SDValue LowerCTLZ(SDValue Op, const X86Subtarget &Subtarget,
	SelectionDAG &DAG) {
	MVT VT = Op.getSimpleValueType();
	MVT OpVT = VT;
	unsigned NumBits = VT.getSizeInBits();
	SDLoc dl(Op);
	unsigned Opc = Op.getOpcode();

	if (VT.isVector())
	return LowerVectorCTLZ(Op, dl, Subtarget, DAG);

	Op = Op.getOperand(0);
	if (VT == MVT::i8) {
	// Zero extend to i32 since there is not an i8 bsr.
	OpVT = MVT::i32;
	Op = DAG.getNode(ISD::ZERO_EXTEND, dl, OpVT, Op);
	}

	// Issue a bsr (scan bits in reverse) which also sets EFLAGS.
	SDVTList VTs = DAG.getVTList(OpVT, MVT::i32);
	Op = DAG.getNode(X86ISD::BSR, dl, VTs, Op);

	if (Opc == ISD::CTLZ) {
	// If src is zero (i.e. bsr sets ZF), returns NumBits.
	SDValue Ops[] = {
	Op,
	DAG.getConstant(NumBits + NumBits - 1, dl, OpVT),
	DAG.getConstant(X86::COND_E, dl, MVT::i8),
	Op.getValue(1)
	};
	Op = DAG.getNode(X86ISD::CMOV, dl, OpVT, Ops);
	}

	// Finally xor with NumBits-1.
	Op = DAG.getNode(ISD::XOR, dl, OpVT, Op,
	DAG.getConstant(NumBits - 1, dl, OpVT));

	if (VT == MVT::i8)
	Op = DAG.getNode(ISD::TRUNCATE, dl, MVT::i8, Op);
	return Op;
	}

	static SDValue LowerCTTZ(SDValue Op, SelectionDAG &DAG) {
	MVT VT = Op.getSimpleValueType();
	unsigned NumBits = VT.getScalarSizeInBits();
	SDLoc dl(Op);

	if (VT.isVector()) {
	SDValue N0 = Op.getOperand(0);
	SDValue Zero = DAG.getConstant(0, dl, VT);

	// lsb(x) = (x & -x)
	SDValue LSB = DAG.getNode(ISD::AND, dl, VT, N0,
	DAG.getNode(ISD::SUB, dl, VT, Zero, N0));

	// cttz_undef(x) = (width - 1) - ctlz(lsb)
	if (Op.getOpcode() == ISD::CTTZ_ZERO_UNDEF) {
	SDValue WidthMinusOne = DAG.getConstant(NumBits - 1, dl, VT);
	return DAG.getNode(ISD::SUB, dl, VT, WidthMinusOne,
	DAG.getNode(ISD::CTLZ, dl, VT, LSB));
	}

	// cttz(x) = ctpop(lsb - 1)
	SDValue One = DAG.getConstant(1, dl, VT);
	return DAG.getNode(ISD::CTPOP, dl, VT,
	DAG.getNode(ISD::SUB, dl, VT, LSB, One));
	}

	assert(Op.getOpcode() == ISD::CTTZ &&
	"Only scalar CTTZ requires custom lowering");

	// Issue a bsf (scan bits forward) which also sets EFLAGS.
	SDVTList VTs = DAG.getVTList(VT, MVT::i32);
	Op = DAG.getNode(X86ISD::BSF, dl, VTs, Op.getOperand(0));

	// If src is zero (i.e. bsf sets ZF), returns NumBits.
	SDValue Ops[] = {
	Op,
	DAG.getConstant(NumBits, dl, VT),
	DAG.getConstant(X86::COND_E, dl, MVT::i8),
	Op.getValue(1)
	};
	return DAG.getNode(X86ISD::CMOV, dl, VT, Ops);
	}

	/// Break a 256-bit integer operation into two new 128-bit ones and then
	/// concatenate the result back.
	static SDValue Lower256IntArith(SDValue Op, SelectionDAG &DAG) {
	MVT VT = Op.getSimpleValueType();

	assert(VT.is256BitVector() && VT.isInteger() &&
	"Unsupported value type for operation");

	unsigned NumElems = VT.getVectorNumElements();
	SDLoc dl(Op);

	// Extract the LHS vectors
	SDValue LHS = Op.getOperand(0);
	SDValue LHS1 = extract128BitVector(LHS, 0, DAG, dl);
	SDValue LHS2 = extract128BitVector(LHS, NumElems / 2, DAG, dl);

	// Extract the RHS vectors
	SDValue RHS = Op.getOperand(1);
	SDValue RHS1 = extract128BitVector(RHS, 0, DAG, dl);
	SDValue RHS2 = extract128BitVector(RHS, NumElems / 2, DAG, dl);

	MVT EltVT = VT.getVectorElementType();
	MVT NewVT = MVT::getVectorVT(EltVT, NumElems/2);

	return DAG.getNode(ISD::CONCAT_VECTORS, dl, VT,
	DAG.getNode(Op.getOpcode(), dl, NewVT, LHS1, RHS1),
	DAG.getNode(Op.getOpcode(), dl, NewVT, LHS2, RHS2));
	}

	/// Break a 512-bit integer operation into two new 256-bit ones and then
	/// concatenate the result back.
	static SDValue Lower512IntArith(SDValue Op, SelectionDAG &DAG) {
	MVT VT = Op.getSimpleValueType();

	assert(VT.is512BitVector() && VT.isInteger() &&
	"Unsupported value type for operation");

	unsigned NumElems = VT.getVectorNumElements();
	SDLoc dl(Op);

	// Extract the LHS vectors
	SDValue LHS = Op.getOperand(0);
	SDValue LHS1 = extract256BitVector(LHS, 0, DAG, dl);
	SDValue LHS2 = extract256BitVector(LHS, NumElems / 2, DAG, dl);

	// Extract the RHS vectors
	SDValue RHS = Op.getOperand(1);
	SDValue RHS1 = extract256BitVector(RHS, 0, DAG, dl);
	SDValue RHS2 = extract256BitVector(RHS, NumElems / 2, DAG, dl);

	MVT EltVT = VT.getVectorElementType();
	MVT NewVT = MVT::getVectorVT(EltVT, NumElems/2);

	return DAG.getNode(ISD::CONCAT_VECTORS, dl, VT,
	DAG.getNode(Op.getOpcode(), dl, NewVT, LHS1, RHS1),
	DAG.getNode(Op.getOpcode(), dl, NewVT, LHS2, RHS2));
	}

	static SDValue LowerADD_SUB(SDValue Op, SelectionDAG &DAG) {
	MVT VT = Op.getSimpleValueType();
	if (VT.getScalarType() == MVT::i1)
	return DAG.getNode(ISD::XOR, SDLoc(Op), VT,
	Op.getOperand(0), Op.getOperand(1));
	assert(Op.getSimpleValueType().is256BitVector() &&
	Op.getSimpleValueType().isInteger() &&
	"Only handle AVX 256-bit vector integer operation");
	return Lower256IntArith(Op, DAG);
	}

	static SDValue LowerABS(SDValue Op, SelectionDAG &DAG) {
	assert(Op.getSimpleValueType().is256BitVector() &&
	Op.getSimpleValueType().isInteger() &&
	"Only handle AVX 256-bit vector integer operation");
	return Lower256IntUnary(Op, DAG);
	}

	static SDValue LowerMINMAX(SDValue Op, SelectionDAG &DAG) {
	assert(Op.getSimpleValueType().is256BitVector() &&
	Op.getSimpleValueType().isInteger() &&
	"Only handle AVX 256-bit vector integer operation");
	return Lower256IntArith(Op, DAG);
	}

	static SDValue LowerMUL(SDValue Op, const X86Subtarget &Subtarget,
	SelectionDAG &DAG) {
	SDLoc dl(Op);
	MVT VT = Op.getSimpleValueType();

	if (VT.getScalarType() == MVT::i1)
	return DAG.getNode(ISD::AND, dl, VT, Op.getOperand(0), Op.getOperand(1));

	// Decompose 256-bit ops into smaller 128-bit ops.
	if (VT.is256BitVector() && !Subtarget.hasInt256())
	return Lower256IntArith(Op, DAG);

	SDValue A = Op.getOperand(0);
	SDValue B = Op.getOperand(1);

	// Lower v16i8/v32i8/v64i8 mul as sign-extension to v8i16/v16i16/v32i16
	// vector pairs, multiply and truncate.
	if (VT == MVT::v16i8 \|\| VT == MVT::v32i8 \|\| VT == MVT::v64i8) {
	if (Subtarget.hasInt256()) {
	// For 512-bit vectors, split into 256-bit vectors to allow the
	// sign-extension to occur.
	if (VT == MVT::v64i8)
	return Lower512IntArith(Op, DAG);

	// For 256-bit vectors, split into 128-bit vectors to allow the
	// sign-extension to occur. We don't need this on AVX512BW as we can
	// safely sign-extend to v32i16.
	if (VT == MVT::v32i8 && !Subtarget.hasBWI())
	return Lower256IntArith(Op, DAG);

	MVT ExVT = MVT::getVectorVT(MVT::i16, VT.getVectorNumElements());
	return DAG.getNode(
	ISD::TRUNCATE, dl, VT,
	DAG.getNode(ISD::MUL, dl, ExVT,
	DAG.getNode(ISD::SIGN_EXTEND, dl, ExVT, A),
	DAG.getNode(ISD::SIGN_EXTEND, dl, ExVT, B)));
	}

	assert(VT == MVT::v16i8 &&
	"Pre-AVX2 support only supports v16i8 multiplication");
	MVT ExVT = MVT::v8i16;

	// Extract the lo parts and sign extend to i16
	SDValue ALo, BLo;
	if (Subtarget.hasSSE41()) {
	ALo = DAG.getSignExtendVectorInReg(A, dl, ExVT);
	BLo = DAG.getSignExtendVectorInReg(B, dl, ExVT);
	} else {
	const int ShufMask[] = {-1, 0, -1, 1, -1, 2, -1, 3,
	-1, 4, -1, 5, -1, 6, -1, 7};
	ALo = DAG.getVectorShuffle(VT, dl, A, A, ShufMask);
	BLo = DAG.getVectorShuffle(VT, dl, B, B, ShufMask);
	ALo = DAG.getBitcast(ExVT, ALo);
	BLo = DAG.getBitcast(ExVT, BLo);
	ALo = DAG.getNode(ISD::SRA, dl, ExVT, ALo, DAG.getConstant(8, dl, ExVT));
	BLo = DAG.getNode(ISD::SRA, dl, ExVT, BLo, DAG.getConstant(8, dl, ExVT));
	}

	// Extract the hi parts and sign extend to i16
	SDValue AHi, BHi;
	if (Subtarget.hasSSE41()) {
	const int ShufMask[] = {8, 9, 10, 11, 12, 13, 14, 15,
	-1, -1, -1, -1, -1, -1, -1, -1};
	AHi = DAG.getVectorShuffle(VT, dl, A, A, ShufMask);
	BHi = DAG.getVectorShuffle(VT, dl, B, B, ShufMask);
	AHi = DAG.getSignExtendVectorInReg(AHi, dl, ExVT);
	BHi = DAG.getSignExtendVectorInReg(BHi, dl, ExVT);
	} else {
	const int ShufMask[] = {-1, 8, -1, 9, -1, 10, -1, 11,
	-1, 12, -1, 13, -1, 14, -1, 15};
	AHi = DAG.getVectorShuffle(VT, dl, A, A, ShufMask);
	BHi = DAG.getVectorShuffle(VT, dl, B, B, ShufMask);
	AHi = DAG.getBitcast(ExVT, AHi);
	BHi = DAG.getBitcast(ExVT, BHi);
	AHi = DAG.getNode(ISD::SRA, dl, ExVT, AHi, DAG.getConstant(8, dl, ExVT));
	BHi = DAG.getNode(ISD::SRA, dl, ExVT, BHi, DAG.getConstant(8, dl, ExVT));
	}

	// Multiply, mask the lower 8bits of the lo/hi results and pack
	SDValue RLo = DAG.getNode(ISD::MUL, dl, ExVT, ALo, BLo);
	SDValue RHi = DAG.getNode(ISD::MUL, dl, ExVT, AHi, BHi);
	RLo = DAG.getNode(ISD::AND, dl, ExVT, RLo, DAG.getConstant(255, dl, ExVT));
	RHi = DAG.getNode(ISD::AND, dl, ExVT, RHi, DAG.getConstant(255, dl, ExVT));
	return DAG.getNode(X86ISD::PACKUS, dl, VT, RLo, RHi);
	}

	// Lower v4i32 mul as 2x shuffle, 2x pmuludq, 2x shuffle.
	if (VT == MVT::v4i32) {
	assert(Subtarget.hasSSE2() && !Subtarget.hasSSE41() &&
	"Should not custom lower when pmuldq is available!");

	// Extract the odd parts.
	static const int UnpackMask[] = { 1, -1, 3, -1 };
	SDValue Aodds = DAG.getVectorShuffle(VT, dl, A, A, UnpackMask);
	SDValue Bodds = DAG.getVectorShuffle(VT, dl, B, B, UnpackMask);

	// Multiply the even parts.
	SDValue Evens = DAG.getNode(X86ISD::PMULUDQ, dl, MVT::v2i64, A, B);
	// Now multiply odd parts.
	SDValue Odds = DAG.getNode(X86ISD::PMULUDQ, dl, MVT::v2i64, Aodds, Bodds);

	Evens = DAG.getBitcast(VT, Evens);
	Odds = DAG.getBitcast(VT, Odds);

	// Merge the two vectors back together with a shuffle. This expands into 2
	// shuffles.
	static const int ShufMask[] = { 0, 4, 2, 6 };
	return DAG.getVectorShuffle(VT, dl, Evens, Odds, ShufMask);
	}

	assert((VT == MVT::v2i64 \|\| VT == MVT::v4i64 \|\| VT == MVT::v8i64) &&
	"Only know how to lower V2I64/V4I64/V8I64 multiply");

	// 32-bit vector types used for MULDQ/MULUDQ.
	MVT MulVT = MVT::getVectorVT(MVT::i32, VT.getSizeInBits() / 32);

	// MULDQ returns the 64-bit result of the signed multiplication of the lower
	// 32-bits. We can lower with this if the sign bits stretch that far.
	if (Subtarget.hasSSE41() && DAG.ComputeNumSignBits(A) > 32 &&
	DAG.ComputeNumSignBits(B) > 32) {
	return DAG.getNode(X86ISD::PMULDQ, dl, VT, DAG.getBitcast(MulVT, A),
	DAG.getBitcast(MulVT, B));
	}

	// Ahi = psrlqi(a, 32);
	// Bhi = psrlqi(b, 32);
	//
	// AloBlo = pmuludq(a, b);
	// AloBhi = pmuludq(a, Bhi);
	// AhiBlo = pmuludq(Ahi, b);
	//
	// Hi = psllqi(AloBhi + AhiBlo, 32);
	// return AloBlo + Hi;
	APInt LowerBitsMask = APInt::getLowBitsSet(64, 32);
	bool ALoIsZero = DAG.MaskedValueIsZero(A, LowerBitsMask);
	bool BLoIsZero = DAG.MaskedValueIsZero(B, LowerBitsMask);

	APInt UpperBitsMask = APInt::getHighBitsSet(64, 32);
	bool AHiIsZero = DAG.MaskedValueIsZero(A, UpperBitsMask);
	bool BHiIsZero = DAG.MaskedValueIsZero(B, UpperBitsMask);

	// Bit cast to 32-bit vectors for MULUDQ.
	SDValue Alo = DAG.getBitcast(MulVT, A);
	SDValue Blo = DAG.getBitcast(MulVT, B);

	SDValue Zero = getZeroVector(VT, Subtarget, DAG, dl);

	// Only multiply lo/hi halves that aren't known to be zero.
	SDValue AloBlo = Zero;
	if (!ALoIsZero && !BLoIsZero)
	AloBlo = DAG.getNode(X86ISD::PMULUDQ, dl, VT, Alo, Blo);

	SDValue AloBhi = Zero;
	if (!ALoIsZero && !BHiIsZero) {
	SDValue Bhi = getTargetVShiftByConstNode(X86ISD::VSRLI, dl, VT, B, 32, DAG);
	Bhi = DAG.getBitcast(MulVT, Bhi);
	AloBhi = DAG.getNode(X86ISD::PMULUDQ, dl, VT, Alo, Bhi);
	}

	SDValue AhiBlo = Zero;
	if (!AHiIsZero && !BLoIsZero) {
	SDValue Ahi = getTargetVShiftByConstNode(X86ISD::VSRLI, dl, VT, A, 32, DAG);
	Ahi = DAG.getBitcast(MulVT, Ahi);
	AhiBlo = DAG.getNode(X86ISD::PMULUDQ, dl, VT, Ahi, Blo);
	}

	SDValue Hi = DAG.getNode(ISD::ADD, dl, VT, AloBhi, AhiBlo);
	Hi = getTargetVShiftByConstNode(X86ISD::VSHLI, dl, VT, Hi, 32, DAG);

	return DAG.getNode(ISD::ADD, dl, VT, AloBlo, Hi);
	}

	static SDValue LowerMULH(SDValue Op, const X86Subtarget &Subtarget,
	SelectionDAG &DAG) {
	SDLoc dl(Op);
	MVT VT = Op.getSimpleValueType();

	// Decompose 256-bit ops into smaller 128-bit ops.
	if (VT.is256BitVector() && !Subtarget.hasInt256())
	return Lower256IntArith(Op, DAG);

	// Only i8 vectors should need custom lowering after this.
	assert((VT == MVT::v16i8 \|\| (VT == MVT::v32i8 && Subtarget.hasInt256())) &&
	"Unsupported vector type");

	// Lower v16i8/v32i8 as extension to v8i16/v16i16 vector pairs, multiply,
	// logical shift down the upper half and pack back to i8.
	SDValue A = Op.getOperand(0);
	SDValue B = Op.getOperand(1);

	// With SSE41 we can use sign/zero extend, but for pre-SSE41 we unpack
	// and then ashr/lshr the upper bits down to the lower bits before multiply.
	unsigned Opcode = Op.getOpcode();
	unsigned ExShift = (ISD::MULHU == Opcode ? ISD::SRL : ISD::SRA);
	unsigned ExSSE41 = (ISD::MULHU == Opcode ? X86ISD::VZEXT : X86ISD::VSEXT);

	// AVX2 implementations - extend xmm subvectors to ymm.
	if (Subtarget.hasInt256()) {
	SDValue Lo = DAG.getIntPtrConstant(0, dl);
	SDValue Hi = DAG.getIntPtrConstant(VT.getVectorNumElements() / 2, dl);

	if (VT == MVT::v32i8) {
	SDValue ALo = DAG.getNode(ISD::EXTRACT_SUBVECTOR, dl, MVT::v16i8, A, Lo);
	SDValue BLo = DAG.getNode(ISD::EXTRACT_SUBVECTOR, dl, MVT::v16i8, B, Lo);
	SDValue AHi = DAG.getNode(ISD::EXTRACT_SUBVECTOR, dl, MVT::v16i8, A, Hi);
	SDValue BHi = DAG.getNode(ISD::EXTRACT_SUBVECTOR, dl, MVT::v16i8, B, Hi);
	ALo = DAG.getNode(ExSSE41, dl, MVT::v16i16, ALo);
	BLo = DAG.getNode(ExSSE41, dl, MVT::v16i16, BLo);
	AHi = DAG.getNode(ExSSE41, dl, MVT::v16i16, AHi);
	BHi = DAG.getNode(ExSSE41, dl, MVT::v16i16, BHi);
	Lo = DAG.getNode(ISD::SRL, dl, MVT::v16i16,
	DAG.getNode(ISD::MUL, dl, MVT::v16i16, ALo, BLo),
	DAG.getConstant(8, dl, MVT::v16i16));
	Hi = DAG.getNode(ISD::SRL, dl, MVT::v16i16,
	DAG.getNode(ISD::MUL, dl, MVT::v16i16, AHi, BHi),
	DAG.getConstant(8, dl, MVT::v16i16));
	// The ymm variant of PACKUS treats the 128-bit lanes separately, so before
	// using PACKUS we need to permute the inputs to the correct lo/hi xmm lane.
	const int LoMask[] = {0, 1, 2, 3, 4, 5, 6, 7,
	16, 17, 18, 19, 20, 21, 22, 23};
	const int HiMask[] = {8, 9, 10, 11, 12, 13, 14, 15,
	24, 25, 26, 27, 28, 29, 30, 31};
	return DAG.getNode(X86ISD::PACKUS, dl, VT,
	DAG.getVectorShuffle(MVT::v16i16, dl, Lo, Hi, LoMask),
	DAG.getVectorShuffle(MVT::v16i16, dl, Lo, Hi, HiMask));
	}

	SDValue ExA = getExtendInVec(ExSSE41, dl, MVT::v16i16, A, DAG);
	SDValue ExB = getExtendInVec(ExSSE41, dl, MVT::v16i16, B, DAG);
	SDValue Mul = DAG.getNode(ISD::MUL, dl, MVT::v16i16, ExA, ExB);
	SDValue MulH = DAG.getNode(ISD::SRL, dl, MVT::v16i16, Mul,
	DAG.getConstant(8, dl, MVT::v16i16));
	Lo = DAG.getNode(ISD::EXTRACT_SUBVECTOR, dl, MVT::v8i16, MulH, Lo);
	Hi = DAG.getNode(ISD::EXTRACT_SUBVECTOR, dl, MVT::v8i16, MulH, Hi);
	return DAG.getNode(X86ISD::PACKUS, dl, VT, Lo, Hi);
	}

	assert(VT == MVT::v16i8 &&
	"Pre-AVX2 support only supports v16i8 multiplication");
	MVT ExVT = MVT::v8i16;

	// Extract the lo parts and zero/sign extend to i16.
	SDValue ALo, BLo;
	if (Subtarget.hasSSE41()) {
	ALo = getExtendInVec(ExSSE41, dl, ExVT, A, DAG);
	BLo = getExtendInVec(ExSSE41, dl, ExVT, B, DAG);
	} else {
	const int ShufMask[] = {-1, 0, -1, 1, -1, 2, -1, 3,
	-1, 4, -1, 5, -1, 6, -1, 7};
	ALo = DAG.getVectorShuffle(VT, dl, A, A, ShufMask);
	BLo = DAG.getVectorShuffle(VT, dl, B, B, ShufMask);
	ALo = DAG.getBitcast(ExVT, ALo);
	BLo = DAG.getBitcast(ExVT, BLo);
	ALo = DAG.getNode(ExShift, dl, ExVT, ALo, DAG.getConstant(8, dl, ExVT));
	BLo = DAG.getNode(ExShift, dl, ExVT, BLo, DAG.getConstant(8, dl, ExVT));
	}

	// Extract the hi parts and zero/sign extend to i16.
	SDValue AHi, BHi;
	if (Subtarget.hasSSE41()) {
	const int ShufMask[] = {8, 9, 10, 11, 12, 13, 14, 15,
	-1, -1, -1, -1, -1, -1, -1, -1};
	AHi = DAG.getVectorShuffle(VT, dl, A, A, ShufMask);
	BHi = DAG.getVectorShuffle(VT, dl, B, B, ShufMask);
	AHi = getExtendInVec(ExSSE41, dl, ExVT, AHi, DAG);
	BHi = getExtendInVec(ExSSE41, dl, ExVT, BHi, DAG);
	} else {
	const int ShufMask[] = {-1, 8, -1, 9, -1, 10, -1, 11,
	-1, 12, -1, 13, -1, 14, -1, 15};
	AHi = DAG.getVectorShuffle(VT, dl, A, A, ShufMask);
	BHi = DAG.getVectorShuffle(VT, dl, B, B, ShufMask);
	AHi = DAG.getBitcast(ExVT, AHi);
	BHi = DAG.getBitcast(ExVT, BHi);
	AHi = DAG.getNode(ExShift, dl, ExVT, AHi, DAG.getConstant(8, dl, ExVT));
	BHi = DAG.getNode(ExShift, dl, ExVT, BHi, DAG.getConstant(8, dl, ExVT));
	}

	// Multiply, lshr the upper 8bits to the lower 8bits of the lo/hi results and
	// pack back to v16i8.
	SDValue RLo = DAG.getNode(ISD::MUL, dl, ExVT, ALo, BLo);
	SDValue RHi = DAG.getNode(ISD::MUL, dl, ExVT, AHi, BHi);
	RLo = DAG.getNode(ISD::SRL, dl, ExVT, RLo, DAG.getConstant(8, dl, ExVT));
	RHi = DAG.getNode(ISD::SRL, dl, ExVT, RHi, DAG.getConstant(8, dl, ExVT));
	return DAG.getNode(X86ISD::PACKUS, dl, VT, RLo, RHi);
	}

	SDValue X86TargetLowering::LowerWin64_i128OP(SDValue Op, SelectionDAG &DAG) const {
	assert(Subtarget.isTargetWin64() && "Unexpected target");
	EVT VT = Op.getValueType();
	assert(VT.isInteger() && VT.getSizeInBits() == 128 &&
	"Unexpected return type for lowering");

	RTLIB::Libcall LC;
	bool isSigned;
	switch (Op->getOpcode()) {
	default: llvm_unreachable("Unexpected request for libcall!");
	case ISD::SDIV: isSigned = true; LC = RTLIB::SDIV_I128; break;
	case ISD::UDIV: isSigned = false; LC = RTLIB::UDIV_I128; break;
	case ISD::SREM: isSigned = true; LC = RTLIB::SREM_I128; break;
	case ISD::UREM: isSigned = false; LC = RTLIB::UREM_I128; break;
	case ISD::SDIVREM: isSigned = true; LC = RTLIB::SDIVREM_I128; break;
	case ISD::UDIVREM: isSigned = false; LC = RTLIB::UDIVREM_I128; break;
	}

	SDLoc dl(Op);
	SDValue InChain = DAG.getEntryNode();

	TargetLowering::ArgListTy Args;
	TargetLowering::ArgListEntry Entry;
	for (unsigned i = 0, e = Op->getNumOperands(); i != e; ++i) {
	EVT ArgVT = Op->getOperand(i).getValueType();
	assert(ArgVT.isInteger() && ArgVT.getSizeInBits() == 128 &&
	"Unexpected argument type for lowering");
	SDValue StackPtr = DAG.CreateStackTemporary(ArgVT, 16);
	Entry.Node = StackPtr;
	InChain = DAG.getStore(InChain, dl, Op->getOperand(i), StackPtr,
	MachinePointerInfo(), /* Alignment = */ 16);
	Type ArgTy = ArgVT.getTypeForEVT(DAG.getContext());
	Entry.Ty = PointerType::get(ArgTy,0);
	Entry.IsSExt = false;
	Entry.IsZExt = false;
	Args.push_back(Entry);
	}

	SDValue Callee = DAG.getExternalSymbol(getLibcallName(LC),
	getPointerTy(DAG.getDataLayout()));

	TargetLowering::CallLoweringInfo CLI(DAG);
	CLI.setDebugLoc(dl)
	.setChain(InChain)
	.setLibCallee(
	getLibcallCallingConv(LC),
	static_cast<EVT>(MVT::v2i64).getTypeForEVT(*DAG.getContext()), Callee,
	std::move(Args))
	.setInRegister()
	.setSExtResult(isSigned)
	.setZExtResult(!isSigned);

	std::pair<SDValue, SDValue> CallInfo = LowerCallTo(CLI);
	return DAG.getBitcast(VT, CallInfo.first);
	}

	static SDValue LowerMUL_LOHI(SDValue Op, const X86Subtarget &Subtarget,
	SelectionDAG &DAG) {
	SDValue Op0 = Op.getOperand(0), Op1 = Op.getOperand(1);
	MVT VT = Op0.getSimpleValueType();
	SDLoc dl(Op);

	// Decompose 256-bit ops into smaller 128-bit ops.
	if (VT.is256BitVector() && !Subtarget.hasInt256()) {
	unsigned Opcode = Op.getOpcode();
	unsigned NumElems = VT.getVectorNumElements();
	MVT HalfVT = MVT::getVectorVT(VT.getScalarType(), NumElems / 2);
	SDValue Lo0 = extract128BitVector(Op0, 0, DAG, dl);
	SDValue Lo1 = extract128BitVector(Op1, 0, DAG, dl);
	SDValue Hi0 = extract128BitVector(Op0, NumElems / 2, DAG, dl);
	SDValue Hi1 = extract128BitVector(Op1, NumElems / 2, DAG, dl);
	SDValue Lo = DAG.getNode(Opcode, dl, DAG.getVTList(HalfVT, HalfVT), Lo0, Lo1);
	SDValue Hi = DAG.getNode(Opcode, dl, DAG.getVTList(HalfVT, HalfVT), Hi0, Hi1);
	SDValue Ops[] = {
	DAG.getNode(ISD::CONCAT_VECTORS, dl, VT, Lo.getValue(0), Hi.getValue(0)),
	DAG.getNode(ISD::CONCAT_VECTORS, dl, VT, Lo.getValue(1), Hi.getValue(1))
	};
	return DAG.getMergeValues(Ops, dl);
	}

	assert((VT == MVT::v4i32 && Subtarget.hasSSE2()) \|\|
	(VT == MVT::v8i32 && Subtarget.hasInt256()));

	// PMULxD operations multiply each even value (starting at 0) of LHS with
	// the related value of RHS and produce a widen result.
	// E.g., PMULUDQ <4 x i32> <a\|b\|c\|d>, <4 x i32> <e\|f\|g\|h>
	// => <2 x i64> <ae\|cg>
	//
	// In other word, to have all the results, we need to perform two PMULxD:
	// 1. one with the even values.
	// 2. one with the odd values.
	// To achieve #2, with need to place the odd values at an even position.
	//
	// Place the odd value at an even position (basically, shift all values 1
	// step to the left):
	const int Mask[] = {1, -1, 3, -1, 5, -1, 7, -1};
	// <a\|b\|c\|d> => <b\|undef\|d\|undef>
	SDValue Odd0 = DAG.getVectorShuffle(VT, dl, Op0, Op0,
	makeArrayRef(&Mask[0], VT.getVectorNumElements()));
	// <e\|f\|g\|h> => <f\|undef\|h\|undef>
	SDValue Odd1 = DAG.getVectorShuffle(VT, dl, Op1, Op1,
	makeArrayRef(&Mask[0], VT.getVectorNumElements()));

	// Emit two multiplies, one for the lower 2 ints and one for the higher 2
	// ints.
	MVT MulVT = VT == MVT::v4i32 ? MVT::v2i64 : MVT::v4i64;
	bool IsSigned = Op->getOpcode() == ISD::SMUL_LOHI;
	unsigned Opcode =
	(!IsSigned \|\| !Subtarget.hasSSE41()) ? X86ISD::PMULUDQ : X86ISD::PMULDQ;
	// PMULUDQ <4 x i32> <a\|b\|c\|d>, <4 x i32> <e\|f\|g\|h>
	// => <2 x i64> <ae\|cg>
	SDValue Mul1 = DAG.getBitcast(VT, DAG.getNode(Opcode, dl, MulVT, Op0, Op1));
	// PMULUDQ <4 x i32> <b\|undef\|d\|undef>, <4 x i32> <f\|undef\|h\|undef>
	// => <2 x i64> <bf\|dh>
	SDValue Mul2 = DAG.getBitcast(VT, DAG.getNode(Opcode, dl, MulVT, Odd0, Odd1));

	// Shuffle it back into the right order.
	SDValue Highs, Lows;
	if (VT == MVT::v8i32) {
	const int HighMask[] = {1, 9, 3, 11, 5, 13, 7, 15};
	Highs = DAG.getVectorShuffle(VT, dl, Mul1, Mul2, HighMask);
	const int LowMask[] = {0, 8, 2, 10, 4, 12, 6, 14};
	Lows = DAG.getVectorShuffle(VT, dl, Mul1, Mul2, LowMask);
	} else {
	const int HighMask[] = {1, 5, 3, 7};
	Highs = DAG.getVectorShuffle(VT, dl, Mul1, Mul2, HighMask);
	const int LowMask[] = {0, 4, 2, 6};
	Lows = DAG.getVectorShuffle(VT, dl, Mul1, Mul2, LowMask);
	}

	// If we have a signed multiply but no PMULDQ fix up the high parts of a
	// unsigned multiply.
	if (IsSigned && !Subtarget.hasSSE41()) {
	SDValue ShAmt = DAG.getConstant(
	31, dl,
	DAG.getTargetLoweringInfo().getShiftAmountTy(VT, DAG.getDataLayout()));
	SDValue T1 = DAG.getNode(ISD::AND, dl, VT,
	DAG.getNode(ISD::SRA, dl, VT, Op0, ShAmt), Op1);
	SDValue T2 = DAG.getNode(ISD::AND, dl, VT,
	DAG.getNode(ISD::SRA, dl, VT, Op1, ShAmt), Op0);

	SDValue Fixup = DAG.getNode(ISD::ADD, dl, VT, T1, T2);
	Highs = DAG.getNode(ISD::SUB, dl, VT, Highs, Fixup);
	}

	// The first result of MUL_LOHI is actually the low value, followed by the
	// high value.
	SDValue Ops[] = {Lows, Highs};
	return DAG.getMergeValues(Ops, dl);
	}

	// Return true if the required (according to Opcode) shift-imm form is natively
	// supported by the Subtarget
	static bool SupportedVectorShiftWithImm(MVT VT, const X86Subtarget &Subtarget,
	unsigned Opcode) {
	if (VT.getScalarSizeInBits() < 16)
	return false;

	if (VT.is512BitVector() && Subtarget.hasAVX512() &&
	(VT.getScalarSizeInBits() > 16 \|\| Subtarget.hasBWI()))
	return true;

	bool LShift = (VT.is128BitVector() && Subtarget.hasSSE2()) \|\|
	(VT.is256BitVector() && Subtarget.hasInt256());

	bool AShift = LShift && (Subtarget.hasAVX512() \|\|
	(VT != MVT::v2i64 && VT != MVT::v4i64));
	return (Opcode == ISD::SRA) ? AShift : LShift;
	}

	// The shift amount is a variable, but it is the same for all vector lanes.
	// These instructions are defined together with shift-immediate.
	static
	bool SupportedVectorShiftWithBaseAmnt(MVT VT, const X86Subtarget &Subtarget,
	unsigned Opcode) {
	return SupportedVectorShiftWithImm(VT, Subtarget, Opcode);
	}

	// Return true if the required (according to Opcode) variable-shift form is
	// natively supported by the Subtarget
	static bool SupportedVectorVarShift(MVT VT, const X86Subtarget &Subtarget,
	unsigned Opcode) {

	if (!Subtarget.hasInt256() \|\| VT.getScalarSizeInBits() < 16)
	return false;

	// vXi16 supported only on AVX-512, BWI
	if (VT.getScalarSizeInBits() == 16 && !Subtarget.hasBWI())
	return false;

	if (Subtarget.hasAVX512())
	return true;

	bool LShift = VT.is128BitVector() \|\| VT.is256BitVector();
	bool AShift = LShift && VT != MVT::v2i64 && VT != MVT::v4i64;
	return (Opcode == ISD::SRA) ? AShift : LShift;
	}

	static SDValue LowerScalarImmediateShift(SDValue Op, SelectionDAG &DAG,
	const X86Subtarget &Subtarget) {
	MVT VT = Op.getSimpleValueType();
	SDLoc dl(Op);
	SDValue R = Op.getOperand(0);
	SDValue Amt = Op.getOperand(1);

	unsigned X86Opc = (Op.getOpcode() == ISD::SHL) ? X86ISD::VSHLI :
	(Op.getOpcode() == ISD::SRL) ? X86ISD::VSRLI : X86ISD::VSRAI;

	auto ArithmeticShiftRight64 = [&](uint64_t ShiftAmt) {
	assert((VT == MVT::v2i64 \|\| VT == MVT::v4i64) && "Unexpected SRA type");
	MVT ExVT = MVT::getVectorVT(MVT::i32, VT.getVectorNumElements() * 2);
	SDValue Ex = DAG.getBitcast(ExVT, R);

	// ashr(R, 63) === cmp_slt(R, 0)
	if (ShiftAmt == 63 && Subtarget.hasSSE42()) {
	assert((VT != MVT::v4i64 \|\| Subtarget.hasInt256()) &&
	"Unsupported PCMPGT op");
	return DAG.getNode(X86ISD::PCMPGT, dl, VT,
	getZeroVector(VT, Subtarget, DAG, dl), R);
	}

	if (ShiftAmt >= 32) {
	// Splat sign to upper i32 dst, and SRA upper i32 src to lower i32.
	SDValue Upper =
	getTargetVShiftByConstNode(X86ISD::VSRAI, dl, ExVT, Ex, 31, DAG);
	SDValue Lower = getTargetVShiftByConstNode(X86ISD::VSRAI, dl, ExVT, Ex,
	ShiftAmt - 32, DAG);
	if (VT == MVT::v2i64)
	Ex = DAG.getVectorShuffle(ExVT, dl, Upper, Lower, {5, 1, 7, 3});
	if (VT == MVT::v4i64)
	Ex = DAG.getVectorShuffle(ExVT, dl, Upper, Lower,
	{9, 1, 11, 3, 13, 5, 15, 7});
	} else {
	// SRA upper i32, SHL whole i64 and select lower i32.
	SDValue Upper = getTargetVShiftByConstNode(X86ISD::VSRAI, dl, ExVT, Ex,
	ShiftAmt, DAG);
	SDValue Lower =
	getTargetVShiftByConstNode(X86ISD::VSRLI, dl, VT, R, ShiftAmt, DAG);
	Lower = DAG.getBitcast(ExVT, Lower);
	if (VT == MVT::v2i64)
	Ex = DAG.getVectorShuffle(ExVT, dl, Upper, Lower, {4, 1, 6, 3});
	if (VT == MVT::v4i64)
	Ex = DAG.getVectorShuffle(ExVT, dl, Upper, Lower,
	{8, 1, 10, 3, 12, 5, 14, 7});
	}
	return DAG.getBitcast(VT, Ex);
	};

	// Optimize shl/srl/sra with constant shift amount.
	if (auto *BVAmt = dyn_cast<BuildVectorSDNode>(Amt)) {
	if (auto *ShiftConst = BVAmt->getConstantSplatNode()) {
	uint64_t ShiftAmt = ShiftConst->getZExtValue();

	if (SupportedVectorShiftWithImm(VT, Subtarget, Op.getOpcode()))
	return getTargetVShiftByConstNode(X86Opc, dl, VT, R, ShiftAmt, DAG);

	// i64 SRA needs to be performed as partial shifts.
	if (((!Subtarget.hasXOP() && VT == MVT::v2i64) \|\|
	(Subtarget.hasInt256() && VT == MVT::v4i64)) &&
	Op.getOpcode() == ISD::SRA)
	return ArithmeticShiftRight64(ShiftAmt);

	if (VT == MVT::v16i8 \|\|
	(Subtarget.hasInt256() && VT == MVT::v32i8) \|\|
	VT == MVT::v64i8) {
	unsigned NumElts = VT.getVectorNumElements();
	MVT ShiftVT = MVT::getVectorVT(MVT::i16, NumElts / 2);

	// Simple i8 add case
	if (Op.getOpcode() == ISD::SHL && ShiftAmt == 1)
	return DAG.getNode(ISD::ADD, dl, VT, R, R);

	// ashr(R, 7) === cmp_slt(R, 0)
	if (Op.getOpcode() == ISD::SRA && ShiftAmt == 7) {
	SDValue Zeros = getZeroVector(VT, Subtarget, DAG, dl);
	if (VT.is512BitVector()) {
	assert(VT == MVT::v64i8 && "Unexpected element type!");
	SDValue CMP = DAG.getNode(X86ISD::PCMPGTM, dl, MVT::v64i1, Zeros, R);
	return DAG.getNode(ISD::SIGN_EXTEND, dl, VT, CMP);
	}
	return DAG.getNode(X86ISD::PCMPGT, dl, VT, Zeros, R);
	}

	// XOP can shift v16i8 directly instead of as shift v8i16 + mask.
	if (VT == MVT::v16i8 && Subtarget.hasXOP())
	return SDValue();

	if (Op.getOpcode() == ISD::SHL) {
	// Make a large shift.
	SDValue SHL = getTargetVShiftByConstNode(X86ISD::VSHLI, dl, ShiftVT,
	R, ShiftAmt, DAG);
	SHL = DAG.getBitcast(VT, SHL);
	// Zero out the rightmost bits.
	return DAG.getNode(ISD::AND, dl, VT, SHL,
	DAG.getConstant(uint8_t(-1U << ShiftAmt), dl, VT));
	}
	if (Op.getOpcode() == ISD::SRL) {
	// Make a large shift.
	SDValue SRL = getTargetVShiftByConstNode(X86ISD::VSRLI, dl, ShiftVT,
	R, ShiftAmt, DAG);
	SRL = DAG.getBitcast(VT, SRL);
	// Zero out the leftmost bits.
	return DAG.getNode(ISD::AND, dl, VT, SRL,
	DAG.getConstant(uint8_t(-1U) >> ShiftAmt, dl, VT));
	}
	if (Op.getOpcode() == ISD::SRA) {
	// ashr(R, Amt) === sub(xor(lshr(R, Amt), Mask), Mask)
	SDValue Res = DAG.getNode(ISD::SRL, dl, VT, R, Amt);

	SDValue Mask = DAG.getConstant(128 >> ShiftAmt, dl, VT);
	Res = DAG.getNode(ISD::XOR, dl, VT, Res, Mask);
	Res = DAG.getNode(ISD::SUB, dl, VT, Res, Mask);
	return Res;
	}
	llvm_unreachable("Unknown shift opcode.");
	}
	}
	}

	// Special case in 32-bit mode, where i64 is expanded into high and low parts.
	// TODO: Replace constant extraction with getTargetConstantBitsFromNode.
	if (!Subtarget.is64Bit() && !Subtarget.hasXOP() &&
	(VT == MVT::v2i64 \|\| (Subtarget.hasInt256() && VT == MVT::v4i64) \|\|
	(Subtarget.hasAVX512() && VT == MVT::v8i64))) {

	// AVX1 targets maybe extracting a 128-bit vector from a 256-bit constant.
	unsigned SubVectorScale = 1;
	if (Amt.getOpcode() == ISD::EXTRACT_SUBVECTOR) {
	SubVectorScale =
	Amt.getOperand(0).getValueSizeInBits() / Amt.getValueSizeInBits();
	Amt = Amt.getOperand(0);
	}

	// Peek through any splat that was introduced for i64 shift vectorization.
	int SplatIndex = -1;
	if (ShuffleVectorSDNode *SVN = dyn_cast<ShuffleVectorSDNode>(Amt.getNode()))
	if (SVN->isSplat()) {
	SplatIndex = SVN->getSplatIndex();
	Amt = Amt.getOperand(0);
	assert(SplatIndex < (int)VT.getVectorNumElements() &&
	"Splat shuffle referencing second operand");
	}

	if (Amt.getOpcode() != ISD::BITCAST \|\|
	Amt.getOperand(0).getOpcode() != ISD::BUILD_VECTOR)
	return SDValue();

	Amt = Amt.getOperand(0);
	unsigned Ratio = Amt.getSimpleValueType().getVectorNumElements() /
	(SubVectorScale * VT.getVectorNumElements());
	unsigned RatioInLog2 = Log2_32_Ceil(Ratio);
	uint64_t ShiftAmt = 0;
	unsigned BaseOp = (SplatIndex < 0 ? 0 : SplatIndex * Ratio);
	for (unsigned i = 0; i != Ratio; ++i) {
	ConstantSDNode *C = dyn_cast<ConstantSDNode>(Amt.getOperand(i + BaseOp));
	if (!C)
	return SDValue();
	// 6 == Log2(64)
	ShiftAmt \|= C->getZExtValue() << (i * (1 << (6 - RatioInLog2)));
	}

	// Check remaining shift amounts (if not a splat).
	if (SplatIndex < 0) {
	for (unsigned i = Ratio; i != Amt.getNumOperands(); i += Ratio) {
	uint64_t ShAmt = 0;
	for (unsigned j = 0; j != Ratio; ++j) {
	ConstantSDNode *C = dyn_cast<ConstantSDNode>(Amt.getOperand(i + j));
	if (!C)
	return SDValue();
	// 6 == Log2(64)
	ShAmt \|= C->getZExtValue() << (j * (1 << (6 - RatioInLog2)));
	}
	if (ShAmt != ShiftAmt)
	return SDValue();
	}
	}

	if (SupportedVectorShiftWithImm(VT, Subtarget, Op.getOpcode()))
	return getTargetVShiftByConstNode(X86Opc, dl, VT, R, ShiftAmt, DAG);

	if (Op.getOpcode() == ISD::SRA)
	return ArithmeticShiftRight64(ShiftAmt);
	}

	return SDValue();
	}

	static SDValue LowerScalarVariableShift(SDValue Op, SelectionDAG &DAG,
	const X86Subtarget &Subtarget) {
	MVT VT = Op.getSimpleValueType();
	SDLoc dl(Op);
	SDValue R = Op.getOperand(0);
	SDValue Amt = Op.getOperand(1);

	unsigned X86OpcI = (Op.getOpcode() == ISD::SHL) ? X86ISD::VSHLI :
	(Op.getOpcode() == ISD::SRL) ? X86ISD::VSRLI : X86ISD::VSRAI;

	unsigned X86OpcV = (Op.getOpcode() == ISD::SHL) ? X86ISD::VSHL :
	(Op.getOpcode() == ISD::SRL) ? X86ISD::VSRL : X86ISD::VSRA;

	if (SupportedVectorShiftWithBaseAmnt(VT, Subtarget, Op.getOpcode())) {
	SDValue BaseShAmt;
	MVT EltVT = VT.getVectorElementType();

	if (BuildVectorSDNode *BV = dyn_cast<BuildVectorSDNode>(Amt)) {
	// Check if this build_vector node is doing a splat.
	// If so, then set BaseShAmt equal to the splat value.
	BaseShAmt = BV->getSplatValue();
	if (BaseShAmt && BaseShAmt.isUndef())
	BaseShAmt = SDValue();
	} else {
	if (Amt.getOpcode() == ISD::EXTRACT_SUBVECTOR)
	Amt = Amt.getOperand(0);

	ShuffleVectorSDNode *SVN = dyn_cast<ShuffleVectorSDNode>(Amt);
	if (SVN && SVN->isSplat()) {
	unsigned SplatIdx = (unsigned)SVN->getSplatIndex();
	SDValue InVec = Amt.getOperand(0);
	if (InVec.getOpcode() == ISD::BUILD_VECTOR) {
	assert((SplatIdx < InVec.getSimpleValueType().getVectorNumElements()) &&
	"Unexpected shuffle index found!");
	BaseShAmt = InVec.getOperand(SplatIdx);
	} else if (InVec.getOpcode() == ISD::INSERT_VECTOR_ELT) {
	if (ConstantSDNode *C =
	dyn_cast<ConstantSDNode>(InVec.getOperand(2))) {
	if (C->getZExtValue() == SplatIdx)
	BaseShAmt = InVec.getOperand(1);
	}
	}

	if (!BaseShAmt)
	// Avoid introducing an extract element from a shuffle.
	BaseShAmt = DAG.getNode(ISD::EXTRACT_VECTOR_ELT, dl, EltVT, InVec,
	DAG.getIntPtrConstant(SplatIdx, dl));
	}
	}

	if (BaseShAmt.getNode()) {
	assert(EltVT.bitsLE(MVT::i64) && "Unexpected element type!");
	if (EltVT != MVT::i64 && EltVT.bitsGT(MVT::i32))
	BaseShAmt = DAG.getNode(ISD::ZERO_EXTEND, dl, MVT::i64, BaseShAmt);
	else if (EltVT.bitsLT(MVT::i32))
	BaseShAmt = DAG.getNode(ISD::ZERO_EXTEND, dl, MVT::i32, BaseShAmt);

	return getTargetVShiftNode(X86OpcI, dl, VT, R, BaseShAmt, Subtarget, DAG);
	}
	}

	// Special case in 32-bit mode, where i64 is expanded into high and low parts.
	if (!Subtarget.is64Bit() && VT == MVT::v2i64 &&
	Amt.getOpcode() == ISD::BITCAST &&
	Amt.getOperand(0).getOpcode() == ISD::BUILD_VECTOR) {
	Amt = Amt.getOperand(0);
	unsigned Ratio = Amt.getSimpleValueType().getVectorNumElements() /
	VT.getVectorNumElements();
	std::vector<SDValue> Vals(Ratio);
	for (unsigned i = 0; i != Ratio; ++i)
	Vals[i] = Amt.getOperand(i);
	for (unsigned i = Ratio; i != Amt.getNumOperands(); i += Ratio) {
	for (unsigned j = 0; j != Ratio; ++j)
	if (Vals[j] != Amt.getOperand(i + j))
	return SDValue();
	}

	if (SupportedVectorShiftWithBaseAmnt(VT, Subtarget, Op.getOpcode()))
	return DAG.getNode(X86OpcV, dl, VT, R, Op.getOperand(1));
	}
	return SDValue();
	}

	static SDValue LowerShift(SDValue Op, const X86Subtarget &Subtarget,
	SelectionDAG &DAG) {
	MVT VT = Op.getSimpleValueType();
	SDLoc dl(Op);
	SDValue R = Op.getOperand(0);
	SDValue Amt = Op.getOperand(1);
	bool ConstantAmt = ISD::isBuildVectorOfConstantSDNodes(Amt.getNode());

	assert(VT.isVector() && "Custom lowering only for vector shifts!");
	assert(Subtarget.hasSSE2() && "Only custom lower when we have SSE2!");

	if (SDValue V = LowerScalarImmediateShift(Op, DAG, Subtarget))
	return V;

	if (SDValue V = LowerScalarVariableShift(Op, DAG, Subtarget))
	return V;

	if (SupportedVectorVarShift(VT, Subtarget, Op.getOpcode()))
	return Op;

	// XOP has 128-bit variable logical/arithmetic shifts.
	// +ve/-ve Amt = shift left/right.
	if (Subtarget.hasXOP() &&
	(VT == MVT::v2i64 \|\| VT == MVT::v4i32 \|\|
	VT == MVT::v8i16 \|\| VT == MVT::v16i8)) {
	if (Op.getOpcode() == ISD::SRL \|\| Op.getOpcode() == ISD::SRA) {
	SDValue Zero = getZeroVector(VT, Subtarget, DAG, dl);
	Amt = DAG.getNode(ISD::SUB, dl, VT, Zero, Amt);
	}
	if (Op.getOpcode() == ISD::SHL \|\| Op.getOpcode() == ISD::SRL)
	return DAG.getNode(X86ISD::VPSHL, dl, VT, R, Amt);
	if (Op.getOpcode() == ISD::SRA)
	return DAG.getNode(X86ISD::VPSHA, dl, VT, R, Amt);
	}

	// 2i64 vector logical shifts can efficiently avoid scalarization - do the
	// shifts per-lane and then shuffle the partial results back together.
	if (VT == MVT::v2i64 && Op.getOpcode() != ISD::SRA) {
	// Splat the shift amounts so the scalar shifts above will catch it.
	SDValue Amt0 = DAG.getVectorShuffle(VT, dl, Amt, Amt, {0, 0});
	SDValue Amt1 = DAG.getVectorShuffle(VT, dl, Amt, Amt, {1, 1});
	SDValue R0 = DAG.getNode(Op->getOpcode(), dl, VT, R, Amt0);
	SDValue R1 = DAG.getNode(Op->getOpcode(), dl, VT, R, Amt1);
	return DAG.getVectorShuffle(VT, dl, R0, R1, {0, 3});
	}

	// i64 vector arithmetic shift can be emulated with the transform:
	// M = lshr(SIGN_MASK, Amt)
	// ashr(R, Amt) === sub(xor(lshr(R, Amt), M), M)
	if ((VT == MVT::v2i64 \|\| (VT == MVT::v4i64 && Subtarget.hasInt256())) &&
	Op.getOpcode() == ISD::SRA) {
	SDValue S = DAG.getConstant(APInt::getSignMask(64), dl, VT);
	SDValue M = DAG.getNode(ISD::SRL, dl, VT, S, Amt);
	R = DAG.getNode(ISD::SRL, dl, VT, R, Amt);
	R = DAG.getNode(ISD::XOR, dl, VT, R, M);
	R = DAG.getNode(ISD::SUB, dl, VT, R, M);
	return R;
	}

	// If possible, lower this packed shift into a vector multiply instead of
	// expanding it into a sequence of scalar shifts.
	// Do this only if the vector shift count is a constant build_vector.
	if (ConstantAmt && Op.getOpcode() == ISD::SHL &&
	(VT == MVT::v8i16 \|\| VT == MVT::v4i32 \|\|
	(Subtarget.hasInt256() && VT == MVT::v16i16))) {
	SmallVector<SDValue, 8> Elts;
	MVT SVT = VT.getVectorElementType();
	unsigned SVTBits = SVT.getSizeInBits();
	APInt One(SVTBits, 1);
	unsigned NumElems = VT.getVectorNumElements();

	for (unsigned i=0; i !=NumElems; ++i) {
	SDValue Op = Amt->getOperand(i);
	if (Op->isUndef()) {
	Elts.push_back(Op);
	continue;
	}

	ConstantSDNode *ND = cast<ConstantSDNode>(Op);
	APInt C(SVTBits, ND->getAPIntValue().getZExtValue());
	uint64_t ShAmt = C.getZExtValue();
	if (ShAmt >= SVTBits) {
	Elts.push_back(DAG.getUNDEF(SVT));
	continue;
	}
	Elts.push_back(DAG.getConstant(One.shl(ShAmt), dl, SVT));
	}
	SDValue BV = DAG.getBuildVector(VT, dl, Elts);
	return DAG.getNode(ISD::MUL, dl, VT, R, BV);
	}

	// Lower SHL with variable shift amount.
	if (VT == MVT::v4i32 && Op->getOpcode() == ISD::SHL) {
	Op = DAG.getNode(ISD::SHL, dl, VT, Amt, DAG.getConstant(23, dl, VT));

	Op = DAG.getNode(ISD::ADD, dl, VT, Op,
	DAG.getConstant(0x3f800000U, dl, VT));
	Op = DAG.getBitcast(MVT::v4f32, Op);
	Op = DAG.getNode(ISD::FP_TO_SINT, dl, VT, Op);
	return DAG.getNode(ISD::MUL, dl, VT, Op, R);
	}

	// If possible, lower this shift as a sequence of two shifts by
	// constant plus a MOVSS/MOVSD/PBLEND instead of scalarizing it.
	// Example:
	// (v4i32 (srl A, (build_vector < X, Y, Y, Y>)))
	//
	// Could be rewritten as:
	// (v4i32 (MOVSS (srl A, <Y,Y,Y,Y>), (srl A, <X,X,X,X>)))
	//
	// The advantage is that the two shifts from the example would be
	// lowered as X86ISD::VSRLI nodes. This would be cheaper than scalarizing
	// the vector shift into four scalar shifts plus four pairs of vector
	// insert/extract.
	if (ConstantAmt && (VT == MVT::v8i16 \|\| VT == MVT::v4i32)) {
	unsigned TargetOpcode = X86ISD::MOVSS;
	bool CanBeSimplified;
	// The splat value for the first packed shift (the 'X' from the example).
	SDValue Amt1 = Amt->getOperand(0);
	// The splat value for the second packed shift (the 'Y' from the example).
	SDValue Amt2 = (VT == MVT::v4i32) ? Amt->getOperand(1) : Amt->getOperand(2);

	// See if it is possible to replace this node with a sequence of
	// two shifts followed by a MOVSS/MOVSD/PBLEND.
	if (VT == MVT::v4i32) {
	// Check if it is legal to use a MOVSS.
	CanBeSimplified = Amt2 == Amt->getOperand(2) &&
	Amt2 == Amt->getOperand(3);
	if (!CanBeSimplified) {
	// Otherwise, check if we can still simplify this node using a MOVSD.
	CanBeSimplified = Amt1 == Amt->getOperand(1) &&
	Amt->getOperand(2) == Amt->getOperand(3);
	TargetOpcode = X86ISD::MOVSD;
	Amt2 = Amt->getOperand(2);
	}
	} else {
	// Do similar checks for the case where the machine value type
	// is MVT::v8i16.
	CanBeSimplified = Amt1 == Amt->getOperand(1);
	for (unsigned i=3; i != 8 && CanBeSimplified; ++i)
	CanBeSimplified = Amt2 == Amt->getOperand(i);

	if (!CanBeSimplified) {
	TargetOpcode = X86ISD::MOVSD;
	CanBeSimplified = true;
	Amt2 = Amt->getOperand(4);
	for (unsigned i=0; i != 4 && CanBeSimplified; ++i)
	CanBeSimplified = Amt1 == Amt->getOperand(i);
	for (unsigned j=4; j != 8 && CanBeSimplified; ++j)
	CanBeSimplified = Amt2 == Amt->getOperand(j);
	}
	}

	if (CanBeSimplified && isa<ConstantSDNode>(Amt1) &&
	isa<ConstantSDNode>(Amt2)) {
	// Replace this node with two shifts followed by a MOVSS/MOVSD/PBLEND.
	MVT CastVT = MVT::v4i32;
	SDValue Splat1 =
	DAG.getConstant(cast<ConstantSDNode>(Amt1)->getAPIntValue(), dl, VT);
	SDValue Shift1 = DAG.getNode(Op->getOpcode(), dl, VT, R, Splat1);
	SDValue Splat2 =
	DAG.getConstant(cast<ConstantSDNode>(Amt2)->getAPIntValue(), dl, VT);
	SDValue Shift2 = DAG.getNode(Op->getOpcode(), dl, VT, R, Splat2);
	SDValue BitCast1 = DAG.getBitcast(CastVT, Shift1);
	SDValue BitCast2 = DAG.getBitcast(CastVT, Shift2);
	if (TargetOpcode == X86ISD::MOVSD)
	return DAG.getBitcast(VT, DAG.getVectorShuffle(CastVT, dl, BitCast1,
	BitCast2, {0, 1, 6, 7}));
	return DAG.getBitcast(VT, DAG.getVectorShuffle(CastVT, dl, BitCast1,
	BitCast2, {0, 5, 6, 7}));
	}
	}

	// v4i32 Non Uniform Shifts.
	// If the shift amount is constant we can shift each lane using the SSE2
	// immediate shifts, else we need to zero-extend each lane to the lower i64
	// and shift using the SSE2 variable shifts.
	// The separate results can then be blended together.
	if (VT == MVT::v4i32) {
	unsigned Opc = Op.getOpcode();
	SDValue Amt0, Amt1, Amt2, Amt3;
	if (ConstantAmt) {
	Amt0 = DAG.getVectorShuffle(VT, dl, Amt, DAG.getUNDEF(VT), {0, 0, 0, 0});
	Amt1 = DAG.getVectorShuffle(VT, dl, Amt, DAG.getUNDEF(VT), {1, 1, 1, 1});
	Amt2 = DAG.getVectorShuffle(VT, dl, Amt, DAG.getUNDEF(VT), {2, 2, 2, 2});
	Amt3 = DAG.getVectorShuffle(VT, dl, Amt, DAG.getUNDEF(VT), {3, 3, 3, 3});
	} else {
	// ISD::SHL is handled above but we include it here for completeness.
	switch (Opc) {
	default:
	llvm_unreachable("Unknown target vector shift node");
	case ISD::SHL:
	Opc = X86ISD::VSHL;
	break;
	case ISD::SRL:
	Opc = X86ISD::VSRL;
	break;
	case ISD::SRA:
	Opc = X86ISD::VSRA;
	break;
	}
	// The SSE2 shifts use the lower i64 as the same shift amount for
	// all lanes and the upper i64 is ignored. These shuffle masks
	// optimally zero-extend each lanes on SSE2/SSE41/AVX targets.
	SDValue Z = getZeroVector(VT, Subtarget, DAG, dl);
	Amt0 = DAG.getVectorShuffle(VT, dl, Amt, Z, {0, 4, -1, -1});
	Amt1 = DAG.getVectorShuffle(VT, dl, Amt, Z, {1, 5, -1, -1});
	Amt2 = DAG.getVectorShuffle(VT, dl, Amt, Z, {2, 6, -1, -1});
	Amt3 = DAG.getVectorShuffle(VT, dl, Amt, Z, {3, 7, -1, -1});
	}

	SDValue R0 = DAG.getNode(Opc, dl, VT, R, Amt0);
	SDValue R1 = DAG.getNode(Opc, dl, VT, R, Amt1);
	SDValue R2 = DAG.getNode(Opc, dl, VT, R, Amt2);
	SDValue R3 = DAG.getNode(Opc, dl, VT, R, Amt3);
	SDValue R02 = DAG.getVectorShuffle(VT, dl, R0, R2, {0, -1, 6, -1});
	SDValue R13 = DAG.getVectorShuffle(VT, dl, R1, R3, {-1, 1, -1, 7});
	return DAG.getVectorShuffle(VT, dl, R02, R13, {0, 5, 2, 7});
	}

	// It's worth extending once and using the vXi16/vXi32 shifts for smaller
	// types, but without AVX512 the extra overheads to get from vXi8 to vXi32
	// make the existing SSE solution better.
	if ((Subtarget.hasInt256() && VT == MVT::v8i16) \|\|
	(Subtarget.hasAVX512() && VT == MVT::v16i16) \|\|
	(Subtarget.hasAVX512() && VT == MVT::v16i8) \|\|
	(Subtarget.hasBWI() && VT == MVT::v32i8)) {
	MVT EvtSVT = (VT == MVT::v32i8 ? MVT::i16 : MVT::i32);
	MVT ExtVT = MVT::getVectorVT(EvtSVT, VT.getVectorNumElements());
	unsigned ExtOpc =
	Op.getOpcode() == ISD::SRA ? ISD::SIGN_EXTEND : ISD::ZERO_EXTEND;
	R = DAG.getNode(ExtOpc, dl, ExtVT, R);
	Amt = DAG.getNode(ISD::ANY_EXTEND, dl, ExtVT, Amt);
	return DAG.getNode(ISD::TRUNCATE, dl, VT,
	DAG.getNode(Op.getOpcode(), dl, ExtVT, R, Amt));
	}

	if (VT == MVT::v16i8 \|\|
	(VT == MVT::v32i8 && Subtarget.hasInt256() && !Subtarget.hasXOP()) \|\|
	(VT == MVT::v64i8 && Subtarget.hasBWI())) {
	MVT ExtVT = MVT::getVectorVT(MVT::i16, VT.getVectorNumElements() / 2);
	unsigned ShiftOpcode = Op->getOpcode();

	auto SignBitSelect = [&](MVT SelVT, SDValue Sel, SDValue V0, SDValue V1) {
	if (VT.is512BitVector()) {
	// On AVX512BW targets we make use of the fact that VSELECT lowers
	// to a masked blend which selects bytes based just on the sign bit
	// extracted to a mask.
	MVT MaskVT = MVT::getVectorVT(MVT::i1, VT.getVectorNumElements());
	V0 = DAG.getBitcast(VT, V0);
	V1 = DAG.getBitcast(VT, V1);
	Sel = DAG.getBitcast(VT, Sel);
	Sel = DAG.getNode(X86ISD::CVT2MASK, dl, MaskVT, Sel);
	return DAG.getBitcast(SelVT, DAG.getSelect(dl, VT, Sel, V0, V1));
	} else if (Subtarget.hasSSE41()) {
	// On SSE41 targets we make use of the fact that VSELECT lowers
	// to PBLENDVB which selects bytes based just on the sign bit.
	V0 = DAG.getBitcast(VT, V0);
	V1 = DAG.getBitcast(VT, V1);
	Sel = DAG.getBitcast(VT, Sel);
	return DAG.getBitcast(SelVT, DAG.getSelect(dl, VT, Sel, V0, V1));
	}
	// On pre-SSE41 targets we test for the sign bit by comparing to
	// zero - a negative value will set all bits of the lanes to true
	// and VSELECT uses that in its OR(AND(V0,C),AND(V1,~C)) lowering.
	SDValue Z = getZeroVector(SelVT, Subtarget, DAG, dl);
	SDValue C = DAG.getNode(X86ISD::PCMPGT, dl, SelVT, Z, Sel);
	return DAG.getSelect(dl, SelVT, C, V0, V1);
	};

	// Turn 'a' into a mask suitable for VSELECT: a = a << 5;
	// We can safely do this using i16 shifts as we're only interested in
	// the 3 lower bits of each byte.
	Amt = DAG.getBitcast(ExtVT, Amt);
	Amt = DAG.getNode(ISD::SHL, dl, ExtVT, Amt, DAG.getConstant(5, dl, ExtVT));
	Amt = DAG.getBitcast(VT, Amt);

	if (Op->getOpcode() == ISD::SHL \|\| Op->getOpcode() == ISD::SRL) {
	// r = VSELECT(r, shift(r, 4), a);
	SDValue M =
	DAG.getNode(ShiftOpcode, dl, VT, R, DAG.getConstant(4, dl, VT));
	R = SignBitSelect(VT, Amt, M, R);

	// a += a
	Amt = DAG.getNode(ISD::ADD, dl, VT, Amt, Amt);

	// r = VSELECT(r, shift(r, 2), a);
	M = DAG.getNode(ShiftOpcode, dl, VT, R, DAG.getConstant(2, dl, VT));
	R = SignBitSelect(VT, Amt, M, R);

	// a += a
	Amt = DAG.getNode(ISD::ADD, dl, VT, Amt, Amt);

	// return VSELECT(r, shift(r, 1), a);
	M = DAG.getNode(ShiftOpcode, dl, VT, R, DAG.getConstant(1, dl, VT));
	R = SignBitSelect(VT, Amt, M, R);
	return R;
	}

	if (Op->getOpcode() == ISD::SRA) {
	// For SRA we need to unpack each byte to the higher byte of a i16 vector
	// so we can correctly sign extend. We don't care what happens to the
	// lower byte.
	SDValue ALo = DAG.getNode(X86ISD::UNPCKL, dl, VT, DAG.getUNDEF(VT), Amt);
	SDValue AHi = DAG.getNode(X86ISD::UNPCKH, dl, VT, DAG.getUNDEF(VT), Amt);
	SDValue RLo = DAG.getNode(X86ISD::UNPCKL, dl, VT, DAG.getUNDEF(VT), R);
	SDValue RHi = DAG.getNode(X86ISD::UNPCKH, dl, VT, DAG.getUNDEF(VT), R);
	ALo = DAG.getBitcast(ExtVT, ALo);
	AHi = DAG.getBitcast(ExtVT, AHi);
	RLo = DAG.getBitcast(ExtVT, RLo);
	RHi = DAG.getBitcast(ExtVT, RHi);

	// r = VSELECT(r, shift(r, 4), a);
	SDValue MLo = DAG.getNode(ShiftOpcode, dl, ExtVT, RLo,
	DAG.getConstant(4, dl, ExtVT));
	SDValue MHi = DAG.getNode(ShiftOpcode, dl, ExtVT, RHi,
	DAG.getConstant(4, dl, ExtVT));
	RLo = SignBitSelect(ExtVT, ALo, MLo, RLo);
	RHi = SignBitSelect(ExtVT, AHi, MHi, RHi);

	// a += a
	ALo = DAG.getNode(ISD::ADD, dl, ExtVT, ALo, ALo);
	AHi = DAG.getNode(ISD::ADD, dl, ExtVT, AHi, AHi);

	// r = VSELECT(r, shift(r, 2), a);
	MLo = DAG.getNode(ShiftOpcode, dl, ExtVT, RLo,
	DAG.getConstant(2, dl, ExtVT));
	MHi = DAG.getNode(ShiftOpcode, dl, ExtVT, RHi,
	DAG.getConstant(2, dl, ExtVT));
	RLo = SignBitSelect(ExtVT, ALo, MLo, RLo);
	RHi = SignBitSelect(ExtVT, AHi, MHi, RHi);

	// a += a
	ALo = DAG.getNode(ISD::ADD, dl, ExtVT, ALo, ALo);
	AHi = DAG.getNode(ISD::ADD, dl, ExtVT, AHi, AHi);

	// r = VSELECT(r, shift(r, 1), a);
	MLo = DAG.getNode(ShiftOpcode, dl, ExtVT, RLo,
	DAG.getConstant(1, dl, ExtVT));
	MHi = DAG.getNode(ShiftOpcode, dl, ExtVT, RHi,
	DAG.getConstant(1, dl, ExtVT));
	RLo = SignBitSelect(ExtVT, ALo, MLo, RLo);
	RHi = SignBitSelect(ExtVT, AHi, MHi, RHi);

	// Logical shift the result back to the lower byte, leaving a zero upper
	// byte
	// meaning that we can safely pack with PACKUSWB.
	RLo =
	DAG.getNode(ISD::SRL, dl, ExtVT, RLo, DAG.getConstant(8, dl, ExtVT));
	RHi =
	DAG.getNode(ISD::SRL, dl, ExtVT, RHi, DAG.getConstant(8, dl, ExtVT));
	return DAG.getNode(X86ISD::PACKUS, dl, VT, RLo, RHi);
	}
	}

	if (Subtarget.hasInt256() && !Subtarget.hasXOP() && VT == MVT::v16i16) {
	MVT ExtVT = MVT::v8i32;
	SDValue Z = getZeroVector(VT, Subtarget, DAG, dl);
	SDValue ALo = DAG.getNode(X86ISD::UNPCKL, dl, VT, Amt, Z);
	SDValue AHi = DAG.getNode(X86ISD::UNPCKH, dl, VT, Amt, Z);
	SDValue RLo = DAG.getNode(X86ISD::UNPCKL, dl, VT, Z, R);
	SDValue RHi = DAG.getNode(X86ISD::UNPCKH, dl, VT, Z, R);
	ALo = DAG.getBitcast(ExtVT, ALo);
	AHi = DAG.getBitcast(ExtVT, AHi);
	RLo = DAG.getBitcast(ExtVT, RLo);
	RHi = DAG.getBitcast(ExtVT, RHi);
	SDValue Lo = DAG.getNode(Op.getOpcode(), dl, ExtVT, RLo, ALo);
	SDValue Hi = DAG.getNode(Op.getOpcode(), dl, ExtVT, RHi, AHi);
	Lo = DAG.getNode(ISD::SRL, dl, ExtVT, Lo, DAG.getConstant(16, dl, ExtVT));
	Hi = DAG.getNode(ISD::SRL, dl, ExtVT, Hi, DAG.getConstant(16, dl, ExtVT));
	return DAG.getNode(X86ISD::PACKUS, dl, VT, Lo, Hi);
	}

	if (VT == MVT::v8i16) {
	unsigned ShiftOpcode = Op->getOpcode();

	// If we have a constant shift amount, the non-SSE41 path is best as
	// avoiding bitcasts make it easier to constant fold and reduce to PBLENDW.
	bool UseSSE41 = Subtarget.hasSSE41() &&
	!ISD::isBuildVectorOfConstantSDNodes(Amt.getNode());

	auto SignBitSelect = [&](SDValue Sel, SDValue V0, SDValue V1) {
	// On SSE41 targets we make use of the fact that VSELECT lowers
	// to PBLENDVB which selects bytes based just on the sign bit.
	if (UseSSE41) {
	MVT ExtVT = MVT::getVectorVT(MVT::i8, VT.getVectorNumElements() * 2);
	V0 = DAG.getBitcast(ExtVT, V0);
	V1 = DAG.getBitcast(ExtVT, V1);
	Sel = DAG.getBitcast(ExtVT, Sel);
	return DAG.getBitcast(VT, DAG.getSelect(dl, ExtVT, Sel, V0, V1));
	}
	// On pre-SSE41 targets we splat the sign bit - a negative value will
	// set all bits of the lanes to true and VSELECT uses that in
	// its OR(AND(V0,C),AND(V1,~C)) lowering.
	SDValue C =
	DAG.getNode(ISD::SRA, dl, VT, Sel, DAG.getConstant(15, dl, VT));
	return DAG.getSelect(dl, VT, C, V0, V1);
	};

	// Turn 'a' into a mask suitable for VSELECT: a = a << 12;
	if (UseSSE41) {
	// On SSE41 targets we need to replicate the shift mask in both
	// bytes for PBLENDVB.
	Amt = DAG.getNode(
	ISD::OR, dl, VT,
	DAG.getNode(ISD::SHL, dl, VT, Amt, DAG.getConstant(4, dl, VT)),
	DAG.getNode(ISD::SHL, dl, VT, Amt, DAG.getConstant(12, dl, VT)));
	} else {
	Amt = DAG.getNode(ISD::SHL, dl, VT, Amt, DAG.getConstant(12, dl, VT));
	}

	// r = VSELECT(r, shift(r, 8), a);
	SDValue M = DAG.getNode(ShiftOpcode, dl, VT, R, DAG.getConstant(8, dl, VT));
	R = SignBitSelect(Amt, M, R);

	// a += a
	Amt = DAG.getNode(ISD::ADD, dl, VT, Amt, Amt);

	// r = VSELECT(r, shift(r, 4), a);
	M = DAG.getNode(ShiftOpcode, dl, VT, R, DAG.getConstant(4, dl, VT));
	R = SignBitSelect(Amt, M, R);

	// a += a
	Amt = DAG.getNode(ISD::ADD, dl, VT, Amt, Amt);

	// r = VSELECT(r, shift(r, 2), a);
	M = DAG.getNode(ShiftOpcode, dl, VT, R, DAG.getConstant(2, dl, VT));
	R = SignBitSelect(Amt, M, R);

	// a += a
	Amt = DAG.getNode(ISD::ADD, dl, VT, Amt, Amt);

	// return VSELECT(r, shift(r, 1), a);
	M = DAG.getNode(ShiftOpcode, dl, VT, R, DAG.getConstant(1, dl, VT));
	R = SignBitSelect(Amt, M, R);
	return R;
	}

	// Decompose 256-bit shifts into smaller 128-bit shifts.
	if (VT.is256BitVector())
	return Lower256IntArith(Op, DAG);

	return SDValue();
	}

	static SDValue LowerRotate(SDValue Op, const X86Subtarget &Subtarget,
	SelectionDAG &DAG) {
	MVT VT = Op.getSimpleValueType();
	SDLoc DL(Op);
	SDValue R = Op.getOperand(0);
	SDValue Amt = Op.getOperand(1);
	unsigned Opcode = Op.getOpcode();
	unsigned EltSizeInBits = VT.getScalarSizeInBits();

	if (Subtarget.hasAVX512()) {
	// Attempt to rotate by immediate.
	APInt UndefElts;
	SmallVector<APInt, 16> EltBits;
	if (getTargetConstantBitsFromNode(Amt, EltSizeInBits, UndefElts, EltBits)) {
	if (!UndefElts && llvm::all_of(EltBits, [EltBits](APInt &V) {
	return EltBits[0] == V;
	})) {
	unsigned Op = (Opcode == ISD::ROTL ? X86ISD::VROTLI : X86ISD::VROTRI);
	uint64_t RotateAmt = EltBits[0].urem(EltSizeInBits);
	return DAG.getNode(Op, DL, VT, R,
	DAG.getConstant(RotateAmt, DL, MVT::i8));
	}
	}

	// Else, fall-back on VPROLV/VPRORV.
	return Op;
	}

	assert(VT.isVector() && "Custom lowering only for vector rotates!");
	assert(Subtarget.hasXOP() && "XOP support required for vector rotates!");
	assert((Opcode == ISD::ROTL) && "Only ROTL supported");

	// XOP has 128-bit vector variable + immediate rotates.
	// +ve/-ve Amt = rotate left/right.

	// Split 256-bit integers.
	if (VT.is256BitVector())
	return Lower256IntArith(Op, DAG);

	assert(VT.is128BitVector() && "Only rotate 128-bit vectors!");

	// Attempt to rotate by immediate.
	if (auto *BVAmt = dyn_cast<BuildVectorSDNode>(Amt)) {
	if (auto *RotateConst = BVAmt->getConstantSplatNode()) {
	uint64_t RotateAmt = RotateConst->getAPIntValue().getZExtValue();
	assert(RotateAmt < EltSizeInBits && "Rotation out of range");
	return DAG.getNode(X86ISD::VPROTI, DL, VT, R,
	DAG.getConstant(RotateAmt, DL, MVT::i8));
	}
	}

	// Use general rotate by variable (per-element).
	return DAG.getNode(X86ISD::VPROT, DL, VT, R, Amt);
	}

	static SDValue LowerXALUO(SDValue Op, SelectionDAG &DAG) {
	// Lower the "add/sub/mul with overflow" instruction into a regular ins plus
	// a "setcc" instruction that checks the overflow flag. The "brcond" lowering
	// looks for this combo and may remove the "setcc" instruction if the "setcc"
	// has only one use.
	SDNode *N = Op.getNode();
	SDValue LHS = N->getOperand(0);
	SDValue RHS = N->getOperand(1);
	unsigned BaseOp = 0;
	X86::CondCode Cond;
	SDLoc DL(Op);
	switch (Op.getOpcode()) {
	default: llvm_unreachable("Unknown ovf instruction!");
	case ISD::SADDO:
	// A subtract of one will be selected as a INC. Note that INC doesn't
	// set CF, so we can't do this for UADDO.
	if (isOneConstant(RHS)) {
	BaseOp = X86ISD::INC;
	Cond = X86::COND_O;
	break;
	}
	BaseOp = X86ISD::ADD;
	Cond = X86::COND_O;
	break;
	case ISD::UADDO:
	BaseOp = X86ISD::ADD;
	Cond = X86::COND_B;
	break;
	case ISD::SSUBO:
	// A subtract of one will be selected as a DEC. Note that DEC doesn't
	// set CF, so we can't do this for USUBO.
	if (isOneConstant(RHS)) {
	BaseOp = X86ISD::DEC;
	Cond = X86::COND_O;
	break;
	}
	BaseOp = X86ISD::SUB;
	Cond = X86::COND_O;
	break;
	case ISD::USUBO:
	BaseOp = X86ISD::SUB;
	Cond = X86::COND_B;
	break;
	case ISD::SMULO:
	BaseOp = N->getValueType(0) == MVT::i8 ? X86ISD::SMUL8 : X86ISD::SMUL;
	Cond = X86::COND_O;
	break;
	case ISD::UMULO: { // i64, i8 = umulo lhs, rhs --> i64, i64, i32 umul lhs,rhs
	if (N->getValueType(0) == MVT::i8) {
	BaseOp = X86ISD::UMUL8;
	Cond = X86::COND_O;
	break;
	}
	SDVTList VTs = DAG.getVTList(N->getValueType(0), N->getValueType(0),
	MVT::i32);
	SDValue Sum = DAG.getNode(X86ISD::UMUL, DL, VTs, LHS, RHS);

	SDValue SetCC = getSETCC(X86::COND_O, SDValue(Sum.getNode(), 2), DL, DAG);

	if (N->getValueType(1) == MVT::i1)
	SetCC = DAG.getNode(ISD::TRUNCATE, DL, MVT::i1, SetCC);

	return DAG.getNode(ISD::MERGE_VALUES, DL, N->getVTList(), Sum, SetCC);
	}
	}

	// Also sets EFLAGS.
	SDVTList VTs = DAG.getVTList(N->getValueType(0), MVT::i32);
	SDValue Sum = DAG.getNode(BaseOp, DL, VTs, LHS, RHS);

	SDValue SetCC = getSETCC(Cond, SDValue(Sum.getNode(), 1), DL, DAG);

	if (N->getValueType(1) == MVT::i1)
	SetCC = DAG.getNode(ISD::TRUNCATE, DL, MVT::i1, SetCC);

	return DAG.getNode(ISD::MERGE_VALUES, DL, N->getVTList(), Sum, SetCC);
	}

	/// Returns true if the operand type is exactly twice the native width, and
	/// the corresponding cmpxchg8b or cmpxchg16b instruction is available.
	/// Used to know whether to use cmpxchg8/16b when expanding atomic operations
	/// (otherwise we leave them alone to become __sync_fetch_and_... calls).
	bool X86TargetLowering::needsCmpXchgNb(Type *MemType) const {
	unsigned OpWidth = MemType->getPrimitiveSizeInBits();

	if (OpWidth == 64)
	return !Subtarget.is64Bit(); // FIXME this should be Subtarget.hasCmpxchg8b
	else if (OpWidth == 128)
	return Subtarget.hasCmpxchg16b();
	else
	return false;
	}

	bool X86TargetLowering::shouldExpandAtomicStoreInIR(StoreInst *SI) const {
	return needsCmpXchgNb(SI->getValueOperand()->getType());
	}

	// Note: this turns large loads into lock cmpxchg8b/16b.
	// FIXME: On 32 bits x86, fild/movq might be faster than lock cmpxchg8b.
	TargetLowering::AtomicExpansionKind
	X86TargetLowering::shouldExpandAtomicLoadInIR(LoadInst *LI) const {
	auto PTy = cast<PointerType>(LI->getPointerOperandType());
	return needsCmpXchgNb(PTy->getElementType()) ? AtomicExpansionKind::CmpXChg
	: AtomicExpansionKind::None;
	}

	TargetLowering::AtomicExpansionKind
	X86TargetLowering::shouldExpandAtomicRMWInIR(AtomicRMWInst *AI) const {
	unsigned NativeWidth = Subtarget.is64Bit() ? 64 : 32;
	Type *MemType = AI->getType();

	// If the operand is too big, we must see if cmpxchg8/16b is available
	// and default to library calls otherwise.
	if (MemType->getPrimitiveSizeInBits() > NativeWidth) {
	return needsCmpXchgNb(MemType) ? AtomicExpansionKind::CmpXChg
	: AtomicExpansionKind::None;
	}

	AtomicRMWInst::BinOp Op = AI->getOperation();
	switch (Op) {
	default:
	llvm_unreachable("Unknown atomic operation");
	case AtomicRMWInst::Xchg:
	case AtomicRMWInst::Add:
	case AtomicRMWInst::Sub:
	// It's better to use xadd, xsub or xchg for these in all cases.
	return AtomicExpansionKind::None;
	case AtomicRMWInst::Or:
	case AtomicRMWInst::And:
	case AtomicRMWInst::Xor:
	// If the atomicrmw's result isn't actually used, we can just add a "lock"
	// prefix to a normal instruction for these operations.
	return !AI->use_empty() ? AtomicExpansionKind::CmpXChg
	: AtomicExpansionKind::None;
	case AtomicRMWInst::Nand:
	case AtomicRMWInst::Max:
	case AtomicRMWInst::Min:
	case AtomicRMWInst::UMax:
	case AtomicRMWInst::UMin:
	// These always require a non-trivial set of data operations on x86. We must
	// use a cmpxchg loop.
	return AtomicExpansionKind::CmpXChg;
	}
	}

	LoadInst *
	X86TargetLowering::lowerIdempotentRMWIntoFencedLoad(AtomicRMWInst *AI) const {
	unsigned NativeWidth = Subtarget.is64Bit() ? 64 : 32;
	Type *MemType = AI->getType();
	// Accesses larger than the native width are turned into cmpxchg/libcalls, so
	// there is no benefit in turning such RMWs into loads, and it is actually
	// harmful as it introduces a mfence.
	if (MemType->getPrimitiveSizeInBits() > NativeWidth)
	return nullptr;

	auto Builder = IRBuilder<>(AI);
	Module *M = Builder.GetInsertBlock()->getParent()->getParent();
	auto SSID = AI->getSyncScopeID();
	// We must restrict the ordering to avoid generating loads with Release or
	// ReleaseAcquire orderings.
	auto Order = AtomicCmpXchgInst::getStrongestFailureOrdering(AI->getOrdering());
	auto Ptr = AI->getPointerOperand();

	// Before the load we need a fence. Here is an example lifted from
	// http://www.hpl.hp.com/techreports/2012/HPL-2012-68.pdf showing why a fence
	// is required:
	// Thread 0:
	// x.store(1, relaxed);
	// r1 = y.fetch_add(0, release);
	// Thread 1:
	// y.fetch_add(42, acquire);
	// r2 = x.load(relaxed);
	// r1 = r2 = 0 is impossible, but becomes possible if the idempotent rmw is
	// lowered to just a load without a fence. A mfence flushes the store buffer,
	// making the optimization clearly correct.
	// FIXME: it is required if isReleaseOrStronger(Order) but it is not clear
	// otherwise, we might be able to be more aggressive on relaxed idempotent
	// rmw. In practice, they do not look useful, so we don't try to be
	// especially clever.
	if (SSID == SyncScope::SingleThread)
	// FIXME: we could just insert an X86ISD::MEMBARRIER here, except we are at
	// the IR level, so we must wrap it in an intrinsic.
	return nullptr;

	if (!Subtarget.hasMFence())
	// FIXME: it might make sense to use a locked operation here but on a
	// different cache-line to prevent cache-line bouncing. In practice it
	// is probably a small win, and x86 processors without mfence are rare
	// enough that we do not bother.
	return nullptr;

	Function *MFence =
	llvm::Intrinsic::getDeclaration(M, Intrinsic::x86_sse2_mfence);
	Builder.CreateCall(MFence, {});

	// Finally we can emit the atomic load.
	LoadInst *Loaded = Builder.CreateAlignedLoad(Ptr,
	AI->getType()->getPrimitiveSizeInBits());
	Loaded->setAtomic(Order, SSID);
	AI->replaceAllUsesWith(Loaded);
	AI->eraseFromParent();
	return Loaded;
	}

	static SDValue LowerATOMIC_FENCE(SDValue Op, const X86Subtarget &Subtarget,
	SelectionDAG &DAG) {
	SDLoc dl(Op);
	AtomicOrdering FenceOrdering = static_cast<AtomicOrdering>(
	cast<ConstantSDNode>(Op.getOperand(1))->getZExtValue());
	SyncScope::ID FenceSSID = static_cast<SyncScope::ID>(
	cast<ConstantSDNode>(Op.getOperand(2))->getZExtValue());

	// The only fence that needs an instruction is a sequentially-consistent
	// cross-thread fence.
	if (FenceOrdering == AtomicOrdering::SequentiallyConsistent &&
	FenceSSID == SyncScope::System) {
	if (Subtarget.hasMFence())
	return DAG.getNode(X86ISD::MFENCE, dl, MVT::Other, Op.getOperand(0));

	SDValue Chain = Op.getOperand(0);
	SDValue Zero = DAG.getConstant(0, dl, MVT::i32);
	SDValue Ops[] = {
	DAG.getRegister(X86::ESP, MVT::i32), // Base
	DAG.getTargetConstant(1, dl, MVT::i8), // Scale
	DAG.getRegister(0, MVT::i32), // Index
	DAG.getTargetConstant(0, dl, MVT::i32), // Disp
	DAG.getRegister(0, MVT::i32), // Segment.
	Zero,
	Chain
	};
	SDNode *Res = DAG.getMachineNode(X86::OR32mrLocked, dl, MVT::Other, Ops);
	return SDValue(Res, 0);
	}

	// MEMBARRIER is a compiler barrier; it codegens to a no-op.
	return DAG.getNode(X86ISD::MEMBARRIER, dl, MVT::Other, Op.getOperand(0));
	}

	static SDValue LowerCMP_SWAP(SDValue Op, const X86Subtarget &Subtarget,
	SelectionDAG &DAG) {
	MVT T = Op.getSimpleValueType();
	SDLoc DL(Op);
	unsigned Reg = 0;
	unsigned size = 0;
	switch(T.SimpleTy) {
	default: llvm_unreachable("Invalid value type!");
	case MVT::i8: Reg = X86::AL; size = 1; break;
	case MVT::i16: Reg = X86::AX; size = 2; break;
	case MVT::i32: Reg = X86::EAX; size = 4; break;
	case MVT::i64:
	assert(Subtarget.is64Bit() && "Node not type legal!");
	Reg = X86::RAX; size = 8;
	break;
	}
	SDValue cpIn = DAG.getCopyToReg(Op.getOperand(0), DL, Reg,
	Op.getOperand(2), SDValue());
	SDValue Ops[] = { cpIn.getValue(0),
	Op.getOperand(1),
	Op.getOperand(3),
	DAG.getTargetConstant(size, DL, MVT::i8),
	cpIn.getValue(1) };
	SDVTList Tys = DAG.getVTList(MVT::Other, MVT::Glue);
	MachineMemOperand *MMO = cast<AtomicSDNode>(Op)->getMemOperand();
	SDValue Result = DAG.getMemIntrinsicNode(X86ISD::LCMPXCHG_DAG, DL, Tys,
	Ops, T, MMO);

	SDValue cpOut =
	DAG.getCopyFromReg(Result.getValue(0), DL, Reg, T, Result.getValue(1));
	SDValue EFLAGS = DAG.getCopyFromReg(cpOut.getValue(1), DL, X86::EFLAGS,
	MVT::i32, cpOut.getValue(2));
	SDValue Success = getSETCC(X86::COND_E, EFLAGS, DL, DAG);

	DAG.ReplaceAllUsesOfValueWith(Op.getValue(0), cpOut);
	DAG.ReplaceAllUsesOfValueWith(Op.getValue(1), Success);
	DAG.ReplaceAllUsesOfValueWith(Op.getValue(2), EFLAGS.getValue(1));
	return SDValue();
	}

	static SDValue LowerBITCAST(SDValue Op, const X86Subtarget &Subtarget,
	SelectionDAG &DAG) {
	MVT SrcVT = Op.getOperand(0).getSimpleValueType();
	MVT DstVT = Op.getSimpleValueType();

	if (SrcVT == MVT::v2i32 \|\| SrcVT == MVT::v4i16 \|\| SrcVT == MVT::v8i8 \|\|
	SrcVT == MVT::i64) {
	assert(Subtarget.hasSSE2() && "Requires at least SSE2!");
	if (DstVT != MVT::f64)
	// This conversion needs to be expanded.
	return SDValue();

	SDValue Op0 = Op->getOperand(0);
	SmallVector<SDValue, 16> Elts;
	SDLoc dl(Op);
	unsigned NumElts;
	MVT SVT;
	if (SrcVT.isVector()) {
	NumElts = SrcVT.getVectorNumElements();
	SVT = SrcVT.getVectorElementType();

	// Widen the vector in input in the case of MVT::v2i32.
	// Example: from MVT::v2i32 to MVT::v4i32.
	for (unsigned i = 0, e = NumElts; i != e; ++i)
	Elts.push_back(DAG.getNode(ISD::EXTRACT_VECTOR_ELT, dl, SVT, Op0,
	DAG.getIntPtrConstant(i, dl)));
	} else {
	assert(SrcVT == MVT::i64 && !Subtarget.is64Bit() &&
	"Unexpected source type in LowerBITCAST");
	Elts.push_back(DAG.getNode(ISD::EXTRACT_ELEMENT, dl, MVT::i32, Op0,
	DAG.getIntPtrConstant(0, dl)));
	Elts.push_back(DAG.getNode(ISD::EXTRACT_ELEMENT, dl, MVT::i32, Op0,
	DAG.getIntPtrConstant(1, dl)));
	NumElts = 2;
	SVT = MVT::i32;
	}
	// Explicitly mark the extra elements as Undef.
	Elts.append(NumElts, DAG.getUNDEF(SVT));

	EVT NewVT = EVT::getVectorVT(DAG.getContext(), SVT, NumElts 2);
	SDValue BV = DAG.getBuildVector(NewVT, dl, Elts);
	SDValue ToV2F64 = DAG.getBitcast(MVT::v2f64, BV);
	return DAG.getNode(ISD::EXTRACT_VECTOR_ELT, dl, MVT::f64, ToV2F64,
	DAG.getIntPtrConstant(0, dl));
	}

	assert(Subtarget.is64Bit() && !Subtarget.hasSSE2() &&
	Subtarget.hasMMX() && "Unexpected custom BITCAST");
	assert((DstVT == MVT::i64 \|\|
	(DstVT.isVector() && DstVT.getSizeInBits()==64)) &&
	"Unexpected custom BITCAST");
	// i64 <=> MMX conversions are Legal.
	if (SrcVT==MVT::i64 && DstVT.isVector())
	return Op;
	if (DstVT==MVT::i64 && SrcVT.isVector())
	return Op;
	// MMX <=> MMX conversions are Legal.
	if (SrcVT.isVector() && DstVT.isVector())
	return Op;
	// All other conversions need to be expanded.
	return SDValue();
	}

	/// Compute the horizontal sum of bytes in V for the elements of VT.
	///
	/// Requires V to be a byte vector and VT to be an integer vector type with
	/// wider elements than V's type. The width of the elements of VT determines
	/// how many bytes of V are summed horizontally to produce each element of the
	/// result.
	static SDValue LowerHorizontalByteSum(SDValue V, MVT VT,
	const X86Subtarget &Subtarget,
	SelectionDAG &DAG) {
	SDLoc DL(V);
	MVT ByteVecVT = V.getSimpleValueType();
	MVT EltVT = VT.getVectorElementType();
	assert(ByteVecVT.getVectorElementType() == MVT::i8 &&
	"Expected value to have byte element type.");
	assert(EltVT != MVT::i8 &&
	"Horizontal byte sum only makes sense for wider elements!");
	unsigned VecSize = VT.getSizeInBits();
	assert(ByteVecVT.getSizeInBits() == VecSize && "Cannot change vector size!");

	// PSADBW instruction horizontally add all bytes and leave the result in i64
	// chunks, thus directly computes the pop count for v2i64 and v4i64.
	if (EltVT == MVT::i64) {
	SDValue Zeros = getZeroVector(ByteVecVT, Subtarget, DAG, DL);
	MVT SadVecVT = MVT::getVectorVT(MVT::i64, VecSize / 64);
	V = DAG.getNode(X86ISD::PSADBW, DL, SadVecVT, V, Zeros);
	return DAG.getBitcast(VT, V);
	}

	if (EltVT == MVT::i32) {
	// We unpack the low half and high half into i32s interleaved with zeros so
	// that we can use PSADBW to horizontally sum them. The most useful part of
	// this is that it lines up the results of two PSADBW instructions to be
	// two v2i64 vectors which concatenated are the 4 population counts. We can
	// then use PACKUSWB to shrink and concatenate them into a v4i32 again.
	SDValue Zeros = getZeroVector(VT, Subtarget, DAG, DL);
	SDValue V32 = DAG.getBitcast(VT, V);
	SDValue Low = DAG.getNode(X86ISD::UNPCKL, DL, VT, V32, Zeros);
	SDValue High = DAG.getNode(X86ISD::UNPCKH, DL, VT, V32, Zeros);

	// Do the horizontal sums into two v2i64s.
	Zeros = getZeroVector(ByteVecVT, Subtarget, DAG, DL);
	MVT SadVecVT = MVT::getVectorVT(MVT::i64, VecSize / 64);
	Low = DAG.getNode(X86ISD::PSADBW, DL, SadVecVT,
	DAG.getBitcast(ByteVecVT, Low), Zeros);
	High = DAG.getNode(X86ISD::PSADBW, DL, SadVecVT,
	DAG.getBitcast(ByteVecVT, High), Zeros);

	// Merge them together.
	MVT ShortVecVT = MVT::getVectorVT(MVT::i16, VecSize / 16);
	V = DAG.getNode(X86ISD::PACKUS, DL, ByteVecVT,
	DAG.getBitcast(ShortVecVT, Low),
	DAG.getBitcast(ShortVecVT, High));

	return DAG.getBitcast(VT, V);
	}

	// The only element type left is i16.
	assert(EltVT == MVT::i16 && "Unknown how to handle type");

	// To obtain pop count for each i16 element starting from the pop count for
	// i8 elements, shift the i16s left by 8, sum as i8s, and then shift as i16s
	// right by 8. It is important to shift as i16s as i8 vector shift isn't
	// directly supported.
	SDValue ShifterV = DAG.getConstant(8, DL, VT);
	SDValue Shl = DAG.getNode(ISD::SHL, DL, VT, DAG.getBitcast(VT, V), ShifterV);
	V = DAG.getNode(ISD::ADD, DL, ByteVecVT, DAG.getBitcast(ByteVecVT, Shl),
	DAG.getBitcast(ByteVecVT, V));
	return DAG.getNode(ISD::SRL, DL, VT, DAG.getBitcast(VT, V), ShifterV);
	}

	static SDValue LowerVectorCTPOPInRegLUT(SDValue Op, const SDLoc &DL,
	const X86Subtarget &Subtarget,
	SelectionDAG &DAG) {
	MVT VT = Op.getSimpleValueType();
	MVT EltVT = VT.getVectorElementType();
	unsigned VecSize = VT.getSizeInBits();

	// Implement a lookup table in register by using an algorithm based on:
	// http://wm.ite.pl/articles/sse-popcount.html
	//
	// The general idea is that every lower byte nibble in the input vector is an
	// index into a in-register pre-computed pop count table. We then split up the
	// input vector in two new ones: (1) a vector with only the shifted-right
	// higher nibbles for each byte and (2) a vector with the lower nibbles (and
	// masked out higher ones) for each byte. PSHUFB is used separately with both
	// to index the in-register table. Next, both are added and the result is a
	// i8 vector where each element contains the pop count for input byte.
	//
	// To obtain the pop count for elements != i8, we follow up with the same
	// approach and use additional tricks as described below.
	//
	const int LUT[16] = {/* 0 / 0, / 1 / 1, / 2 / 1, / 3 */ 2,
	/* 4 / 1, / 5 / 2, / 6 / 2, / 7 */ 3,
	/* 8 / 1, / 9 / 2, / a / 2, / b */ 3,
	/* c / 2, / d / 3, / e / 3, / f */ 4};

	int NumByteElts = VecSize / 8;
	MVT ByteVecVT = MVT::getVectorVT(MVT::i8, NumByteElts);
	SDValue In = DAG.getBitcast(ByteVecVT, Op);
	SmallVector<SDValue, 64> LUTVec;
	for (int i = 0; i < NumByteElts; ++i)
	LUTVec.push_back(DAG.getConstant(LUT[i % 16], DL, MVT::i8));
	SDValue InRegLUT = DAG.getBuildVector(ByteVecVT, DL, LUTVec);
	SDValue M0F = DAG.getConstant(0x0F, DL, ByteVecVT);

	// High nibbles
	SDValue FourV = DAG.getConstant(4, DL, ByteVecVT);
	SDValue HighNibbles = DAG.getNode(ISD::SRL, DL, ByteVecVT, In, FourV);

	// Low nibbles
	SDValue LowNibbles = DAG.getNode(ISD::AND, DL, ByteVecVT, In, M0F);

	// The input vector is used as the shuffle mask that index elements into the
	// LUT. After counting low and high nibbles, add the vector to obtain the
	// final pop count per i8 element.
	SDValue HighPopCnt =
	DAG.getNode(X86ISD::PSHUFB, DL, ByteVecVT, InRegLUT, HighNibbles);
	SDValue LowPopCnt =
	DAG.getNode(X86ISD::PSHUFB, DL, ByteVecVT, InRegLUT, LowNibbles);
	SDValue PopCnt = DAG.getNode(ISD::ADD, DL, ByteVecVT, HighPopCnt, LowPopCnt);

	if (EltVT == MVT::i8)
	return PopCnt;

	return LowerHorizontalByteSum(PopCnt, VT, Subtarget, DAG);
	}

	static SDValue LowerVectorCTPOPBitmath(SDValue Op, const SDLoc &DL,
	const X86Subtarget &Subtarget,
	SelectionDAG &DAG) {
	MVT VT = Op.getSimpleValueType();
	assert(VT.is128BitVector() &&
	"Only 128-bit vector bitmath lowering supported.");

	int VecSize = VT.getSizeInBits();
	MVT EltVT = VT.getVectorElementType();
	int Len = EltVT.getSizeInBits();

	// This is the vectorized version of the "best" algorithm from
	// http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel
	// with a minor tweak to use a series of adds + shifts instead of vector
	// multiplications. Implemented for all integer vector types. We only use
	// this when we don't have SSSE3 which allows a LUT-based lowering that is
	// much faster, even faster than using native popcnt instructions.

	auto GetShift = [&](unsigned OpCode, SDValue V, int Shifter) {
	MVT VT = V.getSimpleValueType();
	SDValue ShifterV = DAG.getConstant(Shifter, DL, VT);
	return DAG.getNode(OpCode, DL, VT, V, ShifterV);
	};
	auto GetMask = [&](SDValue V, APInt Mask) {
	MVT VT = V.getSimpleValueType();
	SDValue MaskV = DAG.getConstant(Mask, DL, VT);
	return DAG.getNode(ISD::AND, DL, VT, V, MaskV);
	};

	// We don't want to incur the implicit masks required to SRL vNi8 vectors on
	// x86, so set the SRL type to have elements at least i16 wide. This is
	// correct because all of our SRLs are followed immediately by a mask anyways
	// that handles any bits that sneak into the high bits of the byte elements.
	MVT SrlVT = Len > 8 ? VT : MVT::getVectorVT(MVT::i16, VecSize / 16);

	SDValue V = Op;

	// v = v - ((v >> 1) & 0x55555555...)
	SDValue Srl =
	DAG.getBitcast(VT, GetShift(ISD::SRL, DAG.getBitcast(SrlVT, V), 1));
	SDValue And = GetMask(Srl, APInt::getSplat(Len, APInt(8, 0x55)));
	V = DAG.getNode(ISD::SUB, DL, VT, V, And);

	// v = (v & 0x33333333...) + ((v >> 2) & 0x33333333...)
	SDValue AndLHS = GetMask(V, APInt::getSplat(Len, APInt(8, 0x33)));
	Srl = DAG.getBitcast(VT, GetShift(ISD::SRL, DAG.getBitcast(SrlVT, V), 2));
	SDValue AndRHS = GetMask(Srl, APInt::getSplat(Len, APInt(8, 0x33)));
	V = DAG.getNode(ISD::ADD, DL, VT, AndLHS, AndRHS);

	// v = (v + (v >> 4)) & 0x0F0F0F0F...
	Srl = DAG.getBitcast(VT, GetShift(ISD::SRL, DAG.getBitcast(SrlVT, V), 4));
	SDValue Add = DAG.getNode(ISD::ADD, DL, VT, V, Srl);
	V = GetMask(Add, APInt::getSplat(Len, APInt(8, 0x0F)));

	// At this point, V contains the byte-wise population count, and we are
	// merely doing a horizontal sum if necessary to get the wider element
	// counts.
	if (EltVT == MVT::i8)
	return V;

	return LowerHorizontalByteSum(
	DAG.getBitcast(MVT::getVectorVT(MVT::i8, VecSize / 8), V), VT, Subtarget,
	DAG);
	}

	// Please ensure that any codegen change from LowerVectorCTPOP is reflected in
	// updated cost models in X86TTIImpl::getIntrinsicInstrCost.
	static SDValue LowerVectorCTPOP(SDValue Op, const X86Subtarget &Subtarget,
	SelectionDAG &DAG) {
	MVT VT = Op.getSimpleValueType();
	assert((VT.is512BitVector() \|\| VT.is256BitVector() \|\| VT.is128BitVector()) &&
	"Unknown CTPOP type to handle");
	SDLoc DL(Op.getNode());
	SDValue Op0 = Op.getOperand(0);

	// TRUNC(CTPOP(ZEXT(X))) to make use of vXi32/vXi64 VPOPCNT instructions.
	if (Subtarget.hasVPOPCNTDQ()) {
	if (VT == MVT::v8i16) {
	Op = DAG.getNode(X86ISD::VZEXT, DL, MVT::v8i64, Op0);
	Op = DAG.getNode(ISD::CTPOP, DL, MVT::v8i64, Op);
	return DAG.getNode(X86ISD::VTRUNC, DL, VT, Op);
	}
	if (VT == MVT::v16i8 \|\| VT == MVT::v16i16) {
	Op = DAG.getNode(X86ISD::VZEXT, DL, MVT::v16i32, Op0);
	Op = DAG.getNode(ISD::CTPOP, DL, MVT::v16i32, Op);
	return DAG.getNode(X86ISD::VTRUNC, DL, VT, Op);
	}
	}

	if (!Subtarget.hasSSSE3()) {
	// We can't use the fast LUT approach, so fall back on vectorized bitmath.
	assert(VT.is128BitVector() && "Only 128-bit vectors supported in SSE!");
	return LowerVectorCTPOPBitmath(Op0, DL, Subtarget, DAG);
	}

	// Decompose 256-bit ops into smaller 128-bit ops.
	if (VT.is256BitVector() && !Subtarget.hasInt256())
	return Lower256IntUnary(Op, DAG);

	// Decompose 512-bit ops into smaller 256-bit ops.
	if (VT.is512BitVector() && !Subtarget.hasBWI())
	return Lower512IntUnary(Op, DAG);

	return LowerVectorCTPOPInRegLUT(Op0, DL, Subtarget, DAG);
	}

	static SDValue LowerCTPOP(SDValue Op, const X86Subtarget &Subtarget,
	SelectionDAG &DAG) {
	assert(Op.getSimpleValueType().isVector() &&
	"We only do custom lowering for vector population count.");
	return LowerVectorCTPOP(Op, Subtarget, DAG);
	}

	static SDValue LowerBITREVERSE_XOP(SDValue Op, SelectionDAG &DAG) {
	MVT VT = Op.getSimpleValueType();
	SDValue In = Op.getOperand(0);
	SDLoc DL(Op);

	// For scalars, its still beneficial to transfer to/from the SIMD unit to
	// perform the BITREVERSE.
	if (!VT.isVector()) {
	MVT VecVT = MVT::getVectorVT(VT, 128 / VT.getSizeInBits());
	SDValue Res = DAG.getNode(ISD::SCALAR_TO_VECTOR, DL, VecVT, In);
	Res = DAG.getNode(ISD::BITREVERSE, DL, VecVT, Res);
	return DAG.getNode(ISD::EXTRACT_VECTOR_ELT, DL, VT, Res,
	DAG.getIntPtrConstant(0, DL));
	}

	int NumElts = VT.getVectorNumElements();
	int ScalarSizeInBytes = VT.getScalarSizeInBits() / 8;

	// Decompose 256-bit ops into smaller 128-bit ops.
	if (VT.is256BitVector())
	return Lower256IntUnary(Op, DAG);

	assert(VT.is128BitVector() &&
	"Only 128-bit vector bitreverse lowering supported.");

	// VPPERM reverses the bits of a byte with the permute Op (2 << 5), and we
	// perform the BSWAP in the shuffle.
	// Its best to shuffle using the second operand as this will implicitly allow
	// memory folding for multiple vectors.
	SmallVector<SDValue, 16> MaskElts;
	for (int i = 0; i != NumElts; ++i) {
	for (int j = ScalarSizeInBytes - 1; j >= 0; --j) {
	int SourceByte = 16 + (i * ScalarSizeInBytes) + j;
	int PermuteByte = SourceByte \| (2 << 5);
	MaskElts.push_back(DAG.getConstant(PermuteByte, DL, MVT::i8));
	}
	}

	SDValue Mask = DAG.getBuildVector(MVT::v16i8, DL, MaskElts);
	SDValue Res = DAG.getBitcast(MVT::v16i8, In);
	Res = DAG.getNode(X86ISD::VPPERM, DL, MVT::v16i8, DAG.getUNDEF(MVT::v16i8),
	Res, Mask);
	return DAG.getBitcast(VT, Res);
	}

	static SDValue LowerBITREVERSE(SDValue Op, const X86Subtarget &Subtarget,
	SelectionDAG &DAG) {
	if (Subtarget.hasXOP())
	return LowerBITREVERSE_XOP(Op, DAG);

	assert(Subtarget.hasSSSE3() && "SSSE3 required for BITREVERSE");

	MVT VT = Op.getSimpleValueType();
	SDValue In = Op.getOperand(0);
	SDLoc DL(Op);

	unsigned NumElts = VT.getVectorNumElements();
	assert(VT.getScalarType() == MVT::i8 &&
	"Only byte vector BITREVERSE supported");

	// Decompose 256-bit ops into smaller 128-bit ops on pre-AVX2.
	if (VT.is256BitVector() && !Subtarget.hasInt256())
	return Lower256IntUnary(Op, DAG);

	// Perform BITREVERSE using PSHUFB lookups. Each byte is split into
	// two nibbles and a PSHUFB lookup to find the bitreverse of each
	// 0-15 value (moved to the other nibble).
	SDValue NibbleMask = DAG.getConstant(0xF, DL, VT);
	SDValue Lo = DAG.getNode(ISD::AND, DL, VT, In, NibbleMask);
	SDValue Hi = DAG.getNode(ISD::SRL, DL, VT, In, DAG.getConstant(4, DL, VT));

	const int LoLUT[16] = {
	/* 0 / 0x00, / 1 / 0x80, / 2 / 0x40, / 3 */ 0xC0,
	/* 4 / 0x20, / 5 / 0xA0, / 6 / 0x60, / 7 */ 0xE0,
	/* 8 / 0x10, / 9 / 0x90, / a / 0x50, / b */ 0xD0,
	/* c / 0x30, / d / 0xB0, / e / 0x70, / f */ 0xF0};
	const int HiLUT[16] = {
	/* 0 / 0x00, / 1 / 0x08, / 2 / 0x04, / 3 */ 0x0C,
	/* 4 / 0x02, / 5 / 0x0A, / 6 / 0x06, / 7 */ 0x0E,
	/* 8 / 0x01, / 9 / 0x09, / a / 0x05, / b */ 0x0D,
	/* c / 0x03, / d / 0x0B, / e / 0x07, / f */ 0x0F};

	SmallVector<SDValue, 16> LoMaskElts, HiMaskElts;
	for (unsigned i = 0; i < NumElts; ++i) {
	LoMaskElts.push_back(DAG.getConstant(LoLUT[i % 16], DL, MVT::i8));
	HiMaskElts.push_back(DAG.getConstant(HiLUT[i % 16], DL, MVT::i8));
	}

	SDValue LoMask = DAG.getBuildVector(VT, DL, LoMaskElts);
	SDValue HiMask = DAG.getBuildVector(VT, DL, HiMaskElts);
	Lo = DAG.getNode(X86ISD::PSHUFB, DL, VT, LoMask, Lo);
	Hi = DAG.getNode(X86ISD::PSHUFB, DL, VT, HiMask, Hi);
	return DAG.getNode(ISD::OR, DL, VT, Lo, Hi);
	}

	static SDValue lowerAtomicArithWithLOCK(SDValue N, SelectionDAG &DAG) {
	unsigned NewOpc = 0;
	switch (N->getOpcode()) {
	case ISD::ATOMIC_LOAD_ADD:
	NewOpc = X86ISD::LADD;
	break;
	case ISD::ATOMIC_LOAD_SUB:
	NewOpc = X86ISD::LSUB;
	break;
	case ISD::ATOMIC_LOAD_OR:
	NewOpc = X86ISD::LOR;
	break;
	case ISD::ATOMIC_LOAD_XOR:
	NewOpc = X86ISD::LXOR;
	break;
	case ISD::ATOMIC_LOAD_AND:
	NewOpc = X86ISD::LAND;
	break;
	default:
	llvm_unreachable("Unknown ATOMIC_LOAD_ opcode");
	}

	MachineMemOperand *MMO = cast<MemSDNode>(N)->getMemOperand();
	return DAG.getMemIntrinsicNode(
	NewOpc, SDLoc(N), DAG.getVTList(MVT::i32, MVT::Other),
	{N->getOperand(0), N->getOperand(1), N->getOperand(2)},
	/MemVT=/N->getSimpleValueType(0), MMO);
	}

	/// Lower atomic_load_ops into LOCK-prefixed operations.
	static SDValue lowerAtomicArith(SDValue N, SelectionDAG &DAG,
	const X86Subtarget &Subtarget) {
	SDValue Chain = N->getOperand(0);
	SDValue LHS = N->getOperand(1);
	SDValue RHS = N->getOperand(2);
	unsigned Opc = N->getOpcode();
	MVT VT = N->getSimpleValueType(0);
	SDLoc DL(N);

	// We can lower atomic_load_add into LXADD. However, any other atomicrmw op
	// can only be lowered when the result is unused. They should have already
	// been transformed into a cmpxchg loop in AtomicExpand.
	if (N->hasAnyUseOfValue(0)) {
	// Handle (atomic_load_sub p, v) as (atomic_load_add p, -v), to be able to
	// select LXADD if LOCK_SUB can't be selected.
	if (Opc == ISD::ATOMIC_LOAD_SUB) {
	AtomicSDNode *AN = cast<AtomicSDNode>(N.getNode());
	RHS = DAG.getNode(ISD::SUB, DL, VT, DAG.getConstant(0, DL, VT), RHS);
	return DAG.getAtomic(ISD::ATOMIC_LOAD_ADD, DL, VT, Chain, LHS,
	RHS, AN->getMemOperand());
	}
	assert(Opc == ISD::ATOMIC_LOAD_ADD &&
	"Used AtomicRMW ops other than Add should have been expanded!");
	return N;
	}

	SDValue LockOp = lowerAtomicArithWithLOCK(N, DAG);
	// RAUW the chain, but don't worry about the result, as it's unused.
	assert(!N->hasAnyUseOfValue(0));
	DAG.ReplaceAllUsesOfValueWith(N.getValue(1), LockOp.getValue(1));
	return SDValue();
	}

	static SDValue LowerATOMIC_STORE(SDValue Op, SelectionDAG &DAG) {
	SDNode *Node = Op.getNode();
	SDLoc dl(Node);
	EVT VT = cast<AtomicSDNode>(Node)->getMemoryVT();

	// Convert seq_cst store -> xchg
	// Convert wide store -> swap (-> cmpxchg8b/cmpxchg16b)
	// FIXME: On 32-bit, store -> fist or movq would be more efficient
	// (The only way to get a 16-byte store is cmpxchg16b)
	// FIXME: 16-byte ATOMIC_SWAP isn't actually hooked up at the moment.
	if (cast<AtomicSDNode>(Node)->getOrdering() ==
	AtomicOrdering::SequentiallyConsistent \|\|
	!DAG.getTargetLoweringInfo().isTypeLegal(VT)) {
	SDValue Swap = DAG.getAtomic(ISD::ATOMIC_SWAP, dl,
	cast<AtomicSDNode>(Node)->getMemoryVT(),
	Node->getOperand(0),
	Node->getOperand(1), Node->getOperand(2),
	cast<AtomicSDNode>(Node)->getMemOperand());
	return Swap.getValue(1);
	}
	// Other atomic stores have a simple pattern.
	return Op;
	}

	static SDValue LowerADDSUBCARRY(SDValue Op, SelectionDAG &DAG) {
	SDNode *N = Op.getNode();
	MVT VT = N->getSimpleValueType(0);

	// Let legalize expand this if it isn't a legal type yet.
	if (!DAG.getTargetLoweringInfo().isTypeLegal(VT))
	return SDValue();

	SDVTList VTs = DAG.getVTList(VT, MVT::i32);
	SDLoc DL(N);

	// Set the carry flag.
	SDValue Carry = Op.getOperand(2);
	EVT CarryVT = Carry.getValueType();
	APInt NegOne = APInt::getAllOnesValue(CarryVT.getScalarSizeInBits());
	Carry = DAG.getNode(X86ISD::ADD, DL, DAG.getVTList(CarryVT, MVT::i32),
	Carry, DAG.getConstant(NegOne, DL, CarryVT));

	unsigned Opc = Op.getOpcode() == ISD::ADDCARRY ? X86ISD::ADC : X86ISD::SBB;
	SDValue Sum = DAG.getNode(Opc, DL, VTs, Op.getOperand(0),
	Op.getOperand(1), Carry.getValue(1));

	SDValue SetCC = getSETCC(X86::COND_B, Sum.getValue(1), DL, DAG);
	if (N->getValueType(1) == MVT::i1)
	SetCC = DAG.getNode(ISD::TRUNCATE, DL, MVT::i1, SetCC);

	return DAG.getNode(ISD::MERGE_VALUES, DL, N->getVTList(), Sum, SetCC);
	}

	static SDValue LowerFSINCOS(SDValue Op, const X86Subtarget &Subtarget,
	SelectionDAG &DAG) {
	assert(Subtarget.isTargetDarwin() && Subtarget.is64Bit());

	// For MacOSX, we want to call an alternative entry point: __sincos_stret,
	// which returns the values as { float, float } (in XMM0) or
	// { double, double } (which is returned in XMM0, XMM1).
	SDLoc dl(Op);
	SDValue Arg = Op.getOperand(0);
	EVT ArgVT = Arg.getValueType();
	Type ArgTy = ArgVT.getTypeForEVT(DAG.getContext());

	TargetLowering::ArgListTy Args;
	TargetLowering::ArgListEntry Entry;

	Entry.Node = Arg;
	Entry.Ty = ArgTy;
	Entry.IsSExt = false;
	Entry.IsZExt = false;
	Args.push_back(Entry);

	bool isF64 = ArgVT == MVT::f64;
	// Only optimize x86_64 for now. i386 is a bit messy. For f32,
	// the small struct {f32, f32} is returned in (eax, edx). For f64,
	// the results are returned via SRet in memory.
	const char *LibcallName = isF64 ? "__sincos_stret" : "__sincosf_stret";
	const TargetLowering &TLI = DAG.getTargetLoweringInfo();
	SDValue Callee =
	DAG.getExternalSymbol(LibcallName, TLI.getPointerTy(DAG.getDataLayout()));

	Type RetTy = isF64 ? (Type )StructType::get(ArgTy, ArgTy)
	: (Type *)VectorType::get(ArgTy, 4);

	TargetLowering::CallLoweringInfo CLI(DAG);
	CLI.setDebugLoc(dl)
	.setChain(DAG.getEntryNode())
	.setLibCallee(CallingConv::C, RetTy, Callee, std::move(Args));

	std::pair<SDValue, SDValue> CallResult = TLI.LowerCallTo(CLI);

	if (isF64)
	// Returned in xmm0 and xmm1.
	return CallResult.first;

	// Returned in bits 0:31 and 32:64 xmm0.
	SDValue SinVal = DAG.getNode(ISD::EXTRACT_VECTOR_ELT, dl, ArgVT,
	CallResult.first, DAG.getIntPtrConstant(0, dl));
	SDValue CosVal = DAG.getNode(ISD::EXTRACT_VECTOR_ELT, dl, ArgVT,
	CallResult.first, DAG.getIntPtrConstant(1, dl));
	SDVTList Tys = DAG.getVTList(ArgVT, ArgVT);
	return DAG.getNode(ISD::MERGE_VALUES, dl, Tys, SinVal, CosVal);
	}

	/// Widen a vector input to a vector of NVT. The
	/// input vector must have the same element type as NVT.
	static SDValue ExtendToType(SDValue InOp, MVT NVT, SelectionDAG &DAG,
	bool FillWithZeroes = false) {
	// Check if InOp already has the right width.
	MVT InVT = InOp.getSimpleValueType();
	if (InVT == NVT)
	return InOp;

	if (InOp.isUndef())
	return DAG.getUNDEF(NVT);

	assert(InVT.getVectorElementType() == NVT.getVectorElementType() &&
	"input and widen element type must match");

	unsigned InNumElts = InVT.getVectorNumElements();
	unsigned WidenNumElts = NVT.getVectorNumElements();
	assert(WidenNumElts > InNumElts && WidenNumElts % InNumElts == 0 &&
	"Unexpected request for vector widening");

	SDLoc dl(InOp);
	if (InOp.getOpcode() == ISD::CONCAT_VECTORS &&
	InOp.getNumOperands() == 2) {
	SDValue N1 = InOp.getOperand(1);
	if ((ISD::isBuildVectorAllZeros(N1.getNode()) && FillWithZeroes) \|\|
	N1.isUndef()) {
	InOp = InOp.getOperand(0);
	InVT = InOp.getSimpleValueType();
	InNumElts = InVT.getVectorNumElements();
	}
	}
	if (ISD::isBuildVectorOfConstantSDNodes(InOp.getNode()) \|\|
	ISD::isBuildVectorOfConstantFPSDNodes(InOp.getNode())) {
	SmallVector<SDValue, 16> Ops;
	for (unsigned i = 0; i < InNumElts; ++i)
	Ops.push_back(InOp.getOperand(i));

	EVT EltVT = InOp.getOperand(0).getValueType();

	SDValue FillVal = FillWithZeroes ? DAG.getConstant(0, dl, EltVT) :
	DAG.getUNDEF(EltVT);
	for (unsigned i = 0; i < WidenNumElts - InNumElts; ++i)
	Ops.push_back(FillVal);
	return DAG.getBuildVector(NVT, dl, Ops);
	}
	SDValue FillVal = FillWithZeroes ? DAG.getConstant(0, dl, NVT) :
	DAG.getUNDEF(NVT);
	return DAG.getNode(ISD::INSERT_SUBVECTOR, dl, NVT, FillVal,
	InOp, DAG.getIntPtrConstant(0, dl));
	}

	static SDValue LowerMSCATTER(SDValue Op, const X86Subtarget &Subtarget,
	SelectionDAG &DAG) {
	assert(Subtarget.hasAVX512() &&
	"MGATHER/MSCATTER are supported on AVX-512 arch only");

	// X86 scatter kills mask register, so its type should be added to
	// the list of return values.
	// If the "scatter" has 2 return values, it is already handled.
	if (Op.getNode()->getNumValues() == 2)
	return Op;

	MaskedScatterSDNode *N = cast<MaskedScatterSDNode>(Op.getNode());
	SDValue Src = N->getValue();
	MVT VT = Src.getSimpleValueType();
	assert(VT.getScalarSizeInBits() >= 32 && "Unsupported scatter op");
	SDLoc dl(Op);

	SDValue NewScatter;
	SDValue Index = N->getIndex();
	SDValue Mask = N->getMask();
	SDValue Chain = N->getChain();
	SDValue BasePtr = N->getBasePtr();
	MVT MemVT = N->getMemoryVT().getSimpleVT();
	MVT IndexVT = Index.getSimpleValueType();
	MVT MaskVT = Mask.getSimpleValueType();

	if (MemVT.getScalarSizeInBits() < VT.getScalarSizeInBits()) {
	// The v2i32 value was promoted to v2i64.
	// Now we "redo" the type legalizer's work and widen the original
	// v2i32 value to v4i32. The original v2i32 is retrieved from v2i64
	// with a shuffle.
	assert((MemVT == MVT::v2i32 && VT == MVT::v2i64) &&
	"Unexpected memory type");
	int ShuffleMask[] = {0, 2, -1, -1};
	Src = DAG.getVectorShuffle(MVT::v4i32, dl, DAG.getBitcast(MVT::v4i32, Src),
	DAG.getUNDEF(MVT::v4i32), ShuffleMask);
	// Now we have 4 elements instead of 2.
	// Expand the index.
	MVT NewIndexVT = MVT::getVectorVT(IndexVT.getScalarType(), 4);
	Index = ExtendToType(Index, NewIndexVT, DAG);

	// Expand the mask with zeroes
	// Mask may be <2 x i64> or <2 x i1> at this moment
	assert((MaskVT == MVT::v2i1 \|\| MaskVT == MVT::v2i64) &&
	"Unexpected mask type");
	MVT ExtMaskVT = MVT::getVectorVT(MaskVT.getScalarType(), 4);
	Mask = ExtendToType(Mask, ExtMaskVT, DAG, true);
	VT = MVT::v4i32;
	}

	unsigned NumElts = VT.getVectorNumElements();
	if (!Subtarget.hasVLX() && !VT.is512BitVector() &&
	!Index.getSimpleValueType().is512BitVector()) {
	// AVX512F supports only 512-bit vectors. Or data or index should
	// be 512 bit wide. If now the both index and data are 256-bit, but
	// the vector contains 8 elements, we just sign-extend the index
	if (IndexVT == MVT::v8i32)
	// Just extend index
	Index = DAG.getNode(ISD::SIGN_EXTEND, dl, MVT::v8i64, Index);
	else {
	// The minimal number of elts in scatter is 8
	NumElts = 8;
	// Index
	MVT NewIndexVT = MVT::getVectorVT(IndexVT.getScalarType(), NumElts);
	// Use original index here, do not modify the index twice
	Index = ExtendToType(N->getIndex(), NewIndexVT, DAG);
	if (IndexVT.getScalarType() == MVT::i32)
	Index = DAG.getNode(ISD::SIGN_EXTEND, dl, MVT::v8i64, Index);

	// Mask
	// At this point we have promoted mask operand
	assert(MaskVT.getScalarSizeInBits() >= 32 && "unexpected mask type");
	MVT ExtMaskVT = MVT::getVectorVT(MaskVT.getScalarType(), NumElts);
	// Use the original mask here, do not modify the mask twice
	Mask = ExtendToType(N->getMask(), ExtMaskVT, DAG, true);

	// The value that should be stored
	MVT NewVT = MVT::getVectorVT(VT.getScalarType(), NumElts);
	Src = ExtendToType(Src, NewVT, DAG);
	}
	}
	// If the mask is "wide" at this point - truncate it to i1 vector
	MVT BitMaskVT = MVT::getVectorVT(MVT::i1, NumElts);
	Mask = DAG.getNode(ISD::TRUNCATE, dl, BitMaskVT, Mask);

	// The mask is killed by scatter, add it to the values
	SDVTList VTs = DAG.getVTList(BitMaskVT, MVT::Other);
	SDValue Ops[] = {Chain, Src, Mask, BasePtr, Index};
	NewScatter = DAG.getMaskedScatter(VTs, N->getMemoryVT(), dl, Ops,
	N->getMemOperand());
	DAG.ReplaceAllUsesWith(Op, SDValue(NewScatter.getNode(), 1));
	return SDValue(NewScatter.getNode(), 1);
	}

	static SDValue LowerMLOAD(SDValue Op, const X86Subtarget &Subtarget,
	SelectionDAG &DAG) {

	MaskedLoadSDNode *N = cast<MaskedLoadSDNode>(Op.getNode());
	MVT VT = Op.getSimpleValueType();
	MVT ScalarVT = VT.getScalarType();
	SDValue Mask = N->getMask();
	SDLoc dl(Op);

	assert((!N->isExpandingLoad() \|\| Subtarget.hasAVX512()) &&
	"Expanding masked load is supported on AVX-512 target only!");

	assert((!N->isExpandingLoad() \|\| ScalarVT.getSizeInBits() >= 32) &&
	"Expanding masked load is supported for 32 and 64-bit types only!");

	// 4x32, 4x64 and 2x64 vectors of non-expanding loads are legal regardless of
	// VLX. These types for exp-loads are handled here.
	if (!N->isExpandingLoad() && VT.getVectorNumElements() <= 4)
	return Op;

	assert(Subtarget.hasAVX512() && !Subtarget.hasVLX() && !VT.is512BitVector() &&
	"Cannot lower masked load op.");

	assert((ScalarVT.getSizeInBits() >= 32 \|\|
	(Subtarget.hasBWI() &&
	(ScalarVT == MVT::i8 \|\| ScalarVT == MVT::i16))) &&
	"Unsupported masked load op.");

	// This operation is legal for targets with VLX, but without
	// VLX the vector should be widened to 512 bit
	unsigned NumEltsInWideVec = 512 / VT.getScalarSizeInBits();
	MVT WideDataVT = MVT::getVectorVT(ScalarVT, NumEltsInWideVec);
	SDValue Src0 = N->getSrc0();
	Src0 = ExtendToType(Src0, WideDataVT, DAG);

	// Mask element has to be i1.
	MVT MaskEltTy = Mask.getSimpleValueType().getScalarType();
	assert((MaskEltTy == MVT::i1 \|\| VT.getVectorNumElements() <= 4) &&
	"We handle 4x32, 4x64 and 2x64 vectors only in this case");

	MVT WideMaskVT = MVT::getVectorVT(MaskEltTy, NumEltsInWideVec);

	Mask = ExtendToType(Mask, WideMaskVT, DAG, true);
	if (MaskEltTy != MVT::i1)
	Mask = DAG.getNode(ISD::TRUNCATE, dl,
	MVT::getVectorVT(MVT::i1, NumEltsInWideVec), Mask);
	SDValue NewLoad = DAG.getMaskedLoad(WideDataVT, dl, N->getChain(),
	N->getBasePtr(), Mask, Src0,
	N->getMemoryVT(), N->getMemOperand(),
	N->getExtensionType(),
	N->isExpandingLoad());

	SDValue Exract = DAG.getNode(ISD::EXTRACT_SUBVECTOR, dl, VT,
	NewLoad.getValue(0),
	DAG.getIntPtrConstant(0, dl));
	SDValue RetOps[] = {Exract, NewLoad.getValue(1)};
	return DAG.getMergeValues(RetOps, dl);
	}

	static SDValue LowerMSTORE(SDValue Op, const X86Subtarget &Subtarget,
	SelectionDAG &DAG) {
	MaskedStoreSDNode *N = cast<MaskedStoreSDNode>(Op.getNode());
	SDValue DataToStore = N->getValue();
	MVT VT = DataToStore.getSimpleValueType();
	MVT ScalarVT = VT.getScalarType();
	SDValue Mask = N->getMask();
	SDLoc dl(Op);

	assert((!N->isCompressingStore() \|\| Subtarget.hasAVX512()) &&
	"Expanding masked load is supported on AVX-512 target only!");

	assert((!N->isCompressingStore() \|\| ScalarVT.getSizeInBits() >= 32) &&
	"Expanding masked load is supported for 32 and 64-bit types only!");

	// 4x32 and 2x64 vectors of non-compressing stores are legal regardless to VLX.
	if (!N->isCompressingStore() && VT.getVectorNumElements() <= 4)
	return Op;

	assert(Subtarget.hasAVX512() && !Subtarget.hasVLX() && !VT.is512BitVector() &&
	"Cannot lower masked store op.");

	assert((ScalarVT.getSizeInBits() >= 32 \|\|
	(Subtarget.hasBWI() &&
	(ScalarVT == MVT::i8 \|\| ScalarVT == MVT::i16))) &&
	"Unsupported masked store op.");

	// This operation is legal for targets with VLX, but without
	// VLX the vector should be widened to 512 bit
	unsigned NumEltsInWideVec = 512/VT.getScalarSizeInBits();
	MVT WideDataVT = MVT::getVectorVT(ScalarVT, NumEltsInWideVec);

	// Mask element has to be i1.
	MVT MaskEltTy = Mask.getSimpleValueType().getScalarType();
	assert((MaskEltTy == MVT::i1 \|\| VT.getVectorNumElements() <= 4) &&
	"We handle 4x32, 4x64 and 2x64 vectors only in this case");

	MVT WideMaskVT = MVT::getVectorVT(MaskEltTy, NumEltsInWideVec);

	DataToStore = ExtendToType(DataToStore, WideDataVT, DAG);
	Mask = ExtendToType(Mask, WideMaskVT, DAG, true);
	if (MaskEltTy != MVT::i1)
	Mask = DAG.getNode(ISD::TRUNCATE, dl,
	MVT::getVectorVT(MVT::i1, NumEltsInWideVec), Mask);
	return DAG.getMaskedStore(N->getChain(), dl, DataToStore, N->getBasePtr(),
	Mask, N->getMemoryVT(), N->getMemOperand(),
	N->isTruncatingStore(), N->isCompressingStore());
	}

	static SDValue LowerMGATHER(SDValue Op, const X86Subtarget &Subtarget,
	SelectionDAG &DAG) {
	assert(Subtarget.hasAVX512() &&
	"MGATHER/MSCATTER are supported on AVX-512 arch only");

	MaskedGatherSDNode *N = cast<MaskedGatherSDNode>(Op.getNode());
	SDLoc dl(Op);
	MVT VT = Op.getSimpleValueType();
	SDValue Index = N->getIndex();
	SDValue Mask = N->getMask();
	SDValue Src0 = N->getValue();
	MVT IndexVT = Index.getSimpleValueType();
	MVT MaskVT = Mask.getSimpleValueType();

	unsigned NumElts = VT.getVectorNumElements();
	assert(VT.getScalarSizeInBits() >= 32 && "Unsupported gather op");

	if (!Subtarget.hasVLX() && !VT.is512BitVector() &&
	!Index.getSimpleValueType().is512BitVector()) {
	// AVX512F supports only 512-bit vectors. Or data or index should
	// be 512 bit wide. If now the both index and data are 256-bit, but
	// the vector contains 8 elements, we just sign-extend the index
	if (NumElts == 8) {
	Index = DAG.getNode(ISD::SIGN_EXTEND, dl, MVT::v8i64, Index);
	SDValue Ops[] = { N->getOperand(0), N->getOperand(1), N->getOperand(2),
	N->getOperand(3), Index };
	DAG.UpdateNodeOperands(N, Ops);
	return Op;
	}

	// Minimal number of elements in Gather
	NumElts = 8;
	// Index
	MVT NewIndexVT = MVT::getVectorVT(IndexVT.getScalarType(), NumElts);
	Index = ExtendToType(Index, NewIndexVT, DAG);
	if (IndexVT.getScalarType() == MVT::i32)
	Index = DAG.getNode(ISD::SIGN_EXTEND, dl, MVT::v8i64, Index);

	// Mask
	MVT MaskBitVT = MVT::getVectorVT(MVT::i1, NumElts);
	// At this point we have promoted mask operand
	assert(MaskVT.getScalarSizeInBits() >= 32 && "unexpected mask type");
	MVT ExtMaskVT = MVT::getVectorVT(MaskVT.getScalarType(), NumElts);
	Mask = ExtendToType(Mask, ExtMaskVT, DAG, true);
	Mask = DAG.getNode(ISD::TRUNCATE, dl, MaskBitVT, Mask);

	// The pass-through value
	MVT NewVT = MVT::getVectorVT(VT.getScalarType(), NumElts);
	Src0 = ExtendToType(Src0, NewVT, DAG);

	SDValue Ops[] = { N->getChain(), Src0, Mask, N->getBasePtr(), Index };
	SDValue NewGather = DAG.getMaskedGather(DAG.getVTList(NewVT, MVT::Other),
	N->getMemoryVT(), dl, Ops,
	N->getMemOperand());
	SDValue Exract = DAG.getNode(ISD::EXTRACT_SUBVECTOR, dl, VT,
	NewGather.getValue(0),
	DAG.getIntPtrConstant(0, dl));
	SDValue RetOps[] = {Exract, NewGather.getValue(1)};
	return DAG.getMergeValues(RetOps, dl);
	}
	if (N->getMemoryVT() == MVT::v2i32 && Subtarget.hasVLX()) {
	// There is a special case when the return type is v2i32 is illegal and
	// the type legaizer extended it to v2i64. Without this conversion we end up
	// with VPGATHERQQ (reading q-words from the memory) instead of VPGATHERQD.
	// In order to avoid this situation, we'll build an X86 specific Gather node
	// with index v2i64 and value type v4i32.
	assert(VT == MVT::v2i64 && Src0.getValueType() == MVT::v2i64 &&
	"Unexpected type in masked gather");
	Src0 = DAG.getVectorShuffle(MVT::v4i32, dl,
	DAG.getBitcast(MVT::v4i32, Src0),
	DAG.getUNDEF(MVT::v4i32), { 0, 2, -1, -1 });
	// The mask should match the destination type. Extending mask with zeroes
	// is not necessary since instruction itself reads only two values from
	// memory.
	Mask = ExtendToType(Mask, MVT::v4i1, DAG, false);
	SDValue Ops[] = { N->getChain(), Src0, Mask, N->getBasePtr(), Index };
	SDValue NewGather = DAG.getTargetMemSDNode<X86MaskedGatherSDNode>(
	DAG.getVTList(MVT::v4i32, MVT::Other), Ops, dl, N->getMemoryVT(),
	N->getMemOperand());

	SDValue Sext = getExtendInVec(X86ISD::VSEXT, dl, MVT::v2i64,
	NewGather.getValue(0), DAG);
	SDValue RetOps[] = { Sext, NewGather.getValue(1) };
	return DAG.getMergeValues(RetOps, dl);
	}
	if (N->getMemoryVT() == MVT::v2f32 && Subtarget.hasVLX()) {
	// This transformation is for optimization only.
	// The type legalizer extended mask and index to 4 elements vector
	// in order to match requirements of the common gather node - same
	// vector width of index and value. X86 Gather node allows mismatch
	// of vector width in order to select more optimal instruction at the
	// end.
	assert(VT == MVT::v4f32 && Src0.getValueType() == MVT::v4f32 &&
	"Unexpected type in masked gather");
	if (Mask.getOpcode() == ISD::CONCAT_VECTORS &&
	ISD::isBuildVectorAllZeros(Mask.getOperand(1).getNode()) &&
	Index.getOpcode() == ISD::CONCAT_VECTORS &&
	Index.getOperand(1).isUndef()) {
	Mask = ExtendToType(Mask.getOperand(0), MVT::v4i1, DAG, false);
	Index = Index.getOperand(0);
	} else
	return Op;
	SDValue Ops[] = { N->getChain(), Src0, Mask, N->getBasePtr(), Index };
	SDValue NewGather = DAG.getTargetMemSDNode<X86MaskedGatherSDNode>(
	DAG.getVTList(MVT::v4f32, MVT::Other), Ops, dl, N->getMemoryVT(),
	N->getMemOperand());

	SDValue RetOps[] = { NewGather.getValue(0), NewGather.getValue(1) };
	return DAG.getMergeValues(RetOps, dl);

	}
	return Op;
	}

	SDValue X86TargetLowering::LowerGC_TRANSITION_START(SDValue Op,
	SelectionDAG &DAG) const {
	// TODO: Eventually, the lowering of these nodes should be informed by or
	// deferred to the GC strategy for the function in which they appear. For
	// now, however, they must be lowered to something. Since they are logically
	// no-ops in the case of a null GC strategy (or a GC strategy which does not
	// require special handling for these nodes), lower them as literal NOOPs for
	// the time being.
	SmallVector<SDValue, 2> Ops;

	Ops.push_back(Op.getOperand(0));
	if (Op->getGluedNode())
	Ops.push_back(Op->getOperand(Op->getNumOperands() - 1));

	SDLoc OpDL(Op);
	SDVTList VTs = DAG.getVTList(MVT::Other, MVT::Glue);
	SDValue NOOP(DAG.getMachineNode(X86::NOOP, SDLoc(Op), VTs, Ops), 0);

	return NOOP;
	}

	SDValue X86TargetLowering::LowerGC_TRANSITION_END(SDValue Op,
	SelectionDAG &DAG) const {
	// TODO: Eventually, the lowering of these nodes should be informed by or
	// deferred to the GC strategy for the function in which they appear. For
	// now, however, they must be lowered to something. Since they are logically
	// no-ops in the case of a null GC strategy (or a GC strategy which does not
	// require special handling for these nodes), lower them as literal NOOPs for
	// the time being.
	SmallVector<SDValue, 2> Ops;

	Ops.push_back(Op.getOperand(0));
	if (Op->getGluedNode())
	Ops.push_back(Op->getOperand(Op->getNumOperands() - 1));

	SDLoc OpDL(Op);
	SDVTList VTs = DAG.getVTList(MVT::Other, MVT::Glue);
	SDValue NOOP(DAG.getMachineNode(X86::NOOP, SDLoc(Op), VTs, Ops), 0);

	return NOOP;
	}

	/// Provide custom lowering hooks for some operations.
	SDValue X86TargetLowering::LowerOperation(SDValue Op, SelectionDAG &DAG) const {
	switch (Op.getOpcode()) {
	default: llvm_unreachable("Should not custom lower this!");
	case ISD::ATOMIC_FENCE: return LowerATOMIC_FENCE(Op, Subtarget, DAG);
	case ISD::ATOMIC_CMP_SWAP_WITH_SUCCESS:
	return LowerCMP_SWAP(Op, Subtarget, DAG);
	case ISD::CTPOP: return LowerCTPOP(Op, Subtarget, DAG);
	case ISD::ATOMIC_LOAD_ADD:
	case ISD::ATOMIC_LOAD_SUB:
	case ISD::ATOMIC_LOAD_OR:
	case ISD::ATOMIC_LOAD_XOR:
	case ISD::ATOMIC_LOAD_AND: return lowerAtomicArith(Op, DAG, Subtarget);
	case ISD::ATOMIC_STORE: return LowerATOMIC_STORE(Op, DAG);
	case ISD::BITREVERSE: return LowerBITREVERSE(Op, Subtarget, DAG);
	case ISD::BUILD_VECTOR: return LowerBUILD_VECTOR(Op, DAG);
	case ISD::CONCAT_VECTORS: return LowerCONCAT_VECTORS(Op, Subtarget, DAG);
	case ISD::VECTOR_SHUFFLE: return lowerVectorShuffle(Op, Subtarget, DAG);
	case ISD::VSELECT: return LowerVSELECT(Op, DAG);
	case ISD::EXTRACT_VECTOR_ELT: return LowerEXTRACT_VECTOR_ELT(Op, DAG);
	case ISD::INSERT_VECTOR_ELT: return LowerINSERT_VECTOR_ELT(Op, DAG);
	case ISD::EXTRACT_SUBVECTOR: return LowerEXTRACT_SUBVECTOR(Op,Subtarget,DAG);
	case ISD::INSERT_SUBVECTOR: return LowerINSERT_SUBVECTOR(Op, Subtarget,DAG);
	case ISD::SCALAR_TO_VECTOR: return LowerSCALAR_TO_VECTOR(Op, Subtarget,DAG);
	case ISD::ConstantPool: return LowerConstantPool(Op, DAG);
	case ISD::GlobalAddress: return LowerGlobalAddress(Op, DAG);
	case ISD::GlobalTLSAddress: return LowerGlobalTLSAddress(Op, DAG);
	case ISD::ExternalSymbol: return LowerExternalSymbol(Op, DAG);
	case ISD::BlockAddress: return LowerBlockAddress(Op, DAG);
	case ISD::SHL_PARTS:
	case ISD::SRA_PARTS:
	case ISD::SRL_PARTS: return LowerShiftParts(Op, DAG);
	case ISD::SINT_TO_FP: return LowerSINT_TO_FP(Op, DAG);
	case ISD::UINT_TO_FP: return LowerUINT_TO_FP(Op, DAG);
	case ISD::TRUNCATE: return LowerTRUNCATE(Op, DAG);
	case ISD::ZERO_EXTEND: return LowerZERO_EXTEND(Op, Subtarget, DAG);
	case ISD::SIGN_EXTEND: return LowerSIGN_EXTEND(Op, Subtarget, DAG);
	case ISD::ANY_EXTEND: return LowerANY_EXTEND(Op, Subtarget, DAG);
	case ISD::ZERO_EXTEND_VECTOR_INREG:
	case ISD::SIGN_EXTEND_VECTOR_INREG:
	return LowerEXTEND_VECTOR_INREG(Op, Subtarget, DAG);
	case ISD::FP_TO_SINT:
	case ISD::FP_TO_UINT: return LowerFP_TO_INT(Op, DAG);
	case ISD::FP_EXTEND: return LowerFP_EXTEND(Op, DAG);
	case ISD::LOAD: return LowerExtendedLoad(Op, Subtarget, DAG);
	case ISD::FABS:
	case ISD::FNEG: return LowerFABSorFNEG(Op, DAG);
	case ISD::FCOPYSIGN: return LowerFCOPYSIGN(Op, DAG);
	case ISD::FGETSIGN: return LowerFGETSIGN(Op, DAG);
	case ISD::SETCC: return LowerSETCC(Op, DAG);
	case ISD::SETCCCARRY: return LowerSETCCCARRY(Op, DAG);
	case ISD::SELECT: return LowerSELECT(Op, DAG);
	case ISD::BRCOND: return LowerBRCOND(Op, DAG);
	case ISD::JumpTable: return LowerJumpTable(Op, DAG);
	case ISD::VASTART: return LowerVASTART(Op, DAG);
	case ISD::VAARG: return LowerVAARG(Op, DAG);
	case ISD::VACOPY: return LowerVACOPY(Op, Subtarget, DAG);
	case ISD::INTRINSIC_WO_CHAIN: return LowerINTRINSIC_WO_CHAIN(Op, Subtarget, DAG);
	case ISD::INTRINSIC_VOID:
	case ISD::INTRINSIC_W_CHAIN: return LowerINTRINSIC_W_CHAIN(Op, Subtarget, DAG);
	case ISD::RETURNADDR: return LowerRETURNADDR(Op, DAG);
	case ISD::ADDROFRETURNADDR: return LowerADDROFRETURNADDR(Op, DAG);
	case ISD::FRAMEADDR: return LowerFRAMEADDR(Op, DAG);
	case ISD::FRAME_TO_ARGS_OFFSET:
	return LowerFRAME_TO_ARGS_OFFSET(Op, DAG);
	case ISD::DYNAMIC_STACKALLOC: return LowerDYNAMIC_STACKALLOC(Op, DAG);
	case ISD::EH_RETURN: return LowerEH_RETURN(Op, DAG);
	case ISD::EH_SJLJ_SETJMP: return lowerEH_SJLJ_SETJMP(Op, DAG);
	case ISD::EH_SJLJ_LONGJMP: return lowerEH_SJLJ_LONGJMP(Op, DAG);
	case ISD::EH_SJLJ_SETUP_DISPATCH:
	return lowerEH_SJLJ_SETUP_DISPATCH(Op, DAG);
	case ISD::INIT_TRAMPOLINE: return LowerINIT_TRAMPOLINE(Op, DAG);
	case ISD::ADJUST_TRAMPOLINE: return LowerADJUST_TRAMPOLINE(Op, DAG);
	case ISD::FLT_ROUNDS_: return LowerFLT_ROUNDS_(Op, DAG);
	case ISD::CTLZ:
	case ISD::CTLZ_ZERO_UNDEF: return LowerCTLZ(Op, Subtarget, DAG);
	case ISD::CTTZ:
	case ISD::CTTZ_ZERO_UNDEF: return LowerCTTZ(Op, DAG);
	case ISD::MUL: return LowerMUL(Op, Subtarget, DAG);
	case ISD::MULHS:
	case ISD::MULHU: return LowerMULH(Op, Subtarget, DAG);
	case ISD::UMUL_LOHI:
	case ISD::SMUL_LOHI: return LowerMUL_LOHI(Op, Subtarget, DAG);
	case ISD::ROTL:
	case ISD::ROTR: return LowerRotate(Op, Subtarget, DAG);
	case ISD::SRA:
	case ISD::SRL:
	case ISD::SHL: return LowerShift(Op, Subtarget, DAG);
	case ISD::SADDO:
	case ISD::UADDO:
	case ISD::SSUBO:
	case ISD::USUBO:
	case ISD::SMULO:
	case ISD::UMULO: return LowerXALUO(Op, DAG);
	case ISD::READCYCLECOUNTER: return LowerREADCYCLECOUNTER(Op, Subtarget,DAG);
	case ISD::BITCAST: return LowerBITCAST(Op, Subtarget, DAG);
	case ISD::ADDCARRY:
	case ISD::SUBCARRY: return LowerADDSUBCARRY(Op, DAG);
	case ISD::ADD:
	case ISD::SUB: return LowerADD_SUB(Op, DAG);
	case ISD::SMAX:
	case ISD::SMIN:
	case ISD::UMAX:
	case ISD::UMIN: return LowerMINMAX(Op, DAG);
	case ISD::ABS: return LowerABS(Op, DAG);
	case ISD::FSINCOS: return LowerFSINCOS(Op, Subtarget, DAG);
	case ISD::MLOAD: return LowerMLOAD(Op, Subtarget, DAG);
	case ISD::MSTORE: return LowerMSTORE(Op, Subtarget, DAG);
	case ISD::MGATHER: return LowerMGATHER(Op, Subtarget, DAG);
	case ISD::MSCATTER: return LowerMSCATTER(Op, Subtarget, DAG);
	case ISD::GC_TRANSITION_START:
	return LowerGC_TRANSITION_START(Op, DAG);
	case ISD::GC_TRANSITION_END: return LowerGC_TRANSITION_END(Op, DAG);
	case ISD::STORE: return LowerTruncatingStore(Op, Subtarget, DAG);
	}
	}

	/// Places new result values for the node in Results (their number
	/// and types must exactly match those of the original return values of
	/// the node), or leaves Results empty, which indicates that the node is not
	/// to be custom lowered after all.
	void X86TargetLowering::LowerOperationWrapper(SDNode *N,
	SmallVectorImpl<SDValue> &Results,
	SelectionDAG &DAG) const {
	SDValue Res = LowerOperation(SDValue(N, 0), DAG);

	if (!Res.getNode())
	return;

	assert((N->getNumValues() <= Res->getNumValues()) &&
	"Lowering returned the wrong number of results!");

	// Places new result values base on N result number.
	// In some cases (LowerSINT_TO_FP for example) Res has more result values
	// than original node, chain should be dropped(last value).
	for (unsigned I = 0, E = N->getNumValues(); I != E; ++I)
	Results.push_back(Res.getValue(I));
	}

	/// Replace a node with an illegal result type with a new node built out of
	/// custom code.
	void X86TargetLowering::ReplaceNodeResults(SDNode *N,
	SmallVectorImpl<SDValue>&Results,
	SelectionDAG &DAG) const {
	SDLoc dl(N);
	const TargetLowering &TLI = DAG.getTargetLoweringInfo();
	switch (N->getOpcode()) {
	default:
	llvm_unreachable("Do not know how to custom type legalize this operation!");
	case X86ISD::AVG: {
	// Legalize types for X86ISD::AVG by expanding vectors.
	assert(Subtarget.hasSSE2() && "Requires at least SSE2!");

	auto InVT = N->getValueType(0);
	auto InVTSize = InVT.getSizeInBits();
	const unsigned RegSize =
	(InVTSize > 128) ? ((InVTSize > 256) ? 512 : 256) : 128;
	assert((Subtarget.hasBWI() \|\| RegSize < 512) &&
	"512-bit vector requires AVX512BW");
	assert((Subtarget.hasAVX2() \|\| RegSize < 256) &&
	"256-bit vector requires AVX2");

	auto ElemVT = InVT.getVectorElementType();
	auto RegVT = EVT::getVectorVT(*DAG.getContext(), ElemVT,
	RegSize / ElemVT.getSizeInBits());
	assert(RegSize % InVT.getSizeInBits() == 0);
	unsigned NumConcat = RegSize / InVT.getSizeInBits();

	SmallVector<SDValue, 16> Ops(NumConcat, DAG.getUNDEF(InVT));
	Ops[0] = N->getOperand(0);
	SDValue InVec0 = DAG.getNode(ISD::CONCAT_VECTORS, dl, RegVT, Ops);
	Ops[0] = N->getOperand(1);
	SDValue InVec1 = DAG.getNode(ISD::CONCAT_VECTORS, dl, RegVT, Ops);

	SDValue Res = DAG.getNode(X86ISD::AVG, dl, RegVT, InVec0, InVec1);
	Results.push_back(DAG.getNode(ISD::EXTRACT_SUBVECTOR, dl, InVT, Res,
	DAG.getIntPtrConstant(0, dl)));
	return;
	}
	// We might have generated v2f32 FMIN/FMAX operations. Widen them to v4f32.
	case X86ISD::FMINC:
	case X86ISD::FMIN:
	case X86ISD::FMAXC:
	case X86ISD::FMAX: {
	EVT VT = N->getValueType(0);
	assert(VT == MVT::v2f32 && "Unexpected type (!= v2f32) on FMIN/FMAX.");
	SDValue UNDEF = DAG.getUNDEF(VT);
	SDValue LHS = DAG.getNode(ISD::CONCAT_VECTORS, dl, MVT::v4f32,
	N->getOperand(0), UNDEF);
	SDValue RHS = DAG.getNode(ISD::CONCAT_VECTORS, dl, MVT::v4f32,
	N->getOperand(1), UNDEF);
	Results.push_back(DAG.getNode(N->getOpcode(), dl, MVT::v4f32, LHS, RHS));
	return;
	}
	case ISD::SDIV:
	case ISD::UDIV:
	case ISD::SREM:
	case ISD::UREM:
	case ISD::SDIVREM:
	case ISD::UDIVREM: {
	SDValue V = LowerWin64_i128OP(SDValue(N,0), DAG);
	Results.push_back(V);
	return;
	}
	case ISD::FP_TO_SINT:
	case ISD::FP_TO_UINT: {
	bool IsSigned = N->getOpcode() == ISD::FP_TO_SINT;

	if (N->getValueType(0) == MVT::v2i32) {
	assert((IsSigned \|\| Subtarget.hasAVX512()) &&
	"Can only handle signed conversion without AVX512");
	assert(Subtarget.hasSSE2() && "Requires at least SSE2!");
	SDValue Src = N->getOperand(0);
	if (Src.getValueType() == MVT::v2f64) {
	SDValue Idx = DAG.getIntPtrConstant(0, dl);
	SDValue Res = DAG.getNode(IsSigned ? X86ISD::CVTTP2SI
	: X86ISD::CVTTP2UI,
	dl, MVT::v4i32, Src);
	Res = DAG.getNode(ISD::EXTRACT_SUBVECTOR, dl, MVT::v2i32, Res, Idx);
	Results.push_back(Res);
	return;
	}
	if (Src.getValueType() == MVT::v2f32) {
	SDValue Idx = DAG.getIntPtrConstant(0, dl);
	SDValue Res = DAG.getNode(ISD::CONCAT_VECTORS, dl, MVT::v4f32, Src,
	DAG.getUNDEF(MVT::v2f32));
	Res = DAG.getNode(IsSigned ? ISD::FP_TO_SINT
	: ISD::FP_TO_UINT, dl, MVT::v4i32, Res);
	Res = DAG.getNode(ISD::EXTRACT_SUBVECTOR, dl, MVT::v2i32, Res, Idx);
	Results.push_back(Res);
	return;
	}

	// The FP_TO_INTHelper below only handles f32/f64/f80 scalar inputs,
	// so early out here.
	return;
	}

	std::pair<SDValue,SDValue> Vals =
	FP_TO_INTHelper(SDValue(N, 0), DAG, IsSigned, /IsReplace=/ true);
	SDValue FIST = Vals.first, StackSlot = Vals.second;
	if (FIST.getNode()) {
	EVT VT = N->getValueType(0);
	// Return a load from the stack slot.
	if (StackSlot.getNode())
	Results.push_back(
	DAG.getLoad(VT, dl, FIST, StackSlot, MachinePointerInfo()));
	else
	Results.push_back(FIST);
	}
	return;
	}
	case ISD::SINT_TO_FP: {
	assert(Subtarget.hasDQI() && Subtarget.hasVLX() && "Requires AVX512DQVL!");
	SDValue Src = N->getOperand(0);
	if (N->getValueType(0) != MVT::v2f32 \|\| Src.getValueType() != MVT::v2i64)
	return;
	Results.push_back(DAG.getNode(X86ISD::CVTSI2P, dl, MVT::v4f32, Src));
	return;
	}
	case ISD::UINT_TO_FP: {
	assert(Subtarget.hasSSE2() && "Requires at least SSE2!");
	EVT VT = N->getValueType(0);
	if (VT != MVT::v2f32)
	return;
	SDValue Src = N->getOperand(0);
	EVT SrcVT = Src.getValueType();
	if (Subtarget.hasDQI() && Subtarget.hasVLX() && SrcVT == MVT::v2i64) {
	Results.push_back(DAG.getNode(X86ISD::CVTUI2P, dl, MVT::v4f32, Src));
	return;
	}
	if (SrcVT != MVT::v2i32)
	return;
	SDValue ZExtIn = DAG.getNode(ISD::ZERO_EXTEND, dl, MVT::v2i64, Src);
	SDValue VBias =
	DAG.getConstantFP(BitsToDouble(0x4330000000000000ULL), dl, MVT::v2f64);
	SDValue Or = DAG.getNode(ISD::OR, dl, MVT::v2i64, ZExtIn,
	DAG.getBitcast(MVT::v2i64, VBias));
	Or = DAG.getBitcast(MVT::v2f64, Or);
	// TODO: Are there any fast-math-flags to propagate here?
	SDValue Sub = DAG.getNode(ISD::FSUB, dl, MVT::v2f64, Or, VBias);
	Results.push_back(DAG.getNode(X86ISD::VFPROUND, dl, MVT::v4f32, Sub));
	return;
	}
	case ISD::FP_ROUND: {
	if (!TLI.isTypeLegal(N->getOperand(0).getValueType()))
	return;
	SDValue V = DAG.getNode(X86ISD::VFPROUND, dl, MVT::v4f32, N->getOperand(0));
	Results.push_back(V);
	return;
	}
	case ISD::FP_EXTEND: {
	// Right now, only MVT::v2f32 has OperationAction for FP_EXTEND.
	// No other ValueType for FP_EXTEND should reach this point.
	assert(N->getValueType(0) == MVT::v2f32 &&
	"Do not know how to legalize this Node");
	return;
	}
	case ISD::INTRINSIC_W_CHAIN: {
	unsigned IntNo = cast<ConstantSDNode>(N->getOperand(1))->getZExtValue();
	switch (IntNo) {
	default : llvm_unreachable("Do not know how to custom type "
	"legalize this intrinsic operation!");
	case Intrinsic::x86_rdtsc:
	return getReadTimeStampCounter(N, dl, X86ISD::RDTSC_DAG, DAG, Subtarget,
	Results);
	case Intrinsic::x86_rdtscp:
	return getReadTimeStampCounter(N, dl, X86ISD::RDTSCP_DAG, DAG, Subtarget,
	Results);
	case Intrinsic::x86_rdpmc:
	return getReadPerformanceCounter(N, dl, DAG, Subtarget, Results);

	case Intrinsic::x86_xgetbv:
	return getExtendedControlRegister(N, dl, DAG, Subtarget, Results);
	}
	}
	case ISD::INTRINSIC_WO_CHAIN: {
	if (SDValue V = LowerINTRINSIC_WO_CHAIN(SDValue(N, 0), Subtarget, DAG))
	Results.push_back(V);
	return;
	}
	case ISD::READCYCLECOUNTER: {
	return getReadTimeStampCounter(N, dl, X86ISD::RDTSC_DAG, DAG, Subtarget,
	Results);
	}
	case ISD::ATOMIC_CMP_SWAP_WITH_SUCCESS: {
	EVT T = N->getValueType(0);
	assert((T == MVT::i64 \|\| T == MVT::i128) && "can only expand cmpxchg pair");
	bool Regs64bit = T == MVT::i128;
	MVT HalfT = Regs64bit ? MVT::i64 : MVT::i32;
	SDValue cpInL, cpInH;
	cpInL = DAG.getNode(ISD::EXTRACT_ELEMENT, dl, HalfT, N->getOperand(2),
	DAG.getConstant(0, dl, HalfT));
	cpInH = DAG.getNode(ISD::EXTRACT_ELEMENT, dl, HalfT, N->getOperand(2),
	DAG.getConstant(1, dl, HalfT));
	cpInL = DAG.getCopyToReg(N->getOperand(0), dl,
	Regs64bit ? X86::RAX : X86::EAX,
	cpInL, SDValue());
	cpInH = DAG.getCopyToReg(cpInL.getValue(0), dl,
	Regs64bit ? X86::RDX : X86::EDX,
	cpInH, cpInL.getValue(1));
	SDValue swapInL, swapInH;
	swapInL = DAG.getNode(ISD::EXTRACT_ELEMENT, dl, HalfT, N->getOperand(3),
	DAG.getConstant(0, dl, HalfT));
	swapInH = DAG.getNode(ISD::EXTRACT_ELEMENT, dl, HalfT, N->getOperand(3),
	DAG.getConstant(1, dl, HalfT));
	swapInH =
	DAG.getCopyToReg(cpInH.getValue(0), dl, Regs64bit ? X86::RCX : X86::ECX,
	swapInH, cpInH.getValue(1));
	// If the current function needs the base pointer, RBX,
	// we shouldn't use cmpxchg directly.
	// Indeed the lowering of that instruction will clobber
	// that register and since RBX will be a reserved register
	// the register allocator will not make sure its value will
	// be properly saved and restored around this live-range.
	const X86RegisterInfo *TRI = Subtarget.getRegisterInfo();
	SDValue Result;
	SDVTList Tys = DAG.getVTList(MVT::Other, MVT::Glue);
	unsigned BasePtr = TRI->getBaseRegister();
	MachineMemOperand *MMO = cast<AtomicSDNode>(N)->getMemOperand();
	if (TRI->hasBasePointer(DAG.getMachineFunction()) &&
	(BasePtr == X86::RBX \|\| BasePtr == X86::EBX)) {
	// ISel prefers the LCMPXCHG64 variant.
	// If that assert breaks, that means it is not the case anymore,
	// and we need to teach LCMPXCHG8_SAVE_EBX_DAG how to save RBX,
	// not just EBX. This is a matter of accepting i64 input for that
	// pseudo, and restoring into the register of the right wide
	// in expand pseudo. Everything else should just work.
	assert(((Regs64bit == (BasePtr == X86::RBX)) \|\| BasePtr == X86::EBX) &&
	"Saving only half of the RBX");
	unsigned Opcode = Regs64bit ? X86ISD::LCMPXCHG16_SAVE_RBX_DAG
	: X86ISD::LCMPXCHG8_SAVE_EBX_DAG;
	SDValue RBXSave = DAG.getCopyFromReg(swapInH.getValue(0), dl,
	Regs64bit ? X86::RBX : X86::EBX,
	HalfT, swapInH.getValue(1));
	SDValue Ops[] = {/Chain/ RBXSave.getValue(1), N->getOperand(1), swapInL,
	RBXSave,
	/Glue/ RBXSave.getValue(2)};
	Result = DAG.getMemIntrinsicNode(Opcode, dl, Tys, Ops, T, MMO);
	} else {
	unsigned Opcode =
	Regs64bit ? X86ISD::LCMPXCHG16_DAG : X86ISD::LCMPXCHG8_DAG;
	swapInL = DAG.getCopyToReg(swapInH.getValue(0), dl,
	Regs64bit ? X86::RBX : X86::EBX, swapInL,
	swapInH.getValue(1));
	SDValue Ops[] = {swapInL.getValue(0), N->getOperand(1),
	swapInL.getValue(1)};
	Result = DAG.getMemIntrinsicNode(Opcode, dl, Tys, Ops, T, MMO);
	}
	SDValue cpOutL = DAG.getCopyFromReg(Result.getValue(0), dl,
	Regs64bit ? X86::RAX : X86::EAX,
	HalfT, Result.getValue(1));
	SDValue cpOutH = DAG.getCopyFromReg(cpOutL.getValue(1), dl,
	Regs64bit ? X86::RDX : X86::EDX,
	HalfT, cpOutL.getValue(2));
	SDValue OpsF[] = { cpOutL.getValue(0), cpOutH.getValue(0)};

	SDValue EFLAGS = DAG.getCopyFromReg(cpOutH.getValue(1), dl, X86::EFLAGS,
	MVT::i32, cpOutH.getValue(2));
	SDValue Success = getSETCC(X86::COND_E, EFLAGS, dl, DAG);
	Success = DAG.getZExtOrTrunc(Success, dl, N->getValueType(1));

	Results.push_back(DAG.getNode(ISD::BUILD_PAIR, dl, T, OpsF));
	Results.push_back(Success);
	Results.push_back(EFLAGS.getValue(1));
	return;
	}
	case ISD::ATOMIC_SWAP:
	case ISD::ATOMIC_LOAD_ADD:
	case ISD::ATOMIC_LOAD_SUB:
	case ISD::ATOMIC_LOAD_AND:
	case ISD::ATOMIC_LOAD_OR:
	case ISD::ATOMIC_LOAD_XOR:
	case ISD::ATOMIC_LOAD_NAND:
	case ISD::ATOMIC_LOAD_MIN:
	case ISD::ATOMIC_LOAD_MAX:
	case ISD::ATOMIC_LOAD_UMIN:
	case ISD::ATOMIC_LOAD_UMAX:
	case ISD::ATOMIC_LOAD: {
	// Delegate to generic TypeLegalization. Situations we can really handle
	// should have already been dealt with by AtomicExpandPass.cpp.
	break;
	}
	case ISD::BITCAST: {
	assert(Subtarget.hasSSE2() && "Requires at least SSE2!");
	EVT DstVT = N->getValueType(0);
	EVT SrcVT = N->getOperand(0)->getValueType(0);

	if (SrcVT != MVT::f64 \|\|
	(DstVT != MVT::v2i32 && DstVT != MVT::v4i16 && DstVT != MVT::v8i8))
	return;

	unsigned NumElts = DstVT.getVectorNumElements();
	EVT SVT = DstVT.getVectorElementType();
	EVT WiderVT = EVT::getVectorVT(DAG.getContext(), SVT, NumElts 2);
	SDValue Expanded = DAG.getNode(ISD::SCALAR_TO_VECTOR, dl,
	MVT::v2f64, N->getOperand(0));
	SDValue ToVecInt = DAG.getBitcast(WiderVT, Expanded);

	if (ExperimentalVectorWideningLegalization) {
	// If we are legalizing vectors by widening, we already have the desired
	// legal vector type, just return it.
	Results.push_back(ToVecInt);
	return;
	}

	SmallVector<SDValue, 8> Elts;
	for (unsigned i = 0, e = NumElts; i != e; ++i)
	Elts.push_back(DAG.getNode(ISD::EXTRACT_VECTOR_ELT, dl, SVT,
	ToVecInt, DAG.getIntPtrConstant(i, dl)));

	Results.push_back(DAG.getBuildVector(DstVT, dl, Elts));
	}
	}
	}

	const char *X86TargetLowering::getTargetNodeName(unsigned Opcode) const {
	switch ((X86ISD::NodeType)Opcode) {
	case X86ISD::FIRST_NUMBER: break;
	case X86ISD::BSF: return "X86ISD::BSF";
	case X86ISD::BSR: return "X86ISD::BSR";
	case X86ISD::SHLD: return "X86ISD::SHLD";
	case X86ISD::SHRD: return "X86ISD::SHRD";
	case X86ISD::FAND: return "X86ISD::FAND";
	case X86ISD::FANDN: return "X86ISD::FANDN";
	case X86ISD::FOR: return "X86ISD::FOR";
	case X86ISD::FXOR: return "X86ISD::FXOR";
	case X86ISD::FILD: return "X86ISD::FILD";
	case X86ISD::FILD_FLAG: return "X86ISD::FILD_FLAG";
	case X86ISD::FP_TO_INT16_IN_MEM: return "X86ISD::FP_TO_INT16_IN_MEM";
	case X86ISD::FP_TO_INT32_IN_MEM: return "X86ISD::FP_TO_INT32_IN_MEM";
	case X86ISD::FP_TO_INT64_IN_MEM: return "X86ISD::FP_TO_INT64_IN_MEM";
	case X86ISD::FLD: return "X86ISD::FLD";
	case X86ISD::FST: return "X86ISD::FST";
	case X86ISD::CALL: return "X86ISD::CALL";
	case X86ISD::RDTSC_DAG: return "X86ISD::RDTSC_DAG";
	case X86ISD::RDTSCP_DAG: return "X86ISD::RDTSCP_DAG";
	case X86ISD::RDPMC_DAG: return "X86ISD::RDPMC_DAG";
	case X86ISD::BT: return "X86ISD::BT";
	case X86ISD::CMP: return "X86ISD::CMP";
	case X86ISD::COMI: return "X86ISD::COMI";
	case X86ISD::UCOMI: return "X86ISD::UCOMI";
	case X86ISD::CMPM: return "X86ISD::CMPM";
	case X86ISD::CMPMU: return "X86ISD::CMPMU";
	case X86ISD::CMPM_RND: return "X86ISD::CMPM_RND";
	case X86ISD::SETCC: return "X86ISD::SETCC";
	case X86ISD::SETCC_CARRY: return "X86ISD::SETCC_CARRY";
	case X86ISD::FSETCC: return "X86ISD::FSETCC";
	case X86ISD::FSETCCM: return "X86ISD::FSETCCM";
	case X86ISD::FSETCCM_RND: return "X86ISD::FSETCCM_RND";
	case X86ISD::CMOV: return "X86ISD::CMOV";
	case X86ISD::BRCOND: return "X86ISD::BRCOND";
	case X86ISD::RET_FLAG: return "X86ISD::RET_FLAG";
	case X86ISD::IRET: return "X86ISD::IRET";
	case X86ISD::REP_STOS: return "X86ISD::REP_STOS";
	case X86ISD::REP_MOVS: return "X86ISD::REP_MOVS";
	case X86ISD::GlobalBaseReg: return "X86ISD::GlobalBaseReg";
	case X86ISD::Wrapper: return "X86ISD::Wrapper";
	case X86ISD::WrapperRIP: return "X86ISD::WrapperRIP";
	case X86ISD::MOVDQ2Q: return "X86ISD::MOVDQ2Q";
	case X86ISD::MMX_MOVD2W: return "X86ISD::MMX_MOVD2W";
	case X86ISD::MMX_MOVW2D: return "X86ISD::MMX_MOVW2D";
	case X86ISD::PEXTRB: return "X86ISD::PEXTRB";
	case X86ISD::PEXTRW: return "X86ISD::PEXTRW";
	case X86ISD::INSERTPS: return "X86ISD::INSERTPS";
	case X86ISD::PINSRB: return "X86ISD::PINSRB";
	case X86ISD::PINSRW: return "X86ISD::PINSRW";
	case X86ISD::PSHUFB: return "X86ISD::PSHUFB";
	case X86ISD::ANDNP: return "X86ISD::ANDNP";
	case X86ISD::BLENDI: return "X86ISD::BLENDI";
	case X86ISD::SHRUNKBLEND: return "X86ISD::SHRUNKBLEND";
	case X86ISD::ADDUS: return "X86ISD::ADDUS";
	case X86ISD::SUBUS: return "X86ISD::SUBUS";
	case X86ISD::HADD: return "X86ISD::HADD";
	case X86ISD::HSUB: return "X86ISD::HSUB";
	case X86ISD::FHADD: return "X86ISD::FHADD";
	case X86ISD::FHSUB: return "X86ISD::FHSUB";
	case X86ISD::CONFLICT: return "X86ISD::CONFLICT";
	case X86ISD::FMAX: return "X86ISD::FMAX";
	case X86ISD::FMAXS: return "X86ISD::FMAXS";
	case X86ISD::FMAX_RND: return "X86ISD::FMAX_RND";
	case X86ISD::FMAXS_RND: return "X86ISD::FMAX_RND";
	case X86ISD::FMIN: return "X86ISD::FMIN";
	case X86ISD::FMINS: return "X86ISD::FMINS";
	case X86ISD::FMIN_RND: return "X86ISD::FMIN_RND";
	case X86ISD::FMINS_RND: return "X86ISD::FMINS_RND";
	case X86ISD::FMAXC: return "X86ISD::FMAXC";
	case X86ISD::FMINC: return "X86ISD::FMINC";
	case X86ISD::FRSQRT: return "X86ISD::FRSQRT";
	case X86ISD::FRSQRTS: return "X86ISD::FRSQRTS";
	case X86ISD::FRCP: return "X86ISD::FRCP";
	case X86ISD::FRCPS: return "X86ISD::FRCPS";
	case X86ISD::EXTRQI: return "X86ISD::EXTRQI";
	case X86ISD::INSERTQI: return "X86ISD::INSERTQI";
	case X86ISD::TLSADDR: return "X86ISD::TLSADDR";
	case X86ISD::TLSBASEADDR: return "X86ISD::TLSBASEADDR";
	case X86ISD::TLSCALL: return "X86ISD::TLSCALL";
	case X86ISD::EH_SJLJ_SETJMP: return "X86ISD::EH_SJLJ_SETJMP";
	case X86ISD::EH_SJLJ_LONGJMP: return "X86ISD::EH_SJLJ_LONGJMP";
	case X86ISD::EH_SJLJ_SETUP_DISPATCH:
	return "X86ISD::EH_SJLJ_SETUP_DISPATCH";
	case X86ISD::EH_RETURN: return "X86ISD::EH_RETURN";
	case X86ISD::TC_RETURN: return "X86ISD::TC_RETURN";
	case X86ISD::FNSTCW16m: return "X86ISD::FNSTCW16m";
	case X86ISD::FNSTSW16r: return "X86ISD::FNSTSW16r";
	case X86ISD::LCMPXCHG_DAG: return "X86ISD::LCMPXCHG_DAG";
	case X86ISD::LCMPXCHG8_DAG: return "X86ISD::LCMPXCHG8_DAG";
	case X86ISD::LCMPXCHG16_DAG: return "X86ISD::LCMPXCHG16_DAG";
	case X86ISD::LCMPXCHG8_SAVE_EBX_DAG:
	return "X86ISD::LCMPXCHG8_SAVE_EBX_DAG";
	case X86ISD::LCMPXCHG16_SAVE_RBX_DAG:
	return "X86ISD::LCMPXCHG16_SAVE_RBX_DAG";
	case X86ISD::LADD: return "X86ISD::LADD";
	case X86ISD::LSUB: return "X86ISD::LSUB";
	case X86ISD::LOR: return "X86ISD::LOR";
	case X86ISD::LXOR: return "X86ISD::LXOR";
	case X86ISD::LAND: return "X86ISD::LAND";
	case X86ISD::VZEXT_MOVL: return "X86ISD::VZEXT_MOVL";
	case X86ISD::VZEXT_LOAD: return "X86ISD::VZEXT_LOAD";
	case X86ISD::VZEXT: return "X86ISD::VZEXT";
	case X86ISD::VSEXT: return "X86ISD::VSEXT";
	case X86ISD::VTRUNC: return "X86ISD::VTRUNC";
	case X86ISD::VTRUNCS: return "X86ISD::VTRUNCS";
	case X86ISD::VTRUNCUS: return "X86ISD::VTRUNCUS";
	case X86ISD::VTRUNCSTORES: return "X86ISD::VTRUNCSTORES";
	case X86ISD::VTRUNCSTOREUS: return "X86ISD::VTRUNCSTOREUS";
	case X86ISD::VMTRUNCSTORES: return "X86ISD::VMTRUNCSTORES";
	case X86ISD::VMTRUNCSTOREUS: return "X86ISD::VMTRUNCSTOREUS";
	case X86ISD::VFPEXT: return "X86ISD::VFPEXT";
	case X86ISD::VFPEXT_RND: return "X86ISD::VFPEXT_RND";
	case X86ISD::VFPEXTS_RND: return "X86ISD::VFPEXTS_RND";
	case X86ISD::VFPROUND: return "X86ISD::VFPROUND";
	case X86ISD::VFPROUND_RND: return "X86ISD::VFPROUND_RND";
	case X86ISD::VFPROUNDS_RND: return "X86ISD::VFPROUNDS_RND";
	case X86ISD::CVT2MASK: return "X86ISD::CVT2MASK";
	case X86ISD::VSHLDQ: return "X86ISD::VSHLDQ";
	case X86ISD::VSRLDQ: return "X86ISD::VSRLDQ";
	case X86ISD::VSHL: return "X86ISD::VSHL";
	case X86ISD::VSRL: return "X86ISD::VSRL";
	case X86ISD::VSRA: return "X86ISD::VSRA";
	case X86ISD::VSHLI: return "X86ISD::VSHLI";
	case X86ISD::VSRLI: return "X86ISD::VSRLI";
	case X86ISD::VSRAI: return "X86ISD::VSRAI";
	case X86ISD::VSRAV: return "X86ISD::VSRAV";
	case X86ISD::VROTLI: return "X86ISD::VROTLI";
	case X86ISD::VROTRI: return "X86ISD::VROTRI";
	case X86ISD::VPPERM: return "X86ISD::VPPERM";
	case X86ISD::CMPP: return "X86ISD::CMPP";
	case X86ISD::PCMPEQ: return "X86ISD::PCMPEQ";
	case X86ISD::PCMPGT: return "X86ISD::PCMPGT";
	case X86ISD::PCMPEQM: return "X86ISD::PCMPEQM";
	case X86ISD::PCMPGTM: return "X86ISD::PCMPGTM";
	case X86ISD::ADD: return "X86ISD::ADD";
	case X86ISD::SUB: return "X86ISD::SUB";
	case X86ISD::ADC: return "X86ISD::ADC";
	case X86ISD::SBB: return "X86ISD::SBB";
	case X86ISD::SMUL: return "X86ISD::SMUL";
	case X86ISD::UMUL: return "X86ISD::UMUL";
	case X86ISD::SMUL8: return "X86ISD::SMUL8";
	case X86ISD::UMUL8: return "X86ISD::UMUL8";
	case X86ISD::SDIVREM8_SEXT_HREG: return "X86ISD::SDIVREM8_SEXT_HREG";
	case X86ISD::UDIVREM8_ZEXT_HREG: return "X86ISD::UDIVREM8_ZEXT_HREG";
	case X86ISD::INC: return "X86ISD::INC";
	case X86ISD::DEC: return "X86ISD::DEC";
	case X86ISD::OR: return "X86ISD::OR";
	case X86ISD::XOR: return "X86ISD::XOR";
	case X86ISD::AND: return "X86ISD::AND";
	case X86ISD::BEXTR: return "X86ISD::BEXTR";
	case X86ISD::MUL_IMM: return "X86ISD::MUL_IMM";
	case X86ISD::MOVMSK: return "X86ISD::MOVMSK";
	case X86ISD::PTEST: return "X86ISD::PTEST";
	case X86ISD::TESTP: return "X86ISD::TESTP";
	case X86ISD::TESTM: return "X86ISD::TESTM";
	case X86ISD::TESTNM: return "X86ISD::TESTNM";
	case X86ISD::KORTEST: return "X86ISD::KORTEST";
	case X86ISD::KTEST: return "X86ISD::KTEST";
	case X86ISD::KSHIFTL: return "X86ISD::KSHIFTL";
	case X86ISD::KSHIFTR: return "X86ISD::KSHIFTR";
	case X86ISD::PACKSS: return "X86ISD::PACKSS";
	case X86ISD::PACKUS: return "X86ISD::PACKUS";
	case X86ISD::PALIGNR: return "X86ISD::PALIGNR";
	case X86ISD::VALIGN: return "X86ISD::VALIGN";
	case X86ISD::PSHUFD: return "X86ISD::PSHUFD";
	case X86ISD::PSHUFHW: return "X86ISD::PSHUFHW";
	case X86ISD::PSHUFLW: return "X86ISD::PSHUFLW";
	case X86ISD::SHUFP: return "X86ISD::SHUFP";
	case X86ISD::SHUF128: return "X86ISD::SHUF128";
	case X86ISD::MOVLHPS: return "X86ISD::MOVLHPS";
	case X86ISD::MOVLHPD: return "X86ISD::MOVLHPD";
	case X86ISD::MOVHLPS: return "X86ISD::MOVHLPS";
	case X86ISD::MOVLPS: return "X86ISD::MOVLPS";
	case X86ISD::MOVLPD: return "X86ISD::MOVLPD";
	case X86ISD::MOVDDUP: return "X86ISD::MOVDDUP";
	case X86ISD::MOVSHDUP: return "X86ISD::MOVSHDUP";
	case X86ISD::MOVSLDUP: return "X86ISD::MOVSLDUP";
	case X86ISD::MOVSD: return "X86ISD::MOVSD";
	case X86ISD::MOVSS: return "X86ISD::MOVSS";
	case X86ISD::UNPCKL: return "X86ISD::UNPCKL";
	case X86ISD::UNPCKH: return "X86ISD::UNPCKH";
	case X86ISD::VBROADCAST: return "X86ISD::VBROADCAST";
	case X86ISD::VBROADCASTM: return "X86ISD::VBROADCASTM";
	case X86ISD::SUBV_BROADCAST: return "X86ISD::SUBV_BROADCAST";
	case X86ISD::VEXTRACT: return "X86ISD::VEXTRACT";
	case X86ISD::VPERMILPV: return "X86ISD::VPERMILPV";
	case X86ISD::VPERMILPI: return "X86ISD::VPERMILPI";
	case X86ISD::VPERM2X128: return "X86ISD::VPERM2X128";
	case X86ISD::VPERMV: return "X86ISD::VPERMV";
	case X86ISD::VPERMV3: return "X86ISD::VPERMV3";
	case X86ISD::VPERMIV3: return "X86ISD::VPERMIV3";
	case X86ISD::VPERMI: return "X86ISD::VPERMI";
	case X86ISD::VPTERNLOG: return "X86ISD::VPTERNLOG";
	case X86ISD::VFIXUPIMM: return "X86ISD::VFIXUPIMM";
	case X86ISD::VFIXUPIMMS: return "X86ISD::VFIXUPIMMS";
	case X86ISD::VRANGE: return "X86ISD::VRANGE";
	case X86ISD::PMULUDQ: return "X86ISD::PMULUDQ";
	case X86ISD::PMULDQ: return "X86ISD::PMULDQ";
	case X86ISD::PSADBW: return "X86ISD::PSADBW";
	case X86ISD::DBPSADBW: return "X86ISD::DBPSADBW";
	case X86ISD::VASTART_SAVE_XMM_REGS: return "X86ISD::VASTART_SAVE_XMM_REGS";
	case X86ISD::VAARG_64: return "X86ISD::VAARG_64";
	case X86ISD::WIN_ALLOCA: return "X86ISD::WIN_ALLOCA";
	case X86ISD::MEMBARRIER: return "X86ISD::MEMBARRIER";
	case X86ISD::MFENCE: return "X86ISD::MFENCE";
	case X86ISD::SEG_ALLOCA: return "X86ISD::SEG_ALLOCA";
	case X86ISD::SAHF: return "X86ISD::SAHF";
	case X86ISD::RDRAND: return "X86ISD::RDRAND";
	case X86ISD::RDSEED: return "X86ISD::RDSEED";
	case X86ISD::VPMADDUBSW: return "X86ISD::VPMADDUBSW";
	case X86ISD::VPMADDWD: return "X86ISD::VPMADDWD";
	case X86ISD::VPROT: return "X86ISD::VPROT";
	case X86ISD::VPROTI: return "X86ISD::VPROTI";
	case X86ISD::VPSHA: return "X86ISD::VPSHA";
	case X86ISD::VPSHL: return "X86ISD::VPSHL";
	case X86ISD::VPCOM: return "X86ISD::VPCOM";
	case X86ISD::VPCOMU: return "X86ISD::VPCOMU";
	case X86ISD::VPERMIL2: return "X86ISD::VPERMIL2";
	case X86ISD::FMADD: return "X86ISD::FMADD";
	case X86ISD::FMSUB: return "X86ISD::FMSUB";
	case X86ISD::FNMADD: return "X86ISD::FNMADD";
	case X86ISD::FNMSUB: return "X86ISD::FNMSUB";
	case X86ISD::FMADDSUB: return "X86ISD::FMADDSUB";
	case X86ISD::FMSUBADD: return "X86ISD::FMSUBADD";
	case X86ISD::FMADD_RND: return "X86ISD::FMADD_RND";
	case X86ISD::FNMADD_RND: return "X86ISD::FNMADD_RND";
	case X86ISD::FMSUB_RND: return "X86ISD::FMSUB_RND";
	case X86ISD::FNMSUB_RND: return "X86ISD::FNMSUB_RND";
	case X86ISD::FMADDSUB_RND: return "X86ISD::FMADDSUB_RND";
	case X86ISD::FMSUBADD_RND: return "X86ISD::FMSUBADD_RND";
	case X86ISD::FMADDS1_RND: return "X86ISD::FMADDS1_RND";
	case X86ISD::FNMADDS1_RND: return "X86ISD::FNMADDS1_RND";
	case X86ISD::FMSUBS1_RND: return "X86ISD::FMSUBS1_RND";
	case X86ISD::FNMSUBS1_RND: return "X86ISD::FNMSUBS1_RND";
	case X86ISD::FMADDS3_RND: return "X86ISD::FMADDS3_RND";
	case X86ISD::FNMADDS3_RND: return "X86ISD::FNMADDS3_RND";
	case X86ISD::FMSUBS3_RND: return "X86ISD::FMSUBS3_RND";
	case X86ISD::FNMSUBS3_RND: return "X86ISD::FNMSUBS3_RND";
	case X86ISD::VPMADD52H: return "X86ISD::VPMADD52H";
	case X86ISD::VPMADD52L: return "X86ISD::VPMADD52L";
	case X86ISD::VRNDSCALE: return "X86ISD::VRNDSCALE";
	case X86ISD::VRNDSCALES: return "X86ISD::VRNDSCALES";
	case X86ISD::VREDUCE: return "X86ISD::VREDUCE";
	case X86ISD::VREDUCES: return "X86ISD::VREDUCES";
	case X86ISD::VGETMANT: return "X86ISD::VGETMANT";
	case X86ISD::VGETMANTS: return "X86ISD::VGETMANTS";
	case X86ISD::PCMPESTRI: return "X86ISD::PCMPESTRI";
	case X86ISD::PCMPISTRI: return "X86ISD::PCMPISTRI";
	case X86ISD::XTEST: return "X86ISD::XTEST";
	case X86ISD::COMPRESS: return "X86ISD::COMPRESS";
	case X86ISD::EXPAND: return "X86ISD::EXPAND";
	case X86ISD::SELECT: return "X86ISD::SELECT";
	case X86ISD::SELECTS: return "X86ISD::SELECTS";
	case X86ISD::ADDSUB: return "X86ISD::ADDSUB";
	case X86ISD::RCP28: return "X86ISD::RCP28";
	case X86ISD::RCP28S: return "X86ISD::RCP28S";
	case X86ISD::EXP2: return "X86ISD::EXP2";
	case X86ISD::RSQRT28: return "X86ISD::RSQRT28";
	case X86ISD::RSQRT28S: return "X86ISD::RSQRT28S";
	case X86ISD::FADD_RND: return "X86ISD::FADD_RND";
	case X86ISD::FADDS_RND: return "X86ISD::FADDS_RND";
	case X86ISD::FSUB_RND: return "X86ISD::FSUB_RND";
	case X86ISD::FSUBS_RND: return "X86ISD::FSUBS_RND";
	case X86ISD::FMUL_RND: return "X86ISD::FMUL_RND";
	case X86ISD::FMULS_RND: return "X86ISD::FMULS_RND";
	case X86ISD::FDIV_RND: return "X86ISD::FDIV_RND";
	case X86ISD::FDIVS_RND: return "X86ISD::FDIVS_RND";
	case X86ISD::FSQRT_RND: return "X86ISD::FSQRT_RND";
	case X86ISD::FSQRTS_RND: return "X86ISD::FSQRTS_RND";
	case X86ISD::FGETEXP_RND: return "X86ISD::FGETEXP_RND";
	case X86ISD::FGETEXPS_RND: return "X86ISD::FGETEXPS_RND";
	case X86ISD::SCALEF: return "X86ISD::SCALEF";
	case X86ISD::SCALEFS: return "X86ISD::SCALEFS";
	case X86ISD::ADDS: return "X86ISD::ADDS";
	case X86ISD::SUBS: return "X86ISD::SUBS";
	case X86ISD::AVG: return "X86ISD::AVG";
	case X86ISD::MULHRS: return "X86ISD::MULHRS";
	case X86ISD::SINT_TO_FP_RND: return "X86ISD::SINT_TO_FP_RND";
	case X86ISD::UINT_TO_FP_RND: return "X86ISD::UINT_TO_FP_RND";
	case X86ISD::CVTTP2SI: return "X86ISD::CVTTP2SI";
	case X86ISD::CVTTP2UI: return "X86ISD::CVTTP2UI";
	case X86ISD::CVTTP2SI_RND: return "X86ISD::CVTTP2SI_RND";
	case X86ISD::CVTTP2UI_RND: return "X86ISD::CVTTP2UI_RND";
	case X86ISD::CVTTS2SI_RND: return "X86ISD::CVTTS2SI_RND";
	case X86ISD::CVTTS2UI_RND: return "X86ISD::CVTTS2UI_RND";
	case X86ISD::CVTSI2P: return "X86ISD::CVTSI2P";
	case X86ISD::CVTUI2P: return "X86ISD::CVTUI2P";
	case X86ISD::VFPCLASS: return "X86ISD::VFPCLASS";
	case X86ISD::VFPCLASSS: return "X86ISD::VFPCLASSS";
	case X86ISD::MULTISHIFT: return "X86ISD::MULTISHIFT";
	case X86ISD::SCALAR_SINT_TO_FP_RND: return "X86ISD::SCALAR_SINT_TO_FP_RND";
	case X86ISD::SCALAR_UINT_TO_FP_RND: return "X86ISD::SCALAR_UINT_TO_FP_RND";
	case X86ISD::CVTPS2PH: return "X86ISD::CVTPS2PH";
	case X86ISD::CVTPH2PS: return "X86ISD::CVTPH2PS";
	case X86ISD::CVTP2SI: return "X86ISD::CVTP2SI";
	case X86ISD::CVTP2UI: return "X86ISD::CVTP2UI";
	case X86ISD::CVTP2SI_RND: return "X86ISD::CVTP2SI_RND";
	case X86ISD::CVTP2UI_RND: return "X86ISD::CVTP2UI_RND";
	case X86ISD::CVTS2SI_RND: return "X86ISD::CVTS2SI_RND";
	case X86ISD::CVTS2UI_RND: return "X86ISD::CVTS2UI_RND";
	case X86ISD::LWPINS: return "X86ISD::LWPINS";
	case X86ISD::MGATHER: return "X86ISD::MGATHER";
	}
	return nullptr;
	}

	/// Return true if the addressing mode represented by AM is legal for this
	/// target, for a load/store of the specified type.
	bool X86TargetLowering::isLegalAddressingMode(const DataLayout &DL,
	const AddrMode &AM, Type *Ty,
	unsigned AS) const {
	// X86 supports extremely general addressing modes.
	CodeModel::Model M = getTargetMachine().getCodeModel();

	// X86 allows a sign-extended 32-bit immediate field as a displacement.
	if (!X86::isOffsetSuitableForCodeModel(AM.BaseOffs, M, AM.BaseGV != nullptr))
	return false;

	if (AM.BaseGV) {
	unsigned GVFlags = Subtarget.classifyGlobalReference(AM.BaseGV);

	// If a reference to this global requires an extra load, we can't fold it.
	if (isGlobalStubReference(GVFlags))
	return false;

	// If BaseGV requires a register for the PIC base, we cannot also have a
	// BaseReg specified.
	if (AM.HasBaseReg && isGlobalRelativeToPICBase(GVFlags))
	return false;

	// If lower 4G is not available, then we must use rip-relative addressing.
	if ((M != CodeModel::Small \|\| isPositionIndependent()) &&
	Subtarget.is64Bit() && (AM.BaseOffs \|\| AM.Scale > 1))
	return false;
	}

	switch (AM.Scale) {
	case 0:
	case 1:
	case 2:
	case 4:
	case 8:
	// These scales always work.
	break;
	case 3:
	case 5:
	case 9:
	// These scales are formed with basereg+scalereg. Only accept if there is
	// no basereg yet.
	if (AM.HasBaseReg)
	return false;
	break;
	default: // Other stuff never works.
	return false;
	}

	return true;
	}

	bool X86TargetLowering::isVectorShiftByScalarCheap(Type *Ty) const {
	unsigned Bits = Ty->getScalarSizeInBits();

	// 8-bit shifts are always expensive, but versions with a scalar amount aren't
	// particularly cheaper than those without.
	if (Bits == 8)
	return false;

	// On AVX2 there are new vpsllv[dq] instructions (and other shifts), that make
	// variable shifts just as cheap as scalar ones.
	if (Subtarget.hasInt256() && (Bits == 32 \|\| Bits == 64))
	return false;

	// Otherwise, it's significantly cheaper to shift by a scalar amount than by a
	// fully general vector.
	return true;
	}

	bool X86TargetLowering::isTruncateFree(Type Ty1, Type Ty2) const {
	if (!Ty1->isIntegerTy() \|\| !Ty2->isIntegerTy())
	return false;
	unsigned NumBits1 = Ty1->getPrimitiveSizeInBits();
	unsigned NumBits2 = Ty2->getPrimitiveSizeInBits();
	return NumBits1 > NumBits2;
	}

	bool X86TargetLowering::allowTruncateForTailCall(Type Ty1, Type Ty2) const {
	if (!Ty1->isIntegerTy() \|\| !Ty2->isIntegerTy())
	return false;

	if (!isTypeLegal(EVT::getEVT(Ty1)))
	return false;

	assert(Ty1->getPrimitiveSizeInBits() <= 64 && "i128 is probably not a noop");

	// Assuming the caller doesn't have a zeroext or signext return parameter,
	// truncation all the way down to i1 is valid.
	return true;
	}

	bool X86TargetLowering::isLegalICmpImmediate(int64_t Imm) const {
	return isInt<32>(Imm);
	}

	bool X86TargetLowering::isLegalAddImmediate(int64_t Imm) const {
	// Can also use sub to handle negated immediates.
	return isInt<32>(Imm);
	}

	bool X86TargetLowering::isTruncateFree(EVT VT1, EVT VT2) const {
	if (!VT1.isInteger() \|\| !VT2.isInteger())
	return false;
	unsigned NumBits1 = VT1.getSizeInBits();
	unsigned NumBits2 = VT2.getSizeInBits();
	return NumBits1 > NumBits2;
	}

	bool X86TargetLowering::isZExtFree(Type Ty1, Type Ty2) const {
	// x86-64 implicitly zero-extends 32-bit results in 64-bit registers.
	return Ty1->isIntegerTy(32) && Ty2->isIntegerTy(64) && Subtarget.is64Bit();
	}

	bool X86TargetLowering::isZExtFree(EVT VT1, EVT VT2) const {
	// x86-64 implicitly zero-extends 32-bit results in 64-bit registers.
	return VT1 == MVT::i32 && VT2 == MVT::i64 && Subtarget.is64Bit();
	}

	bool X86TargetLowering::isZExtFree(SDValue Val, EVT VT2) const {
	EVT VT1 = Val.getValueType();
	if (isZExtFree(VT1, VT2))
	return true;

	if (Val.getOpcode() != ISD::LOAD)
	return false;

	if (!VT1.isSimple() \|\| !VT1.isInteger() \|\|
	!VT2.isSimple() \|\| !VT2.isInteger())
	return false;

	switch (VT1.getSimpleVT().SimpleTy) {
	default: break;
	case MVT::i8:
	case MVT::i16:
	case MVT::i32:
	// X86 has 8, 16, and 32-bit zero-extending loads.
	return true;
	}

	return false;
	}

	bool X86TargetLowering::isVectorLoadExtDesirable(SDValue) const { return true; }

	bool
	X86TargetLowering::isFMAFasterThanFMulAndFAdd(EVT VT) const {
	if (!Subtarget.hasAnyFMA())
	return false;

	VT = VT.getScalarType();

	if (!VT.isSimple())
	return false;

	switch (VT.getSimpleVT().SimpleTy) {
	case MVT::f32:
	case MVT::f64:
	return true;
	default:
	break;
	}

	return false;
	}

	bool X86TargetLowering::isNarrowingProfitable(EVT VT1, EVT VT2) const {
	// i16 instructions are longer (0x66 prefix) and potentially slower.
	return !(VT1 == MVT::i32 && VT2 == MVT::i16);
	}

	/// Targets can use this to indicate that they only support some
	/// VECTOR_SHUFFLE operations, those with specific masks.
	/// By default, if a target supports the VECTOR_SHUFFLE node, all mask values
	/// are assumed to be legal.
	bool
	X86TargetLowering::isShuffleMaskLegal(const SmallVectorImpl<int> &M,
	EVT VT) const {
	if (!VT.isSimple())
	return false;

	// Not for i1 vectors
	if (VT.getSimpleVT().getScalarType() == MVT::i1)
	return false;

	// Very little shuffling can be done for 64-bit vectors right now.
	if (VT.getSimpleVT().getSizeInBits() == 64)
	return false;

	// We only care that the types being shuffled are legal. The lowering can
	// handle any possible shuffle mask that results.
	return isTypeLegal(VT.getSimpleVT());
	}

	bool
	X86TargetLowering::isVectorClearMaskLegal(const SmallVectorImpl<int> &Mask,
	EVT VT) const {
	// Just delegate to the generic legality, clear masks aren't special.
	return isShuffleMaskLegal(Mask, VT);
	}

	//===----------------------------------------------------------------------===//
	// X86 Scheduler Hooks
	//===----------------------------------------------------------------------===//

	/// Utility function to emit xbegin specifying the start of an RTM region.
	static MachineBasicBlock emitXBegin(MachineInstr &MI, MachineBasicBlock MBB,
	const TargetInstrInfo *TII) {
	DebugLoc DL = MI.getDebugLoc();

	const BasicBlock *BB = MBB->getBasicBlock();
	MachineFunction::iterator I = ++MBB->getIterator();

	// For the v = xbegin(), we generate
	//
	// thisMBB:
	// xbegin sinkMBB
	//
	// mainMBB:
	// s0 = -1
	//
	// fallBB:
	// eax = # XABORT_DEF
	// s1 = eax
	//
	// sinkMBB:
	// v = phi(s0/mainBB, s1/fallBB)

	MachineBasicBlock *thisMBB = MBB;
	MachineFunction *MF = MBB->getParent();
	MachineBasicBlock *mainMBB = MF->CreateMachineBasicBlock(BB);
	MachineBasicBlock *fallMBB = MF->CreateMachineBasicBlock(BB);
	MachineBasicBlock *sinkMBB = MF->CreateMachineBasicBlock(BB);
	MF->insert(I, mainMBB);
	MF->insert(I, fallMBB);
	MF->insert(I, sinkMBB);

	// Transfer the remainder of BB and its successor edges to sinkMBB.
	sinkMBB->splice(sinkMBB->begin(), MBB,
	std::next(MachineBasicBlock::iterator(MI)), MBB->end());
	sinkMBB->transferSuccessorsAndUpdatePHIs(MBB);

	MachineRegisterInfo &MRI = MF->getRegInfo();
	unsigned DstReg = MI.getOperand(0).getReg();
	const TargetRegisterClass *RC = MRI.getRegClass(DstReg);
	unsigned mainDstReg = MRI.createVirtualRegister(RC);
	unsigned fallDstReg = MRI.createVirtualRegister(RC);

	// thisMBB:
	// xbegin fallMBB
	// # fallthrough to mainMBB
	// # abortion to fallMBB
	BuildMI(thisMBB, DL, TII->get(X86::XBEGIN_4)).addMBB(fallMBB);
	thisMBB->addSuccessor(mainMBB);
	thisMBB->addSuccessor(fallMBB);

	// mainMBB:
	// mainDstReg := -1
	BuildMI(mainMBB, DL, TII->get(X86::MOV32ri), mainDstReg).addImm(-1);
	BuildMI(mainMBB, DL, TII->get(X86::JMP_1)).addMBB(sinkMBB);
	mainMBB->addSuccessor(sinkMBB);

	// fallMBB:
	// ; pseudo instruction to model hardware's definition from XABORT
	// EAX := XABORT_DEF
	// fallDstReg := EAX
	BuildMI(fallMBB, DL, TII->get(X86::XABORT_DEF));
	BuildMI(fallMBB, DL, TII->get(TargetOpcode::COPY), fallDstReg)
	.addReg(X86::EAX);
	fallMBB->addSuccessor(sinkMBB);

	// sinkMBB:
	// DstReg := phi(mainDstReg/mainBB, fallDstReg/fallBB)
	BuildMI(*sinkMBB, sinkMBB->begin(), DL, TII->get(X86::PHI), DstReg)
	.addReg(mainDstReg).addMBB(mainMBB)
	.addReg(fallDstReg).addMBB(fallMBB);

	MI.eraseFromParent();
	return sinkMBB;
	}

	// FIXME: When we get size specific XMM0 registers, i.e. XMM0_V16I8
	// or XMM0_V32I8 in AVX all of this code can be replaced with that
	// in the .td file.
	static MachineBasicBlock emitPCMPSTRM(MachineInstr &MI, MachineBasicBlock BB,
	const TargetInstrInfo *TII) {
	unsigned Opc;
	switch (MI.getOpcode()) {
	default: llvm_unreachable("illegal opcode!");
	case X86::PCMPISTRM128REG: Opc = X86::PCMPISTRM128rr; break;
	case X86::VPCMPISTRM128REG: Opc = X86::VPCMPISTRM128rr; break;
	case X86::PCMPISTRM128MEM: Opc = X86::PCMPISTRM128rm; break;
	case X86::VPCMPISTRM128MEM: Opc = X86::VPCMPISTRM128rm; break;
	case X86::PCMPESTRM128REG: Opc = X86::PCMPESTRM128rr; break;
	case X86::VPCMPESTRM128REG: Opc = X86::VPCMPESTRM128rr; break;
	case X86::PCMPESTRM128MEM: Opc = X86::PCMPESTRM128rm; break;
	case X86::VPCMPESTRM128MEM: Opc = X86::VPCMPESTRM128rm; break;
	}

	DebugLoc dl = MI.getDebugLoc();
	MachineInstrBuilder MIB = BuildMI(*BB, MI, dl, TII->get(Opc));

	unsigned NumArgs = MI.getNumOperands();
	for (unsigned i = 1; i < NumArgs; ++i) {
	MachineOperand &Op = MI.getOperand(i);
	if (!(Op.isReg() && Op.isImplicit()))
	MIB.add(Op);
	}
	if (MI.hasOneMemOperand())
	MIB->setMemRefs(MI.memoperands_begin(), MI.memoperands_end());

	BuildMI(*BB, MI, dl, TII->get(TargetOpcode::COPY), MI.getOperand(0).getReg())
	.addReg(X86::XMM0);

	MI.eraseFromParent();
	return BB;
	}

	// FIXME: Custom handling because TableGen doesn't support multiple implicit
	// defs in an instruction pattern
	static MachineBasicBlock emitPCMPSTRI(MachineInstr &MI, MachineBasicBlock BB,
	const TargetInstrInfo *TII) {
	unsigned Opc;
	switch (MI.getOpcode()) {
	default: llvm_unreachable("illegal opcode!");
	case X86::PCMPISTRIREG: Opc = X86::PCMPISTRIrr; break;
	case X86::VPCMPISTRIREG: Opc = X86::VPCMPISTRIrr; break;
	case X86::PCMPISTRIMEM: Opc = X86::PCMPISTRIrm; break;
	case X86::VPCMPISTRIMEM: Opc = X86::VPCMPISTRIrm; break;
	case X86::PCMPESTRIREG: Opc = X86::PCMPESTRIrr; break;
	case X86::VPCMPESTRIREG: Opc = X86::VPCMPESTRIrr; break;
	case X86::PCMPESTRIMEM: Opc = X86::PCMPESTRIrm; break;
	case X86::VPCMPESTRIMEM: Opc = X86::VPCMPESTRIrm; break;
	}

	DebugLoc dl = MI.getDebugLoc();
	MachineInstrBuilder MIB = BuildMI(*BB, MI, dl, TII->get(Opc));

	unsigned NumArgs = MI.getNumOperands(); // remove the results
	for (unsigned i = 1; i < NumArgs; ++i) {
	MachineOperand &Op = MI.getOperand(i);
	if (!(Op.isReg() && Op.isImplicit()))
	MIB.add(Op);
	}
	if (MI.hasOneMemOperand())
	MIB->setMemRefs(MI.memoperands_begin(), MI.memoperands_end());

	BuildMI(*BB, MI, dl, TII->get(TargetOpcode::COPY), MI.getOperand(0).getReg())
	.addReg(X86::ECX);

	MI.eraseFromParent();
	return BB;
	}

	static MachineBasicBlock emitWRPKRU(MachineInstr &MI, MachineBasicBlock BB,
	const X86Subtarget &Subtarget) {
	DebugLoc dl = MI.getDebugLoc();
	const TargetInstrInfo *TII = Subtarget.getInstrInfo();

	// insert input VAL into EAX
	BuildMI(*BB, MI, dl, TII->get(TargetOpcode::COPY), X86::EAX)
	.addReg(MI.getOperand(0).getReg());
	// insert zero to ECX
	BuildMI(*BB, MI, dl, TII->get(X86::MOV32r0), X86::ECX);

	// insert zero to EDX
	BuildMI(*BB, MI, dl, TII->get(X86::MOV32r0), X86::EDX);

	// insert WRPKRU instruction
	BuildMI(*BB, MI, dl, TII->get(X86::WRPKRUr));

	MI.eraseFromParent(); // The pseudo is gone now.
	return BB;
	}

	static MachineBasicBlock emitRDPKRU(MachineInstr &MI, MachineBasicBlock BB,
	const X86Subtarget &Subtarget) {
	DebugLoc dl = MI.getDebugLoc();
	const TargetInstrInfo *TII = Subtarget.getInstrInfo();

	// insert zero to ECX
	BuildMI(*BB, MI, dl, TII->get(X86::MOV32r0), X86::ECX);

	// insert RDPKRU instruction
	BuildMI(*BB, MI, dl, TII->get(X86::RDPKRUr));
	BuildMI(*BB, MI, dl, TII->get(TargetOpcode::COPY), MI.getOperand(0).getReg())
	.addReg(X86::EAX);

	MI.eraseFromParent(); // The pseudo is gone now.
	return BB;
	}

	static MachineBasicBlock emitMonitor(MachineInstr &MI, MachineBasicBlock BB,
	const X86Subtarget &Subtarget,
	unsigned Opc) {
	DebugLoc dl = MI.getDebugLoc();
	const TargetInstrInfo *TII = Subtarget.getInstrInfo();
	// Address into RAX/EAX, other two args into ECX, EDX.
	unsigned MemOpc = Subtarget.is64Bit() ? X86::LEA64r : X86::LEA32r;
	unsigned MemReg = Subtarget.is64Bit() ? X86::RAX : X86::EAX;
	MachineInstrBuilder MIB = BuildMI(*BB, MI, dl, TII->get(MemOpc), MemReg);
	for (int i = 0; i < X86::AddrNumOperands; ++i)
	MIB.add(MI.getOperand(i));

	unsigned ValOps = X86::AddrNumOperands;
	BuildMI(*BB, MI, dl, TII->get(TargetOpcode::COPY), X86::ECX)
	.addReg(MI.getOperand(ValOps).getReg());
	BuildMI(*BB, MI, dl, TII->get(TargetOpcode::COPY), X86::EDX)
	.addReg(MI.getOperand(ValOps + 1).getReg());

	// The instruction doesn't actually take any operands though.
	BuildMI(*BB, MI, dl, TII->get(Opc));

	MI.eraseFromParent(); // The pseudo is gone now.
	return BB;
	}

	static MachineBasicBlock emitClzero(MachineInstr MI, MachineBasicBlock *BB,
	const X86Subtarget &Subtarget) {
	DebugLoc dl = MI->getDebugLoc();
	const TargetInstrInfo *TII = Subtarget.getInstrInfo();
	// Address into RAX/EAX
	unsigned MemOpc = Subtarget.is64Bit() ? X86::LEA64r : X86::LEA32r;
	unsigned MemReg = Subtarget.is64Bit() ? X86::RAX : X86::EAX;
	MachineInstrBuilder MIB = BuildMI(*BB, MI, dl, TII->get(MemOpc), MemReg);
	for (int i = 0; i < X86::AddrNumOperands; ++i)
	MIB.add(MI->getOperand(i));

	// The instruction doesn't actually take any operands though.
	BuildMI(*BB, MI, dl, TII->get(X86::CLZEROr));

	MI->eraseFromParent(); // The pseudo is gone now.
	return BB;
	}



	MachineBasicBlock *
	X86TargetLowering::EmitVAARG64WithCustomInserter(MachineInstr &MI,
	MachineBasicBlock *MBB) const {
	// Emit va_arg instruction on X86-64.

	// Operands to this pseudo-instruction:
	// 0 ) Output : destination address (reg)
	// 1-5) Input : va_list address (addr, i64mem)
	// 6 ) ArgSize : Size (in bytes) of vararg type
	// 7 ) ArgMode : 0=overflow only, 1=use gp_offset, 2=use fp_offset
	// 8 ) Align : Alignment of type
	// 9 ) EFLAGS (implicit-def)

	assert(MI.getNumOperands() == 10 && "VAARG_64 should have 10 operands!");
	static_assert(X86::AddrNumOperands == 5,
	"VAARG_64 assumes 5 address operands");

	unsigned DestReg = MI.getOperand(0).getReg();
	MachineOperand &Base = MI.getOperand(1);
	MachineOperand &Scale = MI.getOperand(2);
	MachineOperand &Index = MI.getOperand(3);
	MachineOperand &Disp = MI.getOperand(4);
	MachineOperand &Segment = MI.getOperand(5);
	unsigned ArgSize = MI.getOperand(6).getImm();
	unsigned ArgMode = MI.getOperand(7).getImm();
	unsigned Align = MI.getOperand(8).getImm();

	// Memory Reference
	assert(MI.hasOneMemOperand() && "Expected VAARG_64 to have one memoperand");
	MachineInstr::mmo_iterator MMOBegin = MI.memoperands_begin();
	MachineInstr::mmo_iterator MMOEnd = MI.memoperands_end();

	// Machine Information
	const TargetInstrInfo *TII = Subtarget.getInstrInfo();
	MachineRegisterInfo &MRI = MBB->getParent()->getRegInfo();
	const TargetRegisterClass *AddrRegClass = getRegClassFor(MVT::i64);
	const TargetRegisterClass *OffsetRegClass = getRegClassFor(MVT::i32);
	DebugLoc DL = MI.getDebugLoc();

	// struct va_list {
	// i32 gp_offset
	// i32 fp_offset
	// i64 overflow_area (address)
	// i64 reg_save_area (address)
	// }
	// sizeof(va_list) = 24
	// alignment(va_list) = 8

	unsigned TotalNumIntRegs = 6;
	unsigned TotalNumXMMRegs = 8;
	bool UseGPOffset = (ArgMode == 1);
	bool UseFPOffset = (ArgMode == 2);
	unsigned MaxOffset = TotalNumIntRegs * 8 +
	(UseFPOffset ? TotalNumXMMRegs * 16 : 0);

	/* Align ArgSize to a multiple of 8 */
	unsigned ArgSizeA8 = (ArgSize + 7) & ~7;
	bool NeedsAlign = (Align > 8);

	MachineBasicBlock *thisMBB = MBB;
	MachineBasicBlock *overflowMBB;
	MachineBasicBlock *offsetMBB;
	MachineBasicBlock *endMBB;

	unsigned OffsetDestReg = 0; // Argument address computed by offsetMBB
	unsigned OverflowDestReg = 0; // Argument address computed by overflowMBB
	unsigned OffsetReg = 0;

	if (!UseGPOffset && !UseFPOffset) {
	// If we only pull from the overflow region, we don't create a branch.
	// We don't need to alter control flow.
	OffsetDestReg = 0; // unused
	OverflowDestReg = DestReg;

	offsetMBB = nullptr;
	overflowMBB = thisMBB;
	endMBB = thisMBB;
	} else {
	// First emit code to check if gp_offset (or fp_offset) is below the bound.
	// If so, pull the argument from reg_save_area. (branch to offsetMBB)
	// If not, pull from overflow_area. (branch to overflowMBB)
	//
	// thisMBB
	// \| .
	// \| .
	// offsetMBB overflowMBB
	// \| .
	// \| .
	// endMBB

	// Registers for the PHI in endMBB
	OffsetDestReg = MRI.createVirtualRegister(AddrRegClass);
	OverflowDestReg = MRI.createVirtualRegister(AddrRegClass);

	const BasicBlock *LLVM_BB = MBB->getBasicBlock();
	MachineFunction *MF = MBB->getParent();
	overflowMBB = MF->CreateMachineBasicBlock(LLVM_BB);
	offsetMBB = MF->CreateMachineBasicBlock(LLVM_BB);
	endMBB = MF->CreateMachineBasicBlock(LLVM_BB);

	MachineFunction::iterator MBBIter = ++MBB->getIterator();

	// Insert the new basic blocks
	MF->insert(MBBIter, offsetMBB);
	MF->insert(MBBIter, overflowMBB);
	MF->insert(MBBIter, endMBB);

	// Transfer the remainder of MBB and its successor edges to endMBB.
	endMBB->splice(endMBB->begin(), thisMBB,
	std::next(MachineBasicBlock::iterator(MI)), thisMBB->end());
	endMBB->transferSuccessorsAndUpdatePHIs(thisMBB);

	// Make offsetMBB and overflowMBB successors of thisMBB
	thisMBB->addSuccessor(offsetMBB);
	thisMBB->addSuccessor(overflowMBB);

	// endMBB is a successor of both offsetMBB and overflowMBB
	offsetMBB->addSuccessor(endMBB);
	overflowMBB->addSuccessor(endMBB);

	// Load the offset value into a register
	OffsetReg = MRI.createVirtualRegister(OffsetRegClass);
	BuildMI(thisMBB, DL, TII->get(X86::MOV32rm), OffsetReg)
	.add(Base)
	.add(Scale)
	.add(Index)
	.addDisp(Disp, UseFPOffset ? 4 : 0)
	.add(Segment)
	.setMemRefs(MMOBegin, MMOEnd);

	// Check if there is enough room left to pull this argument.
	BuildMI(thisMBB, DL, TII->get(X86::CMP32ri))
	.addReg(OffsetReg)
	.addImm(MaxOffset + 8 - ArgSizeA8);

	// Branch to "overflowMBB" if offset >= max
	// Fall through to "offsetMBB" otherwise
	BuildMI(thisMBB, DL, TII->get(X86::GetCondBranchFromCond(X86::COND_AE)))
	.addMBB(overflowMBB);
	}

	// In offsetMBB, emit code to use the reg_save_area.
	if (offsetMBB) {
	assert(OffsetReg != 0);

	// Read the reg_save_area address.
	unsigned RegSaveReg = MRI.createVirtualRegister(AddrRegClass);
	BuildMI(offsetMBB, DL, TII->get(X86::MOV64rm), RegSaveReg)
	.add(Base)
	.add(Scale)
	.add(Index)
	.addDisp(Disp, 16)
	.add(Segment)
	.setMemRefs(MMOBegin, MMOEnd);

	// Zero-extend the offset
	unsigned OffsetReg64 = MRI.createVirtualRegister(AddrRegClass);
	BuildMI(offsetMBB, DL, TII->get(X86::SUBREG_TO_REG), OffsetReg64)
	.addImm(0)
	.addReg(OffsetReg)
	.addImm(X86::sub_32bit);

	// Add the offset to the reg_save_area to get the final address.
	BuildMI(offsetMBB, DL, TII->get(X86::ADD64rr), OffsetDestReg)
	.addReg(OffsetReg64)
	.addReg(RegSaveReg);

	// Compute the offset for the next argument
	unsigned NextOffsetReg = MRI.createVirtualRegister(OffsetRegClass);
	BuildMI(offsetMBB, DL, TII->get(X86::ADD32ri), NextOffsetReg)
	.addReg(OffsetReg)
	.addImm(UseFPOffset ? 16 : 8);

	// Store it back into the va_list.
	BuildMI(offsetMBB, DL, TII->get(X86::MOV32mr))
	.add(Base)
	.add(Scale)
	.add(Index)
	.addDisp(Disp, UseFPOffset ? 4 : 0)
	.add(Segment)
	.addReg(NextOffsetReg)
	.setMemRefs(MMOBegin, MMOEnd);

	// Jump to endMBB
	BuildMI(offsetMBB, DL, TII->get(X86::JMP_1))
	.addMBB(endMBB);
	}

	//
	// Emit code to use overflow area
	//

	// Load the overflow_area address into a register.
	unsigned OverflowAddrReg = MRI.createVirtualRegister(AddrRegClass);
	BuildMI(overflowMBB, DL, TII->get(X86::MOV64rm), OverflowAddrReg)
	.add(Base)
	.add(Scale)
	.add(Index)
	.addDisp(Disp, 8)
	.add(Segment)
	.setMemRefs(MMOBegin, MMOEnd);

	// If we need to align it, do so. Otherwise, just copy the address
	// to OverflowDestReg.
	if (NeedsAlign) {
	// Align the overflow address
	assert(isPowerOf2_32(Align) && "Alignment must be a power of 2");
	unsigned TmpReg = MRI.createVirtualRegister(AddrRegClass);

	// aligned_addr = (addr + (align-1)) & ~(align-1)
	BuildMI(overflowMBB, DL, TII->get(X86::ADD64ri32), TmpReg)
	.addReg(OverflowAddrReg)
	.addImm(Align-1);

	BuildMI(overflowMBB, DL, TII->get(X86::AND64ri32), OverflowDestReg)
	.addReg(TmpReg)
	.addImm(~(uint64_t)(Align-1));
	} else {
	BuildMI(overflowMBB, DL, TII->get(TargetOpcode::COPY), OverflowDestReg)
	.addReg(OverflowAddrReg);
	}

	// Compute the next overflow address after this argument.
	// (the overflow address should be kept 8-byte aligned)
	unsigned NextAddrReg = MRI.createVirtualRegister(AddrRegClass);
	BuildMI(overflowMBB, DL, TII->get(X86::ADD64ri32), NextAddrReg)
	.addReg(OverflowDestReg)
	.addImm(ArgSizeA8);

	// Store the new overflow address.
	BuildMI(overflowMBB, DL, TII->get(X86::MOV64mr))
	.add(Base)
	.add(Scale)
	.add(Index)
	.addDisp(Disp, 8)
	.add(Segment)
	.addReg(NextAddrReg)
	.setMemRefs(MMOBegin, MMOEnd);

	// If we branched, emit the PHI to the front of endMBB.
	if (offsetMBB) {
	BuildMI(*endMBB, endMBB->begin(), DL,
	TII->get(X86::PHI), DestReg)
	.addReg(OffsetDestReg).addMBB(offsetMBB)
	.addReg(OverflowDestReg).addMBB(overflowMBB);
	}

	// Erase the pseudo instruction
	MI.eraseFromParent();

	return endMBB;
	}

	MachineBasicBlock *X86TargetLowering::EmitVAStartSaveXMMRegsWithCustomInserter(
	MachineInstr &MI, MachineBasicBlock *MBB) const {
	// Emit code to save XMM registers to the stack. The ABI says that the
	// number of registers to save is given in %al, so it's theoretically
	// possible to do an indirect jump trick to avoid saving all of them,
	// however this code takes a simpler approach and just executes all
	// of the stores if %al is non-zero. It's less code, and it's probably
	// easier on the hardware branch predictor, and stores aren't all that
	// expensive anyway.

	// Create the new basic blocks. One block contains all the XMM stores,
	// and one block is the final destination regardless of whether any
	// stores were performed.
	const BasicBlock *LLVM_BB = MBB->getBasicBlock();
	MachineFunction *F = MBB->getParent();
	MachineFunction::iterator MBBIter = ++MBB->getIterator();
	MachineBasicBlock *XMMSaveMBB = F->CreateMachineBasicBlock(LLVM_BB);
	MachineBasicBlock *EndMBB = F->CreateMachineBasicBlock(LLVM_BB);
	F->insert(MBBIter, XMMSaveMBB);
	F->insert(MBBIter, EndMBB);

	// Transfer the remainder of MBB and its successor edges to EndMBB.
	EndMBB->splice(EndMBB->begin(), MBB,
	std::next(MachineBasicBlock::iterator(MI)), MBB->end());
	EndMBB->transferSuccessorsAndUpdatePHIs(MBB);

	// The original block will now fall through to the XMM save block.
	MBB->addSuccessor(XMMSaveMBB);
	// The XMMSaveMBB will fall through to the end block.
	XMMSaveMBB->addSuccessor(EndMBB);

	// Now add the instructions.
	const TargetInstrInfo *TII = Subtarget.getInstrInfo();
	DebugLoc DL = MI.getDebugLoc();

	unsigned CountReg = MI.getOperand(0).getReg();
	int64_t RegSaveFrameIndex = MI.getOperand(1).getImm();
	int64_t VarArgsFPOffset = MI.getOperand(2).getImm();

	if (!Subtarget.isCallingConvWin64(F->getFunction()->getCallingConv())) {
	// If %al is 0, branch around the XMM save block.
	BuildMI(MBB, DL, TII->get(X86::TEST8rr)).addReg(CountReg).addReg(CountReg);
	BuildMI(MBB, DL, TII->get(X86::JE_1)).addMBB(EndMBB);
	MBB->addSuccessor(EndMBB);
	}

	// Make sure the last operand is EFLAGS, which gets clobbered by the branch
	// that was just emitted, but clearly shouldn't be "saved".
	assert((MI.getNumOperands() <= 3 \|\|
	!MI.getOperand(MI.getNumOperands() - 1).isReg() \|\|
	MI.getOperand(MI.getNumOperands() - 1).getReg() == X86::EFLAGS) &&
	"Expected last argument to be EFLAGS");
	unsigned MOVOpc = Subtarget.hasFp256() ? X86::VMOVAPSmr : X86::MOVAPSmr;
	// In the XMM save block, save all the XMM argument registers.
	for (int i = 3, e = MI.getNumOperands() - 1; i != e; ++i) {
	int64_t Offset = (i - 3) * 16 + VarArgsFPOffset;
	MachineMemOperand *MMO = F->getMachineMemOperand(
	MachinePointerInfo::getFixedStack(*F, RegSaveFrameIndex, Offset),
	MachineMemOperand::MOStore,
	/Size=/16, /Align=/16);
	BuildMI(XMMSaveMBB, DL, TII->get(MOVOpc))
	.addFrameIndex(RegSaveFrameIndex)
	.addImm(/Scale=/1)
	.addReg(/IndexReg=/0)
	.addImm(/Disp=/Offset)
	.addReg(/Segment=/0)
	.addReg(MI.getOperand(i).getReg())
	.addMemOperand(MMO);
	}

	MI.eraseFromParent(); // The pseudo instruction is gone now.

	return EndMBB;
	}

	// The EFLAGS operand of SelectItr might be missing a kill marker
	// because there were multiple uses of EFLAGS, and ISel didn't know
	// which to mark. Figure out whether SelectItr should have had a
	// kill marker, and set it if it should. Returns the correct kill
	// marker value.
	static bool checkAndUpdateEFLAGSKill(MachineBasicBlock::iterator SelectItr,
	MachineBasicBlock* BB,
	const TargetRegisterInfo* TRI) {
	// Scan forward through BB for a use/def of EFLAGS.
	MachineBasicBlock::iterator miI(std::next(SelectItr));
	for (MachineBasicBlock::iterator miE = BB->end(); miI != miE; ++miI) {
	const MachineInstr& mi = *miI;
	if (mi.readsRegister(X86::EFLAGS))
	return false;
	if (mi.definesRegister(X86::EFLAGS))
	break; // Should have kill-flag - update below.
	}

	// If we hit the end of the block, check whether EFLAGS is live into a
	// successor.
	if (miI == BB->end()) {
	for (MachineBasicBlock::succ_iterator sItr = BB->succ_begin(),
	sEnd = BB->succ_end();
	sItr != sEnd; ++sItr) {
	MachineBasicBlock* succ = *sItr;
	if (succ->isLiveIn(X86::EFLAGS))
	return false;
	}
	}

	// We found a def, or hit the end of the basic block and EFLAGS wasn't live
	// out. SelectMI should have a kill flag on EFLAGS.
	SelectItr->addRegisterKilled(X86::EFLAGS, TRI);
	return true;
	}

	// Return true if it is OK for this CMOV pseudo-opcode to be cascaded
	// together with other CMOV pseudo-opcodes into a single basic-block with
	// conditional jump around it.
	static bool isCMOVPseudo(MachineInstr &MI) {
	switch (MI.getOpcode()) {
	case X86::CMOV_FR32:
	case X86::CMOV_FR64:
	case X86::CMOV_GR8:
	case X86::CMOV_GR16:
	case X86::CMOV_GR32:
	case X86::CMOV_RFP32:
	case X86::CMOV_RFP64:
	case X86::CMOV_RFP80:
	case X86::CMOV_V2F64:
	case X86::CMOV_V2I64:
	case X86::CMOV_V4F32:
	case X86::CMOV_V4F64:
	case X86::CMOV_V4I64:
	case X86::CMOV_V16F32:
	case X86::CMOV_V8F32:
	case X86::CMOV_V8F64:
	case X86::CMOV_V8I64:
	case X86::CMOV_V8I1:
	case X86::CMOV_V16I1:
	case X86::CMOV_V32I1:
	case X86::CMOV_V64I1:
	return true;

	default:
	return false;
	}
	}

	MachineBasicBlock *
	X86TargetLowering::EmitLoweredSelect(MachineInstr &MI,
	MachineBasicBlock *BB) const {
	const TargetInstrInfo *TII = Subtarget.getInstrInfo();
	DebugLoc DL = MI.getDebugLoc();

	// To "insert" a SELECT_CC instruction, we actually have to insert the
	// diamond control-flow pattern. The incoming instruction knows the
	// destination vreg to set, the condition code register to branch on, the
	// true/false values to select between, and a branch opcode to use.
	const BasicBlock *LLVM_BB = BB->getBasicBlock();
	MachineFunction::iterator It = ++BB->getIterator();

	// thisMBB:
	// ...
	// TrueVal = ...
	// cmpTY ccX, r1, r2
	// bCC copy1MBB
	// fallthrough --> copy0MBB
	MachineBasicBlock *thisMBB = BB;
	MachineFunction *F = BB->getParent();

	// This code lowers all pseudo-CMOV instructions. Generally it lowers these
	// as described above, by inserting a BB, and then making a PHI at the join
	// point to select the true and false operands of the CMOV in the PHI.
	//
	// The code also handles two different cases of multiple CMOV opcodes
	// in a row.
	//
	// Case 1:
	// In this case, there are multiple CMOVs in a row, all which are based on
	// the same condition setting (or the exact opposite condition setting).
	// In this case we can lower all the CMOVs using a single inserted BB, and
	// then make a number of PHIs at the join point to model the CMOVs. The only
	// trickiness here, is that in a case like:
	//
	// t2 = CMOV cond1 t1, f1
	// t3 = CMOV cond1 t2, f2
	//
	// when rewriting this into PHIs, we have to perform some renaming on the
	// temps since you cannot have a PHI operand refer to a PHI result earlier
	// in the same block. The "simple" but wrong lowering would be:
	//
	// t2 = PHI t1(BB1), f1(BB2)
	// t3 = PHI t2(BB1), f2(BB2)
	//
	// but clearly t2 is not defined in BB1, so that is incorrect. The proper
	// renaming is to note that on the path through BB1, t2 is really just a
	// copy of t1, and do that renaming, properly generating:
	//
	// t2 = PHI t1(BB1), f1(BB2)
	// t3 = PHI t1(BB1), f2(BB2)
	//
	// Case 2, we lower cascaded CMOVs such as
	//
	// (CMOV (CMOV F, T, cc1), T, cc2)
	//
	// to two successive branches. For that, we look for another CMOV as the
	// following instruction.
	//
	// Without this, we would add a PHI between the two jumps, which ends up
	// creating a few copies all around. For instance, for
	//
	// (sitofp (zext (fcmp une)))
	//
	// we would generate:
	//
	// ucomiss %xmm1, %xmm0
	// movss <1.0f>, %xmm0
	// movaps %xmm0, %xmm1
	// jne .LBB5_2
	// xorps %xmm1, %xmm1
	// .LBB5_2:
	// jp .LBB5_4
	// movaps %xmm1, %xmm0
	// .LBB5_4:
	// retq
	//
	// because this custom-inserter would have generated:
	//
	// A
	// \| \
	// \| B
	// \| /
	// C
	// \| \
	// \| D
	// \| /
	// E
	//
	// A: X = ...; Y = ...
	// B: empty
	// C: Z = PHI [X, A], [Y, B]
	// D: empty
	// E: PHI [X, C], [Z, D]
	//
	// If we lower both CMOVs in a single step, we can instead generate:
	//
	// A
	// \| \
	// \| C
	// \| /\|
	// \|/ \|
	// \| \|
	// \| D
	// \| /
	// E
	//
	// A: X = ...; Y = ...
	// D: empty
	// E: PHI [X, A], [X, C], [Y, D]
	//
	// Which, in our sitofp/fcmp example, gives us something like:
	//
	// ucomiss %xmm1, %xmm0
	// movss <1.0f>, %xmm0
	// jne .LBB5_4
	// jp .LBB5_4
	// xorps %xmm0, %xmm0
	// .LBB5_4:
	// retq
	//
	MachineInstr *CascadedCMOV = nullptr;
	MachineInstr *LastCMOV = &MI;
	X86::CondCode CC = X86::CondCode(MI.getOperand(3).getImm());
	X86::CondCode OppCC = X86::GetOppositeBranchCondition(CC);
	MachineBasicBlock::iterator NextMIIt =
	std::next(MachineBasicBlock::iterator(MI));

	// Check for case 1, where there are multiple CMOVs with the same condition
	// first. Of the two cases of multiple CMOV lowerings, case 1 reduces the
	// number of jumps the most.

	if (isCMOVPseudo(MI)) {
	// See if we have a string of CMOVS with the same condition.
	while (NextMIIt != BB->end() && isCMOVPseudo(*NextMIIt) &&
	(NextMIIt->getOperand(3).getImm() == CC \|\|
	NextMIIt->getOperand(3).getImm() == OppCC)) {
	LastCMOV = &*NextMIIt;
	++NextMIIt;
	}
	}

	// This checks for case 2, but only do this if we didn't already find
	// case 1, as indicated by LastCMOV == MI.
	if (LastCMOV == &MI && NextMIIt != BB->end() &&
	NextMIIt->getOpcode() == MI.getOpcode() &&
	NextMIIt->getOperand(2).getReg() == MI.getOperand(2).getReg() &&
	NextMIIt->getOperand(1).getReg() == MI.getOperand(0).getReg() &&
	NextMIIt->getOperand(1).isKill()) {
	CascadedCMOV = &*NextMIIt;
	}

	MachineBasicBlock *jcc1MBB = nullptr;

	// If we have a cascaded CMOV, we lower it to two successive branches to
	// the same block. EFLAGS is used by both, so mark it as live in the second.
	if (CascadedCMOV) {
	jcc1MBB = F->CreateMachineBasicBlock(LLVM_BB);
	F->insert(It, jcc1MBB);
	jcc1MBB->addLiveIn(X86::EFLAGS);
	}

	MachineBasicBlock *copy0MBB = F->CreateMachineBasicBlock(LLVM_BB);
	MachineBasicBlock *sinkMBB = F->CreateMachineBasicBlock(LLVM_BB);
	F->insert(It, copy0MBB);
	F->insert(It, sinkMBB);

	// If the EFLAGS register isn't dead in the terminator, then claim that it's
	// live into the sink and copy blocks.
	const TargetRegisterInfo *TRI = Subtarget.getRegisterInfo();

	MachineInstr *LastEFLAGSUser = CascadedCMOV ? CascadedCMOV : LastCMOV;
	if (!LastEFLAGSUser->killsRegister(X86::EFLAGS) &&
	!checkAndUpdateEFLAGSKill(LastEFLAGSUser, BB, TRI)) {
	copy0MBB->addLiveIn(X86::EFLAGS);
	sinkMBB->addLiveIn(X86::EFLAGS);
	}

	// Transfer the remainder of BB and its successor edges to sinkMBB.
	sinkMBB->splice(sinkMBB->begin(), BB,
	std::next(MachineBasicBlock::iterator(LastCMOV)), BB->end());
	sinkMBB->transferSuccessorsAndUpdatePHIs(BB);

	// Add the true and fallthrough blocks as its successors.
	if (CascadedCMOV) {
	// The fallthrough block may be jcc1MBB, if we have a cascaded CMOV.
	BB->addSuccessor(jcc1MBB);

	// In that case, jcc1MBB will itself fallthrough the copy0MBB, and
	// jump to the sinkMBB.
	jcc1MBB->addSuccessor(copy0MBB);
	jcc1MBB->addSuccessor(sinkMBB);
	} else {
	BB->addSuccessor(copy0MBB);
	}

	// The true block target of the first (or only) branch is always sinkMBB.
	BB->addSuccessor(sinkMBB);

	// Create the conditional branch instruction.
	unsigned Opc = X86::GetCondBranchFromCond(CC);
	BuildMI(BB, DL, TII->get(Opc)).addMBB(sinkMBB);

	if (CascadedCMOV) {
	unsigned Opc2 = X86::GetCondBranchFromCond(
	(X86::CondCode)CascadedCMOV->getOperand(3).getImm());
	BuildMI(jcc1MBB, DL, TII->get(Opc2)).addMBB(sinkMBB);
	}

	// copy0MBB:
	// %FalseValue = ...
	// # fallthrough to sinkMBB
	copy0MBB->addSuccessor(sinkMBB);

	// sinkMBB:
	// %Result = phi [ %FalseValue, copy0MBB ], [ %TrueValue, thisMBB ]
	// ...
	MachineBasicBlock::iterator MIItBegin = MachineBasicBlock::iterator(MI);
	MachineBasicBlock::iterator MIItEnd =
	std::next(MachineBasicBlock::iterator(LastCMOV));
	MachineBasicBlock::iterator SinkInsertionPoint = sinkMBB->begin();
	DenseMap<unsigned, std::pair<unsigned, unsigned>> RegRewriteTable;
	MachineInstrBuilder MIB;

	// As we are creating the PHIs, we have to be careful if there is more than
	// one. Later CMOVs may reference the results of earlier CMOVs, but later
	// PHIs have to reference the individual true/false inputs from earlier PHIs.
	// That also means that PHI construction must work forward from earlier to
	// later, and that the code must maintain a mapping from earlier PHI's
	// destination registers, and the registers that went into the PHI.

	for (MachineBasicBlock::iterator MIIt = MIItBegin; MIIt != MIItEnd; ++MIIt) {
	unsigned DestReg = MIIt->getOperand(0).getReg();
	unsigned Op1Reg = MIIt->getOperand(1).getReg();
	unsigned Op2Reg = MIIt->getOperand(2).getReg();

	// If this CMOV we are generating is the opposite condition from
	// the jump we generated, then we have to swap the operands for the
	// PHI that is going to be generated.
	if (MIIt->getOperand(3).getImm() == OppCC)
	std::swap(Op1Reg, Op2Reg);

	if (RegRewriteTable.find(Op1Reg) != RegRewriteTable.end())
	Op1Reg = RegRewriteTable[Op1Reg].first;

	if (RegRewriteTable.find(Op2Reg) != RegRewriteTable.end())
	Op2Reg = RegRewriteTable[Op2Reg].second;

	MIB = BuildMI(*sinkMBB, SinkInsertionPoint, DL,
	TII->get(X86::PHI), DestReg)
	.addReg(Op1Reg).addMBB(copy0MBB)
	.addReg(Op2Reg).addMBB(thisMBB);

	// Add this PHI to the rewrite table.
	RegRewriteTable[DestReg] = std::make_pair(Op1Reg, Op2Reg);
	}

	// If we have a cascaded CMOV, the second Jcc provides the same incoming
	// value as the first Jcc (the True operand of the SELECT_CC/CMOV nodes).
	if (CascadedCMOV) {
	MIB.addReg(MI.getOperand(2).getReg()).addMBB(jcc1MBB);
	// Copy the PHI result to the register defined by the second CMOV.
	BuildMI(*sinkMBB, std::next(MachineBasicBlock::iterator(MIB.getInstr())),
	DL, TII->get(TargetOpcode::COPY),
	CascadedCMOV->getOperand(0).getReg())
	.addReg(MI.getOperand(0).getReg());
	CascadedCMOV->eraseFromParent();
	}

	// Now remove the CMOV(s).
	for (MachineBasicBlock::iterator MIIt = MIItBegin; MIIt != MIItEnd; )
	(MIIt++)->eraseFromParent();

	return sinkMBB;
	}

	MachineBasicBlock *
	X86TargetLowering::EmitLoweredAtomicFP(MachineInstr &MI,
	MachineBasicBlock *BB) const {
	// Combine the following atomic floating-point modification pattern:
	// a.store(reg OP a.load(acquire), release)
	// Transform them into:
	// OPss (%gpr), %xmm
	// movss %xmm, (%gpr)
	// Or sd equivalent for 64-bit operations.
	unsigned MOp, FOp;
	switch (MI.getOpcode()) {
	default: llvm_unreachable("unexpected instr type for EmitLoweredAtomicFP");
	case X86::RELEASE_FADD32mr:
	FOp = X86::ADDSSrm;
	MOp = X86::MOVSSmr;
	break;
	case X86::RELEASE_FADD64mr:
	FOp = X86::ADDSDrm;
	MOp = X86::MOVSDmr;
	break;
	}
	const X86InstrInfo *TII = Subtarget.getInstrInfo();
	DebugLoc DL = MI.getDebugLoc();
	MachineRegisterInfo &MRI = BB->getParent()->getRegInfo();
	unsigned ValOpIdx = X86::AddrNumOperands;
	unsigned VSrc = MI.getOperand(ValOpIdx).getReg();
	MachineInstrBuilder MIB =
	BuildMI(*BB, MI, DL, TII->get(FOp),
	MRI.createVirtualRegister(MRI.getRegClass(VSrc)))
	.addReg(VSrc);
	for (int i = 0; i < X86::AddrNumOperands; ++i) {
	MachineOperand &Operand = MI.getOperand(i);
	// Clear any kill flags on register operands as we'll create a second
	// instruction using the same address operands.
	if (Operand.isReg())
	Operand.setIsKill(false);
	MIB.add(Operand);
	}
	MachineInstr *FOpMI = MIB;
	MIB = BuildMI(*BB, MI, DL, TII->get(MOp));
	for (int i = 0; i < X86::AddrNumOperands; ++i)
	MIB.add(MI.getOperand(i));
	MIB.addReg(FOpMI->getOperand(0).getReg(), RegState::Kill);
	MI.eraseFromParent(); // The pseudo instruction is gone now.
	return BB;
	}

	MachineBasicBlock *
	X86TargetLowering::EmitLoweredSegAlloca(MachineInstr &MI,
	MachineBasicBlock *BB) const {
	MachineFunction *MF = BB->getParent();
	const TargetInstrInfo *TII = Subtarget.getInstrInfo();
	DebugLoc DL = MI.getDebugLoc();
	const BasicBlock *LLVM_BB = BB->getBasicBlock();

	assert(MF->shouldSplitStack());

	const bool Is64Bit = Subtarget.is64Bit();
	const bool IsLP64 = Subtarget.isTarget64BitLP64();

	const unsigned TlsReg = Is64Bit ? X86::FS : X86::GS;
	const unsigned TlsOffset = IsLP64 ? 0x70 : Is64Bit ? 0x40 : 0x30;

	// BB:
	// ... [Till the alloca]
	// If stacklet is not large enough, jump to mallocMBB
	//
	// bumpMBB:
	// Allocate by subtracting from RSP
	// Jump to continueMBB
	//
	// mallocMBB:
	// Allocate by call to runtime
	//
	// continueMBB:
	// ...
	// [rest of original BB]
	//

	MachineBasicBlock *mallocMBB = MF->CreateMachineBasicBlock(LLVM_BB);
	MachineBasicBlock *bumpMBB = MF->CreateMachineBasicBlock(LLVM_BB);
	MachineBasicBlock *continueMBB = MF->CreateMachineBasicBlock(LLVM_BB);

	MachineRegisterInfo &MRI = MF->getRegInfo();
	const TargetRegisterClass *AddrRegClass =
	getRegClassFor(getPointerTy(MF->getDataLayout()));

	unsigned mallocPtrVReg = MRI.createVirtualRegister(AddrRegClass),
	bumpSPPtrVReg = MRI.createVirtualRegister(AddrRegClass),
	tmpSPVReg = MRI.createVirtualRegister(AddrRegClass),
	SPLimitVReg = MRI.createVirtualRegister(AddrRegClass),
	sizeVReg = MI.getOperand(1).getReg(),
	physSPReg =
	IsLP64 \|\| Subtarget.isTargetNaCl64() ? X86::RSP : X86::ESP;

	MachineFunction::iterator MBBIter = ++BB->getIterator();

	MF->insert(MBBIter, bumpMBB);
	MF->insert(MBBIter, mallocMBB);
	MF->insert(MBBIter, continueMBB);

	continueMBB->splice(continueMBB->begin(), BB,
	std::next(MachineBasicBlock::iterator(MI)), BB->end());
	continueMBB->transferSuccessorsAndUpdatePHIs(BB);

	// Add code to the main basic block to check if the stack limit has been hit,
	// and if so, jump to mallocMBB otherwise to bumpMBB.
	BuildMI(BB, DL, TII->get(TargetOpcode::COPY), tmpSPVReg).addReg(physSPReg);
	BuildMI(BB, DL, TII->get(IsLP64 ? X86::SUB64rr:X86::SUB32rr), SPLimitVReg)
	.addReg(tmpSPVReg).addReg(sizeVReg);
	BuildMI(BB, DL, TII->get(IsLP64 ? X86::CMP64mr:X86::CMP32mr))
	.addReg(0).addImm(1).addReg(0).addImm(TlsOffset).addReg(TlsReg)
	.addReg(SPLimitVReg);
	BuildMI(BB, DL, TII->get(X86::JG_1)).addMBB(mallocMBB);

	// bumpMBB simply decreases the stack pointer, since we know the current
	// stacklet has enough space.
	BuildMI(bumpMBB, DL, TII->get(TargetOpcode::COPY), physSPReg)
	.addReg(SPLimitVReg);
	BuildMI(bumpMBB, DL, TII->get(TargetOpcode::COPY), bumpSPPtrVReg)
	.addReg(SPLimitVReg);
	BuildMI(bumpMBB, DL, TII->get(X86::JMP_1)).addMBB(continueMBB);

	// Calls into a routine in libgcc to allocate more space from the heap.
	const uint32_t *RegMask =
	Subtarget.getRegisterInfo()->getCallPreservedMask(*MF, CallingConv::C);
	if (IsLP64) {
	BuildMI(mallocMBB, DL, TII->get(X86::MOV64rr), X86::RDI)
	.addReg(sizeVReg);
	BuildMI(mallocMBB, DL, TII->get(X86::CALL64pcrel32))
	.addExternalSymbol("__morestack_allocate_stack_space")
	.addRegMask(RegMask)
	.addReg(X86::RDI, RegState::Implicit)
	.addReg(X86::RAX, RegState::ImplicitDefine);
	} else if (Is64Bit) {
	BuildMI(mallocMBB, DL, TII->get(X86::MOV32rr), X86::EDI)
	.addReg(sizeVReg);
	BuildMI(mallocMBB, DL, TII->get(X86::CALL64pcrel32))
	.addExternalSymbol("__morestack_allocate_stack_space")
	.addRegMask(RegMask)
	.addReg(X86::EDI, RegState::Implicit)
	.addReg(X86::EAX, RegState::ImplicitDefine);
	} else {
	BuildMI(mallocMBB, DL, TII->get(X86::SUB32ri), physSPReg).addReg(physSPReg)
	.addImm(12);
	BuildMI(mallocMBB, DL, TII->get(X86::PUSH32r)).addReg(sizeVReg);
	BuildMI(mallocMBB, DL, TII->get(X86::CALLpcrel32))
	.addExternalSymbol("__morestack_allocate_stack_space")
	.addRegMask(RegMask)
	.addReg(X86::EAX, RegState::ImplicitDefine);
	}

	if (!Is64Bit)
	BuildMI(mallocMBB, DL, TII->get(X86::ADD32ri), physSPReg).addReg(physSPReg)
	.addImm(16);

	BuildMI(mallocMBB, DL, TII->get(TargetOpcode::COPY), mallocPtrVReg)
	.addReg(IsLP64 ? X86::RAX : X86::EAX);
	BuildMI(mallocMBB, DL, TII->get(X86::JMP_1)).addMBB(continueMBB);

	// Set up the CFG correctly.
	BB->addSuccessor(bumpMBB);
	BB->addSuccessor(mallocMBB);
	mallocMBB->addSuccessor(continueMBB);
	bumpMBB->addSuccessor(continueMBB);

	// Take care of the PHI nodes.
	BuildMI(*continueMBB, continueMBB->begin(), DL, TII->get(X86::PHI),
	MI.getOperand(0).getReg())
	.addReg(mallocPtrVReg)
	.addMBB(mallocMBB)
	.addReg(bumpSPPtrVReg)
	.addMBB(bumpMBB);

	// Delete the original pseudo instruction.
	MI.eraseFromParent();

	// And we're done.
	return continueMBB;
	}

	MachineBasicBlock *
	X86TargetLowering::EmitLoweredCatchRet(MachineInstr &MI,
	MachineBasicBlock *BB) const {
	MachineFunction *MF = BB->getParent();
	const TargetInstrInfo &TII = *Subtarget.getInstrInfo();
	MachineBasicBlock *TargetMBB = MI.getOperand(0).getMBB();
	DebugLoc DL = MI.getDebugLoc();

	assert(!isAsynchronousEHPersonality(
	classifyEHPersonality(MF->getFunction()->getPersonalityFn())) &&
	"SEH does not use catchret!");

	// Only 32-bit EH needs to worry about manually restoring stack pointers.
	if (!Subtarget.is32Bit())
	return BB;

	// C++ EH creates a new target block to hold the restore code, and wires up
	// the new block to the return destination with a normal JMP_4.
	MachineBasicBlock *RestoreMBB =
	MF->CreateMachineBasicBlock(BB->getBasicBlock());
	assert(BB->succ_size() == 1);
	MF->insert(std::next(BB->getIterator()), RestoreMBB);
	RestoreMBB->transferSuccessorsAndUpdatePHIs(BB);
	BB->addSuccessor(RestoreMBB);
	MI.getOperand(0).setMBB(RestoreMBB);

	auto RestoreMBBI = RestoreMBB->begin();
	BuildMI(*RestoreMBB, RestoreMBBI, DL, TII.get(X86::EH_RESTORE));
	BuildMI(*RestoreMBB, RestoreMBBI, DL, TII.get(X86::JMP_4)).addMBB(TargetMBB);
	return BB;
	}

	MachineBasicBlock *
	X86TargetLowering::EmitLoweredCatchPad(MachineInstr &MI,
	MachineBasicBlock *BB) const {
	MachineFunction *MF = BB->getParent();
	const Constant *PerFn = MF->getFunction()->getPersonalityFn();
	bool IsSEH = isAsynchronousEHPersonality(classifyEHPersonality(PerFn));
	// Only 32-bit SEH requires special handling for catchpad.
	if (IsSEH && Subtarget.is32Bit()) {
	const TargetInstrInfo &TII = *Subtarget.getInstrInfo();
	DebugLoc DL = MI.getDebugLoc();
	BuildMI(*BB, MI, DL, TII.get(X86::EH_RESTORE));
	}
	MI.eraseFromParent();
	return BB;
	}

	MachineBasicBlock *
	X86TargetLowering::EmitLoweredTLSAddr(MachineInstr &MI,
	MachineBasicBlock *BB) const {
	// So, here we replace TLSADDR with the sequence:
	// adjust_stackdown -> TLSADDR -> adjust_stackup.
	// We need this because TLSADDR is lowered into calls
	// inside MC, therefore without the two markers shrink-wrapping
	// may push the prologue/epilogue pass them.
	const TargetInstrInfo &TII = *Subtarget.getInstrInfo();
	DebugLoc DL = MI.getDebugLoc();
	MachineFunction &MF = *BB->getParent();

	// Emit CALLSEQ_START right before the instruction.
	unsigned AdjStackDown = TII.getCallFrameSetupOpcode();
	MachineInstrBuilder CallseqStart =
	BuildMI(MF, DL, TII.get(AdjStackDown)).addImm(0).addImm(0).addImm(0);
	BB->insert(MachineBasicBlock::iterator(MI), CallseqStart);

	// Emit CALLSEQ_END right after the instruction.
	// We don't call erase from parent because we want to keep the
	// original instruction around.
	unsigned AdjStackUp = TII.getCallFrameDestroyOpcode();
	MachineInstrBuilder CallseqEnd =
	BuildMI(MF, DL, TII.get(AdjStackUp)).addImm(0).addImm(0);
	BB->insertAfter(MachineBasicBlock::iterator(MI), CallseqEnd);

	return BB;
	}

	MachineBasicBlock *
	X86TargetLowering::EmitLoweredTLSCall(MachineInstr &MI,
	MachineBasicBlock *BB) const {
	// This is pretty easy. We're taking the value that we received from
	// our load from the relocation, sticking it in either RDI (x86-64)
	// or EAX and doing an indirect call. The return value will then
	// be in the normal return register.
	MachineFunction *F = BB->getParent();
	const X86InstrInfo *TII = Subtarget.getInstrInfo();
	DebugLoc DL = MI.getDebugLoc();

	assert(Subtarget.isTargetDarwin() && "Darwin only instr emitted?");
	assert(MI.getOperand(3).isGlobal() && "This should be a global");

	// Get a register mask for the lowered call.
	// FIXME: The 32-bit calls have non-standard calling conventions. Use a
	// proper register mask.
	const uint32_t *RegMask =
	Subtarget.is64Bit() ?
	Subtarget.getRegisterInfo()->getDarwinTLSCallPreservedMask() :
	Subtarget.getRegisterInfo()->getCallPreservedMask(*F, CallingConv::C);
	if (Subtarget.is64Bit()) {
	MachineInstrBuilder MIB =
	BuildMI(*BB, MI, DL, TII->get(X86::MOV64rm), X86::RDI)
	.addReg(X86::RIP)
	.addImm(0)
	.addReg(0)
	.addGlobalAddress(MI.getOperand(3).getGlobal(), 0,
	MI.getOperand(3).getTargetFlags())
	.addReg(0);
	MIB = BuildMI(*BB, MI, DL, TII->get(X86::CALL64m));
	addDirectMem(MIB, X86::RDI);
	MIB.addReg(X86::RAX, RegState::ImplicitDefine).addRegMask(RegMask);
	} else if (!isPositionIndependent()) {
	MachineInstrBuilder MIB =
	BuildMI(*BB, MI, DL, TII->get(X86::MOV32rm), X86::EAX)
	.addReg(0)
	.addImm(0)
	.addReg(0)
	.addGlobalAddress(MI.getOperand(3).getGlobal(), 0,
	MI.getOperand(3).getTargetFlags())
	.addReg(0);
	MIB = BuildMI(*BB, MI, DL, TII->get(X86::CALL32m));
	addDirectMem(MIB, X86::EAX);
	MIB.addReg(X86::EAX, RegState::ImplicitDefine).addRegMask(RegMask);
	} else {
	MachineInstrBuilder MIB =
	BuildMI(*BB, MI, DL, TII->get(X86::MOV32rm), X86::EAX)
	.addReg(TII->getGlobalBaseReg(F))
	.addImm(0)
	.addReg(0)
	.addGlobalAddress(MI.getOperand(3).getGlobal(), 0,
	MI.getOperand(3).getTargetFlags())
	.addReg(0);
	MIB = BuildMI(*BB, MI, DL, TII->get(X86::CALL32m));
	addDirectMem(MIB, X86::EAX);
	MIB.addReg(X86::EAX, RegState::ImplicitDefine).addRegMask(RegMask);
	}

	MI.eraseFromParent(); // The pseudo instruction is gone now.
	return BB;
	}

	MachineBasicBlock *
	X86TargetLowering::emitEHSjLjSetJmp(MachineInstr &MI,
	MachineBasicBlock *MBB) const {
	DebugLoc DL = MI.getDebugLoc();
	MachineFunction *MF = MBB->getParent();
	const TargetInstrInfo *TII = Subtarget.getInstrInfo();
	const TargetRegisterInfo *TRI = Subtarget.getRegisterInfo();
	MachineRegisterInfo &MRI = MF->getRegInfo();

	const BasicBlock *BB = MBB->getBasicBlock();
	MachineFunction::iterator I = ++MBB->getIterator();

	// Memory Reference
	MachineInstr::mmo_iterator MMOBegin = MI.memoperands_begin();
	MachineInstr::mmo_iterator MMOEnd = MI.memoperands_end();

	unsigned DstReg;
	unsigned MemOpndSlot = 0;

	unsigned CurOp = 0;

	DstReg = MI.getOperand(CurOp++).getReg();
	const TargetRegisterClass *RC = MRI.getRegClass(DstReg);
	assert(TRI->isTypeLegalForClass(*RC, MVT::i32) && "Invalid destination!");
	(void)TRI;
	unsigned mainDstReg = MRI.createVirtualRegister(RC);
	unsigned restoreDstReg = MRI.createVirtualRegister(RC);

	MemOpndSlot = CurOp;

	MVT PVT = getPointerTy(MF->getDataLayout());
	assert((PVT == MVT::i64 \|\| PVT == MVT::i32) &&
	"Invalid Pointer Size!");

	// For v = setjmp(buf), we generate
	//
	// thisMBB:
	// buf[LabelOffset] = restoreMBB <-- takes address of restoreMBB
	// SjLjSetup restoreMBB
	//
	// mainMBB:
	// v_main = 0
	//
	// sinkMBB:
	// v = phi(main, restore)
	//
	// restoreMBB:
	// if base pointer being used, load it from frame
	// v_restore = 1

	MachineBasicBlock *thisMBB = MBB;
	MachineBasicBlock *mainMBB = MF->CreateMachineBasicBlock(BB);
	MachineBasicBlock *sinkMBB = MF->CreateMachineBasicBlock(BB);
	MachineBasicBlock *restoreMBB = MF->CreateMachineBasicBlock(BB);
	MF->insert(I, mainMBB);
	MF->insert(I, sinkMBB);
	MF->push_back(restoreMBB);
	restoreMBB->setHasAddressTaken();

	MachineInstrBuilder MIB;

	// Transfer the remainder of BB and its successor edges to sinkMBB.
	sinkMBB->splice(sinkMBB->begin(), MBB,
	std::next(MachineBasicBlock::iterator(MI)), MBB->end());
	sinkMBB->transferSuccessorsAndUpdatePHIs(MBB);

	// thisMBB:
	unsigned PtrStoreOpc = 0;
	unsigned LabelReg = 0;
	const int64_t LabelOffset = 1 * PVT.getStoreSize();
	bool UseImmLabel = (MF->getTarget().getCodeModel() == CodeModel::Small) &&
	!isPositionIndependent();

	// Prepare IP either in reg or imm.
	if (!UseImmLabel) {
	PtrStoreOpc = (PVT == MVT::i64) ? X86::MOV64mr : X86::MOV32mr;
	const TargetRegisterClass *PtrRC = getRegClassFor(PVT);
	LabelReg = MRI.createVirtualRegister(PtrRC);
	if (Subtarget.is64Bit()) {
	MIB = BuildMI(*thisMBB, MI, DL, TII->get(X86::LEA64r), LabelReg)
	.addReg(X86::RIP)
	.addImm(0)
	.addReg(0)
	.addMBB(restoreMBB)
	.addReg(0);
	} else {
	const X86InstrInfo XII = static_cast<const X86InstrInfo>(TII);
	MIB = BuildMI(*thisMBB, MI, DL, TII->get(X86::LEA32r), LabelReg)
	.addReg(XII->getGlobalBaseReg(MF))
	.addImm(0)
	.addReg(0)
	.addMBB(restoreMBB, Subtarget.classifyBlockAddressReference())
	.addReg(0);
	}
	} else
	PtrStoreOpc = (PVT == MVT::i64) ? X86::MOV64mi32 : X86::MOV32mi;
	// Store IP
	MIB = BuildMI(*thisMBB, MI, DL, TII->get(PtrStoreOpc));
	for (unsigned i = 0; i < X86::AddrNumOperands; ++i) {
	if (i == X86::AddrDisp)
	MIB.addDisp(MI.getOperand(MemOpndSlot + i), LabelOffset);
	else
	MIB.add(MI.getOperand(MemOpndSlot + i));
	}
	if (!UseImmLabel)
	MIB.addReg(LabelReg);
	else
	MIB.addMBB(restoreMBB);
	MIB.setMemRefs(MMOBegin, MMOEnd);
	// Setup
	MIB = BuildMI(*thisMBB, MI, DL, TII->get(X86::EH_SjLj_Setup))
	.addMBB(restoreMBB);

	const X86RegisterInfo *RegInfo = Subtarget.getRegisterInfo();
	MIB.addRegMask(RegInfo->getNoPreservedMask());
	thisMBB->addSuccessor(mainMBB);
	thisMBB->addSuccessor(restoreMBB);

	// mainMBB:
	// EAX = 0
	BuildMI(mainMBB, DL, TII->get(X86::MOV32r0), mainDstReg);
	mainMBB->addSuccessor(sinkMBB);

	// sinkMBB:
	BuildMI(*sinkMBB, sinkMBB->begin(), DL,
	TII->get(X86::PHI), DstReg)
	.addReg(mainDstReg).addMBB(mainMBB)
	.addReg(restoreDstReg).addMBB(restoreMBB);

	// restoreMBB:
	if (RegInfo->hasBasePointer(*MF)) {
	const bool Uses64BitFramePtr =
	Subtarget.isTarget64BitLP64() \|\| Subtarget.isTargetNaCl64();
	X86MachineFunctionInfo *X86FI = MF->getInfo<X86MachineFunctionInfo>();
	X86FI->setRestoreBasePointer(MF);
	unsigned FramePtr = RegInfo->getFrameRegister(*MF);
	unsigned BasePtr = RegInfo->getBaseRegister();
	unsigned Opm = Uses64BitFramePtr ? X86::MOV64rm : X86::MOV32rm;
	addRegOffset(BuildMI(restoreMBB, DL, TII->get(Opm), BasePtr),
	FramePtr, true, X86FI->getRestoreBasePointerOffset())
	.setMIFlag(MachineInstr::FrameSetup);
	}
	BuildMI(restoreMBB, DL, TII->get(X86::MOV32ri), restoreDstReg).addImm(1);
	BuildMI(restoreMBB, DL, TII->get(X86::JMP_1)).addMBB(sinkMBB);
	restoreMBB->addSuccessor(sinkMBB);

	MI.eraseFromParent();
	return sinkMBB;
	}

	MachineBasicBlock *
	X86TargetLowering::emitEHSjLjLongJmp(MachineInstr &MI,
	MachineBasicBlock *MBB) const {
	DebugLoc DL = MI.getDebugLoc();
	MachineFunction *MF = MBB->getParent();
	const TargetInstrInfo *TII = Subtarget.getInstrInfo();
	MachineRegisterInfo &MRI = MF->getRegInfo();

	// Memory Reference
	MachineInstr::mmo_iterator MMOBegin = MI.memoperands_begin();
	MachineInstr::mmo_iterator MMOEnd = MI.memoperands_end();

	MVT PVT = getPointerTy(MF->getDataLayout());
	assert((PVT == MVT::i64 \|\| PVT == MVT::i32) &&
	"Invalid Pointer Size!");

	const TargetRegisterClass *RC =
	(PVT == MVT::i64) ? &X86::GR64RegClass : &X86::GR32RegClass;
	unsigned Tmp = MRI.createVirtualRegister(RC);
	// Since FP is only updated here but NOT referenced, it's treated as GPR.
	const X86RegisterInfo *RegInfo = Subtarget.getRegisterInfo();
	unsigned FP = (PVT == MVT::i64) ? X86::RBP : X86::EBP;
	unsigned SP = RegInfo->getStackRegister();

	MachineInstrBuilder MIB;

	const int64_t LabelOffset = 1 * PVT.getStoreSize();
	const int64_t SPOffset = 2 * PVT.getStoreSize();

	unsigned PtrLoadOpc = (PVT == MVT::i64) ? X86::MOV64rm : X86::MOV32rm;
	unsigned IJmpOpc = (PVT == MVT::i64) ? X86::JMP64r : X86::JMP32r;

	// Reload FP
	MIB = BuildMI(*MBB, MI, DL, TII->get(PtrLoadOpc), FP);
	for (unsigned i = 0; i < X86::AddrNumOperands; ++i)
	MIB.add(MI.getOperand(i));
	MIB.setMemRefs(MMOBegin, MMOEnd);
	// Reload IP
	MIB = BuildMI(*MBB, MI, DL, TII->get(PtrLoadOpc), Tmp);
	for (unsigned i = 0; i < X86::AddrNumOperands; ++i) {
	if (i == X86::AddrDisp)
	MIB.addDisp(MI.getOperand(i), LabelOffset);
	else
	MIB.add(MI.getOperand(i));
	}
	MIB.setMemRefs(MMOBegin, MMOEnd);
	// Reload SP
	MIB = BuildMI(*MBB, MI, DL, TII->get(PtrLoadOpc), SP);
	for (unsigned i = 0; i < X86::AddrNumOperands; ++i) {
	if (i == X86::AddrDisp)
	MIB.addDisp(MI.getOperand(i), SPOffset);
	else
	MIB.add(MI.getOperand(i));
	}
	MIB.setMemRefs(MMOBegin, MMOEnd);
	// Jump
	BuildMI(*MBB, MI, DL, TII->get(IJmpOpc)).addReg(Tmp);

	MI.eraseFromParent();
	return MBB;
	}

	void X86TargetLowering::SetupEntryBlockForSjLj(MachineInstr &MI,
	MachineBasicBlock *MBB,
	MachineBasicBlock *DispatchBB,
	int FI) const {
	DebugLoc DL = MI.getDebugLoc();
	MachineFunction *MF = MBB->getParent();
	MachineRegisterInfo *MRI = &MF->getRegInfo();
	const X86InstrInfo *TII = Subtarget.getInstrInfo();

	MVT PVT = getPointerTy(MF->getDataLayout());
	assert((PVT == MVT::i64 \|\| PVT == MVT::i32) && "Invalid Pointer Size!");

	unsigned Op = 0;
	unsigned VR = 0;

	bool UseImmLabel = (MF->getTarget().getCodeModel() == CodeModel::Small) &&
	!isPositionIndependent();

	if (UseImmLabel) {
	Op = (PVT == MVT::i64) ? X86::MOV64mi32 : X86::MOV32mi;
	} else {
	const TargetRegisterClass *TRC =
	(PVT == MVT::i64) ? &X86::GR64RegClass : &X86::GR32RegClass;
	VR = MRI->createVirtualRegister(TRC);
	Op = (PVT == MVT::i64) ? X86::MOV64mr : X86::MOV32mr;

	if (Subtarget.is64Bit())
	BuildMI(*MBB, MI, DL, TII->get(X86::LEA64r), VR)
	.addReg(X86::RIP)
	.addImm(1)
	.addReg(0)
	.addMBB(DispatchBB)
	.addReg(0);
	else
	BuildMI(*MBB, MI, DL, TII->get(X86::LEA32r), VR)
	.addReg(0) /* TII->getGlobalBaseReg(MF) */
	.addImm(1)
	.addReg(0)
	.addMBB(DispatchBB, Subtarget.classifyBlockAddressReference())
	.addReg(0);
	}

	MachineInstrBuilder MIB = BuildMI(*MBB, MI, DL, TII->get(Op));
	addFrameReference(MIB, FI, 36);
	if (UseImmLabel)
	MIB.addMBB(DispatchBB);
	else
	MIB.addReg(VR);
	}

	MachineBasicBlock *
	X86TargetLowering::EmitSjLjDispatchBlock(MachineInstr &MI,
	MachineBasicBlock *BB) const {
	DebugLoc DL = MI.getDebugLoc();
	MachineFunction *MF = BB->getParent();
	MachineFrameInfo &MFI = MF->getFrameInfo();
	MachineRegisterInfo *MRI = &MF->getRegInfo();
	const X86InstrInfo *TII = Subtarget.getInstrInfo();
	int FI = MFI.getFunctionContextIndex();

	// Get a mapping of the call site numbers to all of the landing pads they're
	// associated with.
	DenseMap<unsigned, SmallVector<MachineBasicBlock *, 2>> CallSiteNumToLPad;
	unsigned MaxCSNum = 0;
	for (auto &MBB : *MF) {
	if (!MBB.isEHPad())
	continue;

	MCSymbol *Sym = nullptr;
	for (const auto &MI : MBB) {
	if (MI.isDebugValue())
	continue;

	assert(MI.isEHLabel() && "expected EH_LABEL");
	Sym = MI.getOperand(0).getMCSymbol();
	break;
	}

	if (!MF->hasCallSiteLandingPad(Sym))
	continue;

	for (unsigned CSI : MF->getCallSiteLandingPad(Sym)) {
	CallSiteNumToLPad[CSI].push_back(&MBB);
	MaxCSNum = std::max(MaxCSNum, CSI);
	}
	}

	// Get an ordered list of the machine basic blocks for the jump table.
	std::vector<MachineBasicBlock *> LPadList;
	SmallPtrSet<MachineBasicBlock *, 32> InvokeBBs;
	LPadList.reserve(CallSiteNumToLPad.size());

	for (unsigned CSI = 1; CSI <= MaxCSNum; ++CSI) {
	for (auto &LP : CallSiteNumToLPad[CSI]) {
	LPadList.push_back(LP);
	InvokeBBs.insert(LP->pred_begin(), LP->pred_end());
	}
	}

	assert(!LPadList.empty() &&
	"No landing pad destinations for the dispatch jump table!");

	// Create the MBBs for the dispatch code.

	// Shove the dispatch's address into the return slot in the function context.
	MachineBasicBlock *DispatchBB = MF->CreateMachineBasicBlock();
	DispatchBB->setIsEHPad(true);

	MachineBasicBlock *TrapBB = MF->CreateMachineBasicBlock();
	BuildMI(TrapBB, DL, TII->get(X86::TRAP));
	DispatchBB->addSuccessor(TrapBB);

	MachineBasicBlock *DispContBB = MF->CreateMachineBasicBlock();
	DispatchBB->addSuccessor(DispContBB);

	// Insert MBBs.
	MF->push_back(DispatchBB);
	MF->push_back(DispContBB);
	MF->push_back(TrapBB);

	// Insert code into the entry block that creates and registers the function
	// context.
	SetupEntryBlockForSjLj(MI, BB, DispatchBB, FI);

	// Create the jump table and associated information
	MachineJumpTableInfo *JTI =
	MF->getOrCreateJumpTableInfo(getJumpTableEncoding());
	unsigned MJTI = JTI->createJumpTableIndex(LPadList);

	const X86RegisterInfo &RI = TII->getRegisterInfo();
	// Add a register mask with no preserved registers. This results in all
	// registers being marked as clobbered.
	if (RI.hasBasePointer(*MF)) {
	const bool FPIs64Bit =
	Subtarget.isTarget64BitLP64() \|\| Subtarget.isTargetNaCl64();
	X86MachineFunctionInfo *MFI = MF->getInfo<X86MachineFunctionInfo>();
	MFI->setRestoreBasePointer(MF);

	unsigned FP = RI.getFrameRegister(*MF);
	unsigned BP = RI.getBaseRegister();
	unsigned Op = FPIs64Bit ? X86::MOV64rm : X86::MOV32rm;
	addRegOffset(BuildMI(DispatchBB, DL, TII->get(Op), BP), FP, true,
	MFI->getRestoreBasePointerOffset())
	.addRegMask(RI.getNoPreservedMask());
	} else {
	BuildMI(DispatchBB, DL, TII->get(X86::NOOP))
	.addRegMask(RI.getNoPreservedMask());
	}

	unsigned IReg = MRI->createVirtualRegister(&X86::GR32RegClass);
	addFrameReference(BuildMI(DispatchBB, DL, TII->get(X86::MOV32rm), IReg), FI,
	4);
	BuildMI(DispatchBB, DL, TII->get(X86::CMP32ri))
	.addReg(IReg)
	.addImm(LPadList.size());
	BuildMI(DispatchBB, DL, TII->get(X86::JA_1)).addMBB(TrapBB);

	unsigned JReg = MRI->createVirtualRegister(&X86::GR32RegClass);
	BuildMI(DispContBB, DL, TII->get(X86::SUB32ri), JReg)
	.addReg(IReg)
	.addImm(1);
	BuildMI(DispContBB, DL,
	TII->get(Subtarget.is64Bit() ? X86::JMP64m : X86::JMP32m))
	.addReg(0)
	.addImm(Subtarget.is64Bit() ? 8 : 4)
	.addReg(JReg)
	.addJumpTableIndex(MJTI)
	.addReg(0);

	// Add the jump table entries as successors to the MBB.
	SmallPtrSet<MachineBasicBlock *, 8> SeenMBBs;
	for (auto &LP : LPadList)
	if (SeenMBBs.insert(LP).second)
	DispContBB->addSuccessor(LP);

	// N.B. the order the invoke BBs are processed in doesn't matter here.
	SmallVector<MachineBasicBlock *, 64> MBBLPads;
	const MCPhysReg *SavedRegs = MF->getRegInfo().getCalleeSavedRegs();
	for (MachineBasicBlock *MBB : InvokeBBs) {
	// Remove the landing pad successor from the invoke block and replace it
	// with the new dispatch block.
	// Keep a copy of Successors since it's modified inside the loop.
	SmallVector<MachineBasicBlock *, 8> Successors(MBB->succ_rbegin(),
	MBB->succ_rend());
	// FIXME: Avoid quadratic complexity.
	for (auto MBBS : Successors) {
	if (MBBS->isEHPad()) {
	MBB->removeSuccessor(MBBS);
	MBBLPads.push_back(MBBS);
	}
	}

	MBB->addSuccessor(DispatchBB);

	// Find the invoke call and mark all of the callee-saved registers as
	// 'implicit defined' so that they're spilled. This prevents code from
	// moving instructions to before the EH block, where they will never be
	// executed.
	for (auto &II : reverse(*MBB)) {
	if (!II.isCall())
	continue;

	DenseMap<unsigned, bool> DefRegs;
	for (auto &MOp : II.operands())
	if (MOp.isReg())
	DefRegs[MOp.getReg()] = true;

	MachineInstrBuilder MIB(*MF, &II);
	for (unsigned RI = 0; SavedRegs[RI]; ++RI) {
	unsigned Reg = SavedRegs[RI];
	if (!DefRegs[Reg])
	MIB.addReg(Reg, RegState::ImplicitDefine \| RegState::Dead);
	}

	break;
	}
	}

	// Mark all former landing pads as non-landing pads. The dispatch is the only
	// landing pad now.
	for (auto &LP : MBBLPads)
	LP->setIsEHPad(false);

	// The instruction is gone now.
	MI.eraseFromParent();
	return BB;
	}

	MachineBasicBlock *
	X86TargetLowering::EmitInstrWithCustomInserter(MachineInstr &MI,
	MachineBasicBlock *BB) const {
	MachineFunction *MF = BB->getParent();
	const TargetInstrInfo *TII = Subtarget.getInstrInfo();
	DebugLoc DL = MI.getDebugLoc();

	switch (MI.getOpcode()) {
	default: llvm_unreachable("Unexpected instr type to insert");
	case X86::TAILJMPd64:
	case X86::TAILJMPr64:
	case X86::TAILJMPm64:
	case X86::TAILJMPr64_REX:
	case X86::TAILJMPm64_REX:
	llvm_unreachable("TAILJMP64 would not be touched here.");
	case X86::TCRETURNdi64:
	case X86::TCRETURNri64:
	case X86::TCRETURNmi64:
	return BB;
	case X86::TLS_addr32:
	case X86::TLS_addr64:
	case X86::TLS_base_addr32:
	case X86::TLS_base_addr64:
	return EmitLoweredTLSAddr(MI, BB);
	case X86::CATCHRET:
	return EmitLoweredCatchRet(MI, BB);
	case X86::CATCHPAD:
	return EmitLoweredCatchPad(MI, BB);
	case X86::SEG_ALLOCA_32:
	case X86::SEG_ALLOCA_64:
	return EmitLoweredSegAlloca(MI, BB);
	case X86::TLSCall_32:
	case X86::TLSCall_64:
	return EmitLoweredTLSCall(MI, BB);
	case X86::CMOV_FR32:
	case X86::CMOV_FR64:
	case X86::CMOV_FR128:
	case X86::CMOV_GR8:
	case X86::CMOV_GR16:
	case X86::CMOV_GR32:
	case X86::CMOV_RFP32:
	case X86::CMOV_RFP64:
	case X86::CMOV_RFP80:
	case X86::CMOV_V2F64:
	case X86::CMOV_V2I64:
	case X86::CMOV_V4F32:
	case X86::CMOV_V4F64:
	case X86::CMOV_V4I64:
	case X86::CMOV_V16F32:
	case X86::CMOV_V8F32:
	case X86::CMOV_V8F64:
	case X86::CMOV_V8I64:
	case X86::CMOV_V8I1:
	case X86::CMOV_V16I1:
	case X86::CMOV_V32I1:
	case X86::CMOV_V64I1:
	return EmitLoweredSelect(MI, BB);

	case X86::RDFLAGS32:
	case X86::RDFLAGS64: {
	unsigned PushF =
	MI.getOpcode() == X86::RDFLAGS32 ? X86::PUSHF32 : X86::PUSHF64;
	unsigned Pop = MI.getOpcode() == X86::RDFLAGS32 ? X86::POP32r : X86::POP64r;
	MachineInstr Push = BuildMI(BB, MI, DL, TII->get(PushF));
	// Permit reads of the FLAGS register without it being defined.
	// This intrinsic exists to read external processor state in flags, such as
	// the trap flag, interrupt flag, and direction flag, none of which are
	// modeled by the backend.
	Push->getOperand(2).setIsUndef();
	BuildMI(*BB, MI, DL, TII->get(Pop), MI.getOperand(0).getReg());

	MI.eraseFromParent(); // The pseudo is gone now.
	return BB;
	}

	case X86::WRFLAGS32:
	case X86::WRFLAGS64: {
	unsigned Push =
	MI.getOpcode() == X86::WRFLAGS32 ? X86::PUSH32r : X86::PUSH64r;
	unsigned PopF =
	MI.getOpcode() == X86::WRFLAGS32 ? X86::POPF32 : X86::POPF64;
	BuildMI(*BB, MI, DL, TII->get(Push)).addReg(MI.getOperand(0).getReg());
	BuildMI(*BB, MI, DL, TII->get(PopF));

	MI.eraseFromParent(); // The pseudo is gone now.
	return BB;
	}

	case X86::RELEASE_FADD32mr:
	case X86::RELEASE_FADD64mr:
	return EmitLoweredAtomicFP(MI, BB);

	case X86::FP32_TO_INT16_IN_MEM:
	case X86::FP32_TO_INT32_IN_MEM:
	case X86::FP32_TO_INT64_IN_MEM:
	case X86::FP64_TO_INT16_IN_MEM:
	case X86::FP64_TO_INT32_IN_MEM:
	case X86::FP64_TO_INT64_IN_MEM:
	case X86::FP80_TO_INT16_IN_MEM:
	case X86::FP80_TO_INT32_IN_MEM:
	case X86::FP80_TO_INT64_IN_MEM: {
	// Change the floating point control register to use "round towards zero"
	// mode when truncating to an integer value.
	int CWFrameIdx = MF->getFrameInfo().CreateStackObject(2, 2, false);
	addFrameReference(BuildMI(*BB, MI, DL,
	TII->get(X86::FNSTCW16m)), CWFrameIdx);

	// Load the old value of the high byte of the control word...
	unsigned OldCW =
	MF->getRegInfo().createVirtualRegister(&X86::GR16RegClass);
	addFrameReference(BuildMI(*BB, MI, DL, TII->get(X86::MOV16rm), OldCW),
	CWFrameIdx);

	// Set the high part to be round to zero...
	addFrameReference(BuildMI(*BB, MI, DL, TII->get(X86::MOV16mi)), CWFrameIdx)
	.addImm(0xC7F);

	// Reload the modified control word now...
	addFrameReference(BuildMI(*BB, MI, DL,
	TII->get(X86::FLDCW16m)), CWFrameIdx);

	// Restore the memory image of control word to original value
	addFrameReference(BuildMI(*BB, MI, DL, TII->get(X86::MOV16mr)), CWFrameIdx)
	.addReg(OldCW);

	// Get the X86 opcode to use.
	unsigned Opc;
	switch (MI.getOpcode()) {
	default: llvm_unreachable("illegal opcode!");
	case X86::FP32_TO_INT16_IN_MEM: Opc = X86::IST_Fp16m32; break;
	case X86::FP32_TO_INT32_IN_MEM: Opc = X86::IST_Fp32m32; break;
	case X86::FP32_TO_INT64_IN_MEM: Opc = X86::IST_Fp64m32; break;
	case X86::FP64_TO_INT16_IN_MEM: Opc = X86::IST_Fp16m64; break;
	case X86::FP64_TO_INT32_IN_MEM: Opc = X86::IST_Fp32m64; break;
	case X86::FP64_TO_INT64_IN_MEM: Opc = X86::IST_Fp64m64; break;
	case X86::FP80_TO_INT16_IN_MEM: Opc = X86::IST_Fp16m80; break;
	case X86::FP80_TO_INT32_IN_MEM: Opc = X86::IST_Fp32m80; break;
	case X86::FP80_TO_INT64_IN_MEM: Opc = X86::IST_Fp64m80; break;
	}

	X86AddressMode AM = getAddressFromInstr(&MI, 0);
	addFullAddress(BuildMI(*BB, MI, DL, TII->get(Opc)), AM)
	.addReg(MI.getOperand(X86::AddrNumOperands).getReg());

	// Reload the original control word now.
	addFrameReference(BuildMI(*BB, MI, DL,
	TII->get(X86::FLDCW16m)), CWFrameIdx);

	MI.eraseFromParent(); // The pseudo instruction is gone now.
	return BB;
	}
	// String/text processing lowering.
	case X86::PCMPISTRM128REG:
	case X86::VPCMPISTRM128REG:
	case X86::PCMPISTRM128MEM:
	case X86::VPCMPISTRM128MEM:
	case X86::PCMPESTRM128REG:
	case X86::VPCMPESTRM128REG:
	case X86::PCMPESTRM128MEM:
	case X86::VPCMPESTRM128MEM:
	assert(Subtarget.hasSSE42() &&
	"Target must have SSE4.2 or AVX features enabled");
	return emitPCMPSTRM(MI, BB, Subtarget.getInstrInfo());

	// String/text processing lowering.
	case X86::PCMPISTRIREG:
	case X86::VPCMPISTRIREG:
	case X86::PCMPISTRIMEM:
	case X86::VPCMPISTRIMEM:
	case X86::PCMPESTRIREG:
	case X86::VPCMPESTRIREG:
	case X86::PCMPESTRIMEM:
	case X86::VPCMPESTRIMEM:
	assert(Subtarget.hasSSE42() &&
	"Target must have SSE4.2 or AVX features enabled");
	return emitPCMPSTRI(MI, BB, Subtarget.getInstrInfo());

	// Thread synchronization.
	case X86::MONITOR:
	return emitMonitor(MI, BB, Subtarget, X86::MONITORrrr);
	case X86::MONITORX:
	return emitMonitor(MI, BB, Subtarget, X86::MONITORXrrr);

	// Cache line zero
	case X86::CLZERO:
	return emitClzero(&MI, BB, Subtarget);

	// PKU feature
	case X86::WRPKRU:
	return emitWRPKRU(MI, BB, Subtarget);
	case X86::RDPKRU:
	return emitRDPKRU(MI, BB, Subtarget);
	// xbegin
	case X86::XBEGIN:
	return emitXBegin(MI, BB, Subtarget.getInstrInfo());

	case X86::VASTART_SAVE_XMM_REGS:
	return EmitVAStartSaveXMMRegsWithCustomInserter(MI, BB);

	case X86::VAARG_64:
	return EmitVAARG64WithCustomInserter(MI, BB);

	case X86::EH_SjLj_SetJmp32:
	case X86::EH_SjLj_SetJmp64:
	return emitEHSjLjSetJmp(MI, BB);

	case X86::EH_SjLj_LongJmp32:
	case X86::EH_SjLj_LongJmp64:
	return emitEHSjLjLongJmp(MI, BB);

	case X86::Int_eh_sjlj_setup_dispatch:
	return EmitSjLjDispatchBlock(MI, BB);

	case TargetOpcode::STATEPOINT:
	// As an implementation detail, STATEPOINT shares the STACKMAP format at
	// this point in the process. We diverge later.
	return emitPatchPoint(MI, BB);

	case TargetOpcode::STACKMAP:
	case TargetOpcode::PATCHPOINT:
	return emitPatchPoint(MI, BB);

	case TargetOpcode::PATCHABLE_EVENT_CALL:
	// Do nothing here, handle in xray instrumentation pass.
	return BB;

	case X86::LCMPXCHG8B: {
	const X86RegisterInfo *TRI = Subtarget.getRegisterInfo();
	// In addition to 4 E[ABCD] registers implied by encoding, CMPXCHG8B
	// requires a memory operand. If it happens that current architecture is
	// i686 and for current function we need a base pointer
	// - which is ESI for i686 - register allocator would not be able to
	// allocate registers for an address in form of X(%reg, %reg, Y)
	// - there never would be enough unreserved registers during regalloc
	// (without the need for base ptr the only option would be X(%edi, %esi, Y).
	// We are giving a hand to register allocator by precomputing the address in
	// a new vreg using LEA.

	// If it is not i686 or there is no base pointer - nothing to do here.
	if (!Subtarget.is32Bit() \|\| !TRI->hasBasePointer(*MF))
	return BB;

	// Even though this code does not necessarily needs the base pointer to
	// be ESI, we check for that. The reason: if this assert fails, there are
	// some changes happened in the compiler base pointer handling, which most
	// probably have to be addressed somehow here.
	assert(TRI->getBaseRegister() == X86::ESI &&
	"LCMPXCHG8B custom insertion for i686 is written with X86::ESI as a "
	"base pointer in mind");

	MachineRegisterInfo &MRI = MF->getRegInfo();
	MVT SPTy = getPointerTy(MF->getDataLayout());
	const TargetRegisterClass *AddrRegClass = getRegClassFor(SPTy);
	unsigned computedAddrVReg = MRI.createVirtualRegister(AddrRegClass);

	X86AddressMode AM = getAddressFromInstr(&MI, 0);
	// Regalloc does not need any help when the memory operand of CMPXCHG8B
	// does not use index register.
	if (AM.IndexReg == X86::NoRegister)
	return BB;

	// After X86TargetLowering::ReplaceNodeResults CMPXCHG8B is glued to its
	// four operand definitions that are E[ABCD] registers. We skip them and
	// then insert the LEA.
	MachineBasicBlock::iterator MBBI(MI);
	while (MBBI->definesRegister(X86::EAX) \|\| MBBI->definesRegister(X86::EBX) \|\|
	MBBI->definesRegister(X86::ECX) \|\| MBBI->definesRegister(X86::EDX))
	--MBBI;
	addFullAddress(
	BuildMI(BB, MBBI, DL, TII->get(X86::LEA32r), computedAddrVReg), AM);

	setDirectAddressInInstr(&MI, 0, computedAddrVReg);

	return BB;
	}
	case X86::LCMPXCHG16B:
	return BB;
	case X86::LCMPXCHG8B_SAVE_EBX:
	case X86::LCMPXCHG16B_SAVE_RBX: {
	unsigned BasePtr =
	MI.getOpcode() == X86::LCMPXCHG8B_SAVE_EBX ? X86::EBX : X86::RBX;
	if (!BB->isLiveIn(BasePtr))
	BB->addLiveIn(BasePtr);
	return BB;
	}
	}
	}

	//===----------------------------------------------------------------------===//
	// X86 Optimization Hooks
	//===----------------------------------------------------------------------===//

	void X86TargetLowering::computeKnownBitsForTargetNode(const SDValue Op,
	KnownBits &Known,
	const APInt &DemandedElts,
	const SelectionDAG &DAG,
	unsigned Depth) const {
	unsigned BitWidth = Known.getBitWidth();
	unsigned Opc = Op.getOpcode();
	EVT VT = Op.getValueType();
	assert((Opc >= ISD::BUILTIN_OP_END \|\|
	Opc == ISD::INTRINSIC_WO_CHAIN \|\|
	Opc == ISD::INTRINSIC_W_CHAIN \|\|
	Opc == ISD::INTRINSIC_VOID) &&
	"Should use MaskedValueIsZero if you don't know whether Op"
	" is a target node!");

	Known.resetAll();
	switch (Opc) {
	default: break;
	case X86ISD::ADD:
	case X86ISD::SUB:
	case X86ISD::ADC:
	case X86ISD::SBB:
	case X86ISD::SMUL:
	case X86ISD::UMUL:
	case X86ISD::INC:
	case X86ISD::DEC:
	case X86ISD::OR:
	case X86ISD::XOR:
	case X86ISD::AND:
	// These nodes' second result is a boolean.
	if (Op.getResNo() == 0)
	break;
	LLVM_FALLTHROUGH;
	case X86ISD::SETCC:
	Known.Zero.setBitsFrom(1);
	break;
	case X86ISD::MOVMSK: {
	unsigned NumLoBits = Op.getOperand(0).getValueType().getVectorNumElements();
	Known.Zero.setBitsFrom(NumLoBits);
	break;
	}
	case X86ISD::VSHLI:
	case X86ISD::VSRLI: {
	if (auto *ShiftImm = dyn_cast<ConstantSDNode>(Op.getOperand(1))) {
	if (ShiftImm->getAPIntValue().uge(VT.getScalarSizeInBits())) {
	Known.setAllZero();
	break;
	}

	DAG.computeKnownBits(Op.getOperand(0), Known, Depth + 1);
	unsigned ShAmt = ShiftImm->getZExtValue();
	if (Opc == X86ISD::VSHLI) {
	Known.Zero <<= ShAmt;
	Known.One <<= ShAmt;
	// Low bits are known zero.
	Known.Zero.setLowBits(ShAmt);
	} else {
	Known.Zero.lshrInPlace(ShAmt);
	Known.One.lshrInPlace(ShAmt);
	// High bits are known zero.
	Known.Zero.setHighBits(ShAmt);
	}
	}
	break;
	}
	case X86ISD::VZEXT: {
	SDValue N0 = Op.getOperand(0);
	unsigned NumElts = VT.getVectorNumElements();

	EVT SrcVT = N0.getValueType();
	unsigned InNumElts = SrcVT.getVectorNumElements();
	unsigned InBitWidth = SrcVT.getScalarSizeInBits();
	assert(InNumElts >= NumElts && "Illegal VZEXT input");

	Known = KnownBits(InBitWidth);
	APInt DemandedSrcElts = APInt::getLowBitsSet(InNumElts, NumElts);
	DAG.computeKnownBits(N0, Known, DemandedSrcElts, Depth + 1);
	Known = Known.zext(BitWidth);
	Known.Zero.setBitsFrom(InBitWidth);
	break;
	}
	}
	}

	unsigned X86TargetLowering::ComputeNumSignBitsForTargetNode(
	SDValue Op, const APInt &DemandedElts, const SelectionDAG &DAG,
	unsigned Depth) const {
	unsigned VTBits = Op.getScalarValueSizeInBits();
	unsigned Opcode = Op.getOpcode();
	switch (Opcode) {
	case X86ISD::SETCC_CARRY:
	// SETCC_CARRY sets the dest to ~0 for true or 0 for false.
	return VTBits;

	case X86ISD::VSEXT: {
	SDValue Src = Op.getOperand(0);
	unsigned Tmp = DAG.ComputeNumSignBits(Src, Depth + 1);
	Tmp += VTBits - Src.getScalarValueSizeInBits();
	return Tmp;
	}

	case X86ISD::VSHLI: {
	SDValue Src = Op.getOperand(0);
	unsigned Tmp = DAG.ComputeNumSignBits(Src, Depth + 1);
	APInt ShiftVal = cast<ConstantSDNode>(Op.getOperand(1))->getAPIntValue();
	if (ShiftVal.uge(VTBits))
	return VTBits; // Shifted all bits out --> zero.
	if (ShiftVal.uge(Tmp))
	return 1; // Shifted all sign bits out --> unknown.
	return Tmp - ShiftVal.getZExtValue();
	}

	case X86ISD::VSRAI: {
	SDValue Src = Op.getOperand(0);
	unsigned Tmp = DAG.ComputeNumSignBits(Src, Depth + 1);
	APInt ShiftVal = cast<ConstantSDNode>(Op.getOperand(1))->getAPIntValue();
	ShiftVal += Tmp;
	return ShiftVal.uge(VTBits) ? VTBits : ShiftVal.getZExtValue();
	}

	case X86ISD::PCMPGT:
	case X86ISD::PCMPEQ:
	case X86ISD::CMPP:
	case X86ISD::VPCOM:
	case X86ISD::VPCOMU:
	// Vector compares return zero/all-bits result values.
	return VTBits;
	}

	// Fallback case.
	return 1;
	}

	/// Returns true (and the GlobalValue and the offset) if the node is a
	/// GlobalAddress + offset.
	bool X86TargetLowering::isGAPlusOffset(SDNode *N,
	const GlobalValue* &GA,
	int64_t &Offset) const {
	if (N->getOpcode() == X86ISD::Wrapper) {
	if (isa<GlobalAddressSDNode>(N->getOperand(0))) {
	GA = cast<GlobalAddressSDNode>(N->getOperand(0))->getGlobal();
	Offset = cast<GlobalAddressSDNode>(N->getOperand(0))->getOffset();
	return true;
	}
	}
	return TargetLowering::isGAPlusOffset(N, GA, Offset);
	}

	// Attempt to match a combined shuffle mask against supported unary shuffle
	// instructions.
	// TODO: Investigate sharing more of this with shuffle lowering.
	static bool matchUnaryVectorShuffle(MVT MaskVT, ArrayRef<int> Mask,
	bool AllowFloatDomain, bool AllowIntDomain,
	SDValue &V1, SDLoc &DL, SelectionDAG &DAG,
	const X86Subtarget &Subtarget,
	unsigned &Shuffle, MVT &SrcVT, MVT &DstVT) {
	unsigned NumMaskElts = Mask.size();
	unsigned MaskEltSize = MaskVT.getScalarSizeInBits();

	// Match against a ZERO_EXTEND_VECTOR_INREG/VZEXT instruction.
	// TODO: Add 512-bit vector support (split AVX512F and AVX512BW).
	if (AllowIntDomain && ((MaskVT.is128BitVector() && Subtarget.hasSSE41()) \|\|
	(MaskVT.is256BitVector() && Subtarget.hasInt256()))) {
	unsigned MaxScale = 64 / MaskEltSize;
	for (unsigned Scale = 2; Scale <= MaxScale; Scale *= 2) {
	bool Match = true;
	unsigned NumDstElts = NumMaskElts / Scale;
	for (unsigned i = 0; i != NumDstElts && Match; ++i) {
	Match &= isUndefOrEqual(Mask[i * Scale], (int)i);
	Match &= isUndefOrZeroInRange(Mask, (i * Scale) + 1, Scale - 1);
	}
	if (Match) {
	unsigned SrcSize = std::max(128u, NumDstElts * MaskEltSize);
	SrcVT = MVT::getVectorVT(MaskVT.getScalarType(), SrcSize / MaskEltSize);
	if (SrcVT != MaskVT)
	V1 = extractSubVector(V1, 0, DAG, DL, SrcSize);
	DstVT = MVT::getIntegerVT(Scale * MaskEltSize);
	DstVT = MVT::getVectorVT(DstVT, NumDstElts);
	Shuffle = SrcVT != MaskVT ? unsigned(X86ISD::VZEXT)
	: unsigned(ISD::ZERO_EXTEND_VECTOR_INREG);
	return true;
	}
	}
	}

	// Match against a VZEXT_MOVL instruction, SSE1 only supports 32-bits (MOVSS).
	if (((MaskEltSize == 32) \|\| (MaskEltSize == 64 && Subtarget.hasSSE2())) &&
	isUndefOrEqual(Mask[0], 0) &&
	isUndefOrZeroInRange(Mask, 1, NumMaskElts - 1)) {
	Shuffle = X86ISD::VZEXT_MOVL;
	SrcVT = DstVT = !Subtarget.hasSSE2() ? MVT::v4f32 : MaskVT;
	return true;
	}

	// Check if we have SSE3 which will let us use MOVDDUP etc. The
	// instructions are no slower than UNPCKLPD but has the option to
	// fold the input operand into even an unaligned memory load.
	if (MaskVT.is128BitVector() && Subtarget.hasSSE3() && AllowFloatDomain) {
	if (isTargetShuffleEquivalent(Mask, {0, 0})) {
	Shuffle = X86ISD::MOVDDUP;
	SrcVT = DstVT = MVT::v2f64;
	return true;
	}
	if (isTargetShuffleEquivalent(Mask, {0, 0, 2, 2})) {
	Shuffle = X86ISD::MOVSLDUP;
	SrcVT = DstVT = MVT::v4f32;
	return true;
	}
	if (isTargetShuffleEquivalent(Mask, {1, 1, 3, 3})) {
	Shuffle = X86ISD::MOVSHDUP;
	SrcVT = DstVT = MVT::v4f32;
	return true;
	}
	}

	if (MaskVT.is256BitVector() && AllowFloatDomain) {
	assert(Subtarget.hasAVX() && "AVX required for 256-bit vector shuffles");
	if (isTargetShuffleEquivalent(Mask, {0, 0, 2, 2})) {
	Shuffle = X86ISD::MOVDDUP;
	SrcVT = DstVT = MVT::v4f64;
	return true;
	}
	if (isTargetShuffleEquivalent(Mask, {0, 0, 2, 2, 4, 4, 6, 6})) {
	Shuffle = X86ISD::MOVSLDUP;
	SrcVT = DstVT = MVT::v8f32;
	return true;
	}
	if (isTargetShuffleEquivalent(Mask, {1, 1, 3, 3, 5, 5, 7, 7})) {
	Shuffle = X86ISD::MOVSHDUP;
	SrcVT = DstVT = MVT::v8f32;
	return true;
	}
	}

	if (MaskVT.is512BitVector() && AllowFloatDomain) {
	assert(Subtarget.hasAVX512() &&
	"AVX512 required for 512-bit vector shuffles");
	if (isTargetShuffleEquivalent(Mask, {0, 0, 2, 2, 4, 4, 6, 6})) {
	Shuffle = X86ISD::MOVDDUP;
	SrcVT = DstVT = MVT::v8f64;
	return true;
	}
	if (isTargetShuffleEquivalent(
	Mask, {0, 0, 2, 2, 4, 4, 6, 6, 8, 8, 10, 10, 12, 12, 14, 14})) {
	Shuffle = X86ISD::MOVSLDUP;
	SrcVT = DstVT = MVT::v16f32;
	return true;
	}
	if (isTargetShuffleEquivalent(
	Mask, {1, 1, 3, 3, 5, 5, 7, 7, 9, 9, 11, 11, 13, 13, 15, 15})) {
	Shuffle = X86ISD::MOVSHDUP;
	SrcVT = DstVT = MVT::v16f32;
	return true;
	}
	}

	// Attempt to match against broadcast-from-vector.
	if (Subtarget.hasAVX2()) {
	SmallVector<int, 64> BroadcastMask(NumMaskElts, 0);
	if (isTargetShuffleEquivalent(Mask, BroadcastMask)) {
	SrcVT = DstVT = MaskVT;
	Shuffle = X86ISD::VBROADCAST;
	return true;
	}
	}

	return false;
	}

	// Attempt to match a combined shuffle mask against supported unary immediate
	// permute instructions.
	// TODO: Investigate sharing more of this with shuffle lowering.
	static bool matchUnaryPermuteVectorShuffle(MVT MaskVT, ArrayRef<int> Mask,
	const APInt &Zeroable,
	bool AllowFloatDomain,
	bool AllowIntDomain,
	const X86Subtarget &Subtarget,
	unsigned &Shuffle, MVT &ShuffleVT,
	unsigned &PermuteImm) {
	unsigned NumMaskElts = Mask.size();
	unsigned InputSizeInBits = MaskVT.getSizeInBits();
	unsigned MaskScalarSizeInBits = InputSizeInBits / NumMaskElts;
	MVT MaskEltVT = MVT::getIntegerVT(MaskScalarSizeInBits);

	bool ContainsZeros =
	llvm::any_of(Mask, [](int M) { return M == SM_SentinelZero; });

	// Handle VPERMI/VPERMILPD vXi64/vXi64 patterns.
	if (!ContainsZeros && MaskScalarSizeInBits == 64) {
	// Check for lane crossing permutes.
	if (is128BitLaneCrossingShuffleMask(MaskEltVT, Mask)) {
	// PERMPD/PERMQ permutes within a 256-bit vector (AVX2+).
	if (Subtarget.hasAVX2() && MaskVT.is256BitVector()) {
	Shuffle = X86ISD::VPERMI;
	ShuffleVT = (AllowFloatDomain ? MVT::v4f64 : MVT::v4i64);
	PermuteImm = getV4X86ShuffleImm(Mask);
	return true;
	}
	if (Subtarget.hasAVX512() && MaskVT.is512BitVector()) {
	SmallVector<int, 4> RepeatedMask;
	if (is256BitLaneRepeatedShuffleMask(MVT::v8f64, Mask, RepeatedMask)) {
	Shuffle = X86ISD::VPERMI;
	ShuffleVT = (AllowFloatDomain ? MVT::v8f64 : MVT::v8i64);
	PermuteImm = getV4X86ShuffleImm(RepeatedMask);
	return true;
	}
	}
	} else if (AllowFloatDomain && Subtarget.hasAVX()) {
	// VPERMILPD can permute with a non-repeating shuffle.
	Shuffle = X86ISD::VPERMILPI;
	ShuffleVT = MVT::getVectorVT(MVT::f64, Mask.size());
	PermuteImm = 0;
	for (int i = 0, e = Mask.size(); i != e; ++i) {
	int M = Mask[i];
	if (M == SM_SentinelUndef)
	continue;
	assert(((M / 2) == (i / 2)) && "Out of range shuffle mask index");
	PermuteImm \|= (M & 1) << i;
	}
	return true;
	}
	}

	// Handle PSHUFD/VPERMILPI vXi32/vXf32 repeated patterns.
	// AVX introduced the VPERMILPD/VPERMILPS float permutes, before then we
	// had to use 2-input SHUFPD/SHUFPS shuffles (not handled here).
	if ((MaskScalarSizeInBits == 64 \|\| MaskScalarSizeInBits == 32) &&
	!ContainsZeros && (AllowIntDomain \|\| Subtarget.hasAVX())) {
	SmallVector<int, 4> RepeatedMask;
	if (is128BitLaneRepeatedShuffleMask(MaskEltVT, Mask, RepeatedMask)) {
	// Narrow the repeated mask to create 32-bit element permutes.
	SmallVector<int, 4> WordMask = RepeatedMask;
	if (MaskScalarSizeInBits == 64)
	scaleShuffleMask(2, RepeatedMask, WordMask);

	Shuffle = (AllowIntDomain ? X86ISD::PSHUFD : X86ISD::VPERMILPI);
	ShuffleVT = (AllowIntDomain ? MVT::i32 : MVT::f32);
	ShuffleVT = MVT::getVectorVT(ShuffleVT, InputSizeInBits / 32);
	PermuteImm = getV4X86ShuffleImm(WordMask);
	return true;
	}
	}

	// Handle PSHUFLW/PSHUFHW vXi16 repeated patterns.
	if (!ContainsZeros && AllowIntDomain && MaskScalarSizeInBits == 16) {
	SmallVector<int, 4> RepeatedMask;
	if (is128BitLaneRepeatedShuffleMask(MaskEltVT, Mask, RepeatedMask)) {
	ArrayRef<int> LoMask(Mask.data() + 0, 4);
	ArrayRef<int> HiMask(Mask.data() + 4, 4);

	// PSHUFLW: permute lower 4 elements only.
	if (isUndefOrInRange(LoMask, 0, 4) &&
	isSequentialOrUndefInRange(HiMask, 0, 4, 4)) {
	Shuffle = X86ISD::PSHUFLW;
	ShuffleVT = MVT::getVectorVT(MVT::i16, InputSizeInBits / 16);
	PermuteImm = getV4X86ShuffleImm(LoMask);
	return true;
	}

	// PSHUFHW: permute upper 4 elements only.
	if (isUndefOrInRange(HiMask, 4, 8) &&
	isSequentialOrUndefInRange(LoMask, 0, 4, 0)) {
	// Offset the HiMask so that we can create the shuffle immediate.
	int OffsetHiMask[4];
	for (int i = 0; i != 4; ++i)
	OffsetHiMask[i] = (HiMask[i] < 0 ? HiMask[i] : HiMask[i] - 4);

	Shuffle = X86ISD::PSHUFHW;
	ShuffleVT = MVT::getVectorVT(MVT::i16, InputSizeInBits / 16);
	PermuteImm = getV4X86ShuffleImm(OffsetHiMask);
	return true;
	}
	}
	}

	// Attempt to match against byte/bit shifts.
	// FIXME: Add 512-bit support.
	if (AllowIntDomain && ((MaskVT.is128BitVector() && Subtarget.hasSSE2()) \|\|
	(MaskVT.is256BitVector() && Subtarget.hasAVX2()))) {
	int ShiftAmt = matchVectorShuffleAsShift(ShuffleVT, Shuffle,
	MaskScalarSizeInBits, Mask,
	0, Zeroable, Subtarget);
	if (0 < ShiftAmt) {
	PermuteImm = (unsigned)ShiftAmt;
	return true;
	}
	}

	return false;
	}

	// Attempt to match a combined unary shuffle mask against supported binary
	// shuffle instructions.
	// TODO: Investigate sharing more of this with shuffle lowering.
	static bool matchBinaryVectorShuffle(MVT MaskVT, ArrayRef<int> Mask,
	bool AllowFloatDomain, bool AllowIntDomain,
	SDValue &V1, SDValue &V2, SDLoc &DL,
	SelectionDAG &DAG,
	const X86Subtarget &Subtarget,
	unsigned &Shuffle, MVT &ShuffleVT,
	bool IsUnary) {
	unsigned EltSizeInBits = MaskVT.getScalarSizeInBits();

	if (MaskVT.is128BitVector()) {
	if (isTargetShuffleEquivalent(Mask, {0, 0}) && AllowFloatDomain) {
	V2 = V1;
	Shuffle = X86ISD::MOVLHPS;
	ShuffleVT = MVT::v4f32;
	return true;
	}
	if (isTargetShuffleEquivalent(Mask, {1, 1}) && AllowFloatDomain) {
	V2 = V1;
	Shuffle = X86ISD::MOVHLPS;
	ShuffleVT = MVT::v4f32;
	return true;
	}
	if (isTargetShuffleEquivalent(Mask, {0, 3}) && Subtarget.hasSSE2() &&
	(AllowFloatDomain \|\| !Subtarget.hasSSE41())) {
	std::swap(V1, V2);
	Shuffle = X86ISD::MOVSD;
	ShuffleVT = MaskVT;
	return true;
	}
	if (isTargetShuffleEquivalent(Mask, {4, 1, 2, 3}) &&
	(AllowFloatDomain \|\| !Subtarget.hasSSE41())) {
	Shuffle = X86ISD::MOVSS;
	ShuffleVT = MaskVT;
	return true;
	}
	}

	// Attempt to match against either a unary or binary UNPCKL/UNPCKH shuffle.
	if ((MaskVT == MVT::v4f32 && Subtarget.hasSSE1()) \|\|
	(MaskVT.is128BitVector() && Subtarget.hasSSE2()) \|\|
	(MaskVT.is256BitVector() && 32 <= EltSizeInBits && Subtarget.hasAVX()) \|\|
	(MaskVT.is256BitVector() && Subtarget.hasAVX2()) \|\|
	(MaskVT.is512BitVector() && Subtarget.hasAVX512())) {
	if (matchVectorShuffleWithUNPCK(MaskVT, V1, V2, Shuffle, IsUnary, Mask, DL,
	DAG, Subtarget)) {
	ShuffleVT = MaskVT;
	if (ShuffleVT.is256BitVector() && !Subtarget.hasAVX2())
	ShuffleVT = (32 == EltSizeInBits ? MVT::v8f32 : MVT::v4f64);
	return true;
	}
	}

	return false;
	}

	static bool matchBinaryPermuteVectorShuffle(MVT MaskVT, ArrayRef<int> Mask,
	const APInt &Zeroable,
	bool AllowFloatDomain,
	bool AllowIntDomain,
	SDValue &V1, SDValue &V2, SDLoc &DL,
	SelectionDAG &DAG,
	const X86Subtarget &Subtarget,
	unsigned &Shuffle, MVT &ShuffleVT,
	unsigned &PermuteImm) {
	unsigned NumMaskElts = Mask.size();
	unsigned EltSizeInBits = MaskVT.getScalarSizeInBits();

	// Attempt to match against PALIGNR byte rotate.
	if (AllowIntDomain && ((MaskVT.is128BitVector() && Subtarget.hasSSSE3()) \|\|
	(MaskVT.is256BitVector() && Subtarget.hasAVX2()))) {
	int ByteRotation = matchVectorShuffleAsByteRotate(MaskVT, V1, V2, Mask);
	if (0 < ByteRotation) {
	Shuffle = X86ISD::PALIGNR;
	ShuffleVT = MVT::getVectorVT(MVT::i8, MaskVT.getSizeInBits() / 8);
	PermuteImm = ByteRotation;
	return true;
	}
	}

	// Attempt to combine to X86ISD::BLENDI.
	if ((NumMaskElts <= 8 && ((Subtarget.hasSSE41() && MaskVT.is128BitVector()) \|\|
	(Subtarget.hasAVX() && MaskVT.is256BitVector()))) \|\|
	(MaskVT == MVT::v16i16 && Subtarget.hasAVX2())) {
	uint64_t BlendMask = 0;
	bool ForceV1Zero = false, ForceV2Zero = false;
	SmallVector<int, 8> TargetMask(Mask.begin(), Mask.end());
	if (matchVectorShuffleAsBlend(V1, V2, TargetMask, ForceV1Zero, ForceV2Zero,
	BlendMask)) {
	if (MaskVT == MVT::v16i16) {
	// We can only use v16i16 PBLENDW if the lanes are repeated.
	SmallVector<int, 8> RepeatedMask;
	if (isRepeatedTargetShuffleMask(128, MaskVT, TargetMask,
	RepeatedMask)) {
	assert(RepeatedMask.size() == 8 &&
	"Repeated mask size doesn't match!");
	PermuteImm = 0;
	for (int i = 0; i < 8; ++i)
	if (RepeatedMask[i] >= 8)
	PermuteImm \|= 1 << i;
	V1 = ForceV1Zero ? getZeroVector(MaskVT, Subtarget, DAG, DL) : V1;
	V2 = ForceV2Zero ? getZeroVector(MaskVT, Subtarget, DAG, DL) : V2;
	Shuffle = X86ISD::BLENDI;
	ShuffleVT = MaskVT;
	return true;
	}
	} else {
	// Determine a type compatible with X86ISD::BLENDI.
	ShuffleVT = MaskVT;
	if (Subtarget.hasAVX2()) {
	if (ShuffleVT == MVT::v4i64)
	ShuffleVT = MVT::v8i32;
	else if (ShuffleVT == MVT::v2i64)
	ShuffleVT = MVT::v4i32;
	} else {
	if (ShuffleVT == MVT::v2i64 \|\| ShuffleVT == MVT::v4i32)
	ShuffleVT = MVT::v8i16;
	else if (ShuffleVT == MVT::v4i64)
	ShuffleVT = MVT::v4f64;
	else if (ShuffleVT == MVT::v8i32)
	ShuffleVT = MVT::v8f32;
	}

	if (!ShuffleVT.isFloatingPoint()) {
	int Scale = EltSizeInBits / ShuffleVT.getScalarSizeInBits();
	BlendMask =
	scaleVectorShuffleBlendMask(BlendMask, NumMaskElts, Scale);
	ShuffleVT = MVT::getIntegerVT(EltSizeInBits / Scale);
	ShuffleVT = MVT::getVectorVT(ShuffleVT, NumMaskElts * Scale);
	}

	V1 = ForceV1Zero ? getZeroVector(MaskVT, Subtarget, DAG, DL) : V1;
	V2 = ForceV2Zero ? getZeroVector(MaskVT, Subtarget, DAG, DL) : V2;
	PermuteImm = (unsigned)BlendMask;
	Shuffle = X86ISD::BLENDI;
	return true;
	}
	}
	}

	// Attempt to combine to INSERTPS.
	if (AllowFloatDomain && EltSizeInBits == 32 && Subtarget.hasSSE41() &&
	MaskVT.is128BitVector()) {
	if (Zeroable.getBoolValue() &&
	matchVectorShuffleAsInsertPS(V1, V2, PermuteImm, Zeroable, Mask, DAG)) {
	Shuffle = X86ISD::INSERTPS;
	ShuffleVT = MVT::v4f32;
	return true;
	}
	}

	// Attempt to combine to SHUFPD.
	if (AllowFloatDomain && EltSizeInBits == 64 &&
	((MaskVT.is128BitVector() && Subtarget.hasSSE2()) \|\|
	(MaskVT.is256BitVector() && Subtarget.hasAVX()) \|\|
	(MaskVT.is512BitVector() && Subtarget.hasAVX512()))) {
	if (matchVectorShuffleWithSHUFPD(MaskVT, V1, V2, PermuteImm, Mask)) {
	Shuffle = X86ISD::SHUFP;
	ShuffleVT = MVT::getVectorVT(MVT::f64, MaskVT.getSizeInBits() / 64);
	return true;
	}
	}

	// Attempt to combine to SHUFPS.
	if (AllowFloatDomain && EltSizeInBits == 32 &&
	((MaskVT.is128BitVector() && Subtarget.hasSSE1()) \|\|
	(MaskVT.is256BitVector() && Subtarget.hasAVX()) \|\|
	(MaskVT.is512BitVector() && Subtarget.hasAVX512()))) {
	SmallVector<int, 4> RepeatedMask;
	if (isRepeatedTargetShuffleMask(128, MaskVT, Mask, RepeatedMask)) {
	// Match each half of the repeated mask, to determine if its just
	// referencing one of the vectors, is zeroable or entirely undef.
	auto MatchHalf = [&](unsigned Offset, int &S0, int &S1) {
	int M0 = RepeatedMask[Offset];
	int M1 = RepeatedMask[Offset + 1];

	if (isUndefInRange(RepeatedMask, Offset, 2)) {
	return DAG.getUNDEF(MaskVT);
	} else if (isUndefOrZeroInRange(RepeatedMask, Offset, 2)) {
	S0 = (SM_SentinelUndef == M0 ? -1 : 0);
	S1 = (SM_SentinelUndef == M1 ? -1 : 1);
	return getZeroVector(MaskVT, Subtarget, DAG, DL);
	} else if (isUndefOrInRange(M0, 0, 4) && isUndefOrInRange(M1, 0, 4)) {
	S0 = (SM_SentinelUndef == M0 ? -1 : M0 & 3);
	S1 = (SM_SentinelUndef == M1 ? -1 : M1 & 3);
	return V1;
	} else if (isUndefOrInRange(M0, 4, 8) && isUndefOrInRange(M1, 4, 8)) {
	S0 = (SM_SentinelUndef == M0 ? -1 : M0 & 3);
	S1 = (SM_SentinelUndef == M1 ? -1 : M1 & 3);
	return V2;
	}

	return SDValue();
	};

	int ShufMask[4] = {-1, -1, -1, -1};
	SDValue Lo = MatchHalf(0, ShufMask[0], ShufMask[1]);
	SDValue Hi = MatchHalf(2, ShufMask[2], ShufMask[3]);

	if (Lo && Hi) {
	V1 = Lo;
	V2 = Hi;
	Shuffle = X86ISD::SHUFP;
	ShuffleVT = MVT::getVectorVT(MVT::f32, MaskVT.getSizeInBits() / 32);
	PermuteImm = getV4X86ShuffleImm(ShufMask);
	return true;
	}
	}
	}

	return false;
	}

	/// \brief Combine an arbitrary chain of shuffles into a single instruction if
	/// possible.
	///
	/// This is the leaf of the recursive combine below. When we have found some
	/// chain of single-use x86 shuffle instructions and accumulated the combined
	/// shuffle mask represented by them, this will try to pattern match that mask
	/// into either a single instruction if there is a special purpose instruction
	/// for this operation, or into a PSHUFB instruction which is a fully general
	/// instruction but should only be used to replace chains over a certain depth.
	static bool combineX86ShuffleChain(ArrayRef<SDValue> Inputs, SDValue Root,
	ArrayRef<int> BaseMask, int Depth,
	bool HasVariableMask, SelectionDAG &DAG,
	TargetLowering::DAGCombinerInfo &DCI,
	const X86Subtarget &Subtarget) {
	assert(!BaseMask.empty() && "Cannot combine an empty shuffle mask!");
	assert((Inputs.size() == 1 \|\| Inputs.size() == 2) &&
	"Unexpected number of shuffle inputs!");

	// Find the inputs that enter the chain. Note that multiple uses are OK
	// here, we're not going to remove the operands we find.
	bool UnaryShuffle = (Inputs.size() == 1);
	SDValue V1 = peekThroughBitcasts(Inputs[0]);
	SDValue V2 = (UnaryShuffle ? DAG.getUNDEF(V1.getValueType())
	: peekThroughBitcasts(Inputs[1]));

	MVT VT1 = V1.getSimpleValueType();
	MVT VT2 = V2.getSimpleValueType();
	MVT RootVT = Root.getSimpleValueType();
	assert(VT1.getSizeInBits() == RootVT.getSizeInBits() &&
	VT2.getSizeInBits() == RootVT.getSizeInBits() &&
	"Vector size mismatch");

	SDLoc DL(Root);
	SDValue Res;

	unsigned NumBaseMaskElts = BaseMask.size();
	if (NumBaseMaskElts == 1) {
	assert(BaseMask[0] == 0 && "Invalid shuffle index found!");
	DCI.CombineTo(Root.getNode(), DAG.getBitcast(RootVT, V1),
	/AddTo/ true);
	return true;
	}

	unsigned RootSizeInBits = RootVT.getSizeInBits();
	unsigned NumRootElts = RootVT.getVectorNumElements();
	unsigned BaseMaskEltSizeInBits = RootSizeInBits / NumBaseMaskElts;
	bool FloatDomain = VT1.isFloatingPoint() \|\| VT2.isFloatingPoint() \|\|
	(RootVT.is256BitVector() && !Subtarget.hasAVX2());

	// Don't combine if we are a AVX512/EVEX target and the mask element size
	// is different from the root element size - this would prevent writemasks
	// from being reused.
	// TODO - this currently prevents all lane shuffles from occurring.
	// TODO - check for writemasks usage instead of always preventing combining.
	// TODO - attempt to narrow Mask back to writemask size.
	bool IsEVEXShuffle =
	RootSizeInBits == 512 \|\| (Subtarget.hasVLX() && RootSizeInBits >= 128);
	if (IsEVEXShuffle && (RootVT.getScalarSizeInBits() != BaseMaskEltSizeInBits))
	return false;

	// TODO - handle 128/256-bit lane shuffles of 512-bit vectors.

	// Handle 128-bit lane shuffles of 256-bit vectors.
	// TODO - this should support binary shuffles.
	if (UnaryShuffle && RootVT.is256BitVector() && NumBaseMaskElts == 2 &&
	!isSequentialOrUndefOrZeroInRange(BaseMask, 0, 2, 0)) {
	if (Depth == 1 && Root.getOpcode() == X86ISD::VPERM2X128)
	return false; // Nothing to do!
	MVT ShuffleVT = (FloatDomain ? MVT::v4f64 : MVT::v4i64);
	unsigned PermMask = 0;
	PermMask \|= ((BaseMask[0] < 0 ? 0x8 : (BaseMask[0] & 1)) << 0);
	PermMask \|= ((BaseMask[1] < 0 ? 0x8 : (BaseMask[1] & 1)) << 4);

	Res = DAG.getBitcast(ShuffleVT, V1);
	DCI.AddToWorklist(Res.getNode());
	Res = DAG.getNode(X86ISD::VPERM2X128, DL, ShuffleVT, Res,
	DAG.getUNDEF(ShuffleVT),
	DAG.getConstant(PermMask, DL, MVT::i8));
	DCI.AddToWorklist(Res.getNode());
	DCI.CombineTo(Root.getNode(), DAG.getBitcast(RootVT, Res),
	/AddTo/ true);
	return true;
	}

	// For masks that have been widened to 128-bit elements or more,
	// narrow back down to 64-bit elements.
	SmallVector<int, 64> Mask;
	if (BaseMaskEltSizeInBits > 64) {
	assert((BaseMaskEltSizeInBits % 64) == 0 && "Illegal mask size");
	int MaskScale = BaseMaskEltSizeInBits / 64;
	scaleShuffleMask(MaskScale, BaseMask, Mask);
	} else {
	Mask = SmallVector<int, 64>(BaseMask.begin(), BaseMask.end());
	}

	unsigned NumMaskElts = Mask.size();
	unsigned MaskEltSizeInBits = RootSizeInBits / NumMaskElts;

	// Determine the effective mask value type.
	FloatDomain &= (32 <= MaskEltSizeInBits);
	MVT MaskVT = FloatDomain ? MVT::getFloatingPointVT(MaskEltSizeInBits)
	: MVT::getIntegerVT(MaskEltSizeInBits);
	MaskVT = MVT::getVectorVT(MaskVT, NumMaskElts);

	// Only allow legal mask types.
	if (!DAG.getTargetLoweringInfo().isTypeLegal(MaskVT))
	return false;

	// Attempt to match the mask against known shuffle patterns.
	MVT ShuffleSrcVT, ShuffleVT;
	unsigned Shuffle, PermuteImm;

	// Which shuffle domains are permitted?
	// Permit domain crossing at higher combine depths.
	bool AllowFloatDomain = FloatDomain \|\| (Depth > 3);
	bool AllowIntDomain = (!FloatDomain \|\| (Depth > 3)) &&
	(!MaskVT.is256BitVector() \|\| Subtarget.hasAVX2());

	// Determine zeroable mask elements.
	APInt Zeroable(NumMaskElts, 0);
	for (unsigned i = 0; i != NumMaskElts; ++i)
	if (isUndefOrZero(Mask[i]))
	Zeroable.setBit(i);

	if (UnaryShuffle) {
	// If we are shuffling a X86ISD::VZEXT_LOAD then we can use the load
	// directly if we don't shuffle the lower element and we shuffle the upper
	// (zero) elements within themselves.
	if (V1.getOpcode() == X86ISD::VZEXT_LOAD &&
	(V1.getScalarValueSizeInBits() % MaskEltSizeInBits) == 0) {
	unsigned Scale = V1.getScalarValueSizeInBits() / MaskEltSizeInBits;
	ArrayRef<int> HiMask(Mask.data() + Scale, NumMaskElts - Scale);
	if (isSequentialOrUndefInRange(Mask, 0, Scale, 0) &&
	isUndefOrZeroOrInRange(HiMask, Scale, NumMaskElts)) {
	DCI.CombineTo(Root.getNode(), DAG.getBitcast(RootVT, V1),
	/AddTo/ true);
	return true;
	}
	}

	if (matchUnaryVectorShuffle(MaskVT, Mask, AllowFloatDomain, AllowIntDomain,
	V1, DL, DAG, Subtarget, Shuffle, ShuffleSrcVT,
	ShuffleVT)) {
	if (Depth == 1 && Root.getOpcode() == Shuffle)
	return false; // Nothing to do!
	if (IsEVEXShuffle && (NumRootElts != ShuffleVT.getVectorNumElements()))
	return false; // AVX512 Writemask clash.
	Res = DAG.getBitcast(ShuffleSrcVT, V1);
	DCI.AddToWorklist(Res.getNode());
	Res = DAG.getNode(Shuffle, DL, ShuffleVT, Res);
	DCI.AddToWorklist(Res.getNode());
	DCI.CombineTo(Root.getNode(), DAG.getBitcast(RootVT, Res),
	/AddTo/ true);
	return true;
	}

	if (matchUnaryPermuteVectorShuffle(MaskVT, Mask, Zeroable, AllowFloatDomain,
	AllowIntDomain, Subtarget, Shuffle,
	ShuffleVT, PermuteImm)) {
	if (Depth == 1 && Root.getOpcode() == Shuffle)
	return false; // Nothing to do!
	if (IsEVEXShuffle && (NumRootElts != ShuffleVT.getVectorNumElements()))
	return false; // AVX512 Writemask clash.
	Res = DAG.getBitcast(ShuffleVT, V1);
	DCI.AddToWorklist(Res.getNode());
	Res = DAG.getNode(Shuffle, DL, ShuffleVT, Res,
	DAG.getConstant(PermuteImm, DL, MVT::i8));
	DCI.AddToWorklist(Res.getNode());
	DCI.CombineTo(Root.getNode(), DAG.getBitcast(RootVT, Res),
	/AddTo/ true);
	return true;
	}
	}

	if (matchBinaryVectorShuffle(MaskVT, Mask, AllowFloatDomain, AllowIntDomain,
	V1, V2, DL, DAG, Subtarget, Shuffle, ShuffleVT,
	UnaryShuffle)) {
	if (Depth == 1 && Root.getOpcode() == Shuffle)
	return false; // Nothing to do!
	if (IsEVEXShuffle && (NumRootElts != ShuffleVT.getVectorNumElements()))
	return false; // AVX512 Writemask clash.
	V1 = DAG.getBitcast(ShuffleVT, V1);
	DCI.AddToWorklist(V1.getNode());
	V2 = DAG.getBitcast(ShuffleVT, V2);
	DCI.AddToWorklist(V2.getNode());
	Res = DAG.getNode(Shuffle, DL, ShuffleVT, V1, V2);
	DCI.AddToWorklist(Res.getNode());
	DCI.CombineTo(Root.getNode(), DAG.getBitcast(RootVT, Res),
	/AddTo/ true);
	return true;
	}

	if (matchBinaryPermuteVectorShuffle(MaskVT, Mask, Zeroable, AllowFloatDomain,
	AllowIntDomain, V1, V2, DL, DAG,
	Subtarget, Shuffle, ShuffleVT,
	PermuteImm)) {
	if (Depth == 1 && Root.getOpcode() == Shuffle)
	return false; // Nothing to do!
	if (IsEVEXShuffle && (NumRootElts != ShuffleVT.getVectorNumElements()))
	return false; // AVX512 Writemask clash.
	V1 = DAG.getBitcast(ShuffleVT, V1);
	DCI.AddToWorklist(V1.getNode());
	V2 = DAG.getBitcast(ShuffleVT, V2);
	DCI.AddToWorklist(V2.getNode());
	Res = DAG.getNode(Shuffle, DL, ShuffleVT, V1, V2,
	DAG.getConstant(PermuteImm, DL, MVT::i8));
	DCI.AddToWorklist(Res.getNode());
	DCI.CombineTo(Root.getNode(), DAG.getBitcast(RootVT, Res),
	/AddTo/ true);
	return true;
	}

	// Typically from here on, we need an integer version of MaskVT.
	MVT IntMaskVT = MVT::getIntegerVT(MaskEltSizeInBits);
	IntMaskVT = MVT::getVectorVT(IntMaskVT, NumMaskElts);

	// Annoyingly, SSE4A instructions don't map into the above match helpers.
	if (Subtarget.hasSSE4A() && AllowIntDomain && RootSizeInBits == 128) {
	uint64_t BitLen, BitIdx;
	if (matchVectorShuffleAsEXTRQ(IntMaskVT, V1, V2, Mask, BitLen, BitIdx,
	Zeroable)) {
	if (Depth == 1 && Root.getOpcode() == X86ISD::EXTRQI)
	return false; // Nothing to do!
	V1 = DAG.getBitcast(IntMaskVT, V1);
	DCI.AddToWorklist(V1.getNode());
	Res = DAG.getNode(X86ISD::EXTRQI, DL, IntMaskVT, V1,
	DAG.getConstant(BitLen, DL, MVT::i8),
	DAG.getConstant(BitIdx, DL, MVT::i8));
	DCI.AddToWorklist(Res.getNode());
	DCI.CombineTo(Root.getNode(), DAG.getBitcast(RootVT, Res),
	/AddTo/ true);
	return true;
	}

	if (matchVectorShuffleAsINSERTQ(IntMaskVT, V1, V2, Mask, BitLen, BitIdx)) {
	if (Depth == 1 && Root.getOpcode() == X86ISD::INSERTQI)
	return false; // Nothing to do!
	V1 = DAG.getBitcast(IntMaskVT, V1);
	DCI.AddToWorklist(V1.getNode());
	V2 = DAG.getBitcast(IntMaskVT, V2);
	DCI.AddToWorklist(V2.getNode());
	Res = DAG.getNode(X86ISD::INSERTQI, DL, IntMaskVT, V1, V2,
	DAG.getConstant(BitLen, DL, MVT::i8),
	DAG.getConstant(BitIdx, DL, MVT::i8));
	DCI.AddToWorklist(Res.getNode());
	DCI.CombineTo(Root.getNode(), DAG.getBitcast(RootVT, Res),
	/AddTo/ true);
	return true;
	}
	}

	// Don't try to re-form single instruction chains under any circumstances now
	// that we've done encoding canonicalization for them.
	if (Depth < 2)
	return false;

	bool MaskContainsZeros =
	any_of(Mask, [](int M) { return M == SM_SentinelZero; });

	if (is128BitLaneCrossingShuffleMask(MaskVT, Mask)) {
	// If we have a single input lane-crossing shuffle then lower to VPERMV.
	if (UnaryShuffle && (Depth >= 3 \|\| HasVariableMask) && !MaskContainsZeros &&
	((Subtarget.hasAVX2() &&
	(MaskVT == MVT::v8f32 \|\| MaskVT == MVT::v8i32)) \|\|
	(Subtarget.hasAVX512() &&
	(MaskVT == MVT::v8f64 \|\| MaskVT == MVT::v8i64 \|\|
	MaskVT == MVT::v16f32 \|\| MaskVT == MVT::v16i32)) \|\|
	(Subtarget.hasBWI() && MaskVT == MVT::v32i16) \|\|
	(Subtarget.hasBWI() && Subtarget.hasVLX() && MaskVT == MVT::v16i16) \|\|
	(Subtarget.hasVBMI() && MaskVT == MVT::v64i8) \|\|
	(Subtarget.hasVBMI() && Subtarget.hasVLX() && MaskVT == MVT::v32i8))) {
	SDValue VPermMask = getConstVector(Mask, IntMaskVT, DAG, DL, true);
	DCI.AddToWorklist(VPermMask.getNode());
	Res = DAG.getBitcast(MaskVT, V1);
	DCI.AddToWorklist(Res.getNode());
	Res = DAG.getNode(X86ISD::VPERMV, DL, MaskVT, VPermMask, Res);
	DCI.AddToWorklist(Res.getNode());
	DCI.CombineTo(Root.getNode(), DAG.getBitcast(RootVT, Res),
	/AddTo/ true);
	return true;
	}

	// Lower a unary+zero lane-crossing shuffle as VPERMV3 with a zero
	// vector as the second source.
	if (UnaryShuffle && (Depth >= 3 \|\| HasVariableMask) &&
	((Subtarget.hasAVX512() &&
	(MaskVT == MVT::v8f64 \|\| MaskVT == MVT::v8i64 \|\|
	MaskVT == MVT::v16f32 \|\| MaskVT == MVT::v16i32)) \|\|
	(Subtarget.hasVLX() &&
	(MaskVT == MVT::v4f64 \|\| MaskVT == MVT::v4i64 \|\|
	MaskVT == MVT::v8f32 \|\| MaskVT == MVT::v8i32)) \|\|
	(Subtarget.hasBWI() && MaskVT == MVT::v32i16) \|\|
	(Subtarget.hasBWI() && Subtarget.hasVLX() && MaskVT == MVT::v16i16) \|\|
	(Subtarget.hasVBMI() && MaskVT == MVT::v64i8) \|\|
	(Subtarget.hasVBMI() && Subtarget.hasVLX() && MaskVT == MVT::v32i8))) {
	// Adjust shuffle mask - replace SM_SentinelZero with second source index.
	for (unsigned i = 0; i != NumMaskElts; ++i)
	if (Mask[i] == SM_SentinelZero)
	Mask[i] = NumMaskElts + i;

	SDValue VPermMask = getConstVector(Mask, IntMaskVT, DAG, DL, true);
	DCI.AddToWorklist(VPermMask.getNode());
	Res = DAG.getBitcast(MaskVT, V1);
	DCI.AddToWorklist(Res.getNode());
	SDValue Zero = getZeroVector(MaskVT, Subtarget, DAG, DL);
	DCI.AddToWorklist(Zero.getNode());
	Res = DAG.getNode(X86ISD::VPERMV3, DL, MaskVT, Res, VPermMask, Zero);
	DCI.AddToWorklist(Res.getNode());
	DCI.CombineTo(Root.getNode(), DAG.getBitcast(RootVT, Res),
	/AddTo/ true);
	return true;
	}

	// If we have a dual input lane-crossing shuffle then lower to VPERMV3.
	if ((Depth >= 3 \|\| HasVariableMask) && !MaskContainsZeros &&
	((Subtarget.hasAVX512() &&
	(MaskVT == MVT::v8f64 \|\| MaskVT == MVT::v8i64 \|\|
	MaskVT == MVT::v16f32 \|\| MaskVT == MVT::v16i32)) \|\|
	(Subtarget.hasVLX() &&
	(MaskVT == MVT::v4f64 \|\| MaskVT == MVT::v4i64 \|\|
	MaskVT == MVT::v8f32 \|\| MaskVT == MVT::v8i32)) \|\|
	(Subtarget.hasBWI() && MaskVT == MVT::v32i16) \|\|
	(Subtarget.hasBWI() && Subtarget.hasVLX() && MaskVT == MVT::v16i16) \|\|
	(Subtarget.hasVBMI() && MaskVT == MVT::v64i8) \|\|
	(Subtarget.hasVBMI() && Subtarget.hasVLX() && MaskVT == MVT::v32i8))) {
	SDValue VPermMask = getConstVector(Mask, IntMaskVT, DAG, DL, true);
	DCI.AddToWorklist(VPermMask.getNode());
	V1 = DAG.getBitcast(MaskVT, V1);
	DCI.AddToWorklist(V1.getNode());
	V2 = DAG.getBitcast(MaskVT, V2);
	DCI.AddToWorklist(V2.getNode());
	Res = DAG.getNode(X86ISD::VPERMV3, DL, MaskVT, V1, VPermMask, V2);
	DCI.AddToWorklist(Res.getNode());
	DCI.CombineTo(Root.getNode(), DAG.getBitcast(RootVT, Res),
	/AddTo/ true);
	return true;
	}
	return false;
	}

	// See if we can combine a single input shuffle with zeros to a bit-mask,
	// which is much simpler than any shuffle.
	if (UnaryShuffle && MaskContainsZeros && (Depth >= 3 \|\| HasVariableMask) &&
	isSequentialOrUndefOrZeroInRange(Mask, 0, NumMaskElts, 0) &&
	DAG.getTargetLoweringInfo().isTypeLegal(MaskVT)) {
	APInt Zero = APInt::getNullValue(MaskEltSizeInBits);
	APInt AllOnes = APInt::getAllOnesValue(MaskEltSizeInBits);
	APInt UndefElts(NumMaskElts, 0);
	SmallVector<APInt, 64> EltBits(NumMaskElts, Zero);
	for (unsigned i = 0; i != NumMaskElts; ++i) {
	int M = Mask[i];
	if (M == SM_SentinelUndef) {
	UndefElts.setBit(i);
	continue;
	}
	if (M == SM_SentinelZero)
	continue;
	EltBits[i] = AllOnes;
	}
	SDValue BitMask = getConstVector(EltBits, UndefElts, MaskVT, DAG, DL);
	DCI.AddToWorklist(BitMask.getNode());
	Res = DAG.getBitcast(MaskVT, V1);
	DCI.AddToWorklist(Res.getNode());
	unsigned AndOpcode =
	FloatDomain ? unsigned(X86ISD::FAND) : unsigned(ISD::AND);
	Res = DAG.getNode(AndOpcode, DL, MaskVT, Res, BitMask);
	DCI.AddToWorklist(Res.getNode());
	DCI.CombineTo(Root.getNode(), DAG.getBitcast(RootVT, Res),
	/AddTo/ true);
	return true;
	}

	// If we have a single input shuffle with different shuffle patterns in the
	// the 128-bit lanes use the variable mask to VPERMILPS.
	// TODO Combine other mask types at higher depths.
	if (UnaryShuffle && HasVariableMask && !MaskContainsZeros &&
	((MaskVT == MVT::v8f32 && Subtarget.hasAVX()) \|\|
	(MaskVT == MVT::v16f32 && Subtarget.hasAVX512()))) {
	SmallVector<SDValue, 16> VPermIdx;
	for (int M : Mask) {
	SDValue Idx =
	M < 0 ? DAG.getUNDEF(MVT::i32) : DAG.getConstant(M % 4, DL, MVT::i32);
	VPermIdx.push_back(Idx);
	}
	SDValue VPermMask = DAG.getBuildVector(IntMaskVT, DL, VPermIdx);
	DCI.AddToWorklist(VPermMask.getNode());
	Res = DAG.getBitcast(MaskVT, V1);
	DCI.AddToWorklist(Res.getNode());
	Res = DAG.getNode(X86ISD::VPERMILPV, DL, MaskVT, Res, VPermMask);
	DCI.AddToWorklist(Res.getNode());
	DCI.CombineTo(Root.getNode(), DAG.getBitcast(RootVT, Res),
	/AddTo/ true);
	return true;
	}

	// With XOP, binary shuffles of 128/256-bit floating point vectors can combine
	// to VPERMIL2PD/VPERMIL2PS.
	if ((Depth >= 3 \|\| HasVariableMask) && Subtarget.hasXOP() &&
	(MaskVT == MVT::v2f64 \|\| MaskVT == MVT::v4f64 \|\| MaskVT == MVT::v4f32 \|\|
	MaskVT == MVT::v8f32)) {
	// VPERMIL2 Operation.
	// Bits[3] - Match Bit.
	// Bits[2:1] - (Per Lane) PD Shuffle Mask.
	// Bits[2:0] - (Per Lane) PS Shuffle Mask.
	unsigned NumLanes = MaskVT.getSizeInBits() / 128;
	unsigned NumEltsPerLane = NumMaskElts / NumLanes;
	SmallVector<int, 8> VPerm2Idx;
	unsigned M2ZImm = 0;
	for (int M : Mask) {
	if (M == SM_SentinelUndef) {
	VPerm2Idx.push_back(-1);
	continue;
	}
	if (M == SM_SentinelZero) {
	M2ZImm = 2;
	VPerm2Idx.push_back(8);
	continue;
	}
	int Index = (M % NumEltsPerLane) + ((M / NumMaskElts) * NumEltsPerLane);
	Index = (MaskVT.getScalarSizeInBits() == 64 ? Index << 1 : Index);
	VPerm2Idx.push_back(Index);
	}
	V1 = DAG.getBitcast(MaskVT, V1);
	DCI.AddToWorklist(V1.getNode());
	V2 = DAG.getBitcast(MaskVT, V2);
	DCI.AddToWorklist(V2.getNode());
	SDValue VPerm2MaskOp = getConstVector(VPerm2Idx, IntMaskVT, DAG, DL, true);
	DCI.AddToWorklist(VPerm2MaskOp.getNode());
	Res = DAG.getNode(X86ISD::VPERMIL2, DL, MaskVT, V1, V2, VPerm2MaskOp,
	DAG.getConstant(M2ZImm, DL, MVT::i8));
	DCI.AddToWorklist(Res.getNode());
	DCI.CombineTo(Root.getNode(), DAG.getBitcast(RootVT, Res),
	/AddTo/ true);
	return true;
	}

	// If we have 3 or more shuffle instructions or a chain involving a variable
	// mask, we can replace them with a single PSHUFB instruction profitably.
	// Intel's manuals suggest only using PSHUFB if doing so replacing 5
	// instructions, but in practice PSHUFB tends to be very fast so we're
	// more aggressive.
	if (UnaryShuffle && (Depth >= 3 \|\| HasVariableMask) &&
	((RootVT.is128BitVector() && Subtarget.hasSSSE3()) \|\|
	(RootVT.is256BitVector() && Subtarget.hasAVX2()) \|\|
	(RootVT.is512BitVector() && Subtarget.hasBWI()))) {
	SmallVector<SDValue, 16> PSHUFBMask;
	int NumBytes = RootVT.getSizeInBits() / 8;
	int Ratio = NumBytes / NumMaskElts;
	for (int i = 0; i < NumBytes; ++i) {
	int M = Mask[i / Ratio];
	if (M == SM_SentinelUndef) {
	PSHUFBMask.push_back(DAG.getUNDEF(MVT::i8));
	continue;
	}
	if (M == SM_SentinelZero) {
	PSHUFBMask.push_back(DAG.getConstant(255, DL, MVT::i8));
	continue;
	}
	M = Ratio * M + i % Ratio;
	assert ((M / 16) == (i / 16) && "Lane crossing detected");
	PSHUFBMask.push_back(DAG.getConstant(M, DL, MVT::i8));
	}
	MVT ByteVT = MVT::getVectorVT(MVT::i8, NumBytes);
	Res = DAG.getBitcast(ByteVT, V1);
	DCI.AddToWorklist(Res.getNode());
	SDValue PSHUFBMaskOp = DAG.getBuildVector(ByteVT, DL, PSHUFBMask);
	DCI.AddToWorklist(PSHUFBMaskOp.getNode());
	Res = DAG.getNode(X86ISD::PSHUFB, DL, ByteVT, Res, PSHUFBMaskOp);
	DCI.AddToWorklist(Res.getNode());
	DCI.CombineTo(Root.getNode(), DAG.getBitcast(RootVT, Res),
	/AddTo/ true);
	return true;
	}

	// With XOP, if we have a 128-bit binary input shuffle we can always combine
	// to VPPERM. We match the depth requirement of PSHUFB - VPPERM is never
	// slower than PSHUFB on targets that support both.
	if ((Depth >= 3 \|\| HasVariableMask) && RootVT.is128BitVector() &&
	Subtarget.hasXOP()) {
	// VPPERM Mask Operation
	// Bits[4:0] - Byte Index (0 - 31)
	// Bits[7:5] - Permute Operation (0 - Source byte, 4 - ZERO)
	SmallVector<SDValue, 16> VPPERMMask;
	int NumBytes = 16;
	int Ratio = NumBytes / NumMaskElts;
	for (int i = 0; i < NumBytes; ++i) {
	int M = Mask[i / Ratio];
	if (M == SM_SentinelUndef) {
	VPPERMMask.push_back(DAG.getUNDEF(MVT::i8));
	continue;
	}
	if (M == SM_SentinelZero) {
	VPPERMMask.push_back(DAG.getConstant(128, DL, MVT::i8));
	continue;
	}
	M = Ratio * M + i % Ratio;
	VPPERMMask.push_back(DAG.getConstant(M, DL, MVT::i8));
	}
	MVT ByteVT = MVT::v16i8;
	V1 = DAG.getBitcast(ByteVT, V1);
	DCI.AddToWorklist(V1.getNode());
	V2 = DAG.getBitcast(ByteVT, V2);
	DCI.AddToWorklist(V2.getNode());
	SDValue VPPERMMaskOp = DAG.getBuildVector(ByteVT, DL, VPPERMMask);
	DCI.AddToWorklist(VPPERMMaskOp.getNode());
	Res = DAG.getNode(X86ISD::VPPERM, DL, ByteVT, V1, V2, VPPERMMaskOp);
	DCI.AddToWorklist(Res.getNode());
	DCI.CombineTo(Root.getNode(), DAG.getBitcast(RootVT, Res),
	/AddTo/ true);
	return true;
	}

	// Failed to find any combines.
	return false;
	}

	// Attempt to constant fold all of the constant source ops.
	// Returns true if the entire shuffle is folded to a constant.
	// TODO: Extend this to merge multiple constant Ops and update the mask.
	static bool combineX86ShufflesConstants(const SmallVectorImpl<SDValue> &Ops,
	ArrayRef<int> Mask, SDValue Root,
	bool HasVariableMask, SelectionDAG &DAG,
	TargetLowering::DAGCombinerInfo &DCI,
	const X86Subtarget &Subtarget) {
	MVT VT = Root.getSimpleValueType();

	unsigned SizeInBits = VT.getSizeInBits();
	unsigned NumMaskElts = Mask.size();
	unsigned MaskSizeInBits = SizeInBits / NumMaskElts;
	unsigned NumOps = Ops.size();

	// Extract constant bits from each source op.
	bool OneUseConstantOp = false;
	SmallVector<APInt, 16> UndefEltsOps(NumOps);
	SmallVector<SmallVector<APInt, 16>, 16> RawBitsOps(NumOps);
	for (unsigned i = 0; i != NumOps; ++i) {
	SDValue SrcOp = Ops[i];
	OneUseConstantOp \|= SrcOp.hasOneUse();
	if (!getTargetConstantBitsFromNode(SrcOp, MaskSizeInBits, UndefEltsOps[i],
	RawBitsOps[i]))
	return false;
	}

	// Only fold if at least one of the constants is only used once or
	// the combined shuffle has included a variable mask shuffle, this
	// is to avoid constant pool bloat.
	if (!OneUseConstantOp && !HasVariableMask)
	return false;

	// Shuffle the constant bits according to the mask.
	APInt UndefElts(NumMaskElts, 0);
	APInt ZeroElts(NumMaskElts, 0);
	APInt ConstantElts(NumMaskElts, 0);
	SmallVector<APInt, 8> ConstantBitData(NumMaskElts,
	APInt::getNullValue(MaskSizeInBits));
	for (unsigned i = 0; i != NumMaskElts; ++i) {
	int M = Mask[i];
	if (M == SM_SentinelUndef) {
	UndefElts.setBit(i);
	continue;
	} else if (M == SM_SentinelZero) {
	ZeroElts.setBit(i);
	continue;
	}
	assert(0 <= M && M < (int)(NumMaskElts * NumOps));

	unsigned SrcOpIdx = (unsigned)M / NumMaskElts;
	unsigned SrcMaskIdx = (unsigned)M % NumMaskElts;

	auto &SrcUndefElts = UndefEltsOps[SrcOpIdx];
	if (SrcUndefElts[SrcMaskIdx]) {
	UndefElts.setBit(i);
	continue;
	}

	auto &SrcEltBits = RawBitsOps[SrcOpIdx];
	APInt &Bits = SrcEltBits[SrcMaskIdx];
	if (!Bits) {
	ZeroElts.setBit(i);
	continue;
	}

	ConstantElts.setBit(i);
	ConstantBitData[i] = Bits;
	}
	assert((UndefElts \| ZeroElts \| ConstantElts).isAllOnesValue());

	// Create the constant data.
	MVT MaskSVT;
	if (VT.isFloatingPoint() && (MaskSizeInBits == 32 \|\| MaskSizeInBits == 64))
	MaskSVT = MVT::getFloatingPointVT(MaskSizeInBits);
	else
	MaskSVT = MVT::getIntegerVT(MaskSizeInBits);

	MVT MaskVT = MVT::getVectorVT(MaskSVT, NumMaskElts);

	SDLoc DL(Root);
	SDValue CstOp = getConstVector(ConstantBitData, UndefElts, MaskVT, DAG, DL);
	DCI.AddToWorklist(CstOp.getNode());
	DCI.CombineTo(Root.getNode(), DAG.getBitcast(VT, CstOp));
	return true;
	}

	/// \brief Fully generic combining of x86 shuffle instructions.
	///
	/// This should be the last combine run over the x86 shuffle instructions. Once
	/// they have been fully optimized, this will recursively consider all chains
	/// of single-use shuffle instructions, build a generic model of the cumulative
	/// shuffle operation, and check for simpler instructions which implement this
	/// operation. We use this primarily for two purposes:
	///
	/// 1) Collapse generic shuffles to specialized single instructions when
	/// equivalent. In most cases, this is just an encoding size win, but
	/// sometimes we will collapse multiple generic shuffles into a single
	/// special-purpose shuffle.
	/// 2) Look for sequences of shuffle instructions with 3 or more total
	/// instructions, and replace them with the slightly more expensive SSSE3
	/// PSHUFB instruction if available. We do this as the last combining step
	/// to ensure we avoid using PSHUFB if we can implement the shuffle with
	/// a suitable short sequence of other instructions. The PSHUFB will either
	/// use a register or have to read from memory and so is slightly (but only
	/// slightly) more expensive than the other shuffle instructions.
	///
	/// Because this is inherently a quadratic operation (for each shuffle in
	/// a chain, we recurse up the chain), the depth is limited to 8 instructions.
	/// This should never be an issue in practice as the shuffle lowering doesn't
	/// produce sequences of more than 8 instructions.
	///
	/// FIXME: We will currently miss some cases where the redundant shuffling
	/// would simplify under the threshold for PSHUFB formation because of
	/// combine-ordering. To fix this, we should do the redundant instruction
	/// combining in this recursive walk.
	static bool combineX86ShufflesRecursively(ArrayRef<SDValue> SrcOps,
	int SrcOpIndex, SDValue Root,
	ArrayRef<int> RootMask,
	ArrayRef<const SDNode*> SrcNodes,
	int Depth, bool HasVariableMask,
	SelectionDAG &DAG,
	TargetLowering::DAGCombinerInfo &DCI,
	const X86Subtarget &Subtarget) {
	// Bound the depth of our recursive combine because this is ultimately
	// quadratic in nature.
	if (Depth > 8)
	return false;

	// Directly rip through bitcasts to find the underlying operand.
	SDValue Op = SrcOps[SrcOpIndex];
	Op = peekThroughOneUseBitcasts(Op);

	MVT VT = Op.getSimpleValueType();
	if (!VT.isVector())
	return false; // Bail if we hit a non-vector.

	assert(Root.getSimpleValueType().isVector() &&
	"Shuffles operate on vector types!");
	assert(VT.getSizeInBits() == Root.getSimpleValueType().getSizeInBits() &&
	"Can only combine shuffles of the same vector register size.");

	// Extract target shuffle mask and resolve sentinels and inputs.
	SmallVector<int, 64> OpMask;
	SmallVector<SDValue, 2> OpInputs;
	if (!resolveTargetShuffleInputs(Op, OpInputs, OpMask, DAG))
	return false;

	assert(OpInputs.size() <= 2 && "Too many shuffle inputs");
	SDValue Input0 = (OpInputs.size() > 0 ? OpInputs[0] : SDValue());
	SDValue Input1 = (OpInputs.size() > 1 ? OpInputs[1] : SDValue());

	// Add the inputs to the Ops list, avoiding duplicates.
	SmallVector<SDValue, 16> Ops(SrcOps.begin(), SrcOps.end());

	int InputIdx0 = -1, InputIdx1 = -1;
	for (int i = 0, e = Ops.size(); i < e; ++i) {
	SDValue BC = peekThroughBitcasts(Ops[i]);
	if (Input0 && BC == peekThroughBitcasts(Input0))
	InputIdx0 = i;
	if (Input1 && BC == peekThroughBitcasts(Input1))
	InputIdx1 = i;
	}

	if (Input0 && InputIdx0 < 0) {
	InputIdx0 = SrcOpIndex;
	Ops[SrcOpIndex] = Input0;
	}
	if (Input1 && InputIdx1 < 0) {
	InputIdx1 = Ops.size();
	Ops.push_back(Input1);
	}

	assert(((RootMask.size() > OpMask.size() &&
	RootMask.size() % OpMask.size() == 0) \|\|
	(OpMask.size() > RootMask.size() &&
	OpMask.size() % RootMask.size() == 0) \|\|
	OpMask.size() == RootMask.size()) &&
	"The smaller number of elements must divide the larger.");

	// This function can be performance-critical, so we rely on the power-of-2
	// knowledge that we have about the mask sizes to replace div/rem ops with
	// bit-masks and shifts.
	assert(isPowerOf2_32(RootMask.size()) && "Non-power-of-2 shuffle mask sizes");
	assert(isPowerOf2_32(OpMask.size()) && "Non-power-of-2 shuffle mask sizes");
	unsigned RootMaskSizeLog2 = countTrailingZeros(RootMask.size());
	unsigned OpMaskSizeLog2 = countTrailingZeros(OpMask.size());

	unsigned MaskWidth = std::max<unsigned>(OpMask.size(), RootMask.size());
	unsigned RootRatio = std::max<unsigned>(1, OpMask.size() >> RootMaskSizeLog2);
	unsigned OpRatio = std::max<unsigned>(1, RootMask.size() >> OpMaskSizeLog2);
	assert((RootRatio == 1 \|\| OpRatio == 1) &&
	"Must not have a ratio for both incoming and op masks!");

	assert(isPowerOf2_32(MaskWidth) && "Non-power-of-2 shuffle mask sizes");
	assert(isPowerOf2_32(RootRatio) && "Non-power-of-2 shuffle mask sizes");
	assert(isPowerOf2_32(OpRatio) && "Non-power-of-2 shuffle mask sizes");
	unsigned RootRatioLog2 = countTrailingZeros(RootRatio);
	unsigned OpRatioLog2 = countTrailingZeros(OpRatio);

	SmallVector<int, 64> Mask(MaskWidth, SM_SentinelUndef);

	// Merge this shuffle operation's mask into our accumulated mask. Note that
	// this shuffle's mask will be the first applied to the input, followed by the
	// root mask to get us all the way to the root value arrangement. The reason
	// for this order is that we are recursing up the operation chain.
	for (unsigned i = 0; i < MaskWidth; ++i) {
	unsigned RootIdx = i >> RootRatioLog2;
	if (RootMask[RootIdx] < 0) {
	// This is a zero or undef lane, we're done.
	Mask[i] = RootMask[RootIdx];
	continue;
	}

	unsigned RootMaskedIdx =
	RootRatio == 1
	? RootMask[RootIdx]
	: (RootMask[RootIdx] << RootRatioLog2) + (i & (RootRatio - 1));

	// Just insert the scaled root mask value if it references an input other
	// than the SrcOp we're currently inserting.
	if ((RootMaskedIdx < (SrcOpIndex * MaskWidth)) \|\|
	(((SrcOpIndex + 1) * MaskWidth) <= RootMaskedIdx)) {
	Mask[i] = RootMaskedIdx;
	continue;
	}

	RootMaskedIdx = RootMaskedIdx & (MaskWidth - 1);
	unsigned OpIdx = RootMaskedIdx >> OpRatioLog2;
	if (OpMask[OpIdx] < 0) {
	// The incoming lanes are zero or undef, it doesn't matter which ones we
	// are using.
	Mask[i] = OpMask[OpIdx];
	continue;
	}

	// Ok, we have non-zero lanes, map them through to one of the Op's inputs.
	unsigned OpMaskedIdx =
	OpRatio == 1
	? OpMask[OpIdx]
	: (OpMask[OpIdx] << OpRatioLog2) + (RootMaskedIdx & (OpRatio - 1));

	OpMaskedIdx = OpMaskedIdx & (MaskWidth - 1);
	if (OpMask[OpIdx] < (int)OpMask.size()) {
	assert(0 <= InputIdx0 && "Unknown target shuffle input");
	OpMaskedIdx += InputIdx0 * MaskWidth;
	} else {
	assert(0 <= InputIdx1 && "Unknown target shuffle input");
	OpMaskedIdx += InputIdx1 * MaskWidth;
	}

	Mask[i] = OpMaskedIdx;
	}

	// Handle the all undef/zero cases early.
	if (all_of(Mask, [](int Idx) { return Idx == SM_SentinelUndef; })) {
	DCI.CombineTo(Root.getNode(), DAG.getUNDEF(Root.getValueType()));
	return true;
	}
	if (all_of(Mask, [](int Idx) { return Idx < 0; })) {
	// TODO - should we handle the mixed zero/undef case as well? Just returning
	// a zero mask will lose information on undef elements possibly reducing
	// future combine possibilities.
	DCI.CombineTo(Root.getNode(), getZeroVector(Root.getSimpleValueType(),
	Subtarget, DAG, SDLoc(Root)));
	return true;
	}

	// Remove unused shuffle source ops.
	resolveTargetShuffleInputsAndMask(Ops, Mask);
	assert(!Ops.empty() && "Shuffle with no inputs detected");

	HasVariableMask \|= isTargetShuffleVariableMask(Op.getOpcode());

	// Update the list of shuffle nodes that have been combined so far.
	SmallVector<const SDNode *, 16> CombinedNodes(SrcNodes.begin(),
	SrcNodes.end());
	CombinedNodes.push_back(Op.getNode());

	// See if we can recurse into each shuffle source op (if it's a target
	// shuffle). The source op should only be combined if it either has a
	// single use (i.e. current Op) or all its users have already been combined.
	for (int i = 0, e = Ops.size(); i < e; ++i)
	if (Ops[i].getNode()->hasOneUse() \|\|
	SDNode::areOnlyUsersOf(CombinedNodes, Ops[i].getNode()))
	if (combineX86ShufflesRecursively(Ops, i, Root, Mask, CombinedNodes,
	Depth + 1, HasVariableMask, DAG, DCI,
	Subtarget))
	return true;

	// Attempt to constant fold all of the constant source ops.
	if (combineX86ShufflesConstants(Ops, Mask, Root, HasVariableMask, DAG, DCI,
	Subtarget))
	return true;

	// We can only combine unary and binary shuffle mask cases.
	if (Ops.size() > 2)
	return false;

	// Minor canonicalization of the accumulated shuffle mask to make it easier
	// to match below. All this does is detect masks with sequential pairs of
	// elements, and shrink them to the half-width mask. It does this in a loop
	// so it will reduce the size of the mask to the minimal width mask which
	// performs an equivalent shuffle.
	SmallVector<int, 64> WidenedMask;
	while (Mask.size() > 1 && canWidenShuffleElements(Mask, WidenedMask)) {
	Mask = std::move(WidenedMask);
	}

	// Canonicalization of binary shuffle masks to improve pattern matching by
	// commuting the inputs.
	if (Ops.size() == 2 && canonicalizeShuffleMaskWithCommute(Mask)) {
	ShuffleVectorSDNode::commuteMask(Mask);
	std::swap(Ops[0], Ops[1]);
	}

	return combineX86ShuffleChain(Ops, Root, Mask, Depth, HasVariableMask, DAG,
	DCI, Subtarget);
	}

	/// \brief Get the PSHUF-style mask from PSHUF node.
	///
	/// This is a very minor wrapper around getTargetShuffleMask to easy forming v4
	/// PSHUF-style masks that can be reused with such instructions.
	static SmallVector<int, 4> getPSHUFShuffleMask(SDValue N) {
	MVT VT = N.getSimpleValueType();
	SmallVector<int, 4> Mask;
	SmallVector<SDValue, 2> Ops;
	bool IsUnary;
	bool HaveMask =
	getTargetShuffleMask(N.getNode(), VT, false, Ops, Mask, IsUnary);
	(void)HaveMask;
	assert(HaveMask);

	// If we have more than 128-bits, only the low 128-bits of shuffle mask
	// matter. Check that the upper masks are repeats and remove them.
	if (VT.getSizeInBits() > 128) {
	int LaneElts = 128 / VT.getScalarSizeInBits();
	#ifndef NDEBUG
	for (int i = 1, NumLanes = VT.getSizeInBits() / 128; i < NumLanes; ++i)
	for (int j = 0; j < LaneElts; ++j)
	assert(Mask[j] == Mask[i * LaneElts + j] - (LaneElts * i) &&
	"Mask doesn't repeat in high 128-bit lanes!");
	#endif
	Mask.resize(LaneElts);
	}

	switch (N.getOpcode()) {
	case X86ISD::PSHUFD:
	return Mask;
	case X86ISD::PSHUFLW:
	Mask.resize(4);
	return Mask;
	case X86ISD::PSHUFHW:
	Mask.erase(Mask.begin(), Mask.begin() + 4);
	for (int &M : Mask)
	M -= 4;
	return Mask;
	default:
	llvm_unreachable("No valid shuffle instruction found!");
	}
	}

	/// \brief Search for a combinable shuffle across a chain ending in pshufd.
	///
	/// We walk up the chain and look for a combinable shuffle, skipping over
	/// shuffles that we could hoist this shuffle's transformation past without
	/// altering anything.
	static SDValue
	combineRedundantDWordShuffle(SDValue N, MutableArrayRef<int> Mask,
	SelectionDAG &DAG) {
	assert(N.getOpcode() == X86ISD::PSHUFD &&
	"Called with something other than an x86 128-bit half shuffle!");
	SDLoc DL(N);

	// Walk up a single-use chain looking for a combinable shuffle. Keep a stack
	// of the shuffles in the chain so that we can form a fresh chain to replace
	// this one.
	SmallVector<SDValue, 8> Chain;
	SDValue V = N.getOperand(0);
	for (; V.hasOneUse(); V = V.getOperand(0)) {
	switch (V.getOpcode()) {
	default:
	return SDValue(); // Nothing combined!

	case ISD::BITCAST:
	// Skip bitcasts as we always know the type for the target specific
	// instructions.
	continue;

	case X86ISD::PSHUFD:
	// Found another dword shuffle.
	break;

	case X86ISD::PSHUFLW:
	// Check that the low words (being shuffled) are the identity in the
	// dword shuffle, and the high words are self-contained.
	if (Mask[0] != 0 \|\| Mask[1] != 1 \|\|
	!(Mask[2] >= 2 && Mask[2] < 4 && Mask[3] >= 2 && Mask[3] < 4))
	return SDValue();

	Chain.push_back(V);
	continue;

	case X86ISD::PSHUFHW:
	// Check that the high words (being shuffled) are the identity in the
	// dword shuffle, and the low words are self-contained.
	if (Mask[2] != 2 \|\| Mask[3] != 3 \|\|
	!(Mask[0] >= 0 && Mask[0] < 2 && Mask[1] >= 0 && Mask[1] < 2))
	return SDValue();

	Chain.push_back(V);
	continue;

	case X86ISD::UNPCKL:
	case X86ISD::UNPCKH:
	// For either i8 -> i16 or i16 -> i32 unpacks, we can combine a dword
	// shuffle into a preceding word shuffle.
	if (V.getSimpleValueType().getVectorElementType() != MVT::i8 &&
	V.getSimpleValueType().getVectorElementType() != MVT::i16)
	return SDValue();

	// Search for a half-shuffle which we can combine with.
	unsigned CombineOp =
	V.getOpcode() == X86ISD::UNPCKL ? X86ISD::PSHUFLW : X86ISD::PSHUFHW;
	if (V.getOperand(0) != V.getOperand(1) \|\|
	!V->isOnlyUserOf(V.getOperand(0).getNode()))
	return SDValue();
	Chain.push_back(V);
	V = V.getOperand(0);
	do {
	switch (V.getOpcode()) {
	default:
	return SDValue(); // Nothing to combine.

	case X86ISD::PSHUFLW:
	case X86ISD::PSHUFHW:
	if (V.getOpcode() == CombineOp)
	break;

	Chain.push_back(V);

	LLVM_FALLTHROUGH;
	case ISD::BITCAST:
	V = V.getOperand(0);
	continue;
	}
	break;
	} while (V.hasOneUse());
	break;
	}
	// Break out of the loop if we break out of the switch.
	break;
	}

	if (!V.hasOneUse())
	// We fell out of the loop without finding a viable combining instruction.
	return SDValue();

	// Merge this node's mask and our incoming mask.
	SmallVector<int, 4> VMask = getPSHUFShuffleMask(V);
	for (int &M : Mask)
	M = VMask[M];
	V = DAG.getNode(V.getOpcode(), DL, V.getValueType(), V.getOperand(0),
	getV4X86ShuffleImm8ForMask(Mask, DL, DAG));

	// Rebuild the chain around this new shuffle.
	while (!Chain.empty()) {
	SDValue W = Chain.pop_back_val();

	if (V.getValueType() != W.getOperand(0).getValueType())
	V = DAG.getBitcast(W.getOperand(0).getValueType(), V);

	switch (W.getOpcode()) {
	default:
	llvm_unreachable("Only PSHUF and UNPCK instructions get here!");

	case X86ISD::UNPCKL:
	case X86ISD::UNPCKH:
	V = DAG.getNode(W.getOpcode(), DL, W.getValueType(), V, V);
	break;

	case X86ISD::PSHUFD:
	case X86ISD::PSHUFLW:
	case X86ISD::PSHUFHW:
	V = DAG.getNode(W.getOpcode(), DL, W.getValueType(), V, W.getOperand(1));
	break;
	}
	}
	if (V.getValueType() != N.getValueType())
	V = DAG.getBitcast(N.getValueType(), V);

	// Return the new chain to replace N.
	return V;
	}

	/// \brief Search for a combinable shuffle across a chain ending in pshuflw or
	/// pshufhw.
	///
	/// We walk up the chain, skipping shuffles of the other half and looking
	/// through shuffles which switch halves trying to find a shuffle of the same
	/// pair of dwords.
	static bool combineRedundantHalfShuffle(SDValue N, MutableArrayRef<int> Mask,
	SelectionDAG &DAG,
	TargetLowering::DAGCombinerInfo &DCI) {
	assert(
	(N.getOpcode() == X86ISD::PSHUFLW \|\| N.getOpcode() == X86ISD::PSHUFHW) &&
	"Called with something other than an x86 128-bit half shuffle!");
	SDLoc DL(N);
	unsigned CombineOpcode = N.getOpcode();

	// Walk up a single-use chain looking for a combinable shuffle.
	SDValue V = N.getOperand(0);
	for (; V.hasOneUse(); V = V.getOperand(0)) {
	switch (V.getOpcode()) {
	default:
	return false; // Nothing combined!

	case ISD::BITCAST:
	// Skip bitcasts as we always know the type for the target specific
	// instructions.
	continue;

	case X86ISD::PSHUFLW:
	case X86ISD::PSHUFHW:
	if (V.getOpcode() == CombineOpcode)
	break;

	// Other-half shuffles are no-ops.
	continue;
	}
	// Break out of the loop if we break out of the switch.
	break;
	}

	if (!V.hasOneUse())
	// We fell out of the loop without finding a viable combining instruction.
	return false;

	// Combine away the bottom node as its shuffle will be accumulated into
	// a preceding shuffle.
	DCI.CombineTo(N.getNode(), N.getOperand(0), /AddTo/ true);

	// Record the old value.
	SDValue Old = V;

	// Merge this node's mask and our incoming mask (adjusted to account for all
	// the pshufd instructions encountered).
	SmallVector<int, 4> VMask = getPSHUFShuffleMask(V);
	for (int &M : Mask)
	M = VMask[M];
	V = DAG.getNode(V.getOpcode(), DL, MVT::v8i16, V.getOperand(0),
	getV4X86ShuffleImm8ForMask(Mask, DL, DAG));

	// Check that the shuffles didn't cancel each other out. If not, we need to
	// combine to the new one.
	if (Old != V)
	// Replace the combinable shuffle with the combined one, updating all users
	// so that we re-evaluate the chain here.
	DCI.CombineTo(Old.getNode(), V, /AddTo/ true);

	return true;
	}

	/// \brief Try to combine x86 target specific shuffles.
	static SDValue combineTargetShuffle(SDValue N, SelectionDAG &DAG,
	TargetLowering::DAGCombinerInfo &DCI,
	const X86Subtarget &Subtarget) {
	SDLoc DL(N);
	MVT VT = N.getSimpleValueType();
	SmallVector<int, 4> Mask;

	unsigned Opcode = N.getOpcode();
	switch (Opcode) {
	case X86ISD::PSHUFD:
	case X86ISD::PSHUFLW:
	case X86ISD::PSHUFHW:
	Mask = getPSHUFShuffleMask(N);
	assert(Mask.size() == 4);
	break;
	case X86ISD::UNPCKL: {
	auto Op0 = N.getOperand(0);
	auto Op1 = N.getOperand(1);
	unsigned Opcode0 = Op0.getOpcode();
	unsigned Opcode1 = Op1.getOpcode();

	// Combine X86ISD::UNPCKL with 2 X86ISD::FHADD inputs into a single
	// X86ISD::FHADD. This is generated by UINT_TO_FP v2f64 scalarization.
	// TODO: Add other horizontal operations as required.
	if (VT == MVT::v2f64 && Opcode0 == Opcode1 && Opcode0 == X86ISD::FHADD)
	return DAG.getNode(Opcode0, DL, VT, Op0.getOperand(0), Op1.getOperand(0));

	// Combine X86ISD::UNPCKL and ISD::VECTOR_SHUFFLE into X86ISD::UNPCKH, in
	// which X86ISD::UNPCKL has a ISD::UNDEF operand, and ISD::VECTOR_SHUFFLE
	// moves upper half elements into the lower half part. For example:
	//
	// t2: v16i8 = vector_shuffle<8,9,10,11,12,13,14,15,u,u,u,u,u,u,u,u> t1,
	// undef:v16i8
	// t3: v16i8 = X86ISD::UNPCKL undef:v16i8, t2
	//
	// will be combined to:
	//
	// t3: v16i8 = X86ISD::UNPCKH undef:v16i8, t1

	// This is only for 128-bit vectors. From SSE4.1 onward this combine may not
	// happen due to advanced instructions.
	if (!VT.is128BitVector())
	return SDValue();

	if (Op0.isUndef() && Opcode1 == ISD::VECTOR_SHUFFLE) {
	ArrayRef<int> Mask = cast<ShuffleVectorSDNode>(Op1.getNode())->getMask();

	unsigned NumElts = VT.getVectorNumElements();
	SmallVector<int, 8> ExpectedMask(NumElts, -1);
	std::iota(ExpectedMask.begin(), ExpectedMask.begin() + NumElts / 2,
	NumElts / 2);

	auto ShufOp = Op1.getOperand(0);
	if (isShuffleEquivalent(Op1, ShufOp, Mask, ExpectedMask))
	return DAG.getNode(X86ISD::UNPCKH, DL, VT, N.getOperand(0), ShufOp);
	}
	return SDValue();
	}
	case X86ISD::BLENDI: {
	SDValue V0 = N->getOperand(0);
	SDValue V1 = N->getOperand(1);
	assert(VT == V0.getSimpleValueType() && VT == V1.getSimpleValueType() &&
	"Unexpected input vector types");

	// Canonicalize a v2f64 blend with a mask of 2 by swapping the vector
	// operands and changing the mask to 1. This saves us a bunch of
	// pattern-matching possibilities related to scalar math ops in SSE/AVX.
	// x86InstrInfo knows how to commute this back after instruction selection
	// if it would help register allocation.

	// TODO: If optimizing for size or a processor that doesn't suffer from
	// partial register update stalls, this should be transformed into a MOVSD
	// instruction because a MOVSD is 1-2 bytes smaller than a BLENDPD.

	if (VT == MVT::v2f64)
	if (auto *Mask = dyn_cast<ConstantSDNode>(N->getOperand(2)))
	if (Mask->getZExtValue() == 2 && !isShuffleFoldableLoad(V0)) {
	SDValue NewMask = DAG.getConstant(1, DL, MVT::i8);
	return DAG.getNode(X86ISD::BLENDI, DL, VT, V1, V0, NewMask);
	}

	return SDValue();
	}
	case X86ISD::MOVSD:
	case X86ISD::MOVSS: {
	SDValue V0 = peekThroughBitcasts(N->getOperand(0));
	SDValue V1 = peekThroughBitcasts(N->getOperand(1));
	bool isZero0 = ISD::isBuildVectorAllZeros(V0.getNode());
	bool isZero1 = ISD::isBuildVectorAllZeros(V1.getNode());
	if (isZero0 && isZero1)
	return SDValue();

	// We often lower to MOVSD/MOVSS from integer as well as native float
	// types; remove unnecessary domain-crossing bitcasts if we can to make it
	// easier to combine shuffles later on. We've already accounted for the
	// domain switching cost when we decided to lower with it.
	bool isFloat = VT.isFloatingPoint();
	bool isFloat0 = V0.getSimpleValueType().isFloatingPoint();
	bool isFloat1 = V1.getSimpleValueType().isFloatingPoint();
	if ((isFloat != isFloat0 \|\| isZero0) && (isFloat != isFloat1 \|\| isZero1)) {
	MVT NewVT = isFloat ? (X86ISD::MOVSD == Opcode ? MVT::v2i64 : MVT::v4i32)
	: (X86ISD::MOVSD == Opcode ? MVT::v2f64 : MVT::v4f32);
	V0 = DAG.getBitcast(NewVT, V0);
	V1 = DAG.getBitcast(NewVT, V1);
	return DAG.getBitcast(VT, DAG.getNode(Opcode, DL, NewVT, V0, V1));
	}

	return SDValue();
	}
	case X86ISD::INSERTPS: {
	assert(VT == MVT::v4f32 && "INSERTPS ValueType must be MVT::v4f32");
	SDValue Op0 = N.getOperand(0);
	SDValue Op1 = N.getOperand(1);
	SDValue Op2 = N.getOperand(2);
	unsigned InsertPSMask = cast<ConstantSDNode>(Op2)->getZExtValue();
	unsigned SrcIdx = (InsertPSMask >> 6) & 0x3;
	unsigned DstIdx = (InsertPSMask >> 4) & 0x3;
	unsigned ZeroMask = InsertPSMask & 0xF;

	// If we zero out all elements from Op0 then we don't need to reference it.
	if (((ZeroMask \| (1u << DstIdx)) == 0xF) && !Op0.isUndef())
	return DAG.getNode(X86ISD::INSERTPS, DL, VT, DAG.getUNDEF(VT), Op1,
	DAG.getConstant(InsertPSMask, DL, MVT::i8));

	// If we zero out the element from Op1 then we don't need to reference it.
	if ((ZeroMask & (1u << DstIdx)) && !Op1.isUndef())
	return DAG.getNode(X86ISD::INSERTPS, DL, VT, Op0, DAG.getUNDEF(VT),
	DAG.getConstant(InsertPSMask, DL, MVT::i8));

	// Attempt to merge insertps Op1 with an inner target shuffle node.
	SmallVector<int, 8> TargetMask1;
	SmallVector<SDValue, 2> Ops1;
	if (setTargetShuffleZeroElements(Op1, TargetMask1, Ops1)) {
	int M = TargetMask1[SrcIdx];
	if (isUndefOrZero(M)) {
	// Zero/UNDEF insertion - zero out element and remove dependency.
	InsertPSMask \|= (1u << DstIdx);
	return DAG.getNode(X86ISD::INSERTPS, DL, VT, Op0, DAG.getUNDEF(VT),
	DAG.getConstant(InsertPSMask, DL, MVT::i8));
	}
	// Update insertps mask srcidx and reference the source input directly.
	assert(0 <= M && M < 8 && "Shuffle index out of range");
	InsertPSMask = (InsertPSMask & 0x3f) \| ((M & 0x3) << 6);
	Op1 = Ops1[M < 4 ? 0 : 1];
	return DAG.getNode(X86ISD::INSERTPS, DL, VT, Op0, Op1,
	DAG.getConstant(InsertPSMask, DL, MVT::i8));
	}

	// Attempt to merge insertps Op0 with an inner target shuffle node.
	SmallVector<int, 8> TargetMask0;
	SmallVector<SDValue, 2> Ops0;
	if (!setTargetShuffleZeroElements(Op0, TargetMask0, Ops0))
	return SDValue();

	bool Updated = false;
	bool UseInput00 = false;
	bool UseInput01 = false;
	for (int i = 0; i != 4; ++i) {
	int M = TargetMask0[i];
	if ((InsertPSMask & (1u << i)) \|\| (i == (int)DstIdx)) {
	// No change if element is already zero or the inserted element.
	continue;
	} else if (isUndefOrZero(M)) {
	// If the target mask is undef/zero then we must zero the element.
	InsertPSMask \|= (1u << i);
	Updated = true;
	continue;
	}

	// The input vector element must be inline.
	if (M != i && M != (i + 4))
	return SDValue();

	// Determine which inputs of the target shuffle we're using.
	UseInput00 \|= (0 <= M && M < 4);
	UseInput01 \|= (4 <= M);
	}

	// If we're not using both inputs of the target shuffle then use the
	// referenced input directly.
	if (UseInput00 && !UseInput01) {
	Updated = true;
	Op0 = Ops0[0];
	} else if (!UseInput00 && UseInput01) {
	Updated = true;
	Op0 = Ops0[1];
	}

	if (Updated)
	return DAG.getNode(X86ISD::INSERTPS, DL, VT, Op0, Op1,
	DAG.getConstant(InsertPSMask, DL, MVT::i8));

	return SDValue();
	}
	default:
	return SDValue();
	}

	// Nuke no-op shuffles that show up after combining.
	if (isNoopShuffleMask(Mask))
	return DCI.CombineTo(N.getNode(), N.getOperand(0), /AddTo/ true);

	// Look for simplifications involving one or two shuffle instructions.
	SDValue V = N.getOperand(0);
	switch (N.getOpcode()) {
	default:
	break;
	case X86ISD::PSHUFLW:
	case X86ISD::PSHUFHW:
	assert(VT.getVectorElementType() == MVT::i16 && "Bad word shuffle type!");

	if (combineRedundantHalfShuffle(N, Mask, DAG, DCI))
	return SDValue(); // We combined away this shuffle, so we're done.

	// See if this reduces to a PSHUFD which is no more expensive and can
	// combine with more operations. Note that it has to at least flip the
	// dwords as otherwise it would have been removed as a no-op.
	if (makeArrayRef(Mask).equals({2, 3, 0, 1})) {
	int DMask[] = {0, 1, 2, 3};
	int DOffset = N.getOpcode() == X86ISD::PSHUFLW ? 0 : 2;
	DMask[DOffset + 0] = DOffset + 1;
	DMask[DOffset + 1] = DOffset + 0;
	MVT DVT = MVT::getVectorVT(MVT::i32, VT.getVectorNumElements() / 2);
	V = DAG.getBitcast(DVT, V);
	DCI.AddToWorklist(V.getNode());
	V = DAG.getNode(X86ISD::PSHUFD, DL, DVT, V,
	getV4X86ShuffleImm8ForMask(DMask, DL, DAG));
	DCI.AddToWorklist(V.getNode());
	return DAG.getBitcast(VT, V);
	}

	// Look for shuffle patterns which can be implemented as a single unpack.
	// FIXME: This doesn't handle the location of the PSHUFD generically, and
	// only works when we have a PSHUFD followed by two half-shuffles.
	if (Mask[0] == Mask[1] && Mask[2] == Mask[3] &&
	(V.getOpcode() == X86ISD::PSHUFLW \|\|
	V.getOpcode() == X86ISD::PSHUFHW) &&
	V.getOpcode() != N.getOpcode() &&
	V.hasOneUse()) {
	SDValue D = peekThroughOneUseBitcasts(V.getOperand(0));
	if (D.getOpcode() == X86ISD::PSHUFD && D.hasOneUse()) {
	SmallVector<int, 4> VMask = getPSHUFShuffleMask(V);
	SmallVector<int, 4> DMask = getPSHUFShuffleMask(D);
	int NOffset = N.getOpcode() == X86ISD::PSHUFLW ? 0 : 4;
	int VOffset = V.getOpcode() == X86ISD::PSHUFLW ? 0 : 4;
	int WordMask[8];
	for (int i = 0; i < 4; ++i) {
	WordMask[i + NOffset] = Mask[i] + NOffset;
	WordMask[i + VOffset] = VMask[i] + VOffset;
	}
	// Map the word mask through the DWord mask.
	int MappedMask[8];
	for (int i = 0; i < 8; ++i)
	MappedMask[i] = 2 * DMask[WordMask[i] / 2] + WordMask[i] % 2;
	if (makeArrayRef(MappedMask).equals({0, 0, 1, 1, 2, 2, 3, 3}) \|\|
	makeArrayRef(MappedMask).equals({4, 4, 5, 5, 6, 6, 7, 7})) {
	// We can replace all three shuffles with an unpack.
	V = DAG.getBitcast(VT, D.getOperand(0));
	DCI.AddToWorklist(V.getNode());
	return DAG.getNode(MappedMask[0] == 0 ? X86ISD::UNPCKL
	: X86ISD::UNPCKH,
	DL, VT, V, V);
	}
	}
	}

	break;

	case X86ISD::PSHUFD:
	if (SDValue NewN = combineRedundantDWordShuffle(N, Mask, DAG))
	return NewN;

	break;
	}

	return SDValue();
	}

	/// Returns true iff the shuffle node \p N can be replaced with ADDSUB
	/// operation. If true is returned then the operands of ADDSUB operation
	/// are written to the parameters \p Opnd0 and \p Opnd1.
	///
	/// We combine shuffle to ADDSUB directly on the abstract vector shuffle nodes
	/// so it is easier to generically match. We also insert dummy vector shuffle
	/// nodes for the operands which explicitly discard the lanes which are unused
	/// by this operation to try to flow through the rest of the combiner
	/// the fact that they're unused.
	static bool isAddSub(SDNode *N, const X86Subtarget &Subtarget,
	SDValue &Opnd0, SDValue &Opnd1) {

	EVT VT = N->getValueType(0);
	if ((!Subtarget.hasSSE3() \|\| (VT != MVT::v4f32 && VT != MVT::v2f64)) &&
	(!Subtarget.hasAVX() \|\| (VT != MVT::v8f32 && VT != MVT::v4f64)) &&
	(!Subtarget.hasAVX512() \|\| (VT != MVT::v16f32 && VT != MVT::v8f64)))
	return false;

	// We only handle target-independent shuffles.
	// FIXME: It would be easy and harmless to use the target shuffle mask
	// extraction tool to support more.
	if (N->getOpcode() != ISD::VECTOR_SHUFFLE)
	return false;

	ArrayRef<int> OrigMask = cast<ShuffleVectorSDNode>(N)->getMask();
	SmallVector<int, 16> Mask(OrigMask.begin(), OrigMask.end());

	SDValue V1 = N->getOperand(0);
	SDValue V2 = N->getOperand(1);

	// We require the first shuffle operand to be the FSUB node, and the second to
	// be the FADD node.
	if (V1.getOpcode() == ISD::FADD && V2.getOpcode() == ISD::FSUB) {
	ShuffleVectorSDNode::commuteMask(Mask);
	std::swap(V1, V2);
	} else if (V1.getOpcode() != ISD::FSUB \|\| V2.getOpcode() != ISD::FADD)
	return false;

	// If there are other uses of these operations we can't fold them.
	if (!V1->hasOneUse() \|\| !V2->hasOneUse())
	return false;

	// Ensure that both operations have the same operands. Note that we can
	// commute the FADD operands.
	SDValue LHS = V1->getOperand(0), RHS = V1->getOperand(1);
	if ((V2->getOperand(0) != LHS \|\| V2->getOperand(1) != RHS) &&
	(V2->getOperand(0) != RHS \|\| V2->getOperand(1) != LHS))
	return false;

	// We're looking for blends between FADD and FSUB nodes. We insist on these
	// nodes being lined up in a specific expected pattern.
	if (!(isShuffleEquivalent(V1, V2, Mask, {0, 3}) \|\|
	isShuffleEquivalent(V1, V2, Mask, {0, 5, 2, 7}) \|\|
	isShuffleEquivalent(V1, V2, Mask, {0, 9, 2, 11, 4, 13, 6, 15}) \|\|
	isShuffleEquivalent(V1, V2, Mask, {0, 17, 2, 19, 4, 21, 6, 23,
	8, 25, 10, 27, 12, 29, 14, 31})))
	return false;

	Opnd0 = LHS;
	Opnd1 = RHS;
	return true;
	}

	/// \brief Try to combine a shuffle into a target-specific add-sub or
	/// mul-add-sub node.
	static SDValue combineShuffleToAddSubOrFMAddSub(SDNode *N,
	const X86Subtarget &Subtarget,
	SelectionDAG &DAG) {
	SDValue Opnd0, Opnd1;
	if (!isAddSub(N, Subtarget, Opnd0, Opnd1))
	return SDValue();

	EVT VT = N->getValueType(0);
	SDLoc DL(N);

	// Try to generate X86ISD::FMADDSUB node here.
	SDValue Opnd2;
	if (isFMAddSub(Subtarget, DAG, Opnd0, Opnd1, Opnd2))
	return DAG.getNode(X86ISD::FMADDSUB, DL, VT, Opnd0, Opnd1, Opnd2);

	// Do not generate X86ISD::ADDSUB node for 512-bit types even though
	// the ADDSUB idiom has been successfully recognized. There are no known
	// X86 targets with 512-bit ADDSUB instructions!
	if (VT.is512BitVector())
	return SDValue();

	return DAG.getNode(X86ISD::ADDSUB, DL, VT, Opnd0, Opnd1);
	}

	// We are looking for a shuffle where both sources are concatenated with undef
	// and have a width that is half of the output's width. AVX2 has VPERMD/Q, so
	// if we can express this as a single-source shuffle, that's preferable.
	static SDValue combineShuffleOfConcatUndef(SDNode *N, SelectionDAG &DAG,
	const X86Subtarget &Subtarget) {
	if (!Subtarget.hasAVX2() \|\| !isa<ShuffleVectorSDNode>(N))
	return SDValue();

	EVT VT = N->getValueType(0);

	// We only care about shuffles of 128/256-bit vectors of 32/64-bit values.
	if (!VT.is128BitVector() && !VT.is256BitVector())
	return SDValue();

	if (VT.getVectorElementType() != MVT::i32 &&
	VT.getVectorElementType() != MVT::i64 &&
	VT.getVectorElementType() != MVT::f32 &&
	VT.getVectorElementType() != MVT::f64)
	return SDValue();

	SDValue N0 = N->getOperand(0);
	SDValue N1 = N->getOperand(1);

	// Check that both sources are concats with undef.
	if (N0.getOpcode() != ISD::CONCAT_VECTORS \|\|
	N1.getOpcode() != ISD::CONCAT_VECTORS \|\| N0.getNumOperands() != 2 \|\|
	N1.getNumOperands() != 2 \|\| !N0.getOperand(1).isUndef() \|\|
	!N1.getOperand(1).isUndef())
	return SDValue();

	// Construct the new shuffle mask. Elements from the first source retain their
	// index, but elements from the second source no longer need to skip an undef.
	SmallVector<int, 8> Mask;
	int NumElts = VT.getVectorNumElements();

	ShuffleVectorSDNode *SVOp = cast<ShuffleVectorSDNode>(N);
	for (int Elt : SVOp->getMask())
	Mask.push_back(Elt < NumElts ? Elt : (Elt - NumElts / 2));

	SDLoc DL(N);
	SDValue Concat = DAG.getNode(ISD::CONCAT_VECTORS, DL, VT, N0.getOperand(0),
	N1.getOperand(0));
	return DAG.getVectorShuffle(VT, DL, Concat, DAG.getUNDEF(VT), Mask);
	}

	static SDValue combineShuffle(SDNode *N, SelectionDAG &DAG,
	TargetLowering::DAGCombinerInfo &DCI,
	const X86Subtarget &Subtarget) {
	SDLoc dl(N);
	EVT VT = N->getValueType(0);
	const TargetLowering &TLI = DAG.getTargetLoweringInfo();
	// If we have legalized the vector types, look for blends of FADD and FSUB
	// nodes that we can fuse into an ADDSUB node.
	if (TLI.isTypeLegal(VT))
	if (SDValue AddSub = combineShuffleToAddSubOrFMAddSub(N, Subtarget, DAG))
	return AddSub;

	// During Type Legalization, when promoting illegal vector types,
	// the backend might introduce new shuffle dag nodes and bitcasts.
	//
	// This code performs the following transformation:
	// fold: (shuffle (bitcast (BINOP A, B)), Undef, <Mask>) ->
	// (shuffle (BINOP (bitcast A), (bitcast B)), Undef, <Mask>)
	//
	// We do this only if both the bitcast and the BINOP dag nodes have
	// one use. Also, perform this transformation only if the new binary
	// operation is legal. This is to avoid introducing dag nodes that
	// potentially need to be further expanded (or custom lowered) into a
	// less optimal sequence of dag nodes.
	if (!DCI.isBeforeLegalize() && DCI.isBeforeLegalizeOps() &&
	N->getOpcode() == ISD::VECTOR_SHUFFLE &&
	N->getOperand(0).getOpcode() == ISD::BITCAST &&
	N->getOperand(1).isUndef() && N->getOperand(0).hasOneUse()) {
	SDValue N0 = N->getOperand(0);
	SDValue N1 = N->getOperand(1);

	SDValue BC0 = N0.getOperand(0);
	EVT SVT = BC0.getValueType();
	unsigned Opcode = BC0.getOpcode();
	unsigned NumElts = VT.getVectorNumElements();

	if (BC0.hasOneUse() && SVT.isVector() &&
	SVT.getVectorNumElements() * 2 == NumElts &&
	TLI.isOperationLegal(Opcode, VT)) {
	bool CanFold = false;
	switch (Opcode) {
	default : break;
	case ISD::ADD:
	case ISD::SUB:
	case ISD::MUL:
	// isOperationLegal lies for integer ops on floating point types.
	CanFold = VT.isInteger();
	break;
	case ISD::FADD:
	case ISD::FSUB:
	case ISD::FMUL:
	// isOperationLegal lies for floating point ops on integer types.
	CanFold = VT.isFloatingPoint();
	break;
	}

	unsigned SVTNumElts = SVT.getVectorNumElements();
	ShuffleVectorSDNode *SVOp = cast<ShuffleVectorSDNode>(N);
	for (unsigned i = 0, e = SVTNumElts; i != e && CanFold; ++i)
	CanFold = SVOp->getMaskElt(i) == (int)(i * 2);
	for (unsigned i = SVTNumElts, e = NumElts; i != e && CanFold; ++i)
	CanFold = SVOp->getMaskElt(i) < 0;

	if (CanFold) {
	SDValue BC00 = DAG.getBitcast(VT, BC0.getOperand(0));
	SDValue BC01 = DAG.getBitcast(VT, BC0.getOperand(1));
	SDValue NewBinOp = DAG.getNode(BC0.getOpcode(), dl, VT, BC00, BC01);
	return DAG.getVectorShuffle(VT, dl, NewBinOp, N1, SVOp->getMask());
	}
	}
	}

	// Combine a vector_shuffle that is equal to build_vector load1, load2, load3,
	// load4, <0, 1, 2, 3> into a 128-bit load if the load addresses are
	// consecutive, non-overlapping, and in the right order.
	SmallVector<SDValue, 16> Elts;
	for (unsigned i = 0, e = VT.getVectorNumElements(); i != e; ++i) {
	if (SDValue Elt = getShuffleScalarElt(N, i, DAG, 0)) {
	Elts.push_back(Elt);
	continue;
	}
	Elts.clear();
	break;
	}

	if (Elts.size() == VT.getVectorNumElements())
	if (SDValue LD =
	EltsFromConsecutiveLoads(VT, Elts, dl, DAG, Subtarget, true))
	return LD;

	// For AVX2, we sometimes want to combine
	// (vector_shuffle <mask> (concat_vectors t1, undef)
	// (concat_vectors t2, undef))
	// Into:
	// (vector_shuffle <mask> (concat_vectors t1, t2), undef)
	// Since the latter can be efficiently lowered with VPERMD/VPERMQ
	if (SDValue ShufConcat = combineShuffleOfConcatUndef(N, DAG, Subtarget))
	return ShufConcat;

	if (isTargetShuffle(N->getOpcode())) {
	SDValue Op(N, 0);
	if (SDValue Shuffle = combineTargetShuffle(Op, DAG, DCI, Subtarget))
	return Shuffle;

	// Try recursively combining arbitrary sequences of x86 shuffle
	// instructions into higher-order shuffles. We do this after combining
	// specific PSHUF instruction sequences into their minimal form so that we
	// can evaluate how many specialized shuffle instructions are involved in
	// a particular chain.
	SmallVector<int, 1> NonceMask; // Just a placeholder.
	NonceMask.push_back(0);
	if (combineX86ShufflesRecursively({Op}, 0, Op, NonceMask, {},
	/Depth/ 1, /HasVarMask/ false, DAG,
	DCI, Subtarget))
	return SDValue(); // This routine will use CombineTo to replace N.
	}

	return SDValue();
	}

	/// Check if a vector extract from a target-specific shuffle of a load can be
	/// folded into a single element load.
	/// Similar handling for VECTOR_SHUFFLE is performed by DAGCombiner, but
	/// shuffles have been custom lowered so we need to handle those here.
	static SDValue XFormVExtractWithShuffleIntoLoad(SDNode *N, SelectionDAG &DAG,
	TargetLowering::DAGCombinerInfo &DCI) {
	if (DCI.isBeforeLegalizeOps())
	return SDValue();

	SDValue InVec = N->getOperand(0);
	SDValue EltNo = N->getOperand(1);
	EVT EltVT = N->getValueType(0);

	if (!isa<ConstantSDNode>(EltNo))
	return SDValue();

	EVT OriginalVT = InVec.getValueType();

	// Peek through bitcasts, don't duplicate a load with other uses.
	InVec = peekThroughOneUseBitcasts(InVec);

	EVT CurrentVT = InVec.getValueType();
	if (!CurrentVT.isVector() \|\|
	CurrentVT.getVectorNumElements() != OriginalVT.getVectorNumElements())
	return SDValue();

	if (!isTargetShuffle(InVec.getOpcode()))
	return SDValue();

	// Don't duplicate a load with other uses.
	if (!InVec.hasOneUse())
	return SDValue();

	SmallVector<int, 16> ShuffleMask;
	SmallVector<SDValue, 2> ShuffleOps;
	bool UnaryShuffle;
	if (!getTargetShuffleMask(InVec.getNode(), CurrentVT.getSimpleVT(), true,
	ShuffleOps, ShuffleMask, UnaryShuffle))
	return SDValue();

	// Select the input vector, guarding against out of range extract vector.
	unsigned NumElems = CurrentVT.getVectorNumElements();
	int Elt = cast<ConstantSDNode>(EltNo)->getZExtValue();
	int Idx = (Elt > (int)NumElems) ? SM_SentinelUndef : ShuffleMask[Elt];

	if (Idx == SM_SentinelZero)
	return EltVT.isInteger() ? DAG.getConstant(0, SDLoc(N), EltVT)
	: DAG.getConstantFP(+0.0, SDLoc(N), EltVT);
	if (Idx == SM_SentinelUndef)
	return DAG.getUNDEF(EltVT);

	assert(0 <= Idx && Idx < (int)(2 * NumElems) && "Shuffle index out of range");
	SDValue LdNode = (Idx < (int)NumElems) ? ShuffleOps[0]
	: ShuffleOps[1];

	// If inputs to shuffle are the same for both ops, then allow 2 uses
	unsigned AllowedUses =
	(ShuffleOps.size() > 1 && ShuffleOps[0] == ShuffleOps[1]) ? 2 : 1;

	if (LdNode.getOpcode() == ISD::BITCAST) {
	// Don't duplicate a load with other uses.
	if (!LdNode.getNode()->hasNUsesOfValue(AllowedUses, 0))
	return SDValue();

	AllowedUses = 1; // only allow 1 load use if we have a bitcast
	LdNode = LdNode.getOperand(0);
	}

	if (!ISD::isNormalLoad(LdNode.getNode()))
	return SDValue();

	LoadSDNode *LN0 = cast<LoadSDNode>(LdNode);

	if (!LN0 \|\|!LN0->hasNUsesOfValue(AllowedUses, 0) \|\| LN0->isVolatile())
	return SDValue();

	// If there's a bitcast before the shuffle, check if the load type and
	// alignment is valid.
	unsigned Align = LN0->getAlignment();
	const TargetLowering &TLI = DAG.getTargetLoweringInfo();
	unsigned NewAlign = DAG.getDataLayout().getABITypeAlignment(
	EltVT.getTypeForEVT(*DAG.getContext()));

	if (NewAlign > Align \|\| !TLI.isOperationLegalOrCustom(ISD::LOAD, EltVT))
	return SDValue();

	// All checks match so transform back to vector_shuffle so that DAG combiner
	// can finish the job
	SDLoc dl(N);

	// Create shuffle node taking into account the case that its a unary shuffle
	SDValue Shuffle = (UnaryShuffle) ? DAG.getUNDEF(CurrentVT) : ShuffleOps[1];
	Shuffle = DAG.getVectorShuffle(CurrentVT, dl, ShuffleOps[0], Shuffle,
	ShuffleMask);
	Shuffle = DAG.getBitcast(OriginalVT, Shuffle);
	return DAG.getNode(ISD::EXTRACT_VECTOR_ELT, dl, N->getValueType(0), Shuffle,
	EltNo);
	}

	// Try to match patterns such as
	// (i16 bitcast (v16i1 x))
	// ->
	// (i16 movmsk (16i8 sext (v16i1 x)))
	// before the illegal vector is scalarized on subtargets that don't have legal
	// vxi1 types.
	static SDValue combineBitcastvxi1(SelectionDAG &DAG, SDValue BitCast,
	const X86Subtarget &Subtarget) {
	EVT VT = BitCast.getValueType();
	SDValue N0 = BitCast.getOperand(0);
	EVT VecVT = N0->getValueType(0);

	if (!VT.isScalarInteger() \|\| !VecVT.isSimple())
	return SDValue();

	// With AVX512 vxi1 types are legal and we prefer using k-regs.
	// MOVMSK is supported in SSE2 or later.
	if (Subtarget.hasAVX512() \|\| !Subtarget.hasSSE2())
	return SDValue();

	// There are MOVMSK flavors for types v16i8, v32i8, v4f32, v8f32, v4f64 and
	// v8f64. So all legal 128-bit and 256-bit vectors are covered except for
	// v8i16 and v16i16.
	// For these two cases, we can shuffle the upper element bytes to a
	// consecutive sequence at the start of the vector and treat the results as
	// v16i8 or v32i8, and for v61i8 this is the preferable solution. However,
	// for v16i16 this is not the case, because the shuffle is expensive, so we
	// avoid sign-extending to this type entirely.
	// For example, t0 := (v8i16 sext(v8i1 x)) needs to be shuffled as:
	// (v16i8 shuffle <0,2,4,6,8,10,12,14,u,u,...,u> (v16i8 bitcast t0), undef)
	MVT SExtVT;
	MVT FPCastVT = MVT::INVALID_SIMPLE_VALUE_TYPE;
	switch (VecVT.getSimpleVT().SimpleTy) {
	default:
	return SDValue();
	case MVT::v2i1:
	SExtVT = MVT::v2i64;
	FPCastVT = MVT::v2f64;
	break;
	case MVT::v4i1:
	SExtVT = MVT::v4i32;
	FPCastVT = MVT::v4f32;
	// For cases such as (i4 bitcast (v4i1 setcc v4i64 v1, v2))
	// sign-extend to a 256-bit operation to avoid truncation.
	if (N0->getOpcode() == ISD::SETCC &&
	N0->getOperand(0)->getValueType(0).is256BitVector() &&
	Subtarget.hasInt256()) {
	SExtVT = MVT::v4i64;
	FPCastVT = MVT::v4f64;
	}
	break;
	case MVT::v8i1:
	SExtVT = MVT::v8i16;
	// For cases such as (i8 bitcast (v8i1 setcc v8i32 v1, v2)),
	// sign-extend to a 256-bit operation to match the compare.
	// If the setcc operand is 128-bit, prefer sign-extending to 128-bit over
	// 256-bit because the shuffle is cheaper than sign extending the result of
	// the compare.
	if (N0->getOpcode() == ISD::SETCC &&
	N0->getOperand(0)->getValueType(0).is256BitVector() &&
	Subtarget.hasInt256()) {
	SExtVT = MVT::v8i32;
	FPCastVT = MVT::v8f32;
	}
	break;
	case MVT::v16i1:
	SExtVT = MVT::v16i8;
	// For the case (i16 bitcast (v16i1 setcc v16i16 v1, v2)),
	// it is not profitable to sign-extend to 256-bit because this will
	// require an extra cross-lane shuffle which is more expensive than
	// truncating the result of the compare to 128-bits.
	break;
	case MVT::v32i1:
	// TODO: Handle pre-AVX2 cases by splitting to two v16i1's.
	if (!Subtarget.hasInt256())
	return SDValue();
	SExtVT = MVT::v32i8;
	break;
	};

	SDLoc DL(BitCast);
	SDValue V = DAG.getSExtOrTrunc(N0, DL, SExtVT);
	if (SExtVT == MVT::v8i16) {
	V = DAG.getBitcast(MVT::v16i8, V);
	V = DAG.getVectorShuffle(
	MVT::v16i8, DL, V, DAG.getUNDEF(MVT::v16i8),
	{0, 2, 4, 6, 8, 10, 12, 14, -1, -1, -1, -1, -1, -1, -1, -1});
	} else
	assert(SExtVT.getScalarType() != MVT::i16 &&
	"Vectors of i16 must be shuffled");
	if (FPCastVT != MVT::INVALID_SIMPLE_VALUE_TYPE)
	V = DAG.getBitcast(FPCastVT, V);
	V = DAG.getNode(X86ISD::MOVMSK, DL, MVT::i32, V);
	return DAG.getZExtOrTrunc(V, DL, VT);
	}

	static SDValue combineBitcast(SDNode *N, SelectionDAG &DAG,
	TargetLowering::DAGCombinerInfo &DCI,
	const X86Subtarget &Subtarget) {
	SDValue N0 = N->getOperand(0);
	EVT VT = N->getValueType(0);
	EVT SrcVT = N0.getValueType();

	// Try to match patterns such as
	// (i16 bitcast (v16i1 x))
	// ->
	// (i16 movmsk (16i8 sext (v16i1 x)))
	// before the setcc result is scalarized on subtargets that don't have legal
	// vxi1 types.
	if (DCI.isBeforeLegalize())
	if (SDValue V = combineBitcastvxi1(DAG, SDValue(N, 0), Subtarget))
	return V;
	// Since MMX types are special and don't usually play with other vector types,
	// it's better to handle them early to be sure we emit efficient code by
	// avoiding store-load conversions.

	// Detect bitcasts between i32 to x86mmx low word.
	if (VT == MVT::x86mmx && N0.getOpcode() == ISD::BUILD_VECTOR &&
	SrcVT == MVT::v2i32 && isNullConstant(N0.getOperand(1))) {
	SDValue N00 = N0->getOperand(0);
	if (N00.getValueType() == MVT::i32)
	return DAG.getNode(X86ISD::MMX_MOVW2D, SDLoc(N00), VT, N00);
	}

	// Detect bitcasts between element or subvector extraction to x86mmx.
	if (VT == MVT::x86mmx &&
	(N0.getOpcode() == ISD::EXTRACT_VECTOR_ELT \|\|
	N0.getOpcode() == ISD::EXTRACT_SUBVECTOR) &&
	isNullConstant(N0.getOperand(1))) {
	SDValue N00 = N0->getOperand(0);
	if (N00.getValueType().is128BitVector())
	return DAG.getNode(X86ISD::MOVDQ2Q, SDLoc(N00), VT,
	DAG.getBitcast(MVT::v2i64, N00));
	}

	// Detect bitcasts from FP_TO_SINT to x86mmx.
	if (VT == MVT::x86mmx && SrcVT == MVT::v2i32 &&
	N0.getOpcode() == ISD::FP_TO_SINT) {
	SDLoc DL(N0);
	SDValue Res = DAG.getNode(ISD::CONCAT_VECTORS, DL, MVT::v4i32, N0,
	DAG.getUNDEF(MVT::v2i32));
	return DAG.getNode(X86ISD::MOVDQ2Q, DL, VT,
	DAG.getBitcast(MVT::v2i64, Res));
	}

	// Convert a bitcasted integer logic operation that has one bitcasted
	// floating-point operand into a floating-point logic operation. This may
	// create a load of a constant, but that is cheaper than materializing the
	// constant in an integer register and transferring it to an SSE register or
	// transferring the SSE operand to integer register and back.
	unsigned FPOpcode;
	switch (N0.getOpcode()) {
	case ISD::AND: FPOpcode = X86ISD::FAND; break;
	case ISD::OR: FPOpcode = X86ISD::FOR; break;
	case ISD::XOR: FPOpcode = X86ISD::FXOR; break;
	default: return SDValue();
	}

	if (!((Subtarget.hasSSE1() && VT == MVT::f32) \|\|
	(Subtarget.hasSSE2() && VT == MVT::f64)))
	return SDValue();

	SDValue LogicOp0 = N0.getOperand(0);
	SDValue LogicOp1 = N0.getOperand(1);
	SDLoc DL0(N0);

	// bitcast(logic(bitcast(X), Y)) --> logic'(X, bitcast(Y))
	if (N0.hasOneUse() && LogicOp0.getOpcode() == ISD::BITCAST &&
	LogicOp0.hasOneUse() && LogicOp0.getOperand(0).getValueType() == VT &&
	!isa<ConstantSDNode>(LogicOp0.getOperand(0))) {
	SDValue CastedOp1 = DAG.getBitcast(VT, LogicOp1);
	return DAG.getNode(FPOpcode, DL0, VT, LogicOp0.getOperand(0), CastedOp1);
	}
	// bitcast(logic(X, bitcast(Y))) --> logic'(bitcast(X), Y)
	if (N0.hasOneUse() && LogicOp1.getOpcode() == ISD::BITCAST &&
	LogicOp1.hasOneUse() && LogicOp1.getOperand(0).getValueType() == VT &&
	!isa<ConstantSDNode>(LogicOp1.getOperand(0))) {
	SDValue CastedOp0 = DAG.getBitcast(VT, LogicOp0);
	return DAG.getNode(FPOpcode, DL0, VT, LogicOp1.getOperand(0), CastedOp0);
	}

	return SDValue();
	}

	// Match a binop + shuffle pyramid that represents a horizontal reduction over
	// the elements of a vector.
	// Returns the vector that is being reduced on, or SDValue() if a reduction
	// was not matched.
	static SDValue matchBinOpReduction(SDNode *Extract, ISD::NodeType BinOp) {
	// The pattern must end in an extract from index 0.
	if ((Extract->getOpcode() != ISD::EXTRACT_VECTOR_ELT) \|\|
	!isNullConstant(Extract->getOperand(1)))
	return SDValue();

	unsigned Stages =
	Log2_32(Extract->getOperand(0).getValueType().getVectorNumElements());

	SDValue Op = Extract->getOperand(0);
	// At each stage, we're looking for something that looks like:
	// %s = shufflevector <8 x i32> %op, <8 x i32> undef,
	// <8 x i32> <i32 2, i32 3, i32 undef, i32 undef,
	// i32 undef, i32 undef, i32 undef, i32 undef>
	// %a = binop <8 x i32> %op, %s
	// Where the mask changes according to the stage. E.g. for a 3-stage pyramid,
	// we expect something like:
	// <4,5,6,7,u,u,u,u>
	// <2,3,u,u,u,u,u,u>
	// <1,u,u,u,u,u,u,u>
	for (unsigned i = 0; i < Stages; ++i) {
	if (Op.getOpcode() != BinOp)
	return SDValue();

	ShuffleVectorSDNode *Shuffle =
	dyn_cast<ShuffleVectorSDNode>(Op.getOperand(0).getNode());
	if (Shuffle) {
	Op = Op.getOperand(1);
	} else {
	Shuffle = dyn_cast<ShuffleVectorSDNode>(Op.getOperand(1).getNode());
	Op = Op.getOperand(0);
	}

	// The first operand of the shuffle should be the same as the other operand
	// of the add.
	if (!Shuffle \|\| (Shuffle->getOperand(0) != Op))
	return SDValue();

	// Verify the shuffle has the expected (at this stage of the pyramid) mask.
	for (int Index = 0, MaskEnd = 1 << i; Index < MaskEnd; ++Index)
	if (Shuffle->getMaskElt(Index) != MaskEnd + Index)
	return SDValue();
	}

	return Op;
	}

	// Given a select, detect the following pattern:
	// 1: %2 = zext <N x i8> %0 to <N x i32>
	// 2: %3 = zext <N x i8> %1 to <N x i32>
	// 3: %4 = sub nsw <N x i32> %2, %3
	// 4: %5 = icmp sgt <N x i32> %4, [0 x N] or [-1 x N]
	// 5: %6 = sub nsw <N x i32> zeroinitializer, %4
	// 6: %7 = select <N x i1> %5, <N x i32> %4, <N x i32> %6
	// This is useful as it is the input into a SAD pattern.
	static bool detectZextAbsDiff(const SDValue &Select, SDValue &Op0,
	SDValue &Op1) {
	// Check the condition of the select instruction is greater-than.
	SDValue SetCC = Select->getOperand(0);
	if (SetCC.getOpcode() != ISD::SETCC)
	return false;
	ISD::CondCode CC = cast<CondCodeSDNode>(SetCC.getOperand(2))->get();
	if (CC != ISD::SETGT && CC != ISD::SETLT)
	return false;

	SDValue SelectOp1 = Select->getOperand(1);
	SDValue SelectOp2 = Select->getOperand(2);

	// The following instructions assume SelectOp1 is the subtraction operand
	// and SelectOp2 is the negation operand.
	// In the case of SETLT this is the other way around.
	if (CC == ISD::SETLT)
	std::swap(SelectOp1, SelectOp2);

	// The second operand of the select should be the negation of the first
	// operand, which is implemented as 0 - SelectOp1.
	if (!(SelectOp2.getOpcode() == ISD::SUB &&
	ISD::isBuildVectorAllZeros(SelectOp2.getOperand(0).getNode()) &&
	SelectOp2.getOperand(1) == SelectOp1))
	return false;

	// The first operand of SetCC is the first operand of the select, which is the
	// difference between the two input vectors.
	if (SetCC.getOperand(0) != SelectOp1)
	return false;

	// In SetLT case, The second operand of the comparison can be either 1 or 0.
	APInt SplatVal;
	if ((CC == ISD::SETLT) &&
	- !((ISD::isConstantSplatVector(SetCC.getOperand(1).getNode(), SplatVal) &&
	- SplatVal == 1) \|\|
	+ !((ISD::isConstantSplatVector(SetCC.getOperand(1).getNode(), SplatVal,
	+ /AllowShrink/false) &&
	+ SplatVal.isOneValue()) \|\|
	(ISD::isBuildVectorAllZeros(SetCC.getOperand(1).getNode()))))
	return false;

	// In SetGT case, The second operand of the comparison can be either -1 or 0.
	if ((CC == ISD::SETGT) &&
	!(ISD::isBuildVectorAllZeros(SetCC.getOperand(1).getNode()) \|\|
	ISD::isBuildVectorAllOnes(SetCC.getOperand(1).getNode())))
	return false;

	// The first operand of the select is the difference between the two input
	// vectors.
	if (SelectOp1.getOpcode() != ISD::SUB)
	return false;

	Op0 = SelectOp1.getOperand(0);
	Op1 = SelectOp1.getOperand(1);

	// Check if the operands of the sub are zero-extended from vectors of i8.
	if (Op0.getOpcode() != ISD::ZERO_EXTEND \|\|
	Op0.getOperand(0).getValueType().getVectorElementType() != MVT::i8 \|\|
	Op1.getOpcode() != ISD::ZERO_EXTEND \|\|
	Op1.getOperand(0).getValueType().getVectorElementType() != MVT::i8)
	return false;

	return true;
	}

	// Given two zexts of <k x i8> to <k x i32>, create a PSADBW of the inputs
	// to these zexts.
	static SDValue createPSADBW(SelectionDAG &DAG, const SDValue &Zext0,
	const SDValue &Zext1, const SDLoc &DL) {

	// Find the appropriate width for the PSADBW.
	EVT InVT = Zext0.getOperand(0).getValueType();
	unsigned RegSize = std::max(128u, InVT.getSizeInBits());

	// "Zero-extend" the i8 vectors. This is not a per-element zext, rather we
	// fill in the missing vector elements with 0.
	unsigned NumConcat = RegSize / InVT.getSizeInBits();
	SmallVector<SDValue, 16> Ops(NumConcat, DAG.getConstant(0, DL, InVT));
	Ops[0] = Zext0.getOperand(0);
	MVT ExtendedVT = MVT::getVectorVT(MVT::i8, RegSize / 8);
	SDValue SadOp0 = DAG.getNode(ISD::CONCAT_VECTORS, DL, ExtendedVT, Ops);
	Ops[0] = Zext1.getOperand(0);
	SDValue SadOp1 = DAG.getNode(ISD::CONCAT_VECTORS, DL, ExtendedVT, Ops);

	// Actually build the SAD
	MVT SadVT = MVT::getVectorVT(MVT::i64, RegSize / 64);
	return DAG.getNode(X86ISD::PSADBW, DL, SadVT, SadOp0, SadOp1);
	}

	// Attempt to replace an all_of/any_of style horizontal reduction with a MOVMSK.
	static SDValue combineHorizontalPredicateResult(SDNode *Extract,
	SelectionDAG &DAG,
	const X86Subtarget &Subtarget) {
	// Bail without SSE2 or with AVX512VL (which uses predicate registers).
	if (!Subtarget.hasSSE2() \|\| Subtarget.hasVLX())
	return SDValue();

	EVT ExtractVT = Extract->getValueType(0);
	unsigned BitWidth = ExtractVT.getSizeInBits();
	if (ExtractVT != MVT::i64 && ExtractVT != MVT::i32 && ExtractVT != MVT::i16 &&
	ExtractVT != MVT::i8)
	return SDValue();

	// Check for OR(any_of) and AND(all_of) horizontal reduction patterns.
	for (ISD::NodeType Op : {ISD::OR, ISD::AND}) {
	SDValue Match = matchBinOpReduction(Extract, Op);
	if (!Match)
	continue;

	// EXTRACT_VECTOR_ELT can require implicit extension of the vector element
	// which we can't support here for now.
	if (Match.getScalarValueSizeInBits() != BitWidth)
	continue;

	// We require AVX2 for PMOVMSKB for v16i16/v32i8;
	unsigned MatchSizeInBits = Match.getValueSizeInBits();
	if (!(MatchSizeInBits == 128 \|\|
	(MatchSizeInBits == 256 &&
	((Subtarget.hasAVX() && BitWidth >= 32) \|\| Subtarget.hasAVX2()))))
	return SDValue();

	// Don't bother performing this for 2-element vectors.
	if (Match.getValueType().getVectorNumElements() <= 2)
	return SDValue();

	// Check that we are extracting a reduction of all sign bits.
	if (DAG.ComputeNumSignBits(Match) != BitWidth)
	return SDValue();

	// For 32/64 bit comparisons use MOVMSKPS/MOVMSKPD, else PMOVMSKB.
	MVT MaskVT;
	if (64 == BitWidth \|\| 32 == BitWidth)
	MaskVT = MVT::getVectorVT(MVT::getFloatingPointVT(BitWidth),
	MatchSizeInBits / BitWidth);
	else
	MaskVT = MVT::getVectorVT(MVT::i8, MatchSizeInBits / 8);

	APInt CompareBits;
	ISD::CondCode CondCode;
	if (Op == ISD::OR) {
	// any_of -> MOVMSK != 0
	CompareBits = APInt::getNullValue(32);
	CondCode = ISD::CondCode::SETNE;
	} else {
	// all_of -> MOVMSK == ((1 << NumElts) - 1)
	CompareBits = APInt::getLowBitsSet(32, MaskVT.getVectorNumElements());
	CondCode = ISD::CondCode::SETEQ;
	}

	// Perform the select as i32/i64 and then truncate to avoid partial register
	// stalls.
	unsigned ResWidth = std::max(BitWidth, 32u);
	EVT ResVT = EVT::getIntegerVT(*DAG.getContext(), ResWidth);
	SDLoc DL(Extract);
	SDValue Zero = DAG.getConstant(0, DL, ResVT);
	SDValue Ones = DAG.getAllOnesConstant(DL, ResVT);
	SDValue Res = DAG.getBitcast(MaskVT, Match);
	Res = DAG.getNode(X86ISD::MOVMSK, DL, MVT::i32, Res);
	Res = DAG.getSelectCC(DL, Res, DAG.getConstant(CompareBits, DL, MVT::i32),
	Ones, Zero, CondCode);
	return DAG.getSExtOrTrunc(Res, DL, ExtractVT);
	}

	return SDValue();
	}

	static SDValue combineBasicSADPattern(SDNode *Extract, SelectionDAG &DAG,
	const X86Subtarget &Subtarget) {
	// PSADBW is only supported on SSE2 and up.
	if (!Subtarget.hasSSE2())
	return SDValue();

	// Verify the type we're extracting from is any integer type above i16.
	EVT VT = Extract->getOperand(0).getValueType();
	if (!VT.isSimple() \|\| !(VT.getVectorElementType().getSizeInBits() > 16))
	return SDValue();

	unsigned RegSize = 128;
	if (Subtarget.hasBWI())
	RegSize = 512;
	else if (Subtarget.hasAVX2())
	RegSize = 256;

	// We handle upto v16i* for SSE2 / v32i* for AVX2 / v64i* for AVX512.
	// TODO: We should be able to handle larger vectors by splitting them before
	// feeding them into several SADs, and then reducing over those.
	if (RegSize / VT.getVectorNumElements() < 8)
	return SDValue();

	// Match shuffle + add pyramid.
	SDValue Root = matchBinOpReduction(Extract, ISD::ADD);

	// The operand is expected to be zero extended from i8
	// (verified in detectZextAbsDiff).
	// In order to convert to i64 and above, additional any/zero/sign
	// extend is expected.
	// The zero extend from 32 bit has no mathematical effect on the result.
	// Also the sign extend is basically zero extend
	// (extends the sign bit which is zero).
	// So it is correct to skip the sign/zero extend instruction.
	if (Root && (Root.getOpcode() == ISD::SIGN_EXTEND \|\|
	Root.getOpcode() == ISD::ZERO_EXTEND \|\|
	Root.getOpcode() == ISD::ANY_EXTEND))
	Root = Root.getOperand(0);

	// If there was a match, we want Root to be a select that is the root of an
	// abs-diff pattern.
	if (!Root \|\| (Root.getOpcode() != ISD::VSELECT))
	return SDValue();

	// Check whether we have an abs-diff pattern feeding into the select.
	SDValue Zext0, Zext1;
	if (!detectZextAbsDiff(Root, Zext0, Zext1))
	return SDValue();

	// Create the SAD instruction.
	SDLoc DL(Extract);
	SDValue SAD = createPSADBW(DAG, Zext0, Zext1, DL);

	// If the original vector was wider than 8 elements, sum over the results
	// in the SAD vector.
	unsigned Stages = Log2_32(VT.getVectorNumElements());
	MVT SadVT = SAD.getSimpleValueType();
	if (Stages > 3) {
	unsigned SadElems = SadVT.getVectorNumElements();

	for(unsigned i = Stages - 3; i > 0; --i) {
	SmallVector<int, 16> Mask(SadElems, -1);
	for(unsigned j = 0, MaskEnd = 1 << (i - 1); j < MaskEnd; ++j)
	Mask[j] = MaskEnd + j;

	SDValue Shuffle =
	DAG.getVectorShuffle(SadVT, DL, SAD, DAG.getUNDEF(SadVT), Mask);
	SAD = DAG.getNode(ISD::ADD, DL, SadVT, SAD, Shuffle);
	}
	}

	MVT Type = Extract->getSimpleValueType(0);
	unsigned TypeSizeInBits = Type.getSizeInBits();
	// Return the lowest TypeSizeInBits bits.
	MVT ResVT = MVT::getVectorVT(Type, SadVT.getSizeInBits() / TypeSizeInBits);
	SAD = DAG.getNode(ISD::BITCAST, DL, ResVT, SAD);
	return DAG.getNode(ISD::EXTRACT_VECTOR_ELT, DL, Type, SAD,
	Extract->getOperand(1));
	}

	// Attempt to peek through a target shuffle and extract the scalar from the
	// source.
	static SDValue combineExtractWithShuffle(SDNode *N, SelectionDAG &DAG,
	TargetLowering::DAGCombinerInfo &DCI,
	const X86Subtarget &Subtarget) {
	if (DCI.isBeforeLegalizeOps())
	return SDValue();

	SDValue Src = N->getOperand(0);
	SDValue Idx = N->getOperand(1);

	EVT VT = N->getValueType(0);
	EVT SrcVT = Src.getValueType();
	EVT SrcSVT = SrcVT.getVectorElementType();
	unsigned NumSrcElts = SrcVT.getVectorNumElements();

	// Don't attempt this for boolean mask vectors or unknown extraction indices.
	if (SrcSVT == MVT::i1 \|\| !isa<ConstantSDNode>(Idx))
	return SDValue();

	// Resolve the target shuffle inputs and mask.
	SmallVector<int, 16> Mask;
	SmallVector<SDValue, 2> Ops;
	if (!resolveTargetShuffleInputs(peekThroughBitcasts(Src), Ops, Mask, DAG))
	return SDValue();

	// Attempt to narrow/widen the shuffle mask to the correct size.
	if (Mask.size() != NumSrcElts) {
	if ((NumSrcElts % Mask.size()) == 0) {
	SmallVector<int, 16> ScaledMask;
	int Scale = NumSrcElts / Mask.size();
	scaleShuffleMask(Scale, Mask, ScaledMask);
	Mask = std::move(ScaledMask);
	} else if ((Mask.size() % NumSrcElts) == 0) {
	SmallVector<int, 16> WidenedMask;
	while (Mask.size() > NumSrcElts &&
	canWidenShuffleElements(Mask, WidenedMask))
	Mask = std::move(WidenedMask);
	// TODO - investigate support for wider shuffle masks with known upper
	// undef/zero elements for implicit zero-extension.
	}
	}

	// Check if narrowing/widening failed.
	if (Mask.size() != NumSrcElts)
	return SDValue();

	int SrcIdx = Mask[N->getConstantOperandVal(1)];
	SDLoc dl(N);

	// If the shuffle source element is undef/zero then we can just accept it.
	if (SrcIdx == SM_SentinelUndef)
	return DAG.getUNDEF(VT);

	if (SrcIdx == SM_SentinelZero)
	return VT.isFloatingPoint() ? DAG.getConstantFP(0.0, dl, VT)
	: DAG.getConstant(0, dl, VT);

	SDValue SrcOp = Ops[SrcIdx / Mask.size()];
	SrcOp = DAG.getBitcast(SrcVT, SrcOp);
	SrcIdx = SrcIdx % Mask.size();

	// We can only extract other elements from 128-bit vectors and in certain
	// circumstances, depending on SSE-level.
	// TODO: Investigate using extract_subvector for larger vectors.
	// TODO: Investigate float/double extraction if it will be just stored.
	if ((SrcVT == MVT::v4i32 \|\| SrcVT == MVT::v2i64) &&
	((SrcIdx == 0 && Subtarget.hasSSE2()) \|\| Subtarget.hasSSE41())) {
	assert(SrcSVT == VT && "Unexpected extraction type");
	return DAG.getNode(ISD::EXTRACT_VECTOR_ELT, dl, SrcSVT, SrcOp,
	DAG.getIntPtrConstant(SrcIdx, dl));
	}

	if ((SrcVT == MVT::v8i16 && Subtarget.hasSSE2()) \|\|
	(SrcVT == MVT::v16i8 && Subtarget.hasSSE41())) {
	assert(VT.getSizeInBits() >= SrcSVT.getSizeInBits() &&
	"Unexpected extraction type");
	unsigned OpCode = (SrcVT == MVT::v8i16 ? X86ISD::PEXTRW : X86ISD::PEXTRB);
	SDValue ExtOp = DAG.getNode(OpCode, dl, MVT::i32, SrcOp,
	DAG.getIntPtrConstant(SrcIdx, dl));
	SDValue Assert = DAG.getNode(ISD::AssertZext, dl, MVT::i32, ExtOp,
	DAG.getValueType(SrcSVT));
	return DAG.getZExtOrTrunc(Assert, dl, VT);
	}

	return SDValue();
	}

	/// Detect vector gather/scatter index generation and convert it from being a
	/// bunch of shuffles and extracts into a somewhat faster sequence.
	/// For i686, the best sequence is apparently storing the value and loading
	/// scalars back, while for x64 we should use 64-bit extracts and shifts.
	static SDValue combineExtractVectorElt(SDNode *N, SelectionDAG &DAG,
	TargetLowering::DAGCombinerInfo &DCI,
	const X86Subtarget &Subtarget) {
	if (SDValue NewOp = XFormVExtractWithShuffleIntoLoad(N, DAG, DCI))
	return NewOp;

	if (SDValue NewOp = combineExtractWithShuffle(N, DAG, DCI, Subtarget))
	return NewOp;

	SDValue InputVector = N->getOperand(0);
	SDValue EltIdx = N->getOperand(1);

	EVT SrcVT = InputVector.getValueType();
	EVT VT = N->getValueType(0);
	SDLoc dl(InputVector);

	// Detect mmx extraction of all bits as a i64. It works better as a bitcast.
	if (InputVector.getOpcode() == ISD::BITCAST && InputVector.hasOneUse() &&
	VT == MVT::i64 && SrcVT == MVT::v1i64 && isNullConstant(EltIdx)) {
	SDValue MMXSrc = InputVector.getOperand(0);

	// The bitcast source is a direct mmx result.
	if (MMXSrc.getValueType() == MVT::x86mmx)
	return DAG.getBitcast(VT, InputVector);
	}

	// Detect mmx to i32 conversion through a v2i32 elt extract.
	if (InputVector.getOpcode() == ISD::BITCAST && InputVector.hasOneUse() &&
	VT == MVT::i32 && SrcVT == MVT::v2i32 && isNullConstant(EltIdx)) {
	SDValue MMXSrc = InputVector.getOperand(0);

	// The bitcast source is a direct mmx result.
	if (MMXSrc.getValueType() == MVT::x86mmx)
	return DAG.getNode(X86ISD::MMX_MOVD2W, dl, MVT::i32, MMXSrc);
	}

	if (VT == MVT::i1 && InputVector.getOpcode() == ISD::BITCAST &&
	isa<ConstantSDNode>(EltIdx) &&
	isa<ConstantSDNode>(InputVector.getOperand(0))) {
	uint64_t ExtractedElt = N->getConstantOperandVal(1);
	uint64_t InputValue = InputVector.getConstantOperandVal(0);
	uint64_t Res = (InputValue >> ExtractedElt) & 1;
	return DAG.getConstant(Res, dl, MVT::i1);
	}

	// Check whether this extract is the root of a sum of absolute differences
	// pattern. This has to be done here because we really want it to happen
	// pre-legalization,
	if (SDValue SAD = combineBasicSADPattern(N, DAG, Subtarget))
	return SAD;

	// Attempt to replace an all_of/any_of horizontal reduction with a MOVMSK.
	if (SDValue Cmp = combineHorizontalPredicateResult(N, DAG, Subtarget))
	return Cmp;

	// Only operate on vectors of 4 elements, where the alternative shuffling
	// gets to be more expensive.
	if (SrcVT != MVT::v4i32)
	return SDValue();

	// Check whether every use of InputVector is an EXTRACT_VECTOR_ELT with a
	// single use which is a sign-extend or zero-extend, and all elements are
	// used.
	SmallVector<SDNode *, 4> Uses;
	unsigned ExtractedElements = 0;
	for (SDNode::use_iterator UI = InputVector.getNode()->use_begin(),
	UE = InputVector.getNode()->use_end(); UI != UE; ++UI) {
	if (UI.getUse().getResNo() != InputVector.getResNo())
	return SDValue();

	SDNode Extract = UI;
	if (Extract->getOpcode() != ISD::EXTRACT_VECTOR_ELT)
	return SDValue();

	if (Extract->getValueType(0) != MVT::i32)
	return SDValue();
	if (!Extract->hasOneUse())
	return SDValue();
	if (Extract->use_begin()->getOpcode() != ISD::SIGN_EXTEND &&
	Extract->use_begin()->getOpcode() != ISD::ZERO_EXTEND)
	return SDValue();
	if (!isa<ConstantSDNode>(Extract->getOperand(1)))
	return SDValue();

	// Record which element was extracted.
	ExtractedElements \|= 1 << Extract->getConstantOperandVal(1);
	Uses.push_back(Extract);
	}

	// If not all the elements were used, this may not be worthwhile.
	if (ExtractedElements != 15)
	return SDValue();

	// Ok, we've now decided to do the transformation.
	// If 64-bit shifts are legal, use the extract-shift sequence,
	// otherwise bounce the vector off the cache.
	const TargetLowering &TLI = DAG.getTargetLoweringInfo();
	SDValue Vals[4];

	if (TLI.isOperationLegal(ISD::SRA, MVT::i64)) {
	SDValue Cst = DAG.getBitcast(MVT::v2i64, InputVector);
	auto &DL = DAG.getDataLayout();
	EVT VecIdxTy = DAG.getTargetLoweringInfo().getVectorIdxTy(DL);
	SDValue BottomHalf = DAG.getNode(ISD::EXTRACT_VECTOR_ELT, dl, MVT::i64, Cst,
	DAG.getConstant(0, dl, VecIdxTy));
	SDValue TopHalf = DAG.getNode(ISD::EXTRACT_VECTOR_ELT, dl, MVT::i64, Cst,
	DAG.getConstant(1, dl, VecIdxTy));

	SDValue ShAmt = DAG.getConstant(
	32, dl, DAG.getTargetLoweringInfo().getShiftAmountTy(MVT::i64, DL));
	Vals[0] = DAG.getNode(ISD::TRUNCATE, dl, MVT::i32, BottomHalf);
	Vals[1] = DAG.getNode(ISD::TRUNCATE, dl, MVT::i32,
	DAG.getNode(ISD::SRA, dl, MVT::i64, BottomHalf, ShAmt));
	Vals[2] = DAG.getNode(ISD::TRUNCATE, dl, MVT::i32, TopHalf);
	Vals[3] = DAG.getNode(ISD::TRUNCATE, dl, MVT::i32,
	DAG.getNode(ISD::SRA, dl, MVT::i64, TopHalf, ShAmt));
	} else {
	// Store the value to a temporary stack slot.
	SDValue StackPtr = DAG.CreateStackTemporary(SrcVT);
	SDValue Ch = DAG.getStore(DAG.getEntryNode(), dl, InputVector, StackPtr,
	MachinePointerInfo());

	EVT ElementType = SrcVT.getVectorElementType();
	unsigned EltSize = ElementType.getSizeInBits() / 8;

	// Replace each use (extract) with a load of the appropriate element.
	for (unsigned i = 0; i < 4; ++i) {
	uint64_t Offset = EltSize * i;
	auto PtrVT = TLI.getPointerTy(DAG.getDataLayout());
	SDValue OffsetVal = DAG.getConstant(Offset, dl, PtrVT);

	SDValue ScalarAddr =
	DAG.getNode(ISD::ADD, dl, PtrVT, StackPtr, OffsetVal);

	// Load the scalar.
	Vals[i] =
	DAG.getLoad(ElementType, dl, Ch, ScalarAddr, MachinePointerInfo());
	}
	}

	// Replace the extracts
	for (SmallVectorImpl<SDNode *>::iterator UI = Uses.begin(),
	UE = Uses.end(); UI != UE; ++UI) {
	SDNode Extract = UI;

	uint64_t IdxVal = Extract->getConstantOperandVal(1);
	DAG.ReplaceAllUsesOfValueWith(SDValue(Extract, 0), Vals[IdxVal]);
	}

	// The replacement was made in place; don't return anything.
	return SDValue();
	}

	// TODO - merge with combineExtractVectorElt once it can handle the implicit
	// zero-extension of X86ISD::PINSRW/X86ISD::PINSRB in:
	// XFormVExtractWithShuffleIntoLoad, combineHorizontalPredicateResult and
	// combineBasicSADPattern.
	static SDValue combineExtractVectorElt_SSE(SDNode *N, SelectionDAG &DAG,
	TargetLowering::DAGCombinerInfo &DCI,
	const X86Subtarget &Subtarget) {
	return combineExtractWithShuffle(N, DAG, DCI, Subtarget);
	}

	/// If a vector select has an operand that is -1 or 0, try to simplify the
	/// select to a bitwise logic operation.
	static SDValue
	combineVSelectWithAllOnesOrZeros(SDNode *N, SelectionDAG &DAG,
	TargetLowering::DAGCombinerInfo &DCI,
	const X86Subtarget &Subtarget) {
	SDValue Cond = N->getOperand(0);
	SDValue LHS = N->getOperand(1);
	SDValue RHS = N->getOperand(2);
	EVT VT = LHS.getValueType();
	EVT CondVT = Cond.getValueType();
	SDLoc DL(N);
	const TargetLowering &TLI = DAG.getTargetLoweringInfo();

	if (N->getOpcode() != ISD::VSELECT)
	return SDValue();

	assert(CondVT.isVector() && "Vector select expects a vector selector!");

	bool FValIsAllZeros = ISD::isBuildVectorAllZeros(LHS.getNode());
	// Check if the first operand is all zeros and Cond type is vXi1.
	// This situation only applies to avx512.
	if (FValIsAllZeros && Subtarget.hasAVX512() && Cond.hasOneUse() &&
	CondVT.getVectorElementType() == MVT::i1) {
	// Invert the cond to not(cond) : xor(op,allones)=not(op)
	SDValue CondNew = DAG.getNode(ISD::XOR, DL, CondVT, Cond,
	DAG.getAllOnesConstant(DL, CondVT));
	// Vselect cond, op1, op2 = Vselect not(cond), op2, op1
	return DAG.getSelect(DL, VT, CondNew, RHS, LHS);
	}

	// To use the condition operand as a bitwise mask, it must have elements that
	// are the same size as the select elements. Ie, the condition operand must
	// have already been promoted from the IR select condition type <N x i1>.
	// Don't check if the types themselves are equal because that excludes
	// vector floating-point selects.
	if (CondVT.getScalarSizeInBits() != VT.getScalarSizeInBits())
	return SDValue();

	bool TValIsAllOnes = ISD::isBuildVectorAllOnes(LHS.getNode());
	FValIsAllZeros = ISD::isBuildVectorAllZeros(RHS.getNode());

	// Try to invert the condition if true value is not all 1s and false value is
	// not all 0s.
	if (!TValIsAllOnes && !FValIsAllZeros &&
	// Check if the selector will be produced by CMPP/PCMP.
	Cond.getOpcode() == ISD::SETCC &&
	// Check if SETCC has already been promoted.
	TLI.getSetCCResultType(DAG.getDataLayout(), *DAG.getContext(), VT) ==
	CondVT) {
	bool TValIsAllZeros = ISD::isBuildVectorAllZeros(LHS.getNode());
	bool FValIsAllOnes = ISD::isBuildVectorAllOnes(RHS.getNode());

	if (TValIsAllZeros \|\| FValIsAllOnes) {
	SDValue CC = Cond.getOperand(2);
	ISD::CondCode NewCC =
	ISD::getSetCCInverse(cast<CondCodeSDNode>(CC)->get(),
	Cond.getOperand(0).getValueType().isInteger());
	Cond = DAG.getSetCC(DL, CondVT, Cond.getOperand(0), Cond.getOperand(1),
	NewCC);
	std::swap(LHS, RHS);
	TValIsAllOnes = FValIsAllOnes;
	FValIsAllZeros = TValIsAllZeros;
	}
	}

	// vselect Cond, 111..., 000... -> Cond
	if (TValIsAllOnes && FValIsAllZeros)
	return DAG.getBitcast(VT, Cond);

	if (!DCI.isBeforeLegalize() && !TLI.isTypeLegal(CondVT))
	return SDValue();

	// vselect Cond, 111..., X -> or Cond, X
	if (TValIsAllOnes) {
	SDValue CastRHS = DAG.getBitcast(CondVT, RHS);
	SDValue Or = DAG.getNode(ISD::OR, DL, CondVT, Cond, CastRHS);
	return DAG.getBitcast(VT, Or);
	}

	// vselect Cond, X, 000... -> and Cond, X
	if (FValIsAllZeros) {
	SDValue CastLHS = DAG.getBitcast(CondVT, LHS);
	SDValue And = DAG.getNode(ISD::AND, DL, CondVT, Cond, CastLHS);
	return DAG.getBitcast(VT, And);
	}

	return SDValue();
	}

	static SDValue combineSelectOfTwoConstants(SDNode *N, SelectionDAG &DAG) {
	SDValue Cond = N->getOperand(0);
	SDValue LHS = N->getOperand(1);
	SDValue RHS = N->getOperand(2);
	SDLoc DL(N);

	auto *TrueC = dyn_cast<ConstantSDNode>(LHS);
	auto *FalseC = dyn_cast<ConstantSDNode>(RHS);
	if (!TrueC \|\| !FalseC)
	return SDValue();

	// Don't do this for crazy integer types.
	if (!DAG.getTargetLoweringInfo().isTypeLegal(LHS.getValueType()))
	return SDValue();

	// If this is efficiently invertible, canonicalize the LHSC/RHSC values
	// so that TrueC (the true value) is larger than FalseC.
	bool NeedsCondInvert = false;
	if (TrueC->getAPIntValue().ult(FalseC->getAPIntValue()) &&
	// Efficiently invertible.
	(Cond.getOpcode() == ISD::SETCC \|\| // setcc -> invertible.
	(Cond.getOpcode() == ISD::XOR && // xor(X, C) -> invertible.
	isa<ConstantSDNode>(Cond.getOperand(1))))) {
	NeedsCondInvert = true;
	std::swap(TrueC, FalseC);
	}

	// Optimize C ? 8 : 0 -> zext(C) << 3. Likewise for any pow2/0.
	if (FalseC->getAPIntValue() == 0 && TrueC->getAPIntValue().isPowerOf2()) {
	if (NeedsCondInvert) // Invert the condition if needed.
	Cond = DAG.getNode(ISD::XOR, DL, Cond.getValueType(), Cond,
	DAG.getConstant(1, DL, Cond.getValueType()));

	// Zero extend the condition if needed.
	Cond = DAG.getNode(ISD::ZERO_EXTEND, DL, LHS.getValueType(), Cond);

	unsigned ShAmt = TrueC->getAPIntValue().logBase2();
	return DAG.getNode(ISD::SHL, DL, LHS.getValueType(), Cond,
	DAG.getConstant(ShAmt, DL, MVT::i8));
	}

	// Optimize cases that will turn into an LEA instruction. This requires
	// an i32 or i64 and an efficient multiplier (1, 2, 3, 4, 5, 8, 9).
	if (N->getValueType(0) == MVT::i32 \|\| N->getValueType(0) == MVT::i64) {
	uint64_t Diff = TrueC->getZExtValue() - FalseC->getZExtValue();
	if (N->getValueType(0) == MVT::i32)
	Diff = (unsigned)Diff;

	bool IsFastMultiplier = false;
	if (Diff < 10) {
	switch ((unsigned char)Diff) {
	default:
	break;
	case 1: // result = add base, cond
	case 2: // result = lea base( , cond*2)
	case 3: // result = lea base(cond, cond*2)
	case 4: // result = lea base( , cond*4)
	case 5: // result = lea base(cond, cond*4)
	case 8: // result = lea base( , cond*8)
	case 9: // result = lea base(cond, cond*8)
	IsFastMultiplier = true;
	break;
	}
	}

	if (IsFastMultiplier) {
	APInt Diff = TrueC->getAPIntValue() - FalseC->getAPIntValue();
	if (NeedsCondInvert) // Invert the condition if needed.
	Cond = DAG.getNode(ISD::XOR, DL, Cond.getValueType(), Cond,
	DAG.getConstant(1, DL, Cond.getValueType()));

	// Zero extend the condition if needed.
	Cond = DAG.getNode(ISD::ZERO_EXTEND, DL, FalseC->getValueType(0), Cond);
	// Scale the condition by the difference.
	if (Diff != 1)
	Cond = DAG.getNode(ISD::MUL, DL, Cond.getValueType(), Cond,
	DAG.getConstant(Diff, DL, Cond.getValueType()));

	// Add the base if non-zero.
	if (FalseC->getAPIntValue() != 0)
	Cond = DAG.getNode(ISD::ADD, DL, Cond.getValueType(), Cond,
	SDValue(FalseC, 0));
	return Cond;
	}
	}

	return SDValue();
	}

	// If this is a bitcasted op that can be represented as another type, push the
	// the bitcast to the inputs. This allows more opportunities for pattern
	// matching masked instructions. This is called when we know that the operation
	// is used as one of the inputs of a vselect.
	static bool combineBitcastForMaskedOp(SDValue OrigOp, SelectionDAG &DAG,
	TargetLowering::DAGCombinerInfo &DCI) {
	// Make sure we have a bitcast.
	if (OrigOp.getOpcode() != ISD::BITCAST)
	return false;

	SDValue Op = OrigOp.getOperand(0);

	// If the operation is used by anything other than the bitcast, we shouldn't
	// do this combine as that would replicate the operation.
	if (!Op.hasOneUse())
	return false;

	MVT VT = OrigOp.getSimpleValueType();
	MVT EltVT = VT.getVectorElementType();
	SDLoc DL(Op.getNode());

	auto BitcastAndCombineShuffle = [&](unsigned Opcode, SDValue Op0, SDValue Op1,
	SDValue Op2) {
	Op0 = DAG.getBitcast(VT, Op0);
	DCI.AddToWorklist(Op0.getNode());
	Op1 = DAG.getBitcast(VT, Op1);
	DCI.AddToWorklist(Op1.getNode());
	DCI.CombineTo(OrigOp.getNode(),
	DAG.getNode(Opcode, DL, VT, Op0, Op1, Op2));
	return true;
	};

	unsigned Opcode = Op.getOpcode();
	switch (Opcode) {
	case X86ISD::PALIGNR:
	// PALIGNR can be converted to VALIGND/Q for 128-bit vectors.
	if (!VT.is128BitVector())
	return false;
	Opcode = X86ISD::VALIGN;
	LLVM_FALLTHROUGH;
	case X86ISD::VALIGN: {
	if (EltVT != MVT::i32 && EltVT != MVT::i64)
	return false;
	uint64_t Imm = cast<ConstantSDNode>(Op.getOperand(2))->getZExtValue();
	MVT OpEltVT = Op.getSimpleValueType().getVectorElementType();
	unsigned ShiftAmt = Imm * OpEltVT.getSizeInBits();
	unsigned EltSize = EltVT.getSizeInBits();
	// Make sure we can represent the same shift with the new VT.
	if ((ShiftAmt % EltSize) != 0)
	return false;
	Imm = ShiftAmt / EltSize;
	return BitcastAndCombineShuffle(Opcode, Op.getOperand(0), Op.getOperand(1),
	DAG.getConstant(Imm, DL, MVT::i8));
	}
	case X86ISD::SHUF128: {
	if (EltVT.getSizeInBits() != 32 && EltVT.getSizeInBits() != 64)
	return false;
	// Only change element size, not type.
	if (VT.isInteger() != Op.getSimpleValueType().isInteger())
	return false;
	return BitcastAndCombineShuffle(Opcode, Op.getOperand(0), Op.getOperand(1),
	Op.getOperand(2));
	}
	case ISD::INSERT_SUBVECTOR: {
	unsigned EltSize = EltVT.getSizeInBits();
	if (EltSize != 32 && EltSize != 64)
	return false;
	MVT OpEltVT = Op.getSimpleValueType().getVectorElementType();
	// Only change element size, not type.
	if (EltVT.isInteger() != OpEltVT.isInteger())
	return false;
	uint64_t Imm = cast<ConstantSDNode>(Op.getOperand(2))->getZExtValue();
	Imm = (Imm * OpEltVT.getSizeInBits()) / EltSize;
	SDValue Op0 = DAG.getBitcast(VT, Op.getOperand(0));
	DCI.AddToWorklist(Op0.getNode());
	// Op1 needs to be bitcasted to a smaller vector with the same element type.
	SDValue Op1 = Op.getOperand(1);
	MVT Op1VT = MVT::getVectorVT(EltVT,
	Op1.getSimpleValueType().getSizeInBits() / EltSize);
	Op1 = DAG.getBitcast(Op1VT, Op1);
	DCI.AddToWorklist(Op1.getNode());
	DCI.CombineTo(OrigOp.getNode(),
	DAG.getNode(Opcode, DL, VT, Op0, Op1,
	DAG.getIntPtrConstant(Imm, DL)));
	return true;
	}
	case ISD::EXTRACT_SUBVECTOR: {
	unsigned EltSize = EltVT.getSizeInBits();
	if (EltSize != 32 && EltSize != 64)
	return false;
	MVT OpEltVT = Op.getSimpleValueType().getVectorElementType();
	// Only change element size, not type.
	if (EltVT.isInteger() != OpEltVT.isInteger())
	return false;
	uint64_t Imm = cast<ConstantSDNode>(Op.getOperand(1))->getZExtValue();
	Imm = (Imm * OpEltVT.getSizeInBits()) / EltSize;
	// Op0 needs to be bitcasted to a larger vector with the same element type.
	SDValue Op0 = Op.getOperand(0);
	MVT Op0VT = MVT::getVectorVT(EltVT,
	Op0.getSimpleValueType().getSizeInBits() / EltSize);
	Op0 = DAG.getBitcast(Op0VT, Op0);
	DCI.AddToWorklist(Op0.getNode());
	DCI.CombineTo(OrigOp.getNode(),
	DAG.getNode(Opcode, DL, VT, Op0,
	DAG.getIntPtrConstant(Imm, DL)));
	return true;
	}
	case X86ISD::SUBV_BROADCAST: {
	unsigned EltSize = EltVT.getSizeInBits();
	if (EltSize != 32 && EltSize != 64)
	return false;
	// Only change element size, not type.
	if (VT.isInteger() != Op.getSimpleValueType().isInteger())
	return false;
	SDValue Op0 = Op.getOperand(0);
	MVT Op0VT = MVT::getVectorVT(EltVT,
	Op0.getSimpleValueType().getSizeInBits() / EltSize);
	Op0 = DAG.getBitcast(Op0VT, Op.getOperand(0));
	DCI.AddToWorklist(Op0.getNode());
	DCI.CombineTo(OrigOp.getNode(),
	DAG.getNode(Opcode, DL, VT, Op0));
	return true;
	}
	}

	return false;
	}

	/// Do target-specific dag combines on SELECT and VSELECT nodes.
	static SDValue combineSelect(SDNode *N, SelectionDAG &DAG,
	TargetLowering::DAGCombinerInfo &DCI,
	const X86Subtarget &Subtarget) {
	SDLoc DL(N);
	SDValue Cond = N->getOperand(0);
	// Get the LHS/RHS of the select.
	SDValue LHS = N->getOperand(1);
	SDValue RHS = N->getOperand(2);
	EVT VT = LHS.getValueType();
	EVT CondVT = Cond.getValueType();
	const TargetLowering &TLI = DAG.getTargetLoweringInfo();

	// If we have SSE[12] support, try to form min/max nodes. SSE min/max
	// instructions match the semantics of the common C idiom x<y?x:y but not
	// x<=y?x:y, because of how they handle negative zero (which can be
	// ignored in unsafe-math mode).
	// We also try to create v2f32 min/max nodes, which we later widen to v4f32.
	if (Cond.getOpcode() == ISD::SETCC && VT.isFloatingPoint() &&
	VT != MVT::f80 && VT != MVT::f128 &&
	(TLI.isTypeLegal(VT) \|\| VT == MVT::v2f32) &&
	(Subtarget.hasSSE2() \|\|
	(Subtarget.hasSSE1() && VT.getScalarType() == MVT::f32))) {
	ISD::CondCode CC = cast<CondCodeSDNode>(Cond.getOperand(2))->get();

	unsigned Opcode = 0;
	// Check for x CC y ? x : y.
	if (DAG.isEqualTo(LHS, Cond.getOperand(0)) &&
	DAG.isEqualTo(RHS, Cond.getOperand(1))) {
	switch (CC) {
	default: break;
	case ISD::SETULT:
	// Converting this to a min would handle NaNs incorrectly, and swapping
	// the operands would cause it to handle comparisons between positive
	// and negative zero incorrectly.
	if (!DAG.isKnownNeverNaN(LHS) \|\| !DAG.isKnownNeverNaN(RHS)) {
	if (!DAG.getTarget().Options.UnsafeFPMath &&
	!(DAG.isKnownNeverZero(LHS) \|\| DAG.isKnownNeverZero(RHS)))
	break;
	std::swap(LHS, RHS);
	}
	Opcode = X86ISD::FMIN;
	break;
	case ISD::SETOLE:
	// Converting this to a min would handle comparisons between positive
	// and negative zero incorrectly.
	if (!DAG.getTarget().Options.UnsafeFPMath &&
	!DAG.isKnownNeverZero(LHS) && !DAG.isKnownNeverZero(RHS))
	break;
	Opcode = X86ISD::FMIN;
	break;
	case ISD::SETULE:
	// Converting this to a min would handle both negative zeros and NaNs
	// incorrectly, but we can swap the operands to fix both.
	std::swap(LHS, RHS);
	LLVM_FALLTHROUGH;
	case ISD::SETOLT:
	case ISD::SETLT:
	case ISD::SETLE:
	Opcode = X86ISD::FMIN;
	break;

	case ISD::SETOGE:
	// Converting this to a max would handle comparisons between positive
	// and negative zero incorrectly.
	if (!DAG.getTarget().Options.UnsafeFPMath &&
	!DAG.isKnownNeverZero(LHS) && !DAG.isKnownNeverZero(RHS))
	break;
	Opcode = X86ISD::FMAX;
	break;
	case ISD::SETUGT:
	// Converting this to a max would handle NaNs incorrectly, and swapping
	// the operands would cause it to handle comparisons between positive
	// and negative zero incorrectly.
	if (!DAG.isKnownNeverNaN(LHS) \|\| !DAG.isKnownNeverNaN(RHS)) {
	if (!DAG.getTarget().Options.UnsafeFPMath &&
	!(DAG.isKnownNeverZero(LHS) \|\| DAG.isKnownNeverZero(RHS)))
	break;
	std::swap(LHS, RHS);
	}
	Opcode = X86ISD::FMAX;
	break;
	case ISD::SETUGE:
	// Converting this to a max would handle both negative zeros and NaNs
	// incorrectly, but we can swap the operands to fix both.
	std::swap(LHS, RHS);
	LLVM_FALLTHROUGH;
	case ISD::SETOGT:
	case ISD::SETGT:
	case ISD::SETGE:
	Opcode = X86ISD::FMAX;
	break;
	}
	// Check for x CC y ? y : x -- a min/max with reversed arms.
	} else if (DAG.isEqualTo(LHS, Cond.getOperand(1)) &&
	DAG.isEqualTo(RHS, Cond.getOperand(0))) {
	switch (CC) {
	default: break;
	case ISD::SETOGE:
	// Converting this to a min would handle comparisons between positive
	// and negative zero incorrectly, and swapping the operands would
	// cause it to handle NaNs incorrectly.
	if (!DAG.getTarget().Options.UnsafeFPMath &&
	!(DAG.isKnownNeverZero(LHS) \|\| DAG.isKnownNeverZero(RHS))) {
	if (!DAG.isKnownNeverNaN(LHS) \|\| !DAG.isKnownNeverNaN(RHS))
	break;
	std::swap(LHS, RHS);
	}
	Opcode = X86ISD::FMIN;
	break;
	case ISD::SETUGT:
	// Converting this to a min would handle NaNs incorrectly.
	if (!DAG.getTarget().Options.UnsafeFPMath &&
	(!DAG.isKnownNeverNaN(LHS) \|\| !DAG.isKnownNeverNaN(RHS)))
	break;
	Opcode = X86ISD::FMIN;
	break;
	case ISD::SETUGE:
	// Converting this to a min would handle both negative zeros and NaNs
	// incorrectly, but we can swap the operands to fix both.
	std::swap(LHS, RHS);
	LLVM_FALLTHROUGH;
	case ISD::SETOGT:
	case ISD::SETGT:
	case ISD::SETGE:
	Opcode = X86ISD::FMIN;
	break;

	case ISD::SETULT:
	// Converting this to a max would handle NaNs incorrectly.
	if (!DAG.isKnownNeverNaN(LHS) \|\| !DAG.isKnownNeverNaN(RHS))
	break;
	Opcode = X86ISD::FMAX;
	break;
	case ISD::SETOLE:
	// Converting this to a max would handle comparisons between positive
	// and negative zero incorrectly, and swapping the operands would
	// cause it to handle NaNs incorrectly.
	if (!DAG.getTarget().Options.UnsafeFPMath &&
	!DAG.isKnownNeverZero(LHS) && !DAG.isKnownNeverZero(RHS)) {
	if (!DAG.isKnownNeverNaN(LHS) \|\| !DAG.isKnownNeverNaN(RHS))
	break;
	std::swap(LHS, RHS);
	}
	Opcode = X86ISD::FMAX;
	break;
	case ISD::SETULE:
	// Converting this to a max would handle both negative zeros and NaNs
	// incorrectly, but we can swap the operands to fix both.
	std::swap(LHS, RHS);
	LLVM_FALLTHROUGH;
	case ISD::SETOLT:
	case ISD::SETLT:
	case ISD::SETLE:
	Opcode = X86ISD::FMAX;
	break;
	}
	}

	if (Opcode)
	return DAG.getNode(Opcode, DL, N->getValueType(0), LHS, RHS);
	}

	// v16i8 (select v16i1, v16i8, v16i8) does not have a proper
	// lowering on KNL. In this case we convert it to
	// v16i8 (select v16i8, v16i8, v16i8) and use AVX instruction.
	// The same situation for all 128 and 256-bit vectors of i8 and i16.
	// Since SKX these selects have a proper lowering.
	if (Subtarget.hasAVX512() && CondVT.isVector() &&
	CondVT.getVectorElementType() == MVT::i1 &&
	(VT.is128BitVector() \|\| VT.is256BitVector()) &&
	(VT.getVectorElementType() == MVT::i8 \|\|
	VT.getVectorElementType() == MVT::i16) &&
	!(Subtarget.hasBWI() && Subtarget.hasVLX())) {
	Cond = DAG.getNode(ISD::SIGN_EXTEND, DL, VT, Cond);
	DCI.AddToWorklist(Cond.getNode());
	return DAG.getNode(N->getOpcode(), DL, VT, Cond, LHS, RHS);
	}

	if (SDValue V = combineSelectOfTwoConstants(N, DAG))
	return V;

	// Canonicalize max and min:
	// (x > y) ? x : y -> (x >= y) ? x : y
	// (x < y) ? x : y -> (x <= y) ? x : y
	// This allows use of COND_S / COND_NS (see TranslateX86CC) which eliminates
	// the need for an extra compare
	// against zero. e.g.
	// (x - y) > 0 : (x - y) ? 0 -> (x - y) >= 0 : (x - y) ? 0
	// subl %esi, %edi
	// testl %edi, %edi
	// movl $0, %eax
	// cmovgl %edi, %eax
	// =>
	// xorl %eax, %eax
	// subl %esi, $edi
	// cmovsl %eax, %edi
	if (N->getOpcode() == ISD::SELECT && Cond.getOpcode() == ISD::SETCC &&
	DAG.isEqualTo(LHS, Cond.getOperand(0)) &&
	DAG.isEqualTo(RHS, Cond.getOperand(1))) {
	ISD::CondCode CC = cast<CondCodeSDNode>(Cond.getOperand(2))->get();
	switch (CC) {
	default: break;
	case ISD::SETLT:
	case ISD::SETGT: {
	ISD::CondCode NewCC = (CC == ISD::SETLT) ? ISD::SETLE : ISD::SETGE;
	Cond = DAG.getSetCC(SDLoc(Cond), Cond.getValueType(),
	Cond.getOperand(0), Cond.getOperand(1), NewCC);
	return DAG.getSelect(DL, VT, Cond, LHS, RHS);
	}
	}
	}

	// Early exit check
	if (!TLI.isTypeLegal(VT))
	return SDValue();

	// Match VSELECTs into subs with unsigned saturation.
	if (N->getOpcode() == ISD::VSELECT && Cond.getOpcode() == ISD::SETCC &&
	// psubus is available in SSE2 and AVX2 for i8 and i16 vectors.
	((Subtarget.hasSSE2() && (VT == MVT::v16i8 \|\| VT == MVT::v8i16)) \|\|
	(Subtarget.hasAVX2() && (VT == MVT::v32i8 \|\| VT == MVT::v16i16)))) {
	ISD::CondCode CC = cast<CondCodeSDNode>(Cond.getOperand(2))->get();

	// Check if one of the arms of the VSELECT is a zero vector. If it's on the
	// left side invert the predicate to simplify logic below.
	SDValue Other;
	if (ISD::isBuildVectorAllZeros(LHS.getNode())) {
	Other = RHS;
	CC = ISD::getSetCCInverse(CC, true);
	} else if (ISD::isBuildVectorAllZeros(RHS.getNode())) {
	Other = LHS;
	}

	if (Other.getNode() && Other->getNumOperands() == 2 &&
	DAG.isEqualTo(Other->getOperand(0), Cond.getOperand(0))) {
	SDValue OpLHS = Other->getOperand(0), OpRHS = Other->getOperand(1);
	SDValue CondRHS = Cond->getOperand(1);

	// Look for a general sub with unsigned saturation first.
	// x >= y ? x-y : 0 --> subus x, y
	// x > y ? x-y : 0 --> subus x, y
	if ((CC == ISD::SETUGE \|\| CC == ISD::SETUGT) &&
	Other->getOpcode() == ISD::SUB && DAG.isEqualTo(OpRHS, CondRHS))
	return DAG.getNode(X86ISD::SUBUS, DL, VT, OpLHS, OpRHS);

	if (auto *OpRHSBV = dyn_cast<BuildVectorSDNode>(OpRHS))
	if (auto *OpRHSConst = OpRHSBV->getConstantSplatNode()) {
	if (auto *CondRHSBV = dyn_cast<BuildVectorSDNode>(CondRHS))
	if (auto *CondRHSConst = CondRHSBV->getConstantSplatNode())
	// If the RHS is a constant we have to reverse the const
	// canonicalization.
	// x > C-1 ? x+-C : 0 --> subus x, C
	if (CC == ISD::SETUGT && Other->getOpcode() == ISD::ADD &&
	CondRHSConst->getAPIntValue() ==
	(-OpRHSConst->getAPIntValue() - 1))
	return DAG.getNode(
	X86ISD::SUBUS, DL, VT, OpLHS,
	DAG.getConstant(-OpRHSConst->getAPIntValue(), DL, VT));

	// Another special case: If C was a sign bit, the sub has been
	// canonicalized into a xor.
	// FIXME: Would it be better to use computeKnownBits to determine
	// whether it's safe to decanonicalize the xor?
	// x s< 0 ? x^C : 0 --> subus x, C
	if (CC == ISD::SETLT && Other->getOpcode() == ISD::XOR &&
	ISD::isBuildVectorAllZeros(CondRHS.getNode()) &&
	OpRHSConst->getAPIntValue().isSignMask())
	// Note that we have to rebuild the RHS constant here to ensure we
	// don't rely on particular values of undef lanes.
	return DAG.getNode(
	X86ISD::SUBUS, DL, VT, OpLHS,
	DAG.getConstant(OpRHSConst->getAPIntValue(), DL, VT));
	}
	}
	}

	if (SDValue V = combineVSelectWithAllOnesOrZeros(N, DAG, DCI, Subtarget))
	return V;

	// If this is a dynamic select (non-constant condition) and we can match
	// this node with one of the variable blend instructions, restructure the
	// condition so that blends can use the high (sign) bit of each element and
	// use SimplifyDemandedBits to simplify the condition operand.
	if (N->getOpcode() == ISD::VSELECT && DCI.isBeforeLegalizeOps() &&
	!DCI.isBeforeLegalize() &&
	!ISD::isBuildVectorOfConstantSDNodes(Cond.getNode())) {
	unsigned BitWidth = Cond.getScalarValueSizeInBits();

	// Don't optimize vector selects that map to mask-registers.
	if (BitWidth == 1)
	return SDValue();

	// We can only handle the cases where VSELECT is directly legal on the
	// subtarget. We custom lower VSELECT nodes with constant conditions and
	// this makes it hard to see whether a dynamic VSELECT will correctly
	// lower, so we both check the operation's status and explicitly handle the
	// cases where a dynamic blend will fail even though a constant-condition
	// blend could be custom lowered.
	// FIXME: We should find a better way to handle this class of problems.
	// Potentially, we should combine constant-condition vselect nodes
	// pre-legalization into shuffles and not mark as many types as custom
	// lowered.
	if (!TLI.isOperationLegalOrCustom(ISD::VSELECT, VT))
	return SDValue();
	// FIXME: We don't support i16-element blends currently. We could and
	// should support them by making all the bits in the condition be set
	// rather than just the high bit and using an i8-element blend.
	if (VT.getVectorElementType() == MVT::i16)
	return SDValue();
	// Dynamic blending was only available from SSE4.1 onward.
	if (VT.is128BitVector() && !Subtarget.hasSSE41())
	return SDValue();
	// Byte blends are only available in AVX2
	if (VT == MVT::v32i8 && !Subtarget.hasAVX2())
	return SDValue();
	+ // There are no 512-bit blend instructions that use sign bits.
	+ if (VT.is512BitVector())
	+ return SDValue();

	assert(BitWidth >= 8 && BitWidth <= 64 && "Invalid mask size");
	APInt DemandedMask(APInt::getSignMask(BitWidth));
	KnownBits Known;
	TargetLowering::TargetLoweringOpt TLO(DAG, !DCI.isBeforeLegalize(),
	!DCI.isBeforeLegalizeOps());
	if (TLI.ShrinkDemandedConstant(Cond, DemandedMask, TLO) \|\|
	TLI.SimplifyDemandedBits(Cond, DemandedMask, Known, TLO)) {
	// If we changed the computation somewhere in the DAG, this change will
	// affect all users of Cond. Make sure it is fine and update all the nodes
	// so that we do not use the generic VSELECT anymore. Otherwise, we may
	// perform wrong optimizations as we messed with the actual expectation
	// for the vector boolean values.
	if (Cond != TLO.Old) {
	// Check all uses of the condition operand to check whether it will be
	// consumed by non-BLEND instructions. Those may require that all bits
	// are set properly.
	for (SDNode *U : Cond->uses()) {
	// TODO: Add other opcodes eventually lowered into BLEND.
	if (U->getOpcode() != ISD::VSELECT)
	return SDValue();
	}

	// Update all users of the condition before committing the change, so
	// that the VSELECT optimizations that expect the correct vector boolean
	// value will not be triggered.
	for (SDNode *U : Cond->uses()) {
	SDValue SB = DAG.getNode(X86ISD::SHRUNKBLEND, SDLoc(U),
	U->getValueType(0), Cond, U->getOperand(1),
	U->getOperand(2));
	DAG.ReplaceAllUsesOfValueWith(SDValue(U, 0), SB);
	}
	DCI.CommitTargetLoweringOpt(TLO);
	return SDValue();
	}
	// Only Cond (rather than other nodes in the computation chain) was
	// changed. Change the condition just for N to keep the opportunity to
	// optimize all other users their own way.
	SDValue SB = DAG.getNode(X86ISD::SHRUNKBLEND, DL, VT, TLO.New, LHS, RHS);
	DAG.ReplaceAllUsesOfValueWith(SDValue(N, 0), SB);
	return SDValue();
	}
	}

	// Look for vselects with LHS/RHS being bitcasted from an operation that
	// can be executed on another type. Push the bitcast to the inputs of
	// the operation. This exposes opportunities for using masking instructions.
	if (N->getOpcode() == ISD::VSELECT && DCI.isAfterLegalizeVectorOps() &&
	CondVT.getVectorElementType() == MVT::i1) {
	if (combineBitcastForMaskedOp(LHS, DAG, DCI))
	return SDValue(N, 0);
	if (combineBitcastForMaskedOp(RHS, DAG, DCI))
	return SDValue(N, 0);
	}

	// Custom action for SELECT MMX
	if (VT == MVT::x86mmx) {
	LHS = DAG.getBitcast(MVT::i64, LHS);
	RHS = DAG.getBitcast(MVT::i64, RHS);
	SDValue newSelect = DAG.getNode(ISD::SELECT, DL, MVT::i64, Cond, LHS, RHS);
	return DAG.getBitcast(VT, newSelect);
	}

	return SDValue();
	}

	/// Combine:
	/// (brcond/cmov/setcc .., (cmp (atomic_load_add x, 1), 0), COND_S)
	/// to:
	/// (brcond/cmov/setcc .., (LADD x, 1), COND_LE)
	/// i.e., reusing the EFLAGS produced by the LOCKed instruction.
	/// Note that this is only legal for some op/cc combinations.
	static SDValue combineSetCCAtomicArith(SDValue Cmp, X86::CondCode &CC,
	SelectionDAG &DAG) {
	// This combine only operates on CMP-like nodes.
	if (!(Cmp.getOpcode() == X86ISD::CMP \|\|
	(Cmp.getOpcode() == X86ISD::SUB && !Cmp->hasAnyUseOfValue(0))))
	return SDValue();

	// Can't replace the cmp if it has more uses than the one we're looking at.
	// FIXME: We would like to be able to handle this, but would need to make sure
	// all uses were updated.
	if (!Cmp.hasOneUse())
	return SDValue();

	// This only applies to variations of the common case:
	// (icmp slt x, 0) -> (icmp sle (add x, 1), 0)
	// (icmp sge x, 0) -> (icmp sgt (add x, 1), 0)
	// (icmp sle x, 0) -> (icmp slt (sub x, 1), 0)
	// (icmp sgt x, 0) -> (icmp sge (sub x, 1), 0)
	// Using the proper condcodes (see below), overflow is checked for.

	// FIXME: We can generalize both constraints:
	// - XOR/OR/AND (if they were made to survive AtomicExpand)
	// - LHS != 1
	// if the result is compared.

	SDValue CmpLHS = Cmp.getOperand(0);
	SDValue CmpRHS = Cmp.getOperand(1);

	if (!CmpLHS.hasOneUse())
	return SDValue();

	auto *CmpRHSC = dyn_cast<ConstantSDNode>(CmpRHS);
	if (!CmpRHSC \|\| CmpRHSC->getZExtValue() != 0)
	return SDValue();

	const unsigned Opc = CmpLHS.getOpcode();

	if (Opc != ISD::ATOMIC_LOAD_ADD && Opc != ISD::ATOMIC_LOAD_SUB)
	return SDValue();

	SDValue OpRHS = CmpLHS.getOperand(2);
	auto *OpRHSC = dyn_cast<ConstantSDNode>(OpRHS);
	if (!OpRHSC)
	return SDValue();

	APInt Addend = OpRHSC->getAPIntValue();
	if (Opc == ISD::ATOMIC_LOAD_SUB)
	Addend = -Addend;

	if (CC == X86::COND_S && Addend == 1)
	CC = X86::COND_LE;
	else if (CC == X86::COND_NS && Addend == 1)
	CC = X86::COND_G;
	else if (CC == X86::COND_G && Addend == -1)
	CC = X86::COND_GE;
	else if (CC == X86::COND_LE && Addend == -1)
	CC = X86::COND_L;
	else
	return SDValue();

	SDValue LockOp = lowerAtomicArithWithLOCK(CmpLHS, DAG);
	DAG.ReplaceAllUsesOfValueWith(CmpLHS.getValue(0),
	DAG.getUNDEF(CmpLHS.getValueType()));
	DAG.ReplaceAllUsesOfValueWith(CmpLHS.getValue(1), LockOp.getValue(1));
	return LockOp;
	}

	// Check whether a boolean test is testing a boolean value generated by
	// X86ISD::SETCC. If so, return the operand of that SETCC and proper condition
	// code.
	//
	// Simplify the following patterns:
	// (Op (CMP (SETCC Cond EFLAGS) 1) EQ) or
	// (Op (CMP (SETCC Cond EFLAGS) 0) NEQ)
	// to (Op EFLAGS Cond)
	//
	// (Op (CMP (SETCC Cond EFLAGS) 0) EQ) or
	// (Op (CMP (SETCC Cond EFLAGS) 1) NEQ)
	// to (Op EFLAGS !Cond)
	//
	// where Op could be BRCOND or CMOV.
	//
	static SDValue checkBoolTestSetCCCombine(SDValue Cmp, X86::CondCode &CC) {
	// This combine only operates on CMP-like nodes.
	if (!(Cmp.getOpcode() == X86ISD::CMP \|\|
	(Cmp.getOpcode() == X86ISD::SUB && !Cmp->hasAnyUseOfValue(0))))
	return SDValue();

	// Quit if not used as a boolean value.
	if (CC != X86::COND_E && CC != X86::COND_NE)
	return SDValue();

	// Check CMP operands. One of them should be 0 or 1 and the other should be
	// an SetCC or extended from it.
	SDValue Op1 = Cmp.getOperand(0);
	SDValue Op2 = Cmp.getOperand(1);

	SDValue SetCC;
	const ConstantSDNode* C = nullptr;
	bool needOppositeCond = (CC == X86::COND_E);
	bool checkAgainstTrue = false; // Is it a comparison against 1?

	if ((C = dyn_cast<ConstantSDNode>(Op1)))
	SetCC = Op2;
	else if ((C = dyn_cast<ConstantSDNode>(Op2)))
	SetCC = Op1;
	else // Quit if all operands are not constants.
	return SDValue();

	if (C->getZExtValue() == 1) {
	needOppositeCond = !needOppositeCond;
	checkAgainstTrue = true;
	} else if (C->getZExtValue() != 0)
	// Quit if the constant is neither 0 or 1.
	return SDValue();

	bool truncatedToBoolWithAnd = false;
	// Skip (zext $x), (trunc $x), or (and $x, 1) node.
	while (SetCC.getOpcode() == ISD::ZERO_EXTEND \|\|
	SetCC.getOpcode() == ISD::TRUNCATE \|\|
	SetCC.getOpcode() == ISD::AND) {
	if (SetCC.getOpcode() == ISD::AND) {
	int OpIdx = -1;
	if (isOneConstant(SetCC.getOperand(0)))
	OpIdx = 1;
	if (isOneConstant(SetCC.getOperand(1)))
	OpIdx = 0;
	if (OpIdx < 0)
	break;
	SetCC = SetCC.getOperand(OpIdx);
	truncatedToBoolWithAnd = true;
	} else
	SetCC = SetCC.getOperand(0);
	}

	switch (SetCC.getOpcode()) {
	case X86ISD::SETCC_CARRY:
	// Since SETCC_CARRY gives output based on R = CF ? ~0 : 0, it's unsafe to
	// simplify it if the result of SETCC_CARRY is not canonicalized to 0 or 1,
	// i.e. it's a comparison against true but the result of SETCC_CARRY is not
	// truncated to i1 using 'and'.
	if (checkAgainstTrue && !truncatedToBoolWithAnd)
	break;
	assert(X86::CondCode(SetCC.getConstantOperandVal(0)) == X86::COND_B &&
	"Invalid use of SETCC_CARRY!");
	LLVM_FALLTHROUGH;
	case X86ISD::SETCC:
	// Set the condition code or opposite one if necessary.
	CC = X86::CondCode(SetCC.getConstantOperandVal(0));
	if (needOppositeCond)
	CC = X86::GetOppositeBranchCondition(CC);
	return SetCC.getOperand(1);
	case X86ISD::CMOV: {
	// Check whether false/true value has canonical one, i.e. 0 or 1.
	ConstantSDNode *FVal = dyn_cast<ConstantSDNode>(SetCC.getOperand(0));
	ConstantSDNode *TVal = dyn_cast<ConstantSDNode>(SetCC.getOperand(1));
	// Quit if true value is not a constant.
	if (!TVal)
	return SDValue();
	// Quit if false value is not a constant.
	if (!FVal) {
	SDValue Op = SetCC.getOperand(0);
	// Skip 'zext' or 'trunc' node.
	if (Op.getOpcode() == ISD::ZERO_EXTEND \|\|
	Op.getOpcode() == ISD::TRUNCATE)
	Op = Op.getOperand(0);
	// A special case for rdrand/rdseed, where 0 is set if false cond is
	// found.
	if ((Op.getOpcode() != X86ISD::RDRAND &&
	Op.getOpcode() != X86ISD::RDSEED) \|\| Op.getResNo() != 0)
	return SDValue();
	}
	// Quit if false value is not the constant 0 or 1.
	bool FValIsFalse = true;
	if (FVal && FVal->getZExtValue() != 0) {
	if (FVal->getZExtValue() != 1)
	return SDValue();
	// If FVal is 1, opposite cond is needed.
	needOppositeCond = !needOppositeCond;
	FValIsFalse = false;
	}
	// Quit if TVal is not the constant opposite of FVal.
	if (FValIsFalse && TVal->getZExtValue() != 1)
	return SDValue();
	if (!FValIsFalse && TVal->getZExtValue() != 0)
	return SDValue();
	CC = X86::CondCode(SetCC.getConstantOperandVal(2));
	if (needOppositeCond)
	CC = X86::GetOppositeBranchCondition(CC);
	return SetCC.getOperand(3);
	}
	}

	return SDValue();
	}

	/// Check whether Cond is an AND/OR of SETCCs off of the same EFLAGS.
	/// Match:
	/// (X86or (X86setcc) (X86setcc))
	/// (X86cmp (and (X86setcc) (X86setcc)), 0)
	static bool checkBoolTestAndOrSetCCCombine(SDValue Cond, X86::CondCode &CC0,
	X86::CondCode &CC1, SDValue &Flags,
	bool &isAnd) {
	if (Cond->getOpcode() == X86ISD::CMP) {
	if (!isNullConstant(Cond->getOperand(1)))
	return false;

	Cond = Cond->getOperand(0);
	}

	isAnd = false;

	SDValue SetCC0, SetCC1;
	switch (Cond->getOpcode()) {
	default: return false;
	case ISD::AND:
	case X86ISD::AND:
	isAnd = true;
	LLVM_FALLTHROUGH;
	case ISD::OR:
	case X86ISD::OR:
	SetCC0 = Cond->getOperand(0);
	SetCC1 = Cond->getOperand(1);
	break;
	};

	// Make sure we have SETCC nodes, using the same flags value.
	if (SetCC0.getOpcode() != X86ISD::SETCC \|\|
	SetCC1.getOpcode() != X86ISD::SETCC \|\|
	SetCC0->getOperand(1) != SetCC1->getOperand(1))
	return false;

	CC0 = (X86::CondCode)SetCC0->getConstantOperandVal(0);
	CC1 = (X86::CondCode)SetCC1->getConstantOperandVal(0);
	Flags = SetCC0->getOperand(1);
	return true;
	}

	/// Optimize an EFLAGS definition used according to the condition code \p CC
	/// into a simpler EFLAGS value, potentially returning a new \p CC and replacing
	/// uses of chain values.
	static SDValue combineSetCCEFLAGS(SDValue EFLAGS, X86::CondCode &CC,
	SelectionDAG &DAG) {
	if (SDValue R = checkBoolTestSetCCCombine(EFLAGS, CC))
	return R;
	return combineSetCCAtomicArith(EFLAGS, CC, DAG);
	}

	/// Optimize X86ISD::CMOV [LHS, RHS, CONDCODE (e.g. X86::COND_NE), CONDVAL]
	static SDValue combineCMov(SDNode *N, SelectionDAG &DAG,
	TargetLowering::DAGCombinerInfo &DCI,
	const X86Subtarget &Subtarget) {
	SDLoc DL(N);

	// If the flag operand isn't dead, don't touch this CMOV.
	if (N->getNumValues() == 2 && !SDValue(N, 1).use_empty())
	return SDValue();

	SDValue FalseOp = N->getOperand(0);
	SDValue TrueOp = N->getOperand(1);
	X86::CondCode CC = (X86::CondCode)N->getConstantOperandVal(2);
	SDValue Cond = N->getOperand(3);

	if (CC == X86::COND_E \|\| CC == X86::COND_NE) {
	switch (Cond.getOpcode()) {
	default: break;
	case X86ISD::BSR:
	case X86ISD::BSF:
	// If operand of BSR / BSF are proven never zero, then ZF cannot be set.
	if (DAG.isKnownNeverZero(Cond.getOperand(0)))
	return (CC == X86::COND_E) ? FalseOp : TrueOp;
	}
	}

	// Try to simplify the EFLAGS and condition code operands.
	// We can't always do this as FCMOV only supports a subset of X86 cond.
	if (SDValue Flags = combineSetCCEFLAGS(Cond, CC, DAG)) {
	if (FalseOp.getValueType() != MVT::f80 \|\| hasFPCMov(CC)) {
	SDValue Ops[] = {FalseOp, TrueOp, DAG.getConstant(CC, DL, MVT::i8),
	Flags};
	return DAG.getNode(X86ISD::CMOV, DL, N->getVTList(), Ops);
	}
	}

	// If this is a select between two integer constants, try to do some
	// optimizations. Note that the operands are ordered the opposite of SELECT
	// operands.
	if (ConstantSDNode *TrueC = dyn_cast<ConstantSDNode>(TrueOp)) {
	if (ConstantSDNode *FalseC = dyn_cast<ConstantSDNode>(FalseOp)) {
	// Canonicalize the TrueC/FalseC values so that TrueC (the true value) is
	// larger than FalseC (the false value).
	if (TrueC->getAPIntValue().ult(FalseC->getAPIntValue())) {
	CC = X86::GetOppositeBranchCondition(CC);
	std::swap(TrueC, FalseC);
	std::swap(TrueOp, FalseOp);
	}

	// Optimize C ? 8 : 0 -> zext(setcc(C)) << 3. Likewise for any pow2/0.
	// This is efficient for any integer data type (including i8/i16) and
	// shift amount.
	if (FalseC->getAPIntValue() == 0 && TrueC->getAPIntValue().isPowerOf2()) {
	Cond = getSETCC(CC, Cond, DL, DAG);

	// Zero extend the condition if needed.
	Cond = DAG.getNode(ISD::ZERO_EXTEND, DL, TrueC->getValueType(0), Cond);

	unsigned ShAmt = TrueC->getAPIntValue().logBase2();
	Cond = DAG.getNode(ISD::SHL, DL, Cond.getValueType(), Cond,
	DAG.getConstant(ShAmt, DL, MVT::i8));
	if (N->getNumValues() == 2) // Dead flag value?
	return DCI.CombineTo(N, Cond, SDValue());
	return Cond;
	}

	// Optimize Cond ? cst+1 : cst -> zext(setcc(C)+cst. This is efficient
	// for any integer data type, including i8/i16.
	if (FalseC->getAPIntValue()+1 == TrueC->getAPIntValue()) {
	Cond = getSETCC(CC, Cond, DL, DAG);

	// Zero extend the condition if needed.
	Cond = DAG.getNode(ISD::ZERO_EXTEND, DL,
	FalseC->getValueType(0), Cond);
	Cond = DAG.getNode(ISD::ADD, DL, Cond.getValueType(), Cond,
	SDValue(FalseC, 0));

	if (N->getNumValues() == 2) // Dead flag value?
	return DCI.CombineTo(N, Cond, SDValue());
	return Cond;
	}

	// Optimize cases that will turn into an LEA instruction. This requires
	// an i32 or i64 and an efficient multiplier (1, 2, 3, 4, 5, 8, 9).
	if (N->getValueType(0) == MVT::i32 \|\| N->getValueType(0) == MVT::i64) {
	uint64_t Diff = TrueC->getZExtValue()-FalseC->getZExtValue();
	if (N->getValueType(0) == MVT::i32) Diff = (unsigned)Diff;

	bool isFastMultiplier = false;
	if (Diff < 10) {
	switch ((unsigned char)Diff) {
	default: break;
	case 1: // result = add base, cond
	case 2: // result = lea base( , cond*2)
	case 3: // result = lea base(cond, cond*2)
	case 4: // result = lea base( , cond*4)
	case 5: // result = lea base(cond, cond*4)
	case 8: // result = lea base( , cond*8)
	case 9: // result = lea base(cond, cond*8)
	isFastMultiplier = true;
	break;
	}
	}

	if (isFastMultiplier) {
	APInt Diff = TrueC->getAPIntValue()-FalseC->getAPIntValue();
	Cond = getSETCC(CC, Cond, DL ,DAG);
	// Zero extend the condition if needed.
	Cond = DAG.getNode(ISD::ZERO_EXTEND, DL, FalseC->getValueType(0),
	Cond);
	// Scale the condition by the difference.
	if (Diff != 1)
	Cond = DAG.getNode(ISD::MUL, DL, Cond.getValueType(), Cond,
	DAG.getConstant(Diff, DL, Cond.getValueType()));

	// Add the base if non-zero.
	if (FalseC->getAPIntValue() != 0)
	Cond = DAG.getNode(ISD::ADD, DL, Cond.getValueType(), Cond,
	SDValue(FalseC, 0));
	if (N->getNumValues() == 2) // Dead flag value?
	return DCI.CombineTo(N, Cond, SDValue());
	return Cond;
	}
	}
	}
	}

	// Handle these cases:
	// (select (x != c), e, c) -> select (x != c), e, x),
	// (select (x == c), c, e) -> select (x == c), x, e)
	// where the c is an integer constant, and the "select" is the combination
	// of CMOV and CMP.
	//
	// The rationale for this change is that the conditional-move from a constant
	// needs two instructions, however, conditional-move from a register needs
	// only one instruction.
	//
	// CAVEAT: By replacing a constant with a symbolic value, it may obscure
	// some instruction-combining opportunities. This opt needs to be
	// postponed as late as possible.
	//
	if (!DCI.isBeforeLegalize() && !DCI.isBeforeLegalizeOps()) {
	// the DCI.xxxx conditions are provided to postpone the optimization as
	// late as possible.

	ConstantSDNode *CmpAgainst = nullptr;
	if ((Cond.getOpcode() == X86ISD::CMP \|\| Cond.getOpcode() == X86ISD::SUB) &&
	(CmpAgainst = dyn_cast<ConstantSDNode>(Cond.getOperand(1))) &&
	!isa<ConstantSDNode>(Cond.getOperand(0))) {

	if (CC == X86::COND_NE &&
	CmpAgainst == dyn_cast<ConstantSDNode>(FalseOp)) {
	CC = X86::GetOppositeBranchCondition(CC);
	std::swap(TrueOp, FalseOp);
	}

	if (CC == X86::COND_E &&
	CmpAgainst == dyn_cast<ConstantSDNode>(TrueOp)) {
	SDValue Ops[] = { FalseOp, Cond.getOperand(0),
	DAG.getConstant(CC, DL, MVT::i8), Cond };
	return DAG.getNode(X86ISD::CMOV, DL, N->getVTList (), Ops);
	}
	}
	}

	// Fold and/or of setcc's to double CMOV:
	// (CMOV F, T, ((cc1 \| cc2) != 0)) -> (CMOV (CMOV F, T, cc1), T, cc2)
	// (CMOV F, T, ((cc1 & cc2) != 0)) -> (CMOV (CMOV T, F, !cc1), F, !cc2)
	//
	// This combine lets us generate:
	// cmovcc1 (jcc1 if we don't have CMOV)
	// cmovcc2 (same)
	// instead of:
	// setcc1
	// setcc2
	// and/or
	// cmovne (jne if we don't have CMOV)
	// When we can't use the CMOV instruction, it might increase branch
	// mispredicts.
	// When we can use CMOV, or when there is no mispredict, this improves
	// throughput and reduces register pressure.
	//
	if (CC == X86::COND_NE) {
	SDValue Flags;
	X86::CondCode CC0, CC1;
	bool isAndSetCC;
	if (checkBoolTestAndOrSetCCCombine(Cond, CC0, CC1, Flags, isAndSetCC)) {
	if (isAndSetCC) {
	std::swap(FalseOp, TrueOp);
	CC0 = X86::GetOppositeBranchCondition(CC0);
	CC1 = X86::GetOppositeBranchCondition(CC1);
	}

	SDValue LOps[] = {FalseOp, TrueOp, DAG.getConstant(CC0, DL, MVT::i8),
	Flags};
	SDValue LCMOV = DAG.getNode(X86ISD::CMOV, DL, N->getVTList(), LOps);
	SDValue Ops[] = {LCMOV, TrueOp, DAG.getConstant(CC1, DL, MVT::i8), Flags};
	SDValue CMOV = DAG.getNode(X86ISD::CMOV, DL, N->getVTList(), Ops);
	DAG.ReplaceAllUsesOfValueWith(SDValue(N, 1), SDValue(CMOV.getNode(), 1));
	return CMOV;
	}
	}

	return SDValue();
	}

	/// Different mul shrinking modes.
	enum ShrinkMode { MULS8, MULU8, MULS16, MULU16 };

	static bool canReduceVMulWidth(SDNode *N, SelectionDAG &DAG, ShrinkMode &Mode) {
	EVT VT = N->getOperand(0).getValueType();
	if (VT.getScalarSizeInBits() != 32)
	return false;

	assert(N->getNumOperands() == 2 && "NumOperands of Mul are 2");
	unsigned SignBits[2] = {1, 1};
	bool IsPositive[2] = {false, false};
	for (unsigned i = 0; i < 2; i++) {
	SDValue Opd = N->getOperand(i);

	// DAG.ComputeNumSignBits return 1 for ISD::ANY_EXTEND, so we need to
	// compute signbits for it separately.
	if (Opd.getOpcode() == ISD::ANY_EXTEND) {
	// For anyextend, it is safe to assume an appropriate number of leading
	// sign/zero bits.
	if (Opd.getOperand(0).getValueType().getVectorElementType() == MVT::i8)
	SignBits[i] = 25;
	else if (Opd.getOperand(0).getValueType().getVectorElementType() ==
	MVT::i16)
	SignBits[i] = 17;
	else
	return false;
	IsPositive[i] = true;
	} else if (Opd.getOpcode() == ISD::BUILD_VECTOR) {
	// All the operands of BUILD_VECTOR need to be int constant.
	// Find the smallest value range which all the operands belong to.
	SignBits[i] = 32;
	IsPositive[i] = true;
	for (const SDValue &SubOp : Opd.getNode()->op_values()) {
	if (SubOp.isUndef())
	continue;
	auto *CN = dyn_cast<ConstantSDNode>(SubOp);
	if (!CN)
	return false;
	APInt IntVal = CN->getAPIntValue();
	if (IntVal.isNegative())
	IsPositive[i] = false;
	SignBits[i] = std::min(SignBits[i], IntVal.getNumSignBits());
	}
	} else {
	SignBits[i] = DAG.ComputeNumSignBits(Opd);
	if (Opd.getOpcode() == ISD::ZERO_EXTEND)
	IsPositive[i] = true;
	}
	}

	bool AllPositive = IsPositive[0] && IsPositive[1];
	unsigned MinSignBits = std::min(SignBits[0], SignBits[1]);
	// When ranges are from -128 ~ 127, use MULS8 mode.
	if (MinSignBits >= 25)
	Mode = MULS8;
	// When ranges are from 0 ~ 255, use MULU8 mode.
	else if (AllPositive && MinSignBits >= 24)
	Mode = MULU8;
	// When ranges are from -32768 ~ 32767, use MULS16 mode.
	else if (MinSignBits >= 17)
	Mode = MULS16;
	// When ranges are from 0 ~ 65535, use MULU16 mode.
	else if (AllPositive && MinSignBits >= 16)
	Mode = MULU16;
	else
	return false;
	return true;
	}

	/// When the operands of vector mul are extended from smaller size values,
	/// like i8 and i16, the type of mul may be shrinked to generate more
	/// efficient code. Two typical patterns are handled:
	/// Pattern1:
	/// %2 = sext/zext <N x i8> %1 to <N x i32>
	/// %4 = sext/zext <N x i8> %3 to <N x i32>
	// or %4 = build_vector <N x i32> %C1, ..., %CN (%C1..%CN are constants)
	/// %5 = mul <N x i32> %2, %4
	///
	/// Pattern2:
	/// %2 = zext/sext <N x i16> %1 to <N x i32>
	/// %4 = zext/sext <N x i16> %3 to <N x i32>
	/// or %4 = build_vector <N x i32> %C1, ..., %CN (%C1..%CN are constants)
	/// %5 = mul <N x i32> %2, %4
	///
	/// There are four mul shrinking modes:
	/// If %2 == sext32(trunc8(%2)), i.e., the scalar value range of %2 is
	/// -128 to 128, and the scalar value range of %4 is also -128 to 128,
	/// generate pmullw+sext32 for it (MULS8 mode).
	/// If %2 == zext32(trunc8(%2)), i.e., the scalar value range of %2 is
	/// 0 to 255, and the scalar value range of %4 is also 0 to 255,
	/// generate pmullw+zext32 for it (MULU8 mode).
	/// If %2 == sext32(trunc16(%2)), i.e., the scalar value range of %2 is
	/// -32768 to 32767, and the scalar value range of %4 is also -32768 to 32767,
	/// generate pmullw+pmulhw for it (MULS16 mode).
	/// If %2 == zext32(trunc16(%2)), i.e., the scalar value range of %2 is
	/// 0 to 65535, and the scalar value range of %4 is also 0 to 65535,
	/// generate pmullw+pmulhuw for it (MULU16 mode).
	static SDValue reduceVMULWidth(SDNode *N, SelectionDAG &DAG,
	const X86Subtarget &Subtarget) {
	// Check for legality
	// pmullw/pmulhw are not supported by SSE.
	if (!Subtarget.hasSSE2())
	return SDValue();

	// Check for profitability
	// pmulld is supported since SSE41. It is better to use pmulld
	// instead of pmullw+pmulhw, except for subtargets where pmulld is slower than
	// the expansion.
	bool OptForMinSize = DAG.getMachineFunction().getFunction()->optForMinSize();
	if (Subtarget.hasSSE41() && (OptForMinSize \|\| !Subtarget.isPMULLDSlow()))
	return SDValue();

	ShrinkMode Mode;
	if (!canReduceVMulWidth(N, DAG, Mode))
	return SDValue();

	SDLoc DL(N);
	SDValue N0 = N->getOperand(0);
	SDValue N1 = N->getOperand(1);
	EVT VT = N->getOperand(0).getValueType();
	unsigned RegSize = 128;
	MVT OpsVT = MVT::getVectorVT(MVT::i16, RegSize / 16);
	EVT ReducedVT =
	EVT::getVectorVT(*DAG.getContext(), MVT::i16, VT.getVectorNumElements());
	// Shrink the operands of mul.
	SDValue NewN0 = DAG.getNode(ISD::TRUNCATE, DL, ReducedVT, N0);
	SDValue NewN1 = DAG.getNode(ISD::TRUNCATE, DL, ReducedVT, N1);

	if (VT.getVectorNumElements() >= OpsVT.getVectorNumElements()) {
	// Generate the lower part of mul: pmullw. For MULU8/MULS8, only the
	// lower part is needed.
	SDValue MulLo = DAG.getNode(ISD::MUL, DL, ReducedVT, NewN0, NewN1);
	if (Mode == MULU8 \|\| Mode == MULS8) {
	return DAG.getNode((Mode == MULU8) ? ISD::ZERO_EXTEND : ISD::SIGN_EXTEND,
	DL, VT, MulLo);
	} else {
	MVT ResVT = MVT::getVectorVT(MVT::i32, VT.getVectorNumElements() / 2);
	// Generate the higher part of mul: pmulhw/pmulhuw. For MULU16/MULS16,
	// the higher part is also needed.
	SDValue MulHi = DAG.getNode(Mode == MULS16 ? ISD::MULHS : ISD::MULHU, DL,
	ReducedVT, NewN0, NewN1);

	// Repack the lower part and higher part result of mul into a wider
	// result.
	// Generate shuffle functioning as punpcklwd.
	SmallVector<int, 16> ShuffleMask(VT.getVectorNumElements());
	for (unsigned i = 0; i < VT.getVectorNumElements() / 2; i++) {
	ShuffleMask[2 * i] = i;
	ShuffleMask[2 * i + 1] = i + VT.getVectorNumElements();
	}
	SDValue ResLo =
	DAG.getVectorShuffle(ReducedVT, DL, MulLo, MulHi, ShuffleMask);
	ResLo = DAG.getNode(ISD::BITCAST, DL, ResVT, ResLo);
	// Generate shuffle functioning as punpckhwd.
	for (unsigned i = 0; i < VT.getVectorNumElements() / 2; i++) {
	ShuffleMask[2 * i] = i + VT.getVectorNumElements() / 2;
	ShuffleMask[2 * i + 1] = i + VT.getVectorNumElements() * 3 / 2;
	}
	SDValue ResHi =
	DAG.getVectorShuffle(ReducedVT, DL, MulLo, MulHi, ShuffleMask);
	ResHi = DAG.getNode(ISD::BITCAST, DL, ResVT, ResHi);
	return DAG.getNode(ISD::CONCAT_VECTORS, DL, VT, ResLo, ResHi);
	}
	} else {
	// When VT.getVectorNumElements() < OpsVT.getVectorNumElements(), we want
	// to legalize the mul explicitly because implicit legalization for type
	// <4 x i16> to <4 x i32> sometimes involves unnecessary unpack
	// instructions which will not exist when we explicitly legalize it by
	// extending <4 x i16> to <8 x i16> (concatenating the <4 x i16> val with
	// <4 x i16> undef).
	//
	// Legalize the operands of mul.
	// FIXME: We may be able to handle non-concatenated vectors by insertion.
	unsigned ReducedSizeInBits = ReducedVT.getSizeInBits();
	if ((RegSize % ReducedSizeInBits) != 0)
	return SDValue();

	SmallVector<SDValue, 16> Ops(RegSize / ReducedSizeInBits,
	DAG.getUNDEF(ReducedVT));
	Ops[0] = NewN0;
	NewN0 = DAG.getNode(ISD::CONCAT_VECTORS, DL, OpsVT, Ops);
	Ops[0] = NewN1;
	NewN1 = DAG.getNode(ISD::CONCAT_VECTORS, DL, OpsVT, Ops);

	if (Mode == MULU8 \|\| Mode == MULS8) {
	// Generate lower part of mul: pmullw. For MULU8/MULS8, only the lower
	// part is needed.
	SDValue Mul = DAG.getNode(ISD::MUL, DL, OpsVT, NewN0, NewN1);

	// convert the type of mul result to VT.
	MVT ResVT = MVT::getVectorVT(MVT::i32, RegSize / 32);
	SDValue Res = DAG.getNode(Mode == MULU8 ? ISD::ZERO_EXTEND_VECTOR_INREG
	: ISD::SIGN_EXTEND_VECTOR_INREG,
	DL, ResVT, Mul);
	return DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, VT, Res,
	DAG.getIntPtrConstant(0, DL));
	} else {
	// Generate the lower and higher part of mul: pmulhw/pmulhuw. For
	// MULU16/MULS16, both parts are needed.
	SDValue MulLo = DAG.getNode(ISD::MUL, DL, OpsVT, NewN0, NewN1);
	SDValue MulHi = DAG.getNode(Mode == MULS16 ? ISD::MULHS : ISD::MULHU, DL,
	OpsVT, NewN0, NewN1);

	// Repack the lower part and higher part result of mul into a wider
	// result. Make sure the type of mul result is VT.
	MVT ResVT = MVT::getVectorVT(MVT::i32, RegSize / 32);
	SDValue Res = DAG.getNode(X86ISD::UNPCKL, DL, OpsVT, MulLo, MulHi);
	Res = DAG.getNode(ISD::BITCAST, DL, ResVT, Res);
	return DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, VT, Res,
	DAG.getIntPtrConstant(0, DL));
	}
	}
	}

	static SDValue combineMulSpecial(uint64_t MulAmt, SDNode *N, SelectionDAG &DAG,
	EVT VT, SDLoc DL) {

	auto combineMulShlAddOrSub = [&](int Mult, int Shift, bool isAdd) {
	SDValue Result = DAG.getNode(X86ISD::MUL_IMM, DL, VT, N->getOperand(0),
	DAG.getConstant(Mult, DL, VT));
	Result = DAG.getNode(ISD::SHL, DL, VT, Result,
	DAG.getConstant(Shift, DL, MVT::i8));
	Result = DAG.getNode(isAdd ? ISD::ADD : ISD::SUB, DL, VT, Result,
	N->getOperand(0));
	return Result;
	};

	auto combineMulMulAddOrSub = [&](bool isAdd) {
	SDValue Result = DAG.getNode(X86ISD::MUL_IMM, DL, VT, N->getOperand(0),
	DAG.getConstant(9, DL, VT));
	Result = DAG.getNode(ISD::MUL, DL, VT, Result, DAG.getConstant(3, DL, VT));
	Result = DAG.getNode(isAdd ? ISD::ADD : ISD::SUB, DL, VT, Result,
	N->getOperand(0));
	return Result;
	};

	switch (MulAmt) {
	default:
	break;
	case 11:
	// mul x, 11 => add ((shl (mul x, 5), 1), x)
	return combineMulShlAddOrSub(5, 1, /isAdd/ true);
	case 21:
	// mul x, 21 => add ((shl (mul x, 5), 2), x)
	return combineMulShlAddOrSub(5, 2, /isAdd/ true);
	case 22:
	// mul x, 22 => add (add ((shl (mul x, 5), 2), x), x)
	return DAG.getNode(ISD::ADD, DL, VT, N->getOperand(0),
	combineMulShlAddOrSub(5, 2, /isAdd/ true));
	case 19:
	// mul x, 19 => sub ((shl (mul x, 5), 2), x)
	return combineMulShlAddOrSub(5, 2, /isAdd/ false);
	case 13:
	// mul x, 13 => add ((shl (mul x, 3), 2), x)
	return combineMulShlAddOrSub(3, 2, /isAdd/ true);
	case 23:
	// mul x, 13 => sub ((shl (mul x, 3), 3), x)
	return combineMulShlAddOrSub(3, 3, /isAdd/ false);
	case 14:
	// mul x, 14 => add (add ((shl (mul x, 3), 2), x), x)
	return DAG.getNode(ISD::ADD, DL, VT, N->getOperand(0),
	combineMulShlAddOrSub(3, 2, /isAdd/ true));
	case 26:
	// mul x, 26 => sub ((mul (mul x, 9), 3), x)
	return combineMulMulAddOrSub(/isAdd/ false);
	case 28:
	// mul x, 28 => add ((mul (mul x, 9), 3), x)
	return combineMulMulAddOrSub(/isAdd/ true);
	case 29:
	// mul x, 29 => add (add ((mul (mul x, 9), 3), x), x)
	return DAG.getNode(ISD::ADD, DL, VT, N->getOperand(0),
	combineMulMulAddOrSub(/isAdd/ true));
	case 30:
	// mul x, 30 => sub (sub ((shl x, 5), x), x)
	return DAG.getNode(
	ISD::SUB, DL, VT,
	DAG.getNode(ISD::SUB, DL, VT,
	DAG.getNode(ISD::SHL, DL, VT, N->getOperand(0),
	DAG.getConstant(5, DL, MVT::i8)),
	N->getOperand(0)),
	N->getOperand(0));
	}
	return SDValue();
	}

	/// Optimize a single multiply with constant into two operations in order to
	/// implement it with two cheaper instructions, e.g. LEA + SHL, LEA + LEA.
	static SDValue combineMul(SDNode *N, SelectionDAG &DAG,
	TargetLowering::DAGCombinerInfo &DCI,
	const X86Subtarget &Subtarget) {
	EVT VT = N->getValueType(0);
	if (DCI.isBeforeLegalize() && VT.isVector())
	return reduceVMULWidth(N, DAG, Subtarget);

	if (!MulConstantOptimization)
	return SDValue();
	// An imul is usually smaller than the alternative sequence.
	if (DAG.getMachineFunction().getFunction()->optForMinSize())
	return SDValue();

	if (DCI.isBeforeLegalize() \|\| DCI.isCalledByLegalizer())
	return SDValue();

	if (VT != MVT::i64 && VT != MVT::i32)
	return SDValue();

	ConstantSDNode *C = dyn_cast<ConstantSDNode>(N->getOperand(1));
	if (!C)
	return SDValue();
	uint64_t MulAmt = C->getZExtValue();
	if (isPowerOf2_64(MulAmt) \|\| MulAmt == 3 \|\| MulAmt == 5 \|\| MulAmt == 9)
	return SDValue();

	uint64_t MulAmt1 = 0;
	uint64_t MulAmt2 = 0;
	if ((MulAmt % 9) == 0) {
	MulAmt1 = 9;
	MulAmt2 = MulAmt / 9;
	} else if ((MulAmt % 5) == 0) {
	MulAmt1 = 5;
	MulAmt2 = MulAmt / 5;
	} else if ((MulAmt % 3) == 0) {
	MulAmt1 = 3;
	MulAmt2 = MulAmt / 3;
	}

	SDLoc DL(N);
	SDValue NewMul;
	if (MulAmt2 &&
	(isPowerOf2_64(MulAmt2) \|\| MulAmt2 == 3 \|\| MulAmt2 == 5 \|\| MulAmt2 == 9)){

	if (isPowerOf2_64(MulAmt2) &&
	!(N->hasOneUse() && N->use_begin()->getOpcode() == ISD::ADD))
	// If second multiplifer is pow2, issue it first. We want the multiply by
	// 3, 5, or 9 to be folded into the addressing mode unless the lone use
	// is an add.
	std::swap(MulAmt1, MulAmt2);

	if (isPowerOf2_64(MulAmt1))
	NewMul = DAG.getNode(ISD::SHL, DL, VT, N->getOperand(0),
	DAG.getConstant(Log2_64(MulAmt1), DL, MVT::i8));
	else
	NewMul = DAG.getNode(X86ISD::MUL_IMM, DL, VT, N->getOperand(0),
	DAG.getConstant(MulAmt1, DL, VT));

	if (isPowerOf2_64(MulAmt2))
	NewMul = DAG.getNode(ISD::SHL, DL, VT, NewMul,
	DAG.getConstant(Log2_64(MulAmt2), DL, MVT::i8));
	else
	NewMul = DAG.getNode(X86ISD::MUL_IMM, DL, VT, NewMul,
	DAG.getConstant(MulAmt2, DL, VT));
	} else if (!Subtarget.slowLEA())
	NewMul = combineMulSpecial(MulAmt, N, DAG, VT, DL);

	if (!NewMul) {
	assert(MulAmt != 0 &&
	MulAmt != (VT == MVT::i64 ? UINT64_MAX : UINT32_MAX) &&
	"Both cases that could cause potential overflows should have "
	"already been handled.");
	int64_t SignMulAmt = C->getSExtValue();
	if ((SignMulAmt != INT64_MIN) && (SignMulAmt != INT64_MAX) &&
	(SignMulAmt != -INT64_MAX)) {
	int NumSign = SignMulAmt > 0 ? 1 : -1;
	bool IsPowerOf2_64PlusOne = isPowerOf2_64(NumSign * SignMulAmt - 1);
	bool IsPowerOf2_64MinusOne = isPowerOf2_64(NumSign * SignMulAmt + 1);
	if (IsPowerOf2_64PlusOne) {
	// (mul x, 2^N + 1) => (add (shl x, N), x)
	NewMul = DAG.getNode(
	ISD::ADD, DL, VT, N->getOperand(0),
	DAG.getNode(ISD::SHL, DL, VT, N->getOperand(0),
	DAG.getConstant(Log2_64(NumSign * SignMulAmt - 1), DL,
	MVT::i8)));
	} else if (IsPowerOf2_64MinusOne) {
	// (mul x, 2^N - 1) => (sub (shl x, N), x)
	NewMul = DAG.getNode(
	ISD::SUB, DL, VT,
	DAG.getNode(ISD::SHL, DL, VT, N->getOperand(0),
	DAG.getConstant(Log2_64(NumSign * SignMulAmt + 1), DL,
	MVT::i8)),
	N->getOperand(0));
	}
	// To negate, subtract the number from zero
	if ((IsPowerOf2_64PlusOne \|\| IsPowerOf2_64MinusOne) && NumSign == -1)
	NewMul =
	DAG.getNode(ISD::SUB, DL, VT, DAG.getConstant(0, DL, VT), NewMul);
	}
	}

	if (NewMul)
	// Do not add new nodes to DAG combiner worklist.
	DCI.CombineTo(N, NewMul, false);

	return SDValue();
	}

	static SDValue combineShiftLeft(SDNode *N, SelectionDAG &DAG) {
	SDValue N0 = N->getOperand(0);
	SDValue N1 = N->getOperand(1);
	ConstantSDNode *N1C = dyn_cast<ConstantSDNode>(N1);
	EVT VT = N0.getValueType();

	// fold (shl (and (setcc_c), c1), c2) -> (and setcc_c, (c1 << c2))
	// since the result of setcc_c is all zero's or all ones.
	if (VT.isInteger() && !VT.isVector() &&
	N1C && N0.getOpcode() == ISD::AND &&
	N0.getOperand(1).getOpcode() == ISD::Constant) {
	SDValue N00 = N0.getOperand(0);
	APInt Mask = cast<ConstantSDNode>(N0.getOperand(1))->getAPIntValue();
	Mask <<= N1C->getAPIntValue();
	bool MaskOK = false;
	// We can handle cases concerning bit-widening nodes containing setcc_c if
	// we carefully interrogate the mask to make sure we are semantics
	// preserving.
	// The transform is not safe if the result of C1 << C2 exceeds the bitwidth
	// of the underlying setcc_c operation if the setcc_c was zero extended.
	// Consider the following example:
	// zext(setcc_c) -> i32 0x0000FFFF
	// c1 -> i32 0x0000FFFF
	// c2 -> i32 0x00000001
	// (shl (and (setcc_c), c1), c2) -> i32 0x0001FFFE
	// (and setcc_c, (c1 << c2)) -> i32 0x0000FFFE
	if (N00.getOpcode() == X86ISD::SETCC_CARRY) {
	MaskOK = true;
	} else if (N00.getOpcode() == ISD::SIGN_EXTEND &&
	N00.getOperand(0).getOpcode() == X86ISD::SETCC_CARRY) {
	MaskOK = true;
	} else if ((N00.getOpcode() == ISD::ZERO_EXTEND \|\|
	N00.getOpcode() == ISD::ANY_EXTEND) &&
	N00.getOperand(0).getOpcode() == X86ISD::SETCC_CARRY) {
	MaskOK = Mask.isIntN(N00.getOperand(0).getValueSizeInBits());
	}
	if (MaskOK && Mask != 0) {
	SDLoc DL(N);
	return DAG.getNode(ISD::AND, DL, VT, N00, DAG.getConstant(Mask, DL, VT));
	}
	}

	// Hardware support for vector shifts is sparse which makes us scalarize the
	// vector operations in many cases. Also, on sandybridge ADD is faster than
	// shl.
	// (shl V, 1) -> add V,V
	if (auto *N1BV = dyn_cast<BuildVectorSDNode>(N1))
	if (auto *N1SplatC = N1BV->getConstantSplatNode()) {
	assert(N0.getValueType().isVector() && "Invalid vector shift type");
	// We shift all of the values by one. In many cases we do not have
	// hardware support for this operation. This is better expressed as an ADD
	// of two values.
	if (N1SplatC->getAPIntValue() == 1)
	return DAG.getNode(ISD::ADD, SDLoc(N), VT, N0, N0);
	}

	return SDValue();
	}

	static SDValue combineShiftRightAlgebraic(SDNode *N, SelectionDAG &DAG) {
	SDValue N0 = N->getOperand(0);
	SDValue N1 = N->getOperand(1);
	EVT VT = N0.getValueType();
	unsigned Size = VT.getSizeInBits();

	// fold (ashr (shl, a, [56,48,32,24,16]), SarConst)
	// into (shl, (sext (a), [56,48,32,24,16] - SarConst)) or
	// into (lshr, (sext (a), SarConst - [56,48,32,24,16]))
	// depending on sign of (SarConst - [56,48,32,24,16])

	// sexts in X86 are MOVs. The MOVs have the same code size
	// as above SHIFTs (only SHIFT on 1 has lower code size).
	// However the MOVs have 2 advantages to a SHIFT:
	// 1. MOVs can write to a register that differs from source
	// 2. MOVs accept memory operands

	if (!VT.isInteger() \|\| VT.isVector() \|\| N1.getOpcode() != ISD::Constant \|\|
	N0.getOpcode() != ISD::SHL \|\| !N0.hasOneUse() \|\|
	N0.getOperand(1).getOpcode() != ISD::Constant)
	return SDValue();

	SDValue N00 = N0.getOperand(0);
	SDValue N01 = N0.getOperand(1);
	APInt ShlConst = (cast<ConstantSDNode>(N01))->getAPIntValue();
	APInt SarConst = (cast<ConstantSDNode>(N1))->getAPIntValue();
	EVT CVT = N1.getValueType();

	if (SarConst.isNegative())
	return SDValue();

	for (MVT SVT : MVT::integer_valuetypes()) {
	unsigned ShiftSize = SVT.getSizeInBits();
	// skipping types without corresponding sext/zext and
	// ShlConst that is not one of [56,48,32,24,16]
	if (ShiftSize < 8 \|\| ShiftSize > 64 \|\| ShlConst != Size - ShiftSize)
	continue;
	SDLoc DL(N);
	SDValue NN =
	DAG.getNode(ISD::SIGN_EXTEND_INREG, DL, VT, N00, DAG.getValueType(SVT));
	SarConst = SarConst - (Size - ShiftSize);
	if (SarConst == 0)
	return NN;
	else if (SarConst.isNegative())
	return DAG.getNode(ISD::SHL, DL, VT, NN,
	DAG.getConstant(-SarConst, DL, CVT));
	else
	return DAG.getNode(ISD::SRA, DL, VT, NN,
	DAG.getConstant(SarConst, DL, CVT));
	}
	return SDValue();
	}

	/// \brief Returns a vector of 0s if the node in input is a vector logical
	/// shift by a constant amount which is known to be bigger than or equal
	/// to the vector element size in bits.
	static SDValue performShiftToAllZeros(SDNode *N, SelectionDAG &DAG,
	const X86Subtarget &Subtarget) {
	EVT VT = N->getValueType(0);

	if (VT != MVT::v2i64 && VT != MVT::v4i32 && VT != MVT::v8i16 &&
	(!Subtarget.hasInt256() \|\|
	(VT != MVT::v4i64 && VT != MVT::v8i32 && VT != MVT::v16i16)))
	return SDValue();

	SDValue Amt = N->getOperand(1);
	SDLoc DL(N);
	if (auto *AmtBV = dyn_cast<BuildVectorSDNode>(Amt))
	if (auto *AmtSplat = AmtBV->getConstantSplatNode()) {
	const APInt &ShiftAmt = AmtSplat->getAPIntValue();
	unsigned MaxAmount =
	VT.getSimpleVT().getScalarSizeInBits();

	// SSE2/AVX2 logical shifts always return a vector of 0s
	// if the shift amount is bigger than or equal to
	// the element size. The constant shift amount will be
	// encoded as a 8-bit immediate.
	if (ShiftAmt.trunc(8).uge(MaxAmount))
	return getZeroVector(VT.getSimpleVT(), Subtarget, DAG, DL);
	}

	return SDValue();
	}

	static SDValue combineShift(SDNode* N, SelectionDAG &DAG,
	TargetLowering::DAGCombinerInfo &DCI,
	const X86Subtarget &Subtarget) {
	if (N->getOpcode() == ISD::SHL)
	if (SDValue V = combineShiftLeft(N, DAG))
	return V;

	if (N->getOpcode() == ISD::SRA)
	if (SDValue V = combineShiftRightAlgebraic(N, DAG))
	return V;

	// Try to fold this logical shift into a zero vector.
	if (N->getOpcode() != ISD::SRA)
	if (SDValue V = performShiftToAllZeros(N, DAG, Subtarget))
	return V;

	return SDValue();
	}

	static SDValue combineVectorShiftImm(SDNode *N, SelectionDAG &DAG,
	TargetLowering::DAGCombinerInfo &DCI,
	const X86Subtarget &Subtarget) {
	unsigned Opcode = N->getOpcode();
	assert((X86ISD::VSHLI == Opcode \|\| X86ISD::VSRAI == Opcode \|\|
	X86ISD::VSRLI == Opcode) &&
	"Unexpected shift opcode");
	bool LogicalShift = X86ISD::VSHLI == Opcode \|\| X86ISD::VSRLI == Opcode;
	EVT VT = N->getValueType(0);
	SDValue N0 = N->getOperand(0);
	SDValue N1 = N->getOperand(1);
	unsigned NumBitsPerElt = VT.getScalarSizeInBits();
	assert(VT == N0.getValueType() && (NumBitsPerElt % 8) == 0 &&
	"Unexpected value type");

	// Out of range logical bit shifts are guaranteed to be zero.
	// Out of range arithmetic bit shifts splat the sign bit.
	APInt ShiftVal = cast<ConstantSDNode>(N1)->getAPIntValue();
	if (ShiftVal.zextOrTrunc(8).uge(NumBitsPerElt)) {
	if (LogicalShift)
	return getZeroVector(VT.getSimpleVT(), Subtarget, DAG, SDLoc(N));
	else
	ShiftVal = NumBitsPerElt - 1;
	}

	// Shift N0 by zero -> N0.
	if (!ShiftVal)
	return N0;

	// Shift zero -> zero.
	if (ISD::isBuildVectorAllZeros(N0.getNode()))
	return getZeroVector(VT.getSimpleVT(), Subtarget, DAG, SDLoc(N));

	// fold (VSRLI (VSRAI X, Y), 31) -> (VSRLI X, 31).
	// This VSRLI only looks at the sign bit, which is unmodified by VSRAI.
	// TODO - support other sra opcodes as needed.
	if (Opcode == X86ISD::VSRLI && (ShiftVal + 1) == NumBitsPerElt &&
	N0.getOpcode() == X86ISD::VSRAI)
	return DAG.getNode(X86ISD::VSRLI, SDLoc(N), VT, N0.getOperand(0), N1);

	// We can decode 'whole byte' logical bit shifts as shuffles.
	if (LogicalShift && (ShiftVal.getZExtValue() % 8) == 0) {
	SDValue Op(N, 0);
	SmallVector<int, 1> NonceMask; // Just a placeholder.
	NonceMask.push_back(0);
	if (combineX86ShufflesRecursively({Op}, 0, Op, NonceMask, {},
	/Depth/ 1, /HasVarMask/ false, DAG,
	DCI, Subtarget))
	return SDValue(); // This routine will use CombineTo to replace N.
	}

	// Constant Folding.
	APInt UndefElts;
	SmallVector<APInt, 32> EltBits;
	if (N->isOnlyUserOf(N0.getNode()) &&
	getTargetConstantBitsFromNode(N0, NumBitsPerElt, UndefElts, EltBits)) {
	assert(EltBits.size() == VT.getVectorNumElements() &&
	"Unexpected shift value type");
	unsigned ShiftImm = ShiftVal.getZExtValue();
	for (APInt &Elt : EltBits) {
	if (X86ISD::VSHLI == Opcode)
	Elt <<= ShiftImm;
	else if (X86ISD::VSRAI == Opcode)
	Elt.ashrInPlace(ShiftImm);
	else
	Elt.lshrInPlace(ShiftImm);
	}
	return getConstVector(EltBits, UndefElts, VT.getSimpleVT(), DAG, SDLoc(N));
	}

	return SDValue();
	}

	static SDValue combineVectorInsert(SDNode *N, SelectionDAG &DAG,
	TargetLowering::DAGCombinerInfo &DCI,
	const X86Subtarget &Subtarget) {
	assert(
	((N->getOpcode() == X86ISD::PINSRB && N->getValueType(0) == MVT::v16i8) \|\|
	(N->getOpcode() == X86ISD::PINSRW &&
	N->getValueType(0) == MVT::v8i16)) &&
	"Unexpected vector insertion");

	// Attempt to combine PINSRB/PINSRW patterns to a shuffle.
	SDValue Op(N, 0);
	SmallVector<int, 1> NonceMask; // Just a placeholder.
	NonceMask.push_back(0);
	combineX86ShufflesRecursively({Op}, 0, Op, NonceMask, {},
	/Depth/ 1, /HasVarMask/ false, DAG,
	DCI, Subtarget);
	return SDValue();
	}

	/// Recognize the distinctive (AND (setcc ...) (setcc ..)) where both setccs
	/// reference the same FP CMP, and rewrite for CMPEQSS and friends. Likewise for
	/// OR -> CMPNEQSS.
	static SDValue combineCompareEqual(SDNode *N, SelectionDAG &DAG,
	TargetLowering::DAGCombinerInfo &DCI,
	const X86Subtarget &Subtarget) {
	unsigned opcode;

	// SSE1 supports CMP{eq\|ne}SS, and SSE2 added CMP{eq\|ne}SD, but
	// we're requiring SSE2 for both.
	if (Subtarget.hasSSE2() && isAndOrOfSetCCs(SDValue(N, 0U), opcode)) {
	SDValue N0 = N->getOperand(0);
	SDValue N1 = N->getOperand(1);
	SDValue CMP0 = N0->getOperand(1);
	SDValue CMP1 = N1->getOperand(1);
	SDLoc DL(N);

	// The SETCCs should both refer to the same CMP.
	if (CMP0.getOpcode() != X86ISD::CMP \|\| CMP0 != CMP1)
	return SDValue();

	SDValue CMP00 = CMP0->getOperand(0);
	SDValue CMP01 = CMP0->getOperand(1);
	EVT VT = CMP00.getValueType();

	if (VT == MVT::f32 \|\| VT == MVT::f64) {
	bool ExpectingFlags = false;
	// Check for any users that want flags:
	for (SDNode::use_iterator UI = N->use_begin(), UE = N->use_end();
	!ExpectingFlags && UI != UE; ++UI)
	switch (UI->getOpcode()) {
	default:
	case ISD::BR_CC:
	case ISD::BRCOND:
	case ISD::SELECT:
	ExpectingFlags = true;
	break;
	case ISD::CopyToReg:
	case ISD::SIGN_EXTEND:
	case ISD::ZERO_EXTEND:
	case ISD::ANY_EXTEND:
	break;
	}

	if (!ExpectingFlags) {
	enum X86::CondCode cc0 = (enum X86::CondCode)N0.getConstantOperandVal(0);
	enum X86::CondCode cc1 = (enum X86::CondCode)N1.getConstantOperandVal(0);

	if (cc1 == X86::COND_E \|\| cc1 == X86::COND_NE) {
	X86::CondCode tmp = cc0;
	cc0 = cc1;
	cc1 = tmp;
	}

	if ((cc0 == X86::COND_E && cc1 == X86::COND_NP) \|\|
	(cc0 == X86::COND_NE && cc1 == X86::COND_P)) {
	// FIXME: need symbolic constants for these magic numbers.
	// See X86ATTInstPrinter.cpp:printSSECC().
	unsigned x86cc = (cc0 == X86::COND_E) ? 0 : 4;
	if (Subtarget.hasAVX512()) {
	SDValue FSetCC =
	DAG.getNode(X86ISD::FSETCCM, DL, MVT::v1i1, CMP00, CMP01,
	DAG.getConstant(x86cc, DL, MVT::i8));
	return DAG.getNode(X86ISD::VEXTRACT, DL, N->getSimpleValueType(0),
	FSetCC, DAG.getIntPtrConstant(0, DL));
	}
	SDValue OnesOrZeroesF = DAG.getNode(X86ISD::FSETCC, DL,
	CMP00.getValueType(), CMP00, CMP01,
	DAG.getConstant(x86cc, DL,
	MVT::i8));

	bool is64BitFP = (CMP00.getValueType() == MVT::f64);
	MVT IntVT = is64BitFP ? MVT::i64 : MVT::i32;

	if (is64BitFP && !Subtarget.is64Bit()) {
	// On a 32-bit target, we cannot bitcast the 64-bit float to a
	// 64-bit integer, since that's not a legal type. Since
	// OnesOrZeroesF is all ones of all zeroes, we don't need all the
	// bits, but can do this little dance to extract the lowest 32 bits
	// and work with those going forward.
	SDValue Vector64 = DAG.getNode(ISD::SCALAR_TO_VECTOR, DL, MVT::v2f64,
	OnesOrZeroesF);
	SDValue Vector32 = DAG.getBitcast(MVT::v4f32, Vector64);
	OnesOrZeroesF = DAG.getNode(ISD::EXTRACT_VECTOR_ELT, DL, MVT::f32,
	Vector32, DAG.getIntPtrConstant(0, DL));
	IntVT = MVT::i32;
	}

	SDValue OnesOrZeroesI = DAG.getBitcast(IntVT, OnesOrZeroesF);
	SDValue ANDed = DAG.getNode(ISD::AND, DL, IntVT, OnesOrZeroesI,
	DAG.getConstant(1, DL, IntVT));
	SDValue OneBitOfTruth = DAG.getNode(ISD::TRUNCATE, DL, MVT::i8,
	ANDed);
	return OneBitOfTruth;
	}
	}
	}
	}
	return SDValue();
	}

	/// Try to fold: (and (xor X, -1), Y) -> (andnp X, Y).
	static SDValue combineANDXORWithAllOnesIntoANDNP(SDNode *N, SelectionDAG &DAG) {
	assert(N->getOpcode() == ISD::AND);

	EVT VT = N->getValueType(0);
	SDValue N0 = N->getOperand(0);
	SDValue N1 = N->getOperand(1);
	SDLoc DL(N);

	if (VT != MVT::v2i64 && VT != MVT::v4i64 && VT != MVT::v8i64)
	return SDValue();

	if (N0.getOpcode() == ISD::XOR &&
	ISD::isBuildVectorAllOnes(N0.getOperand(1).getNode()))
	return DAG.getNode(X86ISD::ANDNP, DL, VT, N0.getOperand(0), N1);

	if (N1.getOpcode() == ISD::XOR &&
	ISD::isBuildVectorAllOnes(N1.getOperand(1).getNode()))
	return DAG.getNode(X86ISD::ANDNP, DL, VT, N1.getOperand(0), N0);

	return SDValue();
	}

	// On AVX/AVX2 the type v8i1 is legalized to v8i16, which is an XMM sized
	// register. In most cases we actually compare or select YMM-sized registers
	// and mixing the two types creates horrible code. This method optimizes
	// some of the transition sequences.
	static SDValue WidenMaskArithmetic(SDNode *N, SelectionDAG &DAG,
	TargetLowering::DAGCombinerInfo &DCI,
	const X86Subtarget &Subtarget) {
	EVT VT = N->getValueType(0);
	if (!VT.is256BitVector())
	return SDValue();

	assert((N->getOpcode() == ISD::ANY_EXTEND \|\|
	N->getOpcode() == ISD::ZERO_EXTEND \|\|
	N->getOpcode() == ISD::SIGN_EXTEND) && "Invalid Node");

	SDValue Narrow = N->getOperand(0);
	EVT NarrowVT = Narrow->getValueType(0);
	if (!NarrowVT.is128BitVector())
	return SDValue();

	if (Narrow->getOpcode() != ISD::XOR &&
	Narrow->getOpcode() != ISD::AND &&
	Narrow->getOpcode() != ISD::OR)
	return SDValue();

	SDValue N0 = Narrow->getOperand(0);
	SDValue N1 = Narrow->getOperand(1);
	SDLoc DL(Narrow);

	// The Left side has to be a trunc.
	if (N0.getOpcode() != ISD::TRUNCATE)
	return SDValue();

	// The type of the truncated inputs.
	EVT WideVT = N0->getOperand(0)->getValueType(0);
	if (WideVT != VT)
	return SDValue();

	// The right side has to be a 'trunc' or a constant vector.
	bool RHSTrunc = N1.getOpcode() == ISD::TRUNCATE;
	ConstantSDNode *RHSConstSplat = nullptr;
	if (auto *RHSBV = dyn_cast<BuildVectorSDNode>(N1))
	RHSConstSplat = RHSBV->getConstantSplatNode();
	if (!RHSTrunc && !RHSConstSplat)
	return SDValue();

	const TargetLowering &TLI = DAG.getTargetLoweringInfo();

	if (!TLI.isOperationLegalOrPromote(Narrow->getOpcode(), WideVT))
	return SDValue();

	// Set N0 and N1 to hold the inputs to the new wide operation.
	N0 = N0->getOperand(0);
	if (RHSConstSplat) {
	N1 = DAG.getNode(ISD::ZERO_EXTEND, DL, WideVT.getVectorElementType(),
	SDValue(RHSConstSplat, 0));
	N1 = DAG.getSplatBuildVector(WideVT, DL, N1);
	} else if (RHSTrunc) {
	N1 = N1->getOperand(0);
	}

	// Generate the wide operation.
	SDValue Op = DAG.getNode(Narrow->getOpcode(), DL, WideVT, N0, N1);
	unsigned Opcode = N->getOpcode();
	switch (Opcode) {
	case ISD::ANY_EXTEND:
	return Op;
	case ISD::ZERO_EXTEND: {
	unsigned InBits = NarrowVT.getScalarSizeInBits();
	APInt Mask = APInt::getAllOnesValue(InBits);
	Mask = Mask.zext(VT.getScalarSizeInBits());
	return DAG.getNode(ISD::AND, DL, VT,
	Op, DAG.getConstant(Mask, DL, VT));
	}
	case ISD::SIGN_EXTEND:
	return DAG.getNode(ISD::SIGN_EXTEND_INREG, DL, VT,
	Op, DAG.getValueType(NarrowVT));
	default:
	llvm_unreachable("Unexpected opcode");
	}
	}

	/// If both input operands of a logic op are being cast from floating point
	/// types, try to convert this into a floating point logic node to avoid
	/// unnecessary moves from SSE to integer registers.
	static SDValue convertIntLogicToFPLogic(SDNode *N, SelectionDAG &DAG,
	const X86Subtarget &Subtarget) {
	unsigned FPOpcode = ISD::DELETED_NODE;
	if (N->getOpcode() == ISD::AND)
	FPOpcode = X86ISD::FAND;
	else if (N->getOpcode() == ISD::OR)
	FPOpcode = X86ISD::FOR;
	else if (N->getOpcode() == ISD::XOR)
	FPOpcode = X86ISD::FXOR;

	assert(FPOpcode != ISD::DELETED_NODE &&
	"Unexpected input node for FP logic conversion");

	EVT VT = N->getValueType(0);
	SDValue N0 = N->getOperand(0);
	SDValue N1 = N->getOperand(1);
	SDLoc DL(N);
	if (N0.getOpcode() == ISD::BITCAST && N1.getOpcode() == ISD::BITCAST &&
	((Subtarget.hasSSE1() && VT == MVT::i32) \|\|
	(Subtarget.hasSSE2() && VT == MVT::i64))) {
	SDValue N00 = N0.getOperand(0);
	SDValue N10 = N1.getOperand(0);
	EVT N00Type = N00.getValueType();
	EVT N10Type = N10.getValueType();
	if (N00Type.isFloatingPoint() && N10Type.isFloatingPoint()) {
	SDValue FPLogic = DAG.getNode(FPOpcode, DL, N00Type, N00, N10);
	return DAG.getBitcast(VT, FPLogic);
	}
	}
	return SDValue();
	}

	/// If this is a zero/all-bits result that is bitwise-anded with a low bits
	/// mask. (Mask == 1 for the x86 lowering of a SETCC + ZEXT), replace the 'and'
	/// with a shift-right to eliminate loading the vector constant mask value.
	static SDValue combineAndMaskToShift(SDNode *N, SelectionDAG &DAG,
	const X86Subtarget &Subtarget) {
	SDValue Op0 = peekThroughBitcasts(N->getOperand(0));
	SDValue Op1 = peekThroughBitcasts(N->getOperand(1));
	EVT VT0 = Op0.getValueType();
	EVT VT1 = Op1.getValueType();

	if (VT0 != VT1 \|\| !VT0.isSimple() \|\| !VT0.isInteger())
	return SDValue();

	APInt SplatVal;
	- if (!ISD::isConstantSplatVector(Op1.getNode(), SplatVal) \|\|
	+ if (!ISD::isConstantSplatVector(Op1.getNode(), SplatVal,
	+ /AllowShrink/false) \|\|
	!SplatVal.isMask())
	return SDValue();

	if (!SupportedVectorShiftWithImm(VT0.getSimpleVT(), Subtarget, ISD::SRL))
	return SDValue();

	unsigned EltBitWidth = VT0.getScalarSizeInBits();
	if (EltBitWidth != DAG.ComputeNumSignBits(Op0))
	return SDValue();

	SDLoc DL(N);
	unsigned ShiftVal = SplatVal.countTrailingOnes();
	SDValue ShAmt = DAG.getConstant(EltBitWidth - ShiftVal, DL, MVT::i8);
	SDValue Shift = DAG.getNode(X86ISD::VSRLI, DL, VT0, Op0, ShAmt);
	return DAG.getBitcast(N->getValueType(0), Shift);
	}

	static SDValue combineAnd(SDNode *N, SelectionDAG &DAG,
	TargetLowering::DAGCombinerInfo &DCI,
	const X86Subtarget &Subtarget) {
	if (DCI.isBeforeLegalizeOps())
	return SDValue();

	if (SDValue R = combineCompareEqual(N, DAG, DCI, Subtarget))
	return R;

	if (SDValue FPLogic = convertIntLogicToFPLogic(N, DAG, Subtarget))
	return FPLogic;

	if (SDValue R = combineANDXORWithAllOnesIntoANDNP(N, DAG))
	return R;

	if (SDValue ShiftRight = combineAndMaskToShift(N, DAG, Subtarget))
	return ShiftRight;

	EVT VT = N->getValueType(0);
	SDValue N0 = N->getOperand(0);
	SDValue N1 = N->getOperand(1);
	SDLoc DL(N);

	// Attempt to recursively combine a bitmask AND with shuffles.
	if (VT.isVector() && (VT.getScalarSizeInBits() % 8) == 0) {
	SDValue Op(N, 0);
	SmallVector<int, 1> NonceMask; // Just a placeholder.
	NonceMask.push_back(0);
	if (combineX86ShufflesRecursively({Op}, 0, Op, NonceMask, {},
	/Depth/ 1, /HasVarMask/ false, DAG,
	DCI, Subtarget))
	return SDValue(); // This routine will use CombineTo to replace N.
	}

	// Create BEXTR instructions
	// BEXTR is ((X >> imm) & (2**size-1))
	if (VT != MVT::i32 && VT != MVT::i64)
	return SDValue();

	if (!Subtarget.hasBMI() && !Subtarget.hasTBM())
	return SDValue();
	if (N0.getOpcode() != ISD::SRA && N0.getOpcode() != ISD::SRL)
	return SDValue();

	ConstantSDNode *MaskNode = dyn_cast<ConstantSDNode>(N1);
	ConstantSDNode *ShiftNode = dyn_cast<ConstantSDNode>(N0.getOperand(1));
	if (MaskNode && ShiftNode) {
	uint64_t Mask = MaskNode->getZExtValue();
	uint64_t Shift = ShiftNode->getZExtValue();
	if (isMask_64(Mask)) {
	uint64_t MaskSize = countPopulation(Mask);
	if (Shift + MaskSize <= VT.getSizeInBits())
	return DAG.getNode(X86ISD::BEXTR, DL, VT, N0.getOperand(0),
	DAG.getConstant(Shift \| (MaskSize << 8), DL,
	VT));
	}
	}
	return SDValue();
	}

	// Try to fold:
	// (or (and (m, y), (pandn m, x)))
	// into:
	// (vselect m, x, y)
	// As a special case, try to fold:
	// (or (and (m, (sub 0, x)), (pandn m, x)))
	// into:
	// (sub (xor X, M), M)
	static SDValue combineLogicBlendIntoPBLENDV(SDNode *N, SelectionDAG &DAG,
	const X86Subtarget &Subtarget) {
	assert(N->getOpcode() == ISD::OR && "Unexpected Opcode");

	SDValue N0 = N->getOperand(0);
	SDValue N1 = N->getOperand(1);
	EVT VT = N->getValueType(0);

	if (!((VT.is128BitVector() && Subtarget.hasSSE2()) \|\|
	(VT.is256BitVector() && Subtarget.hasInt256())))
	return SDValue();

	// Canonicalize AND to LHS.
	if (N1.getOpcode() == ISD::AND)
	std::swap(N0, N1);

	// TODO: Attempt to match against AND(XOR(-1,X),Y) as well, waiting for
	// ANDNP combine allows other combines to happen that prevent matching.
	if (N0.getOpcode() != ISD::AND \|\| N1.getOpcode() != X86ISD::ANDNP)
	return SDValue();

	SDValue Mask = N1.getOperand(0);
	SDValue X = N1.getOperand(1);
	SDValue Y;
	if (N0.getOperand(0) == Mask)
	Y = N0.getOperand(1);
	if (N0.getOperand(1) == Mask)
	Y = N0.getOperand(0);

	// Check to see if the mask appeared in both the AND and ANDNP.
	if (!Y.getNode())
	return SDValue();

	// Validate that X, Y, and Mask are bitcasts, and see through them.
	Mask = peekThroughBitcasts(Mask);
	X = peekThroughBitcasts(X);
	Y = peekThroughBitcasts(Y);

	EVT MaskVT = Mask.getValueType();
	unsigned EltBits = MaskVT.getScalarSizeInBits();

	// TODO: Attempt to handle floating point cases as well?
	if (!MaskVT.isInteger() \|\| DAG.ComputeNumSignBits(Mask) != EltBits)
	return SDValue();

	SDLoc DL(N);

	// Try to match:
	// (or (and (M, (sub 0, X)), (pandn M, X)))
	// which is a special case of vselect:
	// (vselect M, (sub 0, X), X)
	// Per:
	// http://graphics.stanford.edu/~seander/bithacks.html#ConditionalNegate
	// We know that, if fNegate is 0 or 1:
	// (fNegate ? -v : v) == ((v ^ -fNegate) + fNegate)
	//
	// Here, we have a mask, M (all 1s or 0), and, similarly, we know that:
	// ((M & 1) ? -X : X) == ((X ^ -(M & 1)) + (M & 1))
	// ( M ? -X : X) == ((X ^ M ) + (M & 1))
	// This lets us transform our vselect to:
	// (add (xor X, M), (and M, 1))
	// And further to:
	// (sub (xor X, M), M)
	if (X.getValueType() == MaskVT && Y.getValueType() == MaskVT &&
	DAG.getTargetLoweringInfo().isOperationLegal(ISD::SUB, MaskVT)) {
	auto IsNegV = [](SDNode *N, SDValue V) {
	return N->getOpcode() == ISD::SUB && N->getOperand(1) == V &&
	ISD::isBuildVectorAllZeros(N->getOperand(0).getNode());
	};
	SDValue V;
	if (IsNegV(Y.getNode(), X))
	V = X;
	else if (IsNegV(X.getNode(), Y))
	V = Y;

	if (V) {
	SDValue SubOp1 = DAG.getNode(ISD::XOR, DL, MaskVT, V, Mask);
	SDValue SubOp2 = Mask;

	// If the negate was on the false side of the select, then
	// the operands of the SUB need to be swapped. PR 27251.
	// This is because the pattern being matched above is
	// (vselect M, (sub (0, X), X) -> (sub (xor X, M), M)
	// but if the pattern matched was
	// (vselect M, X, (sub (0, X))), that is really negation of the pattern
	// above, -(vselect M, (sub 0, X), X), and therefore the replacement
	// pattern also needs to be a negation of the replacement pattern above.
	// And -(sub X, Y) is just sub (Y, X), so swapping the operands of the
	// sub accomplishes the negation of the replacement pattern.
	if (V == Y)
	std::swap(SubOp1, SubOp2);

	SDValue Res = DAG.getNode(ISD::SUB, DL, MaskVT, SubOp1, SubOp2);
	return DAG.getBitcast(VT, Res);
	}
	}

	// PBLENDVB is only available on SSE 4.1.
	if (!Subtarget.hasSSE41())
	return SDValue();

	MVT BlendVT = (VT == MVT::v4i64) ? MVT::v32i8 : MVT::v16i8;

	X = DAG.getBitcast(BlendVT, X);
	Y = DAG.getBitcast(BlendVT, Y);
	Mask = DAG.getBitcast(BlendVT, Mask);
	Mask = DAG.getSelect(DL, BlendVT, Mask, Y, X);
	return DAG.getBitcast(VT, Mask);
	}

	// Helper function for combineOrCmpEqZeroToCtlzSrl
	// Transforms:
	// seteq(cmp x, 0)
	// into:
	// srl(ctlz x), log2(bitsize(x))
	// Input pattern is checked by caller.
	static SDValue lowerX86CmpEqZeroToCtlzSrl(SDValue Op, EVT ExtTy,
	SelectionDAG &DAG) {
	SDValue Cmp = Op.getOperand(1);
	EVT VT = Cmp.getOperand(0).getValueType();
	unsigned Log2b = Log2_32(VT.getSizeInBits());
	SDLoc dl(Op);
	SDValue Clz = DAG.getNode(ISD::CTLZ, dl, VT, Cmp->getOperand(0));
	// The result of the shift is true or false, and on X86, the 32-bit
	// encoding of shr and lzcnt is more desirable.
	SDValue Trunc = DAG.getZExtOrTrunc(Clz, dl, MVT::i32);
	SDValue Scc = DAG.getNode(ISD::SRL, dl, MVT::i32, Trunc,
	DAG.getConstant(Log2b, dl, VT));
	return DAG.getZExtOrTrunc(Scc, dl, ExtTy);
	}

	// Try to transform:
	// zext(or(setcc(eq, (cmp x, 0)), setcc(eq, (cmp y, 0))))
	// into:
	// srl(or(ctlz(x), ctlz(y)), log2(bitsize(x))
	// Will also attempt to match more generic cases, eg:
	// zext(or(or(setcc(eq, cmp 0), setcc(eq, cmp 0)), setcc(eq, cmp 0)))
	// Only applies if the target supports the FastLZCNT feature.
	static SDValue combineOrCmpEqZeroToCtlzSrl(SDNode *N, SelectionDAG &DAG,
	TargetLowering::DAGCombinerInfo &DCI,
	const X86Subtarget &Subtarget) {
	if (DCI.isBeforeLegalize() \|\| !Subtarget.getTargetLowering()->isCtlzFast())
	return SDValue();

	auto isORCandidate = [](SDValue N) {
	return (N->getOpcode() == ISD::OR && N->hasOneUse());
	};

	// Check the zero extend is extending to 32-bit or more. The code generated by
	// srl(ctlz) for 16-bit or less variants of the pattern would require extra
	// instructions to clear the upper bits.
	if (!N->hasOneUse() \|\| !N->getSimpleValueType(0).bitsGE(MVT::i32) \|\|
	!isORCandidate(N->getOperand(0)))
	return SDValue();

	// Check the node matches: setcc(eq, cmp 0)
	auto isSetCCCandidate = [](SDValue N) {
	return N->getOpcode() == X86ISD::SETCC && N->hasOneUse() &&
	X86::CondCode(N->getConstantOperandVal(0)) == X86::COND_E &&
	N->getOperand(1).getOpcode() == X86ISD::CMP &&
	isNullConstant(N->getOperand(1).getOperand(1)) &&
	N->getOperand(1).getValueType().bitsGE(MVT::i32);
	};

	SDNode *OR = N->getOperand(0).getNode();
	SDValue LHS = OR->getOperand(0);
	SDValue RHS = OR->getOperand(1);

	// Save nodes matching or(or, setcc(eq, cmp 0)).
	SmallVector<SDNode *, 2> ORNodes;
	while (((isORCandidate(LHS) && isSetCCCandidate(RHS)) \|\|
	(isORCandidate(RHS) && isSetCCCandidate(LHS)))) {
	ORNodes.push_back(OR);
	OR = (LHS->getOpcode() == ISD::OR) ? LHS.getNode() : RHS.getNode();
	LHS = OR->getOperand(0);
	RHS = OR->getOperand(1);
	}

	// The last OR node should match or(setcc(eq, cmp 0), setcc(eq, cmp 0)).
	if (!(isSetCCCandidate(LHS) && isSetCCCandidate(RHS)) \|\|
	!isORCandidate(SDValue(OR, 0)))
	return SDValue();

	// We have a or(setcc(eq, cmp 0), setcc(eq, cmp 0)) pattern, try to lower it
	// to
	// or(srl(ctlz),srl(ctlz)).
	// The dag combiner can then fold it into:
	// srl(or(ctlz, ctlz)).
	EVT VT = OR->getValueType(0);
	SDValue NewLHS = lowerX86CmpEqZeroToCtlzSrl(LHS, VT, DAG);
	SDValue Ret, NewRHS;
	if (NewLHS && (NewRHS = lowerX86CmpEqZeroToCtlzSrl(RHS, VT, DAG)))
	Ret = DAG.getNode(ISD::OR, SDLoc(OR), VT, NewLHS, NewRHS);

	if (!Ret)
	return SDValue();

	// Try to lower nodes matching the or(or, setcc(eq, cmp 0)) pattern.
	while (ORNodes.size() > 0) {
	OR = ORNodes.pop_back_val();
	LHS = OR->getOperand(0);
	RHS = OR->getOperand(1);
	// Swap rhs with lhs to match or(setcc(eq, cmp, 0), or).
	if (RHS->getOpcode() == ISD::OR)
	std::swap(LHS, RHS);
	EVT VT = OR->getValueType(0);
	SDValue NewRHS = lowerX86CmpEqZeroToCtlzSrl(RHS, VT, DAG);
	if (!NewRHS)
	return SDValue();
	Ret = DAG.getNode(ISD::OR, SDLoc(OR), VT, Ret, NewRHS);
	}

	if (Ret)
	Ret = DAG.getNode(ISD::ZERO_EXTEND, SDLoc(N), N->getValueType(0), Ret);

	return Ret;
	}

	static SDValue combineOr(SDNode *N, SelectionDAG &DAG,
	TargetLowering::DAGCombinerInfo &DCI,
	const X86Subtarget &Subtarget) {
	if (DCI.isBeforeLegalizeOps())
	return SDValue();

	if (SDValue R = combineCompareEqual(N, DAG, DCI, Subtarget))
	return R;

	if (SDValue FPLogic = convertIntLogicToFPLogic(N, DAG, Subtarget))
	return FPLogic;

	if (SDValue R = combineLogicBlendIntoPBLENDV(N, DAG, Subtarget))
	return R;

	SDValue N0 = N->getOperand(0);
	SDValue N1 = N->getOperand(1);
	EVT VT = N->getValueType(0);

	if (VT != MVT::i16 && VT != MVT::i32 && VT != MVT::i64)
	return SDValue();

	// fold (or (x << c) \| (y >> (64 - c))) ==> (shld64 x, y, c)
	bool OptForSize = DAG.getMachineFunction().getFunction()->optForSize();

	// SHLD/SHRD instructions have lower register pressure, but on some
	// platforms they have higher latency than the equivalent
	// series of shifts/or that would otherwise be generated.
	// Don't fold (or (x << c) \| (y >> (64 - c))) if SHLD/SHRD instructions
	// have higher latencies and we are not optimizing for size.
	if (!OptForSize && Subtarget.isSHLDSlow())
	return SDValue();

	if (N0.getOpcode() == ISD::SRL && N1.getOpcode() == ISD::SHL)
	std::swap(N0, N1);
	if (N0.getOpcode() != ISD::SHL \|\| N1.getOpcode() != ISD::SRL)
	return SDValue();
	if (!N0.hasOneUse() \|\| !N1.hasOneUse())
	return SDValue();

	SDValue ShAmt0 = N0.getOperand(1);
	if (ShAmt0.getValueType() != MVT::i8)
	return SDValue();
	SDValue ShAmt1 = N1.getOperand(1);
	if (ShAmt1.getValueType() != MVT::i8)
	return SDValue();
	if (ShAmt0.getOpcode() == ISD::TRUNCATE)
	ShAmt0 = ShAmt0.getOperand(0);
	if (ShAmt1.getOpcode() == ISD::TRUNCATE)
	ShAmt1 = ShAmt1.getOperand(0);

	SDLoc DL(N);
	unsigned Opc = X86ISD::SHLD;
	SDValue Op0 = N0.getOperand(0);
	SDValue Op1 = N1.getOperand(0);
	if (ShAmt0.getOpcode() == ISD::SUB \|\|
	ShAmt0.getOpcode() == ISD::XOR) {
	Opc = X86ISD::SHRD;
	std::swap(Op0, Op1);
	std::swap(ShAmt0, ShAmt1);
	}

	// OR( SHL( X, C ), SRL( Y, 32 - C ) ) -> SHLD( X, Y, C )
	// OR( SRL( X, C ), SHL( Y, 32 - C ) ) -> SHRD( X, Y, C )
	// OR( SHL( X, C ), SRL( SRL( Y, 1 ), XOR( C, 31 ) ) ) -> SHLD( X, Y, C )
	// OR( SRL( X, C ), SHL( SHL( Y, 1 ), XOR( C, 31 ) ) ) -> SHRD( X, Y, C )
	unsigned Bits = VT.getSizeInBits();
	if (ShAmt1.getOpcode() == ISD::SUB) {
	SDValue Sum = ShAmt1.getOperand(0);
	if (ConstantSDNode *SumC = dyn_cast<ConstantSDNode>(Sum)) {
	SDValue ShAmt1Op1 = ShAmt1.getOperand(1);
	if (ShAmt1Op1.getOpcode() == ISD::TRUNCATE)
	ShAmt1Op1 = ShAmt1Op1.getOperand(0);
	if (SumC->getSExtValue() == Bits && ShAmt1Op1 == ShAmt0)
	return DAG.getNode(Opc, DL, VT,
	Op0, Op1,
	DAG.getNode(ISD::TRUNCATE, DL,
	MVT::i8, ShAmt0));
	}
	} else if (ConstantSDNode *ShAmt1C = dyn_cast<ConstantSDNode>(ShAmt1)) {
	ConstantSDNode *ShAmt0C = dyn_cast<ConstantSDNode>(ShAmt0);
	if (ShAmt0C && (ShAmt0C->getSExtValue() + ShAmt1C->getSExtValue()) == Bits)
	return DAG.getNode(Opc, DL, VT,
	N0.getOperand(0), N1.getOperand(0),
	DAG.getNode(ISD::TRUNCATE, DL,
	MVT::i8, ShAmt0));
	} else if (ShAmt1.getOpcode() == ISD::XOR) {
	SDValue Mask = ShAmt1.getOperand(1);
	if (ConstantSDNode *MaskC = dyn_cast<ConstantSDNode>(Mask)) {
	unsigned InnerShift = (X86ISD::SHLD == Opc ? ISD::SRL : ISD::SHL);
	SDValue ShAmt1Op0 = ShAmt1.getOperand(0);
	if (ShAmt1Op0.getOpcode() == ISD::TRUNCATE)
	ShAmt1Op0 = ShAmt1Op0.getOperand(0);
	if (MaskC->getSExtValue() == (Bits - 1) && ShAmt1Op0 == ShAmt0) {
	if (Op1.getOpcode() == InnerShift &&
	isa<ConstantSDNode>(Op1.getOperand(1)) &&
	Op1.getConstantOperandVal(1) == 1) {
	return DAG.getNode(Opc, DL, VT, Op0, Op1.getOperand(0),
	DAG.getNode(ISD::TRUNCATE, DL, MVT::i8, ShAmt0));
	}
	// Test for ADD( Y, Y ) as an equivalent to SHL( Y, 1 ).
	if (InnerShift == ISD::SHL && Op1.getOpcode() == ISD::ADD &&
	Op1.getOperand(0) == Op1.getOperand(1)) {
	return DAG.getNode(Opc, DL, VT, Op0, Op1.getOperand(0),
	DAG.getNode(ISD::TRUNCATE, DL, MVT::i8, ShAmt0));
	}
	}
	}
	}

	return SDValue();
	}

	/// Generate NEG and CMOV for integer abs.
	static SDValue combineIntegerAbs(SDNode *N, SelectionDAG &DAG) {
	EVT VT = N->getValueType(0);

	// Since X86 does not have CMOV for 8-bit integer, we don't convert
	// 8-bit integer abs to NEG and CMOV.
	if (VT.isInteger() && VT.getSizeInBits() == 8)
	return SDValue();

	SDValue N0 = N->getOperand(0);
	SDValue N1 = N->getOperand(1);
	SDLoc DL(N);

	// Check pattern of XOR(ADD(X,Y), Y) where Y is SRA(X, size(X)-1)
	// and change it to SUB and CMOV.
	if (VT.isInteger() && N->getOpcode() == ISD::XOR &&
	N0.getOpcode() == ISD::ADD && N0.getOperand(1) == N1 &&
	N1.getOpcode() == ISD::SRA && N1.getOperand(0) == N0.getOperand(0)) {
	auto *Y1C = dyn_cast<ConstantSDNode>(N1.getOperand(1));
	if (Y1C && Y1C->getAPIntValue() == VT.getSizeInBits() - 1) {
	// Generate SUB & CMOV.
	SDValue Neg = DAG.getNode(X86ISD::SUB, DL, DAG.getVTList(VT, MVT::i32),
	DAG.getConstant(0, DL, VT), N0.getOperand(0));
	SDValue Ops[] = {N0.getOperand(0), Neg,
	DAG.getConstant(X86::COND_GE, DL, MVT::i8),
	SDValue(Neg.getNode(), 1)};
	return DAG.getNode(X86ISD::CMOV, DL, DAG.getVTList(VT, MVT::Glue), Ops);
	}
	}
	return SDValue();
	}

	/// Try to turn tests against the signbit in the form of:
	/// XOR(TRUNCATE(SRL(X, size(X)-1)), 1)
	/// into:
	/// SETGT(X, -1)
	static SDValue foldXorTruncShiftIntoCmp(SDNode *N, SelectionDAG &DAG) {
	// This is only worth doing if the output type is i8 or i1.
	EVT ResultType = N->getValueType(0);
	if (ResultType != MVT::i8 && ResultType != MVT::i1)
	return SDValue();

	SDValue N0 = N->getOperand(0);
	SDValue N1 = N->getOperand(1);

	// We should be performing an xor against a truncated shift.
	if (N0.getOpcode() != ISD::TRUNCATE \|\| !N0.hasOneUse())
	return SDValue();

	// Make sure we are performing an xor against one.
	if (!isOneConstant(N1))
	return SDValue();

	// SetCC on x86 zero extends so only act on this if it's a logical shift.
	SDValue Shift = N0.getOperand(0);
	if (Shift.getOpcode() != ISD::SRL \|\| !Shift.hasOneUse())
	return SDValue();

	// Make sure we are truncating from one of i16, i32 or i64.
	EVT ShiftTy = Shift.getValueType();
	if (ShiftTy != MVT::i16 && ShiftTy != MVT::i32 && ShiftTy != MVT::i64)
	return SDValue();

	// Make sure the shift amount extracts the sign bit.
	if (!isa<ConstantSDNode>(Shift.getOperand(1)) \|\|
	Shift.getConstantOperandVal(1) != ShiftTy.getSizeInBits() - 1)
	return SDValue();

	// Create a greater-than comparison against -1.
	// N.B. Using SETGE against 0 works but we want a canonical looking
	// comparison, using SETGT matches up with what TranslateX86CC.
	SDLoc DL(N);
	SDValue ShiftOp = Shift.getOperand(0);
	EVT ShiftOpTy = ShiftOp.getValueType();
	const TargetLowering &TLI = DAG.getTargetLoweringInfo();
	EVT SetCCResultType = TLI.getSetCCResultType(DAG.getDataLayout(),
	*DAG.getContext(), ResultType);
	SDValue Cond = DAG.getSetCC(DL, SetCCResultType, ShiftOp,
	DAG.getConstant(-1, DL, ShiftOpTy), ISD::SETGT);
	if (SetCCResultType != ResultType)
	Cond = DAG.getNode(ISD::ZERO_EXTEND, DL, ResultType, Cond);
	return Cond;
	}

	/// Turn vector tests of the signbit in the form of:
	/// xor (sra X, elt_size(X)-1), -1
	/// into:
	/// pcmpgt X, -1
	///
	/// This should be called before type legalization because the pattern may not
	/// persist after that.
	static SDValue foldVectorXorShiftIntoCmp(SDNode *N, SelectionDAG &DAG,
	const X86Subtarget &Subtarget) {
	EVT VT = N->getValueType(0);
	if (!VT.isSimple())
	return SDValue();

	switch (VT.getSimpleVT().SimpleTy) {
	default: return SDValue();
	case MVT::v16i8:
	case MVT::v8i16:
	case MVT::v4i32: if (!Subtarget.hasSSE2()) return SDValue(); break;
	case MVT::v2i64: if (!Subtarget.hasSSE42()) return SDValue(); break;
	case MVT::v32i8:
	case MVT::v16i16:
	case MVT::v8i32:
	case MVT::v4i64: if (!Subtarget.hasAVX2()) return SDValue(); break;
	}

	// There must be a shift right algebraic before the xor, and the xor must be a
	// 'not' operation.
	SDValue Shift = N->getOperand(0);
	SDValue Ones = N->getOperand(1);
	if (Shift.getOpcode() != ISD::SRA \|\| !Shift.hasOneUse() \|\|
	!ISD::isBuildVectorAllOnes(Ones.getNode()))
	return SDValue();

	// The shift should be smearing the sign bit across each vector element.
	auto *ShiftBV = dyn_cast<BuildVectorSDNode>(Shift.getOperand(1));
	if (!ShiftBV)
	return SDValue();

	EVT ShiftEltTy = Shift.getValueType().getVectorElementType();
	auto *ShiftAmt = ShiftBV->getConstantSplatNode();
	if (!ShiftAmt \|\| ShiftAmt->getZExtValue() != ShiftEltTy.getSizeInBits() - 1)
	return SDValue();

	// Create a greater-than comparison against -1. We don't use the more obvious
	// greater-than-or-equal-to-zero because SSE/AVX don't have that instruction.
	return DAG.getNode(X86ISD::PCMPGT, SDLoc(N), VT, Shift.getOperand(0), Ones);
	}

	/// Check if truncation with saturation form type \p SrcVT to \p DstVT
	/// is valid for the given \p Subtarget.
	static bool isSATValidOnAVX512Subtarget(EVT SrcVT, EVT DstVT,
	const X86Subtarget &Subtarget) {
	if (!Subtarget.hasAVX512())
	return false;

	// FIXME: Scalar type may be supported if we move it to vector register.
	if (!SrcVT.isVector() \|\| !SrcVT.isSimple() \|\| SrcVT.getSizeInBits() > 512)
	return false;

	EVT SrcElVT = SrcVT.getScalarType();
	EVT DstElVT = DstVT.getScalarType();
	if (SrcElVT.getSizeInBits() < 16 \|\| SrcElVT.getSizeInBits() > 64)
	return false;
	if (DstElVT.getSizeInBits() < 8 \|\| DstElVT.getSizeInBits() > 32)
	return false;
	if (SrcVT.is512BitVector() \|\| Subtarget.hasVLX())
	return SrcElVT.getSizeInBits() >= 32 \|\| Subtarget.hasBWI();
	return false;
	}

	/// Detect a pattern of truncation with saturation:
	/// (truncate (umin (x, unsigned_max_of_dest_type)) to dest_type).
	/// Return the source value to be truncated or SDValue() if the pattern was not
	/// matched.
	static SDValue detectUSatPattern(SDValue In, EVT VT) {
	if (In.getOpcode() != ISD::UMIN)
	return SDValue();

	//Saturation with truncation. We truncate from InVT to VT.
	assert(In.getScalarValueSizeInBits() > VT.getScalarSizeInBits() &&
	"Unexpected types for truncate operation");

	APInt C;
	- if (ISD::isConstantSplatVector(In.getOperand(1).getNode(), C)) {
	+ if (ISD::isConstantSplatVector(In.getOperand(1).getNode(), C,
	+ /AllowShrink/false)) {
	// C should be equal to UINT32_MAX / UINT16_MAX / UINT8_MAX according
	// the element size of the destination type.
	return C.isMask(VT.getScalarSizeInBits()) ? In.getOperand(0) :
	SDValue();
	}
	return SDValue();
	}

	/// Detect a pattern of truncation with saturation:
	/// (truncate (umin (x, unsigned_max_of_dest_type)) to dest_type).
	/// The types should allow to use VPMOVUS* instruction on AVX512.
	/// Return the source value to be truncated or SDValue() if the pattern was not
	/// matched.
	static SDValue detectAVX512USatPattern(SDValue In, EVT VT,
	const X86Subtarget &Subtarget) {
	if (!isSATValidOnAVX512Subtarget(In.getValueType(), VT, Subtarget))
	return SDValue();
	return detectUSatPattern(In, VT);
	}

	static SDValue
	combineTruncateWithUSat(SDValue In, EVT VT, SDLoc &DL, SelectionDAG &DAG,
	const X86Subtarget &Subtarget) {
	const TargetLowering &TLI = DAG.getTargetLoweringInfo();
	if (!TLI.isTypeLegal(In.getValueType()) \|\| !TLI.isTypeLegal(VT))
	return SDValue();
	if (auto USatVal = detectUSatPattern(In, VT))
	if (isSATValidOnAVX512Subtarget(In.getValueType(), VT, Subtarget))
	return DAG.getNode(X86ISD::VTRUNCUS, DL, VT, USatVal);
	return SDValue();
	}

	/// This function detects the AVG pattern between vectors of unsigned i8/i16,
	/// which is c = (a + b + 1) / 2, and replace this operation with the efficient
	/// X86ISD::AVG instruction.
	static SDValue detectAVGPattern(SDValue In, EVT VT, SelectionDAG &DAG,
	const X86Subtarget &Subtarget,
	const SDLoc &DL) {
	if (!VT.isVector() \|\| !VT.isSimple())
	return SDValue();
	EVT InVT = In.getValueType();
	unsigned NumElems = VT.getVectorNumElements();

	EVT ScalarVT = VT.getVectorElementType();
	if (!((ScalarVT == MVT::i8 \|\| ScalarVT == MVT::i16) &&
	isPowerOf2_32(NumElems)))
	return SDValue();

	// InScalarVT is the intermediate type in AVG pattern and it should be greater
	// than the original input type (i8/i16).
	EVT InScalarVT = InVT.getVectorElementType();
	if (InScalarVT.getSizeInBits() <= ScalarVT.getSizeInBits())
	return SDValue();

	if (!Subtarget.hasSSE2())
	return SDValue();
	if (Subtarget.hasBWI()) {
	if (VT.getSizeInBits() > 512)
	return SDValue();
	} else if (Subtarget.hasAVX2()) {
	if (VT.getSizeInBits() > 256)
	return SDValue();
	} else {
	if (VT.getSizeInBits() > 128)
	return SDValue();
	}

	// Detect the following pattern:
	//
	// %1 = zext <N x i8> %a to <N x i32>
	// %2 = zext <N x i8> %b to <N x i32>
	// %3 = add nuw nsw <N x i32> %1, <i32 1 x N>
	// %4 = add nuw nsw <N x i32> %3, %2
	// %5 = lshr <N x i32> %N, <i32 1 x N>
	// %6 = trunc <N x i32> %5 to <N x i8>
	//
	// In AVX512, the last instruction can also be a trunc store.

	if (In.getOpcode() != ISD::SRL)
	return SDValue();

	// A lambda checking the given SDValue is a constant vector and each element
	// is in the range [Min, Max].
	auto IsConstVectorInRange = [](SDValue V, unsigned Min, unsigned Max) {
	BuildVectorSDNode *BV = dyn_cast<BuildVectorSDNode>(V);
	if (!BV \|\| !BV->isConstant())
	return false;
	for (SDValue Op : V->ops()) {
	ConstantSDNode *C = dyn_cast<ConstantSDNode>(Op);
	if (!C)
	return false;
	uint64_t Val = C->getZExtValue();
	if (Val < Min \|\| Val > Max)
	return false;
	}
	return true;
	};

	// Check if each element of the vector is left-shifted by one.
	auto LHS = In.getOperand(0);
	auto RHS = In.getOperand(1);
	if (!IsConstVectorInRange(RHS, 1, 1))
	return SDValue();
	if (LHS.getOpcode() != ISD::ADD)
	return SDValue();

	// Detect a pattern of a + b + 1 where the order doesn't matter.
	SDValue Operands[3];
	Operands[0] = LHS.getOperand(0);
	Operands[1] = LHS.getOperand(1);

	// Take care of the case when one of the operands is a constant vector whose
	// element is in the range [1, 256].
	if (IsConstVectorInRange(Operands[1], 1, ScalarVT == MVT::i8 ? 256 : 65536) &&
	Operands[0].getOpcode() == ISD::ZERO_EXTEND &&
	Operands[0].getOperand(0).getValueType() == VT) {
	// The pattern is detected. Subtract one from the constant vector, then
	// demote it and emit X86ISD::AVG instruction.
	SDValue VecOnes = DAG.getConstant(1, DL, InVT);
	Operands[1] = DAG.getNode(ISD::SUB, DL, InVT, Operands[1], VecOnes);
	Operands[1] = DAG.getNode(ISD::TRUNCATE, DL, VT, Operands[1]);
	return DAG.getNode(X86ISD::AVG, DL, VT, Operands[0].getOperand(0),
	Operands[1]);
	}

	if (Operands[0].getOpcode() == ISD::ADD)
	std::swap(Operands[0], Operands[1]);
	else if (Operands[1].getOpcode() != ISD::ADD)
	return SDValue();
	Operands[2] = Operands[1].getOperand(0);
	Operands[1] = Operands[1].getOperand(1);

	// Now we have three operands of two additions. Check that one of them is a
	// constant vector with ones, and the other two are promoted from i8/i16.
	for (int i = 0; i < 3; ++i) {
	if (!IsConstVectorInRange(Operands[i], 1, 1))
	continue;
	std::swap(Operands[i], Operands[2]);

	// Check if Operands[0] and Operands[1] are results of type promotion.
	for (int j = 0; j < 2; ++j)
	if (Operands[j].getOpcode() != ISD::ZERO_EXTEND \|\|
	Operands[j].getOperand(0).getValueType() != VT)
	return SDValue();

	// The pattern is detected, emit X86ISD::AVG instruction.
	return DAG.getNode(X86ISD::AVG, DL, VT, Operands[0].getOperand(0),
	Operands[1].getOperand(0));
	}

	return SDValue();
	}

	static SDValue combineLoad(SDNode *N, SelectionDAG &DAG,
	TargetLowering::DAGCombinerInfo &DCI,
	const X86Subtarget &Subtarget) {
	LoadSDNode *Ld = cast<LoadSDNode>(N);
	EVT RegVT = Ld->getValueType(0);
	EVT MemVT = Ld->getMemoryVT();
	SDLoc dl(Ld);
	const TargetLowering &TLI = DAG.getTargetLoweringInfo();

	// For chips with slow 32-byte unaligned loads, break the 32-byte operation
	// into two 16-byte operations. Also split non-temporal aligned loads on
	// pre-AVX2 targets as 32-byte loads will lower to regular temporal loads.
	ISD::LoadExtType Ext = Ld->getExtensionType();
	bool Fast;
	unsigned AddressSpace = Ld->getAddressSpace();
	unsigned Alignment = Ld->getAlignment();
	if (RegVT.is256BitVector() && !DCI.isBeforeLegalizeOps() &&
	Ext == ISD::NON_EXTLOAD &&
	((Ld->isNonTemporal() && !Subtarget.hasInt256() && Alignment >= 16) \|\|
	(TLI.allowsMemoryAccess(*DAG.getContext(), DAG.getDataLayout(), RegVT,
	AddressSpace, Alignment, &Fast) && !Fast))) {
	unsigned NumElems = RegVT.getVectorNumElements();
	if (NumElems < 2)
	return SDValue();

	SDValue Ptr = Ld->getBasePtr();

	EVT HalfVT = EVT::getVectorVT(*DAG.getContext(), MemVT.getScalarType(),
	NumElems/2);
	SDValue Load1 =
	DAG.getLoad(HalfVT, dl, Ld->getChain(), Ptr, Ld->getPointerInfo(),
	Alignment, Ld->getMemOperand()->getFlags());

	Ptr = DAG.getMemBasePlusOffset(Ptr, 16, dl);
	SDValue Load2 =
	DAG.getLoad(HalfVT, dl, Ld->getChain(), Ptr, Ld->getPointerInfo(),
	std::min(16U, Alignment), Ld->getMemOperand()->getFlags());
	SDValue TF = DAG.getNode(ISD::TokenFactor, dl, MVT::Other,
	Load1.getValue(1),
	Load2.getValue(1));

	SDValue NewVec = DAG.getUNDEF(RegVT);
	NewVec = insert128BitVector(NewVec, Load1, 0, DAG, dl);
	NewVec = insert128BitVector(NewVec, Load2, NumElems / 2, DAG, dl);
	return DCI.CombineTo(N, NewVec, TF, true);
	}

	return SDValue();
	}

	/// If V is a build vector of boolean constants and exactly one of those
	/// constants is true, return the operand index of that true element.
	/// Otherwise, return -1.
	static int getOneTrueElt(SDValue V) {
	// This needs to be a build vector of booleans.
	// TODO: Checking for the i1 type matches the IR definition for the mask,
	// but the mask check could be loosened to i8 or other types. That might
	// also require checking more than 'allOnesValue'; eg, the x86 HW
	// instructions only require that the MSB is set for each mask element.
	// The ISD::MSTORE comments/definition do not specify how the mask operand
	// is formatted.
	auto *BV = dyn_cast<BuildVectorSDNode>(V);
	if (!BV \|\| BV->getValueType(0).getVectorElementType() != MVT::i1)
	return -1;

	int TrueIndex = -1;
	unsigned NumElts = BV->getValueType(0).getVectorNumElements();
	for (unsigned i = 0; i < NumElts; ++i) {
	const SDValue &Op = BV->getOperand(i);
	if (Op.isUndef())
	continue;
	auto *ConstNode = dyn_cast<ConstantSDNode>(Op);
	if (!ConstNode)
	return -1;
	if (ConstNode->getAPIntValue().isAllOnesValue()) {
	// If we already found a one, this is too many.
	if (TrueIndex >= 0)
	return -1;
	TrueIndex = i;
	}
	}
	return TrueIndex;
	}

	/// Given a masked memory load/store operation, return true if it has one mask
	/// bit set. If it has one mask bit set, then also return the memory address of
	/// the scalar element to load/store, the vector index to insert/extract that
	/// scalar element, and the alignment for the scalar memory access.
	static bool getParamsForOneTrueMaskedElt(MaskedLoadStoreSDNode *MaskedOp,
	SelectionDAG &DAG, SDValue &Addr,
	SDValue &Index, unsigned &Alignment) {
	int TrueMaskElt = getOneTrueElt(MaskedOp->getMask());
	if (TrueMaskElt < 0)
	return false;

	// Get the address of the one scalar element that is specified by the mask
	// using the appropriate offset from the base pointer.
	EVT EltVT = MaskedOp->getMemoryVT().getVectorElementType();
	Addr = MaskedOp->getBasePtr();
	if (TrueMaskElt != 0) {
	unsigned Offset = TrueMaskElt * EltVT.getStoreSize();
	Addr = DAG.getMemBasePlusOffset(Addr, Offset, SDLoc(MaskedOp));
	}

	Index = DAG.getIntPtrConstant(TrueMaskElt, SDLoc(MaskedOp));
	Alignment = MinAlign(MaskedOp->getAlignment(), EltVT.getStoreSize());
	return true;
	}

	/// If exactly one element of the mask is set for a non-extending masked load,
	/// it is a scalar load and vector insert.
	/// Note: It is expected that the degenerate cases of an all-zeros or all-ones
	/// mask have already been optimized in IR, so we don't bother with those here.
	static SDValue
	reduceMaskedLoadToScalarLoad(MaskedLoadSDNode *ML, SelectionDAG &DAG,
	TargetLowering::DAGCombinerInfo &DCI) {
	// TODO: This is not x86-specific, so it could be lifted to DAGCombiner.
	// However, some target hooks may need to be added to know when the transform
	// is profitable. Endianness would also have to be considered.

	SDValue Addr, VecIndex;
	unsigned Alignment;
	if (!getParamsForOneTrueMaskedElt(ML, DAG, Addr, VecIndex, Alignment))
	return SDValue();

	// Load the one scalar element that is specified by the mask using the
	// appropriate offset from the base pointer.
	SDLoc DL(ML);
	EVT VT = ML->getValueType(0);
	EVT EltVT = VT.getVectorElementType();
	SDValue Load =
	DAG.getLoad(EltVT, DL, ML->getChain(), Addr, ML->getPointerInfo(),
	Alignment, ML->getMemOperand()->getFlags());

	// Insert the loaded element into the appropriate place in the vector.
	SDValue Insert = DAG.getNode(ISD::INSERT_VECTOR_ELT, DL, VT, ML->getSrc0(),
	Load, VecIndex);
	return DCI.CombineTo(ML, Insert, Load.getValue(1), true);
	}

	static SDValue
	combineMaskedLoadConstantMask(MaskedLoadSDNode *ML, SelectionDAG &DAG,
	TargetLowering::DAGCombinerInfo &DCI) {
	if (!ISD::isBuildVectorOfConstantSDNodes(ML->getMask().getNode()))
	return SDValue();

	SDLoc DL(ML);
	EVT VT = ML->getValueType(0);

	// If we are loading the first and last elements of a vector, it is safe and
	// always faster to load the whole vector. Replace the masked load with a
	// vector load and select.
	unsigned NumElts = VT.getVectorNumElements();
	BuildVectorSDNode *MaskBV = cast<BuildVectorSDNode>(ML->getMask());
	bool LoadFirstElt = !isNullConstant(MaskBV->getOperand(0));
	bool LoadLastElt = !isNullConstant(MaskBV->getOperand(NumElts - 1));
	if (LoadFirstElt && LoadLastElt) {
	SDValue VecLd = DAG.getLoad(VT, DL, ML->getChain(), ML->getBasePtr(),
	ML->getMemOperand());
	SDValue Blend = DAG.getSelect(DL, VT, ML->getMask(), VecLd, ML->getSrc0());
	return DCI.CombineTo(ML, Blend, VecLd.getValue(1), true);
	}

	// Convert a masked load with a constant mask into a masked load and a select.
	// This allows the select operation to use a faster kind of select instruction
	// (for example, vblendvps -> vblendps).

	// Don't try this if the pass-through operand is already undefined. That would
	// cause an infinite loop because that's what we're about to create.
	if (ML->getSrc0().isUndef())
	return SDValue();

	// The new masked load has an undef pass-through operand. The select uses the
	// original pass-through operand.
	SDValue NewML = DAG.getMaskedLoad(VT, DL, ML->getChain(), ML->getBasePtr(),
	ML->getMask(), DAG.getUNDEF(VT),
	ML->getMemoryVT(), ML->getMemOperand(),
	ML->getExtensionType());
	SDValue Blend = DAG.getSelect(DL, VT, ML->getMask(), NewML, ML->getSrc0());

	return DCI.CombineTo(ML, Blend, NewML.getValue(1), true);
	}

	static SDValue combineMaskedLoad(SDNode *N, SelectionDAG &DAG,
	TargetLowering::DAGCombinerInfo &DCI,
	const X86Subtarget &Subtarget) {
	MaskedLoadSDNode *Mld = cast<MaskedLoadSDNode>(N);

	// TODO: Expanding load with constant mask may be optimized as well.
	if (Mld->isExpandingLoad())
	return SDValue();

	if (Mld->getExtensionType() == ISD::NON_EXTLOAD) {
	if (SDValue ScalarLoad = reduceMaskedLoadToScalarLoad(Mld, DAG, DCI))
	return ScalarLoad;
	// TODO: Do some AVX512 subsets benefit from this transform?
	if (!Subtarget.hasAVX512())
	if (SDValue Blend = combineMaskedLoadConstantMask(Mld, DAG, DCI))
	return Blend;
	}

	if (Mld->getExtensionType() != ISD::SEXTLOAD)
	return SDValue();

	// Resolve extending loads.
	EVT VT = Mld->getValueType(0);
	unsigned NumElems = VT.getVectorNumElements();
	EVT LdVT = Mld->getMemoryVT();
	SDLoc dl(Mld);

	assert(LdVT != VT && "Cannot extend to the same type");
	unsigned ToSz = VT.getScalarSizeInBits();
	unsigned FromSz = LdVT.getScalarSizeInBits();
	// From/To sizes and ElemCount must be pow of two.
	assert (isPowerOf2_32(NumElems * FromSz * ToSz) &&
	"Unexpected size for extending masked load");

	unsigned SizeRatio = ToSz / FromSz;
	assert(SizeRatio * NumElems * FromSz == VT.getSizeInBits());

	// Create a type on which we perform the shuffle.
	EVT WideVecVT = EVT::getVectorVT(*DAG.getContext(),
	LdVT.getScalarType(), NumElems*SizeRatio);
	assert(WideVecVT.getSizeInBits() == VT.getSizeInBits());

	// Convert Src0 value.
	SDValue WideSrc0 = DAG.getBitcast(WideVecVT, Mld->getSrc0());
	if (!Mld->getSrc0().isUndef()) {
	SmallVector<int, 16> ShuffleVec(NumElems * SizeRatio, -1);
	for (unsigned i = 0; i != NumElems; ++i)
	ShuffleVec[i] = i * SizeRatio;

	// Can't shuffle using an illegal type.
	assert(DAG.getTargetLoweringInfo().isTypeLegal(WideVecVT) &&
	"WideVecVT should be legal");
	WideSrc0 = DAG.getVectorShuffle(WideVecVT, dl, WideSrc0,
	DAG.getUNDEF(WideVecVT), ShuffleVec);
	}
	// Prepare the new mask.
	SDValue NewMask;
	SDValue Mask = Mld->getMask();
	if (Mask.getValueType() == VT) {
	// Mask and original value have the same type.
	NewMask = DAG.getBitcast(WideVecVT, Mask);
	SmallVector<int, 16> ShuffleVec(NumElems * SizeRatio, -1);
	for (unsigned i = 0; i != NumElems; ++i)
	ShuffleVec[i] = i * SizeRatio;
	for (unsigned i = NumElems; i != NumElems * SizeRatio; ++i)
	ShuffleVec[i] = NumElems * SizeRatio;
	NewMask = DAG.getVectorShuffle(WideVecVT, dl, NewMask,
	DAG.getConstant(0, dl, WideVecVT),
	ShuffleVec);
	} else {
	assert(Mask.getValueType().getVectorElementType() == MVT::i1);
	unsigned WidenNumElts = NumElems*SizeRatio;
	unsigned MaskNumElts = VT.getVectorNumElements();
	EVT NewMaskVT = EVT::getVectorVT(*DAG.getContext(), MVT::i1,
	WidenNumElts);

	unsigned NumConcat = WidenNumElts / MaskNumElts;
	SmallVector<SDValue, 16> Ops(NumConcat);
	SDValue ZeroVal = DAG.getConstant(0, dl, Mask.getValueType());
	Ops[0] = Mask;
	for (unsigned i = 1; i != NumConcat; ++i)
	Ops[i] = ZeroVal;

	NewMask = DAG.getNode(ISD::CONCAT_VECTORS, dl, NewMaskVT, Ops);
	}

	SDValue WideLd = DAG.getMaskedLoad(WideVecVT, dl, Mld->getChain(),
	Mld->getBasePtr(), NewMask, WideSrc0,
	Mld->getMemoryVT(), Mld->getMemOperand(),
	ISD::NON_EXTLOAD);
	SDValue NewVec = getExtendInVec(X86ISD::VSEXT, dl, VT, WideLd, DAG);
	return DCI.CombineTo(N, NewVec, WideLd.getValue(1), true);
	}

	/// If exactly one element of the mask is set for a non-truncating masked store,
	/// it is a vector extract and scalar store.
	/// Note: It is expected that the degenerate cases of an all-zeros or all-ones
	/// mask have already been optimized in IR, so we don't bother with those here.
	static SDValue reduceMaskedStoreToScalarStore(MaskedStoreSDNode *MS,
	SelectionDAG &DAG) {
	// TODO: This is not x86-specific, so it could be lifted to DAGCombiner.
	// However, some target hooks may need to be added to know when the transform
	// is profitable. Endianness would also have to be considered.

	SDValue Addr, VecIndex;
	unsigned Alignment;
	if (!getParamsForOneTrueMaskedElt(MS, DAG, Addr, VecIndex, Alignment))
	return SDValue();

	// Extract the one scalar element that is actually being stored.
	SDLoc DL(MS);
	EVT VT = MS->getValue().getValueType();
	EVT EltVT = VT.getVectorElementType();
	SDValue Extract = DAG.getNode(ISD::EXTRACT_VECTOR_ELT, DL, EltVT,
	MS->getValue(), VecIndex);

	// Store that element at the appropriate offset from the base pointer.
	return DAG.getStore(MS->getChain(), DL, Extract, Addr, MS->getPointerInfo(),
	Alignment, MS->getMemOperand()->getFlags());
	}

	static SDValue combineMaskedStore(SDNode *N, SelectionDAG &DAG,
	const X86Subtarget &Subtarget) {
	MaskedStoreSDNode *Mst = cast<MaskedStoreSDNode>(N);

	if (Mst->isCompressingStore())
	return SDValue();

	if (!Mst->isTruncatingStore())
	return reduceMaskedStoreToScalarStore(Mst, DAG);

	// Resolve truncating stores.
	EVT VT = Mst->getValue().getValueType();
	unsigned NumElems = VT.getVectorNumElements();
	EVT StVT = Mst->getMemoryVT();
	SDLoc dl(Mst);

	assert(StVT != VT && "Cannot truncate to the same type");
	unsigned FromSz = VT.getScalarSizeInBits();
	unsigned ToSz = StVT.getScalarSizeInBits();

	const TargetLowering &TLI = DAG.getTargetLoweringInfo();

	// The truncating store is legal in some cases. For example
	// vpmovqb, vpmovqw, vpmovqd, vpmovdb, vpmovdw
	// are designated for truncate store.
	// In this case we don't need any further transformations.
	if (TLI.isTruncStoreLegal(VT, StVT))
	return SDValue();

	// From/To sizes and ElemCount must be pow of two.
	assert (isPowerOf2_32(NumElems * FromSz * ToSz) &&
	"Unexpected size for truncating masked store");
	// We are going to use the original vector elt for storing.
	// Accumulated smaller vector elements must be a multiple of the store size.
	assert (((NumElems * FromSz) % ToSz) == 0 &&
	"Unexpected ratio for truncating masked store");

	unsigned SizeRatio = FromSz / ToSz;
	assert(SizeRatio * NumElems * ToSz == VT.getSizeInBits());

	// Create a type on which we perform the shuffle.
	EVT WideVecVT = EVT::getVectorVT(*DAG.getContext(),
	StVT.getScalarType(), NumElems*SizeRatio);

	assert(WideVecVT.getSizeInBits() == VT.getSizeInBits());

	SDValue WideVec = DAG.getBitcast(WideVecVT, Mst->getValue());
	SmallVector<int, 16> ShuffleVec(NumElems * SizeRatio, -1);
	for (unsigned i = 0; i != NumElems; ++i)
	ShuffleVec[i] = i * SizeRatio;

	// Can't shuffle using an illegal type.
	assert(DAG.getTargetLoweringInfo().isTypeLegal(WideVecVT) &&
	"WideVecVT should be legal");

	SDValue TruncatedVal = DAG.getVectorShuffle(WideVecVT, dl, WideVec,
	DAG.getUNDEF(WideVecVT),
	ShuffleVec);

	SDValue NewMask;
	SDValue Mask = Mst->getMask();
	if (Mask.getValueType() == VT) {
	// Mask and original value have the same type.
	NewMask = DAG.getBitcast(WideVecVT, Mask);
	for (unsigned i = 0; i != NumElems; ++i)
	ShuffleVec[i] = i * SizeRatio;
	for (unsigned i = NumElems; i != NumElems*SizeRatio; ++i)
	ShuffleVec[i] = NumElems*SizeRatio;
	NewMask = DAG.getVectorShuffle(WideVecVT, dl, NewMask,
	DAG.getConstant(0, dl, WideVecVT),
	ShuffleVec);
	} else {
	assert(Mask.getValueType().getVectorElementType() == MVT::i1);
	unsigned WidenNumElts = NumElems*SizeRatio;
	unsigned MaskNumElts = VT.getVectorNumElements();
	EVT NewMaskVT = EVT::getVectorVT(*DAG.getContext(), MVT::i1,
	WidenNumElts);

	unsigned NumConcat = WidenNumElts / MaskNumElts;
	SmallVector<SDValue, 16> Ops(NumConcat);
	SDValue ZeroVal = DAG.getConstant(0, dl, Mask.getValueType());
	Ops[0] = Mask;
	for (unsigned i = 1; i != NumConcat; ++i)
	Ops[i] = ZeroVal;

	NewMask = DAG.getNode(ISD::CONCAT_VECTORS, dl, NewMaskVT, Ops);
	}

	return DAG.getMaskedStore(Mst->getChain(), dl, TruncatedVal,
	Mst->getBasePtr(), NewMask, StVT,
	Mst->getMemOperand(), false);
	}

	static SDValue combineStore(SDNode *N, SelectionDAG &DAG,
	const X86Subtarget &Subtarget) {
	StoreSDNode *St = cast<StoreSDNode>(N);
	EVT VT = St->getValue().getValueType();
	EVT StVT = St->getMemoryVT();
	SDLoc dl(St);
	SDValue StoredVal = St->getOperand(1);
	const TargetLowering &TLI = DAG.getTargetLoweringInfo();

	// If we are saving a concatenation of two XMM registers and 32-byte stores
	// are slow, such as on Sandy Bridge, perform two 16-byte stores.
	bool Fast;
	unsigned AddressSpace = St->getAddressSpace();
	unsigned Alignment = St->getAlignment();
	if (VT.is256BitVector() && StVT == VT &&
	TLI.allowsMemoryAccess(*DAG.getContext(), DAG.getDataLayout(), VT,
	AddressSpace, Alignment, &Fast) &&
	!Fast) {
	unsigned NumElems = VT.getVectorNumElements();
	if (NumElems < 2)
	return SDValue();

	SDValue Value0 = extract128BitVector(StoredVal, 0, DAG, dl);
	SDValue Value1 = extract128BitVector(StoredVal, NumElems / 2, DAG, dl);

	SDValue Ptr0 = St->getBasePtr();
	SDValue Ptr1 = DAG.getMemBasePlusOffset(Ptr0, 16, dl);

	SDValue Ch0 =
	DAG.getStore(St->getChain(), dl, Value0, Ptr0, St->getPointerInfo(),
	Alignment, St->getMemOperand()->getFlags());
	SDValue Ch1 =
	DAG.getStore(St->getChain(), dl, Value1, Ptr1, St->getPointerInfo(),
	std::min(16U, Alignment), St->getMemOperand()->getFlags());
	return DAG.getNode(ISD::TokenFactor, dl, MVT::Other, Ch0, Ch1);
	}

	// Optimize trunc store (of multiple scalars) to shuffle and store.
	// First, pack all of the elements in one place. Next, store to memory
	// in fewer chunks.
	if (St->isTruncatingStore() && VT.isVector()) {
	// Check if we can detect an AVG pattern from the truncation. If yes,
	// replace the trunc store by a normal store with the result of X86ISD::AVG
	// instruction.
	if (SDValue Avg = detectAVGPattern(St->getValue(), St->getMemoryVT(), DAG,
	Subtarget, dl))
	return DAG.getStore(St->getChain(), dl, Avg, St->getBasePtr(),
	St->getPointerInfo(), St->getAlignment(),
	St->getMemOperand()->getFlags());

	if (SDValue Val =
	detectAVX512USatPattern(St->getValue(), St->getMemoryVT(), Subtarget))
	return EmitTruncSStore(false /* Unsigned saturation */, St->getChain(),
	dl, Val, St->getBasePtr(),
	St->getMemoryVT(), St->getMemOperand(), DAG);

	const TargetLowering &TLI = DAG.getTargetLoweringInfo();
	unsigned NumElems = VT.getVectorNumElements();
	assert(StVT != VT && "Cannot truncate to the same type");
	unsigned FromSz = VT.getScalarSizeInBits();
	unsigned ToSz = StVT.getScalarSizeInBits();

	// The truncating store is legal in some cases. For example
	// vpmovqb, vpmovqw, vpmovqd, vpmovdb, vpmovdw
	// are designated for truncate store.
	// In this case we don't need any further transformations.
	if (TLI.isTruncStoreLegalOrCustom(VT, StVT))
	return SDValue();

	// From, To sizes and ElemCount must be pow of two
	if (!isPowerOf2_32(NumElems * FromSz * ToSz)) return SDValue();
	// We are going to use the original vector elt for storing.
	// Accumulated smaller vector elements must be a multiple of the store size.
	if (0 != (NumElems * FromSz) % ToSz) return SDValue();

	unsigned SizeRatio = FromSz / ToSz;

	assert(SizeRatio * NumElems * ToSz == VT.getSizeInBits());

	// Create a type on which we perform the shuffle
	EVT WideVecVT = EVT::getVectorVT(*DAG.getContext(),
	StVT.getScalarType(), NumElems*SizeRatio);

	assert(WideVecVT.getSizeInBits() == VT.getSizeInBits());

	SDValue WideVec = DAG.getBitcast(WideVecVT, St->getValue());
	SmallVector<int, 8> ShuffleVec(NumElems * SizeRatio, -1);
	for (unsigned i = 0; i != NumElems; ++i)
	ShuffleVec[i] = i * SizeRatio;

	// Can't shuffle using an illegal type.
	if (!TLI.isTypeLegal(WideVecVT))
	return SDValue();

	SDValue Shuff = DAG.getVectorShuffle(WideVecVT, dl, WideVec,
	DAG.getUNDEF(WideVecVT),
	ShuffleVec);
	// At this point all of the data is stored at the bottom of the
	// register. We now need to save it to mem.

	// Find the largest store unit
	MVT StoreType = MVT::i8;
	for (MVT Tp : MVT::integer_valuetypes()) {
	if (TLI.isTypeLegal(Tp) && Tp.getSizeInBits() <= NumElems * ToSz)
	StoreType = Tp;
	}

	// On 32bit systems, we can't save 64bit integers. Try bitcasting to F64.
	if (TLI.isTypeLegal(MVT::f64) && StoreType.getSizeInBits() < 64 &&
	(64 <= NumElems * ToSz))
	StoreType = MVT::f64;

	// Bitcast the original vector into a vector of store-size units
	EVT StoreVecVT = EVT::getVectorVT(*DAG.getContext(),
	StoreType, VT.getSizeInBits()/StoreType.getSizeInBits());
	assert(StoreVecVT.getSizeInBits() == VT.getSizeInBits());
	SDValue ShuffWide = DAG.getBitcast(StoreVecVT, Shuff);
	SmallVector<SDValue, 8> Chains;
	SDValue Ptr = St->getBasePtr();

	// Perform one or more big stores into memory.
	for (unsigned i=0, e=(ToSz*NumElems)/StoreType.getSizeInBits(); i!=e; ++i) {
	SDValue SubVec = DAG.getNode(ISD::EXTRACT_VECTOR_ELT, dl,
	StoreType, ShuffWide,
	DAG.getIntPtrConstant(i, dl));
	SDValue Ch =
	DAG.getStore(St->getChain(), dl, SubVec, Ptr, St->getPointerInfo(),
	St->getAlignment(), St->getMemOperand()->getFlags());
	Ptr = DAG.getMemBasePlusOffset(Ptr, StoreType.getStoreSize(), dl);
	Chains.push_back(Ch);
	}

	return DAG.getNode(ISD::TokenFactor, dl, MVT::Other, Chains);
	}

	// Turn load->store of MMX types into GPR load/stores. This avoids clobbering
	// the FP state in cases where an emms may be missing.
	// A preferable solution to the general problem is to figure out the right
	// places to insert EMMS. This qualifies as a quick hack.

	// Similarly, turn load->store of i64 into double load/stores in 32-bit mode.
	if (VT.getSizeInBits() != 64)
	return SDValue();

	const Function *F = DAG.getMachineFunction().getFunction();
	bool NoImplicitFloatOps = F->hasFnAttribute(Attribute::NoImplicitFloat);
	bool F64IsLegal =
	!Subtarget.useSoftFloat() && !NoImplicitFloatOps && Subtarget.hasSSE2();
	if ((VT.isVector() \|\|
	(VT == MVT::i64 && F64IsLegal && !Subtarget.is64Bit())) &&
	isa<LoadSDNode>(St->getValue()) &&
	!cast<LoadSDNode>(St->getValue())->isVolatile() &&
	St->getChain().hasOneUse() && !St->isVolatile()) {
	SDNode* LdVal = St->getValue().getNode();
	LoadSDNode *Ld = nullptr;
	int TokenFactorIndex = -1;
	SmallVector<SDValue, 8> Ops;
	SDNode* ChainVal = St->getChain().getNode();
	// Must be a store of a load. We currently handle two cases: the load
	// is a direct child, and it's under an intervening TokenFactor. It is
	// possible to dig deeper under nested TokenFactors.
	if (ChainVal == LdVal)
	Ld = cast<LoadSDNode>(St->getChain());
	else if (St->getValue().hasOneUse() &&
	ChainVal->getOpcode() == ISD::TokenFactor) {
	for (unsigned i = 0, e = ChainVal->getNumOperands(); i != e; ++i) {
	if (ChainVal->getOperand(i).getNode() == LdVal) {
	TokenFactorIndex = i;
	Ld = cast<LoadSDNode>(St->getValue());
	} else
	Ops.push_back(ChainVal->getOperand(i));
	}
	}

	if (!Ld \|\| !ISD::isNormalLoad(Ld))
	return SDValue();

	// If this is not the MMX case, i.e. we are just turning i64 load/store
	// into f64 load/store, avoid the transformation if there are multiple
	// uses of the loaded value.
	if (!VT.isVector() && !Ld->hasNUsesOfValue(1, 0))
	return SDValue();

	SDLoc LdDL(Ld);
	SDLoc StDL(N);
	// If we are a 64-bit capable x86, lower to a single movq load/store pair.
	// Otherwise, if it's legal to use f64 SSE instructions, use f64 load/store
	// pair instead.
	if (Subtarget.is64Bit() \|\| F64IsLegal) {
	MVT LdVT = Subtarget.is64Bit() ? MVT::i64 : MVT::f64;
	SDValue NewLd = DAG.getLoad(LdVT, LdDL, Ld->getChain(), Ld->getBasePtr(),
	Ld->getPointerInfo(), Ld->getAlignment(),
	Ld->getMemOperand()->getFlags());
	// Make sure new load is placed in same chain order.
	SDValue NewChain = DAG.makeEquivalentMemoryOrdering(Ld, NewLd);
	if (TokenFactorIndex >= 0) {
	Ops.push_back(NewChain);
	NewChain = DAG.getNode(ISD::TokenFactor, LdDL, MVT::Other, Ops);
	}
	return DAG.getStore(NewChain, StDL, NewLd, St->getBasePtr(),
	St->getPointerInfo(), St->getAlignment(),
	St->getMemOperand()->getFlags());
	}

	// Otherwise, lower to two pairs of 32-bit loads / stores.
	SDValue LoAddr = Ld->getBasePtr();
	SDValue HiAddr = DAG.getMemBasePlusOffset(LoAddr, 4, LdDL);

	SDValue LoLd = DAG.getLoad(MVT::i32, LdDL, Ld->getChain(), LoAddr,
	Ld->getPointerInfo(), Ld->getAlignment(),
	Ld->getMemOperand()->getFlags());
	SDValue HiLd = DAG.getLoad(MVT::i32, LdDL, Ld->getChain(), HiAddr,
	Ld->getPointerInfo().getWithOffset(4),
	MinAlign(Ld->getAlignment(), 4),
	Ld->getMemOperand()->getFlags());
	// Make sure new loads are placed in same chain order.
	SDValue NewChain = DAG.makeEquivalentMemoryOrdering(Ld, LoLd);
	NewChain = DAG.makeEquivalentMemoryOrdering(Ld, HiLd);

	if (TokenFactorIndex >= 0) {
	Ops.push_back(NewChain);
	NewChain = DAG.getNode(ISD::TokenFactor, LdDL, MVT::Other, Ops);
	}

	LoAddr = St->getBasePtr();
	HiAddr = DAG.getMemBasePlusOffset(LoAddr, 4, StDL);

	SDValue LoSt =
	DAG.getStore(NewChain, StDL, LoLd, LoAddr, St->getPointerInfo(),
	St->getAlignment(), St->getMemOperand()->getFlags());
	SDValue HiSt = DAG.getStore(
	NewChain, StDL, HiLd, HiAddr, St->getPointerInfo().getWithOffset(4),
	MinAlign(St->getAlignment(), 4), St->getMemOperand()->getFlags());
	return DAG.getNode(ISD::TokenFactor, StDL, MVT::Other, LoSt, HiSt);
	}

	// This is similar to the above case, but here we handle a scalar 64-bit
	// integer store that is extracted from a vector on a 32-bit target.
	// If we have SSE2, then we can treat it like a floating-point double
	// to get past legalization. The execution dependencies fixup pass will
	// choose the optimal machine instruction for the store if this really is
	// an integer or v2f32 rather than an f64.
	if (VT == MVT::i64 && F64IsLegal && !Subtarget.is64Bit() &&
	St->getOperand(1).getOpcode() == ISD::EXTRACT_VECTOR_ELT) {
	SDValue OldExtract = St->getOperand(1);
	SDValue ExtOp0 = OldExtract.getOperand(0);
	unsigned VecSize = ExtOp0.getValueSizeInBits();
	EVT VecVT = EVT::getVectorVT(*DAG.getContext(), MVT::f64, VecSize / 64);
	SDValue BitCast = DAG.getBitcast(VecVT, ExtOp0);
	SDValue NewExtract = DAG.getNode(ISD::EXTRACT_VECTOR_ELT, dl, MVT::f64,
	BitCast, OldExtract.getOperand(1));
	return DAG.getStore(St->getChain(), dl, NewExtract, St->getBasePtr(),
	St->getPointerInfo(), St->getAlignment(),
	St->getMemOperand()->getFlags());
	}

	return SDValue();
	}

	/// Return 'true' if this vector operation is "horizontal"
	/// and return the operands for the horizontal operation in LHS and RHS. A
	/// horizontal operation performs the binary operation on successive elements
	/// of its first operand, then on successive elements of its second operand,
	/// returning the resulting values in a vector. For example, if
	/// A = < float a0, float a1, float a2, float a3 >
	/// and
	/// B = < float b0, float b1, float b2, float b3 >
	/// then the result of doing a horizontal operation on A and B is
	/// A horizontal-op B = < a0 op a1, a2 op a3, b0 op b1, b2 op b3 >.
	/// In short, LHS and RHS are inspected to see if LHS op RHS is of the form
	/// A horizontal-op B, for some already available A and B, and if so then LHS is
	/// set to A, RHS to B, and the routine returns 'true'.
	/// Note that the binary operation should have the property that if one of the
	/// operands is UNDEF then the result is UNDEF.
	static bool isHorizontalBinOp(SDValue &LHS, SDValue &RHS, bool IsCommutative) {
	// Look for the following pattern: if
	// A = < float a0, float a1, float a2, float a3 >
	// B = < float b0, float b1, float b2, float b3 >
	// and
	// LHS = VECTOR_SHUFFLE A, B, <0, 2, 4, 6>
	// RHS = VECTOR_SHUFFLE A, B, <1, 3, 5, 7>
	// then LHS op RHS = < a0 op a1, a2 op a3, b0 op b1, b2 op b3 >
	// which is A horizontal-op B.

	// At least one of the operands should be a vector shuffle.
	if (LHS.getOpcode() != ISD::VECTOR_SHUFFLE &&
	RHS.getOpcode() != ISD::VECTOR_SHUFFLE)
	return false;

	MVT VT = LHS.getSimpleValueType();

	assert((VT.is128BitVector() \|\| VT.is256BitVector()) &&
	"Unsupported vector type for horizontal add/sub");

	// Handle 128 and 256-bit vector lengths. AVX defines horizontal add/sub to
	// operate independently on 128-bit lanes.
	unsigned NumElts = VT.getVectorNumElements();
	unsigned NumLanes = VT.getSizeInBits()/128;
	unsigned NumLaneElts = NumElts / NumLanes;
	assert((NumLaneElts % 2 == 0) &&
	"Vector type should have an even number of elements in each lane");
	unsigned HalfLaneElts = NumLaneElts/2;

	// View LHS in the form
	// LHS = VECTOR_SHUFFLE A, B, LMask
	// If LHS is not a shuffle then pretend it is the shuffle
	// LHS = VECTOR_SHUFFLE LHS, undef, <0, 1, ..., N-1>
	// NOTE: in what follows a default initialized SDValue represents an UNDEF of
	// type VT.
	SDValue A, B;
	SmallVector<int, 16> LMask(NumElts);
	if (LHS.getOpcode() == ISD::VECTOR_SHUFFLE) {
	if (!LHS.getOperand(0).isUndef())
	A = LHS.getOperand(0);
	if (!LHS.getOperand(1).isUndef())
	B = LHS.getOperand(1);
	ArrayRef<int> Mask = cast<ShuffleVectorSDNode>(LHS.getNode())->getMask();
	std::copy(Mask.begin(), Mask.end(), LMask.begin());
	} else {
	if (!LHS.isUndef())
	A = LHS;
	for (unsigned i = 0; i != NumElts; ++i)
	LMask[i] = i;
	}

	// Likewise, view RHS in the form
	// RHS = VECTOR_SHUFFLE C, D, RMask
	SDValue C, D;
	SmallVector<int, 16> RMask(NumElts);
	if (RHS.getOpcode() == ISD::VECTOR_SHUFFLE) {
	if (!RHS.getOperand(0).isUndef())
	C = RHS.getOperand(0);
	if (!RHS.getOperand(1).isUndef())
	D = RHS.getOperand(1);
	ArrayRef<int> Mask = cast<ShuffleVectorSDNode>(RHS.getNode())->getMask();
	std::copy(Mask.begin(), Mask.end(), RMask.begin());
	} else {
	if (!RHS.isUndef())
	C = RHS;
	for (unsigned i = 0; i != NumElts; ++i)
	RMask[i] = i;
	}

	// Check that the shuffles are both shuffling the same vectors.
	if (!(A == C && B == D) && !(A == D && B == C))
	return false;

	// If everything is UNDEF then bail out: it would be better to fold to UNDEF.
	if (!A.getNode() && !B.getNode())
	return false;

	// If A and B occur in reverse order in RHS, then "swap" them (which means
	// rewriting the mask).
	if (A != C)
	ShuffleVectorSDNode::commuteMask(RMask);

	// At this point LHS and RHS are equivalent to
	// LHS = VECTOR_SHUFFLE A, B, LMask
	// RHS = VECTOR_SHUFFLE A, B, RMask
	// Check that the masks correspond to performing a horizontal operation.
	for (unsigned l = 0; l != NumElts; l += NumLaneElts) {
	for (unsigned i = 0; i != NumLaneElts; ++i) {
	int LIdx = LMask[i+l], RIdx = RMask[i+l];

	// Ignore any UNDEF components.
	if (LIdx < 0 \|\| RIdx < 0 \|\|
	(!A.getNode() && (LIdx < (int)NumElts \|\| RIdx < (int)NumElts)) \|\|
	(!B.getNode() && (LIdx >= (int)NumElts \|\| RIdx >= (int)NumElts)))
	continue;

	// Check that successive elements are being operated on. If not, this is
	// not a horizontal operation.
	unsigned Src = (i/HalfLaneElts); // each lane is split between srcs
	int Index = 2(i%HalfLaneElts) + NumEltsSrc + l;
	if (!(LIdx == Index && RIdx == Index + 1) &&
	!(IsCommutative && LIdx == Index + 1 && RIdx == Index))
	return false;
	}
	}

	LHS = A.getNode() ? A : B; // If A is 'UNDEF', use B for it.
	RHS = B.getNode() ? B : A; // If B is 'UNDEF', use A for it.
	return true;
	}

	/// Do target-specific dag combines on floating-point adds/subs.
	static SDValue combineFaddFsub(SDNode *N, SelectionDAG &DAG,
	const X86Subtarget &Subtarget) {
	EVT VT = N->getValueType(0);
	SDValue LHS = N->getOperand(0);
	SDValue RHS = N->getOperand(1);
	bool IsFadd = N->getOpcode() == ISD::FADD;
	assert((IsFadd \|\| N->getOpcode() == ISD::FSUB) && "Wrong opcode");

	// Try to synthesize horizontal add/sub from adds/subs of shuffles.
	if (((Subtarget.hasSSE3() && (VT == MVT::v4f32 \|\| VT == MVT::v2f64)) \|\|
	(Subtarget.hasFp256() && (VT == MVT::v8f32 \|\| VT == MVT::v4f64))) &&
	isHorizontalBinOp(LHS, RHS, IsFadd)) {
	auto NewOpcode = IsFadd ? X86ISD::FHADD : X86ISD::FHSUB;
	return DAG.getNode(NewOpcode, SDLoc(N), VT, LHS, RHS);
	}
	return SDValue();
	}

	/// Attempt to pre-truncate inputs to arithmetic ops if it will simplify
	/// the codegen.
	/// e.g. TRUNC( BINOP( X, Y ) ) --> BINOP( TRUNC( X ), TRUNC( Y ) )
	static SDValue combineTruncatedArithmetic(SDNode *N, SelectionDAG &DAG,
	const X86Subtarget &Subtarget,
	SDLoc &DL) {
	assert(N->getOpcode() == ISD::TRUNCATE && "Wrong opcode");
	SDValue Src = N->getOperand(0);
	unsigned Opcode = Src.getOpcode();
	const TargetLowering &TLI = DAG.getTargetLoweringInfo();

	EVT VT = N->getValueType(0);
	EVT SrcVT = Src.getValueType();

	auto IsRepeatedOpOrFreeTruncation = [VT](SDValue Op0, SDValue Op1) {
	unsigned TruncSizeInBits = VT.getScalarSizeInBits();

	// Repeated operand, so we are only trading one output truncation for
	// one input truncation.
	if (Op0 == Op1)
	return true;

	// See if either operand has been extended from a smaller/equal size to
	// the truncation size, allowing a truncation to combine with the extend.
	unsigned Opcode0 = Op0.getOpcode();
	if ((Opcode0 == ISD::ANY_EXTEND \|\| Opcode0 == ISD::SIGN_EXTEND \|\|
	Opcode0 == ISD::ZERO_EXTEND) &&
	Op0.getOperand(0).getScalarValueSizeInBits() <= TruncSizeInBits)
	return true;

	unsigned Opcode1 = Op1.getOpcode();
	if ((Opcode1 == ISD::ANY_EXTEND \|\| Opcode1 == ISD::SIGN_EXTEND \|\|
	Opcode1 == ISD::ZERO_EXTEND) &&
	Op1.getOperand(0).getScalarValueSizeInBits() <= TruncSizeInBits)
	return true;

	// See if either operand is a single use constant which can be constant
	// folded.
	SDValue BC0 = peekThroughOneUseBitcasts(Op0);
	SDValue BC1 = peekThroughOneUseBitcasts(Op1);
	return ISD::isBuildVectorOfConstantSDNodes(BC0.getNode()) \|\|
	ISD::isBuildVectorOfConstantSDNodes(BC1.getNode());
	};

	auto TruncateArithmetic = [&](SDValue N0, SDValue N1) {
	SDValue Trunc0 = DAG.getNode(ISD::TRUNCATE, DL, VT, N0);
	SDValue Trunc1 = DAG.getNode(ISD::TRUNCATE, DL, VT, N1);
	return DAG.getNode(Opcode, DL, VT, Trunc0, Trunc1);
	};

	// Don't combine if the operation has other uses.
	if (!N->isOnlyUserOf(Src.getNode()))
	return SDValue();

	// Only support vector truncation for now.
	// TODO: i64 scalar math would benefit as well.
	if (!VT.isVector())
	return SDValue();

	// In most cases its only worth pre-truncating if we're only facing the cost
	// of one truncation.
	// i.e. if one of the inputs will constant fold or the input is repeated.
	switch (Opcode) {
	case ISD::AND:
	case ISD::XOR:
	case ISD::OR: {
	SDValue Op0 = Src.getOperand(0);
	SDValue Op1 = Src.getOperand(1);
	if (TLI.isOperationLegalOrPromote(Opcode, VT) &&
	IsRepeatedOpOrFreeTruncation(Op0, Op1))
	return TruncateArithmetic(Op0, Op1);
	break;
	}

	case ISD::MUL:
	// X86 is rubbish at scalar and vector i64 multiplies (until AVX512DQ) - its
	// better to truncate if we have the chance.
	if (SrcVT.getScalarType() == MVT::i64 && TLI.isOperationLegal(Opcode, VT) &&
	!TLI.isOperationLegal(Opcode, SrcVT))
	return TruncateArithmetic(Src.getOperand(0), Src.getOperand(1));
	LLVM_FALLTHROUGH;
	case ISD::ADD: {
	SDValue Op0 = Src.getOperand(0);
	SDValue Op1 = Src.getOperand(1);
	if (TLI.isOperationLegal(Opcode, VT) &&
	IsRepeatedOpOrFreeTruncation(Op0, Op1))
	return TruncateArithmetic(Op0, Op1);
	break;
	}
	}

	return SDValue();
	}

	/// Truncate a group of v4i32 into v16i8/v8i16 using X86ISD::PACKUS.
	static SDValue
	combineVectorTruncationWithPACKUS(SDNode *N, SelectionDAG &DAG,
	SmallVector<SDValue, 8> &Regs) {
	assert(Regs.size() > 0 && (Regs[0].getValueType() == MVT::v4i32 \|\|
	Regs[0].getValueType() == MVT::v2i64));
	EVT OutVT = N->getValueType(0);
	EVT OutSVT = OutVT.getVectorElementType();
	EVT InVT = Regs[0].getValueType();
	EVT InSVT = InVT.getVectorElementType();
	SDLoc DL(N);

	// First, use mask to unset all bits that won't appear in the result.
	assert((OutSVT == MVT::i8 \|\| OutSVT == MVT::i16) &&
	"OutSVT can only be either i8 or i16.");
	APInt Mask =
	APInt::getLowBitsSet(InSVT.getSizeInBits(), OutSVT.getSizeInBits());
	SDValue MaskVal = DAG.getConstant(Mask, DL, InVT);
	for (auto &Reg : Regs)
	Reg = DAG.getNode(ISD::AND, DL, InVT, MaskVal, Reg);

	MVT UnpackedVT, PackedVT;
	if (OutSVT == MVT::i8) {
	UnpackedVT = MVT::v8i16;
	PackedVT = MVT::v16i8;
	} else {
	UnpackedVT = MVT::v4i32;
	PackedVT = MVT::v8i16;
	}

	// In each iteration, truncate the type by a half size.
	auto RegNum = Regs.size();
	for (unsigned j = 1, e = InSVT.getSizeInBits() / OutSVT.getSizeInBits();
	j < e; j *= 2, RegNum /= 2) {
	for (unsigned i = 0; i < RegNum; i++)
	Regs[i] = DAG.getBitcast(UnpackedVT, Regs[i]);
	for (unsigned i = 0; i < RegNum / 2; i++)
	Regs[i] = DAG.getNode(X86ISD::PACKUS, DL, PackedVT, Regs[i * 2],
	Regs[i * 2 + 1]);
	}

	// If the type of the result is v8i8, we need do one more X86ISD::PACKUS, and
	// then extract a subvector as the result since v8i8 is not a legal type.
	if (OutVT == MVT::v8i8) {
	Regs[0] = DAG.getNode(X86ISD::PACKUS, DL, PackedVT, Regs[0], Regs[0]);
	Regs[0] = DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, OutVT, Regs[0],
	DAG.getIntPtrConstant(0, DL));
	return Regs[0];
	} else if (RegNum > 1) {
	Regs.resize(RegNum);
	return DAG.getNode(ISD::CONCAT_VECTORS, DL, OutVT, Regs);
	} else
	return Regs[0];
	}

	/// Truncate a group of v4i32 into v8i16 using X86ISD::PACKSS.
	static SDValue
	combineVectorTruncationWithPACKSS(SDNode *N, const X86Subtarget &Subtarget,
	SelectionDAG &DAG,
	SmallVector<SDValue, 8> &Regs) {
	assert(Regs.size() > 0 && Regs[0].getValueType() == MVT::v4i32);
	EVT OutVT = N->getValueType(0);
	SDLoc DL(N);

	// Shift left by 16 bits, then arithmetic-shift right by 16 bits.
	SDValue ShAmt = DAG.getConstant(16, DL, MVT::i32);
	for (auto &Reg : Regs) {
	Reg = getTargetVShiftNode(X86ISD::VSHLI, DL, MVT::v4i32, Reg, ShAmt,
	Subtarget, DAG);
	Reg = getTargetVShiftNode(X86ISD::VSRAI, DL, MVT::v4i32, Reg, ShAmt,
	Subtarget, DAG);
	}

	for (unsigned i = 0, e = Regs.size() / 2; i < e; i++)
	Regs[i] = DAG.getNode(X86ISD::PACKSS, DL, MVT::v8i16, Regs[i * 2],
	Regs[i * 2 + 1]);

	if (Regs.size() > 2) {
	Regs.resize(Regs.size() / 2);
	return DAG.getNode(ISD::CONCAT_VECTORS, DL, OutVT, Regs);
	} else
	return Regs[0];
	}

	/// This function transforms truncation from vXi32/vXi64 to vXi8/vXi16 into
	/// X86ISD::PACKUS/X86ISD::PACKSS operations. We do it here because after type
	/// legalization the truncation will be translated into a BUILD_VECTOR with each
	/// element that is extracted from a vector and then truncated, and it is
	/// difficult to do this optimization based on them.
	static SDValue combineVectorTruncation(SDNode *N, SelectionDAG &DAG,
	const X86Subtarget &Subtarget) {
	EVT OutVT = N->getValueType(0);
	if (!OutVT.isVector())
	return SDValue();

	SDValue In = N->getOperand(0);
	if (!In.getValueType().isSimple())
	return SDValue();

	EVT InVT = In.getValueType();
	unsigned NumElems = OutVT.getVectorNumElements();

	// TODO: On AVX2, the behavior of X86ISD::PACKUS is different from that on
	// SSE2, and we need to take care of it specially.
	// AVX512 provides vpmovdb.
	if (!Subtarget.hasSSE2() \|\| Subtarget.hasAVX2())
	return SDValue();

	EVT OutSVT = OutVT.getVectorElementType();
	EVT InSVT = InVT.getVectorElementType();
	if (!((InSVT == MVT::i32 \|\| InSVT == MVT::i64) &&
	(OutSVT == MVT::i8 \|\| OutSVT == MVT::i16) && isPowerOf2_32(NumElems) &&
	NumElems >= 8))
	return SDValue();

	// SSSE3's pshufb results in less instructions in the cases below.
	if (Subtarget.hasSSSE3() && NumElems == 8 &&
	((OutSVT == MVT::i8 && InSVT != MVT::i64) \|\|
	(InSVT == MVT::i32 && OutSVT == MVT::i16)))
	return SDValue();

	SDLoc DL(N);

	// Split a long vector into vectors of legal type.
	unsigned RegNum = InVT.getSizeInBits() / 128;
	SmallVector<SDValue, 8> SubVec(RegNum);
	unsigned NumSubRegElts = 128 / InSVT.getSizeInBits();
	EVT SubRegVT = EVT::getVectorVT(*DAG.getContext(), InSVT, NumSubRegElts);

	for (unsigned i = 0; i < RegNum; i++)
	SubVec[i] = DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, SubRegVT, In,
	DAG.getIntPtrConstant(i * NumSubRegElts, DL));

	// SSE2 provides PACKUS for only 2 x v8i16 -> v16i8 and SSE4.1 provides PACKUS
	// for 2 x v4i32 -> v8i16. For SSSE3 and below, we need to use PACKSS to
	// truncate 2 x v4i32 to v8i16.
	if (Subtarget.hasSSE41() \|\| OutSVT == MVT::i8)
	return combineVectorTruncationWithPACKUS(N, DAG, SubVec);
	else if (InSVT == MVT::i32)
	return combineVectorTruncationWithPACKSS(N, Subtarget, DAG, SubVec);
	else
	return SDValue();
	}

	/// This function transforms vector truncation of 'all or none' bits values.
	/// vXi16/vXi32/vXi64 to vXi8/vXi16/vXi32 into X86ISD::PACKSS operations.
	static SDValue combineVectorSignBitsTruncation(SDNode *N, SDLoc &DL,
	SelectionDAG &DAG,
	const X86Subtarget &Subtarget) {
	// Requires SSE2 but AVX512 has fast truncate.
	if (!Subtarget.hasSSE2() \|\| Subtarget.hasAVX512())
	return SDValue();

	if (!N->getValueType(0).isVector() \|\| !N->getValueType(0).isSimple())
	return SDValue();

	SDValue In = N->getOperand(0);
	if (!In.getValueType().isSimple())
	return SDValue();

	MVT VT = N->getValueType(0).getSimpleVT();
	MVT SVT = VT.getScalarType();

	MVT InVT = In.getValueType().getSimpleVT();
	MVT InSVT = InVT.getScalarType();

	// Use PACKSS if the input is a splatted sign bit.
	// e.g. Comparison result, sext_in_reg, etc.
	unsigned NumSignBits = DAG.ComputeNumSignBits(In);
	if (NumSignBits != InSVT.getSizeInBits())
	return SDValue();

	// Check we have a truncation suited for PACKSS.
	if (!VT.is128BitVector() && !VT.is256BitVector())
	return SDValue();
	if (SVT != MVT::i8 && SVT != MVT::i16 && SVT != MVT::i32)
	return SDValue();
	if (InSVT != MVT::i16 && InSVT != MVT::i32 && InSVT != MVT::i64)
	return SDValue();

	return truncateVectorCompareWithPACKSS(VT, In, DL, DAG, Subtarget);
	}

	static SDValue combineTruncate(SDNode *N, SelectionDAG &DAG,
	const X86Subtarget &Subtarget) {
	EVT VT = N->getValueType(0);
	SDValue Src = N->getOperand(0);
	SDLoc DL(N);

	// Attempt to pre-truncate inputs to arithmetic ops instead.
	if (SDValue V = combineTruncatedArithmetic(N, DAG, Subtarget, DL))
	return V;

	// Try to detect AVG pattern first.
	if (SDValue Avg = detectAVGPattern(Src, VT, DAG, Subtarget, DL))
	return Avg;

	// Try to combine truncation with unsigned saturation.
	if (SDValue Val = combineTruncateWithUSat(Src, VT, DL, DAG, Subtarget))
	return Val;

	// The bitcast source is a direct mmx result.
	// Detect bitcasts between i32 to x86mmx
	if (Src.getOpcode() == ISD::BITCAST && VT == MVT::i32) {
	SDValue BCSrc = Src.getOperand(0);
	if (BCSrc.getValueType() == MVT::x86mmx)
	return DAG.getNode(X86ISD::MMX_MOVD2W, DL, MVT::i32, BCSrc);
	}

	// Try to truncate extended sign bits with PACKSS.
	if (SDValue V = combineVectorSignBitsTruncation(N, DL, DAG, Subtarget))
	return V;

	return combineVectorTruncation(N, DAG, Subtarget);
	}

	/// Returns the negated value if the node \p N flips sign of FP value.
	///
	/// FP-negation node may have different forms: FNEG(x) or FXOR (x, 0x80000000).
	/// AVX512F does not have FXOR, so FNEG is lowered as
	/// (bitcast (xor (bitcast x), (bitcast ConstantFP(0x80000000)))).
	/// In this case we go though all bitcasts.
	static SDValue isFNEG(SDNode *N) {
	if (N->getOpcode() == ISD::FNEG)
	return N->getOperand(0);

	SDValue Op = peekThroughBitcasts(SDValue(N, 0));
	if (Op.getOpcode() != X86ISD::FXOR && Op.getOpcode() != ISD::XOR)
	return SDValue();

	SDValue Op1 = peekThroughBitcasts(Op.getOperand(1));
	if (!Op1.getValueType().isFloatingPoint())
	return SDValue();

	SDValue Op0 = peekThroughBitcasts(Op.getOperand(0));

	unsigned EltBits = Op1.getScalarValueSizeInBits();
	auto isSignMask = [&](const ConstantFP *C) {
	return C->getValueAPF().bitcastToAPInt() == APInt::getSignMask(EltBits);
	};

	// There is more than one way to represent the same constant on
	// the different X86 targets. The type of the node may also depend on size.
	// - load scalar value and broadcast
	// - BUILD_VECTOR node
	// - load from a constant pool.
	// We check all variants here.
	if (Op1.getOpcode() == X86ISD::VBROADCAST) {
	if (auto *C = getTargetConstantFromNode(Op1.getOperand(0)))
	if (isSignMask(cast<ConstantFP>(C)))
	return Op0;

	} else if (BuildVectorSDNode *BV = dyn_cast<BuildVectorSDNode>(Op1)) {
	if (ConstantFPSDNode *CN = BV->getConstantFPSplatNode())
	if (isSignMask(CN->getConstantFPValue()))
	return Op0;

	} else if (auto *C = getTargetConstantFromNode(Op1)) {
	if (C->getType()->isVectorTy()) {
	if (auto *SplatV = C->getSplatValue())
	if (isSignMask(cast<ConstantFP>(SplatV)))
	return Op0;
	} else if (auto *FPConst = dyn_cast<ConstantFP>(C))
	if (isSignMask(FPConst))
	return Op0;
	}
	return SDValue();
	}

	/// Do target-specific dag combines on floating point negations.
	static SDValue combineFneg(SDNode *N, SelectionDAG &DAG,
	const X86Subtarget &Subtarget) {
	EVT OrigVT = N->getValueType(0);
	SDValue Arg = isFNEG(N);
	assert(Arg.getNode() && "N is expected to be an FNEG node");

	EVT VT = Arg.getValueType();
	EVT SVT = VT.getScalarType();
	SDLoc DL(N);

	// Let legalize expand this if it isn't a legal type yet.
	if (!DAG.getTargetLoweringInfo().isTypeLegal(VT))
	return SDValue();

	// If we're negating a FMUL node on a target with FMA, then we can avoid the
	// use of a constant by performing (-0 - A*B) instead.
	// FIXME: Check rounding control flags as well once it becomes available.
	if (Arg.getOpcode() == ISD::FMUL && (SVT == MVT::f32 \|\| SVT == MVT::f64) &&
	Arg->getFlags().hasNoSignedZeros() && Subtarget.hasAnyFMA()) {
	SDValue Zero = DAG.getConstantFP(0.0, DL, VT);
	SDValue NewNode = DAG.getNode(X86ISD::FNMSUB, DL, VT, Arg.getOperand(0),
	Arg.getOperand(1), Zero);
	return DAG.getBitcast(OrigVT, NewNode);
	}

	// If we're negating an FMA node, then we can adjust the
	// instruction to include the extra negation.
	unsigned NewOpcode = 0;
	if (Arg.hasOneUse()) {
	switch (Arg.getOpcode()) {
	case X86ISD::FMADD: NewOpcode = X86ISD::FNMSUB; break;
	case X86ISD::FMSUB: NewOpcode = X86ISD::FNMADD; break;
	case X86ISD::FNMADD: NewOpcode = X86ISD::FMSUB; break;
	case X86ISD::FNMSUB: NewOpcode = X86ISD::FMADD; break;
	case X86ISD::FMADD_RND: NewOpcode = X86ISD::FNMSUB_RND; break;
	case X86ISD::FMSUB_RND: NewOpcode = X86ISD::FNMADD_RND; break;
	case X86ISD::FNMADD_RND: NewOpcode = X86ISD::FMSUB_RND; break;
	case X86ISD::FNMSUB_RND: NewOpcode = X86ISD::FMADD_RND; break;
	// We can't handle scalar intrinsic node here because it would only
	// invert one element and not the whole vector. But we could try to handle
	// a negation of the lower element only.
	}
	}
	if (NewOpcode)
	return DAG.getBitcast(OrigVT, DAG.getNode(NewOpcode, DL, VT,
	Arg.getNode()->ops()));

	return SDValue();
	}

	static SDValue lowerX86FPLogicOp(SDNode *N, SelectionDAG &DAG,
	const X86Subtarget &Subtarget) {
	MVT VT = N->getSimpleValueType(0);
	// If we have integer vector types available, use the integer opcodes.
	if (VT.isVector() && Subtarget.hasSSE2()) {
	SDLoc dl(N);

	MVT IntVT = MVT::getVectorVT(MVT::i64, VT.getSizeInBits() / 64);

	SDValue Op0 = DAG.getBitcast(IntVT, N->getOperand(0));
	SDValue Op1 = DAG.getBitcast(IntVT, N->getOperand(1));
	unsigned IntOpcode;
	switch (N->getOpcode()) {
	default: llvm_unreachable("Unexpected FP logic op");
	case X86ISD::FOR: IntOpcode = ISD::OR; break;
	case X86ISD::FXOR: IntOpcode = ISD::XOR; break;
	case X86ISD::FAND: IntOpcode = ISD::AND; break;
	case X86ISD::FANDN: IntOpcode = X86ISD::ANDNP; break;
	}
	SDValue IntOp = DAG.getNode(IntOpcode, dl, IntVT, Op0, Op1);
	return DAG.getBitcast(VT, IntOp);
	}
	return SDValue();
	}

	static SDValue combineXor(SDNode *N, SelectionDAG &DAG,
	TargetLowering::DAGCombinerInfo &DCI,
	const X86Subtarget &Subtarget) {
	if (SDValue Cmp = foldVectorXorShiftIntoCmp(N, DAG, Subtarget))
	return Cmp;

	if (DCI.isBeforeLegalizeOps())
	return SDValue();

	if (SDValue RV = foldXorTruncShiftIntoCmp(N, DAG))
	return RV;

	if (Subtarget.hasCMov())
	if (SDValue RV = combineIntegerAbs(N, DAG))
	return RV;

	if (SDValue FPLogic = convertIntLogicToFPLogic(N, DAG, Subtarget))
	return FPLogic;

	if (isFNEG(N))
	return combineFneg(N, DAG, Subtarget);
	return SDValue();
	}


	static bool isNullFPScalarOrVectorConst(SDValue V) {
	return isNullFPConstant(V) \|\| ISD::isBuildVectorAllZeros(V.getNode());
	}

	/// If a value is a scalar FP zero or a vector FP zero (potentially including
	/// undefined elements), return a zero constant that may be used to fold away
	/// that value. In the case of a vector, the returned constant will not contain
	/// undefined elements even if the input parameter does. This makes it suitable
	/// to be used as a replacement operand with operations (eg, bitwise-and) where
	/// an undef should not propagate.
	static SDValue getNullFPConstForNullVal(SDValue V, SelectionDAG &DAG,
	const X86Subtarget &Subtarget) {
	if (!isNullFPScalarOrVectorConst(V))
	return SDValue();

	if (V.getValueType().isVector())
	return getZeroVector(V.getSimpleValueType(), Subtarget, DAG, SDLoc(V));

	return V;
	}

	static SDValue combineFAndFNotToFAndn(SDNode *N, SelectionDAG &DAG,
	const X86Subtarget &Subtarget) {
	SDValue N0 = N->getOperand(0);
	SDValue N1 = N->getOperand(1);
	EVT VT = N->getValueType(0);
	SDLoc DL(N);

	// Vector types are handled in combineANDXORWithAllOnesIntoANDNP().
	if (!((VT == MVT::f32 && Subtarget.hasSSE1()) \|\|
	(VT == MVT::f64 && Subtarget.hasSSE2())))
	return SDValue();

	auto isAllOnesConstantFP = [](SDValue V) {
	auto *C = dyn_cast<ConstantFPSDNode>(V);
	return C && C->getConstantFPValue()->isAllOnesValue();
	};

	// fand (fxor X, -1), Y --> fandn X, Y
	if (N0.getOpcode() == X86ISD::FXOR && isAllOnesConstantFP(N0.getOperand(1)))
	return DAG.getNode(X86ISD::FANDN, DL, VT, N0.getOperand(0), N1);

	// fand X, (fxor Y, -1) --> fandn Y, X
	if (N1.getOpcode() == X86ISD::FXOR && isAllOnesConstantFP(N1.getOperand(1)))
	return DAG.getNode(X86ISD::FANDN, DL, VT, N1.getOperand(0), N0);

	return SDValue();
	}

	/// Do target-specific dag combines on X86ISD::FAND nodes.
	static SDValue combineFAnd(SDNode *N, SelectionDAG &DAG,
	const X86Subtarget &Subtarget) {
	// FAND(0.0, x) -> 0.0
	if (SDValue V = getNullFPConstForNullVal(N->getOperand(0), DAG, Subtarget))
	return V;

	// FAND(x, 0.0) -> 0.0
	if (SDValue V = getNullFPConstForNullVal(N->getOperand(1), DAG, Subtarget))
	return V;

	if (SDValue V = combineFAndFNotToFAndn(N, DAG, Subtarget))
	return V;

	return lowerX86FPLogicOp(N, DAG, Subtarget);
	}

	/// Do target-specific dag combines on X86ISD::FANDN nodes.
	static SDValue combineFAndn(SDNode *N, SelectionDAG &DAG,
	const X86Subtarget &Subtarget) {
	// FANDN(0.0, x) -> x
	if (isNullFPScalarOrVectorConst(N->getOperand(0)))
	return N->getOperand(1);

	// FANDN(x, 0.0) -> 0.0
	if (SDValue V = getNullFPConstForNullVal(N->getOperand(1), DAG, Subtarget))
	return V;

	return lowerX86FPLogicOp(N, DAG, Subtarget);
	}

	/// Do target-specific dag combines on X86ISD::FOR and X86ISD::FXOR nodes.
	static SDValue combineFOr(SDNode *N, SelectionDAG &DAG,
	const X86Subtarget &Subtarget) {
	assert(N->getOpcode() == X86ISD::FOR \|\| N->getOpcode() == X86ISD::FXOR);

	// F[X]OR(0.0, x) -> x
	if (isNullFPScalarOrVectorConst(N->getOperand(0)))
	return N->getOperand(1);

	// F[X]OR(x, 0.0) -> x
	if (isNullFPScalarOrVectorConst(N->getOperand(1)))
	return N->getOperand(0);

	if (isFNEG(N))
	if (SDValue NewVal = combineFneg(N, DAG, Subtarget))
	return NewVal;

	return lowerX86FPLogicOp(N, DAG, Subtarget);
	}

	/// Do target-specific dag combines on X86ISD::FMIN and X86ISD::FMAX nodes.
	static SDValue combineFMinFMax(SDNode *N, SelectionDAG &DAG) {
	assert(N->getOpcode() == X86ISD::FMIN \|\| N->getOpcode() == X86ISD::FMAX);

	// Only perform optimizations if UnsafeMath is used.
	if (!DAG.getTarget().Options.UnsafeFPMath)
	return SDValue();

	// If we run in unsafe-math mode, then convert the FMAX and FMIN nodes
	// into FMINC and FMAXC, which are Commutative operations.
	unsigned NewOp = 0;
	switch (N->getOpcode()) {
	default: llvm_unreachable("unknown opcode");
	case X86ISD::FMIN: NewOp = X86ISD::FMINC; break;
	case X86ISD::FMAX: NewOp = X86ISD::FMAXC; break;
	}

	return DAG.getNode(NewOp, SDLoc(N), N->getValueType(0),
	N->getOperand(0), N->getOperand(1));
	}

	static SDValue combineFMinNumFMaxNum(SDNode *N, SelectionDAG &DAG,
	const X86Subtarget &Subtarget) {
	if (Subtarget.useSoftFloat())
	return SDValue();

	// TODO: Check for global or instruction-level "nnan". In that case, we
	// should be able to lower to FMAX/FMIN alone.
	// TODO: If an operand is already known to be a NaN or not a NaN, this
	// should be an optional swap and FMAX/FMIN.

	EVT VT = N->getValueType(0);
	if (!((Subtarget.hasSSE1() && (VT == MVT::f32 \|\| VT == MVT::v4f32)) \|\|
	(Subtarget.hasSSE2() && (VT == MVT::f64 \|\| VT == MVT::v2f64)) \|\|
	(Subtarget.hasAVX() && (VT == MVT::v8f32 \|\| VT == MVT::v4f64))))
	return SDValue();

	// This takes at least 3 instructions, so favor a library call when operating
	// on a scalar and minimizing code size.
	if (!VT.isVector() && DAG.getMachineFunction().getFunction()->optForMinSize())
	return SDValue();

	SDValue Op0 = N->getOperand(0);
	SDValue Op1 = N->getOperand(1);
	SDLoc DL(N);
	EVT SetCCType = DAG.getTargetLoweringInfo().getSetCCResultType(
	DAG.getDataLayout(), *DAG.getContext(), VT);

	// There are 4 possibilities involving NaN inputs, and these are the required
	// outputs:
	// Op1
	// Num NaN
	// ----------------
	// Num \| Max \| Op0 \|
	// Op0 ----------------
	// NaN \| Op1 \| NaN \|
	// ----------------
	//
	// The SSE FP max/min instructions were not designed for this case, but rather
	// to implement:
	// Min = Op1 < Op0 ? Op1 : Op0
	// Max = Op1 > Op0 ? Op1 : Op0
	//
	// So they always return Op0 if either input is a NaN. However, we can still
	// use those instructions for fmaxnum by selecting away a NaN input.

	// If either operand is NaN, the 2nd source operand (Op0) is passed through.
	auto MinMaxOp = N->getOpcode() == ISD::FMAXNUM ? X86ISD::FMAX : X86ISD::FMIN;
	SDValue MinOrMax = DAG.getNode(MinMaxOp, DL, VT, Op1, Op0);
	SDValue IsOp0Nan = DAG.getSetCC(DL, SetCCType , Op0, Op0, ISD::SETUO);

	// If Op0 is a NaN, select Op1. Otherwise, select the max. If both operands
	// are NaN, the NaN value of Op1 is the result.
	return DAG.getSelect(DL, VT, IsOp0Nan, Op1, MinOrMax);
	}

	/// Do target-specific dag combines on X86ISD::ANDNP nodes.
	static SDValue combineAndnp(SDNode *N, SelectionDAG &DAG,
	TargetLowering::DAGCombinerInfo &DCI,
	const X86Subtarget &Subtarget) {
	// ANDNP(0, x) -> x
	if (ISD::isBuildVectorAllZeros(N->getOperand(0).getNode()))
	return N->getOperand(1);

	// ANDNP(x, 0) -> 0
	if (ISD::isBuildVectorAllZeros(N->getOperand(1).getNode()))
	return getZeroVector(N->getSimpleValueType(0), Subtarget, DAG, SDLoc(N));

	EVT VT = N->getValueType(0);

	// Attempt to recursively combine a bitmask ANDNP with shuffles.
	if (VT.isVector() && (VT.getScalarSizeInBits() % 8) == 0) {
	SDValue Op(N, 0);
	SmallVector<int, 1> NonceMask; // Just a placeholder.
	NonceMask.push_back(0);
	if (combineX86ShufflesRecursively({Op}, 0, Op, NonceMask, {},
	/Depth/ 1, /HasVarMask/ false, DAG,
	DCI, Subtarget))
	return SDValue(); // This routine will use CombineTo to replace N.
	}

	return SDValue();
	}

	static SDValue combineBT(SDNode *N, SelectionDAG &DAG,
	TargetLowering::DAGCombinerInfo &DCI) {
	// BT ignores high bits in the bit index operand.
	SDValue Op1 = N->getOperand(1);
	if (Op1.hasOneUse()) {
	unsigned BitWidth = Op1.getValueSizeInBits();
	APInt DemandedMask = APInt::getLowBitsSet(BitWidth, Log2_32(BitWidth));
	KnownBits Known;
	TargetLowering::TargetLoweringOpt TLO(DAG, !DCI.isBeforeLegalize(),
	!DCI.isBeforeLegalizeOps());
	const TargetLowering &TLI = DAG.getTargetLoweringInfo();
	if (TLI.ShrinkDemandedConstant(Op1, DemandedMask, TLO) \|\|
	TLI.SimplifyDemandedBits(Op1, DemandedMask, Known, TLO))
	DCI.CommitTargetLoweringOpt(TLO);
	}
	return SDValue();
	}

	static SDValue combineSignExtendInReg(SDNode *N, SelectionDAG &DAG,
	const X86Subtarget &Subtarget) {
	EVT VT = N->getValueType(0);
	if (!VT.isVector())
	return SDValue();

	SDValue N0 = N->getOperand(0);
	SDValue N1 = N->getOperand(1);
	EVT ExtraVT = cast<VTSDNode>(N1)->getVT();
	SDLoc dl(N);

	// The SIGN_EXTEND_INREG to v4i64 is expensive operation on the
	// both SSE and AVX2 since there is no sign-extended shift right
	// operation on a vector with 64-bit elements.
	//(sext_in_reg (v4i64 anyext (v4i32 x )), ExtraVT) ->
	// (v4i64 sext (v4i32 sext_in_reg (v4i32 x , ExtraVT)))
	if (VT == MVT::v4i64 && (N0.getOpcode() == ISD::ANY_EXTEND \|\|
	N0.getOpcode() == ISD::SIGN_EXTEND)) {
	SDValue N00 = N0.getOperand(0);

	// EXTLOAD has a better solution on AVX2,
	// it may be replaced with X86ISD::VSEXT node.
	if (N00.getOpcode() == ISD::LOAD && Subtarget.hasInt256())
	if (!ISD::isNormalLoad(N00.getNode()))
	return SDValue();

	if (N00.getValueType() == MVT::v4i32 && ExtraVT.getSizeInBits() < 128) {
	SDValue Tmp = DAG.getNode(ISD::SIGN_EXTEND_INREG, dl, MVT::v4i32,
	N00, N1);
	return DAG.getNode(ISD::SIGN_EXTEND, dl, MVT::v4i64, Tmp);
	}
	}
	return SDValue();
	}

	/// sext(add_nsw(x, C)) --> add(sext(x), C_sext)
	/// zext(add_nuw(x, C)) --> add(zext(x), C_zext)
	/// Promoting a sign/zero extension ahead of a no overflow 'add' exposes
	/// opportunities to combine math ops, use an LEA, or use a complex addressing
	/// mode. This can eliminate extend, add, and shift instructions.
	static SDValue promoteExtBeforeAdd(SDNode *Ext, SelectionDAG &DAG,
	const X86Subtarget &Subtarget) {
	if (Ext->getOpcode() != ISD::SIGN_EXTEND &&
	Ext->getOpcode() != ISD::ZERO_EXTEND)
	return SDValue();

	// TODO: This should be valid for other integer types.
	EVT VT = Ext->getValueType(0);
	if (VT != MVT::i64)
	return SDValue();

	SDValue Add = Ext->getOperand(0);
	if (Add.getOpcode() != ISD::ADD)
	return SDValue();

	bool Sext = Ext->getOpcode() == ISD::SIGN_EXTEND;
	bool NSW = Add->getFlags().hasNoSignedWrap();
	bool NUW = Add->getFlags().hasNoUnsignedWrap();

	// We need an 'add nsw' feeding into the 'sext' or 'add nuw' feeding
	// into the 'zext'
	if ((Sext && !NSW) \|\| (!Sext && !NUW))
	return SDValue();

	// Having a constant operand to the 'add' ensures that we are not increasing
	// the instruction count because the constant is extended for free below.
	// A constant operand can also become the displacement field of an LEA.
	auto *AddOp1 = dyn_cast<ConstantSDNode>(Add.getOperand(1));
	if (!AddOp1)
	return SDValue();

	// Don't make the 'add' bigger if there's no hope of combining it with some
	// other 'add' or 'shl' instruction.
	// TODO: It may be profitable to generate simpler LEA instructions in place
	// of single 'add' instructions, but the cost model for selecting an LEA
	// currently has a high threshold.
	bool HasLEAPotential = false;
	for (auto *User : Ext->uses()) {
	if (User->getOpcode() == ISD::ADD \|\| User->getOpcode() == ISD::SHL) {
	HasLEAPotential = true;
	break;
	}
	}
	if (!HasLEAPotential)
	return SDValue();

	// Everything looks good, so pull the '{s\|z}ext' ahead of the 'add'.
	int64_t AddConstant = Sext ? AddOp1->getSExtValue() : AddOp1->getZExtValue();
	SDValue AddOp0 = Add.getOperand(0);
	SDValue NewExt = DAG.getNode(Ext->getOpcode(), SDLoc(Ext), VT, AddOp0);
	SDValue NewConstant = DAG.getConstant(AddConstant, SDLoc(Add), VT);

	// The wider add is guaranteed to not wrap because both operands are
	// sign-extended.
	SDNodeFlags Flags;
	Flags.setNoSignedWrap(NSW);
	Flags.setNoUnsignedWrap(NUW);
	return DAG.getNode(ISD::ADD, SDLoc(Add), VT, NewExt, NewConstant, Flags);
	}

	/// (i8,i32 {s/z}ext ({s/u}divrem (i8 x, i8 y)) ->
	/// (i8,i32 ({s/u}divrem_sext_hreg (i8 x, i8 y)
	/// This exposes the {s/z}ext to the sdivrem lowering, so that it directly
	/// extends from AH (which we otherwise need to do contortions to access).
	static SDValue getDivRem8(SDNode *N, SelectionDAG &DAG) {
	SDValue N0 = N->getOperand(0);
	auto OpcodeN = N->getOpcode();
	auto OpcodeN0 = N0.getOpcode();
	if (!((OpcodeN == ISD::SIGN_EXTEND && OpcodeN0 == ISD::SDIVREM) \|\|
	(OpcodeN == ISD::ZERO_EXTEND && OpcodeN0 == ISD::UDIVREM)))
	return SDValue();

	EVT VT = N->getValueType(0);
	EVT InVT = N0.getValueType();
	if (N0.getResNo() != 1 \|\| InVT != MVT::i8 \|\| VT != MVT::i32)
	return SDValue();

	SDVTList NodeTys = DAG.getVTList(MVT::i8, VT);
	auto DivRemOpcode = OpcodeN0 == ISD::SDIVREM ? X86ISD::SDIVREM8_SEXT_HREG
	: X86ISD::UDIVREM8_ZEXT_HREG;
	SDValue R = DAG.getNode(DivRemOpcode, SDLoc(N), NodeTys, N0.getOperand(0),
	N0.getOperand(1));
	DAG.ReplaceAllUsesOfValueWith(N0.getValue(0), R.getValue(0));
	return R.getValue(1);
	}

	/// Convert a SEXT or ZEXT of a vector to a SIGN_EXTEND_VECTOR_INREG or
	/// ZERO_EXTEND_VECTOR_INREG, this requires the splitting (or concatenating
	/// with UNDEFs) of the input to vectors of the same size as the target type
	/// which then extends the lowest elements.
	static SDValue combineToExtendVectorInReg(SDNode *N, SelectionDAG &DAG,
	TargetLowering::DAGCombinerInfo &DCI,
	const X86Subtarget &Subtarget) {
	unsigned Opcode = N->getOpcode();
	if (Opcode != ISD::SIGN_EXTEND && Opcode != ISD::ZERO_EXTEND)
	return SDValue();
	if (!DCI.isBeforeLegalizeOps())
	return SDValue();
	if (!Subtarget.hasSSE2())
	return SDValue();

	SDValue N0 = N->getOperand(0);
	EVT VT = N->getValueType(0);
	EVT SVT = VT.getScalarType();
	EVT InVT = N0.getValueType();
	EVT InSVT = InVT.getScalarType();

	// Input type must be a vector and we must be extending legal integer types.
	if (!VT.isVector())
	return SDValue();
	if (SVT != MVT::i64 && SVT != MVT::i32 && SVT != MVT::i16)
	return SDValue();
	if (InSVT != MVT::i32 && InSVT != MVT::i16 && InSVT != MVT::i8)
	return SDValue();

	// On AVX2+ targets, if the input/output types are both legal then we will be
	// able to use SIGN_EXTEND/ZERO_EXTEND directly.
	if (Subtarget.hasInt256() && DAG.getTargetLoweringInfo().isTypeLegal(VT) &&
	DAG.getTargetLoweringInfo().isTypeLegal(InVT))
	return SDValue();

	SDLoc DL(N);

	auto ExtendVecSize = [&DAG](const SDLoc &DL, SDValue N, unsigned Size) {
	EVT InVT = N.getValueType();
	EVT OutVT = EVT::getVectorVT(*DAG.getContext(), InVT.getScalarType(),
	Size / InVT.getScalarSizeInBits());
	SmallVector<SDValue, 8> Opnds(Size / InVT.getSizeInBits(),
	DAG.getUNDEF(InVT));
	Opnds[0] = N;
	return DAG.getNode(ISD::CONCAT_VECTORS, DL, OutVT, Opnds);
	};

	// If target-size is less than 128-bits, extend to a type that would extend
	// to 128 bits, extend that and extract the original target vector.
	if (VT.getSizeInBits() < 128 && !(128 % VT.getSizeInBits())) {
	unsigned Scale = 128 / VT.getSizeInBits();
	EVT ExVT =
	EVT::getVectorVT(*DAG.getContext(), SVT, 128 / SVT.getSizeInBits());
	SDValue Ex = ExtendVecSize(DL, N0, Scale * InVT.getSizeInBits());
	SDValue SExt = DAG.getNode(Opcode, DL, ExVT, Ex);
	return DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, VT, SExt,
	DAG.getIntPtrConstant(0, DL));
	}

	// If target-size is 128-bits (or 256-bits on AVX2 target), then convert to
	// ISD::_EXTEND_VECTOR_INREG which ensures lowering to X86ISD::VEXT.
	// Also use this if we don't have SSE41 to allow the legalizer do its job.
	if (!Subtarget.hasSSE41() \|\| VT.is128BitVector() \|\|
	(VT.is256BitVector() && Subtarget.hasInt256()) \|\|
	(VT.is512BitVector() && Subtarget.hasAVX512())) {
	SDValue ExOp = ExtendVecSize(DL, N0, VT.getSizeInBits());
	return Opcode == ISD::SIGN_EXTEND
	? DAG.getSignExtendVectorInReg(ExOp, DL, VT)
	: DAG.getZeroExtendVectorInReg(ExOp, DL, VT);
	}

	auto SplitAndExtendInReg = [&](unsigned SplitSize) {
	unsigned NumVecs = VT.getSizeInBits() / SplitSize;
	unsigned NumSubElts = SplitSize / SVT.getSizeInBits();
	EVT SubVT = EVT::getVectorVT(*DAG.getContext(), SVT, NumSubElts);
	EVT InSubVT = EVT::getVectorVT(*DAG.getContext(), InSVT, NumSubElts);

	SmallVector<SDValue, 8> Opnds;
	for (unsigned i = 0, Offset = 0; i != NumVecs; ++i, Offset += NumSubElts) {
	SDValue SrcVec = DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, InSubVT, N0,
	DAG.getIntPtrConstant(Offset, DL));
	SrcVec = ExtendVecSize(DL, SrcVec, SplitSize);
	SrcVec = Opcode == ISD::SIGN_EXTEND
	? DAG.getSignExtendVectorInReg(SrcVec, DL, SubVT)
	: DAG.getZeroExtendVectorInReg(SrcVec, DL, SubVT);
	Opnds.push_back(SrcVec);
	}
	return DAG.getNode(ISD::CONCAT_VECTORS, DL, VT, Opnds);
	};

	// On pre-AVX2 targets, split into 128-bit nodes of
	// ISD::*_EXTEND_VECTOR_INREG.
	if (!Subtarget.hasInt256() && !(VT.getSizeInBits() % 128))
	return SplitAndExtendInReg(128);

	// On pre-AVX512 targets, split into 256-bit nodes of
	// ISD::*_EXTEND_VECTOR_INREG.
	if (!Subtarget.hasAVX512() && !(VT.getSizeInBits() % 256))
	return SplitAndExtendInReg(256);

	return SDValue();
	}

	static SDValue combineSext(SDNode *N, SelectionDAG &DAG,
	TargetLowering::DAGCombinerInfo &DCI,
	const X86Subtarget &Subtarget) {
	SDValue N0 = N->getOperand(0);
	EVT VT = N->getValueType(0);
	EVT InVT = N0.getValueType();
	SDLoc DL(N);

	if (SDValue DivRem8 = getDivRem8(N, DAG))
	return DivRem8;

	if (!DCI.isBeforeLegalizeOps()) {
	if (InVT == MVT::i1) {
	SDValue Zero = DAG.getConstant(0, DL, VT);
	SDValue AllOnes = DAG.getAllOnesConstant(DL, VT);
	return DAG.getSelect(DL, VT, N0, AllOnes, Zero);
	}
	return SDValue();
	}

	if (InVT == MVT::i1 && N0.getOpcode() == ISD::XOR &&
	isAllOnesConstant(N0.getOperand(1)) && N0.hasOneUse()) {
	// Invert and sign-extend a boolean is the same as zero-extend and subtract
	// 1 because 0 becomes -1 and 1 becomes 0. The subtract is efficiently
	// lowered with an LEA or a DEC. This is the same as: select Bool, 0, -1.
	// sext (xor Bool, -1) --> sub (zext Bool), 1
	SDValue Zext = DAG.getNode(ISD::ZERO_EXTEND, DL, VT, N0.getOperand(0));
	return DAG.getNode(ISD::SUB, DL, VT, Zext, DAG.getConstant(1, DL, VT));
	}

	if (SDValue V = combineToExtendVectorInReg(N, DAG, DCI, Subtarget))
	return V;

	if (Subtarget.hasAVX() && VT.is256BitVector())
	if (SDValue R = WidenMaskArithmetic(N, DAG, DCI, Subtarget))
	return R;

	if (SDValue NewAdd = promoteExtBeforeAdd(N, DAG, Subtarget))
	return NewAdd;

	return SDValue();
	}

	static SDValue combineFMA(SDNode *N, SelectionDAG &DAG,
	const X86Subtarget &Subtarget) {
	SDLoc dl(N);
	EVT VT = N->getValueType(0);

	// Let legalize expand this if it isn't a legal type yet.
	if (!DAG.getTargetLoweringInfo().isTypeLegal(VT))
	return SDValue();

	EVT ScalarVT = VT.getScalarType();
	if ((ScalarVT != MVT::f32 && ScalarVT != MVT::f64) \|\| !Subtarget.hasAnyFMA())
	return SDValue();

	SDValue A = N->getOperand(0);
	SDValue B = N->getOperand(1);
	SDValue C = N->getOperand(2);

	auto invertIfNegative = [](SDValue &V) {
	if (SDValue NegVal = isFNEG(V.getNode())) {
	V = NegVal;
	return true;
	}
	return false;
	};

	// Do not convert the passthru input of scalar intrinsics.
	// FIXME: We could allow negations of the lower element only.
	bool NegA = N->getOpcode() != X86ISD::FMADDS1_RND && invertIfNegative(A);
	bool NegB = invertIfNegative(B);
	bool NegC = N->getOpcode() != X86ISD::FMADDS3_RND && invertIfNegative(C);

	// Negative multiplication when NegA xor NegB
	bool NegMul = (NegA != NegB);

	unsigned NewOpcode;
	if (!NegMul)
	NewOpcode = (!NegC) ? X86ISD::FMADD : X86ISD::FMSUB;
	else
	NewOpcode = (!NegC) ? X86ISD::FNMADD : X86ISD::FNMSUB;


	if (N->getOpcode() == X86ISD::FMADD_RND) {
	switch (NewOpcode) {
	case X86ISD::FMADD: NewOpcode = X86ISD::FMADD_RND; break;
	case X86ISD::FMSUB: NewOpcode = X86ISD::FMSUB_RND; break;
	case X86ISD::FNMADD: NewOpcode = X86ISD::FNMADD_RND; break;
	case X86ISD::FNMSUB: NewOpcode = X86ISD::FNMSUB_RND; break;
	}
	} else if (N->getOpcode() == X86ISD::FMADDS1_RND) {
	switch (NewOpcode) {
	case X86ISD::FMADD: NewOpcode = X86ISD::FMADDS1_RND; break;
	case X86ISD::FMSUB: NewOpcode = X86ISD::FMSUBS1_RND; break;
	case X86ISD::FNMADD: NewOpcode = X86ISD::FNMADDS1_RND; break;
	case X86ISD::FNMSUB: NewOpcode = X86ISD::FNMSUBS1_RND; break;
	}
	} else if (N->getOpcode() == X86ISD::FMADDS3_RND) {
	switch (NewOpcode) {
	case X86ISD::FMADD: NewOpcode = X86ISD::FMADDS3_RND; break;
	case X86ISD::FMSUB: NewOpcode = X86ISD::FMSUBS3_RND; break;
	case X86ISD::FNMADD: NewOpcode = X86ISD::FNMADDS3_RND; break;
	case X86ISD::FNMSUB: NewOpcode = X86ISD::FNMSUBS3_RND; break;
	}
	} else {
	assert((N->getOpcode() == X86ISD::FMADD \|\| N->getOpcode() == ISD::FMA) &&
	"Unexpected opcode!");
	return DAG.getNode(NewOpcode, dl, VT, A, B, C);
	}

	return DAG.getNode(NewOpcode, dl, VT, A, B, C, N->getOperand(3));
	}

	static SDValue combineZext(SDNode *N, SelectionDAG &DAG,
	TargetLowering::DAGCombinerInfo &DCI,
	const X86Subtarget &Subtarget) {
	// (i32 zext (and (i8 x86isd::setcc_carry), 1)) ->
	// (and (i32 x86isd::setcc_carry), 1)
	// This eliminates the zext. This transformation is necessary because
	// ISD::SETCC is always legalized to i8.
	SDLoc dl(N);
	SDValue N0 = N->getOperand(0);
	EVT VT = N->getValueType(0);

	if (N0.getOpcode() == ISD::AND &&
	N0.hasOneUse() &&
	N0.getOperand(0).hasOneUse()) {
	SDValue N00 = N0.getOperand(0);
	if (N00.getOpcode() == X86ISD::SETCC_CARRY) {
	if (!isOneConstant(N0.getOperand(1)))
	return SDValue();
	return DAG.getNode(ISD::AND, dl, VT,
	DAG.getNode(X86ISD::SETCC_CARRY, dl, VT,
	N00.getOperand(0), N00.getOperand(1)),
	DAG.getConstant(1, dl, VT));
	}
	}

	if (N0.getOpcode() == ISD::TRUNCATE &&
	N0.hasOneUse() &&
	N0.getOperand(0).hasOneUse()) {
	SDValue N00 = N0.getOperand(0);
	if (N00.getOpcode() == X86ISD::SETCC_CARRY) {
	return DAG.getNode(ISD::AND, dl, VT,
	DAG.getNode(X86ISD::SETCC_CARRY, dl, VT,
	N00.getOperand(0), N00.getOperand(1)),
	DAG.getConstant(1, dl, VT));
	}
	}

	if (SDValue V = combineToExtendVectorInReg(N, DAG, DCI, Subtarget))
	return V;

	if (VT.is256BitVector())
	if (SDValue R = WidenMaskArithmetic(N, DAG, DCI, Subtarget))
	return R;

	if (SDValue DivRem8 = getDivRem8(N, DAG))
	return DivRem8;

	if (SDValue NewAdd = promoteExtBeforeAdd(N, DAG, Subtarget))
	return NewAdd;

	if (SDValue R = combineOrCmpEqZeroToCtlzSrl(N, DAG, DCI, Subtarget))
	return R;

	return SDValue();
	}

	/// Try to map a 128-bit or larger integer comparison to vector instructions
	/// before type legalization splits it up into chunks.
	static SDValue combineVectorSizedSetCCEquality(SDNode *SetCC, SelectionDAG &DAG,
	const X86Subtarget &Subtarget) {
	ISD::CondCode CC = cast<CondCodeSDNode>(SetCC->getOperand(2))->get();
	assert((CC == ISD::SETNE \|\| CC == ISD::SETEQ) && "Bad comparison predicate");

	// We're looking for an oversized integer equality comparison, but ignore a
	// comparison with zero because that gets special treatment in EmitTest().
	SDValue X = SetCC->getOperand(0);
	SDValue Y = SetCC->getOperand(1);
	EVT OpVT = X.getValueType();
	unsigned OpSize = OpVT.getSizeInBits();
	if (!OpVT.isScalarInteger() \|\| OpSize < 128 \|\| isNullConstant(Y))
	return SDValue();

	// TODO: Use PXOR + PTEST for SSE4.1 or later?
	// TODO: Add support for AVX-512.
	EVT VT = SetCC->getValueType(0);
	SDLoc DL(SetCC);
	if ((OpSize == 128 && Subtarget.hasSSE2()) \|\|
	(OpSize == 256 && Subtarget.hasAVX2())) {
	EVT VecVT = OpSize == 128 ? MVT::v16i8 : MVT::v32i8;
	SDValue VecX = DAG.getBitcast(VecVT, X);
	SDValue VecY = DAG.getBitcast(VecVT, Y);

	// If all bytes match (bitmask is 0x(FFFF)FFFF), that's equality.
	// setcc i128 X, Y, eq --> setcc (pmovmskb (pcmpeqb X, Y)), 0xFFFF, eq
	// setcc i128 X, Y, ne --> setcc (pmovmskb (pcmpeqb X, Y)), 0xFFFF, ne
	// setcc i256 X, Y, eq --> setcc (vpmovmskb (vpcmpeqb X, Y)), 0xFFFFFFFF, eq
	// setcc i256 X, Y, ne --> setcc (vpmovmskb (vpcmpeqb X, Y)), 0xFFFFFFFF, ne
	SDValue Cmp = DAG.getNode(X86ISD::PCMPEQ, DL, VecVT, VecX, VecY);
	SDValue MovMsk = DAG.getNode(X86ISD::MOVMSK, DL, MVT::i32, Cmp);
	SDValue FFFFs = DAG.getConstant(OpSize == 128 ? 0xFFFF : 0xFFFFFFFF, DL,
	MVT::i32);
	return DAG.getSetCC(DL, VT, MovMsk, FFFFs, CC);
	}

	return SDValue();
	}

	static SDValue combineSetCC(SDNode *N, SelectionDAG &DAG,
	const X86Subtarget &Subtarget) {
	ISD::CondCode CC = cast<CondCodeSDNode>(N->getOperand(2))->get();
	SDValue LHS = N->getOperand(0);
	SDValue RHS = N->getOperand(1);
	EVT VT = N->getValueType(0);
	SDLoc DL(N);

	if (CC == ISD::SETNE \|\| CC == ISD::SETEQ) {
	EVT OpVT = LHS.getValueType();
	// 0-x == y --> x+y == 0
	// 0-x != y --> x+y != 0
	if (LHS.getOpcode() == ISD::SUB && isNullConstant(LHS.getOperand(0)) &&
	LHS.hasOneUse()) {
	SDValue Add = DAG.getNode(ISD::ADD, DL, OpVT, RHS, LHS.getOperand(1));
	return DAG.getSetCC(DL, VT, Add, DAG.getConstant(0, DL, OpVT), CC);
	}
	// x == 0-y --> x+y == 0
	// x != 0-y --> x+y != 0
	if (RHS.getOpcode() == ISD::SUB && isNullConstant(RHS.getOperand(0)) &&
	RHS.hasOneUse()) {
	SDValue Add = DAG.getNode(ISD::ADD, DL, OpVT, LHS, RHS.getOperand(1));
	return DAG.getSetCC(DL, VT, Add, DAG.getConstant(0, DL, OpVT), CC);
	}

	if (SDValue V = combineVectorSizedSetCCEquality(N, DAG, Subtarget))
	return V;
	}

	if (VT.getScalarType() == MVT::i1 &&
	(CC == ISD::SETNE \|\| CC == ISD::SETEQ \|\| ISD::isSignedIntSetCC(CC))) {
	bool IsSEXT0 =
	(LHS.getOpcode() == ISD::SIGN_EXTEND) &&
	(LHS.getOperand(0).getValueType().getScalarType() == MVT::i1);
	bool IsVZero1 = ISD::isBuildVectorAllZeros(RHS.getNode());

	if (!IsSEXT0 \|\| !IsVZero1) {
	// Swap the operands and update the condition code.
	std::swap(LHS, RHS);
	CC = ISD::getSetCCSwappedOperands(CC);

	IsSEXT0 = (LHS.getOpcode() == ISD::SIGN_EXTEND) &&
	(LHS.getOperand(0).getValueType().getScalarType() == MVT::i1);
	IsVZero1 = ISD::isBuildVectorAllZeros(RHS.getNode());
	}

	if (IsSEXT0 && IsVZero1) {
	assert(VT == LHS.getOperand(0).getValueType() &&
	"Uexpected operand type");
	if (CC == ISD::SETGT)
	return DAG.getConstant(0, DL, VT);
	if (CC == ISD::SETLE)
	return DAG.getConstant(1, DL, VT);
	if (CC == ISD::SETEQ \|\| CC == ISD::SETGE)
	return DAG.getNOT(DL, LHS.getOperand(0), VT);

	assert((CC == ISD::SETNE \|\| CC == ISD::SETLT) &&
	"Unexpected condition code!");
	return LHS.getOperand(0);
	}
	}

	// For an SSE1-only target, lower a comparison of v4f32 to X86ISD::CMPP early
	// to avoid scalarization via legalization because v4i32 is not a legal type.
	if (Subtarget.hasSSE1() && !Subtarget.hasSSE2() && VT == MVT::v4i32 &&
	LHS.getValueType() == MVT::v4f32)
	return LowerVSETCC(SDValue(N, 0), Subtarget, DAG);

	return SDValue();
	}

	static SDValue combineGatherScatter(SDNode *N, SelectionDAG &DAG) {
	SDLoc DL(N);
	// Gather and Scatter instructions use k-registers for masks. The type of
	// the masks is v*i1. So the mask will be truncated anyway.
	// The SIGN_EXTEND_INREG my be dropped.
	SDValue Mask = N->getOperand(2);
	if (Mask.getOpcode() == ISD::SIGN_EXTEND_INREG) {
	SmallVector<SDValue, 5> NewOps(N->op_begin(), N->op_end());
	NewOps[2] = Mask.getOperand(0);
	DAG.UpdateNodeOperands(N, NewOps);
	}
	return SDValue();
	}

	// Optimize RES = X86ISD::SETCC CONDCODE, EFLAG_INPUT
	static SDValue combineX86SetCC(SDNode *N, SelectionDAG &DAG,
	const X86Subtarget &Subtarget) {
	SDLoc DL(N);
	X86::CondCode CC = X86::CondCode(N->getConstantOperandVal(0));
	SDValue EFLAGS = N->getOperand(1);

	// Try to simplify the EFLAGS and condition code operands.
	if (SDValue Flags = combineSetCCEFLAGS(EFLAGS, CC, DAG))
	return getSETCC(CC, Flags, DL, DAG);

	return SDValue();
	}

	/// Optimize branch condition evaluation.
	static SDValue combineBrCond(SDNode *N, SelectionDAG &DAG,
	const X86Subtarget &Subtarget) {
	SDLoc DL(N);
	SDValue EFLAGS = N->getOperand(3);
	X86::CondCode CC = X86::CondCode(N->getConstantOperandVal(2));

	// Try to simplify the EFLAGS and condition code operands.
	// Make sure to not keep references to operands, as combineSetCCEFLAGS can
	// RAUW them under us.
	if (SDValue Flags = combineSetCCEFLAGS(EFLAGS, CC, DAG)) {
	SDValue Cond = DAG.getConstant(CC, DL, MVT::i8);
	return DAG.getNode(X86ISD::BRCOND, DL, N->getVTList(), N->getOperand(0),
	N->getOperand(1), Cond, Flags);
	}

	return SDValue();
	}

	static SDValue combineVectorCompareAndMaskUnaryOp(SDNode *N,
	SelectionDAG &DAG) {
	// Take advantage of vector comparisons producing 0 or -1 in each lane to
	// optimize away operation when it's from a constant.
	//
	// The general transformation is:
	// UNARYOP(AND(VECTOR_CMP(x,y), constant)) -->
	// AND(VECTOR_CMP(x,y), constant2)
	// constant2 = UNARYOP(constant)

	// Early exit if this isn't a vector operation, the operand of the
	// unary operation isn't a bitwise AND, or if the sizes of the operations
	// aren't the same.
	EVT VT = N->getValueType(0);
	if (!VT.isVector() \|\| N->getOperand(0)->getOpcode() != ISD::AND \|\|
	N->getOperand(0)->getOperand(0)->getOpcode() != ISD::SETCC \|\|
	VT.getSizeInBits() != N->getOperand(0)->getValueType(0).getSizeInBits())
	return SDValue();

	// Now check that the other operand of the AND is a constant. We could
	// make the transformation for non-constant splats as well, but it's unclear
	// that would be a benefit as it would not eliminate any operations, just
	// perform one more step in scalar code before moving to the vector unit.
	if (BuildVectorSDNode *BV =
	dyn_cast<BuildVectorSDNode>(N->getOperand(0)->getOperand(1))) {
	// Bail out if the vector isn't a constant.
	if (!BV->isConstant())
	return SDValue();

	// Everything checks out. Build up the new and improved node.
	SDLoc DL(N);
	EVT IntVT = BV->getValueType(0);
	// Create a new constant of the appropriate type for the transformed
	// DAG.
	SDValue SourceConst = DAG.getNode(N->getOpcode(), DL, VT, SDValue(BV, 0));
	// The AND node needs bitcasts to/from an integer vector type around it.
	SDValue MaskConst = DAG.getBitcast(IntVT, SourceConst);
	SDValue NewAnd = DAG.getNode(ISD::AND, DL, IntVT,
	N->getOperand(0)->getOperand(0), MaskConst);
	SDValue Res = DAG.getBitcast(VT, NewAnd);
	return Res;
	}

	return SDValue();
	}

	static SDValue combineUIntToFP(SDNode *N, SelectionDAG &DAG,
	const X86Subtarget &Subtarget) {
	SDValue Op0 = N->getOperand(0);
	EVT VT = N->getValueType(0);
	EVT InVT = Op0.getValueType();
	EVT InSVT = InVT.getScalarType();
	const TargetLowering &TLI = DAG.getTargetLoweringInfo();

	// UINT_TO_FP(vXi8) -> SINT_TO_FP(ZEXT(vXi8 to vXi32))
	// UINT_TO_FP(vXi16) -> SINT_TO_FP(ZEXT(vXi16 to vXi32))
	if (InVT.isVector() && (InSVT == MVT::i8 \|\| InSVT == MVT::i16)) {
	SDLoc dl(N);
	EVT DstVT = EVT::getVectorVT(*DAG.getContext(), MVT::i32,
	InVT.getVectorNumElements());
	SDValue P = DAG.getNode(ISD::ZERO_EXTEND, dl, DstVT, Op0);

	if (TLI.isOperationLegal(ISD::UINT_TO_FP, DstVT))
	return DAG.getNode(ISD::UINT_TO_FP, dl, VT, P);

	return DAG.getNode(ISD::SINT_TO_FP, dl, VT, P);
	}

	// Since UINT_TO_FP is legal (it's marked custom), dag combiner won't
	// optimize it to a SINT_TO_FP when the sign bit is known zero. Perform
	// the optimization here.
	if (DAG.SignBitIsZero(Op0))
	return DAG.getNode(ISD::SINT_TO_FP, SDLoc(N), VT, Op0);

	return SDValue();
	}

	static SDValue combineSIntToFP(SDNode *N, SelectionDAG &DAG,
	const X86Subtarget &Subtarget) {
	// First try to optimize away the conversion entirely when it's
	// conditionally from a constant. Vectors only.
	if (SDValue Res = combineVectorCompareAndMaskUnaryOp(N, DAG))
	return Res;

	// Now move on to more general possibilities.
	SDValue Op0 = N->getOperand(0);
	EVT VT = N->getValueType(0);
	EVT InVT = Op0.getValueType();
	EVT InSVT = InVT.getScalarType();

	// SINT_TO_FP(vXi1) -> SINT_TO_FP(SEXT(vXi1 to vXi32))
	// SINT_TO_FP(vXi8) -> SINT_TO_FP(SEXT(vXi8 to vXi32))
	// SINT_TO_FP(vXi16) -> SINT_TO_FP(SEXT(vXi16 to vXi32))
	if (InVT.isVector() &&
	(InSVT == MVT::i8 \|\| InSVT == MVT::i16 \|\|
	(InSVT == MVT::i1 && !DAG.getTargetLoweringInfo().isTypeLegal(InVT)))) {
	SDLoc dl(N);
	EVT DstVT = EVT::getVectorVT(*DAG.getContext(), MVT::i32,
	InVT.getVectorNumElements());
	SDValue P = DAG.getNode(ISD::SIGN_EXTEND, dl, DstVT, Op0);
	return DAG.getNode(ISD::SINT_TO_FP, dl, VT, P);
	}

	// Without AVX512DQ we only support i64 to float scalar conversion. For both
	// vectors and scalars, see if we know that the upper bits are all the sign
	// bit, in which case we can truncate the input to i32 and convert from that.
	if (InVT.getScalarSizeInBits() > 32 && !Subtarget.hasDQI()) {
	unsigned BitWidth = InVT.getScalarSizeInBits();
	unsigned NumSignBits = DAG.ComputeNumSignBits(Op0);
	if (NumSignBits >= (BitWidth - 31)) {
	EVT TruncVT = EVT::getIntegerVT(*DAG.getContext(), 32);
	if (InVT.isVector())
	TruncVT = EVT::getVectorVT(*DAG.getContext(), TruncVT,
	InVT.getVectorNumElements());
	SDLoc dl(N);
	SDValue Trunc = DAG.getNode(ISD::TRUNCATE, dl, TruncVT, Op0);
	return DAG.getNode(ISD::SINT_TO_FP, dl, VT, Trunc);
	}
	}

	// Transform (SINT_TO_FP (i64 ...)) into an x87 operation if we have
	// a 32-bit target where SSE doesn't support i64->FP operations.
	if (!Subtarget.useSoftFloat() && Op0.getOpcode() == ISD::LOAD) {
	LoadSDNode *Ld = cast<LoadSDNode>(Op0.getNode());
	EVT LdVT = Ld->getValueType(0);

	// This transformation is not supported if the result type is f16 or f128.
	if (VT == MVT::f16 \|\| VT == MVT::f128)
	return SDValue();

	if (!Ld->isVolatile() && !VT.isVector() &&
	ISD::isNON_EXTLoad(Op0.getNode()) && Op0.hasOneUse() &&
	!Subtarget.is64Bit() && LdVT == MVT::i64) {
	SDValue FILDChain = Subtarget.getTargetLowering()->BuildFILD(
	SDValue(N, 0), LdVT, Ld->getChain(), Op0, DAG);
	DAG.ReplaceAllUsesOfValueWith(Op0.getValue(1), FILDChain.getValue(1));
	return FILDChain;
	}
	}
	return SDValue();
	}

	// Optimize RES, EFLAGS = X86ISD::ADD LHS, RHS
	static SDValue combineX86ADD(SDNode *N, SelectionDAG &DAG,
	X86TargetLowering::DAGCombinerInfo &DCI) {
	// When legalizing carry, we create carries via add X, -1
	// If that comes from an actual carry, via setcc, we use the
	// carry directly.
	if (isAllOnesConstant(N->getOperand(1)) && N->hasAnyUseOfValue(1)) {
	SDValue Carry = N->getOperand(0);
	while (Carry.getOpcode() == ISD::TRUNCATE \|\|
	Carry.getOpcode() == ISD::ZERO_EXTEND \|\|
	Carry.getOpcode() == ISD::SIGN_EXTEND \|\|
	Carry.getOpcode() == ISD::ANY_EXTEND \|\|
	(Carry.getOpcode() == ISD::AND &&
	isOneConstant(Carry.getOperand(1))))
	Carry = Carry.getOperand(0);

	if (Carry.getOpcode() == X86ISD::SETCC \|\|
	Carry.getOpcode() == X86ISD::SETCC_CARRY) {
	if (Carry.getConstantOperandVal(0) == X86::COND_B)
	return DCI.CombineTo(N, SDValue(N, 0), Carry.getOperand(1));
	}
	}

	return SDValue();
	}

	// Optimize RES, EFLAGS = X86ISD::ADC LHS, RHS, EFLAGS
	static SDValue combineADC(SDNode *N, SelectionDAG &DAG,
	X86TargetLowering::DAGCombinerInfo &DCI) {
	// If the LHS and RHS of the ADC node are zero, then it can't overflow and
	// the result is either zero or one (depending on the input carry bit).
	// Strength reduce this down to a "set on carry" aka SETCC_CARRY&1.
	if (X86::isZeroNode(N->getOperand(0)) &&
	X86::isZeroNode(N->getOperand(1)) &&
	// We don't have a good way to replace an EFLAGS use, so only do this when
	// dead right now.
	SDValue(N, 1).use_empty()) {
	SDLoc DL(N);
	EVT VT = N->getValueType(0);
	SDValue CarryOut = DAG.getConstant(0, DL, N->getValueType(1));
	SDValue Res1 = DAG.getNode(ISD::AND, DL, VT,
	DAG.getNode(X86ISD::SETCC_CARRY, DL, VT,
	DAG.getConstant(X86::COND_B, DL,
	MVT::i8),
	N->getOperand(2)),
	DAG.getConstant(1, DL, VT));
	return DCI.CombineTo(N, Res1, CarryOut);
	}

	return SDValue();
	}

	/// Materialize "setb reg" as "sbb reg,reg", since it produces an all-ones bit
	/// which is more useful than 0/1 in some cases.
	static SDValue materializeSBB(SDNode *N, SDValue EFLAGS, SelectionDAG &DAG) {
	SDLoc DL(N);
	// "Condition code B" is also known as "the carry flag" (CF).
	SDValue CF = DAG.getConstant(X86::COND_B, DL, MVT::i8);
	SDValue SBB = DAG.getNode(X86ISD::SETCC_CARRY, DL, MVT::i8, CF, EFLAGS);
	MVT VT = N->getSimpleValueType(0);
	if (VT == MVT::i8)
	return DAG.getNode(ISD::AND, DL, VT, SBB, DAG.getConstant(1, DL, VT));

	assert(VT == MVT::i1 && "Unexpected type for SETCC node");
	return DAG.getNode(ISD::TRUNCATE, DL, MVT::i1, SBB);
	}

	/// If this is an add or subtract where one operand is produced by a cmp+setcc,
	/// then try to convert it to an ADC or SBB. This replaces TEST+SET+{ADD/SUB}
	/// with CMP+{ADC, SBB}.
	static SDValue combineAddOrSubToADCOrSBB(SDNode *N, SelectionDAG &DAG) {
	bool IsSub = N->getOpcode() == ISD::SUB;
	SDValue X = N->getOperand(0);
	SDValue Y = N->getOperand(1);

	// If this is an add, canonicalize a zext operand to the RHS.
	// TODO: Incomplete? What if both sides are zexts?
	if (!IsSub && X.getOpcode() == ISD::ZERO_EXTEND &&
	Y.getOpcode() != ISD::ZERO_EXTEND)
	std::swap(X, Y);

	// Look through a one-use zext.
	bool PeekedThroughZext = false;
	if (Y.getOpcode() == ISD::ZERO_EXTEND && Y.hasOneUse()) {
	Y = Y.getOperand(0);
	PeekedThroughZext = true;
	}

	// If this is an add, canonicalize a setcc operand to the RHS.
	// TODO: Incomplete? What if both sides are setcc?
	// TODO: Should we allow peeking through a zext of the other operand?
	if (!IsSub && !PeekedThroughZext && X.getOpcode() == X86ISD::SETCC &&
	Y.getOpcode() != X86ISD::SETCC)
	std::swap(X, Y);

	if (Y.getOpcode() != X86ISD::SETCC \|\| !Y.hasOneUse())
	return SDValue();

	SDLoc DL(N);
	EVT VT = N->getValueType(0);
	X86::CondCode CC = (X86::CondCode)Y.getConstantOperandVal(0);

	// If X is -1 or 0, then we have an opportunity to avoid constants required in
	// the general case below.
	auto *ConstantX = dyn_cast<ConstantSDNode>(X);
	if (ConstantX) {
	if ((!IsSub && CC == X86::COND_AE && ConstantX->isAllOnesValue()) \|\|
	(IsSub && CC == X86::COND_B && ConstantX->isNullValue())) {
	// This is a complicated way to get -1 or 0 from the carry flag:
	// -1 + SETAE --> -1 + (!CF) --> CF ? -1 : 0 --> SBB %eax, %eax
	// 0 - SETB --> 0 - (CF) --> CF ? -1 : 0 --> SBB %eax, %eax
	return DAG.getNode(X86ISD::SETCC_CARRY, DL, VT,
	DAG.getConstant(X86::COND_B, DL, MVT::i8),
	Y.getOperand(1));
	}

	if ((!IsSub && CC == X86::COND_BE && ConstantX->isAllOnesValue()) \|\|
	(IsSub && CC == X86::COND_A && ConstantX->isNullValue())) {
	SDValue EFLAGS = Y->getOperand(1);
	if (EFLAGS.getOpcode() == X86ISD::SUB && EFLAGS.hasOneUse() &&
	EFLAGS.getValueType().isInteger() &&
	!isa<ConstantSDNode>(EFLAGS.getOperand(1))) {
	// Swap the operands of a SUB, and we have the same pattern as above.
	// -1 + SETBE (SUB A, B) --> -1 + SETAE (SUB B, A) --> SUB + SBB
	// 0 - SETA (SUB A, B) --> 0 - SETB (SUB B, A) --> SUB + SBB
	SDValue NewSub = DAG.getNode(
	X86ISD::SUB, SDLoc(EFLAGS), EFLAGS.getNode()->getVTList(),
	EFLAGS.getOperand(1), EFLAGS.getOperand(0));
	SDValue NewEFLAGS = SDValue(NewSub.getNode(), EFLAGS.getResNo());
	return DAG.getNode(X86ISD::SETCC_CARRY, DL, VT,
	DAG.getConstant(X86::COND_B, DL, MVT::i8),
	NewEFLAGS);
	}
	}
	}

	if (CC == X86::COND_B) {
	// X + SETB Z --> X + (mask SBB Z, Z)
	// X - SETB Z --> X - (mask SBB Z, Z)
	// TODO: Produce ADC/SBB here directly and avoid SETCC_CARRY?
	SDValue SBB = materializeSBB(Y.getNode(), Y.getOperand(1), DAG);
	if (SBB.getValueSizeInBits() != VT.getSizeInBits())
	SBB = DAG.getZExtOrTrunc(SBB, DL, VT);
	return DAG.getNode(IsSub ? ISD::SUB : ISD::ADD, DL, VT, X, SBB);
	}

	if (CC == X86::COND_A) {
	SDValue EFLAGS = Y->getOperand(1);
	// Try to convert COND_A into COND_B in an attempt to facilitate
	// materializing "setb reg".
	//
	// Do not flip "e > c", where "c" is a constant, because Cmp instruction
	// cannot take an immediate as its first operand.
	//
	if (EFLAGS.getOpcode() == X86ISD::SUB && EFLAGS.hasOneUse() &&
	EFLAGS.getValueType().isInteger() &&
	!isa<ConstantSDNode>(EFLAGS.getOperand(1))) {
	SDValue NewSub = DAG.getNode(X86ISD::SUB, SDLoc(EFLAGS),
	EFLAGS.getNode()->getVTList(),
	EFLAGS.getOperand(1), EFLAGS.getOperand(0));
	SDValue NewEFLAGS = SDValue(NewSub.getNode(), EFLAGS.getResNo());
	SDValue SBB = materializeSBB(Y.getNode(), NewEFLAGS, DAG);
	if (SBB.getValueSizeInBits() != VT.getSizeInBits())
	SBB = DAG.getZExtOrTrunc(SBB, DL, VT);
	return DAG.getNode(IsSub ? ISD::SUB : ISD::ADD, DL, VT, X, SBB);
	}
	}

	if (CC != X86::COND_E && CC != X86::COND_NE)
	return SDValue();

	SDValue Cmp = Y.getOperand(1);
	if (Cmp.getOpcode() != X86ISD::CMP \|\| !Cmp.hasOneUse() \|\|
	!X86::isZeroNode(Cmp.getOperand(1)) \|\|
	!Cmp.getOperand(0).getValueType().isInteger())
	return SDValue();

	SDValue Z = Cmp.getOperand(0);
	EVT ZVT = Z.getValueType();

	// If X is -1 or 0, then we have an opportunity to avoid constants required in
	// the general case below.
	if (ConstantX) {
	// 'neg' sets the carry flag when Z != 0, so create 0 or -1 using 'sbb' with
	// fake operands:
	// 0 - (Z != 0) --> sbb %eax, %eax, (neg Z)
	// -1 + (Z == 0) --> sbb %eax, %eax, (neg Z)
	if ((IsSub && CC == X86::COND_NE && ConstantX->isNullValue()) \|\|
	(!IsSub && CC == X86::COND_E && ConstantX->isAllOnesValue())) {
	SDValue Zero = DAG.getConstant(0, DL, ZVT);
	SDVTList X86SubVTs = DAG.getVTList(ZVT, MVT::i32);
	SDValue Neg = DAG.getNode(X86ISD::SUB, DL, X86SubVTs, Zero, Z);
	return DAG.getNode(X86ISD::SETCC_CARRY, DL, VT,
	DAG.getConstant(X86::COND_B, DL, MVT::i8),
	SDValue(Neg.getNode(), 1));
	}

	// cmp with 1 sets the carry flag when Z == 0, so create 0 or -1 using 'sbb'
	// with fake operands:
	// 0 - (Z == 0) --> sbb %eax, %eax, (cmp Z, 1)
	// -1 + (Z != 0) --> sbb %eax, %eax, (cmp Z, 1)
	if ((IsSub && CC == X86::COND_E && ConstantX->isNullValue()) \|\|
	(!IsSub && CC == X86::COND_NE && ConstantX->isAllOnesValue())) {
	SDValue One = DAG.getConstant(1, DL, ZVT);
	SDValue Cmp1 = DAG.getNode(X86ISD::CMP, DL, MVT::i32, Z, One);
	return DAG.getNode(X86ISD::SETCC_CARRY, DL, VT,
	DAG.getConstant(X86::COND_B, DL, MVT::i8), Cmp1);
	}
	}

	// (cmp Z, 1) sets the carry flag if Z is 0.
	SDValue One = DAG.getConstant(1, DL, ZVT);
	SDValue Cmp1 = DAG.getNode(X86ISD::CMP, DL, MVT::i32, Z, One);

	// Add the flags type for ADC/SBB nodes.
	SDVTList VTs = DAG.getVTList(VT, MVT::i32);

	// X - (Z != 0) --> sub X, (zext(setne Z, 0)) --> adc X, -1, (cmp Z, 1)
	// X + (Z != 0) --> add X, (zext(setne Z, 0)) --> sbb X, -1, (cmp Z, 1)
	if (CC == X86::COND_NE)
	return DAG.getNode(IsSub ? X86ISD::ADC : X86ISD::SBB, DL, VTs, X,
	DAG.getConstant(-1ULL, DL, VT), Cmp1);

	// X - (Z == 0) --> sub X, (zext(sete Z, 0)) --> sbb X, 0, (cmp Z, 1)
	// X + (Z == 0) --> add X, (zext(sete Z, 0)) --> adc X, 0, (cmp Z, 1)
	return DAG.getNode(IsSub ? X86ISD::SBB : X86ISD::ADC, DL, VTs, X,
	DAG.getConstant(0, DL, VT), Cmp1);
	}

	static SDValue combineLoopMAddPattern(SDNode *N, SelectionDAG &DAG,
	const X86Subtarget &Subtarget) {
	SDValue MulOp = N->getOperand(0);
	SDValue Phi = N->getOperand(1);

	if (MulOp.getOpcode() != ISD::MUL)
	std::swap(MulOp, Phi);
	if (MulOp.getOpcode() != ISD::MUL)
	return SDValue();

	ShrinkMode Mode;
	if (!canReduceVMulWidth(MulOp.getNode(), DAG, Mode) \|\| Mode == MULU16)
	return SDValue();

	EVT VT = N->getValueType(0);

	unsigned RegSize = 128;
	if (Subtarget.hasBWI())
	RegSize = 512;
	else if (Subtarget.hasAVX2())
	RegSize = 256;
	unsigned VectorSize = VT.getVectorNumElements() * 16;
	// If the vector size is less than 128, or greater than the supported RegSize,
	// do not use PMADD.
	if (VectorSize < 128 \|\| VectorSize > RegSize)
	return SDValue();

	SDLoc DL(N);
	EVT ReducedVT = EVT::getVectorVT(*DAG.getContext(), MVT::i16,
	VT.getVectorNumElements());
	EVT MAddVT = EVT::getVectorVT(*DAG.getContext(), MVT::i32,
	VT.getVectorNumElements() / 2);

	// Shrink the operands of mul.
	SDValue N0 = DAG.getNode(ISD::TRUNCATE, DL, ReducedVT, MulOp->getOperand(0));
	SDValue N1 = DAG.getNode(ISD::TRUNCATE, DL, ReducedVT, MulOp->getOperand(1));

	// Madd vector size is half of the original vector size
	SDValue Madd = DAG.getNode(X86ISD::VPMADDWD, DL, MAddVT, N0, N1);
	// Fill the rest of the output with 0
	SDValue Zero = getZeroVector(Madd.getSimpleValueType(), Subtarget, DAG, DL);
	SDValue Concat = DAG.getNode(ISD::CONCAT_VECTORS, DL, VT, Madd, Zero);
	return DAG.getNode(ISD::ADD, DL, VT, Concat, Phi);
	}

	static SDValue combineLoopSADPattern(SDNode *N, SelectionDAG &DAG,
	const X86Subtarget &Subtarget) {
	SDLoc DL(N);
	EVT VT = N->getValueType(0);
	SDValue Op0 = N->getOperand(0);
	SDValue Op1 = N->getOperand(1);

	// TODO: There's nothing special about i32, any integer type above i16 should
	// work just as well.
	if (!VT.isVector() \|\| !VT.isSimple() \|\|
	!(VT.getVectorElementType() == MVT::i32))
	return SDValue();

	unsigned RegSize = 128;
	if (Subtarget.hasBWI())
	RegSize = 512;
	else if (Subtarget.hasAVX2())
	RegSize = 256;

	// We only handle v16i32 for SSE2 / v32i32 for AVX2 / v64i32 for AVX512.
	// TODO: We should be able to handle larger vectors by splitting them before
	// feeding them into several SADs, and then reducing over those.
	if (VT.getSizeInBits() / 4 > RegSize)
	return SDValue();

	// We know N is a reduction add, which means one of its operands is a phi.
	// To match SAD, we need the other operand to be a vector select.
	SDValue SelectOp, Phi;
	if (Op0.getOpcode() == ISD::VSELECT) {
	SelectOp = Op0;
	Phi = Op1;
	} else if (Op1.getOpcode() == ISD::VSELECT) {
	SelectOp = Op1;
	Phi = Op0;
	} else
	return SDValue();

	// Check whether we have an abs-diff pattern feeding into the select.
	if(!detectZextAbsDiff(SelectOp, Op0, Op1))
	return SDValue();

	// SAD pattern detected. Now build a SAD instruction and an addition for
	// reduction. Note that the number of elements of the result of SAD is less
	// than the number of elements of its input. Therefore, we could only update
	// part of elements in the reduction vector.
	SDValue Sad = createPSADBW(DAG, Op0, Op1, DL);

	// The output of PSADBW is a vector of i64.
	// We need to turn the vector of i64 into a vector of i32.
	// If the reduction vector is at least as wide as the psadbw result, just
	// bitcast. If it's narrower, truncate - the high i32 of each i64 is zero
	// anyway.
	MVT ResVT = MVT::getVectorVT(MVT::i32, Sad.getValueSizeInBits() / 32);
	if (VT.getSizeInBits() >= ResVT.getSizeInBits())
	Sad = DAG.getNode(ISD::BITCAST, DL, ResVT, Sad);
	else
	Sad = DAG.getNode(ISD::TRUNCATE, DL, VT, Sad);

	if (VT.getSizeInBits() > ResVT.getSizeInBits()) {
	// Update part of elements of the reduction vector. This is done by first
	// extracting a sub-vector from it, updating this sub-vector, and inserting
	// it back.
	SDValue SubPhi = DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, ResVT, Phi,
	DAG.getIntPtrConstant(0, DL));
	SDValue Res = DAG.getNode(ISD::ADD, DL, ResVT, Sad, SubPhi);
	return DAG.getNode(ISD::INSERT_SUBVECTOR, DL, VT, Phi, Res,
	DAG.getIntPtrConstant(0, DL));
	} else
	return DAG.getNode(ISD::ADD, DL, VT, Sad, Phi);
	}

	/// Convert vector increment or decrement to sub/add with an all-ones constant:
	/// add X, <1, 1...> --> sub X, <-1, -1...>
	/// sub X, <1, 1...> --> add X, <-1, -1...>
	/// The all-ones vector constant can be materialized using a pcmpeq instruction
	/// that is commonly recognized as an idiom (has no register dependency), so
	/// that's better/smaller than loading a splat 1 constant.
	static SDValue combineIncDecVector(SDNode *N, SelectionDAG &DAG) {
	assert((N->getOpcode() == ISD::ADD \|\| N->getOpcode() == ISD::SUB) &&
	"Unexpected opcode for increment/decrement transform");

	// Pseudo-legality check: getOnesVector() expects one of these types, so bail
	// out and wait for legalization if we have an unsupported vector length.
	EVT VT = N->getValueType(0);
	if (!VT.is128BitVector() && !VT.is256BitVector() && !VT.is512BitVector())
	return SDValue();

	SDNode *N1 = N->getOperand(1).getNode();
	APInt SplatVal;
	- if (!ISD::isConstantSplatVector(N1, SplatVal) \|\| !SplatVal.isOneValue())
	+ if (!ISD::isConstantSplatVector(N1, SplatVal, /AllowShrink/false) \|\|
	+ !SplatVal.isOneValue())
	return SDValue();

	SDValue AllOnesVec = getOnesVector(VT, DAG, SDLoc(N));
	unsigned NewOpcode = N->getOpcode() == ISD::ADD ? ISD::SUB : ISD::ADD;
	return DAG.getNode(NewOpcode, SDLoc(N), VT, N->getOperand(0), AllOnesVec);
	}

	static SDValue combineAdd(SDNode *N, SelectionDAG &DAG,
	const X86Subtarget &Subtarget) {
	const SDNodeFlags Flags = N->getFlags();
	if (Flags.hasVectorReduction()) {
	if (SDValue Sad = combineLoopSADPattern(N, DAG, Subtarget))
	return Sad;
	if (SDValue MAdd = combineLoopMAddPattern(N, DAG, Subtarget))
	return MAdd;
	}
	EVT VT = N->getValueType(0);
	SDValue Op0 = N->getOperand(0);
	SDValue Op1 = N->getOperand(1);

	// Try to synthesize horizontal adds from adds of shuffles.
	if (((Subtarget.hasSSSE3() && (VT == MVT::v8i16 \|\| VT == MVT::v4i32)) \|\|
	(Subtarget.hasInt256() && (VT == MVT::v16i16 \|\| VT == MVT::v8i32))) &&
	isHorizontalBinOp(Op0, Op1, true))
	return DAG.getNode(X86ISD::HADD, SDLoc(N), VT, Op0, Op1);

	if (SDValue V = combineIncDecVector(N, DAG))
	return V;

	return combineAddOrSubToADCOrSBB(N, DAG);
	}

	static SDValue combineSub(SDNode *N, SelectionDAG &DAG,
	const X86Subtarget &Subtarget) {
	SDValue Op0 = N->getOperand(0);
	SDValue Op1 = N->getOperand(1);

	// X86 can't encode an immediate LHS of a sub. See if we can push the
	// negation into a preceding instruction.
	if (ConstantSDNode *C = dyn_cast<ConstantSDNode>(Op0)) {
	// If the RHS of the sub is a XOR with one use and a constant, invert the
	// immediate. Then add one to the LHS of the sub so we can turn
	// X-Y -> X+~Y+1, saving one register.
	if (Op1->hasOneUse() && Op1.getOpcode() == ISD::XOR &&
	isa<ConstantSDNode>(Op1.getOperand(1))) {
	APInt XorC = cast<ConstantSDNode>(Op1.getOperand(1))->getAPIntValue();
	EVT VT = Op0.getValueType();
	SDValue NewXor = DAG.getNode(ISD::XOR, SDLoc(Op1), VT,
	Op1.getOperand(0),
	DAG.getConstant(~XorC, SDLoc(Op1), VT));
	return DAG.getNode(ISD::ADD, SDLoc(N), VT, NewXor,
	DAG.getConstant(C->getAPIntValue() + 1, SDLoc(N), VT));
	}
	}

	// Try to synthesize horizontal subs from subs of shuffles.
	EVT VT = N->getValueType(0);
	if (((Subtarget.hasSSSE3() && (VT == MVT::v8i16 \|\| VT == MVT::v4i32)) \|\|
	(Subtarget.hasInt256() && (VT == MVT::v16i16 \|\| VT == MVT::v8i32))) &&
	isHorizontalBinOp(Op0, Op1, false))
	return DAG.getNode(X86ISD::HSUB, SDLoc(N), VT, Op0, Op1);

	if (SDValue V = combineIncDecVector(N, DAG))
	return V;

	return combineAddOrSubToADCOrSBB(N, DAG);
	}

	static SDValue combineVSZext(SDNode *N, SelectionDAG &DAG,
	TargetLowering::DAGCombinerInfo &DCI,
	const X86Subtarget &Subtarget) {
	if (DCI.isBeforeLegalize())
	return SDValue();

	SDLoc DL(N);
	unsigned Opcode = N->getOpcode();
	MVT VT = N->getSimpleValueType(0);
	MVT SVT = VT.getVectorElementType();
	unsigned NumElts = VT.getVectorNumElements();
	unsigned EltSizeInBits = SVT.getSizeInBits();

	SDValue Op = N->getOperand(0);
	MVT OpVT = Op.getSimpleValueType();
	MVT OpEltVT = OpVT.getVectorElementType();
	unsigned OpEltSizeInBits = OpEltVT.getSizeInBits();
	unsigned InputBits = OpEltSizeInBits * NumElts;

	// Perform any constant folding.
	// FIXME: Reduce constant pool usage and don't fold when OptSize is enabled.
	APInt UndefElts;
	SmallVector<APInt, 64> EltBits;
	if (getTargetConstantBitsFromNode(Op, OpEltSizeInBits, UndefElts, EltBits)) {
	APInt Undefs(NumElts, 0);
	SmallVector<APInt, 4> Vals(NumElts, APInt(EltSizeInBits, 0));
	bool IsZEXT =
	(Opcode == X86ISD::VZEXT) \|\| (Opcode == ISD::ZERO_EXTEND_VECTOR_INREG);
	for (unsigned i = 0; i != NumElts; ++i) {
	if (UndefElts[i]) {
	Undefs.setBit(i);
	continue;
	}
	Vals[i] = IsZEXT ? EltBits[i].zextOrTrunc(EltSizeInBits)
	: EltBits[i].sextOrTrunc(EltSizeInBits);
	}
	return getConstVector(Vals, Undefs, VT, DAG, DL);
	}

	// (vzext (bitcast (vzext (x)) -> (vzext x)
	// TODO: (vsext (bitcast (vsext (x)) -> (vsext x)
	SDValue V = peekThroughBitcasts(Op);
	if (Opcode == X86ISD::VZEXT && V != Op && V.getOpcode() == X86ISD::VZEXT) {
	MVT InnerVT = V.getSimpleValueType();
	MVT InnerEltVT = InnerVT.getVectorElementType();

	// If the element sizes match exactly, we can just do one larger vzext. This
	// is always an exact type match as vzext operates on integer types.
	if (OpEltVT == InnerEltVT) {
	assert(OpVT == InnerVT && "Types must match for vzext!");
	return DAG.getNode(X86ISD::VZEXT, DL, VT, V.getOperand(0));
	}

	// The only other way we can combine them is if only a single element of the
	// inner vzext is used in the input to the outer vzext.
	if (InnerEltVT.getSizeInBits() < InputBits)
	return SDValue();

	// In this case, the inner vzext is completely dead because we're going to
	// only look at bits inside of the low element. Just do the outer vzext on
	// a bitcast of the input to the inner.
	return DAG.getNode(X86ISD::VZEXT, DL, VT, DAG.getBitcast(OpVT, V));
	}

	// Check if we can bypass extracting and re-inserting an element of an input
	// vector. Essentially:
	// (bitcast (sclr2vec (ext_vec_elt x))) -> (bitcast x)
	// TODO: Add X86ISD::VSEXT support
	if (Opcode == X86ISD::VZEXT &&
	V.getOpcode() == ISD::SCALAR_TO_VECTOR &&
	V.getOperand(0).getOpcode() == ISD::EXTRACT_VECTOR_ELT &&
	V.getOperand(0).getSimpleValueType().getSizeInBits() == InputBits) {
	SDValue ExtractedV = V.getOperand(0);
	SDValue OrigV = ExtractedV.getOperand(0);
	if (isNullConstant(ExtractedV.getOperand(1))) {
	MVT OrigVT = OrigV.getSimpleValueType();
	// Extract a subvector if necessary...
	if (OrigVT.getSizeInBits() > OpVT.getSizeInBits()) {
	int Ratio = OrigVT.getSizeInBits() / OpVT.getSizeInBits();
	OrigVT = MVT::getVectorVT(OrigVT.getVectorElementType(),
	OrigVT.getVectorNumElements() / Ratio);
	OrigV = DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, OrigVT, OrigV,
	DAG.getIntPtrConstant(0, DL));
	}
	Op = DAG.getBitcast(OpVT, OrigV);
	return DAG.getNode(X86ISD::VZEXT, DL, VT, Op);
	}
	}

	return SDValue();
	}

	/// Canonicalize (LSUB p, 1) -> (LADD p, -1).
	static SDValue combineLockSub(SDNode *N, SelectionDAG &DAG,
	const X86Subtarget &Subtarget) {
	SDValue Chain = N->getOperand(0);
	SDValue LHS = N->getOperand(1);
	SDValue RHS = N->getOperand(2);
	MVT VT = RHS.getSimpleValueType();
	SDLoc DL(N);

	auto *C = dyn_cast<ConstantSDNode>(RHS);
	if (!C \|\| C->getZExtValue() != 1)
	return SDValue();

	RHS = DAG.getConstant(-1, DL, VT);
	MachineMemOperand *MMO = cast<MemSDNode>(N)->getMemOperand();
	return DAG.getMemIntrinsicNode(X86ISD::LADD, DL,
	DAG.getVTList(MVT::i32, MVT::Other),
	{Chain, LHS, RHS}, VT, MMO);
	}

	// TEST (AND a, b) ,(AND a, b) -> TEST a, b
	static SDValue combineTestM(SDNode *N, SelectionDAG &DAG) {
	SDValue Op0 = N->getOperand(0);
	SDValue Op1 = N->getOperand(1);

	if (Op0 != Op1 \|\| Op1->getOpcode() != ISD::AND)
	return SDValue();

	EVT VT = N->getValueType(0);
	SDLoc DL(N);

	return DAG.getNode(X86ISD::TESTM, DL, VT,
	Op0->getOperand(0), Op0->getOperand(1));
	}

	static SDValue combineVectorCompare(SDNode *N, SelectionDAG &DAG,
	const X86Subtarget &Subtarget) {
	MVT VT = N->getSimpleValueType(0);
	SDLoc DL(N);

	if (N->getOperand(0) == N->getOperand(1)) {
	if (N->getOpcode() == X86ISD::PCMPEQ)
	return getOnesVector(VT, DAG, DL);
	if (N->getOpcode() == X86ISD::PCMPGT)
	return getZeroVector(VT, Subtarget, DAG, DL);
	}

	return SDValue();
	}

	static SDValue combineInsertSubvector(SDNode *N, SelectionDAG &DAG,
	TargetLowering::DAGCombinerInfo &DCI,
	const X86Subtarget &Subtarget) {
	if (DCI.isBeforeLegalizeOps())
	return SDValue();

	SDLoc dl(N);
	SDValue Vec = N->getOperand(0);
	SDValue SubVec = N->getOperand(1);
	SDValue Idx = N->getOperand(2);

	unsigned IdxVal = cast<ConstantSDNode>(Idx)->getZExtValue();
	MVT OpVT = N->getSimpleValueType(0);
	MVT SubVecVT = SubVec.getSimpleValueType();

	// If this is an insert of an extract, combine to a shuffle. Don't do this
	// if the insert or extract can be represented with a subvector operation.
	if (SubVec.getOpcode() == ISD::EXTRACT_SUBVECTOR &&
	SubVec.getOperand(0).getSimpleValueType() == OpVT &&
	(IdxVal != 0 \|\| !Vec.isUndef())) {
	int ExtIdxVal = cast<ConstantSDNode>(SubVec.getOperand(1))->getZExtValue();
	if (ExtIdxVal != 0) {
	int VecNumElts = OpVT.getVectorNumElements();
	int SubVecNumElts = SubVecVT.getVectorNumElements();
	SmallVector<int, 64> Mask(VecNumElts);
	// First create an identity shuffle mask.
	for (int i = 0; i != VecNumElts; ++i)
	Mask[i] = i;
	// Now insert the extracted portion.
	for (int i = 0; i != SubVecNumElts; ++i)
	Mask[i + IdxVal] = i + ExtIdxVal + VecNumElts;

	return DAG.getVectorShuffle(OpVT, dl, Vec, SubVec.getOperand(0), Mask);
	}
	}

	// Fold two 16-byte or 32-byte subvector loads into one 32-byte or 64-byte
	// load:
	// (insert_subvector (insert_subvector undef, (load16 addr), 0),
	// (load16 addr + 16), Elts/2)
	// --> load32 addr
	// or:
	// (insert_subvector (insert_subvector undef, (load32 addr), 0),
	// (load32 addr + 32), Elts/2)
	// --> load64 addr
	// or a 16-byte or 32-byte broadcast:
	// (insert_subvector (insert_subvector undef, (load16 addr), 0),
	// (load16 addr), Elts/2)
	// --> X86SubVBroadcast(load16 addr)
	// or:
	// (insert_subvector (insert_subvector undef, (load32 addr), 0),
	// (load32 addr), Elts/2)
	// --> X86SubVBroadcast(load32 addr)
	if ((IdxVal == OpVT.getVectorNumElements() / 2) &&
	Vec.getOpcode() == ISD::INSERT_SUBVECTOR &&
	OpVT.getSizeInBits() == SubVecVT.getSizeInBits() * 2) {
	auto *Idx2 = dyn_cast<ConstantSDNode>(Vec.getOperand(2));
	if (Idx2 && Idx2->getZExtValue() == 0) {
	SDValue SubVec2 = Vec.getOperand(1);
	// If needed, look through bitcasts to get to the load.
	if (auto *FirstLd = dyn_cast<LoadSDNode>(peekThroughBitcasts(SubVec2))) {
	bool Fast;
	unsigned Alignment = FirstLd->getAlignment();
	unsigned AS = FirstLd->getAddressSpace();
	const X86TargetLowering *TLI = Subtarget.getTargetLowering();
	if (TLI->allowsMemoryAccess(*DAG.getContext(), DAG.getDataLayout(),
	OpVT, AS, Alignment, &Fast) && Fast) {
	SDValue Ops[] = {SubVec2, SubVec};
	if (SDValue Ld = EltsFromConsecutiveLoads(OpVT, Ops, dl, DAG,
	Subtarget, false))
	return Ld;
	}
	}
	// If lower/upper loads are the same and the only users of the load, then
	// lower to a VBROADCASTF128/VBROADCASTI128/etc.
	if (auto *Ld = dyn_cast<LoadSDNode>(peekThroughOneUseBitcasts(SubVec2))) {
	if (SubVec2 == SubVec && ISD::isNormalLoad(Ld) &&
	SDNode::areOnlyUsersOf({N, Vec.getNode()}, SubVec2.getNode())) {
	return DAG.getNode(X86ISD::SUBV_BROADCAST, dl, OpVT, SubVec);
	}
	}
	// If this is subv_broadcast insert into both halves, use a larger
	// subv_broadcast.
	if (SubVec.getOpcode() == X86ISD::SUBV_BROADCAST && SubVec == SubVec2) {
	return DAG.getNode(X86ISD::SUBV_BROADCAST, dl, OpVT,
	SubVec.getOperand(0));
	}
	}
	}

	return SDValue();
	}


	SDValue X86TargetLowering::PerformDAGCombine(SDNode *N,
	DAGCombinerInfo &DCI) const {
	SelectionDAG &DAG = DCI.DAG;
	switch (N->getOpcode()) {
	default: break;
	case ISD::EXTRACT_VECTOR_ELT:
	return combineExtractVectorElt(N, DAG, DCI, Subtarget);
	case X86ISD::PEXTRW:
	case X86ISD::PEXTRB:
	return combineExtractVectorElt_SSE(N, DAG, DCI, Subtarget);
	case ISD::INSERT_SUBVECTOR:
	return combineInsertSubvector(N, DAG, DCI, Subtarget);
	case ISD::VSELECT:
	case ISD::SELECT:
	case X86ISD::SHRUNKBLEND: return combineSelect(N, DAG, DCI, Subtarget);
	case ISD::BITCAST: return combineBitcast(N, DAG, DCI, Subtarget);
	case X86ISD::CMOV: return combineCMov(N, DAG, DCI, Subtarget);
	case ISD::ADD: return combineAdd(N, DAG, Subtarget);
	case ISD::SUB: return combineSub(N, DAG, Subtarget);
	case X86ISD::ADD: return combineX86ADD(N, DAG, DCI);
	case X86ISD::ADC: return combineADC(N, DAG, DCI);
	case ISD::MUL: return combineMul(N, DAG, DCI, Subtarget);
	case ISD::SHL:
	case ISD::SRA:
	case ISD::SRL: return combineShift(N, DAG, DCI, Subtarget);
	case ISD::AND: return combineAnd(N, DAG, DCI, Subtarget);
	case ISD::OR: return combineOr(N, DAG, DCI, Subtarget);
	case ISD::XOR: return combineXor(N, DAG, DCI, Subtarget);
	case ISD::LOAD: return combineLoad(N, DAG, DCI, Subtarget);
	case ISD::MLOAD: return combineMaskedLoad(N, DAG, DCI, Subtarget);
	case ISD::STORE: return combineStore(N, DAG, Subtarget);
	case ISD::MSTORE: return combineMaskedStore(N, DAG, Subtarget);
	case ISD::SINT_TO_FP: return combineSIntToFP(N, DAG, Subtarget);
	case ISD::UINT_TO_FP: return combineUIntToFP(N, DAG, Subtarget);
	case ISD::FADD:
	case ISD::FSUB: return combineFaddFsub(N, DAG, Subtarget);
	case ISD::FNEG: return combineFneg(N, DAG, Subtarget);
	case ISD::TRUNCATE: return combineTruncate(N, DAG, Subtarget);
	case X86ISD::ANDNP: return combineAndnp(N, DAG, DCI, Subtarget);
	case X86ISD::FAND: return combineFAnd(N, DAG, Subtarget);
	case X86ISD::FANDN: return combineFAndn(N, DAG, Subtarget);
	case X86ISD::FXOR:
	case X86ISD::FOR: return combineFOr(N, DAG, Subtarget);
	case X86ISD::FMIN:
	case X86ISD::FMAX: return combineFMinFMax(N, DAG);
	case ISD::FMINNUM:
	case ISD::FMAXNUM: return combineFMinNumFMaxNum(N, DAG, Subtarget);
	case X86ISD::BT: return combineBT(N, DAG, DCI);
	case ISD::ANY_EXTEND:
	case ISD::ZERO_EXTEND: return combineZext(N, DAG, DCI, Subtarget);
	case ISD::SIGN_EXTEND: return combineSext(N, DAG, DCI, Subtarget);
	case ISD::SIGN_EXTEND_INREG: return combineSignExtendInReg(N, DAG, Subtarget);
	case ISD::SETCC: return combineSetCC(N, DAG, Subtarget);
	case X86ISD::SETCC: return combineX86SetCC(N, DAG, Subtarget);
	case X86ISD::BRCOND: return combineBrCond(N, DAG, Subtarget);
	case X86ISD::VSHLI:
	case X86ISD::VSRAI:
	case X86ISD::VSRLI:
	return combineVectorShiftImm(N, DAG, DCI, Subtarget);
	case ISD::SIGN_EXTEND_VECTOR_INREG:
	case ISD::ZERO_EXTEND_VECTOR_INREG:
	case X86ISD::VSEXT:
	case X86ISD::VZEXT: return combineVSZext(N, DAG, DCI, Subtarget);
	case X86ISD::PINSRB:
	case X86ISD::PINSRW: return combineVectorInsert(N, DAG, DCI, Subtarget);
	case X86ISD::SHUFP: // Handle all target specific shuffles
	case X86ISD::INSERTPS:
	case X86ISD::EXTRQI:
	case X86ISD::INSERTQI:
	case X86ISD::PALIGNR:
	case X86ISD::VSHLDQ:
	case X86ISD::VSRLDQ:
	case X86ISD::BLENDI:
	case X86ISD::UNPCKH:
	case X86ISD::UNPCKL:
	case X86ISD::MOVHLPS:
	case X86ISD::MOVLHPS:
	case X86ISD::PSHUFB:
	case X86ISD::PSHUFD:
	case X86ISD::PSHUFHW:
	case X86ISD::PSHUFLW:
	case X86ISD::MOVSHDUP:
	case X86ISD::MOVSLDUP:
	case X86ISD::MOVDDUP:
	case X86ISD::MOVSS:
	case X86ISD::MOVSD:
	case X86ISD::VPPERM:
	case X86ISD::VPERMI:
	case X86ISD::VPERMV:
	case X86ISD::VPERMV3:
	case X86ISD::VPERMIV3:
	case X86ISD::VPERMIL2:
	case X86ISD::VPERMILPI:
	case X86ISD::VPERMILPV:
	case X86ISD::VPERM2X128:
	case X86ISD::VZEXT_MOVL:
	case ISD::VECTOR_SHUFFLE: return combineShuffle(N, DAG, DCI,Subtarget);
	case X86ISD::FMADD:
	case X86ISD::FMADD_RND:
	case X86ISD::FMADDS1_RND:
	case X86ISD::FMADDS3_RND:
	case ISD::FMA: return combineFMA(N, DAG, Subtarget);
	case ISD::MGATHER:
	case ISD::MSCATTER: return combineGatherScatter(N, DAG);
	case X86ISD::LSUB: return combineLockSub(N, DAG, Subtarget);
	case X86ISD::TESTM: return combineTestM(N, DAG);
	case X86ISD::PCMPEQ:
	case X86ISD::PCMPGT: return combineVectorCompare(N, DAG, Subtarget);
	}

	return SDValue();
	}

	/// Return true if the target has native support for the specified value type
	/// and it is 'desirable' to use the type for the given node type. e.g. On x86
	/// i16 is legal, but undesirable since i16 instruction encodings are longer and
	/// some i16 instructions are slow.
	bool X86TargetLowering::isTypeDesirableForOp(unsigned Opc, EVT VT) const {
	if (!isTypeLegal(VT))
	return false;
	if (VT != MVT::i16)
	return true;

	switch (Opc) {
	default:
	return true;
	case ISD::LOAD:
	case ISD::SIGN_EXTEND:
	case ISD::ZERO_EXTEND:
	case ISD::ANY_EXTEND:
	case ISD::SHL:
	case ISD::SRL:
	case ISD::SUB:
	case ISD::ADD:
	case ISD::MUL:
	case ISD::AND:
	case ISD::OR:
	case ISD::XOR:
	return false;
	}
	}

	/// This function checks if any of the users of EFLAGS copies the EFLAGS. We
	/// know that the code that lowers COPY of EFLAGS has to use the stack, and if
	/// we don't adjust the stack we clobber the first frame index.
	/// See X86InstrInfo::copyPhysReg.
	static bool hasCopyImplyingStackAdjustment(const MachineFunction &MF) {
	const MachineRegisterInfo &MRI = MF.getRegInfo();
	return any_of(MRI.reg_instructions(X86::EFLAGS),
	[](const MachineInstr &RI) { return RI.isCopy(); });
	}

	void X86TargetLowering::finalizeLowering(MachineFunction &MF) const {
	if (hasCopyImplyingStackAdjustment(MF)) {
	MachineFrameInfo &MFI = MF.getFrameInfo();
	MFI.setHasCopyImplyingStackAdjustment(true);
	}

	TargetLoweringBase::finalizeLowering(MF);
	}

	/// This method query the target whether it is beneficial for dag combiner to
	/// promote the specified node. If true, it should return the desired promotion
	/// type by reference.
	bool X86TargetLowering::IsDesirableToPromoteOp(SDValue Op, EVT &PVT) const {
	EVT VT = Op.getValueType();
	if (VT != MVT::i16)
	return false;

	bool Promote = false;
	bool Commute = false;
	switch (Op.getOpcode()) {
	default: break;
	case ISD::SIGN_EXTEND:
	case ISD::ZERO_EXTEND:
	case ISD::ANY_EXTEND:
	Promote = true;
	break;
	case ISD::SHL:
	case ISD::SRL: {
	SDValue N0 = Op.getOperand(0);
	// Look out for (store (shl (load), x)).
	if (MayFoldLoad(N0) && MayFoldIntoStore(Op))
	return false;
	Promote = true;
	break;
	}
	case ISD::ADD:
	case ISD::MUL:
	case ISD::AND:
	case ISD::OR:
	case ISD::XOR:
	Commute = true;
	LLVM_FALLTHROUGH;
	case ISD::SUB: {
	SDValue N0 = Op.getOperand(0);
	SDValue N1 = Op.getOperand(1);
	if (!Commute && MayFoldLoad(N1))
	return false;
	// Avoid disabling potential load folding opportunities.
	if (MayFoldLoad(N0) && (!isa<ConstantSDNode>(N1) \|\| MayFoldIntoStore(Op)))
	return false;
	if (MayFoldLoad(N1) && (!isa<ConstantSDNode>(N0) \|\| MayFoldIntoStore(Op)))
	return false;
	Promote = true;
	}
	}

	PVT = MVT::i32;
	return Promote;
	}

	//===----------------------------------------------------------------------===//
	// X86 Inline Assembly Support
	//===----------------------------------------------------------------------===//

	// Helper to match a string separated by whitespace.
	static bool matchAsm(StringRef S, ArrayRef<const char *> Pieces) {
	S = S.substr(S.find_first_not_of(" \t")); // Skip leading whitespace.

	for (StringRef Piece : Pieces) {
	if (!S.startswith(Piece)) // Check if the piece matches.
	return false;

	S = S.substr(Piece.size());
	StringRef::size_type Pos = S.find_first_not_of(" \t");
	if (Pos == 0) // We matched a prefix.
	return false;

	S = S.substr(Pos);
	}

	return S.empty();
	}

	static bool clobbersFlagRegisters(const SmallVector<StringRef, 4> &AsmPieces) {

	if (AsmPieces.size() == 3 \|\| AsmPieces.size() == 4) {
	if (std::count(AsmPieces.begin(), AsmPieces.end(), "~{cc}") &&
	std::count(AsmPieces.begin(), AsmPieces.end(), "~{flags}") &&
	std::count(AsmPieces.begin(), AsmPieces.end(), "~{fpsr}")) {

	if (AsmPieces.size() == 3)
	return true;
	else if (std::count(AsmPieces.begin(), AsmPieces.end(), "~{dirflag}"))
	return true;
	}
	}
	return false;
	}

	bool X86TargetLowering::ExpandInlineAsm(CallInst *CI) const {
	InlineAsm *IA = cast<InlineAsm>(CI->getCalledValue());

	const std::string &AsmStr = IA->getAsmString();

	IntegerType *Ty = dyn_cast<IntegerType>(CI->getType());
	if (!Ty \|\| Ty->getBitWidth() % 16 != 0)
	return false;

	// TODO: should remove alternatives from the asmstring: "foo {a\|b}" -> "foo a"
	SmallVector<StringRef, 4> AsmPieces;
	SplitString(AsmStr, AsmPieces, ";\n");

	switch (AsmPieces.size()) {
	default: return false;
	case 1:
	// FIXME: this should verify that we are targeting a 486 or better. If not,
	// we will turn this bswap into something that will be lowered to logical
	// ops instead of emitting the bswap asm. For now, we don't support 486 or
	// lower so don't worry about this.
	// bswap $0
	if (matchAsm(AsmPieces[0], {"bswap", "$0"}) \|\|
	matchAsm(AsmPieces[0], {"bswapl", "$0"}) \|\|
	matchAsm(AsmPieces[0], {"bswapq", "$0"}) \|\|
	matchAsm(AsmPieces[0], {"bswap", "${0:q}"}) \|\|
	matchAsm(AsmPieces[0], {"bswapl", "${0:q}"}) \|\|
	matchAsm(AsmPieces[0], {"bswapq", "${0:q}"})) {
	// No need to check constraints, nothing other than the equivalent of
	// "=r,0" would be valid here.
	return IntrinsicLowering::LowerToByteSwap(CI);
	}

	// rorw $$8, ${0:w} --> llvm.bswap.i16
	if (CI->getType()->isIntegerTy(16) &&
	IA->getConstraintString().compare(0, 5, "=r,0,") == 0 &&
	(matchAsm(AsmPieces[0], {"rorw", "$$8,", "${0:w}"}) \|\|
	matchAsm(AsmPieces[0], {"rolw", "$$8,", "${0:w}"}))) {
	AsmPieces.clear();
	StringRef ConstraintsStr = IA->getConstraintString();
	SplitString(StringRef(ConstraintsStr).substr(5), AsmPieces, ",");
	array_pod_sort(AsmPieces.begin(), AsmPieces.end());
	if (clobbersFlagRegisters(AsmPieces))
	return IntrinsicLowering::LowerToByteSwap(CI);
	}
	break;
	case 3:
	if (CI->getType()->isIntegerTy(32) &&
	IA->getConstraintString().compare(0, 5, "=r,0,") == 0 &&
	matchAsm(AsmPieces[0], {"rorw", "$$8,", "${0:w}"}) &&
	matchAsm(AsmPieces[1], {"rorl", "$$16,", "$0"}) &&
	matchAsm(AsmPieces[2], {"rorw", "$$8,", "${0:w}"})) {
	AsmPieces.clear();
	StringRef ConstraintsStr = IA->getConstraintString();
	SplitString(StringRef(ConstraintsStr).substr(5), AsmPieces, ",");
	array_pod_sort(AsmPieces.begin(), AsmPieces.end());
	if (clobbersFlagRegisters(AsmPieces))
	return IntrinsicLowering::LowerToByteSwap(CI);
	}

	if (CI->getType()->isIntegerTy(64)) {
	InlineAsm::ConstraintInfoVector Constraints = IA->ParseConstraints();
	if (Constraints.size() >= 2 &&
	Constraints[0].Codes.size() == 1 && Constraints[0].Codes[0] == "A" &&
	Constraints[1].Codes.size() == 1 && Constraints[1].Codes[0] == "0") {
	// bswap %eax / bswap %edx / xchgl %eax, %edx -> llvm.bswap.i64
	if (matchAsm(AsmPieces[0], {"bswap", "%eax"}) &&
	matchAsm(AsmPieces[1], {"bswap", "%edx"}) &&
	matchAsm(AsmPieces[2], {"xchgl", "%eax,", "%edx"}))
	return IntrinsicLowering::LowerToByteSwap(CI);
	}
	}
	break;
	}
	return false;
	}

	/// Given a constraint letter, return the type of constraint for this target.
	X86TargetLowering::ConstraintType
	X86TargetLowering::getConstraintType(StringRef Constraint) const {
	if (Constraint.size() == 1) {
	switch (Constraint[0]) {
	case 'R':
	case 'q':
	case 'Q':
	case 'f':
	case 't':
	case 'u':
	case 'y':
	case 'x':
	case 'v':
	case 'Y':
	case 'l':
	return C_RegisterClass;
	case 'k': // AVX512 masking registers.
	case 'a':
	case 'b':
	case 'c':
	case 'd':
	case 'S':
	case 'D':
	case 'A':
	return C_Register;
	case 'I':
	case 'J':
	case 'K':
	case 'L':
	case 'M':
	case 'N':
	case 'G':
	case 'C':
	case 'e':
	case 'Z':
	return C_Other;
	default:
	break;
	}
	}
	else if (Constraint.size() == 2) {
	switch (Constraint[0]) {
	default:
	break;
	case 'Y':
	switch (Constraint[1]) {
	default:
	break;
	case 'k':
	return C_Register;
	}
	}
	}
	return TargetLowering::getConstraintType(Constraint);
	}

	/// Examine constraint type and operand type and determine a weight value.
	/// This object must already have been set up with the operand type
	/// and the current alternative constraint selected.
	TargetLowering::ConstraintWeight
	X86TargetLowering::getSingleConstraintMatchWeight(
	AsmOperandInfo &info, const char *constraint) const {
	ConstraintWeight weight = CW_Invalid;
	Value *CallOperandVal = info.CallOperandVal;
	// If we don't have a value, we can't do a match,
	// but allow it at the lowest weight.
	if (!CallOperandVal)
	return CW_Default;
	Type *type = CallOperandVal->getType();
	// Look at the constraint type.
	switch (*constraint) {
	default:
	weight = TargetLowering::getSingleConstraintMatchWeight(info, constraint);
	LLVM_FALLTHROUGH;
	case 'R':
	case 'q':
	case 'Q':
	case 'a':
	case 'b':
	case 'c':
	case 'd':
	case 'S':
	case 'D':
	case 'A':
	if (CallOperandVal->getType()->isIntegerTy())
	weight = CW_SpecificReg;
	break;
	case 'f':
	case 't':
	case 'u':
	if (type->isFloatingPointTy())
	weight = CW_SpecificReg;
	break;
	case 'y':
	if (type->isX86_MMXTy() && Subtarget.hasMMX())
	weight = CW_SpecificReg;
	break;
	case 'Y':
	// Other "Y<x>" (e.g. "Yk") constraints should be implemented below.
	if (constraint[1] == 'k') {
	// Support for 'Yk' (similarly to the 'k' variant below).
	weight = CW_SpecificReg;
	break;
	}
	// Else fall through (handle "Y" constraint).
	LLVM_FALLTHROUGH;
	case 'v':
	if ((type->getPrimitiveSizeInBits() == 512) && Subtarget.hasAVX512())
	weight = CW_Register;
	LLVM_FALLTHROUGH;
	case 'x':
	if (((type->getPrimitiveSizeInBits() == 128) && Subtarget.hasSSE1()) \|\|
	((type->getPrimitiveSizeInBits() == 256) && Subtarget.hasFp256()))
	weight = CW_Register;
	break;
	case 'k':
	// Enable conditional vector operations using %k<#> registers.
	weight = CW_SpecificReg;
	break;
	case 'I':
	if (ConstantInt *C = dyn_cast<ConstantInt>(info.CallOperandVal)) {
	if (C->getZExtValue() <= 31)
	weight = CW_Constant;
	}
	break;
	case 'J':
	if (ConstantInt *C = dyn_cast<ConstantInt>(CallOperandVal)) {
	if (C->getZExtValue() <= 63)
	weight = CW_Constant;
	}
	break;
	case 'K':
	if (ConstantInt *C = dyn_cast<ConstantInt>(CallOperandVal)) {
	if ((C->getSExtValue() >= -0x80) && (C->getSExtValue() <= 0x7f))
	weight = CW_Constant;
	}
	break;
	case 'L':
	if (ConstantInt *C = dyn_cast<ConstantInt>(CallOperandVal)) {
	if ((C->getZExtValue() == 0xff) \|\| (C->getZExtValue() == 0xffff))
	weight = CW_Constant;
	}
	break;
	case 'M':
	if (ConstantInt *C = dyn_cast<ConstantInt>(CallOperandVal)) {
	if (C->getZExtValue() <= 3)
	weight = CW_Constant;
	}
	break;
	case 'N':
	if (ConstantInt *C = dyn_cast<ConstantInt>(CallOperandVal)) {
	if (C->getZExtValue() <= 0xff)
	weight = CW_Constant;
	}
	break;
	case 'G':
	case 'C':
	if (isa<ConstantFP>(CallOperandVal)) {
	weight = CW_Constant;
	}
	break;
	case 'e':
	if (ConstantInt *C = dyn_cast<ConstantInt>(CallOperandVal)) {
	if ((C->getSExtValue() >= -0x80000000LL) &&
	(C->getSExtValue() <= 0x7fffffffLL))
	weight = CW_Constant;
	}
	break;
	case 'Z':
	if (ConstantInt *C = dyn_cast<ConstantInt>(CallOperandVal)) {
	if (C->getZExtValue() <= 0xffffffff)
	weight = CW_Constant;
	}
	break;
	}
	return weight;
	}

	/// Try to replace an X constraint, which matches anything, with another that
	/// has more specific requirements based on the type of the corresponding
	/// operand.
	const char *X86TargetLowering::
	LowerXConstraint(EVT ConstraintVT) const {
	// FP X constraints get lowered to SSE1/2 registers if available, otherwise
	// 'f' like normal targets.
	if (ConstraintVT.isFloatingPoint()) {
	if (Subtarget.hasSSE2())
	return "Y";
	if (Subtarget.hasSSE1())
	return "x";
	}

	return TargetLowering::LowerXConstraint(ConstraintVT);
	}

	/// Lower the specified operand into the Ops vector.
	/// If it is invalid, don't add anything to Ops.
	void X86TargetLowering::LowerAsmOperandForConstraint(SDValue Op,
	std::string &Constraint,
	std::vector<SDValue>&Ops,
	SelectionDAG &DAG) const {
	SDValue Result;

	// Only support length 1 constraints for now.
	if (Constraint.length() > 1) return;

	char ConstraintLetter = Constraint[0];
	switch (ConstraintLetter) {
	default: break;
	case 'I':
	if (ConstantSDNode *C = dyn_cast<ConstantSDNode>(Op)) {
	if (C->getZExtValue() <= 31) {
	Result = DAG.getTargetConstant(C->getZExtValue(), SDLoc(Op),
	Op.getValueType());
	break;
	}
	}
	return;
	case 'J':
	if (ConstantSDNode *C = dyn_cast<ConstantSDNode>(Op)) {
	if (C->getZExtValue() <= 63) {
	Result = DAG.getTargetConstant(C->getZExtValue(), SDLoc(Op),
	Op.getValueType());
	break;
	}
	}
	return;
	case 'K':
	if (ConstantSDNode *C = dyn_cast<ConstantSDNode>(Op)) {
	if (isInt<8>(C->getSExtValue())) {
	Result = DAG.getTargetConstant(C->getZExtValue(), SDLoc(Op),
	Op.getValueType());
	break;
	}
	}
	return;
	case 'L':
	if (ConstantSDNode *C = dyn_cast<ConstantSDNode>(Op)) {
	if (C->getZExtValue() == 0xff \|\| C->getZExtValue() == 0xffff \|\|
	(Subtarget.is64Bit() && C->getZExtValue() == 0xffffffff)) {
	Result = DAG.getTargetConstant(C->getSExtValue(), SDLoc(Op),
	Op.getValueType());
	break;
	}
	}
	return;
	case 'M':
	if (ConstantSDNode *C = dyn_cast<ConstantSDNode>(Op)) {
	if (C->getZExtValue() <= 3) {
	Result = DAG.getTargetConstant(C->getZExtValue(), SDLoc(Op),
	Op.getValueType());
	break;
	}
	}
	return;
	case 'N':
	if (ConstantSDNode *C = dyn_cast<ConstantSDNode>(Op)) {
	if (C->getZExtValue() <= 255) {
	Result = DAG.getTargetConstant(C->getZExtValue(), SDLoc(Op),
	Op.getValueType());
	break;
	}
	}
	return;
	case 'O':
	if (ConstantSDNode *C = dyn_cast<ConstantSDNode>(Op)) {
	if (C->getZExtValue() <= 127) {
	Result = DAG.getTargetConstant(C->getZExtValue(), SDLoc(Op),
	Op.getValueType());
	break;
	}
	}
	return;
	case 'e': {
	// 32-bit signed value
	if (ConstantSDNode *C = dyn_cast<ConstantSDNode>(Op)) {
	if (ConstantInt::isValueValidForType(Type::getInt32Ty(*DAG.getContext()),
	C->getSExtValue())) {
	// Widen to 64 bits here to get it sign extended.
	Result = DAG.getTargetConstant(C->getSExtValue(), SDLoc(Op), MVT::i64);
	break;
	}
	// FIXME gcc accepts some relocatable values here too, but only in certain
	// memory models; it's complicated.
	}
	return;
	}
	case 'Z': {
	// 32-bit unsigned value
	if (ConstantSDNode *C = dyn_cast<ConstantSDNode>(Op)) {
	if (ConstantInt::isValueValidForType(Type::getInt32Ty(*DAG.getContext()),
	C->getZExtValue())) {
	Result = DAG.getTargetConstant(C->getZExtValue(), SDLoc(Op),
	Op.getValueType());
	break;
	}
	}
	// FIXME gcc accepts some relocatable values here too, but only in certain
	// memory models; it's complicated.
	return;
	}
	case 'i': {
	// Literal immediates are always ok.
	if (ConstantSDNode *CST = dyn_cast<ConstantSDNode>(Op)) {
	// Widen to 64 bits here to get it sign extended.
	Result = DAG.getTargetConstant(CST->getSExtValue(), SDLoc(Op), MVT::i64);
	break;
	}

	// In any sort of PIC mode addresses need to be computed at runtime by
	// adding in a register or some sort of table lookup. These can't
	// be used as immediates.
	if (Subtarget.isPICStyleGOT() \|\| Subtarget.isPICStyleStubPIC())
	return;

	// If we are in non-pic codegen mode, we allow the address of a global (with
	// an optional displacement) to be used with 'i'.
	GlobalAddressSDNode *GA = nullptr;
	int64_t Offset = 0;

	// Match either (GA), (GA+C), (GA+C1+C2), etc.
	while (1) {
	if ((GA = dyn_cast<GlobalAddressSDNode>(Op))) {
	Offset += GA->getOffset();
	break;
	} else if (Op.getOpcode() == ISD::ADD) {
	if (ConstantSDNode *C = dyn_cast<ConstantSDNode>(Op.getOperand(1))) {
	Offset += C->getZExtValue();
	Op = Op.getOperand(0);
	continue;
	}
	} else if (Op.getOpcode() == ISD::SUB) {
	if (ConstantSDNode *C = dyn_cast<ConstantSDNode>(Op.getOperand(1))) {
	Offset += -C->getZExtValue();
	Op = Op.getOperand(0);
	continue;
	}
	}

	// Otherwise, this isn't something we can handle, reject it.
	return;
	}

	const GlobalValue *GV = GA->getGlobal();
	// If we require an extra load to get this address, as in PIC mode, we
	// can't accept it.
	if (isGlobalStubReference(Subtarget.classifyGlobalReference(GV)))
	return;

	Result = DAG.getTargetGlobalAddress(GV, SDLoc(Op),
	GA->getValueType(0), Offset);
	break;
	}
	}

	if (Result.getNode()) {
	Ops.push_back(Result);
	return;
	}
	return TargetLowering::LowerAsmOperandForConstraint(Op, Constraint, Ops, DAG);
	}

	/// Check if \p RC is a general purpose register class.
	/// I.e., GR* or one of their variant.
	static bool isGRClass(const TargetRegisterClass &RC) {
	return RC.hasSuperClassEq(&X86::GR8RegClass) \|\|
	RC.hasSuperClassEq(&X86::GR16RegClass) \|\|
	RC.hasSuperClassEq(&X86::GR32RegClass) \|\|
	RC.hasSuperClassEq(&X86::GR64RegClass) \|\|
	RC.hasSuperClassEq(&X86::LOW32_ADDR_ACCESS_RBPRegClass);
	}

	/// Check if \p RC is a vector register class.
	/// I.e., FR* / VR* or one of their variant.
	static bool isFRClass(const TargetRegisterClass &RC) {
	return RC.hasSuperClassEq(&X86::FR32XRegClass) \|\|
	RC.hasSuperClassEq(&X86::FR64XRegClass) \|\|
	RC.hasSuperClassEq(&X86::VR128XRegClass) \|\|
	RC.hasSuperClassEq(&X86::VR256XRegClass) \|\|
	RC.hasSuperClassEq(&X86::VR512RegClass);
	}

	std::pair<unsigned, const TargetRegisterClass *>
	X86TargetLowering::getRegForInlineAsmConstraint(const TargetRegisterInfo *TRI,
	StringRef Constraint,
	MVT VT) const {
	// First, see if this is a constraint that directly corresponds to an LLVM
	// register class.
	if (Constraint.size() == 1) {
	// GCC Constraint Letters
	switch (Constraint[0]) {
	default: break;
	// TODO: Slight differences here in allocation order and leaving
	// RIP in the class. Do they matter any more here than they do
	// in the normal allocation?
	case 'k':
	if (Subtarget.hasAVX512()) {
	// Only supported in AVX512 or later.
	switch (VT.SimpleTy) {
	default: break;
	case MVT::i32:
	return std::make_pair(0U, &X86::VK32RegClass);
	case MVT::i16:
	return std::make_pair(0U, &X86::VK16RegClass);
	case MVT::i8:
	return std::make_pair(0U, &X86::VK8RegClass);
	case MVT::i1:
	return std::make_pair(0U, &X86::VK1RegClass);
	case MVT::i64:
	return std::make_pair(0U, &X86::VK64RegClass);
	}
	}
	break;
	case 'q': // GENERAL_REGS in 64-bit mode, Q_REGS in 32-bit mode.
	if (Subtarget.is64Bit()) {
	if (VT == MVT::i32 \|\| VT == MVT::f32)
	return std::make_pair(0U, &X86::GR32RegClass);
	if (VT == MVT::i16)
	return std::make_pair(0U, &X86::GR16RegClass);
	if (VT == MVT::i8 \|\| VT == MVT::i1)
	return std::make_pair(0U, &X86::GR8RegClass);
	if (VT == MVT::i64 \|\| VT == MVT::f64)
	return std::make_pair(0U, &X86::GR64RegClass);
	break;
	}
	LLVM_FALLTHROUGH;
	// 32-bit fallthrough
	case 'Q': // Q_REGS
	if (VT == MVT::i32 \|\| VT == MVT::f32)
	return std::make_pair(0U, &X86::GR32_ABCDRegClass);
	if (VT == MVT::i16)
	return std::make_pair(0U, &X86::GR16_ABCDRegClass);
	if (VT == MVT::i8 \|\| VT == MVT::i1)
	return std::make_pair(0U, &X86::GR8_ABCD_LRegClass);
	if (VT == MVT::i64)
	return std::make_pair(0U, &X86::GR64_ABCDRegClass);
	break;
	case 'r': // GENERAL_REGS
	case 'l': // INDEX_REGS
	if (VT == MVT::i8 \|\| VT == MVT::i1)
	return std::make_pair(0U, &X86::GR8RegClass);
	if (VT == MVT::i16)
	return std::make_pair(0U, &X86::GR16RegClass);
	if (VT == MVT::i32 \|\| VT == MVT::f32 \|\| !Subtarget.is64Bit())
	return std::make_pair(0U, &X86::GR32RegClass);
	return std::make_pair(0U, &X86::GR64RegClass);
	case 'R': // LEGACY_REGS
	if (VT == MVT::i8 \|\| VT == MVT::i1)
	return std::make_pair(0U, &X86::GR8_NOREXRegClass);
	if (VT == MVT::i16)
	return std::make_pair(0U, &X86::GR16_NOREXRegClass);
	if (VT == MVT::i32 \|\| !Subtarget.is64Bit())
	return std::make_pair(0U, &X86::GR32_NOREXRegClass);
	return std::make_pair(0U, &X86::GR64_NOREXRegClass);
	case 'f': // FP Stack registers.
	// If SSE is enabled for this VT, use f80 to ensure the isel moves the
	// value to the correct fpstack register class.
	if (VT == MVT::f32 && !isScalarFPTypeInSSEReg(VT))
	return std::make_pair(0U, &X86::RFP32RegClass);
	if (VT == MVT::f64 && !isScalarFPTypeInSSEReg(VT))
	return std::make_pair(0U, &X86::RFP64RegClass);
	return std::make_pair(0U, &X86::RFP80RegClass);
	case 'y': // MMX_REGS if MMX allowed.
	if (!Subtarget.hasMMX()) break;
	return std::make_pair(0U, &X86::VR64RegClass);
	case 'Y': // SSE_REGS if SSE2 allowed
	if (!Subtarget.hasSSE2()) break;
	LLVM_FALLTHROUGH;
	case 'v':
	case 'x': // SSE_REGS if SSE1 allowed or AVX_REGS if AVX allowed
	if (!Subtarget.hasSSE1()) break;
	bool VConstraint = (Constraint[0] == 'v');

	switch (VT.SimpleTy) {
	default: break;
	// Scalar SSE types.
	case MVT::f32:
	case MVT::i32:
	if (VConstraint && Subtarget.hasAVX512() && Subtarget.hasVLX())
	return std::make_pair(0U, &X86::FR32XRegClass);
	return std::make_pair(0U, &X86::FR32RegClass);
	case MVT::f64:
	case MVT::i64:
	if (VConstraint && Subtarget.hasVLX())
	return std::make_pair(0U, &X86::FR64XRegClass);
	return std::make_pair(0U, &X86::FR64RegClass);
	// TODO: Handle f128 and i128 in FR128RegClass after it is tested well.
	// Vector types.
	case MVT::v16i8:
	case MVT::v8i16:
	case MVT::v4i32:
	case MVT::v2i64:
	case MVT::v4f32:
	case MVT::v2f64:
	if (VConstraint && Subtarget.hasVLX())
	return std::make_pair(0U, &X86::VR128XRegClass);
	return std::make_pair(0U, &X86::VR128RegClass);
	// AVX types.
	case MVT::v32i8:
	case MVT::v16i16:
	case MVT::v8i32:
	case MVT::v4i64:
	case MVT::v8f32:
	case MVT::v4f64:
	if (VConstraint && Subtarget.hasVLX())
	return std::make_pair(0U, &X86::VR256XRegClass);
	return std::make_pair(0U, &X86::VR256RegClass);
	case MVT::v8f64:
	case MVT::v16f32:
	case MVT::v16i32:
	case MVT::v8i64:
	return std::make_pair(0U, &X86::VR512RegClass);
	}
	break;
	}
	} else if (Constraint.size() == 2 && Constraint[0] == 'Y') {
	switch (Constraint[1]) {
	default:
	break;
	case 'k':
	// This register class doesn't allocate k0 for masked vector operation.
	if (Subtarget.hasAVX512()) { // Only supported in AVX512.
	switch (VT.SimpleTy) {
	default: break;
	case MVT::i32:
	return std::make_pair(0U, &X86::VK32WMRegClass);
	case MVT::i16:
	return std::make_pair(0U, &X86::VK16WMRegClass);
	case MVT::i8:
	return std::make_pair(0U, &X86::VK8WMRegClass);
	case MVT::i1:
	return std::make_pair(0U, &X86::VK1WMRegClass);
	case MVT::i64:
	return std::make_pair(0U, &X86::VK64WMRegClass);
	}
	}
	break;
	}
	}

	// Use the default implementation in TargetLowering to convert the register
	// constraint into a member of a register class.
	std::pair<unsigned, const TargetRegisterClass*> Res;
	Res = TargetLowering::getRegForInlineAsmConstraint(TRI, Constraint, VT);

	// Not found as a standard register?
	if (!Res.second) {
	// Map st(0) -> st(7) -> ST0
	if (Constraint.size() == 7 && Constraint[0] == '{' &&
	tolower(Constraint[1]) == 's' &&
	tolower(Constraint[2]) == 't' &&
	Constraint[3] == '(' &&
	(Constraint[4] >= '0' && Constraint[4] <= '7') &&
	Constraint[5] == ')' &&
	Constraint[6] == '}') {

	Res.first = X86::FP0+Constraint[4]-'0';
	Res.second = &X86::RFP80RegClass;
	return Res;
	}

	// GCC allows "st(0)" to be called just plain "st".
	if (StringRef("{st}").equals_lower(Constraint)) {
	Res.first = X86::FP0;
	Res.second = &X86::RFP80RegClass;
	return Res;
	}

	// flags -> EFLAGS
	if (StringRef("{flags}").equals_lower(Constraint)) {
	Res.first = X86::EFLAGS;
	Res.second = &X86::CCRRegClass;
	return Res;
	}

	// 'A' means [ER]AX + [ER]DX.
	if (Constraint == "A") {
	if (Subtarget.is64Bit()) {
	Res.first = X86::RAX;
	Res.second = &X86::GR64_ADRegClass;
	} else {
	assert((Subtarget.is32Bit() \|\| Subtarget.is16Bit()) &&
	"Expecting 64, 32 or 16 bit subtarget");
	Res.first = X86::EAX;
	Res.second = &X86::GR32_ADRegClass;
	}
	return Res;
	}
	return Res;
	}

	// Otherwise, check to see if this is a register class of the wrong value
	// type. For example, we want to map "{ax},i32" -> {eax}, we don't want it to
	// turn into {ax},{dx}.
	// MVT::Other is used to specify clobber names.
	if (TRI->isTypeLegalForClass(*Res.second, VT) \|\| VT == MVT::Other)
	return Res; // Correct type already, nothing to do.

	// Get a matching integer of the correct size. i.e. "ax" with MVT::32 should
	// return "eax". This should even work for things like getting 64bit integer
	// registers when given an f64 type.
	const TargetRegisterClass *Class = Res.second;
	// The generic code will match the first register class that contains the
	// given register. Thus, based on the ordering of the tablegened file,
	// the "plain" GR classes might not come first.
	// Therefore, use a helper method.
	if (isGRClass(*Class)) {
	unsigned Size = VT.getSizeInBits();
	if (Size == 1) Size = 8;
	unsigned DestReg = getX86SubSuperRegisterOrZero(Res.first, Size);
	if (DestReg > 0) {
	Res.first = DestReg;
	Res.second = Size == 8 ? &X86::GR8RegClass
	: Size == 16 ? &X86::GR16RegClass
	: Size == 32 ? &X86::GR32RegClass
	: &X86::GR64RegClass;
	assert(Res.second->contains(Res.first) && "Register in register class");
	} else {
	// No register found/type mismatch.
	Res.first = 0;
	Res.second = nullptr;
	}
	} else if (isFRClass(*Class)) {
	// Handle references to XMM physical registers that got mapped into the
	// wrong class. This can happen with constraints like {xmm0} where the
	// target independent register mapper will just pick the first match it can
	// find, ignoring the required type.

	// TODO: Handle f128 and i128 in FR128RegClass after it is tested well.
	if (VT == MVT::f32 \|\| VT == MVT::i32)
	Res.second = &X86::FR32RegClass;
	else if (VT == MVT::f64 \|\| VT == MVT::i64)
	Res.second = &X86::FR64RegClass;
	else if (TRI->isTypeLegalForClass(X86::VR128RegClass, VT))
	Res.second = &X86::VR128RegClass;
	else if (TRI->isTypeLegalForClass(X86::VR256RegClass, VT))
	Res.second = &X86::VR256RegClass;
	else if (TRI->isTypeLegalForClass(X86::VR512RegClass, VT))
	Res.second = &X86::VR512RegClass;
	else {
	// Type mismatch and not a clobber: Return an error;
	Res.first = 0;
	Res.second = nullptr;
	}
	}

	return Res;
	}

	int X86TargetLowering::getScalingFactorCost(const DataLayout &DL,
	const AddrMode &AM, Type *Ty,
	unsigned AS) const {
	// Scaling factors are not free at all.
	// An indexed folded instruction, i.e., inst (reg1, reg2, scale),
	// will take 2 allocations in the out of order engine instead of 1
	// for plain addressing mode, i.e. inst (reg1).
	// E.g.,
	// vaddps (%rsi,%drx), %ymm0, %ymm1
	// Requires two allocations (one for the load, one for the computation)
	// whereas:
	// vaddps (%rsi), %ymm0, %ymm1
	// Requires just 1 allocation, i.e., freeing allocations for other operations
	// and having less micro operations to execute.
	//
	// For some X86 architectures, this is even worse because for instance for
	// stores, the complex addressing mode forces the instruction to use the
	// "load" ports instead of the dedicated "store" port.
	// E.g., on Haswell:
	// vmovaps %ymm1, (%r8, %rdi) can use port 2 or 3.
	// vmovaps %ymm1, (%r8) can use port 2, 3, or 7.
	if (isLegalAddressingMode(DL, AM, Ty, AS))
	// Scale represents reg2 * scale, thus account for 1
	// as soon as we use a second register.
	return AM.Scale != 0;
	return -1;
	}

	bool X86TargetLowering::isIntDivCheap(EVT VT, AttributeList Attr) const {
	// Integer division on x86 is expensive. However, when aggressively optimizing
	// for code size, we prefer to use a div instruction, as it is usually smaller
	// than the alternative sequence.
	// The exception to this is vector division. Since x86 doesn't have vector
	// integer division, leaving the division as-is is a loss even in terms of
	// size, because it will have to be scalarized, while the alternative code
	// sequence can be performed in vector form.
	bool OptSize =
	Attr.hasAttribute(AttributeList::FunctionIndex, Attribute::MinSize);
	return OptSize && !VT.isVector();
	}

	void X86TargetLowering::initializeSplitCSR(MachineBasicBlock *Entry) const {
	if (!Subtarget.is64Bit())
	return;

	// Update IsSplitCSR in X86MachineFunctionInfo.
	X86MachineFunctionInfo *AFI =
	Entry->getParent()->getInfo<X86MachineFunctionInfo>();
	AFI->setIsSplitCSR(true);
	}

	void X86TargetLowering::insertCopiesSplitCSR(
	MachineBasicBlock *Entry,
	const SmallVectorImpl<MachineBasicBlock *> &Exits) const {
	const X86RegisterInfo *TRI = Subtarget.getRegisterInfo();
	const MCPhysReg *IStart = TRI->getCalleeSavedRegsViaCopy(Entry->getParent());
	if (!IStart)
	return;

	const TargetInstrInfo *TII = Subtarget.getInstrInfo();
	MachineRegisterInfo *MRI = &Entry->getParent()->getRegInfo();
	MachineBasicBlock::iterator MBBI = Entry->begin();
	for (const MCPhysReg I = IStart; I; ++I) {
	const TargetRegisterClass *RC = nullptr;
	if (X86::GR64RegClass.contains(*I))
	RC = &X86::GR64RegClass;
	else
	llvm_unreachable("Unexpected register class in CSRsViaCopy!");

	unsigned NewVR = MRI->createVirtualRegister(RC);
	// Create copy from CSR to a virtual register.
	// FIXME: this currently does not emit CFI pseudo-instructions, it works
	// fine for CXX_FAST_TLS since the C++-style TLS access functions should be
	// nounwind. If we want to generalize this later, we may need to emit
	// CFI pseudo-instructions.
	assert(Entry->getParent()->getFunction()->hasFnAttribute(
	Attribute::NoUnwind) &&
	"Function should be nounwind in insertCopiesSplitCSR!");
	Entry->addLiveIn(*I);
	BuildMI(*Entry, MBBI, DebugLoc(), TII->get(TargetOpcode::COPY), NewVR)
	.addReg(*I);

	// Insert the copy-back instructions right before the terminator.
	for (auto *Exit : Exits)
	BuildMI(*Exit, Exit->getFirstTerminator(), DebugLoc(),
	TII->get(TargetOpcode::COPY), *I)
	.addReg(NewVR);
	}
	}

	bool X86TargetLowering::supportSwiftError() const {
	return Subtarget.is64Bit();
	}

	/// Returns the name of the symbol used to emit stack probes or the empty
	/// string if not applicable.
	StringRef X86TargetLowering::getStackProbeSymbolName(MachineFunction &MF) const {
	// If the function specifically requests stack probes, emit them.
	if (MF.getFunction()->hasFnAttribute("probe-stack"))
	return MF.getFunction()->getFnAttribute("probe-stack").getValueAsString();

	// Generally, if we aren't on Windows, the platform ABI does not include
	// support for stack probes, so don't emit them.
	if (!Subtarget.isOSWindows() \|\| Subtarget.isTargetMachO())
	return "";

	// We need a stack probe to conform to the Windows ABI. Choose the right
	// symbol.
	if (Subtarget.is64Bit())
	return Subtarget.isTargetCygMing() ? "___chkstk_ms" : "__chkstk";
	return Subtarget.isTargetCygMing() ? "_alloca" : "_chkstk";
	}
	diff --git a/lib/Target/X86/X86InstrAVX512.td b/lib/Target/X86/X86InstrAVX512.td
	index 0e654a380e7c..0ae960e7d566 100644
	--- a/lib/Target/X86/X86InstrAVX512.td
	+++ b/lib/Target/X86/X86InstrAVX512.td
	@@ -1,10244 +1,10244 @@
	//===-- X86InstrAVX512.td - AVX512 Instruction Set ---------- tablegen --===//
	//
	// The LLVM Compiler Infrastructure
	//
	// This file is distributed under the University of Illinois Open Source
	// License. See LICENSE.TXT for details.
	//
	//===----------------------------------------------------------------------===//
	//
	// This file describes the X86 AVX512 instruction set, defining the
	// instructions, and properties of the instructions which are needed for code
	// generation, machine code emission, and analysis.
	//
	//===----------------------------------------------------------------------===//

	// Group template arguments that can be derived from the vector type (EltNum x
	// EltVT). These are things like the register class for the writemask, etc.
	// The idea is to pass one of these as the template argument rather than the
	// individual arguments.
	// The template is also used for scalar types, in this case numelts is 1.
	class X86VectorVTInfo<int numelts, ValueType eltvt, RegisterClass rc,
	string suffix = ""> {
	RegisterClass RC = rc;
	ValueType EltVT = eltvt;
	int NumElts = numelts;

	// Corresponding mask register class.
	RegisterClass KRC = !cast<RegisterClass>("VK" # NumElts);

	// Corresponding write-mask register class.
	RegisterClass KRCWM = !cast<RegisterClass>("VK" # NumElts # "WM");

	// The mask VT.
	ValueType KVT = !cast<ValueType>("v" # NumElts # "i1");

	// Suffix used in the instruction mnemonic.
	string Suffix = suffix;

	// VTName is a string name for vector VT. For vector types it will be
	// v # NumElts # EltVT, so for vector of 8 elements of i32 it will be v8i32
	// It is a little bit complex for scalar types, where NumElts = 1.
	// In this case we build v4f32 or v2f64
	string VTName = "v" # !if (!eq (NumElts, 1),
	!if (!eq (EltVT.Size, 32), 4,
	!if (!eq (EltVT.Size, 64), 2, NumElts)), NumElts) # EltVT;

	// The vector VT.
	ValueType VT = !cast<ValueType>(VTName);

	string EltTypeName = !cast<string>(EltVT);
	// Size of the element type in bits, e.g. 32 for v16i32.
	string EltSizeName = !subst("i", "", !subst("f", "", EltTypeName));
	int EltSize = EltVT.Size;

	// "i" for integer types and "f" for floating-point types
	string TypeVariantName = !subst(EltSizeName, "", EltTypeName);

	// Size of RC in bits, e.g. 512 for VR512.
	int Size = VT.Size;

	// The corresponding memory operand, e.g. i512mem for VR512.
	X86MemOperand MemOp = !cast<X86MemOperand>(TypeVariantName # Size # "mem");
	X86MemOperand ScalarMemOp = !cast<X86MemOperand>(EltVT # "mem");
	// FP scalar memory operand for intrinsics - ssmem/sdmem.
	Operand IntScalarMemOp = !if (!eq (EltTypeName, "f32"), !cast<Operand>("ssmem"),
	!if (!eq (EltTypeName, "f64"), !cast<Operand>("sdmem"), ?));

	// Load patterns
	// Note: For 128/256-bit integer VT we choose loadv2i64/loadv4i64
	// due to load promotion during legalization
	PatFrag LdFrag = !cast<PatFrag>("load" #
	!if (!eq (TypeVariantName, "i"),
	!if (!eq (Size, 128), "v2i64",
	!if (!eq (Size, 256), "v4i64",
	!if (!eq (Size, 512), "v8i64",
	VTName))), VTName));

	PatFrag AlignedLdFrag = !cast<PatFrag>("alignedload" #
	!if (!eq (TypeVariantName, "i"),
	!if (!eq (Size, 128), "v2i64",
	!if (!eq (Size, 256), "v4i64",
	!if (!eq (Size, 512), "v8i64",
	VTName))), VTName));

	PatFrag ScalarLdFrag = !cast<PatFrag>("load" # EltVT);

	ComplexPattern ScalarIntMemCPat = !if (!eq (EltTypeName, "f32"),
	!cast<ComplexPattern>("sse_load_f32"),
	!if (!eq (EltTypeName, "f64"),
	!cast<ComplexPattern>("sse_load_f64"),
	?));

	// The corresponding float type, e.g. v16f32 for v16i32
	// Note: For EltSize < 32, FloatVT is illegal and TableGen
	// fails to compile, so we choose FloatVT = VT
	ValueType FloatVT = !cast<ValueType>(
	!if (!eq (!srl(EltSize,5),0),
	VTName,
	!if (!eq(TypeVariantName, "i"),
	"v" # NumElts # "f" # EltSize,
	VTName)));

	ValueType IntVT = !cast<ValueType>(
	!if (!eq (!srl(EltSize,5),0),
	VTName,
	!if (!eq(TypeVariantName, "f"),
	"v" # NumElts # "i" # EltSize,
	VTName)));
	// The string to specify embedded broadcast in assembly.
	string BroadcastStr = "{1to" # NumElts # "}";

	// 8-bit compressed displacement tuple/subvector format. This is only
	// defined for NumElts <= 8.
	CD8VForm CD8TupleForm = !if (!eq (!srl(NumElts, 4), 0),
	!cast<CD8VForm>("CD8VT" # NumElts), ?);

	SubRegIndex SubRegIdx = !if (!eq (Size, 128), sub_xmm,
	!if (!eq (Size, 256), sub_ymm, ?));

	Domain ExeDomain = !if (!eq (EltTypeName, "f32"), SSEPackedSingle,
	!if (!eq (EltTypeName, "f64"), SSEPackedDouble,
	SSEPackedInt));

	RegisterClass FRC = !if (!eq (EltTypeName, "f32"), FR32X, FR64X);

	// A vector tye of the same width with element type i64. This is used to
	// create patterns for logic ops.
	ValueType i64VT = !cast<ValueType>("v" # !srl(Size, 6) # "i64");

	// A vector type of the same width with element type i32. This is used to
	// create the canonical constant zero node ImmAllZerosV.
	ValueType i32VT = !cast<ValueType>("v" # !srl(Size, 5) # "i32");
	dag ImmAllZerosV = (VT (bitconvert (i32VT immAllZerosV)));

	string ZSuffix = !if (!eq (Size, 128), "Z128",
	!if (!eq (Size, 256), "Z256", "Z"));
	}

	def v64i8_info : X86VectorVTInfo<64, i8, VR512, "b">;
	def v32i16_info : X86VectorVTInfo<32, i16, VR512, "w">;
	def v16i32_info : X86VectorVTInfo<16, i32, VR512, "d">;
	def v8i64_info : X86VectorVTInfo<8, i64, VR512, "q">;
	def v16f32_info : X86VectorVTInfo<16, f32, VR512, "ps">;
	def v8f64_info : X86VectorVTInfo<8, f64, VR512, "pd">;

	// "x" in v32i8x_info means RC = VR256X
	def v32i8x_info : X86VectorVTInfo<32, i8, VR256X, "b">;
	def v16i16x_info : X86VectorVTInfo<16, i16, VR256X, "w">;
	def v8i32x_info : X86VectorVTInfo<8, i32, VR256X, "d">;
	def v4i64x_info : X86VectorVTInfo<4, i64, VR256X, "q">;
	def v8f32x_info : X86VectorVTInfo<8, f32, VR256X, "ps">;
	def v4f64x_info : X86VectorVTInfo<4, f64, VR256X, "pd">;

	def v16i8x_info : X86VectorVTInfo<16, i8, VR128X, "b">;
	def v8i16x_info : X86VectorVTInfo<8, i16, VR128X, "w">;
	def v4i32x_info : X86VectorVTInfo<4, i32, VR128X, "d">;
	def v2i64x_info : X86VectorVTInfo<2, i64, VR128X, "q">;
	def v4f32x_info : X86VectorVTInfo<4, f32, VR128X, "ps">;
	def v2f64x_info : X86VectorVTInfo<2, f64, VR128X, "pd">;

	// We map scalar types to the smallest (128-bit) vector type
	// with the appropriate element type. This allows to use the same masking logic.
	def i32x_info : X86VectorVTInfo<1, i32, GR32, "si">;
	def i64x_info : X86VectorVTInfo<1, i64, GR64, "sq">;
	def f32x_info : X86VectorVTInfo<1, f32, VR128X, "ss">;
	def f64x_info : X86VectorVTInfo<1, f64, VR128X, "sd">;

	class AVX512VLVectorVTInfo<X86VectorVTInfo i512, X86VectorVTInfo i256,
	X86VectorVTInfo i128> {
	X86VectorVTInfo info512 = i512;
	X86VectorVTInfo info256 = i256;
	X86VectorVTInfo info128 = i128;
	}

	def avx512vl_i8_info : AVX512VLVectorVTInfo<v64i8_info, v32i8x_info,
	v16i8x_info>;
	def avx512vl_i16_info : AVX512VLVectorVTInfo<v32i16_info, v16i16x_info,
	v8i16x_info>;
	def avx512vl_i32_info : AVX512VLVectorVTInfo<v16i32_info, v8i32x_info,
	v4i32x_info>;
	def avx512vl_i64_info : AVX512VLVectorVTInfo<v8i64_info, v4i64x_info,
	v2i64x_info>;
	def avx512vl_f32_info : AVX512VLVectorVTInfo<v16f32_info, v8f32x_info,
	v4f32x_info>;
	def avx512vl_f64_info : AVX512VLVectorVTInfo<v8f64_info, v4f64x_info,
	v2f64x_info>;

	class X86KVectorVTInfo<RegisterClass _krc, RegisterClass _krcwm,
	ValueType _vt> {
	RegisterClass KRC = _krc;
	RegisterClass KRCWM = _krcwm;
	ValueType KVT = _vt;
	}

	def v2i1_info : X86KVectorVTInfo<VK2, VK2WM, v2i1>;
	def v4i1_info : X86KVectorVTInfo<VK4, VK4WM, v4i1>;
	def v8i1_info : X86KVectorVTInfo<VK8, VK8WM, v8i1>;
	def v16i1_info : X86KVectorVTInfo<VK16, VK16WM, v16i1>;
	def v32i1_info : X86KVectorVTInfo<VK32, VK32WM, v32i1>;
	def v64i1_info : X86KVectorVTInfo<VK64, VK64WM, v64i1>;

	// This multiclass generates the masking variants from the non-masking
	// variant. It only provides the assembly pieces for the masking variants.
	// It assumes custom ISel patterns for masking which can be provided as
	// template arguments.
	multiclass AVX512_maskable_custom<bits<8> O, Format F,
	dag Outs,
	dag Ins, dag MaskingIns, dag ZeroMaskingIns,
	string OpcodeStr,
	string AttSrcAsm, string IntelSrcAsm,
	list<dag> Pattern,
	list<dag> MaskingPattern,
	list<dag> ZeroMaskingPattern,
	string MaskingConstraint = "",
	InstrItinClass itin = NoItinerary,
	bit IsCommutable = 0,
	bit IsKCommutable = 0> {
	let isCommutable = IsCommutable in
	def NAME: AVX512<O, F, Outs, Ins,
	OpcodeStr#"\t{"#AttSrcAsm#", $dst\|"#
	"$dst, "#IntelSrcAsm#"}",
	Pattern, itin>;

	// Prefer over VMOV*rrk Pat<>
	let isCommutable = IsKCommutable in
	def NAME#k: AVX512<O, F, Outs, MaskingIns,
	OpcodeStr#"\t{"#AttSrcAsm#", $dst {${mask}}\|"#
	"$dst {${mask}}, "#IntelSrcAsm#"}",
	MaskingPattern, itin>,
	EVEX_K {
	// In case of the 3src subclass this is overridden with a let.
	string Constraints = MaskingConstraint;
	}

	// Zero mask does not add any restrictions to commute operands transformation.
	// So, it is Ok to use IsCommutable instead of IsKCommutable.
	let isCommutable = IsCommutable in // Prefer over VMOV*rrkz Pat<>
	def NAME#kz: AVX512<O, F, Outs, ZeroMaskingIns,
	OpcodeStr#"\t{"#AttSrcAsm#", $dst {${mask}} {z}\|"#
	"$dst {${mask}} {z}, "#IntelSrcAsm#"}",
	ZeroMaskingPattern,
	itin>,
	EVEX_KZ;
	}


	// Common base class of AVX512_maskable and AVX512_maskable_3src.
	multiclass AVX512_maskable_common<bits<8> O, Format F, X86VectorVTInfo _,
	dag Outs,
	dag Ins, dag MaskingIns, dag ZeroMaskingIns,
	string OpcodeStr,
	string AttSrcAsm, string IntelSrcAsm,
	dag RHS, dag MaskingRHS,
	SDNode Select = vselect,
	string MaskingConstraint = "",
	InstrItinClass itin = NoItinerary,
	bit IsCommutable = 0,
	bit IsKCommutable = 0> :
	AVX512_maskable_custom<O, F, Outs, Ins, MaskingIns, ZeroMaskingIns, OpcodeStr,
	AttSrcAsm, IntelSrcAsm,
	[(set _.RC:$dst, RHS)],
	[(set _.RC:$dst, MaskingRHS)],
	[(set _.RC:$dst,
	(Select _.KRCWM:$mask, RHS, _.ImmAllZerosV))],
	MaskingConstraint, NoItinerary, IsCommutable,
	IsKCommutable>;

	// Similar to AVX512_maskable_common, but with scalar types.
	multiclass AVX512_maskable_fp_common<bits<8> O, Format F, X86VectorVTInfo _,
	dag Outs,
	dag Ins, dag MaskingIns, dag ZeroMaskingIns,
	string OpcodeStr,
	string AttSrcAsm, string IntelSrcAsm,
	SDNode Select = vselect,
	string MaskingConstraint = "",
	InstrItinClass itin = NoItinerary,
	bit IsCommutable = 0,
	bit IsKCommutable = 0> :
	AVX512_maskable_custom<O, F, Outs, Ins, MaskingIns, ZeroMaskingIns, OpcodeStr,
	AttSrcAsm, IntelSrcAsm,
	[], [], [],
	MaskingConstraint, NoItinerary, IsCommutable,
	IsKCommutable>;

	// This multiclass generates the unconditional/non-masking, the masking and
	// the zero-masking variant of the vector instruction. In the masking case, the
	// perserved vector elements come from a new dummy input operand tied to $dst.
	multiclass AVX512_maskable<bits<8> O, Format F, X86VectorVTInfo _,
	dag Outs, dag Ins, string OpcodeStr,
	string AttSrcAsm, string IntelSrcAsm,
	dag RHS,
	InstrItinClass itin = NoItinerary,
	bit IsCommutable = 0, bit IsKCommutable = 0,
	SDNode Select = vselect> :
	AVX512_maskable_common<O, F, _, Outs, Ins,
	!con((ins _.RC:$src0, _.KRCWM:$mask), Ins),
	!con((ins _.KRCWM:$mask), Ins),
	OpcodeStr, AttSrcAsm, IntelSrcAsm, RHS,
	(Select _.KRCWM:$mask, RHS, _.RC:$src0), Select,
	"$src0 = $dst", itin, IsCommutable, IsKCommutable>;

	// This multiclass generates the unconditional/non-masking, the masking and
	// the zero-masking variant of the scalar instruction.
	multiclass AVX512_maskable_scalar<bits<8> O, Format F, X86VectorVTInfo _,
	dag Outs, dag Ins, string OpcodeStr,
	string AttSrcAsm, string IntelSrcAsm,
	dag RHS,
	InstrItinClass itin = NoItinerary,
	bit IsCommutable = 0> :
	AVX512_maskable_common<O, F, _, Outs, Ins,
	!con((ins _.RC:$src0, _.KRCWM:$mask), Ins),
	!con((ins _.KRCWM:$mask), Ins),
	OpcodeStr, AttSrcAsm, IntelSrcAsm, RHS,
	(X86selects _.KRCWM:$mask, RHS, _.RC:$src0),
	X86selects, "$src0 = $dst", itin, IsCommutable>;

	// Similar to AVX512_maskable but in this case one of the source operands
	// ($src1) is already tied to $dst so we just use that for the preserved
	// vector elements. NOTE that the NonTiedIns (the ins dag) should exclude
	// $src1.
	multiclass AVX512_maskable_3src<bits<8> O, Format F, X86VectorVTInfo _,
	dag Outs, dag NonTiedIns, string OpcodeStr,
	string AttSrcAsm, string IntelSrcAsm,
	dag RHS, bit IsCommutable = 0,
	bit IsKCommutable = 0> :
	AVX512_maskable_common<O, F, _, Outs,
	!con((ins _.RC:$src1), NonTiedIns),
	!con((ins _.RC:$src1, _.KRCWM:$mask), NonTiedIns),
	!con((ins _.RC:$src1, _.KRCWM:$mask), NonTiedIns),
	OpcodeStr, AttSrcAsm, IntelSrcAsm, RHS,
	(vselect _.KRCWM:$mask, RHS, _.RC:$src1),
	vselect, "", NoItinerary, IsCommutable, IsKCommutable>;

	multiclass AVX512_maskable_3src_scalar<bits<8> O, Format F, X86VectorVTInfo _,
	dag Outs, dag NonTiedIns, string OpcodeStr,
	string AttSrcAsm, string IntelSrcAsm,
	dag RHS, bit IsCommutable = 0,
	bit IsKCommutable = 0> :
	AVX512_maskable_common<O, F, _, Outs,
	!con((ins _.RC:$src1), NonTiedIns),
	!con((ins _.RC:$src1, _.KRCWM:$mask), NonTiedIns),
	!con((ins _.RC:$src1, _.KRCWM:$mask), NonTiedIns),
	OpcodeStr, AttSrcAsm, IntelSrcAsm, RHS,
	(X86selects _.KRCWM:$mask, RHS, _.RC:$src1),
	X86selects, "", NoItinerary, IsCommutable,
	IsKCommutable>;

	multiclass AVX512_maskable_in_asm<bits<8> O, Format F, X86VectorVTInfo _,
	dag Outs, dag Ins,
	string OpcodeStr,
	string AttSrcAsm, string IntelSrcAsm,
	list<dag> Pattern> :
	AVX512_maskable_custom<O, F, Outs, Ins,
	!con((ins _.RC:$src0, _.KRCWM:$mask), Ins),
	!con((ins _.KRCWM:$mask), Ins),
	OpcodeStr, AttSrcAsm, IntelSrcAsm, Pattern, [], [],
	"$src0 = $dst">;


	// Instruction with mask that puts result in mask register,
	// like "compare" and "vptest"
	multiclass AVX512_maskable_custom_cmp<bits<8> O, Format F,
	dag Outs,
	dag Ins, dag MaskingIns,
	string OpcodeStr,
	string AttSrcAsm, string IntelSrcAsm,
	list<dag> Pattern,
	list<dag> MaskingPattern,
	bit IsCommutable = 0> {
	let isCommutable = IsCommutable in
	def NAME: AVX512<O, F, Outs, Ins,
	OpcodeStr#"\t{"#AttSrcAsm#", $dst\|"#
	"$dst, "#IntelSrcAsm#"}",
	Pattern, NoItinerary>;

	def NAME#k: AVX512<O, F, Outs, MaskingIns,
	OpcodeStr#"\t{"#AttSrcAsm#", $dst {${mask}}\|"#
	"$dst {${mask}}, "#IntelSrcAsm#"}",
	MaskingPattern, NoItinerary>, EVEX_K;
	}

	multiclass AVX512_maskable_common_cmp<bits<8> O, Format F, X86VectorVTInfo _,
	dag Outs,
	dag Ins, dag MaskingIns,
	string OpcodeStr,
	string AttSrcAsm, string IntelSrcAsm,
	dag RHS, dag MaskingRHS,
	bit IsCommutable = 0> :
	AVX512_maskable_custom_cmp<O, F, Outs, Ins, MaskingIns, OpcodeStr,
	AttSrcAsm, IntelSrcAsm,
	[(set _.KRC:$dst, RHS)],
	[(set _.KRC:$dst, MaskingRHS)], IsCommutable>;

	multiclass AVX512_maskable_cmp<bits<8> O, Format F, X86VectorVTInfo _,
	dag Outs, dag Ins, string OpcodeStr,
	string AttSrcAsm, string IntelSrcAsm,
	dag RHS, bit IsCommutable = 0> :
	AVX512_maskable_common_cmp<O, F, _, Outs, Ins,
	!con((ins _.KRCWM:$mask), Ins),
	OpcodeStr, AttSrcAsm, IntelSrcAsm, RHS,
	(and _.KRCWM:$mask, RHS), IsCommutable>;

	multiclass AVX512_maskable_cmp_alt<bits<8> O, Format F, X86VectorVTInfo _,
	dag Outs, dag Ins, string OpcodeStr,
	string AttSrcAsm, string IntelSrcAsm> :
	AVX512_maskable_custom_cmp<O, F, Outs,
	Ins, !con((ins _.KRCWM:$mask),Ins), OpcodeStr,
	AttSrcAsm, IntelSrcAsm, [],[]>;

	// This multiclass generates the unconditional/non-masking, the masking and
	// the zero-masking variant of the vector instruction. In the masking case, the
	// perserved vector elements come from a new dummy input operand tied to $dst.
	multiclass AVX512_maskable_logic<bits<8> O, Format F, X86VectorVTInfo _,
	dag Outs, dag Ins, string OpcodeStr,
	string AttSrcAsm, string IntelSrcAsm,
	dag RHS, dag MaskedRHS,
	InstrItinClass itin = NoItinerary,
	bit IsCommutable = 0, SDNode Select = vselect> :
	AVX512_maskable_custom<O, F, Outs, Ins,
	!con((ins _.RC:$src0, _.KRCWM:$mask), Ins),
	!con((ins _.KRCWM:$mask), Ins),
	OpcodeStr, AttSrcAsm, IntelSrcAsm,
	[(set _.RC:$dst, RHS)],
	[(set _.RC:$dst,
	(Select _.KRCWM:$mask, MaskedRHS, _.RC:$src0))],
	[(set _.RC:$dst,
	(Select _.KRCWM:$mask, MaskedRHS,
	_.ImmAllZerosV))],
	"$src0 = $dst", itin, IsCommutable>;

	// Bitcasts between 512-bit vector types. Return the original type since
	// no instruction is needed for the conversion.
	def : Pat<(v8f64 (bitconvert (v8i64 VR512:$src))), (v8f64 VR512:$src)>;
	def : Pat<(v8f64 (bitconvert (v16i32 VR512:$src))), (v8f64 VR512:$src)>;
	def : Pat<(v8f64 (bitconvert (v32i16 VR512:$src))), (v8f64 VR512:$src)>;
	def : Pat<(v8f64 (bitconvert (v64i8 VR512:$src))), (v8f64 VR512:$src)>;
	def : Pat<(v8f64 (bitconvert (v16f32 VR512:$src))), (v8f64 VR512:$src)>;
	def : Pat<(v16f32 (bitconvert (v8i64 VR512:$src))), (v16f32 VR512:$src)>;
	def : Pat<(v16f32 (bitconvert (v16i32 VR512:$src))), (v16f32 VR512:$src)>;
	def : Pat<(v16f32 (bitconvert (v32i16 VR512:$src))), (v16f32 VR512:$src)>;
	def : Pat<(v16f32 (bitconvert (v64i8 VR512:$src))), (v16f32 VR512:$src)>;
	def : Pat<(v16f32 (bitconvert (v8f64 VR512:$src))), (v16f32 VR512:$src)>;
	def : Pat<(v8i64 (bitconvert (v16i32 VR512:$src))), (v8i64 VR512:$src)>;
	def : Pat<(v8i64 (bitconvert (v32i16 VR512:$src))), (v8i64 VR512:$src)>;
	def : Pat<(v8i64 (bitconvert (v64i8 VR512:$src))), (v8i64 VR512:$src)>;
	def : Pat<(v8i64 (bitconvert (v8f64 VR512:$src))), (v8i64 VR512:$src)>;
	def : Pat<(v8i64 (bitconvert (v16f32 VR512:$src))), (v8i64 VR512:$src)>;
	def : Pat<(v16i32 (bitconvert (v8i64 VR512:$src))), (v16i32 VR512:$src)>;
	def : Pat<(v16i32 (bitconvert (v16f32 VR512:$src))), (v16i32 VR512:$src)>;
	def : Pat<(v16i32 (bitconvert (v32i16 VR512:$src))), (v16i32 VR512:$src)>;
	def : Pat<(v16i32 (bitconvert (v64i8 VR512:$src))), (v16i32 VR512:$src)>;
	def : Pat<(v16i32 (bitconvert (v8f64 VR512:$src))), (v16i32 VR512:$src)>;
	def : Pat<(v32i16 (bitconvert (v8i64 VR512:$src))), (v32i16 VR512:$src)>;
	def : Pat<(v32i16 (bitconvert (v16i32 VR512:$src))), (v32i16 VR512:$src)>;
	def : Pat<(v32i16 (bitconvert (v64i8 VR512:$src))), (v32i16 VR512:$src)>;
	def : Pat<(v32i16 (bitconvert (v8f64 VR512:$src))), (v32i16 VR512:$src)>;
	def : Pat<(v32i16 (bitconvert (v16f32 VR512:$src))), (v32i16 VR512:$src)>;
	def : Pat<(v32i16 (bitconvert (v16f32 VR512:$src))), (v32i16 VR512:$src)>;
	def : Pat<(v64i8 (bitconvert (v8i64 VR512:$src))), (v64i8 VR512:$src)>;
	def : Pat<(v64i8 (bitconvert (v16i32 VR512:$src))), (v64i8 VR512:$src)>;
	def : Pat<(v64i8 (bitconvert (v32i16 VR512:$src))), (v64i8 VR512:$src)>;
	def : Pat<(v64i8 (bitconvert (v8f64 VR512:$src))), (v64i8 VR512:$src)>;
	def : Pat<(v64i8 (bitconvert (v16f32 VR512:$src))), (v64i8 VR512:$src)>;

	// Alias instruction that maps zero vector to pxor / xorp* for AVX-512.
	// This is expanded by ExpandPostRAPseudos to an xorps / vxorps, and then
	// swizzled by ExecutionDepsFix to pxor.
	// We set canFoldAsLoad because this can be converted to a constant-pool
	// load of an all-zeros value if folding it would be beneficial.
	let isReMaterializable = 1, isAsCheapAsAMove = 1, canFoldAsLoad = 1,
	isPseudo = 1, Predicates = [HasAVX512], SchedRW = [WriteZero] in {
	def AVX512_512_SET0 : I<0, Pseudo, (outs VR512:$dst), (ins), "",
	[(set VR512:$dst, (v16i32 immAllZerosV))]>;
	def AVX512_512_SETALLONES : I<0, Pseudo, (outs VR512:$dst), (ins), "",
	[(set VR512:$dst, (v16i32 immAllOnesV))]>;
	}

	// Alias instructions that allow VPTERNLOG to be used with a mask to create
	// a mix of all ones and all zeros elements. This is done this way to force
	// the same register to be used as input for all three sources.
	let isPseudo = 1, Predicates = [HasAVX512] in {
	def AVX512_512_SEXT_MASK_32 : I<0, Pseudo, (outs VR512:$dst),
	(ins VK16WM:$mask), "",
	[(set VR512:$dst, (vselect (v16i1 VK16WM:$mask),
	(v16i32 immAllOnesV),
	(v16i32 immAllZerosV)))]>;
	def AVX512_512_SEXT_MASK_64 : I<0, Pseudo, (outs VR512:$dst),
	(ins VK8WM:$mask), "",
	[(set VR512:$dst, (vselect (v8i1 VK8WM:$mask),
	(bc_v8i64 (v16i32 immAllOnesV)),
	(bc_v8i64 (v16i32 immAllZerosV))))]>;
	}

	let isReMaterializable = 1, isAsCheapAsAMove = 1, canFoldAsLoad = 1,
	isPseudo = 1, Predicates = [HasAVX512], SchedRW = [WriteZero] in {
	def AVX512_128_SET0 : I<0, Pseudo, (outs VR128X:$dst), (ins), "",
	[(set VR128X:$dst, (v4i32 immAllZerosV))]>;
	def AVX512_256_SET0 : I<0, Pseudo, (outs VR256X:$dst), (ins), "",
	[(set VR256X:$dst, (v8i32 immAllZerosV))]>;
	}

	// Alias instructions that map fld0 to xorps for sse or vxorps for avx.
	// This is expanded by ExpandPostRAPseudos.
	let isReMaterializable = 1, isAsCheapAsAMove = 1, canFoldAsLoad = 1,
	isPseudo = 1, SchedRW = [WriteZero], Predicates = [HasAVX512] in {
	def AVX512_FsFLD0SS : I<0, Pseudo, (outs FR32X:$dst), (ins), "",
	[(set FR32X:$dst, fp32imm0)]>;
	def AVX512_FsFLD0SD : I<0, Pseudo, (outs FR64X:$dst), (ins), "",
	[(set FR64X:$dst, fpimm0)]>;
	}

	//===----------------------------------------------------------------------===//
	// AVX-512 - VECTOR INSERT
	//
	multiclass vinsert_for_size<int Opcode, X86VectorVTInfo From, X86VectorVTInfo To,
	PatFrag vinsert_insert> {
	let ExeDomain = To.ExeDomain in {
	defm rr : AVX512_maskable<Opcode, MRMSrcReg, To, (outs To.RC:$dst),
	(ins To.RC:$src1, From.RC:$src2, u8imm:$src3),
	"vinsert" # From.EltTypeName # "x" # From.NumElts,
	"$src3, $src2, $src1", "$src1, $src2, $src3",
	(vinsert_insert:$src3 (To.VT To.RC:$src1),
	(From.VT From.RC:$src2),
	(iPTR imm))>, AVX512AIi8Base, EVEX_4V;

	defm rm : AVX512_maskable<Opcode, MRMSrcMem, To, (outs To.RC:$dst),
	(ins To.RC:$src1, From.MemOp:$src2, u8imm:$src3),
	"vinsert" # From.EltTypeName # "x" # From.NumElts,
	"$src3, $src2, $src1", "$src1, $src2, $src3",
	(vinsert_insert:$src3 (To.VT To.RC:$src1),
	(From.VT (bitconvert (From.LdFrag addr:$src2))),
	(iPTR imm))>, AVX512AIi8Base, EVEX_4V,
	EVEX_CD8<From.EltSize, From.CD8TupleForm>;
	}
	}

	multiclass vinsert_for_size_lowering<string InstrStr, X86VectorVTInfo From,
	X86VectorVTInfo To, PatFrag vinsert_insert,
	SDNodeXForm INSERT_get_vinsert_imm , list<Predicate> p> {
	let Predicates = p in {
	def : Pat<(vinsert_insert:$ins
	(To.VT To.RC:$src1), (From.VT From.RC:$src2), (iPTR imm)),
	(To.VT (!cast<Instruction>(InstrStr#"rr")
	To.RC:$src1, From.RC:$src2,
	(INSERT_get_vinsert_imm To.RC:$ins)))>;

	def : Pat<(vinsert_insert:$ins
	(To.VT To.RC:$src1),
	(From.VT (bitconvert (From.LdFrag addr:$src2))),
	(iPTR imm)),
	(To.VT (!cast<Instruction>(InstrStr#"rm")
	To.RC:$src1, addr:$src2,
	(INSERT_get_vinsert_imm To.RC:$ins)))>;
	}
	}

	multiclass vinsert_for_type<ValueType EltVT32, int Opcode128,
	ValueType EltVT64, int Opcode256> {

	let Predicates = [HasVLX] in
	defm NAME # "32x4Z256" : vinsert_for_size<Opcode128,
	X86VectorVTInfo< 4, EltVT32, VR128X>,
	X86VectorVTInfo< 8, EltVT32, VR256X>,
	vinsert128_insert>, EVEX_V256;

	defm NAME # "32x4Z" : vinsert_for_size<Opcode128,
	X86VectorVTInfo< 4, EltVT32, VR128X>,
	X86VectorVTInfo<16, EltVT32, VR512>,
	vinsert128_insert>, EVEX_V512;

	defm NAME # "64x4Z" : vinsert_for_size<Opcode256,
	X86VectorVTInfo< 4, EltVT64, VR256X>,
	X86VectorVTInfo< 8, EltVT64, VR512>,
	vinsert256_insert>, VEX_W, EVEX_V512;

	let Predicates = [HasVLX, HasDQI] in
	defm NAME # "64x2Z256" : vinsert_for_size<Opcode128,
	X86VectorVTInfo< 2, EltVT64, VR128X>,
	X86VectorVTInfo< 4, EltVT64, VR256X>,
	vinsert128_insert>, VEX_W, EVEX_V256;

	let Predicates = [HasDQI] in {
	defm NAME # "64x2Z" : vinsert_for_size<Opcode128,
	X86VectorVTInfo< 2, EltVT64, VR128X>,
	X86VectorVTInfo< 8, EltVT64, VR512>,
	vinsert128_insert>, VEX_W, EVEX_V512;

	defm NAME # "32x8Z" : vinsert_for_size<Opcode256,
	X86VectorVTInfo< 8, EltVT32, VR256X>,
	X86VectorVTInfo<16, EltVT32, VR512>,
	vinsert256_insert>, EVEX_V512;
	}
	}

	defm VINSERTF : vinsert_for_type<f32, 0x18, f64, 0x1a>;
	defm VINSERTI : vinsert_for_type<i32, 0x38, i64, 0x3a>;

	// Codegen pattern with the alternative types,
	// Only add this if 64x2 and its friends are not supported natively via AVX512DQ.
	defm : vinsert_for_size_lowering<"VINSERTF32x4Z256", v2f64x_info, v4f64x_info,
	vinsert128_insert, INSERT_get_vinsert128_imm, [HasVLX, NoDQI]>;
	defm : vinsert_for_size_lowering<"VINSERTI32x4Z256", v2i64x_info, v4i64x_info,
	vinsert128_insert, INSERT_get_vinsert128_imm, [HasVLX, NoDQI]>;

	defm : vinsert_for_size_lowering<"VINSERTF32x4Z", v2f64x_info, v8f64_info,
	vinsert128_insert, INSERT_get_vinsert128_imm, [HasAVX512, NoDQI]>;
	defm : vinsert_for_size_lowering<"VINSERTI32x4Z", v2i64x_info, v8i64_info,
	vinsert128_insert, INSERT_get_vinsert128_imm, [HasAVX512, NoDQI]>;

	defm : vinsert_for_size_lowering<"VINSERTF64x4Z", v8f32x_info, v16f32_info,
	vinsert256_insert, INSERT_get_vinsert256_imm, [HasAVX512, NoDQI]>;
	defm : vinsert_for_size_lowering<"VINSERTI64x4Z", v8i32x_info, v16i32_info,
	vinsert256_insert, INSERT_get_vinsert256_imm, [HasAVX512, NoDQI]>;

	// Codegen pattern with the alternative types insert VEC128 into VEC256
	defm : vinsert_for_size_lowering<"VINSERTI32x4Z256", v8i16x_info, v16i16x_info,
	vinsert128_insert, INSERT_get_vinsert128_imm, [HasVLX]>;
	defm : vinsert_for_size_lowering<"VINSERTI32x4Z256", v16i8x_info, v32i8x_info,
	vinsert128_insert, INSERT_get_vinsert128_imm, [HasVLX]>;
	// Codegen pattern with the alternative types insert VEC128 into VEC512
	defm : vinsert_for_size_lowering<"VINSERTI32x4Z", v8i16x_info, v32i16_info,
	vinsert128_insert, INSERT_get_vinsert128_imm, [HasAVX512]>;
	defm : vinsert_for_size_lowering<"VINSERTI32x4Z", v16i8x_info, v64i8_info,
	vinsert128_insert, INSERT_get_vinsert128_imm, [HasAVX512]>;
	// Codegen pattern with the alternative types insert VEC256 into VEC512
	defm : vinsert_for_size_lowering<"VINSERTI64x4Z", v16i16x_info, v32i16_info,
	vinsert256_insert, INSERT_get_vinsert256_imm, [HasAVX512]>;
	defm : vinsert_for_size_lowering<"VINSERTI64x4Z", v32i8x_info, v64i8_info,
	vinsert256_insert, INSERT_get_vinsert256_imm, [HasAVX512]>;

	// vinsertps - insert f32 to XMM
	let ExeDomain = SSEPackedSingle in {
	def VINSERTPSZrr : AVX512AIi8<0x21, MRMSrcReg, (outs VR128X:$dst),
	(ins VR128X:$src1, VR128X:$src2, u8imm:$src3),
	"vinsertps\t{$src3, $src2, $src1, $dst\|$dst, $src1, $src2, $src3}",
	[(set VR128X:$dst, (X86insertps VR128X:$src1, VR128X:$src2, imm:$src3))]>,
	EVEX_4V;
	def VINSERTPSZrm: AVX512AIi8<0x21, MRMSrcMem, (outs VR128X:$dst),
	(ins VR128X:$src1, f32mem:$src2, u8imm:$src3),
	"vinsertps\t{$src3, $src2, $src1, $dst\|$dst, $src1, $src2, $src3}",
	[(set VR128X:$dst, (X86insertps VR128X:$src1,
	(v4f32 (scalar_to_vector (loadf32 addr:$src2))),
	imm:$src3))]>, EVEX_4V, EVEX_CD8<32, CD8VT1>;
	}

	//===----------------------------------------------------------------------===//
	// AVX-512 VECTOR EXTRACT
	//---

	multiclass vextract_for_size<int Opcode,
	X86VectorVTInfo From, X86VectorVTInfo To,
	PatFrag vextract_extract,
	SDNodeXForm EXTRACT_get_vextract_imm> {

	let hasSideEffects = 0, ExeDomain = To.ExeDomain in {
	// use AVX512_maskable_in_asm (AVX512_maskable can't be used due to
	// vextract_extract), we interesting only in patterns without mask,
	// intrinsics pattern match generated bellow.
	defm rr : AVX512_maskable_in_asm<Opcode, MRMDestReg, To, (outs To.RC:$dst),
	(ins From.RC:$src1, u8imm:$idx),
	"vextract" # To.EltTypeName # "x" # To.NumElts,
	"$idx, $src1", "$src1, $idx",
	[(set To.RC:$dst, (vextract_extract:$idx (From.VT From.RC:$src1),
	(iPTR imm)))]>,
	AVX512AIi8Base, EVEX;
	def mr : AVX512AIi8<Opcode, MRMDestMem, (outs),
	(ins To.MemOp:$dst, From.RC:$src1, u8imm:$idx),
	"vextract" # To.EltTypeName # "x" # To.NumElts #
	"\t{$idx, $src1, $dst\|$dst, $src1, $idx}",
	[(store (To.VT (vextract_extract:$idx
	(From.VT From.RC:$src1), (iPTR imm))),
	addr:$dst)]>, EVEX;

	let mayStore = 1, hasSideEffects = 0 in
	def mrk : AVX512AIi8<Opcode, MRMDestMem, (outs),
	(ins To.MemOp:$dst, To.KRCWM:$mask,
	From.RC:$src1, u8imm:$idx),
	"vextract" # To.EltTypeName # "x" # To.NumElts #
	"\t{$idx, $src1, $dst {${mask}}\|"
	"$dst {${mask}}, $src1, $idx}",
	[]>, EVEX_K, EVEX;
	}

	def : Pat<(To.VT (vselect To.KRCWM:$mask,
	(vextract_extract:$ext (From.VT From.RC:$src1),
	(iPTR imm)),
	To.RC:$src0)),
	(!cast<Instruction>(NAME # To.EltSize # "x" # To.NumElts #
	From.ZSuffix # "rrk")
	To.RC:$src0, To.KRCWM:$mask, From.RC:$src1,
	(EXTRACT_get_vextract_imm To.RC:$ext))>;

	def : Pat<(To.VT (vselect To.KRCWM:$mask,
	(vextract_extract:$ext (From.VT From.RC:$src1),
	(iPTR imm)),
	To.ImmAllZerosV)),
	(!cast<Instruction>(NAME # To.EltSize # "x" # To.NumElts #
	From.ZSuffix # "rrkz")
	To.KRCWM:$mask, From.RC:$src1,
	(EXTRACT_get_vextract_imm To.RC:$ext))>;
	}

	// Codegen pattern for the alternative types
	multiclass vextract_for_size_lowering<string InstrStr, X86VectorVTInfo From,
	X86VectorVTInfo To, PatFrag vextract_extract,
	SDNodeXForm EXTRACT_get_vextract_imm, list<Predicate> p> {
	let Predicates = p in {
	def : Pat<(vextract_extract:$ext (From.VT From.RC:$src1), (iPTR imm)),
	(To.VT (!cast<Instruction>(InstrStr#"rr")
	From.RC:$src1,
	(EXTRACT_get_vextract_imm To.RC:$ext)))>;
	def : Pat<(store (To.VT (vextract_extract:$ext (From.VT From.RC:$src1),
	(iPTR imm))), addr:$dst),
	(!cast<Instruction>(InstrStr#"mr") addr:$dst, From.RC:$src1,
	(EXTRACT_get_vextract_imm To.RC:$ext))>;
	}
	}

	multiclass vextract_for_type<ValueType EltVT32, int Opcode128,
	ValueType EltVT64, int Opcode256> {
	defm NAME # "32x4Z" : vextract_for_size<Opcode128,
	X86VectorVTInfo<16, EltVT32, VR512>,
	X86VectorVTInfo< 4, EltVT32, VR128X>,
	vextract128_extract,
	EXTRACT_get_vextract128_imm>,
	EVEX_V512, EVEX_CD8<32, CD8VT4>;
	defm NAME # "64x4Z" : vextract_for_size<Opcode256,
	X86VectorVTInfo< 8, EltVT64, VR512>,
	X86VectorVTInfo< 4, EltVT64, VR256X>,
	vextract256_extract,
	EXTRACT_get_vextract256_imm>,
	VEX_W, EVEX_V512, EVEX_CD8<64, CD8VT4>;
	let Predicates = [HasVLX] in
	defm NAME # "32x4Z256" : vextract_for_size<Opcode128,
	X86VectorVTInfo< 8, EltVT32, VR256X>,
	X86VectorVTInfo< 4, EltVT32, VR128X>,
	vextract128_extract,
	EXTRACT_get_vextract128_imm>,
	EVEX_V256, EVEX_CD8<32, CD8VT4>;
	let Predicates = [HasVLX, HasDQI] in
	defm NAME # "64x2Z256" : vextract_for_size<Opcode128,
	X86VectorVTInfo< 4, EltVT64, VR256X>,
	X86VectorVTInfo< 2, EltVT64, VR128X>,
	vextract128_extract,
	EXTRACT_get_vextract128_imm>,
	VEX_W, EVEX_V256, EVEX_CD8<64, CD8VT2>;
	let Predicates = [HasDQI] in {
	defm NAME # "64x2Z" : vextract_for_size<Opcode128,
	X86VectorVTInfo< 8, EltVT64, VR512>,
	X86VectorVTInfo< 2, EltVT64, VR128X>,
	vextract128_extract,
	EXTRACT_get_vextract128_imm>,
	VEX_W, EVEX_V512, EVEX_CD8<64, CD8VT2>;
	defm NAME # "32x8Z" : vextract_for_size<Opcode256,
	X86VectorVTInfo<16, EltVT32, VR512>,
	X86VectorVTInfo< 8, EltVT32, VR256X>,
	vextract256_extract,
	EXTRACT_get_vextract256_imm>,
	EVEX_V512, EVEX_CD8<32, CD8VT8>;
	}
	}

	defm VEXTRACTF : vextract_for_type<f32, 0x19, f64, 0x1b>;
	defm VEXTRACTI : vextract_for_type<i32, 0x39, i64, 0x3b>;

	// extract_subvector codegen patterns with the alternative types.
	// Only add this if 64x2 and its friends are not supported natively via AVX512DQ.
	defm : vextract_for_size_lowering<"VEXTRACTF32x4Z", v8f64_info, v2f64x_info,
	vextract128_extract, EXTRACT_get_vextract128_imm, [HasAVX512, NoDQI]>;
	defm : vextract_for_size_lowering<"VEXTRACTI32x4Z", v8i64_info, v2i64x_info,
	vextract128_extract, EXTRACT_get_vextract128_imm, [HasAVX512, NoDQI]>;

	defm : vextract_for_size_lowering<"VEXTRACTF64x4Z", v16f32_info, v8f32x_info,
	vextract256_extract, EXTRACT_get_vextract256_imm, [HasAVX512, NoDQI]>;
	defm : vextract_for_size_lowering<"VEXTRACTI64x4Z", v16i32_info, v8i32x_info,
	vextract256_extract, EXTRACT_get_vextract256_imm, [HasAVX512, NoDQI]>;

	defm : vextract_for_size_lowering<"VEXTRACTF32x4Z256", v4f64x_info, v2f64x_info,
	vextract128_extract, EXTRACT_get_vextract128_imm, [HasVLX, NoDQI]>;
	defm : vextract_for_size_lowering<"VEXTRACTI32x4Z256", v4i64x_info, v2i64x_info,
	vextract128_extract, EXTRACT_get_vextract128_imm, [HasVLX, NoDQI]>;

	// Codegen pattern with the alternative types extract VEC128 from VEC256
	defm : vextract_for_size_lowering<"VEXTRACTI32x4Z256", v16i16x_info, v8i16x_info,
	vextract128_extract, EXTRACT_get_vextract128_imm, [HasVLX]>;
	defm : vextract_for_size_lowering<"VEXTRACTI32x4Z256", v32i8x_info, v16i8x_info,
	vextract128_extract, EXTRACT_get_vextract128_imm, [HasVLX]>;

	// Codegen pattern with the alternative types extract VEC128 from VEC512
	defm : vextract_for_size_lowering<"VEXTRACTI32x4Z", v32i16_info, v8i16x_info,
	vextract128_extract, EXTRACT_get_vextract128_imm, [HasAVX512]>;
	defm : vextract_for_size_lowering<"VEXTRACTI32x4Z", v64i8_info, v16i8x_info,
	vextract128_extract, EXTRACT_get_vextract128_imm, [HasAVX512]>;
	// Codegen pattern with the alternative types extract VEC256 from VEC512
	defm : vextract_for_size_lowering<"VEXTRACTI64x4Z", v32i16_info, v16i16x_info,
	vextract256_extract, EXTRACT_get_vextract256_imm, [HasAVX512]>;
	defm : vextract_for_size_lowering<"VEXTRACTI64x4Z", v64i8_info, v32i8x_info,
	vextract256_extract, EXTRACT_get_vextract256_imm, [HasAVX512]>;

	// A 128-bit subvector extract from the first 256-bit vector position
	// is a subregister copy that needs no instruction.
	def : Pat<(v2i64 (extract_subvector (v8i64 VR512:$src), (iPTR 0))),
	(v2i64 (EXTRACT_SUBREG (v8i64 VR512:$src), sub_xmm))>;
	def : Pat<(v2f64 (extract_subvector (v8f64 VR512:$src), (iPTR 0))),
	(v2f64 (EXTRACT_SUBREG (v8f64 VR512:$src), sub_xmm))>;
	def : Pat<(v4i32 (extract_subvector (v16i32 VR512:$src), (iPTR 0))),
	(v4i32 (EXTRACT_SUBREG (v16i32 VR512:$src), sub_xmm))>;
	def : Pat<(v4f32 (extract_subvector (v16f32 VR512:$src), (iPTR 0))),
	(v4f32 (EXTRACT_SUBREG (v16f32 VR512:$src), sub_xmm))>;
	def : Pat<(v8i16 (extract_subvector (v32i16 VR512:$src), (iPTR 0))),
	(v8i16 (EXTRACT_SUBREG (v32i16 VR512:$src), sub_xmm))>;
	def : Pat<(v16i8 (extract_subvector (v64i8 VR512:$src), (iPTR 0))),
	(v16i8 (EXTRACT_SUBREG (v64i8 VR512:$src), sub_xmm))>;

	// A 256-bit subvector extract from the first 256-bit vector position
	// is a subregister copy that needs no instruction.
	def : Pat<(v4i64 (extract_subvector (v8i64 VR512:$src), (iPTR 0))),
	(v4i64 (EXTRACT_SUBREG (v8i64 VR512:$src), sub_ymm))>;
	def : Pat<(v4f64 (extract_subvector (v8f64 VR512:$src), (iPTR 0))),
	(v4f64 (EXTRACT_SUBREG (v8f64 VR512:$src), sub_ymm))>;
	def : Pat<(v8i32 (extract_subvector (v16i32 VR512:$src), (iPTR 0))),
	(v8i32 (EXTRACT_SUBREG (v16i32 VR512:$src), sub_ymm))>;
	def : Pat<(v8f32 (extract_subvector (v16f32 VR512:$src), (iPTR 0))),
	(v8f32 (EXTRACT_SUBREG (v16f32 VR512:$src), sub_ymm))>;
	def : Pat<(v16i16 (extract_subvector (v32i16 VR512:$src), (iPTR 0))),
	(v16i16 (EXTRACT_SUBREG (v32i16 VR512:$src), sub_ymm))>;
	def : Pat<(v32i8 (extract_subvector (v64i8 VR512:$src), (iPTR 0))),
	(v32i8 (EXTRACT_SUBREG (v64i8 VR512:$src), sub_ymm))>;

	let AddedComplexity = 25 in { // to give priority over vinsertf128rm
	// A 128-bit subvector insert to the first 512-bit vector position
	// is a subregister copy that needs no instruction.
	def : Pat<(v8i64 (insert_subvector undef, (v2i64 VR128X:$src), (iPTR 0))),
	(INSERT_SUBREG (v8i64 (IMPLICIT_DEF)), VR128X:$src, sub_xmm)>;
	def : Pat<(v8f64 (insert_subvector undef, (v2f64 VR128X:$src), (iPTR 0))),
	(INSERT_SUBREG (v8f64 (IMPLICIT_DEF)), VR128X:$src, sub_xmm)>;
	def : Pat<(v16i32 (insert_subvector undef, (v4i32 VR128X:$src), (iPTR 0))),
	(INSERT_SUBREG (v16i32 (IMPLICIT_DEF)), VR128X:$src, sub_xmm)>;
	def : Pat<(v16f32 (insert_subvector undef, (v4f32 VR128X:$src), (iPTR 0))),
	(INSERT_SUBREG (v16f32 (IMPLICIT_DEF)), VR128X:$src, sub_xmm)>;
	def : Pat<(v32i16 (insert_subvector undef, (v8i16 VR128X:$src), (iPTR 0))),
	(INSERT_SUBREG (v32i16 (IMPLICIT_DEF)), VR128X:$src, sub_xmm)>;
	def : Pat<(v64i8 (insert_subvector undef, (v16i8 VR128X:$src), (iPTR 0))),
	(INSERT_SUBREG (v64i8 (IMPLICIT_DEF)), VR128X:$src, sub_xmm)>;

	// A 256-bit subvector insert to the first 512-bit vector position
	// is a subregister copy that needs no instruction.
	def : Pat<(v8i64 (insert_subvector undef, (v4i64 VR256X:$src), (iPTR 0))),
	(INSERT_SUBREG (v8i64 (IMPLICIT_DEF)), VR256X:$src, sub_ymm)>;
	def : Pat<(v8f64 (insert_subvector undef, (v4f64 VR256X:$src), (iPTR 0))),
	(INSERT_SUBREG (v8f64 (IMPLICIT_DEF)), VR256X:$src, sub_ymm)>;
	def : Pat<(v16i32 (insert_subvector undef, (v8i32 VR256X:$src), (iPTR 0))),
	(INSERT_SUBREG (v16i32 (IMPLICIT_DEF)), VR256X:$src, sub_ymm)>;
	def : Pat<(v16f32 (insert_subvector undef, (v8f32 VR256X:$src), (iPTR 0))),
	(INSERT_SUBREG (v16f32 (IMPLICIT_DEF)), VR256X:$src, sub_ymm)>;
	def : Pat<(v32i16 (insert_subvector undef, (v16i16 VR256X:$src), (iPTR 0))),
	(INSERT_SUBREG (v32i16 (IMPLICIT_DEF)), VR256X:$src, sub_ymm)>;
	def : Pat<(v64i8 (insert_subvector undef, (v32i8 VR256X:$src), (iPTR 0))),
	(INSERT_SUBREG (v64i8 (IMPLICIT_DEF)), VR256X:$src, sub_ymm)>;
	}

	// vextractps - extract 32 bits from XMM
	def VEXTRACTPSZrr : AVX512AIi8<0x17, MRMDestReg, (outs GR32:$dst),
	(ins VR128X:$src1, u8imm:$src2),
	"vextractps\t{$src2, $src1, $dst\|$dst, $src1, $src2}",
	[(set GR32:$dst, (extractelt (bc_v4i32 (v4f32 VR128X:$src1)), imm:$src2))]>,
	EVEX;

	def VEXTRACTPSZmr : AVX512AIi8<0x17, MRMDestMem, (outs),
	(ins f32mem:$dst, VR128X:$src1, u8imm:$src2),
	"vextractps\t{$src2, $src1, $dst\|$dst, $src1, $src2}",
	[(store (extractelt (bc_v4i32 (v4f32 VR128X:$src1)), imm:$src2),
	addr:$dst)]>, EVEX, EVEX_CD8<32, CD8VT1>;

	//===---------------------------------------------------------------------===//
	// AVX-512 BROADCAST
	//---
	// broadcast with a scalar argument.
	multiclass avx512_broadcast_scalar<bits<8> opc, string OpcodeStr,
	X86VectorVTInfo DestInfo, X86VectorVTInfo SrcInfo> {
	def : Pat<(DestInfo.VT (X86VBroadcast SrcInfo.FRC:$src)),
	(!cast<Instruction>(NAME#DestInfo.ZSuffix#r)
	(COPY_TO_REGCLASS SrcInfo.FRC:$src, SrcInfo.RC))>;
	def : Pat<(DestInfo.VT (vselect DestInfo.KRCWM:$mask,
	(X86VBroadcast SrcInfo.FRC:$src),
	DestInfo.RC:$src0)),
	(!cast<Instruction>(NAME#DestInfo.ZSuffix#rk)
	DestInfo.RC:$src0, DestInfo.KRCWM:$mask,
	(COPY_TO_REGCLASS SrcInfo.FRC:$src, SrcInfo.RC))>;
	def : Pat<(DestInfo.VT (vselect DestInfo.KRCWM:$mask,
	(X86VBroadcast SrcInfo.FRC:$src),
	DestInfo.ImmAllZerosV)),
	(!cast<Instruction>(NAME#DestInfo.ZSuffix#rkz)
	DestInfo.KRCWM:$mask, (COPY_TO_REGCLASS SrcInfo.FRC:$src, SrcInfo.RC))>;
	}

	multiclass avx512_broadcast_rm<bits<8> opc, string OpcodeStr,
	X86VectorVTInfo DestInfo, X86VectorVTInfo SrcInfo> {
	let ExeDomain = DestInfo.ExeDomain in {
	defm r : AVX512_maskable<opc, MRMSrcReg, DestInfo, (outs DestInfo.RC:$dst),
	(ins SrcInfo.RC:$src), OpcodeStr, "$src", "$src",
	(DestInfo.VT (X86VBroadcast (SrcInfo.VT SrcInfo.RC:$src)))>,
	T8PD, EVEX;
	defm m : AVX512_maskable<opc, MRMSrcMem, DestInfo, (outs DestInfo.RC:$dst),
	(ins SrcInfo.ScalarMemOp:$src), OpcodeStr, "$src", "$src",
	(DestInfo.VT (X86VBroadcast
	(SrcInfo.ScalarLdFrag addr:$src)))>,
	T8PD, EVEX, EVEX_CD8<SrcInfo.EltSize, CD8VT1>;
	}

	def : Pat<(DestInfo.VT (X86VBroadcast
	(SrcInfo.VT (scalar_to_vector
	(SrcInfo.ScalarLdFrag addr:$src))))),
	(!cast<Instruction>(NAME#DestInfo.ZSuffix#m) addr:$src)>;
	def : Pat<(DestInfo.VT (vselect DestInfo.KRCWM:$mask,
	(X86VBroadcast
	(SrcInfo.VT (scalar_to_vector
	(SrcInfo.ScalarLdFrag addr:$src)))),
	DestInfo.RC:$src0)),
	(!cast<Instruction>(NAME#DestInfo.ZSuffix#mk)
	DestInfo.RC:$src0, DestInfo.KRCWM:$mask, addr:$src)>;
	def : Pat<(DestInfo.VT (vselect DestInfo.KRCWM:$mask,
	(X86VBroadcast
	(SrcInfo.VT (scalar_to_vector
	(SrcInfo.ScalarLdFrag addr:$src)))),
	DestInfo.ImmAllZerosV)),
	(!cast<Instruction>(NAME#DestInfo.ZSuffix#mkz)
	DestInfo.KRCWM:$mask, addr:$src)>;
	}

	multiclass avx512_fp_broadcast_sd<bits<8> opc, string OpcodeStr,
	AVX512VLVectorVTInfo _> {
	let Predicates = [HasAVX512] in
	defm Z : avx512_broadcast_rm<opc, OpcodeStr, _.info512, _.info128>,
	avx512_broadcast_scalar<opc, OpcodeStr, _.info512, _.info128>,
	EVEX_V512;

	let Predicates = [HasVLX] in {
	defm Z256 : avx512_broadcast_rm<opc, OpcodeStr, _.info256, _.info128>,
	avx512_broadcast_scalar<opc, OpcodeStr, _.info256, _.info128>,
	EVEX_V256;
	}
	}

	multiclass avx512_fp_broadcast_ss<bits<8> opc, string OpcodeStr,
	AVX512VLVectorVTInfo _> {
	let Predicates = [HasAVX512] in
	defm Z : avx512_broadcast_rm<opc, OpcodeStr, _.info512, _.info128>,
	avx512_broadcast_scalar<opc, OpcodeStr, _.info512, _.info128>,
	EVEX_V512;

	let Predicates = [HasVLX] in {
	defm Z256 : avx512_broadcast_rm<opc, OpcodeStr, _.info256, _.info128>,
	avx512_broadcast_scalar<opc, OpcodeStr, _.info256, _.info128>,
	EVEX_V256;
	defm Z128 : avx512_broadcast_rm<opc, OpcodeStr, _.info128, _.info128>,
	avx512_broadcast_scalar<opc, OpcodeStr, _.info128, _.info128>,
	EVEX_V128;
	}
	}
	defm VBROADCASTSS : avx512_fp_broadcast_ss<0x18, "vbroadcastss",
	avx512vl_f32_info>;
	defm VBROADCASTSD : avx512_fp_broadcast_sd<0x19, "vbroadcastsd",
	avx512vl_f64_info>, VEX_W;

	def : Pat<(int_x86_avx512_vbroadcast_ss_512 addr:$src),
	(VBROADCASTSSZm addr:$src)>;
	def : Pat<(int_x86_avx512_vbroadcast_sd_512 addr:$src),
	(VBROADCASTSDZm addr:$src)>;

	multiclass avx512_int_broadcast_reg<bits<8> opc, X86VectorVTInfo _,
	SDPatternOperator OpNode,
	RegisterClass SrcRC> {
	let ExeDomain = _.ExeDomain in
	defm r : AVX512_maskable<opc, MRMSrcReg, _, (outs _.RC:$dst),
	(ins SrcRC:$src),
	"vpbroadcast"##_.Suffix, "$src", "$src",
	(_.VT (OpNode SrcRC:$src))>, T8PD, EVEX;
	}

	multiclass avx512_int_broadcastbw_reg<bits<8> opc, string Name,
	X86VectorVTInfo _, SDPatternOperator OpNode,
	RegisterClass SrcRC, SubRegIndex Subreg> {
	let ExeDomain = _.ExeDomain in
	defm r : AVX512_maskable_custom<opc, MRMSrcReg,
	(outs _.RC:$dst), (ins GR32:$src),
	!con((ins _.RC:$src0, _.KRCWM:$mask), (ins GR32:$src)),
	!con((ins _.KRCWM:$mask), (ins GR32:$src)),
	"vpbroadcast"##_.Suffix, "$src", "$src", [], [], [],
	"$src0 = $dst">, T8PD, EVEX;

	def : Pat <(_.VT (OpNode SrcRC:$src)),
	(!cast<Instruction>(Name#r)
	(i32 (INSERT_SUBREG (i32 (IMPLICIT_DEF)), SrcRC:$src, Subreg)))>;

	def : Pat <(vselect _.KRCWM:$mask, (_.VT (OpNode SrcRC:$src)), _.RC:$src0),
	(!cast<Instruction>(Name#rk) _.RC:$src0, _.KRCWM:$mask,
	(i32 (INSERT_SUBREG (i32 (IMPLICIT_DEF)), SrcRC:$src, Subreg)))>;

	def : Pat <(vselect _.KRCWM:$mask, (_.VT (OpNode SrcRC:$src)), _.ImmAllZerosV),
	(!cast<Instruction>(Name#rkz) _.KRCWM:$mask,
	(i32 (INSERT_SUBREG (i32 (IMPLICIT_DEF)), SrcRC:$src, Subreg)))>;
	}

	multiclass avx512_int_broadcastbw_reg_vl<bits<8> opc, string Name,
	AVX512VLVectorVTInfo _, SDPatternOperator OpNode,
	RegisterClass SrcRC, SubRegIndex Subreg, Predicate prd> {
	let Predicates = [prd] in
	defm Z : avx512_int_broadcastbw_reg<opc, Name#Z, _.info512, OpNode, SrcRC,
	Subreg>, EVEX_V512;
	let Predicates = [prd, HasVLX] in {
	defm Z256 : avx512_int_broadcastbw_reg<opc, Name#Z256, _.info256, OpNode,
	SrcRC, Subreg>, EVEX_V256;
	defm Z128 : avx512_int_broadcastbw_reg<opc, Name#Z128, _.info128, OpNode,
	SrcRC, Subreg>, EVEX_V128;
	}
	}

	multiclass avx512_int_broadcast_reg_vl<bits<8> opc, AVX512VLVectorVTInfo _,
	SDPatternOperator OpNode,
	RegisterClass SrcRC, Predicate prd> {
	let Predicates = [prd] in
	defm Z : avx512_int_broadcast_reg<opc, _.info512, OpNode, SrcRC>, EVEX_V512;
	let Predicates = [prd, HasVLX] in {
	defm Z256 : avx512_int_broadcast_reg<opc, _.info256, OpNode, SrcRC>, EVEX_V256;
	defm Z128 : avx512_int_broadcast_reg<opc, _.info128, OpNode, SrcRC>, EVEX_V128;
	}
	}

	defm VPBROADCASTBr : avx512_int_broadcastbw_reg_vl<0x7A, "VPBROADCASTBr",
	avx512vl_i8_info, X86VBroadcast, GR8, sub_8bit, HasBWI>;
	defm VPBROADCASTWr : avx512_int_broadcastbw_reg_vl<0x7B, "VPBROADCASTWr",
	avx512vl_i16_info, X86VBroadcast, GR16, sub_16bit,
	HasBWI>;
	defm VPBROADCASTDr : avx512_int_broadcast_reg_vl<0x7C, avx512vl_i32_info,
	X86VBroadcast, GR32, HasAVX512>;
	defm VPBROADCASTQr : avx512_int_broadcast_reg_vl<0x7C, avx512vl_i64_info,
	X86VBroadcast, GR64, HasAVX512>, VEX_W;

	def : Pat <(v16i32 (X86vzext VK16WM:$mask)),
	(VPBROADCASTDrZrkz VK16WM:$mask, (i32 (MOV32ri 0x1)))>;
	def : Pat <(v8i64 (X86vzext VK8WM:$mask)),
	(VPBROADCASTQrZrkz VK8WM:$mask, (i64 (MOV64ri 0x1)))>;

	// Provide aliases for broadcast from the same register class that
	// automatically does the extract.
	multiclass avx512_int_broadcast_rm_lowering<X86VectorVTInfo DestInfo,
	X86VectorVTInfo SrcInfo> {
	def : Pat<(DestInfo.VT (X86VBroadcast (SrcInfo.VT SrcInfo.RC:$src))),
	(!cast<Instruction>(NAME#DestInfo.ZSuffix#"r")
	(EXTRACT_SUBREG (SrcInfo.VT SrcInfo.RC:$src), sub_xmm))>;
	}

	multiclass avx512_int_broadcast_rm_vl<bits<8> opc, string OpcodeStr,
	AVX512VLVectorVTInfo _, Predicate prd> {
	let Predicates = [prd] in {
	defm Z : avx512_broadcast_rm<opc, OpcodeStr, _.info512, _.info128>,
	avx512_int_broadcast_rm_lowering<_.info512, _.info256>,
	EVEX_V512;
	// Defined separately to avoid redefinition.
	defm Z_Alt : avx512_int_broadcast_rm_lowering<_.info512, _.info512>;
	}
	let Predicates = [prd, HasVLX] in {
	defm Z256 : avx512_broadcast_rm<opc, OpcodeStr, _.info256, _.info128>,
	avx512_int_broadcast_rm_lowering<_.info256, _.info256>,
	EVEX_V256;
	defm Z128 : avx512_broadcast_rm<opc, OpcodeStr, _.info128, _.info128>,
	EVEX_V128;
	}
	}

	defm VPBROADCASTB : avx512_int_broadcast_rm_vl<0x78, "vpbroadcastb",
	avx512vl_i8_info, HasBWI>;
	defm VPBROADCASTW : avx512_int_broadcast_rm_vl<0x79, "vpbroadcastw",
	avx512vl_i16_info, HasBWI>;
	defm VPBROADCASTD : avx512_int_broadcast_rm_vl<0x58, "vpbroadcastd",
	avx512vl_i32_info, HasAVX512>;
	defm VPBROADCASTQ : avx512_int_broadcast_rm_vl<0x59, "vpbroadcastq",
	avx512vl_i64_info, HasAVX512>, VEX_W;

	multiclass avx512_subvec_broadcast_rm<bits<8> opc, string OpcodeStr,
	X86VectorVTInfo _Dst, X86VectorVTInfo _Src> {
	defm rm : AVX512_maskable<opc, MRMSrcMem, _Dst, (outs _Dst.RC:$dst),
	(ins _Src.MemOp:$src), OpcodeStr, "$src", "$src",
	(_Dst.VT (X86SubVBroadcast
	(_Src.VT (bitconvert (_Src.LdFrag addr:$src)))))>,
	AVX5128IBase, EVEX;
	}

	let Predicates = [HasAVX512] in {
	// 32-bit targets will fail to load a i64 directly but can use ZEXT_LOAD.
	def : Pat<(v8i64 (X86VBroadcast (v8i64 (X86vzload addr:$src)))),
	(VPBROADCASTQZm addr:$src)>;
	}

	let Predicates = [HasVLX, HasBWI] in {
	// 32-bit targets will fail to load a i64 directly but can use ZEXT_LOAD.
	def : Pat<(v2i64 (X86VBroadcast (v2i64 (X86vzload addr:$src)))),
	(VPBROADCASTQZ128m addr:$src)>;
	def : Pat<(v4i64 (X86VBroadcast (v4i64 (X86vzload addr:$src)))),
	(VPBROADCASTQZ256m addr:$src)>;
	// loadi16 is tricky to fold, because !isTypeDesirableForOp, justifiably.
	// This means we'll encounter truncated i32 loads; match that here.
	def : Pat<(v8i16 (X86VBroadcast (i16 (trunc (i32 (load addr:$src)))))),
	(VPBROADCASTWZ128m addr:$src)>;
	def : Pat<(v16i16 (X86VBroadcast (i16 (trunc (i32 (load addr:$src)))))),
	(VPBROADCASTWZ256m addr:$src)>;
	def : Pat<(v8i16 (X86VBroadcast
	(i16 (trunc (i32 (zextloadi16 addr:$src)))))),
	(VPBROADCASTWZ128m addr:$src)>;
	def : Pat<(v16i16 (X86VBroadcast
	(i16 (trunc (i32 (zextloadi16 addr:$src)))))),
	(VPBROADCASTWZ256m addr:$src)>;
	}

	//===----------------------------------------------------------------------===//
	// AVX-512 BROADCAST SUBVECTORS
	//

	defm VBROADCASTI32X4 : avx512_subvec_broadcast_rm<0x5a, "vbroadcasti32x4",
	v16i32_info, v4i32x_info>,
	EVEX_V512, EVEX_CD8<32, CD8VT4>;
	defm VBROADCASTF32X4 : avx512_subvec_broadcast_rm<0x1a, "vbroadcastf32x4",
	v16f32_info, v4f32x_info>,
	EVEX_V512, EVEX_CD8<32, CD8VT4>;
	defm VBROADCASTI64X4 : avx512_subvec_broadcast_rm<0x5b, "vbroadcasti64x4",
	v8i64_info, v4i64x_info>, VEX_W,
	EVEX_V512, EVEX_CD8<64, CD8VT4>;
	defm VBROADCASTF64X4 : avx512_subvec_broadcast_rm<0x1b, "vbroadcastf64x4",
	v8f64_info, v4f64x_info>, VEX_W,
	EVEX_V512, EVEX_CD8<64, CD8VT4>;

	let Predicates = [HasAVX512] in {
	def : Pat<(v32i16 (X86SubVBroadcast (bc_v16i16 (loadv4i64 addr:$src)))),
	(VBROADCASTI64X4rm addr:$src)>;
	def : Pat<(v64i8 (X86SubVBroadcast (bc_v32i8 (loadv4i64 addr:$src)))),
	(VBROADCASTI64X4rm addr:$src)>;

	// Provide fallback in case the load node that is used in the patterns above
	// is used by additional users, which prevents the pattern selection.
	def : Pat<(v8f64 (X86SubVBroadcast (v4f64 VR256X:$src))),
	(VINSERTF64x4Zrr (INSERT_SUBREG (v8f64 (IMPLICIT_DEF)), VR256X:$src, sub_ymm),
	(v4f64 VR256X:$src), 1)>;
	def : Pat<(v8i64 (X86SubVBroadcast (v4i64 VR256X:$src))),
	(VINSERTI64x4Zrr (INSERT_SUBREG (v8i64 (IMPLICIT_DEF)), VR256X:$src, sub_ymm),
	(v4i64 VR256X:$src), 1)>;
	def : Pat<(v32i16 (X86SubVBroadcast (v16i16 VR256X:$src))),
	(VINSERTI64x4Zrr (INSERT_SUBREG (v32i16 (IMPLICIT_DEF)), VR256X:$src, sub_ymm),
	(v16i16 VR256X:$src), 1)>;
	def : Pat<(v64i8 (X86SubVBroadcast (v32i8 VR256X:$src))),
	(VINSERTI64x4Zrr (INSERT_SUBREG (v64i8 (IMPLICIT_DEF)), VR256X:$src, sub_ymm),
	(v32i8 VR256X:$src), 1)>;

	def : Pat<(v32i16 (X86SubVBroadcast (bc_v8i16 (loadv2i64 addr:$src)))),
	(VBROADCASTI32X4rm addr:$src)>;
	def : Pat<(v64i8 (X86SubVBroadcast (bc_v16i8 (loadv2i64 addr:$src)))),
	(VBROADCASTI32X4rm addr:$src)>;
	}

	let Predicates = [HasVLX] in {
	defm VBROADCASTI32X4Z256 : avx512_subvec_broadcast_rm<0x5a, "vbroadcasti32x4",
	v8i32x_info, v4i32x_info>,
	EVEX_V256, EVEX_CD8<32, CD8VT4>;
	defm VBROADCASTF32X4Z256 : avx512_subvec_broadcast_rm<0x1a, "vbroadcastf32x4",
	v8f32x_info, v4f32x_info>,
	EVEX_V256, EVEX_CD8<32, CD8VT4>;

	def : Pat<(v16i16 (X86SubVBroadcast (bc_v8i16 (loadv2i64 addr:$src)))),
	(VBROADCASTI32X4Z256rm addr:$src)>;
	def : Pat<(v32i8 (X86SubVBroadcast (bc_v16i8 (loadv2i64 addr:$src)))),
	(VBROADCASTI32X4Z256rm addr:$src)>;

	// Provide fallback in case the load node that is used in the patterns above
	// is used by additional users, which prevents the pattern selection.
	def : Pat<(v8f32 (X86SubVBroadcast (v4f32 VR128X:$src))),
	(VINSERTF32x4Z256rr (INSERT_SUBREG (v8f32 (IMPLICIT_DEF)), VR128X:$src, sub_xmm),
	(v4f32 VR128X:$src), 1)>;
	def : Pat<(v8i32 (X86SubVBroadcast (v4i32 VR128X:$src))),
	(VINSERTI32x4Z256rr (INSERT_SUBREG (v8i32 (IMPLICIT_DEF)), VR128X:$src, sub_xmm),
	(v4i32 VR128X:$src), 1)>;
	def : Pat<(v16i16 (X86SubVBroadcast (v8i16 VR128X:$src))),
	(VINSERTI32x4Z256rr (INSERT_SUBREG (v16i16 (IMPLICIT_DEF)), VR128X:$src, sub_xmm),
	(v8i16 VR128X:$src), 1)>;
	def : Pat<(v32i8 (X86SubVBroadcast (v16i8 VR128X:$src))),
	(VINSERTI32x4Z256rr (INSERT_SUBREG (v32i8 (IMPLICIT_DEF)), VR128X:$src, sub_xmm),
	(v16i8 VR128X:$src), 1)>;
	}

	let Predicates = [HasVLX, HasDQI] in {
	defm VBROADCASTI64X2Z128 : avx512_subvec_broadcast_rm<0x5a, "vbroadcasti64x2",
	v4i64x_info, v2i64x_info>, VEX_W,
	EVEX_V256, EVEX_CD8<64, CD8VT2>;
	defm VBROADCASTF64X2Z128 : avx512_subvec_broadcast_rm<0x1a, "vbroadcastf64x2",
	v4f64x_info, v2f64x_info>, VEX_W,
	EVEX_V256, EVEX_CD8<64, CD8VT2>;

	// Provide fallback in case the load node that is used in the patterns above
	// is used by additional users, which prevents the pattern selection.
	def : Pat<(v4f64 (X86SubVBroadcast (v2f64 VR128X:$src))),
	(VINSERTF64x2Z256rr (INSERT_SUBREG (v4f64 (IMPLICIT_DEF)), VR128X:$src, sub_xmm),
	(v2f64 VR128X:$src), 1)>;
	def : Pat<(v4i64 (X86SubVBroadcast (v2i64 VR128X:$src))),
	(VINSERTI64x2Z256rr (INSERT_SUBREG (v4i64 (IMPLICIT_DEF)), VR128X:$src, sub_xmm),
	(v2i64 VR128X:$src), 1)>;
	}

	let Predicates = [HasVLX, NoDQI] in {
	def : Pat<(v4f64 (X86SubVBroadcast (loadv2f64 addr:$src))),
	(VBROADCASTF32X4Z256rm addr:$src)>;
	def : Pat<(v4i64 (X86SubVBroadcast (loadv2i64 addr:$src))),
	(VBROADCASTI32X4Z256rm addr:$src)>;

	// Provide fallback in case the load node that is used in the patterns above
	// is used by additional users, which prevents the pattern selection.
	def : Pat<(v4f64 (X86SubVBroadcast (v2f64 VR128X:$src))),
	(VINSERTF32x4Z256rr (INSERT_SUBREG (v4f64 (IMPLICIT_DEF)), VR128X:$src, sub_xmm),
	(v2f64 VR128X:$src), 1)>;
	def : Pat<(v4i64 (X86SubVBroadcast (v2i64 VR128X:$src))),
	(VINSERTI32x4Z256rr (INSERT_SUBREG (v4i64 (IMPLICIT_DEF)), VR128X:$src, sub_xmm),
	(v2i64 VR128X:$src), 1)>;
	}

	let Predicates = [HasAVX512, NoDQI] in {
	def : Pat<(v8f64 (X86SubVBroadcast (loadv2f64 addr:$src))),
	(VBROADCASTF32X4rm addr:$src)>;
	def : Pat<(v8i64 (X86SubVBroadcast (loadv2i64 addr:$src))),
	(VBROADCASTI32X4rm addr:$src)>;

	def : Pat<(v16f32 (X86SubVBroadcast (loadv8f32 addr:$src))),
	(VBROADCASTF64X4rm addr:$src)>;
	def : Pat<(v16i32 (X86SubVBroadcast (bc_v8i32 (loadv4i64 addr:$src)))),
	(VBROADCASTI64X4rm addr:$src)>;

	// Provide fallback in case the load node that is used in the patterns above
	// is used by additional users, which prevents the pattern selection.
	def : Pat<(v16f32 (X86SubVBroadcast (v8f32 VR256X:$src))),
	(VINSERTF64x4Zrr (INSERT_SUBREG (v16f32 (IMPLICIT_DEF)), VR256X:$src, sub_ymm),
	(v8f32 VR256X:$src), 1)>;
	def : Pat<(v16i32 (X86SubVBroadcast (v8i32 VR256X:$src))),
	(VINSERTI64x4Zrr (INSERT_SUBREG (v16i32 (IMPLICIT_DEF)), VR256X:$src, sub_ymm),
	(v8i32 VR256X:$src), 1)>;
	}

	let Predicates = [HasDQI] in {
	defm VBROADCASTI64X2 : avx512_subvec_broadcast_rm<0x5a, "vbroadcasti64x2",
	v8i64_info, v2i64x_info>, VEX_W,
	EVEX_V512, EVEX_CD8<64, CD8VT2>;
	defm VBROADCASTI32X8 : avx512_subvec_broadcast_rm<0x5b, "vbroadcasti32x8",
	v16i32_info, v8i32x_info>,
	EVEX_V512, EVEX_CD8<32, CD8VT8>;
	defm VBROADCASTF64X2 : avx512_subvec_broadcast_rm<0x1a, "vbroadcastf64x2",
	v8f64_info, v2f64x_info>, VEX_W,
	EVEX_V512, EVEX_CD8<64, CD8VT2>;
	defm VBROADCASTF32X8 : avx512_subvec_broadcast_rm<0x1b, "vbroadcastf32x8",
	v16f32_info, v8f32x_info>,
	EVEX_V512, EVEX_CD8<32, CD8VT8>;

	// Provide fallback in case the load node that is used in the patterns above
	// is used by additional users, which prevents the pattern selection.
	def : Pat<(v16f32 (X86SubVBroadcast (v8f32 VR256X:$src))),
	(VINSERTF32x8Zrr (INSERT_SUBREG (v16f32 (IMPLICIT_DEF)), VR256X:$src, sub_ymm),
	(v8f32 VR256X:$src), 1)>;
	def : Pat<(v16i32 (X86SubVBroadcast (v8i32 VR256X:$src))),
	(VINSERTI32x8Zrr (INSERT_SUBREG (v16i32 (IMPLICIT_DEF)), VR256X:$src, sub_ymm),
	(v8i32 VR256X:$src), 1)>;
	}

	multiclass avx512_common_broadcast_32x2<bits<8> opc, string OpcodeStr,
	AVX512VLVectorVTInfo _Dst, AVX512VLVectorVTInfo _Src> {
	let Predicates = [HasDQI] in
	defm Z : avx512_broadcast_rm<opc, OpcodeStr, _Dst.info512, _Src.info128>,
	EVEX_V512;
	let Predicates = [HasDQI, HasVLX] in
	defm Z256 : avx512_broadcast_rm<opc, OpcodeStr, _Dst.info256, _Src.info128>,
	EVEX_V256;
	}

	multiclass avx512_common_broadcast_i32x2<bits<8> opc, string OpcodeStr,
	AVX512VLVectorVTInfo _Dst, AVX512VLVectorVTInfo _Src> :
	avx512_common_broadcast_32x2<opc, OpcodeStr, _Dst, _Src> {

	let Predicates = [HasDQI, HasVLX] in
	defm Z128 : avx512_broadcast_rm<opc, OpcodeStr, _Dst.info128, _Src.info128>,
	EVEX_V128;
	}

	defm VBROADCASTI32X2 : avx512_common_broadcast_i32x2<0x59, "vbroadcasti32x2",
	avx512vl_i32_info, avx512vl_i64_info>;
	defm VBROADCASTF32X2 : avx512_common_broadcast_32x2<0x19, "vbroadcastf32x2",
	avx512vl_f32_info, avx512vl_f64_info>;

	let Predicates = [HasVLX] in {
	def : Pat<(v8f32 (X86VBroadcast (v8f32 VR256X:$src))),
	(VBROADCASTSSZ256r (EXTRACT_SUBREG (v8f32 VR256X:$src), sub_xmm))>;
	def : Pat<(v4f64 (X86VBroadcast (v4f64 VR256X:$src))),
	(VBROADCASTSDZ256r (EXTRACT_SUBREG (v4f64 VR256X:$src), sub_xmm))>;
	}

	def : Pat<(v16f32 (X86VBroadcast (v16f32 VR512:$src))),
	(VBROADCASTSSZr (EXTRACT_SUBREG (v16f32 VR512:$src), sub_xmm))>;
	def : Pat<(v16f32 (X86VBroadcast (v8f32 VR256X:$src))),
	(VBROADCASTSSZr (EXTRACT_SUBREG (v8f32 VR256X:$src), sub_xmm))>;

	def : Pat<(v8f64 (X86VBroadcast (v8f64 VR512:$src))),
	(VBROADCASTSDZr (EXTRACT_SUBREG (v8f64 VR512:$src), sub_xmm))>;
	def : Pat<(v8f64 (X86VBroadcast (v4f64 VR256X:$src))),
	(VBROADCASTSDZr (EXTRACT_SUBREG (v4f64 VR256X:$src), sub_xmm))>;

	//===----------------------------------------------------------------------===//
	// AVX-512 BROADCAST MASK TO VECTOR REGISTER
	//---
	multiclass avx512_mask_broadcastm<bits<8> opc, string OpcodeStr,
	X86VectorVTInfo _, RegisterClass KRC> {
	def rr : AVX512XS8I<opc, MRMSrcReg, (outs _.RC:$dst), (ins KRC:$src),
	!strconcat(OpcodeStr, "\t{$src, $dst\|$dst, $src}"),
	[(set _.RC:$dst, (_.VT (X86VBroadcastm KRC:$src)))]>, EVEX;
	}

	multiclass avx512_mask_broadcast<bits<8> opc, string OpcodeStr,
	AVX512VLVectorVTInfo VTInfo, RegisterClass KRC> {
	let Predicates = [HasCDI] in
	defm Z : avx512_mask_broadcastm<opc, OpcodeStr, VTInfo.info512, KRC>, EVEX_V512;
	let Predicates = [HasCDI, HasVLX] in {
	defm Z256 : avx512_mask_broadcastm<opc, OpcodeStr, VTInfo.info256, KRC>, EVEX_V256;
	defm Z128 : avx512_mask_broadcastm<opc, OpcodeStr, VTInfo.info128, KRC>, EVEX_V128;
	}
	}

	defm VPBROADCASTMW2D : avx512_mask_broadcast<0x3A, "vpbroadcastmw2d",
	avx512vl_i32_info, VK16>;
	defm VPBROADCASTMB2Q : avx512_mask_broadcast<0x2A, "vpbroadcastmb2q",
	avx512vl_i64_info, VK8>, VEX_W;

	//===----------------------------------------------------------------------===//
	// -- VPERMI2 - 3 source operands form --
	multiclass avx512_perm_i<bits<8> opc, string OpcodeStr, X86VectorVTInfo _> {
	let Constraints = "$src1 = $dst", ExeDomain = _.ExeDomain in {
	// The index operand in the pattern should really be an integer type. However,
	// if we do that and it happens to come from a bitcast, then it becomes
	// difficult to find the bitcast needed to convert the index to the
	// destination type for the passthru since it will be folded with the bitcast
	// of the index operand.
	defm rr: AVX512_maskable_3src<opc, MRMSrcReg, _, (outs _.RC:$dst),
	(ins _.RC:$src2, _.RC:$src3),
	OpcodeStr, "$src3, $src2", "$src2, $src3",
	(_.VT (X86VPermi2X _.RC:$src1, _.RC:$src2, _.RC:$src3)), 1>, EVEX_4V,
	AVX5128IBase;

	defm rm: AVX512_maskable_3src<opc, MRMSrcMem, _, (outs _.RC:$dst),
	(ins _.RC:$src2, _.MemOp:$src3),
	OpcodeStr, "$src3, $src2", "$src2, $src3",
	(_.VT (X86VPermi2X _.RC:$src1, _.RC:$src2,
	(_.VT (bitconvert (_.LdFrag addr:$src3))))), 1>,
	EVEX_4V, AVX5128IBase;
	}
	}
	multiclass avx512_perm_i_mb<bits<8> opc, string OpcodeStr,
	X86VectorVTInfo _> {
	let Constraints = "$src1 = $dst", ExeDomain = _.ExeDomain in
	defm rmb: AVX512_maskable_3src<opc, MRMSrcMem, _, (outs _.RC:$dst),
	(ins _.RC:$src2, _.ScalarMemOp:$src3),
	OpcodeStr, !strconcat("${src3}", _.BroadcastStr,", $src2"),
	!strconcat("$src2, ${src3}", _.BroadcastStr ),
	(_.VT (X86VPermi2X _.RC:$src1,
	_.RC:$src2,(_.VT (X86VBroadcast (_.ScalarLdFrag addr:$src3))))),
	1>, AVX5128IBase, EVEX_4V, EVEX_B;
	}

	multiclass avx512_perm_i_sizes<bits<8> opc, string OpcodeStr,
	AVX512VLVectorVTInfo VTInfo> {
	defm NAME: avx512_perm_i<opc, OpcodeStr, VTInfo.info512>,
	avx512_perm_i_mb<opc, OpcodeStr, VTInfo.info512>, EVEX_V512;
	let Predicates = [HasVLX] in {
	defm NAME#128: avx512_perm_i<opc, OpcodeStr, VTInfo.info128>,
	avx512_perm_i_mb<opc, OpcodeStr, VTInfo.info128>, EVEX_V128;
	defm NAME#256: avx512_perm_i<opc, OpcodeStr, VTInfo.info256>,
	avx512_perm_i_mb<opc, OpcodeStr, VTInfo.info256>, EVEX_V256;
	}
	}

	multiclass avx512_perm_i_sizes_bw<bits<8> opc, string OpcodeStr,
	AVX512VLVectorVTInfo VTInfo,
	Predicate Prd> {
	let Predicates = [Prd] in
	defm NAME: avx512_perm_i<opc, OpcodeStr, VTInfo.info512>, EVEX_V512;
	let Predicates = [Prd, HasVLX] in {
	defm NAME#128: avx512_perm_i<opc, OpcodeStr, VTInfo.info128>, EVEX_V128;
	defm NAME#256: avx512_perm_i<opc, OpcodeStr, VTInfo.info256>, EVEX_V256;
	}
	}

	defm VPERMI2D : avx512_perm_i_sizes<0x76, "vpermi2d",
	avx512vl_i32_info>, EVEX_CD8<32, CD8VF>;
	defm VPERMI2Q : avx512_perm_i_sizes<0x76, "vpermi2q",
	avx512vl_i64_info>, VEX_W, EVEX_CD8<64, CD8VF>;
	defm VPERMI2W : avx512_perm_i_sizes_bw<0x75, "vpermi2w",
	avx512vl_i16_info, HasBWI>,
	VEX_W, EVEX_CD8<16, CD8VF>;
	defm VPERMI2B : avx512_perm_i_sizes_bw<0x75, "vpermi2b",
	avx512vl_i8_info, HasVBMI>,
	EVEX_CD8<8, CD8VF>;
	defm VPERMI2PS : avx512_perm_i_sizes<0x77, "vpermi2ps",
	avx512vl_f32_info>, EVEX_CD8<32, CD8VF>;
	defm VPERMI2PD : avx512_perm_i_sizes<0x77, "vpermi2pd",
	avx512vl_f64_info>, VEX_W, EVEX_CD8<64, CD8VF>;

	// VPERMT2
	multiclass avx512_perm_t<bits<8> opc, string OpcodeStr,
	X86VectorVTInfo _, X86VectorVTInfo IdxVT> {
	let Constraints = "$src1 = $dst", ExeDomain = _.ExeDomain in {
	defm rr: AVX512_maskable_3src<opc, MRMSrcReg, _, (outs _.RC:$dst),
	(ins IdxVT.RC:$src2, _.RC:$src3),
	OpcodeStr, "$src3, $src2", "$src2, $src3",
	(_.VT (X86VPermt2 _.RC:$src1, IdxVT.RC:$src2, _.RC:$src3)), 1>,
	EVEX_4V, AVX5128IBase;

	defm rm: AVX512_maskable_3src<opc, MRMSrcMem, _, (outs _.RC:$dst),
	(ins IdxVT.RC:$src2, _.MemOp:$src3),
	OpcodeStr, "$src3, $src2", "$src2, $src3",
	(_.VT (X86VPermt2 _.RC:$src1, IdxVT.RC:$src2,
	(bitconvert (_.LdFrag addr:$src3)))), 1>,
	EVEX_4V, AVX5128IBase;
	}
	}
	multiclass avx512_perm_t_mb<bits<8> opc, string OpcodeStr,
	X86VectorVTInfo _, X86VectorVTInfo IdxVT> {
	let Constraints = "$src1 = $dst", ExeDomain = _.ExeDomain in
	defm rmb: AVX512_maskable_3src<opc, MRMSrcMem, _, (outs _.RC:$dst),
	(ins IdxVT.RC:$src2, _.ScalarMemOp:$src3),
	OpcodeStr, !strconcat("${src3}", _.BroadcastStr,", $src2"),
	!strconcat("$src2, ${src3}", _.BroadcastStr ),
	(_.VT (X86VPermt2 _.RC:$src1,
	IdxVT.RC:$src2,(_.VT (X86VBroadcast (_.ScalarLdFrag addr:$src3))))),
	1>, AVX5128IBase, EVEX_4V, EVEX_B;
	}

	multiclass avx512_perm_t_sizes<bits<8> opc, string OpcodeStr,
	AVX512VLVectorVTInfo VTInfo,
	AVX512VLVectorVTInfo ShuffleMask> {
	defm NAME: avx512_perm_t<opc, OpcodeStr, VTInfo.info512,
	ShuffleMask.info512>,
	avx512_perm_t_mb<opc, OpcodeStr, VTInfo.info512,
	ShuffleMask.info512>, EVEX_V512;
	let Predicates = [HasVLX] in {
	defm NAME#128: avx512_perm_t<opc, OpcodeStr, VTInfo.info128,
	ShuffleMask.info128>,
	avx512_perm_t_mb<opc, OpcodeStr, VTInfo.info128,
	ShuffleMask.info128>, EVEX_V128;
	defm NAME#256: avx512_perm_t<opc, OpcodeStr, VTInfo.info256,
	ShuffleMask.info256>,
	avx512_perm_t_mb<opc, OpcodeStr, VTInfo.info256,
	ShuffleMask.info256>, EVEX_V256;
	}
	}

	multiclass avx512_perm_t_sizes_bw<bits<8> opc, string OpcodeStr,
	AVX512VLVectorVTInfo VTInfo,
	AVX512VLVectorVTInfo Idx,
	Predicate Prd> {
	let Predicates = [Prd] in
	defm NAME: avx512_perm_t<opc, OpcodeStr, VTInfo.info512,
	Idx.info512>, EVEX_V512;
	let Predicates = [Prd, HasVLX] in {
	defm NAME#128: avx512_perm_t<opc, OpcodeStr, VTInfo.info128,
	Idx.info128>, EVEX_V128;
	defm NAME#256: avx512_perm_t<opc, OpcodeStr, VTInfo.info256,
	Idx.info256>, EVEX_V256;
	}
	}

	defm VPERMT2D : avx512_perm_t_sizes<0x7E, "vpermt2d",
	avx512vl_i32_info, avx512vl_i32_info>, EVEX_CD8<32, CD8VF>;
	defm VPERMT2Q : avx512_perm_t_sizes<0x7E, "vpermt2q",
	avx512vl_i64_info, avx512vl_i64_info>, VEX_W, EVEX_CD8<64, CD8VF>;
	defm VPERMT2W : avx512_perm_t_sizes_bw<0x7D, "vpermt2w",
	avx512vl_i16_info, avx512vl_i16_info, HasBWI>,
	VEX_W, EVEX_CD8<16, CD8VF>;
	defm VPERMT2B : avx512_perm_t_sizes_bw<0x7D, "vpermt2b",
	avx512vl_i8_info, avx512vl_i8_info, HasVBMI>,
	EVEX_CD8<8, CD8VF>;
	defm VPERMT2PS : avx512_perm_t_sizes<0x7F, "vpermt2ps",
	avx512vl_f32_info, avx512vl_i32_info>, EVEX_CD8<32, CD8VF>;
	defm VPERMT2PD : avx512_perm_t_sizes<0x7F, "vpermt2pd",
	avx512vl_f64_info, avx512vl_i64_info>, VEX_W, EVEX_CD8<64, CD8VF>;

	//===----------------------------------------------------------------------===//
	// AVX-512 - BLEND using mask
	//
	multiclass avx512_blendmask<bits<8> opc, string OpcodeStr, X86VectorVTInfo _> {
	let ExeDomain = _.ExeDomain, hasSideEffects = 0 in {
	def rr : AVX5128I<opc, MRMSrcReg, (outs _.RC:$dst),
	(ins _.RC:$src1, _.RC:$src2),
	!strconcat(OpcodeStr,
	"\t{$src2, $src1, ${dst}\|${dst}, $src1, $src2}"),
	[]>, EVEX_4V;
	def rrk : AVX5128I<opc, MRMSrcReg, (outs _.RC:$dst),
	(ins _.KRCWM:$mask, _.RC:$src1, _.RC:$src2),
	!strconcat(OpcodeStr,
	"\t{$src2, $src1, ${dst} {${mask}}\|${dst} {${mask}}, $src1, $src2}"),
	[]>, EVEX_4V, EVEX_K;
	def rrkz : AVX5128I<opc, MRMSrcReg, (outs _.RC:$dst),
	(ins _.KRCWM:$mask, _.RC:$src1, _.RC:$src2),
	!strconcat(OpcodeStr,
	"\t{$src2, $src1, ${dst} {${mask}} {z}\|${dst} {${mask}} {z}, $src1, $src2}"),
	[]>, EVEX_4V, EVEX_KZ;
	let mayLoad = 1 in {
	def rm : AVX5128I<opc, MRMSrcMem, (outs _.RC:$dst),
	(ins _.RC:$src1, _.MemOp:$src2),
	!strconcat(OpcodeStr,
	"\t{$src2, $src1, ${dst}\|${dst}, $src1, $src2}"),
	[]>, EVEX_4V, EVEX_CD8<_.EltSize, CD8VF>;
	def rmk : AVX5128I<opc, MRMSrcMem, (outs _.RC:$dst),
	(ins _.KRCWM:$mask, _.RC:$src1, _.MemOp:$src2),
	!strconcat(OpcodeStr,
	"\t{$src2, $src1, ${dst} {${mask}}\|${dst} {${mask}}, $src1, $src2}"),
	[]>, EVEX_4V, EVEX_K, EVEX_CD8<_.EltSize, CD8VF>;
	def rmkz : AVX5128I<opc, MRMSrcMem, (outs _.RC:$dst),
	(ins _.KRCWM:$mask, _.RC:$src1, _.MemOp:$src2),
	!strconcat(OpcodeStr,
	"\t{$src2, $src1, ${dst} {${mask}} {z}\|${dst} {${mask}} {z}, $src1, $src2}"),
	[]>, EVEX_4V, EVEX_KZ, EVEX_CD8<_.EltSize, CD8VF>;
	}
	}
	}
	multiclass avx512_blendmask_rmb<bits<8> opc, string OpcodeStr, X86VectorVTInfo _> {

	let mayLoad = 1, hasSideEffects = 0 in {
	def rmbk : AVX5128I<opc, MRMSrcMem, (outs _.RC:$dst),
	(ins _.KRCWM:$mask, _.RC:$src1, _.ScalarMemOp:$src2),
	!strconcat(OpcodeStr,
	"\t{${src2}", _.BroadcastStr, ", $src1, $dst {${mask}}\|",
	"$dst {${mask}}, $src1, ${src2}", _.BroadcastStr, "}"),
	[]>, EVEX_4V, EVEX_K, EVEX_B, EVEX_CD8<_.EltSize, CD8VF>;

	def rmb : AVX5128I<opc, MRMSrcMem, (outs _.RC:$dst),
	(ins _.RC:$src1, _.ScalarMemOp:$src2),
	!strconcat(OpcodeStr,
	"\t{${src2}", _.BroadcastStr, ", $src1, $dst\|",
	"$dst, $src1, ${src2}", _.BroadcastStr, "}"),
	[]>, EVEX_4V, EVEX_B, EVEX_CD8<_.EltSize, CD8VF>;
	}
	}

	multiclass blendmask_dq <bits<8> opc, string OpcodeStr,
	AVX512VLVectorVTInfo VTInfo> {
	defm Z : avx512_blendmask <opc, OpcodeStr, VTInfo.info512>,
	avx512_blendmask_rmb <opc, OpcodeStr, VTInfo.info512>, EVEX_V512;

	let Predicates = [HasVLX] in {
	defm Z256 : avx512_blendmask<opc, OpcodeStr, VTInfo.info256>,
	avx512_blendmask_rmb <opc, OpcodeStr, VTInfo.info256>, EVEX_V256;
	defm Z128 : avx512_blendmask<opc, OpcodeStr, VTInfo.info128>,
	avx512_blendmask_rmb <opc, OpcodeStr, VTInfo.info128>, EVEX_V128;
	}
	}

	multiclass blendmask_bw <bits<8> opc, string OpcodeStr,
	AVX512VLVectorVTInfo VTInfo> {
	let Predicates = [HasBWI] in
	defm Z : avx512_blendmask <opc, OpcodeStr, VTInfo.info512>, EVEX_V512;

	let Predicates = [HasBWI, HasVLX] in {
	defm Z256 : avx512_blendmask <opc, OpcodeStr, VTInfo.info256>, EVEX_V256;
	defm Z128 : avx512_blendmask <opc, OpcodeStr, VTInfo.info128>, EVEX_V128;
	}
	}


	defm VBLENDMPS : blendmask_dq <0x65, "vblendmps", avx512vl_f32_info>;
	defm VBLENDMPD : blendmask_dq <0x65, "vblendmpd", avx512vl_f64_info>, VEX_W;
	defm VPBLENDMD : blendmask_dq <0x64, "vpblendmd", avx512vl_i32_info>;
	defm VPBLENDMQ : blendmask_dq <0x64, "vpblendmq", avx512vl_i64_info>, VEX_W;
	defm VPBLENDMB : blendmask_bw <0x66, "vpblendmb", avx512vl_i8_info>;
	defm VPBLENDMW : blendmask_bw <0x66, "vpblendmw", avx512vl_i16_info>, VEX_W;


	//===----------------------------------------------------------------------===//
	// Compare Instructions
	//===----------------------------------------------------------------------===//

	// avx512_cmp_scalar - AVX512 CMPSS and CMPSD

	multiclass avx512_cmp_scalar<X86VectorVTInfo _, SDNode OpNode, SDNode OpNodeRnd>{

	defm rr_Int : AVX512_maskable_cmp<0xC2, MRMSrcReg, _,
	(outs _.KRC:$dst),
	(ins _.RC:$src1, _.RC:$src2, AVXCC:$cc),
	"vcmp${cc}"#_.Suffix,
	"$src2, $src1", "$src1, $src2",
	(OpNode (_.VT _.RC:$src1),
	(_.VT _.RC:$src2),
	imm:$cc)>, EVEX_4V;
	let mayLoad = 1 in
	defm rm_Int : AVX512_maskable_cmp<0xC2, MRMSrcMem, _,
	(outs _.KRC:$dst),
	(ins _.RC:$src1, _.IntScalarMemOp:$src2, AVXCC:$cc),
	"vcmp${cc}"#_.Suffix,
	"$src2, $src1", "$src1, $src2",
	(OpNode (_.VT _.RC:$src1), _.ScalarIntMemCPat:$src2,
	imm:$cc)>, EVEX_4V, EVEX_CD8<_.EltSize, CD8VT1>;

	defm rrb_Int : AVX512_maskable_cmp<0xC2, MRMSrcReg, _,
	(outs _.KRC:$dst),
	(ins _.RC:$src1, _.RC:$src2, AVXCC:$cc),
	"vcmp${cc}"#_.Suffix,
	"{sae}, $src2, $src1", "$src1, $src2, {sae}",
	(OpNodeRnd (_.VT _.RC:$src1),
	(_.VT _.RC:$src2),
	imm:$cc,
	(i32 FROUND_NO_EXC))>, EVEX_4V, EVEX_B;
	// Accept explicit immediate argument form instead of comparison code.
	let isAsmParserOnly = 1, hasSideEffects = 0 in {
	defm rri_alt : AVX512_maskable_cmp_alt<0xC2, MRMSrcReg, _,
	(outs VK1:$dst),
	(ins _.RC:$src1, _.RC:$src2, u8imm:$cc),
	"vcmp"#_.Suffix,
	"$cc, $src2, $src1", "$src1, $src2, $cc">, EVEX_4V;
	let mayLoad = 1 in
	defm rmi_alt : AVX512_maskable_cmp_alt<0xC2, MRMSrcMem, _,
	(outs _.KRC:$dst),
	(ins _.RC:$src1, _.ScalarMemOp:$src2, u8imm:$cc),
	"vcmp"#_.Suffix,
	"$cc, $src2, $src1", "$src1, $src2, $cc">,
	EVEX_4V, EVEX_CD8<_.EltSize, CD8VT1>;

	defm rrb_alt : AVX512_maskable_cmp_alt<0xC2, MRMSrcReg, _,
	(outs _.KRC:$dst),
	(ins _.RC:$src1, _.RC:$src2, u8imm:$cc),
	"vcmp"#_.Suffix,
	"$cc, {sae}, $src2, $src1","$src1, $src2, {sae}, $cc">,
	EVEX_4V, EVEX_B;
	}// let isAsmParserOnly = 1, hasSideEffects = 0

	let isCodeGenOnly = 1 in {
	let isCommutable = 1 in
	def rr : AVX512Ii8<0xC2, MRMSrcReg,
	(outs _.KRC:$dst), (ins _.FRC:$src1, _.FRC:$src2, AVXCC:$cc),
	!strconcat("vcmp${cc}", _.Suffix,
	"\t{$src2, $src1, $dst\|$dst, $src1, $src2}"),
	[(set _.KRC:$dst, (OpNode _.FRC:$src1,
	_.FRC:$src2,
	imm:$cc))],
	IIC_SSE_ALU_F32S_RR>, EVEX_4V;
	def rm : AVX512Ii8<0xC2, MRMSrcMem,
	(outs _.KRC:$dst),
	(ins _.FRC:$src1, _.ScalarMemOp:$src2, AVXCC:$cc),
	!strconcat("vcmp${cc}", _.Suffix,
	"\t{$src2, $src1, $dst\|$dst, $src1, $src2}"),
	[(set _.KRC:$dst, (OpNode _.FRC:$src1,
	(_.ScalarLdFrag addr:$src2),
	imm:$cc))],
	IIC_SSE_ALU_F32P_RM>, EVEX_4V, EVEX_CD8<_.EltSize, CD8VT1>;
	}
	}

	let Predicates = [HasAVX512] in {
	let ExeDomain = SSEPackedSingle in
	defm VCMPSSZ : avx512_cmp_scalar<f32x_info, X86cmpms, X86cmpmsRnd>,
	AVX512XSIi8Base;
	let ExeDomain = SSEPackedDouble in
	defm VCMPSDZ : avx512_cmp_scalar<f64x_info, X86cmpms, X86cmpmsRnd>,
	AVX512XDIi8Base, VEX_W;
	}

	multiclass avx512_icmp_packed<bits<8> opc, string OpcodeStr, SDNode OpNode,
	X86VectorVTInfo _, bit IsCommutable> {
	let isCommutable = IsCommutable in
	def rr : AVX512BI<opc, MRMSrcReg,
	(outs _.KRC:$dst), (ins _.RC:$src1, _.RC:$src2),
	!strconcat(OpcodeStr, "\t{$src2, $src1, $dst\|$dst, $src1, $src2}"),
	[(set _.KRC:$dst, (OpNode (_.VT _.RC:$src1), (_.VT _.RC:$src2)))],
	IIC_SSE_ALU_F32P_RR>, EVEX_4V;
	def rm : AVX512BI<opc, MRMSrcMem,
	(outs _.KRC:$dst), (ins _.RC:$src1, _.MemOp:$src2),
	!strconcat(OpcodeStr, "\t{$src2, $src1, $dst\|$dst, $src1, $src2}"),
	[(set _.KRC:$dst, (OpNode (_.VT _.RC:$src1),
	(_.VT (bitconvert (_.LdFrag addr:$src2)))))],
	IIC_SSE_ALU_F32P_RM>, EVEX_4V;
	let isCommutable = IsCommutable in
	def rrk : AVX512BI<opc, MRMSrcReg,
	(outs _.KRC:$dst), (ins _.KRCWM:$mask, _.RC:$src1, _.RC:$src2),
	!strconcat(OpcodeStr, "\t{$src2, $src1, $dst {${mask}}\|",
	"$dst {${mask}}, $src1, $src2}"),
	[(set _.KRC:$dst, (and _.KRCWM:$mask,
	(OpNode (_.VT _.RC:$src1), (_.VT _.RC:$src2))))],
	IIC_SSE_ALU_F32P_RR>, EVEX_4V, EVEX_K;
	def rmk : AVX512BI<opc, MRMSrcMem,
	(outs _.KRC:$dst), (ins _.KRCWM:$mask, _.RC:$src1, _.MemOp:$src2),
	!strconcat(OpcodeStr, "\t{$src2, $src1, $dst {${mask}}\|",
	"$dst {${mask}}, $src1, $src2}"),
	[(set _.KRC:$dst, (and _.KRCWM:$mask,
	(OpNode (_.VT _.RC:$src1),
	(_.VT (bitconvert
	(_.LdFrag addr:$src2))))))],
	IIC_SSE_ALU_F32P_RM>, EVEX_4V, EVEX_K;
	}

	multiclass avx512_icmp_packed_rmb<bits<8> opc, string OpcodeStr, SDNode OpNode,
	X86VectorVTInfo _, bit IsCommutable> :
	avx512_icmp_packed<opc, OpcodeStr, OpNode, _, IsCommutable> {
	def rmb : AVX512BI<opc, MRMSrcMem,
	(outs _.KRC:$dst), (ins _.RC:$src1, _.ScalarMemOp:$src2),
	!strconcat(OpcodeStr, "\t{${src2}", _.BroadcastStr, ", $src1, $dst",
	"\|$dst, $src1, ${src2}", _.BroadcastStr, "}"),
	[(set _.KRC:$dst, (OpNode (_.VT _.RC:$src1),
	(X86VBroadcast (_.ScalarLdFrag addr:$src2))))],
	IIC_SSE_ALU_F32P_RM>, EVEX_4V, EVEX_B;
	def rmbk : AVX512BI<opc, MRMSrcMem,
	(outs _.KRC:$dst), (ins _.KRCWM:$mask, _.RC:$src1,
	_.ScalarMemOp:$src2),
	!strconcat(OpcodeStr,
	"\t{${src2}", _.BroadcastStr, ", $src1, $dst {${mask}}\|",
	"$dst {${mask}}, $src1, ${src2}", _.BroadcastStr, "}"),
	[(set _.KRC:$dst, (and _.KRCWM:$mask,
	(OpNode (_.VT _.RC:$src1),
	(X86VBroadcast
	(_.ScalarLdFrag addr:$src2)))))],
	IIC_SSE_ALU_F32P_RM>, EVEX_4V, EVEX_K, EVEX_B;
	}

	multiclass avx512_icmp_packed_vl<bits<8> opc, string OpcodeStr, SDNode OpNode,
	AVX512VLVectorVTInfo VTInfo, Predicate prd,
	bit IsCommutable = 0> {
	let Predicates = [prd] in
	defm Z : avx512_icmp_packed<opc, OpcodeStr, OpNode, VTInfo.info512,
	IsCommutable>, EVEX_V512;

	let Predicates = [prd, HasVLX] in {
	defm Z256 : avx512_icmp_packed<opc, OpcodeStr, OpNode, VTInfo.info256,
	IsCommutable>, EVEX_V256;
	defm Z128 : avx512_icmp_packed<opc, OpcodeStr, OpNode, VTInfo.info128,
	IsCommutable>, EVEX_V128;
	}
	}

	multiclass avx512_icmp_packed_rmb_vl<bits<8> opc, string OpcodeStr,
	SDNode OpNode, AVX512VLVectorVTInfo VTInfo,
	Predicate prd, bit IsCommutable = 0> {
	let Predicates = [prd] in
	defm Z : avx512_icmp_packed_rmb<opc, OpcodeStr, OpNode, VTInfo.info512,
	IsCommutable>, EVEX_V512;

	let Predicates = [prd, HasVLX] in {
	defm Z256 : avx512_icmp_packed_rmb<opc, OpcodeStr, OpNode, VTInfo.info256,
	IsCommutable>, EVEX_V256;
	defm Z128 : avx512_icmp_packed_rmb<opc, OpcodeStr, OpNode, VTInfo.info128,
	IsCommutable>, EVEX_V128;
	}
	}

	defm VPCMPEQB : avx512_icmp_packed_vl<0x74, "vpcmpeqb", X86pcmpeqm,
	avx512vl_i8_info, HasBWI, 1>,
	EVEX_CD8<8, CD8VF>;

	defm VPCMPEQW : avx512_icmp_packed_vl<0x75, "vpcmpeqw", X86pcmpeqm,
	avx512vl_i16_info, HasBWI, 1>,
	EVEX_CD8<16, CD8VF>;

	defm VPCMPEQD : avx512_icmp_packed_rmb_vl<0x76, "vpcmpeqd", X86pcmpeqm,
	avx512vl_i32_info, HasAVX512, 1>,
	EVEX_CD8<32, CD8VF>;

	defm VPCMPEQQ : avx512_icmp_packed_rmb_vl<0x29, "vpcmpeqq", X86pcmpeqm,
	avx512vl_i64_info, HasAVX512, 1>,
	T8PD, VEX_W, EVEX_CD8<64, CD8VF>;

	defm VPCMPGTB : avx512_icmp_packed_vl<0x64, "vpcmpgtb", X86pcmpgtm,
	avx512vl_i8_info, HasBWI>,
	EVEX_CD8<8, CD8VF>;

	defm VPCMPGTW : avx512_icmp_packed_vl<0x65, "vpcmpgtw", X86pcmpgtm,
	avx512vl_i16_info, HasBWI>,
	EVEX_CD8<16, CD8VF>;

	defm VPCMPGTD : avx512_icmp_packed_rmb_vl<0x66, "vpcmpgtd", X86pcmpgtm,
	avx512vl_i32_info, HasAVX512>,
	EVEX_CD8<32, CD8VF>;

	defm VPCMPGTQ : avx512_icmp_packed_rmb_vl<0x37, "vpcmpgtq", X86pcmpgtm,
	avx512vl_i64_info, HasAVX512>,
	T8PD, VEX_W, EVEX_CD8<64, CD8VF>;


	multiclass avx512_icmp_packed_lowering<X86VectorVTInfo _, X86KVectorVTInfo NewInf,
	SDNode OpNode, string InstrStr,
	list<Predicate> Preds> {
	let Predicates = Preds in {
	def : Pat<(insert_subvector (NewInf.KVT immAllZerosV),
	(_.KVT (OpNode (_.VT _.RC:$src1), (_.VT _.RC:$src2))),
	(i64 0)),
	(COPY_TO_REGCLASS (!cast<Instruction>(InstrStr##rr) _.RC:$src1, _.RC:$src2),
	NewInf.KRC)>;

	def : Pat<(insert_subvector (NewInf.KVT immAllZerosV),
	(_.KVT (OpNode (_.VT _.RC:$src1),
	(_.VT (bitconvert (_.LdFrag addr:$src2))))),
	(i64 0)),
	(COPY_TO_REGCLASS (!cast<Instruction>(InstrStr##rm) _.RC:$src1, addr:$src2),
	NewInf.KRC)>;

	def : Pat<(insert_subvector (NewInf.KVT immAllZerosV),
	(_.KVT (and _.KRCWM:$mask,
	(OpNode (_.VT _.RC:$src1), (_.VT _.RC:$src2)))),
	(i64 0)),
	(COPY_TO_REGCLASS (!cast<Instruction>(InstrStr##rrk) _.KRCWM:$mask,
	_.RC:$src1, _.RC:$src2),
	NewInf.KRC)>;

	def : Pat<(insert_subvector (NewInf.KVT immAllZerosV),
	(_.KVT (and (_.KVT _.KRCWM:$mask),
	(_.KVT (OpNode (_.VT _.RC:$src1),
	(_.VT (bitconvert
	(_.LdFrag addr:$src2))))))),
	(i64 0)),
	(COPY_TO_REGCLASS (!cast<Instruction>(InstrStr##rmk) _.KRCWM:$mask,
	_.RC:$src1, addr:$src2),
	NewInf.KRC)>;
	}
	}

	multiclass avx512_icmp_packed_rmb_lowering<X86VectorVTInfo _, X86KVectorVTInfo NewInf,
	SDNode OpNode, string InstrStr,
	list<Predicate> Preds>
	: avx512_icmp_packed_lowering<_, NewInf, OpNode, InstrStr, Preds> {
	let Predicates = Preds in {
	def : Pat<(insert_subvector (NewInf.KVT immAllZerosV),
	(_.KVT (OpNode (_.VT _.RC:$src1),
	(X86VBroadcast (_.ScalarLdFrag addr:$src2)))),
	(i64 0)),
	(COPY_TO_REGCLASS (!cast<Instruction>(InstrStr##rmb) _.RC:$src1, addr:$src2),
	NewInf.KRC)>;

	def : Pat<(insert_subvector (NewInf.KVT immAllZerosV),
	(_.KVT (and (_.KVT _.KRCWM:$mask),
	(_.KVT (OpNode (_.VT _.RC:$src1),
	(X86VBroadcast
	(_.ScalarLdFrag addr:$src2)))))),
	(i64 0)),
	(COPY_TO_REGCLASS (!cast<Instruction>(InstrStr##rmbk) _.KRCWM:$mask,
	_.RC:$src1, addr:$src2),
	NewInf.KRC)>;
	}
	}

	// VPCMPEQB - i8
	defm : avx512_icmp_packed_lowering<v16i8x_info, v32i1_info, X86pcmpeqm,
	"VPCMPEQBZ128", [HasBWI, HasVLX]>;
	defm : avx512_icmp_packed_lowering<v16i8x_info, v64i1_info, X86pcmpeqm,
	"VPCMPEQBZ128", [HasBWI, HasVLX]>;

	defm : avx512_icmp_packed_lowering<v32i8x_info, v64i1_info, X86pcmpeqm,
	"VPCMPEQBZ256", [HasBWI, HasVLX]>;

	// VPCMPEQW - i16
	defm : avx512_icmp_packed_lowering<v8i16x_info, v16i1_info, X86pcmpeqm,
	"VPCMPEQWZ128", [HasBWI, HasVLX]>;
	defm : avx512_icmp_packed_lowering<v8i16x_info, v32i1_info, X86pcmpeqm,
	"VPCMPEQWZ128", [HasBWI, HasVLX]>;
	defm : avx512_icmp_packed_lowering<v8i16x_info, v64i1_info, X86pcmpeqm,
	"VPCMPEQWZ128", [HasBWI, HasVLX]>;

	defm : avx512_icmp_packed_lowering<v16i16x_info, v32i1_info, X86pcmpeqm,
	"VPCMPEQWZ256", [HasBWI, HasVLX]>;
	defm : avx512_icmp_packed_lowering<v16i16x_info, v64i1_info, X86pcmpeqm,
	"VPCMPEQWZ256", [HasBWI, HasVLX]>;

	defm : avx512_icmp_packed_lowering<v32i16_info, v64i1_info, X86pcmpeqm,
	"VPCMPEQWZ", [HasBWI]>;

	// VPCMPEQD - i32
	defm : avx512_icmp_packed_rmb_lowering<v4i32x_info, v8i1_info, X86pcmpeqm,
	"VPCMPEQDZ128", [HasAVX512, HasVLX]>;
	defm : avx512_icmp_packed_rmb_lowering<v4i32x_info, v16i1_info, X86pcmpeqm,
	"VPCMPEQDZ128", [HasAVX512, HasVLX]>;
	defm : avx512_icmp_packed_rmb_lowering<v4i32x_info, v32i1_info, X86pcmpeqm,
	"VPCMPEQDZ128", [HasAVX512, HasVLX]>;
	defm : avx512_icmp_packed_rmb_lowering<v4i32x_info, v64i1_info, X86pcmpeqm,
	"VPCMPEQDZ128", [HasAVX512, HasVLX]>;

	defm : avx512_icmp_packed_rmb_lowering<v8i32x_info, v16i1_info, X86pcmpeqm,
	"VPCMPEQDZ256", [HasAVX512, HasVLX]>;
	defm : avx512_icmp_packed_rmb_lowering<v8i32x_info, v32i1_info, X86pcmpeqm,
	"VPCMPEQDZ256", [HasAVX512, HasVLX]>;
	defm : avx512_icmp_packed_rmb_lowering<v8i32x_info, v64i1_info, X86pcmpeqm,
	"VPCMPEQDZ256", [HasAVX512, HasVLX]>;

	defm : avx512_icmp_packed_rmb_lowering<v16i32_info, v32i1_info, X86pcmpeqm,
	"VPCMPEQDZ", [HasAVX512]>;
	defm : avx512_icmp_packed_rmb_lowering<v16i32_info, v64i1_info, X86pcmpeqm,
	"VPCMPEQDZ", [HasAVX512]>;

	// VPCMPEQQ - i64
	defm : avx512_icmp_packed_rmb_lowering<v2i64x_info, v4i1_info, X86pcmpeqm,
	"VPCMPEQQZ128", [HasAVX512, HasVLX]>;
	defm : avx512_icmp_packed_rmb_lowering<v2i64x_info, v8i1_info, X86pcmpeqm,
	"VPCMPEQQZ128", [HasAVX512, HasVLX]>;
	defm : avx512_icmp_packed_rmb_lowering<v2i64x_info, v16i1_info, X86pcmpeqm,
	"VPCMPEQQZ128", [HasAVX512, HasVLX]>;
	defm : avx512_icmp_packed_rmb_lowering<v2i64x_info, v32i1_info, X86pcmpeqm,
	"VPCMPEQQZ128", [HasAVX512, HasVLX]>;
	defm : avx512_icmp_packed_rmb_lowering<v2i64x_info, v64i1_info, X86pcmpeqm,
	"VPCMPEQQZ128", [HasAVX512, HasVLX]>;

	defm : avx512_icmp_packed_rmb_lowering<v4i64x_info, v8i1_info, X86pcmpeqm,
	"VPCMPEQQZ256", [HasAVX512, HasVLX]>;
	defm : avx512_icmp_packed_rmb_lowering<v4i64x_info, v16i1_info, X86pcmpeqm,
	"VPCMPEQQZ256", [HasAVX512, HasVLX]>;
	defm : avx512_icmp_packed_rmb_lowering<v4i64x_info, v32i1_info, X86pcmpeqm,
	"VPCMPEQQZ256", [HasAVX512, HasVLX]>;
	defm : avx512_icmp_packed_rmb_lowering<v4i64x_info, v64i1_info, X86pcmpeqm,
	"VPCMPEQQZ256", [HasAVX512, HasVLX]>;

	defm : avx512_icmp_packed_rmb_lowering<v8i64_info, v16i1_info, X86pcmpeqm,
	"VPCMPEQQZ", [HasAVX512]>;
	defm : avx512_icmp_packed_rmb_lowering<v8i64_info, v32i1_info, X86pcmpeqm,
	"VPCMPEQQZ", [HasAVX512]>;
	defm : avx512_icmp_packed_rmb_lowering<v8i64_info, v64i1_info, X86pcmpeqm,
	"VPCMPEQQZ", [HasAVX512]>;

	// VPCMPGTB - i8
	defm : avx512_icmp_packed_lowering<v16i8x_info, v32i1_info, X86pcmpgtm,
	"VPCMPGTBZ128", [HasBWI, HasVLX]>;
	defm : avx512_icmp_packed_lowering<v16i8x_info, v64i1_info, X86pcmpgtm,
	"VPCMPGTBZ128", [HasBWI, HasVLX]>;

	defm : avx512_icmp_packed_lowering<v32i8x_info, v64i1_info, X86pcmpgtm,
	"VPCMPGTBZ256", [HasBWI, HasVLX]>;

	// VPCMPGTW - i16
	defm : avx512_icmp_packed_lowering<v8i16x_info, v16i1_info, X86pcmpgtm,
	"VPCMPGTWZ128", [HasBWI, HasVLX]>;
	defm : avx512_icmp_packed_lowering<v8i16x_info, v32i1_info, X86pcmpgtm,
	"VPCMPGTWZ128", [HasBWI, HasVLX]>;
	defm : avx512_icmp_packed_lowering<v8i16x_info, v64i1_info, X86pcmpgtm,
	"VPCMPGTWZ128", [HasBWI, HasVLX]>;

	defm : avx512_icmp_packed_lowering<v16i16x_info, v32i1_info, X86pcmpgtm,
	"VPCMPGTWZ256", [HasBWI, HasVLX]>;
	defm : avx512_icmp_packed_lowering<v16i16x_info, v64i1_info, X86pcmpgtm,
	"VPCMPGTWZ256", [HasBWI, HasVLX]>;

	defm : avx512_icmp_packed_lowering<v32i16_info, v64i1_info, X86pcmpgtm,
	"VPCMPGTWZ", [HasBWI]>;

	// VPCMPGTD - i32
	defm : avx512_icmp_packed_rmb_lowering<v4i32x_info, v8i1_info, X86pcmpgtm,
	"VPCMPGTDZ128", [HasAVX512, HasVLX]>;
	defm : avx512_icmp_packed_rmb_lowering<v4i32x_info, v16i1_info, X86pcmpgtm,
	"VPCMPGTDZ128", [HasAVX512, HasVLX]>;
	defm : avx512_icmp_packed_rmb_lowering<v4i32x_info, v32i1_info, X86pcmpgtm,
	"VPCMPGTDZ128", [HasAVX512, HasVLX]>;
	defm : avx512_icmp_packed_rmb_lowering<v4i32x_info, v64i1_info, X86pcmpgtm,
	"VPCMPGTDZ128", [HasAVX512, HasVLX]>;

	defm : avx512_icmp_packed_rmb_lowering<v8i32x_info, v16i1_info, X86pcmpgtm,
	"VPCMPGTDZ256", [HasAVX512, HasVLX]>;
	defm : avx512_icmp_packed_rmb_lowering<v8i32x_info, v32i1_info, X86pcmpgtm,
	"VPCMPGTDZ256", [HasAVX512, HasVLX]>;
	defm : avx512_icmp_packed_rmb_lowering<v8i32x_info, v64i1_info, X86pcmpgtm,
	"VPCMPGTDZ256", [HasAVX512, HasVLX]>;

	defm : avx512_icmp_packed_rmb_lowering<v16i32_info, v32i1_info, X86pcmpgtm,
	"VPCMPGTDZ", [HasAVX512]>;
	defm : avx512_icmp_packed_rmb_lowering<v16i32_info, v64i1_info, X86pcmpgtm,
	"VPCMPGTDZ", [HasAVX512]>;

	// VPCMPGTQ - i64
	defm : avx512_icmp_packed_rmb_lowering<v2i64x_info, v4i1_info, X86pcmpgtm,
	"VPCMPGTQZ128", [HasAVX512, HasVLX]>;
	defm : avx512_icmp_packed_rmb_lowering<v2i64x_info, v8i1_info, X86pcmpgtm,
	"VPCMPGTQZ128", [HasAVX512, HasVLX]>;
	defm : avx512_icmp_packed_rmb_lowering<v2i64x_info, v16i1_info, X86pcmpgtm,
	"VPCMPGTQZ128", [HasAVX512, HasVLX]>;
	defm : avx512_icmp_packed_rmb_lowering<v2i64x_info, v32i1_info, X86pcmpgtm,
	"VPCMPGTQZ128", [HasAVX512, HasVLX]>;
	defm : avx512_icmp_packed_rmb_lowering<v2i64x_info, v64i1_info, X86pcmpgtm,
	"VPCMPGTQZ128", [HasAVX512, HasVLX]>;

	defm : avx512_icmp_packed_rmb_lowering<v4i64x_info, v8i1_info, X86pcmpgtm,
	"VPCMPGTQZ256", [HasAVX512, HasVLX]>;
	defm : avx512_icmp_packed_rmb_lowering<v4i64x_info, v16i1_info, X86pcmpgtm,
	"VPCMPGTQZ256", [HasAVX512, HasVLX]>;
	defm : avx512_icmp_packed_rmb_lowering<v4i64x_info, v32i1_info, X86pcmpgtm,
	"VPCMPGTQZ256", [HasAVX512, HasVLX]>;
	defm : avx512_icmp_packed_rmb_lowering<v4i64x_info, v64i1_info, X86pcmpgtm,
	"VPCMPGTQZ256", [HasAVX512, HasVLX]>;

	defm : avx512_icmp_packed_rmb_lowering<v8i64_info, v16i1_info, X86pcmpgtm,
	"VPCMPGTQZ", [HasAVX512]>;
	defm : avx512_icmp_packed_rmb_lowering<v8i64_info, v32i1_info, X86pcmpgtm,
	"VPCMPGTQZ", [HasAVX512]>;
	defm : avx512_icmp_packed_rmb_lowering<v8i64_info, v64i1_info, X86pcmpgtm,
	"VPCMPGTQZ", [HasAVX512]>;

	multiclass avx512_icmp_cc<bits<8> opc, string Suffix, SDNode OpNode,
	X86VectorVTInfo _> {
	let isCommutable = 1 in
	def rri : AVX512AIi8<opc, MRMSrcReg,
	(outs _.KRC:$dst), (ins _.RC:$src1, _.RC:$src2, AVX512ICC:$cc),
	!strconcat("vpcmp${cc}", Suffix,
	"\t{$src2, $src1, $dst\|$dst, $src1, $src2}"),
	[(set _.KRC:$dst, (OpNode (_.VT _.RC:$src1), (_.VT _.RC:$src2),
	imm:$cc))],
	IIC_SSE_ALU_F32P_RR>, EVEX_4V;
	def rmi : AVX512AIi8<opc, MRMSrcMem,
	(outs _.KRC:$dst), (ins _.RC:$src1, _.MemOp:$src2, AVX512ICC:$cc),
	!strconcat("vpcmp${cc}", Suffix,
	"\t{$src2, $src1, $dst\|$dst, $src1, $src2}"),
	[(set _.KRC:$dst, (OpNode (_.VT _.RC:$src1),
	(_.VT (bitconvert (_.LdFrag addr:$src2))),
	imm:$cc))],
	IIC_SSE_ALU_F32P_RM>, EVEX_4V;
	let isCommutable = 1 in
	def rrik : AVX512AIi8<opc, MRMSrcReg,
	(outs _.KRC:$dst), (ins _.KRCWM:$mask, _.RC:$src1, _.RC:$src2,
	AVX512ICC:$cc),
	!strconcat("vpcmp${cc}", Suffix,
	"\t{$src2, $src1, $dst {${mask}}\|",
	"$dst {${mask}}, $src1, $src2}"),
	[(set _.KRC:$dst, (and _.KRCWM:$mask,
	(OpNode (_.VT _.RC:$src1), (_.VT _.RC:$src2),
	imm:$cc)))],
	IIC_SSE_ALU_F32P_RR>, EVEX_4V, EVEX_K;
	def rmik : AVX512AIi8<opc, MRMSrcMem,
	(outs _.KRC:$dst), (ins _.KRCWM:$mask, _.RC:$src1, _.MemOp:$src2,
	AVX512ICC:$cc),
	!strconcat("vpcmp${cc}", Suffix,
	"\t{$src2, $src1, $dst {${mask}}\|",
	"$dst {${mask}}, $src1, $src2}"),
	[(set _.KRC:$dst, (and _.KRCWM:$mask,
	(OpNode (_.VT _.RC:$src1),
	(_.VT (bitconvert (_.LdFrag addr:$src2))),
	imm:$cc)))],
	IIC_SSE_ALU_F32P_RM>, EVEX_4V, EVEX_K;

	// Accept explicit immediate argument form instead of comparison code.
	let isAsmParserOnly = 1, hasSideEffects = 0 in {
	def rri_alt : AVX512AIi8<opc, MRMSrcReg,
	(outs _.KRC:$dst), (ins _.RC:$src1, _.RC:$src2, u8imm:$cc),
	!strconcat("vpcmp", Suffix, "\t{$cc, $src2, $src1, $dst\|",
	"$dst, $src1, $src2, $cc}"),
	[], IIC_SSE_ALU_F32P_RR>, EVEX_4V;
	let mayLoad = 1 in
	def rmi_alt : AVX512AIi8<opc, MRMSrcMem,
	(outs _.KRC:$dst), (ins _.RC:$src1, _.MemOp:$src2, u8imm:$cc),
	!strconcat("vpcmp", Suffix, "\t{$cc, $src2, $src1, $dst\|",
	"$dst, $src1, $src2, $cc}"),
	[], IIC_SSE_ALU_F32P_RM>, EVEX_4V;
	def rrik_alt : AVX512AIi8<opc, MRMSrcReg,
	(outs _.KRC:$dst), (ins _.KRCWM:$mask, _.RC:$src1, _.RC:$src2,
	u8imm:$cc),
	!strconcat("vpcmp", Suffix,
	"\t{$cc, $src2, $src1, $dst {${mask}}\|",
	"$dst {${mask}}, $src1, $src2, $cc}"),
	[], IIC_SSE_ALU_F32P_RR>, EVEX_4V, EVEX_K;
	let mayLoad = 1 in
	def rmik_alt : AVX512AIi8<opc, MRMSrcMem,
	(outs _.KRC:$dst), (ins _.KRCWM:$mask, _.RC:$src1, _.MemOp:$src2,
	u8imm:$cc),
	!strconcat("vpcmp", Suffix,
	"\t{$cc, $src2, $src1, $dst {${mask}}\|",
	"$dst {${mask}}, $src1, $src2, $cc}"),
	[], IIC_SSE_ALU_F32P_RM>, EVEX_4V, EVEX_K;
	}
	}

	multiclass avx512_icmp_cc_rmb<bits<8> opc, string Suffix, SDNode OpNode,
	X86VectorVTInfo _> :
	avx512_icmp_cc<opc, Suffix, OpNode, _> {
	def rmib : AVX512AIi8<opc, MRMSrcMem,
	(outs _.KRC:$dst), (ins _.RC:$src1, _.ScalarMemOp:$src2,
	AVX512ICC:$cc),
	!strconcat("vpcmp${cc}", Suffix,
	"\t{${src2}", _.BroadcastStr, ", $src1, $dst\|",
	"$dst, $src1, ${src2}", _.BroadcastStr, "}"),
	[(set _.KRC:$dst, (OpNode (_.VT _.RC:$src1),
	(X86VBroadcast (_.ScalarLdFrag addr:$src2)),
	imm:$cc))],
	IIC_SSE_ALU_F32P_RM>, EVEX_4V, EVEX_B;
	def rmibk : AVX512AIi8<opc, MRMSrcMem,
	(outs _.KRC:$dst), (ins _.KRCWM:$mask, _.RC:$src1,
	_.ScalarMemOp:$src2, AVX512ICC:$cc),
	!strconcat("vpcmp${cc}", Suffix,
	"\t{${src2}", _.BroadcastStr, ", $src1, $dst {${mask}}\|",
	"$dst {${mask}}, $src1, ${src2}", _.BroadcastStr, "}"),
	[(set _.KRC:$dst, (and _.KRCWM:$mask,
	(OpNode (_.VT _.RC:$src1),
	(X86VBroadcast (_.ScalarLdFrag addr:$src2)),
	imm:$cc)))],
	IIC_SSE_ALU_F32P_RM>, EVEX_4V, EVEX_K, EVEX_B;

	// Accept explicit immediate argument form instead of comparison code.
	let isAsmParserOnly = 1, hasSideEffects = 0, mayLoad = 1 in {
	def rmib_alt : AVX512AIi8<opc, MRMSrcMem,
	(outs _.KRC:$dst), (ins _.RC:$src1, _.ScalarMemOp:$src2,
	u8imm:$cc),
	!strconcat("vpcmp", Suffix,
	"\t{$cc, ${src2}", _.BroadcastStr, ", $src1, $dst\|",
	"$dst, $src1, ${src2}", _.BroadcastStr, ", $cc}"),
	[], IIC_SSE_ALU_F32P_RM>, EVEX_4V, EVEX_B;
	def rmibk_alt : AVX512AIi8<opc, MRMSrcMem,
	(outs _.KRC:$dst), (ins _.KRCWM:$mask, _.RC:$src1,
	_.ScalarMemOp:$src2, u8imm:$cc),
	!strconcat("vpcmp", Suffix,
	"\t{$cc, ${src2}", _.BroadcastStr, ", $src1, $dst {${mask}}\|",
	"$dst {${mask}}, $src1, ${src2}", _.BroadcastStr, ", $cc}"),
	[], IIC_SSE_ALU_F32P_RM>, EVEX_4V, EVEX_K, EVEX_B;
	}
	}

	multiclass avx512_icmp_cc_vl<bits<8> opc, string Suffix, SDNode OpNode,
	AVX512VLVectorVTInfo VTInfo, Predicate prd> {
	let Predicates = [prd] in
	defm Z : avx512_icmp_cc<opc, Suffix, OpNode, VTInfo.info512>, EVEX_V512;

	let Predicates = [prd, HasVLX] in {
	defm Z256 : avx512_icmp_cc<opc, Suffix, OpNode, VTInfo.info256>, EVEX_V256;
	defm Z128 : avx512_icmp_cc<opc, Suffix, OpNode, VTInfo.info128>, EVEX_V128;
	}
	}

	multiclass avx512_icmp_cc_rmb_vl<bits<8> opc, string Suffix, SDNode OpNode,
	AVX512VLVectorVTInfo VTInfo, Predicate prd> {
	let Predicates = [prd] in
	defm Z : avx512_icmp_cc_rmb<opc, Suffix, OpNode, VTInfo.info512>,
	EVEX_V512;

	let Predicates = [prd, HasVLX] in {
	defm Z256 : avx512_icmp_cc_rmb<opc, Suffix, OpNode, VTInfo.info256>,
	EVEX_V256;
	defm Z128 : avx512_icmp_cc_rmb<opc, Suffix, OpNode, VTInfo.info128>,
	EVEX_V128;
	}
	}

	defm VPCMPB : avx512_icmp_cc_vl<0x3F, "b", X86cmpm, avx512vl_i8_info,
	HasBWI>, EVEX_CD8<8, CD8VF>;
	defm VPCMPUB : avx512_icmp_cc_vl<0x3E, "ub", X86cmpmu, avx512vl_i8_info,
	HasBWI>, EVEX_CD8<8, CD8VF>;

	defm VPCMPW : avx512_icmp_cc_vl<0x3F, "w", X86cmpm, avx512vl_i16_info,
	HasBWI>, VEX_W, EVEX_CD8<16, CD8VF>;
	defm VPCMPUW : avx512_icmp_cc_vl<0x3E, "uw", X86cmpmu, avx512vl_i16_info,
	HasBWI>, VEX_W, EVEX_CD8<16, CD8VF>;

	defm VPCMPD : avx512_icmp_cc_rmb_vl<0x1F, "d", X86cmpm, avx512vl_i32_info,
	HasAVX512>, EVEX_CD8<32, CD8VF>;
	defm VPCMPUD : avx512_icmp_cc_rmb_vl<0x1E, "ud", X86cmpmu, avx512vl_i32_info,
	HasAVX512>, EVEX_CD8<32, CD8VF>;

	defm VPCMPQ : avx512_icmp_cc_rmb_vl<0x1F, "q", X86cmpm, avx512vl_i64_info,
	HasAVX512>, VEX_W, EVEX_CD8<64, CD8VF>;
	defm VPCMPUQ : avx512_icmp_cc_rmb_vl<0x1E, "uq", X86cmpmu, avx512vl_i64_info,
	HasAVX512>, VEX_W, EVEX_CD8<64, CD8VF>;

	multiclass avx512_icmp_cc_packed_lowering<X86VectorVTInfo _, X86KVectorVTInfo NewInf,
	SDNode OpNode, string InstrStr,
	list<Predicate> Preds> {
	let Predicates = Preds in {
	def : Pat<(insert_subvector (NewInf.KVT immAllZerosV),
	(_.KVT (OpNode (_.VT _.RC:$src1),
	(_.VT _.RC:$src2),
	imm:$cc)),
	(i64 0)),
	(COPY_TO_REGCLASS (!cast<Instruction>(InstrStr##rri) _.RC:$src1,
	_.RC:$src2,
	imm:$cc),
	NewInf.KRC)>;

	def : Pat<(insert_subvector (NewInf.KVT immAllZerosV),
	(_.KVT (OpNode (_.VT _.RC:$src1),
	(_.VT (bitconvert (_.LdFrag addr:$src2))),
	imm:$cc)),
	(i64 0)),
	(COPY_TO_REGCLASS (!cast<Instruction>(InstrStr##rmi) _.RC:$src1,
	addr:$src2,
	imm:$cc),
	NewInf.KRC)>;

	def : Pat<(insert_subvector (NewInf.KVT immAllZerosV),
	(_.KVT (and _.KRCWM:$mask,
	(OpNode (_.VT _.RC:$src1),
	(_.VT _.RC:$src2),
	imm:$cc))),
	(i64 0)),
	(COPY_TO_REGCLASS (!cast<Instruction>(InstrStr##rrik) _.KRCWM:$mask,
	_.RC:$src1,
	_.RC:$src2,
	imm:$cc),
	NewInf.KRC)>;

	def : Pat<(insert_subvector (NewInf.KVT immAllZerosV),
	(_.KVT (and (_.KVT _.KRCWM:$mask),
	(_.KVT (OpNode (_.VT _.RC:$src1),
	(_.VT (bitconvert
	(_.LdFrag addr:$src2))),
	imm:$cc)))),
	(i64 0)),
	(COPY_TO_REGCLASS (!cast<Instruction>(InstrStr##rmik) _.KRCWM:$mask,
	_.RC:$src1,
	addr:$src2,
	imm:$cc),
	NewInf.KRC)>;
	}
	}

	multiclass avx512_icmp_cc_packed_rmb_lowering<X86VectorVTInfo _, X86KVectorVTInfo NewInf,
	SDNode OpNode, string InstrStr,
	list<Predicate> Preds>
	: avx512_icmp_cc_packed_lowering<_, NewInf, OpNode, InstrStr, Preds> {
	let Predicates = Preds in {
	def : Pat<(insert_subvector (NewInf.KVT immAllZerosV),
	(_.KVT (OpNode (_.VT _.RC:$src1),
	(X86VBroadcast (_.ScalarLdFrag addr:$src2)),
	imm:$cc)),
	(i64 0)),
	(COPY_TO_REGCLASS (!cast<Instruction>(InstrStr##rmib) _.RC:$src1,
	addr:$src2,
	imm:$cc),
	NewInf.KRC)>;

	def : Pat<(insert_subvector (NewInf.KVT immAllZerosV),
	(_.KVT (and (_.KVT _.KRCWM:$mask),
	(_.KVT (OpNode (_.VT _.RC:$src1),
	(X86VBroadcast
	(_.ScalarLdFrag addr:$src2)),
	imm:$cc)))),
	(i64 0)),
	(COPY_TO_REGCLASS (!cast<Instruction>(InstrStr##rmibk) _.KRCWM:$mask,
	_.RC:$src1,
	addr:$src2,
	imm:$cc),
	NewInf.KRC)>;
	}
	}

	// VPCMPB - i8
	defm : avx512_icmp_cc_packed_lowering<v16i8x_info, v32i1_info, X86cmpm,
	"VPCMPBZ128", [HasBWI, HasVLX]>;
	defm : avx512_icmp_cc_packed_lowering<v16i8x_info, v64i1_info, X86cmpm,
	"VPCMPBZ128", [HasBWI, HasVLX]>;

	defm : avx512_icmp_cc_packed_lowering<v32i8x_info, v64i1_info, X86cmpm,
	"VPCMPBZ256", [HasBWI, HasVLX]>;

	// VPCMPW - i16
	defm : avx512_icmp_cc_packed_lowering<v8i16x_info, v16i1_info, X86cmpm,
	"VPCMPWZ128", [HasBWI, HasVLX]>;
	defm : avx512_icmp_cc_packed_lowering<v8i16x_info, v32i1_info, X86cmpm,
	"VPCMPWZ128", [HasBWI, HasVLX]>;
	defm : avx512_icmp_cc_packed_lowering<v8i16x_info, v64i1_info, X86cmpm,
	"VPCMPWZ128", [HasBWI, HasVLX]>;

	defm : avx512_icmp_cc_packed_lowering<v16i16x_info, v32i1_info, X86cmpm,
	"VPCMPWZ256", [HasBWI, HasVLX]>;
	defm : avx512_icmp_cc_packed_lowering<v16i16x_info, v64i1_info, X86cmpm,
	"VPCMPWZ256", [HasBWI, HasVLX]>;

	defm : avx512_icmp_cc_packed_lowering<v32i16_info, v64i1_info, X86cmpm,
	"VPCMPWZ", [HasBWI]>;

	// VPCMPD - i32
	defm : avx512_icmp_cc_packed_rmb_lowering<v4i32x_info, v8i1_info, X86cmpm,
	"VPCMPDZ128", [HasAVX512, HasVLX]>;
	defm : avx512_icmp_cc_packed_rmb_lowering<v4i32x_info, v16i1_info, X86cmpm,
	"VPCMPDZ128", [HasAVX512, HasVLX]>;
	defm : avx512_icmp_cc_packed_rmb_lowering<v4i32x_info, v32i1_info, X86cmpm,
	"VPCMPDZ128", [HasAVX512, HasVLX]>;
	defm : avx512_icmp_cc_packed_rmb_lowering<v4i32x_info, v64i1_info, X86cmpm,
	"VPCMPDZ128", [HasAVX512, HasVLX]>;

	defm : avx512_icmp_cc_packed_rmb_lowering<v8i32x_info, v16i1_info, X86cmpm,
	"VPCMPDZ256", [HasAVX512, HasVLX]>;
	defm : avx512_icmp_cc_packed_rmb_lowering<v8i32x_info, v32i1_info, X86cmpm,
	"VPCMPDZ256", [HasAVX512, HasVLX]>;
	defm : avx512_icmp_cc_packed_rmb_lowering<v8i32x_info, v64i1_info, X86cmpm,
	"VPCMPDZ256", [HasAVX512, HasVLX]>;

	defm : avx512_icmp_cc_packed_rmb_lowering<v16i32_info, v32i1_info, X86cmpm,
	"VPCMPDZ", [HasAVX512]>;
	defm : avx512_icmp_cc_packed_rmb_lowering<v16i32_info, v64i1_info, X86cmpm,
	"VPCMPDZ", [HasAVX512]>;

	// VPCMPQ - i64
	defm : avx512_icmp_cc_packed_rmb_lowering<v2i64x_info, v4i1_info, X86cmpm,
	"VPCMPQZ128", [HasAVX512, HasVLX]>;
	defm : avx512_icmp_cc_packed_rmb_lowering<v2i64x_info, v8i1_info, X86cmpm,
	"VPCMPQZ128", [HasAVX512, HasVLX]>;
	defm : avx512_icmp_cc_packed_rmb_lowering<v2i64x_info, v16i1_info, X86cmpm,
	"VPCMPQZ128", [HasAVX512, HasVLX]>;
	defm : avx512_icmp_cc_packed_rmb_lowering<v2i64x_info, v32i1_info, X86cmpm,
	"VPCMPQZ128", [HasAVX512, HasVLX]>;
	defm : avx512_icmp_cc_packed_rmb_lowering<v2i64x_info, v64i1_info, X86cmpm,
	"VPCMPQZ128", [HasAVX512, HasVLX]>;

	defm : avx512_icmp_cc_packed_rmb_lowering<v4i64x_info, v8i1_info, X86cmpm,
	"VPCMPQZ256", [HasAVX512, HasVLX]>;
	defm : avx512_icmp_cc_packed_rmb_lowering<v4i64x_info, v16i1_info, X86cmpm,
	"VPCMPQZ256", [HasAVX512, HasVLX]>;
	defm : avx512_icmp_cc_packed_rmb_lowering<v4i64x_info, v32i1_info, X86cmpm,
	"VPCMPQZ256", [HasAVX512, HasVLX]>;
	defm : avx512_icmp_cc_packed_rmb_lowering<v4i64x_info, v64i1_info, X86cmpm,
	"VPCMPQZ256", [HasAVX512, HasVLX]>;

	defm : avx512_icmp_cc_packed_rmb_lowering<v8i64_info, v16i1_info, X86cmpm,
	"VPCMPQZ", [HasAVX512]>;
	defm : avx512_icmp_cc_packed_rmb_lowering<v8i64_info, v32i1_info, X86cmpm,
	"VPCMPQZ", [HasAVX512]>;
	defm : avx512_icmp_cc_packed_rmb_lowering<v8i64_info, v64i1_info, X86cmpm,
	"VPCMPQZ", [HasAVX512]>;

	// VPCMPUB - i8
	defm : avx512_icmp_cc_packed_lowering<v16i8x_info, v32i1_info, X86cmpmu,
	"VPCMPUBZ128", [HasBWI, HasVLX]>;
	defm : avx512_icmp_cc_packed_lowering<v16i8x_info, v64i1_info, X86cmpmu,
	"VPCMPUBZ128", [HasBWI, HasVLX]>;

	defm : avx512_icmp_cc_packed_lowering<v32i8x_info, v64i1_info, X86cmpmu,
	"VPCMPUBZ256", [HasBWI, HasVLX]>;

	// VPCMPUW - i16
	defm : avx512_icmp_cc_packed_lowering<v8i16x_info, v16i1_info, X86cmpmu,
	"VPCMPUWZ128", [HasBWI, HasVLX]>;
	defm : avx512_icmp_cc_packed_lowering<v8i16x_info, v32i1_info, X86cmpmu,
	"VPCMPUWZ128", [HasBWI, HasVLX]>;
	defm : avx512_icmp_cc_packed_lowering<v8i16x_info, v64i1_info, X86cmpmu,
	"VPCMPUWZ128", [HasBWI, HasVLX]>;

	defm : avx512_icmp_cc_packed_lowering<v16i16x_info, v32i1_info, X86cmpmu,
	"VPCMPUWZ256", [HasBWI, HasVLX]>;
	defm : avx512_icmp_cc_packed_lowering<v16i16x_info, v64i1_info, X86cmpmu,
	"VPCMPUWZ256", [HasBWI, HasVLX]>;

	defm : avx512_icmp_cc_packed_lowering<v32i16_info, v64i1_info, X86cmpmu,
	"VPCMPUWZ", [HasBWI]>;

	// VPCMPUD - i32
	defm : avx512_icmp_cc_packed_rmb_lowering<v4i32x_info, v8i1_info, X86cmpmu,
	"VPCMPUDZ128", [HasAVX512, HasVLX]>;
	defm : avx512_icmp_cc_packed_rmb_lowering<v4i32x_info, v16i1_info, X86cmpmu,
	"VPCMPUDZ128", [HasAVX512, HasVLX]>;
	defm : avx512_icmp_cc_packed_rmb_lowering<v4i32x_info, v32i1_info, X86cmpmu,
	"VPCMPUDZ128", [HasAVX512, HasVLX]>;
	defm : avx512_icmp_cc_packed_rmb_lowering<v4i32x_info, v64i1_info, X86cmpmu,
	"VPCMPUDZ128", [HasAVX512, HasVLX]>;

	defm : avx512_icmp_cc_packed_rmb_lowering<v8i32x_info, v16i1_info, X86cmpmu,
	"VPCMPUDZ256", [HasAVX512, HasVLX]>;
	defm : avx512_icmp_cc_packed_rmb_lowering<v8i32x_info, v32i1_info, X86cmpmu,
	"VPCMPUDZ256", [HasAVX512, HasVLX]>;
	defm : avx512_icmp_cc_packed_rmb_lowering<v8i32x_info, v64i1_info, X86cmpmu,
	"VPCMPUDZ256", [HasAVX512, HasVLX]>;

	defm : avx512_icmp_cc_packed_rmb_lowering<v16i32_info, v32i1_info, X86cmpmu,
	"VPCMPUDZ", [HasAVX512]>;
	defm : avx512_icmp_cc_packed_rmb_lowering<v16i32_info, v64i1_info, X86cmpmu,
	"VPCMPUDZ", [HasAVX512]>;

	// VPCMPUQ - i64
	defm : avx512_icmp_cc_packed_rmb_lowering<v2i64x_info, v4i1_info, X86cmpmu,
	"VPCMPUQZ128", [HasAVX512, HasVLX]>;
	defm : avx512_icmp_cc_packed_rmb_lowering<v2i64x_info, v8i1_info, X86cmpmu,
	"VPCMPUQZ128", [HasAVX512, HasVLX]>;
	defm : avx512_icmp_cc_packed_rmb_lowering<v2i64x_info, v16i1_info, X86cmpmu,
	"VPCMPUQZ128", [HasAVX512, HasVLX]>;
	defm : avx512_icmp_cc_packed_rmb_lowering<v2i64x_info, v32i1_info, X86cmpmu,
	"VPCMPUQZ128", [HasAVX512, HasVLX]>;
	defm : avx512_icmp_cc_packed_rmb_lowering<v2i64x_info, v64i1_info, X86cmpmu,
	"VPCMPUQZ128", [HasAVX512, HasVLX]>;

	defm : avx512_icmp_cc_packed_rmb_lowering<v4i64x_info, v8i1_info, X86cmpmu,
	"VPCMPUQZ256", [HasAVX512, HasVLX]>;
	defm : avx512_icmp_cc_packed_rmb_lowering<v4i64x_info, v16i1_info, X86cmpmu,
	"VPCMPUQZ256", [HasAVX512, HasVLX]>;
	defm : avx512_icmp_cc_packed_rmb_lowering<v4i64x_info, v32i1_info, X86cmpmu,
	"VPCMPUQZ256", [HasAVX512, HasVLX]>;
	defm : avx512_icmp_cc_packed_rmb_lowering<v4i64x_info, v64i1_info, X86cmpmu,
	"VPCMPUQZ256", [HasAVX512, HasVLX]>;

	defm : avx512_icmp_cc_packed_rmb_lowering<v8i64_info, v16i1_info, X86cmpmu,
	"VPCMPUQZ", [HasAVX512]>;
	defm : avx512_icmp_cc_packed_rmb_lowering<v8i64_info, v32i1_info, X86cmpmu,
	"VPCMPUQZ", [HasAVX512]>;
	defm : avx512_icmp_cc_packed_rmb_lowering<v8i64_info, v64i1_info, X86cmpmu,
	"VPCMPUQZ", [HasAVX512]>;

	multiclass avx512_vcmp_common<X86VectorVTInfo _> {

	defm rri : AVX512_maskable_cmp<0xC2, MRMSrcReg, _,
	(outs _.KRC:$dst), (ins _.RC:$src1, _.RC:$src2,AVXCC:$cc),
	"vcmp${cc}"#_.Suffix,
	"$src2, $src1", "$src1, $src2",
	(X86cmpm (_.VT _.RC:$src1),
	(_.VT _.RC:$src2),
	imm:$cc), 1>;

	defm rmi : AVX512_maskable_cmp<0xC2, MRMSrcMem, _,
	(outs _.KRC:$dst),(ins _.RC:$src1, _.MemOp:$src2, AVXCC:$cc),
	"vcmp${cc}"#_.Suffix,
	"$src2, $src1", "$src1, $src2",
	(X86cmpm (_.VT _.RC:$src1),
	(_.VT (bitconvert (_.LdFrag addr:$src2))),
	imm:$cc)>;

	defm rmbi : AVX512_maskable_cmp<0xC2, MRMSrcMem, _,
	(outs _.KRC:$dst),
	(ins _.RC:$src1, _.ScalarMemOp:$src2, AVXCC:$cc),
	"vcmp${cc}"#_.Suffix,
	"${src2}"##_.BroadcastStr##", $src1",
	"$src1, ${src2}"##_.BroadcastStr,
	(X86cmpm (_.VT _.RC:$src1),
	(_.VT (X86VBroadcast(_.ScalarLdFrag addr:$src2))),
	imm:$cc)>,EVEX_B;
	// Accept explicit immediate argument form instead of comparison code.
	let isAsmParserOnly = 1, hasSideEffects = 0 in {
	defm rri_alt : AVX512_maskable_cmp_alt<0xC2, MRMSrcReg, _,
	(outs _.KRC:$dst),
	(ins _.RC:$src1, _.RC:$src2, u8imm:$cc),
	"vcmp"#_.Suffix,
	"$cc, $src2, $src1", "$src1, $src2, $cc">;

	let mayLoad = 1 in {
	defm rmi_alt : AVX512_maskable_cmp_alt<0xC2, MRMSrcMem, _,
	(outs _.KRC:$dst),
	(ins _.RC:$src1, _.MemOp:$src2, u8imm:$cc),
	"vcmp"#_.Suffix,
	"$cc, $src2, $src1", "$src1, $src2, $cc">;

	defm rmbi_alt : AVX512_maskable_cmp_alt<0xC2, MRMSrcMem, _,
	(outs _.KRC:$dst),
	(ins _.RC:$src1, _.ScalarMemOp:$src2, u8imm:$cc),
	"vcmp"#_.Suffix,
	"$cc, ${src2}"##_.BroadcastStr##", $src1",
	"$src1, ${src2}"##_.BroadcastStr##", $cc">,EVEX_B;
	}
	}
	}

	multiclass avx512_vcmp_sae<X86VectorVTInfo _> {
	// comparison code form (VCMP[EQ/LT/LE/...]
	defm rrib : AVX512_maskable_cmp<0xC2, MRMSrcReg, _,
	(outs _.KRC:$dst),(ins _.RC:$src1, _.RC:$src2, AVXCC:$cc),
	"vcmp${cc}"#_.Suffix,
	"{sae}, $src2, $src1", "$src1, $src2, {sae}",
	(X86cmpmRnd (_.VT _.RC:$src1),
	(_.VT _.RC:$src2),
	imm:$cc,
	(i32 FROUND_NO_EXC))>, EVEX_B;

	let isAsmParserOnly = 1, hasSideEffects = 0 in {
	defm rrib_alt : AVX512_maskable_cmp_alt<0xC2, MRMSrcReg, _,
	(outs _.KRC:$dst),
	(ins _.RC:$src1, _.RC:$src2, u8imm:$cc),
	"vcmp"#_.Suffix,
	"$cc, {sae}, $src2, $src1",
	"$src1, $src2, {sae}, $cc">, EVEX_B;
	}
	}

	multiclass avx512_vcmp<AVX512VLVectorVTInfo _> {
	let Predicates = [HasAVX512] in {
	defm Z : avx512_vcmp_common<_.info512>,
	avx512_vcmp_sae<_.info512>, EVEX_V512;

	}
	let Predicates = [HasAVX512,HasVLX] in {
	defm Z128 : avx512_vcmp_common<_.info128>, EVEX_V128;
	defm Z256 : avx512_vcmp_common<_.info256>, EVEX_V256;
	}
	}

	defm VCMPPD : avx512_vcmp<avx512vl_f64_info>,
	AVX512PDIi8Base, EVEX_4V, EVEX_CD8<64, CD8VF>, VEX_W;
	defm VCMPPS : avx512_vcmp<avx512vl_f32_info>,
	AVX512PSIi8Base, EVEX_4V, EVEX_CD8<32, CD8VF>;

	multiclass avx512_fcmp_cc_packed_lowering<X86VectorVTInfo _, X86KVectorVTInfo NewInf,
	string InstrStr, list<Predicate> Preds> {
	let Predicates = Preds in {
	def : Pat<(insert_subvector (NewInf.KVT immAllZerosV),
	(_.KVT (X86cmpm (_.VT _.RC:$src1),
	(_.VT _.RC:$src2),
	imm:$cc)),
	(i64 0)),
	(COPY_TO_REGCLASS (!cast<Instruction>(InstrStr##rri) _.RC:$src1,
	_.RC:$src2,
	imm:$cc),
	NewInf.KRC)>;

	def : Pat<(insert_subvector (NewInf.KVT immAllZerosV),
	(_.KVT (X86cmpm (_.VT _.RC:$src1),
	(_.VT (bitconvert (_.LdFrag addr:$src2))),
	imm:$cc)),
	(i64 0)),
	(COPY_TO_REGCLASS (!cast<Instruction>(InstrStr##rmi) _.RC:$src1,
	addr:$src2,
	imm:$cc),
	NewInf.KRC)>;

	def : Pat<(insert_subvector (NewInf.KVT immAllZerosV),
	(_.KVT (X86cmpm (_.VT _.RC:$src1),
	(X86VBroadcast (_.ScalarLdFrag addr:$src2)),
	imm:$cc)),
	(i64 0)),
	(COPY_TO_REGCLASS (!cast<Instruction>(InstrStr##rmbi) _.RC:$src1,
	addr:$src2,
	imm:$cc),
	NewInf.KRC)>;
	}
	}

	multiclass avx512_fcmp_cc_packed_sae_lowering<X86VectorVTInfo _, X86KVectorVTInfo NewInf,
	string InstrStr, list<Predicate> Preds>
	: avx512_fcmp_cc_packed_lowering<_, NewInf, InstrStr, Preds> {

	let Predicates = Preds in
	def : Pat<(insert_subvector (NewInf.KVT immAllZerosV),
	(_.KVT (X86cmpmRnd (_.VT _.RC:$src1),
	(_.VT _.RC:$src2),
	imm:$cc,
	(i32 FROUND_NO_EXC))),
	(i64 0)),
	(COPY_TO_REGCLASS (!cast<Instruction>(InstrStr##rrib) _.RC:$src1,
	_.RC:$src2,
	imm:$cc),
	NewInf.KRC)>;
	}


	// VCMPPS - f32
	defm : avx512_fcmp_cc_packed_lowering<v4f32x_info, v8i1_info, "VCMPPSZ128",
	[HasAVX512, HasVLX]>;
	defm : avx512_fcmp_cc_packed_lowering<v4f32x_info, v16i1_info, "VCMPPSZ128",
	[HasAVX512, HasVLX]>;
	defm : avx512_fcmp_cc_packed_lowering<v4f32x_info, v32i1_info, "VCMPPSZ128",
	[HasAVX512, HasVLX]>;
	defm : avx512_fcmp_cc_packed_lowering<v4f32x_info, v64i1_info, "VCMPPSZ128",
	[HasAVX512, HasVLX]>;

	defm : avx512_fcmp_cc_packed_lowering<v8f32x_info, v16i1_info, "VCMPPSZ256",
	[HasAVX512, HasVLX]>;
	defm : avx512_fcmp_cc_packed_lowering<v8f32x_info, v32i1_info, "VCMPPSZ256",
	[HasAVX512, HasVLX]>;
	defm : avx512_fcmp_cc_packed_lowering<v8f32x_info, v64i1_info, "VCMPPSZ256",
	[HasAVX512, HasVLX]>;

	defm : avx512_fcmp_cc_packed_sae_lowering<v16f32_info, v32i1_info, "VCMPPSZ",
	[HasAVX512]>;
	defm : avx512_fcmp_cc_packed_sae_lowering<v16f32_info, v64i1_info, "VCMPPSZ",
	[HasAVX512]>;

	// VCMPPD - f64
	defm : avx512_fcmp_cc_packed_lowering<v2f64x_info, v4i1_info, "VCMPPDZ128",
	[HasAVX512, HasVLX]>;
	defm : avx512_fcmp_cc_packed_lowering<v2f64x_info, v8i1_info, "VCMPPDZ128",
	[HasAVX512, HasVLX]>;
	defm : avx512_fcmp_cc_packed_lowering<v2f64x_info, v16i1_info, "VCMPPDZ128",
	[HasAVX512, HasVLX]>;
	defm : avx512_fcmp_cc_packed_lowering<v2f64x_info, v32i1_info, "VCMPPDZ128",
	[HasAVX512, HasVLX]>;
	defm : avx512_fcmp_cc_packed_lowering<v2f64x_info, v64i1_info, "VCMPPDZ128",
	[HasAVX512, HasVLX]>;

	defm : avx512_fcmp_cc_packed_lowering<v4f64x_info, v8i1_info, "VCMPPDZ256",
	[HasAVX512, HasVLX]>;
	defm : avx512_fcmp_cc_packed_lowering<v4f64x_info, v16i1_info, "VCMPPDZ256",
	[HasAVX512, HasVLX]>;
	defm : avx512_fcmp_cc_packed_lowering<v4f64x_info, v32i1_info, "VCMPPDZ256",
	[HasAVX512, HasVLX]>;
	defm : avx512_fcmp_cc_packed_lowering<v4f64x_info, v64i1_info, "VCMPPDZ256",
	[HasAVX512, HasVLX]>;

	defm : avx512_fcmp_cc_packed_sae_lowering<v8f64_info, v16i1_info, "VCMPPDZ",
	[HasAVX512]>;
	defm : avx512_fcmp_cc_packed_sae_lowering<v8f64_info, v32i1_info, "VCMPPDZ",
	[HasAVX512]>;
	defm : avx512_fcmp_cc_packed_sae_lowering<v8f64_info, v64i1_info, "VCMPPDZ",
	[HasAVX512]>;

	// ----------------------------------------------------------------
	// FPClass
	//handle fpclass instruction mask = op(reg_scalar,imm)
	// op(mem_scalar,imm)
	multiclass avx512_scalar_fpclass<bits<8> opc, string OpcodeStr, SDNode OpNode,
	X86VectorVTInfo _, Predicate prd> {
	let Predicates = [prd] in {
	def rr : AVX512<opc, MRMSrcReg, (outs _.KRC:$dst),//_.KRC:$dst),
	(ins _.RC:$src1, i32u8imm:$src2),
	OpcodeStr##_.Suffix#"\t{$src2, $src1, $dst\|$dst, $src1, $src2}",
	[(set _.KRC:$dst,(OpNode (_.VT _.RC:$src1),
	(i32 imm:$src2)))], NoItinerary>;
	def rrk : AVX512<opc, MRMSrcReg, (outs _.KRC:$dst),
	(ins _.KRCWM:$mask, _.RC:$src1, i32u8imm:$src2),
	OpcodeStr##_.Suffix#
	"\t{$src2, $src1, $dst {${mask}}\|$dst {${mask}}, $src1, $src2}",
	[(set _.KRC:$dst,(or _.KRCWM:$mask,
	(OpNode (_.VT _.RC:$src1),
	(i32 imm:$src2))))], NoItinerary>, EVEX_K;
	def rm : AVX512<opc, MRMSrcMem, (outs _.KRC:$dst),
	(ins _.MemOp:$src1, i32u8imm:$src2),
	OpcodeStr##_.Suffix##
	"\t{$src2, $src1, $dst\|$dst, $src1, $src2}",
	[(set _.KRC:$dst,
	(OpNode (_.VT (bitconvert (_.LdFrag addr:$src1))),
	(i32 imm:$src2)))], NoItinerary>;
	def rmk : AVX512<opc, MRMSrcMem, (outs _.KRC:$dst),
	(ins _.KRCWM:$mask, _.MemOp:$src1, i32u8imm:$src2),
	OpcodeStr##_.Suffix##
	"\t{$src2, $src1, $dst {${mask}}\|$dst {${mask}}, $src1, $src2}",
	[(set _.KRC:$dst,(or _.KRCWM:$mask,
	(OpNode (_.VT (bitconvert (_.LdFrag addr:$src1))),
	(i32 imm:$src2))))], NoItinerary>, EVEX_K;
	}
	}

	//handle fpclass instruction mask = fpclass(reg_vec, reg_vec, imm)
	// fpclass(reg_vec, mem_vec, imm)
	// fpclass(reg_vec, broadcast(eltVt), imm)
	multiclass avx512_vector_fpclass<bits<8> opc, string OpcodeStr, SDNode OpNode,
	X86VectorVTInfo _, string mem, string broadcast>{
	def rr : AVX512<opc, MRMSrcReg, (outs _.KRC:$dst),
	(ins _.RC:$src1, i32u8imm:$src2),
	OpcodeStr##_.Suffix#"\t{$src2, $src1, $dst\|$dst, $src1, $src2}",
	[(set _.KRC:$dst,(OpNode (_.VT _.RC:$src1),
	(i32 imm:$src2)))], NoItinerary>;
	def rrk : AVX512<opc, MRMSrcReg, (outs _.KRC:$dst),
	(ins _.KRCWM:$mask, _.RC:$src1, i32u8imm:$src2),
	OpcodeStr##_.Suffix#
	"\t{$src2, $src1, $dst {${mask}}\|$dst {${mask}}, $src1, $src2}",
	[(set _.KRC:$dst,(or _.KRCWM:$mask,
	(OpNode (_.VT _.RC:$src1),
	(i32 imm:$src2))))], NoItinerary>, EVEX_K;
	def rm : AVX512<opc, MRMSrcMem, (outs _.KRC:$dst),
	(ins _.MemOp:$src1, i32u8imm:$src2),
	OpcodeStr##_.Suffix##mem#
	"\t{$src2, $src1, $dst\|$dst, $src1, $src2}",
	[(set _.KRC:$dst,(OpNode
	(_.VT (bitconvert (_.LdFrag addr:$src1))),
	(i32 imm:$src2)))], NoItinerary>;
	def rmk : AVX512<opc, MRMSrcMem, (outs _.KRC:$dst),
	(ins _.KRCWM:$mask, _.MemOp:$src1, i32u8imm:$src2),
	OpcodeStr##_.Suffix##mem#
	"\t{$src2, $src1, $dst {${mask}}\|$dst {${mask}}, $src1, $src2}",
	[(set _.KRC:$dst, (or _.KRCWM:$mask, (OpNode
	(_.VT (bitconvert (_.LdFrag addr:$src1))),
	(i32 imm:$src2))))], NoItinerary>, EVEX_K;
	def rmb : AVX512<opc, MRMSrcMem, (outs _.KRC:$dst),
	(ins _.ScalarMemOp:$src1, i32u8imm:$src2),
	OpcodeStr##_.Suffix##broadcast##"\t{$src2, ${src1}"##
	_.BroadcastStr##", $dst\|$dst, ${src1}"
	##_.BroadcastStr##", $src2}",
	[(set _.KRC:$dst,(OpNode
	(_.VT (X86VBroadcast
	(_.ScalarLdFrag addr:$src1))),
	(i32 imm:$src2)))], NoItinerary>,EVEX_B;
	def rmbk : AVX512<opc, MRMSrcMem, (outs _.KRC:$dst),
	(ins _.KRCWM:$mask, _.ScalarMemOp:$src1, i32u8imm:$src2),
	OpcodeStr##_.Suffix##broadcast##"\t{$src2, ${src1}"##
	_.BroadcastStr##", $dst {${mask}}\|$dst {${mask}}, ${src1}"##
	_.BroadcastStr##", $src2}",
	[(set _.KRC:$dst,(or _.KRCWM:$mask, (OpNode
	(_.VT (X86VBroadcast
	(_.ScalarLdFrag addr:$src1))),
	(i32 imm:$src2))))], NoItinerary>,
	EVEX_B, EVEX_K;
	}

	multiclass avx512_vector_fpclass_all<string OpcodeStr,
	AVX512VLVectorVTInfo _, bits<8> opc, SDNode OpNode, Predicate prd,
	string broadcast>{
	let Predicates = [prd] in {
	defm Z : avx512_vector_fpclass<opc, OpcodeStr, OpNode, _.info512, "{z}",
	broadcast>, EVEX_V512;
	}
	let Predicates = [prd, HasVLX] in {
	defm Z128 : avx512_vector_fpclass<opc, OpcodeStr, OpNode, _.info128, "{x}",
	broadcast>, EVEX_V128;
	defm Z256 : avx512_vector_fpclass<opc, OpcodeStr, OpNode, _.info256, "{y}",
	broadcast>, EVEX_V256;
	}
	}

	multiclass avx512_fp_fpclass_all<string OpcodeStr, bits<8> opcVec,
	bits<8> opcScalar, SDNode VecOpNode, SDNode ScalarOpNode, Predicate prd>{
	defm PS : avx512_vector_fpclass_all<OpcodeStr, avx512vl_f32_info, opcVec,
	VecOpNode, prd, "{l}">, EVEX_CD8<32, CD8VF>;
	defm PD : avx512_vector_fpclass_all<OpcodeStr, avx512vl_f64_info, opcVec,
	VecOpNode, prd, "{q}">,EVEX_CD8<64, CD8VF> , VEX_W;
	defm SS : avx512_scalar_fpclass<opcScalar, OpcodeStr, ScalarOpNode,
	f32x_info, prd>, EVEX_CD8<32, CD8VT1>;
	defm SD : avx512_scalar_fpclass<opcScalar, OpcodeStr, ScalarOpNode,
	f64x_info, prd>, EVEX_CD8<64, CD8VT1>, VEX_W;
	}

	defm VFPCLASS : avx512_fp_fpclass_all<"vfpclass", 0x66, 0x67, X86Vfpclass,
	X86Vfpclasss, HasDQI>, AVX512AIi8Base,EVEX;

	//-----------------------------------------------------------------
	// Mask register copy, including
	// - copy between mask registers
	// - load/store mask registers
	// - copy from GPR to mask register and vice versa
	//
	multiclass avx512_mask_mov<bits<8> opc_kk, bits<8> opc_km, bits<8> opc_mk,
	string OpcodeStr, RegisterClass KRC,
	ValueType vvt, X86MemOperand x86memop> {
	let hasSideEffects = 0 in
	def kk : I<opc_kk, MRMSrcReg, (outs KRC:$dst), (ins KRC:$src),
	!strconcat(OpcodeStr, "\t{$src, $dst\|$dst, $src}"), []>;
	def km : I<opc_km, MRMSrcMem, (outs KRC:$dst), (ins x86memop:$src),
	!strconcat(OpcodeStr, "\t{$src, $dst\|$dst, $src}"),
	[(set KRC:$dst, (vvt (load addr:$src)))]>;
	def mk : I<opc_mk, MRMDestMem, (outs), (ins x86memop:$dst, KRC:$src),
	!strconcat(OpcodeStr, "\t{$src, $dst\|$dst, $src}"),
	[(store KRC:$src, addr:$dst)]>;
	}

	multiclass avx512_mask_mov_gpr<bits<8> opc_kr, bits<8> opc_rk,
	string OpcodeStr,
	RegisterClass KRC, RegisterClass GRC> {
	let hasSideEffects = 0 in {
	def kr : I<opc_kr, MRMSrcReg, (outs KRC:$dst), (ins GRC:$src),
	!strconcat(OpcodeStr, "\t{$src, $dst\|$dst, $src}"), []>;
	def rk : I<opc_rk, MRMSrcReg, (outs GRC:$dst), (ins KRC:$src),
	!strconcat(OpcodeStr, "\t{$src, $dst\|$dst, $src}"), []>;
	}
	}

	let Predicates = [HasDQI] in
	defm KMOVB : avx512_mask_mov<0x90, 0x90, 0x91, "kmovb", VK8, v8i1, i8mem>,
	avx512_mask_mov_gpr<0x92, 0x93, "kmovb", VK8, GR32>,
	VEX, PD;

	let Predicates = [HasAVX512] in
	defm KMOVW : avx512_mask_mov<0x90, 0x90, 0x91, "kmovw", VK16, v16i1, i16mem>,
	avx512_mask_mov_gpr<0x92, 0x93, "kmovw", VK16, GR32>,
	VEX, PS;

	let Predicates = [HasBWI] in {
	defm KMOVD : avx512_mask_mov<0x90, 0x90, 0x91, "kmovd", VK32, v32i1,i32mem>,
	VEX, PD, VEX_W;
	defm KMOVD : avx512_mask_mov_gpr<0x92, 0x93, "kmovd", VK32, GR32>,
	VEX, XD;
	defm KMOVQ : avx512_mask_mov<0x90, 0x90, 0x91, "kmovq", VK64, v64i1, i64mem>,
	VEX, PS, VEX_W;
	defm KMOVQ : avx512_mask_mov_gpr<0x92, 0x93, "kmovq", VK64, GR64>,
	VEX, XD, VEX_W;
	}

	// GR from/to mask register
	def : Pat<(v16i1 (bitconvert (i16 GR16:$src))),
	(COPY_TO_REGCLASS (i32 (INSERT_SUBREG (IMPLICIT_DEF), GR16:$src, sub_16bit)), VK16)>;
	def : Pat<(i16 (bitconvert (v16i1 VK16:$src))),
	(EXTRACT_SUBREG (i32 (COPY_TO_REGCLASS VK16:$src, GR32)), sub_16bit)>;

	def : Pat<(v8i1 (bitconvert (i8 GR8:$src))),
	(COPY_TO_REGCLASS (i32 (INSERT_SUBREG (IMPLICIT_DEF), GR8:$src, sub_8bit)), VK8)>;
	def : Pat<(i8 (bitconvert (v8i1 VK8:$src))),
	(EXTRACT_SUBREG (i32 (COPY_TO_REGCLASS VK8:$src, GR32)), sub_8bit)>;

	def : Pat<(i32 (zext (i16 (bitconvert (v16i1 VK16:$src))))),
	(KMOVWrk VK16:$src)>;
	def : Pat<(i32 (anyext (i16 (bitconvert (v16i1 VK16:$src))))),
	(COPY_TO_REGCLASS VK16:$src, GR32)>;

	def : Pat<(i32 (zext (i8 (bitconvert (v8i1 VK8:$src))))),
	(MOVZX32rr8 (EXTRACT_SUBREG (i32 (COPY_TO_REGCLASS VK8:$src, GR32)), sub_8bit))>, Requires<[NoDQI]>;
	def : Pat<(i32 (zext (i8 (bitconvert (v8i1 VK8:$src))))),
	(KMOVBrk VK8:$src)>, Requires<[HasDQI]>;
	def : Pat<(i32 (anyext (i8 (bitconvert (v8i1 VK8:$src))))),
	(COPY_TO_REGCLASS VK8:$src, GR32)>;

	def : Pat<(v32i1 (bitconvert (i32 GR32:$src))),
	(COPY_TO_REGCLASS GR32:$src, VK32)>;
	def : Pat<(i32 (bitconvert (v32i1 VK32:$src))),
	(COPY_TO_REGCLASS VK32:$src, GR32)>;
	def : Pat<(v64i1 (bitconvert (i64 GR64:$src))),
	(COPY_TO_REGCLASS GR64:$src, VK64)>;
	def : Pat<(i64 (bitconvert (v64i1 VK64:$src))),
	(COPY_TO_REGCLASS VK64:$src, GR64)>;

	// Load/store kreg
	let Predicates = [HasDQI] in {
	def : Pat<(store (i8 (bitconvert (v8i1 VK8:$src))), addr:$dst),
	(KMOVBmk addr:$dst, VK8:$src)>;
	def : Pat<(v8i1 (bitconvert (i8 (load addr:$src)))),
	(KMOVBkm addr:$src)>;

	def : Pat<(store VK4:$src, addr:$dst),
	(KMOVBmk addr:$dst, (COPY_TO_REGCLASS VK4:$src, VK8))>;
	def : Pat<(store VK2:$src, addr:$dst),
	(KMOVBmk addr:$dst, (COPY_TO_REGCLASS VK2:$src, VK8))>;
	def : Pat<(store VK1:$src, addr:$dst),
	(KMOVBmk addr:$dst, (COPY_TO_REGCLASS VK1:$src, VK8))>;

	def : Pat<(v2i1 (load addr:$src)),
	(COPY_TO_REGCLASS (KMOVBkm addr:$src), VK2)>;
	def : Pat<(v4i1 (load addr:$src)),
	(COPY_TO_REGCLASS (KMOVBkm addr:$src), VK4)>;
	}
	let Predicates = [HasAVX512, NoDQI] in {
	def : Pat<(store VK1:$src, addr:$dst),
	(MOV8mr addr:$dst,
	(i8 (EXTRACT_SUBREG (i32 (COPY_TO_REGCLASS VK1:$src, GR32)),
	sub_8bit)))>;
	def : Pat<(store VK2:$src, addr:$dst),
	(MOV8mr addr:$dst,
	(i8 (EXTRACT_SUBREG (i32 (COPY_TO_REGCLASS VK2:$src, GR32)),
	sub_8bit)))>;
	def : Pat<(store VK4:$src, addr:$dst),
	(MOV8mr addr:$dst,
	(i8 (EXTRACT_SUBREG (i32 (COPY_TO_REGCLASS VK4:$src, GR32)),
	sub_8bit)))>;
	def : Pat<(store VK8:$src, addr:$dst),
	(MOV8mr addr:$dst,
	(i8 (EXTRACT_SUBREG (i32 (COPY_TO_REGCLASS VK8:$src, GR32)),
	sub_8bit)))>;

	def : Pat<(v8i1 (load addr:$src)),
	(COPY_TO_REGCLASS (MOVZX32rm8 addr:$src), VK8)>;
	def : Pat<(v2i1 (load addr:$src)),
	(COPY_TO_REGCLASS (MOVZX32rm8 addr:$src), VK2)>;
	def : Pat<(v4i1 (load addr:$src)),
	(COPY_TO_REGCLASS (MOVZX32rm8 addr:$src), VK4)>;
	}

	let Predicates = [HasAVX512] in {
	def : Pat<(store (i16 (bitconvert (v16i1 VK16:$src))), addr:$dst),
	(KMOVWmk addr:$dst, VK16:$src)>;
	def : Pat<(v1i1 (load addr:$src)),
	(COPY_TO_REGCLASS (AND32ri8 (MOVZX32rm8 addr:$src), (i32 1)), VK1)>;
	def : Pat<(v16i1 (bitconvert (i16 (load addr:$src)))),
	(KMOVWkm addr:$src)>;
	}
	let Predicates = [HasBWI] in {
	def : Pat<(store (i32 (bitconvert (v32i1 VK32:$src))), addr:$dst),
	(KMOVDmk addr:$dst, VK32:$src)>;
	def : Pat<(v32i1 (bitconvert (i32 (load addr:$src)))),
	(KMOVDkm addr:$src)>;
	def : Pat<(store (i64 (bitconvert (v64i1 VK64:$src))), addr:$dst),
	(KMOVQmk addr:$dst, VK64:$src)>;
	def : Pat<(v64i1 (bitconvert (i64 (load addr:$src)))),
	(KMOVQkm addr:$src)>;
	}

	let Predicates = [HasAVX512] in {
	multiclass operation_gpr_mask_copy_lowering<RegisterClass maskRC, ValueType maskVT> {
	def : Pat<(maskVT (scalar_to_vector GR32:$src)),
	(COPY_TO_REGCLASS GR32:$src, maskRC)>;

	def : Pat<(i32 (X86Vextract maskRC:$src, (iPTR 0))),
	(COPY_TO_REGCLASS maskRC:$src, GR32)>;

	def : Pat<(maskVT (scalar_to_vector GR8:$src)),
	(COPY_TO_REGCLASS (INSERT_SUBREG (i32 (IMPLICIT_DEF)), GR8:$src, sub_8bit), maskRC)>;

	def : Pat<(i8 (X86Vextract maskRC:$src, (iPTR 0))),
	(EXTRACT_SUBREG (i32 (COPY_TO_REGCLASS maskRC:$src, GR32)), sub_8bit)>;

	def : Pat<(i32 (anyext (i8 (X86Vextract maskRC:$src, (iPTR 0))))),
	(COPY_TO_REGCLASS maskRC:$src, GR32)>;
	}

	defm : operation_gpr_mask_copy_lowering<VK1, v1i1>;
	defm : operation_gpr_mask_copy_lowering<VK2, v2i1>;
	defm : operation_gpr_mask_copy_lowering<VK4, v4i1>;
	defm : operation_gpr_mask_copy_lowering<VK8, v8i1>;
	defm : operation_gpr_mask_copy_lowering<VK16, v16i1>;
	defm : operation_gpr_mask_copy_lowering<VK32, v32i1>;
	defm : operation_gpr_mask_copy_lowering<VK64, v64i1>;

	def : Pat<(X86kshiftr (X86kshiftl (v1i1 (scalar_to_vector GR8:$src)), (i8 15)), (i8 15)) ,
	(COPY_TO_REGCLASS
	(KMOVWkr (AND32ri8 (INSERT_SUBREG (i32 (IMPLICIT_DEF)),
	GR8:$src, sub_8bit), (i32 1))), VK1)>;
	def : Pat<(X86kshiftr (X86kshiftl (v16i1 (scalar_to_vector GR8:$src)), (i8 15)), (i8 15)) ,
	(COPY_TO_REGCLASS
	(KMOVWkr (AND32ri8 (INSERT_SUBREG (i32 (IMPLICIT_DEF)),
	GR8:$src, sub_8bit), (i32 1))), VK16)>;
	def : Pat<(X86kshiftr (X86kshiftl (v8i1 (scalar_to_vector GR8:$src)), (i8 15)), (i8 15)) ,
	(COPY_TO_REGCLASS
	(KMOVWkr (AND32ri8 (INSERT_SUBREG (i32 (IMPLICIT_DEF)),
	GR8:$src, sub_8bit), (i32 1))), VK8)>;

	}

	// Mask unary operation
	// - KNOT
	multiclass avx512_mask_unop<bits<8> opc, string OpcodeStr,
	RegisterClass KRC, SDPatternOperator OpNode,
	Predicate prd> {
	let Predicates = [prd] in
	def rr : I<opc, MRMSrcReg, (outs KRC:$dst), (ins KRC:$src),
	!strconcat(OpcodeStr, "\t{$src, $dst\|$dst, $src}"),
	[(set KRC:$dst, (OpNode KRC:$src))]>;
	}

	multiclass avx512_mask_unop_all<bits<8> opc, string OpcodeStr,
	SDPatternOperator OpNode> {
	defm B : avx512_mask_unop<opc, !strconcat(OpcodeStr, "b"), VK8, OpNode,
	HasDQI>, VEX, PD;
	defm W : avx512_mask_unop<opc, !strconcat(OpcodeStr, "w"), VK16, OpNode,
	HasAVX512>, VEX, PS;
	defm D : avx512_mask_unop<opc, !strconcat(OpcodeStr, "d"), VK32, OpNode,
	HasBWI>, VEX, PD, VEX_W;
	defm Q : avx512_mask_unop<opc, !strconcat(OpcodeStr, "q"), VK64, OpNode,
	HasBWI>, VEX, PS, VEX_W;
	}

	defm KNOT : avx512_mask_unop_all<0x44, "knot", vnot>;

	// KNL does not support KMOVB, 8-bit mask is promoted to 16-bit
	let Predicates = [HasAVX512, NoDQI] in
	def : Pat<(vnot VK8:$src),
	(COPY_TO_REGCLASS (KNOTWrr (COPY_TO_REGCLASS VK8:$src, VK16)), VK8)>;

	def : Pat<(vnot VK4:$src),
	(COPY_TO_REGCLASS (KNOTWrr (COPY_TO_REGCLASS VK4:$src, VK16)), VK4)>;
	def : Pat<(vnot VK2:$src),
	(COPY_TO_REGCLASS (KNOTWrr (COPY_TO_REGCLASS VK2:$src, VK16)), VK2)>;

	// Mask binary operation
	// - KAND, KANDN, KOR, KXNOR, KXOR
	multiclass avx512_mask_binop<bits<8> opc, string OpcodeStr,
	RegisterClass KRC, SDPatternOperator OpNode,
	Predicate prd, bit IsCommutable> {
	let Predicates = [prd], isCommutable = IsCommutable in
	def rr : I<opc, MRMSrcReg, (outs KRC:$dst), (ins KRC:$src1, KRC:$src2),
	!strconcat(OpcodeStr,
	"\t{$src2, $src1, $dst\|$dst, $src1, $src2}"),
	[(set KRC:$dst, (OpNode KRC:$src1, KRC:$src2))]>;
	}

	multiclass avx512_mask_binop_all<bits<8> opc, string OpcodeStr,
	SDPatternOperator OpNode, bit IsCommutable,
	Predicate prdW = HasAVX512> {
	defm B : avx512_mask_binop<opc, !strconcat(OpcodeStr, "b"), VK8, OpNode,
	HasDQI, IsCommutable>, VEX_4V, VEX_L, PD;
	defm W : avx512_mask_binop<opc, !strconcat(OpcodeStr, "w"), VK16, OpNode,
	prdW, IsCommutable>, VEX_4V, VEX_L, PS;
	defm D : avx512_mask_binop<opc, !strconcat(OpcodeStr, "d"), VK32, OpNode,
	HasBWI, IsCommutable>, VEX_4V, VEX_L, VEX_W, PD;
	defm Q : avx512_mask_binop<opc, !strconcat(OpcodeStr, "q"), VK64, OpNode,
	HasBWI, IsCommutable>, VEX_4V, VEX_L, VEX_W, PS;
	}

	def andn : PatFrag<(ops node:$i0, node:$i1), (and (not node:$i0), node:$i1)>;
	def xnor : PatFrag<(ops node:$i0, node:$i1), (not (xor node:$i0, node:$i1))>;
	// These nodes use 'vnot' instead of 'not' to support vectors.
	def vandn : PatFrag<(ops node:$i0, node:$i1), (and (vnot node:$i0), node:$i1)>;
	def vxnor : PatFrag<(ops node:$i0, node:$i1), (vnot (xor node:$i0, node:$i1))>;

	defm KAND : avx512_mask_binop_all<0x41, "kand", and, 1>;
	defm KOR : avx512_mask_binop_all<0x45, "kor", or, 1>;
	defm KXNOR : avx512_mask_binop_all<0x46, "kxnor", vxnor, 1>;
	defm KXOR : avx512_mask_binop_all<0x47, "kxor", xor, 1>;
	defm KANDN : avx512_mask_binop_all<0x42, "kandn", vandn, 0>;
	defm KADD : avx512_mask_binop_all<0x4A, "kadd", add, 1, HasDQI>;

	multiclass avx512_binop_pat<SDPatternOperator VOpNode, SDPatternOperator OpNode,
	Instruction Inst> {
	// With AVX512F, 8-bit mask is promoted to 16-bit mask,
	// for the DQI set, this type is legal and KxxxB instruction is used
	let Predicates = [NoDQI] in
	def : Pat<(VOpNode VK8:$src1, VK8:$src2),
	(COPY_TO_REGCLASS
	(Inst (COPY_TO_REGCLASS VK8:$src1, VK16),
	(COPY_TO_REGCLASS VK8:$src2, VK16)), VK8)>;

	// All types smaller than 8 bits require conversion anyway
	def : Pat<(OpNode VK1:$src1, VK1:$src2),
	(COPY_TO_REGCLASS (Inst
	(COPY_TO_REGCLASS VK1:$src1, VK16),
	(COPY_TO_REGCLASS VK1:$src2, VK16)), VK1)>;
	def : Pat<(VOpNode VK2:$src1, VK2:$src2),
	(COPY_TO_REGCLASS (Inst
	(COPY_TO_REGCLASS VK2:$src1, VK16),
	(COPY_TO_REGCLASS VK2:$src2, VK16)), VK1)>;
	def : Pat<(VOpNode VK4:$src1, VK4:$src2),
	(COPY_TO_REGCLASS (Inst
	(COPY_TO_REGCLASS VK4:$src1, VK16),
	(COPY_TO_REGCLASS VK4:$src2, VK16)), VK1)>;
	}

	defm : avx512_binop_pat<and, and, KANDWrr>;
	defm : avx512_binop_pat<vandn, andn, KANDNWrr>;
	defm : avx512_binop_pat<or, or, KORWrr>;
	defm : avx512_binop_pat<vxnor, xnor, KXNORWrr>;
	defm : avx512_binop_pat<xor, xor, KXORWrr>;

	// Mask unpacking
	multiclass avx512_mask_unpck<string Suffix,RegisterClass KRC, ValueType VT,
	RegisterClass KRCSrc, Predicate prd> {
	let Predicates = [prd] in {
	let hasSideEffects = 0 in
	def rr : I<0x4b, MRMSrcReg, (outs KRC:$dst),
	(ins KRC:$src1, KRC:$src2),
	"kunpck"#Suffix#"\t{$src2, $src1, $dst\|$dst, $src1, $src2}", []>,
	VEX_4V, VEX_L;

	def : Pat<(VT (concat_vectors KRCSrc:$src1, KRCSrc:$src2)),
	(!cast<Instruction>(NAME##rr)
	(COPY_TO_REGCLASS KRCSrc:$src2, KRC),
	(COPY_TO_REGCLASS KRCSrc:$src1, KRC))>;
	}
	}

	defm KUNPCKBW : avx512_mask_unpck<"bw", VK16, v16i1, VK8, HasAVX512>, PD;
	defm KUNPCKWD : avx512_mask_unpck<"wd", VK32, v32i1, VK16, HasBWI>, PS;
	defm KUNPCKDQ : avx512_mask_unpck<"dq", VK64, v64i1, VK32, HasBWI>, PS, VEX_W;

	// Mask bit testing
	multiclass avx512_mask_testop<bits<8> opc, string OpcodeStr, RegisterClass KRC,
	SDNode OpNode, Predicate prd> {
	let Predicates = [prd], Defs = [EFLAGS] in
	def rr : I<opc, MRMSrcReg, (outs), (ins KRC:$src1, KRC:$src2),
	!strconcat(OpcodeStr, "\t{$src2, $src1\|$src1, $src2}"),
	[(set EFLAGS, (OpNode KRC:$src1, KRC:$src2))]>;
	}

	multiclass avx512_mask_testop_w<bits<8> opc, string OpcodeStr, SDNode OpNode,
	Predicate prdW = HasAVX512> {
	defm B : avx512_mask_testop<opc, OpcodeStr#"b", VK8, OpNode, HasDQI>,
	VEX, PD;
	defm W : avx512_mask_testop<opc, OpcodeStr#"w", VK16, OpNode, prdW>,
	VEX, PS;
	defm Q : avx512_mask_testop<opc, OpcodeStr#"q", VK64, OpNode, HasBWI>,
	VEX, PS, VEX_W;
	defm D : avx512_mask_testop<opc, OpcodeStr#"d", VK32, OpNode, HasBWI>,
	VEX, PD, VEX_W;
	}

	defm KORTEST : avx512_mask_testop_w<0x98, "kortest", X86kortest>;
	defm KTEST : avx512_mask_testop_w<0x99, "ktest", X86ktest, HasDQI>;

	// Mask shift
	multiclass avx512_mask_shiftop<bits<8> opc, string OpcodeStr, RegisterClass KRC,
	SDNode OpNode> {
	let Predicates = [HasAVX512] in
	def ri : Ii8<opc, MRMSrcReg, (outs KRC:$dst), (ins KRC:$src, u8imm:$imm),
	!strconcat(OpcodeStr,
	"\t{$imm, $src, $dst\|$dst, $src, $imm}"),
	[(set KRC:$dst, (OpNode KRC:$src, (i8 imm:$imm)))]>;
	}

	multiclass avx512_mask_shiftop_w<bits<8> opc1, bits<8> opc2, string OpcodeStr,
	SDNode OpNode> {
	defm W : avx512_mask_shiftop<opc1, !strconcat(OpcodeStr, "w"), VK16, OpNode>,
	VEX, TAPD, VEX_W;
	let Predicates = [HasDQI] in
	defm B : avx512_mask_shiftop<opc1, !strconcat(OpcodeStr, "b"), VK8, OpNode>,
	VEX, TAPD;
	let Predicates = [HasBWI] in {
	defm Q : avx512_mask_shiftop<opc2, !strconcat(OpcodeStr, "q"), VK64, OpNode>,
	VEX, TAPD, VEX_W;
	defm D : avx512_mask_shiftop<opc2, !strconcat(OpcodeStr, "d"), VK32, OpNode>,
	VEX, TAPD;
	}
	}

	defm KSHIFTL : avx512_mask_shiftop_w<0x32, 0x33, "kshiftl", X86kshiftl>;
	defm KSHIFTR : avx512_mask_shiftop_w<0x30, 0x31, "kshiftr", X86kshiftr>;

	multiclass axv512_icmp_packed_no_vlx_lowering<SDNode OpNode, string InstStr> {
	def : Pat<(v8i1 (OpNode (v8i32 VR256X:$src1), (v8i32 VR256X:$src2))),
	(COPY_TO_REGCLASS (!cast<Instruction>(InstStr##Zrr)
	(v16i32 (INSERT_SUBREG (IMPLICIT_DEF), VR256X:$src1, sub_ymm)),
	(v16i32 (INSERT_SUBREG (IMPLICIT_DEF), VR256X:$src2, sub_ymm))), VK8)>;

	def : Pat<(insert_subvector (v16i1 immAllZerosV),
	(v8i1 (OpNode (v8i32 VR256X:$src1), (v8i32 VR256X:$src2))),
	(i64 0)),
	(KSHIFTRWri (KSHIFTLWri (!cast<Instruction>(InstStr##Zrr)
	(v16i32 (INSERT_SUBREG (IMPLICIT_DEF), VR256X:$src1, sub_ymm)),
	(v16i32 (INSERT_SUBREG (IMPLICIT_DEF), VR256X:$src2, sub_ymm))),
	(i8 8)), (i8 8))>;

	def : Pat<(insert_subvector (v16i1 immAllZerosV),
	(v8i1 (and VK8:$mask,
	(OpNode (v8i32 VR256X:$src1), (v8i32 VR256X:$src2)))),
	(i64 0)),
	(KSHIFTRWri (KSHIFTLWri (!cast<Instruction>(InstStr##Zrrk)
	(COPY_TO_REGCLASS VK8:$mask, VK16),
	(v16i32 (INSERT_SUBREG (IMPLICIT_DEF), VR256X:$src1, sub_ymm)),
	(v16i32 (INSERT_SUBREG (IMPLICIT_DEF), VR256X:$src2, sub_ymm))),
	(i8 8)), (i8 8))>;
	}

	multiclass axv512_icmp_packed_cc_no_vlx_lowering<SDNode OpNode, string InstStr,
	AVX512VLVectorVTInfo _> {
	def : Pat<(v8i1 (OpNode (_.info256.VT VR256X:$src1), (_.info256.VT VR256X:$src2), imm:$cc)),
	(COPY_TO_REGCLASS (!cast<Instruction>(InstStr##Zrri)
	(_.info512.VT (INSERT_SUBREG (IMPLICIT_DEF), VR256X:$src1, sub_ymm)),
	(_.info512.VT (INSERT_SUBREG (IMPLICIT_DEF), VR256X:$src2, sub_ymm)),
	imm:$cc), VK8)>;

	def : Pat<(insert_subvector (v16i1 immAllZerosV),
	(v8i1 (OpNode (_.info256.VT VR256X:$src1), (_.info256.VT VR256X:$src2), imm:$cc)),
	(i64 0)),
	(KSHIFTRWri (KSHIFTLWri (!cast<Instruction>(InstStr##Zrri)
	(_.info512.VT (INSERT_SUBREG (IMPLICIT_DEF), VR256X:$src1, sub_ymm)),
	(_.info512.VT (INSERT_SUBREG (IMPLICIT_DEF), VR256X:$src2, sub_ymm)),
	imm:$cc),
	(i8 8)), (i8 8))>;

	def : Pat<(insert_subvector (v16i1 immAllZerosV),
	(v8i1 (and VK8:$mask,
	(OpNode (_.info256.VT VR256X:$src1), (_.info256.VT VR256X:$src2), imm:$cc))),
	(i64 0)),
	(KSHIFTRWri (KSHIFTLWri (!cast<Instruction>(InstStr##Zrrik)
	(COPY_TO_REGCLASS VK8:$mask, VK16),
	(_.info512.VT (INSERT_SUBREG (IMPLICIT_DEF), VR256X:$src1, sub_ymm)),
	(_.info512.VT (INSERT_SUBREG (IMPLICIT_DEF), VR256X:$src2, sub_ymm)),
	imm:$cc),
	(i8 8)), (i8 8))>;
	}

	let Predicates = [HasAVX512, NoVLX] in {
	defm : axv512_icmp_packed_no_vlx_lowering<X86pcmpgtm, "VPCMPGTD">;
	defm : axv512_icmp_packed_no_vlx_lowering<X86pcmpeqm, "VPCMPEQD">;

	defm : axv512_icmp_packed_cc_no_vlx_lowering<X86cmpm, "VCMPPS", avx512vl_f32_info>;
	defm : axv512_icmp_packed_cc_no_vlx_lowering<X86cmpm, "VPCMPD", avx512vl_i32_info>;
	defm : axv512_icmp_packed_cc_no_vlx_lowering<X86cmpmu, "VPCMPUD", avx512vl_i32_info>;
	}

	// Mask setting all 0s or 1s
	multiclass avx512_mask_setop<RegisterClass KRC, ValueType VT, PatFrag Val> {
	let Predicates = [HasAVX512] in
	let isReMaterializable = 1, isAsCheapAsAMove = 1, isPseudo = 1 in
	def #NAME# : I<0, Pseudo, (outs KRC:$dst), (ins), "",
	[(set KRC:$dst, (VT Val))]>;
	}

	multiclass avx512_mask_setop_w<PatFrag Val> {
	defm W : avx512_mask_setop<VK16, v16i1, Val>;
	defm D : avx512_mask_setop<VK32, v32i1, Val>;
	defm Q : avx512_mask_setop<VK64, v64i1, Val>;
	}

	defm KSET0 : avx512_mask_setop_w<immAllZerosV>;
	defm KSET1 : avx512_mask_setop_w<immAllOnesV>;

	// With AVX-512 only, 8-bit mask is promoted to 16-bit mask.
	let Predicates = [HasAVX512] in {
	def : Pat<(v8i1 immAllZerosV), (COPY_TO_REGCLASS (KSET0W), VK8)>;
	def : Pat<(v4i1 immAllZerosV), (COPY_TO_REGCLASS (KSET0W), VK4)>;
	def : Pat<(v2i1 immAllZerosV), (COPY_TO_REGCLASS (KSET0W), VK2)>;
	def : Pat<(v1i1 immAllZerosV), (COPY_TO_REGCLASS (KSET0W), VK1)>;
	def : Pat<(v8i1 immAllOnesV), (COPY_TO_REGCLASS (KSET1W), VK8)>;
	def : Pat<(v4i1 immAllOnesV), (COPY_TO_REGCLASS (KSET1W), VK4)>;
	def : Pat<(v2i1 immAllOnesV), (COPY_TO_REGCLASS (KSET1W), VK2)>;
	def : Pat<(v1i1 immAllOnesV), (COPY_TO_REGCLASS (KSET1W), VK1)>;
	}

	// Patterns for kmask insert_subvector/extract_subvector to/from index=0
	multiclass operation_subvector_mask_lowering<RegisterClass subRC, ValueType subVT,
	RegisterClass RC, ValueType VT> {
	def : Pat<(subVT (extract_subvector (VT RC:$src), (iPTR 0))),
	(subVT (COPY_TO_REGCLASS RC:$src, subRC))>;

	def : Pat<(VT (insert_subvector undef, subRC:$src, (iPTR 0))),
	(VT (COPY_TO_REGCLASS subRC:$src, RC))>;
	}
	defm : operation_subvector_mask_lowering<VK1, v1i1, VK2, v2i1>;
	defm : operation_subvector_mask_lowering<VK1, v1i1, VK4, v4i1>;
	defm : operation_subvector_mask_lowering<VK1, v1i1, VK8, v8i1>;
	defm : operation_subvector_mask_lowering<VK1, v1i1, VK16, v16i1>;
	defm : operation_subvector_mask_lowering<VK1, v1i1, VK32, v32i1>;
	defm : operation_subvector_mask_lowering<VK1, v1i1, VK64, v64i1>;

	defm : operation_subvector_mask_lowering<VK2, v2i1, VK4, v4i1>;
	defm : operation_subvector_mask_lowering<VK2, v2i1, VK8, v8i1>;
	defm : operation_subvector_mask_lowering<VK2, v2i1, VK16, v16i1>;
	defm : operation_subvector_mask_lowering<VK2, v2i1, VK32, v32i1>;
	defm : operation_subvector_mask_lowering<VK2, v2i1, VK64, v64i1>;

	defm : operation_subvector_mask_lowering<VK4, v4i1, VK8, v8i1>;
	defm : operation_subvector_mask_lowering<VK4, v4i1, VK16, v16i1>;
	defm : operation_subvector_mask_lowering<VK4, v4i1, VK32, v32i1>;
	defm : operation_subvector_mask_lowering<VK4, v4i1, VK64, v64i1>;

	defm : operation_subvector_mask_lowering<VK8, v8i1, VK16, v16i1>;
	defm : operation_subvector_mask_lowering<VK8, v8i1, VK32, v32i1>;
	defm : operation_subvector_mask_lowering<VK8, v8i1, VK64, v64i1>;

	defm : operation_subvector_mask_lowering<VK16, v16i1, VK32, v32i1>;
	defm : operation_subvector_mask_lowering<VK16, v16i1, VK64, v64i1>;

	defm : operation_subvector_mask_lowering<VK32, v32i1, VK64, v64i1>;

	def : Pat<(v2i1 (extract_subvector (v4i1 VK4:$src), (iPTR 2))),
	(v2i1 (COPY_TO_REGCLASS
	(KSHIFTRWri (COPY_TO_REGCLASS VK4:$src, VK16), (i8 2)),
	VK2))>;
	def : Pat<(v4i1 (extract_subvector (v8i1 VK8:$src), (iPTR 4))),
	(v4i1 (COPY_TO_REGCLASS
	(KSHIFTRWri (COPY_TO_REGCLASS VK8:$src, VK16), (i8 4)),
	VK4))>;
	def : Pat<(v8i1 (extract_subvector (v16i1 VK16:$src), (iPTR 8))),
	(v8i1 (COPY_TO_REGCLASS (KSHIFTRWri VK16:$src, (i8 8)), VK8))>;
	def : Pat<(v16i1 (extract_subvector (v32i1 VK32:$src), (iPTR 16))),
	(v16i1 (COPY_TO_REGCLASS (KSHIFTRDri VK32:$src, (i8 16)), VK16))>;
	def : Pat<(v32i1 (extract_subvector (v64i1 VK64:$src), (iPTR 32))),
	(v32i1 (COPY_TO_REGCLASS (KSHIFTRQri VK64:$src, (i8 32)), VK32))>;


	// Patterns for kmask shift
	multiclass mask_shift_lowering<RegisterClass RC, ValueType VT> {
	def : Pat<(VT (X86kshiftl RC:$src, (i8 imm:$imm))),
	(VT (COPY_TO_REGCLASS
	(KSHIFTLWri (COPY_TO_REGCLASS RC:$src, VK16),
	(I8Imm $imm)),
	RC))>;
	def : Pat<(VT (X86kshiftr RC:$src, (i8 imm:$imm))),
	(VT (COPY_TO_REGCLASS
	(KSHIFTRWri (COPY_TO_REGCLASS RC:$src, VK16),
	(I8Imm $imm)),
	RC))>;
	}

	defm : mask_shift_lowering<VK8, v8i1>, Requires<[HasAVX512, NoDQI]>;
	defm : mask_shift_lowering<VK4, v4i1>, Requires<[HasAVX512]>;
	defm : mask_shift_lowering<VK2, v2i1>, Requires<[HasAVX512]>;
	//===----------------------------------------------------------------------===//
	// AVX-512 - Aligned and unaligned load and store
	//


	multiclass avx512_load<bits<8> opc, string OpcodeStr, X86VectorVTInfo _,
	PatFrag ld_frag, PatFrag mload,
	SDPatternOperator SelectOprr = vselect> {
	let hasSideEffects = 0 in {
	def rr : AVX512PI<opc, MRMSrcReg, (outs _.RC:$dst), (ins _.RC:$src),
	!strconcat(OpcodeStr, "\t{$src, $dst\|$dst, $src}"), [],
	_.ExeDomain>, EVEX;
	def rrkz : AVX512PI<opc, MRMSrcReg, (outs _.RC:$dst),
	(ins _.KRCWM:$mask, _.RC:$src),
	!strconcat(OpcodeStr, "\t{$src, ${dst} {${mask}} {z}\|",
	"${dst} {${mask}} {z}, $src}"),
	[(set _.RC:$dst, (_.VT (SelectOprr _.KRCWM:$mask,
	(_.VT _.RC:$src),
	_.ImmAllZerosV)))], _.ExeDomain>,
	EVEX, EVEX_KZ;

	let canFoldAsLoad = 1, isReMaterializable = 1,
	SchedRW = [WriteLoad] in
	def rm : AVX512PI<opc, MRMSrcMem, (outs _.RC:$dst), (ins _.MemOp:$src),
	!strconcat(OpcodeStr, "\t{$src, $dst\|$dst, $src}"),
	[(set _.RC:$dst, (_.VT (bitconvert (ld_frag addr:$src))))],
	_.ExeDomain>, EVEX;

	let Constraints = "$src0 = $dst", isConvertibleToThreeAddress = 1 in {
	def rrk : AVX512PI<opc, MRMSrcReg, (outs _.RC:$dst),
	(ins _.RC:$src0, _.KRCWM:$mask, _.RC:$src1),
	!strconcat(OpcodeStr, "\t{$src1, ${dst} {${mask}}\|",
	"${dst} {${mask}}, $src1}"),
	[(set _.RC:$dst, (_.VT (SelectOprr _.KRCWM:$mask,
	(_.VT _.RC:$src1),
	(_.VT _.RC:$src0))))], _.ExeDomain>,
	EVEX, EVEX_K;
	let SchedRW = [WriteLoad] in
	def rmk : AVX512PI<opc, MRMSrcMem, (outs _.RC:$dst),
	(ins _.RC:$src0, _.KRCWM:$mask, _.MemOp:$src1),
	!strconcat(OpcodeStr, "\t{$src1, ${dst} {${mask}}\|",
	"${dst} {${mask}}, $src1}"),
	[(set _.RC:$dst, (_.VT
	(vselect _.KRCWM:$mask,
	(_.VT (bitconvert (ld_frag addr:$src1))),
	(_.VT _.RC:$src0))))], _.ExeDomain>, EVEX, EVEX_K;
	}
	let SchedRW = [WriteLoad] in
	def rmkz : AVX512PI<opc, MRMSrcMem, (outs _.RC:$dst),
	(ins _.KRCWM:$mask, _.MemOp:$src),
	OpcodeStr #"\t{$src, ${dst} {${mask}} {z}\|"#
	"${dst} {${mask}} {z}, $src}",
	[(set _.RC:$dst, (_.VT (vselect _.KRCWM:$mask,
	(_.VT (bitconvert (ld_frag addr:$src))), _.ImmAllZerosV)))],
	_.ExeDomain>, EVEX, EVEX_KZ;
	}
	def : Pat<(_.VT (mload addr:$ptr, _.KRCWM:$mask, undef)),
	(!cast<Instruction>(NAME#_.ZSuffix##rmkz) _.KRCWM:$mask, addr:$ptr)>;

	def : Pat<(_.VT (mload addr:$ptr, _.KRCWM:$mask, _.ImmAllZerosV)),
	(!cast<Instruction>(NAME#_.ZSuffix##rmkz) _.KRCWM:$mask, addr:$ptr)>;

	def : Pat<(_.VT (mload addr:$ptr, _.KRCWM:$mask, (_.VT _.RC:$src0))),
	(!cast<Instruction>(NAME#_.ZSuffix##rmk) _.RC:$src0,
	_.KRCWM:$mask, addr:$ptr)>;
	}

	multiclass avx512_alignedload_vl<bits<8> opc, string OpcodeStr,
	AVX512VLVectorVTInfo _,
	Predicate prd> {
	let Predicates = [prd] in
	defm Z : avx512_load<opc, OpcodeStr, _.info512, _.info512.AlignedLdFrag,
	masked_load_aligned512>, EVEX_V512;

	let Predicates = [prd, HasVLX] in {
	defm Z256 : avx512_load<opc, OpcodeStr, _.info256, _.info256.AlignedLdFrag,
	masked_load_aligned256>, EVEX_V256;
	defm Z128 : avx512_load<opc, OpcodeStr, _.info128, _.info128.AlignedLdFrag,
	masked_load_aligned128>, EVEX_V128;
	}
	}

	multiclass avx512_load_vl<bits<8> opc, string OpcodeStr,
	AVX512VLVectorVTInfo _,
	Predicate prd,
	SDPatternOperator SelectOprr = vselect> {
	let Predicates = [prd] in
	defm Z : avx512_load<opc, OpcodeStr, _.info512, _.info512.LdFrag,
	masked_load_unaligned, SelectOprr>, EVEX_V512;

	let Predicates = [prd, HasVLX] in {
	defm Z256 : avx512_load<opc, OpcodeStr, _.info256, _.info256.LdFrag,
	masked_load_unaligned, SelectOprr>, EVEX_V256;
	defm Z128 : avx512_load<opc, OpcodeStr, _.info128, _.info128.LdFrag,
	masked_load_unaligned, SelectOprr>, EVEX_V128;
	}
	}

	multiclass avx512_store<bits<8> opc, string OpcodeStr, X86VectorVTInfo _,
	PatFrag st_frag, PatFrag mstore, string Name> {

	let hasSideEffects = 0 in {
	def rr_REV : AVX512PI<opc, MRMDestReg, (outs _.RC:$dst), (ins _.RC:$src),
	OpcodeStr # ".s\t{$src, $dst\|$dst, $src}",
	[], _.ExeDomain>, EVEX, FoldGenData<Name#rr>;
	def rrk_REV : AVX512PI<opc, MRMDestReg, (outs _.RC:$dst),
	(ins _.KRCWM:$mask, _.RC:$src),
	OpcodeStr # ".s\t{$src, ${dst} {${mask}}\|"#
	"${dst} {${mask}}, $src}",
	[], _.ExeDomain>, EVEX, EVEX_K, FoldGenData<Name#rrk>;
	def rrkz_REV : AVX512PI<opc, MRMDestReg, (outs _.RC:$dst),
	(ins _.KRCWM:$mask, _.RC:$src),
	OpcodeStr # ".s\t{$src, ${dst} {${mask}} {z}\|" #
	"${dst} {${mask}} {z}, $src}",
	[], _.ExeDomain>, EVEX, EVEX_KZ, FoldGenData<Name#rrkz>;
	}

	def mr : AVX512PI<opc, MRMDestMem, (outs), (ins _.MemOp:$dst, _.RC:$src),
	!strconcat(OpcodeStr, "\t{$src, $dst\|$dst, $src}"),
	[(st_frag (_.VT _.RC:$src), addr:$dst)], _.ExeDomain>, EVEX;
	def mrk : AVX512PI<opc, MRMDestMem, (outs),
	(ins _.MemOp:$dst, _.KRCWM:$mask, _.RC:$src),
	OpcodeStr # "\t{$src, ${dst} {${mask}}\|${dst} {${mask}}, $src}",
	[], _.ExeDomain>, EVEX, EVEX_K;

	def: Pat<(mstore addr:$ptr, _.KRCWM:$mask, (_.VT _.RC:$src)),
	(!cast<Instruction>(NAME#_.ZSuffix##mrk) addr:$ptr,
	_.KRCWM:$mask, _.RC:$src)>;
	}


	multiclass avx512_store_vl< bits<8> opc, string OpcodeStr,
	AVX512VLVectorVTInfo _, Predicate prd,
	string Name> {
	let Predicates = [prd] in
	defm Z : avx512_store<opc, OpcodeStr, _.info512, store,
	masked_store_unaligned, Name#Z>, EVEX_V512;

	let Predicates = [prd, HasVLX] in {
	defm Z256 : avx512_store<opc, OpcodeStr, _.info256, store,
	masked_store_unaligned, Name#Z256>, EVEX_V256;
	defm Z128 : avx512_store<opc, OpcodeStr, _.info128, store,
	masked_store_unaligned, Name#Z128>, EVEX_V128;
	}
	}

	multiclass avx512_alignedstore_vl<bits<8> opc, string OpcodeStr,
	AVX512VLVectorVTInfo _, Predicate prd,
	string Name> {
	let Predicates = [prd] in
	defm Z : avx512_store<opc, OpcodeStr, _.info512, alignedstore512,
	masked_store_aligned512, Name#Z>, EVEX_V512;

	let Predicates = [prd, HasVLX] in {
	defm Z256 : avx512_store<opc, OpcodeStr, _.info256, alignedstore256,
	masked_store_aligned256, Name#Z256>, EVEX_V256;
	defm Z128 : avx512_store<opc, OpcodeStr, _.info128, alignedstore,
	masked_store_aligned128, Name#Z128>, EVEX_V128;
	}
	}

	defm VMOVAPS : avx512_alignedload_vl<0x28, "vmovaps", avx512vl_f32_info,
	HasAVX512>,
	avx512_alignedstore_vl<0x29, "vmovaps", avx512vl_f32_info,
	HasAVX512, "VMOVAPS">,
	PS, EVEX_CD8<32, CD8VF>;

	defm VMOVAPD : avx512_alignedload_vl<0x28, "vmovapd", avx512vl_f64_info,
	HasAVX512>,
	avx512_alignedstore_vl<0x29, "vmovapd", avx512vl_f64_info,
	HasAVX512, "VMOVAPD">,
	PD, VEX_W, EVEX_CD8<64, CD8VF>;

	defm VMOVUPS : avx512_load_vl<0x10, "vmovups", avx512vl_f32_info, HasAVX512,
	null_frag>,
	avx512_store_vl<0x11, "vmovups", avx512vl_f32_info, HasAVX512,
	"VMOVUPS">,
	PS, EVEX_CD8<32, CD8VF>;

	defm VMOVUPD : avx512_load_vl<0x10, "vmovupd", avx512vl_f64_info, HasAVX512,
	null_frag>,
	avx512_store_vl<0x11, "vmovupd", avx512vl_f64_info, HasAVX512,
	"VMOVUPD">,
	PD, VEX_W, EVEX_CD8<64, CD8VF>;

	defm VMOVDQA32 : avx512_alignedload_vl<0x6F, "vmovdqa32", avx512vl_i32_info,
	HasAVX512>,
	avx512_alignedstore_vl<0x7F, "vmovdqa32", avx512vl_i32_info,
	HasAVX512, "VMOVDQA32">,
	PD, EVEX_CD8<32, CD8VF>;

	defm VMOVDQA64 : avx512_alignedload_vl<0x6F, "vmovdqa64", avx512vl_i64_info,
	HasAVX512>,
	avx512_alignedstore_vl<0x7F, "vmovdqa64", avx512vl_i64_info,
	HasAVX512, "VMOVDQA64">,
	PD, VEX_W, EVEX_CD8<64, CD8VF>;

	defm VMOVDQU8 : avx512_load_vl<0x6F, "vmovdqu8", avx512vl_i8_info, HasBWI>,
	avx512_store_vl<0x7F, "vmovdqu8", avx512vl_i8_info,
	HasBWI, "VMOVDQU8">,
	XD, EVEX_CD8<8, CD8VF>;

	defm VMOVDQU16 : avx512_load_vl<0x6F, "vmovdqu16", avx512vl_i16_info, HasBWI>,
	avx512_store_vl<0x7F, "vmovdqu16", avx512vl_i16_info,
	HasBWI, "VMOVDQU16">,
	XD, VEX_W, EVEX_CD8<16, CD8VF>;

	defm VMOVDQU32 : avx512_load_vl<0x6F, "vmovdqu32", avx512vl_i32_info, HasAVX512,
	null_frag>,
	avx512_store_vl<0x7F, "vmovdqu32", avx512vl_i32_info,
	HasAVX512, "VMOVDQU32">,
	XS, EVEX_CD8<32, CD8VF>;

	defm VMOVDQU64 : avx512_load_vl<0x6F, "vmovdqu64", avx512vl_i64_info, HasAVX512,
	null_frag>,
	avx512_store_vl<0x7F, "vmovdqu64", avx512vl_i64_info,
	HasAVX512, "VMOVDQU64">,
	XS, VEX_W, EVEX_CD8<64, CD8VF>;

	// Special instructions to help with spilling when we don't have VLX. We need
	// to load or store from a ZMM register instead. These are converted in
	// expandPostRAPseudos.
	let isReMaterializable = 1, canFoldAsLoad = 1,
	isPseudo = 1, SchedRW = [WriteLoad], mayLoad = 1, hasSideEffects = 0 in {
	def VMOVAPSZ128rm_NOVLX : I<0, Pseudo, (outs VR128X:$dst), (ins f128mem:$src),
	"", []>;
	def VMOVAPSZ256rm_NOVLX : I<0, Pseudo, (outs VR256X:$dst), (ins f256mem:$src),
	"", []>;
	def VMOVUPSZ128rm_NOVLX : I<0, Pseudo, (outs VR128X:$dst), (ins f128mem:$src),
	"", []>;
	def VMOVUPSZ256rm_NOVLX : I<0, Pseudo, (outs VR256X:$dst), (ins f256mem:$src),
	"", []>;
	}

	let isPseudo = 1, mayStore = 1, hasSideEffects = 0 in {
	def VMOVAPSZ128mr_NOVLX : I<0, Pseudo, (outs), (ins f128mem:$dst, VR128X:$src),
	"", []>;
	def VMOVAPSZ256mr_NOVLX : I<0, Pseudo, (outs), (ins f256mem:$dst, VR256X:$src),
	"", []>;
	def VMOVUPSZ128mr_NOVLX : I<0, Pseudo, (outs), (ins f128mem:$dst, VR128X:$src),
	"", []>;
	def VMOVUPSZ256mr_NOVLX : I<0, Pseudo, (outs), (ins f256mem:$dst, VR256X:$src),
	"", []>;
	}

	def : Pat<(v8i64 (vselect VK8WM:$mask, (bc_v8i64 (v16i32 immAllZerosV)),
	(v8i64 VR512:$src))),
	(VMOVDQA64Zrrkz (COPY_TO_REGCLASS (KNOTWrr (COPY_TO_REGCLASS VK8:$mask, VK16)),
	VK8), VR512:$src)>;

	def : Pat<(v16i32 (vselect VK16WM:$mask, (v16i32 immAllZerosV),
	(v16i32 VR512:$src))),
	(VMOVDQA32Zrrkz (KNOTWrr VK16WM:$mask), VR512:$src)>;

	// These patterns exist to prevent the above patterns from introducing a second
	// mask inversion when one already exists.
	def : Pat<(v8i64 (vselect (xor VK8:$mask, (v8i1 immAllOnesV)),
	(bc_v8i64 (v16i32 immAllZerosV)),
	(v8i64 VR512:$src))),
	(VMOVDQA64Zrrkz VK8:$mask, VR512:$src)>;
	def : Pat<(v16i32 (vselect (xor VK16:$mask, (v16i1 immAllOnesV)),
	(v16i32 immAllZerosV),
	(v16i32 VR512:$src))),
	(VMOVDQA32Zrrkz VK16WM:$mask, VR512:$src)>;

	// Patterns for handling v8i1 selects of 256-bit vectors when VLX isn't
	// available. Use a 512-bit operation and extract.
	let Predicates = [HasAVX512, NoVLX] in {
	def : Pat<(v8f32 (vselect (v8i1 VK8WM:$mask), (v8f32 VR256X:$src1),
	(v8f32 VR256X:$src0))),
	(EXTRACT_SUBREG
	(v16f32
	(VMOVAPSZrrk
	(v16f32 (INSERT_SUBREG (IMPLICIT_DEF), VR256X:$src0, sub_ymm)),
	(COPY_TO_REGCLASS VK8WM:$mask, VK16WM),
	(v16f32 (INSERT_SUBREG (IMPLICIT_DEF), VR256X:$src1, sub_ymm)))),
	sub_ymm)>;

	def : Pat<(v8i32 (vselect (v8i1 VK8WM:$mask), (v8i32 VR256X:$src1),
	(v8i32 VR256X:$src0))),
	(EXTRACT_SUBREG
	(v16i32
	(VMOVDQA32Zrrk
	(v16i32 (INSERT_SUBREG (IMPLICIT_DEF), VR256X:$src0, sub_ymm)),
	(COPY_TO_REGCLASS VK8WM:$mask, VK16WM),
	(v16i32 (INSERT_SUBREG (IMPLICIT_DEF), VR256X:$src1, sub_ymm)))),
	sub_ymm)>;
	}

	let Predicates = [HasVLX, NoBWI] in {
	// 128-bit load/store without BWI.
	def : Pat<(alignedstore (v8i16 VR128X:$src), addr:$dst),
	(VMOVDQA32Z128mr addr:$dst, VR128X:$src)>;
	def : Pat<(alignedstore (v16i8 VR128X:$src), addr:$dst),
	(VMOVDQA32Z128mr addr:$dst, VR128X:$src)>;
	def : Pat<(store (v8i16 VR128X:$src), addr:$dst),
	(VMOVDQU32Z128mr addr:$dst, VR128X:$src)>;
	def : Pat<(store (v16i8 VR128X:$src), addr:$dst),
	(VMOVDQU32Z128mr addr:$dst, VR128X:$src)>;

	// 256-bit load/store without BWI.
	def : Pat<(alignedstore256 (v16i16 VR256X:$src), addr:$dst),
	(VMOVDQA32Z256mr addr:$dst, VR256X:$src)>;
	def : Pat<(alignedstore256 (v32i8 VR256X:$src), addr:$dst),
	(VMOVDQA32Z256mr addr:$dst, VR256X:$src)>;
	def : Pat<(store (v16i16 VR256X:$src), addr:$dst),
	(VMOVDQU32Z256mr addr:$dst, VR256X:$src)>;
	def : Pat<(store (v32i8 VR256X:$src), addr:$dst),
	(VMOVDQU32Z256mr addr:$dst, VR256X:$src)>;
	}

	let Predicates = [HasVLX] in {
	// Special patterns for storing subvector extracts of lower 128-bits of 256.
	// Its cheaper to just use VMOVAPS/VMOVUPS instead of VEXTRACTF128mr
	def : Pat<(alignedstore (v2f64 (extract_subvector
	(v4f64 VR256X:$src), (iPTR 0))), addr:$dst),
	(VMOVAPDZ128mr addr:$dst, (v2f64 (EXTRACT_SUBREG VR256X:$src,sub_xmm)))>;
	def : Pat<(alignedstore (v4f32 (extract_subvector
	(v8f32 VR256X:$src), (iPTR 0))), addr:$dst),
	(VMOVAPSZ128mr addr:$dst, (v4f32 (EXTRACT_SUBREG VR256X:$src,sub_xmm)))>;
	def : Pat<(alignedstore (v2i64 (extract_subvector
	(v4i64 VR256X:$src), (iPTR 0))), addr:$dst),
	(VMOVDQA64Z128mr addr:$dst, (v2i64 (EXTRACT_SUBREG VR256X:$src,sub_xmm)))>;
	def : Pat<(alignedstore (v4i32 (extract_subvector
	(v8i32 VR256X:$src), (iPTR 0))), addr:$dst),
	(VMOVDQA32Z128mr addr:$dst, (v4i32 (EXTRACT_SUBREG VR256X:$src,sub_xmm)))>;
	def : Pat<(alignedstore (v8i16 (extract_subvector
	(v16i16 VR256X:$src), (iPTR 0))), addr:$dst),
	(VMOVDQA32Z128mr addr:$dst, (v8i16 (EXTRACT_SUBREG VR256X:$src,sub_xmm)))>;
	def : Pat<(alignedstore (v16i8 (extract_subvector
	(v32i8 VR256X:$src), (iPTR 0))), addr:$dst),
	(VMOVDQA32Z128mr addr:$dst, (v16i8 (EXTRACT_SUBREG VR256X:$src,sub_xmm)))>;

	def : Pat<(store (v2f64 (extract_subvector
	(v4f64 VR256X:$src), (iPTR 0))), addr:$dst),
	(VMOVUPDZ128mr addr:$dst, (v2f64 (EXTRACT_SUBREG VR256X:$src,sub_xmm)))>;
	def : Pat<(store (v4f32 (extract_subvector
	(v8f32 VR256X:$src), (iPTR 0))), addr:$dst),
	(VMOVUPSZ128mr addr:$dst, (v4f32 (EXTRACT_SUBREG VR256X:$src,sub_xmm)))>;
	def : Pat<(store (v2i64 (extract_subvector
	(v4i64 VR256X:$src), (iPTR 0))), addr:$dst),
	(VMOVDQU64Z128mr addr:$dst, (v2i64 (EXTRACT_SUBREG VR256X:$src,sub_xmm)))>;
	def : Pat<(store (v4i32 (extract_subvector
	(v8i32 VR256X:$src), (iPTR 0))), addr:$dst),
	(VMOVDQU32Z128mr addr:$dst, (v4i32 (EXTRACT_SUBREG VR256X:$src,sub_xmm)))>;
	def : Pat<(store (v8i16 (extract_subvector
	(v16i16 VR256X:$src), (iPTR 0))), addr:$dst),
	(VMOVDQU32Z128mr addr:$dst, (v8i16 (EXTRACT_SUBREG VR256X:$src,sub_xmm)))>;
	def : Pat<(store (v16i8 (extract_subvector
	(v32i8 VR256X:$src), (iPTR 0))), addr:$dst),
	(VMOVDQU32Z128mr addr:$dst, (v16i8 (EXTRACT_SUBREG VR256X:$src,sub_xmm)))>;

	// Special patterns for storing subvector extracts of lower 128-bits of 512.
	// Its cheaper to just use VMOVAPS/VMOVUPS instead of VEXTRACTF128mr
	def : Pat<(alignedstore (v2f64 (extract_subvector
	(v8f64 VR512:$src), (iPTR 0))), addr:$dst),
	(VMOVAPDZ128mr addr:$dst, (v2f64 (EXTRACT_SUBREG VR512:$src,sub_xmm)))>;
	def : Pat<(alignedstore (v4f32 (extract_subvector
	(v16f32 VR512:$src), (iPTR 0))), addr:$dst),
	(VMOVAPSZ128mr addr:$dst, (v4f32 (EXTRACT_SUBREG VR512:$src,sub_xmm)))>;
	def : Pat<(alignedstore (v2i64 (extract_subvector
	(v8i64 VR512:$src), (iPTR 0))), addr:$dst),
	(VMOVDQA64Z128mr addr:$dst, (v2i64 (EXTRACT_SUBREG VR512:$src,sub_xmm)))>;
	def : Pat<(alignedstore (v4i32 (extract_subvector
	(v16i32 VR512:$src), (iPTR 0))), addr:$dst),
	(VMOVDQA32Z128mr addr:$dst, (v4i32 (EXTRACT_SUBREG VR512:$src,sub_xmm)))>;
	def : Pat<(alignedstore (v8i16 (extract_subvector
	(v32i16 VR512:$src), (iPTR 0))), addr:$dst),
	(VMOVDQA32Z128mr addr:$dst, (v8i16 (EXTRACT_SUBREG VR512:$src,sub_xmm)))>;
	def : Pat<(alignedstore (v16i8 (extract_subvector
	(v64i8 VR512:$src), (iPTR 0))), addr:$dst),
	(VMOVDQA32Z128mr addr:$dst, (v16i8 (EXTRACT_SUBREG VR512:$src,sub_xmm)))>;

	def : Pat<(store (v2f64 (extract_subvector
	(v8f64 VR512:$src), (iPTR 0))), addr:$dst),
	(VMOVUPDZ128mr addr:$dst, (v2f64 (EXTRACT_SUBREG VR512:$src,sub_xmm)))>;
	def : Pat<(store (v4f32 (extract_subvector
	(v16f32 VR512:$src), (iPTR 0))), addr:$dst),
	(VMOVUPSZ128mr addr:$dst, (v4f32 (EXTRACT_SUBREG VR512:$src,sub_xmm)))>;
	def : Pat<(store (v2i64 (extract_subvector
	(v8i64 VR512:$src), (iPTR 0))), addr:$dst),
	(VMOVDQU64Z128mr addr:$dst, (v2i64 (EXTRACT_SUBREG VR512:$src,sub_xmm)))>;
	def : Pat<(store (v4i32 (extract_subvector
	(v16i32 VR512:$src), (iPTR 0))), addr:$dst),
	(VMOVDQU32Z128mr addr:$dst, (v4i32 (EXTRACT_SUBREG VR512:$src,sub_xmm)))>;
	def : Pat<(store (v8i16 (extract_subvector
	(v32i16 VR512:$src), (iPTR 0))), addr:$dst),
	(VMOVDQU32Z128mr addr:$dst, (v8i16 (EXTRACT_SUBREG VR512:$src,sub_xmm)))>;
	def : Pat<(store (v16i8 (extract_subvector
	(v64i8 VR512:$src), (iPTR 0))), addr:$dst),
	(VMOVDQU32Z128mr addr:$dst, (v16i8 (EXTRACT_SUBREG VR512:$src,sub_xmm)))>;

	// Special patterns for storing subvector extracts of lower 256-bits of 512.
	// Its cheaper to just use VMOVAPS/VMOVUPS instead of VEXTRACTF128mr
	def : Pat<(alignedstore256 (v4f64 (extract_subvector
	(v8f64 VR512:$src), (iPTR 0))), addr:$dst),
	(VMOVAPDZ256mr addr:$dst, (v4f64 (EXTRACT_SUBREG VR512:$src,sub_ymm)))>;
	- def : Pat<(alignedstore (v8f32 (extract_subvector
	- (v16f32 VR512:$src), (iPTR 0))), addr:$dst),
	+ def : Pat<(alignedstore256 (v8f32 (extract_subvector
	+ (v16f32 VR512:$src), (iPTR 0))), addr:$dst),
	(VMOVAPSZ256mr addr:$dst, (v8f32 (EXTRACT_SUBREG VR512:$src,sub_ymm)))>;
	def : Pat<(alignedstore256 (v4i64 (extract_subvector
	(v8i64 VR512:$src), (iPTR 0))), addr:$dst),
	(VMOVDQA64Z256mr addr:$dst, (v4i64 (EXTRACT_SUBREG VR512:$src,sub_ymm)))>;
	def : Pat<(alignedstore256 (v8i32 (extract_subvector
	(v16i32 VR512:$src), (iPTR 0))), addr:$dst),
	(VMOVDQA32Z256mr addr:$dst, (v8i32 (EXTRACT_SUBREG VR512:$src,sub_ymm)))>;
	def : Pat<(alignedstore256 (v16i16 (extract_subvector
	(v32i16 VR512:$src), (iPTR 0))), addr:$dst),
	(VMOVDQA32Z256mr addr:$dst, (v16i16 (EXTRACT_SUBREG VR512:$src,sub_ymm)))>;
	def : Pat<(alignedstore256 (v32i8 (extract_subvector
	(v64i8 VR512:$src), (iPTR 0))), addr:$dst),
	(VMOVDQA32Z256mr addr:$dst, (v32i8 (EXTRACT_SUBREG VR512:$src,sub_ymm)))>;

	def : Pat<(store (v4f64 (extract_subvector
	(v8f64 VR512:$src), (iPTR 0))), addr:$dst),
	(VMOVUPDZ256mr addr:$dst, (v4f64 (EXTRACT_SUBREG VR512:$src,sub_ymm)))>;
	def : Pat<(store (v8f32 (extract_subvector
	(v16f32 VR512:$src), (iPTR 0))), addr:$dst),
	(VMOVUPSZ256mr addr:$dst, (v8f32 (EXTRACT_SUBREG VR512:$src,sub_ymm)))>;
	def : Pat<(store (v4i64 (extract_subvector
	(v8i64 VR512:$src), (iPTR 0))), addr:$dst),
	(VMOVDQU64Z256mr addr:$dst, (v4i64 (EXTRACT_SUBREG VR512:$src,sub_ymm)))>;
	def : Pat<(store (v8i32 (extract_subvector
	(v16i32 VR512:$src), (iPTR 0))), addr:$dst),
	(VMOVDQU32Z256mr addr:$dst, (v8i32 (EXTRACT_SUBREG VR512:$src,sub_ymm)))>;
	def : Pat<(store (v16i16 (extract_subvector
	(v32i16 VR512:$src), (iPTR 0))), addr:$dst),
	(VMOVDQU32Z256mr addr:$dst, (v16i16 (EXTRACT_SUBREG VR512:$src,sub_ymm)))>;
	def : Pat<(store (v32i8 (extract_subvector
	(v64i8 VR512:$src), (iPTR 0))), addr:$dst),
	(VMOVDQU32Z256mr addr:$dst, (v32i8 (EXTRACT_SUBREG VR512:$src,sub_ymm)))>;
	}


	// Move Int Doubleword to Packed Double Int
	//
	let ExeDomain = SSEPackedInt in {
	def VMOVDI2PDIZrr : AVX512BI<0x6E, MRMSrcReg, (outs VR128X:$dst), (ins GR32:$src),
	"vmovd\t{$src, $dst\|$dst, $src}",
	[(set VR128X:$dst,
	(v4i32 (scalar_to_vector GR32:$src)))], IIC_SSE_MOVDQ>,
	EVEX;
	def VMOVDI2PDIZrm : AVX512BI<0x6E, MRMSrcMem, (outs VR128X:$dst), (ins i32mem:$src),
	"vmovd\t{$src, $dst\|$dst, $src}",
	[(set VR128X:$dst,
	(v4i32 (scalar_to_vector (loadi32 addr:$src))))],
	IIC_SSE_MOVDQ>, EVEX, EVEX_CD8<32, CD8VT1>;
	def VMOV64toPQIZrr : AVX512BI<0x6E, MRMSrcReg, (outs VR128X:$dst), (ins GR64:$src),
	"vmovq\t{$src, $dst\|$dst, $src}",
	[(set VR128X:$dst,
	(v2i64 (scalar_to_vector GR64:$src)))],
	IIC_SSE_MOVDQ>, EVEX, VEX_W;
	let isCodeGenOnly = 1, ForceDisassemble = 1, hasSideEffects = 0, mayLoad = 1 in
	def VMOV64toPQIZrm : AVX512BI<0x6E, MRMSrcMem, (outs VR128X:$dst),
	(ins i64mem:$src),
	"vmovq\t{$src, $dst\|$dst, $src}", []>,
	EVEX, VEX_W, EVEX_CD8<64, CD8VT1>;
	let isCodeGenOnly = 1 in {
	def VMOV64toSDZrr : AVX512BI<0x6E, MRMSrcReg, (outs FR64X:$dst), (ins GR64:$src),
	"vmovq\t{$src, $dst\|$dst, $src}",
	[(set FR64X:$dst, (bitconvert GR64:$src))],
	IIC_SSE_MOVDQ>, EVEX, VEX_W, Sched<[WriteMove]>;
	def VMOV64toSDZrm : AVX512XSI<0x7E, MRMSrcMem, (outs FR64X:$dst), (ins i64mem:$src),
	"vmovq\t{$src, $dst\|$dst, $src}",
	[(set FR64X:$dst, (bitconvert (loadi64 addr:$src)))]>,
	EVEX, VEX_W, EVEX_CD8<8, CD8VT8>;
	def VMOVSDto64Zrr : AVX512BI<0x7E, MRMDestReg, (outs GR64:$dst), (ins FR64X:$src),
	"vmovq\t{$src, $dst\|$dst, $src}",
	[(set GR64:$dst, (bitconvert FR64X:$src))],
	IIC_SSE_MOVDQ>, EVEX, VEX_W, Sched<[WriteMove]>;
	def VMOVSDto64Zmr : AVX512BI<0x7E, MRMDestMem, (outs), (ins i64mem:$dst, FR64X:$src),
	"vmovq\t{$src, $dst\|$dst, $src}",
	[(store (i64 (bitconvert FR64X:$src)), addr:$dst)],
	IIC_SSE_MOVDQ>, EVEX, VEX_W, Sched<[WriteStore]>,
	EVEX_CD8<64, CD8VT1>;
	}
	} // ExeDomain = SSEPackedInt

	// Move Int Doubleword to Single Scalar
	//
	let ExeDomain = SSEPackedInt, isCodeGenOnly = 1 in {
	def VMOVDI2SSZrr : AVX512BI<0x6E, MRMSrcReg, (outs FR32X:$dst), (ins GR32:$src),
	"vmovd\t{$src, $dst\|$dst, $src}",
	[(set FR32X:$dst, (bitconvert GR32:$src))],
	IIC_SSE_MOVDQ>, EVEX;

	def VMOVDI2SSZrm : AVX512BI<0x6E, MRMSrcMem, (outs FR32X:$dst), (ins i32mem:$src),
	"vmovd\t{$src, $dst\|$dst, $src}",
	[(set FR32X:$dst, (bitconvert (loadi32 addr:$src)))],
	IIC_SSE_MOVDQ>, EVEX, EVEX_CD8<32, CD8VT1>;
	} // ExeDomain = SSEPackedInt, isCodeGenOnly = 1

	// Move doubleword from xmm register to r/m32
	//
	let ExeDomain = SSEPackedInt in {
	def VMOVPDI2DIZrr : AVX512BI<0x7E, MRMDestReg, (outs GR32:$dst), (ins VR128X:$src),
	"vmovd\t{$src, $dst\|$dst, $src}",
	[(set GR32:$dst, (extractelt (v4i32 VR128X:$src),
	(iPTR 0)))], IIC_SSE_MOVD_ToGP>,
	EVEX;
	def VMOVPDI2DIZmr : AVX512BI<0x7E, MRMDestMem, (outs),
	(ins i32mem:$dst, VR128X:$src),
	"vmovd\t{$src, $dst\|$dst, $src}",
	[(store (i32 (extractelt (v4i32 VR128X:$src),
	(iPTR 0))), addr:$dst)], IIC_SSE_MOVDQ>,
	EVEX, EVEX_CD8<32, CD8VT1>;
	} // ExeDomain = SSEPackedInt

	// Move quadword from xmm1 register to r/m64
	//
	let ExeDomain = SSEPackedInt in {
	def VMOVPQIto64Zrr : I<0x7E, MRMDestReg, (outs GR64:$dst), (ins VR128X:$src),
	"vmovq\t{$src, $dst\|$dst, $src}",
	[(set GR64:$dst, (extractelt (v2i64 VR128X:$src),
	(iPTR 0)))],
	IIC_SSE_MOVD_ToGP>, PD, EVEX, VEX_W,
	Requires<[HasAVX512, In64BitMode]>;

	let isCodeGenOnly = 1, ForceDisassemble = 1, hasSideEffects = 0, mayStore = 1 in
	def VMOVPQIto64Zmr : I<0x7E, MRMDestMem, (outs), (ins i64mem:$dst, VR128X:$src),
	"vmovq\t{$src, $dst\|$dst, $src}",
	[], IIC_SSE_MOVD_ToGP>, PD, EVEX, VEX_W,
	Requires<[HasAVX512, In64BitMode]>;

	def VMOVPQI2QIZmr : I<0xD6, MRMDestMem, (outs),
	(ins i64mem:$dst, VR128X:$src),
	"vmovq\t{$src, $dst\|$dst, $src}",
	[(store (extractelt (v2i64 VR128X:$src), (iPTR 0)),
	addr:$dst)], IIC_SSE_MOVDQ>,
	EVEX, PD, VEX_W, EVEX_CD8<64, CD8VT1>,
	Sched<[WriteStore]>, Requires<[HasAVX512, In64BitMode]>;

	let hasSideEffects = 0 in
	def VMOVPQI2QIZrr : AVX512BI<0xD6, MRMDestReg, (outs VR128X:$dst),
	(ins VR128X:$src),
	"vmovq.s\t{$src, $dst\|$dst, $src}",[]>,
	EVEX, VEX_W;
	} // ExeDomain = SSEPackedInt

	// Move Scalar Single to Double Int
	//
	let ExeDomain = SSEPackedInt, isCodeGenOnly = 1 in {
	def VMOVSS2DIZrr : AVX512BI<0x7E, MRMDestReg, (outs GR32:$dst),
	(ins FR32X:$src),
	"vmovd\t{$src, $dst\|$dst, $src}",
	[(set GR32:$dst, (bitconvert FR32X:$src))],
	IIC_SSE_MOVD_ToGP>, EVEX;
	def VMOVSS2DIZmr : AVX512BI<0x7E, MRMDestMem, (outs),
	(ins i32mem:$dst, FR32X:$src),
	"vmovd\t{$src, $dst\|$dst, $src}",
	[(store (i32 (bitconvert FR32X:$src)), addr:$dst)],
	IIC_SSE_MOVDQ>, EVEX, EVEX_CD8<32, CD8VT1>;
	} // ExeDomain = SSEPackedInt, isCodeGenOnly = 1

	// Move Quadword Int to Packed Quadword Int
	//
	let ExeDomain = SSEPackedInt in {
	def VMOVQI2PQIZrm : AVX512XSI<0x7E, MRMSrcMem, (outs VR128X:$dst),
	(ins i64mem:$src),
	"vmovq\t{$src, $dst\|$dst, $src}",
	[(set VR128X:$dst,
	(v2i64 (scalar_to_vector (loadi64 addr:$src))))]>,
	EVEX, VEX_W, EVEX_CD8<8, CD8VT8>;
	} // ExeDomain = SSEPackedInt

	//===----------------------------------------------------------------------===//
	// AVX-512 MOVSS, MOVSD
	//===----------------------------------------------------------------------===//

	multiclass avx512_move_scalar<string asm, SDNode OpNode,
	X86VectorVTInfo _> {
	def rr : AVX512PI<0x10, MRMSrcReg, (outs _.RC:$dst),
	(ins _.RC:$src1, _.FRC:$src2),
	!strconcat(asm, "\t{$src2, $src1, $dst\|$dst, $src1, $src2}"),
	[(set _.RC:$dst, (_.VT (OpNode _.RC:$src1,
	(scalar_to_vector _.FRC:$src2))))],
	_.ExeDomain,IIC_SSE_MOV_S_RR>, EVEX_4V;
	def rrkz : AVX512PI<0x10, MRMSrcReg, (outs _.RC:$dst),
	(ins _.KRCWM:$mask, _.RC:$src1, _.FRC:$src2),
	!strconcat(asm, "\t{$src2, $src1, $dst {${mask}} {z}\|",
	"$dst {${mask}} {z}, $src1, $src2}"),
	[(set _.RC:$dst, (_.VT (X86selects _.KRCWM:$mask,
	(_.VT (OpNode _.RC:$src1,
	(scalar_to_vector _.FRC:$src2))),
	_.ImmAllZerosV)))],
	_.ExeDomain,IIC_SSE_MOV_S_RR>, EVEX_4V, EVEX_KZ;
	let Constraints = "$src0 = $dst" in
	def rrk : AVX512PI<0x10, MRMSrcReg, (outs _.RC:$dst),
	(ins _.RC:$src0, _.KRCWM:$mask, _.RC:$src1, _.FRC:$src2),
	!strconcat(asm, "\t{$src2, $src1, $dst {${mask}}\|",
	"$dst {${mask}}, $src1, $src2}"),
	[(set _.RC:$dst, (_.VT (X86selects _.KRCWM:$mask,
	(_.VT (OpNode _.RC:$src1,
	(scalar_to_vector _.FRC:$src2))),
	(_.VT _.RC:$src0))))],
	_.ExeDomain,IIC_SSE_MOV_S_RR>, EVEX_4V, EVEX_K;
	let canFoldAsLoad = 1, isReMaterializable = 1 in
	def rm : AVX512PI<0x10, MRMSrcMem, (outs _.FRC:$dst), (ins _.ScalarMemOp:$src),
	!strconcat(asm, "\t{$src, $dst\|$dst, $src}"),
	[(set _.FRC:$dst, (_.ScalarLdFrag addr:$src))],
	_.ExeDomain, IIC_SSE_MOV_S_RM>, EVEX;
	let mayLoad = 1, hasSideEffects = 0 in {
	let Constraints = "$src0 = $dst" in
	def rmk : AVX512PI<0x10, MRMSrcMem, (outs _.RC:$dst),
	(ins _.RC:$src0, _.KRCWM:$mask, _.ScalarMemOp:$src),
	!strconcat(asm, "\t{$src, $dst {${mask}}\|",
	"$dst {${mask}}, $src}"),
	[], _.ExeDomain, IIC_SSE_MOV_S_RM>, EVEX, EVEX_K;
	def rmkz : AVX512PI<0x10, MRMSrcMem, (outs _.RC:$dst),
	(ins _.KRCWM:$mask, _.ScalarMemOp:$src),
	!strconcat(asm, "\t{$src, $dst {${mask}} {z}\|",
	"$dst {${mask}} {z}, $src}"),
	[], _.ExeDomain, IIC_SSE_MOV_S_RM>, EVEX, EVEX_KZ;
	}
	def mr: AVX512PI<0x11, MRMDestMem, (outs), (ins _.ScalarMemOp:$dst, _.FRC:$src),
	!strconcat(asm, "\t{$src, $dst\|$dst, $src}"),
	[(store _.FRC:$src, addr:$dst)], _.ExeDomain, IIC_SSE_MOV_S_MR>,
	EVEX;
	let mayStore = 1, hasSideEffects = 0 in
	def mrk: AVX512PI<0x11, MRMDestMem, (outs),
	(ins _.ScalarMemOp:$dst, VK1WM:$mask, _.FRC:$src),
	!strconcat(asm, "\t{$src, $dst {${mask}}\|$dst {${mask}}, $src}"),
	[], _.ExeDomain, IIC_SSE_MOV_S_MR>, EVEX, EVEX_K;
	}

	defm VMOVSSZ : avx512_move_scalar<"vmovss", X86Movss, f32x_info>,
	VEX_LIG, XS, EVEX_CD8<32, CD8VT1>;

	defm VMOVSDZ : avx512_move_scalar<"vmovsd", X86Movsd, f64x_info>,
	VEX_LIG, XD, VEX_W, EVEX_CD8<64, CD8VT1>;


	multiclass avx512_move_scalar_lowering<string InstrStr, SDNode OpNode,
	PatLeaf ZeroFP, X86VectorVTInfo _> {

	def : Pat<(_.VT (OpNode _.RC:$src0,
	(_.VT (scalar_to_vector
	(_.EltVT (X86selects (scalar_to_vector (and (i8 (trunc GR32:$mask)), (i8 1))),
	(_.EltVT _.FRC:$src1),
	(_.EltVT _.FRC:$src2))))))),
	(COPY_TO_REGCLASS (!cast<Instruction>(InstrStr#rrk)
	(COPY_TO_REGCLASS _.FRC:$src2, _.RC),
	(COPY_TO_REGCLASS GR32:$mask, VK1WM),
	(_.VT _.RC:$src0), _.FRC:$src1),
	_.RC)>;

	def : Pat<(_.VT (OpNode _.RC:$src0,
	(_.VT (scalar_to_vector
	(_.EltVT (X86selects (scalar_to_vector (and (i8 (trunc GR32:$mask)), (i8 1))),
	(_.EltVT _.FRC:$src1),
	(_.EltVT ZeroFP))))))),
	(COPY_TO_REGCLASS (!cast<Instruction>(InstrStr#rrkz)
	(COPY_TO_REGCLASS GR32:$mask, VK1WM),
	(_.VT _.RC:$src0), _.FRC:$src1),
	_.RC)>;
	}

	multiclass avx512_store_scalar_lowering<string InstrStr, AVX512VLVectorVTInfo _,
	dag Mask, RegisterClass MaskRC> {

	def : Pat<(masked_store addr:$dst, Mask,
	(_.info512.VT (insert_subvector undef,
	(_.info256.VT (insert_subvector undef,
	(_.info128.VT _.info128.RC:$src),
	(iPTR 0))),
	(iPTR 0)))),
	(!cast<Instruction>(InstrStr#mrk) addr:$dst,
	(COPY_TO_REGCLASS MaskRC:$mask, VK1WM),
	(COPY_TO_REGCLASS _.info128.RC:$src, _.info128.FRC))>;

	}

	multiclass avx512_store_scalar_lowering_subreg<string InstrStr,
	AVX512VLVectorVTInfo _,
	dag Mask, RegisterClass MaskRC,
	SubRegIndex subreg> {

	def : Pat<(masked_store addr:$dst, Mask,
	(_.info512.VT (insert_subvector undef,
	(_.info256.VT (insert_subvector undef,
	(_.info128.VT _.info128.RC:$src),
	(iPTR 0))),
	(iPTR 0)))),
	(!cast<Instruction>(InstrStr#mrk) addr:$dst,
	(COPY_TO_REGCLASS (i32 (INSERT_SUBREG (IMPLICIT_DEF), MaskRC:$mask, subreg)), VK1WM),
	(COPY_TO_REGCLASS _.info128.RC:$src, _.info128.FRC))>;

	}

	multiclass avx512_load_scalar_lowering<string InstrStr, AVX512VLVectorVTInfo _,
	dag Mask, RegisterClass MaskRC> {

	def : Pat<(_.info128.VT (extract_subvector
	(_.info512.VT (masked_load addr:$srcAddr, Mask,
	(_.info512.VT (bitconvert
	(v16i32 immAllZerosV))))),
	(iPTR 0))),
	(!cast<Instruction>(InstrStr#rmkz)
	(COPY_TO_REGCLASS MaskRC:$mask, VK1WM),
	addr:$srcAddr)>;

	def : Pat<(_.info128.VT (extract_subvector
	(_.info512.VT (masked_load addr:$srcAddr, Mask,
	(_.info512.VT (insert_subvector undef,
	(_.info256.VT (insert_subvector undef,
	(_.info128.VT (X86vzmovl _.info128.RC:$src)),
	(iPTR 0))),
	(iPTR 0))))),
	(iPTR 0))),
	(!cast<Instruction>(InstrStr#rmk) _.info128.RC:$src,
	(COPY_TO_REGCLASS MaskRC:$mask, VK1WM),
	addr:$srcAddr)>;

	}

	multiclass avx512_load_scalar_lowering_subreg<string InstrStr,
	AVX512VLVectorVTInfo _,
	dag Mask, RegisterClass MaskRC,
	SubRegIndex subreg> {

	def : Pat<(_.info128.VT (extract_subvector
	(_.info512.VT (masked_load addr:$srcAddr, Mask,
	(_.info512.VT (bitconvert
	(v16i32 immAllZerosV))))),
	(iPTR 0))),
	(!cast<Instruction>(InstrStr#rmkz)
	(COPY_TO_REGCLASS (i32 (INSERT_SUBREG (IMPLICIT_DEF), MaskRC:$mask, subreg)), VK1WM),
	addr:$srcAddr)>;

	def : Pat<(_.info128.VT (extract_subvector
	(_.info512.VT (masked_load addr:$srcAddr, Mask,
	(_.info512.VT (insert_subvector undef,
	(_.info256.VT (insert_subvector undef,
	(_.info128.VT (X86vzmovl _.info128.RC:$src)),
	(iPTR 0))),
	(iPTR 0))))),
	(iPTR 0))),
	(!cast<Instruction>(InstrStr#rmk) _.info128.RC:$src,
	(COPY_TO_REGCLASS (i32 (INSERT_SUBREG (IMPLICIT_DEF), MaskRC:$mask, subreg)), VK1WM),
	addr:$srcAddr)>;

	}

	defm : avx512_move_scalar_lowering<"VMOVSSZ", X86Movss, fp32imm0, v4f32x_info>;
	defm : avx512_move_scalar_lowering<"VMOVSDZ", X86Movsd, fp64imm0, v2f64x_info>;

	defm : avx512_store_scalar_lowering<"VMOVSSZ", avx512vl_f32_info,
	(v16i1 (bitconvert (i16 (trunc (and GR32:$mask, (i32 1)))))), GR32>;
	defm : avx512_store_scalar_lowering_subreg<"VMOVSSZ", avx512vl_f32_info,
	(v16i1 (bitconvert (i16 (and GR16:$mask, (i16 1))))), GR16, sub_16bit>;
	defm : avx512_store_scalar_lowering_subreg<"VMOVSDZ", avx512vl_f64_info,
	(v8i1 (bitconvert (i8 (and GR8:$mask, (i8 1))))), GR8, sub_8bit>;

	defm : avx512_load_scalar_lowering<"VMOVSSZ", avx512vl_f32_info,
	(v16i1 (bitconvert (i16 (trunc (and GR32:$mask, (i32 1)))))), GR32>;
	defm : avx512_load_scalar_lowering_subreg<"VMOVSSZ", avx512vl_f32_info,
	(v16i1 (bitconvert (i16 (and GR16:$mask, (i16 1))))), GR16, sub_16bit>;
	defm : avx512_load_scalar_lowering_subreg<"VMOVSDZ", avx512vl_f64_info,
	(v8i1 (bitconvert (i8 (and GR8:$mask, (i8 1))))), GR8, sub_8bit>;

	def : Pat<(f32 (X86selects VK1WM:$mask, (f32 FR32X:$src1), (f32 FR32X:$src2))),
	(COPY_TO_REGCLASS (VMOVSSZrrk (COPY_TO_REGCLASS FR32X:$src2, VR128X),
	VK1WM:$mask, (v4f32 (IMPLICIT_DEF)), FR32X:$src1), FR32X)>;

	def : Pat<(f64 (X86selects VK1WM:$mask, (f64 FR64X:$src1), (f64 FR64X:$src2))),
	(COPY_TO_REGCLASS (VMOVSDZrrk (COPY_TO_REGCLASS FR64X:$src2, VR128X),
	VK1WM:$mask, (v2f64 (IMPLICIT_DEF)), FR64X:$src1), FR64X)>;

	def : Pat<(int_x86_avx512_mask_store_ss addr:$dst, VR128X:$src, GR8:$mask),
	(VMOVSSZmrk addr:$dst, (COPY_TO_REGCLASS (i32 (INSERT_SUBREG (IMPLICIT_DEF), GR8:$mask, sub_8bit)), VK1WM),
	(COPY_TO_REGCLASS VR128X:$src, FR32X))>;

	let hasSideEffects = 0 in {
	def VMOVSSZrr_REV: AVX512<0x11, MRMDestReg, (outs VR128X:$dst),
	(ins VR128X:$src1, FR32X:$src2),
	"vmovss.s\t{$src2, $src1, $dst\|$dst, $src1, $src2}",
	[], NoItinerary>, XS, EVEX_4V, VEX_LIG,
	FoldGenData<"VMOVSSZrr">;

	let Constraints = "$src0 = $dst" in
	def VMOVSSZrrk_REV: AVX512<0x11, MRMDestReg, (outs VR128X:$dst),
	(ins f32x_info.RC:$src0, f32x_info.KRCWM:$mask,
	VR128X:$src1, FR32X:$src2),
	"vmovss.s\t{$src2, $src1, $dst {${mask}}\|"#
	"$dst {${mask}}, $src1, $src2}",
	[], NoItinerary>, EVEX_K, XS, EVEX_4V, VEX_LIG,
	FoldGenData<"VMOVSSZrrk">;

	def VMOVSSZrrkz_REV: AVX512<0x11, MRMDestReg, (outs VR128X:$dst),
	(ins f32x_info.KRCWM:$mask, VR128X:$src1, FR32X:$src2),
	"vmovss.s\t{$src2, $src1, $dst {${mask}} {z}\|"#
	"$dst {${mask}} {z}, $src1, $src2}",
	[], NoItinerary>, EVEX_KZ, XS, EVEX_4V, VEX_LIG,
	FoldGenData<"VMOVSSZrrkz">;

	def VMOVSDZrr_REV: AVX512<0x11, MRMDestReg, (outs VR128X:$dst),
	(ins VR128X:$src1, FR64X:$src2),
	"vmovsd.s\t{$src2, $src1, $dst\|$dst, $src1, $src2}",
	[], NoItinerary>, XD, EVEX_4V, VEX_LIG, VEX_W,
	FoldGenData<"VMOVSDZrr">;

	let Constraints = "$src0 = $dst" in
	def VMOVSDZrrk_REV: AVX512<0x11, MRMDestReg, (outs VR128X:$dst),
	(ins f64x_info.RC:$src0, f64x_info.KRCWM:$mask,
	VR128X:$src1, FR64X:$src2),
	"vmovsd.s\t{$src2, $src1, $dst {${mask}}\|"#
	"$dst {${mask}}, $src1, $src2}",
	[], NoItinerary>, EVEX_K, XD, EVEX_4V, VEX_LIG,
	VEX_W, FoldGenData<"VMOVSDZrrk">;

	def VMOVSDZrrkz_REV: AVX512<0x11, MRMDestReg, (outs VR128X:$dst),
	(ins f64x_info.KRCWM:$mask, VR128X:$src1,
	FR64X:$src2),
	"vmovsd.s\t{$src2, $src1, $dst {${mask}} {z}\|"#
	"$dst {${mask}} {z}, $src1, $src2}",
	[], NoItinerary>, EVEX_KZ, XD, EVEX_4V, VEX_LIG,
	VEX_W, FoldGenData<"VMOVSDZrrkz">;
	}

	let Predicates = [HasAVX512] in {
	let AddedComplexity = 15 in {
	// Move scalar to XMM zero-extended, zeroing a VR128X then do a
	// MOVS{S,D} to the lower bits.
	def : Pat<(v4f32 (X86vzmovl (v4f32 (scalar_to_vector FR32X:$src)))),
	(VMOVSSZrr (v4f32 (AVX512_128_SET0)), FR32X:$src)>;
	def : Pat<(v4f32 (X86vzmovl (v4f32 VR128X:$src))),
	(VMOVSSZrr (v4f32 (AVX512_128_SET0)), (COPY_TO_REGCLASS VR128X:$src, FR32X))>;
	def : Pat<(v4i32 (X86vzmovl (v4i32 VR128X:$src))),
	(VMOVSSZrr (v4i32 (AVX512_128_SET0)), (COPY_TO_REGCLASS VR128X:$src, FR32X))>;
	def : Pat<(v2f64 (X86vzmovl (v2f64 (scalar_to_vector FR64X:$src)))),
	(VMOVSDZrr (v2f64 (AVX512_128_SET0)), FR64X:$src)>;
	}

	// Move low f32 and clear high bits.
	def : Pat<(v8f32 (X86vzmovl (v8f32 VR256X:$src))),
	(SUBREG_TO_REG (i32 0),
	(VMOVSSZrr (v4f32 (AVX512_128_SET0)),
	(EXTRACT_SUBREG (v8f32 VR256X:$src), sub_xmm)), sub_xmm)>;
	def : Pat<(v8i32 (X86vzmovl (v8i32 VR256X:$src))),
	(SUBREG_TO_REG (i32 0),
	(VMOVSSZrr (v4i32 (AVX512_128_SET0)),
	(EXTRACT_SUBREG (v8i32 VR256X:$src), sub_xmm)), sub_xmm)>;
	def : Pat<(v16f32 (X86vzmovl (v16f32 VR512:$src))),
	(SUBREG_TO_REG (i32 0),
	(VMOVSSZrr (v4f32 (AVX512_128_SET0)),
	(EXTRACT_SUBREG (v16f32 VR512:$src), sub_xmm)), sub_xmm)>;
	def : Pat<(v16i32 (X86vzmovl (v16i32 VR512:$src))),
	(SUBREG_TO_REG (i32 0),
	(VMOVSSZrr (v4i32 (AVX512_128_SET0)),
	(EXTRACT_SUBREG (v16i32 VR512:$src), sub_xmm)), sub_xmm)>;

	let AddedComplexity = 20 in {
	// MOVSSrm zeros the high parts of the register; represent this
	// with SUBREG_TO_REG. The AVX versions also write: DST[255:128] <- 0
	def : Pat<(v4f32 (X86vzmovl (v4f32 (scalar_to_vector (loadf32 addr:$src))))),
	(COPY_TO_REGCLASS (VMOVSSZrm addr:$src), VR128X)>;
	def : Pat<(v4f32 (scalar_to_vector (loadf32 addr:$src))),
	(COPY_TO_REGCLASS (VMOVSSZrm addr:$src), VR128X)>;
	def : Pat<(v4f32 (X86vzmovl (loadv4f32 addr:$src))),
	(COPY_TO_REGCLASS (VMOVSSZrm addr:$src), VR128X)>;
	def : Pat<(v4f32 (X86vzload addr:$src)),
	(COPY_TO_REGCLASS (VMOVSSZrm addr:$src), VR128X)>;

	// MOVSDrm zeros the high parts of the register; represent this
	// with SUBREG_TO_REG. The AVX versions also write: DST[255:128] <- 0
	def : Pat<(v2f64 (X86vzmovl (v2f64 (scalar_to_vector (loadf64 addr:$src))))),
	(COPY_TO_REGCLASS (VMOVSDZrm addr:$src), VR128X)>;
	def : Pat<(v2f64 (scalar_to_vector (loadf64 addr:$src))),
	(COPY_TO_REGCLASS (VMOVSDZrm addr:$src), VR128X)>;
	def : Pat<(v2f64 (X86vzmovl (loadv2f64 addr:$src))),
	(COPY_TO_REGCLASS (VMOVSDZrm addr:$src), VR128X)>;
	def : Pat<(v2f64 (X86vzmovl (bc_v2f64 (loadv4f32 addr:$src)))),
	(COPY_TO_REGCLASS (VMOVSDZrm addr:$src), VR128X)>;
	def : Pat<(v2f64 (X86vzload addr:$src)),
	(COPY_TO_REGCLASS (VMOVSDZrm addr:$src), VR128X)>;

	// Represent the same patterns above but in the form they appear for
	// 256-bit types
	def : Pat<(v8i32 (X86vzmovl (insert_subvector undef,
	(v4i32 (scalar_to_vector (loadi32 addr:$src))), (iPTR 0)))),
	(SUBREG_TO_REG (i32 0), (VMOVDI2PDIZrm addr:$src), sub_xmm)>;
	def : Pat<(v8f32 (X86vzmovl (insert_subvector undef,
	(v4f32 (scalar_to_vector (loadf32 addr:$src))), (iPTR 0)))),
	(SUBREG_TO_REG (i32 0), (VMOVSSZrm addr:$src), sub_xmm)>;
	def : Pat<(v8f32 (X86vzload addr:$src)),
	(SUBREG_TO_REG (i32 0), (VMOVSSZrm addr:$src), sub_xmm)>;
	def : Pat<(v4f64 (X86vzmovl (insert_subvector undef,
	(v2f64 (scalar_to_vector (loadf64 addr:$src))), (iPTR 0)))),
	(SUBREG_TO_REG (i32 0), (VMOVSDZrm addr:$src), sub_xmm)>;
	def : Pat<(v4f64 (X86vzload addr:$src)),
	(SUBREG_TO_REG (i32 0), (VMOVSDZrm addr:$src), sub_xmm)>;

	// Represent the same patterns above but in the form they appear for
	// 512-bit types
	def : Pat<(v16i32 (X86vzmovl (insert_subvector undef,
	(v4i32 (scalar_to_vector (loadi32 addr:$src))), (iPTR 0)))),
	(SUBREG_TO_REG (i32 0), (VMOVDI2PDIZrm addr:$src), sub_xmm)>;
	def : Pat<(v16f32 (X86vzmovl (insert_subvector undef,
	(v4f32 (scalar_to_vector (loadf32 addr:$src))), (iPTR 0)))),
	(SUBREG_TO_REG (i32 0), (VMOVSSZrm addr:$src), sub_xmm)>;
	def : Pat<(v16f32 (X86vzload addr:$src)),
	(SUBREG_TO_REG (i32 0), (VMOVSSZrm addr:$src), sub_xmm)>;
	def : Pat<(v8f64 (X86vzmovl (insert_subvector undef,
	(v2f64 (scalar_to_vector (loadf64 addr:$src))), (iPTR 0)))),
	(SUBREG_TO_REG (i32 0), (VMOVSDZrm addr:$src), sub_xmm)>;
	def : Pat<(v8f64 (X86vzload addr:$src)),
	(SUBREG_TO_REG (i32 0), (VMOVSDZrm addr:$src), sub_xmm)>;
	}
	def : Pat<(v8f32 (X86vzmovl (insert_subvector undef,
	(v4f32 (scalar_to_vector FR32X:$src)), (iPTR 0)))),
	(SUBREG_TO_REG (i32 0), (v4f32 (VMOVSSZrr (v4f32 (AVX512_128_SET0)),
	FR32X:$src)), sub_xmm)>;
	def : Pat<(v4f64 (X86vzmovl (insert_subvector undef,
	(v2f64 (scalar_to_vector FR64X:$src)), (iPTR 0)))),
	(SUBREG_TO_REG (i64 0), (v2f64 (VMOVSDZrr (v2f64 (AVX512_128_SET0)),
	FR64X:$src)), sub_xmm)>;
	def : Pat<(v4i64 (X86vzmovl (insert_subvector undef,
	(v2i64 (scalar_to_vector (loadi64 addr:$src))), (iPTR 0)))),
	(SUBREG_TO_REG (i64 0), (VMOVQI2PQIZrm addr:$src), sub_xmm)>;

	// Move low f64 and clear high bits.
	def : Pat<(v4f64 (X86vzmovl (v4f64 VR256X:$src))),
	(SUBREG_TO_REG (i32 0),
	(VMOVSDZrr (v2f64 (AVX512_128_SET0)),
	(EXTRACT_SUBREG (v4f64 VR256X:$src), sub_xmm)), sub_xmm)>;
	def : Pat<(v8f64 (X86vzmovl (v8f64 VR512:$src))),
	(SUBREG_TO_REG (i32 0),
	(VMOVSDZrr (v2f64 (AVX512_128_SET0)),
	(EXTRACT_SUBREG (v8f64 VR512:$src), sub_xmm)), sub_xmm)>;

	def : Pat<(v4i64 (X86vzmovl (v4i64 VR256X:$src))),
	(SUBREG_TO_REG (i32 0), (VMOVSDZrr (v2i64 (AVX512_128_SET0)),
	(EXTRACT_SUBREG (v4i64 VR256X:$src), sub_xmm)), sub_xmm)>;
	def : Pat<(v8i64 (X86vzmovl (v8i64 VR512:$src))),
	(SUBREG_TO_REG (i32 0), (VMOVSDZrr (v2i64 (AVX512_128_SET0)),
	(EXTRACT_SUBREG (v8i64 VR512:$src), sub_xmm)), sub_xmm)>;

	// Extract and store.
	def : Pat<(store (f32 (extractelt (v4f32 VR128X:$src), (iPTR 0))),
	addr:$dst),
	(VMOVSSZmr addr:$dst, (COPY_TO_REGCLASS (v4f32 VR128X:$src), FR32X))>;

	// Shuffle with VMOVSS
	def : Pat<(v4i32 (X86Movss VR128X:$src1, VR128X:$src2)),
	(VMOVSSZrr (v4i32 VR128X:$src1),
	(COPY_TO_REGCLASS (v4i32 VR128X:$src2), FR32X))>;
	def : Pat<(v4f32 (X86Movss VR128X:$src1, VR128X:$src2)),
	(VMOVSSZrr (v4f32 VR128X:$src1),
	(COPY_TO_REGCLASS (v4f32 VR128X:$src2), FR32X))>;

	// 256-bit variants
	def : Pat<(v8i32 (X86Movss VR256X:$src1, VR256X:$src2)),
	(SUBREG_TO_REG (i32 0),
	(VMOVSSZrr (EXTRACT_SUBREG (v8i32 VR256X:$src1), sub_xmm),
	(EXTRACT_SUBREG (v8i32 VR256X:$src2), sub_xmm)),
	sub_xmm)>;
	def : Pat<(v8f32 (X86Movss VR256X:$src1, VR256X:$src2)),
	(SUBREG_TO_REG (i32 0),
	(VMOVSSZrr (EXTRACT_SUBREG (v8f32 VR256X:$src1), sub_xmm),
	(EXTRACT_SUBREG (v8f32 VR256X:$src2), sub_xmm)),
	sub_xmm)>;

	// Shuffle with VMOVSD
	def : Pat<(v2i64 (X86Movsd VR128X:$src1, VR128X:$src2)),
	(VMOVSDZrr VR128X:$src1, (COPY_TO_REGCLASS VR128X:$src2, FR64X))>;
	def : Pat<(v2f64 (X86Movsd VR128X:$src1, VR128X:$src2)),
	(VMOVSDZrr VR128X:$src1, (COPY_TO_REGCLASS VR128X:$src2, FR64X))>;

	// 256-bit variants
	def : Pat<(v4i64 (X86Movsd VR256X:$src1, VR256X:$src2)),
	(SUBREG_TO_REG (i32 0),
	(VMOVSDZrr (EXTRACT_SUBREG (v4i64 VR256X:$src1), sub_xmm),
	(EXTRACT_SUBREG (v4i64 VR256X:$src2), sub_xmm)),
	sub_xmm)>;
	def : Pat<(v4f64 (X86Movsd VR256X:$src1, VR256X:$src2)),
	(SUBREG_TO_REG (i32 0),
	(VMOVSDZrr (EXTRACT_SUBREG (v4f64 VR256X:$src1), sub_xmm),
	(EXTRACT_SUBREG (v4f64 VR256X:$src2), sub_xmm)),
	sub_xmm)>;

	def : Pat<(v2f64 (X86Movlpd VR128X:$src1, VR128X:$src2)),
	(VMOVSDZrr VR128X:$src1, (COPY_TO_REGCLASS VR128X:$src2, FR64X))>;
	def : Pat<(v2i64 (X86Movlpd VR128X:$src1, VR128X:$src2)),
	(VMOVSDZrr VR128X:$src1, (COPY_TO_REGCLASS VR128X:$src2, FR64X))>;
	def : Pat<(v4f32 (X86Movlps VR128X:$src1, VR128X:$src2)),
	(VMOVSDZrr VR128X:$src1, (COPY_TO_REGCLASS VR128X:$src2, FR64X))>;
	def : Pat<(v4i32 (X86Movlps VR128X:$src1, VR128X:$src2)),
	(VMOVSDZrr VR128X:$src1, (COPY_TO_REGCLASS VR128X:$src2, FR64X))>;
	}

	let AddedComplexity = 15 in
	def VMOVZPQILo2PQIZrr : AVX512XSI<0x7E, MRMSrcReg, (outs VR128X:$dst),
	(ins VR128X:$src),
	"vmovq\t{$src, $dst\|$dst, $src}",
	[(set VR128X:$dst, (v2i64 (X86vzmovl
	(v2i64 VR128X:$src))))],
	IIC_SSE_MOVQ_RR>, EVEX, VEX_W;

	let Predicates = [HasAVX512] in {
	let AddedComplexity = 15 in {
	def : Pat<(v4i32 (X86vzmovl (v4i32 (scalar_to_vector GR32:$src)))),
	(VMOVDI2PDIZrr GR32:$src)>;

	def : Pat<(v2i64 (X86vzmovl (v2i64 (scalar_to_vector GR64:$src)))),
	(VMOV64toPQIZrr GR64:$src)>;

	def : Pat<(v4i64 (X86vzmovl (insert_subvector undef,
	(v2i64 (scalar_to_vector GR64:$src)),(iPTR 0)))),
	(SUBREG_TO_REG (i64 0), (VMOV64toPQIZrr GR64:$src), sub_xmm)>;

	def : Pat<(v8i64 (X86vzmovl (insert_subvector undef,
	(v2i64 (scalar_to_vector GR64:$src)),(iPTR 0)))),
	(SUBREG_TO_REG (i64 0), (VMOV64toPQIZrr GR64:$src), sub_xmm)>;
	}
	// AVX 128-bit movd/movq instruction write zeros in the high 128-bit part.
	let AddedComplexity = 20 in {
	def : Pat<(v2i64 (X86vzmovl (v2i64 (scalar_to_vector (zextloadi64i32 addr:$src))))),
	(VMOVDI2PDIZrm addr:$src)>;
	def : Pat<(v4i32 (X86vzmovl (v4i32 (scalar_to_vector (loadi32 addr:$src))))),
	(VMOVDI2PDIZrm addr:$src)>;
	def : Pat<(v4i32 (X86vzmovl (bc_v4i32 (loadv4f32 addr:$src)))),
	(VMOVDI2PDIZrm addr:$src)>;
	def : Pat<(v4i32 (X86vzmovl (bc_v4i32 (loadv2i64 addr:$src)))),
	(VMOVDI2PDIZrm addr:$src)>;
	def : Pat<(v4i32 (X86vzload addr:$src)),
	(VMOVDI2PDIZrm addr:$src)>;
	def : Pat<(v8i32 (X86vzload addr:$src)),
	(SUBREG_TO_REG (i32 0), (VMOVDI2PDIZrm addr:$src), sub_xmm)>;
	def : Pat<(v2i64 (X86vzmovl (loadv2i64 addr:$src))),
	(VMOVQI2PQIZrm addr:$src)>;
	def : Pat<(v2f64 (X86vzmovl (v2f64 VR128X:$src))),
	(VMOVZPQILo2PQIZrr VR128X:$src)>;
	def : Pat<(v2i64 (X86vzload addr:$src)),
	(VMOVQI2PQIZrm addr:$src)>;
	def : Pat<(v4i64 (X86vzload addr:$src)),
	(SUBREG_TO_REG (i64 0), (VMOVQI2PQIZrm addr:$src), sub_xmm)>;
	}

	// Use regular 128-bit instructions to match 256-bit scalar_to_vec+zext.
	def : Pat<(v8i32 (X86vzmovl (insert_subvector undef,
	(v4i32 (scalar_to_vector GR32:$src)),(iPTR 0)))),
	(SUBREG_TO_REG (i32 0), (VMOVDI2PDIZrr GR32:$src), sub_xmm)>;
	def : Pat<(v16i32 (X86vzmovl (insert_subvector undef,
	(v4i32 (scalar_to_vector GR32:$src)),(iPTR 0)))),
	(SUBREG_TO_REG (i32 0), (VMOVDI2PDIZrr GR32:$src), sub_xmm)>;

	// Use regular 128-bit instructions to match 512-bit scalar_to_vec+zext.
	def : Pat<(v16i32 (X86vzload addr:$src)),
	(SUBREG_TO_REG (i32 0), (VMOVDI2PDIZrm addr:$src), sub_xmm)>;
	def : Pat<(v8i64 (X86vzload addr:$src)),
	(SUBREG_TO_REG (i64 0), (VMOVQI2PQIZrm addr:$src), sub_xmm)>;
	}
	//===----------------------------------------------------------------------===//
	// AVX-512 - Non-temporals
	//===----------------------------------------------------------------------===//
	let SchedRW = [WriteLoad] in {
	def VMOVNTDQAZrm : AVX512PI<0x2A, MRMSrcMem, (outs VR512:$dst),
	(ins i512mem:$src), "vmovntdqa\t{$src, $dst\|$dst, $src}",
	[], SSEPackedInt>, EVEX, T8PD, EVEX_V512,
	EVEX_CD8<64, CD8VF>;

	let Predicates = [HasVLX] in {
	def VMOVNTDQAZ256rm : AVX512PI<0x2A, MRMSrcMem, (outs VR256X:$dst),
	(ins i256mem:$src),
	"vmovntdqa\t{$src, $dst\|$dst, $src}",
	[], SSEPackedInt>, EVEX, T8PD, EVEX_V256,
	EVEX_CD8<64, CD8VF>;

	def VMOVNTDQAZ128rm : AVX512PI<0x2A, MRMSrcMem, (outs VR128X:$dst),
	(ins i128mem:$src),
	"vmovntdqa\t{$src, $dst\|$dst, $src}",
	[], SSEPackedInt>, EVEX, T8PD, EVEX_V128,
	EVEX_CD8<64, CD8VF>;
	}
	}

	multiclass avx512_movnt<bits<8> opc, string OpcodeStr, X86VectorVTInfo _,
	PatFrag st_frag = alignednontemporalstore,
	InstrItinClass itin = IIC_SSE_MOVNT> {
	let SchedRW = [WriteStore], AddedComplexity = 400 in
	def mr : AVX512PI<opc, MRMDestMem, (outs), (ins _.MemOp:$dst, _.RC:$src),
	!strconcat(OpcodeStr, "\t{$src, $dst\|$dst, $src}"),
	[(st_frag (_.VT _.RC:$src), addr:$dst)],
	_.ExeDomain, itin>, EVEX, EVEX_CD8<_.EltSize, CD8VF>;
	}

	multiclass avx512_movnt_vl<bits<8> opc, string OpcodeStr,
	AVX512VLVectorVTInfo VTInfo> {
	let Predicates = [HasAVX512] in
	defm Z : avx512_movnt<opc, OpcodeStr, VTInfo.info512>, EVEX_V512;

	let Predicates = [HasAVX512, HasVLX] in {
	defm Z256 : avx512_movnt<opc, OpcodeStr, VTInfo.info256>, EVEX_V256;
	defm Z128 : avx512_movnt<opc, OpcodeStr, VTInfo.info128>, EVEX_V128;
	}
	}

	defm VMOVNTDQ : avx512_movnt_vl<0xE7, "vmovntdq", avx512vl_i64_info>, PD;
	defm VMOVNTPD : avx512_movnt_vl<0x2B, "vmovntpd", avx512vl_f64_info>, PD, VEX_W;
	defm VMOVNTPS : avx512_movnt_vl<0x2B, "vmovntps", avx512vl_f32_info>, PS;

	let Predicates = [HasAVX512], AddedComplexity = 400 in {
	def : Pat<(alignednontemporalstore (v16i32 VR512:$src), addr:$dst),
	(VMOVNTDQZmr addr:$dst, VR512:$src)>;
	def : Pat<(alignednontemporalstore (v32i16 VR512:$src), addr:$dst),
	(VMOVNTDQZmr addr:$dst, VR512:$src)>;
	def : Pat<(alignednontemporalstore (v64i8 VR512:$src), addr:$dst),
	(VMOVNTDQZmr addr:$dst, VR512:$src)>;

	def : Pat<(v8f64 (alignednontemporalload addr:$src)),
	(VMOVNTDQAZrm addr:$src)>;
	def : Pat<(v16f32 (alignednontemporalload addr:$src)),
	(VMOVNTDQAZrm addr:$src)>;
	def : Pat<(v8i64 (alignednontemporalload addr:$src)),
	(VMOVNTDQAZrm addr:$src)>;
	def : Pat<(v16i32 (bitconvert (v8i64 (alignednontemporalload addr:$src)))),
	(VMOVNTDQAZrm addr:$src)>;
	def : Pat<(v32i16 (bitconvert (v8i64 (alignednontemporalload addr:$src)))),
	(VMOVNTDQAZrm addr:$src)>;
	def : Pat<(v64i8 (bitconvert (v8i64 (alignednontemporalload addr:$src)))),
	(VMOVNTDQAZrm addr:$src)>;
	}

	let Predicates = [HasVLX], AddedComplexity = 400 in {
	def : Pat<(alignednontemporalstore (v8i32 VR256X:$src), addr:$dst),
	(VMOVNTDQZ256mr addr:$dst, VR256X:$src)>;
	def : Pat<(alignednontemporalstore (v16i16 VR256X:$src), addr:$dst),
	(VMOVNTDQZ256mr addr:$dst, VR256X:$src)>;
	def : Pat<(alignednontemporalstore (v32i8 VR256X:$src), addr:$dst),
	(VMOVNTDQZ256mr addr:$dst, VR256X:$src)>;

	def : Pat<(v4f64 (alignednontemporalload addr:$src)),
	(VMOVNTDQAZ256rm addr:$src)>;
	def : Pat<(v8f32 (alignednontemporalload addr:$src)),
	(VMOVNTDQAZ256rm addr:$src)>;
	def : Pat<(v4i64 (alignednontemporalload addr:$src)),
	(VMOVNTDQAZ256rm addr:$src)>;
	def : Pat<(v8i32 (bitconvert (v2i64 (alignednontemporalload addr:$src)))),
	(VMOVNTDQAZ256rm addr:$src)>;
	def : Pat<(v16i16 (bitconvert (v2i64 (alignednontemporalload addr:$src)))),
	(VMOVNTDQAZ256rm addr:$src)>;
	def : Pat<(v32i8 (bitconvert (v2i64 (alignednontemporalload addr:$src)))),
	(VMOVNTDQAZ256rm addr:$src)>;

	def : Pat<(alignednontemporalstore (v4i32 VR128X:$src), addr:$dst),
	(VMOVNTDQZ128mr addr:$dst, VR128X:$src)>;
	def : Pat<(alignednontemporalstore (v8i16 VR128X:$src), addr:$dst),
	(VMOVNTDQZ128mr addr:$dst, VR128X:$src)>;
	def : Pat<(alignednontemporalstore (v16i8 VR128X:$src), addr:$dst),
	(VMOVNTDQZ128mr addr:$dst, VR128X:$src)>;

	def : Pat<(v2f64 (alignednontemporalload addr:$src)),
	(VMOVNTDQAZ128rm addr:$src)>;
	def : Pat<(v4f32 (alignednontemporalload addr:$src)),
	(VMOVNTDQAZ128rm addr:$src)>;
	def : Pat<(v2i64 (alignednontemporalload addr:$src)),
	(VMOVNTDQAZ128rm addr:$src)>;
	def : Pat<(v4i32 (bitconvert (v2i64 (alignednontemporalload addr:$src)))),
	(VMOVNTDQAZ128rm addr:$src)>;
	def : Pat<(v8i16 (bitconvert (v2i64 (alignednontemporalload addr:$src)))),
	(VMOVNTDQAZ128rm addr:$src)>;
	def : Pat<(v16i8 (bitconvert (v2i64 (alignednontemporalload addr:$src)))),
	(VMOVNTDQAZ128rm addr:$src)>;
	}

	//===----------------------------------------------------------------------===//
	// AVX-512 - Integer arithmetic
	//
	multiclass avx512_binop_rm<bits<8> opc, string OpcodeStr, SDNode OpNode,
	X86VectorVTInfo _, OpndItins itins,
	bit IsCommutable = 0> {
	defm rr : AVX512_maskable<opc, MRMSrcReg, _, (outs _.RC:$dst),
	(ins _.RC:$src1, _.RC:$src2), OpcodeStr,
	"$src2, $src1", "$src1, $src2",
	(_.VT (OpNode _.RC:$src1, _.RC:$src2)),
	itins.rr, IsCommutable>,
	AVX512BIBase, EVEX_4V;

	defm rm : AVX512_maskable<opc, MRMSrcMem, _, (outs _.RC:$dst),
	(ins _.RC:$src1, _.MemOp:$src2), OpcodeStr,
	"$src2, $src1", "$src1, $src2",
	(_.VT (OpNode _.RC:$src1,
	(bitconvert (_.LdFrag addr:$src2)))),
	itins.rm>,
	AVX512BIBase, EVEX_4V;
	}

	multiclass avx512_binop_rmb<bits<8> opc, string OpcodeStr, SDNode OpNode,
	X86VectorVTInfo _, OpndItins itins,
	bit IsCommutable = 0> :
	avx512_binop_rm<opc, OpcodeStr, OpNode, _, itins, IsCommutable> {
	defm rmb : AVX512_maskable<opc, MRMSrcMem, _, (outs _.RC:$dst),
	(ins _.RC:$src1, _.ScalarMemOp:$src2), OpcodeStr,
	"${src2}"##_.BroadcastStr##", $src1",
	"$src1, ${src2}"##_.BroadcastStr,
	(_.VT (OpNode _.RC:$src1,
	(X86VBroadcast
	(_.ScalarLdFrag addr:$src2)))),
	itins.rm>,
	AVX512BIBase, EVEX_4V, EVEX_B;
	}

	multiclass avx512_binop_rm_vl<bits<8> opc, string OpcodeStr, SDNode OpNode,
	AVX512VLVectorVTInfo VTInfo, OpndItins itins,
	Predicate prd, bit IsCommutable = 0> {
	let Predicates = [prd] in
	defm Z : avx512_binop_rm<opc, OpcodeStr, OpNode, VTInfo.info512, itins,
	IsCommutable>, EVEX_V512;

	let Predicates = [prd, HasVLX] in {
	defm Z256 : avx512_binop_rm<opc, OpcodeStr, OpNode, VTInfo.info256, itins,
	IsCommutable>, EVEX_V256;
	defm Z128 : avx512_binop_rm<opc, OpcodeStr, OpNode, VTInfo.info128, itins,
	IsCommutable>, EVEX_V128;
	}
	}

	multiclass avx512_binop_rmb_vl<bits<8> opc, string OpcodeStr, SDNode OpNode,
	AVX512VLVectorVTInfo VTInfo, OpndItins itins,
	Predicate prd, bit IsCommutable = 0> {
	let Predicates = [prd] in
	defm Z : avx512_binop_rmb<opc, OpcodeStr, OpNode, VTInfo.info512, itins,
	IsCommutable>, EVEX_V512;

	let Predicates = [prd, HasVLX] in {
	defm Z256 : avx512_binop_rmb<opc, OpcodeStr, OpNode, VTInfo.info256, itins,
	IsCommutable>, EVEX_V256;
	defm Z128 : avx512_binop_rmb<opc, OpcodeStr, OpNode, VTInfo.info128, itins,
	IsCommutable>, EVEX_V128;
	}
	}

	multiclass avx512_binop_rm_vl_q<bits<8> opc, string OpcodeStr, SDNode OpNode,
	OpndItins itins, Predicate prd,
	bit IsCommutable = 0> {
	defm NAME : avx512_binop_rmb_vl<opc, OpcodeStr, OpNode, avx512vl_i64_info,
	itins, prd, IsCommutable>,
	VEX_W, EVEX_CD8<64, CD8VF>;
	}

	multiclass avx512_binop_rm_vl_d<bits<8> opc, string OpcodeStr, SDNode OpNode,
	OpndItins itins, Predicate prd,
	bit IsCommutable = 0> {
	defm NAME : avx512_binop_rmb_vl<opc, OpcodeStr, OpNode, avx512vl_i32_info,
	itins, prd, IsCommutable>, EVEX_CD8<32, CD8VF>;
	}

	multiclass avx512_binop_rm_vl_w<bits<8> opc, string OpcodeStr, SDNode OpNode,
	OpndItins itins, Predicate prd,
	bit IsCommutable = 0> {
	defm NAME : avx512_binop_rm_vl<opc, OpcodeStr, OpNode, avx512vl_i16_info,
	itins, prd, IsCommutable>, EVEX_CD8<16, CD8VF>;
	}

	multiclass avx512_binop_rm_vl_b<bits<8> opc, string OpcodeStr, SDNode OpNode,
	OpndItins itins, Predicate prd,
	bit IsCommutable = 0> {
	defm NAME : avx512_binop_rm_vl<opc, OpcodeStr, OpNode, avx512vl_i8_info,
	itins, prd, IsCommutable>, EVEX_CD8<8, CD8VF>;
	}

	multiclass avx512_binop_rm_vl_dq<bits<8> opc_d, bits<8> opc_q, string OpcodeStr,
	SDNode OpNode, OpndItins itins, Predicate prd,
	bit IsCommutable = 0> {
	defm Q : avx512_binop_rm_vl_q<opc_q, OpcodeStr#"q", OpNode, itins, prd,
	IsCommutable>;

	defm D : avx512_binop_rm_vl_d<opc_d, OpcodeStr#"d", OpNode, itins, prd,
	IsCommutable>;
	}

	multiclass avx512_binop_rm_vl_bw<bits<8> opc_b, bits<8> opc_w, string OpcodeStr,
	SDNode OpNode, OpndItins itins, Predicate prd,
	bit IsCommutable = 0> {
	defm W : avx512_binop_rm_vl_w<opc_w, OpcodeStr#"w", OpNode, itins, prd,
	IsCommutable>;

	defm B : avx512_binop_rm_vl_b<opc_b, OpcodeStr#"b", OpNode, itins, prd,
	IsCommutable>;
	}

	multiclass avx512_binop_rm_vl_all<bits<8> opc_b, bits<8> opc_w,
	bits<8> opc_d, bits<8> opc_q,
	string OpcodeStr, SDNode OpNode,
	OpndItins itins, bit IsCommutable = 0> {
	defm NAME : avx512_binop_rm_vl_dq<opc_d, opc_q, OpcodeStr, OpNode,
	itins, HasAVX512, IsCommutable>,
	avx512_binop_rm_vl_bw<opc_b, opc_w, OpcodeStr, OpNode,
	itins, HasBWI, IsCommutable>;
	}

	multiclass avx512_binop_rm2<bits<8> opc, string OpcodeStr, OpndItins itins,
	SDNode OpNode,X86VectorVTInfo _Src,
	X86VectorVTInfo _Dst, X86VectorVTInfo _Brdct,
	bit IsCommutable = 0> {
	defm rr : AVX512_maskable<opc, MRMSrcReg, _Dst, (outs _Dst.RC:$dst),
	(ins _Src.RC:$src1, _Src.RC:$src2), OpcodeStr,
	"$src2, $src1","$src1, $src2",
	(_Dst.VT (OpNode
	(_Src.VT _Src.RC:$src1),
	(_Src.VT _Src.RC:$src2))),
	itins.rr, IsCommutable>,
	AVX512BIBase, EVEX_4V;
	defm rm : AVX512_maskable<opc, MRMSrcMem, _Dst, (outs _Dst.RC:$dst),
	(ins _Src.RC:$src1, _Src.MemOp:$src2), OpcodeStr,
	"$src2, $src1", "$src1, $src2",
	(_Dst.VT (OpNode (_Src.VT _Src.RC:$src1),
	(bitconvert (_Src.LdFrag addr:$src2)))),
	itins.rm>,
	AVX512BIBase, EVEX_4V;

	defm rmb : AVX512_maskable<opc, MRMSrcMem, _Dst, (outs _Dst.RC:$dst),
	(ins _Src.RC:$src1, _Brdct.ScalarMemOp:$src2),
	OpcodeStr,
	"${src2}"##_Brdct.BroadcastStr##", $src1",
	"$src1, ${src2}"##_Brdct.BroadcastStr,
	(_Dst.VT (OpNode (_Src.VT _Src.RC:$src1), (bitconvert
	(_Brdct.VT (X86VBroadcast
	(_Brdct.ScalarLdFrag addr:$src2)))))),
	itins.rm>,
	AVX512BIBase, EVEX_4V, EVEX_B;
	}

	defm VPADD : avx512_binop_rm_vl_all<0xFC, 0xFD, 0xFE, 0xD4, "vpadd", add,
	SSE_INTALU_ITINS_P, 1>;
	defm VPSUB : avx512_binop_rm_vl_all<0xF8, 0xF9, 0xFA, 0xFB, "vpsub", sub,
	SSE_INTALU_ITINS_P, 0>;
	defm VPADDS : avx512_binop_rm_vl_bw<0xEC, 0xED, "vpadds", X86adds,
	SSE_INTALU_ITINS_P, HasBWI, 1>;
	defm VPSUBS : avx512_binop_rm_vl_bw<0xE8, 0xE9, "vpsubs", X86subs,
	SSE_INTALU_ITINS_P, HasBWI, 0>;
	defm VPADDUS : avx512_binop_rm_vl_bw<0xDC, 0xDD, "vpaddus", X86addus,
	SSE_INTALU_ITINS_P, HasBWI, 1>;
	defm VPSUBUS : avx512_binop_rm_vl_bw<0xD8, 0xD9, "vpsubus", X86subus,
	SSE_INTALU_ITINS_P, HasBWI, 0>;
	defm VPMULLD : avx512_binop_rm_vl_d<0x40, "vpmulld", mul,
	SSE_INTALU_ITINS_P, HasAVX512, 1>, T8PD;
	defm VPMULLW : avx512_binop_rm_vl_w<0xD5, "vpmullw", mul,
	SSE_INTALU_ITINS_P, HasBWI, 1>;
	defm VPMULLQ : avx512_binop_rm_vl_q<0x40, "vpmullq", mul,
	SSE_INTALU_ITINS_P, HasDQI, 1>, T8PD;
	defm VPMULHW : avx512_binop_rm_vl_w<0xE5, "vpmulhw", mulhs, SSE_INTALU_ITINS_P,
	HasBWI, 1>;
	defm VPMULHUW : avx512_binop_rm_vl_w<0xE4, "vpmulhuw", mulhu, SSE_INTMUL_ITINS_P,
	HasBWI, 1>;
	defm VPMULHRSW : avx512_binop_rm_vl_w<0x0B, "vpmulhrsw", X86mulhrs, SSE_INTMUL_ITINS_P,
	HasBWI, 1>, T8PD;
	defm VPAVG : avx512_binop_rm_vl_bw<0xE0, 0xE3, "vpavg", X86avg,
	SSE_INTALU_ITINS_P, HasBWI, 1>;

	multiclass avx512_binop_all<bits<8> opc, string OpcodeStr, OpndItins itins,
	AVX512VLVectorVTInfo _SrcVTInfo, AVX512VLVectorVTInfo _DstVTInfo,
	SDNode OpNode, Predicate prd, bit IsCommutable = 0> {
	let Predicates = [prd] in
	defm NAME#Z : avx512_binop_rm2<opc, OpcodeStr, itins, OpNode,
	_SrcVTInfo.info512, _DstVTInfo.info512,
	v8i64_info, IsCommutable>,
	EVEX_V512, EVEX_CD8<64, CD8VF>, VEX_W;
	let Predicates = [HasVLX, prd] in {
	defm NAME#Z256 : avx512_binop_rm2<opc, OpcodeStr, itins, OpNode,
	_SrcVTInfo.info256, _DstVTInfo.info256,
	v4i64x_info, IsCommutable>,
	EVEX_V256, EVEX_CD8<64, CD8VF>, VEX_W;
	defm NAME#Z128 : avx512_binop_rm2<opc, OpcodeStr, itins, OpNode,
	_SrcVTInfo.info128, _DstVTInfo.info128,
	v2i64x_info, IsCommutable>,
	EVEX_V128, EVEX_CD8<64, CD8VF>, VEX_W;
	}
	}

	defm VPMULDQ : avx512_binop_all<0x28, "vpmuldq", SSE_INTALU_ITINS_P,
	avx512vl_i32_info, avx512vl_i64_info,
	X86pmuldq, HasAVX512, 1>,T8PD;
	defm VPMULUDQ : avx512_binop_all<0xF4, "vpmuludq", SSE_INTMUL_ITINS_P,
	avx512vl_i32_info, avx512vl_i64_info,
	X86pmuludq, HasAVX512, 1>;
	defm VPMULTISHIFTQB : avx512_binop_all<0x83, "vpmultishiftqb", SSE_INTALU_ITINS_P,
	avx512vl_i8_info, avx512vl_i8_info,
	X86multishift, HasVBMI, 0>, T8PD;

	multiclass avx512_packs_rmb<bits<8> opc, string OpcodeStr, SDNode OpNode,
	X86VectorVTInfo _Src, X86VectorVTInfo _Dst> {
	defm rmb : AVX512_maskable<opc, MRMSrcMem, _Dst, (outs _Dst.RC:$dst),
	(ins _Src.RC:$src1, _Src.ScalarMemOp:$src2),
	OpcodeStr,
	"${src2}"##_Src.BroadcastStr##", $src1",
	"$src1, ${src2}"##_Src.BroadcastStr,
	(_Dst.VT (OpNode (_Src.VT _Src.RC:$src1), (bitconvert
	(_Src.VT (X86VBroadcast
	(_Src.ScalarLdFrag addr:$src2))))))>,
	EVEX_4V, EVEX_B, EVEX_CD8<_Src.EltSize, CD8VF>;
	}

	multiclass avx512_packs_rm<bits<8> opc, string OpcodeStr,
	SDNode OpNode,X86VectorVTInfo _Src,
	X86VectorVTInfo _Dst, bit IsCommutable = 0> {
	defm rr : AVX512_maskable<opc, MRMSrcReg, _Dst, (outs _Dst.RC:$dst),
	(ins _Src.RC:$src1, _Src.RC:$src2), OpcodeStr,
	"$src2, $src1","$src1, $src2",
	(_Dst.VT (OpNode
	(_Src.VT _Src.RC:$src1),
	(_Src.VT _Src.RC:$src2))),
	NoItinerary, IsCommutable>,
	EVEX_CD8<_Src.EltSize, CD8VF>, EVEX_4V;
	defm rm : AVX512_maskable<opc, MRMSrcMem, _Dst, (outs _Dst.RC:$dst),
	(ins _Src.RC:$src1, _Src.MemOp:$src2), OpcodeStr,
	"$src2, $src1", "$src1, $src2",
	(_Dst.VT (OpNode (_Src.VT _Src.RC:$src1),
	(bitconvert (_Src.LdFrag addr:$src2))))>,
	EVEX_4V, EVEX_CD8<_Src.EltSize, CD8VF>;
	}

	multiclass avx512_packs_all_i32_i16<bits<8> opc, string OpcodeStr,
	SDNode OpNode> {
	let Predicates = [HasBWI] in
	defm NAME#Z : avx512_packs_rm<opc, OpcodeStr, OpNode, v16i32_info,
	v32i16_info>,
	avx512_packs_rmb<opc, OpcodeStr, OpNode, v16i32_info,
	v32i16_info>, EVEX_V512;
	let Predicates = [HasBWI, HasVLX] in {
	defm NAME#Z256 : avx512_packs_rm<opc, OpcodeStr, OpNode, v8i32x_info,
	v16i16x_info>,
	avx512_packs_rmb<opc, OpcodeStr, OpNode, v8i32x_info,
	v16i16x_info>, EVEX_V256;
	defm NAME#Z128 : avx512_packs_rm<opc, OpcodeStr, OpNode, v4i32x_info,
	v8i16x_info>,
	avx512_packs_rmb<opc, OpcodeStr, OpNode, v4i32x_info,
	v8i16x_info>, EVEX_V128;
	}
	}
	multiclass avx512_packs_all_i16_i8<bits<8> opc, string OpcodeStr,
	SDNode OpNode> {
	let Predicates = [HasBWI] in
	defm NAME#Z : avx512_packs_rm<opc, OpcodeStr, OpNode, v32i16_info,
	v64i8_info>, EVEX_V512;
	let Predicates = [HasBWI, HasVLX] in {
	defm NAME#Z256 : avx512_packs_rm<opc, OpcodeStr, OpNode, v16i16x_info,
	v32i8x_info>, EVEX_V256;
	defm NAME#Z128 : avx512_packs_rm<opc, OpcodeStr, OpNode, v8i16x_info,
	v16i8x_info>, EVEX_V128;
	}
	}

	multiclass avx512_vpmadd<bits<8> opc, string OpcodeStr,
	SDNode OpNode, AVX512VLVectorVTInfo _Src,
	AVX512VLVectorVTInfo _Dst, bit IsCommutable = 0> {
	let Predicates = [HasBWI] in
	defm NAME#Z : avx512_packs_rm<opc, OpcodeStr, OpNode, _Src.info512,
	_Dst.info512, IsCommutable>, EVEX_V512;
	let Predicates = [HasBWI, HasVLX] in {
	defm NAME#Z256 : avx512_packs_rm<opc, OpcodeStr, OpNode, _Src.info256,
	_Dst.info256, IsCommutable>, EVEX_V256;
	defm NAME#Z128 : avx512_packs_rm<opc, OpcodeStr, OpNode, _Src.info128,
	_Dst.info128, IsCommutable>, EVEX_V128;
	}
	}

	defm VPACKSSDW : avx512_packs_all_i32_i16<0x6B, "vpackssdw", X86Packss>, AVX512BIBase;
	defm VPACKUSDW : avx512_packs_all_i32_i16<0x2b, "vpackusdw", X86Packus>, AVX5128IBase;
	defm VPACKSSWB : avx512_packs_all_i16_i8 <0x63, "vpacksswb", X86Packss>, AVX512BIBase;
	defm VPACKUSWB : avx512_packs_all_i16_i8 <0x67, "vpackuswb", X86Packus>, AVX512BIBase;

	defm VPMADDUBSW : avx512_vpmadd<0x04, "vpmaddubsw", X86vpmaddubsw,
	avx512vl_i8_info, avx512vl_i16_info>, AVX512BIBase, T8PD;
	defm VPMADDWD : avx512_vpmadd<0xF5, "vpmaddwd", X86vpmaddwd,
	avx512vl_i16_info, avx512vl_i32_info, 1>, AVX512BIBase;

	defm VPMAXSB : avx512_binop_rm_vl_b<0x3C, "vpmaxsb", smax,
	SSE_INTALU_ITINS_P, HasBWI, 1>, T8PD;
	defm VPMAXSW : avx512_binop_rm_vl_w<0xEE, "vpmaxsw", smax,
	SSE_INTALU_ITINS_P, HasBWI, 1>;
	defm VPMAXS : avx512_binop_rm_vl_dq<0x3D, 0x3D, "vpmaxs", smax,
	SSE_INTALU_ITINS_P, HasAVX512, 1>, T8PD;

	defm VPMAXUB : avx512_binop_rm_vl_b<0xDE, "vpmaxub", umax,
	SSE_INTALU_ITINS_P, HasBWI, 1>;
	defm VPMAXUW : avx512_binop_rm_vl_w<0x3E, "vpmaxuw", umax,
	SSE_INTALU_ITINS_P, HasBWI, 1>, T8PD;
	defm VPMAXU : avx512_binop_rm_vl_dq<0x3F, 0x3F, "vpmaxu", umax,
	SSE_INTALU_ITINS_P, HasAVX512, 1>, T8PD;

	defm VPMINSB : avx512_binop_rm_vl_b<0x38, "vpminsb", smin,
	SSE_INTALU_ITINS_P, HasBWI, 1>, T8PD;
	defm VPMINSW : avx512_binop_rm_vl_w<0xEA, "vpminsw", smin,
	SSE_INTALU_ITINS_P, HasBWI, 1>;
	defm VPMINS : avx512_binop_rm_vl_dq<0x39, 0x39, "vpmins", smin,
	SSE_INTALU_ITINS_P, HasAVX512, 1>, T8PD;

	defm VPMINUB : avx512_binop_rm_vl_b<0xDA, "vpminub", umin,
	SSE_INTALU_ITINS_P, HasBWI, 1>;
	defm VPMINUW : avx512_binop_rm_vl_w<0x3A, "vpminuw", umin,
	SSE_INTALU_ITINS_P, HasBWI, 1>, T8PD;
	defm VPMINU : avx512_binop_rm_vl_dq<0x3B, 0x3B, "vpminu", umin,
	SSE_INTALU_ITINS_P, HasAVX512, 1>, T8PD;

	// PMULLQ: Use 512bit version to implement 128/256 bit in case NoVLX.
	let Predicates = [HasDQI, NoVLX] in {
	def : Pat<(v4i64 (mul (v4i64 VR256X:$src1), (v4i64 VR256X:$src2))),
	(EXTRACT_SUBREG
	(VPMULLQZrr
	(INSERT_SUBREG (v8i64 (IMPLICIT_DEF)), VR256X:$src1, sub_ymm),
	(INSERT_SUBREG (v8i64 (IMPLICIT_DEF)), VR256X:$src2, sub_ymm)),
	sub_ymm)>;

	def : Pat<(v2i64 (mul (v2i64 VR128X:$src1), (v2i64 VR128X:$src2))),
	(EXTRACT_SUBREG
	(VPMULLQZrr
	(INSERT_SUBREG (v8i64 (IMPLICIT_DEF)), VR128X:$src1, sub_xmm),
	(INSERT_SUBREG (v8i64 (IMPLICIT_DEF)), VR128X:$src2, sub_xmm)),
	sub_xmm)>;
	}

	//===----------------------------------------------------------------------===//
	// AVX-512 Logical Instructions
	//===----------------------------------------------------------------------===//

	multiclass avx512_logic_rm<bits<8> opc, string OpcodeStr, SDNode OpNode,
	X86VectorVTInfo _, bit IsCommutable = 0> {
	defm rr : AVX512_maskable_logic<opc, MRMSrcReg, _, (outs _.RC:$dst),
	(ins _.RC:$src1, _.RC:$src2), OpcodeStr,
	"$src2, $src1", "$src1, $src2",
	(_.i64VT (OpNode (bitconvert (_.VT _.RC:$src1)),
	(bitconvert (_.VT _.RC:$src2)))),
	(_.VT (bitconvert (_.i64VT (OpNode _.RC:$src1,
	_.RC:$src2)))),
	IIC_SSE_BIT_P_RR, IsCommutable>,
	AVX512BIBase, EVEX_4V;

	defm rm : AVX512_maskable_logic<opc, MRMSrcMem, _, (outs _.RC:$dst),
	(ins _.RC:$src1, _.MemOp:$src2), OpcodeStr,
	"$src2, $src1", "$src1, $src2",
	(_.i64VT (OpNode (bitconvert (_.VT _.RC:$src1)),
	(bitconvert (_.LdFrag addr:$src2)))),
	(_.VT (bitconvert (_.i64VT (OpNode _.RC:$src1,
	(bitconvert (_.LdFrag addr:$src2)))))),
	IIC_SSE_BIT_P_RM>,
	AVX512BIBase, EVEX_4V;
	}

	multiclass avx512_logic_rmb<bits<8> opc, string OpcodeStr, SDNode OpNode,
	X86VectorVTInfo _, bit IsCommutable = 0> :
	avx512_logic_rm<opc, OpcodeStr, OpNode, _, IsCommutable> {
	defm rmb : AVX512_maskable_logic<opc, MRMSrcMem, _, (outs _.RC:$dst),
	(ins _.RC:$src1, _.ScalarMemOp:$src2), OpcodeStr,
	"${src2}"##_.BroadcastStr##", $src1",
	"$src1, ${src2}"##_.BroadcastStr,
	(_.i64VT (OpNode _.RC:$src1,
	(bitconvert
	(_.VT (X86VBroadcast
	(_.ScalarLdFrag addr:$src2)))))),
	(_.VT (bitconvert (_.i64VT (OpNode _.RC:$src1,
	(bitconvert
	(_.VT (X86VBroadcast
	(_.ScalarLdFrag addr:$src2)))))))),
	IIC_SSE_BIT_P_RM>,
	AVX512BIBase, EVEX_4V, EVEX_B;
	}

	multiclass avx512_logic_rmb_vl<bits<8> opc, string OpcodeStr, SDNode OpNode,
	AVX512VLVectorVTInfo VTInfo,
	bit IsCommutable = 0> {
	let Predicates = [HasAVX512] in
	defm Z : avx512_logic_rmb<opc, OpcodeStr, OpNode, VTInfo.info512,
	IsCommutable>, EVEX_V512;

	let Predicates = [HasAVX512, HasVLX] in {
	defm Z256 : avx512_logic_rmb<opc, OpcodeStr, OpNode, VTInfo.info256,
	IsCommutable>, EVEX_V256;
	defm Z128 : avx512_logic_rmb<opc, OpcodeStr, OpNode, VTInfo.info128,
	IsCommutable>, EVEX_V128;
	}
	}

	multiclass avx512_logic_rm_vl_d<bits<8> opc, string OpcodeStr, SDNode OpNode,
	bit IsCommutable = 0> {
	defm NAME : avx512_logic_rmb_vl<opc, OpcodeStr, OpNode, avx512vl_i32_info,
	IsCommutable>, EVEX_CD8<32, CD8VF>;
	}

	multiclass avx512_logic_rm_vl_q<bits<8> opc, string OpcodeStr, SDNode OpNode,
	bit IsCommutable = 0> {
	defm NAME : avx512_logic_rmb_vl<opc, OpcodeStr, OpNode, avx512vl_i64_info,
	IsCommutable>,
	VEX_W, EVEX_CD8<64, CD8VF>;
	}

	multiclass avx512_logic_rm_vl_dq<bits<8> opc_d, bits<8> opc_q, string OpcodeStr,
	SDNode OpNode, bit IsCommutable = 0> {
	defm Q : avx512_logic_rm_vl_q<opc_q, OpcodeStr#"q", OpNode, IsCommutable>;
	defm D : avx512_logic_rm_vl_d<opc_d, OpcodeStr#"d", OpNode, IsCommutable>;
	}

	defm VPAND : avx512_logic_rm_vl_dq<0xDB, 0xDB, "vpand", and, 1>;
	defm VPOR : avx512_logic_rm_vl_dq<0xEB, 0xEB, "vpor", or, 1>;
	defm VPXOR : avx512_logic_rm_vl_dq<0xEF, 0xEF, "vpxor", xor, 1>;
	defm VPANDN : avx512_logic_rm_vl_dq<0xDF, 0xDF, "vpandn", X86andnp>;

	//===----------------------------------------------------------------------===//
	// AVX-512 FP arithmetic
	//===----------------------------------------------------------------------===//
	multiclass avx512_fp_scalar<bits<8> opc, string OpcodeStr,X86VectorVTInfo _,
	SDNode OpNode, SDNode VecNode, OpndItins itins,
	bit IsCommutable> {
	let ExeDomain = _.ExeDomain in {
	defm rr_Int : AVX512_maskable_scalar<opc, MRMSrcReg, _, (outs _.RC:$dst),
	(ins _.RC:$src1, _.RC:$src2), OpcodeStr,
	"$src2, $src1", "$src1, $src2",
	(_.VT (VecNode _.RC:$src1, _.RC:$src2,
	(i32 FROUND_CURRENT))),
	itins.rr>;

	defm rm_Int : AVX512_maskable_scalar<opc, MRMSrcMem, _, (outs _.RC:$dst),
	(ins _.RC:$src1, _.IntScalarMemOp:$src2), OpcodeStr,
	"$src2, $src1", "$src1, $src2",
	(_.VT (VecNode _.RC:$src1,
	_.ScalarIntMemCPat:$src2,
	(i32 FROUND_CURRENT))),
	itins.rm>;
	let isCodeGenOnly = 1, Predicates = [HasAVX512] in {
	def rr : I< opc, MRMSrcReg, (outs _.FRC:$dst),
	(ins _.FRC:$src1, _.FRC:$src2),
	OpcodeStr#"\t{$src2, $src1, $dst\|$dst, $src1, $src2}",
	[(set _.FRC:$dst, (OpNode _.FRC:$src1, _.FRC:$src2))],
	itins.rr> {
	let isCommutable = IsCommutable;
	}
	def rm : I< opc, MRMSrcMem, (outs _.FRC:$dst),
	(ins _.FRC:$src1, _.ScalarMemOp:$src2),
	OpcodeStr#"\t{$src2, $src1, $dst\|$dst, $src1, $src2}",
	[(set _.FRC:$dst, (OpNode _.FRC:$src1,
	(_.ScalarLdFrag addr:$src2)))], itins.rm>;
	}
	}
	}

	multiclass avx512_fp_scalar_round<bits<8> opc, string OpcodeStr,X86VectorVTInfo _,
	SDNode VecNode, OpndItins itins, bit IsCommutable = 0> {
	let ExeDomain = _.ExeDomain in
	defm rrb : AVX512_maskable_scalar<opc, MRMSrcReg, _, (outs _.RC:$dst),
	(ins _.RC:$src1, _.RC:$src2, AVX512RC:$rc), OpcodeStr,
	"$rc, $src2, $src1", "$src1, $src2, $rc",
	(VecNode (_.VT _.RC:$src1), (_.VT _.RC:$src2),
	(i32 imm:$rc)), itins.rr, IsCommutable>,
	EVEX_B, EVEX_RC;
	}
	multiclass avx512_fp_scalar_sae<bits<8> opc, string OpcodeStr,X86VectorVTInfo _,
	SDNode OpNode, SDNode VecNode, SDNode SaeNode,
	OpndItins itins, bit IsCommutable> {
	let ExeDomain = _.ExeDomain in {
	defm rr_Int : AVX512_maskable_scalar<opc, MRMSrcReg, _, (outs _.RC:$dst),
	(ins _.RC:$src1, _.RC:$src2), OpcodeStr,
	"$src2, $src1", "$src1, $src2",
	(_.VT (VecNode _.RC:$src1, _.RC:$src2)),
	itins.rr>;

	defm rm_Int : AVX512_maskable_scalar<opc, MRMSrcMem, _, (outs _.RC:$dst),
	(ins _.RC:$src1, _.IntScalarMemOp:$src2), OpcodeStr,
	"$src2, $src1", "$src1, $src2",
	(_.VT (VecNode _.RC:$src1,
	_.ScalarIntMemCPat:$src2)),
	itins.rm>;

	let isCodeGenOnly = 1, Predicates = [HasAVX512] in {
	def rr : I< opc, MRMSrcReg, (outs _.FRC:$dst),
	(ins _.FRC:$src1, _.FRC:$src2),
	OpcodeStr#"\t{$src2, $src1, $dst\|$dst, $src1, $src2}",
	[(set _.FRC:$dst, (OpNode _.FRC:$src1, _.FRC:$src2))],
	itins.rr> {
	let isCommutable = IsCommutable;
	}
	def rm : I< opc, MRMSrcMem, (outs _.FRC:$dst),
	(ins _.FRC:$src1, _.ScalarMemOp:$src2),
	OpcodeStr#"\t{$src2, $src1, $dst\|$dst, $src1, $src2}",
	[(set _.FRC:$dst, (OpNode _.FRC:$src1,
	(_.ScalarLdFrag addr:$src2)))], itins.rm>;
	}

	defm rrb : AVX512_maskable_scalar<opc, MRMSrcReg, _, (outs _.RC:$dst),
	(ins _.RC:$src1, _.RC:$src2), OpcodeStr,
	"{sae}, $src2, $src1", "$src1, $src2, {sae}",
	(SaeNode (_.VT _.RC:$src1), (_.VT _.RC:$src2),
	(i32 FROUND_NO_EXC))>, EVEX_B;
	}
	}

	multiclass avx512_binop_s_round<bits<8> opc, string OpcodeStr, SDNode OpNode,
	SDNode VecNode,
	SizeItins itins, bit IsCommutable> {
	defm SSZ : avx512_fp_scalar<opc, OpcodeStr#"ss", f32x_info, OpNode, VecNode,
	itins.s, IsCommutable>,
	avx512_fp_scalar_round<opc, OpcodeStr#"ss", f32x_info, VecNode,
	itins.s, IsCommutable>,
	XS, EVEX_4V, VEX_LIG, EVEX_CD8<32, CD8VT1>;
	defm SDZ : avx512_fp_scalar<opc, OpcodeStr#"sd", f64x_info, OpNode, VecNode,
	itins.d, IsCommutable>,
	avx512_fp_scalar_round<opc, OpcodeStr#"sd", f64x_info, VecNode,
	itins.d, IsCommutable>,
	XD, VEX_W, EVEX_4V, VEX_LIG, EVEX_CD8<64, CD8VT1>;
	}

	multiclass avx512_binop_s_sae<bits<8> opc, string OpcodeStr, SDNode OpNode,
	SDNode VecNode, SDNode SaeNode,
	SizeItins itins, bit IsCommutable> {
	defm SSZ : avx512_fp_scalar_sae<opc, OpcodeStr#"ss", f32x_info, OpNode,
	VecNode, SaeNode, itins.s, IsCommutable>,
	XS, EVEX_4V, VEX_LIG, EVEX_CD8<32, CD8VT1>;
	defm SDZ : avx512_fp_scalar_sae<opc, OpcodeStr#"sd", f64x_info, OpNode,
	VecNode, SaeNode, itins.d, IsCommutable>,
	XD, VEX_W, EVEX_4V, VEX_LIG, EVEX_CD8<64, CD8VT1>;
	}
	defm VADD : avx512_binop_s_round<0x58, "vadd", fadd, X86faddRnds, SSE_ALU_ITINS_S, 1>;
	defm VMUL : avx512_binop_s_round<0x59, "vmul", fmul, X86fmulRnds, SSE_MUL_ITINS_S, 1>;
	defm VSUB : avx512_binop_s_round<0x5C, "vsub", fsub, X86fsubRnds, SSE_ALU_ITINS_S, 0>;
	defm VDIV : avx512_binop_s_round<0x5E, "vdiv", fdiv, X86fdivRnds, SSE_DIV_ITINS_S, 0>;
	defm VMIN : avx512_binop_s_sae <0x5D, "vmin", X86fmin, X86fmins, X86fminRnds,
	SSE_ALU_ITINS_S, 0>;
	defm VMAX : avx512_binop_s_sae <0x5F, "vmax", X86fmax, X86fmaxs, X86fmaxRnds,
	SSE_ALU_ITINS_S, 0>;

	// MIN/MAX nodes are commutable under "unsafe-fp-math". In this case we use
	// X86fminc and X86fmaxc instead of X86fmin and X86fmax
	multiclass avx512_comutable_binop_s<bits<8> opc, string OpcodeStr,
	X86VectorVTInfo _, SDNode OpNode, OpndItins itins> {
	let isCodeGenOnly = 1, Predicates = [HasAVX512], ExeDomain = _.ExeDomain in {
	def rr : I< opc, MRMSrcReg, (outs _.FRC:$dst),
	(ins _.FRC:$src1, _.FRC:$src2),
	OpcodeStr#"\t{$src2, $src1, $dst\|$dst, $src1, $src2}",
	[(set _.FRC:$dst, (OpNode _.FRC:$src1, _.FRC:$src2))],
	itins.rr> {
	let isCommutable = 1;
	}
	def rm : I< opc, MRMSrcMem, (outs _.FRC:$dst),
	(ins _.FRC:$src1, _.ScalarMemOp:$src2),
	OpcodeStr#"\t{$src2, $src1, $dst\|$dst, $src1, $src2}",
	[(set _.FRC:$dst, (OpNode _.FRC:$src1,
	(_.ScalarLdFrag addr:$src2)))], itins.rm>;
	}
	}
	defm VMINCSSZ : avx512_comutable_binop_s<0x5D, "vminss", f32x_info, X86fminc,
	SSE_ALU_ITINS_S.s>, XS, EVEX_4V, VEX_LIG,
	EVEX_CD8<32, CD8VT1>;

	defm VMINCSDZ : avx512_comutable_binop_s<0x5D, "vminsd", f64x_info, X86fminc,
	SSE_ALU_ITINS_S.d>, XD, VEX_W, EVEX_4V, VEX_LIG,
	EVEX_CD8<64, CD8VT1>;

	defm VMAXCSSZ : avx512_comutable_binop_s<0x5F, "vmaxss", f32x_info, X86fmaxc,
	SSE_ALU_ITINS_S.s>, XS, EVEX_4V, VEX_LIG,
	EVEX_CD8<32, CD8VT1>;

	defm VMAXCSDZ : avx512_comutable_binop_s<0x5F, "vmaxsd", f64x_info, X86fmaxc,
	SSE_ALU_ITINS_S.d>, XD, VEX_W, EVEX_4V, VEX_LIG,
	EVEX_CD8<64, CD8VT1>;

	multiclass avx512_fp_packed<bits<8> opc, string OpcodeStr, SDPatternOperator OpNode,
	X86VectorVTInfo _, OpndItins itins,
	bit IsCommutable> {
	let ExeDomain = _.ExeDomain, hasSideEffects = 0 in {
	defm rr: AVX512_maskable<opc, MRMSrcReg, _, (outs _.RC:$dst),
	(ins _.RC:$src1, _.RC:$src2), OpcodeStr##_.Suffix,
	"$src2, $src1", "$src1, $src2",
	(_.VT (OpNode _.RC:$src1, _.RC:$src2)), itins.rr,
	IsCommutable>, EVEX_4V;
	let mayLoad = 1 in {
	defm rm: AVX512_maskable<opc, MRMSrcMem, _, (outs _.RC:$dst),
	(ins _.RC:$src1, _.MemOp:$src2), OpcodeStr##_.Suffix,
	"$src2, $src1", "$src1, $src2",
	(OpNode _.RC:$src1, (_.LdFrag addr:$src2)), itins.rm>,
	EVEX_4V;
	defm rmb: AVX512_maskable<opc, MRMSrcMem, _, (outs _.RC:$dst),
	(ins _.RC:$src1, _.ScalarMemOp:$src2), OpcodeStr##_.Suffix,
	"${src2}"##_.BroadcastStr##", $src1",
	"$src1, ${src2}"##_.BroadcastStr,
	(OpNode _.RC:$src1, (_.VT (X86VBroadcast
	(_.ScalarLdFrag addr:$src2)))),
	itins.rm>, EVEX_4V, EVEX_B;
	}
	}
	}

	multiclass avx512_fp_round_packed<bits<8> opc, string OpcodeStr, SDPatternOperator OpNodeRnd,
	X86VectorVTInfo _> {
	let ExeDomain = _.ExeDomain in
	defm rb: AVX512_maskable<opc, MRMSrcReg, _, (outs _.RC:$dst),
	(ins _.RC:$src1, _.RC:$src2, AVX512RC:$rc), OpcodeStr##_.Suffix,
	"$rc, $src2, $src1", "$src1, $src2, $rc",
	(_.VT (OpNodeRnd _.RC:$src1, _.RC:$src2, (i32 imm:$rc)))>,
	EVEX_4V, EVEX_B, EVEX_RC;
	}


	multiclass avx512_fp_sae_packed<bits<8> opc, string OpcodeStr, SDPatternOperator OpNodeRnd,
	X86VectorVTInfo _> {
	let ExeDomain = _.ExeDomain in
	defm rb: AVX512_maskable<opc, MRMSrcReg, _, (outs _.RC:$dst),
	(ins _.RC:$src1, _.RC:$src2), OpcodeStr##_.Suffix,
	"{sae}, $src2, $src1", "$src1, $src2, {sae}",
	(_.VT (OpNodeRnd _.RC:$src1, _.RC:$src2, (i32 FROUND_NO_EXC)))>,
	EVEX_4V, EVEX_B;
	}

	multiclass avx512_fp_binop_p<bits<8> opc, string OpcodeStr, SDPatternOperator OpNode,
	Predicate prd, SizeItins itins,
	bit IsCommutable = 0> {
	let Predicates = [prd] in {
	defm PSZ : avx512_fp_packed<opc, OpcodeStr, OpNode, v16f32_info,
	itins.s, IsCommutable>, EVEX_V512, PS,
	EVEX_CD8<32, CD8VF>;
	defm PDZ : avx512_fp_packed<opc, OpcodeStr, OpNode, v8f64_info,
	itins.d, IsCommutable>, EVEX_V512, PD, VEX_W,
	EVEX_CD8<64, CD8VF>;
	}

	// Define only if AVX512VL feature is present.
	let Predicates = [prd, HasVLX] in {
	defm PSZ128 : avx512_fp_packed<opc, OpcodeStr, OpNode, v4f32x_info,
	itins.s, IsCommutable>, EVEX_V128, PS,
	EVEX_CD8<32, CD8VF>;
	defm PSZ256 : avx512_fp_packed<opc, OpcodeStr, OpNode, v8f32x_info,
	itins.s, IsCommutable>, EVEX_V256, PS,
	EVEX_CD8<32, CD8VF>;
	defm PDZ128 : avx512_fp_packed<opc, OpcodeStr, OpNode, v2f64x_info,
	itins.d, IsCommutable>, EVEX_V128, PD, VEX_W,
	EVEX_CD8<64, CD8VF>;
	defm PDZ256 : avx512_fp_packed<opc, OpcodeStr, OpNode, v4f64x_info,
	itins.d, IsCommutable>, EVEX_V256, PD, VEX_W,
	EVEX_CD8<64, CD8VF>;
	}
	}

	multiclass avx512_fp_binop_p_round<bits<8> opc, string OpcodeStr, SDNode OpNodeRnd> {
	defm PSZ : avx512_fp_round_packed<opc, OpcodeStr, OpNodeRnd, v16f32_info>,
	EVEX_V512, PS, EVEX_CD8<32, CD8VF>;
	defm PDZ : avx512_fp_round_packed<opc, OpcodeStr, OpNodeRnd, v8f64_info>,
	EVEX_V512, PD, VEX_W,EVEX_CD8<64, CD8VF>;
	}

	multiclass avx512_fp_binop_p_sae<bits<8> opc, string OpcodeStr, SDNode OpNodeRnd> {
	defm PSZ : avx512_fp_sae_packed<opc, OpcodeStr, OpNodeRnd, v16f32_info>,
	EVEX_V512, PS, EVEX_CD8<32, CD8VF>;
	defm PDZ : avx512_fp_sae_packed<opc, OpcodeStr, OpNodeRnd, v8f64_info>,
	EVEX_V512, PD, VEX_W,EVEX_CD8<64, CD8VF>;
	}

	defm VADD : avx512_fp_binop_p<0x58, "vadd", fadd, HasAVX512,
	SSE_ALU_ITINS_P, 1>,
	avx512_fp_binop_p_round<0x58, "vadd", X86faddRnd>;
	defm VMUL : avx512_fp_binop_p<0x59, "vmul", fmul, HasAVX512,
	SSE_MUL_ITINS_P, 1>,
	avx512_fp_binop_p_round<0x59, "vmul", X86fmulRnd>;
	defm VSUB : avx512_fp_binop_p<0x5C, "vsub", fsub, HasAVX512, SSE_ALU_ITINS_P>,
	avx512_fp_binop_p_round<0x5C, "vsub", X86fsubRnd>;
	defm VDIV : avx512_fp_binop_p<0x5E, "vdiv", fdiv, HasAVX512, SSE_DIV_ITINS_P>,
	avx512_fp_binop_p_round<0x5E, "vdiv", X86fdivRnd>;
	defm VMIN : avx512_fp_binop_p<0x5D, "vmin", X86fmin, HasAVX512,
	SSE_ALU_ITINS_P, 0>,
	avx512_fp_binop_p_sae<0x5D, "vmin", X86fminRnd>;
	defm VMAX : avx512_fp_binop_p<0x5F, "vmax", X86fmax, HasAVX512,
	SSE_ALU_ITINS_P, 0>,
	avx512_fp_binop_p_sae<0x5F, "vmax", X86fmaxRnd>;
	let isCodeGenOnly = 1 in {
	defm VMINC : avx512_fp_binop_p<0x5D, "vmin", X86fminc, HasAVX512,
	SSE_ALU_ITINS_P, 1>;
	defm VMAXC : avx512_fp_binop_p<0x5F, "vmax", X86fmaxc, HasAVX512,
	SSE_ALU_ITINS_P, 1>;
	}
	defm VAND : avx512_fp_binop_p<0x54, "vand", null_frag, HasDQI,
	SSE_ALU_ITINS_P, 1>;
	defm VANDN : avx512_fp_binop_p<0x55, "vandn", null_frag, HasDQI,
	SSE_ALU_ITINS_P, 0>;
	defm VOR : avx512_fp_binop_p<0x56, "vor", null_frag, HasDQI,
	SSE_ALU_ITINS_P, 1>;
	defm VXOR : avx512_fp_binop_p<0x57, "vxor", null_frag, HasDQI,
	SSE_ALU_ITINS_P, 1>;

	// Patterns catch floating point selects with bitcasted integer logic ops.
	multiclass avx512_fp_logical_lowering<string InstrStr, SDNode OpNode,
	X86VectorVTInfo _, Predicate prd> {
	let Predicates = [prd] in {
	// Masked register-register logical operations.
	def : Pat<(_.VT (vselect _.KRCWM:$mask,
	(bitconvert (_.i64VT (OpNode _.RC:$src1, _.RC:$src2))),
	_.RC:$src0)),
	(!cast<Instruction>(InstrStr#rrk) _.RC:$src0, _.KRCWM:$mask,
	_.RC:$src1, _.RC:$src2)>;
	def : Pat<(_.VT (vselect _.KRCWM:$mask,
	(bitconvert (_.i64VT (OpNode _.RC:$src1, _.RC:$src2))),
	_.ImmAllZerosV)),
	(!cast<Instruction>(InstrStr#rrkz) _.KRCWM:$mask, _.RC:$src1,
	_.RC:$src2)>;
	// Masked register-memory logical operations.
	def : Pat<(_.VT (vselect _.KRCWM:$mask,
	(bitconvert (_.i64VT (OpNode _.RC:$src1,
	(load addr:$src2)))),
	_.RC:$src0)),
	(!cast<Instruction>(InstrStr#rmk) _.RC:$src0, _.KRCWM:$mask,
	_.RC:$src1, addr:$src2)>;
	def : Pat<(_.VT (vselect _.KRCWM:$mask,
	(bitconvert (_.i64VT (OpNode _.RC:$src1, (load addr:$src2)))),
	_.ImmAllZerosV)),
	(!cast<Instruction>(InstrStr#rmkz) _.KRCWM:$mask, _.RC:$src1,
	addr:$src2)>;
	// Register-broadcast logical operations.
	def : Pat<(_.i64VT (OpNode _.RC:$src1,
	(bitconvert (_.VT (X86VBroadcast
	(_.ScalarLdFrag addr:$src2)))))),
	(!cast<Instruction>(InstrStr#rmb) _.RC:$src1, addr:$src2)>;
	def : Pat<(_.VT (vselect _.KRCWM:$mask,
	(bitconvert
	(_.i64VT (OpNode _.RC:$src1,
	(bitconvert (_.VT
	(X86VBroadcast
	(_.ScalarLdFrag addr:$src2))))))),
	_.RC:$src0)),
	(!cast<Instruction>(InstrStr#rmbk) _.RC:$src0, _.KRCWM:$mask,
	_.RC:$src1, addr:$src2)>;
	def : Pat<(_.VT (vselect _.KRCWM:$mask,
	(bitconvert
	(_.i64VT (OpNode _.RC:$src1,
	(bitconvert (_.VT
	(X86VBroadcast
	(_.ScalarLdFrag addr:$src2))))))),
	_.ImmAllZerosV)),
	(!cast<Instruction>(InstrStr#rmbkz) _.KRCWM:$mask,
	_.RC:$src1, addr:$src2)>;
	}
	}

	multiclass avx512_fp_logical_lowering_sizes<string InstrStr, SDNode OpNode> {
	defm : avx512_fp_logical_lowering<InstrStr#DZ128, OpNode, v4f32x_info, HasVLX>;
	defm : avx512_fp_logical_lowering<InstrStr#QZ128, OpNode, v2f64x_info, HasVLX>;
	defm : avx512_fp_logical_lowering<InstrStr#DZ256, OpNode, v8f32x_info, HasVLX>;
	defm : avx512_fp_logical_lowering<InstrStr#QZ256, OpNode, v4f64x_info, HasVLX>;
	defm : avx512_fp_logical_lowering<InstrStr#DZ, OpNode, v16f32_info, HasAVX512>;
	defm : avx512_fp_logical_lowering<InstrStr#QZ, OpNode, v8f64_info, HasAVX512>;
	}

	defm : avx512_fp_logical_lowering_sizes<"VPAND", and>;
	defm : avx512_fp_logical_lowering_sizes<"VPOR", or>;
	defm : avx512_fp_logical_lowering_sizes<"VPXOR", xor>;
	defm : avx512_fp_logical_lowering_sizes<"VPANDN", X86andnp>;

	let Predicates = [HasVLX,HasDQI] in {
	// Use packed logical operations for scalar ops.
	def : Pat<(f64 (X86fand FR64X:$src1, FR64X:$src2)),
	(COPY_TO_REGCLASS (VANDPDZ128rr
	(COPY_TO_REGCLASS FR64X:$src1, VR128X),
	(COPY_TO_REGCLASS FR64X:$src2, VR128X)), FR64X)>;
	def : Pat<(f64 (X86for FR64X:$src1, FR64X:$src2)),
	(COPY_TO_REGCLASS (VORPDZ128rr
	(COPY_TO_REGCLASS FR64X:$src1, VR128X),
	(COPY_TO_REGCLASS FR64X:$src2, VR128X)), FR64X)>;
	def : Pat<(f64 (X86fxor FR64X:$src1, FR64X:$src2)),
	(COPY_TO_REGCLASS (VXORPDZ128rr
	(COPY_TO_REGCLASS FR64X:$src1, VR128X),
	(COPY_TO_REGCLASS FR64X:$src2, VR128X)), FR64X)>;
	def : Pat<(f64 (X86fandn FR64X:$src1, FR64X:$src2)),
	(COPY_TO_REGCLASS (VANDNPDZ128rr
	(COPY_TO_REGCLASS FR64X:$src1, VR128X),
	(COPY_TO_REGCLASS FR64X:$src2, VR128X)), FR64X)>;

	def : Pat<(f32 (X86fand FR32X:$src1, FR32X:$src2)),
	(COPY_TO_REGCLASS (VANDPSZ128rr
	(COPY_TO_REGCLASS FR32X:$src1, VR128X),
	(COPY_TO_REGCLASS FR32X:$src2, VR128X)), FR32X)>;
	def : Pat<(f32 (X86for FR32X:$src1, FR32X:$src2)),
	(COPY_TO_REGCLASS (VORPSZ128rr
	(COPY_TO_REGCLASS FR32X:$src1, VR128X),
	(COPY_TO_REGCLASS FR32X:$src2, VR128X)), FR32X)>;
	def : Pat<(f32 (X86fxor FR32X:$src1, FR32X:$src2)),
	(COPY_TO_REGCLASS (VXORPSZ128rr
	(COPY_TO_REGCLASS FR32X:$src1, VR128X),
	(COPY_TO_REGCLASS FR32X:$src2, VR128X)), FR32X)>;
	def : Pat<(f32 (X86fandn FR32X:$src1, FR32X:$src2)),
	(COPY_TO_REGCLASS (VANDNPSZ128rr
	(COPY_TO_REGCLASS FR32X:$src1, VR128X),
	(COPY_TO_REGCLASS FR32X:$src2, VR128X)), FR32X)>;
	}

	multiclass avx512_fp_scalef_p<bits<8> opc, string OpcodeStr, SDNode OpNode,
	X86VectorVTInfo _> {
	let ExeDomain = _.ExeDomain in {
	defm rr: AVX512_maskable<opc, MRMSrcReg, _, (outs _.RC:$dst),
	(ins _.RC:$src1, _.RC:$src2), OpcodeStr##_.Suffix,
	"$src2, $src1", "$src1, $src2",
	(_.VT (OpNode _.RC:$src1, _.RC:$src2, (i32 FROUND_CURRENT)))>, EVEX_4V;
	defm rm: AVX512_maskable<opc, MRMSrcMem, _, (outs _.RC:$dst),
	(ins _.RC:$src1, _.MemOp:$src2), OpcodeStr##_.Suffix,
	"$src2, $src1", "$src1, $src2",
	(OpNode _.RC:$src1, (_.LdFrag addr:$src2), (i32 FROUND_CURRENT))>, EVEX_4V;
	defm rmb: AVX512_maskable<opc, MRMSrcMem, _, (outs _.RC:$dst),
	(ins _.RC:$src1, _.ScalarMemOp:$src2), OpcodeStr##_.Suffix,
	"${src2}"##_.BroadcastStr##", $src1",
	"$src1, ${src2}"##_.BroadcastStr,
	(OpNode _.RC:$src1, (_.VT (X86VBroadcast
	(_.ScalarLdFrag addr:$src2))), (i32 FROUND_CURRENT))>,
	EVEX_4V, EVEX_B;
	}
	}

	multiclass avx512_fp_scalef_scalar<bits<8> opc, string OpcodeStr, SDNode OpNode,
	X86VectorVTInfo _> {
	let ExeDomain = _.ExeDomain in {
	defm rr: AVX512_maskable_scalar<opc, MRMSrcReg, _, (outs _.RC:$dst),
	(ins _.RC:$src1, _.RC:$src2), OpcodeStr##_.Suffix,
	"$src2, $src1", "$src1, $src2",
	(_.VT (OpNode _.RC:$src1, _.RC:$src2, (i32 FROUND_CURRENT)))>;
	defm rm: AVX512_maskable_scalar<opc, MRMSrcMem, _, (outs _.RC:$dst),
	(ins _.RC:$src1, _.ScalarMemOp:$src2), OpcodeStr##_.Suffix,
	"$src2, $src1", "$src1, $src2",
	(OpNode _.RC:$src1,
	(_.VT (scalar_to_vector (_.ScalarLdFrag addr:$src2))),
	(i32 FROUND_CURRENT))>;
	}
	}

	multiclass avx512_fp_scalef_all<bits<8> opc, bits<8> opcScaler, string OpcodeStr, SDNode OpNode, SDNode OpNodeScal> {
	defm PSZ : avx512_fp_scalef_p<opc, OpcodeStr, OpNode, v16f32_info>,
	avx512_fp_round_packed<opc, OpcodeStr, OpNode, v16f32_info>,
	EVEX_V512, EVEX_CD8<32, CD8VF>;
	defm PDZ : avx512_fp_scalef_p<opc, OpcodeStr, OpNode, v8f64_info>,
	avx512_fp_round_packed<opc, OpcodeStr, OpNode, v8f64_info>,
	EVEX_V512, VEX_W, EVEX_CD8<64, CD8VF>;
	defm SSZ128 : avx512_fp_scalef_scalar<opcScaler, OpcodeStr, OpNodeScal, f32x_info>,
	avx512_fp_scalar_round<opcScaler, OpcodeStr##"ss", f32x_info, OpNodeScal, SSE_ALU_ITINS_S.s>,
	EVEX_4V,EVEX_CD8<32, CD8VT1>;
	defm SDZ128 : avx512_fp_scalef_scalar<opcScaler, OpcodeStr, OpNodeScal, f64x_info>,
	avx512_fp_scalar_round<opcScaler, OpcodeStr##"sd", f64x_info, OpNodeScal, SSE_ALU_ITINS_S.d>,
	EVEX_4V, EVEX_CD8<64, CD8VT1>, VEX_W;

	// Define only if AVX512VL feature is present.
	let Predicates = [HasVLX] in {
	defm PSZ128 : avx512_fp_scalef_p<opc, OpcodeStr, OpNode, v4f32x_info>,
	EVEX_V128, EVEX_CD8<32, CD8VF>;
	defm PSZ256 : avx512_fp_scalef_p<opc, OpcodeStr, OpNode, v8f32x_info>,
	EVEX_V256, EVEX_CD8<32, CD8VF>;
	defm PDZ128 : avx512_fp_scalef_p<opc, OpcodeStr, OpNode, v2f64x_info>,
	EVEX_V128, VEX_W, EVEX_CD8<64, CD8VF>;
	defm PDZ256 : avx512_fp_scalef_p<opc, OpcodeStr, OpNode, v4f64x_info>,
	EVEX_V256, VEX_W, EVEX_CD8<64, CD8VF>;
	}
	}
	defm VSCALEF : avx512_fp_scalef_all<0x2C, 0x2D, "vscalef", X86scalef, X86scalefs>, T8PD;

	//===----------------------------------------------------------------------===//
	// AVX-512 VPTESTM instructions
	//===----------------------------------------------------------------------===//

	multiclass avx512_vptest<bits<8> opc, string OpcodeStr, SDNode OpNode,
	X86VectorVTInfo _> {
	let isCommutable = 1 in
	defm rr : AVX512_maskable_cmp<opc, MRMSrcReg, _, (outs _.KRC:$dst),
	(ins _.RC:$src1, _.RC:$src2), OpcodeStr,
	"$src2, $src1", "$src1, $src2",
	(OpNode (_.VT _.RC:$src1), (_.VT _.RC:$src2))>,
	EVEX_4V;
	defm rm : AVX512_maskable_cmp<opc, MRMSrcMem, _, (outs _.KRC:$dst),
	(ins _.RC:$src1, _.MemOp:$src2), OpcodeStr,
	"$src2, $src1", "$src1, $src2",
	(OpNode (_.VT _.RC:$src1),
	(_.VT (bitconvert (_.LdFrag addr:$src2))))>,
	EVEX_4V,
	EVEX_CD8<_.EltSize, CD8VF>;
	}

	multiclass avx512_vptest_mb<bits<8> opc, string OpcodeStr, SDNode OpNode,
	X86VectorVTInfo _> {
	defm rmb : AVX512_maskable_cmp<opc, MRMSrcMem, _, (outs _.KRC:$dst),
	(ins _.RC:$src1, _.ScalarMemOp:$src2), OpcodeStr,
	"${src2}"##_.BroadcastStr##", $src1",
	"$src1, ${src2}"##_.BroadcastStr,
	(OpNode (_.VT _.RC:$src1), (_.VT (X86VBroadcast
	(_.ScalarLdFrag addr:$src2))))>,
	EVEX_B, EVEX_4V, EVEX_CD8<_.EltSize, CD8VF>;
	}

	// Use 512bit version to implement 128/256 bit in case NoVLX.
	multiclass avx512_vptest_lowering<SDNode OpNode, X86VectorVTInfo ExtendInfo,
	X86VectorVTInfo _, string Suffix> {
	def : Pat<(_.KVT (OpNode (_.VT _.RC:$src1), (_.VT _.RC:$src2))),
	(_.KVT (COPY_TO_REGCLASS
	(!cast<Instruction>(NAME # Suffix # "Zrr")
	(INSERT_SUBREG (ExtendInfo.VT (IMPLICIT_DEF)),
	_.RC:$src1, _.SubRegIdx),
	(INSERT_SUBREG (ExtendInfo.VT (IMPLICIT_DEF)),
	_.RC:$src2, _.SubRegIdx)),
	_.KRC))>;
	}

	multiclass avx512_vptest_dq_sizes<bits<8> opc, string OpcodeStr, SDNode OpNode,
	AVX512VLVectorVTInfo _, string Suffix> {
	let Predicates = [HasAVX512] in
	defm Z : avx512_vptest<opc, OpcodeStr, OpNode, _.info512>,
	avx512_vptest_mb<opc, OpcodeStr, OpNode, _.info512>, EVEX_V512;

	let Predicates = [HasAVX512, HasVLX] in {
	defm Z256 : avx512_vptest<opc, OpcodeStr, OpNode, _.info256>,
	avx512_vptest_mb<opc, OpcodeStr, OpNode, _.info256>, EVEX_V256;
	defm Z128 : avx512_vptest<opc, OpcodeStr, OpNode, _.info128>,
	avx512_vptest_mb<opc, OpcodeStr, OpNode, _.info128>, EVEX_V128;
	}
	let Predicates = [HasAVX512, NoVLX] in {
	defm Z256_Alt : avx512_vptest_lowering< OpNode, _.info512, _.info256, Suffix>;
	defm Z128_Alt : avx512_vptest_lowering< OpNode, _.info512, _.info128, Suffix>;
	}
	}

	multiclass avx512_vptest_dq<bits<8> opc, string OpcodeStr, SDNode OpNode> {
	defm D : avx512_vptest_dq_sizes<opc, OpcodeStr#"d", OpNode,
	avx512vl_i32_info, "D">;
	defm Q : avx512_vptest_dq_sizes<opc, OpcodeStr#"q", OpNode,
	avx512vl_i64_info, "Q">, VEX_W;
	}

	multiclass avx512_vptest_wb<bits<8> opc, string OpcodeStr,
	SDNode OpNode> {
	let Predicates = [HasBWI] in {
	defm WZ: avx512_vptest<opc, OpcodeStr#"w", OpNode, v32i16_info>,
	EVEX_V512, VEX_W;
	defm BZ: avx512_vptest<opc, OpcodeStr#"b", OpNode, v64i8_info>,
	EVEX_V512;
	}
	let Predicates = [HasVLX, HasBWI] in {

	defm WZ256: avx512_vptest<opc, OpcodeStr#"w", OpNode, v16i16x_info>,
	EVEX_V256, VEX_W;
	defm WZ128: avx512_vptest<opc, OpcodeStr#"w", OpNode, v8i16x_info>,
	EVEX_V128, VEX_W;
	defm BZ256: avx512_vptest<opc, OpcodeStr#"b", OpNode, v32i8x_info>,
	EVEX_V256;
	defm BZ128: avx512_vptest<opc, OpcodeStr#"b", OpNode, v16i8x_info>,
	EVEX_V128;
	}

	let Predicates = [HasAVX512, NoVLX] in {
	defm BZ256_Alt : avx512_vptest_lowering< OpNode, v64i8_info, v32i8x_info, "B">;
	defm BZ128_Alt : avx512_vptest_lowering< OpNode, v64i8_info, v16i8x_info, "B">;
	defm WZ256_Alt : avx512_vptest_lowering< OpNode, v32i16_info, v16i16x_info, "W">;
	defm WZ128_Alt : avx512_vptest_lowering< OpNode, v32i16_info, v8i16x_info, "W">;
	}

	}

	multiclass avx512_vptest_all_forms<bits<8> opc_wb, bits<8> opc_dq, string OpcodeStr,
	SDNode OpNode> :
	avx512_vptest_wb <opc_wb, OpcodeStr, OpNode>,
	avx512_vptest_dq<opc_dq, OpcodeStr, OpNode>;

	defm VPTESTM : avx512_vptest_all_forms<0x26, 0x27, "vptestm", X86testm>, T8PD;
	defm VPTESTNM : avx512_vptest_all_forms<0x26, 0x27, "vptestnm", X86testnm>, T8XS;


	//===----------------------------------------------------------------------===//
	// AVX-512 Shift instructions
	//===----------------------------------------------------------------------===//
	multiclass avx512_shift_rmi<bits<8> opc, Format ImmFormR, Format ImmFormM,
	string OpcodeStr, SDNode OpNode, X86VectorVTInfo _> {
	let ExeDomain = _.ExeDomain in {
	defm ri : AVX512_maskable<opc, ImmFormR, _, (outs _.RC:$dst),
	(ins _.RC:$src1, u8imm:$src2), OpcodeStr,
	"$src2, $src1", "$src1, $src2",
	(_.VT (OpNode _.RC:$src1, (i8 imm:$src2))),
	SSE_INTSHIFT_ITINS_P.rr>;
	defm mi : AVX512_maskable<opc, ImmFormM, _, (outs _.RC:$dst),
	(ins _.MemOp:$src1, u8imm:$src2), OpcodeStr,
	"$src2, $src1", "$src1, $src2",
	(_.VT (OpNode (_.VT (bitconvert (_.LdFrag addr:$src1))),
	(i8 imm:$src2))),
	SSE_INTSHIFT_ITINS_P.rm>;
	}
	}

	multiclass avx512_shift_rmbi<bits<8> opc, Format ImmFormM,
	string OpcodeStr, SDNode OpNode, X86VectorVTInfo _> {
	let ExeDomain = _.ExeDomain in
	defm mbi : AVX512_maskable<opc, ImmFormM, _, (outs _.RC:$dst),
	(ins _.ScalarMemOp:$src1, u8imm:$src2), OpcodeStr,
	"$src2, ${src1}"##_.BroadcastStr, "${src1}"##_.BroadcastStr##", $src2",
	(_.VT (OpNode (X86VBroadcast (_.ScalarLdFrag addr:$src1)), (i8 imm:$src2))),
	SSE_INTSHIFT_ITINS_P.rm>, EVEX_B;
	}

	multiclass avx512_shift_rrm<bits<8> opc, string OpcodeStr, SDNode OpNode,
	ValueType SrcVT, PatFrag bc_frag, X86VectorVTInfo _> {
	// src2 is always 128-bit
	let ExeDomain = _.ExeDomain in {
	defm rr : AVX512_maskable<opc, MRMSrcReg, _, (outs _.RC:$dst),
	(ins _.RC:$src1, VR128X:$src2), OpcodeStr,
	"$src2, $src1", "$src1, $src2",
	(_.VT (OpNode _.RC:$src1, (SrcVT VR128X:$src2))),
	SSE_INTSHIFT_ITINS_P.rr>, AVX512BIBase, EVEX_4V;
	defm rm : AVX512_maskable<opc, MRMSrcMem, _, (outs _.RC:$dst),
	(ins _.RC:$src1, i128mem:$src2), OpcodeStr,
	"$src2, $src1", "$src1, $src2",
	(_.VT (OpNode _.RC:$src1, (bc_frag (loadv2i64 addr:$src2)))),
	SSE_INTSHIFT_ITINS_P.rm>, AVX512BIBase,
	EVEX_4V;
	}
	}

	multiclass avx512_shift_sizes<bits<8> opc, string OpcodeStr, SDNode OpNode,
	ValueType SrcVT, PatFrag bc_frag,
	AVX512VLVectorVTInfo VTInfo, Predicate prd> {
	let Predicates = [prd] in
	defm Z : avx512_shift_rrm<opc, OpcodeStr, OpNode, SrcVT, bc_frag,
	VTInfo.info512>, EVEX_V512,
	EVEX_CD8<VTInfo.info512.EltSize, CD8VQ> ;
	let Predicates = [prd, HasVLX] in {
	defm Z256 : avx512_shift_rrm<opc, OpcodeStr, OpNode, SrcVT, bc_frag,
	VTInfo.info256>, EVEX_V256,
	EVEX_CD8<VTInfo.info256.EltSize, CD8VH>;
	defm Z128 : avx512_shift_rrm<opc, OpcodeStr, OpNode, SrcVT, bc_frag,
	VTInfo.info128>, EVEX_V128,
	EVEX_CD8<VTInfo.info128.EltSize, CD8VF>;
	}
	}

	multiclass avx512_shift_types<bits<8> opcd, bits<8> opcq, bits<8> opcw,
	string OpcodeStr, SDNode OpNode> {
	defm D : avx512_shift_sizes<opcd, OpcodeStr#"d", OpNode, v4i32, bc_v4i32,
	avx512vl_i32_info, HasAVX512>;
	defm Q : avx512_shift_sizes<opcq, OpcodeStr#"q", OpNode, v2i64, bc_v2i64,
	avx512vl_i64_info, HasAVX512>, VEX_W;
	defm W : avx512_shift_sizes<opcw, OpcodeStr#"w", OpNode, v8i16, bc_v8i16,
	avx512vl_i16_info, HasBWI>;
	}

	multiclass avx512_shift_rmi_sizes<bits<8> opc, Format ImmFormR, Format ImmFormM,
	string OpcodeStr, SDNode OpNode,
	AVX512VLVectorVTInfo VTInfo> {
	let Predicates = [HasAVX512] in
	defm Z: avx512_shift_rmi<opc, ImmFormR, ImmFormM, OpcodeStr, OpNode,
	VTInfo.info512>,
	avx512_shift_rmbi<opc, ImmFormM, OpcodeStr, OpNode,
	VTInfo.info512>, EVEX_V512;
	let Predicates = [HasAVX512, HasVLX] in {
	defm Z256: avx512_shift_rmi<opc, ImmFormR, ImmFormM, OpcodeStr, OpNode,
	VTInfo.info256>,
	avx512_shift_rmbi<opc, ImmFormM, OpcodeStr, OpNode,
	VTInfo.info256>, EVEX_V256;
	defm Z128: avx512_shift_rmi<opc, ImmFormR, ImmFormM, OpcodeStr, OpNode,
	VTInfo.info128>,
	avx512_shift_rmbi<opc, ImmFormM, OpcodeStr, OpNode,
	VTInfo.info128>, EVEX_V128;
	}
	}

	multiclass avx512_shift_rmi_w<bits<8> opcw,
	Format ImmFormR, Format ImmFormM,
	string OpcodeStr, SDNode OpNode> {
	let Predicates = [HasBWI] in
	defm WZ: avx512_shift_rmi<opcw, ImmFormR, ImmFormM, OpcodeStr, OpNode,
	v32i16_info>, EVEX_V512;
	let Predicates = [HasVLX, HasBWI] in {
	defm WZ256: avx512_shift_rmi<opcw, ImmFormR, ImmFormM, OpcodeStr, OpNode,
	v16i16x_info>, EVEX_V256;
	defm WZ128: avx512_shift_rmi<opcw, ImmFormR, ImmFormM, OpcodeStr, OpNode,
	v8i16x_info>, EVEX_V128;
	}
	}

	multiclass avx512_shift_rmi_dq<bits<8> opcd, bits<8> opcq,
	Format ImmFormR, Format ImmFormM,
	string OpcodeStr, SDNode OpNode> {
	defm D: avx512_shift_rmi_sizes<opcd, ImmFormR, ImmFormM, OpcodeStr#"d", OpNode,
	avx512vl_i32_info>, EVEX_CD8<32, CD8VF>;
	defm Q: avx512_shift_rmi_sizes<opcq, ImmFormR, ImmFormM, OpcodeStr#"q", OpNode,
	avx512vl_i64_info>, EVEX_CD8<64, CD8VF>, VEX_W;
	}

	defm VPSRL : avx512_shift_rmi_dq<0x72, 0x73, MRM2r, MRM2m, "vpsrl", X86vsrli>,
	avx512_shift_rmi_w<0x71, MRM2r, MRM2m, "vpsrlw", X86vsrli>, AVX512BIi8Base, EVEX_4V;

	defm VPSLL : avx512_shift_rmi_dq<0x72, 0x73, MRM6r, MRM6m, "vpsll", X86vshli>,
	avx512_shift_rmi_w<0x71, MRM6r, MRM6m, "vpsllw", X86vshli>, AVX512BIi8Base, EVEX_4V;

	defm VPSRA : avx512_shift_rmi_dq<0x72, 0x72, MRM4r, MRM4m, "vpsra", X86vsrai>,
	avx512_shift_rmi_w<0x71, MRM4r, MRM4m, "vpsraw", X86vsrai>, AVX512BIi8Base, EVEX_4V;

	defm VPROR : avx512_shift_rmi_dq<0x72, 0x72, MRM0r, MRM0m, "vpror", X86vrotri>, AVX512BIi8Base, EVEX_4V;
	defm VPROL : avx512_shift_rmi_dq<0x72, 0x72, MRM1r, MRM1m, "vprol", X86vrotli>, AVX512BIi8Base, EVEX_4V;

	defm VPSLL : avx512_shift_types<0xF2, 0xF3, 0xF1, "vpsll", X86vshl>;
	defm VPSRA : avx512_shift_types<0xE2, 0xE2, 0xE1, "vpsra", X86vsra>;
	defm VPSRL : avx512_shift_types<0xD2, 0xD3, 0xD1, "vpsrl", X86vsrl>;

	// Use 512bit VPSRA/VPSRAI version to implement v2i64/v4i64 in case NoVLX.
	let Predicates = [HasAVX512, NoVLX] in {
	def : Pat<(v4i64 (X86vsra (v4i64 VR256X:$src1), (v2i64 VR128X:$src2))),
	(EXTRACT_SUBREG (v8i64
	(VPSRAQZrr
	(v8i64 (INSERT_SUBREG (IMPLICIT_DEF), VR256X:$src1, sub_ymm)),
	VR128X:$src2)), sub_ymm)>;

	def : Pat<(v2i64 (X86vsra (v2i64 VR128X:$src1), (v2i64 VR128X:$src2))),
	(EXTRACT_SUBREG (v8i64
	(VPSRAQZrr
	(v8i64 (INSERT_SUBREG (IMPLICIT_DEF), VR128X:$src1, sub_xmm)),
	VR128X:$src2)), sub_xmm)>;

	def : Pat<(v4i64 (X86vsrai (v4i64 VR256X:$src1), (i8 imm:$src2))),
	(EXTRACT_SUBREG (v8i64
	(VPSRAQZri
	(v8i64 (INSERT_SUBREG (IMPLICIT_DEF), VR256X:$src1, sub_ymm)),
	imm:$src2)), sub_ymm)>;

	def : Pat<(v2i64 (X86vsrai (v2i64 VR128X:$src1), (i8 imm:$src2))),
	(EXTRACT_SUBREG (v8i64
	(VPSRAQZri
	(v8i64 (INSERT_SUBREG (IMPLICIT_DEF), VR128X:$src1, sub_xmm)),
	imm:$src2)), sub_xmm)>;
	}

	//===-------------------------------------------------------------------===//
	// Variable Bit Shifts
	//===-------------------------------------------------------------------===//
	multiclass avx512_var_shift<bits<8> opc, string OpcodeStr, SDNode OpNode,
	X86VectorVTInfo _> {
	let ExeDomain = _.ExeDomain in {
	defm rr : AVX512_maskable<opc, MRMSrcReg, _, (outs _.RC:$dst),
	(ins _.RC:$src1, _.RC:$src2), OpcodeStr,
	"$src2, $src1", "$src1, $src2",
	(_.VT (OpNode _.RC:$src1, (_.VT _.RC:$src2))),
	SSE_INTSHIFT_ITINS_P.rr>, AVX5128IBase, EVEX_4V;
	defm rm : AVX512_maskable<opc, MRMSrcMem, _, (outs _.RC:$dst),
	(ins _.RC:$src1, _.MemOp:$src2), OpcodeStr,
	"$src2, $src1", "$src1, $src2",
	(_.VT (OpNode _.RC:$src1,
	(_.VT (bitconvert (_.LdFrag addr:$src2))))),
	SSE_INTSHIFT_ITINS_P.rm>, AVX5128IBase, EVEX_4V,
	EVEX_CD8<_.EltSize, CD8VF>;
	}
	}

	multiclass avx512_var_shift_mb<bits<8> opc, string OpcodeStr, SDNode OpNode,
	X86VectorVTInfo _> {
	let ExeDomain = _.ExeDomain in
	defm rmb : AVX512_maskable<opc, MRMSrcMem, _, (outs _.RC:$dst),
	(ins _.RC:$src1, _.ScalarMemOp:$src2), OpcodeStr,
	"${src2}"##_.BroadcastStr##", $src1",
	"$src1, ${src2}"##_.BroadcastStr,
	(_.VT (OpNode _.RC:$src1, (_.VT (X86VBroadcast
	(_.ScalarLdFrag addr:$src2))))),
	SSE_INTSHIFT_ITINS_P.rm>, AVX5128IBase, EVEX_B,
	EVEX_4V, EVEX_CD8<_.EltSize, CD8VF>;
	}

	multiclass avx512_var_shift_sizes<bits<8> opc, string OpcodeStr, SDNode OpNode,
	AVX512VLVectorVTInfo _> {
	let Predicates = [HasAVX512] in
	defm Z : avx512_var_shift<opc, OpcodeStr, OpNode, _.info512>,
	avx512_var_shift_mb<opc, OpcodeStr, OpNode, _.info512>, EVEX_V512;

	let Predicates = [HasAVX512, HasVLX] in {
	defm Z256 : avx512_var_shift<opc, OpcodeStr, OpNode, _.info256>,
	avx512_var_shift_mb<opc, OpcodeStr, OpNode, _.info256>, EVEX_V256;
	defm Z128 : avx512_var_shift<opc, OpcodeStr, OpNode, _.info128>,
	avx512_var_shift_mb<opc, OpcodeStr, OpNode, _.info128>, EVEX_V128;
	}
	}

	multiclass avx512_var_shift_types<bits<8> opc, string OpcodeStr,
	SDNode OpNode> {
	defm D : avx512_var_shift_sizes<opc, OpcodeStr#"d", OpNode,
	avx512vl_i32_info>;
	defm Q : avx512_var_shift_sizes<opc, OpcodeStr#"q", OpNode,
	avx512vl_i64_info>, VEX_W;
	}

	// Use 512bit version to implement 128/256 bit in case NoVLX.
	multiclass avx512_var_shift_lowering<AVX512VLVectorVTInfo _, string OpcodeStr,
	SDNode OpNode, list<Predicate> p> {
	let Predicates = p in {
	def : Pat<(_.info256.VT (OpNode (_.info256.VT _.info256.RC:$src1),
	(_.info256.VT _.info256.RC:$src2))),
	(EXTRACT_SUBREG
	(!cast<Instruction>(OpcodeStr#"Zrr")
	(INSERT_SUBREG (_.info512.VT (IMPLICIT_DEF)), VR256X:$src1, sub_ymm),
	(INSERT_SUBREG (_.info512.VT (IMPLICIT_DEF)), VR256X:$src2, sub_ymm)),
	sub_ymm)>;

	def : Pat<(_.info128.VT (OpNode (_.info128.VT _.info128.RC:$src1),
	(_.info128.VT _.info128.RC:$src2))),
	(EXTRACT_SUBREG
	(!cast<Instruction>(OpcodeStr#"Zrr")
	(INSERT_SUBREG (_.info512.VT (IMPLICIT_DEF)), VR128X:$src1, sub_xmm),
	(INSERT_SUBREG (_.info512.VT (IMPLICIT_DEF)), VR128X:$src2, sub_xmm)),
	sub_xmm)>;
	}
	}
	multiclass avx512_var_shift_w<bits<8> opc, string OpcodeStr,
	SDNode OpNode> {
	let Predicates = [HasBWI] in
	defm WZ: avx512_var_shift<opc, OpcodeStr, OpNode, v32i16_info>,
	EVEX_V512, VEX_W;
	let Predicates = [HasVLX, HasBWI] in {

	defm WZ256: avx512_var_shift<opc, OpcodeStr, OpNode, v16i16x_info>,
	EVEX_V256, VEX_W;
	defm WZ128: avx512_var_shift<opc, OpcodeStr, OpNode, v8i16x_info>,
	EVEX_V128, VEX_W;
	}
	}

	defm VPSLLV : avx512_var_shift_types<0x47, "vpsllv", shl>,
	avx512_var_shift_w<0x12, "vpsllvw", shl>;

	defm VPSRAV : avx512_var_shift_types<0x46, "vpsrav", sra>,
	avx512_var_shift_w<0x11, "vpsravw", sra>;

	defm VPSRLV : avx512_var_shift_types<0x45, "vpsrlv", srl>,
	avx512_var_shift_w<0x10, "vpsrlvw", srl>;

	defm VPRORV : avx512_var_shift_types<0x14, "vprorv", rotr>;
	defm VPROLV : avx512_var_shift_types<0x15, "vprolv", rotl>;

	defm : avx512_var_shift_lowering<avx512vl_i64_info, "VPSRAVQ", sra, [HasAVX512, NoVLX]>;
	defm : avx512_var_shift_lowering<avx512vl_i16_info, "VPSLLVW", shl, [HasBWI, NoVLX]>;
	defm : avx512_var_shift_lowering<avx512vl_i16_info, "VPSRAVW", sra, [HasBWI, NoVLX]>;
	defm : avx512_var_shift_lowering<avx512vl_i16_info, "VPSRLVW", srl, [HasBWI, NoVLX]>;

	// Special handing for handling VPSRAV intrinsics.
	multiclass avx512_var_shift_int_lowering<string InstrStr, X86VectorVTInfo _,
	list<Predicate> p> {
	let Predicates = p in {
	def : Pat<(_.VT (X86vsrav _.RC:$src1, _.RC:$src2)),
	(!cast<Instruction>(InstrStr#_.ZSuffix#rr) _.RC:$src1,
	_.RC:$src2)>;
	def : Pat<(_.VT (X86vsrav _.RC:$src1, (bitconvert (_.LdFrag addr:$src2)))),
	(!cast<Instruction>(InstrStr#_.ZSuffix##rm)
	_.RC:$src1, addr:$src2)>;
	def : Pat<(_.VT (vselect _.KRCWM:$mask,
	(X86vsrav _.RC:$src1, _.RC:$src2), _.RC:$src0)),
	(!cast<Instruction>(InstrStr#_.ZSuffix#rrk) _.RC:$src0,
	_.KRC:$mask, _.RC:$src1, _.RC:$src2)>;
	def : Pat<(_.VT (vselect _.KRCWM:$mask,
	(X86vsrav _.RC:$src1, (bitconvert (_.LdFrag addr:$src2))),
	_.RC:$src0)),
	(!cast<Instruction>(InstrStr#_.ZSuffix##rmk) _.RC:$src0,
	_.KRC:$mask, _.RC:$src1, addr:$src2)>;
	def : Pat<(_.VT (vselect _.KRCWM:$mask,
	(X86vsrav _.RC:$src1, _.RC:$src2), _.ImmAllZerosV)),
	(!cast<Instruction>(InstrStr#_.ZSuffix#rrkz) _.KRC:$mask,
	_.RC:$src1, _.RC:$src2)>;
	def : Pat<(_.VT (vselect _.KRCWM:$mask,
	(X86vsrav _.RC:$src1, (bitconvert (_.LdFrag addr:$src2))),
	_.ImmAllZerosV)),
	(!cast<Instruction>(InstrStr#_.ZSuffix##rmkz) _.KRC:$mask,
	_.RC:$src1, addr:$src2)>;
	}
	}

	multiclass avx512_var_shift_int_lowering_mb<string InstrStr, X86VectorVTInfo _,
	list<Predicate> p> :
	avx512_var_shift_int_lowering<InstrStr, _, p> {
	let Predicates = p in {
	def : Pat<(_.VT (X86vsrav _.RC:$src1,
	(X86VBroadcast (_.ScalarLdFrag addr:$src2)))),
	(!cast<Instruction>(InstrStr#_.ZSuffix##rmb)
	_.RC:$src1, addr:$src2)>;
	def : Pat<(_.VT (vselect _.KRCWM:$mask,
	(X86vsrav _.RC:$src1,
	(X86VBroadcast (_.ScalarLdFrag addr:$src2))),
	_.RC:$src0)),
	(!cast<Instruction>(InstrStr#_.ZSuffix##rmbk) _.RC:$src0,
	_.KRC:$mask, _.RC:$src1, addr:$src2)>;
	def : Pat<(_.VT (vselect _.KRCWM:$mask,
	(X86vsrav _.RC:$src1,
	(X86VBroadcast (_.ScalarLdFrag addr:$src2))),
	_.ImmAllZerosV)),
	(!cast<Instruction>(InstrStr#_.ZSuffix##rmbkz) _.KRC:$mask,
	_.RC:$src1, addr:$src2)>;
	}
	}

	defm : avx512_var_shift_int_lowering<"VPSRAVW", v8i16x_info, [HasVLX, HasBWI]>;
	defm : avx512_var_shift_int_lowering<"VPSRAVW", v16i16x_info, [HasVLX, HasBWI]>;
	defm : avx512_var_shift_int_lowering<"VPSRAVW", v32i16_info, [HasBWI]>;
	defm : avx512_var_shift_int_lowering_mb<"VPSRAVD", v4i32x_info, [HasVLX]>;
	defm : avx512_var_shift_int_lowering_mb<"VPSRAVD", v8i32x_info, [HasVLX]>;
	defm : avx512_var_shift_int_lowering_mb<"VPSRAVD", v16i32_info, [HasAVX512]>;
	defm : avx512_var_shift_int_lowering_mb<"VPSRAVQ", v2i64x_info, [HasVLX]>;
	defm : avx512_var_shift_int_lowering_mb<"VPSRAVQ", v4i64x_info, [HasVLX]>;
	defm : avx512_var_shift_int_lowering_mb<"VPSRAVQ", v8i64_info, [HasAVX512]>;


	// Use 512bit VPROL/VPROLI version to implement v2i64/v4i64 + v4i32/v8i32 in case NoVLX.
	let Predicates = [HasAVX512, NoVLX] in {
	def : Pat<(v2i64 (rotl (v2i64 VR128X:$src1), (v2i64 VR128X:$src2))),
	(EXTRACT_SUBREG (v8i64
	(VPROLVQZrr
	(v8i64 (INSERT_SUBREG (IMPLICIT_DEF), VR128X:$src1, sub_xmm)),
	(INSERT_SUBREG (IMPLICIT_DEF), VR128X:$src2, sub_xmm))),
	sub_xmm)>;
	def : Pat<(v4i64 (rotl (v4i64 VR256X:$src1), (v4i64 VR256X:$src2))),
	(EXTRACT_SUBREG (v8i64
	(VPROLVQZrr
	(v8i64 (INSERT_SUBREG (IMPLICIT_DEF), VR256X:$src1, sub_ymm)),
	(INSERT_SUBREG (IMPLICIT_DEF), VR256X:$src2, sub_ymm))),
	sub_ymm)>;

	def : Pat<(v4i32 (rotl (v4i32 VR128X:$src1), (v4i32 VR128X:$src2))),
	(EXTRACT_SUBREG (v16i32
	(VPROLVDZrr
	(v16i32 (INSERT_SUBREG (IMPLICIT_DEF), VR128X:$src1, sub_xmm)),
	(INSERT_SUBREG (IMPLICIT_DEF), VR128X:$src2, sub_xmm))),
	sub_xmm)>;
	def : Pat<(v8i32 (rotl (v8i32 VR256X:$src1), (v8i32 VR256X:$src2))),
	(EXTRACT_SUBREG (v16i32
	(VPROLVDZrr
	(v16i32 (INSERT_SUBREG (IMPLICIT_DEF), VR256X:$src1, sub_ymm)),
	(INSERT_SUBREG (IMPLICIT_DEF), VR256X:$src2, sub_ymm))),
	sub_ymm)>;

	def : Pat<(v2i64 (X86vrotli (v2i64 VR128X:$src1), (i8 imm:$src2))),
	(EXTRACT_SUBREG (v8i64
	(VPROLQZri
	(v8i64 (INSERT_SUBREG (IMPLICIT_DEF), VR128X:$src1, sub_xmm)),
	imm:$src2)), sub_xmm)>;
	def : Pat<(v4i64 (X86vrotli (v4i64 VR256X:$src1), (i8 imm:$src2))),
	(EXTRACT_SUBREG (v8i64
	(VPROLQZri
	(v8i64 (INSERT_SUBREG (IMPLICIT_DEF), VR256X:$src1, sub_ymm)),
	imm:$src2)), sub_ymm)>;

	def : Pat<(v4i32 (X86vrotli (v4i32 VR128X:$src1), (i8 imm:$src2))),
	(EXTRACT_SUBREG (v16i32
	(VPROLDZri
	(v16i32 (INSERT_SUBREG (IMPLICIT_DEF), VR128X:$src1, sub_xmm)),
	imm:$src2)), sub_xmm)>;
	def : Pat<(v8i32 (X86vrotli (v8i32 VR256X:$src1), (i8 imm:$src2))),
	(EXTRACT_SUBREG (v16i32
	(VPROLDZri
	(v16i32 (INSERT_SUBREG (IMPLICIT_DEF), VR256X:$src1, sub_ymm)),
	imm:$src2)), sub_ymm)>;
	}

	// Use 512bit VPROR/VPRORI version to implement v2i64/v4i64 + v4i32/v8i32 in case NoVLX.
	let Predicates = [HasAVX512, NoVLX] in {
	def : Pat<(v2i64 (rotr (v2i64 VR128X:$src1), (v2i64 VR128X:$src2))),
	(EXTRACT_SUBREG (v8i64
	(VPRORVQZrr
	(v8i64 (INSERT_SUBREG (IMPLICIT_DEF), VR128X:$src1, sub_xmm)),
	(INSERT_SUBREG (IMPLICIT_DEF), VR128X:$src2, sub_xmm))),
	sub_xmm)>;
	def : Pat<(v4i64 (rotr (v4i64 VR256X:$src1), (v4i64 VR256X:$src2))),
	(EXTRACT_SUBREG (v8i64
	(VPRORVQZrr
	(v8i64 (INSERT_SUBREG (IMPLICIT_DEF), VR256X:$src1, sub_ymm)),
	(INSERT_SUBREG (IMPLICIT_DEF), VR256X:$src2, sub_ymm))),
	sub_ymm)>;

	def : Pat<(v4i32 (rotr (v4i32 VR128X:$src1), (v4i32 VR128X:$src2))),
	(EXTRACT_SUBREG (v16i32
	(VPRORVDZrr
	(v16i32 (INSERT_SUBREG (IMPLICIT_DEF), VR128X:$src1, sub_xmm)),
	(INSERT_SUBREG (IMPLICIT_DEF), VR128X:$src2, sub_xmm))),
	sub_xmm)>;
	def : Pat<(v8i32 (rotr (v8i32 VR256X:$src1), (v8i32 VR256X:$src2))),
	(EXTRACT_SUBREG (v16i32
	(VPRORVDZrr
	(v16i32 (INSERT_SUBREG (IMPLICIT_DEF), VR256X:$src1, sub_ymm)),
	(INSERT_SUBREG (IMPLICIT_DEF), VR256X:$src2, sub_ymm))),
	sub_ymm)>;

	def : Pat<(v2i64 (X86vrotri (v2i64 VR128X:$src1), (i8 imm:$src2))),
	(EXTRACT_SUBREG (v8i64
	(VPRORQZri
	(v8i64 (INSERT_SUBREG (IMPLICIT_DEF), VR128X:$src1, sub_xmm)),
	imm:$src2)), sub_xmm)>;
	def : Pat<(v4i64 (X86vrotri (v4i64 VR256X:$src1), (i8 imm:$src2))),
	(EXTRACT_SUBREG (v8i64
	(VPRORQZri
	(v8i64 (INSERT_SUBREG (IMPLICIT_DEF), VR256X:$src1, sub_ymm)),
	imm:$src2)), sub_ymm)>;

	def : Pat<(v4i32 (X86vrotri (v4i32 VR128X:$src1), (i8 imm:$src2))),
	(EXTRACT_SUBREG (v16i32
	(VPRORDZri
	(v16i32 (INSERT_SUBREG (IMPLICIT_DEF), VR128X:$src1, sub_xmm)),
	imm:$src2)), sub_xmm)>;
	def : Pat<(v8i32 (X86vrotri (v8i32 VR256X:$src1), (i8 imm:$src2))),
	(EXTRACT_SUBREG (v16i32
	(VPRORDZri
	(v16i32 (INSERT_SUBREG (IMPLICIT_DEF), VR256X:$src1, sub_ymm)),
	imm:$src2)), sub_ymm)>;
	}

	//===-------------------------------------------------------------------===//
	// 1-src variable permutation VPERMW/D/Q
	//===-------------------------------------------------------------------===//
	multiclass avx512_vperm_dq_sizes<bits<8> opc, string OpcodeStr, SDNode OpNode,
	AVX512VLVectorVTInfo _> {
	let Predicates = [HasAVX512] in
	defm Z : avx512_var_shift<opc, OpcodeStr, OpNode, _.info512>,
	avx512_var_shift_mb<opc, OpcodeStr, OpNode, _.info512>, EVEX_V512;

	let Predicates = [HasAVX512, HasVLX] in
	defm Z256 : avx512_var_shift<opc, OpcodeStr, OpNode, _.info256>,
	avx512_var_shift_mb<opc, OpcodeStr, OpNode, _.info256>, EVEX_V256;
	}

	multiclass avx512_vpermi_dq_sizes<bits<8> opc, Format ImmFormR, Format ImmFormM,
	string OpcodeStr, SDNode OpNode,
	AVX512VLVectorVTInfo VTInfo> {
	let Predicates = [HasAVX512] in
	defm Z: avx512_shift_rmi<opc, ImmFormR, ImmFormM, OpcodeStr, OpNode,
	VTInfo.info512>,
	avx512_shift_rmbi<opc, ImmFormM, OpcodeStr, OpNode,
	VTInfo.info512>, EVEX_V512;
	let Predicates = [HasAVX512, HasVLX] in
	defm Z256: avx512_shift_rmi<opc, ImmFormR, ImmFormM, OpcodeStr, OpNode,
	VTInfo.info256>,
	avx512_shift_rmbi<opc, ImmFormM, OpcodeStr, OpNode,
	VTInfo.info256>, EVEX_V256;
	}

	multiclass avx512_vperm_bw<bits<8> opc, string OpcodeStr,
	Predicate prd, SDNode OpNode,
	AVX512VLVectorVTInfo _> {
	let Predicates = [prd] in
	defm Z: avx512_var_shift<opc, OpcodeStr, OpNode, _.info512>,
	EVEX_V512 ;
	let Predicates = [HasVLX, prd] in {
	defm Z256: avx512_var_shift<opc, OpcodeStr, OpNode, _.info256>,
	EVEX_V256 ;
	defm Z128: avx512_var_shift<opc, OpcodeStr, OpNode, _.info128>,
	EVEX_V128 ;
	}
	}

	defm VPERMW : avx512_vperm_bw<0x8D, "vpermw", HasBWI, X86VPermv,
	avx512vl_i16_info>, VEX_W;
	defm VPERMB : avx512_vperm_bw<0x8D, "vpermb", HasVBMI, X86VPermv,
	avx512vl_i8_info>;

	defm VPERMD : avx512_vperm_dq_sizes<0x36, "vpermd", X86VPermv,
	avx512vl_i32_info>;
	defm VPERMQ : avx512_vperm_dq_sizes<0x36, "vpermq", X86VPermv,
	avx512vl_i64_info>, VEX_W;
	defm VPERMPS : avx512_vperm_dq_sizes<0x16, "vpermps", X86VPermv,
	avx512vl_f32_info>;
	defm VPERMPD : avx512_vperm_dq_sizes<0x16, "vpermpd", X86VPermv,
	avx512vl_f64_info>, VEX_W;

	defm VPERMQ : avx512_vpermi_dq_sizes<0x00, MRMSrcReg, MRMSrcMem, "vpermq",
	X86VPermi, avx512vl_i64_info>,
	EVEX, AVX512AIi8Base, EVEX_CD8<64, CD8VF>, VEX_W;
	defm VPERMPD : avx512_vpermi_dq_sizes<0x01, MRMSrcReg, MRMSrcMem, "vpermpd",
	X86VPermi, avx512vl_f64_info>,
	EVEX, AVX512AIi8Base, EVEX_CD8<64, CD8VF>, VEX_W;
	//===----------------------------------------------------------------------===//
	// AVX-512 - VPERMIL
	//===----------------------------------------------------------------------===//

	multiclass avx512_permil_vec<bits<8> OpcVar, string OpcodeStr, SDNode OpNode,
	X86VectorVTInfo _, X86VectorVTInfo Ctrl> {
	defm rr: AVX512_maskable<OpcVar, MRMSrcReg, _, (outs _.RC:$dst),
	(ins _.RC:$src1, Ctrl.RC:$src2), OpcodeStr,
	"$src2, $src1", "$src1, $src2",
	(_.VT (OpNode _.RC:$src1,
	(Ctrl.VT Ctrl.RC:$src2)))>,
	T8PD, EVEX_4V;
	defm rm: AVX512_maskable<OpcVar, MRMSrcMem, _, (outs _.RC:$dst),
	(ins _.RC:$src1, Ctrl.MemOp:$src2), OpcodeStr,
	"$src2, $src1", "$src1, $src2",
	(_.VT (OpNode
	_.RC:$src1,
	(Ctrl.VT (bitconvert(Ctrl.LdFrag addr:$src2)))))>,
	T8PD, EVEX_4V, EVEX_CD8<_.EltSize, CD8VF>;
	defm rmb: AVX512_maskable<OpcVar, MRMSrcMem, _, (outs _.RC:$dst),
	(ins _.RC:$src1, _.ScalarMemOp:$src2), OpcodeStr,
	"${src2}"##_.BroadcastStr##", $src1",
	"$src1, ${src2}"##_.BroadcastStr,
	(_.VT (OpNode
	_.RC:$src1,
	(Ctrl.VT (X86VBroadcast
	(Ctrl.ScalarLdFrag addr:$src2)))))>,
	T8PD, EVEX_4V, EVEX_B, EVEX_CD8<_.EltSize, CD8VF>;
	}

	multiclass avx512_permil_vec_common<string OpcodeStr, bits<8> OpcVar,
	AVX512VLVectorVTInfo _, AVX512VLVectorVTInfo Ctrl>{
	let Predicates = [HasAVX512] in {
	defm Z : avx512_permil_vec<OpcVar, OpcodeStr, X86VPermilpv, _.info512,
	Ctrl.info512>, EVEX_V512;
	}
	let Predicates = [HasAVX512, HasVLX] in {
	defm Z128 : avx512_permil_vec<OpcVar, OpcodeStr, X86VPermilpv, _.info128,
	Ctrl.info128>, EVEX_V128;
	defm Z256 : avx512_permil_vec<OpcVar, OpcodeStr, X86VPermilpv, _.info256,
	Ctrl.info256>, EVEX_V256;
	}
	}

	multiclass avx512_permil<string OpcodeStr, bits<8> OpcImm, bits<8> OpcVar,
	AVX512VLVectorVTInfo _, AVX512VLVectorVTInfo Ctrl>{

	defm NAME: avx512_permil_vec_common<OpcodeStr, OpcVar, _, Ctrl>;
	defm NAME: avx512_shift_rmi_sizes<OpcImm, MRMSrcReg, MRMSrcMem, OpcodeStr,
	X86VPermilpi, _>,
	EVEX, AVX512AIi8Base, EVEX_CD8<_.info128.EltSize, CD8VF>;
	}

	let ExeDomain = SSEPackedSingle in
	defm VPERMILPS : avx512_permil<"vpermilps", 0x04, 0x0C, avx512vl_f32_info,
	avx512vl_i32_info>;
	let ExeDomain = SSEPackedDouble in
	defm VPERMILPD : avx512_permil<"vpermilpd", 0x05, 0x0D, avx512vl_f64_info,
	avx512vl_i64_info>, VEX_W;
	//===----------------------------------------------------------------------===//
	// AVX-512 - VPSHUFD, VPSHUFLW, VPSHUFHW
	//===----------------------------------------------------------------------===//

	defm VPSHUFD : avx512_shift_rmi_sizes<0x70, MRMSrcReg, MRMSrcMem, "vpshufd",
	X86PShufd, avx512vl_i32_info>,
	EVEX, AVX512BIi8Base, EVEX_CD8<32, CD8VF>;
	defm VPSHUFH : avx512_shift_rmi_w<0x70, MRMSrcReg, MRMSrcMem, "vpshufhw",
	X86PShufhw>, EVEX, AVX512XSIi8Base;
	defm VPSHUFL : avx512_shift_rmi_w<0x70, MRMSrcReg, MRMSrcMem, "vpshuflw",
	X86PShuflw>, EVEX, AVX512XDIi8Base;

	multiclass avx512_pshufb_sizes<bits<8> opc, string OpcodeStr, SDNode OpNode> {
	let Predicates = [HasBWI] in
	defm Z: avx512_var_shift<opc, OpcodeStr, OpNode, v64i8_info>, EVEX_V512;

	let Predicates = [HasVLX, HasBWI] in {
	defm Z256: avx512_var_shift<opc, OpcodeStr, OpNode, v32i8x_info>, EVEX_V256;
	defm Z128: avx512_var_shift<opc, OpcodeStr, OpNode, v16i8x_info>, EVEX_V128;
	}
	}

	defm VPSHUFB: avx512_pshufb_sizes<0x00, "vpshufb", X86pshufb>;

	//===----------------------------------------------------------------------===//
	// Move Low to High and High to Low packed FP Instructions
	//===----------------------------------------------------------------------===//
	def VMOVLHPSZrr : AVX512PSI<0x16, MRMSrcReg, (outs VR128X:$dst),
	(ins VR128X:$src1, VR128X:$src2),
	"vmovlhps\t{$src2, $src1, $dst\|$dst, $src1, $src2}",
	[(set VR128X:$dst, (v4f32 (X86Movlhps VR128X:$src1, VR128X:$src2)))],
	IIC_SSE_MOV_LH>, EVEX_4V;
	def VMOVHLPSZrr : AVX512PSI<0x12, MRMSrcReg, (outs VR128X:$dst),
	(ins VR128X:$src1, VR128X:$src2),
	"vmovhlps\t{$src2, $src1, $dst\|$dst, $src1, $src2}",
	[(set VR128X:$dst, (v4f32 (X86Movhlps VR128X:$src1, VR128X:$src2)))],
	IIC_SSE_MOV_LH>, EVEX_4V;

	let Predicates = [HasAVX512] in {
	// MOVLHPS patterns
	def : Pat<(v4i32 (X86Movlhps VR128X:$src1, VR128X:$src2)),
	(VMOVLHPSZrr VR128X:$src1, VR128X:$src2)>;
	def : Pat<(v2i64 (X86Movlhps VR128X:$src1, VR128X:$src2)),
	(VMOVLHPSZrr (v2i64 VR128X:$src1), VR128X:$src2)>;

	// MOVHLPS patterns
	def : Pat<(v4i32 (X86Movhlps VR128X:$src1, VR128X:$src2)),
	(VMOVHLPSZrr VR128X:$src1, VR128X:$src2)>;
	}

	//===----------------------------------------------------------------------===//
	// VMOVHPS/PD VMOVLPS Instructions
	// All patterns was taken from SSS implementation.
	//===----------------------------------------------------------------------===//
	multiclass avx512_mov_hilo_packed<bits<8> opc, string OpcodeStr, SDNode OpNode,
	X86VectorVTInfo _> {
	let ExeDomain = _.ExeDomain in
	def rm : AVX512<opc, MRMSrcMem, (outs _.RC:$dst),
	(ins _.RC:$src1, f64mem:$src2),
	!strconcat(OpcodeStr,
	"\t{$src2, $src1, $dst\|$dst, $src1, $src2}"),
	[(set _.RC:$dst,
	(OpNode _.RC:$src1,
	(_.VT (bitconvert
	(v2f64 (scalar_to_vector (loadf64 addr:$src2)))))))],
	IIC_SSE_MOV_LH>, EVEX_4V;
	}

	defm VMOVHPSZ128 : avx512_mov_hilo_packed<0x16, "vmovhps", X86Movlhps,
	v4f32x_info>, EVEX_CD8<32, CD8VT2>, PS;
	defm VMOVHPDZ128 : avx512_mov_hilo_packed<0x16, "vmovhpd", X86Movlhpd,
	v2f64x_info>, EVEX_CD8<64, CD8VT1>, PD, VEX_W;
	defm VMOVLPSZ128 : avx512_mov_hilo_packed<0x12, "vmovlps", X86Movlps,
	v4f32x_info>, EVEX_CD8<32, CD8VT2>, PS;
	defm VMOVLPDZ128 : avx512_mov_hilo_packed<0x12, "vmovlpd", X86Movlpd,
	v2f64x_info>, EVEX_CD8<64, CD8VT1>, PD, VEX_W;

	let Predicates = [HasAVX512] in {
	// VMOVHPS patterns
	def : Pat<(X86Movlhps VR128X:$src1,
	(bc_v4f32 (v2i64 (scalar_to_vector (loadi64 addr:$src2))))),
	(VMOVHPSZ128rm VR128X:$src1, addr:$src2)>;
	def : Pat<(X86Movlhps VR128X:$src1,
	(bc_v4i32 (v2i64 (X86vzload addr:$src2)))),
	(VMOVHPSZ128rm VR128X:$src1, addr:$src2)>;
	// VMOVHPD patterns
	def : Pat<(v2f64 (X86Unpckl VR128X:$src1,
	(scalar_to_vector (loadf64 addr:$src2)))),
	(VMOVHPDZ128rm VR128X:$src1, addr:$src2)>;
	def : Pat<(v2f64 (X86Unpckl VR128X:$src1,
	(bc_v2f64 (v2i64 (scalar_to_vector (loadi64 addr:$src2)))))),
	(VMOVHPDZ128rm VR128X:$src1, addr:$src2)>;
	// VMOVLPS patterns
	def : Pat<(v4f32 (X86Movlps VR128X:$src1, (load addr:$src2))),
	(VMOVLPSZ128rm VR128X:$src1, addr:$src2)>;
	def : Pat<(v4i32 (X86Movlps VR128X:$src1, (load addr:$src2))),
	(VMOVLPSZ128rm VR128X:$src1, addr:$src2)>;
	// VMOVLPD patterns
	def : Pat<(v2f64 (X86Movlpd VR128X:$src1, (load addr:$src2))),
	(VMOVLPDZ128rm VR128X:$src1, addr:$src2)>;
	def : Pat<(v2i64 (X86Movlpd VR128X:$src1, (load addr:$src2))),
	(VMOVLPDZ128rm VR128X:$src1, addr:$src2)>;
	def : Pat<(v2f64 (X86Movsd VR128X:$src1,
	(v2f64 (scalar_to_vector (loadf64 addr:$src2))))),
	(VMOVLPDZ128rm VR128X:$src1, addr:$src2)>;
	}

	def VMOVHPSZ128mr : AVX512PSI<0x17, MRMDestMem, (outs),
	(ins f64mem:$dst, VR128X:$src),
	"vmovhps\t{$src, $dst\|$dst, $src}",
	[(store (f64 (extractelt
	(X86Unpckh (bc_v2f64 (v4f32 VR128X:$src)),
	(bc_v2f64 (v4f32 VR128X:$src))),
	(iPTR 0))), addr:$dst)], IIC_SSE_MOV_LH>,
	EVEX, EVEX_CD8<32, CD8VT2>;
	def VMOVHPDZ128mr : AVX512PDI<0x17, MRMDestMem, (outs),
	(ins f64mem:$dst, VR128X:$src),
	"vmovhpd\t{$src, $dst\|$dst, $src}",
	[(store (f64 (extractelt
	(v2f64 (X86Unpckh VR128X:$src, VR128X:$src)),
	(iPTR 0))), addr:$dst)], IIC_SSE_MOV_LH>,
	EVEX, EVEX_CD8<64, CD8VT1>, VEX_W;
	def VMOVLPSZ128mr : AVX512PSI<0x13, MRMDestMem, (outs),
	(ins f64mem:$dst, VR128X:$src),
	"vmovlps\t{$src, $dst\|$dst, $src}",
	[(store (f64 (extractelt (bc_v2f64 (v4f32 VR128X:$src)),
	(iPTR 0))), addr:$dst)],
	IIC_SSE_MOV_LH>,
	EVEX, EVEX_CD8<32, CD8VT2>;
	def VMOVLPDZ128mr : AVX512PDI<0x13, MRMDestMem, (outs),
	(ins f64mem:$dst, VR128X:$src),
	"vmovlpd\t{$src, $dst\|$dst, $src}",
	[(store (f64 (extractelt (v2f64 VR128X:$src),
	(iPTR 0))), addr:$dst)],
	IIC_SSE_MOV_LH>,
	EVEX, EVEX_CD8<64, CD8VT1>, VEX_W;

	let Predicates = [HasAVX512] in {
	// VMOVHPD patterns
	def : Pat<(store (f64 (extractelt
	(v2f64 (X86VPermilpi VR128X:$src, (i8 1))),
	(iPTR 0))), addr:$dst),
	(VMOVHPDZ128mr addr:$dst, VR128X:$src)>;
	// VMOVLPS patterns
	def : Pat<(store (v4f32 (X86Movlps (load addr:$src1), VR128X:$src2)),
	addr:$src1),
	(VMOVLPSZ128mr addr:$src1, VR128X:$src2)>;
	def : Pat<(store (v4i32 (X86Movlps
	(bc_v4i32 (loadv2i64 addr:$src1)), VR128X:$src2)), addr:$src1),
	(VMOVLPSZ128mr addr:$src1, VR128X:$src2)>;
	// VMOVLPD patterns
	def : Pat<(store (v2f64 (X86Movlpd (load addr:$src1), VR128X:$src2)),
	addr:$src1),
	(VMOVLPDZ128mr addr:$src1, VR128X:$src2)>;
	def : Pat<(store (v2i64 (X86Movlpd (load addr:$src1), VR128X:$src2)),
	addr:$src1),
	(VMOVLPDZ128mr addr:$src1, VR128X:$src2)>;
	}
	//===----------------------------------------------------------------------===//
	// FMA - Fused Multiply Operations
	//

	multiclass avx512_fma3p_213_rm<bits<8> opc, string OpcodeStr, SDNode OpNode,
	X86VectorVTInfo _, string Suff> {
	let Constraints = "$src1 = $dst", ExeDomain = _.ExeDomain in {
	defm r: AVX512_maskable_3src<opc, MRMSrcReg, _, (outs _.RC:$dst),
	(ins _.RC:$src2, _.RC:$src3),
	OpcodeStr, "$src3, $src2", "$src2, $src3",
	(_.VT (OpNode _.RC:$src2, _.RC:$src1, _.RC:$src3)), 1, 1>,
	AVX512FMA3Base;

	defm m: AVX512_maskable_3src<opc, MRMSrcMem, _, (outs _.RC:$dst),
	(ins _.RC:$src2, _.MemOp:$src3),
	OpcodeStr, "$src3, $src2", "$src2, $src3",
	(_.VT (OpNode _.RC:$src2, _.RC:$src1, (_.LdFrag addr:$src3))), 1, 0>,
	AVX512FMA3Base;

	defm mb: AVX512_maskable_3src<opc, MRMSrcMem, _, (outs _.RC:$dst),
	(ins _.RC:$src2, _.ScalarMemOp:$src3),
	OpcodeStr, !strconcat("${src3}", _.BroadcastStr,", $src2"),
	!strconcat("$src2, ${src3}", _.BroadcastStr ),
	(OpNode _.RC:$src2,
	_.RC:$src1,(_.VT (X86VBroadcast (_.ScalarLdFrag addr:$src3)))), 1, 0>,
	AVX512FMA3Base, EVEX_B;
	}

	// Additional pattern for folding broadcast nodes in other orders.
	def : Pat<(_.VT (vselect _.KRCWM:$mask,
	(OpNode _.RC:$src1, _.RC:$src2,
	(X86VBroadcast (_.ScalarLdFrag addr:$src3))),
	_.RC:$src1)),
	(!cast<Instruction>(NAME#Suff#_.ZSuffix#mbk) _.RC:$src1,
	_.KRCWM:$mask, _.RC:$src2, addr:$src3)>;
	}

	multiclass avx512_fma3_213_round<bits<8> opc, string OpcodeStr, SDNode OpNode,
	X86VectorVTInfo _, string Suff> {
	let Constraints = "$src1 = $dst", ExeDomain = _.ExeDomain in
	defm rb: AVX512_maskable_3src<opc, MRMSrcReg, _, (outs _.RC:$dst),
	(ins _.RC:$src2, _.RC:$src3, AVX512RC:$rc),
	OpcodeStr, "$rc, $src3, $src2", "$src2, $src3, $rc",
	(_.VT ( OpNode _.RC:$src2, _.RC:$src1, _.RC:$src3, (i32 imm:$rc))), 1, 1>,
	AVX512FMA3Base, EVEX_B, EVEX_RC;
	}

	multiclass avx512_fma3p_213_common<bits<8> opc, string OpcodeStr, SDNode OpNode,
	SDNode OpNodeRnd, AVX512VLVectorVTInfo _,
	string Suff> {
	let Predicates = [HasAVX512] in {
	defm Z : avx512_fma3p_213_rm<opc, OpcodeStr, OpNode, _.info512, Suff>,
	avx512_fma3_213_round<opc, OpcodeStr, OpNodeRnd, _.info512,
	Suff>, EVEX_V512, EVEX_CD8<_.info512.EltSize, CD8VF>;
	}
	let Predicates = [HasVLX, HasAVX512] in {
	defm Z256 : avx512_fma3p_213_rm<opc, OpcodeStr, OpNode, _.info256, Suff>,
	EVEX_V256, EVEX_CD8<_.info256.EltSize, CD8VF>;
	defm Z128 : avx512_fma3p_213_rm<opc, OpcodeStr, OpNode, _.info128, Suff>,
	EVEX_V128, EVEX_CD8<_.info128.EltSize, CD8VF>;
	}
	}

	multiclass avx512_fma3p_213_f<bits<8> opc, string OpcodeStr, SDNode OpNode,
	SDNode OpNodeRnd > {
	defm PS : avx512_fma3p_213_common<opc, OpcodeStr#"ps", OpNode, OpNodeRnd,
	avx512vl_f32_info, "PS">;
	defm PD : avx512_fma3p_213_common<opc, OpcodeStr#"pd", OpNode, OpNodeRnd,
	avx512vl_f64_info, "PD">, VEX_W;
	}

	defm VFMADD213 : avx512_fma3p_213_f<0xA8, "vfmadd213", X86Fmadd, X86FmaddRnd>;
	defm VFMSUB213 : avx512_fma3p_213_f<0xAA, "vfmsub213", X86Fmsub, X86FmsubRnd>;
	defm VFMADDSUB213 : avx512_fma3p_213_f<0xA6, "vfmaddsub213", X86Fmaddsub, X86FmaddsubRnd>;
	defm VFMSUBADD213 : avx512_fma3p_213_f<0xA7, "vfmsubadd213", X86Fmsubadd, X86FmsubaddRnd>;
	defm VFNMADD213 : avx512_fma3p_213_f<0xAC, "vfnmadd213", X86Fnmadd, X86FnmaddRnd>;
	defm VFNMSUB213 : avx512_fma3p_213_f<0xAE, "vfnmsub213", X86Fnmsub, X86FnmsubRnd>;


	multiclass avx512_fma3p_231_rm<bits<8> opc, string OpcodeStr, SDNode OpNode,
	X86VectorVTInfo _, string Suff> {
	let Constraints = "$src1 = $dst", ExeDomain = _.ExeDomain in {
	defm r: AVX512_maskable_3src<opc, MRMSrcReg, _, (outs _.RC:$dst),
	(ins _.RC:$src2, _.RC:$src3),
	OpcodeStr, "$src3, $src2", "$src2, $src3",
	(_.VT (OpNode _.RC:$src2, _.RC:$src3, _.RC:$src1)), 1, 1>,
	AVX512FMA3Base;

	defm m: AVX512_maskable_3src<opc, MRMSrcMem, _, (outs _.RC:$dst),
	(ins _.RC:$src2, _.MemOp:$src3),
	OpcodeStr, "$src3, $src2", "$src2, $src3",
	(_.VT (OpNode _.RC:$src2, (_.LdFrag addr:$src3), _.RC:$src1)), 1, 0>,
	AVX512FMA3Base;

	defm mb: AVX512_maskable_3src<opc, MRMSrcMem, _, (outs _.RC:$dst),
	(ins _.RC:$src2, _.ScalarMemOp:$src3),
	OpcodeStr, "${src3}"##_.BroadcastStr##", $src2",
	"$src2, ${src3}"##_.BroadcastStr,
	(_.VT (OpNode _.RC:$src2,
	(_.VT (X86VBroadcast(_.ScalarLdFrag addr:$src3))),
	_.RC:$src1)), 1, 0>, AVX512FMA3Base, EVEX_B;
	}

	// Additional patterns for folding broadcast nodes in other orders.
	def : Pat<(_.VT (OpNode (X86VBroadcast (_.ScalarLdFrag addr:$src3)),
	_.RC:$src2, _.RC:$src1)),
	(!cast<Instruction>(NAME#Suff#_.ZSuffix#mb) _.RC:$src1,
	_.RC:$src2, addr:$src3)>;
	def : Pat<(_.VT (vselect _.KRCWM:$mask,
	(OpNode (X86VBroadcast (_.ScalarLdFrag addr:$src3)),
	_.RC:$src2, _.RC:$src1),
	_.RC:$src1)),
	(!cast<Instruction>(NAME#Suff#_.ZSuffix#mbk) _.RC:$src1,
	_.KRCWM:$mask, _.RC:$src2, addr:$src3)>;
	def : Pat<(_.VT (vselect _.KRCWM:$mask,
	(OpNode (X86VBroadcast (_.ScalarLdFrag addr:$src3)),
	_.RC:$src2, _.RC:$src1),
	_.ImmAllZerosV)),
	(!cast<Instruction>(NAME#Suff#_.ZSuffix#mbkz) _.RC:$src1,
	_.KRCWM:$mask, _.RC:$src2, addr:$src3)>;
	}

	multiclass avx512_fma3_231_round<bits<8> opc, string OpcodeStr, SDNode OpNode,
	X86VectorVTInfo _, string Suff> {
	let Constraints = "$src1 = $dst", ExeDomain = _.ExeDomain in
	defm rb: AVX512_maskable_3src<opc, MRMSrcReg, _, (outs _.RC:$dst),
	(ins _.RC:$src2, _.RC:$src3, AVX512RC:$rc),
	OpcodeStr, "$rc, $src3, $src2", "$src2, $src3, $rc",
	(_.VT ( OpNode _.RC:$src2, _.RC:$src3, _.RC:$src1, (i32 imm:$rc))), 1, 1>,
	AVX512FMA3Base, EVEX_B, EVEX_RC;
	}

	multiclass avx512_fma3p_231_common<bits<8> opc, string OpcodeStr, SDNode OpNode,
	SDNode OpNodeRnd, AVX512VLVectorVTInfo _,
	string Suff> {
	let Predicates = [HasAVX512] in {
	defm Z : avx512_fma3p_231_rm<opc, OpcodeStr, OpNode, _.info512, Suff>,
	avx512_fma3_231_round<opc, OpcodeStr, OpNodeRnd, _.info512,
	Suff>, EVEX_V512, EVEX_CD8<_.info512.EltSize, CD8VF>;
	}
	let Predicates = [HasVLX, HasAVX512] in {
	defm Z256 : avx512_fma3p_231_rm<opc, OpcodeStr, OpNode, _.info256, Suff>,
	EVEX_V256, EVEX_CD8<_.info256.EltSize, CD8VF>;
	defm Z128 : avx512_fma3p_231_rm<opc, OpcodeStr, OpNode, _.info128, Suff>,
	EVEX_V128, EVEX_CD8<_.info128.EltSize, CD8VF>;
	}
	}

	multiclass avx512_fma3p_231_f<bits<8> opc, string OpcodeStr, SDNode OpNode,
	SDNode OpNodeRnd > {
	defm PS : avx512_fma3p_231_common<opc, OpcodeStr#"ps", OpNode, OpNodeRnd,
	avx512vl_f32_info, "PS">;
	defm PD : avx512_fma3p_231_common<opc, OpcodeStr#"pd", OpNode, OpNodeRnd,
	avx512vl_f64_info, "PD">, VEX_W;
	}

	defm VFMADD231 : avx512_fma3p_231_f<0xB8, "vfmadd231", X86Fmadd, X86FmaddRnd>;
	defm VFMSUB231 : avx512_fma3p_231_f<0xBA, "vfmsub231", X86Fmsub, X86FmsubRnd>;
	defm VFMADDSUB231 : avx512_fma3p_231_f<0xB6, "vfmaddsub231", X86Fmaddsub, X86FmaddsubRnd>;
	defm VFMSUBADD231 : avx512_fma3p_231_f<0xB7, "vfmsubadd231", X86Fmsubadd, X86FmsubaddRnd>;
	defm VFNMADD231 : avx512_fma3p_231_f<0xBC, "vfnmadd231", X86Fnmadd, X86FnmaddRnd>;
	defm VFNMSUB231 : avx512_fma3p_231_f<0xBE, "vfnmsub231", X86Fnmsub, X86FnmsubRnd>;

	multiclass avx512_fma3p_132_rm<bits<8> opc, string OpcodeStr, SDNode OpNode,
	X86VectorVTInfo _, string Suff> {
	let Constraints = "$src1 = $dst", ExeDomain = _.ExeDomain in {
	defm r: AVX512_maskable_3src<opc, MRMSrcReg, _, (outs _.RC:$dst),
	(ins _.RC:$src2, _.RC:$src3),
	OpcodeStr, "$src3, $src2", "$src2, $src3",
	(_.VT (OpNode _.RC:$src1, _.RC:$src3, _.RC:$src2)), 1, 1>,
	AVX512FMA3Base;

	defm m: AVX512_maskable_3src<opc, MRMSrcMem, _, (outs _.RC:$dst),
	(ins _.RC:$src2, _.MemOp:$src3),
	OpcodeStr, "$src3, $src2", "$src2, $src3",
	(_.VT (OpNode _.RC:$src1, (_.LdFrag addr:$src3), _.RC:$src2)), 1, 0>,
	AVX512FMA3Base;

	defm mb: AVX512_maskable_3src<opc, MRMSrcMem, _, (outs _.RC:$dst),
	(ins _.RC:$src2, _.ScalarMemOp:$src3),
	OpcodeStr, "${src3}"##_.BroadcastStr##", $src2",
	"$src2, ${src3}"##_.BroadcastStr,
	(_.VT (OpNode _.RC:$src1,
	(_.VT (X86VBroadcast(_.ScalarLdFrag addr:$src3))),
	_.RC:$src2)), 1, 0>, AVX512FMA3Base, EVEX_B;
	}

	// Additional patterns for folding broadcast nodes in other orders.
	def : Pat<(_.VT (vselect _.KRCWM:$mask,
	(OpNode (X86VBroadcast (_.ScalarLdFrag addr:$src3)),
	_.RC:$src1, _.RC:$src2),
	_.RC:$src1)),
	(!cast<Instruction>(NAME#Suff#_.ZSuffix#mbk) _.RC:$src1,
	_.KRCWM:$mask, _.RC:$src2, addr:$src3)>;
	}

	multiclass avx512_fma3_132_round<bits<8> opc, string OpcodeStr, SDNode OpNode,
	X86VectorVTInfo _, string Suff> {
	let Constraints = "$src1 = $dst", ExeDomain = _.ExeDomain in
	defm rb: AVX512_maskable_3src<opc, MRMSrcReg, _, (outs _.RC:$dst),
	(ins _.RC:$src2, _.RC:$src3, AVX512RC:$rc),
	OpcodeStr, "$rc, $src3, $src2", "$src2, $src3, $rc",
	(_.VT ( OpNode _.RC:$src1, _.RC:$src3, _.RC:$src2, (i32 imm:$rc))), 1, 1>,
	AVX512FMA3Base, EVEX_B, EVEX_RC;
	}

	multiclass avx512_fma3p_132_common<bits<8> opc, string OpcodeStr, SDNode OpNode,
	SDNode OpNodeRnd, AVX512VLVectorVTInfo _,
	string Suff> {
	let Predicates = [HasAVX512] in {
	defm Z : avx512_fma3p_132_rm<opc, OpcodeStr, OpNode, _.info512, Suff>,
	avx512_fma3_132_round<opc, OpcodeStr, OpNodeRnd, _.info512,
	Suff>, EVEX_V512, EVEX_CD8<_.info512.EltSize, CD8VF>;
	}
	let Predicates = [HasVLX, HasAVX512] in {
	defm Z256 : avx512_fma3p_132_rm<opc, OpcodeStr, OpNode, _.info256, Suff>,
	EVEX_V256, EVEX_CD8<_.info256.EltSize, CD8VF>;
	defm Z128 : avx512_fma3p_132_rm<opc, OpcodeStr, OpNode, _.info128, Suff>,
	EVEX_V128, EVEX_CD8<_.info128.EltSize, CD8VF>;
	}
	}

	multiclass avx512_fma3p_132_f<bits<8> opc, string OpcodeStr, SDNode OpNode,
	SDNode OpNodeRnd > {
	defm PS : avx512_fma3p_132_common<opc, OpcodeStr#"ps", OpNode, OpNodeRnd,
	avx512vl_f32_info, "PS">;
	defm PD : avx512_fma3p_132_common<opc, OpcodeStr#"pd", OpNode, OpNodeRnd,
	avx512vl_f64_info, "PD">, VEX_W;
	}

	defm VFMADD132 : avx512_fma3p_132_f<0x98, "vfmadd132", X86Fmadd, X86FmaddRnd>;
	defm VFMSUB132 : avx512_fma3p_132_f<0x9A, "vfmsub132", X86Fmsub, X86FmsubRnd>;
	defm VFMADDSUB132 : avx512_fma3p_132_f<0x96, "vfmaddsub132", X86Fmaddsub, X86FmaddsubRnd>;
	defm VFMSUBADD132 : avx512_fma3p_132_f<0x97, "vfmsubadd132", X86Fmsubadd, X86FmsubaddRnd>;
	defm VFNMADD132 : avx512_fma3p_132_f<0x9C, "vfnmadd132", X86Fnmadd, X86FnmaddRnd>;
	defm VFNMSUB132 : avx512_fma3p_132_f<0x9E, "vfnmsub132", X86Fnmsub, X86FnmsubRnd>;

	// Scalar FMA
	let Constraints = "$src1 = $dst" in {
	multiclass avx512_fma3s_common<bits<8> opc, string OpcodeStr, X86VectorVTInfo _,
	dag RHS_VEC_r, dag RHS_VEC_m, dag RHS_VEC_rb,
	dag RHS_r, dag RHS_m > {
	defm r_Int: AVX512_maskable_3src_scalar<opc, MRMSrcReg, _, (outs _.RC:$dst),
	(ins _.RC:$src2, _.RC:$src3), OpcodeStr,
	"$src3, $src2", "$src2, $src3", RHS_VEC_r, 1, 1>, AVX512FMA3Base;

	defm m_Int: AVX512_maskable_3src_scalar<opc, MRMSrcMem, _, (outs _.RC:$dst),
	(ins _.RC:$src2, _.IntScalarMemOp:$src3), OpcodeStr,
	"$src3, $src2", "$src2, $src3", RHS_VEC_m, 1, 1>, AVX512FMA3Base;

	defm rb_Int: AVX512_maskable_3src_scalar<opc, MRMSrcReg, _, (outs _.RC:$dst),
	(ins _.RC:$src2, _.RC:$src3, AVX512RC:$rc),
	OpcodeStr, "$rc, $src3, $src2", "$src2, $src3, $rc", RHS_VEC_rb, 1, 1>,
	AVX512FMA3Base, EVEX_B, EVEX_RC;

	let isCodeGenOnly = 1, isCommutable = 1 in {
	def r : AVX512FMA3<opc, MRMSrcReg, (outs _.FRC:$dst),
	(ins _.FRC:$src1, _.FRC:$src2, _.FRC:$src3),
	!strconcat(OpcodeStr,
	"\t{$src3, $src2, $dst\|$dst, $src2, $src3}"),
	[RHS_r]>;
	def m : AVX512FMA3<opc, MRMSrcMem, (outs _.FRC:$dst),
	(ins _.FRC:$src1, _.FRC:$src2, _.ScalarMemOp:$src3),
	!strconcat(OpcodeStr,
	"\t{$src3, $src2, $dst\|$dst, $src2, $src3}"),
	[RHS_m]>;
	}// isCodeGenOnly = 1
	}
	}// Constraints = "$src1 = $dst"

	multiclass avx512_fma3s_all<bits<8> opc213, bits<8> opc231, bits<8> opc132,
	string OpcodeStr, SDNode OpNode, SDNode OpNodeRnds1,
	SDNode OpNodeRnds3, X86VectorVTInfo _ , string SUFF> {
	let ExeDomain = _.ExeDomain in {
	defm NAME#213#SUFF#Z: avx512_fma3s_common<opc213, OpcodeStr#"213"#_.Suffix , _ ,
	// Operands for intrinsic are in 123 order to preserve passthu
	// semantics.
	(_.VT (OpNodeRnds1 _.RC:$src1, _.RC:$src2, _.RC:$src3, (i32 FROUND_CURRENT))),
	(_.VT (OpNodeRnds1 _.RC:$src1, _.RC:$src2,
	_.ScalarIntMemCPat:$src3, (i32 FROUND_CURRENT))),
	(_.VT (OpNodeRnds1 _.RC:$src1, _.RC:$src2, _.RC:$src3,
	(i32 imm:$rc))),
	(set _.FRC:$dst, (_.EltVT (OpNode _.FRC:$src2, _.FRC:$src1,
	_.FRC:$src3))),
	(set _.FRC:$dst, (_.EltVT (OpNode _.FRC:$src2, _.FRC:$src1,
	(_.ScalarLdFrag addr:$src3))))>;

	defm NAME#231#SUFF#Z: avx512_fma3s_common<opc231, OpcodeStr#"231"#_.Suffix , _ ,
	(_.VT (OpNodeRnds3 _.RC:$src2, _.RC:$src3, _.RC:$src1, (i32 FROUND_CURRENT))),
	(_.VT (OpNodeRnds3 _.RC:$src2, _.ScalarIntMemCPat:$src3,
	_.RC:$src1, (i32 FROUND_CURRENT))),
	(_.VT ( OpNodeRnds3 _.RC:$src2, _.RC:$src3, _.RC:$src1,
	(i32 imm:$rc))),
	(set _.FRC:$dst, (_.EltVT (OpNode _.FRC:$src2, _.FRC:$src3,
	_.FRC:$src1))),
	(set _.FRC:$dst, (_.EltVT (OpNode _.FRC:$src2,
	(_.ScalarLdFrag addr:$src3), _.FRC:$src1)))>;

	defm NAME#132#SUFF#Z: avx512_fma3s_common<opc132, OpcodeStr#"132"#_.Suffix , _ ,
	(_.VT (OpNodeRnds1 _.RC:$src1, _.RC:$src3, _.RC:$src2, (i32 FROUND_CURRENT))),
	(_.VT (OpNodeRnds1 _.RC:$src1, _.ScalarIntMemCPat:$src3,
	_.RC:$src2, (i32 FROUND_CURRENT))),
	(_.VT (OpNodeRnds1 _.RC:$src1, _.RC:$src3, _.RC:$src2,
	(i32 imm:$rc))),
	(set _.FRC:$dst, (_.EltVT (OpNode _.FRC:$src1, _.FRC:$src3,
	_.FRC:$src2))),
	(set _.FRC:$dst, (_.EltVT (OpNode _.FRC:$src1,
	(_.ScalarLdFrag addr:$src3), _.FRC:$src2)))>;
	}
	}

	multiclass avx512_fma3s<bits<8> opc213, bits<8> opc231, bits<8> opc132,
	string OpcodeStr, SDNode OpNode, SDNode OpNodeRnds1,
	SDNode OpNodeRnds3> {
	let Predicates = [HasAVX512] in {
	defm NAME : avx512_fma3s_all<opc213, opc231, opc132, OpcodeStr, OpNode,
	OpNodeRnds1, OpNodeRnds3, f32x_info, "SS">,
	EVEX_CD8<32, CD8VT1>, VEX_LIG;
	defm NAME : avx512_fma3s_all<opc213, opc231, opc132, OpcodeStr, OpNode,
	OpNodeRnds1, OpNodeRnds3, f64x_info, "SD">,
	EVEX_CD8<64, CD8VT1>, VEX_LIG, VEX_W;
	}
	}

	defm VFMADD : avx512_fma3s<0xA9, 0xB9, 0x99, "vfmadd", X86Fmadd, X86FmaddRnds1,
	X86FmaddRnds3>;
	defm VFMSUB : avx512_fma3s<0xAB, 0xBB, 0x9B, "vfmsub", X86Fmsub, X86FmsubRnds1,
	X86FmsubRnds3>;
	defm VFNMADD : avx512_fma3s<0xAD, 0xBD, 0x9D, "vfnmadd", X86Fnmadd,
	X86FnmaddRnds1, X86FnmaddRnds3>;
	defm VFNMSUB : avx512_fma3s<0xAF, 0xBF, 0x9F, "vfnmsub", X86Fnmsub,
	X86FnmsubRnds1, X86FnmsubRnds3>;

	//===----------------------------------------------------------------------===//
	// AVX-512 Packed Multiply of Unsigned 52-bit Integers and Add the Low 52-bit IFMA
	//===----------------------------------------------------------------------===//
	let Constraints = "$src1 = $dst" in {
	multiclass avx512_pmadd52_rm<bits<8> opc, string OpcodeStr, SDNode OpNode,
	X86VectorVTInfo _> {
	let ExeDomain = _.ExeDomain in {
	defm r: AVX512_maskable_3src<opc, MRMSrcReg, _, (outs _.RC:$dst),
	(ins _.RC:$src2, _.RC:$src3),
	OpcodeStr, "$src3, $src2", "$src2, $src3",
	(_.VT (OpNode _.RC:$src1, _.RC:$src2, _.RC:$src3))>,
	AVX512FMA3Base;

	defm m: AVX512_maskable_3src<opc, MRMSrcMem, _, (outs _.RC:$dst),
	(ins _.RC:$src2, _.MemOp:$src3),
	OpcodeStr, "$src3, $src2", "$src2, $src3",
	(_.VT (OpNode _.RC:$src1, _.RC:$src2, (_.LdFrag addr:$src3)))>,
	AVX512FMA3Base;

	defm mb: AVX512_maskable_3src<opc, MRMSrcMem, _, (outs _.RC:$dst),
	(ins _.RC:$src2, _.ScalarMemOp:$src3),
	OpcodeStr, !strconcat("${src3}", _.BroadcastStr,", $src2"),
	!strconcat("$src2, ${src3}", _.BroadcastStr ),
	(OpNode _.RC:$src1,
	_.RC:$src2,(_.VT (X86VBroadcast (_.ScalarLdFrag addr:$src3))))>,
	AVX512FMA3Base, EVEX_B;
	}
	}
	} // Constraints = "$src1 = $dst"

	multiclass avx512_pmadd52_common<bits<8> opc, string OpcodeStr, SDNode OpNode,
	AVX512VLVectorVTInfo _> {
	let Predicates = [HasIFMA] in {
	defm Z : avx512_pmadd52_rm<opc, OpcodeStr, OpNode, _.info512>,
	EVEX_V512, EVEX_CD8<_.info512.EltSize, CD8VF>;
	}
	let Predicates = [HasVLX, HasIFMA] in {
	defm Z256 : avx512_pmadd52_rm<opc, OpcodeStr, OpNode, _.info256>,
	EVEX_V256, EVEX_CD8<_.info256.EltSize, CD8VF>;
	defm Z128 : avx512_pmadd52_rm<opc, OpcodeStr, OpNode, _.info128>,
	EVEX_V128, EVEX_CD8<_.info128.EltSize, CD8VF>;
	}
	}

	defm VPMADD52LUQ : avx512_pmadd52_common<0xb4, "vpmadd52luq", x86vpmadd52l,
	avx512vl_i64_info>, VEX_W;
	defm VPMADD52HUQ : avx512_pmadd52_common<0xb5, "vpmadd52huq", x86vpmadd52h,
	avx512vl_i64_info>, VEX_W;

	//===----------------------------------------------------------------------===//
	// AVX-512 Scalar convert from sign integer to float/double
	//===----------------------------------------------------------------------===//

	multiclass avx512_vcvtsi<bits<8> opc, SDNode OpNode, RegisterClass SrcRC,
	X86VectorVTInfo DstVT, X86MemOperand x86memop,
	PatFrag ld_frag, string asm> {
	let hasSideEffects = 0 in {
	def rr : SI<opc, MRMSrcReg, (outs DstVT.FRC:$dst),
	(ins DstVT.FRC:$src1, SrcRC:$src),
	!strconcat(asm,"\t{$src, $src1, $dst\|$dst, $src1, $src}"), []>,
	EVEX_4V;
	let mayLoad = 1 in
	def rm : SI<opc, MRMSrcMem, (outs DstVT.FRC:$dst),
	(ins DstVT.FRC:$src1, x86memop:$src),
	!strconcat(asm,"\t{$src, $src1, $dst\|$dst, $src1, $src}"), []>,
	EVEX_4V;
	} // hasSideEffects = 0
	let isCodeGenOnly = 1 in {
	def rr_Int : SI<opc, MRMSrcReg, (outs DstVT.RC:$dst),
	(ins DstVT.RC:$src1, SrcRC:$src2),
	!strconcat(asm,"\t{$src2, $src1, $dst\|$dst, $src1, $src2}"),
	[(set DstVT.RC:$dst,
	(OpNode (DstVT.VT DstVT.RC:$src1),
	SrcRC:$src2,
	(i32 FROUND_CURRENT)))]>, EVEX_4V;

	def rm_Int : SI<opc, MRMSrcMem, (outs DstVT.RC:$dst),
	(ins DstVT.RC:$src1, x86memop:$src2),
	!strconcat(asm,"\t{$src2, $src1, $dst\|$dst, $src1, $src2}"),
	[(set DstVT.RC:$dst,
	(OpNode (DstVT.VT DstVT.RC:$src1),
	(ld_frag addr:$src2),
	(i32 FROUND_CURRENT)))]>, EVEX_4V;
	}//isCodeGenOnly = 1
	}

	multiclass avx512_vcvtsi_round<bits<8> opc, SDNode OpNode, RegisterClass SrcRC,
	X86VectorVTInfo DstVT, string asm> {
	def rrb_Int : SI<opc, MRMSrcReg, (outs DstVT.RC:$dst),
	(ins DstVT.RC:$src1, SrcRC:$src2, AVX512RC:$rc),
	!strconcat(asm,
	"\t{$src2, $rc, $src1, $dst\|$dst, $src1, $rc, $src2}"),
	[(set DstVT.RC:$dst,
	(OpNode (DstVT.VT DstVT.RC:$src1),
	SrcRC:$src2,
	(i32 imm:$rc)))]>, EVEX_4V, EVEX_B, EVEX_RC;
	}

	multiclass avx512_vcvtsi_common<bits<8> opc, SDNode OpNode, RegisterClass SrcRC,
	X86VectorVTInfo DstVT, X86MemOperand x86memop,
	PatFrag ld_frag, string asm> {
	defm NAME : avx512_vcvtsi_round<opc, OpNode, SrcRC, DstVT, asm>,
	avx512_vcvtsi<opc, OpNode, SrcRC, DstVT, x86memop, ld_frag, asm>,
	VEX_LIG;
	}

	let Predicates = [HasAVX512] in {
	defm VCVTSI2SSZ : avx512_vcvtsi_common<0x2A, X86SintToFpRnd, GR32,
	v4f32x_info, i32mem, loadi32, "cvtsi2ss{l}">,
	XS, EVEX_CD8<32, CD8VT1>;
	defm VCVTSI642SSZ: avx512_vcvtsi_common<0x2A, X86SintToFpRnd, GR64,
	v4f32x_info, i64mem, loadi64, "cvtsi2ss{q}">,
	XS, VEX_W, EVEX_CD8<64, CD8VT1>;
	defm VCVTSI2SDZ : avx512_vcvtsi_common<0x2A, X86SintToFpRnd, GR32,
	v2f64x_info, i32mem, loadi32, "cvtsi2sd{l}">,
	XD, EVEX_CD8<32, CD8VT1>;
	defm VCVTSI642SDZ: avx512_vcvtsi_common<0x2A, X86SintToFpRnd, GR64,
	v2f64x_info, i64mem, loadi64, "cvtsi2sd{q}">,
	XD, VEX_W, EVEX_CD8<64, CD8VT1>;

	def : InstAlias<"vcvtsi2ss\t{$src, $src1, $dst\|$dst, $src1, $src}",
	(VCVTSI2SSZrm FR64X:$dst, FR64X:$src1, i32mem:$src), 0>;
	def : InstAlias<"vcvtsi2sd\t{$src, $src1, $dst\|$dst, $src1, $src}",
	(VCVTSI2SDZrm FR64X:$dst, FR64X:$src1, i32mem:$src), 0>;

	def : Pat<(f32 (sint_to_fp (loadi32 addr:$src))),
	(VCVTSI2SSZrm (f32 (IMPLICIT_DEF)), addr:$src)>;
	def : Pat<(f32 (sint_to_fp (loadi64 addr:$src))),
	(VCVTSI642SSZrm (f32 (IMPLICIT_DEF)), addr:$src)>;
	def : Pat<(f64 (sint_to_fp (loadi32 addr:$src))),
	(VCVTSI2SDZrm (f64 (IMPLICIT_DEF)), addr:$src)>;
	def : Pat<(f64 (sint_to_fp (loadi64 addr:$src))),
	(VCVTSI642SDZrm (f64 (IMPLICIT_DEF)), addr:$src)>;

	def : Pat<(f32 (sint_to_fp GR32:$src)),
	(VCVTSI2SSZrr (f32 (IMPLICIT_DEF)), GR32:$src)>;
	def : Pat<(f32 (sint_to_fp GR64:$src)),
	(VCVTSI642SSZrr (f32 (IMPLICIT_DEF)), GR64:$src)>;
	def : Pat<(f64 (sint_to_fp GR32:$src)),
	(VCVTSI2SDZrr (f64 (IMPLICIT_DEF)), GR32:$src)>;
	def : Pat<(f64 (sint_to_fp GR64:$src)),
	(VCVTSI642SDZrr (f64 (IMPLICIT_DEF)), GR64:$src)>;

	defm VCVTUSI2SSZ : avx512_vcvtsi_common<0x7B, X86UintToFpRnd, GR32,
	v4f32x_info, i32mem, loadi32,
	"cvtusi2ss{l}">, XS, EVEX_CD8<32, CD8VT1>;
	defm VCVTUSI642SSZ : avx512_vcvtsi_common<0x7B, X86UintToFpRnd, GR64,
	v4f32x_info, i64mem, loadi64, "cvtusi2ss{q}">,
	XS, VEX_W, EVEX_CD8<64, CD8VT1>;
	defm VCVTUSI2SDZ : avx512_vcvtsi<0x7B, X86UintToFpRnd, GR32, v2f64x_info,
	i32mem, loadi32, "cvtusi2sd{l}">,
	XD, VEX_LIG, EVEX_CD8<32, CD8VT1>;
	defm VCVTUSI642SDZ : avx512_vcvtsi_common<0x7B, X86UintToFpRnd, GR64,
	v2f64x_info, i64mem, loadi64, "cvtusi2sd{q}">,
	XD, VEX_W, EVEX_CD8<64, CD8VT1>;

	def : InstAlias<"vcvtusi2ss\t{$src, $src1, $dst\|$dst, $src1, $src}",
	(VCVTUSI2SSZrm FR64X:$dst, FR64X:$src1, i32mem:$src), 0>;
	def : InstAlias<"vcvtusi2sd\t{$src, $src1, $dst\|$dst, $src1, $src}",
	(VCVTUSI2SDZrm FR64X:$dst, FR64X:$src1, i32mem:$src), 0>;

	def : Pat<(f32 (uint_to_fp (loadi32 addr:$src))),
	(VCVTUSI2SSZrm (f32 (IMPLICIT_DEF)), addr:$src)>;
	def : Pat<(f32 (uint_to_fp (loadi64 addr:$src))),
	(VCVTUSI642SSZrm (f32 (IMPLICIT_DEF)), addr:$src)>;
	def : Pat<(f64 (uint_to_fp (loadi32 addr:$src))),
	(VCVTUSI2SDZrm (f64 (IMPLICIT_DEF)), addr:$src)>;
	def : Pat<(f64 (uint_to_fp (loadi64 addr:$src))),
	(VCVTUSI642SDZrm (f64 (IMPLICIT_DEF)), addr:$src)>;

	def : Pat<(f32 (uint_to_fp GR32:$src)),
	(VCVTUSI2SSZrr (f32 (IMPLICIT_DEF)), GR32:$src)>;
	def : Pat<(f32 (uint_to_fp GR64:$src)),
	(VCVTUSI642SSZrr (f32 (IMPLICIT_DEF)), GR64:$src)>;
	def : Pat<(f64 (uint_to_fp GR32:$src)),
	(VCVTUSI2SDZrr (f64 (IMPLICIT_DEF)), GR32:$src)>;
	def : Pat<(f64 (uint_to_fp GR64:$src)),
	(VCVTUSI642SDZrr (f64 (IMPLICIT_DEF)), GR64:$src)>;
	}

	//===----------------------------------------------------------------------===//
	// AVX-512 Scalar convert from float/double to integer
	//===----------------------------------------------------------------------===//
	multiclass avx512_cvt_s_int_round<bits<8> opc, X86VectorVTInfo SrcVT ,
	X86VectorVTInfo DstVT, SDNode OpNode, string asm> {
	let Predicates = [HasAVX512] in {
	def rr : SI<opc, MRMSrcReg, (outs DstVT.RC:$dst), (ins SrcVT.RC:$src),
	!strconcat(asm,"\t{$src, $dst\|$dst, $src}"),
	[(set DstVT.RC:$dst, (OpNode (SrcVT.VT SrcVT.RC:$src),(i32 FROUND_CURRENT)))]>,
	EVEX, VEX_LIG;
	def rb : SI<opc, MRMSrcReg, (outs DstVT.RC:$dst), (ins SrcVT.RC:$src, AVX512RC:$rc),
	!strconcat(asm,"\t{$rc, $src, $dst\|$dst, $src, $rc}"),
	[(set DstVT.RC:$dst, (OpNode (SrcVT.VT SrcVT.RC:$src),(i32 imm:$rc)))]>,
	EVEX, VEX_LIG, EVEX_B, EVEX_RC;
	def rm : SI<opc, MRMSrcMem, (outs DstVT.RC:$dst), (ins SrcVT.IntScalarMemOp:$src),
	!strconcat(asm,"\t{$src, $dst\|$dst, $src}"),
	[(set DstVT.RC:$dst, (OpNode
	(SrcVT.VT SrcVT.ScalarIntMemCPat:$src),
	(i32 FROUND_CURRENT)))]>,
	EVEX, VEX_LIG;
	} // Predicates = [HasAVX512]
	}

	// Convert float/double to signed/unsigned int 32/64
	defm VCVTSS2SIZ: avx512_cvt_s_int_round<0x2D, f32x_info, i32x_info,
	X86cvts2si, "cvtss2si">,
	XS, EVEX_CD8<32, CD8VT1>;
	defm VCVTSS2SI64Z: avx512_cvt_s_int_round<0x2D, f32x_info, i64x_info,
	X86cvts2si, "cvtss2si">,
	XS, VEX_W, EVEX_CD8<32, CD8VT1>;
	defm VCVTSS2USIZ: avx512_cvt_s_int_round<0x79, f32x_info, i32x_info,
	X86cvts2usi, "cvtss2usi">,
	XS, EVEX_CD8<32, CD8VT1>;
	defm VCVTSS2USI64Z: avx512_cvt_s_int_round<0x79, f32x_info, i64x_info,
	X86cvts2usi, "cvtss2usi">, XS, VEX_W,
	EVEX_CD8<32, CD8VT1>;
	defm VCVTSD2SIZ: avx512_cvt_s_int_round<0x2D, f64x_info, i32x_info,
	X86cvts2si, "cvtsd2si">,
	XD, EVEX_CD8<64, CD8VT1>;
	defm VCVTSD2SI64Z: avx512_cvt_s_int_round<0x2D, f64x_info, i64x_info,
	X86cvts2si, "cvtsd2si">,
	XD, VEX_W, EVEX_CD8<64, CD8VT1>;
	defm VCVTSD2USIZ: avx512_cvt_s_int_round<0x79, f64x_info, i32x_info,
	X86cvts2usi, "cvtsd2usi">,
	XD, EVEX_CD8<64, CD8VT1>;
	defm VCVTSD2USI64Z: avx512_cvt_s_int_round<0x79, f64x_info, i64x_info,
	X86cvts2usi, "cvtsd2usi">, XD, VEX_W,
	EVEX_CD8<64, CD8VT1>;

	// The SSE version of these instructions are disabled for AVX512.
	// Therefore, the SSE intrinsics are mapped to the AVX512 instructions.
	let Predicates = [HasAVX512] in {
	def : Pat<(i32 (int_x86_sse_cvtss2si (v4f32 VR128X:$src))),
	(VCVTSS2SIZrr VR128X:$src)>;
	def : Pat<(i32 (int_x86_sse_cvtss2si sse_load_f32:$src)),
	(VCVTSS2SIZrm sse_load_f32:$src)>;
	def : Pat<(i64 (int_x86_sse_cvtss2si64 (v4f32 VR128X:$src))),
	(VCVTSS2SI64Zrr VR128X:$src)>;
	def : Pat<(i64 (int_x86_sse_cvtss2si64 sse_load_f32:$src)),
	(VCVTSS2SI64Zrm sse_load_f32:$src)>;
	def : Pat<(i32 (int_x86_sse2_cvtsd2si (v2f64 VR128X:$src))),
	(VCVTSD2SIZrr VR128X:$src)>;
	def : Pat<(i32 (int_x86_sse2_cvtsd2si sse_load_f64:$src)),
	(VCVTSD2SIZrm sse_load_f64:$src)>;
	def : Pat<(i64 (int_x86_sse2_cvtsd2si64 (v2f64 VR128X:$src))),
	(VCVTSD2SI64Zrr VR128X:$src)>;
	def : Pat<(i64 (int_x86_sse2_cvtsd2si64 sse_load_f64:$src)),
	(VCVTSD2SI64Zrm sse_load_f64:$src)>;
	} // HasAVX512

	let Predicates = [HasAVX512] in {
	def : Pat<(int_x86_sse_cvtsi2ss VR128X:$src1, GR32:$src2),
	(VCVTSI2SSZrr_Int VR128X:$src1, GR32:$src2)>;
	def : Pat<(int_x86_sse_cvtsi2ss VR128X:$src1, (loadi32 addr:$src2)),
	(VCVTSI2SSZrm_Int VR128X:$src1, addr:$src2)>;
	def : Pat<(int_x86_sse_cvtsi642ss VR128X:$src1, GR64:$src2),
	(VCVTSI642SSZrr_Int VR128X:$src1, GR64:$src2)>;
	def : Pat<(int_x86_sse_cvtsi642ss VR128X:$src1, (loadi64 addr:$src2)),
	(VCVTSI642SSZrm_Int VR128X:$src1, addr:$src2)>;
	def : Pat<(int_x86_sse2_cvtsi2sd VR128X:$src1, GR32:$src2),
	(VCVTSI2SDZrr_Int VR128X:$src1, GR32:$src2)>;
	def : Pat<(int_x86_sse2_cvtsi2sd VR128X:$src1, (loadi32 addr:$src2)),
	(VCVTSI2SDZrm_Int VR128X:$src1, addr:$src2)>;
	def : Pat<(int_x86_sse2_cvtsi642sd VR128X:$src1, GR64:$src2),
	(VCVTSI642SDZrr_Int VR128X:$src1, GR64:$src2)>;
	def : Pat<(int_x86_sse2_cvtsi642sd VR128X:$src1, (loadi64 addr:$src2)),
	(VCVTSI642SDZrm_Int VR128X:$src1, addr:$src2)>;
	def : Pat<(int_x86_avx512_cvtusi2sd VR128X:$src1, GR32:$src2),
	(VCVTUSI2SDZrr_Int VR128X:$src1, GR32:$src2)>;
	def : Pat<(int_x86_avx512_cvtusi2sd VR128X:$src1, (loadi32 addr:$src2)),
	(VCVTUSI2SDZrm_Int VR128X:$src1, addr:$src2)>;
	} // Predicates = [HasAVX512]

	// Patterns used for matching vcvtsi2s{s,d} intrinsic sequences from clang
	// which produce unnecessary vmovs{s,d} instructions
	let Predicates = [HasAVX512] in {
	def : Pat<(v4f32 (X86Movss
	(v4f32 VR128X:$dst),
	(v4f32 (scalar_to_vector (f32 (sint_to_fp GR64:$src)))))),
	(VCVTSI642SSZrr_Int VR128X:$dst, GR64:$src)>;

	def : Pat<(v4f32 (X86Movss
	(v4f32 VR128X:$dst),
	(v4f32 (scalar_to_vector (f32 (sint_to_fp GR32:$src)))))),
	(VCVTSI2SSZrr_Int VR128X:$dst, GR32:$src)>;

	def : Pat<(v2f64 (X86Movsd
	(v2f64 VR128X:$dst),
	(v2f64 (scalar_to_vector (f64 (sint_to_fp GR64:$src)))))),
	(VCVTSI642SDZrr_Int VR128X:$dst, GR64:$src)>;

	def : Pat<(v2f64 (X86Movsd
	(v2f64 VR128X:$dst),
	(v2f64 (scalar_to_vector (f64 (sint_to_fp GR32:$src)))))),
	(VCVTSI2SDZrr_Int VR128X:$dst, GR32:$src)>;
	} // Predicates = [HasAVX512]

	// Convert float/double to signed/unsigned int 32/64 with truncation
	multiclass avx512_cvt_s_all<bits<8> opc, string asm, X86VectorVTInfo _SrcRC,
	X86VectorVTInfo _DstRC, SDNode OpNode,
	SDNode OpNodeRnd, string aliasStr>{
	let Predicates = [HasAVX512] in {
	def rr : AVX512<opc, MRMSrcReg, (outs _DstRC.RC:$dst), (ins _SrcRC.FRC:$src),
	!strconcat(asm,"\t{$src, $dst\|$dst, $src}"),
	[(set _DstRC.RC:$dst, (OpNode _SrcRC.FRC:$src))]>, EVEX;
	let hasSideEffects = 0 in
	def rb : AVX512<opc, MRMSrcReg, (outs _DstRC.RC:$dst), (ins _SrcRC.FRC:$src),
	!strconcat(asm,"\t{{sae}, $src, $dst\|$dst, $src, {sae}}"),
	[]>, EVEX, EVEX_B;
	def rm : AVX512<opc, MRMSrcMem, (outs _DstRC.RC:$dst), (ins _SrcRC.ScalarMemOp:$src),
	!strconcat(asm,"\t{$src, $dst\|$dst, $src}"),
	[(set _DstRC.RC:$dst, (OpNode (_SrcRC.ScalarLdFrag addr:$src)))]>,
	EVEX;

	def : InstAlias<asm # aliasStr # "\t{$src, $dst\|$dst, $src}",
	(!cast<Instruction>(NAME # "rr") _DstRC.RC:$dst, _SrcRC.FRC:$src), 0>;
	def : InstAlias<asm # aliasStr # "\t\t{{sae}, $src, $dst\|$dst, $src, {sae}}",
	(!cast<Instruction>(NAME # "rb") _DstRC.RC:$dst, _SrcRC.FRC:$src), 0>;
	def : InstAlias<asm # aliasStr # "\t{$src, $dst\|$dst, $src}",
	(!cast<Instruction>(NAME # "rm") _DstRC.RC:$dst,
	_SrcRC.ScalarMemOp:$src), 0>;

	let isCodeGenOnly = 1 in {
	def rr_Int : AVX512<opc, MRMSrcReg, (outs _DstRC.RC:$dst), (ins _SrcRC.RC:$src),
	!strconcat(asm,"\t{$src, $dst\|$dst, $src}"),
	[(set _DstRC.RC:$dst, (OpNodeRnd (_SrcRC.VT _SrcRC.RC:$src),
	(i32 FROUND_CURRENT)))]>, EVEX, VEX_LIG;
	def rb_Int : AVX512<opc, MRMSrcReg, (outs _DstRC.RC:$dst), (ins _SrcRC.RC:$src),
	!strconcat(asm,"\t{{sae}, $src, $dst\|$dst, $src, {sae}}"),
	[(set _DstRC.RC:$dst, (OpNodeRnd (_SrcRC.VT _SrcRC.RC:$src),
	(i32 FROUND_NO_EXC)))]>,
	EVEX,VEX_LIG , EVEX_B;
	let mayLoad = 1, hasSideEffects = 0 in
	def rm_Int : AVX512<opc, MRMSrcMem, (outs _DstRC.RC:$dst),
	(ins _SrcRC.IntScalarMemOp:$src),
	!strconcat(asm,"\t{$src, $dst\|$dst, $src}"),
	[]>, EVEX, VEX_LIG;

	} // isCodeGenOnly = 1
	} //HasAVX512
	}


	defm VCVTTSS2SIZ: avx512_cvt_s_all<0x2C, "vcvttss2si", f32x_info, i32x_info,
	fp_to_sint, X86cvtts2IntRnd, "{l}">,
	XS, EVEX_CD8<32, CD8VT1>;
	defm VCVTTSS2SI64Z: avx512_cvt_s_all<0x2C, "vcvttss2si", f32x_info, i64x_info,
	fp_to_sint, X86cvtts2IntRnd, "{q}">,
	VEX_W, XS, EVEX_CD8<32, CD8VT1>;
	defm VCVTTSD2SIZ: avx512_cvt_s_all<0x2C, "vcvttsd2si", f64x_info, i32x_info,
	fp_to_sint, X86cvtts2IntRnd, "{l}">,
	XD, EVEX_CD8<64, CD8VT1>;
	defm VCVTTSD2SI64Z: avx512_cvt_s_all<0x2C, "vcvttsd2si", f64x_info, i64x_info,
	fp_to_sint, X86cvtts2IntRnd, "{q}">,
	VEX_W, XD, EVEX_CD8<64, CD8VT1>;

	defm VCVTTSS2USIZ: avx512_cvt_s_all<0x78, "vcvttss2usi", f32x_info, i32x_info,
	fp_to_uint, X86cvtts2UIntRnd, "{l}">,
	XS, EVEX_CD8<32, CD8VT1>;
	defm VCVTTSS2USI64Z: avx512_cvt_s_all<0x78, "vcvttss2usi", f32x_info, i64x_info,
	fp_to_uint, X86cvtts2UIntRnd, "{q}">,
	XS,VEX_W, EVEX_CD8<32, CD8VT1>;
	defm VCVTTSD2USIZ: avx512_cvt_s_all<0x78, "vcvttsd2usi", f64x_info, i32x_info,
	fp_to_uint, X86cvtts2UIntRnd, "{l}">,
	XD, EVEX_CD8<64, CD8VT1>;
	defm VCVTTSD2USI64Z: avx512_cvt_s_all<0x78, "vcvttsd2usi", f64x_info, i64x_info,
	fp_to_uint, X86cvtts2UIntRnd, "{q}">,
	XD, VEX_W, EVEX_CD8<64, CD8VT1>;
	let Predicates = [HasAVX512] in {
	def : Pat<(i32 (int_x86_sse_cvttss2si (v4f32 VR128X:$src))),
	(VCVTTSS2SIZrr_Int VR128X:$src)>;
	def : Pat<(i32 (int_x86_sse_cvttss2si sse_load_f32:$src)),
	(VCVTTSS2SIZrm_Int ssmem:$src)>;
	def : Pat<(i64 (int_x86_sse_cvttss2si64 (v4f32 VR128X:$src))),
	(VCVTTSS2SI64Zrr_Int VR128X:$src)>;
	def : Pat<(i64 (int_x86_sse_cvttss2si64 sse_load_f32:$src)),
	(VCVTTSS2SI64Zrm_Int ssmem:$src)>;
	def : Pat<(i32 (int_x86_sse2_cvttsd2si (v2f64 VR128X:$src))),
	(VCVTTSD2SIZrr_Int VR128X:$src)>;
	def : Pat<(i32 (int_x86_sse2_cvttsd2si sse_load_f64:$src)),
	(VCVTTSD2SIZrm_Int sdmem:$src)>;
	def : Pat<(i64 (int_x86_sse2_cvttsd2si64 (v2f64 VR128X:$src))),
	(VCVTTSD2SI64Zrr_Int VR128X:$src)>;
	def : Pat<(i64 (int_x86_sse2_cvttsd2si64 sse_load_f64:$src)),
	(VCVTTSD2SI64Zrm_Int sdmem:$src)>;
	} // HasAVX512
	//===----------------------------------------------------------------------===//
	// AVX-512 Convert form float to double and back
	//===----------------------------------------------------------------------===//
	multiclass avx512_cvt_fp_scalar<bits<8> opc, string OpcodeStr, X86VectorVTInfo _,
	X86VectorVTInfo _Src, SDNode OpNode> {
	defm rr_Int : AVX512_maskable_scalar<opc, MRMSrcReg, _, (outs _.RC:$dst),
	(ins _.RC:$src1, _Src.RC:$src2), OpcodeStr,
	"$src2, $src1", "$src1, $src2",
	(_.VT (OpNode (_.VT _.RC:$src1),
	(_Src.VT _Src.RC:$src2),
	(i32 FROUND_CURRENT)))>,
	EVEX_4V, VEX_LIG, Sched<[WriteCvtF2F]>;
	defm rm_Int : AVX512_maskable_scalar<opc, MRMSrcMem, _, (outs _.RC:$dst),
	(ins _.RC:$src1, _Src.IntScalarMemOp:$src2), OpcodeStr,
	"$src2, $src1", "$src1, $src2",
	(_.VT (OpNode (_.VT _.RC:$src1),
	(_Src.VT _Src.ScalarIntMemCPat:$src2),
	(i32 FROUND_CURRENT)))>,
	EVEX_4V, VEX_LIG, Sched<[WriteCvtF2FLd, ReadAfterLd]>;

	let isCodeGenOnly = 1, hasSideEffects = 0 in {
	def rr : I<opc, MRMSrcReg, (outs _.FRC:$dst),
	(ins _.FRC:$src1, _Src.FRC:$src2),
	OpcodeStr#"\t{$src2, $src1, $dst\|$dst, $src1, $src2}", []>,
	EVEX_4V, VEX_LIG, Sched<[WriteCvtF2F]>;
	let mayLoad = 1 in
	def rm : I<opc, MRMSrcMem, (outs _.FRC:$dst),
	(ins _.FRC:$src1, _Src.ScalarMemOp:$src2),
	OpcodeStr#"\t{$src2, $src1, $dst\|$dst, $src1, $src2}", []>,
	EVEX_4V, VEX_LIG, Sched<[WriteCvtF2FLd, ReadAfterLd]>;
	}
	}

	// Scalar Coversion with SAE - suppress all exceptions
	multiclass avx512_cvt_fp_sae_scalar<bits<8> opc, string OpcodeStr, X86VectorVTInfo _,
	X86VectorVTInfo _Src, SDNode OpNodeRnd> {
	defm rrb_Int : AVX512_maskable_scalar<opc, MRMSrcReg, _, (outs _.RC:$dst),
	(ins _.RC:$src1, _Src.RC:$src2), OpcodeStr,
	"{sae}, $src2, $src1", "$src1, $src2, {sae}",
	(_.VT (OpNodeRnd (_.VT _.RC:$src1),
	(_Src.VT _Src.RC:$src2),
	(i32 FROUND_NO_EXC)))>,
	EVEX_4V, VEX_LIG, EVEX_B;
	}

	// Scalar Conversion with rounding control (RC)
	multiclass avx512_cvt_fp_rc_scalar<bits<8> opc, string OpcodeStr, X86VectorVTInfo _,
	X86VectorVTInfo _Src, SDNode OpNodeRnd> {
	defm rrb_Int : AVX512_maskable_scalar<opc, MRMSrcReg, _, (outs _.RC:$dst),
	(ins _.RC:$src1, _Src.RC:$src2, AVX512RC:$rc), OpcodeStr,
	"$rc, $src2, $src1", "$src1, $src2, $rc",
	(_.VT (OpNodeRnd (_.VT _.RC:$src1),
	(_Src.VT _Src.RC:$src2), (i32 imm:$rc)))>,
	EVEX_4V, VEX_LIG, Sched<[WriteCvtF2FLd, ReadAfterLd]>,
	EVEX_B, EVEX_RC;
	}
	multiclass avx512_cvt_fp_scalar_sd2ss<bits<8> opc, string OpcodeStr,
	SDNode OpNodeRnd, X86VectorVTInfo _src,
	X86VectorVTInfo _dst> {
	let Predicates = [HasAVX512] in {
	defm Z : avx512_cvt_fp_scalar<opc, OpcodeStr, _dst, _src, OpNodeRnd>,
	avx512_cvt_fp_rc_scalar<opc, OpcodeStr, _dst, _src,
	OpNodeRnd>, VEX_W, EVEX_CD8<64, CD8VT1>, XD;
	}
	}

	multiclass avx512_cvt_fp_scalar_ss2sd<bits<8> opc, string OpcodeStr,
	SDNode OpNodeRnd, X86VectorVTInfo _src,
	X86VectorVTInfo _dst> {
	let Predicates = [HasAVX512] in {
	defm Z : avx512_cvt_fp_scalar<opc, OpcodeStr, _dst, _src, OpNodeRnd>,
	avx512_cvt_fp_sae_scalar<opc, OpcodeStr, _dst, _src, OpNodeRnd>,
	EVEX_CD8<32, CD8VT1>, XS;
	}
	}
	defm VCVTSD2SS : avx512_cvt_fp_scalar_sd2ss<0x5A, "vcvtsd2ss",
	X86froundRnd, f64x_info, f32x_info>;
	defm VCVTSS2SD : avx512_cvt_fp_scalar_ss2sd<0x5A, "vcvtss2sd",
	X86fpextRnd,f32x_info, f64x_info >;

	def : Pat<(f64 (fpextend FR32X:$src)),
	(VCVTSS2SDZrr (COPY_TO_REGCLASS FR32X:$src, FR64X), FR32X:$src)>,
	Requires<[HasAVX512]>;
	def : Pat<(f64 (fpextend (loadf32 addr:$src))),
	(VCVTSS2SDZrm (f64 (IMPLICIT_DEF)), addr:$src)>,
	Requires<[HasAVX512]>;

	def : Pat<(f64 (extloadf32 addr:$src)),
	(VCVTSS2SDZrm (f64 (IMPLICIT_DEF)), addr:$src)>,
	Requires<[HasAVX512, OptForSize]>;

	def : Pat<(f64 (extloadf32 addr:$src)),
	(VCVTSS2SDZrr (f64 (IMPLICIT_DEF)), (VMOVSSZrm addr:$src))>,
	Requires<[HasAVX512, OptForSpeed]>;

	def : Pat<(f32 (fpround FR64X:$src)),
	(VCVTSD2SSZrr (COPY_TO_REGCLASS FR64X:$src, FR32X), FR64X:$src)>,
	Requires<[HasAVX512]>;

	def : Pat<(v4f32 (X86Movss
	(v4f32 VR128X:$dst),
	(v4f32 (scalar_to_vector
	(f32 (fpround (f64 (extractelt VR128X:$src, (iPTR 0))))))))),
	(VCVTSD2SSZrr_Int VR128X:$dst, VR128X:$src)>,
	Requires<[HasAVX512]>;

	def : Pat<(v2f64 (X86Movsd
	(v2f64 VR128X:$dst),
	(v2f64 (scalar_to_vector
	(f64 (fpextend (f32 (extractelt VR128X:$src, (iPTR 0))))))))),
	(VCVTSS2SDZrr_Int VR128X:$dst, VR128X:$src)>,
	Requires<[HasAVX512]>;

	//===----------------------------------------------------------------------===//
	// AVX-512 Vector convert from signed/unsigned integer to float/double
	// and from float/double to signed/unsigned integer
	//===----------------------------------------------------------------------===//

	multiclass avx512_vcvt_fp<bits<8> opc, string OpcodeStr, X86VectorVTInfo _,
	X86VectorVTInfo _Src, SDNode OpNode,
	string Broadcast = _.BroadcastStr,
	string Alias = "", X86MemOperand MemOp = _Src.MemOp> {

	defm rr : AVX512_maskable<opc, MRMSrcReg, _, (outs _.RC:$dst),
	(ins _Src.RC:$src), OpcodeStr, "$src", "$src",
	(_.VT (OpNode (_Src.VT _Src.RC:$src)))>, EVEX;

	defm rm : AVX512_maskable<opc, MRMSrcMem, _, (outs _.RC:$dst),
	(ins MemOp:$src), OpcodeStr#Alias, "$src", "$src",
	(_.VT (OpNode (_Src.VT
	(bitconvert (_Src.LdFrag addr:$src)))))>, EVEX;

	defm rmb : AVX512_maskable<opc, MRMSrcMem, _, (outs _.RC:$dst),
	(ins _Src.ScalarMemOp:$src), OpcodeStr,
	"${src}"##Broadcast, "${src}"##Broadcast,
	(_.VT (OpNode (_Src.VT
	(X86VBroadcast (_Src.ScalarLdFrag addr:$src)))
	))>, EVEX, EVEX_B;
	}
	// Coversion with SAE - suppress all exceptions
	multiclass avx512_vcvt_fp_sae<bits<8> opc, string OpcodeStr, X86VectorVTInfo _,
	X86VectorVTInfo _Src, SDNode OpNodeRnd> {
	defm rrb : AVX512_maskable<opc, MRMSrcReg, _, (outs _.RC:$dst),
	(ins _Src.RC:$src), OpcodeStr,
	"{sae}, $src", "$src, {sae}",
	(_.VT (OpNodeRnd (_Src.VT _Src.RC:$src),
	(i32 FROUND_NO_EXC)))>,
	EVEX, EVEX_B;
	}

	// Conversion with rounding control (RC)
	multiclass avx512_vcvt_fp_rc<bits<8> opc, string OpcodeStr, X86VectorVTInfo _,
	X86VectorVTInfo _Src, SDNode OpNodeRnd> {
	defm rrb : AVX512_maskable<opc, MRMSrcReg, _, (outs _.RC:$dst),
	(ins _Src.RC:$src, AVX512RC:$rc), OpcodeStr,
	"$rc, $src", "$src, $rc",
	(_.VT (OpNodeRnd (_Src.VT _Src.RC:$src), (i32 imm:$rc)))>,
	EVEX, EVEX_B, EVEX_RC;
	}

	// Extend Float to Double
	multiclass avx512_cvtps2pd<bits<8> opc, string OpcodeStr> {
	let Predicates = [HasAVX512] in {
	defm Z : avx512_vcvt_fp<opc, OpcodeStr, v8f64_info, v8f32x_info, fpextend>,
	avx512_vcvt_fp_sae<opc, OpcodeStr, v8f64_info, v8f32x_info,
	X86vfpextRnd>, EVEX_V512;
	}
	let Predicates = [HasVLX] in {
	defm Z128 : avx512_vcvt_fp<opc, OpcodeStr, v2f64x_info, v4f32x_info,
	X86vfpext, "{1to2}", "", f64mem>, EVEX_V128;
	defm Z256 : avx512_vcvt_fp<opc, OpcodeStr, v4f64x_info, v4f32x_info, fpextend>,
	EVEX_V256;
	}
	}

	// Truncate Double to Float
	multiclass avx512_cvtpd2ps<bits<8> opc, string OpcodeStr> {
	let Predicates = [HasAVX512] in {
	defm Z : avx512_vcvt_fp<opc, OpcodeStr, v8f32x_info, v8f64_info, fpround>,
	avx512_vcvt_fp_rc<opc, OpcodeStr, v8f32x_info, v8f64_info,
	X86vfproundRnd>, EVEX_V512;
	}
	let Predicates = [HasVLX] in {
	defm Z128 : avx512_vcvt_fp<opc, OpcodeStr, v4f32x_info, v2f64x_info,
	X86vfpround, "{1to2}", "{x}">, EVEX_V128;
	defm Z256 : avx512_vcvt_fp<opc, OpcodeStr, v4f32x_info, v4f64x_info, fpround,
	"{1to4}", "{y}">, EVEX_V256;

	def : InstAlias<OpcodeStr##"x\t{$src, $dst\|$dst, $src}",
	(!cast<Instruction>(NAME # "Z128rr") VR128X:$dst, VR128X:$src), 0>;
	def : InstAlias<OpcodeStr##"x\t{$src, $dst\|$dst, $src}",
	(!cast<Instruction>(NAME # "Z128rm") VR128X:$dst, f128mem:$src), 0>;
	def : InstAlias<OpcodeStr##"y\t{$src, $dst\|$dst, $src}",
	(!cast<Instruction>(NAME # "Z256rr") VR128X:$dst, VR256X:$src), 0>;
	def : InstAlias<OpcodeStr##"y\t{$src, $dst\|$dst, $src}",
	(!cast<Instruction>(NAME # "Z256rm") VR128X:$dst, f256mem:$src), 0>;
	}
	}

	defm VCVTPD2PS : avx512_cvtpd2ps<0x5A, "vcvtpd2ps">,
	VEX_W, PD, EVEX_CD8<64, CD8VF>;
	defm VCVTPS2PD : avx512_cvtps2pd<0x5A, "vcvtps2pd">,
	PS, EVEX_CD8<32, CD8VH>;

	def : Pat<(v8f64 (extloadv8f32 addr:$src)),
	(VCVTPS2PDZrm addr:$src)>;

	let Predicates = [HasVLX] in {
	let AddedComplexity = 15 in
	def : Pat<(X86vzmovl (v2f64 (bitconvert
	(v4f32 (X86vfpround (v2f64 VR128X:$src)))))),
	(VCVTPD2PSZ128rr VR128X:$src)>;
	def : Pat<(v2f64 (extloadv2f32 addr:$src)),
	(VCVTPS2PDZ128rm addr:$src)>;
	def : Pat<(v4f64 (extloadv4f32 addr:$src)),
	(VCVTPS2PDZ256rm addr:$src)>;
	}

	// Convert Signed/Unsigned Doubleword to Double
	multiclass avx512_cvtdq2pd<bits<8> opc, string OpcodeStr, SDNode OpNode,
	SDNode OpNode128> {
	// No rounding in this op
	let Predicates = [HasAVX512] in
	defm Z : avx512_vcvt_fp<opc, OpcodeStr, v8f64_info, v8i32x_info, OpNode>,
	EVEX_V512;

	let Predicates = [HasVLX] in {
	defm Z128 : avx512_vcvt_fp<opc, OpcodeStr, v2f64x_info, v4i32x_info,
	OpNode128, "{1to2}", "", i64mem>, EVEX_V128;
	defm Z256 : avx512_vcvt_fp<opc, OpcodeStr, v4f64x_info, v4i32x_info, OpNode>,
	EVEX_V256;
	}
	}

	// Convert Signed/Unsigned Doubleword to Float
	multiclass avx512_cvtdq2ps<bits<8> opc, string OpcodeStr, SDNode OpNode,
	SDNode OpNodeRnd> {
	let Predicates = [HasAVX512] in
	defm Z : avx512_vcvt_fp<opc, OpcodeStr, v16f32_info, v16i32_info, OpNode>,
	avx512_vcvt_fp_rc<opc, OpcodeStr, v16f32_info, v16i32_info,
	OpNodeRnd>, EVEX_V512;

	let Predicates = [HasVLX] in {
	defm Z128 : avx512_vcvt_fp<opc, OpcodeStr, v4f32x_info, v4i32x_info, OpNode>,
	EVEX_V128;
	defm Z256 : avx512_vcvt_fp<opc, OpcodeStr, v8f32x_info, v8i32x_info, OpNode>,
	EVEX_V256;
	}
	}

	// Convert Float to Signed/Unsigned Doubleword with truncation
	multiclass avx512_cvttps2dq<bits<8> opc, string OpcodeStr,
	SDNode OpNode, SDNode OpNodeRnd> {
	let Predicates = [HasAVX512] in {
	defm Z : avx512_vcvt_fp<opc, OpcodeStr, v16i32_info, v16f32_info, OpNode>,
	avx512_vcvt_fp_sae<opc, OpcodeStr, v16i32_info, v16f32_info,
	OpNodeRnd>, EVEX_V512;
	}
	let Predicates = [HasVLX] in {
	defm Z128 : avx512_vcvt_fp<opc, OpcodeStr, v4i32x_info, v4f32x_info, OpNode>,
	EVEX_V128;
	defm Z256 : avx512_vcvt_fp<opc, OpcodeStr, v8i32x_info, v8f32x_info, OpNode>,
	EVEX_V256;
	}
	}

	// Convert Float to Signed/Unsigned Doubleword
	multiclass avx512_cvtps2dq<bits<8> opc, string OpcodeStr,
	SDNode OpNode, SDNode OpNodeRnd> {
	let Predicates = [HasAVX512] in {
	defm Z : avx512_vcvt_fp<opc, OpcodeStr, v16i32_info, v16f32_info, OpNode>,
	avx512_vcvt_fp_rc<opc, OpcodeStr, v16i32_info, v16f32_info,
	OpNodeRnd>, EVEX_V512;
	}
	let Predicates = [HasVLX] in {
	defm Z128 : avx512_vcvt_fp<opc, OpcodeStr, v4i32x_info, v4f32x_info, OpNode>,
	EVEX_V128;
	defm Z256 : avx512_vcvt_fp<opc, OpcodeStr, v8i32x_info, v8f32x_info, OpNode>,
	EVEX_V256;
	}
	}

	// Convert Double to Signed/Unsigned Doubleword with truncation
	multiclass avx512_cvttpd2dq<bits<8> opc, string OpcodeStr, SDNode OpNode,
	SDNode OpNode128, SDNode OpNodeRnd> {
	let Predicates = [HasAVX512] in {
	defm Z : avx512_vcvt_fp<opc, OpcodeStr, v8i32x_info, v8f64_info, OpNode>,
	avx512_vcvt_fp_sae<opc, OpcodeStr, v8i32x_info, v8f64_info,
	OpNodeRnd>, EVEX_V512;
	}
	let Predicates = [HasVLX] in {
	// we need "x"/"y" suffixes in order to distinguish between 128 and 256
	// memory forms of these instructions in Asm Parser. They have the same
	// dest type - 'v4i32x_info'. We also specify the broadcast string explicitly
	// due to the same reason.
	defm Z128 : avx512_vcvt_fp<opc, OpcodeStr, v4i32x_info, v2f64x_info,
	OpNode128, "{1to2}", "{x}">, EVEX_V128;
	defm Z256 : avx512_vcvt_fp<opc, OpcodeStr, v4i32x_info, v4f64x_info, OpNode,
	"{1to4}", "{y}">, EVEX_V256;

	def : InstAlias<OpcodeStr##"x\t{$src, $dst\|$dst, $src}",
	(!cast<Instruction>(NAME # "Z128rr") VR128X:$dst, VR128X:$src), 0>;
	def : InstAlias<OpcodeStr##"x\t{$src, $dst\|$dst, $src}",
	(!cast<Instruction>(NAME # "Z128rm") VR128X:$dst, i128mem:$src), 0>;
	def : InstAlias<OpcodeStr##"y\t{$src, $dst\|$dst, $src}",
	(!cast<Instruction>(NAME # "Z256rr") VR128X:$dst, VR256X:$src), 0>;
	def : InstAlias<OpcodeStr##"y\t{$src, $dst\|$dst, $src}",
	(!cast<Instruction>(NAME # "Z256rm") VR128X:$dst, i256mem:$src), 0>;
	}
	}

	// Convert Double to Signed/Unsigned Doubleword
	multiclass avx512_cvtpd2dq<bits<8> opc, string OpcodeStr,
	SDNode OpNode, SDNode OpNodeRnd> {
	let Predicates = [HasAVX512] in {
	defm Z : avx512_vcvt_fp<opc, OpcodeStr, v8i32x_info, v8f64_info, OpNode>,
	avx512_vcvt_fp_rc<opc, OpcodeStr, v8i32x_info, v8f64_info,
	OpNodeRnd>, EVEX_V512;
	}
	let Predicates = [HasVLX] in {
	// we need "x"/"y" suffixes in order to distinguish between 128 and 256
	// memory forms of these instructions in Asm Parcer. They have the same
	// dest type - 'v4i32x_info'. We also specify the broadcast string explicitly
	// due to the same reason.
	defm Z128 : avx512_vcvt_fp<opc, OpcodeStr, v4i32x_info, v2f64x_info, OpNode,
	"{1to2}", "{x}">, EVEX_V128;
	defm Z256 : avx512_vcvt_fp<opc, OpcodeStr, v4i32x_info, v4f64x_info, OpNode,
	"{1to4}", "{y}">, EVEX_V256;

	def : InstAlias<OpcodeStr##"x\t{$src, $dst\|$dst, $src}",
	(!cast<Instruction>(NAME # "Z128rr") VR128X:$dst, VR128X:$src), 0>;
	def : InstAlias<OpcodeStr##"x\t{$src, $dst\|$dst, $src}",
	(!cast<Instruction>(NAME # "Z128rm") VR128X:$dst, f128mem:$src), 0>;
	def : InstAlias<OpcodeStr##"y\t{$src, $dst\|$dst, $src}",
	(!cast<Instruction>(NAME # "Z256rr") VR128X:$dst, VR256X:$src), 0>;
	def : InstAlias<OpcodeStr##"y\t{$src, $dst\|$dst, $src}",
	(!cast<Instruction>(NAME # "Z256rm") VR128X:$dst, f256mem:$src), 0>;
	}
	}

	// Convert Double to Signed/Unsigned Quardword
	multiclass avx512_cvtpd2qq<bits<8> opc, string OpcodeStr,
	SDNode OpNode, SDNode OpNodeRnd> {
	let Predicates = [HasDQI] in {
	defm Z : avx512_vcvt_fp<opc, OpcodeStr, v8i64_info, v8f64_info, OpNode>,
	avx512_vcvt_fp_rc<opc, OpcodeStr, v8i64_info, v8f64_info,
	OpNodeRnd>, EVEX_V512;
	}
	let Predicates = [HasDQI, HasVLX] in {
	defm Z128 : avx512_vcvt_fp<opc, OpcodeStr, v2i64x_info, v2f64x_info, OpNode>,
	EVEX_V128;
	defm Z256 : avx512_vcvt_fp<opc, OpcodeStr, v4i64x_info, v4f64x_info, OpNode>,
	EVEX_V256;
	}
	}

	// Convert Double to Signed/Unsigned Quardword with truncation
	multiclass avx512_cvttpd2qq<bits<8> opc, string OpcodeStr,
	SDNode OpNode, SDNode OpNodeRnd> {
	let Predicates = [HasDQI] in {
	defm Z : avx512_vcvt_fp<opc, OpcodeStr, v8i64_info, v8f64_info, OpNode>,
	avx512_vcvt_fp_sae<opc, OpcodeStr, v8i64_info, v8f64_info,
	OpNodeRnd>, EVEX_V512;
	}
	let Predicates = [HasDQI, HasVLX] in {
	defm Z128 : avx512_vcvt_fp<opc, OpcodeStr, v2i64x_info, v2f64x_info, OpNode>,
	EVEX_V128;
	defm Z256 : avx512_vcvt_fp<opc, OpcodeStr, v4i64x_info, v4f64x_info, OpNode>,
	EVEX_V256;
	}
	}

	// Convert Signed/Unsigned Quardword to Double
	multiclass avx512_cvtqq2pd<bits<8> opc, string OpcodeStr,
	SDNode OpNode, SDNode OpNodeRnd> {
	let Predicates = [HasDQI] in {
	defm Z : avx512_vcvt_fp<opc, OpcodeStr, v8f64_info, v8i64_info, OpNode>,
	avx512_vcvt_fp_rc<opc, OpcodeStr, v8f64_info, v8i64_info,
	OpNodeRnd>, EVEX_V512;
	}
	let Predicates = [HasDQI, HasVLX] in {
	defm Z128 : avx512_vcvt_fp<opc, OpcodeStr, v2f64x_info, v2i64x_info, OpNode>,
	EVEX_V128;
	defm Z256 : avx512_vcvt_fp<opc, OpcodeStr, v4f64x_info, v4i64x_info, OpNode>,
	EVEX_V256;
	}
	}

	// Convert Float to Signed/Unsigned Quardword
	multiclass avx512_cvtps2qq<bits<8> opc, string OpcodeStr,
	SDNode OpNode, SDNode OpNodeRnd> {
	let Predicates = [HasDQI] in {
	defm Z : avx512_vcvt_fp<opc, OpcodeStr, v8i64_info, v8f32x_info, OpNode>,
	avx512_vcvt_fp_rc<opc, OpcodeStr, v8i64_info, v8f32x_info,
	OpNodeRnd>, EVEX_V512;
	}
	let Predicates = [HasDQI, HasVLX] in {
	// Explicitly specified broadcast string, since we take only 2 elements
	// from v4f32x_info source
	defm Z128 : avx512_vcvt_fp<opc, OpcodeStr, v2i64x_info, v4f32x_info, OpNode,
	"{1to2}", "", f64mem>, EVEX_V128;
	defm Z256 : avx512_vcvt_fp<opc, OpcodeStr, v4i64x_info, v4f32x_info, OpNode>,
	EVEX_V256;
	}
	}

	// Convert Float to Signed/Unsigned Quardword with truncation
	multiclass avx512_cvttps2qq<bits<8> opc, string OpcodeStr, SDNode OpNode,
	SDNode OpNode128, SDNode OpNodeRnd> {
	let Predicates = [HasDQI] in {
	defm Z : avx512_vcvt_fp<opc, OpcodeStr, v8i64_info, v8f32x_info, OpNode>,
	avx512_vcvt_fp_sae<opc, OpcodeStr, v8i64_info, v8f32x_info,
	OpNodeRnd>, EVEX_V512;
	}
	let Predicates = [HasDQI, HasVLX] in {
	// Explicitly specified broadcast string, since we take only 2 elements
	// from v4f32x_info source
	defm Z128 : avx512_vcvt_fp<opc, OpcodeStr, v2i64x_info, v4f32x_info, OpNode128,
	"{1to2}", "", f64mem>, EVEX_V128;
	defm Z256 : avx512_vcvt_fp<opc, OpcodeStr, v4i64x_info, v4f32x_info, OpNode>,
	EVEX_V256;
	}
	}

	// Convert Signed/Unsigned Quardword to Float
	multiclass avx512_cvtqq2ps<bits<8> opc, string OpcodeStr, SDNode OpNode,
	SDNode OpNode128, SDNode OpNodeRnd> {
	let Predicates = [HasDQI] in {
	defm Z : avx512_vcvt_fp<opc, OpcodeStr, v8f32x_info, v8i64_info, OpNode>,
	avx512_vcvt_fp_rc<opc, OpcodeStr, v8f32x_info, v8i64_info,
	OpNodeRnd>, EVEX_V512;
	}
	let Predicates = [HasDQI, HasVLX] in {
	// we need "x"/"y" suffixes in order to distinguish between 128 and 256
	// memory forms of these instructions in Asm Parcer. They have the same
	// dest type - 'v4i32x_info'. We also specify the broadcast string explicitly
	// due to the same reason.
	defm Z128 : avx512_vcvt_fp<opc, OpcodeStr, v4f32x_info, v2i64x_info, OpNode128,
	"{1to2}", "{x}">, EVEX_V128;
	defm Z256 : avx512_vcvt_fp<opc, OpcodeStr, v4f32x_info, v4i64x_info, OpNode,
	"{1to4}", "{y}">, EVEX_V256;

	def : InstAlias<OpcodeStr##"x\t{$src, $dst\|$dst, $src}",
	(!cast<Instruction>(NAME # "Z128rr") VR128X:$dst, VR128X:$src), 0>;
	def : InstAlias<OpcodeStr##"x\t{$src, $dst\|$dst, $src}",
	(!cast<Instruction>(NAME # "Z128rm") VR128X:$dst, i128mem:$src), 0>;
	def : InstAlias<OpcodeStr##"y\t{$src, $dst\|$dst, $src}",
	(!cast<Instruction>(NAME # "Z256rr") VR128X:$dst, VR256X:$src), 0>;
	def : InstAlias<OpcodeStr##"y\t{$src, $dst\|$dst, $src}",
	(!cast<Instruction>(NAME # "Z256rm") VR128X:$dst, i256mem:$src), 0>;
	}
	}

	defm VCVTDQ2PD : avx512_cvtdq2pd<0xE6, "vcvtdq2pd", sint_to_fp, X86VSintToFP>,
	XS, EVEX_CD8<32, CD8VH>;

	defm VCVTDQ2PS : avx512_cvtdq2ps<0x5B, "vcvtdq2ps", sint_to_fp,
	X86VSintToFpRnd>,
	PS, EVEX_CD8<32, CD8VF>;

	defm VCVTTPS2DQ : avx512_cvttps2dq<0x5B, "vcvttps2dq", fp_to_sint,
	X86cvttp2siRnd>,
	XS, EVEX_CD8<32, CD8VF>;

	defm VCVTTPD2DQ : avx512_cvttpd2dq<0xE6, "vcvttpd2dq", fp_to_sint, X86cvttp2si,
	X86cvttp2siRnd>,
	PD, VEX_W, EVEX_CD8<64, CD8VF>;

	defm VCVTTPS2UDQ : avx512_cvttps2dq<0x78, "vcvttps2udq", fp_to_uint,
	X86cvttp2uiRnd>, PS,
	EVEX_CD8<32, CD8VF>;

	defm VCVTTPD2UDQ : avx512_cvttpd2dq<0x78, "vcvttpd2udq", fp_to_uint,
	X86cvttp2ui, X86cvttp2uiRnd>, PS, VEX_W,
	EVEX_CD8<64, CD8VF>;

	defm VCVTUDQ2PD : avx512_cvtdq2pd<0x7A, "vcvtudq2pd", uint_to_fp, X86VUintToFP>,
	XS, EVEX_CD8<32, CD8VH>;

	defm VCVTUDQ2PS : avx512_cvtdq2ps<0x7A, "vcvtudq2ps", uint_to_fp,
	X86VUintToFpRnd>, XD,
	EVEX_CD8<32, CD8VF>;

	defm VCVTPS2DQ : avx512_cvtps2dq<0x5B, "vcvtps2dq", X86cvtp2Int,
	X86cvtp2IntRnd>, PD, EVEX_CD8<32, CD8VF>;

	defm VCVTPD2DQ : avx512_cvtpd2dq<0xE6, "vcvtpd2dq", X86cvtp2Int,
	X86cvtp2IntRnd>, XD, VEX_W,
	EVEX_CD8<64, CD8VF>;

	defm VCVTPS2UDQ : avx512_cvtps2dq<0x79, "vcvtps2udq", X86cvtp2UInt,
	X86cvtp2UIntRnd>,
	PS, EVEX_CD8<32, CD8VF>;
	defm VCVTPD2UDQ : avx512_cvtpd2dq<0x79, "vcvtpd2udq", X86cvtp2UInt,
	X86cvtp2UIntRnd>, VEX_W,
	PS, EVEX_CD8<64, CD8VF>;

	defm VCVTPD2QQ : avx512_cvtpd2qq<0x7B, "vcvtpd2qq", X86cvtp2Int,
	X86cvtp2IntRnd>, VEX_W,
	PD, EVEX_CD8<64, CD8VF>;

	defm VCVTPS2QQ : avx512_cvtps2qq<0x7B, "vcvtps2qq", X86cvtp2Int,
	X86cvtp2IntRnd>, PD, EVEX_CD8<32, CD8VH>;

	defm VCVTPD2UQQ : avx512_cvtpd2qq<0x79, "vcvtpd2uqq", X86cvtp2UInt,
	X86cvtp2UIntRnd>, VEX_W,
	PD, EVEX_CD8<64, CD8VF>;

	defm VCVTPS2UQQ : avx512_cvtps2qq<0x79, "vcvtps2uqq", X86cvtp2UInt,
	X86cvtp2UIntRnd>, PD, EVEX_CD8<32, CD8VH>;

	defm VCVTTPD2QQ : avx512_cvttpd2qq<0x7A, "vcvttpd2qq", fp_to_sint,
	X86cvttp2siRnd>, VEX_W,
	PD, EVEX_CD8<64, CD8VF>;

	defm VCVTTPS2QQ : avx512_cvttps2qq<0x7A, "vcvttps2qq", fp_to_sint, X86cvttp2si,
	X86cvttp2siRnd>, PD, EVEX_CD8<32, CD8VH>;

	defm VCVTTPD2UQQ : avx512_cvttpd2qq<0x78, "vcvttpd2uqq", fp_to_uint,
	X86cvttp2uiRnd>, VEX_W,
	PD, EVEX_CD8<64, CD8VF>;

	defm VCVTTPS2UQQ : avx512_cvttps2qq<0x78, "vcvttps2uqq", fp_to_uint, X86cvttp2ui,
	X86cvttp2uiRnd>, PD, EVEX_CD8<32, CD8VH>;

	defm VCVTQQ2PD : avx512_cvtqq2pd<0xE6, "vcvtqq2pd", sint_to_fp,
	X86VSintToFpRnd>, VEX_W, XS, EVEX_CD8<64, CD8VF>;

	defm VCVTUQQ2PD : avx512_cvtqq2pd<0x7A, "vcvtuqq2pd", uint_to_fp,
	X86VUintToFpRnd>, VEX_W, XS, EVEX_CD8<64, CD8VF>;

	defm VCVTQQ2PS : avx512_cvtqq2ps<0x5B, "vcvtqq2ps", sint_to_fp, X86VSintToFP,
	X86VSintToFpRnd>, VEX_W, PS, EVEX_CD8<64, CD8VF>;

	defm VCVTUQQ2PS : avx512_cvtqq2ps<0x7A, "vcvtuqq2ps", uint_to_fp, X86VUintToFP,
	X86VUintToFpRnd>, VEX_W, XD, EVEX_CD8<64, CD8VF>;

	let Predicates = [HasAVX512, NoVLX] in {
	def : Pat<(v8i32 (fp_to_uint (v8f32 VR256X:$src1))),
	(EXTRACT_SUBREG (v16i32 (VCVTTPS2UDQZrr
	(v16f32 (INSERT_SUBREG (IMPLICIT_DEF),
	VR256X:$src1, sub_ymm)))), sub_ymm)>;

	def : Pat<(v4i32 (fp_to_uint (v4f32 VR128X:$src1))),
	(EXTRACT_SUBREG (v16i32 (VCVTTPS2UDQZrr
	(v16f32 (INSERT_SUBREG (IMPLICIT_DEF),
	VR128X:$src1, sub_xmm)))), sub_xmm)>;

	def : Pat<(v4i32 (fp_to_uint (v4f64 VR256X:$src1))),
	(EXTRACT_SUBREG (v8i32 (VCVTTPD2UDQZrr
	(v8f64 (INSERT_SUBREG (IMPLICIT_DEF),
	VR256X:$src1, sub_ymm)))), sub_xmm)>;

	def : Pat<(v4i32 (X86cvttp2ui (v2f64 VR128X:$src))),
	(EXTRACT_SUBREG (v8i32 (VCVTTPD2UDQZrr
	(v8f64 (INSERT_SUBREG (IMPLICIT_DEF),
	VR128X:$src, sub_xmm)))), sub_xmm)>;

	def : Pat<(v8f32 (uint_to_fp (v8i32 VR256X:$src1))),
	(EXTRACT_SUBREG (v16f32 (VCVTUDQ2PSZrr
	(v16i32 (INSERT_SUBREG (IMPLICIT_DEF),
	VR256X:$src1, sub_ymm)))), sub_ymm)>;

	def : Pat<(v4f32 (uint_to_fp (v4i32 VR128X:$src1))),
	(EXTRACT_SUBREG (v16f32 (VCVTUDQ2PSZrr
	(v16i32 (INSERT_SUBREG (IMPLICIT_DEF),
	VR128X:$src1, sub_xmm)))), sub_xmm)>;

	def : Pat<(v4f64 (uint_to_fp (v4i32 VR128X:$src1))),
	(EXTRACT_SUBREG (v8f64 (VCVTUDQ2PDZrr
	(v8i32 (INSERT_SUBREG (IMPLICIT_DEF),
	VR128X:$src1, sub_xmm)))), sub_ymm)>;

	def : Pat<(v2f64 (X86VUintToFP (v4i32 VR128X:$src1))),
	(EXTRACT_SUBREG (v8f64 (VCVTUDQ2PDZrr
	(v8i32 (INSERT_SUBREG (IMPLICIT_DEF),
	VR128X:$src1, sub_xmm)))), sub_xmm)>;
	}

	let Predicates = [HasAVX512, HasVLX] in {
	let AddedComplexity = 15 in {
	def : Pat<(X86vzmovl (v2i64 (bitconvert
	(v4i32 (X86cvtp2Int (v2f64 VR128X:$src)))))),
	(VCVTPD2DQZ128rr VR128X:$src)>;
	def : Pat<(v4i32 (bitconvert (X86vzmovl (v2i64 (bitconvert
	(v4i32 (X86cvtp2UInt (v2f64 VR128X:$src)))))))),
	(VCVTPD2UDQZ128rr VR128X:$src)>;
	def : Pat<(X86vzmovl (v2i64 (bitconvert
	(v4i32 (X86cvttp2si (v2f64 VR128X:$src)))))),
	(VCVTTPD2DQZ128rr VR128X:$src)>;
	def : Pat<(v4i32 (bitconvert (X86vzmovl (v2i64 (bitconvert
	(v4i32 (X86cvttp2ui (v2f64 VR128X:$src)))))))),
	(VCVTTPD2UDQZ128rr VR128X:$src)>;
	}
	}

	let Predicates = [HasAVX512] in {
	def : Pat<(v8f32 (fpround (loadv8f64 addr:$src))),
	(VCVTPD2PSZrm addr:$src)>;
	def : Pat<(v8f64 (extloadv8f32 addr:$src)),
	(VCVTPS2PDZrm addr:$src)>;
	}

	let Predicates = [HasDQI, HasVLX] in {
	let AddedComplexity = 15 in {
	def : Pat<(X86vzmovl (v2f64 (bitconvert
	(v4f32 (X86VSintToFP (v2i64 VR128X:$src)))))),
	(VCVTQQ2PSZ128rr VR128X:$src)>;
	def : Pat<(X86vzmovl (v2f64 (bitconvert
	(v4f32 (X86VUintToFP (v2i64 VR128X:$src)))))),
	(VCVTUQQ2PSZ128rr VR128X:$src)>;
	}
	}

	let Predicates = [HasDQI, NoVLX] in {
	def : Pat<(v2i64 (fp_to_sint (v2f64 VR128X:$src1))),
	(EXTRACT_SUBREG (v8i64 (VCVTTPD2QQZrr
	(v8f64 (INSERT_SUBREG (IMPLICIT_DEF),
	VR128X:$src1, sub_xmm)))), sub_xmm)>;

	def : Pat<(v4i64 (fp_to_sint (v4f32 VR128X:$src1))),
	(EXTRACT_SUBREG (v8i64 (VCVTTPS2QQZrr
	(v8f32 (INSERT_SUBREG (IMPLICIT_DEF),
	VR128X:$src1, sub_xmm)))), sub_ymm)>;

	def : Pat<(v4i64 (fp_to_sint (v4f64 VR256X:$src1))),
	(EXTRACT_SUBREG (v8i64 (VCVTTPD2QQZrr
	(v8f64 (INSERT_SUBREG (IMPLICIT_DEF),
	VR256X:$src1, sub_ymm)))), sub_ymm)>;

	def : Pat<(v2i64 (fp_to_uint (v2f64 VR128X:$src1))),
	(EXTRACT_SUBREG (v8i64 (VCVTTPD2UQQZrr
	(v8f64 (INSERT_SUBREG (IMPLICIT_DEF),
	VR128X:$src1, sub_xmm)))), sub_xmm)>;

	def : Pat<(v4i64 (fp_to_uint (v4f32 VR128X:$src1))),
	(EXTRACT_SUBREG (v8i64 (VCVTTPS2UQQZrr
	(v8f32 (INSERT_SUBREG (IMPLICIT_DEF),
	VR128X:$src1, sub_xmm)))), sub_ymm)>;

	def : Pat<(v4i64 (fp_to_uint (v4f64 VR256X:$src1))),
	(EXTRACT_SUBREG (v8i64 (VCVTTPD2UQQZrr
	(v8f64 (INSERT_SUBREG (IMPLICIT_DEF),
	VR256X:$src1, sub_ymm)))), sub_ymm)>;

	def : Pat<(v4f32 (sint_to_fp (v4i64 VR256X:$src1))),
	(EXTRACT_SUBREG (v8f32 (VCVTQQ2PSZrr
	(v8i64 (INSERT_SUBREG (IMPLICIT_DEF),
	VR256X:$src1, sub_ymm)))), sub_xmm)>;

	def : Pat<(v2f64 (sint_to_fp (v2i64 VR128X:$src1))),
	(EXTRACT_SUBREG (v8f64 (VCVTQQ2PDZrr
	(v8i64 (INSERT_SUBREG (IMPLICIT_DEF),
	VR128X:$src1, sub_xmm)))), sub_xmm)>;

	def : Pat<(v4f64 (sint_to_fp (v4i64 VR256X:$src1))),
	(EXTRACT_SUBREG (v8f64 (VCVTQQ2PDZrr
	(v8i64 (INSERT_SUBREG (IMPLICIT_DEF),
	VR256X:$src1, sub_ymm)))), sub_ymm)>;

	def : Pat<(v4f32 (uint_to_fp (v4i64 VR256X:$src1))),
	(EXTRACT_SUBREG (v8f32 (VCVTUQQ2PSZrr
	(v8i64 (INSERT_SUBREG (IMPLICIT_DEF),
	VR256X:$src1, sub_ymm)))), sub_xmm)>;

	def : Pat<(v2f64 (uint_to_fp (v2i64 VR128X:$src1))),
	(EXTRACT_SUBREG (v8f64 (VCVTUQQ2PDZrr
	(v8i64 (INSERT_SUBREG (IMPLICIT_DEF),
	VR128X:$src1, sub_xmm)))), sub_xmm)>;

	def : Pat<(v4f64 (uint_to_fp (v4i64 VR256X:$src1))),
	(EXTRACT_SUBREG (v8f64 (VCVTUQQ2PDZrr
	(v8i64 (INSERT_SUBREG (IMPLICIT_DEF),
	VR256X:$src1, sub_ymm)))), sub_ymm)>;
	}

	//===----------------------------------------------------------------------===//
	// Half precision conversion instructions
	//===----------------------------------------------------------------------===//
	multiclass avx512_cvtph2ps<X86VectorVTInfo _dest, X86VectorVTInfo _src,
	X86MemOperand x86memop, PatFrag ld_frag> {
	defm rr : AVX512_maskable<0x13, MRMSrcReg, _dest ,(outs _dest.RC:$dst), (ins _src.RC:$src),
	"vcvtph2ps", "$src", "$src",
	(X86cvtph2ps (_src.VT _src.RC:$src),
	(i32 FROUND_CURRENT))>, T8PD;
	defm rm : AVX512_maskable<0x13, MRMSrcMem, _dest, (outs _dest.RC:$dst), (ins x86memop:$src),
	"vcvtph2ps", "$src", "$src",
	(X86cvtph2ps (_src.VT (bitconvert (ld_frag addr:$src))),
	(i32 FROUND_CURRENT))>, T8PD;
	}

	multiclass avx512_cvtph2ps_sae<X86VectorVTInfo _dest, X86VectorVTInfo _src> {
	defm rb : AVX512_maskable<0x13, MRMSrcReg, _dest ,(outs _dest.RC:$dst), (ins _src.RC:$src),
	"vcvtph2ps", "{sae}, $src", "$src, {sae}",
	(X86cvtph2ps (_src.VT _src.RC:$src),
	(i32 FROUND_NO_EXC))>, T8PD, EVEX_B;

	}

	let Predicates = [HasAVX512] in {
	defm VCVTPH2PSZ : avx512_cvtph2ps<v16f32_info, v16i16x_info, f256mem, loadv4i64>,
	avx512_cvtph2ps_sae<v16f32_info, v16i16x_info>,
	EVEX, EVEX_V512, EVEX_CD8<32, CD8VH>;
	let Predicates = [HasVLX] in {
	defm VCVTPH2PSZ256 : avx512_cvtph2ps<v8f32x_info, v8i16x_info, f128mem,
	loadv2i64>,EVEX, EVEX_V256, EVEX_CD8<32, CD8VH>;
	defm VCVTPH2PSZ128 : avx512_cvtph2ps<v4f32x_info, v8i16x_info, f64mem,
	loadv2i64>, EVEX, EVEX_V128, EVEX_CD8<32, CD8VH>;
	}
	}

	multiclass avx512_cvtps2ph<X86VectorVTInfo _dest, X86VectorVTInfo _src,
	X86MemOperand x86memop> {
	defm rr : AVX512_maskable<0x1D, MRMDestReg, _dest ,(outs _dest.RC:$dst),
	(ins _src.RC:$src1, i32u8imm:$src2),
	"vcvtps2ph", "$src2, $src1", "$src1, $src2",
	(X86cvtps2ph (_src.VT _src.RC:$src1),
	(i32 imm:$src2)),
	NoItinerary, 0, 0, X86select>, AVX512AIi8Base;
	def mr : AVX512AIi8<0x1D, MRMDestMem, (outs),
	(ins x86memop:$dst, _src.RC:$src1, i32u8imm:$src2),
	"vcvtps2ph\t{$src2, $src1, $dst\|$dst, $src1, $src2}",
	[(store (_dest.VT (X86cvtps2ph (_src.VT _src.RC:$src1),
	(i32 imm:$src2))),
	addr:$dst)]>;
	let hasSideEffects = 0, mayStore = 1 in
	def mrk : AVX512AIi8<0x1D, MRMDestMem, (outs),
	(ins x86memop:$dst, _dest.KRCWM:$mask, _src.RC:$src1, i32u8imm:$src2),
	"vcvtps2ph\t{$src2, $src1, $dst {${mask}}\|$dst {${mask}}, $src1, $src2}",
	[]>, EVEX_K;
	}
	multiclass avx512_cvtps2ph_sae<X86VectorVTInfo _dest, X86VectorVTInfo _src> {
	let hasSideEffects = 0 in
	defm rb : AVX512_maskable_in_asm<0x1D, MRMDestReg, _dest,
	(outs _dest.RC:$dst),
	(ins _src.RC:$src1, i32u8imm:$src2),
	"vcvtps2ph", "$src2, {sae}, $src1", "$src1, {sae}, $src2",
	[]>, EVEX_B, AVX512AIi8Base;
	}
	let Predicates = [HasAVX512] in {
	defm VCVTPS2PHZ : avx512_cvtps2ph<v16i16x_info, v16f32_info, f256mem>,
	avx512_cvtps2ph_sae<v16i16x_info, v16f32_info>,
	EVEX, EVEX_V512, EVEX_CD8<32, CD8VH>;
	let Predicates = [HasVLX] in {
	defm VCVTPS2PHZ256 : avx512_cvtps2ph<v8i16x_info, v8f32x_info, f128mem>,
	EVEX, EVEX_V256, EVEX_CD8<32, CD8VH>;
	defm VCVTPS2PHZ128 : avx512_cvtps2ph<v8i16x_info, v4f32x_info, f64mem>,
	EVEX, EVEX_V128, EVEX_CD8<32, CD8VH>;
	}
	}

	// Patterns for matching conversions from float to half-float and vice versa.
	let Predicates = [HasVLX] in {
	// Use MXCSR.RC for rounding instead of explicitly specifying the default
	// rounding mode (Nearest-Even, encoded as 0). Both are equivalent in the
	// configurations we support (the default). However, falling back to MXCSR is
	// more consistent with other instructions, which are always controlled by it.
	// It's encoded as 0b100.
	def : Pat<(fp_to_f16 FR32X:$src),
	(i16 (EXTRACT_SUBREG (VMOVPDI2DIZrr (VCVTPS2PHZ128rr
	(COPY_TO_REGCLASS FR32X:$src, VR128X), 4)), sub_16bit))>;

	def : Pat<(f16_to_fp GR16:$src),
	(f32 (COPY_TO_REGCLASS (VCVTPH2PSZ128rr
	(COPY_TO_REGCLASS (MOVSX32rr16 GR16:$src), VR128X)), FR32X)) >;

	def : Pat<(f16_to_fp (i16 (fp_to_f16 FR32X:$src))),
	(f32 (COPY_TO_REGCLASS (VCVTPH2PSZ128rr
	(VCVTPS2PHZ128rr (COPY_TO_REGCLASS FR32X:$src, VR128X), 4)), FR32X)) >;
	}

	// Patterns for matching float to half-float conversion when AVX512 is supported
	// but F16C isn't. In that case we have to use 512-bit vectors.
	let Predicates = [HasAVX512, NoVLX, NoF16C] in {
	def : Pat<(fp_to_f16 FR32X:$src),
	(i16 (EXTRACT_SUBREG
	(VMOVPDI2DIZrr
	(v8i16 (EXTRACT_SUBREG
	(VCVTPS2PHZrr
	(INSERT_SUBREG (v16f32 (IMPLICIT_DEF)),
	(v4f32 (COPY_TO_REGCLASS FR32X:$src, VR128X)),
	sub_xmm), 4), sub_xmm))), sub_16bit))>;

	def : Pat<(f16_to_fp GR16:$src),
	(f32 (COPY_TO_REGCLASS
	(v4f32 (EXTRACT_SUBREG
	(VCVTPH2PSZrr
	(INSERT_SUBREG (v16i16 (IMPLICIT_DEF)),
	(v8i16 (COPY_TO_REGCLASS (MOVSX32rr16 GR16:$src), VR128X)),
	sub_xmm)), sub_xmm)), FR32X))>;

	def : Pat<(f16_to_fp (i16 (fp_to_f16 FR32X:$src))),
	(f32 (COPY_TO_REGCLASS
	(v4f32 (EXTRACT_SUBREG
	(VCVTPH2PSZrr
	(VCVTPS2PHZrr (INSERT_SUBREG (v16f32 (IMPLICIT_DEF)),
	(v4f32 (COPY_TO_REGCLASS FR32X:$src, VR128X)),
	sub_xmm), 4)), sub_xmm)), FR32X))>;
	}

	// Unordered/Ordered scalar fp compare with Sea and set EFLAGS
	multiclass avx512_ord_cmp_sae<bits<8> opc, X86VectorVTInfo _,
	string OpcodeStr> {
	def rb: AVX512<opc, MRMSrcReg, (outs), (ins _.RC:$src1, _.RC:$src2),
	!strconcat(OpcodeStr, "\t{{sae}, $src2, $src1\|$src1, $src2, {sae}}"),
	[], IIC_SSE_COMIS_RR>, EVEX, EVEX_B, VEX_LIG, EVEX_V128,
	Sched<[WriteFAdd]>;
	}

	let Defs = [EFLAGS], Predicates = [HasAVX512] in {
	defm VUCOMISSZ : avx512_ord_cmp_sae<0x2E, v4f32x_info, "vucomiss">,
	AVX512PSIi8Base, EVEX_CD8<32, CD8VT1>;
	defm VUCOMISDZ : avx512_ord_cmp_sae<0x2E, v2f64x_info, "vucomisd">,
	AVX512PDIi8Base, VEX_W, EVEX_CD8<64, CD8VT1>;
	defm VCOMISSZ : avx512_ord_cmp_sae<0x2F, v4f32x_info, "vcomiss">,
	AVX512PSIi8Base, EVEX_CD8<32, CD8VT1>;
	defm VCOMISDZ : avx512_ord_cmp_sae<0x2F, v2f64x_info, "vcomisd">,
	AVX512PDIi8Base, VEX_W, EVEX_CD8<64, CD8VT1>;
	}

	let Defs = [EFLAGS], Predicates = [HasAVX512] in {
	defm VUCOMISSZ : sse12_ord_cmp<0x2E, FR32X, X86cmp, f32, f32mem, loadf32,
	"ucomiss">, PS, EVEX, VEX_LIG,
	EVEX_CD8<32, CD8VT1>;
	defm VUCOMISDZ : sse12_ord_cmp<0x2E, FR64X, X86cmp, f64, f64mem, loadf64,
	"ucomisd">, PD, EVEX,
	VEX_LIG, VEX_W, EVEX_CD8<64, CD8VT1>;
	let Pattern = []<dag> in {
	defm VCOMISSZ : sse12_ord_cmp<0x2F, FR32X, undef, f32, f32mem, loadf32,
	"comiss">, PS, EVEX, VEX_LIG,
	EVEX_CD8<32, CD8VT1>;
	defm VCOMISDZ : sse12_ord_cmp<0x2F, FR64X, undef, f64, f64mem, loadf64,
	"comisd">, PD, EVEX,
	VEX_LIG, VEX_W, EVEX_CD8<64, CD8VT1>;
	}
	let isCodeGenOnly = 1 in {
	defm Int_VUCOMISSZ : sse12_ord_cmp_int<0x2E, VR128X, X86ucomi, v4f32, ssmem,
	sse_load_f32, "ucomiss">, PS, EVEX, VEX_LIG,
	EVEX_CD8<32, CD8VT1>;
	defm Int_VUCOMISDZ : sse12_ord_cmp_int<0x2E, VR128X, X86ucomi, v2f64, sdmem,
	sse_load_f64, "ucomisd">, PD, EVEX,
	VEX_LIG, VEX_W, EVEX_CD8<64, CD8VT1>;

	defm Int_VCOMISSZ : sse12_ord_cmp_int<0x2F, VR128X, X86comi, v4f32, ssmem,
	sse_load_f32, "comiss">, PS, EVEX, VEX_LIG,
	EVEX_CD8<32, CD8VT1>;
	defm Int_VCOMISDZ : sse12_ord_cmp_int<0x2F, VR128X, X86comi, v2f64, sdmem,
	sse_load_f64, "comisd">, PD, EVEX,
	VEX_LIG, VEX_W, EVEX_CD8<64, CD8VT1>;
	}
	}

	/// avx512_fp14_s rcp14ss, rcp14sd, rsqrt14ss, rsqrt14sd
	multiclass avx512_fp14_s<bits<8> opc, string OpcodeStr, SDNode OpNode,
	X86VectorVTInfo _> {
	let Predicates = [HasAVX512], ExeDomain = _.ExeDomain in {
	defm rr : AVX512_maskable_scalar<opc, MRMSrcReg, _, (outs _.RC:$dst),
	(ins _.RC:$src1, _.RC:$src2), OpcodeStr,
	"$src2, $src1", "$src1, $src2",
	(OpNode (_.VT _.RC:$src1), (_.VT _.RC:$src2))>, EVEX_4V;
	defm rm : AVX512_maskable_scalar<opc, MRMSrcMem, _, (outs _.RC:$dst),
	(ins _.RC:$src1, _.ScalarMemOp:$src2), OpcodeStr,
	"$src2, $src1", "$src1, $src2",
	(OpNode (_.VT _.RC:$src1),
	(_.VT (scalar_to_vector (_.ScalarLdFrag addr:$src2))))>, EVEX_4V;
	}
	}

	defm VRCP14SS : avx512_fp14_s<0x4D, "vrcp14ss", X86frcp14s, f32x_info>,
	EVEX_CD8<32, CD8VT1>, T8PD;
	defm VRCP14SD : avx512_fp14_s<0x4D, "vrcp14sd", X86frcp14s, f64x_info>,
	VEX_W, EVEX_CD8<64, CD8VT1>, T8PD;
	defm VRSQRT14SS : avx512_fp14_s<0x4F, "vrsqrt14ss", X86frsqrt14s, f32x_info>,
	EVEX_CD8<32, CD8VT1>, T8PD;
	defm VRSQRT14SD : avx512_fp14_s<0x4F, "vrsqrt14sd", X86frsqrt14s, f64x_info>,
	VEX_W, EVEX_CD8<64, CD8VT1>, T8PD;

	/// avx512_fp14_p rcp14ps, rcp14pd, rsqrt14ps, rsqrt14pd
	multiclass avx512_fp14_p<bits<8> opc, string OpcodeStr, SDNode OpNode,
	X86VectorVTInfo _> {
	let ExeDomain = _.ExeDomain in {
	defm r: AVX512_maskable<opc, MRMSrcReg, _, (outs _.RC:$dst),
	(ins _.RC:$src), OpcodeStr, "$src", "$src",
	(_.FloatVT (OpNode _.RC:$src))>, EVEX, T8PD;
	defm m: AVX512_maskable<opc, MRMSrcMem, _, (outs _.RC:$dst),
	(ins _.MemOp:$src), OpcodeStr, "$src", "$src",
	(OpNode (_.FloatVT
	(bitconvert (_.LdFrag addr:$src))))>, EVEX, T8PD;
	defm mb: AVX512_maskable<opc, MRMSrcMem, _, (outs _.RC:$dst),
	(ins _.ScalarMemOp:$src), OpcodeStr,
	"${src}"##_.BroadcastStr, "${src}"##_.BroadcastStr,
	(OpNode (_.FloatVT
	(X86VBroadcast (_.ScalarLdFrag addr:$src))))>,
	EVEX, T8PD, EVEX_B;
	}
	}

	multiclass avx512_fp14_p_vl_all<bits<8> opc, string OpcodeStr, SDNode OpNode> {
	defm PSZ : avx512_fp14_p<opc, !strconcat(OpcodeStr, "ps"), OpNode, v16f32_info>,
	EVEX_V512, EVEX_CD8<32, CD8VF>;
	defm PDZ : avx512_fp14_p<opc, !strconcat(OpcodeStr, "pd"), OpNode, v8f64_info>,
	EVEX_V512, VEX_W, EVEX_CD8<64, CD8VF>;

	// Define only if AVX512VL feature is present.
	let Predicates = [HasVLX] in {
	defm PSZ128 : avx512_fp14_p<opc, !strconcat(OpcodeStr, "ps"),
	OpNode, v4f32x_info>,
	EVEX_V128, EVEX_CD8<32, CD8VF>;
	defm PSZ256 : avx512_fp14_p<opc, !strconcat(OpcodeStr, "ps"),
	OpNode, v8f32x_info>,
	EVEX_V256, EVEX_CD8<32, CD8VF>;
	defm PDZ128 : avx512_fp14_p<opc, !strconcat(OpcodeStr, "pd"),
	OpNode, v2f64x_info>,
	EVEX_V128, VEX_W, EVEX_CD8<64, CD8VF>;
	defm PDZ256 : avx512_fp14_p<opc, !strconcat(OpcodeStr, "pd"),
	OpNode, v4f64x_info>,
	EVEX_V256, VEX_W, EVEX_CD8<64, CD8VF>;
	}
	}

	defm VRSQRT14 : avx512_fp14_p_vl_all<0x4E, "vrsqrt14", X86frsqrt>;
	defm VRCP14 : avx512_fp14_p_vl_all<0x4C, "vrcp14", X86frcp>;

	/// avx512_fp28_s rcp28ss, rcp28sd, rsqrt28ss, rsqrt28sd
	multiclass avx512_fp28_s<bits<8> opc, string OpcodeStr,X86VectorVTInfo _,
	SDNode OpNode> {
	let ExeDomain = _.ExeDomain in {
	defm r : AVX512_maskable_scalar<opc, MRMSrcReg, _, (outs _.RC:$dst),
	(ins _.RC:$src1, _.RC:$src2), OpcodeStr,
	"$src2, $src1", "$src1, $src2",
	(OpNode (_.VT _.RC:$src1), (_.VT _.RC:$src2),
	(i32 FROUND_CURRENT))>;

	defm rb : AVX512_maskable_scalar<opc, MRMSrcReg, _, (outs _.RC:$dst),
	(ins _.RC:$src1, _.RC:$src2), OpcodeStr,
	"{sae}, $src2, $src1", "$src1, $src2, {sae}",
	(OpNode (_.VT _.RC:$src1), (_.VT _.RC:$src2),
	(i32 FROUND_NO_EXC))>, EVEX_B;

	defm m : AVX512_maskable_scalar<opc, MRMSrcMem, _, (outs _.RC:$dst),
	(ins _.RC:$src1, _.ScalarMemOp:$src2), OpcodeStr,
	"$src2, $src1", "$src1, $src2",
	(OpNode (_.VT _.RC:$src1),
	(_.VT (scalar_to_vector (_.ScalarLdFrag addr:$src2))),
	(i32 FROUND_CURRENT))>;
	}
	}

	multiclass avx512_eri_s<bits<8> opc, string OpcodeStr, SDNode OpNode> {
	defm SS : avx512_fp28_s<opc, OpcodeStr#"ss", f32x_info, OpNode>,
	EVEX_CD8<32, CD8VT1>;
	defm SD : avx512_fp28_s<opc, OpcodeStr#"sd", f64x_info, OpNode>,
	EVEX_CD8<64, CD8VT1>, VEX_W;
	}

	let Predicates = [HasERI] in {
	defm VRCP28 : avx512_eri_s<0xCB, "vrcp28", X86rcp28s>, T8PD, EVEX_4V;
	defm VRSQRT28 : avx512_eri_s<0xCD, "vrsqrt28", X86rsqrt28s>, T8PD, EVEX_4V;
	}

	defm VGETEXP : avx512_eri_s<0x43, "vgetexp", X86fgetexpRnds>, T8PD, EVEX_4V;
	/// avx512_fp28_p rcp28ps, rcp28pd, rsqrt28ps, rsqrt28pd

	multiclass avx512_fp28_p<bits<8> opc, string OpcodeStr, X86VectorVTInfo _,
	SDNode OpNode> {
	let ExeDomain = _.ExeDomain in {
	defm r : AVX512_maskable<opc, MRMSrcReg, _, (outs _.RC:$dst),
	(ins _.RC:$src), OpcodeStr, "$src", "$src",
	(OpNode (_.VT _.RC:$src), (i32 FROUND_CURRENT))>;

	defm m : AVX512_maskable<opc, MRMSrcMem, _, (outs _.RC:$dst),
	(ins _.MemOp:$src), OpcodeStr, "$src", "$src",
	(OpNode (_.FloatVT
	(bitconvert (_.LdFrag addr:$src))),
	(i32 FROUND_CURRENT))>;

	defm mb : AVX512_maskable<opc, MRMSrcMem, _, (outs _.RC:$dst),
	(ins _.ScalarMemOp:$src), OpcodeStr,
	"${src}"##_.BroadcastStr, "${src}"##_.BroadcastStr,
	(OpNode (_.FloatVT
	(X86VBroadcast (_.ScalarLdFrag addr:$src))),
	(i32 FROUND_CURRENT))>, EVEX_B;
	}
	}
	multiclass avx512_fp28_p_round<bits<8> opc, string OpcodeStr, X86VectorVTInfo _,
	SDNode OpNode> {
	let ExeDomain = _.ExeDomain in
	defm rb : AVX512_maskable<opc, MRMSrcReg, _, (outs _.RC:$dst),
	(ins _.RC:$src), OpcodeStr,
	"{sae}, $src", "$src, {sae}",
	(OpNode (_.VT _.RC:$src), (i32 FROUND_NO_EXC))>, EVEX_B;
	}

	multiclass avx512_eri<bits<8> opc, string OpcodeStr, SDNode OpNode> {
	defm PS : avx512_fp28_p<opc, OpcodeStr#"ps", v16f32_info, OpNode>,
	avx512_fp28_p_round<opc, OpcodeStr#"ps", v16f32_info, OpNode>,
	T8PD, EVEX_V512, EVEX_CD8<32, CD8VF>;
	defm PD : avx512_fp28_p<opc, OpcodeStr#"pd", v8f64_info, OpNode>,
	avx512_fp28_p_round<opc, OpcodeStr#"pd", v8f64_info, OpNode>,
	T8PD, EVEX_V512, VEX_W, EVEX_CD8<64, CD8VF>;
	}

	multiclass avx512_fp_unaryop_packed<bits<8> opc, string OpcodeStr,
	SDNode OpNode> {
	// Define only if AVX512VL feature is present.
	let Predicates = [HasVLX] in {
	defm PSZ128 : avx512_fp28_p<opc, OpcodeStr#"ps", v4f32x_info, OpNode>,
	EVEX_V128, T8PD, EVEX_CD8<32, CD8VF>;
	defm PSZ256 : avx512_fp28_p<opc, OpcodeStr#"ps", v8f32x_info, OpNode>,
	EVEX_V256, T8PD, EVEX_CD8<32, CD8VF>;
	defm PDZ128 : avx512_fp28_p<opc, OpcodeStr#"pd", v2f64x_info, OpNode>,
	EVEX_V128, VEX_W, T8PD, EVEX_CD8<64, CD8VF>;
	defm PDZ256 : avx512_fp28_p<opc, OpcodeStr#"pd", v4f64x_info, OpNode>,
	EVEX_V256, VEX_W, T8PD, EVEX_CD8<64, CD8VF>;
	}
	}
	let Predicates = [HasERI] in {

	defm VRSQRT28 : avx512_eri<0xCC, "vrsqrt28", X86rsqrt28>, EVEX;
	defm VRCP28 : avx512_eri<0xCA, "vrcp28", X86rcp28>, EVEX;
	defm VEXP2 : avx512_eri<0xC8, "vexp2", X86exp2>, EVEX;
	}
	defm VGETEXP : avx512_eri<0x42, "vgetexp", X86fgetexpRnd>,
	avx512_fp_unaryop_packed<0x42, "vgetexp", X86fgetexpRnd> , EVEX;

	multiclass avx512_sqrt_packed_round<bits<8> opc, string OpcodeStr,
	SDNode OpNodeRnd, X86VectorVTInfo _>{
	let ExeDomain = _.ExeDomain in
	defm rb: AVX512_maskable<opc, MRMSrcReg, _, (outs _.RC:$dst),
	(ins _.RC:$src, AVX512RC:$rc), OpcodeStr, "$rc, $src", "$src, $rc",
	(_.VT (OpNodeRnd _.RC:$src, (i32 imm:$rc)))>,
	EVEX, EVEX_B, EVEX_RC;
	}

	multiclass avx512_sqrt_packed<bits<8> opc, string OpcodeStr,
	SDNode OpNode, X86VectorVTInfo _>{
	let ExeDomain = _.ExeDomain in {
	defm r: AVX512_maskable<opc, MRMSrcReg, _, (outs _.RC:$dst),
	(ins _.RC:$src), OpcodeStr, "$src", "$src",
	(_.FloatVT (OpNode _.RC:$src))>, EVEX;
	defm m: AVX512_maskable<opc, MRMSrcMem, _, (outs _.RC:$dst),
	(ins _.MemOp:$src), OpcodeStr, "$src", "$src",
	(OpNode (_.FloatVT
	(bitconvert (_.LdFrag addr:$src))))>, EVEX;

	defm mb: AVX512_maskable<opc, MRMSrcMem, _, (outs _.RC:$dst),
	(ins _.ScalarMemOp:$src), OpcodeStr,
	"${src}"##_.BroadcastStr, "${src}"##_.BroadcastStr,
	(OpNode (_.FloatVT
	(X86VBroadcast (_.ScalarLdFrag addr:$src))))>,
	EVEX, EVEX_B;
	}
	}

	multiclass avx512_sqrt_packed_all<bits<8> opc, string OpcodeStr,
	SDNode OpNode> {
	defm PSZ : avx512_sqrt_packed<opc, !strconcat(OpcodeStr, "ps"), OpNode,
	v16f32_info>,
	EVEX_V512, PS, EVEX_CD8<32, CD8VF>;
	defm PDZ : avx512_sqrt_packed<opc, !strconcat(OpcodeStr, "pd"), OpNode,
	v8f64_info>,
	EVEX_V512, VEX_W, PD, EVEX_CD8<64, CD8VF>;
	// Define only if AVX512VL feature is present.
	let Predicates = [HasVLX] in {
	defm PSZ128 : avx512_sqrt_packed<opc, !strconcat(OpcodeStr, "ps"),
	OpNode, v4f32x_info>,
	EVEX_V128, PS, EVEX_CD8<32, CD8VF>;
	defm PSZ256 : avx512_sqrt_packed<opc, !strconcat(OpcodeStr, "ps"),
	OpNode, v8f32x_info>,
	EVEX_V256, PS, EVEX_CD8<32, CD8VF>;
	defm PDZ128 : avx512_sqrt_packed<opc, !strconcat(OpcodeStr, "pd"),
	OpNode, v2f64x_info>,
	EVEX_V128, VEX_W, PD, EVEX_CD8<64, CD8VF>;
	defm PDZ256 : avx512_sqrt_packed<opc, !strconcat(OpcodeStr, "pd"),
	OpNode, v4f64x_info>,
	EVEX_V256, VEX_W, PD, EVEX_CD8<64, CD8VF>;
	}
	}

	multiclass avx512_sqrt_packed_all_round<bits<8> opc, string OpcodeStr,
	SDNode OpNodeRnd> {
	defm PSZ : avx512_sqrt_packed_round<opc, !strconcat(OpcodeStr, "ps"), OpNodeRnd,
	v16f32_info>, EVEX_V512, PS, EVEX_CD8<32, CD8VF>;
	defm PDZ : avx512_sqrt_packed_round<opc, !strconcat(OpcodeStr, "pd"), OpNodeRnd,
	v8f64_info>, EVEX_V512, VEX_W, PD, EVEX_CD8<64, CD8VF>;
	}

	multiclass avx512_sqrt_scalar<bits<8> opc, string OpcodeStr,X86VectorVTInfo _,
	string SUFF, SDNode OpNode, SDNode OpNodeRnd> {
	let ExeDomain = _.ExeDomain in {
	defm r_Int : AVX512_maskable_scalar<opc, MRMSrcReg, _, (outs _.RC:$dst),
	(ins _.RC:$src1, _.RC:$src2), OpcodeStr,
	"$src2, $src1", "$src1, $src2",
	(OpNodeRnd (_.VT _.RC:$src1),
	(_.VT _.RC:$src2),
	(i32 FROUND_CURRENT))>;
	defm m_Int : AVX512_maskable_scalar<opc, MRMSrcMem, _, (outs _.RC:$dst),
	(ins _.RC:$src1, _.ScalarMemOp:$src2), OpcodeStr,
	"$src2, $src1", "$src1, $src2",
	(OpNodeRnd (_.VT _.RC:$src1),
	(_.VT (scalar_to_vector
	(_.ScalarLdFrag addr:$src2))),
	(i32 FROUND_CURRENT))>;

	defm rb_Int : AVX512_maskable_scalar<opc, MRMSrcReg, _, (outs _.RC:$dst),
	(ins _.RC:$src1, _.RC:$src2, AVX512RC:$rc), OpcodeStr,
	"$rc, $src2, $src1", "$src1, $src2, $rc",
	(OpNodeRnd (_.VT _.RC:$src1),
	(_.VT _.RC:$src2),
	(i32 imm:$rc))>,
	EVEX_B, EVEX_RC;

	let isCodeGenOnly = 1, hasSideEffects = 0 in {
	def r : I<opc, MRMSrcReg, (outs _.FRC:$dst),
	(ins _.FRC:$src1, _.FRC:$src2),
	OpcodeStr#"\t{$src2, $src1, $dst\|$dst, $src1, $src2}", []>;

	let mayLoad = 1 in
	def m : I<opc, MRMSrcMem, (outs _.FRC:$dst),
	(ins _.FRC:$src1, _.ScalarMemOp:$src2),
	OpcodeStr#"\t{$src2, $src1, $dst\|$dst, $src1, $src2}", []>;
	}
	}

	def : Pat<(_.EltVT (OpNode _.FRC:$src)),
	(!cast<Instruction>(NAME#SUFF#Zr)
	(_.EltVT (IMPLICIT_DEF)), _.FRC:$src)>;

	def : Pat<(_.EltVT (OpNode (load addr:$src))),
	(!cast<Instruction>(NAME#SUFF#Zm)
	(_.EltVT (IMPLICIT_DEF)), addr:$src)>, Requires<[HasAVX512, OptForSize]>;
	}

	multiclass avx512_sqrt_scalar_all<bits<8> opc, string OpcodeStr> {
	defm SSZ : avx512_sqrt_scalar<opc, OpcodeStr#"ss", f32x_info, "SS", fsqrt,
	X86fsqrtRnds>, EVEX_CD8<32, CD8VT1>, EVEX_4V, XS;
	defm SDZ : avx512_sqrt_scalar<opc, OpcodeStr#"sd", f64x_info, "SD", fsqrt,
	X86fsqrtRnds>, EVEX_CD8<64, CD8VT1>, EVEX_4V, XD, VEX_W;
	}

	defm VSQRT : avx512_sqrt_packed_all<0x51, "vsqrt", fsqrt>,
	avx512_sqrt_packed_all_round<0x51, "vsqrt", X86fsqrtRnd>;

	defm VSQRT : avx512_sqrt_scalar_all<0x51, "vsqrt">, VEX_LIG;

	let Predicates = [HasAVX512] in {
	def : Pat<(f32 (X86frsqrt FR32X:$src)),
	(COPY_TO_REGCLASS (VRSQRT14SSrr (v4f32 (IMPLICIT_DEF)), (COPY_TO_REGCLASS FR32X:$src, VR128X)), VR128X)>;
	def : Pat<(f32 (X86frsqrt (load addr:$src))),
	(COPY_TO_REGCLASS (VRSQRT14SSrm (v4f32 (IMPLICIT_DEF)), addr:$src), VR128X)>,
	Requires<[OptForSize]>;
	def : Pat<(f32 (X86frcp FR32X:$src)),
	(COPY_TO_REGCLASS (VRCP14SSrr (v4f32 (IMPLICIT_DEF)), (COPY_TO_REGCLASS FR32X:$src, VR128X)), VR128X )>;
	def : Pat<(f32 (X86frcp (load addr:$src))),
	(COPY_TO_REGCLASS (VRCP14SSrm (v4f32 (IMPLICIT_DEF)), addr:$src), VR128X)>,
	Requires<[OptForSize]>;
	}

	multiclass
	avx512_rndscale_scalar<bits<8> opc, string OpcodeStr, X86VectorVTInfo _> {

	let ExeDomain = _.ExeDomain in {
	defm r : AVX512_maskable_scalar<opc, MRMSrcReg, _, (outs _.RC:$dst),
	(ins _.RC:$src1, _.RC:$src2, i32u8imm:$src3), OpcodeStr,
	"$src3, $src2, $src1", "$src1, $src2, $src3",
	(_.VT (X86RndScales (_.VT _.RC:$src1), (_.VT _.RC:$src2),
	(i32 imm:$src3), (i32 FROUND_CURRENT)))>;

	defm rb : AVX512_maskable_scalar<opc, MRMSrcReg, _, (outs _.RC:$dst),
	(ins _.RC:$src1, _.RC:$src2, i32u8imm:$src3), OpcodeStr,
	"$src3, {sae}, $src2, $src1", "$src1, $src2, {sae}, $src3",
	(_.VT (X86RndScales (_.VT _.RC:$src1), (_.VT _.RC:$src2),
	(i32 imm:$src3), (i32 FROUND_NO_EXC)))>, EVEX_B;

	defm m : AVX512_maskable_scalar<opc, MRMSrcMem, _, (outs _.RC:$dst),
	(ins _.RC:$src1, _.ScalarMemOp:$src2, i32u8imm:$src3),
	OpcodeStr,
	"$src3, $src2, $src1", "$src1, $src2, $src3",
	(_.VT (X86RndScales (_.VT _.RC:$src1),
	(_.VT (scalar_to_vector (_.ScalarLdFrag addr:$src2))),
	(i32 imm:$src3), (i32 FROUND_CURRENT)))>;
	}
	let Predicates = [HasAVX512] in {
	def : Pat<(ffloor _.FRC:$src), (COPY_TO_REGCLASS
	(_.VT (!cast<Instruction>(NAME##r) (_.VT (IMPLICIT_DEF)),
	(_.VT (COPY_TO_REGCLASS _.FRC:$src, _.RC)), (i32 0x9))), _.FRC)>;
	def : Pat<(fceil _.FRC:$src), (COPY_TO_REGCLASS
	(_.VT (!cast<Instruction>(NAME##r) (_.VT (IMPLICIT_DEF)),
	(_.VT (COPY_TO_REGCLASS _.FRC:$src, _.RC)), (i32 0xa))), _.FRC)>;
	def : Pat<(ftrunc _.FRC:$src), (COPY_TO_REGCLASS
	(_.VT (!cast<Instruction>(NAME##r) (_.VT (IMPLICIT_DEF)),
	(_.VT (COPY_TO_REGCLASS _.FRC:$src, _.RC)), (i32 0xb))), _.FRC)>;
	def : Pat<(frint _.FRC:$src), (COPY_TO_REGCLASS
	(_.VT (!cast<Instruction>(NAME##r) (_.VT (IMPLICIT_DEF)),
	(_.VT (COPY_TO_REGCLASS _.FRC:$src, _.RC)), (i32 0x4))), _.FRC)>;
	def : Pat<(fnearbyint _.FRC:$src), (COPY_TO_REGCLASS
	(_.VT (!cast<Instruction>(NAME##r) (_.VT (IMPLICIT_DEF)),
	(_.VT (COPY_TO_REGCLASS _.FRC:$src, _.RC)), (i32 0xc))), _.FRC)>;

	def : Pat<(ffloor (_.ScalarLdFrag addr:$src)), (COPY_TO_REGCLASS
	(_.VT (!cast<Instruction>(NAME##m) (_.VT (IMPLICIT_DEF)),
	addr:$src, (i32 0x9))), _.FRC)>;
	def : Pat<(fceil (_.ScalarLdFrag addr:$src)), (COPY_TO_REGCLASS
	(_.VT (!cast<Instruction>(NAME##m) (_.VT (IMPLICIT_DEF)),
	addr:$src, (i32 0xa))), _.FRC)>;
	def : Pat<(ftrunc (_.ScalarLdFrag addr:$src)), (COPY_TO_REGCLASS
	(_.VT (!cast<Instruction>(NAME##m) (_.VT (IMPLICIT_DEF)),
	addr:$src, (i32 0xb))), _.FRC)>;
	def : Pat<(frint (_.ScalarLdFrag addr:$src)), (COPY_TO_REGCLASS
	(_.VT (!cast<Instruction>(NAME##m) (_.VT (IMPLICIT_DEF)),
	addr:$src, (i32 0x4))), _.FRC)>;
	def : Pat<(fnearbyint (_.ScalarLdFrag addr:$src)), (COPY_TO_REGCLASS
	(_.VT (!cast<Instruction>(NAME##m) (_.VT (IMPLICIT_DEF)),
	addr:$src, (i32 0xc))), _.FRC)>;
	}
	}

	defm VRNDSCALESS : avx512_rndscale_scalar<0x0A, "vrndscaless", f32x_info>,
	AVX512AIi8Base, EVEX_4V, EVEX_CD8<32, CD8VT1>;

	defm VRNDSCALESD : avx512_rndscale_scalar<0x0B, "vrndscalesd", f64x_info>, VEX_W,
	AVX512AIi8Base, EVEX_4V, EVEX_CD8<64, CD8VT1>;

	//-------------------------------------------------
	// Integer truncate and extend operations
	//-------------------------------------------------

	multiclass avx512_trunc_common<bits<8> opc, string OpcodeStr, SDNode OpNode,
	X86VectorVTInfo SrcInfo, X86VectorVTInfo DestInfo,
	X86MemOperand x86memop> {
	let ExeDomain = DestInfo.ExeDomain in
	defm rr : AVX512_maskable<opc, MRMDestReg, DestInfo, (outs DestInfo.RC:$dst),
	(ins SrcInfo.RC:$src1), OpcodeStr ,"$src1", "$src1",
	(DestInfo.VT (OpNode (SrcInfo.VT SrcInfo.RC:$src1)))>,
	EVEX, T8XS;

	// for intrinsic patter match
	def : Pat<(DestInfo.VT (X86select DestInfo.KRCWM:$mask,
	(DestInfo.VT (OpNode (SrcInfo.VT SrcInfo.RC:$src1))),
	undef)),
	(!cast<Instruction>(NAME#SrcInfo.ZSuffix##rrkz) DestInfo.KRCWM:$mask ,
	SrcInfo.RC:$src1)>;

	def : Pat<(DestInfo.VT (X86select DestInfo.KRCWM:$mask,
	(DestInfo.VT (OpNode (SrcInfo.VT SrcInfo.RC:$src1))),
	DestInfo.ImmAllZerosV)),
	(!cast<Instruction>(NAME#SrcInfo.ZSuffix##rrkz) DestInfo.KRCWM:$mask ,
	SrcInfo.RC:$src1)>;

	def : Pat<(DestInfo.VT (X86select DestInfo.KRCWM:$mask,
	(DestInfo.VT (OpNode (SrcInfo.VT SrcInfo.RC:$src1))),
	DestInfo.RC:$src0)),
	(!cast<Instruction>(NAME#SrcInfo.ZSuffix##rrk) DestInfo.RC:$src0,
	DestInfo.KRCWM:$mask ,
	SrcInfo.RC:$src1)>;

	let mayStore = 1, mayLoad = 1, hasSideEffects = 0,
	ExeDomain = DestInfo.ExeDomain in {
	def mr : AVX512XS8I<opc, MRMDestMem, (outs),
	(ins x86memop:$dst, SrcInfo.RC:$src),
	OpcodeStr # "\t{$src, $dst\|$dst, $src}",
	[]>, EVEX;

	def mrk : AVX512XS8I<opc, MRMDestMem, (outs),
	(ins x86memop:$dst, SrcInfo.KRCWM:$mask, SrcInfo.RC:$src),
	OpcodeStr # "\t{$src, $dst {${mask}}\|$dst {${mask}}, $src}",
	[]>, EVEX, EVEX_K;
	}//mayStore = 1, mayLoad = 1, hasSideEffects = 0
	}

	multiclass avx512_trunc_mr_lowering<X86VectorVTInfo SrcInfo,
	X86VectorVTInfo DestInfo,
	PatFrag truncFrag, PatFrag mtruncFrag > {

	def : Pat<(truncFrag (SrcInfo.VT SrcInfo.RC:$src), addr:$dst),
	(!cast<Instruction>(NAME#SrcInfo.ZSuffix##mr)
	addr:$dst, SrcInfo.RC:$src)>;

	def : Pat<(mtruncFrag addr:$dst, SrcInfo.KRCWM:$mask,
	(SrcInfo.VT SrcInfo.RC:$src)),
	(!cast<Instruction>(NAME#SrcInfo.ZSuffix##mrk)
	addr:$dst, SrcInfo.KRCWM:$mask, SrcInfo.RC:$src)>;
	}

	multiclass avx512_trunc<bits<8> opc, string OpcodeStr, SDNode OpNode,
	AVX512VLVectorVTInfo VTSrcInfo, X86VectorVTInfo DestInfoZ128,
	X86VectorVTInfo DestInfoZ256, X86VectorVTInfo DestInfoZ,
	X86MemOperand x86memopZ128, X86MemOperand x86memopZ256,
	X86MemOperand x86memopZ, PatFrag truncFrag, PatFrag mtruncFrag,
	Predicate prd = HasAVX512>{

	let Predicates = [HasVLX, prd] in {
	defm Z128: avx512_trunc_common<opc, OpcodeStr, OpNode, VTSrcInfo.info128,
	DestInfoZ128, x86memopZ128>,
	avx512_trunc_mr_lowering<VTSrcInfo.info128, DestInfoZ128,
	truncFrag, mtruncFrag>, EVEX_V128;

	defm Z256: avx512_trunc_common<opc, OpcodeStr, OpNode, VTSrcInfo.info256,
	DestInfoZ256, x86memopZ256>,
	avx512_trunc_mr_lowering<VTSrcInfo.info256, DestInfoZ256,
	truncFrag, mtruncFrag>, EVEX_V256;
	}
	let Predicates = [prd] in
	defm Z: avx512_trunc_common<opc, OpcodeStr, OpNode, VTSrcInfo.info512,
	DestInfoZ, x86memopZ>,
	avx512_trunc_mr_lowering<VTSrcInfo.info512, DestInfoZ,
	truncFrag, mtruncFrag>, EVEX_V512;
	}

	multiclass avx512_trunc_qb<bits<8> opc, string OpcodeStr, SDNode OpNode,
	PatFrag StoreNode, PatFrag MaskedStoreNode> {
	defm NAME: avx512_trunc<opc, OpcodeStr, OpNode, avx512vl_i64_info,
	v16i8x_info, v16i8x_info, v16i8x_info, i16mem, i32mem, i64mem,
	StoreNode, MaskedStoreNode>, EVEX_CD8<8, CD8VO>;
	}

	multiclass avx512_trunc_qw<bits<8> opc, string OpcodeStr, SDNode OpNode,
	PatFrag StoreNode, PatFrag MaskedStoreNode> {
	defm NAME: avx512_trunc<opc, OpcodeStr, OpNode, avx512vl_i64_info,
	v8i16x_info, v8i16x_info, v8i16x_info, i32mem, i64mem, i128mem,
	StoreNode, MaskedStoreNode>, EVEX_CD8<16, CD8VQ>;
	}

	multiclass avx512_trunc_qd<bits<8> opc, string OpcodeStr, SDNode OpNode,
	PatFrag StoreNode, PatFrag MaskedStoreNode> {
	defm NAME: avx512_trunc<opc, OpcodeStr, OpNode, avx512vl_i64_info,
	v4i32x_info, v4i32x_info, v8i32x_info, i64mem, i128mem, i256mem,
	StoreNode, MaskedStoreNode>, EVEX_CD8<32, CD8VH>;
	}

	multiclass avx512_trunc_db<bits<8> opc, string OpcodeStr, SDNode OpNode,
	PatFrag StoreNode, PatFrag MaskedStoreNode> {
	defm NAME: avx512_trunc<opc, OpcodeStr, OpNode, avx512vl_i32_info,
	v16i8x_info, v16i8x_info, v16i8x_info, i32mem, i64mem, i128mem,
	StoreNode, MaskedStoreNode>, EVEX_CD8<8, CD8VQ>;
	}

	multiclass avx512_trunc_dw<bits<8> opc, string OpcodeStr, SDNode OpNode,
	PatFrag StoreNode, PatFrag MaskedStoreNode> {
	defm NAME: avx512_trunc<opc, OpcodeStr, OpNode, avx512vl_i32_info,
	v8i16x_info, v8i16x_info, v16i16x_info, i64mem, i128mem, i256mem,
	StoreNode, MaskedStoreNode>, EVEX_CD8<16, CD8VH>;
	}

	multiclass avx512_trunc_wb<bits<8> opc, string OpcodeStr, SDNode OpNode,
	PatFrag StoreNode, PatFrag MaskedStoreNode> {
	defm NAME: avx512_trunc<opc, OpcodeStr, OpNode, avx512vl_i16_info,
	v16i8x_info, v16i8x_info, v32i8x_info, i64mem, i128mem, i256mem,
	StoreNode, MaskedStoreNode, HasBWI>, EVEX_CD8<16, CD8VH>;
	}

	defm VPMOVQB : avx512_trunc_qb<0x32, "vpmovqb", X86vtrunc,
	truncstorevi8, masked_truncstorevi8>;
	defm VPMOVSQB : avx512_trunc_qb<0x22, "vpmovsqb", X86vtruncs,
	truncstore_s_vi8, masked_truncstore_s_vi8>;
	defm VPMOVUSQB : avx512_trunc_qb<0x12, "vpmovusqb", X86vtruncus,
	truncstore_us_vi8, masked_truncstore_us_vi8>;

	defm VPMOVQW : avx512_trunc_qw<0x34, "vpmovqw", X86vtrunc,
	truncstorevi16, masked_truncstorevi16>;
	defm VPMOVSQW : avx512_trunc_qw<0x24, "vpmovsqw", X86vtruncs,
	truncstore_s_vi16, masked_truncstore_s_vi16>;
	defm VPMOVUSQW : avx512_trunc_qw<0x14, "vpmovusqw", X86vtruncus,
	truncstore_us_vi16, masked_truncstore_us_vi16>;

	defm VPMOVQD : avx512_trunc_qd<0x35, "vpmovqd", X86vtrunc,
	truncstorevi32, masked_truncstorevi32>;
	defm VPMOVSQD : avx512_trunc_qd<0x25, "vpmovsqd", X86vtruncs,
	truncstore_s_vi32, masked_truncstore_s_vi32>;
	defm VPMOVUSQD : avx512_trunc_qd<0x15, "vpmovusqd", X86vtruncus,
	truncstore_us_vi32, masked_truncstore_us_vi32>;

	defm VPMOVDB : avx512_trunc_db<0x31, "vpmovdb", X86vtrunc,
	truncstorevi8, masked_truncstorevi8>;
	defm VPMOVSDB : avx512_trunc_db<0x21, "vpmovsdb", X86vtruncs,
	truncstore_s_vi8, masked_truncstore_s_vi8>;
	defm VPMOVUSDB : avx512_trunc_db<0x11, "vpmovusdb", X86vtruncus,
	truncstore_us_vi8, masked_truncstore_us_vi8>;

	defm VPMOVDW : avx512_trunc_dw<0x33, "vpmovdw", X86vtrunc,
	truncstorevi16, masked_truncstorevi16>;
	defm VPMOVSDW : avx512_trunc_dw<0x23, "vpmovsdw", X86vtruncs,
	truncstore_s_vi16, masked_truncstore_s_vi16>;
	defm VPMOVUSDW : avx512_trunc_dw<0x13, "vpmovusdw", X86vtruncus,
	truncstore_us_vi16, masked_truncstore_us_vi16>;

	defm VPMOVWB : avx512_trunc_wb<0x30, "vpmovwb", X86vtrunc,
	truncstorevi8, masked_truncstorevi8>;
	defm VPMOVSWB : avx512_trunc_wb<0x20, "vpmovswb", X86vtruncs,
	truncstore_s_vi8, masked_truncstore_s_vi8>;
	defm VPMOVUSWB : avx512_trunc_wb<0x10, "vpmovuswb", X86vtruncus,
	truncstore_us_vi8, masked_truncstore_us_vi8>;

	let Predicates = [HasAVX512, NoVLX] in {
	def: Pat<(v8i16 (X86vtrunc (v8i32 VR256X:$src))),
	(v8i16 (EXTRACT_SUBREG
	(v16i16 (VPMOVDWZrr (v16i32 (INSERT_SUBREG (IMPLICIT_DEF),
	VR256X:$src, sub_ymm)))), sub_xmm))>;
	def: Pat<(v4i32 (X86vtrunc (v4i64 VR256X:$src))),
	(v4i32 (EXTRACT_SUBREG
	(v8i32 (VPMOVQDZrr (v8i64 (INSERT_SUBREG (IMPLICIT_DEF),
	VR256X:$src, sub_ymm)))), sub_xmm))>;
	}

	let Predicates = [HasBWI, NoVLX] in {
	def: Pat<(v16i8 (X86vtrunc (v16i16 VR256X:$src))),
	(v16i8 (EXTRACT_SUBREG (VPMOVWBZrr (v32i16 (INSERT_SUBREG (IMPLICIT_DEF),
	VR256X:$src, sub_ymm))), sub_xmm))>;
	}

	multiclass avx512_extend_common<bits<8> opc, string OpcodeStr,
	X86VectorVTInfo DestInfo, X86VectorVTInfo SrcInfo,
	X86MemOperand x86memop, PatFrag LdFrag, SDPatternOperator OpNode>{
	let ExeDomain = DestInfo.ExeDomain in {
	defm rr : AVX512_maskable<opc, MRMSrcReg, DestInfo, (outs DestInfo.RC:$dst),
	(ins SrcInfo.RC:$src), OpcodeStr ,"$src", "$src",
	(DestInfo.VT (OpNode (SrcInfo.VT SrcInfo.RC:$src)))>,
	EVEX;

	defm rm : AVX512_maskable<opc, MRMSrcMem, DestInfo, (outs DestInfo.RC:$dst),
	(ins x86memop:$src), OpcodeStr ,"$src", "$src",
	(DestInfo.VT (LdFrag addr:$src))>,
	EVEX;
	}
	}

	multiclass avx512_extend_BW<bits<8> opc, string OpcodeStr,
	SDPatternOperator OpNode, SDPatternOperator InVecNode,
	string ExtTy,PatFrag LdFrag = !cast<PatFrag>(ExtTy#"extloadvi8")> {
	let Predicates = [HasVLX, HasBWI] in {
	defm Z128: avx512_extend_common<opc, OpcodeStr, v8i16x_info,
	v16i8x_info, i64mem, LdFrag, InVecNode>,
	EVEX_CD8<8, CD8VH>, T8PD, EVEX_V128;

	defm Z256: avx512_extend_common<opc, OpcodeStr, v16i16x_info,
	v16i8x_info, i128mem, LdFrag, OpNode>,
	EVEX_CD8<8, CD8VH>, T8PD, EVEX_V256;
	}
	let Predicates = [HasBWI] in {
	defm Z : avx512_extend_common<opc, OpcodeStr, v32i16_info,
	v32i8x_info, i256mem, LdFrag, OpNode>,
	EVEX_CD8<8, CD8VH>, T8PD, EVEX_V512;
	}
	}

	multiclass avx512_extend_BD<bits<8> opc, string OpcodeStr,
	SDPatternOperator OpNode, SDPatternOperator InVecNode,
	string ExtTy,PatFrag LdFrag = !cast<PatFrag>(ExtTy#"extloadvi8")> {
	let Predicates = [HasVLX, HasAVX512] in {
	defm Z128: avx512_extend_common<opc, OpcodeStr, v4i32x_info,
	v16i8x_info, i32mem, LdFrag, InVecNode>,
	EVEX_CD8<8, CD8VQ>, T8PD, EVEX_V128;

	defm Z256: avx512_extend_common<opc, OpcodeStr, v8i32x_info,
	v16i8x_info, i64mem, LdFrag, OpNode>,
	EVEX_CD8<8, CD8VQ>, T8PD, EVEX_V256;
	}
	let Predicates = [HasAVX512] in {
	defm Z : avx512_extend_common<opc, OpcodeStr, v16i32_info,
	v16i8x_info, i128mem, LdFrag, OpNode>,
	EVEX_CD8<8, CD8VQ>, T8PD, EVEX_V512;
	}
	}

	multiclass avx512_extend_BQ<bits<8> opc, string OpcodeStr,
	SDPatternOperator OpNode, SDPatternOperator InVecNode,
	string ExtTy,PatFrag LdFrag = !cast<PatFrag>(ExtTy#"extloadvi8")> {
	let Predicates = [HasVLX, HasAVX512] in {
	defm Z128: avx512_extend_common<opc, OpcodeStr, v2i64x_info,
	v16i8x_info, i16mem, LdFrag, InVecNode>,
	EVEX_CD8<8, CD8VO>, T8PD, EVEX_V128;

	defm Z256: avx512_extend_common<opc, OpcodeStr, v4i64x_info,
	v16i8x_info, i32mem, LdFrag, OpNode>,
	EVEX_CD8<8, CD8VO>, T8PD, EVEX_V256;
	}
	let Predicates = [HasAVX512] in {
	defm Z : avx512_extend_common<opc, OpcodeStr, v8i64_info,
	v16i8x_info, i64mem, LdFrag, OpNode>,
	EVEX_CD8<8, CD8VO>, T8PD, EVEX_V512;
	}
	}

	multiclass avx512_extend_WD<bits<8> opc, string OpcodeStr,
	SDPatternOperator OpNode, SDPatternOperator InVecNode,
	string ExtTy,PatFrag LdFrag = !cast<PatFrag>(ExtTy#"extloadvi16")> {
	let Predicates = [HasVLX, HasAVX512] in {
	defm Z128: avx512_extend_common<opc, OpcodeStr, v4i32x_info,
	v8i16x_info, i64mem, LdFrag, InVecNode>,
	EVEX_CD8<16, CD8VH>, T8PD, EVEX_V128;

	defm Z256: avx512_extend_common<opc, OpcodeStr, v8i32x_info,
	v8i16x_info, i128mem, LdFrag, OpNode>,
	EVEX_CD8<16, CD8VH>, T8PD, EVEX_V256;
	}
	let Predicates = [HasAVX512] in {
	defm Z : avx512_extend_common<opc, OpcodeStr, v16i32_info,
	v16i16x_info, i256mem, LdFrag, OpNode>,
	EVEX_CD8<16, CD8VH>, T8PD, EVEX_V512;
	}
	}

	multiclass avx512_extend_WQ<bits<8> opc, string OpcodeStr,
	SDPatternOperator OpNode, SDPatternOperator InVecNode,
	string ExtTy,PatFrag LdFrag = !cast<PatFrag>(ExtTy#"extloadvi16")> {
	let Predicates = [HasVLX, HasAVX512] in {
	defm Z128: avx512_extend_common<opc, OpcodeStr, v2i64x_info,
	v8i16x_info, i32mem, LdFrag, InVecNode>,
	EVEX_CD8<16, CD8VQ>, T8PD, EVEX_V128;

	defm Z256: avx512_extend_common<opc, OpcodeStr, v4i64x_info,
	v8i16x_info, i64mem, LdFrag, OpNode>,
	EVEX_CD8<16, CD8VQ>, T8PD, EVEX_V256;
	}
	let Predicates = [HasAVX512] in {
	defm Z : avx512_extend_common<opc, OpcodeStr, v8i64_info,
	v8i16x_info, i128mem, LdFrag, OpNode>,
	EVEX_CD8<16, CD8VQ>, T8PD, EVEX_V512;
	}
	}

	multiclass avx512_extend_DQ<bits<8> opc, string OpcodeStr,
	SDPatternOperator OpNode, SDPatternOperator InVecNode,
	string ExtTy,PatFrag LdFrag = !cast<PatFrag>(ExtTy#"extloadvi32")> {

	let Predicates = [HasVLX, HasAVX512] in {
	defm Z128: avx512_extend_common<opc, OpcodeStr, v2i64x_info,
	v4i32x_info, i64mem, LdFrag, InVecNode>,
	EVEX_CD8<32, CD8VH>, T8PD, EVEX_V128;

	defm Z256: avx512_extend_common<opc, OpcodeStr, v4i64x_info,
	v4i32x_info, i128mem, LdFrag, OpNode>,
	EVEX_CD8<32, CD8VH>, T8PD, EVEX_V256;
	}
	let Predicates = [HasAVX512] in {
	defm Z : avx512_extend_common<opc, OpcodeStr, v8i64_info,
	v8i32x_info, i256mem, LdFrag, OpNode>,
	EVEX_CD8<32, CD8VH>, T8PD, EVEX_V512;
	}
	}

	defm VPMOVZXBW : avx512_extend_BW<0x30, "vpmovzxbw", X86vzext, zext_invec, "z">;
	defm VPMOVZXBD : avx512_extend_BD<0x31, "vpmovzxbd", X86vzext, zext_invec, "z">;
	defm VPMOVZXBQ : avx512_extend_BQ<0x32, "vpmovzxbq", X86vzext, zext_invec, "z">;
	defm VPMOVZXWD : avx512_extend_WD<0x33, "vpmovzxwd", X86vzext, zext_invec, "z">;
	defm VPMOVZXWQ : avx512_extend_WQ<0x34, "vpmovzxwq", X86vzext, zext_invec, "z">;
	defm VPMOVZXDQ : avx512_extend_DQ<0x35, "vpmovzxdq", X86vzext, zext_invec, "z">;

	defm VPMOVSXBW: avx512_extend_BW<0x20, "vpmovsxbw", X86vsext, sext_invec, "s">;
	defm VPMOVSXBD: avx512_extend_BD<0x21, "vpmovsxbd", X86vsext, sext_invec, "s">;
	defm VPMOVSXBQ: avx512_extend_BQ<0x22, "vpmovsxbq", X86vsext, sext_invec, "s">;
	defm VPMOVSXWD: avx512_extend_WD<0x23, "vpmovsxwd", X86vsext, sext_invec, "s">;
	defm VPMOVSXWQ: avx512_extend_WQ<0x24, "vpmovsxwq", X86vsext, sext_invec, "s">;
	defm VPMOVSXDQ: avx512_extend_DQ<0x25, "vpmovsxdq", X86vsext, sext_invec, "s">;

	// EXTLOAD patterns, implemented using vpmovz
	multiclass avx512_ext_lowering<string InstrStr, X86VectorVTInfo To,
	X86VectorVTInfo From, PatFrag LdFrag> {
	def : Pat<(To.VT (LdFrag addr:$src)),
	(!cast<Instruction>("VPMOVZX"#InstrStr#"rm") addr:$src)>;
	def : Pat<(To.VT (vselect To.KRCWM:$mask, (LdFrag addr:$src), To.RC:$src0)),
	(!cast<Instruction>("VPMOVZX"#InstrStr#"rmk") To.RC:$src0,
	To.KRC:$mask, addr:$src)>;
	def : Pat<(To.VT (vselect To.KRCWM:$mask, (LdFrag addr:$src),
	To.ImmAllZerosV)),
	(!cast<Instruction>("VPMOVZX"#InstrStr#"rmkz") To.KRC:$mask,
	addr:$src)>;
	}

	let Predicates = [HasVLX, HasBWI] in {
	defm : avx512_ext_lowering<"BWZ128", v8i16x_info, v16i8x_info, extloadvi8>;
	defm : avx512_ext_lowering<"BWZ256", v16i16x_info, v16i8x_info, extloadvi8>;
	}
	let Predicates = [HasBWI] in {
	defm : avx512_ext_lowering<"BWZ", v32i16_info, v32i8x_info, extloadvi8>;
	}
	let Predicates = [HasVLX, HasAVX512] in {
	defm : avx512_ext_lowering<"BDZ128", v4i32x_info, v16i8x_info, extloadvi8>;
	defm : avx512_ext_lowering<"BDZ256", v8i32x_info, v16i8x_info, extloadvi8>;
	defm : avx512_ext_lowering<"BQZ128", v2i64x_info, v16i8x_info, extloadvi8>;
	defm : avx512_ext_lowering<"BQZ256", v4i64x_info, v16i8x_info, extloadvi8>;
	defm : avx512_ext_lowering<"WDZ128", v4i32x_info, v8i16x_info, extloadvi16>;
	defm : avx512_ext_lowering<"WDZ256", v8i32x_info, v8i16x_info, extloadvi16>;
	defm : avx512_ext_lowering<"WQZ128", v2i64x_info, v8i16x_info, extloadvi16>;
	defm : avx512_ext_lowering<"WQZ256", v4i64x_info, v8i16x_info, extloadvi16>;
	defm : avx512_ext_lowering<"DQZ128", v2i64x_info, v4i32x_info, extloadvi32>;
	defm : avx512_ext_lowering<"DQZ256", v4i64x_info, v4i32x_info, extloadvi32>;
	}
	let Predicates = [HasAVX512] in {
	defm : avx512_ext_lowering<"BDZ", v16i32_info, v16i8x_info, extloadvi8>;
	defm : avx512_ext_lowering<"BQZ", v8i64_info, v16i8x_info, extloadvi8>;
	defm : avx512_ext_lowering<"WDZ", v16i32_info, v16i16x_info, extloadvi16>;
	defm : avx512_ext_lowering<"WQZ", v8i64_info, v8i16x_info, extloadvi16>;
	defm : avx512_ext_lowering<"DQZ", v8i64_info, v8i32x_info, extloadvi32>;
	}

	multiclass AVX512_pmovx_patterns<string OpcPrefix, SDNode ExtOp,
	SDNode InVecOp, PatFrag ExtLoad16> {
	// 128-bit patterns
	let Predicates = [HasVLX, HasBWI] in {
	def : Pat<(v8i16 (InVecOp (bc_v16i8 (v2i64 (scalar_to_vector (loadi64 addr:$src)))))),
	(!cast<I>(OpcPrefix#BWZ128rm) addr:$src)>;
	def : Pat<(v8i16 (InVecOp (bc_v16i8 (v2f64 (scalar_to_vector (loadf64 addr:$src)))))),
	(!cast<I>(OpcPrefix#BWZ128rm) addr:$src)>;
	def : Pat<(v8i16 (InVecOp (v16i8 (vzmovl_v2i64 addr:$src)))),
	(!cast<I>(OpcPrefix#BWZ128rm) addr:$src)>;
	def : Pat<(v8i16 (InVecOp (v16i8 (vzload_v2i64 addr:$src)))),
	(!cast<I>(OpcPrefix#BWZ128rm) addr:$src)>;
	def : Pat<(v8i16 (InVecOp (bc_v16i8 (loadv2i64 addr:$src)))),
	(!cast<I>(OpcPrefix#BWZ128rm) addr:$src)>;
	}
	let Predicates = [HasVLX] in {
	def : Pat<(v4i32 (InVecOp (bc_v16i8 (v4i32 (scalar_to_vector (loadi32 addr:$src)))))),
	(!cast<I>(OpcPrefix#BDZ128rm) addr:$src)>;
	def : Pat<(v4i32 (InVecOp (v16i8 (vzmovl_v4i32 addr:$src)))),
	(!cast<I>(OpcPrefix#BDZ128rm) addr:$src)>;
	def : Pat<(v4i32 (InVecOp (v16i8 (vzload_v2i64 addr:$src)))),
	(!cast<I>(OpcPrefix#BDZ128rm) addr:$src)>;
	def : Pat<(v4i32 (InVecOp (bc_v16i8 (loadv2i64 addr:$src)))),
	(!cast<I>(OpcPrefix#BDZ128rm) addr:$src)>;

	def : Pat<(v2i64 (InVecOp (bc_v16i8 (v4i32 (scalar_to_vector (ExtLoad16 addr:$src)))))),
	(!cast<I>(OpcPrefix#BQZ128rm) addr:$src)>;
	def : Pat<(v2i64 (InVecOp (v16i8 (vzmovl_v4i32 addr:$src)))),
	(!cast<I>(OpcPrefix#BQZ128rm) addr:$src)>;
	def : Pat<(v2i64 (InVecOp (v16i8 (vzload_v2i64 addr:$src)))),
	(!cast<I>(OpcPrefix#BQZ128rm) addr:$src)>;
	def : Pat<(v2i64 (InVecOp (bc_v16i8 (loadv2i64 addr:$src)))),
	(!cast<I>(OpcPrefix#BQZ128rm) addr:$src)>;

	def : Pat<(v4i32 (InVecOp (bc_v8i16 (v2i64 (scalar_to_vector (loadi64 addr:$src)))))),
	(!cast<I>(OpcPrefix#WDZ128rm) addr:$src)>;
	def : Pat<(v4i32 (InVecOp (bc_v8i16 (v2f64 (scalar_to_vector (loadf64 addr:$src)))))),
	(!cast<I>(OpcPrefix#WDZ128rm) addr:$src)>;
	def : Pat<(v4i32 (InVecOp (v8i16 (vzmovl_v2i64 addr:$src)))),
	(!cast<I>(OpcPrefix#WDZ128rm) addr:$src)>;
	def : Pat<(v4i32 (InVecOp (v8i16 (vzload_v2i64 addr:$src)))),
	(!cast<I>(OpcPrefix#WDZ128rm) addr:$src)>;
	def : Pat<(v4i32 (InVecOp (bc_v8i16 (loadv2i64 addr:$src)))),
	(!cast<I>(OpcPrefix#WDZ128rm) addr:$src)>;

	def : Pat<(v2i64 (InVecOp (bc_v8i16 (v4i32 (scalar_to_vector (loadi32 addr:$src)))))),
	(!cast<I>(OpcPrefix#WQZ128rm) addr:$src)>;
	def : Pat<(v2i64 (InVecOp (v8i16 (vzmovl_v4i32 addr:$src)))),
	(!cast<I>(OpcPrefix#WQZ128rm) addr:$src)>;
	def : Pat<(v2i64 (InVecOp (v8i16 (vzload_v2i64 addr:$src)))),
	(!cast<I>(OpcPrefix#WQZ128rm) addr:$src)>;
	def : Pat<(v2i64 (InVecOp (bc_v8i16 (loadv2i64 addr:$src)))),
	(!cast<I>(OpcPrefix#WQZ128rm) addr:$src)>;

	def : Pat<(v2i64 (InVecOp (bc_v4i32 (v2i64 (scalar_to_vector (loadi64 addr:$src)))))),
	(!cast<I>(OpcPrefix#DQZ128rm) addr:$src)>;
	def : Pat<(v2i64 (InVecOp (bc_v4i32 (v2f64 (scalar_to_vector (loadf64 addr:$src)))))),
	(!cast<I>(OpcPrefix#DQZ128rm) addr:$src)>;
	def : Pat<(v2i64 (InVecOp (v4i32 (vzmovl_v2i64 addr:$src)))),
	(!cast<I>(OpcPrefix#DQZ128rm) addr:$src)>;
	def : Pat<(v2i64 (InVecOp (v4i32 (vzload_v2i64 addr:$src)))),
	(!cast<I>(OpcPrefix#DQZ128rm) addr:$src)>;
	def : Pat<(v2i64 (InVecOp (bc_v4i32 (loadv2i64 addr:$src)))),
	(!cast<I>(OpcPrefix#DQZ128rm) addr:$src)>;
	}
	// 256-bit patterns
	let Predicates = [HasVLX, HasBWI] in {
	def : Pat<(v16i16 (ExtOp (bc_v16i8 (loadv2i64 addr:$src)))),
	(!cast<I>(OpcPrefix#BWZ256rm) addr:$src)>;
	def : Pat<(v16i16 (ExtOp (v16i8 (vzmovl_v2i64 addr:$src)))),
	(!cast<I>(OpcPrefix#BWZ256rm) addr:$src)>;
	def : Pat<(v16i16 (ExtOp (v16i8 (vzload_v2i64 addr:$src)))),
	(!cast<I>(OpcPrefix#BWZ256rm) addr:$src)>;
	}
	let Predicates = [HasVLX] in {
	def : Pat<(v8i32 (ExtOp (bc_v16i8 (v2i64 (scalar_to_vector (loadi64 addr:$src)))))),
	(!cast<I>(OpcPrefix#BDZ256rm) addr:$src)>;
	def : Pat<(v8i32 (ExtOp (v16i8 (vzmovl_v2i64 addr:$src)))),
	(!cast<I>(OpcPrefix#BDZ256rm) addr:$src)>;
	def : Pat<(v8i32 (ExtOp (v16i8 (vzload_v2i64 addr:$src)))),
	(!cast<I>(OpcPrefix#BDZ256rm) addr:$src)>;
	def : Pat<(v8i32 (ExtOp (bc_v16i8 (loadv2i64 addr:$src)))),
	(!cast<I>(OpcPrefix#BDZ256rm) addr:$src)>;

	def : Pat<(v4i64 (ExtOp (bc_v16i8 (v4i32 (scalar_to_vector (loadi32 addr:$src)))))),
	(!cast<I>(OpcPrefix#BQZ256rm) addr:$src)>;
	def : Pat<(v4i64 (ExtOp (v16i8 (vzmovl_v4i32 addr:$src)))),
	(!cast<I>(OpcPrefix#BQZ256rm) addr:$src)>;
	def : Pat<(v4i64 (ExtOp (v16i8 (vzload_v2i64 addr:$src)))),
	(!cast<I>(OpcPrefix#BQZ256rm) addr:$src)>;
	def : Pat<(v4i64 (ExtOp (bc_v16i8 (loadv2i64 addr:$src)))),
	(!cast<I>(OpcPrefix#BQZ256rm) addr:$src)>;

	def : Pat<(v8i32 (ExtOp (bc_v8i16 (loadv2i64 addr:$src)))),
	(!cast<I>(OpcPrefix#WDZ256rm) addr:$src)>;
	def : Pat<(v8i32 (ExtOp (v8i16 (vzmovl_v2i64 addr:$src)))),
	(!cast<I>(OpcPrefix#WDZ256rm) addr:$src)>;
	def : Pat<(v8i32 (ExtOp (v8i16 (vzload_v2i64 addr:$src)))),
	(!cast<I>(OpcPrefix#WDZ256rm) addr:$src)>;

	def : Pat<(v4i64 (ExtOp (bc_v8i16 (v2i64 (scalar_to_vector (loadi64 addr:$src)))))),
	(!cast<I>(OpcPrefix#WQZ256rm) addr:$src)>;
	def : Pat<(v4i64 (ExtOp (v8i16 (vzmovl_v2i64 addr:$src)))),
	(!cast<I>(OpcPrefix#WQZ256rm) addr:$src)>;
	def : Pat<(v4i64 (ExtOp (v8i16 (vzload_v2i64 addr:$src)))),
	(!cast<I>(OpcPrefix#WQZ256rm) addr:$src)>;
	def : Pat<(v4i64 (ExtOp (bc_v8i16 (loadv2i64 addr:$src)))),
	(!cast<I>(OpcPrefix#WQZ256rm) addr:$src)>;

	def : Pat<(v4i64 (ExtOp (bc_v4i32 (loadv2i64 addr:$src)))),
	(!cast<I>(OpcPrefix#DQZ256rm) addr:$src)>;
	def : Pat<(v4i64 (ExtOp (v4i32 (vzmovl_v2i64 addr:$src)))),
	(!cast<I>(OpcPrefix#DQZ256rm) addr:$src)>;
	def : Pat<(v4i64 (ExtOp (v4i32 (vzload_v2i64 addr:$src)))),
	(!cast<I>(OpcPrefix#DQZ256rm) addr:$src)>;
	}
	// 512-bit patterns
	let Predicates = [HasBWI] in {
	def : Pat<(v32i16 (ExtOp (bc_v32i8 (loadv4i64 addr:$src)))),
	(!cast<I>(OpcPrefix#BWZrm) addr:$src)>;
	}
	let Predicates = [HasAVX512] in {
	def : Pat<(v16i32 (ExtOp (bc_v16i8 (loadv2i64 addr:$src)))),
	(!cast<I>(OpcPrefix#BDZrm) addr:$src)>;

	def : Pat<(v8i64 (ExtOp (bc_v16i8 (v2i64 (scalar_to_vector (loadi64 addr:$src)))))),
	(!cast<I>(OpcPrefix#BQZrm) addr:$src)>;
	def : Pat<(v8i64 (ExtOp (bc_v16i8 (loadv2i64 addr:$src)))),
	(!cast<I>(OpcPrefix#BQZrm) addr:$src)>;

	def : Pat<(v16i32 (ExtOp (bc_v16i16 (loadv4i64 addr:$src)))),
	(!cast<I>(OpcPrefix#WDZrm) addr:$src)>;

	def : Pat<(v8i64 (ExtOp (bc_v8i16 (loadv2i64 addr:$src)))),
	(!cast<I>(OpcPrefix#WQZrm) addr:$src)>;

	def : Pat<(v8i64 (ExtOp (bc_v8i32 (loadv4i64 addr:$src)))),
	(!cast<I>(OpcPrefix#DQZrm) addr:$src)>;
	}
	}

	defm : AVX512_pmovx_patterns<"VPMOVSX", X86vsext, sext_invec, extloadi32i16>;
	defm : AVX512_pmovx_patterns<"VPMOVZX", X86vzext, zext_invec, loadi16_anyext>;

	//===----------------------------------------------------------------------===//
	// GATHER - SCATTER Operations

	multiclass avx512_gather<bits<8> opc, string OpcodeStr, X86VectorVTInfo _,
	X86MemOperand memop, PatFrag GatherNode> {
	let Constraints = "@earlyclobber $dst, $src1 = $dst, $mask = $mask_wb",
	ExeDomain = _.ExeDomain in
	def rm : AVX5128I<opc, MRMSrcMem, (outs _.RC:$dst, _.KRCWM:$mask_wb),
	(ins _.RC:$src1, _.KRCWM:$mask, memop:$src2),
	!strconcat(OpcodeStr#_.Suffix,
	"\t{$src2, ${dst} {${mask}}\|${dst} {${mask}}, $src2}"),
	[(set _.RC:$dst, _.KRCWM:$mask_wb,
	(GatherNode (_.VT _.RC:$src1), _.KRCWM:$mask,
	vectoraddr:$src2))]>, EVEX, EVEX_K,
	EVEX_CD8<_.EltSize, CD8VT1>;
	}

	multiclass avx512_gather_q_pd<bits<8> dopc, bits<8> qopc,
	AVX512VLVectorVTInfo _, string OpcodeStr, string SUFF> {
	defm NAME##D##SUFF##Z: avx512_gather<dopc, OpcodeStr##"d", _.info512,
	vy512mem, mgatherv8i32>, EVEX_V512, VEX_W;
	defm NAME##Q##SUFF##Z: avx512_gather<qopc, OpcodeStr##"q", _.info512,
	vz512mem, mgatherv8i64>, EVEX_V512, VEX_W;
	let Predicates = [HasVLX] in {
	defm NAME##D##SUFF##Z256: avx512_gather<dopc, OpcodeStr##"d", _.info256,
	vx256xmem, mgatherv4i32>, EVEX_V256, VEX_W;
	defm NAME##Q##SUFF##Z256: avx512_gather<qopc, OpcodeStr##"q", _.info256,
	vy256xmem, mgatherv4i64>, EVEX_V256, VEX_W;
	defm NAME##D##SUFF##Z128: avx512_gather<dopc, OpcodeStr##"d", _.info128,
	vx128xmem, mgatherv4i32>, EVEX_V128, VEX_W;
	defm NAME##Q##SUFF##Z128: avx512_gather<qopc, OpcodeStr##"q", _.info128,
	vx128xmem, mgatherv2i64>, EVEX_V128, VEX_W;
	}
	}

	multiclass avx512_gather_d_ps<bits<8> dopc, bits<8> qopc,
	AVX512VLVectorVTInfo _, string OpcodeStr, string SUFF> {
	defm NAME##D##SUFF##Z: avx512_gather<dopc, OpcodeStr##"d", _.info512, vz512mem,
	mgatherv16i32>, EVEX_V512;
	defm NAME##Q##SUFF##Z: avx512_gather<qopc, OpcodeStr##"q", _.info256, vz256xmem,
	mgatherv8i64>, EVEX_V512;
	let Predicates = [HasVLX] in {
	defm NAME##D##SUFF##Z256: avx512_gather<dopc, OpcodeStr##"d", _.info256,
	vy256xmem, mgatherv8i32>, EVEX_V256;
	defm NAME##Q##SUFF##Z256: avx512_gather<qopc, OpcodeStr##"q", _.info128,
	vy128xmem, mgatherv4i64>, EVEX_V256;
	defm NAME##D##SUFF##Z128: avx512_gather<dopc, OpcodeStr##"d", _.info128,
	vx128xmem, mgatherv4i32>, EVEX_V128;
	defm NAME##Q##SUFF##Z128: avx512_gather<qopc, OpcodeStr##"q", _.info128,
	vx64xmem, X86mgatherv2i64>, EVEX_V128;
	}
	}


	defm VGATHER : avx512_gather_q_pd<0x92, 0x93, avx512vl_f64_info, "vgather", "PD">,
	avx512_gather_d_ps<0x92, 0x93, avx512vl_f32_info, "vgather", "PS">;

	defm VPGATHER : avx512_gather_q_pd<0x90, 0x91, avx512vl_i64_info, "vpgather", "Q">,
	avx512_gather_d_ps<0x90, 0x91, avx512vl_i32_info, "vpgather", "D">;

	multiclass avx512_scatter<bits<8> opc, string OpcodeStr, X86VectorVTInfo _,
	X86MemOperand memop, PatFrag ScatterNode> {

	let mayStore = 1, Constraints = "$mask = $mask_wb", ExeDomain = _.ExeDomain in

	def mr : AVX5128I<opc, MRMDestMem, (outs _.KRCWM:$mask_wb),
	(ins memop:$dst, _.KRCWM:$mask, _.RC:$src),
	!strconcat(OpcodeStr#_.Suffix,
	"\t{$src, ${dst} {${mask}}\|${dst} {${mask}}, $src}"),
	[(set _.KRCWM:$mask_wb, (ScatterNode (_.VT _.RC:$src),
	_.KRCWM:$mask, vectoraddr:$dst))]>,
	EVEX, EVEX_K, EVEX_CD8<_.EltSize, CD8VT1>;
	}

	multiclass avx512_scatter_q_pd<bits<8> dopc, bits<8> qopc,
	AVX512VLVectorVTInfo _, string OpcodeStr, string SUFF> {
	defm NAME##D##SUFF##Z: avx512_scatter<dopc, OpcodeStr##"d", _.info512,
	vy512mem, mscatterv8i32>, EVEX_V512, VEX_W;
	defm NAME##Q##SUFF##Z: avx512_scatter<qopc, OpcodeStr##"q", _.info512,
	vz512mem, mscatterv8i64>, EVEX_V512, VEX_W;
	let Predicates = [HasVLX] in {
	defm NAME##D##SUFF##Z256: avx512_scatter<dopc, OpcodeStr##"d", _.info256,
	vx256xmem, mscatterv4i32>, EVEX_V256, VEX_W;
	defm NAME##Q##SUFF##Z256: avx512_scatter<qopc, OpcodeStr##"q", _.info256,
	vy256xmem, mscatterv4i64>, EVEX_V256, VEX_W;
	defm NAME##D##SUFF##Z128: avx512_scatter<dopc, OpcodeStr##"d", _.info128,
	vx128xmem, mscatterv4i32>, EVEX_V128, VEX_W;
	defm NAME##Q##SUFF##Z128: avx512_scatter<qopc, OpcodeStr##"q", _.info128,
	vx128xmem, mscatterv2i64>, EVEX_V128, VEX_W;
	}
	}

	multiclass avx512_scatter_d_ps<bits<8> dopc, bits<8> qopc,
	AVX512VLVectorVTInfo _, string OpcodeStr, string SUFF> {
	defm NAME##D##SUFF##Z: avx512_scatter<dopc, OpcodeStr##"d", _.info512, vz512mem,
	mscatterv16i32>, EVEX_V512;
	defm NAME##Q##SUFF##Z: avx512_scatter<qopc, OpcodeStr##"q", _.info256, vz256xmem,
	mscatterv8i64>, EVEX_V512;
	let Predicates = [HasVLX] in {
	defm NAME##D##SUFF##Z256: avx512_scatter<dopc, OpcodeStr##"d", _.info256,
	vy256xmem, mscatterv8i32>, EVEX_V256;
	defm NAME##Q##SUFF##Z256: avx512_scatter<qopc, OpcodeStr##"q", _.info128,
	vy128xmem, mscatterv4i64>, EVEX_V256;
	defm NAME##D##SUFF##Z128: avx512_scatter<dopc, OpcodeStr##"d", _.info128,
	vx128xmem, mscatterv4i32>, EVEX_V128;
	defm NAME##Q##SUFF##Z128: avx512_scatter<qopc, OpcodeStr##"q", _.info128,
	vx64xmem, mscatterv2i64>, EVEX_V128;
	}
	}

	defm VSCATTER : avx512_scatter_q_pd<0xA2, 0xA3, avx512vl_f64_info, "vscatter", "PD">,
	avx512_scatter_d_ps<0xA2, 0xA3, avx512vl_f32_info, "vscatter", "PS">;

	defm VPSCATTER : avx512_scatter_q_pd<0xA0, 0xA1, avx512vl_i64_info, "vpscatter", "Q">,
	avx512_scatter_d_ps<0xA0, 0xA1, avx512vl_i32_info, "vpscatter", "D">;

	// prefetch
	multiclass avx512_gather_scatter_prefetch<bits<8> opc, Format F, string OpcodeStr,
	RegisterClass KRC, X86MemOperand memop> {
	let Predicates = [HasPFI], hasSideEffects = 1 in
	def m : AVX5128I<opc, F, (outs), (ins KRC:$mask, memop:$src),
	!strconcat(OpcodeStr, "\t{$src {${mask}}\|{${mask}}, $src}"),
	[]>, EVEX, EVEX_K;
	}

	defm VGATHERPF0DPS: avx512_gather_scatter_prefetch<0xC6, MRM1m, "vgatherpf0dps",
	VK16WM, vz512mem>, EVEX_V512, EVEX_CD8<32, CD8VT1>;

	defm VGATHERPF0QPS: avx512_gather_scatter_prefetch<0xC7, MRM1m, "vgatherpf0qps",
	VK8WM, vz256xmem>, EVEX_V512, EVEX_CD8<64, CD8VT1>;

	defm VGATHERPF0DPD: avx512_gather_scatter_prefetch<0xC6, MRM1m, "vgatherpf0dpd",
	VK8WM, vy512mem>, EVEX_V512, VEX_W, EVEX_CD8<32, CD8VT1>;

	defm VGATHERPF0QPD: avx512_gather_scatter_prefetch<0xC7, MRM1m, "vgatherpf0qpd",
	VK8WM, vz512mem>, EVEX_V512, VEX_W, EVEX_CD8<64, CD8VT1>;

	defm VGATHERPF1DPS: avx512_gather_scatter_prefetch<0xC6, MRM2m, "vgatherpf1dps",
	VK16WM, vz512mem>, EVEX_V512, EVEX_CD8<32, CD8VT1>;

	defm VGATHERPF1QPS: avx512_gather_scatter_prefetch<0xC7, MRM2m, "vgatherpf1qps",
	VK8WM, vz256xmem>, EVEX_V512, EVEX_CD8<64, CD8VT1>;

	defm VGATHERPF1DPD: avx512_gather_scatter_prefetch<0xC6, MRM2m, "vgatherpf1dpd",
	VK8WM, vy512mem>, EVEX_V512, VEX_W, EVEX_CD8<32, CD8VT1>;

	defm VGATHERPF1QPD: avx512_gather_scatter_prefetch<0xC7, MRM2m, "vgatherpf1qpd",
	VK8WM, vz512mem>, EVEX_V512, VEX_W, EVEX_CD8<64, CD8VT1>;

	defm VSCATTERPF0DPS: avx512_gather_scatter_prefetch<0xC6, MRM5m, "vscatterpf0dps",
	VK16WM, vz512mem>, EVEX_V512, EVEX_CD8<32, CD8VT1>;

	defm VSCATTERPF0QPS: avx512_gather_scatter_prefetch<0xC7, MRM5m, "vscatterpf0qps",
	VK8WM, vz256xmem>, EVEX_V512, EVEX_CD8<64, CD8VT1>;

	defm VSCATTERPF0DPD: avx512_gather_scatter_prefetch<0xC6, MRM5m, "vscatterpf0dpd",
	VK8WM, vy512mem>, EVEX_V512, VEX_W, EVEX_CD8<32, CD8VT1>;

	defm VSCATTERPF0QPD: avx512_gather_scatter_prefetch<0xC7, MRM5m, "vscatterpf0qpd",
	VK8WM, vz512mem>, EVEX_V512, VEX_W, EVEX_CD8<64, CD8VT1>;

	defm VSCATTERPF1DPS: avx512_gather_scatter_prefetch<0xC6, MRM6m, "vscatterpf1dps",
	VK16WM, vz512mem>, EVEX_V512, EVEX_CD8<32, CD8VT1>;

	defm VSCATTERPF1QPS: avx512_gather_scatter_prefetch<0xC7, MRM6m, "vscatterpf1qps",
	VK8WM, vz256xmem>, EVEX_V512, EVEX_CD8<64, CD8VT1>;

	defm VSCATTERPF1DPD: avx512_gather_scatter_prefetch<0xC6, MRM6m, "vscatterpf1dpd",
	VK8WM, vy512mem>, EVEX_V512, VEX_W, EVEX_CD8<32, CD8VT1>;

	defm VSCATTERPF1QPD: avx512_gather_scatter_prefetch<0xC7, MRM6m, "vscatterpf1qpd",
	VK8WM, vz512mem>, EVEX_V512, VEX_W, EVEX_CD8<64, CD8VT1>;

	// Helper fragments to match sext vXi1 to vXiY.
	def v64i1sextv64i8 : PatLeaf<(v64i8
	(X86vsext
	(v64i1 (X86pcmpgtm
	(bc_v64i8 (v16i32 immAllZerosV)),
	VR512:$src))))>;
	def v32i1sextv32i16 : PatLeaf<(v32i16 (X86vsrai VR512:$src, (i8 15)))>;
	def v16i1sextv16i32 : PatLeaf<(v16i32 (X86vsrai VR512:$src, (i8 31)))>;
	def v8i1sextv8i64 : PatLeaf<(v8i64 (X86vsrai VR512:$src, (i8 63)))>;

	multiclass cvt_by_vec_width<bits<8> opc, X86VectorVTInfo Vec, string OpcodeStr > {
	def rr : AVX512XS8I<opc, MRMSrcReg, (outs Vec.RC:$dst), (ins Vec.KRC:$src),
	!strconcat(OpcodeStr##Vec.Suffix, "\t{$src, $dst\|$dst, $src}"),
	[(set Vec.RC:$dst, (Vec.VT (X86vsext Vec.KRC:$src)))]>, EVEX;
	}

	// Use 512bit version to implement 128/256 bit in case NoVLX.
	multiclass avx512_convert_mask_to_vector_lowering<X86VectorVTInfo X86Info,
	X86VectorVTInfo _> {

	def : Pat<(X86Info.VT (X86vsext (X86Info.KVT X86Info.KRC:$src))),
	(X86Info.VT (EXTRACT_SUBREG
	(_.VT (!cast<Instruction>(NAME#"Zrr")
	(_.KVT (COPY_TO_REGCLASS X86Info.KRC:$src,_.KRC)))),
	X86Info.SubRegIdx))>;
	}

	multiclass cvt_mask_by_elt_width<bits<8> opc, AVX512VLVectorVTInfo VTInfo,
	string OpcodeStr, Predicate prd> {
	let Predicates = [prd] in
	defm Z : cvt_by_vec_width<opc, VTInfo.info512, OpcodeStr>, EVEX_V512;

	let Predicates = [prd, HasVLX] in {
	defm Z256 : cvt_by_vec_width<opc, VTInfo.info256, OpcodeStr>, EVEX_V256;
	defm Z128 : cvt_by_vec_width<opc, VTInfo.info128, OpcodeStr>, EVEX_V128;
	}
	let Predicates = [prd, NoVLX] in {
	defm Z256_Alt : avx512_convert_mask_to_vector_lowering<VTInfo.info256,VTInfo.info512>;
	defm Z128_Alt : avx512_convert_mask_to_vector_lowering<VTInfo.info128,VTInfo.info512>;
	}

	}

	defm VPMOVM2B : cvt_mask_by_elt_width<0x28, avx512vl_i8_info, "vpmovm2" , HasBWI>;
	defm VPMOVM2W : cvt_mask_by_elt_width<0x28, avx512vl_i16_info, "vpmovm2", HasBWI> , VEX_W;
	defm VPMOVM2D : cvt_mask_by_elt_width<0x38, avx512vl_i32_info, "vpmovm2", HasDQI>;
	defm VPMOVM2Q : cvt_mask_by_elt_width<0x38, avx512vl_i64_info, "vpmovm2", HasDQI> , VEX_W;

	multiclass convert_vector_to_mask_common<bits<8> opc, X86VectorVTInfo _, string OpcodeStr > {
	def rr : AVX512XS8I<opc, MRMSrcReg, (outs _.KRC:$dst), (ins _.RC:$src),
	!strconcat(OpcodeStr, "\t{$src, $dst\|$dst, $src}"),
	[(set _.KRC:$dst, (X86cvt2mask (_.VT _.RC:$src)))]>, EVEX;
	}

	// Use 512bit version to implement 128/256 bit in case NoVLX.
	multiclass convert_vector_to_mask_lowering<X86VectorVTInfo ExtendInfo,
	X86VectorVTInfo _> {

	def : Pat<(_.KVT (X86cvt2mask (_.VT _.RC:$src))),
	(_.KVT (COPY_TO_REGCLASS
	(!cast<Instruction>(NAME#"Zrr")
	(INSERT_SUBREG (ExtendInfo.VT (IMPLICIT_DEF)),
	_.RC:$src, _.SubRegIdx)),
	_.KRC))>;
	}

	multiclass avx512_convert_vector_to_mask<bits<8> opc, string OpcodeStr,
	AVX512VLVectorVTInfo VTInfo, Predicate prd> {
	let Predicates = [prd] in
	defm Z : convert_vector_to_mask_common <opc, VTInfo.info512, OpcodeStr>,
	EVEX_V512;

	let Predicates = [prd, HasVLX] in {
	defm Z256 : convert_vector_to_mask_common<opc, VTInfo.info256, OpcodeStr>,
	EVEX_V256;
	defm Z128 : convert_vector_to_mask_common<opc, VTInfo.info128, OpcodeStr>,
	EVEX_V128;
	}
	let Predicates = [prd, NoVLX] in {
	defm Z256_Alt : convert_vector_to_mask_lowering<VTInfo.info512, VTInfo.info256>;
	defm Z128_Alt : convert_vector_to_mask_lowering<VTInfo.info512, VTInfo.info128>;
	}
	}

	defm VPMOVB2M : avx512_convert_vector_to_mask<0x29, "vpmovb2m",
	avx512vl_i8_info, HasBWI>;
	defm VPMOVW2M : avx512_convert_vector_to_mask<0x29, "vpmovw2m",
	avx512vl_i16_info, HasBWI>, VEX_W;
	defm VPMOVD2M : avx512_convert_vector_to_mask<0x39, "vpmovd2m",
	avx512vl_i32_info, HasDQI>;
	defm VPMOVQ2M : avx512_convert_vector_to_mask<0x39, "vpmovq2m",
	avx512vl_i64_info, HasDQI>, VEX_W;

	//===----------------------------------------------------------------------===//
	// AVX-512 - COMPRESS and EXPAND
	//

	multiclass compress_by_vec_width_common<bits<8> opc, X86VectorVTInfo _,
	string OpcodeStr> {
	defm rr : AVX512_maskable<opc, MRMDestReg, _, (outs _.RC:$dst),
	(ins _.RC:$src1), OpcodeStr, "$src1", "$src1",
	(_.VT (X86compress _.RC:$src1))>, AVX5128IBase;

	let mayStore = 1, hasSideEffects = 0 in
	def mr : AVX5128I<opc, MRMDestMem, (outs),
	(ins _.MemOp:$dst, _.RC:$src),
	OpcodeStr # "\t{$src, $dst\|$dst, $src}",
	[]>, EVEX_CD8<_.EltSize, CD8VT1>;

	def mrk : AVX5128I<opc, MRMDestMem, (outs),
	(ins _.MemOp:$dst, _.KRCWM:$mask, _.RC:$src),
	OpcodeStr # "\t{$src, $dst {${mask}}\|$dst {${mask}}, $src}",
	[]>,
	EVEX_K, EVEX_CD8<_.EltSize, CD8VT1>;
	}

	multiclass compress_by_vec_width_lowering<X86VectorVTInfo _ > {

	def : Pat<(X86mCompressingStore addr:$dst, _.KRCWM:$mask,
	(_.VT _.RC:$src)),
	(!cast<Instruction>(NAME#_.ZSuffix##mrk)
	addr:$dst, _.KRCWM:$mask, _.RC:$src)>;
	}

	multiclass compress_by_elt_width<bits<8> opc, string OpcodeStr,
	AVX512VLVectorVTInfo VTInfo> {
	defm Z : compress_by_vec_width_common<opc, VTInfo.info512, OpcodeStr>,
	compress_by_vec_width_lowering<VTInfo.info512>, EVEX_V512;

	let Predicates = [HasVLX] in {
	defm Z256 : compress_by_vec_width_common<opc, VTInfo.info256, OpcodeStr>,
	compress_by_vec_width_lowering<VTInfo.info256>, EVEX_V256;
	defm Z128 : compress_by_vec_width_common<opc, VTInfo.info128, OpcodeStr>,
	compress_by_vec_width_lowering<VTInfo.info128>, EVEX_V128;
	}
	}

	defm VPCOMPRESSD : compress_by_elt_width <0x8B, "vpcompressd", avx512vl_i32_info>,
	EVEX;
	defm VPCOMPRESSQ : compress_by_elt_width <0x8B, "vpcompressq", avx512vl_i64_info>,
	EVEX, VEX_W;
	defm VCOMPRESSPS : compress_by_elt_width <0x8A, "vcompressps", avx512vl_f32_info>,
	EVEX;
	defm VCOMPRESSPD : compress_by_elt_width <0x8A, "vcompresspd", avx512vl_f64_info>,
	EVEX, VEX_W;

	// expand
	multiclass expand_by_vec_width<bits<8> opc, X86VectorVTInfo _,
	string OpcodeStr> {
	defm rr : AVX512_maskable<opc, MRMSrcReg, _, (outs _.RC:$dst),
	(ins _.RC:$src1), OpcodeStr, "$src1", "$src1",
	(_.VT (X86expand _.RC:$src1))>, AVX5128IBase;

	defm rm : AVX512_maskable<opc, MRMSrcMem, _, (outs _.RC:$dst),
	(ins _.MemOp:$src1), OpcodeStr, "$src1", "$src1",
	(_.VT (X86expand (_.VT (bitconvert
	(_.LdFrag addr:$src1)))))>,
	AVX5128IBase, EVEX_CD8<_.EltSize, CD8VT1>;
	}

	multiclass expand_by_vec_width_lowering<X86VectorVTInfo _ > {

	def : Pat<(_.VT (X86mExpandingLoad addr:$src, _.KRCWM:$mask, undef)),
	(!cast<Instruction>(NAME#_.ZSuffix##rmkz)
	_.KRCWM:$mask, addr:$src)>;

	def : Pat<(_.VT (X86mExpandingLoad addr:$src, _.KRCWM:$mask,
	(_.VT _.RC:$src0))),
	(!cast<Instruction>(NAME#_.ZSuffix##rmk)
	_.RC:$src0, _.KRCWM:$mask, addr:$src)>;
	}

	multiclass expand_by_elt_width<bits<8> opc, string OpcodeStr,
	AVX512VLVectorVTInfo VTInfo> {
	defm Z : expand_by_vec_width<opc, VTInfo.info512, OpcodeStr>,
	expand_by_vec_width_lowering<VTInfo.info512>, EVEX_V512;

	let Predicates = [HasVLX] in {
	defm Z256 : expand_by_vec_width<opc, VTInfo.info256, OpcodeStr>,
	expand_by_vec_width_lowering<VTInfo.info256>, EVEX_V256;
	defm Z128 : expand_by_vec_width<opc, VTInfo.info128, OpcodeStr>,
	expand_by_vec_width_lowering<VTInfo.info128>, EVEX_V128;
	}
	}

	defm VPEXPANDD : expand_by_elt_width <0x89, "vpexpandd", avx512vl_i32_info>,
	EVEX;
	defm VPEXPANDQ : expand_by_elt_width <0x89, "vpexpandq", avx512vl_i64_info>,
	EVEX, VEX_W;
	defm VEXPANDPS : expand_by_elt_width <0x88, "vexpandps", avx512vl_f32_info>,
	EVEX;
	defm VEXPANDPD : expand_by_elt_width <0x88, "vexpandpd", avx512vl_f64_info>,
	EVEX, VEX_W;

	//handle instruction reg_vec1 = op(reg_vec,imm)
	// op(mem_vec,imm)
	// op(broadcast(eltVt),imm)
	//all instruction created with FROUND_CURRENT
	multiclass avx512_unary_fp_packed_imm<bits<8> opc, string OpcodeStr, SDNode OpNode,
	X86VectorVTInfo _>{
	let ExeDomain = _.ExeDomain in {
	defm rri : AVX512_maskable<opc, MRMSrcReg, _, (outs _.RC:$dst),
	(ins _.RC:$src1, i32u8imm:$src2),
	OpcodeStr##_.Suffix, "$src2, $src1", "$src1, $src2",
	(OpNode (_.VT _.RC:$src1),
	(i32 imm:$src2),
	(i32 FROUND_CURRENT))>;
	defm rmi : AVX512_maskable<opc, MRMSrcMem, _, (outs _.RC:$dst),
	(ins _.MemOp:$src1, i32u8imm:$src2),
	OpcodeStr##_.Suffix, "$src2, $src1", "$src1, $src2",
	(OpNode (_.VT (bitconvert (_.LdFrag addr:$src1))),
	(i32 imm:$src2),
	(i32 FROUND_CURRENT))>;
	defm rmbi : AVX512_maskable<opc, MRMSrcMem, _, (outs _.RC:$dst),
	(ins _.ScalarMemOp:$src1, i32u8imm:$src2),
	OpcodeStr##_.Suffix, "$src2, ${src1}"##_.BroadcastStr,
	"${src1}"##_.BroadcastStr##", $src2",
	(OpNode (_.VT (X86VBroadcast(_.ScalarLdFrag addr:$src1))),
	(i32 imm:$src2),
	(i32 FROUND_CURRENT))>, EVEX_B;
	}
	}

	//handle instruction reg_vec1 = op(reg_vec2,reg_vec3,imm),{sae}
	multiclass avx512_unary_fp_sae_packed_imm<bits<8> opc, string OpcodeStr,
	SDNode OpNode, X86VectorVTInfo _>{
	let ExeDomain = _.ExeDomain in
	defm rrib : AVX512_maskable<opc, MRMSrcReg, _, (outs _.RC:$dst),
	(ins _.RC:$src1, i32u8imm:$src2),
	OpcodeStr##_.Suffix, "$src2, {sae}, $src1",
	"$src1, {sae}, $src2",
	(OpNode (_.VT _.RC:$src1),
	(i32 imm:$src2),
	(i32 FROUND_NO_EXC))>, EVEX_B;
	}

	multiclass avx512_common_unary_fp_sae_packed_imm<string OpcodeStr,
	AVX512VLVectorVTInfo _, bits<8> opc, SDNode OpNode, Predicate prd>{
	let Predicates = [prd] in {
	defm Z : avx512_unary_fp_packed_imm<opc, OpcodeStr, OpNode, _.info512>,
	avx512_unary_fp_sae_packed_imm<opc, OpcodeStr, OpNode, _.info512>,
	EVEX_V512;
	}
	let Predicates = [prd, HasVLX] in {
	defm Z128 : avx512_unary_fp_packed_imm<opc, OpcodeStr, OpNode, _.info128>,
	EVEX_V128;
	defm Z256 : avx512_unary_fp_packed_imm<opc, OpcodeStr, OpNode, _.info256>,
	EVEX_V256;
	}
	}

	//handle instruction reg_vec1 = op(reg_vec2,reg_vec3,imm)
	// op(reg_vec2,mem_vec,imm)
	// op(reg_vec2,broadcast(eltVt),imm)
	//all instruction created with FROUND_CURRENT
	multiclass avx512_fp_packed_imm<bits<8> opc, string OpcodeStr, SDNode OpNode,
	X86VectorVTInfo _>{
	let ExeDomain = _.ExeDomain in {
	defm rri : AVX512_maskable<opc, MRMSrcReg, _, (outs _.RC:$dst),
	(ins _.RC:$src1, _.RC:$src2, i32u8imm:$src3),
	OpcodeStr, "$src3, $src2, $src1", "$src1, $src2, $src3",
	(OpNode (_.VT _.RC:$src1),
	(_.VT _.RC:$src2),
	(i32 imm:$src3),
	(i32 FROUND_CURRENT))>;
	defm rmi : AVX512_maskable<opc, MRMSrcMem, _, (outs _.RC:$dst),
	(ins _.RC:$src1, _.MemOp:$src2, i32u8imm:$src3),
	OpcodeStr, "$src3, $src2, $src1", "$src1, $src2, $src3",
	(OpNode (_.VT _.RC:$src1),
	(_.VT (bitconvert (_.LdFrag addr:$src2))),
	(i32 imm:$src3),
	(i32 FROUND_CURRENT))>;
	defm rmbi : AVX512_maskable<opc, MRMSrcMem, _, (outs _.RC:$dst),
	(ins _.RC:$src1, _.ScalarMemOp:$src2, i32u8imm:$src3),
	OpcodeStr, "$src3, ${src2}"##_.BroadcastStr##", $src1",
	"$src1, ${src2}"##_.BroadcastStr##", $src3",
	(OpNode (_.VT _.RC:$src1),
	(_.VT (X86VBroadcast(_.ScalarLdFrag addr:$src2))),
	(i32 imm:$src3),
	(i32 FROUND_CURRENT))>, EVEX_B;
	}
	}

	//handle instruction reg_vec1 = op(reg_vec2,reg_vec3,imm)
	// op(reg_vec2,mem_vec,imm)
	multiclass avx512_3Op_rm_imm8<bits<8> opc, string OpcodeStr, SDNode OpNode,
	X86VectorVTInfo DestInfo, X86VectorVTInfo SrcInfo>{
	let ExeDomain = DestInfo.ExeDomain in {
	defm rri : AVX512_maskable<opc, MRMSrcReg, DestInfo, (outs DestInfo.RC:$dst),
	(ins SrcInfo.RC:$src1, SrcInfo.RC:$src2, u8imm:$src3),
	OpcodeStr, "$src3, $src2, $src1", "$src1, $src2, $src3",
	(DestInfo.VT (OpNode (SrcInfo.VT SrcInfo.RC:$src1),
	(SrcInfo.VT SrcInfo.RC:$src2),
	(i8 imm:$src3)))>;
	defm rmi : AVX512_maskable<opc, MRMSrcMem, DestInfo, (outs DestInfo.RC:$dst),
	(ins SrcInfo.RC:$src1, SrcInfo.MemOp:$src2, u8imm:$src3),
	OpcodeStr, "$src3, $src2, $src1", "$src1, $src2, $src3",
	(DestInfo.VT (OpNode (SrcInfo.VT SrcInfo.RC:$src1),
	(SrcInfo.VT (bitconvert
	(SrcInfo.LdFrag addr:$src2))),
	(i8 imm:$src3)))>;
	}
	}

	//handle instruction reg_vec1 = op(reg_vec2,reg_vec3,imm)
	// op(reg_vec2,mem_vec,imm)
	// op(reg_vec2,broadcast(eltVt),imm)
	multiclass avx512_3Op_imm8<bits<8> opc, string OpcodeStr, SDNode OpNode,
	X86VectorVTInfo _>:
	avx512_3Op_rm_imm8<opc, OpcodeStr, OpNode, _, _>{

	let ExeDomain = _.ExeDomain in
	defm rmbi : AVX512_maskable<opc, MRMSrcMem, _, (outs _.RC:$dst),
	(ins _.RC:$src1, _.ScalarMemOp:$src2, u8imm:$src3),
	OpcodeStr, "$src3, ${src2}"##_.BroadcastStr##", $src1",
	"$src1, ${src2}"##_.BroadcastStr##", $src3",
	(OpNode (_.VT _.RC:$src1),
	(_.VT (X86VBroadcast(_.ScalarLdFrag addr:$src2))),
	(i8 imm:$src3))>, EVEX_B;
	}

	//handle scalar instruction reg_vec1 = op(reg_vec2,reg_vec3,imm)
	// op(reg_vec2,mem_scalar,imm)
	//all instruction created with FROUND_CURRENT
	multiclass avx512_fp_scalar_imm<bits<8> opc, string OpcodeStr, SDNode OpNode,
	X86VectorVTInfo _> {
	let ExeDomain = _.ExeDomain in {
	defm rri : AVX512_maskable_scalar<opc, MRMSrcReg, _, (outs _.RC:$dst),
	(ins _.RC:$src1, _.RC:$src2, i32u8imm:$src3),
	OpcodeStr, "$src3, $src2, $src1", "$src1, $src2, $src3",
	(OpNode (_.VT _.RC:$src1),
	(_.VT _.RC:$src2),
	(i32 imm:$src3),
	(i32 FROUND_CURRENT))>;
	defm rmi : AVX512_maskable_scalar<opc, MRMSrcMem, _, (outs _.RC:$dst),
	(ins _.RC:$src1, _.ScalarMemOp:$src2, i32u8imm:$src3),
	OpcodeStr, "$src3, $src2, $src1", "$src1, $src2, $src3",
	(OpNode (_.VT _.RC:$src1),
	(_.VT (scalar_to_vector
	(_.ScalarLdFrag addr:$src2))),
	(i32 imm:$src3),
	(i32 FROUND_CURRENT))>;
	}
	}

	//handle instruction reg_vec1 = op(reg_vec2,reg_vec3,imm),{sae}
	multiclass avx512_fp_sae_packed_imm<bits<8> opc, string OpcodeStr,
	SDNode OpNode, X86VectorVTInfo _>{
	let ExeDomain = _.ExeDomain in
	defm rrib : AVX512_maskable<opc, MRMSrcReg, _, (outs _.RC:$dst),
	(ins _.RC:$src1, _.RC:$src2, i32u8imm:$src3),
	OpcodeStr, "$src3, {sae}, $src2, $src1",
	"$src1, $src2, {sae}, $src3",
	(OpNode (_.VT _.RC:$src1),
	(_.VT _.RC:$src2),
	(i32 imm:$src3),
	(i32 FROUND_NO_EXC))>, EVEX_B;
	}
	//handle scalar instruction reg_vec1 = op(reg_vec2,reg_vec3,imm),{sae}
	multiclass avx512_fp_sae_scalar_imm<bits<8> opc, string OpcodeStr,
	SDNode OpNode, X86VectorVTInfo _> {
	let ExeDomain = _.ExeDomain in
	defm NAME#rrib : AVX512_maskable_scalar<opc, MRMSrcReg, _, (outs _.RC:$dst),
	(ins _.RC:$src1, _.RC:$src2, i32u8imm:$src3),
	OpcodeStr, "$src3, {sae}, $src2, $src1",
	"$src1, $src2, {sae}, $src3",
	(OpNode (_.VT _.RC:$src1),
	(_.VT _.RC:$src2),
	(i32 imm:$src3),
	(i32 FROUND_NO_EXC))>, EVEX_B;
	}

	multiclass avx512_common_fp_sae_packed_imm<string OpcodeStr,
	AVX512VLVectorVTInfo _, bits<8> opc, SDNode OpNode, Predicate prd>{
	let Predicates = [prd] in {
	defm Z : avx512_fp_packed_imm<opc, OpcodeStr, OpNode, _.info512>,
	avx512_fp_sae_packed_imm<opc, OpcodeStr, OpNode, _.info512>,
	EVEX_V512;

	}
	let Predicates = [prd, HasVLX] in {
	defm Z128 : avx512_fp_packed_imm<opc, OpcodeStr, OpNode, _.info128>,
	EVEX_V128;
	defm Z256 : avx512_fp_packed_imm<opc, OpcodeStr, OpNode, _.info256>,
	EVEX_V256;
	}
	}

	multiclass avx512_common_3Op_rm_imm8<bits<8> opc, SDNode OpNode, string OpStr,
	AVX512VLVectorVTInfo DestInfo, AVX512VLVectorVTInfo SrcInfo>{
	let Predicates = [HasBWI] in {
	defm Z : avx512_3Op_rm_imm8<opc, OpStr, OpNode, DestInfo.info512,
	SrcInfo.info512>, EVEX_V512, AVX512AIi8Base, EVEX_4V;
	}
	let Predicates = [HasBWI, HasVLX] in {
	defm Z128 : avx512_3Op_rm_imm8<opc, OpStr, OpNode, DestInfo.info128,
	SrcInfo.info128>, EVEX_V128, AVX512AIi8Base, EVEX_4V;
	defm Z256 : avx512_3Op_rm_imm8<opc, OpStr, OpNode, DestInfo.info256,
	SrcInfo.info256>, EVEX_V256, AVX512AIi8Base, EVEX_4V;
	}
	}

	multiclass avx512_common_3Op_imm8<string OpcodeStr, AVX512VLVectorVTInfo _,
	bits<8> opc, SDNode OpNode>{
	let Predicates = [HasAVX512] in {
	defm Z : avx512_3Op_imm8<opc, OpcodeStr, OpNode, _.info512>, EVEX_V512;
	}
	let Predicates = [HasAVX512, HasVLX] in {
	defm Z128 : avx512_3Op_imm8<opc, OpcodeStr, OpNode, _.info128>, EVEX_V128;
	defm Z256 : avx512_3Op_imm8<opc, OpcodeStr, OpNode, _.info256>, EVEX_V256;
	}
	}

	multiclass avx512_common_fp_sae_scalar_imm<string OpcodeStr,
	X86VectorVTInfo _, bits<8> opc, SDNode OpNode, Predicate prd>{
	let Predicates = [prd] in {
	defm Z128 : avx512_fp_scalar_imm<opc, OpcodeStr, OpNode, _>,
	avx512_fp_sae_scalar_imm<opc, OpcodeStr, OpNode, _>;
	}
	}

	multiclass avx512_common_unary_fp_sae_packed_imm_all<string OpcodeStr,
	bits<8> opcPs, bits<8> opcPd, SDNode OpNode, Predicate prd>{
	defm PS : avx512_common_unary_fp_sae_packed_imm<OpcodeStr, avx512vl_f32_info,
	opcPs, OpNode, prd>, EVEX_CD8<32, CD8VF>;
	defm PD : avx512_common_unary_fp_sae_packed_imm<OpcodeStr, avx512vl_f64_info,
	opcPd, OpNode, prd>, EVEX_CD8<64, CD8VF>, VEX_W;
	}


	defm VREDUCE : avx512_common_unary_fp_sae_packed_imm_all<"vreduce", 0x56, 0x56,
	X86VReduce, HasDQI>, AVX512AIi8Base, EVEX;
	defm VRNDSCALE : avx512_common_unary_fp_sae_packed_imm_all<"vrndscale", 0x08, 0x09,
	X86VRndScale, HasAVX512>, AVX512AIi8Base, EVEX;
	defm VGETMANT : avx512_common_unary_fp_sae_packed_imm_all<"vgetmant", 0x26, 0x26,
	X86VGetMant, HasAVX512>, AVX512AIi8Base, EVEX;


	defm VRANGEPD : avx512_common_fp_sae_packed_imm<"vrangepd", avx512vl_f64_info,
	0x50, X86VRange, HasDQI>,
	AVX512AIi8Base, EVEX_4V, EVEX_CD8<64, CD8VF>, VEX_W;
	defm VRANGEPS : avx512_common_fp_sae_packed_imm<"vrangeps", avx512vl_f32_info,
	0x50, X86VRange, HasDQI>,
	AVX512AIi8Base, EVEX_4V, EVEX_CD8<32, CD8VF>;

	defm VRANGESD: avx512_common_fp_sae_scalar_imm<"vrangesd", f64x_info,
	0x51, X86VRange, HasDQI>,
	AVX512AIi8Base, VEX_LIG, EVEX_4V, EVEX_CD8<64, CD8VT1>, VEX_W;
	defm VRANGESS: avx512_common_fp_sae_scalar_imm<"vrangess", f32x_info,
	0x51, X86VRange, HasDQI>,
	AVX512AIi8Base, VEX_LIG, EVEX_4V, EVEX_CD8<32, CD8VT1>;

	defm VREDUCESD: avx512_common_fp_sae_scalar_imm<"vreducesd", f64x_info,
	0x57, X86Reduces, HasDQI>,
	AVX512AIi8Base, VEX_LIG, EVEX_4V, EVEX_CD8<64, CD8VT1>, VEX_W;
	defm VREDUCESS: avx512_common_fp_sae_scalar_imm<"vreducess", f32x_info,
	0x57, X86Reduces, HasDQI>,
	AVX512AIi8Base, VEX_LIG, EVEX_4V, EVEX_CD8<32, CD8VT1>;

	defm VGETMANTSD: avx512_common_fp_sae_scalar_imm<"vgetmantsd", f64x_info,
	0x27, X86GetMants, HasAVX512>,
	AVX512AIi8Base, VEX_LIG, EVEX_4V, EVEX_CD8<64, CD8VT1>, VEX_W;
	defm VGETMANTSS: avx512_common_fp_sae_scalar_imm<"vgetmantss", f32x_info,
	0x27, X86GetMants, HasAVX512>,
	AVX512AIi8Base, VEX_LIG, EVEX_4V, EVEX_CD8<32, CD8VT1>;

	multiclass avx512_shuff_packed_128<string OpcodeStr, AVX512VLVectorVTInfo _,
	bits<8> opc, SDNode OpNode = X86Shuf128>{
	let Predicates = [HasAVX512] in {
	defm Z : avx512_3Op_imm8<opc, OpcodeStr, OpNode, _.info512>, EVEX_V512;

	}
	let Predicates = [HasAVX512, HasVLX] in {
	defm Z256 : avx512_3Op_imm8<opc, OpcodeStr, OpNode, _.info256>, EVEX_V256;
	}
	}
	let Predicates = [HasAVX512] in {
	def : Pat<(v16f32 (ffloor VR512:$src)),
	(VRNDSCALEPSZrri VR512:$src, (i32 0x9))>;
	def : Pat<(v16f32 (fnearbyint VR512:$src)),
	(VRNDSCALEPSZrri VR512:$src, (i32 0xC))>;
	def : Pat<(v16f32 (fceil VR512:$src)),
	(VRNDSCALEPSZrri VR512:$src, (i32 0xA))>;
	def : Pat<(v16f32 (frint VR512:$src)),
	(VRNDSCALEPSZrri VR512:$src, (i32 0x4))>;
	def : Pat<(v16f32 (ftrunc VR512:$src)),
	(VRNDSCALEPSZrri VR512:$src, (i32 0xB))>;

	def : Pat<(v8f64 (ffloor VR512:$src)),
	(VRNDSCALEPDZrri VR512:$src, (i32 0x9))>;
	def : Pat<(v8f64 (fnearbyint VR512:$src)),
	(VRNDSCALEPDZrri VR512:$src, (i32 0xC))>;
	def : Pat<(v8f64 (fceil VR512:$src)),
	(VRNDSCALEPDZrri VR512:$src, (i32 0xA))>;
	def : Pat<(v8f64 (frint VR512:$src)),
	(VRNDSCALEPDZrri VR512:$src, (i32 0x4))>;
	def : Pat<(v8f64 (ftrunc VR512:$src)),
	(VRNDSCALEPDZrri VR512:$src, (i32 0xB))>;
	}

	defm VSHUFF32X4 : avx512_shuff_packed_128<"vshuff32x4",avx512vl_f32_info, 0x23>,
	AVX512AIi8Base, EVEX_4V, EVEX_CD8<32, CD8VF>;
	defm VSHUFF64X2 : avx512_shuff_packed_128<"vshuff64x2",avx512vl_f64_info, 0x23>,
	AVX512AIi8Base, EVEX_4V, EVEX_CD8<64, CD8VF>, VEX_W;
	defm VSHUFI32X4 : avx512_shuff_packed_128<"vshufi32x4",avx512vl_i32_info, 0x43>,
	AVX512AIi8Base, EVEX_4V, EVEX_CD8<32, CD8VF>;
	defm VSHUFI64X2 : avx512_shuff_packed_128<"vshufi64x2",avx512vl_i64_info, 0x43>,
	AVX512AIi8Base, EVEX_4V, EVEX_CD8<64, CD8VF>, VEX_W;

	let Predicates = [HasAVX512] in {
	// Provide fallback in case the load node that is used in the broadcast
	// patterns above is used by additional users, which prevents the pattern
	// selection.
	def : Pat<(v8f64 (X86SubVBroadcast (v2f64 VR128X:$src))),
	(VSHUFF64X2Zrri (INSERT_SUBREG (v8f64 (IMPLICIT_DEF)), VR128X:$src, sub_xmm),
	(INSERT_SUBREG (v8f64 (IMPLICIT_DEF)), VR128X:$src, sub_xmm),
	0)>;
	def : Pat<(v8i64 (X86SubVBroadcast (v2i64 VR128X:$src))),
	(VSHUFI64X2Zrri (INSERT_SUBREG (v8i64 (IMPLICIT_DEF)), VR128X:$src, sub_xmm),
	(INSERT_SUBREG (v8i64 (IMPLICIT_DEF)), VR128X:$src, sub_xmm),
	0)>;

	def : Pat<(v16f32 (X86SubVBroadcast (v4f32 VR128X:$src))),
	(VSHUFF32X4Zrri (INSERT_SUBREG (v16f32 (IMPLICIT_DEF)), VR128X:$src, sub_xmm),
	(INSERT_SUBREG (v16f32 (IMPLICIT_DEF)), VR128X:$src, sub_xmm),
	0)>;
	def : Pat<(v16i32 (X86SubVBroadcast (v4i32 VR128X:$src))),
	(VSHUFI32X4Zrri (INSERT_SUBREG (v16i32 (IMPLICIT_DEF)), VR128X:$src, sub_xmm),
	(INSERT_SUBREG (v16i32 (IMPLICIT_DEF)), VR128X:$src, sub_xmm),
	0)>;

	def : Pat<(v32i16 (X86SubVBroadcast (v8i16 VR128X:$src))),
	(VSHUFI32X4Zrri (INSERT_SUBREG (v32i16 (IMPLICIT_DEF)), VR128X:$src, sub_xmm),
	(INSERT_SUBREG (v32i16 (IMPLICIT_DEF)), VR128X:$src, sub_xmm),
	0)>;

	def : Pat<(v64i8 (X86SubVBroadcast (v16i8 VR128X:$src))),
	(VSHUFI32X4Zrri (INSERT_SUBREG (v64i8 (IMPLICIT_DEF)), VR128X:$src, sub_xmm),
	(INSERT_SUBREG (v64i8 (IMPLICIT_DEF)), VR128X:$src, sub_xmm),
	0)>;
	}

	multiclass avx512_valign<string OpcodeStr, AVX512VLVectorVTInfo VTInfo_I> {
	defm NAME: avx512_common_3Op_imm8<OpcodeStr, VTInfo_I, 0x03, X86VAlign>,
	AVX512AIi8Base, EVEX_4V;
	}

	defm VALIGND: avx512_valign<"valignd", avx512vl_i32_info>,
	EVEX_CD8<32, CD8VF>;
	defm VALIGNQ: avx512_valign<"valignq", avx512vl_i64_info>,
	EVEX_CD8<64, CD8VF>, VEX_W;

	multiclass avx512_vpalignr_lowering<X86VectorVTInfo _ , list<Predicate> p>{
	let Predicates = p in
	def NAME#_.VTName#rri:
	Pat<(_.VT (X86PAlignr _.RC:$src1, _.RC:$src2, (i8 imm:$imm))),
	(!cast<Instruction>(NAME#_.ZSuffix#rri)
	_.RC:$src1, _.RC:$src2, imm:$imm)>;
	}

	multiclass avx512_vpalignr_lowering_common<AVX512VLVectorVTInfo _>:
	avx512_vpalignr_lowering<_.info512, [HasBWI]>,
	avx512_vpalignr_lowering<_.info128, [HasBWI, HasVLX]>,
	avx512_vpalignr_lowering<_.info256, [HasBWI, HasVLX]>;

	defm VPALIGNR: avx512_common_3Op_rm_imm8<0x0F, X86PAlignr, "vpalignr" ,
	avx512vl_i8_info, avx512vl_i8_info>,
	avx512_vpalignr_lowering_common<avx512vl_i16_info>,
	avx512_vpalignr_lowering_common<avx512vl_i32_info>,
	avx512_vpalignr_lowering_common<avx512vl_f32_info>,
	avx512_vpalignr_lowering_common<avx512vl_i64_info>,
	avx512_vpalignr_lowering_common<avx512vl_f64_info>,
	EVEX_CD8<8, CD8VF>;

	defm VDBPSADBW: avx512_common_3Op_rm_imm8<0x42, X86dbpsadbw, "vdbpsadbw" ,
	avx512vl_i16_info, avx512vl_i8_info>, EVEX_CD8<8, CD8VF>;

	multiclass avx512_unary_rm<bits<8> opc, string OpcodeStr, SDNode OpNode,
	X86VectorVTInfo _> {
	let ExeDomain = _.ExeDomain in {
	defm rr : AVX512_maskable<opc, MRMSrcReg, _, (outs _.RC:$dst),
	(ins _.RC:$src1), OpcodeStr,
	"$src1", "$src1",
	(_.VT (OpNode _.RC:$src1))>, EVEX, AVX5128IBase;

	defm rm : AVX512_maskable<opc, MRMSrcMem, _, (outs _.RC:$dst),
	(ins _.MemOp:$src1), OpcodeStr,
	"$src1", "$src1",
	(_.VT (OpNode (bitconvert (_.LdFrag addr:$src1))))>,
	EVEX, AVX5128IBase, EVEX_CD8<_.EltSize, CD8VF>;
	}
	}

	multiclass avx512_unary_rmb<bits<8> opc, string OpcodeStr, SDNode OpNode,
	X86VectorVTInfo _> :
	avx512_unary_rm<opc, OpcodeStr, OpNode, _> {
	defm rmb : AVX512_maskable<opc, MRMSrcMem, _, (outs _.RC:$dst),
	(ins _.ScalarMemOp:$src1), OpcodeStr,
	"${src1}"##_.BroadcastStr,
	"${src1}"##_.BroadcastStr,
	(_.VT (OpNode (X86VBroadcast
	(_.ScalarLdFrag addr:$src1))))>,
	EVEX, AVX5128IBase, EVEX_B, EVEX_CD8<_.EltSize, CD8VF>;
	}

	multiclass avx512_unary_rm_vl<bits<8> opc, string OpcodeStr, SDNode OpNode,
	AVX512VLVectorVTInfo VTInfo, Predicate prd> {
	let Predicates = [prd] in
	defm Z : avx512_unary_rm<opc, OpcodeStr, OpNode, VTInfo.info512>, EVEX_V512;

	let Predicates = [prd, HasVLX] in {
	defm Z256 : avx512_unary_rm<opc, OpcodeStr, OpNode, VTInfo.info256>,
	EVEX_V256;
	defm Z128 : avx512_unary_rm<opc, OpcodeStr, OpNode, VTInfo.info128>,
	EVEX_V128;
	}
	}

	multiclass avx512_unary_rmb_vl<bits<8> opc, string OpcodeStr, SDNode OpNode,
	AVX512VLVectorVTInfo VTInfo, Predicate prd> {
	let Predicates = [prd] in
	defm Z : avx512_unary_rmb<opc, OpcodeStr, OpNode, VTInfo.info512>,
	EVEX_V512;

	let Predicates = [prd, HasVLX] in {
	defm Z256 : avx512_unary_rmb<opc, OpcodeStr, OpNode, VTInfo.info256>,
	EVEX_V256;
	defm Z128 : avx512_unary_rmb<opc, OpcodeStr, OpNode, VTInfo.info128>,
	EVEX_V128;
	}
	}

	multiclass avx512_unary_rm_vl_dq<bits<8> opc_d, bits<8> opc_q, string OpcodeStr,
	SDNode OpNode, Predicate prd> {
	defm Q : avx512_unary_rmb_vl<opc_q, OpcodeStr#"q", OpNode, avx512vl_i64_info,
	prd>, VEX_W;
	defm D : avx512_unary_rmb_vl<opc_d, OpcodeStr#"d", OpNode, avx512vl_i32_info,
	prd>;
	}

	multiclass avx512_unary_rm_vl_bw<bits<8> opc_b, bits<8> opc_w, string OpcodeStr,
	SDNode OpNode, Predicate prd> {
	defm W : avx512_unary_rm_vl<opc_w, OpcodeStr#"w", OpNode, avx512vl_i16_info, prd>;
	defm B : avx512_unary_rm_vl<opc_b, OpcodeStr#"b", OpNode, avx512vl_i8_info, prd>;
	}

	multiclass avx512_unary_rm_vl_all<bits<8> opc_b, bits<8> opc_w,
	bits<8> opc_d, bits<8> opc_q,
	string OpcodeStr, SDNode OpNode> {
	defm NAME : avx512_unary_rm_vl_dq<opc_d, opc_q, OpcodeStr, OpNode,
	HasAVX512>,
	avx512_unary_rm_vl_bw<opc_b, opc_w, OpcodeStr, OpNode,
	HasBWI>;
	}

	defm VPABS : avx512_unary_rm_vl_all<0x1C, 0x1D, 0x1E, 0x1F, "vpabs", abs>;

	// VPABS: Use 512bit version to implement 128/256 bit in case NoVLX.
	let Predicates = [HasAVX512, NoVLX] in {
	def : Pat<(v4i64 (abs VR256X:$src)),
	(EXTRACT_SUBREG
	(VPABSQZrr
	(INSERT_SUBREG (v8i64 (IMPLICIT_DEF)), VR256X:$src, sub_ymm)),
	sub_ymm)>;
	def : Pat<(v2i64 (abs VR128X:$src)),
	(EXTRACT_SUBREG
	(VPABSQZrr
	(INSERT_SUBREG (v8i64 (IMPLICIT_DEF)), VR128X:$src, sub_xmm)),
	sub_xmm)>;
	}

	multiclass avx512_ctlz<bits<8> opc, string OpcodeStr, Predicate prd>{

	defm NAME : avx512_unary_rm_vl_dq<opc, opc, OpcodeStr, ctlz, prd>;
	}

	defm VPLZCNT : avx512_ctlz<0x44, "vplzcnt", HasCDI>;
	defm VPCONFLICT : avx512_unary_rm_vl_dq<0xC4, 0xC4, "vpconflict", X86Conflict, HasCDI>;

	// VPLZCNT: Use 512bit version to implement 128/256 bit in case NoVLX.
	let Predicates = [HasCDI, NoVLX] in {
	def : Pat<(v4i64 (ctlz VR256X:$src)),
	(EXTRACT_SUBREG
	(VPLZCNTQZrr
	(INSERT_SUBREG (v8i64 (IMPLICIT_DEF)), VR256X:$src, sub_ymm)),
	sub_ymm)>;
	def : Pat<(v2i64 (ctlz VR128X:$src)),
	(EXTRACT_SUBREG
	(VPLZCNTQZrr
	(INSERT_SUBREG (v8i64 (IMPLICIT_DEF)), VR128X:$src, sub_xmm)),
	sub_xmm)>;

	def : Pat<(v8i32 (ctlz VR256X:$src)),
	(EXTRACT_SUBREG
	(VPLZCNTDZrr
	(INSERT_SUBREG (v16i32 (IMPLICIT_DEF)), VR256X:$src, sub_ymm)),
	sub_ymm)>;
	def : Pat<(v4i32 (ctlz VR128X:$src)),
	(EXTRACT_SUBREG
	(VPLZCNTDZrr
	(INSERT_SUBREG (v16i32 (IMPLICIT_DEF)), VR128X:$src, sub_xmm)),
	sub_xmm)>;
	}

	//===---------------------------------------------------------------------===//
	// Counts number of ones - VPOPCNTD and VPOPCNTQ
	//===---------------------------------------------------------------------===//

	multiclass avx512_unary_rmb_popcnt<bits<8> opc, string OpcodeStr, X86VectorVTInfo VTInfo> {
	let Predicates = [HasVPOPCNTDQ] in
	defm Z : avx512_unary_rmb<opc, OpcodeStr, ctpop, VTInfo>, EVEX_V512;
	}

	// Use 512bit version to implement 128/256 bit.
	multiclass avx512_unary_lowering<SDNode OpNode, AVX512VLVectorVTInfo _, Predicate prd> {
	let Predicates = [prd] in {
	def Z256_Alt : Pat<(_.info256.VT(OpNode _.info256.RC:$src1)),
	(EXTRACT_SUBREG
	(!cast<Instruction>(NAME # "Zrr")
	(INSERT_SUBREG(_.info512.VT(IMPLICIT_DEF)),
	_.info256.RC:$src1,
	_.info256.SubRegIdx)),
	_.info256.SubRegIdx)>;

	def Z128_Alt : Pat<(_.info128.VT(OpNode _.info128.RC:$src1)),
	(EXTRACT_SUBREG
	(!cast<Instruction>(NAME # "Zrr")
	(INSERT_SUBREG(_.info512.VT(IMPLICIT_DEF)),
	_.info128.RC:$src1,
	_.info128.SubRegIdx)),
	_.info128.SubRegIdx)>;
	}
	}

	defm VPOPCNTD : avx512_unary_rmb_popcnt<0x55, "vpopcntd", v16i32_info>,
	avx512_unary_lowering<ctpop, avx512vl_i32_info, HasVPOPCNTDQ>;
	defm VPOPCNTQ : avx512_unary_rmb_popcnt<0x55, "vpopcntq", v8i64_info>,
	avx512_unary_lowering<ctpop, avx512vl_i64_info, HasVPOPCNTDQ>, VEX_W;

	//===---------------------------------------------------------------------===//
	// Replicate Single FP - MOVSHDUP and MOVSLDUP
	//===---------------------------------------------------------------------===//
	multiclass avx512_replicate<bits<8> opc, string OpcodeStr, SDNode OpNode>{
	defm NAME: avx512_unary_rm_vl<opc, OpcodeStr, OpNode, avx512vl_f32_info,
	HasAVX512>, XS;
	}

	defm VMOVSHDUP : avx512_replicate<0x16, "vmovshdup", X86Movshdup>;
	defm VMOVSLDUP : avx512_replicate<0x12, "vmovsldup", X86Movsldup>;

	//===----------------------------------------------------------------------===//
	// AVX-512 - MOVDDUP
	//===----------------------------------------------------------------------===//

	multiclass avx512_movddup_128<bits<8> opc, string OpcodeStr, SDNode OpNode,
	X86VectorVTInfo _> {
	let ExeDomain = _.ExeDomain in {
	defm rr : AVX512_maskable<opc, MRMSrcReg, _, (outs _.RC:$dst),
	(ins _.RC:$src), OpcodeStr, "$src", "$src",
	(_.VT (OpNode (_.VT _.RC:$src)))>, EVEX;
	defm rm : AVX512_maskable<opc, MRMSrcMem, _, (outs _.RC:$dst),
	(ins _.ScalarMemOp:$src), OpcodeStr, "$src", "$src",
	(_.VT (OpNode (_.VT (scalar_to_vector
	(_.ScalarLdFrag addr:$src)))))>,
	EVEX, EVEX_CD8<_.EltSize, CD8VH>;
	}
	}

	multiclass avx512_movddup_common<bits<8> opc, string OpcodeStr, SDNode OpNode,
	AVX512VLVectorVTInfo VTInfo> {

	defm Z : avx512_unary_rm<opc, OpcodeStr, OpNode, VTInfo.info512>, EVEX_V512;

	let Predicates = [HasAVX512, HasVLX] in {
	defm Z256 : avx512_unary_rm<opc, OpcodeStr, OpNode, VTInfo.info256>,
	EVEX_V256;
	defm Z128 : avx512_movddup_128<opc, OpcodeStr, OpNode, VTInfo.info128>,
	EVEX_V128;
	}
	}

	multiclass avx512_movddup<bits<8> opc, string OpcodeStr, SDNode OpNode>{
	defm NAME: avx512_movddup_common<opc, OpcodeStr, OpNode,
	avx512vl_f64_info>, XD, VEX_W;
	}

	defm VMOVDDUP : avx512_movddup<0x12, "vmovddup", X86Movddup>;

	let Predicates = [HasVLX] in {
	def : Pat<(X86Movddup (loadv2f64 addr:$src)),
	(VMOVDDUPZ128rm addr:$src)>;
	def : Pat<(v2f64 (X86VBroadcast (loadf64 addr:$src))),
	(VMOVDDUPZ128rm addr:$src)>;
	def : Pat<(v2f64 (X86VBroadcast f64:$src)),
	(VMOVDDUPZ128rr (COPY_TO_REGCLASS FR64X:$src, VR128X))>;

	def : Pat<(vselect (v2i1 VK2WM:$mask), (X86Movddup (loadv2f64 addr:$src)),
	(v2f64 VR128X:$src0)),
	(VMOVDDUPZ128rmk VR128X:$src0, VK2WM:$mask, addr:$src)>;
	def : Pat<(vselect (v2i1 VK2WM:$mask), (X86Movddup (loadv2f64 addr:$src)),
	(bitconvert (v4i32 immAllZerosV))),
	(VMOVDDUPZ128rmkz VK2WM:$mask, addr:$src)>;

	def : Pat<(vselect (v2i1 VK2WM:$mask), (v2f64 (X86VBroadcast f64:$src)),
	(v2f64 VR128X:$src0)),
	(VMOVDDUPZ128rrk VR128X:$src0, VK2WM:$mask,
	(COPY_TO_REGCLASS FR64X:$src, VR128X))>;
	def : Pat<(vselect (v2i1 VK2WM:$mask), (v2f64 (X86VBroadcast f64:$src)),
	(bitconvert (v4i32 immAllZerosV))),
	(VMOVDDUPZ128rrkz VK2WM:$mask, (COPY_TO_REGCLASS FR64X:$src, VR128X))>;

	def : Pat<(vselect (v2i1 VK2WM:$mask), (v2f64 (X86VBroadcast (loadf64 addr:$src))),
	(v2f64 VR128X:$src0)),
	(VMOVDDUPZ128rmk VR128X:$src0, VK2WM:$mask, addr:$src)>;
	def : Pat<(vselect (v2i1 VK2WM:$mask), (v2f64 (X86VBroadcast (loadf64 addr:$src))),
	(bitconvert (v4i32 immAllZerosV))),
	(VMOVDDUPZ128rmkz VK2WM:$mask, addr:$src)>;
	}

	//===----------------------------------------------------------------------===//
	// AVX-512 - Unpack Instructions
	//===----------------------------------------------------------------------===//
	defm VUNPCKH : avx512_fp_binop_p<0x15, "vunpckh", X86Unpckh, HasAVX512,
	SSE_ALU_ITINS_S>;
	defm VUNPCKL : avx512_fp_binop_p<0x14, "vunpckl", X86Unpckl, HasAVX512,
	SSE_ALU_ITINS_S>;

	defm VPUNPCKLBW : avx512_binop_rm_vl_b<0x60, "vpunpcklbw", X86Unpckl,
	SSE_INTALU_ITINS_P, HasBWI>;
	defm VPUNPCKHBW : avx512_binop_rm_vl_b<0x68, "vpunpckhbw", X86Unpckh,
	SSE_INTALU_ITINS_P, HasBWI>;
	defm VPUNPCKLWD : avx512_binop_rm_vl_w<0x61, "vpunpcklwd", X86Unpckl,
	SSE_INTALU_ITINS_P, HasBWI>;
	defm VPUNPCKHWD : avx512_binop_rm_vl_w<0x69, "vpunpckhwd", X86Unpckh,
	SSE_INTALU_ITINS_P, HasBWI>;

	defm VPUNPCKLDQ : avx512_binop_rm_vl_d<0x62, "vpunpckldq", X86Unpckl,
	SSE_INTALU_ITINS_P, HasAVX512>;
	defm VPUNPCKHDQ : avx512_binop_rm_vl_d<0x6A, "vpunpckhdq", X86Unpckh,
	SSE_INTALU_ITINS_P, HasAVX512>;
	defm VPUNPCKLQDQ : avx512_binop_rm_vl_q<0x6C, "vpunpcklqdq", X86Unpckl,
	SSE_INTALU_ITINS_P, HasAVX512>;
	defm VPUNPCKHQDQ : avx512_binop_rm_vl_q<0x6D, "vpunpckhqdq", X86Unpckh,
	SSE_INTALU_ITINS_P, HasAVX512>;

	//===----------------------------------------------------------------------===//
	// AVX-512 - Extract & Insert Integer Instructions
	//===----------------------------------------------------------------------===//

	multiclass avx512_extract_elt_bw_m<bits<8> opc, string OpcodeStr, SDNode OpNode,
	X86VectorVTInfo _> {
	def mr : AVX512Ii8<opc, MRMDestMem, (outs),
	(ins _.ScalarMemOp:$dst, _.RC:$src1, u8imm:$src2),
	OpcodeStr#"\t{$src2, $src1, $dst\|$dst, $src1, $src2}",
	[(store (_.EltVT (trunc (assertzext (OpNode (_.VT _.RC:$src1),
	imm:$src2)))),
	addr:$dst)]>,
	EVEX, EVEX_CD8<_.EltSize, CD8VT1>;
	}

	multiclass avx512_extract_elt_b<string OpcodeStr, X86VectorVTInfo _> {
	let Predicates = [HasBWI] in {
	def rr : AVX512Ii8<0x14, MRMDestReg, (outs GR32orGR64:$dst),
	(ins _.RC:$src1, u8imm:$src2),
	OpcodeStr#"\t{$src2, $src1, $dst\|$dst, $src1, $src2}",
	[(set GR32orGR64:$dst,
	(X86pextrb (_.VT _.RC:$src1), imm:$src2))]>,
	EVEX, TAPD;

	defm NAME : avx512_extract_elt_bw_m<0x14, OpcodeStr, X86pextrb, _>, TAPD;
	}
	}

	multiclass avx512_extract_elt_w<string OpcodeStr, X86VectorVTInfo _> {
	let Predicates = [HasBWI] in {
	def rr : AVX512Ii8<0xC5, MRMSrcReg, (outs GR32orGR64:$dst),
	(ins _.RC:$src1, u8imm:$src2),
	OpcodeStr#"\t{$src2, $src1, $dst\|$dst, $src1, $src2}",
	[(set GR32orGR64:$dst,
	(X86pextrw (_.VT _.RC:$src1), imm:$src2))]>,
	EVEX, PD;

	let hasSideEffects = 0 in
	def rr_REV : AVX512Ii8<0x15, MRMDestReg, (outs GR32orGR64:$dst),
	(ins _.RC:$src1, u8imm:$src2),
	OpcodeStr#".s\t{$src2, $src1, $dst\|$dst, $src1, $src2}", []>,
	EVEX, TAPD, FoldGenData<NAME#rr>;

	defm NAME : avx512_extract_elt_bw_m<0x15, OpcodeStr, X86pextrw, _>, TAPD;
	}
	}

	multiclass avx512_extract_elt_dq<string OpcodeStr, X86VectorVTInfo _,
	RegisterClass GRC> {
	let Predicates = [HasDQI] in {
	def rr : AVX512Ii8<0x16, MRMDestReg, (outs GRC:$dst),
	(ins _.RC:$src1, u8imm:$src2),
	OpcodeStr#"\t{$src2, $src1, $dst\|$dst, $src1, $src2}",
	[(set GRC:$dst,
	(extractelt (_.VT _.RC:$src1), imm:$src2))]>,
	EVEX, TAPD;

	def mr : AVX512Ii8<0x16, MRMDestMem, (outs),
	(ins _.ScalarMemOp:$dst, _.RC:$src1, u8imm:$src2),
	OpcodeStr#"\t{$src2, $src1, $dst\|$dst, $src1, $src2}",
	[(store (extractelt (_.VT _.RC:$src1),
	imm:$src2),addr:$dst)]>,
	EVEX, EVEX_CD8<_.EltSize, CD8VT1>, TAPD;
	}
	}

	defm VPEXTRBZ : avx512_extract_elt_b<"vpextrb", v16i8x_info>;
	defm VPEXTRWZ : avx512_extract_elt_w<"vpextrw", v8i16x_info>;
	defm VPEXTRDZ : avx512_extract_elt_dq<"vpextrd", v4i32x_info, GR32>;
	defm VPEXTRQZ : avx512_extract_elt_dq<"vpextrq", v2i64x_info, GR64>, VEX_W;

	multiclass avx512_insert_elt_m<bits<8> opc, string OpcodeStr, SDNode OpNode,
	X86VectorVTInfo _, PatFrag LdFrag> {
	def rm : AVX512Ii8<opc, MRMSrcMem, (outs _.RC:$dst),
	(ins _.RC:$src1, _.ScalarMemOp:$src2, u8imm:$src3),
	OpcodeStr#"\t{$src3, $src2, $src1, $dst\|$dst, $src1, $src2, $src3}",
	[(set _.RC:$dst,
	(_.VT (OpNode _.RC:$src1, (LdFrag addr:$src2), imm:$src3)))]>,
	EVEX_4V, EVEX_CD8<_.EltSize, CD8VT1>;
	}

	multiclass avx512_insert_elt_bw<bits<8> opc, string OpcodeStr, SDNode OpNode,
	X86VectorVTInfo _, PatFrag LdFrag> {
	let Predicates = [HasBWI] in {
	def rr : AVX512Ii8<opc, MRMSrcReg, (outs _.RC:$dst),
	(ins _.RC:$src1, GR32orGR64:$src2, u8imm:$src3),
	OpcodeStr#"\t{$src3, $src2, $src1, $dst\|$dst, $src1, $src2, $src3}",
	[(set _.RC:$dst,
	(OpNode _.RC:$src1, GR32orGR64:$src2, imm:$src3))]>, EVEX_4V;

	defm NAME : avx512_insert_elt_m<opc, OpcodeStr, OpNode, _, LdFrag>;
	}
	}

	multiclass avx512_insert_elt_dq<bits<8> opc, string OpcodeStr,
	X86VectorVTInfo _, RegisterClass GRC> {
	let Predicates = [HasDQI] in {
	def rr : AVX512Ii8<opc, MRMSrcReg, (outs _.RC:$dst),
	(ins _.RC:$src1, GRC:$src2, u8imm:$src3),
	OpcodeStr#"\t{$src3, $src2, $src1, $dst\|$dst, $src1, $src2, $src3}",
	[(set _.RC:$dst,
	(_.VT (insertelt _.RC:$src1, GRC:$src2, imm:$src3)))]>,
	EVEX_4V, TAPD;

	defm NAME : avx512_insert_elt_m<opc, OpcodeStr, insertelt, _,
	_.ScalarLdFrag>, TAPD;
	}
	}

	defm VPINSRBZ : avx512_insert_elt_bw<0x20, "vpinsrb", X86pinsrb, v16i8x_info,
	extloadi8>, TAPD;
	defm VPINSRWZ : avx512_insert_elt_bw<0xC4, "vpinsrw", X86pinsrw, v8i16x_info,
	extloadi16>, PD;
	defm VPINSRDZ : avx512_insert_elt_dq<0x22, "vpinsrd", v4i32x_info, GR32>;
	defm VPINSRQZ : avx512_insert_elt_dq<0x22, "vpinsrq", v2i64x_info, GR64>, VEX_W;
	//===----------------------------------------------------------------------===//
	// VSHUFPS - VSHUFPD Operations
	//===----------------------------------------------------------------------===//
	multiclass avx512_shufp<string OpcodeStr, AVX512VLVectorVTInfo VTInfo_I,
	AVX512VLVectorVTInfo VTInfo_FP>{
	defm NAME: avx512_common_3Op_imm8<OpcodeStr, VTInfo_FP, 0xC6, X86Shufp>,
	EVEX_CD8<VTInfo_FP.info512.EltSize, CD8VF>,
	AVX512AIi8Base, EVEX_4V;
	}

	defm VSHUFPS: avx512_shufp<"vshufps", avx512vl_i32_info, avx512vl_f32_info>, PS;
	defm VSHUFPD: avx512_shufp<"vshufpd", avx512vl_i64_info, avx512vl_f64_info>, PD, VEX_W;
	//===----------------------------------------------------------------------===//
	// AVX-512 - Byte shift Left/Right
	//===----------------------------------------------------------------------===//

	multiclass avx512_shift_packed<bits<8> opc, SDNode OpNode, Format MRMr,
	Format MRMm, string OpcodeStr, X86VectorVTInfo _>{
	def rr : AVX512<opc, MRMr,
	(outs _.RC:$dst), (ins _.RC:$src1, u8imm:$src2),
	!strconcat(OpcodeStr, "\t{$src2, $src1, $dst\|$dst, $src1, $src2}"),
	[(set _.RC:$dst,(_.VT (OpNode _.RC:$src1, (i8 imm:$src2))))]>;
	def rm : AVX512<opc, MRMm,
	(outs _.RC:$dst), (ins _.MemOp:$src1, u8imm:$src2),
	!strconcat(OpcodeStr, "\t{$src2, $src1, $dst\|$dst, $src1, $src2}"),
	[(set _.RC:$dst,(_.VT (OpNode
	(_.VT (bitconvert (_.LdFrag addr:$src1))),
	(i8 imm:$src2))))]>;
	}

	multiclass avx512_shift_packed_all<bits<8> opc, SDNode OpNode, Format MRMr,
	Format MRMm, string OpcodeStr, Predicate prd>{
	let Predicates = [prd] in
	defm Z512 : avx512_shift_packed<opc, OpNode, MRMr, MRMm,
	OpcodeStr, v64i8_info>, EVEX_V512;
	let Predicates = [prd, HasVLX] in {
	defm Z256 : avx512_shift_packed<opc, OpNode, MRMr, MRMm,
	OpcodeStr, v32i8x_info>, EVEX_V256;
	defm Z128 : avx512_shift_packed<opc, OpNode, MRMr, MRMm,
	OpcodeStr, v16i8x_info>, EVEX_V128;
	}
	}
	defm VPSLLDQ : avx512_shift_packed_all<0x73, X86vshldq, MRM7r, MRM7m, "vpslldq",
	HasBWI>, AVX512PDIi8Base, EVEX_4V;
	defm VPSRLDQ : avx512_shift_packed_all<0x73, X86vshrdq, MRM3r, MRM3m, "vpsrldq",
	HasBWI>, AVX512PDIi8Base, EVEX_4V;


	multiclass avx512_psadbw_packed<bits<8> opc, SDNode OpNode,
	string OpcodeStr, X86VectorVTInfo _dst,
	X86VectorVTInfo _src>{
	def rr : AVX512BI<opc, MRMSrcReg,
	(outs _dst.RC:$dst), (ins _src.RC:$src1, _src.RC:$src2),
	!strconcat(OpcodeStr, "\t{$src2, $src1, $dst\|$dst, $src1, $src2}"),
	[(set _dst.RC:$dst,(_dst.VT
	(OpNode (_src.VT _src.RC:$src1),
	(_src.VT _src.RC:$src2))))]>;
	def rm : AVX512BI<opc, MRMSrcMem,
	(outs _dst.RC:$dst), (ins _src.RC:$src1, _src.MemOp:$src2),
	!strconcat(OpcodeStr, "\t{$src2, $src1, $dst\|$dst, $src1, $src2}"),
	[(set _dst.RC:$dst,(_dst.VT
	(OpNode (_src.VT _src.RC:$src1),
	(_src.VT (bitconvert
	(_src.LdFrag addr:$src2))))))]>;
	}

	multiclass avx512_psadbw_packed_all<bits<8> opc, SDNode OpNode,
	string OpcodeStr, Predicate prd> {
	let Predicates = [prd] in
	defm Z512 : avx512_psadbw_packed<opc, OpNode, OpcodeStr, v8i64_info,
	v64i8_info>, EVEX_V512;
	let Predicates = [prd, HasVLX] in {
	defm Z256 : avx512_psadbw_packed<opc, OpNode, OpcodeStr, v4i64x_info,
	v32i8x_info>, EVEX_V256;
	defm Z128 : avx512_psadbw_packed<opc, OpNode, OpcodeStr, v2i64x_info,
	v16i8x_info>, EVEX_V128;
	}
	}

	defm VPSADBW : avx512_psadbw_packed_all<0xf6, X86psadbw, "vpsadbw",
	HasBWI>, EVEX_4V;

	// Transforms to swizzle an immediate to enable better matching when
	// memory operand isn't in the right place.
	def VPTERNLOG321_imm8 : SDNodeXForm<imm, [{
	// Convert a VPTERNLOG immediate by swapping operand 0 and operand 2.
	uint8_t Imm = N->getZExtValue();
	// Swap bits 1/4 and 3/6.
	uint8_t NewImm = Imm & 0xa5;
	if (Imm & 0x02) NewImm \|= 0x10;
	if (Imm & 0x10) NewImm \|= 0x02;
	if (Imm & 0x08) NewImm \|= 0x40;
	if (Imm & 0x40) NewImm \|= 0x08;
	return getI8Imm(NewImm, SDLoc(N));
	}]>;
	def VPTERNLOG213_imm8 : SDNodeXForm<imm, [{
	// Convert a VPTERNLOG immediate by swapping operand 1 and operand 2.
	uint8_t Imm = N->getZExtValue();
	// Swap bits 2/4 and 3/5.
	uint8_t NewImm = Imm & 0xc3;
	if (Imm & 0x04) NewImm \|= 0x10;
	if (Imm & 0x10) NewImm \|= 0x04;
	if (Imm & 0x08) NewImm \|= 0x20;
	if (Imm & 0x20) NewImm \|= 0x08;
	return getI8Imm(NewImm, SDLoc(N));
	}]>;
	def VPTERNLOG132_imm8 : SDNodeXForm<imm, [{
	// Convert a VPTERNLOG immediate by swapping operand 1 and operand 2.
	uint8_t Imm = N->getZExtValue();
	// Swap bits 1/2 and 5/6.
	uint8_t NewImm = Imm & 0x99;
	if (Imm & 0x02) NewImm \|= 0x04;
	if (Imm & 0x04) NewImm \|= 0x02;
	if (Imm & 0x20) NewImm \|= 0x40;
	if (Imm & 0x40) NewImm \|= 0x20;
	return getI8Imm(NewImm, SDLoc(N));
	}]>;
	def VPTERNLOG231_imm8 : SDNodeXForm<imm, [{
	// Convert a VPTERNLOG immediate by moving operand 1 to the end.
	uint8_t Imm = N->getZExtValue();
	// Move bits 1->2, 2->4, 3->6, 4->1, 5->3, 6->5
	uint8_t NewImm = Imm & 0x81;
	if (Imm & 0x02) NewImm \|= 0x04;
	if (Imm & 0x04) NewImm \|= 0x10;
	if (Imm & 0x08) NewImm \|= 0x40;
	if (Imm & 0x10) NewImm \|= 0x02;
	if (Imm & 0x20) NewImm \|= 0x08;
	if (Imm & 0x40) NewImm \|= 0x20;
	return getI8Imm(NewImm, SDLoc(N));
	}]>;
	def VPTERNLOG312_imm8 : SDNodeXForm<imm, [{
	// Convert a VPTERNLOG immediate by moving operand 2 to the beginning.
	uint8_t Imm = N->getZExtValue();
	// Move bits 1->4, 2->1, 3->5, 4->2, 5->6, 6->3
	uint8_t NewImm = Imm & 0x81;
	if (Imm & 0x02) NewImm \|= 0x10;
	if (Imm & 0x04) NewImm \|= 0x02;
	if (Imm & 0x08) NewImm \|= 0x20;
	if (Imm & 0x10) NewImm \|= 0x04;
	if (Imm & 0x20) NewImm \|= 0x40;
	if (Imm & 0x40) NewImm \|= 0x08;
	return getI8Imm(NewImm, SDLoc(N));
	}]>;

	multiclass avx512_ternlog<bits<8> opc, string OpcodeStr, SDNode OpNode,
	X86VectorVTInfo _>{
	let Constraints = "$src1 = $dst", ExeDomain = _.ExeDomain in {
	defm rri : AVX512_maskable_3src<opc, MRMSrcReg, _, (outs _.RC:$dst),
	(ins _.RC:$src2, _.RC:$src3, u8imm:$src4),
	OpcodeStr, "$src4, $src3, $src2", "$src2, $src3, $src4",
	(OpNode (_.VT _.RC:$src1),
	(_.VT _.RC:$src2),
	(_.VT _.RC:$src3),
	(i8 imm:$src4)), 1, 1>, AVX512AIi8Base, EVEX_4V;
	defm rmi : AVX512_maskable_3src<opc, MRMSrcMem, _, (outs _.RC:$dst),
	(ins _.RC:$src2, _.MemOp:$src3, u8imm:$src4),
	OpcodeStr, "$src4, $src3, $src2", "$src2, $src3, $src4",
	(OpNode (_.VT _.RC:$src1),
	(_.VT _.RC:$src2),
	(_.VT (bitconvert (_.LdFrag addr:$src3))),
	(i8 imm:$src4)), 1, 0>,
	AVX512AIi8Base, EVEX_4V, EVEX_CD8<_.EltSize, CD8VF>;
	defm rmbi : AVX512_maskable_3src<opc, MRMSrcMem, _, (outs _.RC:$dst),
	(ins _.RC:$src2, _.ScalarMemOp:$src3, u8imm:$src4),
	OpcodeStr, "$src4, ${src3}"##_.BroadcastStr##", $src2",
	"$src2, ${src3}"##_.BroadcastStr##", $src4",
	(OpNode (_.VT _.RC:$src1),
	(_.VT _.RC:$src2),
	(_.VT (X86VBroadcast(_.ScalarLdFrag addr:$src3))),
	(i8 imm:$src4)), 1, 0>, EVEX_B,
	AVX512AIi8Base, EVEX_4V, EVEX_CD8<_.EltSize, CD8VF>;
	}// Constraints = "$src1 = $dst"

	// Additional patterns for matching passthru operand in other positions.
	def : Pat<(_.VT (vselect _.KRCWM:$mask,
	(OpNode _.RC:$src3, _.RC:$src2, _.RC:$src1, (i8 imm:$src4)),
	_.RC:$src1)),
	(!cast<Instruction>(NAME#_.ZSuffix#rrik) _.RC:$src1, _.KRCWM:$mask,
	_.RC:$src2, _.RC:$src3, (VPTERNLOG321_imm8 imm:$src4))>;
	def : Pat<(_.VT (vselect _.KRCWM:$mask,
	(OpNode _.RC:$src2, _.RC:$src1, _.RC:$src3, (i8 imm:$src4)),
	_.RC:$src1)),
	(!cast<Instruction>(NAME#_.ZSuffix#rrik) _.RC:$src1, _.KRCWM:$mask,
	_.RC:$src2, _.RC:$src3, (VPTERNLOG213_imm8 imm:$src4))>;

	// Additional patterns for matching loads in other positions.
	def : Pat<(_.VT (OpNode (bitconvert (_.LdFrag addr:$src3)),
	_.RC:$src2, _.RC:$src1, (i8 imm:$src4))),
	(!cast<Instruction>(NAME#_.ZSuffix#rmi) _.RC:$src1, _.RC:$src2,
	addr:$src3, (VPTERNLOG321_imm8 imm:$src4))>;
	def : Pat<(_.VT (OpNode _.RC:$src1,
	(bitconvert (_.LdFrag addr:$src3)),
	_.RC:$src2, (i8 imm:$src4))),
	(!cast<Instruction>(NAME#_.ZSuffix#rmi) _.RC:$src1, _.RC:$src2,
	addr:$src3, (VPTERNLOG132_imm8 imm:$src4))>;

	// Additional patterns for matching zero masking with loads in other
	// positions.
	def : Pat<(_.VT (vselect _.KRCWM:$mask,
	(OpNode (bitconvert (_.LdFrag addr:$src3)),
	_.RC:$src2, _.RC:$src1, (i8 imm:$src4)),
	_.ImmAllZerosV)),
	(!cast<Instruction>(NAME#_.ZSuffix#rmikz) _.RC:$src1, _.KRCWM:$mask,
	_.RC:$src2, addr:$src3, (VPTERNLOG321_imm8 imm:$src4))>;
	def : Pat<(_.VT (vselect _.KRCWM:$mask,
	(OpNode _.RC:$src1, (bitconvert (_.LdFrag addr:$src3)),
	_.RC:$src2, (i8 imm:$src4)),
	_.ImmAllZerosV)),
	(!cast<Instruction>(NAME#_.ZSuffix#rmikz) _.RC:$src1, _.KRCWM:$mask,
	_.RC:$src2, addr:$src3, (VPTERNLOG132_imm8 imm:$src4))>;

	// Additional patterns for matching masked loads with different
	// operand orders.
	def : Pat<(_.VT (vselect _.KRCWM:$mask,
	(OpNode _.RC:$src1, (bitconvert (_.LdFrag addr:$src3)),
	_.RC:$src2, (i8 imm:$src4)),
	_.RC:$src1)),
	(!cast<Instruction>(NAME#_.ZSuffix#rmik) _.RC:$src1, _.KRCWM:$mask,
	_.RC:$src2, addr:$src3, (VPTERNLOG132_imm8 imm:$src4))>;
	def : Pat<(_.VT (vselect _.KRCWM:$mask,
	(OpNode (bitconvert (_.LdFrag addr:$src3)),
	_.RC:$src2, _.RC:$src1, (i8 imm:$src4)),
	_.RC:$src1)),
	(!cast<Instruction>(NAME#_.ZSuffix#rmik) _.RC:$src1, _.KRCWM:$mask,
	_.RC:$src2, addr:$src3, (VPTERNLOG321_imm8 imm:$src4))>;
	def : Pat<(_.VT (vselect _.KRCWM:$mask,
	(OpNode _.RC:$src2, _.RC:$src1,
	(bitconvert (_.LdFrag addr:$src3)), (i8 imm:$src4)),
	_.RC:$src1)),
	(!cast<Instruction>(NAME#_.ZSuffix#rmik) _.RC:$src1, _.KRCWM:$mask,
	_.RC:$src2, addr:$src3, (VPTERNLOG213_imm8 imm:$src4))>;
	def : Pat<(_.VT (vselect _.KRCWM:$mask,
	(OpNode _.RC:$src2, (bitconvert (_.LdFrag addr:$src3)),
	_.RC:$src1, (i8 imm:$src4)),
	_.RC:$src1)),
	(!cast<Instruction>(NAME#_.ZSuffix#rmik) _.RC:$src1, _.KRCWM:$mask,
	_.RC:$src2, addr:$src3, (VPTERNLOG231_imm8 imm:$src4))>;
	def : Pat<(_.VT (vselect _.KRCWM:$mask,
	(OpNode (bitconvert (_.LdFrag addr:$src3)),
	_.RC:$src1, _.RC:$src2, (i8 imm:$src4)),
	_.RC:$src1)),
	(!cast<Instruction>(NAME#_.ZSuffix#rmik) _.RC:$src1, _.KRCWM:$mask,
	_.RC:$src2, addr:$src3, (VPTERNLOG312_imm8 imm:$src4))>;

	// Additional patterns for matching broadcasts in other positions.
	def : Pat<(_.VT (OpNode (X86VBroadcast (_.ScalarLdFrag addr:$src3)),
	_.RC:$src2, _.RC:$src1, (i8 imm:$src4))),
	(!cast<Instruction>(NAME#_.ZSuffix#rmbi) _.RC:$src1, _.RC:$src2,
	addr:$src3, (VPTERNLOG321_imm8 imm:$src4))>;
	def : Pat<(_.VT (OpNode _.RC:$src1,
	(X86VBroadcast (_.ScalarLdFrag addr:$src3)),
	_.RC:$src2, (i8 imm:$src4))),
	(!cast<Instruction>(NAME#_.ZSuffix#rmbi) _.RC:$src1, _.RC:$src2,
	addr:$src3, (VPTERNLOG132_imm8 imm:$src4))>;

	// Additional patterns for matching zero masking with broadcasts in other
	// positions.
	def : Pat<(_.VT (vselect _.KRCWM:$mask,
	(OpNode (X86VBroadcast (_.ScalarLdFrag addr:$src3)),
	_.RC:$src2, _.RC:$src1, (i8 imm:$src4)),
	_.ImmAllZerosV)),
	(!cast<Instruction>(NAME#_.ZSuffix#rmbikz) _.RC:$src1,
	_.KRCWM:$mask, _.RC:$src2, addr:$src3,
	(VPTERNLOG321_imm8 imm:$src4))>;
	def : Pat<(_.VT (vselect _.KRCWM:$mask,
	(OpNode _.RC:$src1,
	(X86VBroadcast (_.ScalarLdFrag addr:$src3)),
	_.RC:$src2, (i8 imm:$src4)),
	_.ImmAllZerosV)),
	(!cast<Instruction>(NAME#_.ZSuffix#rmbikz) _.RC:$src1,
	_.KRCWM:$mask, _.RC:$src2, addr:$src3,
	(VPTERNLOG132_imm8 imm:$src4))>;

	// Additional patterns for matching masked broadcasts with different
	// operand orders.
	def : Pat<(_.VT (vselect _.KRCWM:$mask,
	(OpNode _.RC:$src1,
	(X86VBroadcast (_.ScalarLdFrag addr:$src3)),
	_.RC:$src2, (i8 imm:$src4)),
	_.RC:$src1)),
	(!cast<Instruction>(NAME#_.ZSuffix#rmbik) _.RC:$src1, _.KRCWM:$mask,
	_.RC:$src2, addr:$src3, (VPTERNLOG132_imm8 imm:$src4))>;
	def : Pat<(_.VT (vselect _.KRCWM:$mask,
	(OpNode (X86VBroadcast (_.ScalarLdFrag addr:$src3)),
	_.RC:$src2, _.RC:$src1, (i8 imm:$src4)),
	_.RC:$src1)),
	(!cast<Instruction>(NAME#_.ZSuffix#rmik) _.RC:$src1, _.KRCWM:$mask,
	_.RC:$src2, addr:$src3, (VPTERNLOG321_imm8 imm:$src4))>;
	def : Pat<(_.VT (vselect _.KRCWM:$mask,
	(OpNode _.RC:$src2, _.RC:$src1,
	(X86VBroadcast (_.ScalarLdFrag addr:$src3)),
	(i8 imm:$src4)), _.RC:$src1)),
	(!cast<Instruction>(NAME#_.ZSuffix#rmik) _.RC:$src1, _.KRCWM:$mask,
	_.RC:$src2, addr:$src3, (VPTERNLOG213_imm8 imm:$src4))>;
	def : Pat<(_.VT (vselect _.KRCWM:$mask,
	(OpNode _.RC:$src2,
	(X86VBroadcast (_.ScalarLdFrag addr:$src3)),
	_.RC:$src1, (i8 imm:$src4)),
	_.RC:$src1)),
	(!cast<Instruction>(NAME#_.ZSuffix#rmik) _.RC:$src1, _.KRCWM:$mask,
	_.RC:$src2, addr:$src3, (VPTERNLOG231_imm8 imm:$src4))>;
	def : Pat<(_.VT (vselect _.KRCWM:$mask,
	(OpNode (X86VBroadcast (_.ScalarLdFrag addr:$src3)),
	_.RC:$src1, _.RC:$src2, (i8 imm:$src4)),
	_.RC:$src1)),
	(!cast<Instruction>(NAME#_.ZSuffix#rmik) _.RC:$src1, _.KRCWM:$mask,
	_.RC:$src2, addr:$src3, (VPTERNLOG312_imm8 imm:$src4))>;
	}

	multiclass avx512_common_ternlog<string OpcodeStr, AVX512VLVectorVTInfo _>{
	let Predicates = [HasAVX512] in
	defm Z : avx512_ternlog<0x25, OpcodeStr, X86vpternlog, _.info512>, EVEX_V512;
	let Predicates = [HasAVX512, HasVLX] in {
	defm Z128 : avx512_ternlog<0x25, OpcodeStr, X86vpternlog, _.info128>, EVEX_V128;
	defm Z256 : avx512_ternlog<0x25, OpcodeStr, X86vpternlog, _.info256>, EVEX_V256;
	}
	}

	defm VPTERNLOGD : avx512_common_ternlog<"vpternlogd", avx512vl_i32_info>;
	defm VPTERNLOGQ : avx512_common_ternlog<"vpternlogq", avx512vl_i64_info>, VEX_W;

	//===----------------------------------------------------------------------===//
	// AVX-512 - FixupImm
	//===----------------------------------------------------------------------===//

	multiclass avx512_fixupimm_packed<bits<8> opc, string OpcodeStr, SDNode OpNode,
	X86VectorVTInfo _>{
	let Constraints = "$src1 = $dst", ExeDomain = _.ExeDomain in {
	defm rri : AVX512_maskable_3src<opc, MRMSrcReg, _, (outs _.RC:$dst),
	(ins _.RC:$src2, _.RC:$src3, i32u8imm:$src4),
	OpcodeStr##_.Suffix, "$src4, $src3, $src2", "$src2, $src3, $src4",
	(OpNode (_.VT _.RC:$src1),
	(_.VT _.RC:$src2),
	(_.IntVT _.RC:$src3),
	(i32 imm:$src4),
	(i32 FROUND_CURRENT))>;
	defm rmi : AVX512_maskable_3src<opc, MRMSrcMem, _, (outs _.RC:$dst),
	(ins _.RC:$src2, _.MemOp:$src3, i32u8imm:$src4),
	OpcodeStr##_.Suffix, "$src4, $src3, $src2", "$src2, $src3, $src4",
	(OpNode (_.VT _.RC:$src1),
	(_.VT _.RC:$src2),
	(_.IntVT (bitconvert (_.LdFrag addr:$src3))),
	(i32 imm:$src4),
	(i32 FROUND_CURRENT))>;
	defm rmbi : AVX512_maskable_3src<opc, MRMSrcMem, _, (outs _.RC:$dst),
	(ins _.RC:$src2, _.ScalarMemOp:$src3, i32u8imm:$src4),
	OpcodeStr##_.Suffix, "$src4, ${src3}"##_.BroadcastStr##", $src2",
	"$src2, ${src3}"##_.BroadcastStr##", $src4",
	(OpNode (_.VT _.RC:$src1),
	(_.VT _.RC:$src2),
	(_.IntVT (X86VBroadcast(_.ScalarLdFrag addr:$src3))),
	(i32 imm:$src4),
	(i32 FROUND_CURRENT))>, EVEX_B;
	} // Constraints = "$src1 = $dst"
	}

	multiclass avx512_fixupimm_packed_sae<bits<8> opc, string OpcodeStr,
	SDNode OpNode, X86VectorVTInfo _>{
	let Constraints = "$src1 = $dst", ExeDomain = _.ExeDomain in {
	defm rrib : AVX512_maskable_3src<opc, MRMSrcReg, _, (outs _.RC:$dst),
	(ins _.RC:$src2, _.RC:$src3, i32u8imm:$src4),
	OpcodeStr##_.Suffix, "$src4, {sae}, $src3, $src2",
	"$src2, $src3, {sae}, $src4",
	(OpNode (_.VT _.RC:$src1),
	(_.VT _.RC:$src2),
	(_.IntVT _.RC:$src3),
	(i32 imm:$src4),
	(i32 FROUND_NO_EXC))>, EVEX_B;
	}
	}

	multiclass avx512_fixupimm_scalar<bits<8> opc, string OpcodeStr, SDNode OpNode,
	X86VectorVTInfo _, X86VectorVTInfo _src3VT> {
	let Constraints = "$src1 = $dst" , Predicates = [HasAVX512],
	ExeDomain = _.ExeDomain in {
	defm rri : AVX512_maskable_3src_scalar<opc, MRMSrcReg, _, (outs _.RC:$dst),
	(ins _.RC:$src2, _.RC:$src3, i32u8imm:$src4),
	OpcodeStr##_.Suffix, "$src4, $src3, $src2", "$src2, $src3, $src4",
	(OpNode (_.VT _.RC:$src1),
	(_.VT _.RC:$src2),
	(_src3VT.VT _src3VT.RC:$src3),
	(i32 imm:$src4),
	(i32 FROUND_CURRENT))>;

	defm rrib : AVX512_maskable_3src_scalar<opc, MRMSrcReg, _, (outs _.RC:$dst),
	(ins _.RC:$src2, _.RC:$src3, i32u8imm:$src4),
	OpcodeStr##_.Suffix, "$src4, {sae}, $src3, $src2",
	"$src2, $src3, {sae}, $src4",
	(OpNode (_.VT _.RC:$src1),
	(_.VT _.RC:$src2),
	(_src3VT.VT _src3VT.RC:$src3),
	(i32 imm:$src4),
	(i32 FROUND_NO_EXC))>, EVEX_B;
	defm rmi : AVX512_maskable_3src_scalar<opc, MRMSrcMem, _, (outs _.RC:$dst),
	(ins _.RC:$src2, _.ScalarMemOp:$src3, i32u8imm:$src4),
	OpcodeStr##_.Suffix, "$src4, $src3, $src2", "$src2, $src3, $src4",
	(OpNode (_.VT _.RC:$src1),
	(_.VT _.RC:$src2),
	(_src3VT.VT (scalar_to_vector
	(_src3VT.ScalarLdFrag addr:$src3))),
	(i32 imm:$src4),
	(i32 FROUND_CURRENT))>;
	}
	}

	multiclass avx512_fixupimm_packed_all<AVX512VLVectorVTInfo _Vec>{
	let Predicates = [HasAVX512] in
	defm Z : avx512_fixupimm_packed<0x54, "vfixupimm", X86VFixupimm, _Vec.info512>,
	avx512_fixupimm_packed_sae<0x54, "vfixupimm", X86VFixupimm, _Vec.info512>,
	AVX512AIi8Base, EVEX_4V, EVEX_V512;
	let Predicates = [HasAVX512, HasVLX] in {
	defm Z128 : avx512_fixupimm_packed<0x54, "vfixupimm", X86VFixupimm, _Vec.info128>,
	AVX512AIi8Base, EVEX_4V, EVEX_V128;
	defm Z256 : avx512_fixupimm_packed<0x54, "vfixupimm", X86VFixupimm, _Vec.info256>,
	AVX512AIi8Base, EVEX_4V, EVEX_V256;
	}
	}

	defm VFIXUPIMMSS : avx512_fixupimm_scalar<0x55, "vfixupimm", X86VFixupimmScalar,
	f32x_info, v4i32x_info>,
	AVX512AIi8Base, VEX_LIG, EVEX_4V, EVEX_CD8<32, CD8VT1>;
	defm VFIXUPIMMSD : avx512_fixupimm_scalar<0x55, "vfixupimm", X86VFixupimmScalar,
	f64x_info, v2i64x_info>,
	AVX512AIi8Base, VEX_LIG, EVEX_4V, EVEX_CD8<64, CD8VT1>, VEX_W;
	defm VFIXUPIMMPS : avx512_fixupimm_packed_all<avx512vl_f32_info>,
	EVEX_CD8<32, CD8VF>;
	defm VFIXUPIMMPD : avx512_fixupimm_packed_all<avx512vl_f64_info>,
	EVEX_CD8<64, CD8VF>, VEX_W;



	// Patterns used to select SSE scalar fp arithmetic instructions from
	// either:
	//
	// (1) a scalar fp operation followed by a blend
	//
	// The effect is that the backend no longer emits unnecessary vector
	// insert instructions immediately after SSE scalar fp instructions
	// like addss or mulss.
	//
	// For example, given the following code:
	// __m128 foo(__m128 A, __m128 B) {
	// A[0] += B[0];
	// return A;
	// }
	//
	// Previously we generated:
	// addss %xmm0, %xmm1
	// movss %xmm1, %xmm0
	//
	// We now generate:
	// addss %xmm1, %xmm0
	//
	// (2) a vector packed single/double fp operation followed by a vector insert
	//
	// The effect is that the backend converts the packed fp instruction
	// followed by a vector insert into a single SSE scalar fp instruction.
	//
	// For example, given the following code:
	// __m128 foo(__m128 A, __m128 B) {
	// __m128 C = A + B;
	// return (__m128) {c[0], a[1], a[2], a[3]};
	// }
	//
	// Previously we generated:
	// addps %xmm0, %xmm1
	// movss %xmm1, %xmm0
	//
	// We now generate:
	// addss %xmm1, %xmm0

	// TODO: Some canonicalization in lowering would simplify the number of
	// patterns we have to try to match.
	multiclass AVX512_scalar_math_f32_patterns<SDNode Op, string OpcPrefix> {
	let Predicates = [HasAVX512] in {
	// extracted scalar math op with insert via movss
	def : Pat<(v4f32 (X86Movss (v4f32 VR128X:$dst), (v4f32 (scalar_to_vector
	(Op (f32 (extractelt (v4f32 VR128X:$dst), (iPTR 0))),
	FR32X:$src))))),
	(!cast<I>("V"#OpcPrefix#SSZrr_Int) v4f32:$dst,
	(COPY_TO_REGCLASS FR32X:$src, VR128X))>;

	// extracted scalar math op with insert via blend
	def : Pat<(v4f32 (X86Blendi (v4f32 VR128X:$dst), (v4f32 (scalar_to_vector
	(Op (f32 (extractelt (v4f32 VR128X:$dst), (iPTR 0))),
	FR32X:$src))), (i8 1))),
	(!cast<I>("V"#OpcPrefix#SSZrr_Int) v4f32:$dst,
	(COPY_TO_REGCLASS FR32X:$src, VR128X))>;

	// vector math op with insert via movss
	def : Pat<(v4f32 (X86Movss (v4f32 VR128X:$dst),
	(Op (v4f32 VR128X:$dst), (v4f32 VR128X:$src)))),
	(!cast<I>("V"#OpcPrefix#SSZrr_Int) v4f32:$dst, v4f32:$src)>;

	// vector math op with insert via blend
	def : Pat<(v4f32 (X86Blendi (v4f32 VR128X:$dst),
	(Op (v4f32 VR128X:$dst), (v4f32 VR128X:$src)), (i8 1))),
	(!cast<I>("V"#OpcPrefix#SSZrr_Int) v4f32:$dst, v4f32:$src)>;

	// extracted masked scalar math op with insert via movss
	def : Pat<(X86Movss (v4f32 VR128X:$src1),
	(scalar_to_vector
	(X86selects VK1WM:$mask,
	(Op (f32 (extractelt (v4f32 VR128X:$src1), (iPTR 0))),
	FR32X:$src2),
	FR32X:$src0))),
	(!cast<I>("V"#OpcPrefix#SSZrr_Intk) (COPY_TO_REGCLASS FR32X:$src0, VR128X),
	VK1WM:$mask, v4f32:$src1,
	(COPY_TO_REGCLASS FR32X:$src2, VR128X))>;
	}
	}

	defm : AVX512_scalar_math_f32_patterns<fadd, "ADD">;
	defm : AVX512_scalar_math_f32_patterns<fsub, "SUB">;
	defm : AVX512_scalar_math_f32_patterns<fmul, "MUL">;
	defm : AVX512_scalar_math_f32_patterns<fdiv, "DIV">;

	multiclass AVX512_scalar_math_f64_patterns<SDNode Op, string OpcPrefix> {
	let Predicates = [HasAVX512] in {
	// extracted scalar math op with insert via movsd
	def : Pat<(v2f64 (X86Movsd (v2f64 VR128X:$dst), (v2f64 (scalar_to_vector
	(Op (f64 (extractelt (v2f64 VR128X:$dst), (iPTR 0))),
	FR64X:$src))))),
	(!cast<I>("V"#OpcPrefix#SDZrr_Int) v2f64:$dst,
	(COPY_TO_REGCLASS FR64X:$src, VR128X))>;

	// extracted scalar math op with insert via blend
	def : Pat<(v2f64 (X86Blendi (v2f64 VR128X:$dst), (v2f64 (scalar_to_vector
	(Op (f64 (extractelt (v2f64 VR128X:$dst), (iPTR 0))),
	FR64X:$src))), (i8 1))),
	(!cast<I>("V"#OpcPrefix#SDZrr_Int) v2f64:$dst,
	(COPY_TO_REGCLASS FR64X:$src, VR128X))>;

	// vector math op with insert via movsd
	def : Pat<(v2f64 (X86Movsd (v2f64 VR128X:$dst),
	(Op (v2f64 VR128X:$dst), (v2f64 VR128X:$src)))),
	(!cast<I>("V"#OpcPrefix#SDZrr_Int) v2f64:$dst, v2f64:$src)>;

	// vector math op with insert via blend
	def : Pat<(v2f64 (X86Blendi (v2f64 VR128X:$dst),
	(Op (v2f64 VR128X:$dst), (v2f64 VR128X:$src)), (i8 1))),
	(!cast<I>("V"#OpcPrefix#SDZrr_Int) v2f64:$dst, v2f64:$src)>;

	// extracted masked scalar math op with insert via movss
	def : Pat<(X86Movsd (v2f64 VR128X:$src1),
	(scalar_to_vector
	(X86selects VK1WM:$mask,
	(Op (f64 (extractelt (v2f64 VR128X:$src1), (iPTR 0))),
	FR64X:$src2),
	FR64X:$src0))),
	(!cast<I>("V"#OpcPrefix#SDZrr_Intk) (COPY_TO_REGCLASS FR64X:$src0, VR128X),
	VK1WM:$mask, v2f64:$src1,
	(COPY_TO_REGCLASS FR64X:$src2, VR128X))>;
	}
	}

	defm : AVX512_scalar_math_f64_patterns<fadd, "ADD">;
	defm : AVX512_scalar_math_f64_patterns<fsub, "SUB">;
	defm : AVX512_scalar_math_f64_patterns<fmul, "MUL">;
	defm : AVX512_scalar_math_f64_patterns<fdiv, "DIV">;
	diff --git a/lib/Target/X86/X86SchedSandyBridge.td b/lib/Target/X86/X86SchedSandyBridge.td
	index 6d85ca6cad64..b8ec5883152c 100644
	--- a/lib/Target/X86/X86SchedSandyBridge.td
	+++ b/lib/Target/X86/X86SchedSandyBridge.td
	@@ -1,2687 +1,275 @@
	//=- X86SchedSandyBridge.td - X86 Sandy Bridge Scheduling ----- tablegen --=//
	//
	// The LLVM Compiler Infrastructure
	//
	// This file is distributed under the University of Illinois Open Source
	// License. See LICENSE.TXT for details.
	//
	//===----------------------------------------------------------------------===//
	//
	// This file defines the machine model for Sandy Bridge to support instruction
	// scheduling and other instruction cost heuristics.
	//
	//===----------------------------------------------------------------------===//

	def SandyBridgeModel : SchedMachineModel {
	// All x86 instructions are modeled as a single micro-op, and SB can decode 4
	// instructions per cycle.
	// FIXME: Identify instructions that aren't a single fused micro-op.
	let IssueWidth = 4;
	let MicroOpBufferSize = 168; // Based on the reorder buffer.
	let LoadLatency = 4;
	let MispredictPenalty = 16;

	// Based on the LSD (loop-stream detector) queue size.
	let LoopMicroOpBufferSize = 28;

	- // This flag is set to allow the scheduler to assign
	- // a default model to unrecognized opcodes.
	+ // FIXME: SSE4 and AVX are unimplemented. This flag is set to allow
	+ // the scheduler to assign a default model to unrecognized opcodes.
	let CompleteModel = 0;
	}

	let SchedModel = SandyBridgeModel in {

	// Sandy Bridge can issue micro-ops to 6 different ports in one cycle.

	// Ports 0, 1, and 5 handle all computation.
	def SBPort0 : ProcResource<1>;
	def SBPort1 : ProcResource<1>;
	def SBPort5 : ProcResource<1>;

	// Ports 2 and 3 are identical. They handle loads and the address half of
	// stores.
	def SBPort23 : ProcResource<2>;

	// Port 4 gets the data half of stores. Store data can be available later than
	// the store address, but since we don't model the latency of stores, we can
	// ignore that.
	def SBPort4 : ProcResource<1>;

	// Many micro-ops are capable of issuing on multiple ports.
	-def SBPort01 : ProcResGroup<[SBPort0, SBPort1]>;
	def SBPort05 : ProcResGroup<[SBPort0, SBPort5]>;
	def SBPort15 : ProcResGroup<[SBPort1, SBPort5]>;
	def SBPort015 : ProcResGroup<[SBPort0, SBPort1, SBPort5]>;

	// 54 Entry Unified Scheduler
	def SBPortAny : ProcResGroup<[SBPort0, SBPort1, SBPort23, SBPort4, SBPort5]> {
	let BufferSize=54;
	}

	// Integer division issued on port 0.
	def SBDivider : ProcResource<1>;

	// Loads are 4 cycles, so ReadAfterLd registers needn't be available until 4
	// cycles after the memory operand.
	def : ReadAdvance<ReadAfterLd, 4>;

	// Many SchedWrites are defined in pairs with and without a folded load.
	// Instructions with folded loads are usually micro-fused, so they only appear
	// as two micro-ops when queued in the reservation station.
	// This multiclass defines the resource usage for variants with and without
	// folded loads.
	multiclass SBWriteResPair<X86FoldableSchedWrite SchedRW,
	ProcResourceKind ExePort,
	int Lat> {
	// Register variant is using a single cycle on ExePort.
	def : WriteRes<SchedRW, [ExePort]> { let Latency = Lat; }

	// Memory variant also uses a cycle on port 2/3 and adds 4 cycles to the
	// latency.
	def : WriteRes<SchedRW.Folded, [SBPort23, ExePort]> {
	let Latency = !add(Lat, 4);
	}
	}

	// A folded store needs a cycle on port 4 for the store data, but it does not
	// need an extra port 2/3 cycle to recompute the address.
	def : WriteRes<WriteRMW, [SBPort4]>;

	def : WriteRes<WriteStore, [SBPort23, SBPort4]>;
	def : WriteRes<WriteLoad, [SBPort23]> { let Latency = 4; }
	def : WriteRes<WriteMove, [SBPort015]>;
	def : WriteRes<WriteZero, []>;

	defm : SBWriteResPair<WriteALU, SBPort015, 1>;
	defm : SBWriteResPair<WriteIMul, SBPort1, 3>;
	def : WriteRes<WriteIMulH, []> { let Latency = 3; }
	defm : SBWriteResPair<WriteShift, SBPort05, 1>;
	defm : SBWriteResPair<WriteJump, SBPort5, 1>;

	// This is for simple LEAs with one or two input operands.
	// The complex ones can only execute on port 1, and they require two cycles on
	// the port to read all inputs. We don't model that.
	def : WriteRes<WriteLEA, [SBPort15]>;

	// This is quite rough, latency depends on the dividend.
	def : WriteRes<WriteIDiv, [SBPort0, SBDivider]> {
	let Latency = 25;
	let ResourceCycles = [1, 10];
	}
	def : WriteRes<WriteIDivLd, [SBPort23, SBPort0, SBDivider]> {
	let Latency = 29;
	let ResourceCycles = [1, 1, 10];
	}

	// Scalar and vector floating point.
	defm : SBWriteResPair<WriteFAdd, SBPort1, 3>;
	defm : SBWriteResPair<WriteFMul, SBPort0, 5>;
	-defm : SBWriteResPair<WriteFDiv, SBPort0, 24>;
	+defm : SBWriteResPair<WriteFDiv, SBPort0, 12>; // 10-14 cycles.
	defm : SBWriteResPair<WriteFRcp, SBPort0, 5>;
	defm : SBWriteResPair<WriteFRsqrt, SBPort0, 5>;
	-defm : SBWriteResPair<WriteFSqrt, SBPort0, 14>;
	+defm : SBWriteResPair<WriteFSqrt, SBPort0, 15>;
	defm : SBWriteResPair<WriteCvtF2I, SBPort1, 3>;
	defm : SBWriteResPair<WriteCvtI2F, SBPort1, 4>;
	defm : SBWriteResPair<WriteCvtF2F, SBPort1, 3>;
	defm : SBWriteResPair<WriteFShuffle, SBPort5, 1>;
	defm : SBWriteResPair<WriteFBlend, SBPort05, 1>;
	def : WriteRes<WriteFVarBlend, [SBPort0, SBPort5]> {
	let Latency = 2;
	let ResourceCycles = [1, 1];
	}
	def : WriteRes<WriteFVarBlendLd, [SBPort0, SBPort5, SBPort23]> {
	let Latency = 6;
	let ResourceCycles = [1, 1, 1];
	}

	// Vector integer operations.
	-defm : SBWriteResPair<WriteVecShift, SBPort5, 1>;
	-defm : SBWriteResPair<WriteVecLogic, SBPort5, 1>;
	-defm : SBWriteResPair<WriteVecALU, SBPort1, 3>;
	+defm : SBWriteResPair<WriteVecShift, SBPort05, 1>;
	+defm : SBWriteResPair<WriteVecLogic, SBPort015, 1>;
	+defm : SBWriteResPair<WriteVecALU, SBPort15, 1>;
	defm : SBWriteResPair<WriteVecIMul, SBPort0, 5>;
	-defm : SBWriteResPair<WriteShuffle, SBPort5, 1>;
	+defm : SBWriteResPair<WriteShuffle, SBPort15, 1>;
	defm : SBWriteResPair<WriteBlend, SBPort15, 1>;
	def : WriteRes<WriteVarBlend, [SBPort1, SBPort5]> {
	let Latency = 2;
	let ResourceCycles = [1, 1];
	}
	def : WriteRes<WriteVarBlendLd, [SBPort1, SBPort5, SBPort23]> {
	let Latency = 6;
	let ResourceCycles = [1, 1, 1];
	}
	-def : WriteRes<WriteMPSAD, [SBPort0,SBPort15]> {
	- let Latency = 5;
	- let NumMicroOps = 3;
	- let ResourceCycles = [1,2];
	+def : WriteRes<WriteMPSAD, [SBPort0, SBPort1, SBPort5]> {
	+ let Latency = 6;
	+ let ResourceCycles = [1, 1, 1];
	}
	-def : WriteRes<WriteMPSADLd, [SBPort0,SBPort23,SBPort15]> {
	- let Latency = 11;
	- let NumMicroOps = 4;
	- let ResourceCycles = [1,1,2];
	+def : WriteRes<WriteMPSADLd, [SBPort0, SBPort1, SBPort5, SBPort23]> {
	+ let Latency = 6;
	+ let ResourceCycles = [1, 1, 1, 1];
	}

	////////////////////////////////////////////////////////////////////////////////
	// Horizontal add/sub instructions.
	////////////////////////////////////////////////////////////////////////////////
	// HADD, HSUB PS/PD
	// x,x / v,v,v.
	def : WriteRes<WriteFHAdd, [SBPort1]> {
	let Latency = 3;
	}

	// x,m / v,v,m.
	def : WriteRes<WriteFHAddLd, [SBPort1, SBPort23]> {
	let Latency = 7;
	let ResourceCycles = [1, 1];
	}

	// PHADD\|PHSUB (S) W/D.
	// v <- v,v.
	def : WriteRes<WritePHAdd, [SBPort15]>;

	// v <- v,m.
	def : WriteRes<WritePHAddLd, [SBPort15, SBPort23]> {
	let Latency = 5;
	let ResourceCycles = [1, 1];
	}

	// String instructions.
	// Packed Compare Implicit Length Strings, Return Mask
	def : WriteRes<WritePCmpIStrM, [SBPort015]> {
	let Latency = 11;
	let ResourceCycles = [3];
	}
	def : WriteRes<WritePCmpIStrMLd, [SBPort015, SBPort23]> {
	let Latency = 11;
	let ResourceCycles = [3, 1];
	}

	// Packed Compare Explicit Length Strings, Return Mask
	def : WriteRes<WritePCmpEStrM, [SBPort015]> {
	let Latency = 11;
	let ResourceCycles = [8];
	}
	def : WriteRes<WritePCmpEStrMLd, [SBPort015, SBPort23]> {
	let Latency = 11;
	let ResourceCycles = [7, 1];
	}

	// Packed Compare Implicit Length Strings, Return Index
	-def : WriteRes<WritePCmpIStrI, [SBPort0]> {
	- let Latency = 11;
	- let NumMicroOps = 3;
	+def : WriteRes<WritePCmpIStrI, [SBPort015]> {
	+ let Latency = 3;
	let ResourceCycles = [3];
	}
	-def : WriteRes<WritePCmpIStrILd, [SBPort0,SBPort23]> {
	- let Latency = 17;
	- let NumMicroOps = 4;
	- let ResourceCycles = [3,1];
	+def : WriteRes<WritePCmpIStrILd, [SBPort015, SBPort23]> {
	+ let Latency = 3;
	+ let ResourceCycles = [3, 1];
	}

	// Packed Compare Explicit Length Strings, Return Index
	def : WriteRes<WritePCmpEStrI, [SBPort015]> {
	let Latency = 4;
	let ResourceCycles = [8];
	}
	def : WriteRes<WritePCmpEStrILd, [SBPort015, SBPort23]> {
	let Latency = 4;
	let ResourceCycles = [7, 1];
	}

	// AES Instructions.
	-def : WriteRes<WriteAESDecEnc, [SBPort5,SBPort015]> {
	- let Latency = 7;
	- let NumMicroOps = 2;
	- let ResourceCycles = [1,1];
	+def : WriteRes<WriteAESDecEnc, [SBPort015]> {
	+ let Latency = 8;
	+ let ResourceCycles = [2];
	}
	-def : WriteRes<WriteAESDecEncLd, [SBPort5,SBPort23,SBPort015]> {
	- let Latency = 13;
	- let NumMicroOps = 3;
	- let ResourceCycles = [1,1,1];
	+def : WriteRes<WriteAESDecEncLd, [SBPort015, SBPort23]> {
	+ let Latency = 8;
	+ let ResourceCycles = [2, 1];
	}

	-def : WriteRes<WriteAESIMC, [SBPort5]> {
	- let Latency = 12;
	- let NumMicroOps = 2;
	+def : WriteRes<WriteAESIMC, [SBPort015]> {
	+ let Latency = 8;
	let ResourceCycles = [2];
	}
	-def : WriteRes<WriteAESIMCLd, [SBPort5,SBPort23]> {
	- let Latency = 18;
	- let NumMicroOps = 3;
	- let ResourceCycles = [2,1];
	+def : WriteRes<WriteAESIMCLd, [SBPort015, SBPort23]> {
	+ let Latency = 8;
	+ let ResourceCycles = [2, 1];
	}

	def : WriteRes<WriteAESKeyGen, [SBPort015]> {
	let Latency = 8;
	let ResourceCycles = [11];
	}
	def : WriteRes<WriteAESKeyGenLd, [SBPort015, SBPort23]> {
	let Latency = 8;
	let ResourceCycles = [10, 1];
	}

	// Carry-less multiplication instructions.
	def : WriteRes<WriteCLMul, [SBPort015]> {
	let Latency = 14;
	let ResourceCycles = [18];
	}
	def : WriteRes<WriteCLMulLd, [SBPort015, SBPort23]> {
	let Latency = 14;
	let ResourceCycles = [17, 1];
	}


	def : WriteRes<WriteSystem, [SBPort015]> { let Latency = 100; }
	def : WriteRes<WriteMicrocoded, [SBPort015]> { let Latency = 100; }
	def : WriteRes<WriteFence, [SBPort23, SBPort4]>;
	def : WriteRes<WriteNop, []>;

	// AVX2 is not supported on that architecture, but we should define the basic
	// scheduling resources anyway.
	defm : SBWriteResPair<WriteFShuffle256, SBPort0, 1>;
	defm : SBWriteResPair<WriteShuffle256, SBPort0, 1>;
	defm : SBWriteResPair<WriteVarVecShift, SBPort0, 1>;
	-
	-// Remaining SNB instrs.
	-
	-def SBWriteResGroup0 : SchedWriteRes<[SBPort0]> {
	- let Latency = 1;
	- let NumMicroOps = 1;
	- let ResourceCycles = [1];
	-}
	-def: InstRW<[SBWriteResGroup0], (instregex "CVTSS2SDrr")>;
	-def: InstRW<[SBWriteResGroup0], (instregex "PSLLDri")>;
	-def: InstRW<[SBWriteResGroup0], (instregex "PSLLQri")>;
	-def: InstRW<[SBWriteResGroup0], (instregex "PSLLWri")>;
	-def: InstRW<[SBWriteResGroup0], (instregex "PSRADri")>;
	-def: InstRW<[SBWriteResGroup0], (instregex "PSRAWri")>;
	-def: InstRW<[SBWriteResGroup0], (instregex "PSRLDri")>;
	-def: InstRW<[SBWriteResGroup0], (instregex "PSRLQri")>;
	-def: InstRW<[SBWriteResGroup0], (instregex "PSRLWri")>;
	-def: InstRW<[SBWriteResGroup0], (instregex "VCVTSS2SDrr")>;
	-def: InstRW<[SBWriteResGroup0], (instregex "VPMOVMSKBrr")>;
	-def: InstRW<[SBWriteResGroup0], (instregex "VPSLLDri")>;
	-def: InstRW<[SBWriteResGroup0], (instregex "VPSLLQri")>;
	-def: InstRW<[SBWriteResGroup0], (instregex "VPSLLWri")>;
	-def: InstRW<[SBWriteResGroup0], (instregex "VPSRADri")>;
	-def: InstRW<[SBWriteResGroup0], (instregex "VPSRAWri")>;
	-def: InstRW<[SBWriteResGroup0], (instregex "VPSRLDri")>;
	-def: InstRW<[SBWriteResGroup0], (instregex "VPSRLQri")>;
	-def: InstRW<[SBWriteResGroup0], (instregex "VPSRLWri")>;
	-def: InstRW<[SBWriteResGroup0], (instregex "VTESTPDYrr")>;
	-def: InstRW<[SBWriteResGroup0], (instregex "VTESTPDrr")>;
	-def: InstRW<[SBWriteResGroup0], (instregex "VTESTPSYrr")>;
	-def: InstRW<[SBWriteResGroup0], (instregex "VTESTPSrr")>;
	-
	-def SBWriteResGroup1 : SchedWriteRes<[SBPort1]> {
	- let Latency = 1;
	- let NumMicroOps = 1;
	- let ResourceCycles = [1];
	-}
	-def: InstRW<[SBWriteResGroup1], (instregex "COMP_FST0r")>;
	-def: InstRW<[SBWriteResGroup1], (instregex "COM_FST0r")>;
	-def: InstRW<[SBWriteResGroup1], (instregex "UCOM_FPr")>;
	-def: InstRW<[SBWriteResGroup1], (instregex "UCOM_Fr")>;
	-
	-def SBWriteResGroup2 : SchedWriteRes<[SBPort5]> {
	- let Latency = 1;
	- let NumMicroOps = 1;
	- let ResourceCycles = [1];
	-}
	-def: InstRW<[SBWriteResGroup2], (instregex "ANDNPDrr")>;
	-def: InstRW<[SBWriteResGroup2], (instregex "ANDNPSrr")>;
	-def: InstRW<[SBWriteResGroup2], (instregex "ANDPDrr")>;
	-def: InstRW<[SBWriteResGroup2], (instregex "ANDPSrr")>;
	-def: InstRW<[SBWriteResGroup2], (instregex "FDECSTP")>;
	-def: InstRW<[SBWriteResGroup2], (instregex "FFREE")>;
	-def: InstRW<[SBWriteResGroup2], (instregex "FINCSTP")>;
	-def: InstRW<[SBWriteResGroup2], (instregex "FNOP")>;
	-def: InstRW<[SBWriteResGroup2], (instregex "INSERTPSrr")>;
	-def: InstRW<[SBWriteResGroup2], (instregex "JMP64r")>;
	-def: InstRW<[SBWriteResGroup2], (instregex "LD_Frr")>;
	-def: InstRW<[SBWriteResGroup2], (instregex "MOV64toPQIrr")>;
	-def: InstRW<[SBWriteResGroup2], (instregex "MOVAPDrr")>;
	-def: InstRW<[SBWriteResGroup2], (instregex "MOVAPSrr")>;
	-def: InstRW<[SBWriteResGroup2], (instregex "MOVDDUPrr")>;
	-def: InstRW<[SBWriteResGroup2], (instregex "MOVDI2PDIrr")>;
	-def: InstRW<[SBWriteResGroup2], (instregex "MOVHLPSrr")>;
	-def: InstRW<[SBWriteResGroup2], (instregex "MOVLHPSrr")>;
	-def: InstRW<[SBWriteResGroup2], (instregex "MOVSDrr")>;
	-def: InstRW<[SBWriteResGroup2], (instregex "MOVSHDUPrr")>;
	-def: InstRW<[SBWriteResGroup2], (instregex "MOVSLDUPrr")>;
	-def: InstRW<[SBWriteResGroup2], (instregex "MOVSSrr")>;
	-def: InstRW<[SBWriteResGroup2], (instregex "MOVUPDrr")>;
	-def: InstRW<[SBWriteResGroup2], (instregex "MOVUPSrr")>;
	-def: InstRW<[SBWriteResGroup2], (instregex "ORPDrr")>;
	-def: InstRW<[SBWriteResGroup2], (instregex "ORPSrr")>;
	-def: InstRW<[SBWriteResGroup2], (instregex "RETQ")>;
	-def: InstRW<[SBWriteResGroup2], (instregex "SHUFPDrri")>;
	-def: InstRW<[SBWriteResGroup2], (instregex "SHUFPSrri")>;
	-def: InstRW<[SBWriteResGroup2], (instregex "ST_FPrr")>;
	-def: InstRW<[SBWriteResGroup2], (instregex "ST_Frr")>;
	-def: InstRW<[SBWriteResGroup2], (instregex "UNPCKHPDrr")>;
	-def: InstRW<[SBWriteResGroup2], (instregex "UNPCKHPSrr")>;
	-def: InstRW<[SBWriteResGroup2], (instregex "UNPCKLPDrr")>;
	-def: InstRW<[SBWriteResGroup2], (instregex "UNPCKLPSrr")>;
	-def: InstRW<[SBWriteResGroup2], (instregex "VANDNPDYrr")>;
	-def: InstRW<[SBWriteResGroup2], (instregex "VANDNPDrr")>;
	-def: InstRW<[SBWriteResGroup2], (instregex "VANDNPSYrr")>;
	-def: InstRW<[SBWriteResGroup2], (instregex "VANDNPSrr")>;
	-def: InstRW<[SBWriteResGroup2], (instregex "VANDPDrr")>;
	-def: InstRW<[SBWriteResGroup2], (instregex "VANDPDrr")>;
	-def: InstRW<[SBWriteResGroup2], (instregex "VANDPSrr")>;
	-def: InstRW<[SBWriteResGroup2], (instregex "VEXTRACTF128rr")>;
	-def: InstRW<[SBWriteResGroup2], (instregex "VINSERTF128rr")>;
	-def: InstRW<[SBWriteResGroup2], (instregex "VINSERTPSrr")>;
	-def: InstRW<[SBWriteResGroup2], (instregex "VMOV64toPQIrr")>;
	-def: InstRW<[SBWriteResGroup2], (instregex "VMOV64toPQIrr")>;
	-def: InstRW<[SBWriteResGroup2], (instregex "VMOVAPDYrr")>;
	-def: InstRW<[SBWriteResGroup2], (instregex "VMOVAPDrr")>;
	-def: InstRW<[SBWriteResGroup2], (instregex "VMOVAPSYrr")>;
	-def: InstRW<[SBWriteResGroup2], (instregex "VMOVAPSrr")>;
	-def: InstRW<[SBWriteResGroup2], (instregex "VMOVDDUPYrr")>;
	-def: InstRW<[SBWriteResGroup2], (instregex "VMOVDDUPrr")>;
	-def: InstRW<[SBWriteResGroup2], (instregex "VMOVHLPSrr")>;
	-def: InstRW<[SBWriteResGroup2], (instregex "VMOVHLPSrr")>;
	-def: InstRW<[SBWriteResGroup2], (instregex "VMOVSDrr")>;
	-def: InstRW<[SBWriteResGroup2], (instregex "VMOVSHDUPYrr")>;
	-def: InstRW<[SBWriteResGroup2], (instregex "VMOVSHDUPrr")>;
	-def: InstRW<[SBWriteResGroup2], (instregex "VMOVSLDUPYrr")>;
	-def: InstRW<[SBWriteResGroup2], (instregex "VMOVSLDUPrr")>;
	-def: InstRW<[SBWriteResGroup2], (instregex "VMOVSSrr")>;
	-def: InstRW<[SBWriteResGroup2], (instregex "VMOVUPDYrr")>;
	-def: InstRW<[SBWriteResGroup2], (instregex "VMOVUPDrr")>;
	-def: InstRW<[SBWriteResGroup2], (instregex "VMOVUPSYrr")>;
	-def: InstRW<[SBWriteResGroup2], (instregex "VMOVUPSrr")>;
	-def: InstRW<[SBWriteResGroup2], (instregex "VORPDYrr")>;
	-def: InstRW<[SBWriteResGroup2], (instregex "VORPDrr")>;
	-def: InstRW<[SBWriteResGroup2], (instregex "VORPSYrr")>;
	-def: InstRW<[SBWriteResGroup2], (instregex "VORPSrr")>;
	-def: InstRW<[SBWriteResGroup2], (instregex "VPERMILPDri")>;
	-def: InstRW<[SBWriteResGroup2], (instregex "VPERMILPDrm")>;
	-def: InstRW<[SBWriteResGroup2], (instregex "VPERMILPDrr")>;
	-def: InstRW<[SBWriteResGroup2], (instregex "VPERMILPSri")>;
	-def: InstRW<[SBWriteResGroup2], (instregex "VPERMILPSrm")>;
	-def: InstRW<[SBWriteResGroup2], (instregex "VPERMILPSrr")>;
	-def: InstRW<[SBWriteResGroup2], (instregex "VPERMILPSrr")>;
	-def: InstRW<[SBWriteResGroup2], (instregex "VSHUFPDYrri")>;
	-def: InstRW<[SBWriteResGroup2], (instregex "VSHUFPDrri")>;
	-def: InstRW<[SBWriteResGroup2], (instregex "VSHUFPSYrri")>;
	-def: InstRW<[SBWriteResGroup2], (instregex "VSHUFPSrri")>;
	-def: InstRW<[SBWriteResGroup2], (instregex "VUNPCKHPDrr")>;
	-def: InstRW<[SBWriteResGroup2], (instregex "VUNPCKHPSrr")>;
	-def: InstRW<[SBWriteResGroup2], (instregex "VUNPCKLPDYrr")>;
	-def: InstRW<[SBWriteResGroup2], (instregex "VUNPCKLPDrr")>;
	-def: InstRW<[SBWriteResGroup2], (instregex "VUNPCKLPSYrr")>;
	-def: InstRW<[SBWriteResGroup2], (instregex "VUNPCKLPSrr")>;
	-def: InstRW<[SBWriteResGroup2], (instregex "VXORPDrr")>;
	-def: InstRW<[SBWriteResGroup2], (instregex "VXORPSrr")>;
	-def: InstRW<[SBWriteResGroup2], (instregex "XORPDrr")>;
	-def: InstRW<[SBWriteResGroup2], (instregex "XORPSrr")>;
	-
	-def SBWriteResGroup3 : SchedWriteRes<[SBPort01]> {
	- let Latency = 1;
	- let NumMicroOps = 1;
	- let ResourceCycles = [1];
	-}
	-def: InstRW<[SBWriteResGroup3], (instregex "LEA64_32r")>;
	-
	-def SBWriteResGroup4 : SchedWriteRes<[SBPort0]> {
	- let Latency = 1;
	- let NumMicroOps = 1;
	- let ResourceCycles = [1];
	-}
	-def: InstRW<[SBWriteResGroup4], (instregex "BLENDPDrri")>;
	-def: InstRW<[SBWriteResGroup4], (instregex "BLENDPSrri")>;
	-def: InstRW<[SBWriteResGroup4], (instregex "BT32ri8")>;
	-def: InstRW<[SBWriteResGroup4], (instregex "BT32rr")>;
	-def: InstRW<[SBWriteResGroup4], (instregex "BTC32ri8")>;
	-def: InstRW<[SBWriteResGroup4], (instregex "BTC32rr")>;
	-def: InstRW<[SBWriteResGroup4], (instregex "BTR32ri8")>;
	-def: InstRW<[SBWriteResGroup4], (instregex "BTR32rr")>;
	-def: InstRW<[SBWriteResGroup4], (instregex "BTS32ri8")>;
	-def: InstRW<[SBWriteResGroup4], (instregex "BTS32rr")>;
	-def: InstRW<[SBWriteResGroup4], (instregex "CDQ")>;
	-def: InstRW<[SBWriteResGroup4], (instregex "CQO")>;
	-def: InstRW<[SBWriteResGroup4], (instregex "LAHF")>;
	-def: InstRW<[SBWriteResGroup4], (instregex "SAHF")>;
	-def: InstRW<[SBWriteResGroup4], (instregex "SAR32ri")>;
	-def: InstRW<[SBWriteResGroup4], (instregex "SAR8ri")>;
	-def: InstRW<[SBWriteResGroup4], (instregex "SETAEr")>;
	-def: InstRW<[SBWriteResGroup4], (instregex "SETBr")>;
	-def: InstRW<[SBWriteResGroup4], (instregex "SETEr")>;
	-def: InstRW<[SBWriteResGroup4], (instregex "SETGEr")>;
	-def: InstRW<[SBWriteResGroup4], (instregex "SETGr")>;
	-def: InstRW<[SBWriteResGroup4], (instregex "SETLEr")>;
	-def: InstRW<[SBWriteResGroup4], (instregex "SETLr")>;
	-def: InstRW<[SBWriteResGroup4], (instregex "SETNEr")>;
	-def: InstRW<[SBWriteResGroup4], (instregex "SETNOr")>;
	-def: InstRW<[SBWriteResGroup4], (instregex "SETNPr")>;
	-def: InstRW<[SBWriteResGroup4], (instregex "SETNSr")>;
	-def: InstRW<[SBWriteResGroup4], (instregex "SETOr")>;
	-def: InstRW<[SBWriteResGroup4], (instregex "SETPr")>;
	-def: InstRW<[SBWriteResGroup4], (instregex "SETSr")>;
	-def: InstRW<[SBWriteResGroup4], (instregex "SHL32ri")>;
	-def: InstRW<[SBWriteResGroup4], (instregex "SHL64r1")>;
	-def: InstRW<[SBWriteResGroup4], (instregex "SHL8r1")>;
	-def: InstRW<[SBWriteResGroup4], (instregex "SHL8ri")>;
	-def: InstRW<[SBWriteResGroup4], (instregex "SHR32ri")>;
	-def: InstRW<[SBWriteResGroup4], (instregex "SHR8ri")>;
	-def: InstRW<[SBWriteResGroup4], (instregex "VBLENDPDYrri")>;
	-def: InstRW<[SBWriteResGroup4], (instregex "VBLENDPDrri")>;
	-def: InstRW<[SBWriteResGroup4], (instregex "VBLENDPSYrri")>;
	-def: InstRW<[SBWriteResGroup4], (instregex "VBLENDPSrri")>;
	-def: InstRW<[SBWriteResGroup4], (instregex "VMOVDQAYrr")>;
	-def: InstRW<[SBWriteResGroup4], (instregex "VMOVDQArr")>;
	-def: InstRW<[SBWriteResGroup4], (instregex "VMOVDQUYrr")>;
	-def: InstRW<[SBWriteResGroup4], (instregex "VMOVDQUrr")>;
	-
	-def SBWriteResGroup5 : SchedWriteRes<[SBPort15]> {
	- let Latency = 1;
	- let NumMicroOps = 1;
	- let ResourceCycles = [1];
	-}
	-def: InstRW<[SBWriteResGroup5], (instregex "KORTESTBrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "MMX_PABSBrr64")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "MMX_PABSDrr64")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "MMX_PABSWrr64")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "MMX_PADDQirr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "MMX_PALIGNR64irr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "MMX_PSHUFBrr64")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "MMX_PSIGNBrr64")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "MMX_PSIGNDrr64")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "MMX_PSIGNWrr64")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "PABSBrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "PABSDrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "PABSWrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "PACKSSDWrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "PACKSSWBrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "PACKUSDWrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "PACKUSWBrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "PADDBrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "PADDDrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "PADDQrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "PADDSBrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "PADDSWrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "PADDUSBrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "PADDUSWrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "PADDWrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "PALIGNRrri")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "PAVGBrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "PAVGWrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "PBLENDWrri")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "PCMPEQBrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "PCMPEQDrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "PCMPEQQrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "PCMPEQWrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "PCMPGTBrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "PCMPGTDrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "PCMPGTWrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "PMAXSBrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "PMAXSDrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "PMAXSWrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "PMAXUBrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "PMAXUDrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "PMAXUWrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "PMINSBrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "PMINSDrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "PMINSWrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "PMINUBrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "PMINUDrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "PMINUWrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "PMOVSXBDrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "PMOVSXBQrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "PMOVSXBWrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "PMOVSXDQrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "PMOVSXWDrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "PMOVSXWQrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "PMOVZXBDrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "PMOVZXBQrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "PMOVZXBWrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "PMOVZXDQrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "PMOVZXWDrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "PMOVZXWQrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "PSHUFBrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "PSHUFDri")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "PSHUFHWri")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "PSHUFLWri")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "PSIGNBrr128")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "PSIGNDrr128")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "PSIGNWrr128")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "PSLLDQri")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "PSRLDQri")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "PSUBBrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "PSUBDrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "PSUBQrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "PSUBSBrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "PSUBSWrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "PSUBUSBrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "PSUBUSWrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "PSUBWrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "PUNPCKHBWrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "PUNPCKHDQrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "PUNPCKHQDQrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "PUNPCKHWDrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "PUNPCKLBWrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "PUNPCKLDQrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "PUNPCKLQDQrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "PUNPCKLWDrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "VMASKMOVPSYrm")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "VPABSBrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "VPABSDrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "VPABSWrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "VPACKSSDWrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "VPACKSSWBrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "VPACKUSDWrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "VPACKUSWBrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "VPADDBrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "VPADDDrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "VPADDQrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "VPADDUSBrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "VPADDUSWrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "VPALIGNRrri")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "VPAVGBrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "VPAVGWrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "VPBLENDWrri")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "VPCMPEQBrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "VPCMPEQDrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "VPCMPEQWrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "VPCMPGTBrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "VPCMPGTDrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "VPCMPGTWrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "VPMAXSBrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "VPMAXSDrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "VPMAXSWrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "VPMAXUBrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "VPMAXUDrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "VPMAXUWrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "VPMINSBrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "VPMINSDrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "VPMINSWrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "VPMINUBrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "VPMINUDrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "VPMINUWrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "VPMOVSXBDrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "VPMOVSXBQrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "VPMOVSXBWrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "VPMOVSXDQrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "VPMOVSXWDrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "VPMOVSXWQrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "VPMOVZXBDrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "VPMOVZXBQrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "VPMOVZXBWrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "VPMOVZXDQrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "VPMOVZXWDrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "VPMOVZXWQrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "VPSHUFBrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "VPSHUFDri")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "VPSHUFLWri")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "VPSIGNBrr128")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "VPSIGNDrr128")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "VPSIGNWrr128")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "VPSLLDQri")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "VPSRLDQri")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "VPSUBBrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "VPSUBDrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "VPSUBQrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "VPSUBSBrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "VPSUBSWrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "VPSUBUSBrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "VPSUBUSWrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "VPSUBWrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "VPUNPCKHBWrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "VPUNPCKHDQrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "VPUNPCKHWDrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "VPUNPCKLDQrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "VPUNPCKLQDQrr")>;
	-def: InstRW<[SBWriteResGroup5], (instregex "VPUNPCKLWDrr")>;
	-
	-def SBWriteResGroup6 : SchedWriteRes<[SBPort015]> {
	- let Latency = 1;
	- let NumMicroOps = 1;
	- let ResourceCycles = [1];
	-}
	-def: InstRW<[SBWriteResGroup6], (instregex "ADD32ri8")>;
	-def: InstRW<[SBWriteResGroup6], (instregex "ADD32rr")>;
	-def: InstRW<[SBWriteResGroup6], (instregex "ADD8ri")>;
	-def: InstRW<[SBWriteResGroup6], (instregex "ADD8rr")>;
	-def: InstRW<[SBWriteResGroup6], (instregex "AND32ri")>;
	-def: InstRW<[SBWriteResGroup6], (instregex "AND64ri8")>;
	-def: InstRW<[SBWriteResGroup6], (instregex "AND64rr")>;
	-def: InstRW<[SBWriteResGroup6], (instregex "AND8ri")>;
	-def: InstRW<[SBWriteResGroup6], (instregex "AND8rr")>;
	-def: InstRW<[SBWriteResGroup6], (instregex "CBW")>;
	-def: InstRW<[SBWriteResGroup6], (instregex "CMC")>;
	-def: InstRW<[SBWriteResGroup6], (instregex "CMP16ri8")>;
	-def: InstRW<[SBWriteResGroup6], (instregex "CMP32i32")>;
	-def: InstRW<[SBWriteResGroup6], (instregex "CMP64rr")>;
	-def: InstRW<[SBWriteResGroup6], (instregex "CMP8ri")>;
	-def: InstRW<[SBWriteResGroup6], (instregex "CMP8rr")>;
	-def: InstRW<[SBWriteResGroup6], (instregex "CWDE")>;
	-def: InstRW<[SBWriteResGroup6], (instregex "DEC64r")>;
	-def: InstRW<[SBWriteResGroup6], (instregex "DEC8r")>;
	-def: InstRW<[SBWriteResGroup6], (instregex "INC64r")>;
	-def: InstRW<[SBWriteResGroup6], (instregex "INC8r")>;
	-def: InstRW<[SBWriteResGroup6], (instregex "MMX_MOVD64from64rr")>;
	-def: InstRW<[SBWriteResGroup6], (instregex "MMX_MOVQ2DQrr")>;
	-def: InstRW<[SBWriteResGroup6], (instregex "MOV32rr")>;
	-def: InstRW<[SBWriteResGroup6], (instregex "MOV8ri")>;
	-def: InstRW<[SBWriteResGroup6], (instregex "MOV8rr")>;
	-def: InstRW<[SBWriteResGroup6], (instregex "MOVDQArr")>;
	-def: InstRW<[SBWriteResGroup6], (instregex "MOVDQUrr")>;
	-def: InstRW<[SBWriteResGroup6], (instregex "MOVPQI2QIrr")>;
	-def: InstRW<[SBWriteResGroup6], (instregex "MOVSX32rr16")>;
	-def: InstRW<[SBWriteResGroup6], (instregex "MOVSX32rr8")>;
	-def: InstRW<[SBWriteResGroup6], (instregex "MOVZX32rr16")>;
	-def: InstRW<[SBWriteResGroup6], (instregex "MOVZX32rr8")>;
	-def: InstRW<[SBWriteResGroup6], (instregex "NEG64r")>;
	-def: InstRW<[SBWriteResGroup6], (instregex "NEG8r")>;
	-def: InstRW<[SBWriteResGroup6], (instregex "NOT64r")>;
	-def: InstRW<[SBWriteResGroup6], (instregex "NOT8r")>;
	-def: InstRW<[SBWriteResGroup6], (instregex "OR64ri8")>;
	-def: InstRW<[SBWriteResGroup6], (instregex "OR64rr")>;
	-def: InstRW<[SBWriteResGroup6], (instregex "OR8ri")>;
	-def: InstRW<[SBWriteResGroup6], (instregex "OR8rr")>;
	-def: InstRW<[SBWriteResGroup6], (instregex "PANDNrr")>;
	-def: InstRW<[SBWriteResGroup6], (instregex "PANDrr")>;
	-def: InstRW<[SBWriteResGroup6], (instregex "PORrr")>;
	-def: InstRW<[SBWriteResGroup6], (instregex "PXORrr")>;
	-def: InstRW<[SBWriteResGroup6], (instregex "STC")>;
	-def: InstRW<[SBWriteResGroup6], (instregex "SUB64ri8")>;
	-def: InstRW<[SBWriteResGroup6], (instregex "SUB64rr")>;
	-def: InstRW<[SBWriteResGroup6], (instregex "SUB8ri")>;
	-def: InstRW<[SBWriteResGroup6], (instregex "SUB8rr")>;
	-def: InstRW<[SBWriteResGroup6], (instregex "TEST64rr")>;
	-def: InstRW<[SBWriteResGroup6], (instregex "TEST8ri")>;
	-def: InstRW<[SBWriteResGroup6], (instregex "TEST8rr")>;
	-def: InstRW<[SBWriteResGroup6], (instregex "VMOVPQI2QIrr")>;
	-def: InstRW<[SBWriteResGroup6], (instregex "VMOVZPQILo2PQIrr")>;
	-def: InstRW<[SBWriteResGroup6], (instregex "VPANDNrr")>;
	-def: InstRW<[SBWriteResGroup6], (instregex "VPANDrr")>;
	-def: InstRW<[SBWriteResGroup6], (instregex "VPORrr")>;
	-def: InstRW<[SBWriteResGroup6], (instregex "VPXORrr")>;
	-def: InstRW<[SBWriteResGroup6], (instregex "XOR32rr")>;
	-def: InstRW<[SBWriteResGroup6], (instregex "XOR64ri8")>;
	-def: InstRW<[SBWriteResGroup6], (instregex "XOR8ri")>;
	-def: InstRW<[SBWriteResGroup6], (instregex "XOR8rr")>;
	-
	-def SBWriteResGroup7 : SchedWriteRes<[SBPort0]> {
	- let Latency = 2;
	- let NumMicroOps = 1;
	- let ResourceCycles = [1];
	-}
	-def: InstRW<[SBWriteResGroup7], (instregex "MOVMSKPDrr")>;
	-def: InstRW<[SBWriteResGroup7], (instregex "MOVMSKPSrr")>;
	-def: InstRW<[SBWriteResGroup7], (instregex "MOVPDI2DIrr")>;
	-def: InstRW<[SBWriteResGroup7], (instregex "MOVPQIto64rr")>;
	-def: InstRW<[SBWriteResGroup7], (instregex "PMOVMSKBrr")>;
	-def: InstRW<[SBWriteResGroup7], (instregex "VMOVMSKPDYrr")>;
	-def: InstRW<[SBWriteResGroup7], (instregex "VMOVMSKPDrr")>;
	-def: InstRW<[SBWriteResGroup7], (instregex "VMOVMSKPSrr")>;
	-def: InstRW<[SBWriteResGroup7], (instregex "VMOVPDI2DIrr")>;
	-def: InstRW<[SBWriteResGroup7], (instregex "VMOVPQIto64rr")>;
	-
	-def SBWriteResGroup9 : SchedWriteRes<[SBPort0]> {
	- let Latency = 2;
	- let NumMicroOps = 2;
	- let ResourceCycles = [2];
	-}
	-def: InstRW<[SBWriteResGroup9], (instregex "BLENDVPDrr0")>;
	-def: InstRW<[SBWriteResGroup9], (instregex "BLENDVPSrr0")>;
	-def: InstRW<[SBWriteResGroup9], (instregex "ROL32ri")>;
	-def: InstRW<[SBWriteResGroup9], (instregex "ROL8ri")>;
	-def: InstRW<[SBWriteResGroup9], (instregex "ROR32ri")>;
	-def: InstRW<[SBWriteResGroup9], (instregex "ROR8ri")>;
	-def: InstRW<[SBWriteResGroup9], (instregex "SETAr")>;
	-def: InstRW<[SBWriteResGroup9], (instregex "SETBEr")>;
	-def: InstRW<[SBWriteResGroup9], (instregex "VBLENDVPDYrr")>;
	-def: InstRW<[SBWriteResGroup9], (instregex "VBLENDVPDrr")>;
	-def: InstRW<[SBWriteResGroup9], (instregex "VBLENDVPSYrr")>;
	-def: InstRW<[SBWriteResGroup9], (instregex "VBLENDVPSrr")>;
	-
	-def SBWriteResGroup10 : SchedWriteRes<[SBPort15]> {
	- let Latency = 2;
	- let NumMicroOps = 2;
	- let ResourceCycles = [2];
	-}
	-def: InstRW<[SBWriteResGroup10], (instregex "VPBLENDVBrr")>;
	-
	-def SBWriteResGroup11 : SchedWriteRes<[SBPort015]> {
	- let Latency = 2;
	- let NumMicroOps = 2;
	- let ResourceCycles = [2];
	-}
	-def: InstRW<[SBWriteResGroup11], (instregex "SCASB")>;
	-def: InstRW<[SBWriteResGroup11], (instregex "SCASL")>;
	-def: InstRW<[SBWriteResGroup11], (instregex "SCASQ")>;
	-def: InstRW<[SBWriteResGroup11], (instregex "SCASW")>;
	-
	-def SBWriteResGroup12 : SchedWriteRes<[SBPort0,SBPort1]> {
	- let Latency = 2;
	- let NumMicroOps = 2;
	- let ResourceCycles = [1,1];
	-}
	-def: InstRW<[SBWriteResGroup12], (instregex "COMISDrr")>;
	-def: InstRW<[SBWriteResGroup12], (instregex "COMISSrr")>;
	-def: InstRW<[SBWriteResGroup12], (instregex "UCOMISDrr")>;
	-def: InstRW<[SBWriteResGroup12], (instregex "UCOMISSrr")>;
	-def: InstRW<[SBWriteResGroup12], (instregex "VCOMISDrr")>;
	-def: InstRW<[SBWriteResGroup12], (instregex "VCOMISSrr")>;
	-def: InstRW<[SBWriteResGroup12], (instregex "VUCOMISDrr")>;
	-def: InstRW<[SBWriteResGroup12], (instregex "VUCOMISSrr")>;
	-
	-def SBWriteResGroup13 : SchedWriteRes<[SBPort0,SBPort5]> {
	- let Latency = 2;
	- let NumMicroOps = 2;
	- let ResourceCycles = [1,1];
	-}
	-def: InstRW<[SBWriteResGroup13], (instregex "CVTPS2PDrr")>;
	-def: InstRW<[SBWriteResGroup13], (instregex "PTESTrr")>;
	-def: InstRW<[SBWriteResGroup13], (instregex "VCVTPS2PDYrr")>;
	-def: InstRW<[SBWriteResGroup13], (instregex "VCVTPS2PDrr")>;
	-def: InstRW<[SBWriteResGroup13], (instregex "VPTESTYrr")>;
	-def: InstRW<[SBWriteResGroup13], (instregex "VPTESTrr")>;
	-
	-def SBWriteResGroup14 : SchedWriteRes<[SBPort0,SBPort15]> {
	- let Latency = 2;
	- let NumMicroOps = 2;
	- let ResourceCycles = [1,1];
	-}
	-def: InstRW<[SBWriteResGroup14], (instregex "PSLLDrr")>;
	-def: InstRW<[SBWriteResGroup14], (instregex "PSLLQrr")>;
	-def: InstRW<[SBWriteResGroup14], (instregex "PSLLWrr")>;
	-def: InstRW<[SBWriteResGroup14], (instregex "PSRADrr")>;
	-def: InstRW<[SBWriteResGroup14], (instregex "PSRAWrr")>;
	-def: InstRW<[SBWriteResGroup14], (instregex "PSRLDrr")>;
	-def: InstRW<[SBWriteResGroup14], (instregex "PSRLQrr")>;
	-def: InstRW<[SBWriteResGroup14], (instregex "PSRLWrr")>;
	-def: InstRW<[SBWriteResGroup14], (instregex "VPSRADrr")>;
	-def: InstRW<[SBWriteResGroup14], (instregex "VPSRAWrr")>;
	-def: InstRW<[SBWriteResGroup14], (instregex "VPSRLDrr")>;
	-def: InstRW<[SBWriteResGroup14], (instregex "VPSRLQrr")>;
	-def: InstRW<[SBWriteResGroup14], (instregex "VPSRLWrr")>;
	-
	-def SBWriteResGroup15 : SchedWriteRes<[SBPort0,SBPort015]> {
	- let Latency = 2;
	- let NumMicroOps = 2;
	- let ResourceCycles = [1,1];
	-}
	-def: InstRW<[SBWriteResGroup15], (instregex "FNSTSW16r")>;
	-
	-def SBWriteResGroup16 : SchedWriteRes<[SBPort1,SBPort0]> {
	- let Latency = 2;
	- let NumMicroOps = 2;
	- let ResourceCycles = [1,1];
	-}
	-def: InstRW<[SBWriteResGroup16], (instregex "BSWAP32r")>;
	-
	-def SBWriteResGroup17 : SchedWriteRes<[SBPort5,SBPort15]> {
	- let Latency = 2;
	- let NumMicroOps = 2;
	- let ResourceCycles = [1,1];
	-}
	-def: InstRW<[SBWriteResGroup17], (instregex "PINSRBrr")>;
	-def: InstRW<[SBWriteResGroup17], (instregex "PINSRDrr")>;
	-def: InstRW<[SBWriteResGroup17], (instregex "PINSRQrr")>;
	-def: InstRW<[SBWriteResGroup17], (instregex "PINSRWrri")>;
	-def: InstRW<[SBWriteResGroup17], (instregex "VPINSRBrr")>;
	-def: InstRW<[SBWriteResGroup17], (instregex "VPINSRDrr")>;
	-def: InstRW<[SBWriteResGroup17], (instregex "VPINSRQrr")>;
	-def: InstRW<[SBWriteResGroup17], (instregex "VPINSRWrri")>;
	-
	-def SBWriteResGroup18 : SchedWriteRes<[SBPort5,SBPort015]> {
	- let Latency = 2;
	- let NumMicroOps = 2;
	- let ResourceCycles = [1,1];
	-}
	-def: InstRW<[SBWriteResGroup18], (instregex "MMX_MOVDQ2Qrr")>;
	-
	-def SBWriteResGroup19 : SchedWriteRes<[SBPort0,SBPort015]> {
	- let Latency = 2;
	- let NumMicroOps = 2;
	- let ResourceCycles = [1,1];
	-}
	-def: InstRW<[SBWriteResGroup19], (instregex "ADC64ri8")>;
	-def: InstRW<[SBWriteResGroup19], (instregex "ADC64rr")>;
	-def: InstRW<[SBWriteResGroup19], (instregex "ADC8ri")>;
	-def: InstRW<[SBWriteResGroup19], (instregex "ADC8rr")>;
	-def: InstRW<[SBWriteResGroup19], (instregex "CMOVAE32rr")>;
	-def: InstRW<[SBWriteResGroup19], (instregex "CMOVB32rr")>;
	-def: InstRW<[SBWriteResGroup19], (instregex "CMOVE32rr")>;
	-def: InstRW<[SBWriteResGroup19], (instregex "CMOVG32rr")>;
	-def: InstRW<[SBWriteResGroup19], (instregex "CMOVGE32rr")>;
	-def: InstRW<[SBWriteResGroup19], (instregex "CMOVL32rr")>;
	-def: InstRW<[SBWriteResGroup19], (instregex "CMOVLE32rr")>;
	-def: InstRW<[SBWriteResGroup19], (instregex "CMOVNE32rr")>;
	-def: InstRW<[SBWriteResGroup19], (instregex "CMOVNO32rr")>;
	-def: InstRW<[SBWriteResGroup19], (instregex "CMOVNP32rr")>;
	-def: InstRW<[SBWriteResGroup19], (instregex "CMOVNS32rr")>;
	-def: InstRW<[SBWriteResGroup19], (instregex "CMOVO32rr")>;
	-def: InstRW<[SBWriteResGroup19], (instregex "CMOVP32rr")>;
	-def: InstRW<[SBWriteResGroup19], (instregex "CMOVS32rr")>;
	-def: InstRW<[SBWriteResGroup19], (instregex "SBB32rr")>;
	-def: InstRW<[SBWriteResGroup19], (instregex "SBB64ri8")>;
	-def: InstRW<[SBWriteResGroup19], (instregex "SBB8ri")>;
	-def: InstRW<[SBWriteResGroup19], (instregex "SBB8rr")>;
	-def: InstRW<[SBWriteResGroup19], (instregex "SHLD32rri8")>;
	-def: InstRW<[SBWriteResGroup19], (instregex "SHRD32rri8")>;
	-
	-def SBWriteResGroup20 : SchedWriteRes<[SBPort0]> {
	- let Latency = 3;
	- let NumMicroOps = 1;
	- let ResourceCycles = [1];
	-}
	-def: InstRW<[SBWriteResGroup20], (instregex "MMX_PMADDUBSWrr64")>;
	-def: InstRW<[SBWriteResGroup20], (instregex "MMX_PMULHRSWrr64")>;
	-def: InstRW<[SBWriteResGroup20], (instregex "MMX_PMULUDQirr")>;
	-def: InstRW<[SBWriteResGroup20], (instregex "PMADDUBSWrr")>;
	-def: InstRW<[SBWriteResGroup20], (instregex "PMADDWDrr")>;
	-def: InstRW<[SBWriteResGroup20], (instregex "PMULDQrr")>;
	-def: InstRW<[SBWriteResGroup20], (instregex "PMULHRSWrr")>;
	-def: InstRW<[SBWriteResGroup20], (instregex "PMULHUWrr")>;
	-def: InstRW<[SBWriteResGroup20], (instregex "PMULHWrr")>;
	-def: InstRW<[SBWriteResGroup20], (instregex "PMULLDrr")>;
	-def: InstRW<[SBWriteResGroup20], (instregex "PMULLWrr")>;
	-def: InstRW<[SBWriteResGroup20], (instregex "PMULUDQrr")>;
	-def: InstRW<[SBWriteResGroup20], (instregex "PSADBWrr")>;
	-def: InstRW<[SBWriteResGroup20], (instregex "VMOVMSKPSYrr")>;
	-def: InstRW<[SBWriteResGroup20], (instregex "VPMADDUBSWrr")>;
	-def: InstRW<[SBWriteResGroup20], (instregex "VPMADDWDrr")>;
	-def: InstRW<[SBWriteResGroup20], (instregex "VPMULDQrr")>;
	-def: InstRW<[SBWriteResGroup20], (instregex "VPMULHRSWrr")>;
	-def: InstRW<[SBWriteResGroup20], (instregex "VPMULHWrr")>;
	-def: InstRW<[SBWriteResGroup20], (instregex "VPMULLDrr")>;
	-def: InstRW<[SBWriteResGroup20], (instregex "VPMULLWrr")>;
	-def: InstRW<[SBWriteResGroup20], (instregex "VPSADBWrr")>;
	-
	-def SBWriteResGroup21 : SchedWriteRes<[SBPort1]> {
	- let Latency = 3;
	- let NumMicroOps = 1;
	- let ResourceCycles = [1];
	-}
	-def: InstRW<[SBWriteResGroup21], (instregex "ADDPDrr")>;
	-def: InstRW<[SBWriteResGroup21], (instregex "ADDPSrr")>;
	-def: InstRW<[SBWriteResGroup21], (instregex "ADDSDrr")>;
	-def: InstRW<[SBWriteResGroup21], (instregex "ADDSSrr")>;
	-def: InstRW<[SBWriteResGroup21], (instregex "ADDSUBPDrr")>;
	-def: InstRW<[SBWriteResGroup21], (instregex "ADDSUBPSrr")>;
	-def: InstRW<[SBWriteResGroup21], (instregex "ADD_FPrST0")>;
	-def: InstRW<[SBWriteResGroup21], (instregex "ADD_FST0r")>;
	-def: InstRW<[SBWriteResGroup21], (instregex "ADD_FrST0")>;
	-def: InstRW<[SBWriteResGroup21], (instregex "BSF32rr")>;
	-def: InstRW<[SBWriteResGroup21], (instregex "BSR32rr")>;
	-def: InstRW<[SBWriteResGroup21], (instregex "CMPPDrri")>;
	-def: InstRW<[SBWriteResGroup21], (instregex "CMPPSrri")>;
	-def: InstRW<[SBWriteResGroup21], (instregex "CMPSDrr")>;
	-def: InstRW<[SBWriteResGroup21], (instregex "CMPSSrr")>;
	-def: InstRW<[SBWriteResGroup21], (instregex "CRC32r32r32")>;
	-def: InstRW<[SBWriteResGroup21], (instregex "CRC32r32r8")>;
	-def: InstRW<[SBWriteResGroup21], (instregex "CVTDQ2PSrr")>;
	-def: InstRW<[SBWriteResGroup21], (instregex "CVTPS2DQrr")>;
	-def: InstRW<[SBWriteResGroup21], (instregex "CVTTPS2DQrr")>;
	-def: InstRW<[SBWriteResGroup21], (instregex "MAXPDrr")>;
	-def: InstRW<[SBWriteResGroup21], (instregex "MAXPSrr")>;
	-def: InstRW<[SBWriteResGroup21], (instregex "MAXSDrr")>;
	-def: InstRW<[SBWriteResGroup21], (instregex "MAXSSrr")>;
	-def: InstRW<[SBWriteResGroup21], (instregex "MINPDrr")>;
	-def: InstRW<[SBWriteResGroup21], (instregex "MINPSrr")>;
	-def: InstRW<[SBWriteResGroup21], (instregex "MINSDrr")>;
	-def: InstRW<[SBWriteResGroup21], (instregex "MINSSrr")>;
	-def: InstRW<[SBWriteResGroup21], (instregex "MMX_CVTPI2PSirr")>;
	-def: InstRW<[SBWriteResGroup21], (instregex "MMX_CVTPS2PIirr")>;
	-def: InstRW<[SBWriteResGroup21], (instregex "MMX_CVTTPS2PIirr")>;
	-def: InstRW<[SBWriteResGroup21], (instregex "MUL8r")>;
	-def: InstRW<[SBWriteResGroup21], (instregex "POPCNT32rr")>;
	-def: InstRW<[SBWriteResGroup21], (instregex "ROUNDPDr")>;
	-def: InstRW<[SBWriteResGroup21], (instregex "ROUNDPSr")>;
	-def: InstRW<[SBWriteResGroup21], (instregex "ROUNDSDr")>;
	-def: InstRW<[SBWriteResGroup21], (instregex "ROUNDSSr")>;
	-def: InstRW<[SBWriteResGroup21], (instregex "SUBPDrr")>;
	-def: InstRW<[SBWriteResGroup21], (instregex "SUBPSrr")>;
	-def: InstRW<[SBWriteResGroup21], (instregex "SUBR_FPrST0")>;
	-def: InstRW<[SBWriteResGroup21], (instregex "SUBR_FST0r")>;
	-def: InstRW<[SBWriteResGroup21], (instregex "SUBR_FrST0")>;
	-def: InstRW<[SBWriteResGroup21], (instregex "SUBSDrr")>;
	-def: InstRW<[SBWriteResGroup21], (instregex "SUBSSrr")>;
	-def: InstRW<[SBWriteResGroup21], (instregex "SUB_FPrST0")>;
	-def: InstRW<[SBWriteResGroup21], (instregex "SUB_FST0r")>;
	-def: InstRW<[SBWriteResGroup21], (instregex "SUB_FrST0")>;
	-def: InstRW<[SBWriteResGroup21], (instregex "VADDPDYrr")>;
	-def: InstRW<[SBWriteResGroup21], (instregex "VADDPDrr")>;
	-def: InstRW<[SBWriteResGroup21], (instregex "VADDPSYrr")>;
	-def: InstRW<[SBWriteResGroup21], (instregex "VADDPSrr")>;
	-def: InstRW<[SBWriteResGroup21], (instregex "VADDSDrr")>;
	-def: InstRW<[SBWriteResGroup21], (instregex "VADDSSrr")>;
	-def: InstRW<[SBWriteResGroup21], (instregex "VADDSUBPDYrr")>;
	-def: InstRW<[SBWriteResGroup21], (instregex "VADDSUBPDrr")>;
	-def: InstRW<[SBWriteResGroup21], (instregex "VADDSUBPSYrr")>;
	-def: InstRW<[SBWriteResGroup21], (instregex "VADDSUBPSrr")>;
	-def: InstRW<[SBWriteResGroup21], (instregex "VBROADCASTF128")>;
	-def: InstRW<[SBWriteResGroup21], (instregex "VCMPPDYrri")>;
	-def: InstRW<[SBWriteResGroup21], (instregex "VCMPPDrri")>;
	-def: InstRW<[SBWriteResGroup21], (instregex "VCMPPSYrri")>;
	-def: InstRW<[SBWriteResGroup21], (instregex "VCMPPSrri")>;
	-def: InstRW<[SBWriteResGroup21], (instregex "VCMPSDrr")>;
	-def: InstRW<[SBWriteResGroup21], (instregex "VCMPSSrr")>;
	-def: InstRW<[SBWriteResGroup21], (instregex "VCVTDQ2PSYrr")>;
	-def: InstRW<[SBWriteResGroup21], (instregex "VCVTDQ2PSrr")>;
	-def: InstRW<[SBWriteResGroup21], (instregex "VCVTPS2DQYrr")>;
	-def: InstRW<[SBWriteResGroup21], (instregex "VCVTPS2DQrr")>;
	-def: InstRW<[SBWriteResGroup21], (instregex "VCVTTPS2DQrr")>;
	-def: InstRW<[SBWriteResGroup21], (instregex "VMAXPDYrr")>;
	-def: InstRW<[SBWriteResGroup21], (instregex "VMAXPDrr")>;
	-def: InstRW<[SBWriteResGroup21], (instregex "VMAXPSYrr")>;
	-def: InstRW<[SBWriteResGroup21], (instregex "VMAXPSrr")>;
	-def: InstRW<[SBWriteResGroup21], (instregex "VMAXSDrr")>;
	-def: InstRW<[SBWriteResGroup21], (instregex "VMAXSSrr")>;
	-def: InstRW<[SBWriteResGroup21], (instregex "VMINPDrr")>;
	-def: InstRW<[SBWriteResGroup21], (instregex "VMINPSrr")>;
	-def: InstRW<[SBWriteResGroup21], (instregex "VMINSDrr")>;
	-def: InstRW<[SBWriteResGroup21], (instregex "VMINSSrr")>;
	-def: InstRW<[SBWriteResGroup21], (instregex "VROUNDPDr")>;
	-def: InstRW<[SBWriteResGroup21], (instregex "VROUNDPSr")>;
	-def: InstRW<[SBWriteResGroup21], (instregex "VROUNDSDr")>;
	-def: InstRW<[SBWriteResGroup21], (instregex "VSUBPDYrr")>;
	-def: InstRW<[SBWriteResGroup21], (instregex "VSUBPDrr")>;
	-def: InstRW<[SBWriteResGroup21], (instregex "VSUBPSYrr")>;
	-def: InstRW<[SBWriteResGroup21], (instregex "VSUBPSrr")>;
	-
	-def SBWriteResGroup22 : SchedWriteRes<[SBPort0,SBPort5]> {
	- let Latency = 3;
	- let NumMicroOps = 2;
	- let ResourceCycles = [1,1];
	-}
	-def: InstRW<[SBWriteResGroup22], (instregex "EXTRACTPSrr")>;
	-def: InstRW<[SBWriteResGroup22], (instregex "VEXTRACTPSrr")>;
	-
	-def SBWriteResGroup23 : SchedWriteRes<[SBPort0,SBPort15]> {
	- let Latency = 3;
	- let NumMicroOps = 2;
	- let ResourceCycles = [1,1];
	-}
	-def: InstRW<[SBWriteResGroup23], (instregex "PEXTRBrr")>;
	-def: InstRW<[SBWriteResGroup23], (instregex "PEXTRDrr")>;
	-def: InstRW<[SBWriteResGroup23], (instregex "PEXTRQrr")>;
	-def: InstRW<[SBWriteResGroup23], (instregex "PEXTRWri")>;
	-def: InstRW<[SBWriteResGroup23], (instregex "VPEXTRBrr")>;
	-def: InstRW<[SBWriteResGroup23], (instregex "VPEXTRDrr")>;
	-def: InstRW<[SBWriteResGroup23], (instregex "VPEXTRQrr")>;
	-def: InstRW<[SBWriteResGroup23], (instregex "VPEXTRWri")>;
	-def: InstRW<[SBWriteResGroup23], (instregex "SHL64rCL")>;
	-def: InstRW<[SBWriteResGroup23], (instregex "SHL8rCL")>;
	-
	-def SBWriteResGroup24 : SchedWriteRes<[SBPort15]> {
	- let Latency = 3;
	- let NumMicroOps = 3;
	- let ResourceCycles = [3];
	-}
	-def: InstRW<[SBWriteResGroup24], (instregex "MMX_PHADDSWrr64")>;
	-def: InstRW<[SBWriteResGroup24], (instregex "MMX_PHADDWrr64")>;
	-def: InstRW<[SBWriteResGroup24], (instregex "MMX_PHADDrr64")>;
	-def: InstRW<[SBWriteResGroup24], (instregex "MMX_PHSUBDrr64")>;
	-def: InstRW<[SBWriteResGroup24], (instregex "MMX_PHSUBSWrr64")>;
	-def: InstRW<[SBWriteResGroup24], (instregex "MMX_PHSUBWrr64")>;
	-def: InstRW<[SBWriteResGroup24], (instregex "PHADDDrr")>;
	-def: InstRW<[SBWriteResGroup24], (instregex "PHADDSWrr128")>;
	-def: InstRW<[SBWriteResGroup24], (instregex "PHADDWrr")>;
	-def: InstRW<[SBWriteResGroup24], (instregex "PHSUBDrr")>;
	-def: InstRW<[SBWriteResGroup24], (instregex "PHSUBSWrr128")>;
	-def: InstRW<[SBWriteResGroup24], (instregex "PHSUBWrr")>;
	-def: InstRW<[SBWriteResGroup24], (instregex "VPHADDDrr")>;
	-def: InstRW<[SBWriteResGroup24], (instregex "VPHADDSWrr128")>;
	-def: InstRW<[SBWriteResGroup24], (instregex "VPHADDWrr")>;
	-def: InstRW<[SBWriteResGroup24], (instregex "VPHSUBDrr")>;
	-def: InstRW<[SBWriteResGroup24], (instregex "VPHSUBSWrr128")>;
	-def: InstRW<[SBWriteResGroup24], (instregex "VPHSUBWrr")>;
	-
	-def SBWriteResGroup25 : SchedWriteRes<[SBPort015]> {
	- let Latency = 3;
	- let NumMicroOps = 3;
	- let ResourceCycles = [3];
	-}
	-def: InstRW<[SBWriteResGroup25], (instregex "LEAVE64")>;
	-def: InstRW<[SBWriteResGroup25], (instregex "XADD32rr")>;
	-def: InstRW<[SBWriteResGroup25], (instregex "XADD8rr")>;
	-
	-def SBWriteResGroup26 : SchedWriteRes<[SBPort0,SBPort015]> {
	- let Latency = 3;
	- let NumMicroOps = 3;
	- let ResourceCycles = [2,1];
	-}
	-def: InstRW<[SBWriteResGroup26], (instregex "CMOVA32rr")>;
	-def: InstRW<[SBWriteResGroup26], (instregex "CMOVBE32rr")>;
	-
	-def SBWriteResGroup27 : SchedWriteRes<[SBPort0,SBPort1]> {
	- let Latency = 4;
	- let NumMicroOps = 2;
	- let ResourceCycles = [1,1];
	-}
	-def: InstRW<[SBWriteResGroup27], (instregex "MUL64r")>;
	-
	-def SBWriteResGroup28 : SchedWriteRes<[SBPort1,SBPort5]> {
	- let Latency = 4;
	- let NumMicroOps = 2;
	- let ResourceCycles = [1,1];
	-}
	-def: InstRW<[SBWriteResGroup28], (instregex "CVTDQ2PDrr")>;
	-def: InstRW<[SBWriteResGroup28], (instregex "CVTPD2DQrr")>;
	-def: InstRW<[SBWriteResGroup28], (instregex "CVTPD2PSrr")>;
	-def: InstRW<[SBWriteResGroup28], (instregex "CVTSD2SSrr")>;
	-def: InstRW<[SBWriteResGroup28], (instregex "CVTSI2SD64rr")>;
	-def: InstRW<[SBWriteResGroup28], (instregex "CVTSI2SDrr")>;
	-def: InstRW<[SBWriteResGroup28], (instregex "CVTTPD2DQrr")>;
	-def: InstRW<[SBWriteResGroup28], (instregex "MMX_CVTPD2PIirr")>;
	-def: InstRW<[SBWriteResGroup28], (instregex "MMX_CVTPI2PDirr")>;
	-def: InstRW<[SBWriteResGroup28], (instregex "MMX_CVTTPD2PIirr")>;
	-def: InstRW<[SBWriteResGroup28], (instregex "VCVTDQ2PDYrr")>;
	-def: InstRW<[SBWriteResGroup28], (instregex "VCVTDQ2PDrr")>;
	-def: InstRW<[SBWriteResGroup28], (instregex "VCVTPD2DQYrr")>;
	-def: InstRW<[SBWriteResGroup28], (instregex "VCVTPD2DQrr")>;
	-def: InstRW<[SBWriteResGroup28], (instregex "VCVTPD2PSYrr")>;
	-def: InstRW<[SBWriteResGroup28], (instregex "VCVTPD2PSrr")>;
	-def: InstRW<[SBWriteResGroup28], (instregex "VCVTSI2SD64rr")>;
	-def: InstRW<[SBWriteResGroup28], (instregex "VCVTSI2SDrr")>;
	-def: InstRW<[SBWriteResGroup28], (instregex "VCVTTPD2DQYrr")>;
	-def: InstRW<[SBWriteResGroup28], (instregex "VCVTTPD2DQrr")>;
	-
	-def SBWriteResGroup29 : SchedWriteRes<[SBPort1,SBPort015]> {
	- let Latency = 4;
	- let NumMicroOps = 2;
	- let ResourceCycles = [1,1];
	-}
	-def: InstRW<[SBWriteResGroup29], (instregex "MOV64sr")>;
	-def: InstRW<[SBWriteResGroup29], (instregex "PAUSE")>;
	-
	-def SBWriteResGroup30 : SchedWriteRes<[SBPort0]> {
	- let Latency = 5;
	- let NumMicroOps = 1;
	- let ResourceCycles = [1];
	-}
	-def: InstRW<[SBWriteResGroup30], (instregex "MULPDrr")>;
	-def: InstRW<[SBWriteResGroup30], (instregex "MULPSrr")>;
	-def: InstRW<[SBWriteResGroup30], (instregex "MULSDrr")>;
	-def: InstRW<[SBWriteResGroup30], (instregex "MULSSrr")>;
	-def: InstRW<[SBWriteResGroup30], (instregex "MUL_FPrST0")>;
	-def: InstRW<[SBWriteResGroup30], (instregex "MUL_FST0r")>;
	-def: InstRW<[SBWriteResGroup30], (instregex "MUL_FrST0")>;
	-def: InstRW<[SBWriteResGroup30], (instregex "PCMPGTQrr")>;
	-def: InstRW<[SBWriteResGroup30], (instregex "PHMINPOSUWrr128")>;
	-def: InstRW<[SBWriteResGroup30], (instregex "RCPPSr")>;
	-def: InstRW<[SBWriteResGroup30], (instregex "RCPSSr")>;
	-def: InstRW<[SBWriteResGroup30], (instregex "RSQRTPSr")>;
	-def: InstRW<[SBWriteResGroup30], (instregex "RSQRTSSr")>;
	-def: InstRW<[SBWriteResGroup30], (instregex "VMULPDYrr")>;
	-def: InstRW<[SBWriteResGroup30], (instregex "VMULPDrr")>;
	-def: InstRW<[SBWriteResGroup30], (instregex "VMULPSYrr")>;
	-def: InstRW<[SBWriteResGroup30], (instregex "VMULPSrr")>;
	-def: InstRW<[SBWriteResGroup30], (instregex "VMULSDrr")>;
	-def: InstRW<[SBWriteResGroup30], (instregex "VMULSSrr")>;
	-def: InstRW<[SBWriteResGroup30], (instregex "VPCMPGTQrr")>;
	-def: InstRW<[SBWriteResGroup30], (instregex "VPHMINPOSUWrr128")>;
	-def: InstRW<[SBWriteResGroup30], (instregex "VRSQRTPSr")>;
	-def: InstRW<[SBWriteResGroup30], (instregex "VRSQRTSSr")>;
	-
	-def SBWriteResGroup31 : SchedWriteRes<[SBPort23]> {
	- let Latency = 5;
	- let NumMicroOps = 1;
	- let ResourceCycles = [1];
	-}
	-def: InstRW<[SBWriteResGroup31], (instregex "MOV32rm")>;
	-def: InstRW<[SBWriteResGroup31], (instregex "MOV8rm")>;
	-def: InstRW<[SBWriteResGroup31], (instregex "MOVSX32rm16")>;
	-def: InstRW<[SBWriteResGroup31], (instregex "MOVSX32rm8")>;
	-def: InstRW<[SBWriteResGroup31], (instregex "MOVZX32rm16")>;
	-def: InstRW<[SBWriteResGroup31], (instregex "MOVZX32rm8")>;
	-def: InstRW<[SBWriteResGroup31], (instregex "PREFETCH")>;
	-
	-def SBWriteResGroup32 : SchedWriteRes<[SBPort0,SBPort1]> {
	- let Latency = 5;
	- let NumMicroOps = 2;
	- let ResourceCycles = [1,1];
	-}
	-def: InstRW<[SBWriteResGroup32], (instregex "CVTSD2SI64rr")>;
	-def: InstRW<[SBWriteResGroup32], (instregex "CVTSD2SIrr")>;
	-def: InstRW<[SBWriteResGroup32], (instregex "CVTSS2SI64rr")>;
	-def: InstRW<[SBWriteResGroup32], (instregex "CVTSS2SIrr")>;
	-def: InstRW<[SBWriteResGroup32], (instregex "CVTTSD2SI64rr")>;
	-def: InstRW<[SBWriteResGroup32], (instregex "CVTTSD2SIrr")>;
	-def: InstRW<[SBWriteResGroup32], (instregex "CVTTSS2SI64rr")>;
	-def: InstRW<[SBWriteResGroup32], (instregex "CVTTSS2SIrr")>;
	-def: InstRW<[SBWriteResGroup32], (instregex "VCVTSD2SI64rr")>;
	-def: InstRW<[SBWriteResGroup32], (instregex "VCVTSS2SI64rr")>;
	-def: InstRW<[SBWriteResGroup32], (instregex "VCVTSS2SIrr")>;
	-def: InstRW<[SBWriteResGroup32], (instregex "VCVTTSD2SI64rr")>;
	-def: InstRW<[SBWriteResGroup32], (instregex "VCVTTSD2SIrr")>;
	-def: InstRW<[SBWriteResGroup32], (instregex "VCVTTSS2SI64rr")>;
	-def: InstRW<[SBWriteResGroup32], (instregex "VCVTTSS2SIrr")>;
	-
	-def SBWriteResGroup33 : SchedWriteRes<[SBPort4,SBPort23]> {
	- let Latency = 5;
	- let NumMicroOps = 2;
	- let ResourceCycles = [1,1];
	-}
	-def: InstRW<[SBWriteResGroup33], (instregex "MOV64mr")>;
	-def: InstRW<[SBWriteResGroup33], (instregex "MOV8mr")>;
	-def: InstRW<[SBWriteResGroup33], (instregex "MOVAPDmr")>;
	-def: InstRW<[SBWriteResGroup33], (instregex "MOVAPSmr")>;
	-def: InstRW<[SBWriteResGroup33], (instregex "MOVDQAmr")>;
	-def: InstRW<[SBWriteResGroup33], (instregex "MOVDQUmr")>;
	-def: InstRW<[SBWriteResGroup33], (instregex "MOVHPDmr")>;
	-def: InstRW<[SBWriteResGroup33], (instregex "MOVHPSmr")>;
	-def: InstRW<[SBWriteResGroup33], (instregex "MOVLPDmr")>;
	-def: InstRW<[SBWriteResGroup33], (instregex "MOVLPSmr")>;
	-def: InstRW<[SBWriteResGroup33], (instregex "MOVNTDQmr")>;
	-def: InstRW<[SBWriteResGroup33], (instregex "MOVNTI_64mr")>;
	-def: InstRW<[SBWriteResGroup33], (instregex "MOVNTImr")>;
	-def: InstRW<[SBWriteResGroup33], (instregex "MOVNTPDmr")>;
	-def: InstRW<[SBWriteResGroup33], (instregex "MOVNTPSmr")>;
	-def: InstRW<[SBWriteResGroup33], (instregex "MOVPDI2DImr")>;
	-def: InstRW<[SBWriteResGroup33], (instregex "MOVPQI2QImr")>;
	-def: InstRW<[SBWriteResGroup33], (instregex "MOVPQIto64mr")>;
	-def: InstRW<[SBWriteResGroup33], (instregex "MOVSSmr")>;
	-def: InstRW<[SBWriteResGroup33], (instregex "MOVUPDmr")>;
	-def: InstRW<[SBWriteResGroup33], (instregex "MOVUPSmr")>;
	-def: InstRW<[SBWriteResGroup33], (instregex "PUSH64i8")>;
	-def: InstRW<[SBWriteResGroup33], (instregex "PUSH64r")>;
	-def: InstRW<[SBWriteResGroup33], (instregex "VEXTRACTF128mr")>;
	-def: InstRW<[SBWriteResGroup33], (instregex "VMOVAPDYmr")>;
	-def: InstRW<[SBWriteResGroup33], (instregex "VMOVAPDmr")>;
	-def: InstRW<[SBWriteResGroup33], (instregex "VMOVAPSYmr")>;
	-def: InstRW<[SBWriteResGroup33], (instregex "VMOVAPSmr")>;
	-def: InstRW<[SBWriteResGroup33], (instregex "VMOVDQAYmr")>;
	-def: InstRW<[SBWriteResGroup33], (instregex "VMOVDQAmr")>;
	-def: InstRW<[SBWriteResGroup33], (instregex "VMOVDQUYmr")>;
	-def: InstRW<[SBWriteResGroup33], (instregex "VMOVDQUmr")>;
	-def: InstRW<[SBWriteResGroup33], (instregex "VMOVHPDmr")>;
	-def: InstRW<[SBWriteResGroup33], (instregex "VMOVHPSmr")>;
	-def: InstRW<[SBWriteResGroup33], (instregex "VMOVLPDmr")>;
	-def: InstRW<[SBWriteResGroup33], (instregex "VMOVLPSmr")>;
	-def: InstRW<[SBWriteResGroup33], (instregex "VMOVNTDQYmr")>;
	-def: InstRW<[SBWriteResGroup33], (instregex "VMOVNTDQmr")>;
	-def: InstRW<[SBWriteResGroup33], (instregex "VMOVNTPDYmr")>;
	-def: InstRW<[SBWriteResGroup33], (instregex "VMOVNTPDmr")>;
	-def: InstRW<[SBWriteResGroup33], (instregex "VMOVNTPSYmr")>;
	-def: InstRW<[SBWriteResGroup33], (instregex "VMOVNTPSmr")>;
	-def: InstRW<[SBWriteResGroup33], (instregex "VMOVPDI2DImr")>;
	-def: InstRW<[SBWriteResGroup33], (instregex "VMOVPQI2QImr")>;
	-def: InstRW<[SBWriteResGroup33], (instregex "VMOVPQIto64mr")>;
	-def: InstRW<[SBWriteResGroup33], (instregex "VMOVSDmr")>;
	-def: InstRW<[SBWriteResGroup33], (instregex "VMOVSSmr")>;
	-def: InstRW<[SBWriteResGroup33], (instregex "VMOVUPDYmr")>;
	-def: InstRW<[SBWriteResGroup33], (instregex "VMOVUPDmr")>;
	-def: InstRW<[SBWriteResGroup33], (instregex "VMOVUPSYmr")>;
	-def: InstRW<[SBWriteResGroup33], (instregex "VMOVUPSmr")>;
	-
	-def SBWriteResGroup34 : SchedWriteRes<[SBPort0,SBPort15]> {
	- let Latency = 5;
	- let NumMicroOps = 3;
	- let ResourceCycles = [1,2];
	-}
	-def: InstRW<[SBWriteResGroup34], (instregex "MPSADBWrri")>;
	-def: InstRW<[SBWriteResGroup34], (instregex "VMPSADBWrri")>;
	-
	-def SBWriteResGroup35 : SchedWriteRes<[SBPort1,SBPort5]> {
	- let Latency = 5;
	- let NumMicroOps = 3;
	- let ResourceCycles = [1,2];
	-}
	-def: InstRW<[SBWriteResGroup35], (instregex "CLI")>;
	-def: InstRW<[SBWriteResGroup35], (instregex "CVTSI2SS64rr")>;
	-def: InstRW<[SBWriteResGroup35], (instregex "CVTSI2SSrr")>;
	-def: InstRW<[SBWriteResGroup35], (instregex "HADDPDrr")>;
	-def: InstRW<[SBWriteResGroup35], (instregex "HADDPSrr")>;
	-def: InstRW<[SBWriteResGroup35], (instregex "HSUBPDrr")>;
	-def: InstRW<[SBWriteResGroup35], (instregex "HSUBPSrr")>;
	-def: InstRW<[SBWriteResGroup35], (instregex "VCVTSI2SS64rr")>;
	-def: InstRW<[SBWriteResGroup35], (instregex "VCVTSI2SSrr")>;
	-def: InstRW<[SBWriteResGroup35], (instregex "VHADDPDrr")>;
	-def: InstRW<[SBWriteResGroup35], (instregex "VHADDPSYrr")>;
	-def: InstRW<[SBWriteResGroup35], (instregex "VHADDPSrr")>;
	-def: InstRW<[SBWriteResGroup35], (instregex "VHSUBPDYrr")>;
	-def: InstRW<[SBWriteResGroup35], (instregex "VHSUBPDrr")>;
	-def: InstRW<[SBWriteResGroup35], (instregex "VHSUBPSYrr")>;
	-def: InstRW<[SBWriteResGroup35], (instregex "VHSUBPSrr")>;
	-
	-def SBWriteResGroup36 : SchedWriteRes<[SBPort4,SBPort5,SBPort23]> {
	- let Latency = 5;
	- let NumMicroOps = 3;
	- let ResourceCycles = [1,1,1];
	-}
	-def: InstRW<[SBWriteResGroup36], (instregex "CALL64r")>;
	-def: InstRW<[SBWriteResGroup36], (instregex "EXTRACTPSmr")>;
	-def: InstRW<[SBWriteResGroup36], (instregex "VEXTRACTPSmr")>;
	-
	-def SBWriteResGroup37 : SchedWriteRes<[SBPort4,SBPort01,SBPort23]> {
	- let Latency = 5;
	- let NumMicroOps = 3;
	- let ResourceCycles = [1,1,1];
	-}
	-def: InstRW<[SBWriteResGroup37], (instregex "VMASKMOVPDYrm")>;
	-def: InstRW<[SBWriteResGroup37], (instregex "VMASKMOVPDmr")>;
	-def: InstRW<[SBWriteResGroup37], (instregex "VMASKMOVPSmr")>;
	-
	-def SBWriteResGroup38 : SchedWriteRes<[SBPort4,SBPort23,SBPort0]> {
	- let Latency = 5;
	- let NumMicroOps = 3;
	- let ResourceCycles = [1,1,1];
	-}
	-def: InstRW<[SBWriteResGroup38], (instregex "SETAEm")>;
	-def: InstRW<[SBWriteResGroup38], (instregex "SETBm")>;
	-def: InstRW<[SBWriteResGroup38], (instregex "SETEm")>;
	-def: InstRW<[SBWriteResGroup38], (instregex "SETGEm")>;
	-def: InstRW<[SBWriteResGroup38], (instregex "SETGm")>;
	-def: InstRW<[SBWriteResGroup38], (instregex "SETLEm")>;
	-def: InstRW<[SBWriteResGroup38], (instregex "SETLm")>;
	-def: InstRW<[SBWriteResGroup38], (instregex "SETNEm")>;
	-def: InstRW<[SBWriteResGroup38], (instregex "SETNOm")>;
	-def: InstRW<[SBWriteResGroup38], (instregex "SETNPm")>;
	-def: InstRW<[SBWriteResGroup38], (instregex "SETNSm")>;
	-def: InstRW<[SBWriteResGroup38], (instregex "SETOm")>;
	-def: InstRW<[SBWriteResGroup38], (instregex "SETPm")>;
	-def: InstRW<[SBWriteResGroup38], (instregex "SETSm")>;
	-
	-def SBWriteResGroup39 : SchedWriteRes<[SBPort4,SBPort23,SBPort15]> {
	- let Latency = 5;
	- let NumMicroOps = 3;
	- let ResourceCycles = [1,1,1];
	-}
	-def: InstRW<[SBWriteResGroup39], (instregex "PEXTRBmr")>;
	-def: InstRW<[SBWriteResGroup39], (instregex "VPEXTRBmr")>;
	-def: InstRW<[SBWriteResGroup39], (instregex "VPEXTRDmr")>;
	-def: InstRW<[SBWriteResGroup39], (instregex "VPEXTRWmr")>;
	-
	-def SBWriteResGroup40 : SchedWriteRes<[SBPort4,SBPort23,SBPort015]> {
	- let Latency = 5;
	- let NumMicroOps = 3;
	- let ResourceCycles = [1,1,1];
	-}
	-def: InstRW<[SBWriteResGroup40], (instregex "MOV8mi")>;
	-def: InstRW<[SBWriteResGroup40], (instregex "STOSB")>;
	-def: InstRW<[SBWriteResGroup40], (instregex "STOSL")>;
	-def: InstRW<[SBWriteResGroup40], (instregex "STOSQ")>;
	-def: InstRW<[SBWriteResGroup40], (instregex "STOSW")>;
	-
	-def SBWriteResGroup41 : SchedWriteRes<[SBPort5,SBPort015]> {
	- let Latency = 5;
	- let NumMicroOps = 4;
	- let ResourceCycles = [1,3];
	-}
	-def: InstRW<[SBWriteResGroup41], (instregex "FNINIT")>;
	-
	-def SBWriteResGroup42 : SchedWriteRes<[SBPort0,SBPort015]> {
	- let Latency = 5;
	- let NumMicroOps = 4;
	- let ResourceCycles = [1,3];
	-}
	-def: InstRW<[SBWriteResGroup42], (instregex "CMPXCHG32rr")>;
	-def: InstRW<[SBWriteResGroup42], (instregex "CMPXCHG8rr")>;
	-
	-def SBWriteResGroup43 : SchedWriteRes<[SBPort4,SBPort23,SBPort0]> {
	- let Latency = 5;
	- let NumMicroOps = 4;
	- let ResourceCycles = [1,1,2];
	-}
	-def: InstRW<[SBWriteResGroup43], (instregex "SETAm")>;
	-def: InstRW<[SBWriteResGroup43], (instregex "SETBEm")>;
	-
	-def SBWriteResGroup44 : SchedWriteRes<[SBPort0,SBPort4,SBPort5,SBPort23]> {
	- let Latency = 5;
	- let NumMicroOps = 4;
	- let ResourceCycles = [1,1,1,1];
	-}
	-def: InstRW<[SBWriteResGroup44], (instregex "LDMXCSR")>;
	-def: InstRW<[SBWriteResGroup44], (instregex "STMXCSR")>;
	-def: InstRW<[SBWriteResGroup44], (instregex "VLDMXCSR")>;
	-def: InstRW<[SBWriteResGroup44], (instregex "VSTMXCSR")>;
	-
	-def SBWriteResGroup45 : SchedWriteRes<[SBPort0,SBPort4,SBPort23,SBPort15]> {
	- let Latency = 5;
	- let NumMicroOps = 4;
	- let ResourceCycles = [1,1,1,1];
	-}
	-def: InstRW<[SBWriteResGroup45], (instregex "PEXTRDmr")>;
	-def: InstRW<[SBWriteResGroup45], (instregex "PEXTRQmr")>;
	-def: InstRW<[SBWriteResGroup45], (instregex "VPEXTRQmr")>;
	-def: InstRW<[SBWriteResGroup45], (instregex "PUSHF16")>;
	-def: InstRW<[SBWriteResGroup45], (instregex "PUSHF64")>;
	-
	-def SBWriteResGroup46 : SchedWriteRes<[SBPort4,SBPort5,SBPort01,SBPort23]> {
	- let Latency = 5;
	- let NumMicroOps = 4;
	- let ResourceCycles = [1,1,1,1];
	-}
	-def: InstRW<[SBWriteResGroup46], (instregex "CLFLUSH")>;
	-
	-def SBWriteResGroup47 : SchedWriteRes<[SBPort4,SBPort5,SBPort01,SBPort23]> {
	- let Latency = 5;
	- let NumMicroOps = 5;
	- let ResourceCycles = [1,2,1,1];
	-}
	-def: InstRW<[SBWriteResGroup47], (instregex "FXRSTOR")>;
	-
	-def SBWriteResGroup48 : SchedWriteRes<[SBPort23]> {
	- let Latency = 6;
	- let NumMicroOps = 1;
	- let ResourceCycles = [1];
	-}
	-def: InstRW<[SBWriteResGroup48], (instregex "LDDQUrm")>;
	-def: InstRW<[SBWriteResGroup48], (instregex "MMX_MOVD64from64rm")>;
	-def: InstRW<[SBWriteResGroup48], (instregex "MOV64toPQIrm")>;
	-def: InstRW<[SBWriteResGroup48], (instregex "MOVAPDrm")>;
	-def: InstRW<[SBWriteResGroup48], (instregex "MOVAPSrm")>;
	-def: InstRW<[SBWriteResGroup48], (instregex "MOVDDUPrm")>;
	-def: InstRW<[SBWriteResGroup48], (instregex "MOVDI2PDIrm")>;
	-def: InstRW<[SBWriteResGroup48], (instregex "MOVDQArm")>;
	-def: InstRW<[SBWriteResGroup48], (instregex "MOVDQUrm")>;
	-def: InstRW<[SBWriteResGroup48], (instregex "MOVNTDQArm")>;
	-def: InstRW<[SBWriteResGroup48], (instregex "MOVSHDUPrm")>;
	-def: InstRW<[SBWriteResGroup48], (instregex "MOVSLDUPrm")>;
	-def: InstRW<[SBWriteResGroup48], (instregex "MOVSSrm")>;
	-def: InstRW<[SBWriteResGroup48], (instregex "MOVUPDrm")>;
	-def: InstRW<[SBWriteResGroup48], (instregex "MOVUPSrm")>;
	-def: InstRW<[SBWriteResGroup48], (instregex "POP64r")>;
	-def: InstRW<[SBWriteResGroup48], (instregex "VBROADCASTSSrm")>;
	-def: InstRW<[SBWriteResGroup48], (instregex "VLDDQUYrm")>;
	-def: InstRW<[SBWriteResGroup48], (instregex "VLDDQUrm")>;
	-def: InstRW<[SBWriteResGroup48], (instregex "VMOV64toPQIrm")>;
	-def: InstRW<[SBWriteResGroup48], (instregex "VMOVAPDrm")>;
	-def: InstRW<[SBWriteResGroup48], (instregex "VMOVAPSrm")>;
	-def: InstRW<[SBWriteResGroup48], (instregex "VMOVDDUPrm")>;
	-def: InstRW<[SBWriteResGroup48], (instregex "VMOVDI2PDIrm")>;
	-def: InstRW<[SBWriteResGroup48], (instregex "VMOVDQArm")>;
	-def: InstRW<[SBWriteResGroup48], (instregex "VMOVDQUrm")>;
	-def: InstRW<[SBWriteResGroup48], (instregex "VMOVNTDQArm")>;
	-def: InstRW<[SBWriteResGroup48], (instregex "VMOVQI2PQIrm")>;
	-def: InstRW<[SBWriteResGroup48], (instregex "VMOVSDrm")>;
	-def: InstRW<[SBWriteResGroup48], (instregex "VMOVSHDUPrm")>;
	-def: InstRW<[SBWriteResGroup48], (instregex "VMOVSLDUPrm")>;
	-def: InstRW<[SBWriteResGroup48], (instregex "VMOVSSrm")>;
	-def: InstRW<[SBWriteResGroup48], (instregex "VMOVUPDrm")>;
	-def: InstRW<[SBWriteResGroup48], (instregex "VMOVUPSrm")>;
	-
	-def SBWriteResGroup49 : SchedWriteRes<[SBPort5,SBPort23]> {
	- let Latency = 6;
	- let NumMicroOps = 2;
	- let ResourceCycles = [1,1];
	-}
	-def: InstRW<[SBWriteResGroup49], (instregex "JMP64m")>;
	-def: InstRW<[SBWriteResGroup49], (instregex "MOV64sm")>;
	-
	-def SBWriteResGroup50 : SchedWriteRes<[SBPort23,SBPort0]> {
	- let Latency = 6;
	- let NumMicroOps = 2;
	- let ResourceCycles = [1,1];
	-}
	-def: InstRW<[SBWriteResGroup50], (instregex "BT64mi8")>;
	-
	-def SBWriteResGroup51 : SchedWriteRes<[SBPort23,SBPort15]> {
	- let Latency = 6;
	- let NumMicroOps = 2;
	- let ResourceCycles = [1,1];
	-}
	-def: InstRW<[SBWriteResGroup51], (instregex "MMX_PABSBrm64")>;
	-def: InstRW<[SBWriteResGroup51], (instregex "MMX_PABSDrm64")>;
	-def: InstRW<[SBWriteResGroup51], (instregex "MMX_PABSWrm64")>;
	-def: InstRW<[SBWriteResGroup51], (instregex "MMX_PALIGNR64irm")>;
	-def: InstRW<[SBWriteResGroup51], (instregex "MMX_PSHUFBrm64")>;
	-def: InstRW<[SBWriteResGroup51], (instregex "MMX_PSIGNBrm64")>;
	-def: InstRW<[SBWriteResGroup51], (instregex "MMX_PSIGNDrm64")>;
	-def: InstRW<[SBWriteResGroup51], (instregex "MMX_PSIGNWrm64")>;
	-
	-def SBWriteResGroup52 : SchedWriteRes<[SBPort23,SBPort015]> {
	- let Latency = 6;
	- let NumMicroOps = 2;
	- let ResourceCycles = [1,1];
	-}
	-def: InstRW<[SBWriteResGroup52], (instregex "ADD64rm")>;
	-def: InstRW<[SBWriteResGroup52], (instregex "ADD8rm")>;
	-def: InstRW<[SBWriteResGroup52], (instregex "AND64rm")>;
	-def: InstRW<[SBWriteResGroup52], (instregex "AND8rm")>;
	-def: InstRW<[SBWriteResGroup52], (instregex "CMP64mi8")>;
	-def: InstRW<[SBWriteResGroup52], (instregex "CMP64mr")>;
	-def: InstRW<[SBWriteResGroup52], (instregex "CMP64rm")>;
	-def: InstRW<[SBWriteResGroup52], (instregex "CMP8mi")>;
	-def: InstRW<[SBWriteResGroup52], (instregex "CMP8mr")>;
	-def: InstRW<[SBWriteResGroup52], (instregex "CMP8rm")>;
	-def: InstRW<[SBWriteResGroup52], (instregex "LODSL")>;
	-def: InstRW<[SBWriteResGroup52], (instregex "LODSQ")>;
	-def: InstRW<[SBWriteResGroup52], (instregex "OR64rm")>;
	-def: InstRW<[SBWriteResGroup52], (instregex "OR8rm")>;
	-def: InstRW<[SBWriteResGroup52], (instregex "SUB64rm")>;
	-def: InstRW<[SBWriteResGroup52], (instregex "SUB8rm")>;
	-def: InstRW<[SBWriteResGroup52], (instregex "XOR64rm")>;
	-def: InstRW<[SBWriteResGroup52], (instregex "XOR8rm")>;
	-
	-def SBWriteResGroup53 : SchedWriteRes<[SBPort4,SBPort23]> {
	- let Latency = 6;
	- let NumMicroOps = 3;
	- let ResourceCycles = [1,2];
	-}
	-def: InstRW<[SBWriteResGroup53], (instregex "POP64rmm")>;
	-def: InstRW<[SBWriteResGroup53], (instregex "PUSH64rmm")>;
	-def: InstRW<[SBWriteResGroup53], (instregex "ST_F32m")>;
	-def: InstRW<[SBWriteResGroup53], (instregex "ST_F64m")>;
	-def: InstRW<[SBWriteResGroup53], (instregex "ST_FP32m")>;
	-def: InstRW<[SBWriteResGroup53], (instregex "ST_FP64m")>;
	-def: InstRW<[SBWriteResGroup53], (instregex "ST_FP80m")>;
	-
	-def SBWriteResGroup54 : SchedWriteRes<[SBPort23]> {
	- let Latency = 7;
	- let NumMicroOps = 1;
	- let ResourceCycles = [1];
	-}
	-def: InstRW<[SBWriteResGroup54], (instregex "VBROADCASTSDYrm")>;
	-def: InstRW<[SBWriteResGroup54], (instregex "VBROADCASTSSrm")>;
	-def: InstRW<[SBWriteResGroup54], (instregex "VMOVAPDYrm")>;
	-def: InstRW<[SBWriteResGroup54], (instregex "VMOVAPSYrm")>;
	-def: InstRW<[SBWriteResGroup54], (instregex "VMOVDDUPYrm")>;
	-def: InstRW<[SBWriteResGroup54], (instregex "VMOVDQAYrm")>;
	-def: InstRW<[SBWriteResGroup54], (instregex "VMOVDQUYrm")>;
	-def: InstRW<[SBWriteResGroup54], (instregex "VMOVSHDUPYrm")>;
	-def: InstRW<[SBWriteResGroup54], (instregex "VMOVSLDUPYrm")>;
	-def: InstRW<[SBWriteResGroup54], (instregex "VMOVUPDYrm")>;
	-def: InstRW<[SBWriteResGroup54], (instregex "VMOVUPSYrm")>;
	-
	-def SBWriteResGroup55 : SchedWriteRes<[SBPort0,SBPort23]> {
	- let Latency = 7;
	- let NumMicroOps = 2;
	- let ResourceCycles = [1,1];
	-}
	-def: InstRW<[SBWriteResGroup55], (instregex "CVTPS2PDrm")>;
	-def: InstRW<[SBWriteResGroup55], (instregex "CVTSS2SDrm")>;
	-def: InstRW<[SBWriteResGroup55], (instregex "VCVTPS2PDYrm")>;
	-def: InstRW<[SBWriteResGroup55], (instregex "VCVTPS2PDrm")>;
	-def: InstRW<[SBWriteResGroup55], (instregex "VCVTSS2SDrm")>;
	-def: InstRW<[SBWriteResGroup55], (instregex "VTESTPDrm")>;
	-def: InstRW<[SBWriteResGroup55], (instregex "VTESTPSrm")>;
	-
	-def SBWriteResGroup56 : SchedWriteRes<[SBPort5,SBPort23]> {
	- let Latency = 7;
	- let NumMicroOps = 2;
	- let ResourceCycles = [1,1];
	-}
	-def: InstRW<[SBWriteResGroup56], (instregex "ANDNPDrm")>;
	-def: InstRW<[SBWriteResGroup56], (instregex "ANDNPSrm")>;
	-def: InstRW<[SBWriteResGroup56], (instregex "ANDPDrm")>;
	-def: InstRW<[SBWriteResGroup56], (instregex "ANDPSrm")>;
	-def: InstRW<[SBWriteResGroup56], (instregex "INSERTPSrm")>;
	-def: InstRW<[SBWriteResGroup56], (instregex "MOVHPDrm")>;
	-def: InstRW<[SBWriteResGroup56], (instregex "MOVHPSrm")>;
	-def: InstRW<[SBWriteResGroup56], (instregex "MOVLPDrm")>;
	-def: InstRW<[SBWriteResGroup56], (instregex "MOVLPSrm")>;
	-def: InstRW<[SBWriteResGroup56], (instregex "ORPDrm")>;
	-def: InstRW<[SBWriteResGroup56], (instregex "ORPSrm")>;
	-def: InstRW<[SBWriteResGroup56], (instregex "SHUFPDrmi")>;
	-def: InstRW<[SBWriteResGroup56], (instregex "SHUFPSrmi")>;
	-def: InstRW<[SBWriteResGroup56], (instregex "UNPCKHPDrm")>;
	-def: InstRW<[SBWriteResGroup56], (instregex "UNPCKHPSrm")>;
	-def: InstRW<[SBWriteResGroup56], (instregex "UNPCKLPDrm")>;
	-def: InstRW<[SBWriteResGroup56], (instregex "UNPCKLPSrm")>;
	-def: InstRW<[SBWriteResGroup56], (instregex "VANDNPDrm")>;
	-def: InstRW<[SBWriteResGroup56], (instregex "VANDNPSrm")>;
	-def: InstRW<[SBWriteResGroup56], (instregex "VANDPDrm")>;
	-def: InstRW<[SBWriteResGroup56], (instregex "VANDPSrm")>;
	-def: InstRW<[SBWriteResGroup56], (instregex "VBROADCASTF128")>;
	-def: InstRW<[SBWriteResGroup56], (instregex "VINSERTPSrm")>;
	-def: InstRW<[SBWriteResGroup56], (instregex "VMOVHPDrm")>;
	-def: InstRW<[SBWriteResGroup56], (instregex "VMOVHPSrm")>;
	-def: InstRW<[SBWriteResGroup56], (instregex "VMOVLPDrm")>;
	-def: InstRW<[SBWriteResGroup56], (instregex "VMOVLPSrm")>;
	-def: InstRW<[SBWriteResGroup56], (instregex "VORPDrm")>;
	-def: InstRW<[SBWriteResGroup56], (instregex "VORPSrm")>;
	-def: InstRW<[SBWriteResGroup56], (instregex "VPERMILPDmi")>;
	-def: InstRW<[SBWriteResGroup56], (instregex "VPERMILPDri")>;
	-def: InstRW<[SBWriteResGroup56], (instregex "VPERMILPSmi")>;
	-def: InstRW<[SBWriteResGroup56], (instregex "VPERMILPSri")>;
	-def: InstRW<[SBWriteResGroup56], (instregex "VSHUFPDrmi")>;
	-def: InstRW<[SBWriteResGroup56], (instregex "VSHUFPSrmi")>;
	-def: InstRW<[SBWriteResGroup56], (instregex "VUNPCKHPDrm")>;
	-def: InstRW<[SBWriteResGroup56], (instregex "VUNPCKHPSrm")>;
	-def: InstRW<[SBWriteResGroup56], (instregex "VUNPCKLPDrm")>;
	-def: InstRW<[SBWriteResGroup56], (instregex "VUNPCKLPSrm")>;
	-def: InstRW<[SBWriteResGroup56], (instregex "VXORPDrm")>;
	-def: InstRW<[SBWriteResGroup56], (instregex "VXORPSrm")>;
	-def: InstRW<[SBWriteResGroup56], (instregex "XORPDrm")>;
	-def: InstRW<[SBWriteResGroup56], (instregex "XORPSrm")>;
	-
	-def SBWriteResGroup57 : SchedWriteRes<[SBPort5,SBPort015]> {
	- let Latency = 7;
	- let NumMicroOps = 2;
	- let ResourceCycles = [1,1];
	-}
	-def: InstRW<[SBWriteResGroup57], (instregex "AESDECLASTrr")>;
	-def: InstRW<[SBWriteResGroup57], (instregex "AESDECrr")>;
	-def: InstRW<[SBWriteResGroup57], (instregex "AESENCLASTrr")>;
	-def: InstRW<[SBWriteResGroup57], (instregex "AESENCrr")>;
	-def: InstRW<[SBWriteResGroup57], (instregex "KANDQrr")>;
	-def: InstRW<[SBWriteResGroup57], (instregex "VAESDECLASTrr")>;
	-def: InstRW<[SBWriteResGroup57], (instregex "VAESDECrr")>;
	-def: InstRW<[SBWriteResGroup57], (instregex "VAESENCrr")>;
	-
	-def SBWriteResGroup58 : SchedWriteRes<[SBPort23,SBPort0]> {
	- let Latency = 7;
	- let NumMicroOps = 2;
	- let ResourceCycles = [1,1];
	-}
	-def: InstRW<[SBWriteResGroup58], (instregex "BLENDPDrmi")>;
	-def: InstRW<[SBWriteResGroup58], (instregex "BLENDPSrmi")>;
	-def: InstRW<[SBWriteResGroup58], (instregex "VBLENDPDrmi")>;
	-def: InstRW<[SBWriteResGroup58], (instregex "VBLENDPSrmi")>;
	-def: InstRW<[SBWriteResGroup58], (instregex "VINSERTF128rm")>;
	-
	-def SBWriteResGroup59 : SchedWriteRes<[SBPort23,SBPort15]> {
	- let Latency = 7;
	- let NumMicroOps = 2;
	- let ResourceCycles = [1,1];
	-}
	-def: InstRW<[SBWriteResGroup59], (instregex "MMX_PADDQirm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "PABSBrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "PABSDrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "PABSWrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "PACKSSDWrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "PACKSSWBrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "PACKUSDWrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "PACKUSWBrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "PADDBrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "PADDDrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "PADDQrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "PADDSBrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "PADDSWrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "PADDUSBrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "PADDUSWrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "PADDWrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "PALIGNRrmi")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "PAVGBrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "PAVGWrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "PBLENDWrmi")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "PCMPEQBrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "PCMPEQDrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "PCMPEQQrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "PCMPEQWrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "PCMPGTBrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "PCMPGTDrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "PCMPGTWrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "PINSRBrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "PINSRDrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "PINSRQrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "PINSRWrmi")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "PMAXSBrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "PMAXSDrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "PMAXSWrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "PMAXUBrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "PMAXUDrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "PMAXUWrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "PMINSBrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "PMINSDrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "PMINSWrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "PMINUBrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "PMINUDrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "PMINUWrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "PMOVSXBDrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "PMOVSXBQrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "PMOVSXBWrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "PMOVSXDQrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "PMOVSXWDrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "PMOVSXWQrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "PMOVZXBDrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "PMOVZXBQrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "PMOVZXBWrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "PMOVZXDQrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "PMOVZXWDrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "PMOVZXWQrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "PSHUFBrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "PSHUFDmi")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "PSHUFHWmi")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "PSHUFLWmi")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "PSIGNBrm128")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "PSIGNDrm128")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "PSIGNWrm128")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "PSUBBrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "PSUBDrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "PSUBQrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "PSUBSBrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "PSUBSWrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "PSUBUSBrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "PSUBUSWrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "PSUBWrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "PUNPCKHBWrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "PUNPCKHDQrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "PUNPCKHQDQrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "PUNPCKHWDrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "PUNPCKLBWrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "PUNPCKLDQrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "PUNPCKLQDQrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "PUNPCKLWDrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "VPABSBrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "VPABSDrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "VPABSWrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "VPACKSSDWrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "VPACKSSWBrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "VPACKUSDWrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "VPACKUSWBrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "VPADDBrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "VPADDDrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "VPADDQrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "VPADDSBrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "VPADDSWrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "VPADDUSBrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "VPADDUSWrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "VPADDWrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "VPALIGNRrmi")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "VPAVGBrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "VPAVGWrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "VPBLENDWrmi")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "VPCMPEQBrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "VPCMPEQDrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "VPCMPEQQrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "VPCMPEQWrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "VPCMPGTBrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "VPCMPGTDrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "VPCMPGTWrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "VPINSRBrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "VPINSRDrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "VPINSRQrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "VPINSRWrmi")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "VPMAXSBrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "VPMAXSDrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "VPMAXSWrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "VPMAXUBrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "VPMAXUDrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "VPMAXUWrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "VPMINSBrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "VPMINSDrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "VPMINSWrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "VPMINUBrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "VPMINUDrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "VPMINUWrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "VPMOVSXBDrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "VPMOVSXBQrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "VPMOVSXBWrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "VPMOVSXDQrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "VPMOVSXWDrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "VPMOVSXWQrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "VPMOVZXBDrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "VPMOVZXBQrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "VPMOVZXBWrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "VPMOVZXDQrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "VPMOVZXWDrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "VPMOVZXWQrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "VPSHUFBrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "VPSHUFDmi")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "VPSHUFHWmi")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "VPSHUFLWmi")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "VPSIGNBrm128")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "VPSIGNDrm128")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "VPSIGNWrm128")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "VPSUBBrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "VPSUBDrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "VPSUBQrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "VPSUBSBrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "VPSUBSWrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "VPSUBUSBrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "VPSUBUSWrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "VPSUBWrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "VPUNPCKHBWrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "VPUNPCKHDQrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "VPUNPCKHQDQrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "VPUNPCKHWDrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "VPUNPCKLBWrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "VPUNPCKLDQrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "VPUNPCKLQDQrm")>;
	-def: InstRW<[SBWriteResGroup59], (instregex "VPUNPCKLWDrm")>;
	-
	-def SBWriteResGroup60 : SchedWriteRes<[SBPort23,SBPort015]> {
	- let Latency = 7;
	- let NumMicroOps = 2;
	- let ResourceCycles = [1,1];
	-}
	-def: InstRW<[SBWriteResGroup60], (instregex "PANDNrm")>;
	-def: InstRW<[SBWriteResGroup60], (instregex "PANDrm")>;
	-def: InstRW<[SBWriteResGroup60], (instregex "PORrm")>;
	-def: InstRW<[SBWriteResGroup60], (instregex "PXORrm")>;
	-def: InstRW<[SBWriteResGroup60], (instregex "VPANDNrm")>;
	-def: InstRW<[SBWriteResGroup60], (instregex "VPANDrm")>;
	-def: InstRW<[SBWriteResGroup60], (instregex "VPORrm")>;
	-def: InstRW<[SBWriteResGroup60], (instregex "VPXORrm")>;
	-
	-def SBWriteResGroup61 : SchedWriteRes<[SBPort0,SBPort0]> {
	- let Latency = 7;
	- let NumMicroOps = 3;
	- let ResourceCycles = [2,1];
	-}
	-def: InstRW<[SBWriteResGroup61], (instregex "VRCPPSr")>;
	-def: InstRW<[SBWriteResGroup61], (instregex "VRSQRTPSYr")>;
	-
	-def SBWriteResGroup62 : SchedWriteRes<[SBPort5,SBPort23]> {
	- let Latency = 7;
	- let NumMicroOps = 3;
	- let ResourceCycles = [2,1];
	-}
	-def: InstRW<[SBWriteResGroup62], (instregex "VERRm")>;
	-def: InstRW<[SBWriteResGroup62], (instregex "VERWm")>;
	-
	-def SBWriteResGroup63 : SchedWriteRes<[SBPort23,SBPort015]> {
	- let Latency = 7;
	- let NumMicroOps = 3;
	- let ResourceCycles = [1,2];
	-}
	-def: InstRW<[SBWriteResGroup63], (instregex "LODSB")>;
	-def: InstRW<[SBWriteResGroup63], (instregex "LODSW")>;
	-
	-def SBWriteResGroup64 : SchedWriteRes<[SBPort5,SBPort01,SBPort23]> {
	- let Latency = 7;
	- let NumMicroOps = 3;
	- let ResourceCycles = [1,1,1];
	-}
	-def: InstRW<[SBWriteResGroup64], (instregex "FARJMP64")>;
	-
	-def SBWriteResGroup65 : SchedWriteRes<[SBPort23,SBPort0,SBPort015]> {
	- let Latency = 7;
	- let NumMicroOps = 3;
	- let ResourceCycles = [1,1,1];
	-}
	-def: InstRW<[SBWriteResGroup65], (instregex "ADC64rm")>;
	-def: InstRW<[SBWriteResGroup65], (instregex "ADC8rm")>;
	-def: InstRW<[SBWriteResGroup65], (instregex "CMOVAE64rm")>;
	-def: InstRW<[SBWriteResGroup65], (instregex "CMOVB64rm")>;
	-def: InstRW<[SBWriteResGroup65], (instregex "CMOVE64rm")>;
	-def: InstRW<[SBWriteResGroup65], (instregex "CMOVG64rm")>;
	-def: InstRW<[SBWriteResGroup65], (instregex "CMOVGE64rm")>;
	-def: InstRW<[SBWriteResGroup65], (instregex "CMOVL64rm")>;
	-def: InstRW<[SBWriteResGroup65], (instregex "CMOVLE64rm")>;
	-def: InstRW<[SBWriteResGroup65], (instregex "CMOVNE64rm")>;
	-def: InstRW<[SBWriteResGroup65], (instregex "CMOVNO64rm")>;
	-def: InstRW<[SBWriteResGroup65], (instregex "CMOVNP64rm")>;
	-def: InstRW<[SBWriteResGroup65], (instregex "CMOVNS64rm")>;
	-def: InstRW<[SBWriteResGroup65], (instregex "CMOVO64rm")>;
	-def: InstRW<[SBWriteResGroup65], (instregex "CMOVP64rm")>;
	-def: InstRW<[SBWriteResGroup65], (instregex "CMOVS64rm")>;
	-def: InstRW<[SBWriteResGroup65], (instregex "SBB64rm")>;
	-def: InstRW<[SBWriteResGroup65], (instregex "SBB8rm")>;
	-
	-def SBWriteResGroup66 : SchedWriteRes<[SBPort0,SBPort4,SBPort23]> {
	- let Latency = 7;
	- let NumMicroOps = 4;
	- let ResourceCycles = [1,1,2];
	-}
	-def: InstRW<[SBWriteResGroup66], (instregex "FNSTSWm")>;
	-
	-def SBWriteResGroup67 : SchedWriteRes<[SBPort1,SBPort5,SBPort015]> {
	- let Latency = 7;
	- let NumMicroOps = 4;
	- let ResourceCycles = [1,2,1];
	-}
	-def: InstRW<[SBWriteResGroup67], (instregex "SLDT32r")>;
	-def: InstRW<[SBWriteResGroup67], (instregex "STR32r")>;
	-
	-def SBWriteResGroup68 : SchedWriteRes<[SBPort4,SBPort5,SBPort23]> {
	- let Latency = 7;
	- let NumMicroOps = 4;
	- let ResourceCycles = [1,1,2];
	-}
	-def: InstRW<[SBWriteResGroup68], (instregex "CALL64m")>;
	-def: InstRW<[SBWriteResGroup68], (instregex "FNSTCW16m")>;
	-
	-def SBWriteResGroup69 : SchedWriteRes<[SBPort4,SBPort23,SBPort0]> {
	- let Latency = 7;
	- let NumMicroOps = 4;
	- let ResourceCycles = [1,2,1];
	-}
	-def: InstRW<[SBWriteResGroup69], (instregex "BTC64mi8")>;
	-def: InstRW<[SBWriteResGroup69], (instregex "BTR64mi8")>;
	-def: InstRW<[SBWriteResGroup69], (instregex "BTS64mi8")>;
	-def: InstRW<[SBWriteResGroup69], (instregex "SAR64mi")>;
	-def: InstRW<[SBWriteResGroup69], (instregex "SAR8mi")>;
	-def: InstRW<[SBWriteResGroup69], (instregex "SHL64m1")>;
	-def: InstRW<[SBWriteResGroup69], (instregex "SHL64mi")>;
	-def: InstRW<[SBWriteResGroup69], (instregex "SHL8m1")>;
	-def: InstRW<[SBWriteResGroup69], (instregex "SHL8mi")>;
	-def: InstRW<[SBWriteResGroup69], (instregex "SHR64mi")>;
	-def: InstRW<[SBWriteResGroup69], (instregex "SHR8mi")>;
	-
	-def SBWriteResGroup70 : SchedWriteRes<[SBPort4,SBPort23,SBPort015]> {
	- let Latency = 7;
	- let NumMicroOps = 4;
	- let ResourceCycles = [1,2,1];
	-}
	-def: InstRW<[SBWriteResGroup70], (instregex "ADD64mi8")>;
	-def: InstRW<[SBWriteResGroup70], (instregex "ADD64mr")>;
	-def: InstRW<[SBWriteResGroup70], (instregex "ADD8mi")>;
	-def: InstRW<[SBWriteResGroup70], (instregex "ADD8mr")>;
	-def: InstRW<[SBWriteResGroup70], (instregex "AND64mi8")>;
	-def: InstRW<[SBWriteResGroup70], (instregex "AND64mr")>;
	-def: InstRW<[SBWriteResGroup70], (instregex "AND8mi")>;
	-def: InstRW<[SBWriteResGroup70], (instregex "AND8mr")>;
	-def: InstRW<[SBWriteResGroup70], (instregex "DEC64m")>;
	-def: InstRW<[SBWriteResGroup70], (instregex "DEC8m")>;
	-def: InstRW<[SBWriteResGroup70], (instregex "INC64m")>;
	-def: InstRW<[SBWriteResGroup70], (instregex "INC8m")>;
	-def: InstRW<[SBWriteResGroup70], (instregex "NEG64m")>;
	-def: InstRW<[SBWriteResGroup70], (instregex "NEG8m")>;
	-def: InstRW<[SBWriteResGroup70], (instregex "NOT64m")>;
	-def: InstRW<[SBWriteResGroup70], (instregex "NOT8m")>;
	-def: InstRW<[SBWriteResGroup70], (instregex "OR64mi8")>;
	-def: InstRW<[SBWriteResGroup70], (instregex "OR64mr")>;
	-def: InstRW<[SBWriteResGroup70], (instregex "OR8mi")>;
	-def: InstRW<[SBWriteResGroup70], (instregex "OR8mr")>;
	-def: InstRW<[SBWriteResGroup70], (instregex "SUB64mi8")>;
	-def: InstRW<[SBWriteResGroup70], (instregex "SUB64mr")>;
	-def: InstRW<[SBWriteResGroup70], (instregex "SUB8mi")>;
	-def: InstRW<[SBWriteResGroup70], (instregex "SUB8mr")>;
	-def: InstRW<[SBWriteResGroup70], (instregex "TEST64rm")>;
	-def: InstRW<[SBWriteResGroup70], (instregex "TEST8mi")>;
	-def: InstRW<[SBWriteResGroup70], (instregex "TEST8rm")>;
	-def: InstRW<[SBWriteResGroup70], (instregex "XOR64mi8")>;
	-def: InstRW<[SBWriteResGroup70], (instregex "XOR64mr")>;
	-def: InstRW<[SBWriteResGroup70], (instregex "XOR8mi")>;
	-def: InstRW<[SBWriteResGroup70], (instregex "XOR8mr")>;
	-
	-def SBWriteResGroup71 : SchedWriteRes<[SBPort0,SBPort23]> {
	- let Latency = 8;
	- let NumMicroOps = 2;
	- let ResourceCycles = [1,1];
	-}
	-def: InstRW<[SBWriteResGroup71], (instregex "MMX_PMADDUBSWrm64")>;
	-def: InstRW<[SBWriteResGroup71], (instregex "MMX_PMULHRSWrm64")>;
	-def: InstRW<[SBWriteResGroup71], (instregex "VTESTPDYrm")>;
	-def: InstRW<[SBWriteResGroup71], (instregex "VTESTPSYrm")>;
	-
	-def SBWriteResGroup72 : SchedWriteRes<[SBPort1,SBPort23]> {
	- let Latency = 8;
	- let NumMicroOps = 2;
	- let ResourceCycles = [1,1];
	-}
	-def: InstRW<[SBWriteResGroup72], (instregex "BSF64rm")>;
	-def: InstRW<[SBWriteResGroup72], (instregex "BSR64rm")>;
	-def: InstRW<[SBWriteResGroup72], (instregex "CRC32r32m16")>;
	-def: InstRW<[SBWriteResGroup72], (instregex "CRC32r32m8")>;
	-def: InstRW<[SBWriteResGroup72], (instregex "FCOM32m")>;
	-def: InstRW<[SBWriteResGroup72], (instregex "FCOM64m")>;
	-def: InstRW<[SBWriteResGroup72], (instregex "FCOMP32m")>;
	-def: InstRW<[SBWriteResGroup72], (instregex "FCOMP64m")>;
	-def: InstRW<[SBWriteResGroup72], (instregex "MUL8m")>;
	-
	-def SBWriteResGroup73 : SchedWriteRes<[SBPort5,SBPort23]> {
	- let Latency = 8;
	- let NumMicroOps = 2;
	- let ResourceCycles = [1,1];
	-}
	-def: InstRW<[SBWriteResGroup73], (instregex "VANDNPDYrm")>;
	-def: InstRW<[SBWriteResGroup73], (instregex "VANDNPSYrm")>;
	-def: InstRW<[SBWriteResGroup73], (instregex "VANDPDrm")>;
	-def: InstRW<[SBWriteResGroup73], (instregex "VANDPSrm")>;
	-def: InstRW<[SBWriteResGroup73], (instregex "VORPDYrm")>;
	-def: InstRW<[SBWriteResGroup73], (instregex "VORPSYrm")>;
	-def: InstRW<[SBWriteResGroup73], (instregex "VPERM2F128rm")>;
	-def: InstRW<[SBWriteResGroup73], (instregex "VPERMILPDYri")>;
	-def: InstRW<[SBWriteResGroup73], (instregex "VPERMILPDmi")>;
	-def: InstRW<[SBWriteResGroup73], (instregex "VPERMILPSYri")>;
	-def: InstRW<[SBWriteResGroup73], (instregex "VPERMILPSmi")>;
	-def: InstRW<[SBWriteResGroup73], (instregex "VSHUFPDYrmi")>;
	-def: InstRW<[SBWriteResGroup73], (instregex "VSHUFPSYrmi")>;
	-def: InstRW<[SBWriteResGroup73], (instregex "VUNPCKHPDrm")>;
	-def: InstRW<[SBWriteResGroup73], (instregex "VUNPCKHPSrm")>;
	-def: InstRW<[SBWriteResGroup73], (instregex "VUNPCKLPDYrm")>;
	-def: InstRW<[SBWriteResGroup73], (instregex "VUNPCKLPSYrm")>;
	-def: InstRW<[SBWriteResGroup73], (instregex "VXORPDrm")>;
	-def: InstRW<[SBWriteResGroup73], (instregex "VXORPSrm")>;
	-
	-def SBWriteResGroup74 : SchedWriteRes<[SBPort23,SBPort0]> {
	- let Latency = 8;
	- let NumMicroOps = 2;
	- let ResourceCycles = [1,1];
	-}
	-def: InstRW<[SBWriteResGroup74], (instregex "VBLENDPDYrmi")>;
	-def: InstRW<[SBWriteResGroup74], (instregex "VBLENDPSYrmi")>;
	-
	-def SBWriteResGroup75 : SchedWriteRes<[SBPort23,SBPort0]> {
	- let Latency = 8;
	- let NumMicroOps = 3;
	- let ResourceCycles = [1,2];
	-}
	-def: InstRW<[SBWriteResGroup75], (instregex "BLENDVPDrm0")>;
	-def: InstRW<[SBWriteResGroup75], (instregex "BLENDVPSrm0")>;
	-def: InstRW<[SBWriteResGroup75], (instregex "VBLENDVPDrm")>;
	-def: InstRW<[SBWriteResGroup75], (instregex "VBLENDVPSrm")>;
	-def: InstRW<[SBWriteResGroup75], (instregex "VMASKMOVPDrm")>;
	-def: InstRW<[SBWriteResGroup75], (instregex "VMASKMOVPSrm")>;
	-
	-def SBWriteResGroup76 : SchedWriteRes<[SBPort23,SBPort15]> {
	- let Latency = 8;
	- let NumMicroOps = 3;
	- let ResourceCycles = [1,2];
	-}
	-def: InstRW<[SBWriteResGroup76], (instregex "PBLENDVBrr0")>;
	-def: InstRW<[SBWriteResGroup76], (instregex "VPBLENDVBrm")>;
	-
	-def SBWriteResGroup77 : SchedWriteRes<[SBPort0,SBPort1,SBPort23]> {
	- let Latency = 8;
	- let NumMicroOps = 3;
	- let ResourceCycles = [1,1,1];
	-}
	-def: InstRW<[SBWriteResGroup77], (instregex "COMISDrm")>;
	-def: InstRW<[SBWriteResGroup77], (instregex "COMISSrm")>;
	-def: InstRW<[SBWriteResGroup77], (instregex "UCOMISDrm")>;
	-def: InstRW<[SBWriteResGroup77], (instregex "UCOMISSrm")>;
	-def: InstRW<[SBWriteResGroup77], (instregex "VCOMISDrm")>;
	-def: InstRW<[SBWriteResGroup77], (instregex "VCOMISSrm")>;
	-def: InstRW<[SBWriteResGroup77], (instregex "VUCOMISDrm")>;
	-def: InstRW<[SBWriteResGroup77], (instregex "VUCOMISSrm")>;
	-
	-def SBWriteResGroup78 : SchedWriteRes<[SBPort0,SBPort5,SBPort23]> {
	- let Latency = 8;
	- let NumMicroOps = 3;
	- let ResourceCycles = [1,1,1];
	-}
	-def: InstRW<[SBWriteResGroup78], (instregex "PTESTrm")>;
	-def: InstRW<[SBWriteResGroup78], (instregex "VPTESTrm")>;
	-
	-def SBWriteResGroup79 : SchedWriteRes<[SBPort0,SBPort23,SBPort15]> {
	- let Latency = 8;
	- let NumMicroOps = 3;
	- let ResourceCycles = [1,1,1];
	-}
	-def: InstRW<[SBWriteResGroup79], (instregex "PSLLDrm")>;
	-def: InstRW<[SBWriteResGroup79], (instregex "PSLLQrm")>;
	-def: InstRW<[SBWriteResGroup79], (instregex "PSLLWrm")>;
	-def: InstRW<[SBWriteResGroup79], (instregex "PSRADrm")>;
	-def: InstRW<[SBWriteResGroup79], (instregex "PSRAWrm")>;
	-def: InstRW<[SBWriteResGroup79], (instregex "PSRLDrm")>;
	-def: InstRW<[SBWriteResGroup79], (instregex "PSRLQrm")>;
	-def: InstRW<[SBWriteResGroup79], (instregex "PSRLWrm")>;
	-def: InstRW<[SBWriteResGroup79], (instregex "VPSLLDri")>;
	-def: InstRW<[SBWriteResGroup79], (instregex "VPSLLQri")>;
	-def: InstRW<[SBWriteResGroup79], (instregex "VPSLLWri")>;
	-def: InstRW<[SBWriteResGroup79], (instregex "VPSRADrm")>;
	-def: InstRW<[SBWriteResGroup79], (instregex "VPSRAWrm")>;
	-def: InstRW<[SBWriteResGroup79], (instregex "VPSRLDrm")>;
	-def: InstRW<[SBWriteResGroup79], (instregex "VPSRLQrm")>;
	-def: InstRW<[SBWriteResGroup79], (instregex "VPSRLWrm")>;
	-
	-def SBWriteResGroup80 : SchedWriteRes<[SBPort23,SBPort15]> {
	- let Latency = 8;
	- let NumMicroOps = 4;
	- let ResourceCycles = [1,3];
	-}
	-def: InstRW<[SBWriteResGroup80], (instregex "MMX_PHADDSWrm64")>;
	-def: InstRW<[SBWriteResGroup80], (instregex "MMX_PHADDWrm64")>;
	-def: InstRW<[SBWriteResGroup80], (instregex "MMX_PHADDrm64")>;
	-def: InstRW<[SBWriteResGroup80], (instregex "MMX_PHSUBDrm64")>;
	-def: InstRW<[SBWriteResGroup80], (instregex "MMX_PHSUBSWrm64")>;
	-def: InstRW<[SBWriteResGroup80], (instregex "MMX_PHSUBWrm64")>;
	-
	-def SBWriteResGroup81 : SchedWriteRes<[SBPort23,SBPort015]> {
	- let Latency = 8;
	- let NumMicroOps = 4;
	- let ResourceCycles = [1,3];
	-}
	-def: InstRW<[SBWriteResGroup81], (instregex "CMPXCHG64rm")>;
	-def: InstRW<[SBWriteResGroup81], (instregex "CMPXCHG8rm")>;
	-
	-def SBWriteResGroup82 : SchedWriteRes<[SBPort23,SBPort0,SBPort015]> {
	- let Latency = 8;
	- let NumMicroOps = 4;
	- let ResourceCycles = [1,2,1];
	-}
	-def: InstRW<[SBWriteResGroup82], (instregex "CMOVA64rm")>;
	-def: InstRW<[SBWriteResGroup82], (instregex "CMOVBE64rm")>;
	-
	-def SBWriteResGroup83 : SchedWriteRes<[SBPort23,SBPort015]> {
	- let Latency = 8;
	- let NumMicroOps = 5;
	- let ResourceCycles = [2,3];
	-}
	-def: InstRW<[SBWriteResGroup83], (instregex "CMPSB")>;
	-def: InstRW<[SBWriteResGroup83], (instregex "CMPSL")>;
	-def: InstRW<[SBWriteResGroup83], (instregex "CMPSQ")>;
	-def: InstRW<[SBWriteResGroup83], (instregex "CMPSW")>;
	-
	-def SBWriteResGroup84 : SchedWriteRes<[SBPort4,SBPort5,SBPort23]> {
	- let Latency = 8;
	- let NumMicroOps = 5;
	- let ResourceCycles = [1,2,2];
	-}
	-def: InstRW<[SBWriteResGroup84], (instregex "FLDCW16m")>;
	-
	-def SBWriteResGroup85 : SchedWriteRes<[SBPort4,SBPort23,SBPort0]> {
	- let Latency = 8;
	- let NumMicroOps = 5;
	- let ResourceCycles = [1,2,2];
	-}
	-def: InstRW<[SBWriteResGroup85], (instregex "ROL64mi")>;
	-def: InstRW<[SBWriteResGroup85], (instregex "ROL8mi")>;
	-def: InstRW<[SBWriteResGroup85], (instregex "ROR64mi")>;
	-def: InstRW<[SBWriteResGroup85], (instregex "ROR8mi")>;
	-
	-def SBWriteResGroup86 : SchedWriteRes<[SBPort4,SBPort23,SBPort015]> {
	- let Latency = 8;
	- let NumMicroOps = 5;
	- let ResourceCycles = [1,2,2];
	-}
	-def: InstRW<[SBWriteResGroup86], (instregex "MOVSB")>;
	-def: InstRW<[SBWriteResGroup86], (instregex "MOVSL")>;
	-def: InstRW<[SBWriteResGroup86], (instregex "MOVSQ")>;
	-def: InstRW<[SBWriteResGroup86], (instregex "MOVSW")>;
	-def: InstRW<[SBWriteResGroup86], (instregex "XADD64rm")>;
	-def: InstRW<[SBWriteResGroup86], (instregex "XADD8rm")>;
	-
	-def SBWriteResGroup87 : SchedWriteRes<[SBPort4,SBPort5,SBPort01,SBPort23]> {
	- let Latency = 8;
	- let NumMicroOps = 5;
	- let ResourceCycles = [1,1,1,2];
	-}
	-def: InstRW<[SBWriteResGroup87], (instregex "FARCALL64")>;
	-
	-def SBWriteResGroup88 : SchedWriteRes<[SBPort4,SBPort23,SBPort0,SBPort015]> {
	- let Latency = 8;
	- let NumMicroOps = 5;
	- let ResourceCycles = [1,2,1,1];
	-}
	-def: InstRW<[SBWriteResGroup88], (instregex "SHLD64mri8")>;
	-def: InstRW<[SBWriteResGroup88], (instregex "SHRD64mri8")>;
	-
	-def SBWriteResGroup89 : SchedWriteRes<[SBPort0,SBPort23]> {
	- let Latency = 9;
	- let NumMicroOps = 2;
	- let ResourceCycles = [1,1];
	-}
	-def: InstRW<[SBWriteResGroup89], (instregex "MMX_PMULUDQirm")>;
	-def: InstRW<[SBWriteResGroup89], (instregex "PMADDUBSWrm")>;
	-def: InstRW<[SBWriteResGroup89], (instregex "PMADDWDrm")>;
	-def: InstRW<[SBWriteResGroup89], (instregex "PMULDQrm")>;
	-def: InstRW<[SBWriteResGroup89], (instregex "PMULHRSWrm")>;
	-def: InstRW<[SBWriteResGroup89], (instregex "PMULHUWrm")>;
	-def: InstRW<[SBWriteResGroup89], (instregex "PMULHWrm")>;
	-def: InstRW<[SBWriteResGroup89], (instregex "PMULLDrm")>;
	-def: InstRW<[SBWriteResGroup89], (instregex "PMULLWrm")>;
	-def: InstRW<[SBWriteResGroup89], (instregex "PMULUDQrm")>;
	-def: InstRW<[SBWriteResGroup89], (instregex "PSADBWrm")>;
	-def: InstRW<[SBWriteResGroup89], (instregex "VPMADDUBSWrm")>;
	-def: InstRW<[SBWriteResGroup89], (instregex "VPMADDWDrm")>;
	-def: InstRW<[SBWriteResGroup89], (instregex "VPMULDQrm")>;
	-def: InstRW<[SBWriteResGroup89], (instregex "VPMULHRSWrm")>;
	-def: InstRW<[SBWriteResGroup89], (instregex "VPMULHUWrm")>;
	-def: InstRW<[SBWriteResGroup89], (instregex "VPMULHWrm")>;
	-def: InstRW<[SBWriteResGroup89], (instregex "VPMULLDrm")>;
	-def: InstRW<[SBWriteResGroup89], (instregex "VPMULLWrm")>;
	-def: InstRW<[SBWriteResGroup89], (instregex "VPMULUDQrm")>;
	-def: InstRW<[SBWriteResGroup89], (instregex "VPSADBWrm")>;
	-
	-def SBWriteResGroup90 : SchedWriteRes<[SBPort1,SBPort23]> {
	- let Latency = 9;
	- let NumMicroOps = 2;
	- let ResourceCycles = [1,1];
	-}
	-def: InstRW<[SBWriteResGroup90], (instregex "ADDPDrm")>;
	-def: InstRW<[SBWriteResGroup90], (instregex "ADDPSrm")>;
	-def: InstRW<[SBWriteResGroup90], (instregex "ADDSDrm")>;
	-def: InstRW<[SBWriteResGroup90], (instregex "ADDSSrm")>;
	-def: InstRW<[SBWriteResGroup90], (instregex "ADDSUBPDrm")>;
	-def: InstRW<[SBWriteResGroup90], (instregex "ADDSUBPSrm")>;
	-def: InstRW<[SBWriteResGroup90], (instregex "CMPPDrmi")>;
	-def: InstRW<[SBWriteResGroup90], (instregex "CMPPSrmi")>;
	-def: InstRW<[SBWriteResGroup90], (instregex "CMPSSrm")>;
	-def: InstRW<[SBWriteResGroup90], (instregex "CVTDQ2PSrm")>;
	-def: InstRW<[SBWriteResGroup90], (instregex "CVTPS2DQrm")>;
	-def: InstRW<[SBWriteResGroup90], (instregex "CVTSI2SD64rm")>;
	-def: InstRW<[SBWriteResGroup90], (instregex "CVTSI2SDrm")>;
	-def: InstRW<[SBWriteResGroup90], (instregex "CVTTPS2DQrm")>;
	-def: InstRW<[SBWriteResGroup90], (instregex "MAXPDrm")>;
	-def: InstRW<[SBWriteResGroup90], (instregex "MAXPSrm")>;
	-def: InstRW<[SBWriteResGroup90], (instregex "MAXSDrm")>;
	-def: InstRW<[SBWriteResGroup90], (instregex "MAXSSrm")>;
	-def: InstRW<[SBWriteResGroup90], (instregex "MINPDrm")>;
	-def: InstRW<[SBWriteResGroup90], (instregex "MINPSrm")>;
	-def: InstRW<[SBWriteResGroup90], (instregex "MINSDrm")>;
	-def: InstRW<[SBWriteResGroup90], (instregex "MINSSrm")>;
	-def: InstRW<[SBWriteResGroup90], (instregex "MMX_CVTPI2PSirm")>;
	-def: InstRW<[SBWriteResGroup90], (instregex "MMX_CVTPS2PIirm")>;
	-def: InstRW<[SBWriteResGroup90], (instregex "MMX_CVTTPS2PIirm")>;
	-def: InstRW<[SBWriteResGroup90], (instregex "POPCNT64rm")>;
	-def: InstRW<[SBWriteResGroup90], (instregex "ROUNDPDm")>;
	-def: InstRW<[SBWriteResGroup90], (instregex "ROUNDPSm")>;
	-def: InstRW<[SBWriteResGroup90], (instregex "ROUNDSDm")>;
	-def: InstRW<[SBWriteResGroup90], (instregex "ROUNDSSm")>;
	-def: InstRW<[SBWriteResGroup90], (instregex "SUBPDrm")>;
	-def: InstRW<[SBWriteResGroup90], (instregex "SUBPSrm")>;
	-def: InstRW<[SBWriteResGroup90], (instregex "SUBSDrm")>;
	-def: InstRW<[SBWriteResGroup90], (instregex "SUBSSrm")>;
	-def: InstRW<[SBWriteResGroup90], (instregex "VADDPDrm")>;
	-def: InstRW<[SBWriteResGroup90], (instregex "VADDPSrm")>;
	-def: InstRW<[SBWriteResGroup90], (instregex "VADDSDrm")>;
	-def: InstRW<[SBWriteResGroup90], (instregex "VADDSSrm")>;
	-def: InstRW<[SBWriteResGroup90], (instregex "VADDSUBPDrm")>;
	-def: InstRW<[SBWriteResGroup90], (instregex "VADDSUBPSrm")>;
	-def: InstRW<[SBWriteResGroup90], (instregex "VCMPPDrmi")>;
	-def: InstRW<[SBWriteResGroup90], (instregex "VCMPPSrmi")>;
	-def: InstRW<[SBWriteResGroup90], (instregex "VCMPSDrm")>;
	-def: InstRW<[SBWriteResGroup90], (instregex "VCMPSSrm")>;
	-def: InstRW<[SBWriteResGroup90], (instregex "VCVTDQ2PSrm")>;
	-def: InstRW<[SBWriteResGroup90], (instregex "VCVTPS2DQrm")>;
	-def: InstRW<[SBWriteResGroup90], (instregex "VCVTSI2SD64rm")>;
	-def: InstRW<[SBWriteResGroup90], (instregex "VCVTSI2SDrm")>;
	-def: InstRW<[SBWriteResGroup90], (instregex "VCVTTPS2DQrm")>;
	-def: InstRW<[SBWriteResGroup90], (instregex "VMAXPDrm")>;
	-def: InstRW<[SBWriteResGroup90], (instregex "VMAXPSrm")>;
	-def: InstRW<[SBWriteResGroup90], (instregex "VMAXSDrm")>;
	-def: InstRW<[SBWriteResGroup90], (instregex "VMAXSSrm")>;
	-def: InstRW<[SBWriteResGroup90], (instregex "VMINPDrm")>;
	-def: InstRW<[SBWriteResGroup90], (instregex "VMINPSrm")>;
	-def: InstRW<[SBWriteResGroup90], (instregex "VMINSDrm")>;
	-def: InstRW<[SBWriteResGroup90], (instregex "VMINSSrm")>;
	-def: InstRW<[SBWriteResGroup90], (instregex "VROUNDPDm")>;
	-def: InstRW<[SBWriteResGroup90], (instregex "VROUNDPSm")>;
	-def: InstRW<[SBWriteResGroup90], (instregex "VROUNDSDm")>;
	-def: InstRW<[SBWriteResGroup90], (instregex "VROUNDSSm")>;
	-def: InstRW<[SBWriteResGroup90], (instregex "VSUBPDrm")>;
	-def: InstRW<[SBWriteResGroup90], (instregex "VSUBPSrm")>;
	-def: InstRW<[SBWriteResGroup90], (instregex "VSUBSDrm")>;
	-def: InstRW<[SBWriteResGroup90], (instregex "VSUBSSrm")>;
	-
	-def SBWriteResGroup91 : SchedWriteRes<[SBPort23,SBPort0]> {
	- let Latency = 9;
	- let NumMicroOps = 3;
	- let ResourceCycles = [1,2];
	-}
	-def: InstRW<[SBWriteResGroup91], (instregex "VBLENDVPDYrm")>;
	-def: InstRW<[SBWriteResGroup91], (instregex "VBLENDVPSYrm")>;
	-def: InstRW<[SBWriteResGroup91], (instregex "VMASKMOVPDrm")>;
	-def: InstRW<[SBWriteResGroup91], (instregex "VMASKMOVPSrm")>;
	-
	-def SBWriteResGroup92 : SchedWriteRes<[SBPort0,SBPort1,SBPort5]> {
	- let Latency = 9;
	- let NumMicroOps = 3;
	- let ResourceCycles = [1,1,1];
	-}
	-def: InstRW<[SBWriteResGroup92], (instregex "DPPDrri")>;
	-def: InstRW<[SBWriteResGroup92], (instregex "VDPPDrri")>;
	-
	-def SBWriteResGroup93 : SchedWriteRes<[SBPort0,SBPort1,SBPort23]> {
	- let Latency = 9;
	- let NumMicroOps = 3;
	- let ResourceCycles = [1,1,1];
	-}
	-def: InstRW<[SBWriteResGroup93], (instregex "CVTSD2SI64rm")>;
	-def: InstRW<[SBWriteResGroup93], (instregex "CVTSD2SIrm")>;
	-def: InstRW<[SBWriteResGroup93], (instregex "CVTSS2SI64rm")>;
	-def: InstRW<[SBWriteResGroup93], (instregex "CVTSS2SIrm")>;
	-def: InstRW<[SBWriteResGroup93], (instregex "CVTTSD2SI64rm")>;
	-def: InstRW<[SBWriteResGroup93], (instregex "CVTTSD2SIrm")>;
	-def: InstRW<[SBWriteResGroup93], (instregex "CVTTSS2SI64rm")>;
	-def: InstRW<[SBWriteResGroup93], (instregex "CVTTSS2SIrm")>;
	-def: InstRW<[SBWriteResGroup93], (instregex "MUL64m")>;
	-
	-def SBWriteResGroup94 : SchedWriteRes<[SBPort0,SBPort5,SBPort23]> {
	- let Latency = 9;
	- let NumMicroOps = 3;
	- let ResourceCycles = [1,1,1];
	-}
	-def: InstRW<[SBWriteResGroup94], (instregex "VPTESTYrm")>;
	-
	-def SBWriteResGroup95 : SchedWriteRes<[SBPort5,SBPort01,SBPort23]> {
	- let Latency = 9;
	- let NumMicroOps = 3;
	- let ResourceCycles = [1,1,1];
	-}
	-def: InstRW<[SBWriteResGroup95], (instregex "LD_F32m")>;
	-def: InstRW<[SBWriteResGroup95], (instregex "LD_F64m")>;
	-def: InstRW<[SBWriteResGroup95], (instregex "LD_F80m")>;
	-
	-def SBWriteResGroup96 : SchedWriteRes<[SBPort23,SBPort15]> {
	- let Latency = 9;
	- let NumMicroOps = 4;
	- let ResourceCycles = [1,3];
	-}
	-def: InstRW<[SBWriteResGroup96], (instregex "PHADDDrm")>;
	-def: InstRW<[SBWriteResGroup96], (instregex "PHADDSWrm128")>;
	-def: InstRW<[SBWriteResGroup96], (instregex "PHADDWrm")>;
	-def: InstRW<[SBWriteResGroup96], (instregex "PHSUBDrm")>;
	-def: InstRW<[SBWriteResGroup96], (instregex "PHSUBSWrm128")>;
	-def: InstRW<[SBWriteResGroup96], (instregex "PHSUBWrm")>;
	-def: InstRW<[SBWriteResGroup96], (instregex "VPHADDDrm")>;
	-def: InstRW<[SBWriteResGroup96], (instregex "VPHADDSWrm128")>;
	-def: InstRW<[SBWriteResGroup96], (instregex "VPHADDWrm")>;
	-def: InstRW<[SBWriteResGroup96], (instregex "VPHSUBDrm")>;
	-def: InstRW<[SBWriteResGroup96], (instregex "VPHSUBSWrm128")>;
	-def: InstRW<[SBWriteResGroup96], (instregex "VPHSUBWrm")>;
	-
	-def SBWriteResGroup97 : SchedWriteRes<[SBPort1,SBPort4,SBPort23]> {
	- let Latency = 9;
	- let NumMicroOps = 4;
	- let ResourceCycles = [1,1,2];
	-}
	-def: InstRW<[SBWriteResGroup97], (instregex "IST_F16m")>;
	-def: InstRW<[SBWriteResGroup97], (instregex "IST_F32m")>;
	-def: InstRW<[SBWriteResGroup97], (instregex "IST_FP16m")>;
	-def: InstRW<[SBWriteResGroup97], (instregex "IST_FP32m")>;
	-def: InstRW<[SBWriteResGroup97], (instregex "IST_FP64m")>;
	-def: InstRW<[SBWriteResGroup97], (instregex "SHL64mCL")>;
	-def: InstRW<[SBWriteResGroup97], (instregex "SHL8mCL")>;
	-
	-def SBWriteResGroup98 : SchedWriteRes<[SBPort4,SBPort23,SBPort015]> {
	- let Latency = 9;
	- let NumMicroOps = 6;
	- let ResourceCycles = [1,2,3];
	-}
	-def: InstRW<[SBWriteResGroup98], (instregex "ADC64mi8")>;
	-def: InstRW<[SBWriteResGroup98], (instregex "ADC8mi")>;
	-def: InstRW<[SBWriteResGroup98], (instregex "SBB64mi8")>;
	-def: InstRW<[SBWriteResGroup98], (instregex "SBB8mi")>;
	-
	-def SBWriteResGroup99 : SchedWriteRes<[SBPort4,SBPort23,SBPort0,SBPort015]> {
	- let Latency = 9;
	- let NumMicroOps = 6;
	- let ResourceCycles = [1,2,2,1];
	-}
	-def: InstRW<[SBWriteResGroup99], (instregex "ADC64mr")>;
	-def: InstRW<[SBWriteResGroup99], (instregex "ADC8mr")>;
	-def: InstRW<[SBWriteResGroup99], (instregex "SBB64mr")>;
	-def: InstRW<[SBWriteResGroup99], (instregex "SBB8mr")>;
	-
	-def SBWriteResGroup100 : SchedWriteRes<[SBPort4,SBPort5,SBPort23,SBPort0,SBPort015]> {
	- let Latency = 9;
	- let NumMicroOps = 6;
	- let ResourceCycles = [1,1,2,1,1];
	-}
	-def: InstRW<[SBWriteResGroup100], (instregex "BT64mr")>;
	-def: InstRW<[SBWriteResGroup100], (instregex "BTC64mr")>;
	-def: InstRW<[SBWriteResGroup100], (instregex "BTR64mr")>;
	-def: InstRW<[SBWriteResGroup100], (instregex "BTS64mr")>;
	-
	-def SBWriteResGroup101 : SchedWriteRes<[SBPort1,SBPort23]> {
	- let Latency = 10;
	- let NumMicroOps = 2;
	- let ResourceCycles = [1,1];
	-}
	-def: InstRW<[SBWriteResGroup101], (instregex "ADD_F32m")>;
	-def: InstRW<[SBWriteResGroup101], (instregex "ADD_F64m")>;
	-def: InstRW<[SBWriteResGroup101], (instregex "ILD_F16m")>;
	-def: InstRW<[SBWriteResGroup101], (instregex "ILD_F32m")>;
	-def: InstRW<[SBWriteResGroup101], (instregex "ILD_F64m")>;
	-def: InstRW<[SBWriteResGroup101], (instregex "SUBR_F32m")>;
	-def: InstRW<[SBWriteResGroup101], (instregex "SUBR_F64m")>;
	-def: InstRW<[SBWriteResGroup101], (instregex "SUB_F32m")>;
	-def: InstRW<[SBWriteResGroup101], (instregex "SUB_F64m")>;
	-def: InstRW<[SBWriteResGroup101], (instregex "VADDPDYrm")>;
	-def: InstRW<[SBWriteResGroup101], (instregex "VADDPSYrm")>;
	-def: InstRW<[SBWriteResGroup101], (instregex "VADDSUBPDYrm")>;
	-def: InstRW<[SBWriteResGroup101], (instregex "VADDSUBPSYrm")>;
	-def: InstRW<[SBWriteResGroup101], (instregex "VCMPPDYrmi")>;
	-def: InstRW<[SBWriteResGroup101], (instregex "VCMPPSYrmi")>;
	-def: InstRW<[SBWriteResGroup101], (instregex "VCVTDQ2PSYrm")>;
	-def: InstRW<[SBWriteResGroup101], (instregex "VCVTPS2DQYrm")>;
	-def: InstRW<[SBWriteResGroup101], (instregex "VCVTTPS2DQrm")>;
	-def: InstRW<[SBWriteResGroup101], (instregex "VMAXPDYrm")>;
	-def: InstRW<[SBWriteResGroup101], (instregex "VMAXPSYrm")>;
	-def: InstRW<[SBWriteResGroup101], (instregex "VMINPDrm")>;
	-def: InstRW<[SBWriteResGroup101], (instregex "VMINPSrm")>;
	-def: InstRW<[SBWriteResGroup101], (instregex "VROUNDPDm")>;
	-def: InstRW<[SBWriteResGroup101], (instregex "VROUNDPSm")>;
	-def: InstRW<[SBWriteResGroup101], (instregex "VSUBPDYrm")>;
	-def: InstRW<[SBWriteResGroup101], (instregex "VSUBPSYrm")>;
	-
	-def SBWriteResGroup102 : SchedWriteRes<[SBPort0,SBPort1,SBPort23]> {
	- let Latency = 10;
	- let NumMicroOps = 3;
	- let ResourceCycles = [1,1,1];
	-}
	-def: InstRW<[SBWriteResGroup102], (instregex "VCVTSD2SI64rm")>;
	-def: InstRW<[SBWriteResGroup102], (instregex "VCVTSD2SI64rr")>;
	-def: InstRW<[SBWriteResGroup102], (instregex "VCVTSS2SI64rm")>;
	-def: InstRW<[SBWriteResGroup102], (instregex "VCVTSS2SIrm")>;
	-def: InstRW<[SBWriteResGroup102], (instregex "VCVTTSD2SI64rm")>;
	-def: InstRW<[SBWriteResGroup102], (instregex "VCVTTSD2SI64rr")>;
	-def: InstRW<[SBWriteResGroup102], (instregex "VCVTTSS2SI64rm")>;
	-def: InstRW<[SBWriteResGroup102], (instregex "VCVTTSS2SIrm")>;
	-
	-def SBWriteResGroup103 : SchedWriteRes<[SBPort1,SBPort5,SBPort23]> {
	- let Latency = 10;
	- let NumMicroOps = 3;
	- let ResourceCycles = [1,1,1];
	-}
	-def: InstRW<[SBWriteResGroup103], (instregex "CVTDQ2PDrm")>;
	-def: InstRW<[SBWriteResGroup103], (instregex "CVTPD2DQrm")>;
	-def: InstRW<[SBWriteResGroup103], (instregex "CVTPD2PSrm")>;
	-def: InstRW<[SBWriteResGroup103], (instregex "CVTSD2SSrm")>;
	-def: InstRW<[SBWriteResGroup103], (instregex "CVTSI2SS64rm")>;
	-def: InstRW<[SBWriteResGroup103], (instregex "CVTSI2SSrm")>;
	-def: InstRW<[SBWriteResGroup103], (instregex "CVTTPD2DQrm")>;
	-def: InstRW<[SBWriteResGroup103], (instregex "MMX_CVTPD2PIirm")>;
	-def: InstRW<[SBWriteResGroup103], (instregex "MMX_CVTPI2PDirm")>;
	-def: InstRW<[SBWriteResGroup103], (instregex "MMX_CVTTPD2PIirm")>;
	-def: InstRW<[SBWriteResGroup103], (instregex "VCVTDQ2PDYrm")>;
	-def: InstRW<[SBWriteResGroup103], (instregex "VCVTDQ2PDrm")>;
	-def: InstRW<[SBWriteResGroup103], (instregex "VCVTPD2DQrm")>;
	-def: InstRW<[SBWriteResGroup103], (instregex "VCVTPD2PSrm")>;
	-def: InstRW<[SBWriteResGroup103], (instregex "VCVTSD2SSrm")>;
	-def: InstRW<[SBWriteResGroup103], (instregex "VCVTSI2SS64rm")>;
	-def: InstRW<[SBWriteResGroup103], (instregex "VCVTSI2SSrm")>;
	-def: InstRW<[SBWriteResGroup103], (instregex "VCVTTPD2DQrm")>;
	-
	-def SBWriteResGroup104 : SchedWriteRes<[SBPort0,SBPort23]> {
	- let Latency = 11;
	- let NumMicroOps = 2;
	- let ResourceCycles = [1,1];
	-}
	-def: InstRW<[SBWriteResGroup104], (instregex "MULPDrm")>;
	-def: InstRW<[SBWriteResGroup104], (instregex "MULPSrm")>;
	-def: InstRW<[SBWriteResGroup104], (instregex "MULSDrm")>;
	-def: InstRW<[SBWriteResGroup104], (instregex "MULSSrm")>;
	-def: InstRW<[SBWriteResGroup104], (instregex "PCMPGTQrm")>;
	-def: InstRW<[SBWriteResGroup104], (instregex "PHMINPOSUWrm128")>;
	-def: InstRW<[SBWriteResGroup104], (instregex "RCPPSm")>;
	-def: InstRW<[SBWriteResGroup104], (instregex "RCPSSm")>;
	-def: InstRW<[SBWriteResGroup104], (instregex "RSQRTPSm")>;
	-def: InstRW<[SBWriteResGroup104], (instregex "RSQRTSSm")>;
	-def: InstRW<[SBWriteResGroup104], (instregex "VMULPDrm")>;
	-def: InstRW<[SBWriteResGroup104], (instregex "VMULPSrm")>;
	-def: InstRW<[SBWriteResGroup104], (instregex "VMULSDrm")>;
	-def: InstRW<[SBWriteResGroup104], (instregex "VMULSSrm")>;
	-def: InstRW<[SBWriteResGroup104], (instregex "VPCMPGTQrm")>;
	-def: InstRW<[SBWriteResGroup104], (instregex "VPHMINPOSUWrm128")>;
	-def: InstRW<[SBWriteResGroup104], (instregex "VRCPPSm")>;
	-def: InstRW<[SBWriteResGroup104], (instregex "VRCPSSm")>;
	-def: InstRW<[SBWriteResGroup104], (instregex "VRSQRTPSm")>;
	-def: InstRW<[SBWriteResGroup104], (instregex "VRSQRTSSm")>;
	-
	-def SBWriteResGroup105 : SchedWriteRes<[SBPort0]> {
	- let Latency = 11;
	- let NumMicroOps = 3;
	- let ResourceCycles = [3];
	-}
	-def: InstRW<[SBWriteResGroup105], (instregex "PCMPISTRIrr")>;
	-def: InstRW<[SBWriteResGroup105], (instregex "PCMPISTRM128rr")>;
	-def: InstRW<[SBWriteResGroup105], (instregex "VPCMPISTRIrr")>;
	-def: InstRW<[SBWriteResGroup105], (instregex "VPCMPISTRM128rr")>;
	-
	-def SBWriteResGroup106 : SchedWriteRes<[SBPort1,SBPort23]> {
	- let Latency = 11;
	- let NumMicroOps = 3;
	- let ResourceCycles = [2,1];
	-}
	-def: InstRW<[SBWriteResGroup106], (instregex "FICOM16m")>;
	-def: InstRW<[SBWriteResGroup106], (instregex "FICOM32m")>;
	-def: InstRW<[SBWriteResGroup106], (instregex "FICOMP16m")>;
	-def: InstRW<[SBWriteResGroup106], (instregex "FICOMP32m")>;
	-
	-def SBWriteResGroup107 : SchedWriteRes<[SBPort1,SBPort5,SBPort23]> {
	- let Latency = 11;
	- let NumMicroOps = 3;
	- let ResourceCycles = [1,1,1];
	-}
	-def: InstRW<[SBWriteResGroup107], (instregex "VCVTPD2DQYrm")>;
	-def: InstRW<[SBWriteResGroup107], (instregex "VCVTPD2PSYrm")>;
	-def: InstRW<[SBWriteResGroup107], (instregex "VCVTTPD2DQYrm")>;
	-
	-def SBWriteResGroup108 : SchedWriteRes<[SBPort0,SBPort23,SBPort15]> {
	- let Latency = 11;
	- let NumMicroOps = 4;
	- let ResourceCycles = [1,1,2];
	-}
	-def: InstRW<[SBWriteResGroup108], (instregex "MPSADBWrmi")>;
	-def: InstRW<[SBWriteResGroup108], (instregex "VMPSADBWrmi")>;
	-
	-def SBWriteResGroup109 : SchedWriteRes<[SBPort1,SBPort5,SBPort23]> {
	- let Latency = 11;
	- let NumMicroOps = 4;
	- let ResourceCycles = [1,2,1];
	-}
	-def: InstRW<[SBWriteResGroup109], (instregex "HADDPDrm")>;
	-def: InstRW<[SBWriteResGroup109], (instregex "HADDPSrm")>;
	-def: InstRW<[SBWriteResGroup109], (instregex "HSUBPDrm")>;
	-def: InstRW<[SBWriteResGroup109], (instregex "HSUBPSrm")>;
	-def: InstRW<[SBWriteResGroup109], (instregex "VHADDPDrm")>;
	-def: InstRW<[SBWriteResGroup109], (instregex "VHADDPSrm")>;
	-def: InstRW<[SBWriteResGroup109], (instregex "VHSUBPDrm")>;
	-def: InstRW<[SBWriteResGroup109], (instregex "VHSUBPSrm")>;
	-
	-def SBWriteResGroup110 : SchedWriteRes<[SBPort5]> {
	- let Latency = 12;
	- let NumMicroOps = 2;
	- let ResourceCycles = [2];
	-}
	-def: InstRW<[SBWriteResGroup110], (instregex "AESIMCrr")>;
	-def: InstRW<[SBWriteResGroup110], (instregex "VAESIMCrr")>;
	-
	-def SBWriteResGroup111 : SchedWriteRes<[SBPort0,SBPort23]> {
	- let Latency = 12;
	- let NumMicroOps = 2;
	- let ResourceCycles = [1,1];
	-}
	-def: InstRW<[SBWriteResGroup111], (instregex "MUL_F32m")>;
	-def: InstRW<[SBWriteResGroup111], (instregex "MUL_F64m")>;
	-def: InstRW<[SBWriteResGroup111], (instregex "VMULPDYrm")>;
	-def: InstRW<[SBWriteResGroup111], (instregex "VMULPSYrm")>;
	-
	-def SBWriteResGroup112 : SchedWriteRes<[SBPort0,SBPort1,SBPort5]> {
	- let Latency = 12;
	- let NumMicroOps = 4;
	- let ResourceCycles = [1,2,1];
	-}
	-def: InstRW<[SBWriteResGroup112], (instregex "DPPSrri")>;
	-def: InstRW<[SBWriteResGroup112], (instregex "VDPPSYrri")>;
	-def: InstRW<[SBWriteResGroup112], (instregex "VDPPSrri")>;
	-
	-def SBWriteResGroup113 : SchedWriteRes<[SBPort1,SBPort5,SBPort23]> {
	- let Latency = 12;
	- let NumMicroOps = 4;
	- let ResourceCycles = [1,2,1];
	-}
	-def: InstRW<[SBWriteResGroup113], (instregex "VHADDPDrm")>;
	-def: InstRW<[SBWriteResGroup113], (instregex "VHADDPSYrm")>;
	-def: InstRW<[SBWriteResGroup113], (instregex "VHSUBPDYrm")>;
	-def: InstRW<[SBWriteResGroup113], (instregex "VHSUBPSYrm")>;
	-
	-def SBWriteResGroup114 : SchedWriteRes<[SBPort1,SBPort23]> {
	- let Latency = 13;
	- let NumMicroOps = 3;
	- let ResourceCycles = [2,1];
	-}
	-def: InstRW<[SBWriteResGroup114], (instregex "ADD_FI16m")>;
	-def: InstRW<[SBWriteResGroup114], (instregex "ADD_FI32m")>;
	-def: InstRW<[SBWriteResGroup114], (instregex "SUBR_FI16m")>;
	-def: InstRW<[SBWriteResGroup114], (instregex "SUBR_FI32m")>;
	-def: InstRW<[SBWriteResGroup114], (instregex "SUB_FI16m")>;
	-def: InstRW<[SBWriteResGroup114], (instregex "SUB_FI32m")>;
	-
	-def SBWriteResGroup115 : SchedWriteRes<[SBPort5,SBPort23,SBPort015]> {
	- let Latency = 13;
	- let NumMicroOps = 3;
	- let ResourceCycles = [1,1,1];
	-}
	-def: InstRW<[SBWriteResGroup115], (instregex "AESDECLASTrm")>;
	-def: InstRW<[SBWriteResGroup115], (instregex "AESDECrm")>;
	-def: InstRW<[SBWriteResGroup115], (instregex "AESENCLASTrm")>;
	-def: InstRW<[SBWriteResGroup115], (instregex "AESENCrm")>;
	-def: InstRW<[SBWriteResGroup115], (instregex "VAESDECLASTrm")>;
	-def: InstRW<[SBWriteResGroup115], (instregex "VAESDECrm")>;
	-def: InstRW<[SBWriteResGroup115], (instregex "VAESENCLASTrm")>;
	-def: InstRW<[SBWriteResGroup115], (instregex "VAESENCrm")>;
	-
	-def SBWriteResGroup116 : SchedWriteRes<[SBPort0]> {
	- let Latency = 14;
	- let NumMicroOps = 1;
	- let ResourceCycles = [1];
	-}
	-def: InstRW<[SBWriteResGroup116], (instregex "DIVPSrr")>;
	-def: InstRW<[SBWriteResGroup116], (instregex "DIVSSrr")>;
	-def: InstRW<[SBWriteResGroup116], (instregex "SQRTPSr")>;
	-def: InstRW<[SBWriteResGroup116], (instregex "SQRTSSr")>;
	-def: InstRW<[SBWriteResGroup116], (instregex "VDIVPSrr")>;
	-def: InstRW<[SBWriteResGroup116], (instregex "VDIVSSrr")>;
	-def: InstRW<[SBWriteResGroup116], (instregex "VSQRTPSr")>;
	-
	-def SBWriteResGroup117 : SchedWriteRes<[SBPort0,SBPort23]> {
	- let Latency = 14;
	- let NumMicroOps = 2;
	- let ResourceCycles = [1,1];
	-}
	-def: InstRW<[SBWriteResGroup117], (instregex "VSQRTSSm")>;
	-
	-def SBWriteResGroup118 : SchedWriteRes<[SBPort0,SBPort23,SBPort0]> {
	- let Latency = 14;
	- let NumMicroOps = 4;
	- let ResourceCycles = [2,1,1];
	-}
	-def: InstRW<[SBWriteResGroup118], (instregex "VRCPPSm")>;
	-def: InstRW<[SBWriteResGroup118], (instregex "VRSQRTPSYm")>;
	-
	-def SBWriteResGroup119 : SchedWriteRes<[SBPort0,SBPort1,SBPort23]> {
	- let Latency = 15;
	- let NumMicroOps = 3;
	- let ResourceCycles = [1,1,1];
	-}
	-def: InstRW<[SBWriteResGroup119], (instregex "MUL_FI16m")>;
	-def: InstRW<[SBWriteResGroup119], (instregex "MUL_FI32m")>;
	-
	-def SBWriteResGroup120 : SchedWriteRes<[SBPort0,SBPort1,SBPort5,SBPort23]> {
	- let Latency = 15;
	- let NumMicroOps = 4;
	- let ResourceCycles = [1,1,1,1];
	-}
	-def: InstRW<[SBWriteResGroup120], (instregex "DPPDrmi")>;
	-def: InstRW<[SBWriteResGroup120], (instregex "VDPPDrmi")>;
	-
	-def SBWriteResGroup121 : SchedWriteRes<[SBPort0,SBPort23]> {
	- let Latency = 17;
	- let NumMicroOps = 4;
	- let ResourceCycles = [3,1];
	-}
	-def: InstRW<[SBWriteResGroup121], (instregex "PCMPISTRIrm")>;
	-def: InstRW<[SBWriteResGroup121], (instregex "PCMPISTRM128rm")>;
	-def: InstRW<[SBWriteResGroup121], (instregex "VPCMPISTRIrm")>;
	-def: InstRW<[SBWriteResGroup121], (instregex "VPCMPISTRM128rm")>;
	-
	-def SBWriteResGroup122 : SchedWriteRes<[SBPort5,SBPort23]> {
	- let Latency = 18;
	- let NumMicroOps = 3;
	- let ResourceCycles = [2,1];
	-}
	-def: InstRW<[SBWriteResGroup122], (instregex "AESIMCrm")>;
	-def: InstRW<[SBWriteResGroup122], (instregex "VAESIMCrm")>;
	-
	-def SBWriteResGroup123 : SchedWriteRes<[SBPort0,SBPort23]> {
	- let Latency = 20;
	- let NumMicroOps = 2;
	- let ResourceCycles = [1,1];
	-}
	-def: InstRW<[SBWriteResGroup123], (instregex "DIVPSrm")>;
	-def: InstRW<[SBWriteResGroup123], (instregex "DIVSSrm")>;
	-def: InstRW<[SBWriteResGroup123], (instregex "SQRTPSm")>;
	-def: InstRW<[SBWriteResGroup123], (instregex "SQRTSSm")>;
	-def: InstRW<[SBWriteResGroup123], (instregex "VDIVPSrm")>;
	-def: InstRW<[SBWriteResGroup123], (instregex "VDIVSSrm")>;
	-def: InstRW<[SBWriteResGroup123], (instregex "VSQRTPSm")>;
	-
	-def SBWriteResGroup124 : SchedWriteRes<[SBPort0]> {
	- let Latency = 21;
	- let NumMicroOps = 1;
	- let ResourceCycles = [1];
	-}
	-def: InstRW<[SBWriteResGroup124], (instregex "VSQRTSDr")>;
	-
	-def SBWriteResGroup125 : SchedWriteRes<[SBPort0,SBPort23]> {
	- let Latency = 21;
	- let NumMicroOps = 2;
	- let ResourceCycles = [1,1];
	-}
	-def: InstRW<[SBWriteResGroup125], (instregex "VSQRTSDm")>;
	-
	-def SBWriteResGroup126 : SchedWriteRes<[SBPort0]> {
	- let Latency = 22;
	- let NumMicroOps = 1;
	- let ResourceCycles = [1];
	-}
	-def: InstRW<[SBWriteResGroup126], (instregex "DIVPDrr")>;
	-def: InstRW<[SBWriteResGroup126], (instregex "DIVSDrr")>;
	-def: InstRW<[SBWriteResGroup126], (instregex "SQRTPDr")>;
	-def: InstRW<[SBWriteResGroup126], (instregex "SQRTSDr")>;
	-def: InstRW<[SBWriteResGroup126], (instregex "VDIVPDrr")>;
	-def: InstRW<[SBWriteResGroup126], (instregex "VDIVSDrr")>;
	-def: InstRW<[SBWriteResGroup126], (instregex "VSQRTPDr")>;
	-
	-def SBWriteResGroup127 : SchedWriteRes<[SBPort0]> {
	- let Latency = 24;
	- let NumMicroOps = 1;
	- let ResourceCycles = [1];
	-}
	-def: InstRW<[SBWriteResGroup127], (instregex "DIVR_FPrST0")>;
	-def: InstRW<[SBWriteResGroup127], (instregex "DIVR_FST0r")>;
	-def: InstRW<[SBWriteResGroup127], (instregex "DIVR_FrST0")>;
	-def: InstRW<[SBWriteResGroup127], (instregex "DIV_FPrST0")>;
	-def: InstRW<[SBWriteResGroup127], (instregex "DIV_FST0r")>;
	-def: InstRW<[SBWriteResGroup127], (instregex "DIV_FrST0")>;
	-
	-def SBWriteResGroup128 : SchedWriteRes<[SBPort0,SBPort23]> {
	- let Latency = 28;
	- let NumMicroOps = 2;
	- let ResourceCycles = [1,1];
	-}
	-def: InstRW<[SBWriteResGroup128], (instregex "DIVPDrm")>;
	-def: InstRW<[SBWriteResGroup128], (instregex "DIVSDrm")>;
	-def: InstRW<[SBWriteResGroup128], (instregex "SQRTPDm")>;
	-def: InstRW<[SBWriteResGroup128], (instregex "SQRTSDm")>;
	-def: InstRW<[SBWriteResGroup128], (instregex "VDIVPDrm")>;
	-def: InstRW<[SBWriteResGroup128], (instregex "VDIVSDrm")>;
	-def: InstRW<[SBWriteResGroup128], (instregex "VSQRTPDm")>;
	-
	-def SBWriteResGroup129 : SchedWriteRes<[SBPort0,SBPort0]> {
	- let Latency = 29;
	- let NumMicroOps = 3;
	- let ResourceCycles = [2,1];
	-}
	-def: InstRW<[SBWriteResGroup129], (instregex "VDIVPSYrr")>;
	-def: InstRW<[SBWriteResGroup129], (instregex "VSQRTPSYr")>;
	-
	-def SBWriteResGroup130 : SchedWriteRes<[SBPort0,SBPort23]> {
	- let Latency = 31;
	- let NumMicroOps = 2;
	- let ResourceCycles = [1,1];
	-}
	-def: InstRW<[SBWriteResGroup130], (instregex "DIVR_F32m")>;
	-def: InstRW<[SBWriteResGroup130], (instregex "DIVR_F64m")>;
	-def: InstRW<[SBWriteResGroup130], (instregex "DIV_F32m")>;
	-def: InstRW<[SBWriteResGroup130], (instregex "DIV_F64m")>;
	-
	-def SBWriteResGroup131 : SchedWriteRes<[SBPort0,SBPort1,SBPort23]> {
	- let Latency = 34;
	- let NumMicroOps = 3;
	- let ResourceCycles = [1,1,1];
	-}
	-def: InstRW<[SBWriteResGroup131], (instregex "DIVR_FI16m")>;
	-def: InstRW<[SBWriteResGroup131], (instregex "DIVR_FI32m")>;
	-def: InstRW<[SBWriteResGroup131], (instregex "DIV_FI16m")>;
	-def: InstRW<[SBWriteResGroup131], (instregex "DIV_FI32m")>;
	-
	-def SBWriteResGroup132 : SchedWriteRes<[SBPort0,SBPort23,SBPort0]> {
	- let Latency = 36;
	- let NumMicroOps = 4;
	- let ResourceCycles = [2,1,1];
	-}
	-def: InstRW<[SBWriteResGroup132], (instregex "VDIVPSYrm")>;
	-def: InstRW<[SBWriteResGroup132], (instregex "VSQRTPSYm")>;
	-
	-def SBWriteResGroup133 : SchedWriteRes<[SBPort0,SBPort0]> {
	- let Latency = 45;
	- let NumMicroOps = 3;
	- let ResourceCycles = [2,1];
	-}
	-def: InstRW<[SBWriteResGroup133], (instregex "VDIVPDYrr")>;
	-def: InstRW<[SBWriteResGroup133], (instregex "VSQRTPDYr")>;
	-
	-def SBWriteResGroup134 : SchedWriteRes<[SBPort0,SBPort23,SBPort0]> {
	- let Latency = 52;
	- let NumMicroOps = 4;
	- let ResourceCycles = [2,1,1];
	-}
	-def: InstRW<[SBWriteResGroup134], (instregex "VDIVPDYrm")>;
	-def: InstRW<[SBWriteResGroup134], (instregex "VSQRTPDYm")>;
	-
	-def SBWriteResGroup135 : SchedWriteRes<[SBPort0]> {
	- let Latency = 114;
	- let NumMicroOps = 1;
	- let ResourceCycles = [1];
	-}
	-def: InstRW<[SBWriteResGroup135], (instregex "VSQRTSSr")>;
	-
	} // SchedModel
	diff --git a/lib/ToolDrivers/llvm-dlltool/DlltoolDriver.cpp b/lib/ToolDrivers/llvm-dlltool/DlltoolDriver.cpp
	index fc15dc1e6032..4820b9f7de58 100644
	--- a/lib/ToolDrivers/llvm-dlltool/DlltoolDriver.cpp
	+++ b/lib/ToolDrivers/llvm-dlltool/DlltoolDriver.cpp
	@@ -1,167 +1,183 @@
	//===- DlltoolDriver.cpp - dlltool.exe-compatible driver ------------------===//
	//
	// The LLVM Compiler Infrastructure
	//
	// This file is distributed under the University of Illinois Open Source
	// License. See LICENSE.TXT for details.
	//
	//===----------------------------------------------------------------------===//
	//
	// Defines an interface to a dlltool.exe-compatible driver.
	//
	//===----------------------------------------------------------------------===//

	#include "llvm/ToolDrivers/llvm-dlltool/DlltoolDriver.h"
	#include "llvm/Object/ArchiveWriter.h"
	#include "llvm/Object/COFF.h"
	#include "llvm/Object/COFFImportFile.h"
	#include "llvm/Object/COFFModuleDefinition.h"
	#include "llvm/Option/Arg.h"
	#include "llvm/Option/ArgList.h"
	#include "llvm/Option/Option.h"
	#include "llvm/Support/Path.h"

	#include <string>
	#include <vector>

	using namespace llvm;
	using namespace llvm::object;
	using namespace llvm::COFF;

	namespace {

	enum {
	OPT_INVALID = 0,
	#define OPTION(_1, _2, ID, _4, _5, _6, _7, _8, _9, _10, _11, _12) OPT_##ID,
	#include "Options.inc"
	#undef OPTION
	};

	#define PREFIX(NAME, VALUE) const char *const NAME[] = VALUE;
	#include "Options.inc"
	#undef PREFIX

	static const llvm::opt::OptTable::Info infoTable[] = {
	#define OPTION(X1, X2, ID, KIND, GROUP, ALIAS, X7, X8, X9, X10, X11, X12) \
	{X1, X2, X10, X11, OPT_##ID, llvm::opt::Option::KIND##Class, \
	X9, X8, OPT_##GROUP, OPT_##ALIAS, X7, X12},
	#include "Options.inc"
	#undef OPTION
	};

	class DllOptTable : public llvm::opt::OptTable {
	public:
	DllOptTable() : OptTable(infoTable, false) {}
	};

	} // namespace

	std::vector<std::unique_ptr<MemoryBuffer>> OwningMBs;

	// Opens a file. Path has to be resolved already.
	// Newly created memory buffers are owned by this driver.
	Optional<MemoryBufferRef> openFile(StringRef Path) {
	ErrorOr<std::unique_ptr<llvm::MemoryBuffer>> MB = MemoryBuffer::getFile(Path);

	if (std::error_code EC = MB.getError()) {
	llvm::errs() << "fail openFile: " << EC.message() << "\n";
	return None;
	}

	MemoryBufferRef MBRef = MB.get()->getMemBufferRef();
	OwningMBs.push_back(std::move(MB.get())); // take ownership
	return MBRef;
	}

	static MachineTypes getEmulation(StringRef S) {
	return StringSwitch<MachineTypes>(S)
	.Case("i386", IMAGE_FILE_MACHINE_I386)
	.Case("i386:x86-64", IMAGE_FILE_MACHINE_AMD64)
	.Case("arm", IMAGE_FILE_MACHINE_ARMNT)
	.Default(IMAGE_FILE_MACHINE_UNKNOWN);
	}

	static std::string getImplibPath(std::string Path) {
	SmallString<128> Out = StringRef("lib");
	Out.append(Path);
	sys::path::replace_extension(Out, ".a");
	return Out.str();
	}

	int llvm::dlltoolDriverMain(llvm::ArrayRef<const char *> ArgsArr) {
	DllOptTable Table;
	unsigned MissingIndex;
	unsigned MissingCount;
	llvm::opt::InputArgList Args =
	Table.ParseArgs(ArgsArr.slice(1), MissingIndex, MissingCount);
	if (MissingCount) {
	llvm::errs() << Args.getArgString(MissingIndex) << ": missing argument\n";
	return 1;
	}

	// Handle when no input or output is specified
	if (Args.hasArgNoClaim(OPT_INPUT) \|\|
	(!Args.hasArgNoClaim(OPT_d) && !Args.hasArgNoClaim(OPT_l))) {
	Table.PrintHelp(outs(), ArgsArr[0], "dlltool", false);
	llvm::outs() << "\nTARGETS: i386, i386:x86-64, arm\n";
	return 1;
	}

	if (!Args.hasArgNoClaim(OPT_m) && Args.hasArgNoClaim(OPT_d)) {
	llvm::errs() << "error: no target machine specified\n"
	<< "supported targets: i386, i386:x86-64, arm\n";
	return 1;
	}

	for (auto *Arg : Args.filtered(OPT_UNKNOWN))
	llvm::errs() << "ignoring unknown argument: " << Arg->getSpelling() << "\n";

	if (!Args.hasArg(OPT_d)) {
	llvm::errs() << "no definition file specified\n";
	return 1;
	}

	Optional<MemoryBufferRef> MB = openFile(Args.getLastArg(OPT_d)->getValue());
	if (!MB)
	return 1;

	if (!MB->getBufferSize()) {
	llvm::errs() << "definition file empty\n";
	return 1;
	}

	COFF::MachineTypes Machine = IMAGE_FILE_MACHINE_UNKNOWN;
	if (auto *Arg = Args.getLastArg(OPT_m))
	Machine = getEmulation(Arg->getValue());

	if (Machine == IMAGE_FILE_MACHINE_UNKNOWN) {
	llvm::errs() << "unknown target\n";
	return 1;
	}

	Expected<COFFModuleDefinition> Def =
	parseCOFFModuleDefinition(*MB, Machine, true);

	if (!Def) {
	llvm::errs() << "error parsing definition\n"
	<< errorToErrorCode(Def.takeError()).message();
	return 1;
	}

	// Do this after the parser because parseCOFFModuleDefinition sets OutputFile.
	if (auto *Arg = Args.getLastArg(OPT_D))
	Def->OutputFile = Arg->getValue();

	if (Def->OutputFile.empty()) {
	llvm::errs() << "no output file specified\n";
	return 1;
	}

	std::string Path = Args.getLastArgValue(OPT_l);
	if (Path.empty())
	Path = getImplibPath(Def->OutputFile);

	+ if (Machine == IMAGE_FILE_MACHINE_I386 && Args.getLastArg(OPT_k)) {
	+ for (COFFShortExport& E : Def->Exports) {
	+ if (E.isWeak() \|\| (!E.Name.empty() && E.Name[0] == '?'))
	+ continue;
	+ E.SymbolName = E.Name;
	+ // Trim off the trailing decoration. Symbols will always have a
	+ // starting prefix here (either _ for cdecl/stdcall, @ for fastcall
	+ // or ? for C++ functions). (Vectorcall functions also will end up having
	+ // a prefix here, even if they shouldn't.)
	+ E.Name = E.Name.substr(0, E.Name.find('@', 1));
	+ // By making sure E.SymbolName != E.Name for decorated symbols,
	+ // writeImportLibrary writes these symbols with the type
	+ // IMPORT_NAME_UNDECORATE.
	+ }
	+ }
	+
	if (writeImportLibrary(Def->OutputFile, Path, Def->Exports, Machine, true))
	return 1;
	return 0;
	}
	diff --git a/lib/ToolDrivers/llvm-dlltool/Options.td b/lib/ToolDrivers/llvm-dlltool/Options.td
	index 213c6a4d7674..e78182ab8130 100644
	--- a/lib/ToolDrivers/llvm-dlltool/Options.td
	+++ b/lib/ToolDrivers/llvm-dlltool/Options.td
	@@ -1,26 +1,26 @@
	include "llvm/Option/OptParser.td"

	def m: JoinedOrSeparate<["-"], "m">, HelpText<"Set target machine">;
	def m_long : JoinedOrSeparate<["--"], "machine">, Alias<m>;

	def l: JoinedOrSeparate<["-"], "l">, HelpText<"Generate an import lib">;
	def l_long : JoinedOrSeparate<["--"], "output-lib">, Alias<l>;

	def D: JoinedOrSeparate<["-"], "D">, HelpText<"Specify the input DLL Name">;
	def D_long : JoinedOrSeparate<["--"], "dllname">, Alias<D>;

	def d: JoinedOrSeparate<["-"], "d">, HelpText<"Input .def File">;
	def d_long : JoinedOrSeparate<["--"], "input-def">, Alias<d>;

	+def k: Flag<["-"], "k">, HelpText<"Kill @n Symbol from export">;
	+def k_alias: Flag<["--"], "kill-at">, Alias<k>;
	+
	//==============================================================================
	// The flags below do nothing. They are defined only for dlltool compatibility.
	//==============================================================================

	-def k: Flag<["-"], "k">, HelpText<"Kill @n Symbol from export">;
	-def k_alias: Flag<["--"], "kill-at">, Alias<k>;
	-
	def S: JoinedOrSeparate<["-"], "S">, HelpText<"Assembler">;
	def S_alias: JoinedOrSeparate<["--"], "as">, Alias<S>;

	def f: JoinedOrSeparate<["-"], "f">, HelpText<"Assembler Flags">;
	def f_alias: JoinedOrSeparate<["--"], "as-flags">, Alias<f>;
	diff --git a/lib/Transforms/Scalar/LowerAtomic.cpp b/lib/Transforms/Scalar/LowerAtomic.cpp
	index 08e60b16bedf..6f77c5bd0d07 100644
	--- a/lib/Transforms/Scalar/LowerAtomic.cpp
	+++ b/lib/Transforms/Scalar/LowerAtomic.cpp
	@@ -1,174 +1,173 @@
	//===- LowerAtomic.cpp - Lower atomic intrinsics --------------------------===//
	//
	// The LLVM Compiler Infrastructure
	//
	// This file is distributed under the University of Illinois Open Source
	// License. See LICENSE.TXT for details.
	//
	//===----------------------------------------------------------------------===//
	//
	// This pass lowers atomic intrinsics to non-atomic form for use in a known
	// non-preemptible environment.
	//
	//===----------------------------------------------------------------------===//

	#include "llvm/Transforms/Scalar/LowerAtomic.h"
	#include "llvm/IR/Function.h"
	#include "llvm/IR/IRBuilder.h"
	#include "llvm/IR/IntrinsicInst.h"
	#include "llvm/Pass.h"
	#include "llvm/Transforms/Scalar.h"
	using namespace llvm;

	#define DEBUG_TYPE "loweratomic"

	static bool LowerAtomicCmpXchgInst(AtomicCmpXchgInst *CXI) {
	IRBuilder<> Builder(CXI);
	Value *Ptr = CXI->getPointerOperand();
	Value *Cmp = CXI->getCompareOperand();
	Value *Val = CXI->getNewValOperand();

	LoadInst *Orig = Builder.CreateLoad(Ptr);
	Value *Equal = Builder.CreateICmpEQ(Orig, Cmp);
	Value *Res = Builder.CreateSelect(Equal, Val, Orig);
	Builder.CreateStore(Res, Ptr);

	Res = Builder.CreateInsertValue(UndefValue::get(CXI->getType()), Orig, 0);
	Res = Builder.CreateInsertValue(Res, Equal, 1);

	CXI->replaceAllUsesWith(Res);
	CXI->eraseFromParent();
	return true;
	}

	static bool LowerAtomicRMWInst(AtomicRMWInst *RMWI) {
	IRBuilder<> Builder(RMWI);
	Value *Ptr = RMWI->getPointerOperand();
	Value *Val = RMWI->getValOperand();

	LoadInst *Orig = Builder.CreateLoad(Ptr);
	Value *Res = nullptr;

	switch (RMWI->getOperation()) {
	default: llvm_unreachable("Unexpected RMW operation");
	case AtomicRMWInst::Xchg:
	Res = Val;
	break;
	case AtomicRMWInst::Add:
	Res = Builder.CreateAdd(Orig, Val);
	break;
	case AtomicRMWInst::Sub:
	Res = Builder.CreateSub(Orig, Val);
	break;
	case AtomicRMWInst::And:
	Res = Builder.CreateAnd(Orig, Val);
	break;
	case AtomicRMWInst::Nand:
	Res = Builder.CreateNot(Builder.CreateAnd(Orig, Val));
	break;
	case AtomicRMWInst::Or:
	Res = Builder.CreateOr(Orig, Val);
	break;
	case AtomicRMWInst::Xor:
	Res = Builder.CreateXor(Orig, Val);
	break;
	case AtomicRMWInst::Max:
	Res = Builder.CreateSelect(Builder.CreateICmpSLT(Orig, Val),
	Val, Orig);
	break;
	case AtomicRMWInst::Min:
	Res = Builder.CreateSelect(Builder.CreateICmpSLT(Orig, Val),
	Orig, Val);
	break;
	case AtomicRMWInst::UMax:
	Res = Builder.CreateSelect(Builder.CreateICmpULT(Orig, Val),
	Val, Orig);
	break;
	case AtomicRMWInst::UMin:
	Res = Builder.CreateSelect(Builder.CreateICmpULT(Orig, Val),
	Orig, Val);
	break;
	}
	Builder.CreateStore(Res, Ptr);
	RMWI->replaceAllUsesWith(Orig);
	RMWI->eraseFromParent();
	return true;
	}

	static bool LowerFenceInst(FenceInst *FI) {
	FI->eraseFromParent();
	return true;
	}

	static bool LowerLoadInst(LoadInst *LI) {
	LI->setAtomic(AtomicOrdering::NotAtomic);
	return true;
	}

	static bool LowerStoreInst(StoreInst *SI) {
	SI->setAtomic(AtomicOrdering::NotAtomic);
	return true;
	}

	static bool runOnBasicBlock(BasicBlock &BB) {
	bool Changed = false;
	for (BasicBlock::iterator DI = BB.begin(), DE = BB.end(); DI != DE;) {
	Instruction Inst = &DI++;
	if (FenceInst *FI = dyn_cast<FenceInst>(Inst))
	Changed \|= LowerFenceInst(FI);
	else if (AtomicCmpXchgInst *CXI = dyn_cast<AtomicCmpXchgInst>(Inst))
	Changed \|= LowerAtomicCmpXchgInst(CXI);
	else if (AtomicRMWInst *RMWI = dyn_cast<AtomicRMWInst>(Inst))
	Changed \|= LowerAtomicRMWInst(RMWI);
	else if (LoadInst *LI = dyn_cast<LoadInst>(Inst)) {
	if (LI->isAtomic())
	LowerLoadInst(LI);
	} else if (StoreInst *SI = dyn_cast<StoreInst>(Inst)) {
	if (SI->isAtomic())
	LowerStoreInst(SI);
	}
	}
	return Changed;
	}

	static bool lowerAtomics(Function &F) {
	bool Changed = false;
	for (BasicBlock &BB : F) {
	Changed \|= runOnBasicBlock(BB);
	}
	return Changed;
	}

	PreservedAnalyses LowerAtomicPass::run(Function &F, FunctionAnalysisManager &) {
	if (lowerAtomics(F))
	return PreservedAnalyses::none();
	return PreservedAnalyses::all();
	}

	namespace {
	class LowerAtomicLegacyPass : public FunctionPass {
	public:
	static char ID;

	LowerAtomicLegacyPass() : FunctionPass(ID) {
	initializeLowerAtomicLegacyPassPass(*PassRegistry::getPassRegistry());
	}

	bool runOnFunction(Function &F) override {
	- if (skipFunction(F))
	- return false;
	+ // Don't skip optnone functions; atomics still need to be lowered.
	FunctionAnalysisManager DummyFAM;
	auto PA = Impl.run(F, DummyFAM);
	return !PA.areAllPreserved();
	}

	private:
	LowerAtomicPass Impl;
	};
	}

	char LowerAtomicLegacyPass::ID = 0;
	INITIALIZE_PASS(LowerAtomicLegacyPass, "loweratomic",
	"Lower atomic intrinsics to non-atomic form", false, false)

	Pass *llvm::createLowerAtomicPass() { return new LowerAtomicLegacyPass(); }
	diff --git a/lib/Transforms/Scalar/Reassociate.cpp b/lib/Transforms/Scalar/Reassociate.cpp
	index 29d1ba406ae4..e235e5eb1a06 100644
	--- a/lib/Transforms/Scalar/Reassociate.cpp
	+++ b/lib/Transforms/Scalar/Reassociate.cpp
	@@ -1,2281 +1,2287 @@
	//===- Reassociate.cpp - Reassociate binary expressions -------------------===//
	//
	// The LLVM Compiler Infrastructure
	//
	// This file is distributed under the University of Illinois Open Source
	// License. See LICENSE.TXT for details.
	//
	//===----------------------------------------------------------------------===//
	//
	// This pass reassociates commutative expressions in an order that is designed
	// to promote better constant propagation, GCSE, LICM, PRE, etc.
	//
	// For example: 4 + (x + 5) -> x + (4 + 5)
	//
	// In the implementation of this algorithm, constants are assigned rank = 0,
	// function arguments are rank = 1, and other values are assigned ranks
	// corresponding to the reverse post order traversal of current function
	// (starting at 2), which effectively gives values in deep loops higher rank
	// than values not in loops.
	//
	//===----------------------------------------------------------------------===//

	#include "llvm/Transforms/Scalar/Reassociate.h"
	#include "llvm/ADT/DenseMap.h"
	#include "llvm/ADT/PostOrderIterator.h"
	#include "llvm/ADT/STLExtras.h"
	#include "llvm/ADT/SetVector.h"
	#include "llvm/ADT/Statistic.h"
	#include "llvm/Analysis/GlobalsModRef.h"
	#include "llvm/Analysis/ValueTracking.h"
	#include "llvm/IR/CFG.h"
	#include "llvm/IR/Constants.h"
	#include "llvm/IR/DerivedTypes.h"
	#include "llvm/IR/Function.h"
	#include "llvm/IR/IRBuilder.h"
	#include "llvm/IR/Instructions.h"
	#include "llvm/IR/IntrinsicInst.h"
	#include "llvm/IR/PatternMatch.h"
	#include "llvm/IR/ValueHandle.h"
	#include "llvm/Pass.h"
	#include "llvm/Support/Debug.h"
	#include "llvm/Support/raw_ostream.h"
	#include "llvm/Transforms/Scalar.h"
	#include "llvm/Transforms/Utils/Local.h"
	#include <algorithm>
	using namespace llvm;
	using namespace reassociate;

	#define DEBUG_TYPE "reassociate"

	STATISTIC(NumChanged, "Number of insts reassociated");
	STATISTIC(NumAnnihil, "Number of expr tree annihilated");
	STATISTIC(NumFactor , "Number of multiplies factored");

	#ifndef NDEBUG
	/// Print out the expression identified in the Ops list.
	///
	static void PrintOps(Instruction *I, const SmallVectorImpl<ValueEntry> &Ops) {
	Module *M = I->getModule();
	dbgs() << Instruction::getOpcodeName(I->getOpcode()) << " "
	<< *Ops[0].Op->getType() << '\t';
	for (unsigned i = 0, e = Ops.size(); i != e; ++i) {
	dbgs() << "[ ";
	Ops[i].Op->printAsOperand(dbgs(), false, M);
	dbgs() << ", #" << Ops[i].Rank << "] ";
	}
	}
	#endif

	/// Utility class representing a non-constant Xor-operand. We classify
	/// non-constant Xor-Operands into two categories:
	/// C1) The operand is in the form "X & C", where C is a constant and C != ~0
	/// C2)
	/// C2.1) The operand is in the form of "X \| C", where C is a non-zero
	/// constant.
	/// C2.2) Any operand E which doesn't fall into C1 and C2.1, we view this
	/// operand as "E \| 0"
	class llvm::reassociate::XorOpnd {
	public:
	XorOpnd(Value *V);

	bool isInvalid() const { return SymbolicPart == nullptr; }
	bool isOrExpr() const { return isOr; }
	Value *getValue() const { return OrigVal; }
	Value *getSymbolicPart() const { return SymbolicPart; }
	unsigned getSymbolicRank() const { return SymbolicRank; }
	const APInt &getConstPart() const { return ConstPart; }

	void Invalidate() { SymbolicPart = OrigVal = nullptr; }
	void setSymbolicRank(unsigned R) { SymbolicRank = R; }

	private:
	Value *OrigVal;
	Value *SymbolicPart;
	APInt ConstPart;
	unsigned SymbolicRank;
	bool isOr;
	};

	XorOpnd::XorOpnd(Value *V) {
	assert(!isa<ConstantInt>(V) && "No ConstantInt");
	OrigVal = V;
	Instruction *I = dyn_cast<Instruction>(V);
	SymbolicRank = 0;

	if (I && (I->getOpcode() == Instruction::Or \|\|
	I->getOpcode() == Instruction::And)) {
	Value *V0 = I->getOperand(0);
	Value *V1 = I->getOperand(1);
	const APInt *C;
	if (match(V0, PatternMatch::m_APInt(C)))
	std::swap(V0, V1);

	if (match(V1, PatternMatch::m_APInt(C))) {
	ConstPart = *C;
	SymbolicPart = V0;
	isOr = (I->getOpcode() == Instruction::Or);
	return;
	}
	}

	// view the operand as "V \| 0"
	SymbolicPart = V;
	ConstPart = APInt::getNullValue(V->getType()->getScalarSizeInBits());
	isOr = true;
	}

	/// Return true if V is an instruction of the specified opcode and if it
	/// only has one use.
	static BinaryOperator isReassociableOp(Value V, unsigned Opcode) {
	if (V->hasOneUse() && isa<Instruction>(V) &&
	cast<Instruction>(V)->getOpcode() == Opcode &&
	(!isa<FPMathOperator>(V) \|\|
	cast<Instruction>(V)->hasUnsafeAlgebra()))
	return cast<BinaryOperator>(V);
	return nullptr;
	}

	static BinaryOperator isReassociableOp(Value V, unsigned Opcode1,
	unsigned Opcode2) {
	if (V->hasOneUse() && isa<Instruction>(V) &&
	(cast<Instruction>(V)->getOpcode() == Opcode1 \|\|
	cast<Instruction>(V)->getOpcode() == Opcode2) &&
	(!isa<FPMathOperator>(V) \|\|
	cast<Instruction>(V)->hasUnsafeAlgebra()))
	return cast<BinaryOperator>(V);
	return nullptr;
	}

	void ReassociatePass::BuildRankMap(Function &F,
	ReversePostOrderTraversal<Function*> &RPOT) {
	unsigned i = 2;

	// Assign distinct ranks to function arguments.
	for (Function::arg_iterator I = F.arg_begin(), E = F.arg_end(); I != E; ++I) {
	ValueRankMap[&*I] = ++i;
	DEBUG(dbgs() << "Calculated Rank[" << I->getName() << "] = " << i << "\n");
	}

	// Traverse basic blocks in ReversePostOrder
	for (BasicBlock *BB : RPOT) {
	unsigned BBRank = RankMap[BB] = ++i << 16;

	// Walk the basic block, adding precomputed ranks for any instructions that
	// we cannot move. This ensures that the ranks for these instructions are
	// all different in the block.
	for (Instruction &I : *BB)
	if (mayBeMemoryDependent(I))
	ValueRankMap[&I] = ++BBRank;
	}
	}

	unsigned ReassociatePass::getRank(Value *V) {
	Instruction *I = dyn_cast<Instruction>(V);
	if (!I) {
	if (isa<Argument>(V)) return ValueRankMap[V]; // Function argument.
	return 0; // Otherwise it's a global or constant, rank 0.
	}

	if (unsigned Rank = ValueRankMap[I])
	return Rank; // Rank already known?

	// If this is an expression, return the 1+MAX(rank(LHS), rank(RHS)) so that
	// we can reassociate expressions for code motion! Since we do not recurse
	// for PHI nodes, we cannot have infinite recursion here, because there
	// cannot be loops in the value graph that do not go through PHI nodes.
	unsigned Rank = 0, MaxRank = RankMap[I->getParent()];
	for (unsigned i = 0, e = I->getNumOperands();
	i != e && Rank != MaxRank; ++i)
	Rank = std::max(Rank, getRank(I->getOperand(i)));

	// If this is a not or neg instruction, do not count it for rank. This
	// assures us that X and ~X will have the same rank.
	if (!BinaryOperator::isNot(I) && !BinaryOperator::isNeg(I) &&
	!BinaryOperator::isFNeg(I))
	++Rank;

	DEBUG(dbgs() << "Calculated Rank[" << V->getName() << "] = " << Rank << "\n");

	return ValueRankMap[I] = Rank;
	}

	// Canonicalize constants to RHS. Otherwise, sort the operands by rank.
	void ReassociatePass::canonicalizeOperands(Instruction *I) {
	assert(isa<BinaryOperator>(I) && "Expected binary operator.");
	assert(I->isCommutative() && "Expected commutative operator.");

	Value *LHS = I->getOperand(0);
	Value *RHS = I->getOperand(1);
	unsigned LHSRank = getRank(LHS);
	unsigned RHSRank = getRank(RHS);

	if (isa<Constant>(RHS))
	return;

	if (isa<Constant>(LHS) \|\| RHSRank < LHSRank)
	cast<BinaryOperator>(I)->swapOperands();
	}

	static BinaryOperator CreateAdd(Value S1, Value *S2, const Twine &Name,
	Instruction InsertBefore, Value FlagsOp) {
	if (S1->getType()->isIntOrIntVectorTy())
	return BinaryOperator::CreateAdd(S1, S2, Name, InsertBefore);
	else {
	BinaryOperator *Res =
	BinaryOperator::CreateFAdd(S1, S2, Name, InsertBefore);
	Res->setFastMathFlags(cast<FPMathOperator>(FlagsOp)->getFastMathFlags());
	return Res;
	}
	}

	static BinaryOperator CreateMul(Value S1, Value *S2, const Twine &Name,
	Instruction InsertBefore, Value FlagsOp) {
	if (S1->getType()->isIntOrIntVectorTy())
	return BinaryOperator::CreateMul(S1, S2, Name, InsertBefore);
	else {
	BinaryOperator *Res =
	BinaryOperator::CreateFMul(S1, S2, Name, InsertBefore);
	Res->setFastMathFlags(cast<FPMathOperator>(FlagsOp)->getFastMathFlags());
	return Res;
	}
	}

	static BinaryOperator CreateNeg(Value S1, const Twine &Name,
	Instruction InsertBefore, Value FlagsOp) {
	if (S1->getType()->isIntOrIntVectorTy())
	return BinaryOperator::CreateNeg(S1, Name, InsertBefore);
	else {
	BinaryOperator *Res = BinaryOperator::CreateFNeg(S1, Name, InsertBefore);
	Res->setFastMathFlags(cast<FPMathOperator>(FlagsOp)->getFastMathFlags());
	return Res;
	}
	}

	/// Replace 0-X with X*-1.
	static BinaryOperator LowerNegateToMultiply(Instruction Neg) {
	Type *Ty = Neg->getType();
	Constant *NegOne = Ty->isIntOrIntVectorTy() ?
	ConstantInt::getAllOnesValue(Ty) : ConstantFP::get(Ty, -1.0);

	BinaryOperator *Res = CreateMul(Neg->getOperand(1), NegOne, "", Neg, Neg);
	Neg->setOperand(1, Constant::getNullValue(Ty)); // Drop use of op.
	Res->takeName(Neg);
	Neg->replaceAllUsesWith(Res);
	Res->setDebugLoc(Neg->getDebugLoc());
	return Res;
	}

	/// Returns k such that lambda(2^Bitwidth) = 2^k, where lambda is the Carmichael
	/// function. This means that x^(2^k) === 1 mod 2^Bitwidth for
	/// every odd x, i.e. x^(2^k) = 1 for every odd x in Bitwidth-bit arithmetic.
	/// Note that 0 <= k < Bitwidth, and if Bitwidth > 3 then x^(2^k) = 0 for every
	/// even x in Bitwidth-bit arithmetic.
	static unsigned CarmichaelShift(unsigned Bitwidth) {
	if (Bitwidth < 3)
	return Bitwidth - 1;
	return Bitwidth - 2;
	}

	/// Add the extra weight 'RHS' to the existing weight 'LHS',
	/// reducing the combined weight using any special properties of the operation.
	/// The existing weight LHS represents the computation X op X op ... op X where
	/// X occurs LHS times. The combined weight represents X op X op ... op X with
	/// X occurring LHS + RHS times. If op is "Xor" for example then the combined
	/// operation is equivalent to X if LHS + RHS is odd, or 0 if LHS + RHS is even;
	/// the routine returns 1 in LHS in the first case, and 0 in LHS in the second.
	static void IncorporateWeight(APInt &LHS, const APInt &RHS, unsigned Opcode) {
	// If we were working with infinite precision arithmetic then the combined
	// weight would be LHS + RHS. But we are using finite precision arithmetic,
	// and the APInt sum LHS + RHS may not be correct if it wraps (it is correct
	// for nilpotent operations and addition, but not for idempotent operations
	// and multiplication), so it is important to correctly reduce the combined
	// weight back into range if wrapping would be wrong.

	// If RHS is zero then the weight didn't change.
	if (RHS.isMinValue())
	return;
	// If LHS is zero then the combined weight is RHS.
	if (LHS.isMinValue()) {
	LHS = RHS;
	return;
	}
	// From this point on we know that neither LHS nor RHS is zero.

	if (Instruction::isIdempotent(Opcode)) {
	// Idempotent means X op X === X, so any non-zero weight is equivalent to a
	// weight of 1. Keeping weights at zero or one also means that wrapping is
	// not a problem.
	assert(LHS == 1 && RHS == 1 && "Weights not reduced!");
	return; // Return a weight of 1.
	}
	if (Instruction::isNilpotent(Opcode)) {
	// Nilpotent means X op X === 0, so reduce weights modulo 2.
	assert(LHS == 1 && RHS == 1 && "Weights not reduced!");
	LHS = 0; // 1 + 1 === 0 modulo 2.
	return;
	}
	if (Opcode == Instruction::Add \|\| Opcode == Instruction::FAdd) {
	// TODO: Reduce the weight by exploiting nsw/nuw?
	LHS += RHS;
	return;
	}

	assert((Opcode == Instruction::Mul \|\| Opcode == Instruction::FMul) &&
	"Unknown associative operation!");
	unsigned Bitwidth = LHS.getBitWidth();
	// If CM is the Carmichael number then a weight W satisfying W >= CM+Bitwidth
	// can be replaced with W-CM. That's because x^W=x^(W-CM) for every Bitwidth
	// bit number x, since either x is odd in which case x^CM = 1, or x is even in
	// which case both x^W and x^(W - CM) are zero. By subtracting off multiples
	// of CM like this weights can always be reduced to the range [0, CM+Bitwidth)
	// which by a happy accident means that they can always be represented using
	// Bitwidth bits.
	// TODO: Reduce the weight by exploiting nsw/nuw? (Could do much better than
	// the Carmichael number).
	if (Bitwidth > 3) {
	/// CM - The value of Carmichael's lambda function.
	APInt CM = APInt::getOneBitSet(Bitwidth, CarmichaelShift(Bitwidth));
	// Any weight W >= Threshold can be replaced with W - CM.
	APInt Threshold = CM + Bitwidth;
	assert(LHS.ult(Threshold) && RHS.ult(Threshold) && "Weights not reduced!");
	// For Bitwidth 4 or more the following sum does not overflow.
	LHS += RHS;
	while (LHS.uge(Threshold))
	LHS -= CM;
	} else {
	// To avoid problems with overflow do everything the same as above but using
	// a larger type.
	unsigned CM = 1U << CarmichaelShift(Bitwidth);
	unsigned Threshold = CM + Bitwidth;
	assert(LHS.getZExtValue() < Threshold && RHS.getZExtValue() < Threshold &&
	"Weights not reduced!");
	unsigned Total = LHS.getZExtValue() + RHS.getZExtValue();
	while (Total >= Threshold)
	Total -= CM;
	LHS = Total;
	}
	}

	typedef std::pair<Value*, APInt> RepeatedValue;

	/// Given an associative binary expression, return the leaf
	/// nodes in Ops along with their weights (how many times the leaf occurs). The
	/// original expression is the same as
	/// (Ops[0].first op Ops[0].first op ... Ops[0].first) <- Ops[0].second times
	/// op
	/// (Ops[1].first op Ops[1].first op ... Ops[1].first) <- Ops[1].second times
	/// op
	/// ...
	/// op
	/// (Ops[N].first op Ops[N].first op ... Ops[N].first) <- Ops[N].second times
	///
	/// Note that the values Ops[0].first, ..., Ops[N].first are all distinct.
	///
	/// This routine may modify the function, in which case it returns 'true'. The
	/// changes it makes may well be destructive, changing the value computed by 'I'
	/// to something completely different. Thus if the routine returns 'true' then
	/// you MUST either replace I with a new expression computed from the Ops array,
	/// or use RewriteExprTree to put the values back in.
	///
	/// A leaf node is either not a binary operation of the same kind as the root
	/// node 'I' (i.e. is not a binary operator at all, or is, but with a different
	/// opcode), or is the same kind of binary operator but has a use which either
	/// does not belong to the expression, or does belong to the expression but is
	/// a leaf node. Every leaf node has at least one use that is a non-leaf node
	/// of the expression, while for non-leaf nodes (except for the root 'I') every
	/// use is a non-leaf node of the expression.
	///
	/// For example:
	/// expression graph node names
	///
	/// + \| I
	/// / \ \|
	/// + + \| A, B
	/// / \ / \ \|
	/// * + * \| C, D, E
	/// / \ / \ / \ \|
	/// + * \| F, G
	///
	/// The leaf nodes are C, E, F and G. The Ops array will contain (maybe not in
	/// that order) (C, 1), (E, 1), (F, 2), (G, 2).
	///
	/// The expression is maximal: if some instruction is a binary operator of the
	/// same kind as 'I', and all of its uses are non-leaf nodes of the expression,
	/// then the instruction also belongs to the expression, is not a leaf node of
	/// it, and its operands also belong to the expression (but may be leaf nodes).
	///
	/// NOTE: This routine will set operands of non-leaf non-root nodes to undef in
	/// order to ensure that every non-root node in the expression has exactly one
	/// use by a non-leaf node of the expression. This destruction means that the
	/// caller MUST either replace 'I' with a new expression or use something like
	/// RewriteExprTree to put the values back in if the routine indicates that it
	/// made a change by returning 'true'.
	///
	/// In the above example either the right operand of A or the left operand of B
	/// will be replaced by undef. If it is B's operand then this gives:
	///
	/// + \| I
	/// / \ \|
	/// + + \| A, B - operand of B replaced with undef
	/// / \ \ \|
	/// * + * \| C, D, E
	/// / \ / \ / \ \|
	/// + * \| F, G
	///
	/// Note that such undef operands can only be reached by passing through 'I'.
	/// For example, if you visit operands recursively starting from a leaf node
	/// then you will never see such an undef operand unless you get back to 'I',
	/// which requires passing through a phi node.
	///
	/// Note that this routine may also mutate binary operators of the wrong type
	/// that have all uses inside the expression (i.e. only used by non-leaf nodes
	/// of the expression) if it can turn them into binary operators of the right
	/// type and thus make the expression bigger.

	static bool LinearizeExprTree(BinaryOperator *I,
	SmallVectorImpl<RepeatedValue> &Ops) {
	DEBUG(dbgs() << "LINEARIZE: " << *I << '\n');
	unsigned Bitwidth = I->getType()->getScalarType()->getPrimitiveSizeInBits();
	unsigned Opcode = I->getOpcode();
	assert(I->isAssociative() && I->isCommutative() &&
	"Expected an associative and commutative operation!");

	// Visit all operands of the expression, keeping track of their weight (the
	// number of paths from the expression root to the operand, or if you like
	// the number of times that operand occurs in the linearized expression).
	// For example, if I = X + A, where X = A + B, then I, X and B have weight 1
	// while A has weight two.

	// Worklist of non-leaf nodes (their operands are in the expression too) along
	// with their weights, representing a certain number of paths to the operator.
	// If an operator occurs in the worklist multiple times then we found multiple
	// ways to get to it.
	SmallVector<std::pair<BinaryOperator*, APInt>, 8> Worklist; // (Op, Weight)
	Worklist.push_back(std::make_pair(I, APInt(Bitwidth, 1)));
	bool Changed = false;

	// Leaves of the expression are values that either aren't the right kind of
	// operation (eg: a constant, or a multiply in an add tree), or are, but have
	// some uses that are not inside the expression. For example, in I = X + X,
	// X = A + B, the value X has two uses (by I) that are in the expression. If
	// X has any other uses, for example in a return instruction, then we consider
	// X to be a leaf, and won't analyze it further. When we first visit a value,
	// if it has more than one use then at first we conservatively consider it to
	// be a leaf. Later, as the expression is explored, we may discover some more
	// uses of the value from inside the expression. If all uses turn out to be
	// from within the expression (and the value is a binary operator of the right
	// kind) then the value is no longer considered to be a leaf, and its operands
	// are explored.

	// Leaves - Keeps track of the set of putative leaves as well as the number of
	// paths to each leaf seen so far.
	typedef DenseMap<Value*, APInt> LeafMap;
	LeafMap Leaves; // Leaf -> Total weight so far.
	SmallVector<Value*, 8> LeafOrder; // Ensure deterministic leaf output order.

	#ifndef NDEBUG
	SmallPtrSet<Value*, 8> Visited; // For sanity checking the iteration scheme.
	#endif
	while (!Worklist.empty()) {
	std::pair<BinaryOperator*, APInt> P = Worklist.pop_back_val();
	I = P.first; // We examine the operands of this binary operator.

	for (unsigned OpIdx = 0; OpIdx < 2; ++OpIdx) { // Visit operands.
	Value *Op = I->getOperand(OpIdx);
	APInt Weight = P.second; // Number of paths to this operand.
	DEBUG(dbgs() << "OPERAND: " << *Op << " (" << Weight << ")\n");
	assert(!Op->use_empty() && "No uses, so how did we get to it?!");

	// If this is a binary operation of the right kind with only one use then
	// add its operands to the expression.
	if (BinaryOperator *BO = isReassociableOp(Op, Opcode)) {
	assert(Visited.insert(Op).second && "Not first visit!");
	DEBUG(dbgs() << "DIRECT ADD: " << *Op << " (" << Weight << ")\n");
	Worklist.push_back(std::make_pair(BO, Weight));
	continue;
	}

	// Appears to be a leaf. Is the operand already in the set of leaves?
	LeafMap::iterator It = Leaves.find(Op);
	if (It == Leaves.end()) {
	// Not in the leaf map. Must be the first time we saw this operand.
	assert(Visited.insert(Op).second && "Not first visit!");
	if (!Op->hasOneUse()) {
	// This value has uses not accounted for by the expression, so it is
	// not safe to modify. Mark it as being a leaf.
	DEBUG(dbgs() << "ADD USES LEAF: " << *Op << " (" << Weight << ")\n");
	LeafOrder.push_back(Op);
	Leaves[Op] = Weight;
	continue;
	}
	// No uses outside the expression, try morphing it.
	} else {
	// Already in the leaf map.
	assert(It != Leaves.end() && Visited.count(Op) &&
	"In leaf map but not visited!");

	// Update the number of paths to the leaf.
	IncorporateWeight(It->second, Weight, Opcode);

	#if 0 // TODO: Re-enable once PR13021 is fixed.
	// The leaf already has one use from inside the expression. As we want
	// exactly one such use, drop this new use of the leaf.
	assert(!Op->hasOneUse() && "Only one use, but we got here twice!");
	I->setOperand(OpIdx, UndefValue::get(I->getType()));
	Changed = true;

	// If the leaf is a binary operation of the right kind and we now see
	// that its multiple original uses were in fact all by nodes belonging
	// to the expression, then no longer consider it to be a leaf and add
	// its operands to the expression.
	if (BinaryOperator *BO = isReassociableOp(Op, Opcode)) {
	DEBUG(dbgs() << "UNLEAF: " << *Op << " (" << It->second << ")\n");
	Worklist.push_back(std::make_pair(BO, It->second));
	Leaves.erase(It);
	continue;
	}
	#endif

	// If we still have uses that are not accounted for by the expression
	// then it is not safe to modify the value.
	if (!Op->hasOneUse())
	continue;

	// No uses outside the expression, try morphing it.
	Weight = It->second;
	Leaves.erase(It); // Since the value may be morphed below.
	}

	// At this point we have a value which, first of all, is not a binary
	// expression of the right kind, and secondly, is only used inside the
	// expression. This means that it can safely be modified. See if we
	// can usefully morph it into an expression of the right kind.
	assert((!isa<Instruction>(Op) \|\|
	cast<Instruction>(Op)->getOpcode() != Opcode
	\|\| (isa<FPMathOperator>(Op) &&
	!cast<Instruction>(Op)->hasUnsafeAlgebra())) &&
	"Should have been handled above!");
	assert(Op->hasOneUse() && "Has uses outside the expression tree!");

	// If this is a multiply expression, turn any internal negations into
	// multiplies by -1 so they can be reassociated.
	if (BinaryOperator *BO = dyn_cast<BinaryOperator>(Op))
	if ((Opcode == Instruction::Mul && BinaryOperator::isNeg(BO)) \|\|
	(Opcode == Instruction::FMul && BinaryOperator::isFNeg(BO))) {
	DEBUG(dbgs() << "MORPH LEAF: " << *Op << " (" << Weight << ") TO ");
	BO = LowerNegateToMultiply(BO);
	DEBUG(dbgs() << *BO << '\n');
	Worklist.push_back(std::make_pair(BO, Weight));
	Changed = true;
	continue;
	}

	// Failed to morph into an expression of the right type. This really is
	// a leaf.
	DEBUG(dbgs() << "ADD LEAF: " << *Op << " (" << Weight << ")\n");
	assert(!isReassociableOp(Op, Opcode) && "Value was morphed?");
	LeafOrder.push_back(Op);
	Leaves[Op] = Weight;
	}
	}

	// The leaves, repeated according to their weights, represent the linearized
	// form of the expression.
	for (unsigned i = 0, e = LeafOrder.size(); i != e; ++i) {
	Value *V = LeafOrder[i];
	LeafMap::iterator It = Leaves.find(V);
	if (It == Leaves.end())
	// Node initially thought to be a leaf wasn't.
	continue;
	assert(!isReassociableOp(V, Opcode) && "Shouldn't be a leaf!");
	APInt Weight = It->second;
	if (Weight.isMinValue())
	// Leaf already output or weight reduction eliminated it.
	continue;
	// Ensure the leaf is only output once.
	It->second = 0;
	Ops.push_back(std::make_pair(V, Weight));
	}

	// For nilpotent operations or addition there may be no operands, for example
	// because the expression was "X xor X" or consisted of 2^Bitwidth additions:
	// in both cases the weight reduces to 0 causing the value to be skipped.
	if (Ops.empty()) {
	Constant *Identity = ConstantExpr::getBinOpIdentity(Opcode, I->getType());
	assert(Identity && "Associative operation without identity!");
	Ops.emplace_back(Identity, APInt(Bitwidth, 1));
	}

	return Changed;
	}

	/// Now that the operands for this expression tree are
	/// linearized and optimized, emit them in-order.
	void ReassociatePass::RewriteExprTree(BinaryOperator *I,
	SmallVectorImpl<ValueEntry> &Ops) {
	assert(Ops.size() > 1 && "Single values should be used directly!");

	// Since our optimizations should never increase the number of operations, the
	// new expression can usually be written reusing the existing binary operators
	// from the original expression tree, without creating any new instructions,
	// though the rewritten expression may have a completely different topology.
	// We take care to not change anything if the new expression will be the same
	// as the original. If more than trivial changes (like commuting operands)
	// were made then we are obliged to clear out any optional subclass data like
	// nsw flags.

	/// NodesToRewrite - Nodes from the original expression available for writing
	/// the new expression into.
	SmallVector<BinaryOperator*, 8> NodesToRewrite;
	unsigned Opcode = I->getOpcode();
	BinaryOperator *Op = I;

	/// NotRewritable - The operands being written will be the leaves of the new
	/// expression and must not be used as inner nodes (via NodesToRewrite) by
	/// mistake. Inner nodes are always reassociable, and usually leaves are not
	/// (if they were they would have been incorporated into the expression and so
	/// would not be leaves), so most of the time there is no danger of this. But
	/// in rare cases a leaf may become reassociable if an optimization kills uses
	/// of it, or it may momentarily become reassociable during rewriting (below)
	/// due it being removed as an operand of one of its uses. Ensure that misuse
	/// of leaf nodes as inner nodes cannot occur by remembering all of the future
	/// leaves and refusing to reuse any of them as inner nodes.
	SmallPtrSet<Value*, 8> NotRewritable;
	for (unsigned i = 0, e = Ops.size(); i != e; ++i)
	NotRewritable.insert(Ops[i].Op);

	// ExpressionChanged - Non-null if the rewritten expression differs from the
	// original in some non-trivial way, requiring the clearing of optional flags.
	// Flags are cleared from the operator in ExpressionChanged up to I inclusive.
	BinaryOperator *ExpressionChanged = nullptr;
	for (unsigned i = 0; ; ++i) {
	// The last operation (which comes earliest in the IR) is special as both
	// operands will come from Ops, rather than just one with the other being
	// a subexpression.
	if (i+2 == Ops.size()) {
	Value *NewLHS = Ops[i].Op;
	Value *NewRHS = Ops[i+1].Op;
	Value *OldLHS = Op->getOperand(0);
	Value *OldRHS = Op->getOperand(1);

	if (NewLHS == OldLHS && NewRHS == OldRHS)
	// Nothing changed, leave it alone.
	break;

	if (NewLHS == OldRHS && NewRHS == OldLHS) {
	// The order of the operands was reversed. Swap them.
	DEBUG(dbgs() << "RA: " << *Op << '\n');
	Op->swapOperands();
	DEBUG(dbgs() << "TO: " << *Op << '\n');
	MadeChange = true;
	++NumChanged;
	break;
	}

	// The new operation differs non-trivially from the original. Overwrite
	// the old operands with the new ones.
	DEBUG(dbgs() << "RA: " << *Op << '\n');
	if (NewLHS != OldLHS) {
	BinaryOperator *BO = isReassociableOp(OldLHS, Opcode);
	if (BO && !NotRewritable.count(BO))
	NodesToRewrite.push_back(BO);
	Op->setOperand(0, NewLHS);
	}
	if (NewRHS != OldRHS) {
	BinaryOperator *BO = isReassociableOp(OldRHS, Opcode);
	if (BO && !NotRewritable.count(BO))
	NodesToRewrite.push_back(BO);
	Op->setOperand(1, NewRHS);
	}
	DEBUG(dbgs() << "TO: " << *Op << '\n');

	ExpressionChanged = Op;
	MadeChange = true;
	++NumChanged;

	break;
	}

	// Not the last operation. The left-hand side will be a sub-expression
	// while the right-hand side will be the current element of Ops.
	Value *NewRHS = Ops[i].Op;
	if (NewRHS != Op->getOperand(1)) {
	DEBUG(dbgs() << "RA: " << *Op << '\n');
	if (NewRHS == Op->getOperand(0)) {
	// The new right-hand side was already present as the left operand. If
	// we are lucky then swapping the operands will sort out both of them.
	Op->swapOperands();
	} else {
	// Overwrite with the new right-hand side.
	BinaryOperator *BO = isReassociableOp(Op->getOperand(1), Opcode);
	if (BO && !NotRewritable.count(BO))
	NodesToRewrite.push_back(BO);
	Op->setOperand(1, NewRHS);
	ExpressionChanged = Op;
	}
	DEBUG(dbgs() << "TO: " << *Op << '\n');
	MadeChange = true;
	++NumChanged;
	}

	// Now deal with the left-hand side. If this is already an operation node
	// from the original expression then just rewrite the rest of the expression
	// into it.
	BinaryOperator *BO = isReassociableOp(Op->getOperand(0), Opcode);
	if (BO && !NotRewritable.count(BO)) {
	Op = BO;
	continue;
	}

	// Otherwise, grab a spare node from the original expression and use that as
	// the left-hand side. If there are no nodes left then the optimizers made
	// an expression with more nodes than the original! This usually means that
	// they did something stupid but it might mean that the problem was just too
	// hard (finding the mimimal number of multiplications needed to realize a
	// multiplication expression is NP-complete). Whatever the reason, smart or
	// stupid, create a new node if there are none left.
	BinaryOperator *NewOp;
	if (NodesToRewrite.empty()) {
	Constant *Undef = UndefValue::get(I->getType());
	NewOp = BinaryOperator::Create(Instruction::BinaryOps(Opcode),
	Undef, Undef, "", I);
	if (NewOp->getType()->isFPOrFPVectorTy())
	NewOp->setFastMathFlags(I->getFastMathFlags());
	} else {
	NewOp = NodesToRewrite.pop_back_val();
	}

	DEBUG(dbgs() << "RA: " << *Op << '\n');
	Op->setOperand(0, NewOp);
	DEBUG(dbgs() << "TO: " << *Op << '\n');
	ExpressionChanged = Op;
	MadeChange = true;
	++NumChanged;
	Op = NewOp;
	}

	// If the expression changed non-trivially then clear out all subclass data
	// starting from the operator specified in ExpressionChanged, and compactify
	// the operators to just before the expression root to guarantee that the
	// expression tree is dominated by all of Ops.
	if (ExpressionChanged)
	do {
	// Preserve FastMathFlags.
	if (isa<FPMathOperator>(I)) {
	FastMathFlags Flags = I->getFastMathFlags();
	ExpressionChanged->clearSubclassOptionalData();
	ExpressionChanged->setFastMathFlags(Flags);
	} else
	ExpressionChanged->clearSubclassOptionalData();

	if (ExpressionChanged == I)
	break;
	ExpressionChanged->moveBefore(I);
	ExpressionChanged = cast<BinaryOperator>(*ExpressionChanged->user_begin());
	} while (1);

	// Throw away any left over nodes from the original expression.
	for (unsigned i = 0, e = NodesToRewrite.size(); i != e; ++i)
	RedoInsts.insert(NodesToRewrite[i]);
	}

	/// Insert instructions before the instruction pointed to by BI,
	/// that computes the negative version of the value specified. The negative
	/// version of the value is returned, and BI is left pointing at the instruction
	/// that should be processed next by the reassociation pass.
	/// Also add intermediate instructions to the redo list that are modified while
	/// pushing the negates through adds. These will be revisited to see if
	/// additional opportunities have been exposed.
	static Value NegateValue(Value V, Instruction *BI,
	SetVector<AssertingVH<Instruction>> &ToRedo) {
	if (Constant *C = dyn_cast<Constant>(V)) {
	if (C->getType()->isFPOrFPVectorTy()) {
	return ConstantExpr::getFNeg(C);
	}
	return ConstantExpr::getNeg(C);
	}


	// We are trying to expose opportunity for reassociation. One of the things
	// that we want to do to achieve this is to push a negation as deep into an
	// expression chain as possible, to expose the add instructions. In practice,
	// this means that we turn this:
	// X = -(A+12+C+D) into X = -A + -12 + -C + -D = -12 + -A + -C + -D
	// so that later, a: Y = 12+X could get reassociated with the -12 to eliminate
	// the constants. We assume that instcombine will clean up the mess later if
	// we introduce tons of unnecessary negation instructions.
	//
	if (BinaryOperator *I =
	isReassociableOp(V, Instruction::Add, Instruction::FAdd)) {
	// Push the negates through the add.
	I->setOperand(0, NegateValue(I->getOperand(0), BI, ToRedo));
	I->setOperand(1, NegateValue(I->getOperand(1), BI, ToRedo));
	if (I->getOpcode() == Instruction::Add) {
	I->setHasNoUnsignedWrap(false);
	I->setHasNoSignedWrap(false);
	}

	// We must move the add instruction here, because the neg instructions do
	// not dominate the old add instruction in general. By moving it, we are
	// assured that the neg instructions we just inserted dominate the
	// instruction we are about to insert after them.
	//
	I->moveBefore(BI);
	I->setName(I->getName()+".neg");

	// Add the intermediate negates to the redo list as processing them later
	// could expose more reassociating opportunities.
	ToRedo.insert(I);
	return I;
	}

	// Okay, we need to materialize a negated version of V with an instruction.
	// Scan the use lists of V to see if we have one already.
	for (User *U : V->users()) {
	if (!BinaryOperator::isNeg(U) && !BinaryOperator::isFNeg(U))
	continue;

	// We found one! Now we have to make sure that the definition dominates
	// this use. We do this by moving it to the entry block (if it is a
	// non-instruction value) or right after the definition. These negates will
	// be zapped by reassociate later, so we don't need much finesse here.
	BinaryOperator *TheNeg = cast<BinaryOperator>(U);

	// Verify that the negate is in this function, V might be a constant expr.
	if (TheNeg->getParent()->getParent() != BI->getParent()->getParent())
	continue;

	BasicBlock::iterator InsertPt;
	if (Instruction *InstInput = dyn_cast<Instruction>(V)) {
	if (InvokeInst *II = dyn_cast<InvokeInst>(InstInput)) {
	InsertPt = II->getNormalDest()->begin();
	} else {
	InsertPt = ++InstInput->getIterator();
	}
	while (isa<PHINode>(InsertPt)) ++InsertPt;
	} else {
	InsertPt = TheNeg->getParent()->getParent()->getEntryBlock().begin();
	}
	TheNeg->moveBefore(&*InsertPt);
	if (TheNeg->getOpcode() == Instruction::Sub) {
	TheNeg->setHasNoUnsignedWrap(false);
	TheNeg->setHasNoSignedWrap(false);
	} else {
	TheNeg->andIRFlags(BI);
	}
	ToRedo.insert(TheNeg);
	return TheNeg;
	}

	// Insert a 'neg' instruction that subtracts the value from zero to get the
	// negation.
	BinaryOperator *NewNeg = CreateNeg(V, V->getName() + ".neg", BI, BI);
	ToRedo.insert(NewNeg);
	return NewNeg;
	}

	/// Return true if we should break up this subtract of X-Y into (X + -Y).
	static bool ShouldBreakUpSubtract(Instruction *Sub) {
	// If this is a negation, we can't split it up!
	if (BinaryOperator::isNeg(Sub) \|\| BinaryOperator::isFNeg(Sub))
	return false;

	// Don't breakup X - undef.
	if (isa<UndefValue>(Sub->getOperand(1)))
	return false;

	// Don't bother to break this up unless either the LHS is an associable add or
	// subtract or if this is only used by one.
	Value *V0 = Sub->getOperand(0);
	if (isReassociableOp(V0, Instruction::Add, Instruction::FAdd) \|\|
	isReassociableOp(V0, Instruction::Sub, Instruction::FSub))
	return true;
	Value *V1 = Sub->getOperand(1);
	if (isReassociableOp(V1, Instruction::Add, Instruction::FAdd) \|\|
	isReassociableOp(V1, Instruction::Sub, Instruction::FSub))
	return true;
	Value *VB = Sub->user_back();
	if (Sub->hasOneUse() &&
	(isReassociableOp(VB, Instruction::Add, Instruction::FAdd) \|\|
	isReassociableOp(VB, Instruction::Sub, Instruction::FSub)))
	return true;

	return false;
	}

	/// If we have (X-Y), and if either X is an add, or if this is only used by an
	/// add, transform this into (X+(0-Y)) to promote better reassociation.
	static BinaryOperator *
	BreakUpSubtract(Instruction *Sub, SetVector<AssertingVH<Instruction>> &ToRedo) {
	// Convert a subtract into an add and a neg instruction. This allows sub
	// instructions to be commuted with other add instructions.
	//
	// Calculate the negative value of Operand 1 of the sub instruction,
	// and set it as the RHS of the add instruction we just made.
	//
	Value *NegVal = NegateValue(Sub->getOperand(1), Sub, ToRedo);
	BinaryOperator *New = CreateAdd(Sub->getOperand(0), NegVal, "", Sub, Sub);
	Sub->setOperand(0, Constant::getNullValue(Sub->getType())); // Drop use of op.
	Sub->setOperand(1, Constant::getNullValue(Sub->getType())); // Drop use of op.
	New->takeName(Sub);

	// Everyone now refers to the add instruction.
	Sub->replaceAllUsesWith(New);
	New->setDebugLoc(Sub->getDebugLoc());

	DEBUG(dbgs() << "Negated: " << *New << '\n');
	return New;
	}

	/// If this is a shift of a reassociable multiply or is used by one, change
	/// this into a multiply by a constant to assist with further reassociation.
	static BinaryOperator ConvertShiftToMul(Instruction Shl) {
	Constant *MulCst = ConstantInt::get(Shl->getType(), 1);
	MulCst = ConstantExpr::getShl(MulCst, cast<Constant>(Shl->getOperand(1)));

	BinaryOperator *Mul =
	BinaryOperator::CreateMul(Shl->getOperand(0), MulCst, "", Shl);
	Shl->setOperand(0, UndefValue::get(Shl->getType())); // Drop use of op.
	Mul->takeName(Shl);

	// Everyone now refers to the mul instruction.
	Shl->replaceAllUsesWith(Mul);
	Mul->setDebugLoc(Shl->getDebugLoc());

	// We can safely preserve the nuw flag in all cases. It's also safe to turn a
	// nuw nsw shl into a nuw nsw mul. However, nsw in isolation requires special
	// handling.
	bool NSW = cast<BinaryOperator>(Shl)->hasNoSignedWrap();
	bool NUW = cast<BinaryOperator>(Shl)->hasNoUnsignedWrap();
	if (NSW && NUW)
	Mul->setHasNoSignedWrap(true);
	Mul->setHasNoUnsignedWrap(NUW);
	return Mul;
	}

	/// Scan backwards and forwards among values with the same rank as element i
	/// to see if X exists. If X does not exist, return i. This is useful when
	/// scanning for 'x' when we see '-x' because they both get the same rank.
	static unsigned FindInOperandList(const SmallVectorImpl<ValueEntry> &Ops,
	unsigned i, Value *X) {
	unsigned XRank = Ops[i].Rank;
	unsigned e = Ops.size();
	for (unsigned j = i+1; j != e && Ops[j].Rank == XRank; ++j) {
	if (Ops[j].Op == X)
	return j;
	if (Instruction *I1 = dyn_cast<Instruction>(Ops[j].Op))
	if (Instruction *I2 = dyn_cast<Instruction>(X))
	if (I1->isIdenticalTo(I2))
	return j;
	}
	// Scan backwards.
	for (unsigned j = i-1; j != ~0U && Ops[j].Rank == XRank; --j) {
	if (Ops[j].Op == X)
	return j;
	if (Instruction *I1 = dyn_cast<Instruction>(Ops[j].Op))
	if (Instruction *I2 = dyn_cast<Instruction>(X))
	if (I1->isIdenticalTo(I2))
	return j;
	}
	return i;
	}

	/// Emit a tree of add instructions, summing Ops together
	/// and returning the result. Insert the tree before I.
	static Value EmitAddTreeOfValues(Instruction I,
	SmallVectorImpl<WeakTrackingVH> &Ops) {
	if (Ops.size() == 1) return Ops.back();

	Value *V1 = Ops.back();
	Ops.pop_back();
	Value *V2 = EmitAddTreeOfValues(I, Ops);
	return CreateAdd(V2, V1, "tmp", I, I);
	}

	/// If V is an expression tree that is a multiplication sequence,
	/// and if this sequence contains a multiply by Factor,
	/// remove Factor from the tree and return the new tree.
	Value ReassociatePass::RemoveFactorFromExpression(Value V, Value *Factor) {
	BinaryOperator *BO = isReassociableOp(V, Instruction::Mul, Instruction::FMul);
	if (!BO)
	return nullptr;

	SmallVector<RepeatedValue, 8> Tree;
	MadeChange \|= LinearizeExprTree(BO, Tree);
	SmallVector<ValueEntry, 8> Factors;
	Factors.reserve(Tree.size());
	for (unsigned i = 0, e = Tree.size(); i != e; ++i) {
	RepeatedValue E = Tree[i];
	Factors.append(E.second.getZExtValue(),
	ValueEntry(getRank(E.first), E.first));
	}

	bool FoundFactor = false;
	bool NeedsNegate = false;
	for (unsigned i = 0, e = Factors.size(); i != e; ++i) {
	if (Factors[i].Op == Factor) {
	FoundFactor = true;
	Factors.erase(Factors.begin()+i);
	break;
	}

	// If this is a negative version of this factor, remove it.
	if (ConstantInt *FC1 = dyn_cast<ConstantInt>(Factor)) {
	if (ConstantInt *FC2 = dyn_cast<ConstantInt>(Factors[i].Op))
	if (FC1->getValue() == -FC2->getValue()) {
	FoundFactor = NeedsNegate = true;
	Factors.erase(Factors.begin()+i);
	break;
	}
	} else if (ConstantFP *FC1 = dyn_cast<ConstantFP>(Factor)) {
	if (ConstantFP *FC2 = dyn_cast<ConstantFP>(Factors[i].Op)) {
	const APFloat &F1 = FC1->getValueAPF();
	APFloat F2(FC2->getValueAPF());
	F2.changeSign();
	if (F1.compare(F2) == APFloat::cmpEqual) {
	FoundFactor = NeedsNegate = true;
	Factors.erase(Factors.begin() + i);
	break;
	}
	}
	}
	}

	if (!FoundFactor) {
	// Make sure to restore the operands to the expression tree.
	RewriteExprTree(BO, Factors);
	return nullptr;
	}

	BasicBlock::iterator InsertPt = ++BO->getIterator();

	// If this was just a single multiply, remove the multiply and return the only
	// remaining operand.
	if (Factors.size() == 1) {
	RedoInsts.insert(BO);
	V = Factors[0].Op;
	} else {
	RewriteExprTree(BO, Factors);
	V = BO;
	}

	if (NeedsNegate)
	V = CreateNeg(V, "neg", &*InsertPt, BO);

	return V;
	}

	/// If V is a single-use multiply, recursively add its operands as factors,
	/// otherwise add V to the list of factors.
	///
	/// Ops is the top-level list of add operands we're trying to factor.
	static void FindSingleUseMultiplyFactors(Value *V,
	SmallVectorImpl<Value*> &Factors) {
	BinaryOperator *BO = isReassociableOp(V, Instruction::Mul, Instruction::FMul);
	if (!BO) {
	Factors.push_back(V);
	return;
	}

	// Otherwise, add the LHS and RHS to the list of factors.
	FindSingleUseMultiplyFactors(BO->getOperand(1), Factors);
	FindSingleUseMultiplyFactors(BO->getOperand(0), Factors);
	}

	/// Optimize a series of operands to an 'and', 'or', or 'xor' instruction.
	/// This optimizes based on identities. If it can be reduced to a single Value,
	/// it is returned, otherwise the Ops list is mutated as necessary.
	static Value *OptimizeAndOrXor(unsigned Opcode,
	SmallVectorImpl<ValueEntry> &Ops) {
	// Scan the operand lists looking for X and ~X pairs, along with X,X pairs.
	// If we find any, we can simplify the expression. X&~X == 0, X\|~X == -1.
	for (unsigned i = 0, e = Ops.size(); i != e; ++i) {
	// First, check for X and ~X in the operand list.
	assert(i < Ops.size());
	if (BinaryOperator::isNot(Ops[i].Op)) { // Cannot occur for ^.
	Value *X = BinaryOperator::getNotArgument(Ops[i].Op);
	unsigned FoundX = FindInOperandList(Ops, i, X);
	if (FoundX != i) {
	if (Opcode == Instruction::And) // ...&X&~X = 0
	return Constant::getNullValue(X->getType());

	if (Opcode == Instruction::Or) // ...\|X\|~X = -1
	return Constant::getAllOnesValue(X->getType());
	}
	}

	// Next, check for duplicate pairs of values, which we assume are next to
	// each other, due to our sorting criteria.
	assert(i < Ops.size());
	if (i+1 != Ops.size() && Ops[i+1].Op == Ops[i].Op) {
	if (Opcode == Instruction::And \|\| Opcode == Instruction::Or) {
	// Drop duplicate values for And and Or.
	Ops.erase(Ops.begin()+i);
	--i; --e;
	++NumAnnihil;
	continue;
	}

	// Drop pairs of values for Xor.
	assert(Opcode == Instruction::Xor);
	if (e == 2)
	return Constant::getNullValue(Ops[0].Op->getType());

	// Y ^ X^X -> Y
	Ops.erase(Ops.begin()+i, Ops.begin()+i+2);
	i -= 1; e -= 2;
	++NumAnnihil;
	}
	}
	return nullptr;
	}

	/// Helper function of CombineXorOpnd(). It creates a bitwise-and
	/// instruction with the given two operands, and return the resulting
	/// instruction. There are two special cases: 1) if the constant operand is 0,
	/// it will return NULL. 2) if the constant is ~0, the symbolic operand will
	/// be returned.
	static Value createAndInstr(Instruction InsertBefore, Value *Opnd,
	const APInt &ConstOpnd) {
	if (ConstOpnd.isNullValue())
	return nullptr;

	if (ConstOpnd.isAllOnesValue())
	return Opnd;

	Instruction *I = BinaryOperator::CreateAnd(
	Opnd, ConstantInt::get(Opnd->getType(), ConstOpnd), "and.ra",
	InsertBefore);
	I->setDebugLoc(InsertBefore->getDebugLoc());
	return I;
	}

	// Helper function of OptimizeXor(). It tries to simplify "Opnd1 ^ ConstOpnd"
	// into "R ^ C", where C would be 0, and R is a symbolic value.
	//
	// If it was successful, true is returned, and the "R" and "C" is returned
	// via "Res" and "ConstOpnd", respectively; otherwise, false is returned,
	// and both "Res" and "ConstOpnd" remain unchanged.
	//
	bool ReassociatePass::CombineXorOpnd(Instruction I, XorOpnd Opnd1,
	APInt &ConstOpnd, Value *&Res) {
	// Xor-Rule 1: (x \| c1) ^ c2 = (x \| c1) ^ (c1 ^ c1) ^ c2
	// = ((x \| c1) ^ c1) ^ (c1 ^ c2)
	// = (x & ~c1) ^ (c1 ^ c2)
	// It is useful only when c1 == c2.
	if (!Opnd1->isOrExpr() \|\| Opnd1->getConstPart().isNullValue())
	return false;

	if (!Opnd1->getValue()->hasOneUse())
	return false;

	const APInt &C1 = Opnd1->getConstPart();
	if (C1 != ConstOpnd)
	return false;

	Value *X = Opnd1->getSymbolicPart();
	Res = createAndInstr(I, X, ~C1);
	// ConstOpnd was C2, now C1 ^ C2.
	ConstOpnd ^= C1;

	if (Instruction *T = dyn_cast<Instruction>(Opnd1->getValue()))
	RedoInsts.insert(T);
	return true;
	}


	// Helper function of OptimizeXor(). It tries to simplify
	// "Opnd1 ^ Opnd2 ^ ConstOpnd" into "R ^ C", where C would be 0, and R is a
	// symbolic value.
	//
	// If it was successful, true is returned, and the "R" and "C" is returned
	// via "Res" and "ConstOpnd", respectively (If the entire expression is
	// evaluated to a constant, the Res is set to NULL); otherwise, false is
	// returned, and both "Res" and "ConstOpnd" remain unchanged.
	bool ReassociatePass::CombineXorOpnd(Instruction I, XorOpnd Opnd1,
	XorOpnd *Opnd2, APInt &ConstOpnd,
	Value *&Res) {
	Value *X = Opnd1->getSymbolicPart();
	if (X != Opnd2->getSymbolicPart())
	return false;

	// This many instruction become dead.(At least "Opnd1 ^ Opnd2" will die.)
	int DeadInstNum = 1;
	if (Opnd1->getValue()->hasOneUse())
	DeadInstNum++;
	if (Opnd2->getValue()->hasOneUse())
	DeadInstNum++;

	// Xor-Rule 2:
	// (x \| c1) ^ (x & c2)
	// = (x\|c1) ^ (x&c2) ^ (c1 ^ c1) = ((x\|c1) ^ c1) ^ (x & c2) ^ c1
	// = (x & ~c1) ^ (x & c2) ^ c1 // Xor-Rule 1
	// = (x & c3) ^ c1, where c3 = ~c1 ^ c2 // Xor-rule 3
	//
	if (Opnd1->isOrExpr() != Opnd2->isOrExpr()) {
	if (Opnd2->isOrExpr())
	std::swap(Opnd1, Opnd2);

	const APInt &C1 = Opnd1->getConstPart();
	const APInt &C2 = Opnd2->getConstPart();
	APInt C3((~C1) ^ C2);

	// Do not increase code size!
	if (!C3.isNullValue() && !C3.isAllOnesValue()) {
	int NewInstNum = ConstOpnd.getBoolValue() ? 1 : 2;
	if (NewInstNum > DeadInstNum)
	return false;
	}

	Res = createAndInstr(I, X, C3);
	ConstOpnd ^= C1;

	} else if (Opnd1->isOrExpr()) {
	// Xor-Rule 3: (x \| c1) ^ (x \| c2) = (x & c3) ^ c3 where c3 = c1 ^ c2
	//
	const APInt &C1 = Opnd1->getConstPart();
	const APInt &C2 = Opnd2->getConstPart();
	APInt C3 = C1 ^ C2;

	// Do not increase code size
	if (!C3.isNullValue() && !C3.isAllOnesValue()) {
	int NewInstNum = ConstOpnd.getBoolValue() ? 1 : 2;
	if (NewInstNum > DeadInstNum)
	return false;
	}

	Res = createAndInstr(I, X, C3);
	ConstOpnd ^= C3;
	} else {
	// Xor-Rule 4: (x & c1) ^ (x & c2) = (x & (c1^c2))
	//
	const APInt &C1 = Opnd1->getConstPart();
	const APInt &C2 = Opnd2->getConstPart();
	APInt C3 = C1 ^ C2;
	Res = createAndInstr(I, X, C3);
	}

	// Put the original operands in the Redo list; hope they will be deleted
	// as dead code.
	if (Instruction *T = dyn_cast<Instruction>(Opnd1->getValue()))
	RedoInsts.insert(T);
	if (Instruction *T = dyn_cast<Instruction>(Opnd2->getValue()))
	RedoInsts.insert(T);

	return true;
	}

	/// Optimize a series of operands to an 'xor' instruction. If it can be reduced
	/// to a single Value, it is returned, otherwise the Ops list is mutated as
	/// necessary.
	Value ReassociatePass::OptimizeXor(Instruction I,
	SmallVectorImpl<ValueEntry> &Ops) {
	if (Value *V = OptimizeAndOrXor(Instruction::Xor, Ops))
	return V;

	if (Ops.size() == 1)
	return nullptr;

	SmallVector<XorOpnd, 8> Opnds;
	SmallVector<XorOpnd*, 8> OpndPtrs;
	Type *Ty = Ops[0].Op->getType();
	APInt ConstOpnd(Ty->getScalarSizeInBits(), 0);

	// Step 1: Convert ValueEntry to XorOpnd
	for (unsigned i = 0, e = Ops.size(); i != e; ++i) {
	Value *V = Ops[i].Op;
	const APInt *C;
	// TODO: Support non-splat vectors.
	if (match(V, PatternMatch::m_APInt(C))) {
	ConstOpnd ^= *C;
	} else {
	XorOpnd O(V);
	O.setSymbolicRank(getRank(O.getSymbolicPart()));
	Opnds.push_back(O);
	}
	}

	// NOTE: From this point on, do NOT add/delete element to/from "Opnds".
	// It would otherwise invalidate the "Opnds"'s iterator, and hence invalidate
	// the "OpndPtrs" as well. For the similar reason, do not fuse this loop
	// with the previous loop --- the iterator of the "Opnds" may be invalidated
	// when new elements are added to the vector.
	for (unsigned i = 0, e = Opnds.size(); i != e; ++i)
	OpndPtrs.push_back(&Opnds[i]);

	// Step 2: Sort the Xor-Operands in a way such that the operands containing
	// the same symbolic value cluster together. For instance, the input operand
	// sequence ("x \| 123", "y & 456", "x & 789") will be sorted into:
	// ("x \| 123", "x & 789", "y & 456").
	//
	// The purpose is twofold:
	// 1) Cluster together the operands sharing the same symbolic-value.
	// 2) Operand having smaller symbolic-value-rank is permuted earlier, which
	// could potentially shorten crital path, and expose more loop-invariants.
	// Note that values' rank are basically defined in RPO order (FIXME).
	// So, if Rank(X) < Rank(Y) < Rank(Z), it means X is defined earlier
	// than Y which is defined earlier than Z. Permute "x \| 1", "Y & 2",
	// "z" in the order of X-Y-Z is better than any other orders.
	std::stable_sort(OpndPtrs.begin(), OpndPtrs.end(),
	[](XorOpnd LHS, XorOpnd RHS) {
	return LHS->getSymbolicRank() < RHS->getSymbolicRank();
	});

	// Step 3: Combine adjacent operands
	XorOpnd *PrevOpnd = nullptr;
	bool Changed = false;
	for (unsigned i = 0, e = Opnds.size(); i < e; i++) {
	XorOpnd *CurrOpnd = OpndPtrs[i];
	// The combined value
	Value *CV;

	// Step 3.1: Try simplifying "CurrOpnd ^ ConstOpnd"
	if (!ConstOpnd.isNullValue() &&
	CombineXorOpnd(I, CurrOpnd, ConstOpnd, CV)) {
	Changed = true;
	if (CV)
	*CurrOpnd = XorOpnd(CV);
	else {
	CurrOpnd->Invalidate();
	continue;
	}
	}

	if (!PrevOpnd \|\| CurrOpnd->getSymbolicPart() != PrevOpnd->getSymbolicPart()) {
	PrevOpnd = CurrOpnd;
	continue;
	}

	// step 3.2: When previous and current operands share the same symbolic
	// value, try to simplify "PrevOpnd ^ CurrOpnd ^ ConstOpnd"
	//
	if (CombineXorOpnd(I, CurrOpnd, PrevOpnd, ConstOpnd, CV)) {
	// Remove previous operand
	PrevOpnd->Invalidate();
	if (CV) {
	*CurrOpnd = XorOpnd(CV);
	PrevOpnd = CurrOpnd;
	} else {
	CurrOpnd->Invalidate();
	PrevOpnd = nullptr;
	}
	Changed = true;
	}
	}

	// Step 4: Reassemble the Ops
	if (Changed) {
	Ops.clear();
	for (unsigned int i = 0, e = Opnds.size(); i < e; i++) {
	XorOpnd &O = Opnds[i];
	if (O.isInvalid())
	continue;
	ValueEntry VE(getRank(O.getValue()), O.getValue());
	Ops.push_back(VE);
	}
	if (!ConstOpnd.isNullValue()) {
	Value *C = ConstantInt::get(Ty, ConstOpnd);
	ValueEntry VE(getRank(C), C);
	Ops.push_back(VE);
	}
	unsigned Sz = Ops.size();
	if (Sz == 1)
	return Ops.back().Op;
	if (Sz == 0) {
	assert(ConstOpnd.isNullValue());
	return ConstantInt::get(Ty, ConstOpnd);
	}
	}

	return nullptr;
	}

	/// Optimize a series of operands to an 'add' instruction. This
	/// optimizes based on identities. If it can be reduced to a single Value, it
	/// is returned, otherwise the Ops list is mutated as necessary.
	Value ReassociatePass::OptimizeAdd(Instruction I,
	SmallVectorImpl<ValueEntry> &Ops) {
	// Scan the operand lists looking for X and -X pairs. If we find any, we
	// can simplify expressions like X+-X == 0 and X+~X ==-1. While we're at it,
	// scan for any
	// duplicates. We want to canonicalize Y+Y+Y+Z -> 3*Y+Z.

	for (unsigned i = 0, e = Ops.size(); i != e; ++i) {
	Value *TheOp = Ops[i].Op;
	// Check to see if we've seen this operand before. If so, we factor all
	// instances of the operand together. Due to our sorting criteria, we know
	// that these need to be next to each other in the vector.
	if (i+1 != Ops.size() && Ops[i+1].Op == TheOp) {
	// Rescan the list, remove all instances of this operand from the expr.
	unsigned NumFound = 0;
	do {
	Ops.erase(Ops.begin()+i);
	++NumFound;
	} while (i != Ops.size() && Ops[i].Op == TheOp);

	DEBUG(dbgs() << "\nFACTORING [" << NumFound << "]: " << *TheOp << '\n');
	++NumFactor;

	// Insert a new multiply.
	Type *Ty = TheOp->getType();
	Constant *C = Ty->isIntOrIntVectorTy() ?
	ConstantInt::get(Ty, NumFound) : ConstantFP::get(Ty, NumFound);
	Instruction *Mul = CreateMul(TheOp, C, "factor", I, I);

	// Now that we have inserted a multiply, optimize it. This allows us to
	// handle cases that require multiple factoring steps, such as this:
	// (X2) + (X2) + (X2) -> (X2)3 -> X6
	RedoInsts.insert(Mul);

	// If every add operand was a duplicate, return the multiply.
	if (Ops.empty())
	return Mul;

	// Otherwise, we had some input that didn't have the dupe, such as
	// "A + A + B" -> "A*2 + B". Add the new multiply to the list of
	// things being added by this operation.
	Ops.insert(Ops.begin(), ValueEntry(getRank(Mul), Mul));

	--i;
	e = Ops.size();
	continue;
	}

	// Check for X and -X or X and ~X in the operand list.
	if (!BinaryOperator::isNeg(TheOp) && !BinaryOperator::isFNeg(TheOp) &&
	!BinaryOperator::isNot(TheOp))
	continue;

	Value *X = nullptr;
	if (BinaryOperator::isNeg(TheOp) \|\| BinaryOperator::isFNeg(TheOp))
	X = BinaryOperator::getNegArgument(TheOp);
	else if (BinaryOperator::isNot(TheOp))
	X = BinaryOperator::getNotArgument(TheOp);

	unsigned FoundX = FindInOperandList(Ops, i, X);
	if (FoundX == i)
	continue;

	// Remove X and -X from the operand list.
	if (Ops.size() == 2 &&
	(BinaryOperator::isNeg(TheOp) \|\| BinaryOperator::isFNeg(TheOp)))
	return Constant::getNullValue(X->getType());

	// Remove X and ~X from the operand list.
	if (Ops.size() == 2 && BinaryOperator::isNot(TheOp))
	return Constant::getAllOnesValue(X->getType());

	Ops.erase(Ops.begin()+i);
	if (i < FoundX)
	--FoundX;
	else
	--i; // Need to back up an extra one.
	Ops.erase(Ops.begin()+FoundX);
	++NumAnnihil;
	--i; // Revisit element.
	e -= 2; // Removed two elements.

	// if X and ~X we append -1 to the operand list.
	if (BinaryOperator::isNot(TheOp)) {
	Value *V = Constant::getAllOnesValue(X->getType());
	Ops.insert(Ops.end(), ValueEntry(getRank(V), V));
	e += 1;
	}
	}

	// Scan the operand list, checking to see if there are any common factors
	// between operands. Consider something like AA+AB*C+D. We would like to
	// reassociate this to A(A+BC)+D, which reduces the number of multiplies.
	// To efficiently find this, we count the number of times a factor occurs
	// for any ADD operands that are MULs.
	DenseMap<Value*, unsigned> FactorOccurrences;

	// Keep track of each multiply we see, to avoid triggering on (X4)+(X4)
	// where they are actually the same multiply.
	unsigned MaxOcc = 0;
	Value *MaxOccVal = nullptr;
	for (unsigned i = 0, e = Ops.size(); i != e; ++i) {
	BinaryOperator *BOp =
	isReassociableOp(Ops[i].Op, Instruction::Mul, Instruction::FMul);
	if (!BOp)
	continue;

	// Compute all of the factors of this added value.
	SmallVector<Value*, 8> Factors;
	FindSingleUseMultiplyFactors(BOp, Factors);
	assert(Factors.size() > 1 && "Bad linearize!");

	// Add one to FactorOccurrences for each unique factor in this op.
	SmallPtrSet<Value*, 8> Duplicates;
	for (unsigned i = 0, e = Factors.size(); i != e; ++i) {
	Value *Factor = Factors[i];
	if (!Duplicates.insert(Factor).second)
	continue;

	unsigned Occ = ++FactorOccurrences[Factor];
	if (Occ > MaxOcc) {
	MaxOcc = Occ;
	MaxOccVal = Factor;
	}

	// If Factor is a negative constant, add the negated value as a factor
	// because we can percolate the negate out. Watch for minint, which
	// cannot be positivified.
	if (ConstantInt *CI = dyn_cast<ConstantInt>(Factor)) {
	if (CI->isNegative() && !CI->isMinValue(true)) {
	Factor = ConstantInt::get(CI->getContext(), -CI->getValue());
	if (!Duplicates.insert(Factor).second)
	continue;
	unsigned Occ = ++FactorOccurrences[Factor];
	if (Occ > MaxOcc) {
	MaxOcc = Occ;
	MaxOccVal = Factor;
	}
	}
	} else if (ConstantFP *CF = dyn_cast<ConstantFP>(Factor)) {
	if (CF->isNegative()) {
	APFloat F(CF->getValueAPF());
	F.changeSign();
	Factor = ConstantFP::get(CF->getContext(), F);
	if (!Duplicates.insert(Factor).second)
	continue;
	unsigned Occ = ++FactorOccurrences[Factor];
	if (Occ > MaxOcc) {
	MaxOcc = Occ;
	MaxOccVal = Factor;
	}
	}
	}
	}
	}

	// If any factor occurred more than one time, we can pull it out.
	if (MaxOcc > 1) {
	DEBUG(dbgs() << "\nFACTORING [" << MaxOcc << "]: " << *MaxOccVal << '\n');
	++NumFactor;

	// Create a new instruction that uses the MaxOccVal twice. If we don't do
	// this, we could otherwise run into situations where removing a factor
	// from an expression will drop a use of maxocc, and this can cause
	// RemoveFactorFromExpression on successive values to behave differently.
	Instruction *DummyInst =
	I->getType()->isIntOrIntVectorTy()
	? BinaryOperator::CreateAdd(MaxOccVal, MaxOccVal)
	: BinaryOperator::CreateFAdd(MaxOccVal, MaxOccVal);

	SmallVector<WeakTrackingVH, 4> NewMulOps;
	for (unsigned i = 0; i != Ops.size(); ++i) {
	// Only try to remove factors from expressions we're allowed to.
	BinaryOperator *BOp =
	isReassociableOp(Ops[i].Op, Instruction::Mul, Instruction::FMul);
	if (!BOp)
	continue;

	if (Value *V = RemoveFactorFromExpression(Ops[i].Op, MaxOccVal)) {
	// The factorized operand may occur several times. Convert them all in
	// one fell swoop.
	for (unsigned j = Ops.size(); j != i;) {
	--j;
	if (Ops[j].Op == Ops[i].Op) {
	NewMulOps.push_back(V);
	Ops.erase(Ops.begin()+j);
	}
	}
	--i;
	}
	}

	// No need for extra uses anymore.
	DummyInst->deleteValue();

	unsigned NumAddedValues = NewMulOps.size();
	Value *V = EmitAddTreeOfValues(I, NewMulOps);

	// Now that we have inserted the add tree, optimize it. This allows us to
	// handle cases that require multiple factoring steps, such as this:
	// AAB + AAC --> A(AB+AC) --> A(A*(B+C))
	assert(NumAddedValues > 1 && "Each occurrence should contribute a value");
	(void)NumAddedValues;
	if (Instruction *VI = dyn_cast<Instruction>(V))
	RedoInsts.insert(VI);

	// Create the multiply.
	Instruction *V2 = CreateMul(V, MaxOccVal, "tmp", I, I);

	// Rerun associate on the multiply in case the inner expression turned into
	// a multiply. We want to make sure that we keep things in canonical form.
	RedoInsts.insert(V2);

	// If every add operand included the factor (e.g. "AB + AC"), then the
	// entire result expression is just the multiply "A*(B+C)".
	if (Ops.empty())
	return V2;

	// Otherwise, we had some input that didn't have the factor, such as
	// "AB + AC + D" -> "A*(B+C) + D". Add the new multiply to the list of
	// things being added by this operation.
	Ops.insert(Ops.begin(), ValueEntry(getRank(V2), V2));
	}

	return nullptr;
	}

	/// \brief Build up a vector of value/power pairs factoring a product.
	///
	/// Given a series of multiplication operands, build a vector of factors and
	/// the powers each is raised to when forming the final product. Sort them in
	/// the order of descending power.
	///
	/// (x*x) -> [(x, 2)]
	/// ((xx)x) -> [(x, 3)]
	/// ((((xy)x)y)x) -> [(x, 3), (y, 2)]
	///
	/// \returns Whether any factors have a power greater than one.
	static bool collectMultiplyFactors(SmallVectorImpl<ValueEntry> &Ops,
	SmallVectorImpl<Factor> &Factors) {
	// FIXME: Have Ops be (ValueEntry, Multiplicity) pairs, simplifying this.
	// Compute the sum of powers of simplifiable factors.
	unsigned FactorPowerSum = 0;
	for (unsigned Idx = 1, Size = Ops.size(); Idx < Size; ++Idx) {
	Value *Op = Ops[Idx-1].Op;

	// Count the number of occurrences of this value.
	unsigned Count = 1;
	for (; Idx < Size && Ops[Idx].Op == Op; ++Idx)
	++Count;
	// Track for simplification all factors which occur 2 or more times.
	if (Count > 1)
	FactorPowerSum += Count;
	}

	// We can only simplify factors if the sum of the powers of our simplifiable
	// factors is 4 or higher. When that is the case, we will always have
	// a simplification. This is an important invariant to prevent cyclicly
	// trying to simplify already minimal formations.
	if (FactorPowerSum < 4)
	return false;

	// Now gather the simplifiable factors, removing them from Ops.
	FactorPowerSum = 0;
	for (unsigned Idx = 1; Idx < Ops.size(); ++Idx) {
	Value *Op = Ops[Idx-1].Op;

	// Count the number of occurrences of this value.
	unsigned Count = 1;
	for (; Idx < Ops.size() && Ops[Idx].Op == Op; ++Idx)
	++Count;
	if (Count == 1)
	continue;
	// Move an even number of occurrences to Factors.
	Count &= ~1U;
	Idx -= Count;
	FactorPowerSum += Count;
	Factors.push_back(Factor(Op, Count));
	Ops.erase(Ops.begin()+Idx, Ops.begin()+Idx+Count);
	}

	// None of the adjustments above should have reduced the sum of factor powers
	// below our mininum of '4'.
	assert(FactorPowerSum >= 4);

	std::stable_sort(Factors.begin(), Factors.end(),
	[](const Factor &LHS, const Factor &RHS) {
	return LHS.Power > RHS.Power;
	});
	return true;
	}

	/// \brief Build a tree of multiplies, computing the product of Ops.
	static Value *buildMultiplyTree(IRBuilder<> &Builder,
	SmallVectorImpl<Value*> &Ops) {
	if (Ops.size() == 1)
	return Ops.back();

	Value *LHS = Ops.pop_back_val();
	do {
	if (LHS->getType()->isIntOrIntVectorTy())
	LHS = Builder.CreateMul(LHS, Ops.pop_back_val());
	else
	LHS = Builder.CreateFMul(LHS, Ops.pop_back_val());
	} while (!Ops.empty());

	return LHS;
	}

	/// \brief Build a minimal multiplication DAG for (a^x)(b^y)(c^z)*...
	///
	/// Given a vector of values raised to various powers, where no two values are
	/// equal and the powers are sorted in decreasing order, compute the minimal
	/// DAG of multiplies to compute the final product, and return that product
	/// value.
	Value *
	ReassociatePass::buildMinimalMultiplyDAG(IRBuilder<> &Builder,
	SmallVectorImpl<Factor> &Factors) {
	assert(Factors[0].Power);
	SmallVector<Value *, 4> OuterProduct;
	for (unsigned LastIdx = 0, Idx = 1, Size = Factors.size();
	Idx < Size && Factors[Idx].Power > 0; ++Idx) {
	if (Factors[Idx].Power != Factors[LastIdx].Power) {
	LastIdx = Idx;
	continue;
	}

	// We want to multiply across all the factors with the same power so that
	// we can raise them to that power as a single entity. Build a mini tree
	// for that.
	SmallVector<Value *, 4> InnerProduct;
	InnerProduct.push_back(Factors[LastIdx].Base);
	do {
	InnerProduct.push_back(Factors[Idx].Base);
	++Idx;
	} while (Idx < Size && Factors[Idx].Power == Factors[LastIdx].Power);

	// Reset the base value of the first factor to the new expression tree.
	// We'll remove all the factors with the same power in a second pass.
	Value *M = Factors[LastIdx].Base = buildMultiplyTree(Builder, InnerProduct);
	if (Instruction *MI = dyn_cast<Instruction>(M))
	RedoInsts.insert(MI);

	LastIdx = Idx;
	}
	// Unique factors with equal powers -- we've folded them into the first one's
	// base.
	Factors.erase(std::unique(Factors.begin(), Factors.end(),
	[](const Factor &LHS, const Factor &RHS) {
	return LHS.Power == RHS.Power;
	}),
	Factors.end());

	// Iteratively collect the base of each factor with an add power into the
	// outer product, and halve each power in preparation for squaring the
	// expression.
	for (unsigned Idx = 0, Size = Factors.size(); Idx != Size; ++Idx) {
	if (Factors[Idx].Power & 1)
	OuterProduct.push_back(Factors[Idx].Base);
	Factors[Idx].Power >>= 1;
	}
	if (Factors[0].Power) {
	Value *SquareRoot = buildMinimalMultiplyDAG(Builder, Factors);
	OuterProduct.push_back(SquareRoot);
	OuterProduct.push_back(SquareRoot);
	}
	if (OuterProduct.size() == 1)
	return OuterProduct.front();

	Value *V = buildMultiplyTree(Builder, OuterProduct);
	return V;
	}

	Value ReassociatePass::OptimizeMul(BinaryOperator I,
	SmallVectorImpl<ValueEntry> &Ops) {
	// We can only optimize the multiplies when there is a chain of more than
	// three, such that a balanced tree might require fewer total multiplies.
	if (Ops.size() < 4)
	return nullptr;

	// Try to turn linear trees of multiplies without other uses of the
	// intermediate stages into minimal multiply DAGs with perfect sub-expression
	// re-use.
	SmallVector<Factor, 4> Factors;
	if (!collectMultiplyFactors(Ops, Factors))
	return nullptr; // All distinct factors, so nothing left for us to do.

	IRBuilder<> Builder(I);
	// The reassociate transformation for FP operations is performed only
	// if unsafe algebra is permitted by FastMathFlags. Propagate those flags
	// to the newly generated operations.
	if (auto FPI = dyn_cast<FPMathOperator>(I))
	Builder.setFastMathFlags(FPI->getFastMathFlags());

	Value *V = buildMinimalMultiplyDAG(Builder, Factors);
	if (Ops.empty())
	return V;

	ValueEntry NewEntry = ValueEntry(getRank(V), V);
	Ops.insert(std::lower_bound(Ops.begin(), Ops.end(), NewEntry), NewEntry);
	return nullptr;
	}

	Value ReassociatePass::OptimizeExpression(BinaryOperator I,
	SmallVectorImpl<ValueEntry> &Ops) {
	// Now that we have the linearized expression tree, try to optimize it.
	// Start by folding any constants that we found.
	Constant *Cst = nullptr;
	unsigned Opcode = I->getOpcode();
	while (!Ops.empty() && isa<Constant>(Ops.back().Op)) {
	Constant *C = cast<Constant>(Ops.pop_back_val().Op);
	Cst = Cst ? ConstantExpr::get(Opcode, C, Cst) : C;
	}
	// If there was nothing but constants then we are done.
	if (Ops.empty())
	return Cst;

	// Put the combined constant back at the end of the operand list, except if
	// there is no point. For example, an add of 0 gets dropped here, while a
	// multiplication by zero turns the whole expression into zero.
	if (Cst && Cst != ConstantExpr::getBinOpIdentity(Opcode, I->getType())) {
	if (Cst == ConstantExpr::getBinOpAbsorber(Opcode, I->getType()))
	return Cst;
	Ops.push_back(ValueEntry(0, Cst));
	}

	if (Ops.size() == 1) return Ops[0].Op;

	// Handle destructive annihilation due to identities between elements in the
	// argument list here.
	unsigned NumOps = Ops.size();
	switch (Opcode) {
	default: break;
	case Instruction::And:
	case Instruction::Or:
	if (Value *Result = OptimizeAndOrXor(Opcode, Ops))
	return Result;
	break;

	case Instruction::Xor:
	if (Value *Result = OptimizeXor(I, Ops))
	return Result;
	break;

	case Instruction::Add:
	case Instruction::FAdd:
	if (Value *Result = OptimizeAdd(I, Ops))
	return Result;
	break;

	case Instruction::Mul:
	case Instruction::FMul:
	if (Value *Result = OptimizeMul(I, Ops))
	return Result;
	break;
	}

	if (Ops.size() != NumOps)
	return OptimizeExpression(I, Ops);
	return nullptr;
	}

	// Remove dead instructions and if any operands are trivially dead add them to
	// Insts so they will be removed as well.
	void ReassociatePass::RecursivelyEraseDeadInsts(
	Instruction *I, SetVector<AssertingVH<Instruction>> &Insts) {
	assert(isInstructionTriviallyDead(I) && "Trivially dead instructions only!");
	SmallVector<Value *, 4> Ops(I->op_begin(), I->op_end());
	ValueRankMap.erase(I);
	Insts.remove(I);
	RedoInsts.remove(I);
	I->eraseFromParent();
	for (auto Op : Ops)
	if (Instruction *OpInst = dyn_cast<Instruction>(Op))
	if (OpInst->use_empty())
	Insts.insert(OpInst);
	}

	/// Zap the given instruction, adding interesting operands to the work list.
	void ReassociatePass::EraseInst(Instruction *I) {
	assert(isInstructionTriviallyDead(I) && "Trivially dead instructions only!");
	DEBUG(dbgs() << "Erasing dead inst: "; I->dump());

	SmallVector<Value*, 8> Ops(I->op_begin(), I->op_end());
	// Erase the dead instruction.
	ValueRankMap.erase(I);
	RedoInsts.remove(I);
	I->eraseFromParent();
	// Optimize its operands.
	SmallPtrSet<Instruction *, 8> Visited; // Detect self-referential nodes.
	for (unsigned i = 0, e = Ops.size(); i != e; ++i)
	if (Instruction *Op = dyn_cast<Instruction>(Ops[i])) {
	// If this is a node in an expression tree, climb to the expression root
	// and add that since that's where optimization actually happens.
	unsigned Opcode = Op->getOpcode();
	while (Op->hasOneUse() && Op->user_back()->getOpcode() == Opcode &&
	Visited.insert(Op).second)
	Op = Op->user_back();
	RedoInsts.insert(Op);
	}

	MadeChange = true;
	}

	// Canonicalize expressions of the following form:
	// x + (-Constant * y) -> x - (Constant * y)
	// x - (-Constant * y) -> x + (Constant * y)
	Instruction ReassociatePass::canonicalizeNegConstExpr(Instruction I) {
	if (!I->hasOneUse() \|\| I->getType()->isVectorTy())
	return nullptr;

	// Must be a fmul or fdiv instruction.
	unsigned Opcode = I->getOpcode();
	if (Opcode != Instruction::FMul && Opcode != Instruction::FDiv)
	return nullptr;

	auto *C0 = dyn_cast<ConstantFP>(I->getOperand(0));
	auto *C1 = dyn_cast<ConstantFP>(I->getOperand(1));

	// Both operands are constant, let it get constant folded away.
	if (C0 && C1)
	return nullptr;

	ConstantFP *CF = C0 ? C0 : C1;

	// Must have one constant operand.
	if (!CF)
	return nullptr;

	// Must be a negative ConstantFP.
	if (!CF->isNegative())
	return nullptr;

	// User must be a binary operator with one or more uses.
	Instruction *User = I->user_back();
	if (!isa<BinaryOperator>(User) \|\| User->use_empty())
	return nullptr;

	unsigned UserOpcode = User->getOpcode();
	if (UserOpcode != Instruction::FAdd && UserOpcode != Instruction::FSub)
	return nullptr;

	// Subtraction is not commutative. Explicitly, the following transform is
	// not valid: (-Constant * y) - x -> x + (Constant * y)
	if (!User->isCommutative() && User->getOperand(1) != I)
	return nullptr;

	+ // Don't canonicalize x + (-Constant * y) -> x - (Constant * y), if the
	+ // resulting subtract will be broken up later. This can get us into an
	+ // infinite loop during reassociation.
	+ if (UserOpcode == Instruction::FAdd && ShouldBreakUpSubtract(User))
	+ return nullptr;
	+
	// Change the sign of the constant.
	APFloat Val = CF->getValueAPF();
	Val.changeSign();
	I->setOperand(C0 ? 0 : 1, ConstantFP::get(CF->getContext(), Val));

	// Canonicalize I to RHS to simplify the next bit of logic. E.g.,
	// ((-Consty) + x) -> (x + (-Consty)).
	if (User->getOperand(0) == I && User->isCommutative())
	cast<BinaryOperator>(User)->swapOperands();

	Value *Op0 = User->getOperand(0);
	Value *Op1 = User->getOperand(1);
	BinaryOperator *NI;
	switch (UserOpcode) {
	default:
	llvm_unreachable("Unexpected Opcode!");
	case Instruction::FAdd:
	NI = BinaryOperator::CreateFSub(Op0, Op1);
	NI->setFastMathFlags(cast<FPMathOperator>(User)->getFastMathFlags());
	break;
	case Instruction::FSub:
	NI = BinaryOperator::CreateFAdd(Op0, Op1);
	NI->setFastMathFlags(cast<FPMathOperator>(User)->getFastMathFlags());
	break;
	}

	NI->insertBefore(User);
	NI->setName(User->getName());
	User->replaceAllUsesWith(NI);
	NI->setDebugLoc(I->getDebugLoc());
	RedoInsts.insert(I);
	MadeChange = true;
	return NI;
	}

	/// Inspect and optimize the given instruction. Note that erasing
	/// instructions is not allowed.
	void ReassociatePass::OptimizeInst(Instruction *I) {
	// Only consider operations that we understand.
	if (!isa<BinaryOperator>(I))
	return;

	if (I->getOpcode() == Instruction::Shl && isa<ConstantInt>(I->getOperand(1)))
	// If an operand of this shift is a reassociable multiply, or if the shift
	// is used by a reassociable multiply or add, turn into a multiply.
	if (isReassociableOp(I->getOperand(0), Instruction::Mul) \|\|
	(I->hasOneUse() &&
	(isReassociableOp(I->user_back(), Instruction::Mul) \|\|
	isReassociableOp(I->user_back(), Instruction::Add)))) {
	Instruction *NI = ConvertShiftToMul(I);
	RedoInsts.insert(I);
	MadeChange = true;
	I = NI;
	}

	// Canonicalize negative constants out of expressions.
	if (Instruction *Res = canonicalizeNegConstExpr(I))
	I = Res;

	// Commute binary operators, to canonicalize the order of their operands.
	// This can potentially expose more CSE opportunities, and makes writing other
	// transformations simpler.
	if (I->isCommutative())
	canonicalizeOperands(I);

	// Don't optimize floating point instructions that don't have unsafe algebra.
	if (I->getType()->isFPOrFPVectorTy() && !I->hasUnsafeAlgebra())
	return;

	// Do not reassociate boolean (i1) expressions. We want to preserve the
	// original order of evaluation for short-circuited comparisons that
	// SimplifyCFG has folded to AND/OR expressions. If the expression
	// is not further optimized, it is likely to be transformed back to a
	// short-circuited form for code gen, and the source order may have been
	// optimized for the most likely conditions.
	if (I->getType()->isIntegerTy(1))
	return;

	// If this is a subtract instruction which is not already in negate form,
	// see if we can convert it to X+-Y.
	if (I->getOpcode() == Instruction::Sub) {
	if (ShouldBreakUpSubtract(I)) {
	Instruction *NI = BreakUpSubtract(I, RedoInsts);
	RedoInsts.insert(I);
	MadeChange = true;
	I = NI;
	} else if (BinaryOperator::isNeg(I)) {
	// Otherwise, this is a negation. See if the operand is a multiply tree
	// and if this is not an inner node of a multiply tree.
	if (isReassociableOp(I->getOperand(1), Instruction::Mul) &&
	(!I->hasOneUse() \|\|
	!isReassociableOp(I->user_back(), Instruction::Mul))) {
	Instruction *NI = LowerNegateToMultiply(I);
	// If the negate was simplified, revisit the users to see if we can
	// reassociate further.
	for (User *U : NI->users()) {
	if (BinaryOperator *Tmp = dyn_cast<BinaryOperator>(U))
	RedoInsts.insert(Tmp);
	}
	RedoInsts.insert(I);
	MadeChange = true;
	I = NI;
	}
	}
	} else if (I->getOpcode() == Instruction::FSub) {
	if (ShouldBreakUpSubtract(I)) {
	Instruction *NI = BreakUpSubtract(I, RedoInsts);
	RedoInsts.insert(I);
	MadeChange = true;
	I = NI;
	} else if (BinaryOperator::isFNeg(I)) {
	// Otherwise, this is a negation. See if the operand is a multiply tree
	// and if this is not an inner node of a multiply tree.
	if (isReassociableOp(I->getOperand(1), Instruction::FMul) &&
	(!I->hasOneUse() \|\|
	!isReassociableOp(I->user_back(), Instruction::FMul))) {
	// If the negate was simplified, revisit the users to see if we can
	// reassociate further.
	Instruction *NI = LowerNegateToMultiply(I);
	for (User *U : NI->users()) {
	if (BinaryOperator *Tmp = dyn_cast<BinaryOperator>(U))
	RedoInsts.insert(Tmp);
	}
	RedoInsts.insert(I);
	MadeChange = true;
	I = NI;
	}
	}
	}

	// If this instruction is an associative binary operator, process it.
	if (!I->isAssociative()) return;
	BinaryOperator *BO = cast<BinaryOperator>(I);

	// If this is an interior node of a reassociable tree, ignore it until we
	// get to the root of the tree, to avoid N^2 analysis.
	unsigned Opcode = BO->getOpcode();
	if (BO->hasOneUse() && BO->user_back()->getOpcode() == Opcode) {
	// During the initial run we will get to the root of the tree.
	// But if we get here while we are redoing instructions, there is no
	// guarantee that the root will be visited. So Redo later
	if (BO->user_back() != BO &&
	BO->getParent() == BO->user_back()->getParent())
	RedoInsts.insert(BO->user_back());
	return;
	}

	// If this is an add tree that is used by a sub instruction, ignore it
	// until we process the subtract.
	if (BO->hasOneUse() && BO->getOpcode() == Instruction::Add &&
	cast<Instruction>(BO->user_back())->getOpcode() == Instruction::Sub)
	return;
	if (BO->hasOneUse() && BO->getOpcode() == Instruction::FAdd &&
	cast<Instruction>(BO->user_back())->getOpcode() == Instruction::FSub)
	return;

	ReassociateExpression(BO);
	}

	void ReassociatePass::ReassociateExpression(BinaryOperator *I) {
	// First, walk the expression tree, linearizing the tree, collecting the
	// operand information.
	SmallVector<RepeatedValue, 8> Tree;
	MadeChange \|= LinearizeExprTree(I, Tree);
	SmallVector<ValueEntry, 8> Ops;
	Ops.reserve(Tree.size());
	for (unsigned i = 0, e = Tree.size(); i != e; ++i) {
	RepeatedValue E = Tree[i];
	Ops.append(E.second.getZExtValue(),
	ValueEntry(getRank(E.first), E.first));
	}

	DEBUG(dbgs() << "RAIn:\t"; PrintOps(I, Ops); dbgs() << '\n');

	// Now that we have linearized the tree to a list and have gathered all of
	// the operands and their ranks, sort the operands by their rank. Use a
	// stable_sort so that values with equal ranks will have their relative
	// positions maintained (and so the compiler is deterministic). Note that
	// this sorts so that the highest ranking values end up at the beginning of
	// the vector.
	std::stable_sort(Ops.begin(), Ops.end());

	// Now that we have the expression tree in a convenient
	// sorted form, optimize it globally if possible.
	if (Value *V = OptimizeExpression(I, Ops)) {
	if (V == I)
	// Self-referential expression in unreachable code.
	return;
	// This expression tree simplified to something that isn't a tree,
	// eliminate it.
	DEBUG(dbgs() << "Reassoc to scalar: " << *V << '\n');
	I->replaceAllUsesWith(V);
	if (Instruction *VI = dyn_cast<Instruction>(V))
	VI->setDebugLoc(I->getDebugLoc());
	RedoInsts.insert(I);
	++NumAnnihil;
	return;
	}

	// We want to sink immediates as deeply as possible except in the case where
	// this is a multiply tree used only by an add, and the immediate is a -1.
	// In this case we reassociate to put the negation on the outside so that we
	// can fold the negation into the add: (-X)Y + Z -> Z-XY
	if (I->hasOneUse()) {
	if (I->getOpcode() == Instruction::Mul &&
	cast<Instruction>(I->user_back())->getOpcode() == Instruction::Add &&
	isa<ConstantInt>(Ops.back().Op) &&
	cast<ConstantInt>(Ops.back().Op)->isMinusOne()) {
	ValueEntry Tmp = Ops.pop_back_val();
	Ops.insert(Ops.begin(), Tmp);
	} else if (I->getOpcode() == Instruction::FMul &&
	cast<Instruction>(I->user_back())->getOpcode() ==
	Instruction::FAdd &&
	isa<ConstantFP>(Ops.back().Op) &&
	cast<ConstantFP>(Ops.back().Op)->isExactlyValue(-1.0)) {
	ValueEntry Tmp = Ops.pop_back_val();
	Ops.insert(Ops.begin(), Tmp);
	}
	}

	DEBUG(dbgs() << "RAOut:\t"; PrintOps(I, Ops); dbgs() << '\n');

	if (Ops.size() == 1) {
	if (Ops[0].Op == I)
	// Self-referential expression in unreachable code.
	return;

	// This expression tree simplified to something that isn't a tree,
	// eliminate it.
	I->replaceAllUsesWith(Ops[0].Op);
	if (Instruction *OI = dyn_cast<Instruction>(Ops[0].Op))
	OI->setDebugLoc(I->getDebugLoc());
	RedoInsts.insert(I);
	return;
	}

	// Now that we ordered and optimized the expressions, splat them back into
	// the expression tree, removing any unneeded nodes.
	RewriteExprTree(I, Ops);
	}

	PreservedAnalyses ReassociatePass::run(Function &F, FunctionAnalysisManager &) {
	// Get the functions basic blocks in Reverse Post Order. This order is used by
	// BuildRankMap to pre calculate ranks correctly. It also excludes dead basic
	// blocks (it has been seen that the analysis in this pass could hang when
	// analysing dead basic blocks).
	ReversePostOrderTraversal<Function *> RPOT(&F);

	// Calculate the rank map for F.
	BuildRankMap(F, RPOT);

	MadeChange = false;
	// Traverse the same blocks that was analysed by BuildRankMap.
	for (BasicBlock *BI : RPOT) {
	assert(RankMap.count(&*BI) && "BB should be ranked.");
	// Optimize every instruction in the basic block.
	for (BasicBlock::iterator II = BI->begin(), IE = BI->end(); II != IE;)
	if (isInstructionTriviallyDead(&*II)) {
	EraseInst(&*II++);
	} else {
	OptimizeInst(&*II);
	assert(II->getParent() == &*BI && "Moved to a different block!");
	++II;
	}

	// Make a copy of all the instructions to be redone so we can remove dead
	// instructions.
	SetVector<AssertingVH<Instruction>> ToRedo(RedoInsts);
	// Iterate over all instructions to be reevaluated and remove trivially dead
	// instructions. If any operand of the trivially dead instruction becomes
	// dead mark it for deletion as well. Continue this process until all
	// trivially dead instructions have been removed.
	while (!ToRedo.empty()) {
	Instruction *I = ToRedo.pop_back_val();
	if (isInstructionTriviallyDead(I)) {
	RecursivelyEraseDeadInsts(I, ToRedo);
	MadeChange = true;
	}
	}

	// Now that we have removed dead instructions, we can reoptimize the
	// remaining instructions.
	while (!RedoInsts.empty()) {
	Instruction *I = RedoInsts.pop_back_val();
	if (isInstructionTriviallyDead(I))
	EraseInst(I);
	else
	OptimizeInst(I);
	}
	}

	// We are done with the rank map.
	RankMap.clear();
	ValueRankMap.clear();

	if (MadeChange) {
	PreservedAnalyses PA;
	PA.preserveSet<CFGAnalyses>();
	PA.preserve<GlobalsAA>();
	return PA;
	}

	return PreservedAnalyses::all();
	}

	namespace {
	class ReassociateLegacyPass : public FunctionPass {
	ReassociatePass Impl;
	public:
	static char ID; // Pass identification, replacement for typeid
	ReassociateLegacyPass() : FunctionPass(ID) {
	initializeReassociateLegacyPassPass(*PassRegistry::getPassRegistry());
	}

	bool runOnFunction(Function &F) override {
	if (skipFunction(F))
	return false;

	FunctionAnalysisManager DummyFAM;
	auto PA = Impl.run(F, DummyFAM);
	return !PA.areAllPreserved();
	}

	void getAnalysisUsage(AnalysisUsage &AU) const override {
	AU.setPreservesCFG();
	AU.addPreserved<GlobalsAAWrapperPass>();
	}
	};
	}

	char ReassociateLegacyPass::ID = 0;
	INITIALIZE_PASS(ReassociateLegacyPass, "reassociate",
	"Reassociate expressions", false, false)

	// Public interface to the Reassociate pass
	FunctionPass *llvm::createReassociatePass() {
	return new ReassociateLegacyPass();
	}
	diff --git a/lib/Transforms/Utils/CloneFunction.cpp b/lib/Transforms/Utils/CloneFunction.cpp
	index 7e75e8847785..9c4e13903ed7 100644
	--- a/lib/Transforms/Utils/CloneFunction.cpp
	+++ b/lib/Transforms/Utils/CloneFunction.cpp
	@@ -1,833 +1,834 @@
	//===- CloneFunction.cpp - Clone a function into another function ---------===//
	//
	// The LLVM Compiler Infrastructure
	//
	// This file is distributed under the University of Illinois Open Source
	// License. See LICENSE.TXT for details.
	//
	//===----------------------------------------------------------------------===//
	//
	// This file implements the CloneFunctionInto interface, which is used as the
	// low-level function cloner. This is used by the CloneFunction and function
	// inliner to do the dirty work of copying the body of a function around.
	//
	//===----------------------------------------------------------------------===//

	#include "llvm/ADT/SetVector.h"
	#include "llvm/ADT/SmallVector.h"
	#include "llvm/Analysis/ConstantFolding.h"
	#include "llvm/Analysis/InstructionSimplify.h"
	#include "llvm/Analysis/LoopInfo.h"
	#include "llvm/IR/CFG.h"
	#include "llvm/IR/Constants.h"
	#include "llvm/IR/DebugInfo.h"
	#include "llvm/IR/DerivedTypes.h"
	#include "llvm/IR/Function.h"
	#include "llvm/IR/GlobalVariable.h"
	#include "llvm/IR/Instructions.h"
	#include "llvm/IR/IntrinsicInst.h"
	#include "llvm/IR/LLVMContext.h"
	#include "llvm/IR/Metadata.h"
	#include "llvm/IR/Module.h"
	#include "llvm/Transforms/Utils/BasicBlockUtils.h"
	#include "llvm/Transforms/Utils/Cloning.h"
	#include "llvm/Transforms/Utils/Local.h"
	#include "llvm/Transforms/Utils/ValueMapper.h"
	#include <map>
	using namespace llvm;

	/// See comments in Cloning.h.
	BasicBlock llvm::CloneBasicBlock(const BasicBlock BB, ValueToValueMapTy &VMap,
	const Twine &NameSuffix, Function *F,
	ClonedCodeInfo *CodeInfo,
	DebugInfoFinder *DIFinder) {
	DenseMap<const MDNode , MDNode > Cache;
	BasicBlock *NewBB = BasicBlock::Create(BB->getContext(), "", F);
	if (BB->hasName()) NewBB->setName(BB->getName()+NameSuffix);

	bool hasCalls = false, hasDynamicAllocas = false, hasStaticAllocas = false;
	Module *TheModule = F ? F->getParent() : nullptr;

	// Loop over all instructions, and copy them over.
	for (BasicBlock::const_iterator II = BB->begin(), IE = BB->end();
	II != IE; ++II) {

	if (DIFinder && TheModule) {
	if (auto *DDI = dyn_cast<DbgDeclareInst>(II))
	DIFinder->processDeclare(*TheModule, DDI);
	else if (auto *DVI = dyn_cast<DbgValueInst>(II))
	DIFinder->processValue(*TheModule, DVI);

	if (auto DbgLoc = II->getDebugLoc())
	DIFinder->processLocation(*TheModule, DbgLoc.get());
	}

	Instruction *NewInst = II->clone();
	if (II->hasName())
	NewInst->setName(II->getName()+NameSuffix);
	NewBB->getInstList().push_back(NewInst);
	VMap[&*II] = NewInst; // Add instruction map to value.

	hasCalls \|= (isa<CallInst>(II) && !isa<DbgInfoIntrinsic>(II));
	if (const AllocaInst *AI = dyn_cast<AllocaInst>(II)) {
	if (isa<ConstantInt>(AI->getArraySize()))
	hasStaticAllocas = true;
	else
	hasDynamicAllocas = true;
	}
	}

	if (CodeInfo) {
	CodeInfo->ContainsCalls \|= hasCalls;
	CodeInfo->ContainsDynamicAllocas \|= hasDynamicAllocas;
	CodeInfo->ContainsDynamicAllocas \|= hasStaticAllocas &&
	BB != &BB->getParent()->getEntryBlock();
	}
	return NewBB;
	}

	// Clone OldFunc into NewFunc, transforming the old arguments into references to
	// VMap values.
	//
	void llvm::CloneFunctionInto(Function NewFunc, const Function OldFunc,
	ValueToValueMapTy &VMap,
	bool ModuleLevelChanges,
	SmallVectorImpl<ReturnInst*> &Returns,
	const char NameSuffix, ClonedCodeInfo CodeInfo,
	ValueMapTypeRemapper *TypeMapper,
	ValueMaterializer *Materializer) {
	assert(NameSuffix && "NameSuffix cannot be null!");

	#ifndef NDEBUG
	for (const Argument &I : OldFunc->args())
	assert(VMap.count(&I) && "No mapping from source argument specified!");
	#endif

	// Copy all attributes other than those stored in the AttributeList. We need
	// to remap the parameter indices of the AttributeList.
	AttributeList NewAttrs = NewFunc->getAttributes();
	NewFunc->copyAttributesFrom(OldFunc);
	NewFunc->setAttributes(NewAttrs);

	// Fix up the personality function that got copied over.
	if (OldFunc->hasPersonalityFn())
	NewFunc->setPersonalityFn(
	MapValue(OldFunc->getPersonalityFn(), VMap,
	ModuleLevelChanges ? RF_None : RF_NoModuleLevelChanges,
	TypeMapper, Materializer));

	SmallVector<AttributeSet, 4> NewArgAttrs(NewFunc->arg_size());
	AttributeList OldAttrs = OldFunc->getAttributes();

	// Clone any argument attributes that are present in the VMap.
	for (const Argument &OldArg : OldFunc->args()) {
	if (Argument *NewArg = dyn_cast<Argument>(VMap[&OldArg])) {
	NewArgAttrs[NewArg->getArgNo()] =
	OldAttrs.getParamAttributes(OldArg.getArgNo());
	}
	}

	NewFunc->setAttributes(
	AttributeList::get(NewFunc->getContext(), OldAttrs.getFnAttributes(),
	OldAttrs.getRetAttributes(), NewArgAttrs));

	bool MustCloneSP =
	OldFunc->getParent() && OldFunc->getParent() == NewFunc->getParent();
	DISubprogram *SP = OldFunc->getSubprogram();
	if (SP) {
	assert(!MustCloneSP \|\| ModuleLevelChanges);
	// Add mappings for some DebugInfo nodes that we don't want duplicated
	// even if they're distinct.
	auto &MD = VMap.MD();
	MD[SP->getUnit()].reset(SP->getUnit());
	MD[SP->getType()].reset(SP->getType());
	MD[SP->getFile()].reset(SP->getFile());
	// If we're not cloning into the same module, no need to clone the
	// subprogram
	if (!MustCloneSP)
	MD[SP].reset(SP);
	}

	SmallVector<std::pair<unsigned, MDNode *>, 1> MDs;
	OldFunc->getAllMetadata(MDs);
	for (auto MD : MDs) {
	NewFunc->addMetadata(
	MD.first,
	*MapMetadata(MD.second, VMap,
	ModuleLevelChanges ? RF_None : RF_NoModuleLevelChanges,
	TypeMapper, Materializer));
	}

	// When we remap instructions, we want to avoid duplicating inlined
	// DISubprograms, so record all subprograms we find as we duplicate
	// instructions and then freeze them in the MD map.
	// We also record information about dbg.value and dbg.declare to avoid
	// duplicating the types.
	DebugInfoFinder DIFinder;

	// Loop over all of the basic blocks in the function, cloning them as
	// appropriate. Note that we save BE this way in order to handle cloning of
	// recursive functions into themselves.
	//
	for (Function::const_iterator BI = OldFunc->begin(), BE = OldFunc->end();
	BI != BE; ++BI) {
	const BasicBlock &BB = *BI;

	// Create a new basic block and copy instructions into it!
	BasicBlock *CBB = CloneBasicBlock(&BB, VMap, NameSuffix, NewFunc, CodeInfo,
	SP ? &DIFinder : nullptr);

	// Add basic block mapping.
	VMap[&BB] = CBB;

	// It is only legal to clone a function if a block address within that
	// function is never referenced outside of the function. Given that, we
	// want to map block addresses from the old function to block addresses in
	// the clone. (This is different from the generic ValueMapper
	// implementation, which generates an invalid blockaddress when
	// cloning a function.)
	if (BB.hasAddressTaken()) {
	Constant OldBBAddr = BlockAddress::get(const_cast<Function>(OldFunc),
	const_cast<BasicBlock*>(&BB));
	VMap[OldBBAddr] = BlockAddress::get(NewFunc, CBB);
	}

	// Note return instructions for the caller.
	if (ReturnInst *RI = dyn_cast<ReturnInst>(CBB->getTerminator()))
	Returns.push_back(RI);
	}

	for (DISubprogram *ISP : DIFinder.subprograms()) {
	if (ISP != SP) {
	VMap.MD()[ISP].reset(ISP);
	}
	}

	for (auto *Type : DIFinder.types()) {
	VMap.MD()[Type].reset(Type);
	}

	// Loop over all of the instructions in the function, fixing up operand
	// references as we go. This uses VMap to do all the hard work.
	for (Function::iterator BB =
	cast<BasicBlock>(VMap[&OldFunc->front()])->getIterator(),
	BE = NewFunc->end();
	BB != BE; ++BB)
	// Loop over all instructions, fixing each one as we find it...
	for (Instruction &II : *BB)
	RemapInstruction(&II, VMap,
	ModuleLevelChanges ? RF_None : RF_NoModuleLevelChanges,
	TypeMapper, Materializer);
	}

	/// Return a copy of the specified function and add it to that function's
	/// module. Also, any references specified in the VMap are changed to refer to
	/// their mapped value instead of the original one. If any of the arguments to
	/// the function are in the VMap, the arguments are deleted from the resultant
	/// function. The VMap is updated to include mappings from all of the
	/// instructions and basicblocks in the function from their old to new values.
	///
	Function llvm::CloneFunction(Function F, ValueToValueMapTy &VMap,
	ClonedCodeInfo *CodeInfo) {
	std::vector<Type*> ArgTypes;

	// The user might be deleting arguments to the function by specifying them in
	// the VMap. If so, we need to not add the arguments to the arg ty vector
	//
	for (const Argument &I : F->args())
	if (VMap.count(&I) == 0) // Haven't mapped the argument to anything yet?
	ArgTypes.push_back(I.getType());

	// Create a new function type...
	FunctionType *FTy = FunctionType::get(F->getFunctionType()->getReturnType(),
	ArgTypes, F->getFunctionType()->isVarArg());

	// Create the new function...
	Function *NewF =
	Function::Create(FTy, F->getLinkage(), F->getName(), F->getParent());

	// Loop over the arguments, copying the names of the mapped arguments over...
	Function::arg_iterator DestI = NewF->arg_begin();
	for (const Argument & I : F->args())
	if (VMap.count(&I) == 0) { // Is this argument preserved?
	DestI->setName(I.getName()); // Copy the name over...
	VMap[&I] = &*DestI++; // Add mapping to VMap
	}

	SmallVector<ReturnInst*, 8> Returns; // Ignore returns cloned.
	CloneFunctionInto(NewF, F, VMap, F->getSubprogram() != nullptr, Returns, "",
	CodeInfo);

	return NewF;
	}



	namespace {
	/// This is a private class used to implement CloneAndPruneFunctionInto.
	struct PruningFunctionCloner {
	Function *NewFunc;
	const Function *OldFunc;
	ValueToValueMapTy &VMap;
	bool ModuleLevelChanges;
	const char *NameSuffix;
	ClonedCodeInfo *CodeInfo;

	public:
	PruningFunctionCloner(Function newFunc, const Function oldFunc,
	ValueToValueMapTy &valueMap, bool moduleLevelChanges,
	const char nameSuffix, ClonedCodeInfo codeInfo)
	: NewFunc(newFunc), OldFunc(oldFunc), VMap(valueMap),
	ModuleLevelChanges(moduleLevelChanges), NameSuffix(nameSuffix),
	CodeInfo(codeInfo) {}

	/// The specified block is found to be reachable, clone it and
	/// anything that it can reach.
	void CloneBlock(const BasicBlock *BB,
	BasicBlock::const_iterator StartingInst,
	std::vector<const BasicBlock*> &ToClone);
	};
	}

	/// The specified block is found to be reachable, clone it and
	/// anything that it can reach.
	void PruningFunctionCloner::CloneBlock(const BasicBlock *BB,
	BasicBlock::const_iterator StartingInst,
	std::vector<const BasicBlock*> &ToClone){
	WeakTrackingVH &BBEntry = VMap[BB];

	// Have we already cloned this block?
	if (BBEntry) return;

	// Nope, clone it now.
	BasicBlock *NewBB;
	BBEntry = NewBB = BasicBlock::Create(BB->getContext());
	if (BB->hasName()) NewBB->setName(BB->getName()+NameSuffix);

	// It is only legal to clone a function if a block address within that
	// function is never referenced outside of the function. Given that, we
	// want to map block addresses from the old function to block addresses in
	// the clone. (This is different from the generic ValueMapper
	// implementation, which generates an invalid blockaddress when
	// cloning a function.)
	//
	// Note that we don't need to fix the mapping for unreachable blocks;
	// the default mapping there is safe.
	if (BB->hasAddressTaken()) {
	Constant OldBBAddr = BlockAddress::get(const_cast<Function>(OldFunc),
	const_cast<BasicBlock*>(BB));
	VMap[OldBBAddr] = BlockAddress::get(NewFunc, NewBB);
	}

	bool hasCalls = false, hasDynamicAllocas = false, hasStaticAllocas = false;

	// Loop over all instructions, and copy them over, DCE'ing as we go. This
	// loop doesn't include the terminator.
	for (BasicBlock::const_iterator II = StartingInst, IE = --BB->end();
	II != IE; ++II) {

	Instruction *NewInst = II->clone();

	// Eagerly remap operands to the newly cloned instruction, except for PHI
	// nodes for which we defer processing until we update the CFG.
	if (!isa<PHINode>(NewInst)) {
	RemapInstruction(NewInst, VMap,
	ModuleLevelChanges ? RF_None : RF_NoModuleLevelChanges);

	// If we can simplify this instruction to some other value, simply add
	// a mapping to that value rather than inserting a new instruction into
	// the basic block.
	if (Value *V =
	SimplifyInstruction(NewInst, BB->getModule()->getDataLayout())) {
	// On the off-chance that this simplifies to an instruction in the old
	// function, map it back into the new function.
	- if (Value *MappedV = VMap.lookup(V))
	- V = MappedV;
	+ if (NewFunc != OldFunc)
	+ if (Value *MappedV = VMap.lookup(V))
	+ V = MappedV;

	if (!NewInst->mayHaveSideEffects()) {
	VMap[&*II] = V;
	NewInst->deleteValue();
	continue;
	}
	}
	}

	if (II->hasName())
	NewInst->setName(II->getName()+NameSuffix);
	VMap[&*II] = NewInst; // Add instruction map to value.
	NewBB->getInstList().push_back(NewInst);
	hasCalls \|= (isa<CallInst>(II) && !isa<DbgInfoIntrinsic>(II));

	if (CodeInfo)
	if (auto CS = ImmutableCallSite(&*II))
	if (CS.hasOperandBundles())
	CodeInfo->OperandBundleCallSites.push_back(NewInst);

	if (const AllocaInst *AI = dyn_cast<AllocaInst>(II)) {
	if (isa<ConstantInt>(AI->getArraySize()))
	hasStaticAllocas = true;
	else
	hasDynamicAllocas = true;
	}
	}

	// Finally, clone over the terminator.
	const TerminatorInst *OldTI = BB->getTerminator();
	bool TerminatorDone = false;
	if (const BranchInst *BI = dyn_cast<BranchInst>(OldTI)) {
	if (BI->isConditional()) {
	// If the condition was a known constant in the callee...
	ConstantInt *Cond = dyn_cast<ConstantInt>(BI->getCondition());
	// Or is a known constant in the caller...
	if (!Cond) {
	Value *V = VMap.lookup(BI->getCondition());
	Cond = dyn_cast_or_null<ConstantInt>(V);
	}

	// Constant fold to uncond branch!
	if (Cond) {
	BasicBlock *Dest = BI->getSuccessor(!Cond->getZExtValue());
	VMap[OldTI] = BranchInst::Create(Dest, NewBB);
	ToClone.push_back(Dest);
	TerminatorDone = true;
	}
	}
	} else if (const SwitchInst *SI = dyn_cast<SwitchInst>(OldTI)) {
	// If switching on a value known constant in the caller.
	ConstantInt *Cond = dyn_cast<ConstantInt>(SI->getCondition());
	if (!Cond) { // Or known constant after constant prop in the callee...
	Value *V = VMap.lookup(SI->getCondition());
	Cond = dyn_cast_or_null<ConstantInt>(V);
	}
	if (Cond) { // Constant fold to uncond branch!
	SwitchInst::ConstCaseHandle Case = *SI->findCaseValue(Cond);
	BasicBlock Dest = const_cast<BasicBlock>(Case.getCaseSuccessor());
	VMap[OldTI] = BranchInst::Create(Dest, NewBB);
	ToClone.push_back(Dest);
	TerminatorDone = true;
	}
	}

	if (!TerminatorDone) {
	Instruction *NewInst = OldTI->clone();
	if (OldTI->hasName())
	NewInst->setName(OldTI->getName()+NameSuffix);
	NewBB->getInstList().push_back(NewInst);
	VMap[OldTI] = NewInst; // Add instruction map to value.

	if (CodeInfo)
	if (auto CS = ImmutableCallSite(OldTI))
	if (CS.hasOperandBundles())
	CodeInfo->OperandBundleCallSites.push_back(NewInst);

	// Recursively clone any reachable successor blocks.
	const TerminatorInst *TI = BB->getTerminator();
	for (const BasicBlock *Succ : TI->successors())
	ToClone.push_back(Succ);
	}

	if (CodeInfo) {
	CodeInfo->ContainsCalls \|= hasCalls;
	CodeInfo->ContainsDynamicAllocas \|= hasDynamicAllocas;
	CodeInfo->ContainsDynamicAllocas \|= hasStaticAllocas &&
	BB != &BB->getParent()->front();
	}
	}

	/// This works like CloneAndPruneFunctionInto, except that it does not clone the
	/// entire function. Instead it starts at an instruction provided by the caller
	/// and copies (and prunes) only the code reachable from that instruction.
	void llvm::CloneAndPruneIntoFromInst(Function NewFunc, const Function OldFunc,
	const Instruction *StartingInst,
	ValueToValueMapTy &VMap,
	bool ModuleLevelChanges,
	SmallVectorImpl<ReturnInst *> &Returns,
	const char *NameSuffix,
	ClonedCodeInfo *CodeInfo) {
	assert(NameSuffix && "NameSuffix cannot be null!");

	ValueMapTypeRemapper *TypeMapper = nullptr;
	ValueMaterializer *Materializer = nullptr;

	#ifndef NDEBUG
	// If the cloning starts at the beginning of the function, verify that
	// the function arguments are mapped.
	if (!StartingInst)
	for (const Argument &II : OldFunc->args())
	assert(VMap.count(&II) && "No mapping from source argument specified!");
	#endif

	PruningFunctionCloner PFC(NewFunc, OldFunc, VMap, ModuleLevelChanges,
	NameSuffix, CodeInfo);
	const BasicBlock *StartingBB;
	if (StartingInst)
	StartingBB = StartingInst->getParent();
	else {
	StartingBB = &OldFunc->getEntryBlock();
	StartingInst = &StartingBB->front();
	}

	// Clone the entry block, and anything recursively reachable from it.
	std::vector<const BasicBlock*> CloneWorklist;
	PFC.CloneBlock(StartingBB, StartingInst->getIterator(), CloneWorklist);
	while (!CloneWorklist.empty()) {
	const BasicBlock *BB = CloneWorklist.back();
	CloneWorklist.pop_back();
	PFC.CloneBlock(BB, BB->begin(), CloneWorklist);
	}

	// Loop over all of the basic blocks in the old function. If the block was
	// reachable, we have cloned it and the old block is now in the value map:
	// insert it into the new function in the right order. If not, ignore it.
	//
	// Defer PHI resolution until rest of function is resolved.
	SmallVector<const PHINode*, 16> PHIToResolve;
	for (const BasicBlock &BI : *OldFunc) {
	Value *V = VMap.lookup(&BI);
	BasicBlock *NewBB = cast_or_null<BasicBlock>(V);
	if (!NewBB) continue; // Dead block.

	// Add the new block to the new function.
	NewFunc->getBasicBlockList().push_back(NewBB);

	// Handle PHI nodes specially, as we have to remove references to dead
	// blocks.
	for (BasicBlock::const_iterator I = BI.begin(), E = BI.end(); I != E; ++I) {
	// PHI nodes may have been remapped to non-PHI nodes by the caller or
	// during the cloning process.
	if (const PHINode *PN = dyn_cast<PHINode>(I)) {
	if (isa<PHINode>(VMap[PN]))
	PHIToResolve.push_back(PN);
	else
	break;
	} else {
	break;
	}
	}

	// Finally, remap the terminator instructions, as those can't be remapped
	// until all BBs are mapped.
	RemapInstruction(NewBB->getTerminator(), VMap,
	ModuleLevelChanges ? RF_None : RF_NoModuleLevelChanges,
	TypeMapper, Materializer);
	}

	// Defer PHI resolution until rest of function is resolved, PHI resolution
	// requires the CFG to be up-to-date.
	for (unsigned phino = 0, e = PHIToResolve.size(); phino != e; ) {
	const PHINode *OPN = PHIToResolve[phino];
	unsigned NumPreds = OPN->getNumIncomingValues();
	const BasicBlock *OldBB = OPN->getParent();
	BasicBlock *NewBB = cast<BasicBlock>(VMap[OldBB]);

	// Map operands for blocks that are live and remove operands for blocks
	// that are dead.
	for (; phino != PHIToResolve.size() &&
	PHIToResolve[phino]->getParent() == OldBB; ++phino) {
	OPN = PHIToResolve[phino];
	PHINode *PN = cast<PHINode>(VMap[OPN]);
	for (unsigned pred = 0, e = NumPreds; pred != e; ++pred) {
	Value *V = VMap.lookup(PN->getIncomingBlock(pred));
	if (BasicBlock *MappedBlock = cast_or_null<BasicBlock>(V)) {
	Value *InVal = MapValue(PN->getIncomingValue(pred),
	VMap,
	ModuleLevelChanges ? RF_None : RF_NoModuleLevelChanges);
	assert(InVal && "Unknown input value?");
	PN->setIncomingValue(pred, InVal);
	PN->setIncomingBlock(pred, MappedBlock);
	} else {
	PN->removeIncomingValue(pred, false);
	--pred; // Revisit the next entry.
	--e;
	}
	}
	}

	// The loop above has removed PHI entries for those blocks that are dead
	// and has updated others. However, if a block is live (i.e. copied over)
	// but its terminator has been changed to not go to this block, then our
	// phi nodes will have invalid entries. Update the PHI nodes in this
	// case.
	PHINode *PN = cast<PHINode>(NewBB->begin());
	NumPreds = std::distance(pred_begin(NewBB), pred_end(NewBB));
	if (NumPreds != PN->getNumIncomingValues()) {
	assert(NumPreds < PN->getNumIncomingValues());
	// Count how many times each predecessor comes to this block.
	std::map<BasicBlock*, unsigned> PredCount;
	for (pred_iterator PI = pred_begin(NewBB), E = pred_end(NewBB);
	PI != E; ++PI)
	--PredCount[*PI];

	// Figure out how many entries to remove from each PHI.
	for (unsigned i = 0, e = PN->getNumIncomingValues(); i != e; ++i)
	++PredCount[PN->getIncomingBlock(i)];

	// At this point, the excess predecessor entries are positive in the
	// map. Loop over all of the PHIs and remove excess predecessor
	// entries.
	BasicBlock::iterator I = NewBB->begin();
	for (; (PN = dyn_cast<PHINode>(I)); ++I) {
	for (const auto &PCI : PredCount) {
	BasicBlock *Pred = PCI.first;
	for (unsigned NumToRemove = PCI.second; NumToRemove; --NumToRemove)
	PN->removeIncomingValue(Pred, false);
	}
	}
	}

	// If the loops above have made these phi nodes have 0 or 1 operand,
	// replace them with undef or the input value. We must do this for
	// correctness, because 0-operand phis are not valid.
	PN = cast<PHINode>(NewBB->begin());
	if (PN->getNumIncomingValues() == 0) {
	BasicBlock::iterator I = NewBB->begin();
	BasicBlock::const_iterator OldI = OldBB->begin();
	while ((PN = dyn_cast<PHINode>(I++))) {
	Value *NV = UndefValue::get(PN->getType());
	PN->replaceAllUsesWith(NV);
	assert(VMap[&*OldI] == PN && "VMap mismatch");
	VMap[&*OldI] = NV;
	PN->eraseFromParent();
	++OldI;
	}
	}
	}

	// Make a second pass over the PHINodes now that all of them have been
	// remapped into the new function, simplifying the PHINode and performing any
	// recursive simplifications exposed. This will transparently update the
	// WeakTrackingVH in the VMap. Notably, we rely on that so that if we coalesce
	// two PHINodes, the iteration over the old PHIs remains valid, and the
	// mapping will just map us to the new node (which may not even be a PHI
	// node).
	const DataLayout &DL = NewFunc->getParent()->getDataLayout();
	SmallSetVector<const Value *, 8> Worklist;
	for (unsigned Idx = 0, Size = PHIToResolve.size(); Idx != Size; ++Idx)
	if (isa<PHINode>(VMap[PHIToResolve[Idx]]))
	Worklist.insert(PHIToResolve[Idx]);

	// Note that we must test the size on each iteration, the worklist can grow.
	for (unsigned Idx = 0; Idx != Worklist.size(); ++Idx) {
	const Value *OrigV = Worklist[Idx];
	auto *I = dyn_cast_or_null<Instruction>(VMap.lookup(OrigV));
	if (!I)
	continue;

	// Skip over non-intrinsic callsites, we don't want to remove any nodes from
	// the CGSCC.
	CallSite CS = CallSite(I);
	if (CS && CS.getCalledFunction() && !CS.getCalledFunction()->isIntrinsic())
	continue;

	// See if this instruction simplifies.
	Value *SimpleV = SimplifyInstruction(I, DL);
	if (!SimpleV)
	continue;

	// Stash away all the uses of the old instruction so we can check them for
	// recursive simplifications after a RAUW. This is cheaper than checking all
	// uses of To on the recursive step in most cases.
	for (const User *U : OrigV->users())
	Worklist.insert(cast<Instruction>(U));

	// Replace the instruction with its simplified value.
	I->replaceAllUsesWith(SimpleV);

	// If the original instruction had no side effects, remove it.
	if (isInstructionTriviallyDead(I))
	I->eraseFromParent();
	else
	VMap[OrigV] = I;
	}

	// Now that the inlined function body has been fully constructed, go through
	// and zap unconditional fall-through branches. This happens all the time when
	// specializing code: code specialization turns conditional branches into
	// uncond branches, and this code folds them.
	Function::iterator Begin = cast<BasicBlock>(VMap[StartingBB])->getIterator();
	Function::iterator I = Begin;
	while (I != NewFunc->end()) {
	// Check if this block has become dead during inlining or other
	// simplifications. Note that the first block will appear dead, as it has
	// not yet been wired up properly.
	if (I != Begin && (pred_begin(&I) == pred_end(&I) \|\|
	I->getSinglePredecessor() == &*I)) {
	BasicBlock DeadBB = &I++;
	DeleteDeadBlock(DeadBB);
	continue;
	}

	// We need to simplify conditional branches and switches with a constant
	// operand. We try to prune these out when cloning, but if the
	// simplification required looking through PHI nodes, those are only
	// available after forming the full basic block. That may leave some here,
	// and we still want to prune the dead code as early as possible.
	ConstantFoldTerminator(&*I);

	BranchInst *BI = dyn_cast<BranchInst>(I->getTerminator());
	if (!BI \|\| BI->isConditional()) { ++I; continue; }

	BasicBlock *Dest = BI->getSuccessor(0);
	if (!Dest->getSinglePredecessor()) {
	++I; continue;
	}

	// We shouldn't be able to get single-entry PHI nodes here, as instsimplify
	// above should have zapped all of them..
	assert(!isa<PHINode>(Dest->begin()));

	// We know all single-entry PHI nodes in the inlined function have been
	// removed, so we just need to splice the blocks.
	BI->eraseFromParent();

	// Make all PHI nodes that referred to Dest now refer to I as their source.
	Dest->replaceAllUsesWith(&*I);

	// Move all the instructions in the succ to the pred.
	I->getInstList().splice(I->end(), Dest->getInstList());

	// Remove the dest block.
	Dest->eraseFromParent();

	// Do not increment I, iteratively merge all things this block branches to.
	}

	// Make a final pass over the basic blocks from the old function to gather
	// any return instructions which survived folding. We have to do this here
	// because we can iteratively remove and merge returns above.
	for (Function::iterator I = cast<BasicBlock>(VMap[StartingBB])->getIterator(),
	E = NewFunc->end();
	I != E; ++I)
	if (ReturnInst *RI = dyn_cast<ReturnInst>(I->getTerminator()))
	Returns.push_back(RI);
	}


	/// This works exactly like CloneFunctionInto,
	/// except that it does some simple constant prop and DCE on the fly. The
	/// effect of this is to copy significantly less code in cases where (for
	/// example) a function call with constant arguments is inlined, and those
	/// constant arguments cause a significant amount of code in the callee to be
	/// dead. Since this doesn't produce an exact copy of the input, it can't be
	/// used for things like CloneFunction or CloneModule.
	void llvm::CloneAndPruneFunctionInto(Function NewFunc, const Function OldFunc,
	ValueToValueMapTy &VMap,
	bool ModuleLevelChanges,
	SmallVectorImpl<ReturnInst*> &Returns,
	const char *NameSuffix,
	ClonedCodeInfo *CodeInfo,
	Instruction *TheCall) {
	CloneAndPruneIntoFromInst(NewFunc, OldFunc, &OldFunc->front().front(), VMap,
	ModuleLevelChanges, Returns, NameSuffix, CodeInfo);
	}

	/// \brief Remaps instructions in \p Blocks using the mapping in \p VMap.
	void llvm::remapInstructionsInBlocks(
	const SmallVectorImpl<BasicBlock *> &Blocks, ValueToValueMapTy &VMap) {
	// Rewrite the code to refer to itself.
	for (auto *BB : Blocks)
	for (auto &Inst : *BB)
	RemapInstruction(&Inst, VMap,
	RF_NoModuleLevelChanges \| RF_IgnoreMissingLocals);
	}

	/// \brief Clones a loop \p OrigLoop. Returns the loop and the blocks in \p
	/// Blocks.
	///
	/// Updates LoopInfo and DominatorTree assuming the loop is dominated by block
	/// \p LoopDomBB. Insert the new blocks before block specified in \p Before.
	Loop llvm::cloneLoopWithPreheader(BasicBlock Before, BasicBlock *LoopDomBB,
	Loop *OrigLoop, ValueToValueMapTy &VMap,
	const Twine &NameSuffix, LoopInfo *LI,
	DominatorTree *DT,
	SmallVectorImpl<BasicBlock *> &Blocks) {
	assert(OrigLoop->getSubLoops().empty() &&
	"Loop to be cloned cannot have inner loop");
	Function *F = OrigLoop->getHeader()->getParent();
	Loop *ParentLoop = OrigLoop->getParentLoop();

	Loop *NewLoop = new Loop();
	if (ParentLoop)
	ParentLoop->addChildLoop(NewLoop);
	else
	LI->addTopLevelLoop(NewLoop);

	BasicBlock *OrigPH = OrigLoop->getLoopPreheader();
	assert(OrigPH && "No preheader");
	BasicBlock *NewPH = CloneBasicBlock(OrigPH, VMap, NameSuffix, F);
	// To rename the loop PHIs.
	VMap[OrigPH] = NewPH;
	Blocks.push_back(NewPH);

	// Update LoopInfo.
	if (ParentLoop)
	ParentLoop->addBasicBlockToLoop(NewPH, *LI);

	// Update DominatorTree.
	DT->addNewBlock(NewPH, LoopDomBB);

	for (BasicBlock *BB : OrigLoop->getBlocks()) {
	BasicBlock *NewBB = CloneBasicBlock(BB, VMap, NameSuffix, F);
	VMap[BB] = NewBB;

	// Update LoopInfo.
	NewLoop->addBasicBlockToLoop(NewBB, *LI);

	// Add DominatorTree node. After seeing all blocks, update to correct IDom.
	DT->addNewBlock(NewBB, NewPH);

	Blocks.push_back(NewBB);
	}

	for (BasicBlock *BB : OrigLoop->getBlocks()) {
	// Update DominatorTree.
	BasicBlock *IDomBB = DT->getNode(BB)->getIDom()->getBlock();
	DT->changeImmediateDominator(cast<BasicBlock>(VMap[BB]),
	cast<BasicBlock>(VMap[IDomBB]));
	}

	// Move them physically from the end of the block list.
	F->getBasicBlockList().splice(Before->getIterator(), F->getBasicBlockList(),
	NewPH);
	F->getBasicBlockList().splice(Before->getIterator(), F->getBasicBlockList(),
	NewLoop->getHeader()->getIterator(), F->end());

	return NewLoop;
	}

	/// \brief Duplicate non-Phi instructions from the beginning of block up to
	/// StopAt instruction into a split block between BB and its predecessor.
	BasicBlock *
	llvm::DuplicateInstructionsInSplitBetween(BasicBlock BB, BasicBlock PredBB,
	Instruction *StopAt,
	ValueToValueMapTy &ValueMapping) {
	// We are going to have to map operands from the original BB block to the new
	// copy of the block 'NewBB'. If there are PHI nodes in BB, evaluate them to
	// account for entry from PredBB.
	BasicBlock::iterator BI = BB->begin();
	for (; PHINode *PN = dyn_cast<PHINode>(BI); ++BI)
	ValueMapping[PN] = PN->getIncomingValueForBlock(PredBB);

	BasicBlock *NewBB = SplitEdge(PredBB, BB);
	NewBB->setName(PredBB->getName() + ".split");
	Instruction *NewTerm = NewBB->getTerminator();

	// Clone the non-phi instructions of BB into NewBB, keeping track of the
	// mapping and using it to remap operands in the cloned instructions.
	for (; StopAt != &*BI; ++BI) {
	Instruction *New = BI->clone();
	New->setName(BI->getName());
	New->insertBefore(NewTerm);
	ValueMapping[&*BI] = New;

	// Remap operands to patch up intra-block references.
	for (unsigned i = 0, e = New->getNumOperands(); i != e; ++i)
	if (Instruction *Inst = dyn_cast<Instruction>(New->getOperand(i))) {
	auto I = ValueMapping.find(Inst);
	if (I != ValueMapping.end())
	New->setOperand(i, I->second);
	}
	}

	return NewBB;
	}
	diff --git a/test/Bitcode/upgrade-module-flag.ll b/test/Bitcode/upgrade-module-flag.ll
	index d6741faa837f..de6c9b2cf1bb 100644
	--- a/test/Bitcode/upgrade-module-flag.ll
	+++ b/test/Bitcode/upgrade-module-flag.ll
	@@ -1,9 +1,13 @@
	; RUN: llvm-as < %s \| llvm-dis \| FileCheck %s
	; RUN: verify-uselistorder < %s

	-!llvm.module.flags = !{!0}
	+!llvm.module.flags = !{!0, !1, !2}

	-!0 = !{i32 1, !"Objective-C Image Info Version", i32 0}
	+!0 = !{i32 1, !"PIC Level", i32 1}
	+!1 = !{i32 1, !"PIE Level", i32 1}
	+!2 = !{i32 1, !"Objective-C Image Info Version", i32 0}

	-; CHECK: !0 = !{i32 1, !"Objective-C Image Info Version", i32 0}
	-; CHECK: !1 = !{i32 4, !"Objective-C Class Properties", i32 0}
	+; CHECK: !0 = !{i32 7, !"PIC Level", i32 1}
	+; CHECK: !1 = !{i32 7, !"PIE Level", i32 1}
	+; CHECK: !2 = !{i32 1, !"Objective-C Image Info Version", i32 0}
	+; CHECK: !3 = !{i32 4, !"Objective-C Class Properties", i32 0}
	diff --git a/test/CodeGen/ARM/Windows/vla-cpsr.ll b/test/CodeGen/ARM/Windows/vla-cpsr.ll
	new file mode 100644
	index 000000000000..de0f0b68a4d2
	--- /dev/null
	+++ b/test/CodeGen/ARM/Windows/vla-cpsr.ll
	@@ -0,0 +1,13 @@
	+; RUN: llc -mtriple thumbv7-windows-itanium -filetype asm -o /dev/null %s -print-machineinstrs=expand-isel-pseudos 2>&1 \| FileCheck %s
	+
	+declare arm_aapcs_vfpcc void @g(i8*) local_unnamed_addr
	+
	+define arm_aapcs_vfpcc void @f(i32 %i) local_unnamed_addr {
	+entry:
	+ %vla = alloca i8, i32 %i, align 1
	+ call arm_aapcs_vfpcc void @g(i8* nonnull %vla)
	+ ret void
	+}
	+
	+; CHECK: tBL pred:14, pred:%noreg, <es:__chkstk>, %LR<imp-def>, %SP<imp-use>, %R4<imp-use,kill>, %R4<imp-def>, %R12<imp-def,dead>, %CPSR<imp-def,dead>
	+
	diff --git a/test/CodeGen/ARM/vzip.ll b/test/CodeGen/ARM/vzip.ll
	index 771bf5f05215..06b49ab94053 100644
	--- a/test/CodeGen/ARM/vzip.ll
	+++ b/test/CodeGen/ARM/vzip.ll
	@@ -1,364 +1,383 @@
	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc -mtriple=arm-eabi -mattr=+neon %s -o - \| FileCheck %s

	define <8 x i8> @vzipi8(<8 x i8>* %A, <8 x i8>* %B) nounwind {
	; CHECK-LABEL: vzipi8:
	; CHECK: @ BB#0:
	; CHECK-NEXT: vldr d16, [r1]
	; CHECK-NEXT: vldr d17, [r0]
	; CHECK-NEXT: vzip.8 d17, d16
	; CHECK-NEXT: vadd.i8 d16, d17, d16
	; CHECK-NEXT: vmov r0, r1, d16
	; CHECK-NEXT: mov pc, lr
	%tmp1 = load <8 x i8>, <8 x i8>* %A
	%tmp2 = load <8 x i8>, <8 x i8>* %B
	%tmp3 = shufflevector <8 x i8> %tmp1, <8 x i8> %tmp2, <8 x i32> <i32 0, i32 8, i32 1, i32 9, i32 2, i32 10, i32 3, i32 11>
	%tmp4 = shufflevector <8 x i8> %tmp1, <8 x i8> %tmp2, <8 x i32> <i32 4, i32 12, i32 5, i32 13, i32 6, i32 14, i32 7, i32 15>
	%tmp5 = add <8 x i8> %tmp3, %tmp4
	ret <8 x i8> %tmp5
	}

	define <16 x i8> @vzipi8_Qres(<8 x i8>* %A, <8 x i8>* %B) nounwind {
	; CHECK-LABEL: vzipi8_Qres:
	; CHECK: @ BB#0:
	; CHECK-NEXT: vldr d17, [r1]
	; CHECK-NEXT: vldr d16, [r0]
	; CHECK-NEXT: vzip.8 d16, d17
	; CHECK-NEXT: vmov r0, r1, d16
	; CHECK-NEXT: vmov r2, r3, d17
	; CHECK-NEXT: mov pc, lr
	%tmp1 = load <8 x i8>, <8 x i8>* %A
	%tmp2 = load <8 x i8>, <8 x i8>* %B
	%tmp3 = shufflevector <8 x i8> %tmp1, <8 x i8> %tmp2, <16 x i32> <i32 0, i32 8, i32 1, i32 9, i32 2, i32 10, i32 3, i32 11, i32 4, i32 12, i32 5, i32 13, i32 6, i32 14, i32 7, i32 15>
	ret <16 x i8> %tmp3
	}

	define <4 x i16> @vzipi16(<4 x i16>* %A, <4 x i16>* %B) nounwind {
	; CHECK-LABEL: vzipi16:
	; CHECK: @ BB#0:
	; CHECK-NEXT: vldr d16, [r1]
	; CHECK-NEXT: vldr d17, [r0]
	; CHECK-NEXT: vzip.16 d17, d16
	; CHECK-NEXT: vadd.i16 d16, d17, d16
	; CHECK-NEXT: vmov r0, r1, d16
	; CHECK-NEXT: mov pc, lr
	%tmp1 = load <4 x i16>, <4 x i16>* %A
	%tmp2 = load <4 x i16>, <4 x i16>* %B
	%tmp3 = shufflevector <4 x i16> %tmp1, <4 x i16> %tmp2, <4 x i32> <i32 0, i32 4, i32 1, i32 5>
	%tmp4 = shufflevector <4 x i16> %tmp1, <4 x i16> %tmp2, <4 x i32> <i32 2, i32 6, i32 3, i32 7>
	%tmp5 = add <4 x i16> %tmp3, %tmp4
	ret <4 x i16> %tmp5
	}

	define <8 x i16> @vzipi16_Qres(<4 x i16>* %A, <4 x i16>* %B) nounwind {
	; CHECK-LABEL: vzipi16_Qres:
	; CHECK: @ BB#0:
	; CHECK-NEXT: vldr d17, [r1]
	; CHECK-NEXT: vldr d16, [r0]
	; CHECK-NEXT: vzip.16 d16, d17
	; CHECK-NEXT: vmov r0, r1, d16
	; CHECK-NEXT: vmov r2, r3, d17
	; CHECK-NEXT: mov pc, lr
	%tmp1 = load <4 x i16>, <4 x i16>* %A
	%tmp2 = load <4 x i16>, <4 x i16>* %B
	%tmp3 = shufflevector <4 x i16> %tmp1, <4 x i16> %tmp2, <8 x i32> <i32 0, i32 4, i32 1, i32 5, i32 2, i32 6, i32 3, i32 7>
	ret <8 x i16> %tmp3
	}

	; VZIP.32 is equivalent to VTRN.32 for 64-bit vectors.

	define <16 x i8> @vzipQi8(<16 x i8>* %A, <16 x i8>* %B) nounwind {
	; CHECK-LABEL: vzipQi8:
	; CHECK: @ BB#0:
	; CHECK-NEXT: vld1.64 {d16, d17}, [r1]
	; CHECK-NEXT: vld1.64 {d18, d19}, [r0]
	; CHECK-NEXT: vzip.8 q9, q8
	; CHECK-NEXT: vadd.i8 q8, q9, q8
	; CHECK-NEXT: vmov r0, r1, d16
	; CHECK-NEXT: vmov r2, r3, d17
	; CHECK-NEXT: mov pc, lr
	%tmp1 = load <16 x i8>, <16 x i8>* %A
	%tmp2 = load <16 x i8>, <16 x i8>* %B
	%tmp3 = shufflevector <16 x i8> %tmp1, <16 x i8> %tmp2, <16 x i32> <i32 0, i32 16, i32 1, i32 17, i32 2, i32 18, i32 3, i32 19, i32 4, i32 20, i32 5, i32 21, i32 6, i32 22, i32 7, i32 23>
	%tmp4 = shufflevector <16 x i8> %tmp1, <16 x i8> %tmp2, <16 x i32> <i32 8, i32 24, i32 9, i32 25, i32 10, i32 26, i32 11, i32 27, i32 12, i32 28, i32 13, i32 29, i32 14, i32 30, i32 15, i32 31>
	%tmp5 = add <16 x i8> %tmp3, %tmp4
	ret <16 x i8> %tmp5
	}

	define <32 x i8> @vzipQi8_QQres(<16 x i8>* %A, <16 x i8>* %B) nounwind {
	; CHECK-LABEL: vzipQi8_QQres:
	; CHECK: @ BB#0:
	; CHECK-NEXT: vld1.64 {d16, d17}, [r2]
	; CHECK-NEXT: vld1.64 {d18, d19}, [r1]
	; CHECK-NEXT: vzip.8 q9, q8
	; CHECK-NEXT: vst1.8 {d18, d19}, [r0:128]!
	; CHECK-NEXT: vst1.64 {d16, d17}, [r0:128]
	; CHECK-NEXT: mov pc, lr
	%tmp1 = load <16 x i8>, <16 x i8>* %A
	%tmp2 = load <16 x i8>, <16 x i8>* %B
	%tmp3 = shufflevector <16 x i8> %tmp1, <16 x i8> %tmp2, <32 x i32> <i32 0, i32 16, i32 1, i32 17, i32 2, i32 18, i32 3, i32 19, i32 4, i32 20, i32 5, i32 21, i32 6, i32 22, i32 7, i32 23, i32 8, i32 24, i32 9, i32 25, i32 10, i32 26, i32 11, i32 27, i32 12, i32 28, i32 13, i32 29, i32 14, i32 30, i32 15, i32 31>
	ret <32 x i8> %tmp3
	}

	define <8 x i16> @vzipQi16(<8 x i16>* %A, <8 x i16>* %B) nounwind {
	; CHECK-LABEL: vzipQi16:
	; CHECK: @ BB#0:
	; CHECK-NEXT: vld1.64 {d16, d17}, [r1]
	; CHECK-NEXT: vld1.64 {d18, d19}, [r0]
	; CHECK-NEXT: vzip.16 q9, q8
	; CHECK-NEXT: vadd.i16 q8, q9, q8
	; CHECK-NEXT: vmov r0, r1, d16
	; CHECK-NEXT: vmov r2, r3, d17
	; CHECK-NEXT: mov pc, lr
	%tmp1 = load <8 x i16>, <8 x i16>* %A
	%tmp2 = load <8 x i16>, <8 x i16>* %B
	%tmp3 = shufflevector <8 x i16> %tmp1, <8 x i16> %tmp2, <8 x i32> <i32 0, i32 8, i32 1, i32 9, i32 2, i32 10, i32 3, i32 11>
	%tmp4 = shufflevector <8 x i16> %tmp1, <8 x i16> %tmp2, <8 x i32> <i32 4, i32 12, i32 5, i32 13, i32 6, i32 14, i32 7, i32 15>
	%tmp5 = add <8 x i16> %tmp3, %tmp4
	ret <8 x i16> %tmp5
	}

	define <16 x i16> @vzipQi16_QQres(<8 x i16>* %A, <8 x i16>* %B) nounwind {
	; CHECK-LABEL: vzipQi16_QQres:
	; CHECK: @ BB#0:
	; CHECK-NEXT: vld1.64 {d16, d17}, [r2]
	; CHECK-NEXT: vld1.64 {d18, d19}, [r1]
	; CHECK-NEXT: vzip.16 q9, q8
	; CHECK-NEXT: vst1.16 {d18, d19}, [r0:128]!
	; CHECK-NEXT: vst1.64 {d16, d17}, [r0:128]
	; CHECK-NEXT: mov pc, lr
	%tmp1 = load <8 x i16>, <8 x i16>* %A
	%tmp2 = load <8 x i16>, <8 x i16>* %B
	%tmp3 = shufflevector <8 x i16> %tmp1, <8 x i16> %tmp2, <16 x i32> <i32 0, i32 8, i32 1, i32 9, i32 2, i32 10, i32 3, i32 11, i32 4, i32 12, i32 5, i32 13, i32 6, i32 14, i32 7, i32 15>
	ret <16 x i16> %tmp3
	}

	define <4 x i32> @vzipQi32(<4 x i32>* %A, <4 x i32>* %B) nounwind {
	; CHECK-LABEL: vzipQi32:
	; CHECK: @ BB#0:
	; CHECK-NEXT: vld1.64 {d16, d17}, [r1]
	; CHECK-NEXT: vld1.64 {d18, d19}, [r0]
	; CHECK-NEXT: vzip.32 q9, q8
	; CHECK-NEXT: vadd.i32 q8, q9, q8
	; CHECK-NEXT: vmov r0, r1, d16
	; CHECK-NEXT: vmov r2, r3, d17
	; CHECK-NEXT: mov pc, lr
	%tmp1 = load <4 x i32>, <4 x i32>* %A
	%tmp2 = load <4 x i32>, <4 x i32>* %B
	%tmp3 = shufflevector <4 x i32> %tmp1, <4 x i32> %tmp2, <4 x i32> <i32 0, i32 4, i32 1, i32 5>
	%tmp4 = shufflevector <4 x i32> %tmp1, <4 x i32> %tmp2, <4 x i32> <i32 2, i32 6, i32 3, i32 7>
	%tmp5 = add <4 x i32> %tmp3, %tmp4
	ret <4 x i32> %tmp5
	}

	define <8 x i32> @vzipQi32_QQres(<4 x i32>* %A, <4 x i32>* %B) nounwind {
	; CHECK-LABEL: vzipQi32_QQres:
	; CHECK: @ BB#0:
	; CHECK-NEXT: vld1.64 {d16, d17}, [r2]
	; CHECK-NEXT: vld1.64 {d18, d19}, [r1]
	; CHECK-NEXT: vzip.32 q9, q8
	; CHECK-NEXT: vst1.32 {d18, d19}, [r0:128]!
	; CHECK-NEXT: vst1.64 {d16, d17}, [r0:128]
	; CHECK-NEXT: mov pc, lr
	%tmp1 = load <4 x i32>, <4 x i32>* %A
	%tmp2 = load <4 x i32>, <4 x i32>* %B
	%tmp3 = shufflevector <4 x i32> %tmp1, <4 x i32> %tmp2, <8 x i32> <i32 0, i32 4, i32 1, i32 5, i32 2, i32 6, i32 3, i32 7>
	ret <8 x i32> %tmp3
	}

	define <4 x float> @vzipQf(<4 x float>* %A, <4 x float>* %B) nounwind {
	; CHECK-LABEL: vzipQf:
	; CHECK: @ BB#0:
	; CHECK-NEXT: vld1.64 {d16, d17}, [r1]
	; CHECK-NEXT: vld1.64 {d18, d19}, [r0]
	; CHECK-NEXT: vzip.32 q9, q8
	; CHECK-NEXT: vadd.f32 q8, q9, q8
	; CHECK-NEXT: vmov r0, r1, d16
	; CHECK-NEXT: vmov r2, r3, d17
	; CHECK-NEXT: mov pc, lr
	%tmp1 = load <4 x float>, <4 x float>* %A
	%tmp2 = load <4 x float>, <4 x float>* %B
	%tmp3 = shufflevector <4 x float> %tmp1, <4 x float> %tmp2, <4 x i32> <i32 0, i32 4, i32 1, i32 5>
	%tmp4 = shufflevector <4 x float> %tmp1, <4 x float> %tmp2, <4 x i32> <i32 2, i32 6, i32 3, i32 7>
	%tmp5 = fadd <4 x float> %tmp3, %tmp4
	ret <4 x float> %tmp5
	}

	define <8 x float> @vzipQf_QQres(<4 x float>* %A, <4 x float>* %B) nounwind {
	; CHECK-LABEL: vzipQf_QQres:
	; CHECK: @ BB#0:
	; CHECK-NEXT: vld1.64 {d16, d17}, [r2]
	; CHECK-NEXT: vld1.64 {d18, d19}, [r1]
	; CHECK-NEXT: vzip.32 q9, q8
	; CHECK-NEXT: vst1.32 {d18, d19}, [r0:128]!
	; CHECK-NEXT: vst1.64 {d16, d17}, [r0:128]
	; CHECK-NEXT: mov pc, lr
	%tmp1 = load <4 x float>, <4 x float>* %A
	%tmp2 = load <4 x float>, <4 x float>* %B
	%tmp3 = shufflevector <4 x float> %tmp1, <4 x float> %tmp2, <8 x i32> <i32 0, i32 4, i32 1, i32 5, i32 2, i32 6, i32 3, i32 7>
	ret <8 x float> %tmp3
	}

	; Undef shuffle indices should not prevent matching to VZIP:

	define <8 x i8> @vzipi8_undef(<8 x i8>* %A, <8 x i8>* %B) nounwind {
	; CHECK-LABEL: vzipi8_undef:
	; CHECK: @ BB#0:
	; CHECK-NEXT: vldr d16, [r1]
	; CHECK-NEXT: vldr d17, [r0]
	; CHECK-NEXT: vzip.8 d17, d16
	; CHECK-NEXT: vadd.i8 d16, d17, d16
	; CHECK-NEXT: vmov r0, r1, d16
	; CHECK-NEXT: mov pc, lr
	%tmp1 = load <8 x i8>, <8 x i8>* %A
	%tmp2 = load <8 x i8>, <8 x i8>* %B
	%tmp3 = shufflevector <8 x i8> %tmp1, <8 x i8> %tmp2, <8 x i32> <i32 0, i32 undef, i32 1, i32 9, i32 undef, i32 10, i32 3, i32 11>
	%tmp4 = shufflevector <8 x i8> %tmp1, <8 x i8> %tmp2, <8 x i32> <i32 4, i32 12, i32 5, i32 13, i32 6, i32 undef, i32 undef, i32 15>
	%tmp5 = add <8 x i8> %tmp3, %tmp4
	ret <8 x i8> %tmp5
	}

	define <16 x i8> @vzipi8_undef_Qres(<8 x i8>* %A, <8 x i8>* %B) nounwind {
	; CHECK-LABEL: vzipi8_undef_Qres:
	; CHECK: @ BB#0:
	; CHECK-NEXT: vldr d17, [r1]
	; CHECK-NEXT: vldr d16, [r0]
	; CHECK-NEXT: vzip.8 d16, d17
	; CHECK-NEXT: vmov r0, r1, d16
	; CHECK-NEXT: vmov r2, r3, d17
	; CHECK-NEXT: mov pc, lr
	%tmp1 = load <8 x i8>, <8 x i8>* %A
	%tmp2 = load <8 x i8>, <8 x i8>* %B
	%tmp3 = shufflevector <8 x i8> %tmp1, <8 x i8> %tmp2, <16 x i32> <i32 0, i32 undef, i32 1, i32 9, i32 undef, i32 10, i32 3, i32 11, i32 4, i32 12, i32 5, i32 13, i32 6, i32 undef, i32 undef, i32 15>
	ret <16 x i8> %tmp3
	}

	define <16 x i8> @vzipQi8_undef(<16 x i8>* %A, <16 x i8>* %B) nounwind {
	; CHECK-LABEL: vzipQi8_undef:
	; CHECK: @ BB#0:
	; CHECK-NEXT: vld1.64 {d16, d17}, [r1]
	; CHECK-NEXT: vld1.64 {d18, d19}, [r0]
	; CHECK-NEXT: vzip.8 q9, q8
	; CHECK-NEXT: vadd.i8 q8, q9, q8
	; CHECK-NEXT: vmov r0, r1, d16
	; CHECK-NEXT: vmov r2, r3, d17
	; CHECK-NEXT: mov pc, lr
	%tmp1 = load <16 x i8>, <16 x i8>* %A
	%tmp2 = load <16 x i8>, <16 x i8>* %B
	%tmp3 = shufflevector <16 x i8> %tmp1, <16 x i8> %tmp2, <16 x i32> <i32 0, i32 16, i32 1, i32 undef, i32 undef, i32 undef, i32 3, i32 19, i32 4, i32 20, i32 5, i32 21, i32 6, i32 22, i32 7, i32 23>
	%tmp4 = shufflevector <16 x i8> %tmp1, <16 x i8> %tmp2, <16 x i32> <i32 8, i32 24, i32 9, i32 undef, i32 10, i32 26, i32 11, i32 27, i32 12, i32 28, i32 13, i32 undef, i32 14, i32 30, i32 undef, i32 31>
	%tmp5 = add <16 x i8> %tmp3, %tmp4
	ret <16 x i8> %tmp5
	}

	define <32 x i8> @vzipQi8_undef_QQres(<16 x i8>* %A, <16 x i8>* %B) nounwind {
	; CHECK-LABEL: vzipQi8_undef_QQres:
	; CHECK: @ BB#0:
	; CHECK-NEXT: vld1.64 {d16, d17}, [r2]
	; CHECK-NEXT: vld1.64 {d18, d19}, [r1]
	; CHECK-NEXT: vzip.8 q9, q8
	; CHECK-NEXT: vst1.8 {d18, d19}, [r0:128]!
	; CHECK-NEXT: vst1.64 {d16, d17}, [r0:128]
	; CHECK-NEXT: mov pc, lr
	%tmp1 = load <16 x i8>, <16 x i8>* %A
	%tmp2 = load <16 x i8>, <16 x i8>* %B
	%tmp3 = shufflevector <16 x i8> %tmp1, <16 x i8> %tmp2, <32 x i32> <i32 0, i32 16, i32 1, i32 undef, i32 undef, i32 undef, i32 3, i32 19, i32 4, i32 20, i32 5, i32 21, i32 6, i32 22, i32 7, i32 23, i32 8, i32 24, i32 9, i32 undef, i32 10, i32 26, i32 11, i32 27, i32 12, i32 28, i32 13, i32 undef, i32 14, i32 30, i32 undef, i32 31>
	ret <32 x i8> %tmp3
	}

	define <8 x i16> @vzip_lower_shufflemask_undef(<4 x i16>* %A, <4 x i16>* %B) {
	; CHECK-LABEL: vzip_lower_shufflemask_undef:
	; CHECK: @ BB#0: @ %entry
	; CHECK-NEXT: vldr d17, [r1]
	; CHECK-NEXT: vldr d16, [r0]
	; CHECK-NEXT: vzip.16 d16, d17
	; CHECK-NEXT: vmov r0, r1, d16
	; CHECK-NEXT: vmov r2, r3, d17
	; CHECK-NEXT: mov pc, lr
	entry:
	%tmp1 = load <4 x i16>, <4 x i16>* %A
	%tmp2 = load <4 x i16>, <4 x i16>* %B
	%0 = shufflevector <4 x i16> %tmp1, <4 x i16> %tmp2, <8 x i32> <i32 undef, i32 undef, i32 undef, i32 undef, i32 2, i32 6, i32 3, i32 7>
	ret <8 x i16> %0
	}

	+; NOTE: The mask here looks like something that could be done with a vzip,
	+; but which the current handling of two-result vzip can't do - thus ending up
	+; as a vtrn.
	+define <8 x i16> @vzip_lower_shufflemask_undef_rev(<4 x i16>* %A, <4 x i16>* %B) {
	+; CHECK-LABEL: vzip_lower_shufflemask_undef_rev:
	+; CHECK: @ BB#0: @ %entry
	+; CHECK-NEXT: vldr d16, [r1]
	+; CHECK-NEXT: vldr d19, [r0]
	+; CHECK-NEXT: vtrn.16 d19, d16
	+; CHECK-NEXT: vmov r0, r1, d18
	+; CHECK-NEXT: vmov r2, r3, d19
	+; CHECK-NEXT: mov pc, lr
	+entry:
	+ %tmp1 = load <4 x i16>, <4 x i16>* %A
	+ %tmp2 = load <4 x i16>, <4 x i16>* %B
	+ %0 = shufflevector <4 x i16> %tmp1, <4 x i16> %tmp2, <8 x i32> <i32 undef, i32 undef, i32 undef, i32 undef, i32 0, i32 4, i32 undef, i32 undef>
	+ ret <8 x i16> %0
	+}
	+
	define <4 x i32> @vzip_lower_shufflemask_zeroed(<2 x i32>* %A) {
	; CHECK-LABEL: vzip_lower_shufflemask_zeroed:
	; CHECK: @ BB#0: @ %entry
	; CHECK-NEXT: vldr d16, [r0]
	; CHECK-NEXT: vdup.32 q9, d16[0]
	; CHECK-NEXT: vzip.32 q8, q9
	; CHECK-NEXT: vmov r0, r1, d16
	; CHECK-NEXT: vmov r2, r3, d17
	; CHECK-NEXT: mov pc, lr
	entry:
	%tmp1 = load <2 x i32>, <2 x i32>* %A
	%0 = shufflevector <2 x i32> %tmp1, <2 x i32> %tmp1, <4 x i32> <i32 0, i32 0, i32 1, i32 0>
	ret <4 x i32> %0
	}

	define <4 x i32> @vzip_lower_shufflemask_vuzp(<2 x i32>* %A) {
	; CHECK-LABEL: vzip_lower_shufflemask_vuzp:
	; CHECK: @ BB#0: @ %entry
	; CHECK-NEXT: vldr d16, [r0]
	; CHECK-NEXT: vdup.32 q9, d16[0]
	; CHECK-NEXT: vzip.32 q8, q9
	; CHECK-NEXT: vmov r0, r1, d16
	; CHECK-NEXT: vmov r2, r3, d17
	; CHECK-NEXT: mov pc, lr
	entry:
	%tmp1 = load <2 x i32>, <2 x i32>* %A
	%0 = shufflevector <2 x i32> %tmp1, <2 x i32> %tmp1, <4 x i32> <i32 0, i32 2, i32 1, i32 0>
	ret <4 x i32> %0
	}

	define void @vzip_undef_rev_shufflemask_vtrn(<2 x i32>* %A, <4 x i32>* %B) {
	; CHECK-LABEL: vzip_undef_rev_shufflemask_vtrn:
	; CHECK: @ BB#0: @ %entry
	; CHECK-NEXT: vldr d16, [r0]
	; CHECK-NEXT: vorr q9, q8, q8
	; CHECK-NEXT: vzip.32 q8, q9
	; CHECK-NEXT: vext.32 q8, q8, q8, #2
	; CHECK-NEXT: vst1.64 {d16, d17}, [r1]
	; CHECK-NEXT: mov pc, lr
	entry:
	%tmp1 = load <2 x i32>, <2 x i32>* %A
	%0 = shufflevector <2 x i32> %tmp1, <2 x i32> undef, <4 x i32> <i32 1, i32 1, i32 0, i32 0>
	store <4 x i32> %0, <4 x i32>* %B
	ret void
	}

	define void @vzip_vext_factor(<8 x i16>* %A, <4 x i16>* %B) {
	; CHECK-LABEL: vzip_vext_factor:
	; CHECK: @ BB#0: @ %entry
	; CHECK-NEXT: vld1.64 {d16, d17}, [r0]
	; CHECK-NEXT: vext.16 d18, d16, d17, #1
	; CHECK-NEXT: vext.16 d16, d18, d17, #2
	; CHECK-NEXT: vext.16 d16, d16, d16, #1
	; CHECK-NEXT: vstr d16, [r1]
	; CHECK-NEXT: mov pc, lr
	entry:
	%tmp1 = load <8 x i16>, <8 x i16>* %A
	%0 = shufflevector <8 x i16> %tmp1, <8 x i16> undef, <4 x i32> <i32 4, i32 4, i32 5, i32 3>
	store <4 x i16> %0, <4 x i16>* %B
	ret void
	}

	define <8 x i8> @vdup_zip(i8* nocapture readonly %x, i8* nocapture readonly %y) {
	; CHECK-LABEL: vdup_zip:
	; CHECK: @ BB#0: @ %entry
	; CHECK-NEXT: vld1.8 {d16[]}, [r1]
	; CHECK-NEXT: vld1.8 {d17[]}, [r0]
	; CHECK-NEXT: vzip.8 d17, d16
	; CHECK-NEXT: vmov r0, r1, d17
	; CHECK-NEXT: mov pc, lr
	entry:
	%0 = load i8, i8* %x, align 1
	%1 = insertelement <8 x i8> undef, i8 %0, i32 0
	%lane = shufflevector <8 x i8> %1, <8 x i8> undef, <8 x i32> <i32 0, i32 0, i32 0, i32 0, i32 undef, i32 undef, i32 undef, i32 undef>
	%2 = load i8, i8* %y, align 1
	%3 = insertelement <8 x i8> undef, i8 %2, i32 0
	%lane3 = shufflevector <8 x i8> %3, <8 x i8> undef, <8 x i32> <i32 0, i32 0, i32 0, i32 0, i32 undef, i32 undef, i32 undef, i32 undef>
	%vzip.i = shufflevector <8 x i8> %lane, <8 x i8> %lane3, <8 x i32> <i32 0, i32 8, i32 1, i32 9, i32 2, i32 10, i32 3, i32 11>
	ret <8 x i8> %vzip.i
	}
	diff --git a/test/CodeGen/X86/avx-schedule.ll b/test/CodeGen/X86/avx-schedule.ll
	index 953f3bdd06e8..78c88f401cbc 100644
	--- a/test/CodeGen/X86/avx-schedule.ll
	+++ b/test/CodeGen/X86/avx-schedule.ll
	@@ -1,2890 +1,2890 @@
	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -print-schedule -mcpu=sandybridge \| FileCheck %s --check-prefix=CHECK --check-prefix=SANDY
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -print-schedule -mcpu=ivybridge \| FileCheck %s --check-prefix=CHECK --check-prefix=SANDY
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -print-schedule -mcpu=haswell \| FileCheck %s --check-prefix=CHECK --check-prefix=HASWELL
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -print-schedule -mcpu=skylake \| FileCheck %s --check-prefix=CHECK --check-prefix=HASWELL
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -print-schedule -mcpu=btver2 \| FileCheck %s --check-prefix=CHECK --check-prefix=BTVER2
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -print-schedule -mcpu=znver1 \| FileCheck %s --check-prefix=CHECK --check-prefix=ZNVER1

	define <4 x double> @test_addpd(<4 x double> %a0, <4 x double> %a1, <4 x double> *%a2) {
	; SANDY-LABEL: test_addpd:
	; SANDY: # BB#0:
	; SANDY-NEXT: vaddpd %ymm1, %ymm0, %ymm0 # sched: [3:1.00]
	-; SANDY-NEXT: vaddpd (%rdi), %ymm0, %ymm0 # sched: [10:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vaddpd (%rdi), %ymm0, %ymm0 # sched: [7:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_addpd:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vaddpd %ymm1, %ymm0, %ymm0 # sched: [3:1.00]
	; HASWELL-NEXT: vaddpd (%rdi), %ymm0, %ymm0 # sched: [7:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_addpd:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vaddpd %ymm1, %ymm0, %ymm0 # sched: [3:2.00]
	; BTVER2-NEXT: vaddpd (%rdi), %ymm0, %ymm0 # sched: [8:2.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_addpd:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vaddpd %ymm1, %ymm0, %ymm0 # sched: [3:1.00]
	; ZNVER1-NEXT: vaddpd (%rdi), %ymm0, %ymm0 # sched: [10:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = fadd <4 x double> %a0, %a1
	%2 = load <4 x double>, <4 x double> *%a2, align 32
	%3 = fadd <4 x double> %1, %2
	ret <4 x double> %3
	}

	define <8 x float> @test_addps(<8 x float> %a0, <8 x float> %a1, <8 x float> *%a2) {
	; SANDY-LABEL: test_addps:
	; SANDY: # BB#0:
	; SANDY-NEXT: vaddps %ymm1, %ymm0, %ymm0 # sched: [3:1.00]
	-; SANDY-NEXT: vaddps (%rdi), %ymm0, %ymm0 # sched: [10:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vaddps (%rdi), %ymm0, %ymm0 # sched: [7:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_addps:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vaddps %ymm1, %ymm0, %ymm0 # sched: [3:1.00]
	; HASWELL-NEXT: vaddps (%rdi), %ymm0, %ymm0 # sched: [7:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_addps:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vaddps %ymm1, %ymm0, %ymm0 # sched: [3:2.00]
	; BTVER2-NEXT: vaddps (%rdi), %ymm0, %ymm0 # sched: [8:2.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_addps:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vaddps %ymm1, %ymm0, %ymm0 # sched: [3:1.00]
	; ZNVER1-NEXT: vaddps (%rdi), %ymm0, %ymm0 # sched: [10:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = fadd <8 x float> %a0, %a1
	%2 = load <8 x float>, <8 x float> *%a2, align 32
	%3 = fadd <8 x float> %1, %2
	ret <8 x float> %3
	}

	define <4 x double> @test_addsubpd(<4 x double> %a0, <4 x double> %a1, <4 x double> *%a2) {
	; SANDY-LABEL: test_addsubpd:
	; SANDY: # BB#0:
	; SANDY-NEXT: vaddsubpd %ymm1, %ymm0, %ymm0 # sched: [3:1.00]
	-; SANDY-NEXT: vaddsubpd (%rdi), %ymm0, %ymm0 # sched: [10:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vaddsubpd (%rdi), %ymm0, %ymm0 # sched: [7:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_addsubpd:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vaddsubpd %ymm1, %ymm0, %ymm0 # sched: [3:1.00]
	; HASWELL-NEXT: vaddsubpd (%rdi), %ymm0, %ymm0 # sched: [7:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_addsubpd:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vaddsubpd %ymm1, %ymm0, %ymm0 # sched: [3:2.00]
	; BTVER2-NEXT: vaddsubpd (%rdi), %ymm0, %ymm0 # sched: [8:2.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_addsubpd:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vaddsubpd %ymm1, %ymm0, %ymm0 # sched: [3:1.00]
	; ZNVER1-NEXT: vaddsubpd (%rdi), %ymm0, %ymm0 # sched: [10:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <4 x double> @llvm.x86.avx.addsub.pd.256(<4 x double> %a0, <4 x double> %a1)
	%2 = load <4 x double>, <4 x double> *%a2, align 32
	%3 = call <4 x double> @llvm.x86.avx.addsub.pd.256(<4 x double> %1, <4 x double> %2)
	ret <4 x double> %3
	}
	declare <4 x double> @llvm.x86.avx.addsub.pd.256(<4 x double>, <4 x double>) nounwind readnone

	define <8 x float> @test_addsubps(<8 x float> %a0, <8 x float> %a1, <8 x float> *%a2) {
	; SANDY-LABEL: test_addsubps:
	; SANDY: # BB#0:
	; SANDY-NEXT: vaddsubps %ymm1, %ymm0, %ymm0 # sched: [3:1.00]
	-; SANDY-NEXT: vaddsubps (%rdi), %ymm0, %ymm0 # sched: [10:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vaddsubps (%rdi), %ymm0, %ymm0 # sched: [7:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_addsubps:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vaddsubps %ymm1, %ymm0, %ymm0 # sched: [3:1.00]
	; HASWELL-NEXT: vaddsubps (%rdi), %ymm0, %ymm0 # sched: [7:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_addsubps:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vaddsubps %ymm1, %ymm0, %ymm0 # sched: [3:2.00]
	; BTVER2-NEXT: vaddsubps (%rdi), %ymm0, %ymm0 # sched: [8:2.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_addsubps:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vaddsubps %ymm1, %ymm0, %ymm0 # sched: [3:1.00]
	; ZNVER1-NEXT: vaddsubps (%rdi), %ymm0, %ymm0 # sched: [10:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <8 x float> @llvm.x86.avx.addsub.ps.256(<8 x float> %a0, <8 x float> %a1)
	%2 = load <8 x float>, <8 x float> *%a2, align 32
	%3 = call <8 x float> @llvm.x86.avx.addsub.ps.256(<8 x float> %1, <8 x float> %2)
	ret <8 x float> %3
	}
	declare <8 x float> @llvm.x86.avx.addsub.ps.256(<8 x float>, <8 x float>) nounwind readnone

	define <4 x double> @test_andnotpd(<4 x double> %a0, <4 x double> %a1, <4 x double> *%a2) {
	; SANDY-LABEL: test_andnotpd:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vandnpd %ymm1, %ymm0, %ymm0 # sched: [1:1.00]
	-; SANDY-NEXT: vandnpd (%rdi), %ymm0, %ymm0 # sched: [8:1.00]
	+; SANDY-NEXT: vandnpd %ymm1, %ymm0, %ymm0 # sched: [1:0.33]
	+; SANDY-NEXT: vandnpd (%rdi), %ymm0, %ymm0 # sched: [5:0.50]
	; SANDY-NEXT: vaddpd %ymm0, %ymm1, %ymm0 # sched: [3:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_andnotpd:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vandnpd %ymm1, %ymm0, %ymm0 # sched: [1:1.00]
	; HASWELL-NEXT: vandnpd (%rdi), %ymm0, %ymm0 # sched: [5:1.00]
	; HASWELL-NEXT: vaddpd %ymm0, %ymm1, %ymm0 # sched: [3:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_andnotpd:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vandnpd %ymm1, %ymm0, %ymm0 # sched: [1:0.50]
	; BTVER2-NEXT: vandnpd (%rdi), %ymm0, %ymm0 # sched: [6:1.00]
	; BTVER2-NEXT: vaddpd %ymm0, %ymm1, %ymm0 # sched: [3:2.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_andnotpd:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vandnpd %ymm1, %ymm0, %ymm0 # sched: [1:0.25]
	; ZNVER1-NEXT: vandnpd (%rdi), %ymm0, %ymm0 # sched: [8:0.50]
	; ZNVER1-NEXT: vaddpd %ymm0, %ymm1, %ymm0 # sched: [3:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = bitcast <4 x double> %a0 to <4 x i64>
	%2 = bitcast <4 x double> %a1 to <4 x i64>
	%3 = xor <4 x i64> %1, <i64 -1, i64 -1, i64 -1, i64 -1>
	%4 = and <4 x i64> %3, %2
	%5 = load <4 x double>, <4 x double> *%a2, align 32
	%6 = bitcast <4 x double> %5 to <4 x i64>
	%7 = xor <4 x i64> %4, <i64 -1, i64 -1, i64 -1, i64 -1>
	%8 = and <4 x i64> %6, %7
	%9 = bitcast <4 x i64> %8 to <4 x double>
	%10 = fadd <4 x double> %a1, %9
	ret <4 x double> %10
	}

	define <8 x float> @test_andnotps(<8 x float> %a0, <8 x float> %a1, <8 x float> *%a2) {
	; SANDY-LABEL: test_andnotps:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vandnps %ymm1, %ymm0, %ymm0 # sched: [1:1.00]
	-; SANDY-NEXT: vandnps (%rdi), %ymm0, %ymm0 # sched: [8:1.00]
	+; SANDY-NEXT: vandnps %ymm1, %ymm0, %ymm0 # sched: [1:0.33]
	+; SANDY-NEXT: vandnps (%rdi), %ymm0, %ymm0 # sched: [5:0.50]
	; SANDY-NEXT: vaddps %ymm0, %ymm1, %ymm0 # sched: [3:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_andnotps:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vandnps %ymm1, %ymm0, %ymm0 # sched: [1:1.00]
	; HASWELL-NEXT: vandnps (%rdi), %ymm0, %ymm0 # sched: [5:1.00]
	; HASWELL-NEXT: vaddps %ymm0, %ymm1, %ymm0 # sched: [3:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_andnotps:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vandnps %ymm1, %ymm0, %ymm0 # sched: [1:0.50]
	; BTVER2-NEXT: vandnps (%rdi), %ymm0, %ymm0 # sched: [6:1.00]
	; BTVER2-NEXT: vaddps %ymm0, %ymm1, %ymm0 # sched: [3:2.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_andnotps:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vandnps %ymm1, %ymm0, %ymm0 # sched: [1:0.25]
	; ZNVER1-NEXT: vandnps (%rdi), %ymm0, %ymm0 # sched: [8:0.50]
	; ZNVER1-NEXT: vaddps %ymm0, %ymm1, %ymm0 # sched: [3:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = bitcast <8 x float> %a0 to <4 x i64>
	%2 = bitcast <8 x float> %a1 to <4 x i64>
	%3 = xor <4 x i64> %1, <i64 -1, i64 -1, i64 -1, i64 -1>
	%4 = and <4 x i64> %3, %2
	%5 = load <8 x float>, <8 x float> *%a2, align 32
	%6 = bitcast <8 x float> %5 to <4 x i64>
	%7 = xor <4 x i64> %4, <i64 -1, i64 -1, i64 -1, i64 -1>
	%8 = and <4 x i64> %6, %7
	%9 = bitcast <4 x i64> %8 to <8 x float>
	%10 = fadd <8 x float> %a1, %9
	ret <8 x float> %10
	}

	define <4 x double> @test_andpd(<4 x double> %a0, <4 x double> %a1, <4 x double> *%a2) {
	; SANDY-LABEL: test_andpd:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vandpd %ymm1, %ymm0, %ymm0 # sched: [1:1.00]
	-; SANDY-NEXT: vandpd (%rdi), %ymm0, %ymm0 # sched: [5:1.00]
	+; SANDY-NEXT: vandpd %ymm1, %ymm0, %ymm0 # sched: [1:0.33]
	+; SANDY-NEXT: vandpd (%rdi), %ymm0, %ymm0 # sched: [5:0.50]
	; SANDY-NEXT: vaddpd %ymm0, %ymm1, %ymm0 # sched: [3:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_andpd:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vandpd %ymm1, %ymm0, %ymm0 # sched: [1:1.00]
	; HASWELL-NEXT: vandpd (%rdi), %ymm0, %ymm0 # sched: [5:1.00]
	; HASWELL-NEXT: vaddpd %ymm0, %ymm1, %ymm0 # sched: [3:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_andpd:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vandpd %ymm1, %ymm0, %ymm0 # sched: [1:0.50]
	; BTVER2-NEXT: vandpd (%rdi), %ymm0, %ymm0 # sched: [6:1.00]
	; BTVER2-NEXT: vaddpd %ymm0, %ymm1, %ymm0 # sched: [3:2.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_andpd:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vandpd %ymm1, %ymm0, %ymm0 # sched: [1:0.25]
	; ZNVER1-NEXT: vandpd (%rdi), %ymm0, %ymm0 # sched: [8:0.50]
	; ZNVER1-NEXT: vaddpd %ymm0, %ymm1, %ymm0 # sched: [3:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = bitcast <4 x double> %a0 to <4 x i64>
	%2 = bitcast <4 x double> %a1 to <4 x i64>
	%3 = and <4 x i64> %1, %2
	%4 = load <4 x double>, <4 x double> *%a2, align 32
	%5 = bitcast <4 x double> %4 to <4 x i64>
	%6 = and <4 x i64> %3, %5
	%7 = bitcast <4 x i64> %6 to <4 x double>
	%8 = fadd <4 x double> %a1, %7
	ret <4 x double> %8
	}

	define <8 x float> @test_andps(<8 x float> %a0, <8 x float> %a1, <8 x float> *%a2) {
	; SANDY-LABEL: test_andps:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vandps %ymm1, %ymm0, %ymm0 # sched: [1:1.00]
	-; SANDY-NEXT: vandps (%rdi), %ymm0, %ymm0 # sched: [5:1.00]
	+; SANDY-NEXT: vandps %ymm1, %ymm0, %ymm0 # sched: [1:0.33]
	+; SANDY-NEXT: vandps (%rdi), %ymm0, %ymm0 # sched: [5:0.50]
	; SANDY-NEXT: vaddps %ymm0, %ymm1, %ymm0 # sched: [3:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_andps:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vandps %ymm1, %ymm0, %ymm0 # sched: [1:1.00]
	; HASWELL-NEXT: vandps (%rdi), %ymm0, %ymm0 # sched: [5:1.00]
	; HASWELL-NEXT: vaddps %ymm0, %ymm1, %ymm0 # sched: [3:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_andps:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vandps %ymm1, %ymm0, %ymm0 # sched: [1:0.50]
	; BTVER2-NEXT: vandps (%rdi), %ymm0, %ymm0 # sched: [6:1.00]
	; BTVER2-NEXT: vaddps %ymm0, %ymm1, %ymm0 # sched: [3:2.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_andps:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vandps %ymm1, %ymm0, %ymm0 # sched: [1:0.25]
	; ZNVER1-NEXT: vandps (%rdi), %ymm0, %ymm0 # sched: [8:0.50]
	; ZNVER1-NEXT: vaddps %ymm0, %ymm1, %ymm0 # sched: [3:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = bitcast <8 x float> %a0 to <4 x i64>
	%2 = bitcast <8 x float> %a1 to <4 x i64>
	%3 = and <4 x i64> %1, %2
	%4 = load <8 x float>, <8 x float> *%a2, align 32
	%5 = bitcast <8 x float> %4 to <4 x i64>
	%6 = and <4 x i64> %3, %5
	%7 = bitcast <4 x i64> %6 to <8 x float>
	%8 = fadd <8 x float> %a1, %7
	ret <8 x float> %8
	}

	define <4 x double> @test_blendpd(<4 x double> %a0, <4 x double> %a1, <4 x double> *%a2) {
	; SANDY-LABEL: test_blendpd:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vblendpd {{.*#+}} ymm0 = ymm0[0],ymm1[1,2],ymm0[3] sched: [1:1.00]
	+; SANDY-NEXT: vblendpd {{.*#+}} ymm0 = ymm0[0],ymm1[1,2],ymm0[3] sched: [1:0.50]
	; SANDY-NEXT: vaddpd %ymm0, %ymm1, %ymm0 # sched: [3:1.00]
	-; SANDY-NEXT: vblendpd {{.*#+}} ymm0 = ymm0[0],mem[1,2],ymm0[3] sched: [8:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vblendpd {{.*#+}} ymm0 = ymm0[0],mem[1,2],ymm0[3] sched: [5:0.50]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_blendpd:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vblendpd {{.*#+}} ymm0 = ymm0[0],ymm1[1,2],ymm0[3] sched: [1:0.33]
	; HASWELL-NEXT: vaddpd %ymm0, %ymm1, %ymm0 # sched: [3:1.00]
	; HASWELL-NEXT: vblendpd {{.*#+}} ymm0 = ymm0[0],mem[1,2],ymm0[3] sched: [5:0.50]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_blendpd:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vblendpd {{.*#+}} ymm0 = ymm0[0],ymm1[1,2],ymm0[3] sched: [1:0.50]
	; BTVER2-NEXT: vaddpd %ymm0, %ymm1, %ymm0 # sched: [3:2.00]
	; BTVER2-NEXT: vblendpd {{.*#+}} ymm0 = ymm0[0],mem[1,2],ymm0[3] sched: [6:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_blendpd:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vblendpd {{.*#+}} ymm0 = ymm0[0],ymm1[1,2],ymm0[3] sched: [1:0.50]
	; ZNVER1-NEXT: vaddpd %ymm0, %ymm1, %ymm0 # sched: [3:1.00]
	; ZNVER1-NEXT: vblendpd {{.*#+}} ymm0 = ymm0[0],mem[1,2],ymm0[3] sched: [8:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = shufflevector <4 x double> %a0, <4 x double> %a1, <4 x i32> <i32 0, i32 5, i32 6, i32 3>
	%2 = load <4 x double>, <4 x double> *%a2, align 32
	%3 = fadd <4 x double> %a1, %1
	%4 = shufflevector <4 x double> %3, <4 x double> %2, <4 x i32> <i32 0, i32 5, i32 6, i32 3>
	ret <4 x double> %4
	}

	define <8 x float> @test_blendps(<8 x float> %a0, <8 x float> %a1, <8 x float> *%a2) {
	; SANDY-LABEL: test_blendps:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vblendps {{.*#+}} ymm0 = ymm0[0],ymm1[1,2],ymm0[3,4,5,6,7] sched: [1:1.00]
	-; SANDY-NEXT: vblendps {{.*#+}} ymm0 = ymm0[0,1],mem[2],ymm0[3],mem[4,5,6],ymm0[7] sched: [8:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vblendps {{.*#+}} ymm0 = ymm0[0],ymm1[1,2],ymm0[3,4,5,6,7] sched: [1:0.50]
	+; SANDY-NEXT: vblendps {{.*#+}} ymm0 = ymm0[0,1],mem[2],ymm0[3],mem[4,5,6],ymm0[7] sched: [5:0.50]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_blendps:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vblendps {{.*#+}} ymm0 = ymm0[0],ymm1[1,2],ymm0[3,4,5,6,7] sched: [1:0.33]
	; HASWELL-NEXT: vblendps {{.*#+}} ymm0 = ymm0[0,1],mem[2],ymm0[3],mem[4,5,6],ymm0[7] sched: [5:0.50]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_blendps:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vblendps {{.*#+}} ymm0 = ymm0[0],ymm1[1,2],ymm0[3,4,5,6,7] sched: [1:0.50]
	; BTVER2-NEXT: vblendps {{.*#+}} ymm0 = ymm0[0,1],mem[2],ymm0[3],mem[4,5,6],ymm0[7] sched: [6:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_blendps:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vblendps {{.*#+}} ymm0 = ymm0[0],ymm1[1,2],ymm0[3,4,5,6,7] sched: [1:0.50]
	; ZNVER1-NEXT: vblendps {{.*#+}} ymm0 = ymm0[0,1],mem[2],ymm0[3],mem[4,5,6],ymm0[7] sched: [8:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = shufflevector <8 x float> %a0, <8 x float> %a1, <8 x i32> <i32 0, i32 9, i32 10, i32 3, i32 4, i32 5, i32 6, i32 7>
	%2 = load <8 x float>, <8 x float> *%a2, align 32
	%3 = shufflevector <8 x float> %1, <8 x float> %2, <8 x i32> <i32 0, i32 1, i32 10, i32 3, i32 12, i32 13, i32 14, i32 7>
	ret <8 x float> %3
	}

	define <4 x double> @test_blendvpd(<4 x double> %a0, <4 x double> %a1, <4 x double> %a2, <4 x double> *%a3) {
	; SANDY-LABEL: test_blendvpd:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vblendvpd %ymm2, %ymm1, %ymm0, %ymm0 # sched: [2:2.00]
	-; SANDY-NEXT: vblendvpd %ymm2, (%rdi), %ymm0, %ymm0 # sched: [9:2.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vblendvpd %ymm2, %ymm1, %ymm0, %ymm0 # sched: [2:1.00]
	+; SANDY-NEXT: vblendvpd %ymm2, (%rdi), %ymm0, %ymm0 # sched: [6:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_blendvpd:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vblendvpd %ymm2, %ymm1, %ymm0, %ymm0 # sched: [2:2.00]
	; HASWELL-NEXT: vblendvpd %ymm2, (%rdi), %ymm0, %ymm0 # sched: [6:2.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_blendvpd:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vblendvpd %ymm2, %ymm1, %ymm0, %ymm0 # sched: [2:1.00]
	; BTVER2-NEXT: vblendvpd %ymm2, (%rdi), %ymm0, %ymm0 # sched: [7:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_blendvpd:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vblendvpd %ymm2, %ymm1, %ymm0, %ymm0 # sched: [1:0.50]
	; ZNVER1-NEXT: vblendvpd %ymm2, (%rdi), %ymm0, %ymm0 # sched: [8:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <4 x double> @llvm.x86.avx.blendv.pd.256(<4 x double> %a0, <4 x double> %a1, <4 x double> %a2)
	%2 = load <4 x double>, <4 x double> *%a3, align 32
	%3 = call <4 x double> @llvm.x86.avx.blendv.pd.256(<4 x double> %1, <4 x double> %2, <4 x double> %a2)
	ret <4 x double> %3
	}
	declare <4 x double> @llvm.x86.avx.blendv.pd.256(<4 x double>, <4 x double>, <4 x double>) nounwind readnone

	define <8 x float> @test_blendvps(<8 x float> %a0, <8 x float> %a1, <8 x float> %a2, <8 x float> *%a3) {
	; SANDY-LABEL: test_blendvps:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vblendvps %ymm2, %ymm1, %ymm0, %ymm0 # sched: [2:2.00]
	-; SANDY-NEXT: vblendvps %ymm2, (%rdi), %ymm0, %ymm0 # sched: [9:2.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vblendvps %ymm2, %ymm1, %ymm0, %ymm0 # sched: [2:1.00]
	+; SANDY-NEXT: vblendvps %ymm2, (%rdi), %ymm0, %ymm0 # sched: [6:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_blendvps:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vblendvps %ymm2, %ymm1, %ymm0, %ymm0 # sched: [2:2.00]
	; HASWELL-NEXT: vblendvps %ymm2, (%rdi), %ymm0, %ymm0 # sched: [6:2.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_blendvps:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vblendvps %ymm2, %ymm1, %ymm0, %ymm0 # sched: [2:1.00]
	; BTVER2-NEXT: vblendvps %ymm2, (%rdi), %ymm0, %ymm0 # sched: [7:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_blendvps:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vblendvps %ymm2, %ymm1, %ymm0, %ymm0 # sched: [1:0.50]
	; ZNVER1-NEXT: vblendvps %ymm2, (%rdi), %ymm0, %ymm0 # sched: [8:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <8 x float> @llvm.x86.avx.blendv.ps.256(<8 x float> %a0, <8 x float> %a1, <8 x float> %a2)
	%2 = load <8 x float>, <8 x float> *%a3, align 32
	%3 = call <8 x float> @llvm.x86.avx.blendv.ps.256(<8 x float> %1, <8 x float> %2, <8 x float> %a2)
	ret <8 x float> %3
	}
	declare <8 x float> @llvm.x86.avx.blendv.ps.256(<8 x float>, <8 x float>, <8 x float>) nounwind readnone

	define <8 x float> @test_broadcastf128(<4 x float> *%a0) {
	; SANDY-LABEL: test_broadcastf128:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vbroadcastf128 {{.*#+}} ymm0 = mem[0,1,0,1] sched: [3:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vbroadcastf128 {{.*#+}} ymm0 = mem[0,1,0,1] sched: [5:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_broadcastf128:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vbroadcastf128 {{.*#+}} ymm0 = mem[0,1,0,1] sched: [4:0.50]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_broadcastf128:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vbroadcastf128 {{.*#+}} ymm0 = mem[0,1,0,1] sched: [6:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_broadcastf128:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vbroadcastf128 {{.*#+}} ymm0 = mem[0,1,0,1] sched: [8:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = load <4 x float>, <4 x float> *%a0, align 32
	%2 = shufflevector <4 x float> %1, <4 x float> undef, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 0, i32 1, i32 2, i32 3>
	ret <8 x float> %2
	}

	define <4 x double> @test_broadcastsd_ymm(double *%a0) {
	; SANDY-LABEL: test_broadcastsd_ymm:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vbroadcastsd (%rdi), %ymm0 # sched: [7:0.50]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vbroadcastsd (%rdi), %ymm0 # sched: [5:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_broadcastsd_ymm:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vbroadcastsd (%rdi), %ymm0 # sched: [5:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_broadcastsd_ymm:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vbroadcastsd (%rdi), %ymm0 # sched: [6:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_broadcastsd_ymm:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vbroadcastsd (%rdi), %ymm0 # sched: [8:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = load double, double *%a0, align 8
	%2 = insertelement <4 x double> undef, double %1, i32 0
	%3 = shufflevector <4 x double> %2, <4 x double> undef, <4 x i32> zeroinitializer
	ret <4 x double> %3
	}

	define <4 x float> @test_broadcastss(float *%a0) {
	; SANDY-LABEL: test_broadcastss:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vbroadcastss (%rdi), %xmm0 # sched: [6:0.50]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vbroadcastss (%rdi), %xmm0 # sched: [4:0.50]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_broadcastss:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vbroadcastss (%rdi), %xmm0 # sched: [4:0.50]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_broadcastss:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vbroadcastss (%rdi), %xmm0 # sched: [5:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_broadcastss:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vbroadcastss (%rdi), %xmm0 # sched: [8:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = load float, float *%a0, align 4
	%2 = insertelement <4 x float> undef, float %1, i32 0
	%3 = shufflevector <4 x float> %2, <4 x float> undef, <4 x i32> zeroinitializer
	ret <4 x float> %3
	}

	define <8 x float> @test_broadcastss_ymm(float *%a0) {
	; SANDY-LABEL: test_broadcastss_ymm:
	; SANDY: # BB#0:
	; SANDY-NEXT: vbroadcastss (%rdi), %ymm0 # sched: [5:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_broadcastss_ymm:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vbroadcastss (%rdi), %ymm0 # sched: [5:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_broadcastss_ymm:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vbroadcastss (%rdi), %ymm0 # sched: [6:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_broadcastss_ymm:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vbroadcastss (%rdi), %ymm0 # sched: [8:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = load float, float *%a0, align 4
	%2 = insertelement <8 x float> undef, float %1, i32 0
	%3 = shufflevector <8 x float> %2, <8 x float> undef, <8 x i32> zeroinitializer
	ret <8 x float> %3
	}

	define <4 x double> @test_cmppd(<4 x double> %a0, <4 x double> %a1, <4 x double> *%a2) {
	; SANDY-LABEL: test_cmppd:
	; SANDY: # BB#0:
	; SANDY-NEXT: vcmpeqpd %ymm1, %ymm0, %ymm1 # sched: [3:1.00]
	-; SANDY-NEXT: vcmpeqpd (%rdi), %ymm0, %ymm0 # sched: [10:1.00]
	-; SANDY-NEXT: vorpd %ymm0, %ymm1, %ymm0 # sched: [1:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vcmpeqpd (%rdi), %ymm0, %ymm0 # sched: [7:1.00]
	+; SANDY-NEXT: vorpd %ymm0, %ymm1, %ymm0 # sched: [1:0.33]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_cmppd:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vcmpeqpd %ymm1, %ymm0, %ymm1 # sched: [3:1.00]
	; HASWELL-NEXT: vcmpeqpd (%rdi), %ymm0, %ymm0 # sched: [7:1.00]
	; HASWELL-NEXT: vorpd %ymm0, %ymm1, %ymm0 # sched: [1:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_cmppd:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vcmpeqpd %ymm1, %ymm0, %ymm1 # sched: [3:1.00]
	; BTVER2-NEXT: vcmpeqpd (%rdi), %ymm0, %ymm0 # sched: [8:1.00]
	; BTVER2-NEXT: vorpd %ymm0, %ymm1, %ymm0 # sched: [1:0.50]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_cmppd:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vcmpeqpd %ymm1, %ymm0, %ymm1 # sched: [3:1.00]
	; ZNVER1-NEXT: vcmpeqpd (%rdi), %ymm0, %ymm0 # sched: [10:1.00]
	; ZNVER1-NEXT: vorpd %ymm0, %ymm1, %ymm0 # sched: [1:0.25]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = fcmp oeq <4 x double> %a0, %a1
	%2 = load <4 x double>, <4 x double> *%a2, align 32
	%3 = fcmp oeq <4 x double> %a0, %2
	%4 = sext <4 x i1> %1 to <4 x i64>
	%5 = sext <4 x i1> %3 to <4 x i64>
	%6 = or <4 x i64> %4, %5
	%7 = bitcast <4 x i64> %6 to <4 x double>
	ret <4 x double> %7
	}

	define <8 x float> @test_cmpps(<8 x float> %a0, <8 x float> %a1, <8 x float> *%a2) {
	; SANDY-LABEL: test_cmpps:
	; SANDY: # BB#0:
	; SANDY-NEXT: vcmpeqps %ymm1, %ymm0, %ymm1 # sched: [3:1.00]
	-; SANDY-NEXT: vcmpeqps (%rdi), %ymm0, %ymm0 # sched: [10:1.00]
	-; SANDY-NEXT: vorps %ymm0, %ymm1, %ymm0 # sched: [1:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vcmpeqps (%rdi), %ymm0, %ymm0 # sched: [7:1.00]
	+; SANDY-NEXT: vorps %ymm0, %ymm1, %ymm0 # sched: [1:0.33]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_cmpps:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vcmpeqps %ymm1, %ymm0, %ymm1 # sched: [3:1.00]
	; HASWELL-NEXT: vcmpeqps (%rdi), %ymm0, %ymm0 # sched: [7:1.00]
	; HASWELL-NEXT: vorps %ymm0, %ymm1, %ymm0 # sched: [1:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_cmpps:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vcmpeqps %ymm1, %ymm0, %ymm1 # sched: [3:1.00]
	; BTVER2-NEXT: vcmpeqps (%rdi), %ymm0, %ymm0 # sched: [8:1.00]
	; BTVER2-NEXT: vorps %ymm0, %ymm1, %ymm0 # sched: [1:0.50]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_cmpps:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vcmpeqps %ymm1, %ymm0, %ymm1 # sched: [3:1.00]
	; ZNVER1-NEXT: vcmpeqps (%rdi), %ymm0, %ymm0 # sched: [10:1.00]
	; ZNVER1-NEXT: vorps %ymm0, %ymm1, %ymm0 # sched: [1:0.25]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = fcmp oeq <8 x float> %a0, %a1
	%2 = load <8 x float>, <8 x float> *%a2, align 32
	%3 = fcmp oeq <8 x float> %a0, %2
	%4 = sext <8 x i1> %1 to <8 x i32>
	%5 = sext <8 x i1> %3 to <8 x i32>
	%6 = or <8 x i32> %4, %5
	%7 = bitcast <8 x i32> %6 to <8 x float>
	ret <8 x float> %7
	}

	define <4 x double> @test_cvtdq2pd(<4 x i32> %a0, <4 x i32> *%a1) {
	; SANDY-LABEL: test_cvtdq2pd:
	; SANDY: # BB#0:
	; SANDY-NEXT: vcvtdq2pd %xmm0, %ymm0 # sched: [4:1.00]
	-; SANDY-NEXT: vcvtdq2pd (%rdi), %ymm1 # sched: [10:1.00]
	+; SANDY-NEXT: vcvtdq2pd (%rdi), %ymm1 # sched: [8:1.00]
	; SANDY-NEXT: vaddpd %ymm1, %ymm0, %ymm0 # sched: [3:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_cvtdq2pd:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vcvtdq2pd %xmm0, %ymm0 # sched: [6:1.00]
	; HASWELL-NEXT: vcvtdq2pd (%rdi), %ymm1 # sched: [8:1.00]
	; HASWELL-NEXT: vaddpd %ymm1, %ymm0, %ymm0 # sched: [3:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_cvtdq2pd:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vcvtdq2pd (%rdi), %ymm1 # sched: [8:1.00]
	; BTVER2-NEXT: vcvtdq2pd %xmm0, %ymm0 # sched: [3:1.00]
	; BTVER2-NEXT: vaddpd %ymm1, %ymm0, %ymm0 # sched: [3:2.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_cvtdq2pd:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vcvtdq2pd (%rdi), %ymm1 # sched: [12:1.00]
	; ZNVER1-NEXT: vcvtdq2pd %xmm0, %ymm0 # sched: [5:1.00]
	; ZNVER1-NEXT: vaddpd %ymm1, %ymm0, %ymm0 # sched: [3:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = sitofp <4 x i32> %a0 to <4 x double>
	%2 = load <4 x i32>, <4 x i32> *%a1, align 16
	%3 = sitofp <4 x i32> %2 to <4 x double>
	%4 = fadd <4 x double> %1, %3
	ret <4 x double> %4
	}

	define <8 x float> @test_cvtdq2ps(<8 x i32> %a0, <8 x i32> *%a1) {
	; SANDY-LABEL: test_cvtdq2ps:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vcvtdq2ps %ymm0, %ymm0 # sched: [3:1.00]
	-; SANDY-NEXT: vmovaps (%rdi), %xmm1 # sched: [6:0.50]
	-; SANDY-NEXT: vinsertf128 $1, 16(%rdi), %ymm1, %ymm1 # sched: [7:1.00]
	-; SANDY-NEXT: vcvtdq2ps %ymm1, %ymm1 # sched: [3:1.00]
	+; SANDY-NEXT: vcvtdq2ps %ymm0, %ymm0 # sched: [4:1.00]
	+; SANDY-NEXT: vmovaps (%rdi), %xmm1 # sched: [4:0.50]
	+; SANDY-NEXT: vinsertf128 $1, 16(%rdi), %ymm1, %ymm1 # sched: [5:1.00]
	+; SANDY-NEXT: vcvtdq2ps %ymm1, %ymm1 # sched: [4:1.00]
	; SANDY-NEXT: vaddps %ymm1, %ymm0, %ymm0 # sched: [3:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_cvtdq2ps:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vcvtdq2ps %ymm0, %ymm0 # sched: [4:1.00]
	; HASWELL-NEXT: vcvtdq2ps (%rdi), %ymm1 # sched: [8:1.00]
	; HASWELL-NEXT: vaddps %ymm1, %ymm0, %ymm0 # sched: [3:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_cvtdq2ps:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vcvtdq2ps (%rdi), %ymm1 # sched: [8:1.00]
	; BTVER2-NEXT: vcvtdq2ps %ymm0, %ymm0 # sched: [3:1.00]
	; BTVER2-NEXT: vaddps %ymm1, %ymm0, %ymm0 # sched: [3:2.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_cvtdq2ps:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vcvtdq2ps (%rdi), %ymm1 # sched: [12:1.00]
	; ZNVER1-NEXT: vcvtdq2ps %ymm0, %ymm0 # sched: [5:1.00]
	; ZNVER1-NEXT: vaddps %ymm1, %ymm0, %ymm0 # sched: [3:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = sitofp <8 x i32> %a0 to <8 x float>
	%2 = load <8 x i32>, <8 x i32> *%a1, align 16
	%3 = sitofp <8 x i32> %2 to <8 x float>
	%4 = fadd <8 x float> %1, %3
	ret <8 x float> %4
	}

	define <8 x i32> @test_cvtpd2dq(<4 x double> %a0, <4 x double> *%a1) {
	; SANDY-LABEL: test_cvtpd2dq:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vcvttpd2dq %ymm0, %xmm0 # sched: [4:1.00]
	-; SANDY-NEXT: vcvttpd2dqy (%rdi), %xmm1 # sched: [11:1.00]
	+; SANDY-NEXT: vcvttpd2dq %ymm0, %xmm0 # sched: [3:1.00]
	+; SANDY-NEXT: vcvttpd2dqy (%rdi), %xmm1 # sched: [7:1.00]
	; SANDY-NEXT: vinsertf128 $1, %xmm1, %ymm0, %ymm0 # sched: [1:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_cvtpd2dq:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vcvttpd2dq %ymm0, %xmm0 # sched: [6:1.00]
	; HASWELL-NEXT: vcvttpd2dqy (%rdi), %xmm1 # sched: [10:1.00]
	; HASWELL-NEXT: vinsertf128 $1, %xmm1, %ymm0, %ymm0 # sched: [3:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_cvtpd2dq:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vcvttpd2dqy (%rdi), %xmm1 # sched: [8:1.00]
	; BTVER2-NEXT: vcvttpd2dq %ymm0, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: vinsertf128 $1, %xmm1, %ymm0, %ymm0 # sched: [1:0.50]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_cvtpd2dq:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vcvttpd2dqy (%rdi), %xmm1 # sched: [12:1.00]
	; ZNVER1-NEXT: vcvttpd2dq %ymm0, %xmm0 # sched: [5:1.00]
	; ZNVER1-NEXT: vinsertf128 $1, %xmm1, %ymm0, %ymm0 # sched: [1:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = fptosi <4 x double> %a0 to <4 x i32>
	%2 = load <4 x double>, <4 x double> *%a1, align 32
	%3 = fptosi <4 x double> %2 to <4 x i32>
	%4 = shufflevector <4 x i32> %1, <4 x i32> %3, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
	ret <8 x i32> %4
	}

	define <8 x float> @test_cvtpd2ps(<4 x double> %a0, <4 x double> *%a1) {
	; SANDY-LABEL: test_cvtpd2ps:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vcvtpd2ps %ymm0, %xmm0 # sched: [4:1.00]
	-; SANDY-NEXT: vcvtpd2psy (%rdi), %xmm1 # sched: [11:1.00]
	+; SANDY-NEXT: vcvtpd2ps %ymm0, %xmm0 # sched: [3:1.00]
	+; SANDY-NEXT: vcvtpd2psy (%rdi), %xmm1 # sched: [7:1.00]
	; SANDY-NEXT: vinsertf128 $1, %xmm1, %ymm0, %ymm0 # sched: [1:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_cvtpd2ps:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vcvtpd2ps %ymm0, %xmm0 # sched: [5:1.00]
	; HASWELL-NEXT: vcvtpd2psy (%rdi), %xmm1 # sched: [9:1.00]
	; HASWELL-NEXT: vinsertf128 $1, %xmm1, %ymm0, %ymm0 # sched: [3:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_cvtpd2ps:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vcvtpd2psy (%rdi), %xmm1 # sched: [8:1.00]
	; BTVER2-NEXT: vcvtpd2ps %ymm0, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: vinsertf128 $1, %xmm1, %ymm0, %ymm0 # sched: [1:0.50]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_cvtpd2ps:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vcvtpd2psy (%rdi), %xmm1 # sched: [12:1.00]
	; ZNVER1-NEXT: vcvtpd2ps %ymm0, %xmm0 # sched: [5:1.00]
	; ZNVER1-NEXT: vinsertf128 $1, %xmm1, %ymm0, %ymm0 # sched: [1:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = fptrunc <4 x double> %a0 to <4 x float>
	%2 = load <4 x double>, <4 x double> *%a1, align 32
	%3 = fptrunc <4 x double> %2 to <4 x float>
	%4 = shufflevector <4 x float> %1, <4 x float> %3, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
	ret <8 x float> %4
	}

	define <8 x i32> @test_cvtps2dq(<8 x float> %a0, <8 x float> *%a1) {
	; SANDY-LABEL: test_cvtps2dq:
	; SANDY: # BB#0:
	; SANDY-NEXT: vcvttps2dq %ymm0, %ymm0 # sched: [3:1.00]
	; SANDY-NEXT: vcvttps2dq (%rdi), %ymm1 # sched: [7:1.00]
	-; SANDY-NEXT: vorps %ymm1, %ymm0, %ymm0 # sched: [1:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vorps %ymm1, %ymm0, %ymm0 # sched: [1:0.33]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_cvtps2dq:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vcvttps2dq %ymm0, %ymm0 # sched: [3:1.00]
	; HASWELL-NEXT: vcvttps2dq (%rdi), %ymm1 # sched: [7:1.00]
	; HASWELL-NEXT: vorps %ymm1, %ymm0, %ymm0 # sched: [1:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_cvtps2dq:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vcvttps2dq (%rdi), %ymm1 # sched: [8:1.00]
	; BTVER2-NEXT: vcvttps2dq %ymm0, %ymm0 # sched: [3:1.00]
	; BTVER2-NEXT: vorps %ymm1, %ymm0, %ymm0 # sched: [1:0.50]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_cvtps2dq:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vcvttps2dq (%rdi), %ymm1 # sched: [12:1.00]
	; ZNVER1-NEXT: vcvttps2dq %ymm0, %ymm0 # sched: [5:1.00]
	; ZNVER1-NEXT: vorps %ymm1, %ymm0, %ymm0 # sched: [1:0.25]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = fptosi <8 x float> %a0 to <8 x i32>
	%2 = load <8 x float>, <8 x float> *%a1, align 32
	%3 = fptosi <8 x float> %2 to <8 x i32>
	%4 = or <8 x i32> %1, %3
	ret <8 x i32> %4
	}

	define <4 x double> @test_divpd(<4 x double> %a0, <4 x double> %a1, <4 x double> *%a2) {
	; SANDY-LABEL: test_divpd:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vdivpd %ymm1, %ymm0, %ymm0 # sched: [45:3.00]
	-; SANDY-NEXT: vdivpd (%rdi), %ymm0, %ymm0 # sched: [52:3.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vdivpd %ymm1, %ymm0, %ymm0 # sched: [12:1.00]
	+; SANDY-NEXT: vdivpd (%rdi), %ymm0, %ymm0 # sched: [16:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_divpd:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vdivpd %ymm1, %ymm0, %ymm0 # sched: [27:2.00]
	; HASWELL-NEXT: vdivpd (%rdi), %ymm0, %ymm0 # sched: [31:2.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_divpd:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vdivpd %ymm1, %ymm0, %ymm0 # sched: [38:38.00]
	; BTVER2-NEXT: vdivpd (%rdi), %ymm0, %ymm0 # sched: [43:38.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_divpd:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vdivpd %ymm1, %ymm0, %ymm0 # sched: [15:1.00]
	; ZNVER1-NEXT: vdivpd (%rdi), %ymm0, %ymm0 # sched: [22:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = fdiv <4 x double> %a0, %a1
	%2 = load <4 x double>, <4 x double> *%a2, align 32
	%3 = fdiv <4 x double> %1, %2
	ret <4 x double> %3
	}

	define <8 x float> @test_divps(<8 x float> %a0, <8 x float> %a1, <8 x float> *%a2) {
	; SANDY-LABEL: test_divps:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vdivps %ymm1, %ymm0, %ymm0 # sched: [29:3.00]
	-; SANDY-NEXT: vdivps (%rdi), %ymm0, %ymm0 # sched: [36:3.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vdivps %ymm1, %ymm0, %ymm0 # sched: [12:1.00]
	+; SANDY-NEXT: vdivps (%rdi), %ymm0, %ymm0 # sched: [16:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_divps:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vdivps %ymm1, %ymm0, %ymm0 # sched: [19:2.00]
	; HASWELL-NEXT: vdivps (%rdi), %ymm0, %ymm0 # sched: [23:2.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_divps:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vdivps %ymm1, %ymm0, %ymm0 # sched: [38:38.00]
	; BTVER2-NEXT: vdivps (%rdi), %ymm0, %ymm0 # sched: [43:38.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_divps:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vdivps %ymm1, %ymm0, %ymm0 # sched: [15:1.00]
	; ZNVER1-NEXT: vdivps (%rdi), %ymm0, %ymm0 # sched: [22:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = fdiv <8 x float> %a0, %a1
	%2 = load <8 x float>, <8 x float> *%a2, align 32
	%3 = fdiv <8 x float> %1, %2
	ret <8 x float> %3
	}

	define <8 x float> @test_dpps(<8 x float> %a0, <8 x float> %a1, <8 x float> *%a2) {
	; SANDY-LABEL: test_dpps:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vdpps $7, %ymm1, %ymm0, %ymm0 # sched: [12:2.00]
	+; SANDY-NEXT: vdpps $7, %ymm1, %ymm0, %ymm0 # sched: [3:1.00]
	; SANDY-NEXT: vdpps $7, (%rdi), %ymm0, %ymm0 # sched: [7:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_dpps:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vdpps $7, %ymm1, %ymm0, %ymm0 # sched: [14:2.00]
	; HASWELL-NEXT: vdpps $7, (%rdi), %ymm0, %ymm0 # sched: [18:2.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_dpps:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vdpps $7, %ymm1, %ymm0, %ymm0 # sched: [3:1.00]
	; BTVER2-NEXT: vdpps $7, (%rdi), %ymm0, %ymm0 # sched: [8:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_dpps:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vdpps $7, %ymm1, %ymm0, %ymm0 # sched: [3:1.00]
	; ZNVER1-NEXT: vdpps $7, (%rdi), %ymm0, %ymm0 # sched: [10:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <8 x float> @llvm.x86.avx.dp.ps.256(<8 x float> %a0, <8 x float> %a1, i8 7)
	%2 = load <8 x float>, <8 x float> *%a2, align 32
	%3 = call <8 x float> @llvm.x86.avx.dp.ps.256(<8 x float> %1, <8 x float> %2, i8 7)
	ret <8 x float> %3
	}
	declare <8 x float> @llvm.x86.avx.dp.ps.256(<8 x float>, <8 x float>, i8) nounwind readnone

	define <4 x float> @test_extractf128(<8 x float> %a0, <8 x float> %a1, <4 x float> *%a2) {
	; SANDY-LABEL: test_extractf128:
	; SANDY: # BB#0:
	; SANDY-NEXT: vextractf128 $1, %ymm0, %xmm0 # sched: [1:1.00]
	-; SANDY-NEXT: vextractf128 $1, %ymm1, (%rdi) # sched: [5:1.00]
	+; SANDY-NEXT: vextractf128 $1, %ymm1, (%rdi) # sched: [1:1.00]
	; SANDY-NEXT: vzeroupper # sched: [?:0.000000e+00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_extractf128:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vextractf128 $1, %ymm0, %xmm0 # sched: [3:1.00]
	; HASWELL-NEXT: vextractf128 $1, %ymm1, (%rdi) # sched: [4:1.00]
	; HASWELL-NEXT: vzeroupper # sched: [1:0.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_extractf128:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vextractf128 $1, %ymm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: vextractf128 $1, %ymm1, (%rdi) # sched: [1:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_extractf128:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vextractf128 $1, %ymm0, %xmm0 # sched: [1:0.50]
	; ZNVER1-NEXT: vextractf128 $1, %ymm1, (%rdi) # sched: [1:0.50]
	; ZNVER1-NEXT: vzeroupper # sched: [?:0.000000e+00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = shufflevector <8 x float> %a0, <8 x float> undef, <4 x i32> <i32 4, i32 5, i32 6, i32 7>
	%2 = shufflevector <8 x float> %a1, <8 x float> undef, <4 x i32> <i32 4, i32 5, i32 6, i32 7>
	store <4 x float> %2, <4 x float> *%a2
	ret <4 x float> %1
	}

	define <4 x double> @test_haddpd(<4 x double> %a0, <4 x double> %a1, <4 x double> *%a2) {
	; SANDY-LABEL: test_haddpd:
	; SANDY: # BB#0:
	; SANDY-NEXT: vhaddpd %ymm1, %ymm0, %ymm0 # sched: [3:1.00]
	; SANDY-NEXT: vhaddpd (%rdi), %ymm0, %ymm0 # sched: [7:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_haddpd:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vhaddpd %ymm1, %ymm0, %ymm0 # sched: [5:2.00]
	; HASWELL-NEXT: vhaddpd (%rdi), %ymm0, %ymm0 # sched: [9:2.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_haddpd:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vhaddpd %ymm1, %ymm0, %ymm0 # sched: [3:2.00]
	; BTVER2-NEXT: vhaddpd (%rdi), %ymm0, %ymm0 # sched: [8:2.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_haddpd:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vhaddpd %ymm1, %ymm0, %ymm0 # sched: [3:1.00]
	; ZNVER1-NEXT: vhaddpd (%rdi), %ymm0, %ymm0 # sched: [10:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <4 x double> @llvm.x86.avx.hadd.pd.256(<4 x double> %a0, <4 x double> %a1)
	%2 = load <4 x double>, <4 x double> *%a2, align 32
	%3 = call <4 x double> @llvm.x86.avx.hadd.pd.256(<4 x double> %1, <4 x double> %2)
	ret <4 x double> %3
	}
	declare <4 x double> @llvm.x86.avx.hadd.pd.256(<4 x double>, <4 x double>) nounwind readnone

	define <8 x float> @test_haddps(<8 x float> %a0, <8 x float> %a1, <8 x float> *%a2) {
	; SANDY-LABEL: test_haddps:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vhaddps %ymm1, %ymm0, %ymm0 # sched: [5:2.00]
	-; SANDY-NEXT: vhaddps (%rdi), %ymm0, %ymm0 # sched: [12:2.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vhaddps %ymm1, %ymm0, %ymm0 # sched: [3:1.00]
	+; SANDY-NEXT: vhaddps (%rdi), %ymm0, %ymm0 # sched: [7:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_haddps:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vhaddps %ymm1, %ymm0, %ymm0 # sched: [5:2.00]
	; HASWELL-NEXT: vhaddps (%rdi), %ymm0, %ymm0 # sched: [9:2.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_haddps:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vhaddps %ymm1, %ymm0, %ymm0 # sched: [3:2.00]
	; BTVER2-NEXT: vhaddps (%rdi), %ymm0, %ymm0 # sched: [8:2.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_haddps:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vhaddps %ymm1, %ymm0, %ymm0 # sched: [3:1.00]
	; ZNVER1-NEXT: vhaddps (%rdi), %ymm0, %ymm0 # sched: [10:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <8 x float> @llvm.x86.avx.hadd.ps.256(<8 x float> %a0, <8 x float> %a1)
	%2 = load <8 x float>, <8 x float> *%a2, align 32
	%3 = call <8 x float> @llvm.x86.avx.hadd.ps.256(<8 x float> %1, <8 x float> %2)
	ret <8 x float> %3
	}
	declare <8 x float> @llvm.x86.avx.hadd.ps.256(<8 x float>, <8 x float>) nounwind readnone

	define <4 x double> @test_hsubpd(<4 x double> %a0, <4 x double> %a1, <4 x double> *%a2) {
	; SANDY-LABEL: test_hsubpd:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vhsubpd %ymm1, %ymm0, %ymm0 # sched: [5:2.00]
	-; SANDY-NEXT: vhsubpd (%rdi), %ymm0, %ymm0 # sched: [12:2.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vhsubpd %ymm1, %ymm0, %ymm0 # sched: [3:1.00]
	+; SANDY-NEXT: vhsubpd (%rdi), %ymm0, %ymm0 # sched: [7:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_hsubpd:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vhsubpd %ymm1, %ymm0, %ymm0 # sched: [5:2.00]
	; HASWELL-NEXT: vhsubpd (%rdi), %ymm0, %ymm0 # sched: [9:2.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_hsubpd:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vhsubpd %ymm1, %ymm0, %ymm0 # sched: [3:2.00]
	; BTVER2-NEXT: vhsubpd (%rdi), %ymm0, %ymm0 # sched: [8:2.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_hsubpd:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vhsubpd %ymm1, %ymm0, %ymm0 # sched: [3:1.00]
	; ZNVER1-NEXT: vhsubpd (%rdi), %ymm0, %ymm0 # sched: [10:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <4 x double> @llvm.x86.avx.hsub.pd.256(<4 x double> %a0, <4 x double> %a1)
	%2 = load <4 x double>, <4 x double> *%a2, align 32
	%3 = call <4 x double> @llvm.x86.avx.hsub.pd.256(<4 x double> %1, <4 x double> %2)
	ret <4 x double> %3
	}
	declare <4 x double> @llvm.x86.avx.hsub.pd.256(<4 x double>, <4 x double>) nounwind readnone

	define <8 x float> @test_hsubps(<8 x float> %a0, <8 x float> %a1, <8 x float> *%a2) {
	; SANDY-LABEL: test_hsubps:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vhsubps %ymm1, %ymm0, %ymm0 # sched: [5:2.00]
	-; SANDY-NEXT: vhsubps (%rdi), %ymm0, %ymm0 # sched: [12:2.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vhsubps %ymm1, %ymm0, %ymm0 # sched: [3:1.00]
	+; SANDY-NEXT: vhsubps (%rdi), %ymm0, %ymm0 # sched: [7:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_hsubps:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vhsubps %ymm1, %ymm0, %ymm0 # sched: [5:2.00]
	; HASWELL-NEXT: vhsubps (%rdi), %ymm0, %ymm0 # sched: [9:2.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_hsubps:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vhsubps %ymm1, %ymm0, %ymm0 # sched: [3:2.00]
	; BTVER2-NEXT: vhsubps (%rdi), %ymm0, %ymm0 # sched: [8:2.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_hsubps:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vhsubps %ymm1, %ymm0, %ymm0 # sched: [3:1.00]
	; ZNVER1-NEXT: vhsubps (%rdi), %ymm0, %ymm0 # sched: [10:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <8 x float> @llvm.x86.avx.hsub.ps.256(<8 x float> %a0, <8 x float> %a1)
	%2 = load <8 x float>, <8 x float> *%a2, align 32
	%3 = call <8 x float> @llvm.x86.avx.hsub.ps.256(<8 x float> %1, <8 x float> %2)
	ret <8 x float> %3
	}
	declare <8 x float> @llvm.x86.avx.hsub.ps.256(<8 x float>, <8 x float>) nounwind readnone

	define <8 x float> @test_insertf128(<8 x float> %a0, <4 x float> %a1, <4 x float> *%a2) {
	; SANDY-LABEL: test_insertf128:
	; SANDY: # BB#0:
	; SANDY-NEXT: vinsertf128 $1, %xmm1, %ymm0, %ymm1 # sched: [1:1.00]
	-; SANDY-NEXT: vinsertf128 $1, (%rdi), %ymm0, %ymm0 # sched: [7:1.00]
	+; SANDY-NEXT: vinsertf128 $1, (%rdi), %ymm0, %ymm0 # sched: [5:1.00]
	; SANDY-NEXT: vaddps %ymm0, %ymm1, %ymm0 # sched: [3:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_insertf128:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vinsertf128 $1, %xmm1, %ymm0, %ymm1 # sched: [3:1.00]
	; HASWELL-NEXT: vinsertf128 $1, (%rdi), %ymm0, %ymm0 # sched: [3:1.00]
	; HASWELL-NEXT: vaddps %ymm0, %ymm1, %ymm0 # sched: [3:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_insertf128:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vinsertf128 $1, %xmm1, %ymm0, %ymm1 # sched: [1:0.50]
	; BTVER2-NEXT: vinsertf128 $1, (%rdi), %ymm0, %ymm0 # sched: [6:1.00]
	; BTVER2-NEXT: vaddps %ymm0, %ymm1, %ymm0 # sched: [3:2.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_insertf128:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vinsertf128 $1, %xmm1, %ymm0, %ymm1 # sched: [1:0.50]
	; ZNVER1-NEXT: vinsertf128 $1, (%rdi), %ymm0, %ymm0 # sched: [8:0.50]
	; ZNVER1-NEXT: vaddps %ymm0, %ymm1, %ymm0 # sched: [3:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = shufflevector <4 x float> %a1, <4 x float> undef, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef>
	%2 = shufflevector <8 x float> %a0, <8 x float> %1, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 8, i32 9, i32 10, i32 11>
	%3 = load <4 x float>, <4 x float> *%a2, align 16
	%4 = shufflevector <4 x float> %3, <4 x float> undef, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef>
	%5 = shufflevector <8 x float> %a0, <8 x float> %4, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 8, i32 9, i32 10, i32 11>
	%6 = fadd <8 x float> %2, %5
	ret <8 x float> %6
	}

	define <32 x i8> @test_lddqu(i8* %a0) {
	; SANDY-LABEL: test_lddqu:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vlddqu (%rdi), %ymm0 # sched: [6:0.50]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vlddqu (%rdi), %ymm0 # sched: [4:0.50]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_lddqu:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vlddqu (%rdi), %ymm0 # sched: [4:0.50]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_lddqu:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vlddqu (%rdi), %ymm0 # sched: [5:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_lddqu:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vlddqu (%rdi), %ymm0 # sched: [8:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <32 x i8> @llvm.x86.avx.ldu.dq.256(i8* %a0)
	ret <32 x i8> %1
	}
	declare <32 x i8> @llvm.x86.avx.ldu.dq.256(i8*) nounwind readonly

	define <2 x double> @test_maskmovpd(i8* %a0, <2 x i64> %a1, <2 x double> %a2) {
	; SANDY-LABEL: test_maskmovpd:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vmaskmovpd (%rdi), %xmm0, %xmm2 # sched: [8:2.00]
	-; SANDY-NEXT: vmaskmovpd %xmm1, %xmm0, (%rdi) # sched: [5:1.00]
	+; SANDY-NEXT: vmaskmovpd (%rdi), %xmm0, %xmm2 # sched: [?:0.000000e+00]
	+; SANDY-NEXT: vmaskmovpd %xmm1, %xmm0, (%rdi) # sched: [?:0.000000e+00]
	; SANDY-NEXT: vmovapd %xmm2, %xmm0 # sched: [1:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_maskmovpd:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vmaskmovpd (%rdi), %xmm0, %xmm2 # sched: [4:2.00]
	; HASWELL-NEXT: vmaskmovpd %xmm1, %xmm0, (%rdi) # sched: [13:1.00]
	; HASWELL-NEXT: vmovapd %xmm2, %xmm0 # sched: [1:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_maskmovpd:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vmaskmovpd (%rdi), %xmm0, %xmm2 # sched: [?:0.000000e+00]
	; BTVER2-NEXT: vmaskmovpd %xmm1, %xmm0, (%rdi) # sched: [?:0.000000e+00]
	; BTVER2-NEXT: vmovapd %xmm2, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_maskmovpd:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vmaskmovpd (%rdi), %xmm0, %xmm2 # sched: [?:0.000000e+00]
	; ZNVER1-NEXT: vmaskmovpd %xmm1, %xmm0, (%rdi) # sched: [?:0.000000e+00]
	; ZNVER1-NEXT: vmovapd %xmm2, %xmm0 # sched: [1:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <2 x double> @llvm.x86.avx.maskload.pd(i8* %a0, <2 x i64> %a1)
	call void @llvm.x86.avx.maskstore.pd(i8* %a0, <2 x i64> %a1, <2 x double> %a2)
	ret <2 x double> %1
	}
	declare <2 x double> @llvm.x86.avx.maskload.pd(i8*, <2 x i64>) nounwind readonly
	declare void @llvm.x86.avx.maskstore.pd(i8*, <2 x i64>, <2 x double>) nounwind

	define <4 x double> @test_maskmovpd_ymm(i8* %a0, <4 x i64> %a1, <4 x double> %a2) {
	; SANDY-LABEL: test_maskmovpd_ymm:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vmaskmovpd (%rdi), %ymm0, %ymm2 # sched: [5:1.00]
	+; SANDY-NEXT: vmaskmovpd (%rdi), %ymm0, %ymm2 # sched: [?:0.000000e+00]
	; SANDY-NEXT: vmaskmovpd %ymm1, %ymm0, (%rdi) # sched: [?:0.000000e+00]
	; SANDY-NEXT: vmovapd %ymm2, %ymm0 # sched: [1:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_maskmovpd_ymm:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vmaskmovpd (%rdi), %ymm0, %ymm2 # sched: [4:2.00]
	; HASWELL-NEXT: vmaskmovpd %ymm1, %ymm0, (%rdi) # sched: [14:1.00]
	; HASWELL-NEXT: vmovapd %ymm2, %ymm0 # sched: [1:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_maskmovpd_ymm:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vmaskmovpd (%rdi), %ymm0, %ymm2 # sched: [?:0.000000e+00]
	; BTVER2-NEXT: vmaskmovpd %ymm1, %ymm0, (%rdi) # sched: [?:0.000000e+00]
	; BTVER2-NEXT: vmovapd %ymm2, %ymm0 # sched: [1:0.50]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_maskmovpd_ymm:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vmaskmovpd (%rdi), %ymm0, %ymm2 # sched: [?:0.000000e+00]
	; ZNVER1-NEXT: vmaskmovpd %ymm1, %ymm0, (%rdi) # sched: [?:0.000000e+00]
	; ZNVER1-NEXT: vmovapd %ymm2, %ymm0 # sched: [1:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <4 x double> @llvm.x86.avx.maskload.pd.256(i8* %a0, <4 x i64> %a1)
	call void @llvm.x86.avx.maskstore.pd.256(i8* %a0, <4 x i64> %a1, <4 x double> %a2)
	ret <4 x double> %1
	}
	declare <4 x double> @llvm.x86.avx.maskload.pd.256(i8*, <4 x i64>) nounwind readonly
	declare void @llvm.x86.avx.maskstore.pd.256(i8*, <4 x i64>, <4 x double>) nounwind

	define <4 x float> @test_maskmovps(i8* %a0, <4 x i32> %a1, <4 x float> %a2) {
	; SANDY-LABEL: test_maskmovps:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vmaskmovps (%rdi), %xmm0, %xmm2 # sched: [8:2.00]
	-; SANDY-NEXT: vmaskmovps %xmm1, %xmm0, (%rdi) # sched: [5:1.00]
	+; SANDY-NEXT: vmaskmovps (%rdi), %xmm0, %xmm2 # sched: [?:0.000000e+00]
	+; SANDY-NEXT: vmaskmovps %xmm1, %xmm0, (%rdi) # sched: [?:0.000000e+00]
	; SANDY-NEXT: vmovaps %xmm2, %xmm0 # sched: [1:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_maskmovps:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vmaskmovps (%rdi), %xmm0, %xmm2 # sched: [4:2.00]
	; HASWELL-NEXT: vmaskmovps %xmm1, %xmm0, (%rdi) # sched: [13:1.00]
	; HASWELL-NEXT: vmovaps %xmm2, %xmm0 # sched: [1:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_maskmovps:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vmaskmovps (%rdi), %xmm0, %xmm2 # sched: [?:0.000000e+00]
	; BTVER2-NEXT: vmaskmovps %xmm1, %xmm0, (%rdi) # sched: [?:0.000000e+00]
	; BTVER2-NEXT: vmovaps %xmm2, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_maskmovps:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vmaskmovps (%rdi), %xmm0, %xmm2 # sched: [?:0.000000e+00]
	; ZNVER1-NEXT: vmaskmovps %xmm1, %xmm0, (%rdi) # sched: [?:0.000000e+00]
	; ZNVER1-NEXT: vmovaps %xmm2, %xmm0 # sched: [1:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <4 x float> @llvm.x86.avx.maskload.ps(i8* %a0, <4 x i32> %a1)
	call void @llvm.x86.avx.maskstore.ps(i8* %a0, <4 x i32> %a1, <4 x float> %a2)
	ret <4 x float> %1
	}
	declare <4 x float> @llvm.x86.avx.maskload.ps(i8*, <4 x i32>) nounwind readonly
	declare void @llvm.x86.avx.maskstore.ps(i8*, <4 x i32>, <4 x float>) nounwind

	define <8 x float> @test_maskmovps_ymm(i8* %a0, <8 x i32> %a1, <8 x float> %a2) {
	; SANDY-LABEL: test_maskmovps_ymm:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vmaskmovps (%rdi), %ymm0, %ymm2 # sched: [1:0.50]
	+; SANDY-NEXT: vmaskmovps (%rdi), %ymm0, %ymm2 # sched: [?:0.000000e+00]
	; SANDY-NEXT: vmaskmovps %ymm1, %ymm0, (%rdi) # sched: [?:0.000000e+00]
	; SANDY-NEXT: vmovaps %ymm2, %ymm0 # sched: [1:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_maskmovps_ymm:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vmaskmovps (%rdi), %ymm0, %ymm2 # sched: [4:2.00]
	; HASWELL-NEXT: vmaskmovps %ymm1, %ymm0, (%rdi) # sched: [14:1.00]
	; HASWELL-NEXT: vmovaps %ymm2, %ymm0 # sched: [1:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_maskmovps_ymm:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vmaskmovps (%rdi), %ymm0, %ymm2 # sched: [?:0.000000e+00]
	; BTVER2-NEXT: vmaskmovps %ymm1, %ymm0, (%rdi) # sched: [?:0.000000e+00]
	; BTVER2-NEXT: vmovaps %ymm2, %ymm0 # sched: [1:0.50]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_maskmovps_ymm:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vmaskmovps (%rdi), %ymm0, %ymm2 # sched: [?:0.000000e+00]
	; ZNVER1-NEXT: vmaskmovps %ymm1, %ymm0, (%rdi) # sched: [?:0.000000e+00]
	; ZNVER1-NEXT: vmovaps %ymm2, %ymm0 # sched: [1:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <8 x float> @llvm.x86.avx.maskload.ps.256(i8* %a0, <8 x i32> %a1)
	call void @llvm.x86.avx.maskstore.ps.256(i8* %a0, <8 x i32> %a1, <8 x float> %a2)
	ret <8 x float> %1
	}
	declare <8 x float> @llvm.x86.avx.maskload.ps.256(i8*, <8 x i32>) nounwind readonly
	declare void @llvm.x86.avx.maskstore.ps.256(i8*, <8 x i32>, <8 x float>) nounwind

	define <4 x double> @test_maxpd(<4 x double> %a0, <4 x double> %a1, <4 x double> *%a2) {
	; SANDY-LABEL: test_maxpd:
	; SANDY: # BB#0:
	; SANDY-NEXT: vmaxpd %ymm1, %ymm0, %ymm0 # sched: [3:1.00]
	-; SANDY-NEXT: vmaxpd (%rdi), %ymm0, %ymm0 # sched: [10:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vmaxpd (%rdi), %ymm0, %ymm0 # sched: [7:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_maxpd:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vmaxpd %ymm1, %ymm0, %ymm0 # sched: [3:1.00]
	; HASWELL-NEXT: vmaxpd (%rdi), %ymm0, %ymm0 # sched: [7:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_maxpd:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vmaxpd %ymm1, %ymm0, %ymm0 # sched: [3:1.00]
	; BTVER2-NEXT: vmaxpd (%rdi), %ymm0, %ymm0 # sched: [8:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_maxpd:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vmaxpd %ymm1, %ymm0, %ymm0 # sched: [3:1.00]
	; ZNVER1-NEXT: vmaxpd (%rdi), %ymm0, %ymm0 # sched: [10:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <4 x double> @llvm.x86.avx.max.pd.256(<4 x double> %a0, <4 x double> %a1)
	%2 = load <4 x double>, <4 x double> *%a2, align 32
	%3 = call <4 x double> @llvm.x86.avx.max.pd.256(<4 x double> %1, <4 x double> %2)
	ret <4 x double> %3
	}
	declare <4 x double> @llvm.x86.avx.max.pd.256(<4 x double>, <4 x double>) nounwind readnone

	define <8 x float> @test_maxps(<8 x float> %a0, <8 x float> %a1, <8 x float> *%a2) {
	; SANDY-LABEL: test_maxps:
	; SANDY: # BB#0:
	; SANDY-NEXT: vmaxps %ymm1, %ymm0, %ymm0 # sched: [3:1.00]
	-; SANDY-NEXT: vmaxps (%rdi), %ymm0, %ymm0 # sched: [10:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vmaxps (%rdi), %ymm0, %ymm0 # sched: [7:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_maxps:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vmaxps %ymm1, %ymm0, %ymm0 # sched: [3:1.00]
	; HASWELL-NEXT: vmaxps (%rdi), %ymm0, %ymm0 # sched: [7:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_maxps:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vmaxps %ymm1, %ymm0, %ymm0 # sched: [3:1.00]
	; BTVER2-NEXT: vmaxps (%rdi), %ymm0, %ymm0 # sched: [8:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_maxps:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vmaxps %ymm1, %ymm0, %ymm0 # sched: [3:1.00]
	; ZNVER1-NEXT: vmaxps (%rdi), %ymm0, %ymm0 # sched: [10:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <8 x float> @llvm.x86.avx.max.ps.256(<8 x float> %a0, <8 x float> %a1)
	%2 = load <8 x float>, <8 x float> *%a2, align 32
	%3 = call <8 x float> @llvm.x86.avx.max.ps.256(<8 x float> %1, <8 x float> %2)
	ret <8 x float> %3
	}
	declare <8 x float> @llvm.x86.avx.max.ps.256(<8 x float>, <8 x float>) nounwind readnone

	define <4 x double> @test_minpd(<4 x double> %a0, <4 x double> %a1, <4 x double> *%a2) {
	; SANDY-LABEL: test_minpd:
	; SANDY: # BB#0:
	; SANDY-NEXT: vminpd %ymm1, %ymm0, %ymm0 # sched: [3:1.00]
	; SANDY-NEXT: vminpd (%rdi), %ymm0, %ymm0 # sched: [7:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_minpd:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vminpd %ymm1, %ymm0, %ymm0 # sched: [3:1.00]
	; HASWELL-NEXT: vminpd (%rdi), %ymm0, %ymm0 # sched: [7:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_minpd:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vminpd %ymm1, %ymm0, %ymm0 # sched: [3:1.00]
	; BTVER2-NEXT: vminpd (%rdi), %ymm0, %ymm0 # sched: [8:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_minpd:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vminpd %ymm1, %ymm0, %ymm0 # sched: [3:1.00]
	; ZNVER1-NEXT: vminpd (%rdi), %ymm0, %ymm0 # sched: [10:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <4 x double> @llvm.x86.avx.min.pd.256(<4 x double> %a0, <4 x double> %a1)
	%2 = load <4 x double>, <4 x double> *%a2, align 32
	%3 = call <4 x double> @llvm.x86.avx.min.pd.256(<4 x double> %1, <4 x double> %2)
	ret <4 x double> %3
	}
	declare <4 x double> @llvm.x86.avx.min.pd.256(<4 x double>, <4 x double>) nounwind readnone

	define <8 x float> @test_minps(<8 x float> %a0, <8 x float> %a1, <8 x float> *%a2) {
	; SANDY-LABEL: test_minps:
	; SANDY: # BB#0:
	; SANDY-NEXT: vminps %ymm1, %ymm0, %ymm0 # sched: [3:1.00]
	; SANDY-NEXT: vminps (%rdi), %ymm0, %ymm0 # sched: [7:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_minps:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vminps %ymm1, %ymm0, %ymm0 # sched: [3:1.00]
	; HASWELL-NEXT: vminps (%rdi), %ymm0, %ymm0 # sched: [7:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_minps:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vminps %ymm1, %ymm0, %ymm0 # sched: [3:1.00]
	; BTVER2-NEXT: vminps (%rdi), %ymm0, %ymm0 # sched: [8:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_minps:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vminps %ymm1, %ymm0, %ymm0 # sched: [3:1.00]
	; ZNVER1-NEXT: vminps (%rdi), %ymm0, %ymm0 # sched: [10:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <8 x float> @llvm.x86.avx.min.ps.256(<8 x float> %a0, <8 x float> %a1)
	%2 = load <8 x float>, <8 x float> *%a2, align 32
	%3 = call <8 x float> @llvm.x86.avx.min.ps.256(<8 x float> %1, <8 x float> %2)
	ret <8 x float> %3
	}
	declare <8 x float> @llvm.x86.avx.min.ps.256(<8 x float>, <8 x float>) nounwind readnone

	define <4 x double> @test_movapd(<4 x double> %a0, <4 x double> %a1) {
	; SANDY-LABEL: test_movapd:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vmovapd (%rdi), %ymm0 # sched: [7:0.50]
	+; SANDY-NEXT: vmovapd (%rdi), %ymm0 # sched: [4:0.50]
	; SANDY-NEXT: vaddpd %ymm0, %ymm0, %ymm0 # sched: [3:1.00]
	-; SANDY-NEXT: vmovapd %ymm0, (%rsi) # sched: [5:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vmovapd %ymm0, (%rsi) # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_movapd:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vmovapd (%rdi), %ymm0 # sched: [4:0.50]
	; HASWELL-NEXT: vaddpd %ymm0, %ymm0, %ymm0 # sched: [3:1.00]
	; HASWELL-NEXT: vmovapd %ymm0, (%rsi) # sched: [1:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_movapd:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vmovapd (%rdi), %ymm0 # sched: [5:1.00]
	; BTVER2-NEXT: vaddpd %ymm0, %ymm0, %ymm0 # sched: [3:2.00]
	; BTVER2-NEXT: vmovapd %ymm0, (%rsi) # sched: [1:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_movapd:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vmovapd (%rdi), %ymm0 # sched: [8:0.50]
	; ZNVER1-NEXT: vaddpd %ymm0, %ymm0, %ymm0 # sched: [3:1.00]
	; ZNVER1-NEXT: vmovapd %ymm0, (%rsi) # sched: [1:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = load <4 x double>, <4 x double> *%a0, align 32
	%2 = fadd <4 x double> %1, %1
	store <4 x double> %2, <4 x double> *%a1, align 32
	ret <4 x double> %2
	}

	define <8 x float> @test_movaps(<8 x float> %a0, <8 x float> %a1) {
	; SANDY-LABEL: test_movaps:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vmovaps (%rdi), %ymm0 # sched: [7:0.50]
	+; SANDY-NEXT: vmovaps (%rdi), %ymm0 # sched: [4:0.50]
	; SANDY-NEXT: vaddps %ymm0, %ymm0, %ymm0 # sched: [3:1.00]
	-; SANDY-NEXT: vmovaps %ymm0, (%rsi) # sched: [5:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vmovaps %ymm0, (%rsi) # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_movaps:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vmovaps (%rdi), %ymm0 # sched: [4:0.50]
	; HASWELL-NEXT: vaddps %ymm0, %ymm0, %ymm0 # sched: [3:1.00]
	; HASWELL-NEXT: vmovaps %ymm0, (%rsi) # sched: [1:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_movaps:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vmovaps (%rdi), %ymm0 # sched: [5:1.00]
	; BTVER2-NEXT: vaddps %ymm0, %ymm0, %ymm0 # sched: [3:2.00]
	; BTVER2-NEXT: vmovaps %ymm0, (%rsi) # sched: [1:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_movaps:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vmovaps (%rdi), %ymm0 # sched: [8:0.50]
	; ZNVER1-NEXT: vaddps %ymm0, %ymm0, %ymm0 # sched: [3:1.00]
	; ZNVER1-NEXT: vmovaps %ymm0, (%rsi) # sched: [1:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = load <8 x float>, <8 x float> *%a0, align 32
	%2 = fadd <8 x float> %1, %1
	store <8 x float> %2, <8 x float> *%a1, align 32
	ret <8 x float> %2
	}

	define <4 x double> @test_movddup(<4 x double> %a0, <4 x double> *%a1) {
	; SANDY-LABEL: test_movddup:
	; SANDY: # BB#0:
	; SANDY-NEXT: vmovddup {{.*#+}} ymm0 = ymm0[0,0,2,2] sched: [1:1.00]
	-; SANDY-NEXT: vmovddup {{.*#+}} ymm1 = mem[0,0,2,2] sched: [7:0.50]
	+; SANDY-NEXT: vmovddup {{.*#+}} ymm1 = mem[0,0,2,2] sched: [4:0.50]
	; SANDY-NEXT: vaddpd %ymm1, %ymm0, %ymm0 # sched: [3:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_movddup:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vmovddup {{.*#+}} ymm0 = ymm0[0,0,2,2] sched: [1:1.00]
	; HASWELL-NEXT: vmovddup {{.*#+}} ymm1 = mem[0,0,2,2] sched: [4:0.50]
	; HASWELL-NEXT: vaddpd %ymm1, %ymm0, %ymm0 # sched: [3:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_movddup:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vmovddup {{.*#+}} ymm1 = mem[0,0,2,2] sched: [5:1.00]
	; BTVER2-NEXT: vmovddup {{.*#+}} ymm0 = ymm0[0,0,2,2] sched: [1:0.50]
	; BTVER2-NEXT: vaddpd %ymm1, %ymm0, %ymm0 # sched: [3:2.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_movddup:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vmovddup {{.*#+}} ymm1 = mem[0,0,2,2] sched: [8:0.50]
	; ZNVER1-NEXT: vmovddup {{.*#+}} ymm0 = ymm0[0,0,2,2] sched: [1:0.50]
	; ZNVER1-NEXT: vaddpd %ymm1, %ymm0, %ymm0 # sched: [3:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = shufflevector <4 x double> %a0, <4 x double> undef, <4 x i32> <i32 0, i32 0, i32 2, i32 2>
	%2 = load <4 x double>, <4 x double> *%a1, align 32
	%3 = shufflevector <4 x double> %2, <4 x double> undef, <4 x i32> <i32 0, i32 0, i32 2, i32 2>
	%4 = fadd <4 x double> %1, %3
	ret <4 x double> %4
	}

	define i32 @test_movmskpd(<4 x double> %a0) {
	; SANDY-LABEL: test_movmskpd:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vmovmskpd %ymm0, %eax # sched: [2:1.00]
	+; SANDY-NEXT: vmovmskpd %ymm0, %eax # sched: [1:0.33]
	; SANDY-NEXT: vzeroupper # sched: [?:0.000000e+00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_movmskpd:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vmovmskpd %ymm0, %eax # sched: [2:1.00]
	; HASWELL-NEXT: vzeroupper # sched: [1:0.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_movmskpd:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vmovmskpd %ymm0, %eax # sched: [1:0.50]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_movmskpd:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vmovmskpd %ymm0, %eax # sched: [1:0.25]
	; ZNVER1-NEXT: vzeroupper # sched: [?:0.000000e+00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call i32 @llvm.x86.avx.movmsk.pd.256(<4 x double> %a0)
	ret i32 %1
	}
	declare i32 @llvm.x86.avx.movmsk.pd.256(<4 x double>) nounwind readnone

	define i32 @test_movmskps(<8 x float> %a0) {
	; SANDY-LABEL: test_movmskps:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vmovmskps %ymm0, %eax # sched: [3:1.00]
	+; SANDY-NEXT: vmovmskps %ymm0, %eax # sched: [1:0.33]
	; SANDY-NEXT: vzeroupper # sched: [?:0.000000e+00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_movmskps:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vmovmskps %ymm0, %eax # sched: [2:1.00]
	; HASWELL-NEXT: vzeroupper # sched: [1:0.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_movmskps:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vmovmskps %ymm0, %eax # sched: [1:0.50]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_movmskps:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vmovmskps %ymm0, %eax # sched: [1:0.25]
	; ZNVER1-NEXT: vzeroupper # sched: [?:0.000000e+00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call i32 @llvm.x86.avx.movmsk.ps.256(<8 x float> %a0)
	ret i32 %1
	}
	declare i32 @llvm.x86.avx.movmsk.ps.256(<8 x float>) nounwind readnone

	define <4 x double> @test_movntpd(<4 x double> %a0, <4 x double> *%a1) {
	; SANDY-LABEL: test_movntpd:
	; SANDY: # BB#0:
	; SANDY-NEXT: vaddpd %ymm0, %ymm0, %ymm0 # sched: [3:1.00]
	-; SANDY-NEXT: vmovntpd %ymm0, (%rdi) # sched: [5:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vmovntpd %ymm0, (%rdi) # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_movntpd:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vaddpd %ymm0, %ymm0, %ymm0 # sched: [3:1.00]
	; HASWELL-NEXT: vmovntpd %ymm0, (%rdi) # sched: [1:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_movntpd:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vaddpd %ymm0, %ymm0, %ymm0 # sched: [3:2.00]
	; BTVER2-NEXT: vmovntpd %ymm0, (%rdi) # sched: [1:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_movntpd:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vaddpd %ymm0, %ymm0, %ymm0 # sched: [3:1.00]
	; ZNVER1-NEXT: vmovntpd %ymm0, (%rdi) # sched: [1:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = fadd <4 x double> %a0, %a0
	store <4 x double> %1, <4 x double> *%a1, align 32, !nontemporal !0
	ret <4 x double> %1
	}

	define <8 x float> @test_movntps(<8 x float> %a0, <8 x float> *%a1) {
	; SANDY-LABEL: test_movntps:
	; SANDY: # BB#0:
	; SANDY-NEXT: vaddps %ymm0, %ymm0, %ymm0 # sched: [3:1.00]
	-; SANDY-NEXT: vmovntps %ymm0, (%rdi) # sched: [5:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vmovntps %ymm0, (%rdi) # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_movntps:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vaddps %ymm0, %ymm0, %ymm0 # sched: [3:1.00]
	; HASWELL-NEXT: vmovntps %ymm0, (%rdi) # sched: [1:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_movntps:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vaddps %ymm0, %ymm0, %ymm0 # sched: [3:2.00]
	; BTVER2-NEXT: vmovntps %ymm0, (%rdi) # sched: [1:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_movntps:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vaddps %ymm0, %ymm0, %ymm0 # sched: [3:1.00]
	; ZNVER1-NEXT: vmovntps %ymm0, (%rdi) # sched: [1:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = fadd <8 x float> %a0, %a0
	store <8 x float> %1, <8 x float> *%a1, align 32, !nontemporal !0
	ret <8 x float> %1
	}

	define <8 x float> @test_movshdup(<8 x float> %a0, <8 x float> *%a1) {
	; SANDY-LABEL: test_movshdup:
	; SANDY: # BB#0:
	; SANDY-NEXT: vmovshdup {{.*#+}} ymm0 = ymm0[1,1,3,3,5,5,7,7] sched: [1:1.00]
	-; SANDY-NEXT: vmovshdup {{.*#+}} ymm1 = mem[1,1,3,3,5,5,7,7] sched: [7:0.50]
	+; SANDY-NEXT: vmovshdup {{.*#+}} ymm1 = mem[1,1,3,3,5,5,7,7] sched: [4:0.50]
	; SANDY-NEXT: vaddps %ymm1, %ymm0, %ymm0 # sched: [3:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_movshdup:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vmovshdup {{.*#+}} ymm0 = ymm0[1,1,3,3,5,5,7,7] sched: [1:1.00]
	; HASWELL-NEXT: vmovshdup {{.*#+}} ymm1 = mem[1,1,3,3,5,5,7,7] sched: [4:0.50]
	; HASWELL-NEXT: vaddps %ymm1, %ymm0, %ymm0 # sched: [3:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_movshdup:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vmovshdup {{.*#+}} ymm1 = mem[1,1,3,3,5,5,7,7] sched: [5:1.00]
	; BTVER2-NEXT: vmovshdup {{.*#+}} ymm0 = ymm0[1,1,3,3,5,5,7,7] sched: [1:0.50]
	; BTVER2-NEXT: vaddps %ymm1, %ymm0, %ymm0 # sched: [3:2.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_movshdup:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vmovshdup {{.*#+}} ymm1 = mem[1,1,3,3,5,5,7,7] sched: [8:0.50]
	; ZNVER1-NEXT: vmovshdup {{.*#+}} ymm0 = ymm0[1,1,3,3,5,5,7,7] sched: [1:0.50]
	; ZNVER1-NEXT: vaddps %ymm1, %ymm0, %ymm0 # sched: [3:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = shufflevector <8 x float> %a0, <8 x float> undef, <8 x i32> <i32 1, i32 1, i32 3, i32 3, i32 5, i32 5, i32 7, i32 7>
	%2 = load <8 x float>, <8 x float> *%a1, align 32
	%3 = shufflevector <8 x float> %2, <8 x float> undef, <8 x i32> <i32 1, i32 1, i32 3, i32 3, i32 5, i32 5, i32 7, i32 7>
	%4 = fadd <8 x float> %1, %3
	ret <8 x float> %4
	}

	define <8 x float> @test_movsldup(<8 x float> %a0, <8 x float> *%a1) {
	; SANDY-LABEL: test_movsldup:
	; SANDY: # BB#0:
	; SANDY-NEXT: vmovsldup {{.*#+}} ymm0 = ymm0[0,0,2,2,4,4,6,6] sched: [1:1.00]
	-; SANDY-NEXT: vmovsldup {{.*#+}} ymm1 = mem[0,0,2,2,4,4,6,6] sched: [7:0.50]
	+; SANDY-NEXT: vmovsldup {{.*#+}} ymm1 = mem[0,0,2,2,4,4,6,6] sched: [4:0.50]
	; SANDY-NEXT: vaddps %ymm1, %ymm0, %ymm0 # sched: [3:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_movsldup:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vmovsldup {{.*#+}} ymm0 = ymm0[0,0,2,2,4,4,6,6] sched: [1:1.00]
	; HASWELL-NEXT: vmovsldup {{.*#+}} ymm1 = mem[0,0,2,2,4,4,6,6] sched: [4:0.50]
	; HASWELL-NEXT: vaddps %ymm1, %ymm0, %ymm0 # sched: [3:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_movsldup:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vmovsldup {{.*#+}} ymm1 = mem[0,0,2,2,4,4,6,6] sched: [5:1.00]
	; BTVER2-NEXT: vmovsldup {{.*#+}} ymm0 = ymm0[0,0,2,2,4,4,6,6] sched: [1:0.50]
	; BTVER2-NEXT: vaddps %ymm1, %ymm0, %ymm0 # sched: [3:2.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_movsldup:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vmovsldup {{.*#+}} ymm1 = mem[0,0,2,2,4,4,6,6] sched: [8:0.50]
	; ZNVER1-NEXT: vmovsldup {{.*#+}} ymm0 = ymm0[0,0,2,2,4,4,6,6] sched: [1:0.50]
	; ZNVER1-NEXT: vaddps %ymm1, %ymm0, %ymm0 # sched: [3:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = shufflevector <8 x float> %a0, <8 x float> undef, <8 x i32> <i32 0, i32 0, i32 2, i32 2, i32 4, i32 4, i32 6, i32 6>
	%2 = load <8 x float>, <8 x float> *%a1, align 32
	%3 = shufflevector <8 x float> %2, <8 x float> undef, <8 x i32> <i32 0, i32 0, i32 2, i32 2, i32 4, i32 4, i32 6, i32 6>
	%4 = fadd <8 x float> %1, %3
	ret <8 x float> %4
	}

	define <4 x double> @test_movupd(<4 x double> %a0, <4 x double> %a1) {
	; SANDY-LABEL: test_movupd:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vmovups (%rdi), %xmm0 # sched: [6:0.50]
	-; SANDY-NEXT: vinsertf128 $1, 16(%rdi), %ymm0, %ymm0 # sched: [7:1.00]
	+; SANDY-NEXT: vmovups (%rdi), %xmm0 # sched: [4:0.50]
	+; SANDY-NEXT: vinsertf128 $1, 16(%rdi), %ymm0, %ymm0 # sched: [5:1.00]
	; SANDY-NEXT: vaddpd %ymm0, %ymm0, %ymm0 # sched: [3:1.00]
	-; SANDY-NEXT: vextractf128 $1, %ymm0, 16(%rsi) # sched: [5:1.00]
	-; SANDY-NEXT: vmovupd %xmm0, (%rsi) # sched: [5:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vextractf128 $1, %ymm0, 16(%rsi) # sched: [1:1.00]
	+; SANDY-NEXT: vmovupd %xmm0, (%rsi) # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_movupd:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vmovupd (%rdi), %ymm0 # sched: [4:0.50]
	; HASWELL-NEXT: vaddpd %ymm0, %ymm0, %ymm0 # sched: [3:1.00]
	; HASWELL-NEXT: vmovupd %ymm0, (%rsi) # sched: [1:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_movupd:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vmovupd (%rdi), %ymm0 # sched: [5:1.00]
	; BTVER2-NEXT: vaddpd %ymm0, %ymm0, %ymm0 # sched: [3:2.00]
	; BTVER2-NEXT: vmovupd %ymm0, (%rsi) # sched: [1:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_movupd:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vmovupd (%rdi), %ymm0 # sched: [8:0.50]
	; ZNVER1-NEXT: vaddpd %ymm0, %ymm0, %ymm0 # sched: [3:1.00]
	; ZNVER1-NEXT: vmovupd %ymm0, (%rsi) # sched: [1:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = load <4 x double>, <4 x double> *%a0, align 1
	%2 = fadd <4 x double> %1, %1
	store <4 x double> %2, <4 x double> *%a1, align 1
	ret <4 x double> %2
	}

	define <8 x float> @test_movups(<8 x float> %a0, <8 x float> %a1) {
	; SANDY-LABEL: test_movups:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vmovups (%rdi), %xmm0 # sched: [6:0.50]
	-; SANDY-NEXT: vinsertf128 $1, 16(%rdi), %ymm0, %ymm0 # sched: [7:1.00]
	+; SANDY-NEXT: vmovups (%rdi), %xmm0 # sched: [4:0.50]
	+; SANDY-NEXT: vinsertf128 $1, 16(%rdi), %ymm0, %ymm0 # sched: [5:1.00]
	; SANDY-NEXT: vaddps %ymm0, %ymm0, %ymm0 # sched: [3:1.00]
	-; SANDY-NEXT: vextractf128 $1, %ymm0, 16(%rsi) # sched: [5:1.00]
	-; SANDY-NEXT: vmovups %xmm0, (%rsi) # sched: [5:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vextractf128 $1, %ymm0, 16(%rsi) # sched: [1:1.00]
	+; SANDY-NEXT: vmovups %xmm0, (%rsi) # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_movups:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vmovups (%rdi), %ymm0 # sched: [4:0.50]
	; HASWELL-NEXT: vaddps %ymm0, %ymm0, %ymm0 # sched: [3:1.00]
	; HASWELL-NEXT: vmovups %ymm0, (%rsi) # sched: [1:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_movups:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vmovups (%rdi), %ymm0 # sched: [5:1.00]
	; BTVER2-NEXT: vaddps %ymm0, %ymm0, %ymm0 # sched: [3:2.00]
	; BTVER2-NEXT: vmovups %ymm0, (%rsi) # sched: [1:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_movups:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vmovups (%rdi), %ymm0 # sched: [8:0.50]
	; ZNVER1-NEXT: vaddps %ymm0, %ymm0, %ymm0 # sched: [3:1.00]
	; ZNVER1-NEXT: vmovups %ymm0, (%rsi) # sched: [1:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = load <8 x float>, <8 x float> *%a0, align 1
	%2 = fadd <8 x float> %1, %1
	store <8 x float> %2, <8 x float> *%a1, align 1
	ret <8 x float> %2
	}

	define <4 x double> @test_mulpd(<4 x double> %a0, <4 x double> %a1, <4 x double> *%a2) {
	; SANDY-LABEL: test_mulpd:
	; SANDY: # BB#0:
	; SANDY-NEXT: vmulpd %ymm1, %ymm0, %ymm0 # sched: [5:1.00]
	-; SANDY-NEXT: vmulpd (%rdi), %ymm0, %ymm0 # sched: [12:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vmulpd (%rdi), %ymm0, %ymm0 # sched: [9:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_mulpd:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vmulpd %ymm1, %ymm0, %ymm0 # sched: [5:1.00]
	; HASWELL-NEXT: vmulpd (%rdi), %ymm0, %ymm0 # sched: [9:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_mulpd:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vmulpd %ymm1, %ymm0, %ymm0 # sched: [4:4.00]
	; BTVER2-NEXT: vmulpd (%rdi), %ymm0, %ymm0 # sched: [9:4.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_mulpd:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vmulpd %ymm1, %ymm0, %ymm0 # sched: [5:1.00]
	; ZNVER1-NEXT: vmulpd (%rdi), %ymm0, %ymm0 # sched: [12:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = fmul <4 x double> %a0, %a1
	%2 = load <4 x double>, <4 x double> *%a2, align 32
	%3 = fmul <4 x double> %1, %2
	ret <4 x double> %3
	}

	define <8 x float> @test_mulps(<8 x float> %a0, <8 x float> %a1, <8 x float> *%a2) {
	; SANDY-LABEL: test_mulps:
	; SANDY: # BB#0:
	; SANDY-NEXT: vmulps %ymm1, %ymm0, %ymm0 # sched: [5:1.00]
	-; SANDY-NEXT: vmulps (%rdi), %ymm0, %ymm0 # sched: [12:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vmulps (%rdi), %ymm0, %ymm0 # sched: [9:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_mulps:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vmulps %ymm1, %ymm0, %ymm0 # sched: [5:1.00]
	; HASWELL-NEXT: vmulps (%rdi), %ymm0, %ymm0 # sched: [9:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_mulps:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vmulps %ymm1, %ymm0, %ymm0 # sched: [2:2.00]
	; BTVER2-NEXT: vmulps (%rdi), %ymm0, %ymm0 # sched: [7:2.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_mulps:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vmulps %ymm1, %ymm0, %ymm0 # sched: [5:1.00]
	; ZNVER1-NEXT: vmulps (%rdi), %ymm0, %ymm0 # sched: [12:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = fmul <8 x float> %a0, %a1
	%2 = load <8 x float>, <8 x float> *%a2, align 32
	%3 = fmul <8 x float> %1, %2
	ret <8 x float> %3
	}

	define <4 x double> @orpd(<4 x double> %a0, <4 x double> %a1, <4 x double> *%a2) {
	; SANDY-LABEL: orpd:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vorpd %ymm1, %ymm0, %ymm0 # sched: [1:1.00]
	-; SANDY-NEXT: vorpd (%rdi), %ymm0, %ymm0 # sched: [8:1.00]
	+; SANDY-NEXT: vorpd %ymm1, %ymm0, %ymm0 # sched: [1:0.33]
	+; SANDY-NEXT: vorpd (%rdi), %ymm0, %ymm0 # sched: [5:0.50]
	; SANDY-NEXT: vaddpd %ymm0, %ymm1, %ymm0 # sched: [3:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: orpd:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vorpd %ymm1, %ymm0, %ymm0 # sched: [1:1.00]
	; HASWELL-NEXT: vorpd (%rdi), %ymm0, %ymm0 # sched: [5:1.00]
	; HASWELL-NEXT: vaddpd %ymm0, %ymm1, %ymm0 # sched: [3:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: orpd:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vorpd %ymm1, %ymm0, %ymm0 # sched: [1:0.50]
	; BTVER2-NEXT: vorpd (%rdi), %ymm0, %ymm0 # sched: [6:1.00]
	; BTVER2-NEXT: vaddpd %ymm0, %ymm1, %ymm0 # sched: [3:2.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: orpd:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vorpd %ymm1, %ymm0, %ymm0 # sched: [1:0.25]
	; ZNVER1-NEXT: vorpd (%rdi), %ymm0, %ymm0 # sched: [8:0.50]
	; ZNVER1-NEXT: vaddpd %ymm0, %ymm1, %ymm0 # sched: [3:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = bitcast <4 x double> %a0 to <4 x i64>
	%2 = bitcast <4 x double> %a1 to <4 x i64>
	%3 = or <4 x i64> %1, %2
	%4 = load <4 x double>, <4 x double> *%a2, align 32
	%5 = bitcast <4 x double> %4 to <4 x i64>
	%6 = or <4 x i64> %3, %5
	%7 = bitcast <4 x i64> %6 to <4 x double>
	%8 = fadd <4 x double> %a1, %7
	ret <4 x double> %8
	}

	define <8 x float> @test_orps(<8 x float> %a0, <8 x float> %a1, <8 x float> *%a2) {
	; SANDY-LABEL: test_orps:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vorps %ymm1, %ymm0, %ymm0 # sched: [1:1.00]
	-; SANDY-NEXT: vorps (%rdi), %ymm0, %ymm0 # sched: [8:1.00]
	+; SANDY-NEXT: vorps %ymm1, %ymm0, %ymm0 # sched: [1:0.33]
	+; SANDY-NEXT: vorps (%rdi), %ymm0, %ymm0 # sched: [5:0.50]
	; SANDY-NEXT: vaddps %ymm0, %ymm1, %ymm0 # sched: [3:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_orps:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vorps %ymm1, %ymm0, %ymm0 # sched: [1:1.00]
	; HASWELL-NEXT: vorps (%rdi), %ymm0, %ymm0 # sched: [5:1.00]
	; HASWELL-NEXT: vaddps %ymm0, %ymm1, %ymm0 # sched: [3:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_orps:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vorps %ymm1, %ymm0, %ymm0 # sched: [1:0.50]
	; BTVER2-NEXT: vorps (%rdi), %ymm0, %ymm0 # sched: [6:1.00]
	; BTVER2-NEXT: vaddps %ymm0, %ymm1, %ymm0 # sched: [3:2.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_orps:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vorps %ymm1, %ymm0, %ymm0 # sched: [1:0.25]
	; ZNVER1-NEXT: vorps (%rdi), %ymm0, %ymm0 # sched: [8:0.50]
	; ZNVER1-NEXT: vaddps %ymm0, %ymm1, %ymm0 # sched: [3:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = bitcast <8 x float> %a0 to <4 x i64>
	%2 = bitcast <8 x float> %a1 to <4 x i64>
	%3 = or <4 x i64> %1, %2
	%4 = load <8 x float>, <8 x float> *%a2, align 32
	%5 = bitcast <8 x float> %4 to <4 x i64>
	%6 = or <4 x i64> %3, %5
	%7 = bitcast <4 x i64> %6 to <8 x float>
	%8 = fadd <8 x float> %a1, %7
	ret <8 x float> %8
	}

	define <2 x double> @test_permilpd(<2 x double> %a0, <2 x double> *%a1) {
	; SANDY-LABEL: test_permilpd:
	; SANDY: # BB#0:
	; SANDY-NEXT: vpermilpd {{.*#+}} xmm0 = xmm0[1,0] sched: [1:1.00]
	-; SANDY-NEXT: vpermilpd {{.*#+}} xmm1 = mem[1,0] sched: [7:1.00]
	+; SANDY-NEXT: vpermilpd {{.*#+}} xmm1 = mem[1,0] sched: [5:1.00]
	; SANDY-NEXT: vaddpd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_permilpd:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpermilpd {{.*#+}} xmm0 = xmm0[1,0] sched: [1:1.00]
	; HASWELL-NEXT: vpermilpd {{.*#+}} xmm1 = mem[1,0] sched: [5:1.00]
	; HASWELL-NEXT: vaddpd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_permilpd:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpermilpd {{.*#+}} xmm1 = mem[1,0] sched: [6:1.00]
	; BTVER2-NEXT: vpermilpd {{.*#+}} xmm0 = xmm0[1,0] sched: [1:0.50]
	; BTVER2-NEXT: vaddpd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_permilpd:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpermilpd {{.*#+}} xmm1 = mem[1,0] sched: [8:0.50]
	; ZNVER1-NEXT: vpermilpd {{.*#+}} xmm0 = xmm0[1,0] sched: [1:0.50]
	; ZNVER1-NEXT: vaddpd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = shufflevector <2 x double> %a0, <2 x double> undef, <2 x i32> <i32 1, i32 0>
	%2 = load <2 x double>, <2 x double> *%a1, align 16
	%3 = shufflevector <2 x double> %2, <2 x double> undef, <2 x i32> <i32 1, i32 0>
	%4 = fadd <2 x double> %1, %3
	ret <2 x double> %4
	}

	define <4 x double> @test_permilpd_ymm(<4 x double> %a0, <4 x double> *%a1) {
	; SANDY-LABEL: test_permilpd_ymm:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vpermilpd {{.*#+}} ymm0 = ymm0[1,0,2,3] sched: [8:1.00]
	+; SANDY-NEXT: vpermilpd {{.*#+}} ymm0 = ymm0[1,0,2,3] sched: [1:1.00]
	; SANDY-NEXT: vpermilpd {{.*#+}} ymm1 = mem[1,0,2,3] sched: [5:1.00]
	; SANDY-NEXT: vaddpd %ymm1, %ymm0, %ymm0 # sched: [3:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_permilpd_ymm:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpermilpd {{.*#+}} ymm0 = ymm0[1,0,2,3] sched: [1:1.00]
	; HASWELL-NEXT: vpermilpd {{.*#+}} ymm1 = mem[1,0,2,3] sched: [5:1.00]
	; HASWELL-NEXT: vaddpd %ymm1, %ymm0, %ymm0 # sched: [3:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_permilpd_ymm:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpermilpd {{.*#+}} ymm1 = mem[1,0,2,3] sched: [6:1.00]
	; BTVER2-NEXT: vpermilpd {{.*#+}} ymm0 = ymm0[1,0,2,3] sched: [1:0.50]
	; BTVER2-NEXT: vaddpd %ymm1, %ymm0, %ymm0 # sched: [3:2.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_permilpd_ymm:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpermilpd {{.*#+}} ymm1 = mem[1,0,2,3] sched: [8:0.50]
	; ZNVER1-NEXT: vpermilpd {{.*#+}} ymm0 = ymm0[1,0,2,3] sched: [1:0.50]
	; ZNVER1-NEXT: vaddpd %ymm1, %ymm0, %ymm0 # sched: [3:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = shufflevector <4 x double> %a0, <4 x double> undef, <4 x i32> <i32 1, i32 0, i32 2, i32 3>
	%2 = load <4 x double>, <4 x double> *%a1, align 32
	%3 = shufflevector <4 x double> %2, <4 x double> undef, <4 x i32> <i32 1, i32 0, i32 2, i32 3>
	%4 = fadd <4 x double> %1, %3
	ret <4 x double> %4
	}

	define <4 x float> @test_permilps(<4 x float> %a0, <4 x float> *%a1) {
	; SANDY-LABEL: test_permilps:
	; SANDY: # BB#0:
	; SANDY-NEXT: vpermilps {{.*#+}} xmm0 = xmm0[3,2,1,0] sched: [1:1.00]
	-; SANDY-NEXT: vpermilps {{.*#+}} xmm1 = mem[3,2,1,0] sched: [7:1.00]
	+; SANDY-NEXT: vpermilps {{.*#+}} xmm1 = mem[3,2,1,0] sched: [5:1.00]
	; SANDY-NEXT: vaddps %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_permilps:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpermilps {{.*#+}} xmm0 = xmm0[3,2,1,0] sched: [1:1.00]
	; HASWELL-NEXT: vpermilps {{.*#+}} xmm1 = mem[3,2,1,0] sched: [5:1.00]
	; HASWELL-NEXT: vaddps %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_permilps:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpermilps {{.*#+}} xmm1 = mem[3,2,1,0] sched: [6:1.00]
	; BTVER2-NEXT: vpermilps {{.*#+}} xmm0 = xmm0[3,2,1,0] sched: [1:0.50]
	; BTVER2-NEXT: vaddps %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_permilps:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpermilps {{.*#+}} xmm1 = mem[3,2,1,0] sched: [8:0.50]
	; ZNVER1-NEXT: vpermilps {{.*#+}} xmm0 = xmm0[3,2,1,0] sched: [1:0.50]
	; ZNVER1-NEXT: vaddps %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = shufflevector <4 x float> %a0, <4 x float> undef, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
	%2 = load <4 x float>, <4 x float> *%a1, align 16
	%3 = shufflevector <4 x float> %2, <4 x float> undef, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
	%4 = fadd <4 x float> %1, %3
	ret <4 x float> %4
	}

	define <8 x float> @test_permilps_ymm(<8 x float> %a0, <8 x float> *%a1) {
	; SANDY-LABEL: test_permilps_ymm:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vpermilps {{.*#+}} ymm0 = ymm0[3,2,1,0,7,6,5,4] sched: [8:1.00]
	+; SANDY-NEXT: vpermilps {{.*#+}} ymm0 = ymm0[3,2,1,0,7,6,5,4] sched: [1:1.00]
	; SANDY-NEXT: vpermilps {{.*#+}} ymm1 = mem[3,2,1,0,7,6,5,4] sched: [5:1.00]
	; SANDY-NEXT: vaddps %ymm1, %ymm0, %ymm0 # sched: [3:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_permilps_ymm:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpermilps {{.*#+}} ymm0 = ymm0[3,2,1,0,7,6,5,4] sched: [1:1.00]
	; HASWELL-NEXT: vpermilps {{.*#+}} ymm1 = mem[3,2,1,0,7,6,5,4] sched: [5:1.00]
	; HASWELL-NEXT: vaddps %ymm1, %ymm0, %ymm0 # sched: [3:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_permilps_ymm:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpermilps {{.*#+}} ymm1 = mem[3,2,1,0,7,6,5,4] sched: [6:1.00]
	; BTVER2-NEXT: vpermilps {{.*#+}} ymm0 = ymm0[3,2,1,0,7,6,5,4] sched: [1:0.50]
	; BTVER2-NEXT: vaddps %ymm1, %ymm0, %ymm0 # sched: [3:2.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_permilps_ymm:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpermilps {{.*#+}} ymm1 = mem[3,2,1,0,7,6,5,4] sched: [8:0.50]
	; ZNVER1-NEXT: vpermilps {{.*#+}} ymm0 = ymm0[3,2,1,0,7,6,5,4] sched: [1:0.50]
	; ZNVER1-NEXT: vaddps %ymm1, %ymm0, %ymm0 # sched: [3:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = shufflevector <8 x float> %a0, <8 x float> undef, <8 x i32> <i32 3, i32 2, i32 1, i32 0, i32 7, i32 6, i32 5, i32 4>
	%2 = load <8 x float>, <8 x float> *%a1, align 32
	%3 = shufflevector <8 x float> %2, <8 x float> undef, <8 x i32> <i32 3, i32 2, i32 1, i32 0, i32 7, i32 6, i32 5, i32 4>
	%4 = fadd <8 x float> %1, %3
	ret <8 x float> %4
	}

	define <2 x double> @test_permilvarpd(<2 x double> %a0, <2 x i64> %a1, <2 x i64> *%a2) {
	; SANDY-LABEL: test_permilvarpd:
	; SANDY: # BB#0:
	; SANDY-NEXT: vpermilpd %xmm1, %xmm0, %xmm0 # sched: [1:1.00]
	-; SANDY-NEXT: vpermilpd (%rdi), %xmm0, %xmm0 # sched: [1:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vpermilpd (%rdi), %xmm0, %xmm0 # sched: [5:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_permilvarpd:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpermilpd %xmm1, %xmm0, %xmm0 # sched: [1:1.00]
	; HASWELL-NEXT: vpermilpd (%rdi), %xmm0, %xmm0 # sched: [5:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_permilvarpd:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpermilpd %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: vpermilpd (%rdi), %xmm0, %xmm0 # sched: [6:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_permilvarpd:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpermilpd %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; ZNVER1-NEXT: vpermilpd (%rdi), %xmm0, %xmm0 # sched: [8:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <2 x double> @llvm.x86.avx.vpermilvar.pd(<2 x double> %a0, <2 x i64> %a1)
	%2 = load <2 x i64>, <2 x i64> *%a2, align 16
	%3 = call <2 x double> @llvm.x86.avx.vpermilvar.pd(<2 x double> %1, <2 x i64> %2)
	ret <2 x double> %3
	}
	declare <2 x double> @llvm.x86.avx.vpermilvar.pd(<2 x double>, <2 x i64>) nounwind readnone

	define <4 x double> @test_permilvarpd_ymm(<4 x double> %a0, <4 x i64> %a1, <4 x i64> *%a2) {
	; SANDY-LABEL: test_permilvarpd_ymm:
	; SANDY: # BB#0:
	; SANDY-NEXT: vpermilpd %ymm1, %ymm0, %ymm0 # sched: [1:1.00]
	; SANDY-NEXT: vpermilpd (%rdi), %ymm0, %ymm0 # sched: [5:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_permilvarpd_ymm:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpermilpd %ymm1, %ymm0, %ymm0 # sched: [1:1.00]
	; HASWELL-NEXT: vpermilpd (%rdi), %ymm0, %ymm0 # sched: [5:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_permilvarpd_ymm:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpermilpd %ymm1, %ymm0, %ymm0 # sched: [1:0.50]
	; BTVER2-NEXT: vpermilpd (%rdi), %ymm0, %ymm0 # sched: [6:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_permilvarpd_ymm:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpermilpd %ymm1, %ymm0, %ymm0 # sched: [1:0.50]
	; ZNVER1-NEXT: vpermilpd (%rdi), %ymm0, %ymm0 # sched: [8:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <4 x double> @llvm.x86.avx.vpermilvar.pd.256(<4 x double> %a0, <4 x i64> %a1)
	%2 = load <4 x i64>, <4 x i64> *%a2, align 32
	%3 = call <4 x double> @llvm.x86.avx.vpermilvar.pd.256(<4 x double> %1, <4 x i64> %2)
	ret <4 x double> %3
	}
	declare <4 x double> @llvm.x86.avx.vpermilvar.pd.256(<4 x double>, <4 x i64>) nounwind readnone

	define <4 x float> @test_permilvarps(<4 x float> %a0, <4 x i32> %a1, <4 x i32> *%a2) {
	; SANDY-LABEL: test_permilvarps:
	; SANDY: # BB#0:
	; SANDY-NEXT: vpermilps %xmm1, %xmm0, %xmm0 # sched: [1:1.00]
	-; SANDY-NEXT: vpermilps (%rdi), %xmm0, %xmm0 # sched: [1:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vpermilps (%rdi), %xmm0, %xmm0 # sched: [5:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_permilvarps:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpermilps %xmm1, %xmm0, %xmm0 # sched: [1:1.00]
	; HASWELL-NEXT: vpermilps (%rdi), %xmm0, %xmm0 # sched: [5:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_permilvarps:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpermilps %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: vpermilps (%rdi), %xmm0, %xmm0 # sched: [6:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_permilvarps:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpermilps %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; ZNVER1-NEXT: vpermilps (%rdi), %xmm0, %xmm0 # sched: [8:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <4 x float> @llvm.x86.avx.vpermilvar.ps(<4 x float> %a0, <4 x i32> %a1)
	%2 = load <4 x i32>, <4 x i32> *%a2, align 16
	%3 = call <4 x float> @llvm.x86.avx.vpermilvar.ps(<4 x float> %1, <4 x i32> %2)
	ret <4 x float> %3
	}
	declare <4 x float> @llvm.x86.avx.vpermilvar.ps(<4 x float>, <4 x i32>) nounwind readnone

	define <8 x float> @test_permilvarps_ymm(<8 x float> %a0, <8 x i32> %a1, <8 x i32> *%a2) {
	; SANDY-LABEL: test_permilvarps_ymm:
	; SANDY: # BB#0:
	; SANDY-NEXT: vpermilps %ymm1, %ymm0, %ymm0 # sched: [1:1.00]
	; SANDY-NEXT: vpermilps (%rdi), %ymm0, %ymm0 # sched: [5:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_permilvarps_ymm:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpermilps %ymm1, %ymm0, %ymm0 # sched: [1:1.00]
	; HASWELL-NEXT: vpermilps (%rdi), %ymm0, %ymm0 # sched: [5:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_permilvarps_ymm:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpermilps %ymm1, %ymm0, %ymm0 # sched: [1:0.50]
	; BTVER2-NEXT: vpermilps (%rdi), %ymm0, %ymm0 # sched: [6:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_permilvarps_ymm:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpermilps %ymm1, %ymm0, %ymm0 # sched: [1:0.50]
	; ZNVER1-NEXT: vpermilps (%rdi), %ymm0, %ymm0 # sched: [8:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <8 x float> @llvm.x86.avx.vpermilvar.ps.256(<8 x float> %a0, <8 x i32> %a1)
	%2 = load <8 x i32>, <8 x i32> *%a2, align 32
	%3 = call <8 x float> @llvm.x86.avx.vpermilvar.ps.256(<8 x float> %1, <8 x i32> %2)
	ret <8 x float> %3
	}
	declare <8 x float> @llvm.x86.avx.vpermilvar.ps.256(<8 x float>, <8 x i32>) nounwind readnone

	define <8 x float> @test_rcpps(<8 x float> %a0, <8 x float> *%a1) {
	; SANDY-LABEL: test_rcpps:
	; SANDY: # BB#0:
	; SANDY-NEXT: vrcpps %ymm0, %ymm0 # sched: [5:1.00]
	; SANDY-NEXT: vrcpps (%rdi), %ymm1 # sched: [9:1.00]
	; SANDY-NEXT: vaddps %ymm1, %ymm0, %ymm0 # sched: [3:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_rcpps:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vrcpps (%rdi), %ymm1 # sched: [11:2.00]
	; HASWELL-NEXT: vrcpps %ymm0, %ymm0 # sched: [7:2.00]
	; HASWELL-NEXT: vaddps %ymm1, %ymm0, %ymm0 # sched: [3:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_rcpps:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vrcpps (%rdi), %ymm1 # sched: [7:2.00]
	; BTVER2-NEXT: vrcpps %ymm0, %ymm0 # sched: [2:2.00]
	; BTVER2-NEXT: vaddps %ymm1, %ymm0, %ymm0 # sched: [3:2.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_rcpps:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vrcpps (%rdi), %ymm1 # sched: [12:0.50]
	; ZNVER1-NEXT: vrcpps %ymm0, %ymm0 # sched: [5:0.50]
	; ZNVER1-NEXT: vaddps %ymm1, %ymm0, %ymm0 # sched: [3:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <8 x float> @llvm.x86.avx.rcp.ps.256(<8 x float> %a0)
	%2 = load <8 x float>, <8 x float> *%a1, align 32
	%3 = call <8 x float> @llvm.x86.avx.rcp.ps.256(<8 x float> %2)
	%4 = fadd <8 x float> %1, %3
	ret <8 x float> %4
	}
	declare <8 x float> @llvm.x86.avx.rcp.ps.256(<8 x float>) nounwind readnone

	define <4 x double> @test_roundpd(<4 x double> %a0, <4 x double> *%a1) {
	; SANDY-LABEL: test_roundpd:
	; SANDY: # BB#0:
	; SANDY-NEXT: vroundpd $7, %ymm0, %ymm0 # sched: [3:1.00]
	; SANDY-NEXT: vroundpd $7, (%rdi), %ymm1 # sched: [7:1.00]
	; SANDY-NEXT: vaddpd %ymm1, %ymm0, %ymm0 # sched: [3:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_roundpd:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vroundpd $7, %ymm0, %ymm0 # sched: [6:2.00]
	; HASWELL-NEXT: vroundpd $7, (%rdi), %ymm1 # sched: [10:2.00]
	; HASWELL-NEXT: vaddpd %ymm1, %ymm0, %ymm0 # sched: [3:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_roundpd:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vroundpd $7, (%rdi), %ymm1 # sched: [8:1.00]
	; BTVER2-NEXT: vroundpd $7, %ymm0, %ymm0 # sched: [3:1.00]
	; BTVER2-NEXT: vaddpd %ymm1, %ymm0, %ymm0 # sched: [3:2.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_roundpd:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vroundpd $7, (%rdi), %ymm1 # sched: [10:1.00]
	; ZNVER1-NEXT: vroundpd $7, %ymm0, %ymm0 # sched: [3:1.00]
	; ZNVER1-NEXT: vaddpd %ymm1, %ymm0, %ymm0 # sched: [3:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <4 x double> @llvm.x86.avx.round.pd.256(<4 x double> %a0, i32 7)
	%2 = load <4 x double>, <4 x double> *%a1, align 32
	%3 = call <4 x double> @llvm.x86.avx.round.pd.256(<4 x double> %2, i32 7)
	%4 = fadd <4 x double> %1, %3
	ret <4 x double> %4
	}
	declare <4 x double> @llvm.x86.avx.round.pd.256(<4 x double>, i32) nounwind readnone

	define <8 x float> @test_roundps(<8 x float> %a0, <8 x float> *%a1) {
	; SANDY-LABEL: test_roundps:
	; SANDY: # BB#0:
	; SANDY-NEXT: vroundps $7, %ymm0, %ymm0 # sched: [3:1.00]
	; SANDY-NEXT: vroundps $7, (%rdi), %ymm1 # sched: [7:1.00]
	; SANDY-NEXT: vaddps %ymm1, %ymm0, %ymm0 # sched: [3:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_roundps:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vroundps $7, %ymm0, %ymm0 # sched: [6:2.00]
	; HASWELL-NEXT: vroundps $7, (%rdi), %ymm1 # sched: [10:2.00]
	; HASWELL-NEXT: vaddps %ymm1, %ymm0, %ymm0 # sched: [3:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_roundps:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vroundps $7, (%rdi), %ymm1 # sched: [8:1.00]
	; BTVER2-NEXT: vroundps $7, %ymm0, %ymm0 # sched: [3:1.00]
	; BTVER2-NEXT: vaddps %ymm1, %ymm0, %ymm0 # sched: [3:2.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_roundps:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vroundps $7, (%rdi), %ymm1 # sched: [10:1.00]
	; ZNVER1-NEXT: vroundps $7, %ymm0, %ymm0 # sched: [3:1.00]
	; ZNVER1-NEXT: vaddps %ymm1, %ymm0, %ymm0 # sched: [3:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <8 x float> @llvm.x86.avx.round.ps.256(<8 x float> %a0, i32 7)
	%2 = load <8 x float>, <8 x float> *%a1, align 32
	%3 = call <8 x float> @llvm.x86.avx.round.ps.256(<8 x float> %2, i32 7)
	%4 = fadd <8 x float> %1, %3
	ret <8 x float> %4
	}
	declare <8 x float> @llvm.x86.avx.round.ps.256(<8 x float>, i32) nounwind readnone

	define <8 x float> @test_rsqrtps(<8 x float> %a0, <8 x float> *%a1) {
	; SANDY-LABEL: test_rsqrtps:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vrsqrtps (%rdi), %ymm1 # sched: [14:3.00]
	-; SANDY-NEXT: vrsqrtps %ymm0, %ymm0 # sched: [7:3.00]
	+; SANDY-NEXT: vrsqrtps %ymm0, %ymm0 # sched: [5:1.00]
	+; SANDY-NEXT: vrsqrtps (%rdi), %ymm1 # sched: [9:1.00]
	; SANDY-NEXT: vaddps %ymm1, %ymm0, %ymm0 # sched: [3:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_rsqrtps:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vrsqrtps (%rdi), %ymm1 # sched: [11:2.00]
	; HASWELL-NEXT: vrsqrtps %ymm0, %ymm0 # sched: [7:2.00]
	; HASWELL-NEXT: vaddps %ymm1, %ymm0, %ymm0 # sched: [3:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_rsqrtps:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vrsqrtps (%rdi), %ymm1 # sched: [7:2.00]
	; BTVER2-NEXT: vrsqrtps %ymm0, %ymm0 # sched: [2:2.00]
	; BTVER2-NEXT: vaddps %ymm1, %ymm0, %ymm0 # sched: [3:2.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_rsqrtps:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vrsqrtps (%rdi), %ymm1 # sched: [12:0.50]
	; ZNVER1-NEXT: vrsqrtps %ymm0, %ymm0 # sched: [5:0.50]
	; ZNVER1-NEXT: vaddps %ymm1, %ymm0, %ymm0 # sched: [3:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <8 x float> @llvm.x86.avx.rsqrt.ps.256(<8 x float> %a0)
	%2 = load <8 x float>, <8 x float> *%a1, align 32
	%3 = call <8 x float> @llvm.x86.avx.rsqrt.ps.256(<8 x float> %2)
	%4 = fadd <8 x float> %1, %3
	ret <8 x float> %4
	}
	declare <8 x float> @llvm.x86.avx.rsqrt.ps.256(<8 x float>) nounwind readnone

	define <4 x double> @test_shufpd(<4 x double> %a0, <4 x double> %a1, <4 x double> *%a2) {
	; SANDY-LABEL: test_shufpd:
	; SANDY: # BB#0:
	; SANDY-NEXT: vshufpd {{.*#+}} ymm0 = ymm0[1],ymm1[0],ymm0[2],ymm1[3] sched: [1:1.00]
	-; SANDY-NEXT: vshufpd {{.*#+}} ymm1 = ymm1[1],mem[0],ymm1[2],mem[3] sched: [8:1.00]
	+; SANDY-NEXT: vshufpd {{.*#+}} ymm1 = ymm1[1],mem[0],ymm1[2],mem[3] sched: [5:1.00]
	; SANDY-NEXT: vaddpd %ymm1, %ymm0, %ymm0 # sched: [3:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_shufpd:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vshufpd {{.*#+}} ymm0 = ymm0[1],ymm1[0],ymm0[2],ymm1[3] sched: [1:1.00]
	; HASWELL-NEXT: vshufpd {{.*#+}} ymm1 = ymm1[1],mem[0],ymm1[2],mem[3] sched: [5:1.00]
	; HASWELL-NEXT: vaddpd %ymm1, %ymm0, %ymm0 # sched: [3:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_shufpd:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vshufpd {{.*#+}} ymm0 = ymm0[1],ymm1[0],ymm0[2],ymm1[3] sched: [1:0.50]
	; BTVER2-NEXT: vshufpd {{.*#+}} ymm1 = ymm1[1],mem[0],ymm1[2],mem[3] sched: [6:1.00]
	; BTVER2-NEXT: vaddpd %ymm1, %ymm0, %ymm0 # sched: [3:2.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_shufpd:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vshufpd {{.*#+}} ymm0 = ymm0[1],ymm1[0],ymm0[2],ymm1[3] sched: [1:0.50]
	; ZNVER1-NEXT: vshufpd {{.*#+}} ymm1 = ymm1[1],mem[0],ymm1[2],mem[3] sched: [8:0.50]
	; ZNVER1-NEXT: vaddpd %ymm1, %ymm0, %ymm0 # sched: [3:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = shufflevector <4 x double> %a0, <4 x double> %a1, <4 x i32> <i32 1, i32 4, i32 2, i32 7>
	%2 = load <4 x double>, <4 x double> *%a2, align 32
	%3 = shufflevector <4 x double> %a1, <4 x double> %2, <4 x i32> <i32 1, i32 4, i32 2, i32 7>
	%4 = fadd <4 x double> %1, %3
	ret <4 x double> %4
	}

	define <8 x float> @test_shufps(<8 x float> %a0, <8 x float> %a1, <8 x float> *%a2) nounwind {
	; SANDY-LABEL: test_shufps:
	; SANDY: # BB#0:
	; SANDY-NEXT: vshufps {{.*#+}} ymm0 = ymm0[0,0],ymm1[0,0],ymm0[4,4],ymm1[4,4] sched: [1:1.00]
	-; SANDY-NEXT: vshufps {{.*#+}} ymm0 = ymm0[0,3],mem[0,0],ymm0[4,7],mem[4,4] sched: [8:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vshufps {{.*#+}} ymm0 = ymm0[0,3],mem[0,0],ymm0[4,7],mem[4,4] sched: [5:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_shufps:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vshufps {{.*#+}} ymm0 = ymm0[0,0],ymm1[0,0],ymm0[4,4],ymm1[4,4] sched: [1:1.00]
	; HASWELL-NEXT: vshufps {{.*#+}} ymm0 = ymm0[0,3],mem[0,0],ymm0[4,7],mem[4,4] sched: [5:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_shufps:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vshufps {{.*#+}} ymm0 = ymm0[0,0],ymm1[0,0],ymm0[4,4],ymm1[4,4] sched: [1:0.50]
	; BTVER2-NEXT: vshufps {{.*#+}} ymm0 = ymm0[0,3],mem[0,0],ymm0[4,7],mem[4,4] sched: [6:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_shufps:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vshufps {{.*#+}} ymm0 = ymm0[0,0],ymm1[0,0],ymm0[4,4],ymm1[4,4] sched: [1:0.50]
	; ZNVER1-NEXT: vshufps {{.*#+}} ymm0 = ymm0[0,3],mem[0,0],ymm0[4,7],mem[4,4] sched: [8:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = shufflevector <8 x float> %a0, <8 x float> %a1, <8 x i32> <i32 0, i32 0, i32 8, i32 8, i32 4, i32 4, i32 12, i32 12>
	%2 = load <8 x float>, <8 x float> *%a2, align 32
	%3 = shufflevector <8 x float> %1, <8 x float> %2, <8 x i32> <i32 0, i32 3, i32 8, i32 8, i32 4, i32 7, i32 12, i32 12>
	ret <8 x float> %3
	}

	define <4 x double> @test_sqrtpd(<4 x double> %a0, <4 x double> *%a1) {
	; SANDY-LABEL: test_sqrtpd:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vsqrtpd (%rdi), %ymm1 # sched: [52:3.00]
	-; SANDY-NEXT: vsqrtpd %ymm0, %ymm0 # sched: [45:3.00]
	+; SANDY-NEXT: vsqrtpd %ymm0, %ymm0 # sched: [15:1.00]
	+; SANDY-NEXT: vsqrtpd (%rdi), %ymm1 # sched: [19:1.00]
	; SANDY-NEXT: vaddpd %ymm1, %ymm0, %ymm0 # sched: [3:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_sqrtpd:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vsqrtpd (%rdi), %ymm1 # sched: [32:2.00]
	; HASWELL-NEXT: vsqrtpd %ymm0, %ymm0 # sched: [28:2.00]
	; HASWELL-NEXT: vaddpd %ymm1, %ymm0, %ymm0 # sched: [3:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_sqrtpd:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vsqrtpd (%rdi), %ymm1 # sched: [59:54.00]
	; BTVER2-NEXT: vsqrtpd %ymm0, %ymm0 # sched: [54:54.00]
	; BTVER2-NEXT: vaddpd %ymm1, %ymm0, %ymm0 # sched: [3:2.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_sqrtpd:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vsqrtpd (%rdi), %ymm1 # sched: [27:1.00]
	; ZNVER1-NEXT: vsqrtpd %ymm0, %ymm0 # sched: [20:1.00]
	; ZNVER1-NEXT: vaddpd %ymm1, %ymm0, %ymm0 # sched: [3:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <4 x double> @llvm.x86.avx.sqrt.pd.256(<4 x double> %a0)
	%2 = load <4 x double>, <4 x double> *%a1, align 32
	%3 = call <4 x double> @llvm.x86.avx.sqrt.pd.256(<4 x double> %2)
	%4 = fadd <4 x double> %1, %3
	ret <4 x double> %4
	}
	declare <4 x double> @llvm.x86.avx.sqrt.pd.256(<4 x double>) nounwind readnone

	define <8 x float> @test_sqrtps(<8 x float> %a0, <8 x float> *%a1) {
	; SANDY-LABEL: test_sqrtps:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vsqrtps (%rdi), %ymm1 # sched: [36:3.00]
	-; SANDY-NEXT: vsqrtps %ymm0, %ymm0 # sched: [29:3.00]
	+; SANDY-NEXT: vsqrtps %ymm0, %ymm0 # sched: [15:1.00]
	+; SANDY-NEXT: vsqrtps (%rdi), %ymm1 # sched: [19:1.00]
	; SANDY-NEXT: vaddps %ymm1, %ymm0, %ymm0 # sched: [3:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_sqrtps:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vsqrtps (%rdi), %ymm1 # sched: [23:2.00]
	; HASWELL-NEXT: vsqrtps %ymm0, %ymm0 # sched: [19:2.00]
	; HASWELL-NEXT: vaddps %ymm1, %ymm0, %ymm0 # sched: [3:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_sqrtps:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vsqrtps (%rdi), %ymm1 # sched: [47:42.00]
	; BTVER2-NEXT: vsqrtps %ymm0, %ymm0 # sched: [42:42.00]
	; BTVER2-NEXT: vaddps %ymm1, %ymm0, %ymm0 # sched: [3:2.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_sqrtps:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vsqrtps (%rdi), %ymm1 # sched: [27:1.00]
	; ZNVER1-NEXT: vsqrtps %ymm0, %ymm0 # sched: [20:1.00]
	; ZNVER1-NEXT: vaddps %ymm1, %ymm0, %ymm0 # sched: [3:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <8 x float> @llvm.x86.avx.sqrt.ps.256(<8 x float> %a0)
	%2 = load <8 x float>, <8 x float> *%a1, align 32
	%3 = call <8 x float> @llvm.x86.avx.sqrt.ps.256(<8 x float> %2)
	%4 = fadd <8 x float> %1, %3
	ret <8 x float> %4
	}
	declare <8 x float> @llvm.x86.avx.sqrt.ps.256(<8 x float>) nounwind readnone

	define <4 x double> @test_subpd(<4 x double> %a0, <4 x double> %a1, <4 x double> *%a2) {
	; SANDY-LABEL: test_subpd:
	; SANDY: # BB#0:
	; SANDY-NEXT: vsubpd %ymm1, %ymm0, %ymm0 # sched: [3:1.00]
	-; SANDY-NEXT: vsubpd (%rdi), %ymm0, %ymm0 # sched: [10:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vsubpd (%rdi), %ymm0, %ymm0 # sched: [7:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_subpd:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vsubpd %ymm1, %ymm0, %ymm0 # sched: [3:1.00]
	; HASWELL-NEXT: vsubpd (%rdi), %ymm0, %ymm0 # sched: [7:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_subpd:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vsubpd %ymm1, %ymm0, %ymm0 # sched: [3:2.00]
	; BTVER2-NEXT: vsubpd (%rdi), %ymm0, %ymm0 # sched: [8:2.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_subpd:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vsubpd %ymm1, %ymm0, %ymm0 # sched: [3:1.00]
	; ZNVER1-NEXT: vsubpd (%rdi), %ymm0, %ymm0 # sched: [10:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = fsub <4 x double> %a0, %a1
	%2 = load <4 x double>, <4 x double> *%a2, align 32
	%3 = fsub <4 x double> %1, %2
	ret <4 x double> %3
	}

	define <8 x float> @test_subps(<8 x float> %a0, <8 x float> %a1, <8 x float> *%a2) {
	; SANDY-LABEL: test_subps:
	; SANDY: # BB#0:
	; SANDY-NEXT: vsubps %ymm1, %ymm0, %ymm0 # sched: [3:1.00]
	-; SANDY-NEXT: vsubps (%rdi), %ymm0, %ymm0 # sched: [10:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vsubps (%rdi), %ymm0, %ymm0 # sched: [7:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_subps:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vsubps %ymm1, %ymm0, %ymm0 # sched: [3:1.00]
	; HASWELL-NEXT: vsubps (%rdi), %ymm0, %ymm0 # sched: [7:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_subps:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vsubps %ymm1, %ymm0, %ymm0 # sched: [3:2.00]
	; BTVER2-NEXT: vsubps (%rdi), %ymm0, %ymm0 # sched: [8:2.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_subps:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vsubps %ymm1, %ymm0, %ymm0 # sched: [3:1.00]
	; ZNVER1-NEXT: vsubps (%rdi), %ymm0, %ymm0 # sched: [10:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = fsub <8 x float> %a0, %a1
	%2 = load <8 x float>, <8 x float> *%a2, align 32
	%3 = fsub <8 x float> %1, %2
	ret <8 x float> %3
	}

	define i32 @test_testpd(<2 x double> %a0, <2 x double> %a1, <2 x double> *%a2) {
	; SANDY-LABEL: test_testpd:
	; SANDY: # BB#0:
	; SANDY-NEXT: xorl %eax, %eax # sched: [1:0.33]
	-; SANDY-NEXT: vtestpd %xmm1, %xmm0 # sched: [1:1.00]
	-; SANDY-NEXT: setb %al # sched: [1:1.00]
	-; SANDY-NEXT: vtestpd (%rdi), %xmm0 # sched: [7:1.00]
	+; SANDY-NEXT: vtestpd %xmm1, %xmm0 # sched: [1:0.33]
	+; SANDY-NEXT: setb %al # sched: [1:0.33]
	+; SANDY-NEXT: vtestpd (%rdi), %xmm0 # sched: [5:0.50]
	; SANDY-NEXT: adcl $0, %eax # sched: [1:0.33]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_testpd:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: xorl %eax, %eax # sched: [1:0.25]
	; HASWELL-NEXT: vtestpd %xmm1, %xmm0 # sched: [1:0.33]
	; HASWELL-NEXT: setb %al # sched: [1:0.50]
	; HASWELL-NEXT: vtestpd (%rdi), %xmm0 # sched: [5:0.50]
	; HASWELL-NEXT: adcl $0, %eax # sched: [2:0.50]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_testpd:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: xorl %eax, %eax # sched: [1:0.50]
	; BTVER2-NEXT: vtestpd %xmm1, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: setb %al # sched: [1:0.50]
	; BTVER2-NEXT: vtestpd (%rdi), %xmm0 # sched: [6:1.00]
	; BTVER2-NEXT: adcl $0, %eax # sched: [1:0.50]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_testpd:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: xorl %eax, %eax # sched: [1:0.25]
	; ZNVER1-NEXT: vtestpd %xmm1, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: setb %al # sched: [1:0.25]
	; ZNVER1-NEXT: vtestpd (%rdi), %xmm0 # sched: [8:0.50]
	; ZNVER1-NEXT: adcl $0, %eax # sched: [1:0.25]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call i32 @llvm.x86.avx.vtestc.pd(<2 x double> %a0, <2 x double> %a1)
	%2 = load <2 x double>, <2 x double> *%a2, align 16
	%3 = call i32 @llvm.x86.avx.vtestc.pd(<2 x double> %a0, <2 x double> %2)
	%4 = add i32 %1, %3
	ret i32 %4
	}
	declare i32 @llvm.x86.avx.vtestc.pd(<2 x double>, <2 x double>) nounwind readnone

	define i32 @test_testpd_ymm(<4 x double> %a0, <4 x double> %a1, <4 x double> *%a2) {
	; SANDY-LABEL: test_testpd_ymm:
	; SANDY: # BB#0:
	; SANDY-NEXT: xorl %eax, %eax # sched: [1:0.33]
	-; SANDY-NEXT: vtestpd %ymm1, %ymm0 # sched: [1:1.00]
	-; SANDY-NEXT: setb %al # sched: [1:1.00]
	-; SANDY-NEXT: vtestpd (%rdi), %ymm0 # sched: [8:1.00]
	+; SANDY-NEXT: vtestpd %ymm1, %ymm0 # sched: [1:0.33]
	+; SANDY-NEXT: setb %al # sched: [1:0.33]
	+; SANDY-NEXT: vtestpd (%rdi), %ymm0 # sched: [5:0.50]
	; SANDY-NEXT: adcl $0, %eax # sched: [1:0.33]
	; SANDY-NEXT: vzeroupper # sched: [?:0.000000e+00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_testpd_ymm:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: xorl %eax, %eax # sched: [1:0.25]
	; HASWELL-NEXT: vtestpd %ymm1, %ymm0 # sched: [1:0.33]
	; HASWELL-NEXT: setb %al # sched: [1:0.50]
	; HASWELL-NEXT: vtestpd (%rdi), %ymm0 # sched: [5:0.50]
	; HASWELL-NEXT: adcl $0, %eax # sched: [2:0.50]
	; HASWELL-NEXT: vzeroupper # sched: [1:0.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_testpd_ymm:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: xorl %eax, %eax # sched: [1:0.50]
	; BTVER2-NEXT: vtestpd %ymm1, %ymm0 # sched: [1:0.50]
	; BTVER2-NEXT: setb %al # sched: [1:0.50]
	; BTVER2-NEXT: vtestpd (%rdi), %ymm0 # sched: [6:1.00]
	; BTVER2-NEXT: adcl $0, %eax # sched: [1:0.50]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_testpd_ymm:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: xorl %eax, %eax # sched: [1:0.25]
	; ZNVER1-NEXT: vtestpd %ymm1, %ymm0 # sched: [1:0.25]
	; ZNVER1-NEXT: setb %al # sched: [1:0.25]
	; ZNVER1-NEXT: vtestpd (%rdi), %ymm0 # sched: [8:0.50]
	; ZNVER1-NEXT: adcl $0, %eax # sched: [1:0.25]
	; ZNVER1-NEXT: vzeroupper # sched: [?:0.000000e+00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call i32 @llvm.x86.avx.vtestc.pd.256(<4 x double> %a0, <4 x double> %a1)
	%2 = load <4 x double>, <4 x double> *%a2, align 32
	%3 = call i32 @llvm.x86.avx.vtestc.pd.256(<4 x double> %a0, <4 x double> %2)
	%4 = add i32 %1, %3
	ret i32 %4
	}
	declare i32 @llvm.x86.avx.vtestc.pd.256(<4 x double>, <4 x double>) nounwind readnone

	define i32 @test_testps(<4 x float> %a0, <4 x float> %a1, <4 x float> *%a2) {
	; SANDY-LABEL: test_testps:
	; SANDY: # BB#0:
	; SANDY-NEXT: xorl %eax, %eax # sched: [1:0.33]
	-; SANDY-NEXT: vtestps %xmm1, %xmm0 # sched: [1:1.00]
	-; SANDY-NEXT: setb %al # sched: [1:1.00]
	-; SANDY-NEXT: vtestps (%rdi), %xmm0 # sched: [7:1.00]
	+; SANDY-NEXT: vtestps %xmm1, %xmm0 # sched: [1:0.33]
	+; SANDY-NEXT: setb %al # sched: [1:0.33]
	+; SANDY-NEXT: vtestps (%rdi), %xmm0 # sched: [5:0.50]
	; SANDY-NEXT: adcl $0, %eax # sched: [1:0.33]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_testps:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: xorl %eax, %eax # sched: [1:0.25]
	; HASWELL-NEXT: vtestps %xmm1, %xmm0 # sched: [1:0.33]
	; HASWELL-NEXT: setb %al # sched: [1:0.50]
	; HASWELL-NEXT: vtestps (%rdi), %xmm0 # sched: [5:0.50]
	; HASWELL-NEXT: adcl $0, %eax # sched: [2:0.50]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_testps:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: xorl %eax, %eax # sched: [1:0.50]
	; BTVER2-NEXT: vtestps %xmm1, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: setb %al # sched: [1:0.50]
	; BTVER2-NEXT: vtestps (%rdi), %xmm0 # sched: [6:1.00]
	; BTVER2-NEXT: adcl $0, %eax # sched: [1:0.50]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_testps:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: xorl %eax, %eax # sched: [1:0.25]
	; ZNVER1-NEXT: vtestps %xmm1, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: setb %al # sched: [1:0.25]
	; ZNVER1-NEXT: vtestps (%rdi), %xmm0 # sched: [8:0.50]
	; ZNVER1-NEXT: adcl $0, %eax # sched: [1:0.25]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call i32 @llvm.x86.avx.vtestc.ps(<4 x float> %a0, <4 x float> %a1)
	%2 = load <4 x float>, <4 x float> *%a2, align 16
	%3 = call i32 @llvm.x86.avx.vtestc.ps(<4 x float> %a0, <4 x float> %2)
	%4 = add i32 %1, %3
	ret i32 %4
	}
	declare i32 @llvm.x86.avx.vtestc.ps(<4 x float>, <4 x float>) nounwind readnone

	define i32 @test_testps_ymm(<8 x float> %a0, <8 x float> %a1, <8 x float> *%a2) {
	; SANDY-LABEL: test_testps_ymm:
	; SANDY: # BB#0:
	; SANDY-NEXT: xorl %eax, %eax # sched: [1:0.33]
	-; SANDY-NEXT: vtestps %ymm1, %ymm0 # sched: [1:1.00]
	-; SANDY-NEXT: setb %al # sched: [1:1.00]
	-; SANDY-NEXT: vtestps (%rdi), %ymm0 # sched: [8:1.00]
	+; SANDY-NEXT: vtestps %ymm1, %ymm0 # sched: [1:0.33]
	+; SANDY-NEXT: setb %al # sched: [1:0.33]
	+; SANDY-NEXT: vtestps (%rdi), %ymm0 # sched: [5:0.50]
	; SANDY-NEXT: adcl $0, %eax # sched: [1:0.33]
	; SANDY-NEXT: vzeroupper # sched: [?:0.000000e+00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_testps_ymm:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: xorl %eax, %eax # sched: [1:0.25]
	; HASWELL-NEXT: vtestps %ymm1, %ymm0 # sched: [1:0.33]
	; HASWELL-NEXT: setb %al # sched: [1:0.50]
	; HASWELL-NEXT: vtestps (%rdi), %ymm0 # sched: [5:0.50]
	; HASWELL-NEXT: adcl $0, %eax # sched: [2:0.50]
	; HASWELL-NEXT: vzeroupper # sched: [1:0.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_testps_ymm:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: xorl %eax, %eax # sched: [1:0.50]
	; BTVER2-NEXT: vtestps %ymm1, %ymm0 # sched: [1:0.50]
	; BTVER2-NEXT: setb %al # sched: [1:0.50]
	; BTVER2-NEXT: vtestps (%rdi), %ymm0 # sched: [6:1.00]
	; BTVER2-NEXT: adcl $0, %eax # sched: [1:0.50]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_testps_ymm:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: xorl %eax, %eax # sched: [1:0.25]
	; ZNVER1-NEXT: vtestps %ymm1, %ymm0 # sched: [1:0.25]
	; ZNVER1-NEXT: setb %al # sched: [1:0.25]
	; ZNVER1-NEXT: vtestps (%rdi), %ymm0 # sched: [8:0.50]
	; ZNVER1-NEXT: adcl $0, %eax # sched: [1:0.25]
	; ZNVER1-NEXT: vzeroupper # sched: [?:0.000000e+00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call i32 @llvm.x86.avx.vtestc.ps.256(<8 x float> %a0, <8 x float> %a1)
	%2 = load <8 x float>, <8 x float> *%a2, align 32
	%3 = call i32 @llvm.x86.avx.vtestc.ps.256(<8 x float> %a0, <8 x float> %2)
	%4 = add i32 %1, %3
	ret i32 %4
	}
	declare i32 @llvm.x86.avx.vtestc.ps.256(<8 x float>, <8 x float>) nounwind readnone

	define <4 x double> @test_unpckhpd(<4 x double> %a0, <4 x double> %a1, <4 x double> *%a2) {
	; SANDY-LABEL: test_unpckhpd:
	; SANDY: # BB#0:
	; SANDY-NEXT: vunpckhpd {{.*#+}} ymm0 = ymm0[1],ymm1[1],ymm0[3],ymm1[3] sched: [1:1.00]
	; SANDY-NEXT: vunpckhpd {{.*#+}} ymm1 = ymm1[1],mem[1],ymm1[3],mem[3] sched: [5:1.00]
	; SANDY-NEXT: vaddpd %ymm1, %ymm0, %ymm0 # sched: [3:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_unpckhpd:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vunpckhpd {{.*#+}} ymm0 = ymm0[1],ymm1[1],ymm0[3],ymm1[3] sched: [1:1.00]
	; HASWELL-NEXT: vunpckhpd {{.*#+}} ymm1 = ymm1[1],mem[1],ymm1[3],mem[3] sched: [5:1.00]
	; HASWELL-NEXT: vaddpd %ymm1, %ymm0, %ymm0 # sched: [3:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_unpckhpd:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vunpckhpd {{.*#+}} ymm0 = ymm0[1],ymm1[1],ymm0[3],ymm1[3] sched: [1:0.50]
	; BTVER2-NEXT: vunpckhpd {{.*#+}} ymm1 = ymm1[1],mem[1],ymm1[3],mem[3] sched: [6:1.00]
	; BTVER2-NEXT: vaddpd %ymm1, %ymm0, %ymm0 # sched: [3:2.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_unpckhpd:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vunpckhpd {{.*#+}} ymm0 = ymm0[1],ymm1[1],ymm0[3],ymm1[3] sched: [1:0.50]
	; ZNVER1-NEXT: vunpckhpd {{.*#+}} ymm1 = ymm1[1],mem[1],ymm1[3],mem[3] sched: [8:0.50]
	; ZNVER1-NEXT: vaddpd %ymm1, %ymm0, %ymm0 # sched: [3:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = shufflevector <4 x double> %a0, <4 x double> %a1, <4 x i32> <i32 1, i32 5, i32 3, i32 7>
	%2 = load <4 x double>, <4 x double> *%a2, align 32
	%3 = shufflevector <4 x double> %a1, <4 x double> %2, <4 x i32> <i32 1, i32 5, i32 3, i32 7>
	%4 = fadd <4 x double> %1, %3
	ret <4 x double> %4
	}

	define <8 x float> @test_unpckhps(<8 x float> %a0, <8 x float> %a1, <8 x float> *%a2) nounwind {
	; SANDY-LABEL: test_unpckhps:
	; SANDY: # BB#0:
	; SANDY-NEXT: vunpckhps {{.*#+}} ymm0 = ymm0[2],ymm1[2],ymm0[3],ymm1[3],ymm0[6],ymm1[6],ymm0[7],ymm1[7] sched: [1:1.00]
	; SANDY-NEXT: vunpckhps {{.*#+}} ymm0 = ymm0[2],mem[2],ymm0[3],mem[3],ymm0[6],mem[6],ymm0[7],mem[7] sched: [5:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_unpckhps:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vunpckhps {{.*#+}} ymm0 = ymm0[2],ymm1[2],ymm0[3],ymm1[3],ymm0[6],ymm1[6],ymm0[7],ymm1[7] sched: [1:1.00]
	; HASWELL-NEXT: vunpckhps {{.*#+}} ymm0 = ymm0[2],mem[2],ymm0[3],mem[3],ymm0[6],mem[6],ymm0[7],mem[7] sched: [5:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_unpckhps:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vunpckhps {{.*#+}} ymm0 = ymm0[2],ymm1[2],ymm0[3],ymm1[3],ymm0[6],ymm1[6],ymm0[7],ymm1[7] sched: [1:0.50]
	; BTVER2-NEXT: vunpckhps {{.*#+}} ymm0 = ymm0[2],mem[2],ymm0[3],mem[3],ymm0[6],mem[6],ymm0[7],mem[7] sched: [6:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_unpckhps:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vunpckhps {{.*#+}} ymm0 = ymm0[2],ymm1[2],ymm0[3],ymm1[3],ymm0[6],ymm1[6],ymm0[7],ymm1[7] sched: [1:0.50]
	; ZNVER1-NEXT: vunpckhps {{.*#+}} ymm0 = ymm0[2],mem[2],ymm0[3],mem[3],ymm0[6],mem[6],ymm0[7],mem[7] sched: [8:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = shufflevector <8 x float> %a0, <8 x float> %a1, <8 x i32> <i32 2, i32 10, i32 3, i32 11, i32 6, i32 14, i32 7, i32 15>
	%2 = load <8 x float>, <8 x float> *%a2, align 32
	%3 = shufflevector <8 x float> %1, <8 x float> %2, <8 x i32> <i32 2, i32 10, i32 3, i32 11, i32 6, i32 14, i32 7, i32 15>
	ret <8 x float> %3
	}

	define <4 x double> @test_unpcklpd(<4 x double> %a0, <4 x double> %a1, <4 x double> *%a2) {
	; SANDY-LABEL: test_unpcklpd:
	; SANDY: # BB#0:
	; SANDY-NEXT: vunpcklpd {{.*#+}} ymm0 = ymm0[0],ymm1[0],ymm0[2],ymm1[2] sched: [1:1.00]
	-; SANDY-NEXT: vunpcklpd {{.*#+}} ymm1 = ymm1[0],mem[0],ymm1[2],mem[2] sched: [8:1.00]
	+; SANDY-NEXT: vunpcklpd {{.*#+}} ymm1 = ymm1[0],mem[0],ymm1[2],mem[2] sched: [5:1.00]
	; SANDY-NEXT: vaddpd %ymm1, %ymm0, %ymm0 # sched: [3:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_unpcklpd:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vunpcklpd {{.*#+}} ymm0 = ymm0[0],ymm1[0],ymm0[2],ymm1[2] sched: [1:1.00]
	; HASWELL-NEXT: vunpcklpd {{.*#+}} ymm1 = ymm1[0],mem[0],ymm1[2],mem[2] sched: [5:1.00]
	; HASWELL-NEXT: vaddpd %ymm1, %ymm0, %ymm0 # sched: [3:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_unpcklpd:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vunpcklpd {{.*#+}} ymm0 = ymm0[0],ymm1[0],ymm0[2],ymm1[2] sched: [1:0.50]
	; BTVER2-NEXT: vunpcklpd {{.*#+}} ymm1 = ymm1[0],mem[0],ymm1[2],mem[2] sched: [6:1.00]
	; BTVER2-NEXT: vaddpd %ymm1, %ymm0, %ymm0 # sched: [3:2.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_unpcklpd:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vunpcklpd {{.*#+}} ymm0 = ymm0[0],ymm1[0],ymm0[2],ymm1[2] sched: [1:0.50]
	; ZNVER1-NEXT: vunpcklpd {{.*#+}} ymm1 = ymm1[0],mem[0],ymm1[2],mem[2] sched: [8:0.50]
	; ZNVER1-NEXT: vaddpd %ymm1, %ymm0, %ymm0 # sched: [3:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = shufflevector <4 x double> %a0, <4 x double> %a1, <4 x i32> <i32 0, i32 4, i32 2, i32 6>
	%2 = load <4 x double>, <4 x double> *%a2, align 32
	%3 = shufflevector <4 x double> %a1, <4 x double> %2, <4 x i32> <i32 0, i32 4, i32 2, i32 6>
	%4 = fadd <4 x double> %1, %3
	ret <4 x double> %4
	}

	define <8 x float> @test_unpcklps(<8 x float> %a0, <8 x float> %a1, <8 x float> *%a2) nounwind {
	; SANDY-LABEL: test_unpcklps:
	; SANDY: # BB#0:
	; SANDY-NEXT: vunpcklps {{.*#+}} ymm0 = ymm0[0],ymm1[0],ymm0[1],ymm1[1],ymm0[4],ymm1[4],ymm0[5],ymm1[5] sched: [1:1.00]
	-; SANDY-NEXT: vunpcklps {{.*#+}} ymm0 = ymm0[0],mem[0],ymm0[1],mem[1],ymm0[4],mem[4],ymm0[5],mem[5] sched: [8:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vunpcklps {{.*#+}} ymm0 = ymm0[0],mem[0],ymm0[1],mem[1],ymm0[4],mem[4],ymm0[5],mem[5] sched: [5:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_unpcklps:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vunpcklps {{.*#+}} ymm0 = ymm0[0],ymm1[0],ymm0[1],ymm1[1],ymm0[4],ymm1[4],ymm0[5],ymm1[5] sched: [1:1.00]
	; HASWELL-NEXT: vunpcklps {{.*#+}} ymm0 = ymm0[0],mem[0],ymm0[1],mem[1],ymm0[4],mem[4],ymm0[5],mem[5] sched: [5:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_unpcklps:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vunpcklps {{.*#+}} ymm0 = ymm0[0],ymm1[0],ymm0[1],ymm1[1],ymm0[4],ymm1[4],ymm0[5],ymm1[5] sched: [1:0.50]
	; BTVER2-NEXT: vunpcklps {{.*#+}} ymm0 = ymm0[0],mem[0],ymm0[1],mem[1],ymm0[4],mem[4],ymm0[5],mem[5] sched: [6:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_unpcklps:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vunpcklps {{.*#+}} ymm0 = ymm0[0],ymm1[0],ymm0[1],ymm1[1],ymm0[4],ymm1[4],ymm0[5],ymm1[5] sched: [1:0.50]
	; ZNVER1-NEXT: vunpcklps {{.*#+}} ymm0 = ymm0[0],mem[0],ymm0[1],mem[1],ymm0[4],mem[4],ymm0[5],mem[5] sched: [8:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = shufflevector <8 x float> %a0, <8 x float> %a1, <8 x i32> <i32 0, i32 8, i32 1, i32 9, i32 4, i32 12, i32 5, i32 13>
	%2 = load <8 x float>, <8 x float> *%a2, align 32
	%3 = shufflevector <8 x float> %1, <8 x float> %2, <8 x i32> <i32 0, i32 8, i32 1, i32 9, i32 4, i32 12, i32 5, i32 13>
	ret <8 x float> %3
	}

	define <4 x double> @test_xorpd(<4 x double> %a0, <4 x double> %a1, <4 x double> *%a2) {
	; SANDY-LABEL: test_xorpd:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vxorpd %ymm1, %ymm0, %ymm0 # sched: [1:1.00]
	-; SANDY-NEXT: vxorpd (%rdi), %ymm0, %ymm0 # sched: [5:1.00]
	+; SANDY-NEXT: vxorpd %ymm1, %ymm0, %ymm0 # sched: [1:0.33]
	+; SANDY-NEXT: vxorpd (%rdi), %ymm0, %ymm0 # sched: [5:0.50]
	; SANDY-NEXT: vaddpd %ymm0, %ymm1, %ymm0 # sched: [3:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_xorpd:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vxorpd %ymm1, %ymm0, %ymm0 # sched: [1:1.00]
	; HASWELL-NEXT: vxorpd (%rdi), %ymm0, %ymm0 # sched: [5:1.00]
	; HASWELL-NEXT: vaddpd %ymm0, %ymm1, %ymm0 # sched: [3:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_xorpd:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vxorpd %ymm1, %ymm0, %ymm0 # sched: [1:0.50]
	; BTVER2-NEXT: vxorpd (%rdi), %ymm0, %ymm0 # sched: [6:1.00]
	; BTVER2-NEXT: vaddpd %ymm0, %ymm1, %ymm0 # sched: [3:2.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_xorpd:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vxorpd %ymm1, %ymm0, %ymm0 # sched: [1:0.25]
	; ZNVER1-NEXT: vxorpd (%rdi), %ymm0, %ymm0 # sched: [8:0.50]
	; ZNVER1-NEXT: vaddpd %ymm0, %ymm1, %ymm0 # sched: [3:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = bitcast <4 x double> %a0 to <4 x i64>
	%2 = bitcast <4 x double> %a1 to <4 x i64>
	%3 = xor <4 x i64> %1, %2
	%4 = load <4 x double>, <4 x double> *%a2, align 32
	%5 = bitcast <4 x double> %4 to <4 x i64>
	%6 = xor <4 x i64> %3, %5
	%7 = bitcast <4 x i64> %6 to <4 x double>
	%8 = fadd <4 x double> %a1, %7
	ret <4 x double> %8
	}

	define <8 x float> @test_xorps(<8 x float> %a0, <8 x float> %a1, <8 x float> *%a2) {
	; SANDY-LABEL: test_xorps:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vxorps %ymm1, %ymm0, %ymm0 # sched: [1:1.00]
	-; SANDY-NEXT: vxorps (%rdi), %ymm0, %ymm0 # sched: [5:1.00]
	+; SANDY-NEXT: vxorps %ymm1, %ymm0, %ymm0 # sched: [1:0.33]
	+; SANDY-NEXT: vxorps (%rdi), %ymm0, %ymm0 # sched: [5:0.50]
	; SANDY-NEXT: vaddps %ymm0, %ymm1, %ymm0 # sched: [3:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_xorps:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vxorps %ymm1, %ymm0, %ymm0 # sched: [1:1.00]
	; HASWELL-NEXT: vxorps (%rdi), %ymm0, %ymm0 # sched: [5:1.00]
	; HASWELL-NEXT: vaddps %ymm0, %ymm1, %ymm0 # sched: [3:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_xorps:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vxorps %ymm1, %ymm0, %ymm0 # sched: [1:0.50]
	; BTVER2-NEXT: vxorps (%rdi), %ymm0, %ymm0 # sched: [6:1.00]
	; BTVER2-NEXT: vaddps %ymm0, %ymm1, %ymm0 # sched: [3:2.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_xorps:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vxorps %ymm1, %ymm0, %ymm0 # sched: [1:0.25]
	; ZNVER1-NEXT: vxorps (%rdi), %ymm0, %ymm0 # sched: [8:0.50]
	; ZNVER1-NEXT: vaddps %ymm0, %ymm1, %ymm0 # sched: [3:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = bitcast <8 x float> %a0 to <4 x i64>
	%2 = bitcast <8 x float> %a1 to <4 x i64>
	%3 = xor <4 x i64> %1, %2
	%4 = load <8 x float>, <8 x float> *%a2, align 32
	%5 = bitcast <8 x float> %4 to <4 x i64>
	%6 = xor <4 x i64> %3, %5
	%7 = bitcast <4 x i64> %6 to <8 x float>
	%8 = fadd <8 x float> %a1, %7
	ret <8 x float> %8
	}

	define void @test_zeroall() {
	; SANDY-LABEL: test_zeroall:
	; SANDY: # BB#0:
	; SANDY-NEXT: vzeroall # sched: [?:0.000000e+00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_zeroall:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vzeroall # sched: [1:0.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_zeroall:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vzeroall # sched: [?:0.000000e+00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_zeroall:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vzeroall # sched: [?:0.000000e+00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	call void @llvm.x86.avx.vzeroall()
	ret void
	}
	declare void @llvm.x86.avx.vzeroall() nounwind

	define void @test_zeroupper() {
	; SANDY-LABEL: test_zeroupper:
	; SANDY: # BB#0:
	; SANDY-NEXT: vzeroupper # sched: [?:0.000000e+00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_zeroupper:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vzeroupper # sched: [1:0.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_zeroupper:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vzeroupper # sched: [?:0.000000e+00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_zeroupper:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vzeroupper # sched: [?:0.000000e+00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	call void @llvm.x86.avx.vzeroupper()
	ret void
	}
	declare void @llvm.x86.avx.vzeroupper() nounwind

	!0 = !{i32 1}
	diff --git a/test/CodeGen/X86/avx512-extract-subvector.ll b/test/CodeGen/X86/avx512-extract-subvector.ll
	index 2d0a81046b4e..85db44ddd232 100644
	--- a/test/CodeGen/X86/avx512-extract-subvector.ll
	+++ b/test/CodeGen/X86/avx512-extract-subvector.ll
	@@ -1,914 +1,914 @@
	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc < %s -mtriple=x86_64-apple-darwin -mcpu=skx \| FileCheck --check-prefix=SKX %s


	define <8 x i16> @extract_subvector128_v32i16(<32 x i16> %x) nounwind {
	; SKX-LABEL: extract_subvector128_v32i16:
	; SKX: ## BB#0:
	; SKX-NEXT: vextracti32x4 $2, %zmm0, %xmm0
	; SKX-NEXT: vzeroupper
	; SKX-NEXT: retq
	%r1 = shufflevector <32 x i16> %x, <32 x i16> undef, <8 x i32> <i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23>
	ret <8 x i16> %r1
	}

	define <8 x i16> @extract_subvector128_v32i16_first_element(<32 x i16> %x) nounwind {
	; SKX-LABEL: extract_subvector128_v32i16_first_element:
	; SKX: ## BB#0:
	; SKX-NEXT: ## kill: %XMM0<def> %XMM0<kill> %ZMM0<kill>
	; SKX-NEXT: vzeroupper
	; SKX-NEXT: retq
	%r1 = shufflevector <32 x i16> %x, <32 x i16> undef, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
	ret <8 x i16> %r1
	}

	define <16 x i8> @extract_subvector128_v64i8(<64 x i8> %x) nounwind {
	; SKX-LABEL: extract_subvector128_v64i8:
	; SKX: ## BB#0:
	; SKX-NEXT: vextracti32x4 $2, %zmm0, %xmm0
	; SKX-NEXT: vzeroupper
	; SKX-NEXT: retq
	%r1 = shufflevector <64 x i8> %x, <64 x i8> undef, <16 x i32> <i32 32, i32 33, i32 34, i32 35, i32 36, i32 37, i32 38,i32 39, i32 40, i32 41, i32 42, i32 43, i32 44, i32 45, i32 46, i32 47>
	ret <16 x i8> %r1
	}

	define <16 x i8> @extract_subvector128_v64i8_first_element(<64 x i8> %x) nounwind {
	; SKX-LABEL: extract_subvector128_v64i8_first_element:
	; SKX: ## BB#0:
	; SKX-NEXT: ## kill: %XMM0<def> %XMM0<kill> %ZMM0<kill>
	; SKX-NEXT: vzeroupper
	; SKX-NEXT: retq
	%r1 = shufflevector <64 x i8> %x, <64 x i8> undef, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
	ret <16 x i8> %r1
	}


	define <16 x i16> @extract_subvector256_v32i16(<32 x i16> %x) nounwind {
	; SKX-LABEL: extract_subvector256_v32i16:
	; SKX: ## BB#0:
	; SKX-NEXT: vextracti64x4 $1, %zmm0, %ymm0
	; SKX-NEXT: retq
	%r1 = shufflevector <32 x i16> %x, <32 x i16> undef, <16 x i32> <i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31>
	ret <16 x i16> %r1
	}

	define <32 x i8> @extract_subvector256_v64i8(<64 x i8> %x) nounwind {
	; SKX-LABEL: extract_subvector256_v64i8:
	; SKX: ## BB#0:
	; SKX-NEXT: vextracti64x4 $1, %zmm0, %ymm0
	; SKX-NEXT: retq
	%r1 = shufflevector <64 x i8> %x, <64 x i8> undef, <32 x i32> <i32 32, i32 33, i32 34, i32 35, i32 36, i32 37, i32 38, i32 39, i32 40, i32 41, i32 42, i32 43, i32 44, i32 45, i32 46, i32 47, i32 48, i32 49, i32 50, i32 51, i32 52, i32 53, i32 54, i32 55, i32 56, i32 57, i32 58, i32 59, i32 60, i32 61, i32 62, i32 63>
	ret <32 x i8> %r1
	}

	define void @extract_subvector256_v8f64_store(double* nocapture %addr, <4 x double> %a) nounwind uwtable ssp {
	; SKX-LABEL: extract_subvector256_v8f64_store:
	; SKX: ## BB#0: ## %entry
	; SKX-NEXT: vextractf128 $1, %ymm0, (%rdi)
	; SKX-NEXT: vzeroupper
	; SKX-NEXT: retq
	entry:
	%0 = shufflevector <4 x double> %a, <4 x double> undef, <2 x i32> <i32 2, i32 3>
	%1 = bitcast double* %addr to <2 x double>*
	store <2 x double> %0, <2 x double>* %1, align 1
	ret void
	}

	define void @extract_subvector256_v8f32_store(float* nocapture %addr, <8 x float> %a) nounwind uwtable ssp {
	; SKX-LABEL: extract_subvector256_v8f32_store:
	; SKX: ## BB#0: ## %entry
	; SKX-NEXT: vextractf128 $1, %ymm0, (%rdi)
	; SKX-NEXT: vzeroupper
	; SKX-NEXT: retq
	entry:
	%0 = shufflevector <8 x float> %a, <8 x float> undef, <4 x i32> <i32 4, i32 5, i32 6, i32 7>
	%1 = bitcast float* %addr to <4 x float>*
	store <4 x float> %0, <4 x float>* %1, align 1
	ret void
	}

	define void @extract_subvector256_v4i64_store(i64* nocapture %addr, <4 x i64> %a) nounwind uwtable ssp {
	; SKX-LABEL: extract_subvector256_v4i64_store:
	; SKX: ## BB#0: ## %entry
	; SKX-NEXT: vextracti128 $1, %ymm0, (%rdi)
	; SKX-NEXT: vzeroupper
	; SKX-NEXT: retq
	entry:
	%0 = shufflevector <4 x i64> %a, <4 x i64> undef, <2 x i32> <i32 2, i32 3>
	%1 = bitcast i64* %addr to <2 x i64>*
	store <2 x i64> %0, <2 x i64>* %1, align 1
	ret void
	}

	define void @extract_subvector256_v8i32_store(i32* nocapture %addr, <8 x i32> %a) nounwind uwtable ssp {
	; SKX-LABEL: extract_subvector256_v8i32_store:
	; SKX: ## BB#0: ## %entry
	; SKX-NEXT: vextracti128 $1, %ymm0, (%rdi)
	; SKX-NEXT: vzeroupper
	; SKX-NEXT: retq
	entry:
	%0 = shufflevector <8 x i32> %a, <8 x i32> undef, <4 x i32> <i32 4, i32 5, i32 6, i32 7>
	%1 = bitcast i32* %addr to <4 x i32>*
	store <4 x i32> %0, <4 x i32>* %1, align 1
	ret void
	}

	define void @extract_subvector256_v16i16_store(i16* nocapture %addr, <16 x i16> %a) nounwind uwtable ssp {
	; SKX-LABEL: extract_subvector256_v16i16_store:
	; SKX: ## BB#0: ## %entry
	; SKX-NEXT: vextracti128 $1, %ymm0, (%rdi)
	; SKX-NEXT: vzeroupper
	; SKX-NEXT: retq
	entry:
	%0 = shufflevector <16 x i16> %a, <16 x i16> undef, <8 x i32> <i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
	%1 = bitcast i16* %addr to <8 x i16>*
	store <8 x i16> %0, <8 x i16>* %1, align 1
	ret void
	}

	define void @extract_subvector256_v32i8_store(i8* nocapture %addr, <32 x i8> %a) nounwind uwtable ssp {
	; SKX-LABEL: extract_subvector256_v32i8_store:
	; SKX: ## BB#0: ## %entry
	; SKX-NEXT: vextracti128 $1, %ymm0, (%rdi)
	; SKX-NEXT: vzeroupper
	; SKX-NEXT: retq
	entry:
	%0 = shufflevector <32 x i8> %a, <32 x i8> undef, <16 x i32> <i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31>
	%1 = bitcast i8* %addr to <16 x i8>*
	store <16 x i8> %0, <16 x i8>* %1, align 1
	ret void
	}

	define void @extract_subvector256_v4f64_store_lo(double* nocapture %addr, <4 x double> %a) nounwind uwtable ssp {
	; SKX-LABEL: extract_subvector256_v4f64_store_lo:
	; SKX: ## BB#0: ## %entry
	; SKX-NEXT: vmovups %xmm0, (%rdi)
	; SKX-NEXT: vzeroupper
	; SKX-NEXT: retq
	entry:
	%0 = shufflevector <4 x double> %a, <4 x double> undef, <2 x i32> <i32 0, i32 1>
	%1 = bitcast double* %addr to <2 x double>*
	store <2 x double> %0, <2 x double>* %1, align 1
	ret void
	}

	define void @extract_subvector256_v4f64_store_lo_align_16(double* nocapture %addr, <4 x double> %a) nounwind uwtable ssp {
	; SKX-LABEL: extract_subvector256_v4f64_store_lo_align_16:
	; SKX: ## BB#0: ## %entry
	; SKX-NEXT: vmovaps %xmm0, (%rdi)
	; SKX-NEXT: vzeroupper
	; SKX-NEXT: retq
	entry:
	%0 = shufflevector <4 x double> %a, <4 x double> undef, <2 x i32> <i32 0, i32 1>
	%1 = bitcast double* %addr to <2 x double>*
	store <2 x double> %0, <2 x double>* %1, align 16
	ret void
	}

	define void @extract_subvector256_v4f32_store_lo(float* nocapture %addr, <8 x float> %a) nounwind uwtable ssp {
	; SKX-LABEL: extract_subvector256_v4f32_store_lo:
	; SKX: ## BB#0: ## %entry
	; SKX-NEXT: vmovups %xmm0, (%rdi)
	; SKX-NEXT: vzeroupper
	; SKX-NEXT: retq
	entry:
	%0 = shufflevector <8 x float> %a, <8 x float> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
	%1 = bitcast float* %addr to <4 x float>*
	store <4 x float> %0, <4 x float>* %1, align 1
	ret void
	}

	define void @extract_subvector256_v4f32_store_lo_align_16(float* nocapture %addr, <8 x float> %a) nounwind uwtable ssp {
	; SKX-LABEL: extract_subvector256_v4f32_store_lo_align_16:
	; SKX: ## BB#0: ## %entry
	; SKX-NEXT: vmovaps %xmm0, (%rdi)
	; SKX-NEXT: vzeroupper
	; SKX-NEXT: retq
	entry:
	%0 = shufflevector <8 x float> %a, <8 x float> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
	%1 = bitcast float* %addr to <4 x float>*
	store <4 x float> %0, <4 x float>* %1, align 16
	ret void
	}

	define void @extract_subvector256_v2i64_store_lo(i64* nocapture %addr, <4 x i64> %a) nounwind uwtable ssp {
	; SKX-LABEL: extract_subvector256_v2i64_store_lo:
	; SKX: ## BB#0: ## %entry
	; SKX-NEXT: vmovups %xmm0, (%rdi)
	; SKX-NEXT: vzeroupper
	; SKX-NEXT: retq
	entry:
	%0 = shufflevector <4 x i64> %a, <4 x i64> undef, <2 x i32> <i32 0, i32 1>
	%1 = bitcast i64* %addr to <2 x i64>*
	store <2 x i64> %0, <2 x i64>* %1, align 1
	ret void
	}

	define void @extract_subvector256_v2i64_store_lo_align_16(i64* nocapture %addr, <4 x i64> %a) nounwind uwtable ssp {
	; SKX-LABEL: extract_subvector256_v2i64_store_lo_align_16:
	; SKX: ## BB#0: ## %entry
	; SKX-NEXT: vmovaps %xmm0, (%rdi)
	; SKX-NEXT: vzeroupper
	; SKX-NEXT: retq
	entry:
	%0 = shufflevector <4 x i64> %a, <4 x i64> undef, <2 x i32> <i32 0, i32 1>
	%1 = bitcast i64* %addr to <2 x i64>*
	store <2 x i64> %0, <2 x i64>* %1, align 16
	ret void
	}

	define void @extract_subvector256_v4i32_store_lo(i32* nocapture %addr, <8 x i32> %a) nounwind uwtable ssp {
	; SKX-LABEL: extract_subvector256_v4i32_store_lo:
	; SKX: ## BB#0: ## %entry
	; SKX-NEXT: vmovups %xmm0, (%rdi)
	; SKX-NEXT: vzeroupper
	; SKX-NEXT: retq
	entry:
	%0 = shufflevector <8 x i32> %a, <8 x i32> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
	%1 = bitcast i32* %addr to <4 x i32>*
	store <4 x i32> %0, <4 x i32>* %1, align 1
	ret void
	}

	define void @extract_subvector256_v4i32_store_lo_align_16(i32* nocapture %addr, <8 x i32> %a) nounwind uwtable ssp {
	; SKX-LABEL: extract_subvector256_v4i32_store_lo_align_16:
	; SKX: ## BB#0: ## %entry
	; SKX-NEXT: vmovaps %xmm0, (%rdi)
	; SKX-NEXT: vzeroupper
	; SKX-NEXT: retq
	entry:
	%0 = shufflevector <8 x i32> %a, <8 x i32> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
	%1 = bitcast i32* %addr to <4 x i32>*
	store <4 x i32> %0, <4 x i32>* %1, align 16
	ret void
	}

	define void @extract_subvector256_v8i16_store_lo(i16* nocapture %addr, <16 x i16> %a) nounwind uwtable ssp {
	; SKX-LABEL: extract_subvector256_v8i16_store_lo:
	; SKX: ## BB#0: ## %entry
	; SKX-NEXT: vmovups %xmm0, (%rdi)
	; SKX-NEXT: vzeroupper
	; SKX-NEXT: retq
	entry:
	%0 = shufflevector <16 x i16> %a, <16 x i16> undef, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
	%1 = bitcast i16* %addr to <8 x i16>*
	store <8 x i16> %0, <8 x i16>* %1, align 1
	ret void
	}

	define void @extract_subvector256_v8i16_store_lo_align_16(i16* nocapture %addr, <16 x i16> %a) nounwind uwtable ssp {
	; SKX-LABEL: extract_subvector256_v8i16_store_lo_align_16:
	; SKX: ## BB#0: ## %entry
	; SKX-NEXT: vmovaps %xmm0, (%rdi)
	; SKX-NEXT: vzeroupper
	; SKX-NEXT: retq
	entry:
	%0 = shufflevector <16 x i16> %a, <16 x i16> undef, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
	%1 = bitcast i16* %addr to <8 x i16>*
	store <8 x i16> %0, <8 x i16>* %1, align 16
	ret void
	}

	define void @extract_subvector256_v16i8_store_lo(i8* nocapture %addr, <32 x i8> %a) nounwind uwtable ssp {
	; SKX-LABEL: extract_subvector256_v16i8_store_lo:
	; SKX: ## BB#0: ## %entry
	; SKX-NEXT: vmovups %xmm0, (%rdi)
	; SKX-NEXT: vzeroupper
	; SKX-NEXT: retq
	entry:
	%0 = shufflevector <32 x i8> %a, <32 x i8> undef, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
	%1 = bitcast i8* %addr to <16 x i8>*
	store <16 x i8> %0, <16 x i8>* %1, align 1
	ret void
	}

	define void @extract_subvector256_v16i8_store_lo_align_16(i8* nocapture %addr, <32 x i8> %a) nounwind uwtable ssp {
	; SKX-LABEL: extract_subvector256_v16i8_store_lo_align_16:
	; SKX: ## BB#0: ## %entry
	; SKX-NEXT: vmovaps %xmm0, (%rdi)
	; SKX-NEXT: vzeroupper
	; SKX-NEXT: retq
	entry:
	%0 = shufflevector <32 x i8> %a, <32 x i8> undef, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
	%1 = bitcast i8* %addr to <16 x i8>*
	store <16 x i8> %0, <16 x i8>* %1, align 16
	ret void
	}

	define void @extract_subvector512_v2f64_store_lo(double* nocapture %addr, <8 x double> %a) nounwind uwtable ssp {
	; SKX-LABEL: extract_subvector512_v2f64_store_lo:
	; SKX: ## BB#0: ## %entry
	; SKX-NEXT: vmovups %xmm0, (%rdi)
	; SKX-NEXT: vzeroupper
	; SKX-NEXT: retq
	entry:
	%0 = shufflevector <8 x double> %a, <8 x double> undef, <2 x i32> <i32 0, i32 1>
	%1 = bitcast double* %addr to <2 x double>*
	store <2 x double> %0, <2 x double>* %1, align 1
	ret void
	}

	define void @extract_subvector512_v2f64_store_lo_align_16(double* nocapture %addr, <8 x double> %a) nounwind uwtable ssp {
	; SKX-LABEL: extract_subvector512_v2f64_store_lo_align_16:
	; SKX: ## BB#0: ## %entry
	; SKX-NEXT: vmovaps %xmm0, (%rdi)
	; SKX-NEXT: vzeroupper
	; SKX-NEXT: retq
	entry:
	%0 = shufflevector <8 x double> %a, <8 x double> undef, <2 x i32> <i32 0, i32 1>
	%1 = bitcast double* %addr to <2 x double>*
	store <2 x double> %0, <2 x double>* %1, align 16
	ret void
	}

	define void @extract_subvector512_v4f32_store_lo(float* nocapture %addr, <16 x float> %a) nounwind uwtable ssp {
	; SKX-LABEL: extract_subvector512_v4f32_store_lo:
	; SKX: ## BB#0: ## %entry
	; SKX-NEXT: vmovups %xmm0, (%rdi)
	; SKX-NEXT: vzeroupper
	; SKX-NEXT: retq
	entry:
	%0 = shufflevector <16 x float> %a, <16 x float> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
	%1 = bitcast float* %addr to <4 x float>*
	store <4 x float> %0, <4 x float>* %1, align 1
	ret void
	}

	define void @extract_subvector512_v4f32_store_lo_align_16(float* nocapture %addr, <16 x float> %a) nounwind uwtable ssp {
	; SKX-LABEL: extract_subvector512_v4f32_store_lo_align_16:
	; SKX: ## BB#0: ## %entry
	; SKX-NEXT: vmovaps %xmm0, (%rdi)
	; SKX-NEXT: vzeroupper
	; SKX-NEXT: retq
	entry:
	%0 = shufflevector <16 x float> %a, <16 x float> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
	%1 = bitcast float* %addr to <4 x float>*
	store <4 x float> %0, <4 x float>* %1, align 16
	ret void
	}

	define void @extract_subvector512_v2i64_store_lo(i64* nocapture %addr, <8 x i64> %a) nounwind uwtable ssp {
	; SKX-LABEL: extract_subvector512_v2i64_store_lo:
	; SKX: ## BB#0: ## %entry
	; SKX-NEXT: vmovups %xmm0, (%rdi)
	; SKX-NEXT: vzeroupper
	; SKX-NEXT: retq
	entry:
	%0 = shufflevector <8 x i64> %a, <8 x i64> undef, <2 x i32> <i32 0, i32 1>
	%1 = bitcast i64* %addr to <2 x i64>*
	store <2 x i64> %0, <2 x i64>* %1, align 1
	ret void
	}

	define void @extract_subvector512_v2i64_store_lo_align_16(i64* nocapture %addr, <8 x i64> %a) nounwind uwtable ssp {
	; SKX-LABEL: extract_subvector512_v2i64_store_lo_align_16:
	; SKX: ## BB#0: ## %entry
	; SKX-NEXT: vmovaps %xmm0, (%rdi)
	; SKX-NEXT: vzeroupper
	; SKX-NEXT: retq
	entry:
	%0 = shufflevector <8 x i64> %a, <8 x i64> undef, <2 x i32> <i32 0, i32 1>
	%1 = bitcast i64* %addr to <2 x i64>*
	store <2 x i64> %0, <2 x i64>* %1, align 16
	ret void
	}

	define void @extract_subvector512_v4i32_store_lo(i32* nocapture %addr, <16 x i32> %a) nounwind uwtable ssp {
	; SKX-LABEL: extract_subvector512_v4i32_store_lo:
	; SKX: ## BB#0: ## %entry
	; SKX-NEXT: vmovups %xmm0, (%rdi)
	; SKX-NEXT: vzeroupper
	; SKX-NEXT: retq
	entry:
	%0 = shufflevector <16 x i32> %a, <16 x i32> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
	%1 = bitcast i32* %addr to <4 x i32>*
	store <4 x i32> %0, <4 x i32>* %1, align 1
	ret void
	}

	define void @extract_subvector512_v4i32_store_lo_align_16(i32* nocapture %addr, <16 x i32> %a) nounwind uwtable ssp {
	; SKX-LABEL: extract_subvector512_v4i32_store_lo_align_16:
	; SKX: ## BB#0: ## %entry
	; SKX-NEXT: vmovaps %xmm0, (%rdi)
	; SKX-NEXT: vzeroupper
	; SKX-NEXT: retq
	entry:
	%0 = shufflevector <16 x i32> %a, <16 x i32> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
	%1 = bitcast i32* %addr to <4 x i32>*
	store <4 x i32> %0, <4 x i32>* %1, align 16
	ret void
	}

	define void @extract_subvector512_v8i16_store_lo(i16* nocapture %addr, <32 x i16> %a) nounwind uwtable ssp {
	; SKX-LABEL: extract_subvector512_v8i16_store_lo:
	; SKX: ## BB#0: ## %entry
	; SKX-NEXT: vmovups %xmm0, (%rdi)
	; SKX-NEXT: vzeroupper
	; SKX-NEXT: retq
	entry:
	%0 = shufflevector <32 x i16> %a, <32 x i16> undef, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
	%1 = bitcast i16* %addr to <8 x i16>*
	store <8 x i16> %0, <8 x i16>* %1, align 1
	ret void
	}

	define void @extract_subvector512_v16i8_store_lo(i8* nocapture %addr, <64 x i8> %a) nounwind uwtable ssp {
	; SKX-LABEL: extract_subvector512_v16i8_store_lo:
	; SKX: ## BB#0: ## %entry
	; SKX-NEXT: vmovups %xmm0, (%rdi)
	; SKX-NEXT: vzeroupper
	; SKX-NEXT: retq
	entry:
	%0 = shufflevector <64 x i8> %a, <64 x i8> undef, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
	%1 = bitcast i8* %addr to <16 x i8>*
	store <16 x i8> %0, <16 x i8>* %1, align 1
	ret void
	}

	define void @extract_subvector512_v16i8_store_lo_align_16(i8* nocapture %addr, <64 x i8> %a) nounwind uwtable ssp {
	; SKX-LABEL: extract_subvector512_v16i8_store_lo_align_16:
	; SKX: ## BB#0: ## %entry
	; SKX-NEXT: vmovaps %xmm0, (%rdi)
	; SKX-NEXT: vzeroupper
	; SKX-NEXT: retq
	entry:
	%0 = shufflevector <64 x i8> %a, <64 x i8> undef, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
	%1 = bitcast i8* %addr to <16 x i8>*
	store <16 x i8> %0, <16 x i8>* %1, align 16
	ret void
	}

	define void @extract_subvector512_v4f64_store_lo(double* nocapture %addr, <8 x double> %a) nounwind uwtable ssp {
	; SKX-LABEL: extract_subvector512_v4f64_store_lo:
	; SKX: ## BB#0: ## %entry
	; SKX-NEXT: vmovups %ymm0, (%rdi)
	; SKX-NEXT: vzeroupper
	; SKX-NEXT: retq
	entry:
	%0 = shufflevector <8 x double> %a, <8 x double> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
	%1 = bitcast double* %addr to <4 x double>*
	store <4 x double> %0, <4 x double>* %1, align 1
	ret void
	}

	define void @extract_subvector512_v4f64_store_lo_align_16(double* nocapture %addr, <8 x double> %a) nounwind uwtable ssp {
	; SKX-LABEL: extract_subvector512_v4f64_store_lo_align_16:
	; SKX: ## BB#0: ## %entry
	; SKX-NEXT: vmovups %ymm0, (%rdi)
	; SKX-NEXT: vzeroupper
	; SKX-NEXT: retq
	entry:
	%0 = shufflevector <8 x double> %a, <8 x double> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
	%1 = bitcast double* %addr to <4 x double>*
	store <4 x double> %0, <4 x double>* %1, align 16
	ret void
	}

	define void @extract_subvector512_v4f64_store_lo_align_32(double* nocapture %addr, <8 x double> %a) nounwind uwtable ssp {
	; SKX-LABEL: extract_subvector512_v4f64_store_lo_align_32:
	; SKX: ## BB#0: ## %entry
	; SKX-NEXT: vmovaps %ymm0, (%rdi)
	; SKX-NEXT: vzeroupper
	; SKX-NEXT: retq
	entry:
	%0 = shufflevector <8 x double> %a, <8 x double> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
	%1 = bitcast double* %addr to <4 x double>*
	store <4 x double> %0, <4 x double>* %1, align 32
	ret void
	}

	define void @extract_subvector512_v8f32_store_lo(float* nocapture %addr, <16 x float> %a) nounwind uwtable ssp {
	; SKX-LABEL: extract_subvector512_v8f32_store_lo:
	; SKX: ## BB#0: ## %entry
	; SKX-NEXT: vmovups %ymm0, (%rdi)
	; SKX-NEXT: vzeroupper
	; SKX-NEXT: retq
	entry:
	%0 = shufflevector <16 x float> %a, <16 x float> undef, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
	%1 = bitcast float* %addr to <8 x float>*
	store <8 x float> %0, <8 x float>* %1, align 1
	ret void
	}

	define void @extract_subvector512_v8f32_store_lo_align_16(float* nocapture %addr, <16 x float> %a) nounwind uwtable ssp {
	; SKX-LABEL: extract_subvector512_v8f32_store_lo_align_16:
	; SKX: ## BB#0: ## %entry
	-; SKX-NEXT: vmovaps %ymm0, (%rdi)
	+; SKX-NEXT: vmovups %ymm0, (%rdi)
	; SKX-NEXT: vzeroupper
	; SKX-NEXT: retq
	entry:
	%0 = shufflevector <16 x float> %a, <16 x float> undef, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
	%1 = bitcast float* %addr to <8 x float>*
	store <8 x float> %0, <8 x float>* %1, align 16
	ret void
	}

	define void @extract_subvector512_v8f32_store_lo_align_32(float* nocapture %addr, <16 x float> %a) nounwind uwtable ssp {
	; SKX-LABEL: extract_subvector512_v8f32_store_lo_align_32:
	; SKX: ## BB#0: ## %entry
	; SKX-NEXT: vmovaps %ymm0, (%rdi)
	; SKX-NEXT: vzeroupper
	; SKX-NEXT: retq
	entry:
	%0 = shufflevector <16 x float> %a, <16 x float> undef, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
	%1 = bitcast float* %addr to <8 x float>*
	store <8 x float> %0, <8 x float>* %1, align 32
	ret void
	}

	define void @extract_subvector512_v4i64_store_lo(i64* nocapture %addr, <8 x i64> %a) nounwind uwtable ssp {
	; SKX-LABEL: extract_subvector512_v4i64_store_lo:
	; SKX: ## BB#0: ## %entry
	; SKX-NEXT: vmovups %ymm0, (%rdi)
	; SKX-NEXT: vzeroupper
	; SKX-NEXT: retq
	entry:
	%0 = shufflevector <8 x i64> %a, <8 x i64> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
	%1 = bitcast i64* %addr to <4 x i64>*
	store <4 x i64> %0, <4 x i64>* %1, align 1
	ret void
	}

	define void @extract_subvector512_v4i64_store_lo_align_16(i64* nocapture %addr, <8 x i64> %a) nounwind uwtable ssp {
	; SKX-LABEL: extract_subvector512_v4i64_store_lo_align_16:
	; SKX: ## BB#0: ## %entry
	; SKX-NEXT: vmovups %ymm0, (%rdi)
	; SKX-NEXT: vzeroupper
	; SKX-NEXT: retq
	entry:
	%0 = shufflevector <8 x i64> %a, <8 x i64> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
	%1 = bitcast i64* %addr to <4 x i64>*
	store <4 x i64> %0, <4 x i64>* %1, align 16
	ret void
	}

	define void @extract_subvector512_v4i64_store_lo_align_32(i64* nocapture %addr, <8 x i64> %a) nounwind uwtable ssp {
	; SKX-LABEL: extract_subvector512_v4i64_store_lo_align_32:
	; SKX: ## BB#0: ## %entry
	; SKX-NEXT: vmovaps %ymm0, (%rdi)
	; SKX-NEXT: vzeroupper
	; SKX-NEXT: retq
	entry:
	%0 = shufflevector <8 x i64> %a, <8 x i64> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
	%1 = bitcast i64* %addr to <4 x i64>*
	store <4 x i64> %0, <4 x i64>* %1, align 32
	ret void
	}

	define void @extract_subvector512_v8i32_store_lo(i32* nocapture %addr, <16 x i32> %a) nounwind uwtable ssp {
	; SKX-LABEL: extract_subvector512_v8i32_store_lo:
	; SKX: ## BB#0: ## %entry
	; SKX-NEXT: vmovups %ymm0, (%rdi)
	; SKX-NEXT: vzeroupper
	; SKX-NEXT: retq
	entry:
	%0 = shufflevector <16 x i32> %a, <16 x i32> undef, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
	%1 = bitcast i32* %addr to <8 x i32>*
	store <8 x i32> %0, <8 x i32>* %1, align 1
	ret void
	}

	define void @extract_subvector512_v8i32_store_lo_align_16(i32* nocapture %addr, <16 x i32> %a) nounwind uwtable ssp {
	; SKX-LABEL: extract_subvector512_v8i32_store_lo_align_16:
	; SKX: ## BB#0: ## %entry
	; SKX-NEXT: vmovups %ymm0, (%rdi)
	; SKX-NEXT: vzeroupper
	; SKX-NEXT: retq
	entry:
	%0 = shufflevector <16 x i32> %a, <16 x i32> undef, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
	%1 = bitcast i32* %addr to <8 x i32>*
	store <8 x i32> %0, <8 x i32>* %1, align 16
	ret void
	}

	define void @extract_subvector512_v8i32_store_lo_align_32(i32* nocapture %addr, <16 x i32> %a) nounwind uwtable ssp {
	; SKX-LABEL: extract_subvector512_v8i32_store_lo_align_32:
	; SKX: ## BB#0: ## %entry
	; SKX-NEXT: vmovaps %ymm0, (%rdi)
	; SKX-NEXT: vzeroupper
	; SKX-NEXT: retq
	entry:
	%0 = shufflevector <16 x i32> %a, <16 x i32> undef, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
	%1 = bitcast i32* %addr to <8 x i32>*
	store <8 x i32> %0, <8 x i32>* %1, align 32
	ret void
	}

	define void @extract_subvector512_v16i16_store_lo(i16* nocapture %addr, <32 x i16> %a) nounwind uwtable ssp {
	; SKX-LABEL: extract_subvector512_v16i16_store_lo:
	; SKX: ## BB#0: ## %entry
	; SKX-NEXT: vmovups %ymm0, (%rdi)
	; SKX-NEXT: vzeroupper
	; SKX-NEXT: retq
	entry:
	%0 = shufflevector <32 x i16> %a, <32 x i16> undef, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
	%1 = bitcast i16* %addr to <16 x i16>*
	store <16 x i16> %0, <16 x i16>* %1, align 1
	ret void
	}

	define void @extract_subvector512_v16i16_store_lo_align_16(i16* nocapture %addr, <32 x i16> %a) nounwind uwtable ssp {
	; SKX-LABEL: extract_subvector512_v16i16_store_lo_align_16:
	; SKX: ## BB#0: ## %entry
	; SKX-NEXT: vmovups %ymm0, (%rdi)
	; SKX-NEXT: vzeroupper
	; SKX-NEXT: retq
	entry:
	%0 = shufflevector <32 x i16> %a, <32 x i16> undef, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
	%1 = bitcast i16* %addr to <16 x i16>*
	store <16 x i16> %0, <16 x i16>* %1, align 16
	ret void
	}

	define void @extract_subvector512_v16i16_store_lo_align_32(i16* nocapture %addr, <32 x i16> %a) nounwind uwtable ssp {
	; SKX-LABEL: extract_subvector512_v16i16_store_lo_align_32:
	; SKX: ## BB#0: ## %entry
	; SKX-NEXT: vmovaps %ymm0, (%rdi)
	; SKX-NEXT: vzeroupper
	; SKX-NEXT: retq
	entry:
	%0 = shufflevector <32 x i16> %a, <32 x i16> undef, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
	%1 = bitcast i16* %addr to <16 x i16>*
	store <16 x i16> %0, <16 x i16>* %1, align 32
	ret void
	}

	define void @extract_subvector512_v32i8_store_lo(i8* nocapture %addr, <64 x i8> %a) nounwind uwtable ssp {
	; SKX-LABEL: extract_subvector512_v32i8_store_lo:
	; SKX: ## BB#0: ## %entry
	; SKX-NEXT: vmovups %ymm0, (%rdi)
	; SKX-NEXT: vzeroupper
	; SKX-NEXT: retq
	entry:
	%0 = shufflevector <64 x i8> %a, <64 x i8> undef, <32 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31>
	%1 = bitcast i8* %addr to <32 x i8>*
	store <32 x i8> %0, <32 x i8>* %1, align 1
	ret void
	}

	define void @extract_subvector512_v32i8_store_lo_align_16(i8* nocapture %addr, <64 x i8> %a) nounwind uwtable ssp {
	; SKX-LABEL: extract_subvector512_v32i8_store_lo_align_16:
	; SKX: ## BB#0: ## %entry
	; SKX-NEXT: vmovups %ymm0, (%rdi)
	; SKX-NEXT: vzeroupper
	; SKX-NEXT: retq
	entry:
	%0 = shufflevector <64 x i8> %a, <64 x i8> undef, <32 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31>
	%1 = bitcast i8* %addr to <32 x i8>*
	store <32 x i8> %0, <32 x i8>* %1, align 16
	ret void
	}

	define void @extract_subvector512_v32i8_store_lo_align_32(i8* nocapture %addr, <64 x i8> %a) nounwind uwtable ssp {
	; SKX-LABEL: extract_subvector512_v32i8_store_lo_align_32:
	; SKX: ## BB#0: ## %entry
	; SKX-NEXT: vmovaps %ymm0, (%rdi)
	; SKX-NEXT: vzeroupper
	; SKX-NEXT: retq
	entry:
	%0 = shufflevector <64 x i8> %a, <64 x i8> undef, <32 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31>
	%1 = bitcast i8* %addr to <32 x i8>*
	store <32 x i8> %0, <32 x i8>* %1, align 32
	ret void
	}

	define <4 x double> @test_mm512_mask_extractf64x4_pd(<4 x double> %__W, i8 %__U, <8 x double> %__A) {
	; SKX-LABEL: test_mm512_mask_extractf64x4_pd:
	; SKX: ## BB#0: ## %entry
	; SKX-NEXT: kmovd %edi, %k1
	; SKX-NEXT: vextractf64x4 $1, %zmm1, %ymm0 {%k1}
	; SKX-NEXT: retq
	entry:
	%shuffle = shufflevector <8 x double> %__A, <8 x double> undef, <4 x i32> <i32 4, i32 5, i32 6, i32 7>
	%0 = bitcast i8 %__U to <8 x i1>
	%extract = shufflevector <8 x i1> %0, <8 x i1> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
	%1 = select <4 x i1> %extract, <4 x double> %shuffle, <4 x double> %__W
	ret <4 x double> %1
	}

	define <4 x double> @test_mm512_maskz_extractf64x4_pd(i8 %__U, <8 x double> %__A) {
	; SKX-LABEL: test_mm512_maskz_extractf64x4_pd:
	; SKX: ## BB#0: ## %entry
	; SKX-NEXT: kmovd %edi, %k1
	; SKX-NEXT: vextractf64x4 $1, %zmm0, %ymm0 {%k1} {z}
	; SKX-NEXT: retq
	entry:
	%shuffle = shufflevector <8 x double> %__A, <8 x double> undef, <4 x i32> <i32 4, i32 5, i32 6, i32 7>
	%0 = bitcast i8 %__U to <8 x i1>
	%extract = shufflevector <8 x i1> %0, <8 x i1> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
	%1 = select <4 x i1> %extract, <4 x double> %shuffle, <4 x double> zeroinitializer
	ret <4 x double> %1
	}

	define <4 x float> @test_mm512_mask_extractf32x4_ps(<4 x float> %__W, i8 %__U, <8 x double> %__A) {
	; SKX-LABEL: test_mm512_mask_extractf32x4_ps:
	; SKX: ## BB#0: ## %entry
	; SKX-NEXT: kmovd %edi, %k1
	; SKX-NEXT: vextractf32x4 $1, %zmm1, %xmm0 {%k1}
	; SKX-NEXT: vzeroupper
	; SKX-NEXT: retq
	entry:
	%0 = bitcast <8 x double> %__A to <16 x float>
	%shuffle = shufflevector <16 x float> %0, <16 x float> undef, <4 x i32> <i32 4, i32 5, i32 6, i32 7>
	%1 = bitcast i8 %__U to <8 x i1>
	%extract = shufflevector <8 x i1> %1, <8 x i1> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
	%2 = select <4 x i1> %extract, <4 x float> %shuffle, <4 x float> %__W
	ret <4 x float> %2
	}

	define <4 x float> @test_mm512_maskz_extractf32x4_ps(i8 %__U, <8 x double> %__A) {
	; SKX-LABEL: test_mm512_maskz_extractf32x4_ps:
	; SKX: ## BB#0: ## %entry
	; SKX-NEXT: kmovd %edi, %k1
	; SKX-NEXT: vextractf32x4 $1, %zmm0, %xmm0 {%k1} {z}
	; SKX-NEXT: vzeroupper
	; SKX-NEXT: retq
	entry:
	%0 = bitcast <8 x double> %__A to <16 x float>
	%shuffle = shufflevector <16 x float> %0, <16 x float> undef, <4 x i32> <i32 4, i32 5, i32 6, i32 7>
	%1 = bitcast i8 %__U to <8 x i1>
	%extract = shufflevector <8 x i1> %1, <8 x i1> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
	%2 = select <4 x i1> %extract, <4 x float> %shuffle, <4 x float> zeroinitializer
	ret <4 x float> %2
	}

	define <2 x double> @test_mm256_mask_extractf64x2_pd(<2 x double> %__W, i8 %__U, <4 x double> %__A) {
	; SKX-LABEL: test_mm256_mask_extractf64x2_pd:
	; SKX: ## BB#0: ## %entry
	; SKX-NEXT: kmovd %edi, %k1
	; SKX-NEXT: vextractf64x2 $1, %ymm1, %xmm0 {%k1}
	; SKX-NEXT: vzeroupper
	; SKX-NEXT: retq
	entry:
	%shuffle = shufflevector <4 x double> %__A, <4 x double> undef, <2 x i32> <i32 2, i32 3>
	%0 = bitcast i8 %__U to <8 x i1>
	%extract = shufflevector <8 x i1> %0, <8 x i1> undef, <2 x i32> <i32 0, i32 1>
	%1 = select <2 x i1> %extract, <2 x double> %shuffle, <2 x double> %__W
	ret <2 x double> %1
	}

	define <2 x double> @test_mm256_maskz_extractf64x2_pd(i8 %__U, <4 x double> %__A) {
	; SKX-LABEL: test_mm256_maskz_extractf64x2_pd:
	; SKX: ## BB#0: ## %entry
	; SKX-NEXT: kmovd %edi, %k1
	; SKX-NEXT: vextractf64x2 $1, %ymm0, %xmm0 {%k1} {z}
	; SKX-NEXT: vzeroupper
	; SKX-NEXT: retq
	entry:
	%shuffle = shufflevector <4 x double> %__A, <4 x double> undef, <2 x i32> <i32 2, i32 3>
	%0 = bitcast i8 %__U to <8 x i1>
	%extract = shufflevector <8 x i1> %0, <8 x i1> undef, <2 x i32> <i32 0, i32 1>
	%1 = select <2 x i1> %extract, <2 x double> %shuffle, <2 x double> zeroinitializer
	ret <2 x double> %1
	}

	define <2 x i64> @test_mm256_mask_extracti64x2_epi64(<2 x i64> %__W, i8 %__U, <4 x i64> %__A) {
	; SKX-LABEL: test_mm256_mask_extracti64x2_epi64:
	; SKX: ## BB#0: ## %entry
	; SKX-NEXT: kmovd %edi, %k1
	; SKX-NEXT: vextracti64x2 $1, %ymm1, %xmm0 {%k1}
	; SKX-NEXT: vzeroupper
	; SKX-NEXT: retq
	entry:
	%shuffle = shufflevector <4 x i64> %__A, <4 x i64> undef, <2 x i32> <i32 2, i32 3>
	%0 = bitcast i8 %__U to <8 x i1>
	%extract = shufflevector <8 x i1> %0, <8 x i1> undef, <2 x i32> <i32 0, i32 1>
	%1 = select <2 x i1> %extract, <2 x i64> %shuffle, <2 x i64> %__W
	ret <2 x i64> %1
	}

	define <2 x i64> @test_mm256_maskz_extracti64x2_epi64(i8 %__U, <4 x i64> %__A) {
	; SKX-LABEL: test_mm256_maskz_extracti64x2_epi64:
	; SKX: ## BB#0: ## %entry
	; SKX-NEXT: kmovd %edi, %k1
	; SKX-NEXT: vextracti64x2 $1, %ymm0, %xmm0 {%k1} {z}
	; SKX-NEXT: vzeroupper
	; SKX-NEXT: retq
	entry:
	%shuffle = shufflevector <4 x i64> %__A, <4 x i64> undef, <2 x i32> <i32 2, i32 3>
	%0 = bitcast i8 %__U to <8 x i1>
	%extract = shufflevector <8 x i1> %0, <8 x i1> undef, <2 x i32> <i32 0, i32 1>
	%1 = select <2 x i1> %extract, <2 x i64> %shuffle, <2 x i64> zeroinitializer
	ret <2 x i64> %1
	}

	define <4 x float> @test_mm256_mask_extractf32x4_ps(<4 x float> %__W, i8 %__U, <8 x float> %__A) {
	; SKX-LABEL: test_mm256_mask_extractf32x4_ps:
	; SKX: ## BB#0: ## %entry
	; SKX-NEXT: kmovd %edi, %k1
	; SKX-NEXT: vextractf32x4 $1, %ymm1, %xmm0 {%k1}
	; SKX-NEXT: vzeroupper
	; SKX-NEXT: retq
	entry:
	%shuffle = shufflevector <8 x float> %__A, <8 x float> undef, <4 x i32> <i32 4, i32 5, i32 6, i32 7>
	%0 = bitcast i8 %__U to <8 x i1>
	%extract = shufflevector <8 x i1> %0, <8 x i1> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
	%1 = select <4 x i1> %extract, <4 x float> %shuffle, <4 x float> %__W
	ret <4 x float> %1
	}

	define <4 x float> @test_mm256_maskz_extractf32x4_ps(i8 %__U, <8 x float> %__A) {
	; SKX-LABEL: test_mm256_maskz_extractf32x4_ps:
	; SKX: ## BB#0: ## %entry
	; SKX-NEXT: kmovd %edi, %k1
	; SKX-NEXT: vextractf32x4 $1, %ymm0, %xmm0 {%k1} {z}
	; SKX-NEXT: vzeroupper
	; SKX-NEXT: retq
	entry:
	%shuffle = shufflevector <8 x float> %__A, <8 x float> undef, <4 x i32> <i32 4, i32 5, i32 6, i32 7>
	%0 = bitcast i8 %__U to <8 x i1>
	%extract = shufflevector <8 x i1> %0, <8 x i1> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
	%1 = select <4 x i1> %extract, <4 x float> %shuffle, <4 x float> zeroinitializer
	ret <4 x float> %1
	}

	define <2 x i64> @test_mm256_mask_extracti32x4_epi32(<2 x i64> %__W, i8 %__U, <4 x i64> %__A) {
	; SKX-LABEL: test_mm256_mask_extracti32x4_epi32:
	; SKX: ## BB#0: ## %entry
	; SKX-NEXT: kmovd %edi, %k1
	; SKX-NEXT: vextracti32x4 $1, %ymm1, %xmm0 {%k1}
	; SKX-NEXT: vzeroupper
	; SKX-NEXT: retq
	entry:
	%0 = bitcast <4 x i64> %__A to <8 x i32>
	%shuffle = shufflevector <8 x i32> %0, <8 x i32> undef, <4 x i32> <i32 4, i32 5, i32 6, i32 7>
	%1 = bitcast <2 x i64> %__W to <4 x i32>
	%2 = bitcast i8 %__U to <8 x i1>
	%extract = shufflevector <8 x i1> %2, <8 x i1> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
	%3 = select <4 x i1> %extract, <4 x i32> %shuffle, <4 x i32> %1
	%4 = bitcast <4 x i32> %3 to <2 x i64>
	ret <2 x i64> %4
	}

	define <2 x i64> @test_mm256_maskz_extracti32x4_epi32(i8 %__U, <4 x i64> %__A) {
	; SKX-LABEL: test_mm256_maskz_extracti32x4_epi32:
	; SKX: ## BB#0: ## %entry
	; SKX-NEXT: kmovd %edi, %k1
	; SKX-NEXT: vextracti32x4 $1, %ymm0, %xmm0 {%k1} {z}
	; SKX-NEXT: vzeroupper
	; SKX-NEXT: retq
	entry:
	%0 = bitcast <4 x i64> %__A to <8 x i32>
	%shuffle = shufflevector <8 x i32> %0, <8 x i32> undef, <4 x i32> <i32 4, i32 5, i32 6, i32 7>
	%1 = bitcast i8 %__U to <8 x i1>
	%extract = shufflevector <8 x i1> %1, <8 x i1> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
	%2 = select <4 x i1> %extract, <4 x i32> %shuffle, <4 x i32> zeroinitializer
	%3 = bitcast <4 x i32> %2 to <2 x i64>
	ret <2 x i64> %3
	}

	define <8 x float> @test_mm512_mask_extractf32x8_ps(<8 x float> %__W, i8 %__U, <16 x float> %__A) {
	; SKX-LABEL: test_mm512_mask_extractf32x8_ps:
	; SKX: ## BB#0: ## %entry
	; SKX-NEXT: kmovd %edi, %k1
	; SKX-NEXT: vextractf32x8 $1, %zmm1, %ymm0 {%k1}
	; SKX-NEXT: retq
	entry:
	%shuffle = shufflevector <16 x float> %__A, <16 x float> undef, <8 x i32> <i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
	%0 = bitcast i8 %__U to <8 x i1>
	%1 = select <8 x i1> %0, <8 x float> %shuffle, <8 x float> %__W
	ret <8 x float> %1
	}

	define <8 x float> @test_mm512_maskz_extractf32x8_ps(i8 %__U, <16 x float> %__A) {
	; SKX-LABEL: test_mm512_maskz_extractf32x8_ps:
	; SKX: ## BB#0: ## %entry
	; SKX-NEXT: kmovd %edi, %k1
	; SKX-NEXT: vextractf32x8 $1, %zmm0, %ymm0 {%k1} {z}
	; SKX-NEXT: retq
	entry:
	%shuffle = shufflevector <16 x float> %__A, <16 x float> undef, <8 x i32> <i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
	%0 = bitcast i8 %__U to <8 x i1>
	%1 = select <8 x i1> %0, <8 x float> %shuffle, <8 x float> zeroinitializer
	ret <8 x float> %1
	}

	define <2 x double> @test_mm512_mask_extractf64x2_pd(<2 x double> %__W, i8 %__U, <8 x double> %__A) {
	; SKX-LABEL: test_mm512_mask_extractf64x2_pd:
	; SKX: ## BB#0: ## %entry
	; SKX-NEXT: kmovd %edi, %k1
	; SKX-NEXT: vextractf64x2 $3, %zmm1, %xmm0 {%k1}
	; SKX-NEXT: vzeroupper
	; SKX-NEXT: retq
	entry:
	%shuffle = shufflevector <8 x double> %__A, <8 x double> undef, <2 x i32> <i32 6, i32 7>
	%0 = bitcast i8 %__U to <8 x i1>
	%extract = shufflevector <8 x i1> %0, <8 x i1> undef, <2 x i32> <i32 0, i32 1>
	%1 = select <2 x i1> %extract, <2 x double> %shuffle, <2 x double> %__W
	ret <2 x double> %1
	}

	define <2 x double> @test_mm512_maskz_extractf64x2_pd(i8 %__U, <8 x double> %__A) {
	; SKX-LABEL: test_mm512_maskz_extractf64x2_pd:
	; SKX: ## BB#0: ## %entry
	; SKX-NEXT: kmovd %edi, %k1
	; SKX-NEXT: vextractf64x2 $3, %zmm0, %xmm0 {%k1} {z}
	; SKX-NEXT: vzeroupper
	; SKX-NEXT: retq
	entry:
	%shuffle = shufflevector <8 x double> %__A, <8 x double> undef, <2 x i32> <i32 6, i32 7>
	%0 = bitcast i8 %__U to <8 x i1>
	%extract = shufflevector <8 x i1> %0, <8 x i1> undef, <2 x i32> <i32 0, i32 1>
	%1 = select <2 x i1> %extract, <2 x double> %shuffle, <2 x double> zeroinitializer
	ret <2 x double> %1
	}
	diff --git a/test/CodeGen/X86/extractelement-legalization-store-ordering.ll b/test/CodeGen/X86/extractelement-legalization-store-ordering.ll
	index 4d0b5ccc16b0..9d0900f3b424 100644
	--- a/test/CodeGen/X86/extractelement-legalization-store-ordering.ll
	+++ b/test/CodeGen/X86/extractelement-legalization-store-ordering.ll
	@@ -1,59 +1,59 @@
	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc < %s -mtriple=i386-apple-darwin -mcpu=yonah \| FileCheck %s

	target datalayout = "e-m:o-p:32:32-f64:32:64-f80:128-n8:16:32-S128"

	; Make sure we don't break load/store ordering when turning an extractelement
	; into loads, off the stack or a previous store.
	; Be very explicit about the ordering/stack offsets.

	define void @test_extractelement_legalization_storereuse(<4 x i32> %a, i32* nocapture %x, i32* nocapture readonly %y, i32 %i) #0 {
	; CHECK-LABEL: test_extractelement_legalization_storereuse:
	; CHECK: ## BB#0: ## %entry
	; CHECK-NEXT: pushl %ebx
	; CHECK-NEXT: pushl %edi
	; CHECK-NEXT: pushl %esi
	; CHECK-NEXT: movl {{[0-9]+}}(%esp), %eax
	; CHECK-NEXT: movl {{[0-9]+}}(%esp), %ecx
	-; CHECK-NEXT: paddd (%ecx), %xmm0
	; CHECK-NEXT: movl {{[0-9]+}}(%esp), %edx
	-; CHECK-NEXT: movdqa %xmm0, (%ecx)
	-; CHECK-NEXT: movl (%ecx), %esi
	-; CHECK-NEXT: movl 4(%ecx), %edi
	-; CHECK-NEXT: shll $4, %edx
	-; CHECK-NEXT: movl 8(%ecx), %ebx
	-; CHECK-NEXT: movl 12(%ecx), %ecx
	-; CHECK-NEXT: movl %esi, 12(%eax,%edx)
	-; CHECK-NEXT: movl %edi, (%eax,%edx)
	-; CHECK-NEXT: movl %ebx, 8(%eax,%edx)
	-; CHECK-NEXT: movl %ecx, 4(%eax,%edx)
	+; CHECK-NEXT: paddd (%edx), %xmm0
	+; CHECK-NEXT: movdqa %xmm0, (%edx)
	+; CHECK-NEXT: movl (%edx), %esi
	+; CHECK-NEXT: movl 4(%edx), %edi
	+; CHECK-NEXT: shll $4, %ecx
	+; CHECK-NEXT: movl 8(%edx), %ebx
	+; CHECK-NEXT: movl 12(%edx), %edx
	+; CHECK-NEXT: movl %esi, 12(%eax,%ecx)
	+; CHECK-NEXT: movl %edi, (%eax,%ecx)
	+; CHECK-NEXT: movl %ebx, 8(%eax,%ecx)
	+; CHECK-NEXT: movl %edx, 4(%eax,%ecx)
	; CHECK-NEXT: popl %esi
	; CHECK-NEXT: popl %edi
	; CHECK-NEXT: popl %ebx
	; CHECK-NEXT: retl
	; CHECK-NEXT: ## -- End function
	entry:
	%0 = bitcast i32* %y to <4 x i32>*
	%1 = load <4 x i32>, <4 x i32>* %0, align 16
	%am = add <4 x i32> %a, %1
	store <4 x i32> %am, <4 x i32>* %0, align 16
	%ip0 = shl nsw i32 %i, 2
	%ip1 = or i32 %ip0, 1
	%ip2 = or i32 %ip0, 2
	%ip3 = or i32 %ip0, 3
	%vecext = extractelement <4 x i32> %am, i32 %ip0
	%arrayidx = getelementptr inbounds i32, i32* %x, i32 %ip3
	store i32 %vecext, i32* %arrayidx, align 4
	%vecext5 = extractelement <4 x i32> %am, i32 %ip1
	%arrayidx8 = getelementptr inbounds i32, i32* %x, i32 %ip0
	store i32 %vecext5, i32* %arrayidx8, align 4
	%vecext11 = extractelement <4 x i32> %am, i32 %ip2
	%arrayidx14 = getelementptr inbounds i32, i32* %x, i32 %ip2
	store i32 %vecext11, i32* %arrayidx14, align 4
	%vecext17 = extractelement <4 x i32> %am, i32 %ip3
	%arrayidx20 = getelementptr inbounds i32, i32* %x, i32 %ip1
	store i32 %vecext17, i32* %arrayidx20, align 4
	ret void
	}

	attributes #0 = { nounwind }
	diff --git a/test/CodeGen/X86/f16c-schedule.ll b/test/CodeGen/X86/f16c-schedule.ll
	deleted file mode 100644
	index 15ae4a49d7d3..000000000000
	--- a/test/CodeGen/X86/f16c-schedule.ll
	+++ /dev/null
	@@ -1,144 +0,0 @@
	-; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	-; RUN: llc < %s -mtriple=x86_64-unknown-unknown -print-schedule -mcpu=ivybridge \| FileCheck %s --check-prefix=CHECK --check-prefix=IVY
	-; RUN: llc < %s -mtriple=x86_64-unknown-unknown -print-schedule -mcpu=haswell \| FileCheck %s --check-prefix=CHECK --check-prefix=HASWELL
	-; RUN: llc < %s -mtriple=x86_64-unknown-unknown -print-schedule -mcpu=skylake \| FileCheck %s --check-prefix=CHECK --check-prefix=HASWELL
	-; RUN: llc < %s -mtriple=x86_64-unknown-unknown -print-schedule -mcpu=btver2 \| FileCheck %s --check-prefix=CHECK --check-prefix=BTVER2
	-; RUN: llc < %s -mtriple=x86_64-unknown-unknown -print-schedule -mcpu=znver1 \| FileCheck %s --check-prefix=CHECK --check-prefix=ZNVER1
	-
	-define <4 x float> @test_vcvtph2ps_128(<8 x i16> %a0, <8 x i16> *%a1) {
	-; IVY-LABEL: test_vcvtph2ps_128:
	-; IVY: # BB#0:
	-; IVY-NEXT: vcvtph2ps (%rdi), %xmm1 # sched: [7:1.00]
	-; IVY-NEXT: vcvtph2ps %xmm0, %xmm0 # sched: [3:1.00]
	-; IVY-NEXT: vaddps %xmm0, %xmm1, %xmm0 # sched: [3:1.00]
	-; IVY-NEXT: retq # sched: [1:1.00]
	-;
	-; HASWELL-LABEL: test_vcvtph2ps_128:
	-; HASWELL: # BB#0:
	-; HASWELL-NEXT: vcvtph2ps (%rdi), %xmm1 # sched: [7:1.00]
	-; HASWELL-NEXT: vcvtph2ps %xmm0, %xmm0 # sched: [4:1.00]
	-; HASWELL-NEXT: vaddps %xmm0, %xmm1, %xmm0 # sched: [3:1.00]
	-; HASWELL-NEXT: retq # sched: [1:1.00]
	-;
	-; BTVER2-LABEL: test_vcvtph2ps_128:
	-; BTVER2: # BB#0:
	-; BTVER2-NEXT: vcvtph2ps (%rdi), %xmm1 # sched: [8:1.00]
	-; BTVER2-NEXT: vcvtph2ps %xmm0, %xmm0 # sched: [3:1.00]
	-; BTVER2-NEXT: vaddps %xmm0, %xmm1, %xmm0 # sched: [3:1.00]
	-; BTVER2-NEXT: retq # sched: [4:1.00]
	-;
	-; ZNVER1-LABEL: test_vcvtph2ps_128:
	-; ZNVER1: # BB#0:
	-; ZNVER1-NEXT: vcvtph2ps (%rdi), %xmm1 # sched: [12:1.00]
	-; ZNVER1-NEXT: vcvtph2ps %xmm0, %xmm0 # sched: [5:1.00]
	-; ZNVER1-NEXT: vaddps %xmm0, %xmm1, %xmm0 # sched: [3:1.00]
	-; ZNVER1-NEXT: retq # sched: [5:0.50]
	- %1 = load <8 x i16>, <8 x i16> *%a1
	- %2 = call <4 x float> @llvm.x86.vcvtph2ps.128(<8 x i16> %1)
	- %3 = call <4 x float> @llvm.x86.vcvtph2ps.128(<8 x i16> %a0)
	- %4 = fadd <4 x float> %2, %3
	- ret <4 x float> %4
	-}
	-declare <4 x float> @llvm.x86.vcvtph2ps.128(<8 x i16>)
	-
	-define <8 x float> @test_vcvtph2ps_256(<8 x i16> %a0, <8 x i16> *%a1) {
	-; IVY-LABEL: test_vcvtph2ps_256:
	-; IVY: # BB#0:
	-; IVY-NEXT: vcvtph2ps (%rdi), %ymm1 # sched: [7:1.00]
	-; IVY-NEXT: vcvtph2ps %xmm0, %ymm0 # sched: [3:1.00]
	-; IVY-NEXT: vaddps %ymm0, %ymm1, %ymm0 # sched: [3:1.00]
	-; IVY-NEXT: retq # sched: [1:1.00]
	-;
	-; HASWELL-LABEL: test_vcvtph2ps_256:
	-; HASWELL: # BB#0:
	-; HASWELL-NEXT: vcvtph2ps (%rdi), %ymm1 # sched: [7:1.00]
	-; HASWELL-NEXT: vcvtph2ps %xmm0, %ymm0 # sched: [4:1.00]
	-; HASWELL-NEXT: vaddps %ymm0, %ymm1, %ymm0 # sched: [3:1.00]
	-; HASWELL-NEXT: retq # sched: [1:1.00]
	-;
	-; BTVER2-LABEL: test_vcvtph2ps_256:
	-; BTVER2: # BB#0:
	-; BTVER2-NEXT: vcvtph2ps (%rdi), %ymm1 # sched: [8:1.00]
	-; BTVER2-NEXT: vcvtph2ps %xmm0, %ymm0 # sched: [3:1.00]
	-; BTVER2-NEXT: vaddps %ymm0, %ymm1, %ymm0 # sched: [3:2.00]
	-; BTVER2-NEXT: retq # sched: [4:1.00]
	-;
	-; ZNVER1-LABEL: test_vcvtph2ps_256:
	-; ZNVER1: # BB#0:
	-; ZNVER1-NEXT: vcvtph2ps (%rdi), %ymm1 # sched: [12:1.00]
	-; ZNVER1-NEXT: vcvtph2ps %xmm0, %ymm0 # sched: [5:1.00]
	-; ZNVER1-NEXT: vaddps %ymm0, %ymm1, %ymm0 # sched: [3:1.00]
	-; ZNVER1-NEXT: retq # sched: [5:0.50]
	- %1 = load <8 x i16>, <8 x i16> *%a1
	- %2 = call <8 x float> @llvm.x86.vcvtph2ps.256(<8 x i16> %1)
	- %3 = call <8 x float> @llvm.x86.vcvtph2ps.256(<8 x i16> %a0)
	- %4 = fadd <8 x float> %2, %3
	- ret <8 x float> %4
	-}
	-declare <8 x float> @llvm.x86.vcvtph2ps.256(<8 x i16>)
	-
	-define <8 x i16> @test_vcvtps2ph_128(<4 x float> %a0, <4 x float> %a1, <4 x i16> *%a2) {
	-; IVY-LABEL: test_vcvtps2ph_128:
	-; IVY: # BB#0:
	-; IVY-NEXT: vcvtps2ph $0, %xmm0, %xmm0 # sched: [3:1.00]
	-; IVY-NEXT: vcvtps2ph $0, %xmm1, (%rdi) # sched: [7:1.00]
	-; IVY-NEXT: retq # sched: [1:1.00]
	-;
	-; HASWELL-LABEL: test_vcvtps2ph_128:
	-; HASWELL: # BB#0:
	-; HASWELL-NEXT: vcvtps2ph $0, %xmm0, %xmm0 # sched: [4:1.00]
	-; HASWELL-NEXT: vcvtps2ph $0, %xmm1, (%rdi) # sched: [8:1.00]
	-; HASWELL-NEXT: retq # sched: [1:1.00]
	-;
	-; BTVER2-LABEL: test_vcvtps2ph_128:
	-; BTVER2: # BB#0:
	-; BTVER2-NEXT: vcvtps2ph $0, %xmm0, %xmm0 # sched: [3:1.00]
	-; BTVER2-NEXT: vcvtps2ph $0, %xmm1, (%rdi) # sched: [8:1.00]
	-; BTVER2-NEXT: retq # sched: [4:1.00]
	-;
	-; ZNVER1-LABEL: test_vcvtps2ph_128:
	-; ZNVER1: # BB#0:
	-; ZNVER1-NEXT: vcvtps2ph $0, %xmm0, %xmm0 # sched: [5:1.00]
	-; ZNVER1-NEXT: vcvtps2ph $0, %xmm1, (%rdi) # sched: [12:1.00]
	-; ZNVER1-NEXT: retq # sched: [5:0.50]
	- %1 = call <8 x i16> @llvm.x86.vcvtps2ph.128(<4 x float> %a0, i32 0)
	- %2 = call <8 x i16> @llvm.x86.vcvtps2ph.128(<4 x float> %a1, i32 0)
	- %3 = shufflevector <8 x i16> %2, <8 x i16> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
	- store <4 x i16> %3, <4 x i16> *%a2
	- ret <8 x i16> %1
	-}
	-declare <8 x i16> @llvm.x86.vcvtps2ph.128(<4 x float>, i32)
	-
	-define <8 x i16> @test_vcvtps2ph_256(<8 x float> %a0, <8 x float> %a1, <8 x i16> *%a2) {
	-; IVY-LABEL: test_vcvtps2ph_256:
	-; IVY: # BB#0:
	-; IVY-NEXT: vcvtps2ph $0, %ymm0, %xmm0 # sched: [3:1.00]
	-; IVY-NEXT: vcvtps2ph $0, %ymm1, (%rdi) # sched: [7:1.00]
	-; IVY-NEXT: vzeroupper # sched: [?:0.000000e+00]
	-; IVY-NEXT: retq # sched: [1:1.00]
	-;
	-; HASWELL-LABEL: test_vcvtps2ph_256:
	-; HASWELL: # BB#0:
	-; HASWELL-NEXT: vcvtps2ph $0, %ymm0, %xmm0 # sched: [4:1.00]
	-; HASWELL-NEXT: vcvtps2ph $0, %ymm1, (%rdi) # sched: [8:1.00]
	-; HASWELL-NEXT: vzeroupper # sched: [1:0.00]
	-; HASWELL-NEXT: retq # sched: [1:1.00]
	-;
	-; BTVER2-LABEL: test_vcvtps2ph_256:
	-; BTVER2: # BB#0:
	-; BTVER2-NEXT: vcvtps2ph $0, %ymm0, %xmm0 # sched: [3:1.00]
	-; BTVER2-NEXT: vcvtps2ph $0, %ymm1, (%rdi) # sched: [8:1.00]
	-; BTVER2-NEXT: retq # sched: [4:1.00]
	-;
	-; ZNVER1-LABEL: test_vcvtps2ph_256:
	-; ZNVER1: # BB#0:
	-; ZNVER1-NEXT: vcvtps2ph $0, %ymm0, %xmm0 # sched: [5:1.00]
	-; ZNVER1-NEXT: vcvtps2ph $0, %ymm1, (%rdi) # sched: [12:1.00]
	-; ZNVER1-NEXT: vzeroupper # sched: [?:0.000000e+00]
	-; ZNVER1-NEXT: retq # sched: [5:0.50]
	- %1 = call <8 x i16> @llvm.x86.vcvtps2ph.256(<8 x float> %a0, i32 0)
	- %2 = call <8 x i16> @llvm.x86.vcvtps2ph.256(<8 x float> %a1, i32 0)
	- store <8 x i16> %2, <8 x i16> *%a2
	- ret <8 x i16> %1
	-}
	-declare <8 x i16> @llvm.x86.vcvtps2ph.256(<8 x float>, i32)
	diff --git a/test/CodeGen/X86/fp128-i128.ll b/test/CodeGen/X86/fp128-i128.ll
	index 98082ec611d4..6c6bc8bdc1d1 100644
	--- a/test/CodeGen/X86/fp128-i128.ll
	+++ b/test/CodeGen/X86/fp128-i128.ll
	@@ -1,395 +1,395 @@
	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc < %s -O2 -mtriple=x86_64-linux-android -mattr=+mmx -enable-legalize-types-checking \| FileCheck %s
	; RUN: llc < %s -O2 -mtriple=x86_64-linux-gnu -mattr=+mmx -enable-legalize-types-checking \| FileCheck %s

	; These tests were generated from simplified libm C code.
	; When compiled for the x86_64-linux-android target,
	; long double is mapped to f128 type that should be passed
	; in SSE registers. When the f128 type calling convention
	; problem was fixed, old llvm code failed to handle f128 values
	; in several f128/i128 type operations. These unit tests hopefully
	; will catch regression in any future change in this area.
	; To modified or enhance these test cases, please consult libm
	; code pattern and compile with -target x86_64-linux-android
	; to generate IL. The __float128 keyword if not accepted by
	; clang, just define it to "long double".
	;

	; typedef long double __float128;
	; union IEEEl2bits {
	; __float128 e;
	; struct {
	; unsigned long manl :64;
	; unsigned long manh :48;
	; unsigned int exp :15;
	; unsigned int sign :1;
	; } bits;
	; struct {
	; unsigned long manl :64;
	; unsigned long manh :48;
	; unsigned int expsign :16;
	; } xbits;
	; };

	; C code:
	; void foo(__float128 x);
	; void TestUnionLD1(__float128 s, unsigned long n) {
	; union IEEEl2bits u;
	; __float128 w;
	; u.e = s;
	; u.bits.manh = n;
	; w = u.e;
	; foo(w);
	; }
	define void @TestUnionLD1(fp128 %s, i64 %n) #0 {
	; CHECK-LABEL: TestUnionLD1:
	; CHECK: # BB#0: # %entry
	; CHECK-NEXT: movaps %xmm0, -{{[0-9]+}}(%rsp)
	; CHECK-NEXT: movq -{{[0-9]+}}(%rsp), %rax
	; CHECK-NEXT: movabsq $281474976710655, %rcx # imm = 0xFFFFFFFFFFFF
	; CHECK-NEXT: andq %rdi, %rcx
	; CHECK-NEXT: movabsq $-281474976710656, %rdx # imm = 0xFFFF000000000000
	; CHECK-NEXT: andq -{{[0-9]+}}(%rsp), %rdx
	-; CHECK-NEXT: movq %rax, -{{[0-9]+}}(%rsp)
	; CHECK-NEXT: orq %rcx, %rdx
	+; CHECK-NEXT: movq %rax, -{{[0-9]+}}(%rsp)
	; CHECK-NEXT: movq %rdx, -{{[0-9]+}}(%rsp)
	; CHECK-NEXT: movaps -{{[0-9]+}}(%rsp), %xmm0
	; CHECK-NEXT: jmp foo # TAILCALL
	entry:
	%0 = bitcast fp128 %s to i128
	%1 = zext i64 %n to i128
	%bf.value = shl nuw i128 %1, 64
	%bf.shl = and i128 %bf.value, 5192296858534809181786422619668480
	%bf.clear = and i128 %0, -5192296858534809181786422619668481
	%bf.set = or i128 %bf.shl, %bf.clear
	%2 = bitcast i128 %bf.set to fp128
	tail call void @foo(fp128 %2) #2
	ret void
	}

	; C code:
	; __float128 TestUnionLD2(__float128 s) {
	; union IEEEl2bits u;
	; __float128 w;
	; u.e = s;
	; u.bits.manl = 0;
	; w = u.e;
	; return w;
	; }
	define fp128 @TestUnionLD2(fp128 %s) #0 {
	; CHECK-LABEL: TestUnionLD2:
	; CHECK: # BB#0: # %entry
	; CHECK-NEXT: movaps %xmm0, -{{[0-9]+}}(%rsp)
	; CHECK-NEXT: movq -{{[0-9]+}}(%rsp), %rax
	; CHECK-NEXT: movq %rax, -{{[0-9]+}}(%rsp)
	; CHECK-NEXT: movq $0, -{{[0-9]+}}(%rsp)
	; CHECK-NEXT: movaps -{{[0-9]+}}(%rsp), %xmm0
	; CHECK-NEXT: retq
	entry:
	%0 = bitcast fp128 %s to i128
	%bf.clear = and i128 %0, -18446744073709551616
	%1 = bitcast i128 %bf.clear to fp128
	ret fp128 %1
	}

	; C code:
	; __float128 TestI128_1(__float128 x)
	; {
	; union IEEEl2bits z;
	; z.e = x;
	; z.bits.sign = 0;
	; return (z.e < 0.1L) ? 1.0L : 2.0L;
	; }
	define fp128 @TestI128_1(fp128 %x) #0 {
	; CHECK-LABEL: TestI128_1:
	; CHECK: # BB#0: # %entry
	; CHECK-NEXT: subq $40, %rsp
	; CHECK-NEXT: movaps %xmm0, {{[0-9]+}}(%rsp)
	; CHECK-NEXT: movq {{[0-9]+}}(%rsp), %rax
	; CHECK-NEXT: movabsq $9223372036854775807, %rcx # imm = 0x7FFFFFFFFFFFFFFF
	; CHECK-NEXT: andq {{[0-9]+}}(%rsp), %rcx
	; CHECK-NEXT: movq %rcx, {{[0-9]+}}(%rsp)
	; CHECK-NEXT: movq %rax, (%rsp)
	; CHECK-NEXT: movaps (%rsp), %xmm0
	; CHECK-NEXT: movaps {{.*}}(%rip), %xmm1
	; CHECK-NEXT: callq __lttf2
	; CHECK-NEXT: xorl %ecx, %ecx
	; CHECK-NEXT: testl %eax, %eax
	; CHECK-NEXT: sets %cl
	; CHECK-NEXT: shlq $4, %rcx
	; CHECK-NEXT: movaps {{\.LCPI.*}}(%rcx), %xmm0
	; CHECK-NEXT: addq $40, %rsp
	; CHECK-NEXT: retq
	entry:
	%0 = bitcast fp128 %x to i128
	%bf.clear = and i128 %0, 170141183460469231731687303715884105727
	%1 = bitcast i128 %bf.clear to fp128
	%cmp = fcmp olt fp128 %1, 0xL999999999999999A3FFB999999999999
	%cond = select i1 %cmp, fp128 0xL00000000000000003FFF000000000000, fp128 0xL00000000000000004000000000000000
	ret fp128 %cond
	}

	; C code:
	; __float128 TestI128_2(__float128 x, __float128 y)
	; {
	; unsigned short hx;
	; union IEEEl2bits ge_u;
	; ge_u.e = x;
	; hx = ge_u.xbits.expsign;
	; return (hx & 0x8000) == 0 ? x : y;
	; }
	define fp128 @TestI128_2(fp128 %x, fp128 %y) #0 {
	; CHECK-LABEL: TestI128_2:
	; CHECK: # BB#0: # %entry
	; CHECK-NEXT: movaps %xmm0, -{{[0-9]+}}(%rsp)
	; CHECK-NEXT: cmpq $0, -{{[0-9]+}}(%rsp)
	; CHECK-NEXT: jns .LBB3_2
	; CHECK-NEXT: # BB#1: # %entry
	; CHECK-NEXT: movaps %xmm1, %xmm0
	; CHECK-NEXT: .LBB3_2: # %entry
	; CHECK-NEXT: retq
	entry:
	%0 = bitcast fp128 %x to i128
	%cmp = icmp sgt i128 %0, -1
	%cond = select i1 %cmp, fp128 %x, fp128 %y
	ret fp128 %cond
	}

	; C code:
	; __float128 TestI128_3(__float128 x, int *ex)
	; {
	; union IEEEl2bits u;
	; u.e = x;
	; if (u.bits.exp == 0) {
	; u.e *= 0x1.0p514;
	; u.bits.exp = 0x3ffe;
	; }
	; return (u.e);
	; }
	define fp128 @TestI128_3(fp128 %x, i32* nocapture readnone %ex) #0 {
	; CHECK-LABEL: TestI128_3:
	; CHECK: # BB#0: # %entry
	; CHECK-NEXT: subq $56, %rsp
	; CHECK-NEXT: movaps %xmm0, {{[0-9]+}}(%rsp)
	; CHECK-NEXT: movq {{[0-9]+}}(%rsp), %rax
	; CHECK-NEXT: movabsq $9223090561878065152, %rcx # imm = 0x7FFF000000000000
	; CHECK-NEXT: testq %rcx, %rax
	; CHECK-NEXT: je .LBB4_2
	; CHECK-NEXT: # BB#1:
	; CHECK-NEXT: movq {{[0-9]+}}(%rsp), %rcx
	; CHECK-NEXT: jmp .LBB4_3
	; CHECK-NEXT: .LBB4_2: # %if.then
	; CHECK-NEXT: movaps {{.*}}(%rip), %xmm1
	; CHECK-NEXT: callq __multf3
	; CHECK-NEXT: movaps %xmm0, {{[0-9]+}}(%rsp)
	; CHECK-NEXT: movq {{[0-9]+}}(%rsp), %rcx
	; CHECK-NEXT: movabsq $-9223090561878065153, %rdx # imm = 0x8000FFFFFFFFFFFF
	; CHECK-NEXT: andq {{[0-9]+}}(%rsp), %rdx
	; CHECK-NEXT: movabsq $4611123068473966592, %rax # imm = 0x3FFE000000000000
	; CHECK-NEXT: orq %rdx, %rax
	; CHECK-NEXT: .LBB4_3: # %if.end
	; CHECK-NEXT: movq %rcx, (%rsp)
	; CHECK-NEXT: movq %rax, {{[0-9]+}}(%rsp)
	; CHECK-NEXT: movaps (%rsp), %xmm0
	; CHECK-NEXT: addq $56, %rsp
	; CHECK-NEXT: retq
	entry:
	%0 = bitcast fp128 %x to i128
	%bf.cast = and i128 %0, 170135991163610696904058773219554885632
	%cmp = icmp eq i128 %bf.cast, 0
	br i1 %cmp, label %if.then, label %if.end

	if.then: ; preds = %entry
	%mul = fmul fp128 %x, 0xL00000000000000004201000000000000
	%1 = bitcast fp128 %mul to i128
	%bf.clear4 = and i128 %1, -170135991163610696904058773219554885633
	%bf.set = or i128 %bf.clear4, 85060207136517546210586590865283612672
	br label %if.end

	if.end: ; preds = %if.then, %entry
	%u.sroa.0.0 = phi i128 [ %bf.set, %if.then ], [ %0, %entry ]
	%2 = bitcast i128 %u.sroa.0.0 to fp128
	ret fp128 %2
	}

	; C code:
	; __float128 TestI128_4(__float128 x)
	; {
	; union IEEEl2bits u;
	; __float128 df;
	; u.e = x;
	; u.xbits.manl = 0;
	; df = u.e;
	; return x + df;
	; }
	define fp128 @TestI128_4(fp128 %x) #0 {
	; CHECK-LABEL: TestI128_4:
	; CHECK: # BB#0: # %entry
	; CHECK-NEXT: subq $40, %rsp
	; CHECK-NEXT: movaps %xmm0, %xmm1
	; CHECK-NEXT: movaps %xmm1, {{[0-9]+}}(%rsp)
	; CHECK-NEXT: movq {{[0-9]+}}(%rsp), %rax
	; CHECK-NEXT: movq %rax, {{[0-9]+}}(%rsp)
	; CHECK-NEXT: movq $0, (%rsp)
	; CHECK-NEXT: movaps (%rsp), %xmm0
	; CHECK-NEXT: callq __addtf3
	; CHECK-NEXT: addq $40, %rsp
	; CHECK-NEXT: retq
	entry:
	%0 = bitcast fp128 %x to i128
	%bf.clear = and i128 %0, -18446744073709551616
	%1 = bitcast i128 %bf.clear to fp128
	%add = fadd fp128 %1, %x
	ret fp128 %add
	}

	@v128 = common global i128 0, align 16
	@v128_2 = common global i128 0, align 16

	; C code:
	; unsigned __int128 v128, v128_2;
	; void TestShift128_2() {
	; v128 = ((v128 << 96) \| v128_2);
	; }
	define void @TestShift128_2() #2 {
	; CHECK-LABEL: TestShift128_2:
	; CHECK: # BB#0: # %entry
	; CHECK-NEXT: movq {{.*}}(%rip), %rax
	; CHECK-NEXT: shlq $32, %rax
	; CHECK-NEXT: movq {{.*}}(%rip), %rcx
	; CHECK-NEXT: orq v128_2+{{.*}}(%rip), %rax
	; CHECK-NEXT: movq %rcx, {{.*}}(%rip)
	; CHECK-NEXT: movq %rax, v128+{{.*}}(%rip)
	; CHECK-NEXT: retq
	entry:
	%0 = load i128, i128* @v128, align 16
	%shl = shl i128 %0, 96
	%1 = load i128, i128* @v128_2, align 16
	%or = or i128 %shl, %1
	store i128 %or, i128* @v128, align 16
	ret void
	}

	define fp128 @acosl(fp128 %x) #0 {
	; CHECK-LABEL: acosl:
	; CHECK: # BB#0: # %entry
	; CHECK-NEXT: subq $40, %rsp
	; CHECK-NEXT: movaps %xmm0, %xmm1
	; CHECK-NEXT: movaps %xmm1, {{[0-9]+}}(%rsp)
	; CHECK-NEXT: movq {{[0-9]+}}(%rsp), %rax
	; CHECK-NEXT: movq %rax, {{[0-9]+}}(%rsp)
	; CHECK-NEXT: movq $0, (%rsp)
	; CHECK-NEXT: movaps (%rsp), %xmm0
	; CHECK-NEXT: callq __addtf3
	; CHECK-NEXT: addq $40, %rsp
	; CHECK-NEXT: retq
	entry:
	%0 = bitcast fp128 %x to i128
	%bf.clear = and i128 %0, -18446744073709551616
	%1 = bitcast i128 %bf.clear to fp128
	%add = fadd fp128 %1, %x
	ret fp128 %add
	}

	; Compare i128 values and check i128 constants.
	define fp128 @TestComp(fp128 %x, fp128 %y) #0 {
	; CHECK-LABEL: TestComp:
	; CHECK: # BB#0: # %entry
	; CHECK-NEXT: movaps %xmm0, -{{[0-9]+}}(%rsp)
	; CHECK-NEXT: cmpq $0, -{{[0-9]+}}(%rsp)
	; CHECK-NEXT: jns .LBB8_2
	; CHECK-NEXT: # BB#1: # %entry
	; CHECK-NEXT: movaps %xmm1, %xmm0
	; CHECK-NEXT: .LBB8_2: # %entry
	; CHECK-NEXT: retq
	entry:
	%0 = bitcast fp128 %x to i128
	%cmp = icmp sgt i128 %0, -1
	%cond = select i1 %cmp, fp128 %x, fp128 %y
	ret fp128 %cond
	}

	declare void @foo(fp128) #1

	; Test logical operations on fp128 values.
	define fp128 @TestFABS_LD(fp128 %x) #0 {
	; CHECK-LABEL: TestFABS_LD:
	; CHECK: # BB#0: # %entry
	; CHECK-NEXT: andps {{.*}}(%rip), %xmm0
	; CHECK-NEXT: retq
	entry:
	%call = tail call fp128 @fabsl(fp128 %x) #2
	ret fp128 %call
	}

	declare fp128 @fabsl(fp128) #1

	declare fp128 @copysignl(fp128, fp128) #1

	; Test more complicated logical operations generated from copysignl.
	define void @TestCopySign({ fp128, fp128 }* noalias nocapture sret %agg.result, { fp128, fp128 }* byval nocapture readonly align 16 %z) #0 {
	; CHECK-LABEL: TestCopySign:
	; CHECK: # BB#0: # %entry
	; CHECK-NEXT: pushq %rbp
	; CHECK-NEXT: pushq %rbx
	; CHECK-NEXT: subq $40, %rsp
	; CHECK-NEXT: movq %rdi, %rbx
	; CHECK-NEXT: movaps {{[0-9]+}}(%rsp), %xmm0
	; CHECK-NEXT: movaps {{[0-9]+}}(%rsp), %xmm1
	; CHECK-NEXT: movaps %xmm1, {{[0-9]+}}(%rsp) # 16-byte Spill
	; CHECK-NEXT: movaps %xmm0, (%rsp) # 16-byte Spill
	; CHECK-NEXT: callq __gttf2
	; CHECK-NEXT: movl %eax, %ebp
	; CHECK-NEXT: movaps {{[0-9]+}}(%rsp), %xmm0 # 16-byte Reload
	; CHECK-NEXT: movaps %xmm0, %xmm1
	; CHECK-NEXT: callq __subtf3
	; CHECK-NEXT: testl %ebp, %ebp
	; CHECK-NEXT: jle .LBB10_1
	; CHECK-NEXT: # BB#2: # %if.then
	; CHECK-NEXT: andps {{.*}}(%rip), %xmm0
	; CHECK-NEXT: movaps %xmm0, %xmm1
	; CHECK-NEXT: movaps (%rsp), %xmm0 # 16-byte Reload
	; CHECK-NEXT: movaps %xmm1, %xmm2
	; CHECK-NEXT: jmp .LBB10_3
	; CHECK-NEXT: .LBB10_1:
	; CHECK-NEXT: movaps (%rsp), %xmm2 # 16-byte Reload
	; CHECK-NEXT: .LBB10_3: # %cleanup
	; CHECK-NEXT: movaps {{.*}}(%rip), %xmm1
	; CHECK-NEXT: andps {{[0-9]+}}(%rsp), %xmm1 # 16-byte Folded Reload
	; CHECK-NEXT: andps {{.*}}(%rip), %xmm0
	; CHECK-NEXT: orps %xmm1, %xmm0
	; CHECK-NEXT: movaps %xmm2, (%rbx)
	; CHECK-NEXT: movaps %xmm0, 16(%rbx)
	; CHECK-NEXT: movq %rbx, %rax
	; CHECK-NEXT: addq $40, %rsp
	; CHECK-NEXT: popq %rbx
	; CHECK-NEXT: popq %rbp
	; CHECK-NEXT: retq
	entry:
	%z.realp = getelementptr inbounds { fp128, fp128 }, { fp128, fp128 }* %z, i64 0, i32 0
	%z.real = load fp128, fp128* %z.realp, align 16
	%z.imagp = getelementptr inbounds { fp128, fp128 }, { fp128, fp128 }* %z, i64 0, i32 1
	%z.imag4 = load fp128, fp128* %z.imagp, align 16
	%cmp = fcmp ogt fp128 %z.real, %z.imag4
	%sub = fsub fp128 %z.imag4, %z.imag4
	br i1 %cmp, label %if.then, label %cleanup

	if.then: ; preds = %entry
	%call = tail call fp128 @fabsl(fp128 %sub) #2
	br label %cleanup

	cleanup: ; preds = %entry, %if.then
	%z.real.sink = phi fp128 [ %z.real, %if.then ], [ %sub, %entry ]
	%call.sink = phi fp128 [ %call, %if.then ], [ %z.real, %entry ]
	%call5 = tail call fp128 @copysignl(fp128 %z.real.sink, fp128 %z.imag4) #2
	%0 = getelementptr inbounds { fp128, fp128 }, { fp128, fp128 }* %agg.result, i64 0, i32 0
	%1 = getelementptr inbounds { fp128, fp128 }, { fp128, fp128 }* %agg.result, i64 0, i32 1
	store fp128 %call.sink, fp128* %0, align 16
	store fp128 %call5, fp128* %1, align 16
	ret void
	}


	attributes #0 = { nounwind "disable-tail-calls"="false" "less-precise-fpmad"="false" "no-frame-pointer-elim"="false" "no-infs-fp-math"="false" "no-nans-fp-math"="false" "stack-protector-buffer-size"="8" "target-cpu"="x86-64" "target-features"="+ssse3,+sse3,+popcnt,+sse,+sse2,+sse4.1,+sse4.2" "unsafe-fp-math"="false" "use-soft-float"="false" }
	attributes #1 = { "disable-tail-calls"="false" "less-precise-fpmad"="false" "no-frame-pointer-elim"="false" "no-infs-fp-math"="false" "no-nans-fp-math"="false" "stack-protector-buffer-size"="8" "target-cpu"="x86-64" "target-features"="+ssse3,+sse3,+popcnt,+sse,+sse2,+sse4.1,+sse4.2" "unsafe-fp-math"="false" "use-soft-float"="false" }
	attributes #2 = { nounwind readnone }
	diff --git a/test/CodeGen/X86/gather-addresses.ll b/test/CodeGen/X86/gather-addresses.ll
	index e09ad3e4e0b8..c3109673468e 100644
	--- a/test/CodeGen/X86/gather-addresses.ll
	+++ b/test/CodeGen/X86/gather-addresses.ll
	@@ -1,90 +1,90 @@
	; RUN: llc -mtriple=x86_64-linux -mcpu=nehalem < %s \| FileCheck %s --check-prefix=LIN
	; RUN: llc -mtriple=x86_64-win32 -mcpu=nehalem < %s \| FileCheck %s --check-prefix=WIN
	; RUN: llc -mtriple=i686-win32 -mcpu=nehalem < %s \| FileCheck %s --check-prefix=LIN32
	; rdar://7398554

	; When doing vector gather-scatter index calculation with 32-bit indices,
	; use an efficient mov/shift sequence rather than shuffling each individual
	; element out of the index vector.

	; CHECK-LABEL: foo:
	; LIN: movdqa (%rsi), %xmm0
	; LIN: pand (%rdx), %xmm0
	; LIN: pextrq $1, %xmm0, %r[[REG4:.+]]
	; LIN: movq %xmm0, %r[[REG2:.+]]
	; LIN: movslq %e[[REG2]], %r[[REG1:.+]]
	; LIN: sarq $32, %r[[REG2]]
	; LIN: movslq %e[[REG4]], %r[[REG3:.+]]
	; LIN: sarq $32, %r[[REG4]]
	-; LIN: movsd (%rdi,%r[[REG3]],8), %xmm1
	-; LIN: movhpd (%rdi,%r[[REG4]],8), %xmm1
	-; LIN: movq %rdi, %xmm1
	-; LIN: movq %r[[REG3]], %xmm0
	+; LIN: movsd (%rdi,%r[[REG1]],8), %xmm0
	+; LIN: movhpd (%rdi,%r[[REG2]],8), %xmm0
	+; LIN: movsd (%rdi,%r[[REG3]],8), %xmm1
	+; LIN: movhpd (%rdi,%r[[REG4]],8), %xmm1

	; WIN: movdqa (%rdx), %xmm0
	; WIN: pand (%r8), %xmm0
	; WIN: pextrq $1, %xmm0, %r[[REG4:.+]]
	; WIN: movq %xmm0, %r[[REG2:.+]]
	; WIN: movslq %e[[REG2]], %r[[REG1:.+]]
	; WIN: sarq $32, %r[[REG2]]
	; WIN: movslq %e[[REG4]], %r[[REG3:.+]]
	; WIN: sarq $32, %r[[REG4]]
	-; WIN: movsd (%rcx,%r[[REG3]],8), %xmm1
	-; WIN: movhpd (%rcx,%r[[REG4]],8), %xmm1
	-; WIN: movdqa (%r[[REG2]]), %xmm0
	-; WIN: movq %r[[REG2]], %xmm1
	+; WIN: movsd (%rcx,%r[[REG1]],8), %xmm0
	+; WIN: movhpd (%rcx,%r[[REG2]],8), %xmm0
	+; WIN: movsd (%rcx,%r[[REG3]],8), %xmm1
	+; WIN: movhpd (%rcx,%r[[REG4]],8), %xmm1

	define <4 x double> @foo(double* %p, <4 x i32>* %i, <4 x i32>* %h) nounwind {
	%a = load <4 x i32>, <4 x i32>* %i
	%b = load <4 x i32>, <4 x i32>* %h
	%j = and <4 x i32> %a, %b
	%d0 = extractelement <4 x i32> %j, i32 0
	%d1 = extractelement <4 x i32> %j, i32 1
	%d2 = extractelement <4 x i32> %j, i32 2
	%d3 = extractelement <4 x i32> %j, i32 3
	%q0 = getelementptr double, double* %p, i32 %d0
	%q1 = getelementptr double, double* %p, i32 %d1
	%q2 = getelementptr double, double* %p, i32 %d2
	%q3 = getelementptr double, double* %p, i32 %d3
	%r0 = load double, double* %q0
	%r1 = load double, double* %q1
	%r2 = load double, double* %q2
	%r3 = load double, double* %q3
	%v0 = insertelement <4 x double> undef, double %r0, i32 0
	%v1 = insertelement <4 x double> %v0, double %r1, i32 1
	%v2 = insertelement <4 x double> %v1, double %r2, i32 2
	%v3 = insertelement <4 x double> %v2, double %r3, i32 3
	ret <4 x double> %v3
	}

	; Check that the sequence previously used above, which bounces the vector off the
	; cache works for x86-32. Note that in this case it will not be used for index
	; calculation, since indexes are 32-bit, not 64.
	; CHECK-LABEL: old:
	; LIN32: movaps %xmm0, (%esp)
	; LIN32-DAG: {{(mov\|and)}}l (%esp),
	; LIN32-DAG: {{(mov\|and)}}l 4(%esp),
	; LIN32-DAG: {{(mov\|and)}}l 8(%esp),
	; LIN32-DAG: {{(mov\|and)}}l 12(%esp),
	define <4 x i64> @old(double* %p, <4 x i32>* %i, <4 x i32>* %h, i64 %f) nounwind {
	%a = load <4 x i32>, <4 x i32>* %i
	%b = load <4 x i32>, <4 x i32>* %h
	%j = and <4 x i32> %a, %b
	%d0 = extractelement <4 x i32> %j, i32 0
	%d1 = extractelement <4 x i32> %j, i32 1
	%d2 = extractelement <4 x i32> %j, i32 2
	%d3 = extractelement <4 x i32> %j, i32 3
	%q0 = zext i32 %d0 to i64
	%q1 = zext i32 %d1 to i64
	%q2 = zext i32 %d2 to i64
	%q3 = zext i32 %d3 to i64
	%r0 = and i64 %q0, %f
	%r1 = and i64 %q1, %f
	%r2 = and i64 %q2, %f
	%r3 = and i64 %q3, %f
	%v0 = insertelement <4 x i64> undef, i64 %r0, i32 0
	%v1 = insertelement <4 x i64> %v0, i64 %r1, i32 1
	%v2 = insertelement <4 x i64> %v1, i64 %r2, i32 2
	%v3 = insertelement <4 x i64> %v2, i64 %r3, i32 3
	ret <4 x i64> %v3
	}
	diff --git a/test/CodeGen/X86/lea32-schedule.ll b/test/CodeGen/X86/lea32-schedule.ll
	deleted file mode 100644
	index e42ce30c5a6d..000000000000
	--- a/test/CodeGen/X86/lea32-schedule.ll
	+++ /dev/null
	@@ -1,653 +0,0 @@
	-; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	-; RUN: llc < %s -mtriple=x86_64-unknown-unknown -print-schedule -mcpu=x86-64 \| FileCheck %s --check-prefix=CHECK --check-prefix=GENERIC
	-; RUN: llc < %s -mtriple=x86_64-unknown-unknown -print-schedule -mcpu=atom \| FileCheck %s --check-prefix=CHECK --check-prefix=ATOM
	-; RUN: llc < %s -mtriple=x86_64-unknown-unknown -print-schedule -mcpu=slm \| FileCheck %s --check-prefix=CHECK --check-prefix=SLM
	-; RUN: llc < %s -mtriple=x86_64-unknown-unknown -print-schedule -mcpu=sandybridge \| FileCheck %s --check-prefix=CHECK --check-prefix=SANDY
	-; RUN: llc < %s -mtriple=x86_64-unknown-unknown -print-schedule -mcpu=ivybridge \| FileCheck %s --check-prefix=CHECK --check-prefix=SANDY
	-; RUN: llc < %s -mtriple=x86_64-unknown-unknown -print-schedule -mcpu=haswell \| FileCheck %s --check-prefix=CHECK --check-prefix=HASWELL
	-; RUN: llc < %s -mtriple=x86_64-unknown-unknown -print-schedule -mcpu=skylake \| FileCheck %s --check-prefix=CHECK --check-prefix=HASWELL
	-; RUN: llc < %s -mtriple=x86_64-unknown-unknown -print-schedule -mcpu=knl \| FileCheck %s --check-prefix=CHECK --check-prefix=HASWELL
	-; RUN: llc < %s -mtriple=x86_64-unknown-unknown -print-schedule -mcpu=btver2 \| FileCheck %s --check-prefix=CHECK --check-prefix=BTVER2
	-; RUN: llc < %s -mtriple=x86_64-unknown-unknown -print-schedule -mcpu=znver1 \| FileCheck %s --check-prefix=CHECK --check-prefix=ZNVER1
	-
	-define i32 @test_lea_offset(i32) {
	-; GENERIC-LABEL: test_lea_offset:
	-; GENERIC: # BB#0:
	-; GENERIC-NEXT: # kill: %EDI<def> %EDI<kill> %RDI<def>
	-; GENERIC-NEXT: leal -24(%rdi), %eax # sched: [1:0.50]
	-; GENERIC-NEXT: retq # sched: [1:1.00]
	-;
	-; ATOM-LABEL: test_lea_offset:
	-; ATOM: # BB#0:
	-; ATOM-NEXT: # kill: %EDI<def> %EDI<kill> %RDI<def>
	-; ATOM-NEXT: leal -24(%rdi), %eax
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: retq
	-;
	-; SLM-LABEL: test_lea_offset:
	-; SLM: # BB#0:
	-; SLM-NEXT: # kill: %EDI<def> %EDI<kill> %RDI<def>
	-; SLM-NEXT: leal -24(%rdi), %eax # sched: [1:1.00]
	-; SLM-NEXT: retq # sched: [4:1.00]
	-;
	-; SANDY-LABEL: test_lea_offset:
	-; SANDY: # BB#0:
	-; SANDY-NEXT: # kill: %EDI<def> %EDI<kill> %RDI<def>
	-; SANDY-NEXT: leal -24(%rdi), %eax # sched: [1:0.50]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	-;
	-; HASWELL-LABEL: test_lea_offset:
	-; HASWELL: # BB#0:
	-; HASWELL-NEXT: # kill: %EDI<def> %EDI<kill> %RDI<def>
	-; HASWELL-NEXT: leal -24(%rdi), %eax # sched: [1:0.50]
	-; HASWELL-NEXT: retq # sched: [1:1.00]
	-;
	-; BTVER2-LABEL: test_lea_offset:
	-; BTVER2: # BB#0:
	-; BTVER2-NEXT: # kill: %EDI<def> %EDI<kill> %RDI<def>
	-; BTVER2-NEXT: leal -24(%rdi), %eax # sched: [1:0.50]
	-; BTVER2-NEXT: retq # sched: [4:1.00]
	-;
	-; ZNVER1-LABEL: test_lea_offset:
	-; ZNVER1: # BB#0:
	-; ZNVER1-NEXT: # kill: %EDI<def> %EDI<kill> %RDI<def>
	-; ZNVER1-NEXT: leal -24(%rdi), %eax # sched: [1:0.25]
	-; ZNVER1-NEXT: retq # sched: [5:0.50]
	- %2 = add nsw i32 %0, -24
	- ret i32 %2
	-}
	-
	-define i32 @test_lea_offset_big(i32) {
	-; GENERIC-LABEL: test_lea_offset_big:
	-; GENERIC: # BB#0:
	-; GENERIC-NEXT: # kill: %EDI<def> %EDI<kill> %RDI<def>
	-; GENERIC-NEXT: leal 1024(%rdi), %eax # sched: [1:0.50]
	-; GENERIC-NEXT: retq # sched: [1:1.00]
	-;
	-; ATOM-LABEL: test_lea_offset_big:
	-; ATOM: # BB#0:
	-; ATOM-NEXT: # kill: %EDI<def> %EDI<kill> %RDI<def>
	-; ATOM-NEXT: leal 1024(%rdi), %eax
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: retq
	-;
	-; SLM-LABEL: test_lea_offset_big:
	-; SLM: # BB#0:
	-; SLM-NEXT: # kill: %EDI<def> %EDI<kill> %RDI<def>
	-; SLM-NEXT: leal 1024(%rdi), %eax # sched: [1:1.00]
	-; SLM-NEXT: retq # sched: [4:1.00]
	-;
	-; SANDY-LABEL: test_lea_offset_big:
	-; SANDY: # BB#0:
	-; SANDY-NEXT: # kill: %EDI<def> %EDI<kill> %RDI<def>
	-; SANDY-NEXT: leal 1024(%rdi), %eax # sched: [1:0.50]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	-;
	-; HASWELL-LABEL: test_lea_offset_big:
	-; HASWELL: # BB#0:
	-; HASWELL-NEXT: # kill: %EDI<def> %EDI<kill> %RDI<def>
	-; HASWELL-NEXT: leal 1024(%rdi), %eax # sched: [1:0.50]
	-; HASWELL-NEXT: retq # sched: [1:1.00]
	-;
	-; BTVER2-LABEL: test_lea_offset_big:
	-; BTVER2: # BB#0:
	-; BTVER2-NEXT: # kill: %EDI<def> %EDI<kill> %RDI<def>
	-; BTVER2-NEXT: leal 1024(%rdi), %eax # sched: [1:0.50]
	-; BTVER2-NEXT: retq # sched: [4:1.00]
	-;
	-; ZNVER1-LABEL: test_lea_offset_big:
	-; ZNVER1: # BB#0:
	-; ZNVER1-NEXT: # kill: %EDI<def> %EDI<kill> %RDI<def>
	-; ZNVER1-NEXT: leal 1024(%rdi), %eax # sched: [1:0.25]
	-; ZNVER1-NEXT: retq # sched: [5:0.50]
	- %2 = add nsw i32 %0, 1024
	- ret i32 %2
	-}
	-
	-; Function Attrs: norecurse nounwind readnone uwtable
	-define i32 @test_lea_add(i32, i32) {
	-; GENERIC-LABEL: test_lea_add:
	-; GENERIC: # BB#0:
	-; GENERIC-NEXT: # kill: %ESI<def> %ESI<kill> %RSI<def>
	-; GENERIC-NEXT: # kill: %EDI<def> %EDI<kill> %RDI<def>
	-; GENERIC-NEXT: leal (%rdi,%rsi), %eax # sched: [1:0.50]
	-; GENERIC-NEXT: retq # sched: [1:1.00]
	-;
	-; ATOM-LABEL: test_lea_add:
	-; ATOM: # BB#0:
	-; ATOM-NEXT: # kill: %ESI<def> %ESI<kill> %RSI<def>
	-; ATOM-NEXT: # kill: %EDI<def> %EDI<kill> %RDI<def>
	-; ATOM-NEXT: leal (%rdi,%rsi), %eax
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: retq
	-;
	-; SLM-LABEL: test_lea_add:
	-; SLM: # BB#0:
	-; SLM-NEXT: # kill: %ESI<def> %ESI<kill> %RSI<def>
	-; SLM-NEXT: # kill: %EDI<def> %EDI<kill> %RDI<def>
	-; SLM-NEXT: leal (%rdi,%rsi), %eax # sched: [1:1.00]
	-; SLM-NEXT: retq # sched: [4:1.00]
	-;
	-; SANDY-LABEL: test_lea_add:
	-; SANDY: # BB#0:
	-; SANDY-NEXT: # kill: %ESI<def> %ESI<kill> %RSI<def>
	-; SANDY-NEXT: # kill: %EDI<def> %EDI<kill> %RDI<def>
	-; SANDY-NEXT: leal (%rdi,%rsi), %eax # sched: [1:0.50]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	-;
	-; HASWELL-LABEL: test_lea_add:
	-; HASWELL: # BB#0:
	-; HASWELL-NEXT: # kill: %ESI<def> %ESI<kill> %RSI<def>
	-; HASWELL-NEXT: # kill: %EDI<def> %EDI<kill> %RDI<def>
	-; HASWELL-NEXT: leal (%rdi,%rsi), %eax # sched: [1:0.50]
	-; HASWELL-NEXT: retq # sched: [1:1.00]
	-;
	-; BTVER2-LABEL: test_lea_add:
	-; BTVER2: # BB#0:
	-; BTVER2-NEXT: # kill: %ESI<def> %ESI<kill> %RSI<def>
	-; BTVER2-NEXT: # kill: %EDI<def> %EDI<kill> %RDI<def>
	-; BTVER2-NEXT: leal (%rdi,%rsi), %eax # sched: [1:0.50]
	-; BTVER2-NEXT: retq # sched: [4:1.00]
	-;
	-; ZNVER1-LABEL: test_lea_add:
	-; ZNVER1: # BB#0:
	-; ZNVER1-NEXT: # kill: %ESI<def> %ESI<kill> %RSI<def>
	-; ZNVER1-NEXT: # kill: %EDI<def> %EDI<kill> %RDI<def>
	-; ZNVER1-NEXT: leal (%rdi,%rsi), %eax # sched: [1:0.25]
	-; ZNVER1-NEXT: retq # sched: [5:0.50]
	- %3 = add nsw i32 %1, %0
	- ret i32 %3
	-}
	-
	-define i32 @test_lea_add_offset(i32, i32) {
	-; GENERIC-LABEL: test_lea_add_offset:
	-; GENERIC: # BB#0:
	-; GENERIC-NEXT: # kill: %ESI<def> %ESI<kill> %RSI<def>
	-; GENERIC-NEXT: # kill: %EDI<def> %EDI<kill> %RDI<def>
	-; GENERIC-NEXT: leal 16(%rdi,%rsi), %eax # sched: [1:0.50]
	-; GENERIC-NEXT: retq # sched: [1:1.00]
	-;
	-; ATOM-LABEL: test_lea_add_offset:
	-; ATOM: # BB#0:
	-; ATOM-NEXT: # kill: %ESI<def> %ESI<kill> %RSI<def>
	-; ATOM-NEXT: # kill: %EDI<def> %EDI<kill> %RDI<def>
	-; ATOM-NEXT: leal 16(%rdi,%rsi), %eax
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: retq
	-;
	-; SLM-LABEL: test_lea_add_offset:
	-; SLM: # BB#0:
	-; SLM-NEXT: # kill: %ESI<def> %ESI<kill> %RSI<def>
	-; SLM-NEXT: # kill: %EDI<def> %EDI<kill> %RDI<def>
	-; SLM-NEXT: leal 16(%rdi,%rsi), %eax # sched: [1:1.00]
	-; SLM-NEXT: retq # sched: [4:1.00]
	-;
	-; SANDY-LABEL: test_lea_add_offset:
	-; SANDY: # BB#0:
	-; SANDY-NEXT: # kill: %ESI<def> %ESI<kill> %RSI<def>
	-; SANDY-NEXT: # kill: %EDI<def> %EDI<kill> %RDI<def>
	-; SANDY-NEXT: leal (%rdi,%rsi), %eax # sched: [1:0.50]
	-; SANDY-NEXT: addl $16, %eax # sched: [1:0.33]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	-;
	-; HASWELL-LABEL: test_lea_add_offset:
	-; HASWELL: # BB#0:
	-; HASWELL-NEXT: # kill: %ESI<def> %ESI<kill> %RSI<def>
	-; HASWELL-NEXT: # kill: %EDI<def> %EDI<kill> %RDI<def>
	-; HASWELL-NEXT: leal (%rdi,%rsi), %eax # sched: [1:0.50]
	-; HASWELL-NEXT: addl $16, %eax # sched: [1:0.25]
	-; HASWELL-NEXT: retq # sched: [1:1.00]
	-;
	-; BTVER2-LABEL: test_lea_add_offset:
	-; BTVER2: # BB#0:
	-; BTVER2-NEXT: # kill: %ESI<def> %ESI<kill> %RSI<def>
	-; BTVER2-NEXT: # kill: %EDI<def> %EDI<kill> %RDI<def>
	-; BTVER2-NEXT: leal 16(%rdi,%rsi), %eax # sched: [1:0.50]
	-; BTVER2-NEXT: retq # sched: [4:1.00]
	-;
	-; ZNVER1-LABEL: test_lea_add_offset:
	-; ZNVER1: # BB#0:
	-; ZNVER1-NEXT: # kill: %ESI<def> %ESI<kill> %RSI<def>
	-; ZNVER1-NEXT: # kill: %EDI<def> %EDI<kill> %RDI<def>
	-; ZNVER1-NEXT: leal 16(%rdi,%rsi), %eax # sched: [1:0.25]
	-; ZNVER1-NEXT: retq # sched: [5:0.50]
	- %3 = add i32 %0, 16
	- %4 = add i32 %3, %1
	- ret i32 %4
	-}
	-
	-define i32 @test_lea_add_offset_big(i32, i32) {
	-; GENERIC-LABEL: test_lea_add_offset_big:
	-; GENERIC: # BB#0:
	-; GENERIC-NEXT: # kill: %ESI<def> %ESI<kill> %RSI<def>
	-; GENERIC-NEXT: # kill: %EDI<def> %EDI<kill> %RDI<def>
	-; GENERIC-NEXT: leal -4096(%rdi,%rsi), %eax # sched: [1:0.50]
	-; GENERIC-NEXT: retq # sched: [1:1.00]
	-;
	-; ATOM-LABEL: test_lea_add_offset_big:
	-; ATOM: # BB#0:
	-; ATOM-NEXT: # kill: %ESI<def> %ESI<kill> %RSI<def>
	-; ATOM-NEXT: # kill: %EDI<def> %EDI<kill> %RDI<def>
	-; ATOM-NEXT: leal -4096(%rdi,%rsi), %eax
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: retq
	-;
	-; SLM-LABEL: test_lea_add_offset_big:
	-; SLM: # BB#0:
	-; SLM-NEXT: # kill: %ESI<def> %ESI<kill> %RSI<def>
	-; SLM-NEXT: # kill: %EDI<def> %EDI<kill> %RDI<def>
	-; SLM-NEXT: leal -4096(%rdi,%rsi), %eax # sched: [1:1.00]
	-; SLM-NEXT: retq # sched: [4:1.00]
	-;
	-; SANDY-LABEL: test_lea_add_offset_big:
	-; SANDY: # BB#0:
	-; SANDY-NEXT: # kill: %ESI<def> %ESI<kill> %RSI<def>
	-; SANDY-NEXT: # kill: %EDI<def> %EDI<kill> %RDI<def>
	-; SANDY-NEXT: leal (%rdi,%rsi), %eax # sched: [1:0.50]
	-; SANDY-NEXT: addl $-4096, %eax # imm = 0xF000
	-; SANDY-NEXT: # sched: [1:0.33]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	-;
	-; HASWELL-LABEL: test_lea_add_offset_big:
	-; HASWELL: # BB#0:
	-; HASWELL-NEXT: # kill: %ESI<def> %ESI<kill> %RSI<def>
	-; HASWELL-NEXT: # kill: %EDI<def> %EDI<kill> %RDI<def>
	-; HASWELL-NEXT: leal (%rdi,%rsi), %eax # sched: [1:0.50]
	-; HASWELL-NEXT: addl $-4096, %eax # imm = 0xF000
	-; HASWELL-NEXT: # sched: [1:0.25]
	-; HASWELL-NEXT: retq # sched: [1:1.00]
	-;
	-; BTVER2-LABEL: test_lea_add_offset_big:
	-; BTVER2: # BB#0:
	-; BTVER2-NEXT: # kill: %ESI<def> %ESI<kill> %RSI<def>
	-; BTVER2-NEXT: # kill: %EDI<def> %EDI<kill> %RDI<def>
	-; BTVER2-NEXT: leal -4096(%rdi,%rsi), %eax # sched: [1:0.50]
	-; BTVER2-NEXT: retq # sched: [4:1.00]
	-;
	-; ZNVER1-LABEL: test_lea_add_offset_big:
	-; ZNVER1: # BB#0:
	-; ZNVER1-NEXT: # kill: %ESI<def> %ESI<kill> %RSI<def>
	-; ZNVER1-NEXT: # kill: %EDI<def> %EDI<kill> %RDI<def>
	-; ZNVER1-NEXT: leal -4096(%rdi,%rsi), %eax # sched: [1:0.25]
	-; ZNVER1-NEXT: retq # sched: [5:0.50]
	- %3 = add i32 %0, -4096
	- %4 = add i32 %3, %1
	- ret i32 %4
	-}
	-
	-define i32 @test_lea_mul(i32) {
	-; GENERIC-LABEL: test_lea_mul:
	-; GENERIC: # BB#0:
	-; GENERIC-NEXT: # kill: %EDI<def> %EDI<kill> %RDI<def>
	-; GENERIC-NEXT: leal (%rdi,%rdi,2), %eax # sched: [1:0.50]
	-; GENERIC-NEXT: retq # sched: [1:1.00]
	-;
	-; ATOM-LABEL: test_lea_mul:
	-; ATOM: # BB#0:
	-; ATOM-NEXT: # kill: %EDI<def> %EDI<kill> %RDI<def>
	-; ATOM-NEXT: leal (%rdi,%rdi,2), %eax
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: retq
	-;
	-; SLM-LABEL: test_lea_mul:
	-; SLM: # BB#0:
	-; SLM-NEXT: # kill: %EDI<def> %EDI<kill> %RDI<def>
	-; SLM-NEXT: leal (%rdi,%rdi,2), %eax # sched: [1:1.00]
	-; SLM-NEXT: retq # sched: [4:1.00]
	-;
	-; SANDY-LABEL: test_lea_mul:
	-; SANDY: # BB#0:
	-; SANDY-NEXT: # kill: %EDI<def> %EDI<kill> %RDI<def>
	-; SANDY-NEXT: leal (%rdi,%rdi,2), %eax # sched: [1:0.50]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	-;
	-; HASWELL-LABEL: test_lea_mul:
	-; HASWELL: # BB#0:
	-; HASWELL-NEXT: # kill: %EDI<def> %EDI<kill> %RDI<def>
	-; HASWELL-NEXT: leal (%rdi,%rdi,2), %eax # sched: [1:0.50]
	-; HASWELL-NEXT: retq # sched: [1:1.00]
	-;
	-; BTVER2-LABEL: test_lea_mul:
	-; BTVER2: # BB#0:
	-; BTVER2-NEXT: # kill: %EDI<def> %EDI<kill> %RDI<def>
	-; BTVER2-NEXT: leal (%rdi,%rdi,2), %eax # sched: [1:0.50]
	-; BTVER2-NEXT: retq # sched: [4:1.00]
	-;
	-; ZNVER1-LABEL: test_lea_mul:
	-; ZNVER1: # BB#0:
	-; ZNVER1-NEXT: # kill: %EDI<def> %EDI<kill> %RDI<def>
	-; ZNVER1-NEXT: leal (%rdi,%rdi,2), %eax # sched: [1:0.25]
	-; ZNVER1-NEXT: retq # sched: [5:0.50]
	- %2 = mul nsw i32 %0, 3
	- ret i32 %2
	-}
	-
	-define i32 @test_lea_mul_offset(i32) {
	-; GENERIC-LABEL: test_lea_mul_offset:
	-; GENERIC: # BB#0:
	-; GENERIC-NEXT: # kill: %EDI<def> %EDI<kill> %RDI<def>
	-; GENERIC-NEXT: leal -32(%rdi,%rdi,2), %eax # sched: [1:0.50]
	-; GENERIC-NEXT: retq # sched: [1:1.00]
	-;
	-; ATOM-LABEL: test_lea_mul_offset:
	-; ATOM: # BB#0:
	-; ATOM-NEXT: # kill: %EDI<def> %EDI<kill> %RDI<def>
	-; ATOM-NEXT: leal -32(%rdi,%rdi,2), %eax
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: retq
	-;
	-; SLM-LABEL: test_lea_mul_offset:
	-; SLM: # BB#0:
	-; SLM-NEXT: # kill: %EDI<def> %EDI<kill> %RDI<def>
	-; SLM-NEXT: leal -32(%rdi,%rdi,2), %eax # sched: [1:1.00]
	-; SLM-NEXT: retq # sched: [4:1.00]
	-;
	-; SANDY-LABEL: test_lea_mul_offset:
	-; SANDY: # BB#0:
	-; SANDY-NEXT: # kill: %EDI<def> %EDI<kill> %RDI<def>
	-; SANDY-NEXT: leal (%rdi,%rdi,2), %eax # sched: [1:0.50]
	-; SANDY-NEXT: addl $-32, %eax # sched: [1:0.33]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	-;
	-; HASWELL-LABEL: test_lea_mul_offset:
	-; HASWELL: # BB#0:
	-; HASWELL-NEXT: # kill: %EDI<def> %EDI<kill> %RDI<def>
	-; HASWELL-NEXT: leal (%rdi,%rdi,2), %eax # sched: [1:0.50]
	-; HASWELL-NEXT: addl $-32, %eax # sched: [1:0.25]
	-; HASWELL-NEXT: retq # sched: [1:1.00]
	-;
	-; BTVER2-LABEL: test_lea_mul_offset:
	-; BTVER2: # BB#0:
	-; BTVER2-NEXT: # kill: %EDI<def> %EDI<kill> %RDI<def>
	-; BTVER2-NEXT: leal -32(%rdi,%rdi,2), %eax # sched: [1:0.50]
	-; BTVER2-NEXT: retq # sched: [4:1.00]
	-;
	-; ZNVER1-LABEL: test_lea_mul_offset:
	-; ZNVER1: # BB#0:
	-; ZNVER1-NEXT: # kill: %EDI<def> %EDI<kill> %RDI<def>
	-; ZNVER1-NEXT: leal -32(%rdi,%rdi,2), %eax # sched: [1:0.25]
	-; ZNVER1-NEXT: retq # sched: [5:0.50]
	- %2 = mul nsw i32 %0, 3
	- %3 = add nsw i32 %2, -32
	- ret i32 %3
	-}
	-
	-define i32 @test_lea_mul_offset_big(i32) {
	-; GENERIC-LABEL: test_lea_mul_offset_big:
	-; GENERIC: # BB#0:
	-; GENERIC-NEXT: # kill: %EDI<def> %EDI<kill> %RDI<def>
	-; GENERIC-NEXT: leal 10000(%rdi,%rdi,8), %eax # sched: [1:0.50]
	-; GENERIC-NEXT: retq # sched: [1:1.00]
	-;
	-; ATOM-LABEL: test_lea_mul_offset_big:
	-; ATOM: # BB#0:
	-; ATOM-NEXT: # kill: %EDI<def> %EDI<kill> %RDI<def>
	-; ATOM-NEXT: leal 10000(%rdi,%rdi,8), %eax
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: retq
	-;
	-; SLM-LABEL: test_lea_mul_offset_big:
	-; SLM: # BB#0:
	-; SLM-NEXT: # kill: %EDI<def> %EDI<kill> %RDI<def>
	-; SLM-NEXT: leal 10000(%rdi,%rdi,8), %eax # sched: [1:1.00]
	-; SLM-NEXT: retq # sched: [4:1.00]
	-;
	-; SANDY-LABEL: test_lea_mul_offset_big:
	-; SANDY: # BB#0:
	-; SANDY-NEXT: # kill: %EDI<def> %EDI<kill> %RDI<def>
	-; SANDY-NEXT: leal (%rdi,%rdi,8), %eax # sched: [1:0.50]
	-; SANDY-NEXT: addl $10000, %eax # imm = 0x2710
	-; SANDY-NEXT: # sched: [1:0.33]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	-;
	-; HASWELL-LABEL: test_lea_mul_offset_big:
	-; HASWELL: # BB#0:
	-; HASWELL-NEXT: # kill: %EDI<def> %EDI<kill> %RDI<def>
	-; HASWELL-NEXT: leal (%rdi,%rdi,8), %eax # sched: [1:0.50]
	-; HASWELL-NEXT: addl $10000, %eax # imm = 0x2710
	-; HASWELL-NEXT: # sched: [1:0.25]
	-; HASWELL-NEXT: retq # sched: [1:1.00]
	-;
	-; BTVER2-LABEL: test_lea_mul_offset_big:
	-; BTVER2: # BB#0:
	-; BTVER2-NEXT: # kill: %EDI<def> %EDI<kill> %RDI<def>
	-; BTVER2-NEXT: leal 10000(%rdi,%rdi,8), %eax # sched: [1:0.50]
	-; BTVER2-NEXT: retq # sched: [4:1.00]
	-;
	-; ZNVER1-LABEL: test_lea_mul_offset_big:
	-; ZNVER1: # BB#0:
	-; ZNVER1-NEXT: # kill: %EDI<def> %EDI<kill> %RDI<def>
	-; ZNVER1-NEXT: leal 10000(%rdi,%rdi,8), %eax # sched: [1:0.25]
	-; ZNVER1-NEXT: retq # sched: [5:0.50]
	- %2 = mul nsw i32 %0, 9
	- %3 = add nsw i32 %2, 10000
	- ret i32 %3
	-}
	-
	-define i32 @test_lea_add_scale(i32, i32) {
	-; GENERIC-LABEL: test_lea_add_scale:
	-; GENERIC: # BB#0:
	-; GENERIC-NEXT: # kill: %ESI<def> %ESI<kill> %RSI<def>
	-; GENERIC-NEXT: # kill: %EDI<def> %EDI<kill> %RDI<def>
	-; GENERIC-NEXT: leal (%rdi,%rsi,2), %eax # sched: [1:0.50]
	-; GENERIC-NEXT: retq # sched: [1:1.00]
	-;
	-; ATOM-LABEL: test_lea_add_scale:
	-; ATOM: # BB#0:
	-; ATOM-NEXT: # kill: %ESI<def> %ESI<kill> %RSI<def>
	-; ATOM-NEXT: # kill: %EDI<def> %EDI<kill> %RDI<def>
	-; ATOM-NEXT: leal (%rdi,%rsi,2), %eax
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: retq
	-;
	-; SLM-LABEL: test_lea_add_scale:
	-; SLM: # BB#0:
	-; SLM-NEXT: # kill: %ESI<def> %ESI<kill> %RSI<def>
	-; SLM-NEXT: # kill: %EDI<def> %EDI<kill> %RDI<def>
	-; SLM-NEXT: leal (%rdi,%rsi,2), %eax # sched: [1:1.00]
	-; SLM-NEXT: retq # sched: [4:1.00]
	-;
	-; SANDY-LABEL: test_lea_add_scale:
	-; SANDY: # BB#0:
	-; SANDY-NEXT: # kill: %ESI<def> %ESI<kill> %RSI<def>
	-; SANDY-NEXT: # kill: %EDI<def> %EDI<kill> %RDI<def>
	-; SANDY-NEXT: leal (%rdi,%rsi,2), %eax # sched: [1:0.50]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	-;
	-; HASWELL-LABEL: test_lea_add_scale:
	-; HASWELL: # BB#0:
	-; HASWELL-NEXT: # kill: %ESI<def> %ESI<kill> %RSI<def>
	-; HASWELL-NEXT: # kill: %EDI<def> %EDI<kill> %RDI<def>
	-; HASWELL-NEXT: leal (%rdi,%rsi,2), %eax # sched: [1:0.50]
	-; HASWELL-NEXT: retq # sched: [1:1.00]
	-;
	-; BTVER2-LABEL: test_lea_add_scale:
	-; BTVER2: # BB#0:
	-; BTVER2-NEXT: # kill: %ESI<def> %ESI<kill> %RSI<def>
	-; BTVER2-NEXT: # kill: %EDI<def> %EDI<kill> %RDI<def>
	-; BTVER2-NEXT: leal (%rdi,%rsi,2), %eax # sched: [1:0.50]
	-; BTVER2-NEXT: retq # sched: [4:1.00]
	-;
	-; ZNVER1-LABEL: test_lea_add_scale:
	-; ZNVER1: # BB#0:
	-; ZNVER1-NEXT: # kill: %ESI<def> %ESI<kill> %RSI<def>
	-; ZNVER1-NEXT: # kill: %EDI<def> %EDI<kill> %RDI<def>
	-; ZNVER1-NEXT: leal (%rdi,%rsi,2), %eax # sched: [1:0.25]
	-; ZNVER1-NEXT: retq # sched: [5:0.50]
	- %3 = shl i32 %1, 1
	- %4 = add nsw i32 %3, %0
	- ret i32 %4
	-}
	-
	-define i32 @test_lea_add_scale_offset(i32, i32) {
	-; GENERIC-LABEL: test_lea_add_scale_offset:
	-; GENERIC: # BB#0:
	-; GENERIC-NEXT: # kill: %ESI<def> %ESI<kill> %RSI<def>
	-; GENERIC-NEXT: # kill: %EDI<def> %EDI<kill> %RDI<def>
	-; GENERIC-NEXT: leal 96(%rdi,%rsi,4), %eax # sched: [1:0.50]
	-; GENERIC-NEXT: retq # sched: [1:1.00]
	-;
	-; ATOM-LABEL: test_lea_add_scale_offset:
	-; ATOM: # BB#0:
	-; ATOM-NEXT: # kill: %ESI<def> %ESI<kill> %RSI<def>
	-; ATOM-NEXT: # kill: %EDI<def> %EDI<kill> %RDI<def>
	-; ATOM-NEXT: leal 96(%rdi,%rsi,4), %eax
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: retq
	-;
	-; SLM-LABEL: test_lea_add_scale_offset:
	-; SLM: # BB#0:
	-; SLM-NEXT: # kill: %ESI<def> %ESI<kill> %RSI<def>
	-; SLM-NEXT: # kill: %EDI<def> %EDI<kill> %RDI<def>
	-; SLM-NEXT: leal 96(%rdi,%rsi,4), %eax # sched: [1:1.00]
	-; SLM-NEXT: retq # sched: [4:1.00]
	-;
	-; SANDY-LABEL: test_lea_add_scale_offset:
	-; SANDY: # BB#0:
	-; SANDY-NEXT: # kill: %ESI<def> %ESI<kill> %RSI<def>
	-; SANDY-NEXT: # kill: %EDI<def> %EDI<kill> %RDI<def>
	-; SANDY-NEXT: leal (%rdi,%rsi,4), %eax # sched: [1:0.50]
	-; SANDY-NEXT: addl $96, %eax # sched: [1:0.33]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	-;
	-; HASWELL-LABEL: test_lea_add_scale_offset:
	-; HASWELL: # BB#0:
	-; HASWELL-NEXT: # kill: %ESI<def> %ESI<kill> %RSI<def>
	-; HASWELL-NEXT: # kill: %EDI<def> %EDI<kill> %RDI<def>
	-; HASWELL-NEXT: leal (%rdi,%rsi,4), %eax # sched: [1:0.50]
	-; HASWELL-NEXT: addl $96, %eax # sched: [1:0.25]
	-; HASWELL-NEXT: retq # sched: [1:1.00]
	-;
	-; BTVER2-LABEL: test_lea_add_scale_offset:
	-; BTVER2: # BB#0:
	-; BTVER2-NEXT: # kill: %ESI<def> %ESI<kill> %RSI<def>
	-; BTVER2-NEXT: # kill: %EDI<def> %EDI<kill> %RDI<def>
	-; BTVER2-NEXT: leal 96(%rdi,%rsi,4), %eax # sched: [1:0.50]
	-; BTVER2-NEXT: retq # sched: [4:1.00]
	-;
	-; ZNVER1-LABEL: test_lea_add_scale_offset:
	-; ZNVER1: # BB#0:
	-; ZNVER1-NEXT: # kill: %ESI<def> %ESI<kill> %RSI<def>
	-; ZNVER1-NEXT: # kill: %EDI<def> %EDI<kill> %RDI<def>
	-; ZNVER1-NEXT: leal 96(%rdi,%rsi,4), %eax # sched: [1:0.25]
	-; ZNVER1-NEXT: retq # sched: [5:0.50]
	- %3 = shl i32 %1, 2
	- %4 = add i32 %0, 96
	- %5 = add i32 %4, %3
	- ret i32 %5
	-}
	-
	-define i32 @test_lea_add_scale_offset_big(i32, i32) {
	-; GENERIC-LABEL: test_lea_add_scale_offset_big:
	-; GENERIC: # BB#0:
	-; GENERIC-NEXT: # kill: %ESI<def> %ESI<kill> %RSI<def>
	-; GENERIC-NEXT: # kill: %EDI<def> %EDI<kill> %RDI<def>
	-; GENERIC-NEXT: leal -1200(%rdi,%rsi,8), %eax # sched: [1:0.50]
	-; GENERIC-NEXT: retq # sched: [1:1.00]
	-;
	-; ATOM-LABEL: test_lea_add_scale_offset_big:
	-; ATOM: # BB#0:
	-; ATOM-NEXT: # kill: %ESI<def> %ESI<kill> %RSI<def>
	-; ATOM-NEXT: # kill: %EDI<def> %EDI<kill> %RDI<def>
	-; ATOM-NEXT: leal -1200(%rdi,%rsi,8), %eax
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: retq
	-;
	-; SLM-LABEL: test_lea_add_scale_offset_big:
	-; SLM: # BB#0:
	-; SLM-NEXT: # kill: %ESI<def> %ESI<kill> %RSI<def>
	-; SLM-NEXT: # kill: %EDI<def> %EDI<kill> %RDI<def>
	-; SLM-NEXT: leal -1200(%rdi,%rsi,8), %eax # sched: [1:1.00]
	-; SLM-NEXT: retq # sched: [4:1.00]
	-;
	-; SANDY-LABEL: test_lea_add_scale_offset_big:
	-; SANDY: # BB#0:
	-; SANDY-NEXT: # kill: %ESI<def> %ESI<kill> %RSI<def>
	-; SANDY-NEXT: # kill: %EDI<def> %EDI<kill> %RDI<def>
	-; SANDY-NEXT: leal (%rdi,%rsi,8), %eax # sched: [1:0.50]
	-; SANDY-NEXT: addl $-1200, %eax # imm = 0xFB50
	-; SANDY-NEXT: # sched: [1:0.33]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	-;
	-; HASWELL-LABEL: test_lea_add_scale_offset_big:
	-; HASWELL: # BB#0:
	-; HASWELL-NEXT: # kill: %ESI<def> %ESI<kill> %RSI<def>
	-; HASWELL-NEXT: # kill: %EDI<def> %EDI<kill> %RDI<def>
	-; HASWELL-NEXT: leal (%rdi,%rsi,8), %eax # sched: [1:0.50]
	-; HASWELL-NEXT: addl $-1200, %eax # imm = 0xFB50
	-; HASWELL-NEXT: # sched: [1:0.25]
	-; HASWELL-NEXT: retq # sched: [1:1.00]
	-;
	-; BTVER2-LABEL: test_lea_add_scale_offset_big:
	-; BTVER2: # BB#0:
	-; BTVER2-NEXT: # kill: %ESI<def> %ESI<kill> %RSI<def>
	-; BTVER2-NEXT: # kill: %EDI<def> %EDI<kill> %RDI<def>
	-; BTVER2-NEXT: leal -1200(%rdi,%rsi,8), %eax # sched: [1:0.50]
	-; BTVER2-NEXT: retq # sched: [4:1.00]
	-;
	-; ZNVER1-LABEL: test_lea_add_scale_offset_big:
	-; ZNVER1: # BB#0:
	-; ZNVER1-NEXT: # kill: %ESI<def> %ESI<kill> %RSI<def>
	-; ZNVER1-NEXT: # kill: %EDI<def> %EDI<kill> %RDI<def>
	-; ZNVER1-NEXT: leal -1200(%rdi,%rsi,8), %eax # sched: [1:0.25]
	-; ZNVER1-NEXT: retq # sched: [5:0.50]
	- %3 = shl i32 %1, 3
	- %4 = add i32 %0, -1200
	- %5 = add i32 %4, %3
	- ret i32 %5
	-}
	diff --git a/test/CodeGen/X86/lea64-schedule.ll b/test/CodeGen/X86/lea64-schedule.ll
	deleted file mode 100644
	index 0ff1574c809d..000000000000
	--- a/test/CodeGen/X86/lea64-schedule.ll
	+++ /dev/null
	@@ -1,534 +0,0 @@
	-; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	-; RUN: llc < %s -mtriple=x86_64-unknown-unknown -print-schedule -mcpu=x86-64 \| FileCheck %s --check-prefix=CHECK --check-prefix=GENERIC
	-; RUN: llc < %s -mtriple=x86_64-unknown-unknown -print-schedule -mcpu=atom \| FileCheck %s --check-prefix=CHECK --check-prefix=ATOM
	-; RUN: llc < %s -mtriple=x86_64-unknown-unknown -print-schedule -mcpu=slm \| FileCheck %s --check-prefix=CHECK --check-prefix=SLM
	-; RUN: llc < %s -mtriple=x86_64-unknown-unknown -print-schedule -mcpu=sandybridge \| FileCheck %s --check-prefix=CHECK --check-prefix=SANDY
	-; RUN: llc < %s -mtriple=x86_64-unknown-unknown -print-schedule -mcpu=ivybridge \| FileCheck %s --check-prefix=CHECK --check-prefix=SANDY
	-; RUN: llc < %s -mtriple=x86_64-unknown-unknown -print-schedule -mcpu=haswell \| FileCheck %s --check-prefix=CHECK --check-prefix=HASWELL
	-; RUN: llc < %s -mtriple=x86_64-unknown-unknown -print-schedule -mcpu=skylake \| FileCheck %s --check-prefix=CHECK --check-prefix=HASWELL
	-; RUN: llc < %s -mtriple=x86_64-unknown-unknown -print-schedule -mcpu=knl \| FileCheck %s --check-prefix=CHECK --check-prefix=HASWELL
	-; RUN: llc < %s -mtriple=x86_64-unknown-unknown -print-schedule -mcpu=btver2 \| FileCheck %s --check-prefix=CHECK --check-prefix=BTVER2
	-; RUN: llc < %s -mtriple=x86_64-unknown-unknown -print-schedule -mcpu=znver1 \| FileCheck %s --check-prefix=CHECK --check-prefix=ZNVER1
	-
	-define i64 @test_lea_offset(i64) {
	-; GENERIC-LABEL: test_lea_offset:
	-; GENERIC: # BB#0:
	-; GENERIC-NEXT: leaq -24(%rdi), %rax # sched: [1:0.50]
	-; GENERIC-NEXT: retq # sched: [1:1.00]
	-;
	-; ATOM-LABEL: test_lea_offset:
	-; ATOM: # BB#0:
	-; ATOM-NEXT: leaq -24(%rdi), %rax
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: retq
	-;
	-; SLM-LABEL: test_lea_offset:
	-; SLM: # BB#0:
	-; SLM-NEXT: leaq -24(%rdi), %rax # sched: [1:1.00]
	-; SLM-NEXT: retq # sched: [4:1.00]
	-;
	-; SANDY-LABEL: test_lea_offset:
	-; SANDY: # BB#0:
	-; SANDY-NEXT: leaq -24(%rdi), %rax # sched: [1:0.50]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	-;
	-; HASWELL-LABEL: test_lea_offset:
	-; HASWELL: # BB#0:
	-; HASWELL-NEXT: leaq -24(%rdi), %rax # sched: [1:0.50]
	-; HASWELL-NEXT: retq # sched: [1:1.00]
	-;
	-; BTVER2-LABEL: test_lea_offset:
	-; BTVER2: # BB#0:
	-; BTVER2-NEXT: leaq -24(%rdi), %rax # sched: [1:0.50]
	-; BTVER2-NEXT: retq # sched: [4:1.00]
	-;
	-; ZNVER1-LABEL: test_lea_offset:
	-; ZNVER1: # BB#0:
	-; ZNVER1-NEXT: leaq -24(%rdi), %rax # sched: [1:0.25]
	-; ZNVER1-NEXT: retq # sched: [5:0.50]
	- %2 = add nsw i64 %0, -24
	- ret i64 %2
	-}
	-
	-define i64 @test_lea_offset_big(i64) {
	-; GENERIC-LABEL: test_lea_offset_big:
	-; GENERIC: # BB#0:
	-; GENERIC-NEXT: leaq 1024(%rdi), %rax # sched: [1:0.50]
	-; GENERIC-NEXT: retq # sched: [1:1.00]
	-;
	-; ATOM-LABEL: test_lea_offset_big:
	-; ATOM: # BB#0:
	-; ATOM-NEXT: leaq 1024(%rdi), %rax
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: retq
	-;
	-; SLM-LABEL: test_lea_offset_big:
	-; SLM: # BB#0:
	-; SLM-NEXT: leaq 1024(%rdi), %rax # sched: [1:1.00]
	-; SLM-NEXT: retq # sched: [4:1.00]
	-;
	-; SANDY-LABEL: test_lea_offset_big:
	-; SANDY: # BB#0:
	-; SANDY-NEXT: leaq 1024(%rdi), %rax # sched: [1:0.50]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	-;
	-; HASWELL-LABEL: test_lea_offset_big:
	-; HASWELL: # BB#0:
	-; HASWELL-NEXT: leaq 1024(%rdi), %rax # sched: [1:0.50]
	-; HASWELL-NEXT: retq # sched: [1:1.00]
	-;
	-; BTVER2-LABEL: test_lea_offset_big:
	-; BTVER2: # BB#0:
	-; BTVER2-NEXT: leaq 1024(%rdi), %rax # sched: [1:0.50]
	-; BTVER2-NEXT: retq # sched: [4:1.00]
	-;
	-; ZNVER1-LABEL: test_lea_offset_big:
	-; ZNVER1: # BB#0:
	-; ZNVER1-NEXT: leaq 1024(%rdi), %rax # sched: [1:0.25]
	-; ZNVER1-NEXT: retq # sched: [5:0.50]
	- %2 = add nsw i64 %0, 1024
	- ret i64 %2
	-}
	-
	-; Function Attrs: norecurse nounwind readnone uwtable
	-define i64 @test_lea_add(i64, i64) {
	-; GENERIC-LABEL: test_lea_add:
	-; GENERIC: # BB#0:
	-; GENERIC-NEXT: leaq (%rdi,%rsi), %rax # sched: [1:0.50]
	-; GENERIC-NEXT: retq # sched: [1:1.00]
	-;
	-; ATOM-LABEL: test_lea_add:
	-; ATOM: # BB#0:
	-; ATOM-NEXT: leaq (%rdi,%rsi), %rax
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: retq
	-;
	-; SLM-LABEL: test_lea_add:
	-; SLM: # BB#0:
	-; SLM-NEXT: leaq (%rdi,%rsi), %rax # sched: [1:1.00]
	-; SLM-NEXT: retq # sched: [4:1.00]
	-;
	-; SANDY-LABEL: test_lea_add:
	-; SANDY: # BB#0:
	-; SANDY-NEXT: leaq (%rdi,%rsi), %rax # sched: [1:0.50]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	-;
	-; HASWELL-LABEL: test_lea_add:
	-; HASWELL: # BB#0:
	-; HASWELL-NEXT: leaq (%rdi,%rsi), %rax # sched: [1:0.50]
	-; HASWELL-NEXT: retq # sched: [1:1.00]
	-;
	-; BTVER2-LABEL: test_lea_add:
	-; BTVER2: # BB#0:
	-; BTVER2-NEXT: leaq (%rdi,%rsi), %rax # sched: [1:0.50]
	-; BTVER2-NEXT: retq # sched: [4:1.00]
	-;
	-; ZNVER1-LABEL: test_lea_add:
	-; ZNVER1: # BB#0:
	-; ZNVER1-NEXT: leaq (%rdi,%rsi), %rax # sched: [1:0.25]
	-; ZNVER1-NEXT: retq # sched: [5:0.50]
	- %3 = add nsw i64 %1, %0
	- ret i64 %3
	-}
	-
	-define i64 @test_lea_add_offset(i64, i64) {
	-; GENERIC-LABEL: test_lea_add_offset:
	-; GENERIC: # BB#0:
	-; GENERIC-NEXT: leaq 16(%rdi,%rsi), %rax # sched: [1:0.50]
	-; GENERIC-NEXT: retq # sched: [1:1.00]
	-;
	-; ATOM-LABEL: test_lea_add_offset:
	-; ATOM: # BB#0:
	-; ATOM-NEXT: leaq 16(%rdi,%rsi), %rax
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: retq
	-;
	-; SLM-LABEL: test_lea_add_offset:
	-; SLM: # BB#0:
	-; SLM-NEXT: leaq 16(%rdi,%rsi), %rax # sched: [1:1.00]
	-; SLM-NEXT: retq # sched: [4:1.00]
	-;
	-; SANDY-LABEL: test_lea_add_offset:
	-; SANDY: # BB#0:
	-; SANDY-NEXT: leaq (%rdi,%rsi), %rax # sched: [1:0.50]
	-; SANDY-NEXT: addq $16, %rax # sched: [1:0.33]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	-;
	-; HASWELL-LABEL: test_lea_add_offset:
	-; HASWELL: # BB#0:
	-; HASWELL-NEXT: leaq (%rdi,%rsi), %rax # sched: [1:0.50]
	-; HASWELL-NEXT: addq $16, %rax # sched: [1:0.25]
	-; HASWELL-NEXT: retq # sched: [1:1.00]
	-;
	-; BTVER2-LABEL: test_lea_add_offset:
	-; BTVER2: # BB#0:
	-; BTVER2-NEXT: leaq 16(%rdi,%rsi), %rax # sched: [1:0.50]
	-; BTVER2-NEXT: retq # sched: [4:1.00]
	-;
	-; ZNVER1-LABEL: test_lea_add_offset:
	-; ZNVER1: # BB#0:
	-; ZNVER1-NEXT: leaq 16(%rdi,%rsi), %rax # sched: [1:0.25]
	-; ZNVER1-NEXT: retq # sched: [5:0.50]
	- %3 = add i64 %0, 16
	- %4 = add i64 %3, %1
	- ret i64 %4
	-}
	-
	-define i64 @test_lea_add_offset_big(i64, i64) {
	-; GENERIC-LABEL: test_lea_add_offset_big:
	-; GENERIC: # BB#0:
	-; GENERIC-NEXT: leaq -4096(%rdi,%rsi), %rax # sched: [1:0.50]
	-; GENERIC-NEXT: retq # sched: [1:1.00]
	-;
	-; ATOM-LABEL: test_lea_add_offset_big:
	-; ATOM: # BB#0:
	-; ATOM-NEXT: leaq -4096(%rdi,%rsi), %rax
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: retq
	-;
	-; SLM-LABEL: test_lea_add_offset_big:
	-; SLM: # BB#0:
	-; SLM-NEXT: leaq -4096(%rdi,%rsi), %rax # sched: [1:1.00]
	-; SLM-NEXT: retq # sched: [4:1.00]
	-;
	-; SANDY-LABEL: test_lea_add_offset_big:
	-; SANDY: # BB#0:
	-; SANDY-NEXT: leaq (%rdi,%rsi), %rax # sched: [1:0.50]
	-; SANDY-NEXT: addq $-4096, %rax # imm = 0xF000
	-; SANDY-NEXT: # sched: [1:0.33]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	-;
	-; HASWELL-LABEL: test_lea_add_offset_big:
	-; HASWELL: # BB#0:
	-; HASWELL-NEXT: leaq (%rdi,%rsi), %rax # sched: [1:0.50]
	-; HASWELL-NEXT: addq $-4096, %rax # imm = 0xF000
	-; HASWELL-NEXT: # sched: [1:0.25]
	-; HASWELL-NEXT: retq # sched: [1:1.00]
	-;
	-; BTVER2-LABEL: test_lea_add_offset_big:
	-; BTVER2: # BB#0:
	-; BTVER2-NEXT: leaq -4096(%rdi,%rsi), %rax # sched: [1:0.50]
	-; BTVER2-NEXT: retq # sched: [4:1.00]
	-;
	-; ZNVER1-LABEL: test_lea_add_offset_big:
	-; ZNVER1: # BB#0:
	-; ZNVER1-NEXT: leaq -4096(%rdi,%rsi), %rax # sched: [1:0.25]
	-; ZNVER1-NEXT: retq # sched: [5:0.50]
	- %3 = add i64 %0, -4096
	- %4 = add i64 %3, %1
	- ret i64 %4
	-}
	-
	-define i64 @test_lea_mul(i64) {
	-; GENERIC-LABEL: test_lea_mul:
	-; GENERIC: # BB#0:
	-; GENERIC-NEXT: leaq (%rdi,%rdi,2), %rax # sched: [1:0.50]
	-; GENERIC-NEXT: retq # sched: [1:1.00]
	-;
	-; ATOM-LABEL: test_lea_mul:
	-; ATOM: # BB#0:
	-; ATOM-NEXT: leaq (%rdi,%rdi,2), %rax
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: retq
	-;
	-; SLM-LABEL: test_lea_mul:
	-; SLM: # BB#0:
	-; SLM-NEXT: leaq (%rdi,%rdi,2), %rax # sched: [1:1.00]
	-; SLM-NEXT: retq # sched: [4:1.00]
	-;
	-; SANDY-LABEL: test_lea_mul:
	-; SANDY: # BB#0:
	-; SANDY-NEXT: leaq (%rdi,%rdi,2), %rax # sched: [1:0.50]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	-;
	-; HASWELL-LABEL: test_lea_mul:
	-; HASWELL: # BB#0:
	-; HASWELL-NEXT: leaq (%rdi,%rdi,2), %rax # sched: [1:0.50]
	-; HASWELL-NEXT: retq # sched: [1:1.00]
	-;
	-; BTVER2-LABEL: test_lea_mul:
	-; BTVER2: # BB#0:
	-; BTVER2-NEXT: leaq (%rdi,%rdi,2), %rax # sched: [1:0.50]
	-; BTVER2-NEXT: retq # sched: [4:1.00]
	-;
	-; ZNVER1-LABEL: test_lea_mul:
	-; ZNVER1: # BB#0:
	-; ZNVER1-NEXT: leaq (%rdi,%rdi,2), %rax # sched: [1:0.25]
	-; ZNVER1-NEXT: retq # sched: [5:0.50]
	- %2 = mul nsw i64 %0, 3
	- ret i64 %2
	-}
	-
	-define i64 @test_lea_mul_offset(i64) {
	-; GENERIC-LABEL: test_lea_mul_offset:
	-; GENERIC: # BB#0:
	-; GENERIC-NEXT: leaq -32(%rdi,%rdi,2), %rax # sched: [1:0.50]
	-; GENERIC-NEXT: retq # sched: [1:1.00]
	-;
	-; ATOM-LABEL: test_lea_mul_offset:
	-; ATOM: # BB#0:
	-; ATOM-NEXT: leaq -32(%rdi,%rdi,2), %rax
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: retq
	-;
	-; SLM-LABEL: test_lea_mul_offset:
	-; SLM: # BB#0:
	-; SLM-NEXT: leaq -32(%rdi,%rdi,2), %rax # sched: [1:1.00]
	-; SLM-NEXT: retq # sched: [4:1.00]
	-;
	-; SANDY-LABEL: test_lea_mul_offset:
	-; SANDY: # BB#0:
	-; SANDY-NEXT: leaq (%rdi,%rdi,2), %rax # sched: [1:0.50]
	-; SANDY-NEXT: addq $-32, %rax # sched: [1:0.33]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	-;
	-; HASWELL-LABEL: test_lea_mul_offset:
	-; HASWELL: # BB#0:
	-; HASWELL-NEXT: leaq (%rdi,%rdi,2), %rax # sched: [1:0.50]
	-; HASWELL-NEXT: addq $-32, %rax # sched: [1:0.25]
	-; HASWELL-NEXT: retq # sched: [1:1.00]
	-;
	-; BTVER2-LABEL: test_lea_mul_offset:
	-; BTVER2: # BB#0:
	-; BTVER2-NEXT: leaq -32(%rdi,%rdi,2), %rax # sched: [1:0.50]
	-; BTVER2-NEXT: retq # sched: [4:1.00]
	-;
	-; ZNVER1-LABEL: test_lea_mul_offset:
	-; ZNVER1: # BB#0:
	-; ZNVER1-NEXT: leaq -32(%rdi,%rdi,2), %rax # sched: [1:0.25]
	-; ZNVER1-NEXT: retq # sched: [5:0.50]
	- %2 = mul nsw i64 %0, 3
	- %3 = add nsw i64 %2, -32
	- ret i64 %3
	-}
	-
	-define i64 @test_lea_mul_offset_big(i64) {
	-; GENERIC-LABEL: test_lea_mul_offset_big:
	-; GENERIC: # BB#0:
	-; GENERIC-NEXT: leaq 10000(%rdi,%rdi,8), %rax # sched: [1:0.50]
	-; GENERIC-NEXT: retq # sched: [1:1.00]
	-;
	-; ATOM-LABEL: test_lea_mul_offset_big:
	-; ATOM: # BB#0:
	-; ATOM-NEXT: leaq 10000(%rdi,%rdi,8), %rax
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: retq
	-;
	-; SLM-LABEL: test_lea_mul_offset_big:
	-; SLM: # BB#0:
	-; SLM-NEXT: leaq 10000(%rdi,%rdi,8), %rax # sched: [1:1.00]
	-; SLM-NEXT: retq # sched: [4:1.00]
	-;
	-; SANDY-LABEL: test_lea_mul_offset_big:
	-; SANDY: # BB#0:
	-; SANDY-NEXT: leaq (%rdi,%rdi,8), %rax # sched: [1:0.50]
	-; SANDY-NEXT: addq $10000, %rax # imm = 0x2710
	-; SANDY-NEXT: # sched: [1:0.33]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	-;
	-; HASWELL-LABEL: test_lea_mul_offset_big:
	-; HASWELL: # BB#0:
	-; HASWELL-NEXT: leaq (%rdi,%rdi,8), %rax # sched: [1:0.50]
	-; HASWELL-NEXT: addq $10000, %rax # imm = 0x2710
	-; HASWELL-NEXT: # sched: [1:0.25]
	-; HASWELL-NEXT: retq # sched: [1:1.00]
	-;
	-; BTVER2-LABEL: test_lea_mul_offset_big:
	-; BTVER2: # BB#0:
	-; BTVER2-NEXT: leaq 10000(%rdi,%rdi,8), %rax # sched: [1:0.50]
	-; BTVER2-NEXT: retq # sched: [4:1.00]
	-;
	-; ZNVER1-LABEL: test_lea_mul_offset_big:
	-; ZNVER1: # BB#0:
	-; ZNVER1-NEXT: leaq 10000(%rdi,%rdi,8), %rax # sched: [1:0.25]
	-; ZNVER1-NEXT: retq # sched: [5:0.50]
	- %2 = mul nsw i64 %0, 9
	- %3 = add nsw i64 %2, 10000
	- ret i64 %3
	-}
	-
	-define i64 @test_lea_add_scale(i64, i64) {
	-; GENERIC-LABEL: test_lea_add_scale:
	-; GENERIC: # BB#0:
	-; GENERIC-NEXT: leaq (%rdi,%rsi,2), %rax # sched: [1:0.50]
	-; GENERIC-NEXT: retq # sched: [1:1.00]
	-;
	-; ATOM-LABEL: test_lea_add_scale:
	-; ATOM: # BB#0:
	-; ATOM-NEXT: leaq (%rdi,%rsi,2), %rax
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: retq
	-;
	-; SLM-LABEL: test_lea_add_scale:
	-; SLM: # BB#0:
	-; SLM-NEXT: leaq (%rdi,%rsi,2), %rax # sched: [1:1.00]
	-; SLM-NEXT: retq # sched: [4:1.00]
	-;
	-; SANDY-LABEL: test_lea_add_scale:
	-; SANDY: # BB#0:
	-; SANDY-NEXT: leaq (%rdi,%rsi,2), %rax # sched: [1:0.50]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	-;
	-; HASWELL-LABEL: test_lea_add_scale:
	-; HASWELL: # BB#0:
	-; HASWELL-NEXT: leaq (%rdi,%rsi,2), %rax # sched: [1:0.50]
	-; HASWELL-NEXT: retq # sched: [1:1.00]
	-;
	-; BTVER2-LABEL: test_lea_add_scale:
	-; BTVER2: # BB#0:
	-; BTVER2-NEXT: leaq (%rdi,%rsi,2), %rax # sched: [1:0.50]
	-; BTVER2-NEXT: retq # sched: [4:1.00]
	-;
	-; ZNVER1-LABEL: test_lea_add_scale:
	-; ZNVER1: # BB#0:
	-; ZNVER1-NEXT: leaq (%rdi,%rsi,2), %rax # sched: [1:0.25]
	-; ZNVER1-NEXT: retq # sched: [5:0.50]
	- %3 = shl i64 %1, 1
	- %4 = add nsw i64 %3, %0
	- ret i64 %4
	-}
	-
	-define i64 @test_lea_add_scale_offset(i64, i64) {
	-; GENERIC-LABEL: test_lea_add_scale_offset:
	-; GENERIC: # BB#0:
	-; GENERIC-NEXT: leaq 96(%rdi,%rsi,4), %rax # sched: [1:0.50]
	-; GENERIC-NEXT: retq # sched: [1:1.00]
	-;
	-; ATOM-LABEL: test_lea_add_scale_offset:
	-; ATOM: # BB#0:
	-; ATOM-NEXT: leaq 96(%rdi,%rsi,4), %rax
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: retq
	-;
	-; SLM-LABEL: test_lea_add_scale_offset:
	-; SLM: # BB#0:
	-; SLM-NEXT: leaq 96(%rdi,%rsi,4), %rax # sched: [1:1.00]
	-; SLM-NEXT: retq # sched: [4:1.00]
	-;
	-; SANDY-LABEL: test_lea_add_scale_offset:
	-; SANDY: # BB#0:
	-; SANDY-NEXT: leaq (%rdi,%rsi,4), %rax # sched: [1:0.50]
	-; SANDY-NEXT: addq $96, %rax # sched: [1:0.33]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	-;
	-; HASWELL-LABEL: test_lea_add_scale_offset:
	-; HASWELL: # BB#0:
	-; HASWELL-NEXT: leaq (%rdi,%rsi,4), %rax # sched: [1:0.50]
	-; HASWELL-NEXT: addq $96, %rax # sched: [1:0.25]
	-; HASWELL-NEXT: retq # sched: [1:1.00]
	-;
	-; BTVER2-LABEL: test_lea_add_scale_offset:
	-; BTVER2: # BB#0:
	-; BTVER2-NEXT: leaq 96(%rdi,%rsi,4), %rax # sched: [1:0.50]
	-; BTVER2-NEXT: retq # sched: [4:1.00]
	-;
	-; ZNVER1-LABEL: test_lea_add_scale_offset:
	-; ZNVER1: # BB#0:
	-; ZNVER1-NEXT: leaq 96(%rdi,%rsi,4), %rax # sched: [1:0.25]
	-; ZNVER1-NEXT: retq # sched: [5:0.50]
	- %3 = shl i64 %1, 2
	- %4 = add i64 %0, 96
	- %5 = add i64 %4, %3
	- ret i64 %5
	-}
	-
	-define i64 @test_lea_add_scale_offset_big(i64, i64) {
	-; GENERIC-LABEL: test_lea_add_scale_offset_big:
	-; GENERIC: # BB#0:
	-; GENERIC-NEXT: leaq -1200(%rdi,%rsi,8), %rax # sched: [1:0.50]
	-; GENERIC-NEXT: retq # sched: [1:1.00]
	-;
	-; ATOM-LABEL: test_lea_add_scale_offset_big:
	-; ATOM: # BB#0:
	-; ATOM-NEXT: leaq -1200(%rdi,%rsi,8), %rax
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: nop
	-; ATOM-NEXT: retq
	-;
	-; SLM-LABEL: test_lea_add_scale_offset_big:
	-; SLM: # BB#0:
	-; SLM-NEXT: leaq -1200(%rdi,%rsi,8), %rax # sched: [1:1.00]
	-; SLM-NEXT: retq # sched: [4:1.00]
	-;
	-; SANDY-LABEL: test_lea_add_scale_offset_big:
	-; SANDY: # BB#0:
	-; SANDY-NEXT: leaq (%rdi,%rsi,8), %rax # sched: [1:0.50]
	-; SANDY-NEXT: addq $-1200, %rax # imm = 0xFB50
	-; SANDY-NEXT: # sched: [1:0.33]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	-;
	-; HASWELL-LABEL: test_lea_add_scale_offset_big:
	-; HASWELL: # BB#0:
	-; HASWELL-NEXT: leaq (%rdi,%rsi,8), %rax # sched: [1:0.50]
	-; HASWELL-NEXT: addq $-1200, %rax # imm = 0xFB50
	-; HASWELL-NEXT: # sched: [1:0.25]
	-; HASWELL-NEXT: retq # sched: [1:1.00]
	-;
	-; BTVER2-LABEL: test_lea_add_scale_offset_big:
	-; BTVER2: # BB#0:
	-; BTVER2-NEXT: leaq -1200(%rdi,%rsi,8), %rax # sched: [1:0.50]
	-; BTVER2-NEXT: retq # sched: [4:1.00]
	-;
	-; ZNVER1-LABEL: test_lea_add_scale_offset_big:
	-; ZNVER1: # BB#0:
	-; ZNVER1-NEXT: leaq -1200(%rdi,%rsi,8), %rax # sched: [1:0.25]
	-; ZNVER1-NEXT: retq # sched: [5:0.50]
	- %3 = shl i64 %1, 3
	- %4 = add i64 %0, -1200
	- %5 = add i64 %4, %3
	- ret i64 %5
	-}
	diff --git a/test/CodeGen/X86/popcnt-schedule.ll b/test/CodeGen/X86/popcnt-schedule.ll
	deleted file mode 100644
	index c0d11280fc1d..000000000000
	--- a/test/CodeGen/X86/popcnt-schedule.ll
	+++ /dev/null
	@@ -1,167 +0,0 @@
	-; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	-; RUN: llc < %s -mtriple=x86_64-unknown-unknown -print-schedule -mattr=+popcnt \| FileCheck %s --check-prefix=CHECK --check-prefix=GENERIC
	-; RUN: llc < %s -mtriple=x86_64-unknown-unknown -print-schedule -mcpu=slm \| FileCheck %s --check-prefix=CHECK --check-prefix=SLM
	-; RUN: llc < %s -mtriple=x86_64-unknown-unknown -print-schedule -mcpu=goldmont \| FileCheck %s --check-prefix=CHECK --check-prefix=SLM
	-; RUN: llc < %s -mtriple=x86_64-unknown-unknown -print-schedule -mcpu=sandybridge \| FileCheck %s --check-prefix=CHECK --check-prefix=SANDY
	-; RUN: llc < %s -mtriple=x86_64-unknown-unknown -print-schedule -mcpu=ivybridge \| FileCheck %s --check-prefix=CHECK --check-prefix=SANDY
	-; RUN: llc < %s -mtriple=x86_64-unknown-unknown -print-schedule -mcpu=haswell \| FileCheck %s --check-prefix=CHECK --check-prefix=HASWELL
	-; RUN: llc < %s -mtriple=x86_64-unknown-unknown -print-schedule -mcpu=skylake \| FileCheck %s --check-prefix=CHECK --check-prefix=HASWELL
	-; RUN: llc < %s -mtriple=x86_64-unknown-unknown -print-schedule -mcpu=knl \| FileCheck %s --check-prefix=CHECK --check-prefix=HASWELL
	-; RUN: llc < %s -mtriple=x86_64-unknown-unknown -print-schedule -mcpu=btver2 \| FileCheck %s --check-prefix=CHECK --check-prefix=BTVER2
	-; RUN: llc < %s -mtriple=x86_64-unknown-unknown -print-schedule -mcpu=znver1 \| FileCheck %s --check-prefix=CHECK --check-prefix=ZNVER1
	-
	-define i16 @test_ctpop_i16(i16 zeroext %a0, i16 *%a1) {
	-; GENERIC-LABEL: test_ctpop_i16:
	-; GENERIC: # BB#0:
	-; GENERIC-NEXT: popcntw (%rsi), %cx
	-; GENERIC-NEXT: popcntw %di, %ax
	-; GENERIC-NEXT: orl %ecx, %eax
	-; GENERIC-NEXT: # kill: %AX<def> %AX<kill> %EAX<kill>
	-; GENERIC-NEXT: retq
	-;
	-; SLM-LABEL: test_ctpop_i16:
	-; SLM: # BB#0:
	-; SLM-NEXT: popcntw (%rsi), %cx # sched: [6:1.00]
	-; SLM-NEXT: popcntw %di, %ax # sched: [3:1.00]
	-; SLM-NEXT: orl %ecx, %eax # sched: [1:0.50]
	-; SLM-NEXT: # kill: %AX<def> %AX<kill> %EAX<kill>
	-; SLM-NEXT: retq # sched: [4:1.00]
	-;
	-; SANDY-LABEL: test_ctpop_i16:
	-; SANDY: # BB#0:
	-; SANDY-NEXT: popcntw (%rsi), %cx # sched: [7:1.00]
	-; SANDY-NEXT: popcntw %di, %ax # sched: [3:1.00]
	-; SANDY-NEXT: orl %ecx, %eax # sched: [1:0.33]
	-; SANDY-NEXT: # kill: %AX<def> %AX<kill> %EAX<kill>
	-; SANDY-NEXT: retq # sched: [1:1.00]
	-;
	-; HASWELL-LABEL: test_ctpop_i16:
	-; HASWELL: # BB#0:
	-; HASWELL-NEXT: popcntw (%rsi), %cx # sched: [7:1.00]
	-; HASWELL-NEXT: popcntw %di, %ax # sched: [3:1.00]
	-; HASWELL-NEXT: orl %ecx, %eax # sched: [1:0.25]
	-; HASWELL-NEXT: # kill: %AX<def> %AX<kill> %EAX<kill>
	-; HASWELL-NEXT: retq # sched: [1:1.00]
	-;
	-; BTVER2-LABEL: test_ctpop_i16:
	-; BTVER2: # BB#0:
	-; BTVER2-NEXT: popcntw (%rsi), %cx # sched: [8:1.00]
	-; BTVER2-NEXT: popcntw %di, %ax # sched: [3:1.00]
	-; BTVER2-NEXT: orl %ecx, %eax # sched: [1:0.50]
	-; BTVER2-NEXT: # kill: %AX<def> %AX<kill> %EAX<kill>
	-; BTVER2-NEXT: retq # sched: [4:1.00]
	-;
	-; ZNVER1-LABEL: test_ctpop_i16:
	-; ZNVER1: # BB#0:
	-; ZNVER1-NEXT: popcntw (%rsi), %cx # sched: [10:1.00]
	-; ZNVER1-NEXT: popcntw %di, %ax # sched: [3:1.00]
	-; ZNVER1-NEXT: orl %ecx, %eax # sched: [1:0.25]
	-; ZNVER1-NEXT: # kill: %AX<def> %AX<kill> %EAX<kill>
	-; ZNVER1-NEXT: retq # sched: [5:0.50]
	- %1 = load i16, i16 *%a1
	- %2 = tail call i16 @llvm.ctpop.i16( i16 %1 )
	- %3 = tail call i16 @llvm.ctpop.i16( i16 %a0 )
	- %4 = or i16 %2, %3
	- ret i16 %4
	-}
	-declare i16 @llvm.ctpop.i16(i16)
	-
	-define i32 @test_ctpop_i32(i32 %a0, i32 *%a1) {
	-; GENERIC-LABEL: test_ctpop_i32:
	-; GENERIC: # BB#0:
	-; GENERIC-NEXT: popcntl (%rsi), %ecx
	-; GENERIC-NEXT: popcntl %edi, %eax
	-; GENERIC-NEXT: orl %ecx, %eax
	-; GENERIC-NEXT: retq
	-;
	-; SLM-LABEL: test_ctpop_i32:
	-; SLM: # BB#0:
	-; SLM-NEXT: popcntl (%rsi), %ecx # sched: [6:1.00]
	-; SLM-NEXT: popcntl %edi, %eax # sched: [3:1.00]
	-; SLM-NEXT: orl %ecx, %eax # sched: [1:0.50]
	-; SLM-NEXT: retq # sched: [4:1.00]
	-;
	-; SANDY-LABEL: test_ctpop_i32:
	-; SANDY: # BB#0:
	-; SANDY-NEXT: popcntl (%rsi), %ecx # sched: [7:1.00]
	-; SANDY-NEXT: popcntl %edi, %eax # sched: [3:1.00]
	-; SANDY-NEXT: orl %ecx, %eax # sched: [1:0.33]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	-;
	-; HASWELL-LABEL: test_ctpop_i32:
	-; HASWELL: # BB#0:
	-; HASWELL-NEXT: popcntl (%rsi), %ecx # sched: [7:1.00]
	-; HASWELL-NEXT: popcntl %edi, %eax # sched: [3:1.00]
	-; HASWELL-NEXT: orl %ecx, %eax # sched: [1:0.25]
	-; HASWELL-NEXT: retq # sched: [1:1.00]
	-;
	-; BTVER2-LABEL: test_ctpop_i32:
	-; BTVER2: # BB#0:
	-; BTVER2-NEXT: popcntl (%rsi), %ecx # sched: [8:1.00]
	-; BTVER2-NEXT: popcntl %edi, %eax # sched: [3:1.00]
	-; BTVER2-NEXT: orl %ecx, %eax # sched: [1:0.50]
	-; BTVER2-NEXT: retq # sched: [4:1.00]
	-;
	-; ZNVER1-LABEL: test_ctpop_i32:
	-; ZNVER1: # BB#0:
	-; ZNVER1-NEXT: popcntl (%rsi), %ecx # sched: [10:1.00]
	-; ZNVER1-NEXT: popcntl %edi, %eax # sched: [3:1.00]
	-; ZNVER1-NEXT: orl %ecx, %eax # sched: [1:0.25]
	-; ZNVER1-NEXT: retq # sched: [5:0.50]
	- %1 = load i32, i32 *%a1
	- %2 = tail call i32 @llvm.ctpop.i32( i32 %1 )
	- %3 = tail call i32 @llvm.ctpop.i32( i32 %a0 )
	- %4 = or i32 %2, %3
	- ret i32 %4
	-}
	-declare i32 @llvm.ctpop.i32(i32)
	-
	-define i64 @test_ctpop_i64(i64 %a0, i64 *%a1) {
	-; GENERIC-LABEL: test_ctpop_i64:
	-; GENERIC: # BB#0:
	-; GENERIC-NEXT: popcntq (%rsi), %rcx
	-; GENERIC-NEXT: popcntq %rdi, %rax
	-; GENERIC-NEXT: orq %rcx, %rax
	-; GENERIC-NEXT: retq
	-;
	-; SLM-LABEL: test_ctpop_i64:
	-; SLM: # BB#0:
	-; SLM-NEXT: popcntq (%rsi), %rcx # sched: [6:1.00]
	-; SLM-NEXT: popcntq %rdi, %rax # sched: [3:1.00]
	-; SLM-NEXT: orq %rcx, %rax # sched: [1:0.50]
	-; SLM-NEXT: retq # sched: [4:1.00]
	-;
	-; SANDY-LABEL: test_ctpop_i64:
	-; SANDY: # BB#0:
	-; SANDY-NEXT: popcntq (%rsi), %rcx # sched: [9:1.00]
	-; SANDY-NEXT: popcntq %rdi, %rax # sched: [3:1.00]
	-; SANDY-NEXT: orq %rcx, %rax # sched: [1:0.33]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	-;
	-; HASWELL-LABEL: test_ctpop_i64:
	-; HASWELL: # BB#0:
	-; HASWELL-NEXT: popcntq (%rsi), %rcx # sched: [7:1.00]
	-; HASWELL-NEXT: popcntq %rdi, %rax # sched: [3:1.00]
	-; HASWELL-NEXT: orq %rcx, %rax # sched: [1:0.25]
	-; HASWELL-NEXT: retq # sched: [1:1.00]
	-;
	-; BTVER2-LABEL: test_ctpop_i64:
	-; BTVER2: # BB#0:
	-; BTVER2-NEXT: popcntq (%rsi), %rcx # sched: [8:1.00]
	-; BTVER2-NEXT: popcntq %rdi, %rax # sched: [3:1.00]
	-; BTVER2-NEXT: orq %rcx, %rax # sched: [1:0.50]
	-; BTVER2-NEXT: retq # sched: [4:1.00]
	-;
	-; ZNVER1-LABEL: test_ctpop_i64:
	-; ZNVER1: # BB#0:
	-; ZNVER1-NEXT: popcntq (%rsi), %rcx # sched: [10:1.00]
	-; ZNVER1-NEXT: popcntq %rdi, %rax # sched: [3:1.00]
	-; ZNVER1-NEXT: orq %rcx, %rax # sched: [1:0.25]
	-; ZNVER1-NEXT: retq # sched: [5:0.50]
	- %1 = load i64, i64 *%a1
	- %2 = tail call i64 @llvm.ctpop.i64( i64 %1 )
	- %3 = tail call i64 @llvm.ctpop.i64( i64 %a0 )
	- %4 = or i64 %2, %3
	- ret i64 %4
	-}
	-declare i64 @llvm.ctpop.i64(i64)
	diff --git a/test/CodeGen/X86/pr34139.ll b/test/CodeGen/X86/pr34139.ll
	new file mode 100644
	index 000000000000..c20c2cd510c7
	--- /dev/null
	+++ b/test/CodeGen/X86/pr34139.ll
	@@ -0,0 +1,24 @@
	+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	+; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu -mcpu=knl \| FileCheck %s
	+
	+define void @f_f(<16 x double>* %ptr) {
	+; CHECK-LABEL: f_f:
	+; CHECK: # BB#0:
	+; CHECK-NEXT: vpcmpeqd %xmm0, %xmm0, %xmm0
	+; CHECK-NEXT: vmovdqa %xmm0, (%rax)
	+; CHECK-NEXT: vpternlogd $255, %zmm0, %zmm0, %zmm0
	+; CHECK-NEXT: vmovapd (%rdi), %zmm1
	+; CHECK-NEXT: vmovapd 64(%rdi), %zmm2
	+; CHECK-NEXT: vptestmq %zmm0, %zmm0, %k1
	+; CHECK-NEXT: vmovapd %zmm0, %zmm1 {%k1}
	+; CHECK-NEXT: vmovapd %zmm0, %zmm2 {%k1}
	+; CHECK-NEXT: vmovapd %zmm2, 64(%rdi)
	+; CHECK-NEXT: vmovapd %zmm1, (%rdi)
	+ store <16 x i8> <i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1>, <16 x i8>* undef
	+ %load_mask8.i.i.i = load <16 x i8>, <16 x i8>* undef
	+ %v.i.i.i.i = load <16 x double>, <16 x double>* %ptr
	+ %mask_vec_i1.i.i.i51.i.i = icmp ne <16 x i8> %load_mask8.i.i.i, zeroinitializer
	+ %v1.i.i.i.i = select <16 x i1> %mask_vec_i1.i.i.i51.i.i, <16 x double> undef, <16 x double> %v.i.i.i.i
	+ store <16 x double> %v1.i.i.i.i, <16 x double>* %ptr
	+ unreachable
	+}
	diff --git a/test/CodeGen/X86/pr34177.ll b/test/CodeGen/X86/pr34177.ll
	new file mode 100644
	index 000000000000..7c210058ae6c
	--- /dev/null
	+++ b/test/CodeGen/X86/pr34177.ll
	@@ -0,0 +1,52 @@
	+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	+; RUN: llc < %s -mattr=+avx512f \| FileCheck %s
	+; RUN: llc < %s -mattr=+avx512f,+avx512vl,+avx512bw,+avx512dq \| FileCheck %s
	+
	+target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
	+target triple = "x86_64-unknown-linux-gnu"
	+
	+define void @test() local_unnamed_addr {
	+; CHECK-LABEL: test:
	+; CHECK: # BB#0:
	+; CHECK-NEXT: vmovdqa {{.*#+}} xmm0 = [2,3]
	+; CHECK-NEXT: vpextrq $1, %xmm0, %rax
	+; CHECK-NEXT: vmovq %xmm0, %rcx
	+; CHECK-NEXT: negq %rdx
	+; CHECK-NEXT: fld1
	+; CHECK-NEXT: fldz
	+; CHECK-NEXT: fld %st(0)
	+; CHECK-NEXT: fcmove %st(2), %st(0)
	+; CHECK-NEXT: cmpq %rax, %rcx
	+; CHECK-NEXT: fld %st(1)
	+; CHECK-NEXT: fcmove %st(3), %st(0)
	+; CHECK-NEXT: cmpq %rax, %rax
	+; CHECK-NEXT: fld %st(2)
	+; CHECK-NEXT: fcmove %st(4), %st(0)
	+; CHECK-NEXT: movl $1, %eax
	+; CHECK-NEXT: cmpq %rax, %rax
	+; CHECK-NEXT: fld %st(3)
	+; CHECK-NEXT: fcmove %st(5), %st(0)
	+; CHECK-NEXT: fstp %st(5)
	+; CHECK-NEXT: fxch %st(2)
	+; CHECK-NEXT: fadd %st(3)
	+; CHECK-NEXT: fxch %st(4)
	+; CHECK-NEXT: fadd %st(3)
	+; CHECK-NEXT: fxch %st(2)
	+; CHECK-NEXT: fadd %st(3)
	+; CHECK-NEXT: fxch %st(1)
	+; CHECK-NEXT: faddp %st(3)
	+; CHECK-NEXT: fxch %st(3)
	+; CHECK-NEXT: fstpt (%rax)
	+; CHECK-NEXT: fxch %st(1)
	+; CHECK-NEXT: fstpt (%rax)
	+; CHECK-NEXT: fxch %st(1)
	+; CHECK-NEXT: fstpt (%rax)
	+; CHECK-NEXT: fstpt (%rax)
	+ %1 = icmp eq <4 x i64> <i64 0, i64 1, i64 2, i64 3>, undef
	+ %2 = select <4 x i1> %1, <4 x x86_fp80> <x86_fp80 0xK3FFF8000000000000000, x86_fp80 0xK3FFF8000000000000000, x86_fp80 0xK3FFF8000000000000000, x86_fp80 0xK3FFF8000000000000000>, <4 x x86_fp80> zeroinitializer
	+ %3 = fadd <4 x x86_fp80> undef, %2
	+ %4 = shufflevector <4 x x86_fp80> %3, <4 x x86_fp80> undef, <8 x i32> <i32 0, i32 4, i32 1, i32 5, i32 2, i32 6, i32 3, i32 7>
	+ store <8 x x86_fp80> %4, <8 x x86_fp80>* undef, align 16
	+ unreachable
	+}
	+
	diff --git a/test/CodeGen/X86/pr34271-1.ll b/test/CodeGen/X86/pr34271-1.ll
	new file mode 100644
	index 000000000000..2e2f0fd0aa94
	--- /dev/null
	+++ b/test/CodeGen/X86/pr34271-1.ll
	@@ -0,0 +1,14 @@
	+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	+; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu -mattr=avx512vl,avx512bw \| FileCheck %s
	+
	+define <16 x i16> @foo(<16 x i32> %i) {
	+; CHECK-LABEL: foo:
	+; CHECK: # BB#0:
	+; CHECK-NEXT: vpminud {{.*}}(%rip){1to16}, %zmm0, %zmm0
	+; CHECK-NEXT: vpmovdw %zmm0, %ymm0
	+; CHECK-NEXT: retq
	+ %x3 = icmp ult <16 x i32> %i, <i32 16843009, i32 16843009, i32 16843009, i32 16843009, i32 16843009, i32 16843009, i32 16843009, i32 16843009, i32 16843009, i32 16843009, i32 16843009, i32 16843009, i32 16843009, i32 16843009, i32 16843009, i32 16843009>
	+ %x5 = select <16 x i1> %x3, <16 x i32> %i, <16 x i32> <i32 16843009, i32 16843009, i32 16843009, i32 16843009, i32 16843009, i32 16843009, i32 16843009, i32 16843009, i32 16843009, i32 16843009, i32 16843009, i32 16843009, i32 16843009, i32 16843009, i32 16843009, i32 16843009>
	+ %x6 = trunc <16 x i32> %x5 to <16 x i16>
	+ ret <16 x i16> %x6
	+}
	diff --git a/test/CodeGen/X86/pr34271.ll b/test/CodeGen/X86/pr34271.ll
	new file mode 100644
	index 000000000000..40d01617c30d
	--- /dev/null
	+++ b/test/CodeGen/X86/pr34271.ll
	@@ -0,0 +1,14 @@
	+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	+; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu \| FileCheck %s
	+
	+; CHECK: .LCPI0_0:
	+; CHECK-NEXT: .zero 16,1
	+
	+define <4 x i32> @f(<4 x i32> %a) {
	+; CHECK-LABEL: f:
	+; CHECK: # BB#0:
	+; CHECK-NEXT: paddd .LCPI0_0(%rip), %xmm0
	+; CHECK-NEXT: retq
	+ %v = add nuw nsw <4 x i32> %a, <i32 16843009, i32 16843009, i32 16843009, i32 16843009>
	+ ret <4 x i32> %v
	+}
	diff --git a/test/CodeGen/X86/recip-fastmath.ll b/test/CodeGen/X86/recip-fastmath.ll
	index 02a968c6f27d..9102e68f231b 100644
	--- a/test/CodeGen/X86/recip-fastmath.ll
	+++ b/test/CodeGen/X86/recip-fastmath.ll
	@@ -1,803 +1,803 @@
	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+sse2 \| FileCheck %s --check-prefix=CHECK --check-prefix=SSE --check-prefix=SSE-RECIP
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx \| FileCheck %s --check-prefix=CHECK --check-prefix=AVX --check-prefix=AVX-RECIP
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx,+fma \| FileCheck %s --check-prefix=CHECK --check-prefix=AVX --check-prefix=FMA-RECIP
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=btver2 -print-schedule \| FileCheck %s --check-prefix=CHECK --check-prefix=AVX --check-prefix=BTVER2
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=sandybridge -print-schedule \| FileCheck %s --check-prefix=CHECK --check-prefix=AVX --check-prefix=SANDY
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=haswell -print-schedule \| FileCheck %s --check-prefix=CHECK --check-prefix=AVX --check-prefix=HASWELL
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=haswell -mattr=-fma \| FileCheck %s --check-prefix=CHECK --check-prefix=AVX --check-prefix=HASWELL-NO-FMA
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=knl -print-schedule \| FileCheck %s --check-prefix=CHECK --check-prefix=AVX --check-prefix=AVX512 --check-prefix=KNL
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=skx -print-schedule \| FileCheck %s --check-prefix=CHECK --check-prefix=AVX --check-prefix=AVX512 --check-prefix=SKX

	; If the target's divss/divps instructions are substantially
	; slower than rcpss/rcpps with a Newton-Raphson refinement,
	; we should generate the estimate sequence.

	; See PR21385 ( http://llvm.org/bugs/show_bug.cgi?id=21385 )
	; for details about the accuracy, speed, and implementation
	; differences of x86 reciprocal estimates.

	define float @f32_no_estimate(float %x) #0 {
	; SSE-LABEL: f32_no_estimate:
	; SSE: # BB#0:
	; SSE-NEXT: movss {{.*#+}} xmm1 = mem[0],zero,zero,zero
	; SSE-NEXT: divss %xmm0, %xmm1
	; SSE-NEXT: movaps %xmm1, %xmm0
	; SSE-NEXT: retq
	;
	; AVX-RECIP-LABEL: f32_no_estimate:
	; AVX-RECIP: # BB#0:
	; AVX-RECIP-NEXT: vmovss {{.*#+}} xmm1 = mem[0],zero,zero,zero
	; AVX-RECIP-NEXT: vdivss %xmm0, %xmm1, %xmm0
	; AVX-RECIP-NEXT: retq
	;
	; FMA-RECIP-LABEL: f32_no_estimate:
	; FMA-RECIP: # BB#0:
	; FMA-RECIP-NEXT: vmovss {{.*#+}} xmm1 = mem[0],zero,zero,zero
	; FMA-RECIP-NEXT: vdivss %xmm0, %xmm1, %xmm0
	; FMA-RECIP-NEXT: retq
	;
	; BTVER2-LABEL: f32_no_estimate:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vmovss {{.*#+}} xmm1 = mem[0],zero,zero,zero sched: [5:1.00]
	; BTVER2-NEXT: vdivss %xmm0, %xmm1, %xmm0 # sched: [19:19.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: f32_no_estimate:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vmovss {{.*#+}} xmm1 = mem[0],zero,zero,zero sched: [6:0.50]
	-; SANDY-NEXT: vdivss %xmm0, %xmm1, %xmm0 # sched: [14:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vmovss {{.*#+}} xmm1 = mem[0],zero,zero,zero sched: [4:0.50]
	+; SANDY-NEXT: vdivss %xmm0, %xmm1, %xmm0 # sched: [12:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: f32_no_estimate:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vmovss {{.*#+}} xmm1 = mem[0],zero,zero,zero sched: [4:0.50]
	; HASWELL-NEXT: vdivss %xmm0, %xmm1, %xmm0 # sched: [12:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; HASWELL-NO-FMA-LABEL: f32_no_estimate:
	; HASWELL-NO-FMA: # BB#0:
	; HASWELL-NO-FMA-NEXT: vmovss {{.*#+}} xmm1 = mem[0],zero,zero,zero
	; HASWELL-NO-FMA-NEXT: vdivss %xmm0, %xmm1, %xmm0
	; HASWELL-NO-FMA-NEXT: retq
	;
	; AVX512-LABEL: f32_no_estimate:
	; AVX512: # BB#0:
	; AVX512-NEXT: vmovss {{.*#+}} xmm1 = mem[0],zero,zero,zero sched: [4:0.50]
	; AVX512-NEXT: vdivss %xmm0, %xmm1, %xmm0 # sched: [12:1.00]
	; AVX512-NEXT: retq # sched: [1:1.00]
	%div = fdiv fast float 1.0, %x
	ret float %div
	}

	define float @f32_one_step(float %x) #1 {
	; SSE-LABEL: f32_one_step:
	; SSE: # BB#0:
	; SSE-NEXT: rcpss %xmm0, %xmm2
	; SSE-NEXT: mulss %xmm2, %xmm0
	; SSE-NEXT: movss {{.*#+}} xmm1 = mem[0],zero,zero,zero
	; SSE-NEXT: subss %xmm0, %xmm1
	; SSE-NEXT: mulss %xmm2, %xmm1
	; SSE-NEXT: addss %xmm2, %xmm1
	; SSE-NEXT: movaps %xmm1, %xmm0
	; SSE-NEXT: retq
	;
	; AVX-RECIP-LABEL: f32_one_step:
	; AVX-RECIP: # BB#0:
	; AVX-RECIP-NEXT: vrcpss %xmm0, %xmm0, %xmm1
	; AVX-RECIP-NEXT: vmulss %xmm1, %xmm0, %xmm0
	; AVX-RECIP-NEXT: vmovss {{.*#+}} xmm2 = mem[0],zero,zero,zero
	; AVX-RECIP-NEXT: vsubss %xmm0, %xmm2, %xmm0
	; AVX-RECIP-NEXT: vmulss %xmm0, %xmm1, %xmm0
	; AVX-RECIP-NEXT: vaddss %xmm0, %xmm1, %xmm0
	; AVX-RECIP-NEXT: retq
	;
	; FMA-RECIP-LABEL: f32_one_step:
	; FMA-RECIP: # BB#0:
	; FMA-RECIP-NEXT: vrcpss %xmm0, %xmm0, %xmm1
	; FMA-RECIP-NEXT: vfnmadd213ss {{.*}}(%rip), %xmm1, %xmm0
	; FMA-RECIP-NEXT: vfmadd132ss %xmm1, %xmm1, %xmm0
	; FMA-RECIP-NEXT: retq
	;
	; BTVER2-LABEL: f32_one_step:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vmovss {{.*#+}} xmm2 = mem[0],zero,zero,zero sched: [5:1.00]
	; BTVER2-NEXT: vrcpss %xmm0, %xmm0, %xmm1 # sched: [2:1.00]
	; BTVER2-NEXT: vmulss %xmm1, %xmm0, %xmm0 # sched: [2:1.00]
	; BTVER2-NEXT: vsubss %xmm0, %xmm2, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: vmulss %xmm0, %xmm1, %xmm0 # sched: [2:1.00]
	; BTVER2-NEXT: vaddss %xmm0, %xmm1, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: f32_one_step:
	; SANDY: # BB#0:
	; SANDY-NEXT: vrcpss %xmm0, %xmm0, %xmm1 # sched: [5:1.00]
	; SANDY-NEXT: vmulss %xmm1, %xmm0, %xmm0 # sched: [5:1.00]
	-; SANDY-NEXT: vmovss {{.*#+}} xmm2 = mem[0],zero,zero,zero sched: [6:0.50]
	+; SANDY-NEXT: vmovss {{.*#+}} xmm2 = mem[0],zero,zero,zero sched: [4:0.50]
	; SANDY-NEXT: vsubss %xmm0, %xmm2, %xmm0 # sched: [3:1.00]
	; SANDY-NEXT: vmulss %xmm0, %xmm1, %xmm0 # sched: [5:1.00]
	; SANDY-NEXT: vaddss %xmm0, %xmm1, %xmm0 # sched: [3:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: f32_one_step:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vrcpss %xmm0, %xmm0, %xmm1 # sched: [5:1.00]
	; HASWELL-NEXT: vfnmadd213ss {{.*}}(%rip), %xmm1, %xmm0
	; HASWELL-NEXT: vfmadd132ss %xmm1, %xmm1, %xmm0
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; HASWELL-NO-FMA-LABEL: f32_one_step:
	; HASWELL-NO-FMA: # BB#0:
	; HASWELL-NO-FMA-NEXT: vrcpss %xmm0, %xmm0, %xmm1
	; HASWELL-NO-FMA-NEXT: vmulss %xmm1, %xmm0, %xmm0
	; HASWELL-NO-FMA-NEXT: vmovss {{.*#+}} xmm2 = mem[0],zero,zero,zero
	; HASWELL-NO-FMA-NEXT: vsubss %xmm0, %xmm2, %xmm0
	; HASWELL-NO-FMA-NEXT: vmulss %xmm0, %xmm1, %xmm0
	; HASWELL-NO-FMA-NEXT: vaddss %xmm0, %xmm1, %xmm0
	; HASWELL-NO-FMA-NEXT: retq
	;
	; AVX512-LABEL: f32_one_step:
	; AVX512: # BB#0:
	; AVX512-NEXT: vrcp14ss %xmm0, %xmm0, %xmm1
	; AVX512-NEXT: vfnmadd213ss {{.*}}(%rip), %xmm1, %xmm0
	; AVX512-NEXT: vfmadd132ss %xmm1, %xmm1, %xmm0
	; AVX512-NEXT: retq # sched: [1:1.00]
	%div = fdiv fast float 1.0, %x
	ret float %div
	}

	define float @f32_two_step(float %x) #2 {
	; SSE-LABEL: f32_two_step:
	; SSE: # BB#0:
	; SSE-NEXT: rcpss %xmm0, %xmm2
	; SSE-NEXT: movaps %xmm0, %xmm3
	; SSE-NEXT: mulss %xmm2, %xmm3
	; SSE-NEXT: movss {{.*#+}} xmm1 = mem[0],zero,zero,zero
	; SSE-NEXT: movaps %xmm1, %xmm4
	; SSE-NEXT: subss %xmm3, %xmm4
	; SSE-NEXT: mulss %xmm2, %xmm4
	; SSE-NEXT: addss %xmm2, %xmm4
	; SSE-NEXT: mulss %xmm4, %xmm0
	; SSE-NEXT: subss %xmm0, %xmm1
	; SSE-NEXT: mulss %xmm4, %xmm1
	; SSE-NEXT: addss %xmm4, %xmm1
	; SSE-NEXT: movaps %xmm1, %xmm0
	; SSE-NEXT: retq
	;
	; AVX-RECIP-LABEL: f32_two_step:
	; AVX-RECIP: # BB#0:
	; AVX-RECIP-NEXT: vrcpss %xmm0, %xmm0, %xmm1
	; AVX-RECIP-NEXT: vmulss %xmm1, %xmm0, %xmm2
	; AVX-RECIP-NEXT: vmovss {{.*#+}} xmm3 = mem[0],zero,zero,zero
	; AVX-RECIP-NEXT: vsubss %xmm2, %xmm3, %xmm2
	; AVX-RECIP-NEXT: vmulss %xmm2, %xmm1, %xmm2
	; AVX-RECIP-NEXT: vaddss %xmm2, %xmm1, %xmm1
	; AVX-RECIP-NEXT: vmulss %xmm1, %xmm0, %xmm0
	; AVX-RECIP-NEXT: vsubss %xmm0, %xmm3, %xmm0
	; AVX-RECIP-NEXT: vmulss %xmm0, %xmm1, %xmm0
	; AVX-RECIP-NEXT: vaddss %xmm0, %xmm1, %xmm0
	; AVX-RECIP-NEXT: retq
	;
	; FMA-RECIP-LABEL: f32_two_step:
	; FMA-RECIP: # BB#0:
	; FMA-RECIP-NEXT: vrcpss %xmm0, %xmm0, %xmm1
	; FMA-RECIP-NEXT: vmovss {{.*#+}} xmm2 = mem[0],zero,zero,zero
	; FMA-RECIP-NEXT: vmovaps %xmm1, %xmm3
	; FMA-RECIP-NEXT: vfnmadd213ss %xmm2, %xmm0, %xmm3
	; FMA-RECIP-NEXT: vfmadd132ss %xmm1, %xmm1, %xmm3
	; FMA-RECIP-NEXT: vfnmadd213ss %xmm2, %xmm3, %xmm0
	; FMA-RECIP-NEXT: vfmadd132ss %xmm3, %xmm3, %xmm0
	; FMA-RECIP-NEXT: retq
	;
	; BTVER2-LABEL: f32_two_step:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vmovss {{.*#+}} xmm3 = mem[0],zero,zero,zero sched: [5:1.00]
	; BTVER2-NEXT: vrcpss %xmm0, %xmm0, %xmm1 # sched: [2:1.00]
	; BTVER2-NEXT: vmulss %xmm1, %xmm0, %xmm2 # sched: [2:1.00]
	; BTVER2-NEXT: vsubss %xmm2, %xmm3, %xmm2 # sched: [3:1.00]
	; BTVER2-NEXT: vmulss %xmm2, %xmm1, %xmm2 # sched: [2:1.00]
	; BTVER2-NEXT: vaddss %xmm2, %xmm1, %xmm1 # sched: [3:1.00]
	; BTVER2-NEXT: vmulss %xmm1, %xmm0, %xmm0 # sched: [2:1.00]
	; BTVER2-NEXT: vsubss %xmm0, %xmm3, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: vmulss %xmm0, %xmm1, %xmm0 # sched: [2:1.00]
	; BTVER2-NEXT: vaddss %xmm0, %xmm1, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: f32_two_step:
	; SANDY: # BB#0:
	; SANDY-NEXT: vrcpss %xmm0, %xmm0, %xmm1 # sched: [5:1.00]
	; SANDY-NEXT: vmulss %xmm1, %xmm0, %xmm2 # sched: [5:1.00]
	-; SANDY-NEXT: vmovss {{.*#+}} xmm3 = mem[0],zero,zero,zero sched: [6:0.50]
	+; SANDY-NEXT: vmovss {{.*#+}} xmm3 = mem[0],zero,zero,zero sched: [4:0.50]
	; SANDY-NEXT: vsubss %xmm2, %xmm3, %xmm2 # sched: [3:1.00]
	; SANDY-NEXT: vmulss %xmm2, %xmm1, %xmm2 # sched: [5:1.00]
	; SANDY-NEXT: vaddss %xmm2, %xmm1, %xmm1 # sched: [3:1.00]
	; SANDY-NEXT: vmulss %xmm1, %xmm0, %xmm0 # sched: [5:1.00]
	; SANDY-NEXT: vsubss %xmm0, %xmm3, %xmm0 # sched: [3:1.00]
	; SANDY-NEXT: vmulss %xmm0, %xmm1, %xmm0 # sched: [5:1.00]
	; SANDY-NEXT: vaddss %xmm0, %xmm1, %xmm0 # sched: [3:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: f32_two_step:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vrcpss %xmm0, %xmm0, %xmm1 # sched: [5:1.00]
	; HASWELL-NEXT: vmovss {{.*#+}} xmm2 = mem[0],zero,zero,zero sched: [4:0.50]
	; HASWELL-NEXT: vmovaps %xmm1, %xmm3 # sched: [1:1.00]
	; HASWELL-NEXT: vfnmadd213ss %xmm2, %xmm0, %xmm3
	; HASWELL-NEXT: vfmadd132ss %xmm1, %xmm1, %xmm3
	; HASWELL-NEXT: vfnmadd213ss %xmm2, %xmm3, %xmm0
	; HASWELL-NEXT: vfmadd132ss %xmm3, %xmm3, %xmm0
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; HASWELL-NO-FMA-LABEL: f32_two_step:
	; HASWELL-NO-FMA: # BB#0:
	; HASWELL-NO-FMA-NEXT: vrcpss %xmm0, %xmm0, %xmm1
	; HASWELL-NO-FMA-NEXT: vmulss %xmm1, %xmm0, %xmm2
	; HASWELL-NO-FMA-NEXT: vmovss {{.*#+}} xmm3 = mem[0],zero,zero,zero
	; HASWELL-NO-FMA-NEXT: vsubss %xmm2, %xmm3, %xmm2
	; HASWELL-NO-FMA-NEXT: vmulss %xmm2, %xmm1, %xmm2
	; HASWELL-NO-FMA-NEXT: vaddss %xmm2, %xmm1, %xmm1
	; HASWELL-NO-FMA-NEXT: vmulss %xmm1, %xmm0, %xmm0
	; HASWELL-NO-FMA-NEXT: vsubss %xmm0, %xmm3, %xmm0
	; HASWELL-NO-FMA-NEXT: vmulss %xmm0, %xmm1, %xmm0
	; HASWELL-NO-FMA-NEXT: vaddss %xmm0, %xmm1, %xmm0
	; HASWELL-NO-FMA-NEXT: retq
	;
	; AVX512-LABEL: f32_two_step:
	; AVX512: # BB#0:
	; AVX512-NEXT: vrcp14ss %xmm0, %xmm0, %xmm1
	; AVX512-NEXT: vmovss {{.*#+}} xmm2 = mem[0],zero,zero,zero sched: [4:0.50]
	; AVX512-NEXT: vmovaps %xmm1, %xmm3 # sched: [1:1.00]
	; AVX512-NEXT: vfnmadd213ss %xmm2, %xmm0, %xmm3
	; AVX512-NEXT: vfmadd132ss %xmm1, %xmm1, %xmm3
	; AVX512-NEXT: vfnmadd213ss %xmm2, %xmm3, %xmm0
	; AVX512-NEXT: vfmadd132ss %xmm3, %xmm3, %xmm0
	; AVX512-NEXT: retq # sched: [1:1.00]
	%div = fdiv fast float 1.0, %x
	ret float %div
	}

	define <4 x float> @v4f32_no_estimate(<4 x float> %x) #0 {
	; SSE-LABEL: v4f32_no_estimate:
	; SSE: # BB#0:
	; SSE-NEXT: movaps {{.*#+}} xmm1 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]
	; SSE-NEXT: divps %xmm0, %xmm1
	; SSE-NEXT: movaps %xmm1, %xmm0
	; SSE-NEXT: retq
	;
	; AVX-RECIP-LABEL: v4f32_no_estimate:
	; AVX-RECIP: # BB#0:
	; AVX-RECIP-NEXT: vmovaps {{.*#+}} xmm1 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]
	; AVX-RECIP-NEXT: vdivps %xmm0, %xmm1, %xmm0
	; AVX-RECIP-NEXT: retq
	;
	; FMA-RECIP-LABEL: v4f32_no_estimate:
	; FMA-RECIP: # BB#0:
	; FMA-RECIP-NEXT: vmovaps {{.*#+}} xmm1 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]
	; FMA-RECIP-NEXT: vdivps %xmm0, %xmm1, %xmm0
	; FMA-RECIP-NEXT: retq
	;
	; BTVER2-LABEL: v4f32_no_estimate:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vmovaps {{.*#+}} xmm1 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00] sched: [5:1.00]
	; BTVER2-NEXT: vdivps %xmm0, %xmm1, %xmm0 # sched: [19:19.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: v4f32_no_estimate:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vmovaps {{.*#+}} xmm1 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00] sched: [6:0.50]
	-; SANDY-NEXT: vdivps %xmm0, %xmm1, %xmm0 # sched: [14:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vmovaps {{.*#+}} xmm1 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00] sched: [4:0.50]
	+; SANDY-NEXT: vdivps %xmm0, %xmm1, %xmm0 # sched: [12:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: v4f32_no_estimate:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vbroadcastss {{.*#+}} xmm1 = [1,1,1,1] sched: [4:0.50]
	; HASWELL-NEXT: vdivps %xmm0, %xmm1, %xmm0 # sched: [12:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; HASWELL-NO-FMA-LABEL: v4f32_no_estimate:
	; HASWELL-NO-FMA: # BB#0:
	; HASWELL-NO-FMA-NEXT: vbroadcastss {{.*#+}} xmm1 = [1,1,1,1]
	; HASWELL-NO-FMA-NEXT: vdivps %xmm0, %xmm1, %xmm0
	; HASWELL-NO-FMA-NEXT: retq
	;
	; AVX512-LABEL: v4f32_no_estimate:
	; AVX512: # BB#0:
	; AVX512-NEXT: vbroadcastss {{.*#+}} xmm1 = [1,1,1,1] sched: [4:0.50]
	; AVX512-NEXT: vdivps %xmm0, %xmm1, %xmm0 # sched: [12:1.00]
	; AVX512-NEXT: retq # sched: [1:1.00]
	%div = fdiv fast <4 x float> <float 1.0, float 1.0, float 1.0, float 1.0>, %x
	ret <4 x float> %div
	}

	define <4 x float> @v4f32_one_step(<4 x float> %x) #1 {
	; SSE-LABEL: v4f32_one_step:
	; SSE: # BB#0:
	; SSE-NEXT: rcpps %xmm0, %xmm2
	; SSE-NEXT: mulps %xmm2, %xmm0
	; SSE-NEXT: movaps {{.*#+}} xmm1 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]
	; SSE-NEXT: subps %xmm0, %xmm1
	; SSE-NEXT: mulps %xmm2, %xmm1
	; SSE-NEXT: addps %xmm2, %xmm1
	; SSE-NEXT: movaps %xmm1, %xmm0
	; SSE-NEXT: retq
	;
	; AVX-RECIP-LABEL: v4f32_one_step:
	; AVX-RECIP: # BB#0:
	; AVX-RECIP-NEXT: vrcpps %xmm0, %xmm1
	; AVX-RECIP-NEXT: vmulps %xmm1, %xmm0, %xmm0
	; AVX-RECIP-NEXT: vmovaps {{.*#+}} xmm2 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]
	; AVX-RECIP-NEXT: vsubps %xmm0, %xmm2, %xmm0
	; AVX-RECIP-NEXT: vmulps %xmm0, %xmm1, %xmm0
	; AVX-RECIP-NEXT: vaddps %xmm0, %xmm1, %xmm0
	; AVX-RECIP-NEXT: retq
	;
	; FMA-RECIP-LABEL: v4f32_one_step:
	; FMA-RECIP: # BB#0:
	; FMA-RECIP-NEXT: vrcpps %xmm0, %xmm1
	; FMA-RECIP-NEXT: vfnmadd213ps {{.*}}(%rip), %xmm1, %xmm0
	; FMA-RECIP-NEXT: vfmadd132ps %xmm1, %xmm1, %xmm0
	; FMA-RECIP-NEXT: retq
	;
	; BTVER2-LABEL: v4f32_one_step:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vmovaps {{.*#+}} xmm2 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00] sched: [5:1.00]
	; BTVER2-NEXT: vrcpps %xmm0, %xmm1 # sched: [2:1.00]
	; BTVER2-NEXT: vmulps %xmm1, %xmm0, %xmm0 # sched: [2:1.00]
	; BTVER2-NEXT: vsubps %xmm0, %xmm2, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: vmulps %xmm0, %xmm1, %xmm0 # sched: [2:1.00]
	; BTVER2-NEXT: vaddps %xmm0, %xmm1, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: v4f32_one_step:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vrcpps %xmm0, %xmm1 # sched: [7:3.00]
	+; SANDY-NEXT: vrcpps %xmm0, %xmm1 # sched: [5:1.00]
	; SANDY-NEXT: vmulps %xmm1, %xmm0, %xmm0 # sched: [5:1.00]
	-; SANDY-NEXT: vmovaps {{.*#+}} xmm2 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00] sched: [6:0.50]
	+; SANDY-NEXT: vmovaps {{.*#+}} xmm2 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00] sched: [4:0.50]
	; SANDY-NEXT: vsubps %xmm0, %xmm2, %xmm0 # sched: [3:1.00]
	; SANDY-NEXT: vmulps %xmm0, %xmm1, %xmm0 # sched: [5:1.00]
	; SANDY-NEXT: vaddps %xmm0, %xmm1, %xmm0 # sched: [3:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: v4f32_one_step:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vrcpps %xmm0, %xmm1 # sched: [5:1.00]
	; HASWELL-NEXT: vbroadcastss {{.*#+}} xmm2 = [1,1,1,1] sched: [4:0.50]
	; HASWELL-NEXT: vfnmadd213ps %xmm2, %xmm1, %xmm0
	; HASWELL-NEXT: vfmadd132ps %xmm1, %xmm1, %xmm0
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; HASWELL-NO-FMA-LABEL: v4f32_one_step:
	; HASWELL-NO-FMA: # BB#0:
	; HASWELL-NO-FMA-NEXT: vrcpps %xmm0, %xmm1
	; HASWELL-NO-FMA-NEXT: vmulps %xmm1, %xmm0, %xmm0
	; HASWELL-NO-FMA-NEXT: vbroadcastss {{.*#+}} xmm2 = [1,1,1,1]
	; HASWELL-NO-FMA-NEXT: vsubps %xmm0, %xmm2, %xmm0
	; HASWELL-NO-FMA-NEXT: vmulps %xmm0, %xmm1, %xmm0
	; HASWELL-NO-FMA-NEXT: vaddps %xmm0, %xmm1, %xmm0
	; HASWELL-NO-FMA-NEXT: retq
	;
	; KNL-LABEL: v4f32_one_step:
	; KNL: # BB#0:
	; KNL-NEXT: vrcpps %xmm0, %xmm1 # sched: [5:1.00]
	; KNL-NEXT: vbroadcastss {{.*#+}} xmm2 = [1,1,1,1] sched: [4:0.50]
	; KNL-NEXT: vfnmadd213ps %xmm2, %xmm1, %xmm0
	; KNL-NEXT: vfmadd132ps %xmm1, %xmm1, %xmm0
	; KNL-NEXT: retq # sched: [1:1.00]
	;
	; SKX-LABEL: v4f32_one_step:
	; SKX: # BB#0:
	; SKX-NEXT: vrcp14ps %xmm0, %xmm1
	; SKX-NEXT: vfnmadd213ps {{.*}}(%rip){1to4}, %xmm1, %xmm0
	; SKX-NEXT: vfmadd132ps %xmm1, %xmm1, %xmm0
	; SKX-NEXT: retq # sched: [1:1.00]
	%div = fdiv fast <4 x float> <float 1.0, float 1.0, float 1.0, float 1.0>, %x
	ret <4 x float> %div
	}

	define <4 x float> @v4f32_two_step(<4 x float> %x) #2 {
	; SSE-LABEL: v4f32_two_step:
	; SSE: # BB#0:
	; SSE-NEXT: rcpps %xmm0, %xmm2
	; SSE-NEXT: movaps %xmm0, %xmm3
	; SSE-NEXT: mulps %xmm2, %xmm3
	; SSE-NEXT: movaps {{.*#+}} xmm1 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]
	; SSE-NEXT: movaps %xmm1, %xmm4
	; SSE-NEXT: subps %xmm3, %xmm4
	; SSE-NEXT: mulps %xmm2, %xmm4
	; SSE-NEXT: addps %xmm2, %xmm4
	; SSE-NEXT: mulps %xmm4, %xmm0
	; SSE-NEXT: subps %xmm0, %xmm1
	; SSE-NEXT: mulps %xmm4, %xmm1
	; SSE-NEXT: addps %xmm4, %xmm1
	; SSE-NEXT: movaps %xmm1, %xmm0
	; SSE-NEXT: retq
	;
	; AVX-RECIP-LABEL: v4f32_two_step:
	; AVX-RECIP: # BB#0:
	; AVX-RECIP-NEXT: vrcpps %xmm0, %xmm1
	; AVX-RECIP-NEXT: vmulps %xmm1, %xmm0, %xmm2
	; AVX-RECIP-NEXT: vmovaps {{.*#+}} xmm3 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]
	; AVX-RECIP-NEXT: vsubps %xmm2, %xmm3, %xmm2
	; AVX-RECIP-NEXT: vmulps %xmm2, %xmm1, %xmm2
	; AVX-RECIP-NEXT: vaddps %xmm2, %xmm1, %xmm1
	; AVX-RECIP-NEXT: vmulps %xmm1, %xmm0, %xmm0
	; AVX-RECIP-NEXT: vsubps %xmm0, %xmm3, %xmm0
	; AVX-RECIP-NEXT: vmulps %xmm0, %xmm1, %xmm0
	; AVX-RECIP-NEXT: vaddps %xmm0, %xmm1, %xmm0
	; AVX-RECIP-NEXT: retq
	;
	; FMA-RECIP-LABEL: v4f32_two_step:
	; FMA-RECIP: # BB#0:
	; FMA-RECIP-NEXT: vrcpps %xmm0, %xmm1
	; FMA-RECIP-NEXT: vmovaps {{.*#+}} xmm2 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]
	; FMA-RECIP-NEXT: vmovaps %xmm1, %xmm3
	; FMA-RECIP-NEXT: vfnmadd213ps %xmm2, %xmm0, %xmm3
	; FMA-RECIP-NEXT: vfmadd132ps %xmm1, %xmm1, %xmm3
	; FMA-RECIP-NEXT: vfnmadd213ps %xmm2, %xmm3, %xmm0
	; FMA-RECIP-NEXT: vfmadd132ps %xmm3, %xmm3, %xmm0
	; FMA-RECIP-NEXT: retq
	;
	; BTVER2-LABEL: v4f32_two_step:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vmovaps {{.*#+}} xmm3 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00] sched: [5:1.00]
	; BTVER2-NEXT: vrcpps %xmm0, %xmm1 # sched: [2:1.00]
	; BTVER2-NEXT: vmulps %xmm1, %xmm0, %xmm2 # sched: [2:1.00]
	; BTVER2-NEXT: vsubps %xmm2, %xmm3, %xmm2 # sched: [3:1.00]
	; BTVER2-NEXT: vmulps %xmm2, %xmm1, %xmm2 # sched: [2:1.00]
	; BTVER2-NEXT: vaddps %xmm2, %xmm1, %xmm1 # sched: [3:1.00]
	; BTVER2-NEXT: vmulps %xmm1, %xmm0, %xmm0 # sched: [2:1.00]
	; BTVER2-NEXT: vsubps %xmm0, %xmm3, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: vmulps %xmm0, %xmm1, %xmm0 # sched: [2:1.00]
	; BTVER2-NEXT: vaddps %xmm0, %xmm1, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: v4f32_two_step:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vrcpps %xmm0, %xmm1 # sched: [7:3.00]
	+; SANDY-NEXT: vrcpps %xmm0, %xmm1 # sched: [5:1.00]
	; SANDY-NEXT: vmulps %xmm1, %xmm0, %xmm2 # sched: [5:1.00]
	-; SANDY-NEXT: vmovaps {{.*#+}} xmm3 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00] sched: [6:0.50]
	+; SANDY-NEXT: vmovaps {{.*#+}} xmm3 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00] sched: [4:0.50]
	; SANDY-NEXT: vsubps %xmm2, %xmm3, %xmm2 # sched: [3:1.00]
	; SANDY-NEXT: vmulps %xmm2, %xmm1, %xmm2 # sched: [5:1.00]
	; SANDY-NEXT: vaddps %xmm2, %xmm1, %xmm1 # sched: [3:1.00]
	; SANDY-NEXT: vmulps %xmm1, %xmm0, %xmm0 # sched: [5:1.00]
	; SANDY-NEXT: vsubps %xmm0, %xmm3, %xmm0 # sched: [3:1.00]
	; SANDY-NEXT: vmulps %xmm0, %xmm1, %xmm0 # sched: [5:1.00]
	; SANDY-NEXT: vaddps %xmm0, %xmm1, %xmm0 # sched: [3:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: v4f32_two_step:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vrcpps %xmm0, %xmm1 # sched: [5:1.00]
	; HASWELL-NEXT: vbroadcastss {{.*#+}} xmm2 = [1,1,1,1] sched: [4:0.50]
	; HASWELL-NEXT: vmovaps %xmm1, %xmm3 # sched: [1:1.00]
	; HASWELL-NEXT: vfnmadd213ps %xmm2, %xmm0, %xmm3
	; HASWELL-NEXT: vfmadd132ps %xmm1, %xmm1, %xmm3
	; HASWELL-NEXT: vfnmadd213ps %xmm2, %xmm3, %xmm0
	; HASWELL-NEXT: vfmadd132ps %xmm3, %xmm3, %xmm0
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; HASWELL-NO-FMA-LABEL: v4f32_two_step:
	; HASWELL-NO-FMA: # BB#0:
	; HASWELL-NO-FMA-NEXT: vrcpps %xmm0, %xmm1
	; HASWELL-NO-FMA-NEXT: vmulps %xmm1, %xmm0, %xmm2
	; HASWELL-NO-FMA-NEXT: vbroadcastss {{.*#+}} xmm3 = [1,1,1,1]
	; HASWELL-NO-FMA-NEXT: vsubps %xmm2, %xmm3, %xmm2
	; HASWELL-NO-FMA-NEXT: vmulps %xmm2, %xmm1, %xmm2
	; HASWELL-NO-FMA-NEXT: vaddps %xmm2, %xmm1, %xmm1
	; HASWELL-NO-FMA-NEXT: vmulps %xmm1, %xmm0, %xmm0
	; HASWELL-NO-FMA-NEXT: vsubps %xmm0, %xmm3, %xmm0
	; HASWELL-NO-FMA-NEXT: vmulps %xmm0, %xmm1, %xmm0
	; HASWELL-NO-FMA-NEXT: vaddps %xmm0, %xmm1, %xmm0
	; HASWELL-NO-FMA-NEXT: retq
	;
	; KNL-LABEL: v4f32_two_step:
	; KNL: # BB#0:
	; KNL-NEXT: vrcpps %xmm0, %xmm1 # sched: [5:1.00]
	; KNL-NEXT: vbroadcastss {{.*#+}} xmm2 = [1,1,1,1] sched: [4:0.50]
	; KNL-NEXT: vmovaps %xmm1, %xmm3 # sched: [1:1.00]
	; KNL-NEXT: vfnmadd213ps %xmm2, %xmm0, %xmm3
	; KNL-NEXT: vfmadd132ps %xmm1, %xmm1, %xmm3
	; KNL-NEXT: vfnmadd213ps %xmm2, %xmm3, %xmm0
	; KNL-NEXT: vfmadd132ps %xmm3, %xmm3, %xmm0
	; KNL-NEXT: retq # sched: [1:1.00]
	;
	; SKX-LABEL: v4f32_two_step:
	; SKX: # BB#0:
	; SKX-NEXT: vrcp14ps %xmm0, %xmm1
	; SKX-NEXT: vbroadcastss {{.*#+}} xmm2 = [1,1,1,1] sched: [4:0.50]
	; SKX-NEXT: vmovaps %xmm1, %xmm3 # sched: [1:1.00]
	; SKX-NEXT: vfnmadd213ps %xmm2, %xmm0, %xmm3
	; SKX-NEXT: vfmadd132ps %xmm1, %xmm1, %xmm3
	; SKX-NEXT: vfnmadd213ps %xmm2, %xmm3, %xmm0
	; SKX-NEXT: vfmadd132ps %xmm3, %xmm3, %xmm0
	; SKX-NEXT: retq # sched: [1:1.00]
	%div = fdiv fast <4 x float> <float 1.0, float 1.0, float 1.0, float 1.0>, %x
	ret <4 x float> %div
	}

	define <8 x float> @v8f32_no_estimate(<8 x float> %x) #0 {
	; SSE-LABEL: v8f32_no_estimate:
	; SSE: # BB#0:
	; SSE-NEXT: movaps {{.*#+}} xmm2 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]
	; SSE-NEXT: movaps %xmm2, %xmm3
	; SSE-NEXT: divps %xmm0, %xmm3
	; SSE-NEXT: divps %xmm1, %xmm2
	; SSE-NEXT: movaps %xmm3, %xmm0
	; SSE-NEXT: movaps %xmm2, %xmm1
	; SSE-NEXT: retq
	;
	; AVX-RECIP-LABEL: v8f32_no_estimate:
	; AVX-RECIP: # BB#0:
	; AVX-RECIP-NEXT: vmovaps {{.*#+}} ymm1 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]
	; AVX-RECIP-NEXT: vdivps %ymm0, %ymm1, %ymm0
	; AVX-RECIP-NEXT: retq
	;
	; FMA-RECIP-LABEL: v8f32_no_estimate:
	; FMA-RECIP: # BB#0:
	; FMA-RECIP-NEXT: vmovaps {{.*#+}} ymm1 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]
	; FMA-RECIP-NEXT: vdivps %ymm0, %ymm1, %ymm0
	; FMA-RECIP-NEXT: retq
	;
	; BTVER2-LABEL: v8f32_no_estimate:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vmovaps {{.*#+}} ymm1 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00] sched: [5:1.00]
	; BTVER2-NEXT: vdivps %ymm0, %ymm1, %ymm0 # sched: [38:38.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: v8f32_no_estimate:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vmovaps {{.*#+}} ymm1 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00] sched: [7:0.50]
	-; SANDY-NEXT: vdivps %ymm0, %ymm1, %ymm0 # sched: [29:3.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vmovaps {{.*#+}} ymm1 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00] sched: [4:0.50]
	+; SANDY-NEXT: vdivps %ymm0, %ymm1, %ymm0 # sched: [12:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: v8f32_no_estimate:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vbroadcastss {{.*#+}} ymm1 = [1,1,1,1,1,1,1,1] sched: [5:1.00]
	; HASWELL-NEXT: vdivps %ymm0, %ymm1, %ymm0 # sched: [19:2.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; HASWELL-NO-FMA-LABEL: v8f32_no_estimate:
	; HASWELL-NO-FMA: # BB#0:
	; HASWELL-NO-FMA-NEXT: vbroadcastss {{.*#+}} ymm1 = [1,1,1,1,1,1,1,1]
	; HASWELL-NO-FMA-NEXT: vdivps %ymm0, %ymm1, %ymm0
	; HASWELL-NO-FMA-NEXT: retq
	;
	; AVX512-LABEL: v8f32_no_estimate:
	; AVX512: # BB#0:
	; AVX512-NEXT: vbroadcastss {{.*#+}} ymm1 = [1,1,1,1,1,1,1,1] sched: [5:1.00]
	; AVX512-NEXT: vdivps %ymm0, %ymm1, %ymm0 # sched: [19:2.00]
	; AVX512-NEXT: retq # sched: [1:1.00]
	%div = fdiv fast <8 x float> <float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0>, %x
	ret <8 x float> %div
	}

	define <8 x float> @v8f32_one_step(<8 x float> %x) #1 {
	; SSE-LABEL: v8f32_one_step:
	; SSE: # BB#0:
	; SSE-NEXT: rcpps %xmm0, %xmm4
	; SSE-NEXT: mulps %xmm4, %xmm0
	; SSE-NEXT: movaps {{.*#+}} xmm2 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]
	; SSE-NEXT: movaps %xmm2, %xmm3
	; SSE-NEXT: subps %xmm0, %xmm3
	; SSE-NEXT: mulps %xmm4, %xmm3
	; SSE-NEXT: addps %xmm4, %xmm3
	; SSE-NEXT: rcpps %xmm1, %xmm0
	; SSE-NEXT: mulps %xmm0, %xmm1
	; SSE-NEXT: subps %xmm1, %xmm2
	; SSE-NEXT: mulps %xmm0, %xmm2
	; SSE-NEXT: addps %xmm0, %xmm2
	; SSE-NEXT: movaps %xmm3, %xmm0
	; SSE-NEXT: movaps %xmm2, %xmm1
	; SSE-NEXT: retq
	;
	; AVX-RECIP-LABEL: v8f32_one_step:
	; AVX-RECIP: # BB#0:
	; AVX-RECIP-NEXT: vrcpps %ymm0, %ymm1
	; AVX-RECIP-NEXT: vmulps %ymm1, %ymm0, %ymm0
	; AVX-RECIP-NEXT: vmovaps {{.*#+}} ymm2 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]
	; AVX-RECIP-NEXT: vsubps %ymm0, %ymm2, %ymm0
	; AVX-RECIP-NEXT: vmulps %ymm0, %ymm1, %ymm0
	; AVX-RECIP-NEXT: vaddps %ymm0, %ymm1, %ymm0
	; AVX-RECIP-NEXT: retq
	;
	; FMA-RECIP-LABEL: v8f32_one_step:
	; FMA-RECIP: # BB#0:
	; FMA-RECIP-NEXT: vrcpps %ymm0, %ymm1
	; FMA-RECIP-NEXT: vfnmadd213ps {{.*}}(%rip), %ymm1, %ymm0
	; FMA-RECIP-NEXT: vfmadd132ps %ymm1, %ymm1, %ymm0
	; FMA-RECIP-NEXT: retq
	;
	; BTVER2-LABEL: v8f32_one_step:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vmovaps {{.*#+}} ymm2 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00] sched: [5:1.00]
	; BTVER2-NEXT: vrcpps %ymm0, %ymm1 # sched: [2:2.00]
	; BTVER2-NEXT: vmulps %ymm1, %ymm0, %ymm0 # sched: [2:2.00]
	; BTVER2-NEXT: vsubps %ymm0, %ymm2, %ymm0 # sched: [3:2.00]
	; BTVER2-NEXT: vmulps %ymm0, %ymm1, %ymm0 # sched: [2:2.00]
	; BTVER2-NEXT: vaddps %ymm0, %ymm1, %ymm0 # sched: [3:2.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: v8f32_one_step:
	; SANDY: # BB#0:
	; SANDY-NEXT: vrcpps %ymm0, %ymm1 # sched: [5:1.00]
	; SANDY-NEXT: vmulps %ymm1, %ymm0, %ymm0 # sched: [5:1.00]
	-; SANDY-NEXT: vmovaps {{.*#+}} ymm2 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00] sched: [7:0.50]
	+; SANDY-NEXT: vmovaps {{.*#+}} ymm2 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00] sched: [4:0.50]
	; SANDY-NEXT: vsubps %ymm0, %ymm2, %ymm0 # sched: [3:1.00]
	; SANDY-NEXT: vmulps %ymm0, %ymm1, %ymm0 # sched: [5:1.00]
	; SANDY-NEXT: vaddps %ymm0, %ymm1, %ymm0 # sched: [3:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: v8f32_one_step:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vrcpps %ymm0, %ymm1 # sched: [7:2.00]
	; HASWELL-NEXT: vbroadcastss {{.*#+}} ymm2 = [1,1,1,1,1,1,1,1] sched: [5:1.00]
	; HASWELL-NEXT: vfnmadd213ps %ymm2, %ymm1, %ymm0
	; HASWELL-NEXT: vfmadd132ps %ymm1, %ymm1, %ymm0
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; HASWELL-NO-FMA-LABEL: v8f32_one_step:
	; HASWELL-NO-FMA: # BB#0:
	; HASWELL-NO-FMA-NEXT: vrcpps %ymm0, %ymm1
	; HASWELL-NO-FMA-NEXT: vmulps %ymm1, %ymm0, %ymm0
	; HASWELL-NO-FMA-NEXT: vbroadcastss {{.*#+}} ymm2 = [1,1,1,1,1,1,1,1]
	; HASWELL-NO-FMA-NEXT: vsubps %ymm0, %ymm2, %ymm0
	; HASWELL-NO-FMA-NEXT: vmulps %ymm0, %ymm1, %ymm0
	; HASWELL-NO-FMA-NEXT: vaddps %ymm0, %ymm1, %ymm0
	; HASWELL-NO-FMA-NEXT: retq
	;
	; KNL-LABEL: v8f32_one_step:
	; KNL: # BB#0:
	; KNL-NEXT: vrcpps %ymm0, %ymm1 # sched: [7:2.00]
	; KNL-NEXT: vbroadcastss {{.*#+}} ymm2 = [1,1,1,1,1,1,1,1] sched: [5:1.00]
	; KNL-NEXT: vfnmadd213ps %ymm2, %ymm1, %ymm0
	; KNL-NEXT: vfmadd132ps %ymm1, %ymm1, %ymm0
	; KNL-NEXT: retq # sched: [1:1.00]
	;
	; SKX-LABEL: v8f32_one_step:
	; SKX: # BB#0:
	; SKX-NEXT: vrcp14ps %ymm0, %ymm1
	; SKX-NEXT: vfnmadd213ps {{.*}}(%rip){1to8}, %ymm1, %ymm0
	; SKX-NEXT: vfmadd132ps %ymm1, %ymm1, %ymm0
	; SKX-NEXT: retq # sched: [1:1.00]
	%div = fdiv fast <8 x float> <float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0>, %x
	ret <8 x float> %div
	}

	define <8 x float> @v8f32_two_step(<8 x float> %x) #2 {
	; SSE-LABEL: v8f32_two_step:
	; SSE: # BB#0:
	; SSE-NEXT: movaps %xmm1, %xmm2
	; SSE-NEXT: rcpps %xmm0, %xmm3
	; SSE-NEXT: movaps %xmm0, %xmm4
	; SSE-NEXT: mulps %xmm3, %xmm4
	; SSE-NEXT: movaps {{.*#+}} xmm1 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]
	; SSE-NEXT: movaps %xmm1, %xmm5
	; SSE-NEXT: subps %xmm4, %xmm5
	; SSE-NEXT: mulps %xmm3, %xmm5
	; SSE-NEXT: addps %xmm3, %xmm5
	; SSE-NEXT: mulps %xmm5, %xmm0
	; SSE-NEXT: movaps %xmm1, %xmm3
	; SSE-NEXT: subps %xmm0, %xmm3
	; SSE-NEXT: mulps %xmm5, %xmm3
	; SSE-NEXT: addps %xmm5, %xmm3
	; SSE-NEXT: rcpps %xmm2, %xmm0
	; SSE-NEXT: movaps %xmm2, %xmm4
	; SSE-NEXT: mulps %xmm0, %xmm4
	; SSE-NEXT: movaps %xmm1, %xmm5
	; SSE-NEXT: subps %xmm4, %xmm5
	; SSE-NEXT: mulps %xmm0, %xmm5
	; SSE-NEXT: addps %xmm0, %xmm5
	; SSE-NEXT: mulps %xmm5, %xmm2
	; SSE-NEXT: subps %xmm2, %xmm1
	; SSE-NEXT: mulps %xmm5, %xmm1
	; SSE-NEXT: addps %xmm5, %xmm1
	; SSE-NEXT: movaps %xmm3, %xmm0
	; SSE-NEXT: retq
	;
	; AVX-RECIP-LABEL: v8f32_two_step:
	; AVX-RECIP: # BB#0:
	; AVX-RECIP-NEXT: vrcpps %ymm0, %ymm1
	; AVX-RECIP-NEXT: vmulps %ymm1, %ymm0, %ymm2
	; AVX-RECIP-NEXT: vmovaps {{.*#+}} ymm3 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]
	; AVX-RECIP-NEXT: vsubps %ymm2, %ymm3, %ymm2
	; AVX-RECIP-NEXT: vmulps %ymm2, %ymm1, %ymm2
	; AVX-RECIP-NEXT: vaddps %ymm2, %ymm1, %ymm1
	; AVX-RECIP-NEXT: vmulps %ymm1, %ymm0, %ymm0
	; AVX-RECIP-NEXT: vsubps %ymm0, %ymm3, %ymm0
	; AVX-RECIP-NEXT: vmulps %ymm0, %ymm1, %ymm0
	; AVX-RECIP-NEXT: vaddps %ymm0, %ymm1, %ymm0
	; AVX-RECIP-NEXT: retq
	;
	; FMA-RECIP-LABEL: v8f32_two_step:
	; FMA-RECIP: # BB#0:
	; FMA-RECIP-NEXT: vrcpps %ymm0, %ymm1
	; FMA-RECIP-NEXT: vmovaps {{.*#+}} ymm2 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]
	; FMA-RECIP-NEXT: vmovaps %ymm1, %ymm3
	; FMA-RECIP-NEXT: vfnmadd213ps %ymm2, %ymm0, %ymm3
	; FMA-RECIP-NEXT: vfmadd132ps %ymm1, %ymm1, %ymm3
	; FMA-RECIP-NEXT: vfnmadd213ps %ymm2, %ymm3, %ymm0
	; FMA-RECIP-NEXT: vfmadd132ps %ymm3, %ymm3, %ymm0
	; FMA-RECIP-NEXT: retq
	;
	; BTVER2-LABEL: v8f32_two_step:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vmovaps {{.*#+}} ymm3 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00] sched: [5:1.00]
	; BTVER2-NEXT: vrcpps %ymm0, %ymm1 # sched: [2:2.00]
	; BTVER2-NEXT: vmulps %ymm1, %ymm0, %ymm2 # sched: [2:2.00]
	; BTVER2-NEXT: vsubps %ymm2, %ymm3, %ymm2 # sched: [3:2.00]
	; BTVER2-NEXT: vmulps %ymm2, %ymm1, %ymm2 # sched: [2:2.00]
	; BTVER2-NEXT: vaddps %ymm2, %ymm1, %ymm1 # sched: [3:2.00]
	; BTVER2-NEXT: vmulps %ymm1, %ymm0, %ymm0 # sched: [2:2.00]
	; BTVER2-NEXT: vsubps %ymm0, %ymm3, %ymm0 # sched: [3:2.00]
	; BTVER2-NEXT: vmulps %ymm0, %ymm1, %ymm0 # sched: [2:2.00]
	; BTVER2-NEXT: vaddps %ymm0, %ymm1, %ymm0 # sched: [3:2.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: v8f32_two_step:
	; SANDY: # BB#0:
	; SANDY-NEXT: vrcpps %ymm0, %ymm1 # sched: [5:1.00]
	; SANDY-NEXT: vmulps %ymm1, %ymm0, %ymm2 # sched: [5:1.00]
	-; SANDY-NEXT: vmovaps {{.*#+}} ymm3 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00] sched: [7:0.50]
	+; SANDY-NEXT: vmovaps {{.*#+}} ymm3 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00] sched: [4:0.50]
	; SANDY-NEXT: vsubps %ymm2, %ymm3, %ymm2 # sched: [3:1.00]
	; SANDY-NEXT: vmulps %ymm2, %ymm1, %ymm2 # sched: [5:1.00]
	; SANDY-NEXT: vaddps %ymm2, %ymm1, %ymm1 # sched: [3:1.00]
	; SANDY-NEXT: vmulps %ymm1, %ymm0, %ymm0 # sched: [5:1.00]
	; SANDY-NEXT: vsubps %ymm0, %ymm3, %ymm0 # sched: [3:1.00]
	; SANDY-NEXT: vmulps %ymm0, %ymm1, %ymm0 # sched: [5:1.00]
	; SANDY-NEXT: vaddps %ymm0, %ymm1, %ymm0 # sched: [3:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: v8f32_two_step:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vrcpps %ymm0, %ymm1 # sched: [7:2.00]
	; HASWELL-NEXT: vbroadcastss {{.*#+}} ymm2 = [1,1,1,1,1,1,1,1] sched: [5:1.00]
	; HASWELL-NEXT: vmovaps %ymm1, %ymm3 # sched: [1:1.00]
	; HASWELL-NEXT: vfnmadd213ps %ymm2, %ymm0, %ymm3
	; HASWELL-NEXT: vfmadd132ps %ymm1, %ymm1, %ymm3
	; HASWELL-NEXT: vfnmadd213ps %ymm2, %ymm3, %ymm0
	; HASWELL-NEXT: vfmadd132ps %ymm3, %ymm3, %ymm0
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; HASWELL-NO-FMA-LABEL: v8f32_two_step:
	; HASWELL-NO-FMA: # BB#0:
	; HASWELL-NO-FMA-NEXT: vrcpps %ymm0, %ymm1
	; HASWELL-NO-FMA-NEXT: vmulps %ymm1, %ymm0, %ymm2
	; HASWELL-NO-FMA-NEXT: vbroadcastss {{.*#+}} ymm3 = [1,1,1,1,1,1,1,1]
	; HASWELL-NO-FMA-NEXT: vsubps %ymm2, %ymm3, %ymm2
	; HASWELL-NO-FMA-NEXT: vmulps %ymm2, %ymm1, %ymm2
	; HASWELL-NO-FMA-NEXT: vaddps %ymm2, %ymm1, %ymm1
	; HASWELL-NO-FMA-NEXT: vmulps %ymm1, %ymm0, %ymm0
	; HASWELL-NO-FMA-NEXT: vsubps %ymm0, %ymm3, %ymm0
	; HASWELL-NO-FMA-NEXT: vmulps %ymm0, %ymm1, %ymm0
	; HASWELL-NO-FMA-NEXT: vaddps %ymm0, %ymm1, %ymm0
	; HASWELL-NO-FMA-NEXT: retq
	;
	; KNL-LABEL: v8f32_two_step:
	; KNL: # BB#0:
	; KNL-NEXT: vrcpps %ymm0, %ymm1 # sched: [7:2.00]
	; KNL-NEXT: vbroadcastss {{.*#+}} ymm2 = [1,1,1,1,1,1,1,1] sched: [5:1.00]
	; KNL-NEXT: vmovaps %ymm1, %ymm3 # sched: [1:1.00]
	; KNL-NEXT: vfnmadd213ps %ymm2, %ymm0, %ymm3
	; KNL-NEXT: vfmadd132ps %ymm1, %ymm1, %ymm3
	; KNL-NEXT: vfnmadd213ps %ymm2, %ymm3, %ymm0
	; KNL-NEXT: vfmadd132ps %ymm3, %ymm3, %ymm0
	; KNL-NEXT: retq # sched: [1:1.00]
	;
	; SKX-LABEL: v8f32_two_step:
	; SKX: # BB#0:
	; SKX-NEXT: vrcp14ps %ymm0, %ymm1
	; SKX-NEXT: vbroadcastss {{.*#+}} ymm2 = [1,1,1,1,1,1,1,1] sched: [5:1.00]
	; SKX-NEXT: vmovaps %ymm1, %ymm3 # sched: [1:1.00]
	; SKX-NEXT: vfnmadd213ps %ymm2, %ymm0, %ymm3
	; SKX-NEXT: vfmadd132ps %ymm1, %ymm1, %ymm3
	; SKX-NEXT: vfnmadd213ps %ymm2, %ymm3, %ymm0
	; SKX-NEXT: vfmadd132ps %ymm3, %ymm3, %ymm0
	; SKX-NEXT: retq # sched: [1:1.00]
	%div = fdiv fast <8 x float> <float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0>, %x
	ret <8 x float> %div
	}

	attributes #0 = { "unsafe-fp-math"="true" "reciprocal-estimates"="!divf,!vec-divf" }
	attributes #1 = { "unsafe-fp-math"="true" "reciprocal-estimates"="divf,vec-divf" }
	attributes #2 = { "unsafe-fp-math"="true" "reciprocal-estimates"="divf:2,vec-divf:2" }

	diff --git a/test/CodeGen/X86/recip-fastmath2.ll b/test/CodeGen/X86/recip-fastmath2.ll
	index c82eab84757f..e6070e41a2b2 100644
	--- a/test/CodeGen/X86/recip-fastmath2.ll
	+++ b/test/CodeGen/X86/recip-fastmath2.ll
	@@ -1,1162 +1,1162 @@
	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+sse2 -print-schedule \| FileCheck %s --check-prefix=CHECK --check-prefix=SSE --check-prefix=SSE-RECIP
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx -print-schedule \| FileCheck %s --check-prefix=CHECK --check-prefix=AVX --check-prefix=AVX-RECIP
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx,+fma -print-schedule \| FileCheck %s --check-prefix=CHECK --check-prefix=AVX --check-prefix=FMA-RECIP
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=btver2 -print-schedule \| FileCheck %s --check-prefix=CHECK --check-prefix=AVX --check-prefix=BTVER2
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=sandybridge -print-schedule \| FileCheck %s --check-prefix=CHECK --check-prefix=AVX --check-prefix=SANDY
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=haswell -print-schedule \| FileCheck %s --check-prefix=CHECK --check-prefix=AVX --check-prefix=HASWELL
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=haswell -print-schedule -mattr=-fma \| FileCheck %s --check-prefix=CHECK --check-prefix=AVX --check-prefix=HASWELL-NO-FMA
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=knl -print-schedule \| FileCheck %s --check-prefix=CHECK --check-prefix=AVX --check-prefix=AVX512 --check-prefix=KNL
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=skx -print-schedule \| FileCheck %s --check-prefix=CHECK --check-prefix=AVX --check-prefix=AVX512 --check-prefix=SKX

	; It's the extra tests coverage for recip as discussed on D26855.

	define float @f32_no_step_2(float %x) #3 {
	; SSE-LABEL: f32_no_step_2:
	; SSE: # BB#0:
	; SSE-NEXT: rcpss %xmm0, %xmm0
	; SSE-NEXT: mulss {{.*}}(%rip), %xmm0
	; SSE-NEXT: retq
	;
	; AVX-RECIP-LABEL: f32_no_step_2:
	; AVX-RECIP: # BB#0:
	; AVX-RECIP-NEXT: vrcpss %xmm0, %xmm0, %xmm0
	; AVX-RECIP-NEXT: vmulss {{.*}}(%rip), %xmm0, %xmm0
	; AVX-RECIP-NEXT: retq
	;
	; FMA-RECIP-LABEL: f32_no_step_2:
	; FMA-RECIP: # BB#0:
	; FMA-RECIP-NEXT: vrcpss %xmm0, %xmm0, %xmm0
	; FMA-RECIP-NEXT: vmulss {{.*}}(%rip), %xmm0, %xmm0
	; FMA-RECIP-NEXT: retq
	;
	; BTVER2-LABEL: f32_no_step_2:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vrcpss %xmm0, %xmm0, %xmm0 # sched: [2:1.00]
	; BTVER2-NEXT: vmulss {{.*}}(%rip), %xmm0, %xmm0 # sched: [7:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: f32_no_step_2:
	; SANDY: # BB#0:
	; SANDY-NEXT: vrcpss %xmm0, %xmm0, %xmm0 # sched: [5:1.00]
	-; SANDY-NEXT: vmulss {{.*}}(%rip), %xmm0, %xmm0 # sched: [11:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vmulss {{.*}}(%rip), %xmm0, %xmm0 # sched: [9:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: f32_no_step_2:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vrcpss %xmm0, %xmm0, %xmm0 # sched: [5:1.00]
	; HASWELL-NEXT: vmulss {{.*}}(%rip), %xmm0, %xmm0 # sched: [9:0.50]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; HASWELL-NO-FMA-LABEL: f32_no_step_2:
	; HASWELL-NO-FMA: # BB#0:
	; HASWELL-NO-FMA-NEXT: vrcpss %xmm0, %xmm0, %xmm0 # sched: [5:1.00]
	; HASWELL-NO-FMA-NEXT: vmulss {{.*}}(%rip), %xmm0, %xmm0 # sched: [9:0.50]
	; HASWELL-NO-FMA-NEXT: retq # sched: [1:1.00]
	;
	; AVX512-LABEL: f32_no_step_2:
	; AVX512: # BB#0:
	; AVX512-NEXT: vrcp14ss %xmm0, %xmm0, %xmm0
	; AVX512-NEXT: vmulss {{.*}}(%rip), %xmm0, %xmm0 # sched: [9:0.50]
	; AVX512-NEXT: retq # sched: [1:1.00]
	%div = fdiv fast float 1234.0, %x
	ret float %div
	}

	define float @f32_one_step_2(float %x) #1 {
	; SSE-LABEL: f32_one_step_2:
	; SSE: # BB#0:
	; SSE-NEXT: rcpss %xmm0, %xmm2
	; SSE-NEXT: mulss %xmm2, %xmm0
	; SSE-NEXT: movss {{.*#+}} xmm1 = mem[0],zero,zero,zero
	; SSE-NEXT: subss %xmm0, %xmm1
	; SSE-NEXT: mulss %xmm2, %xmm1
	; SSE-NEXT: addss %xmm2, %xmm1
	; SSE-NEXT: mulss {{.*}}(%rip), %xmm1
	; SSE-NEXT: movaps %xmm1, %xmm0
	; SSE-NEXT: retq
	;
	; AVX-RECIP-LABEL: f32_one_step_2:
	; AVX-RECIP: # BB#0:
	; AVX-RECIP-NEXT: vrcpss %xmm0, %xmm0, %xmm1
	; AVX-RECIP-NEXT: vmulss %xmm1, %xmm0, %xmm0
	; AVX-RECIP-NEXT: vmovss {{.*#+}} xmm2 = mem[0],zero,zero,zero
	; AVX-RECIP-NEXT: vsubss %xmm0, %xmm2, %xmm0
	; AVX-RECIP-NEXT: vmulss %xmm0, %xmm1, %xmm0
	; AVX-RECIP-NEXT: vaddss %xmm0, %xmm1, %xmm0
	; AVX-RECIP-NEXT: vmulss {{.*}}(%rip), %xmm0, %xmm0
	; AVX-RECIP-NEXT: retq
	;
	; FMA-RECIP-LABEL: f32_one_step_2:
	; FMA-RECIP: # BB#0:
	; FMA-RECIP-NEXT: vrcpss %xmm0, %xmm0, %xmm1
	; FMA-RECIP-NEXT: vfnmadd213ss {{.*}}(%rip), %xmm1, %xmm0
	; FMA-RECIP-NEXT: vfmadd132ss %xmm1, %xmm1, %xmm0
	; FMA-RECIP-NEXT: vmulss {{.*}}(%rip), %xmm0, %xmm0
	; FMA-RECIP-NEXT: retq
	;
	; BTVER2-LABEL: f32_one_step_2:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vmovss {{.*#+}} xmm2 = mem[0],zero,zero,zero sched: [5:1.00]
	; BTVER2-NEXT: vrcpss %xmm0, %xmm0, %xmm1 # sched: [2:1.00]
	; BTVER2-NEXT: vmulss %xmm1, %xmm0, %xmm0 # sched: [2:1.00]
	; BTVER2-NEXT: vsubss %xmm0, %xmm2, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: vmulss %xmm0, %xmm1, %xmm0 # sched: [2:1.00]
	; BTVER2-NEXT: vaddss %xmm0, %xmm1, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: vmulss {{.*}}(%rip), %xmm0, %xmm0 # sched: [7:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: f32_one_step_2:
	; SANDY: # BB#0:
	; SANDY-NEXT: vrcpss %xmm0, %xmm0, %xmm1 # sched: [5:1.00]
	; SANDY-NEXT: vmulss %xmm1, %xmm0, %xmm0 # sched: [5:1.00]
	-; SANDY-NEXT: vmovss {{.*#+}} xmm2 = mem[0],zero,zero,zero sched: [6:0.50]
	+; SANDY-NEXT: vmovss {{.*#+}} xmm2 = mem[0],zero,zero,zero sched: [4:0.50]
	; SANDY-NEXT: vsubss %xmm0, %xmm2, %xmm0 # sched: [3:1.00]
	; SANDY-NEXT: vmulss %xmm0, %xmm1, %xmm0 # sched: [5:1.00]
	; SANDY-NEXT: vaddss %xmm0, %xmm1, %xmm0 # sched: [3:1.00]
	-; SANDY-NEXT: vmulss {{.*}}(%rip), %xmm0, %xmm0 # sched: [11:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vmulss {{.*}}(%rip), %xmm0, %xmm0 # sched: [9:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: f32_one_step_2:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vrcpss %xmm0, %xmm0, %xmm1 # sched: [5:1.00]
	; HASWELL-NEXT: vfnmadd213ss {{.*}}(%rip), %xmm1, %xmm0
	; HASWELL-NEXT: vfmadd132ss %xmm1, %xmm1, %xmm0
	; HASWELL-NEXT: vmulss {{.*}}(%rip), %xmm0, %xmm0 # sched: [9:0.50]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; HASWELL-NO-FMA-LABEL: f32_one_step_2:
	; HASWELL-NO-FMA: # BB#0:
	; HASWELL-NO-FMA-NEXT: vrcpss %xmm0, %xmm0, %xmm1 # sched: [5:1.00]
	; HASWELL-NO-FMA-NEXT: vmulss %xmm1, %xmm0, %xmm0 # sched: [5:0.50]
	; HASWELL-NO-FMA-NEXT: vmovss {{.*#+}} xmm2 = mem[0],zero,zero,zero sched: [4:0.50]
	; HASWELL-NO-FMA-NEXT: vsubss %xmm0, %xmm2, %xmm0 # sched: [3:1.00]
	; HASWELL-NO-FMA-NEXT: vmulss %xmm0, %xmm1, %xmm0 # sched: [5:0.50]
	; HASWELL-NO-FMA-NEXT: vaddss %xmm0, %xmm1, %xmm0 # sched: [3:1.00]
	; HASWELL-NO-FMA-NEXT: vmulss {{.*}}(%rip), %xmm0, %xmm0 # sched: [9:0.50]
	; HASWELL-NO-FMA-NEXT: retq # sched: [1:1.00]
	;
	; AVX512-LABEL: f32_one_step_2:
	; AVX512: # BB#0:
	; AVX512-NEXT: vrcp14ss %xmm0, %xmm0, %xmm1
	; AVX512-NEXT: vfnmadd213ss {{.*}}(%rip), %xmm1, %xmm0
	; AVX512-NEXT: vfmadd132ss %xmm1, %xmm1, %xmm0
	; AVX512-NEXT: vmulss {{.*}}(%rip), %xmm0, %xmm0 # sched: [9:0.50]
	; AVX512-NEXT: retq # sched: [1:1.00]
	%div = fdiv fast float 3456.0, %x
	ret float %div
	}

	define float @f32_one_step_2_divs(float %x) #1 {
	; SSE-LABEL: f32_one_step_2_divs:
	; SSE: # BB#0:
	; SSE-NEXT: rcpss %xmm0, %xmm1
	; SSE-NEXT: mulss %xmm1, %xmm0
	; SSE-NEXT: movss {{.*#+}} xmm2 = mem[0],zero,zero,zero
	; SSE-NEXT: subss %xmm0, %xmm2
	; SSE-NEXT: mulss %xmm1, %xmm2
	; SSE-NEXT: addss %xmm1, %xmm2
	; SSE-NEXT: movss {{.*#+}} xmm0 = mem[0],zero,zero,zero
	; SSE-NEXT: mulss %xmm2, %xmm0
	; SSE-NEXT: mulss %xmm2, %xmm0
	; SSE-NEXT: retq
	;
	; AVX-RECIP-LABEL: f32_one_step_2_divs:
	; AVX-RECIP: # BB#0:
	; AVX-RECIP-NEXT: vrcpss %xmm0, %xmm0, %xmm1
	; AVX-RECIP-NEXT: vmulss %xmm1, %xmm0, %xmm0
	; AVX-RECIP-NEXT: vmovss {{.*#+}} xmm2 = mem[0],zero,zero,zero
	; AVX-RECIP-NEXT: vsubss %xmm0, %xmm2, %xmm0
	; AVX-RECIP-NEXT: vmulss %xmm0, %xmm1, %xmm0
	; AVX-RECIP-NEXT: vaddss %xmm0, %xmm1, %xmm0
	; AVX-RECIP-NEXT: vmulss {{.*}}(%rip), %xmm0, %xmm1
	; AVX-RECIP-NEXT: vmulss %xmm0, %xmm1, %xmm0
	; AVX-RECIP-NEXT: retq
	;
	; FMA-RECIP-LABEL: f32_one_step_2_divs:
	; FMA-RECIP: # BB#0:
	; FMA-RECIP-NEXT: vrcpss %xmm0, %xmm0, %xmm1
	; FMA-RECIP-NEXT: vfnmadd213ss {{.*}}(%rip), %xmm1, %xmm0
	; FMA-RECIP-NEXT: vfmadd132ss %xmm1, %xmm1, %xmm0
	; FMA-RECIP-NEXT: vmulss {{.*}}(%rip), %xmm0, %xmm1
	; FMA-RECIP-NEXT: vmulss %xmm0, %xmm1, %xmm0
	; FMA-RECIP-NEXT: retq
	;
	; BTVER2-LABEL: f32_one_step_2_divs:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vmovss {{.*#+}} xmm2 = mem[0],zero,zero,zero sched: [5:1.00]
	; BTVER2-NEXT: vrcpss %xmm0, %xmm0, %xmm1 # sched: [2:1.00]
	; BTVER2-NEXT: vmulss %xmm1, %xmm0, %xmm0 # sched: [2:1.00]
	; BTVER2-NEXT: vsubss %xmm0, %xmm2, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: vmulss %xmm0, %xmm1, %xmm0 # sched: [2:1.00]
	; BTVER2-NEXT: vaddss %xmm0, %xmm1, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: vmulss {{.*}}(%rip), %xmm0, %xmm1 # sched: [7:1.00]
	; BTVER2-NEXT: vmulss %xmm0, %xmm1, %xmm0 # sched: [2:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: f32_one_step_2_divs:
	; SANDY: # BB#0:
	; SANDY-NEXT: vrcpss %xmm0, %xmm0, %xmm1 # sched: [5:1.00]
	; SANDY-NEXT: vmulss %xmm1, %xmm0, %xmm0 # sched: [5:1.00]
	-; SANDY-NEXT: vmovss {{.*#+}} xmm2 = mem[0],zero,zero,zero sched: [6:0.50]
	+; SANDY-NEXT: vmovss {{.*#+}} xmm2 = mem[0],zero,zero,zero sched: [4:0.50]
	; SANDY-NEXT: vsubss %xmm0, %xmm2, %xmm0 # sched: [3:1.00]
	; SANDY-NEXT: vmulss %xmm0, %xmm1, %xmm0 # sched: [5:1.00]
	; SANDY-NEXT: vaddss %xmm0, %xmm1, %xmm0 # sched: [3:1.00]
	-; SANDY-NEXT: vmulss {{.*}}(%rip), %xmm0, %xmm1 # sched: [11:1.00]
	+; SANDY-NEXT: vmulss {{.*}}(%rip), %xmm0, %xmm1 # sched: [9:1.00]
	; SANDY-NEXT: vmulss %xmm0, %xmm1, %xmm0 # sched: [5:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: f32_one_step_2_divs:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vrcpss %xmm0, %xmm0, %xmm1 # sched: [5:1.00]
	; HASWELL-NEXT: vfnmadd213ss {{.*}}(%rip), %xmm1, %xmm0
	; HASWELL-NEXT: vfmadd132ss %xmm1, %xmm1, %xmm0
	; HASWELL-NEXT: vmulss {{.*}}(%rip), %xmm0, %xmm1 # sched: [9:0.50]
	; HASWELL-NEXT: vmulss %xmm0, %xmm1, %xmm0 # sched: [5:0.50]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; HASWELL-NO-FMA-LABEL: f32_one_step_2_divs:
	; HASWELL-NO-FMA: # BB#0:
	; HASWELL-NO-FMA-NEXT: vrcpss %xmm0, %xmm0, %xmm1 # sched: [5:1.00]
	; HASWELL-NO-FMA-NEXT: vmulss %xmm1, %xmm0, %xmm0 # sched: [5:0.50]
	; HASWELL-NO-FMA-NEXT: vmovss {{.*#+}} xmm2 = mem[0],zero,zero,zero sched: [4:0.50]
	; HASWELL-NO-FMA-NEXT: vsubss %xmm0, %xmm2, %xmm0 # sched: [3:1.00]
	; HASWELL-NO-FMA-NEXT: vmulss %xmm0, %xmm1, %xmm0 # sched: [5:0.50]
	; HASWELL-NO-FMA-NEXT: vaddss %xmm0, %xmm1, %xmm0 # sched: [3:1.00]
	; HASWELL-NO-FMA-NEXT: vmulss {{.*}}(%rip), %xmm0, %xmm1 # sched: [9:0.50]
	; HASWELL-NO-FMA-NEXT: vmulss %xmm0, %xmm1, %xmm0 # sched: [5:0.50]
	; HASWELL-NO-FMA-NEXT: retq # sched: [1:1.00]
	;
	; AVX512-LABEL: f32_one_step_2_divs:
	; AVX512: # BB#0:
	; AVX512-NEXT: vrcp14ss %xmm0, %xmm0, %xmm1
	; AVX512-NEXT: vfnmadd213ss {{.*}}(%rip), %xmm1, %xmm0
	; AVX512-NEXT: vfmadd132ss %xmm1, %xmm1, %xmm0
	; AVX512-NEXT: vmulss {{.*}}(%rip), %xmm0, %xmm1 # sched: [9:0.50]
	; AVX512-NEXT: vmulss %xmm0, %xmm1, %xmm0 # sched: [5:0.50]
	; AVX512-NEXT: retq # sched: [1:1.00]
	%div = fdiv fast float 3456.0, %x
	%div2 = fdiv fast float %div, %x
	ret float %div2
	}

	define float @f32_two_step_2(float %x) #2 {
	; SSE-LABEL: f32_two_step_2:
	; SSE: # BB#0:
	; SSE-NEXT: rcpss %xmm0, %xmm2
	; SSE-NEXT: movaps %xmm0, %xmm3
	; SSE-NEXT: mulss %xmm2, %xmm3
	; SSE-NEXT: movss {{.*#+}} xmm1 = mem[0],zero,zero,zero
	; SSE-NEXT: movaps %xmm1, %xmm4
	; SSE-NEXT: subss %xmm3, %xmm4
	; SSE-NEXT: mulss %xmm2, %xmm4
	; SSE-NEXT: addss %xmm2, %xmm4
	; SSE-NEXT: mulss %xmm4, %xmm0
	; SSE-NEXT: subss %xmm0, %xmm1
	; SSE-NEXT: mulss %xmm4, %xmm1
	; SSE-NEXT: addss %xmm4, %xmm1
	; SSE-NEXT: mulss {{.*}}(%rip), %xmm1
	; SSE-NEXT: movaps %xmm1, %xmm0
	; SSE-NEXT: retq
	;
	; AVX-RECIP-LABEL: f32_two_step_2:
	; AVX-RECIP: # BB#0:
	; AVX-RECIP-NEXT: vrcpss %xmm0, %xmm0, %xmm1
	; AVX-RECIP-NEXT: vmulss %xmm1, %xmm0, %xmm2
	; AVX-RECIP-NEXT: vmovss {{.*#+}} xmm3 = mem[0],zero,zero,zero
	; AVX-RECIP-NEXT: vsubss %xmm2, %xmm3, %xmm2
	; AVX-RECIP-NEXT: vmulss %xmm2, %xmm1, %xmm2
	; AVX-RECIP-NEXT: vaddss %xmm2, %xmm1, %xmm1
	; AVX-RECIP-NEXT: vmulss %xmm1, %xmm0, %xmm0
	; AVX-RECIP-NEXT: vsubss %xmm0, %xmm3, %xmm0
	; AVX-RECIP-NEXT: vmulss %xmm0, %xmm1, %xmm0
	; AVX-RECIP-NEXT: vaddss %xmm0, %xmm1, %xmm0
	; AVX-RECIP-NEXT: vmulss {{.*}}(%rip), %xmm0, %xmm0
	; AVX-RECIP-NEXT: retq
	;
	; FMA-RECIP-LABEL: f32_two_step_2:
	; FMA-RECIP: # BB#0:
	; FMA-RECIP-NEXT: vrcpss %xmm0, %xmm0, %xmm1
	; FMA-RECIP-NEXT: vmovss {{.*#+}} xmm2 = mem[0],zero,zero,zero
	; FMA-RECIP-NEXT: vmovaps %xmm1, %xmm3
	; FMA-RECIP-NEXT: vfnmadd213ss %xmm2, %xmm0, %xmm3
	; FMA-RECIP-NEXT: vfmadd132ss %xmm1, %xmm1, %xmm3
	; FMA-RECIP-NEXT: vfnmadd213ss %xmm2, %xmm3, %xmm0
	; FMA-RECIP-NEXT: vfmadd132ss %xmm3, %xmm3, %xmm0
	; FMA-RECIP-NEXT: vmulss {{.*}}(%rip), %xmm0, %xmm0
	; FMA-RECIP-NEXT: retq
	;
	; BTVER2-LABEL: f32_two_step_2:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vmovss {{.*#+}} xmm3 = mem[0],zero,zero,zero sched: [5:1.00]
	; BTVER2-NEXT: vrcpss %xmm0, %xmm0, %xmm1 # sched: [2:1.00]
	; BTVER2-NEXT: vmulss %xmm1, %xmm0, %xmm2 # sched: [2:1.00]
	; BTVER2-NEXT: vsubss %xmm2, %xmm3, %xmm2 # sched: [3:1.00]
	; BTVER2-NEXT: vmulss %xmm2, %xmm1, %xmm2 # sched: [2:1.00]
	; BTVER2-NEXT: vaddss %xmm2, %xmm1, %xmm1 # sched: [3:1.00]
	; BTVER2-NEXT: vmulss %xmm1, %xmm0, %xmm0 # sched: [2:1.00]
	; BTVER2-NEXT: vsubss %xmm0, %xmm3, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: vmulss %xmm0, %xmm1, %xmm0 # sched: [2:1.00]
	; BTVER2-NEXT: vaddss %xmm0, %xmm1, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: vmulss {{.*}}(%rip), %xmm0, %xmm0 # sched: [7:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: f32_two_step_2:
	; SANDY: # BB#0:
	; SANDY-NEXT: vrcpss %xmm0, %xmm0, %xmm1 # sched: [5:1.00]
	; SANDY-NEXT: vmulss %xmm1, %xmm0, %xmm2 # sched: [5:1.00]
	-; SANDY-NEXT: vmovss {{.*#+}} xmm3 = mem[0],zero,zero,zero sched: [6:0.50]
	+; SANDY-NEXT: vmovss {{.*#+}} xmm3 = mem[0],zero,zero,zero sched: [4:0.50]
	; SANDY-NEXT: vsubss %xmm2, %xmm3, %xmm2 # sched: [3:1.00]
	; SANDY-NEXT: vmulss %xmm2, %xmm1, %xmm2 # sched: [5:1.00]
	; SANDY-NEXT: vaddss %xmm2, %xmm1, %xmm1 # sched: [3:1.00]
	; SANDY-NEXT: vmulss %xmm1, %xmm0, %xmm0 # sched: [5:1.00]
	; SANDY-NEXT: vsubss %xmm0, %xmm3, %xmm0 # sched: [3:1.00]
	; SANDY-NEXT: vmulss %xmm0, %xmm1, %xmm0 # sched: [5:1.00]
	; SANDY-NEXT: vaddss %xmm0, %xmm1, %xmm0 # sched: [3:1.00]
	-; SANDY-NEXT: vmulss {{.*}}(%rip), %xmm0, %xmm0 # sched: [11:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vmulss {{.*}}(%rip), %xmm0, %xmm0 # sched: [9:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: f32_two_step_2:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vrcpss %xmm0, %xmm0, %xmm1 # sched: [5:1.00]
	; HASWELL-NEXT: vmovss {{.*#+}} xmm2 = mem[0],zero,zero,zero sched: [4:0.50]
	; HASWELL-NEXT: vmovaps %xmm1, %xmm3 # sched: [1:1.00]
	; HASWELL-NEXT: vfnmadd213ss %xmm2, %xmm0, %xmm3
	; HASWELL-NEXT: vfmadd132ss %xmm1, %xmm1, %xmm3
	; HASWELL-NEXT: vfnmadd213ss %xmm2, %xmm3, %xmm0
	; HASWELL-NEXT: vfmadd132ss %xmm3, %xmm3, %xmm0
	; HASWELL-NEXT: vmulss {{.*}}(%rip), %xmm0, %xmm0 # sched: [9:0.50]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; HASWELL-NO-FMA-LABEL: f32_two_step_2:
	; HASWELL-NO-FMA: # BB#0:
	; HASWELL-NO-FMA-NEXT: vrcpss %xmm0, %xmm0, %xmm1 # sched: [5:1.00]
	; HASWELL-NO-FMA-NEXT: vmulss %xmm1, %xmm0, %xmm2 # sched: [5:0.50]
	; HASWELL-NO-FMA-NEXT: vmovss {{.*#+}} xmm3 = mem[0],zero,zero,zero sched: [4:0.50]
	; HASWELL-NO-FMA-NEXT: vsubss %xmm2, %xmm3, %xmm2 # sched: [3:1.00]
	; HASWELL-NO-FMA-NEXT: vmulss %xmm2, %xmm1, %xmm2 # sched: [5:0.50]
	; HASWELL-NO-FMA-NEXT: vaddss %xmm2, %xmm1, %xmm1 # sched: [3:1.00]
	; HASWELL-NO-FMA-NEXT: vmulss %xmm1, %xmm0, %xmm0 # sched: [5:0.50]
	; HASWELL-NO-FMA-NEXT: vsubss %xmm0, %xmm3, %xmm0 # sched: [3:1.00]
	; HASWELL-NO-FMA-NEXT: vmulss %xmm0, %xmm1, %xmm0 # sched: [5:0.50]
	; HASWELL-NO-FMA-NEXT: vaddss %xmm0, %xmm1, %xmm0 # sched: [3:1.00]
	; HASWELL-NO-FMA-NEXT: vmulss {{.*}}(%rip), %xmm0, %xmm0 # sched: [9:0.50]
	; HASWELL-NO-FMA-NEXT: retq # sched: [1:1.00]
	;
	; AVX512-LABEL: f32_two_step_2:
	; AVX512: # BB#0:
	; AVX512-NEXT: vrcp14ss %xmm0, %xmm0, %xmm1
	; AVX512-NEXT: vmovss {{.*#+}} xmm2 = mem[0],zero,zero,zero sched: [4:0.50]
	; AVX512-NEXT: vmovaps %xmm1, %xmm3 # sched: [1:1.00]
	; AVX512-NEXT: vfnmadd213ss %xmm2, %xmm0, %xmm3
	; AVX512-NEXT: vfmadd132ss %xmm1, %xmm1, %xmm3
	; AVX512-NEXT: vfnmadd213ss %xmm2, %xmm3, %xmm0
	; AVX512-NEXT: vfmadd132ss %xmm3, %xmm3, %xmm0
	; AVX512-NEXT: vmulss {{.*}}(%rip), %xmm0, %xmm0 # sched: [9:0.50]
	; AVX512-NEXT: retq # sched: [1:1.00]
	%div = fdiv fast float 6789.0, %x
	ret float %div
	}

	define <4 x float> @v4f32_one_step2(<4 x float> %x) #1 {
	; SSE-LABEL: v4f32_one_step2:
	; SSE: # BB#0:
	; SSE-NEXT: rcpps %xmm0, %xmm2
	; SSE-NEXT: mulps %xmm2, %xmm0
	; SSE-NEXT: movaps {{.*#+}} xmm1 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]
	; SSE-NEXT: subps %xmm0, %xmm1
	; SSE-NEXT: mulps %xmm2, %xmm1
	; SSE-NEXT: addps %xmm2, %xmm1
	; SSE-NEXT: mulps {{.*}}(%rip), %xmm1
	; SSE-NEXT: movaps %xmm1, %xmm0
	; SSE-NEXT: retq
	;
	; AVX-RECIP-LABEL: v4f32_one_step2:
	; AVX-RECIP: # BB#0:
	; AVX-RECIP-NEXT: vrcpps %xmm0, %xmm1
	; AVX-RECIP-NEXT: vmulps %xmm1, %xmm0, %xmm0
	; AVX-RECIP-NEXT: vmovaps {{.*#+}} xmm2 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]
	; AVX-RECIP-NEXT: vsubps %xmm0, %xmm2, %xmm0
	; AVX-RECIP-NEXT: vmulps %xmm0, %xmm1, %xmm0
	; AVX-RECIP-NEXT: vaddps %xmm0, %xmm1, %xmm0
	; AVX-RECIP-NEXT: vmulps {{.*}}(%rip), %xmm0, %xmm0
	; AVX-RECIP-NEXT: retq
	;
	; FMA-RECIP-LABEL: v4f32_one_step2:
	; FMA-RECIP: # BB#0:
	; FMA-RECIP-NEXT: vrcpps %xmm0, %xmm1
	; FMA-RECIP-NEXT: vfnmadd213ps {{.*}}(%rip), %xmm1, %xmm0
	; FMA-RECIP-NEXT: vfmadd132ps %xmm1, %xmm1, %xmm0
	; FMA-RECIP-NEXT: vmulps {{.*}}(%rip), %xmm0, %xmm0
	; FMA-RECIP-NEXT: retq
	;
	; BTVER2-LABEL: v4f32_one_step2:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vmovaps {{.*#+}} xmm2 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00] sched: [5:1.00]
	; BTVER2-NEXT: vrcpps %xmm0, %xmm1 # sched: [2:1.00]
	; BTVER2-NEXT: vmulps %xmm1, %xmm0, %xmm0 # sched: [2:1.00]
	; BTVER2-NEXT: vsubps %xmm0, %xmm2, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: vmulps %xmm0, %xmm1, %xmm0 # sched: [2:1.00]
	; BTVER2-NEXT: vaddps %xmm0, %xmm1, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: vmulps {{.*}}(%rip), %xmm0, %xmm0 # sched: [7:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: v4f32_one_step2:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vrcpps %xmm0, %xmm1 # sched: [7:3.00]
	+; SANDY-NEXT: vrcpps %xmm0, %xmm1 # sched: [5:1.00]
	; SANDY-NEXT: vmulps %xmm1, %xmm0, %xmm0 # sched: [5:1.00]
	-; SANDY-NEXT: vmovaps {{.*#+}} xmm2 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00] sched: [6:0.50]
	+; SANDY-NEXT: vmovaps {{.*#+}} xmm2 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00] sched: [4:0.50]
	; SANDY-NEXT: vsubps %xmm0, %xmm2, %xmm0 # sched: [3:1.00]
	; SANDY-NEXT: vmulps %xmm0, %xmm1, %xmm0 # sched: [5:1.00]
	; SANDY-NEXT: vaddps %xmm0, %xmm1, %xmm0 # sched: [3:1.00]
	-; SANDY-NEXT: vmulps {{.*}}(%rip), %xmm0, %xmm0 # sched: [11:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vmulps {{.*}}(%rip), %xmm0, %xmm0 # sched: [9:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: v4f32_one_step2:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vrcpps %xmm0, %xmm1 # sched: [5:1.00]
	; HASWELL-NEXT: vbroadcastss {{.*#+}} xmm2 = [1,1,1,1] sched: [4:0.50]
	; HASWELL-NEXT: vfnmadd213ps %xmm2, %xmm1, %xmm0
	; HASWELL-NEXT: vfmadd132ps %xmm1, %xmm1, %xmm0
	; HASWELL-NEXT: vmulps {{.*}}(%rip), %xmm0, %xmm0 # sched: [9:0.50]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; HASWELL-NO-FMA-LABEL: v4f32_one_step2:
	; HASWELL-NO-FMA: # BB#0:
	; HASWELL-NO-FMA-NEXT: vrcpps %xmm0, %xmm1 # sched: [5:1.00]
	; HASWELL-NO-FMA-NEXT: vmulps %xmm1, %xmm0, %xmm0 # sched: [5:0.50]
	; HASWELL-NO-FMA-NEXT: vbroadcastss {{.*#+}} xmm2 = [1,1,1,1] sched: [4:0.50]
	; HASWELL-NO-FMA-NEXT: vsubps %xmm0, %xmm2, %xmm0 # sched: [3:1.00]
	; HASWELL-NO-FMA-NEXT: vmulps %xmm0, %xmm1, %xmm0 # sched: [5:0.50]
	; HASWELL-NO-FMA-NEXT: vaddps %xmm0, %xmm1, %xmm0 # sched: [3:1.00]
	; HASWELL-NO-FMA-NEXT: vmulps {{.*}}(%rip), %xmm0, %xmm0 # sched: [9:0.50]
	; HASWELL-NO-FMA-NEXT: retq # sched: [1:1.00]
	;
	; KNL-LABEL: v4f32_one_step2:
	; KNL: # BB#0:
	; KNL-NEXT: vrcpps %xmm0, %xmm1 # sched: [5:1.00]
	; KNL-NEXT: vbroadcastss {{.*#+}} xmm2 = [1,1,1,1] sched: [4:0.50]
	; KNL-NEXT: vfnmadd213ps %xmm2, %xmm1, %xmm0
	; KNL-NEXT: vfmadd132ps %xmm1, %xmm1, %xmm0
	; KNL-NEXT: vmulps {{.*}}(%rip), %xmm0, %xmm0 # sched: [9:0.50]
	; KNL-NEXT: retq # sched: [1:1.00]
	;
	; SKX-LABEL: v4f32_one_step2:
	; SKX: # BB#0:
	; SKX-NEXT: vrcp14ps %xmm0, %xmm1
	; SKX-NEXT: vfnmadd213ps {{.*}}(%rip){1to4}, %xmm1, %xmm0
	; SKX-NEXT: vfmadd132ps %xmm1, %xmm1, %xmm0
	; SKX-NEXT: vmulps {{.*}}(%rip), %xmm0, %xmm0 # sched: [9:0.50]
	; SKX-NEXT: retq # sched: [1:1.00]
	%div = fdiv fast <4 x float> <float 1.0, float 2.0, float 3.0, float 4.0>, %x
	ret <4 x float> %div
	}

	define <4 x float> @v4f32_one_step_2_divs(<4 x float> %x) #1 {
	; SSE-LABEL: v4f32_one_step_2_divs:
	; SSE: # BB#0:
	; SSE-NEXT: rcpps %xmm0, %xmm1
	; SSE-NEXT: mulps %xmm1, %xmm0
	; SSE-NEXT: movaps {{.*#+}} xmm2 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]
	; SSE-NEXT: subps %xmm0, %xmm2
	; SSE-NEXT: mulps %xmm1, %xmm2
	; SSE-NEXT: addps %xmm1, %xmm2
	; SSE-NEXT: movaps {{.*#+}} xmm0 = [1.000000e+00,2.000000e+00,3.000000e+00,4.000000e+00]
	; SSE-NEXT: mulps %xmm2, %xmm0
	; SSE-NEXT: mulps %xmm2, %xmm0
	; SSE-NEXT: retq
	;
	; AVX-RECIP-LABEL: v4f32_one_step_2_divs:
	; AVX-RECIP: # BB#0:
	; AVX-RECIP-NEXT: vrcpps %xmm0, %xmm1
	; AVX-RECIP-NEXT: vmulps %xmm1, %xmm0, %xmm0
	; AVX-RECIP-NEXT: vmovaps {{.*#+}} xmm2 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]
	; AVX-RECIP-NEXT: vsubps %xmm0, %xmm2, %xmm0
	; AVX-RECIP-NEXT: vmulps %xmm0, %xmm1, %xmm0
	; AVX-RECIP-NEXT: vaddps %xmm0, %xmm1, %xmm0
	; AVX-RECIP-NEXT: vmulps {{.*}}(%rip), %xmm0, %xmm1
	; AVX-RECIP-NEXT: vmulps %xmm0, %xmm1, %xmm0
	; AVX-RECIP-NEXT: retq
	;
	; FMA-RECIP-LABEL: v4f32_one_step_2_divs:
	; FMA-RECIP: # BB#0:
	; FMA-RECIP-NEXT: vrcpps %xmm0, %xmm1
	; FMA-RECIP-NEXT: vfnmadd213ps {{.*}}(%rip), %xmm1, %xmm0
	; FMA-RECIP-NEXT: vfmadd132ps %xmm1, %xmm1, %xmm0
	; FMA-RECIP-NEXT: vmulps {{.*}}(%rip), %xmm0, %xmm1
	; FMA-RECIP-NEXT: vmulps %xmm0, %xmm1, %xmm0
	; FMA-RECIP-NEXT: retq
	;
	; BTVER2-LABEL: v4f32_one_step_2_divs:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vmovaps {{.*#+}} xmm2 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00] sched: [5:1.00]
	; BTVER2-NEXT: vrcpps %xmm0, %xmm1 # sched: [2:1.00]
	; BTVER2-NEXT: vmulps %xmm1, %xmm0, %xmm0 # sched: [2:1.00]
	; BTVER2-NEXT: vsubps %xmm0, %xmm2, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: vmulps %xmm0, %xmm1, %xmm0 # sched: [2:1.00]
	; BTVER2-NEXT: vaddps %xmm0, %xmm1, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: vmulps {{.*}}(%rip), %xmm0, %xmm1 # sched: [7:1.00]
	; BTVER2-NEXT: vmulps %xmm0, %xmm1, %xmm0 # sched: [2:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: v4f32_one_step_2_divs:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vrcpps %xmm0, %xmm1 # sched: [7:3.00]
	+; SANDY-NEXT: vrcpps %xmm0, %xmm1 # sched: [5:1.00]
	; SANDY-NEXT: vmulps %xmm1, %xmm0, %xmm0 # sched: [5:1.00]
	-; SANDY-NEXT: vmovaps {{.*#+}} xmm2 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00] sched: [6:0.50]
	+; SANDY-NEXT: vmovaps {{.*#+}} xmm2 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00] sched: [4:0.50]
	; SANDY-NEXT: vsubps %xmm0, %xmm2, %xmm0 # sched: [3:1.00]
	; SANDY-NEXT: vmulps %xmm0, %xmm1, %xmm0 # sched: [5:1.00]
	; SANDY-NEXT: vaddps %xmm0, %xmm1, %xmm0 # sched: [3:1.00]
	-; SANDY-NEXT: vmulps {{.*}}(%rip), %xmm0, %xmm1 # sched: [11:1.00]
	+; SANDY-NEXT: vmulps {{.*}}(%rip), %xmm0, %xmm1 # sched: [9:1.00]
	; SANDY-NEXT: vmulps %xmm0, %xmm1, %xmm0 # sched: [5:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: v4f32_one_step_2_divs:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vrcpps %xmm0, %xmm1 # sched: [5:1.00]
	; HASWELL-NEXT: vbroadcastss {{.*#+}} xmm2 = [1,1,1,1] sched: [4:0.50]
	; HASWELL-NEXT: vfnmadd213ps %xmm2, %xmm1, %xmm0
	; HASWELL-NEXT: vfmadd132ps %xmm1, %xmm1, %xmm0
	; HASWELL-NEXT: vmulps {{.*}}(%rip), %xmm0, %xmm1 # sched: [9:0.50]
	; HASWELL-NEXT: vmulps %xmm0, %xmm1, %xmm0 # sched: [5:0.50]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; HASWELL-NO-FMA-LABEL: v4f32_one_step_2_divs:
	; HASWELL-NO-FMA: # BB#0:
	; HASWELL-NO-FMA-NEXT: vrcpps %xmm0, %xmm1 # sched: [5:1.00]
	; HASWELL-NO-FMA-NEXT: vmulps %xmm1, %xmm0, %xmm0 # sched: [5:0.50]
	; HASWELL-NO-FMA-NEXT: vbroadcastss {{.*#+}} xmm2 = [1,1,1,1] sched: [4:0.50]
	; HASWELL-NO-FMA-NEXT: vsubps %xmm0, %xmm2, %xmm0 # sched: [3:1.00]
	; HASWELL-NO-FMA-NEXT: vmulps %xmm0, %xmm1, %xmm0 # sched: [5:0.50]
	; HASWELL-NO-FMA-NEXT: vaddps %xmm0, %xmm1, %xmm0 # sched: [3:1.00]
	; HASWELL-NO-FMA-NEXT: vmulps {{.*}}(%rip), %xmm0, %xmm1 # sched: [9:0.50]
	; HASWELL-NO-FMA-NEXT: vmulps %xmm0, %xmm1, %xmm0 # sched: [5:0.50]
	; HASWELL-NO-FMA-NEXT: retq # sched: [1:1.00]
	;
	; KNL-LABEL: v4f32_one_step_2_divs:
	; KNL: # BB#0:
	; KNL-NEXT: vrcpps %xmm0, %xmm1 # sched: [5:1.00]
	; KNL-NEXT: vbroadcastss {{.*#+}} xmm2 = [1,1,1,1] sched: [4:0.50]
	; KNL-NEXT: vfnmadd213ps %xmm2, %xmm1, %xmm0
	; KNL-NEXT: vfmadd132ps %xmm1, %xmm1, %xmm0
	; KNL-NEXT: vmulps {{.*}}(%rip), %xmm0, %xmm1 # sched: [9:0.50]
	; KNL-NEXT: vmulps %xmm0, %xmm1, %xmm0 # sched: [5:0.50]
	; KNL-NEXT: retq # sched: [1:1.00]
	;
	; SKX-LABEL: v4f32_one_step_2_divs:
	; SKX: # BB#0:
	; SKX-NEXT: vrcp14ps %xmm0, %xmm1
	; SKX-NEXT: vfnmadd213ps {{.*}}(%rip){1to4}, %xmm1, %xmm0
	; SKX-NEXT: vfmadd132ps %xmm1, %xmm1, %xmm0
	; SKX-NEXT: vmulps {{.*}}(%rip), %xmm0, %xmm1 # sched: [9:0.50]
	; SKX-NEXT: vmulps %xmm0, %xmm1, %xmm0 # sched: [5:0.50]
	; SKX-NEXT: retq # sched: [1:1.00]
	%div = fdiv fast <4 x float> <float 1.0, float 2.0, float 3.0, float 4.0>, %x
	%div2 = fdiv fast <4 x float> %div, %x
	ret <4 x float> %div2
	}

	define <4 x float> @v4f32_two_step2(<4 x float> %x) #2 {
	; SSE-LABEL: v4f32_two_step2:
	; SSE: # BB#0:
	; SSE-NEXT: rcpps %xmm0, %xmm2
	; SSE-NEXT: movaps %xmm0, %xmm3
	; SSE-NEXT: mulps %xmm2, %xmm3
	; SSE-NEXT: movaps {{.*#+}} xmm1 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]
	; SSE-NEXT: movaps %xmm1, %xmm4
	; SSE-NEXT: subps %xmm3, %xmm4
	; SSE-NEXT: mulps %xmm2, %xmm4
	; SSE-NEXT: addps %xmm2, %xmm4
	; SSE-NEXT: mulps %xmm4, %xmm0
	; SSE-NEXT: subps %xmm0, %xmm1
	; SSE-NEXT: mulps %xmm4, %xmm1
	; SSE-NEXT: addps %xmm4, %xmm1
	; SSE-NEXT: mulps {{.*}}(%rip), %xmm1
	; SSE-NEXT: movaps %xmm1, %xmm0
	; SSE-NEXT: retq
	;
	; AVX-RECIP-LABEL: v4f32_two_step2:
	; AVX-RECIP: # BB#0:
	; AVX-RECIP-NEXT: vrcpps %xmm0, %xmm1
	; AVX-RECIP-NEXT: vmulps %xmm1, %xmm0, %xmm2
	; AVX-RECIP-NEXT: vmovaps {{.*#+}} xmm3 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]
	; AVX-RECIP-NEXT: vsubps %xmm2, %xmm3, %xmm2
	; AVX-RECIP-NEXT: vmulps %xmm2, %xmm1, %xmm2
	; AVX-RECIP-NEXT: vaddps %xmm2, %xmm1, %xmm1
	; AVX-RECIP-NEXT: vmulps %xmm1, %xmm0, %xmm0
	; AVX-RECIP-NEXT: vsubps %xmm0, %xmm3, %xmm0
	; AVX-RECIP-NEXT: vmulps %xmm0, %xmm1, %xmm0
	; AVX-RECIP-NEXT: vaddps %xmm0, %xmm1, %xmm0
	; AVX-RECIP-NEXT: vmulps {{.*}}(%rip), %xmm0, %xmm0
	; AVX-RECIP-NEXT: retq
	;
	; FMA-RECIP-LABEL: v4f32_two_step2:
	; FMA-RECIP: # BB#0:
	; FMA-RECIP-NEXT: vrcpps %xmm0, %xmm1
	; FMA-RECIP-NEXT: vmovaps {{.*#+}} xmm2 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]
	; FMA-RECIP-NEXT: vmovaps %xmm1, %xmm3
	; FMA-RECIP-NEXT: vfnmadd213ps %xmm2, %xmm0, %xmm3
	; FMA-RECIP-NEXT: vfmadd132ps %xmm1, %xmm1, %xmm3
	; FMA-RECIP-NEXT: vfnmadd213ps %xmm2, %xmm3, %xmm0
	; FMA-RECIP-NEXT: vfmadd132ps %xmm3, %xmm3, %xmm0
	; FMA-RECIP-NEXT: vmulps {{.*}}(%rip), %xmm0, %xmm0
	; FMA-RECIP-NEXT: retq
	;
	; BTVER2-LABEL: v4f32_two_step2:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vmovaps {{.*#+}} xmm3 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00] sched: [5:1.00]
	; BTVER2-NEXT: vrcpps %xmm0, %xmm1 # sched: [2:1.00]
	; BTVER2-NEXT: vmulps %xmm1, %xmm0, %xmm2 # sched: [2:1.00]
	; BTVER2-NEXT: vsubps %xmm2, %xmm3, %xmm2 # sched: [3:1.00]
	; BTVER2-NEXT: vmulps %xmm2, %xmm1, %xmm2 # sched: [2:1.00]
	; BTVER2-NEXT: vaddps %xmm2, %xmm1, %xmm1 # sched: [3:1.00]
	; BTVER2-NEXT: vmulps %xmm1, %xmm0, %xmm0 # sched: [2:1.00]
	; BTVER2-NEXT: vsubps %xmm0, %xmm3, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: vmulps %xmm0, %xmm1, %xmm0 # sched: [2:1.00]
	; BTVER2-NEXT: vaddps %xmm0, %xmm1, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: vmulps {{.*}}(%rip), %xmm0, %xmm0 # sched: [7:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: v4f32_two_step2:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vrcpps %xmm0, %xmm1 # sched: [7:3.00]
	+; SANDY-NEXT: vrcpps %xmm0, %xmm1 # sched: [5:1.00]
	; SANDY-NEXT: vmulps %xmm1, %xmm0, %xmm2 # sched: [5:1.00]
	-; SANDY-NEXT: vmovaps {{.*#+}} xmm3 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00] sched: [6:0.50]
	+; SANDY-NEXT: vmovaps {{.*#+}} xmm3 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00] sched: [4:0.50]
	; SANDY-NEXT: vsubps %xmm2, %xmm3, %xmm2 # sched: [3:1.00]
	; SANDY-NEXT: vmulps %xmm2, %xmm1, %xmm2 # sched: [5:1.00]
	; SANDY-NEXT: vaddps %xmm2, %xmm1, %xmm1 # sched: [3:1.00]
	; SANDY-NEXT: vmulps %xmm1, %xmm0, %xmm0 # sched: [5:1.00]
	; SANDY-NEXT: vsubps %xmm0, %xmm3, %xmm0 # sched: [3:1.00]
	; SANDY-NEXT: vmulps %xmm0, %xmm1, %xmm0 # sched: [5:1.00]
	; SANDY-NEXT: vaddps %xmm0, %xmm1, %xmm0 # sched: [3:1.00]
	-; SANDY-NEXT: vmulps {{.*}}(%rip), %xmm0, %xmm0 # sched: [11:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vmulps {{.*}}(%rip), %xmm0, %xmm0 # sched: [9:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: v4f32_two_step2:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vrcpps %xmm0, %xmm1 # sched: [5:1.00]
	; HASWELL-NEXT: vbroadcastss {{.*#+}} xmm2 = [1,1,1,1] sched: [4:0.50]
	; HASWELL-NEXT: vmovaps %xmm1, %xmm3 # sched: [1:1.00]
	; HASWELL-NEXT: vfnmadd213ps %xmm2, %xmm0, %xmm3
	; HASWELL-NEXT: vfmadd132ps %xmm1, %xmm1, %xmm3
	; HASWELL-NEXT: vfnmadd213ps %xmm2, %xmm3, %xmm0
	; HASWELL-NEXT: vfmadd132ps %xmm3, %xmm3, %xmm0
	; HASWELL-NEXT: vmulps {{.*}}(%rip), %xmm0, %xmm0 # sched: [9:0.50]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; HASWELL-NO-FMA-LABEL: v4f32_two_step2:
	; HASWELL-NO-FMA: # BB#0:
	; HASWELL-NO-FMA-NEXT: vrcpps %xmm0, %xmm1 # sched: [5:1.00]
	; HASWELL-NO-FMA-NEXT: vmulps %xmm1, %xmm0, %xmm2 # sched: [5:0.50]
	; HASWELL-NO-FMA-NEXT: vbroadcastss {{.*#+}} xmm3 = [1,1,1,1] sched: [4:0.50]
	; HASWELL-NO-FMA-NEXT: vsubps %xmm2, %xmm3, %xmm2 # sched: [3:1.00]
	; HASWELL-NO-FMA-NEXT: vmulps %xmm2, %xmm1, %xmm2 # sched: [5:0.50]
	; HASWELL-NO-FMA-NEXT: vaddps %xmm2, %xmm1, %xmm1 # sched: [3:1.00]
	; HASWELL-NO-FMA-NEXT: vmulps %xmm1, %xmm0, %xmm0 # sched: [5:0.50]
	; HASWELL-NO-FMA-NEXT: vsubps %xmm0, %xmm3, %xmm0 # sched: [3:1.00]
	; HASWELL-NO-FMA-NEXT: vmulps %xmm0, %xmm1, %xmm0 # sched: [5:0.50]
	; HASWELL-NO-FMA-NEXT: vaddps %xmm0, %xmm1, %xmm0 # sched: [3:1.00]
	; HASWELL-NO-FMA-NEXT: vmulps {{.*}}(%rip), %xmm0, %xmm0 # sched: [9:0.50]
	; HASWELL-NO-FMA-NEXT: retq # sched: [1:1.00]
	;
	; KNL-LABEL: v4f32_two_step2:
	; KNL: # BB#0:
	; KNL-NEXT: vrcpps %xmm0, %xmm1 # sched: [5:1.00]
	; KNL-NEXT: vbroadcastss {{.*#+}} xmm2 = [1,1,1,1] sched: [4:0.50]
	; KNL-NEXT: vmovaps %xmm1, %xmm3 # sched: [1:1.00]
	; KNL-NEXT: vfnmadd213ps %xmm2, %xmm0, %xmm3
	; KNL-NEXT: vfmadd132ps %xmm1, %xmm1, %xmm3
	; KNL-NEXT: vfnmadd213ps %xmm2, %xmm3, %xmm0
	; KNL-NEXT: vfmadd132ps %xmm3, %xmm3, %xmm0
	; KNL-NEXT: vmulps {{.*}}(%rip), %xmm0, %xmm0 # sched: [9:0.50]
	; KNL-NEXT: retq # sched: [1:1.00]
	;
	; SKX-LABEL: v4f32_two_step2:
	; SKX: # BB#0:
	; SKX-NEXT: vrcp14ps %xmm0, %xmm1
	; SKX-NEXT: vbroadcastss {{.*#+}} xmm2 = [1,1,1,1] sched: [4:0.50]
	; SKX-NEXT: vmovaps %xmm1, %xmm3 # sched: [1:1.00]
	; SKX-NEXT: vfnmadd213ps %xmm2, %xmm0, %xmm3
	; SKX-NEXT: vfmadd132ps %xmm1, %xmm1, %xmm3
	; SKX-NEXT: vfnmadd213ps %xmm2, %xmm3, %xmm0
	; SKX-NEXT: vfmadd132ps %xmm3, %xmm3, %xmm0
	; SKX-NEXT: vmulps {{.*}}(%rip), %xmm0, %xmm0 # sched: [9:0.50]
	; SKX-NEXT: retq # sched: [1:1.00]
	%div = fdiv fast <4 x float> <float 1.0, float 2.0, float 3.0, float 4.0>, %x
	ret <4 x float> %div
	}

	define <8 x float> @v8f32_one_step2(<8 x float> %x) #1 {
	; SSE-LABEL: v8f32_one_step2:
	; SSE: # BB#0:
	; SSE-NEXT: rcpps %xmm1, %xmm4
	; SSE-NEXT: mulps %xmm4, %xmm1
	; SSE-NEXT: movaps {{.*#+}} xmm2 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]
	; SSE-NEXT: movaps %xmm2, %xmm3
	; SSE-NEXT: subps %xmm1, %xmm3
	; SSE-NEXT: mulps %xmm4, %xmm3
	; SSE-NEXT: addps %xmm4, %xmm3
	; SSE-NEXT: rcpps %xmm0, %xmm1
	; SSE-NEXT: mulps %xmm1, %xmm0
	; SSE-NEXT: subps %xmm0, %xmm2
	; SSE-NEXT: mulps %xmm1, %xmm2
	; SSE-NEXT: addps %xmm1, %xmm2
	; SSE-NEXT: mulps {{.*}}(%rip), %xmm2
	; SSE-NEXT: mulps {{.*}}(%rip), %xmm3
	; SSE-NEXT: movaps %xmm2, %xmm0
	; SSE-NEXT: movaps %xmm3, %xmm1
	; SSE-NEXT: retq
	;
	; AVX-RECIP-LABEL: v8f32_one_step2:
	; AVX-RECIP: # BB#0:
	; AVX-RECIP-NEXT: vrcpps %ymm0, %ymm1
	; AVX-RECIP-NEXT: vmulps %ymm1, %ymm0, %ymm0
	; AVX-RECIP-NEXT: vmovaps {{.*#+}} ymm2 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]
	; AVX-RECIP-NEXT: vsubps %ymm0, %ymm2, %ymm0
	; AVX-RECIP-NEXT: vmulps %ymm0, %ymm1, %ymm0
	; AVX-RECIP-NEXT: vaddps %ymm0, %ymm1, %ymm0
	; AVX-RECIP-NEXT: vmulps {{.*}}(%rip), %ymm0, %ymm0
	; AVX-RECIP-NEXT: retq
	;
	; FMA-RECIP-LABEL: v8f32_one_step2:
	; FMA-RECIP: # BB#0:
	; FMA-RECIP-NEXT: vrcpps %ymm0, %ymm1
	; FMA-RECIP-NEXT: vfnmadd213ps {{.*}}(%rip), %ymm1, %ymm0
	; FMA-RECIP-NEXT: vfmadd132ps %ymm1, %ymm1, %ymm0
	; FMA-RECIP-NEXT: vmulps {{.*}}(%rip), %ymm0, %ymm0
	; FMA-RECIP-NEXT: retq
	;
	; BTVER2-LABEL: v8f32_one_step2:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vmovaps {{.*#+}} ymm2 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00] sched: [5:1.00]
	; BTVER2-NEXT: vrcpps %ymm0, %ymm1 # sched: [2:2.00]
	; BTVER2-NEXT: vmulps %ymm1, %ymm0, %ymm0 # sched: [2:2.00]
	; BTVER2-NEXT: vsubps %ymm0, %ymm2, %ymm0 # sched: [3:2.00]
	; BTVER2-NEXT: vmulps %ymm0, %ymm1, %ymm0 # sched: [2:2.00]
	; BTVER2-NEXT: vaddps %ymm0, %ymm1, %ymm0 # sched: [3:2.00]
	; BTVER2-NEXT: vmulps {{.*}}(%rip), %ymm0, %ymm0 # sched: [7:2.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: v8f32_one_step2:
	; SANDY: # BB#0:
	; SANDY-NEXT: vrcpps %ymm0, %ymm1 # sched: [5:1.00]
	; SANDY-NEXT: vmulps %ymm1, %ymm0, %ymm0 # sched: [5:1.00]
	-; SANDY-NEXT: vmovaps {{.*#+}} ymm2 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00] sched: [7:0.50]
	+; SANDY-NEXT: vmovaps {{.*#+}} ymm2 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00] sched: [4:0.50]
	; SANDY-NEXT: vsubps %ymm0, %ymm2, %ymm0 # sched: [3:1.00]
	; SANDY-NEXT: vmulps %ymm0, %ymm1, %ymm0 # sched: [5:1.00]
	; SANDY-NEXT: vaddps %ymm0, %ymm1, %ymm0 # sched: [3:1.00]
	-; SANDY-NEXT: vmulps {{.*}}(%rip), %ymm0, %ymm0 # sched: [12:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vmulps {{.*}}(%rip), %ymm0, %ymm0 # sched: [9:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: v8f32_one_step2:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vrcpps %ymm0, %ymm1 # sched: [7:2.00]
	; HASWELL-NEXT: vbroadcastss {{.*#+}} ymm2 = [1,1,1,1,1,1,1,1] sched: [5:1.00]
	; HASWELL-NEXT: vfnmadd213ps %ymm2, %ymm1, %ymm0
	; HASWELL-NEXT: vfmadd132ps %ymm1, %ymm1, %ymm0
	; HASWELL-NEXT: vmulps {{.*}}(%rip), %ymm0, %ymm0 # sched: [9:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; HASWELL-NO-FMA-LABEL: v8f32_one_step2:
	; HASWELL-NO-FMA: # BB#0:
	; HASWELL-NO-FMA-NEXT: vrcpps %ymm0, %ymm1 # sched: [7:2.00]
	; HASWELL-NO-FMA-NEXT: vmulps %ymm1, %ymm0, %ymm0 # sched: [5:1.00]
	; HASWELL-NO-FMA-NEXT: vbroadcastss {{.*#+}} ymm2 = [1,1,1,1,1,1,1,1] sched: [5:1.00]
	; HASWELL-NO-FMA-NEXT: vsubps %ymm0, %ymm2, %ymm0 # sched: [3:1.00]
	; HASWELL-NO-FMA-NEXT: vmulps %ymm0, %ymm1, %ymm0 # sched: [5:1.00]
	; HASWELL-NO-FMA-NEXT: vaddps %ymm0, %ymm1, %ymm0 # sched: [3:1.00]
	; HASWELL-NO-FMA-NEXT: vmulps {{.*}}(%rip), %ymm0, %ymm0 # sched: [9:1.00]
	; HASWELL-NO-FMA-NEXT: retq # sched: [1:1.00]
	;
	; KNL-LABEL: v8f32_one_step2:
	; KNL: # BB#0:
	; KNL-NEXT: vrcpps %ymm0, %ymm1 # sched: [7:2.00]
	; KNL-NEXT: vbroadcastss {{.*#+}} ymm2 = [1,1,1,1,1,1,1,1] sched: [5:1.00]
	; KNL-NEXT: vfnmadd213ps %ymm2, %ymm1, %ymm0
	; KNL-NEXT: vfmadd132ps %ymm1, %ymm1, %ymm0
	; KNL-NEXT: vmulps {{.*}}(%rip), %ymm0, %ymm0 # sched: [9:1.00]
	; KNL-NEXT: retq # sched: [1:1.00]
	;
	; SKX-LABEL: v8f32_one_step2:
	; SKX: # BB#0:
	; SKX-NEXT: vrcp14ps %ymm0, %ymm1
	; SKX-NEXT: vfnmadd213ps {{.*}}(%rip){1to8}, %ymm1, %ymm0
	; SKX-NEXT: vfmadd132ps %ymm1, %ymm1, %ymm0
	; SKX-NEXT: vmulps {{.*}}(%rip), %ymm0, %ymm0 # sched: [9:1.00]
	; SKX-NEXT: retq # sched: [1:1.00]
	%div = fdiv fast <8 x float> <float 1.0, float 2.0, float 3.0, float 4.0, float 5.0, float 6.0, float 7.0, float 8.0>, %x
	ret <8 x float> %div
	}

	define <8 x float> @v8f32_one_step_2_divs(<8 x float> %x) #1 {
	; SSE-LABEL: v8f32_one_step_2_divs:
	; SSE: # BB#0:
	; SSE-NEXT: rcpps %xmm0, %xmm2
	; SSE-NEXT: mulps %xmm2, %xmm0
	; SSE-NEXT: movaps {{.*#+}} xmm3 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]
	; SSE-NEXT: movaps %xmm3, %xmm4
	; SSE-NEXT: subps %xmm0, %xmm4
	; SSE-NEXT: mulps %xmm2, %xmm4
	; SSE-NEXT: addps %xmm2, %xmm4
	; SSE-NEXT: rcpps %xmm1, %xmm0
	; SSE-NEXT: mulps %xmm0, %xmm1
	; SSE-NEXT: subps %xmm1, %xmm3
	; SSE-NEXT: mulps %xmm0, %xmm3
	; SSE-NEXT: addps %xmm0, %xmm3
	; SSE-NEXT: movaps {{.*#+}} xmm1 = [5.000000e+00,6.000000e+00,7.000000e+00,8.000000e+00]
	; SSE-NEXT: mulps %xmm3, %xmm1
	; SSE-NEXT: movaps {{.*#+}} xmm0 = [1.000000e+00,2.000000e+00,3.000000e+00,4.000000e+00]
	; SSE-NEXT: mulps %xmm4, %xmm0
	; SSE-NEXT: mulps %xmm4, %xmm0
	; SSE-NEXT: mulps %xmm3, %xmm1
	; SSE-NEXT: retq
	;
	; AVX-RECIP-LABEL: v8f32_one_step_2_divs:
	; AVX-RECIP: # BB#0:
	; AVX-RECIP-NEXT: vrcpps %ymm0, %ymm1
	; AVX-RECIP-NEXT: vmulps %ymm1, %ymm0, %ymm0
	; AVX-RECIP-NEXT: vmovaps {{.*#+}} ymm2 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]
	; AVX-RECIP-NEXT: vsubps %ymm0, %ymm2, %ymm0
	; AVX-RECIP-NEXT: vmulps %ymm0, %ymm1, %ymm0
	; AVX-RECIP-NEXT: vaddps %ymm0, %ymm1, %ymm0
	; AVX-RECIP-NEXT: vmulps {{.*}}(%rip), %ymm0, %ymm1
	; AVX-RECIP-NEXT: vmulps %ymm0, %ymm1, %ymm0
	; AVX-RECIP-NEXT: retq
	;
	; FMA-RECIP-LABEL: v8f32_one_step_2_divs:
	; FMA-RECIP: # BB#0:
	; FMA-RECIP-NEXT: vrcpps %ymm0, %ymm1
	; FMA-RECIP-NEXT: vfnmadd213ps {{.*}}(%rip), %ymm1, %ymm0
	; FMA-RECIP-NEXT: vfmadd132ps %ymm1, %ymm1, %ymm0
	; FMA-RECIP-NEXT: vmulps {{.*}}(%rip), %ymm0, %ymm1
	; FMA-RECIP-NEXT: vmulps %ymm0, %ymm1, %ymm0
	; FMA-RECIP-NEXT: retq
	;
	; BTVER2-LABEL: v8f32_one_step_2_divs:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vmovaps {{.*#+}} ymm2 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00] sched: [5:1.00]
	; BTVER2-NEXT: vrcpps %ymm0, %ymm1 # sched: [2:2.00]
	; BTVER2-NEXT: vmulps %ymm1, %ymm0, %ymm0 # sched: [2:2.00]
	; BTVER2-NEXT: vsubps %ymm0, %ymm2, %ymm0 # sched: [3:2.00]
	; BTVER2-NEXT: vmulps %ymm0, %ymm1, %ymm0 # sched: [2:2.00]
	; BTVER2-NEXT: vaddps %ymm0, %ymm1, %ymm0 # sched: [3:2.00]
	; BTVER2-NEXT: vmulps {{.*}}(%rip), %ymm0, %ymm1 # sched: [7:2.00]
	; BTVER2-NEXT: vmulps %ymm0, %ymm1, %ymm0 # sched: [2:2.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: v8f32_one_step_2_divs:
	; SANDY: # BB#0:
	; SANDY-NEXT: vrcpps %ymm0, %ymm1 # sched: [5:1.00]
	; SANDY-NEXT: vmulps %ymm1, %ymm0, %ymm0 # sched: [5:1.00]
	-; SANDY-NEXT: vmovaps {{.*#+}} ymm2 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00] sched: [7:0.50]
	+; SANDY-NEXT: vmovaps {{.*#+}} ymm2 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00] sched: [4:0.50]
	; SANDY-NEXT: vsubps %ymm0, %ymm2, %ymm0 # sched: [3:1.00]
	; SANDY-NEXT: vmulps %ymm0, %ymm1, %ymm0 # sched: [5:1.00]
	; SANDY-NEXT: vaddps %ymm0, %ymm1, %ymm0 # sched: [3:1.00]
	-; SANDY-NEXT: vmulps {{.*}}(%rip), %ymm0, %ymm1 # sched: [12:1.00]
	+; SANDY-NEXT: vmulps {{.*}}(%rip), %ymm0, %ymm1 # sched: [9:1.00]
	; SANDY-NEXT: vmulps %ymm0, %ymm1, %ymm0 # sched: [5:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: v8f32_one_step_2_divs:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vrcpps %ymm0, %ymm1 # sched: [7:2.00]
	; HASWELL-NEXT: vbroadcastss {{.*#+}} ymm2 = [1,1,1,1,1,1,1,1] sched: [5:1.00]
	; HASWELL-NEXT: vfnmadd213ps %ymm2, %ymm1, %ymm0
	; HASWELL-NEXT: vfmadd132ps %ymm1, %ymm1, %ymm0
	; HASWELL-NEXT: vmulps {{.*}}(%rip), %ymm0, %ymm1 # sched: [9:1.00]
	; HASWELL-NEXT: vmulps %ymm0, %ymm1, %ymm0 # sched: [5:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; HASWELL-NO-FMA-LABEL: v8f32_one_step_2_divs:
	; HASWELL-NO-FMA: # BB#0:
	; HASWELL-NO-FMA-NEXT: vrcpps %ymm0, %ymm1 # sched: [7:2.00]
	; HASWELL-NO-FMA-NEXT: vmulps %ymm1, %ymm0, %ymm0 # sched: [5:1.00]
	; HASWELL-NO-FMA-NEXT: vbroadcastss {{.*#+}} ymm2 = [1,1,1,1,1,1,1,1] sched: [5:1.00]
	; HASWELL-NO-FMA-NEXT: vsubps %ymm0, %ymm2, %ymm0 # sched: [3:1.00]
	; HASWELL-NO-FMA-NEXT: vmulps %ymm0, %ymm1, %ymm0 # sched: [5:1.00]
	; HASWELL-NO-FMA-NEXT: vaddps %ymm0, %ymm1, %ymm0 # sched: [3:1.00]
	; HASWELL-NO-FMA-NEXT: vmulps {{.*}}(%rip), %ymm0, %ymm1 # sched: [9:1.00]
	; HASWELL-NO-FMA-NEXT: vmulps %ymm0, %ymm1, %ymm0 # sched: [5:1.00]
	; HASWELL-NO-FMA-NEXT: retq # sched: [1:1.00]
	;
	; KNL-LABEL: v8f32_one_step_2_divs:
	; KNL: # BB#0:
	; KNL-NEXT: vrcpps %ymm0, %ymm1 # sched: [7:2.00]
	; KNL-NEXT: vbroadcastss {{.*#+}} ymm2 = [1,1,1,1,1,1,1,1] sched: [5:1.00]
	; KNL-NEXT: vfnmadd213ps %ymm2, %ymm1, %ymm0
	; KNL-NEXT: vfmadd132ps %ymm1, %ymm1, %ymm0
	; KNL-NEXT: vmulps {{.*}}(%rip), %ymm0, %ymm1 # sched: [9:1.00]
	; KNL-NEXT: vmulps %ymm0, %ymm1, %ymm0 # sched: [5:1.00]
	; KNL-NEXT: retq # sched: [1:1.00]
	;
	; SKX-LABEL: v8f32_one_step_2_divs:
	; SKX: # BB#0:
	; SKX-NEXT: vrcp14ps %ymm0, %ymm1
	; SKX-NEXT: vfnmadd213ps {{.*}}(%rip){1to8}, %ymm1, %ymm0
	; SKX-NEXT: vfmadd132ps %ymm1, %ymm1, %ymm0
	; SKX-NEXT: vmulps {{.*}}(%rip), %ymm0, %ymm1 # sched: [9:1.00]
	; SKX-NEXT: vmulps %ymm0, %ymm1, %ymm0 # sched: [5:1.00]
	; SKX-NEXT: retq # sched: [1:1.00]
	%div = fdiv fast <8 x float> <float 1.0, float 2.0, float 3.0, float 4.0, float 5.0, float 6.0, float 7.0, float 8.0>, %x
	%div2 = fdiv fast <8 x float> %div, %x
	ret <8 x float> %div2
	}

	define <8 x float> @v8f32_two_step2(<8 x float> %x) #2 {
	; SSE-LABEL: v8f32_two_step2:
	; SSE: # BB#0:
	; SSE-NEXT: movaps %xmm0, %xmm2
	; SSE-NEXT: rcpps %xmm1, %xmm3
	; SSE-NEXT: movaps %xmm1, %xmm4
	; SSE-NEXT: mulps %xmm3, %xmm4
	; SSE-NEXT: movaps {{.*#+}} xmm0 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]
	; SSE-NEXT: movaps %xmm0, %xmm5
	; SSE-NEXT: subps %xmm4, %xmm5
	; SSE-NEXT: mulps %xmm3, %xmm5
	; SSE-NEXT: addps %xmm3, %xmm5
	; SSE-NEXT: mulps %xmm5, %xmm1
	; SSE-NEXT: movaps %xmm0, %xmm3
	; SSE-NEXT: subps %xmm1, %xmm3
	; SSE-NEXT: mulps %xmm5, %xmm3
	; SSE-NEXT: addps %xmm5, %xmm3
	; SSE-NEXT: rcpps %xmm2, %xmm1
	; SSE-NEXT: movaps %xmm2, %xmm4
	; SSE-NEXT: mulps %xmm1, %xmm4
	; SSE-NEXT: movaps %xmm0, %xmm5
	; SSE-NEXT: subps %xmm4, %xmm5
	; SSE-NEXT: mulps %xmm1, %xmm5
	; SSE-NEXT: addps %xmm1, %xmm5
	; SSE-NEXT: mulps %xmm5, %xmm2
	; SSE-NEXT: subps %xmm2, %xmm0
	; SSE-NEXT: mulps %xmm5, %xmm0
	; SSE-NEXT: addps %xmm5, %xmm0
	; SSE-NEXT: mulps {{.*}}(%rip), %xmm0
	; SSE-NEXT: mulps {{.*}}(%rip), %xmm3
	; SSE-NEXT: movaps %xmm3, %xmm1
	; SSE-NEXT: retq
	;
	; AVX-RECIP-LABEL: v8f32_two_step2:
	; AVX-RECIP: # BB#0:
	; AVX-RECIP-NEXT: vrcpps %ymm0, %ymm1
	; AVX-RECIP-NEXT: vmulps %ymm1, %ymm0, %ymm2
	; AVX-RECIP-NEXT: vmovaps {{.*#+}} ymm3 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]
	; AVX-RECIP-NEXT: vsubps %ymm2, %ymm3, %ymm2
	; AVX-RECIP-NEXT: vmulps %ymm2, %ymm1, %ymm2
	; AVX-RECIP-NEXT: vaddps %ymm2, %ymm1, %ymm1
	; AVX-RECIP-NEXT: vmulps %ymm1, %ymm0, %ymm0
	; AVX-RECIP-NEXT: vsubps %ymm0, %ymm3, %ymm0
	; AVX-RECIP-NEXT: vmulps %ymm0, %ymm1, %ymm0
	; AVX-RECIP-NEXT: vaddps %ymm0, %ymm1, %ymm0
	; AVX-RECIP-NEXT: vmulps {{.*}}(%rip), %ymm0, %ymm0
	; AVX-RECIP-NEXT: retq
	;
	; FMA-RECIP-LABEL: v8f32_two_step2:
	; FMA-RECIP: # BB#0:
	; FMA-RECIP-NEXT: vrcpps %ymm0, %ymm1
	; FMA-RECIP-NEXT: vmovaps {{.*#+}} ymm2 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]
	; FMA-RECIP-NEXT: vmovaps %ymm1, %ymm3
	; FMA-RECIP-NEXT: vfnmadd213ps %ymm2, %ymm0, %ymm3
	; FMA-RECIP-NEXT: vfmadd132ps %ymm1, %ymm1, %ymm3
	; FMA-RECIP-NEXT: vfnmadd213ps %ymm2, %ymm3, %ymm0
	; FMA-RECIP-NEXT: vfmadd132ps %ymm3, %ymm3, %ymm0
	; FMA-RECIP-NEXT: vmulps {{.*}}(%rip), %ymm0, %ymm0
	; FMA-RECIP-NEXT: retq
	;
	; BTVER2-LABEL: v8f32_two_step2:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vmovaps {{.*#+}} ymm3 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00] sched: [5:1.00]
	; BTVER2-NEXT: vrcpps %ymm0, %ymm1 # sched: [2:2.00]
	; BTVER2-NEXT: vmulps %ymm1, %ymm0, %ymm2 # sched: [2:2.00]
	; BTVER2-NEXT: vsubps %ymm2, %ymm3, %ymm2 # sched: [3:2.00]
	; BTVER2-NEXT: vmulps %ymm2, %ymm1, %ymm2 # sched: [2:2.00]
	; BTVER2-NEXT: vaddps %ymm2, %ymm1, %ymm1 # sched: [3:2.00]
	; BTVER2-NEXT: vmulps %ymm1, %ymm0, %ymm0 # sched: [2:2.00]
	; BTVER2-NEXT: vsubps %ymm0, %ymm3, %ymm0 # sched: [3:2.00]
	; BTVER2-NEXT: vmulps %ymm0, %ymm1, %ymm0 # sched: [2:2.00]
	; BTVER2-NEXT: vaddps %ymm0, %ymm1, %ymm0 # sched: [3:2.00]
	; BTVER2-NEXT: vmulps {{.*}}(%rip), %ymm0, %ymm0 # sched: [7:2.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: v8f32_two_step2:
	; SANDY: # BB#0:
	; SANDY-NEXT: vrcpps %ymm0, %ymm1 # sched: [5:1.00]
	; SANDY-NEXT: vmulps %ymm1, %ymm0, %ymm2 # sched: [5:1.00]
	-; SANDY-NEXT: vmovaps {{.*#+}} ymm3 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00] sched: [7:0.50]
	+; SANDY-NEXT: vmovaps {{.*#+}} ymm3 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00] sched: [4:0.50]
	; SANDY-NEXT: vsubps %ymm2, %ymm3, %ymm2 # sched: [3:1.00]
	; SANDY-NEXT: vmulps %ymm2, %ymm1, %ymm2 # sched: [5:1.00]
	; SANDY-NEXT: vaddps %ymm2, %ymm1, %ymm1 # sched: [3:1.00]
	; SANDY-NEXT: vmulps %ymm1, %ymm0, %ymm0 # sched: [5:1.00]
	; SANDY-NEXT: vsubps %ymm0, %ymm3, %ymm0 # sched: [3:1.00]
	; SANDY-NEXT: vmulps %ymm0, %ymm1, %ymm0 # sched: [5:1.00]
	; SANDY-NEXT: vaddps %ymm0, %ymm1, %ymm0 # sched: [3:1.00]
	-; SANDY-NEXT: vmulps {{.*}}(%rip), %ymm0, %ymm0 # sched: [12:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vmulps {{.*}}(%rip), %ymm0, %ymm0 # sched: [9:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: v8f32_two_step2:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vrcpps %ymm0, %ymm1 # sched: [7:2.00]
	; HASWELL-NEXT: vbroadcastss {{.*#+}} ymm2 = [1,1,1,1,1,1,1,1] sched: [5:1.00]
	; HASWELL-NEXT: vmovaps %ymm1, %ymm3 # sched: [1:1.00]
	; HASWELL-NEXT: vfnmadd213ps %ymm2, %ymm0, %ymm3
	; HASWELL-NEXT: vfmadd132ps %ymm1, %ymm1, %ymm3
	; HASWELL-NEXT: vfnmadd213ps %ymm2, %ymm3, %ymm0
	; HASWELL-NEXT: vfmadd132ps %ymm3, %ymm3, %ymm0
	; HASWELL-NEXT: vmulps {{.*}}(%rip), %ymm0, %ymm0 # sched: [9:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; HASWELL-NO-FMA-LABEL: v8f32_two_step2:
	; HASWELL-NO-FMA: # BB#0:
	; HASWELL-NO-FMA-NEXT: vrcpps %ymm0, %ymm1 # sched: [7:2.00]
	; HASWELL-NO-FMA-NEXT: vmulps %ymm1, %ymm0, %ymm2 # sched: [5:1.00]
	; HASWELL-NO-FMA-NEXT: vbroadcastss {{.*#+}} ymm3 = [1,1,1,1,1,1,1,1] sched: [5:1.00]
	; HASWELL-NO-FMA-NEXT: vsubps %ymm2, %ymm3, %ymm2 # sched: [3:1.00]
	; HASWELL-NO-FMA-NEXT: vmulps %ymm2, %ymm1, %ymm2 # sched: [5:1.00]
	; HASWELL-NO-FMA-NEXT: vaddps %ymm2, %ymm1, %ymm1 # sched: [3:1.00]
	; HASWELL-NO-FMA-NEXT: vmulps %ymm1, %ymm0, %ymm0 # sched: [5:1.00]
	; HASWELL-NO-FMA-NEXT: vsubps %ymm0, %ymm3, %ymm0 # sched: [3:1.00]
	; HASWELL-NO-FMA-NEXT: vmulps %ymm0, %ymm1, %ymm0 # sched: [5:1.00]
	; HASWELL-NO-FMA-NEXT: vaddps %ymm0, %ymm1, %ymm0 # sched: [3:1.00]
	; HASWELL-NO-FMA-NEXT: vmulps {{.*}}(%rip), %ymm0, %ymm0 # sched: [9:1.00]
	; HASWELL-NO-FMA-NEXT: retq # sched: [1:1.00]
	;
	; KNL-LABEL: v8f32_two_step2:
	; KNL: # BB#0:
	; KNL-NEXT: vrcpps %ymm0, %ymm1 # sched: [7:2.00]
	; KNL-NEXT: vbroadcastss {{.*#+}} ymm2 = [1,1,1,1,1,1,1,1] sched: [5:1.00]
	; KNL-NEXT: vmovaps %ymm1, %ymm3 # sched: [1:1.00]
	; KNL-NEXT: vfnmadd213ps %ymm2, %ymm0, %ymm3
	; KNL-NEXT: vfmadd132ps %ymm1, %ymm1, %ymm3
	; KNL-NEXT: vfnmadd213ps %ymm2, %ymm3, %ymm0
	; KNL-NEXT: vfmadd132ps %ymm3, %ymm3, %ymm0
	; KNL-NEXT: vmulps {{.*}}(%rip), %ymm0, %ymm0 # sched: [9:1.00]
	; KNL-NEXT: retq # sched: [1:1.00]
	;
	; SKX-LABEL: v8f32_two_step2:
	; SKX: # BB#0:
	; SKX-NEXT: vrcp14ps %ymm0, %ymm1
	; SKX-NEXT: vbroadcastss {{.*#+}} ymm2 = [1,1,1,1,1,1,1,1] sched: [5:1.00]
	; SKX-NEXT: vmovaps %ymm1, %ymm3 # sched: [1:1.00]
	; SKX-NEXT: vfnmadd213ps %ymm2, %ymm0, %ymm3
	; SKX-NEXT: vfmadd132ps %ymm1, %ymm1, %ymm3
	; SKX-NEXT: vfnmadd213ps %ymm2, %ymm3, %ymm0
	; SKX-NEXT: vfmadd132ps %ymm3, %ymm3, %ymm0
	; SKX-NEXT: vmulps {{.*}}(%rip), %ymm0, %ymm0 # sched: [9:1.00]
	; SKX-NEXT: retq # sched: [1:1.00]
	%div = fdiv fast <8 x float> <float 1.0, float 2.0, float 3.0, float 4.0, float 5.0, float 6.0, float 7.0, float 8.0>, %x
	ret <8 x float> %div
	}

	define <8 x float> @v8f32_no_step(<8 x float> %x) #3 {
	; SSE-LABEL: v8f32_no_step:
	; SSE: # BB#0:
	; SSE-NEXT: rcpps %xmm0, %xmm0
	; SSE-NEXT: rcpps %xmm1, %xmm1
	; SSE-NEXT: retq
	;
	; AVX-RECIP-LABEL: v8f32_no_step:
	; AVX-RECIP: # BB#0:
	; AVX-RECIP-NEXT: vrcpps %ymm0, %ymm0
	; AVX-RECIP-NEXT: retq
	;
	; FMA-RECIP-LABEL: v8f32_no_step:
	; FMA-RECIP: # BB#0:
	; FMA-RECIP-NEXT: vrcpps %ymm0, %ymm0
	; FMA-RECIP-NEXT: retq
	;
	; BTVER2-LABEL: v8f32_no_step:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vrcpps %ymm0, %ymm0 # sched: [2:2.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: v8f32_no_step:
	; SANDY: # BB#0:
	; SANDY-NEXT: vrcpps %ymm0, %ymm0 # sched: [5:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: v8f32_no_step:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vrcpps %ymm0, %ymm0 # sched: [7:2.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; HASWELL-NO-FMA-LABEL: v8f32_no_step:
	; HASWELL-NO-FMA: # BB#0:
	; HASWELL-NO-FMA-NEXT: vrcpps %ymm0, %ymm0 # sched: [7:2.00]
	; HASWELL-NO-FMA-NEXT: retq # sched: [1:1.00]
	;
	; KNL-LABEL: v8f32_no_step:
	; KNL: # BB#0:
	; KNL-NEXT: vrcpps %ymm0, %ymm0 # sched: [7:2.00]
	; KNL-NEXT: retq # sched: [1:1.00]
	;
	; SKX-LABEL: v8f32_no_step:
	; SKX: # BB#0:
	; SKX-NEXT: vrcp14ps %ymm0, %ymm0
	; SKX-NEXT: retq # sched: [1:1.00]
	%div = fdiv fast <8 x float> <float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0>, %x
	ret <8 x float> %div
	}

	define <8 x float> @v8f32_no_step2(<8 x float> %x) #3 {
	; SSE-LABEL: v8f32_no_step2:
	; SSE: # BB#0:
	; SSE-NEXT: rcpps %xmm1, %xmm1
	; SSE-NEXT: rcpps %xmm0, %xmm0
	; SSE-NEXT: mulps {{.*}}(%rip), %xmm0
	; SSE-NEXT: mulps {{.*}}(%rip), %xmm1
	; SSE-NEXT: retq
	;
	; AVX-RECIP-LABEL: v8f32_no_step2:
	; AVX-RECIP: # BB#0:
	; AVX-RECIP-NEXT: vrcpps %ymm0, %ymm0
	; AVX-RECIP-NEXT: vmulps {{.*}}(%rip), %ymm0, %ymm0
	; AVX-RECIP-NEXT: retq
	;
	; FMA-RECIP-LABEL: v8f32_no_step2:
	; FMA-RECIP: # BB#0:
	; FMA-RECIP-NEXT: vrcpps %ymm0, %ymm0
	; FMA-RECIP-NEXT: vmulps {{.*}}(%rip), %ymm0, %ymm0
	; FMA-RECIP-NEXT: retq
	;
	; BTVER2-LABEL: v8f32_no_step2:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vrcpps %ymm0, %ymm0 # sched: [2:2.00]
	; BTVER2-NEXT: vmulps {{.*}}(%rip), %ymm0, %ymm0 # sched: [7:2.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: v8f32_no_step2:
	; SANDY: # BB#0:
	; SANDY-NEXT: vrcpps %ymm0, %ymm0 # sched: [5:1.00]
	-; SANDY-NEXT: vmulps {{.*}}(%rip), %ymm0, %ymm0 # sched: [12:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vmulps {{.*}}(%rip), %ymm0, %ymm0 # sched: [9:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: v8f32_no_step2:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vrcpps %ymm0, %ymm0 # sched: [7:2.00]
	; HASWELL-NEXT: vmulps {{.*}}(%rip), %ymm0, %ymm0 # sched: [9:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; HASWELL-NO-FMA-LABEL: v8f32_no_step2:
	; HASWELL-NO-FMA: # BB#0:
	; HASWELL-NO-FMA-NEXT: vrcpps %ymm0, %ymm0 # sched: [7:2.00]
	; HASWELL-NO-FMA-NEXT: vmulps {{.*}}(%rip), %ymm0, %ymm0 # sched: [9:1.00]
	; HASWELL-NO-FMA-NEXT: retq # sched: [1:1.00]
	;
	; KNL-LABEL: v8f32_no_step2:
	; KNL: # BB#0:
	; KNL-NEXT: vrcpps %ymm0, %ymm0 # sched: [7:2.00]
	; KNL-NEXT: vmulps {{.*}}(%rip), %ymm0, %ymm0 # sched: [9:1.00]
	; KNL-NEXT: retq # sched: [1:1.00]
	;
	; SKX-LABEL: v8f32_no_step2:
	; SKX: # BB#0:
	; SKX-NEXT: vrcp14ps %ymm0, %ymm0
	; SKX-NEXT: vmulps {{.*}}(%rip), %ymm0, %ymm0 # sched: [9:1.00]
	; SKX-NEXT: retq # sched: [1:1.00]
	%div = fdiv fast <8 x float> <float 1.0, float 2.0, float 3.0, float 4.0, float 5.0, float 6.0, float 7.0, float 8.0>, %x
	ret <8 x float> %div
	}

	attributes #0 = { "unsafe-fp-math"="true" "reciprocal-estimates"="!divf,!vec-divf" }
	attributes #1 = { "unsafe-fp-math"="true" "reciprocal-estimates"="divf,vec-divf" }
	attributes #2 = { "unsafe-fp-math"="true" "reciprocal-estimates"="divf:2,vec-divf:2" }
	attributes #3 = { "unsafe-fp-math"="true" "reciprocal-estimates"="divf:0,vec-divf:0" }

	diff --git a/test/CodeGen/X86/sse-schedule.ll b/test/CodeGen/X86/sse-schedule.ll
	index 29f726c3df6a..f44cee9db22c 100644
	--- a/test/CodeGen/X86/sse-schedule.ll
	+++ b/test/CodeGen/X86/sse-schedule.ll
	@@ -1,2740 +1,2740 @@
	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -print-schedule \| FileCheck %s --check-prefix=CHECK --check-prefix=GENERIC
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -print-schedule -mcpu=atom \| FileCheck %s --check-prefix=CHECK --check-prefix=ATOM
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -print-schedule -mcpu=slm \| FileCheck %s --check-prefix=CHECK --check-prefix=SLM
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -print-schedule -mcpu=sandybridge \| FileCheck %s --check-prefix=CHECK --check-prefix=SANDY
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -print-schedule -mcpu=ivybridge \| FileCheck %s --check-prefix=CHECK --check-prefix=SANDY
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -print-schedule -mcpu=haswell \| FileCheck %s --check-prefix=CHECK --check-prefix=HASWELL
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -print-schedule -mcpu=skylake \| FileCheck %s --check-prefix=CHECK --check-prefix=HASWELL
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -print-schedule -mcpu=btver2 \| FileCheck %s --check-prefix=CHECK --check-prefix=BTVER2
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -print-schedule -mcpu=znver1 \| FileCheck %s --check-prefix=CHECK --check-prefix=ZNVER1

	define <4 x float> @test_addps(<4 x float> %a0, <4 x float> %a1, <4 x float> *%a2) {
	; GENERIC-LABEL: test_addps:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: addps %xmm1, %xmm0
	; GENERIC-NEXT: addps (%rdi), %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_addps:
	; ATOM: # BB#0:
	; ATOM-NEXT: addps %xmm1, %xmm0
	; ATOM-NEXT: addps (%rdi), %xmm0
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_addps:
	; SLM: # BB#0:
	; SLM-NEXT: addps %xmm1, %xmm0 # sched: [3:1.00]
	; SLM-NEXT: addps (%rdi), %xmm0 # sched: [6:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_addps:
	; SANDY: # BB#0:
	; SANDY-NEXT: vaddps %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	-; SANDY-NEXT: vaddps (%rdi), %xmm0, %xmm0 # sched: [9:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vaddps (%rdi), %xmm0, %xmm0 # sched: [7:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_addps:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vaddps %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; HASWELL-NEXT: vaddps (%rdi), %xmm0, %xmm0 # sched: [7:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_addps:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vaddps %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: vaddps (%rdi), %xmm0, %xmm0 # sched: [8:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_addps:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vaddps %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; ZNVER1-NEXT: vaddps (%rdi), %xmm0, %xmm0 # sched: [10:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = fadd <4 x float> %a0, %a1
	%2 = load <4 x float>, <4 x float> *%a2, align 16
	%3 = fadd <4 x float> %1, %2
	ret <4 x float> %3
	}

	define float @test_addss(float %a0, float %a1, float *%a2) {
	; GENERIC-LABEL: test_addss:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: addss %xmm1, %xmm0
	; GENERIC-NEXT: addss (%rdi), %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_addss:
	; ATOM: # BB#0:
	; ATOM-NEXT: addss %xmm1, %xmm0
	; ATOM-NEXT: addss (%rdi), %xmm0
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_addss:
	; SLM: # BB#0:
	; SLM-NEXT: addss %xmm1, %xmm0 # sched: [3:1.00]
	; SLM-NEXT: addss (%rdi), %xmm0 # sched: [6:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_addss:
	; SANDY: # BB#0:
	; SANDY-NEXT: vaddss %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	-; SANDY-NEXT: vaddss (%rdi), %xmm0, %xmm0 # sched: [9:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vaddss (%rdi), %xmm0, %xmm0 # sched: [7:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_addss:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vaddss %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; HASWELL-NEXT: vaddss (%rdi), %xmm0, %xmm0 # sched: [7:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_addss:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vaddss %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: vaddss (%rdi), %xmm0, %xmm0 # sched: [8:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_addss:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vaddss %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; ZNVER1-NEXT: vaddss (%rdi), %xmm0, %xmm0 # sched: [10:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = fadd float %a0, %a1
	%2 = load float, float *%a2, align 4
	%3 = fadd float %1, %2
	ret float %3
	}

	define <4 x float> @test_andps(<4 x float> %a0, <4 x float> %a1, <4 x float> *%a2) {
	; GENERIC-LABEL: test_andps:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: andps %xmm1, %xmm0
	; GENERIC-NEXT: andps (%rdi), %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_andps:
	; ATOM: # BB#0:
	; ATOM-NEXT: andps %xmm1, %xmm0
	; ATOM-NEXT: andps (%rdi), %xmm0
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_andps:
	; SLM: # BB#0:
	; SLM-NEXT: andps %xmm1, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: andps (%rdi), %xmm0 # sched: [4:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_andps:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vandps %xmm1, %xmm0, %xmm0 # sched: [1:1.00]
	-; SANDY-NEXT: vandps (%rdi), %xmm0, %xmm0 # sched: [7:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vandps %xmm1, %xmm0, %xmm0 # sched: [1:0.33]
	+; SANDY-NEXT: vandps (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_andps:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vandps %xmm1, %xmm0, %xmm0 # sched: [1:1.00]
	; HASWELL-NEXT: vandps (%rdi), %xmm0, %xmm0 # sched: [5:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_andps:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vandps %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: vandps (%rdi), %xmm0, %xmm0 # sched: [6:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_andps:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vandps %xmm1, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: vandps (%rdi), %xmm0, %xmm0 # sched: [8:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = bitcast <4 x float> %a0 to <4 x i32>
	%2 = bitcast <4 x float> %a1 to <4 x i32>
	%3 = and <4 x i32> %1, %2
	%4 = load <4 x float>, <4 x float> *%a2, align 16
	%5 = bitcast <4 x float> %4 to <4 x i32>
	%6 = and <4 x i32> %3, %5
	%7 = bitcast <4 x i32> %6 to <4 x float>
	ret <4 x float> %7
	}

	define <4 x float> @test_andnotps(<4 x float> %a0, <4 x float> %a1, <4 x float> *%a2) {
	; GENERIC-LABEL: test_andnotps:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: andnps %xmm1, %xmm0
	; GENERIC-NEXT: andnps (%rdi), %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_andnotps:
	; ATOM: # BB#0:
	; ATOM-NEXT: andnps %xmm1, %xmm0
	; ATOM-NEXT: andnps (%rdi), %xmm0
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_andnotps:
	; SLM: # BB#0:
	; SLM-NEXT: andnps %xmm1, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: andnps (%rdi), %xmm0 # sched: [4:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_andnotps:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vandnps %xmm1, %xmm0, %xmm0 # sched: [1:1.00]
	-; SANDY-NEXT: vandnps (%rdi), %xmm0, %xmm0 # sched: [7:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vandnps %xmm1, %xmm0, %xmm0 # sched: [1:0.33]
	+; SANDY-NEXT: vandnps (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_andnotps:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vandnps %xmm1, %xmm0, %xmm0 # sched: [1:1.00]
	; HASWELL-NEXT: vandnps (%rdi), %xmm0, %xmm0 # sched: [5:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_andnotps:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vandnps %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: vandnps (%rdi), %xmm0, %xmm0 # sched: [6:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_andnotps:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vandnps %xmm1, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: vandnps (%rdi), %xmm0, %xmm0 # sched: [8:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = bitcast <4 x float> %a0 to <4 x i32>
	%2 = bitcast <4 x float> %a1 to <4 x i32>
	%3 = xor <4 x i32> %1, <i32 -1, i32 -1, i32 -1, i32 -1>
	%4 = and <4 x i32> %3, %2
	%5 = load <4 x float>, <4 x float> *%a2, align 16
	%6 = bitcast <4 x float> %5 to <4 x i32>
	%7 = xor <4 x i32> %4, <i32 -1, i32 -1, i32 -1, i32 -1>
	%8 = and <4 x i32> %6, %7
	%9 = bitcast <4 x i32> %8 to <4 x float>
	ret <4 x float> %9
	}

	define <4 x float> @test_cmpps(<4 x float> %a0, <4 x float> %a1, <4 x float> *%a2) {
	; GENERIC-LABEL: test_cmpps:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: cmpeqps %xmm0, %xmm1
	; GENERIC-NEXT: cmpeqps (%rdi), %xmm0
	; GENERIC-NEXT: orps %xmm1, %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_cmpps:
	; ATOM: # BB#0:
	; ATOM-NEXT: cmpeqps %xmm0, %xmm1
	; ATOM-NEXT: cmpeqps (%rdi), %xmm0
	; ATOM-NEXT: orps %xmm1, %xmm0
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_cmpps:
	; SLM: # BB#0:
	; SLM-NEXT: cmpeqps %xmm0, %xmm1 # sched: [3:1.00]
	; SLM-NEXT: cmpeqps (%rdi), %xmm0 # sched: [6:1.00]
	; SLM-NEXT: orps %xmm1, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_cmpps:
	; SANDY: # BB#0:
	; SANDY-NEXT: vcmpeqps %xmm1, %xmm0, %xmm1 # sched: [3:1.00]
	-; SANDY-NEXT: vcmpeqps (%rdi), %xmm0, %xmm0 # sched: [9:1.00]
	-; SANDY-NEXT: vorps %xmm0, %xmm1, %xmm0 # sched: [1:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vcmpeqps (%rdi), %xmm0, %xmm0 # sched: [7:1.00]
	+; SANDY-NEXT: vorps %xmm0, %xmm1, %xmm0 # sched: [1:0.33]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_cmpps:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vcmpeqps %xmm1, %xmm0, %xmm1 # sched: [3:1.00]
	; HASWELL-NEXT: vcmpeqps (%rdi), %xmm0, %xmm0 # sched: [7:1.00]
	; HASWELL-NEXT: vorps %xmm0, %xmm1, %xmm0 # sched: [1:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_cmpps:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vcmpeqps %xmm1, %xmm0, %xmm1 # sched: [3:1.00]
	; BTVER2-NEXT: vcmpeqps (%rdi), %xmm0, %xmm0 # sched: [8:1.00]
	; BTVER2-NEXT: vorps %xmm0, %xmm1, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_cmpps:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vcmpeqps %xmm1, %xmm0, %xmm1 # sched: [3:1.00]
	; ZNVER1-NEXT: vcmpeqps (%rdi), %xmm0, %xmm0 # sched: [10:1.00]
	; ZNVER1-NEXT: vorps %xmm0, %xmm1, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = fcmp oeq <4 x float> %a0, %a1
	%2 = load <4 x float>, <4 x float> *%a2, align 16
	%3 = fcmp oeq <4 x float> %a0, %2
	%4 = or <4 x i1> %1, %3
	%5 = sext <4 x i1> %4 to <4 x i32>
	%6 = bitcast <4 x i32> %5 to <4 x float>
	ret <4 x float> %6
	}

	define float @test_cmpss(float %a0, float %a1, float *%a2) {
	; GENERIC-LABEL: test_cmpss:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: cmpeqss %xmm1, %xmm0
	; GENERIC-NEXT: cmpeqss (%rdi), %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_cmpss:
	; ATOM: # BB#0:
	; ATOM-NEXT: cmpeqss %xmm1, %xmm0
	; ATOM-NEXT: cmpeqss (%rdi), %xmm0
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_cmpss:
	; SLM: # BB#0:
	; SLM-NEXT: cmpeqss %xmm1, %xmm0 # sched: [3:1.00]
	; SLM-NEXT: cmpeqss (%rdi), %xmm0 # sched: [6:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_cmpss:
	; SANDY: # BB#0:
	; SANDY-NEXT: vcmpeqss %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; SANDY-NEXT: vcmpeqss (%rdi), %xmm0, %xmm0 # sched: [7:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_cmpss:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vcmpeqss %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; HASWELL-NEXT: vcmpeqss (%rdi), %xmm0, %xmm0 # sched: [7:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_cmpss:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vcmpeqss %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: vcmpeqss (%rdi), %xmm0, %xmm0 # sched: [8:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_cmpss:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vcmpeqss %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; ZNVER1-NEXT: vcmpeqss (%rdi), %xmm0, %xmm0 # sched: [10:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = insertelement <4 x float> undef, float %a0, i32 0
	%2 = insertelement <4 x float> undef, float %a1, i32 0
	%3 = call <4 x float> @llvm.x86.sse.cmp.ss(<4 x float> %1, <4 x float> %2, i8 0)
	%4 = load float, float *%a2, align 4
	%5 = insertelement <4 x float> undef, float %4, i32 0
	%6 = call <4 x float> @llvm.x86.sse.cmp.ss(<4 x float> %3, <4 x float> %5, i8 0)
	%7 = extractelement <4 x float> %6, i32 0
	ret float %7
	}
	declare <4 x float> @llvm.x86.sse.cmp.ss(<4 x float>, <4 x float>, i8) nounwind readnone

	define i32 @test_comiss(<4 x float> %a0, <4 x float> %a1, <4 x float> *%a2) {
	; GENERIC-LABEL: test_comiss:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: comiss %xmm1, %xmm0
	; GENERIC-NEXT: setnp %al
	; GENERIC-NEXT: sete %cl
	; GENERIC-NEXT: andb %al, %cl
	; GENERIC-NEXT: comiss (%rdi), %xmm0
	; GENERIC-NEXT: setnp %al
	; GENERIC-NEXT: sete %dl
	; GENERIC-NEXT: andb %al, %dl
	; GENERIC-NEXT: orb %cl, %dl
	; GENERIC-NEXT: movzbl %dl, %eax
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_comiss:
	; ATOM: # BB#0:
	; ATOM-NEXT: comiss %xmm1, %xmm0
	; ATOM-NEXT: setnp %al
	; ATOM-NEXT: sete %cl
	; ATOM-NEXT: andb %al, %cl
	; ATOM-NEXT: comiss (%rdi), %xmm0
	; ATOM-NEXT: setnp %al
	; ATOM-NEXT: sete %dl
	; ATOM-NEXT: andb %al, %dl
	; ATOM-NEXT: orb %cl, %dl
	; ATOM-NEXT: movzbl %dl, %eax
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_comiss:
	; SLM: # BB#0:
	; SLM-NEXT: comiss %xmm1, %xmm0 # sched: [3:1.00]
	; SLM-NEXT: setnp %al # sched: [1:0.50]
	; SLM-NEXT: sete %cl # sched: [1:0.50]
	; SLM-NEXT: andb %al, %cl # sched: [1:0.50]
	; SLM-NEXT: comiss (%rdi), %xmm0 # sched: [6:1.00]
	; SLM-NEXT: setnp %al # sched: [1:0.50]
	; SLM-NEXT: sete %dl # sched: [1:0.50]
	; SLM-NEXT: andb %al, %dl # sched: [1:0.50]
	; SLM-NEXT: orb %cl, %dl # sched: [1:0.50]
	; SLM-NEXT: movzbl %dl, %eax # sched: [1:0.50]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_comiss:
	; SANDY: # BB#0:
	; SANDY-NEXT: vcomiss %xmm1, %xmm0 # sched: [3:1.00]
	-; SANDY-NEXT: setnp %al # sched: [1:1.00]
	-; SANDY-NEXT: sete %cl # sched: [1:1.00]
	+; SANDY-NEXT: setnp %al # sched: [1:0.33]
	+; SANDY-NEXT: sete %cl # sched: [1:0.33]
	; SANDY-NEXT: andb %al, %cl # sched: [1:0.33]
	; SANDY-NEXT: vcomiss (%rdi), %xmm0 # sched: [7:1.00]
	-; SANDY-NEXT: setnp %al # sched: [1:1.00]
	-; SANDY-NEXT: sete %dl # sched: [1:1.00]
	+; SANDY-NEXT: setnp %al # sched: [1:0.33]
	+; SANDY-NEXT: sete %dl # sched: [1:0.33]
	; SANDY-NEXT: andb %al, %dl # sched: [1:0.33]
	; SANDY-NEXT: orb %cl, %dl # sched: [1:0.33]
	; SANDY-NEXT: movzbl %dl, %eax # sched: [1:0.33]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_comiss:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vcomiss %xmm1, %xmm0 # sched: [3:1.00]
	; HASWELL-NEXT: setnp %al # sched: [1:0.50]
	; HASWELL-NEXT: sete %cl # sched: [1:0.50]
	; HASWELL-NEXT: andb %al, %cl # sched: [1:0.25]
	; HASWELL-NEXT: vcomiss (%rdi), %xmm0 # sched: [7:1.00]
	; HASWELL-NEXT: setnp %al # sched: [1:0.50]
	; HASWELL-NEXT: sete %dl # sched: [1:0.50]
	; HASWELL-NEXT: andb %al, %dl # sched: [1:0.25]
	; HASWELL-NEXT: orb %cl, %dl # sched: [1:0.25]
	; HASWELL-NEXT: movzbl %dl, %eax # sched: [1:0.25]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_comiss:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vcomiss %xmm1, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: setnp %al # sched: [1:0.50]
	; BTVER2-NEXT: sete %cl # sched: [1:0.50]
	; BTVER2-NEXT: andb %al, %cl # sched: [1:0.50]
	; BTVER2-NEXT: vcomiss (%rdi), %xmm0 # sched: [8:1.00]
	; BTVER2-NEXT: setnp %al # sched: [1:0.50]
	; BTVER2-NEXT: sete %dl # sched: [1:0.50]
	; BTVER2-NEXT: andb %al, %dl # sched: [1:0.50]
	; BTVER2-NEXT: orb %cl, %dl # sched: [1:0.50]
	; BTVER2-NEXT: movzbl %dl, %eax # sched: [1:0.50]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_comiss:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vcomiss %xmm1, %xmm0 # sched: [3:1.00]
	; ZNVER1-NEXT: setnp %al # sched: [1:0.25]
	; ZNVER1-NEXT: sete %cl # sched: [1:0.25]
	; ZNVER1-NEXT: andb %al, %cl # sched: [1:0.25]
	; ZNVER1-NEXT: vcomiss (%rdi), %xmm0 # sched: [10:1.00]
	; ZNVER1-NEXT: setnp %al # sched: [1:0.25]
	; ZNVER1-NEXT: sete %dl # sched: [1:0.25]
	; ZNVER1-NEXT: andb %al, %dl # sched: [1:0.25]
	; ZNVER1-NEXT: orb %cl, %dl # sched: [1:0.25]
	; ZNVER1-NEXT: movzbl %dl, %eax # sched: [1:0.25]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call i32 @llvm.x86.sse.comieq.ss(<4 x float> %a0, <4 x float> %a1)
	%2 = load <4 x float>, <4 x float> *%a2, align 4
	%3 = call i32 @llvm.x86.sse.comieq.ss(<4 x float> %a0, <4 x float> %2)
	%4 = or i32 %1, %3
	ret i32 %4
	}
	declare i32 @llvm.x86.sse.comieq.ss(<4 x float>, <4 x float>) nounwind readnone

	define float @test_cvtsi2ss(i32 %a0, i32 *%a1) {
	; GENERIC-LABEL: test_cvtsi2ss:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: cvtsi2ssl %edi, %xmm1
	; GENERIC-NEXT: cvtsi2ssl (%rsi), %xmm0
	; GENERIC-NEXT: addss %xmm1, %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_cvtsi2ss:
	; ATOM: # BB#0:
	; ATOM-NEXT: cvtsi2ssl (%rsi), %xmm0
	; ATOM-NEXT: cvtsi2ssl %edi, %xmm1
	; ATOM-NEXT: addss %xmm1, %xmm0
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_cvtsi2ss:
	; SLM: # BB#0:
	; SLM-NEXT: cvtsi2ssl (%rsi), %xmm0 # sched: [7:1.00]
	; SLM-NEXT: cvtsi2ssl %edi, %xmm1 # sched: [4:0.50]
	; SLM-NEXT: addss %xmm1, %xmm0 # sched: [3:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_cvtsi2ss:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vcvtsi2ssl %edi, %xmm0, %xmm0 # sched: [5:2.00]
	-; SANDY-NEXT: vcvtsi2ssl (%rsi), %xmm1, %xmm1 # sched: [10:1.00]
	+; SANDY-NEXT: vcvtsi2ssl %edi, %xmm0, %xmm0 # sched: [4:1.00]
	+; SANDY-NEXT: vcvtsi2ssl (%rsi), %xmm1, %xmm1 # sched: [8:1.00]
	; SANDY-NEXT: vaddss %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_cvtsi2ss:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vcvtsi2ssl %edi, %xmm0, %xmm0 # sched: [4:1.00]
	; HASWELL-NEXT: vcvtsi2ssl (%rsi), %xmm1, %xmm1 # sched: [8:1.00]
	; HASWELL-NEXT: vaddss %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_cvtsi2ss:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vcvtsi2ssl %edi, %xmm0, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: vcvtsi2ssl (%rsi), %xmm1, %xmm1 # sched: [8:1.00]
	; BTVER2-NEXT: vaddss %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_cvtsi2ss:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vcvtsi2ssl %edi, %xmm0, %xmm0 # sched: [5:1.00]
	; ZNVER1-NEXT: vcvtsi2ssl (%rsi), %xmm1, %xmm1 # sched: [12:1.00]
	; ZNVER1-NEXT: vaddss %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = sitofp i32 %a0 to float
	%2 = load i32, i32 *%a1, align 4
	%3 = sitofp i32 %2 to float
	%4 = fadd float %1, %3
	ret float %4
	}

	define float @test_cvtsi2ssq(i64 %a0, i64 *%a1) {
	; GENERIC-LABEL: test_cvtsi2ssq:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: cvtsi2ssq %rdi, %xmm1
	; GENERIC-NEXT: cvtsi2ssq (%rsi), %xmm0
	; GENERIC-NEXT: addss %xmm1, %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_cvtsi2ssq:
	; ATOM: # BB#0:
	; ATOM-NEXT: cvtsi2ssq (%rsi), %xmm0
	; ATOM-NEXT: cvtsi2ssq %rdi, %xmm1
	; ATOM-NEXT: addss %xmm1, %xmm0
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_cvtsi2ssq:
	; SLM: # BB#0:
	; SLM-NEXT: cvtsi2ssq (%rsi), %xmm0 # sched: [7:1.00]
	; SLM-NEXT: cvtsi2ssq %rdi, %xmm1 # sched: [4:0.50]
	; SLM-NEXT: addss %xmm1, %xmm0 # sched: [3:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_cvtsi2ssq:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vcvtsi2ssq %rdi, %xmm0, %xmm0 # sched: [5:2.00]
	-; SANDY-NEXT: vcvtsi2ssq (%rsi), %xmm1, %xmm1 # sched: [10:1.00]
	+; SANDY-NEXT: vcvtsi2ssq %rdi, %xmm0, %xmm0 # sched: [4:1.00]
	+; SANDY-NEXT: vcvtsi2ssq (%rsi), %xmm1, %xmm1 # sched: [8:1.00]
	; SANDY-NEXT: vaddss %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_cvtsi2ssq:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vcvtsi2ssq %rdi, %xmm0, %xmm0 # sched: [4:1.00]
	; HASWELL-NEXT: vcvtsi2ssq (%rsi), %xmm1, %xmm1 # sched: [8:1.00]
	; HASWELL-NEXT: vaddss %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_cvtsi2ssq:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vcvtsi2ssq %rdi, %xmm0, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: vcvtsi2ssq (%rsi), %xmm1, %xmm1 # sched: [8:1.00]
	; BTVER2-NEXT: vaddss %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_cvtsi2ssq:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vcvtsi2ssq %rdi, %xmm0, %xmm0 # sched: [5:1.00]
	; ZNVER1-NEXT: vcvtsi2ssq (%rsi), %xmm1, %xmm1 # sched: [12:1.00]
	; ZNVER1-NEXT: vaddss %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = sitofp i64 %a0 to float
	%2 = load i64, i64 *%a1, align 8
	%3 = sitofp i64 %2 to float
	%4 = fadd float %1, %3
	ret float %4
	}

	define i32 @test_cvtss2si(float %a0, float *%a1) {
	; GENERIC-LABEL: test_cvtss2si:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: cvtss2si %xmm0, %ecx
	; GENERIC-NEXT: cvtss2si (%rdi), %eax
	; GENERIC-NEXT: addl %ecx, %eax
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_cvtss2si:
	; ATOM: # BB#0:
	; ATOM-NEXT: cvtss2si (%rdi), %eax
	; ATOM-NEXT: cvtss2si %xmm0, %ecx
	; ATOM-NEXT: addl %ecx, %eax
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_cvtss2si:
	; SLM: # BB#0:
	; SLM-NEXT: cvtss2si (%rdi), %eax # sched: [7:1.00]
	; SLM-NEXT: cvtss2si %xmm0, %ecx # sched: [4:0.50]
	; SLM-NEXT: addl %ecx, %eax # sched: [1:0.50]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_cvtss2si:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vcvtss2si %xmm0, %ecx # sched: [5:1.00]
	-; SANDY-NEXT: vcvtss2si (%rdi), %eax # sched: [10:1.00]
	+; SANDY-NEXT: vcvtss2si %xmm0, %ecx # sched: [3:1.00]
	+; SANDY-NEXT: vcvtss2si (%rdi), %eax # sched: [7:1.00]
	; SANDY-NEXT: addl %ecx, %eax # sched: [1:0.33]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_cvtss2si:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vcvtss2si %xmm0, %ecx # sched: [4:1.00]
	; HASWELL-NEXT: vcvtss2si (%rdi), %eax # sched: [8:1.00]
	; HASWELL-NEXT: addl %ecx, %eax # sched: [1:0.25]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_cvtss2si:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vcvtss2si (%rdi), %eax # sched: [8:1.00]
	; BTVER2-NEXT: vcvtss2si %xmm0, %ecx # sched: [3:1.00]
	; BTVER2-NEXT: addl %ecx, %eax # sched: [1:0.50]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_cvtss2si:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vcvtss2si (%rdi), %eax # sched: [12:1.00]
	; ZNVER1-NEXT: vcvtss2si %xmm0, %ecx # sched: [5:1.00]
	; ZNVER1-NEXT: addl %ecx, %eax # sched: [1:0.25]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = insertelement <4 x float> undef, float %a0, i32 0
	%2 = call i32 @llvm.x86.sse.cvtss2si(<4 x float> %1)
	%3 = load float, float *%a1, align 4
	%4 = insertelement <4 x float> undef, float %3, i32 0
	%5 = call i32 @llvm.x86.sse.cvtss2si(<4 x float> %4)
	%6 = add i32 %2, %5
	ret i32 %6
	}
	declare i32 @llvm.x86.sse.cvtss2si(<4 x float>) nounwind readnone

	define i64 @test_cvtss2siq(float %a0, float *%a1) {
	; GENERIC-LABEL: test_cvtss2siq:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: cvtss2si %xmm0, %rcx
	; GENERIC-NEXT: cvtss2si (%rdi), %rax
	; GENERIC-NEXT: addq %rcx, %rax
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_cvtss2siq:
	; ATOM: # BB#0:
	; ATOM-NEXT: cvtss2si (%rdi), %rax
	; ATOM-NEXT: cvtss2si %xmm0, %rcx
	; ATOM-NEXT: addq %rcx, %rax
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_cvtss2siq:
	; SLM: # BB#0:
	; SLM-NEXT: cvtss2si (%rdi), %rax # sched: [7:1.00]
	; SLM-NEXT: cvtss2si %xmm0, %rcx # sched: [4:0.50]
	; SLM-NEXT: addq %rcx, %rax # sched: [1:0.50]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_cvtss2siq:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vcvtss2si %xmm0, %rcx # sched: [5:1.00]
	-; SANDY-NEXT: vcvtss2si (%rdi), %rax # sched: [10:1.00]
	+; SANDY-NEXT: vcvtss2si %xmm0, %rcx # sched: [3:1.00]
	+; SANDY-NEXT: vcvtss2si (%rdi), %rax # sched: [7:1.00]
	; SANDY-NEXT: addq %rcx, %rax # sched: [1:0.33]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_cvtss2siq:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vcvtss2si %xmm0, %rcx # sched: [4:1.00]
	; HASWELL-NEXT: vcvtss2si (%rdi), %rax # sched: [8:1.00]
	; HASWELL-NEXT: addq %rcx, %rax # sched: [1:0.25]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_cvtss2siq:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vcvtss2si (%rdi), %rax # sched: [8:1.00]
	; BTVER2-NEXT: vcvtss2si %xmm0, %rcx # sched: [3:1.00]
	; BTVER2-NEXT: addq %rcx, %rax # sched: [1:0.50]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_cvtss2siq:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vcvtss2si (%rdi), %rax # sched: [12:1.00]
	; ZNVER1-NEXT: vcvtss2si %xmm0, %rcx # sched: [5:1.00]
	; ZNVER1-NEXT: addq %rcx, %rax # sched: [1:0.25]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = insertelement <4 x float> undef, float %a0, i32 0
	%2 = call i64 @llvm.x86.sse.cvtss2si64(<4 x float> %1)
	%3 = load float, float *%a1, align 4
	%4 = insertelement <4 x float> undef, float %3, i32 0
	%5 = call i64 @llvm.x86.sse.cvtss2si64(<4 x float> %4)
	%6 = add i64 %2, %5
	ret i64 %6
	}
	declare i64 @llvm.x86.sse.cvtss2si64(<4 x float>) nounwind readnone

	define i32 @test_cvttss2si(float %a0, float *%a1) {
	; GENERIC-LABEL: test_cvttss2si:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: cvttss2si %xmm0, %ecx
	; GENERIC-NEXT: cvttss2si (%rdi), %eax
	; GENERIC-NEXT: addl %ecx, %eax
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_cvttss2si:
	; ATOM: # BB#0:
	; ATOM-NEXT: cvttss2si (%rdi), %eax
	; ATOM-NEXT: cvttss2si %xmm0, %ecx
	; ATOM-NEXT: addl %ecx, %eax
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_cvttss2si:
	; SLM: # BB#0:
	; SLM-NEXT: cvttss2si (%rdi), %eax # sched: [7:1.00]
	; SLM-NEXT: cvttss2si %xmm0, %ecx # sched: [4:0.50]
	; SLM-NEXT: addl %ecx, %eax # sched: [1:0.50]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_cvttss2si:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vcvttss2si %xmm0, %ecx # sched: [5:1.00]
	-; SANDY-NEXT: vcvttss2si (%rdi), %eax # sched: [10:1.00]
	+; SANDY-NEXT: vcvttss2si %xmm0, %ecx # sched: [3:1.00]
	+; SANDY-NEXT: vcvttss2si (%rdi), %eax # sched: [7:1.00]
	; SANDY-NEXT: addl %ecx, %eax # sched: [1:0.33]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_cvttss2si:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vcvttss2si %xmm0, %ecx # sched: [4:1.00]
	; HASWELL-NEXT: vcvttss2si (%rdi), %eax # sched: [8:1.00]
	; HASWELL-NEXT: addl %ecx, %eax # sched: [1:0.25]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_cvttss2si:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vcvttss2si (%rdi), %eax # sched: [8:1.00]
	; BTVER2-NEXT: vcvttss2si %xmm0, %ecx # sched: [3:1.00]
	; BTVER2-NEXT: addl %ecx, %eax # sched: [1:0.50]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_cvttss2si:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vcvttss2si (%rdi), %eax # sched: [12:1.00]
	; ZNVER1-NEXT: vcvttss2si %xmm0, %ecx # sched: [5:1.00]
	; ZNVER1-NEXT: addl %ecx, %eax # sched: [1:0.25]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = fptosi float %a0 to i32
	%2 = load float, float *%a1, align 4
	%3 = fptosi float %2 to i32
	%4 = add i32 %1, %3
	ret i32 %4
	}

	define i64 @test_cvttss2siq(float %a0, float *%a1) {
	; GENERIC-LABEL: test_cvttss2siq:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: cvttss2si %xmm0, %rcx
	; GENERIC-NEXT: cvttss2si (%rdi), %rax
	; GENERIC-NEXT: addq %rcx, %rax
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_cvttss2siq:
	; ATOM: # BB#0:
	; ATOM-NEXT: cvttss2si (%rdi), %rax
	; ATOM-NEXT: cvttss2si %xmm0, %rcx
	; ATOM-NEXT: addq %rcx, %rax
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_cvttss2siq:
	; SLM: # BB#0:
	; SLM-NEXT: cvttss2si (%rdi), %rax # sched: [7:1.00]
	; SLM-NEXT: cvttss2si %xmm0, %rcx # sched: [4:0.50]
	; SLM-NEXT: addq %rcx, %rax # sched: [1:0.50]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_cvttss2siq:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vcvttss2si %xmm0, %rcx # sched: [5:1.00]
	-; SANDY-NEXT: vcvttss2si (%rdi), %rax # sched: [10:1.00]
	+; SANDY-NEXT: vcvttss2si %xmm0, %rcx # sched: [3:1.00]
	+; SANDY-NEXT: vcvttss2si (%rdi), %rax # sched: [7:1.00]
	; SANDY-NEXT: addq %rcx, %rax # sched: [1:0.33]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_cvttss2siq:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vcvttss2si %xmm0, %rcx # sched: [4:1.00]
	; HASWELL-NEXT: vcvttss2si (%rdi), %rax # sched: [8:1.00]
	; HASWELL-NEXT: addq %rcx, %rax # sched: [1:0.25]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_cvttss2siq:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vcvttss2si (%rdi), %rax # sched: [8:1.00]
	; BTVER2-NEXT: vcvttss2si %xmm0, %rcx # sched: [3:1.00]
	; BTVER2-NEXT: addq %rcx, %rax # sched: [1:0.50]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_cvttss2siq:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vcvttss2si (%rdi), %rax # sched: [12:1.00]
	; ZNVER1-NEXT: vcvttss2si %xmm0, %rcx # sched: [5:1.00]
	; ZNVER1-NEXT: addq %rcx, %rax # sched: [1:0.25]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = fptosi float %a0 to i64
	%2 = load float, float *%a1, align 4
	%3 = fptosi float %2 to i64
	%4 = add i64 %1, %3
	ret i64 %4
	}

	define <4 x float> @test_divps(<4 x float> %a0, <4 x float> %a1, <4 x float> *%a2) {
	; GENERIC-LABEL: test_divps:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: divps %xmm1, %xmm0
	; GENERIC-NEXT: divps (%rdi), %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_divps:
	; ATOM: # BB#0:
	; ATOM-NEXT: divps %xmm1, %xmm0
	; ATOM-NEXT: divps (%rdi), %xmm0
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_divps:
	; SLM: # BB#0:
	; SLM-NEXT: divps %xmm1, %xmm0 # sched: [34:34.00]
	; SLM-NEXT: divps (%rdi), %xmm0 # sched: [37:34.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_divps:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vdivps %xmm1, %xmm0, %xmm0 # sched: [14:1.00]
	-; SANDY-NEXT: vdivps (%rdi), %xmm0, %xmm0 # sched: [20:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vdivps %xmm1, %xmm0, %xmm0 # sched: [12:1.00]
	+; SANDY-NEXT: vdivps (%rdi), %xmm0, %xmm0 # sched: [16:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_divps:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vdivps %xmm1, %xmm0, %xmm0 # sched: [12:1.00]
	; HASWELL-NEXT: vdivps (%rdi), %xmm0, %xmm0 # sched: [16:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_divps:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vdivps %xmm1, %xmm0, %xmm0 # sched: [19:19.00]
	; BTVER2-NEXT: vdivps (%rdi), %xmm0, %xmm0 # sched: [24:19.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_divps:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vdivps %xmm1, %xmm0, %xmm0 # sched: [15:1.00]
	; ZNVER1-NEXT: vdivps (%rdi), %xmm0, %xmm0 # sched: [22:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = fdiv <4 x float> %a0, %a1
	%2 = load <4 x float>, <4 x float> *%a2, align 16
	%3 = fdiv <4 x float> %1, %2
	ret <4 x float> %3
	}

	define float @test_divss(float %a0, float %a1, float *%a2) {
	; GENERIC-LABEL: test_divss:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: divss %xmm1, %xmm0
	; GENERIC-NEXT: divss (%rdi), %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_divss:
	; ATOM: # BB#0:
	; ATOM-NEXT: divss %xmm1, %xmm0
	; ATOM-NEXT: divss (%rdi), %xmm0
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_divss:
	; SLM: # BB#0:
	; SLM-NEXT: divss %xmm1, %xmm0 # sched: [34:34.00]
	; SLM-NEXT: divss (%rdi), %xmm0 # sched: [37:34.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_divss:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vdivss %xmm1, %xmm0, %xmm0 # sched: [14:1.00]
	-; SANDY-NEXT: vdivss (%rdi), %xmm0, %xmm0 # sched: [20:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vdivss %xmm1, %xmm0, %xmm0 # sched: [12:1.00]
	+; SANDY-NEXT: vdivss (%rdi), %xmm0, %xmm0 # sched: [16:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_divss:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vdivss %xmm1, %xmm0, %xmm0 # sched: [12:1.00]
	; HASWELL-NEXT: vdivss (%rdi), %xmm0, %xmm0 # sched: [16:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_divss:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vdivss %xmm1, %xmm0, %xmm0 # sched: [19:19.00]
	; BTVER2-NEXT: vdivss (%rdi), %xmm0, %xmm0 # sched: [24:19.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_divss:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vdivss %xmm1, %xmm0, %xmm0 # sched: [15:1.00]
	; ZNVER1-NEXT: vdivss (%rdi), %xmm0, %xmm0 # sched: [22:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = fdiv float %a0, %a1
	%2 = load float, float *%a2, align 4
	%3 = fdiv float %1, %2
	ret float %3
	}

	define void @test_ldmxcsr(i32 %a0) {
	; GENERIC-LABEL: test_ldmxcsr:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: movl %edi, -{{[0-9]+}}(%rsp)
	; GENERIC-NEXT: ldmxcsr -{{[0-9]+}}(%rsp)
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_ldmxcsr:
	; ATOM: # BB#0:
	; ATOM-NEXT: movl %edi, -{{[0-9]+}}(%rsp)
	; ATOM-NEXT: ldmxcsr -{{[0-9]+}}(%rsp)
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_ldmxcsr:
	; SLM: # BB#0:
	; SLM-NEXT: movl %edi, -{{[0-9]+}}(%rsp) # sched: [1:1.00]
	; SLM-NEXT: ldmxcsr -{{[0-9]+}}(%rsp) # sched: [3:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_ldmxcsr:
	; SANDY: # BB#0:
	; SANDY-NEXT: movl %edi, -{{[0-9]+}}(%rsp) # sched: [1:1.00]
	-; SANDY-NEXT: vldmxcsr -{{[0-9]+}}(%rsp) # sched: [5:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vldmxcsr -{{[0-9]+}}(%rsp) # sched: [4:0.50]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_ldmxcsr:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: movl %edi, -{{[0-9]+}}(%rsp) # sched: [1:1.00]
	; HASWELL-NEXT: vldmxcsr -{{[0-9]+}}(%rsp) # sched: [6:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_ldmxcsr:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: movl %edi, -{{[0-9]+}}(%rsp) # sched: [1:1.00]
	; BTVER2-NEXT: vldmxcsr -{{[0-9]+}}(%rsp) # sched: [5:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_ldmxcsr:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: movl %edi, -{{[0-9]+}}(%rsp) # sched: [1:0.50]
	; ZNVER1-NEXT: vldmxcsr -{{[0-9]+}}(%rsp) # sched: [8:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = alloca i32, align 4
	%2 = bitcast i32* %1 to i8*
	store i32 %a0, i32* %1
	call void @llvm.x86.sse.ldmxcsr(i8* %2)
	ret void
	}
	declare void @llvm.x86.sse.ldmxcsr(i8*) nounwind readnone

	define <4 x float> @test_maxps(<4 x float> %a0, <4 x float> %a1, <4 x float> *%a2) {
	; GENERIC-LABEL: test_maxps:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: maxps %xmm1, %xmm0
	; GENERIC-NEXT: maxps (%rdi), %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_maxps:
	; ATOM: # BB#0:
	; ATOM-NEXT: maxps %xmm1, %xmm0
	; ATOM-NEXT: maxps (%rdi), %xmm0
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_maxps:
	; SLM: # BB#0:
	; SLM-NEXT: maxps %xmm1, %xmm0 # sched: [3:1.00]
	; SLM-NEXT: maxps (%rdi), %xmm0 # sched: [6:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_maxps:
	; SANDY: # BB#0:
	; SANDY-NEXT: vmaxps %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	-; SANDY-NEXT: vmaxps (%rdi), %xmm0, %xmm0 # sched: [9:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vmaxps (%rdi), %xmm0, %xmm0 # sched: [7:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_maxps:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vmaxps %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; HASWELL-NEXT: vmaxps (%rdi), %xmm0, %xmm0 # sched: [7:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_maxps:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vmaxps %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: vmaxps (%rdi), %xmm0, %xmm0 # sched: [8:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_maxps:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vmaxps %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; ZNVER1-NEXT: vmaxps (%rdi), %xmm0, %xmm0 # sched: [10:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <4 x float> @llvm.x86.sse.max.ps(<4 x float> %a0, <4 x float> %a1)
	%2 = load <4 x float>, <4 x float> *%a2, align 16
	%3 = call <4 x float> @llvm.x86.sse.max.ps(<4 x float> %1, <4 x float> %2)
	ret <4 x float> %3
	}
	declare <4 x float> @llvm.x86.sse.max.ps(<4 x float>, <4 x float>) nounwind readnone

	define <4 x float> @test_maxss(<4 x float> %a0, <4 x float> %a1, <4 x float> *%a2) {
	; GENERIC-LABEL: test_maxss:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: maxss %xmm1, %xmm0
	; GENERIC-NEXT: maxss (%rdi), %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_maxss:
	; ATOM: # BB#0:
	; ATOM-NEXT: maxss %xmm1, %xmm0
	; ATOM-NEXT: maxss (%rdi), %xmm0
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_maxss:
	; SLM: # BB#0:
	; SLM-NEXT: maxss %xmm1, %xmm0 # sched: [3:1.00]
	; SLM-NEXT: maxss (%rdi), %xmm0 # sched: [6:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_maxss:
	; SANDY: # BB#0:
	; SANDY-NEXT: vmaxss %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	-; SANDY-NEXT: vmaxss (%rdi), %xmm0, %xmm0 # sched: [9:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vmaxss (%rdi), %xmm0, %xmm0 # sched: [7:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_maxss:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vmaxss %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; HASWELL-NEXT: vmaxss (%rdi), %xmm0, %xmm0 # sched: [7:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_maxss:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vmaxss %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: vmaxss (%rdi), %xmm0, %xmm0 # sched: [8:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_maxss:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vmaxss %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; ZNVER1-NEXT: vmaxss (%rdi), %xmm0, %xmm0 # sched: [10:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <4 x float> @llvm.x86.sse.max.ss(<4 x float> %a0, <4 x float> %a1)
	%2 = load <4 x float>, <4 x float> *%a2, align 16
	%3 = call <4 x float> @llvm.x86.sse.max.ss(<4 x float> %1, <4 x float> %2)
	ret <4 x float> %3
	}
	declare <4 x float> @llvm.x86.sse.max.ss(<4 x float>, <4 x float>) nounwind readnone

	define <4 x float> @test_minps(<4 x float> %a0, <4 x float> %a1, <4 x float> *%a2) {
	; GENERIC-LABEL: test_minps:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: minps %xmm1, %xmm0
	; GENERIC-NEXT: minps (%rdi), %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_minps:
	; ATOM: # BB#0:
	; ATOM-NEXT: minps %xmm1, %xmm0
	; ATOM-NEXT: minps (%rdi), %xmm0
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_minps:
	; SLM: # BB#0:
	; SLM-NEXT: minps %xmm1, %xmm0 # sched: [3:1.00]
	; SLM-NEXT: minps (%rdi), %xmm0 # sched: [6:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_minps:
	; SANDY: # BB#0:
	; SANDY-NEXT: vminps %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	-; SANDY-NEXT: vminps (%rdi), %xmm0, %xmm0 # sched: [9:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vminps (%rdi), %xmm0, %xmm0 # sched: [7:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_minps:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vminps %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; HASWELL-NEXT: vminps (%rdi), %xmm0, %xmm0 # sched: [7:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_minps:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vminps %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: vminps (%rdi), %xmm0, %xmm0 # sched: [8:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_minps:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vminps %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; ZNVER1-NEXT: vminps (%rdi), %xmm0, %xmm0 # sched: [10:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <4 x float> @llvm.x86.sse.min.ps(<4 x float> %a0, <4 x float> %a1)
	%2 = load <4 x float>, <4 x float> *%a2, align 16
	%3 = call <4 x float> @llvm.x86.sse.min.ps(<4 x float> %1, <4 x float> %2)
	ret <4 x float> %3
	}
	declare <4 x float> @llvm.x86.sse.min.ps(<4 x float>, <4 x float>) nounwind readnone

	define <4 x float> @test_minss(<4 x float> %a0, <4 x float> %a1, <4 x float> *%a2) {
	; GENERIC-LABEL: test_minss:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: minss %xmm1, %xmm0
	; GENERIC-NEXT: minss (%rdi), %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_minss:
	; ATOM: # BB#0:
	; ATOM-NEXT: minss %xmm1, %xmm0
	; ATOM-NEXT: minss (%rdi), %xmm0
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_minss:
	; SLM: # BB#0:
	; SLM-NEXT: minss %xmm1, %xmm0 # sched: [3:1.00]
	; SLM-NEXT: minss (%rdi), %xmm0 # sched: [6:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_minss:
	; SANDY: # BB#0:
	; SANDY-NEXT: vminss %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	-; SANDY-NEXT: vminss (%rdi), %xmm0, %xmm0 # sched: [9:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vminss (%rdi), %xmm0, %xmm0 # sched: [7:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_minss:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vminss %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; HASWELL-NEXT: vminss (%rdi), %xmm0, %xmm0 # sched: [7:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_minss:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vminss %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: vminss (%rdi), %xmm0, %xmm0 # sched: [8:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_minss:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vminss %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; ZNVER1-NEXT: vminss (%rdi), %xmm0, %xmm0 # sched: [10:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <4 x float> @llvm.x86.sse.min.ss(<4 x float> %a0, <4 x float> %a1)
	%2 = load <4 x float>, <4 x float> *%a2, align 16
	%3 = call <4 x float> @llvm.x86.sse.min.ss(<4 x float> %1, <4 x float> %2)
	ret <4 x float> %3
	}
	declare <4 x float> @llvm.x86.sse.min.ss(<4 x float>, <4 x float>) nounwind readnone

	define void @test_movaps(<4 x float> %a0, <4 x float> %a1) {
	; GENERIC-LABEL: test_movaps:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: movaps (%rdi), %xmm0
	; GENERIC-NEXT: addps %xmm0, %xmm0
	; GENERIC-NEXT: movaps %xmm0, (%rsi)
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_movaps:
	; ATOM: # BB#0:
	; ATOM-NEXT: movaps (%rdi), %xmm0
	; ATOM-NEXT: addps %xmm0, %xmm0
	; ATOM-NEXT: movaps %xmm0, (%rsi)
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_movaps:
	; SLM: # BB#0:
	; SLM-NEXT: movaps (%rdi), %xmm0 # sched: [3:1.00]
	; SLM-NEXT: addps %xmm0, %xmm0 # sched: [3:1.00]
	; SLM-NEXT: movaps %xmm0, (%rsi) # sched: [1:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_movaps:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vmovaps (%rdi), %xmm0 # sched: [6:0.50]
	+; SANDY-NEXT: vmovaps (%rdi), %xmm0 # sched: [4:0.50]
	; SANDY-NEXT: vaddps %xmm0, %xmm0, %xmm0 # sched: [3:1.00]
	-; SANDY-NEXT: vmovaps %xmm0, (%rsi) # sched: [5:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vmovaps %xmm0, (%rsi) # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_movaps:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vmovaps (%rdi), %xmm0 # sched: [4:0.50]
	; HASWELL-NEXT: vaddps %xmm0, %xmm0, %xmm0 # sched: [3:1.00]
	; HASWELL-NEXT: vmovaps %xmm0, (%rsi) # sched: [1:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_movaps:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vmovaps (%rdi), %xmm0 # sched: [5:1.00]
	; BTVER2-NEXT: vaddps %xmm0, %xmm0, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: vmovaps %xmm0, (%rsi) # sched: [1:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_movaps:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vmovaps (%rdi), %xmm0 # sched: [8:0.50]
	; ZNVER1-NEXT: vaddps %xmm0, %xmm0, %xmm0 # sched: [3:1.00]
	; ZNVER1-NEXT: vmovaps %xmm0, (%rsi) # sched: [1:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = load <4 x float>, <4 x float> *%a0, align 16
	%2 = fadd <4 x float> %1, %1
	store <4 x float> %2, <4 x float> *%a1, align 16
	ret void
	}

	; TODO (v)movhlps

	define <4 x float> @test_movhlps(<4 x float> %a0, <4 x float> %a1) {
	; GENERIC-LABEL: test_movhlps:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: movhlps {{.*#+}} xmm0 = xmm1[1],xmm0[1]
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_movhlps:
	; ATOM: # BB#0:
	; ATOM-NEXT: movhlps {{.*#+}} xmm0 = xmm1[1],xmm0[1]
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_movhlps:
	; SLM: # BB#0:
	; SLM-NEXT: movhlps {{.*#+}} xmm0 = xmm1[1],xmm0[1] sched: [1:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_movhlps:
	; SANDY: # BB#0:
	; SANDY-NEXT: vunpckhpd {{.*#+}} xmm0 = xmm1[1],xmm0[1] sched: [1:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_movhlps:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vunpckhpd {{.*#+}} xmm0 = xmm1[1],xmm0[1] sched: [1:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_movhlps:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vunpckhpd {{.*#+}} xmm0 = xmm1[1],xmm0[1] sched: [1:0.50]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_movhlps:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vunpckhpd {{.*#+}} xmm0 = xmm1[1],xmm0[1] sched: [1:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = shufflevector <4 x float> %a0, <4 x float> %a1, <4 x i32> <i32 6, i32 7, i32 2, i32 3>
	ret <4 x float> %1
	}

	; TODO (v)movhps

	define void @test_movhps(<4 x float> %a0, <4 x float> %a1, x86_mmx *%a2) {
	; GENERIC-LABEL: test_movhps:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: movhpd {{.*#+}} xmm1 = xmm1[0],mem[0]
	; GENERIC-NEXT: addps %xmm0, %xmm1
	; GENERIC-NEXT: movhlps {{.*#+}} xmm1 = xmm1[1,1]
	; GENERIC-NEXT: movlps %xmm1, (%rdi)
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_movhps:
	; ATOM: # BB#0:
	; ATOM-NEXT: movhpd {{.*#+}} xmm1 = xmm1[0],mem[0]
	; ATOM-NEXT: addps %xmm0, %xmm1
	; ATOM-NEXT: movhlps {{.*#+}} xmm1 = xmm1[1,1]
	; ATOM-NEXT: movlps %xmm1, (%rdi)
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_movhps:
	; SLM: # BB#0:
	; SLM-NEXT: movhpd {{.*#+}} xmm1 = xmm1[0],mem[0] sched: [4:1.00]
	; SLM-NEXT: addps %xmm0, %xmm1 # sched: [3:1.00]
	; SLM-NEXT: pextrq $1, %xmm1, (%rdi) # sched: [4:2.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_movhps:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vmovhpd {{.*#+}} xmm1 = xmm1[0],mem[0] sched: [7:1.00]
	+; SANDY-NEXT: vmovhpd {{.*#+}} xmm1 = xmm1[0],mem[0] sched: [5:1.00]
	; SANDY-NEXT: vaddps %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; SANDY-NEXT: vpextrq $1, %xmm0, (%rdi) # sched: [5:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_movhps:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vmovhpd {{.*#+}} xmm1 = xmm1[0],mem[0] sched: [5:1.00]
	; HASWELL-NEXT: vaddps %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; HASWELL-NEXT: vpextrq $1, %xmm0, (%rdi) # sched: [5:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_movhps:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vmovhpd {{.*#+}} xmm1 = xmm1[0],mem[0] sched: [6:1.00]
	; BTVER2-NEXT: vaddps %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: vpextrq $1, %xmm0, (%rdi) # sched: [6:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_movhps:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vmovhpd {{.*#+}} xmm1 = xmm1[0],mem[0] sched: [8:0.50]
	; ZNVER1-NEXT: vaddps %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; ZNVER1-NEXT: vpextrq $1, %xmm0, (%rdi) # sched: [8:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = bitcast x86_mmx* %a2 to <2 x float>*
	%2 = load <2 x float>, <2 x float> *%1, align 8
	%3 = shufflevector <2 x float> %2, <2 x float> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
	%4 = shufflevector <4 x float> %a1, <4 x float> %3, <4 x i32> <i32 0, i32 1, i32 4, i32 5>
	%5 = fadd <4 x float> %a0, %4
	%6 = shufflevector <4 x float> %5, <4 x float> undef, <2 x i32> <i32 2, i32 3>
	store <2 x float> %6, <2 x float>* %1
	ret void
	}

	; TODO (v)movlhps

	define <4 x float> @test_movlhps(<4 x float> %a0, <4 x float> %a1) {
	; GENERIC-LABEL: test_movlhps:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: unpcklpd {{.*#+}} xmm0 = xmm0[0],xmm1[0]
	; GENERIC-NEXT: addps %xmm1, %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_movlhps:
	; ATOM: # BB#0:
	; ATOM-NEXT: unpcklpd {{.*#+}} xmm0 = xmm0[0],xmm1[0]
	; ATOM-NEXT: addps %xmm1, %xmm0
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_movlhps:
	; SLM: # BB#0:
	; SLM-NEXT: unpcklpd {{.*#+}} xmm0 = xmm0[0],xmm1[0] sched: [1:1.00]
	; SLM-NEXT: addps %xmm1, %xmm0 # sched: [3:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_movlhps:
	; SANDY: # BB#0:
	; SANDY-NEXT: vunpcklpd {{.*#+}} xmm0 = xmm0[0],xmm1[0] sched: [1:1.00]
	; SANDY-NEXT: vaddps %xmm0, %xmm1, %xmm0 # sched: [3:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_movlhps:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vunpcklpd {{.*#+}} xmm0 = xmm0[0],xmm1[0] sched: [1:1.00]
	; HASWELL-NEXT: vaddps %xmm0, %xmm1, %xmm0 # sched: [3:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_movlhps:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vunpcklpd {{.*#+}} xmm0 = xmm0[0],xmm1[0] sched: [1:0.50]
	; BTVER2-NEXT: vaddps %xmm0, %xmm1, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_movlhps:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vunpcklpd {{.*#+}} xmm0 = xmm0[0],xmm1[0] sched: [1:0.50]
	; ZNVER1-NEXT: vaddps %xmm0, %xmm1, %xmm0 # sched: [3:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = shufflevector <4 x float> %a0, <4 x float> %a1, <4 x i32> <i32 0, i32 1, i32 4, i32 5>
	%2 = fadd <4 x float> %a1, %1
	ret <4 x float> %2
	}

	define void @test_movlps(<4 x float> %a0, <4 x float> %a1, x86_mmx *%a2) {
	; GENERIC-LABEL: test_movlps:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: movlpd {{.*#+}} xmm1 = mem[0],xmm1[1]
	; GENERIC-NEXT: addps %xmm0, %xmm1
	; GENERIC-NEXT: movlps %xmm1, (%rdi)
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_movlps:
	; ATOM: # BB#0:
	; ATOM-NEXT: movlpd {{.*#+}} xmm1 = mem[0],xmm1[1]
	; ATOM-NEXT: addps %xmm0, %xmm1
	; ATOM-NEXT: movlps %xmm1, (%rdi)
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_movlps:
	; SLM: # BB#0:
	; SLM-NEXT: movlpd {{.*#+}} xmm1 = mem[0],xmm1[1] sched: [4:1.00]
	; SLM-NEXT: addps %xmm0, %xmm1 # sched: [3:1.00]
	; SLM-NEXT: movlps %xmm1, (%rdi) # sched: [1:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_movlps:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vmovlpd {{.*#+}} xmm1 = mem[0],xmm1[1] sched: [7:1.00]
	+; SANDY-NEXT: vmovlpd {{.*#+}} xmm1 = mem[0],xmm1[1] sched: [5:1.00]
	; SANDY-NEXT: vaddps %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	-; SANDY-NEXT: vmovlps %xmm0, (%rdi) # sched: [5:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vmovlps %xmm0, (%rdi) # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_movlps:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vmovlpd {{.*#+}} xmm1 = mem[0],xmm1[1] sched: [5:1.00]
	; HASWELL-NEXT: vaddps %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; HASWELL-NEXT: vmovlps %xmm0, (%rdi) # sched: [1:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_movlps:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vmovlpd {{.*#+}} xmm1 = mem[0],xmm1[1] sched: [6:1.00]
	; BTVER2-NEXT: vaddps %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: vmovlps %xmm0, (%rdi) # sched: [1:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_movlps:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vmovlpd {{.*#+}} xmm1 = mem[0],xmm1[1] sched: [8:0.50]
	; ZNVER1-NEXT: vaddps %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; ZNVER1-NEXT: vmovlps %xmm0, (%rdi) # sched: [1:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = bitcast x86_mmx* %a2 to <2 x float>*
	%2 = load <2 x float>, <2 x float> *%1, align 8
	%3 = shufflevector <2 x float> %2, <2 x float> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
	%4 = shufflevector <4 x float> %a1, <4 x float> %3, <4 x i32> <i32 4, i32 5, i32 2, i32 3>
	%5 = fadd <4 x float> %a0, %4
	%6 = shufflevector <4 x float> %5, <4 x float> undef, <2 x i32> <i32 0, i32 1>
	store <2 x float> %6, <2 x float>* %1
	ret void
	}

	define i32 @test_movmskps(<4 x float> %a0) {
	; GENERIC-LABEL: test_movmskps:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: movmskps %xmm0, %eax
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_movmskps:
	; ATOM: # BB#0:
	; ATOM-NEXT: movmskps %xmm0, %eax
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_movmskps:
	; SLM: # BB#0:
	; SLM-NEXT: movmskps %xmm0, %eax # sched: [1:0.50]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_movmskps:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vmovmskps %xmm0, %eax # sched: [2:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vmovmskps %xmm0, %eax # sched: [1:0.33]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_movmskps:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vmovmskps %xmm0, %eax # sched: [3:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_movmskps:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vmovmskps %xmm0, %eax # sched: [1:0.50]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_movmskps:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vmovmskps %xmm0, %eax # sched: [1:0.25]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call i32 @llvm.x86.sse.movmsk.ps(<4 x float> %a0)
	ret i32 %1
	}
	declare i32 @llvm.x86.sse.movmsk.ps(<4 x float>) nounwind readnone

	define void @test_movntps(<4 x float> %a0, <4 x float> *%a1) {
	; GENERIC-LABEL: test_movntps:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: movntps %xmm0, (%rdi)
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_movntps:
	; ATOM: # BB#0:
	; ATOM-NEXT: movntps %xmm0, (%rdi)
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_movntps:
	; SLM: # BB#0:
	; SLM-NEXT: movntps %xmm0, (%rdi) # sched: [1:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_movntps:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vmovntps %xmm0, (%rdi) # sched: [5:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vmovntps %xmm0, (%rdi) # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_movntps:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vmovntps %xmm0, (%rdi) # sched: [1:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_movntps:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vmovntps %xmm0, (%rdi) # sched: [1:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_movntps:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vmovntps %xmm0, (%rdi) # sched: [1:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	store <4 x float> %a0, <4 x float> *%a1, align 16, !nontemporal !0
	ret void
	}

	define void @test_movss_mem(float* %a0, float* %a1) {
	; GENERIC-LABEL: test_movss_mem:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: movss {{.*#+}} xmm0 = mem[0],zero,zero,zero
	; GENERIC-NEXT: addss %xmm0, %xmm0
	; GENERIC-NEXT: movss %xmm0, (%rsi)
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_movss_mem:
	; ATOM: # BB#0:
	; ATOM-NEXT: movss {{.*#+}} xmm0 = mem[0],zero,zero,zero
	; ATOM-NEXT: addss %xmm0, %xmm0
	; ATOM-NEXT: movss %xmm0, (%rsi)
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_movss_mem:
	; SLM: # BB#0:
	; SLM-NEXT: movss {{.*#+}} xmm0 = mem[0],zero,zero,zero sched: [3:1.00]
	; SLM-NEXT: addss %xmm0, %xmm0 # sched: [3:1.00]
	; SLM-NEXT: movss %xmm0, (%rsi) # sched: [1:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_movss_mem:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vmovss {{.*#+}} xmm0 = mem[0],zero,zero,zero sched: [6:0.50]
	+; SANDY-NEXT: vmovss {{.*#+}} xmm0 = mem[0],zero,zero,zero sched: [4:0.50]
	; SANDY-NEXT: vaddss %xmm0, %xmm0, %xmm0 # sched: [3:1.00]
	-; SANDY-NEXT: vmovss %xmm0, (%rsi) # sched: [5:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vmovss %xmm0, (%rsi) # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_movss_mem:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vmovss {{.*#+}} xmm0 = mem[0],zero,zero,zero sched: [4:0.50]
	; HASWELL-NEXT: vaddss %xmm0, %xmm0, %xmm0 # sched: [3:1.00]
	; HASWELL-NEXT: vmovss %xmm0, (%rsi) # sched: [1:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_movss_mem:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vmovss {{.*#+}} xmm0 = mem[0],zero,zero,zero sched: [5:1.00]
	; BTVER2-NEXT: vaddss %xmm0, %xmm0, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: vmovss %xmm0, (%rsi) # sched: [1:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_movss_mem:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vmovss {{.*#+}} xmm0 = mem[0],zero,zero,zero sched: [8:0.50]
	; ZNVER1-NEXT: vaddss %xmm0, %xmm0, %xmm0 # sched: [3:1.00]
	; ZNVER1-NEXT: vmovss %xmm0, (%rsi) # sched: [1:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = load float, float* %a0, align 1
	%2 = fadd float %1, %1
	store float %2, float *%a1, align 1
	ret void
	}

	define <4 x float> @test_movss_reg(<4 x float> %a0, <4 x float> %a1) {
	; GENERIC-LABEL: test_movss_reg:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: movss {{.*#+}} xmm0 = xmm1[0],xmm0[1,2,3]
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_movss_reg:
	; ATOM: # BB#0:
	; ATOM-NEXT: movss {{.*#+}} xmm0 = xmm1[0],xmm0[1,2,3]
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_movss_reg:
	; SLM: # BB#0:
	; SLM-NEXT: blendps {{.*#+}} xmm0 = xmm1[0],xmm0[1,2,3] sched: [1:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_movss_reg:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vblendps {{.*#+}} xmm0 = xmm1[0],xmm0[1,2,3] sched: [1:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vblendps {{.*#+}} xmm0 = xmm1[0],xmm0[1,2,3] sched: [1:0.50]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_movss_reg:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vblendps {{.*#+}} xmm0 = xmm1[0],xmm0[1,2,3] sched: [1:0.33]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_movss_reg:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vblendps {{.*#+}} xmm0 = xmm1[0],xmm0[1,2,3] sched: [1:0.50]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_movss_reg:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vblendps {{.*#+}} xmm0 = xmm1[0],xmm0[1,2,3] sched: [1:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = shufflevector <4 x float> %a0, <4 x float> %a1, <4 x i32> <i32 4, i32 1, i32 2, i32 3>
	ret <4 x float> %1
	}

	define void @test_movups(<4 x float> %a0, <4 x float> %a1) {
	; GENERIC-LABEL: test_movups:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: movups (%rdi), %xmm0
	; GENERIC-NEXT: addps %xmm0, %xmm0
	; GENERIC-NEXT: movups %xmm0, (%rsi)
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_movups:
	; ATOM: # BB#0:
	; ATOM-NEXT: movups (%rdi), %xmm0
	; ATOM-NEXT: addps %xmm0, %xmm0
	; ATOM-NEXT: movups %xmm0, (%rsi)
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_movups:
	; SLM: # BB#0:
	; SLM-NEXT: movups (%rdi), %xmm0 # sched: [3:1.00]
	; SLM-NEXT: addps %xmm0, %xmm0 # sched: [3:1.00]
	; SLM-NEXT: movups %xmm0, (%rsi) # sched: [1:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_movups:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vmovups (%rdi), %xmm0 # sched: [6:0.50]
	+; SANDY-NEXT: vmovups (%rdi), %xmm0 # sched: [4:0.50]
	; SANDY-NEXT: vaddps %xmm0, %xmm0, %xmm0 # sched: [3:1.00]
	-; SANDY-NEXT: vmovups %xmm0, (%rsi) # sched: [5:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vmovups %xmm0, (%rsi) # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_movups:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vmovups (%rdi), %xmm0 # sched: [4:0.50]
	; HASWELL-NEXT: vaddps %xmm0, %xmm0, %xmm0 # sched: [3:1.00]
	; HASWELL-NEXT: vmovups %xmm0, (%rsi) # sched: [1:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_movups:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vmovups (%rdi), %xmm0 # sched: [5:1.00]
	; BTVER2-NEXT: vaddps %xmm0, %xmm0, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: vmovups %xmm0, (%rsi) # sched: [1:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_movups:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vmovups (%rdi), %xmm0 # sched: [8:0.50]
	; ZNVER1-NEXT: vaddps %xmm0, %xmm0, %xmm0 # sched: [3:1.00]
	; ZNVER1-NEXT: vmovups %xmm0, (%rsi) # sched: [1:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = load <4 x float>, <4 x float> *%a0, align 1
	%2 = fadd <4 x float> %1, %1
	store <4 x float> %2, <4 x float> *%a1, align 1
	ret void
	}

	define <4 x float> @test_mulps(<4 x float> %a0, <4 x float> %a1, <4 x float> *%a2) {
	; GENERIC-LABEL: test_mulps:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: mulps %xmm1, %xmm0
	; GENERIC-NEXT: mulps (%rdi), %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_mulps:
	; ATOM: # BB#0:
	; ATOM-NEXT: mulps %xmm1, %xmm0
	; ATOM-NEXT: mulps (%rdi), %xmm0
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_mulps:
	; SLM: # BB#0:
	; SLM-NEXT: mulps %xmm1, %xmm0 # sched: [5:2.00]
	; SLM-NEXT: mulps (%rdi), %xmm0 # sched: [8:2.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_mulps:
	; SANDY: # BB#0:
	; SANDY-NEXT: vmulps %xmm1, %xmm0, %xmm0 # sched: [5:1.00]
	-; SANDY-NEXT: vmulps (%rdi), %xmm0, %xmm0 # sched: [11:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vmulps (%rdi), %xmm0, %xmm0 # sched: [9:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_mulps:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vmulps %xmm1, %xmm0, %xmm0 # sched: [5:0.50]
	; HASWELL-NEXT: vmulps (%rdi), %xmm0, %xmm0 # sched: [9:0.50]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_mulps:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vmulps %xmm1, %xmm0, %xmm0 # sched: [2:1.00]
	; BTVER2-NEXT: vmulps (%rdi), %xmm0, %xmm0 # sched: [7:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_mulps:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vmulps %xmm1, %xmm0, %xmm0 # sched: [5:1.00]
	; ZNVER1-NEXT: vmulps (%rdi), %xmm0, %xmm0 # sched: [12:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = fmul <4 x float> %a0, %a1
	%2 = load <4 x float>, <4 x float> *%a2, align 16
	%3 = fmul <4 x float> %1, %2
	ret <4 x float> %3
	}

	define float @test_mulss(float %a0, float %a1, float *%a2) {
	; GENERIC-LABEL: test_mulss:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: mulss %xmm1, %xmm0
	; GENERIC-NEXT: mulss (%rdi), %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_mulss:
	; ATOM: # BB#0:
	; ATOM-NEXT: mulss %xmm1, %xmm0
	; ATOM-NEXT: mulss (%rdi), %xmm0
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_mulss:
	; SLM: # BB#0:
	; SLM-NEXT: mulss %xmm1, %xmm0 # sched: [5:2.00]
	; SLM-NEXT: mulss (%rdi), %xmm0 # sched: [8:2.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_mulss:
	; SANDY: # BB#0:
	; SANDY-NEXT: vmulss %xmm1, %xmm0, %xmm0 # sched: [5:1.00]
	-; SANDY-NEXT: vmulss (%rdi), %xmm0, %xmm0 # sched: [11:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vmulss (%rdi), %xmm0, %xmm0 # sched: [9:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_mulss:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vmulss %xmm1, %xmm0, %xmm0 # sched: [5:0.50]
	; HASWELL-NEXT: vmulss (%rdi), %xmm0, %xmm0 # sched: [9:0.50]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_mulss:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vmulss %xmm1, %xmm0, %xmm0 # sched: [2:1.00]
	; BTVER2-NEXT: vmulss (%rdi), %xmm0, %xmm0 # sched: [7:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_mulss:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vmulss %xmm1, %xmm0, %xmm0 # sched: [5:1.00]
	; ZNVER1-NEXT: vmulss (%rdi), %xmm0, %xmm0 # sched: [12:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = fmul float %a0, %a1
	%2 = load float, float *%a2, align 4
	%3 = fmul float %1, %2
	ret float %3
	}

	define <4 x float> @test_orps(<4 x float> %a0, <4 x float> %a1, <4 x float> *%a2) {
	; GENERIC-LABEL: test_orps:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: orps %xmm1, %xmm0
	; GENERIC-NEXT: orps (%rdi), %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_orps:
	; ATOM: # BB#0:
	; ATOM-NEXT: orps %xmm1, %xmm0
	; ATOM-NEXT: orps (%rdi), %xmm0
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_orps:
	; SLM: # BB#0:
	; SLM-NEXT: orps %xmm1, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: orps (%rdi), %xmm0 # sched: [4:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_orps:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vorps %xmm1, %xmm0, %xmm0 # sched: [1:1.00]
	-; SANDY-NEXT: vorps (%rdi), %xmm0, %xmm0 # sched: [7:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vorps %xmm1, %xmm0, %xmm0 # sched: [1:0.33]
	+; SANDY-NEXT: vorps (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_orps:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vorps %xmm1, %xmm0, %xmm0 # sched: [1:1.00]
	; HASWELL-NEXT: vorps (%rdi), %xmm0, %xmm0 # sched: [5:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_orps:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vorps %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: vorps (%rdi), %xmm0, %xmm0 # sched: [6:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_orps:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vorps %xmm1, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: vorps (%rdi), %xmm0, %xmm0 # sched: [8:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = bitcast <4 x float> %a0 to <4 x i32>
	%2 = bitcast <4 x float> %a1 to <4 x i32>
	%3 = or <4 x i32> %1, %2
	%4 = load <4 x float>, <4 x float> *%a2, align 16
	%5 = bitcast <4 x float> %4 to <4 x i32>
	%6 = or <4 x i32> %3, %5
	%7 = bitcast <4 x i32> %6 to <4 x float>
	ret <4 x float> %7
	}

	define void @test_prefetchnta(i8* %a0) {
	; GENERIC-LABEL: test_prefetchnta:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: prefetchnta (%rdi)
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_prefetchnta:
	; ATOM: # BB#0:
	; ATOM-NEXT: prefetchnta (%rdi)
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_prefetchnta:
	; SLM: # BB#0:
	; SLM-NEXT: prefetchnta (%rdi) # sched: [3:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_prefetchnta:
	; SANDY: # BB#0:
	-; SANDY-NEXT: prefetchnta (%rdi) # sched: [5:0.50]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: prefetchnta (%rdi) # sched: [4:0.50]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_prefetchnta:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: prefetchnta (%rdi) # sched: [4:0.50]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_prefetchnta:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: prefetchnta (%rdi) # sched: [5:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_prefetchnta:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: prefetchnta (%rdi) # sched: [8:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	call void @llvm.prefetch(i8* %a0, i32 0, i32 0, i32 1)
	ret void
	}
	declare void @llvm.prefetch(i8* nocapture, i32, i32, i32) nounwind readnone

	define <4 x float> @test_rcpps(<4 x float> %a0, <4 x float> *%a1) {
	; GENERIC-LABEL: test_rcpps:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: rcpps %xmm0, %xmm1
	; GENERIC-NEXT: rcpps (%rdi), %xmm0
	; GENERIC-NEXT: addps %xmm1, %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_rcpps:
	; ATOM: # BB#0:
	; ATOM-NEXT: rcpps (%rdi), %xmm1
	; ATOM-NEXT: rcpps %xmm0, %xmm0
	; ATOM-NEXT: addps %xmm0, %xmm1
	; ATOM-NEXT: movaps %xmm1, %xmm0
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_rcpps:
	; SLM: # BB#0:
	; SLM-NEXT: rcpps (%rdi), %xmm1 # sched: [8:1.00]
	; SLM-NEXT: rcpps %xmm0, %xmm0 # sched: [5:1.00]
	; SLM-NEXT: addps %xmm0, %xmm1 # sched: [3:1.00]
	; SLM-NEXT: movaps %xmm1, %xmm0 # sched: [1:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_rcpps:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vrcpps %xmm0, %xmm0 # sched: [7:3.00]
	-; SANDY-NEXT: vrcpps (%rdi), %xmm1 # sched: [11:1.00]
	+; SANDY-NEXT: vrcpps %xmm0, %xmm0 # sched: [5:1.00]
	+; SANDY-NEXT: vrcpps (%rdi), %xmm1 # sched: [9:1.00]
	; SANDY-NEXT: vaddps %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_rcpps:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vrcpps %xmm0, %xmm0 # sched: [5:1.00]
	; HASWELL-NEXT: vrcpps (%rdi), %xmm1 # sched: [9:1.00]
	; HASWELL-NEXT: vaddps %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_rcpps:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vrcpps (%rdi), %xmm1 # sched: [7:1.00]
	; BTVER2-NEXT: vrcpps %xmm0, %xmm0 # sched: [2:1.00]
	; BTVER2-NEXT: vaddps %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_rcpps:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vrcpps (%rdi), %xmm1 # sched: [12:0.50]
	; ZNVER1-NEXT: vrcpps %xmm0, %xmm0 # sched: [5:0.50]
	; ZNVER1-NEXT: vaddps %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <4 x float> @llvm.x86.sse.rcp.ps(<4 x float> %a0)
	%2 = load <4 x float>, <4 x float> *%a1, align 16
	%3 = call <4 x float> @llvm.x86.sse.rcp.ps(<4 x float> %2)
	%4 = fadd <4 x float> %1, %3
	ret <4 x float> %4
	}
	declare <4 x float> @llvm.x86.sse.rcp.ps(<4 x float>) nounwind readnone

	; TODO - rcpss_m

	define <4 x float> @test_rcpss(float %a0, float *%a1) {
	; GENERIC-LABEL: test_rcpss:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: rcpss %xmm0, %xmm0
	; GENERIC-NEXT: movss {{.*#+}} xmm1 = mem[0],zero,zero,zero
	; GENERIC-NEXT: rcpss %xmm1, %xmm1
	; GENERIC-NEXT: addps %xmm1, %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_rcpss:
	; ATOM: # BB#0:
	; ATOM-NEXT: movss {{.*#+}} xmm1 = mem[0],zero,zero,zero
	; ATOM-NEXT: rcpss %xmm0, %xmm0
	; ATOM-NEXT: rcpss %xmm1, %xmm1
	; ATOM-NEXT: addps %xmm1, %xmm0
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_rcpss:
	; SLM: # BB#0:
	; SLM-NEXT: movss {{.*#+}} xmm1 = mem[0],zero,zero,zero sched: [3:1.00]
	; SLM-NEXT: rcpss %xmm0, %xmm0 # sched: [8:1.00]
	; SLM-NEXT: rcpss %xmm1, %xmm1 # sched: [8:1.00]
	; SLM-NEXT: addps %xmm1, %xmm0 # sched: [3:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_rcpss:
	; SANDY: # BB#0:
	; SANDY-NEXT: vrcpss %xmm0, %xmm0, %xmm0 # sched: [9:1.00]
	-; SANDY-NEXT: vmovss {{.*#+}} xmm1 = mem[0],zero,zero,zero sched: [6:0.50]
	+; SANDY-NEXT: vmovss {{.*#+}} xmm1 = mem[0],zero,zero,zero sched: [4:0.50]
	; SANDY-NEXT: vrcpss %xmm1, %xmm1, %xmm1 # sched: [9:1.00]
	; SANDY-NEXT: vaddps %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_rcpss:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vrcpss %xmm0, %xmm0, %xmm0 # sched: [9:1.00]
	; HASWELL-NEXT: vmovss {{.*#+}} xmm1 = mem[0],zero,zero,zero sched: [4:0.50]
	; HASWELL-NEXT: vrcpss %xmm1, %xmm1, %xmm1 # sched: [9:1.00]
	; HASWELL-NEXT: vaddps %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_rcpss:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vmovss {{.*#+}} xmm1 = mem[0],zero,zero,zero sched: [5:1.00]
	; BTVER2-NEXT: vrcpss %xmm0, %xmm0, %xmm0 # sched: [7:1.00]
	; BTVER2-NEXT: vrcpss %xmm1, %xmm1, %xmm1 # sched: [7:1.00]
	; BTVER2-NEXT: vaddps %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_rcpss:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vmovss {{.*#+}} xmm1 = mem[0],zero,zero,zero sched: [8:0.50]
	; ZNVER1-NEXT: vrcpss %xmm0, %xmm0, %xmm0 # sched: [12:0.50]
	; ZNVER1-NEXT: vrcpss %xmm1, %xmm1, %xmm1 # sched: [12:0.50]
	; ZNVER1-NEXT: vaddps %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = insertelement <4 x float> undef, float %a0, i32 0
	%2 = call <4 x float> @llvm.x86.sse.rcp.ss(<4 x float> %1)
	%3 = load float, float *%a1, align 4
	%4 = insertelement <4 x float> undef, float %3, i32 0
	%5 = call <4 x float> @llvm.x86.sse.rcp.ss(<4 x float> %4)
	%6 = fadd <4 x float> %2, %5
	ret <4 x float> %6
	}
	declare <4 x float> @llvm.x86.sse.rcp.ss(<4 x float>) nounwind readnone

	define <4 x float> @test_rsqrtps(<4 x float> %a0, <4 x float> *%a1) {
	; GENERIC-LABEL: test_rsqrtps:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: rsqrtps %xmm0, %xmm1
	; GENERIC-NEXT: rsqrtps (%rdi), %xmm0
	; GENERIC-NEXT: addps %xmm1, %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_rsqrtps:
	; ATOM: # BB#0:
	; ATOM-NEXT: rsqrtps (%rdi), %xmm1
	; ATOM-NEXT: rsqrtps %xmm0, %xmm0
	; ATOM-NEXT: addps %xmm0, %xmm1
	; ATOM-NEXT: movaps %xmm1, %xmm0
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_rsqrtps:
	; SLM: # BB#0:
	; SLM-NEXT: rsqrtps (%rdi), %xmm1 # sched: [8:1.00]
	; SLM-NEXT: rsqrtps %xmm0, %xmm0 # sched: [5:1.00]
	; SLM-NEXT: addps %xmm0, %xmm1 # sched: [3:1.00]
	; SLM-NEXT: movaps %xmm1, %xmm0 # sched: [1:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_rsqrtps:
	; SANDY: # BB#0:
	; SANDY-NEXT: vrsqrtps %xmm0, %xmm0 # sched: [5:1.00]
	-; SANDY-NEXT: vrsqrtps (%rdi), %xmm1 # sched: [11:1.00]
	+; SANDY-NEXT: vrsqrtps (%rdi), %xmm1 # sched: [9:1.00]
	; SANDY-NEXT: vaddps %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_rsqrtps:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vrsqrtps %xmm0, %xmm0 # sched: [5:1.00]
	; HASWELL-NEXT: vrsqrtps (%rdi), %xmm1 # sched: [9:1.00]
	; HASWELL-NEXT: vaddps %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_rsqrtps:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vrsqrtps (%rdi), %xmm1 # sched: [7:1.00]
	; BTVER2-NEXT: vrsqrtps %xmm0, %xmm0 # sched: [2:1.00]
	; BTVER2-NEXT: vaddps %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_rsqrtps:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vrsqrtps (%rdi), %xmm1 # sched: [12:0.50]
	; ZNVER1-NEXT: vrsqrtps %xmm0, %xmm0 # sched: [5:0.50]
	; ZNVER1-NEXT: vaddps %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <4 x float> @llvm.x86.sse.rsqrt.ps(<4 x float> %a0)
	%2 = load <4 x float>, <4 x float> *%a1, align 16
	%3 = call <4 x float> @llvm.x86.sse.rsqrt.ps(<4 x float> %2)
	%4 = fadd <4 x float> %1, %3
	ret <4 x float> %4
	}
	declare <4 x float> @llvm.x86.sse.rsqrt.ps(<4 x float>) nounwind readnone

	; TODO - rsqrtss_m

	define <4 x float> @test_rsqrtss(float %a0, float *%a1) {
	; GENERIC-LABEL: test_rsqrtss:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: rsqrtss %xmm0, %xmm0
	; GENERIC-NEXT: movss {{.*#+}} xmm1 = mem[0],zero,zero,zero
	; GENERIC-NEXT: rsqrtss %xmm1, %xmm1
	; GENERIC-NEXT: addps %xmm1, %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_rsqrtss:
	; ATOM: # BB#0:
	; ATOM-NEXT: movss {{.*#+}} xmm1 = mem[0],zero,zero,zero
	; ATOM-NEXT: rsqrtss %xmm0, %xmm0
	; ATOM-NEXT: rsqrtss %xmm1, %xmm1
	; ATOM-NEXT: addps %xmm1, %xmm0
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_rsqrtss:
	; SLM: # BB#0:
	; SLM-NEXT: movss {{.*#+}} xmm1 = mem[0],zero,zero,zero sched: [3:1.00]
	; SLM-NEXT: rsqrtss %xmm0, %xmm0 # sched: [8:1.00]
	; SLM-NEXT: rsqrtss %xmm1, %xmm1 # sched: [8:1.00]
	; SLM-NEXT: addps %xmm1, %xmm0 # sched: [3:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_rsqrtss:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vrsqrtss %xmm0, %xmm0, %xmm0 # sched: [5:1.00]
	-; SANDY-NEXT: vmovss {{.*#+}} xmm1 = mem[0],zero,zero,zero sched: [6:0.50]
	-; SANDY-NEXT: vrsqrtss %xmm1, %xmm1, %xmm1 # sched: [5:1.00]
	+; SANDY-NEXT: vrsqrtss %xmm0, %xmm0, %xmm0 # sched: [9:1.00]
	+; SANDY-NEXT: vmovss {{.*#+}} xmm1 = mem[0],zero,zero,zero sched: [4:0.50]
	+; SANDY-NEXT: vrsqrtss %xmm1, %xmm1, %xmm1 # sched: [9:1.00]
	; SANDY-NEXT: vaddps %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_rsqrtss:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vrsqrtss %xmm0, %xmm0, %xmm0 # sched: [5:1.00]
	; HASWELL-NEXT: vmovss {{.*#+}} xmm1 = mem[0],zero,zero,zero sched: [4:0.50]
	; HASWELL-NEXT: vrsqrtss %xmm1, %xmm1, %xmm1 # sched: [5:1.00]
	; HASWELL-NEXT: vaddps %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_rsqrtss:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vmovss {{.*#+}} xmm1 = mem[0],zero,zero,zero sched: [5:1.00]
	; BTVER2-NEXT: vrsqrtss %xmm0, %xmm0, %xmm0 # sched: [7:1.00]
	; BTVER2-NEXT: vrsqrtss %xmm1, %xmm1, %xmm1 # sched: [7:1.00]
	; BTVER2-NEXT: vaddps %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_rsqrtss:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vmovss {{.*#+}} xmm1 = mem[0],zero,zero,zero sched: [8:0.50]
	; ZNVER1-NEXT: vrsqrtss %xmm0, %xmm0, %xmm0 # sched: [12:0.50]
	; ZNVER1-NEXT: vrsqrtss %xmm1, %xmm1, %xmm1 # sched: [12:0.50]
	; ZNVER1-NEXT: vaddps %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = insertelement <4 x float> undef, float %a0, i32 0
	%2 = call <4 x float> @llvm.x86.sse.rsqrt.ss(<4 x float> %1)
	%3 = load float, float *%a1, align 4
	%4 = insertelement <4 x float> undef, float %3, i32 0
	%5 = call <4 x float> @llvm.x86.sse.rsqrt.ss(<4 x float> %4)
	%6 = fadd <4 x float> %2, %5
	ret <4 x float> %6
	}
	declare <4 x float> @llvm.x86.sse.rsqrt.ss(<4 x float>) nounwind readnone

	define void @test_sfence() {
	; GENERIC-LABEL: test_sfence:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: sfence
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_sfence:
	; ATOM: # BB#0:
	; ATOM-NEXT: sfence
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_sfence:
	; SLM: # BB#0:
	; SLM-NEXT: sfence # sched: [1:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_sfence:
	; SANDY: # BB#0:
	; SANDY-NEXT: sfence # sched: [1:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_sfence:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: sfence # sched: [1:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_sfence:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: sfence # sched: [1:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_sfence:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: sfence # sched: [1:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	call void @llvm.x86.sse.sfence()
	ret void
	}
	declare void @llvm.x86.sse.sfence() nounwind readnone

	define <4 x float> @test_shufps(<4 x float> %a0, <4 x float> %a1, <4 x float> *%a2) nounwind {
	; GENERIC-LABEL: test_shufps:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: shufps {{.*#+}} xmm0 = xmm0[0,0],xmm1[0,0]
	; GENERIC-NEXT: shufps {{.*#+}} xmm0 = xmm0[0,3],mem[0,0]
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_shufps:
	; ATOM: # BB#0:
	; ATOM-NEXT: shufps {{.*#+}} xmm0 = xmm0[0,0],xmm1[0,0]
	; ATOM-NEXT: shufps {{.*#+}} xmm0 = xmm0[0,3],mem[0,0]
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_shufps:
	; SLM: # BB#0:
	; SLM-NEXT: shufps {{.*#+}} xmm0 = xmm0[0,0],xmm1[0,0] sched: [1:1.00]
	; SLM-NEXT: shufps {{.*#+}} xmm0 = xmm0[0,3],mem[0,0] sched: [4:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_shufps:
	; SANDY: # BB#0:
	; SANDY-NEXT: vshufps {{.*#+}} xmm0 = xmm0[0,0],xmm1[0,0] sched: [1:1.00]
	-; SANDY-NEXT: vshufps {{.*#+}} xmm0 = xmm0[0,3],mem[0,0] sched: [7:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vshufps {{.*#+}} xmm0 = xmm0[0,3],mem[0,0] sched: [5:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_shufps:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vshufps {{.*#+}} xmm0 = xmm0[0,0],xmm1[0,0] sched: [1:1.00]
	; HASWELL-NEXT: vshufps {{.*#+}} xmm0 = xmm0[0,3],mem[0,0] sched: [5:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_shufps:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vshufps {{.*#+}} xmm0 = xmm0[0,0],xmm1[0,0] sched: [1:0.50]
	; BTVER2-NEXT: vshufps {{.*#+}} xmm0 = xmm0[0,3],mem[0,0] sched: [6:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_shufps:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vshufps {{.*#+}} xmm0 = xmm0[0,0],xmm1[0,0] sched: [1:0.50]
	; ZNVER1-NEXT: vshufps {{.*#+}} xmm0 = xmm0[0,3],mem[0,0] sched: [8:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = shufflevector <4 x float> %a0, <4 x float> %a1, <4 x i32> <i32 0, i32 0, i32 4, i32 4>
	%2 = load <4 x float>, <4 x float> *%a2, align 16
	%3 = shufflevector <4 x float> %1, <4 x float> %2, <4 x i32> <i32 0, i32 3, i32 4, i32 4>
	ret <4 x float> %3
	}

	define <4 x float> @test_sqrtps(<4 x float> %a0, <4 x float> *%a1) {
	; GENERIC-LABEL: test_sqrtps:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: sqrtps %xmm0, %xmm1
	; GENERIC-NEXT: sqrtps (%rdi), %xmm0
	; GENERIC-NEXT: addps %xmm1, %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_sqrtps:
	; ATOM: # BB#0:
	; ATOM-NEXT: sqrtps %xmm0, %xmm1
	; ATOM-NEXT: sqrtps (%rdi), %xmm0
	; ATOM-NEXT: addps %xmm1, %xmm0
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_sqrtps:
	; SLM: # BB#0:
	; SLM-NEXT: sqrtps (%rdi), %xmm1 # sched: [18:1.00]
	; SLM-NEXT: sqrtps %xmm0, %xmm0 # sched: [15:1.00]
	; SLM-NEXT: addps %xmm0, %xmm1 # sched: [3:1.00]
	; SLM-NEXT: movaps %xmm1, %xmm0 # sched: [1:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_sqrtps:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vsqrtps %xmm0, %xmm0 # sched: [14:1.00]
	-; SANDY-NEXT: vsqrtps (%rdi), %xmm1 # sched: [20:1.00]
	+; SANDY-NEXT: vsqrtps %xmm0, %xmm0 # sched: [15:1.00]
	+; SANDY-NEXT: vsqrtps (%rdi), %xmm1 # sched: [19:1.00]
	; SANDY-NEXT: vaddps %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_sqrtps:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vsqrtps %xmm0, %xmm0 # sched: [15:1.00]
	; HASWELL-NEXT: vsqrtps (%rdi), %xmm1 # sched: [19:1.00]
	; HASWELL-NEXT: vaddps %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_sqrtps:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vsqrtps (%rdi), %xmm1 # sched: [26:21.00]
	; BTVER2-NEXT: vsqrtps %xmm0, %xmm0 # sched: [21:21.00]
	; BTVER2-NEXT: vaddps %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_sqrtps:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vsqrtps (%rdi), %xmm1 # sched: [27:1.00]
	; ZNVER1-NEXT: vsqrtps %xmm0, %xmm0 # sched: [20:1.00]
	; ZNVER1-NEXT: vaddps %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <4 x float> @llvm.x86.sse.sqrt.ps(<4 x float> %a0)
	%2 = load <4 x float>, <4 x float> *%a1, align 16
	%3 = call <4 x float> @llvm.x86.sse.sqrt.ps(<4 x float> %2)
	%4 = fadd <4 x float> %1, %3
	ret <4 x float> %4
	}
	declare <4 x float> @llvm.x86.sse.sqrt.ps(<4 x float>) nounwind readnone

	; TODO - sqrtss_m

	define <4 x float> @test_sqrtss(<4 x float> %a0, <4 x float> *%a1) {
	; GENERIC-LABEL: test_sqrtss:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: sqrtss %xmm0, %xmm0
	; GENERIC-NEXT: movaps (%rdi), %xmm1
	; GENERIC-NEXT: sqrtss %xmm1, %xmm1
	; GENERIC-NEXT: addps %xmm1, %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_sqrtss:
	; ATOM: # BB#0:
	; ATOM-NEXT: movaps (%rdi), %xmm1
	; ATOM-NEXT: sqrtss %xmm0, %xmm0
	; ATOM-NEXT: sqrtss %xmm1, %xmm1
	; ATOM-NEXT: addps %xmm1, %xmm0
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_sqrtss:
	; SLM: # BB#0:
	; SLM-NEXT: movaps (%rdi), %xmm1 # sched: [3:1.00]
	; SLM-NEXT: sqrtss %xmm0, %xmm0 # sched: [18:1.00]
	; SLM-NEXT: sqrtss %xmm1, %xmm1 # sched: [18:1.00]
	; SLM-NEXT: addps %xmm1, %xmm0 # sched: [3:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_sqrtss:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vsqrtss %xmm0, %xmm0, %xmm0 # sched: [114:1.00]
	-; SANDY-NEXT: vmovaps (%rdi), %xmm1 # sched: [6:0.50]
	-; SANDY-NEXT: vsqrtss %xmm1, %xmm1, %xmm1 # sched: [114:1.00]
	+; SANDY-NEXT: vsqrtss %xmm0, %xmm0, %xmm0 # sched: [19:1.00]
	+; SANDY-NEXT: vmovaps (%rdi), %xmm1 # sched: [4:0.50]
	+; SANDY-NEXT: vsqrtss %xmm1, %xmm1, %xmm1 # sched: [19:1.00]
	; SANDY-NEXT: vaddps %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_sqrtss:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vsqrtss %xmm0, %xmm0, %xmm0 # sched: [19:1.00]
	; HASWELL-NEXT: vmovaps (%rdi), %xmm1 # sched: [4:0.50]
	; HASWELL-NEXT: vsqrtss %xmm1, %xmm1, %xmm1 # sched: [19:1.00]
	; HASWELL-NEXT: vaddps %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_sqrtss:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vmovaps (%rdi), %xmm1 # sched: [5:1.00]
	; BTVER2-NEXT: vsqrtss %xmm0, %xmm0, %xmm0 # sched: [26:21.00]
	; BTVER2-NEXT: vsqrtss %xmm1, %xmm1, %xmm1 # sched: [26:21.00]
	; BTVER2-NEXT: vaddps %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_sqrtss:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vmovaps (%rdi), %xmm1 # sched: [8:0.50]
	; ZNVER1-NEXT: vsqrtss %xmm0, %xmm0, %xmm0 # sched: [27:1.00]
	; ZNVER1-NEXT: vsqrtss %xmm1, %xmm1, %xmm1 # sched: [27:1.00]
	; ZNVER1-NEXT: vaddps %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <4 x float> @llvm.x86.sse.sqrt.ss(<4 x float> %a0)
	%2 = load <4 x float>, <4 x float> *%a1, align 16
	%3 = call <4 x float> @llvm.x86.sse.sqrt.ss(<4 x float> %2)
	%4 = fadd <4 x float> %1, %3
	ret <4 x float> %4
	}
	declare <4 x float> @llvm.x86.sse.sqrt.ss(<4 x float>) nounwind readnone

	define i32 @test_stmxcsr() {
	; GENERIC-LABEL: test_stmxcsr:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: stmxcsr -{{[0-9]+}}(%rsp)
	; GENERIC-NEXT: movl -{{[0-9]+}}(%rsp), %eax
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_stmxcsr:
	; ATOM: # BB#0:
	; ATOM-NEXT: stmxcsr -{{[0-9]+}}(%rsp)
	; ATOM-NEXT: movl -{{[0-9]+}}(%rsp), %eax
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_stmxcsr:
	; SLM: # BB#0:
	; SLM-NEXT: stmxcsr -{{[0-9]+}}(%rsp) # sched: [1:1.00]
	; SLM-NEXT: movl -{{[0-9]+}}(%rsp), %eax # sched: [3:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_stmxcsr:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vstmxcsr -{{[0-9]+}}(%rsp) # sched: [5:1.00]
	-; SANDY-NEXT: movl -{{[0-9]+}}(%rsp), %eax # sched: [5:0.50]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vstmxcsr -{{[0-9]+}}(%rsp) # sched: [1:1.00]
	+; SANDY-NEXT: movl -{{[0-9]+}}(%rsp), %eax # sched: [4:0.50]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_stmxcsr:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vstmxcsr -{{[0-9]+}}(%rsp) # sched: [7:1.00]
	; HASWELL-NEXT: movl -{{[0-9]+}}(%rsp), %eax # sched: [4:0.50]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_stmxcsr:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vstmxcsr -{{[0-9]+}}(%rsp) # sched: [1:1.00]
	; BTVER2-NEXT: movl -{{[0-9]+}}(%rsp), %eax # sched: [5:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_stmxcsr:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vstmxcsr -{{[0-9]+}}(%rsp) # sched: [1:0.50]
	; ZNVER1-NEXT: movl -{{[0-9]+}}(%rsp), %eax # sched: [8:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = alloca i32, align 4
	%2 = bitcast i32* %1 to i8*
	call void @llvm.x86.sse.stmxcsr(i8* %2)
	%3 = load i32, i32* %1, align 4
	ret i32 %3
	}
	declare void @llvm.x86.sse.stmxcsr(i8*) nounwind readnone

	define <4 x float> @test_subps(<4 x float> %a0, <4 x float> %a1, <4 x float> *%a2) {
	; GENERIC-LABEL: test_subps:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: subps %xmm1, %xmm0
	; GENERIC-NEXT: subps (%rdi), %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_subps:
	; ATOM: # BB#0:
	; ATOM-NEXT: subps %xmm1, %xmm0
	; ATOM-NEXT: subps (%rdi), %xmm0
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_subps:
	; SLM: # BB#0:
	; SLM-NEXT: subps %xmm1, %xmm0 # sched: [3:1.00]
	; SLM-NEXT: subps (%rdi), %xmm0 # sched: [6:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_subps:
	; SANDY: # BB#0:
	; SANDY-NEXT: vsubps %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	-; SANDY-NEXT: vsubps (%rdi), %xmm0, %xmm0 # sched: [9:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vsubps (%rdi), %xmm0, %xmm0 # sched: [7:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_subps:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vsubps %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; HASWELL-NEXT: vsubps (%rdi), %xmm0, %xmm0 # sched: [7:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_subps:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vsubps %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: vsubps (%rdi), %xmm0, %xmm0 # sched: [8:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_subps:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vsubps %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; ZNVER1-NEXT: vsubps (%rdi), %xmm0, %xmm0 # sched: [10:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = fsub <4 x float> %a0, %a1
	%2 = load <4 x float>, <4 x float> *%a2, align 16
	%3 = fsub <4 x float> %1, %2
	ret <4 x float> %3
	}

	define float @test_subss(float %a0, float %a1, float *%a2) {
	; GENERIC-LABEL: test_subss:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: subss %xmm1, %xmm0
	; GENERIC-NEXT: subss (%rdi), %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_subss:
	; ATOM: # BB#0:
	; ATOM-NEXT: subss %xmm1, %xmm0
	; ATOM-NEXT: subss (%rdi), %xmm0
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_subss:
	; SLM: # BB#0:
	; SLM-NEXT: subss %xmm1, %xmm0 # sched: [3:1.00]
	; SLM-NEXT: subss (%rdi), %xmm0 # sched: [6:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_subss:
	; SANDY: # BB#0:
	; SANDY-NEXT: vsubss %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	-; SANDY-NEXT: vsubss (%rdi), %xmm0, %xmm0 # sched: [9:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vsubss (%rdi), %xmm0, %xmm0 # sched: [7:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_subss:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vsubss %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; HASWELL-NEXT: vsubss (%rdi), %xmm0, %xmm0 # sched: [7:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_subss:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vsubss %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: vsubss (%rdi), %xmm0, %xmm0 # sched: [8:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_subss:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vsubss %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; ZNVER1-NEXT: vsubss (%rdi), %xmm0, %xmm0 # sched: [10:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = fsub float %a0, %a1
	%2 = load float, float *%a2, align 4
	%3 = fsub float %1, %2
	ret float %3
	}

	define i32 @test_ucomiss(<4 x float> %a0, <4 x float> %a1, <4 x float> *%a2) {
	; GENERIC-LABEL: test_ucomiss:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: ucomiss %xmm1, %xmm0
	; GENERIC-NEXT: setnp %al
	; GENERIC-NEXT: sete %cl
	; GENERIC-NEXT: andb %al, %cl
	; GENERIC-NEXT: ucomiss (%rdi), %xmm0
	; GENERIC-NEXT: setnp %al
	; GENERIC-NEXT: sete %dl
	; GENERIC-NEXT: andb %al, %dl
	; GENERIC-NEXT: orb %cl, %dl
	; GENERIC-NEXT: movzbl %dl, %eax
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_ucomiss:
	; ATOM: # BB#0:
	; ATOM-NEXT: ucomiss %xmm1, %xmm0
	; ATOM-NEXT: setnp %al
	; ATOM-NEXT: sete %cl
	; ATOM-NEXT: andb %al, %cl
	; ATOM-NEXT: ucomiss (%rdi), %xmm0
	; ATOM-NEXT: setnp %al
	; ATOM-NEXT: sete %dl
	; ATOM-NEXT: andb %al, %dl
	; ATOM-NEXT: orb %cl, %dl
	; ATOM-NEXT: movzbl %dl, %eax
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_ucomiss:
	; SLM: # BB#0:
	; SLM-NEXT: ucomiss %xmm1, %xmm0 # sched: [3:1.00]
	; SLM-NEXT: setnp %al # sched: [1:0.50]
	; SLM-NEXT: sete %cl # sched: [1:0.50]
	; SLM-NEXT: andb %al, %cl # sched: [1:0.50]
	; SLM-NEXT: ucomiss (%rdi), %xmm0 # sched: [6:1.00]
	; SLM-NEXT: setnp %al # sched: [1:0.50]
	; SLM-NEXT: sete %dl # sched: [1:0.50]
	; SLM-NEXT: andb %al, %dl # sched: [1:0.50]
	; SLM-NEXT: orb %cl, %dl # sched: [1:0.50]
	; SLM-NEXT: movzbl %dl, %eax # sched: [1:0.50]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_ucomiss:
	; SANDY: # BB#0:
	; SANDY-NEXT: vucomiss %xmm1, %xmm0 # sched: [3:1.00]
	-; SANDY-NEXT: setnp %al # sched: [1:1.00]
	-; SANDY-NEXT: sete %cl # sched: [1:1.00]
	+; SANDY-NEXT: setnp %al # sched: [1:0.33]
	+; SANDY-NEXT: sete %cl # sched: [1:0.33]
	; SANDY-NEXT: andb %al, %cl # sched: [1:0.33]
	; SANDY-NEXT: vucomiss (%rdi), %xmm0 # sched: [7:1.00]
	-; SANDY-NEXT: setnp %al # sched: [1:1.00]
	-; SANDY-NEXT: sete %dl # sched: [1:1.00]
	+; SANDY-NEXT: setnp %al # sched: [1:0.33]
	+; SANDY-NEXT: sete %dl # sched: [1:0.33]
	; SANDY-NEXT: andb %al, %dl # sched: [1:0.33]
	; SANDY-NEXT: orb %cl, %dl # sched: [1:0.33]
	; SANDY-NEXT: movzbl %dl, %eax # sched: [1:0.33]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_ucomiss:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vucomiss %xmm1, %xmm0 # sched: [3:1.00]
	; HASWELL-NEXT: setnp %al # sched: [1:0.50]
	; HASWELL-NEXT: sete %cl # sched: [1:0.50]
	; HASWELL-NEXT: andb %al, %cl # sched: [1:0.25]
	; HASWELL-NEXT: vucomiss (%rdi), %xmm0 # sched: [7:1.00]
	; HASWELL-NEXT: setnp %al # sched: [1:0.50]
	; HASWELL-NEXT: sete %dl # sched: [1:0.50]
	; HASWELL-NEXT: andb %al, %dl # sched: [1:0.25]
	; HASWELL-NEXT: orb %cl, %dl # sched: [1:0.25]
	; HASWELL-NEXT: movzbl %dl, %eax # sched: [1:0.25]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_ucomiss:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vucomiss %xmm1, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: setnp %al # sched: [1:0.50]
	; BTVER2-NEXT: sete %cl # sched: [1:0.50]
	; BTVER2-NEXT: andb %al, %cl # sched: [1:0.50]
	; BTVER2-NEXT: vucomiss (%rdi), %xmm0 # sched: [8:1.00]
	; BTVER2-NEXT: setnp %al # sched: [1:0.50]
	; BTVER2-NEXT: sete %dl # sched: [1:0.50]
	; BTVER2-NEXT: andb %al, %dl # sched: [1:0.50]
	; BTVER2-NEXT: orb %cl, %dl # sched: [1:0.50]
	; BTVER2-NEXT: movzbl %dl, %eax # sched: [1:0.50]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_ucomiss:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vucomiss %xmm1, %xmm0 # sched: [3:1.00]
	; ZNVER1-NEXT: setnp %al # sched: [1:0.25]
	; ZNVER1-NEXT: sete %cl # sched: [1:0.25]
	; ZNVER1-NEXT: andb %al, %cl # sched: [1:0.25]
	; ZNVER1-NEXT: vucomiss (%rdi), %xmm0 # sched: [10:1.00]
	; ZNVER1-NEXT: setnp %al # sched: [1:0.25]
	; ZNVER1-NEXT: sete %dl # sched: [1:0.25]
	; ZNVER1-NEXT: andb %al, %dl # sched: [1:0.25]
	; ZNVER1-NEXT: orb %cl, %dl # sched: [1:0.25]
	; ZNVER1-NEXT: movzbl %dl, %eax # sched: [1:0.25]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call i32 @llvm.x86.sse.ucomieq.ss(<4 x float> %a0, <4 x float> %a1)
	%2 = load <4 x float>, <4 x float> *%a2, align 4
	%3 = call i32 @llvm.x86.sse.ucomieq.ss(<4 x float> %a0, <4 x float> %2)
	%4 = or i32 %1, %3
	ret i32 %4
	}
	declare i32 @llvm.x86.sse.ucomieq.ss(<4 x float>, <4 x float>) nounwind readnone

	define <4 x float> @test_unpckhps(<4 x float> %a0, <4 x float> %a1, <4 x float> *%a2) {
	; GENERIC-LABEL: test_unpckhps:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: unpckhps {{.*#+}} xmm0 = xmm0[2],xmm1[2],xmm0[3],xmm1[3]
	; GENERIC-NEXT: unpckhps {{.*#+}} xmm0 = xmm0[2],mem[2],xmm0[3],mem[3]
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_unpckhps:
	; ATOM: # BB#0:
	; ATOM-NEXT: unpckhps {{.*#+}} xmm0 = xmm0[2],xmm1[2],xmm0[3],xmm1[3]
	; ATOM-NEXT: unpckhps {{.*#+}} xmm0 = xmm0[2],mem[2],xmm0[3],mem[3]
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_unpckhps:
	; SLM: # BB#0:
	; SLM-NEXT: unpckhps {{.*#+}} xmm0 = xmm0[2],xmm1[2],xmm0[3],xmm1[3] sched: [1:1.00]
	; SLM-NEXT: unpckhps {{.*#+}} xmm0 = xmm0[2],mem[2],xmm0[3],mem[3] sched: [4:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_unpckhps:
	; SANDY: # BB#0:
	; SANDY-NEXT: vunpckhps {{.*#+}} xmm0 = xmm0[2],xmm1[2],xmm0[3],xmm1[3] sched: [1:1.00]
	-; SANDY-NEXT: vunpckhps {{.*#+}} xmm0 = xmm0[2],mem[2],xmm0[3],mem[3] sched: [7:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vunpckhps {{.*#+}} xmm0 = xmm0[2],mem[2],xmm0[3],mem[3] sched: [5:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_unpckhps:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vunpckhps {{.*#+}} xmm0 = xmm0[2],xmm1[2],xmm0[3],xmm1[3] sched: [1:1.00]
	; HASWELL-NEXT: vunpckhps {{.*#+}} xmm0 = xmm0[2],mem[2],xmm0[3],mem[3] sched: [5:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_unpckhps:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vunpckhps {{.*#+}} xmm0 = xmm0[2],xmm1[2],xmm0[3],xmm1[3] sched: [1:0.50]
	; BTVER2-NEXT: vunpckhps {{.*#+}} xmm0 = xmm0[2],mem[2],xmm0[3],mem[3] sched: [6:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_unpckhps:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vunpckhps {{.*#+}} xmm0 = xmm0[2],xmm1[2],xmm0[3],xmm1[3] sched: [1:0.50]
	; ZNVER1-NEXT: vunpckhps {{.*#+}} xmm0 = xmm0[2],mem[2],xmm0[3],mem[3] sched: [8:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = shufflevector <4 x float> %a0, <4 x float> %a1, <4 x i32> <i32 2, i32 6, i32 3, i32 7>
	%2 = load <4 x float>, <4 x float> *%a2, align 16
	%3 = shufflevector <4 x float> %1, <4 x float> %2, <4 x i32> <i32 2, i32 6, i32 3, i32 7>
	ret <4 x float> %3
	}

	define <4 x float> @test_unpcklps(<4 x float> %a0, <4 x float> %a1, <4 x float> *%a2) {
	; GENERIC-LABEL: test_unpcklps:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: unpcklps {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1]
	; GENERIC-NEXT: unpcklps {{.*#+}} xmm0 = xmm0[0],mem[0],xmm0[1],mem[1]
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_unpcklps:
	; ATOM: # BB#0:
	; ATOM-NEXT: unpcklps {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1]
	; ATOM-NEXT: unpcklps {{.*#+}} xmm0 = xmm0[0],mem[0],xmm0[1],mem[1]
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_unpcklps:
	; SLM: # BB#0:
	; SLM-NEXT: unpcklps {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1] sched: [1:1.00]
	; SLM-NEXT: unpcklps {{.*#+}} xmm0 = xmm0[0],mem[0],xmm0[1],mem[1] sched: [4:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_unpcklps:
	; SANDY: # BB#0:
	; SANDY-NEXT: vunpcklps {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1] sched: [1:1.00]
	-; SANDY-NEXT: vunpcklps {{.*#+}} xmm0 = xmm0[0],mem[0],xmm0[1],mem[1] sched: [7:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vunpcklps {{.*#+}} xmm0 = xmm0[0],mem[0],xmm0[1],mem[1] sched: [5:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_unpcklps:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vunpcklps {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1] sched: [1:1.00]
	; HASWELL-NEXT: vunpcklps {{.*#+}} xmm0 = xmm0[0],mem[0],xmm0[1],mem[1] sched: [5:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_unpcklps:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vunpcklps {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1] sched: [1:0.50]
	; BTVER2-NEXT: vunpcklps {{.*#+}} xmm0 = xmm0[0],mem[0],xmm0[1],mem[1] sched: [6:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_unpcklps:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vunpcklps {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1] sched: [1:0.50]
	; ZNVER1-NEXT: vunpcklps {{.*#+}} xmm0 = xmm0[0],mem[0],xmm0[1],mem[1] sched: [8:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = shufflevector <4 x float> %a0, <4 x float> %a1, <4 x i32> <i32 0, i32 4, i32 1, i32 5>
	%2 = load <4 x float>, <4 x float> *%a2, align 16
	%3 = shufflevector <4 x float> %1, <4 x float> %2, <4 x i32> <i32 0, i32 4, i32 1, i32 5>
	ret <4 x float> %3
	}

	define <4 x float> @test_xorps(<4 x float> %a0, <4 x float> %a1, <4 x float> *%a2) {
	; GENERIC-LABEL: test_xorps:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: xorps %xmm1, %xmm0
	; GENERIC-NEXT: xorps (%rdi), %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_xorps:
	; ATOM: # BB#0:
	; ATOM-NEXT: xorps %xmm1, %xmm0
	; ATOM-NEXT: xorps (%rdi), %xmm0
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_xorps:
	; SLM: # BB#0:
	; SLM-NEXT: xorps %xmm1, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: xorps (%rdi), %xmm0 # sched: [4:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_xorps:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vxorps %xmm1, %xmm0, %xmm0 # sched: [1:1.00]
	-; SANDY-NEXT: vxorps (%rdi), %xmm0, %xmm0 # sched: [7:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vxorps %xmm1, %xmm0, %xmm0 # sched: [1:0.33]
	+; SANDY-NEXT: vxorps (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_xorps:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vxorps %xmm1, %xmm0, %xmm0 # sched: [1:1.00]
	; HASWELL-NEXT: vxorps (%rdi), %xmm0, %xmm0 # sched: [5:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_xorps:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vxorps %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: vxorps (%rdi), %xmm0, %xmm0 # sched: [6:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_xorps:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vxorps %xmm1, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: vxorps (%rdi), %xmm0, %xmm0 # sched: [8:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = bitcast <4 x float> %a0 to <4 x i32>
	%2 = bitcast <4 x float> %a1 to <4 x i32>
	%3 = xor <4 x i32> %1, %2
	%4 = load <4 x float>, <4 x float> *%a2, align 16
	%5 = bitcast <4 x float> %4 to <4 x i32>
	%6 = xor <4 x i32> %3, %5
	%7 = bitcast <4 x i32> %6 to <4 x float>
	ret <4 x float> %7
	}

	!0 = !{i32 1}
	diff --git a/test/CodeGen/X86/sse2-schedule.ll b/test/CodeGen/X86/sse2-schedule.ll
	index 6ee908e0c787..62c194f2fc4b 100644
	--- a/test/CodeGen/X86/sse2-schedule.ll
	+++ b/test/CodeGen/X86/sse2-schedule.ll
	@@ -1,6907 +1,6907 @@
	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -print-schedule \| FileCheck %s --check-prefix=CHECK --check-prefix=GENERIC
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -print-schedule -mcpu=atom \| FileCheck %s --check-prefix=CHECK --check-prefix=ATOM
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -print-schedule -mcpu=slm \| FileCheck %s --check-prefix=CHECK --check-prefix=SLM
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -print-schedule -mcpu=sandybridge \| FileCheck %s --check-prefix=CHECK --check-prefix=SANDY
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -print-schedule -mcpu=ivybridge \| FileCheck %s --check-prefix=CHECK --check-prefix=SANDY
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -print-schedule -mcpu=haswell \| FileCheck %s --check-prefix=CHECK --check-prefix=HASWELL
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -print-schedule -mcpu=skylake \| FileCheck %s --check-prefix=CHECK --check-prefix=HASWELL
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -print-schedule -mcpu=btver2 \| FileCheck %s --check-prefix=CHECK --check-prefix=BTVER2
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -print-schedule -mcpu=znver1 \| FileCheck %s --check-prefix=CHECK --check-prefix=ZNVER1

	define <2 x double> @test_addpd(<2 x double> %a0, <2 x double> %a1, <2 x double> *%a2) {
	; GENERIC-LABEL: test_addpd:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: addpd %xmm1, %xmm0
	; GENERIC-NEXT: addpd (%rdi), %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_addpd:
	; ATOM: # BB#0:
	; ATOM-NEXT: addpd %xmm1, %xmm0
	; ATOM-NEXT: addpd (%rdi), %xmm0
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_addpd:
	; SLM: # BB#0:
	; SLM-NEXT: addpd %xmm1, %xmm0 # sched: [3:1.00]
	; SLM-NEXT: addpd (%rdi), %xmm0 # sched: [6:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_addpd:
	; SANDY: # BB#0:
	; SANDY-NEXT: vaddpd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	-; SANDY-NEXT: vaddpd (%rdi), %xmm0, %xmm0 # sched: [9:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vaddpd (%rdi), %xmm0, %xmm0 # sched: [7:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_addpd:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vaddpd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; HASWELL-NEXT: vaddpd (%rdi), %xmm0, %xmm0 # sched: [7:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_addpd:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vaddpd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: vaddpd (%rdi), %xmm0, %xmm0 # sched: [8:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_addpd:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vaddpd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; ZNVER1-NEXT: vaddpd (%rdi), %xmm0, %xmm0 # sched: [10:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = fadd <2 x double> %a0, %a1
	%2 = load <2 x double>, <2 x double> *%a2, align 16
	%3 = fadd <2 x double> %1, %2
	ret <2 x double> %3
	}

	define double @test_addsd(double %a0, double %a1, double *%a2) {
	; GENERIC-LABEL: test_addsd:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: addsd %xmm1, %xmm0
	; GENERIC-NEXT: addsd (%rdi), %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_addsd:
	; ATOM: # BB#0:
	; ATOM-NEXT: addsd %xmm1, %xmm0
	; ATOM-NEXT: addsd (%rdi), %xmm0
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_addsd:
	; SLM: # BB#0:
	; SLM-NEXT: addsd %xmm1, %xmm0 # sched: [3:1.00]
	; SLM-NEXT: addsd (%rdi), %xmm0 # sched: [6:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_addsd:
	; SANDY: # BB#0:
	; SANDY-NEXT: vaddsd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	-; SANDY-NEXT: vaddsd (%rdi), %xmm0, %xmm0 # sched: [9:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vaddsd (%rdi), %xmm0, %xmm0 # sched: [7:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_addsd:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vaddsd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; HASWELL-NEXT: vaddsd (%rdi), %xmm0, %xmm0 # sched: [7:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_addsd:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vaddsd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: vaddsd (%rdi), %xmm0, %xmm0 # sched: [8:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_addsd:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vaddsd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; ZNVER1-NEXT: vaddsd (%rdi), %xmm0, %xmm0 # sched: [10:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = fadd double %a0, %a1
	%2 = load double, double *%a2, align 8
	%3 = fadd double %1, %2
	ret double %3
	}

	define <2 x double> @test_andpd(<2 x double> %a0, <2 x double> %a1, <2 x double> *%a2) {
	; GENERIC-LABEL: test_andpd:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: andpd %xmm1, %xmm0
	; GENERIC-NEXT: andpd (%rdi), %xmm0
	; GENERIC-NEXT: addpd %xmm1, %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_andpd:
	; ATOM: # BB#0:
	; ATOM-NEXT: andpd %xmm1, %xmm0
	; ATOM-NEXT: andpd (%rdi), %xmm0
	; ATOM-NEXT: addpd %xmm1, %xmm0
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_andpd:
	; SLM: # BB#0:
	; SLM-NEXT: andpd %xmm1, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: andpd (%rdi), %xmm0 # sched: [4:1.00]
	; SLM-NEXT: addpd %xmm1, %xmm0 # sched: [3:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_andpd:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vandpd %xmm1, %xmm0, %xmm0 # sched: [1:1.00]
	-; SANDY-NEXT: vandpd (%rdi), %xmm0, %xmm0 # sched: [7:1.00]
	+; SANDY-NEXT: vandpd %xmm1, %xmm0, %xmm0 # sched: [1:0.33]
	+; SANDY-NEXT: vandpd (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	; SANDY-NEXT: vaddpd %xmm0, %xmm1, %xmm0 # sched: [3:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_andpd:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vandpd %xmm1, %xmm0, %xmm0 # sched: [1:1.00]
	; HASWELL-NEXT: vandpd (%rdi), %xmm0, %xmm0 # sched: [5:1.00]
	; HASWELL-NEXT: vaddpd %xmm0, %xmm1, %xmm0 # sched: [3:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_andpd:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vandpd %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: vandpd (%rdi), %xmm0, %xmm0 # sched: [6:1.00]
	; BTVER2-NEXT: vaddpd %xmm0, %xmm1, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_andpd:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vandpd %xmm1, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: vandpd (%rdi), %xmm0, %xmm0 # sched: [8:0.50]
	; ZNVER1-NEXT: vaddpd %xmm0, %xmm1, %xmm0 # sched: [3:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = bitcast <2 x double> %a0 to <4 x i32>
	%2 = bitcast <2 x double> %a1 to <4 x i32>
	%3 = and <4 x i32> %1, %2
	%4 = load <2 x double>, <2 x double> *%a2, align 16
	%5 = bitcast <2 x double> %4 to <4 x i32>
	%6 = and <4 x i32> %3, %5
	%7 = bitcast <4 x i32> %6 to <2 x double>
	%8 = fadd <2 x double> %a1, %7
	ret <2 x double> %8
	}

	define <2 x double> @test_andnotpd(<2 x double> %a0, <2 x double> %a1, <2 x double> *%a2) {
	; GENERIC-LABEL: test_andnotpd:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: andnpd %xmm1, %xmm0
	; GENERIC-NEXT: andnpd (%rdi), %xmm0
	; GENERIC-NEXT: addpd %xmm1, %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_andnotpd:
	; ATOM: # BB#0:
	; ATOM-NEXT: andnpd %xmm1, %xmm0
	; ATOM-NEXT: andnpd (%rdi), %xmm0
	; ATOM-NEXT: addpd %xmm1, %xmm0
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_andnotpd:
	; SLM: # BB#0:
	; SLM-NEXT: andnpd %xmm1, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: andnpd (%rdi), %xmm0 # sched: [4:1.00]
	; SLM-NEXT: addpd %xmm1, %xmm0 # sched: [3:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_andnotpd:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vandnpd %xmm1, %xmm0, %xmm0 # sched: [1:1.00]
	-; SANDY-NEXT: vandnpd (%rdi), %xmm0, %xmm0 # sched: [7:1.00]
	+; SANDY-NEXT: vandnpd %xmm1, %xmm0, %xmm0 # sched: [1:0.33]
	+; SANDY-NEXT: vandnpd (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	; SANDY-NEXT: vaddpd %xmm0, %xmm1, %xmm0 # sched: [3:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_andnotpd:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vandnpd %xmm1, %xmm0, %xmm0 # sched: [1:1.00]
	; HASWELL-NEXT: vandnpd (%rdi), %xmm0, %xmm0 # sched: [5:1.00]
	; HASWELL-NEXT: vaddpd %xmm0, %xmm1, %xmm0 # sched: [3:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_andnotpd:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vandnpd %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: vandnpd (%rdi), %xmm0, %xmm0 # sched: [6:1.00]
	; BTVER2-NEXT: vaddpd %xmm0, %xmm1, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_andnotpd:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vandnpd %xmm1, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: vandnpd (%rdi), %xmm0, %xmm0 # sched: [8:0.50]
	; ZNVER1-NEXT: vaddpd %xmm0, %xmm1, %xmm0 # sched: [3:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = bitcast <2 x double> %a0 to <4 x i32>
	%2 = bitcast <2 x double> %a1 to <4 x i32>
	%3 = xor <4 x i32> %1, <i32 -1, i32 -1, i32 -1, i32 -1>
	%4 = and <4 x i32> %3, %2
	%5 = load <2 x double>, <2 x double> *%a2, align 16
	%6 = bitcast <2 x double> %5 to <4 x i32>
	%7 = xor <4 x i32> %4, <i32 -1, i32 -1, i32 -1, i32 -1>
	%8 = and <4 x i32> %6, %7
	%9 = bitcast <4 x i32> %8 to <2 x double>
	%10 = fadd <2 x double> %a1, %9
	ret <2 x double> %10
	}

	define <2 x double> @test_cmppd(<2 x double> %a0, <2 x double> %a1, <2 x double> *%a2) {
	; GENERIC-LABEL: test_cmppd:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: cmpeqpd %xmm0, %xmm1
	; GENERIC-NEXT: cmpeqpd (%rdi), %xmm0
	; GENERIC-NEXT: orpd %xmm1, %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_cmppd:
	; ATOM: # BB#0:
	; ATOM-NEXT: cmpeqpd %xmm0, %xmm1
	; ATOM-NEXT: cmpeqpd (%rdi), %xmm0
	; ATOM-NEXT: orpd %xmm1, %xmm0
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_cmppd:
	; SLM: # BB#0:
	; SLM-NEXT: cmpeqpd %xmm0, %xmm1 # sched: [3:1.00]
	; SLM-NEXT: cmpeqpd (%rdi), %xmm0 # sched: [6:1.00]
	; SLM-NEXT: orpd %xmm1, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_cmppd:
	; SANDY: # BB#0:
	; SANDY-NEXT: vcmpeqpd %xmm1, %xmm0, %xmm1 # sched: [3:1.00]
	-; SANDY-NEXT: vcmpeqpd (%rdi), %xmm0, %xmm0 # sched: [9:1.00]
	-; SANDY-NEXT: vorpd %xmm0, %xmm1, %xmm0 # sched: [1:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vcmpeqpd (%rdi), %xmm0, %xmm0 # sched: [7:1.00]
	+; SANDY-NEXT: vorpd %xmm0, %xmm1, %xmm0 # sched: [1:0.33]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_cmppd:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vcmpeqpd %xmm1, %xmm0, %xmm1 # sched: [3:1.00]
	; HASWELL-NEXT: vcmpeqpd (%rdi), %xmm0, %xmm0 # sched: [7:1.00]
	; HASWELL-NEXT: vorpd %xmm0, %xmm1, %xmm0 # sched: [1:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_cmppd:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vcmpeqpd %xmm1, %xmm0, %xmm1 # sched: [3:1.00]
	; BTVER2-NEXT: vcmpeqpd (%rdi), %xmm0, %xmm0 # sched: [8:1.00]
	; BTVER2-NEXT: vorpd %xmm0, %xmm1, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_cmppd:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vcmpeqpd %xmm1, %xmm0, %xmm1 # sched: [3:1.00]
	; ZNVER1-NEXT: vcmpeqpd (%rdi), %xmm0, %xmm0 # sched: [10:1.00]
	; ZNVER1-NEXT: vorpd %xmm0, %xmm1, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = fcmp oeq <2 x double> %a0, %a1
	%2 = load <2 x double>, <2 x double> *%a2, align 16
	%3 = fcmp oeq <2 x double> %a0, %2
	%4 = or <2 x i1> %1, %3
	%5 = sext <2 x i1> %4 to <2 x i64>
	%6 = bitcast <2 x i64> %5 to <2 x double>
	ret <2 x double> %6
	}

	define double @test_cmpsd(double %a0, double %a1, double *%a2) {
	; GENERIC-LABEL: test_cmpsd:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: cmpeqsd %xmm1, %xmm0
	; GENERIC-NEXT: cmpeqsd (%rdi), %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_cmpsd:
	; ATOM: # BB#0:
	; ATOM-NEXT: cmpeqsd %xmm1, %xmm0
	; ATOM-NEXT: cmpeqsd (%rdi), %xmm0
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_cmpsd:
	; SLM: # BB#0:
	; SLM-NEXT: cmpeqsd %xmm1, %xmm0 # sched: [3:1.00]
	; SLM-NEXT: cmpeqsd (%rdi), %xmm0 # sched: [6:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_cmpsd:
	; SANDY: # BB#0:
	; SANDY-NEXT: vcmpeqsd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; SANDY-NEXT: vcmpeqsd (%rdi), %xmm0, %xmm0 # sched: [7:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_cmpsd:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vcmpeqsd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; HASWELL-NEXT: vcmpeqsd (%rdi), %xmm0, %xmm0 # sched: [7:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_cmpsd:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vcmpeqsd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: vcmpeqsd (%rdi), %xmm0, %xmm0 # sched: [8:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_cmpsd:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vcmpeqsd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; ZNVER1-NEXT: vcmpeqsd (%rdi), %xmm0, %xmm0 # sched: [10:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = insertelement <2 x double> undef, double %a0, i32 0
	%2 = insertelement <2 x double> undef, double %a1, i32 0
	%3 = call <2 x double> @llvm.x86.sse2.cmp.sd(<2 x double> %1, <2 x double> %2, i8 0)
	%4 = load double, double *%a2, align 8
	%5 = insertelement <2 x double> undef, double %4, i32 0
	%6 = call <2 x double> @llvm.x86.sse2.cmp.sd(<2 x double> %3, <2 x double> %5, i8 0)
	%7 = extractelement <2 x double> %6, i32 0
	ret double %7
	}
	declare <2 x double> @llvm.x86.sse2.cmp.sd(<2 x double>, <2 x double>, i8) nounwind readnone

	define i32 @test_comisd(<2 x double> %a0, <2 x double> %a1, <2 x double> *%a2) {
	; GENERIC-LABEL: test_comisd:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: comisd %xmm1, %xmm0
	; GENERIC-NEXT: setnp %al
	; GENERIC-NEXT: sete %cl
	; GENERIC-NEXT: andb %al, %cl
	; GENERIC-NEXT: comisd (%rdi), %xmm0
	; GENERIC-NEXT: setnp %al
	; GENERIC-NEXT: sete %dl
	; GENERIC-NEXT: andb %al, %dl
	; GENERIC-NEXT: orb %cl, %dl
	; GENERIC-NEXT: movzbl %dl, %eax
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_comisd:
	; ATOM: # BB#0:
	; ATOM-NEXT: comisd %xmm1, %xmm0
	; ATOM-NEXT: setnp %al
	; ATOM-NEXT: sete %cl
	; ATOM-NEXT: andb %al, %cl
	; ATOM-NEXT: comisd (%rdi), %xmm0
	; ATOM-NEXT: setnp %al
	; ATOM-NEXT: sete %dl
	; ATOM-NEXT: andb %al, %dl
	; ATOM-NEXT: orb %cl, %dl
	; ATOM-NEXT: movzbl %dl, %eax
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_comisd:
	; SLM: # BB#0:
	; SLM-NEXT: comisd %xmm1, %xmm0 # sched: [3:1.00]
	; SLM-NEXT: setnp %al # sched: [1:0.50]
	; SLM-NEXT: sete %cl # sched: [1:0.50]
	; SLM-NEXT: andb %al, %cl # sched: [1:0.50]
	; SLM-NEXT: comisd (%rdi), %xmm0 # sched: [6:1.00]
	; SLM-NEXT: setnp %al # sched: [1:0.50]
	; SLM-NEXT: sete %dl # sched: [1:0.50]
	; SLM-NEXT: andb %al, %dl # sched: [1:0.50]
	; SLM-NEXT: orb %cl, %dl # sched: [1:0.50]
	; SLM-NEXT: movzbl %dl, %eax # sched: [1:0.50]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_comisd:
	; SANDY: # BB#0:
	; SANDY-NEXT: vcomisd %xmm1, %xmm0 # sched: [3:1.00]
	-; SANDY-NEXT: setnp %al # sched: [1:1.00]
	-; SANDY-NEXT: sete %cl # sched: [1:1.00]
	+; SANDY-NEXT: setnp %al # sched: [1:0.33]
	+; SANDY-NEXT: sete %cl # sched: [1:0.33]
	; SANDY-NEXT: andb %al, %cl # sched: [1:0.33]
	; SANDY-NEXT: vcomisd (%rdi), %xmm0 # sched: [7:1.00]
	-; SANDY-NEXT: setnp %al # sched: [1:1.00]
	-; SANDY-NEXT: sete %dl # sched: [1:1.00]
	+; SANDY-NEXT: setnp %al # sched: [1:0.33]
	+; SANDY-NEXT: sete %dl # sched: [1:0.33]
	; SANDY-NEXT: andb %al, %dl # sched: [1:0.33]
	; SANDY-NEXT: orb %cl, %dl # sched: [1:0.33]
	; SANDY-NEXT: movzbl %dl, %eax # sched: [1:0.33]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_comisd:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vcomisd %xmm1, %xmm0 # sched: [3:1.00]
	; HASWELL-NEXT: setnp %al # sched: [1:0.50]
	; HASWELL-NEXT: sete %cl # sched: [1:0.50]
	; HASWELL-NEXT: andb %al, %cl # sched: [1:0.25]
	; HASWELL-NEXT: vcomisd (%rdi), %xmm0 # sched: [7:1.00]
	; HASWELL-NEXT: setnp %al # sched: [1:0.50]
	; HASWELL-NEXT: sete %dl # sched: [1:0.50]
	; HASWELL-NEXT: andb %al, %dl # sched: [1:0.25]
	; HASWELL-NEXT: orb %cl, %dl # sched: [1:0.25]
	; HASWELL-NEXT: movzbl %dl, %eax # sched: [1:0.25]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_comisd:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vcomisd %xmm1, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: setnp %al # sched: [1:0.50]
	; BTVER2-NEXT: sete %cl # sched: [1:0.50]
	; BTVER2-NEXT: andb %al, %cl # sched: [1:0.50]
	; BTVER2-NEXT: vcomisd (%rdi), %xmm0 # sched: [8:1.00]
	; BTVER2-NEXT: setnp %al # sched: [1:0.50]
	; BTVER2-NEXT: sete %dl # sched: [1:0.50]
	; BTVER2-NEXT: andb %al, %dl # sched: [1:0.50]
	; BTVER2-NEXT: orb %cl, %dl # sched: [1:0.50]
	; BTVER2-NEXT: movzbl %dl, %eax # sched: [1:0.50]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_comisd:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vcomisd %xmm1, %xmm0 # sched: [3:1.00]
	; ZNVER1-NEXT: setnp %al # sched: [1:0.25]
	; ZNVER1-NEXT: sete %cl # sched: [1:0.25]
	; ZNVER1-NEXT: andb %al, %cl # sched: [1:0.25]
	; ZNVER1-NEXT: vcomisd (%rdi), %xmm0 # sched: [10:1.00]
	; ZNVER1-NEXT: setnp %al # sched: [1:0.25]
	; ZNVER1-NEXT: sete %dl # sched: [1:0.25]
	; ZNVER1-NEXT: andb %al, %dl # sched: [1:0.25]
	; ZNVER1-NEXT: orb %cl, %dl # sched: [1:0.25]
	; ZNVER1-NEXT: movzbl %dl, %eax # sched: [1:0.25]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call i32 @llvm.x86.sse2.comieq.sd(<2 x double> %a0, <2 x double> %a1)
	%2 = load <2 x double>, <2 x double> *%a2, align 8
	%3 = call i32 @llvm.x86.sse2.comieq.sd(<2 x double> %a0, <2 x double> %2)
	%4 = or i32 %1, %3
	ret i32 %4
	}
	declare i32 @llvm.x86.sse2.comieq.sd(<2 x double>, <2 x double>) nounwind readnone

	define <2 x double> @test_cvtdq2pd(<4 x i32> %a0, <4 x i32> *%a1) {
	; GENERIC-LABEL: test_cvtdq2pd:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: cvtdq2pd %xmm0, %xmm1
	; GENERIC-NEXT: cvtdq2pd (%rdi), %xmm0
	; GENERIC-NEXT: addpd %xmm1, %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_cvtdq2pd:
	; ATOM: # BB#0:
	; ATOM-NEXT: cvtdq2pd %xmm0, %xmm1
	; ATOM-NEXT: cvtdq2pd (%rdi), %xmm0
	; ATOM-NEXT: addpd %xmm1, %xmm0
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_cvtdq2pd:
	; SLM: # BB#0:
	; SLM-NEXT: cvtdq2pd %xmm0, %xmm1 # sched: [4:0.50]
	; SLM-NEXT: cvtdq2pd (%rdi), %xmm0 # sched: [7:1.00]
	; SLM-NEXT: addpd %xmm1, %xmm0 # sched: [3:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_cvtdq2pd:
	; SANDY: # BB#0:
	; SANDY-NEXT: vcvtdq2pd %xmm0, %xmm0 # sched: [4:1.00]
	-; SANDY-NEXT: vcvtdq2pd (%rdi), %xmm1 # sched: [10:1.00]
	+; SANDY-NEXT: vcvtdq2pd (%rdi), %xmm1 # sched: [8:1.00]
	; SANDY-NEXT: vaddpd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_cvtdq2pd:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vcvtdq2pd %xmm0, %xmm0 # sched: [4:1.00]
	; HASWELL-NEXT: vcvtdq2pd (%rdi), %xmm1 # sched: [8:1.00]
	; HASWELL-NEXT: vaddpd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_cvtdq2pd:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vcvtdq2pd (%rdi), %xmm1 # sched: [8:1.00]
	; BTVER2-NEXT: vcvtdq2pd %xmm0, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: vaddpd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_cvtdq2pd:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vcvtdq2pd (%rdi), %xmm1 # sched: [12:1.00]
	; ZNVER1-NEXT: vcvtdq2pd %xmm0, %xmm0 # sched: [5:1.00]
	; ZNVER1-NEXT: vaddpd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = shufflevector <4 x i32> %a0, <4 x i32> undef, <2 x i32> <i32 0, i32 1>
	%2 = sitofp <2 x i32> %1 to <2 x double>
	%3 = load <4 x i32>, <4 x i32>*%a1, align 16
	%4 = shufflevector <4 x i32> %3, <4 x i32> undef, <2 x i32> <i32 0, i32 1>
	%5 = sitofp <2 x i32> %4 to <2 x double>
	%6 = fadd <2 x double> %2, %5
	ret <2 x double> %6
	}

	define <4 x float> @test_cvtdq2ps(<4 x i32> %a0, <4 x i32> *%a1) {
	; GENERIC-LABEL: test_cvtdq2ps:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: cvtdq2ps %xmm0, %xmm1
	; GENERIC-NEXT: cvtdq2ps (%rdi), %xmm0
	; GENERIC-NEXT: addps %xmm1, %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_cvtdq2ps:
	; ATOM: # BB#0:
	; ATOM-NEXT: cvtdq2ps (%rdi), %xmm1
	; ATOM-NEXT: cvtdq2ps %xmm0, %xmm0
	; ATOM-NEXT: addps %xmm0, %xmm1
	; ATOM-NEXT: movaps %xmm1, %xmm0
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_cvtdq2ps:
	; SLM: # BB#0:
	; SLM-NEXT: cvtdq2ps %xmm0, %xmm1 # sched: [4:0.50]
	; SLM-NEXT: cvtdq2ps (%rdi), %xmm0 # sched: [7:1.00]
	; SLM-NEXT: addps %xmm1, %xmm0 # sched: [3:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_cvtdq2ps:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vcvtdq2ps %xmm0, %xmm0 # sched: [3:1.00]
	-; SANDY-NEXT: vcvtdq2ps (%rdi), %xmm1 # sched: [9:1.00]
	+; SANDY-NEXT: vcvtdq2ps %xmm0, %xmm0 # sched: [4:1.00]
	+; SANDY-NEXT: vcvtdq2ps (%rdi), %xmm1 # sched: [8:1.00]
	; SANDY-NEXT: vaddps %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_cvtdq2ps:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vcvtdq2ps %xmm0, %xmm0 # sched: [4:1.00]
	; HASWELL-NEXT: vcvtdq2ps (%rdi), %xmm1 # sched: [8:1.00]
	; HASWELL-NEXT: vaddps %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_cvtdq2ps:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vcvtdq2ps (%rdi), %xmm1 # sched: [8:1.00]
	; BTVER2-NEXT: vcvtdq2ps %xmm0, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: vaddps %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_cvtdq2ps:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vcvtdq2ps (%rdi), %xmm1 # sched: [12:1.00]
	; ZNVER1-NEXT: vcvtdq2ps %xmm0, %xmm0 # sched: [5:1.00]
	; ZNVER1-NEXT: vaddps %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = sitofp <4 x i32> %a0 to <4 x float>
	%2 = load <4 x i32>, <4 x i32>*%a1, align 16
	%3 = sitofp <4 x i32> %2 to <4 x float>
	%4 = fadd <4 x float> %1, %3
	ret <4 x float> %4
	}

	define <4 x i32> @test_cvtpd2dq(<2 x double> %a0, <2 x double> *%a1) {
	; GENERIC-LABEL: test_cvtpd2dq:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: cvtpd2dq %xmm0, %xmm1
	; GENERIC-NEXT: cvtpd2dq (%rdi), %xmm0
	; GENERIC-NEXT: paddd %xmm1, %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_cvtpd2dq:
	; ATOM: # BB#0:
	; ATOM-NEXT: cvtpd2dq (%rdi), %xmm1
	; ATOM-NEXT: cvtpd2dq %xmm0, %xmm0
	; ATOM-NEXT: paddd %xmm0, %xmm1
	; ATOM-NEXT: movdqa %xmm1, %xmm0
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_cvtpd2dq:
	; SLM: # BB#0:
	; SLM-NEXT: cvtpd2dq %xmm0, %xmm1 # sched: [4:0.50]
	; SLM-NEXT: cvtpd2dq (%rdi), %xmm0 # sched: [7:1.00]
	; SLM-NEXT: paddd %xmm1, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_cvtpd2dq:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vcvtpd2dq %xmm0, %xmm0 # sched: [4:1.00]
	-; SANDY-NEXT: vcvtpd2dqx (%rdi), %xmm1 # sched: [10:1.00]
	+; SANDY-NEXT: vcvtpd2dq %xmm0, %xmm0 # sched: [3:1.00]
	+; SANDY-NEXT: vcvtpd2dqx (%rdi), %xmm1 # sched: [7:1.00]
	; SANDY-NEXT: vpaddd %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_cvtpd2dq:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vcvtpd2dq %xmm0, %xmm0 # sched: [4:1.00]
	; HASWELL-NEXT: vcvtpd2dqx (%rdi), %xmm1 # sched: [8:1.00]
	; HASWELL-NEXT: vpaddd %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_cvtpd2dq:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vcvtpd2dqx (%rdi), %xmm1 # sched: [8:1.00]
	; BTVER2-NEXT: vcvtpd2dq %xmm0, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: vpaddd %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_cvtpd2dq:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vcvtpd2dqx (%rdi), %xmm1 # sched: [12:1.00]
	; ZNVER1-NEXT: vcvtpd2dq %xmm0, %xmm0 # sched: [5:1.00]
	; ZNVER1-NEXT: vpaddd %xmm1, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <4 x i32> @llvm.x86.sse2.cvtpd2dq(<2 x double> %a0)
	%2 = load <2 x double>, <2 x double> *%a1, align 16
	%3 = call <4 x i32> @llvm.x86.sse2.cvtpd2dq(<2 x double> %2)
	%4 = add <4 x i32> %1, %3
	ret <4 x i32> %4
	}
	declare <4 x i32> @llvm.x86.sse2.cvtpd2dq(<2 x double>) nounwind readnone

	define <4 x float> @test_cvtpd2ps(<2 x double> %a0, <2 x double> *%a1) {
	; GENERIC-LABEL: test_cvtpd2ps:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: cvtpd2ps %xmm0, %xmm1
	; GENERIC-NEXT: cvtpd2ps (%rdi), %xmm0
	; GENERIC-NEXT: addps %xmm1, %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_cvtpd2ps:
	; ATOM: # BB#0:
	; ATOM-NEXT: cvtpd2ps (%rdi), %xmm1
	; ATOM-NEXT: cvtpd2ps %xmm0, %xmm0
	; ATOM-NEXT: addps %xmm0, %xmm1
	; ATOM-NEXT: movaps %xmm1, %xmm0
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_cvtpd2ps:
	; SLM: # BB#0:
	; SLM-NEXT: cvtpd2ps %xmm0, %xmm1 # sched: [4:0.50]
	; SLM-NEXT: cvtpd2ps (%rdi), %xmm0 # sched: [7:1.00]
	; SLM-NEXT: addps %xmm1, %xmm0 # sched: [3:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_cvtpd2ps:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vcvtpd2ps %xmm0, %xmm0 # sched: [4:1.00]
	-; SANDY-NEXT: vcvtpd2psx (%rdi), %xmm1 # sched: [10:1.00]
	+; SANDY-NEXT: vcvtpd2ps %xmm0, %xmm0 # sched: [3:1.00]
	+; SANDY-NEXT: vcvtpd2psx (%rdi), %xmm1 # sched: [7:1.00]
	; SANDY-NEXT: vaddps %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_cvtpd2ps:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vcvtpd2ps %xmm0, %xmm0 # sched: [4:1.00]
	; HASWELL-NEXT: vcvtpd2psx (%rdi), %xmm1 # sched: [8:1.00]
	; HASWELL-NEXT: vaddps %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_cvtpd2ps:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vcvtpd2psx (%rdi), %xmm1 # sched: [8:1.00]
	; BTVER2-NEXT: vcvtpd2ps %xmm0, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: vaddps %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_cvtpd2ps:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vcvtpd2psx (%rdi), %xmm1 # sched: [12:1.00]
	; ZNVER1-NEXT: vcvtpd2ps %xmm0, %xmm0 # sched: [5:1.00]
	; ZNVER1-NEXT: vaddps %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <4 x float> @llvm.x86.sse2.cvtpd2ps(<2 x double> %a0)
	%2 = load <2 x double>, <2 x double> *%a1, align 16
	%3 = call <4 x float> @llvm.x86.sse2.cvtpd2ps(<2 x double> %2)
	%4 = fadd <4 x float> %1, %3
	ret <4 x float> %4
	}
	declare <4 x float> @llvm.x86.sse2.cvtpd2ps(<2 x double>) nounwind readnone

	define <4 x i32> @test_cvtps2dq(<4 x float> %a0, <4 x float> *%a1) {
	; GENERIC-LABEL: test_cvtps2dq:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: cvtps2dq %xmm0, %xmm1
	; GENERIC-NEXT: cvtps2dq (%rdi), %xmm0
	; GENERIC-NEXT: paddd %xmm1, %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_cvtps2dq:
	; ATOM: # BB#0:
	; ATOM-NEXT: cvtps2dq (%rdi), %xmm1
	; ATOM-NEXT: cvtps2dq %xmm0, %xmm0
	; ATOM-NEXT: paddd %xmm0, %xmm1
	; ATOM-NEXT: movdqa %xmm1, %xmm0
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_cvtps2dq:
	; SLM: # BB#0:
	; SLM-NEXT: cvtps2dq %xmm0, %xmm1 # sched: [4:0.50]
	; SLM-NEXT: cvtps2dq (%rdi), %xmm0 # sched: [7:1.00]
	; SLM-NEXT: paddd %xmm1, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_cvtps2dq:
	; SANDY: # BB#0:
	; SANDY-NEXT: vcvtps2dq %xmm0, %xmm0 # sched: [3:1.00]
	-; SANDY-NEXT: vcvtps2dq (%rdi), %xmm1 # sched: [9:1.00]
	+; SANDY-NEXT: vcvtps2dq (%rdi), %xmm1 # sched: [7:1.00]
	; SANDY-NEXT: vpaddd %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_cvtps2dq:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vcvtps2dq %xmm0, %xmm0 # sched: [3:1.00]
	; HASWELL-NEXT: vcvtps2dq (%rdi), %xmm1 # sched: [7:1.00]
	; HASWELL-NEXT: vpaddd %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_cvtps2dq:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vcvtps2dq (%rdi), %xmm1 # sched: [8:1.00]
	; BTVER2-NEXT: vcvtps2dq %xmm0, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: vpaddd %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_cvtps2dq:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vcvtps2dq (%rdi), %xmm1 # sched: [12:1.00]
	; ZNVER1-NEXT: vcvtps2dq %xmm0, %xmm0 # sched: [5:1.00]
	; ZNVER1-NEXT: vpaddd %xmm1, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <4 x i32> @llvm.x86.sse2.cvtps2dq(<4 x float> %a0)
	%2 = load <4 x float>, <4 x float> *%a1, align 16
	%3 = call <4 x i32> @llvm.x86.sse2.cvtps2dq(<4 x float> %2)
	%4 = add <4 x i32> %1, %3
	ret <4 x i32> %4
	}
	declare <4 x i32> @llvm.x86.sse2.cvtps2dq(<4 x float>) nounwind readnone

	define <2 x double> @test_cvtps2pd(<4 x float> %a0, <4 x float> *%a1) {
	; GENERIC-LABEL: test_cvtps2pd:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: cvtps2pd %xmm0, %xmm1
	; GENERIC-NEXT: cvtps2pd (%rdi), %xmm0
	; GENERIC-NEXT: addpd %xmm1, %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_cvtps2pd:
	; ATOM: # BB#0:
	; ATOM-NEXT: cvtps2pd (%rdi), %xmm1
	; ATOM-NEXT: cvtps2pd %xmm0, %xmm0
	; ATOM-NEXT: addpd %xmm0, %xmm1
	; ATOM-NEXT: movapd %xmm1, %xmm0
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_cvtps2pd:
	; SLM: # BB#0:
	; SLM-NEXT: cvtps2pd %xmm0, %xmm1 # sched: [4:0.50]
	; SLM-NEXT: cvtps2pd (%rdi), %xmm0 # sched: [7:1.00]
	; SLM-NEXT: addpd %xmm1, %xmm0 # sched: [3:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_cvtps2pd:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vcvtps2pd %xmm0, %xmm0 # sched: [2:1.00]
	+; SANDY-NEXT: vcvtps2pd %xmm0, %xmm0 # sched: [3:1.00]
	; SANDY-NEXT: vcvtps2pd (%rdi), %xmm1 # sched: [7:1.00]
	; SANDY-NEXT: vaddpd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_cvtps2pd:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vcvtps2pd %xmm0, %xmm0 # sched: [2:1.00]
	; HASWELL-NEXT: vcvtps2pd (%rdi), %xmm1 # sched: [5:1.00]
	; HASWELL-NEXT: vaddpd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_cvtps2pd:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vcvtps2pd (%rdi), %xmm1 # sched: [8:1.00]
	; BTVER2-NEXT: vcvtps2pd %xmm0, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: vaddpd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_cvtps2pd:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vcvtps2pd (%rdi), %xmm1 # sched: [12:1.00]
	; ZNVER1-NEXT: vcvtps2pd %xmm0, %xmm0 # sched: [5:1.00]
	; ZNVER1-NEXT: vaddpd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = shufflevector <4 x float> %a0, <4 x float> undef, <2 x i32> <i32 0, i32 1>
	%2 = fpext <2 x float> %1 to <2 x double>
	%3 = load <4 x float>, <4 x float> *%a1, align 16
	%4 = shufflevector <4 x float> %3, <4 x float> undef, <2 x i32> <i32 0, i32 1>
	%5 = fpext <2 x float> %4 to <2 x double>
	%6 = fadd <2 x double> %2, %5
	ret <2 x double> %6
	}

	define i32 @test_cvtsd2si(double %a0, double *%a1) {
	; GENERIC-LABEL: test_cvtsd2si:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: cvtsd2si %xmm0, %ecx
	; GENERIC-NEXT: cvtsd2si (%rdi), %eax
	; GENERIC-NEXT: addl %ecx, %eax
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_cvtsd2si:
	; ATOM: # BB#0:
	; ATOM-NEXT: cvtsd2si (%rdi), %eax
	; ATOM-NEXT: cvtsd2si %xmm0, %ecx
	; ATOM-NEXT: addl %ecx, %eax
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_cvtsd2si:
	; SLM: # BB#0:
	; SLM-NEXT: cvtsd2si (%rdi), %eax # sched: [7:1.00]
	; SLM-NEXT: cvtsd2si %xmm0, %ecx # sched: [4:0.50]
	; SLM-NEXT: addl %ecx, %eax # sched: [1:0.50]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_cvtsd2si:
	; SANDY: # BB#0:
	; SANDY-NEXT: vcvtsd2si %xmm0, %ecx # sched: [3:1.00]
	; SANDY-NEXT: vcvtsd2si (%rdi), %eax # sched: [7:1.00]
	; SANDY-NEXT: addl %ecx, %eax # sched: [1:0.33]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_cvtsd2si:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vcvtsd2si %xmm0, %ecx # sched: [4:1.00]
	; HASWELL-NEXT: vcvtsd2si (%rdi), %eax # sched: [8:1.00]
	; HASWELL-NEXT: addl %ecx, %eax # sched: [1:0.25]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_cvtsd2si:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vcvtsd2si (%rdi), %eax # sched: [8:1.00]
	; BTVER2-NEXT: vcvtsd2si %xmm0, %ecx # sched: [3:1.00]
	; BTVER2-NEXT: addl %ecx, %eax # sched: [1:0.50]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_cvtsd2si:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vcvtsd2si (%rdi), %eax # sched: [12:1.00]
	; ZNVER1-NEXT: vcvtsd2si %xmm0, %ecx # sched: [5:1.00]
	; ZNVER1-NEXT: addl %ecx, %eax # sched: [1:0.25]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = insertelement <2 x double> undef, double %a0, i32 0
	%2 = call i32 @llvm.x86.sse2.cvtsd2si(<2 x double> %1)
	%3 = load double, double *%a1, align 8
	%4 = insertelement <2 x double> undef, double %3, i32 0
	%5 = call i32 @llvm.x86.sse2.cvtsd2si(<2 x double> %4)
	%6 = add i32 %2, %5
	ret i32 %6
	}
	declare i32 @llvm.x86.sse2.cvtsd2si(<2 x double>) nounwind readnone

	define i64 @test_cvtsd2siq(double %a0, double *%a1) {
	; GENERIC-LABEL: test_cvtsd2siq:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: cvtsd2si %xmm0, %rcx
	; GENERIC-NEXT: cvtsd2si (%rdi), %rax
	; GENERIC-NEXT: addq %rcx, %rax
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_cvtsd2siq:
	; ATOM: # BB#0:
	; ATOM-NEXT: cvtsd2si (%rdi), %rax
	; ATOM-NEXT: cvtsd2si %xmm0, %rcx
	; ATOM-NEXT: addq %rcx, %rax
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_cvtsd2siq:
	; SLM: # BB#0:
	; SLM-NEXT: cvtsd2si (%rdi), %rax # sched: [7:1.00]
	; SLM-NEXT: cvtsd2si %xmm0, %rcx # sched: [4:0.50]
	; SLM-NEXT: addq %rcx, %rax # sched: [1:0.50]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_cvtsd2siq:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vcvtsd2si %xmm0, %rcx # sched: [5:1.00]
	-; SANDY-NEXT: vcvtsd2si (%rdi), %rax # sched: [10:1.00]
	+; SANDY-NEXT: vcvtsd2si %xmm0, %rcx # sched: [3:1.00]
	+; SANDY-NEXT: vcvtsd2si (%rdi), %rax # sched: [7:1.00]
	; SANDY-NEXT: addq %rcx, %rax # sched: [1:0.33]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_cvtsd2siq:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vcvtsd2si %xmm0, %rcx # sched: [4:1.00]
	; HASWELL-NEXT: vcvtsd2si (%rdi), %rax # sched: [8:1.00]
	; HASWELL-NEXT: addq %rcx, %rax # sched: [1:0.25]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_cvtsd2siq:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vcvtsd2si (%rdi), %rax # sched: [8:1.00]
	; BTVER2-NEXT: vcvtsd2si %xmm0, %rcx # sched: [3:1.00]
	; BTVER2-NEXT: addq %rcx, %rax # sched: [1:0.50]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_cvtsd2siq:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vcvtsd2si (%rdi), %rax # sched: [12:1.00]
	; ZNVER1-NEXT: vcvtsd2si %xmm0, %rcx # sched: [5:1.00]
	; ZNVER1-NEXT: addq %rcx, %rax # sched: [1:0.25]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = insertelement <2 x double> undef, double %a0, i32 0
	%2 = call i64 @llvm.x86.sse2.cvtsd2si64(<2 x double> %1)
	%3 = load double, double *%a1, align 8
	%4 = insertelement <2 x double> undef, double %3, i32 0
	%5 = call i64 @llvm.x86.sse2.cvtsd2si64(<2 x double> %4)
	%6 = add i64 %2, %5
	ret i64 %6
	}
	declare i64 @llvm.x86.sse2.cvtsd2si64(<2 x double>) nounwind readnone

	define float @test_cvtsd2ss(double %a0, double *%a1) {
	; GENERIC-LABEL: test_cvtsd2ss:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: cvtsd2ss %xmm0, %xmm1
	; GENERIC-NEXT: movsd {{.*#+}} xmm0 = mem[0],zero
	; GENERIC-NEXT: cvtsd2ss %xmm0, %xmm0
	; GENERIC-NEXT: addss %xmm1, %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_cvtsd2ss:
	; ATOM: # BB#0:
	; ATOM-NEXT: movsd {{.*#+}} xmm1 = mem[0],zero
	; ATOM-NEXT: cvtsd2ss %xmm0, %xmm2
	; ATOM-NEXT: xorps %xmm0, %xmm0
	; ATOM-NEXT: cvtsd2ss %xmm1, %xmm0
	; ATOM-NEXT: addss %xmm2, %xmm0
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_cvtsd2ss:
	; SLM: # BB#0:
	; SLM-NEXT: cvtsd2ss %xmm0, %xmm1 # sched: [4:0.50]
	; SLM-NEXT: movsd {{.*#+}} xmm0 = mem[0],zero sched: [3:1.00]
	; SLM-NEXT: cvtsd2ss %xmm0, %xmm0 # sched: [4:0.50]
	; SLM-NEXT: addss %xmm1, %xmm0 # sched: [3:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_cvtsd2ss:
	; SANDY: # BB#0:
	; SANDY-NEXT: vcvtsd2ss %xmm0, %xmm0, %xmm0 # sched: [3:1.00]
	-; SANDY-NEXT: vmovsd {{.*#+}} xmm1 = mem[0],zero sched: [6:0.50]
	+; SANDY-NEXT: vmovsd {{.*#+}} xmm1 = mem[0],zero sched: [4:0.50]
	; SANDY-NEXT: vcvtsd2ss %xmm1, %xmm1, %xmm1 # sched: [3:1.00]
	; SANDY-NEXT: vaddss %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_cvtsd2ss:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vcvtsd2ss %xmm0, %xmm0, %xmm0 # sched: [4:1.00]
	; HASWELL-NEXT: vmovsd {{.*#+}} xmm1 = mem[0],zero sched: [4:0.50]
	; HASWELL-NEXT: vcvtsd2ss %xmm1, %xmm1, %xmm1 # sched: [4:1.00]
	; HASWELL-NEXT: vaddss %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_cvtsd2ss:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vmovsd {{.*#+}} xmm1 = mem[0],zero sched: [5:1.00]
	; BTVER2-NEXT: vcvtsd2ss %xmm0, %xmm0, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: vcvtsd2ss %xmm1, %xmm1, %xmm1 # sched: [3:1.00]
	; BTVER2-NEXT: vaddss %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_cvtsd2ss:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vmovsd {{.*#+}} xmm1 = mem[0],zero sched: [8:0.50]
	; ZNVER1-NEXT: vcvtsd2ss %xmm0, %xmm0, %xmm0 # sched: [5:1.00]
	; ZNVER1-NEXT: vcvtsd2ss %xmm1, %xmm1, %xmm1 # sched: [5:1.00]
	; ZNVER1-NEXT: vaddss %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = fptrunc double %a0 to float
	%2 = load double, double *%a1, align 8
	%3 = fptrunc double %2 to float
	%4 = fadd float %1, %3
	ret float %4
	}

	define double @test_cvtsi2sd(i32 %a0, i32 *%a1) {
	; GENERIC-LABEL: test_cvtsi2sd:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: cvtsi2sdl %edi, %xmm1
	; GENERIC-NEXT: cvtsi2sdl (%rsi), %xmm0
	; GENERIC-NEXT: addsd %xmm1, %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_cvtsi2sd:
	; ATOM: # BB#0:
	; ATOM-NEXT: cvtsi2sdl (%rsi), %xmm0
	; ATOM-NEXT: cvtsi2sdl %edi, %xmm1
	; ATOM-NEXT: addsd %xmm1, %xmm0
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_cvtsi2sd:
	; SLM: # BB#0:
	; SLM-NEXT: cvtsi2sdl (%rsi), %xmm0 # sched: [7:1.00]
	; SLM-NEXT: cvtsi2sdl %edi, %xmm1 # sched: [4:0.50]
	; SLM-NEXT: addsd %xmm1, %xmm0 # sched: [3:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_cvtsi2sd:
	; SANDY: # BB#0:
	; SANDY-NEXT: vcvtsi2sdl %edi, %xmm0, %xmm0 # sched: [4:1.00]
	-; SANDY-NEXT: vcvtsi2sdl (%rsi), %xmm1, %xmm1 # sched: [9:1.00]
	+; SANDY-NEXT: vcvtsi2sdl (%rsi), %xmm1, %xmm1 # sched: [8:1.00]
	; SANDY-NEXT: vaddsd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_cvtsi2sd:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vcvtsi2sdl %edi, %xmm0, %xmm0 # sched: [4:1.00]
	; HASWELL-NEXT: vcvtsi2sdl (%rsi), %xmm1, %xmm1 # sched: [8:1.00]
	; HASWELL-NEXT: vaddsd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_cvtsi2sd:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vcvtsi2sdl %edi, %xmm0, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: vcvtsi2sdl (%rsi), %xmm1, %xmm1 # sched: [8:1.00]
	; BTVER2-NEXT: vaddsd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_cvtsi2sd:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vcvtsi2sdl %edi, %xmm0, %xmm0 # sched: [5:1.00]
	; ZNVER1-NEXT: vcvtsi2sdl (%rsi), %xmm1, %xmm1 # sched: [12:1.00]
	; ZNVER1-NEXT: vaddsd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = sitofp i32 %a0 to double
	%2 = load i32, i32 *%a1, align 8
	%3 = sitofp i32 %2 to double
	%4 = fadd double %1, %3
	ret double %4
	}

	define double @test_cvtsi2sdq(i64 %a0, i64 *%a1) {
	; GENERIC-LABEL: test_cvtsi2sdq:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: cvtsi2sdq %rdi, %xmm1
	; GENERIC-NEXT: cvtsi2sdq (%rsi), %xmm0
	; GENERIC-NEXT: addsd %xmm1, %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_cvtsi2sdq:
	; ATOM: # BB#0:
	; ATOM-NEXT: cvtsi2sdq (%rsi), %xmm0
	; ATOM-NEXT: cvtsi2sdq %rdi, %xmm1
	; ATOM-NEXT: addsd %xmm1, %xmm0
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_cvtsi2sdq:
	; SLM: # BB#0:
	; SLM-NEXT: cvtsi2sdq (%rsi), %xmm0 # sched: [7:1.00]
	; SLM-NEXT: cvtsi2sdq %rdi, %xmm1 # sched: [4:0.50]
	; SLM-NEXT: addsd %xmm1, %xmm0 # sched: [3:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_cvtsi2sdq:
	; SANDY: # BB#0:
	; SANDY-NEXT: vcvtsi2sdq %rdi, %xmm0, %xmm0 # sched: [4:1.00]
	-; SANDY-NEXT: vcvtsi2sdq (%rsi), %xmm1, %xmm1 # sched: [9:1.00]
	+; SANDY-NEXT: vcvtsi2sdq (%rsi), %xmm1, %xmm1 # sched: [8:1.00]
	; SANDY-NEXT: vaddsd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_cvtsi2sdq:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vcvtsi2sdq %rdi, %xmm0, %xmm0 # sched: [4:1.00]
	; HASWELL-NEXT: vcvtsi2sdq (%rsi), %xmm1, %xmm1 # sched: [8:1.00]
	; HASWELL-NEXT: vaddsd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_cvtsi2sdq:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vcvtsi2sdq %rdi, %xmm0, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: vcvtsi2sdq (%rsi), %xmm1, %xmm1 # sched: [8:1.00]
	; BTVER2-NEXT: vaddsd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_cvtsi2sdq:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vcvtsi2sdq %rdi, %xmm0, %xmm0 # sched: [5:1.00]
	; ZNVER1-NEXT: vcvtsi2sdq (%rsi), %xmm1, %xmm1 # sched: [12:1.00]
	; ZNVER1-NEXT: vaddsd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = sitofp i64 %a0 to double
	%2 = load i64, i64 *%a1, align 8
	%3 = sitofp i64 %2 to double
	%4 = fadd double %1, %3
	ret double %4
	}

	; TODO - cvtss2sd_m

	define double @test_cvtss2sd(float %a0, float *%a1) {
	; GENERIC-LABEL: test_cvtss2sd:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: cvtss2sd %xmm0, %xmm1
	; GENERIC-NEXT: movss {{.*#+}} xmm0 = mem[0],zero,zero,zero
	; GENERIC-NEXT: cvtss2sd %xmm0, %xmm0
	; GENERIC-NEXT: addsd %xmm1, %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_cvtss2sd:
	; ATOM: # BB#0:
	; ATOM-NEXT: movss {{.*#+}} xmm1 = mem[0],zero,zero,zero
	; ATOM-NEXT: cvtss2sd %xmm0, %xmm2
	; ATOM-NEXT: xorps %xmm0, %xmm0
	; ATOM-NEXT: cvtss2sd %xmm1, %xmm0
	; ATOM-NEXT: addsd %xmm2, %xmm0
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_cvtss2sd:
	; SLM: # BB#0:
	; SLM-NEXT: cvtss2sd %xmm0, %xmm1 # sched: [4:0.50]
	; SLM-NEXT: movss {{.*#+}} xmm0 = mem[0],zero,zero,zero sched: [3:1.00]
	; SLM-NEXT: cvtss2sd %xmm0, %xmm0 # sched: [4:0.50]
	; SLM-NEXT: addsd %xmm1, %xmm0 # sched: [3:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_cvtss2sd:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vcvtss2sd %xmm0, %xmm0, %xmm0 # sched: [1:1.00]
	-; SANDY-NEXT: vmovss {{.*#+}} xmm1 = mem[0],zero,zero,zero sched: [6:0.50]
	-; SANDY-NEXT: vcvtss2sd %xmm1, %xmm1, %xmm1 # sched: [1:1.00]
	+; SANDY-NEXT: vcvtss2sd %xmm0, %xmm0, %xmm0 # sched: [3:1.00]
	+; SANDY-NEXT: vmovss {{.*#+}} xmm1 = mem[0],zero,zero,zero sched: [4:0.50]
	+; SANDY-NEXT: vcvtss2sd %xmm1, %xmm1, %xmm1 # sched: [3:1.00]
	; SANDY-NEXT: vaddsd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_cvtss2sd:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vcvtss2sd %xmm0, %xmm0, %xmm0 # sched: [2:1.00]
	; HASWELL-NEXT: vmovss {{.*#+}} xmm1 = mem[0],zero,zero,zero sched: [4:0.50]
	; HASWELL-NEXT: vcvtss2sd %xmm1, %xmm1, %xmm1 # sched: [2:1.00]
	; HASWELL-NEXT: vaddsd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_cvtss2sd:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vmovss {{.*#+}} xmm1 = mem[0],zero,zero,zero sched: [5:1.00]
	; BTVER2-NEXT: vcvtss2sd %xmm0, %xmm0, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: vcvtss2sd %xmm1, %xmm1, %xmm1 # sched: [3:1.00]
	; BTVER2-NEXT: vaddsd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_cvtss2sd:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vmovss {{.*#+}} xmm1 = mem[0],zero,zero,zero sched: [8:0.50]
	; ZNVER1-NEXT: vcvtss2sd %xmm0, %xmm0, %xmm0 # sched: [5:1.00]
	; ZNVER1-NEXT: vcvtss2sd %xmm1, %xmm1, %xmm1 # sched: [5:1.00]
	; ZNVER1-NEXT: vaddsd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = fpext float %a0 to double
	%2 = load float, float *%a1, align 4
	%3 = fpext float %2 to double
	%4 = fadd double %1, %3
	ret double %4
	}

	define <4 x i32> @test_cvttpd2dq(<2 x double> %a0, <2 x double> *%a1) {
	; GENERIC-LABEL: test_cvttpd2dq:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: cvttpd2dq %xmm0, %xmm1
	; GENERIC-NEXT: cvttpd2dq (%rdi), %xmm0
	; GENERIC-NEXT: paddd %xmm1, %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_cvttpd2dq:
	; ATOM: # BB#0:
	; ATOM-NEXT: cvttpd2dq (%rdi), %xmm1
	; ATOM-NEXT: cvttpd2dq %xmm0, %xmm0
	; ATOM-NEXT: paddd %xmm0, %xmm1
	; ATOM-NEXT: movdqa %xmm1, %xmm0
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_cvttpd2dq:
	; SLM: # BB#0:
	; SLM-NEXT: cvttpd2dq %xmm0, %xmm1 # sched: [4:0.50]
	; SLM-NEXT: cvttpd2dq (%rdi), %xmm0 # sched: [7:1.00]
	; SLM-NEXT: paddd %xmm1, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_cvttpd2dq:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vcvttpd2dq %xmm0, %xmm0 # sched: [4:1.00]
	-; SANDY-NEXT: vcvttpd2dqx (%rdi), %xmm1 # sched: [10:1.00]
	+; SANDY-NEXT: vcvttpd2dq %xmm0, %xmm0 # sched: [3:1.00]
	+; SANDY-NEXT: vcvttpd2dqx (%rdi), %xmm1 # sched: [7:1.00]
	; SANDY-NEXT: vpaddd %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_cvttpd2dq:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vcvttpd2dq %xmm0, %xmm0 # sched: [4:1.00]
	; HASWELL-NEXT: vcvttpd2dqx (%rdi), %xmm1 # sched: [8:1.00]
	; HASWELL-NEXT: vpaddd %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_cvttpd2dq:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vcvttpd2dqx (%rdi), %xmm1 # sched: [8:1.00]
	; BTVER2-NEXT: vcvttpd2dq %xmm0, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: vpaddd %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_cvttpd2dq:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vcvttpd2dqx (%rdi), %xmm1 # sched: [12:1.00]
	; ZNVER1-NEXT: vcvttpd2dq %xmm0, %xmm0 # sched: [5:1.00]
	; ZNVER1-NEXT: vpaddd %xmm1, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = fptosi <2 x double> %a0 to <2 x i32>
	%2 = shufflevector <2 x i32> %1, <2 x i32> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
	%3 = load <2 x double>, <2 x double> *%a1, align 16
	%4 = fptosi <2 x double> %3 to <2 x i32>
	%5 = shufflevector <2 x i32> %4, <2 x i32> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
	%6 = add <4 x i32> %2, %5
	ret <4 x i32> %6
	}

	define <4 x i32> @test_cvttps2dq(<4 x float> %a0, <4 x float> *%a1) {
	; GENERIC-LABEL: test_cvttps2dq:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: cvttps2dq %xmm0, %xmm1
	; GENERIC-NEXT: cvttps2dq (%rdi), %xmm0
	; GENERIC-NEXT: paddd %xmm1, %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_cvttps2dq:
	; ATOM: # BB#0:
	; ATOM-NEXT: cvttps2dq (%rdi), %xmm1
	; ATOM-NEXT: cvttps2dq %xmm0, %xmm0
	; ATOM-NEXT: paddd %xmm0, %xmm1
	; ATOM-NEXT: movdqa %xmm1, %xmm0
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_cvttps2dq:
	; SLM: # BB#0:
	; SLM-NEXT: cvttps2dq %xmm0, %xmm1 # sched: [4:0.50]
	; SLM-NEXT: cvttps2dq (%rdi), %xmm0 # sched: [7:1.00]
	; SLM-NEXT: paddd %xmm1, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_cvttps2dq:
	; SANDY: # BB#0:
	; SANDY-NEXT: vcvttps2dq %xmm0, %xmm0 # sched: [3:1.00]
	-; SANDY-NEXT: vcvttps2dq (%rdi), %xmm1 # sched: [9:1.00]
	+; SANDY-NEXT: vcvttps2dq (%rdi), %xmm1 # sched: [7:1.00]
	; SANDY-NEXT: vpaddd %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_cvttps2dq:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vcvttps2dq %xmm0, %xmm0 # sched: [3:1.00]
	; HASWELL-NEXT: vcvttps2dq (%rdi), %xmm1 # sched: [7:1.00]
	; HASWELL-NEXT: vpaddd %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_cvttps2dq:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vcvttps2dq (%rdi), %xmm1 # sched: [8:1.00]
	; BTVER2-NEXT: vcvttps2dq %xmm0, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: vpaddd %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_cvttps2dq:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vcvttps2dq (%rdi), %xmm1 # sched: [12:1.00]
	; ZNVER1-NEXT: vcvttps2dq %xmm0, %xmm0 # sched: [5:1.00]
	; ZNVER1-NEXT: vpaddd %xmm1, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = fptosi <4 x float> %a0 to <4 x i32>
	%2 = load <4 x float>, <4 x float> *%a1, align 16
	%3 = fptosi <4 x float> %2 to <4 x i32>
	%4 = add <4 x i32> %1, %3
	ret <4 x i32> %4
	}

	define i32 @test_cvttsd2si(double %a0, double *%a1) {
	; GENERIC-LABEL: test_cvttsd2si:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: cvttsd2si %xmm0, %ecx
	; GENERIC-NEXT: cvttsd2si (%rdi), %eax
	; GENERIC-NEXT: addl %ecx, %eax
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_cvttsd2si:
	; ATOM: # BB#0:
	; ATOM-NEXT: cvttsd2si (%rdi), %eax
	; ATOM-NEXT: cvttsd2si %xmm0, %ecx
	; ATOM-NEXT: addl %ecx, %eax
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_cvttsd2si:
	; SLM: # BB#0:
	; SLM-NEXT: cvttsd2si (%rdi), %eax # sched: [7:1.00]
	; SLM-NEXT: cvttsd2si %xmm0, %ecx # sched: [4:0.50]
	; SLM-NEXT: addl %ecx, %eax # sched: [1:0.50]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_cvttsd2si:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vcvttsd2si %xmm0, %ecx # sched: [5:1.00]
	+; SANDY-NEXT: vcvttsd2si %xmm0, %ecx # sched: [3:1.00]
	; SANDY-NEXT: vcvttsd2si (%rdi), %eax # sched: [7:1.00]
	; SANDY-NEXT: addl %ecx, %eax # sched: [1:0.33]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_cvttsd2si:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vcvttsd2si %xmm0, %ecx # sched: [4:1.00]
	; HASWELL-NEXT: vcvttsd2si (%rdi), %eax # sched: [8:1.00]
	; HASWELL-NEXT: addl %ecx, %eax # sched: [1:0.25]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_cvttsd2si:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vcvttsd2si (%rdi), %eax # sched: [8:1.00]
	; BTVER2-NEXT: vcvttsd2si %xmm0, %ecx # sched: [3:1.00]
	; BTVER2-NEXT: addl %ecx, %eax # sched: [1:0.50]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_cvttsd2si:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vcvttsd2si (%rdi), %eax # sched: [12:1.00]
	; ZNVER1-NEXT: vcvttsd2si %xmm0, %ecx # sched: [5:1.00]
	; ZNVER1-NEXT: addl %ecx, %eax # sched: [1:0.25]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = fptosi double %a0 to i32
	%2 = load double, double *%a1, align 8
	%3 = fptosi double %2 to i32
	%4 = add i32 %1, %3
	ret i32 %4
	}

	define i64 @test_cvttsd2siq(double %a0, double *%a1) {
	; GENERIC-LABEL: test_cvttsd2siq:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: cvttsd2si %xmm0, %rcx
	; GENERIC-NEXT: cvttsd2si (%rdi), %rax
	; GENERIC-NEXT: addq %rcx, %rax
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_cvttsd2siq:
	; ATOM: # BB#0:
	; ATOM-NEXT: cvttsd2si (%rdi), %rax
	; ATOM-NEXT: cvttsd2si %xmm0, %rcx
	; ATOM-NEXT: addq %rcx, %rax
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_cvttsd2siq:
	; SLM: # BB#0:
	; SLM-NEXT: cvttsd2si (%rdi), %rax # sched: [7:1.00]
	; SLM-NEXT: cvttsd2si %xmm0, %rcx # sched: [4:0.50]
	; SLM-NEXT: addq %rcx, %rax # sched: [1:0.50]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_cvttsd2siq:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vcvttsd2si %xmm0, %rcx # sched: [5:1.00]
	-; SANDY-NEXT: vcvttsd2si (%rdi), %rax # sched: [10:1.00]
	+; SANDY-NEXT: vcvttsd2si %xmm0, %rcx # sched: [3:1.00]
	+; SANDY-NEXT: vcvttsd2si (%rdi), %rax # sched: [7:1.00]
	; SANDY-NEXT: addq %rcx, %rax # sched: [1:0.33]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_cvttsd2siq:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vcvttsd2si %xmm0, %rcx # sched: [4:1.00]
	; HASWELL-NEXT: vcvttsd2si (%rdi), %rax # sched: [8:1.00]
	; HASWELL-NEXT: addq %rcx, %rax # sched: [1:0.25]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_cvttsd2siq:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vcvttsd2si (%rdi), %rax # sched: [8:1.00]
	; BTVER2-NEXT: vcvttsd2si %xmm0, %rcx # sched: [3:1.00]
	; BTVER2-NEXT: addq %rcx, %rax # sched: [1:0.50]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_cvttsd2siq:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vcvttsd2si (%rdi), %rax # sched: [12:1.00]
	; ZNVER1-NEXT: vcvttsd2si %xmm0, %rcx # sched: [5:1.00]
	; ZNVER1-NEXT: addq %rcx, %rax # sched: [1:0.25]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = fptosi double %a0 to i64
	%2 = load double, double *%a1, align 8
	%3 = fptosi double %2 to i64
	%4 = add i64 %1, %3
	ret i64 %4
	}

	define <2 x double> @test_divpd(<2 x double> %a0, <2 x double> %a1, <2 x double> *%a2) {
	; GENERIC-LABEL: test_divpd:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: divpd %xmm1, %xmm0
	; GENERIC-NEXT: divpd (%rdi), %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_divpd:
	; ATOM: # BB#0:
	; ATOM-NEXT: divpd %xmm1, %xmm0
	; ATOM-NEXT: divpd (%rdi), %xmm0
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_divpd:
	; SLM: # BB#0:
	; SLM-NEXT: divpd %xmm1, %xmm0 # sched: [34:34.00]
	; SLM-NEXT: divpd (%rdi), %xmm0 # sched: [37:34.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_divpd:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vdivpd %xmm1, %xmm0, %xmm0 # sched: [22:1.00]
	-; SANDY-NEXT: vdivpd (%rdi), %xmm0, %xmm0 # sched: [28:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vdivpd %xmm1, %xmm0, %xmm0 # sched: [12:1.00]
	+; SANDY-NEXT: vdivpd (%rdi), %xmm0, %xmm0 # sched: [16:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_divpd:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vdivpd %xmm1, %xmm0, %xmm0 # sched: [12:1.00]
	; HASWELL-NEXT: vdivpd (%rdi), %xmm0, %xmm0 # sched: [16:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_divpd:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vdivpd %xmm1, %xmm0, %xmm0 # sched: [19:19.00]
	; BTVER2-NEXT: vdivpd (%rdi), %xmm0, %xmm0 # sched: [24:19.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_divpd:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vdivpd %xmm1, %xmm0, %xmm0 # sched: [15:1.00]
	; ZNVER1-NEXT: vdivpd (%rdi), %xmm0, %xmm0 # sched: [22:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = fdiv <2 x double> %a0, %a1
	%2 = load <2 x double>, <2 x double> *%a2, align 16
	%3 = fdiv <2 x double> %1, %2
	ret <2 x double> %3
	}

	define double @test_divsd(double %a0, double %a1, double *%a2) {
	; GENERIC-LABEL: test_divsd:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: divsd %xmm1, %xmm0
	; GENERIC-NEXT: divsd (%rdi), %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_divsd:
	; ATOM: # BB#0:
	; ATOM-NEXT: divsd %xmm1, %xmm0
	; ATOM-NEXT: divsd (%rdi), %xmm0
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_divsd:
	; SLM: # BB#0:
	; SLM-NEXT: divsd %xmm1, %xmm0 # sched: [34:34.00]
	; SLM-NEXT: divsd (%rdi), %xmm0 # sched: [37:34.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_divsd:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vdivsd %xmm1, %xmm0, %xmm0 # sched: [22:1.00]
	-; SANDY-NEXT: vdivsd (%rdi), %xmm0, %xmm0 # sched: [28:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vdivsd %xmm1, %xmm0, %xmm0 # sched: [12:1.00]
	+; SANDY-NEXT: vdivsd (%rdi), %xmm0, %xmm0 # sched: [16:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_divsd:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vdivsd %xmm1, %xmm0, %xmm0 # sched: [12:1.00]
	; HASWELL-NEXT: vdivsd (%rdi), %xmm0, %xmm0 # sched: [16:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_divsd:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vdivsd %xmm1, %xmm0, %xmm0 # sched: [19:19.00]
	; BTVER2-NEXT: vdivsd (%rdi), %xmm0, %xmm0 # sched: [24:19.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_divsd:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vdivsd %xmm1, %xmm0, %xmm0 # sched: [15:1.00]
	; ZNVER1-NEXT: vdivsd (%rdi), %xmm0, %xmm0 # sched: [22:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = fdiv double %a0, %a1
	%2 = load double, double *%a2, align 8
	%3 = fdiv double %1, %2
	ret double %3
	}

	define void @test_lfence() {
	; GENERIC-LABEL: test_lfence:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: lfence
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_lfence:
	; ATOM: # BB#0:
	; ATOM-NEXT: lfence
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_lfence:
	; SLM: # BB#0:
	; SLM-NEXT: lfence # sched: [1:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_lfence:
	; SANDY: # BB#0:
	; SANDY-NEXT: lfence # sched: [1:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_lfence:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: lfence # sched: [1:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_lfence:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: lfence # sched: [1:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_lfence:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: lfence # sched: [1:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	call void @llvm.x86.sse2.lfence()
	ret void
	}
	declare void @llvm.x86.sse2.lfence() nounwind readnone

	define void @test_mfence() {
	; GENERIC-LABEL: test_mfence:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: mfence
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_mfence:
	; ATOM: # BB#0:
	; ATOM-NEXT: mfence
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_mfence:
	; SLM: # BB#0:
	; SLM-NEXT: mfence # sched: [1:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_mfence:
	; SANDY: # BB#0:
	; SANDY-NEXT: mfence # sched: [1:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_mfence:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: mfence # sched: [1:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_mfence:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: mfence # sched: [1:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_mfence:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: mfence # sched: [1:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	call void @llvm.x86.sse2.mfence()
	ret void
	}
	declare void @llvm.x86.sse2.mfence() nounwind readnone

	define void @test_maskmovdqu(<16 x i8> %a0, <16 x i8> %a1, i8* %a2) {
	; GENERIC-LABEL: test_maskmovdqu:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: maskmovdqu %xmm1, %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_maskmovdqu:
	; ATOM: # BB#0:
	; ATOM-NEXT: maskmovdqu %xmm1, %xmm0
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_maskmovdqu:
	; SLM: # BB#0:
	; SLM-NEXT: maskmovdqu %xmm1, %xmm0 # sched: [1:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_maskmovdqu:
	; SANDY: # BB#0:
	; SANDY-NEXT: vmaskmovdqu %xmm1, %xmm0 # sched: [1:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_maskmovdqu:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vmaskmovdqu %xmm1, %xmm0 # sched: [14:2.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_maskmovdqu:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vmaskmovdqu %xmm1, %xmm0 # sched: [1:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_maskmovdqu:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vmaskmovdqu %xmm1, %xmm0 # sched: [1:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	call void @llvm.x86.sse2.maskmov.dqu(<16 x i8> %a0, <16 x i8> %a1, i8* %a2)
	ret void
	}
	declare void @llvm.x86.sse2.maskmov.dqu(<16 x i8>, <16 x i8>, i8*) nounwind

	define <2 x double> @test_maxpd(<2 x double> %a0, <2 x double> %a1, <2 x double> *%a2) {
	; GENERIC-LABEL: test_maxpd:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: maxpd %xmm1, %xmm0
	; GENERIC-NEXT: maxpd (%rdi), %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_maxpd:
	; ATOM: # BB#0:
	; ATOM-NEXT: maxpd %xmm1, %xmm0
	; ATOM-NEXT: maxpd (%rdi), %xmm0
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_maxpd:
	; SLM: # BB#0:
	; SLM-NEXT: maxpd %xmm1, %xmm0 # sched: [3:1.00]
	; SLM-NEXT: maxpd (%rdi), %xmm0 # sched: [6:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_maxpd:
	; SANDY: # BB#0:
	; SANDY-NEXT: vmaxpd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	-; SANDY-NEXT: vmaxpd (%rdi), %xmm0, %xmm0 # sched: [9:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vmaxpd (%rdi), %xmm0, %xmm0 # sched: [7:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_maxpd:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vmaxpd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; HASWELL-NEXT: vmaxpd (%rdi), %xmm0, %xmm0 # sched: [7:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_maxpd:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vmaxpd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: vmaxpd (%rdi), %xmm0, %xmm0 # sched: [8:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_maxpd:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vmaxpd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; ZNVER1-NEXT: vmaxpd (%rdi), %xmm0, %xmm0 # sched: [10:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <2 x double> @llvm.x86.sse2.max.pd(<2 x double> %a0, <2 x double> %a1)
	%2 = load <2 x double>, <2 x double> *%a2, align 16
	%3 = call <2 x double> @llvm.x86.sse2.max.pd(<2 x double> %1, <2 x double> %2)
	ret <2 x double> %3
	}
	declare <2 x double> @llvm.x86.sse2.max.pd(<2 x double>, <2 x double>) nounwind readnone

	define <2 x double> @test_maxsd(<2 x double> %a0, <2 x double> %a1, <2 x double> *%a2) {
	; GENERIC-LABEL: test_maxsd:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: maxsd %xmm1, %xmm0
	; GENERIC-NEXT: maxsd (%rdi), %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_maxsd:
	; ATOM: # BB#0:
	; ATOM-NEXT: maxsd %xmm1, %xmm0
	; ATOM-NEXT: maxsd (%rdi), %xmm0
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_maxsd:
	; SLM: # BB#0:
	; SLM-NEXT: maxsd %xmm1, %xmm0 # sched: [3:1.00]
	; SLM-NEXT: maxsd (%rdi), %xmm0 # sched: [6:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_maxsd:
	; SANDY: # BB#0:
	; SANDY-NEXT: vmaxsd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	-; SANDY-NEXT: vmaxsd (%rdi), %xmm0, %xmm0 # sched: [9:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vmaxsd (%rdi), %xmm0, %xmm0 # sched: [7:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_maxsd:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vmaxsd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; HASWELL-NEXT: vmaxsd (%rdi), %xmm0, %xmm0 # sched: [7:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_maxsd:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vmaxsd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: vmaxsd (%rdi), %xmm0, %xmm0 # sched: [8:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_maxsd:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vmaxsd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; ZNVER1-NEXT: vmaxsd (%rdi), %xmm0, %xmm0 # sched: [10:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <2 x double> @llvm.x86.sse2.max.sd(<2 x double> %a0, <2 x double> %a1)
	%2 = load <2 x double>, <2 x double> *%a2, align 16
	%3 = call <2 x double> @llvm.x86.sse2.max.sd(<2 x double> %1, <2 x double> %2)
	ret <2 x double> %3
	}
	declare <2 x double> @llvm.x86.sse2.max.sd(<2 x double>, <2 x double>) nounwind readnone

	define <2 x double> @test_minpd(<2 x double> %a0, <2 x double> %a1, <2 x double> *%a2) {
	; GENERIC-LABEL: test_minpd:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: minpd %xmm1, %xmm0
	; GENERIC-NEXT: minpd (%rdi), %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_minpd:
	; ATOM: # BB#0:
	; ATOM-NEXT: minpd %xmm1, %xmm0
	; ATOM-NEXT: minpd (%rdi), %xmm0
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_minpd:
	; SLM: # BB#0:
	; SLM-NEXT: minpd %xmm1, %xmm0 # sched: [3:1.00]
	; SLM-NEXT: minpd (%rdi), %xmm0 # sched: [6:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_minpd:
	; SANDY: # BB#0:
	; SANDY-NEXT: vminpd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	-; SANDY-NEXT: vminpd (%rdi), %xmm0, %xmm0 # sched: [9:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vminpd (%rdi), %xmm0, %xmm0 # sched: [7:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_minpd:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vminpd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; HASWELL-NEXT: vminpd (%rdi), %xmm0, %xmm0 # sched: [7:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_minpd:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vminpd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: vminpd (%rdi), %xmm0, %xmm0 # sched: [8:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_minpd:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vminpd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; ZNVER1-NEXT: vminpd (%rdi), %xmm0, %xmm0 # sched: [10:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <2 x double> @llvm.x86.sse2.min.pd(<2 x double> %a0, <2 x double> %a1)
	%2 = load <2 x double>, <2 x double> *%a2, align 16
	%3 = call <2 x double> @llvm.x86.sse2.min.pd(<2 x double> %1, <2 x double> %2)
	ret <2 x double> %3
	}
	declare <2 x double> @llvm.x86.sse2.min.pd(<2 x double>, <2 x double>) nounwind readnone

	define <2 x double> @test_minsd(<2 x double> %a0, <2 x double> %a1, <2 x double> *%a2) {
	; GENERIC-LABEL: test_minsd:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: minsd %xmm1, %xmm0
	; GENERIC-NEXT: minsd (%rdi), %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_minsd:
	; ATOM: # BB#0:
	; ATOM-NEXT: minsd %xmm1, %xmm0
	; ATOM-NEXT: minsd (%rdi), %xmm0
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_minsd:
	; SLM: # BB#0:
	; SLM-NEXT: minsd %xmm1, %xmm0 # sched: [3:1.00]
	; SLM-NEXT: minsd (%rdi), %xmm0 # sched: [6:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_minsd:
	; SANDY: # BB#0:
	; SANDY-NEXT: vminsd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	-; SANDY-NEXT: vminsd (%rdi), %xmm0, %xmm0 # sched: [9:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vminsd (%rdi), %xmm0, %xmm0 # sched: [7:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_minsd:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vminsd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; HASWELL-NEXT: vminsd (%rdi), %xmm0, %xmm0 # sched: [7:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_minsd:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vminsd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: vminsd (%rdi), %xmm0, %xmm0 # sched: [8:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_minsd:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vminsd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; ZNVER1-NEXT: vminsd (%rdi), %xmm0, %xmm0 # sched: [10:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <2 x double> @llvm.x86.sse2.min.sd(<2 x double> %a0, <2 x double> %a1)
	%2 = load <2 x double>, <2 x double> *%a2, align 16
	%3 = call <2 x double> @llvm.x86.sse2.min.sd(<2 x double> %1, <2 x double> %2)
	ret <2 x double> %3
	}
	declare <2 x double> @llvm.x86.sse2.min.sd(<2 x double>, <2 x double>) nounwind readnone

	define void @test_movapd(<2 x double> %a0, <2 x double> %a1) {
	; GENERIC-LABEL: test_movapd:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: movapd (%rdi), %xmm0
	; GENERIC-NEXT: addpd %xmm0, %xmm0
	; GENERIC-NEXT: movapd %xmm0, (%rsi)
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_movapd:
	; ATOM: # BB#0:
	; ATOM-NEXT: movapd (%rdi), %xmm0
	; ATOM-NEXT: addpd %xmm0, %xmm0
	; ATOM-NEXT: movapd %xmm0, (%rsi)
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_movapd:
	; SLM: # BB#0:
	; SLM-NEXT: movapd (%rdi), %xmm0 # sched: [3:1.00]
	; SLM-NEXT: addpd %xmm0, %xmm0 # sched: [3:1.00]
	; SLM-NEXT: movapd %xmm0, (%rsi) # sched: [1:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_movapd:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vmovapd (%rdi), %xmm0 # sched: [6:0.50]
	+; SANDY-NEXT: vmovapd (%rdi), %xmm0 # sched: [4:0.50]
	; SANDY-NEXT: vaddpd %xmm0, %xmm0, %xmm0 # sched: [3:1.00]
	-; SANDY-NEXT: vmovapd %xmm0, (%rsi) # sched: [5:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vmovapd %xmm0, (%rsi) # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_movapd:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vmovapd (%rdi), %xmm0 # sched: [4:0.50]
	; HASWELL-NEXT: vaddpd %xmm0, %xmm0, %xmm0 # sched: [3:1.00]
	; HASWELL-NEXT: vmovapd %xmm0, (%rsi) # sched: [1:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_movapd:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vmovapd (%rdi), %xmm0 # sched: [5:1.00]
	; BTVER2-NEXT: vaddpd %xmm0, %xmm0, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: vmovapd %xmm0, (%rsi) # sched: [1:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_movapd:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vmovapd (%rdi), %xmm0 # sched: [8:0.50]
	; ZNVER1-NEXT: vaddpd %xmm0, %xmm0, %xmm0 # sched: [3:1.00]
	; ZNVER1-NEXT: vmovapd %xmm0, (%rsi) # sched: [1:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = load <2 x double>, <2 x double> *%a0, align 16
	%2 = fadd <2 x double> %1, %1
	store <2 x double> %2, <2 x double> *%a1, align 16
	ret void
	}

	define void @test_movdqa(<2 x i64> %a0, <2 x i64> %a1) {
	; GENERIC-LABEL: test_movdqa:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: movdqa (%rdi), %xmm0
	; GENERIC-NEXT: paddq %xmm0, %xmm0
	; GENERIC-NEXT: movdqa %xmm0, (%rsi)
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_movdqa:
	; ATOM: # BB#0:
	; ATOM-NEXT: movdqa (%rdi), %xmm0
	; ATOM-NEXT: paddq %xmm0, %xmm0
	; ATOM-NEXT: movdqa %xmm0, (%rsi)
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_movdqa:
	; SLM: # BB#0:
	; SLM-NEXT: movdqa (%rdi), %xmm0 # sched: [3:1.00]
	; SLM-NEXT: paddq %xmm0, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: movdqa %xmm0, (%rsi) # sched: [1:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_movdqa:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vmovdqa (%rdi), %xmm0 # sched: [6:0.50]
	+; SANDY-NEXT: vmovdqa (%rdi), %xmm0 # sched: [4:0.50]
	; SANDY-NEXT: vpaddq %xmm0, %xmm0, %xmm0 # sched: [1:0.50]
	-; SANDY-NEXT: vmovdqa %xmm0, (%rsi) # sched: [5:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vmovdqa %xmm0, (%rsi) # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_movdqa:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vmovdqa (%rdi), %xmm0 # sched: [4:0.50]
	; HASWELL-NEXT: vpaddq %xmm0, %xmm0, %xmm0 # sched: [1:0.50]
	; HASWELL-NEXT: vmovdqa %xmm0, (%rsi) # sched: [1:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_movdqa:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vmovdqa (%rdi), %xmm0 # sched: [5:1.00]
	; BTVER2-NEXT: vpaddq %xmm0, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: vmovdqa %xmm0, (%rsi) # sched: [1:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_movdqa:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vmovdqa (%rdi), %xmm0 # sched: [8:0.50]
	; ZNVER1-NEXT: vpaddq %xmm0, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: vmovdqa %xmm0, (%rsi) # sched: [1:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = load <2 x i64>, <2 x i64> *%a0, align 16
	%2 = add <2 x i64> %1, %1
	store <2 x i64> %2, <2 x i64> *%a1, align 16
	ret void
	}

	define void @test_movdqu(<2 x i64> %a0, <2 x i64> %a1) {
	; GENERIC-LABEL: test_movdqu:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: movdqu (%rdi), %xmm0
	; GENERIC-NEXT: paddq %xmm0, %xmm0
	; GENERIC-NEXT: movdqu %xmm0, (%rsi)
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_movdqu:
	; ATOM: # BB#0:
	; ATOM-NEXT: movdqu (%rdi), %xmm0
	; ATOM-NEXT: paddq %xmm0, %xmm0
	; ATOM-NEXT: movdqu %xmm0, (%rsi)
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_movdqu:
	; SLM: # BB#0:
	; SLM-NEXT: movdqu (%rdi), %xmm0 # sched: [3:1.00]
	; SLM-NEXT: paddq %xmm0, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: movdqu %xmm0, (%rsi) # sched: [1:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_movdqu:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vmovdqu (%rdi), %xmm0 # sched: [6:0.50]
	+; SANDY-NEXT: vmovdqu (%rdi), %xmm0 # sched: [4:0.50]
	; SANDY-NEXT: vpaddq %xmm0, %xmm0, %xmm0 # sched: [1:0.50]
	-; SANDY-NEXT: vmovdqu %xmm0, (%rsi) # sched: [5:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vmovdqu %xmm0, (%rsi) # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_movdqu:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vmovdqu (%rdi), %xmm0 # sched: [4:0.50]
	; HASWELL-NEXT: vpaddq %xmm0, %xmm0, %xmm0 # sched: [1:0.50]
	; HASWELL-NEXT: vmovdqu %xmm0, (%rsi) # sched: [1:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_movdqu:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vmovdqu (%rdi), %xmm0 # sched: [5:1.00]
	; BTVER2-NEXT: vpaddq %xmm0, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: vmovdqu %xmm0, (%rsi) # sched: [1:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_movdqu:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vmovdqu (%rdi), %xmm0 # sched: [8:0.50]
	; ZNVER1-NEXT: vpaddq %xmm0, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: vmovdqu %xmm0, (%rsi) # sched: [1:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = load <2 x i64>, <2 x i64> *%a0, align 1
	%2 = add <2 x i64> %1, %1
	store <2 x i64> %2, <2 x i64> *%a1, align 1
	ret void
	}

	define i32 @test_movd(<4 x i32> %a0, i32 %a1, i32 *%a2) {
	; GENERIC-LABEL: test_movd:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: movd %edi, %xmm1
	; GENERIC-NEXT: movd {{.*#+}} xmm2 = mem[0],zero,zero,zero
	; GENERIC-NEXT: paddd %xmm0, %xmm1
	; GENERIC-NEXT: paddd %xmm0, %xmm2
	; GENERIC-NEXT: movd %xmm2, %eax
	; GENERIC-NEXT: movd %xmm1, (%rsi)
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_movd:
	; ATOM: # BB#0:
	; ATOM-NEXT: movd {{.*#+}} xmm1 = mem[0],zero,zero,zero
	; ATOM-NEXT: paddd %xmm0, %xmm1
	; ATOM-NEXT: movd %xmm1, %eax
	; ATOM-NEXT: movd %edi, %xmm1
	; ATOM-NEXT: paddd %xmm0, %xmm1
	; ATOM-NEXT: movd %xmm1, (%rsi)
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_movd:
	; SLM: # BB#0:
	; SLM-NEXT: movd {{.*#+}} xmm2 = mem[0],zero,zero,zero sched: [3:1.00]
	; SLM-NEXT: movd %edi, %xmm1 # sched: [1:0.50]
	; SLM-NEXT: paddd %xmm0, %xmm1 # sched: [1:0.50]
	; SLM-NEXT: movd %xmm1, (%rsi) # sched: [1:1.00]
	; SLM-NEXT: paddd %xmm0, %xmm2 # sched: [1:0.50]
	; SLM-NEXT: movd %xmm2, %eax # sched: [1:0.50]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_movd:
	; SANDY: # BB#0:
	; SANDY-NEXT: vmovd %edi, %xmm1 # sched: [1:0.33]
	-; SANDY-NEXT: vmovd {{.*#+}} xmm2 = mem[0],zero,zero,zero sched: [6:0.50]
	+; SANDY-NEXT: vmovd {{.*#+}} xmm2 = mem[0],zero,zero,zero sched: [4:0.50]
	; SANDY-NEXT: vpaddd %xmm1, %xmm0, %xmm1 # sched: [1:0.50]
	; SANDY-NEXT: vpaddd %xmm2, %xmm0, %xmm0 # sched: [1:0.50]
	-; SANDY-NEXT: vmovd %xmm0, %eax # sched: [2:1.00]
	-; SANDY-NEXT: vmovd %xmm1, (%rsi) # sched: [5:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vmovd %xmm0, %eax # sched: [1:0.33]
	+; SANDY-NEXT: vmovd %xmm1, (%rsi) # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_movd:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vmovd %edi, %xmm1 # sched: [1:1.00]
	; HASWELL-NEXT: vmovd {{.*#+}} xmm2 = mem[0],zero,zero,zero sched: [4:0.50]
	; HASWELL-NEXT: vpaddd %xmm1, %xmm0, %xmm1 # sched: [1:0.50]
	; HASWELL-NEXT: vpaddd %xmm2, %xmm0, %xmm0 # sched: [1:0.50]
	; HASWELL-NEXT: vmovd %xmm0, %eax # sched: [1:1.00]
	; HASWELL-NEXT: vmovd %xmm1, (%rsi) # sched: [1:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_movd:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vmovd {{.*#+}} xmm2 = mem[0],zero,zero,zero sched: [5:1.00]
	; BTVER2-NEXT: vmovd %edi, %xmm1 # sched: [1:0.17]
	; BTVER2-NEXT: vpaddd %xmm1, %xmm0, %xmm1 # sched: [1:0.50]
	; BTVER2-NEXT: vmovd %xmm1, (%rsi) # sched: [1:1.00]
	; BTVER2-NEXT: vpaddd %xmm2, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: vmovd %xmm0, %eax # sched: [1:0.17]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_movd:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vmovd {{.*#+}} xmm2 = mem[0],zero,zero,zero sched: [8:0.50]
	; ZNVER1-NEXT: vmovd %edi, %xmm1 # sched: [1:0.25]
	; ZNVER1-NEXT: vpaddd %xmm1, %xmm0, %xmm1 # sched: [1:0.25]
	; ZNVER1-NEXT: vmovd %xmm1, (%rsi) # sched: [1:0.50]
	; ZNVER1-NEXT: vpaddd %xmm2, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: vmovd %xmm0, %eax # sched: [1:0.25]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = insertelement <4 x i32> undef, i32 %a1, i32 0
	%2 = load i32, i32 *%a2
	%3 = insertelement <4 x i32> undef, i32 %2, i32 0
	%4 = add <4 x i32> %a0, %1
	%5 = add <4 x i32> %a0, %3
	%6 = extractelement <4 x i32> %4, i32 0
	%7 = extractelement <4 x i32> %5, i32 0
	store i32 %6, i32* %a2
	ret i32 %7
	}

	define i64 @test_movd_64(<2 x i64> %a0, i64 %a1, i64 *%a2) {
	; GENERIC-LABEL: test_movd_64:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: movq %rdi, %xmm1
	; GENERIC-NEXT: movq {{.*#+}} xmm2 = mem[0],zero
	; GENERIC-NEXT: paddq %xmm0, %xmm1
	; GENERIC-NEXT: paddq %xmm0, %xmm2
	; GENERIC-NEXT: movq %xmm2, %rax
	; GENERIC-NEXT: movq %xmm1, (%rsi)
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_movd_64:
	; ATOM: # BB#0:
	; ATOM-NEXT: movq {{.*#+}} xmm1 = mem[0],zero
	; ATOM-NEXT: movq %rdi, %xmm2
	; ATOM-NEXT: paddq %xmm0, %xmm2
	; ATOM-NEXT: paddq %xmm0, %xmm1
	; ATOM-NEXT: movq %xmm2, (%rsi)
	; ATOM-NEXT: movq %xmm1, %rax
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_movd_64:
	; SLM: # BB#0:
	; SLM-NEXT: movq {{.*#+}} xmm2 = mem[0],zero sched: [3:1.00]
	; SLM-NEXT: movq %rdi, %xmm1 # sched: [1:0.50]
	; SLM-NEXT: paddq %xmm0, %xmm1 # sched: [1:0.50]
	; SLM-NEXT: movq %xmm1, (%rsi) # sched: [1:1.00]
	; SLM-NEXT: paddq %xmm0, %xmm2 # sched: [1:0.50]
	; SLM-NEXT: movq %xmm2, %rax # sched: [1:0.50]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_movd_64:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vmovq %rdi, %xmm1 # sched: [1:1.00]
	-; SANDY-NEXT: vmovq {{.*#+}} xmm2 = mem[0],zero sched: [6:0.50]
	+; SANDY-NEXT: vmovq %rdi, %xmm1 # sched: [1:0.33]
	+; SANDY-NEXT: vmovq {{.*#+}} xmm2 = mem[0],zero sched: [4:0.50]
	; SANDY-NEXT: vpaddq %xmm1, %xmm0, %xmm1 # sched: [1:0.50]
	; SANDY-NEXT: vpaddq %xmm2, %xmm0, %xmm0 # sched: [1:0.50]
	-; SANDY-NEXT: vmovq %xmm0, %rax # sched: [2:1.00]
	-; SANDY-NEXT: vmovq %xmm1, (%rsi) # sched: [5:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vmovq %xmm0, %rax # sched: [1:0.33]
	+; SANDY-NEXT: vmovq %xmm1, (%rsi) # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_movd_64:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vmovq %rdi, %xmm1 # sched: [1:1.00]
	; HASWELL-NEXT: vmovq {{.*#+}} xmm2 = mem[0],zero sched: [4:0.50]
	; HASWELL-NEXT: vpaddq %xmm1, %xmm0, %xmm1 # sched: [1:0.50]
	; HASWELL-NEXT: vpaddq %xmm2, %xmm0, %xmm0 # sched: [1:0.50]
	; HASWELL-NEXT: vmovq %xmm0, %rax # sched: [1:1.00]
	; HASWELL-NEXT: vmovq %xmm1, (%rsi) # sched: [1:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_movd_64:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vmovq {{.*#+}} xmm2 = mem[0],zero sched: [5:1.00]
	; BTVER2-NEXT: vmovq %rdi, %xmm1 # sched: [1:0.17]
	; BTVER2-NEXT: vpaddq %xmm1, %xmm0, %xmm1 # sched: [1:0.50]
	; BTVER2-NEXT: vmovq %xmm1, (%rsi) # sched: [1:1.00]
	; BTVER2-NEXT: vpaddq %xmm2, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: vmovq %xmm0, %rax # sched: [1:0.17]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_movd_64:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vmovq {{.*#+}} xmm2 = mem[0],zero sched: [8:0.50]
	; ZNVER1-NEXT: vmovq %rdi, %xmm1 # sched: [1:0.25]
	; ZNVER1-NEXT: vpaddq %xmm1, %xmm0, %xmm1 # sched: [1:0.25]
	; ZNVER1-NEXT: vmovq %xmm1, (%rsi) # sched: [1:0.50]
	; ZNVER1-NEXT: vpaddq %xmm2, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: vmovq %xmm0, %rax # sched: [1:0.25]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = insertelement <2 x i64> undef, i64 %a1, i64 0
	%2 = load i64, i64 *%a2
	%3 = insertelement <2 x i64> undef, i64 %2, i64 0
	%4 = add <2 x i64> %a0, %1
	%5 = add <2 x i64> %a0, %3
	%6 = extractelement <2 x i64> %4, i64 0
	%7 = extractelement <2 x i64> %5, i64 0
	store i64 %6, i64* %a2
	ret i64 %7
	}

	define void @test_movhpd(<2 x double> %a0, <2 x double> %a1, x86_mmx *%a2) {
	; GENERIC-LABEL: test_movhpd:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: movhpd {{.*#+}} xmm1 = xmm1[0],mem[0]
	; GENERIC-NEXT: addpd %xmm0, %xmm1
	; GENERIC-NEXT: movhpd %xmm1, (%rdi)
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_movhpd:
	; ATOM: # BB#0:
	; ATOM-NEXT: movhpd {{.*#+}} xmm1 = xmm1[0],mem[0]
	; ATOM-NEXT: addpd %xmm0, %xmm1
	; ATOM-NEXT: movhpd %xmm1, (%rdi)
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_movhpd:
	; SLM: # BB#0:
	; SLM-NEXT: movhpd {{.*#+}} xmm1 = xmm1[0],mem[0] sched: [4:1.00]
	; SLM-NEXT: addpd %xmm0, %xmm1 # sched: [3:1.00]
	; SLM-NEXT: movhpd %xmm1, (%rdi) # sched: [1:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_movhpd:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vmovhpd {{.*#+}} xmm1 = xmm1[0],mem[0] sched: [7:1.00]
	+; SANDY-NEXT: vmovhpd {{.*#+}} xmm1 = xmm1[0],mem[0] sched: [5:1.00]
	; SANDY-NEXT: vaddpd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	-; SANDY-NEXT: vmovhpd %xmm0, (%rdi) # sched: [5:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vmovhpd %xmm0, (%rdi) # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_movhpd:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vmovhpd {{.*#+}} xmm1 = xmm1[0],mem[0] sched: [5:1.00]
	; HASWELL-NEXT: vaddpd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; HASWELL-NEXT: vmovhpd %xmm0, (%rdi) # sched: [1:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_movhpd:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vmovhpd {{.*#+}} xmm1 = xmm1[0],mem[0] sched: [6:1.00]
	; BTVER2-NEXT: vaddpd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: vmovhpd %xmm0, (%rdi) # sched: [1:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_movhpd:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vmovhpd {{.*#+}} xmm1 = xmm1[0],mem[0] sched: [8:0.50]
	; ZNVER1-NEXT: vaddpd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; ZNVER1-NEXT: vmovhpd %xmm0, (%rdi) # sched: [1:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = bitcast x86_mmx* %a2 to double*
	%2 = load double, double *%1, align 8
	%3 = insertelement <2 x double> %a1, double %2, i32 1
	%4 = fadd <2 x double> %a0, %3
	%5 = extractelement <2 x double> %4, i32 1
	store double %5, double* %1
	ret void
	}

	define void @test_movlpd(<2 x double> %a0, <2 x double> %a1, x86_mmx *%a2) {
	; GENERIC-LABEL: test_movlpd:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: movlpd {{.*#+}} xmm1 = mem[0],xmm1[1]
	; GENERIC-NEXT: addpd %xmm0, %xmm1
	; GENERIC-NEXT: movlpd %xmm1, (%rdi)
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_movlpd:
	; ATOM: # BB#0:
	; ATOM-NEXT: movlpd {{.*#+}} xmm1 = mem[0],xmm1[1]
	; ATOM-NEXT: addpd %xmm0, %xmm1
	; ATOM-NEXT: movlpd %xmm1, (%rdi)
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_movlpd:
	; SLM: # BB#0:
	; SLM-NEXT: movlpd {{.*#+}} xmm1 = mem[0],xmm1[1] sched: [4:1.00]
	; SLM-NEXT: addpd %xmm0, %xmm1 # sched: [3:1.00]
	; SLM-NEXT: movlpd %xmm1, (%rdi) # sched: [1:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_movlpd:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vmovlpd {{.*#+}} xmm1 = mem[0],xmm1[1] sched: [7:1.00]
	+; SANDY-NEXT: vmovlpd {{.*#+}} xmm1 = mem[0],xmm1[1] sched: [5:1.00]
	; SANDY-NEXT: vaddpd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	-; SANDY-NEXT: vmovlpd %xmm0, (%rdi) # sched: [5:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vmovlpd %xmm0, (%rdi) # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_movlpd:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vmovlpd {{.*#+}} xmm1 = mem[0],xmm1[1] sched: [5:1.00]
	; HASWELL-NEXT: vaddpd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; HASWELL-NEXT: vmovlpd %xmm0, (%rdi) # sched: [1:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_movlpd:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vmovlpd {{.*#+}} xmm1 = mem[0],xmm1[1] sched: [6:1.00]
	; BTVER2-NEXT: vaddpd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: vmovlpd %xmm0, (%rdi) # sched: [1:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_movlpd:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vmovlpd {{.*#+}} xmm1 = mem[0],xmm1[1] sched: [8:0.50]
	; ZNVER1-NEXT: vaddpd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; ZNVER1-NEXT: vmovlpd %xmm0, (%rdi) # sched: [1:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = bitcast x86_mmx* %a2 to double*
	%2 = load double, double *%1, align 8
	%3 = insertelement <2 x double> %a1, double %2, i32 0
	%4 = fadd <2 x double> %a0, %3
	%5 = extractelement <2 x double> %4, i32 0
	store double %5, double* %1
	ret void
	}

	define i32 @test_movmskpd(<2 x double> %a0) {
	; GENERIC-LABEL: test_movmskpd:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: movmskpd %xmm0, %eax
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_movmskpd:
	; ATOM: # BB#0:
	; ATOM-NEXT: movmskpd %xmm0, %eax
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_movmskpd:
	; SLM: # BB#0:
	; SLM-NEXT: movmskpd %xmm0, %eax # sched: [1:0.50]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_movmskpd:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vmovmskpd %xmm0, %eax # sched: [2:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vmovmskpd %xmm0, %eax # sched: [1:0.33]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_movmskpd:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vmovmskpd %xmm0, %eax # sched: [3:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_movmskpd:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vmovmskpd %xmm0, %eax # sched: [1:0.50]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_movmskpd:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vmovmskpd %xmm0, %eax # sched: [1:0.25]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call i32 @llvm.x86.sse2.movmsk.pd(<2 x double> %a0)
	ret i32 %1
	}
	declare i32 @llvm.x86.sse2.movmsk.pd(<2 x double>) nounwind readnone

	define void @test_movntdqa(<2 x i64> %a0, <2 x i64> *%a1) {
	; GENERIC-LABEL: test_movntdqa:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: paddq %xmm0, %xmm0
	; GENERIC-NEXT: movntdq %xmm0, (%rdi)
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_movntdqa:
	; ATOM: # BB#0:
	; ATOM-NEXT: paddq %xmm0, %xmm0
	; ATOM-NEXT: movntdq %xmm0, (%rdi)
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_movntdqa:
	; SLM: # BB#0:
	; SLM-NEXT: paddq %xmm0, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: movntdq %xmm0, (%rdi) # sched: [1:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_movntdqa:
	; SANDY: # BB#0:
	; SANDY-NEXT: vpaddq %xmm0, %xmm0, %xmm0 # sched: [1:0.50]
	-; SANDY-NEXT: vmovntdq %xmm0, (%rdi) # sched: [5:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vmovntdq %xmm0, (%rdi) # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_movntdqa:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpaddq %xmm0, %xmm0, %xmm0 # sched: [1:0.50]
	; HASWELL-NEXT: vmovntdq %xmm0, (%rdi) # sched: [1:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_movntdqa:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpaddq %xmm0, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: vmovntdq %xmm0, (%rdi) # sched: [1:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_movntdqa:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpaddq %xmm0, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: vmovntdq %xmm0, (%rdi) # sched: [1:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = add <2 x i64> %a0, %a0
	store <2 x i64> %1, <2 x i64> *%a1, align 16, !nontemporal !0
	ret void
	}

	define void @test_movntpd(<2 x double> %a0, <2 x double> *%a1) {
	; GENERIC-LABEL: test_movntpd:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: addpd %xmm0, %xmm0
	; GENERIC-NEXT: movntpd %xmm0, (%rdi)
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_movntpd:
	; ATOM: # BB#0:
	; ATOM-NEXT: addpd %xmm0, %xmm0
	; ATOM-NEXT: movntpd %xmm0, (%rdi)
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_movntpd:
	; SLM: # BB#0:
	; SLM-NEXT: addpd %xmm0, %xmm0 # sched: [3:1.00]
	; SLM-NEXT: movntpd %xmm0, (%rdi) # sched: [1:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_movntpd:
	; SANDY: # BB#0:
	; SANDY-NEXT: vaddpd %xmm0, %xmm0, %xmm0 # sched: [3:1.00]
	-; SANDY-NEXT: vmovntpd %xmm0, (%rdi) # sched: [5:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vmovntpd %xmm0, (%rdi) # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_movntpd:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vaddpd %xmm0, %xmm0, %xmm0 # sched: [3:1.00]
	; HASWELL-NEXT: vmovntpd %xmm0, (%rdi) # sched: [1:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_movntpd:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vaddpd %xmm0, %xmm0, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: vmovntpd %xmm0, (%rdi) # sched: [1:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_movntpd:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vaddpd %xmm0, %xmm0, %xmm0 # sched: [3:1.00]
	; ZNVER1-NEXT: vmovntpd %xmm0, (%rdi) # sched: [1:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = fadd <2 x double> %a0, %a0
	store <2 x double> %1, <2 x double> *%a1, align 16, !nontemporal !0
	ret void
	}

	define <2 x i64> @test_movq_mem(<2 x i64> %a0, i64 *%a1) {
	; GENERIC-LABEL: test_movq_mem:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: movq {{.*#+}} xmm1 = mem[0],zero
	; GENERIC-NEXT: paddq %xmm1, %xmm0
	; GENERIC-NEXT: movq %xmm0, (%rdi)
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_movq_mem:
	; ATOM: # BB#0:
	; ATOM-NEXT: movq {{.*#+}} xmm1 = mem[0],zero
	; ATOM-NEXT: paddq %xmm1, %xmm0
	; ATOM-NEXT: movq %xmm0, (%rdi)
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_movq_mem:
	; SLM: # BB#0:
	; SLM-NEXT: movq {{.*#+}} xmm1 = mem[0],zero sched: [3:1.00]
	; SLM-NEXT: paddq %xmm1, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: movq %xmm0, (%rdi) # sched: [1:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_movq_mem:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vmovq {{.*#+}} xmm1 = mem[0],zero sched: [6:0.50]
	+; SANDY-NEXT: vmovq {{.*#+}} xmm1 = mem[0],zero sched: [4:0.50]
	; SANDY-NEXT: vpaddq %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	-; SANDY-NEXT: vmovq %xmm0, (%rdi) # sched: [5:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vmovq %xmm0, (%rdi) # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_movq_mem:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vmovq {{.*#+}} xmm1 = mem[0],zero sched: [4:0.50]
	; HASWELL-NEXT: vpaddq %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; HASWELL-NEXT: vmovq %xmm0, (%rdi) # sched: [1:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_movq_mem:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vmovq {{.*#+}} xmm1 = mem[0],zero sched: [5:1.00]
	; BTVER2-NEXT: vpaddq %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: vmovq %xmm0, (%rdi) # sched: [1:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_movq_mem:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vmovq {{.*#+}} xmm1 = mem[0],zero sched: [8:0.50]
	; ZNVER1-NEXT: vpaddq %xmm1, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: vmovq %xmm0, (%rdi) # sched: [1:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = load i64, i64* %a1, align 1
	%2 = insertelement <2 x i64> zeroinitializer, i64 %1, i32 0
	%3 = add <2 x i64> %a0, %2
	%4 = extractelement <2 x i64> %3, i32 0
	store i64 %4, i64 *%a1, align 1
	ret <2 x i64> %3
	}

	define <2 x i64> @test_movq_reg(<2 x i64> %a0, <2 x i64> %a1) {
	; GENERIC-LABEL: test_movq_reg:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: movq {{.*#+}} xmm0 = xmm0[0],zero
	; GENERIC-NEXT: paddq %xmm1, %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_movq_reg:
	; ATOM: # BB#0:
	; ATOM-NEXT: movq {{.*#+}} xmm0 = xmm0[0],zero
	; ATOM-NEXT: paddq %xmm1, %xmm0
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_movq_reg:
	; SLM: # BB#0:
	; SLM-NEXT: movq {{.*#+}} xmm0 = xmm0[0],zero sched: [1:0.50]
	; SLM-NEXT: paddq %xmm1, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_movq_reg:
	; SANDY: # BB#0:
	; SANDY-NEXT: vmovq {{.*#+}} xmm0 = xmm0[0],zero sched: [1:0.33]
	; SANDY-NEXT: vpaddq %xmm0, %xmm1, %xmm0 # sched: [1:0.50]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_movq_reg:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vmovq {{.*#+}} xmm0 = xmm0[0],zero sched: [1:0.33]
	; HASWELL-NEXT: vpaddq %xmm0, %xmm1, %xmm0 # sched: [1:0.50]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_movq_reg:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vmovq {{.*#+}} xmm0 = xmm0[0],zero sched: [1:0.50]
	; BTVER2-NEXT: vpaddq %xmm0, %xmm1, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_movq_reg:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vmovq {{.*#+}} xmm0 = xmm0[0],zero sched: [1:0.25]
	; ZNVER1-NEXT: vpaddq %xmm0, %xmm1, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = shufflevector <2 x i64> %a0, <2 x i64> zeroinitializer, <2 x i32> <i32 0, i32 2>
	%2 = add <2 x i64> %a1, %1
	ret <2 x i64> %2
	}

	define void @test_movsd_mem(double* %a0, double* %a1) {
	; GENERIC-LABEL: test_movsd_mem:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: movsd {{.*#+}} xmm0 = mem[0],zero
	; GENERIC-NEXT: addsd %xmm0, %xmm0
	; GENERIC-NEXT: movsd %xmm0, (%rsi)
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_movsd_mem:
	; ATOM: # BB#0:
	; ATOM-NEXT: movsd {{.*#+}} xmm0 = mem[0],zero
	; ATOM-NEXT: addsd %xmm0, %xmm0
	; ATOM-NEXT: movsd %xmm0, (%rsi)
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_movsd_mem:
	; SLM: # BB#0:
	; SLM-NEXT: movsd {{.*#+}} xmm0 = mem[0],zero sched: [3:1.00]
	; SLM-NEXT: addsd %xmm0, %xmm0 # sched: [3:1.00]
	; SLM-NEXT: movsd %xmm0, (%rsi) # sched: [1:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_movsd_mem:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vmovsd {{.*#+}} xmm0 = mem[0],zero sched: [6:0.50]
	+; SANDY-NEXT: vmovsd {{.*#+}} xmm0 = mem[0],zero sched: [4:0.50]
	; SANDY-NEXT: vaddsd %xmm0, %xmm0, %xmm0 # sched: [3:1.00]
	-; SANDY-NEXT: vmovsd %xmm0, (%rsi) # sched: [5:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vmovsd %xmm0, (%rsi) # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_movsd_mem:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vmovsd {{.*#+}} xmm0 = mem[0],zero sched: [4:0.50]
	; HASWELL-NEXT: vaddsd %xmm0, %xmm0, %xmm0 # sched: [3:1.00]
	; HASWELL-NEXT: vmovsd %xmm0, (%rsi) # sched: [1:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_movsd_mem:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vmovsd {{.*#+}} xmm0 = mem[0],zero sched: [5:1.00]
	; BTVER2-NEXT: vaddsd %xmm0, %xmm0, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: vmovsd %xmm0, (%rsi) # sched: [1:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_movsd_mem:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vmovsd {{.*#+}} xmm0 = mem[0],zero sched: [8:0.50]
	; ZNVER1-NEXT: vaddsd %xmm0, %xmm0, %xmm0 # sched: [3:1.00]
	; ZNVER1-NEXT: vmovsd %xmm0, (%rsi) # sched: [1:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = load double, double* %a0, align 1
	%2 = fadd double %1, %1
	store double %2, double *%a1, align 1
	ret void
	}

	define <2 x double> @test_movsd_reg(<2 x double> %a0, <2 x double> %a1) {
	; GENERIC-LABEL: test_movsd_reg:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: unpcklpd {{.*#+}} xmm1 = xmm1[0],xmm0[0]
	; GENERIC-NEXT: movapd %xmm1, %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_movsd_reg:
	; ATOM: # BB#0:
	; ATOM-NEXT: unpcklpd {{.*#+}} xmm1 = xmm1[0],xmm0[0]
	; ATOM-NEXT: movapd %xmm1, %xmm0
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_movsd_reg:
	; SLM: # BB#0:
	; SLM-NEXT: unpcklpd {{.*#+}} xmm1 = xmm1[0],xmm0[0] sched: [1:1.00]
	; SLM-NEXT: movapd %xmm1, %xmm0 # sched: [1:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_movsd_reg:
	; SANDY: # BB#0:
	; SANDY-NEXT: vunpcklpd {{.*#+}} xmm0 = xmm1[0],xmm0[0] sched: [1:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_movsd_reg:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vunpcklpd {{.*#+}} xmm0 = xmm1[0],xmm0[0] sched: [1:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_movsd_reg:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vunpcklpd {{.*#+}} xmm0 = xmm1[0],xmm0[0] sched: [1:0.50]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_movsd_reg:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vunpcklpd {{.*#+}} xmm0 = xmm1[0],xmm0[0] sched: [1:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = shufflevector <2 x double> %a0, <2 x double> %a1, <2 x i32> <i32 2, i32 0>
	ret <2 x double> %1
	}

	define void @test_movupd(<2 x double> %a0, <2 x double> %a1) {
	; GENERIC-LABEL: test_movupd:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: movupd (%rdi), %xmm0
	; GENERIC-NEXT: addpd %xmm0, %xmm0
	; GENERIC-NEXT: movupd %xmm0, (%rsi)
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_movupd:
	; ATOM: # BB#0:
	; ATOM-NEXT: movupd (%rdi), %xmm0
	; ATOM-NEXT: addpd %xmm0, %xmm0
	; ATOM-NEXT: movupd %xmm0, (%rsi)
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_movupd:
	; SLM: # BB#0:
	; SLM-NEXT: movupd (%rdi), %xmm0 # sched: [3:1.00]
	; SLM-NEXT: addpd %xmm0, %xmm0 # sched: [3:1.00]
	; SLM-NEXT: movupd %xmm0, (%rsi) # sched: [1:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_movupd:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vmovupd (%rdi), %xmm0 # sched: [6:0.50]
	+; SANDY-NEXT: vmovupd (%rdi), %xmm0 # sched: [4:0.50]
	; SANDY-NEXT: vaddpd %xmm0, %xmm0, %xmm0 # sched: [3:1.00]
	-; SANDY-NEXT: vmovupd %xmm0, (%rsi) # sched: [5:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vmovupd %xmm0, (%rsi) # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_movupd:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vmovupd (%rdi), %xmm0 # sched: [4:0.50]
	; HASWELL-NEXT: vaddpd %xmm0, %xmm0, %xmm0 # sched: [3:1.00]
	; HASWELL-NEXT: vmovupd %xmm0, (%rsi) # sched: [1:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_movupd:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vmovupd (%rdi), %xmm0 # sched: [5:1.00]
	; BTVER2-NEXT: vaddpd %xmm0, %xmm0, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: vmovupd %xmm0, (%rsi) # sched: [1:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_movupd:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vmovupd (%rdi), %xmm0 # sched: [8:0.50]
	; ZNVER1-NEXT: vaddpd %xmm0, %xmm0, %xmm0 # sched: [3:1.00]
	; ZNVER1-NEXT: vmovupd %xmm0, (%rsi) # sched: [1:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = load <2 x double>, <2 x double> *%a0, align 1
	%2 = fadd <2 x double> %1, %1
	store <2 x double> %2, <2 x double> *%a1, align 1
	ret void
	}

	define <2 x double> @test_mulpd(<2 x double> %a0, <2 x double> %a1, <2 x double> *%a2) {
	; GENERIC-LABEL: test_mulpd:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: mulpd %xmm1, %xmm0
	; GENERIC-NEXT: mulpd (%rdi), %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_mulpd:
	; ATOM: # BB#0:
	; ATOM-NEXT: mulpd %xmm1, %xmm0
	; ATOM-NEXT: mulpd (%rdi), %xmm0
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_mulpd:
	; SLM: # BB#0:
	; SLM-NEXT: mulpd %xmm1, %xmm0 # sched: [5:2.00]
	; SLM-NEXT: mulpd (%rdi), %xmm0 # sched: [8:2.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_mulpd:
	; SANDY: # BB#0:
	; SANDY-NEXT: vmulpd %xmm1, %xmm0, %xmm0 # sched: [5:1.00]
	-; SANDY-NEXT: vmulpd (%rdi), %xmm0, %xmm0 # sched: [11:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vmulpd (%rdi), %xmm0, %xmm0 # sched: [9:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_mulpd:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vmulpd %xmm1, %xmm0, %xmm0 # sched: [5:0.50]
	; HASWELL-NEXT: vmulpd (%rdi), %xmm0, %xmm0 # sched: [9:0.50]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_mulpd:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vmulpd %xmm1, %xmm0, %xmm0 # sched: [2:1.00]
	; BTVER2-NEXT: vmulpd (%rdi), %xmm0, %xmm0 # sched: [7:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_mulpd:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vmulpd %xmm1, %xmm0, %xmm0 # sched: [5:1.00]
	; ZNVER1-NEXT: vmulpd (%rdi), %xmm0, %xmm0 # sched: [12:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = fmul <2 x double> %a0, %a1
	%2 = load <2 x double>, <2 x double> *%a2, align 16
	%3 = fmul <2 x double> %1, %2
	ret <2 x double> %3
	}

	define double @test_mulsd(double %a0, double %a1, double *%a2) {
	; GENERIC-LABEL: test_mulsd:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: mulsd %xmm1, %xmm0
	; GENERIC-NEXT: mulsd (%rdi), %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_mulsd:
	; ATOM: # BB#0:
	; ATOM-NEXT: mulsd %xmm1, %xmm0
	; ATOM-NEXT: mulsd (%rdi), %xmm0
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_mulsd:
	; SLM: # BB#0:
	; SLM-NEXT: mulsd %xmm1, %xmm0 # sched: [5:2.00]
	; SLM-NEXT: mulsd (%rdi), %xmm0 # sched: [8:2.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_mulsd:
	; SANDY: # BB#0:
	; SANDY-NEXT: vmulsd %xmm1, %xmm0, %xmm0 # sched: [5:1.00]
	-; SANDY-NEXT: vmulsd (%rdi), %xmm0, %xmm0 # sched: [11:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vmulsd (%rdi), %xmm0, %xmm0 # sched: [9:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_mulsd:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vmulsd %xmm1, %xmm0, %xmm0 # sched: [5:0.50]
	; HASWELL-NEXT: vmulsd (%rdi), %xmm0, %xmm0 # sched: [9:0.50]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_mulsd:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vmulsd %xmm1, %xmm0, %xmm0 # sched: [2:1.00]
	; BTVER2-NEXT: vmulsd (%rdi), %xmm0, %xmm0 # sched: [7:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_mulsd:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vmulsd %xmm1, %xmm0, %xmm0 # sched: [5:1.00]
	; ZNVER1-NEXT: vmulsd (%rdi), %xmm0, %xmm0 # sched: [12:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = fmul double %a0, %a1
	%2 = load double, double *%a2, align 8
	%3 = fmul double %1, %2
	ret double %3
	}

	define <2 x double> @test_orpd(<2 x double> %a0, <2 x double> %a1, <2 x double> *%a2) {
	; GENERIC-LABEL: test_orpd:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: orpd %xmm1, %xmm0
	; GENERIC-NEXT: orpd (%rdi), %xmm0
	; GENERIC-NEXT: addpd %xmm1, %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_orpd:
	; ATOM: # BB#0:
	; ATOM-NEXT: orpd %xmm1, %xmm0
	; ATOM-NEXT: orpd (%rdi), %xmm0
	; ATOM-NEXT: addpd %xmm1, %xmm0
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_orpd:
	; SLM: # BB#0:
	; SLM-NEXT: orpd %xmm1, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: orpd (%rdi), %xmm0 # sched: [4:1.00]
	; SLM-NEXT: addpd %xmm1, %xmm0 # sched: [3:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_orpd:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vorpd %xmm1, %xmm0, %xmm0 # sched: [1:1.00]
	-; SANDY-NEXT: vorpd (%rdi), %xmm0, %xmm0 # sched: [7:1.00]
	+; SANDY-NEXT: vorpd %xmm1, %xmm0, %xmm0 # sched: [1:0.33]
	+; SANDY-NEXT: vorpd (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	; SANDY-NEXT: vaddpd %xmm0, %xmm1, %xmm0 # sched: [3:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_orpd:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vorpd %xmm1, %xmm0, %xmm0 # sched: [1:1.00]
	; HASWELL-NEXT: vorpd (%rdi), %xmm0, %xmm0 # sched: [5:1.00]
	; HASWELL-NEXT: vaddpd %xmm0, %xmm1, %xmm0 # sched: [3:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_orpd:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vorpd %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: vorpd (%rdi), %xmm0, %xmm0 # sched: [6:1.00]
	; BTVER2-NEXT: vaddpd %xmm0, %xmm1, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_orpd:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vorpd %xmm1, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: vorpd (%rdi), %xmm0, %xmm0 # sched: [8:0.50]
	; ZNVER1-NEXT: vaddpd %xmm0, %xmm1, %xmm0 # sched: [3:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = bitcast <2 x double> %a0 to <4 x i32>
	%2 = bitcast <2 x double> %a1 to <4 x i32>
	%3 = or <4 x i32> %1, %2
	%4 = load <2 x double>, <2 x double> *%a2, align 16
	%5 = bitcast <2 x double> %4 to <4 x i32>
	%6 = or <4 x i32> %3, %5
	%7 = bitcast <4 x i32> %6 to <2 x double>
	%8 = fadd <2 x double> %a1, %7
	ret <2 x double> %8
	}

	define <8 x i16> @test_packssdw(<4 x i32> %a0, <4 x i32> %a1, <4 x i32> *%a2) {
	; GENERIC-LABEL: test_packssdw:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: packssdw %xmm1, %xmm0
	; GENERIC-NEXT: packssdw (%rdi), %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_packssdw:
	; ATOM: # BB#0:
	; ATOM-NEXT: packssdw %xmm1, %xmm0
	; ATOM-NEXT: packssdw (%rdi), %xmm0
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_packssdw:
	; SLM: # BB#0:
	; SLM-NEXT: packssdw %xmm1, %xmm0 # sched: [1:1.00]
	; SLM-NEXT: packssdw (%rdi), %xmm0 # sched: [4:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_packssdw:
	; SANDY: # BB#0:
	; SANDY-NEXT: vpackssdw %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	-; SANDY-NEXT: vpackssdw (%rdi), %xmm0, %xmm0 # sched: [7:0.50]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vpackssdw (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_packssdw:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpackssdw %xmm1, %xmm0, %xmm0 # sched: [1:1.00]
	; HASWELL-NEXT: vpackssdw (%rdi), %xmm0, %xmm0 # sched: [5:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_packssdw:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpackssdw %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: vpackssdw (%rdi), %xmm0, %xmm0 # sched: [6:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_packssdw:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpackssdw %xmm1, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: vpackssdw (%rdi), %xmm0, %xmm0 # sched: [8:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <8 x i16> @llvm.x86.sse2.packssdw.128(<4 x i32> %a0, <4 x i32> %a1)
	%2 = bitcast <8 x i16> %1 to <4 x i32>
	%3 = load <4 x i32>, <4 x i32> *%a2, align 16
	%4 = call <8 x i16> @llvm.x86.sse2.packssdw.128(<4 x i32> %2, <4 x i32> %3)
	ret <8 x i16> %4
	}
	declare <8 x i16> @llvm.x86.sse2.packssdw.128(<4 x i32>, <4 x i32>) nounwind readnone

	define <16 x i8> @test_packsswb(<8 x i16> %a0, <8 x i16> %a1, <8 x i16> *%a2) {
	; GENERIC-LABEL: test_packsswb:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: packsswb %xmm1, %xmm0
	; GENERIC-NEXT: packsswb (%rdi), %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_packsswb:
	; ATOM: # BB#0:
	; ATOM-NEXT: packsswb %xmm1, %xmm0
	; ATOM-NEXT: packsswb (%rdi), %xmm0
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_packsswb:
	; SLM: # BB#0:
	; SLM-NEXT: packsswb %xmm1, %xmm0 # sched: [1:1.00]
	; SLM-NEXT: packsswb (%rdi), %xmm0 # sched: [4:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_packsswb:
	; SANDY: # BB#0:
	; SANDY-NEXT: vpacksswb %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	-; SANDY-NEXT: vpacksswb (%rdi), %xmm0, %xmm0 # sched: [7:0.50]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vpacksswb (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_packsswb:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpacksswb %xmm1, %xmm0, %xmm0 # sched: [1:1.00]
	; HASWELL-NEXT: vpacksswb (%rdi), %xmm0, %xmm0 # sched: [5:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_packsswb:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpacksswb %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: vpacksswb (%rdi), %xmm0, %xmm0 # sched: [6:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_packsswb:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpacksswb %xmm1, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: vpacksswb (%rdi), %xmm0, %xmm0 # sched: [8:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <16 x i8> @llvm.x86.sse2.packsswb.128(<8 x i16> %a0, <8 x i16> %a1)
	%2 = bitcast <16 x i8> %1 to <8 x i16>
	%3 = load <8 x i16>, <8 x i16> *%a2, align 16
	%4 = call <16 x i8> @llvm.x86.sse2.packsswb.128(<8 x i16> %2, <8 x i16> %3)
	ret <16 x i8> %4
	}
	declare <16 x i8> @llvm.x86.sse2.packsswb.128(<8 x i16>, <8 x i16>) nounwind readnone

	define <16 x i8> @test_packuswb(<8 x i16> %a0, <8 x i16> %a1, <8 x i16> *%a2) {
	; GENERIC-LABEL: test_packuswb:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: packuswb %xmm1, %xmm0
	; GENERIC-NEXT: packuswb (%rdi), %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_packuswb:
	; ATOM: # BB#0:
	; ATOM-NEXT: packuswb %xmm1, %xmm0
	; ATOM-NEXT: packuswb (%rdi), %xmm0
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_packuswb:
	; SLM: # BB#0:
	; SLM-NEXT: packuswb %xmm1, %xmm0 # sched: [1:1.00]
	; SLM-NEXT: packuswb (%rdi), %xmm0 # sched: [4:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_packuswb:
	; SANDY: # BB#0:
	; SANDY-NEXT: vpackuswb %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	-; SANDY-NEXT: vpackuswb (%rdi), %xmm0, %xmm0 # sched: [7:0.50]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vpackuswb (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_packuswb:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpackuswb %xmm1, %xmm0, %xmm0 # sched: [1:1.00]
	; HASWELL-NEXT: vpackuswb (%rdi), %xmm0, %xmm0 # sched: [5:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_packuswb:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpackuswb %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: vpackuswb (%rdi), %xmm0, %xmm0 # sched: [6:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_packuswb:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpackuswb %xmm1, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: vpackuswb (%rdi), %xmm0, %xmm0 # sched: [8:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <16 x i8> @llvm.x86.sse2.packuswb.128(<8 x i16> %a0, <8 x i16> %a1)
	%2 = bitcast <16 x i8> %1 to <8 x i16>
	%3 = load <8 x i16>, <8 x i16> *%a2, align 16
	%4 = call <16 x i8> @llvm.x86.sse2.packuswb.128(<8 x i16> %2, <8 x i16> %3)
	ret <16 x i8> %4
	}
	declare <16 x i8> @llvm.x86.sse2.packuswb.128(<8 x i16>, <8 x i16>) nounwind readnone

	define <16 x i8> @test_paddb(<16 x i8> %a0, <16 x i8> %a1, <16 x i8> *%a2) {
	; GENERIC-LABEL: test_paddb:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: paddb %xmm1, %xmm0
	; GENERIC-NEXT: paddb (%rdi), %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_paddb:
	; ATOM: # BB#0:
	; ATOM-NEXT: paddb %xmm1, %xmm0
	; ATOM-NEXT: paddb (%rdi), %xmm0
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_paddb:
	; SLM: # BB#0:
	; SLM-NEXT: paddb %xmm1, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: paddb (%rdi), %xmm0 # sched: [4:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_paddb:
	; SANDY: # BB#0:
	; SANDY-NEXT: vpaddb %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	-; SANDY-NEXT: vpaddb (%rdi), %xmm0, %xmm0 # sched: [7:0.50]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vpaddb (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_paddb:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpaddb %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; HASWELL-NEXT: vpaddb (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_paddb:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpaddb %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: vpaddb (%rdi), %xmm0, %xmm0 # sched: [6:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_paddb:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpaddb %xmm1, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: vpaddb (%rdi), %xmm0, %xmm0 # sched: [8:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = add <16 x i8> %a0, %a1
	%2 = load <16 x i8>, <16 x i8> *%a2, align 16
	%3 = add <16 x i8> %1, %2
	ret <16 x i8> %3
	}

	define <4 x i32> @test_paddd(<4 x i32> %a0, <4 x i32> %a1, <4 x i32> *%a2) {
	; GENERIC-LABEL: test_paddd:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: paddd %xmm1, %xmm0
	; GENERIC-NEXT: paddd (%rdi), %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_paddd:
	; ATOM: # BB#0:
	; ATOM-NEXT: paddd %xmm1, %xmm0
	; ATOM-NEXT: paddd (%rdi), %xmm0
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_paddd:
	; SLM: # BB#0:
	; SLM-NEXT: paddd %xmm1, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: paddd (%rdi), %xmm0 # sched: [4:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_paddd:
	; SANDY: # BB#0:
	; SANDY-NEXT: vpaddd %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	-; SANDY-NEXT: vpaddd (%rdi), %xmm0, %xmm0 # sched: [7:0.50]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vpaddd (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_paddd:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpaddd %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; HASWELL-NEXT: vpaddd (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_paddd:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpaddd %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: vpaddd (%rdi), %xmm0, %xmm0 # sched: [6:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_paddd:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpaddd %xmm1, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: vpaddd (%rdi), %xmm0, %xmm0 # sched: [8:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = add <4 x i32> %a0, %a1
	%2 = load <4 x i32>, <4 x i32> *%a2, align 16
	%3 = add <4 x i32> %1, %2
	ret <4 x i32> %3
	}

	define <2 x i64> @test_paddq(<2 x i64> %a0, <2 x i64> %a1, <2 x i64> *%a2) {
	; GENERIC-LABEL: test_paddq:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: paddq %xmm1, %xmm0
	; GENERIC-NEXT: paddq (%rdi), %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_paddq:
	; ATOM: # BB#0:
	; ATOM-NEXT: paddq %xmm1, %xmm0
	; ATOM-NEXT: paddq (%rdi), %xmm0
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_paddq:
	; SLM: # BB#0:
	; SLM-NEXT: paddq %xmm1, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: paddq (%rdi), %xmm0 # sched: [4:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_paddq:
	; SANDY: # BB#0:
	; SANDY-NEXT: vpaddq %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	-; SANDY-NEXT: vpaddq (%rdi), %xmm0, %xmm0 # sched: [7:0.50]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vpaddq (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_paddq:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpaddq %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; HASWELL-NEXT: vpaddq (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_paddq:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpaddq %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: vpaddq (%rdi), %xmm0, %xmm0 # sched: [6:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_paddq:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpaddq %xmm1, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: vpaddq (%rdi), %xmm0, %xmm0 # sched: [8:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = add <2 x i64> %a0, %a1
	%2 = load <2 x i64>, <2 x i64> *%a2, align 16
	%3 = add <2 x i64> %1, %2
	ret <2 x i64> %3
	}

	define <16 x i8> @test_paddsb(<16 x i8> %a0, <16 x i8> %a1, <16 x i8> *%a2) {
	; GENERIC-LABEL: test_paddsb:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: paddsb %xmm1, %xmm0
	; GENERIC-NEXT: paddsb (%rdi), %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_paddsb:
	; ATOM: # BB#0:
	; ATOM-NEXT: paddsb %xmm1, %xmm0
	; ATOM-NEXT: paddsb (%rdi), %xmm0
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_paddsb:
	; SLM: # BB#0:
	; SLM-NEXT: paddsb %xmm1, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: paddsb (%rdi), %xmm0 # sched: [4:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_paddsb:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vpaddsb %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	-; SANDY-NEXT: vpaddsb (%rdi), %xmm0, %xmm0 # sched: [7:0.50]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vpaddsb %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	+; SANDY-NEXT: vpaddsb (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_paddsb:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpaddsb %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; HASWELL-NEXT: vpaddsb (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_paddsb:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpaddsb %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: vpaddsb (%rdi), %xmm0, %xmm0 # sched: [6:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_paddsb:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpaddsb %xmm1, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: vpaddsb (%rdi), %xmm0, %xmm0 # sched: [8:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <16 x i8> @llvm.x86.sse2.padds.b(<16 x i8> %a0, <16 x i8> %a1)
	%2 = load <16 x i8>, <16 x i8> *%a2, align 16
	%3 = call <16 x i8> @llvm.x86.sse2.padds.b(<16 x i8> %1, <16 x i8> %2)
	ret <16 x i8> %3
	}
	declare <16 x i8> @llvm.x86.sse2.padds.b(<16 x i8>, <16 x i8>) nounwind readnone

	define <8 x i16> @test_paddsw(<8 x i16> %a0, <8 x i16> %a1, <8 x i16> *%a2) {
	; GENERIC-LABEL: test_paddsw:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: paddsw %xmm1, %xmm0
	; GENERIC-NEXT: paddsw (%rdi), %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_paddsw:
	; ATOM: # BB#0:
	; ATOM-NEXT: paddsw %xmm1, %xmm0
	; ATOM-NEXT: paddsw (%rdi), %xmm0
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_paddsw:
	; SLM: # BB#0:
	; SLM-NEXT: paddsw %xmm1, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: paddsw (%rdi), %xmm0 # sched: [4:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_paddsw:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vpaddsw %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	-; SANDY-NEXT: vpaddsw (%rdi), %xmm0, %xmm0 # sched: [7:0.50]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vpaddsw %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	+; SANDY-NEXT: vpaddsw (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_paddsw:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpaddsw %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; HASWELL-NEXT: vpaddsw (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_paddsw:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpaddsw %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: vpaddsw (%rdi), %xmm0, %xmm0 # sched: [6:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_paddsw:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpaddsw %xmm1, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: vpaddsw (%rdi), %xmm0, %xmm0 # sched: [8:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <8 x i16> @llvm.x86.sse2.padds.w(<8 x i16> %a0, <8 x i16> %a1)
	%2 = load <8 x i16>, <8 x i16> *%a2, align 16
	%3 = call <8 x i16> @llvm.x86.sse2.padds.w(<8 x i16> %1, <8 x i16> %2)
	ret <8 x i16> %3
	}
	declare <8 x i16> @llvm.x86.sse2.padds.w(<8 x i16>, <8 x i16>) nounwind readnone

	define <16 x i8> @test_paddusb(<16 x i8> %a0, <16 x i8> %a1, <16 x i8> *%a2) {
	; GENERIC-LABEL: test_paddusb:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: paddusb %xmm1, %xmm0
	; GENERIC-NEXT: paddusb (%rdi), %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_paddusb:
	; ATOM: # BB#0:
	; ATOM-NEXT: paddusb %xmm1, %xmm0
	; ATOM-NEXT: paddusb (%rdi), %xmm0
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_paddusb:
	; SLM: # BB#0:
	; SLM-NEXT: paddusb %xmm1, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: paddusb (%rdi), %xmm0 # sched: [4:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_paddusb:
	; SANDY: # BB#0:
	; SANDY-NEXT: vpaddusb %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	-; SANDY-NEXT: vpaddusb (%rdi), %xmm0, %xmm0 # sched: [7:0.50]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vpaddusb (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_paddusb:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpaddusb %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; HASWELL-NEXT: vpaddusb (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_paddusb:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpaddusb %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: vpaddusb (%rdi), %xmm0, %xmm0 # sched: [6:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_paddusb:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpaddusb %xmm1, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: vpaddusb (%rdi), %xmm0, %xmm0 # sched: [8:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <16 x i8> @llvm.x86.sse2.paddus.b(<16 x i8> %a0, <16 x i8> %a1)
	%2 = load <16 x i8>, <16 x i8> *%a2, align 16
	%3 = call <16 x i8> @llvm.x86.sse2.paddus.b(<16 x i8> %1, <16 x i8> %2)
	ret <16 x i8> %3
	}
	declare <16 x i8> @llvm.x86.sse2.paddus.b(<16 x i8>, <16 x i8>) nounwind readnone

	define <8 x i16> @test_paddusw(<8 x i16> %a0, <8 x i16> %a1, <8 x i16> *%a2) {
	; GENERIC-LABEL: test_paddusw:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: paddusw %xmm1, %xmm0
	; GENERIC-NEXT: paddusw (%rdi), %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_paddusw:
	; ATOM: # BB#0:
	; ATOM-NEXT: paddusw %xmm1, %xmm0
	; ATOM-NEXT: paddusw (%rdi), %xmm0
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_paddusw:
	; SLM: # BB#0:
	; SLM-NEXT: paddusw %xmm1, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: paddusw (%rdi), %xmm0 # sched: [4:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_paddusw:
	; SANDY: # BB#0:
	; SANDY-NEXT: vpaddusw %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	-; SANDY-NEXT: vpaddusw (%rdi), %xmm0, %xmm0 # sched: [7:0.50]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vpaddusw (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_paddusw:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpaddusw %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; HASWELL-NEXT: vpaddusw (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_paddusw:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpaddusw %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: vpaddusw (%rdi), %xmm0, %xmm0 # sched: [6:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_paddusw:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpaddusw %xmm1, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: vpaddusw (%rdi), %xmm0, %xmm0 # sched: [8:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <8 x i16> @llvm.x86.sse2.paddus.w(<8 x i16> %a0, <8 x i16> %a1)
	%2 = load <8 x i16>, <8 x i16> *%a2, align 16
	%3 = call <8 x i16> @llvm.x86.sse2.paddus.w(<8 x i16> %1, <8 x i16> %2)
	ret <8 x i16> %3
	}
	declare <8 x i16> @llvm.x86.sse2.paddus.w(<8 x i16>, <8 x i16>) nounwind readnone

	define <8 x i16> @test_paddw(<8 x i16> %a0, <8 x i16> %a1, <8 x i16> *%a2) {
	; GENERIC-LABEL: test_paddw:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: paddw %xmm1, %xmm0
	; GENERIC-NEXT: paddw (%rdi), %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_paddw:
	; ATOM: # BB#0:
	; ATOM-NEXT: paddw %xmm1, %xmm0
	; ATOM-NEXT: paddw (%rdi), %xmm0
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_paddw:
	; SLM: # BB#0:
	; SLM-NEXT: paddw %xmm1, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: paddw (%rdi), %xmm0 # sched: [4:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_paddw:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vpaddw %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	-; SANDY-NEXT: vpaddw (%rdi), %xmm0, %xmm0 # sched: [7:0.50]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vpaddw %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	+; SANDY-NEXT: vpaddw (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_paddw:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpaddw %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; HASWELL-NEXT: vpaddw (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_paddw:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpaddw %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: vpaddw (%rdi), %xmm0, %xmm0 # sched: [6:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_paddw:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpaddw %xmm1, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: vpaddw (%rdi), %xmm0, %xmm0 # sched: [8:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = add <8 x i16> %a0, %a1
	%2 = load <8 x i16>, <8 x i16> *%a2, align 16
	%3 = add <8 x i16> %1, %2
	ret <8 x i16> %3
	}

	define <2 x i64> @test_pand(<2 x i64> %a0, <2 x i64> %a1, <2 x i64> *%a2) {
	; GENERIC-LABEL: test_pand:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: pand %xmm1, %xmm0
	; GENERIC-NEXT: pand (%rdi), %xmm0
	; GENERIC-NEXT: paddq %xmm1, %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_pand:
	; ATOM: # BB#0:
	; ATOM-NEXT: pand %xmm1, %xmm0
	; ATOM-NEXT: pand (%rdi), %xmm0
	; ATOM-NEXT: paddq %xmm1, %xmm0
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_pand:
	; SLM: # BB#0:
	; SLM-NEXT: pand %xmm1, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: pand (%rdi), %xmm0 # sched: [4:1.00]
	; SLM-NEXT: paddq %xmm1, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_pand:
	; SANDY: # BB#0:
	; SANDY-NEXT: vpand %xmm1, %xmm0, %xmm0 # sched: [1:0.33]
	-; SANDY-NEXT: vpand (%rdi), %xmm0, %xmm0 # sched: [7:0.50]
	+; SANDY-NEXT: vpand (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	; SANDY-NEXT: vpaddq %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_pand:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpand %xmm1, %xmm0, %xmm0 # sched: [1:0.33]
	; HASWELL-NEXT: vpand (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	; HASWELL-NEXT: vpaddq %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_pand:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpand %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: vpand (%rdi), %xmm0, %xmm0 # sched: [6:1.00]
	; BTVER2-NEXT: vpaddq %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_pand:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpand %xmm1, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: vpand (%rdi), %xmm0, %xmm0 # sched: [8:0.50]
	; ZNVER1-NEXT: vpaddq %xmm1, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = and <2 x i64> %a0, %a1
	%2 = load <2 x i64>, <2 x i64> *%a2, align 16
	%3 = and <2 x i64> %1, %2
	%4 = add <2 x i64> %3, %a1
	ret <2 x i64> %4
	}

	define <2 x i64> @test_pandn(<2 x i64> %a0, <2 x i64> %a1, <2 x i64> *%a2) {
	; GENERIC-LABEL: test_pandn:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: pandn %xmm1, %xmm0
	; GENERIC-NEXT: movdqa %xmm0, %xmm1
	; GENERIC-NEXT: pandn (%rdi), %xmm1
	; GENERIC-NEXT: paddq %xmm0, %xmm1
	; GENERIC-NEXT: movdqa %xmm1, %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_pandn:
	; ATOM: # BB#0:
	; ATOM-NEXT: pandn %xmm1, %xmm0
	; ATOM-NEXT: movdqa %xmm0, %xmm1
	; ATOM-NEXT: pandn (%rdi), %xmm1
	; ATOM-NEXT: paddq %xmm0, %xmm1
	; ATOM-NEXT: movdqa %xmm1, %xmm0
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_pandn:
	; SLM: # BB#0:
	; SLM-NEXT: pandn %xmm1, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: movdqa %xmm0, %xmm1 # sched: [1:0.50]
	; SLM-NEXT: pandn (%rdi), %xmm1 # sched: [4:1.00]
	; SLM-NEXT: paddq %xmm0, %xmm1 # sched: [1:0.50]
	; SLM-NEXT: movdqa %xmm1, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_pandn:
	; SANDY: # BB#0:
	; SANDY-NEXT: vpandn %xmm1, %xmm0, %xmm0 # sched: [1:0.33]
	-; SANDY-NEXT: vpandn (%rdi), %xmm0, %xmm1 # sched: [7:0.50]
	+; SANDY-NEXT: vpandn (%rdi), %xmm0, %xmm1 # sched: [5:0.50]
	; SANDY-NEXT: vpaddq %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_pandn:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpandn %xmm1, %xmm0, %xmm0 # sched: [1:0.33]
	; HASWELL-NEXT: vpandn (%rdi), %xmm0, %xmm1 # sched: [5:0.50]
	; HASWELL-NEXT: vpaddq %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_pandn:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpandn %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: vpandn (%rdi), %xmm0, %xmm1 # sched: [6:1.00]
	; BTVER2-NEXT: vpaddq %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_pandn:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpandn %xmm1, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: vpandn (%rdi), %xmm0, %xmm1 # sched: [8:0.50]
	; ZNVER1-NEXT: vpaddq %xmm1, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = xor <2 x i64> %a0, <i64 -1, i64 -1>
	%2 = and <2 x i64> %a1, %1
	%3 = load <2 x i64>, <2 x i64> *%a2, align 16
	%4 = xor <2 x i64> %2, <i64 -1, i64 -1>
	%5 = and <2 x i64> %3, %4
	%6 = add <2 x i64> %2, %5
	ret <2 x i64> %6
	}

	define <16 x i8> @test_pavgb(<16 x i8> %a0, <16 x i8> %a1, <16 x i8> *%a2) {
	; GENERIC-LABEL: test_pavgb:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: pavgb %xmm1, %xmm0
	; GENERIC-NEXT: pavgb (%rdi), %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_pavgb:
	; ATOM: # BB#0:
	; ATOM-NEXT: pavgb %xmm1, %xmm0
	; ATOM-NEXT: pavgb (%rdi), %xmm0
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_pavgb:
	; SLM: # BB#0:
	; SLM-NEXT: pavgb %xmm1, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: pavgb (%rdi), %xmm0 # sched: [4:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_pavgb:
	; SANDY: # BB#0:
	; SANDY-NEXT: vpavgb %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	-; SANDY-NEXT: vpavgb (%rdi), %xmm0, %xmm0 # sched: [7:0.50]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vpavgb (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_pavgb:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpavgb %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; HASWELL-NEXT: vpavgb (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_pavgb:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpavgb %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: vpavgb (%rdi), %xmm0, %xmm0 # sched: [6:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_pavgb:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpavgb %xmm1, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: vpavgb (%rdi), %xmm0, %xmm0 # sched: [8:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <16 x i8> @llvm.x86.sse2.pavg.b(<16 x i8> %a0, <16 x i8> %a1)
	%2 = load <16 x i8>, <16 x i8> *%a2, align 16
	%3 = call <16 x i8> @llvm.x86.sse2.pavg.b(<16 x i8> %1, <16 x i8> %2)
	ret <16 x i8> %3
	}
	declare <16 x i8> @llvm.x86.sse2.pavg.b(<16 x i8> %arg0, <16 x i8> %arg1) nounwind readnone

	define <8 x i16> @test_pavgw(<8 x i16> %a0, <8 x i16> %a1, <8 x i16> *%a2) {
	; GENERIC-LABEL: test_pavgw:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: pavgw %xmm1, %xmm0
	; GENERIC-NEXT: pavgw (%rdi), %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_pavgw:
	; ATOM: # BB#0:
	; ATOM-NEXT: pavgw %xmm1, %xmm0
	; ATOM-NEXT: pavgw (%rdi), %xmm0
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_pavgw:
	; SLM: # BB#0:
	; SLM-NEXT: pavgw %xmm1, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: pavgw (%rdi), %xmm0 # sched: [4:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_pavgw:
	; SANDY: # BB#0:
	; SANDY-NEXT: vpavgw %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	-; SANDY-NEXT: vpavgw (%rdi), %xmm0, %xmm0 # sched: [7:0.50]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vpavgw (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_pavgw:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpavgw %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; HASWELL-NEXT: vpavgw (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_pavgw:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpavgw %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: vpavgw (%rdi), %xmm0, %xmm0 # sched: [6:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_pavgw:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpavgw %xmm1, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: vpavgw (%rdi), %xmm0, %xmm0 # sched: [8:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <8 x i16> @llvm.x86.sse2.pavg.w(<8 x i16> %a0, <8 x i16> %a1)
	%2 = load <8 x i16>, <8 x i16> *%a2, align 16
	%3 = call <8 x i16> @llvm.x86.sse2.pavg.w(<8 x i16> %1, <8 x i16> %2)
	ret <8 x i16> %3
	}
	declare <8 x i16> @llvm.x86.sse2.pavg.w(<8 x i16>, <8 x i16>) nounwind readnone

	define <16 x i8> @test_pcmpeqb(<16 x i8> %a0, <16 x i8> %a1, <16 x i8> *%a2) {
	; GENERIC-LABEL: test_pcmpeqb:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: pcmpeqb %xmm0, %xmm1
	; GENERIC-NEXT: pcmpeqb (%rdi), %xmm0
	; GENERIC-NEXT: por %xmm1, %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_pcmpeqb:
	; ATOM: # BB#0:
	; ATOM-NEXT: pcmpeqb %xmm0, %xmm1
	; ATOM-NEXT: pcmpeqb (%rdi), %xmm0
	; ATOM-NEXT: por %xmm1, %xmm0
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_pcmpeqb:
	; SLM: # BB#0:
	; SLM-NEXT: pcmpeqb %xmm0, %xmm1 # sched: [1:0.50]
	; SLM-NEXT: pcmpeqb (%rdi), %xmm0 # sched: [4:1.00]
	; SLM-NEXT: por %xmm1, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_pcmpeqb:
	; SANDY: # BB#0:
	; SANDY-NEXT: vpcmpeqb %xmm1, %xmm0, %xmm1 # sched: [1:0.50]
	-; SANDY-NEXT: vpcmpeqb (%rdi), %xmm0, %xmm0 # sched: [7:0.50]
	+; SANDY-NEXT: vpcmpeqb (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	; SANDY-NEXT: vpor %xmm0, %xmm1, %xmm0 # sched: [1:0.33]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_pcmpeqb:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpcmpeqb %xmm1, %xmm0, %xmm1 # sched: [1:0.50]
	; HASWELL-NEXT: vpcmpeqb (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	; HASWELL-NEXT: vpor %xmm0, %xmm1, %xmm0 # sched: [1:0.33]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_pcmpeqb:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpcmpeqb %xmm1, %xmm0, %xmm1 # sched: [1:0.50]
	; BTVER2-NEXT: vpcmpeqb (%rdi), %xmm0, %xmm0 # sched: [6:1.00]
	; BTVER2-NEXT: vpor %xmm0, %xmm1, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_pcmpeqb:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpcmpeqb %xmm1, %xmm0, %xmm1 # sched: [1:0.25]
	; ZNVER1-NEXT: vpcmpeqb (%rdi), %xmm0, %xmm0 # sched: [8:0.50]
	; ZNVER1-NEXT: vpor %xmm0, %xmm1, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = icmp eq <16 x i8> %a0, %a1
	%2 = load <16 x i8>, <16 x i8> *%a2, align 16
	%3 = icmp eq <16 x i8> %a0, %2
	%4 = or <16 x i1> %1, %3
	%5 = sext <16 x i1> %4 to <16 x i8>
	ret <16 x i8> %5
	}

	define <4 x i32> @test_pcmpeqd(<4 x i32> %a0, <4 x i32> %a1, <4 x i32> *%a2) {
	; GENERIC-LABEL: test_pcmpeqd:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: pcmpeqd %xmm0, %xmm1
	; GENERIC-NEXT: pcmpeqd (%rdi), %xmm0
	; GENERIC-NEXT: por %xmm1, %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_pcmpeqd:
	; ATOM: # BB#0:
	; ATOM-NEXT: pcmpeqd %xmm0, %xmm1
	; ATOM-NEXT: pcmpeqd (%rdi), %xmm0
	; ATOM-NEXT: por %xmm1, %xmm0
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_pcmpeqd:
	; SLM: # BB#0:
	; SLM-NEXT: pcmpeqd %xmm0, %xmm1 # sched: [1:0.50]
	; SLM-NEXT: pcmpeqd (%rdi), %xmm0 # sched: [4:1.00]
	; SLM-NEXT: por %xmm1, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_pcmpeqd:
	; SANDY: # BB#0:
	; SANDY-NEXT: vpcmpeqd %xmm1, %xmm0, %xmm1 # sched: [1:0.50]
	-; SANDY-NEXT: vpcmpeqd (%rdi), %xmm0, %xmm0 # sched: [7:0.50]
	+; SANDY-NEXT: vpcmpeqd (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	; SANDY-NEXT: vpor %xmm0, %xmm1, %xmm0 # sched: [1:0.33]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_pcmpeqd:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpcmpeqd %xmm1, %xmm0, %xmm1 # sched: [1:0.50]
	; HASWELL-NEXT: vpcmpeqd (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	; HASWELL-NEXT: vpor %xmm0, %xmm1, %xmm0 # sched: [1:0.33]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_pcmpeqd:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpcmpeqd %xmm1, %xmm0, %xmm1 # sched: [1:0.50]
	; BTVER2-NEXT: vpcmpeqd (%rdi), %xmm0, %xmm0 # sched: [6:1.00]
	; BTVER2-NEXT: vpor %xmm0, %xmm1, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_pcmpeqd:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpcmpeqd %xmm1, %xmm0, %xmm1 # sched: [1:0.25]
	; ZNVER1-NEXT: vpcmpeqd (%rdi), %xmm0, %xmm0 # sched: [8:0.50]
	; ZNVER1-NEXT: vpor %xmm0, %xmm1, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = icmp eq <4 x i32> %a0, %a1
	%2 = load <4 x i32>, <4 x i32> *%a2, align 16
	%3 = icmp eq <4 x i32> %a0, %2
	%4 = or <4 x i1> %1, %3
	%5 = sext <4 x i1> %4 to <4 x i32>
	ret <4 x i32> %5
	}

	define <8 x i16> @test_pcmpeqw(<8 x i16> %a0, <8 x i16> %a1, <8 x i16> *%a2) {
	; GENERIC-LABEL: test_pcmpeqw:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: pcmpeqw %xmm0, %xmm1
	; GENERIC-NEXT: pcmpeqw (%rdi), %xmm0
	; GENERIC-NEXT: por %xmm1, %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_pcmpeqw:
	; ATOM: # BB#0:
	; ATOM-NEXT: pcmpeqw %xmm0, %xmm1
	; ATOM-NEXT: pcmpeqw (%rdi), %xmm0
	; ATOM-NEXT: por %xmm1, %xmm0
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_pcmpeqw:
	; SLM: # BB#0:
	; SLM-NEXT: pcmpeqw %xmm0, %xmm1 # sched: [1:0.50]
	; SLM-NEXT: pcmpeqw (%rdi), %xmm0 # sched: [4:1.00]
	; SLM-NEXT: por %xmm1, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_pcmpeqw:
	; SANDY: # BB#0:
	; SANDY-NEXT: vpcmpeqw %xmm1, %xmm0, %xmm1 # sched: [1:0.50]
	-; SANDY-NEXT: vpcmpeqw (%rdi), %xmm0, %xmm0 # sched: [7:0.50]
	+; SANDY-NEXT: vpcmpeqw (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	; SANDY-NEXT: vpor %xmm0, %xmm1, %xmm0 # sched: [1:0.33]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_pcmpeqw:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpcmpeqw %xmm1, %xmm0, %xmm1 # sched: [1:0.50]
	; HASWELL-NEXT: vpcmpeqw (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	; HASWELL-NEXT: vpor %xmm0, %xmm1, %xmm0 # sched: [1:0.33]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_pcmpeqw:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpcmpeqw %xmm1, %xmm0, %xmm1 # sched: [1:0.50]
	; BTVER2-NEXT: vpcmpeqw (%rdi), %xmm0, %xmm0 # sched: [6:1.00]
	; BTVER2-NEXT: vpor %xmm0, %xmm1, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_pcmpeqw:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpcmpeqw %xmm1, %xmm0, %xmm1 # sched: [1:0.25]
	; ZNVER1-NEXT: vpcmpeqw (%rdi), %xmm0, %xmm0 # sched: [8:0.50]
	; ZNVER1-NEXT: vpor %xmm0, %xmm1, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = icmp eq <8 x i16> %a0, %a1
	%2 = load <8 x i16>, <8 x i16> *%a2, align 16
	%3 = icmp eq <8 x i16> %a0, %2
	%4 = or <8 x i1> %1, %3
	%5 = sext <8 x i1> %4 to <8 x i16>
	ret <8 x i16> %5
	}

	define <16 x i8> @test_pcmpgtb(<16 x i8> %a0, <16 x i8> %a1, <16 x i8> *%a2) {
	; GENERIC-LABEL: test_pcmpgtb:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: movdqa %xmm0, %xmm2
	; GENERIC-NEXT: pcmpgtb %xmm1, %xmm2
	; GENERIC-NEXT: pcmpgtb (%rdi), %xmm0
	; GENERIC-NEXT: por %xmm2, %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_pcmpgtb:
	; ATOM: # BB#0:
	; ATOM-NEXT: movdqa %xmm0, %xmm2
	; ATOM-NEXT: pcmpgtb (%rdi), %xmm0
	; ATOM-NEXT: pcmpgtb %xmm1, %xmm2
	; ATOM-NEXT: por %xmm2, %xmm0
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_pcmpgtb:
	; SLM: # BB#0:
	; SLM-NEXT: movdqa %xmm0, %xmm2 # sched: [1:0.50]
	; SLM-NEXT: pcmpgtb (%rdi), %xmm0 # sched: [4:1.00]
	; SLM-NEXT: pcmpgtb %xmm1, %xmm2 # sched: [1:0.50]
	; SLM-NEXT: por %xmm2, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_pcmpgtb:
	; SANDY: # BB#0:
	; SANDY-NEXT: vpcmpgtb %xmm1, %xmm0, %xmm1 # sched: [1:0.50]
	-; SANDY-NEXT: vpcmpgtb (%rdi), %xmm0, %xmm0 # sched: [7:0.50]
	+; SANDY-NEXT: vpcmpgtb (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	; SANDY-NEXT: vpor %xmm0, %xmm1, %xmm0 # sched: [1:0.33]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_pcmpgtb:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpcmpgtb %xmm1, %xmm0, %xmm1 # sched: [1:0.50]
	; HASWELL-NEXT: vpcmpgtb (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	; HASWELL-NEXT: vpor %xmm0, %xmm1, %xmm0 # sched: [1:0.33]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_pcmpgtb:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpcmpgtb %xmm1, %xmm0, %xmm1 # sched: [1:0.50]
	; BTVER2-NEXT: vpcmpgtb (%rdi), %xmm0, %xmm0 # sched: [6:1.00]
	; BTVER2-NEXT: vpor %xmm0, %xmm1, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_pcmpgtb:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpcmpgtb %xmm1, %xmm0, %xmm1 # sched: [1:0.25]
	; ZNVER1-NEXT: vpcmpgtb (%rdi), %xmm0, %xmm0 # sched: [8:0.50]
	; ZNVER1-NEXT: vpor %xmm0, %xmm1, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = icmp sgt <16 x i8> %a0, %a1
	%2 = load <16 x i8>, <16 x i8> *%a2, align 16
	%3 = icmp sgt <16 x i8> %a0, %2
	%4 = or <16 x i1> %1, %3
	%5 = sext <16 x i1> %4 to <16 x i8>
	ret <16 x i8> %5
	}

	define <4 x i32> @test_pcmpgtd(<4 x i32> %a0, <4 x i32> %a1, <4 x i32> *%a2) {
	; GENERIC-LABEL: test_pcmpgtd:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: movdqa %xmm0, %xmm2
	; GENERIC-NEXT: pcmpgtd %xmm1, %xmm2
	; GENERIC-NEXT: pcmpeqd (%rdi), %xmm0
	; GENERIC-NEXT: por %xmm2, %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_pcmpgtd:
	; ATOM: # BB#0:
	; ATOM-NEXT: movdqa %xmm0, %xmm2
	; ATOM-NEXT: pcmpeqd (%rdi), %xmm0
	; ATOM-NEXT: pcmpgtd %xmm1, %xmm2
	; ATOM-NEXT: por %xmm2, %xmm0
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_pcmpgtd:
	; SLM: # BB#0:
	; SLM-NEXT: movdqa %xmm0, %xmm2 # sched: [1:0.50]
	; SLM-NEXT: pcmpeqd (%rdi), %xmm0 # sched: [4:1.00]
	; SLM-NEXT: pcmpgtd %xmm1, %xmm2 # sched: [1:0.50]
	; SLM-NEXT: por %xmm2, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_pcmpgtd:
	; SANDY: # BB#0:
	; SANDY-NEXT: vpcmpgtd %xmm1, %xmm0, %xmm1 # sched: [1:0.50]
	-; SANDY-NEXT: vpcmpeqd (%rdi), %xmm0, %xmm0 # sched: [7:0.50]
	+; SANDY-NEXT: vpcmpeqd (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	; SANDY-NEXT: vpor %xmm0, %xmm1, %xmm0 # sched: [1:0.33]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_pcmpgtd:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpcmpgtd %xmm1, %xmm0, %xmm1 # sched: [1:0.50]
	; HASWELL-NEXT: vpcmpeqd (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	; HASWELL-NEXT: vpor %xmm0, %xmm1, %xmm0 # sched: [1:0.33]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_pcmpgtd:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpcmpgtd %xmm1, %xmm0, %xmm1 # sched: [1:0.50]
	; BTVER2-NEXT: vpcmpeqd (%rdi), %xmm0, %xmm0 # sched: [6:1.00]
	; BTVER2-NEXT: vpor %xmm0, %xmm1, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_pcmpgtd:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpcmpgtd %xmm1, %xmm0, %xmm1 # sched: [1:0.25]
	; ZNVER1-NEXT: vpcmpeqd (%rdi), %xmm0, %xmm0 # sched: [8:0.50]
	; ZNVER1-NEXT: vpor %xmm0, %xmm1, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = icmp sgt <4 x i32> %a0, %a1
	%2 = load <4 x i32>, <4 x i32> *%a2, align 16
	%3 = icmp eq <4 x i32> %a0, %2
	%4 = or <4 x i1> %1, %3
	%5 = sext <4 x i1> %4 to <4 x i32>
	ret <4 x i32> %5
	}

	define <8 x i16> @test_pcmpgtw(<8 x i16> %a0, <8 x i16> %a1, <8 x i16> *%a2) {
	; GENERIC-LABEL: test_pcmpgtw:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: movdqa %xmm0, %xmm2
	; GENERIC-NEXT: pcmpgtw %xmm1, %xmm2
	; GENERIC-NEXT: pcmpgtw (%rdi), %xmm0
	; GENERIC-NEXT: por %xmm2, %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_pcmpgtw:
	; ATOM: # BB#0:
	; ATOM-NEXT: movdqa %xmm0, %xmm2
	; ATOM-NEXT: pcmpgtw (%rdi), %xmm0
	; ATOM-NEXT: pcmpgtw %xmm1, %xmm2
	; ATOM-NEXT: por %xmm2, %xmm0
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_pcmpgtw:
	; SLM: # BB#0:
	; SLM-NEXT: movdqa %xmm0, %xmm2 # sched: [1:0.50]
	; SLM-NEXT: pcmpgtw (%rdi), %xmm0 # sched: [4:1.00]
	; SLM-NEXT: pcmpgtw %xmm1, %xmm2 # sched: [1:0.50]
	; SLM-NEXT: por %xmm2, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_pcmpgtw:
	; SANDY: # BB#0:
	; SANDY-NEXT: vpcmpgtw %xmm1, %xmm0, %xmm1 # sched: [1:0.50]
	-; SANDY-NEXT: vpcmpgtw (%rdi), %xmm0, %xmm0 # sched: [7:0.50]
	+; SANDY-NEXT: vpcmpgtw (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	; SANDY-NEXT: vpor %xmm0, %xmm1, %xmm0 # sched: [1:0.33]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_pcmpgtw:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpcmpgtw %xmm1, %xmm0, %xmm1 # sched: [1:0.50]
	; HASWELL-NEXT: vpcmpgtw (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	; HASWELL-NEXT: vpor %xmm0, %xmm1, %xmm0 # sched: [1:0.33]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_pcmpgtw:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpcmpgtw %xmm1, %xmm0, %xmm1 # sched: [1:0.50]
	; BTVER2-NEXT: vpcmpgtw (%rdi), %xmm0, %xmm0 # sched: [6:1.00]
	; BTVER2-NEXT: vpor %xmm0, %xmm1, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_pcmpgtw:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpcmpgtw %xmm1, %xmm0, %xmm1 # sched: [1:0.25]
	; ZNVER1-NEXT: vpcmpgtw (%rdi), %xmm0, %xmm0 # sched: [8:0.50]
	; ZNVER1-NEXT: vpor %xmm0, %xmm1, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = icmp sgt <8 x i16> %a0, %a1
	%2 = load <8 x i16>, <8 x i16> *%a2, align 16
	%3 = icmp sgt <8 x i16> %a0, %2
	%4 = or <8 x i1> %1, %3
	%5 = sext <8 x i1> %4 to <8 x i16>
	ret <8 x i16> %5
	}

	define i16 @test_pextrw(<8 x i16> %a0) {
	; GENERIC-LABEL: test_pextrw:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: pextrw $6, %xmm0, %eax
	; GENERIC-NEXT: # kill: %AX<def> %AX<kill> %EAX<kill>
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_pextrw:
	; ATOM: # BB#0:
	; ATOM-NEXT: pextrw $6, %xmm0, %eax
	; ATOM-NEXT: # kill: %AX<def> %AX<kill> %EAX<kill>
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_pextrw:
	; SLM: # BB#0:
	; SLM-NEXT: pextrw $6, %xmm0, %eax # sched: [4:1.00]
	; SLM-NEXT: # kill: %AX<def> %AX<kill> %EAX<kill>
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_pextrw:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vpextrw $6, %xmm0, %eax # sched: [3:1.00]
	+; SANDY-NEXT: vpextrw $6, %xmm0, %eax # sched: [1:0.50]
	; SANDY-NEXT: # kill: %AX<def> %AX<kill> %EAX<kill>
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_pextrw:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpextrw $6, %xmm0, %eax # sched: [1:1.00]
	; HASWELL-NEXT: # kill: %AX<def> %AX<kill> %EAX<kill>
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_pextrw:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpextrw $6, %xmm0, %eax # sched: [1:0.50]
	; BTVER2-NEXT: # kill: %AX<def> %AX<kill> %EAX<kill>
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_pextrw:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpextrw $6, %xmm0, %eax # sched: [1:0.25]
	; ZNVER1-NEXT: # kill: %AX<def> %AX<kill> %EAX<kill>
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = extractelement <8 x i16> %a0, i32 6
	ret i16 %1
	}

	define <8 x i16> @test_pinsrw(<8 x i16> %a0, i16 %a1, i16 *%a2) {
	; GENERIC-LABEL: test_pinsrw:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: pinsrw $1, %edi, %xmm0
	; GENERIC-NEXT: pinsrw $3, (%rsi), %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_pinsrw:
	; ATOM: # BB#0:
	; ATOM-NEXT: pinsrw $1, %edi, %xmm0
	; ATOM-NEXT: pinsrw $3, (%rsi), %xmm0
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_pinsrw:
	; SLM: # BB#0:
	; SLM-NEXT: pinsrw $1, %edi, %xmm0 # sched: [1:1.00]
	; SLM-NEXT: pinsrw $3, (%rsi), %xmm0 # sched: [4:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_pinsrw:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vpinsrw $1, %edi, %xmm0, %xmm0 # sched: [2:1.00]
	-; SANDY-NEXT: vpinsrw $3, (%rsi), %xmm0, %xmm0 # sched: [7:0.50]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vpinsrw $1, %edi, %xmm0, %xmm0 # sched: [1:0.50]
	+; SANDY-NEXT: vpinsrw $3, (%rsi), %xmm0, %xmm0 # sched: [5:0.50]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_pinsrw:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpinsrw $1, %edi, %xmm0, %xmm0 # sched: [1:1.00]
	; HASWELL-NEXT: vpinsrw $3, (%rsi), %xmm0, %xmm0 # sched: [5:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_pinsrw:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpinsrw $1, %edi, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: vpinsrw $3, (%rsi), %xmm0, %xmm0 # sched: [6:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_pinsrw:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpinsrw $1, %edi, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: vpinsrw $3, (%rsi), %xmm0, %xmm0 # sched: [8:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = insertelement <8 x i16> %a0, i16 %a1, i32 1
	%2 = load i16, i16 *%a2
	%3 = insertelement <8 x i16> %1, i16 %2, i32 3
	ret <8 x i16> %3
	}

	define <4 x i32> @test_pmaddwd(<8 x i16> %a0, <8 x i16> %a1, <8 x i16> *%a2) {
	; GENERIC-LABEL: test_pmaddwd:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: pmaddwd %xmm1, %xmm0
	; GENERIC-NEXT: pmaddwd (%rdi), %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_pmaddwd:
	; ATOM: # BB#0:
	; ATOM-NEXT: pmaddwd %xmm1, %xmm0
	; ATOM-NEXT: pmaddwd (%rdi), %xmm0
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_pmaddwd:
	; SLM: # BB#0:
	; SLM-NEXT: pmaddwd %xmm1, %xmm0 # sched: [4:1.00]
	; SLM-NEXT: pmaddwd (%rdi), %xmm0 # sched: [7:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_pmaddwd:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vpmaddwd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	+; SANDY-NEXT: vpmaddwd %xmm1, %xmm0, %xmm0 # sched: [5:1.00]
	; SANDY-NEXT: vpmaddwd (%rdi), %xmm0, %xmm0 # sched: [9:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_pmaddwd:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpmaddwd %xmm1, %xmm0, %xmm0 # sched: [5:1.00]
	; HASWELL-NEXT: vpmaddwd (%rdi), %xmm0, %xmm0 # sched: [9:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_pmaddwd:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpmaddwd %xmm1, %xmm0, %xmm0 # sched: [2:1.00]
	; BTVER2-NEXT: vpmaddwd (%rdi), %xmm0, %xmm0 # sched: [7:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_pmaddwd:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpmaddwd %xmm1, %xmm0, %xmm0 # sched: [4:1.00]
	; ZNVER1-NEXT: vpmaddwd (%rdi), %xmm0, %xmm0 # sched: [11:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <4 x i32> @llvm.x86.sse2.pmadd.wd(<8 x i16> %a0, <8 x i16> %a1)
	%2 = bitcast <4 x i32> %1 to <8 x i16>
	%3 = load <8 x i16>, <8 x i16> *%a2, align 16
	%4 = call <4 x i32> @llvm.x86.sse2.pmadd.wd(<8 x i16> %2, <8 x i16> %3)
	ret <4 x i32> %4
	}
	declare <4 x i32> @llvm.x86.sse2.pmadd.wd(<8 x i16>, <8 x i16>) nounwind readnone

	define <8 x i16> @test_pmaxsw(<8 x i16> %a0, <8 x i16> %a1, <8 x i16> *%a2) {
	; GENERIC-LABEL: test_pmaxsw:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: pmaxsw %xmm1, %xmm0
	; GENERIC-NEXT: pmaxsw (%rdi), %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_pmaxsw:
	; ATOM: # BB#0:
	; ATOM-NEXT: pmaxsw %xmm1, %xmm0
	; ATOM-NEXT: pmaxsw (%rdi), %xmm0
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_pmaxsw:
	; SLM: # BB#0:
	; SLM-NEXT: pmaxsw %xmm1, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: pmaxsw (%rdi), %xmm0 # sched: [4:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_pmaxsw:
	; SANDY: # BB#0:
	; SANDY-NEXT: vpmaxsw %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	-; SANDY-NEXT: vpmaxsw (%rdi), %xmm0, %xmm0 # sched: [7:0.50]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vpmaxsw (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_pmaxsw:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpmaxsw %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; HASWELL-NEXT: vpmaxsw (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_pmaxsw:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpmaxsw %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: vpmaxsw (%rdi), %xmm0, %xmm0 # sched: [6:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_pmaxsw:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpmaxsw %xmm1, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: vpmaxsw (%rdi), %xmm0, %xmm0 # sched: [8:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <8 x i16> @llvm.x86.sse2.pmaxs.w(<8 x i16> %a0, <8 x i16> %a1)
	%2 = load <8 x i16>, <8 x i16> *%a2, align 16
	%3 = call <8 x i16> @llvm.x86.sse2.pmaxs.w(<8 x i16> %1, <8 x i16> %2)
	ret <8 x i16> %3
	}
	declare <8 x i16> @llvm.x86.sse2.pmaxs.w(<8 x i16>, <8 x i16>) nounwind readnone

	define <16 x i8> @test_pmaxub(<16 x i8> %a0, <16 x i8> %a1, <16 x i8> *%a2) {
	; GENERIC-LABEL: test_pmaxub:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: pmaxub %xmm1, %xmm0
	; GENERIC-NEXT: pmaxub (%rdi), %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_pmaxub:
	; ATOM: # BB#0:
	; ATOM-NEXT: pmaxub %xmm1, %xmm0
	; ATOM-NEXT: pmaxub (%rdi), %xmm0
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_pmaxub:
	; SLM: # BB#0:
	; SLM-NEXT: pmaxub %xmm1, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: pmaxub (%rdi), %xmm0 # sched: [4:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_pmaxub:
	; SANDY: # BB#0:
	; SANDY-NEXT: vpmaxub %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	-; SANDY-NEXT: vpmaxub (%rdi), %xmm0, %xmm0 # sched: [7:0.50]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vpmaxub (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_pmaxub:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpmaxub %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; HASWELL-NEXT: vpmaxub (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_pmaxub:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpmaxub %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: vpmaxub (%rdi), %xmm0, %xmm0 # sched: [6:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_pmaxub:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpmaxub %xmm1, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: vpmaxub (%rdi), %xmm0, %xmm0 # sched: [8:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <16 x i8> @llvm.x86.sse2.pmaxu.b(<16 x i8> %a0, <16 x i8> %a1)
	%2 = load <16 x i8>, <16 x i8> *%a2, align 16
	%3 = call <16 x i8> @llvm.x86.sse2.pmaxu.b(<16 x i8> %1, <16 x i8> %2)
	ret <16 x i8> %3
	}
	declare <16 x i8> @llvm.x86.sse2.pmaxu.b(<16 x i8>, <16 x i8>) nounwind readnone

	define <8 x i16> @test_pminsw(<8 x i16> %a0, <8 x i16> %a1, <8 x i16> *%a2) {
	; GENERIC-LABEL: test_pminsw:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: pminsw %xmm1, %xmm0
	; GENERIC-NEXT: pminsw (%rdi), %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_pminsw:
	; ATOM: # BB#0:
	; ATOM-NEXT: pminsw %xmm1, %xmm0
	; ATOM-NEXT: pminsw (%rdi), %xmm0
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_pminsw:
	; SLM: # BB#0:
	; SLM-NEXT: pminsw %xmm1, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: pminsw (%rdi), %xmm0 # sched: [4:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_pminsw:
	; SANDY: # BB#0:
	; SANDY-NEXT: vpminsw %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	-; SANDY-NEXT: vpminsw (%rdi), %xmm0, %xmm0 # sched: [7:0.50]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vpminsw (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_pminsw:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpminsw %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; HASWELL-NEXT: vpminsw (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_pminsw:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpminsw %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: vpminsw (%rdi), %xmm0, %xmm0 # sched: [6:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_pminsw:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpminsw %xmm1, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: vpminsw (%rdi), %xmm0, %xmm0 # sched: [8:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <8 x i16> @llvm.x86.sse2.pmins.w(<8 x i16> %a0, <8 x i16> %a1)
	%2 = load <8 x i16>, <8 x i16> *%a2, align 16
	%3 = call <8 x i16> @llvm.x86.sse2.pmins.w(<8 x i16> %1, <8 x i16> %2)
	ret <8 x i16> %3
	}
	declare <8 x i16> @llvm.x86.sse2.pmins.w(<8 x i16>, <8 x i16>) nounwind readnone

	define <16 x i8> @test_pminub(<16 x i8> %a0, <16 x i8> %a1, <16 x i8> *%a2) {
	; GENERIC-LABEL: test_pminub:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: pminub %xmm1, %xmm0
	; GENERIC-NEXT: pminub (%rdi), %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_pminub:
	; ATOM: # BB#0:
	; ATOM-NEXT: pminub %xmm1, %xmm0
	; ATOM-NEXT: pminub (%rdi), %xmm0
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_pminub:
	; SLM: # BB#0:
	; SLM-NEXT: pminub %xmm1, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: pminub (%rdi), %xmm0 # sched: [4:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_pminub:
	; SANDY: # BB#0:
	; SANDY-NEXT: vpminub %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	-; SANDY-NEXT: vpminub (%rdi), %xmm0, %xmm0 # sched: [7:0.50]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vpminub (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_pminub:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpminub %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; HASWELL-NEXT: vpminub (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_pminub:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpminub %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: vpminub (%rdi), %xmm0, %xmm0 # sched: [6:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_pminub:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpminub %xmm1, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: vpminub (%rdi), %xmm0, %xmm0 # sched: [8:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <16 x i8> @llvm.x86.sse2.pminu.b(<16 x i8> %a0, <16 x i8> %a1)
	%2 = load <16 x i8>, <16 x i8> *%a2, align 16
	%3 = call <16 x i8> @llvm.x86.sse2.pminu.b(<16 x i8> %1, <16 x i8> %2)
	ret <16 x i8> %3
	}
	declare <16 x i8> @llvm.x86.sse2.pminu.b(<16 x i8>, <16 x i8>) nounwind readnone

	define i32 @test_pmovmskb(<16 x i8> %a0) {
	; GENERIC-LABEL: test_pmovmskb:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: pmovmskb %xmm0, %eax
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_pmovmskb:
	; ATOM: # BB#0:
	; ATOM-NEXT: pmovmskb %xmm0, %eax
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_pmovmskb:
	; SLM: # BB#0:
	; SLM-NEXT: pmovmskb %xmm0, %eax # sched: [1:0.50]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_pmovmskb:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vpmovmskb %xmm0, %eax # sched: [1:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vpmovmskb %xmm0, %eax # sched: [1:0.33]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_pmovmskb:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpmovmskb %xmm0, %eax # sched: [3:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_pmovmskb:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpmovmskb %xmm0, %eax # sched: [1:0.50]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_pmovmskb:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpmovmskb %xmm0, %eax # sched: [1:0.25]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call i32 @llvm.x86.sse2.pmovmskb.128(<16 x i8> %a0)
	ret i32 %1
	}
	declare i32 @llvm.x86.sse2.pmovmskb.128(<16 x i8>) nounwind readnone

	define <8 x i16> @test_pmulhuw(<8 x i16> %a0, <8 x i16> %a1, <8 x i16> *%a2) {
	; GENERIC-LABEL: test_pmulhuw:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: pmulhuw %xmm1, %xmm0
	; GENERIC-NEXT: pmulhuw (%rdi), %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_pmulhuw:
	; ATOM: # BB#0:
	; ATOM-NEXT: pmulhuw %xmm1, %xmm0
	; ATOM-NEXT: pmulhuw (%rdi), %xmm0
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_pmulhuw:
	; SLM: # BB#0:
	; SLM-NEXT: pmulhuw %xmm1, %xmm0 # sched: [4:1.00]
	; SLM-NEXT: pmulhuw (%rdi), %xmm0 # sched: [7:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_pmulhuw:
	; SANDY: # BB#0:
	; SANDY-NEXT: vpmulhuw %xmm1, %xmm0, %xmm0 # sched: [5:1.00]
	; SANDY-NEXT: vpmulhuw (%rdi), %xmm0, %xmm0 # sched: [9:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_pmulhuw:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpmulhuw %xmm1, %xmm0, %xmm0 # sched: [5:1.00]
	; HASWELL-NEXT: vpmulhuw (%rdi), %xmm0, %xmm0 # sched: [9:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_pmulhuw:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpmulhuw %xmm1, %xmm0, %xmm0 # sched: [2:1.00]
	; BTVER2-NEXT: vpmulhuw (%rdi), %xmm0, %xmm0 # sched: [7:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_pmulhuw:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpmulhuw %xmm1, %xmm0, %xmm0 # sched: [4:1.00]
	; ZNVER1-NEXT: vpmulhuw (%rdi), %xmm0, %xmm0 # sched: [11:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <8 x i16> @llvm.x86.sse2.pmulhu.w(<8 x i16> %a0, <8 x i16> %a1)
	%2 = load <8 x i16>, <8 x i16> *%a2, align 16
	%3 = call <8 x i16> @llvm.x86.sse2.pmulhu.w(<8 x i16> %1, <8 x i16> %2)
	ret <8 x i16> %3
	}
	declare <8 x i16> @llvm.x86.sse2.pmulhu.w(<8 x i16>, <8 x i16>) nounwind readnone

	define <8 x i16> @test_pmulhw(<8 x i16> %a0, <8 x i16> %a1, <8 x i16> *%a2) {
	; GENERIC-LABEL: test_pmulhw:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: pmulhw %xmm1, %xmm0
	; GENERIC-NEXT: pmulhw (%rdi), %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_pmulhw:
	; ATOM: # BB#0:
	; ATOM-NEXT: pmulhw %xmm1, %xmm0
	; ATOM-NEXT: pmulhw (%rdi), %xmm0
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_pmulhw:
	; SLM: # BB#0:
	; SLM-NEXT: pmulhw %xmm1, %xmm0 # sched: [4:1.00]
	; SLM-NEXT: pmulhw (%rdi), %xmm0 # sched: [7:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_pmulhw:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vpmulhw %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	+; SANDY-NEXT: vpmulhw %xmm1, %xmm0, %xmm0 # sched: [5:1.00]
	; SANDY-NEXT: vpmulhw (%rdi), %xmm0, %xmm0 # sched: [9:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_pmulhw:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpmulhw %xmm1, %xmm0, %xmm0 # sched: [5:1.00]
	; HASWELL-NEXT: vpmulhw (%rdi), %xmm0, %xmm0 # sched: [9:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_pmulhw:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpmulhw %xmm1, %xmm0, %xmm0 # sched: [2:1.00]
	; BTVER2-NEXT: vpmulhw (%rdi), %xmm0, %xmm0 # sched: [7:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_pmulhw:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpmulhw %xmm1, %xmm0, %xmm0 # sched: [4:1.00]
	; ZNVER1-NEXT: vpmulhw (%rdi), %xmm0, %xmm0 # sched: [11:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <8 x i16> @llvm.x86.sse2.pmulh.w(<8 x i16> %a0, <8 x i16> %a1)
	%2 = load <8 x i16>, <8 x i16> *%a2, align 16
	%3 = call <8 x i16> @llvm.x86.sse2.pmulh.w(<8 x i16> %1, <8 x i16> %2)
	ret <8 x i16> %3
	}
	declare <8 x i16> @llvm.x86.sse2.pmulh.w(<8 x i16>, <8 x i16>) nounwind readnone

	define <8 x i16> @test_pmullw(<8 x i16> %a0, <8 x i16> %a1, <8 x i16> *%a2) {
	; GENERIC-LABEL: test_pmullw:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: pmullw %xmm1, %xmm0
	; GENERIC-NEXT: pmullw (%rdi), %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_pmullw:
	; ATOM: # BB#0:
	; ATOM-NEXT: pmullw %xmm1, %xmm0
	; ATOM-NEXT: pmullw (%rdi), %xmm0
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_pmullw:
	; SLM: # BB#0:
	; SLM-NEXT: pmullw %xmm1, %xmm0 # sched: [4:1.00]
	; SLM-NEXT: pmullw (%rdi), %xmm0 # sched: [7:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_pmullw:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vpmullw %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	+; SANDY-NEXT: vpmullw %xmm1, %xmm0, %xmm0 # sched: [5:1.00]
	; SANDY-NEXT: vpmullw (%rdi), %xmm0, %xmm0 # sched: [9:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_pmullw:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpmullw %xmm1, %xmm0, %xmm0 # sched: [5:1.00]
	; HASWELL-NEXT: vpmullw (%rdi), %xmm0, %xmm0 # sched: [9:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_pmullw:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpmullw %xmm1, %xmm0, %xmm0 # sched: [2:1.00]
	; BTVER2-NEXT: vpmullw (%rdi), %xmm0, %xmm0 # sched: [7:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_pmullw:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpmullw %xmm1, %xmm0, %xmm0 # sched: [4:1.00]
	; ZNVER1-NEXT: vpmullw (%rdi), %xmm0, %xmm0 # sched: [11:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = mul <8 x i16> %a0, %a1
	%2 = load <8 x i16>, <8 x i16> *%a2, align 16
	%3 = mul <8 x i16> %1, %2
	ret <8 x i16> %3
	}

	define <2 x i64> @test_pmuludq(<4 x i32> %a0, <4 x i32> %a1, <4 x i32> *%a2) {
	; GENERIC-LABEL: test_pmuludq:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: pmuludq %xmm1, %xmm0
	; GENERIC-NEXT: pmuludq (%rdi), %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_pmuludq:
	; ATOM: # BB#0:
	; ATOM-NEXT: pmuludq %xmm1, %xmm0
	; ATOM-NEXT: pmuludq (%rdi), %xmm0
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_pmuludq:
	; SLM: # BB#0:
	; SLM-NEXT: pmuludq %xmm1, %xmm0 # sched: [4:1.00]
	; SLM-NEXT: pmuludq (%rdi), %xmm0 # sched: [7:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_pmuludq:
	; SANDY: # BB#0:
	; SANDY-NEXT: vpmuludq %xmm1, %xmm0, %xmm0 # sched: [5:1.00]
	; SANDY-NEXT: vpmuludq (%rdi), %xmm0, %xmm0 # sched: [9:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_pmuludq:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpmuludq %xmm1, %xmm0, %xmm0 # sched: [5:1.00]
	; HASWELL-NEXT: vpmuludq (%rdi), %xmm0, %xmm0 # sched: [9:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_pmuludq:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpmuludq %xmm1, %xmm0, %xmm0 # sched: [2:1.00]
	; BTVER2-NEXT: vpmuludq (%rdi), %xmm0, %xmm0 # sched: [7:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_pmuludq:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpmuludq %xmm1, %xmm0, %xmm0 # sched: [4:1.00]
	; ZNVER1-NEXT: vpmuludq (%rdi), %xmm0, %xmm0 # sched: [11:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <2 x i64> @llvm.x86.sse2.pmulu.dq(<4 x i32> %a0, <4 x i32> %a1)
	%2 = bitcast <2 x i64> %1 to <4 x i32>
	%3 = load <4 x i32>, <4 x i32> *%a2, align 16
	%4 = call <2 x i64> @llvm.x86.sse2.pmulu.dq(<4 x i32> %2, <4 x i32> %3)
	ret <2 x i64> %4
	}
	declare <2 x i64> @llvm.x86.sse2.pmulu.dq(<4 x i32>, <4 x i32>) nounwind readnone

	define <2 x i64> @test_por(<2 x i64> %a0, <2 x i64> %a1, <2 x i64> *%a2) {
	; GENERIC-LABEL: test_por:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: por %xmm1, %xmm0
	; GENERIC-NEXT: por (%rdi), %xmm0
	; GENERIC-NEXT: paddq %xmm1, %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_por:
	; ATOM: # BB#0:
	; ATOM-NEXT: por %xmm1, %xmm0
	; ATOM-NEXT: por (%rdi), %xmm0
	; ATOM-NEXT: paddq %xmm1, %xmm0
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_por:
	; SLM: # BB#0:
	; SLM-NEXT: por %xmm1, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: por (%rdi), %xmm0 # sched: [4:1.00]
	; SLM-NEXT: paddq %xmm1, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_por:
	; SANDY: # BB#0:
	; SANDY-NEXT: vpor %xmm1, %xmm0, %xmm0 # sched: [1:0.33]
	-; SANDY-NEXT: vpor (%rdi), %xmm0, %xmm0 # sched: [7:0.50]
	+; SANDY-NEXT: vpor (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	; SANDY-NEXT: vpaddq %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_por:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpor %xmm1, %xmm0, %xmm0 # sched: [1:0.33]
	; HASWELL-NEXT: vpor (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	; HASWELL-NEXT: vpaddq %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_por:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpor %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: vpor (%rdi), %xmm0, %xmm0 # sched: [6:1.00]
	; BTVER2-NEXT: vpaddq %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_por:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpor %xmm1, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: vpor (%rdi), %xmm0, %xmm0 # sched: [8:0.50]
	; ZNVER1-NEXT: vpaddq %xmm1, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = or <2 x i64> %a0, %a1
	%2 = load <2 x i64>, <2 x i64> *%a2, align 16
	%3 = or <2 x i64> %1, %2
	%4 = add <2 x i64> %3, %a1
	ret <2 x i64> %4
	}

	define <2 x i64> @test_psadbw(<16 x i8> %a0, <16 x i8> %a1, <16 x i8> *%a2) {
	; GENERIC-LABEL: test_psadbw:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: psadbw %xmm1, %xmm0
	; GENERIC-NEXT: psadbw (%rdi), %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_psadbw:
	; ATOM: # BB#0:
	; ATOM-NEXT: psadbw %xmm1, %xmm0
	; ATOM-NEXT: psadbw (%rdi), %xmm0
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_psadbw:
	; SLM: # BB#0:
	; SLM-NEXT: psadbw %xmm1, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: psadbw (%rdi), %xmm0 # sched: [4:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_psadbw:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vpsadbw %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	+; SANDY-NEXT: vpsadbw %xmm1, %xmm0, %xmm0 # sched: [5:1.00]
	; SANDY-NEXT: vpsadbw (%rdi), %xmm0, %xmm0 # sched: [9:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_psadbw:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpsadbw %xmm1, %xmm0, %xmm0 # sched: [5:1.00]
	; HASWELL-NEXT: vpsadbw (%rdi), %xmm0, %xmm0 # sched: [9:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_psadbw:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpsadbw %xmm1, %xmm0, %xmm0 # sched: [2:1.00]
	; BTVER2-NEXT: vpsadbw (%rdi), %xmm0, %xmm0 # sched: [7:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_psadbw:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpsadbw %xmm1, %xmm0, %xmm0 # sched: [4:1.00]
	; ZNVER1-NEXT: vpsadbw (%rdi), %xmm0, %xmm0 # sched: [11:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <2 x i64> @llvm.x86.sse2.psad.bw(<16 x i8> %a0, <16 x i8> %a1)
	%2 = bitcast <2 x i64> %1 to <16 x i8>
	%3 = load <16 x i8>, <16 x i8> *%a2, align 16
	%4 = call <2 x i64> @llvm.x86.sse2.psad.bw(<16 x i8> %2, <16 x i8> %3)
	ret <2 x i64> %4
	}
	declare <2 x i64> @llvm.x86.sse2.psad.bw(<16 x i8>, <16 x i8>) nounwind readnone

	define <4 x i32> @test_pshufd(<4 x i32> %a0, <4 x i32> *%a1) {
	; GENERIC-LABEL: test_pshufd:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: pshufd {{.*#+}} xmm1 = xmm0[1,0,3,2]
	; GENERIC-NEXT: pshufd {{.*#+}} xmm0 = mem[3,2,1,0]
	; GENERIC-NEXT: paddd %xmm1, %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_pshufd:
	; ATOM: # BB#0:
	; ATOM-NEXT: pshufd {{.*#+}} xmm1 = mem[3,2,1,0]
	; ATOM-NEXT: pshufd {{.*#+}} xmm0 = xmm0[1,0,3,2]
	; ATOM-NEXT: paddd %xmm0, %xmm1
	; ATOM-NEXT: movdqa %xmm1, %xmm0
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_pshufd:
	; SLM: # BB#0:
	; SLM-NEXT: pshufd {{.*#+}} xmm1 = mem[3,2,1,0] sched: [4:1.00]
	; SLM-NEXT: pshufd {{.*#+}} xmm0 = xmm0[1,0,3,2] sched: [1:1.00]
	; SLM-NEXT: paddd %xmm0, %xmm1 # sched: [1:0.50]
	; SLM-NEXT: movdqa %xmm1, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_pshufd:
	; SANDY: # BB#0:
	; SANDY-NEXT: vpshufd {{.*#+}} xmm0 = xmm0[1,0,3,2] sched: [1:0.50]
	-; SANDY-NEXT: vpshufd {{.*#+}} xmm1 = mem[3,2,1,0] sched: [7:0.50]
	+; SANDY-NEXT: vpshufd {{.*#+}} xmm1 = mem[3,2,1,0] sched: [5:0.50]
	; SANDY-NEXT: vpaddd %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_pshufd:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpshufd {{.*#+}} xmm0 = xmm0[1,0,3,2] sched: [1:1.00]
	; HASWELL-NEXT: vpshufd {{.*#+}} xmm1 = mem[3,2,1,0] sched: [5:1.00]
	; HASWELL-NEXT: vpaddd %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_pshufd:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpshufd {{.*#+}} xmm1 = mem[3,2,1,0] sched: [6:1.00]
	; BTVER2-NEXT: vpshufd {{.*#+}} xmm0 = xmm0[1,0,3,2] sched: [1:0.50]
	; BTVER2-NEXT: vpaddd %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_pshufd:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpshufd {{.*#+}} xmm1 = mem[3,2,1,0] sched: [8:0.50]
	; ZNVER1-NEXT: vpshufd {{.*#+}} xmm0 = xmm0[1,0,3,2] sched: [1:0.25]
	; ZNVER1-NEXT: vpaddd %xmm1, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = shufflevector <4 x i32> %a0, <4 x i32> undef, <4 x i32> <i32 1, i32 0, i32 3, i32 2>
	%2 = load <4 x i32>, <4 x i32> *%a1, align 16
	%3 = shufflevector <4 x i32> %2, <4 x i32> undef, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
	%4 = add <4 x i32> %1, %3
	ret <4 x i32> %4
	}

	define <8 x i16> @test_pshufhw(<8 x i16> %a0, <8 x i16> *%a1) {
	; GENERIC-LABEL: test_pshufhw:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: pshufhw {{.*#+}} xmm1 = xmm0[0,1,2,3,5,4,7,6]
	; GENERIC-NEXT: pshufhw {{.*#+}} xmm0 = mem[0,1,2,3,7,6,5,4]
	; GENERIC-NEXT: paddw %xmm1, %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_pshufhw:
	; ATOM: # BB#0:
	; ATOM-NEXT: pshufhw {{.*#+}} xmm1 = mem[0,1,2,3,7,6,5,4]
	; ATOM-NEXT: pshufhw {{.*#+}} xmm0 = xmm0[0,1,2,3,5,4,7,6]
	; ATOM-NEXT: paddw %xmm0, %xmm1
	; ATOM-NEXT: movdqa %xmm1, %xmm0
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_pshufhw:
	; SLM: # BB#0:
	; SLM-NEXT: pshufhw {{.*#+}} xmm1 = mem[0,1,2,3,7,6,5,4] sched: [4:1.00]
	; SLM-NEXT: pshufhw {{.*#+}} xmm0 = xmm0[0,1,2,3,5,4,7,6] sched: [1:1.00]
	; SLM-NEXT: paddw %xmm0, %xmm1 # sched: [1:0.50]
	; SLM-NEXT: movdqa %xmm1, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_pshufhw:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vpshufhw {{.*#+}} xmm0 = xmm0[0,1,2,3,5,4,7,6] sched: [1:1.00]
	-; SANDY-NEXT: vpshufhw {{.*#+}} xmm1 = mem[0,1,2,3,7,6,5,4] sched: [7:0.50]
	-; SANDY-NEXT: vpaddw %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vpshufhw {{.*#+}} xmm0 = xmm0[0,1,2,3,5,4,7,6] sched: [1:0.50]
	+; SANDY-NEXT: vpshufhw {{.*#+}} xmm1 = mem[0,1,2,3,7,6,5,4] sched: [5:0.50]
	+; SANDY-NEXT: vpaddw %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_pshufhw:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpshufhw {{.*#+}} xmm0 = xmm0[0,1,2,3,5,4,7,6] sched: [1:1.00]
	; HASWELL-NEXT: vpshufhw {{.*#+}} xmm1 = mem[0,1,2,3,7,6,5,4] sched: [5:1.00]
	; HASWELL-NEXT: vpaddw %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_pshufhw:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpshufhw {{.*#+}} xmm1 = mem[0,1,2,3,7,6,5,4] sched: [6:1.00]
	; BTVER2-NEXT: vpshufhw {{.*#+}} xmm0 = xmm0[0,1,2,3,5,4,7,6] sched: [1:0.50]
	; BTVER2-NEXT: vpaddw %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_pshufhw:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpshufhw {{.*#+}} xmm1 = mem[0,1,2,3,7,6,5,4] sched: [8:0.50]
	; ZNVER1-NEXT: vpshufhw {{.*#+}} xmm0 = xmm0[0,1,2,3,5,4,7,6] sched: [1:0.25]
	; ZNVER1-NEXT: vpaddw %xmm1, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = shufflevector <8 x i16> %a0, <8 x i16> undef, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 5, i32 4, i32 7, i32 6>
	%2 = load <8 x i16>, <8 x i16> *%a1, align 16
	%3 = shufflevector <8 x i16> %2, <8 x i16> undef, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 7, i32 6, i32 5, i32 4>
	%4 = add <8 x i16> %1, %3
	ret <8 x i16> %4
	}

	define <8 x i16> @test_pshuflw(<8 x i16> %a0, <8 x i16> *%a1) {
	; GENERIC-LABEL: test_pshuflw:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: pshuflw {{.*#+}} xmm1 = xmm0[1,0,3,2,4,5,6,7]
	; GENERIC-NEXT: pshuflw {{.*#+}} xmm0 = mem[3,2,1,0,4,5,6,7]
	; GENERIC-NEXT: paddw %xmm1, %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_pshuflw:
	; ATOM: # BB#0:
	; ATOM-NEXT: pshuflw {{.*#+}} xmm1 = mem[3,2,1,0,4,5,6,7]
	; ATOM-NEXT: pshuflw {{.*#+}} xmm0 = xmm0[1,0,3,2,4,5,6,7]
	; ATOM-NEXT: paddw %xmm0, %xmm1
	; ATOM-NEXT: movdqa %xmm1, %xmm0
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_pshuflw:
	; SLM: # BB#0:
	; SLM-NEXT: pshuflw {{.*#+}} xmm1 = mem[3,2,1,0,4,5,6,7] sched: [4:1.00]
	; SLM-NEXT: pshuflw {{.*#+}} xmm0 = xmm0[1,0,3,2,4,5,6,7] sched: [1:1.00]
	; SLM-NEXT: paddw %xmm0, %xmm1 # sched: [1:0.50]
	; SLM-NEXT: movdqa %xmm1, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_pshuflw:
	; SANDY: # BB#0:
	; SANDY-NEXT: vpshuflw {{.*#+}} xmm0 = xmm0[1,0,3,2,4,5,6,7] sched: [1:0.50]
	-; SANDY-NEXT: vpshuflw {{.*#+}} xmm1 = mem[3,2,1,0,4,5,6,7] sched: [7:0.50]
	-; SANDY-NEXT: vpaddw %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vpshuflw {{.*#+}} xmm1 = mem[3,2,1,0,4,5,6,7] sched: [5:0.50]
	+; SANDY-NEXT: vpaddw %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_pshuflw:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpshuflw {{.*#+}} xmm0 = xmm0[1,0,3,2,4,5,6,7] sched: [1:1.00]
	; HASWELL-NEXT: vpshuflw {{.*#+}} xmm1 = mem[3,2,1,0,4,5,6,7] sched: [5:1.00]
	; HASWELL-NEXT: vpaddw %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_pshuflw:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpshuflw {{.*#+}} xmm1 = mem[3,2,1,0,4,5,6,7] sched: [6:1.00]
	; BTVER2-NEXT: vpshuflw {{.*#+}} xmm0 = xmm0[1,0,3,2,4,5,6,7] sched: [1:0.50]
	; BTVER2-NEXT: vpaddw %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_pshuflw:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpshuflw {{.*#+}} xmm1 = mem[3,2,1,0,4,5,6,7] sched: [8:0.50]
	; ZNVER1-NEXT: vpshuflw {{.*#+}} xmm0 = xmm0[1,0,3,2,4,5,6,7] sched: [1:0.25]
	; ZNVER1-NEXT: vpaddw %xmm1, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = shufflevector <8 x i16> %a0, <8 x i16> undef, <8 x i32> <i32 1, i32 0, i32 3, i32 2, i32 4, i32 5, i32 6, i32 7>
	%2 = load <8 x i16>, <8 x i16> *%a1, align 16
	%3 = shufflevector <8 x i16> %2, <8 x i16> undef, <8 x i32> <i32 3, i32 2, i32 1, i32 0, i32 4, i32 5, i32 6, i32 7>
	%4 = add <8 x i16> %1, %3
	ret <8 x i16> %4
	}

	define <4 x i32> @test_pslld(<4 x i32> %a0, <4 x i32> %a1, <4 x i32> *%a2) {
	; GENERIC-LABEL: test_pslld:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: pslld %xmm1, %xmm0
	; GENERIC-NEXT: pslld (%rdi), %xmm0
	; GENERIC-NEXT: pslld $2, %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_pslld:
	; ATOM: # BB#0:
	; ATOM-NEXT: pslld %xmm1, %xmm0
	; ATOM-NEXT: pslld (%rdi), %xmm0
	; ATOM-NEXT: pslld $2, %xmm0
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_pslld:
	; SLM: # BB#0:
	; SLM-NEXT: pslld %xmm1, %xmm0 # sched: [1:1.00]
	; SLM-NEXT: pslld (%rdi), %xmm0 # sched: [4:1.00]
	; SLM-NEXT: pslld $2, %xmm0 # sched: [1:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_pslld:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vpslld %xmm1, %xmm0, %xmm0 # sched: [1:1.00]
	-; SANDY-NEXT: vpslld (%rdi), %xmm0, %xmm0 # sched: [5:1.00]
	-; SANDY-NEXT: vpslld $2, %xmm0, %xmm0 # sched: [1:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vpslld %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	+; SANDY-NEXT: vpslld (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	+; SANDY-NEXT: vpslld $2, %xmm0, %xmm0 # sched: [1:0.50]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_pslld:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpslld %xmm1, %xmm0, %xmm0 # sched: [2:1.00]
	; HASWELL-NEXT: vpslld (%rdi), %xmm0, %xmm0 # sched: [5:1.00]
	; HASWELL-NEXT: vpslld $2, %xmm0, %xmm0 # sched: [1:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_pslld:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpslld %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: vpslld (%rdi), %xmm0, %xmm0 # sched: [6:1.00]
	; BTVER2-NEXT: vpslld $2, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_pslld:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpslld %xmm1, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: vpslld (%rdi), %xmm0, %xmm0 # sched: [8:0.50]
	; ZNVER1-NEXT: vpslld $2, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <4 x i32> @llvm.x86.sse2.psll.d(<4 x i32> %a0, <4 x i32> %a1)
	%2 = load <4 x i32>, <4 x i32> *%a2, align 16
	%3 = call <4 x i32> @llvm.x86.sse2.psll.d(<4 x i32> %1, <4 x i32> %2)
	%4 = call <4 x i32> @llvm.x86.sse2.pslli.d(<4 x i32> %3, i32 2)
	ret <4 x i32> %4
	}
	declare <4 x i32> @llvm.x86.sse2.psll.d(<4 x i32>, <4 x i32>) nounwind readnone
	declare <4 x i32> @llvm.x86.sse2.pslli.d(<4 x i32>, i32) nounwind readnone

	define <4 x i32> @test_pslldq(<4 x i32> %a0) {
	; GENERIC-LABEL: test_pslldq:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: pslldq {{.*#+}} xmm0 = zero,zero,zero,zero,xmm0[0,1,2,3,4,5,6,7,8,9,10,11]
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_pslldq:
	; ATOM: # BB#0:
	; ATOM-NEXT: pslldq {{.*#+}} xmm0 = zero,zero,zero,zero,xmm0[0,1,2,3,4,5,6,7,8,9,10,11]
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_pslldq:
	; SLM: # BB#0:
	; SLM-NEXT: pslldq {{.*#+}} xmm0 = zero,zero,zero,zero,xmm0[0,1,2,3,4,5,6,7,8,9,10,11] sched: [1:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_pslldq:
	; SANDY: # BB#0:
	; SANDY-NEXT: vpslldq {{.*#+}} xmm0 = zero,zero,zero,zero,xmm0[0,1,2,3,4,5,6,7,8,9,10,11] sched: [1:0.50]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_pslldq:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpslldq {{.*#+}} xmm0 = zero,zero,zero,zero,xmm0[0,1,2,3,4,5,6,7,8,9,10,11] sched: [1:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_pslldq:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpslldq {{.*#+}} xmm0 = zero,zero,zero,zero,xmm0[0,1,2,3,4,5,6,7,8,9,10,11] sched: [1:0.50]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_pslldq:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpslldq {{.*#+}} xmm0 = zero,zero,zero,zero,xmm0[0,1,2,3,4,5,6,7,8,9,10,11] sched: [1:0.25]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = shufflevector <4 x i32> %a0, <4 x i32> zeroinitializer, <4 x i32> <i32 4, i32 0, i32 1, i32 2>
	ret <4 x i32> %1
	}

	define <2 x i64> @test_psllq(<2 x i64> %a0, <2 x i64> %a1, <2 x i64> *%a2) {
	; GENERIC-LABEL: test_psllq:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: psllq %xmm1, %xmm0
	; GENERIC-NEXT: psllq (%rdi), %xmm0
	; GENERIC-NEXT: psllq $2, %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_psllq:
	; ATOM: # BB#0:
	; ATOM-NEXT: psllq %xmm1, %xmm0
	; ATOM-NEXT: psllq (%rdi), %xmm0
	; ATOM-NEXT: psllq $2, %xmm0
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_psllq:
	; SLM: # BB#0:
	; SLM-NEXT: psllq %xmm1, %xmm0 # sched: [1:1.00]
	; SLM-NEXT: psllq (%rdi), %xmm0 # sched: [4:1.00]
	; SLM-NEXT: psllq $2, %xmm0 # sched: [1:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_psllq:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vpsllq %xmm1, %xmm0, %xmm0 # sched: [1:1.00]
	-; SANDY-NEXT: vpsllq (%rdi), %xmm0, %xmm0 # sched: [5:1.00]
	-; SANDY-NEXT: vpsllq $2, %xmm0, %xmm0 # sched: [1:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vpsllq %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	+; SANDY-NEXT: vpsllq (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	+; SANDY-NEXT: vpsllq $2, %xmm0, %xmm0 # sched: [1:0.50]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_psllq:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpsllq %xmm1, %xmm0, %xmm0 # sched: [2:1.00]
	; HASWELL-NEXT: vpsllq (%rdi), %xmm0, %xmm0 # sched: [5:1.00]
	; HASWELL-NEXT: vpsllq $2, %xmm0, %xmm0 # sched: [1:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_psllq:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpsllq %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: vpsllq (%rdi), %xmm0, %xmm0 # sched: [6:1.00]
	; BTVER2-NEXT: vpsllq $2, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_psllq:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpsllq %xmm1, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: vpsllq (%rdi), %xmm0, %xmm0 # sched: [8:0.50]
	; ZNVER1-NEXT: vpsllq $2, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <2 x i64> @llvm.x86.sse2.psll.q(<2 x i64> %a0, <2 x i64> %a1)
	%2 = load <2 x i64>, <2 x i64> *%a2, align 16
	%3 = call <2 x i64> @llvm.x86.sse2.psll.q(<2 x i64> %1, <2 x i64> %2)
	%4 = call <2 x i64> @llvm.x86.sse2.pslli.q(<2 x i64> %3, i32 2)
	ret <2 x i64> %4
	}
	declare <2 x i64> @llvm.x86.sse2.psll.q(<2 x i64>, <2 x i64>) nounwind readnone
	declare <2 x i64> @llvm.x86.sse2.pslli.q(<2 x i64>, i32) nounwind readnone

	define <8 x i16> @test_psllw(<8 x i16> %a0, <8 x i16> %a1, <8 x i16> *%a2) {
	; GENERIC-LABEL: test_psllw:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: psllw %xmm1, %xmm0
	; GENERIC-NEXT: psllw (%rdi), %xmm0
	; GENERIC-NEXT: psllw $2, %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_psllw:
	; ATOM: # BB#0:
	; ATOM-NEXT: psllw %xmm1, %xmm0
	; ATOM-NEXT: psllw (%rdi), %xmm0
	; ATOM-NEXT: psllw $2, %xmm0
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_psllw:
	; SLM: # BB#0:
	; SLM-NEXT: psllw %xmm1, %xmm0 # sched: [1:1.00]
	; SLM-NEXT: psllw (%rdi), %xmm0 # sched: [4:1.00]
	; SLM-NEXT: psllw $2, %xmm0 # sched: [1:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_psllw:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vpsllw %xmm1, %xmm0, %xmm0 # sched: [1:1.00]
	-; SANDY-NEXT: vpsllw (%rdi), %xmm0, %xmm0 # sched: [5:1.00]
	-; SANDY-NEXT: vpsllw $2, %xmm0, %xmm0 # sched: [1:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vpsllw %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	+; SANDY-NEXT: vpsllw (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	+; SANDY-NEXT: vpsllw $2, %xmm0, %xmm0 # sched: [1:0.50]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_psllw:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpsllw %xmm1, %xmm0, %xmm0 # sched: [2:1.00]
	; HASWELL-NEXT: vpsllw (%rdi), %xmm0, %xmm0 # sched: [5:1.00]
	; HASWELL-NEXT: vpsllw $2, %xmm0, %xmm0 # sched: [1:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_psllw:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpsllw %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: vpsllw (%rdi), %xmm0, %xmm0 # sched: [6:1.00]
	; BTVER2-NEXT: vpsllw $2, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_psllw:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpsllw %xmm1, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: vpsllw (%rdi), %xmm0, %xmm0 # sched: [8:0.50]
	; ZNVER1-NEXT: vpsllw $2, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <8 x i16> @llvm.x86.sse2.psll.w(<8 x i16> %a0, <8 x i16> %a1)
	%2 = load <8 x i16>, <8 x i16> *%a2, align 16
	%3 = call <8 x i16> @llvm.x86.sse2.psll.w(<8 x i16> %1, <8 x i16> %2)
	%4 = call <8 x i16> @llvm.x86.sse2.pslli.w(<8 x i16> %3, i32 2)
	ret <8 x i16> %4
	}
	declare <8 x i16> @llvm.x86.sse2.psll.w(<8 x i16>, <8 x i16>) nounwind readnone
	declare <8 x i16> @llvm.x86.sse2.pslli.w(<8 x i16>, i32) nounwind readnone

	define <4 x i32> @test_psrad(<4 x i32> %a0, <4 x i32> %a1, <4 x i32> *%a2) {
	; GENERIC-LABEL: test_psrad:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: psrad %xmm1, %xmm0
	; GENERIC-NEXT: psrad (%rdi), %xmm0
	; GENERIC-NEXT: psrad $2, %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_psrad:
	; ATOM: # BB#0:
	; ATOM-NEXT: psrad %xmm1, %xmm0
	; ATOM-NEXT: psrad (%rdi), %xmm0
	; ATOM-NEXT: psrad $2, %xmm0
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_psrad:
	; SLM: # BB#0:
	; SLM-NEXT: psrad %xmm1, %xmm0 # sched: [1:1.00]
	; SLM-NEXT: psrad (%rdi), %xmm0 # sched: [4:1.00]
	; SLM-NEXT: psrad $2, %xmm0 # sched: [1:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_psrad:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vpsrad %xmm1, %xmm0, %xmm0 # sched: [2:1.00]
	-; SANDY-NEXT: vpsrad (%rdi), %xmm0, %xmm0 # sched: [8:1.00]
	-; SANDY-NEXT: vpsrad $2, %xmm0, %xmm0 # sched: [1:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vpsrad %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	+; SANDY-NEXT: vpsrad (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	+; SANDY-NEXT: vpsrad $2, %xmm0, %xmm0 # sched: [1:0.50]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_psrad:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpsrad %xmm1, %xmm0, %xmm0 # sched: [2:1.00]
	; HASWELL-NEXT: vpsrad (%rdi), %xmm0, %xmm0 # sched: [5:1.00]
	; HASWELL-NEXT: vpsrad $2, %xmm0, %xmm0 # sched: [1:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_psrad:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpsrad %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: vpsrad (%rdi), %xmm0, %xmm0 # sched: [6:1.00]
	; BTVER2-NEXT: vpsrad $2, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_psrad:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpsrad %xmm1, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: vpsrad (%rdi), %xmm0, %xmm0 # sched: [8:0.50]
	; ZNVER1-NEXT: vpsrad $2, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <4 x i32> @llvm.x86.sse2.psra.d(<4 x i32> %a0, <4 x i32> %a1)
	%2 = load <4 x i32>, <4 x i32> *%a2, align 16
	%3 = call <4 x i32> @llvm.x86.sse2.psra.d(<4 x i32> %1, <4 x i32> %2)
	%4 = call <4 x i32> @llvm.x86.sse2.psrai.d(<4 x i32> %3, i32 2)
	ret <4 x i32> %4
	}
	declare <4 x i32> @llvm.x86.sse2.psra.d(<4 x i32>, <4 x i32>) nounwind readnone
	declare <4 x i32> @llvm.x86.sse2.psrai.d(<4 x i32>, i32) nounwind readnone

	define <8 x i16> @test_psraw(<8 x i16> %a0, <8 x i16> %a1, <8 x i16> *%a2) {
	; GENERIC-LABEL: test_psraw:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: psraw %xmm1, %xmm0
	; GENERIC-NEXT: psraw (%rdi), %xmm0
	; GENERIC-NEXT: psraw $2, %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_psraw:
	; ATOM: # BB#0:
	; ATOM-NEXT: psraw %xmm1, %xmm0
	; ATOM-NEXT: psraw (%rdi), %xmm0
	; ATOM-NEXT: psraw $2, %xmm0
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_psraw:
	; SLM: # BB#0:
	; SLM-NEXT: psraw %xmm1, %xmm0 # sched: [1:1.00]
	; SLM-NEXT: psraw (%rdi), %xmm0 # sched: [4:1.00]
	; SLM-NEXT: psraw $2, %xmm0 # sched: [1:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_psraw:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vpsraw %xmm1, %xmm0, %xmm0 # sched: [2:1.00]
	-; SANDY-NEXT: vpsraw (%rdi), %xmm0, %xmm0 # sched: [8:1.00]
	-; SANDY-NEXT: vpsraw $2, %xmm0, %xmm0 # sched: [1:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vpsraw %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	+; SANDY-NEXT: vpsraw (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	+; SANDY-NEXT: vpsraw $2, %xmm0, %xmm0 # sched: [1:0.50]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_psraw:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpsraw %xmm1, %xmm0, %xmm0 # sched: [2:1.00]
	; HASWELL-NEXT: vpsraw (%rdi), %xmm0, %xmm0 # sched: [5:1.00]
	; HASWELL-NEXT: vpsraw $2, %xmm0, %xmm0 # sched: [1:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_psraw:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpsraw %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: vpsraw (%rdi), %xmm0, %xmm0 # sched: [6:1.00]
	; BTVER2-NEXT: vpsraw $2, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_psraw:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpsraw %xmm1, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: vpsraw (%rdi), %xmm0, %xmm0 # sched: [8:0.50]
	; ZNVER1-NEXT: vpsraw $2, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <8 x i16> @llvm.x86.sse2.psra.w(<8 x i16> %a0, <8 x i16> %a1)
	%2 = load <8 x i16>, <8 x i16> *%a2, align 16
	%3 = call <8 x i16> @llvm.x86.sse2.psra.w(<8 x i16> %1, <8 x i16> %2)
	%4 = call <8 x i16> @llvm.x86.sse2.psrai.w(<8 x i16> %3, i32 2)
	ret <8 x i16> %4
	}
	declare <8 x i16> @llvm.x86.sse2.psra.w(<8 x i16>, <8 x i16>) nounwind readnone
	declare <8 x i16> @llvm.x86.sse2.psrai.w(<8 x i16>, i32) nounwind readnone

	define <4 x i32> @test_psrld(<4 x i32> %a0, <4 x i32> %a1, <4 x i32> *%a2) {
	; GENERIC-LABEL: test_psrld:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: psrld %xmm1, %xmm0
	; GENERIC-NEXT: psrld (%rdi), %xmm0
	; GENERIC-NEXT: psrld $2, %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_psrld:
	; ATOM: # BB#0:
	; ATOM-NEXT: psrld %xmm1, %xmm0
	; ATOM-NEXT: psrld (%rdi), %xmm0
	; ATOM-NEXT: psrld $2, %xmm0
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_psrld:
	; SLM: # BB#0:
	; SLM-NEXT: psrld %xmm1, %xmm0 # sched: [1:1.00]
	; SLM-NEXT: psrld (%rdi), %xmm0 # sched: [4:1.00]
	; SLM-NEXT: psrld $2, %xmm0 # sched: [1:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_psrld:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vpsrld %xmm1, %xmm0, %xmm0 # sched: [2:1.00]
	-; SANDY-NEXT: vpsrld (%rdi), %xmm0, %xmm0 # sched: [8:1.00]
	-; SANDY-NEXT: vpsrld $2, %xmm0, %xmm0 # sched: [1:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vpsrld %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	+; SANDY-NEXT: vpsrld (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	+; SANDY-NEXT: vpsrld $2, %xmm0, %xmm0 # sched: [1:0.50]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_psrld:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpsrld %xmm1, %xmm0, %xmm0 # sched: [2:1.00]
	; HASWELL-NEXT: vpsrld (%rdi), %xmm0, %xmm0 # sched: [5:1.00]
	; HASWELL-NEXT: vpsrld $2, %xmm0, %xmm0 # sched: [1:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_psrld:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpsrld %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: vpsrld (%rdi), %xmm0, %xmm0 # sched: [6:1.00]
	; BTVER2-NEXT: vpsrld $2, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_psrld:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpsrld %xmm1, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: vpsrld (%rdi), %xmm0, %xmm0 # sched: [8:0.50]
	; ZNVER1-NEXT: vpsrld $2, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <4 x i32> @llvm.x86.sse2.psrl.d(<4 x i32> %a0, <4 x i32> %a1)
	%2 = load <4 x i32>, <4 x i32> *%a2, align 16
	%3 = call <4 x i32> @llvm.x86.sse2.psrl.d(<4 x i32> %1, <4 x i32> %2)
	%4 = call <4 x i32> @llvm.x86.sse2.psrli.d(<4 x i32> %3, i32 2)
	ret <4 x i32> %4
	}
	declare <4 x i32> @llvm.x86.sse2.psrl.d(<4 x i32>, <4 x i32>) nounwind readnone
	declare <4 x i32> @llvm.x86.sse2.psrli.d(<4 x i32>, i32) nounwind readnone

	define <4 x i32> @test_psrldq(<4 x i32> %a0) {
	; GENERIC-LABEL: test_psrldq:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: psrldq {{.*#+}} xmm0 = xmm0[4,5,6,7,8,9,10,11,12,13,14,15],zero,zero,zero,zero
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_psrldq:
	; ATOM: # BB#0:
	; ATOM-NEXT: psrldq {{.*#+}} xmm0 = xmm0[4,5,6,7,8,9,10,11,12,13,14,15],zero,zero,zero,zero
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_psrldq:
	; SLM: # BB#0:
	; SLM-NEXT: psrldq {{.*#+}} xmm0 = xmm0[4,5,6,7,8,9,10,11,12,13,14,15],zero,zero,zero,zero sched: [1:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_psrldq:
	; SANDY: # BB#0:
	; SANDY-NEXT: vpsrldq {{.*#+}} xmm0 = xmm0[4,5,6,7,8,9,10,11,12,13,14,15],zero,zero,zero,zero sched: [1:0.50]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_psrldq:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpsrldq {{.*#+}} xmm0 = xmm0[4,5,6,7,8,9,10,11,12,13,14,15],zero,zero,zero,zero sched: [1:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_psrldq:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpsrldq {{.*#+}} xmm0 = xmm0[4,5,6,7,8,9,10,11,12,13,14,15],zero,zero,zero,zero sched: [1:0.50]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_psrldq:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpsrldq {{.*#+}} xmm0 = xmm0[4,5,6,7,8,9,10,11,12,13,14,15],zero,zero,zero,zero sched: [1:0.25]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = shufflevector <4 x i32> %a0, <4 x i32> zeroinitializer, <4 x i32> <i32 1, i32 2, i32 3, i32 4>
	ret <4 x i32> %1
	}

	define <2 x i64> @test_psrlq(<2 x i64> %a0, <2 x i64> %a1, <2 x i64> *%a2) {
	; GENERIC-LABEL: test_psrlq:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: psrlq %xmm1, %xmm0
	; GENERIC-NEXT: psrlq (%rdi), %xmm0
	; GENERIC-NEXT: psrlq $2, %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_psrlq:
	; ATOM: # BB#0:
	; ATOM-NEXT: psrlq %xmm1, %xmm0
	; ATOM-NEXT: psrlq (%rdi), %xmm0
	; ATOM-NEXT: psrlq $2, %xmm0
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_psrlq:
	; SLM: # BB#0:
	; SLM-NEXT: psrlq %xmm1, %xmm0 # sched: [1:1.00]
	; SLM-NEXT: psrlq (%rdi), %xmm0 # sched: [4:1.00]
	; SLM-NEXT: psrlq $2, %xmm0 # sched: [1:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_psrlq:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vpsrlq %xmm1, %xmm0, %xmm0 # sched: [2:1.00]
	-; SANDY-NEXT: vpsrlq (%rdi), %xmm0, %xmm0 # sched: [8:1.00]
	-; SANDY-NEXT: vpsrlq $2, %xmm0, %xmm0 # sched: [1:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vpsrlq %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	+; SANDY-NEXT: vpsrlq (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	+; SANDY-NEXT: vpsrlq $2, %xmm0, %xmm0 # sched: [1:0.50]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_psrlq:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpsrlq %xmm1, %xmm0, %xmm0 # sched: [2:1.00]
	; HASWELL-NEXT: vpsrlq (%rdi), %xmm0, %xmm0 # sched: [5:1.00]
	; HASWELL-NEXT: vpsrlq $2, %xmm0, %xmm0 # sched: [1:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_psrlq:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpsrlq %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: vpsrlq (%rdi), %xmm0, %xmm0 # sched: [6:1.00]
	; BTVER2-NEXT: vpsrlq $2, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_psrlq:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpsrlq %xmm1, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: vpsrlq (%rdi), %xmm0, %xmm0 # sched: [8:0.50]
	; ZNVER1-NEXT: vpsrlq $2, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <2 x i64> @llvm.x86.sse2.psrl.q(<2 x i64> %a0, <2 x i64> %a1)
	%2 = load <2 x i64>, <2 x i64> *%a2, align 16
	%3 = call <2 x i64> @llvm.x86.sse2.psrl.q(<2 x i64> %1, <2 x i64> %2)
	%4 = call <2 x i64> @llvm.x86.sse2.psrli.q(<2 x i64> %3, i32 2)
	ret <2 x i64> %4
	}
	declare <2 x i64> @llvm.x86.sse2.psrl.q(<2 x i64>, <2 x i64>) nounwind readnone
	declare <2 x i64> @llvm.x86.sse2.psrli.q(<2 x i64>, i32) nounwind readnone

	define <8 x i16> @test_psrlw(<8 x i16> %a0, <8 x i16> %a1, <8 x i16> *%a2) {
	; GENERIC-LABEL: test_psrlw:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: psrlw %xmm1, %xmm0
	; GENERIC-NEXT: psrlw (%rdi), %xmm0
	; GENERIC-NEXT: psrlw $2, %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_psrlw:
	; ATOM: # BB#0:
	; ATOM-NEXT: psrlw %xmm1, %xmm0
	; ATOM-NEXT: psrlw (%rdi), %xmm0
	; ATOM-NEXT: psrlw $2, %xmm0
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_psrlw:
	; SLM: # BB#0:
	; SLM-NEXT: psrlw %xmm1, %xmm0 # sched: [1:1.00]
	; SLM-NEXT: psrlw (%rdi), %xmm0 # sched: [4:1.00]
	; SLM-NEXT: psrlw $2, %xmm0 # sched: [1:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_psrlw:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vpsrlw %xmm1, %xmm0, %xmm0 # sched: [2:1.00]
	-; SANDY-NEXT: vpsrlw (%rdi), %xmm0, %xmm0 # sched: [8:1.00]
	-; SANDY-NEXT: vpsrlw $2, %xmm0, %xmm0 # sched: [1:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vpsrlw %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	+; SANDY-NEXT: vpsrlw (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	+; SANDY-NEXT: vpsrlw $2, %xmm0, %xmm0 # sched: [1:0.50]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_psrlw:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpsrlw %xmm1, %xmm0, %xmm0 # sched: [2:1.00]
	; HASWELL-NEXT: vpsrlw (%rdi), %xmm0, %xmm0 # sched: [5:1.00]
	; HASWELL-NEXT: vpsrlw $2, %xmm0, %xmm0 # sched: [1:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_psrlw:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpsrlw %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: vpsrlw (%rdi), %xmm0, %xmm0 # sched: [6:1.00]
	; BTVER2-NEXT: vpsrlw $2, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_psrlw:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpsrlw %xmm1, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: vpsrlw (%rdi), %xmm0, %xmm0 # sched: [8:0.50]
	; ZNVER1-NEXT: vpsrlw $2, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <8 x i16> @llvm.x86.sse2.psrl.w(<8 x i16> %a0, <8 x i16> %a1)
	%2 = load <8 x i16>, <8 x i16> *%a2, align 16
	%3 = call <8 x i16> @llvm.x86.sse2.psrl.w(<8 x i16> %1, <8 x i16> %2)
	%4 = call <8 x i16> @llvm.x86.sse2.psrli.w(<8 x i16> %3, i32 2)
	ret <8 x i16> %4
	}
	declare <8 x i16> @llvm.x86.sse2.psrl.w(<8 x i16>, <8 x i16>) nounwind readnone
	declare <8 x i16> @llvm.x86.sse2.psrli.w(<8 x i16>, i32) nounwind readnone

	define <16 x i8> @test_psubb(<16 x i8> %a0, <16 x i8> %a1, <16 x i8> *%a2) {
	; GENERIC-LABEL: test_psubb:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: psubb %xmm1, %xmm0
	; GENERIC-NEXT: psubb (%rdi), %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_psubb:
	; ATOM: # BB#0:
	; ATOM-NEXT: psubb %xmm1, %xmm0
	; ATOM-NEXT: psubb (%rdi), %xmm0
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_psubb:
	; SLM: # BB#0:
	; SLM-NEXT: psubb %xmm1, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: psubb (%rdi), %xmm0 # sched: [4:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_psubb:
	; SANDY: # BB#0:
	; SANDY-NEXT: vpsubb %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	-; SANDY-NEXT: vpsubb (%rdi), %xmm0, %xmm0 # sched: [7:0.50]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vpsubb (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_psubb:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpsubb %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; HASWELL-NEXT: vpsubb (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_psubb:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpsubb %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: vpsubb (%rdi), %xmm0, %xmm0 # sched: [6:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_psubb:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpsubb %xmm1, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: vpsubb (%rdi), %xmm0, %xmm0 # sched: [8:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = sub <16 x i8> %a0, %a1
	%2 = load <16 x i8>, <16 x i8> *%a2, align 16
	%3 = sub <16 x i8> %1, %2
	ret <16 x i8> %3
	}

	define <4 x i32> @test_psubd(<4 x i32> %a0, <4 x i32> %a1, <4 x i32> *%a2) {
	; GENERIC-LABEL: test_psubd:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: psubd %xmm1, %xmm0
	; GENERIC-NEXT: psubd (%rdi), %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_psubd:
	; ATOM: # BB#0:
	; ATOM-NEXT: psubd %xmm1, %xmm0
	; ATOM-NEXT: psubd (%rdi), %xmm0
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_psubd:
	; SLM: # BB#0:
	; SLM-NEXT: psubd %xmm1, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: psubd (%rdi), %xmm0 # sched: [4:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_psubd:
	; SANDY: # BB#0:
	; SANDY-NEXT: vpsubd %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	-; SANDY-NEXT: vpsubd (%rdi), %xmm0, %xmm0 # sched: [7:0.50]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vpsubd (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_psubd:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpsubd %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; HASWELL-NEXT: vpsubd (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_psubd:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpsubd %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: vpsubd (%rdi), %xmm0, %xmm0 # sched: [6:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_psubd:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpsubd %xmm1, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: vpsubd (%rdi), %xmm0, %xmm0 # sched: [8:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = sub <4 x i32> %a0, %a1
	%2 = load <4 x i32>, <4 x i32> *%a2, align 16
	%3 = sub <4 x i32> %1, %2
	ret <4 x i32> %3
	}

	define <2 x i64> @test_psubq(<2 x i64> %a0, <2 x i64> %a1, <2 x i64> *%a2) {
	; GENERIC-LABEL: test_psubq:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: psubq %xmm1, %xmm0
	; GENERIC-NEXT: psubq (%rdi), %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_psubq:
	; ATOM: # BB#0:
	; ATOM-NEXT: psubq %xmm1, %xmm0
	; ATOM-NEXT: psubq (%rdi), %xmm0
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_psubq:
	; SLM: # BB#0:
	; SLM-NEXT: psubq %xmm1, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: psubq (%rdi), %xmm0 # sched: [4:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_psubq:
	; SANDY: # BB#0:
	; SANDY-NEXT: vpsubq %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	-; SANDY-NEXT: vpsubq (%rdi), %xmm0, %xmm0 # sched: [7:0.50]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vpsubq (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_psubq:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpsubq %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; HASWELL-NEXT: vpsubq (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_psubq:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpsubq %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: vpsubq (%rdi), %xmm0, %xmm0 # sched: [6:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_psubq:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpsubq %xmm1, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: vpsubq (%rdi), %xmm0, %xmm0 # sched: [8:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = sub <2 x i64> %a0, %a1
	%2 = load <2 x i64>, <2 x i64> *%a2, align 16
	%3 = sub <2 x i64> %1, %2
	ret <2 x i64> %3
	}

	define <16 x i8> @test_psubsb(<16 x i8> %a0, <16 x i8> %a1, <16 x i8> *%a2) {
	; GENERIC-LABEL: test_psubsb:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: psubsb %xmm1, %xmm0
	; GENERIC-NEXT: psubsb (%rdi), %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_psubsb:
	; ATOM: # BB#0:
	; ATOM-NEXT: psubsb %xmm1, %xmm0
	; ATOM-NEXT: psubsb (%rdi), %xmm0
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_psubsb:
	; SLM: # BB#0:
	; SLM-NEXT: psubsb %xmm1, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: psubsb (%rdi), %xmm0 # sched: [4:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_psubsb:
	; SANDY: # BB#0:
	; SANDY-NEXT: vpsubsb %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	-; SANDY-NEXT: vpsubsb (%rdi), %xmm0, %xmm0 # sched: [7:0.50]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vpsubsb (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_psubsb:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpsubsb %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; HASWELL-NEXT: vpsubsb (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_psubsb:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpsubsb %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: vpsubsb (%rdi), %xmm0, %xmm0 # sched: [6:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_psubsb:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpsubsb %xmm1, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: vpsubsb (%rdi), %xmm0, %xmm0 # sched: [8:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <16 x i8> @llvm.x86.sse2.psubs.b(<16 x i8> %a0, <16 x i8> %a1)
	%2 = load <16 x i8>, <16 x i8> *%a2, align 16
	%3 = call <16 x i8> @llvm.x86.sse2.psubs.b(<16 x i8> %1, <16 x i8> %2)
	ret <16 x i8> %3
	}
	declare <16 x i8> @llvm.x86.sse2.psubs.b(<16 x i8>, <16 x i8>) nounwind readnone

	define <8 x i16> @test_psubsw(<8 x i16> %a0, <8 x i16> %a1, <8 x i16> *%a2) {
	; GENERIC-LABEL: test_psubsw:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: psubsw %xmm1, %xmm0
	; GENERIC-NEXT: psubsw (%rdi), %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_psubsw:
	; ATOM: # BB#0:
	; ATOM-NEXT: psubsw %xmm1, %xmm0
	; ATOM-NEXT: psubsw (%rdi), %xmm0
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_psubsw:
	; SLM: # BB#0:
	; SLM-NEXT: psubsw %xmm1, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: psubsw (%rdi), %xmm0 # sched: [4:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_psubsw:
	; SANDY: # BB#0:
	; SANDY-NEXT: vpsubsw %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	-; SANDY-NEXT: vpsubsw (%rdi), %xmm0, %xmm0 # sched: [7:0.50]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vpsubsw (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_psubsw:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpsubsw %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; HASWELL-NEXT: vpsubsw (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_psubsw:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpsubsw %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: vpsubsw (%rdi), %xmm0, %xmm0 # sched: [6:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_psubsw:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpsubsw %xmm1, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: vpsubsw (%rdi), %xmm0, %xmm0 # sched: [8:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <8 x i16> @llvm.x86.sse2.psubs.w(<8 x i16> %a0, <8 x i16> %a1)
	%2 = load <8 x i16>, <8 x i16> *%a2, align 16
	%3 = call <8 x i16> @llvm.x86.sse2.psubs.w(<8 x i16> %1, <8 x i16> %2)
	ret <8 x i16> %3
	}
	declare <8 x i16> @llvm.x86.sse2.psubs.w(<8 x i16>, <8 x i16>) nounwind readnone

	define <16 x i8> @test_psubusb(<16 x i8> %a0, <16 x i8> %a1, <16 x i8> *%a2) {
	; GENERIC-LABEL: test_psubusb:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: psubusb %xmm1, %xmm0
	; GENERIC-NEXT: psubusb (%rdi), %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_psubusb:
	; ATOM: # BB#0:
	; ATOM-NEXT: psubusb %xmm1, %xmm0
	; ATOM-NEXT: psubusb (%rdi), %xmm0
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_psubusb:
	; SLM: # BB#0:
	; SLM-NEXT: psubusb %xmm1, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: psubusb (%rdi), %xmm0 # sched: [4:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_psubusb:
	; SANDY: # BB#0:
	; SANDY-NEXT: vpsubusb %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	-; SANDY-NEXT: vpsubusb (%rdi), %xmm0, %xmm0 # sched: [7:0.50]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vpsubusb (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_psubusb:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpsubusb %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; HASWELL-NEXT: vpsubusb (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_psubusb:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpsubusb %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: vpsubusb (%rdi), %xmm0, %xmm0 # sched: [6:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_psubusb:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpsubusb %xmm1, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: vpsubusb (%rdi), %xmm0, %xmm0 # sched: [8:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <16 x i8> @llvm.x86.sse2.psubus.b(<16 x i8> %a0, <16 x i8> %a1)
	%2 = load <16 x i8>, <16 x i8> *%a2, align 16
	%3 = call <16 x i8> @llvm.x86.sse2.psubus.b(<16 x i8> %1, <16 x i8> %2)
	ret <16 x i8> %3
	}
	declare <16 x i8> @llvm.x86.sse2.psubus.b(<16 x i8>, <16 x i8>) nounwind readnone

	define <8 x i16> @test_psubusw(<8 x i16> %a0, <8 x i16> %a1, <8 x i16> *%a2) {
	; GENERIC-LABEL: test_psubusw:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: psubusw %xmm1, %xmm0
	; GENERIC-NEXT: psubusw (%rdi), %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_psubusw:
	; ATOM: # BB#0:
	; ATOM-NEXT: psubusw %xmm1, %xmm0
	; ATOM-NEXT: psubusw (%rdi), %xmm0
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_psubusw:
	; SLM: # BB#0:
	; SLM-NEXT: psubusw %xmm1, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: psubusw (%rdi), %xmm0 # sched: [4:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_psubusw:
	; SANDY: # BB#0:
	; SANDY-NEXT: vpsubusw %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	-; SANDY-NEXT: vpsubusw (%rdi), %xmm0, %xmm0 # sched: [7:0.50]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vpsubusw (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_psubusw:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpsubusw %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; HASWELL-NEXT: vpsubusw (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_psubusw:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpsubusw %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: vpsubusw (%rdi), %xmm0, %xmm0 # sched: [6:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_psubusw:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpsubusw %xmm1, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: vpsubusw (%rdi), %xmm0, %xmm0 # sched: [8:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <8 x i16> @llvm.x86.sse2.psubus.w(<8 x i16> %a0, <8 x i16> %a1)
	%2 = load <8 x i16>, <8 x i16> *%a2, align 16
	%3 = call <8 x i16> @llvm.x86.sse2.psubus.w(<8 x i16> %1, <8 x i16> %2)
	ret <8 x i16> %3
	}
	declare <8 x i16> @llvm.x86.sse2.psubus.w(<8 x i16>, <8 x i16>) nounwind readnone

	define <8 x i16> @test_psubw(<8 x i16> %a0, <8 x i16> %a1, <8 x i16> *%a2) {
	; GENERIC-LABEL: test_psubw:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: psubw %xmm1, %xmm0
	; GENERIC-NEXT: psubw (%rdi), %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_psubw:
	; ATOM: # BB#0:
	; ATOM-NEXT: psubw %xmm1, %xmm0
	; ATOM-NEXT: psubw (%rdi), %xmm0
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_psubw:
	; SLM: # BB#0:
	; SLM-NEXT: psubw %xmm1, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: psubw (%rdi), %xmm0 # sched: [4:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_psubw:
	; SANDY: # BB#0:
	; SANDY-NEXT: vpsubw %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	-; SANDY-NEXT: vpsubw (%rdi), %xmm0, %xmm0 # sched: [7:0.50]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vpsubw (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_psubw:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpsubw %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; HASWELL-NEXT: vpsubw (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_psubw:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpsubw %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: vpsubw (%rdi), %xmm0, %xmm0 # sched: [6:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_psubw:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpsubw %xmm1, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: vpsubw (%rdi), %xmm0, %xmm0 # sched: [8:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = sub <8 x i16> %a0, %a1
	%2 = load <8 x i16>, <8 x i16> *%a2, align 16
	%3 = sub <8 x i16> %1, %2
	ret <8 x i16> %3
	}

	define <16 x i8> @test_punpckhbw(<16 x i8> %a0, <16 x i8> %a1, <16 x i8> *%a2) {
	; GENERIC-LABEL: test_punpckhbw:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: punpckhbw {{.*#+}} xmm0 = xmm0[8],xmm1[8],xmm0[9],xmm1[9],xmm0[10],xmm1[10],xmm0[11],xmm1[11],xmm0[12],xmm1[12],xmm0[13],xmm1[13],xmm0[14],xmm1[14],xmm0[15],xmm1[15]
	; GENERIC-NEXT: punpckhbw {{.*#+}} xmm0 = xmm0[8],mem[8],xmm0[9],mem[9],xmm0[10],mem[10],xmm0[11],mem[11],xmm0[12],mem[12],xmm0[13],mem[13],xmm0[14],mem[14],xmm0[15],mem[15]
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_punpckhbw:
	; ATOM: # BB#0:
	; ATOM-NEXT: punpckhbw {{.*#+}} xmm0 = xmm0[8],xmm1[8],xmm0[9],xmm1[9],xmm0[10],xmm1[10],xmm0[11],xmm1[11],xmm0[12],xmm1[12],xmm0[13],xmm1[13],xmm0[14],xmm1[14],xmm0[15],xmm1[15]
	; ATOM-NEXT: punpckhbw {{.*#+}} xmm0 = xmm0[8],mem[8],xmm0[9],mem[9],xmm0[10],mem[10],xmm0[11],mem[11],xmm0[12],mem[12],xmm0[13],mem[13],xmm0[14],mem[14],xmm0[15],mem[15]
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_punpckhbw:
	; SLM: # BB#0:
	; SLM-NEXT: punpckhbw {{.*#+}} xmm0 = xmm0[8],xmm1[8],xmm0[9],xmm1[9],xmm0[10],xmm1[10],xmm0[11],xmm1[11],xmm0[12],xmm1[12],xmm0[13],xmm1[13],xmm0[14],xmm1[14],xmm0[15],xmm1[15] sched: [1:1.00]
	; SLM-NEXT: punpckhbw {{.*#+}} xmm0 = xmm0[8],mem[8],xmm0[9],mem[9],xmm0[10],mem[10],xmm0[11],mem[11],xmm0[12],mem[12],xmm0[13],mem[13],xmm0[14],mem[14],xmm0[15],mem[15] sched: [4:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_punpckhbw:
	; SANDY: # BB#0:
	; SANDY-NEXT: vpunpckhbw {{.*#+}} xmm0 = xmm0[8],xmm1[8],xmm0[9],xmm1[9],xmm0[10],xmm1[10],xmm0[11],xmm1[11],xmm0[12],xmm1[12],xmm0[13],xmm1[13],xmm0[14],xmm1[14],xmm0[15],xmm1[15] sched: [1:0.50]
	-; SANDY-NEXT: vpunpckhbw {{.*#+}} xmm0 = xmm0[8],mem[8],xmm0[9],mem[9],xmm0[10],mem[10],xmm0[11],mem[11],xmm0[12],mem[12],xmm0[13],mem[13],xmm0[14],mem[14],xmm0[15],mem[15] sched: [7:0.50]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vpunpckhbw {{.*#+}} xmm0 = xmm0[8],mem[8],xmm0[9],mem[9],xmm0[10],mem[10],xmm0[11],mem[11],xmm0[12],mem[12],xmm0[13],mem[13],xmm0[14],mem[14],xmm0[15],mem[15] sched: [5:0.50]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_punpckhbw:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpunpckhbw {{.*#+}} xmm0 = xmm0[8],xmm1[8],xmm0[9],xmm1[9],xmm0[10],xmm1[10],xmm0[11],xmm1[11],xmm0[12],xmm1[12],xmm0[13],xmm1[13],xmm0[14],xmm1[14],xmm0[15],xmm1[15] sched: [1:1.00]
	; HASWELL-NEXT: vpunpckhbw {{.*#+}} xmm0 = xmm0[8],mem[8],xmm0[9],mem[9],xmm0[10],mem[10],xmm0[11],mem[11],xmm0[12],mem[12],xmm0[13],mem[13],xmm0[14],mem[14],xmm0[15],mem[15] sched: [5:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_punpckhbw:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpunpckhbw {{.*#+}} xmm0 = xmm0[8],xmm1[8],xmm0[9],xmm1[9],xmm0[10],xmm1[10],xmm0[11],xmm1[11],xmm0[12],xmm1[12],xmm0[13],xmm1[13],xmm0[14],xmm1[14],xmm0[15],xmm1[15] sched: [1:0.50]
	; BTVER2-NEXT: vpunpckhbw {{.*#+}} xmm0 = xmm0[8],mem[8],xmm0[9],mem[9],xmm0[10],mem[10],xmm0[11],mem[11],xmm0[12],mem[12],xmm0[13],mem[13],xmm0[14],mem[14],xmm0[15],mem[15] sched: [6:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_punpckhbw:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpunpckhbw {{.*#+}} xmm0 = xmm0[8],xmm1[8],xmm0[9],xmm1[9],xmm0[10],xmm1[10],xmm0[11],xmm1[11],xmm0[12],xmm1[12],xmm0[13],xmm1[13],xmm0[14],xmm1[14],xmm0[15],xmm1[15] sched: [1:0.25]
	; ZNVER1-NEXT: vpunpckhbw {{.*#+}} xmm0 = xmm0[8],mem[8],xmm0[9],mem[9],xmm0[10],mem[10],xmm0[11],mem[11],xmm0[12],mem[12],xmm0[13],mem[13],xmm0[14],mem[14],xmm0[15],mem[15] sched: [8:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = shufflevector <16 x i8> %a0, <16 x i8> %a1, <16 x i32> <i32 8, i32 24, i32 9, i32 25, i32 10, i32 26, i32 11, i32 27, i32 12, i32 28, i32 13, i32 29, i32 14, i32 30, i32 15, i32 31>
	%2 = load <16 x i8>, <16 x i8> *%a2, align 16
	%3 = shufflevector <16 x i8> %1, <16 x i8> %2, <16 x i32> <i32 8, i32 24, i32 9, i32 25, i32 10, i32 26, i32 11, i32 27, i32 12, i32 28, i32 13, i32 29, i32 14, i32 30, i32 15, i32 31>
	ret <16 x i8> %3
	}

	define <4 x i32> @test_punpckhdq(<4 x i32> %a0, <4 x i32> %a1, <4 x i32> *%a2) {
	; GENERIC-LABEL: test_punpckhdq:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: punpckhdq {{.*#+}} xmm0 = xmm0[2],xmm1[2],xmm0[3],xmm1[3]
	; GENERIC-NEXT: punpckhdq {{.*#+}} xmm1 = xmm1[2],mem[2],xmm1[3],mem[3]
	; GENERIC-NEXT: paddd %xmm1, %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_punpckhdq:
	; ATOM: # BB#0:
	; ATOM-NEXT: punpckhdq {{.*#+}} xmm0 = xmm0[2],xmm1[2],xmm0[3],xmm1[3]
	; ATOM-NEXT: punpckhdq {{.*#+}} xmm1 = xmm1[2],mem[2],xmm1[3],mem[3]
	; ATOM-NEXT: paddd %xmm1, %xmm0
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_punpckhdq:
	; SLM: # BB#0:
	; SLM-NEXT: punpckhdq {{.*#+}} xmm0 = xmm0[2],xmm1[2],xmm0[3],xmm1[3] sched: [1:1.00]
	; SLM-NEXT: punpckhdq {{.*#+}} xmm1 = xmm1[2],mem[2],xmm1[3],mem[3] sched: [4:1.00]
	; SLM-NEXT: paddd %xmm1, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_punpckhdq:
	; SANDY: # BB#0:
	; SANDY-NEXT: vpunpckhdq {{.*#+}} xmm0 = xmm0[2],xmm1[2],xmm0[3],xmm1[3] sched: [1:0.50]
	-; SANDY-NEXT: vpunpckhdq {{.*#+}} xmm1 = xmm1[2],mem[2],xmm1[3],mem[3] sched: [7:0.50]
	+; SANDY-NEXT: vpunpckhdq {{.*#+}} xmm1 = xmm1[2],mem[2],xmm1[3],mem[3] sched: [5:0.50]
	; SANDY-NEXT: vpaddd %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_punpckhdq:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpunpckhdq {{.*#+}} xmm0 = xmm0[2],xmm1[2],xmm0[3],xmm1[3] sched: [1:1.00]
	; HASWELL-NEXT: vpunpckhdq {{.*#+}} xmm1 = xmm1[2],mem[2],xmm1[3],mem[3] sched: [5:1.00]
	; HASWELL-NEXT: vpaddd %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_punpckhdq:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpunpckhdq {{.*#+}} xmm0 = xmm0[2],xmm1[2],xmm0[3],xmm1[3] sched: [1:0.50]
	; BTVER2-NEXT: vpunpckhdq {{.*#+}} xmm1 = xmm1[2],mem[2],xmm1[3],mem[3] sched: [6:1.00]
	; BTVER2-NEXT: vpaddd %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_punpckhdq:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpunpckhdq {{.*#+}} xmm0 = xmm0[2],xmm1[2],xmm0[3],xmm1[3] sched: [1:0.25]
	; ZNVER1-NEXT: vpunpckhdq {{.*#+}} xmm1 = xmm1[2],mem[2],xmm1[3],mem[3] sched: [8:0.50]
	; ZNVER1-NEXT: vpaddd %xmm1, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = shufflevector <4 x i32> %a0, <4 x i32> %a1, <4 x i32> <i32 2, i32 6, i32 3, i32 7>
	%2 = load <4 x i32>, <4 x i32> *%a2, align 16
	%3 = shufflevector <4 x i32> %a1, <4 x i32> %2, <4 x i32> <i32 2, i32 6, i32 3, i32 7>
	%4 = add <4 x i32> %1, %3
	ret <4 x i32> %4
	}

	define <2 x i64> @test_punpckhqdq(<2 x i64> %a0, <2 x i64> %a1, <2 x i64> *%a2) {
	; GENERIC-LABEL: test_punpckhqdq:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: punpckhqdq {{.*#+}} xmm0 = xmm0[1],xmm1[1]
	; GENERIC-NEXT: punpckhqdq {{.*#+}} xmm1 = xmm1[1],mem[1]
	; GENERIC-NEXT: paddq %xmm1, %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_punpckhqdq:
	; ATOM: # BB#0:
	; ATOM-NEXT: punpckhqdq {{.*#+}} xmm0 = xmm0[1],xmm1[1]
	; ATOM-NEXT: punpckhqdq {{.*#+}} xmm1 = xmm1[1],mem[1]
	; ATOM-NEXT: paddq %xmm1, %xmm0
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_punpckhqdq:
	; SLM: # BB#0:
	; SLM-NEXT: punpckhqdq {{.*#+}} xmm0 = xmm0[1],xmm1[1] sched: [1:1.00]
	; SLM-NEXT: punpckhqdq {{.*#+}} xmm1 = xmm1[1],mem[1] sched: [4:1.00]
	; SLM-NEXT: paddq %xmm1, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_punpckhqdq:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vpunpckhqdq {{.*#+}} xmm0 = xmm0[1],xmm1[1] sched: [1:1.00]
	-; SANDY-NEXT: vpunpckhqdq {{.*#+}} xmm1 = xmm1[1],mem[1] sched: [7:0.50]
	+; SANDY-NEXT: vpunpckhqdq {{.*#+}} xmm0 = xmm0[1],xmm1[1] sched: [1:0.50]
	+; SANDY-NEXT: vpunpckhqdq {{.*#+}} xmm1 = xmm1[1],mem[1] sched: [5:0.50]
	; SANDY-NEXT: vpaddq %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_punpckhqdq:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpunpckhqdq {{.*#+}} xmm0 = xmm0[1],xmm1[1] sched: [1:1.00]
	; HASWELL-NEXT: vpunpckhqdq {{.*#+}} xmm1 = xmm1[1],mem[1] sched: [5:1.00]
	; HASWELL-NEXT: vpaddq %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_punpckhqdq:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpunpckhqdq {{.*#+}} xmm0 = xmm0[1],xmm1[1] sched: [1:0.50]
	; BTVER2-NEXT: vpunpckhqdq {{.*#+}} xmm1 = xmm1[1],mem[1] sched: [6:1.00]
	; BTVER2-NEXT: vpaddq %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_punpckhqdq:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpunpckhqdq {{.*#+}} xmm0 = xmm0[1],xmm1[1] sched: [1:0.25]
	; ZNVER1-NEXT: vpunpckhqdq {{.*#+}} xmm1 = xmm1[1],mem[1] sched: [8:0.50]
	; ZNVER1-NEXT: vpaddq %xmm1, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = shufflevector <2 x i64> %a0, <2 x i64> %a1, <2 x i32> <i32 1, i32 3>
	%2 = load <2 x i64>, <2 x i64> *%a2, align 16
	%3 = shufflevector <2 x i64> %a1, <2 x i64> %2, <2x i32> <i32 1, i32 3>
	%4 = add <2 x i64> %1, %3
	ret <2 x i64> %4
	}

	define <8 x i16> @test_punpckhwd(<8 x i16> %a0, <8 x i16> %a1, <8 x i16> *%a2) {
	; GENERIC-LABEL: test_punpckhwd:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: punpckhwd {{.*#+}} xmm0 = xmm0[4],xmm1[4],xmm0[5],xmm1[5],xmm0[6],xmm1[6],xmm0[7],xmm1[7]
	; GENERIC-NEXT: punpckhwd {{.*#+}} xmm0 = xmm0[4],mem[4],xmm0[5],mem[5],xmm0[6],mem[6],xmm0[7],mem[7]
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_punpckhwd:
	; ATOM: # BB#0:
	; ATOM-NEXT: punpckhwd {{.*#+}} xmm0 = xmm0[4],xmm1[4],xmm0[5],xmm1[5],xmm0[6],xmm1[6],xmm0[7],xmm1[7]
	; ATOM-NEXT: punpckhwd {{.*#+}} xmm0 = xmm0[4],mem[4],xmm0[5],mem[5],xmm0[6],mem[6],xmm0[7],mem[7]
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_punpckhwd:
	; SLM: # BB#0:
	; SLM-NEXT: punpckhwd {{.*#+}} xmm0 = xmm0[4],xmm1[4],xmm0[5],xmm1[5],xmm0[6],xmm1[6],xmm0[7],xmm1[7] sched: [1:1.00]
	; SLM-NEXT: punpckhwd {{.*#+}} xmm0 = xmm0[4],mem[4],xmm0[5],mem[5],xmm0[6],mem[6],xmm0[7],mem[7] sched: [4:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_punpckhwd:
	; SANDY: # BB#0:
	; SANDY-NEXT: vpunpckhwd {{.*#+}} xmm0 = xmm0[4],xmm1[4],xmm0[5],xmm1[5],xmm0[6],xmm1[6],xmm0[7],xmm1[7] sched: [1:0.50]
	-; SANDY-NEXT: vpunpckhwd {{.*#+}} xmm0 = xmm0[4],mem[4],xmm0[5],mem[5],xmm0[6],mem[6],xmm0[7],mem[7] sched: [7:0.50]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vpunpckhwd {{.*#+}} xmm0 = xmm0[4],mem[4],xmm0[5],mem[5],xmm0[6],mem[6],xmm0[7],mem[7] sched: [5:0.50]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_punpckhwd:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpunpckhwd {{.*#+}} xmm0 = xmm0[4],xmm1[4],xmm0[5],xmm1[5],xmm0[6],xmm1[6],xmm0[7],xmm1[7] sched: [1:1.00]
	; HASWELL-NEXT: vpunpckhwd {{.*#+}} xmm0 = xmm0[4],mem[4],xmm0[5],mem[5],xmm0[6],mem[6],xmm0[7],mem[7] sched: [5:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_punpckhwd:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpunpckhwd {{.*#+}} xmm0 = xmm0[4],xmm1[4],xmm0[5],xmm1[5],xmm0[6],xmm1[6],xmm0[7],xmm1[7] sched: [1:0.50]
	; BTVER2-NEXT: vpunpckhwd {{.*#+}} xmm0 = xmm0[4],mem[4],xmm0[5],mem[5],xmm0[6],mem[6],xmm0[7],mem[7] sched: [6:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_punpckhwd:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpunpckhwd {{.*#+}} xmm0 = xmm0[4],xmm1[4],xmm0[5],xmm1[5],xmm0[6],xmm1[6],xmm0[7],xmm1[7] sched: [1:0.25]
	; ZNVER1-NEXT: vpunpckhwd {{.*#+}} xmm0 = xmm0[4],mem[4],xmm0[5],mem[5],xmm0[6],mem[6],xmm0[7],mem[7] sched: [8:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = shufflevector <8 x i16> %a0, <8 x i16> %a1, <8 x i32> <i32 4, i32 12, i32 5, i32 13, i32 6, i32 14, i32 7, i32 15>
	%2 = load <8 x i16>, <8 x i16> *%a2, align 16
	%3 = shufflevector <8 x i16> %1, <8 x i16> %2, <8 x i32> <i32 4, i32 12, i32 5, i32 13, i32 6, i32 14, i32 7, i32 15>
	ret <8 x i16> %3
	}

	define <16 x i8> @test_punpcklbw(<16 x i8> %a0, <16 x i8> %a1, <16 x i8> *%a2) {
	; GENERIC-LABEL: test_punpcklbw:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: punpcklbw {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1],xmm0[2],xmm1[2],xmm0[3],xmm1[3],xmm0[4],xmm1[4],xmm0[5],xmm1[5],xmm0[6],xmm1[6],xmm0[7],xmm1[7]
	; GENERIC-NEXT: punpcklbw {{.*#+}} xmm0 = xmm0[0],mem[0],xmm0[1],mem[1],xmm0[2],mem[2],xmm0[3],mem[3],xmm0[4],mem[4],xmm0[5],mem[5],xmm0[6],mem[6],xmm0[7],mem[7]
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_punpcklbw:
	; ATOM: # BB#0:
	; ATOM-NEXT: punpcklbw {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1],xmm0[2],xmm1[2],xmm0[3],xmm1[3],xmm0[4],xmm1[4],xmm0[5],xmm1[5],xmm0[6],xmm1[6],xmm0[7],xmm1[7]
	; ATOM-NEXT: punpcklbw {{.*#+}} xmm0 = xmm0[0],mem[0],xmm0[1],mem[1],xmm0[2],mem[2],xmm0[3],mem[3],xmm0[4],mem[4],xmm0[5],mem[5],xmm0[6],mem[6],xmm0[7],mem[7]
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_punpcklbw:
	; SLM: # BB#0:
	; SLM-NEXT: punpcklbw {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1],xmm0[2],xmm1[2],xmm0[3],xmm1[3],xmm0[4],xmm1[4],xmm0[5],xmm1[5],xmm0[6],xmm1[6],xmm0[7],xmm1[7] sched: [1:1.00]
	; SLM-NEXT: punpcklbw {{.*#+}} xmm0 = xmm0[0],mem[0],xmm0[1],mem[1],xmm0[2],mem[2],xmm0[3],mem[3],xmm0[4],mem[4],xmm0[5],mem[5],xmm0[6],mem[6],xmm0[7],mem[7] sched: [4:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_punpcklbw:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vpunpcklbw {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1],xmm0[2],xmm1[2],xmm0[3],xmm1[3],xmm0[4],xmm1[4],xmm0[5],xmm1[5],xmm0[6],xmm1[6],xmm0[7],xmm1[7] sched: [1:1.00]
	-; SANDY-NEXT: vpunpcklbw {{.*#+}} xmm0 = xmm0[0],mem[0],xmm0[1],mem[1],xmm0[2],mem[2],xmm0[3],mem[3],xmm0[4],mem[4],xmm0[5],mem[5],xmm0[6],mem[6],xmm0[7],mem[7] sched: [7:0.50]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vpunpcklbw {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1],xmm0[2],xmm1[2],xmm0[3],xmm1[3],xmm0[4],xmm1[4],xmm0[5],xmm1[5],xmm0[6],xmm1[6],xmm0[7],xmm1[7] sched: [1:0.50]
	+; SANDY-NEXT: vpunpcklbw {{.*#+}} xmm0 = xmm0[0],mem[0],xmm0[1],mem[1],xmm0[2],mem[2],xmm0[3],mem[3],xmm0[4],mem[4],xmm0[5],mem[5],xmm0[6],mem[6],xmm0[7],mem[7] sched: [5:0.50]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_punpcklbw:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpunpcklbw {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1],xmm0[2],xmm1[2],xmm0[3],xmm1[3],xmm0[4],xmm1[4],xmm0[5],xmm1[5],xmm0[6],xmm1[6],xmm0[7],xmm1[7] sched: [1:1.00]
	; HASWELL-NEXT: vpunpcklbw {{.*#+}} xmm0 = xmm0[0],mem[0],xmm0[1],mem[1],xmm0[2],mem[2],xmm0[3],mem[3],xmm0[4],mem[4],xmm0[5],mem[5],xmm0[6],mem[6],xmm0[7],mem[7] sched: [5:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_punpcklbw:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpunpcklbw {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1],xmm0[2],xmm1[2],xmm0[3],xmm1[3],xmm0[4],xmm1[4],xmm0[5],xmm1[5],xmm0[6],xmm1[6],xmm0[7],xmm1[7] sched: [1:0.50]
	; BTVER2-NEXT: vpunpcklbw {{.*#+}} xmm0 = xmm0[0],mem[0],xmm0[1],mem[1],xmm0[2],mem[2],xmm0[3],mem[3],xmm0[4],mem[4],xmm0[5],mem[5],xmm0[6],mem[6],xmm0[7],mem[7] sched: [6:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_punpcklbw:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpunpcklbw {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1],xmm0[2],xmm1[2],xmm0[3],xmm1[3],xmm0[4],xmm1[4],xmm0[5],xmm1[5],xmm0[6],xmm1[6],xmm0[7],xmm1[7] sched: [1:0.25]
	; ZNVER1-NEXT: vpunpcklbw {{.*#+}} xmm0 = xmm0[0],mem[0],xmm0[1],mem[1],xmm0[2],mem[2],xmm0[3],mem[3],xmm0[4],mem[4],xmm0[5],mem[5],xmm0[6],mem[6],xmm0[7],mem[7] sched: [8:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = shufflevector <16 x i8> %a0, <16 x i8> %a1, <16 x i32> <i32 0, i32 16, i32 1, i32 17, i32 2, i32 18, i32 3, i32 19, i32 4, i32 20, i32 5, i32 21, i32 6, i32 22, i32 7, i32 23>
	%2 = load <16 x i8>, <16 x i8> *%a2, align 16
	%3 = shufflevector <16 x i8> %1, <16 x i8> %2, <16 x i32> <i32 0, i32 16, i32 1, i32 17, i32 2, i32 18, i32 3, i32 19, i32 4, i32 20, i32 5, i32 21, i32 6, i32 22, i32 7, i32 23>
	ret <16 x i8> %3
	}

	define <4 x i32> @test_punpckldq(<4 x i32> %a0, <4 x i32> %a1, <4 x i32> *%a2) {
	; GENERIC-LABEL: test_punpckldq:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: punpckldq {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1]
	; GENERIC-NEXT: punpckldq {{.*#+}} xmm1 = xmm1[0],mem[0],xmm1[1],mem[1]
	; GENERIC-NEXT: paddd %xmm1, %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_punpckldq:
	; ATOM: # BB#0:
	; ATOM-NEXT: punpckldq {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1]
	; ATOM-NEXT: punpckldq {{.*#+}} xmm1 = xmm1[0],mem[0],xmm1[1],mem[1]
	; ATOM-NEXT: paddd %xmm1, %xmm0
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_punpckldq:
	; SLM: # BB#0:
	; SLM-NEXT: punpckldq {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1] sched: [1:1.00]
	; SLM-NEXT: punpckldq {{.*#+}} xmm1 = xmm1[0],mem[0],xmm1[1],mem[1] sched: [4:1.00]
	; SLM-NEXT: paddd %xmm1, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_punpckldq:
	; SANDY: # BB#0:
	; SANDY-NEXT: vpunpckldq {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1] sched: [1:0.50]
	-; SANDY-NEXT: vpunpckldq {{.*#+}} xmm1 = xmm1[0],mem[0],xmm1[1],mem[1] sched: [7:0.50]
	+; SANDY-NEXT: vpunpckldq {{.*#+}} xmm1 = xmm1[0],mem[0],xmm1[1],mem[1] sched: [5:0.50]
	; SANDY-NEXT: vpaddd %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_punpckldq:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpunpckldq {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1] sched: [1:1.00]
	; HASWELL-NEXT: vpunpckldq {{.*#+}} xmm1 = xmm1[0],mem[0],xmm1[1],mem[1] sched: [5:1.00]
	; HASWELL-NEXT: vpaddd %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_punpckldq:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpunpckldq {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1] sched: [1:0.50]
	; BTVER2-NEXT: vpunpckldq {{.*#+}} xmm1 = xmm1[0],mem[0],xmm1[1],mem[1] sched: [6:1.00]
	; BTVER2-NEXT: vpaddd %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_punpckldq:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpunpckldq {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1] sched: [1:0.25]
	; ZNVER1-NEXT: vpunpckldq {{.*#+}} xmm1 = xmm1[0],mem[0],xmm1[1],mem[1] sched: [8:0.50]
	; ZNVER1-NEXT: vpaddd %xmm1, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = shufflevector <4 x i32> %a0, <4 x i32> %a1, <4 x i32> <i32 0, i32 4, i32 1, i32 5>
	%2 = load <4 x i32>, <4 x i32> *%a2, align 16
	%3 = shufflevector <4 x i32> %a1, <4 x i32> %2, <4 x i32> <i32 0, i32 4, i32 1, i32 5>
	%4 = add <4 x i32> %1, %3
	ret <4 x i32> %4
	}

	define <2 x i64> @test_punpcklqdq(<2 x i64> %a0, <2 x i64> %a1, <2 x i64> *%a2) {
	; GENERIC-LABEL: test_punpcklqdq:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: punpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm1[0]
	; GENERIC-NEXT: punpcklqdq {{.*#+}} xmm1 = xmm1[0],mem[0]
	; GENERIC-NEXT: paddq %xmm1, %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_punpcklqdq:
	; ATOM: # BB#0:
	; ATOM-NEXT: punpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm1[0]
	; ATOM-NEXT: punpcklqdq {{.*#+}} xmm1 = xmm1[0],mem[0]
	; ATOM-NEXT: paddq %xmm1, %xmm0
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_punpcklqdq:
	; SLM: # BB#0:
	; SLM-NEXT: punpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm1[0] sched: [1:1.00]
	; SLM-NEXT: punpcklqdq {{.*#+}} xmm1 = xmm1[0],mem[0] sched: [4:1.00]
	; SLM-NEXT: paddq %xmm1, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_punpcklqdq:
	; SANDY: # BB#0:
	; SANDY-NEXT: vpunpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm1[0] sched: [1:0.50]
	-; SANDY-NEXT: vpunpcklqdq {{.*#+}} xmm1 = xmm1[0],mem[0] sched: [7:0.50]
	+; SANDY-NEXT: vpunpcklqdq {{.*#+}} xmm1 = xmm1[0],mem[0] sched: [5:0.50]
	; SANDY-NEXT: vpaddq %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_punpcklqdq:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpunpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm1[0] sched: [1:1.00]
	; HASWELL-NEXT: vpunpcklqdq {{.*#+}} xmm1 = xmm1[0],mem[0] sched: [5:1.00]
	; HASWELL-NEXT: vpaddq %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_punpcklqdq:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpunpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm1[0] sched: [1:0.50]
	; BTVER2-NEXT: vpunpcklqdq {{.*#+}} xmm1 = xmm1[0],mem[0] sched: [6:1.00]
	; BTVER2-NEXT: vpaddq %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_punpcklqdq:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpunpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm1[0] sched: [1:0.25]
	; ZNVER1-NEXT: vpunpcklqdq {{.*#+}} xmm1 = xmm1[0],mem[0] sched: [8:0.50]
	; ZNVER1-NEXT: vpaddq %xmm1, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = shufflevector <2 x i64> %a0, <2 x i64> %a1, <2 x i32> <i32 0, i32 2>
	%2 = load <2 x i64>, <2 x i64> *%a2, align 16
	%3 = shufflevector <2 x i64> %a1, <2 x i64> %2, <2x i32> <i32 0, i32 2>
	%4 = add <2 x i64> %1, %3
	ret <2 x i64> %4
	}

	define <8 x i16> @test_punpcklwd(<8 x i16> %a0, <8 x i16> %a1, <8 x i16> *%a2) {
	; GENERIC-LABEL: test_punpcklwd:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: punpcklwd {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1],xmm0[2],xmm1[2],xmm0[3],xmm1[3]
	; GENERIC-NEXT: punpcklwd {{.*#+}} xmm0 = xmm0[0],mem[0],xmm0[1],mem[1],xmm0[2],mem[2],xmm0[3],mem[3]
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_punpcklwd:
	; ATOM: # BB#0:
	; ATOM-NEXT: punpcklwd {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1],xmm0[2],xmm1[2],xmm0[3],xmm1[3]
	; ATOM-NEXT: punpcklwd {{.*#+}} xmm0 = xmm0[0],mem[0],xmm0[1],mem[1],xmm0[2],mem[2],xmm0[3],mem[3]
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_punpcklwd:
	; SLM: # BB#0:
	; SLM-NEXT: punpcklwd {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1],xmm0[2],xmm1[2],xmm0[3],xmm1[3] sched: [1:1.00]
	; SLM-NEXT: punpcklwd {{.*#+}} xmm0 = xmm0[0],mem[0],xmm0[1],mem[1],xmm0[2],mem[2],xmm0[3],mem[3] sched: [4:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_punpcklwd:
	; SANDY: # BB#0:
	; SANDY-NEXT: vpunpcklwd {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1],xmm0[2],xmm1[2],xmm0[3],xmm1[3] sched: [1:0.50]
	-; SANDY-NEXT: vpunpcklwd {{.*#+}} xmm0 = xmm0[0],mem[0],xmm0[1],mem[1],xmm0[2],mem[2],xmm0[3],mem[3] sched: [7:0.50]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vpunpcklwd {{.*#+}} xmm0 = xmm0[0],mem[0],xmm0[1],mem[1],xmm0[2],mem[2],xmm0[3],mem[3] sched: [5:0.50]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_punpcklwd:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpunpcklwd {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1],xmm0[2],xmm1[2],xmm0[3],xmm1[3] sched: [1:1.00]
	; HASWELL-NEXT: vpunpcklwd {{.*#+}} xmm0 = xmm0[0],mem[0],xmm0[1],mem[1],xmm0[2],mem[2],xmm0[3],mem[3] sched: [5:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_punpcklwd:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpunpcklwd {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1],xmm0[2],xmm1[2],xmm0[3],xmm1[3] sched: [1:0.50]
	; BTVER2-NEXT: vpunpcklwd {{.*#+}} xmm0 = xmm0[0],mem[0],xmm0[1],mem[1],xmm0[2],mem[2],xmm0[3],mem[3] sched: [6:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_punpcklwd:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpunpcklwd {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1],xmm0[2],xmm1[2],xmm0[3],xmm1[3] sched: [1:0.25]
	; ZNVER1-NEXT: vpunpcklwd {{.*#+}} xmm0 = xmm0[0],mem[0],xmm0[1],mem[1],xmm0[2],mem[2],xmm0[3],mem[3] sched: [8:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = shufflevector <8 x i16> %a0, <8 x i16> %a1, <8 x i32> <i32 0, i32 8, i32 1, i32 9, i32 2, i32 10, i32 3, i32 11>
	%2 = load <8 x i16>, <8 x i16> *%a2, align 16
	%3 = shufflevector <8 x i16> %1, <8 x i16> %2, <8 x i32> <i32 0, i32 8, i32 1, i32 9, i32 2, i32 10, i32 3, i32 11>
	ret <8 x i16> %3
	}

	define <2 x i64> @test_pxor(<2 x i64> %a0, <2 x i64> %a1, <2 x i64> *%a2) {
	; GENERIC-LABEL: test_pxor:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: pxor %xmm1, %xmm0
	; GENERIC-NEXT: pxor (%rdi), %xmm0
	; GENERIC-NEXT: paddq %xmm1, %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_pxor:
	; ATOM: # BB#0:
	; ATOM-NEXT: pxor %xmm1, %xmm0
	; ATOM-NEXT: pxor (%rdi), %xmm0
	; ATOM-NEXT: paddq %xmm1, %xmm0
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_pxor:
	; SLM: # BB#0:
	; SLM-NEXT: pxor %xmm1, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: pxor (%rdi), %xmm0 # sched: [4:1.00]
	; SLM-NEXT: paddq %xmm1, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_pxor:
	; SANDY: # BB#0:
	; SANDY-NEXT: vpxor %xmm1, %xmm0, %xmm0 # sched: [1:0.33]
	-; SANDY-NEXT: vpxor (%rdi), %xmm0, %xmm0 # sched: [7:0.50]
	+; SANDY-NEXT: vpxor (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	; SANDY-NEXT: vpaddq %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_pxor:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpxor %xmm1, %xmm0, %xmm0 # sched: [1:0.33]
	; HASWELL-NEXT: vpxor (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	; HASWELL-NEXT: vpaddq %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_pxor:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpxor %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: vpxor (%rdi), %xmm0, %xmm0 # sched: [6:1.00]
	; BTVER2-NEXT: vpaddq %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_pxor:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpxor %xmm1, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: vpxor (%rdi), %xmm0, %xmm0 # sched: [8:0.50]
	; ZNVER1-NEXT: vpaddq %xmm1, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = xor <2 x i64> %a0, %a1
	%2 = load <2 x i64>, <2 x i64> *%a2, align 16
	%3 = xor <2 x i64> %1, %2
	%4 = add <2 x i64> %3, %a1
	ret <2 x i64> %4
	}

	define <2 x double> @test_shufpd(<2 x double> %a0, <2 x double> %a1, <2 x double> *%a2) {
	; GENERIC-LABEL: test_shufpd:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: shufpd {{.*#+}} xmm0 = xmm0[1],xmm1[0]
	; GENERIC-NEXT: shufpd {{.*#+}} xmm1 = xmm1[1],mem[0]
	; GENERIC-NEXT: addpd %xmm1, %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_shufpd:
	; ATOM: # BB#0:
	; ATOM-NEXT: shufpd {{.*#+}} xmm0 = xmm0[1],xmm1[0]
	; ATOM-NEXT: shufpd {{.*#+}} xmm1 = xmm1[1],mem[0]
	; ATOM-NEXT: addpd %xmm1, %xmm0
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_shufpd:
	; SLM: # BB#0:
	; SLM-NEXT: shufpd {{.*#+}} xmm0 = xmm0[1],xmm1[0] sched: [1:1.00]
	; SLM-NEXT: shufpd {{.*#+}} xmm1 = xmm1[1],mem[0] sched: [4:1.00]
	; SLM-NEXT: addpd %xmm1, %xmm0 # sched: [3:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_shufpd:
	; SANDY: # BB#0:
	; SANDY-NEXT: vshufpd {{.*#+}} xmm0 = xmm0[1],xmm1[0] sched: [1:1.00]
	-; SANDY-NEXT: vshufpd {{.*#+}} xmm1 = xmm1[1],mem[0] sched: [7:1.00]
	+; SANDY-NEXT: vshufpd {{.*#+}} xmm1 = xmm1[1],mem[0] sched: [5:1.00]
	; SANDY-NEXT: vaddpd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_shufpd:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vshufpd {{.*#+}} xmm0 = xmm0[1],xmm1[0] sched: [1:1.00]
	; HASWELL-NEXT: vshufpd {{.*#+}} xmm1 = xmm1[1],mem[0] sched: [5:1.00]
	; HASWELL-NEXT: vaddpd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_shufpd:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vshufpd {{.*#+}} xmm0 = xmm0[1],xmm1[0] sched: [1:0.50]
	; BTVER2-NEXT: vshufpd {{.*#+}} xmm1 = xmm1[1],mem[0] sched: [6:1.00]
	; BTVER2-NEXT: vaddpd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_shufpd:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vshufpd {{.*#+}} xmm0 = xmm0[1],xmm1[0] sched: [1:0.50]
	; ZNVER1-NEXT: vshufpd {{.*#+}} xmm1 = xmm1[1],mem[0] sched: [8:0.50]
	; ZNVER1-NEXT: vaddpd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = shufflevector <2 x double> %a0, <2 x double> %a1, <2 x i32> <i32 1, i32 2>
	%2 = load <2 x double>, <2 x double> *%a2, align 16
	%3 = shufflevector <2 x double> %a1, <2 x double> %2, <2 x i32> <i32 1, i32 2>
	%4 = fadd <2 x double> %1, %3
	ret <2 x double> %4
	}

	define <2 x double> @test_sqrtpd(<2 x double> %a0, <2 x double> *%a1) {
	; GENERIC-LABEL: test_sqrtpd:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: sqrtpd %xmm0, %xmm1
	; GENERIC-NEXT: sqrtpd (%rdi), %xmm0
	; GENERIC-NEXT: addpd %xmm1, %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_sqrtpd:
	; ATOM: # BB#0:
	; ATOM-NEXT: sqrtpd %xmm0, %xmm1
	; ATOM-NEXT: sqrtpd (%rdi), %xmm0
	; ATOM-NEXT: addpd %xmm1, %xmm0
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_sqrtpd:
	; SLM: # BB#0:
	; SLM-NEXT: sqrtpd (%rdi), %xmm1 # sched: [18:1.00]
	; SLM-NEXT: sqrtpd %xmm0, %xmm0 # sched: [15:1.00]
	; SLM-NEXT: addpd %xmm0, %xmm1 # sched: [3:1.00]
	; SLM-NEXT: movapd %xmm1, %xmm0 # sched: [1:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_sqrtpd:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vsqrtpd %xmm0, %xmm0 # sched: [22:1.00]
	-; SANDY-NEXT: vsqrtpd (%rdi), %xmm1 # sched: [28:1.00]
	+; SANDY-NEXT: vsqrtpd %xmm0, %xmm0 # sched: [15:1.00]
	+; SANDY-NEXT: vsqrtpd (%rdi), %xmm1 # sched: [19:1.00]
	; SANDY-NEXT: vaddpd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_sqrtpd:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vsqrtpd %xmm0, %xmm0 # sched: [15:1.00]
	; HASWELL-NEXT: vsqrtpd (%rdi), %xmm1 # sched: [19:1.00]
	; HASWELL-NEXT: vaddpd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_sqrtpd:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vsqrtpd (%rdi), %xmm1 # sched: [26:21.00]
	; BTVER2-NEXT: vsqrtpd %xmm0, %xmm0 # sched: [21:21.00]
	; BTVER2-NEXT: vaddpd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_sqrtpd:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vsqrtpd (%rdi), %xmm1 # sched: [27:1.00]
	; ZNVER1-NEXT: vsqrtpd %xmm0, %xmm0 # sched: [20:1.00]
	; ZNVER1-NEXT: vaddpd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <2 x double> @llvm.x86.sse2.sqrt.pd(<2 x double> %a0)
	%2 = load <2 x double>, <2 x double> *%a1, align 16
	%3 = call <2 x double> @llvm.x86.sse2.sqrt.pd(<2 x double> %2)
	%4 = fadd <2 x double> %1, %3
	ret <2 x double> %4
	}
	declare <2 x double> @llvm.x86.sse2.sqrt.pd(<2 x double>) nounwind readnone

	; TODO - sqrtsd_m

	define <2 x double> @test_sqrtsd(<2 x double> %a0, <2 x double> *%a1) {
	; GENERIC-LABEL: test_sqrtsd:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: sqrtsd %xmm0, %xmm0
	; GENERIC-NEXT: movapd (%rdi), %xmm1
	; GENERIC-NEXT: sqrtsd %xmm1, %xmm1
	; GENERIC-NEXT: addpd %xmm1, %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_sqrtsd:
	; ATOM: # BB#0:
	; ATOM-NEXT: movapd (%rdi), %xmm1
	; ATOM-NEXT: sqrtsd %xmm0, %xmm0
	; ATOM-NEXT: sqrtsd %xmm1, %xmm1
	; ATOM-NEXT: addpd %xmm1, %xmm0
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_sqrtsd:
	; SLM: # BB#0:
	; SLM-NEXT: movapd (%rdi), %xmm1 # sched: [3:1.00]
	; SLM-NEXT: sqrtsd %xmm0, %xmm0 # sched: [18:1.00]
	; SLM-NEXT: sqrtsd %xmm1, %xmm1 # sched: [18:1.00]
	; SLM-NEXT: addpd %xmm1, %xmm0 # sched: [3:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_sqrtsd:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vsqrtsd %xmm0, %xmm0, %xmm0 # sched: [21:1.00]
	-; SANDY-NEXT: vmovapd (%rdi), %xmm1 # sched: [6:0.50]
	-; SANDY-NEXT: vsqrtsd %xmm1, %xmm1, %xmm1 # sched: [21:1.00]
	+; SANDY-NEXT: vsqrtsd %xmm0, %xmm0, %xmm0 # sched: [19:1.00]
	+; SANDY-NEXT: vmovapd (%rdi), %xmm1 # sched: [4:0.50]
	+; SANDY-NEXT: vsqrtsd %xmm1, %xmm1, %xmm1 # sched: [19:1.00]
	; SANDY-NEXT: vaddpd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_sqrtsd:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vsqrtsd %xmm0, %xmm0, %xmm0 # sched: [19:1.00]
	; HASWELL-NEXT: vmovapd (%rdi), %xmm1 # sched: [4:0.50]
	; HASWELL-NEXT: vsqrtsd %xmm1, %xmm1, %xmm1 # sched: [19:1.00]
	; HASWELL-NEXT: vaddpd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_sqrtsd:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vmovapd (%rdi), %xmm1 # sched: [5:1.00]
	; BTVER2-NEXT: vsqrtsd %xmm0, %xmm0, %xmm0 # sched: [26:21.00]
	; BTVER2-NEXT: vsqrtsd %xmm1, %xmm1, %xmm1 # sched: [26:21.00]
	; BTVER2-NEXT: vaddpd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_sqrtsd:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vmovapd (%rdi), %xmm1 # sched: [8:0.50]
	; ZNVER1-NEXT: vsqrtsd %xmm0, %xmm0, %xmm0 # sched: [27:1.00]
	; ZNVER1-NEXT: vsqrtsd %xmm1, %xmm1, %xmm1 # sched: [27:1.00]
	; ZNVER1-NEXT: vaddpd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <2 x double> @llvm.x86.sse2.sqrt.sd(<2 x double> %a0)
	%2 = load <2 x double>, <2 x double> *%a1, align 16
	%3 = call <2 x double> @llvm.x86.sse2.sqrt.sd(<2 x double> %2)
	%4 = fadd <2 x double> %1, %3
	ret <2 x double> %4
	}
	declare <2 x double> @llvm.x86.sse2.sqrt.sd(<2 x double>) nounwind readnone

	define <2 x double> @test_subpd(<2 x double> %a0, <2 x double> %a1, <2 x double> *%a2) {
	; GENERIC-LABEL: test_subpd:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: subpd %xmm1, %xmm0
	; GENERIC-NEXT: subpd (%rdi), %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_subpd:
	; ATOM: # BB#0:
	; ATOM-NEXT: subpd %xmm1, %xmm0
	; ATOM-NEXT: subpd (%rdi), %xmm0
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_subpd:
	; SLM: # BB#0:
	; SLM-NEXT: subpd %xmm1, %xmm0 # sched: [3:1.00]
	; SLM-NEXT: subpd (%rdi), %xmm0 # sched: [6:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_subpd:
	; SANDY: # BB#0:
	; SANDY-NEXT: vsubpd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	-; SANDY-NEXT: vsubpd (%rdi), %xmm0, %xmm0 # sched: [9:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vsubpd (%rdi), %xmm0, %xmm0 # sched: [7:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_subpd:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vsubpd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; HASWELL-NEXT: vsubpd (%rdi), %xmm0, %xmm0 # sched: [7:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_subpd:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vsubpd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: vsubpd (%rdi), %xmm0, %xmm0 # sched: [8:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_subpd:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vsubpd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; ZNVER1-NEXT: vsubpd (%rdi), %xmm0, %xmm0 # sched: [10:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = fsub <2 x double> %a0, %a1
	%2 = load <2 x double>, <2 x double> *%a2, align 16
	%3 = fsub <2 x double> %1, %2
	ret <2 x double> %3
	}

	define double @test_subsd(double %a0, double %a1, double *%a2) {
	; GENERIC-LABEL: test_subsd:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: subsd %xmm1, %xmm0
	; GENERIC-NEXT: subsd (%rdi), %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_subsd:
	; ATOM: # BB#0:
	; ATOM-NEXT: subsd %xmm1, %xmm0
	; ATOM-NEXT: subsd (%rdi), %xmm0
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_subsd:
	; SLM: # BB#0:
	; SLM-NEXT: subsd %xmm1, %xmm0 # sched: [3:1.00]
	; SLM-NEXT: subsd (%rdi), %xmm0 # sched: [6:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_subsd:
	; SANDY: # BB#0:
	; SANDY-NEXT: vsubsd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	-; SANDY-NEXT: vsubsd (%rdi), %xmm0, %xmm0 # sched: [9:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vsubsd (%rdi), %xmm0, %xmm0 # sched: [7:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_subsd:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vsubsd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; HASWELL-NEXT: vsubsd (%rdi), %xmm0, %xmm0 # sched: [7:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_subsd:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vsubsd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: vsubsd (%rdi), %xmm0, %xmm0 # sched: [8:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_subsd:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vsubsd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; ZNVER1-NEXT: vsubsd (%rdi), %xmm0, %xmm0 # sched: [10:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = fsub double %a0, %a1
	%2 = load double, double *%a2, align 8
	%3 = fsub double %1, %2
	ret double %3
	}

	define i32 @test_ucomisd(<2 x double> %a0, <2 x double> %a1, <2 x double> *%a2) {
	; GENERIC-LABEL: test_ucomisd:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: ucomisd %xmm1, %xmm0
	; GENERIC-NEXT: setnp %al
	; GENERIC-NEXT: sete %cl
	; GENERIC-NEXT: andb %al, %cl
	; GENERIC-NEXT: ucomisd (%rdi), %xmm0
	; GENERIC-NEXT: setnp %al
	; GENERIC-NEXT: sete %dl
	; GENERIC-NEXT: andb %al, %dl
	; GENERIC-NEXT: orb %cl, %dl
	; GENERIC-NEXT: movzbl %dl, %eax
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_ucomisd:
	; ATOM: # BB#0:
	; ATOM-NEXT: ucomisd %xmm1, %xmm0
	; ATOM-NEXT: setnp %al
	; ATOM-NEXT: sete %cl
	; ATOM-NEXT: andb %al, %cl
	; ATOM-NEXT: ucomisd (%rdi), %xmm0
	; ATOM-NEXT: setnp %al
	; ATOM-NEXT: sete %dl
	; ATOM-NEXT: andb %al, %dl
	; ATOM-NEXT: orb %cl, %dl
	; ATOM-NEXT: movzbl %dl, %eax
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_ucomisd:
	; SLM: # BB#0:
	; SLM-NEXT: ucomisd %xmm1, %xmm0 # sched: [3:1.00]
	; SLM-NEXT: setnp %al # sched: [1:0.50]
	; SLM-NEXT: sete %cl # sched: [1:0.50]
	; SLM-NEXT: andb %al, %cl # sched: [1:0.50]
	; SLM-NEXT: ucomisd (%rdi), %xmm0 # sched: [6:1.00]
	; SLM-NEXT: setnp %al # sched: [1:0.50]
	; SLM-NEXT: sete %dl # sched: [1:0.50]
	; SLM-NEXT: andb %al, %dl # sched: [1:0.50]
	; SLM-NEXT: orb %cl, %dl # sched: [1:0.50]
	; SLM-NEXT: movzbl %dl, %eax # sched: [1:0.50]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_ucomisd:
	; SANDY: # BB#0:
	; SANDY-NEXT: vucomisd %xmm1, %xmm0 # sched: [3:1.00]
	-; SANDY-NEXT: setnp %al # sched: [1:1.00]
	-; SANDY-NEXT: sete %cl # sched: [1:1.00]
	+; SANDY-NEXT: setnp %al # sched: [1:0.33]
	+; SANDY-NEXT: sete %cl # sched: [1:0.33]
	; SANDY-NEXT: andb %al, %cl # sched: [1:0.33]
	; SANDY-NEXT: vucomisd (%rdi), %xmm0 # sched: [7:1.00]
	-; SANDY-NEXT: setnp %al # sched: [1:1.00]
	-; SANDY-NEXT: sete %dl # sched: [1:1.00]
	+; SANDY-NEXT: setnp %al # sched: [1:0.33]
	+; SANDY-NEXT: sete %dl # sched: [1:0.33]
	; SANDY-NEXT: andb %al, %dl # sched: [1:0.33]
	; SANDY-NEXT: orb %cl, %dl # sched: [1:0.33]
	; SANDY-NEXT: movzbl %dl, %eax # sched: [1:0.33]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_ucomisd:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vucomisd %xmm1, %xmm0 # sched: [3:1.00]
	; HASWELL-NEXT: setnp %al # sched: [1:0.50]
	; HASWELL-NEXT: sete %cl # sched: [1:0.50]
	; HASWELL-NEXT: andb %al, %cl # sched: [1:0.25]
	; HASWELL-NEXT: vucomisd (%rdi), %xmm0 # sched: [7:1.00]
	; HASWELL-NEXT: setnp %al # sched: [1:0.50]
	; HASWELL-NEXT: sete %dl # sched: [1:0.50]
	; HASWELL-NEXT: andb %al, %dl # sched: [1:0.25]
	; HASWELL-NEXT: orb %cl, %dl # sched: [1:0.25]
	; HASWELL-NEXT: movzbl %dl, %eax # sched: [1:0.25]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_ucomisd:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vucomisd %xmm1, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: setnp %al # sched: [1:0.50]
	; BTVER2-NEXT: sete %cl # sched: [1:0.50]
	; BTVER2-NEXT: andb %al, %cl # sched: [1:0.50]
	; BTVER2-NEXT: vucomisd (%rdi), %xmm0 # sched: [8:1.00]
	; BTVER2-NEXT: setnp %al # sched: [1:0.50]
	; BTVER2-NEXT: sete %dl # sched: [1:0.50]
	; BTVER2-NEXT: andb %al, %dl # sched: [1:0.50]
	; BTVER2-NEXT: orb %cl, %dl # sched: [1:0.50]
	; BTVER2-NEXT: movzbl %dl, %eax # sched: [1:0.50]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_ucomisd:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vucomisd %xmm1, %xmm0 # sched: [3:1.00]
	; ZNVER1-NEXT: setnp %al # sched: [1:0.25]
	; ZNVER1-NEXT: sete %cl # sched: [1:0.25]
	; ZNVER1-NEXT: andb %al, %cl # sched: [1:0.25]
	; ZNVER1-NEXT: vucomisd (%rdi), %xmm0 # sched: [10:1.00]
	; ZNVER1-NEXT: setnp %al # sched: [1:0.25]
	; ZNVER1-NEXT: sete %dl # sched: [1:0.25]
	; ZNVER1-NEXT: andb %al, %dl # sched: [1:0.25]
	; ZNVER1-NEXT: orb %cl, %dl # sched: [1:0.25]
	; ZNVER1-NEXT: movzbl %dl, %eax # sched: [1:0.25]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call i32 @llvm.x86.sse2.ucomieq.sd(<2 x double> %a0, <2 x double> %a1)
	%2 = load <2 x double>, <2 x double> *%a2, align 8
	%3 = call i32 @llvm.x86.sse2.ucomieq.sd(<2 x double> %a0, <2 x double> %2)
	%4 = or i32 %1, %3
	ret i32 %4
	}
	declare i32 @llvm.x86.sse2.ucomieq.sd(<2 x double>, <2 x double>) nounwind readnone

	define <2 x double> @test_unpckhpd(<2 x double> %a0, <2 x double> %a1, <2 x double> *%a2) {
	; GENERIC-LABEL: test_unpckhpd:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: unpckhpd {{.*#+}} xmm0 = xmm0[1],xmm1[1]
	; GENERIC-NEXT: unpckhpd {{.*#+}} xmm1 = xmm1[1],mem[1]
	; GENERIC-NEXT: addpd %xmm1, %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_unpckhpd:
	; ATOM: # BB#0:
	; ATOM-NEXT: unpckhpd {{.*#+}} xmm0 = xmm0[1],xmm1[1]
	; ATOM-NEXT: unpckhpd {{.*#+}} xmm1 = xmm1[1],mem[1]
	; ATOM-NEXT: addpd %xmm1, %xmm0
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_unpckhpd:
	; SLM: # BB#0:
	; SLM-NEXT: unpckhpd {{.*#+}} xmm0 = xmm0[1],xmm1[1] sched: [1:1.00]
	; SLM-NEXT: unpckhpd {{.*#+}} xmm1 = xmm1[1],mem[1] sched: [4:1.00]
	; SLM-NEXT: addpd %xmm1, %xmm0 # sched: [3:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_unpckhpd:
	; SANDY: # BB#0:
	; SANDY-NEXT: vunpckhpd {{.*#+}} xmm0 = xmm0[1],xmm1[1] sched: [1:1.00]
	-; SANDY-NEXT: vunpckhpd {{.*#+}} xmm1 = xmm1[1],mem[1] sched: [7:1.00]
	+; SANDY-NEXT: vunpckhpd {{.*#+}} xmm1 = xmm1[1],mem[1] sched: [5:1.00]
	; SANDY-NEXT: vaddpd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_unpckhpd:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vunpckhpd {{.*#+}} xmm0 = xmm0[1],xmm1[1] sched: [1:1.00]
	; HASWELL-NEXT: vunpckhpd {{.*#+}} xmm1 = xmm1[1],mem[1] sched: [5:1.00]
	; HASWELL-NEXT: vaddpd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_unpckhpd:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vunpckhpd {{.*#+}} xmm0 = xmm0[1],xmm1[1] sched: [1:0.50]
	; BTVER2-NEXT: vunpckhpd {{.*#+}} xmm1 = xmm1[1],mem[1] sched: [6:1.00]
	; BTVER2-NEXT: vaddpd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_unpckhpd:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vunpckhpd {{.*#+}} xmm0 = xmm0[1],xmm1[1] sched: [1:0.50]
	; ZNVER1-NEXT: vunpckhpd {{.*#+}} xmm1 = xmm1[1],mem[1] sched: [8:0.50]
	; ZNVER1-NEXT: vaddpd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = shufflevector <2 x double> %a0, <2 x double> %a1, <2 x i32> <i32 1, i32 3>
	%2 = load <2 x double>, <2 x double> *%a2, align 16
	%3 = shufflevector <2 x double> %a1, <2 x double> %2, <2 x i32> <i32 1, i32 3>
	%4 = fadd <2 x double> %1, %3
	ret <2 x double> %4
	}

	define <2 x double> @test_unpcklpd(<2 x double> %a0, <2 x double> %a1, <2 x double> *%a2) {
	; GENERIC-LABEL: test_unpcklpd:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: unpcklpd {{.*#+}} xmm0 = xmm0[0],xmm1[0]
	; GENERIC-NEXT: movapd %xmm0, %xmm1
	; GENERIC-NEXT: unpcklpd {{.*#+}} xmm1 = xmm1[0],mem[0]
	; GENERIC-NEXT: addpd %xmm0, %xmm1
	; GENERIC-NEXT: movapd %xmm1, %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_unpcklpd:
	; ATOM: # BB#0:
	; ATOM-NEXT: unpcklpd {{.*#+}} xmm0 = xmm0[0],xmm1[0]
	; ATOM-NEXT: movapd %xmm0, %xmm1
	; ATOM-NEXT: unpcklpd {{.*#+}} xmm1 = xmm1[0],mem[0]
	; ATOM-NEXT: addpd %xmm0, %xmm1
	; ATOM-NEXT: movapd %xmm1, %xmm0
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_unpcklpd:
	; SLM: # BB#0:
	; SLM-NEXT: unpcklpd {{.*#+}} xmm0 = xmm0[0],xmm1[0] sched: [1:1.00]
	; SLM-NEXT: movapd %xmm0, %xmm1 # sched: [1:1.00]
	; SLM-NEXT: unpcklpd {{.*#+}} xmm1 = xmm1[0],mem[0] sched: [4:1.00]
	; SLM-NEXT: addpd %xmm0, %xmm1 # sched: [3:1.00]
	; SLM-NEXT: movapd %xmm1, %xmm0 # sched: [1:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_unpcklpd:
	; SANDY: # BB#0:
	; SANDY-NEXT: vunpcklpd {{.*#+}} xmm0 = xmm0[0],xmm1[0] sched: [1:1.00]
	-; SANDY-NEXT: vunpcklpd {{.*#+}} xmm1 = xmm0[0],mem[0] sched: [7:1.00]
	+; SANDY-NEXT: vunpcklpd {{.*#+}} xmm1 = xmm0[0],mem[0] sched: [5:1.00]
	; SANDY-NEXT: vaddpd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_unpcklpd:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vunpcklpd {{.*#+}} xmm0 = xmm0[0],xmm1[0] sched: [1:1.00]
	; HASWELL-NEXT: vunpcklpd {{.*#+}} xmm1 = xmm0[0],mem[0] sched: [5:1.00]
	; HASWELL-NEXT: vaddpd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_unpcklpd:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vunpcklpd {{.*#+}} xmm0 = xmm0[0],xmm1[0] sched: [1:0.50]
	; BTVER2-NEXT: vunpcklpd {{.*#+}} xmm1 = xmm0[0],mem[0] sched: [6:1.00]
	; BTVER2-NEXT: vaddpd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_unpcklpd:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vunpcklpd {{.*#+}} xmm0 = xmm0[0],xmm1[0] sched: [1:0.50]
	; ZNVER1-NEXT: vunpcklpd {{.*#+}} xmm1 = xmm0[0],mem[0] sched: [8:0.50]
	; ZNVER1-NEXT: vaddpd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = shufflevector <2 x double> %a0, <2 x double> %a1, <2 x i32> <i32 0, i32 2>
	%2 = load <2 x double>, <2 x double> *%a2, align 16
	%3 = shufflevector <2 x double> %1, <2 x double> %2, <2 x i32> <i32 0, i32 2>
	%4 = fadd <2 x double> %1, %3
	ret <2 x double> %4
	}

	define <2 x double> @test_xorpd(<2 x double> %a0, <2 x double> %a1, <2 x double> *%a2) {
	; GENERIC-LABEL: test_xorpd:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: xorpd %xmm1, %xmm0
	; GENERIC-NEXT: xorpd (%rdi), %xmm0
	; GENERIC-NEXT: addpd %xmm1, %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_xorpd:
	; ATOM: # BB#0:
	; ATOM-NEXT: xorpd %xmm1, %xmm0
	; ATOM-NEXT: xorpd (%rdi), %xmm0
	; ATOM-NEXT: addpd %xmm1, %xmm0
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_xorpd:
	; SLM: # BB#0:
	; SLM-NEXT: xorpd %xmm1, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: xorpd (%rdi), %xmm0 # sched: [4:1.00]
	; SLM-NEXT: addpd %xmm1, %xmm0 # sched: [3:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_xorpd:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vxorpd %xmm1, %xmm0, %xmm0 # sched: [1:1.00]
	-; SANDY-NEXT: vxorpd (%rdi), %xmm0, %xmm0 # sched: [7:1.00]
	+; SANDY-NEXT: vxorpd %xmm1, %xmm0, %xmm0 # sched: [1:0.33]
	+; SANDY-NEXT: vxorpd (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	; SANDY-NEXT: vaddpd %xmm0, %xmm1, %xmm0 # sched: [3:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_xorpd:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vxorpd %xmm1, %xmm0, %xmm0 # sched: [1:1.00]
	; HASWELL-NEXT: vxorpd (%rdi), %xmm0, %xmm0 # sched: [5:1.00]
	; HASWELL-NEXT: vaddpd %xmm0, %xmm1, %xmm0 # sched: [3:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_xorpd:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vxorpd %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: vxorpd (%rdi), %xmm0, %xmm0 # sched: [6:1.00]
	; BTVER2-NEXT: vaddpd %xmm0, %xmm1, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_xorpd:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vxorpd %xmm1, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: vxorpd (%rdi), %xmm0, %xmm0 # sched: [8:0.50]
	; ZNVER1-NEXT: vaddpd %xmm0, %xmm1, %xmm0 # sched: [3:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = bitcast <2 x double> %a0 to <4 x i32>
	%2 = bitcast <2 x double> %a1 to <4 x i32>
	%3 = xor <4 x i32> %1, %2
	%4 = load <2 x double>, <2 x double> *%a2, align 16
	%5 = bitcast <2 x double> %4 to <4 x i32>
	%6 = xor <4 x i32> %3, %5
	%7 = bitcast <4 x i32> %6 to <2 x double>
	%8 = fadd <2 x double> %a1, %7
	ret <2 x double> %8
	}

	!0 = !{i32 1}
	diff --git a/test/CodeGen/X86/sse3-schedule.ll b/test/CodeGen/X86/sse3-schedule.ll
	index ad38d1c6ff49..5f41ccda0fde 100644
	--- a/test/CodeGen/X86/sse3-schedule.ll
	+++ b/test/CodeGen/X86/sse3-schedule.ll
	@@ -1,517 +1,517 @@
	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -print-schedule -mattr=+sse3 \| FileCheck %s --check-prefix=CHECK --check-prefix=GENERIC
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -print-schedule -mcpu=atom \| FileCheck %s --check-prefix=CHECK --check-prefix=ATOM
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -print-schedule -mcpu=slm \| FileCheck %s --check-prefix=CHECK --check-prefix=SLM
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -print-schedule -mcpu=sandybridge \| FileCheck %s --check-prefix=CHECK --check-prefix=SANDY
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -print-schedule -mcpu=ivybridge \| FileCheck %s --check-prefix=CHECK --check-prefix=SANDY
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -print-schedule -mcpu=haswell \| FileCheck %s --check-prefix=CHECK --check-prefix=HASWELL
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -print-schedule -mcpu=skylake \| FileCheck %s --check-prefix=CHECK --check-prefix=HASWELL
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -print-schedule -mcpu=btver2 \| FileCheck %s --check-prefix=CHECK --check-prefix=BTVER2
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -print-schedule -mcpu=znver1 \| FileCheck %s --check-prefix=CHECK --check-prefix=ZNVER1

	define <2 x double> @test_addsubpd(<2 x double> %a0, <2 x double> %a1, <2 x double> *%a2) {
	; GENERIC-LABEL: test_addsubpd:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: addsubpd %xmm1, %xmm0
	; GENERIC-NEXT: addsubpd (%rdi), %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_addsubpd:
	; ATOM: # BB#0:
	; ATOM-NEXT: addsubpd %xmm1, %xmm0
	; ATOM-NEXT: addsubpd (%rdi), %xmm0
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_addsubpd:
	; SLM: # BB#0:
	; SLM-NEXT: addsubpd %xmm1, %xmm0 # sched: [3:1.00]
	; SLM-NEXT: addsubpd (%rdi), %xmm0 # sched: [6:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_addsubpd:
	; SANDY: # BB#0:
	; SANDY-NEXT: vaddsubpd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	-; SANDY-NEXT: vaddsubpd (%rdi), %xmm0, %xmm0 # sched: [9:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vaddsubpd (%rdi), %xmm0, %xmm0 # sched: [7:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_addsubpd:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vaddsubpd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; HASWELL-NEXT: vaddsubpd (%rdi), %xmm0, %xmm0 # sched: [7:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_addsubpd:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vaddsubpd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: vaddsubpd (%rdi), %xmm0, %xmm0 # sched: [8:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_addsubpd:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vaddsubpd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; ZNVER1-NEXT: vaddsubpd (%rdi), %xmm0, %xmm0 # sched: [10:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <2 x double> @llvm.x86.sse3.addsub.pd(<2 x double> %a0, <2 x double> %a1)
	%2 = load <2 x double>, <2 x double> *%a2, align 16
	%3 = call <2 x double> @llvm.x86.sse3.addsub.pd(<2 x double> %1, <2 x double> %2)
	ret <2 x double> %3
	}
	declare <2 x double> @llvm.x86.sse3.addsub.pd(<2 x double>, <2 x double>) nounwind readnone

	define <4 x float> @test_addsubps(<4 x float> %a0, <4 x float> %a1, <4 x float> *%a2) {
	; GENERIC-LABEL: test_addsubps:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: addsubps %xmm1, %xmm0
	; GENERIC-NEXT: addsubps (%rdi), %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_addsubps:
	; ATOM: # BB#0:
	; ATOM-NEXT: addsubps %xmm1, %xmm0
	; ATOM-NEXT: addsubps (%rdi), %xmm0
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_addsubps:
	; SLM: # BB#0:
	; SLM-NEXT: addsubps %xmm1, %xmm0 # sched: [3:1.00]
	; SLM-NEXT: addsubps (%rdi), %xmm0 # sched: [6:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_addsubps:
	; SANDY: # BB#0:
	; SANDY-NEXT: vaddsubps %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	-; SANDY-NEXT: vaddsubps (%rdi), %xmm0, %xmm0 # sched: [9:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vaddsubps (%rdi), %xmm0, %xmm0 # sched: [7:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_addsubps:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vaddsubps %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; HASWELL-NEXT: vaddsubps (%rdi), %xmm0, %xmm0 # sched: [7:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_addsubps:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vaddsubps %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: vaddsubps (%rdi), %xmm0, %xmm0 # sched: [8:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_addsubps:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vaddsubps %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; ZNVER1-NEXT: vaddsubps (%rdi), %xmm0, %xmm0 # sched: [10:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <4 x float> @llvm.x86.sse3.addsub.ps(<4 x float> %a0, <4 x float> %a1)
	%2 = load <4 x float>, <4 x float> *%a2, align 16
	%3 = call <4 x float> @llvm.x86.sse3.addsub.ps(<4 x float> %1, <4 x float> %2)
	ret <4 x float> %3
	}
	declare <4 x float> @llvm.x86.sse3.addsub.ps(<4 x float>, <4 x float>) nounwind readnone

	define <2 x double> @test_haddpd(<2 x double> %a0, <2 x double> %a1, <2 x double> *%a2) {
	; GENERIC-LABEL: test_haddpd:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: haddpd %xmm1, %xmm0
	; GENERIC-NEXT: haddpd (%rdi), %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_haddpd:
	; ATOM: # BB#0:
	; ATOM-NEXT: haddpd %xmm1, %xmm0
	; ATOM-NEXT: haddpd (%rdi), %xmm0
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_haddpd:
	; SLM: # BB#0:
	; SLM-NEXT: haddpd %xmm1, %xmm0 # sched: [3:1.00]
	; SLM-NEXT: haddpd (%rdi), %xmm0 # sched: [6:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_haddpd:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vhaddpd %xmm1, %xmm0, %xmm0 # sched: [5:2.00]
	-; SANDY-NEXT: vhaddpd (%rdi), %xmm0, %xmm0 # sched: [11:2.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vhaddpd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	+; SANDY-NEXT: vhaddpd (%rdi), %xmm0, %xmm0 # sched: [7:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_haddpd:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vhaddpd %xmm1, %xmm0, %xmm0 # sched: [5:2.00]
	; HASWELL-NEXT: vhaddpd (%rdi), %xmm0, %xmm0 # sched: [9:2.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_haddpd:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vhaddpd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: vhaddpd (%rdi), %xmm0, %xmm0 # sched: [8:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_haddpd:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vhaddpd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; ZNVER1-NEXT: vhaddpd (%rdi), %xmm0, %xmm0 # sched: [10:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <2 x double> @llvm.x86.sse3.hadd.pd(<2 x double> %a0, <2 x double> %a1)
	%2 = load <2 x double>, <2 x double> *%a2, align 16
	%3 = call <2 x double> @llvm.x86.sse3.hadd.pd(<2 x double> %1, <2 x double> %2)
	ret <2 x double> %3
	}
	declare <2 x double> @llvm.x86.sse3.hadd.pd(<2 x double>, <2 x double>) nounwind readnone

	define <4 x float> @test_haddps(<4 x float> %a0, <4 x float> %a1, <4 x float> *%a2) {
	; GENERIC-LABEL: test_haddps:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: haddps %xmm1, %xmm0
	; GENERIC-NEXT: haddps (%rdi), %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_haddps:
	; ATOM: # BB#0:
	; ATOM-NEXT: haddps %xmm1, %xmm0
	; ATOM-NEXT: haddps (%rdi), %xmm0
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_haddps:
	; SLM: # BB#0:
	; SLM-NEXT: haddps %xmm1, %xmm0 # sched: [3:1.00]
	; SLM-NEXT: haddps (%rdi), %xmm0 # sched: [6:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_haddps:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vhaddps %xmm1, %xmm0, %xmm0 # sched: [5:2.00]
	-; SANDY-NEXT: vhaddps (%rdi), %xmm0, %xmm0 # sched: [11:2.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vhaddps %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	+; SANDY-NEXT: vhaddps (%rdi), %xmm0, %xmm0 # sched: [7:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_haddps:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vhaddps %xmm1, %xmm0, %xmm0 # sched: [5:2.00]
	; HASWELL-NEXT: vhaddps (%rdi), %xmm0, %xmm0 # sched: [9:2.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_haddps:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vhaddps %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: vhaddps (%rdi), %xmm0, %xmm0 # sched: [8:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_haddps:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vhaddps %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; ZNVER1-NEXT: vhaddps (%rdi), %xmm0, %xmm0 # sched: [10:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <4 x float> @llvm.x86.sse3.hadd.ps(<4 x float> %a0, <4 x float> %a1)
	%2 = load <4 x float>, <4 x float> *%a2, align 16
	%3 = call <4 x float> @llvm.x86.sse3.hadd.ps(<4 x float> %1, <4 x float> %2)
	ret <4 x float> %3
	}
	declare <4 x float> @llvm.x86.sse3.hadd.ps(<4 x float>, <4 x float>) nounwind readnone

	define <2 x double> @test_hsubpd(<2 x double> %a0, <2 x double> %a1, <2 x double> *%a2) {
	; GENERIC-LABEL: test_hsubpd:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: hsubpd %xmm1, %xmm0
	; GENERIC-NEXT: hsubpd (%rdi), %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_hsubpd:
	; ATOM: # BB#0:
	; ATOM-NEXT: hsubpd %xmm1, %xmm0
	; ATOM-NEXT: hsubpd (%rdi), %xmm0
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_hsubpd:
	; SLM: # BB#0:
	; SLM-NEXT: hsubpd %xmm1, %xmm0 # sched: [3:1.00]
	; SLM-NEXT: hsubpd (%rdi), %xmm0 # sched: [6:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_hsubpd:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vhsubpd %xmm1, %xmm0, %xmm0 # sched: [5:2.00]
	-; SANDY-NEXT: vhsubpd (%rdi), %xmm0, %xmm0 # sched: [11:2.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vhsubpd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	+; SANDY-NEXT: vhsubpd (%rdi), %xmm0, %xmm0 # sched: [7:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_hsubpd:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vhsubpd %xmm1, %xmm0, %xmm0 # sched: [5:2.00]
	; HASWELL-NEXT: vhsubpd (%rdi), %xmm0, %xmm0 # sched: [9:2.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_hsubpd:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vhsubpd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: vhsubpd (%rdi), %xmm0, %xmm0 # sched: [8:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_hsubpd:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vhsubpd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; ZNVER1-NEXT: vhsubpd (%rdi), %xmm0, %xmm0 # sched: [10:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <2 x double> @llvm.x86.sse3.hsub.pd(<2 x double> %a0, <2 x double> %a1)
	%2 = load <2 x double>, <2 x double> *%a2, align 16
	%3 = call <2 x double> @llvm.x86.sse3.hsub.pd(<2 x double> %1, <2 x double> %2)
	ret <2 x double> %3
	}
	declare <2 x double> @llvm.x86.sse3.hsub.pd(<2 x double>, <2 x double>) nounwind readnone

	define <4 x float> @test_hsubps(<4 x float> %a0, <4 x float> %a1, <4 x float> *%a2) {
	; GENERIC-LABEL: test_hsubps:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: hsubps %xmm1, %xmm0
	; GENERIC-NEXT: hsubps (%rdi), %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_hsubps:
	; ATOM: # BB#0:
	; ATOM-NEXT: hsubps %xmm1, %xmm0
	; ATOM-NEXT: hsubps (%rdi), %xmm0
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_hsubps:
	; SLM: # BB#0:
	; SLM-NEXT: hsubps %xmm1, %xmm0 # sched: [3:1.00]
	; SLM-NEXT: hsubps (%rdi), %xmm0 # sched: [6:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_hsubps:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vhsubps %xmm1, %xmm0, %xmm0 # sched: [5:2.00]
	-; SANDY-NEXT: vhsubps (%rdi), %xmm0, %xmm0 # sched: [11:2.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vhsubps %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	+; SANDY-NEXT: vhsubps (%rdi), %xmm0, %xmm0 # sched: [7:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_hsubps:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vhsubps %xmm1, %xmm0, %xmm0 # sched: [5:2.00]
	; HASWELL-NEXT: vhsubps (%rdi), %xmm0, %xmm0 # sched: [9:2.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_hsubps:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vhsubps %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: vhsubps (%rdi), %xmm0, %xmm0 # sched: [8:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_hsubps:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vhsubps %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; ZNVER1-NEXT: vhsubps (%rdi), %xmm0, %xmm0 # sched: [10:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <4 x float> @llvm.x86.sse3.hsub.ps(<4 x float> %a0, <4 x float> %a1)
	%2 = load <4 x float>, <4 x float> *%a2, align 16
	%3 = call <4 x float> @llvm.x86.sse3.hsub.ps(<4 x float> %1, <4 x float> %2)
	ret <4 x float> %3
	}
	declare <4 x float> @llvm.x86.sse3.hsub.ps(<4 x float>, <4 x float>) nounwind readnone

	define <16 x i8> @test_lddqu(i8* %a0) {
	; GENERIC-LABEL: test_lddqu:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: lddqu (%rdi), %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_lddqu:
	; ATOM: # BB#0:
	; ATOM-NEXT: lddqu (%rdi), %xmm0
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_lddqu:
	; SLM: # BB#0:
	; SLM-NEXT: lddqu (%rdi), %xmm0 # sched: [3:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_lddqu:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vlddqu (%rdi), %xmm0 # sched: [6:0.50]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vlddqu (%rdi), %xmm0 # sched: [4:0.50]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_lddqu:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vlddqu (%rdi), %xmm0 # sched: [4:0.50]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_lddqu:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vlddqu (%rdi), %xmm0 # sched: [5:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_lddqu:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vlddqu (%rdi), %xmm0 # sched: [8:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <16 x i8> @llvm.x86.sse3.ldu.dq(i8* %a0)
	ret <16 x i8> %1
	}
	declare <16 x i8> @llvm.x86.sse3.ldu.dq(i8*) nounwind readonly

	define <2 x double> @test_movddup(<2 x double> %a0, <2 x double> *%a1) {
	; GENERIC-LABEL: test_movddup:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: movddup {{.*#+}} xmm1 = xmm0[0,0]
	; GENERIC-NEXT: movddup {{.*#+}} xmm0 = mem[0,0]
	; GENERIC-NEXT: addpd %xmm1, %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_movddup:
	; ATOM: # BB#0:
	; ATOM-NEXT: movddup {{.*#+}} xmm1 = mem[0,0]
	; ATOM-NEXT: movddup {{.*#+}} xmm0 = xmm0[0,0]
	; ATOM-NEXT: addpd %xmm0, %xmm1
	; ATOM-NEXT: movapd %xmm1, %xmm0
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_movddup:
	; SLM: # BB#0:
	; SLM-NEXT: movddup {{.*#+}} xmm1 = xmm0[0,0] sched: [1:1.00]
	; SLM-NEXT: movddup {{.*#+}} xmm0 = mem[0,0] sched: [3:1.00]
	; SLM-NEXT: addpd %xmm1, %xmm0 # sched: [3:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_movddup:
	; SANDY: # BB#0:
	; SANDY-NEXT: vmovddup {{.*#+}} xmm0 = xmm0[0,0] sched: [1:1.00]
	-; SANDY-NEXT: vmovddup {{.*#+}} xmm1 = mem[0,0] sched: [6:0.50]
	+; SANDY-NEXT: vmovddup {{.*#+}} xmm1 = mem[0,0] sched: [4:0.50]
	; SANDY-NEXT: vaddpd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_movddup:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vmovddup {{.*#+}} xmm0 = xmm0[0,0] sched: [1:1.00]
	; HASWELL-NEXT: vmovddup {{.*#+}} xmm1 = mem[0,0] sched: [4:0.50]
	; HASWELL-NEXT: vaddpd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_movddup:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vmovddup {{.*#+}} xmm1 = mem[0,0] sched: [5:1.00]
	; BTVER2-NEXT: vmovddup {{.*#+}} xmm0 = xmm0[0,0] sched: [1:0.50]
	; BTVER2-NEXT: vaddpd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_movddup:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vmovddup {{.*#+}} xmm1 = mem[0,0] sched: [8:0.50]
	; ZNVER1-NEXT: vmovddup {{.*#+}} xmm0 = xmm0[0,0] sched: [1:0.50]
	; ZNVER1-NEXT: vaddpd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = shufflevector <2 x double> %a0, <2 x double> undef, <2 x i32> zeroinitializer
	%2 = load <2 x double>, <2 x double> *%a1, align 16
	%3 = shufflevector <2 x double> %2, <2 x double> undef, <2 x i32> zeroinitializer
	%4 = fadd <2 x double> %1, %3
	ret <2 x double> %4
	}

	define <4 x float> @test_movshdup(<4 x float> %a0, <4 x float> *%a1) {
	; GENERIC-LABEL: test_movshdup:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: movshdup {{.*#+}} xmm1 = xmm0[1,1,3,3]
	; GENERIC-NEXT: movshdup {{.*#+}} xmm0 = mem[1,1,3,3]
	; GENERIC-NEXT: addps %xmm1, %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_movshdup:
	; ATOM: # BB#0:
	; ATOM-NEXT: movshdup {{.*#+}} xmm1 = mem[1,1,3,3]
	; ATOM-NEXT: movshdup {{.*#+}} xmm0 = xmm0[1,1,3,3]
	; ATOM-NEXT: addps %xmm0, %xmm1
	; ATOM-NEXT: movaps %xmm1, %xmm0
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_movshdup:
	; SLM: # BB#0:
	; SLM-NEXT: movshdup {{.*#+}} xmm1 = xmm0[1,1,3,3] sched: [1:1.00]
	; SLM-NEXT: movshdup {{.*#+}} xmm0 = mem[1,1,3,3] sched: [3:1.00]
	; SLM-NEXT: addps %xmm1, %xmm0 # sched: [3:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_movshdup:
	; SANDY: # BB#0:
	; SANDY-NEXT: vmovshdup {{.*#+}} xmm0 = xmm0[1,1,3,3] sched: [1:1.00]
	-; SANDY-NEXT: vmovshdup {{.*#+}} xmm1 = mem[1,1,3,3] sched: [6:0.50]
	+; SANDY-NEXT: vmovshdup {{.*#+}} xmm1 = mem[1,1,3,3] sched: [4:0.50]
	; SANDY-NEXT: vaddps %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_movshdup:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vmovshdup {{.*#+}} xmm0 = xmm0[1,1,3,3] sched: [1:1.00]
	; HASWELL-NEXT: vmovshdup {{.*#+}} xmm1 = mem[1,1,3,3] sched: [4:0.50]
	; HASWELL-NEXT: vaddps %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_movshdup:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vmovshdup {{.*#+}} xmm1 = mem[1,1,3,3] sched: [5:1.00]
	; BTVER2-NEXT: vmovshdup {{.*#+}} xmm0 = xmm0[1,1,3,3] sched: [1:0.50]
	; BTVER2-NEXT: vaddps %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_movshdup:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vmovshdup {{.*#+}} xmm1 = mem[1,1,3,3] sched: [8:0.50]
	; ZNVER1-NEXT: vmovshdup {{.*#+}} xmm0 = xmm0[1,1,3,3] sched: [1:0.50]
	; ZNVER1-NEXT: vaddps %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = shufflevector <4 x float> %a0, <4 x float> undef, <4 x i32> <i32 1, i32 1, i32 3, i32 3>
	%2 = load <4 x float>, <4 x float> *%a1, align 16
	%3 = shufflevector <4 x float> %2, <4 x float> undef, <4 x i32> <i32 1, i32 1, i32 3, i32 3>
	%4 = fadd <4 x float> %1, %3
	ret <4 x float> %4
	}

	define <4 x float> @test_movsldup(<4 x float> %a0, <4 x float> *%a1) {
	; GENERIC-LABEL: test_movsldup:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: movsldup {{.*#+}} xmm1 = xmm0[0,0,2,2]
	; GENERIC-NEXT: movsldup {{.*#+}} xmm0 = mem[0,0,2,2]
	; GENERIC-NEXT: addps %xmm1, %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_movsldup:
	; ATOM: # BB#0:
	; ATOM-NEXT: movsldup {{.*#+}} xmm1 = mem[0,0,2,2]
	; ATOM-NEXT: movsldup {{.*#+}} xmm0 = xmm0[0,0,2,2]
	; ATOM-NEXT: addps %xmm0, %xmm1
	; ATOM-NEXT: movaps %xmm1, %xmm0
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_movsldup:
	; SLM: # BB#0:
	; SLM-NEXT: movsldup {{.*#+}} xmm1 = xmm0[0,0,2,2] sched: [1:1.00]
	; SLM-NEXT: movsldup {{.*#+}} xmm0 = mem[0,0,2,2] sched: [3:1.00]
	; SLM-NEXT: addps %xmm1, %xmm0 # sched: [3:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_movsldup:
	; SANDY: # BB#0:
	; SANDY-NEXT: vmovsldup {{.*#+}} xmm0 = xmm0[0,0,2,2] sched: [1:1.00]
	-; SANDY-NEXT: vmovsldup {{.*#+}} xmm1 = mem[0,0,2,2] sched: [6:0.50]
	+; SANDY-NEXT: vmovsldup {{.*#+}} xmm1 = mem[0,0,2,2] sched: [4:0.50]
	; SANDY-NEXT: vaddps %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_movsldup:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vmovsldup {{.*#+}} xmm0 = xmm0[0,0,2,2] sched: [1:1.00]
	; HASWELL-NEXT: vmovsldup {{.*#+}} xmm1 = mem[0,0,2,2] sched: [4:0.50]
	; HASWELL-NEXT: vaddps %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_movsldup:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vmovsldup {{.*#+}} xmm1 = mem[0,0,2,2] sched: [5:1.00]
	; BTVER2-NEXT: vmovsldup {{.*#+}} xmm0 = xmm0[0,0,2,2] sched: [1:0.50]
	; BTVER2-NEXT: vaddps %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_movsldup:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vmovsldup {{.*#+}} xmm1 = mem[0,0,2,2] sched: [8:0.50]
	; ZNVER1-NEXT: vmovsldup {{.*#+}} xmm0 = xmm0[0,0,2,2] sched: [1:0.50]
	; ZNVER1-NEXT: vaddps %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = shufflevector <4 x float> %a0, <4 x float> undef, <4 x i32> <i32 0, i32 0, i32 2, i32 2>
	%2 = load <4 x float>, <4 x float> *%a1, align 16
	%3 = shufflevector <4 x float> %2, <4 x float> undef, <4 x i32> <i32 0, i32 0, i32 2, i32 2>
	%4 = fadd <4 x float> %1, %3
	ret <4 x float> %4
	}
	diff --git a/test/CodeGen/X86/sse41-schedule.ll b/test/CodeGen/X86/sse41-schedule.ll
	index 26cca98816a3..ac600fed0ea0 100644
	--- a/test/CodeGen/X86/sse41-schedule.ll
	+++ b/test/CodeGen/X86/sse41-schedule.ll
	@@ -1,2247 +1,2247 @@
	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -print-schedule -mattr=+sse4.1 \| FileCheck %s --check-prefix=CHECK --check-prefix=GENERIC
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -print-schedule -mcpu=slm \| FileCheck %s --check-prefix=CHECK --check-prefix=SLM
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -print-schedule -mcpu=sandybridge \| FileCheck %s --check-prefix=CHECK --check-prefix=SANDY
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -print-schedule -mcpu=ivybridge \| FileCheck %s --check-prefix=CHECK --check-prefix=SANDY
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -print-schedule -mcpu=haswell \| FileCheck %s --check-prefix=CHECK --check-prefix=HASWELL
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -print-schedule -mcpu=skylake \| FileCheck %s --check-prefix=CHECK --check-prefix=HASWELL
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -print-schedule -mcpu=btver2 \| FileCheck %s --check-prefix=CHECK --check-prefix=BTVER2
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -print-schedule -mcpu=znver1 \| FileCheck %s --check-prefix=CHECK --check-prefix=ZNVER1

	define <2 x double> @test_blendpd(<2 x double> %a0, <2 x double> %a1, <2 x double> *%a2) {
	; GENERIC-LABEL: test_blendpd:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: blendpd {{.*#+}} xmm0 = xmm0[0],xmm1[1]
	; GENERIC-NEXT: addpd %xmm1, %xmm0
	; GENERIC-NEXT: blendpd {{.*#+}} xmm0 = xmm0[0],mem[1]
	; GENERIC-NEXT: retq
	;
	; SLM-LABEL: test_blendpd:
	; SLM: # BB#0:
	; SLM-NEXT: blendpd {{.*#+}} xmm0 = xmm0[0],xmm1[1] sched: [1:1.00]
	; SLM-NEXT: addpd %xmm1, %xmm0 # sched: [3:1.00]
	; SLM-NEXT: blendpd {{.*#+}} xmm0 = xmm0[0],mem[1] sched: [4:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_blendpd:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vblendpd {{.*#+}} xmm0 = xmm0[0],xmm1[1] sched: [1:1.00]
	+; SANDY-NEXT: vblendpd {{.*#+}} xmm0 = xmm0[0],xmm1[1] sched: [1:0.50]
	; SANDY-NEXT: vaddpd %xmm0, %xmm1, %xmm0 # sched: [3:1.00]
	-; SANDY-NEXT: vblendpd {{.*#+}} xmm0 = xmm0[0],mem[1] sched: [7:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vblendpd {{.*#+}} xmm0 = xmm0[0],mem[1] sched: [5:0.50]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_blendpd:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vblendpd {{.*#+}} xmm0 = xmm0[0],xmm1[1] sched: [1:0.33]
	; HASWELL-NEXT: vaddpd %xmm0, %xmm1, %xmm0 # sched: [3:1.00]
	; HASWELL-NEXT: vblendpd {{.*#+}} xmm0 = xmm0[0],mem[1] sched: [5:0.50]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_blendpd:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vblendpd {{.*#+}} xmm0 = xmm0[0],xmm1[1] sched: [1:0.50]
	; BTVER2-NEXT: vaddpd %xmm0, %xmm1, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: vblendpd {{.*#+}} xmm0 = xmm0[0],mem[1] sched: [6:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_blendpd:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vblendpd {{.*#+}} xmm0 = xmm0[0],xmm1[1] sched: [1:0.50]
	; ZNVER1-NEXT: vaddpd %xmm0, %xmm1, %xmm0 # sched: [3:1.00]
	; ZNVER1-NEXT: vblendpd {{.*#+}} xmm0 = xmm0[0],mem[1] sched: [8:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = shufflevector <2 x double> %a0, <2 x double> %a1, <2 x i32> <i32 0, i32 3>
	%2 = load <2 x double>, <2 x double> *%a2, align 16
	%3 = fadd <2 x double> %a1, %1
	%4 = shufflevector <2 x double> %3, <2 x double> %2, <2 x i32> <i32 0, i32 3>
	ret <2 x double> %4
	}

	define <4 x float> @test_blendps(<4 x float> %a0, <4 x float> %a1, <4 x float> *%a2) {
	; GENERIC-LABEL: test_blendps:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: blendps {{.*#+}} xmm0 = xmm0[0],xmm1[1,2],xmm0[3]
	; GENERIC-NEXT: blendps {{.*#+}} xmm0 = xmm0[0],mem[1],xmm0[2,3]
	; GENERIC-NEXT: retq
	;
	; SLM-LABEL: test_blendps:
	; SLM: # BB#0:
	; SLM-NEXT: blendps {{.*#+}} xmm0 = xmm0[0],xmm1[1,2],xmm0[3] sched: [1:1.00]
	; SLM-NEXT: blendps {{.*#+}} xmm0 = xmm0[0],mem[1],xmm0[2,3] sched: [4:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_blendps:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vblendps {{.*#+}} xmm0 = xmm0[0],xmm1[1,2],xmm0[3] sched: [1:1.00]
	-; SANDY-NEXT: vblendps {{.*#+}} xmm0 = xmm0[0],mem[1],xmm0[2,3] sched: [7:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vblendps {{.*#+}} xmm0 = xmm0[0],xmm1[1,2],xmm0[3] sched: [1:0.50]
	+; SANDY-NEXT: vblendps {{.*#+}} xmm0 = xmm0[0],mem[1],xmm0[2,3] sched: [5:0.50]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_blendps:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vblendps {{.*#+}} xmm0 = xmm0[0],xmm1[1,2],xmm0[3] sched: [1:0.33]
	; HASWELL-NEXT: vblendps {{.*#+}} xmm0 = xmm0[0],mem[1],xmm0[2,3] sched: [5:0.50]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_blendps:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vblendps {{.*#+}} xmm0 = xmm0[0],xmm1[1,2],xmm0[3] sched: [1:0.50]
	; BTVER2-NEXT: vblendps {{.*#+}} xmm0 = xmm0[0],mem[1],xmm0[2,3] sched: [6:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_blendps:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vblendps {{.*#+}} xmm0 = xmm0[0],xmm1[1,2],xmm0[3] sched: [1:0.50]
	; ZNVER1-NEXT: vblendps {{.*#+}} xmm0 = xmm0[0],mem[1],xmm0[2,3] sched: [8:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = shufflevector <4 x float> %a0, <4 x float> %a1, <4 x i32> <i32 0, i32 5, i32 6, i32 3>
	%2 = load <4 x float>, <4 x float> *%a2, align 16
	%3 = shufflevector <4 x float> %1, <4 x float> %2, <4 x i32> <i32 0, i32 5, i32 2, i32 3>
	ret <4 x float> %3
	}

	define <2 x double> @test_blendvpd(<2 x double> %a0, <2 x double> %a1, <2 x double> %a2, <2 x double> *%a3) {
	; GENERIC-LABEL: test_blendvpd:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: movapd %xmm0, %xmm3
	; GENERIC-NEXT: movaps %xmm2, %xmm0
	; GENERIC-NEXT: blendvpd %xmm0, %xmm1, %xmm3
	; GENERIC-NEXT: blendvpd %xmm0, (%rdi), %xmm3
	; GENERIC-NEXT: movapd %xmm3, %xmm0
	; GENERIC-NEXT: retq
	;
	; SLM-LABEL: test_blendvpd:
	; SLM: # BB#0:
	; SLM-NEXT: movapd %xmm0, %xmm3 # sched: [1:1.00]
	; SLM-NEXT: movaps %xmm2, %xmm0 # sched: [1:1.00]
	; SLM-NEXT: blendvpd %xmm0, %xmm1, %xmm3 # sched: [1:1.00]
	; SLM-NEXT: blendvpd %xmm0, (%rdi), %xmm3 # sched: [4:1.00]
	; SLM-NEXT: movapd %xmm3, %xmm0 # sched: [1:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_blendvpd:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vblendvpd %xmm2, %xmm1, %xmm0, %xmm0 # sched: [2:2.00]
	-; SANDY-NEXT: vblendvpd %xmm2, (%rdi), %xmm0, %xmm0 # sched: [8:2.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vblendvpd %xmm2, %xmm1, %xmm0, %xmm0 # sched: [2:1.00]
	+; SANDY-NEXT: vblendvpd %xmm2, (%rdi), %xmm0, %xmm0 # sched: [6:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_blendvpd:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vblendvpd %xmm2, %xmm1, %xmm0, %xmm0 # sched: [2:2.00]
	; HASWELL-NEXT: vblendvpd %xmm2, (%rdi), %xmm0, %xmm0 # sched: [6:2.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_blendvpd:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vblendvpd %xmm2, %xmm1, %xmm0, %xmm0 # sched: [2:1.00]
	; BTVER2-NEXT: vblendvpd %xmm2, (%rdi), %xmm0, %xmm0 # sched: [7:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_blendvpd:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vblendvpd %xmm2, %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; ZNVER1-NEXT: vblendvpd %xmm2, (%rdi), %xmm0, %xmm0 # sched: [8:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <2 x double> @llvm.x86.sse41.blendvpd(<2 x double> %a0, <2 x double> %a1, <2 x double> %a2)
	%2 = load <2 x double>, <2 x double> *%a3, align 16
	%3 = call <2 x double> @llvm.x86.sse41.blendvpd(<2 x double> %1, <2 x double> %2, <2 x double> %a2)
	ret <2 x double> %3
	}
	declare <2 x double> @llvm.x86.sse41.blendvpd(<2 x double>, <2 x double>, <2 x double>) nounwind readnone

	define <4 x float> @test_blendvps(<4 x float> %a0, <4 x float> %a1, <4 x float> %a2, <4 x float> *%a3) {
	; GENERIC-LABEL: test_blendvps:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: movaps %xmm0, %xmm3
	; GENERIC-NEXT: movaps %xmm2, %xmm0
	; GENERIC-NEXT: blendvps %xmm0, %xmm1, %xmm3
	; GENERIC-NEXT: blendvps %xmm0, (%rdi), %xmm3
	; GENERIC-NEXT: movaps %xmm3, %xmm0
	; GENERIC-NEXT: retq
	;
	; SLM-LABEL: test_blendvps:
	; SLM: # BB#0:
	; SLM-NEXT: movaps %xmm0, %xmm3 # sched: [1:1.00]
	; SLM-NEXT: movaps %xmm2, %xmm0 # sched: [1:1.00]
	; SLM-NEXT: blendvps %xmm0, %xmm1, %xmm3 # sched: [1:1.00]
	; SLM-NEXT: blendvps %xmm0, (%rdi), %xmm3 # sched: [4:1.00]
	; SLM-NEXT: movaps %xmm3, %xmm0 # sched: [1:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_blendvps:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vblendvps %xmm2, %xmm1, %xmm0, %xmm0 # sched: [2:2.00]
	-; SANDY-NEXT: vblendvps %xmm2, (%rdi), %xmm0, %xmm0 # sched: [8:2.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vblendvps %xmm2, %xmm1, %xmm0, %xmm0 # sched: [2:1.00]
	+; SANDY-NEXT: vblendvps %xmm2, (%rdi), %xmm0, %xmm0 # sched: [6:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_blendvps:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vblendvps %xmm2, %xmm1, %xmm0, %xmm0 # sched: [2:2.00]
	; HASWELL-NEXT: vblendvps %xmm2, (%rdi), %xmm0, %xmm0 # sched: [6:2.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_blendvps:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vblendvps %xmm2, %xmm1, %xmm0, %xmm0 # sched: [2:1.00]
	; BTVER2-NEXT: vblendvps %xmm2, (%rdi), %xmm0, %xmm0 # sched: [7:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_blendvps:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vblendvps %xmm2, %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; ZNVER1-NEXT: vblendvps %xmm2, (%rdi), %xmm0, %xmm0 # sched: [8:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <4 x float> @llvm.x86.sse41.blendvps(<4 x float> %a0, <4 x float> %a1, <4 x float> %a2)
	%2 = load <4 x float>, <4 x float> *%a3
	%3 = call <4 x float> @llvm.x86.sse41.blendvps(<4 x float> %1, <4 x float> %2, <4 x float> %a2)
	ret <4 x float> %3
	}
	declare <4 x float> @llvm.x86.sse41.blendvps(<4 x float>, <4 x float>, <4 x float>) nounwind readnone

	define <2 x double> @test_dppd(<2 x double> %a0, <2 x double> %a1, <2 x double> *%a2) {
	; GENERIC-LABEL: test_dppd:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: dppd $7, %xmm1, %xmm0
	; GENERIC-NEXT: dppd $7, (%rdi), %xmm0
	; GENERIC-NEXT: retq
	;
	; SLM-LABEL: test_dppd:
	; SLM: # BB#0:
	; SLM-NEXT: dppd $7, %xmm1, %xmm0 # sched: [3:1.00]
	; SLM-NEXT: dppd $7, (%rdi), %xmm0 # sched: [6:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_dppd:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vdppd $7, %xmm1, %xmm0, %xmm0 # sched: [9:1.00]
	-; SANDY-NEXT: vdppd $7, (%rdi), %xmm0, %xmm0 # sched: [15:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vdppd $7, %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	+; SANDY-NEXT: vdppd $7, (%rdi), %xmm0, %xmm0 # sched: [7:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_dppd:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vdppd $7, %xmm1, %xmm0, %xmm0 # sched: [9:1.00]
	; HASWELL-NEXT: vdppd $7, (%rdi), %xmm0, %xmm0 # sched: [13:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_dppd:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vdppd $7, %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: vdppd $7, (%rdi), %xmm0, %xmm0 # sched: [8:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_dppd:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vdppd $7, %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; ZNVER1-NEXT: vdppd $7, (%rdi), %xmm0, %xmm0 # sched: [10:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <2 x double> @llvm.x86.sse41.dppd(<2 x double> %a0, <2 x double> %a1, i8 7)
	%2 = load <2 x double>, <2 x double> *%a2, align 16
	%3 = call <2 x double> @llvm.x86.sse41.dppd(<2 x double> %1, <2 x double> %2, i8 7)
	ret <2 x double> %3
	}
	declare <2 x double> @llvm.x86.sse41.dppd(<2 x double>, <2 x double>, i8) nounwind readnone

	define <4 x float> @test_dpps(<4 x float> %a0, <4 x float> %a1, <4 x float> *%a2) {
	; GENERIC-LABEL: test_dpps:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: dpps $7, %xmm1, %xmm0
	; GENERIC-NEXT: dpps $7, (%rdi), %xmm0
	; GENERIC-NEXT: retq
	;
	; SLM-LABEL: test_dpps:
	; SLM: # BB#0:
	; SLM-NEXT: dpps $7, %xmm1, %xmm0 # sched: [3:1.00]
	; SLM-NEXT: dpps $7, (%rdi), %xmm0 # sched: [6:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_dpps:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vdpps $7, %xmm1, %xmm0, %xmm0 # sched: [12:2.00]
	+; SANDY-NEXT: vdpps $7, %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; SANDY-NEXT: vdpps $7, (%rdi), %xmm0, %xmm0 # sched: [7:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_dpps:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vdpps $7, %xmm1, %xmm0, %xmm0 # sched: [14:2.00]
	; HASWELL-NEXT: vdpps $7, (%rdi), %xmm0, %xmm0 # sched: [18:2.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_dpps:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vdpps $7, %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: vdpps $7, (%rdi), %xmm0, %xmm0 # sched: [8:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_dpps:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vdpps $7, %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; ZNVER1-NEXT: vdpps $7, (%rdi), %xmm0, %xmm0 # sched: [10:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <4 x float> @llvm.x86.sse41.dpps(<4 x float> %a0, <4 x float> %a1, i8 7)
	%2 = load <4 x float>, <4 x float> *%a2, align 16
	%3 = call <4 x float> @llvm.x86.sse41.dpps(<4 x float> %1, <4 x float> %2, i8 7)
	ret <4 x float> %3
	}
	declare <4 x float> @llvm.x86.sse41.dpps(<4 x float>, <4 x float>, i8) nounwind readnone

	define <4 x float> @test_insertps(<4 x float> %a0, <4 x float> %a1, float *%a2) {
	; GENERIC-LABEL: test_insertps:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: insertps {{.*#+}} xmm0 = zero,xmm1[0],xmm0[2,3]
	; GENERIC-NEXT: insertps {{.*#+}} xmm0 = xmm0[0,1,2],mem[0]
	; GENERIC-NEXT: retq
	;
	; SLM-LABEL: test_insertps:
	; SLM: # BB#0:
	; SLM-NEXT: insertps {{.*#+}} xmm0 = zero,xmm1[0],xmm0[2,3] sched: [1:1.00]
	; SLM-NEXT: insertps {{.*#+}} xmm0 = xmm0[0,1,2],mem[0] sched: [4:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_insertps:
	; SANDY: # BB#0:
	; SANDY-NEXT: vinsertps {{.*#+}} xmm0 = zero,xmm1[0],xmm0[2,3] sched: [1:1.00]
	-; SANDY-NEXT: vinsertps {{.*#+}} xmm0 = xmm0[0,1,2],mem[0] sched: [7:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vinsertps {{.*#+}} xmm0 = xmm0[0,1,2],mem[0] sched: [5:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_insertps:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vinsertps {{.*#+}} xmm0 = zero,xmm1[0],xmm0[2,3] sched: [1:1.00]
	; HASWELL-NEXT: vinsertps {{.*#+}} xmm0 = xmm0[0,1,2],mem[0] sched: [5:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_insertps:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vinsertps {{.*#+}} xmm0 = zero,xmm1[0],xmm0[2,3] sched: [1:0.50]
	; BTVER2-NEXT: vinsertps {{.*#+}} xmm0 = xmm0[0,1,2],mem[0] sched: [6:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_insertps:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vinsertps {{.*#+}} xmm0 = zero,xmm1[0],xmm0[2,3] sched: [1:0.50]
	; ZNVER1-NEXT: vinsertps {{.*#+}} xmm0 = xmm0[0,1,2],mem[0] sched: [8:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <4 x float> @llvm.x86.sse41.insertps(<4 x float> %a0, <4 x float> %a1, i8 17)
	%2 = load float, float *%a2
	%3 = insertelement <4 x float> %1, float %2, i32 3
	ret <4 x float> %3
	}
	declare <4 x float> @llvm.x86.sse41.insertps(<4 x float>, <4 x float>, i8) nounwind readnone

	define <2 x i64> @test_movntdqa(i8* %a0) {
	; GENERIC-LABEL: test_movntdqa:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: movntdqa (%rdi), %xmm0
	; GENERIC-NEXT: retq
	;
	; SLM-LABEL: test_movntdqa:
	; SLM: # BB#0:
	; SLM-NEXT: movntdqa (%rdi), %xmm0 # sched: [3:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_movntdqa:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vmovntdqa (%rdi), %xmm0 # sched: [6:0.50]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vmovntdqa (%rdi), %xmm0 # sched: [4:0.50]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_movntdqa:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vmovntdqa (%rdi), %xmm0 # sched: [4:0.50]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_movntdqa:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vmovntdqa (%rdi), %xmm0 # sched: [5:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_movntdqa:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vmovntdqa (%rdi), %xmm0 # sched: [8:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <2 x i64> @llvm.x86.sse41.movntdqa(i8* %a0)
	ret <2 x i64> %1
	}
	declare <2 x i64> @llvm.x86.sse41.movntdqa(i8*) nounwind readnone

	define <8 x i16> @test_mpsadbw(<16 x i8> %a0, <16 x i8> %a1, <16 x i8> *%a2) {
	; GENERIC-LABEL: test_mpsadbw:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: mpsadbw $7, %xmm1, %xmm0
	; GENERIC-NEXT: mpsadbw $7, (%rdi), %xmm0
	; GENERIC-NEXT: retq
	;
	; SLM-LABEL: test_mpsadbw:
	; SLM: # BB#0:
	; SLM-NEXT: mpsadbw $7, %xmm1, %xmm0 # sched: [7:1.00]
	; SLM-NEXT: mpsadbw $7, (%rdi), %xmm0 # sched: [10:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_mpsadbw:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vmpsadbw $7, %xmm1, %xmm0, %xmm0 # sched: [5:1.00]
	-; SANDY-NEXT: vmpsadbw $7, (%rdi), %xmm0, %xmm0 # sched: [11:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vmpsadbw $7, %xmm1, %xmm0, %xmm0 # sched: [6:1.00]
	+; SANDY-NEXT: vmpsadbw $7, (%rdi), %xmm0, %xmm0 # sched: [6:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_mpsadbw:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vmpsadbw $7, %xmm1, %xmm0, %xmm0 # sched: [6:2.00]
	; HASWELL-NEXT: vmpsadbw $7, (%rdi), %xmm0, %xmm0 # sched: [6:2.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_mpsadbw:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vmpsadbw $7, %xmm1, %xmm0, %xmm0 # sched: [3:2.00]
	; BTVER2-NEXT: vmpsadbw $7, (%rdi), %xmm0, %xmm0 # sched: [8:2.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_mpsadbw:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vmpsadbw $7, %xmm1, %xmm0, %xmm0 # sched: [100:0.00]
	; ZNVER1-NEXT: vmpsadbw $7, (%rdi), %xmm0, %xmm0 # sched: [100:0.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <8 x i16> @llvm.x86.sse41.mpsadbw(<16 x i8> %a0, <16 x i8> %a1, i8 7)
	%2 = bitcast <8 x i16> %1 to <16 x i8>
	%3 = load <16 x i8>, <16 x i8> *%a2, align 16
	%4 = call <8 x i16> @llvm.x86.sse41.mpsadbw(<16 x i8> %2, <16 x i8> %3, i8 7)
	ret <8 x i16> %4
	}
	declare <8 x i16> @llvm.x86.sse41.mpsadbw(<16 x i8>, <16 x i8>, i8) nounwind readnone

	define <8 x i16> @test_packusdw(<4 x i32> %a0, <4 x i32> %a1, <4 x i32> *%a2) {
	; GENERIC-LABEL: test_packusdw:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: packusdw %xmm1, %xmm0
	; GENERIC-NEXT: packusdw (%rdi), %xmm0
	; GENERIC-NEXT: retq
	;
	; SLM-LABEL: test_packusdw:
	; SLM: # BB#0:
	; SLM-NEXT: packusdw %xmm1, %xmm0 # sched: [1:1.00]
	; SLM-NEXT: packusdw (%rdi), %xmm0 # sched: [4:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_packusdw:
	; SANDY: # BB#0:
	; SANDY-NEXT: vpackusdw %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	-; SANDY-NEXT: vpackusdw (%rdi), %xmm0, %xmm0 # sched: [7:0.50]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vpackusdw (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_packusdw:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpackusdw %xmm1, %xmm0, %xmm0 # sched: [1:1.00]
	; HASWELL-NEXT: vpackusdw (%rdi), %xmm0, %xmm0 # sched: [5:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_packusdw:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpackusdw %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: vpackusdw (%rdi), %xmm0, %xmm0 # sched: [6:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_packusdw:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpackusdw %xmm1, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: vpackusdw (%rdi), %xmm0, %xmm0 # sched: [8:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <8 x i16> @llvm.x86.sse41.packusdw(<4 x i32> %a0, <4 x i32> %a1)
	%2 = bitcast <8 x i16> %1 to <4 x i32>
	%3 = load <4 x i32>, <4 x i32> *%a2, align 16
	%4 = call <8 x i16> @llvm.x86.sse41.packusdw(<4 x i32> %2, <4 x i32> %3)
	ret <8 x i16> %4
	}
	declare <8 x i16> @llvm.x86.sse41.packusdw(<4 x i32>, <4 x i32>) nounwind readnone

	define <16 x i8> @test_pblendvb(<16 x i8> %a0, <16 x i8> %a1, <16 x i8> %a2, <16 x i8> *%a3) {
	; GENERIC-LABEL: test_pblendvb:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: movdqa %xmm0, %xmm3
	; GENERIC-NEXT: movaps %xmm2, %xmm0
	; GENERIC-NEXT: pblendvb %xmm0, %xmm1, %xmm3
	; GENERIC-NEXT: pblendvb %xmm0, (%rdi), %xmm3
	; GENERIC-NEXT: movdqa %xmm3, %xmm0
	; GENERIC-NEXT: retq
	;
	; SLM-LABEL: test_pblendvb:
	; SLM: # BB#0:
	; SLM-NEXT: movdqa %xmm0, %xmm3 # sched: [1:0.50]
	; SLM-NEXT: movaps %xmm2, %xmm0 # sched: [1:1.00]
	; SLM-NEXT: pblendvb %xmm0, %xmm1, %xmm3 # sched: [1:1.00]
	; SLM-NEXT: pblendvb %xmm0, (%rdi), %xmm3 # sched: [4:1.00]
	; SLM-NEXT: movdqa %xmm3, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_pblendvb:
	; SANDY: # BB#0:
	; SANDY-NEXT: vpblendvb %xmm2, %xmm1, %xmm0, %xmm0 # sched: [2:1.00]
	-; SANDY-NEXT: vpblendvb %xmm2, (%rdi), %xmm0, %xmm0 # sched: [8:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vpblendvb %xmm2, (%rdi), %xmm0, %xmm0 # sched: [6:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_pblendvb:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpblendvb %xmm2, %xmm1, %xmm0, %xmm0 # sched: [2:2.00]
	; HASWELL-NEXT: vpblendvb %xmm2, (%rdi), %xmm0, %xmm0 # sched: [6:2.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_pblendvb:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpblendvb %xmm2, %xmm1, %xmm0, %xmm0 # sched: [2:1.00]
	; BTVER2-NEXT: vpblendvb %xmm2, (%rdi), %xmm0, %xmm0 # sched: [7:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_pblendvb:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpblendvb %xmm2, %xmm1, %xmm0, %xmm0 # sched: [1:1.00]
	; ZNVER1-NEXT: vpblendvb %xmm2, (%rdi), %xmm0, %xmm0 # sched: [8:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <16 x i8> @llvm.x86.sse41.pblendvb(<16 x i8> %a0, <16 x i8> %a1, <16 x i8> %a2)
	%2 = load <16 x i8>, <16 x i8> *%a3, align 16
	%3 = call <16 x i8> @llvm.x86.sse41.pblendvb(<16 x i8> %1, <16 x i8> %2, <16 x i8> %a2)
	ret <16 x i8> %3
	}
	declare <16 x i8> @llvm.x86.sse41.pblendvb(<16 x i8>, <16 x i8>, <16 x i8>) nounwind readnone

	define <8 x i16> @test_pblendw(<8 x i16> %a0, <8 x i16> %a1, <8 x i16> *%a2) {
	; GENERIC-LABEL: test_pblendw:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: pblendw {{.*#+}} xmm0 = xmm0[0],xmm1[1],xmm0[2],xmm1[3],xmm0[4],xmm1[5],xmm0[6],xmm1[7]
	; GENERIC-NEXT: pblendw {{.*#+}} xmm0 = xmm0[0,1],mem[2,3],xmm0[4,5,6],mem[7]
	; GENERIC-NEXT: retq
	;
	; SLM-LABEL: test_pblendw:
	; SLM: # BB#0:
	; SLM-NEXT: pblendw {{.*#+}} xmm0 = xmm0[0],xmm1[1],xmm0[2],xmm1[3],xmm0[4],xmm1[5],xmm0[6],xmm1[7] sched: [1:1.00]
	; SLM-NEXT: pblendw {{.*#+}} xmm0 = xmm0[0,1],mem[2,3],xmm0[4,5,6],mem[7] sched: [4:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_pblendw:
	; SANDY: # BB#0:
	; SANDY-NEXT: vpblendw {{.*#+}} xmm0 = xmm0[0],xmm1[1],xmm0[2],xmm1[3],xmm0[4],xmm1[5],xmm0[6],xmm1[7] sched: [1:0.50]
	-; SANDY-NEXT: vpblendw {{.*#+}} xmm0 = xmm0[0,1],mem[2,3],xmm0[4,5,6],mem[7] sched: [7:0.50]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vpblendw {{.*#+}} xmm0 = xmm0[0,1],mem[2,3],xmm0[4,5,6],mem[7] sched: [5:0.50]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_pblendw:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpblendw {{.*#+}} xmm0 = xmm0[0],xmm1[1],xmm0[2],xmm1[3],xmm0[4],xmm1[5],xmm0[6],xmm1[7] sched: [1:1.00]
	; HASWELL-NEXT: vpblendw {{.*#+}} xmm0 = xmm0[0,1],mem[2,3],xmm0[4,5,6],mem[7] sched: [4:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_pblendw:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpblendw {{.*#+}} xmm0 = xmm0[0],xmm1[1],xmm0[2],xmm1[3],xmm0[4],xmm1[5],xmm0[6],xmm1[7] sched: [1:0.50]
	; BTVER2-NEXT: vpblendw {{.*#+}} xmm0 = xmm0[0,1],mem[2,3],xmm0[4,5,6],mem[7] sched: [6:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_pblendw:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpblendw {{.*#+}} xmm0 = xmm0[0],xmm1[1],xmm0[2],xmm1[3],xmm0[4],xmm1[5],xmm0[6],xmm1[7] sched: [1:0.50]
	; ZNVER1-NEXT: vpblendw {{.*#+}} xmm0 = xmm0[0,1],mem[2,3],xmm0[4,5,6],mem[7] sched: [8:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = shufflevector <8 x i16> %a0, <8 x i16> %a1, <8 x i32> <i32 0, i32 9, i32 2, i32 11, i32 4, i32 13, i32 6, i32 15>
	%2 = load <8 x i16>, <8 x i16> *%a2, align 16
	%3 = shufflevector <8 x i16> %1, <8 x i16> %2, <8 x i32> <i32 0, i32 1, i32 10, i32 11, i32 4, i32 5, i32 6, i32 15>
	ret <8 x i16> %3
	}

	define <2 x i64> @test_pcmpeqq(<2 x i64> %a0, <2 x i64> %a1, <2 x i64> *%a2) {
	; GENERIC-LABEL: test_pcmpeqq:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: pcmpeqq %xmm1, %xmm0
	; GENERIC-NEXT: pcmpeqq (%rdi), %xmm0
	; GENERIC-NEXT: retq
	;
	; SLM-LABEL: test_pcmpeqq:
	; SLM: # BB#0:
	; SLM-NEXT: pcmpeqq %xmm1, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: pcmpeqq (%rdi), %xmm0 # sched: [4:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_pcmpeqq:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vpcmpeqq %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	-; SANDY-NEXT: vpcmpeqq (%rdi), %xmm0, %xmm0 # sched: [7:0.50]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vpcmpeqq %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	+; SANDY-NEXT: vpcmpeqq (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_pcmpeqq:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpcmpeqq %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; HASWELL-NEXT: vpcmpeqq (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_pcmpeqq:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpcmpeqq %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: vpcmpeqq (%rdi), %xmm0, %xmm0 # sched: [6:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_pcmpeqq:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpcmpeqq %xmm1, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: vpcmpeqq (%rdi), %xmm0, %xmm0 # sched: [8:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = icmp eq <2 x i64> %a0, %a1
	%2 = sext <2 x i1> %1 to <2 x i64>
	%3 = load <2 x i64>, <2 x i64>*%a2, align 16
	%4 = icmp eq <2 x i64> %2, %3
	%5 = sext <2 x i1> %4 to <2 x i64>
	ret <2 x i64> %5
	}

	define i32 @test_pextrb(<16 x i8> %a0, i8 *%a1) {
	; GENERIC-LABEL: test_pextrb:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: pextrb $3, %xmm0, %eax
	; GENERIC-NEXT: pextrb $1, %xmm0, (%rdi)
	; GENERIC-NEXT: retq
	;
	; SLM-LABEL: test_pextrb:
	; SLM: # BB#0:
	; SLM-NEXT: pextrb $3, %xmm0, %eax # sched: [1:1.00]
	; SLM-NEXT: pextrb $1, %xmm0, (%rdi) # sched: [4:2.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_pextrb:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vpextrb $3, %xmm0, %eax # sched: [3:1.00]
	+; SANDY-NEXT: vpextrb $3, %xmm0, %eax # sched: [1:0.50]
	; SANDY-NEXT: vpextrb $1, %xmm0, (%rdi) # sched: [5:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_pextrb:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpextrb $3, %xmm0, %eax # sched: [1:1.00]
	; HASWELL-NEXT: vpextrb $1, %xmm0, (%rdi) # sched: [5:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_pextrb:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpextrb $3, %xmm0, %eax # sched: [1:0.50]
	; BTVER2-NEXT: vpextrb $1, %xmm0, (%rdi) # sched: [6:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_pextrb:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpextrb $3, %xmm0, %eax # sched: [1:0.25]
	; ZNVER1-NEXT: vpextrb $1, %xmm0, (%rdi) # sched: [8:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = extractelement <16 x i8> %a0, i32 3
	%2 = extractelement <16 x i8> %a0, i32 1
	store i8 %2, i8 *%a1
	%3 = zext i8 %1 to i32
	ret i32 %3
	}

	define i32 @test_pextrd(<4 x i32> %a0, i32 *%a1) {
	; GENERIC-LABEL: test_pextrd:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: pextrd $3, %xmm0, %eax
	; GENERIC-NEXT: pextrd $1, %xmm0, (%rdi)
	; GENERIC-NEXT: retq
	;
	; SLM-LABEL: test_pextrd:
	; SLM: # BB#0:
	; SLM-NEXT: pextrd $3, %xmm0, %eax # sched: [1:1.00]
	; SLM-NEXT: pextrd $1, %xmm0, (%rdi) # sched: [4:2.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_pextrd:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vpextrd $3, %xmm0, %eax # sched: [3:1.00]
	+; SANDY-NEXT: vpextrd $3, %xmm0, %eax # sched: [1:0.50]
	; SANDY-NEXT: vpextrd $1, %xmm0, (%rdi) # sched: [5:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_pextrd:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpextrd $3, %xmm0, %eax # sched: [1:1.00]
	; HASWELL-NEXT: vpextrd $1, %xmm0, (%rdi) # sched: [5:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_pextrd:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpextrd $3, %xmm0, %eax # sched: [1:0.50]
	; BTVER2-NEXT: vpextrd $1, %xmm0, (%rdi) # sched: [6:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_pextrd:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpextrd $3, %xmm0, %eax # sched: [1:0.25]
	; ZNVER1-NEXT: vpextrd $1, %xmm0, (%rdi) # sched: [8:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = extractelement <4 x i32> %a0, i32 3
	%2 = extractelement <4 x i32> %a0, i32 1
	store i32 %2, i32 *%a1
	ret i32 %1
	}

	define i64 @test_pextrq(<2 x i64> %a0, <2 x i64> %a1, i64 *%a2) {
	; GENERIC-LABEL: test_pextrq:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: pextrq $1, %xmm0, %rax
	; GENERIC-NEXT: pextrq $1, %xmm0, (%rdi)
	; GENERIC-NEXT: retq
	;
	; SLM-LABEL: test_pextrq:
	; SLM: # BB#0:
	; SLM-NEXT: pextrq $1, %xmm0, %rax # sched: [1:1.00]
	; SLM-NEXT: pextrq $1, %xmm0, (%rdi) # sched: [4:2.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_pextrq:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vpextrq $1, %xmm0, %rax # sched: [3:1.00]
	+; SANDY-NEXT: vpextrq $1, %xmm0, %rax # sched: [1:0.50]
	; SANDY-NEXT: vpextrq $1, %xmm0, (%rdi) # sched: [5:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_pextrq:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpextrq $1, %xmm0, %rax # sched: [1:1.00]
	; HASWELL-NEXT: vpextrq $1, %xmm0, (%rdi) # sched: [5:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_pextrq:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpextrq $1, %xmm0, %rax # sched: [1:0.50]
	; BTVER2-NEXT: vpextrq $1, %xmm0, (%rdi) # sched: [6:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_pextrq:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpextrq $1, %xmm0, %rax # sched: [1:0.25]
	; ZNVER1-NEXT: vpextrq $1, %xmm0, (%rdi) # sched: [8:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = extractelement <2 x i64> %a0, i32 1
	%2 = extractelement <2 x i64> %a0, i32 1
	store i64 %2, i64 *%a2
	ret i64 %1
	}

	define i32 @test_pextrw(<8 x i16> %a0, i16 *%a1) {
	; GENERIC-LABEL: test_pextrw:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: pextrw $3, %xmm0, %eax
	; GENERIC-NEXT: pextrw $1, %xmm0, (%rdi)
	; GENERIC-NEXT: retq
	;
	; SLM-LABEL: test_pextrw:
	; SLM: # BB#0:
	; SLM-NEXT: pextrw $3, %xmm0, %eax # sched: [4:1.00]
	; SLM-NEXT: pextrw $1, %xmm0, (%rdi) # sched: [4:2.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_pextrw:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vpextrw $3, %xmm0, %eax # sched: [3:1.00]
	+; SANDY-NEXT: vpextrw $3, %xmm0, %eax # sched: [1:0.50]
	; SANDY-NEXT: vpextrw $1, %xmm0, (%rdi) # sched: [5:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_pextrw:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpextrw $3, %xmm0, %eax # sched: [1:1.00]
	; HASWELL-NEXT: vpextrw $1, %xmm0, (%rdi) # sched: [5:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_pextrw:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpextrw $3, %xmm0, %eax # sched: [1:0.50]
	; BTVER2-NEXT: vpextrw $1, %xmm0, (%rdi) # sched: [6:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_pextrw:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpextrw $3, %xmm0, %eax # sched: [1:0.25]
	; ZNVER1-NEXT: vpextrw $1, %xmm0, (%rdi) # sched: [8:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = extractelement <8 x i16> %a0, i32 3
	%2 = extractelement <8 x i16> %a0, i32 1
	store i16 %2, i16 *%a1
	%3 = zext i16 %1 to i32
	ret i32 %3
	}

	define <8 x i16> @test_phminposuw(<8 x i16> *%a0) {
	; GENERIC-LABEL: test_phminposuw:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: phminposuw (%rdi), %xmm0
	; GENERIC-NEXT: phminposuw %xmm0, %xmm0
	; GENERIC-NEXT: retq
	;
	; SLM-LABEL: test_phminposuw:
	; SLM: # BB#0:
	; SLM-NEXT: phminposuw (%rdi), %xmm0 # sched: [7:1.00]
	; SLM-NEXT: phminposuw %xmm0, %xmm0 # sched: [4:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_phminposuw:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vphminposuw (%rdi), %xmm0 # sched: [11:1.00]
	+; SANDY-NEXT: vphminposuw (%rdi), %xmm0 # sched: [9:1.00]
	; SANDY-NEXT: vphminposuw %xmm0, %xmm0 # sched: [5:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_phminposuw:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vphminposuw (%rdi), %xmm0 # sched: [9:1.00]
	; HASWELL-NEXT: vphminposuw %xmm0, %xmm0 # sched: [5:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_phminposuw:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vphminposuw (%rdi), %xmm0 # sched: [7:1.00]
	; BTVER2-NEXT: vphminposuw %xmm0, %xmm0 # sched: [2:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_phminposuw:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vphminposuw (%rdi), %xmm0 # sched: [11:1.00]
	; ZNVER1-NEXT: vphminposuw %xmm0, %xmm0 # sched: [4:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = load <8 x i16>, <8 x i16> *%a0, align 16
	%2 = call <8 x i16> @llvm.x86.sse41.phminposuw(<8 x i16> %1)
	%3 = call <8 x i16> @llvm.x86.sse41.phminposuw(<8 x i16> %2)
	ret <8 x i16> %3
	}
	declare <8 x i16> @llvm.x86.sse41.phminposuw(<8 x i16>) nounwind readnone

	define <16 x i8> @test_pinsrb(<16 x i8> %a0, i8 %a1, i8 *%a2) {
	; GENERIC-LABEL: test_pinsrb:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: pinsrb $1, %edi, %xmm0
	; GENERIC-NEXT: pinsrb $3, (%rsi), %xmm0
	; GENERIC-NEXT: retq
	;
	; SLM-LABEL: test_pinsrb:
	; SLM: # BB#0:
	; SLM-NEXT: pinsrb $1, %edi, %xmm0 # sched: [1:1.00]
	; SLM-NEXT: pinsrb $3, (%rsi), %xmm0 # sched: [4:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_pinsrb:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vpinsrb $1, %edi, %xmm0, %xmm0 # sched: [2:1.00]
	-; SANDY-NEXT: vpinsrb $3, (%rsi), %xmm0, %xmm0 # sched: [7:0.50]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vpinsrb $1, %edi, %xmm0, %xmm0 # sched: [1:0.50]
	+; SANDY-NEXT: vpinsrb $3, (%rsi), %xmm0, %xmm0 # sched: [5:0.50]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_pinsrb:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpinsrb $1, %edi, %xmm0, %xmm0 # sched: [1:1.00]
	; HASWELL-NEXT: vpinsrb $3, (%rsi), %xmm0, %xmm0 # sched: [5:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_pinsrb:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpinsrb $1, %edi, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: vpinsrb $3, (%rsi), %xmm0, %xmm0 # sched: [6:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_pinsrb:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpinsrb $1, %edi, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: vpinsrb $3, (%rsi), %xmm0, %xmm0 # sched: [8:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = insertelement <16 x i8> %a0, i8 %a1, i32 1
	%2 = load i8, i8 *%a2
	%3 = insertelement <16 x i8> %1, i8 %2, i32 3
	ret <16 x i8> %3
	}

	define <4 x i32> @test_pinsrd(<4 x i32> %a0, i32 %a1, i32 *%a2) {
	; GENERIC-LABEL: test_pinsrd:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: pinsrd $1, %edi, %xmm0
	; GENERIC-NEXT: pinsrd $3, (%rsi), %xmm0
	; GENERIC-NEXT: retq
	;
	; SLM-LABEL: test_pinsrd:
	; SLM: # BB#0:
	; SLM-NEXT: pinsrd $1, %edi, %xmm0 # sched: [1:1.00]
	; SLM-NEXT: pinsrd $3, (%rsi), %xmm0 # sched: [4:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_pinsrd:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vpinsrd $1, %edi, %xmm0, %xmm0 # sched: [2:1.00]
	-; SANDY-NEXT: vpinsrd $3, (%rsi), %xmm0, %xmm0 # sched: [7:0.50]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vpinsrd $1, %edi, %xmm0, %xmm0 # sched: [1:0.50]
	+; SANDY-NEXT: vpinsrd $3, (%rsi), %xmm0, %xmm0 # sched: [5:0.50]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_pinsrd:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpinsrd $1, %edi, %xmm0, %xmm0 # sched: [1:1.00]
	; HASWELL-NEXT: vpinsrd $3, (%rsi), %xmm0, %xmm0 # sched: [5:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_pinsrd:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpinsrd $1, %edi, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: vpinsrd $3, (%rsi), %xmm0, %xmm0 # sched: [6:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_pinsrd:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpinsrd $1, %edi, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: vpinsrd $3, (%rsi), %xmm0, %xmm0 # sched: [8:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = insertelement <4 x i32> %a0, i32 %a1, i32 1
	%2 = load i32, i32 *%a2
	%3 = insertelement <4 x i32> %1, i32 %2, i32 3
	ret <4 x i32> %3
	}

	define <2 x i64> @test_pinsrq(<2 x i64> %a0, <2 x i64> %a1, i64 %a2, i64 *%a3) {
	; GENERIC-LABEL: test_pinsrq:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: pinsrq $1, %rdi, %xmm0
	; GENERIC-NEXT: pinsrq $1, (%rsi), %xmm1
	; GENERIC-NEXT: paddq %xmm1, %xmm0
	; GENERIC-NEXT: retq
	;
	; SLM-LABEL: test_pinsrq:
	; SLM: # BB#0:
	; SLM-NEXT: pinsrq $1, (%rsi), %xmm1 # sched: [4:1.00]
	; SLM-NEXT: pinsrq $1, %rdi, %xmm0 # sched: [1:1.00]
	; SLM-NEXT: paddq %xmm1, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_pinsrq:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vpinsrq $1, %rdi, %xmm0, %xmm0 # sched: [2:1.00]
	-; SANDY-NEXT: vpinsrq $1, (%rsi), %xmm1, %xmm1 # sched: [7:0.50]
	+; SANDY-NEXT: vpinsrq $1, %rdi, %xmm0, %xmm0 # sched: [1:0.50]
	+; SANDY-NEXT: vpinsrq $1, (%rsi), %xmm1, %xmm1 # sched: [5:0.50]
	; SANDY-NEXT: vpaddq %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_pinsrq:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpinsrq $1, %rdi, %xmm0, %xmm0 # sched: [1:1.00]
	; HASWELL-NEXT: vpinsrq $1, (%rsi), %xmm1, %xmm1 # sched: [5:1.00]
	; HASWELL-NEXT: vpaddq %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_pinsrq:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpinsrq $1, (%rsi), %xmm1, %xmm1 # sched: [6:1.00]
	; BTVER2-NEXT: vpinsrq $1, %rdi, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: vpaddq %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_pinsrq:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpinsrq $1, (%rsi), %xmm1, %xmm1 # sched: [8:0.50]
	; ZNVER1-NEXT: vpinsrq $1, %rdi, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: vpaddq %xmm1, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = insertelement <2 x i64> %a0, i64 %a2, i32 1
	%2 = load i64, i64 *%a3
	%3 = insertelement <2 x i64> %a1, i64 %2, i32 1
	%4 = add <2 x i64> %1, %3
	ret <2 x i64> %4
	}

	define <16 x i8> @test_pmaxsb(<16 x i8> %a0, <16 x i8> %a1, <16 x i8> *%a2) {
	; GENERIC-LABEL: test_pmaxsb:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: pmaxsb %xmm1, %xmm0
	; GENERIC-NEXT: pmaxsb (%rdi), %xmm0
	; GENERIC-NEXT: retq
	;
	; SLM-LABEL: test_pmaxsb:
	; SLM: # BB#0:
	; SLM-NEXT: pmaxsb %xmm1, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: pmaxsb (%rdi), %xmm0 # sched: [4:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_pmaxsb:
	; SANDY: # BB#0:
	; SANDY-NEXT: vpmaxsb %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	-; SANDY-NEXT: vpmaxsb (%rdi), %xmm0, %xmm0 # sched: [7:0.50]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vpmaxsb (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_pmaxsb:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpmaxsb %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; HASWELL-NEXT: vpmaxsb (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_pmaxsb:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpmaxsb %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: vpmaxsb (%rdi), %xmm0, %xmm0 # sched: [6:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_pmaxsb:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpmaxsb %xmm1, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: vpmaxsb (%rdi), %xmm0, %xmm0 # sched: [8:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <16 x i8> @llvm.x86.sse41.pmaxsb(<16 x i8> %a0, <16 x i8> %a1)
	%2 = load <16 x i8>, <16 x i8> *%a2, align 16
	%3 = call <16 x i8> @llvm.x86.sse41.pmaxsb(<16 x i8> %1, <16 x i8> %2)
	ret <16 x i8> %3
	}
	declare <16 x i8> @llvm.x86.sse41.pmaxsb(<16 x i8>, <16 x i8>) nounwind readnone

	define <4 x i32> @test_pmaxsd(<4 x i32> %a0, <4 x i32> %a1, <4 x i32> *%a2) {
	; GENERIC-LABEL: test_pmaxsd:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: pmaxsd %xmm1, %xmm0
	; GENERIC-NEXT: pmaxsd (%rdi), %xmm0
	; GENERIC-NEXT: retq
	;
	; SLM-LABEL: test_pmaxsd:
	; SLM: # BB#0:
	; SLM-NEXT: pmaxsd %xmm1, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: pmaxsd (%rdi), %xmm0 # sched: [4:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_pmaxsd:
	; SANDY: # BB#0:
	; SANDY-NEXT: vpmaxsd %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	-; SANDY-NEXT: vpmaxsd (%rdi), %xmm0, %xmm0 # sched: [7:0.50]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vpmaxsd (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_pmaxsd:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpmaxsd %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; HASWELL-NEXT: vpmaxsd (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_pmaxsd:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpmaxsd %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: vpmaxsd (%rdi), %xmm0, %xmm0 # sched: [6:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_pmaxsd:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpmaxsd %xmm1, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: vpmaxsd (%rdi), %xmm0, %xmm0 # sched: [8:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <4 x i32> @llvm.x86.sse41.pmaxsd(<4 x i32> %a0, <4 x i32> %a1)
	%2 = load <4 x i32>, <4 x i32> *%a2, align 16
	%3 = call <4 x i32> @llvm.x86.sse41.pmaxsd(<4 x i32> %1, <4 x i32> %2)
	ret <4 x i32> %3
	}
	declare <4 x i32> @llvm.x86.sse41.pmaxsd(<4 x i32>, <4 x i32>) nounwind readnone

	define <4 x i32> @test_pmaxud(<4 x i32> %a0, <4 x i32> %a1, <4 x i32> *%a2) {
	; GENERIC-LABEL: test_pmaxud:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: pmaxud %xmm1, %xmm0
	; GENERIC-NEXT: pmaxud (%rdi), %xmm0
	; GENERIC-NEXT: retq
	;
	; SLM-LABEL: test_pmaxud:
	; SLM: # BB#0:
	; SLM-NEXT: pmaxud %xmm1, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: pmaxud (%rdi), %xmm0 # sched: [4:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_pmaxud:
	; SANDY: # BB#0:
	; SANDY-NEXT: vpmaxud %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	-; SANDY-NEXT: vpmaxud (%rdi), %xmm0, %xmm0 # sched: [7:0.50]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vpmaxud (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_pmaxud:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpmaxud %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; HASWELL-NEXT: vpmaxud (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_pmaxud:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpmaxud %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: vpmaxud (%rdi), %xmm0, %xmm0 # sched: [6:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_pmaxud:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpmaxud %xmm1, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: vpmaxud (%rdi), %xmm0, %xmm0 # sched: [8:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <4 x i32> @llvm.x86.sse41.pmaxud(<4 x i32> %a0, <4 x i32> %a1)
	%2 = load <4 x i32>, <4 x i32> *%a2, align 16
	%3 = call <4 x i32> @llvm.x86.sse41.pmaxud(<4 x i32> %1, <4 x i32> %2)
	ret <4 x i32> %3
	}
	declare <4 x i32> @llvm.x86.sse41.pmaxud(<4 x i32>, <4 x i32>) nounwind readnone

	define <8 x i16> @test_pmaxuw(<8 x i16> %a0, <8 x i16> %a1, <8 x i16> *%a2) {
	; GENERIC-LABEL: test_pmaxuw:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: pmaxuw %xmm1, %xmm0
	; GENERIC-NEXT: pmaxuw (%rdi), %xmm0
	; GENERIC-NEXT: retq
	;
	; SLM-LABEL: test_pmaxuw:
	; SLM: # BB#0:
	; SLM-NEXT: pmaxuw %xmm1, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: pmaxuw (%rdi), %xmm0 # sched: [4:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_pmaxuw:
	; SANDY: # BB#0:
	; SANDY-NEXT: vpmaxuw %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	-; SANDY-NEXT: vpmaxuw (%rdi), %xmm0, %xmm0 # sched: [7:0.50]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vpmaxuw (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_pmaxuw:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpmaxuw %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; HASWELL-NEXT: vpmaxuw (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_pmaxuw:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpmaxuw %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: vpmaxuw (%rdi), %xmm0, %xmm0 # sched: [6:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_pmaxuw:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpmaxuw %xmm1, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: vpmaxuw (%rdi), %xmm0, %xmm0 # sched: [8:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <8 x i16> @llvm.x86.sse41.pmaxuw(<8 x i16> %a0, <8 x i16> %a1)
	%2 = load <8 x i16>, <8 x i16> *%a2, align 16
	%3 = call <8 x i16> @llvm.x86.sse41.pmaxuw(<8 x i16> %1, <8 x i16> %2)
	ret <8 x i16> %3
	}
	declare <8 x i16> @llvm.x86.sse41.pmaxuw(<8 x i16>, <8 x i16>) nounwind readnone

	define <16 x i8> @test_pminsb(<16 x i8> %a0, <16 x i8> %a1, <16 x i8> *%a2) {
	; GENERIC-LABEL: test_pminsb:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: pminsb %xmm1, %xmm0
	; GENERIC-NEXT: pminsb (%rdi), %xmm0
	; GENERIC-NEXT: retq
	;
	; SLM-LABEL: test_pminsb:
	; SLM: # BB#0:
	; SLM-NEXT: pminsb %xmm1, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: pminsb (%rdi), %xmm0 # sched: [4:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_pminsb:
	; SANDY: # BB#0:
	; SANDY-NEXT: vpminsb %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	-; SANDY-NEXT: vpminsb (%rdi), %xmm0, %xmm0 # sched: [7:0.50]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vpminsb (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_pminsb:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpminsb %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; HASWELL-NEXT: vpminsb (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_pminsb:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpminsb %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: vpminsb (%rdi), %xmm0, %xmm0 # sched: [6:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_pminsb:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpminsb %xmm1, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: vpminsb (%rdi), %xmm0, %xmm0 # sched: [8:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <16 x i8> @llvm.x86.sse41.pminsb(<16 x i8> %a0, <16 x i8> %a1)
	%2 = load <16 x i8>, <16 x i8> *%a2, align 16
	%3 = call <16 x i8> @llvm.x86.sse41.pminsb(<16 x i8> %1, <16 x i8> %2)
	ret <16 x i8> %3
	}
	declare <16 x i8> @llvm.x86.sse41.pminsb(<16 x i8>, <16 x i8>) nounwind readnone

	define <4 x i32> @test_pminsd(<4 x i32> %a0, <4 x i32> %a1, <4 x i32> *%a2) {
	; GENERIC-LABEL: test_pminsd:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: pminsd %xmm1, %xmm0
	; GENERIC-NEXT: pminsd (%rdi), %xmm0
	; GENERIC-NEXT: retq
	;
	; SLM-LABEL: test_pminsd:
	; SLM: # BB#0:
	; SLM-NEXT: pminsd %xmm1, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: pminsd (%rdi), %xmm0 # sched: [4:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_pminsd:
	; SANDY: # BB#0:
	; SANDY-NEXT: vpminsd %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	-; SANDY-NEXT: vpminsd (%rdi), %xmm0, %xmm0 # sched: [7:0.50]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vpminsd (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_pminsd:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpminsd %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; HASWELL-NEXT: vpminsd (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_pminsd:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpminsd %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: vpminsd (%rdi), %xmm0, %xmm0 # sched: [6:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_pminsd:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpminsd %xmm1, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: vpminsd (%rdi), %xmm0, %xmm0 # sched: [8:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <4 x i32> @llvm.x86.sse41.pminsd(<4 x i32> %a0, <4 x i32> %a1)
	%2 = load <4 x i32>, <4 x i32> *%a2, align 16
	%3 = call <4 x i32> @llvm.x86.sse41.pminsd(<4 x i32> %1, <4 x i32> %2)
	ret <4 x i32> %3
	}
	declare <4 x i32> @llvm.x86.sse41.pminsd(<4 x i32>, <4 x i32>) nounwind readnone

	define <4 x i32> @test_pminud(<4 x i32> %a0, <4 x i32> %a1, <4 x i32> *%a2) {
	; GENERIC-LABEL: test_pminud:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: pminud %xmm1, %xmm0
	; GENERIC-NEXT: pminud (%rdi), %xmm0
	; GENERIC-NEXT: retq
	;
	; SLM-LABEL: test_pminud:
	; SLM: # BB#0:
	; SLM-NEXT: pminud %xmm1, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: pminud (%rdi), %xmm0 # sched: [4:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_pminud:
	; SANDY: # BB#0:
	; SANDY-NEXT: vpminud %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	-; SANDY-NEXT: vpminud (%rdi), %xmm0, %xmm0 # sched: [7:0.50]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vpminud (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_pminud:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpminud %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; HASWELL-NEXT: vpminud (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_pminud:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpminud %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: vpminud (%rdi), %xmm0, %xmm0 # sched: [6:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_pminud:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpminud %xmm1, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: vpminud (%rdi), %xmm0, %xmm0 # sched: [8:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <4 x i32> @llvm.x86.sse41.pminud(<4 x i32> %a0, <4 x i32> %a1)
	%2 = load <4 x i32>, <4 x i32> *%a2, align 16
	%3 = call <4 x i32> @llvm.x86.sse41.pminud(<4 x i32> %1, <4 x i32> %2)
	ret <4 x i32> %3
	}
	declare <4 x i32> @llvm.x86.sse41.pminud(<4 x i32>, <4 x i32>) nounwind readnone

	define <8 x i16> @test_pminuw(<8 x i16> %a0, <8 x i16> %a1, <8 x i16> *%a2) {
	; GENERIC-LABEL: test_pminuw:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: pminuw %xmm1, %xmm0
	; GENERIC-NEXT: pminuw (%rdi), %xmm0
	; GENERIC-NEXT: retq
	;
	; SLM-LABEL: test_pminuw:
	; SLM: # BB#0:
	; SLM-NEXT: pminuw %xmm1, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: pminuw (%rdi), %xmm0 # sched: [4:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_pminuw:
	; SANDY: # BB#0:
	; SANDY-NEXT: vpminuw %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	-; SANDY-NEXT: vpminuw (%rdi), %xmm0, %xmm0 # sched: [7:0.50]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vpminuw (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_pminuw:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpminuw %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; HASWELL-NEXT: vpminuw (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_pminuw:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpminuw %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: vpminuw (%rdi), %xmm0, %xmm0 # sched: [6:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_pminuw:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpminuw %xmm1, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: vpminuw (%rdi), %xmm0, %xmm0 # sched: [8:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <8 x i16> @llvm.x86.sse41.pminuw(<8 x i16> %a0, <8 x i16> %a1)
	%2 = load <8 x i16>, <8 x i16> *%a2, align 16
	%3 = call <8 x i16> @llvm.x86.sse41.pminuw(<8 x i16> %1, <8 x i16> %2)
	ret <8 x i16> %3
	}
	declare <8 x i16> @llvm.x86.sse41.pminuw(<8 x i16>, <8 x i16>) nounwind readnone

	define <8 x i16> @test_pmovsxbw(<16 x i8> %a0, <8 x i8> *%a1) {
	; GENERIC-LABEL: test_pmovsxbw:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: pmovsxbw %xmm0, %xmm1
	; GENERIC-NEXT: pmovsxbw (%rdi), %xmm0
	; GENERIC-NEXT: paddw %xmm1, %xmm0
	; GENERIC-NEXT: retq
	;
	; SLM-LABEL: test_pmovsxbw:
	; SLM: # BB#0:
	; SLM-NEXT: pmovsxbw (%rdi), %xmm1 # sched: [4:1.00]
	; SLM-NEXT: pmovsxbw %xmm0, %xmm0 # sched: [1:1.00]
	; SLM-NEXT: paddw %xmm0, %xmm1 # sched: [1:0.50]
	; SLM-NEXT: movdqa %xmm1, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_pmovsxbw:
	; SANDY: # BB#0:
	; SANDY-NEXT: vpmovsxbw %xmm0, %xmm0 # sched: [1:0.50]
	-; SANDY-NEXT: vpmovsxbw (%rdi), %xmm1 # sched: [7:0.50]
	-; SANDY-NEXT: vpaddw %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vpmovsxbw (%rdi), %xmm1 # sched: [5:0.50]
	+; SANDY-NEXT: vpaddw %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_pmovsxbw:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpmovsxbw %xmm0, %xmm0 # sched: [1:1.00]
	; HASWELL-NEXT: vpmovsxbw (%rdi), %xmm1 # sched: [5:1.00]
	; HASWELL-NEXT: vpaddw %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_pmovsxbw:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpmovsxbw (%rdi), %xmm1 # sched: [6:1.00]
	; BTVER2-NEXT: vpmovsxbw %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: vpaddw %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_pmovsxbw:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpmovsxbw (%rdi), %xmm1 # sched: [8:0.50]
	; ZNVER1-NEXT: vpmovsxbw %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: vpaddw %xmm1, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = shufflevector <16 x i8> %a0, <16 x i8> undef, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
	%2 = sext <8 x i8> %1 to <8 x i16>
	%3 = load <8 x i8>, <8 x i8>* %a1, align 1
	%4 = sext <8 x i8> %3 to <8 x i16>
	%5 = add <8 x i16> %2, %4
	ret <8 x i16> %5
	}

	define <4 x i32> @test_pmovsxbd(<16 x i8> %a0, <4 x i8> *%a1) {
	; GENERIC-LABEL: test_pmovsxbd:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: pmovsxbd %xmm0, %xmm1
	; GENERIC-NEXT: pmovsxbd (%rdi), %xmm0
	; GENERIC-NEXT: paddd %xmm1, %xmm0
	; GENERIC-NEXT: retq
	;
	; SLM-LABEL: test_pmovsxbd:
	; SLM: # BB#0:
	; SLM-NEXT: pmovsxbd (%rdi), %xmm1 # sched: [4:1.00]
	; SLM-NEXT: pmovsxbd %xmm0, %xmm0 # sched: [1:1.00]
	; SLM-NEXT: paddd %xmm0, %xmm1 # sched: [1:0.50]
	; SLM-NEXT: movdqa %xmm1, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_pmovsxbd:
	; SANDY: # BB#0:
	; SANDY-NEXT: vpmovsxbd %xmm0, %xmm0 # sched: [1:0.50]
	-; SANDY-NEXT: vpmovsxbd (%rdi), %xmm1 # sched: [7:0.50]
	+; SANDY-NEXT: vpmovsxbd (%rdi), %xmm1 # sched: [5:0.50]
	; SANDY-NEXT: vpaddd %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_pmovsxbd:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpmovsxbd %xmm0, %xmm0 # sched: [1:1.00]
	; HASWELL-NEXT: vpmovsxbd (%rdi), %xmm1 # sched: [5:1.00]
	; HASWELL-NEXT: vpaddd %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_pmovsxbd:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpmovsxbd (%rdi), %xmm1 # sched: [6:1.00]
	; BTVER2-NEXT: vpmovsxbd %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: vpaddd %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_pmovsxbd:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpmovsxbd (%rdi), %xmm1 # sched: [8:0.50]
	; ZNVER1-NEXT: vpmovsxbd %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: vpaddd %xmm1, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = shufflevector <16 x i8> %a0, <16 x i8> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
	%2 = sext <4 x i8> %1 to <4 x i32>
	%3 = load <4 x i8>, <4 x i8>* %a1, align 1
	%4 = sext <4 x i8> %3 to <4 x i32>
	%5 = add <4 x i32> %2, %4
	ret <4 x i32> %5
	}

	define <2 x i64> @test_pmovsxbq(<16 x i8> %a0, <2 x i8> *%a1) {
	; GENERIC-LABEL: test_pmovsxbq:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: pmovsxbq %xmm0, %xmm1
	; GENERIC-NEXT: pmovsxbq (%rdi), %xmm0
	; GENERIC-NEXT: paddq %xmm1, %xmm0
	; GENERIC-NEXT: retq
	;
	; SLM-LABEL: test_pmovsxbq:
	; SLM: # BB#0:
	; SLM-NEXT: pmovsxbq (%rdi), %xmm1 # sched: [4:1.00]
	; SLM-NEXT: pmovsxbq %xmm0, %xmm0 # sched: [1:1.00]
	; SLM-NEXT: paddq %xmm0, %xmm1 # sched: [1:0.50]
	; SLM-NEXT: movdqa %xmm1, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_pmovsxbq:
	; SANDY: # BB#0:
	; SANDY-NEXT: vpmovsxbq %xmm0, %xmm0 # sched: [1:0.50]
	-; SANDY-NEXT: vpmovsxbq (%rdi), %xmm1 # sched: [7:0.50]
	+; SANDY-NEXT: vpmovsxbq (%rdi), %xmm1 # sched: [5:0.50]
	; SANDY-NEXT: vpaddq %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_pmovsxbq:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpmovsxbq %xmm0, %xmm0 # sched: [1:1.00]
	; HASWELL-NEXT: vpmovsxbq (%rdi), %xmm1 # sched: [5:1.00]
	; HASWELL-NEXT: vpaddq %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_pmovsxbq:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpmovsxbq (%rdi), %xmm1 # sched: [6:1.00]
	; BTVER2-NEXT: vpmovsxbq %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: vpaddq %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_pmovsxbq:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpmovsxbq (%rdi), %xmm1 # sched: [8:0.50]
	; ZNVER1-NEXT: vpmovsxbq %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: vpaddq %xmm1, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = shufflevector <16 x i8> %a0, <16 x i8> undef, <2 x i32> <i32 0, i32 1>
	%2 = sext <2 x i8> %1 to <2 x i64>
	%3 = load <2 x i8>, <2 x i8>* %a1, align 1
	%4 = sext <2 x i8> %3 to <2 x i64>
	%5 = add <2 x i64> %2, %4
	ret <2 x i64> %5
	}

	define <2 x i64> @test_pmovsxdq(<4 x i32> %a0, <2 x i32> *%a1) {
	; GENERIC-LABEL: test_pmovsxdq:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: pmovsxdq %xmm0, %xmm1
	; GENERIC-NEXT: pmovsxdq (%rdi), %xmm0
	; GENERIC-NEXT: paddq %xmm1, %xmm0
	; GENERIC-NEXT: retq
	;
	; SLM-LABEL: test_pmovsxdq:
	; SLM: # BB#0:
	; SLM-NEXT: pmovsxdq (%rdi), %xmm1 # sched: [4:1.00]
	; SLM-NEXT: pmovsxdq %xmm0, %xmm0 # sched: [1:1.00]
	; SLM-NEXT: paddq %xmm0, %xmm1 # sched: [1:0.50]
	; SLM-NEXT: movdqa %xmm1, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_pmovsxdq:
	; SANDY: # BB#0:
	; SANDY-NEXT: vpmovsxdq %xmm0, %xmm0 # sched: [1:0.50]
	-; SANDY-NEXT: vpmovsxdq (%rdi), %xmm1 # sched: [7:0.50]
	+; SANDY-NEXT: vpmovsxdq (%rdi), %xmm1 # sched: [5:0.50]
	; SANDY-NEXT: vpaddq %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_pmovsxdq:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpmovsxdq %xmm0, %xmm0 # sched: [1:1.00]
	; HASWELL-NEXT: vpmovsxdq (%rdi), %xmm1 # sched: [5:1.00]
	; HASWELL-NEXT: vpaddq %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_pmovsxdq:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpmovsxdq (%rdi), %xmm1 # sched: [6:1.00]
	; BTVER2-NEXT: vpmovsxdq %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: vpaddq %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_pmovsxdq:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpmovsxdq (%rdi), %xmm1 # sched: [8:0.50]
	; ZNVER1-NEXT: vpmovsxdq %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: vpaddq %xmm1, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = shufflevector <4 x i32> %a0, <4 x i32> undef, <2 x i32> <i32 0, i32 1>
	%2 = sext <2 x i32> %1 to <2 x i64>
	%3 = load <2 x i32>, <2 x i32>* %a1, align 1
	%4 = sext <2 x i32> %3 to <2 x i64>
	%5 = add <2 x i64> %2, %4
	ret <2 x i64> %5
	}

	define <4 x i32> @test_pmovsxwd(<8 x i16> %a0, <4 x i16> *%a1) {
	; GENERIC-LABEL: test_pmovsxwd:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: pmovsxwd %xmm0, %xmm1
	; GENERIC-NEXT: pmovsxwd (%rdi), %xmm0
	; GENERIC-NEXT: paddd %xmm1, %xmm0
	; GENERIC-NEXT: retq
	;
	; SLM-LABEL: test_pmovsxwd:
	; SLM: # BB#0:
	; SLM-NEXT: pmovsxwd (%rdi), %xmm1 # sched: [4:1.00]
	; SLM-NEXT: pmovsxwd %xmm0, %xmm0 # sched: [1:1.00]
	; SLM-NEXT: paddd %xmm0, %xmm1 # sched: [1:0.50]
	; SLM-NEXT: movdqa %xmm1, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_pmovsxwd:
	; SANDY: # BB#0:
	; SANDY-NEXT: vpmovsxwd %xmm0, %xmm0 # sched: [1:0.50]
	-; SANDY-NEXT: vpmovsxwd (%rdi), %xmm1 # sched: [7:0.50]
	+; SANDY-NEXT: vpmovsxwd (%rdi), %xmm1 # sched: [5:0.50]
	; SANDY-NEXT: vpaddd %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_pmovsxwd:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpmovsxwd %xmm0, %xmm0 # sched: [1:1.00]
	; HASWELL-NEXT: vpmovsxwd (%rdi), %xmm1 # sched: [5:1.00]
	; HASWELL-NEXT: vpaddd %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_pmovsxwd:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpmovsxwd (%rdi), %xmm1 # sched: [6:1.00]
	; BTVER2-NEXT: vpmovsxwd %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: vpaddd %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_pmovsxwd:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpmovsxwd (%rdi), %xmm1 # sched: [8:0.50]
	; ZNVER1-NEXT: vpmovsxwd %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: vpaddd %xmm1, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = shufflevector <8 x i16> %a0, <8 x i16> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
	%2 = sext <4 x i16> %1 to <4 x i32>
	%3 = load <4 x i16>, <4 x i16>* %a1, align 1
	%4 = sext <4 x i16> %3 to <4 x i32>
	%5 = add <4 x i32> %2, %4
	ret <4 x i32> %5
	}

	define <2 x i64> @test_pmovsxwq(<8 x i16> %a0, <2 x i16> *%a1) {
	; GENERIC-LABEL: test_pmovsxwq:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: pmovsxwq %xmm0, %xmm1
	; GENERIC-NEXT: pmovsxwq (%rdi), %xmm0
	; GENERIC-NEXT: paddq %xmm1, %xmm0
	; GENERIC-NEXT: retq
	;
	; SLM-LABEL: test_pmovsxwq:
	; SLM: # BB#0:
	; SLM-NEXT: pmovsxwq (%rdi), %xmm1 # sched: [4:1.00]
	; SLM-NEXT: pmovsxwq %xmm0, %xmm0 # sched: [1:1.00]
	; SLM-NEXT: paddq %xmm0, %xmm1 # sched: [1:0.50]
	; SLM-NEXT: movdqa %xmm1, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_pmovsxwq:
	; SANDY: # BB#0:
	; SANDY-NEXT: vpmovsxwq %xmm0, %xmm0 # sched: [1:0.50]
	-; SANDY-NEXT: vpmovsxwq (%rdi), %xmm1 # sched: [7:0.50]
	+; SANDY-NEXT: vpmovsxwq (%rdi), %xmm1 # sched: [5:0.50]
	; SANDY-NEXT: vpaddq %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_pmovsxwq:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpmovsxwq %xmm0, %xmm0 # sched: [1:1.00]
	; HASWELL-NEXT: vpmovsxwq (%rdi), %xmm1 # sched: [5:1.00]
	; HASWELL-NEXT: vpaddq %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_pmovsxwq:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpmovsxwq (%rdi), %xmm1 # sched: [6:1.00]
	; BTVER2-NEXT: vpmovsxwq %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: vpaddq %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_pmovsxwq:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpmovsxwq (%rdi), %xmm1 # sched: [8:0.50]
	; ZNVER1-NEXT: vpmovsxwq %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: vpaddq %xmm1, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = shufflevector <8 x i16> %a0, <8 x i16> undef, <2 x i32> <i32 0, i32 1>
	%2 = sext <2 x i16> %1 to <2 x i64>
	%3 = load <2 x i16>, <2 x i16>* %a1, align 1
	%4 = sext <2 x i16> %3 to <2 x i64>
	%5 = add <2 x i64> %2, %4
	ret <2 x i64> %5
	}

	define <8 x i16> @test_pmovzxbw(<16 x i8> %a0, <8 x i8> *%a1) {
	; GENERIC-LABEL: test_pmovzxbw:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: pmovzxbw {{.*#+}} xmm1 = xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero,xmm0[4],zero,xmm0[5],zero,xmm0[6],zero,xmm0[7],zero
	; GENERIC-NEXT: pmovzxbw {{.*#+}} xmm0 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero
	; GENERIC-NEXT: paddw %xmm1, %xmm0
	; GENERIC-NEXT: retq
	;
	; SLM-LABEL: test_pmovzxbw:
	; SLM: # BB#0:
	; SLM-NEXT: pmovzxbw {{.*#+}} xmm1 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero sched: [4:1.00]
	; SLM-NEXT: pmovzxbw {{.*#+}} xmm0 = xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero,xmm0[4],zero,xmm0[5],zero,xmm0[6],zero,xmm0[7],zero sched: [1:1.00]
	; SLM-NEXT: paddw %xmm0, %xmm1 # sched: [1:0.50]
	; SLM-NEXT: movdqa %xmm1, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_pmovzxbw:
	; SANDY: # BB#0:
	; SANDY-NEXT: vpmovzxbw {{.*#+}} xmm0 = xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero,xmm0[4],zero,xmm0[5],zero,xmm0[6],zero,xmm0[7],zero sched: [1:0.50]
	-; SANDY-NEXT: vpmovzxbw {{.*#+}} xmm1 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero sched: [7:0.50]
	-; SANDY-NEXT: vpaddw %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vpmovzxbw {{.*#+}} xmm1 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero sched: [5:0.50]
	+; SANDY-NEXT: vpaddw %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_pmovzxbw:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpmovzxbw {{.*#+}} xmm0 = xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero,xmm0[4],zero,xmm0[5],zero,xmm0[6],zero,xmm0[7],zero sched: [1:1.00]
	; HASWELL-NEXT: vpmovzxbw {{.*#+}} xmm1 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero sched: [5:1.00]
	; HASWELL-NEXT: vpaddw %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_pmovzxbw:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpmovzxbw {{.*#+}} xmm1 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero sched: [6:1.00]
	; BTVER2-NEXT: vpmovzxbw {{.*#+}} xmm0 = xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero,xmm0[4],zero,xmm0[5],zero,xmm0[6],zero,xmm0[7],zero sched: [1:0.50]
	; BTVER2-NEXT: vpaddw %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_pmovzxbw:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpmovzxbw {{.*#+}} xmm1 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero sched: [8:0.50]
	; ZNVER1-NEXT: vpmovzxbw {{.*#+}} xmm0 = xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero,xmm0[4],zero,xmm0[5],zero,xmm0[6],zero,xmm0[7],zero sched: [1:0.25]
	; ZNVER1-NEXT: vpaddw %xmm1, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = shufflevector <16 x i8> %a0, <16 x i8> undef, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
	%2 = zext <8 x i8> %1 to <8 x i16>
	%3 = load <8 x i8>, <8 x i8>* %a1, align 1
	%4 = zext <8 x i8> %3 to <8 x i16>
	%5 = add <8 x i16> %2, %4
	ret <8 x i16> %5
	}

	define <4 x i32> @test_pmovzxbd(<16 x i8> %a0, <4 x i8> *%a1) {
	; GENERIC-LABEL: test_pmovzxbd:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: pmovzxbd {{.*#+}} xmm1 = xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero,xmm0[2],zero,zero,zero,xmm0[3],zero,zero,zero
	; GENERIC-NEXT: pmovzxbd {{.*#+}} xmm0 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
	; GENERIC-NEXT: paddd %xmm1, %xmm0
	; GENERIC-NEXT: retq
	;
	; SLM-LABEL: test_pmovzxbd:
	; SLM: # BB#0:
	; SLM-NEXT: pmovzxbd {{.*#+}} xmm1 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero sched: [4:1.00]
	; SLM-NEXT: pmovzxbd {{.*#+}} xmm0 = xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero,xmm0[2],zero,zero,zero,xmm0[3],zero,zero,zero sched: [1:1.00]
	; SLM-NEXT: paddd %xmm0, %xmm1 # sched: [1:0.50]
	; SLM-NEXT: movdqa %xmm1, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_pmovzxbd:
	; SANDY: # BB#0:
	; SANDY-NEXT: vpmovzxbd {{.*#+}} xmm0 = xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero,xmm0[2],zero,zero,zero,xmm0[3],zero,zero,zero sched: [1:0.50]
	-; SANDY-NEXT: vpmovzxbd {{.*#+}} xmm1 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero sched: [7:0.50]
	+; SANDY-NEXT: vpmovzxbd {{.*#+}} xmm1 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero sched: [5:0.50]
	; SANDY-NEXT: vpaddd %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_pmovzxbd:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpmovzxbd {{.*#+}} xmm0 = xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero,xmm0[2],zero,zero,zero,xmm0[3],zero,zero,zero sched: [1:1.00]
	; HASWELL-NEXT: vpmovzxbd {{.*#+}} xmm1 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero sched: [5:1.00]
	; HASWELL-NEXT: vpaddd %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_pmovzxbd:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpmovzxbd {{.*#+}} xmm1 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero sched: [6:1.00]
	; BTVER2-NEXT: vpmovzxbd {{.*#+}} xmm0 = xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero,xmm0[2],zero,zero,zero,xmm0[3],zero,zero,zero sched: [1:0.50]
	; BTVER2-NEXT: vpaddd %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_pmovzxbd:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpmovzxbd {{.*#+}} xmm1 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero sched: [8:0.50]
	; ZNVER1-NEXT: vpmovzxbd {{.*#+}} xmm0 = xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero,xmm0[2],zero,zero,zero,xmm0[3],zero,zero,zero sched: [1:0.25]
	; ZNVER1-NEXT: vpaddd %xmm1, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = shufflevector <16 x i8> %a0, <16 x i8> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
	%2 = zext <4 x i8> %1 to <4 x i32>
	%3 = load <4 x i8>, <4 x i8>* %a1, align 1
	%4 = zext <4 x i8> %3 to <4 x i32>
	%5 = add <4 x i32> %2, %4
	ret <4 x i32> %5
	}

	define <2 x i64> @test_pmovzxbq(<16 x i8> %a0, <2 x i8> *%a1) {
	; GENERIC-LABEL: test_pmovzxbq:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: pmovzxbq {{.*#+}} xmm1 = xmm0[0],zero,zero,zero,zero,zero,zero,zero,xmm0[1],zero,zero,zero,zero,zero,zero,zero
	; GENERIC-NEXT: pmovzxbq {{.*#+}} xmm0 = mem[0],zero,zero,zero,zero,zero,zero,zero,mem[1],zero,zero,zero,zero,zero,zero,zero
	; GENERIC-NEXT: paddq %xmm1, %xmm0
	; GENERIC-NEXT: retq
	;
	; SLM-LABEL: test_pmovzxbq:
	; SLM: # BB#0:
	; SLM-NEXT: pmovzxbq {{.*#+}} xmm1 = mem[0],zero,zero,zero,zero,zero,zero,zero,mem[1],zero,zero,zero,zero,zero,zero,zero sched: [4:1.00]
	; SLM-NEXT: pmovzxbq {{.*#+}} xmm0 = xmm0[0],zero,zero,zero,zero,zero,zero,zero,xmm0[1],zero,zero,zero,zero,zero,zero,zero sched: [1:1.00]
	; SLM-NEXT: paddq %xmm0, %xmm1 # sched: [1:0.50]
	; SLM-NEXT: movdqa %xmm1, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_pmovzxbq:
	; SANDY: # BB#0:
	; SANDY-NEXT: vpmovzxbq {{.*#+}} xmm0 = xmm0[0],zero,zero,zero,zero,zero,zero,zero,xmm0[1],zero,zero,zero,zero,zero,zero,zero sched: [1:0.50]
	-; SANDY-NEXT: vpmovzxbq {{.*#+}} xmm1 = mem[0],zero,zero,zero,zero,zero,zero,zero,mem[1],zero,zero,zero,zero,zero,zero,zero sched: [7:0.50]
	+; SANDY-NEXT: vpmovzxbq {{.*#+}} xmm1 = mem[0],zero,zero,zero,zero,zero,zero,zero,mem[1],zero,zero,zero,zero,zero,zero,zero sched: [5:0.50]
	; SANDY-NEXT: vpaddq %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_pmovzxbq:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpmovzxbq {{.*#+}} xmm0 = xmm0[0],zero,zero,zero,zero,zero,zero,zero,xmm0[1],zero,zero,zero,zero,zero,zero,zero sched: [1:1.00]
	; HASWELL-NEXT: vpmovzxbq {{.*#+}} xmm1 = mem[0],zero,zero,zero,zero,zero,zero,zero,mem[1],zero,zero,zero,zero,zero,zero,zero sched: [5:1.00]
	; HASWELL-NEXT: vpaddq %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_pmovzxbq:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpmovzxbq {{.*#+}} xmm1 = mem[0],zero,zero,zero,zero,zero,zero,zero,mem[1],zero,zero,zero,zero,zero,zero,zero sched: [6:1.00]
	; BTVER2-NEXT: vpmovzxbq {{.*#+}} xmm0 = xmm0[0],zero,zero,zero,zero,zero,zero,zero,xmm0[1],zero,zero,zero,zero,zero,zero,zero sched: [1:0.50]
	; BTVER2-NEXT: vpaddq %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_pmovzxbq:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpmovzxbq {{.*#+}} xmm1 = mem[0],zero,zero,zero,zero,zero,zero,zero,mem[1],zero,zero,zero,zero,zero,zero,zero sched: [8:0.50]
	; ZNVER1-NEXT: vpmovzxbq {{.*#+}} xmm0 = xmm0[0],zero,zero,zero,zero,zero,zero,zero,xmm0[1],zero,zero,zero,zero,zero,zero,zero sched: [1:0.25]
	; ZNVER1-NEXT: vpaddq %xmm1, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = shufflevector <16 x i8> %a0, <16 x i8> undef, <2 x i32> <i32 0, i32 1>
	%2 = zext <2 x i8> %1 to <2 x i64>
	%3 = load <2 x i8>, <2 x i8>* %a1, align 1
	%4 = zext <2 x i8> %3 to <2 x i64>
	%5 = add <2 x i64> %2, %4
	ret <2 x i64> %5
	}

	define <2 x i64> @test_pmovzxdq(<4 x i32> %a0, <2 x i32> *%a1) {
	; GENERIC-LABEL: test_pmovzxdq:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: pmovzxdq {{.*#+}} xmm1 = xmm0[0],zero,xmm0[1],zero
	; GENERIC-NEXT: pmovzxdq {{.*#+}} xmm0 = mem[0],zero,mem[1],zero
	; GENERIC-NEXT: paddq %xmm1, %xmm0
	; GENERIC-NEXT: retq
	;
	; SLM-LABEL: test_pmovzxdq:
	; SLM: # BB#0:
	; SLM-NEXT: pmovzxdq {{.*#+}} xmm1 = mem[0],zero,mem[1],zero sched: [4:1.00]
	; SLM-NEXT: pmovzxdq {{.*#+}} xmm0 = xmm0[0],zero,xmm0[1],zero sched: [1:1.00]
	; SLM-NEXT: paddq %xmm0, %xmm1 # sched: [1:0.50]
	; SLM-NEXT: movdqa %xmm1, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_pmovzxdq:
	; SANDY: # BB#0:
	; SANDY-NEXT: vpmovzxdq {{.*#+}} xmm0 = xmm0[0],zero,xmm0[1],zero sched: [1:0.50]
	-; SANDY-NEXT: vpmovzxdq {{.*#+}} xmm1 = mem[0],zero,mem[1],zero sched: [7:0.50]
	+; SANDY-NEXT: vpmovzxdq {{.*#+}} xmm1 = mem[0],zero,mem[1],zero sched: [5:0.50]
	; SANDY-NEXT: vpaddq %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_pmovzxdq:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpmovzxdq {{.*#+}} xmm0 = xmm0[0],zero,xmm0[1],zero sched: [1:1.00]
	; HASWELL-NEXT: vpmovzxdq {{.*#+}} xmm1 = mem[0],zero,mem[1],zero sched: [5:1.00]
	; HASWELL-NEXT: vpaddq %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_pmovzxdq:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpmovzxdq {{.*#+}} xmm1 = mem[0],zero,mem[1],zero sched: [6:1.00]
	; BTVER2-NEXT: vpmovzxdq {{.*#+}} xmm0 = xmm0[0],zero,xmm0[1],zero sched: [1:0.50]
	; BTVER2-NEXT: vpaddq %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_pmovzxdq:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpmovzxdq {{.*#+}} xmm1 = mem[0],zero,mem[1],zero sched: [8:0.50]
	; ZNVER1-NEXT: vpmovzxdq {{.*#+}} xmm0 = xmm0[0],zero,xmm0[1],zero sched: [1:0.25]
	; ZNVER1-NEXT: vpaddq %xmm1, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = shufflevector <4 x i32> %a0, <4 x i32> undef, <2 x i32> <i32 0, i32 1>
	%2 = zext <2 x i32> %1 to <2 x i64>
	%3 = load <2 x i32>, <2 x i32>* %a1, align 1
	%4 = zext <2 x i32> %3 to <2 x i64>
	%5 = add <2 x i64> %2, %4
	ret <2 x i64> %5
	}

	define <4 x i32> @test_pmovzxwd(<8 x i16> %a0, <4 x i16> *%a1) {
	; GENERIC-LABEL: test_pmovzxwd:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: pmovzxwd {{.*#+}} xmm1 = xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero
	; GENERIC-NEXT: pmovzxwd {{.*#+}} xmm0 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero
	; GENERIC-NEXT: paddd %xmm1, %xmm0
	; GENERIC-NEXT: retq
	;
	; SLM-LABEL: test_pmovzxwd:
	; SLM: # BB#0:
	; SLM-NEXT: pmovzxwd {{.*#+}} xmm1 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero sched: [4:1.00]
	; SLM-NEXT: pmovzxwd {{.*#+}} xmm0 = xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero sched: [1:1.00]
	; SLM-NEXT: paddd %xmm0, %xmm1 # sched: [1:0.50]
	; SLM-NEXT: movdqa %xmm1, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_pmovzxwd:
	; SANDY: # BB#0:
	; SANDY-NEXT: vpmovzxwd {{.*#+}} xmm0 = xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero sched: [1:0.50]
	-; SANDY-NEXT: vpmovzxwd {{.*#+}} xmm1 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero sched: [7:0.50]
	+; SANDY-NEXT: vpmovzxwd {{.*#+}} xmm1 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero sched: [5:0.50]
	; SANDY-NEXT: vpaddd %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_pmovzxwd:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpmovzxwd {{.*#+}} xmm0 = xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero sched: [1:1.00]
	; HASWELL-NEXT: vpmovzxwd {{.*#+}} xmm1 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero sched: [5:1.00]
	; HASWELL-NEXT: vpaddd %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_pmovzxwd:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpmovzxwd {{.*#+}} xmm1 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero sched: [6:1.00]
	; BTVER2-NEXT: vpmovzxwd {{.*#+}} xmm0 = xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero sched: [1:0.50]
	; BTVER2-NEXT: vpaddd %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_pmovzxwd:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpmovzxwd {{.*#+}} xmm1 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero sched: [8:0.50]
	; ZNVER1-NEXT: vpmovzxwd {{.*#+}} xmm0 = xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero sched: [1:0.25]
	; ZNVER1-NEXT: vpaddd %xmm1, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = shufflevector <8 x i16> %a0, <8 x i16> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
	%2 = zext <4 x i16> %1 to <4 x i32>
	%3 = load <4 x i16>, <4 x i16>* %a1, align 1
	%4 = zext <4 x i16> %3 to <4 x i32>
	%5 = add <4 x i32> %2, %4
	ret <4 x i32> %5
	}

	define <2 x i64> @test_pmovzxwq(<8 x i16> %a0, <2 x i16> *%a1) {
	; GENERIC-LABEL: test_pmovzxwq:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: pmovzxwq {{.*#+}} xmm1 = xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero
	; GENERIC-NEXT: pmovzxwq {{.*#+}} xmm0 = mem[0],zero,zero,zero,mem[1],zero,zero,zero
	; GENERIC-NEXT: paddq %xmm1, %xmm0
	; GENERIC-NEXT: retq
	;
	; SLM-LABEL: test_pmovzxwq:
	; SLM: # BB#0:
	; SLM-NEXT: pmovzxwq {{.*#+}} xmm1 = mem[0],zero,zero,zero,mem[1],zero,zero,zero sched: [4:1.00]
	; SLM-NEXT: pmovzxwq {{.*#+}} xmm0 = xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero sched: [1:1.00]
	; SLM-NEXT: paddq %xmm0, %xmm1 # sched: [1:0.50]
	; SLM-NEXT: movdqa %xmm1, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_pmovzxwq:
	; SANDY: # BB#0:
	; SANDY-NEXT: vpmovzxwq {{.*#+}} xmm0 = xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero sched: [1:0.50]
	-; SANDY-NEXT: vpmovzxwq {{.*#+}} xmm1 = mem[0],zero,zero,zero,mem[1],zero,zero,zero sched: [7:0.50]
	+; SANDY-NEXT: vpmovzxwq {{.*#+}} xmm1 = mem[0],zero,zero,zero,mem[1],zero,zero,zero sched: [5:0.50]
	; SANDY-NEXT: vpaddq %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_pmovzxwq:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpmovzxwq {{.*#+}} xmm0 = xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero sched: [1:1.00]
	; HASWELL-NEXT: vpmovzxwq {{.*#+}} xmm1 = mem[0],zero,zero,zero,mem[1],zero,zero,zero sched: [5:1.00]
	; HASWELL-NEXT: vpaddq %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_pmovzxwq:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpmovzxwq {{.*#+}} xmm1 = mem[0],zero,zero,zero,mem[1],zero,zero,zero sched: [6:1.00]
	; BTVER2-NEXT: vpmovzxwq {{.*#+}} xmm0 = xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero sched: [1:0.50]
	; BTVER2-NEXT: vpaddq %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_pmovzxwq:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpmovzxwq {{.*#+}} xmm1 = mem[0],zero,zero,zero,mem[1],zero,zero,zero sched: [8:0.50]
	; ZNVER1-NEXT: vpmovzxwq {{.*#+}} xmm0 = xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero sched: [1:0.25]
	; ZNVER1-NEXT: vpaddq %xmm1, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = shufflevector <8 x i16> %a0, <8 x i16> undef, <2 x i32> <i32 0, i32 1>
	%2 = zext <2 x i16> %1 to <2 x i64>
	%3 = load <2 x i16>, <2 x i16>* %a1, align 1
	%4 = zext <2 x i16> %3 to <2 x i64>
	%5 = add <2 x i64> %2, %4
	ret <2 x i64> %5
	}

	define <2 x i64> @test_pmuldq(<4 x i32> %a0, <4 x i32> %a1, <4 x i32> *%a2) {
	; GENERIC-LABEL: test_pmuldq:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: pmuldq %xmm1, %xmm0
	; GENERIC-NEXT: pmuldq (%rdi), %xmm0
	; GENERIC-NEXT: retq
	;
	; SLM-LABEL: test_pmuldq:
	; SLM: # BB#0:
	; SLM-NEXT: pmuldq %xmm1, %xmm0 # sched: [4:1.00]
	; SLM-NEXT: pmuldq (%rdi), %xmm0 # sched: [7:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_pmuldq:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vpmuldq %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	+; SANDY-NEXT: vpmuldq %xmm1, %xmm0, %xmm0 # sched: [5:1.00]
	; SANDY-NEXT: vpmuldq (%rdi), %xmm0, %xmm0 # sched: [9:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_pmuldq:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpmuldq %xmm1, %xmm0, %xmm0 # sched: [5:1.00]
	; HASWELL-NEXT: vpmuldq (%rdi), %xmm0, %xmm0 # sched: [9:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_pmuldq:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpmuldq %xmm1, %xmm0, %xmm0 # sched: [2:1.00]
	; BTVER2-NEXT: vpmuldq (%rdi), %xmm0, %xmm0 # sched: [7:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_pmuldq:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpmuldq %xmm1, %xmm0, %xmm0 # sched: [4:1.00]
	; ZNVER1-NEXT: vpmuldq (%rdi), %xmm0, %xmm0 # sched: [11:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <2 x i64> @llvm.x86.sse41.pmuldq(<4 x i32> %a0, <4 x i32> %a1)
	%2 = bitcast <2 x i64> %1 to <4 x i32>
	%3 = load <4 x i32>, <4 x i32> *%a2, align 16
	%4 = call <2 x i64> @llvm.x86.sse41.pmuldq(<4 x i32> %2, <4 x i32> %3)
	ret <2 x i64> %4
	}
	declare <2 x i64> @llvm.x86.sse41.pmuldq(<4 x i32>, <4 x i32>) nounwind readnone

	define <4 x i32> @test_pmulld(<4 x i32> %a0, <4 x i32> %a1, <4 x i32> *%a2) {
	; GENERIC-LABEL: test_pmulld:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: pmulld %xmm1, %xmm0
	; GENERIC-NEXT: pmulld (%rdi), %xmm0
	; GENERIC-NEXT: retq
	;
	; SLM-LABEL: test_pmulld:
	; SLM: # BB#0:
	; SLM-NEXT: pmulld %xmm1, %xmm0 # sched: [4:1.00]
	; SLM-NEXT: pmulld (%rdi), %xmm0 # sched: [7:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_pmulld:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vpmulld %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	+; SANDY-NEXT: vpmulld %xmm1, %xmm0, %xmm0 # sched: [5:1.00]
	; SANDY-NEXT: vpmulld (%rdi), %xmm0, %xmm0 # sched: [9:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_pmulld:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpmulld %xmm1, %xmm0, %xmm0 # sched: [10:2.00]
	; HASWELL-NEXT: vpmulld (%rdi), %xmm0, %xmm0 # sched: [10:2.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_pmulld:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpmulld %xmm1, %xmm0, %xmm0 # sched: [2:1.00]
	; BTVER2-NEXT: vpmulld (%rdi), %xmm0, %xmm0 # sched: [7:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_pmulld:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpmulld %xmm1, %xmm0, %xmm0 # sched: [4:1.00]
	; ZNVER1-NEXT: vpmulld (%rdi), %xmm0, %xmm0 # sched: [11:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = mul <4 x i32> %a0, %a1
	%2 = load <4 x i32>, <4 x i32> *%a2, align 16
	%3 = mul <4 x i32> %1, %2
	ret <4 x i32> %3
	}

	define i32 @test_ptest(<2 x i64> %a0, <2 x i64> %a1, <2 x i64> *%a2) {
	; GENERIC-LABEL: test_ptest:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: ptest %xmm1, %xmm0
	; GENERIC-NEXT: setb %al
	; GENERIC-NEXT: ptest (%rdi), %xmm0
	; GENERIC-NEXT: setb %cl
	; GENERIC-NEXT: andb %al, %cl
	; GENERIC-NEXT: movzbl %cl, %eax
	; GENERIC-NEXT: retq
	;
	; SLM-LABEL: test_ptest:
	; SLM: # BB#0:
	; SLM-NEXT: ptest %xmm1, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: setb %al # sched: [1:0.50]
	; SLM-NEXT: ptest (%rdi), %xmm0 # sched: [4:1.00]
	; SLM-NEXT: setb %cl # sched: [1:0.50]
	; SLM-NEXT: andb %al, %cl # sched: [1:0.50]
	; SLM-NEXT: movzbl %cl, %eax # sched: [1:0.50]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_ptest:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vptest %xmm1, %xmm0 # sched: [2:1.00]
	-; SANDY-NEXT: setb %al # sched: [1:1.00]
	-; SANDY-NEXT: vptest (%rdi), %xmm0 # sched: [8:1.00]
	-; SANDY-NEXT: setb %cl # sched: [1:1.00]
	+; SANDY-NEXT: vptest %xmm1, %xmm0 # sched: [1:0.33]
	+; SANDY-NEXT: setb %al # sched: [1:0.33]
	+; SANDY-NEXT: vptest (%rdi), %xmm0 # sched: [5:0.50]
	+; SANDY-NEXT: setb %cl # sched: [1:0.33]
	; SANDY-NEXT: andb %al, %cl # sched: [1:0.33]
	; SANDY-NEXT: movzbl %cl, %eax # sched: [1:0.33]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_ptest:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vptest %xmm1, %xmm0 # sched: [2:1.00]
	; HASWELL-NEXT: setb %al # sched: [1:0.50]
	; HASWELL-NEXT: vptest (%rdi), %xmm0 # sched: [2:1.00]
	; HASWELL-NEXT: setb %cl # sched: [1:0.50]
	; HASWELL-NEXT: andb %al, %cl # sched: [1:0.25]
	; HASWELL-NEXT: movzbl %cl, %eax # sched: [1:0.25]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_ptest:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vptest %xmm1, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: setb %al # sched: [1:0.50]
	; BTVER2-NEXT: vptest (%rdi), %xmm0 # sched: [6:1.00]
	; BTVER2-NEXT: setb %cl # sched: [1:0.50]
	; BTVER2-NEXT: andb %al, %cl # sched: [1:0.50]
	; BTVER2-NEXT: movzbl %cl, %eax # sched: [1:0.50]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_ptest:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vptest %xmm1, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: setb %al # sched: [1:0.25]
	; ZNVER1-NEXT: vptest (%rdi), %xmm0 # sched: [8:0.50]
	; ZNVER1-NEXT: setb %cl # sched: [1:0.25]
	; ZNVER1-NEXT: andb %al, %cl # sched: [1:0.25]
	; ZNVER1-NEXT: movzbl %cl, %eax # sched: [1:0.25]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call i32 @llvm.x86.sse41.ptestc(<2 x i64> %a0, <2 x i64> %a1)
	%2 = load <2 x i64>, <2 x i64> *%a2, align 16
	%3 = call i32 @llvm.x86.sse41.ptestc(<2 x i64> %a0, <2 x i64> %2)
	%4 = and i32 %1, %3
	ret i32 %4
	}
	declare i32 @llvm.x86.sse41.ptestc(<2 x i64>, <2 x i64>) nounwind readnone

	define <2 x double> @test_roundpd(<2 x double> %a0, <2 x double> *%a1) {
	; GENERIC-LABEL: test_roundpd:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: roundpd $7, %xmm0, %xmm1
	; GENERIC-NEXT: roundpd $7, (%rdi), %xmm0
	; GENERIC-NEXT: addpd %xmm1, %xmm0
	; GENERIC-NEXT: retq
	;
	; SLM-LABEL: test_roundpd:
	; SLM: # BB#0:
	; SLM-NEXT: roundpd $7, (%rdi), %xmm1 # sched: [6:1.00]
	; SLM-NEXT: roundpd $7, %xmm0, %xmm0 # sched: [3:1.00]
	; SLM-NEXT: addpd %xmm0, %xmm1 # sched: [3:1.00]
	; SLM-NEXT: movapd %xmm1, %xmm0 # sched: [1:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_roundpd:
	; SANDY: # BB#0:
	; SANDY-NEXT: vroundpd $7, %xmm0, %xmm0 # sched: [3:1.00]
	-; SANDY-NEXT: vroundpd $7, (%rdi), %xmm1 # sched: [9:1.00]
	+; SANDY-NEXT: vroundpd $7, (%rdi), %xmm1 # sched: [7:1.00]
	; SANDY-NEXT: vaddpd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_roundpd:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vroundpd $7, %xmm0, %xmm0 # sched: [6:2.00]
	; HASWELL-NEXT: vroundpd $7, (%rdi), %xmm1 # sched: [10:2.00]
	; HASWELL-NEXT: vaddpd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_roundpd:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vroundpd $7, (%rdi), %xmm1 # sched: [8:1.00]
	; BTVER2-NEXT: vroundpd $7, %xmm0, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: vaddpd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_roundpd:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vroundpd $7, (%rdi), %xmm1 # sched: [10:1.00]
	; ZNVER1-NEXT: vroundpd $7, %xmm0, %xmm0 # sched: [3:1.00]
	; ZNVER1-NEXT: vaddpd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <2 x double> @llvm.x86.sse41.round.pd(<2 x double> %a0, i32 7)
	%2 = load <2 x double>, <2 x double> *%a1, align 16
	%3 = call <2 x double> @llvm.x86.sse41.round.pd(<2 x double> %2, i32 7)
	%4 = fadd <2 x double> %1, %3
	ret <2 x double> %4
	}
	declare <2 x double> @llvm.x86.sse41.round.pd(<2 x double>, i32) nounwind readnone

	define <4 x float> @test_roundps(<4 x float> %a0, <4 x float> *%a1) {
	; GENERIC-LABEL: test_roundps:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: roundps $7, %xmm0, %xmm1
	; GENERIC-NEXT: roundps $7, (%rdi), %xmm0
	; GENERIC-NEXT: addps %xmm1, %xmm0
	; GENERIC-NEXT: retq
	;
	; SLM-LABEL: test_roundps:
	; SLM: # BB#0:
	; SLM-NEXT: roundps $7, (%rdi), %xmm1 # sched: [6:1.00]
	; SLM-NEXT: roundps $7, %xmm0, %xmm0 # sched: [3:1.00]
	; SLM-NEXT: addps %xmm0, %xmm1 # sched: [3:1.00]
	; SLM-NEXT: movaps %xmm1, %xmm0 # sched: [1:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_roundps:
	; SANDY: # BB#0:
	; SANDY-NEXT: vroundps $7, %xmm0, %xmm0 # sched: [3:1.00]
	-; SANDY-NEXT: vroundps $7, (%rdi), %xmm1 # sched: [9:1.00]
	+; SANDY-NEXT: vroundps $7, (%rdi), %xmm1 # sched: [7:1.00]
	; SANDY-NEXT: vaddps %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_roundps:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vroundps $7, %xmm0, %xmm0 # sched: [6:2.00]
	; HASWELL-NEXT: vroundps $7, (%rdi), %xmm1 # sched: [10:2.00]
	; HASWELL-NEXT: vaddps %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_roundps:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vroundps $7, (%rdi), %xmm1 # sched: [8:1.00]
	; BTVER2-NEXT: vroundps $7, %xmm0, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: vaddps %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_roundps:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vroundps $7, (%rdi), %xmm1 # sched: [10:1.00]
	; ZNVER1-NEXT: vroundps $7, %xmm0, %xmm0 # sched: [3:1.00]
	; ZNVER1-NEXT: vaddps %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <4 x float> @llvm.x86.sse41.round.ps(<4 x float> %a0, i32 7)
	%2 = load <4 x float>, <4 x float> *%a1, align 16
	%3 = call <4 x float> @llvm.x86.sse41.round.ps(<4 x float> %2, i32 7)
	%4 = fadd <4 x float> %1, %3
	ret <4 x float> %4
	}
	declare <4 x float> @llvm.x86.sse41.round.ps(<4 x float>, i32) nounwind readnone

	define <2 x double> @test_roundsd(<2 x double> %a0, <2 x double> %a1, <2 x double> *%a2) {
	; GENERIC-LABEL: test_roundsd:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: movaps %xmm0, %xmm2
	; GENERIC-NEXT: roundsd $7, %xmm1, %xmm2
	; GENERIC-NEXT: roundsd $7, (%rdi), %xmm0
	; GENERIC-NEXT: addpd %xmm2, %xmm0
	; GENERIC-NEXT: retq
	;
	; SLM-LABEL: test_roundsd:
	; SLM: # BB#0:
	; SLM-NEXT: movaps %xmm0, %xmm2 # sched: [1:1.00]
	; SLM-NEXT: roundsd $7, (%rdi), %xmm0 # sched: [6:1.00]
	; SLM-NEXT: roundsd $7, %xmm1, %xmm2 # sched: [3:1.00]
	; SLM-NEXT: addpd %xmm2, %xmm0 # sched: [3:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_roundsd:
	; SANDY: # BB#0:
	; SANDY-NEXT: vroundsd $7, %xmm1, %xmm0, %xmm1 # sched: [3:1.00]
	-; SANDY-NEXT: vroundsd $7, (%rdi), %xmm0, %xmm0 # sched: [9:1.00]
	+; SANDY-NEXT: vroundsd $7, (%rdi), %xmm0, %xmm0 # sched: [7:1.00]
	; SANDY-NEXT: vaddpd %xmm0, %xmm1, %xmm0 # sched: [3:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_roundsd:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vroundsd $7, %xmm1, %xmm0, %xmm1 # sched: [6:2.00]
	; HASWELL-NEXT: vroundsd $7, (%rdi), %xmm0, %xmm0 # sched: [10:2.00]
	; HASWELL-NEXT: vaddpd %xmm0, %xmm1, %xmm0 # sched: [3:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_roundsd:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vroundsd $7, %xmm1, %xmm0, %xmm1 # sched: [3:1.00]
	; BTVER2-NEXT: vroundsd $7, (%rdi), %xmm0, %xmm0 # sched: [8:1.00]
	; BTVER2-NEXT: vaddpd %xmm0, %xmm1, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_roundsd:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vroundsd $7, %xmm1, %xmm0, %xmm1 # sched: [3:1.00]
	; ZNVER1-NEXT: vroundsd $7, (%rdi), %xmm0, %xmm0 # sched: [10:1.00]
	; ZNVER1-NEXT: vaddpd %xmm0, %xmm1, %xmm0 # sched: [3:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <2 x double> @llvm.x86.sse41.round.sd(<2 x double> %a0, <2 x double> %a1, i32 7)
	%2 = load <2 x double>, <2 x double>* %a2, align 16
	%3 = call <2 x double> @llvm.x86.sse41.round.sd(<2 x double> %a0, <2 x double> %2, i32 7)
	%4 = fadd <2 x double> %1, %3
	ret <2 x double> %4
	}
	declare <2 x double> @llvm.x86.sse41.round.sd(<2 x double>, <2 x double>, i32) nounwind readnone

	define <4 x float> @test_roundss(<4 x float> %a0, <4 x float> %a1, <4 x float> *%a2) {
	; GENERIC-LABEL: test_roundss:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: movaps %xmm0, %xmm2
	; GENERIC-NEXT: roundss $7, %xmm1, %xmm2
	; GENERIC-NEXT: roundss $7, (%rdi), %xmm0
	; GENERIC-NEXT: addps %xmm2, %xmm0
	; GENERIC-NEXT: retq
	;
	; SLM-LABEL: test_roundss:
	; SLM: # BB#0:
	; SLM-NEXT: movaps %xmm0, %xmm2 # sched: [1:1.00]
	; SLM-NEXT: roundss $7, (%rdi), %xmm0 # sched: [6:1.00]
	; SLM-NEXT: roundss $7, %xmm1, %xmm2 # sched: [3:1.00]
	; SLM-NEXT: addps %xmm2, %xmm0 # sched: [3:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_roundss:
	; SANDY: # BB#0:
	; SANDY-NEXT: vroundss $7, %xmm1, %xmm0, %xmm1 # sched: [3:1.00]
	-; SANDY-NEXT: vroundss $7, (%rdi), %xmm0, %xmm0 # sched: [9:1.00]
	+; SANDY-NEXT: vroundss $7, (%rdi), %xmm0, %xmm0 # sched: [7:1.00]
	; SANDY-NEXT: vaddps %xmm0, %xmm1, %xmm0 # sched: [3:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_roundss:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vroundss $7, %xmm1, %xmm0, %xmm1 # sched: [6:2.00]
	; HASWELL-NEXT: vroundss $7, (%rdi), %xmm0, %xmm0 # sched: [10:2.00]
	; HASWELL-NEXT: vaddps %xmm0, %xmm1, %xmm0 # sched: [3:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_roundss:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vroundss $7, %xmm1, %xmm0, %xmm1 # sched: [3:1.00]
	; BTVER2-NEXT: vroundss $7, (%rdi), %xmm0, %xmm0 # sched: [8:1.00]
	; BTVER2-NEXT: vaddps %xmm0, %xmm1, %xmm0 # sched: [3:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_roundss:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vroundss $7, %xmm1, %xmm0, %xmm1 # sched: [3:1.00]
	; ZNVER1-NEXT: vroundss $7, (%rdi), %xmm0, %xmm0 # sched: [10:1.00]
	; ZNVER1-NEXT: vaddps %xmm0, %xmm1, %xmm0 # sched: [3:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <4 x float> @llvm.x86.sse41.round.ss(<4 x float> %a0, <4 x float> %a1, i32 7)
	%2 = load <4 x float>, <4 x float> *%a2, align 16
	%3 = call <4 x float> @llvm.x86.sse41.round.ss(<4 x float> %a0, <4 x float> %2, i32 7)
	%4 = fadd <4 x float> %1, %3
	ret <4 x float> %4
	}
	declare <4 x float> @llvm.x86.sse41.round.ss(<4 x float>, <4 x float>, i32) nounwind readnone
	diff --git a/test/CodeGen/X86/sse42-schedule.ll b/test/CodeGen/X86/sse42-schedule.ll
	index adf857e12179..2a502e809bca 100644
	--- a/test/CodeGen/X86/sse42-schedule.ll
	+++ b/test/CodeGen/X86/sse42-schedule.ll
	@@ -1,556 +1,556 @@
	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -print-schedule -mattr=+sse4.2 \| FileCheck %s --check-prefix=CHECK --check-prefix=GENERIC
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -print-schedule -mcpu=slm \| FileCheck %s --check-prefix=CHECK --check-prefix=SLM
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -print-schedule -mcpu=sandybridge \| FileCheck %s --check-prefix=CHECK --check-prefix=SANDY
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -print-schedule -mcpu=ivybridge \| FileCheck %s --check-prefix=CHECK --check-prefix=SANDY
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -print-schedule -mcpu=haswell \| FileCheck %s --check-prefix=CHECK --check-prefix=HASWELL
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -print-schedule -mcpu=skylake \| FileCheck %s --check-prefix=CHECK --check-prefix=HASWELL
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -print-schedule -mcpu=btver2 \| FileCheck %s --check-prefix=CHECK --check-prefix=BTVER2
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -print-schedule -mcpu=znver1 \| FileCheck %s --check-prefix=CHECK --check-prefix=ZNVER1

	define i32 @crc32_32_8(i32 %a0, i8 %a1, i8 *%a2) {
	; GENERIC-LABEL: crc32_32_8:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: crc32b %sil, %edi
	; GENERIC-NEXT: crc32b (%rdx), %edi
	; GENERIC-NEXT: movl %edi, %eax
	; GENERIC-NEXT: retq
	;
	; SLM-LABEL: crc32_32_8:
	; SLM: # BB#0:
	; SLM-NEXT: crc32b %sil, %edi # sched: [3:1.00]
	; SLM-NEXT: crc32b (%rdx), %edi # sched: [6:1.00]
	; SLM-NEXT: movl %edi, %eax # sched: [1:0.50]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: crc32_32_8:
	; SANDY: # BB#0:
	; SANDY-NEXT: crc32b %sil, %edi # sched: [3:1.00]
	-; SANDY-NEXT: crc32b (%rdx), %edi # sched: [8:1.00]
	+; SANDY-NEXT: crc32b (%rdx), %edi # sched: [7:1.00]
	; SANDY-NEXT: movl %edi, %eax # sched: [1:0.33]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: crc32_32_8:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: crc32b %sil, %edi # sched: [3:1.00]
	; HASWELL-NEXT: crc32b (%rdx), %edi # sched: [7:1.00]
	; HASWELL-NEXT: movl %edi, %eax # sched: [1:0.25]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: crc32_32_8:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: crc32b %sil, %edi # sched: [3:1.00]
	; BTVER2-NEXT: crc32b (%rdx), %edi # sched: [8:1.00]
	; BTVER2-NEXT: movl %edi, %eax # sched: [1:0.17]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: crc32_32_8:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: crc32b %sil, %edi # sched: [3:1.00]
	; ZNVER1-NEXT: crc32b (%rdx), %edi # sched: [10:1.00]
	; ZNVER1-NEXT: movl %edi, %eax # sched: [1:0.25]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call i32 @llvm.x86.sse42.crc32.32.8(i32 %a0, i8 %a1)
	%2 = load i8, i8 *%a2
	%3 = call i32 @llvm.x86.sse42.crc32.32.8(i32 %1, i8 %2)
	ret i32 %3
	}
	declare i32 @llvm.x86.sse42.crc32.32.8(i32, i8) nounwind

	define i32 @crc32_32_16(i32 %a0, i16 %a1, i16 *%a2) {
	; GENERIC-LABEL: crc32_32_16:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: crc32w %si, %edi
	; GENERIC-NEXT: crc32w (%rdx), %edi
	; GENERIC-NEXT: movl %edi, %eax
	; GENERIC-NEXT: retq
	;
	; SLM-LABEL: crc32_32_16:
	; SLM: # BB#0:
	; SLM-NEXT: crc32w %si, %edi # sched: [3:1.00]
	; SLM-NEXT: crc32w (%rdx), %edi # sched: [6:1.00]
	; SLM-NEXT: movl %edi, %eax # sched: [1:0.50]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: crc32_32_16:
	; SANDY: # BB#0:
	; SANDY-NEXT: crc32w %si, %edi # sched: [3:1.00]
	-; SANDY-NEXT: crc32w (%rdx), %edi # sched: [8:1.00]
	+; SANDY-NEXT: crc32w (%rdx), %edi # sched: [7:1.00]
	; SANDY-NEXT: movl %edi, %eax # sched: [1:0.33]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: crc32_32_16:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: crc32w %si, %edi # sched: [3:1.00]
	; HASWELL-NEXT: crc32w (%rdx), %edi # sched: [7:1.00]
	; HASWELL-NEXT: movl %edi, %eax # sched: [1:0.25]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: crc32_32_16:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: crc32w %si, %edi # sched: [3:1.00]
	; BTVER2-NEXT: crc32w (%rdx), %edi # sched: [8:1.00]
	; BTVER2-NEXT: movl %edi, %eax # sched: [1:0.17]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: crc32_32_16:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: crc32w %si, %edi # sched: [3:1.00]
	; ZNVER1-NEXT: crc32w (%rdx), %edi # sched: [10:1.00]
	; ZNVER1-NEXT: movl %edi, %eax # sched: [1:0.25]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call i32 @llvm.x86.sse42.crc32.32.16(i32 %a0, i16 %a1)
	%2 = load i16, i16 *%a2
	%3 = call i32 @llvm.x86.sse42.crc32.32.16(i32 %1, i16 %2)
	ret i32 %3
	}
	declare i32 @llvm.x86.sse42.crc32.32.16(i32, i16) nounwind

	define i32 @crc32_32_32(i32 %a0, i32 %a1, i32 *%a2) {
	; GENERIC-LABEL: crc32_32_32:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: crc32l %esi, %edi
	; GENERIC-NEXT: crc32l (%rdx), %edi
	; GENERIC-NEXT: movl %edi, %eax
	; GENERIC-NEXT: retq
	;
	; SLM-LABEL: crc32_32_32:
	; SLM: # BB#0:
	; SLM-NEXT: crc32l %esi, %edi # sched: [3:1.00]
	; SLM-NEXT: crc32l (%rdx), %edi # sched: [6:1.00]
	; SLM-NEXT: movl %edi, %eax # sched: [1:0.50]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: crc32_32_32:
	; SANDY: # BB#0:
	; SANDY-NEXT: crc32l %esi, %edi # sched: [3:1.00]
	; SANDY-NEXT: crc32l (%rdx), %edi # sched: [7:1.00]
	; SANDY-NEXT: movl %edi, %eax # sched: [1:0.33]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: crc32_32_32:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: crc32l %esi, %edi # sched: [3:1.00]
	; HASWELL-NEXT: crc32l (%rdx), %edi # sched: [7:1.00]
	; HASWELL-NEXT: movl %edi, %eax # sched: [1:0.25]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: crc32_32_32:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: crc32l %esi, %edi # sched: [3:1.00]
	; BTVER2-NEXT: crc32l (%rdx), %edi # sched: [8:1.00]
	; BTVER2-NEXT: movl %edi, %eax # sched: [1:0.17]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: crc32_32_32:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: crc32l %esi, %edi # sched: [3:1.00]
	; ZNVER1-NEXT: crc32l (%rdx), %edi # sched: [10:1.00]
	; ZNVER1-NEXT: movl %edi, %eax # sched: [1:0.25]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call i32 @llvm.x86.sse42.crc32.32.32(i32 %a0, i32 %a1)
	%2 = load i32, i32 *%a2
	%3 = call i32 @llvm.x86.sse42.crc32.32.32(i32 %1, i32 %2)
	ret i32 %3
	}
	declare i32 @llvm.x86.sse42.crc32.32.32(i32, i32) nounwind

	define i64 @crc32_64_8(i64 %a0, i8 %a1, i8 *%a2) nounwind {
	; GENERIC-LABEL: crc32_64_8:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: crc32b %sil, %edi
	; GENERIC-NEXT: crc32b (%rdx), %edi
	; GENERIC-NEXT: movq %rdi, %rax
	; GENERIC-NEXT: retq
	;
	; SLM-LABEL: crc32_64_8:
	; SLM: # BB#0:
	; SLM-NEXT: crc32b %sil, %edi # sched: [3:1.00]
	; SLM-NEXT: crc32b (%rdx), %edi # sched: [6:1.00]
	; SLM-NEXT: movq %rdi, %rax # sched: [1:0.50]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: crc32_64_8:
	; SANDY: # BB#0:
	; SANDY-NEXT: crc32b %sil, %edi # sched: [3:1.00]
	-; SANDY-NEXT: crc32b (%rdx), %edi # sched: [8:1.00]
	+; SANDY-NEXT: crc32b (%rdx), %edi # sched: [7:1.00]
	; SANDY-NEXT: movq %rdi, %rax # sched: [1:0.33]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: crc32_64_8:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: crc32b %sil, %edi # sched: [3:1.00]
	; HASWELL-NEXT: crc32b (%rdx), %edi # sched: [7:1.00]
	; HASWELL-NEXT: movq %rdi, %rax # sched: [1:0.25]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: crc32_64_8:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: crc32b %sil, %edi # sched: [3:1.00]
	; BTVER2-NEXT: crc32b (%rdx), %edi # sched: [8:1.00]
	; BTVER2-NEXT: movq %rdi, %rax # sched: [1:0.17]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: crc32_64_8:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: crc32b %sil, %edi # sched: [3:1.00]
	; ZNVER1-NEXT: crc32b (%rdx), %edi # sched: [10:1.00]
	; ZNVER1-NEXT: movq %rdi, %rax # sched: [1:0.25]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call i64 @llvm.x86.sse42.crc32.64.8(i64 %a0, i8 %a1)
	%2 = load i8, i8 *%a2
	%3 = call i64 @llvm.x86.sse42.crc32.64.8(i64 %1, i8 %2)
	ret i64 %3
	}
	declare i64 @llvm.x86.sse42.crc32.64.8(i64, i8) nounwind

	define i64 @crc32_64_64(i64 %a0, i64 %a1, i64 *%a2) {
	; GENERIC-LABEL: crc32_64_64:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: crc32q %rsi, %rdi
	; GENERIC-NEXT: crc32q (%rdx), %rdi
	; GENERIC-NEXT: movq %rdi, %rax
	; GENERIC-NEXT: retq
	;
	; SLM-LABEL: crc32_64_64:
	; SLM: # BB#0:
	; SLM-NEXT: crc32q %rsi, %rdi # sched: [3:1.00]
	; SLM-NEXT: crc32q (%rdx), %rdi # sched: [6:1.00]
	; SLM-NEXT: movq %rdi, %rax # sched: [1:0.50]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: crc32_64_64:
	; SANDY: # BB#0:
	; SANDY-NEXT: crc32q %rsi, %rdi # sched: [3:1.00]
	; SANDY-NEXT: crc32q (%rdx), %rdi # sched: [7:1.00]
	; SANDY-NEXT: movq %rdi, %rax # sched: [1:0.33]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: crc32_64_64:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: crc32q %rsi, %rdi # sched: [3:1.00]
	; HASWELL-NEXT: crc32q (%rdx), %rdi # sched: [7:1.00]
	; HASWELL-NEXT: movq %rdi, %rax # sched: [1:0.25]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: crc32_64_64:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: crc32q %rsi, %rdi # sched: [3:1.00]
	; BTVER2-NEXT: crc32q (%rdx), %rdi # sched: [8:1.00]
	; BTVER2-NEXT: movq %rdi, %rax # sched: [1:0.17]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: crc32_64_64:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: crc32q %rsi, %rdi # sched: [3:1.00]
	; ZNVER1-NEXT: crc32q (%rdx), %rdi # sched: [10:1.00]
	; ZNVER1-NEXT: movq %rdi, %rax # sched: [1:0.25]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call i64 @llvm.x86.sse42.crc32.64.64(i64 %a0, i64 %a1)
	%2 = load i64, i64 *%a2
	%3 = call i64 @llvm.x86.sse42.crc32.64.64(i64 %1, i64 %2)
	ret i64 %3
	}
	declare i64 @llvm.x86.sse42.crc32.64.64(i64, i64) nounwind

	define i32 @test_pcmpestri(<16 x i8> %a0, <16 x i8> %a1, <16 x i8> *%a2) {
	; GENERIC-LABEL: test_pcmpestri:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: movl $7, %eax
	; GENERIC-NEXT: movl $7, %edx
	; GENERIC-NEXT: pcmpestri $7, %xmm1, %xmm0
	; GENERIC-NEXT: movl %ecx, %esi
	; GENERIC-NEXT: movl $7, %eax
	; GENERIC-NEXT: movl $7, %edx
	; GENERIC-NEXT: pcmpestri $7, (%rdi), %xmm0
	; GENERIC-NEXT: # kill: %ECX<def> %ECX<kill> %RCX<def>
	; GENERIC-NEXT: leal (%rcx,%rsi), %eax
	; GENERIC-NEXT: retq
	;
	; SLM-LABEL: test_pcmpestri:
	; SLM: # BB#0:
	; SLM-NEXT: movl $7, %eax # sched: [1:0.50]
	; SLM-NEXT: movl $7, %edx # sched: [1:0.50]
	; SLM-NEXT: pcmpestri $7, %xmm1, %xmm0 # sched: [21:21.00]
	; SLM-NEXT: movl $7, %eax # sched: [1:0.50]
	; SLM-NEXT: movl $7, %edx # sched: [1:0.50]
	; SLM-NEXT: movl %ecx, %esi # sched: [1:0.50]
	; SLM-NEXT: pcmpestri $7, (%rdi), %xmm0 # sched: [21:21.00]
	; SLM-NEXT: # kill: %ECX<def> %ECX<kill> %RCX<def>
	; SLM-NEXT: leal (%rcx,%rsi), %eax # sched: [1:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_pcmpestri:
	; SANDY: # BB#0:
	; SANDY-NEXT: movl $7, %eax # sched: [1:0.33]
	; SANDY-NEXT: movl $7, %edx # sched: [1:0.33]
	; SANDY-NEXT: vpcmpestri $7, %xmm1, %xmm0 # sched: [4:2.67]
	; SANDY-NEXT: movl %ecx, %esi # sched: [1:0.33]
	; SANDY-NEXT: movl $7, %eax # sched: [1:0.33]
	; SANDY-NEXT: movl $7, %edx # sched: [1:0.33]
	; SANDY-NEXT: vpcmpestri $7, (%rdi), %xmm0 # sched: [4:2.33]
	; SANDY-NEXT: # kill: %ECX<def> %ECX<kill> %RCX<def>
	; SANDY-NEXT: leal (%rcx,%rsi), %eax # sched: [1:0.50]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_pcmpestri:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: movl $7, %eax # sched: [1:0.25]
	; HASWELL-NEXT: movl $7, %edx # sched: [1:0.25]
	; HASWELL-NEXT: vpcmpestri $7, %xmm1, %xmm0 # sched: [11:3.00]
	; HASWELL-NEXT: movl %ecx, %esi # sched: [1:0.25]
	; HASWELL-NEXT: movl $7, %eax # sched: [1:0.25]
	; HASWELL-NEXT: movl $7, %edx # sched: [1:0.25]
	; HASWELL-NEXT: vpcmpestri $7, (%rdi), %xmm0 # sched: [11:3.00]
	; HASWELL-NEXT: # kill: %ECX<def> %ECX<kill> %RCX<def>
	; HASWELL-NEXT: leal (%rcx,%rsi), %eax # sched: [1:0.50]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_pcmpestri:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: movl $7, %eax # sched: [1:0.17]
	; BTVER2-NEXT: movl $7, %edx # sched: [1:0.17]
	; BTVER2-NEXT: vpcmpestri $7, %xmm1, %xmm0 # sched: [13:2.50]
	; BTVER2-NEXT: movl $7, %eax # sched: [1:0.17]
	; BTVER2-NEXT: movl $7, %edx # sched: [1:0.17]
	; BTVER2-NEXT: movl %ecx, %esi # sched: [1:0.17]
	; BTVER2-NEXT: vpcmpestri $7, (%rdi), %xmm0 # sched: [18:2.50]
	; BTVER2-NEXT: # kill: %ECX<def> %ECX<kill> %RCX<def>
	; BTVER2-NEXT: leal (%rcx,%rsi), %eax # sched: [1:0.50]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_pcmpestri:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: movl $7, %eax # sched: [1:0.25]
	; ZNVER1-NEXT: movl $7, %edx # sched: [1:0.25]
	; ZNVER1-NEXT: vpcmpestri $7, %xmm1, %xmm0 # sched: [100:0.00]
	; ZNVER1-NEXT: movl $7, %eax # sched: [1:0.25]
	; ZNVER1-NEXT: movl $7, %edx # sched: [1:0.25]
	; ZNVER1-NEXT: movl %ecx, %esi # sched: [1:0.25]
	; ZNVER1-NEXT: vpcmpestri $7, (%rdi), %xmm0 # sched: [100:0.00]
	; ZNVER1-NEXT: # kill: %ECX<def> %ECX<kill> %RCX<def>
	; ZNVER1-NEXT: leal (%rcx,%rsi), %eax # sched: [1:0.25]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call i32 @llvm.x86.sse42.pcmpestri128(<16 x i8> %a0, i32 7, <16 x i8> %a1, i32 7, i8 7)
	%2 = load <16 x i8>, <16 x i8> *%a2, align 16
	%3 = call i32 @llvm.x86.sse42.pcmpestri128(<16 x i8> %a0, i32 7, <16 x i8> %2, i32 7, i8 7)
	%4 = add i32 %1, %3
	ret i32 %4
	}
	declare i32 @llvm.x86.sse42.pcmpestri128(<16 x i8>, i32, <16 x i8>, i32, i8) nounwind readnone

	define <16 x i8> @test_pcmpestrm(<16 x i8> %a0, <16 x i8> %a1, <16 x i8> *%a2) {
	; GENERIC-LABEL: test_pcmpestrm:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: movl $7, %eax
	; GENERIC-NEXT: movl $7, %edx
	; GENERIC-NEXT: pcmpestrm $7, %xmm1, %xmm0
	; GENERIC-NEXT: movl $7, %eax
	; GENERIC-NEXT: movl $7, %edx
	; GENERIC-NEXT: pcmpestrm $7, (%rdi), %xmm0
	; GENERIC-NEXT: retq
	;
	; SLM-LABEL: test_pcmpestrm:
	; SLM: # BB#0:
	; SLM-NEXT: movl $7, %eax # sched: [1:0.50]
	; SLM-NEXT: movl $7, %edx # sched: [1:0.50]
	; SLM-NEXT: pcmpestrm $7, %xmm1, %xmm0 # sched: [17:17.00]
	; SLM-NEXT: movl $7, %eax # sched: [1:0.50]
	; SLM-NEXT: movl $7, %edx # sched: [1:0.50]
	; SLM-NEXT: pcmpestrm $7, (%rdi), %xmm0 # sched: [17:17.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_pcmpestrm:
	; SANDY: # BB#0:
	; SANDY-NEXT: movl $7, %eax # sched: [1:0.33]
	; SANDY-NEXT: movl $7, %edx # sched: [1:0.33]
	; SANDY-NEXT: vpcmpestrm $7, %xmm1, %xmm0 # sched: [11:2.67]
	; SANDY-NEXT: movl $7, %eax # sched: [1:0.33]
	; SANDY-NEXT: movl $7, %edx # sched: [1:0.33]
	; SANDY-NEXT: vpcmpestrm $7, (%rdi), %xmm0 # sched: [11:2.33]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_pcmpestrm:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: movl $7, %eax # sched: [1:0.25]
	; HASWELL-NEXT: movl $7, %edx # sched: [1:0.25]
	; HASWELL-NEXT: vpcmpestrm $7, %xmm1, %xmm0 # sched: [10:4.00]
	; HASWELL-NEXT: movl $7, %eax # sched: [1:0.25]
	; HASWELL-NEXT: movl $7, %edx # sched: [1:0.25]
	; HASWELL-NEXT: vpcmpestrm $7, (%rdi), %xmm0 # sched: [10:3.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_pcmpestrm:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: movl $7, %eax # sched: [1:0.17]
	; BTVER2-NEXT: movl $7, %edx # sched: [1:0.17]
	; BTVER2-NEXT: vpcmpestrm $7, %xmm1, %xmm0 # sched: [13:2.50]
	; BTVER2-NEXT: movl $7, %eax # sched: [1:0.17]
	; BTVER2-NEXT: movl $7, %edx # sched: [1:0.17]
	; BTVER2-NEXT: vpcmpestrm $7, (%rdi), %xmm0 # sched: [18:2.50]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_pcmpestrm:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: movl $7, %eax # sched: [1:0.25]
	; ZNVER1-NEXT: movl $7, %edx # sched: [1:0.25]
	; ZNVER1-NEXT: vpcmpestrm $7, %xmm1, %xmm0 # sched: [100:0.00]
	; ZNVER1-NEXT: movl $7, %eax # sched: [1:0.25]
	; ZNVER1-NEXT: movl $7, %edx # sched: [1:0.25]
	; ZNVER1-NEXT: vpcmpestrm $7, (%rdi), %xmm0 # sched: [100:0.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <16 x i8> @llvm.x86.sse42.pcmpestrm128(<16 x i8> %a0, i32 7, <16 x i8> %a1, i32 7, i8 7)
	%2 = load <16 x i8>, <16 x i8> *%a2, align 16
	%3 = call <16 x i8> @llvm.x86.sse42.pcmpestrm128(<16 x i8> %1, i32 7, <16 x i8> %2, i32 7, i8 7)
	ret <16 x i8> %3
	}
	declare <16 x i8> @llvm.x86.sse42.pcmpestrm128(<16 x i8>, i32, <16 x i8>, i32, i8) nounwind readnone

	define i32 @test_pcmpistri(<16 x i8> %a0, <16 x i8> %a1, <16 x i8> *%a2) {
	; GENERIC-LABEL: test_pcmpistri:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: pcmpistri $7, %xmm1, %xmm0
	; GENERIC-NEXT: movl %ecx, %eax
	; GENERIC-NEXT: pcmpistri $7, (%rdi), %xmm0
	; GENERIC-NEXT: # kill: %ECX<def> %ECX<kill> %RCX<def>
	; GENERIC-NEXT: leal (%rcx,%rax), %eax
	; GENERIC-NEXT: retq
	;
	; SLM-LABEL: test_pcmpistri:
	; SLM: # BB#0:
	; SLM-NEXT: pcmpistri $7, %xmm1, %xmm0 # sched: [17:17.00]
	; SLM-NEXT: movl %ecx, %eax # sched: [1:0.50]
	; SLM-NEXT: pcmpistri $7, (%rdi), %xmm0 # sched: [17:17.00]
	; SLM-NEXT: # kill: %ECX<def> %ECX<kill> %RCX<def>
	; SLM-NEXT: leal (%rcx,%rax), %eax # sched: [1:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_pcmpistri:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vpcmpistri $7, %xmm1, %xmm0 # sched: [11:3.00]
	+; SANDY-NEXT: vpcmpistri $7, %xmm1, %xmm0 # sched: [3:1.00]
	; SANDY-NEXT: movl %ecx, %eax # sched: [1:0.33]
	-; SANDY-NEXT: vpcmpistri $7, (%rdi), %xmm0 # sched: [17:3.00]
	+; SANDY-NEXT: vpcmpistri $7, (%rdi), %xmm0 # sched: [3:1.00]
	; SANDY-NEXT: # kill: %ECX<def> %ECX<kill> %RCX<def>
	; SANDY-NEXT: leal (%rcx,%rax), %eax # sched: [1:0.50]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_pcmpistri:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpcmpistri $7, %xmm1, %xmm0 # sched: [11:3.00]
	; HASWELL-NEXT: movl %ecx, %eax # sched: [1:0.25]
	; HASWELL-NEXT: vpcmpistri $7, (%rdi), %xmm0 # sched: [11:3.00]
	; HASWELL-NEXT: # kill: %ECX<def> %ECX<kill> %RCX<def>
	; HASWELL-NEXT: leal (%rcx,%rax), %eax # sched: [1:0.50]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_pcmpistri:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpcmpistri $7, %xmm1, %xmm0 # sched: [6:1.00]
	; BTVER2-NEXT: movl %ecx, %eax # sched: [1:0.17]
	; BTVER2-NEXT: vpcmpistri $7, (%rdi), %xmm0 # sched: [11:1.00]
	; BTVER2-NEXT: # kill: %ECX<def> %ECX<kill> %RCX<def>
	; BTVER2-NEXT: leal (%rcx,%rax), %eax # sched: [1:0.50]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_pcmpistri:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpcmpistri $7, %xmm1, %xmm0 # sched: [100:0.00]
	; ZNVER1-NEXT: movl %ecx, %eax # sched: [1:0.25]
	; ZNVER1-NEXT: vpcmpistri $7, (%rdi), %xmm0 # sched: [100:0.00]
	; ZNVER1-NEXT: # kill: %ECX<def> %ECX<kill> %RCX<def>
	; ZNVER1-NEXT: leal (%rcx,%rax), %eax # sched: [1:0.25]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call i32 @llvm.x86.sse42.pcmpistri128(<16 x i8> %a0, <16 x i8> %a1, i8 7)
	%2 = load <16 x i8>, <16 x i8> *%a2, align 16
	%3 = call i32 @llvm.x86.sse42.pcmpistri128(<16 x i8> %a0, <16 x i8> %2, i8 7)
	%4 = add i32 %1, %3
	ret i32 %4
	}
	declare i32 @llvm.x86.sse42.pcmpistri128(<16 x i8>, <16 x i8>, i8) nounwind readnone

	define <16 x i8> @test_pcmpistrm(<16 x i8> %a0, <16 x i8> %a1, <16 x i8> *%a2) {
	; GENERIC-LABEL: test_pcmpistrm:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: pcmpistrm $7, %xmm1, %xmm0
	; GENERIC-NEXT: pcmpistrm $7, (%rdi), %xmm0
	; GENERIC-NEXT: retq
	;
	; SLM-LABEL: test_pcmpistrm:
	; SLM: # BB#0:
	; SLM-NEXT: pcmpistrm $7, %xmm1, %xmm0 # sched: [13:13.00]
	; SLM-NEXT: pcmpistrm $7, (%rdi), %xmm0 # sched: [13:13.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_pcmpistrm:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vpcmpistrm $7, %xmm1, %xmm0 # sched: [11:3.00]
	-; SANDY-NEXT: vpcmpistrm $7, (%rdi), %xmm0 # sched: [17:3.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vpcmpistrm $7, %xmm1, %xmm0 # sched: [11:1.00]
	+; SANDY-NEXT: vpcmpistrm $7, (%rdi), %xmm0 # sched: [11:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_pcmpistrm:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpcmpistrm $7, %xmm1, %xmm0 # sched: [10:3.00]
	; HASWELL-NEXT: vpcmpistrm $7, (%rdi), %xmm0 # sched: [10:3.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_pcmpistrm:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpcmpistrm $7, %xmm1, %xmm0 # sched: [7:1.00]
	; BTVER2-NEXT: vpcmpistrm $7, (%rdi), %xmm0 # sched: [12:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_pcmpistrm:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpcmpistrm $7, %xmm1, %xmm0 # sched: [100:0.00]
	; ZNVER1-NEXT: vpcmpistrm $7, (%rdi), %xmm0 # sched: [100:0.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <16 x i8> @llvm.x86.sse42.pcmpistrm128(<16 x i8> %a0, <16 x i8> %a1, i8 7)
	%2 = load <16 x i8>, <16 x i8> *%a2, align 16
	%3 = call <16 x i8> @llvm.x86.sse42.pcmpistrm128(<16 x i8> %1, <16 x i8> %2, i8 7)
	ret <16 x i8> %3
	}
	declare <16 x i8> @llvm.x86.sse42.pcmpistrm128(<16 x i8>, <16 x i8>, i8) nounwind readnone

	define <2 x i64> @test_pcmpgtq(<2 x i64> %a0, <2 x i64> %a1, <2 x i64> *%a2) {
	; GENERIC-LABEL: test_pcmpgtq:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: pcmpgtq %xmm1, %xmm0
	; GENERIC-NEXT: pcmpgtq (%rdi), %xmm0
	; GENERIC-NEXT: retq
	;
	; SLM-LABEL: test_pcmpgtq:
	; SLM: # BB#0:
	; SLM-NEXT: pcmpgtq %xmm1, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: pcmpgtq (%rdi), %xmm0 # sched: [4:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_pcmpgtq:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vpcmpgtq %xmm1, %xmm0, %xmm0 # sched: [5:1.00]
	-; SANDY-NEXT: vpcmpgtq (%rdi), %xmm0, %xmm0 # sched: [11:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vpcmpgtq %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	+; SANDY-NEXT: vpcmpgtq (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_pcmpgtq:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpcmpgtq %xmm1, %xmm0, %xmm0 # sched: [5:1.00]
	; HASWELL-NEXT: vpcmpgtq (%rdi), %xmm0, %xmm0 # sched: [5:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_pcmpgtq:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpcmpgtq %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: vpcmpgtq (%rdi), %xmm0, %xmm0 # sched: [6:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_pcmpgtq:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpcmpgtq %xmm1, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: vpcmpgtq (%rdi), %xmm0, %xmm0 # sched: [8:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = icmp sgt <2 x i64> %a0, %a1
	%2 = sext <2 x i1> %1 to <2 x i64>
	%3 = load <2 x i64>, <2 x i64>*%a2, align 16
	%4 = icmp sgt <2 x i64> %2, %3
	%5 = sext <2 x i1> %4 to <2 x i64>
	ret <2 x i64> %5
	}
	diff --git a/test/CodeGen/X86/ssse3-schedule.ll b/test/CodeGen/X86/ssse3-schedule.ll
	index 24ace69ebb9e..fb3530667ce7 100644
	--- a/test/CodeGen/X86/ssse3-schedule.ll
	+++ b/test/CodeGen/X86/ssse3-schedule.ll
	@@ -1,850 +1,850 @@
	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -print-schedule -mattr=+ssse3 \| FileCheck %s --check-prefix=CHECK --check-prefix=GENERIC
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -print-schedule -mcpu=atom \| FileCheck %s --check-prefix=CHECK --check-prefix=ATOM
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -print-schedule -mcpu=slm \| FileCheck %s --check-prefix=CHECK --check-prefix=SLM
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -print-schedule -mcpu=sandybridge \| FileCheck %s --check-prefix=CHECK --check-prefix=SANDY
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -print-schedule -mcpu=ivybridge \| FileCheck %s --check-prefix=CHECK --check-prefix=SANDY
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -print-schedule -mcpu=haswell \| FileCheck %s --check-prefix=CHECK --check-prefix=HASWELL
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -print-schedule -mcpu=skylake \| FileCheck %s --check-prefix=CHECK --check-prefix=HASWELL
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -print-schedule -mcpu=btver2 \| FileCheck %s --check-prefix=CHECK --check-prefix=BTVER2
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -print-schedule -mcpu=znver1 \| FileCheck %s --check-prefix=CHECK --check-prefix=ZNVER1

	define <16 x i8> @test_pabsb(<16 x i8> %a0, <16 x i8> *%a1) {
	; GENERIC-LABEL: test_pabsb:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: pabsb %xmm0, %xmm1
	; GENERIC-NEXT: pabsb (%rdi), %xmm0
	; GENERIC-NEXT: por %xmm1, %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_pabsb:
	; ATOM: # BB#0:
	; ATOM-NEXT: pabsb (%rdi), %xmm1
	; ATOM-NEXT: pabsb %xmm0, %xmm0
	; ATOM-NEXT: por %xmm0, %xmm1
	; ATOM-NEXT: movdqa %xmm1, %xmm0
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_pabsb:
	; SLM: # BB#0:
	; SLM-NEXT: pabsb %xmm0, %xmm1 # sched: [1:0.50]
	; SLM-NEXT: pabsb (%rdi), %xmm0 # sched: [4:1.00]
	; SLM-NEXT: por %xmm1, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_pabsb:
	; SANDY: # BB#0:
	; SANDY-NEXT: vpabsb %xmm0, %xmm0 # sched: [1:0.50]
	-; SANDY-NEXT: vpabsb (%rdi), %xmm1 # sched: [7:0.50]
	+; SANDY-NEXT: vpabsb (%rdi), %xmm1 # sched: [5:0.50]
	; SANDY-NEXT: vpor %xmm1, %xmm0, %xmm0 # sched: [1:0.33]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_pabsb:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpabsb %xmm0, %xmm0 # sched: [1:0.50]
	; HASWELL-NEXT: vpabsb (%rdi), %xmm1 # sched: [5:0.50]
	; HASWELL-NEXT: vpor %xmm1, %xmm0, %xmm0 # sched: [1:0.33]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_pabsb:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpabsb (%rdi), %xmm1 # sched: [6:1.00]
	; BTVER2-NEXT: vpabsb %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: vpor %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_pabsb:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpabsb (%rdi), %xmm1 # sched: [8:0.50]
	; ZNVER1-NEXT: vpabsb %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: vpor %xmm1, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <16 x i8> @llvm.x86.ssse3.pabs.b.128(<16 x i8> %a0)
	%2 = load <16 x i8>, <16 x i8> *%a1, align 16
	%3 = call <16 x i8> @llvm.x86.ssse3.pabs.b.128(<16 x i8> %2)
	%4 = or <16 x i8> %1, %3
	ret <16 x i8> %4
	}
	declare <16 x i8> @llvm.x86.ssse3.pabs.b.128(<16 x i8>) nounwind readnone

	define <4 x i32> @test_pabsd(<4 x i32> %a0, <4 x i32> *%a1) {
	; GENERIC-LABEL: test_pabsd:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: pabsd %xmm0, %xmm1
	; GENERIC-NEXT: pabsd (%rdi), %xmm0
	; GENERIC-NEXT: por %xmm1, %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_pabsd:
	; ATOM: # BB#0:
	; ATOM-NEXT: pabsd (%rdi), %xmm1
	; ATOM-NEXT: pabsd %xmm0, %xmm0
	; ATOM-NEXT: por %xmm0, %xmm1
	; ATOM-NEXT: movdqa %xmm1, %xmm0
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_pabsd:
	; SLM: # BB#0:
	; SLM-NEXT: pabsd %xmm0, %xmm1 # sched: [1:0.50]
	; SLM-NEXT: pabsd (%rdi), %xmm0 # sched: [4:1.00]
	; SLM-NEXT: por %xmm1, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_pabsd:
	; SANDY: # BB#0:
	; SANDY-NEXT: vpabsd %xmm0, %xmm0 # sched: [1:0.50]
	-; SANDY-NEXT: vpabsd (%rdi), %xmm1 # sched: [7:0.50]
	+; SANDY-NEXT: vpabsd (%rdi), %xmm1 # sched: [5:0.50]
	; SANDY-NEXT: vpor %xmm1, %xmm0, %xmm0 # sched: [1:0.33]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_pabsd:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpabsd %xmm0, %xmm0 # sched: [1:0.50]
	; HASWELL-NEXT: vpabsd (%rdi), %xmm1 # sched: [5:0.50]
	; HASWELL-NEXT: vpor %xmm1, %xmm0, %xmm0 # sched: [1:0.33]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_pabsd:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpabsd (%rdi), %xmm1 # sched: [6:1.00]
	; BTVER2-NEXT: vpabsd %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: vpor %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_pabsd:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpabsd (%rdi), %xmm1 # sched: [8:0.50]
	; ZNVER1-NEXT: vpabsd %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: vpor %xmm1, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <4 x i32> @llvm.x86.ssse3.pabs.d.128(<4 x i32> %a0)
	%2 = load <4 x i32>, <4 x i32> *%a1, align 16
	%3 = call <4 x i32> @llvm.x86.ssse3.pabs.d.128(<4 x i32> %2)
	%4 = or <4 x i32> %1, %3
	ret <4 x i32> %4
	}
	declare <4 x i32> @llvm.x86.ssse3.pabs.d.128(<4 x i32>) nounwind readnone

	define <8 x i16> @test_pabsw(<8 x i16> %a0, <8 x i16> *%a1) {
	; GENERIC-LABEL: test_pabsw:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: pabsw %xmm0, %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_pabsw:
	; ATOM: # BB#0:
	; ATOM-NEXT: pabsw %xmm0, %xmm0
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_pabsw:
	; SLM: # BB#0:
	; SLM-NEXT: pabsw %xmm0, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_pabsw:
	; SANDY: # BB#0:
	; SANDY-NEXT: vpabsw %xmm0, %xmm0 # sched: [1:0.50]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_pabsw:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpabsw %xmm0, %xmm0 # sched: [1:0.50]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_pabsw:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpabsw %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_pabsw:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpabsw %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <8 x i16> @llvm.x86.ssse3.pabs.w.128(<8 x i16> %a0)
	%2 = load <8 x i16>, <8 x i16> *%a1, align 16
	%3 = call <8 x i16> @llvm.x86.ssse3.pabs.w.128(<8 x i16> %2)
	%4 = or <8 x i16> %1, %3
	ret <8 x i16> %1
	}
	declare <8 x i16> @llvm.x86.ssse3.pabs.w.128(<8 x i16>) nounwind readnone

	define <8 x i16> @test_palignr(<8 x i16> %a0, <8 x i16> %a1, <8 x i16> *%a2) {
	; GENERIC-LABEL: test_palignr:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: palignr {{.*#+}} xmm1 = xmm0[6,7,8,9,10,11,12,13,14,15],xmm1[0,1,2,3,4,5]
	; GENERIC-NEXT: palignr {{.*#+}} xmm1 = mem[14,15],xmm1[0,1,2,3,4,5,6,7,8,9,10,11,12,13]
	; GENERIC-NEXT: movdqa %xmm1, %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_palignr:
	; ATOM: # BB#0:
	; ATOM-NEXT: palignr {{.*#+}} xmm1 = xmm0[6,7,8,9,10,11,12,13,14,15],xmm1[0,1,2,3,4,5]
	; ATOM-NEXT: palignr {{.*#+}} xmm1 = mem[14,15],xmm1[0,1,2,3,4,5,6,7,8,9,10,11,12,13]
	; ATOM-NEXT: movdqa %xmm1, %xmm0
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_palignr:
	; SLM: # BB#0:
	; SLM-NEXT: palignr {{.*#+}} xmm1 = xmm0[6,7,8,9,10,11,12,13,14,15],xmm1[0,1,2,3,4,5] sched: [1:1.00]
	; SLM-NEXT: palignr {{.*#+}} xmm1 = mem[14,15],xmm1[0,1,2,3,4,5,6,7,8,9,10,11,12,13] sched: [4:1.00]
	; SLM-NEXT: movdqa %xmm1, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_palignr:
	; SANDY: # BB#0:
	; SANDY-NEXT: vpalignr {{.*#+}} xmm0 = xmm0[6,7,8,9,10,11,12,13,14,15],xmm1[0,1,2,3,4,5] sched: [1:0.50]
	-; SANDY-NEXT: vpalignr {{.*#+}} xmm0 = mem[14,15],xmm0[0,1,2,3,4,5,6,7,8,9,10,11,12,13] sched: [7:0.50]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vpalignr {{.*#+}} xmm0 = mem[14,15],xmm0[0,1,2,3,4,5,6,7,8,9,10,11,12,13] sched: [5:0.50]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_palignr:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpalignr {{.*#+}} xmm0 = xmm0[6,7,8,9,10,11,12,13,14,15],xmm1[0,1,2,3,4,5] sched: [1:1.00]
	; HASWELL-NEXT: vpalignr {{.*#+}} xmm0 = mem[14,15],xmm0[0,1,2,3,4,5,6,7,8,9,10,11,12,13] sched: [5:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_palignr:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpalignr {{.*#+}} xmm0 = xmm0[6,7,8,9,10,11,12,13,14,15],xmm1[0,1,2,3,4,5] sched: [1:0.50]
	; BTVER2-NEXT: vpalignr {{.*#+}} xmm0 = mem[14,15],xmm0[0,1,2,3,4,5,6,7,8,9,10,11,12,13] sched: [6:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_palignr:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpalignr {{.*#+}} xmm0 = xmm0[6,7,8,9,10,11,12,13,14,15],xmm1[0,1,2,3,4,5] sched: [1:0.25]
	; ZNVER1-NEXT: vpalignr {{.*#+}} xmm0 = mem[14,15],xmm0[0,1,2,3,4,5,6,7,8,9,10,11,12,13] sched: [8:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = shufflevector <8 x i16> %a0, <8 x i16> %a1, <8 x i32> <i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10>
	%2 = load <8 x i16>, <8 x i16> *%a2, align 16
	%3 = shufflevector <8 x i16> %2, <8 x i16> %1, <8 x i32> <i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14>
	ret <8 x i16> %3
	}

	define <4 x i32> @test_phaddd(<4 x i32> %a0, <4 x i32> %a1, <4 x i32> *%a2) {
	; GENERIC-LABEL: test_phaddd:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: phaddd %xmm1, %xmm0
	; GENERIC-NEXT: phaddd (%rdi), %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_phaddd:
	; ATOM: # BB#0:
	; ATOM-NEXT: phaddd %xmm1, %xmm0
	; ATOM-NEXT: phaddd (%rdi), %xmm0
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_phaddd:
	; SLM: # BB#0:
	; SLM-NEXT: phaddd %xmm1, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: phaddd (%rdi), %xmm0 # sched: [4:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_phaddd:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vphaddd %xmm1, %xmm0, %xmm0 # sched: [3:1.50]
	-; SANDY-NEXT: vphaddd (%rdi), %xmm0, %xmm0 # sched: [9:1.50]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vphaddd %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	+; SANDY-NEXT: vphaddd (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_phaddd:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vphaddd %xmm1, %xmm0, %xmm0 # sched: [3:2.00]
	; HASWELL-NEXT: vphaddd (%rdi), %xmm0, %xmm0 # sched: [6:2.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_phaddd:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vphaddd %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: vphaddd (%rdi), %xmm0, %xmm0 # sched: [6:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_phaddd:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vphaddd %xmm1, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: vphaddd (%rdi), %xmm0, %xmm0 # sched: [8:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <4 x i32> @llvm.x86.ssse3.phadd.d.128(<4 x i32> %a0, <4 x i32> %a1)
	%2 = load <4 x i32>, <4 x i32> *%a2, align 16
	%3 = call <4 x i32> @llvm.x86.ssse3.phadd.d.128(<4 x i32> %1, <4 x i32> %2)
	ret <4 x i32> %3
	}
	declare <4 x i32> @llvm.x86.ssse3.phadd.d.128(<4 x i32>, <4 x i32>) nounwind readnone

	define <8 x i16> @test_phaddsw(<8 x i16> %a0, <8 x i16> %a1, <8 x i16> *%a2) {
	; GENERIC-LABEL: test_phaddsw:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: phaddsw %xmm1, %xmm0
	; GENERIC-NEXT: phaddsw (%rdi), %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_phaddsw:
	; ATOM: # BB#0:
	; ATOM-NEXT: phaddsw %xmm1, %xmm0
	; ATOM-NEXT: phaddsw (%rdi), %xmm0
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_phaddsw:
	; SLM: # BB#0:
	; SLM-NEXT: phaddsw %xmm1, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: phaddsw (%rdi), %xmm0 # sched: [4:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_phaddsw:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vphaddsw %xmm1, %xmm0, %xmm0 # sched: [3:1.50]
	-; SANDY-NEXT: vphaddsw (%rdi), %xmm0, %xmm0 # sched: [9:1.50]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vphaddsw %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	+; SANDY-NEXT: vphaddsw (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_phaddsw:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vphaddsw %xmm1, %xmm0, %xmm0 # sched: [3:2.00]
	; HASWELL-NEXT: vphaddsw (%rdi), %xmm0, %xmm0 # sched: [6:2.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_phaddsw:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vphaddsw %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: vphaddsw (%rdi), %xmm0, %xmm0 # sched: [6:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_phaddsw:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vphaddsw %xmm1, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: vphaddsw (%rdi), %xmm0, %xmm0 # sched: [8:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <8 x i16> @llvm.x86.ssse3.phadd.sw.128(<8 x i16> %a0, <8 x i16> %a1)
	%2 = load <8 x i16>, <8 x i16> *%a2, align 16
	%3 = call <8 x i16> @llvm.x86.ssse3.phadd.sw.128(<8 x i16> %1, <8 x i16> %2)
	ret <8 x i16> %3
	}
	declare <8 x i16> @llvm.x86.ssse3.phadd.sw.128(<8 x i16>, <8 x i16>) nounwind readnone

	define <8 x i16> @test_phaddw(<8 x i16> %a0, <8 x i16> %a1, <8 x i16> *%a2) {
	; GENERIC-LABEL: test_phaddw:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: phaddw %xmm1, %xmm0
	; GENERIC-NEXT: phaddw (%rdi), %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_phaddw:
	; ATOM: # BB#0:
	; ATOM-NEXT: phaddw %xmm1, %xmm0
	; ATOM-NEXT: phaddw (%rdi), %xmm0
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_phaddw:
	; SLM: # BB#0:
	; SLM-NEXT: phaddw %xmm1, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: phaddw (%rdi), %xmm0 # sched: [4:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_phaddw:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vphaddw %xmm1, %xmm0, %xmm0 # sched: [3:1.50]
	-; SANDY-NEXT: vphaddw (%rdi), %xmm0, %xmm0 # sched: [9:1.50]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vphaddw %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	+; SANDY-NEXT: vphaddw (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_phaddw:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vphaddw %xmm1, %xmm0, %xmm0 # sched: [3:2.00]
	; HASWELL-NEXT: vphaddw (%rdi), %xmm0, %xmm0 # sched: [6:2.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_phaddw:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vphaddw %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: vphaddw (%rdi), %xmm0, %xmm0 # sched: [6:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_phaddw:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vphaddw %xmm1, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: vphaddw (%rdi), %xmm0, %xmm0 # sched: [8:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <8 x i16> @llvm.x86.ssse3.phadd.w.128(<8 x i16> %a0, <8 x i16> %a1)
	%2 = load <8 x i16>, <8 x i16> *%a2, align 16
	%3 = call <8 x i16> @llvm.x86.ssse3.phadd.w.128(<8 x i16> %1, <8 x i16> %2)
	ret <8 x i16> %3
	}
	declare <8 x i16> @llvm.x86.ssse3.phadd.w.128(<8 x i16>, <8 x i16>) nounwind readnone

	define <4 x i32> @test_phsubd(<4 x i32> %a0, <4 x i32> %a1, <4 x i32> *%a2) {
	; GENERIC-LABEL: test_phsubd:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: phsubd %xmm1, %xmm0
	; GENERIC-NEXT: phsubd (%rdi), %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_phsubd:
	; ATOM: # BB#0:
	; ATOM-NEXT: phsubd %xmm1, %xmm0
	; ATOM-NEXT: phsubd (%rdi), %xmm0
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_phsubd:
	; SLM: # BB#0:
	; SLM-NEXT: phsubd %xmm1, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: phsubd (%rdi), %xmm0 # sched: [4:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_phsubd:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vphsubd %xmm1, %xmm0, %xmm0 # sched: [3:1.50]
	-; SANDY-NEXT: vphsubd (%rdi), %xmm0, %xmm0 # sched: [9:1.50]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vphsubd %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	+; SANDY-NEXT: vphsubd (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_phsubd:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vphsubd %xmm1, %xmm0, %xmm0 # sched: [3:2.00]
	; HASWELL-NEXT: vphsubd (%rdi), %xmm0, %xmm0 # sched: [6:2.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_phsubd:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vphsubd %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: vphsubd (%rdi), %xmm0, %xmm0 # sched: [6:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_phsubd:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vphsubd %xmm1, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: vphsubd (%rdi), %xmm0, %xmm0 # sched: [8:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <4 x i32> @llvm.x86.ssse3.phsub.d.128(<4 x i32> %a0, <4 x i32> %a1)
	%2 = load <4 x i32>, <4 x i32> *%a2, align 16
	%3 = call <4 x i32> @llvm.x86.ssse3.phsub.d.128(<4 x i32> %1, <4 x i32> %2)
	ret <4 x i32> %3
	}
	declare <4 x i32> @llvm.x86.ssse3.phsub.d.128(<4 x i32>, <4 x i32>) nounwind readnone

	define <8 x i16> @test_phsubsw(<8 x i16> %a0, <8 x i16> %a1, <8 x i16> *%a2) {
	; GENERIC-LABEL: test_phsubsw:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: phsubsw %xmm1, %xmm0
	; GENERIC-NEXT: phsubsw (%rdi), %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_phsubsw:
	; ATOM: # BB#0:
	; ATOM-NEXT: phsubsw %xmm1, %xmm0
	; ATOM-NEXT: phsubsw (%rdi), %xmm0
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_phsubsw:
	; SLM: # BB#0:
	; SLM-NEXT: phsubsw %xmm1, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: phsubsw (%rdi), %xmm0 # sched: [4:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_phsubsw:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vphsubsw %xmm1, %xmm0, %xmm0 # sched: [3:1.50]
	-; SANDY-NEXT: vphsubsw (%rdi), %xmm0, %xmm0 # sched: [9:1.50]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vphsubsw %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	+; SANDY-NEXT: vphsubsw (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_phsubsw:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vphsubsw %xmm1, %xmm0, %xmm0 # sched: [3:2.00]
	; HASWELL-NEXT: vphsubsw (%rdi), %xmm0, %xmm0 # sched: [6:2.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_phsubsw:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vphsubsw %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: vphsubsw (%rdi), %xmm0, %xmm0 # sched: [6:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_phsubsw:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vphsubsw %xmm1, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: vphsubsw (%rdi), %xmm0, %xmm0 # sched: [8:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <8 x i16> @llvm.x86.ssse3.phsub.sw.128(<8 x i16> %a0, <8 x i16> %a1)
	%2 = load <8 x i16>, <8 x i16> *%a2, align 16
	%3 = call <8 x i16> @llvm.x86.ssse3.phsub.sw.128(<8 x i16> %1, <8 x i16> %2)
	ret <8 x i16> %3
	}
	declare <8 x i16> @llvm.x86.ssse3.phsub.sw.128(<8 x i16>, <8 x i16>) nounwind readnone

	define <8 x i16> @test_phsubw(<8 x i16> %a0, <8 x i16> %a1, <8 x i16> *%a2) {
	; GENERIC-LABEL: test_phsubw:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: phsubw %xmm1, %xmm0
	; GENERIC-NEXT: phsubw (%rdi), %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_phsubw:
	; ATOM: # BB#0:
	; ATOM-NEXT: phsubw %xmm1, %xmm0
	; ATOM-NEXT: phsubw (%rdi), %xmm0
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_phsubw:
	; SLM: # BB#0:
	; SLM-NEXT: phsubw %xmm1, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: phsubw (%rdi), %xmm0 # sched: [4:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_phsubw:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vphsubw %xmm1, %xmm0, %xmm0 # sched: [3:1.50]
	-; SANDY-NEXT: vphsubw (%rdi), %xmm0, %xmm0 # sched: [9:1.50]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vphsubw %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	+; SANDY-NEXT: vphsubw (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_phsubw:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vphsubw %xmm1, %xmm0, %xmm0 # sched: [3:2.00]
	; HASWELL-NEXT: vphsubw (%rdi), %xmm0, %xmm0 # sched: [6:2.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_phsubw:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vphsubw %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: vphsubw (%rdi), %xmm0, %xmm0 # sched: [6:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_phsubw:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vphsubw %xmm1, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: vphsubw (%rdi), %xmm0, %xmm0 # sched: [8:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <8 x i16> @llvm.x86.ssse3.phsub.w.128(<8 x i16> %a0, <8 x i16> %a1)
	%2 = load <8 x i16>, <8 x i16> *%a2, align 16
	%3 = call <8 x i16> @llvm.x86.ssse3.phsub.w.128(<8 x i16> %1, <8 x i16> %2)
	ret <8 x i16> %3
	}
	declare <8 x i16> @llvm.x86.ssse3.phsub.w.128(<8 x i16>, <8 x i16>) nounwind readnone

	define <8 x i16> @test_pmaddubsw(<16 x i8> %a0, <16 x i8> %a1, <16 x i8> *%a2) {
	; GENERIC-LABEL: test_pmaddubsw:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: pmaddubsw %xmm1, %xmm0
	; GENERIC-NEXT: pmaddubsw (%rdi), %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_pmaddubsw:
	; ATOM: # BB#0:
	; ATOM-NEXT: pmaddubsw %xmm1, %xmm0
	; ATOM-NEXT: pmaddubsw (%rdi), %xmm0
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_pmaddubsw:
	; SLM: # BB#0:
	; SLM-NEXT: pmaddubsw %xmm1, %xmm0 # sched: [4:1.00]
	; SLM-NEXT: pmaddubsw (%rdi), %xmm0 # sched: [7:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_pmaddubsw:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vpmaddubsw %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	+; SANDY-NEXT: vpmaddubsw %xmm1, %xmm0, %xmm0 # sched: [5:1.00]
	; SANDY-NEXT: vpmaddubsw (%rdi), %xmm0, %xmm0 # sched: [9:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_pmaddubsw:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpmaddubsw %xmm1, %xmm0, %xmm0 # sched: [5:1.00]
	; HASWELL-NEXT: vpmaddubsw (%rdi), %xmm0, %xmm0 # sched: [9:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_pmaddubsw:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpmaddubsw %xmm1, %xmm0, %xmm0 # sched: [2:1.00]
	; BTVER2-NEXT: vpmaddubsw (%rdi), %xmm0, %xmm0 # sched: [7:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_pmaddubsw:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpmaddubsw %xmm1, %xmm0, %xmm0 # sched: [4:1.00]
	; ZNVER1-NEXT: vpmaddubsw (%rdi), %xmm0, %xmm0 # sched: [11:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <8 x i16> @llvm.x86.ssse3.pmadd.ub.sw.128(<16 x i8> %a0, <16 x i8> %a1)
	%2 = load <16 x i8>, <16 x i8> *%a2, align 16
	%3 = bitcast <8 x i16> %1 to <16 x i8>
	%4 = call <8 x i16> @llvm.x86.ssse3.pmadd.ub.sw.128(<16 x i8> %3, <16 x i8> %2)
	ret <8 x i16> %4
	}
	declare <8 x i16> @llvm.x86.ssse3.pmadd.ub.sw.128(<16 x i8>, <16 x i8>) nounwind readnone

	define <8 x i16> @test_pmulhrsw(<8 x i16> %a0, <8 x i16> %a1, <8 x i16> *%a2) {
	; GENERIC-LABEL: test_pmulhrsw:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: pmulhrsw %xmm1, %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_pmulhrsw:
	; ATOM: # BB#0:
	; ATOM-NEXT: pmulhrsw %xmm1, %xmm0
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_pmulhrsw:
	; SLM: # BB#0:
	; SLM-NEXT: pmulhrsw %xmm1, %xmm0 # sched: [4:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_pmulhrsw:
	; SANDY: # BB#0:
	-; SANDY-NEXT: vpmulhrsw %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vpmulhrsw %xmm1, %xmm0, %xmm0 # sched: [5:1.00]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_pmulhrsw:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpmulhrsw %xmm1, %xmm0, %xmm0 # sched: [5:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_pmulhrsw:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpmulhrsw %xmm1, %xmm0, %xmm0 # sched: [2:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_pmulhrsw:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpmulhrsw %xmm1, %xmm0, %xmm0 # sched: [4:1.00]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <8 x i16> @llvm.x86.ssse3.pmul.hr.sw.128(<8 x i16> %a0, <8 x i16> %a1)
	%2 = load <8 x i16>, <8 x i16> *%a2, align 16
	%3 = call <8 x i16> @llvm.x86.ssse3.pmul.hr.sw.128(<8 x i16> %1, <8 x i16> %2)
	ret <8 x i16> %1
	}
	declare <8 x i16> @llvm.x86.ssse3.pmul.hr.sw.128(<8 x i16>, <8 x i16>) nounwind readnone

	define <16 x i8> @test_pshufb(<16 x i8> %a0, <16 x i8> %a1, <16 x i8> *%a2) {
	; GENERIC-LABEL: test_pshufb:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: pshufb %xmm1, %xmm0
	; GENERIC-NEXT: pshufb (%rdi), %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_pshufb:
	; ATOM: # BB#0:
	; ATOM-NEXT: pshufb %xmm1, %xmm0
	; ATOM-NEXT: pshufb (%rdi), %xmm0
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_pshufb:
	; SLM: # BB#0:
	; SLM-NEXT: pshufb %xmm1, %xmm0 # sched: [1:1.00]
	; SLM-NEXT: pshufb (%rdi), %xmm0 # sched: [4:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_pshufb:
	; SANDY: # BB#0:
	; SANDY-NEXT: vpshufb %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	-; SANDY-NEXT: vpshufb (%rdi), %xmm0, %xmm0 # sched: [7:0.50]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vpshufb (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_pshufb:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpshufb %xmm1, %xmm0, %xmm0 # sched: [1:1.00]
	; HASWELL-NEXT: vpshufb (%rdi), %xmm0, %xmm0 # sched: [5:1.00]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_pshufb:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpshufb %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: vpshufb (%rdi), %xmm0, %xmm0 # sched: [6:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_pshufb:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpshufb %xmm1, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: vpshufb (%rdi), %xmm0, %xmm0 # sched: [8:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <16 x i8> @llvm.x86.ssse3.pshuf.b.128(<16 x i8> %a0, <16 x i8> %a1)
	%2 = load <16 x i8>, <16 x i8> *%a2, align 16
	%3 = call <16 x i8> @llvm.x86.ssse3.pshuf.b.128(<16 x i8> %1, <16 x i8> %2)
	ret <16 x i8> %3
	}
	declare <16 x i8> @llvm.x86.ssse3.pshuf.b.128(<16 x i8>, <16 x i8>) nounwind readnone

	define <16 x i8> @test_psignb(<16 x i8> %a0, <16 x i8> %a1, <16 x i8> *%a2) {
	; GENERIC-LABEL: test_psignb:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: psignb %xmm1, %xmm0
	; GENERIC-NEXT: psignb (%rdi), %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_psignb:
	; ATOM: # BB#0:
	; ATOM-NEXT: psignb %xmm1, %xmm0
	; ATOM-NEXT: psignb (%rdi), %xmm0
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_psignb:
	; SLM: # BB#0:
	; SLM-NEXT: psignb %xmm1, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: psignb (%rdi), %xmm0 # sched: [4:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_psignb:
	; SANDY: # BB#0:
	; SANDY-NEXT: vpsignb %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	-; SANDY-NEXT: vpsignb (%rdi), %xmm0, %xmm0 # sched: [7:0.50]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vpsignb (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_psignb:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpsignb %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; HASWELL-NEXT: vpsignb (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_psignb:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpsignb %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: vpsignb (%rdi), %xmm0, %xmm0 # sched: [6:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_psignb:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpsignb %xmm1, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: vpsignb (%rdi), %xmm0, %xmm0 # sched: [8:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <16 x i8> @llvm.x86.ssse3.psign.b.128(<16 x i8> %a0, <16 x i8> %a1)
	%2 = load <16 x i8>, <16 x i8> *%a2, align 16
	%3 = call <16 x i8> @llvm.x86.ssse3.psign.b.128(<16 x i8> %1, <16 x i8> %2)
	ret <16 x i8> %3
	}
	declare <16 x i8> @llvm.x86.ssse3.psign.b.128(<16 x i8>, <16 x i8>) nounwind readnone

	define <4 x i32> @test_psignd(<4 x i32> %a0, <4 x i32> %a1, <4 x i32> *%a2) {
	; GENERIC-LABEL: test_psignd:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: psignd %xmm1, %xmm0
	; GENERIC-NEXT: psignd (%rdi), %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_psignd:
	; ATOM: # BB#0:
	; ATOM-NEXT: psignd %xmm1, %xmm0
	; ATOM-NEXT: psignd (%rdi), %xmm0
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_psignd:
	; SLM: # BB#0:
	; SLM-NEXT: psignd %xmm1, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: psignd (%rdi), %xmm0 # sched: [4:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_psignd:
	; SANDY: # BB#0:
	; SANDY-NEXT: vpsignd %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	-; SANDY-NEXT: vpsignd (%rdi), %xmm0, %xmm0 # sched: [7:0.50]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vpsignd (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_psignd:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpsignd %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; HASWELL-NEXT: vpsignd (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_psignd:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpsignd %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: vpsignd (%rdi), %xmm0, %xmm0 # sched: [6:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_psignd:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpsignd %xmm1, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: vpsignd (%rdi), %xmm0, %xmm0 # sched: [8:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <4 x i32> @llvm.x86.ssse3.psign.d.128(<4 x i32> %a0, <4 x i32> %a1)
	%2 = load <4 x i32>, <4 x i32> *%a2, align 16
	%3 = call <4 x i32> @llvm.x86.ssse3.psign.d.128(<4 x i32> %1, <4 x i32> %2)
	ret <4 x i32> %3
	}
	declare <4 x i32> @llvm.x86.ssse3.psign.d.128(<4 x i32>, <4 x i32>) nounwind readnone

	define <8 x i16> @test_psignw(<8 x i16> %a0, <8 x i16> %a1, <8 x i16> *%a2) {
	; GENERIC-LABEL: test_psignw:
	; GENERIC: # BB#0:
	; GENERIC-NEXT: psignw %xmm1, %xmm0
	; GENERIC-NEXT: psignw (%rdi), %xmm0
	; GENERIC-NEXT: retq
	;
	; ATOM-LABEL: test_psignw:
	; ATOM: # BB#0:
	; ATOM-NEXT: psignw %xmm1, %xmm0
	; ATOM-NEXT: psignw (%rdi), %xmm0
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: nop
	; ATOM-NEXT: retq
	;
	; SLM-LABEL: test_psignw:
	; SLM: # BB#0:
	; SLM-NEXT: psignw %xmm1, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: psignw (%rdi), %xmm0 # sched: [4:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]
	;
	; SANDY-LABEL: test_psignw:
	; SANDY: # BB#0:
	; SANDY-NEXT: vpsignw %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	-; SANDY-NEXT: vpsignw (%rdi), %xmm0, %xmm0 # sched: [7:0.50]
	-; SANDY-NEXT: retq # sched: [1:1.00]
	+; SANDY-NEXT: vpsignw (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	+; SANDY-NEXT: retq # sched: [5:1.00]
	;
	; HASWELL-LABEL: test_psignw:
	; HASWELL: # BB#0:
	; HASWELL-NEXT: vpsignw %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; HASWELL-NEXT: vpsignw (%rdi), %xmm0, %xmm0 # sched: [5:0.50]
	; HASWELL-NEXT: retq # sched: [1:1.00]
	;
	; BTVER2-LABEL: test_psignw:
	; BTVER2: # BB#0:
	; BTVER2-NEXT: vpsignw %xmm1, %xmm0, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: vpsignw (%rdi), %xmm0, %xmm0 # sched: [6:1.00]
	; BTVER2-NEXT: retq # sched: [4:1.00]
	;
	; ZNVER1-LABEL: test_psignw:
	; ZNVER1: # BB#0:
	; ZNVER1-NEXT: vpsignw %xmm1, %xmm0, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: vpsignw (%rdi), %xmm0, %xmm0 # sched: [8:0.50]
	; ZNVER1-NEXT: retq # sched: [5:0.50]
	%1 = call <8 x i16> @llvm.x86.ssse3.psign.w.128(<8 x i16> %a0, <8 x i16> %a1)
	%2 = load <8 x i16>, <8 x i16> *%a2, align 16
	%3 = call <8 x i16> @llvm.x86.ssse3.psign.w.128(<8 x i16> %1, <8 x i16> %2)
	ret <8 x i16> %3
	}
	declare <8 x i16> @llvm.x86.ssse3.psign.w.128(<8 x i16>, <8 x i16>) nounwind readnone
	diff --git a/test/DllTool/coff-decorated.def b/test/DllTool/coff-decorated.def
	new file mode 100644
	index 000000000000..5a908f388480
	--- /dev/null
	+++ b/test/DllTool/coff-decorated.def
	@@ -0,0 +1,26 @@
	+; RUN: llvm-dlltool -k -m i386 --input-def %s --output-lib %t.a
	+; RUN: llvm-readobj %t.a \| FileCheck %s
	+; RUN: llvm-nm %t.a \| FileCheck %s -check-prefix=CHECK-NM
	+
	+LIBRARY test.dll
	+EXPORTS
	+CdeclFunction
	+StdcallFunction@4
	+@FastcallFunction@4
	+StdcallAlias@4=StdcallFunction@4
	+??_7exception@@6B@
	+
	+; CHECK: Name type: noprefix
	+; CHECK: Symbol: __imp__CdeclFunction
	+; CHECK: Symbol: _CdeclFunction
	+; CHECK: Name type: undecorate
	+; CHECK: Symbol: __imp__StdcallFunction@4
	+; CHECK: Symbol: _StdcallFunction@4
	+; CHECK: Name type: undecorate
	+; CHECK: Symbol: __imp_@FastcallFunction@4
	+; CHECK: Symbol: @FastcallFunction@4
	+; CHECK: Name type: name
	+; CHECK: Symbol: __imp_??_7exception@@6B@
	+; CHECK: Symbol: ??_7exception@@6B@
	+; CHECK-NM: w _StdcallAlias@4
	+; CHECK-NM: U _StdcallFunction@4
	diff --git a/test/Feature/optnone-opt.ll b/test/Feature/optnone-opt.ll
	index 6410afb6be99..ae0e1a48acc5 100644
	--- a/test/Feature/optnone-opt.ll
	+++ b/test/Feature/optnone-opt.ll
	@@ -1,73 +1,72 @@
	; RUN: opt -S -debug %s 2>&1 \| FileCheck %s --check-prefix=OPT-O0
	; RUN: opt -O1 -S -debug %s 2>&1 \| FileCheck %s --check-prefix=OPT-O1
	; RUN: opt -O2 -S -debug %s 2>&1 \| FileCheck %s --check-prefix=OPT-O1 --check-prefix=OPT-O2O3
	; RUN: opt -O3 -S -debug %s 2>&1 \| FileCheck %s --check-prefix=OPT-O1 --check-prefix=OPT-O2O3
	; RUN: opt -dce -die -gvn-hoist -loweratomic -S -debug %s 2>&1 \| FileCheck %s --check-prefix=OPT-MORE
	; RUN: opt -indvars -licm -loop-deletion -loop-extract -loop-idiom -loop-instsimplify -loop-reduce -loop-reroll -loop-rotate -loop-unroll -loop-unswitch -S -debug %s 2>&1 \| FileCheck %s --check-prefix=OPT-LOOP

	; REQUIRES: asserts

	; This test verifies that we don't run target independent IR-level
	; optimizations on optnone functions.

	; Function Attrs: noinline optnone
	define i32 @_Z3fooi(i32 %x) #0 {
	entry:
	%x.addr = alloca i32, align 4
	store i32 %x, i32* %x.addr, align 4
	br label %while.cond

	while.cond: ; preds = %while.body, %entry
	%0 = load i32, i32* %x.addr, align 4
	%dec = add nsw i32 %0, -1
	store i32 %dec, i32* %x.addr, align 4
	%tobool = icmp ne i32 %0, 0
	br i1 %tobool, label %while.body, label %while.end

	while.body: ; preds = %while.cond
	br label %while.cond

	while.end: ; preds = %while.cond
	ret i32 0
	}

	attributes #0 = { optnone noinline }

	; Nothing that runs at -O0 gets skipped.
	; OPT-O0-NOT: Skipping pass

	; IR passes run at -O1 and higher.
	; OPT-O1-DAG: Skipping pass 'Aggressive Dead Code Elimination'
	; OPT-O1-DAG: Skipping pass 'Combine redundant instructions'
	; OPT-O1-DAG: Skipping pass 'Dead Store Elimination'
	; OPT-O1-DAG: Skipping pass 'Early CSE'
	; OPT-O1-DAG: Skipping pass 'Jump Threading'
	; OPT-O1-DAG: Skipping pass 'MemCpy Optimization'
	; OPT-O1-DAG: Skipping pass 'Reassociate expressions'
	; OPT-O1-DAG: Skipping pass 'Simplify the CFG'
	; OPT-O1-DAG: Skipping pass 'Sparse Conditional Constant Propagation'
	; OPT-O1-DAG: Skipping pass 'SROA'
	; OPT-O1-DAG: Skipping pass 'Tail Call Elimination'
	; OPT-O1-DAG: Skipping pass 'Value Propagation'

	; Additional IR passes run at -O2 and higher.
	; OPT-O2O3-DAG: Skipping pass 'Global Value Numbering'
	; OPT-O2O3-DAG: Skipping pass 'SLP Vectorizer'

	; Additional IR passes that opt doesn't turn on by default.
	; OPT-MORE-DAG: Skipping pass 'Dead Code Elimination'
	; OPT-MORE-DAG: Skipping pass 'Dead Instruction Elimination'
	-; OPT-MORE-DAG: Skipping pass 'Lower atomic intrinsics

	; Loop IR passes that opt doesn't turn on by default.
	; OPT-LOOP-DAG: Skipping pass 'Delete dead loops'
	; OPT-LOOP-DAG: Skipping pass 'Extract loops into new functions'
	; OPT-LOOP-DAG: Skipping pass 'Induction Variable Simplification'
	; OPT-LOOP-DAG: Skipping pass 'Loop Invariant Code Motion'
	; OPT-LOOP-DAG: Skipping pass 'Loop Strength Reduction'
	; OPT-LOOP-DAG: Skipping pass 'Recognize loop idioms'
	; OPT-LOOP-DAG: Skipping pass 'Reroll loops'
	; OPT-LOOP-DAG: Skipping pass 'Rotate Loops'
	; OPT-LOOP-DAG: Skipping pass 'Simplify instructions in loops'
	; OPT-LOOP-DAG: Skipping pass 'Unroll loops'
	; OPT-LOOP-DAG: Skipping pass 'Unswitch loops'
	diff --git a/test/Linker/module-flags-pic-1-a.ll b/test/Linker/module-flags-pic-1-a.ll
	index ea933359ac66..9074aa6e593f 100644
	--- a/test/Linker/module-flags-pic-1-a.ll
	+++ b/test/Linker/module-flags-pic-1-a.ll
	@@ -1,9 +1,9 @@
	; RUN: llvm-link %s %p/Inputs/module-flags-pic-1-b.ll -S -o - \| FileCheck %s

	; test linking modules with specified and default PIC levels

	-!0 = !{ i32 1, !"PIC Level", i32 1 }
	+!0 = !{ i32 7, !"PIC Level", i32 1 }

	!llvm.module.flags = !{!0}
	; CHECK: !llvm.module.flags = !{!0}
	-; CHECK: !0 = !{i32 1, !"PIC Level", i32 1}
	+; CHECK: !0 = !{i32 7, !"PIC Level", i32 1}
	diff --git a/test/Transforms/Inline/recursive.ll b/test/Transforms/Inline/recursive.ll
	index e189339e224b..ded12dddf63f 100644
	--- a/test/Transforms/Inline/recursive.ll
	+++ b/test/Transforms/Inline/recursive.ll
	@@ -1,39 +1,70 @@
	; RUN: opt -inline -S < %s \| FileCheck %s
	; RUN: opt -passes='cgscc(inline)' -S < %s \| FileCheck %s

	target datalayout = "e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:32:64-f32:32:32-f64:32:64-v64:64:64-v128:128:128-a0:0:64-f80:128:128"
	target triple = "i386-apple-darwin10.0"

	; rdar://10853263

	; Make sure that the callee is still here.
	; CHECK-LABEL: define i32 @callee(
	define i32 @callee(i32 %param) {
	%yyy = alloca [100000 x i8]
	%r = bitcast [100000 x i8]* %yyy to i8*
	call void @foo2(i8* %r)
	ret i32 4
	}

	; CHECK-LABEL: define i32 @caller(
	; CHECK-NEXT: entry:
	; CHECK-NOT: alloca
	; CHECK: ret
	define i32 @caller(i32 %param) {
	entry:
	%t = call i32 @foo(i32 %param)
	%cmp = icmp eq i32 %t, -1
	br i1 %cmp, label %exit, label %cont

	cont:
	%r = call i32 @caller(i32 %t)
	%f = call i32 @callee(i32 %r)
	br label %cont
	exit:
	ret i32 4
	}

	declare void @foo2(i8* %in)

	declare i32 @foo(i32 %param)

	+; Check that when inlining a non-recursive path into a function's own body that
	+; we get the re-mapping of instructions correct.
	+define i32 @test_recursive_inlining_remapping(i1 %init, i8* %addr) {
	+; CHECK-LABEL: define i32 @test_recursive_inlining_remapping(
	+bb:
	+ %n = alloca i32
	+ br i1 %init, label %store, label %load
	+; CHECK-NOT: alloca
	+;
	+; CHECK: %[[N:.*]] = alloca i32
	+; CHECK-NEXT: br i1 %init,
	+
	+store:
	+ store i32 0, i32* %n
	+ %cast = bitcast i32* %n to i8*
	+ %v = call i32 @test_recursive_inlining_remapping(i1 false, i8* %cast)
	+ ret i32 %v
	+; CHECK-NOT: call
	+;
	+; CHECK: store i32 0, i32* %[[N]]
	+; CHECK-NEXT: %[[CAST:.]] = bitcast i32 %[[N]] to i8*
	+; CHECK-NEXT: %[[INLINED_LOAD:.]] = load i32, i32 %[[N]]
	+; CHECK-NEXT: ret i32 %[[INLINED_LOAD]]
	+;
	+; CHECK-NOT: call
	+
	+load:
	+ %castback = bitcast i8* %addr to i32*
	+ %n.load = load i32, i32* %castback
	+ ret i32 %n.load
	+}
	diff --git a/test/Transforms/LowerAtomic/atomic-swap.ll b/test/Transforms/LowerAtomic/atomic-swap.ll
	index 77000527a11f..59a5caed481c 100644
	--- a/test/Transforms/LowerAtomic/atomic-swap.ll
	+++ b/test/Transforms/LowerAtomic/atomic-swap.ll
	@@ -1,28 +1,39 @@
	; RUN: opt < %s -loweratomic -S \| FileCheck %s

	define i8 @cmpswap() {
	; CHECK-LABEL: @cmpswap(
	%i = alloca i8
	%pair = cmpxchg i8* %i, i8 0, i8 42 monotonic monotonic
	%j = extractvalue { i8, i1 } %pair, 0
	; CHECK: [[OLDVAL:%[a-z0-9]+]] = load i8, i8* [[ADDR:%[a-z0-9]+]]
	; CHECK-NEXT: [[SAME:%[a-z0-9]+]] = icmp eq i8 [[OLDVAL]], 0
	; CHECK-NEXT: [[TO_STORE:%[a-z0-9]+]] = select i1 [[SAME]], i8 42, i8 [[OLDVAL]]
	; CHECK-NEXT: store i8 [[TO_STORE]], i8* [[ADDR]]
	; CHECK-NEXT: [[TMP:%[a-z0-9]+]] = insertvalue { i8, i1 } undef, i8 [[OLDVAL]], 0
	; CHECK-NEXT: [[RES:%[a-z0-9]+]] = insertvalue { i8, i1 } [[TMP]], i1 [[SAME]], 1
	; CHECK-NEXT: [[VAL:%[a-z0-9]+]] = extractvalue { i8, i1 } [[RES]], 0
	ret i8 %j
	; CHECK: ret i8 [[VAL]]
	}


	define i8 @swap() {
	; CHECK-LABEL: @swap(
	%i = alloca i8
	%j = atomicrmw xchg i8* %i, i8 42 monotonic
	; CHECK: [[INST:%[a-z0-9]+]] = load
	; CHECK-NEXT: store
	ret i8 %j
	; CHECK: ret i8 [[INST]]
	}
	+
	+
	+define i8 @swap_optnone() noinline optnone {
	+; CHECK-LABEL: @swap_optnone(
	+ %i = alloca i8
	+ %j = atomicrmw xchg i8* %i, i8 42 monotonic
	+; CHECK: [[INST:%[a-z0-9]+]] = load
	+; CHECK-NEXT: store
	+ ret i8 %j
	+; CHECK: ret i8 [[INST]]
	+}
	diff --git a/test/Transforms/Reassociate/canonicalize-neg-const.ll b/test/Transforms/Reassociate/canonicalize-neg-const.ll
	index 465460cb53b1..7cb2c3a10e2d 100644
	--- a/test/Transforms/Reassociate/canonicalize-neg-const.ll
	+++ b/test/Transforms/Reassociate/canonicalize-neg-const.ll
	@@ -1,156 +1,178 @@
	; RUN: opt -reassociate -gvn -S < %s \| FileCheck %s

	; (x + 0.1234 * y) * (x + -0.1234 * y) -> (x + 0.1234 * y) * (x - 0.1234 * y)
	define double @test1(double %x, double %y) {
	; CHECK-LABEL: @test1
	; CHECK-NEXT: fmul double %y, 1.234000e-01
	; CHECK-NEXT: fadd double %x, %mul
	; CHECK-NEXT: fsub double %x, %mul
	; CHECK-NEXT: fmul double %add{{.}}, %add{{.}}
	; CHECK-NEXT: ret double %mul

	%mul = fmul double 1.234000e-01, %y
	%add = fadd double %mul, %x
	%mul1 = fmul double -1.234000e-01, %y
	%add2 = fadd double %mul1, %x
	%mul3 = fmul double %add, %add2
	ret double %mul3
	}

	; (x + -0.1234 * y) * (x + -0.1234 * y) -> (x - 0.1234 * y) * (x - 0.1234 * y)
	define double @test2(double %x, double %y) {
	; CHECK-LABEL: @test2
	; CHECK-NEXT: fmul double %y, 1.234000e-01
	; CHECK-NEXT: fsub double %x, %mul
	; CHECK-NEXT: fmul double %add{{.}}, %add{{.}}
	; CHECK-NEXT: ret double %mul

	%mul = fmul double %y, -1.234000e-01
	%add = fadd double %mul, %x
	%mul1 = fmul double %y, -1.234000e-01
	%add2 = fadd double %mul1, %x
	%mul3 = fmul double %add, %add2
	ret double %mul3
	}

	; (x + 0.1234 * y) * (x - -0.1234 * y) -> (x + 0.1234 * y) * (x + 0.1234 * y)
	define double @test3(double %x, double %y) {
	; CHECK-LABEL: @test3
	; CHECK-NEXT: fmul double %y, 1.234000e-01
	; CHECK-NEXT: fadd double %x, %mul
	; CHECK-NEXT: fmul double %add{{.}}, %add{{.}}
	; CHECK-NEXT: ret double

	%mul = fmul double %y, 1.234000e-01
	%add = fadd double %mul, %x
	%mul1 = fmul double %y, -1.234000e-01
	%add2 = fsub double %x, %mul1
	%mul3 = fmul double %add, %add2
	ret double %mul3
	}

	; Canonicalize (x - -0.1234 * y)
	define double @test5(double %x, double %y) {
	; CHECK-LABEL: @test5
	; CHECK-NEXT: fmul double %y, 1.234000e-01
	; CHECK-NEXT: fadd double %x, %mul
	; CHECK-NEXT: ret double

	%mul = fmul double -1.234000e-01, %y
	%sub = fsub double %x, %mul
	ret double %sub
	}

	; Don't modify (-0.1234 * y - x)
	define double @test6(double %x, double %y) {
	; CHECK-LABEL: @test6
	; CHECK-NEXT: fmul double %y, -1.234000e-01
	; CHECK-NEXT: fsub double %mul, %x
	; CHECK-NEXT: ret double %sub

	%mul = fmul double -1.234000e-01, %y
	%sub = fsub double %mul, %x
	ret double %sub
	}

	; Canonicalize (-0.1234 * y + x) -> (x - 0.1234 * y)
	define double @test7(double %x, double %y) {
	; CHECK-LABEL: @test7
	; CHECK-NEXT: fmul double %y, 1.234000e-01
	; CHECK-NEXT: fsub double %x, %mul
	; CHECK-NEXT: ret double %add

	%mul = fmul double -1.234000e-01, %y
	%add = fadd double %mul, %x
	ret double %add
	}

	; Canonicalize (y * -0.1234 + x) -> (x - 0.1234 * y)
	define double @test8(double %x, double %y) {
	; CHECK-LABEL: @test8
	; CHECK-NEXT: fmul double %y, 1.234000e-01
	; CHECK-NEXT: fsub double %x, %mul
	; CHECK-NEXT: ret double %add

	%mul = fmul double %y, -1.234000e-01
	%add = fadd double %mul, %x
	ret double %add
	}

	; Canonicalize (x - -0.1234 / y)
	define double @test9(double %x, double %y) {
	; CHECK-LABEL: @test9
	; CHECK-NEXT: fdiv double 1.234000e-01, %y
	; CHECK-NEXT: fadd double %x, %div
	; CHECK-NEXT: ret double

	%div = fdiv double -1.234000e-01, %y
	%sub = fsub double %x, %div
	ret double %sub
	}

	; Don't modify (-0.1234 / y - x)
	define double @test10(double %x, double %y) {
	; CHECK-LABEL: @test10
	; CHECK-NEXT: fdiv double -1.234000e-01, %y
	; CHECK-NEXT: fsub double %div, %x
	; CHECK-NEXT: ret double %sub

	%div = fdiv double -1.234000e-01, %y
	%sub = fsub double %div, %x
	ret double %sub
	}

	; Canonicalize (-0.1234 / y + x) -> (x - 0.1234 / y)
	define double @test11(double %x, double %y) {
	; CHECK-LABEL: @test11
	; CHECK-NEXT: fdiv double 1.234000e-01, %y
	; CHECK-NEXT: fsub double %x, %div
	; CHECK-NEXT: ret double %add

	%div = fdiv double -1.234000e-01, %y
	%add = fadd double %div, %x
	ret double %add
	}

	; Canonicalize (y / -0.1234 + x) -> (x - y / 0.1234)
	define double @test12(double %x, double %y) {
	; CHECK-LABEL: @test12
	; CHECK-NEXT: fdiv double %y, 1.234000e-01
	; CHECK-NEXT: fsub double %x, %div
	; CHECK-NEXT: ret double %add

	%div = fdiv double %y, -1.234000e-01
	%add = fadd double %div, %x
	ret double %add
	}

	; Don't create an NSW violation
	define i4 @test13(i4 %x) {
	; CHECK-LABEL: @test13
	; CHECK-NEXT: %[[mul:.*]] = mul nsw i4 %x, -2
	; CHECK-NEXT: %[[add:.*]] = add i4 %[[mul]], 3
	%mul = mul nsw i4 %x, -2
	%add = add i4 %mul, 3
	ret i4 %add
	}
	+
	+; This tests used to cause an infinite loop where we would loop between
	+; canonicalizing the negated constant (i.e., (X + Y-5.0) -> (X - Y5.0)) and
	+; breaking up a subtract (i.e., (X - Y5.0) -> X + (0 - Y5.0)). To break the
	+; cycle, we don't canonicalize the negative constant if we're going to later
	+; break up the subtract.
	+;
	+; Check to make sure we don't canonicalize
	+; (%pow2-5.0 + %sub) -> (%sub - %pow25.0)
	+; as we would later break up this subtract causing a cycle.
	+;
	+; CHECK-LABEL: @pr34078
	+; CHECK: %mul5.neg = fmul fast double %pow2, -5.000000e-01
	+; CHECK: %sub1 = fadd fast double %mul5.neg, %sub
	+define double @pr34078(double %A) {
	+ %sub = fsub fast double 1.000000e+00, %A
	+ %pow2 = fmul double %A, %A
	+ %mul5 = fmul fast double %pow2, 5.000000e-01
	+ %sub1 = fsub fast double %sub, %mul5
	+ %add = fadd fast double %sub1, %sub1
	+ ret double %add
	+}

File Metadata

Mime Type: application/octet-stream
Expires: Sun, Jun 30, 3:29 PM (2 d)
Storage Engine: chunks
Storage Format: Chunks
Storage Handle: wNDYAiRVKGYH
Default Alt Text: (4 MB)

Offset	End	Complete
0	4194304	Yes
4194304	4577686	Yes

No OneTemporaryActions

View Options

File Metadata

Event Timeline

No OneTemporary
Actions