Concatenative topics

Concatenative meta

Other languages

SSE

This is a quick reference for Intel's Streaming SIMD Extensions. Feel free to make additions or corrections!

Vector types

The vector types here are named with the same convention as in Factor's SIMD library. It should be obvious what they mean:

char-16
uchar-16
short-8
ushort-8
int-4
uint-4
longlong-2
ulonglong-2
float-4
double-2

Instruction set

The number next to each instruction is the SSE version:

1: SSE
2: SSE2
3: SSE3
3.3: SSSE3
4.1: SSE4.1
4.2: SSE4.2

	char-16	uchar-16	short-8	ushort-8	int-4	uint-4	longlong-2	ulonglong-2	float-4	double-2
move^*	MOVDQ[AU] 2	MOVDQ[AU] 2	MOVDQ[AU] 2	MOVDQ[AU] 2	MOVDQ[AU] 2	MOVDQ[AU] 2	MOVDQ[AU] 2	MOVDQ[AU] 2	MOV[AU]PS 1	MOV[AU]PD 2
add	PADDB 2	PADDB 2	PADDW 2	PADDW 2	PADDD 2	PADDD 2	PADDQ 2	PADDQ 2	ADDPS 1	ADDPD 2
subtract	PSUBB 2	PSUBB 2	PSUBW 2	PSUBW 2	PSUBD 2	PSUBD 2	PSUBQ 2	PSUBQ 2	SUBPS 1	SUBPD 2
saturated add	PADDSB 2	PADDUSB 2	PADDSW 2	PADDUSW 2
saturated subtract	PSUBSB 2	PSUBUSB 2	PSUBSW 2	PSUBUSW 2
add-subtract									ADDSUBPS 3	ADDSUBPD 3
horizontal add			PHADDW 3.3	PHADDW 3.3	PHADDD 3.3	PHADDD 3.3			HADDPS 3	HADDPD 3
multiply			PMULLW 2	PMULLW 2	PMULLD 4.1	PMULLD 4.1			MULPS 1	MULPD 2
divide									DIVPS 1	DIVPD 2
absolute value	PABSB 3.3		PABSW 3.3		PABSD 3.3
minimum	PMINSB 4.1	PMINUB 2	PMINSW 2	PMINUW 4.1	PMINSD 4.1	PMINUD 4.1			MINPS 1	MINPD 2
maximum	PMAXSB 4.1	PMAXUB 2	PMAXSW 2	PMAXUW 4.1	PMAXSD 4.1	PMAXUD 4.1			MAXPS 1	MAXPD 2
approx reciprocal									RCPPS 1
square root									SQRTPS 1	SQRTPD 2
comparison	PCMPxxB^† 2	PCMPxxB^† 2	PCMPxxW^† 2	PCMPxxW^† 2	PCMPxxD^† 2	PCMPxxD^† 2			CMPxxxPS^‡ 1	CMPxxxPD^‡ 2
bitwise and	PAND 2	PAND 2	PAND 2	PAND 2	PAND 2	PAND 2	PAND 2	PAND 2	ANDPS 1	ANDPD 2
bitwise or	POR 2	POR 2	POR 2	POR 2	POR 2	POR 2	POR 2	POR 2	ORPS 1	ORPD 2
bitwise xor	PXOR 2	PXOR 2	PXOR 2	PXOR 2	PXOR 2	PXOR 2	PXOR 2	PXOR 2	XORPS 1	XORPD 2
load mask	PMOVMSKB 2	PMOVMSKB 2	PMOVMSKB 2	PMOVMSKB 2	PMOVMSKB 2	PMOVMSKB 2	PMOVMSKB 2	PMOVMSKB 2	MOVMSKPS 1	MOVMSKPD 2
shift left			PSLLW 2	PSLLW 2	PSLLD 2	PSLLD 2	PSLLQ 2	PSLLQ 2
shift right			PSRAW 2	PSRLW 2	PSRAD 2	PSRLD 2		PSRLQ 2
unpack low	PUNPCKLBW 2	PUNPCKLBW 2	PUNPCKLWD 2	PUNPCKLWD 2	PUNPCKLDQ 2	PUNPCKLDQ 2	PUNPCKLQDQ 2	PUNPCKLQDQ 2	UNPCKLPS 1	UNPCKLPD 2
unpack high	PUNPCKHBW 2	PUNPCKHBW 2	PUNPCKHWD 2	PUNPCKHWD 2	PUNPCKHDQ 2	PUNPCKHDQ 2	PUNPCKHQDQ 2	PUNPCKHQDQ 2	UNPCKHPS 1	UNPCKHPD 2
static shuffle^§			PSHUF[HL]W^‖ 2	PSHUF[HL]W^‖ 2	PSHUFD 2	PSHUFD 2	PSHUFD 2	PSHUFD 2	SHUFPS^¶ 1	SHUFPD^¶ 2
dynamic shuffle	PSHUFB 3.3	PSHUFB 3.3	PSHUFB 3.3	PSHUFB 3.3	PSHUFB 3.3	PSHUFB 3.3	PSHUFB 3.3	PSHUFB 3.3

Notes:

The SSE2 integer SIMD mnemonics are the same as the MMX mnemonics; however, using them with SSE XMM registers rather than MMX MM registers generates different instructions.
There are many more instructions that do not fit in this grid, but these are the most important ones to know.
* Every move instruction has an aligned (A) and unaligned (U) form. Aligned is faster, but will trap if your address is not a multiple of 16 bytes.
† Equality (PCMPEQ_) and signed greater-than (PCMPGT_) operations are provided for integer vectors. For signed less-than, invert the operands. For signed less/greater-than-or-equal, perform the PCMPEQ and PCMPGT comparisons and POR the results together. For unsigned tests, bias the inputs by PXORing 0x80, 0x8000, or 0x80000000 to the components.
‡ The following floating-point comparison operations are provided: EQ, LT, LE, UNORD, NEQ, NLT, NLE, and ORD. To get greater-than comparisons, invert the operands. LT, LE, NLT, and NLE are ordered comparisons and will raise the Invalid floating-point exception if a NaN is input.
§ Some shuffle patterns for some vector types can be achieved with specialized instructions that may have better performance or code size than the generalized shuffle instruction. See "Special shuffles" under each vector type below.
‖ 16-bit element shuffles only shuffle half of the register at a time.
¶ Floating-point shuffle words select the low elements from the source register and the high elements from the destination. To shuffle a single vector, use the same register for source and destination.

Idioms

int-4

Select nth component

Gather four integers into a vector

punpckldq xmm0, xmm1  ; xmm0 => ? ? 1 0
punpckldq xmm2, xmm3  ; xmm2 => ? ? 3 2
punpcklqdq xmm0, xmm2 ; xmm0 => 3 2 1 0

float-4

Select nth component

Gather four floats into a vector

movss dst, src1
unpcklps dst, src2
unpcklps src3, src4
movlhps dst, src3

Broadcast float into four components

movss dst, src
shufps dst, dst, 0x0

Absolute value

Horizontal add with SSE2

movaps xmm1, xmm0
shufps xmm0, xmm1, 0xb1
addps xmm0, xmm1
movaps xmm1, xmm0
shufps xmm0, xmm0, 0x0a
addps xmm0, xmm1

Special shuffles

order	code
0 0 2 2	movsldup dst, src
1 1 3 3	movshdup dst, src
0 1 0 1	movlhps dst, dst
2 3 2 3	movhlps dst, dst
0 0 1 1	unpcklps dst, dst
2 2 3 3	unpckhps dst, dst

double-2

Select nth component

Gather two doubles into a vector

movsd dst, src1
unpcklpd dst, src2

Broadcast double into two components

movddup dst, src

Absolute value

Horizontal add with SSE2

movapd xmm1, xmm0
unpckhpd xmm1, xmm1
addsd xmm0, xmm1

References

For full details, consult Intel's or AMD's instruction set reference documentation.

This revision created on Mon, 28 Sep 2009 18:20:12 by jckarter

Contents

SSE

Vector types

Instruction set

Idioms

int-4

Select nth component

Gather four integers into a vector

float-4

Select nth component

Gather four floats into a vector

Broadcast float into four components

Absolute value

Horizontal add with SSE2

Special shuffles

double-2

Select nth component

Gather two doubles into a vector

Broadcast double into two components

Absolute value

Horizontal add with SSE2

References