This is a quick reference for Intel's Streaming SIMD Extensions. Feel free to make additions or corrections!

The vector types here are named with the same convention as in Factor's SIMD library. It should be obvious what they mean:

- char-16
- uchar-16
- short-8
- ushort-8
- int-4
- uint-4
- longlong-2
- ulonglong-2
- float-4
- double-2

The number next to each instruction is the SSE version:

char-16 | uchar-16 | short-8 | ushort-8 | int-4 | uint-4 | longlong-2 | ulonglong-2 | float-4 | double-2 | |

move^{*} | MOVDQ[AU] 2 | MOVDQ[AU] 2 | MOVDQ[AU] 2 | MOVDQ[AU] 2 | MOVDQ[AU] 2 | MOVDQ[AU] 2 | MOVDQ[AU] 2 | MOVDQ[AU] 2 | MOV[AU]PS 1 | MOV[AU]PD 2 |

add | PADDB 2 | PADDB 2 | PADDW 2 | PADDW 2 | PADDD 2 | PADDD 2 | PADDQ 2 | PADDQ 2 | ADDPS 1 | ADDPD 2 |

subtract | PSUBB 2 | PSUBB 2 | PSUBW 2 | PSUBW 2 | PSUBD 2 | PSUBD 2 | PSUBQ 2 | PSUBQ 2 | SUBPS 1 | SUBPD 2 |

saturated add | PADDSB 2 | PADDUSB 2 | PADDSW 2 | PADDUSW 2 | ||||||

saturated subtract | PSUBSB 2 | PSUBUSB 2 | PSUBSW 2 | PSUBUSW 2 | ||||||

add-subtract | ADDSUBPS 3 | ADDSUBPD 3 | ||||||||

horizontal add | PHADDW 3.3 | PHADDW 3.3 | PHADDD 3.3 | PHADDD 3.3 | HADDPS 3 | HADDPD 3 | ||||

multiply | PMULLW 2 | PMULLW 2 | PMULLD 4.1 | PMULLD 4.1 | MULPS 1 | MULPD 2 | ||||

divide | DIVPS 1 | DIVPD 2 | ||||||||

absolute value | PABSB 3.3 | PABSW 3.3 | PABSD 3.3 | |||||||

minimum | PMINSB 4.1 | PMINUB 2 | PMINSW 2 | PMINUW 4.1 | PMINSD 4.1 | PMINUD 4.1 | MINPS 1 | MINPD 2 | ||

maximum | PMAXSB 4.1 | PMAXUB 2 | PMAXSW 2 | PMAXUW 4.1 | PMAXSD 4.1 | PMAXUD 4.1 | MAXPS 1 | MAXPD 2 | ||

approx reciprocal | RCPPS 1 | |||||||||

square root | SQRTPS 1 | SQRTPD 2 | ||||||||

comparison | PCMPxxB^{†} 2 | PCMPxxB^{†} 2 | PCMPxxW^{†} 2 | PCMPxxW^{†} 2 | PCMPxxD^{†} 2 | PCMPxxD^{†} 2 | CMPxxxPS^{‡} 1 | CMPxxxPD^{‡} 2 | ||

bitwise and | PAND 2 | PAND 2 | PAND 2 | PAND 2 | PAND 2 | PAND 2 | PAND 2 | PAND 2 | ANDPS 1 | ANDPD 2 |

bitwise and-not | PANDN 2 | PANDN 2 | PANDN 2 | PANDN 2 | PANDN 2 | PANDN 2 | PANDN 2 | PANDN 2 | ANDNPS 1 | ANDNPD 2 |

bitwise or | POR 2 | POR 2 | POR 2 | POR 2 | POR 2 | POR 2 | POR 2 | POR 2 | ORPS 1 | ORPD 2 |

bitwise xor | PXOR 2 | PXOR 2 | PXOR 2 | PXOR 2 | PXOR 2 | PXOR 2 | PXOR 2 | PXOR 2 | XORPS 1 | XORPD 2 |

load mask | PMOVMSKB 2 | PMOVMSKB 2 | PMOVMSKB 2 | PMOVMSKB 2 | PMOVMSKB 2 | PMOVMSKB 2 | PMOVMSKB 2 | PMOVMSKB 2 | MOVMSKPS 1 | MOVMSKPD 2 |

shift left | PSLLW 2 | PSLLW 2 | PSLLD 2 | PSLLD 2 | PSLLQ 2 | PSLLQ 2 | ||||

shift right | PSRAW 2 | PSRLW 2 | PSRAD 2 | PSRLD 2 | PSRLQ 2 | |||||

unpack low | PUNPCKLBW 2 | PUNPCKLBW 2 | PUNPCKLWD 2 | PUNPCKLWD 2 | PUNPCKLDQ 2 | PUNPCKLDQ 2 | PUNPCKLQDQ 2 | PUNPCKLQDQ 2 | UNPCKLPS 1 | UNPCKLPD 2 |

unpack high | PUNPCKHBW 2 | PUNPCKHBW 2 | PUNPCKHWD 2 | PUNPCKHWD 2 | PUNPCKHDQ 2 | PUNPCKHDQ 2 | PUNPCKHQDQ 2 | PUNPCKHQDQ 2 | UNPCKHPS 1 | UNPCKHPD 2 |

static shuffle^{§} | PSHUF[HL]W^{‖} 2 | PSHUF[HL]W^{‖} 2 | PSHUFD 2 | PSHUFD 2 | PSHUFD 2 | PSHUFD 2 | SHUFPS^{¶} 1 | SHUFPD^{¶} 2 | ||

variable shuffle | PSHUFB 3.3 | PSHUFB 3.3 | PSHUFB 3.3 | PSHUFB 3.3 | PSHUFB 3.3 | PSHUFB 3.3 | PSHUFB 3.3 | PSHUFB 3.3 | ||

static blend | PBLENDW 4.1 | PBLENDW 4.1 | PBLENDW 4.1 | PBLENDW 4.1 | PBLENDW 4.1 | PBLENDW 4.1 | BLENDPS 4.1 | BLENDPD 4.1 | ||

variable blend^{#} | PBLENDVB 4.1 | PBLENDVB 4.1 | PBLENDVB 4.1 | PBLENDVB 4.1 | PBLENDVB 4.1 | PBLENDVB 4.1 | PBLENDVB 4.1 | PBLENDVB 4.1 | BLENDVPS 4.1 | BLENDVPD 4.1 |

Notes:

- The SSE2 integer SIMD mnemonics are the same as the MMX mnemonics; however, using them with SSE XMM registers rather than MMX MM registers generates different instructions.
- There are many more instructions that do not fit in this grid, but these are the most important ones to know.
- * Every move instruction has an aligned (A) and unaligned (U) form. Aligned is faster, but will trap if your address is not a multiple of 16 bytes.
- † Equality (PCMPEQ_) and signed greater-than (PCMPGT_) operations are provided for integer vectors. For signed less-than, invert the operands. For signed less/greater-than-or-equal, perform the PCMPEQ and PCMPGT comparisons and POR the results together. For unsigned tests, bias the inputs by PXORing 0x80, 0x8000, or 0x80000000 to the components.
- ‡ The following floating-point comparison operations are provided: EQ, LT, LE, UNORD, NEQ, NLT, NLE, and ORD. To get greater-than comparisons, invert the operands. LT, LE, NLT, and NLE are ordered comparisons and will raise the Invalid floating-point exception if a NaN is input.
- § Some shuffle patterns for some vector types can be achieved with specialized instructions that may have better performance or code size than the generalized shuffle instruction. See "Special shuffles" under each vector type below.
- ‖ 16-bit element shuffles only shuffle half of the register at a time.
- ¶ Floating-point shuffles select the high element(s) from the source register and the low element(s) from the destination. To shuffle a single vector, use the same register for source and destination.
- # Variable blends take the blend mask from XMM0 as an implicit operand.

; mask is in xmm0 ; if-true is in xmm1 ; if-false is in xmm2 ; blended result is in xmm3 pand xmm1, xmm0 movdqa xmm3, xmm0 pandn xmm3, xmm2 por xmm3, xmm1

Directly to a GPR: (requires SSE 4.1)

pextrd eax, xmm0, n

To low element of an XMM register:

; tbw

Use `movd eax, xmm0`

to move the selected element to a GPR.

Directly from GPRs: (requires SSE 4.1)

pinsrd xmm0, r8d, 0 pinsrd xmm0, r9d, 1 pinsrd xmm0, r10d, 2 pinsrd xmm0, r11d, 3

From low elements of XMM registers:

punpckldq xmm0, xmm1 ; xmm0 => ? ? 1 0 punpckldq xmm2, xmm3 ; xmm2 => ? ? 3 2 punpcklqdq xmm0, xmm2 ; xmm0 => 3 2 1 0

Use `movd xmm0, eax`

to load the low element from a GPR.

Element 0 is a no-op:

movss dst, src

Element 1:

movshdup dst, src

Element 2:

movhlps dst, src

Element 3:

movaps dst, src shufps dst, dst, 0xff ; 3 3 3 3

unpcklps xmm0, xmm1 ; xmm0 => ? ? 1 0 unpcklps xmm2, xmm3 ; xmm2 => ? ? 3 2 movlhps xmm0, xmm2 ; xmm0 => 3 2 1 0

movss dst, src shufps dst, dst, n

Where `n`

selects the element:

element | `n` |

0 | `0x00` |

1 | `0x55` |

2 | `0xaa` |

3 | `0xff` |

movaps xmm1, xmm0 shufps xmm0, xmm1, 0xb1 ; 1 0 3 2 addps xmm0, xmm1 movaps xmm1, xmm0 shufps xmm0, xmm0, 0x0a ; 2 2 0 0 addps xmm0, xmm1

; mask is in xmm0 ; if-true is in xmm1 ; if-false is in xmm2 ; blended result is in xmm3 andps xmm1, xmm0 movaps xmm3, xmm0 andnps xmm3, xmm2 orps xmm3, xmm1

order | code |

0 0 2 2 | `movsldup dst, src` |

1 1 3 3 | `movshdup dst, src` |

0 1 0 1 | `movlhps dst, dst` |

2 3 2 3 | `movhlps dst, dst` |

0 0 1 1 | `unpcklps dst, dst` |

2 2 3 3 | `unpckhps dst, dst` |

unpcklpd xmm0, xmm1

Element 0:

movddup xmm0, xmm1

Element 1:

movapd xmm0, xmm1 unpckhpd xmm0, xmm0

movapd xmm1, xmm0 unpckhpd xmm1, xmm1 addsd xmm0, xmm1

; mask is in xmm0 ; if-true is in xmm1 ; if-false is in xmm2 ; blended result is in xmm3 andpd xmm1, xmm0 movapd xmm3, xmm0 andnpd xmm3, xmm2 orpd xmm3, xmm1

order | code |

0 0 | `unpcklpd dst, dst` or `movddup dst, src` |

1 1 | `unpckhpd dst, dst` |

For full details, consult Intel's or AMD's instruction set reference documentation.

*This revision created on Mon, 28 Sep 2009 19:18:39 by jckarter
*