This is a quick reference for Intel's Streaming SIMD Extensions. Feel free to make additions or corrections!

# Vector types

The vector types here are named with the same convention as in Factor's SIMD library. It should be obvious what they mean:

- char-16
- uchar-16
- short-8
- ushort-8
- int-4
- uint-4
- float-4
- double-2

# Instruction set

The number next to each instruction is the SSE version:

| char-16 | uchar-16 | short-8 | ushort-8 | int-4 | uint-4 | float-4 | double-2 |

move | MOVDQ[AU] 2 | MOVDQ[AU] 2 | MOVDQ[AU] 2 | MOVDQ[AU] 2 | MOVDQ[AU] 2 | MOVDQ[AU] 2 | MOV[AU]PS 1 | MOV[AU]PD 2 |

add | PADDB 2 | PADDB 2 | PADDW 2 | PADDW 2 | PADDD 2 | PADDD 2 | ADDPS 1 | ADDPD 2 |

subtract | PSUBB 2 | PSUBB 2 | PSUBW 2 | PSUBW 2 | PSUBD 2 | PSUBD 2 | SUBPS 1 | SUBPD 2 |

add with saturation | PADDSB 2 | PADDUSB 2 | PADDSW 2 | PADDUSW 2 | | | | |

subtract with saturation | PSUBSB 2 | PSUBUSB 2 | PSUBSW 2 | PSUBUSW 2 | | | | |

add-subtract | | | | | | | ADDSUBPS 3 | ADDSUBPD 3 |

horizontal add | | | PHADDW 3.3 | PHADDW 3.3 | PHADDD 3.3 | PHADDW 3.3 | HADDPS 3 | HADDPS 3 |

multiply | | | PMULLW 2 | PMULLW 2 | PMULLD 2 | PMULLD 2 | MULPS 1 | MULPD 2 |

divide | | | | | | | DIVPS 1 | DIVPD 2 |

absolute value | PABSB 3.3 | | PABSW 3.3 | | PABSD 3.3 | | | |

minimum | | PMINUB 2 | PMINSW 2 | | | | MINPS 1 | MINPD 2 |

maximum | | PMAXUB 2 | PMAXSW 2 | | | | MAXPS 1 | MAXPD 2 |

approximate reciprocal | | | | | | | RCPPS 1 | |

square root | | | | | | | SQRTPS 1 | SQRTPD 2 |

bitwise and | PAND 2 | PAND 2 | PAND 2 | PAND 2 | PAND 2 | PAND 2 | ANDPS 1 | ANDPD 2 |

bitwise or | POR 2 | POR 2 | POR 2 | POR 2 | POR 2 | POR 2 | ORPS 1 | ORPD 2 |

bitwise xor | PXOR 2 | PXOR 2 | PXOR 2 | PXOR 2 | PXOR 2 | PXOR 2 | XORPS 1 | XORPD 2 |

Notes:

- The SSE2 integer SIMD mnemonics are the same as the MMX mnemonics; however, using them with SSE XMM registers rather than MMX MM registers generates different instructions.
- There are many more instructions that do not fit in this grid, but these are the most important ones to know.
- Every move instruction has an aligned and unaligned form. Aligned is faster, but will trap if your address is not a multiple of 16 bytes.

# Idioms

## int-4

### Select nth component

### Gather four integers into a vector

punpckldq xmm0, xmm1 ; xmm0 => ? ? 1 0
punpckldq xmm2, xmm3 ; xmm2 => ? ? 3 2
punpcklqdq xmm0, xmm2 ; xmm0 => 3 2 1 0

## float-4

### Select nth component

### Gather four floats into a vector

movss dst, src1
unpcklps dst, src2
unpcklps src3, src4
movlhps dst, src3

### Broadcast float into four components

movss dst, src
shufps dst, dst, 0x0

### Absolute value

### Horizontal add with SSE2

movaps xmm1, xmm0
shufps xmm0, xmm1, 0xb1
addps xmm0, xmm1
movaps xmm1, xmm0
shufps xmm0, xmm0, 0x0a
addps xmm0, xmm1

## double-2

### Select nth component

### Gather two doubles into a vector

movsd dst, src1
unpcklpd dst, src2

### Broadcast double into two components

movsd dst, src
unpcklpd dst, dst

### Absolute value

### Horizontal add with SSE2

movapd xmm1, xmm0
unpckhpd xmm1, xmm1
addsd xmm0, xmm1

# References

For full details, consult Intel's or AMD's instruction set reference documentation.