Last year Ashot told us about programming methods for video cards. They can compute a large scale data-parallel task much faster than CPU. Still, they can hardly be called a universal solution. When data comes in small packages or the job has sub-linear complexity it takes too long to transfer it over the PCI-E bus.
What to do? Use SIMD instructions. Most modern processors have them and you have already paid a high price for them. Large registers appeared in the hardware, the CISC-RISC translator increased, the task scheduler became more complicated... And compilers have never learned to substitute complex instructions on their own. So we should type them manually!
Ashot will share key SIMD optimizations and a few anti-patterns he faced while developing Unum.