on 32bit targets this will just compile into two 32bit copies which gcc cant optimize as well as it can with the byte copy (it seems to love to do some AVX stuff atleast in my tests)
so this is only useful for 64bit capable platforms (compiler does not seem to do 64bit copys for the byte copy case, so this ends up being faster)