On x86 the architecture guarantees memory read/write order. They are not allowed to make some important optimizations because it would change observed memory read/write. DMA or SMP makes no difference to this.
On most other architectures the CPU can and will change memory read/write order. As a result there is a lot of multi-threaded code that works correctly on x86 that fails when run elsewhere.
FWIW: this gets argued about occasionally, but consensus seems to be that the cited line in the SDM is documenting a misfeature on an older CPU (though the details escape me about which it is). That effect is, IIRC, not observable on current hardware.
Maybe. I’ve implemented a ring buffer that is used between two virtual machine domains. There were a few places where barriers were needed. If they were removed the ring buffer would start corrupting data. These barriers are in addition to the many obviously needed compiler barriers.
On most other architectures the CPU can and will change memory read/write order. As a result there is a lot of multi-threaded code that works correctly on x86 that fails when run elsewhere.