FWIW: this gets argued about occasionally, but consensus seems to be that the cited line in the SDM is documenting a misfeature on an older CPU (though the details escape me about which it is). That effect is, IIRC, not observable on current hardware.
Maybe. I’ve implemented a ring buffer that is used between two virtual machine domains. There were a few places where barriers were needed. If they were removed the ring buffer would start corrupting data. These barriers are in addition to the many obviously needed compiler barriers.