Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The article states that the CPU has a limit of 4 instructions per cycle, but the sum2 method issues 5 instructions per cycle. Presumably one of them (maybe the increment) is trivial enough to be executed as a fifth instruction.


gpderetta is right -- test/cmp + jump will get fused.

uiCA is a very nice tool which tries to simulate how instructions will get scheduled, e.g. this is the trace it produces for sum3 on Haswell, showing the fusion: https://uica.uops.info/tmp/75182318511042c98d4d74bc026db179_... .


It's cool, I would love to have this for ARMv8 Mac


The LLVM project has a tool called llvm-mca that does this. Example: https://gcc.godbolt.org/z/7zcova1ce

The version in the Compiler Explorer wouldn't work on AArch64 without an -mcpu flag and I didn't know what to pass, so I copied -mcpu=cyclone from https://djolertrk.github.io/2021/11/05/optimize-AARCH64-back.... You'd have to look up the correct one for your Mac's CPU.


some nominally 4-wide intel cpus can execute 5 or 6 instructions per cycle when macrofused. For example a cmp and a conditional jXX can be macrofused.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: