I'm sure when you reverse engineer shit in the past, you have come across some instructions that seemingly useless like the one in the name of this post
mov eax, eax. Seriously, wtf right?
Yes, it can be treated as a 2 bytes NOP and in 99% cases, it is used for code alignment (Intel recommends 16-byte boundaries) but is there anything else it can be used for?
- It move value from eax to eax itself
- no flag is modified in mov instruction
I usually just think of it as compiler doing some silly shit. However as it turns out, there is a bit more to it.
While doing some study i found some interesting paper online, you can find it here. It's an optimization guide for assembly programmer. This bring back a lot of memory from University! :)
The rest of the post is either a summary or a copy-pasta from the article of some key things i found useful and would like to note it down for future ref, however there are A LOT more in the article that i would definetly revisit/read when i have time.
μop = micro operations, eg:
add eax,90 generate 1 μop while
add eax,[ebx] generate 2 μops (1 to read ebx and other is to add this value to eax).
NOTE: Micro-op fusion
The register renaming (RAT) and retirement (RRF) stages in the pipeline are bottlenecks with a maximum throughput of 3 μops per clock cycle. In order to get more through these bottlenecks, the designers have joined some operations together that were split in two μops in previous processors. They call this μop fusion. so for the example above,
add eax,[ebx] only generate 1 μop.
Take this piece of code as example
; Example Register renaming mov eax, [mem1] mov ebx, [mem2] add ebx, eax imul eax, 6 mov [mem3], eax mov [mem4], ebx
Here we have eax being multiplied and also the value of eax is added to ebx. In microprocessor, to be able to execute these in parrallel, everytime a register is being modified or write to, a temporary register is assigned to. This process is called register renaming and being handled by register alias table (RAT) which has a queue of 10 and can be filled up pretty quick if there is a blockage down the pipeline.
From the article (May not applicable for latest microprocessor):
The RAT can handle 3 μops per clock cycle. This means that the overall throughput of the microprocessor can never exceed 3 μops per clock cycle on average.
Re-Order Buffer (ROB) read
The ROB read stage is where the value of the renamed registers are stored in the ROB entry, if they are available.
Each ROB entry can have up to 2 inputs registers and 2 output registers.
There are 3 possibilities for the value of an input register:
- The register has not been modified recently. The ROB-read stage reads the value from the permanent register file and stores it in the ROB entry.
- The value has been modified recently. The new value is the output of a μop that has been executed but not yet retired. The ROB-read stage will read the value from the not yet retired ROB entry and store it in the new ROB entry.
- The value is not ready yet. The needed value is the coming output of a μop that is queued but not yet executed. The new value cannot be written yet, but it will be written to the new ROB entry by the execution unit as soon as it is ready.
What we are interested in is first scenario, when the ROB-read happens to a permanent register file. Unfortunately, the permanent register file only has 2 read ports (again, probably only apply to old CPU as this problem is now solved from Sandy-Bridge onward). This means that any RAT group that require more than 2 read register operation will cost extra clock cycle. When that happens, the preceding RAT would be stalled until ROB-read is ready again. This known problem is called register-read-stalls.
Since ROB can handle only two permanent register reads per clock cycle, the instruction in the example below will cost 3 clock cycle for the ROB-read to finish.
mov ebx, eax ; read eax add edx, ecx ; read eax and edx sub edi, [eax] ; read edi and eax
This is because in this example, we have 5 read operations from permanent registers as shown in the example. You probably noticed that eax is being read 3 times. This is because eax has not been modified in any of these instruction and require a read operation from permanent register file.
It is even more interesting when you have a set of operations such as this in the RAT queue:
mov ebx, eax ; read eax add edx, ecx ; read eax and edx sub edi, [eax] ; read edi and eax sub edi, [eax] ; read eax imul edx, eax ; read eax mov esi, eax ; read eax
Notice that at instruction
imul edx, eax, we did not need to read the value of edx. This is because the value of edx was recently modified in previous set of 3 μops and freshly-baked value is still available in ROB entry and does not require a permanent read.
Let's follow the ROB-read operation here for every 3 μops (max RAT through put), the first 3 μops will cost 3 clock cycles due to 5 permanent register reads (number of read port is 2) and the following 3 μops will cost another 2 clock cycles for 3 reads. This is 5 clock cycles altogether.
Now let's see what happens if a
mov eax, eax instruction was introduced to this piece of code.
mov eax, eax ; read eax mov ebx, eax add edx, ecx ; read edx sub edi, [eax] ; read edi sub edi, [eax] imul edx, eax mov esi, eax
mov instruction "modifies" the value of eax (by moving it to itself), the subsequent read instruction are free because ROB will simply pull eax value from one ROB entry to another and does not require to read the value from permanent register file. The overall cost is now reduced to only 2 clock cycles to complete.
So there we go,
mov eax, eax is definetly not useless and if anything, it is the exact opposite. It can optimise your assembly code for older CPU.
Anyway, I went through this at ~4am and I hope my understanding is correct and helped you in any way. =p