Quantcast
Channel: Category Name
Viewing all articles
Browse latest Browse all 10804

Optimizing C++ Code : Dead Code Elimination

$
0
0

 

If you have arrived in the middle of this blog series, you might want instead to begin at the beginning.

This post examines the optimization called Dead-Code-Elimination, which I’ll abbreviate to DCE.  It does what it says: discards any calculations whose results are not actually used by the program.

Now, you will probably assert that your code calculates only results that are used, and never any results that are not used: only an idiot, after all, would gratuitously add useless code – calculating the first 1000 digits of pi, for example, whilst also doing something useful.  So when would the DCE optimization ever have an effect?

The reason for describing DCE so early in this series is that it could otherwise wreak havoc and confusion throughout our  exploration of other, more interesting optimizations: consider this tiny example program, held in a file called Sum.cpp:

int main() {
    long long s = 0;
    for (long long i = 1; i <= 1000000000; ++i) s += i;
}

We are interested in how fast the loop executes, calculating the sum of the first billion integers.  (Yes, this is a spectacularly silly example, since the result has a closed formula, taught in High school.  But that’s not the point)

Build this program with the command:  CL /Od /FA Sum.cpp  and run with the command Sum.  Note that this build disables optimizations, via the /Od switch.  On my PC, it takes about 4 seconds to run.  Now try compiling optimized-for-speed, using CL /O2 /FA Sum.cpp.  On my PC, this version runs so fast there’s no perceptible delay.  Has the compiler really done such a fantastic job at optimizing our code?  The answer is no (but in a surprising way, also yes):

Let’s look first at the code it generates for the /Od case, stored into Sum.asm.  I have trimmed and annotated the text to show only the loop body:

       mov    QWORD PTR s$[rsp], 0                     ;; long long s = 0
       mov    QWORD PTR i$1[rsp], 1                    ;; long long i = 1
       jmp    SHORT $LN3@main  
$LN2@main:

       mov    rax, QWORD PTR i$1[rsp]                  ;; rax = i
       inc    rax                                      ;; rax += 1
       mov    QWORD PTR i$1[rsp], rax                  ;; i = rax 
$LN3@main:
       cmp    QWORD PTR i$1[rsp], 1000000000           ;; i <= 1000000000 ?
       jg     SHORT $LN1@main                          ;; no – we’re done
       mov    rax, QWORD PTR i$1[rsp]                  ;; rax = i
       mov    rcx, QWORD PTR s$[rsp]                   ;; rcx = s
       add    rcx, rax                                 ;; rcx += rax
       mov    rax, rcx                                 ;; rax = rcx
       mov    QWORD PTR s$[rsp], rax                   ;; s = rax
       jmp    SHORT $LN2@main                          ;; loop
$LN1@main:

The instructions look pretty much like you would expect.  The variable i is held on the stack at offset i$1 from the location pointed-to by the RSP register; elsewhere in the asm file, we find that i$1 = 0.  We use the RAX register to increment i.  Similarly, variable s is held on the stack (at offset s$ from the location pointed-to by the RSP register; elsewhere in the asm file, we find that s$ = 8).  The code uses RCX to calculate the running sum each time around the loop.

Notice how we load the value for i from its “home” location on the stack, every time around the loop; and write the new value back to its “home” location.  The same for the variable s.  We could describe this code as “naïve” – it’s been generated by a dumb compiler (i.e., one with optimizations disabled).  For example, it’s wasteful to keep accessing memory for every iteration of the loop, when we could have kept the values for i and s in registers throughout.

So much for the non-optimized code.  What about the code generated for the optimized case?  Let’s look at the corresponding Sum.asm for the optimized, /O2, build.  Again, I have trimmed the file down to just the part that implements the loop body, and the answer is:

                                                       ;; there’s nothing here!

Yes – it’s empty!  There are no instructions that calculate the value of s.  

Well, that answer is clearly wrong, you might say.  But how do we know the answer is wrong?  The optimizer has reasoned that the program does not actually make use of the answer for s at any point; and so it has not bothered to include any code to calculate it.  You can’t say the answer is wrong, unless you check that answer, right?

We have just fallen victim to, been mugged in the street by, and lost our shirt to, the DCE optimization.  If you don’t observe an answer, the program (often) won’t calculate that answer. 

You might view this effect in the optimizer as analogous, in a profoundly shallow way, to a fundamental result in Quantum Physics, often paraphrased in articles on popular science as “If a tree falls in the forest, and there’s nobody around to hear, does it make a sound?”

Well, let’s “observe” the answer in our program, by adding a printf of variable s, into our code, as follows:

#include
int main() {
    long long s = 0;
    for (long long i = 1; i <= 1000000000; ++i) s += i;
    printf("%lld ", s);

The /Od version of this program prints the correct answer, still taking about 4 seconds to run.  The /O2 version prints the same, correct answer, but runs much faster.   (See the optional section below for the value of “much faster” – in fact, the speedup is around 7X)

At this point, we have explained the main point of this blog post: always be very in analyzing compiler optimizations, and in measuring their benefit, that we are not being misled by DCE.  Here are four steps to help notice, and ward off, the unasked-for attention of DCE:

  • Check that the timings have not suddenly improved by an order of magnitude
  • Examine the generated code (using the /FA switch)
  • If in doubt add a strategic printf
  • Put the code of interest into its own .CPP file, separate from the one holding main.  This works, so long as you do not request whole-program optimization (via the /GL switch, that we’ll cover later)

However, we can wring several more interesting lessons from this tiny example.  I have marked these sections below as “Optional-1” through “Optional-4”.

  

Optional-1 : Codegen for /O2

Why is our /O2 version (including a printf to defeat DCE), so much faster than the /Od version?  Here is the code generated for the loop of the /O2 version, extracted from the Sum.asm file:

       xor    edx, edx
       mov    eax, 1 
       mov    ecx, edx
       mov    r8d, edx
       mov    r9d, edx
       npad   13
$LL3@main:
       inc    r9
       add    r8, 2
       add    rcx, 3
       add    r9, rax                           ;; r9  = 2  8 18 32 50 ...
       add    r8, rax                           ;; r8  = 3 10 21 36 55 ...
       add    rcx, rax                          ;; rcx = 4 12 24 40 60 ...
       add    rdx, rax                          ;; rdx = 1  6 15 28 45 ...
       add    rax, 4                            ;; rax = 1  5  9 13 17 ...
       cmp    rax, 1000000000                   ;; i <= 1000000000 ?
       jle    SHORT $LL3@main                   ;; yes, so loop back

Note that the loop body contains approximately the same number of instructions as the non-optimized build, and yet it runs much faster.  That’s mostly because the instructions in this optimized loop use registers, rather than memory locations.  As we all know, accessing a register is much faster than accessing memory.  Here are approximate latencies that demonstrate how memory access can slow your program to a snail’s pace:

Location

Latency

Register1 cycle
L14 cycles
L210 cycles
L375 cycles
DRAM60 ns

So, the non-optimized version is reading and writing to stack locations, which will quickly migrate into L2 (10 cycle access time) and L1 cache (4 cycle access time).  Both are slower than if all the calculation is done in registers, with access times around a single cycle.

But there’s more going on here to make the code run faster.  Notice how the /Od version increments the loop counter by 1 each time around the loop.  But the /O2 version increments the loop counter (held in register RAX) by 4 each time around the loop.  Eh?

The optimizer has unrolled the loop.  So it adds four items together on each iteration, like this:

s = (1 + 2 + 3 + 4) + (5 + 6 + 7 + 8) + (9 + 10 + 11 + 12) + (13 + . . .

By unrolling this loop, the test for loop-end is made every four iterations, rather than on every iteration – so the CPU spends more time doing useful work than forever checking whether it can stop yet!

Also, rather than accumulate the results into a single location, it has decided to use 4 separate registers, to accumulate 4 partial sums, like this:

RDX = 1 + 5 +  9 + 13 + ...  =  1,  6, 15, 28 ...
R9  = 2 + 6 + 10 + 14 + ...  =  2,  8, 18, 32 ...
R8  = 3 + 7 + 11 + 15 + ...  =  3, 10, 21, 36 ...
RCX = 4 + 8 + 12 + 16 + ...  =  4, 12, 24, 40 ...

At the end of the loop, it adds the partial sums, in these four registers, together, to get the final answer.

(I will leave it as an exercise for the reader how the optimizer copes with a loop whose trip count is not a multiple of 4)

 

Optional-2 : Accurate Performance Measurements

Earlier, I said that the /O2 version of the program, without a printf inhibitor, “runs so fast there’s no perceptible delay”.  Here is a program that makes that statement more precise:

#include
#include
int main() {
  LARGE_INTEGER start, stop;
  QueryPerformanceCounter(&start);
    long long s = 0;
    for (long long i = 1; i <= 1000000000; ++i) s += i;
  QueryPerformanceCounter(&stop);
  double diff = stop.QuadPart - start.QuadPart;
  printf("%f", diff);
}

It uses QueryPerformanceCounter to measure the elapsed time.  (This is a “sawn-off” version of a high-resolution timer described in a previous blog).  There are lots of caveats to bear in-mind when measuring performance (you might want to browse a list I wrote previously), but they don’t matter for this particular case, as we shall see in a moment:

On my PC, the /Od version of this program prints a value for  diff of about 7 million somethings.  (The units of the answer don’t matter – just know that the number gets bigger as the program takes longer to run).  The /O2 version prints a value for diff of 0.  And the cause, as explained above, is our friend DCE.

When we add a printf to prevent DCE, the /Od version runs in about 1 million somethings - a speedup of about 7X.

 

Optional-3 : x64 Assembler “widening”

If we look carefully again at the assembler listings in this post, we find a few surprises in the part of the program that initializes our registers:

       xor    edx, edx                          ;; rdx = 0     (64-bit!)
       mov    eax, 1                            ;; rax = i = 1 (64-bit!)
       mov    ecx, edx                          ;; rcx = 0     (64-bit!)
       mov    r8d, edx                          ;; r8  = 0     (64-bit!)
       mov    r9d, edx                          ;; r9  = 0     (64-bit!)
       npad   13                                ;; multi-byte nop alignment padding
$LL3@main:

Recall first that the original C++ program uses long long variables for both the loop counter and the sum.  In the VC++ compiler, these map onto 64-bit integers.  So we should expect the generated code to make use of x64’s 64-bit registers.

We already explained, in a previous post, that xor reg, reg is a compact way to zero out the contents of reg.  But our first instruction is apply xor to register EDX – the lower 32 bits of the RDX register.  The next instruction moves a 1 into EAX, the lower 32 bits of the RAX register.  This same pattern – of moving a value into 32-bit register continues with the next 3 instructions.  Taken at face value, this would leave the higher 32 bits of each target register containing random junk.  And yet the loop body performs calculations on the full-width, 64-bit registers.  How can the answers possibly be right?

The answer is that the x64 instruction set, originally published by AMD, chose to automatically zero-extend the high 32 bits of a 64-bit destination register.  Here are the two applicable bullets from section 3.4.5 of that manual:

  • Zero-Extension of 32-Bit Results: 32-bit results are zero-extended into the high 32 bits of 64-bit GPR destination registers.
  • No Extension of 8-Bit and 16-Bit Results: 8-bit and 16-bit results leave the high 56 or 48 bits, respectively, of 64-bit GPR destination registers unchanged.

Finally, notice the npad 13 instruction (really a pseudo-operation – a directive to the assembler).  This ensures the next instruction – which is the start of the loop body – will be aligned in memory on an address that’s a multiple of 16 bytes.  This improves performance (sometimes, and on some micro-architectures).

  

Optional-4 : printf versus std::cout

You might ask why I used C’s printf function, rather than C++ std::cout as a way to defeat DCE in the above experiment.  Try it and see.  Both work ok, but the asm file generated by the latter is much bigger, and therefore more difficult to navigate around: 0.7 Mbytes compared with 1.7 Kbytes.

 

 

 

 

 

 

 


Viewing all articles
Browse latest Browse all 10804

Trending Articles