A common belief among programmers is that you should write code that says what you mean, because the compiler will probably do a better job at optimising than you can do off the top of your head.
On the other hand, Writing Efficient Android Code says "It is unwise to rely on a compiler to "save" you and make your code fast enough" and recommends that programmers do things like caching member variables (like array lengths) in local variables. This isn't just premature optimisation; it's trivial stuff we expect the most basic compilers to handle!
To understand why, it helps to know how your code ends up running on device:
- Source files are compiled to Java class files with (usually Sun's) javac.
- Java class files are compiled to Dalvik bytecode with the Android SDK's dx.
- Dalvik bytecode is interpreted by the Dalvik VM running on device. Dalvik (in Android 1.0) has no JIT.1
All three stages can perform optimisation (after all, a JIT is effectively an interpreter optimisation). So why aren't they there?
- There's not much point optimising Java bytecode if you have an optimising JIT, so Java compilers don't do it:
- You need to handle inefficient bytecode anyway (from existing JARs, competitors' compilers, old versions of your compiler, ...). Optimised bytecode is probably harder to optimise in a JIT.2
- The only thing you save is the size of your class files -- It's a waste of effort.
- Efficient Java bytecode helps everybody's VM, whereas an optimising JIT gives you a competitive edge.3
- If you're in a rush to ship (and Android 1.0 feels slightly rushed), you gain a lot more by rewriting the bottlenecks in C than you do optimising (optimising is hard -- if it was easy, GCC wouldn't be so bad at it).
- A fast, optimising JIT is even harder to get right. Sun's JVM didn't have a JIT for nearly 3 years -- plenty of time to gain a reputation of being reeeeeally slow.
Let's take some sample code: We'll start with a familiar paradigm to most programmers...
public void toggle() {
mRunning = !mRunning;
if (mRunning) {
start();
} else {
stop();
}
}
There are two main ways the resulting Dalvik bytecode is suboptimal (one has already been pointed out). For brevity, we'll look at functions that don't call start()
and stop()
(the code there is fairly uninteresting):
public boolean toggle1() {
mB = !mB;
return mB;
}
public boolean toggle2() {
boolean b = !mB;
mB = b;
return b;
}
public boolean toggle3() {
return (mB = !mB);
}
The resulting bytecode is not as succinct:
Java Dalvik
public boolean toggle1(); Test.toggle1:()Z:
0: aload_0 regs: 0002; ins: 0001; outs: 0000
1: aload_0 0: iget-boolean v0, v1, Test.mB:Z
2: getfield #5; //Field mB:Z 2: if-nez v0, 000a // +0008
5: ifne 12 4: const/4 v0, #int 1 // #1
8: iconst_1 5: iput-boolean v0, v1, Test.mB:Z
9: goto 13 7: iget-boolean v0, v1, Test.mB:Z
12: iconst_0 9: return v0
13: putfield #5; //Field mB:Z a: const/4 v0, #int 0 // #0
16: aload_0 b: goto 0005 // -0006
17: getfield #5; //Field mB:Z
20: ireturn
public boolean toggle2(); Test.toggle2:()Z:
0: aload_0 regs: 0002; ins: 0001; outs: 0000
1: getfield #5; //Field mB:Z 0: iget-boolean v0, v1, Test.mB:Z
4: ifne 11 2: if-nez v0, 0008 // +0006
7: iconst_1 4: const/4 v0, #int 1 // #1
8: goto 12 5: iput-boolean v0, v1, Test.mB:Z
11: iconst_0 7: return v0
12: istore_1 8: const/4 v0, #int 0 // #0
13: aload_0 9: goto 0005 // -0004
14: iload_1
15: putfield #5; //Field mB:Z
18: iload_1
19: ireturn
public boolean toggle3(); Test.toggle3:()Z:
0: aload_0 regs: 0002; ins: 0001; outs: 0000
1: aload_0 0: iget-boolean v0, v1, Test.mB:Z
2: getfield #5; //Field mB:Z 2: if-nez v0, 0008 // +0006
5: ifne 12 4: const/4 v0, #int 1 // #1
8: iconst_1 5: iput-boolean v0, v1, Test.mB:Z
9: goto 13 7: return v0
12: iconst_0 8: const/4 v0, #int 0 // #0
13: dup_x1 9: goto 0005 // -0004
14: putfield #5; //Field mB:Z
17: ireturn
- The original Java bytecode is terrible.
- Using a local variable replaces
aload-…-putfield-aload-getfield
withistore-aload-iload-putfield-iload
. That's two more instructions just to use a local variable (no wonder Java interpreters are slow!). - Using the result of the assignment replaces
aload-…-putfield-aload-getfield
withaload-…-dup-putfield
, thus saving one instruction (and member-variable load).
dx
translates Java's stack and local variable accesses into Dalvik register operations. Since there's no longer a differentiation between the stack and local variables, dup
, iload
, and istore
should all get compiled into move
. Let's see what actually happens:
- The original Java bytecode turns into equally terrible Dalvik bytecode. It's safe to cache
mRunning
since it's notvolatile
, butdx
doesn't realise this. - Both the local-variable and assignment-result versions produce the exact same Dalvik bytecode, replacing the second
iget-boolean
with nothing. The local-variable access is optimised away!4
So neither javac nor dx get rid of the second member-variable read, but dx optimises away local variable usage. We knew this -- Google told us. This still doesn't explain why boolean negation results in a compare-and-branch; it should just be a XOR. Let's look at some more code:
public static final boolean not1(boolean b) {
return !b;
}
public static final boolean not2(boolean b) {
return b ^ true;
}
The difference in bytecode is staggering:
Java Dalvik
boolean not1(boolean): Test.not1:(Z)Z:
0: iload_0 regs: 0002; ins: 0001; outs: 0000
1: ifeq 8 0: if-eqz v1, 0004 // +0004
4: iconst_0 2: const/4 v0, #int 0 // #0
5: goto 9 3: return v0
8: iconst_1 4: const/4 v0, #int 1 // #1
9: ireturn 5: goto 0003 // -0002
boolean not2(boolean): Test.not2:(Z)Z:
0: iload_0 regs: 0002; ins: 0001; outs: 0000
1: iconst_1 0: xor-int/lit8 v0, v1, #int 1 // #01
2: ixor 2: return v0
3: ireturn
javac seems to compile !b
as b ? false : true
. dx isn't clever enough to realise that bools can only be 0 or 1 and optimise it into a XOR.
At the end of the day, I settled on if (mRunning ^= true) { ... }
. It's probably terrible "code style", but it means there are less things to get wrong when I end up copy-and-pasting it.
I tried similar code6 with gcc -std=c99 for x86, ARM, and Thumb. GCC seems to treat !b
exactly like c == 0
, and b^true
to nearly like c == 1
or (bool)(c^1)
(but not like (c^1) != 0
). b == 2
is optimised away to 0 (even on -O0) so it knows that bools can only be 0 or 1, but it seems that bool is mostly treated unsigned char for reading.
In summary?
- With the Android SDK, member variables access is expensive, but local variable "allocation" is free.
- A built-in boolean type might not optimise as well as you'd expect, even for "mature" compilers like GCC.
- While you might blame inefficiency on javac, gcj does not do much better.
- Most importantly, look at compiler output before you try to "optimise" your source code. You'll probably be surprised.
In the next issue I'll look at the register allocator.
-
The Dalvik VM is designed to conserve memory, which is one of the reasons it doesn't use a JIT. Interestingly, Dalvik bytecode is about 30% larger than the corresponding Java bytecode, since Dalvik instructions are larger. Try Dalvik VM internals@Google I/O for a productive use of YouTube. ↩
-
Inefficient stack-based bytecode is so easy to generate -- take the parse tree, output it in reverse-polish notation, and voila: A=B+(CD) translates to load B, load C, load D, mul, add, store A. If you don't optimise, reversing this is easy: apply a bunch of rules like load X, load Y, mul -> XY. If you optimise A=A+A*A to load A, dup, dup, mul, add, store A, you need to do a lot more work to extract any meaning out of it. ↩
-
GCJ doesn't optimise either -- it produces nearly the same terrible code. Jikes compiles
!b
asb^true
, but this is due to a better parse-tree-to-bytecode translation rule for!
, not an optimisation pass -- the --optimize option is ignored according to the man page. ↩ -
A naive Java-to-Dalvik compiler could use one register for each local variable and one more for each "slot" of stack space (Java bytecode methods specify their maximum stack size), but this would cause a lot of unnecessary copying. com.android.dx.ssa.Optimizer converts code to SSA form and performs move-combining, constant popagation, "literal upgrade", "constant collection", dead code removal, and register allocation on conversion back to register-op form. It's pretty basic and quite effective -- dx --no-optimize results in about 55% larger code5 -- but these are the only optimisations performed as far as I can tell. ↩
-
dexdump classes.dex | grep 'insns size' | cut -d: -f2 | cut -d' ' -f2 | python -c 'import sys; print reduce(lambda a,b:a+int(b),sys.stdin.read().split(),0)' ↩
-
Actually
_Bool
and((_Bool)1)
, defined in C99. It means I don't have to worry about. ↩