How to get the best performance from CACAO

There are several things you can try to get the most out of CACAO.

Use Java 1.5 Bytecode

Java 1.5 bytecode uses the LDC instruction for loading constant class references, which is much more efficient than the .class$ methods generated by compilers for the Java 1.4 platform.

You should at least compile the VM classes of CACAO targeting Java 1.5 bytecode.

Configure Options

Add the following options when configuring CACAO:

--disable-debug --disable-disassembler CFLAGS="-O2"

Add to CFLAGS any options that produce better code for your particular processor.

(Note: --disable-disassembler is not yet available in the releases, just leave it out.)

<!> If for some reason you want the CACAO interpreter to be fast, you must add the option -fno-reorder-blocks to CFLAGS, otherwise the dynamic superinstructions will not work properly.

You can also add the configure option --disable-verifier which removes all verifier code, but be aware that what you get then is no longer a JVM complying with the Java Virtual Machine Specification. You will have to trust the bytecode you execute...

(Note: --disable-verifier is currently broken; see PR81, you can however just use the runtime argument -noverify)

Use __thread

A patch to enable the use of __thread can be found on the mailing list (original posts here). The patch has been committed to the CACAO repository as of Dec 26 2008. The --enable-__thread switch is on by default. It should probably be renamed to --enable-tls.

On my olde P4 this gives performance improvements like that: (first & second column are the run times in msecs, third column is the ratio; < 1.0 is good)

compress     (   528.0) : (   529.0) - (1.002)
jess         (   596.0) : (   581.0) - (0.975)
db           (    21.0) : (    20.0) - (0.952)
mpegaudio    (    90.0) : (    88.0) - (0.978)
jack         (   791.0) : (   740.0) - (0.936)
antlr        (   449.0) : (   403.0) - (0.898)
bloat        (  4850.0) : (  4326.0) - (0.892)
fop          (   748.0) : (   678.0) - (0.906)
hsqldb       (   792.0) : (   735.0) - (0.928)
jython       (  3444.0) : (  2943.0) - (0.855)
luindex      (   806.0) : (   768.0) - (0.953)
lusearch     (  7298.0) : (  6966.0) - (0.955)
pmd          (    64.0) : (    48.0) - (0.750)
xalan        (  9952.0) : (  9402.0) - (0.945)

Use mfence

Every non-ancient x86 (since P4, I think) supports the mfence instruction. This small patch makes worthy use of it.

diff -r 9a5247aed3b3 src/vm/jit/i386/md-atomic.hpp
--- a/src/vm/jit/i386/md-atomic.hpp     Mon Dec 01 11:28:17 2008 +0100
+++ b/src/vm/jit/i386/md-atomic.hpp     Mon Dec 01 19:43:50 2008 +0100
@@ -90,7 +90,7 @@
  */
 inline void Atomic::memory_barrier(void)
 {
-       __asm__ __volatile__ ("lock; add $0, 0(%%esp)" : : : "memory" );
+       __asm__ __volatile__ ("mfence" : : : "memory" );
 }
 
 

Let's see if this helps (on my Core 2 MacBook):

db           ( 10829.0) : ( 10387.0) - (0.959)

It helps a bit for lock-heavy benchmarks like "db". Not really that spectacular, though.

Contribute

If you want CACAO to be faster and have an idea how to achieve that, join us! We can always need help. :) (see CacaoChat)

cacaowiki: BestPerformanceHowto (last edited 2009-01-13 19:22:03 by StefanRing)