How Java’s Garbage Collector Works?
The program specifications and hardware constraints of embedded systems result in some unique problems regarding garbage collection. Different garbage collection algorithms are appropriate depending on the embedded systems application, thus preventing garbage collection from being a “silver bullet” problem.
The efficiency of garbage collection algorithms depends on numerous factors. One factor is the hardware of the targeted platform: its CPU type, the amount of memory available, and the availability of memory management hardware support. Another factor is the type of runtime system (compiled or interpreted) and the level of support it provides for memory allocation (supervised or unsupervised). Finally, factors depending on the application itself, notably its memory usage patterns (memory cells connectivity, relative size, and life span of allocated objects), are also relevant.
Once these factors have been taken into consideration, the suitable garbage collection algorithms should be reviewed in order to take into account their processing costs, the space overhead they impose (both on the allocated data and on the code size) as well as the performance degradation they incur as the heap fills up.
Garbage collection relies on the observation that a previously allocated memory location that is no longer referenced by any pointers is no longer accessible to the application and is therefore reclaimable. The role of a garbage collector is to identify and recycle these inaccessible allocated memory chunks. There are two families of garbage collecting techniques — reference counting and tracing .
Reference counting keeps track of the number of references to a memory cell. The cell is considered reclaimable whenever this number becomes zero. The compiler or the runtime system is responsible for adjusting the reference count every time a pointer is updated or copied. The main advantage of reference counting is that the computational overhead is distributed throughout the execution of the program, which ensures a smooth response time. Reclaiming an isolated cell doesn’t imply accessing other cells, and there are optimizations that reduce the cost of deleting the last reference to an entire group of cells. The major weakness of reference counting is its inability to reclaim cyclic structures such as doubly-linked lists or trees featuring leaf nodes with pointers back to their roots.
Reference counting can be optimized to reduce the number of times reference count fields have to be updated and to reduce the size of the reference count field. It can also be modified to detect pointers that are part of a cycle or can even be combined with a tracing algorithm, described below, in order to reclaim cyclic data.
Reachable (live) memory locations are those that are directly or indirectly pointed to by a primordial set of pointers called the roots. Tracing algorithms start from the roots and examine each allocated memory cells’ connectivity.
There are two kinds of tracing algorithms: “mark and sweep” and “copying.” Mark and sweep marks active memory cells (those that are reachable through pointer reference, starting with the roots) whenever the amount of free memory goes below a certain threshold, and then sweeps through the allocated cells. When sweeping is complete, unmarked cells are returned to the available memory pool. The advantage of this approach is its ability to reclaim cyclic data structures. But if implemented as described, it may not be acceptable for an application with some real-time constraints for reasons I will explain below.
The copying algorithm uses a heap divided into halves called semi-spaces. One semi-space, called the current space, is used for memory allocation. When garbage collection occurs, active memory cells are traced from the current space and copied into the second space. When done, the second space contains only live cells (non-garbage) and becomes the new active space. The cells get automatically compacted as a result of this process. Unreachable cells (garbage) are not copied and are thus naturally reclaimed. The copying algorithm ensures fast allocation of new cells and reduces fragmentation. On the other hand, it halves the memory available to the application and touches all the living memory cells during the copy algorithm, which may lead to a considerable number of page faults on a system using pagination.
Tracing garbage collectors can be either accurate/exact (able to distinguish pointer from non-pointer data) or conservative (consider every word encountered as a potential pointer). The type of tracing garbage collector you use — exact or conservative — affects the way pointers are identified within the location where roots can be found (registers, stack, static areas) and the way the garbage collector scans cells looking for pointers to other cells.
A conservative garbage collector needs to examine every memory word as a potential pointer. This process isn’t guaranteed to succeed for all instances of the considered words and may lead to misidentification errors. This implies that memory cells that are actually garbage may accidentally be retained. Furthermore, a fully conservative garbage collector can’t venture moving memory blocks around, wrongly updating memory references it thinks point to them. It implies that a fully conservative garbage collector can’t be a fully compacting collector.
Conservative collectors can be altered to be mostly copying: objects that are solely accessible from other heap-allocated object are copied. They are deployed on systems where no cooperation from the compiler or the runtime system is expected to help the collector tell whether a word is a pointer or not.
Garbage Collection and Java
The Java runtime relies on garbage collection to identify and reclaim objects no longer in use automatically. Garbage collection doesn’t alter the behavior of an application written in Java, with the exception of classes that declare a finalizer method. Finalizers are executed right before objects found to be garbage are discarded, giving the developer a chance to explicitly release non-memory resources the class uses. Tricky finalizers may resuscitate an object by installing a new reference to it, but finalizers are executed only once. This guarantees that objects may be brought back to life only once. Beginning with JDK 1.1, data belonging to classes is also garbage collected, allowing unused classes to be automatically discarded (this happens whenever their class load gets recycled.) Note that system classes are never garbage collected.
The initial set of roots of a Java runtime are the live thread objects, the system classes (for instance java.lang.String), the local variables and operands on the JVM stack, the references from the JNI, the pending exceptions and the references from the native stack. Stemming from these roots, all three types of Java entities feature reachable components that will be examined by the collector. Objects have instance fields and class that are reclaimable, arrays have element objects and class that can be garbage collected and classes feature a set of components that can be recycled: super class, interfaces, linked classes, string literals, static fields, class loader, signers, and protection domains. Up to JDK 1.1.4, Sun uses a partially conservative compacting mark and sweep algorithm, which makes a conservative assumption when the native stack is scanned, but allows the Java heap and stack to be scanned in a non-conservative way. The algorithm compacts the heap, substantially reducing its fragmentation.
Sun’s implementation relies on an implementation detail to keep the cost of objects’ relocation low. The reference to an object (by which an object is know to exists to other objects and the VM) is implemented as a non-moving handle pointing to the actual object’s data. Once the actual object’s relocation performed, simply updating its handle allows the object to be accessed at its new location.
The garbage collection code is executed in a separate thread when the system runs out of memory or when System.gc is explicitly invoked. The collector may also run asynchronously when it detects that the system is idle, but this may require some support from the native thread package. In any cases, execution of the garbage collector halts all other application threads and scans the entire heap.
Which Garbage Collection for Embedded Java?
In order to adapt Java garbage collection to embedded systems, you must consider four issues: the nature of the runtime environment, the hardware characteristics and limitations of the embedded platform, the expected performances (acceptable general overhead and maximum pause times), and the way in which critical memory situations need be handled.
The runtime environment will mainly influence the way roots are identified. An interpreted or semi-compiled (JIT) environment can make references clearly identifiable so the Java heap and the Java stack can be scanned non-conservatively. On the other hand, the pre-compiled approach — like the one currently being developed at Cygnus — needs to rely on conservative scanning, unless additional information is made available to the runtime by a cooperative compiler.
Memory, both RAM and ROM, is a scarce resource. Memory constraints usually rule out all copying and generational algorithms and prevents sophisticated solutions like adaptive collectors (that are dynamically changing the way they garbage collect the memory to achieve greater efficiency) or cooperative collectors (featuring several algorithms in one collector, to get the best of several worlds) because of the amount of ROM required for their implementation.
Additionally, reference counting requires the count field of an allocated object to be roughly the same size as a pointer or some other value that can represent all valid addresses of the system. This field has to remain tied to the object. On the other hand, information tags required by a mark and sweep range from a minimum of 1 bit to several bits for the implementation of three colors algorithms (which are based on an abstraction introduced by Dijkstra, attributing different colors to memory cells depending on the need for the collector to visit, re-consider or discard them) or pointer-reversal marking algorithms (a time and space bound algorithm that stores in the currently examined memory cells the value of its parent pointer).
Most embedded systems don’t provide swapped virtual memory, which implies that accessing a random portion of the memory has a fixed cost. This diminishes the impact of the mark-sweep and the space-costly copying collectors that visit all live cells before reclaiming the dead ones. On the other hand, this also means the heap can’t be expanded. When reaching a low water mark situation, the collector can’t count on tapping some virtual memory as a temporary solution.
An embedded application running Java has a strong chance of being interactively operated by a user. This demands a fast, non-disruptive garbage collector that also provides predictable performance for future memory allocation. Periodically performing partial garbage collection must be sufficient to prevent the need to run the garbage collector when a cell is allocated. However, no noticeable slow down should occur in the extreme case that the collector is forced to run.
You can enhance predictability by improving heap fragmentation so that the low-level operation of reserving memory doesn’t have to scan the heap trying to find sufficient free memory for an indeterminate amount of time. It should also be guaranteed to succeed if enough free memory is known to be present. It is unacceptable for an embedded system to have a memory allocation failure because the fragmentation of the heap prevents the allocator from finding a memory block of a sufficient size.
Fragmentation can be fought by compacting the heap. While compacting the heap, cells can be moved either by using a technique called sliding, where the cells’ allocations are retained, or by using a technique called linearizing, where a cell pointed to by another cell is moved to a close adjacent position. In addition to the copying algorithm, which compacts the heap as a side-effect of copying a live cell from one semi-space to an other one, there are several compacting mark-sweep algorithms (also called mark-compact) with complexities ranging from O(M) to O(M log M) (where M is the size of the heap). Note that conservative scanning imposes restrictions on objects that can be moved.
Non-disruptive Garbage Collection
To be able to provide non-disruptive garbage collection, it is necessary to use an incremental algorithm that runs for a fixed amount of time and ensures that further memory allocation requests won’t fail. Unfortunately, real-time algorithms [1,2] usually rely on additional processors or huge amounts of memory so that they can run copying type collectors. The Wallace-Runciman  algorithm combines mark-during-sweep where portions of the heap are marked and then swept as the mark phase works on the next region, and stack collect, a modified mark-sweep where states of the marking phase are pushed on a stack. Stack-collect avoids mark-during-sweep worst case execution time behavior which is quadratic to the size of the heap. This algorithm has a small 3 bits per cell overhead, can operate with very small stack depth, and its execution time is bounded. Unfortunately, its execution time accounted for 30%-45% of an application run time in early implementations, targeting a functional language and assuming a heap exclusively made of binary cells.
Baker’s  incremental copying algorithm is a popular real-time collector that doesn’t really fit in a embedded environment. It requires at least twice the minimum amount of memory in order to be implemented. Wilson and Johnstone  devised a incremental non-copying implicit reclamation collector that meets real-time constraints. It uses a write barrier to mark pointers to non-examined objects stored in cells that have already been examined so they are later re-examined by the collector. This is combined with a non-copying implicit reclamation strategy where unreached objects don’t need to be traversed in order to be reclaimed. The main cost resides in the necessity of maintaining data structures used to keep track of reachable cells in order to deduce unreachable (garbage) objects. The algorithm can be altered in order to fight the fragmentation it naturally induces and requires a strong cooperation from the runtime system side (or the compiler) to implement the write barriers efficiently.
The Kaffe conservative garbage collector
During the development phase of the Cygnus GNUPro compiler for the Java language, we’ve been experimenting with Kaffe, a free VM mainly written by Tim Wilkinson. It features an implementation of the classic three colors conservative scan algorithm. The source code, available from ftp.transvirtual.com, has been widely ported to several platforms and OSs.
The three colors model
The three colors model first introduced by Dijkstra in  is useful to describe incremental or non-incremental tracing garbage collectors. It assigns a color to a heap cell depending on the degree of completion of its collection. At the start of collection, all cells are white. If the cell and its immediate descendants have been visited, it is colored black and won’t be reclaimed. A cell entirely visited with the exception of its immediate descendants is colored grey. If the mutator can run in parallel with the collector and changes any pointer on a black cell, it has to change its color back to grey. Consequently, the collector must reconsider a grey cell since its collection was incomplete or changed by the action of a concurrently running mutator.
Under the three colors model, a collector runs until all reachable cells are blackened. White (unvisited) cells are garbage and hence reclaimable.
The source code
The source code for the virtual machine found under kaffevm/ features two files that are relevant to the garbage collection. gc-mem.c implements the GC aware memory allocator and some low level GC operations. gc-incremental.c implements the conservative collector.
Kaffe distinguishes several allocation needs. On one hand, it has to deal with Java objects: classes, arrays and class instances. On the other hand, it has to allocate memory for different internal data structures that are required by the VM and its components (like the JIT) to function properly. All these pieces of memory need to be garbaged collected and/or finalized differently.
For example, the memory allocated for a class instance may contain references to other objects and is scanned (walked) conservatively, but also may or may not require finalization. Note that the actual implementation that entirely and conservatively walks a class instance memory isn’t the most efficient. It could be optimized by the use of the object’s class information to distinguish, in the object’s memory, references to other objects from non collectable data.
On the other hand, the data structure representing a java.lang.Class object contains a lot of fields. It turns out that only some of them need to be walked (sometimes optionally), and classes are never finalized. Some other data structures, like thread locks or global paths are never garbage collected. Finally, live threads native stacks need to be scanned conservatively, because they may contain references objects.
To deal with these requirements, one of the bookkeeping information Kaffe ties to a garbage collected chunk of memory is a pointer to a particular GC function set. A GC function set is a data structure featuring two function pointers: the first one is the collection routine and defines how the allocated chunk of memory should be walked. The second one specifies an object’s finalization routine, or if not a function pointer, it can also be a flag used to mark the memory chunk’s special properties, tagging normal, fixed or root objects.
This data structure forms the typedef gcFuncs and several elements of that type are statically defined in gc-incremental.c.
When the garbage collector has to operate on a memory chunk, it retrieves its private GC function set and applies the function that is required to walk or finalize it. The table below summarizes how some of the memory allocations serving particular purposes are linked to a dedicated GC function set:
Allocating garbage collected memory
The main function used to allocate garbage collected memory is gc_malloc(). It allocates a memory block guaranteed to be big enough to fit the requested size by calling gc_heap_malloc(), which belongs to the low level memory allocator. It then ties to that block a pointer to an instance of the gcFuncs data structure. For example, to allocate 52 bytes of garbage collected memory that will represent an object’s instance which needs to be finalized, one can write:
void *newObject = gc_malloc (52, &gcFinalizeObject);
The GC function set is also used to determine whether the allocated memory is a GC root in which case it is linked to the rootObjects list. For the allocation of a collectable object, gc_malloc() also sets its color to WHITE and inserts the whole garbage collected page it belongs to in the white list.
The function gc_malloc_fixed() is a short cut to the invocation of gc_malloc() with the gcFixed GC function set as a second argument. There are other functions that can attach or reattach a random piece of memory to the garbage collector. In that case, an indirect root is created and collected separately.
Execution of the garbage collector and finalizer
Created during the initialization phase of Kaffe, the conservative collector runs as a separate daemon thread set to run the gcMan() function. At the same time, the finalizer thread is created with a hook to the finaliserMan() function.
Low level memory allocation
The low level memory allocator has two different behaviors depending on the size of the object is has to allocate. It will try to fit as many small objects of the same size as possible within a piece of memory the size of a page. This page will contain a local gc_block header and a computed amount of blocks of a rounded up size. This number is computed so that each block has somewhere in a special section of the page two private entries. The first one is a pointer to an instance of the gcFuncs data structure. The second private entry is an eight bit unsigned integer which is used as a state flag and also retains color information. A page of garbage collected memory intended to store small objects is organized as:
The size of this entire block is typically the system’s page size. There are N gcFuncs pointers `F‘, N bytes of flag information `f‘, and N `Data‘ blocks of a given size. For example, on a typical 32 bit system with a page size of 4KB, a gc_block of 48 bytes and 4 bytes pointers, a garbage collected block of memory will be able to store 32 ‘Data’ blocks of 120 bytes, using up to 4048 bytes. On that kind of system object sizes range from 8 bytes to 4032 bytes. They’re all multiple of the size of a pointer.
The low level memory allocation function is gc_heap_malloc() which, when called for the first time, initializes the heap by invoking gc_heap_initialize(). This function sets the sztable table, whose entries will be later used to round the size of a small allocations (below MAX_SMALL_OBJECT_SIZE bytes) and tell which free list of garbage collected pages should be processed to get a new chunk of that size.
If the size of the block to allocate is found to be big, then gc_large_block() is called to allocates a round number of pages sufficient enough to contain the object and its private collector information. This piece of memory will contain a gc_block header. The state of the memory chunk is set to NORMAL, and its color set to FREE.
Otherwise, in the case of a small block, sztable is addressed using the size of the allocation to figure a rounded up size and a free list index. Then, the allocation of the normalized chunk takes a number of steps:
- gc_small_block() is called once but may fail. For example, if gc_heap_malloc() is called for the first time, the pointer to the primary free list (gc_prim_freelist) is null and needs to be allocated. Or the system might also be running out of memory. For all those cases, execution continues with the steps 2 and further. Otherwise, the execution goes directly to step 3.
- After having eventually performed some garbage collection (if and only if there is a GC thread and some heap already allocated — which is not the case during the first invocation), gc_small_block() is called again with the same original size. If this fails, the size of the chunk to allocate is changed to the value of the heap allocation size (ALLOC_HEAPSIZE, typically 1MB) and some memory is allocated from the system, invoking gc_system_alloc(). This will get pagealloc() called which in its turn relies, depending on how Kaffe was built, on sbrk(), memalign() or malloc() to perform the real allocation. This block is then inserted in the gc_objecthash hash table and then made a free list available entry.
- The routine allocates an object of the rounded-up size. The free list is processed to find a fit for a chunk of the size of gc_pgsize which is usually set to match the size of a system’s page. This chunk of memory will be prepared to receive a given amount of small object of the rounded up size. Within that page, blocks of the normalized size are chained into a free list whose head pointer is maintained in the free field of the local gc_block header. Each of these blocks has its state tagged NORMAL and its color set to FREE.
- The next available entry of the rounded-up size is extracted from the allocated block, using the free list information contained in its header. Its state is reset to normal and it content zeroed and returned.
Things might take place slightly differently on an embedded system. The sztable can computed by advance and stored in ROM. pagealloc() might end up doing something different to obtain a new memory block, depending on the memory management routines supported by the embedded device.
Conservatively guessing the nature of a pointer
A conservative GC has to determine if an address is a valid garbage collected memory location. The function gc_heap_isobject() provides this functionality. Its implementation successively looks for several properties featured by a memory location pointing to an object. By doing so, it reduces the chances of a misidentification that would lead to the unintentional retention of a garbage object and cause a memory leak.
gc_heap_isobject() first tries to retrieve the gc_block associated with the memory location. It then computes the memory location’s hash table index, from which it gets an entry in the gc_objecthash array. This linked list of allocated blocks is walked until an address equal to the gc_block of the processed memory location is found. The final test makes sure that the examined address fits within this allocated piece of collectable memory and that it has a color. If one of the test fails during the process, the pointer is not considered to be a piece of garbage collected memory, and the function returns false. Otherwise, true is returned.
There are several functions of the walk family that are used to collect single memory locations or blocks of memory. Some of them are used to specify the collector parts of the GC function sets, while others are used internally.
walkMemory() first colors a memory block to black and inserts the page it belongs to in the black list. It then applies the collector function of this chunk of memory to its entirety.
walkBlock() processes each grey memory chunks of a garbage collected page, setting their color back to black and individually applying their private GC functions.
walkConservative() calls scanConservative(), which chops a memory chunk into pointers and calls markObject() on each one if it’s a GC heap objects. This is a function that basically traces references to other objects within an object. In this process, it considers every word found inside an object’s memory as a pointer to an other live object. Since it may or may not be the case, it calls gc_heap_isobject() to filter pointers to valid objects, discarding all other address values: pointers to some other memory locations, good-looking integers and any other random array of bits.
walkNull() doesn’t perform any collection. It’s the collector attached to fixed objects, which are objects that are never collected, like the GC thread.
walkIndirectRoot() is used to collect the indirect roots that are created when a random piece of memory is attached to the garbage collector. It simply calls its own collector for the amount of memory this block represents.
The garbage collection process
There are several possible invocations of the garbage collector. gcMan(), whose code is periodically executed by the GC thread, is an endless loop. After having made other thread waiting upon completion, its calls the function startGC(). This processes all root objects, set their color to white and calls markObject() on them. markObject() first makes sure that the object belongs to garbage collected memory by calling gc_heap_isobject(). The object is then turned grey and the page it belongs to is moved to the grey list.
startGC() ends by calling walkLiveThreads(). This function processes the list of live threads object and calls walkMemory() on each of them. It turns out that for live threads, the private GC routine is walkThread() which calls markObject() on private thread info and then applies walkConservative() on the thread’s stack, looking for object’s references to be collected.
gcMan() continues by processing the list of grey allocated pages that it eventually obtained from the startGC() pass, and calls the function walkBlock() on them.
The white (garbage) list is processed and objects that need a finalization are marked calling markObject(). Next, potentially new grey objects are processed the same way grey objects were previously processed.
We’re reaching the end of the collection process. finishGC() is called to process garbage (white) objects. Objects that need to be finalized are moved to the finalise list and those which don’t require finalization but are white are returned to the local free storage of the page they belong to, calling gc_heap_free(). The finalizer thread is then woken up. Finally, black objects are moved back to the white list and marked accordingly.
Another way of invoking the garbage collector is by calling the invokeGC() which basically wakes up the GC thread. This is used to implement native functions like System.gc() or Runtime.gc().
The finalizer thread, periodically allowed to run by the GC thread, walks the finalise list and processes each entries (which are pages of garbage collected memory) as follows: the page is first moved to the white list. Then, elements of the page marked to be finalized are individually processed by their own finalizer function.
Incremental garbage collection with Kaffe
The collector is designed for incremental garbage collection by implementing the three colors model; and some code is already in place to handle it. Several problems have to be solved, including the crucial issue of the implementation of software write barriers.
The heap in an embedded system can’t be expanded because of the lack of virtual memory. This makes running out of memory a critical situation that applications should avoid using preventive measures. Since it’s impossible for a Java application to explicitly free memory (it can only make its best effort to ensure that references to an object are suppressed), it can be worthwhile to encourage references to huge but easily recomputable objects using weak references.
Weak references were introduced in JDK 1.2. They allow references to an object to be maintained without preventing the object from being reclaimed by the collector. Applications may also use the MemoryAdvice interface to be implemented by the RunTime class in JDK 1.2. It defines the getMemoryAdvice method that returns four indicators of the memory usage, as estimated by the runtime environment, from GREEN (everything is fine) to RED (take every conceivable action to discard objects).
In the case of an application running out of memory, an aggressive garbage collection pass should be performed. The garbage collector must be guaranteed the memory this operation requires. Since some stack and some heap may be necessary, the system needs to monitor the low-water mark, and prevent the application from totally running out of memory before taking action. If no memory can still be found, the system throws an OutOfMemoryError error and it’s up to the application to selectively choose which parts should be discarded.
Being able to characterize the dynamic memory allocation behavior of an application helps in choosing a garbage collection algorithm. For example, if it appears that the embedded application is known to build no cyclic data structures, then a derived reference counting algorithm might be appropriate. If allocated objects are known to be fairly similar in sizes, then a non compacting algorithm might be used.
Instrumentation of memory allocations behavior, including peak size, object sizes and life spans statistics will help a developer to select the correct garbage collection algorithm. Such a monitoring can be performed during development and the test phases by using a special version of the runtime environment. The next step is to assist the developer by allowing them to plug and tune the garbage collector of their choice.
These issues are being seriously examined at Cygnus Solutions, which is developing a Java runtime and Java extensions for the GCC compiler [see side bar “The Cygnus GNUPro compiler for the Java language”]. Being in control of the tools that generate the code gives us opportunities to ease the work of a garbage collector in many ways. We can precisely perform any relevant static allocation of objects and implement write barriers, even though the implementation of this may be challenging if we want to support pluggable collectors. An efficient write barrier needs to interact very closely with the collector in use, being aware of either its internal data structures or the object marking in use. They usually feature between ten to thirty instructions and can’t really afford the versatility of using function calls. One solution might be to specify the write barrier as a user-editable piece of code, which can be inserted as necessary by the compiler during the instruction generation phase. Developing our own compiler also gives us a chance to analyze the way finalizers are written, thus detecting actions that resuscitate objects being thrown away. This could be done by analyzing the use of this as an expression right-hand side or method parameter.
As it is the case for other programming languages and execution environment, designing and implementing a garbage collector suitable to the use of the Java language in embedded systems is not an easy task. It requires a deep analysis of numerous factors ranging from the hardware in use to the dynamic of the application to run on the embedded system. A Java runtime capable of instrumenting the memory allocation patterns coupled with the possibility of a pluggable garbage collector seem to be attractive enough, so that several player in the field are seriously considering it as a solution. With more time spent studying the memory allocations behavior of embedded applications written in Java, we might in the future see the emergence of a collection of ready-to-use garbage collector algorithm that embedded system developers will be able to directly incorporate into their design.