1Thread Suspension timeouts in ART 2--------------------------------- 3ART occasionally needs to "suspend" threads for a variety of reasons. "Suspended" threads may 4continue to run, but may not access data structures related to the Java heap. Please see 5`mutator_gc_coord.md` for details. 6 7The suspension process usually involves setting a flag for the thread to be "suspended", possibly 8causing the thread being "suspended" to generate a SIGSEGV at an opportune point, so that it 9notices the flag, and then having it acknowledge that it is now "suspended". 10 11This process is time-limited so that it does not hang a misbehaving process indefinitely. A 12timeout crashes the process with an abort message indicating a timeout in one of `SuspendAll`, 13`SuspendThreadByPeer`, or `SuspendThreadByThreadId`. It will normally occur after 4 seconds if the 14thread requesting the suspension has high priority, and either 8 or 12 seconds otherwise. 15 16Any such timeout has the inherent downside that it may occur on a sufficiently overcommitted 17device even when there is no deadlock or similar bug involved. Clearly this should be 18extremely rare. 19 20Android 15 changed the handling of such timeouts in several ways: 21 221) The underlying suspension code was changed to improve correctness and better report timeouts. 23This included reducing the timeout in some cases to avoid the danger of reporting such timeouts as 24hard-to-analyze ANRs. 25 262) When such a timeout is encountered, we now aggressively try to abort the thread refusing to 27suspend, so that the main reported stack trace gives a better indication of what went wrong. The 28thread originating the suspension request will still abort if this failed or took too long. 29 303) The timeout abort message should contain a fair amount of information about the thread failing 31to abort, including two prefixes of the `/proc/<pid>/task/<tid>/stat` for the offending thread, 32taken a second or more apart. These snapshots contain several bits of useful information, such as 33kernel process states, the thread priority, and the `utime` and `stime` fields indicating the 34amount of time for which the thread was scheduled. See `man proc`, and look for `/proc/pid/stat`. 35(Initial Android 15 versions reported `/proc/<tid>/stat` instead, which includes process rather 36than thread cpu time.) 37 38This has been known to fail for several reasons: 39 401) A deadlock involving thread suspension. The issues here are discussed in `mutator_gc_coord.md`. 41A common cause of these appear to be "native" C++ locks that are both held while executing Java 42code, and acquired in `@CriticalNative` or `@FastNative` JNI calls. These are clear bugs that, 43once identified, usually have a fairly clear-cut fix. 44 452) Overcommitting the cores, so that the thread being "suspended" just does not get a chance to 46run within the timeout of 4 or more seconds. 47 483) Either ART or `@CriticalNative`/`@FastNative` code that continues in Java `kRunnable` state for 49too long without checking suspension requests. 50 514) The thread being suspended is either itself running at a low thread priority, or is waiting for 52a thread at low thread priority. A Java priority 10 thread has Linux niceness -8, but a priority 1 53thread has niceness 20. This means the former gets roughly 1.25^28. or more than 500, times the 54cpu share of the latter when the device's cores are overcommitted. It is worth noting that 55priority 5 (NORMAL) corresponds to niceness 0, while priority 4 corresponds to niceness 10, which 56is already almost a factor of 10 difference. 57 58When we do see such timeouts, they are often a combination of the last 3. The fixes in such a case 59tend to be less clear. Cores may become significantly overcommitted due to attempts to avoid 60unused cores, particularly during startup. There are currently times when ART needs to perform IO 61or paging operations while the Java heap is not in a consistent state. Priority issues can be 62difficult to address, since temporary priority changes may race with other priority changes. 63 64Different suspension timeout failures will usually need to be addressed individually. 65There is no single "silver bullet" fix for all of them. There is ongoing work 66to improve the tools available for handling priority issues. Currently the possible fixes 67include: 68 69- Remove any newly discovered deadlocks, e.g. by removing an `@FastNative` annotation to prevent 70 a lock from being acquired while the thread already has Java heap access. Or no longer 71 hold native locks across calls to Java. 72- Reduce the amount of time spent continuously in Java runnable state. For application code, that 73 may again involve removing `@FastNative` or `@CriticalNative` annotations. For ART internal 74 code, break up `ScopedObjectAccess` sections or the like, being careful to not hold native 75 pointers to Java heap objects across such sections. 76- Avoid excessive parallelism that is causing some threads to starve. 77- Reduce differences in thread priorities and, if necessary, avoid very low priority threads, for 78 the same reason. 79- On slow devices, if you are in a position to do so, consider setting `ro.hw_timeout_multiplier` 80 to a value greater than one. 81