xref: /aosp_15_r20/art/runtime/thread_suspension_timeouts.md (revision 795d594fd825385562da6b089ea9b2033f3abf5a)
1Thread Suspension timeouts in ART
2---------------------------------
3ART occasionally needs to "suspend" threads for a variety of reasons. "Suspended" threads may
4continue to run, but may not access data structures related to the Java heap. Please see
5`mutator_gc_coord.md` for details.
6
7The suspension process usually involves setting a flag for the thread to be "suspended", possibly
8causing the thread being "suspended" to generate a SIGSEGV at an opportune point, so that it
9notices the flag, and then having it acknowledge that it is now "suspended".
10
11This process is time-limited so that it does not hang a misbehaving process indefinitely. A
12timeout crashes the process with an abort message indicating a timeout in one of `SuspendAll`,
13`SuspendThreadByPeer`, or `SuspendThreadByThreadId`. It will normally occur after 4 seconds if the
14thread requesting the suspension has high priority, and either 8 or 12 seconds otherwise.
15
16Any such timeout has the inherent downside that it may occur on a sufficiently overcommitted
17device even when there is no deadlock or similar bug involved. Clearly this should be
18extremely rare.
19
20Android 15 changed the handling of such timeouts in several ways:
21
221) The underlying suspension code was changed to improve correctness and better report timeouts.
23This included reducing the timeout in some cases to avoid the danger of reporting such timeouts as
24hard-to-analyze ANRs.
25
262) When such a timeout is encountered, we now aggressively try to abort the thread refusing to
27suspend, so that the main reported stack trace gives a better indication of what went wrong. The
28thread originating the suspension request will still abort if this failed or took too long.
29
303) The timeout abort message should contain a fair amount of information about the thread failing
31to abort, including two prefixes of the `/proc/<pid>/task/<tid>/stat` for the offending thread,
32taken a second or more apart. These snapshots contain several bits of useful information, such as
33kernel process states, the thread priority, and the `utime` and `stime` fields indicating the
34amount of time for which the thread was scheduled. See `man proc`, and look for `/proc/pid/stat`.
35(Initial Android 15 versions reported `/proc/<tid>/stat` instead, which includes process rather
36than thread cpu time.)
37
38This has been known to fail for several reasons:
39
401) A deadlock involving thread suspension. The issues here are discussed in `mutator_gc_coord.md`.
41A common cause of these appear to be "native" C++ locks that are both held while executing Java
42code, and acquired in `@CriticalNative` or `@FastNative` JNI calls. These are clear bugs that,
43once identified, usually have a fairly clear-cut fix.
44
452) Overcommitting the cores, so that the thread being "suspended" just does not get a chance to
46run within the timeout of 4 or more seconds.
47
483) Either ART or `@CriticalNative`/`@FastNative` code that continues in Java `kRunnable` state for
49too long without checking suspension requests.
50
514) The thread being suspended is either itself running at a low thread priority, or is waiting for
52a thread at low thread priority. A Java priority 10 thread has Linux niceness -8, but a priority 1
53thread has niceness 20. This means the former gets roughly 1.25^28. or more than 500, times the
54cpu share of the latter when the device's cores are overcommitted. It is worth noting that
55priority 5 (NORMAL) corresponds to niceness 0, while priority 4 corresponds to niceness 10, which
56is already almost a factor of 10 difference.
57
58When we do see such timeouts, they are often a combination of the last 3. The fixes in such a case
59tend to be less clear. Cores may become significantly overcommitted due to attempts to avoid
60unused cores, particularly during startup. There are currently times when ART needs to perform IO
61or paging operations while the Java heap is not in a consistent state. Priority issues can be
62difficult to address, since temporary priority changes may race with other priority changes.
63
64Different suspension timeout failures will usually need to be addressed individually.
65There is no single "silver bullet" fix for all of them. There is ongoing work
66to improve the tools available for handling priority issues. Currently the possible fixes
67include:
68
69- Remove any newly discovered deadlocks, e.g. by removing an `@FastNative` annotation to prevent
70  a lock from being acquired while the thread already has Java heap access. Or no longer
71  hold native locks across calls to Java.
72- Reduce the amount of time spent continuously in Java runnable state. For application code, that
73  may again involve removing `@FastNative` or `@CriticalNative` annotations. For ART internal
74  code, break up `ScopedObjectAccess` sections or the like, being careful to not hold native
75  pointers to Java heap objects across such sections.
76- Avoid excessive parallelism that is causing some threads to starve.
77- Reduce differences in thread priorities and, if necessary, avoid very low priority threads, for
78  the same reason.
79- On slow devices, if you are in a position to do so, consider setting `ro.hw_timeout_multiplier`
80  to a value greater than one.
81