1# Adding or extending a family of adaptive instructions.
2
3## Families of instructions
4
5The core part of PEP 659 (specializing adaptive interpreter) is the families
6of instructions that perform the adaptive specialization.
7
8A family of instructions has the following fundamental properties:
9
10* It corresponds to a single instruction in the code
11  generated by the bytecode compiler.
12* It has a single adaptive instruction that records an execution count and,
13  at regular intervals, attempts to specialize itself. If not specializing,
14  it executes the non-adaptive instruction.
15* It has at least one specialized form of the instruction that is tailored
16  for a particular value or set of values at runtime.
17* All members of the family must have the same number of inline cache entries,
18  to ensure correct execution.
19  Individual family members do not need to use all of the entries,
20  but must skip over any unused entries when executing.
21
22The current implementation also requires the following,
23although these are not fundamental and may change:
24
25* All families uses one or more inline cache entries,
26  the first entry is always the counter.
27* All instruction names should start with the name of the non-adaptive
28  instruction.
29* The adaptive instruction should end in `_ADAPTIVE`.
30* Specialized forms should have names describing their specialization.
31
32## Example family
33
34The `LOAD_GLOBAL` instruction (in Python/ceval.c) already has an adaptive
35family that serves as a relatively simple example.
36
37The `LOAD_GLOBAL_ADAPTIVE` instruction performs adaptive specialization,
38calling `_Py_Specialize_LoadGlobal()` when the counter reaches zero.
39
40There are two specialized instructions in the family, `LOAD_GLOBAL_MODULE`
41which is specialized for global variables in the module, and
42`LOAD_GLOBAL_BUILTIN` which is specialized for builtin variables.
43
44## Performance analysis
45
46The benefit of a specialization can be assessed with the following formula:
47`Tbase/Tadaptive`.
48
49Where `Tbase` is the mean time to execute the base instruction,
50and `Tadaptive` is the mean time to execute the specialized and adaptive forms.
51
52`Tadaptive = (sum(Ti*Ni) + Tmiss*Nmiss)/(sum(Ni)+Nmiss)`
53
54`Ti` is the time to execute the `i`th instruction in the family and `Ni` is
55the number of times that instruction is executed.
56`Tmiss` is the time to process a miss, including de-optimzation
57and the time to execute the base instruction.
58
59The ideal situation is where misses are rare and the specialized
60forms are much faster than the base instruction.
61`LOAD_GLOBAL` is near ideal, `Nmiss/sum(Ni) ≈ 0`.
62In which case we have `Tadaptive ≈ sum(Ti*Ni)`.
63Since we can expect the specialized forms `LOAD_GLOBAL_MODULE` and
64`LOAD_GLOBAL_BUILTIN` to be much faster than the adaptive base instruction,
65we would expect the specialization of `LOAD_GLOBAL` to be profitable.
66
67## Design considerations
68
69While `LOAD_GLOBAL` may be ideal, instructions like `LOAD_ATTR` and
70`CALL_FUNCTION` are not. For maximum performance we want to keep `Ti`
71low for all specialized instructions and `Nmiss` as low as possible.
72
73Keeping `Nmiss` low means that there should be specializations for almost
74all values seen by the base instruction. Keeping `sum(Ti*Ni)` low means
75keeping `Ti` low which means minimizing branches and dependent memory
76accesses (pointer chasing). These two objectives may be in conflict,
77requiring judgement and experimentation to design the family of instructions.
78
79The size of the inline cache should as small as possible,
80without impairing performance, to reduce the number of
81`EXTENDED_ARG` jumps, and to reduce pressure on the CPU's data cache.
82
83### Gathering data
84
85Before choosing how to specialize an instruction, it is important to gather
86some data. What are the patterns of usage of the base instruction?
87Data can best be gathered by instrumenting the interpreter. Since a
88specialization function and adaptive instruction are going to be required,
89instrumentation can most easily be added in the specialization function.
90
91### Choice of specializations
92
93The performance of the specializing adaptive interpreter relies on the
94quality of specialization and keeping the overhead of specialization low.
95
96Specialized instructions must be fast. In order to be fast,
97specialized instructions should be tailored for a particular
98set of values that allows them to:
991. Verify that incoming value is part of that set with low overhead.
1002. Perform the operation quickly.
101
102This requires that the set of values is chosen such that membership can be
103tested quickly and that membership is sufficient to allow the operation to
104performed quickly.
105
106For example, `LOAD_GLOBAL_MODULE` is specialized for `globals()`
107dictionaries that have a keys with the expected version.
108
109This can be tested quickly:
110* `globals->keys->dk_version == expected_version`
111
112and the operation can be performed quickly:
113* `value = entries[cache->index].me_value;`.
114
115Because it is impossible to measure the performance of an instruction without
116also measuring unrelated factors, the assessment of the quality of a
117specialization will require some judgement.
118
119As a general rule, specialized instructions should be much faster than the
120base instruction.
121
122### Implementation of specialized instructions
123
124In general, specialized instructions should be implemented in two parts:
1251. A sequence of guards, each of the form
126  `DEOPT_IF(guard-condition-is-false, BASE_NAME)`.
1272. The operation, which should ideally have no branches and
128  a minimum number of dependent memory accesses.
129
130In practice, the parts may overlap, as data required for guards
131can be re-used in the operation.
132
133If there are branches in the operation, then consider further specialization
134to eliminate the branches.
135
136### Maintaining stats
137
138Finally, take care that stats are gather correctly.
139After the last `DEOPT_IF` has passed, a hit should be recorded with
140`STAT_INC(BASE_INSTRUCTION, hit)`.
141After a optimization has been deferred in the `ADAPTIVE` form,
142that should be recorded with `STAT_INC(BASE_INSTRUCTION, deferred)`.
143