1:mod:`statistics` --- Mathematical statistics functions
2=======================================================
3
4.. module:: statistics
5   :synopsis: Mathematical statistics functions
6
7.. moduleauthor:: Steven D'Aprano <steve+python@pearwood.info>
8.. sectionauthor:: Steven D'Aprano <steve+python@pearwood.info>
9
10.. versionadded:: 3.4
11
12**Source code:** :source:`Lib/statistics.py`
13
14.. testsetup:: *
15
16   from statistics import *
17   __name__ = '<doctest>'
18
19--------------
20
21This module provides functions for calculating mathematical statistics of
22numeric (:class:`~numbers.Real`-valued) data.
23
24The module is not intended to be a competitor to third-party libraries such
25as `NumPy <https://numpy.org>`_, `SciPy <https://scipy.org/>`_, or
26proprietary full-featured statistics packages aimed at professional
27statisticians such as Minitab, SAS and Matlab. It is aimed at the level of
28graphing and scientific calculators.
29
30Unless explicitly noted, these functions support :class:`int`,
31:class:`float`, :class:`~decimal.Decimal` and :class:`~fractions.Fraction`.
32Behaviour with other types (whether in the numeric tower or not) is
33currently unsupported.  Collections with a mix of types are also undefined
34and implementation-dependent.  If your input data consists of mixed types,
35you may be able to use :func:`map` to ensure a consistent result, for
36example: ``map(float, input_data)``.
37
38Some datasets use ``NaN`` (not a number) values to represent missing data.
39Since NaNs have unusual comparison semantics, they cause surprising or
40undefined behaviors in the statistics functions that sort data or that count
41occurrences.  The functions affected are ``median()``, ``median_low()``,
42``median_high()``, ``median_grouped()``, ``mode()``, ``multimode()``, and
43``quantiles()``.  The ``NaN`` values should be stripped before calling these
44functions::
45
46    >>> from statistics import median
47    >>> from math import isnan
48    >>> from itertools import filterfalse
49
50    >>> data = [20.7, float('NaN'),19.2, 18.3, float('NaN'), 14.4]
51    >>> sorted(data)  # This has surprising behavior
52    [20.7, nan, 14.4, 18.3, 19.2, nan]
53    >>> median(data)  # This result is unexpected
54    16.35
55
56    >>> sum(map(isnan, data))    # Number of missing values
57    2
58    >>> clean = list(filterfalse(isnan, data))  # Strip NaN values
59    >>> clean
60    [20.7, 19.2, 18.3, 14.4]
61    >>> sorted(clean)  # Sorting now works as expected
62    [14.4, 18.3, 19.2, 20.7]
63    >>> median(clean)       # This result is now well defined
64    18.75
65
66
67Averages and measures of central location
68-----------------------------------------
69
70These functions calculate an average or typical value from a population
71or sample.
72
73=======================  ===============================================================
74:func:`mean`             Arithmetic mean ("average") of data.
75:func:`fmean`            Fast, floating point arithmetic mean, with optional weighting.
76:func:`geometric_mean`   Geometric mean of data.
77:func:`harmonic_mean`    Harmonic mean of data.
78:func:`median`           Median (middle value) of data.
79:func:`median_low`       Low median of data.
80:func:`median_high`      High median of data.
81:func:`median_grouped`   Median, or 50th percentile, of grouped data.
82:func:`mode`             Single mode (most common value) of discrete or nominal data.
83:func:`multimode`        List of modes (most common values) of discrete or nominal data.
84:func:`quantiles`        Divide data into intervals with equal probability.
85=======================  ===============================================================
86
87Measures of spread
88------------------
89
90These functions calculate a measure of how much the population or sample
91tends to deviate from the typical or average values.
92
93=======================  =============================================
94:func:`pstdev`           Population standard deviation of data.
95:func:`pvariance`        Population variance of data.
96:func:`stdev`            Sample standard deviation of data.
97:func:`variance`         Sample variance of data.
98=======================  =============================================
99
100Statistics for relations between two inputs
101-------------------------------------------
102
103These functions calculate statistics regarding relations between two inputs.
104
105=========================  =====================================================
106:func:`covariance`         Sample covariance for two variables.
107:func:`correlation`        Pearson's correlation coefficient for two variables.
108:func:`linear_regression`  Slope and intercept for simple linear regression.
109=========================  =====================================================
110
111
112Function details
113----------------
114
115Note: The functions do not require the data given to them to be sorted.
116However, for reading convenience, most of the examples show sorted sequences.
117
118.. function:: mean(data)
119
120   Return the sample arithmetic mean of *data* which can be a sequence or iterable.
121
122   The arithmetic mean is the sum of the data divided by the number of data
123   points.  It is commonly called "the average", although it is only one of many
124   different mathematical averages.  It is a measure of the central location of
125   the data.
126
127   If *data* is empty, :exc:`StatisticsError` will be raised.
128
129   Some examples of use:
130
131   .. doctest::
132
133      >>> mean([1, 2, 3, 4, 4])
134      2.8
135      >>> mean([-1.0, 2.5, 3.25, 5.75])
136      2.625
137
138      >>> from fractions import Fraction as F
139      >>> mean([F(3, 7), F(1, 21), F(5, 3), F(1, 3)])
140      Fraction(13, 21)
141
142      >>> from decimal import Decimal as D
143      >>> mean([D("0.5"), D("0.75"), D("0.625"), D("0.375")])
144      Decimal('0.5625')
145
146   .. note::
147
148      The mean is strongly affected by `outliers
149      <https://en.wikipedia.org/wiki/Outlier>`_ and is not necessarily a
150      typical example of the data points. For a more robust, although less
151      efficient, measure of `central tendency
152      <https://en.wikipedia.org/wiki/Central_tendency>`_, see :func:`median`.
153
154      The sample mean gives an unbiased estimate of the true population mean,
155      so that when taken on average over all the possible samples,
156      ``mean(sample)`` converges on the true mean of the entire population.  If
157      *data* represents the entire population rather than a sample, then
158      ``mean(data)`` is equivalent to calculating the true population mean μ.
159
160
161.. function:: fmean(data, weights=None)
162
163   Convert *data* to floats and compute the arithmetic mean.
164
165   This runs faster than the :func:`mean` function and it always returns a
166   :class:`float`.  The *data* may be a sequence or iterable.  If the input
167   dataset is empty, raises a :exc:`StatisticsError`.
168
169   .. doctest::
170
171      >>> fmean([3.5, 4.0, 5.25])
172      4.25
173
174   Optional weighting is supported.  For example, a professor assigns a
175   grade for a course by weighting quizzes at 20%, homework at 20%, a
176   midterm exam at 30%, and a final exam at 30%:
177
178   .. doctest::
179
180      >>> grades = [85, 92, 83, 91]
181      >>> weights = [0.20, 0.20, 0.30, 0.30]
182      >>> fmean(grades, weights)
183      87.6
184
185   If *weights* is supplied, it must be the same length as the *data* or
186   a :exc:`ValueError` will be raised.
187
188   .. versionadded:: 3.8
189
190   .. versionchanged:: 3.11
191      Added support for *weights*.
192
193
194.. function:: geometric_mean(data)
195
196   Convert *data* to floats and compute the geometric mean.
197
198   The geometric mean indicates the central tendency or typical value of the
199   *data* using the product of the values (as opposed to the arithmetic mean
200   which uses their sum).
201
202   Raises a :exc:`StatisticsError` if the input dataset is empty,
203   if it contains a zero, or if it contains a negative value.
204   The *data* may be a sequence or iterable.
205
206   No special efforts are made to achieve exact results.
207   (However, this may change in the future.)
208
209   .. doctest::
210
211      >>> round(geometric_mean([54, 24, 36]), 1)
212      36.0
213
214   .. versionadded:: 3.8
215
216
217.. function:: harmonic_mean(data, weights=None)
218
219   Return the harmonic mean of *data*, a sequence or iterable of
220   real-valued numbers.  If *weights* is omitted or *None*, then
221   equal weighting is assumed.
222
223   The harmonic mean is the reciprocal of the arithmetic :func:`mean` of the
224   reciprocals of the data. For example, the harmonic mean of three values *a*,
225   *b* and *c* will be equivalent to ``3/(1/a + 1/b + 1/c)``.  If one of the
226   values is zero, the result will be zero.
227
228   The harmonic mean is a type of average, a measure of the central
229   location of the data.  It is often appropriate when averaging
230   ratios or rates, for example speeds.
231
232   Suppose a car travels 10 km at 40 km/hr, then another 10 km at 60 km/hr.
233   What is the average speed?
234
235   .. doctest::
236
237      >>> harmonic_mean([40, 60])
238      48.0
239
240   Suppose a car travels 40 km/hr for 5 km, and when traffic clears,
241   speeds-up to 60 km/hr for the remaining 30 km of the journey. What
242   is the average speed?
243
244   .. doctest::
245
246      >>> harmonic_mean([40, 60], weights=[5, 30])
247      56.0
248
249   :exc:`StatisticsError` is raised if *data* is empty, any element
250   is less than zero, or if the weighted sum isn't positive.
251
252   The current algorithm has an early-out when it encounters a zero
253   in the input.  This means that the subsequent inputs are not tested
254   for validity.  (This behavior may change in the future.)
255
256   .. versionadded:: 3.6
257
258   .. versionchanged:: 3.10
259      Added support for *weights*.
260
261.. function:: median(data)
262
263   Return the median (middle value) of numeric data, using the common "mean of
264   middle two" method.  If *data* is empty, :exc:`StatisticsError` is raised.
265   *data* can be a sequence or iterable.
266
267   The median is a robust measure of central location and is less affected by
268   the presence of outliers.  When the number of data points is odd, the
269   middle data point is returned:
270
271   .. doctest::
272
273      >>> median([1, 3, 5])
274      3
275
276   When the number of data points is even, the median is interpolated by taking
277   the average of the two middle values:
278
279   .. doctest::
280
281      >>> median([1, 3, 5, 7])
282      4.0
283
284   This is suited for when your data is discrete, and you don't mind that the
285   median may not be an actual data point.
286
287   If the data is ordinal (supports order operations) but not numeric (doesn't
288   support addition), consider using :func:`median_low` or :func:`median_high`
289   instead.
290
291.. function:: median_low(data)
292
293   Return the low median of numeric data.  If *data* is empty,
294   :exc:`StatisticsError` is raised.  *data* can be a sequence or iterable.
295
296   The low median is always a member of the data set.  When the number of data
297   points is odd, the middle value is returned.  When it is even, the smaller of
298   the two middle values is returned.
299
300   .. doctest::
301
302      >>> median_low([1, 3, 5])
303      3
304      >>> median_low([1, 3, 5, 7])
305      3
306
307   Use the low median when your data are discrete and you prefer the median to
308   be an actual data point rather than interpolated.
309
310
311.. function:: median_high(data)
312
313   Return the high median of data.  If *data* is empty, :exc:`StatisticsError`
314   is raised.  *data* can be a sequence or iterable.
315
316   The high median is always a member of the data set.  When the number of data
317   points is odd, the middle value is returned.  When it is even, the larger of
318   the two middle values is returned.
319
320   .. doctest::
321
322      >>> median_high([1, 3, 5])
323      3
324      >>> median_high([1, 3, 5, 7])
325      5
326
327   Use the high median when your data are discrete and you prefer the median to
328   be an actual data point rather than interpolated.
329
330
331.. function:: median_grouped(data, interval=1)
332
333   Return the median of grouped continuous data, calculated as the 50th
334   percentile, using interpolation.  If *data* is empty, :exc:`StatisticsError`
335   is raised.  *data* can be a sequence or iterable.
336
337   .. doctest::
338
339      >>> median_grouped([52, 52, 53, 54])
340      52.5
341
342   In the following example, the data are rounded, so that each value represents
343   the midpoint of data classes, e.g. 1 is the midpoint of the class 0.5--1.5, 2
344   is the midpoint of 1.5--2.5, 3 is the midpoint of 2.5--3.5, etc.  With the data
345   given, the middle value falls somewhere in the class 3.5--4.5, and
346   interpolation is used to estimate it:
347
348   .. doctest::
349
350      >>> median_grouped([1, 2, 2, 3, 4, 4, 4, 4, 4, 5])
351      3.7
352
353   Optional argument *interval* represents the class interval, and defaults
354   to 1.  Changing the class interval naturally will change the interpolation:
355
356   .. doctest::
357
358      >>> median_grouped([1, 3, 3, 5, 7], interval=1)
359      3.25
360      >>> median_grouped([1, 3, 3, 5, 7], interval=2)
361      3.5
362
363   This function does not check whether the data points are at least
364   *interval* apart.
365
366   .. impl-detail::
367
368      Under some circumstances, :func:`median_grouped` may coerce data points to
369      floats.  This behaviour is likely to change in the future.
370
371   .. seealso::
372
373      * "Statistics for the Behavioral Sciences", Frederick J Gravetter and
374        Larry B Wallnau (8th Edition).
375
376      * The `SSMEDIAN
377        <https://help.gnome.org/users/gnumeric/stable/gnumeric.html#gnumeric-function-SSMEDIAN>`_
378        function in the Gnome Gnumeric spreadsheet, including `this discussion
379        <https://mail.gnome.org/archives/gnumeric-list/2011-April/msg00018.html>`_.
380
381
382.. function:: mode(data)
383
384   Return the single most common data point from discrete or nominal *data*.
385   The mode (when it exists) is the most typical value and serves as a
386   measure of central location.
387
388   If there are multiple modes with the same frequency, returns the first one
389   encountered in the *data*.  If the smallest or largest of those is
390   desired instead, use ``min(multimode(data))`` or ``max(multimode(data))``.
391   If the input *data* is empty, :exc:`StatisticsError` is raised.
392
393   ``mode`` assumes discrete data and returns a single value. This is the
394   standard treatment of the mode as commonly taught in schools:
395
396   .. doctest::
397
398      >>> mode([1, 1, 2, 3, 3, 3, 3, 4])
399      3
400
401   The mode is unique in that it is the only statistic in this package that
402   also applies to nominal (non-numeric) data:
403
404   .. doctest::
405
406      >>> mode(["red", "blue", "blue", "red", "green", "red", "red"])
407      'red'
408
409   .. versionchanged:: 3.8
410      Now handles multimodal datasets by returning the first mode encountered.
411      Formerly, it raised :exc:`StatisticsError` when more than one mode was
412      found.
413
414
415.. function:: multimode(data)
416
417   Return a list of the most frequently occurring values in the order they
418   were first encountered in the *data*.  Will return more than one result if
419   there are multiple modes or an empty list if the *data* is empty:
420
421   .. doctest::
422
423        >>> multimode('aabbbbccddddeeffffgg')
424        ['b', 'd', 'f']
425        >>> multimode('')
426        []
427
428   .. versionadded:: 3.8
429
430
431.. function:: pstdev(data, mu=None)
432
433   Return the population standard deviation (the square root of the population
434   variance).  See :func:`pvariance` for arguments and other details.
435
436   .. doctest::
437
438      >>> pstdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75])
439      0.986893273527251
440
441
442.. function:: pvariance(data, mu=None)
443
444   Return the population variance of *data*, a non-empty sequence or iterable
445   of real-valued numbers.  Variance, or second moment about the mean, is a
446   measure of the variability (spread or dispersion) of data.  A large
447   variance indicates that the data is spread out; a small variance indicates
448   it is clustered closely around the mean.
449
450   If the optional second argument *mu* is given, it is typically the mean of
451   the *data*.  It can also be used to compute the second moment around a
452   point that is not the mean.  If it is missing or ``None`` (the default),
453   the arithmetic mean is automatically calculated.
454
455   Use this function to calculate the variance from the entire population.  To
456   estimate the variance from a sample, the :func:`variance` function is usually
457   a better choice.
458
459   Raises :exc:`StatisticsError` if *data* is empty.
460
461   Examples:
462
463   .. doctest::
464
465      >>> data = [0.0, 0.25, 0.25, 1.25, 1.5, 1.75, 2.75, 3.25]
466      >>> pvariance(data)
467      1.25
468
469   If you have already calculated the mean of your data, you can pass it as the
470   optional second argument *mu* to avoid recalculation:
471
472   .. doctest::
473
474      >>> mu = mean(data)
475      >>> pvariance(data, mu)
476      1.25
477
478   Decimals and Fractions are supported:
479
480   .. doctest::
481
482      >>> from decimal import Decimal as D
483      >>> pvariance([D("27.5"), D("30.25"), D("30.25"), D("34.5"), D("41.75")])
484      Decimal('24.815')
485
486      >>> from fractions import Fraction as F
487      >>> pvariance([F(1, 4), F(5, 4), F(1, 2)])
488      Fraction(13, 72)
489
490   .. note::
491
492      When called with the entire population, this gives the population variance
493      σ².  When called on a sample instead, this is the biased sample variance
494      s², also known as variance with N degrees of freedom.
495
496      If you somehow know the true population mean μ, you may use this
497      function to calculate the variance of a sample, giving the known
498      population mean as the second argument.  Provided the data points are a
499      random sample of the population, the result will be an unbiased estimate
500      of the population variance.
501
502
503.. function:: stdev(data, xbar=None)
504
505   Return the sample standard deviation (the square root of the sample
506   variance).  See :func:`variance` for arguments and other details.
507
508   .. doctest::
509
510      >>> stdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75])
511      1.0810874155219827
512
513
514.. function:: variance(data, xbar=None)
515
516   Return the sample variance of *data*, an iterable of at least two real-valued
517   numbers.  Variance, or second moment about the mean, is a measure of the
518   variability (spread or dispersion) of data.  A large variance indicates that
519   the data is spread out; a small variance indicates it is clustered closely
520   around the mean.
521
522   If the optional second argument *xbar* is given, it should be the mean of
523   *data*.  If it is missing or ``None`` (the default), the mean is
524   automatically calculated.
525
526   Use this function when your data is a sample from a population. To calculate
527   the variance from the entire population, see :func:`pvariance`.
528
529   Raises :exc:`StatisticsError` if *data* has fewer than two values.
530
531   Examples:
532
533   .. doctest::
534
535      >>> data = [2.75, 1.75, 1.25, 0.25, 0.5, 1.25, 3.5]
536      >>> variance(data)
537      1.3720238095238095
538
539   If you have already calculated the mean of your data, you can pass it as the
540   optional second argument *xbar* to avoid recalculation:
541
542   .. doctest::
543
544      >>> m = mean(data)
545      >>> variance(data, m)
546      1.3720238095238095
547
548   This function does not attempt to verify that you have passed the actual mean
549   as *xbar*.  Using arbitrary values for *xbar* can lead to invalid or
550   impossible results.
551
552   Decimal and Fraction values are supported:
553
554   .. doctest::
555
556      >>> from decimal import Decimal as D
557      >>> variance([D("27.5"), D("30.25"), D("30.25"), D("34.5"), D("41.75")])
558      Decimal('31.01875')
559
560      >>> from fractions import Fraction as F
561      >>> variance([F(1, 6), F(1, 2), F(5, 3)])
562      Fraction(67, 108)
563
564   .. note::
565
566      This is the sample variance s² with Bessel's correction, also known as
567      variance with N-1 degrees of freedom.  Provided that the data points are
568      representative (e.g. independent and identically distributed), the result
569      should be an unbiased estimate of the true population variance.
570
571      If you somehow know the actual population mean μ you should pass it to the
572      :func:`pvariance` function as the *mu* parameter to get the variance of a
573      sample.
574
575.. function:: quantiles(data, *, n=4, method='exclusive')
576
577   Divide *data* into *n* continuous intervals with equal probability.
578   Returns a list of ``n - 1`` cut points separating the intervals.
579
580   Set *n* to 4 for quartiles (the default).  Set *n* to 10 for deciles.  Set
581   *n* to 100 for percentiles which gives the 99 cuts points that separate
582   *data* into 100 equal sized groups.  Raises :exc:`StatisticsError` if *n*
583   is not least 1.
584
585   The *data* can be any iterable containing sample data.  For meaningful
586   results, the number of data points in *data* should be larger than *n*.
587   Raises :exc:`StatisticsError` if there are not at least two data points.
588
589   The cut points are linearly interpolated from the
590   two nearest data points.  For example, if a cut point falls one-third
591   of the distance between two sample values, ``100`` and ``112``, the
592   cut-point will evaluate to ``104``.
593
594   The *method* for computing quantiles can be varied depending on
595   whether the *data* includes or excludes the lowest and
596   highest possible values from the population.
597
598   The default *method* is "exclusive" and is used for data sampled from
599   a population that can have more extreme values than found in the
600   samples.  The portion of the population falling below the *i-th* of
601   *m* sorted data points is computed as ``i / (m + 1)``.  Given nine
602   sample values, the method sorts them and assigns the following
603   percentiles: 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%.
604
605   Setting the *method* to "inclusive" is used for describing population
606   data or for samples that are known to include the most extreme values
607   from the population.  The minimum value in *data* is treated as the 0th
608   percentile and the maximum value is treated as the 100th percentile.
609   The portion of the population falling below the *i-th* of *m* sorted
610   data points is computed as ``(i - 1) / (m - 1)``.  Given 11 sample
611   values, the method sorts them and assigns the following percentiles:
612   0%, 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 100%.
613
614   .. doctest::
615
616        # Decile cut points for empirically sampled data
617        >>> data = [105, 129, 87, 86, 111, 111, 89, 81, 108, 92, 110,
618        ...         100, 75, 105, 103, 109, 76, 119, 99, 91, 103, 129,
619        ...         106, 101, 84, 111, 74, 87, 86, 103, 103, 106, 86,
620        ...         111, 75, 87, 102, 121, 111, 88, 89, 101, 106, 95,
621        ...         103, 107, 101, 81, 109, 104]
622        >>> [round(q, 1) for q in quantiles(data, n=10)]
623        [81.0, 86.2, 89.0, 99.4, 102.5, 103.6, 106.0, 109.8, 111.0]
624
625   .. versionadded:: 3.8
626
627.. function:: covariance(x, y, /)
628
629   Return the sample covariance of two inputs *x* and *y*. Covariance
630   is a measure of the joint variability of two inputs.
631
632   Both inputs must be of the same length (no less than two), otherwise
633   :exc:`StatisticsError` is raised.
634
635   Examples:
636
637   .. doctest::
638
639      >>> x = [1, 2, 3, 4, 5, 6, 7, 8, 9]
640      >>> y = [1, 2, 3, 1, 2, 3, 1, 2, 3]
641      >>> covariance(x, y)
642      0.75
643      >>> z = [9, 8, 7, 6, 5, 4, 3, 2, 1]
644      >>> covariance(x, z)
645      -7.5
646      >>> covariance(z, x)
647      -7.5
648
649   .. versionadded:: 3.10
650
651.. function:: correlation(x, y, /)
652
653   Return the `Pearson's correlation coefficient
654   <https://en.wikipedia.org/wiki/Pearson_correlation_coefficient>`_
655   for two inputs. Pearson's correlation coefficient *r* takes values
656   between -1 and +1. It measures the strength and direction of the linear
657   relationship, where +1 means very strong, positive linear relationship,
658   -1 very strong, negative linear relationship, and 0 no linear relationship.
659
660   Both inputs must be of the same length (no less than two), and need
661   not to be constant, otherwise :exc:`StatisticsError` is raised.
662
663   Examples:
664
665   .. doctest::
666
667      >>> x = [1, 2, 3, 4, 5, 6, 7, 8, 9]
668      >>> y = [9, 8, 7, 6, 5, 4, 3, 2, 1]
669      >>> correlation(x, x)
670      1.0
671      >>> correlation(x, y)
672      -1.0
673
674   .. versionadded:: 3.10
675
676.. function:: linear_regression(x, y, /, *, proportional=False)
677
678   Return the slope and intercept of `simple linear regression
679   <https://en.wikipedia.org/wiki/Simple_linear_regression>`_
680   parameters estimated using ordinary least squares. Simple linear
681   regression describes the relationship between an independent variable *x* and
682   a dependent variable *y* in terms of this linear function:
683
684      *y = slope \* x + intercept + noise*
685
686   where ``slope`` and ``intercept`` are the regression parameters that are
687   estimated, and ``noise`` represents the
688   variability of the data that was not explained by the linear regression
689   (it is equal to the difference between predicted and actual values
690   of the dependent variable).
691
692   Both inputs must be of the same length (no less than two), and
693   the independent variable *x* cannot be constant;
694   otherwise a :exc:`StatisticsError` is raised.
695
696   For example, we can use the `release dates of the Monty
697   Python films <https://en.wikipedia.org/wiki/Monty_Python#Films>`_
698   to predict the cumulative number of Monty Python films
699   that would have been produced by 2019
700   assuming that they had kept the pace.
701
702   .. doctest::
703
704      >>> year = [1971, 1975, 1979, 1982, 1983]
705      >>> films_total = [1, 2, 3, 4, 5]
706      >>> slope, intercept = linear_regression(year, films_total)
707      >>> round(slope * 2019 + intercept)
708      16
709
710   If *proportional* is true, the independent variable *x* and the
711   dependent variable *y* are assumed to be directly proportional.
712   The data is fit to a line passing through the origin.
713   Since the *intercept* will always be 0.0, the underlying linear
714   function simplifies to:
715
716      *y = slope \* x + noise*
717
718   .. versionadded:: 3.10
719
720   .. versionchanged:: 3.11
721      Added support for *proportional*.
722
723Exceptions
724----------
725
726A single exception is defined:
727
728.. exception:: StatisticsError
729
730   Subclass of :exc:`ValueError` for statistics-related exceptions.
731
732
733:class:`NormalDist` objects
734---------------------------
735
736:class:`NormalDist` is a tool for creating and manipulating normal
737distributions of a `random variable
738<http://www.stat.yale.edu/Courses/1997-98/101/ranvar.htm>`_.  It is a
739class that treats the mean and standard deviation of data
740measurements as a single entity.
741
742Normal distributions arise from the `Central Limit Theorem
743<https://en.wikipedia.org/wiki/Central_limit_theorem>`_ and have a wide range
744of applications in statistics.
745
746.. class:: NormalDist(mu=0.0, sigma=1.0)
747
748    Returns a new *NormalDist* object where *mu* represents the `arithmetic
749    mean <https://en.wikipedia.org/wiki/Arithmetic_mean>`_ and *sigma*
750    represents the `standard deviation
751    <https://en.wikipedia.org/wiki/Standard_deviation>`_.
752
753    If *sigma* is negative, raises :exc:`StatisticsError`.
754
755    .. attribute:: mean
756
757       A read-only property for the `arithmetic mean
758       <https://en.wikipedia.org/wiki/Arithmetic_mean>`_ of a normal
759       distribution.
760
761    .. attribute:: median
762
763       A read-only property for the `median
764       <https://en.wikipedia.org/wiki/Median>`_ of a normal
765       distribution.
766
767    .. attribute:: mode
768
769       A read-only property for the `mode
770       <https://en.wikipedia.org/wiki/Mode_(statistics)>`_ of a normal
771       distribution.
772
773    .. attribute:: stdev
774
775       A read-only property for the `standard deviation
776       <https://en.wikipedia.org/wiki/Standard_deviation>`_ of a normal
777       distribution.
778
779    .. attribute:: variance
780
781       A read-only property for the `variance
782       <https://en.wikipedia.org/wiki/Variance>`_ of a normal
783       distribution. Equal to the square of the standard deviation.
784
785    .. classmethod:: NormalDist.from_samples(data)
786
787       Makes a normal distribution instance with *mu* and *sigma* parameters
788       estimated from the *data* using :func:`fmean` and :func:`stdev`.
789
790       The *data* can be any :term:`iterable` and should consist of values
791       that can be converted to type :class:`float`.  If *data* does not
792       contain at least two elements, raises :exc:`StatisticsError` because it
793       takes at least one point to estimate a central value and at least two
794       points to estimate dispersion.
795
796    .. method:: NormalDist.samples(n, *, seed=None)
797
798       Generates *n* random samples for a given mean and standard deviation.
799       Returns a :class:`list` of :class:`float` values.
800
801       If *seed* is given, creates a new instance of the underlying random
802       number generator.  This is useful for creating reproducible results,
803       even in a multi-threading context.
804
805    .. method:: NormalDist.pdf(x)
806
807       Using a `probability density function (pdf)
808       <https://en.wikipedia.org/wiki/Probability_density_function>`_, compute
809       the relative likelihood that a random variable *X* will be near the
810       given value *x*.  Mathematically, it is the limit of the ratio ``P(x <=
811       X < x+dx) / dx`` as *dx* approaches zero.
812
813       The relative likelihood is computed as the probability of a sample
814       occurring in a narrow range divided by the width of the range (hence
815       the word "density").  Since the likelihood is relative to other points,
816       its value can be greater than ``1.0``.
817
818    .. method:: NormalDist.cdf(x)
819
820       Using a `cumulative distribution function (cdf)
821       <https://en.wikipedia.org/wiki/Cumulative_distribution_function>`_,
822       compute the probability that a random variable *X* will be less than or
823       equal to *x*.  Mathematically, it is written ``P(X <= x)``.
824
825    .. method:: NormalDist.inv_cdf(p)
826
827       Compute the inverse cumulative distribution function, also known as the
828       `quantile function <https://en.wikipedia.org/wiki/Quantile_function>`_
829       or the `percent-point
830       <https://web.archive.org/web/20190203145224/https://www.statisticshowto.datasciencecentral.com/inverse-distribution-function/>`_
831       function.  Mathematically, it is written ``x : P(X <= x) = p``.
832
833       Finds the value *x* of the random variable *X* such that the
834       probability of the variable being less than or equal to that value
835       equals the given probability *p*.
836
837    .. method:: NormalDist.overlap(other)
838
839       Measures the agreement between two normal probability distributions.
840       Returns a value between 0.0 and 1.0 giving `the overlapping area for
841       the two probability density functions
842       <https://www.rasch.org/rmt/rmt101r.htm>`_.
843
844    .. method:: NormalDist.quantiles(n=4)
845
846        Divide the normal distribution into *n* continuous intervals with
847        equal probability.  Returns a list of (n - 1) cut points separating
848        the intervals.
849
850        Set *n* to 4 for quartiles (the default).  Set *n* to 10 for deciles.
851        Set *n* to 100 for percentiles which gives the 99 cuts points that
852        separate the normal distribution into 100 equal sized groups.
853
854    .. method:: NormalDist.zscore(x)
855
856        Compute the
857        `Standard Score <https://www.statisticshowto.com/probability-and-statistics/z-score/>`_
858        describing *x* in terms of the number of standard deviations
859        above or below the mean of the normal distribution:
860        ``(x - mean) / stdev``.
861
862        .. versionadded:: 3.9
863
864    Instances of :class:`NormalDist` support addition, subtraction,
865    multiplication and division by a constant.  These operations
866    are used for translation and scaling.  For example:
867
868    .. doctest::
869
870        >>> temperature_february = NormalDist(5, 2.5)             # Celsius
871        >>> temperature_february * (9/5) + 32                     # Fahrenheit
872        NormalDist(mu=41.0, sigma=4.5)
873
874    Dividing a constant by an instance of :class:`NormalDist` is not supported
875    because the result wouldn't be normally distributed.
876
877    Since normal distributions arise from additive effects of independent
878    variables, it is possible to `add and subtract two independent normally
879    distributed random variables
880    <https://en.wikipedia.org/wiki/Sum_of_normally_distributed_random_variables>`_
881    represented as instances of :class:`NormalDist`.  For example:
882
883    .. doctest::
884
885        >>> birth_weights = NormalDist.from_samples([2.5, 3.1, 2.1, 2.4, 2.7, 3.5])
886        >>> drug_effects = NormalDist(0.4, 0.15)
887        >>> combined = birth_weights + drug_effects
888        >>> round(combined.mean, 1)
889        3.1
890        >>> round(combined.stdev, 1)
891        0.5
892
893    .. versionadded:: 3.8
894
895
896:class:`NormalDist` Examples and Recipes
897^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
898
899:class:`NormalDist` readily solves classic probability problems.
900
901For example, given `historical data for SAT exams
902<https://nces.ed.gov/programs/digest/d17/tables/dt17_226.40.asp>`_ showing
903that scores are normally distributed with a mean of 1060 and a standard
904deviation of 195, determine the percentage of students with test scores
905between 1100 and 1200, after rounding to the nearest whole number:
906
907.. doctest::
908
909    >>> sat = NormalDist(1060, 195)
910    >>> fraction = sat.cdf(1200 + 0.5) - sat.cdf(1100 - 0.5)
911    >>> round(fraction * 100.0, 1)
912    18.4
913
914Find the `quartiles <https://en.wikipedia.org/wiki/Quartile>`_ and `deciles
915<https://en.wikipedia.org/wiki/Decile>`_ for the SAT scores:
916
917.. doctest::
918
919    >>> list(map(round, sat.quantiles()))
920    [928, 1060, 1192]
921    >>> list(map(round, sat.quantiles(n=10)))
922    [810, 896, 958, 1011, 1060, 1109, 1162, 1224, 1310]
923
924To estimate the distribution for a model than isn't easy to solve
925analytically, :class:`NormalDist` can generate input samples for a `Monte
926Carlo simulation <https://en.wikipedia.org/wiki/Monte_Carlo_method>`_:
927
928.. doctest::
929
930    >>> def model(x, y, z):
931    ...     return (3*x + 7*x*y - 5*y) / (11 * z)
932    ...
933    >>> n = 100_000
934    >>> X = NormalDist(10, 2.5).samples(n, seed=3652260728)
935    >>> Y = NormalDist(15, 1.75).samples(n, seed=4582495471)
936    >>> Z = NormalDist(50, 1.25).samples(n, seed=6582483453)
937    >>> quantiles(map(model, X, Y, Z))       # doctest: +SKIP
938    [1.4591308524824727, 1.8035946855390597, 2.175091447274739]
939
940Normal distributions can be used to approximate `Binomial
941distributions <https://mathworld.wolfram.com/BinomialDistribution.html>`_
942when the sample size is large and when the probability of a successful
943trial is near 50%.
944
945For example, an open source conference has 750 attendees and two rooms with a
946500 person capacity.  There is a talk about Python and another about Ruby.
947In previous conferences, 65% of the attendees preferred to listen to Python
948talks.  Assuming the population preferences haven't changed, what is the
949probability that the Python room will stay within its capacity limits?
950
951.. doctest::
952
953    >>> n = 750             # Sample size
954    >>> p = 0.65            # Preference for Python
955    >>> q = 1.0 - p         # Preference for Ruby
956    >>> k = 500             # Room capacity
957
958    >>> # Approximation using the cumulative normal distribution
959    >>> from math import sqrt
960    >>> round(NormalDist(mu=n*p, sigma=sqrt(n*p*q)).cdf(k + 0.5), 4)
961    0.8402
962
963    >>> # Solution using the cumulative binomial distribution
964    >>> from math import comb, fsum
965    >>> round(fsum(comb(n, r) * p**r * q**(n-r) for r in range(k+1)), 4)
966    0.8402
967
968    >>> # Approximation using a simulation
969    >>> from random import seed, choices
970    >>> seed(8675309)
971    >>> def trial():
972    ...     return choices(('Python', 'Ruby'), (p, q), k=n).count('Python')
973    >>> mean(trial() <= k for i in range(10_000))
974    0.8398
975
976Normal distributions commonly arise in machine learning problems.
977
978Wikipedia has a `nice example of a Naive Bayesian Classifier
979<https://en.wikipedia.org/wiki/Naive_Bayes_classifier#Person_classification>`_.
980The challenge is to predict a person's gender from measurements of normally
981distributed features including height, weight, and foot size.
982
983We're given a training dataset with measurements for eight people.  The
984measurements are assumed to be normally distributed, so we summarize the data
985with :class:`NormalDist`:
986
987.. doctest::
988
989    >>> height_male = NormalDist.from_samples([6, 5.92, 5.58, 5.92])
990    >>> height_female = NormalDist.from_samples([5, 5.5, 5.42, 5.75])
991    >>> weight_male = NormalDist.from_samples([180, 190, 170, 165])
992    >>> weight_female = NormalDist.from_samples([100, 150, 130, 150])
993    >>> foot_size_male = NormalDist.from_samples([12, 11, 12, 10])
994    >>> foot_size_female = NormalDist.from_samples([6, 8, 7, 9])
995
996Next, we encounter a new person whose feature measurements are known but whose
997gender is unknown:
998
999.. doctest::
1000
1001    >>> ht = 6.0        # height
1002    >>> wt = 130        # weight
1003    >>> fs = 8          # foot size
1004
1005Starting with a 50% `prior probability
1006<https://en.wikipedia.org/wiki/Prior_probability>`_ of being male or female,
1007we compute the posterior as the prior times the product of likelihoods for the
1008feature measurements given the gender:
1009
1010.. doctest::
1011
1012   >>> prior_male = 0.5
1013   >>> prior_female = 0.5
1014   >>> posterior_male = (prior_male * height_male.pdf(ht) *
1015   ...                   weight_male.pdf(wt) * foot_size_male.pdf(fs))
1016
1017   >>> posterior_female = (prior_female * height_female.pdf(ht) *
1018   ...                     weight_female.pdf(wt) * foot_size_female.pdf(fs))
1019
1020The final prediction goes to the largest posterior. This is known as the
1021`maximum a posteriori
1022<https://en.wikipedia.org/wiki/Maximum_a_posteriori_estimation>`_ or MAP:
1023
1024.. doctest::
1025
1026  >>> 'male' if posterior_male > posterior_female else 'female'
1027  'female'
1028
1029
1030..
1031   # This modelines must appear within the last ten lines of the file.
1032   kate: indent-width 3; remove-trailing-space on; replace-tabs on; encoding utf-8;
1033