1:mod:`statistics` --- Mathematical statistics functions 2======================================================= 3 4.. module:: statistics 5 :synopsis: Mathematical statistics functions 6 7.. moduleauthor:: Steven D'Aprano <steve+python@pearwood.info> 8.. sectionauthor:: Steven D'Aprano <steve+python@pearwood.info> 9 10.. versionadded:: 3.4 11 12**Source code:** :source:`Lib/statistics.py` 13 14.. testsetup:: * 15 16 from statistics import * 17 __name__ = '<doctest>' 18 19-------------- 20 21This module provides functions for calculating mathematical statistics of 22numeric (:class:`~numbers.Real`-valued) data. 23 24The module is not intended to be a competitor to third-party libraries such 25as `NumPy <https://numpy.org>`_, `SciPy <https://scipy.org/>`_, or 26proprietary full-featured statistics packages aimed at professional 27statisticians such as Minitab, SAS and Matlab. It is aimed at the level of 28graphing and scientific calculators. 29 30Unless explicitly noted, these functions support :class:`int`, 31:class:`float`, :class:`~decimal.Decimal` and :class:`~fractions.Fraction`. 32Behaviour with other types (whether in the numeric tower or not) is 33currently unsupported. Collections with a mix of types are also undefined 34and implementation-dependent. If your input data consists of mixed types, 35you may be able to use :func:`map` to ensure a consistent result, for 36example: ``map(float, input_data)``. 37 38Some datasets use ``NaN`` (not a number) values to represent missing data. 39Since NaNs have unusual comparison semantics, they cause surprising or 40undefined behaviors in the statistics functions that sort data or that count 41occurrences. The functions affected are ``median()``, ``median_low()``, 42``median_high()``, ``median_grouped()``, ``mode()``, ``multimode()``, and 43``quantiles()``. The ``NaN`` values should be stripped before calling these 44functions:: 45 46 >>> from statistics import median 47 >>> from math import isnan 48 >>> from itertools import filterfalse 49 50 >>> data = [20.7, float('NaN'),19.2, 18.3, float('NaN'), 14.4] 51 >>> sorted(data) # This has surprising behavior 52 [20.7, nan, 14.4, 18.3, 19.2, nan] 53 >>> median(data) # This result is unexpected 54 16.35 55 56 >>> sum(map(isnan, data)) # Number of missing values 57 2 58 >>> clean = list(filterfalse(isnan, data)) # Strip NaN values 59 >>> clean 60 [20.7, 19.2, 18.3, 14.4] 61 >>> sorted(clean) # Sorting now works as expected 62 [14.4, 18.3, 19.2, 20.7] 63 >>> median(clean) # This result is now well defined 64 18.75 65 66 67Averages and measures of central location 68----------------------------------------- 69 70These functions calculate an average or typical value from a population 71or sample. 72 73======================= =============================================================== 74:func:`mean` Arithmetic mean ("average") of data. 75:func:`fmean` Fast, floating point arithmetic mean, with optional weighting. 76:func:`geometric_mean` Geometric mean of data. 77:func:`harmonic_mean` Harmonic mean of data. 78:func:`median` Median (middle value) of data. 79:func:`median_low` Low median of data. 80:func:`median_high` High median of data. 81:func:`median_grouped` Median, or 50th percentile, of grouped data. 82:func:`mode` Single mode (most common value) of discrete or nominal data. 83:func:`multimode` List of modes (most common values) of discrete or nominal data. 84:func:`quantiles` Divide data into intervals with equal probability. 85======================= =============================================================== 86 87Measures of spread 88------------------ 89 90These functions calculate a measure of how much the population or sample 91tends to deviate from the typical or average values. 92 93======================= ============================================= 94:func:`pstdev` Population standard deviation of data. 95:func:`pvariance` Population variance of data. 96:func:`stdev` Sample standard deviation of data. 97:func:`variance` Sample variance of data. 98======================= ============================================= 99 100Statistics for relations between two inputs 101------------------------------------------- 102 103These functions calculate statistics regarding relations between two inputs. 104 105========================= ===================================================== 106:func:`covariance` Sample covariance for two variables. 107:func:`correlation` Pearson's correlation coefficient for two variables. 108:func:`linear_regression` Slope and intercept for simple linear regression. 109========================= ===================================================== 110 111 112Function details 113---------------- 114 115Note: The functions do not require the data given to them to be sorted. 116However, for reading convenience, most of the examples show sorted sequences. 117 118.. function:: mean(data) 119 120 Return the sample arithmetic mean of *data* which can be a sequence or iterable. 121 122 The arithmetic mean is the sum of the data divided by the number of data 123 points. It is commonly called "the average", although it is only one of many 124 different mathematical averages. It is a measure of the central location of 125 the data. 126 127 If *data* is empty, :exc:`StatisticsError` will be raised. 128 129 Some examples of use: 130 131 .. doctest:: 132 133 >>> mean([1, 2, 3, 4, 4]) 134 2.8 135 >>> mean([-1.0, 2.5, 3.25, 5.75]) 136 2.625 137 138 >>> from fractions import Fraction as F 139 >>> mean([F(3, 7), F(1, 21), F(5, 3), F(1, 3)]) 140 Fraction(13, 21) 141 142 >>> from decimal import Decimal as D 143 >>> mean([D("0.5"), D("0.75"), D("0.625"), D("0.375")]) 144 Decimal('0.5625') 145 146 .. note:: 147 148 The mean is strongly affected by `outliers 149 <https://en.wikipedia.org/wiki/Outlier>`_ and is not necessarily a 150 typical example of the data points. For a more robust, although less 151 efficient, measure of `central tendency 152 <https://en.wikipedia.org/wiki/Central_tendency>`_, see :func:`median`. 153 154 The sample mean gives an unbiased estimate of the true population mean, 155 so that when taken on average over all the possible samples, 156 ``mean(sample)`` converges on the true mean of the entire population. If 157 *data* represents the entire population rather than a sample, then 158 ``mean(data)`` is equivalent to calculating the true population mean μ. 159 160 161.. function:: fmean(data, weights=None) 162 163 Convert *data* to floats and compute the arithmetic mean. 164 165 This runs faster than the :func:`mean` function and it always returns a 166 :class:`float`. The *data* may be a sequence or iterable. If the input 167 dataset is empty, raises a :exc:`StatisticsError`. 168 169 .. doctest:: 170 171 >>> fmean([3.5, 4.0, 5.25]) 172 4.25 173 174 Optional weighting is supported. For example, a professor assigns a 175 grade for a course by weighting quizzes at 20%, homework at 20%, a 176 midterm exam at 30%, and a final exam at 30%: 177 178 .. doctest:: 179 180 >>> grades = [85, 92, 83, 91] 181 >>> weights = [0.20, 0.20, 0.30, 0.30] 182 >>> fmean(grades, weights) 183 87.6 184 185 If *weights* is supplied, it must be the same length as the *data* or 186 a :exc:`ValueError` will be raised. 187 188 .. versionadded:: 3.8 189 190 .. versionchanged:: 3.11 191 Added support for *weights*. 192 193 194.. function:: geometric_mean(data) 195 196 Convert *data* to floats and compute the geometric mean. 197 198 The geometric mean indicates the central tendency or typical value of the 199 *data* using the product of the values (as opposed to the arithmetic mean 200 which uses their sum). 201 202 Raises a :exc:`StatisticsError` if the input dataset is empty, 203 if it contains a zero, or if it contains a negative value. 204 The *data* may be a sequence or iterable. 205 206 No special efforts are made to achieve exact results. 207 (However, this may change in the future.) 208 209 .. doctest:: 210 211 >>> round(geometric_mean([54, 24, 36]), 1) 212 36.0 213 214 .. versionadded:: 3.8 215 216 217.. function:: harmonic_mean(data, weights=None) 218 219 Return the harmonic mean of *data*, a sequence or iterable of 220 real-valued numbers. If *weights* is omitted or *None*, then 221 equal weighting is assumed. 222 223 The harmonic mean is the reciprocal of the arithmetic :func:`mean` of the 224 reciprocals of the data. For example, the harmonic mean of three values *a*, 225 *b* and *c* will be equivalent to ``3/(1/a + 1/b + 1/c)``. If one of the 226 values is zero, the result will be zero. 227 228 The harmonic mean is a type of average, a measure of the central 229 location of the data. It is often appropriate when averaging 230 ratios or rates, for example speeds. 231 232 Suppose a car travels 10 km at 40 km/hr, then another 10 km at 60 km/hr. 233 What is the average speed? 234 235 .. doctest:: 236 237 >>> harmonic_mean([40, 60]) 238 48.0 239 240 Suppose a car travels 40 km/hr for 5 km, and when traffic clears, 241 speeds-up to 60 km/hr for the remaining 30 km of the journey. What 242 is the average speed? 243 244 .. doctest:: 245 246 >>> harmonic_mean([40, 60], weights=[5, 30]) 247 56.0 248 249 :exc:`StatisticsError` is raised if *data* is empty, any element 250 is less than zero, or if the weighted sum isn't positive. 251 252 The current algorithm has an early-out when it encounters a zero 253 in the input. This means that the subsequent inputs are not tested 254 for validity. (This behavior may change in the future.) 255 256 .. versionadded:: 3.6 257 258 .. versionchanged:: 3.10 259 Added support for *weights*. 260 261.. function:: median(data) 262 263 Return the median (middle value) of numeric data, using the common "mean of 264 middle two" method. If *data* is empty, :exc:`StatisticsError` is raised. 265 *data* can be a sequence or iterable. 266 267 The median is a robust measure of central location and is less affected by 268 the presence of outliers. When the number of data points is odd, the 269 middle data point is returned: 270 271 .. doctest:: 272 273 >>> median([1, 3, 5]) 274 3 275 276 When the number of data points is even, the median is interpolated by taking 277 the average of the two middle values: 278 279 .. doctest:: 280 281 >>> median([1, 3, 5, 7]) 282 4.0 283 284 This is suited for when your data is discrete, and you don't mind that the 285 median may not be an actual data point. 286 287 If the data is ordinal (supports order operations) but not numeric (doesn't 288 support addition), consider using :func:`median_low` or :func:`median_high` 289 instead. 290 291.. function:: median_low(data) 292 293 Return the low median of numeric data. If *data* is empty, 294 :exc:`StatisticsError` is raised. *data* can be a sequence or iterable. 295 296 The low median is always a member of the data set. When the number of data 297 points is odd, the middle value is returned. When it is even, the smaller of 298 the two middle values is returned. 299 300 .. doctest:: 301 302 >>> median_low([1, 3, 5]) 303 3 304 >>> median_low([1, 3, 5, 7]) 305 3 306 307 Use the low median when your data are discrete and you prefer the median to 308 be an actual data point rather than interpolated. 309 310 311.. function:: median_high(data) 312 313 Return the high median of data. If *data* is empty, :exc:`StatisticsError` 314 is raised. *data* can be a sequence or iterable. 315 316 The high median is always a member of the data set. When the number of data 317 points is odd, the middle value is returned. When it is even, the larger of 318 the two middle values is returned. 319 320 .. doctest:: 321 322 >>> median_high([1, 3, 5]) 323 3 324 >>> median_high([1, 3, 5, 7]) 325 5 326 327 Use the high median when your data are discrete and you prefer the median to 328 be an actual data point rather than interpolated. 329 330 331.. function:: median_grouped(data, interval=1) 332 333 Return the median of grouped continuous data, calculated as the 50th 334 percentile, using interpolation. If *data* is empty, :exc:`StatisticsError` 335 is raised. *data* can be a sequence or iterable. 336 337 .. doctest:: 338 339 >>> median_grouped([52, 52, 53, 54]) 340 52.5 341 342 In the following example, the data are rounded, so that each value represents 343 the midpoint of data classes, e.g. 1 is the midpoint of the class 0.5--1.5, 2 344 is the midpoint of 1.5--2.5, 3 is the midpoint of 2.5--3.5, etc. With the data 345 given, the middle value falls somewhere in the class 3.5--4.5, and 346 interpolation is used to estimate it: 347 348 .. doctest:: 349 350 >>> median_grouped([1, 2, 2, 3, 4, 4, 4, 4, 4, 5]) 351 3.7 352 353 Optional argument *interval* represents the class interval, and defaults 354 to 1. Changing the class interval naturally will change the interpolation: 355 356 .. doctest:: 357 358 >>> median_grouped([1, 3, 3, 5, 7], interval=1) 359 3.25 360 >>> median_grouped([1, 3, 3, 5, 7], interval=2) 361 3.5 362 363 This function does not check whether the data points are at least 364 *interval* apart. 365 366 .. impl-detail:: 367 368 Under some circumstances, :func:`median_grouped` may coerce data points to 369 floats. This behaviour is likely to change in the future. 370 371 .. seealso:: 372 373 * "Statistics for the Behavioral Sciences", Frederick J Gravetter and 374 Larry B Wallnau (8th Edition). 375 376 * The `SSMEDIAN 377 <https://help.gnome.org/users/gnumeric/stable/gnumeric.html#gnumeric-function-SSMEDIAN>`_ 378 function in the Gnome Gnumeric spreadsheet, including `this discussion 379 <https://mail.gnome.org/archives/gnumeric-list/2011-April/msg00018.html>`_. 380 381 382.. function:: mode(data) 383 384 Return the single most common data point from discrete or nominal *data*. 385 The mode (when it exists) is the most typical value and serves as a 386 measure of central location. 387 388 If there are multiple modes with the same frequency, returns the first one 389 encountered in the *data*. If the smallest or largest of those is 390 desired instead, use ``min(multimode(data))`` or ``max(multimode(data))``. 391 If the input *data* is empty, :exc:`StatisticsError` is raised. 392 393 ``mode`` assumes discrete data and returns a single value. This is the 394 standard treatment of the mode as commonly taught in schools: 395 396 .. doctest:: 397 398 >>> mode([1, 1, 2, 3, 3, 3, 3, 4]) 399 3 400 401 The mode is unique in that it is the only statistic in this package that 402 also applies to nominal (non-numeric) data: 403 404 .. doctest:: 405 406 >>> mode(["red", "blue", "blue", "red", "green", "red", "red"]) 407 'red' 408 409 .. versionchanged:: 3.8 410 Now handles multimodal datasets by returning the first mode encountered. 411 Formerly, it raised :exc:`StatisticsError` when more than one mode was 412 found. 413 414 415.. function:: multimode(data) 416 417 Return a list of the most frequently occurring values in the order they 418 were first encountered in the *data*. Will return more than one result if 419 there are multiple modes or an empty list if the *data* is empty: 420 421 .. doctest:: 422 423 >>> multimode('aabbbbccddddeeffffgg') 424 ['b', 'd', 'f'] 425 >>> multimode('') 426 [] 427 428 .. versionadded:: 3.8 429 430 431.. function:: pstdev(data, mu=None) 432 433 Return the population standard deviation (the square root of the population 434 variance). See :func:`pvariance` for arguments and other details. 435 436 .. doctest:: 437 438 >>> pstdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75]) 439 0.986893273527251 440 441 442.. function:: pvariance(data, mu=None) 443 444 Return the population variance of *data*, a non-empty sequence or iterable 445 of real-valued numbers. Variance, or second moment about the mean, is a 446 measure of the variability (spread or dispersion) of data. A large 447 variance indicates that the data is spread out; a small variance indicates 448 it is clustered closely around the mean. 449 450 If the optional second argument *mu* is given, it is typically the mean of 451 the *data*. It can also be used to compute the second moment around a 452 point that is not the mean. If it is missing or ``None`` (the default), 453 the arithmetic mean is automatically calculated. 454 455 Use this function to calculate the variance from the entire population. To 456 estimate the variance from a sample, the :func:`variance` function is usually 457 a better choice. 458 459 Raises :exc:`StatisticsError` if *data* is empty. 460 461 Examples: 462 463 .. doctest:: 464 465 >>> data = [0.0, 0.25, 0.25, 1.25, 1.5, 1.75, 2.75, 3.25] 466 >>> pvariance(data) 467 1.25 468 469 If you have already calculated the mean of your data, you can pass it as the 470 optional second argument *mu* to avoid recalculation: 471 472 .. doctest:: 473 474 >>> mu = mean(data) 475 >>> pvariance(data, mu) 476 1.25 477 478 Decimals and Fractions are supported: 479 480 .. doctest:: 481 482 >>> from decimal import Decimal as D 483 >>> pvariance([D("27.5"), D("30.25"), D("30.25"), D("34.5"), D("41.75")]) 484 Decimal('24.815') 485 486 >>> from fractions import Fraction as F 487 >>> pvariance([F(1, 4), F(5, 4), F(1, 2)]) 488 Fraction(13, 72) 489 490 .. note:: 491 492 When called with the entire population, this gives the population variance 493 σ². When called on a sample instead, this is the biased sample variance 494 s², also known as variance with N degrees of freedom. 495 496 If you somehow know the true population mean μ, you may use this 497 function to calculate the variance of a sample, giving the known 498 population mean as the second argument. Provided the data points are a 499 random sample of the population, the result will be an unbiased estimate 500 of the population variance. 501 502 503.. function:: stdev(data, xbar=None) 504 505 Return the sample standard deviation (the square root of the sample 506 variance). See :func:`variance` for arguments and other details. 507 508 .. doctest:: 509 510 >>> stdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75]) 511 1.0810874155219827 512 513 514.. function:: variance(data, xbar=None) 515 516 Return the sample variance of *data*, an iterable of at least two real-valued 517 numbers. Variance, or second moment about the mean, is a measure of the 518 variability (spread or dispersion) of data. A large variance indicates that 519 the data is spread out; a small variance indicates it is clustered closely 520 around the mean. 521 522 If the optional second argument *xbar* is given, it should be the mean of 523 *data*. If it is missing or ``None`` (the default), the mean is 524 automatically calculated. 525 526 Use this function when your data is a sample from a population. To calculate 527 the variance from the entire population, see :func:`pvariance`. 528 529 Raises :exc:`StatisticsError` if *data* has fewer than two values. 530 531 Examples: 532 533 .. doctest:: 534 535 >>> data = [2.75, 1.75, 1.25, 0.25, 0.5, 1.25, 3.5] 536 >>> variance(data) 537 1.3720238095238095 538 539 If you have already calculated the mean of your data, you can pass it as the 540 optional second argument *xbar* to avoid recalculation: 541 542 .. doctest:: 543 544 >>> m = mean(data) 545 >>> variance(data, m) 546 1.3720238095238095 547 548 This function does not attempt to verify that you have passed the actual mean 549 as *xbar*. Using arbitrary values for *xbar* can lead to invalid or 550 impossible results. 551 552 Decimal and Fraction values are supported: 553 554 .. doctest:: 555 556 >>> from decimal import Decimal as D 557 >>> variance([D("27.5"), D("30.25"), D("30.25"), D("34.5"), D("41.75")]) 558 Decimal('31.01875') 559 560 >>> from fractions import Fraction as F 561 >>> variance([F(1, 6), F(1, 2), F(5, 3)]) 562 Fraction(67, 108) 563 564 .. note:: 565 566 This is the sample variance s² with Bessel's correction, also known as 567 variance with N-1 degrees of freedom. Provided that the data points are 568 representative (e.g. independent and identically distributed), the result 569 should be an unbiased estimate of the true population variance. 570 571 If you somehow know the actual population mean μ you should pass it to the 572 :func:`pvariance` function as the *mu* parameter to get the variance of a 573 sample. 574 575.. function:: quantiles(data, *, n=4, method='exclusive') 576 577 Divide *data* into *n* continuous intervals with equal probability. 578 Returns a list of ``n - 1`` cut points separating the intervals. 579 580 Set *n* to 4 for quartiles (the default). Set *n* to 10 for deciles. Set 581 *n* to 100 for percentiles which gives the 99 cuts points that separate 582 *data* into 100 equal sized groups. Raises :exc:`StatisticsError` if *n* 583 is not least 1. 584 585 The *data* can be any iterable containing sample data. For meaningful 586 results, the number of data points in *data* should be larger than *n*. 587 Raises :exc:`StatisticsError` if there are not at least two data points. 588 589 The cut points are linearly interpolated from the 590 two nearest data points. For example, if a cut point falls one-third 591 of the distance between two sample values, ``100`` and ``112``, the 592 cut-point will evaluate to ``104``. 593 594 The *method* for computing quantiles can be varied depending on 595 whether the *data* includes or excludes the lowest and 596 highest possible values from the population. 597 598 The default *method* is "exclusive" and is used for data sampled from 599 a population that can have more extreme values than found in the 600 samples. The portion of the population falling below the *i-th* of 601 *m* sorted data points is computed as ``i / (m + 1)``. Given nine 602 sample values, the method sorts them and assigns the following 603 percentiles: 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%. 604 605 Setting the *method* to "inclusive" is used for describing population 606 data or for samples that are known to include the most extreme values 607 from the population. The minimum value in *data* is treated as the 0th 608 percentile and the maximum value is treated as the 100th percentile. 609 The portion of the population falling below the *i-th* of *m* sorted 610 data points is computed as ``(i - 1) / (m - 1)``. Given 11 sample 611 values, the method sorts them and assigns the following percentiles: 612 0%, 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 100%. 613 614 .. doctest:: 615 616 # Decile cut points for empirically sampled data 617 >>> data = [105, 129, 87, 86, 111, 111, 89, 81, 108, 92, 110, 618 ... 100, 75, 105, 103, 109, 76, 119, 99, 91, 103, 129, 619 ... 106, 101, 84, 111, 74, 87, 86, 103, 103, 106, 86, 620 ... 111, 75, 87, 102, 121, 111, 88, 89, 101, 106, 95, 621 ... 103, 107, 101, 81, 109, 104] 622 >>> [round(q, 1) for q in quantiles(data, n=10)] 623 [81.0, 86.2, 89.0, 99.4, 102.5, 103.6, 106.0, 109.8, 111.0] 624 625 .. versionadded:: 3.8 626 627.. function:: covariance(x, y, /) 628 629 Return the sample covariance of two inputs *x* and *y*. Covariance 630 is a measure of the joint variability of two inputs. 631 632 Both inputs must be of the same length (no less than two), otherwise 633 :exc:`StatisticsError` is raised. 634 635 Examples: 636 637 .. doctest:: 638 639 >>> x = [1, 2, 3, 4, 5, 6, 7, 8, 9] 640 >>> y = [1, 2, 3, 1, 2, 3, 1, 2, 3] 641 >>> covariance(x, y) 642 0.75 643 >>> z = [9, 8, 7, 6, 5, 4, 3, 2, 1] 644 >>> covariance(x, z) 645 -7.5 646 >>> covariance(z, x) 647 -7.5 648 649 .. versionadded:: 3.10 650 651.. function:: correlation(x, y, /) 652 653 Return the `Pearson's correlation coefficient 654 <https://en.wikipedia.org/wiki/Pearson_correlation_coefficient>`_ 655 for two inputs. Pearson's correlation coefficient *r* takes values 656 between -1 and +1. It measures the strength and direction of the linear 657 relationship, where +1 means very strong, positive linear relationship, 658 -1 very strong, negative linear relationship, and 0 no linear relationship. 659 660 Both inputs must be of the same length (no less than two), and need 661 not to be constant, otherwise :exc:`StatisticsError` is raised. 662 663 Examples: 664 665 .. doctest:: 666 667 >>> x = [1, 2, 3, 4, 5, 6, 7, 8, 9] 668 >>> y = [9, 8, 7, 6, 5, 4, 3, 2, 1] 669 >>> correlation(x, x) 670 1.0 671 >>> correlation(x, y) 672 -1.0 673 674 .. versionadded:: 3.10 675 676.. function:: linear_regression(x, y, /, *, proportional=False) 677 678 Return the slope and intercept of `simple linear regression 679 <https://en.wikipedia.org/wiki/Simple_linear_regression>`_ 680 parameters estimated using ordinary least squares. Simple linear 681 regression describes the relationship between an independent variable *x* and 682 a dependent variable *y* in terms of this linear function: 683 684 *y = slope \* x + intercept + noise* 685 686 where ``slope`` and ``intercept`` are the regression parameters that are 687 estimated, and ``noise`` represents the 688 variability of the data that was not explained by the linear regression 689 (it is equal to the difference between predicted and actual values 690 of the dependent variable). 691 692 Both inputs must be of the same length (no less than two), and 693 the independent variable *x* cannot be constant; 694 otherwise a :exc:`StatisticsError` is raised. 695 696 For example, we can use the `release dates of the Monty 697 Python films <https://en.wikipedia.org/wiki/Monty_Python#Films>`_ 698 to predict the cumulative number of Monty Python films 699 that would have been produced by 2019 700 assuming that they had kept the pace. 701 702 .. doctest:: 703 704 >>> year = [1971, 1975, 1979, 1982, 1983] 705 >>> films_total = [1, 2, 3, 4, 5] 706 >>> slope, intercept = linear_regression(year, films_total) 707 >>> round(slope * 2019 + intercept) 708 16 709 710 If *proportional* is true, the independent variable *x* and the 711 dependent variable *y* are assumed to be directly proportional. 712 The data is fit to a line passing through the origin. 713 Since the *intercept* will always be 0.0, the underlying linear 714 function simplifies to: 715 716 *y = slope \* x + noise* 717 718 .. versionadded:: 3.10 719 720 .. versionchanged:: 3.11 721 Added support for *proportional*. 722 723Exceptions 724---------- 725 726A single exception is defined: 727 728.. exception:: StatisticsError 729 730 Subclass of :exc:`ValueError` for statistics-related exceptions. 731 732 733:class:`NormalDist` objects 734--------------------------- 735 736:class:`NormalDist` is a tool for creating and manipulating normal 737distributions of a `random variable 738<http://www.stat.yale.edu/Courses/1997-98/101/ranvar.htm>`_. It is a 739class that treats the mean and standard deviation of data 740measurements as a single entity. 741 742Normal distributions arise from the `Central Limit Theorem 743<https://en.wikipedia.org/wiki/Central_limit_theorem>`_ and have a wide range 744of applications in statistics. 745 746.. class:: NormalDist(mu=0.0, sigma=1.0) 747 748 Returns a new *NormalDist* object where *mu* represents the `arithmetic 749 mean <https://en.wikipedia.org/wiki/Arithmetic_mean>`_ and *sigma* 750 represents the `standard deviation 751 <https://en.wikipedia.org/wiki/Standard_deviation>`_. 752 753 If *sigma* is negative, raises :exc:`StatisticsError`. 754 755 .. attribute:: mean 756 757 A read-only property for the `arithmetic mean 758 <https://en.wikipedia.org/wiki/Arithmetic_mean>`_ of a normal 759 distribution. 760 761 .. attribute:: median 762 763 A read-only property for the `median 764 <https://en.wikipedia.org/wiki/Median>`_ of a normal 765 distribution. 766 767 .. attribute:: mode 768 769 A read-only property for the `mode 770 <https://en.wikipedia.org/wiki/Mode_(statistics)>`_ of a normal 771 distribution. 772 773 .. attribute:: stdev 774 775 A read-only property for the `standard deviation 776 <https://en.wikipedia.org/wiki/Standard_deviation>`_ of a normal 777 distribution. 778 779 .. attribute:: variance 780 781 A read-only property for the `variance 782 <https://en.wikipedia.org/wiki/Variance>`_ of a normal 783 distribution. Equal to the square of the standard deviation. 784 785 .. classmethod:: NormalDist.from_samples(data) 786 787 Makes a normal distribution instance with *mu* and *sigma* parameters 788 estimated from the *data* using :func:`fmean` and :func:`stdev`. 789 790 The *data* can be any :term:`iterable` and should consist of values 791 that can be converted to type :class:`float`. If *data* does not 792 contain at least two elements, raises :exc:`StatisticsError` because it 793 takes at least one point to estimate a central value and at least two 794 points to estimate dispersion. 795 796 .. method:: NormalDist.samples(n, *, seed=None) 797 798 Generates *n* random samples for a given mean and standard deviation. 799 Returns a :class:`list` of :class:`float` values. 800 801 If *seed* is given, creates a new instance of the underlying random 802 number generator. This is useful for creating reproducible results, 803 even in a multi-threading context. 804 805 .. method:: NormalDist.pdf(x) 806 807 Using a `probability density function (pdf) 808 <https://en.wikipedia.org/wiki/Probability_density_function>`_, compute 809 the relative likelihood that a random variable *X* will be near the 810 given value *x*. Mathematically, it is the limit of the ratio ``P(x <= 811 X < x+dx) / dx`` as *dx* approaches zero. 812 813 The relative likelihood is computed as the probability of a sample 814 occurring in a narrow range divided by the width of the range (hence 815 the word "density"). Since the likelihood is relative to other points, 816 its value can be greater than ``1.0``. 817 818 .. method:: NormalDist.cdf(x) 819 820 Using a `cumulative distribution function (cdf) 821 <https://en.wikipedia.org/wiki/Cumulative_distribution_function>`_, 822 compute the probability that a random variable *X* will be less than or 823 equal to *x*. Mathematically, it is written ``P(X <= x)``. 824 825 .. method:: NormalDist.inv_cdf(p) 826 827 Compute the inverse cumulative distribution function, also known as the 828 `quantile function <https://en.wikipedia.org/wiki/Quantile_function>`_ 829 or the `percent-point 830 <https://web.archive.org/web/20190203145224/https://www.statisticshowto.datasciencecentral.com/inverse-distribution-function/>`_ 831 function. Mathematically, it is written ``x : P(X <= x) = p``. 832 833 Finds the value *x* of the random variable *X* such that the 834 probability of the variable being less than or equal to that value 835 equals the given probability *p*. 836 837 .. method:: NormalDist.overlap(other) 838 839 Measures the agreement between two normal probability distributions. 840 Returns a value between 0.0 and 1.0 giving `the overlapping area for 841 the two probability density functions 842 <https://www.rasch.org/rmt/rmt101r.htm>`_. 843 844 .. method:: NormalDist.quantiles(n=4) 845 846 Divide the normal distribution into *n* continuous intervals with 847 equal probability. Returns a list of (n - 1) cut points separating 848 the intervals. 849 850 Set *n* to 4 for quartiles (the default). Set *n* to 10 for deciles. 851 Set *n* to 100 for percentiles which gives the 99 cuts points that 852 separate the normal distribution into 100 equal sized groups. 853 854 .. method:: NormalDist.zscore(x) 855 856 Compute the 857 `Standard Score <https://www.statisticshowto.com/probability-and-statistics/z-score/>`_ 858 describing *x* in terms of the number of standard deviations 859 above or below the mean of the normal distribution: 860 ``(x - mean) / stdev``. 861 862 .. versionadded:: 3.9 863 864 Instances of :class:`NormalDist` support addition, subtraction, 865 multiplication and division by a constant. These operations 866 are used for translation and scaling. For example: 867 868 .. doctest:: 869 870 >>> temperature_february = NormalDist(5, 2.5) # Celsius 871 >>> temperature_february * (9/5) + 32 # Fahrenheit 872 NormalDist(mu=41.0, sigma=4.5) 873 874 Dividing a constant by an instance of :class:`NormalDist` is not supported 875 because the result wouldn't be normally distributed. 876 877 Since normal distributions arise from additive effects of independent 878 variables, it is possible to `add and subtract two independent normally 879 distributed random variables 880 <https://en.wikipedia.org/wiki/Sum_of_normally_distributed_random_variables>`_ 881 represented as instances of :class:`NormalDist`. For example: 882 883 .. doctest:: 884 885 >>> birth_weights = NormalDist.from_samples([2.5, 3.1, 2.1, 2.4, 2.7, 3.5]) 886 >>> drug_effects = NormalDist(0.4, 0.15) 887 >>> combined = birth_weights + drug_effects 888 >>> round(combined.mean, 1) 889 3.1 890 >>> round(combined.stdev, 1) 891 0.5 892 893 .. versionadded:: 3.8 894 895 896:class:`NormalDist` Examples and Recipes 897^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 898 899:class:`NormalDist` readily solves classic probability problems. 900 901For example, given `historical data for SAT exams 902<https://nces.ed.gov/programs/digest/d17/tables/dt17_226.40.asp>`_ showing 903that scores are normally distributed with a mean of 1060 and a standard 904deviation of 195, determine the percentage of students with test scores 905between 1100 and 1200, after rounding to the nearest whole number: 906 907.. doctest:: 908 909 >>> sat = NormalDist(1060, 195) 910 >>> fraction = sat.cdf(1200 + 0.5) - sat.cdf(1100 - 0.5) 911 >>> round(fraction * 100.0, 1) 912 18.4 913 914Find the `quartiles <https://en.wikipedia.org/wiki/Quartile>`_ and `deciles 915<https://en.wikipedia.org/wiki/Decile>`_ for the SAT scores: 916 917.. doctest:: 918 919 >>> list(map(round, sat.quantiles())) 920 [928, 1060, 1192] 921 >>> list(map(round, sat.quantiles(n=10))) 922 [810, 896, 958, 1011, 1060, 1109, 1162, 1224, 1310] 923 924To estimate the distribution for a model than isn't easy to solve 925analytically, :class:`NormalDist` can generate input samples for a `Monte 926Carlo simulation <https://en.wikipedia.org/wiki/Monte_Carlo_method>`_: 927 928.. doctest:: 929 930 >>> def model(x, y, z): 931 ... return (3*x + 7*x*y - 5*y) / (11 * z) 932 ... 933 >>> n = 100_000 934 >>> X = NormalDist(10, 2.5).samples(n, seed=3652260728) 935 >>> Y = NormalDist(15, 1.75).samples(n, seed=4582495471) 936 >>> Z = NormalDist(50, 1.25).samples(n, seed=6582483453) 937 >>> quantiles(map(model, X, Y, Z)) # doctest: +SKIP 938 [1.4591308524824727, 1.8035946855390597, 2.175091447274739] 939 940Normal distributions can be used to approximate `Binomial 941distributions <https://mathworld.wolfram.com/BinomialDistribution.html>`_ 942when the sample size is large and when the probability of a successful 943trial is near 50%. 944 945For example, an open source conference has 750 attendees and two rooms with a 946500 person capacity. There is a talk about Python and another about Ruby. 947In previous conferences, 65% of the attendees preferred to listen to Python 948talks. Assuming the population preferences haven't changed, what is the 949probability that the Python room will stay within its capacity limits? 950 951.. doctest:: 952 953 >>> n = 750 # Sample size 954 >>> p = 0.65 # Preference for Python 955 >>> q = 1.0 - p # Preference for Ruby 956 >>> k = 500 # Room capacity 957 958 >>> # Approximation using the cumulative normal distribution 959 >>> from math import sqrt 960 >>> round(NormalDist(mu=n*p, sigma=sqrt(n*p*q)).cdf(k + 0.5), 4) 961 0.8402 962 963 >>> # Solution using the cumulative binomial distribution 964 >>> from math import comb, fsum 965 >>> round(fsum(comb(n, r) * p**r * q**(n-r) for r in range(k+1)), 4) 966 0.8402 967 968 >>> # Approximation using a simulation 969 >>> from random import seed, choices 970 >>> seed(8675309) 971 >>> def trial(): 972 ... return choices(('Python', 'Ruby'), (p, q), k=n).count('Python') 973 >>> mean(trial() <= k for i in range(10_000)) 974 0.8398 975 976Normal distributions commonly arise in machine learning problems. 977 978Wikipedia has a `nice example of a Naive Bayesian Classifier 979<https://en.wikipedia.org/wiki/Naive_Bayes_classifier#Person_classification>`_. 980The challenge is to predict a person's gender from measurements of normally 981distributed features including height, weight, and foot size. 982 983We're given a training dataset with measurements for eight people. The 984measurements are assumed to be normally distributed, so we summarize the data 985with :class:`NormalDist`: 986 987.. doctest:: 988 989 >>> height_male = NormalDist.from_samples([6, 5.92, 5.58, 5.92]) 990 >>> height_female = NormalDist.from_samples([5, 5.5, 5.42, 5.75]) 991 >>> weight_male = NormalDist.from_samples([180, 190, 170, 165]) 992 >>> weight_female = NormalDist.from_samples([100, 150, 130, 150]) 993 >>> foot_size_male = NormalDist.from_samples([12, 11, 12, 10]) 994 >>> foot_size_female = NormalDist.from_samples([6, 8, 7, 9]) 995 996Next, we encounter a new person whose feature measurements are known but whose 997gender is unknown: 998 999.. doctest:: 1000 1001 >>> ht = 6.0 # height 1002 >>> wt = 130 # weight 1003 >>> fs = 8 # foot size 1004 1005Starting with a 50% `prior probability 1006<https://en.wikipedia.org/wiki/Prior_probability>`_ of being male or female, 1007we compute the posterior as the prior times the product of likelihoods for the 1008feature measurements given the gender: 1009 1010.. doctest:: 1011 1012 >>> prior_male = 0.5 1013 >>> prior_female = 0.5 1014 >>> posterior_male = (prior_male * height_male.pdf(ht) * 1015 ... weight_male.pdf(wt) * foot_size_male.pdf(fs)) 1016 1017 >>> posterior_female = (prior_female * height_female.pdf(ht) * 1018 ... weight_female.pdf(wt) * foot_size_female.pdf(fs)) 1019 1020The final prediction goes to the largest posterior. This is known as the 1021`maximum a posteriori 1022<https://en.wikipedia.org/wiki/Maximum_a_posteriori_estimation>`_ or MAP: 1023 1024.. doctest:: 1025 1026 >>> 'male' if posterior_male > posterior_female else 'female' 1027 'female' 1028 1029 1030.. 1031 # This modelines must appear within the last ten lines of the file. 1032 kate: indent-width 3; remove-trailing-space on; replace-tabs on; encoding utf-8; 1033