To perform descriptive statistics in Python, one may need to import different Python libraries and modules. Python has an abundance of additional modules or libraries that augment the base framework and functionality of the language.
Libraries for Descriptive Statistics in Python
A library is a collection of functions that are used to complete certain programming tasks without having to write your algorithm. The following are some important libraries that are used to perform different descriptive (both numerical and graphical) and inferential (comparative and relation) analyses.
- Numpy is a library for working with multi-dimensional arrays and matrices.
- Pandas provides high-performance, easy-to-use data structures and data analysis tools.
- Scipy is a library of techniques for numerical and scientific computing.
- Matplotlib is a library for making graphs.
- Seaborn is a higher-level interface to Matplotlib that can be used to simplify many graphing tasks.
- Statsmodels is a library that implements many statistical techniques.
Import Libraries
To perform different statistics, one needs to import the libraries first.
# pip install pandas # pip install --upgrade pandas import pandas as pd import numpy as np import statistics as stat
Note that if a library is not installed on a system, it needs to be installed first. The commented lines can be used to install and upgrade the pandas’ library.
After importing the required libraries, one needs data to perform some statistical analysis on it. For this Descriptive Statistics Tutorial in Python, we will use the mpg.csv data file.
It is better to copy the mpg.csv file to the working directory. In this case data file can be easily imported to Python workspace. Write the code below to import the data file in a variable.
data = pd.read_csv("mpg.csv", index_col=0)
It is also best to get some insights about imported data.
data.head(10) data.shape data.info()
Descriptive Statistics in Python
One can compute the descriptive statistics in Python for complete data or any of the variables in the data set. Let us save the hwy
variable from the data frame and compute some basic descriptive statistics such as mean, minimum value, maximum value, and median of the variable.
hwy = data['hwy'] # mean of a variable print("mean=", np.mean(data["hwy"])) # Median of a variable print("median=", np.median(data["hwy"])) # Minimum and Maximum value print("minimum=", np.min(data["hwy"])) print("maximum=", np.max(data["hwy"])) # output mean= 23.44017094017094 median= 24.0 minimum= 12 maximum= 44
One can also compute and print (display on screen) by typing the following line of codes
mean = np.mean(data["hwy"]) print("The mean value of hwy variable is {}".format(round(mean, 5) ) ) print("The mean value of hwy variable is", round(mean, 5) ) # output The mean value of hwy variable is 23.44017 The mean value of hwy variable is 23.44017
Instead of computing different descriptive statistics in Python separately, one can use the describe() method to compute descriptive statistics such as mean, median, standard deviation, quartiles, etc. The measures of central tendency and measures of dispersions from describe
function (describe( )) in Python is:
# descriptive statistics for displ variable data["displ"].describe() #descriptive statistics for displ, cyl, and hwy variables data[['displ', 'cyl', 'hwy']].describe(). # Transpose the output data[['displ', 'cyl', 'hwy']].describe().T
The average value for more than one variable in the data frame can be computed as given below
means = data[["displ", 'cyl']].mean() print(means)
The measures of dispersions can be computed from numpy
and pandas
libraries.
# Using Numpy print('---Results using Numpy--') sd = np.std(data["displ"], ddof = 1) print(round(sd, 3)) sd = np.std(data["displ"], ddof = 0) print(sd) print('\n--results using Pandas--') print(data['displ'].std()) print(data['displ'].std(ddof = 0)) # numpy uses 0 as the default degree of freedom # Pandas uses 1 degree of freedom as default # ddof (delta degrees of freedom) is used to specify the degrees of freedom for sample or population SD ### Output ---Results using Numpy-- 1.292 1.2891954791898101 --results using Pandas-- 1.2919590310839344 1.2891954791898101
np.var(data["displ"]) ## Output 1.6620249835634442
One can compute the range (a measure of dispersion) by computing the maximum and minimum values of the variable.
# Range range = np.max(data["displ"]) - np.min(data["displ"]) #print(range) # or data["displ"].max() - data["displ"].min() # Output 5.4
The standard error of the mean ($SE=\frac{\sigma}{\sqrt{n}}$) can be computed from the Scipy library or by writing your own Python code.
# Standard Error of MEAN from scipy import stats se = stats.sem(data['displ']) print(se) # or print(np.std(data['displ'], ddof=1)/np.sqrt(len(data['displ']))) # function form def se(sigma, n): return sigma/np.sqrt(n) sigma = np.std(data['displ'], ddof = 1) n = len(data['displ']) se(sigma, n) # output 0.08445800397768476 0.08445800397768473 0.08445800397768473
Empirical Rule
In statistics, the 68–95–99.7 rule, also known as the empirical rule, is the percentage of values that lie within an interval estimate in a normal distribution: 68%, 95%, and 99.7% of the values lie within one, two, and three standard deviations of the mean,
\begin{align*}
\overline{X} & \pm 1 SD \quad \quad 68\%\\
\overline{X} & \pm 2 SD \quad \quad 95\%\\
\overline{X} & \pm 3 SD \quad \quad 99.7\%
\end{align*}
# Mean +/- 1 SD Mean = data["displ"].mean() SD = data["displ"].std() LL = Mean - 1 * SD UL = Mean + 1 * SD print(LL, UL) # check either values are within interval or not res = (data["displ"] >= LL) & (data["displ"] <= UL) print(res) # confirm percentage print(res.sum()/res.count()*100)
Computing Z Score of a Variable
The Z-score is a measure of how many standard deviations below or above the population mean a data point is. It can be used to detect outliers. Z-scores can be computed using scipy.stats.zscore()
.
# Load/ import the required data from vega_datasets import data iris = data.iris() iris.head()
from scipy import stats iris['Zscore'] = stats.zscore(iris['sepalLength']) print(iris['Zscore'])
The Z-scores for sepalLength variable will be computed and stored in Zscore
variable.
# Display all varaibles values whos zscore is greater than 2 and less than -2 iris[ (iris['Zscore'] > 2) | (iris['Zscore'] < -2) ] # for any required variable iris['sepalLength'][ (iris['Zscore'] > 2) | (iris['Zscore'] < -2) ]
Descriptive Statistics in Python for Categorical Variable
The descriptive statistics in Python for categorical variables can be computed, for example
iris['species'].describe() iris['species'].value_counts()
The groupby()
command can be used to compute different descriptive statistics in Python for each group.
iris.groupby(iris['species']).describe() iris.groupby(iris['species']).describe().T # Transposed output
The groupby() command can be used to compute different aggregate functions for a quantitative variable for each group.
iris.groupby('species').agg( {'sepalLength':['mean', 'std', 'var', 'median']} )
iris[iris.species=='setosa'].agg({'sepalLength':['mean', 'std', 'var']}) iris.groupby(iris.species).quantile(.25) iris['sepalLength'].groupby(iris.species).quantile(.25)