Descriptive Statistics in Python

To perform descriptive statistics in Python, one may need to import different Python libraries and modules. Python has an abundance of additional modules or libraries that augment the base framework and functionality of the language.

Libraries for Descriptive Statistics in Python

A library is a collection of functions that are used to complete certain programming tasks without having to write your algorithm. The following are some important libraries that are used to perform different descriptive (both numerical and graphical) and inferential (comparative and relational) analyses.

Numpy is a library for working with multi-dimensional arrays and matrices.
Pandas provides high-performance, easy-to-use data structures and data analysis tools.
Scipy is a library of techniques for numerical and scientific computing.
Matplotlib is a library for making graphs.
Seaborn is a higher-level interface to Matplotlib that can be used to simplify many graphing tasks.
Statsmodels is a library that implements many statistical techniques.

Import Python Libraries

To perform different statistics, one needs to import the libraries first.

# pip install pandas
# pip install --upgrade pandas

import pandas as pd
import numpy as np
import statistics as stat

Note that if a library is not installed on a system, it needs to be installed first. The commented lines can be used to install and upgrade the Pandas library.

After importing the required libraries, one needs data to perform some statistical analysis on it. For this Descriptive Statistics Tutorial in Python, we will use the mpg.csv data file.

mpg Download

It is better to copy the mpg.csv file to the working directory. In this case data file can be easily imported to the Python workspace. Write the code below to import the data file in a variable.

data = pd.read_csv("mpg.csv", index_col=0)

It is also best to get some insights about imported data.

data.head(10)
data.shape
data.info()

One can compute the descriptive statistics in Python for complete data or any of the variables in the data set. Let us save the hwy variable from the data frame and compute some basic descriptive statistics such as the mean, minimum value, maximum value, and median of the variable.

hwy = data['hwy']

# mean of a variable
print("mean=", np.mean(data["hwy"]))

# Median of a variable
print("median=", np.median(data["hwy"]))

# Minimum and Maximum value
print("minimum=", np.min(data["hwy"]))
print("maximum=", np.max(data["hwy"]))

# output
mean= 23.44017094017094
median= 24.0
minimum= 12
maximum= 44

One can also compute and print (display on screen) by typing the following line of codes

mean = np.mean(data["hwy"])

print("The mean value of hwy variable is {}".format(round(mean, 5) )  )

print("The mean value of hwy variable is", round(mean, 5) ) 

# output
The mean value of hwy variable is 23.44017
The mean value of hwy variable is 23.44017

Instead of computing different descriptive statistics in Python separately, one can use the describe() method to compute descriptive statistics such as mean, median, standard deviation, quartiles, etc. The measures of central tendency and measures of dispersions from describe function (describe( )) in Python is:

# descriptive statistics for displ variable
data["displ"].describe()

#descriptive statistics for displ, cyl, and hwy variables
data[['displ', 'cyl', 'hwy']].describe().

# Transpose the output
data[['displ', 'cyl', 'hwy']].describe().T

The average value for more than one variable in the data frame can be computed as given below

means = data[["displ", 'cyl']].mean()

print(means)

The measures of dispersions can be computed from numpy and pandas libraries.

# Using Numpy
print('---Results using Numpy--')
sd = np.std(data["displ"], ddof = 1)
print(round(sd, 3))

sd =  np.std(data["displ"], ddof = 0)
print(sd)

print('\n--results using Pandas--')
print(data['displ'].std())
print(data['displ'].std(ddof = 0))

# numpy uses 0 as the default degree of freedom
# Pandas uses 1 degree of freedom as default
# ddof (delta degrees of freedom) is used to specify the degrees of freedom for sample or population SD


### Output
---Results using Numpy--
1.292
1.2891954791898101

--results using Pandas--
1.2919590310839344
1.2891954791898101

np.var(data["displ"])

## Output
1.6620249835634442

One can compute the range (a measure of dispersion) by computing the maximum and minimum values of the variable.

# Range 
range = np.max(data["displ"]) - np.min(data["displ"])

#print(range)

# or

data["displ"].max() - data["displ"].min()

# Output
5.4

The standard error of the mean ($SE=\frac{\sigma}{\sqrt{n}}$) can be computed from the Scipy library or by writing your own Python code.

# Standard Error of MEAN 
from scipy import stats

se = stats.sem(data['displ'])
print(se)

# or
print(np.std(data['displ'], ddof=1)/np.sqrt(len(data['displ'])))

# function form
def se(sigma, n):
    return sigma/np.sqrt(n)

sigma = np.std(data['displ'], ddof = 1)
n = len(data['displ'])

se(sigma, n)

# output
0.08445800397768476
0.08445800397768473
0.08445800397768473

Empirical Rule

In statistics, the 68–95–99.7 rule, also known as the empirical rule, is the percentage of values that lie within an interval estimate in a normal distribution: 68%, 95%, and 99.7% of the values lie within one, two, and three standard deviations of the mean,

\begin{align*}
\overline{X} & \pm 1 SD \quad \quad 68\%\\
\overline{X} & \pm 2 SD \quad \quad 95\%\\
\overline{X} & \pm 3 SD \quad \quad 99.7\%
\end{align*}

# Mean +/- 1 SD
Mean = data["displ"].mean()
SD = data["displ"].std()

LL = Mean - 1 * SD
UL = Mean + 1 * SD

print(LL, UL)

# check either values are within interval or not
res = (data["displ"] >= LL) & (data["displ"] <= UL)
print(res)

# confirm percentage
print(res.sum()/res.count()*100)

Computing the Z Score of a Variable

The Z-score is a measure of how many standard deviations below or above the population mean a data point is. It can be used to detect outliers. Z-scores can be computed using scipy.stats.zscore().

# Load/ import the required data
from vega_datasets import data
iris = data.iris()
iris.head()

Iris Data Descriptive Statistics in Python

from scipy import stats
iris['Zscore'] = stats.zscore(iris['sepalLength'])
print(iris['Zscore'])

The Z-scores for sepalLength variable will be computed and stored in Zscore variable.

# Display all varaibles values whos zscore is greater than 2 and less than -2
iris[ (iris['Zscore'] > 2) | (iris['Zscore'] < -2) ]

# for any required variable
iris['sepalLength'][ (iris['Zscore'] > 2) | (iris['Zscore'] < -2) ]