Descriptive Statistics in Python

To perform descriptive statistics in Python, one may need to import different Python libraries and modules. Python has an abundance of additional modules or libraries that augment the base framework and functionality of the language.

Python Libraries

A library is a collection of functions that are used to complete certain programming tasks without having to write your algorithm. The following are some important libraries that are used to perform different descriptive (both numerical and graphical) and inferential (comparative and relation) analyses.

  • Numpy is a library for working with multi-dimensional arrays and matrices.
  • Pandas provides high-performance, easy-to-use data structures and data analysis tools.
  • Scipy is a library of techniques for numerical and scientific computing.
  • Matplotlib is a library for making graphs.
  • Seaborn is a higher-level interface to Matplotlib that can be used to simplify many graphing tasks.
  • Statsmodels is a library that implements many statistical techniques.

Import Libraries

To perform different statistics, one needs to import the libraries first.

# pip install pandas
# pip install --upgrade pandas

import pandas as pd
import numpy as np
import statistics as stat

Note that if a library is not installed on a system, it needs to be installed first. The commented lines can be used to install and upgrade the pandas’ library.

After importing the required libraries, one needs data to perform some statistical analysis on it. For this Descriptive Statistics Tutorial in Python, we will use the mpg.csv data file.

It is better to copy the mpg.csv file to the working directory. In this case data file can be easily imported to Python workspace. Write the code below to import the data file in a variable.

data = pd.read_csv("mpg.csv", index_col=0)

It is also best to get some insights about imported data.

data.head(10)
data.shape
data.info()
Data Frame in Python

Descriptive Statistics in Python

One can compute the descriptive statistics for complete data or any of the variables in the data set. Let us save the hwy variable from the data frame and compute some basic descriptive statistics such as mean, minimum value, maximum value, and median of the variable.

hwy = data['hwy']

# mean of a variable
print("mean=", np.mean(data["hwy"]))

# Median of a variable
print("median=", np.median(data["hwy"]))

# Minimum and Maximum value
print("minimum=", np.min(data["hwy"]))
print("maximum=", np.max(data["hwy"]))

# output
mean= 23.44017094017094
median= 24.0
minimum= 12
maximum= 44

One can also compute and print (display on screen) by typing the following line of codes

mean = np.mean(data["hwy"])

print("The mean value of hwy variable is {}".format(round(mean, 5) )  )

print("The mean value of hwy variable is", round(mean, 5) ) 

# output
The mean value of hwy variable is 23.44017
The mean value of hwy variable is 23.44017

Instead of computing different descriptive statistics separately, one can use the describe() method to compute descriptive statistics such as mean, median, standard deviation, and quartiles, etc. The measures of central tendency and measures of dispersions from describe() function are:

# descriptive statistics for displ variable
data["displ"].describe()

#descriptive statistics for displ, cyl, and hwy variables
data[['displ', 'cyl', 'hwy']].describe().

# Transpose the output
data[['displ', 'cyl', 'hwy']].describe().T
Descriptive Statistics Python Describe

The average value for more than one variable in the data frame can be computed as given below

means = data[["displ", 'cyl']].mean()

print(means)

The measures of dispersions can be computed from numpy and pandas libraries.

# Using Numpy
print('---Results using Numpy--')
sd = np.std(data["displ"], ddof = 1)
print(round(sd, 3))

sd =  np.std(data["displ"], ddof = 0)
print(sd)

print('\n--results using Pandas--')
print(data['displ'].std())
print(data['displ'].std(ddof = 0))

# numpy uses 0 as the default degree of freedom
# Pandas uses 1 degree of freedom as default
# ddof (delta degrees of freedom) is used to specify the degrees of freedom for sample or population SD


### Output
---Results using Numpy--
1.292
1.2891954791898101

--results using Pandas--
1.2919590310839344
1.2891954791898101
np.var(data["displ"])

## Output
1.6620249835634442

One can compute the range (a measure of dispersion) by computing the maximum and minimum values of the variable.

# Range 
range = np.max(data["displ"]) - np.min(data["displ"])

#print(range)

# or

data["displ"].max() - data["displ"].min()

# Output
5.4

The standard error of the mean ($SE=\frac{\sigma}{\sqrt{n}}$) can be computed from Scipy library or by writing your own Python code.

# Standard Error of MEAN 
from scipy import stats

se = stats.sem(data['displ'])
print(se)

# or
print(np.std(data['displ'], ddof=1)/np.sqrt(len(data['displ'])))

# function form
def se(sigma, n):
    return sigma/np.sqrt(n)

sigma = np.std(data['displ'], ddof = 1)
n = len(data['displ'])

se(sigma, n)

# output
0.08445800397768476
0.08445800397768473
0.08445800397768473

Empirical Rule

In statistics, the 68–95–99.7 rule, also known as the empirical rule, the percentage of values that lie within an interval estimate in a normal distribution: 68%, 95%, and 99.7% of the values lie within one, two, and three standard deviations of the mean,

\begin{align*}
\overline{X} & \pm 1 SD \quad \quad 68\%\\
\overline{X} & \pm 2 SD \quad \quad 95\%\\
\overline{X} & \pm 3 SD \quad \quad 99.7\%
\end{align*}

# Mean +/- 1 SD
Mean = data["displ"].mean()
SD = data["displ"].std()

LL = Mean - 1 * SD
UL = Mean + 1 * SD

print(LL, UL)

# check either values are within interval or not
res = (data["displ"] >= LL) & (data["displ"] <= UL)
print(res)

# confirm percentage
print(res.sum()/res.count()*100)

Computing Z Score of a Variable

The Z-score is a measure of how many standard deviations below or above the population mean a data point is. It can be used to detect outliers. Z-scores can be computed using scipy.stats.zscore().

# Load/ import the required data
from vega_datasets import data
iris = data.iris()
iris.head()
Iris Data Descriptive Statistics in Python
from scipy import stats
iris['Zscore'] = stats.zscore(iris['sepalLength'])
print(iris['Zscore'])

The Z-scores for sepalLength variable will be computed and stored in Zscore variable.

# Display all varaibles values whos zscore is greater than 2 and less than -2
iris[ (iris['Zscore'] > 2) | (iris['Zscore'] < -2) ]

# for any required variable
iris['sepalLength'][ (iris['Zscore'] > 2) | (iris['Zscore'] < -2) ]

Descriptive Statistics for Categorical Variable

The descriptive statistics for categorical variable can be computed, for example

iris['species'].describe()
iris['species'].value_counts()

The groupby() command can be used to compute different statsitics for each group.

iris.groupby(iris['species']).describe()
iris.groupby(iris['species']).describe().T  # Transposed output

The groupby() command can be used to compute different aggregate functions for a quantitative varaible with respect to each group.

iris.groupby('species').agg( {'sepalLength':['mean', 'std', 'var', 'median']} )
Descriptive Statistics in Python using groupby and aggregate functions
iris[iris.species=='setosa'].agg({'sepalLength':['mean', 'std', 'var']})

iris.groupby(iris.species).quantile(.25)

iris['sepalLength'].groupby(iris.species).quantile(.25)

https://itfeature.com