Data Sets in Python - R Programming FAQs

There are various formats for a dataset, .csv, .json, .xlsx, etc. The dataset can be stored in different places, on your local machine, or sometimes online. Many of the Python libraries are bundled with data set(s). One can install and load/import data sets in Python. For example, the vega_datasets is a Python library that contains some popular data sets.

# install vega datasets
pip install vega_datasets

After successful installation of the vega_datasets library, one needs to import and load the data.

from vega_datasets import data
iris = data.iris()

Now the datasets bundled in vega_datasets are locally available and one can readily access them without an internet connection. One can see the list of available data sets in the vega_datasets library.

Getting Data Set Information in Python

Since the iris data set is assigned to the variable iris. Now one can perform different computations on this variable.

# unique observations in iris
iris.species.unique()

# identifying missing observations in the data set
iris.isna().any()

# checking the data types of variables in the data set
iris.dtypes

Computing Statistics in Python

After checking some information and cleaning the data, one can perform different statistics on the data. For example,

# Descriptive Statistics
iris.describe()

If a variable is not required for further computations, one can drop the variable and may create a new data set. For example, from the iris data set, if a “species” variable is not required, one can omit the variable and may create a new data set.

# data = iris.drop(['species'], axis = 1)

The argument axis=1 refers to the columns of a data frame or a series and axis=0 refers to the rows of a data frame or a series.

Reading Online Data Sets in Python

Note that one can also read the data from a URL. For example,

URL = "https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DA0101EN/auto.csv"
df = pd.read_csv(URL, header = None)

For any data sets, one may update the column (variable) names. For the data stored in the URL, the column names are 0, 1, 2, …, 25. Therefore, one can name the columns with some strings, like

headers = ["symboling", "normalized-losses", "make", "fuel-type", "aspiration", "num-of-doors", "body-style", "drive-wheels", "engine-location", "wheel-base", "length", "width", "height", "curb-weight", "engine-type", "num-of-cylinders", "engine-size", "fuel-system", "bore", "stroke", "compression-ratio", "horsepower", "peak-rpm", "city-mpg", "highway-mpg", "price"]
df.columns = headers
df

Exporting Data Sets in Python

The data sets in Python can also be imported to diffeent file format. For example

# export as csf file
df.to_csv("data.csv", index = False)

# export as excel file
df.to_excel("data.xlsx", index = False)

Instead of importing data set, one can also generate his/her own data set.

before = stats.norm.rvs(scale = 30, loc = 250, size = 100)

after  = before + stats.norm.rvs(scale = 5, loc = -1.25, size = 100)

df = pd.DataFrame(
                {"Weight_before": before,
                   "Weight_after": after,
                   "diff": after-before}
                 )
df.describe()

https://itfeature.com

https://gmstat.com