29. Data science (with Pandas module)¶
In progress.
In this module we explore an important, practical application of programming: data science. Here, we start looking at ways of dealing with real-world data. This includes:
exploring and understanding the data, seeing what kinds of quantities are present (categorical and quantitative)
cleaning the data, because there can be missing values and outliers, each of which affects our ability to represent the actual data as meaningfully as possible
visualizing the data---as they say, a picture is worth a thousand words! ... and it also helps us interpret the data and really understand it more deeply.
We will also make use of a new module, called Pandas, which stands for "Python for data analysis". This contains a powerful container for loading/reading in and organizing our data, called a DataFrame. While previously have have looked at storing data values in arrays and lists, those have some limitations for dealing with datasets collected in the real world.
For example, we would likely have both quantitative information (e.g., to be stored in ints or floats), as well as descriptive/categorical information (to be stored as strings or lists, like country name, date, lists of crops, etc.), within the same dataset. That dtype-based diversity is impossible for the arrays. And we will also be adding, deleting and rearranging the data---so while column and row indices are useful, it will quickly become hard to keep track of what numbers represent what. Having column headers and functionality to select based on labels for each column will be veeery useful. For these reasons and more that we will see, the Pandas DataFrame will be a useful container for our data, helping to explore, clean and visualize it all, while also helping us keep our sanity. cleaning. It supports common data formats. (why using pandas instead of arrays -introduction )
To start, we import the Pandas module, as well as a couple old favorites, as:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
This section has been contributed by an AIMS-Senegal alum, Nafissatou Pouye, who works in the field---jerejef!
29.1. Reading in data to a dataframe¶
In this section we will investigate the following dataset:
dataframe.csv
. It is a
comma-separated variable (CSV) file, which can be visualized using a
text editor or a spreadsheet program (like Excel or LibreOffice Calc).
This is a widely used data storage filetype, and the Pandas library
integrates well with this and other common data formats.
We can use the Pandas function read_csv()
to read in the file,
specifying the delimiter that separates the data entries. (As the
name implies, the delimiter in a CSV file is typically a comma
','
, but it can also be other characters like a semi-colon
';'
.) The function returns a DataFrame object, which we assign:
df = pd.read_csv('dataframe.csv', delimiter = ',')
We can display its contents:
df
... which we can see is a mix of numbers and text files (some abbreviated), organized as columns of data and rows of subjects or entries:
1 Region RegionName Admin0 Indicator IndicatorName Unit Date Value
20 5 Burkina Faso Burkina Faso 11820916 Principal repay... US$ 2014 81720000
31 32 Mauritania Mauritania 11820916 Principal repay... US$ 2014 205969000
42 5 Burkina Faso Burkina Faso 11881516 Proportion of s... NaN 2014 18,9
53 37 Niger Niger 11850216 Ability to trac... NaN 2014 2,360,207
64 31 Mali Mali 11760916 Population in t... NaN 2014 386,594,945
75 37 Niger Niger 11850516 Competence and ... NaN 2014 2,276,874
86 31 Mali Mali 11881516 Proportion of s... NaN 2014 9,5
97 37 Niger Niger 11850716 Quality of trad... NaN 2014 2,079,882
The leftmost column is not really a column of data in the file, but it just shows the row number of the data. Note that the enumeration starts after the very first row, which is a set of labels for each column. Unlike when using arrays, we will often not use row (or column) indices to pick out data of interest---instead we will use the more descriptive labels and/or data values. This is discussed further below.
The output of type(df)
is pandas.core.frame.DataFrame
. Just
briefly, the top of the type/class description from
help(pd.core.frame.DataFrame)
(or, more simply after defining the
particular DataFrame above, help(df)
) is:
1class DataFrame(pandas.core.generic.NDFrame)
2 | DataFrame(data=None, index: Union[Collection, NoneType] = None,
3 columns: Union[Collection, NoneType] = None, dtype: Union[str,
4 numpy.dtype, ForwardRef('ExtensionDtype'), NoneType] = None,
5 copy: bool = False)
6 |
7 | Two-dimensional, size-mutable, potentially heterogeneous tabular data.
8 |
9 | Data structure also contains labeled axes (rows and columns).
10 | Arithmetic operations align on both row and column labels. Can be
11 | thought of as a dict-like container for Series objects. The primary
12 | pandas data structure.
13 |
14 ...
What is a pandas dataframe? A dataframe is a two-dimensional (tabular) data structure, with labeled axes. The principal components of a dataFrame are:
the data, called values or records,
the rows, called observations,
and the columns, also called variables.
In general, a pandas dataframe will be created by loading the datasets from existing storage such as a CSV or excel file or a SQL database. The dataframe used in this section is an extract of the (publicly available) Socio-Economic Database compiled by the African Development Bank Statistics Department. It represents different socio-economic indicators for 5 African countries in 2014.
29.2. Exploring the data¶
There are many methods associated with DataFrames, particularly for selecting the data in various ways.
Check the dataframe shape:
print(df.shape)
... that is, the number of observations (rows) and variables
(columns); in the present case, it is (44, 8)
, to start with.
To have an idea of the data, we can look at some parts of the data instead of going through the entire data frame. For example:
We can look at the first few rows of the dataframe:
print(df.head())
The default number to display is 5 (check the help in
df.head?
), or we could specify any particular number as an argument to the method, such as:print(df.head(7))
We can also take a look at the 5 last rows:
df.tail()
Or we can even look at 5 random rows:
df.sample(5)
We can display the dataframes columns as a pandas index object
(via df.columns
), or slightly more cleanly as a list:
df.columns.values.tolist()
... which outputs:
['Region',
'RegionName',
'Admin0',
'Indicator',
'IndicatorName',
'Unit',
'Date',
'Value']
We can use the column names to view subsets of the data frame, selecting by column (called filtering by column):
df[['RegionName']]
df[['RegionName','IndicatorName'']]
... which can also be combined row selection, such as showing the first 5 rows of filtered columns:
df.head()[['RegionName','IndicatorName'']]
We can search for (filter) specific values. For example, we may want
to look for a specific country within the 'Indicator'
column, or
one particular value within the 'Unit'
column:
df[df.Indicator == 'Niger']
df[df.Unit == 'US$ million']
29.3. Cleaning the data¶
After exploring the data, we can start working on the data cleaning process. Data cleaning can be defined as the process of treating data in order to make it machine-readable and ready for analysis. (This also involves a bit more data exploration, but with a bit more critical eye on the details!)
29.3.1. Check+fix variable types¶
First, let us take a look at the columns (or variable types). A variable can be categorical, meaning it takes values over a finite set. For example, in the present data 'RegionName' is a categorical variable. A variable can also be continuous, meaning numerical and arithmetic operations can be performed---what variable(s) look continuous, as displayed during the data exploration?
To view the variable types, we can use the following method:
print(df.dtypes)
... where the categorical variables are listed as object
and
the continuous ones should be listed as some kind of numerical type
(int, float, etc.). In the present data, the above command outputs:
Region int64
RegionName object
Admin0 object
Indicator int64
IndicatorName object
Unit object
Date int64
Value object
dtype: object
If we look closely, we can see some issues with certain variables;
let's look at a few observations with df.sample(3)
, just to remind
ourselves of what the data looks like again:
Region RegionName Admin0 Indicator IndicatorName Unit Date Value
4 31 Mali Mali 11760916 Population in ... NaN 2014 386,594,945
28 37 Niger Niger 11850816 Efficiency of ... NaN 2014 2,485,543
42 5 Burkina Faso Burkina Faso 11850716 Quality of tra... NaN 2014 2,348,214
We can detect some issues with the types of certain variables:
Issue #1: The type of the variable
'Value'
should be continuous, since it is a monetary value; this occurs because the commas are not recognized as being part of a number. It is a bit similar to trying to perform type conversion in the following way:float('2,360,207')
, which produces an error.Issue #2: The
'Date'
datatype shouldn't be integer, though we can see why that happens: the single year values get interpreted as a continuous variable.
We can fix those issues by converting the data types of these variables as follows:
Fix #1: We can include another kwarg when loading the CSV, so that commas in numbers are recognized as being spacers of digits:
df = pd.read_csv('dataframe.csv', delimiter = ',', thousands=',')
Note: we have re-imported the data here, so any pre-existing changes to the dataframe would be lost.
Fix #2: We can explicitly convert the date to a specialized date/time format:
df['Date'] = pd.to_datetime(df['Date'], format='%Y')
We can check our work by seeing which columns contain continuous and
categorical variables in the current dataframe. We could use the
above df.dtypes
to see all columns once, or see each as a separate
list, using the pandas method select_dtypes()
in the following
ways.
Select the continuous (i.e. numerical) variables:
df_num = df.select_dtypes(include=[np.number]) num_cols = df_num.columns.values.tolist() print(num_cols)
... which produces:
['Region', 'Indicator', 'Value']
Select categorical (i.e., non-numerical) variables:
df_cat = df.select_dtypes(exclude=[np.number]) cat_cols = df_cat.columns.values.tolist() print(cat_cols)
... which produces:
['RegionName', 'Admin0', 'IndicatorName', 'Unit', 'Date']
29.3.2. Deal with missing values¶
When some records are missing in a dataframe they are called missing
values. These are typically coded as "NA" within a CSV, and
displayed as NaN
in the displayed dataframes. It is important to
detect those missing values and to deal with them accordingly, for
which there are a few options.
First, we must find the missing values. Here, we loop over each
column, and use the isnull()
method in Pandas to calculate the
fraction of missing values:
for col in df.columns:
pct_missing = np.mean(df[col].isnull())
print('{:15s} - {:6.1%}'.format(col, pct_missing))
... which outputs the following table showing the percentage of missing values for each variable:
Region - 0.0%
RegionName - 0.0%
Admin0 - 0.0%
Indicator - 0.0%
IndicatorName - 0.0%
Unit - 79.5%
Date - 0.0%
Value - 0.0%
We see that there is, indeed, one variable that has a lot (or here,
any) missing values: 'Unit'
.
Now, we have to decide what to do about the missing values. In this case, since the percentage of missing values is , we may want to drop the variable:
df.drop('Unit', axis=1)
29.3.3. Detect outliers¶
Continuous variables
To get an overall sense of the values stored in a continuous variable column, we can take a look at its descriptive statistics. These summary statistics gives a quick and simple description of the data.
For example, from our list of num_cols
, we can check out the
descriptive statistics of the 'Indicator'
column with:
df['Indicator'].describe()
... which outputs:
count 4.400000e+01
mean 1.182708e+07
std 3.961010e+04
min 1.176082e+07
25% 1.182092e+07
50% 1.185032e+07
75% 1.185072e+07
max 1.188152e+07
Name: Indicator, dtype: float64
The following table describes each of the returned fields, which basically describe the average values of the variable (mean and median), as well as their spread (std, min/max and interquartile ranges).
Item |
Information |
---|---|
count |
number of non-missing values |
mean |
average of the data values |
std |
sample standard deviation |
min |
smallest data value |
25% |
quartile: 25% of data <= this value |
50% |
quartile: 50% of data <= this value (= median) |
75% |
quartile: 75% of data <= this value |
max |
largest value in data |
Name |
variable (column) name |
dtype |
datatype of column |
Categorical variables
There is also summary information available for categorical variables.
For example, from our list of cat_cols
, we can check out the
descriptive statistics of the 'RegionName'
column with:
df['RegionName'].describe()
... which outputs:
count 44
unique 6
top Niger
freq 10
Name: RegionName, dtype: object
Item |
Information |
---|---|
count |
number of non-missing values |
unique |
number of separate categories |
top |
category which appears most |
freq |
the "top" category's number of appearances |
Name |
variable (column) name |
dtype |
datatype of column |
Make a histogram for categorical variables:
# make a histogram
plt.figure('hist')
df['RegionName'].hist(bins=200)
plt.show()
# make a bar plot
plt.figure('bars')
df['RegionName'].value_counts().plot.bar()
plt.show()