29. Data science (with Pandas module)

In progress.

In this module we explore an important, practical application of programming: data science. Here, we start looking at ways of dealing with real-world data. This includes:

  • exploring and understanding the data, seeing what kinds of quantities are present (categorical and quantitative)

  • cleaning the data, because there can be missing values and outliers, each of which affects our ability to represent the actual data as meaningfully as possible

  • visualizing the data---as they say, a picture is worth a thousand words! ... and it also helps us interpret the data and really understand it more deeply.

We will also make use of a new module, called Pandas, which stands for "Python for data analysis". This contains a powerful container for loading/reading in and organizing our data, called a DataFrame. While previously have have looked at storing data values in arrays and lists, those have some limitations for dealing with datasets collected in the real world.

For example, we would likely have both quantitative information (e.g., to be stored in ints or floats), as well as descriptive/categorical information (to be stored as strings or lists, like country name, date, lists of crops, etc.), within the same dataset. That dtype-based diversity is impossible for the arrays. And we will also be adding, deleting and rearranging the data---so while column and row indices are useful, it will quickly become hard to keep track of what numbers represent what. Having column headers and functionality to select based on labels for each column will be veeery useful. For these reasons and more that we will see, the Pandas DataFrame will be a useful container for our data, helping to explore, clean and visualize it all, while also helping us keep our sanity. cleaning. It supports common data formats. (why using pandas instead of arrays -introduction )

To start, we import the Pandas module, as well as a couple old favorites, as:

import pandas as pd

import numpy as np
import matplotlib.pyplot as plt

This section has been contributed by an AIMS-Senegal alum, Nafissatou Pouye, who works in the field---jerejef!

29.1. Reading in data to a dataframe

In this section we will investigate the following dataset: dataframe.csv. It is a comma-separated variable (CSV) file, which can be visualized using a text editor or a spreadsheet program (like Excel or LibreOffice Calc). This is a widely used data storage filetype, and the Pandas library integrates well with this and other common data formats.

We can use the Pandas function read_csv() to read in the file, specifying the delimiter that separates the data entries. (As the name implies, the delimiter in a CSV file is typically a comma ',', but it can also be other characters like a semi-colon ';'.) The function returns a DataFrame object, which we assign:

df = pd.read_csv('dataframe.csv', delimiter = ',')

We can display its contents:

df

... which we can see is a mix of numbers and text files (some abbreviated), organized as columns of data and rows of subjects or entries:

1    Region    RegionName        Admin0  Indicator       IndicatorName Unit  Date        Value
20        5  Burkina Faso  Burkina Faso   11820916  Principal repay...  US$  2014     81720000
31       32    Mauritania    Mauritania   11820916  Principal repay...  US$  2014    205969000
42        5  Burkina Faso  Burkina Faso   11881516  Proportion of s...  NaN  2014         18,9
53       37         Niger         Niger   11850216  Ability to trac...  NaN  2014    2,360,207
64       31          Mali          Mali   11760916  Population in t...  NaN  2014  386,594,945
75       37         Niger         Niger   11850516  Competence and ...  NaN  2014    2,276,874
86       31          Mali          Mali   11881516  Proportion of s...  NaN  2014          9,5
97       37         Niger         Niger   11850716  Quality of trad...  NaN  2014    2,079,882

The leftmost column is not really a column of data in the file, but it just shows the row number of the data. Note that the enumeration starts after the very first row, which is a set of labels for each column. Unlike when using arrays, we will often not use row (or column) indices to pick out data of interest---instead we will use the more descriptive labels and/or data values. This is discussed further below.

The output of type(df) is pandas.core.frame.DataFrame. Just briefly, the top of the type/class description from help(pd.core.frame.DataFrame) (or, more simply after defining the particular DataFrame above, help(df)) is:

 1class DataFrame(pandas.core.generic.NDFrame)
 2 |  DataFrame(data=None, index: Union[Collection, NoneType] = None,
 3    columns: Union[Collection, NoneType] = None, dtype: Union[str,
 4    numpy.dtype, ForwardRef('ExtensionDtype'), NoneType] = None,
 5    copy: bool = False)
 6 |
 7 |  Two-dimensional, size-mutable, potentially heterogeneous tabular data.
 8 |
 9 |  Data structure also contains labeled axes (rows and columns).
10 |  Arithmetic operations align on both row and column labels. Can be
11 |  thought of as a dict-like container for Series objects. The primary
12 |  pandas data structure.
13 |
14 ...

What is a pandas dataframe? A dataframe is a two-dimensional (tabular) data structure, with labeled axes. The principal components of a dataFrame are:

  • the data, called values or records,

  • the rows, called observations,

  • and the columns, also called variables.

In general, a pandas dataframe will be created by loading the datasets from existing storage such as a CSV or excel file or a SQL database. The dataframe used in this section is an extract of the (publicly available) Socio-Economic Database compiled by the African Development Bank Statistics Department. It represents different socio-economic indicators for 5 African countries in 2014.

29.2. Exploring the data

There are many methods associated with DataFrames, particularly for selecting the data in various ways.

Check the dataframe shape:

print(df.shape)

... that is, the number of observations (rows) and variables (columns); in the present case, it is (44, 8), to start with.

To have an idea of the data, we can look at some parts of the data instead of going through the entire data frame. For example:

  • We can look at the first few rows of the dataframe:

    print(df.head())
    

    The default number to display is 5 (check the help in df.head?), or we could specify any particular number as an argument to the method, such as:

    print(df.head(7))
    
  • We can also take a look at the 5 last rows:

    df.tail()
    
  • Or we can even look at 5 random rows:

    df.sample(5)
    

We can display the dataframes columns as a pandas index object (via df.columns), or slightly more cleanly as a list:

df.columns.values.tolist()

... which outputs:

['Region',
 'RegionName',
 'Admin0',
 'Indicator',
 'IndicatorName',
 'Unit',
 'Date',
 'Value']

We can use the column names to view subsets of the data frame, selecting by column (called filtering by column):

df[['RegionName']]
df[['RegionName','IndicatorName'']]

... which can also be combined row selection, such as showing the first 5 rows of filtered columns:

df.head()[['RegionName','IndicatorName'']]

We can search for (filter) specific values. For example, we may want to look for a specific country within the 'Indicator' column, or one particular value within the 'Unit' column:

df[df.Indicator == 'Niger']
df[df.Unit == 'US$ million']

29.3. Cleaning the data

After exploring the data, we can start working on the data cleaning process. Data cleaning can be defined as the process of treating data in order to make it machine-readable and ready for analysis. (This also involves a bit more data exploration, but with a bit more critical eye on the details!)

29.3.1. Check+fix variable types

First, let us take a look at the columns (or variable types). A variable can be categorical, meaning it takes values over a finite set. For example, in the present data 'RegionName' is a categorical variable. A variable can also be continuous, meaning numerical and arithmetic operations can be performed---what variable(s) look continuous, as displayed during the data exploration?

To view the variable types, we can use the following method:

print(df.dtypes)

... where the categorical variables are listed as object and the continuous ones should be listed as some kind of numerical type (int, float, etc.). In the present data, the above command outputs:

Region            int64
RegionName       object
Admin0           object
Indicator         int64
IndicatorName    object
Unit             object
Date              int64
Value            object
dtype: object

If we look closely, we can see some issues with certain variables; let's look at a few observations with df.sample(3), just to remind ourselves of what the data looks like again:

    Region    RegionName        Admin0  Indicator      IndicatorName Unit  Date        Value
4       31          Mali          Mali   11760916  Population in ...  NaN  2014  386,594,945
28      37         Niger         Niger   11850816  Efficiency of ...  NaN  2014    2,485,543
42       5  Burkina Faso  Burkina Faso   11850716  Quality of tra...  NaN  2014    2,348,214

We can detect some issues with the types of certain variables:

  • Issue #1: The type of the variable 'Value' should be continuous, since it is a monetary value; this occurs because the commas are not recognized as being part of a number. It is a bit similar to trying to perform type conversion in the following way: float('2,360,207'), which produces an error.

  • Issue #2: The 'Date' datatype shouldn't be integer, though we can see why that happens: the single year values get interpreted as a continuous variable.

We can fix those issues by converting the data types of these variables as follows:

  • Fix #1: We can include another kwarg when loading the CSV, so that commas in numbers are recognized as being spacers of digits:

    df = pd.read_csv('dataframe.csv', delimiter = ',', thousands=',')
    

    Note: we have re-imported the data here, so any pre-existing changes to the dataframe would be lost.

  • Fix #2: We can explicitly convert the date to a specialized date/time format:

    df['Date'] = pd.to_datetime(df['Date'], format='%Y')
    

We can check our work by seeing which columns contain continuous and categorical variables in the current dataframe. We could use the above df.dtypes to see all columns once, or see each as a separate list, using the pandas method select_dtypes() in the following ways.

  • Select the continuous (i.e. numerical) variables:

    df_num   = df.select_dtypes(include=[np.number])
    num_cols = df_num.columns.values.tolist()
    print(num_cols)
    

    ... which produces:

    ['Region', 'Indicator', 'Value']
    
  • Select categorical (i.e., non-numerical) variables:

    df_cat   = df.select_dtypes(exclude=[np.number])
    cat_cols = df_cat.columns.values.tolist()
    print(cat_cols)
    

    ... which produces:

    ['RegionName', 'Admin0', 'IndicatorName', 'Unit', 'Date']
    

29.3.2. Deal with missing values

When some records are missing in a dataframe they are called missing values. These are typically coded as "NA" within a CSV, and displayed as NaN in the displayed dataframes. It is important to detect those missing values and to deal with them accordingly, for which there are a few options.

First, we must find the missing values. Here, we loop over each column, and use the isnull() method in Pandas to calculate the fraction of missing values:

for col in df.columns:
    pct_missing = np.mean(df[col].isnull())
    print('{:15s} - {:6.1%}'.format(col, pct_missing))

... which outputs the following table showing the percentage of missing values for each variable:

Region          -   0.0%
RegionName      -   0.0%
Admin0          -   0.0%
Indicator       -   0.0%
IndicatorName   -   0.0%
Unit            -  79.5%
Date            -   0.0%
Value           -   0.0%

We see that there is, indeed, one variable that has a lot (or here, any) missing values: 'Unit'.

Now, we have to decide what to do about the missing values. In this case, since the percentage of missing values is \geq 50\%, we may want to drop the variable:

df.drop('Unit', axis=1)

29.3.3. Detect outliers

Continuous variables

To get an overall sense of the values stored in a continuous variable column, we can take a look at its descriptive statistics. These summary statistics gives a quick and simple description of the data.

For example, from our list of num_cols, we can check out the descriptive statistics of the 'Indicator' column with:

df['Indicator'].describe()

... which outputs:

count    4.400000e+01
mean     1.182708e+07
std      3.961010e+04
min      1.176082e+07
25%      1.182092e+07
50%      1.185032e+07
75%      1.185072e+07
max      1.188152e+07
Name: Indicator, dtype: float64

The following table describes each of the returned fields, which basically describe the average values of the variable (mean and median), as well as their spread (std, min/max and interquartile ranges).

Table of outputs from a dataframe's describe() method for a continuous variable

Item

Information

count

number of non-missing values

mean

average of the data values

std

sample standard deviation

min

smallest data value

25%

quartile: 25% of data <= this value

50%

quartile: 50% of data <= this value (= median)

75%

quartile: 75% of data <= this value

max

largest value in data

Name

variable (column) name

dtype

datatype of column


Categorical variables

There is also summary information available for categorical variables. For example, from our list of cat_cols, we can check out the descriptive statistics of the 'RegionName' column with:

df['RegionName'].describe()

... which outputs:

count        44
unique        6
top       Niger
freq         10
Name: RegionName, dtype: object
Table of outputs from a dataframe's describe() method for a categorical variable

Item

Information

count

number of non-missing values

unique

number of separate categories

top

category which appears most

freq

the "top" category's number of appearances

Name

variable (column) name

dtype

datatype of column

Make a histogram for categorical variables:

# make a histogram
plt.figure('hist')
df['RegionName'].hist(bins=200)
plt.show()

# make a bar plot
plt.figure('bars')
df['RegionName'].value_counts().plot.bar()
plt.show()

29.4. Dataviz with Python

29.5. Practice