:tocdepth: 2 .. PT: a note on dependencies for env (if using conda, say): + To read XLSX files in with pd.read_excel(), there is an additionally dependency to include beyond just pandas: conda install -c anaconda xlrd .. _mod_data_sci: *********************************** Data science (with Pandas module) *********************************** .. contents:: :local: .. highlight:: python **In progress.** In this module we explore an important, practical application of programming: data science. Here, we start looking at ways of dealing with real-world data. This includes: * exploring and understanding the data, seeing what kinds of quantities are present (categorical and quantitative) * cleaning the data, because there can be missing values and outliers, each of which affects our ability to represent the actual data as meaningfully as possible * visualizing the data---as they say, a picture is worth a thousand words! \.\.\. and it also helps us interpret the data and really understand it more deeply. We will also make use of a new module, called Pandas, which stands for "Python for data analysis". This contains a powerful container for loading/reading in and organizing our data, called a **DataFrame**. While previously have have looked at storing data values in arrays and lists, those have some limitations for dealing with datasets collected in the real world. For example, we would likely have both quantitative information (e.g., to be stored in ints or floats), as well as descriptive/categorical information (to be stored as strings or lists, like country name, date, lists of crops, etc.), within the same dataset. That dtype-based diversity is impossible for the arrays. And we will also be adding, deleting and rearranging the data---so while column and row indices are useful, it will quickly become hard to keep track of what numbers represent what. Having column headers and functionality to select based on labels for each column will be veeery useful. For these reasons and more that we will see, the Pandas DataFrame will be a useful container for our data, helping to explore, clean and visualize it all, while also helping us keep our sanity. cleaning. It supports common data formats. (why using pandas instead of arrays -introduction ) To start, we import the Pandas module, as well as a couple old favorites, as:: import pandas as pd import numpy as np import matplotlib.pyplot as plt This section has been contributed by an AIMS-Senegal alum, Nafissatou Pouye, who works in the field---*jerejef!* Reading in data to a dataframe =========================================================================== In this section we will investigate the following dataset: :download:`dataframe.csv `. It is a comma-separated variable (CSV) file, which can be visualized using a text editor or a spreadsheet program (like Excel or LibreOffice Calc). This is a widely used data storage filetype, and the Pandas library integrates well with this and other common data formats. We can use the Pandas function ``read_csv()`` to read in the file, specifying the **delimiter** that separates the data entries. (As the name implies, the delimiter in a CSV file is *typically* a comma ``','``, but it can also be other characters like a semi-colon ``';'``.) The function returns a DataFrame object, which we assign:: df = pd.read_csv('dataframe.csv', delimiter = ',') We can display its contents:: df \.\.\. which we can see is a mix of numbers and text files (some abbreviated), organized as columns of data and rows of subjects or entries: .. code-block:: none :linenos: Region RegionName Admin0 Indicator IndicatorName Unit Date Value 0 5 Burkina Faso Burkina Faso 11820916 Principal repay... US$ 2014 81720000 1 32 Mauritania Mauritania 11820916 Principal repay... US$ 2014 205969000 2 5 Burkina Faso Burkina Faso 11881516 Proportion of s... NaN 2014 18,9 3 37 Niger Niger 11850216 Ability to trac... NaN 2014 2,360,207 4 31 Mali Mali 11760916 Population in t... NaN 2014 386,594,945 5 37 Niger Niger 11850516 Competence and ... NaN 2014 2,276,874 6 31 Mali Mali 11881516 Proportion of s... NaN 2014 9,5 7 37 Niger Niger 11850716 Quality of trad... NaN 2014 2,079,882 The leftmost column is not really a column of data in the file, but it just shows the row number of the data. Note that the enumeration starts after the very first row, which is a set of labels for each column. Unlike when using arrays, we will often *not* use row (or column) indices to pick out data of interest---instead we will use the more descriptive labels and/or data values. This is discussed further below. The output of ``type(df)`` is ``pandas.core.frame.DataFrame``. Just briefly, the top of the type/class description from ``help(pd.core.frame.DataFrame)`` (or, more simply after defining the particular DataFrame above, ``help(df)``) is: .. code-block:: none :linenos: class DataFrame(pandas.core.generic.NDFrame) | DataFrame(data=None, index: Union[Collection, NoneType] = None, columns: Union[Collection, NoneType] = None, dtype: Union[str, numpy.dtype, ForwardRef('ExtensionDtype'), NoneType] = None, copy: bool = False) | | Two-dimensional, size-mutable, potentially heterogeneous tabular data. | | Data structure also contains labeled axes (rows and columns). | Arithmetic operations align on both row and column labels. Can be | thought of as a dict-like container for Series objects. The primary | pandas data structure. | ... What is a pandas dataframe? A dataframe is a two-dimensional (tabular) data structure, with labeled axes. The principal components of a dataFrame are: * the data, called **values** or **records**, * the rows, called **observations**, * and the columns, also called **variables**. In general, a pandas dataframe will be created by loading the datasets from existing storage such as a CSV or excel file or a SQL database. The dataframe used in this section is an extract of the (publicly available) Socio-Economic Database compiled by the African Development Bank Statistics Department. It represents different socio-economic indicators for 5 African countries in 2014. Exploring the data =========================================================================== There are many methods associated with DataFrames, particularly for selecting the data in various ways. Check the dataframe shape:: print(df.shape) \.\.\. that is, the number of observations (rows) and variables (columns); in the present case, it is ``(44, 8)``, to start with. To have an idea of the data, we can look at some parts of the data instead of going through the entire data frame. For example: * We can look at the first few rows of the dataframe:: print(df.head()) The default number to display is 5 (check the help in ``df.head?``), or we could specify any particular number as an argument to the method, such as:: print(df.head(7)) * We can also take a look at the 5 last rows:: df.tail() * Or we can even look at 5 random rows:: df.sample(5) We can display the dataframes columns as a pandas index object (via ``df.columns``), or slightly more cleanly as a list:: df.columns.values.tolist() \.\.\. which outputs:: ['Region', 'RegionName', 'Admin0', 'Indicator', 'IndicatorName', 'Unit', 'Date', 'Value'] We can use the column names to view subsets of the data frame, selecting by column (called **filtering** by column):: df[['RegionName']] df[['RegionName','IndicatorName'']] \.\.\. which can also be combined row selection, such as showing the first 5 rows of filtered columns:: df.head()[['RegionName','IndicatorName'']] We can search for (filter) specific values. For example, we may want to look for a specific country within the ``'Indicator'`` column, or one particular value within the ``'Unit'`` column:: df[df.Indicator == 'Niger'] df[df.Unit == 'US$ million'] Cleaning the data =========================================================================== After exploring the data, we can start working on the **data cleaning** process. Data cleaning can be defined as the process of treating data in order to make it machine-readable and ready for analysis. (This also involves a bit more data exploration, but with a bit more critical eye on the details!) Check+fix variable types --------------------------------------------------------------------------- First, let us take a look at the columns (or variable types). A variable can be **categorical**, meaning it takes values over a finite set. For example, in the present data 'RegionName' is a categorical variable. A variable can also be **continuous**, meaning numerical and arithmetic operations can be performed---what variable(s) look continuous, as displayed during the data exploration? To view the variable types, we can use the following method:: print(df.dtypes) \.\.\. where the categorical variables are listed as ``object`` and the continuous ones should be listed as some kind of numerical type (int, float, etc.). In the present data, the above command outputs: .. code-block:: none Region int64 RegionName object Admin0 object Indicator int64 IndicatorName object Unit object Date int64 Value object dtype: object If we look closely, we can see some issues with certain variables; let's look at a few observations with ``df.sample(3)``, just to remind ourselves of what the data looks like again: .. code-block:: none Region RegionName Admin0 Indicator IndicatorName Unit Date Value 4 31 Mali Mali 11760916 Population in ... NaN 2014 386,594,945 28 37 Niger Niger 11850816 Efficiency of ... NaN 2014 2,485,543 42 5 Burkina Faso Burkina Faso 11850716 Quality of tra... NaN 2014 2,348,214 We can detect some issues with the types of certain variables: * *Issue #1:* The type of the variable ``'Value'`` should be continuous, since it is a monetary value; this occurs because the commas are not recognized as being part of a number. It is a bit similar to trying to perform type conversion in the following way: ``float('2,360,207')``, which produces an error. .. PT comm: in the fix to this, we ignore the commas (basically remove them and convert the values to be int) -> should we be concerned that '18,9' actually might represent '18,900', and hence convert to '18900', rather than to '189'? * *Issue #2:* The ``'Date'`` datatype shouldn't be integer, though we can see why that happens: the single year values get interpreted as a continuous variable. We can fix those issues by converting the data types of these variables as follows: * *Fix #1:* We can include another kwarg when loading the CSV, so that commas in numbers are recognized as being spacers of digits:: df = pd.read_csv('dataframe.csv', delimiter = ',', thousands=',') Note: we have re-imported the data here, so any pre-existing changes to the dataframe would be lost. .. PT: and still do this float conversion?: We can explicitly the type to be a float:: df['Value'].astype('float') * *Fix #2:* We can explicitly convert the date to a specialized date/time format:: df['Date'] = pd.to_datetime(df['Date'], format='%Y') We can check our work by seeing which columns contain continuous and categorical variables in the current dataframe. We could use the above ``df.dtypes`` to see all columns once, or see each as a separate list, using the pandas method ``select_dtypes()`` in the following ways. * Select the continuous (i.e. numerical) variables:: df_num = df.select_dtypes(include=[np.number]) num_cols = df_num.columns.values.tolist() print(num_cols) \.\.\. which produces:: ['Region', 'Indicator', 'Value'] * Select categorical (i.e., non-numerical) variables:: df_cat = df.select_dtypes(exclude=[np.number]) cat_cols = df_cat.columns.values.tolist() print(cat_cols) \.\.\. which produces:: ['RegionName', 'Admin0', 'IndicatorName', 'Unit', 'Date'] Deal with missing values --------------------------------------------------------------------------- When some records are missing in a dataframe they are called **missing values**. These are typically coded as "NA" within a CSV, and displayed as ``NaN`` in the displayed dataframes. It is important to detect those missing values and to deal with them accordingly, for which there are a few options. First, we must *find* the missing values. Here, we loop over each column, and use the ``isnull()`` method in Pandas to calculate the fraction of missing values:: for col in df.columns: pct_missing = np.mean(df[col].isnull()) print('{:15s} - {:6.1%}'.format(col, pct_missing)) .. pt: the round() function, originally used here, has weird behavior. From reading the help, with no args given, it SHOULD round to no decimal places; this appears to be the case if the dtype of the number is float, but NOT if it is np.float64. That annoys me. From googling a bit, it also has some other odd behavior. Let's just use string formatting, as now above. \.\.\. which outputs the following table showing the percentage of missing values for each variable: .. code-block:: none Region - 0.0% RegionName - 0.0% Admin0 - 0.0% Indicator - 0.0% IndicatorName - 0.0% Unit - 79.5% Date - 0.0% Value - 0.0% We see that there is, indeed, one variable that has a lot (or here, *any*) missing values: ``'Unit'``. Now, we have to decide what to do about the missing values. In this case, since the percentage of missing values is :math:`\geq 50\%`, we may want to drop the variable:: df.drop('Unit', axis=1) .. pt: discuss other methods/approaches, too? Detect outliers --------------------------------------------------------------------------- **Continuous variables** To get an overall sense of the values stored in a continuous variable column, we can take a look at its descriptive statistics. These summary statistics gives a quick and simple description of the data. For example, from our list of ``num_cols``, we can check out the descriptive statistics of the ``'Indicator'`` column with:: df['Indicator'].describe() \.\.\. which outputs: .. code-block:: none count 4.400000e+01 mean 1.182708e+07 std 3.961010e+04 min 1.176082e+07 25% 1.182092e+07 50% 1.185032e+07 75% 1.185072e+07 max 1.188152e+07 Name: Indicator, dtype: float64 The following table describes each of the returned fields, which basically describe the average values of the variable (mean and median), as well as their spread (std, min/max and interquartile ranges). .. list-table:: Table of outputs from a dataframe's ``describe()`` method for a continuous variable :header-rows: 1 :widths: 20 80 * - Item - Information * - **count** - number of non-missing values * - **mean** - average of the data values * - **std** - sample standard deviation * - **min** - smallest data value * - **25%** - quartile: 25% of data <= this value * - **50%** - quartile: 50% of data <= this value (= median) * - **75%** - quartile: 75% of data <= this value * - **max** - largest value in data * - **Name** - variable (column) name * - **dtype** - datatype of column | **Categorical variables** There is also summary information available for categorical variables. For example, from our list of ``cat_cols``, we can check out the descriptive statistics of the ``'RegionName'`` column with:: df['RegionName'].describe() \.\.\. which outputs: .. code-block:: none count 44 unique 6 top Niger freq 10 Name: RegionName, dtype: object .. list-table:: Table of outputs from a dataframe's ``describe()`` method for a categorical variable :header-rows: 1 :widths: 20 80 * - Item - Information * - **count** - number of non-missing values * - **unique** - number of separate categories * - **top** - category which appears most * - **freq** - the "top" category's number of appearances * - **Name** - variable (column) name * - **dtype** - datatype of column Make a histogram for categorical variables:: # make a histogram plt.figure('hist') df['RegionName'].hist(bins=200) plt.show() # make a bar plot plt.figure('bars') df['RegionName'].value_counts().plot.bar() plt.show() Dataviz with Python =========================================================================== Practice ===========================================================================