Chapter 5 Pandas

A powerful library for data manipulation and analysis, borrowing the idea of data frames from the R language. An essential import setup for Pandas is

import pandas as pd
from datetime import datetime, date

5.1 Series

Similar to one-dimensional NumPy array, with the addition of an index which can comprisearbitrary values

pd.Series([2,5,6,7])
#> 0    2
#> 1    5
#> 2    6
#> 3    7
#> dtype: int64

In the example above, the index has not been specified, and it defaults to the standard 0-based integer index.

s = pd.Series([2,5,6,7],index=("prime1","prime2","prime3","prime4"))
print(s)
#> prime1    2
#> prime2    5
#> prime3    6
#> prime4    7
#> dtype: int64

And then indexes can be used for slicing

s[["prime2","prime4"]]
#> prime2    5
#> prime4    7
#> dtype: int64

When instantiated from a dictionary input, the keys are used to create the index

s = pd.Series({'Friday':0.2,'Saturday':1.0,'Sunday':4.2,'Monday':0.0,'Tuesday':1.2})
print(s[1:3])
#> Saturday    1.0
#> Sunday      4.2
#> dtype: float64

The overall behaviour of Series is similar to that of ndarray, as far as slicing and indexing are concerned:

def fahrenheit(t):
    return 9*t/5+32
    
s[s<1.2],
#> (Friday      0.2
#> Saturday    1.0
#> Monday      0.0
#> dtype: float64,)
fahrenheit(s)
#> Friday      32.36
#> Saturday    33.80
#> Sunday      39.56
#> Monday      32.00
#> Tuesday     34.16
#> dtype: float64

Series has some features of dict, like assignment by index and testing of index value membership

s['Monday'] = np.nan
pd.isna(s['Monday'])
#> True
dt=pd.date_range('2019-10-29','2019-11-02')
print(dt)
#> DatetimeIndex(['2019-10-29', '2019-10-30', '2019-10-31', '2019-11-01',
#>                '2019-11-02'],
#>               dtype='datetime64[ns]', freq='D')
precip_sp = pd.Series([0.2,1.0,4.2,0.0,1.2],index=dt)
print(precip_sp)
#> 2019-10-29    0.2
#> 2019-10-30    1.0
#> 2019-10-31    4.2
#> 2019-11-01    0.0
#> 2019-11-02    1.2
#> Freq: D, dtype: float64
precip_bu = pd.Series([0.0,0.0,0.0,1.2,4.2,0.0], index=pd.date_range('2019-10-27','2019-11-01'))
#NaN as precip_bu and precip_sp have different indexes range
print(precip_sp-precip_bu)
#> 2019-10-27    NaN
#> 2019-10-28    NaN
#> 2019-10-29    0.2
#> 2019-10-30   -0.2
#> 2019-10-31    0.0
#> 2019-11-01    0.0
#> 2019-11-02    NaN
#> Freq: D, dtype: float64

5.2 DataFrame

A2-dimensional, table-like, data structure with columns of possibly different types and an index for its rows.

You can create a DataFrame by passing a dictionary of Series objects:

people_dict = {
    "weight": pd.Series([68, 83, 112], index=["alice", "martino", "franco"]),
    "birthyear": pd.Series([1984, 1985, 1992], index=["alice", "martino","franco"], name="year"),
"children": pd.Series([0, 3], index=["martino", "franco"]),
"hobby": pd.Series(["Biking", "Dancing"], index=["alice", "martino"]),
}

people = pd.DataFrame(people_dict)
people
#>          weight  birthyear  children    hobby
#> alice        68       1984       NaN   Biking
#> franco      112       1992       3.0      NaN
#> martino      83       1985       0.0  Dancing

Another convenient way to create a DataFrame is to pass all the values to the constructor as an ndarray, or a list of lists, and specify the column names and row index labels separately:

values = [
[1985, np.nan, "Biking",   68],
[1984, 3,      "Dancing",  83],
[1992, 0,      np.nan,    112]
         ]

d = pd.DataFrame(
        values,
        columns=["birthyear", "children", "hobby", "weight"],
        index=["alice", "martino", "franco"])
d
#>          birthyear  children    hobby  weight
#> alice         1985       NaN   Biking      68
#> martino       1984       3.0  Dancing      83
#> franco        1992       0.0      NaN     112
precips = pd.DataFrame({'San Pietro Capofiume': precip_sp,'Bologna Urbana': precip_bu})
print(precips)
#>             San Pietro Capofiume  Bologna Urbana
#> 2019-10-27                   NaN             0.0
#> 2019-10-28                   NaN             0.0
#> 2019-10-29                   0.2             0.0
#> 2019-10-30                   1.0             1.2
#> 2019-10-31                   4.2             4.2
#> 2019-11-01                   0.0             0.0
#> 2019-11-02                   1.2             NaN

Panda’s DataFrame() function is flexible as to the types of its inputs. As a list of dictionaries, each dictionary element of the input list is one row ofthe data frame

city_loc = pd.DataFrame([{'Lat':44.49381,'Long':11.33875},
{'Lat':41.89193,'Long':12.51133}],
index=['Bologna','Rome'])
print(city_loc)
#>               Lat      Long
#> Bologna  44.49381  11.33875
#> Rome     41.89193  12.51133

DataFrame behaves like a dictionary, in which keys are column names and the values are index-aligned Series

city_loc.columns
#> Index(['Lat', 'Long'], dtype='object')
city_loc['Lat']
#> Bologna    44.49381
#> Rome       41.89193
#> Name: Lat, dtype: float64
# add an elevation column for example as a list
city_loc['Elev'] = [54,21]
city_loc
#>               Lat      Long  Elev
#> Bologna  44.49381  11.33875    54
#> Rome     41.89193  12.51133    21

5.2.1 Indexing

5.2.1.1 .loc[] attribute

Label-based indexing is supported via the attribute .loc[], the result is a Series object.

precips.loc['2019-10-30'], 
#> (San Pietro Capofiume    1.0
#> Bologna Urbana          1.2
#> Name: 2019-10-30 00:00:00, dtype: float64,)
type(precips.loc['2019-10-30'])
#> <class 'pandas.core.series.Series'>

.loc[] supports slicing. However, unlike ndarray integer-based slicing,it includes the upper label:

precips.loc['2019-10-29':'2019-11-01']
#>             San Pietro Capofiume  Bologna Urbana
#> 2019-10-29                   0.2             0.0
#> 2019-10-30                   1.0             1.2
#> 2019-10-31                   4.2             4.2
#> 2019-11-01                   0.0             0.0

and it also allows for list of labels:

# Note: in this example the list has to be transformed into a DatetimeIndex because the Series objects which were the input of the data frame had been constructed with a DatetimeIndex
precips.loc[pd.DatetimeIndex(['2019-10-31','2019-11-01','2019-11-02'])]
#>             San Pietro Capofiume  Bologna Urbana
#> 2019-10-31                   4.2             4.2
#> 2019-11-01                   0.0             0.0
#> 2019-11-02                   1.2             NaN

Is it possible to index through booleans

#create index with missing values
na=(pd.isna(precips['Bologna Urbana'])|pd.isna(precips['San Pietro Capofiume']))
#index full dataset without(-) values which are na
precips.loc[~na]
#>             San Pietro Capofiume  Bologna Urbana
#> 2019-10-29                   0.2             0.0
#> 2019-10-30                   1.0             1.2
#> 2019-10-31                   4.2             4.2
#> 2019-11-01                   0.0             0.0

5.2.1.2 .iloc[] attribute

The.iloc attribute provides integer-based indexing, following the usual Pythonconventions. The allowed types of input are the same as.loc, but integers mustbe specified instead of labels. The following examples return the same resultsas the ones above:

precips.iloc[3]
#> San Pietro Capofiume    1.0
#> Bologna Urbana          1.2
#> Name: 2019-10-30 00:00:00, dtype: float64
precips.iloc[2:6]
#>             San Pietro Capofiume  Bologna Urbana
#> 2019-10-29                   0.2             0.0
#> 2019-10-30                   1.0             1.2
#> 2019-10-31                   4.2             4.2
#> 2019-11-01                   0.0             0.0
precips.iloc[[4,5,6]]
#>             San Pietro Capofiume  Bologna Urbana
#> 2019-10-31                   4.2             4.2
#> 2019-11-01                   0.0             0.0
#> 2019-11-02                   1.2             NaN

Is it possible to index with booleans, however first we need to convert it to a numpy array wiht .to_numpy()

precips.iloc[~na.to_numpy()]
#>             San Pietro Capofiume  Bologna Urbana
#> 2019-10-29                   0.2             0.0
#> 2019-10-30                   1.0             1.2
#> 2019-10-31                   4.2             4.2
#> 2019-11-01                   0.0             0.0

5.2.1.3 .query()

The .query() method lets you filter a DataFrame based on a query expression; the best choice if you have several conditions to be applied on the row selection.

df.query("(column1 == 'value') and (column2 > 1000)")

n = 200
df.query("(column1 == 'value') and (column2 > @n)")

df.query(f"(column1 == 'value') or (column2 > {n})")

5.2.2 Reading Data

5.2.2.1 read_csv

# iris1 = pd.read_csv("https://raw.githubusercontent.com/SakshamAyush/Iris_Flower_Classification/master/data/bezdekIris.csv")
  • The compression argument enables to specify the decompression algorithm if dataset is a compressed CSV file, setting the value 'infer' chooses it by file extension.

  • By default, time series are not recognized, the dtypeof the series is object, it should be datetime64[ns]. the parse_dates argument enables us to specify a list of columns that must be parse as dates (i.e. parse_dates = ['year'] or parse_dates = [0])

  • The index_col argument enables us to set a sequence of columns as row labels (ex. index_col='Time')

5.2.2.2 read_excel

The read_excel function enables the reading of .xls and .xlsx files. Installation of either xlrd or openpyxl (for Excel 2007,.xlsx) is a prerequisite.

5.2.3 Data Wrangling

Function pd.concat on a list of objects can concatenate Series or DataFrame objects

#Row labels are not modified
s1=pd.Series([1,2,3])
s2=pd.Series([4,5,6])
s=pd.concat([s1, s2])
s
#> 0    1
#> 1    2
#> 2    3
#> 0    4
#> 1    5
#> 2    6
#> dtype: int64

Be careful when concatenating DataFrame objects. Let c1 and c2 be the column labels of the 1st and 2nd frame

Columns𝐶2 and 𝐶1 are added to first data frame, columns𝐶1 and 𝐶2 to the second, all filled with NaN.

The rows of the second frame are appended to the first

df1=pd.DataFrame({'a': [1,2,3],'b': [4,5,6]})
df2=pd.DataFrame({'b': [7,8,9],'c': [10,11,12]})
df=pd.concat([df1, df2])

5.2.3.1 Multi-indexing

If all columns are tuples of the same size, then they are understood as a multi-index. The same goes for row index labels. For example:

d = pd.DataFrame(
  {
    ("public", "birthyear"):
        {("Paris","alice"):1985, ("Paris","martino"): 1984, ("London","franco"):1992},
("public", "hobby"):
        {("Paris","alice"):"Biking", ("Paris","martino"): "Dancing"},
    ("private", "weight"):
        {("Paris","alice"):68, ("Paris","martino"): 83, ("London","franco"): 112},
    ("private", "children"):
        {("Paris", "alice"):np.nan, ("Paris","martino"): 3, ("London","franco"): 0}
  }
) 
d
#>                   public          private         
#>                birthyear    hobby  weight children
#> Paris  alice        1985   Biking      68      NaN
#>        martino      1984  Dancing      83      3.0
#> London franco       1992      NaN     112      0.0

You can transpose columns and indices using the .T attribute:

dT = d.T

There are two levels of columns, and two levels of indices. We can drop a column level by calling droplevel() (the same goes for indices):

dT.columns = dT.columns.droplevel(level = 0)
dT
#>                     alice  martino franco
#> public  birthyear    1985     1984   1992
#>         hobby      Biking  Dancing    NaN
#> private weight         68       83    112
#>         children      NaN      3.0    0.0

5.2.3.2 Stacking and unstacking levels

Calling the stack() method will push the lowest column level after the lowest index:

d.stack()
#>                           private   public
#> Paris  alice   birthyear      NaN     1985
#>                hobby          NaN   Biking
#>                weight        68.0      NaN
#>        martino birthyear      NaN     1984
#>                children       3.0      NaN
#>                hobby          NaN  Dancing
#>                weight        83.0      NaN
#> London franco  birthyear      NaN     1992
#>                children       0.0      NaN
#>                weight       112.0      NaN
dT.stack()
#> public   birthyear  alice         1985
#>                     martino       1984
#>                     franco        1992
#>          hobby      alice       Biking
#>                     martino    Dancing
#> private  weight     alice           68
#>                     martino         83
#>                     franco         112
#>          children   martino        3.0
#>                     franco         0.0
#> dtype: object

Note that many NaN values appeared. This makes sense because many new combinations did not exist before (eg. there was no bob in London).

Calling unstack() will do the reverse, once again creating many NaN values.

dT.unstack()
#>             alice                          ...    franco                      
#>         birthyear children   hobby weight  ... birthyear children hobby weight
#> private       NaN      NaN     NaN     68  ...       NaN      0.0   NaN    112
#> public       1985      NaN  Biking    NaN  ...      1992      NaN   NaN    NaN
#> 
#> [2 rows x 12 columns]

The stack() and unstack() methods let you select the level to stack/unstack. You can even stack/unstack multiple levels at once, specifying the level parameter.

5.2.3.3 Adding or removing columns

people
#>          weight  birthyear  children    hobby
#> alice        68       1984       NaN   Biking
#> franco      112       1992       3.0      NaN
#> martino      83       1985       0.0  Dancing
people["age"] = 2018 - people["birthyear"] # adds a new column "age" 
people["over 30"] = people["age"] > 30 # adds another column "over 30" 
birthyears = people.pop("birthyear")
del people["children"]

people
#>          weight    hobby  age  over 30
#> alice        68   Biking   34     True
#> franco      112      NaN   26    False
#> martino      83  Dancing   33     True

When you add a new column, it must have the same number of rows. Missing rows are filled with NaN, and extra rows are ignored.

When adding a new column, it is added at the end (on the right) by default. You can also insert a column anywhere else using the .insert() method.

people.insert(1, "height", [172, 181, 185])

(people
.assign(body_mass_index = lambda df: df["weight"] / (df["height"] / 100)** 2)
.assign(overweight = lambda df: df["body_mass_index"] > 25)
)
#>          weight  height    hobby  age  over 30  body_mass_index  overweight
#> alice        68     172   Biking   34     True        22.985398       False
#> franco      112     181      NaN   26    False        34.186991        True
#> martino      83     185  Dancing   33     True        24.251278       False

5.2.3.4 Sorting a DataFrame

You can sort a DataFrame by calling its sort_index method. By default it sorts the rows by their index label, in ascending order, but let’s reverse the order:

people.sort_index(ascending=False)
#>          weight  height    hobby  age  over 30
#> martino      83     185  Dancing   33     True
#> franco      112     181      NaN   26    False
#> alice        68     172   Biking   34     True

Note that sort_index returned a sorted copy of the DataFrame. To modify people directly, we can set the inplace= argument to True. Also, we can sort the columns instead of the rows by setting axis=1 or by the values instead of the labels, we can use sort_values() and specify the column to sort by:

people.sort_values(by="age", inplace=True)
people
#>          weight  height    hobby  age  over 30
#> franco      112     181      NaN   26    False
#> martino      83     185  Dancing   33     True
#> alice        68     172   Biking   34     True

5.2.3.5 Aggregating

Similar to the SQL language, pandas allows grouping your data into groups to run calculations over each group. * .groupby(" ")

Pandas supports spreadsheet-like pivot tables that allow quick data summarization. To illustrate this, let’s create a simple DataFrame:

5.2.4 Evaluating an expression

A great feature supported by pandas is expression evaluation. This relies on the numexpr library which must be installed.

people.eval("weight / (height/100) ** 2 > 25")
#> franco      True
#> martino    False
#> alice      False
#> dtype: bool

Assignment expressions are also supported. Let’s set inplace=True to directly modify the DataFrame rather than getting a modified copy:

people.eval("body_mass_index = weight / (height/100) ** 2", inplace=True)
people
#>          weight  height    hobby  age  over 30  body_mass_index
#> franco      112     181      NaN   26    False        34.186991
#> martino      83     185  Dancing   33     True        24.251278
#> alice        68     172   Biking   34     True        22.985398

5.2.5 Handling Missing Data

Dealing with missing data is a frequent task when working with real life data. Pandas offers a few tools to handle missing data.

  • .isnull(), .isna(), indicate whether values are missing (True/False). Its sum (.isnull().sum(0)) reports and overview the number of NAs for each column.

  • df[df.isnull().any(axis=1)] keeps only the rows that contain at least one null

grades_array = np.array([[8,8,9],[10,9,9],[4, 8, 2], [9, 10, 10]])
grades = pd.DataFrame(grades_array, columns=["sep", "oct", "nov"], index=["alice","bob","charles","darwin"])
grades
#>          sep  oct  nov
#> alice      8    8    9
#> bob       10    9    9
#> charles    4    8    2
#> darwin     9   10   10
bonus_array = np.array([[0,np.nan,2],[np.nan,1,0],[0, 1, 0], [3, 3, 0]])
bonus_points = pd.DataFrame(bonus_array, columns=["oct", "nov", "dec"], index=["bob","colin", "darwin", "charles"]) 
bonus_points
#>          oct  nov  dec
#> bob      0.0  NaN  2.0
#> colin    NaN  1.0  0.0
#> darwin   0.0  1.0  0.0
#> charles  3.0  3.0  0.0
grades + bonus_points
#>          dec   nov   oct  sep
#> alice    NaN   NaN   NaN  NaN
#> bob      NaN   NaN   9.0  NaN
#> charles  NaN   5.0  11.0  NaN
#> colin    NaN   NaN   NaN  NaN
#> darwin   NaN  11.0  10.0  NaN

Looks like the addition worked in some cases but way too many elements are now empty. That’s because when aligning the DataFrames, some columns and rows were only present on one side, and thus they were considered missing on the other side (NaN). Then adding NaN to a number results in NaN, hence the result.

5.2.5.1 .fillna()

Let’s try to fix the problem above. For example, we can decide that missing data should result in a zero, instead of NaN. We can replace all NaN values by a any value using the fillna() method:

(grades + bonus_points).fillna(0) #fill with 0s
#>          dec   nov   oct  sep
#> alice    0.0   0.0   0.0  0.0
#> bob      0.0   0.0   9.0  0.0
#> charles  0.0   5.0  11.0  0.0
#> colin    0.0   0.0   0.0  0.0
#> darwin   0.0  11.0  10.0  0.0

It’s a bit unfair that we’re setting grades to zero in September, though. Perhaps we should decide that missing grades are missing grades, but missing bonus points should be replaced by zeros:

better_bonus_points = bonus_points.copy()
better_bonus_points.insert(0, "sep", 0)
better_bonus_points.loc["alice"] = 0
better_bonus_points = better_bonus_points.interpolate(axis=1)

final_grades = grades + better_bonus_points
final_grades
#>          dec   nov   oct   sep
#> alice    NaN   9.0   8.0   8.0
#> bob      NaN  10.0   9.0  10.0
#> charles  NaN   5.0  11.0   4.0
#> colin    NaN   NaN   NaN   NaN
#> darwin   NaN  11.0  10.0   9.0

5.2.5.2 .dropna()

So let’s call the .dropna() method to get rid of rows that are full of NaNs:

final_grades_clean = final_grades.dropna(how="all")
final_grades_clean
#>          dec   nov   oct   sep
#> alice    NaN   9.0   8.0   8.0
#> bob      NaN  10.0   9.0  10.0
#> charles  NaN   5.0  11.0   4.0
#> darwin   NaN  11.0  10.0   9.0

To remove columns that are full of NaNs, we set the axis argument to 1:

final_grades_clean = final_grades_clean.dropna(axis=1, how="all")
final_grades_clean
#>           nov   oct   sep
#> alice     9.0   8.0   8.0
#> bob      10.0   9.0  10.0
#> charles   5.0  11.0   4.0
#> darwin   11.0  10.0   9.0

5.2.6 Data Manipulation

  • .tonumeric() convert each column value to numeric

  • .astype(str) convert, for example, to str the values

  • .groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=NoDefault.no_default, observed=False, dropna=True) a groupby operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups.

df.groupby('column1').sum()
  • Use .cut() function to create a vector with categories labeled according to some values. For example, suppose we have a variable ‘median_income’ and we want to split it into 5 categories, in a variable ‘income_cat’ :
df["income_cat"] = pd.cut(housing["median_income"], bins=[0., 1.5, 3.0, 4.5, 6., np.inf], labels=['low income', 'low-mid income', 'mid income', 'mid-high income', 'high income'])]

5.2.7 Overview Functions

  • The .info() method prints out a summary of each columns contents:

  • the .describe() method gives a nice overview of the main aggregated values over each column: * count: number of non-null (not NaN) values * mean: mean of non-null values * std: standard deviation of non-null values * min: minimum of non-null values * 25%, 50%, 75%: 25th, 50th and 75th percentile of non-null values * max: maximum of non-null values

  • In order to know the characteristics of the elements, the .unique() function reports the single elements. Instead, .nunique() it counts the unique values.

  • For object type of data, there may exits multiple categories (factors). To explore them use .value_counts()

  • .shape() it reports the shape of the table (rows x columns)

  • .corr(), compute pairwise correlation of columns, excluding NA/null values